* fixes big KOPT, breaks opencl
* fix optimizer
* KernelCache
* oops, broke batchnorm
* hack to fix it
* fix llvm, less hacky gpu
* disable the cache
* cache just breaks things
* triton can add
* print stuff from triton
* write out file
* ops triton working
* reduce ops
* sort of works
* Triton bugfixes & implementation of remaining ops (#490)
* padding
* support pow, max, relu, gt0
* allocate return buffer
* Fix reduce
* Add tests for power op
* Fix triton illegal memory accesses and memory leak (#512)
* Fix mypy issue
* Add triton to setup.py
* Replace torch with pycuda
* Use one cuda stream for data transfer and kernels
* Remove triton submodule
* Fix memory leak by using weakrefs for caching
* Fix memory access by adding valid as mask for load
* Fix invalid kernel launches by flattening the grid (#515)
---------
Co-authored-by: Martin Loretz <20306567+martinloretzzz@users.noreply.github.com>
* Refactor getenv into helpers
* Remove unused os
* Fix default value
* Fix more defaults for CI
* Fix bracket
* Revert changes to openpilot/compile.py
* Use getenv from helpers when possible
* bringing back reshape and permute
* done with E701
* 4x4 works in generic way
* max and sum not vectorizing...
* special case single float
* support comparing to MPS
* improve matmul speed, consider generic principles
* GlobalCounter
* fix op tracking
* faster
* comment that out for now
* err, it needs that
* fix minor issues
* fix global_mem
* chonker will make llvm fast
* work
* better speed tests, we will make them fast
* with the cache add is the same speed
* relu and neg are fast
* fix sum speed
* maximum maxnum?
* hack for gemm opt
* gemm very slow
* zeros like
* test_permute
* shapetracker returns self
* fix shapetracker factorization
* err, int strides
* permutes are faster now in tinygrad than pytorch
* support -1 in expand
* gemm unrolled
* improve final test case
* WIP GEMM
* why isn't GEMM fast?
* revert cache dim
* ffp contract works on clang, not llvm?
* ignore llvm ir
* this makes fma work at least, but no faster
* USE_4x4
* 63 GFLOPS
* 87 GFLOPS
* that wasn't matmul, 44 GFLOPS now
* 82 GFLOPS permuted
* this permute too
* a little speed for the convs
* 45 GFLOPS
* speed tests pass again
* clean up prints
* fix FMA WHAT A WASTE OF TIME
* colors
* moar fair
* GPU
* useless on chonker
* cleanups
* improve factorized shapetracker
* better threshold
* label conv
* work
* ops test pass again
* hot load the index
* run the last view, no need to create
* ZeroView needs a repr for the key to work
* fix segfault on out of bounds
* one more test
* start amx, and llvm.initialize_native_asmparser
* amx works
* nice AMX class
* nicer AMX class
* refactor get_idxs
* amx working
* is slower...
* useless flip
* cache
* SZ_X
* AMX_SZ_X/Y work alone
* Contiguous mlop
* test gemm packed
* PREPARE in packed
* use_amx factor
* prefetch isn't faster
* loop
* same 3ms
* 2.24 ms
* allow double on store in TG
* amx reduce is the same speed as non amx reduce
* include memory bandwidth
* clean up shapetracker
* flip returns stride
* prepare for upstream
* Update ops_llvm.py (#426)
* permutes are yellow and green now
* faster conv
* llvm cleanups
* Show optimised IR under debug 4 (#428)
* ASTKernel class
* Make tinygrad work with older python version (#427)
* Make tinygrad work with older python version
* Use partialmethod instead of partial
* smiple chonker is chonking
* remove junk from test speed vs torch
* fix linker and types
* AMX is only here now
* add LLVM tests, it's a valid backend now
* oops, run llvm test
* contiguous_op
* fix loadops compare
* dedup reduceops
Co-authored-by: calledit <1573053+calledit@users.noreply.github.com>
* working exec ast
* exec_ast is staticmethod
* GenericExecAST
* fold that sometimes
* ExplicitExecAST
* exec_ast for GPU
* gpu working
* get_lazyop_shape
* now gpubuffer is ExplicitExecAST
* dedup
* add a type
* RESHAPE in opencl code
* fix linter
* that too for linter
* cleanups
* remove dead code
* GenericShape is less lines
* add ALLOWED_KERNEL_COUNT to tests
* fix mypy
* that's gotta be recursive
* fix opencl shape processing
* remove unneeded lambda
* in progress
* big conv test works
* that's unneeded
* fix opencl with reduce
* rewrite contiguous_view_constant_fold
* clean up mids in loop code
* subidx
* print cl kernel before run
* no reduce, no loop
* Revert "no reduce, no loop"
This reverts commit 92777e40e9.
* Fix OpenCL Metal texture issues
Tile CL images when needed, to fit into the 16384 max Metal image size;
gets me to ~4.8s/iteration for SD on M1 Pro with OPENCL=1 FLOAT16=1.
* Minor cleanup
* Fix mish in CI, or no-op?
* Is mish being framed?
* It would help if any of this reproduced locally
* ???
* OPT is reverted; use original mish
* Cleanup post-review
* Fix some shape usage
* Tiler tests, shouldn't oom or overflow either
* Can't CL if there's no CL?
* Run tiler tests even if GPU=1
* relu6 segfault binary chop; revert test
* relu6 segfault binary chop; revert accel
* relu6 segfault binary chop; revert . (???)
* end relu6 segfault binary chop; repo's haunted