* fixes big KOPT, breaks opencl
* fix optimizer
* KernelCache
* oops, broke batchnorm
* hack to fix it
* fix llvm, less hacky gpu
* disable the cache
* cache just breaks things
* triton can add
* print stuff from triton
* write out file
* ops triton working
* reduce ops
* sort of works
* Triton bugfixes & implementation of remaining ops (#490)
* padding
* support pow, max, relu, gt0
* allocate return buffer
* Fix reduce
* Add tests for power op
* Fix triton illegal memory accesses and memory leak (#512)
* Fix mypy issue
* Add triton to setup.py
* Replace torch with pycuda
* Use one cuda stream for data transfer and kernels
* Remove triton submodule
* Fix memory leak by using weakrefs for caching
* Fix memory access by adding valid as mask for load
* Fix invalid kernel launches by flattening the grid (#515)
---------
Co-authored-by: Martin Loretz <20306567+martinloretzzz@users.noreply.github.com>
* Refactor getenv into helpers
* Remove unused os
* Fix default value
* Fix more defaults for CI
* Fix bracket
* Revert changes to openpilot/compile.py
* Use getenv from helpers when possible
* we typing
* types look good in theory
* most tests pass
* gpu tests pass
* TEST_AST
* delete comments
* i must have written that bug so many times
* bugfix
* don't merge the small ones
* add f to constants
* commits from reduce
* don't GCD the mod nodes
* broken and a hack IMAGE=3
* group for reduce
* fix linter + mypy
* move out test ast
* insource TENSOR_TYPE_TO_NP_TYPE
* does this fix it?
* move imports out
* add image
* load + store + boring stuff:
* image tests pass
* thneed print GFLOPS
* op conv test
* more debugging
* hack for multiview image
* shapetracker creates less views
* disable image tests
* working better
* ugh, lkey not key
* print in DEBUG, and allow views
* works
* simple padding conv2d
* use index for image
* that was bad code
* debug print
* fix types
* less lines
* save lines
* chonker will make llvm fast
* work
* better speed tests, we will make them fast
* with the cache add is the same speed
* relu and neg are fast
* fix sum speed
* maximum maxnum?
* hack for gemm opt
* gemm very slow
* zeros like
* test_permute
* shapetracker returns self
* fix shapetracker factorization
* err, int strides
* permutes are faster now in tinygrad than pytorch
* support -1 in expand
* gemm unrolled
* improve final test case
* WIP GEMM
* why isn't GEMM fast?
* revert cache dim
* ffp contract works on clang, not llvm?
* ignore llvm ir
* this makes fma work at least, but no faster
* USE_4x4
* 63 GFLOPS
* 87 GFLOPS
* that wasn't matmul, 44 GFLOPS now
* 82 GFLOPS permuted
* this permute too
* a little speed for the convs
* 45 GFLOPS
* speed tests pass again
* clean up prints
* fix FMA WHAT A WASTE OF TIME
* colors
* moar fair
* GPU
* useless on chonker
* cleanups
* improve factorized shapetracker
* better threshold
* label conv
* work
* ops test pass again
* hot load the index
* run the last view, no need to create
* ZeroView needs a repr for the key to work
* fix segfault on out of bounds
* one more test
* start amx, and llvm.initialize_native_asmparser
* amx works
* nice AMX class
* nicer AMX class
* refactor get_idxs
* amx working
* is slower...
* useless flip
* cache
* SZ_X
* AMX_SZ_X/Y work alone
* Contiguous mlop
* test gemm packed
* PREPARE in packed
* use_amx factor
* prefetch isn't faster
* loop
* same 3ms
* 2.24 ms
* allow double on store in TG
* amx reduce is the same speed as non amx reduce
* include memory bandwidth
* clean up shapetracker
* flip returns stride
* prepare for upstream
* Update ops_llvm.py (#426)
* permutes are yellow and green now
* faster conv
* llvm cleanups
* Show optimised IR under debug 4 (#428)
* ASTKernel class
* Make tinygrad work with older python version (#427)
* Make tinygrad work with older python version
* Use partialmethod instead of partial
* smiple chonker is chonking
* remove junk from test speed vs torch
* fix linker and types
* AMX is only here now
* add LLVM tests, it's a valid backend now
* oops, run llvm test
* contiguous_op
* fix loadops compare
* dedup reduceops
Co-authored-by: calledit <1573053+calledit@users.noreply.github.com>
* gemm
* off by factor of 5
* 50 GFLOPS
* works
* 91 gflops
* working at 50G
* works
* iy
* 150 GFLOPS
* 150 GFLOPS
* N=2048 is still fast
* threading soon
* multithread
* pinning
* throttling is sad
* Align matrices to cacheline width (#361)
Co-authored-by: cloud <Cloud11665@gmail.com>