the correct condition is that PADTO cannot be applied to reduce axis, not Reduce.MAX in ops.
even for Reduce.SUM it's possible that the reduce axis had a div before, and the padded 0 became inf then sum over it is incorrect.
* these asserts should pass
* fix that assert
* ALU dtypes
* acc dtype for group_for_reduce
* cast image ALUs to the base dtype
* remove all casts from linearizer
* fix argmax
* fix multinomial
* fix __getitem__
* Revert "fix __getitem__"
This reverts commit 62ad719bfa.
* fix MemBuffer outputs being wrong when there is an arange + ALU with a different dtype
eg. fancy slicing (int, float), bert embeddings (int, long)
this should be fixed in lazy instead of having to break the kernel
* cleanup argmax fix
* fix matmul in ints
cast in the end
* fix llama
* skip wrong hardcoded asts in the worlds dataset
* fix llama p2
* cleanup missing parts of the diff
---------
Co-authored-by: George Hotz <geohot@gmail.com>
* autopad shapetracker for BEAM
* OptOps.PADTO
* skip that test for now
* correct padding reduce axis
* just 32
* avoid more than double the FLOPs
* cleanups
* test case
* no support for triton and llvm yet
* typos
* symbolic shape would not work
* cannot PADTO with MAX kernel
* advance db version
* no breaking change - don't advance db version
* is triton just python?
* Revert "is triton just python?"
This reverts commit 17e776c25587615e33a3634c2fb0bb8591ce65d4.
* Revert "Revert "is triton just python?""
This reverts commit 6c434c01e1c4b0ea0431ec18632cd859fb3cf260.
* support llvm
* is it really passing in CI only?
* update tests
* oh triton test passed
* simpler
* revert that, with a test
* check if st are the same
* Revert "check if st are the same"
This reverts commit d2a5eac110a5da1af82a2728c883779ef69c3cad.
* update the db version
* rebase artifact
* merge kernel and optimizer
* linearize is reentrant
* move global/local size
* clean up linearizer copy
* remove unneeded lin copies
* stop linearizing twice
* oops, that should be None
* stable diffusion < 324ms
* revert swap action
* fix tests due to more sum splitting
* REDUCEOP_SPLIT_THRESHOLD env var
* added from unaligned np test (#2134)
* align cpu buffer before copy into cl buffer (#2135)
* remove shelve from handcode_resnet50_opt.py (#2139)
* Add dictionary keys to reduce db size (#2131)
* work
* ignore beam cache
* dictionary keys are generic
* minor db cleanups
* fix baseline and extract dataset
* fix training
* log likelihood
* more lin to feats
* sts
* training policynet
* net sort of works
* dedup
* refactor, stupid new actions
* fix uops deduping
* BEAM_ESTIMATE
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
Co-authored-by: imaolo <56898718+imaolo@users.noreply.github.com>
* create cache for q learning
* make linter happy
* global beam
* where it belongs
* bugfix
* ditch the kopt, use the beam
* faster lin and DEBUG=2 okay
* remove kopt, move search to features
* start work on auto opt
* lin failure
* not beating hcopt
* greedy
* timing is fast
* codegen.search
* greedy search in handcode_opt
* track running gflops
* clean up those files
* no failure