* var_vals are global
* working with global ish
* better
* fix export model
* fix tests
* better kv cache
* does it run?
* use where for kvmask
* fix excessive var_vals
* fix import
* how does multigpu use this?
* llama kinda work
* faster and simpler
* cleanup
* fix conversation mode
* test cleanups
* fix one more test
* test cleanup
---------
Co-authored-by: George Hotz <geohot@gmail.com>
* merge kernel and optimizer
* linearize is reentrant
* move global/local size
* clean up linearizer copy
* remove unneeded lin copies
* stop linearizing twice
* oops, that should be None
* optimizer: simplify GROUP and LOCAL to have one of each
Now that tensor cores only use LASTLOCAL, we can simplify to use
only that op everywhere.
The only use of GROUP is in matvec hand-coded opts and it doesn't
make a performance difference so switching to use only the top
behavior.
Also adds additional asserts to prevent tensor core dims from
being altered which causes bad kernels to be generated.
* search: remove duplicated actions
* wmma: refactor tensor cores using existing local dims
* optimizer: fix bad rebase and break after one late local
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
* optimizer: add test for correctness of opts
Also added OptOps.UPCASTMID to constrain valid axes for opts with
group_for_reduce.
* llvm: fix LinearizerOptions to correctly not has_shared
* optimizer: remove premature test scaffold for TC opts
* search: fix the action space
* small changes
* expand in terms of substitute, directly expand g_idxs g_valid
* delete expand_ops
* don't compare using hash
* any instead of in
thanks gijskoning
Co-authored-by: Gijs Koning <gijs-koning@live.nl>
* support tc
* testing code
* no more create_rednode
* maxsize none in view/node
* oops
* undo
* typing
* oops
* oops
* lmao
* lmao
* add expand multi test
* Node.iter_idxs
* type
* type
* delete checks!
* clean up a little?
* expand_idx in symbolic
* un-golf
* play around with types >.>
* test_substitute and also remove an incorrect test?
* get rid of range
* Update symbolic.py
* split out view cache change
* split out flat components change
* reduce diff
* reduce diff
* add some float4 tests
* fix
---------
Co-authored-by: Gijs Koning <gijs-koning@live.nl>
* loadop buffer on cpu
* works for GPU
* sort of working
* has bugs
* gpu tests pass
* fix some tests
* fix tensor cores
* fix test linearizer
* fix symbolic
* fix has_variable_shape
* non symbolic size
* disable weird test
* simple cache fix
* fix custom function
* fix kopt
* cleanups
* a bit broken on the assign
* contig check
* only buffer
* need that order
* idx
* dedup buffers
* hmm, bugfix
* fix tensor cores
* opts device
* loadop buffer on cpu
* works for GPU
* sort of working
* has bugs
* gpu tests pass
* fix some tests
* fix tensor cores
* fix test linearizer
* fix symbolic
* fix has_variable_shape
* non symbolic size
* disable weird test
* simple cache fix
* fix custom function
* fix kopt
* cleanups
* a bit broken on the assign
* contig check
* only buffer
* need that order
* idx
* add constant fold
* err, it's just zero folding
* self store fold + caching
* prints and more folds
* simpler winograd kernels
* remove childless uops
* cache loads across buffers (since they may share rawbufs)
* typing
* add test
* fix test
* small changes to test
* fix test
* one big cache
* whitespace
* golf a line?
* invalid is RawBuffer(0)[0], valid 1.
* new version
* fix abstractions
* try remove test
* Revert "try remove test"
This reverts commit 2fc18a9f8e.
* assert_allclose
* minimize the test
* minimize the test
* minimize the test
* minimize the test
* Revert "minimize the test"
This reverts commit e0c0929596.
* Revert "minimize the test"
This reverts commit 88240551b1.
* Revert "minimize the test"
This reverts commit 78328a7ce2.
* Revert "minimize the test"
This reverts commit 989523fded.
* skip test inside body
* oops
* oops