* cpu tests pass
* torch works
* works
* metal works
* fix ops_disk
* metal jit works
* fix openpilot
* llvm and clang work
* fix webgpu
* docs are rly broken
* LRU works on metal
* delete comment
* revert name to ._buf. LRU only on Compiled
* changes
* allocator
* allocator, getting closer
* lru alloc
* LRUAllocator
* all pass
* metal
* cuda
* test examples
* linearizer
* test fixes
* fix custom + clean realize
* fix hip
* skip tests
* fix tests
* fix size=0
* fix MOCKHIP
* fix thneed
* copy better
* simple
* old style metal copy
* fix thneed
* np reshape
* give cuda a device
* var_vals are global
* working with global ish
* better
* fix export model
* fix tests
* better kv cache
* does it run?
* use where for kvmask
* fix excessive var_vals
* fix import
* how does multigpu use this?
* llama kinda work
* faster and simpler
* cleanup
* fix conversation mode
* test cleanups
* fix one more test
* test cleanup
---------
Co-authored-by: George Hotz <geohot@gmail.com>
* move metal+clang to compile api
* all to the new style
* remove binary arg
* fix triton
* fixup tests
* fix clang
* diskcache is generic
* __wrapped__
* compile_gpu
* fix thneed
* keep the src in the ASTRunner
* lib
* move compile_gpu
* compile_gpu in device
* put compiler in astrunner
* test reverts
* triton compiler
* ugh, that too
* create cache for q learning
* make linter happy
* global beam
* where it belongs
* bugfix
* ditch the kopt, use the beam
* faster lin and DEBUG=2 okay
* remove kopt, move search to features
* lazy cleanups
* ast functions take in LazyOps
* op instead of self.op
* _base for mops
* fix contiguous
* start schedule
* test_schedule
* fix openpilot
* more tests
* bugfix and test skip
* work
* make sure things get freed
* fix zerosized tensors
* fix failing test
* fix ceil and friends
* fix openpilot
* disable training
* disable test collectives
* loadop buffer on cpu
* works for GPU
* sort of working
* has bugs
* gpu tests pass
* fix some tests
* fix tensor cores
* fix test linearizer
* fix symbolic
* fix has_variable_shape
* non symbolic size
* disable weird test
* simple cache fix
* fix custom function
* fix kopt
* cleanups
* a bit broken on the assign
* contig check
* only buffer
* need that order
* idx
* dedup buffers
* hmm, bugfix
* fix tensor cores
* opts device
* loadop buffer on cpu
* works for GPU
* sort of working
* has bugs
* gpu tests pass
* fix some tests
* fix tensor cores
* fix test linearizer
* fix symbolic
* fix has_variable_shape
* non symbolic size
* disable weird test
* simple cache fix
* fix custom function
* fix kopt
* cleanups
* a bit broken on the assign
* contig check
* only buffer
* need that order
* idx
* Symbolic Shape JIT
update tests
2 variables symbolic ops, adding more tests
test passing
cleanup
* more test cases
* single flag
* review update
* jit attention one piece
* realize
* symbolic_jit test for cuda
* old artifact
* works with cuda gpu but failed ci
* CUDACPU
* Context and Timing can now be used as decorators
* Using Timing decorator in quickstart.md
The time formating is better and is a useful tool to learn.
Old: Time: 3.5260659999912605
New: Time: 3526.14 ms
* Updated env_vars documentation for Context
* Added test for Context decorator
* Put new import on same line as others
* new version
* fix abstractions
* try remove test
* Revert "try remove test"
This reverts commit 2fc18a9f8e.
* assert_allclose
* minimize the test
* minimize the test
* minimize the test
* minimize the test
* Revert "minimize the test"
This reverts commit e0c0929596.
* Revert "minimize the test"
This reverts commit 88240551b1.
* Revert "minimize the test"
This reverts commit 78328a7ce2.
* Revert "minimize the test"
This reverts commit 989523fded.
* skip test inside body
* oops
* oops