* cpu tests pass
* torch works
* works
* metal works
* fix ops_disk
* metal jit works
* fix openpilot
* llvm and clang work
* fix webgpu
* docs are rly broken
* LRU works on metal
* delete comment
* revert name to ._buf. LRU only on Compiled
* changes
* allocator
* allocator, getting closer
* lru alloc
* LRUAllocator
* all pass
* metal
* cuda
* test examples
* linearizer
* test fixes
* fix custom + clean realize
* fix hip
* skip tests
* fix tests
* fix size=0
* fix MOCKHIP
* fix thneed
* copy better
* simple
* old style metal copy
* fix thneed
* np reshape
* give cuda a device
* metal indirect command buffers
* sub 1ms gpt
* metal batch exec is good
* remove whitespace
* input_replace
* fix ci
* useResources
* very simple cacheallocator
* update_stats
* fix CI
* minor
* remove that from jit
* var_vals are global
* working with global ish
* better
* fix export model
* fix tests
* better kv cache
* does it run?
* use where for kvmask
* fix excessive var_vals
* fix import
* how does multigpu use this?
* llama kinda work
* faster and simpler
* cleanup
* fix conversation mode
* test cleanups
* fix one more test
* test cleanup
---------
Co-authored-by: George Hotz <geohot@gmail.com>
* Enable Multi-Output Export
* Add test
* Update examples and lint
* fix padding
* test ops
* dummy commit to rerun test
* revert cuda lint
* Enforce tuple/list of tensors
* subscripted generics
* put back webgpu test
* Re-enable WebGPU Efficientnet test
* Allow multi-input model export
* Add model export unit test
* Fix efficientnet compilation
* Only run model export test on JIT supported devices
* Skip export model test if not EXPORT_SUPPORTED_DEVICE
* Symbolic Shape JIT
update tests
2 variables symbolic ops, adding more tests
test passing
cleanup
* more test cases
* single flag
* review update
* jit attention one piece
* realize
* symbolic_jit test for cuda
* old artifact
* works with cuda gpu but failed ci
* CUDACPU