* init
* removed mulacc
* is uoptimize the problem?
* lol hax make work temporarily fix l8er
* revert extra/ changes
* clean up
* flaky metal tests?
* add back mulacc for metal
* revert last commit
* try skipping linearizer_failure tests
* skip flammit tests... cuz tests all work locally
* try narrow down exact linearizer failure test
* try 2
* try 4
* generated code is the exact same wtf why CI fails
* code for 15 and 17 are exact same with or without mulacc, this should pass
* try only 1 failure
* try garbage collecting lol...
* try del variables lol
* try gcing after del lol...
* is diskcache the problem???
* try disabling opts cache idk
* try remove hack
* try disable github metal cache...
* try CACHELEVEL=0 :D idk anymore
* try increase newCommandQueueWithMaxCommandBufferCount_, im almost out of ideas...
* revert
* actually not a HACK
* oops
* lazy rewrite, try 2
* min fix tests
* pass contig test
* put broken pads back
* move that to realize
* no contig child fixes array packing
* so wrong
* now that's correct
* base children
* fix bind issues
* disable to_image_idx
* fix tests
* that failure shouldn't break other tests
* more fixes
* fix torch
* skip failing tests in CI
* 1e-7
* half is broken
* 1e-6 margin of error
* cpu tests pass
* torch works
* works
* metal works
* fix ops_disk
* metal jit works
* fix openpilot
* llvm and clang work
* fix webgpu
* docs are rly broken
* LRU works on metal
* delete comment
* revert name to ._buf. LRU only on Compiled
* changes
* allocator
* allocator, getting closer
* lru alloc
* LRUAllocator
* all pass
* metal
* cuda
* test examples
* linearizer
* test fixes
* fix custom + clean realize
* fix hip
* skip tests
* fix tests
* fix size=0
* fix MOCKHIP
* fix thneed
* copy better
* simple
* old style metal copy
* fix thneed
* np reshape
* give cuda a device
* var_vals are global
* working with global ish
* better
* fix export model
* fix tests
* better kv cache
* does it run?
* use where for kvmask
* fix excessive var_vals
* fix import
* how does multigpu use this?
* llama kinda work
* faster and simpler
* cleanup
* fix conversation mode
* test cleanups
* fix one more test
* test cleanup
---------
Co-authored-by: George Hotz <geohot@gmail.com>
* move metal+clang to compile api
* all to the new style
* remove binary arg
* fix triton
* fixup tests
* fix clang
* diskcache is generic
* __wrapped__
* compile_gpu
* fix thneed
* keep the src in the ASTRunner
* lib
* move compile_gpu
* compile_gpu in device
* put compiler in astrunner
* test reverts
* triton compiler
* ugh, that too
* lazy cleanups
* ast functions take in LazyOps
* op instead of self.op
* _base for mops
* fix contiguous
* start schedule
* test_schedule
* fix openpilot
* more tests
* bugfix and test skip
* work
* make sure things get freed
* fix zerosized tensors
* fix failing test
* fix ceil and friends
* fix openpilot
* disable training
* disable test collectives
* loadop buffer on cpu
* works for GPU
* sort of working
* has bugs
* gpu tests pass
* fix some tests
* fix tensor cores
* fix test linearizer
* fix symbolic
* fix has_variable_shape
* non symbolic size
* disable weird test
* simple cache fix
* fix custom function
* fix kopt
* cleanups
* a bit broken on the assign
* contig check
* only buffer
* need that order
* idx
* dedup buffers
* hmm, bugfix
* fix tensor cores
* opts device
* loadop buffer on cpu
* works for GPU
* sort of working
* has bugs
* gpu tests pass
* fix some tests
* fix tensor cores
* fix test linearizer
* fix symbolic
* fix has_variable_shape
* non symbolic size
* disable weird test
* simple cache fix
* fix custom function
* fix kopt
* cleanups
* a bit broken on the assign
* contig check
* only buffer
* need that order
* idx
* Symbolic Shape JIT
update tests
2 variables symbolic ops, adding more tests
test passing
cleanup
* more test cases
* single flag
* review update
* jit attention one piece
* realize
* symbolic_jit test for cuda
* old artifact
* works with cuda gpu but failed ci
* CUDACPU
* new version
* fix abstractions
* try remove test
* Revert "try remove test"
This reverts commit 2fc18a9f8e.
* assert_allclose
* minimize the test
* minimize the test
* minimize the test
* minimize the test
* Revert "minimize the test"
This reverts commit e0c0929596.
* Revert "minimize the test"
This reverts commit 88240551b1.
* Revert "minimize the test"
This reverts commit 78328a7ce2.
* Revert "minimize the test"
This reverts commit 989523fded.
* skip test inside body
* oops
* oops
* Rename FusedOps to TernaryOps
* Support ternary broadcast
* Add where llop and mlop
* Make where op work in cstyle codegen
* Don't skip test_inf_where
* Add backward path to where op
* Use bool in cstyle codegen
* Add LLVM where op
* Add numpy where op
* Add torch where op
* Simplify where mlop
* Update documentation
* Forgot a rename
* Merged relevant changes from PR #1195 onto PR #1196
* Add test to cover changes to linearizer.ast_parse for WHERE op
Without this METAL will try to use ternary op on float4 and fail
* Make where op work in wgsl backend
* Allow ternary ops to be merged
* Make mypy happy
---------
Co-authored-by: Francis Lam <flam@alum.mit.edu>