* init
* removed mulacc
* is uoptimize the problem?
* lol hax make work temporarily fix l8er
* revert extra/ changes
* clean up
* flaky metal tests?
* add back mulacc for metal
* revert last commit
* try skipping linearizer_failure tests
* skip flammit tests... cuz tests all work locally
* try narrow down exact linearizer failure test
* try 2
* try 4
* generated code is the exact same wtf why CI fails
* code for 15 and 17 are exact same with or without mulacc, this should pass
* try only 1 failure
* try garbage collecting lol...
* try del variables lol
* try gcing after del lol...
* is diskcache the problem???
* try disabling opts cache idk
* try remove hack
* try disable github metal cache...
* try CACHELEVEL=0 :D idk anymore
* try increase newCommandQueueWithMaxCommandBufferCount_, im almost out of ideas...
* revert
* actually not a HACK
* oops
* remove cpu and torch backends
* don't copy to cpu
* use clang instead of cpu
* multitensor gathers on the first device
* clang is cpu + use default
* fixup
* bugfix
* lazy rewrite, try 2
* min fix tests
* pass contig test
* put broken pads back
* move that to realize
* no contig child fixes array packing
* so wrong
* now that's correct
* base children
* fix bind issues
* disable to_image_idx
* fix tests
* that failure shouldn't break other tests
* more fixes
* fix torch
* skip failing tests in CI
* 1e-7
* half is broken
* 1e-6 margin of error
* handle reshape of contiguous subparts with explicit mask
* remove the add/remove ones logic in reshape
* accomodate ones in accumulate logic
* make multiply commutative
* fix linting
* make mypy happy
* add test for commutative mul
* merge dimensions in shape_strides for 1 range masks
* add offsets for merging
* fix linting
* add back explicit 1 reshapes
* fix mypy errors
* fix accumulate by includng state
* include non-zero stride dimension in acc
* small cleanup
* more compact to_shape_strides
* more logical cleanup
* compress more
* compress reshape mask
* adding some comments
* small bug fix
* improve test coverage
* remove explicit add remove ones
* small bug in test
* enable test_reshape_splitting_combining
* small fix
* 10 lines less to_shape_strides
* shorten reshape mask
* some more cleanup
* more cleanup
* introduce some symbols for compactness
* more symbols
* more cleaner
* lessen symbols, it became less readable
* remove merge_views from view.reshape
* change to_shape_strides to _merge_dims
* improve readability
* fix corner case
* cleanup
* better handling of 1 <= Variable('i',1,10) & new_dim = Variable('i',1,10)
* rewrite _reshape_mask for readability
* fix white space
* add comment
* nice shorthands for readability
* add proof in docs
* small nit
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
* cpu tests pass
* torch works
* works
* metal works
* fix ops_disk
* metal jit works
* fix openpilot
* llvm and clang work
* fix webgpu
* docs are rly broken
* LRU works on metal
* delete comment
* revert name to ._buf. LRU only on Compiled
* changes
* allocator
* allocator, getting closer
* lru alloc
* LRUAllocator
* all pass
* metal
* cuda
* test examples
* linearizer
* test fixes
* fix custom + clean realize
* fix hip
* skip tests
* fix tests
* fix size=0
* fix MOCKHIP
* fix thneed
* copy better
* simple
* old style metal copy
* fix thneed
* np reshape
* give cuda a device
* var_vals are global
* working with global ish
* better
* fix export model
* fix tests
* better kv cache
* does it run?
* use where for kvmask
* fix excessive var_vals
* fix import
* how does multigpu use this?
* llama kinda work
* faster and simpler
* cleanup
* fix conversation mode
* test cleanups
* fix one more test
* test cleanup
---------
Co-authored-by: George Hotz <geohot@gmail.com>
* move metal+clang to compile api
* all to the new style
* remove binary arg
* fix triton
* fixup tests
* fix clang
* diskcache is generic
* __wrapped__
* compile_gpu
* fix thneed
* keep the src in the ASTRunner
* lib
* move compile_gpu
* compile_gpu in device
* put compiler in astrunner
* test reverts
* triton compiler
* ugh, that too
* create cache for q learning
* make linter happy
* global beam
* where it belongs
* bugfix
* ditch the kopt, use the beam
* faster lin and DEBUG=2 okay
* remove kopt, move search to features