* test speed llama
* oops, put it back
* uses the real device codegen
* just do it on the mac
* pp
* is faster?
* Revert "is faster?"
This reverts commit 42db542010.
* disable docker again for less load on CI
* matrix strategy
* push env to GITHUB_ENV
* use printf instead of echo
* use temp helper function for cross os paths
* use path join
* switched to using temp helper function
* skip test on windows due to memory limit
* small fix
* removed semi
* touchups
* clean up
* seperate tests
* test changes to test_utils on windows
* small refactor
* more cleanups
* undo helpers change
* only skip if in CI and WINDOWS
* safetensors test
* safe_save
* load back with real safetensors
* bugfix in device name. add simple torch_load
* it works for llama, but it's slower...
* mmap
* no intermediate
* load mmaped
* readinto speed
* not ready yet
* revert that
* optimizations in symbolic.py
* fix infinite recursion when expanding sums
* add test case to make sure NumNodes are hoisted up in cases where MulNodes cancel eachother out
* no zeroview start
* closer
* stride mask
* st tests pass, delete ZeroView
* byebye zv
* close to working
* not contiguous with mask
* subtract, don't add
* mask on view
* ugh, that shouldn't have been in there
* shape merge
* bugfixes
* fuzzer + 4 fuzzer failures
* fuzzer for symbolic
* more fuzzing and nothing
* that fuzzer doesn't hit either
* fixes padding...ugh
* no more offsets
* working
* rewrite load and store
* all checks
* fix idxs
* progress
* bugfix
* float4_axis
* works
* cleanups
* complex valids_okay
* fix binop, other tests failure
* that was a bad idea
* better layernorm
* inference kernel count tests
* new style reshape pushing
* fixup replacement
* 199 kernels is okay. fix flops
* push reshape through unaryops only
* GRAPH=2 draws the phantom ops
* found resnet issue
* non working test
* mul is cheaper than div
* OPT inflation
* SHUFFLE_PAD_OPS in OPT=2
* linearizer outputs something
* working ish
* cstyle codegen
* clang mostly works
* fix load valid
* fix numberless loop
* fancy gen
* working
* fix enet compiler
* cleanups
* float4 upcasting
* less lines
* supports_float4
* constant folding
* mulacc
* internet tests flaky in CI
* 90% image support
* fix image generic
* bugs exposed with shapetracker and single view
* new llvm
* use vload, remove OLD
* that's really poorly done
* ending up being more lines
* runs one metal kernel
* conv2d works
* ops tests are passing
* const folding
* all ops work
* pre commit always passes
* torch works
* working still
* fix graph test
* tests passing
* image almost works
* image conv works
* most images
* fix custom
* fix assignment
* fix compile enet
* clean up comments
* fix realize return value
* include shapetracker in LB repr
* copy should make a copy
* reenable method cache
* fix lna
* dtypes in graph
* forward only for IMAGE=2
* simple realize
* getting close
* fixup new api, it's good except the kernel count
* back to 197 kernels
* tests should pass
* go to a real float
* no type_on_cpu
* fix the docs
* put shapetracker back in it's proper place
* add dtype class
* dtypes
* buffers are lazy
* dtype is tracked by lazybuffer and GenericShape
* fix types in llvm
* llvm store
* dtype tests
* fix tests maybe
* fix flop counter
* fix CI
* CI fix and check format
* fix dtype and dtype check
* fix custom test
* fix test graph
* Refactor contraction and add unit tests
* Fix typo; Fix TestConv.test_elu failure due to some ones in old_shape
* Add push permute test cases
* Fix mypy type annotation check error
* Add contraction unit test; Reshape to higher dimension is not contraction
* behavior is correct without VALIDHACKS
* simple div and mod
* fix tests
* no negative variables
* alt form is correct
* still correct
* bug in mulnode
* at least validhacks works now
* cleanups
* test validhacks, and to_image_idx
* cache compare key
* tests and __neg__