* bottom up linearizer
* late stores
* more complete
* remove broken heuristic
* upcast size
* opt
* more conservative
* it needs that
* disable opencl half on QCOM
* fix
* make that a real test
* cpu test okay
* ptx skip
* end is after the range
* add quantize fp8 in llama3
* don't truncate fp8 alu result
* cast to float32 before matmul
* --model weights/LLaMA-3/8B-SF-DPO/
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
* add 0.10.0 to comma benchmark
disabled the 0.10.1 ones which are pinned to master. it does not work because benchmark uses the cached old version
* that's pinned
* late ifs try 2
* fix image
* fix that test
* panic
* ptx fixups
* preserve toposort
* those pass locally
* Revert "those pass locally"
This reverts commit 063409f828.
* no ls
* make that explicit
* Simpler compile3
* tests
* remove default args
* onnx file is still fp16
* self-test FP16 too
* allow test disable
* absurd tolerance
* Just do latest
* Try simplest
* use later models
* kernel count not relevant if speed is good
* dead improts
* Revert "dead improts"
This reverts commit f68c2cd15d.
* Revert "kernel count not relevant if speed is good"
This reverts commit 0955ca4ee0.
* add back kernal count check on latest model
* rtoposort is fast, can replace rangeify with this
* fast rangeify
* work
* fast rangeify works for mnist
* should work
* progress
* pad fix
* FAST
* tests passing
* don't delete those shape ops
* put in rangeify map
* ending ranges fix
* tests
* mstack/mselect no hacks
* move to indexing.py
* touch up tests + add comments
* disable failing test
* actually make the file readable
* failing
* error
* add ordering
* fix some tests
* fix more tests
* shorten comment
* update test
* add rule and test
* add rule and test
* remove check
* use fold_divmod_congruence instead of simplify
* adjust tests
* shorten line
* new algo
* add test
* add function to un-nest the div
* add UOp.factor
* test UOp.factor
* uop_given_valid tries to factor simplex expression
* shorten line
* symbolic_flat is back
* change that back
* fix those new tests
* new rule for ordering
* factor multiple factors
* no symbolic_flat
* symbolic_flat to there
* move that back
* fix imports
* merge correctly
* linter happy
* add rule
* add a test
* cleanup
* revert that for now
* UOp.factor returns self instead of None
* try all_candidates
* remove or_else
* post index symbolic
* add test
* maket this closer to the original
* increase mac hlb_cifar min step time
* add some ordering tests
* cleanup
* increase pytest timeout time
* check dtype