* rtoposort is fast, can replace rangeify with this
* fast rangeify
* work
* fast rangeify works for mnist
* should work
* progress
* pad fix
* FAST
* tests passing
* don't delete those shape ops
* put in rangeify map
* ending ranges fix
* tests
* mstack/mselect no hacks
* move to indexing.py
* touch up tests + add comments
* disable failing test
* actually make the file readable
* failing
* error
* add ordering
* fix some tests
* fix more tests
* shorten comment
* update test
* add rule and test
* add rule and test
* remove check
* use fold_divmod_congruence instead of simplify
* adjust tests
* shorten line
* new algo
* add test
* add function to un-nest the div
* add UOp.factor
* test UOp.factor
* uop_given_valid tries to factor simplex expression
* shorten line
* symbolic_flat is back
* change that back
* fix those new tests
* new rule for ordering
* factor multiple factors
* no symbolic_flat
* symbolic_flat to there
* move that back
* fix imports
* merge correctly
* linter happy
* add rule
* add a test
* cleanup
* revert that for now
* UOp.factor returns self instead of None
* try all_candidates
* remove or_else
* post index symbolic
* add test
* maket this closer to the original
* increase mac hlb_cifar min step time
* add some ordering tests
* cleanup
* increase pytest timeout time
* check dtype
* assert jitted times in openpilot
* better error
* better error
* add ASSERT_MIN_STEP_TIME to more models
* t is step_times
* update benchmark times
* update times
* make POSTOPT=2 the default
* more matching tc
* fix winograd
* fix that test
* add matvec to Scheduler
* flip tc sort order
* similar speed
* fix beam on image
* disable slow tests
* slow
* move device tests to test/device
* test speedups
* test device
* linalg to unit
* upd
* so pytest just works
* more divide and skip
* speed
* test devectorize
* add pillow
* BOOM
* cache extra/huggingface/models/
* why max buffer size is not 0
* override MAX_BUFFER_SIZE
* less models
* remove more models and change cache dir to already cached dir
* only metal
* less is more?
* remove check ops
* why is this not setting the ENVVAR
* ughhhhh just test in models
* only cpu and gpu
* only cpu actually
* just override it idk
* final
* move extra dependencies up top
* simplification
* fix print
* make README better
* revert ops_disk fix for now
* clean up test_onnx
* remove testing fashion clip model cuz sloooowwwwww
* actually let METAL run this
* fix comment mistake
* fix download path in run_models
* does this work?
* cleanup setup and teardown
* contextvar like this?
* prove model is cached
* do I need to increment DOWNLOAD_CACHE_VERSION?
* see if cached with incremented DOWNLOAD_CACHE_VERSION
* use warnings to see if the model exists
* revert DOWNLOAD_CACHE_VERSION stuff and clean up
* add retry to download
* nit
* Add mmapeak implementation for 7900 XTX
* Change identation
* Use a template instead of multiple assebly files
* Fix output formatting
* Reduce register file bank conflicts
* More accurate measurement for quick instructions
* Add support for gfx1201
* RDNA4 wmma requires less VGRPs
* RDNA4 does not have s_cmpk instructions
* Add v_wmma_i32_16x16x32_iu4 for gfx1201
* Add sparse wmma instructions
* split to tinybox red MLPerf Benchmark
---------
Co-authored-by: Panagiotis Kourouklidis <panagiotis.kourouklidis@gmail.com>
hlb cifar is fast so added it, can add bert too if you think it's ok
6 real gpus to test multigraph and transfers + accuracy validation
should probably be added to tinystats too, i don't know how though
Co-authored-by: chenyu <chenyu@fastmail.com>
* add MobileNetV2 to comma CI
* symlink imagenet
* also the signature
* comment that out
* need imagenetmock
* same train and test set
* quantize on CPU=1
* verbose
* need __hexagon_divsf3
* 0x858d6c15
* quant cpu + CC=clang-19