* assert jitted times in openpilot
* better error
* better error
* add ASSERT_MIN_STEP_TIME to more models
* t is step_times
* update benchmark times
* update times
* make POSTOPT=2 the default
* more matching tc
* fix winograd
* fix that test
* add matvec to Scheduler
* flip tc sort order
* similar speed
* fix beam on image
* disable slow tests
* slow
* move device tests to test/device
* test speedups
* test device
* linalg to unit
* upd
* so pytest just works
* more divide and skip
* speed
* test devectorize
* add pillow
* BOOM
* cache extra/huggingface/models/
* why max buffer size is not 0
* override MAX_BUFFER_SIZE
* less models
* remove more models and change cache dir to already cached dir
* only metal
* less is more?
* remove check ops
* why is this not setting the ENVVAR
* ughhhhh just test in models
* only cpu and gpu
* only cpu actually
* just override it idk
* final
* move extra dependencies up top
* simplification
* fix print
* make README better
* revert ops_disk fix for now
* clean up test_onnx
* remove testing fashion clip model cuz sloooowwwwww
* actually let METAL run this
* fix comment mistake
* fix download path in run_models
* does this work?
* cleanup setup and teardown
* contextvar like this?
* prove model is cached
* do I need to increment DOWNLOAD_CACHE_VERSION?
* see if cached with incremented DOWNLOAD_CACHE_VERSION
* use warnings to see if the model exists
* revert DOWNLOAD_CACHE_VERSION stuff and clean up
* add retry to download
* nit
* Add mmapeak implementation for 7900 XTX
* Change identation
* Use a template instead of multiple assebly files
* Fix output formatting
* Reduce register file bank conflicts
* More accurate measurement for quick instructions
* Add support for gfx1201
* RDNA4 wmma requires less VGRPs
* RDNA4 does not have s_cmpk instructions
* Add v_wmma_i32_16x16x32_iu4 for gfx1201
* Add sparse wmma instructions
* split to tinybox red MLPerf Benchmark
---------
Co-authored-by: Panagiotis Kourouklidis <panagiotis.kourouklidis@gmail.com>
hlb cifar is fast so added it, can add bert too if you think it's ok
6 real gpus to test multigraph and transfers + accuracy validation
should probably be added to tinystats too, i don't know how though
Co-authored-by: chenyu <chenyu@fastmail.com>
* add MobileNetV2 to comma CI
* symlink imagenet
* also the signature
* comment that out
* need imagenetmock
* same train and test set
* quantize on CPU=1
* verbose
* need __hexagon_divsf3
* 0x858d6c15
* quant cpu + CC=clang-19
* propagate use_tensor_cores
* add use_tensor_core to arg in test and search
* bugfix
* get TC val from ContextVar in search
* revert minor space change
* add tc emulation test to ci and benchmark
* revert
* revert whitespace change
* remove test for ptx
* add comment and remove llvm test run
* set pad t 3 for amd padded tc test
* change pad for amd regardless CI
* test tc padded uops and correctness separately
* add test_tensor_cores_padded_uops test to ci
* remove redundant chack for amd device
* cleanup