Commit Graph

280 Commits

Author SHA1 Message Date
nimlgen
4b4ba5454c ci: move driver start higher (#11431) 2025-07-30 10:48:38 +03:00
chenyu
204da24cfc increase driverbenchmark timeout-minutes to 15 (#11428) 2025-07-29 19:45:05 -04:00
nimlgen
c88e401d0e ci: fix typos in h machine benchmarks (#11423) 2025-07-29 22:11:47 +03:00
George Hotz
1f1f99c287 hotfix: add DEBUG=3 to driver CI 2025-07-29 11:03:47 -07:00
nimlgen
d38d285489 ci: add h machines (#11416)
* ci: add h machines

* more

* fix names

* names not collide

* 20

* 10
2025-07-29 19:21:51 +03:00
chenyu
2b48b961be fix a few broken AMX tests (#11204) 2025-07-12 21:42:38 -04:00
George Hotz
0597735f28 remove TC=3 not porting this (#11045) 2025-06-30 15:12:49 -07:00
chenyu
126fcf4129 clean up AMD_LLVM in tests (#11021) 2025-06-28 22:45:47 -04:00
chenyu
d71bb6a7b2 remove comma 0.9.4 from benchmark (#10867) 2025-06-18 12:43:59 -04:00
chenyu
4f535641f7 add one huggingface_onnx test to mac benchmark ci (#10700)
this crashed for me on onnx parser pr but seems fine for the author. see if ci mac is fine
2025-06-08 12:26:12 -04:00
wozeparrot
37e1ef1be3 feat: cleanup old AM processes (#10653) 2025-06-05 15:41:00 -07:00
wozeparrot
5e3c4a8431 fix: comma testsig (#10568) 2025-05-29 19:00:07 -07:00
George Hotz
6b8eb5fec2 split mlperf to its own red benchmark run (#10492)
* Add mmapeak implementation for 7900 XTX

* Change identation

* Use a template instead of multiple assebly files

* Fix output formatting

* Reduce register file bank conflicts

* More accurate measurement for quick instructions

* Add support for gfx1201

* RDNA4 wmma requires less VGRPs

* RDNA4 does not have s_cmpk instructions

* Add v_wmma_i32_16x16x32_iu4 for gfx1201

* Add sparse wmma instructions

* split to tinybox red MLPerf Benchmark

---------

Co-authored-by: Panagiotis Kourouklidis <panagiotis.kourouklidis@gmail.com>
2025-05-23 17:12:41 -07:00
uuuvn
3ca5680920 Test remote in benchmark (#10304)
hlb cifar is fast so added it, can add bert too if you think it's ok

6 real gpus to test multigraph and transfers + accuracy validation

should probably be added to tinystats too, i don't know how though

Co-authored-by: chenyu <chenyu@fastmail.com>
2025-05-23 12:12:57 -04:00
qazal
90eb3c0e5d add MobileNetV2 benchmark to comma CI (#10250)
* add MobileNetV2 to comma CI

* symlink imagenet

* also the signature

* comment that out

* need imagenetmock

* same train and test set

* quantize on CPU=1

* verbose

* need __hexagon_divsf3

* 0x858d6c15

* quant cpu + CC=clang-19
2025-05-19 18:22:50 +03:00
George Hotz
b06291077c no amdgpu kernel driver (#10408)
* no amdgpu kernel driver

* don't test hip

* lower req
2025-05-18 20:52:39 -07:00
wozeparrot
1ed04f993b move benchmark stat tracking to influxdb (#10185) 2025-05-15 16:14:56 -07:00
Ignacio Sica
47b3055fe2 set fail-fast behavior (#10336) 2025-05-15 11:24:45 -07:00
George Hotz
7a3d4de59a hotfix: add GRAPH_ONE_KERNEL=1 to UsbGPU openpilot test 2025-05-14 14:50:37 -07:00
George Hotz
f1130ab3d3 openpilot benchmark test (#10290)
* openpilot benchmark test

* that
2025-05-13 22:49:28 -07:00
chenyu
ad5cb2717d FUSE_ARANGE=1 in bert bench (#10263)
still fails, something multi related maybe

Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>
2025-05-13 09:12:19 -04:00
chenyu
0015b3921f sleep more in CI Remove amdgpu (#10261)
see if this is less flaky
2025-05-12 08:13:44 -04:00
nimlgen
7d6ed1b1e9 hotfix: mac ci (#10210)
* fixed?

* cmnt
2025-05-08 14:13:23 +03:00
nimlgen
ba52fce4b2 usbgpu: benchmark in ci (#10208)
* usbgpu: benchmark

* usbgpu: benchmark
2025-05-08 12:02:04 +03:00
Ignacio Sica
bf5fb97498 fix AMD_LLVM bf16 tc for gfx1100 (#10102)
* fix amd_llvm bf16 tc

* cleanup pattern
2025-04-30 20:06:38 -03:00
chenyu
4a04098389 fix llama3 with nf4 quantize (#10107)
also int8 outputs is wrong
2025-04-29 15:14:36 -04:00
Ignacio Sica
9d5677c12c fix ptx linearizer bug 2 [pr] (#9967)
* check for local buffer

* hotfix

* add test_tensor_cores_emulation run for ptx
2025-04-29 14:30:07 -03:00
Ignacio Sica
58cf8cd493 add support for "shared_mem" for LLVM (#10093)
* init llvm shared

* add test_tensor_cores_emulation run for llvm
2025-04-29 08:56:36 -04:00
Ignacio Sica
bda116d773 fix use_tensor_cores propagation (#10048)
* propagate use_tensor_cores

* add use_tensor_core to arg in test and search

* bugfix

* get TC val from ContextVar in search

* revert minor space change

* add tc emulation test to ci and benchmark

* revert

* revert whitespace change

* remove test for ptx

* add comment and remove llvm test run
2025-04-28 19:30:50 -03:00
chenyu
e996584685 olmoe in mac benchmark (#10077) 2025-04-27 21:07:02 -04:00
George Hotz
b6d2effaf5 assign is contiguous (#10066)
* assign is contiguous

* disable process replay for SDXL
2025-04-27 08:40:33 -04:00
Ignacio Sica
023b1c28a2 test_tensor_cores_padded refactor (#9724)
* set pad t 3 for amd padded tc test

* change pad for amd regardless CI

* test tc padded uops and correctness separately

* add test_tensor_cores_padded_uops test to ci

* remove redundant chack for amd device

* cleanup
2025-04-18 17:05:54 -03:00
chenyu
c5db5b83b9 add SHOULD_USE_TC=1 check to simple_matmul (#9802)
* add SHOULD_USE_TC=1 check to simple_matmul

also zero centered the random input and update atol for tf32

* ATOL=2e-2 for HALF
2025-04-09 02:24:42 -04:00
George Hotz
14928fecff Revert "fix TF32 tensor core dropped in tc_sm89 (#9798)"
This reverts commit 7c9a96824f.
2025-04-09 12:27:39 +08:00
chenyu
7c9a96824f fix TF32 tensor core dropped in tc_sm89 (#9798)
also add `SHOULD_USE_TC=1` to verify TC is applied in simple_matmul
2025-04-08 23:20:50 -04:00
Ignacio Sica
58785181a8 AMD bf16xf32 TC (#9717)
* dont test bf16 for emulated amd tc

* skip bf16 tc test in ci

* skip bf16 for AMD in test_tensor_cores_codegen

* add simple bf16 gemm test to benchmark
2025-04-07 11:41:04 +08:00
chenyu
1d25844d44 Revert "disable CI red llama 3 4 gpu beam (#9690)" (#9709)
This reverts commit 6a5eacba8b.
2025-04-03 02:34:39 -04:00
chenyu
6a5eacba8b disable CI red llama 3 4 gpu beam (#9690)
device hangs and ci would fail
2025-04-02 03:19:09 -04:00
qazal
4df2b6347d hotfix: bump tinybox red training CI timeout to 30 minutes (#9426) 2025-03-13 09:31:44 +01:00
nimlgen
cd9d74f7ea use am in training benchmarks (#9357)
* am in training benchmarks

* fix

* not needed anymore
2025-03-05 19:13:47 +03:00
chenyu
2e7c2780a9 CLANG -> CPU (#9189) 2025-02-20 18:03:09 -05:00
Ignacio Sica
aaed315fee add AMX support to LLVM (#8957)
* init amx support for llvm

* revert elf changes

* fix attributes for AMX asm calls

* add comments

* add llvm amx job to benchmarks

* cleanup

* cleanup

* hotfix: improve comments

* comment for aux buffers

* hotfix:

* move amx_tc to ClangRenderer

* merge master

* refactor

* add docs

* add corsix docs reference

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2025-02-12 16:01:18 +08:00
nimlgen
52a69dd5e9 Revert "use am in training benchmarks (#8965)" (#8981)
This reverts commit 107e616857.
2025-02-09 15:43:45 +03:00
nimlgen
107e616857 use am in training benchmarks (#8965)
* am in training benchmarks

* fix

* not needed anymore
2025-02-08 20:20:47 +03:00
George Hotz
0cbb7d7f1e hotfix: metal has known sync issue 2025-02-06 14:29:41 +08:00
chenyu
836cf42c2e fix rand_like for multi (#8880) 2025-02-03 19:00:14 -05:00
chenyu
0c759e1ff6 add bert to bechmark ci (#8741)
with `DISABLE_DROPOUT=1 BERT_LAYERS=2` for now
2025-01-24 14:45:11 -05:00
ignaciosica
d2234e308a tf32 tc for nv and ptx (#8635)
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2025-01-17 17:43:57 -08:00
nimlgen
f671da6755 ci: add AM start time to benchmark (#8637)
* ci: add AM start time to benchmark

* am: unlock it

* add AMD

* revert this
2025-01-16 14:47:36 +03:00
chenyu
4ee3243c93 JITBEAM=2 for LLaMA-3 8B on 4 GPUs [pr] (#8623)
is it fast?
2025-01-14 19:52:38 -05:00