nimlgen
4b4ba5454c
ci: move driver start higher ( #11431 )
2025-07-30 10:48:38 +03:00
chenyu
204da24cfc
increase driverbenchmark timeout-minutes to 15 ( #11428 )
2025-07-29 19:45:05 -04:00
nimlgen
c88e401d0e
ci: fix typos in h machine benchmarks ( #11423 )
2025-07-29 22:11:47 +03:00
George Hotz
1f1f99c287
hotfix: add DEBUG=3 to driver CI
2025-07-29 11:03:47 -07:00
nimlgen
d38d285489
ci: add h machines ( #11416 )
...
* ci: add h machines
* more
* fix names
* names not collide
* 20
* 10
2025-07-29 19:21:51 +03:00
chenyu
2b48b961be
fix a few broken AMX tests ( #11204 )
2025-07-12 21:42:38 -04:00
George Hotz
0597735f28
remove TC=3 not porting this ( #11045 )
2025-06-30 15:12:49 -07:00
chenyu
126fcf4129
clean up AMD_LLVM in tests ( #11021 )
2025-06-28 22:45:47 -04:00
chenyu
d71bb6a7b2
remove comma 0.9.4 from benchmark ( #10867 )
2025-06-18 12:43:59 -04:00
chenyu
4f535641f7
add one huggingface_onnx test to mac benchmark ci ( #10700 )
...
this crashed for me on onnx parser pr but seems fine for the author. see if ci mac is fine
2025-06-08 12:26:12 -04:00
wozeparrot
37e1ef1be3
feat: cleanup old AM processes ( #10653 )
2025-06-05 15:41:00 -07:00
wozeparrot
5e3c4a8431
fix: comma testsig ( #10568 )
2025-05-29 19:00:07 -07:00
George Hotz
6b8eb5fec2
split mlperf to its own red benchmark run ( #10492 )
...
* Add mmapeak implementation for 7900 XTX
* Change identation
* Use a template instead of multiple assebly files
* Fix output formatting
* Reduce register file bank conflicts
* More accurate measurement for quick instructions
* Add support for gfx1201
* RDNA4 wmma requires less VGRPs
* RDNA4 does not have s_cmpk instructions
* Add v_wmma_i32_16x16x32_iu4 for gfx1201
* Add sparse wmma instructions
* split to tinybox red MLPerf Benchmark
---------
Co-authored-by: Panagiotis Kourouklidis <panagiotis.kourouklidis@gmail.com >
2025-05-23 17:12:41 -07:00
uuuvn
3ca5680920
Test remote in benchmark ( #10304 )
...
hlb cifar is fast so added it, can add bert too if you think it's ok
6 real gpus to test multigraph and transfers + accuracy validation
should probably be added to tinystats too, i don't know how though
Co-authored-by: chenyu <chenyu@fastmail.com >
2025-05-23 12:12:57 -04:00
qazal
90eb3c0e5d
add MobileNetV2 benchmark to comma CI ( #10250 )
...
* add MobileNetV2 to comma CI
* symlink imagenet
* also the signature
* comment that out
* need imagenetmock
* same train and test set
* quantize on CPU=1
* verbose
* need __hexagon_divsf3
* 0x858d6c15
* quant cpu + CC=clang-19
2025-05-19 18:22:50 +03:00
George Hotz
b06291077c
no amdgpu kernel driver ( #10408 )
...
* no amdgpu kernel driver
* don't test hip
* lower req
2025-05-18 20:52:39 -07:00
wozeparrot
1ed04f993b
move benchmark stat tracking to influxdb ( #10185 )
2025-05-15 16:14:56 -07:00
Ignacio Sica
47b3055fe2
set fail-fast behavior ( #10336 )
2025-05-15 11:24:45 -07:00
George Hotz
7a3d4de59a
hotfix: add GRAPH_ONE_KERNEL=1 to UsbGPU openpilot test
2025-05-14 14:50:37 -07:00
George Hotz
f1130ab3d3
openpilot benchmark test ( #10290 )
...
* openpilot benchmark test
* that
2025-05-13 22:49:28 -07:00
chenyu
ad5cb2717d
FUSE_ARANGE=1 in bert bench ( #10263 )
...
still fails, something multi related maybe
Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com >
2025-05-13 09:12:19 -04:00
chenyu
0015b3921f
sleep more in CI Remove amdgpu ( #10261 )
...
see if this is less flaky
2025-05-12 08:13:44 -04:00
nimlgen
7d6ed1b1e9
hotfix: mac ci ( #10210 )
...
* fixed?
* cmnt
2025-05-08 14:13:23 +03:00
nimlgen
ba52fce4b2
usbgpu: benchmark in ci ( #10208 )
...
* usbgpu: benchmark
* usbgpu: benchmark
2025-05-08 12:02:04 +03:00
Ignacio Sica
bf5fb97498
fix AMD_LLVM bf16 tc for gfx1100 ( #10102 )
...
* fix amd_llvm bf16 tc
* cleanup pattern
2025-04-30 20:06:38 -03:00
chenyu
4a04098389
fix llama3 with nf4 quantize ( #10107 )
...
also int8 outputs is wrong
2025-04-29 15:14:36 -04:00
Ignacio Sica
9d5677c12c
fix ptx linearizer bug 2 [pr] ( #9967 )
...
* check for local buffer
* hotfix
* add test_tensor_cores_emulation run for ptx
2025-04-29 14:30:07 -03:00
Ignacio Sica
58cf8cd493
add support for "shared_mem" for LLVM ( #10093 )
...
* init llvm shared
* add test_tensor_cores_emulation run for llvm
2025-04-29 08:56:36 -04:00
Ignacio Sica
bda116d773
fix use_tensor_cores propagation ( #10048 )
...
* propagate use_tensor_cores
* add use_tensor_core to arg in test and search
* bugfix
* get TC val from ContextVar in search
* revert minor space change
* add tc emulation test to ci and benchmark
* revert
* revert whitespace change
* remove test for ptx
* add comment and remove llvm test run
2025-04-28 19:30:50 -03:00
chenyu
e996584685
olmoe in mac benchmark ( #10077 )
2025-04-27 21:07:02 -04:00
George Hotz
b6d2effaf5
assign is contiguous ( #10066 )
...
* assign is contiguous
* disable process replay for SDXL
2025-04-27 08:40:33 -04:00
Ignacio Sica
023b1c28a2
test_tensor_cores_padded refactor (#9724 )
...
* set pad t 3 for amd padded tc test
* change pad for amd regardless CI
* test tc padded uops and correctness separately
* add test_tensor_cores_padded_uops test to ci
* remove redundant chack for amd device
* cleanup
2025-04-18 17:05:54 -03:00
chenyu
c5db5b83b9
add SHOULD_USE_TC=1 check to simple_matmul ( #9802 )
...
* add SHOULD_USE_TC=1 check to simple_matmul
also zero centered the random input and update atol for tf32
* ATOL=2e-2 for HALF
2025-04-09 02:24:42 -04:00
George Hotz
14928fecff
Revert "fix TF32 tensor core dropped in tc_sm89 ( #9798 )"
...
This reverts commit 7c9a96824f .
2025-04-09 12:27:39 +08:00
chenyu
7c9a96824f
fix TF32 tensor core dropped in tc_sm89 ( #9798 )
...
also add `SHOULD_USE_TC=1` to verify TC is applied in simple_matmul
2025-04-08 23:20:50 -04:00
Ignacio Sica
58785181a8
AMD bf16xf32 TC ( #9717 )
...
* dont test bf16 for emulated amd tc
* skip bf16 tc test in ci
* skip bf16 for AMD in test_tensor_cores_codegen
* add simple bf16 gemm test to benchmark
2025-04-07 11:41:04 +08:00
chenyu
1d25844d44
Revert "disable CI red llama 3 4 gpu beam ( #9690 )" ( #9709 )
...
This reverts commit 6a5eacba8b .
2025-04-03 02:34:39 -04:00
chenyu
6a5eacba8b
disable CI red llama 3 4 gpu beam ( #9690 )
...
device hangs and ci would fail
2025-04-02 03:19:09 -04:00
qazal
4df2b6347d
hotfix: bump tinybox red training CI timeout to 30 minutes ( #9426 )
2025-03-13 09:31:44 +01:00
nimlgen
cd9d74f7ea
use am in training benchmarks ( #9357 )
...
* am in training benchmarks
* fix
* not needed anymore
2025-03-05 19:13:47 +03:00
chenyu
2e7c2780a9
CLANG -> CPU ( #9189 )
2025-02-20 18:03:09 -05:00
Ignacio Sica
aaed315fee
add AMX support to LLVM ( #8957 )
...
* init amx support for llvm
* revert elf changes
* fix attributes for AMX asm calls
* add comments
* add llvm amx job to benchmarks
* cleanup
* cleanup
* hotfix: improve comments
* comment for aux buffers
* hotfix:
* move amx_tc to ClangRenderer
* merge master
* refactor
* add docs
* add corsix docs reference
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com >
2025-02-12 16:01:18 +08:00
nimlgen
52a69dd5e9
Revert "use am in training benchmarks ( #8965 )" ( #8981 )
...
This reverts commit 107e616857 .
2025-02-09 15:43:45 +03:00
nimlgen
107e616857
use am in training benchmarks ( #8965 )
...
* am in training benchmarks
* fix
* not needed anymore
2025-02-08 20:20:47 +03:00
George Hotz
0cbb7d7f1e
hotfix: metal has known sync issue
2025-02-06 14:29:41 +08:00
chenyu
836cf42c2e
fix rand_like for multi ( #8880 )
2025-02-03 19:00:14 -05:00
chenyu
0c759e1ff6
add bert to bechmark ci ( #8741 )
...
with `DISABLE_DROPOUT=1 BERT_LAYERS=2` for now
2025-01-24 14:45:11 -05:00
ignaciosica
d2234e308a
tf32 tc for nv and ptx ( #8635 )
...
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com >
2025-01-17 17:43:57 -08:00
nimlgen
f671da6755
ci: add AM start time to benchmark ( #8637 )
...
* ci: add AM start time to benchmark
* am: unlock it
* add AMD
* revert this
2025-01-16 14:47:36 +03:00
chenyu
4ee3243c93
JITBEAM=2 for LLaMA-3 8B on 4 GPUs [pr] ( #8623 )
...
is it fast?
2025-01-14 19:52:38 -05:00