qazal
4df2b6347d
hotfix: bump tinybox red training CI timeout to 30 minutes ( #9426 )
2025-03-13 09:31:44 +01:00
nimlgen
cd9d74f7ea
use am in training benchmarks ( #9357 )
...
* am in training benchmarks
* fix
* not needed anymore
2025-03-05 19:13:47 +03:00
chenyu
2e7c2780a9
CLANG -> CPU ( #9189 )
2025-02-20 18:03:09 -05:00
Ignacio Sica
aaed315fee
add AMX support to LLVM ( #8957 )
...
* init amx support for llvm
* revert elf changes
* fix attributes for AMX asm calls
* add comments
* add llvm amx job to benchmarks
* cleanup
* cleanup
* hotfix: improve comments
* comment for aux buffers
* hotfix:
* move amx_tc to ClangRenderer
* merge master
* refactor
* add docs
* add corsix docs reference
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com >
2025-02-12 16:01:18 +08:00
nimlgen
52a69dd5e9
Revert "use am in training benchmarks ( #8965 )" ( #8981 )
...
This reverts commit 107e616857 .
2025-02-09 15:43:45 +03:00
nimlgen
107e616857
use am in training benchmarks ( #8965 )
...
* am in training benchmarks
* fix
* not needed anymore
2025-02-08 20:20:47 +03:00
George Hotz
0cbb7d7f1e
hotfix: metal has known sync issue
2025-02-06 14:29:41 +08:00
chenyu
836cf42c2e
fix rand_like for multi ( #8880 )
2025-02-03 19:00:14 -05:00
chenyu
0c759e1ff6
add bert to bechmark ci ( #8741 )
...
with `DISABLE_DROPOUT=1 BERT_LAYERS=2` for now
2025-01-24 14:45:11 -05:00
ignaciosica
d2234e308a
tf32 tc for nv and ptx ( #8635 )
...
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com >
2025-01-17 17:43:57 -08:00
nimlgen
f671da6755
ci: add AM start time to benchmark ( #8637 )
...
* ci: add AM start time to benchmark
* am: unlock it
* add AMD
* revert this
2025-01-16 14:47:36 +03:00
chenyu
4ee3243c93
JITBEAM=2 for LLaMA-3 8B on 4 GPUs [pr] ( #8623 )
...
is it fast?
2025-01-14 19:52:38 -05:00
George Hotz
bfbe81df71
remove cast before view ( #8613 )
...
* remove cast before view
* greener
* indexing
* that passes too
* openpilot too
* ack
---------
Co-authored-by: qazal <qazal.software@gmail.com >
2025-01-14 15:04:58 -05:00
chenyu
393eec3201
raise RuntimeError for uneven shard [pr] ( #8593 )
...
no 7B llama on 6 GPUs
skip 70B
2025-01-14 14:51:48 -05:00
nimlgen
1ff6862a3d
ci: sleep a bit to let the driver unload the prev pid ( #8605 )
2025-01-14 15:55:23 +03:00
nimlgen
74b83c4c41
am in ci ( #8532 )
...
* try am in ci
* no sudo
* temp
* run more am test
* run half on am
* insert amdgpu
* other machine as well
2025-01-13 19:55:17 +03:00
qazal
60503c8621
use CAPTURE_PROCESS_REPLAY=1 in CI [pr] ( #8564 )
2025-01-11 06:03:48 -05:00
chenyu
85a4397f27
fix create_schedule_with_vars usage in allreduce benchmark [pr] ( #8522 )
...
* fix create_schedule_with_vars usage in allreduce benchmark [pr]
because i didn't know how to use it...
* increase time limit because tiny17 is slow
2025-01-07 01:30:01 -05:00
chenyu
0061dc7447
fix benchmark allreduce and add to ci [pr] ( #8521 )
2025-01-07 00:37:59 -05:00
ignaciosica
0a00187dce
add real AMX tests to benchmark ( #8216 )
...
* add real amx to benchmark
* add debug=2 to check tc are triggered
2024-12-13 14:03:41 -05:00
chenyu
d462f8ace0
use HALF in cifar wino benchmarks ( #8153 )
...
more representative as it hits tensor cores on tinyboxes
2024-12-10 20:21:00 -05:00
George Hotz
f83d715f41
move checks into compile3, delete compile2 [pr] ( #8127 )
...
* move checks into compile3 [pr]
* test_vs_onnx
* test v torch works
* float16 won't compile on compile3
* actually delete compile2
2024-12-09 14:21:42 -08:00
George Hotz
87c360c4b5
hotfix: add --size 8B to llama3
2024-12-09 07:53:20 -08:00
chenyu
3c8c98253a
BEAM_DEBUG=1 in speed_v_theoretical ( #7942 )
...
* DEBUG=3 in speed_v_theoretical
* BEAM_DEBUG=1
2024-11-28 08:30:55 -05:00
chenyu
a6171cbe71
add stable diffusion v2 to mac benchmark ( #7917 )
...
this caught #7902
2024-11-26 22:09:43 -05:00
chenyu
ac57d82a13
test_tiny on real NV/CUDA/AMD/HIP ( #7886 )
...
simple tests that run on real CUDA and HIP
2024-11-24 16:34:54 -05:00
chenyu
5c5b1b994c
less flaky benchmarks ( #7855 )
...
JIT=2 for metal cifar with HALF, and lower tflops for nv test_gemm_4096. failures in https://github.com/tinygrad/tinygrad/actions/runs/11980239535/job/33404098428?pr=7830
2024-11-22 16:39:39 -05:00
chenyu
d5c9fafff5
default run stable diffusion benchmark with fp16 ( #7831 )
...
and keep the non-fp16 one in mac
2024-11-21 15:58:17 -05:00
chenyu
c815d7b56e
run bfloat16 tensor core in metal benchmark ( #7808 )
...
* run bfloat16 tensor core in metal benchmark
* separate task
2024-11-20 15:34:07 -05:00
chenyu
e6cfaaa496
metal benchmark JIT=2 -> JIT=1 ( #7661 )
2024-11-12 22:55:27 -05:00
chenyu
1884f021e3
add conv3x3 to speed_v_theoretical ( #7658 )
...
* add conv3x3 to speed_v_theoretical
* show test duration
2024-11-12 16:41:56 -05:00
chenyu
a88a15c7e8
setup perflevel in red CI ( #7645 )
...
runs v4.1 bert setup.
```
rocm-smi --setprofile compute
rocm-smi --setmclk 3
rocm-smi --setperflevel high
```
2024-11-11 18:44:55 -05:00
chenyu
773d5b60bf
beam benchmark tests ( #7638 )
...
* beam benchmark tests
* lower AMD number somehow
* less flaky
2024-11-11 18:11:18 -05:00
chenyu
bfab03288d
fix HALF=1 in test_speed_v_torch ( #7642 )
...
* fix HALF=1 in test_speed_v_torch
"operation cache defeats" adds 1 to all arg, which were centered around 0. adding 1 makes big matmul and matvec go inf.
fixed by subtract 1 after and bumpped tolerance for half input
* bigger tol for BIG=2, update CI too
* bigger tol
2024-11-11 14:29:37 -05:00
George Hotz
b4cb6b89f9
hotfix: CI mac uses python 3.11
2024-11-11 23:42:35 +08:00
George Hotz
9648372ee6
hotfix: mac uses python 3.12
2024-11-11 23:23:48 +08:00
George Hotz
6f93e91deb
hotfix: lower mnist threshold for non determinism
2024-11-03 11:05:12 +08:00
George Hotz
4fed358511
hotfix: timeouts to 20 minutes. better no stats update than a red x
2024-10-25 16:31:52 +08:00
chenyu
d4c94d0d32
disable llama 1 4gpu and 6gpu benchmark ( #7276 )
...
having llama3 4gpu and 6gpu should be good enough
2024-10-24 14:19:22 -04:00
chenyu
e6929f2402
RUN_PROCESS_REPLAY=0 on llama 70B and resnet training ( #7272 )
...
* RUN_PROCESS_REPLAY=0 on llama 70B and resnet training
also added a 15 minutes total timeout, this cannot grow indefinitely
* add a few more
* a few more just for NV
2024-10-24 12:09:54 -04:00
George Hotz
9f4ca88218
hotfix: relax target pct for beautiful_mnist
2024-10-17 12:36:07 +08:00
nimlgen
feb0bcb58b
qcom bench bind to perf cluster ( #6996 )
2024-10-11 12:21:52 +03:00
nimlgen
f9d454aed5
correct kernargs alignment ( #6984 )
2024-10-11 00:06:28 +03:00
qazal
b82023c97e
process replay cleanup to generic _pmap [pr] ( #6929 )
...
* process replay cleanup to generic _pmap [pr]
* delete `COMPARE_SCHEDULE`
2024-10-07 13:57:05 +08:00
George Hotz
f45d178a55
hotfix: support JIT_BATCH_SIZE=0, make that the default
2024-09-25 10:36:04 +08:00
George Hotz
52e7f1c108
add new model CI
2024-09-25 10:23:06 +08:00
George Hotz
de259e3f09
hotfix: add compile3 to comma CI
2024-09-23 18:25:49 +08:00
qazal
e2d6e10ddf
hotfix: reset benchmarks cache for process replay ( #6671 )
2024-09-23 15:13:02 +08:00
nimlgen
d22b46a2ac
qcom in benchmarks ( #6337 )
2024-09-02 19:59:11 +03:00
chenyu
7d46fb0c83
load balance NV benchmark ci ( #6107 )
2024-08-16 10:08:08 -04:00