chenyu
4ee3243c93
JITBEAM=2 for LLaMA-3 8B on 4 GPUs [pr] ( #8623 )
...
is it fast?
2025-01-14 19:52:38 -05:00
George Hotz
bfbe81df71
remove cast before view ( #8613 )
...
* remove cast before view
* greener
* indexing
* that passes too
* openpilot too
* ack
---------
Co-authored-by: qazal <qazal.software@gmail.com >
2025-01-14 15:04:58 -05:00
chenyu
393eec3201
raise RuntimeError for uneven shard [pr] ( #8593 )
...
no 7B llama on 6 GPUs
skip 70B
2025-01-14 14:51:48 -05:00
nimlgen
1ff6862a3d
ci: sleep a bit to let the driver unload the prev pid ( #8605 )
2025-01-14 15:55:23 +03:00
nimlgen
74b83c4c41
am in ci ( #8532 )
...
* try am in ci
* no sudo
* temp
* run more am test
* run half on am
* insert amdgpu
* other machine as well
2025-01-13 19:55:17 +03:00
qazal
60503c8621
use CAPTURE_PROCESS_REPLAY=1 in CI [pr] ( #8564 )
2025-01-11 06:03:48 -05:00
chenyu
85a4397f27
fix create_schedule_with_vars usage in allreduce benchmark [pr] ( #8522 )
...
* fix create_schedule_with_vars usage in allreduce benchmark [pr]
because i didn't know how to use it...
* increase time limit because tiny17 is slow
2025-01-07 01:30:01 -05:00
chenyu
0061dc7447
fix benchmark allreduce and add to ci [pr] ( #8521 )
2025-01-07 00:37:59 -05:00
ignaciosica
0a00187dce
add real AMX tests to benchmark ( #8216 )
...
* add real amx to benchmark
* add debug=2 to check tc are triggered
2024-12-13 14:03:41 -05:00
chenyu
d462f8ace0
use HALF in cifar wino benchmarks ( #8153 )
...
more representative as it hits tensor cores on tinyboxes
2024-12-10 20:21:00 -05:00
George Hotz
f83d715f41
move checks into compile3, delete compile2 [pr] ( #8127 )
...
* move checks into compile3 [pr]
* test_vs_onnx
* test v torch works
* float16 won't compile on compile3
* actually delete compile2
2024-12-09 14:21:42 -08:00
George Hotz
87c360c4b5
hotfix: add --size 8B to llama3
2024-12-09 07:53:20 -08:00
chenyu
3c8c98253a
BEAM_DEBUG=1 in speed_v_theoretical ( #7942 )
...
* DEBUG=3 in speed_v_theoretical
* BEAM_DEBUG=1
2024-11-28 08:30:55 -05:00
chenyu
a6171cbe71
add stable diffusion v2 to mac benchmark ( #7917 )
...
this caught #7902
2024-11-26 22:09:43 -05:00
chenyu
ac57d82a13
test_tiny on real NV/CUDA/AMD/HIP ( #7886 )
...
simple tests that run on real CUDA and HIP
2024-11-24 16:34:54 -05:00
chenyu
5c5b1b994c
less flaky benchmarks ( #7855 )
...
JIT=2 for metal cifar with HALF, and lower tflops for nv test_gemm_4096. failures in https://github.com/tinygrad/tinygrad/actions/runs/11980239535/job/33404098428?pr=7830
2024-11-22 16:39:39 -05:00
chenyu
d5c9fafff5
default run stable diffusion benchmark with fp16 ( #7831 )
...
and keep the non-fp16 one in mac
2024-11-21 15:58:17 -05:00
chenyu
c815d7b56e
run bfloat16 tensor core in metal benchmark ( #7808 )
...
* run bfloat16 tensor core in metal benchmark
* separate task
2024-11-20 15:34:07 -05:00
chenyu
e6cfaaa496
metal benchmark JIT=2 -> JIT=1 ( #7661 )
2024-11-12 22:55:27 -05:00
chenyu
1884f021e3
add conv3x3 to speed_v_theoretical ( #7658 )
...
* add conv3x3 to speed_v_theoretical
* show test duration
2024-11-12 16:41:56 -05:00
chenyu
a88a15c7e8
setup perflevel in red CI ( #7645 )
...
runs v4.1 bert setup.
```
rocm-smi --setprofile compute
rocm-smi --setmclk 3
rocm-smi --setperflevel high
```
2024-11-11 18:44:55 -05:00
chenyu
773d5b60bf
beam benchmark tests ( #7638 )
...
* beam benchmark tests
* lower AMD number somehow
* less flaky
2024-11-11 18:11:18 -05:00
chenyu
bfab03288d
fix HALF=1 in test_speed_v_torch ( #7642 )
...
* fix HALF=1 in test_speed_v_torch
"operation cache defeats" adds 1 to all arg, which were centered around 0. adding 1 makes big matmul and matvec go inf.
fixed by subtract 1 after and bumpped tolerance for half input
* bigger tol for BIG=2, update CI too
* bigger tol
2024-11-11 14:29:37 -05:00
George Hotz
b4cb6b89f9
hotfix: CI mac uses python 3.11
2024-11-11 23:42:35 +08:00
George Hotz
9648372ee6
hotfix: mac uses python 3.12
2024-11-11 23:23:48 +08:00
George Hotz
6f93e91deb
hotfix: lower mnist threshold for non determinism
2024-11-03 11:05:12 +08:00
George Hotz
4fed358511
hotfix: timeouts to 20 minutes. better no stats update than a red x
2024-10-25 16:31:52 +08:00
chenyu
d4c94d0d32
disable llama 1 4gpu and 6gpu benchmark ( #7276 )
...
having llama3 4gpu and 6gpu should be good enough
2024-10-24 14:19:22 -04:00
chenyu
e6929f2402
RUN_PROCESS_REPLAY=0 on llama 70B and resnet training ( #7272 )
...
* RUN_PROCESS_REPLAY=0 on llama 70B and resnet training
also added a 15 minutes total timeout, this cannot grow indefinitely
* add a few more
* a few more just for NV
2024-10-24 12:09:54 -04:00
George Hotz
9f4ca88218
hotfix: relax target pct for beautiful_mnist
2024-10-17 12:36:07 +08:00
nimlgen
feb0bcb58b
qcom bench bind to perf cluster ( #6996 )
2024-10-11 12:21:52 +03:00
nimlgen
f9d454aed5
correct kernargs alignment ( #6984 )
2024-10-11 00:06:28 +03:00
qazal
b82023c97e
process replay cleanup to generic _pmap [pr] ( #6929 )
...
* process replay cleanup to generic _pmap [pr]
* delete `COMPARE_SCHEDULE`
2024-10-07 13:57:05 +08:00
George Hotz
f45d178a55
hotfix: support JIT_BATCH_SIZE=0, make that the default
2024-09-25 10:36:04 +08:00
George Hotz
52e7f1c108
add new model CI
2024-09-25 10:23:06 +08:00
George Hotz
de259e3f09
hotfix: add compile3 to comma CI
2024-09-23 18:25:49 +08:00
qazal
e2d6e10ddf
hotfix: reset benchmarks cache for process replay ( #6671 )
2024-09-23 15:13:02 +08:00
nimlgen
d22b46a2ac
qcom in benchmarks ( #6337 )
2024-09-02 19:59:11 +03:00
chenyu
7d46fb0c83
load balance NV benchmark ci ( #6107 )
2024-08-16 10:08:08 -04:00
nimlgen
8f787785d9
fix openpilot benchmark ( #6049 )
2024-08-12 21:12:32 +03:00
qazal
266afad8ed
hotfix: skip schedule capture in benchmarks ( #6012 )
2024-08-10 17:13:53 +03:00
chenyu
adba5efc64
enable llama 2 70B in tinybox green CI ( #5905 )
...
runnable with MAX_CONTEXT=256
2024-08-04 18:48:46 -04:00
wozeparrot
acadccf344
comma benchmark ( #5518 )
2024-08-02 14:36:54 -07:00
wozeparrot
eebb1b9922
feat: temperature 0 llama3 benchmark ( #5806 )
2024-07-30 12:05:36 -07:00
qazal
3e49d86c01
process replay diffs 3 things now ( #5731 )
...
* github api infra
* process replay is 3 parts now
* parse benchmarks
* add gh_token
* complete diff
* move process replay tests
* last successful run
* add tempdir
* skip master
2024-07-27 12:52:20 +03:00
George Hotz
db1d093b29
reenable LLaMA-3 8B BEAM on NV ( #5746 )
2024-07-26 16:56:41 -07:00
wozeparrot
6ccb2390c3
feat: update_benchmark_staging ( #5529 )
2024-07-17 20:40:57 -07:00
wozeparrot
218e157f00
benchmark on update_benchmark_staging ( #5541 )
2024-07-17 17:11:52 -07:00
chenyu
b17e4adb3a
add -c advice.detachedHead=false to process replay git checkout ( #5419 )
...
remove the noisy `Note: switching to 'origin/master'.
You are in 'detached HEAD' state. You can look around, make experimental
changes...` in log
2024-07-12 15:13:26 -04:00
qazal
31fcc516dc
more process replay tooling ( #5407 )
...
* replays
* what's in there
* can it be up there
* sha is enough
* insert sha as the key
* fix str
* update reset utils
* that nested try/except was terrible
* github_context can go
2024-07-12 13:11:34 +03:00