chenyu
f7965f85aa
Revert "feat: faster index building ( #11462 )" ( #11478 )
...
This reverts commit 3a4deb08d2 .
2025-08-02 12:50:48 -04:00
wozeparrot
3a4deb08d2
feat: faster index building ( #11462 )
...
* feat: faster index building
* feat: correct training samples
2025-08-02 11:50:18 -04:00
chenyu
9e8e6b45ab
grad acc train llama ( #11467 )
...
* grad acc train llama
* log step time
2025-08-01 15:54:50 -04:00
chenyu
7ad7329257
data parallel train llama ( #11466 )
2025-08-01 12:13:51 -04:00
George Hotz
8ff03806e8
add llama layers ( #11460 )
...
* add llama layers
* add contig bw for speed
2025-07-31 16:28:04 -07:00
wozeparrot
6252f7770e
feat: fake data ( #11447 )
2025-07-30 17:18:20 -07:00
chenyu
e300451f3a
update llama3 ( #11446 )
...
`LR=1e-4 TRAIN_ON_VAL=1 DEFAULT_FLOAT=bfloat16 FUSE_ARANGE=1 JITBEAM=2 OPTIM_DTYPE=bfloat16 LLAMA3_SIZE=1B WARMUP_STEPS=36 DECAY_STEPS=360 SEQLEN=512 PYTHONPATH=. AMD=1 AMD_LLVM=0 MODEL=llama3 python3 examples/mlperf/model_train.py` trained to 7
2025-07-30 19:34:21 -04:00
wozeparrot
5fb975351a
feat: flag for training on val ( #11441 )
2025-07-30 14:29:45 -07:00
wozeparrot
825b6a2505
feat: llama3 dataloader ( #11340 )
2025-07-30 13:27:55 -07:00
chenyu
c14c9a8eff
llama3 grad clip ( #11003 )
2025-06-27 19:14:12 -04:00
chenyu
f2548afeb5
bert grad clipping start with const 0 ( #11008 )
...
saved the init kernels
2025-06-27 18:02:23 -04:00
chenyu
6ab5a5cb6c
llama3 mlperf train ( #10983 )
...
work in progress. now it can overfit small examples and vram roughly matches
2025-06-26 20:24:27 -04:00
chenyu
8751d47985
CosineAnnealingLRWithWarmup ( #10981 )
2025-06-25 17:45:21 -04:00
chenyu
efad567ebd
ruff check whole examples/mlperf/ ( #10979 )
2025-06-25 12:57:48 -04:00
chenyu
0480139def
log_perplexity metrics ( #10912 )
2025-06-21 10:44:47 -04:00
chenyu
62a540066e
remove DEBUG=2 in mi300x bert setup ( #10886 )
...
seems fine now, not sure what the issue was
2025-06-19 13:28:53 -04:00
chenyu
f377cc19cd
use AM for bert ( #10882 )
...
have triained 3 runs and all seem fine
2025-06-19 09:48:54 -04:00
chenyu
b70c7d3631
bert grad accumulation ( #10863 )
...
* bert grad accumulation
* realize grad
2025-06-18 12:17:07 -04:00
chenyu
075a74cf25
add global_batch_size to mlperf bert ( #10852 )
...
global_batch_size = grad_acc_steps * batch_size. no-op change to prep grad acc for bert
2025-06-17 17:54:15 -04:00
chenyu
81e296d7b8
remove Tensor.test() in retinanet ( #10770 )
...
test was removed
2025-06-10 22:14:57 -04:00
George Hotz
32e9949052
rename lazydata to uop ( #10698 )
2025-06-08 08:42:22 -07:00
chenyu
4ab3391e6f
set -o pipefail for mlperf run_and_time (#10577 )
...
also run the 5.1 script in ci cron job
2025-05-30 16:36:44 -04:00
chenyu
baf482d314
copy mlperf stuff to 5.1 ( #10576 )
...
5.0 is finalized, new changes go to 5.1
2025-05-30 16:12:39 -04:00
George Hotz
b3b43a82c4
remove Tensor.no_grad, it's meaningless now [pr] ( #10556 )
2025-05-28 22:20:02 -07:00
chenyu
74cf5dbd9e
mlperf system updates ( #10550 )
...
standardized processor and accelerator names
2025-05-28 16:15:46 -04:00
chenyu
51dc7eedb0
correct use AM for resnet run_and_time ( #10524 )
2025-05-26 15:33:11 -04:00
chenyu
c1919ad55f
use AM for resnet run_and_time ( #10523 )
2025-05-26 14:50:49 -04:00
chenyu
2d50efb92b
set -e on mlperf run_and_time scripts (#10519 )
2025-05-26 09:22:30 -04:00
chenyu
dc6309242d
WallTimeEvent for mlperf ci ( #10506 )
2025-05-24 10:56:03 -04:00
chenyu
67d1364106
update LOGMLPERF in red resnet run_and_time ( #10416 )
2025-05-19 13:23:33 -04:00
chenyu
485e80da69
run_and_time for resnet ci ( #10405 )
2025-05-18 23:39:57 -04:00
wozeparrot
1ed04f993b
move benchmark stat tracking to influxdb ( #10185 )
2025-05-15 16:14:56 -07:00
George Hotz
568d6d96e7
small changes from new multi [pr] ( #10318 )
2025-05-14 20:50:59 -07:00
George Hotz
bfc30fa6ea
hotfix: typo in shm_name
2025-05-14 19:34:52 -07:00
George Hotz
2bc54b3e22
manually handle OSX
2025-05-14 19:17:51 -07:00
George Hotz
ab460486d7
Revert "resnet dataloader osx ( #10316 )"
...
This reverts commit aef336930a .
2025-05-14 19:15:07 -07:00
George Hotz
aef336930a
resnet dataloader osx ( #10316 )
...
* mlperf dataloader on mac
* resnet dataloader [pr]
* simple should work
2025-05-14 18:31:26 -07:00
chenyu
610ee79b22
cherry pick mlperf5.0 branch to master ( #10089 )
2025-04-28 15:36:56 -04:00
chenyu
74c6cf8be3
lint mlperf model_train ( #10038 )
2025-04-24 16:19:44 -04:00
chenyu
a25abf55e3
retinanet only call postprocess_detections with RUNMLPERF ( #10017 )
...
during setup only need to compile `_eval_step().numpy()`
2025-04-23 20:45:38 -04:00
chenyu
65faa1d94b
explicit device in mlperf scripts ( #10015 )
2025-04-23 17:11:52 -04:00
chenyu
a3f938dbee
remove retinanet INITMLPERF from beam script ( #10011 )
...
it only controls logging, loading real data or not is solely controlled by RUNMLPERF
2025-04-23 14:32:54 -04:00
Francis Lata
5542aeb0e4
RetinaNet MLPerf flag updates ( #10009 )
...
* add RUNMLPERF and update INITMLPERF usage
* update scripts to use RUNMLPERF
2025-04-23 13:00:34 -04:00
George Hotz
de0504276b
pop 0 is slow [pr] ( #10007 )
2025-04-23 17:00:59 +01:00
chenyu
d3a8d5c128
print postprocess_detections time in retinanet eval ( #10005 )
...
`BS=96 BASEDIR="/raid/datasets/openimages" MODEL=retinanet python examples/mlperf/model_eval.py`
```
...
loaded dataset @ 8.64s
loaded initial data @ 12.57s
****** 619.97 ms to enqueue, 46042.13 ms to realize ( 116.22 ms fetching, 45399.58 ms postprocess_detections). 0.09 examples/sec. 0.83 TFLOPS @ 59.23s
****** 147.49 ms to enqueue, 37362.16 ms to realize ( 146.96 ms fetching, 36618.84 ms postprocess_detections). 0.11 examples/sec. 1.03 TFLOPS @ 96.74s
****** 152.85 ms to enqueue, 37244.08 ms to realize ( 120.67 ms fetching, 36235.19 ms postprocess_detections). 0.11 examples/sec. 1.04 TFLOPS @ 134.14s
****** 146.39 ms to enqueue, 37279.85 ms to realize ( 65.07 ms fetching, 36233.56 ms postprocess_detections). 0.11 examples/sec. 1.04 TFLOPS @ 171.56s
****** 152.41 ms to enqueue, 37264.04 ms to realize ( 127.08 ms fetching, 36196.10 ms postprocess_detections). 0.11 examples/sec. 1.04 TFLOPS @ 208.98s
****** 151.29 ms to enqueue, 36868.08 ms to realize ( 142.73 ms fetching, 36153.07 ms postprocess_detections). 0.11 examples/sec. 1.05 TFLOPS @ 246.00s
****** 136.41 ms to enqueue, 37325.04 ms to realize ( 90.29 ms fetching, 36573.38 ms postprocess_detections). 0.11 examples/sec. 1.04 TFLOPS @ 283.46s
```
2025-04-23 11:39:56 -04:00
chenyu
c39128133c
retinanet green scripts ( #9996 )
...
also removed realize in data_get and used empty for fake data. slightly bigger lr. https://wandb.ai/chenyuxyz/MLPerf-RetinaNet/runs/8skid0e8?nw=nwuserchenyuxyz
2025-04-23 08:28:03 -04:00
chenyu
fb89d9a584
retinanet eval combine output on GPUS[0] ( #9966 )
...
eval 35 sec -> 20 sec. it was spending 13 seconds assembling output tensor on CPU backend. GPUS[0] seems to have enough memory, otherwise we can lower EVAL_BS
2025-04-22 07:43:51 -04:00
chenyu
5294c32279
dev scripts for retinanet ( #9968 )
...
also BASE_DIR -> BASEDIR for consistency, and move wandb up a bit for more accurate timing
2025-04-21 17:54:56 -04:00
Francis Lata
defa1e77f6
get the proper dataset count ( #9962 )
2025-04-21 12:11:37 -04:00
Francis Lata
d7e247f329
RetinaNet INITMLPERF support ( #9950 )
...
* fixes to make fake data work
* fix eval beam
* fix merge issue
2025-04-21 10:32:05 -04:00