chenyu
6b3480ec70
update mi300x bert haparams ( #9716 )
...
* update mi300x bert haparams
borrowed from previous submission that also did BS=1024
* update
2025-04-03 22:30:00 -04:00
chenyu
a6fec2f5ae
dev_run for bert on mi300x ( #9706 )
2025-04-02 21:12:55 -04:00
chenyu
f7cb2e8da3
bert dev_beam for mi300x box ( #9648 )
...
* bert dev_beam for mi300x box
* terminate BENCHMARK properly
2025-03-31 08:35:51 -04:00
chenyu
d8d7ac1bb1
fix bert free_intermediates ( #9633 )
...
fix when only run eval `TRAIN=0 BERT_SIZE=tiny examples/mlperf/training_submission_v5.0/tinycorp/benchmarks/bert/implementations/tinybox_green/dev_beam.sh`
2025-03-30 22:42:52 -04:00
chenyu
a187dfd3df
bert BEAM_UOPS_MAX 3000->4000 ( #9603 )
...
more stable for the final step time
green 410ms (master) -> 397ms (BEAM=4) -> 392ms (this)
red 561ms (master) -> 550ms (this)
2025-03-27 11:58:47 -04:00
chenyu
62888614f6
lower bert eval bs to 24 ( #9590 )
...
oom during eval
2025-03-26 21:25:23 -04:00
chenyu
c965f4c20b
update bert config ( #9555 )
...
BEAM 4->5 for green, 2% faster
use AMD driver instead of AM for red, 5% faster
2025-03-23 16:14:41 -04:00
Francis Lata
1a1087e3a0
cleanups on losses and dataset tests ( #9538 )
2025-03-21 17:03:18 -04:00
Francis Lata
8cbe4009fc
RetinaNet losses ( #9536 )
...
* add sigmoid_focal_loss and l1_loss
* update ref implementation comment
2025-03-21 15:52:54 -04:00
chenyu
b46b8ee15e
add a flag to log when beam surpassed max limit [pr] ( #9533 )
2025-03-21 13:37:02 -04:00
Francis Lata
eb95825eea
RetinaNet dataloader ( #9442 )
...
* retinanet dataloader
* remove batch_size from generate_anchors
* refactor kits19 dataset tests
* add tests for dataloader
* fix testing setup and cleanups
* remove unused import
2025-03-21 13:36:41 -04:00
chenyu
f53be010d7
lower bert learning rate ( #9481 )
...
slightly better. first sub 3hr run https://wandb.ai/chenyuxyz/MLPerf-BERT/runs/0or96ink/overview
2025-03-17 10:49:56 -04:00
chenyu
d2cfbd8a4d
bert lower learning rate and total steps ( #9466 )
...
closer to the other submission with BS=240. converged with 10% less epochs
2025-03-16 17:21:20 -04:00
chenyu
4992958dae
update bert beam params ( #9423 )
...
BEAM_MIN_PROGRESS=5 for setup speed
2025-03-12 13:00:41 -04:00
chenyu
22fc0a2e36
bert sum acc in half ( #9412 )
...
also BS=96
2025-03-11 23:03:15 -04:00
chenyu
01e8b60911
acc_dtype -> dtype ( #9402 )
...
matched numpy and torch
2025-03-10 16:05:30 -04:00
chenyu
2af129c078
bert corealize multiple outputs ( #9359 )
...
1% faster step
2025-03-05 10:58:37 -05:00
chenyu
ad72269f08
bert put eval copy and getting lr in jit ( #9350 )
2025-03-04 20:57:03 -05:00
chenyu
9eb45eb629
add a flag to skip bert train ( #9349 )
2025-03-04 17:13:00 -05:00
qazal
845814f396
revert buffer_view change ( #9311 )
...
* Revert "BUFFER_VIEW is a node in the kernel graph + delete ViewOp (#9298 )"
This reverts commit 3210b656b6 .
* Revert "substitute ast from kernel op [pr] (#9293 )"
This reverts commit 5a9c788ae6 .
2025-03-01 11:00:12 +01:00
qazal
3210b656b6
BUFFER_VIEW is a node in the kernel graph + delete ViewOp ( #9298 )
2025-02-28 12:15:04 +02:00
George Hotz
3f4eb9006a
test for device mismatch [pr] ( #9250 )
...
* test for device mismatch [pr]
* fix bert
2025-02-26 13:06:33 +08:00
chenyu
979e84f30e
RESET_STEP in bert setup and beam ( #9248 )
...
running dev_beam migh OOM without it but runs fine in real run.
2025-02-25 19:15:10 -05:00
chenyu
6610ad58ab
hotfix bert no shard with only one device ( #9243 )
...
`LLVM=1 BERT_SIZE="tiny" DEFAULT_FLOAT=HALF BENCHMARK=5 MODEL="bert" python3 examples/mlperf/model_train.py` runs for me with this. it should not failed with single device shard though
2025-02-25 09:05:11 -05:00
chenyu
8c7be428e5
update bert BS to 78 ( #9236 )
...
fits 78 now. about 215 tflops on green
2025-02-24 22:47:35 -05:00
chenyu
2e7c2780a9
CLANG -> CPU ( #9189 )
2025-02-20 18:03:09 -05:00
chenyu
3b37cc898b
add bert tiny config ( #9177 )
...
set with BERT_SIZE=tiny. easier to study embedding and fusion
2025-02-19 14:57:03 -05:00
chenyu
975c318dbc
bert use int32 for input ids ( #9173 )
...
original data was int32 for these. float might have caused precision issues
2025-02-19 08:17:27 -05:00
chenyu
ff05bff221
put bert data shard inside jit ( #9160 )
...
python time 45ms -> 9ms, it was spending time to schedule the shard
also init bert data on CLANG since it's from numpy, so we don't create the tensor on default device then shard into GPUS
2025-02-18 10:36:54 -05:00
chenyu
5dc1257ce0
clean up bert fake data iterator [pr] ( #9145 )
...
reuse the same get_data_bert path in setup and real run
2025-02-17 20:03:38 -05:00
chenyu
81597ddd96
increase lr for bert ( #9098 )
...
had one run that converged better https://wandb.ai/chenyuxyz/MLPerf-BERT/runs/u66tv2hh/overview
2025-02-14 19:10:35 -05:00
chenyu
b58e7b1898
zero out the weight in bert init run ( #9076 )
...
`DEFAULT_FLOAT=HALF BENCHMARK=10 BS=66 EVAL_BS=6 GPUS=6 MODEL=bert python3 examples/mlperf/model_train.py` no longer oom. I think the buffer of random init weights caused the oom.
2025-02-14 08:40:41 -05:00
chenyu
9e91898941
bert eval at the end of training ( #9070 )
...
always eval at the last epoch
2025-02-13 16:29:44 -05:00
chenyu
7b5ac2c15e
free_intermediates in bert ( #9040 )
...
also re-enable dropout and update EVAL_BS
2025-02-12 10:00:39 -05:00
chenyu
a092b6395d
Tuple -> tuple, List -> list [pr] ( #8936 )
2025-02-06 14:21:19 -05:00
chenyu
c7ca7959e6
set DISABLE_DROPOUT=1 in bert script for now ( #8799 )
2025-01-29 10:51:29 -05:00
chenyu
c99ae81f63
update default resnet LOSS_SCALER to 256 [pr] ( #8774 )
2025-01-27 16:59:05 -05:00
chenyu
af65331b76
update beam params for bert green [pr] ( #8726 )
...
increase BEAM_UPCAST_MAX and BEAM_LOCAL_MAX to default and matched red. 3% faster step
2025-01-22 22:00:05 -05:00
chenyu
9a9079118e
envvar BERT_LAYERS [pr] ( #8709 )
...
default is 24 for large
2025-01-21 22:49:19 -05:00
chenyu
9f6d545a16
bert log global_norm in training step [pr] ( #8708 )
...
* bert log global_norm in training step [pr]
and minor cleanups
* .item()
2025-01-21 20:36:27 -05:00
chenyu
1e283c33d3
remove realize in bert model init [pr] ( #8707 )
2025-01-21 14:11:03 -05:00
chenyu
930728c069
bert BS 72->66 [pr] ( #8621 )
...
72 does not fit now
2025-01-14 18:41:41 -05:00
chenyu
994944920b
simpler batch_load_train_bert [pr] ( #8582 )
...
don't think that buffer is really beneficial. 5% faster data_time and 1ms faster per step.
https://wandb.ai/chenyuxyz/MLPerf-BERT/runs/69c9lx8y/overview
2025-01-12 20:25:05 -05:00
chenyu
def90b22f6
EVAL_BS=36 for bert [pr] ( #8576 )
...
3X faster eval compared to BS=6.
green https://wandb.ai/chenyuxyz/MLPerf-BERT/runs/ka5p5sm9/overview
red https://wandb.ai/chenyuxyz/MLPerf-BERT/runs/a7maxsxd/overview
2025-01-12 09:43:56 -05:00
chenyu
64a917b7eb
remove LAZYCACHE ContextVar [pr] ( #8175 )
...
also removed from resnet latest script
2024-12-11 22:02:52 -05:00
chenyu
3e2430f822
use tqdm tqdm in mlperf training ( #7929 )
...
issue in benchmark dashboard logging, revert back to tqdm tqdm for now
2024-11-27 21:57:05 -05:00
qazal
9828277c03
view doesn't have buffer, fix the tests [pr] ( #7841 )
...
* view doesn't have buffer, fix the tests [pr]
* need assigns
2024-11-22 20:41:55 +08:00
Francis Lata
90eff347e2
tinytqdm write support ( #6359 )
...
* add write support
* add test
* update test case to compare write outputs
* assert final write output
* flush when using write
* update write logic
* Revert "update write logic"
This reverts commit 5e0e611b46 .
---------
Co-authored-by: chenyu <chenyu@fastmail.com >
2024-10-16 14:51:41 -04:00
chenyu
ed1ed9e4ff
bert use BS=72 ( #7015 )
...
memory 131 -> 138
green tflops 201 -> 209
red tflops 160 -> 169
2024-10-12 09:41:56 -04:00
chenyu
36056e0760
update mlperf systems and copy 4.1 to 5.0 ( #7004 )
2024-10-11 16:20:34 -04:00