Commit Graph

197 Commits

Author SHA1 Message Date
chenyu
6b3480ec70 update mi300x bert haparams (#9716)
* update mi300x bert haparams

borrowed from previous submission that also did BS=1024

* update
2025-04-03 22:30:00 -04:00
chenyu
a6fec2f5ae dev_run for bert on mi300x (#9706) 2025-04-02 21:12:55 -04:00
chenyu
f7cb2e8da3 bert dev_beam for mi300x box (#9648)
* bert dev_beam for mi300x box

* terminate BENCHMARK properly
2025-03-31 08:35:51 -04:00
chenyu
d8d7ac1bb1 fix bert free_intermediates (#9633)
fix when only run eval `TRAIN=0 BERT_SIZE=tiny examples/mlperf/training_submission_v5.0/tinycorp/benchmarks/bert/implementations/tinybox_green/dev_beam.sh`
2025-03-30 22:42:52 -04:00
chenyu
a187dfd3df bert BEAM_UOPS_MAX 3000->4000 (#9603)
more stable for the final step time

green 410ms (master) -> 397ms (BEAM=4) -> 392ms (this)
red 561ms (master) -> 550ms (this)
2025-03-27 11:58:47 -04:00
chenyu
62888614f6 lower bert eval bs to 24 (#9590)
oom during eval
2025-03-26 21:25:23 -04:00
chenyu
c965f4c20b update bert config (#9555)
BEAM 4->5 for green, 2% faster
use AMD driver instead of AM for red, 5% faster
2025-03-23 16:14:41 -04:00
Francis Lata
1a1087e3a0 cleanups on losses and dataset tests (#9538) 2025-03-21 17:03:18 -04:00
Francis Lata
8cbe4009fc RetinaNet losses (#9536)
* add sigmoid_focal_loss and l1_loss

* update ref implementation comment
2025-03-21 15:52:54 -04:00
chenyu
b46b8ee15e add a flag to log when beam surpassed max limit [pr] (#9533) 2025-03-21 13:37:02 -04:00
Francis Lata
eb95825eea RetinaNet dataloader (#9442)
* retinanet dataloader

* remove batch_size from generate_anchors

* refactor kits19 dataset tests

* add tests for dataloader

* fix testing setup and cleanups

* remove unused import
2025-03-21 13:36:41 -04:00
chenyu
f53be010d7 lower bert learning rate (#9481)
slightly better. first sub 3hr run https://wandb.ai/chenyuxyz/MLPerf-BERT/runs/0or96ink/overview
2025-03-17 10:49:56 -04:00
chenyu
d2cfbd8a4d bert lower learning rate and total steps (#9466)
closer to the other submission with BS=240. converged with 10% less epochs
2025-03-16 17:21:20 -04:00
chenyu
4992958dae update bert beam params (#9423)
BEAM_MIN_PROGRESS=5 for setup speed
2025-03-12 13:00:41 -04:00
chenyu
22fc0a2e36 bert sum acc in half (#9412)
also BS=96
2025-03-11 23:03:15 -04:00
chenyu
01e8b60911 acc_dtype -> dtype (#9402)
matched numpy and torch
2025-03-10 16:05:30 -04:00
chenyu
2af129c078 bert corealize multiple outputs (#9359)
1% faster step
2025-03-05 10:58:37 -05:00
chenyu
ad72269f08 bert put eval copy and getting lr in jit (#9350) 2025-03-04 20:57:03 -05:00
chenyu
9eb45eb629 add a flag to skip bert train (#9349) 2025-03-04 17:13:00 -05:00
qazal
845814f396 revert buffer_view change (#9311)
* Revert "BUFFER_VIEW is a node in the kernel graph + delete ViewOp (#9298)"

This reverts commit 3210b656b6.

* Revert "substitute ast from kernel op [pr] (#9293)"

This reverts commit 5a9c788ae6.
2025-03-01 11:00:12 +01:00
qazal
3210b656b6 BUFFER_VIEW is a node in the kernel graph + delete ViewOp (#9298) 2025-02-28 12:15:04 +02:00
George Hotz
3f4eb9006a test for device mismatch [pr] (#9250)
* test for device mismatch [pr]

* fix bert
2025-02-26 13:06:33 +08:00
chenyu
979e84f30e RESET_STEP in bert setup and beam (#9248)
running dev_beam migh OOM without it but runs fine in real run.
2025-02-25 19:15:10 -05:00
chenyu
6610ad58ab hotfix bert no shard with only one device (#9243)
`LLVM=1 BERT_SIZE="tiny" DEFAULT_FLOAT=HALF BENCHMARK=5 MODEL="bert" python3 examples/mlperf/model_train.py` runs for me with this. it should not failed with single device shard though
2025-02-25 09:05:11 -05:00
chenyu
8c7be428e5 update bert BS to 78 (#9236)
fits 78 now. about 215 tflops on green
2025-02-24 22:47:35 -05:00
chenyu
2e7c2780a9 CLANG -> CPU (#9189) 2025-02-20 18:03:09 -05:00
chenyu
3b37cc898b add bert tiny config (#9177)
set with BERT_SIZE=tiny. easier to study embedding and fusion
2025-02-19 14:57:03 -05:00
chenyu
975c318dbc bert use int32 for input ids (#9173)
original data was int32 for these. float might have caused precision issues
2025-02-19 08:17:27 -05:00
chenyu
ff05bff221 put bert data shard inside jit (#9160)
python time 45ms -> 9ms, it was spending time to schedule the shard

also init bert data on CLANG since it's from numpy, so we don't create the tensor on default device then shard into GPUS
2025-02-18 10:36:54 -05:00
chenyu
5dc1257ce0 clean up bert fake data iterator [pr] (#9145)
reuse the same get_data_bert path in setup and real run
2025-02-17 20:03:38 -05:00
chenyu
81597ddd96 increase lr for bert (#9098)
had one run that converged better https://wandb.ai/chenyuxyz/MLPerf-BERT/runs/u66tv2hh/overview
2025-02-14 19:10:35 -05:00
chenyu
b58e7b1898 zero out the weight in bert init run (#9076)
`DEFAULT_FLOAT=HALF BENCHMARK=10 BS=66 EVAL_BS=6 GPUS=6 MODEL=bert python3 examples/mlperf/model_train.py` no longer oom. I think the buffer of random init weights caused the oom.
2025-02-14 08:40:41 -05:00
chenyu
9e91898941 bert eval at the end of training (#9070)
always eval at the last epoch
2025-02-13 16:29:44 -05:00
chenyu
7b5ac2c15e free_intermediates in bert (#9040)
also re-enable dropout and update EVAL_BS
2025-02-12 10:00:39 -05:00
chenyu
a092b6395d Tuple -> tuple, List -> list [pr] (#8936) 2025-02-06 14:21:19 -05:00
chenyu
c7ca7959e6 set DISABLE_DROPOUT=1 in bert script for now (#8799) 2025-01-29 10:51:29 -05:00
chenyu
c99ae81f63 update default resnet LOSS_SCALER to 256 [pr] (#8774) 2025-01-27 16:59:05 -05:00
chenyu
af65331b76 update beam params for bert green [pr] (#8726)
increase BEAM_UPCAST_MAX and BEAM_LOCAL_MAX to default and matched red. 3% faster step
2025-01-22 22:00:05 -05:00
chenyu
9a9079118e envvar BERT_LAYERS [pr] (#8709)
default is 24 for large
2025-01-21 22:49:19 -05:00
chenyu
9f6d545a16 bert log global_norm in training step [pr] (#8708)
* bert log global_norm in training step [pr]

and minor cleanups

* .item()
2025-01-21 20:36:27 -05:00
chenyu
1e283c33d3 remove realize in bert model init [pr] (#8707) 2025-01-21 14:11:03 -05:00
chenyu
930728c069 bert BS 72->66 [pr] (#8621)
72 does not fit now
2025-01-14 18:41:41 -05:00
chenyu
994944920b simpler batch_load_train_bert [pr] (#8582)
don't think that buffer is really beneficial. 5% faster data_time and 1ms faster per step.
https://wandb.ai/chenyuxyz/MLPerf-BERT/runs/69c9lx8y/overview
2025-01-12 20:25:05 -05:00
chenyu
def90b22f6 EVAL_BS=36 for bert [pr] (#8576)
3X faster eval compared to BS=6.
green https://wandb.ai/chenyuxyz/MLPerf-BERT/runs/ka5p5sm9/overview
red https://wandb.ai/chenyuxyz/MLPerf-BERT/runs/a7maxsxd/overview
2025-01-12 09:43:56 -05:00
chenyu
64a917b7eb remove LAZYCACHE ContextVar [pr] (#8175)
also removed from resnet latest script
2024-12-11 22:02:52 -05:00
chenyu
3e2430f822 use tqdm tqdm in mlperf training (#7929)
issue in benchmark dashboard logging, revert back to tqdm tqdm for now
2024-11-27 21:57:05 -05:00
qazal
9828277c03 view doesn't have buffer, fix the tests [pr] (#7841)
* view doesn't have buffer, fix the tests [pr]

* need assigns
2024-11-22 20:41:55 +08:00
Francis Lata
90eff347e2 tinytqdm write support (#6359)
* add write support

* add test

* update test case to compare write outputs

* assert final write output

* flush when using write

* update write logic

* Revert "update write logic"

This reverts commit 5e0e611b46.

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-10-16 14:51:41 -04:00
chenyu
ed1ed9e4ff bert use BS=72 (#7015)
memory 131 -> 138
green tflops 201 -> 209
red tflops 160 -> 169
2024-10-12 09:41:56 -04:00
chenyu
36056e0760 update mlperf systems and copy 4.1 to 5.0 (#7004) 2024-10-11 16:20:34 -04:00