Commit Graph

283 Commits

Author SHA1 Message Date
Francis Lata
27ec792c19 check for CKPT when target metric is reached before saving 2025-03-02 00:41:08 -08:00
Francis Lata
3ac4ae5870 hotfix: log metric and move target metric check outside of CKPT 2025-03-01 04:31:00 -08:00
Francis Lata
974309862d update dataloader seed 2025-02-28 21:41:30 +00:00
Francis Lata
6a62ece474 minor cleanups 2025-02-28 15:43:11 +00:00
Francis Lata
074e9f742b more typing fixes 2025-02-28 15:42:11 +00:00
Francis Lata
e9d1af26b2 undo more changes 2025-02-28 15:11:17 +00:00
Francis Lata
47edcdb834 undo changse 2025-02-28 15:08:55 +00:00
Francis Lata
bdf442717c update seeding on dataloader and the start of training script 2025-02-28 14:58:28 +00:00
Francis Lata
87bfa77f4a some typing cleanups 2025-02-28 14:47:29 +00:00
Francis Lata
7cb226d757 Revert "Revert "add nan check during training""
This reverts commit b7b2943197.
2025-02-26 15:43:20 +00:00
Francis Lata
e006ae24ea Merge branch 'master' into retinanet_mlperf 2025-02-26 07:31:32 +00:00
George Hotz
3f4eb9006a test for device mismatch [pr] (#9250)
* test for device mismatch [pr]

* fix bert
2025-02-26 13:06:33 +08:00
chenyu
979e84f30e RESET_STEP in bert setup and beam (#9248)
running dev_beam migh OOM without it but runs fine in real run.
2025-02-25 19:15:10 -05:00
Francis Lata
b7b2943197 Revert "add nan check during training"
This reverts commit ddf1f0d5dd.
2025-02-25 21:43:28 +00:00
chenyu
6610ad58ab hotfix bert no shard with only one device (#9243)
`LLVM=1 BERT_SIZE="tiny" DEFAULT_FLOAT=HALF BENCHMARK=5 MODEL="bert" python3 examples/mlperf/model_train.py` runs for me with this. it should not failed with single device shard though
2025-02-25 09:05:11 -05:00
Francis Lata
ddf1f0d5dd add nan check during training 2025-02-25 10:53:31 +00:00
Francis Lata
8737020d75 add JIT reset support 2025-02-25 10:52:26 +00:00
Francis Lata
30d5daa121 Merge branch 'master' into retinanet_mlperf 2025-02-25 10:32:34 +00:00
chenyu
8c7be428e5 update bert BS to 78 (#9236)
fits 78 now. about 215 tflops on green
2025-02-24 22:47:35 -05:00
Francis Lata
2c3417dfce Merge branch 'master' into retinanet_mlperf 2025-02-23 21:23:28 +00:00
Francis Lata
60c13c2932 update loss calculation for regresionhead and some cleanups 2025-02-23 21:22:33 +00:00
chenyu
2e7c2780a9 CLANG -> CPU (#9189) 2025-02-20 18:03:09 -05:00
Francis Lata
7dba815c47 fix train script 2025-02-19 20:43:02 +00:00
Francis Lata
fc36f09b1e no need to return loaded keys for resnet 2025-02-19 20:35:03 +00:00
chenyu
3b37cc898b add bert tiny config (#9177)
set with BERT_SIZE=tiny. easier to study embedding and fusion
2025-02-19 14:57:03 -05:00
Francis Lata
41378e74a6 model init, hyperparam, and data preprocessing updates 2025-02-19 18:47:06 +00:00
chenyu
975c318dbc bert use int32 for input ids (#9173)
original data was int32 for these. float might have caused precision issues
2025-02-19 08:17:27 -05:00
chenyu
ff05bff221 put bert data shard inside jit (#9160)
python time 45ms -> 9ms, it was spending time to schedule the shard

also init bert data on CLANG since it's from numpy, so we don't create the tensor on default device then shard into GPUS
2025-02-18 10:36:54 -05:00
chenyu
5dc1257ce0 clean up bert fake data iterator [pr] (#9145)
reuse the same get_data_bert path in setup and real run
2025-02-17 20:03:38 -05:00
chenyu
81597ddd96 increase lr for bert (#9098)
had one run that converged better https://wandb.ai/chenyuxyz/MLPerf-BERT/runs/u66tv2hh/overview
2025-02-14 19:10:35 -05:00
Francis Lata
cfa1c2d50e hyperparameter adjustments and cleanups 2025-02-14 17:53:06 +00:00
chenyu
b58e7b1898 zero out the weight in bert init run (#9076)
`DEFAULT_FLOAT=HALF BENCHMARK=10 BS=66 EVAL_BS=6 GPUS=6 MODEL=bert python3 examples/mlperf/model_train.py` no longer oom. I think the buffer of random init weights caused the oom.
2025-02-14 08:40:41 -05:00
Francis Lata
caf9b2baa2 Merge branch 'master' into retinanet_mlperf 2025-02-14 06:28:37 +00:00
chenyu
9e91898941 bert eval at the end of training (#9070)
always eval at the last epoch
2025-02-13 16:29:44 -05:00
Francis Lata
5f26692068 remove frozen layers from optimizer's params 2025-02-13 06:36:13 +00:00
Francis Lata
ff301f0be9 minor cleanups 2025-02-12 16:03:38 +00:00
Francis Lata
f61b10450e Merge branch 'master' into retinanet_mlperf 2025-02-12 15:47:05 +00:00
chenyu
7b5ac2c15e free_intermediates in bert (#9040)
also re-enable dropout and update EVAL_BS
2025-02-12 10:00:39 -05:00
Francis Lata
37aab697b8 adjust LR to be the ratio of the batch size 2025-02-07 19:46:54 +00:00
Francis Lata
041481f910 Merge branch 'master' into retinanet_mlperf 2025-02-07 15:28:29 +00:00
chenyu
a092b6395d Tuple -> tuple, List -> list [pr] (#8936) 2025-02-06 14:21:19 -05:00
Francis Lata
83a2b84f55 add validation loop to training script 2025-02-03 19:54:22 +00:00
Francis Lata
f02cce0049 remove unnecessary targets from validation dataloader 2025-02-03 19:15:30 +00:00
Francis Lata
932cf4b7f2 fix img_ids repeating its values 2025-02-02 19:21:46 +00:00
Francis Lata
17ae62d741 cleanup boxes and labels in dataloader 2025-02-02 18:51:14 +00:00
Francis Lata
594d7126d8 return validation targets in dataloader 2025-02-02 06:50:21 -08:00
Francis Lata
811893a3bd cleanup train and validation dataloader 2025-01-31 16:59:37 -08:00
Francis Lata
6d70035c22 put back parallel testing and remove img_ids Tensor from dataloader 2025-01-31 16:13:02 -08:00
Francis Lata
9938a1aabc remove optional disk tensors in dataloader 2025-01-31 09:07:39 -08:00
Francis Lata
80fa9dd731 fix issue with realized on dataloader 2025-01-31 08:31:25 -08:00