Francis Lata
27ec792c19
check for CKPT when target metric is reached before saving
2025-03-02 00:41:08 -08:00
Francis Lata
3ac4ae5870
hotfix: log metric and move target metric check outside of CKPT
2025-03-01 04:31:00 -08:00
Francis Lata
974309862d
update dataloader seed
2025-02-28 21:41:30 +00:00
Francis Lata
6a62ece474
minor cleanups
2025-02-28 15:43:11 +00:00
Francis Lata
074e9f742b
more typing fixes
2025-02-28 15:42:11 +00:00
Francis Lata
e9d1af26b2
undo more changes
2025-02-28 15:11:17 +00:00
Francis Lata
47edcdb834
undo changse
2025-02-28 15:08:55 +00:00
Francis Lata
bdf442717c
update seeding on dataloader and the start of training script
2025-02-28 14:58:28 +00:00
Francis Lata
87bfa77f4a
some typing cleanups
2025-02-28 14:47:29 +00:00
Francis Lata
7cb226d757
Revert "Revert "add nan check during training""
...
This reverts commit b7b2943197 .
2025-02-26 15:43:20 +00:00
Francis Lata
e006ae24ea
Merge branch 'master' into retinanet_mlperf
2025-02-26 07:31:32 +00:00
George Hotz
3f4eb9006a
test for device mismatch [pr] ( #9250 )
...
* test for device mismatch [pr]
* fix bert
2025-02-26 13:06:33 +08:00
chenyu
979e84f30e
RESET_STEP in bert setup and beam ( #9248 )
...
running dev_beam migh OOM without it but runs fine in real run.
2025-02-25 19:15:10 -05:00
Francis Lata
b7b2943197
Revert "add nan check during training"
...
This reverts commit ddf1f0d5dd .
2025-02-25 21:43:28 +00:00
chenyu
6610ad58ab
hotfix bert no shard with only one device ( #9243 )
...
`LLVM=1 BERT_SIZE="tiny" DEFAULT_FLOAT=HALF BENCHMARK=5 MODEL="bert" python3 examples/mlperf/model_train.py` runs for me with this. it should not failed with single device shard though
2025-02-25 09:05:11 -05:00
Francis Lata
ddf1f0d5dd
add nan check during training
2025-02-25 10:53:31 +00:00
Francis Lata
8737020d75
add JIT reset support
2025-02-25 10:52:26 +00:00
Francis Lata
30d5daa121
Merge branch 'master' into retinanet_mlperf
2025-02-25 10:32:34 +00:00
chenyu
8c7be428e5
update bert BS to 78 ( #9236 )
...
fits 78 now. about 215 tflops on green
2025-02-24 22:47:35 -05:00
Francis Lata
2c3417dfce
Merge branch 'master' into retinanet_mlperf
2025-02-23 21:23:28 +00:00
Francis Lata
60c13c2932
update loss calculation for regresionhead and some cleanups
2025-02-23 21:22:33 +00:00
chenyu
2e7c2780a9
CLANG -> CPU ( #9189 )
2025-02-20 18:03:09 -05:00
Francis Lata
7dba815c47
fix train script
2025-02-19 20:43:02 +00:00
Francis Lata
fc36f09b1e
no need to return loaded keys for resnet
2025-02-19 20:35:03 +00:00
chenyu
3b37cc898b
add bert tiny config ( #9177 )
...
set with BERT_SIZE=tiny. easier to study embedding and fusion
2025-02-19 14:57:03 -05:00
Francis Lata
41378e74a6
model init, hyperparam, and data preprocessing updates
2025-02-19 18:47:06 +00:00
chenyu
975c318dbc
bert use int32 for input ids ( #9173 )
...
original data was int32 for these. float might have caused precision issues
2025-02-19 08:17:27 -05:00
chenyu
ff05bff221
put bert data shard inside jit ( #9160 )
...
python time 45ms -> 9ms, it was spending time to schedule the shard
also init bert data on CLANG since it's from numpy, so we don't create the tensor on default device then shard into GPUS
2025-02-18 10:36:54 -05:00
chenyu
5dc1257ce0
clean up bert fake data iterator [pr] ( #9145 )
...
reuse the same get_data_bert path in setup and real run
2025-02-17 20:03:38 -05:00
chenyu
81597ddd96
increase lr for bert ( #9098 )
...
had one run that converged better https://wandb.ai/chenyuxyz/MLPerf-BERT/runs/u66tv2hh/overview
2025-02-14 19:10:35 -05:00
Francis Lata
cfa1c2d50e
hyperparameter adjustments and cleanups
2025-02-14 17:53:06 +00:00
chenyu
b58e7b1898
zero out the weight in bert init run ( #9076 )
...
`DEFAULT_FLOAT=HALF BENCHMARK=10 BS=66 EVAL_BS=6 GPUS=6 MODEL=bert python3 examples/mlperf/model_train.py` no longer oom. I think the buffer of random init weights caused the oom.
2025-02-14 08:40:41 -05:00
Francis Lata
caf9b2baa2
Merge branch 'master' into retinanet_mlperf
2025-02-14 06:28:37 +00:00
chenyu
9e91898941
bert eval at the end of training ( #9070 )
...
always eval at the last epoch
2025-02-13 16:29:44 -05:00
Francis Lata
5f26692068
remove frozen layers from optimizer's params
2025-02-13 06:36:13 +00:00
Francis Lata
ff301f0be9
minor cleanups
2025-02-12 16:03:38 +00:00
Francis Lata
f61b10450e
Merge branch 'master' into retinanet_mlperf
2025-02-12 15:47:05 +00:00
chenyu
7b5ac2c15e
free_intermediates in bert ( #9040 )
...
also re-enable dropout and update EVAL_BS
2025-02-12 10:00:39 -05:00
Francis Lata
37aab697b8
adjust LR to be the ratio of the batch size
2025-02-07 19:46:54 +00:00
Francis Lata
041481f910
Merge branch 'master' into retinanet_mlperf
2025-02-07 15:28:29 +00:00
chenyu
a092b6395d
Tuple -> tuple, List -> list [pr] ( #8936 )
2025-02-06 14:21:19 -05:00
Francis Lata
83a2b84f55
add validation loop to training script
2025-02-03 19:54:22 +00:00
Francis Lata
f02cce0049
remove unnecessary targets from validation dataloader
2025-02-03 19:15:30 +00:00
Francis Lata
932cf4b7f2
fix img_ids repeating its values
2025-02-02 19:21:46 +00:00
Francis Lata
17ae62d741
cleanup boxes and labels in dataloader
2025-02-02 18:51:14 +00:00
Francis Lata
594d7126d8
return validation targets in dataloader
2025-02-02 06:50:21 -08:00
Francis Lata
811893a3bd
cleanup train and validation dataloader
2025-01-31 16:59:37 -08:00
Francis Lata
6d70035c22
put back parallel testing and remove img_ids Tensor from dataloader
2025-01-31 16:13:02 -08:00
Francis Lata
9938a1aabc
remove optional disk tensors in dataloader
2025-01-31 09:07:39 -08:00
Francis Lata
80fa9dd731
fix issue with realized on dataloader
2025-01-31 08:31:25 -08:00