Francis Lata
e807cf817d
add LR scheduler and the start of training step
2024-11-12 02:48:31 -08:00
Francis Lata
50abdc22c8
add proper training loop over the training dataset
2024-11-09 17:45:55 -08:00
Francis Lata
6e3efd4ed6
add validation set test
2024-10-25 22:55:49 -07:00
Francis Lata
65c561a618
update image to be float32
2024-10-25 21:18:34 -07:00
Francis Lata
4b21a8fb8d
got dataloader with normalize working
2024-10-25 20:25:07 -07:00
Francis Lata
d9d65b9537
cleanup dataloader test and revert shm path
2024-10-19 17:32:58 -07:00
Francis Lata
4bebe61a9c
add dataloader + test
2024-10-16 15:38:47 -04:00
Francis Lata
3d857d758e
Merge branch 'master' into retinanet_mlperf
2024-10-16 15:36:37 -04:00
Francis Lata
90eff347e2
tinytqdm write support ( #6359 )
...
* add write support
* add test
* update test case to compare write outputs
* assert final write output
* flush when using write
* update write logic
* Revert "update write logic"
This reverts commit 5e0e611b46 .
---------
Co-authored-by: chenyu <chenyu@fastmail.com >
2024-10-16 14:51:41 -04:00
Francis Lata
d5813a3c42
Merge branch 'master' into retinanet_mlperf
2024-10-12 22:04:58 -04:00
chenyu
ed1ed9e4ff
bert use BS=72 ( #7015 )
...
memory 131 -> 138
green tflops 201 -> 209
red tflops 160 -> 169
2024-10-12 09:41:56 -04:00
Francis Lata
1295a3020f
Merge branch 'master' into retinanet_mlperf
2024-10-11 23:08:17 -04:00
Francis Lata
b802f74cee
add dataloader
2024-10-11 23:04:21 -04:00
chenyu
36056e0760
update mlperf systems and copy 4.1 to 5.0 ( #7004 )
2024-10-11 16:20:34 -04:00
chenyu
0e42662f2a
log seed at the right place for bert ( #7000 )
2024-10-11 10:39:40 -04:00
nimlgen
5496a36536
update red mlperf bert readme ( #6969 )
2024-10-11 13:08:06 +03:00
chenyu
b5546912e2
10% more TRAIN_STEPS for bert ( #6971 )
...
got two very close run, adding more steps for buffer
2024-10-09 19:21:43 -04:00
chenyu
35cf48659b
limit beam param for bert on green ( #6966 )
...
seems to mitigate the crash
2024-10-09 11:48:18 -04:00
chenyu
1ff2c98f8a
fix logfile name for bert red ( #6952 )
2024-10-08 05:37:52 -04:00
chenyu
a78c96273a
update bert epoch logging ( #6940 )
...
* update bert epoch logging
epoch for bert is simply number of examples seen (which is used for RCP check)
* update total steps too
* more changes
2024-10-08 00:34:06 -04:00
chenyu
102dfe5510
back to 2**10 for bert loss scaler ( #6934 )
...
getting 2 NaN for this, revert back to 2**10
2024-10-07 10:17:21 -04:00
chenyu
0cf815a93a
bert use BS=66 and update hparams ( #6932 )
...
with dropout memory improvement, we can fit BS=66 now. revert back to the hparams in #5891 too
2024-10-07 05:08:27 -04:00
chenyu
718b959349
log epoch start and stop for bert ( #6912 )
2024-10-06 06:39:46 -04:00
Francis Lata
5dbebf460e
continue with dataloader implementation
2024-10-05 23:51:16 -07:00
Francis Lata
8ca848d542
reuse existing prepare_target
2024-10-05 18:21:15 -07:00
Francis Lata
30f6f6a094
add first transform for train set
2024-10-05 18:12:38 -07:00
Francis Lata
12c4259b16
Merge branch 'master' into retinanet_mlperf
2024-10-05 16:22:58 -07:00
chenyu
16c1fa4208
use BEAM=3 for red box bert runs ( #6904 )
...
BEAM=4 slightly exceeded 30 minutes setup
2024-10-05 09:21:12 -04:00
chenyu
0e706227a2
add seed to bert result log filename ( #6903 )
...
* add seed to bert result log filename
* different name for different benchmark
2024-10-05 09:15:24 -04:00
chenyu
7391376528
update bert hparams ( #6876 )
...
4h32m with this https://wandb.ai/chenyuxyz/MLPerf-BERT/runs/q99frv1l/overview .
loss scaler 2**13->2**10. matched the closest submission, no nan for ~10 runs.
increased lr and total step a bit.
`PARALLEL=0` after setup, same as resnet.
2024-10-04 00:39:06 -04:00
Francis Lata
5bc8230739
start dataloader work with input images
2024-10-02 11:47:37 -07:00
Francis Lata
f81603d983
minor cleanup
2024-10-02 06:57:23 -07:00
Francis Lata
d281e64411
add optimizer
2024-10-02 04:46:08 -07:00
Francis Lata
b8e24b4f4d
recursively go through backbone layers to freeze them
2024-10-02 04:21:44 -07:00
Francis Lata
7432cd7857
Merge branch 'master' into retinanet_mlperf
2024-10-01 19:26:35 -07:00
chenyu
5f77217772
bert default CKPT to 0 ( #6840 )
...
not required
2024-10-01 21:55:56 -04:00
chenyu
f59517754e
add RESET_STEP in bert to control reset ( #6818 )
...
same as resnet
2024-09-30 09:39:04 -04:00
chenyu
494b20e886
bert BS back to 54 ( #6791 )
...
60 does not run end to end
2024-09-27 22:16:05 -04:00
chenyu
572d77d1d9
bert script delete eval data after eval ( #6790 )
...
fits BS=60 which is 2% faster than 54. also fixed wandb logging params
2024-09-27 20:54:00 -04:00
Francis Lata
35105e769b
small cleanup
2024-09-27 13:28:53 -07:00
Francis Lata
a879d19c82
update initializers for model
2024-09-27 13:25:28 -07:00
Francis Lata
52a1ba7fcb
Merge branch 'master' into retinanet_mlperf
2024-09-27 08:32:44 -07:00
chenyu
f9c8e144ff
chmod +x mlperf bert script for red ( #6789 )
...
also disabled raising power cap in setup. wozeparrot mentioned that's unstable and might cause bert training issue on red
2024-09-27 11:27:32 -04:00
Francis Lata
9fbc3f1fc7
Merge branch 'master' into retinanet_mlperf
2024-09-27 08:16:24 -07:00
Francis Lata
d3a387be63
[MLPerf] Prepare openimages dataset script ( #6747 )
...
* prepare openimages for MLPerf
* cleanup
* fix issue when clearing jit_cache on retinanet eval
* revert pandas specific changes
2024-09-27 11:13:56 -04:00
Francis Lata
211b04ba2c
Merge branch 'master' into retinanet_mlperf
2024-09-26 15:03:00 -07:00
chenyu
bea7ed5986
add RUNMLPERF=1 to bert dev_run.sh ( #6775 )
...
already set in run_and_time.sh, need RUNMLPERF=1 for it to load real data
2024-09-26 11:00:49 -04:00
chenyu
12de203a43
add IGNORE_JIT_FIRST_BEAM to bert scripts ( #6769 )
...
* update bert BEAM params
copied from resnet to start with
* just IGNORE_JIT_FIRST_BEAM
2024-09-26 05:38:24 -04:00
Francis Lata
ea05de325c
Merge branch 'master' into retinanet_mlperf
2024-09-26 02:20:28 -07:00
chenyu
5a5fbfa1eb
smaller bert script change ( #6768 )
...
only WANDB and RUNMLPERF order. BENCHMARK and BEAM will be done differently
2024-09-26 04:54:28 -04:00