Commit Graph

179 Commits

Author SHA1 Message Date
Francis Lata
e807cf817d add LR scheduler and the start of training step 2024-11-12 02:48:31 -08:00
Francis Lata
50abdc22c8 add proper training loop over the training dataset 2024-11-09 17:45:55 -08:00
Francis Lata
6e3efd4ed6 add validation set test 2024-10-25 22:55:49 -07:00
Francis Lata
65c561a618 update image to be float32 2024-10-25 21:18:34 -07:00
Francis Lata
4b21a8fb8d got dataloader with normalize working 2024-10-25 20:25:07 -07:00
Francis Lata
d9d65b9537 cleanup dataloader test and revert shm path 2024-10-19 17:32:58 -07:00
Francis Lata
4bebe61a9c add dataloader + test 2024-10-16 15:38:47 -04:00
Francis Lata
3d857d758e Merge branch 'master' into retinanet_mlperf 2024-10-16 15:36:37 -04:00
Francis Lata
90eff347e2 tinytqdm write support (#6359)
* add write support

* add test

* update test case to compare write outputs

* assert final write output

* flush when using write

* update write logic

* Revert "update write logic"

This reverts commit 5e0e611b46.

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-10-16 14:51:41 -04:00
Francis Lata
d5813a3c42 Merge branch 'master' into retinanet_mlperf 2024-10-12 22:04:58 -04:00
chenyu
ed1ed9e4ff bert use BS=72 (#7015)
memory 131 -> 138
green tflops 201 -> 209
red tflops 160 -> 169
2024-10-12 09:41:56 -04:00
Francis Lata
1295a3020f Merge branch 'master' into retinanet_mlperf 2024-10-11 23:08:17 -04:00
Francis Lata
b802f74cee add dataloader 2024-10-11 23:04:21 -04:00
chenyu
36056e0760 update mlperf systems and copy 4.1 to 5.0 (#7004) 2024-10-11 16:20:34 -04:00
chenyu
0e42662f2a log seed at the right place for bert (#7000) 2024-10-11 10:39:40 -04:00
nimlgen
5496a36536 update red mlperf bert readme (#6969) 2024-10-11 13:08:06 +03:00
chenyu
b5546912e2 10% more TRAIN_STEPS for bert (#6971)
got two very close run, adding more steps for buffer
2024-10-09 19:21:43 -04:00
chenyu
35cf48659b limit beam param for bert on green (#6966)
seems to mitigate the crash
2024-10-09 11:48:18 -04:00
chenyu
1ff2c98f8a fix logfile name for bert red (#6952) 2024-10-08 05:37:52 -04:00
chenyu
a78c96273a update bert epoch logging (#6940)
* update bert epoch logging

epoch for bert is simply number of examples seen (which is used for RCP check)

* update total steps too

* more changes
2024-10-08 00:34:06 -04:00
chenyu
102dfe5510 back to 2**10 for bert loss scaler (#6934)
getting 2 NaN for this, revert back to 2**10
2024-10-07 10:17:21 -04:00
chenyu
0cf815a93a bert use BS=66 and update hparams (#6932)
with dropout memory improvement, we can fit BS=66 now. revert back to the hparams in #5891 too
2024-10-07 05:08:27 -04:00
chenyu
718b959349 log epoch start and stop for bert (#6912) 2024-10-06 06:39:46 -04:00
Francis Lata
5dbebf460e continue with dataloader implementation 2024-10-05 23:51:16 -07:00
Francis Lata
8ca848d542 reuse existing prepare_target 2024-10-05 18:21:15 -07:00
Francis Lata
30f6f6a094 add first transform for train set 2024-10-05 18:12:38 -07:00
Francis Lata
12c4259b16 Merge branch 'master' into retinanet_mlperf 2024-10-05 16:22:58 -07:00
chenyu
16c1fa4208 use BEAM=3 for red box bert runs (#6904)
BEAM=4 slightly exceeded 30 minutes setup
2024-10-05 09:21:12 -04:00
chenyu
0e706227a2 add seed to bert result log filename (#6903)
* add seed to bert result log filename

* different name for different benchmark
2024-10-05 09:15:24 -04:00
chenyu
7391376528 update bert hparams (#6876)
4h32m with this https://wandb.ai/chenyuxyz/MLPerf-BERT/runs/q99frv1l/overview.

loss scaler 2**13->2**10. matched the closest submission, no nan for ~10 runs.

increased lr and total step a bit.

`PARALLEL=0` after setup, same as resnet.
2024-10-04 00:39:06 -04:00
Francis Lata
5bc8230739 start dataloader work with input images 2024-10-02 11:47:37 -07:00
Francis Lata
f81603d983 minor cleanup 2024-10-02 06:57:23 -07:00
Francis Lata
d281e64411 add optimizer 2024-10-02 04:46:08 -07:00
Francis Lata
b8e24b4f4d recursively go through backbone layers to freeze them 2024-10-02 04:21:44 -07:00
Francis Lata
7432cd7857 Merge branch 'master' into retinanet_mlperf 2024-10-01 19:26:35 -07:00
chenyu
5f77217772 bert default CKPT to 0 (#6840)
not required
2024-10-01 21:55:56 -04:00
chenyu
f59517754e add RESET_STEP in bert to control reset (#6818)
same as resnet
2024-09-30 09:39:04 -04:00
chenyu
494b20e886 bert BS back to 54 (#6791)
60 does not run end to end
2024-09-27 22:16:05 -04:00
chenyu
572d77d1d9 bert script delete eval data after eval (#6790)
fits BS=60 which is 2% faster than 54. also fixed wandb logging params
2024-09-27 20:54:00 -04:00
Francis Lata
35105e769b small cleanup 2024-09-27 13:28:53 -07:00
Francis Lata
a879d19c82 update initializers for model 2024-09-27 13:25:28 -07:00
Francis Lata
52a1ba7fcb Merge branch 'master' into retinanet_mlperf 2024-09-27 08:32:44 -07:00
chenyu
f9c8e144ff chmod +x mlperf bert script for red (#6789)
also disabled raising power cap in setup. wozeparrot mentioned that's unstable and might cause bert training issue on red
2024-09-27 11:27:32 -04:00
Francis Lata
9fbc3f1fc7 Merge branch 'master' into retinanet_mlperf 2024-09-27 08:16:24 -07:00
Francis Lata
d3a387be63 [MLPerf] Prepare openimages dataset script (#6747)
* prepare openimages for MLPerf

* cleanup

* fix issue when clearing jit_cache on retinanet eval

* revert pandas specific changes
2024-09-27 11:13:56 -04:00
Francis Lata
211b04ba2c Merge branch 'master' into retinanet_mlperf 2024-09-26 15:03:00 -07:00
chenyu
bea7ed5986 add RUNMLPERF=1 to bert dev_run.sh (#6775)
already set in run_and_time.sh, need RUNMLPERF=1 for it to load real data
2024-09-26 11:00:49 -04:00
chenyu
12de203a43 add IGNORE_JIT_FIRST_BEAM to bert scripts (#6769)
* update bert BEAM params

copied from resnet to start with

* just IGNORE_JIT_FIRST_BEAM
2024-09-26 05:38:24 -04:00
Francis Lata
ea05de325c Merge branch 'master' into retinanet_mlperf 2024-09-26 02:20:28 -07:00
chenyu
5a5fbfa1eb smaller bert script change (#6768)
only WANDB and RUNMLPERF order. BENCHMARK and BEAM will be done differently
2024-09-26 04:54:28 -04:00