Commit Graph

62 Commits

Author SHA1 Message Date
chenyu
5a5fbfa1eb smaller bert script change (#6768)
only WANDB and RUNMLPERF order. BENCHMARK and BEAM will be done differently
2024-09-26 04:54:28 -04:00
Francis Lata
b7ce9a1530 UNet3D MLPerf (#3470)
* add training set transforms

* add DICE cross entropy loss

* convert pred and label to Tensor when calculating DICE score

* cleanups and allow train dataset batching

* fix DICE CE loss calculation

* jitted training step

* clean up DICE CE loss calculation

* initial support for sharding

* Revert "initial support for sharding"

This reverts commit e3670813b8.

* minor updates

* cleanup imports

* add support for sharding

* apply temp patch to try to avoid OOM

* revert cstyle changes

* add gradient acc

* hotfix

* add FP16 support

* add ability to train on smaller image sizes

* add support for saving and loading checkpoints + cleanup some various modes

* fix issue with using smaller patch size + update W&B logging

* disable LR_WARMUP_EPOCHS

* updates

* minor cleanups

* cleanup

* update order of transformations

* more cleanups

* realize loss

* cleanup

* more cleanup

* some cleanups

* add RAM usage

* minor cleanups

* add support for gradient accumulation

* cleanup imports

* minor updates to not use GA_STEPS

* remove FP16 option since it's available now globally

* update multi-GPU setup

* add timing logs for training loop

* go back to using existing dataloader and add ability to preprocess data to save time

* clean up optimization and re-enable JIT and multi-GPU support for training and evaluation

* free train and eval steps memory

* cleanups and scale batch size based on the number of GPUs

* fix GlobalCounters import

* fix seed

* fix W&B setup

* update batch size default size

* add back metric divergence check

* put back JIT on UNet3d eval

* move dataset preprocessing inside training code

* add test for dice_loss

* add config logging support to W&B and other cleanups

* change how default float is getting retrieved

* remove TinyJit import duplicate

* update config logging to W&B and remove JIT on eval_step

* no need for caching preprocessed data anymore

* fix how evaluation is ran and how often

* add support for LR scaling

* fix issue with gaussian being moved to scipy.signal.windows

* remove DICE loss unit test

* fix issue where loss isn't compatible with multiGPU

* add individual BEAM control for train and eval steps

* fix ndimage scipy import

* add BENCHMARK

* cleanups on BENCHMARK + fix on rand_flip augmentation during training

* cleanup train and eval BEAM envs

* add checkpointing support after every eval

* cleanup model_eval

* disable grad during eval

* use new preprocessing dataset mechanism

* remove unused import

* use training and inference_mode contexts

* start eval after benchmarking

* add data fetching time

* cleanup decorators

* more cleanups on training script

* add message during benchmarking mode

* realize when reassigning LR on scheduler and update default number of epochs

* add JIT on eval step

* remove JIT on eval_step

* add train dataloader for unet3d

* move checkpointing to be done after every epoch

* revert removal of JIT on unet3d inference

* save checkpoint if metric is not successful

* Revert "add train dataloader for unet3d"

This reverts commit c166d129df.

* Revert "Revert "add train dataloader for unet3d""

This reverts commit 36366c65d2.

* hotfix: seed was defaulting to a value of 0

* fix SEED value

* remove the usage of context managers for setting BEAM and going from training to inference

* support new stack API for calculating eval loss and metric

* Revert "remove the usage of context managers for setting BEAM and going from training to inference"

This reverts commit 2c0ba8d322.

* check training and test preprocessed folders separately

* clean up imports and log FUSE_CONV_BW

* use train and val preprocessing constants

* add kits19 dataset setup script

* update to use the new test decorator for disabling grad

* update kits19 dataset setup script

* add docs on how to train the model

* set default value for BASEDIR

* add detailed instruction about BASEDIR usage

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-09-10 04:37:28 -04:00
Elias Wahl
c9b4602854 no load in INITMLPERF (#5957) 2024-08-08 11:28:24 -04:00
Elias Wahl
c9862e17d4 MLPERF BERT submission scripts (#5931)
* green

* red

* fix benchmark

* log

* count train samples

* oops. 4.0 -> 4.1

* note to todo

* no pillow
2024-08-06 18:09:18 -04:00
Elias Wahl
937bf5fe12 better hparam (#5891) 2024-08-03 12:38:53 -04:00
Elias Wahl
4a114756f6 New BERT dataloader (#5881)
* One file == One topic

* update test

* new dataloader

* update train script

* get index is faster
2024-08-02 15:12:23 -04:00
Elias Wahl
73bddc44f6 Fix fake dataloader (#5326) 2024-07-08 09:07:44 -04:00
Elias Wahl
e267f3161d Add MLLogger (#5125)
* add MLPerf logger

* eval steps

* start with step 1

* compliance for 3.1.0 and 4.0.0

* more compliance

* assert, comment and contiguous
2024-06-26 12:23:56 -04:00
Elias Wahl
f31ef11537 Better default hparams for large BS (#5030)
* better default hparams for large BS

* bf16 too

* use tuple
2024-06-18 11:13:06 -04:00
Elias Wahl
7bfa9101c0 Float in scaled dot product attention (#4985)
* Monkeypatch scaled-dot-product-attention

* Use dot instead of matmul

* new api

* imports

* least_upper_dtype
2024-06-18 08:16:41 -04:00
Elias Wahl
d2e3c391e8 Residual in MLM loss + Change default steps (#4935)
* Residual in mlm loss

* Reduce default steps to 160K * 24

* oops

* comment
2024-06-12 16:09:18 -04:00
Elias Wahl
e576aca044 Disable dropout (#4837) 2024-06-04 18:57:26 -04:00
Elias Wahl
bb248a0dd1 Optional half matmul (#4835)
* half linear

* move weight cast back

* oops

* matmul dtype var

* todo comment
2024-06-04 17:53:41 -04:00
Elias Wahl
04e237328b Refactor to class style (#4804) 2024-06-04 14:08:31 -07:00
Elias Wahl
c4b0acf095 Global norm + small changes (#4749)
* norm

* no empty

* default loss scaler in float
2024-05-27 18:35:27 -04:00
Elias Wahl
acc0039cfc Resume fix + scheduler for non weight decay params (#4679)
* move ckpt dir

* fix resume. Add scheduler group
2024-05-21 19:38:13 -04:00
Elias Wahl
993091adfa loss scaler + nan fixes (#4661) 2024-05-20 17:08:35 -04:00
chenyu
bed70b130c mlperf bert getenv-able EVAL_STEP_FREQ (#4534) 2024-05-11 14:36:56 -04:00
chenyu
04a4980a51 touchup bert script (#4531)
small adjustments, remove duplicated training setting and stop the script once target is hit
2024-05-11 13:02:02 -04:00
chenyu
b00b6b16f0 fix TRAIN_BEAM and Tensor.training for mlperf bert (#4525)
also hard coded bert model config instead of looking up a file
2024-05-11 00:18:36 -04:00
chenyu
b399d98e41 fix resnet eval (#4507) 2024-05-10 00:49:00 -04:00
chenyu
0e8aa0e288 use fake data in beam searching resnet (#4504) 2024-05-09 23:43:50 -04:00
chenyu
047c7f3e5b polish resnet mlperf logging (#4490)
don't include save final check point time in run time, and some cosmetic order changes
2024-05-09 13:04:24 -04:00
chenyu
d78e159aa3 resnet logging move RUN_START to start of the script (#4488) 2024-05-09 12:32:32 -04:00
chenyu
1f6bf9d2f7 real diskcache_clear in model_train resnet (#4445)
clear cache if INITMLPERF is set, or running run_and_time. dev_beam and dev_run do not clear cache
2024-05-08 19:06:09 -04:00
chenyu
1b4645bea6 hotfix resnet move init_start to start of the script (#4481) 2024-05-08 19:03:52 -04:00
chenyu
db7e15c46f hotfix resnet only log epoch start with RUNMLPERF (#4477) 2024-05-08 15:14:41 -04:00
chenyu
062c6dd65d mlperf logging, truncate dir in logs and log seed (#4475) 2024-05-08 12:54:02 -04:00
chenyu
b62a65b617 redo faster sparse_categorical_crossentropy (#4461)
update LR and DECAY in resnet default that help convergence too
2024-05-08 11:21:43 -04:00
wozeparrot
603d3a351b feat: allow keeping multiple cookies (#4440) 2024-05-05 19:26:48 -07:00
David Hou
b767d59684 resnet trainer: keep old cookie around until next step has been queued (#4401)
* keep old cookie around until next step has been queued (-10ms 6gpu)

* also for eval

* drop cookie before data_get?

* Revert "drop cookie before data_get?"

This reverts commit b01e6aa2b2.

* Revert "Revert "drop cookie before data_get?""

This reverts commit 23464e73d4.
2024-05-03 12:15:21 -04:00
chenyu
2c3b7f8e70 pad resnet training data with training data mean (#4369)
update model_train resnet to pad training
2024-05-02 20:26:15 -04:00
chenyu
ab01a9433d resnet eval 4n+3 if epoch < 33 (#4391)
the rule is as thoroughly as 4n+k and we can stop the clock as soon as eval hits target. this can save 24 evals or 12 minutes
2024-05-02 16:52:07 -04:00
chenyu
bf31837e6d resnet correct steps_in_val_epoch in logging (#4389)
also added random seed from system in scripts
2024-05-02 10:51:36 -04:00
chenyu
22376e53b7 resnet mlperf logging (#4361)
* resnet mlperf logging

* cropping too much?
2024-05-02 00:00:04 -04:00
chenyu
6628e13a5f pad resnet eval data in model_train (#4374)
asserted if eval sample count is different from total eval file count.
2024-05-01 14:33:42 -04:00
chenyu
826cccd54d fix mean underflow for half tensor (#4377)
* fix mean underflow for half tensor

divide only the reduce factor. added unit test and non-nan assertion in resnet training. also added a failed test cast for symbolic shape var

* skip for python backend
2024-05-01 13:38:57 -04:00
Elias Wahl
babe87a8ae BERT: Checkpoint loading tests (#4359)
* Move checkpoint init to helpers. Add test

* linters

* Move the steps outside of the main train loop

* Move data_get

* data_get belongs to helpers
2024-04-30 14:43:41 -04:00
Elias Wahl
71ff68b445 dropout after eval step (#4351) 2024-04-29 15:47:21 -04:00
Elias Wahl
27613dd881 MLPerf BERT: Main training loop (#4288)
* BERT language modeling head + trunc normal initializers

* add train loop + helpers

* shuffle in dataloaders + slight changes in main loop

* beam change

* Minor changes

* random.shuffle

* HParam update

* Use deque for dataloader

* wandb bert project name

* half fixes

* BENCHMARK + remove epoch

* cast + print()

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-04-29 14:35:27 -04:00
chenyu
ec65aea32f resnet stop the script once hit target (#4303)
* resnet stop the script once hit target

* comment
2024-04-25 23:54:56 -04:00
chenyu
f9a7badace use LR=7 for resnet with BS=1536 (#4299)
had 3 runs after lr float32, seems quite stable and converges at epoch 34 and 35
2024-04-25 15:23:10 -04:00
chenyu
c1fbacb182 resnet benchmarks use DEFAULT_FLOAT=HALF (#4285)
also update LR default to scaled based on 1536 (the BS we are submitting)
2024-04-24 12:10:57 -04:00
chenyu
8401de9922 resnet benchmark return early in eval (#4278)
only do few eval steps to compile, and skip second epoch when doing beam + benchmark. save 2 minutes
2024-04-24 00:55:01 -04:00
chenyu
6637ecc5fe use IGNORE_JIT_FIRST_BEAM to not BEAM in jit cnt=0 (#4269)
we want to have different BEAM values for resnet train and eval. global JITBEAM cannot do this. added the flag to change beam behavior at cnt=0 (so it default behaves the same with or without TinyJit), and for cnt=1 it uses existing BEAM.value.

Also updated the context var BEAM in resnet to be outside of TinyJit. saves about 3 minutes compile time
2024-04-23 18:59:43 -04:00
chenyu
37f8be6450 resnet print epoch ops and mem in benchmark (#4244)
* resnet print epoch ops and mem in benchmark

also added a flag to optionally disable reset jitted steps

* real per epoch stats
2024-04-21 18:32:31 -04:00
chenyu
f7416916df update resnet hparams based on BS=1632 RCP (#4210)
https://github.com/mlcommons/logging/blob/master/mlperf_logging/rcp_checker/training_4.0.0/rcps_resnet.json
2024-04-18 12:01:46 -04:00
chenyu
d5b67c1ca3 log resnet TRAIN_BEAM / EVAL_BEAM (#4181)
also run eval in benchmark mode if either one is positive
2024-04-15 19:29:08 -04:00
chenyu
6a2168e698 TRAIN_BEAM and EVAL_BEAM for resnet (#4177)
working on measuring compile time
2024-04-15 14:57:21 -04:00
chenyu
e20d6f9221 correct resnet estimate time (#4169)
7.99 hours was rendered as 7h0m.
2024-04-14 02:21:46 -04:00