Commit Graph

103 Commits

Author SHA1 Message Date
chenyu
e8024c8281 faster bert global_norm (#9901)
tinyamd 2% faster.  also updated beam params that's 2-3% faster.

update mlperf doc and steps too
2025-04-15 18:24:44 -04:00
Francis Lata
31483050c0 add eval_freq flag (#9894) 2025-04-15 06:42:40 -04:00
chenyu
43d3a75d6c increase bert max train_steps (#9883) 2025-04-14 08:53:44 -04:00
Francis Lata
2793cca9a6 RetinaNet MLPerf (#8385)
* add support for a custom BASEDIR for openimages download

* make export step faster

* add focal loss

* update model_eval with new dataloader

* generate_anchors in tinygrad

* update initializers for model

* small cleanup

* revert isin enhancements

* recursively go through backbone layers to freeze them

* add optimizer

* minor cleanup

* start dataloader work with input images

* add first transform for train set

* reuse existing prepare_target

* continue with dataloader implementation

* add dataloader

* separate out KiTS19 dataset test cases

* create mock data samples for test

* add dataloader + test

* cleanup dataloader test and revert shm path

* trim dataloader related code needed from ref

* got dataloader with normalize working

* update image to be float32

* add back normalization and negate it in test

* clean up reference dataset implementation + ruff changes

* add validation set test

* add proper training loop over the training dataset

* add LambdaLR support

* add LR scheduler and the start of training step

* get forward call to model work and setup multi-GPU

* already passed device

* return matches from dataloader

* hotfix for dataloader typo causing some hang

* start some work on classification loss

* update focal loss to support masking

* add missing test and cleanup focal loss

* cleanup unit tests

* remove masking support for sigmoid_focal_loss

* make ClassificationHead loss work

* cleanups + fix dataloader tests

* remove sigmoid when computing loss

* make anchors use Tensors

* simplify anchors batching

* revert anchors to use np

* implement regression loss

* fix regression loss

* cleanup losses

* move BoxCoder to MLPerf helpers

* revert helper changes

* fixes after helper refactor cleanup

* add tests for l1_loss

* start re-enabling training step

* minor cleanup

* add pycocotools to testing dependencies

* make training work

* adjust regression loss to mask after L1 loss is calculated

* reduce img and lbl sizes by half for KiTS19 dataset tests

* Revert "reduce img and lbl sizes by half for KiTS19 dataset tests"

This reverts commit d115b0c664.

* temporarily disable openimages dataset tests to debug CI

* enable openimages dataset test and create samples once

* temporarily disable openimages validation set test

* reenable test and add some debugging to the test

* add boto3 testing dependencies

* add pandas to testing dependencies

* This reverts commit 467704fec6.

* reenable test

* move sample creation to setup

* realize boxcoder's encoding

* add wandb

* fix wandb resuming feature

* move anchors as part of dataloader

* fix dtype for anchor inside dataloader and fix horizontal flip transformation

* add support for BENCHMARK

* set seed

* debug dataset test failuire

* Revert "debug dataset test failuire"

This reverts commit 1b2f9d7f50.

* fix dataloader script

* do not realize when sharding model weights

* setup openimages samples differently

* create the necessary samples per test case

* enable lr scheduler and fix benchmark timing

* add jit to the training loop

* add checkpointing and training resume capabilities

* refactor on training loop and start the work on val looop

* add debug logging for dataloader test

* debug test

* assert boxes again

* update validation dataloader and more cleanups

* fix validation test case

* add multi device support to retinanet eval

* fix issue with realized on dataloader

* remove optional disk tensors in dataloader

* remove verbose debugging on datasets test

* put back parallel testing and remove img_ids Tensor from dataloader

* cleanup train and validation dataloader

* return validation targets in dataloader

* cleanup boxes and labels in dataloader

* fix img_ids repeating its values

* remove unnecessary targets from validation dataloader

* add validation loop to training script

* adjust LR to be the ratio of the batch size

* minor cleanups

* remove frozen layers from optimizer's params

* hyperparameter adjustments and cleanups

* model init, hyperparam, and data preprocessing updates

* no need to return loaded keys for resnet

* fix train script

* update loss calculation for regresionhead and some cleanups

* add JIT reset support

* add nan check during training

* Revert "add nan check during training"

This reverts commit ddf1f0d5dd.

* Revert "Revert "add nan check during training""

This reverts commit b7b2943197.

* some typing cleanups

* update seeding on dataloader and the start of training script

* undo changse

* undo more changes

* more typing fixes

* minor cleanups

* update dataloader seed

* hotfix: log metric and move target metric check outside of CKPT

* check for CKPT when target metric is reached before saving

* add TRAIN_BEAM and EVAL_BEAM

* minor cleanup

* update hyperparams and add support for EVAL_BS

* add green coloring to metric reached statement

* initial work to support f16

* update model initializers to be monkeypatched

* update layers to support float32 weight loading + float16 training

* don't return loss that's scaled

* run eval on benchmark beam

* move BEAM to their respective steps

* update layers to be compatible with fp16

* end BENCHMARK after first eval

* cleanups and adjust learning rate for fp16

* remove duplicated files from test

* revert losses changes

* Revert "revert losses changes"

This reverts commit aebccf93ac.

* go back to old LR

* cast batchnorm to float32

* set new loss scaler default value for float16

* remove LambdaLRScheduler

* remove runner and use dataloader on eval

* fix retinanet eval with new dataloader

* remove unused import

* revert lr_scheduler updates

* use BS=96 with new learning rate

* rename module initializers

* more cleanups on training loop

* remove contig from optim.step

* simplify sum when computing loss
2025-04-12 22:11:51 -04:00
chenyu
4aab16ca6a bert script cleanup and assert nan loss (#9851) 2025-04-11 05:41:49 -04:00
chenyu
a0b72f066a don't free intermediate for bert mi300x (#9824) 2025-04-10 01:48:34 -04:00
chenyu
6b3480ec70 update mi300x bert haparams (#9716)
* update mi300x bert haparams

borrowed from previous submission that also did BS=1024

* update
2025-04-03 22:30:00 -04:00
chenyu
a6fec2f5ae dev_run for bert on mi300x (#9706) 2025-04-02 21:12:55 -04:00
chenyu
f7cb2e8da3 bert dev_beam for mi300x box (#9648)
* bert dev_beam for mi300x box

* terminate BENCHMARK properly
2025-03-31 08:35:51 -04:00
chenyu
d8d7ac1bb1 fix bert free_intermediates (#9633)
fix when only run eval `TRAIN=0 BERT_SIZE=tiny examples/mlperf/training_submission_v5.0/tinycorp/benchmarks/bert/implementations/tinybox_green/dev_beam.sh`
2025-03-30 22:42:52 -04:00
chenyu
f53be010d7 lower bert learning rate (#9481)
slightly better. first sub 3hr run https://wandb.ai/chenyuxyz/MLPerf-BERT/runs/0or96ink/overview
2025-03-17 10:49:56 -04:00
chenyu
d2cfbd8a4d bert lower learning rate and total steps (#9466)
closer to the other submission with BS=240. converged with 10% less epochs
2025-03-16 17:21:20 -04:00
chenyu
22fc0a2e36 bert sum acc in half (#9412)
also BS=96
2025-03-11 23:03:15 -04:00
chenyu
2af129c078 bert corealize multiple outputs (#9359)
1% faster step
2025-03-05 10:58:37 -05:00
chenyu
ad72269f08 bert put eval copy and getting lr in jit (#9350) 2025-03-04 20:57:03 -05:00
chenyu
9eb45eb629 add a flag to skip bert train (#9349) 2025-03-04 17:13:00 -05:00
George Hotz
3f4eb9006a test for device mismatch [pr] (#9250)
* test for device mismatch [pr]

* fix bert
2025-02-26 13:06:33 +08:00
chenyu
979e84f30e RESET_STEP in bert setup and beam (#9248)
running dev_beam migh OOM without it but runs fine in real run.
2025-02-25 19:15:10 -05:00
chenyu
6610ad58ab hotfix bert no shard with only one device (#9243)
`LLVM=1 BERT_SIZE="tiny" DEFAULT_FLOAT=HALF BENCHMARK=5 MODEL="bert" python3 examples/mlperf/model_train.py` runs for me with this. it should not failed with single device shard though
2025-02-25 09:05:11 -05:00
chenyu
ff05bff221 put bert data shard inside jit (#9160)
python time 45ms -> 9ms, it was spending time to schedule the shard

also init bert data on CLANG since it's from numpy, so we don't create the tensor on default device then shard into GPUS
2025-02-18 10:36:54 -05:00
chenyu
5dc1257ce0 clean up bert fake data iterator [pr] (#9145)
reuse the same get_data_bert path in setup and real run
2025-02-17 20:03:38 -05:00
chenyu
81597ddd96 increase lr for bert (#9098)
had one run that converged better https://wandb.ai/chenyuxyz/MLPerf-BERT/runs/u66tv2hh/overview
2025-02-14 19:10:35 -05:00
chenyu
b58e7b1898 zero out the weight in bert init run (#9076)
`DEFAULT_FLOAT=HALF BENCHMARK=10 BS=66 EVAL_BS=6 GPUS=6 MODEL=bert python3 examples/mlperf/model_train.py` no longer oom. I think the buffer of random init weights caused the oom.
2025-02-14 08:40:41 -05:00
chenyu
9e91898941 bert eval at the end of training (#9070)
always eval at the last epoch
2025-02-13 16:29:44 -05:00
chenyu
7b5ac2c15e free_intermediates in bert (#9040)
also re-enable dropout and update EVAL_BS
2025-02-12 10:00:39 -05:00
chenyu
c99ae81f63 update default resnet LOSS_SCALER to 256 [pr] (#8774) 2025-01-27 16:59:05 -05:00
chenyu
9f6d545a16 bert log global_norm in training step [pr] (#8708)
* bert log global_norm in training step [pr]

and minor cleanups

* .item()
2025-01-21 20:36:27 -05:00
chenyu
1e283c33d3 remove realize in bert model init [pr] (#8707) 2025-01-21 14:11:03 -05:00
chenyu
3e2430f822 use tqdm tqdm in mlperf training (#7929)
issue in benchmark dashboard logging, revert back to tqdm tqdm for now
2024-11-27 21:57:05 -05:00
Francis Lata
90eff347e2 tinytqdm write support (#6359)
* add write support

* add test

* update test case to compare write outputs

* assert final write output

* flush when using write

* update write logic

* Revert "update write logic"

This reverts commit 5e0e611b46.

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-10-16 14:51:41 -04:00
chenyu
0e42662f2a log seed at the right place for bert (#7000) 2024-10-11 10:39:40 -04:00
chenyu
b5546912e2 10% more TRAIN_STEPS for bert (#6971)
got two very close run, adding more steps for buffer
2024-10-09 19:21:43 -04:00
chenyu
a78c96273a update bert epoch logging (#6940)
* update bert epoch logging

epoch for bert is simply number of examples seen (which is used for RCP check)

* update total steps too

* more changes
2024-10-08 00:34:06 -04:00
chenyu
102dfe5510 back to 2**10 for bert loss scaler (#6934)
getting 2 NaN for this, revert back to 2**10
2024-10-07 10:17:21 -04:00
chenyu
0cf815a93a bert use BS=66 and update hparams (#6932)
with dropout memory improvement, we can fit BS=66 now. revert back to the hparams in #5891 too
2024-10-07 05:08:27 -04:00
chenyu
718b959349 log epoch start and stop for bert (#6912) 2024-10-06 06:39:46 -04:00
chenyu
0e706227a2 add seed to bert result log filename (#6903)
* add seed to bert result log filename

* different name for different benchmark
2024-10-05 09:15:24 -04:00
chenyu
7391376528 update bert hparams (#6876)
4h32m with this https://wandb.ai/chenyuxyz/MLPerf-BERT/runs/q99frv1l/overview.

loss scaler 2**13->2**10. matched the closest submission, no nan for ~10 runs.

increased lr and total step a bit.

`PARALLEL=0` after setup, same as resnet.
2024-10-04 00:39:06 -04:00
chenyu
5f77217772 bert default CKPT to 0 (#6840)
not required
2024-10-01 21:55:56 -04:00
chenyu
f59517754e add RESET_STEP in bert to control reset (#6818)
same as resnet
2024-09-30 09:39:04 -04:00
chenyu
572d77d1d9 bert script delete eval data after eval (#6790)
fits BS=60 which is 2% faster than 54. also fixed wandb logging params
2024-09-27 20:54:00 -04:00
chenyu
5a5fbfa1eb smaller bert script change (#6768)
only WANDB and RUNMLPERF order. BENCHMARK and BEAM will be done differently
2024-09-26 04:54:28 -04:00
Francis Lata
b7ce9a1530 UNet3D MLPerf (#3470)
* add training set transforms

* add DICE cross entropy loss

* convert pred and label to Tensor when calculating DICE score

* cleanups and allow train dataset batching

* fix DICE CE loss calculation

* jitted training step

* clean up DICE CE loss calculation

* initial support for sharding

* Revert "initial support for sharding"

This reverts commit e3670813b8.

* minor updates

* cleanup imports

* add support for sharding

* apply temp patch to try to avoid OOM

* revert cstyle changes

* add gradient acc

* hotfix

* add FP16 support

* add ability to train on smaller image sizes

* add support for saving and loading checkpoints + cleanup some various modes

* fix issue with using smaller patch size + update W&B logging

* disable LR_WARMUP_EPOCHS

* updates

* minor cleanups

* cleanup

* update order of transformations

* more cleanups

* realize loss

* cleanup

* more cleanup

* some cleanups

* add RAM usage

* minor cleanups

* add support for gradient accumulation

* cleanup imports

* minor updates to not use GA_STEPS

* remove FP16 option since it's available now globally

* update multi-GPU setup

* add timing logs for training loop

* go back to using existing dataloader and add ability to preprocess data to save time

* clean up optimization and re-enable JIT and multi-GPU support for training and evaluation

* free train and eval steps memory

* cleanups and scale batch size based on the number of GPUs

* fix GlobalCounters import

* fix seed

* fix W&B setup

* update batch size default size

* add back metric divergence check

* put back JIT on UNet3d eval

* move dataset preprocessing inside training code

* add test for dice_loss

* add config logging support to W&B and other cleanups

* change how default float is getting retrieved

* remove TinyJit import duplicate

* update config logging to W&B and remove JIT on eval_step

* no need for caching preprocessed data anymore

* fix how evaluation is ran and how often

* add support for LR scaling

* fix issue with gaussian being moved to scipy.signal.windows

* remove DICE loss unit test

* fix issue where loss isn't compatible with multiGPU

* add individual BEAM control for train and eval steps

* fix ndimage scipy import

* add BENCHMARK

* cleanups on BENCHMARK + fix on rand_flip augmentation during training

* cleanup train and eval BEAM envs

* add checkpointing support after every eval

* cleanup model_eval

* disable grad during eval

* use new preprocessing dataset mechanism

* remove unused import

* use training and inference_mode contexts

* start eval after benchmarking

* add data fetching time

* cleanup decorators

* more cleanups on training script

* add message during benchmarking mode

* realize when reassigning LR on scheduler and update default number of epochs

* add JIT on eval step

* remove JIT on eval_step

* add train dataloader for unet3d

* move checkpointing to be done after every epoch

* revert removal of JIT on unet3d inference

* save checkpoint if metric is not successful

* Revert "add train dataloader for unet3d"

This reverts commit c166d129df.

* Revert "Revert "add train dataloader for unet3d""

This reverts commit 36366c65d2.

* hotfix: seed was defaulting to a value of 0

* fix SEED value

* remove the usage of context managers for setting BEAM and going from training to inference

* support new stack API for calculating eval loss and metric

* Revert "remove the usage of context managers for setting BEAM and going from training to inference"

This reverts commit 2c0ba8d322.

* check training and test preprocessed folders separately

* clean up imports and log FUSE_CONV_BW

* use train and val preprocessing constants

* add kits19 dataset setup script

* update to use the new test decorator for disabling grad

* update kits19 dataset setup script

* add docs on how to train the model

* set default value for BASEDIR

* add detailed instruction about BASEDIR usage

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-09-10 04:37:28 -04:00
Elias Wahl
c9b4602854 no load in INITMLPERF (#5957) 2024-08-08 11:28:24 -04:00
Elias Wahl
c9862e17d4 MLPERF BERT submission scripts (#5931)
* green

* red

* fix benchmark

* log

* count train samples

* oops. 4.0 -> 4.1

* note to todo

* no pillow
2024-08-06 18:09:18 -04:00
Elias Wahl
937bf5fe12 better hparam (#5891) 2024-08-03 12:38:53 -04:00
Elias Wahl
4a114756f6 New BERT dataloader (#5881)
* One file == One topic

* update test

* new dataloader

* update train script

* get index is faster
2024-08-02 15:12:23 -04:00
Elias Wahl
73bddc44f6 Fix fake dataloader (#5326) 2024-07-08 09:07:44 -04:00
Elias Wahl
e267f3161d Add MLLogger (#5125)
* add MLPerf logger

* eval steps

* start with step 1

* compliance for 3.1.0 and 4.0.0

* more compliance

* assert, comment and contiguous
2024-06-26 12:23:56 -04:00
Elias Wahl
f31ef11537 Better default hparams for large BS (#5030)
* better default hparams for large BS

* bf16 too

* use tuple
2024-06-18 11:13:06 -04:00