Commit Graph

216 Commits

Author SHA1 Message Date
chenyu
e2ed673c94 FUSE_ARANGE_UINT to not fuse uint (#9915)
hack to bypass rand, can FUSE_ARANGE on green for 6ms per step
2025-04-16 18:49:38 -04:00
chenyu
e8024c8281 faster bert global_norm (#9901)
tinyamd 2% faster.  also updated beam params that's 2-3% faster.

update mlperf doc and steps too
2025-04-15 18:24:44 -04:00
Francis Lata
31483050c0 add eval_freq flag (#9894) 2025-04-15 06:42:40 -04:00
chenyu
43d3a75d6c increase bert max train_steps (#9883) 2025-04-14 08:53:44 -04:00
chenyu
e2a40fb523 update bert mi300x script (#9872)
2 runs failed to converge in 10 back to back runs, increase total train steps and some beam params (2% faster step)
2025-04-13 10:07:36 -04:00
Francis Lata
2793cca9a6 RetinaNet MLPerf (#8385)
* add support for a custom BASEDIR for openimages download

* make export step faster

* add focal loss

* update model_eval with new dataloader

* generate_anchors in tinygrad

* update initializers for model

* small cleanup

* revert isin enhancements

* recursively go through backbone layers to freeze them

* add optimizer

* minor cleanup

* start dataloader work with input images

* add first transform for train set

* reuse existing prepare_target

* continue with dataloader implementation

* add dataloader

* separate out KiTS19 dataset test cases

* create mock data samples for test

* add dataloader + test

* cleanup dataloader test and revert shm path

* trim dataloader related code needed from ref

* got dataloader with normalize working

* update image to be float32

* add back normalization and negate it in test

* clean up reference dataset implementation + ruff changes

* add validation set test

* add proper training loop over the training dataset

* add LambdaLR support

* add LR scheduler and the start of training step

* get forward call to model work and setup multi-GPU

* already passed device

* return matches from dataloader

* hotfix for dataloader typo causing some hang

* start some work on classification loss

* update focal loss to support masking

* add missing test and cleanup focal loss

* cleanup unit tests

* remove masking support for sigmoid_focal_loss

* make ClassificationHead loss work

* cleanups + fix dataloader tests

* remove sigmoid when computing loss

* make anchors use Tensors

* simplify anchors batching

* revert anchors to use np

* implement regression loss

* fix regression loss

* cleanup losses

* move BoxCoder to MLPerf helpers

* revert helper changes

* fixes after helper refactor cleanup

* add tests for l1_loss

* start re-enabling training step

* minor cleanup

* add pycocotools to testing dependencies

* make training work

* adjust regression loss to mask after L1 loss is calculated

* reduce img and lbl sizes by half for KiTS19 dataset tests

* Revert "reduce img and lbl sizes by half for KiTS19 dataset tests"

This reverts commit d115b0c664.

* temporarily disable openimages dataset tests to debug CI

* enable openimages dataset test and create samples once

* temporarily disable openimages validation set test

* reenable test and add some debugging to the test

* add boto3 testing dependencies

* add pandas to testing dependencies

* This reverts commit 467704fec6.

* reenable test

* move sample creation to setup

* realize boxcoder's encoding

* add wandb

* fix wandb resuming feature

* move anchors as part of dataloader

* fix dtype for anchor inside dataloader and fix horizontal flip transformation

* add support for BENCHMARK

* set seed

* debug dataset test failuire

* Revert "debug dataset test failuire"

This reverts commit 1b2f9d7f50.

* fix dataloader script

* do not realize when sharding model weights

* setup openimages samples differently

* create the necessary samples per test case

* enable lr scheduler and fix benchmark timing

* add jit to the training loop

* add checkpointing and training resume capabilities

* refactor on training loop and start the work on val looop

* add debug logging for dataloader test

* debug test

* assert boxes again

* update validation dataloader and more cleanups

* fix validation test case

* add multi device support to retinanet eval

* fix issue with realized on dataloader

* remove optional disk tensors in dataloader

* remove verbose debugging on datasets test

* put back parallel testing and remove img_ids Tensor from dataloader

* cleanup train and validation dataloader

* return validation targets in dataloader

* cleanup boxes and labels in dataloader

* fix img_ids repeating its values

* remove unnecessary targets from validation dataloader

* add validation loop to training script

* adjust LR to be the ratio of the batch size

* minor cleanups

* remove frozen layers from optimizer's params

* hyperparameter adjustments and cleanups

* model init, hyperparam, and data preprocessing updates

* no need to return loaded keys for resnet

* fix train script

* update loss calculation for regresionhead and some cleanups

* add JIT reset support

* add nan check during training

* Revert "add nan check during training"

This reverts commit ddf1f0d5dd.

* Revert "Revert "add nan check during training""

This reverts commit b7b2943197.

* some typing cleanups

* update seeding on dataloader and the start of training script

* undo changse

* undo more changes

* more typing fixes

* minor cleanups

* update dataloader seed

* hotfix: log metric and move target metric check outside of CKPT

* check for CKPT when target metric is reached before saving

* add TRAIN_BEAM and EVAL_BEAM

* minor cleanup

* update hyperparams and add support for EVAL_BS

* add green coloring to metric reached statement

* initial work to support f16

* update model initializers to be monkeypatched

* update layers to support float32 weight loading + float16 training

* don't return loss that's scaled

* run eval on benchmark beam

* move BEAM to their respective steps

* update layers to be compatible with fp16

* end BENCHMARK after first eval

* cleanups and adjust learning rate for fp16

* remove duplicated files from test

* revert losses changes

* Revert "revert losses changes"

This reverts commit aebccf93ac.

* go back to old LR

* cast batchnorm to float32

* set new loss scaler default value for float16

* remove LambdaLRScheduler

* remove runner and use dataloader on eval

* fix retinanet eval with new dataloader

* remove unused import

* revert lr_scheduler updates

* use BS=96 with new learning rate

* rename module initializers

* more cleanups on training loop

* remove contig from optim.step

* simplify sum when computing loss
2025-04-12 22:11:51 -04:00
chenyu
4aab16ca6a bert script cleanup and assert nan loss (#9851) 2025-04-11 05:41:49 -04:00
chenyu
995d20673a increase bert TRAIN_STEPS for mi300x (#9833)
got a few non converged ones so try to increase steps. we need >= 90% runs to converge
2025-04-10 08:25:09 -04:00
chenyu
817746b30e add contiguous to EmbeddingBert output (#9829)
for some reason with random dropout it creates different ast on each device. And search embedding is slow. This workaround saved 6 minutes setup time on mi300x (25->19) and resulted in similar speed
2025-04-10 04:31:19 -04:00
chenyu
a0b72f066a don't free intermediate for bert mi300x (#9824) 2025-04-10 01:48:34 -04:00
chenyu
2e1002e179 EVAL_BS=96 and BEAM=3 for bert green (#9819)
19m -> 13m setup and same end to end time
2025-04-09 22:37:27 -04:00
chenyu
8fe83385ec add system json for mi300x mlperf (#9786)
* add system json for mi300x mlperf

```
python3 -m mlperf_logging.system_desc_checker examples/mlperf/training_submission_v5.0/tinycorp/systems/tinybox_8xMI300X.json training 4.1.0
INFO -   System description checker passed for tinybox 8xMI300X
```

also removed the rocm from tinybox_red since we are not using it

* update mlperf-logging version
2025-04-08 06:36:44 -04:00
chenyu
4cc7422769 use AM driver in bert mlperf (#9775)
we should commit to use AM. it's 7ms slower python time now
2025-04-07 23:40:27 -04:00
Francis Lata
f8fe15e64e move BoxCoder to mlperf helpers (#9773) 2025-04-07 20:27:06 -04:00
chenyu
7c4a739fe4 full script for bert mi300x (#9772) 2025-04-07 11:41:31 -04:00
chenyu
3069ebfad1 use BERT_LAYERS=2 in bert init (#9769)
save 5 minut scheduling in setup so we can fit more search
2025-04-07 07:46:37 -04:00
Francis Lata
71b8890dd6 use validation dataloader inside retinanet eval (#9747) 2025-04-05 16:46:55 -04:00
chenyu
5a04f4d4ba revert bert hparams for green and red (#9744)
did more runs and it's not really better and not worth the change. only useful for BS=1024
2025-04-05 07:38:01 -04:00
chenyu
640ff681c3 rename bert script to 8xMI300X (#9734)
and adds a script for single MI300X
2025-04-03 23:36:24 -04:00
chenyu
6b3480ec70 update mi300x bert haparams (#9716)
* update mi300x bert haparams

borrowed from previous submission that also did BS=1024

* update
2025-04-03 22:30:00 -04:00
chenyu
a6fec2f5ae dev_run for bert on mi300x (#9706) 2025-04-02 21:12:55 -04:00
chenyu
f7cb2e8da3 bert dev_beam for mi300x box (#9648)
* bert dev_beam for mi300x box

* terminate BENCHMARK properly
2025-03-31 08:35:51 -04:00
chenyu
d8d7ac1bb1 fix bert free_intermediates (#9633)
fix when only run eval `TRAIN=0 BERT_SIZE=tiny examples/mlperf/training_submission_v5.0/tinycorp/benchmarks/bert/implementations/tinybox_green/dev_beam.sh`
2025-03-30 22:42:52 -04:00
chenyu
a187dfd3df bert BEAM_UOPS_MAX 3000->4000 (#9603)
more stable for the final step time

green 410ms (master) -> 397ms (BEAM=4) -> 392ms (this)
red 561ms (master) -> 550ms (this)
2025-03-27 11:58:47 -04:00
chenyu
62888614f6 lower bert eval bs to 24 (#9590)
oom during eval
2025-03-26 21:25:23 -04:00
chenyu
c965f4c20b update bert config (#9555)
BEAM 4->5 for green, 2% faster
use AMD driver instead of AM for red, 5% faster
2025-03-23 16:14:41 -04:00
Francis Lata
1a1087e3a0 cleanups on losses and dataset tests (#9538) 2025-03-21 17:03:18 -04:00
Francis Lata
8cbe4009fc RetinaNet losses (#9536)
* add sigmoid_focal_loss and l1_loss

* update ref implementation comment
2025-03-21 15:52:54 -04:00
chenyu
b46b8ee15e add a flag to log when beam surpassed max limit [pr] (#9533) 2025-03-21 13:37:02 -04:00
Francis Lata
eb95825eea RetinaNet dataloader (#9442)
* retinanet dataloader

* remove batch_size from generate_anchors

* refactor kits19 dataset tests

* add tests for dataloader

* fix testing setup and cleanups

* remove unused import
2025-03-21 13:36:41 -04:00
chenyu
f53be010d7 lower bert learning rate (#9481)
slightly better. first sub 3hr run https://wandb.ai/chenyuxyz/MLPerf-BERT/runs/0or96ink/overview
2025-03-17 10:49:56 -04:00
chenyu
d2cfbd8a4d bert lower learning rate and total steps (#9466)
closer to the other submission with BS=240. converged with 10% less epochs
2025-03-16 17:21:20 -04:00
chenyu
4992958dae update bert beam params (#9423)
BEAM_MIN_PROGRESS=5 for setup speed
2025-03-12 13:00:41 -04:00
chenyu
22fc0a2e36 bert sum acc in half (#9412)
also BS=96
2025-03-11 23:03:15 -04:00
chenyu
01e8b60911 acc_dtype -> dtype (#9402)
matched numpy and torch
2025-03-10 16:05:30 -04:00
chenyu
2af129c078 bert corealize multiple outputs (#9359)
1% faster step
2025-03-05 10:58:37 -05:00
chenyu
ad72269f08 bert put eval copy and getting lr in jit (#9350) 2025-03-04 20:57:03 -05:00
chenyu
9eb45eb629 add a flag to skip bert train (#9349) 2025-03-04 17:13:00 -05:00
qazal
845814f396 revert buffer_view change (#9311)
* Revert "BUFFER_VIEW is a node in the kernel graph + delete ViewOp (#9298)"

This reverts commit 3210b656b6.

* Revert "substitute ast from kernel op [pr] (#9293)"

This reverts commit 5a9c788ae6.
2025-03-01 11:00:12 +01:00
qazal
3210b656b6 BUFFER_VIEW is a node in the kernel graph + delete ViewOp (#9298) 2025-02-28 12:15:04 +02:00
George Hotz
3f4eb9006a test for device mismatch [pr] (#9250)
* test for device mismatch [pr]

* fix bert
2025-02-26 13:06:33 +08:00
chenyu
979e84f30e RESET_STEP in bert setup and beam (#9248)
running dev_beam migh OOM without it but runs fine in real run.
2025-02-25 19:15:10 -05:00
chenyu
6610ad58ab hotfix bert no shard with only one device (#9243)
`LLVM=1 BERT_SIZE="tiny" DEFAULT_FLOAT=HALF BENCHMARK=5 MODEL="bert" python3 examples/mlperf/model_train.py` runs for me with this. it should not failed with single device shard though
2025-02-25 09:05:11 -05:00
chenyu
8c7be428e5 update bert BS to 78 (#9236)
fits 78 now. about 215 tflops on green
2025-02-24 22:47:35 -05:00
chenyu
2e7c2780a9 CLANG -> CPU (#9189) 2025-02-20 18:03:09 -05:00
chenyu
3b37cc898b add bert tiny config (#9177)
set with BERT_SIZE=tiny. easier to study embedding and fusion
2025-02-19 14:57:03 -05:00
chenyu
975c318dbc bert use int32 for input ids (#9173)
original data was int32 for these. float might have caused precision issues
2025-02-19 08:17:27 -05:00
chenyu
ff05bff221 put bert data shard inside jit (#9160)
python time 45ms -> 9ms, it was spending time to schedule the shard

also init bert data on CLANG since it's from numpy, so we don't create the tensor on default device then shard into GPUS
2025-02-18 10:36:54 -05:00
chenyu
5dc1257ce0 clean up bert fake data iterator [pr] (#9145)
reuse the same get_data_bert path in setup and real run
2025-02-17 20:03:38 -05:00
chenyu
81597ddd96 increase lr for bert (#9098)
had one run that converged better https://wandb.ai/chenyuxyz/MLPerf-BERT/runs/u66tv2hh/overview
2025-02-14 19:10:35 -05:00