* add support for a custom BASEDIR for openimages download
* make export step faster
* add focal loss
* update model_eval with new dataloader
* generate_anchors in tinygrad
* update initializers for model
* small cleanup
* revert isin enhancements
* recursively go through backbone layers to freeze them
* add optimizer
* minor cleanup
* start dataloader work with input images
* add first transform for train set
* reuse existing prepare_target
* continue with dataloader implementation
* add dataloader
* separate out KiTS19 dataset test cases
* create mock data samples for test
* add dataloader + test
* cleanup dataloader test and revert shm path
* trim dataloader related code needed from ref
* got dataloader with normalize working
* update image to be float32
* add back normalization and negate it in test
* clean up reference dataset implementation + ruff changes
* add validation set test
* add proper training loop over the training dataset
* add LambdaLR support
* add LR scheduler and the start of training step
* get forward call to model work and setup multi-GPU
* already passed device
* return matches from dataloader
* hotfix for dataloader typo causing some hang
* start some work on classification loss
* update focal loss to support masking
* add missing test and cleanup focal loss
* cleanup unit tests
* remove masking support for sigmoid_focal_loss
* make ClassificationHead loss work
* cleanups + fix dataloader tests
* remove sigmoid when computing loss
* make anchors use Tensors
* simplify anchors batching
* revert anchors to use np
* implement regression loss
* fix regression loss
* cleanup losses
* move BoxCoder to MLPerf helpers
* revert helper changes
* fixes after helper refactor cleanup
* add tests for l1_loss
* start re-enabling training step
* minor cleanup
* add pycocotools to testing dependencies
* make training work
* adjust regression loss to mask after L1 loss is calculated
* reduce img and lbl sizes by half for KiTS19 dataset tests
* Revert "reduce img and lbl sizes by half for KiTS19 dataset tests"
This reverts commit d115b0c664.
* temporarily disable openimages dataset tests to debug CI
* enable openimages dataset test and create samples once
* temporarily disable openimages validation set test
* reenable test and add some debugging to the test
* add boto3 testing dependencies
* add pandas to testing dependencies
* This reverts commit 467704fec6.
* reenable test
* move sample creation to setup
* realize boxcoder's encoding
* add wandb
* fix wandb resuming feature
* move anchors as part of dataloader
* fix dtype for anchor inside dataloader and fix horizontal flip transformation
* add support for BENCHMARK
* set seed
* debug dataset test failuire
* Revert "debug dataset test failuire"
This reverts commit 1b2f9d7f50.
* fix dataloader script
* do not realize when sharding model weights
* setup openimages samples differently
* create the necessary samples per test case
* enable lr scheduler and fix benchmark timing
* add jit to the training loop
* add checkpointing and training resume capabilities
* refactor on training loop and start the work on val looop
* add debug logging for dataloader test
* debug test
* assert boxes again
* update validation dataloader and more cleanups
* fix validation test case
* add multi device support to retinanet eval
* fix issue with realized on dataloader
* remove optional disk tensors in dataloader
* remove verbose debugging on datasets test
* put back parallel testing and remove img_ids Tensor from dataloader
* cleanup train and validation dataloader
* return validation targets in dataloader
* cleanup boxes and labels in dataloader
* fix img_ids repeating its values
* remove unnecessary targets from validation dataloader
* add validation loop to training script
* adjust LR to be the ratio of the batch size
* minor cleanups
* remove frozen layers from optimizer's params
* hyperparameter adjustments and cleanups
* model init, hyperparam, and data preprocessing updates
* no need to return loaded keys for resnet
* fix train script
* update loss calculation for regresionhead and some cleanups
* add JIT reset support
* add nan check during training
* Revert "add nan check during training"
This reverts commit ddf1f0d5dd.
* Revert "Revert "add nan check during training""
This reverts commit b7b2943197.
* some typing cleanups
* update seeding on dataloader and the start of training script
* undo changse
* undo more changes
* more typing fixes
* minor cleanups
* update dataloader seed
* hotfix: log metric and move target metric check outside of CKPT
* check for CKPT when target metric is reached before saving
* add TRAIN_BEAM and EVAL_BEAM
* minor cleanup
* update hyperparams and add support for EVAL_BS
* add green coloring to metric reached statement
* initial work to support f16
* update model initializers to be monkeypatched
* update layers to support float32 weight loading + float16 training
* don't return loss that's scaled
* run eval on benchmark beam
* move BEAM to their respective steps
* update layers to be compatible with fp16
* end BENCHMARK after first eval
* cleanups and adjust learning rate for fp16
* remove duplicated files from test
* revert losses changes
* Revert "revert losses changes"
This reverts commit aebccf93ac.
* go back to old LR
* cast batchnorm to float32
* set new loss scaler default value for float16
* remove LambdaLRScheduler
* remove runner and use dataloader on eval
* fix retinanet eval with new dataloader
* remove unused import
* revert lr_scheduler updates
* use BS=96 with new learning rate
* rename module initializers
* more cleanups on training loop
* remove contig from optim.step
* simplify sum when computing loss
fix when only run eval `TRAIN=0 BERT_SIZE=tiny examples/mlperf/training_submission_v5.0/tinycorp/benchmarks/bert/implementations/tinybox_green/dev_beam.sh`
`LLVM=1 BERT_SIZE="tiny" DEFAULT_FLOAT=HALF BENCHMARK=5 MODEL="bert" python3 examples/mlperf/model_train.py` runs for me with this. it should not failed with single device shard though
python time 45ms -> 9ms, it was spending time to schedule the shard
also init bert data on CLANG since it's from numpy, so we don't create the tensor on default device then shard into GPUS
`DEFAULT_FLOAT=HALF BENCHMARK=10 BS=66 EVAL_BS=6 GPUS=6 MODEL=bert python3 examples/mlperf/model_train.py` no longer oom. I think the buffer of random init weights caused the oom.
* add write support
* add test
* update test case to compare write outputs
* assert final write output
* flush when using write
* update write logic
* Revert "update write logic"
This reverts commit 5e0e611b46.
---------
Co-authored-by: chenyu <chenyu@fastmail.com>
* add training set transforms
* add DICE cross entropy loss
* convert pred and label to Tensor when calculating DICE score
* cleanups and allow train dataset batching
* fix DICE CE loss calculation
* jitted training step
* clean up DICE CE loss calculation
* initial support for sharding
* Revert "initial support for sharding"
This reverts commit e3670813b8.
* minor updates
* cleanup imports
* add support for sharding
* apply temp patch to try to avoid OOM
* revert cstyle changes
* add gradient acc
* hotfix
* add FP16 support
* add ability to train on smaller image sizes
* add support for saving and loading checkpoints + cleanup some various modes
* fix issue with using smaller patch size + update W&B logging
* disable LR_WARMUP_EPOCHS
* updates
* minor cleanups
* cleanup
* update order of transformations
* more cleanups
* realize loss
* cleanup
* more cleanup
* some cleanups
* add RAM usage
* minor cleanups
* add support for gradient accumulation
* cleanup imports
* minor updates to not use GA_STEPS
* remove FP16 option since it's available now globally
* update multi-GPU setup
* add timing logs for training loop
* go back to using existing dataloader and add ability to preprocess data to save time
* clean up optimization and re-enable JIT and multi-GPU support for training and evaluation
* free train and eval steps memory
* cleanups and scale batch size based on the number of GPUs
* fix GlobalCounters import
* fix seed
* fix W&B setup
* update batch size default size
* add back metric divergence check
* put back JIT on UNet3d eval
* move dataset preprocessing inside training code
* add test for dice_loss
* add config logging support to W&B and other cleanups
* change how default float is getting retrieved
* remove TinyJit import duplicate
* update config logging to W&B and remove JIT on eval_step
* no need for caching preprocessed data anymore
* fix how evaluation is ran and how often
* add support for LR scaling
* fix issue with gaussian being moved to scipy.signal.windows
* remove DICE loss unit test
* fix issue where loss isn't compatible with multiGPU
* add individual BEAM control for train and eval steps
* fix ndimage scipy import
* add BENCHMARK
* cleanups on BENCHMARK + fix on rand_flip augmentation during training
* cleanup train and eval BEAM envs
* add checkpointing support after every eval
* cleanup model_eval
* disable grad during eval
* use new preprocessing dataset mechanism
* remove unused import
* use training and inference_mode contexts
* start eval after benchmarking
* add data fetching time
* cleanup decorators
* more cleanups on training script
* add message during benchmarking mode
* realize when reassigning LR on scheduler and update default number of epochs
* add JIT on eval step
* remove JIT on eval_step
* add train dataloader for unet3d
* move checkpointing to be done after every epoch
* revert removal of JIT on unet3d inference
* save checkpoint if metric is not successful
* Revert "add train dataloader for unet3d"
This reverts commit c166d129df.
* Revert "Revert "add train dataloader for unet3d""
This reverts commit 36366c65d2.
* hotfix: seed was defaulting to a value of 0
* fix SEED value
* remove the usage of context managers for setting BEAM and going from training to inference
* support new stack API for calculating eval loss and metric
* Revert "remove the usage of context managers for setting BEAM and going from training to inference"
This reverts commit 2c0ba8d322.
* check training and test preprocessed folders separately
* clean up imports and log FUSE_CONV_BW
* use train and val preprocessing constants
* add kits19 dataset setup script
* update to use the new test decorator for disabling grad
* update kits19 dataset setup script
* add docs on how to train the model
* set default value for BASEDIR
* add detailed instruction about BASEDIR usage
---------
Co-authored-by: chenyu <chenyu@fastmail.com>