* add support for a custom BASEDIR for openimages download
* make export step faster
* add focal loss
* update model_eval with new dataloader
* generate_anchors in tinygrad
* update initializers for model
* small cleanup
* revert isin enhancements
* recursively go through backbone layers to freeze them
* add optimizer
* minor cleanup
* start dataloader work with input images
* add first transform for train set
* reuse existing prepare_target
* continue with dataloader implementation
* add dataloader
* separate out KiTS19 dataset test cases
* create mock data samples for test
* add dataloader + test
* cleanup dataloader test and revert shm path
* trim dataloader related code needed from ref
* got dataloader with normalize working
* update image to be float32
* add back normalization and negate it in test
* clean up reference dataset implementation + ruff changes
* add validation set test
* add proper training loop over the training dataset
* add LambdaLR support
* add LR scheduler and the start of training step
* get forward call to model work and setup multi-GPU
* already passed device
* return matches from dataloader
* hotfix for dataloader typo causing some hang
* start some work on classification loss
* update focal loss to support masking
* add missing test and cleanup focal loss
* cleanup unit tests
* remove masking support for sigmoid_focal_loss
* make ClassificationHead loss work
* cleanups + fix dataloader tests
* remove sigmoid when computing loss
* make anchors use Tensors
* simplify anchors batching
* revert anchors to use np
* implement regression loss
* fix regression loss
* cleanup losses
* move BoxCoder to MLPerf helpers
* revert helper changes
* fixes after helper refactor cleanup
* add tests for l1_loss
* start re-enabling training step
* minor cleanup
* add pycocotools to testing dependencies
* make training work
* adjust regression loss to mask after L1 loss is calculated
* reduce img and lbl sizes by half for KiTS19 dataset tests
* Revert "reduce img and lbl sizes by half for KiTS19 dataset tests"
This reverts commit d115b0c664.
* temporarily disable openimages dataset tests to debug CI
* enable openimages dataset test and create samples once
* temporarily disable openimages validation set test
* reenable test and add some debugging to the test
* add boto3 testing dependencies
* add pandas to testing dependencies
* This reverts commit 467704fec6.
* reenable test
* move sample creation to setup
* realize boxcoder's encoding
* add wandb
* fix wandb resuming feature
* move anchors as part of dataloader
* fix dtype for anchor inside dataloader and fix horizontal flip transformation
* add support for BENCHMARK
* set seed
* debug dataset test failuire
* Revert "debug dataset test failuire"
This reverts commit 1b2f9d7f50.
* fix dataloader script
* do not realize when sharding model weights
* setup openimages samples differently
* create the necessary samples per test case
* enable lr scheduler and fix benchmark timing
* add jit to the training loop
* add checkpointing and training resume capabilities
* refactor on training loop and start the work on val looop
* add debug logging for dataloader test
* debug test
* assert boxes again
* update validation dataloader and more cleanups
* fix validation test case
* add multi device support to retinanet eval
* fix issue with realized on dataloader
* remove optional disk tensors in dataloader
* remove verbose debugging on datasets test
* put back parallel testing and remove img_ids Tensor from dataloader
* cleanup train and validation dataloader
* return validation targets in dataloader
* cleanup boxes and labels in dataloader
* fix img_ids repeating its values
* remove unnecessary targets from validation dataloader
* add validation loop to training script
* adjust LR to be the ratio of the batch size
* minor cleanups
* remove frozen layers from optimizer's params
* hyperparameter adjustments and cleanups
* model init, hyperparam, and data preprocessing updates
* no need to return loaded keys for resnet
* fix train script
* update loss calculation for regresionhead and some cleanups
* add JIT reset support
* add nan check during training
* Revert "add nan check during training"
This reverts commit ddf1f0d5dd.
* Revert "Revert "add nan check during training""
This reverts commit b7b2943197.
* some typing cleanups
* update seeding on dataloader and the start of training script
* undo changse
* undo more changes
* more typing fixes
* minor cleanups
* update dataloader seed
* hotfix: log metric and move target metric check outside of CKPT
* check for CKPT when target metric is reached before saving
* add TRAIN_BEAM and EVAL_BEAM
* minor cleanup
* update hyperparams and add support for EVAL_BS
* add green coloring to metric reached statement
* initial work to support f16
* update model initializers to be monkeypatched
* update layers to support float32 weight loading + float16 training
* don't return loss that's scaled
* run eval on benchmark beam
* move BEAM to their respective steps
* update layers to be compatible with fp16
* end BENCHMARK after first eval
* cleanups and adjust learning rate for fp16
* remove duplicated files from test
* revert losses changes
* Revert "revert losses changes"
This reverts commit aebccf93ac.
* go back to old LR
* cast batchnorm to float32
* set new loss scaler default value for float16
* remove LambdaLRScheduler
* remove runner and use dataloader on eval
* fix retinanet eval with new dataloader
* remove unused import
* revert lr_scheduler updates
* use BS=96 with new learning rate
* rename module initializers
* more cleanups on training loop
* remove contig from optim.step
* simplify sum when computing loss
for some reason with random dropout it creates different ast on each device. And search embedding is slow. This workaround saved 6 minutes setup time on mi300x (25->19) and resulted in similar speed
* resnet individual layer benchmarks!
* small
* 1 and 2
* mem_used
* no ci
* better conv print
* defaults
* prints
* adjust
* adjust
* adjust
* benchmark only one layer example
* tensor.training, zero_grad, sum instead of mean, last mem, last kernel count
* default jitcnt=1
* scale flops/kernels with jitcnt
* add note about jitcnt memory
* touchup
* this is a lot of stuff
TEST_TRAIN env for less data
don't diskcache get_train_files
debug message
no lr_scaler for fp32
comment, typo
type stuff
don't destructure proc
make batchnorm parameters float
make batchnorm parameters float
resnet18, checkpointing
hack up checkpointing to keep the names in there
oops
wandb_resume
lower lr
eval/ckpt use e+1
lars
report top_1_acc
some wandb stuff
split fw and bw steps to save memory
oops
save model when reach target
formatting
make sgd hparams consistent
just always write the cats tag...
pass X and Y into backward_step to trigger input replace
shuffle eval set to fix batchnorm eval
dataset is sorted by class, so the means and variances are all wrong
small cleanup
hack restore only one copy of each tensor
do bufs from lin after cache check (lru should handle it fine)
record epoch in wandb
more digits for topk in eval
more env vars
small cleanup
cleanup hack tricks
cleanup hack tricks
don't save ckpt for testeval
cleanup
diskcache train file glob
clean up a little
device_str
SCE into tensor
small
small
log_softmax out of resnet.py
oops
hack :(
comments
HeNormal, track gradient norm
oops
log SYNCBN to wandb
real truncnorm
less samples for truncated normal
custom init for Linear
log layer stats
small
Revert "small"
This reverts commit 988f4c1cf3.
Revert "log layer stats"
This reverts commit 9d98224585.
rename BNSYNC to SYNCBN to be consistent with cifar
optional TRACK_NORMS
fix label smoothing :/
lars skip list
only weight decay if not in skip list
comment
default 0 TRACK_NORMS
don't allocate beam scratch buffers if in cache
clean up data pipeline, unsplit train/test, put back a hack
remove print
run test_indexing on remu (#3404)
* emulated ops_hip infra
* add int4
* include test_indexing in remu
* Revert "Merge branch 'remu-dev-mac'"
This reverts commit 6870457e57, reversing
changes made to 3c4c8c9e16.
fix bad seeding
UnsyncBatchNorm2d but with synced trainable weights
label downsample batchnorm in Bottleneck
:/
:/
i mean... it runs... its hits the acc... its fast...
new unsyncbatchnorm for resnet
small fix
don't do assign buffer reuse for axis change
* remove changes
* remove changes
* move LARS out of tinygrad/
* rand_truncn rename
* whitespace
* stray whitespace
* no more gnorms
* delete some dataloading stuff
* remove comment
* clean up train script
* small comments
* move checkpointing stuff to mlperf helpers
* if WANDB
* small comments
* remove whitespace change
* new unsynced bn
* clean up prints / loop vars
* whitespace
* undo nn changes
* clean up loops
* rearrange getenvs
* cpu_count()
* PolynomialLR whitespace
* move he_normal out
* cap warmup in polylr
* rearrange wandb log
* realize both x and y in data_get
* use double quotes
* combine prints in ckpts resume
* take UBN from cifar
* running_var
* whitespace
* whitespace
* typo
* if instead of ternary for resnet downsample
* clean up dataloader cleanup a little?
* separate rng for shuffle
* clean up imports in model_train
* clean up imports
* don't realize copyin in data_get
* remove TESTEVAL (train dataloader didn't get freed every loop)
* adjust wandb_config entries a little
* clean up wandb config dict
* reduce lines
* whitespace
* shorter lines
* put shm unlink back, but it doesn't seem to do anything
* don't pass seed per task
* monkeypatch batchnorm
* the reseed was wrong
* add epoch number to desc
* don't unsyncedbatchnorm is syncbn=1
* put back downsample name
* eval every epoch
* Revert "the reseed was wrong"
This reverts commit 3440a07dff3f40e8a8d156ca3f1938558a59249f.
* cast lr in onecycle
* support fp16
* cut off kernel if expand after reduce
* test polynomial lr
* move polynomiallr to examples/mlperf
* working PolynomialDecayWithWarmup + tests.......
add lars_util.py, oops
* keep lars_util.py as intact as possible, simplify our interface
* no more half
* polylr and lars were merged
* undo search change
* override Linear init
* remove half stuff from model_train
* update scheduler init with new args
* don't divide by input mean
* mistake in resnet.py
* restore whitespace in resnet.py
* add test_data_parallel_resnet_train_step
* move initializers out of resnet.py
* unused imports
* log_softmax to model output in test to fix precision flakiness
* log_softmax to model output in test to fix precision flakiness
* oops, don't realize here
* is None
* realize initializations in order for determinism
* BENCHMARK flag for number of steps
* add resnet to bechmark.yml
* return instead of break
* missing return
* cpu_count, rearrange benchmark.yml
* unused variable
* disable tqdm if BENCHMARK
* getenv WARMUP_EPOCHS
* unlink disktensor shm file if exists
* terminate instead of join
* properly shut down queues
* use hip in benchmark for now
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>