tinygrad

mirror of https://github.com/tinygrad/tinygrad.git synced 2026-02-03 03:05:03 -05:00

Author	SHA1	Message	Date
chenyu	5a5fbfa1eb	smaller bert script change (#6768 ) only WANDB and RUNMLPERF order. BENCHMARK and BEAM will be done differently	2024-09-26 04:54:28 -04:00
Francis Lata	b7ce9a1530	UNet3D MLPerf (#3470 ) * add training set transforms * add DICE cross entropy loss * convert pred and label to Tensor when calculating DICE score * cleanups and allow train dataset batching * fix DICE CE loss calculation * jitted training step * clean up DICE CE loss calculation * initial support for sharding * Revert "initial support for sharding" This reverts commit `e3670813b8`. * minor updates * cleanup imports * add support for sharding * apply temp patch to try to avoid OOM * revert cstyle changes * add gradient acc * hotfix * add FP16 support * add ability to train on smaller image sizes * add support for saving and loading checkpoints + cleanup some various modes * fix issue with using smaller patch size + update W&B logging * disable LR_WARMUP_EPOCHS * updates * minor cleanups * cleanup * update order of transformations * more cleanups * realize loss * cleanup * more cleanup * some cleanups * add RAM usage * minor cleanups * add support for gradient accumulation * cleanup imports * minor updates to not use GA_STEPS * remove FP16 option since it's available now globally * update multi-GPU setup * add timing logs for training loop * go back to using existing dataloader and add ability to preprocess data to save time * clean up optimization and re-enable JIT and multi-GPU support for training and evaluation * free train and eval steps memory * cleanups and scale batch size based on the number of GPUs * fix GlobalCounters import * fix seed * fix W&B setup * update batch size default size * add back metric divergence check * put back JIT on UNet3d eval * move dataset preprocessing inside training code * add test for dice_loss * add config logging support to W&B and other cleanups * change how default float is getting retrieved * remove TinyJit import duplicate * update config logging to W&B and remove JIT on eval_step * no need for caching preprocessed data anymore * fix how evaluation is ran and how often * add support for LR scaling * fix issue with gaussian being moved to scipy.signal.windows * remove DICE loss unit test * fix issue where loss isn't compatible with multiGPU * add individual BEAM control for train and eval steps * fix ndimage scipy import * add BENCHMARK * cleanups on BENCHMARK + fix on rand_flip augmentation during training * cleanup train and eval BEAM envs * add checkpointing support after every eval * cleanup model_eval * disable grad during eval * use new preprocessing dataset mechanism * remove unused import * use training and inference_mode contexts * start eval after benchmarking * add data fetching time * cleanup decorators * more cleanups on training script * add message during benchmarking mode * realize when reassigning LR on scheduler and update default number of epochs * add JIT on eval step * remove JIT on eval_step * add train dataloader for unet3d * move checkpointing to be done after every epoch * revert removal of JIT on unet3d inference * save checkpoint if metric is not successful * Revert "add train dataloader for unet3d" This reverts commit `c166d129df`. * Revert "Revert "add train dataloader for unet3d"" This reverts commit `36366c65d2`. * hotfix: seed was defaulting to a value of 0 * fix SEED value * remove the usage of context managers for setting BEAM and going from training to inference * support new stack API for calculating eval loss and metric * Revert "remove the usage of context managers for setting BEAM and going from training to inference" This reverts commit `2c0ba8d322`. * check training and test preprocessed folders separately * clean up imports and log FUSE_CONV_BW * use train and val preprocessing constants * add kits19 dataset setup script * update to use the new test decorator for disabling grad * update kits19 dataset setup script * add docs on how to train the model * set default value for BASEDIR * add detailed instruction about BASEDIR usage --------- Co-authored-by: chenyu <chenyu@fastmail.com>	2024-09-10 04:37:28 -04:00
Elias Wahl	c9b4602854	no load in INITMLPERF (#5957 )	2024-08-08 11:28:24 -04:00
Elias Wahl	c9862e17d4	MLPERF BERT submission scripts (#5931 ) * green * red * fix benchmark * log * count train samples * oops. 4.0 -> 4.1 * note to todo * no pillow	2024-08-06 18:09:18 -04:00
Elias Wahl	937bf5fe12	better hparam (#5891 )	2024-08-03 12:38:53 -04:00
Elias Wahl	4a114756f6	New BERT dataloader (#5881 ) * One file == One topic * update test * new dataloader * update train script * get index is faster	2024-08-02 15:12:23 -04:00
Elias Wahl	73bddc44f6	Fix fake dataloader (#5326 )	2024-07-08 09:07:44 -04:00
Elias Wahl	e267f3161d	Add MLLogger (#5125 ) * add MLPerf logger * eval steps * start with step 1 * compliance for 3.1.0 and 4.0.0 * more compliance * assert, comment and contiguous	2024-06-26 12:23:56 -04:00
Elias Wahl	f31ef11537	Better default hparams for large BS (#5030 ) * better default hparams for large BS * bf16 too * use tuple	2024-06-18 11:13:06 -04:00
Elias Wahl	7bfa9101c0	Float in scaled dot product attention (#4985 ) * Monkeypatch scaled-dot-product-attention * Use dot instead of matmul * new api * imports * least_upper_dtype	2024-06-18 08:16:41 -04:00
Elias Wahl	d2e3c391e8	Residual in MLM loss + Change default steps (#4935 ) * Residual in mlm loss * Reduce default steps to 160K * 24 * oops * comment	2024-06-12 16:09:18 -04:00
Elias Wahl	e576aca044	Disable dropout (#4837 )	2024-06-04 18:57:26 -04:00
Elias Wahl	bb248a0dd1	Optional half matmul (#4835 ) * half linear * move weight cast back * oops * matmul dtype var * todo comment	2024-06-04 17:53:41 -04:00
Elias Wahl	04e237328b	Refactor to class style (#4804 )	2024-06-04 14:08:31 -07:00
Elias Wahl	c4b0acf095	Global norm + small changes (#4749 ) * norm * no empty * default loss scaler in float	2024-05-27 18:35:27 -04:00
Elias Wahl	acc0039cfc	Resume fix + scheduler for non weight decay params (#4679 ) * move ckpt dir * fix resume. Add scheduler group	2024-05-21 19:38:13 -04:00
Elias Wahl	993091adfa	loss scaler + nan fixes (#4661 )	2024-05-20 17:08:35 -04:00
chenyu	bed70b130c	mlperf bert getenv-able EVAL_STEP_FREQ (#4534 )	2024-05-11 14:36:56 -04:00
chenyu	04a4980a51	touchup bert script (#4531 ) small adjustments, remove duplicated training setting and stop the script once target is hit	2024-05-11 13:02:02 -04:00
chenyu	b00b6b16f0	fix TRAIN_BEAM and Tensor.training for mlperf bert (#4525 ) also hard coded bert model config instead of looking up a file	2024-05-11 00:18:36 -04:00
chenyu	b399d98e41	fix resnet eval (#4507 )	2024-05-10 00:49:00 -04:00
chenyu	0e8aa0e288	use fake data in beam searching resnet (#4504 )	2024-05-09 23:43:50 -04:00
chenyu	047c7f3e5b	polish resnet mlperf logging (#4490 ) don't include save final check point time in run time, and some cosmetic order changes	2024-05-09 13:04:24 -04:00
chenyu	d78e159aa3	resnet logging move RUN_START to start of the script (#4488 )	2024-05-09 12:32:32 -04:00
chenyu	1f6bf9d2f7	real diskcache_clear in model_train resnet (#4445 ) clear cache if INITMLPERF is set, or running run_and_time. dev_beam and dev_run do not clear cache	2024-05-08 19:06:09 -04:00
chenyu	1b4645bea6	hotfix resnet move init_start to start of the script (#4481 )	2024-05-08 19:03:52 -04:00
chenyu	db7e15c46f	hotfix resnet only log epoch start with RUNMLPERF (#4477 )	2024-05-08 15:14:41 -04:00
chenyu	062c6dd65d	mlperf logging, truncate dir in logs and log seed (#4475 )	2024-05-08 12:54:02 -04:00
chenyu	b62a65b617	redo faster sparse_categorical_crossentropy (#4461 ) update LR and DECAY in resnet default that help convergence too	2024-05-08 11:21:43 -04:00
wozeparrot	603d3a351b	feat: allow keeping multiple cookies (#4440 )	2024-05-05 19:26:48 -07:00
David Hou	b767d59684	resnet trainer: keep old cookie around until next step has been queued (#4401 ) * keep old cookie around until next step has been queued (-10ms 6gpu) * also for eval * drop cookie before data_get? * Revert "drop cookie before data_get?" This reverts commit `b01e6aa2b2`. * Revert "Revert "drop cookie before data_get?"" This reverts commit `23464e73d4`.	2024-05-03 12:15:21 -04:00
chenyu	2c3b7f8e70	pad resnet training data with training data mean (#4369 ) update model_train resnet to pad training	2024-05-02 20:26:15 -04:00
chenyu	ab01a9433d	resnet eval 4n+3 if epoch < 33 (#4391 ) the rule is as thoroughly as 4n+k and we can stop the clock as soon as eval hits target. this can save 24 evals or 12 minutes	2024-05-02 16:52:07 -04:00
chenyu	bf31837e6d	resnet correct steps_in_val_epoch in logging (#4389 ) also added random seed from system in scripts	2024-05-02 10:51:36 -04:00
chenyu	22376e53b7	resnet mlperf logging (#4361 ) * resnet mlperf logging * cropping too much?	2024-05-02 00:00:04 -04:00
chenyu	6628e13a5f	pad resnet eval data in model_train (#4374 ) asserted if eval sample count is different from total eval file count.	2024-05-01 14:33:42 -04:00
chenyu	826cccd54d	fix mean underflow for half tensor (#4377 ) * fix mean underflow for half tensor divide only the reduce factor. added unit test and non-nan assertion in resnet training. also added a failed test cast for symbolic shape var * skip for python backend	2024-05-01 13:38:57 -04:00
Elias Wahl	babe87a8ae	BERT: Checkpoint loading tests (#4359 ) * Move checkpoint init to helpers. Add test * linters * Move the steps outside of the main train loop * Move data_get * data_get belongs to helpers	2024-04-30 14:43:41 -04:00
Elias Wahl	71ff68b445	dropout after eval step (#4351 )	2024-04-29 15:47:21 -04:00
Elias Wahl	27613dd881	MLPerf BERT: Main training loop (#4288 ) * BERT language modeling head + trunc normal initializers * add train loop + helpers * shuffle in dataloaders + slight changes in main loop * beam change * Minor changes * random.shuffle * HParam update * Use deque for dataloader * wandb bert project name * half fixes * BENCHMARK + remove epoch * cast + print() --------- Co-authored-by: chenyu <chenyu@fastmail.com>	2024-04-29 14:35:27 -04:00
chenyu	ec65aea32f	resnet stop the script once hit target (#4303 ) * resnet stop the script once hit target * comment	2024-04-25 23:54:56 -04:00
chenyu	f9a7badace	use LR=7 for resnet with BS=1536 (#4299 ) had 3 runs after lr float32, seems quite stable and converges at epoch 34 and 35	2024-04-25 15:23:10 -04:00
chenyu	c1fbacb182	resnet benchmarks use DEFAULT_FLOAT=HALF (#4285 ) also update LR default to scaled based on 1536 (the BS we are submitting)	2024-04-24 12:10:57 -04:00
chenyu	8401de9922	resnet benchmark return early in eval (#4278 ) only do few eval steps to compile, and skip second epoch when doing beam + benchmark. save 2 minutes	2024-04-24 00:55:01 -04:00
chenyu	6637ecc5fe	use IGNORE_JIT_FIRST_BEAM to not BEAM in jit cnt=0 (#4269 ) we want to have different BEAM values for resnet train and eval. global JITBEAM cannot do this. added the flag to change beam behavior at cnt=0 (so it default behaves the same with or without TinyJit), and for cnt=1 it uses existing BEAM.value. Also updated the context var BEAM in resnet to be outside of TinyJit. saves about 3 minutes compile time	2024-04-23 18:59:43 -04:00
chenyu	37f8be6450	resnet print epoch ops and mem in benchmark (#4244 ) * resnet print epoch ops and mem in benchmark also added a flag to optionally disable reset jitted steps * real per epoch stats	2024-04-21 18:32:31 -04:00
chenyu	f7416916df	update resnet hparams based on BS=1632 RCP (#4210 ) https://github.com/mlcommons/logging/blob/master/mlperf_logging/rcp_checker/training_4.0.0/rcps_resnet.json	2024-04-18 12:01:46 -04:00
chenyu	d5b67c1ca3	log resnet TRAIN_BEAM / EVAL_BEAM (#4181 ) also run eval in benchmark mode if either one is positive	2024-04-15 19:29:08 -04:00
chenyu	6a2168e698	TRAIN_BEAM and EVAL_BEAM for resnet (#4177 ) working on measuring compile time	2024-04-15 14:57:21 -04:00
chenyu	e20d6f9221	correct resnet estimate time (#4169 ) 7.99 hours was rendered as 7h0m.	2024-04-14 02:21:46 -04:00

1 2

62 Commits