tinygrad

mirror of https://github.com/tinygrad/tinygrad.git synced 2026-04-29 03:00:14 -04:00

Author	SHA1	Message	Date
chenyu	ff05bff221	put bert data shard inside jit (#9160 ) python time 45ms -> 9ms, it was spending time to schedule the shard also init bert data on CLANG since it's from numpy, so we don't create the tensor on default device then shard into GPUS	2025-02-18 10:36:54 -05:00
chenyu	5dc1257ce0	clean up bert fake data iterator [pr] (#9145 ) reuse the same get_data_bert path in setup and real run	2025-02-17 20:03:38 -05:00
chenyu	81597ddd96	increase lr for bert (#9098 ) had one run that converged better https://wandb.ai/chenyuxyz/MLPerf-BERT/runs/u66tv2hh/overview	2025-02-14 19:10:35 -05:00
chenyu	b58e7b1898	zero out the weight in bert init run (#9076 ) `DEFAULT_FLOAT=HALF BENCHMARK=10 BS=66 EVAL_BS=6 GPUS=6 MODEL=bert python3 examples/mlperf/model_train.py` no longer oom. I think the buffer of random init weights caused the oom.	2025-02-14 08:40:41 -05:00
chenyu	9e91898941	bert eval at the end of training (#9070 ) always eval at the last epoch	2025-02-13 16:29:44 -05:00
chenyu	7b5ac2c15e	free_intermediates in bert (#9040 ) also re-enable dropout and update EVAL_BS	2025-02-12 10:00:39 -05:00
chenyu	a092b6395d	Tuple -> tuple, List -> list [pr] (#8936 )	2025-02-06 14:21:19 -05:00
chenyu	c7ca7959e6	set DISABLE_DROPOUT=1 in bert script for now (#8799 )	2025-01-29 10:51:29 -05:00
chenyu	c99ae81f63	update default resnet LOSS_SCALER to 256 [pr] (#8774 )	2025-01-27 16:59:05 -05:00
chenyu	af65331b76	update beam params for bert green [pr] (#8726 ) increase BEAM_UPCAST_MAX and BEAM_LOCAL_MAX to default and matched red. 3% faster step	2025-01-22 22:00:05 -05:00
chenyu	9a9079118e	envvar BERT_LAYERS [pr] (#8709 ) default is 24 for large	2025-01-21 22:49:19 -05:00
chenyu	9f6d545a16	bert log global_norm in training step [pr] (#8708 ) * bert log global_norm in training step [pr] and minor cleanups * .item()	2025-01-21 20:36:27 -05:00
chenyu	1e283c33d3	remove realize in bert model init [pr] (#8707 )	2025-01-21 14:11:03 -05:00
chenyu	930728c069	bert BS 72->66 [pr] (#8621 ) 72 does not fit now	2025-01-14 18:41:41 -05:00
chenyu	994944920b	simpler batch_load_train_bert [pr] (#8582 ) don't think that buffer is really beneficial. 5% faster data_time and 1ms faster per step. https://wandb.ai/chenyuxyz/MLPerf-BERT/runs/69c9lx8y/overview	2025-01-12 20:25:05 -05:00
chenyu	def90b22f6	EVAL_BS=36 for bert [pr] (#8576 ) 3X faster eval compared to BS=6. green https://wandb.ai/chenyuxyz/MLPerf-BERT/runs/ka5p5sm9/overview red https://wandb.ai/chenyuxyz/MLPerf-BERT/runs/a7maxsxd/overview	2025-01-12 09:43:56 -05:00
chenyu	64a917b7eb	remove LAZYCACHE ContextVar [pr] (#8175 ) also removed from resnet latest script	2024-12-11 22:02:52 -05:00
chenyu	3e2430f822	use tqdm tqdm in mlperf training (#7929 ) issue in benchmark dashboard logging, revert back to tqdm tqdm for now	2024-11-27 21:57:05 -05:00
qazal	9828277c03	view doesn't have buffer, fix the tests [pr] (#7841 ) * view doesn't have buffer, fix the tests [pr] * need assigns	2024-11-22 20:41:55 +08:00
Francis Lata	90eff347e2	tinytqdm write support (#6359 ) * add write support * add test * update test case to compare write outputs * assert final write output * flush when using write * update write logic * Revert "update write logic" This reverts commit `5e0e611b46`. --------- Co-authored-by: chenyu <chenyu@fastmail.com>	2024-10-16 14:51:41 -04:00
chenyu	ed1ed9e4ff	bert use BS=72 (#7015 ) memory 131 -> 138 green tflops 201 -> 209 red tflops 160 -> 169	2024-10-12 09:41:56 -04:00
chenyu	36056e0760	update mlperf systems and copy 4.1 to 5.0 (#7004 )	2024-10-11 16:20:34 -04:00
chenyu	0e42662f2a	log seed at the right place for bert (#7000 )	2024-10-11 10:39:40 -04:00
nimlgen	5496a36536	update red mlperf bert readme (#6969 )	2024-10-11 13:08:06 +03:00
chenyu	b5546912e2	10% more TRAIN_STEPS for bert (#6971 ) got two very close run, adding more steps for buffer	2024-10-09 19:21:43 -04:00
chenyu	35cf48659b	limit beam param for bert on green (#6966 ) seems to mitigate the crash	2024-10-09 11:48:18 -04:00
chenyu	1ff2c98f8a	fix logfile name for bert red (#6952 )	2024-10-08 05:37:52 -04:00
chenyu	a78c96273a	update bert epoch logging (#6940 ) * update bert epoch logging epoch for bert is simply number of examples seen (which is used for RCP check) * update total steps too * more changes	2024-10-08 00:34:06 -04:00
chenyu	102dfe5510	back to 210 for bert loss scaler (#6934 ) getting 2 NaN for this, revert back to 210	2024-10-07 10:17:21 -04:00
chenyu	0cf815a93a	bert use BS=66 and update hparams (#6932 ) with dropout memory improvement, we can fit BS=66 now. revert back to the hparams in #5891 too	2024-10-07 05:08:27 -04:00
chenyu	718b959349	log epoch start and stop for bert (#6912 )	2024-10-06 06:39:46 -04:00
chenyu	16c1fa4208	use BEAM=3 for red box bert runs (#6904 ) BEAM=4 slightly exceeded 30 minutes setup	2024-10-05 09:21:12 -04:00
chenyu	0e706227a2	add seed to bert result log filename (#6903 ) * add seed to bert result log filename * different name for different benchmark	2024-10-05 09:15:24 -04:00
chenyu	7391376528	update bert hparams (#6876 ) 4h32m with this https://wandb.ai/chenyuxyz/MLPerf-BERT/runs/q99frv1l/overview. loss scaler 213->210. matched the closest submission, no nan for ~10 runs. increased lr and total step a bit. `PARALLEL=0` after setup, same as resnet.	2024-10-04 00:39:06 -04:00
chenyu	5f77217772	bert default CKPT to 0 (#6840 ) not required	2024-10-01 21:55:56 -04:00
chenyu	f59517754e	add RESET_STEP in bert to control reset (#6818 ) same as resnet	2024-09-30 09:39:04 -04:00
chenyu	494b20e886	bert BS back to 54 (#6791 ) 60 does not run end to end	2024-09-27 22:16:05 -04:00
chenyu	572d77d1d9	bert script delete eval data after eval (#6790 ) fits BS=60 which is 2% faster than 54. also fixed wandb logging params	2024-09-27 20:54:00 -04:00
chenyu	f9c8e144ff	chmod +x mlperf bert script for red (#6789 ) also disabled raising power cap in setup. wozeparrot mentioned that's unstable and might cause bert training issue on red	2024-09-27 11:27:32 -04:00
Francis Lata	d3a387be63	[MLPerf] Prepare openimages dataset script (#6747 ) * prepare openimages for MLPerf * cleanup * fix issue when clearing jit_cache on retinanet eval * revert pandas specific changes	2024-09-27 11:13:56 -04:00
chenyu	bea7ed5986	add RUNMLPERF=1 to bert dev_run.sh (#6775 ) already set in run_and_time.sh, need RUNMLPERF=1 for it to load real data	2024-09-26 11:00:49 -04:00
chenyu	12de203a43	add IGNORE_JIT_FIRST_BEAM to bert scripts (#6769 ) * update bert BEAM params copied from resnet to start with * just IGNORE_JIT_FIRST_BEAM	2024-09-26 05:38:24 -04:00
chenyu	5a5fbfa1eb	smaller bert script change (#6768 ) only WANDB and RUNMLPERF order. BENCHMARK and BEAM will be done differently	2024-09-26 04:54:28 -04:00
chenyu	396c96357b	update mlperf bert scripts (#6755 ) removed DISABLE_DROPOUT=1. updated BS to 54 that works on tinyboxes with dropouts. used bert's sparse_categorical_crossentropy that takes Tensor ignore_index in accuracy method	2024-09-25 23:55:05 -04:00
Anurag Lamsal	568757e087	fix model_eval.py in the mlperf folder searching for bert vocab in the wrong directory (#6649 )	2024-09-24 11:20:44 +08:00
Francis Lata	b7ce9a1530	UNet3D MLPerf (#3470 ) * add training set transforms * add DICE cross entropy loss * convert pred and label to Tensor when calculating DICE score * cleanups and allow train dataset batching * fix DICE CE loss calculation * jitted training step * clean up DICE CE loss calculation * initial support for sharding * Revert "initial support for sharding" This reverts commit `e3670813b8`. * minor updates * cleanup imports * add support for sharding * apply temp patch to try to avoid OOM * revert cstyle changes * add gradient acc * hotfix * add FP16 support * add ability to train on smaller image sizes * add support for saving and loading checkpoints + cleanup some various modes * fix issue with using smaller patch size + update W&B logging * disable LR_WARMUP_EPOCHS * updates * minor cleanups * cleanup * update order of transformations * more cleanups * realize loss * cleanup * more cleanup * some cleanups * add RAM usage * minor cleanups * add support for gradient accumulation * cleanup imports * minor updates to not use GA_STEPS * remove FP16 option since it's available now globally * update multi-GPU setup * add timing logs for training loop * go back to using existing dataloader and add ability to preprocess data to save time * clean up optimization and re-enable JIT and multi-GPU support for training and evaluation * free train and eval steps memory * cleanups and scale batch size based on the number of GPUs * fix GlobalCounters import * fix seed * fix W&B setup * update batch size default size * add back metric divergence check * put back JIT on UNet3d eval * move dataset preprocessing inside training code * add test for dice_loss * add config logging support to W&B and other cleanups * change how default float is getting retrieved * remove TinyJit import duplicate * update config logging to W&B and remove JIT on eval_step * no need for caching preprocessed data anymore * fix how evaluation is ran and how often * add support for LR scaling * fix issue with gaussian being moved to scipy.signal.windows * remove DICE loss unit test * fix issue where loss isn't compatible with multiGPU * add individual BEAM control for train and eval steps * fix ndimage scipy import * add BENCHMARK * cleanups on BENCHMARK + fix on rand_flip augmentation during training * cleanup train and eval BEAM envs * add checkpointing support after every eval * cleanup model_eval * disable grad during eval * use new preprocessing dataset mechanism * remove unused import * use training and inference_mode contexts * start eval after benchmarking * add data fetching time * cleanup decorators * more cleanups on training script * add message during benchmarking mode * realize when reassigning LR on scheduler and update default number of epochs * add JIT on eval step * remove JIT on eval_step * add train dataloader for unet3d * move checkpointing to be done after every epoch * revert removal of JIT on unet3d inference * save checkpoint if metric is not successful * Revert "add train dataloader for unet3d" This reverts commit `c166d129df`. * Revert "Revert "add train dataloader for unet3d"" This reverts commit `36366c65d2`. * hotfix: seed was defaulting to a value of 0 * fix SEED value * remove the usage of context managers for setting BEAM and going from training to inference * support new stack API for calculating eval loss and metric * Revert "remove the usage of context managers for setting BEAM and going from training to inference" This reverts commit `2c0ba8d322`. * check training and test preprocessed folders separately * clean up imports and log FUSE_CONV_BW * use train and val preprocessing constants * add kits19 dataset setup script * update to use the new test decorator for disabling grad * update kits19 dataset setup script * add docs on how to train the model * set default value for BASEDIR * add detailed instruction about BASEDIR usage --------- Co-authored-by: chenyu <chenyu@fastmail.com>	2024-09-10 04:37:28 -04:00
Elias Wahl	c9b4602854	no load in INITMLPERF (#5957 )	2024-08-08 11:28:24 -04:00
Elias Wahl	c9862e17d4	MLPERF BERT submission scripts (#5931 ) * green * red * fix benchmark * log * count train samples * oops. 4.0 -> 4.1 * note to todo * no pillow	2024-08-06 18:09:18 -04:00
chenyu	1dab75ae37	clean up mlperf dataloader import (#5940 ) use tinygrad tqdm for dataset, and PIL Image is only needed for resnet	2024-08-06 17:10:08 -04:00
Elias Wahl	937bf5fe12	better hparam (#5891 )	2024-08-03 12:38:53 -04:00

1 2 3 4

169 Commits