tinygrad

mirror of https://github.com/tinygrad/tinygrad.git synced 2026-02-03 03:05:03 -05:00

Author	SHA1	Message	Date
chenyu	f53be010d7	lower bert learning rate (#9481 ) slightly better. first sub 3hr run https://wandb.ai/chenyuxyz/MLPerf-BERT/runs/0or96ink/overview	2025-03-17 10:49:56 -04:00
chenyu	d2cfbd8a4d	bert lower learning rate and total steps (#9466 ) closer to the other submission with BS=240. converged with 10% less epochs	2025-03-16 17:21:20 -04:00
chenyu	22fc0a2e36	bert sum acc in half (#9412 ) also BS=96	2025-03-11 23:03:15 -04:00
chenyu	2af129c078	bert corealize multiple outputs (#9359 ) 1% faster step	2025-03-05 10:58:37 -05:00
chenyu	ad72269f08	bert put eval copy and getting lr in jit (#9350 )	2025-03-04 20:57:03 -05:00
chenyu	9eb45eb629	add a flag to skip bert train (#9349 )	2025-03-04 17:13:00 -05:00
George Hotz	3f4eb9006a	test for device mismatch [pr] (#9250 ) * test for device mismatch [pr] * fix bert	2025-02-26 13:06:33 +08:00
chenyu	979e84f30e	RESET_STEP in bert setup and beam (#9248 ) running dev_beam migh OOM without it but runs fine in real run.	2025-02-25 19:15:10 -05:00
chenyu	6610ad58ab	hotfix bert no shard with only one device (#9243 ) `LLVM=1 BERT_SIZE="tiny" DEFAULT_FLOAT=HALF BENCHMARK=5 MODEL="bert" python3 examples/mlperf/model_train.py` runs for me with this. it should not failed with single device shard though	2025-02-25 09:05:11 -05:00
chenyu	ff05bff221	put bert data shard inside jit (#9160 ) python time 45ms -> 9ms, it was spending time to schedule the shard also init bert data on CLANG since it's from numpy, so we don't create the tensor on default device then shard into GPUS	2025-02-18 10:36:54 -05:00
chenyu	5dc1257ce0	clean up bert fake data iterator [pr] (#9145 ) reuse the same get_data_bert path in setup and real run	2025-02-17 20:03:38 -05:00
chenyu	81597ddd96	increase lr for bert (#9098 ) had one run that converged better https://wandb.ai/chenyuxyz/MLPerf-BERT/runs/u66tv2hh/overview	2025-02-14 19:10:35 -05:00
chenyu	b58e7b1898	zero out the weight in bert init run (#9076 ) `DEFAULT_FLOAT=HALF BENCHMARK=10 BS=66 EVAL_BS=6 GPUS=6 MODEL=bert python3 examples/mlperf/model_train.py` no longer oom. I think the buffer of random init weights caused the oom.	2025-02-14 08:40:41 -05:00
chenyu	9e91898941	bert eval at the end of training (#9070 ) always eval at the last epoch	2025-02-13 16:29:44 -05:00
chenyu	7b5ac2c15e	free_intermediates in bert (#9040 ) also re-enable dropout and update EVAL_BS	2025-02-12 10:00:39 -05:00
chenyu	c99ae81f63	update default resnet LOSS_SCALER to 256 [pr] (#8774 )	2025-01-27 16:59:05 -05:00
chenyu	9f6d545a16	bert log global_norm in training step [pr] (#8708 ) * bert log global_norm in training step [pr] and minor cleanups * .item()	2025-01-21 20:36:27 -05:00
chenyu	1e283c33d3	remove realize in bert model init [pr] (#8707 )	2025-01-21 14:11:03 -05:00
chenyu	3e2430f822	use tqdm tqdm in mlperf training (#7929 ) issue in benchmark dashboard logging, revert back to tqdm tqdm for now	2024-11-27 21:57:05 -05:00
Francis Lata	90eff347e2	tinytqdm write support (#6359 ) * add write support * add test * update test case to compare write outputs * assert final write output * flush when using write * update write logic * Revert "update write logic" This reverts commit `5e0e611b46`. --------- Co-authored-by: chenyu <chenyu@fastmail.com>	2024-10-16 14:51:41 -04:00
chenyu	0e42662f2a	log seed at the right place for bert (#7000 )	2024-10-11 10:39:40 -04:00
chenyu	b5546912e2	10% more TRAIN_STEPS for bert (#6971 ) got two very close run, adding more steps for buffer	2024-10-09 19:21:43 -04:00
chenyu	a78c96273a	update bert epoch logging (#6940 ) * update bert epoch logging epoch for bert is simply number of examples seen (which is used for RCP check) * update total steps too * more changes	2024-10-08 00:34:06 -04:00
chenyu	102dfe5510	back to 210 for bert loss scaler (#6934 ) getting 2 NaN for this, revert back to 210	2024-10-07 10:17:21 -04:00
chenyu	0cf815a93a	bert use BS=66 and update hparams (#6932 ) with dropout memory improvement, we can fit BS=66 now. revert back to the hparams in #5891 too	2024-10-07 05:08:27 -04:00
chenyu	718b959349	log epoch start and stop for bert (#6912 )	2024-10-06 06:39:46 -04:00
chenyu	0e706227a2	add seed to bert result log filename (#6903 ) * add seed to bert result log filename * different name for different benchmark	2024-10-05 09:15:24 -04:00
chenyu	7391376528	update bert hparams (#6876 ) 4h32m with this https://wandb.ai/chenyuxyz/MLPerf-BERT/runs/q99frv1l/overview. loss scaler 213->210. matched the closest submission, no nan for ~10 runs. increased lr and total step a bit. `PARALLEL=0` after setup, same as resnet.	2024-10-04 00:39:06 -04:00
chenyu	5f77217772	bert default CKPT to 0 (#6840 ) not required	2024-10-01 21:55:56 -04:00
chenyu	f59517754e	add RESET_STEP in bert to control reset (#6818 ) same as resnet	2024-09-30 09:39:04 -04:00
chenyu	572d77d1d9	bert script delete eval data after eval (#6790 ) fits BS=60 which is 2% faster than 54. also fixed wandb logging params	2024-09-27 20:54:00 -04:00
chenyu	5a5fbfa1eb	smaller bert script change (#6768 ) only WANDB and RUNMLPERF order. BENCHMARK and BEAM will be done differently	2024-09-26 04:54:28 -04:00
Francis Lata	b7ce9a1530	UNet3D MLPerf (#3470 ) * add training set transforms * add DICE cross entropy loss * convert pred and label to Tensor when calculating DICE score * cleanups and allow train dataset batching * fix DICE CE loss calculation * jitted training step * clean up DICE CE loss calculation * initial support for sharding * Revert "initial support for sharding" This reverts commit `e3670813b8`. * minor updates * cleanup imports * add support for sharding * apply temp patch to try to avoid OOM * revert cstyle changes * add gradient acc * hotfix * add FP16 support * add ability to train on smaller image sizes * add support for saving and loading checkpoints + cleanup some various modes * fix issue with using smaller patch size + update W&B logging * disable LR_WARMUP_EPOCHS * updates * minor cleanups * cleanup * update order of transformations * more cleanups * realize loss * cleanup * more cleanup * some cleanups * add RAM usage * minor cleanups * add support for gradient accumulation * cleanup imports * minor updates to not use GA_STEPS * remove FP16 option since it's available now globally * update multi-GPU setup * add timing logs for training loop * go back to using existing dataloader and add ability to preprocess data to save time * clean up optimization and re-enable JIT and multi-GPU support for training and evaluation * free train and eval steps memory * cleanups and scale batch size based on the number of GPUs * fix GlobalCounters import * fix seed * fix W&B setup * update batch size default size * add back metric divergence check * put back JIT on UNet3d eval * move dataset preprocessing inside training code * add test for dice_loss * add config logging support to W&B and other cleanups * change how default float is getting retrieved * remove TinyJit import duplicate * update config logging to W&B and remove JIT on eval_step * no need for caching preprocessed data anymore * fix how evaluation is ran and how often * add support for LR scaling * fix issue with gaussian being moved to scipy.signal.windows * remove DICE loss unit test * fix issue where loss isn't compatible with multiGPU * add individual BEAM control for train and eval steps * fix ndimage scipy import * add BENCHMARK * cleanups on BENCHMARK + fix on rand_flip augmentation during training * cleanup train and eval BEAM envs * add checkpointing support after every eval * cleanup model_eval * disable grad during eval * use new preprocessing dataset mechanism * remove unused import * use training and inference_mode contexts * start eval after benchmarking * add data fetching time * cleanup decorators * more cleanups on training script * add message during benchmarking mode * realize when reassigning LR on scheduler and update default number of epochs * add JIT on eval step * remove JIT on eval_step * add train dataloader for unet3d * move checkpointing to be done after every epoch * revert removal of JIT on unet3d inference * save checkpoint if metric is not successful * Revert "add train dataloader for unet3d" This reverts commit `c166d129df`. * Revert "Revert "add train dataloader for unet3d"" This reverts commit `36366c65d2`. * hotfix: seed was defaulting to a value of 0 * fix SEED value * remove the usage of context managers for setting BEAM and going from training to inference * support new stack API for calculating eval loss and metric * Revert "remove the usage of context managers for setting BEAM and going from training to inference" This reverts commit `2c0ba8d322`. * check training and test preprocessed folders separately * clean up imports and log FUSE_CONV_BW * use train and val preprocessing constants * add kits19 dataset setup script * update to use the new test decorator for disabling grad * update kits19 dataset setup script * add docs on how to train the model * set default value for BASEDIR * add detailed instruction about BASEDIR usage --------- Co-authored-by: chenyu <chenyu@fastmail.com>	2024-09-10 04:37:28 -04:00
Elias Wahl	c9b4602854	no load in INITMLPERF (#5957 )	2024-08-08 11:28:24 -04:00
Elias Wahl	c9862e17d4	MLPERF BERT submission scripts (#5931 ) * green * red * fix benchmark * log * count train samples * oops. 4.0 -> 4.1 * note to todo * no pillow	2024-08-06 18:09:18 -04:00
Elias Wahl	937bf5fe12	better hparam (#5891 )	2024-08-03 12:38:53 -04:00
Elias Wahl	4a114756f6	New BERT dataloader (#5881 ) * One file == One topic * update test * new dataloader * update train script * get index is faster	2024-08-02 15:12:23 -04:00
Elias Wahl	73bddc44f6	Fix fake dataloader (#5326 )	2024-07-08 09:07:44 -04:00
Elias Wahl	e267f3161d	Add MLLogger (#5125 ) * add MLPerf logger * eval steps * start with step 1 * compliance for 3.1.0 and 4.0.0 * more compliance * assert, comment and contiguous	2024-06-26 12:23:56 -04:00
Elias Wahl	f31ef11537	Better default hparams for large BS (#5030 ) * better default hparams for large BS * bf16 too * use tuple	2024-06-18 11:13:06 -04:00
Elias Wahl	7bfa9101c0	Float in scaled dot product attention (#4985 ) * Monkeypatch scaled-dot-product-attention * Use dot instead of matmul * new api * imports * least_upper_dtype	2024-06-18 08:16:41 -04:00
Elias Wahl	d2e3c391e8	Residual in MLM loss + Change default steps (#4935 ) * Residual in mlm loss * Reduce default steps to 160K * 24 * oops * comment	2024-06-12 16:09:18 -04:00
Elias Wahl	e576aca044	Disable dropout (#4837 )	2024-06-04 18:57:26 -04:00
Elias Wahl	bb248a0dd1	Optional half matmul (#4835 ) * half linear * move weight cast back * oops * matmul dtype var * todo comment	2024-06-04 17:53:41 -04:00
Elias Wahl	04e237328b	Refactor to class style (#4804 )	2024-06-04 14:08:31 -07:00
Elias Wahl	c4b0acf095	Global norm + small changes (#4749 ) * norm * no empty * default loss scaler in float	2024-05-27 18:35:27 -04:00
Elias Wahl	acc0039cfc	Resume fix + scheduler for non weight decay params (#4679 ) * move ckpt dir * fix resume. Add scheduler group	2024-05-21 19:38:13 -04:00
Elias Wahl	993091adfa	loss scaler + nan fixes (#4661 )	2024-05-20 17:08:35 -04:00
chenyu	bed70b130c	mlperf bert getenv-able EVAL_STEP_FREQ (#4534 )	2024-05-11 14:36:56 -04:00
chenyu	04a4980a51	touchup bert script (#4531 ) small adjustments, remove duplicated training setting and stop the script once target is hit	2024-05-11 13:02:02 -04:00

1 2

93 Commits