tinygrad

mirror of https://github.com/tinygrad/tinygrad.git synced 2026-02-16 17:45:38 -05:00

Author	SHA1	Message	Date
chenyu	e2ed673c94	FUSE_ARANGE_UINT to not fuse uint (#9915 ) hack to bypass rand, can FUSE_ARANGE on green for 6ms per step	2025-04-16 18:49:38 -04:00
chenyu	e8024c8281	faster bert global_norm (#9901 ) tinyamd 2% faster. also updated beam params that's 2-3% faster. update mlperf doc and steps too	2025-04-15 18:24:44 -04:00
Francis Lata	31483050c0	add eval_freq flag (#9894 )	2025-04-15 06:42:40 -04:00
chenyu	43d3a75d6c	increase bert max train_steps (#9883 )	2025-04-14 08:53:44 -04:00
chenyu	e2a40fb523	update bert mi300x script (#9872 ) 2 runs failed to converge in 10 back to back runs, increase total train steps and some beam params (2% faster step)	2025-04-13 10:07:36 -04:00
Francis Lata	2793cca9a6	RetinaNet MLPerf (#8385 ) * add support for a custom BASEDIR for openimages download * make export step faster * add focal loss * update model_eval with new dataloader * generate_anchors in tinygrad * update initializers for model * small cleanup * revert isin enhancements * recursively go through backbone layers to freeze them * add optimizer * minor cleanup * start dataloader work with input images * add first transform for train set * reuse existing prepare_target * continue with dataloader implementation * add dataloader * separate out KiTS19 dataset test cases * create mock data samples for test * add dataloader + test * cleanup dataloader test and revert shm path * trim dataloader related code needed from ref * got dataloader with normalize working * update image to be float32 * add back normalization and negate it in test * clean up reference dataset implementation + ruff changes * add validation set test * add proper training loop over the training dataset * add LambdaLR support * add LR scheduler and the start of training step * get forward call to model work and setup multi-GPU * already passed device * return matches from dataloader * hotfix for dataloader typo causing some hang * start some work on classification loss * update focal loss to support masking * add missing test and cleanup focal loss * cleanup unit tests * remove masking support for sigmoid_focal_loss * make ClassificationHead loss work * cleanups + fix dataloader tests * remove sigmoid when computing loss * make anchors use Tensors * simplify anchors batching * revert anchors to use np * implement regression loss * fix regression loss * cleanup losses * move BoxCoder to MLPerf helpers * revert helper changes * fixes after helper refactor cleanup * add tests for l1_loss * start re-enabling training step * minor cleanup * add pycocotools to testing dependencies * make training work * adjust regression loss to mask after L1 loss is calculated * reduce img and lbl sizes by half for KiTS19 dataset tests * Revert "reduce img and lbl sizes by half for KiTS19 dataset tests" This reverts commit `d115b0c664`. * temporarily disable openimages dataset tests to debug CI * enable openimages dataset test and create samples once * temporarily disable openimages validation set test * reenable test and add some debugging to the test * add boto3 testing dependencies * add pandas to testing dependencies * This reverts commit `467704fec6`. * reenable test * move sample creation to setup * realize boxcoder's encoding * add wandb * fix wandb resuming feature * move anchors as part of dataloader * fix dtype for anchor inside dataloader and fix horizontal flip transformation * add support for BENCHMARK * set seed * debug dataset test failuire * Revert "debug dataset test failuire" This reverts commit `1b2f9d7f50`. * fix dataloader script * do not realize when sharding model weights * setup openimages samples differently * create the necessary samples per test case * enable lr scheduler and fix benchmark timing * add jit to the training loop * add checkpointing and training resume capabilities * refactor on training loop and start the work on val looop * add debug logging for dataloader test * debug test * assert boxes again * update validation dataloader and more cleanups * fix validation test case * add multi device support to retinanet eval * fix issue with realized on dataloader * remove optional disk tensors in dataloader * remove verbose debugging on datasets test * put back parallel testing and remove img_ids Tensor from dataloader * cleanup train and validation dataloader * return validation targets in dataloader * cleanup boxes and labels in dataloader * fix img_ids repeating its values * remove unnecessary targets from validation dataloader * add validation loop to training script * adjust LR to be the ratio of the batch size * minor cleanups * remove frozen layers from optimizer's params * hyperparameter adjustments and cleanups * model init, hyperparam, and data preprocessing updates * no need to return loaded keys for resnet * fix train script * update loss calculation for regresionhead and some cleanups * add JIT reset support * add nan check during training * Revert "add nan check during training" This reverts commit `ddf1f0d5dd`. * Revert "Revert "add nan check during training"" This reverts commit `b7b2943197`. * some typing cleanups * update seeding on dataloader and the start of training script * undo changse * undo more changes * more typing fixes * minor cleanups * update dataloader seed * hotfix: log metric and move target metric check outside of CKPT * check for CKPT when target metric is reached before saving * add TRAIN_BEAM and EVAL_BEAM * minor cleanup * update hyperparams and add support for EVAL_BS * add green coloring to metric reached statement * initial work to support f16 * update model initializers to be monkeypatched * update layers to support float32 weight loading + float16 training * don't return loss that's scaled * run eval on benchmark beam * move BEAM to their respective steps * update layers to be compatible with fp16 * end BENCHMARK after first eval * cleanups and adjust learning rate for fp16 * remove duplicated files from test * revert losses changes * Revert "revert losses changes" This reverts commit `aebccf93ac`. * go back to old LR * cast batchnorm to float32 * set new loss scaler default value for float16 * remove LambdaLRScheduler * remove runner and use dataloader on eval * fix retinanet eval with new dataloader * remove unused import * revert lr_scheduler updates * use BS=96 with new learning rate * rename module initializers * more cleanups on training loop * remove contig from optim.step * simplify sum when computing loss	2025-04-12 22:11:51 -04:00
chenyu	4aab16ca6a	bert script cleanup and assert nan loss (#9851 )	2025-04-11 05:41:49 -04:00
chenyu	995d20673a	increase bert TRAIN_STEPS for mi300x (#9833 ) got a few non converged ones so try to increase steps. we need >= 90% runs to converge	2025-04-10 08:25:09 -04:00
chenyu	817746b30e	add contiguous to EmbeddingBert output (#9829 ) for some reason with random dropout it creates different ast on each device. And search embedding is slow. This workaround saved 6 minutes setup time on mi300x (25->19) and resulted in similar speed	2025-04-10 04:31:19 -04:00
chenyu	a0b72f066a	don't free intermediate for bert mi300x (#9824 )	2025-04-10 01:48:34 -04:00
chenyu	2e1002e179	EVAL_BS=96 and BEAM=3 for bert green (#9819 ) 19m -> 13m setup and same end to end time	2025-04-09 22:37:27 -04:00
chenyu	8fe83385ec	add system json for mi300x mlperf (#9786 ) * add system json for mi300x mlperf ``` python3 -m mlperf_logging.system_desc_checker examples/mlperf/training_submission_v5.0/tinycorp/systems/tinybox_8xMI300X.json training 4.1.0 INFO - System description checker passed for tinybox 8xMI300X ``` also removed the rocm from tinybox_red since we are not using it * update mlperf-logging version	2025-04-08 06:36:44 -04:00
chenyu	4cc7422769	use AM driver in bert mlperf (#9775 ) we should commit to use AM. it's 7ms slower python time now	2025-04-07 23:40:27 -04:00
Francis Lata	f8fe15e64e	move BoxCoder to mlperf helpers (#9773 )	2025-04-07 20:27:06 -04:00
chenyu	7c4a739fe4	full script for bert mi300x (#9772 )	2025-04-07 11:41:31 -04:00
chenyu	3069ebfad1	use BERT_LAYERS=2 in bert init (#9769 ) save 5 minut scheduling in setup so we can fit more search	2025-04-07 07:46:37 -04:00
Francis Lata	71b8890dd6	use validation dataloader inside retinanet eval (#9747 )	2025-04-05 16:46:55 -04:00
chenyu	5a04f4d4ba	revert bert hparams for green and red (#9744 ) did more runs and it's not really better and not worth the change. only useful for BS=1024	2025-04-05 07:38:01 -04:00
chenyu	640ff681c3	rename bert script to 8xMI300X (#9734 ) and adds a script for single MI300X	2025-04-03 23:36:24 -04:00
chenyu	6b3480ec70	update mi300x bert haparams (#9716 ) * update mi300x bert haparams borrowed from previous submission that also did BS=1024 * update	2025-04-03 22:30:00 -04:00
chenyu	a6fec2f5ae	dev_run for bert on mi300x (#9706 )	2025-04-02 21:12:55 -04:00
chenyu	f7cb2e8da3	bert dev_beam for mi300x box (#9648 ) * bert dev_beam for mi300x box * terminate BENCHMARK properly	2025-03-31 08:35:51 -04:00
chenyu	d8d7ac1bb1	fix bert free_intermediates (#9633 ) fix when only run eval `TRAIN=0 BERT_SIZE=tiny examples/mlperf/training_submission_v5.0/tinycorp/benchmarks/bert/implementations/tinybox_green/dev_beam.sh`	2025-03-30 22:42:52 -04:00
chenyu	a187dfd3df	bert BEAM_UOPS_MAX 3000->4000 (#9603 ) more stable for the final step time green 410ms (master) -> 397ms (BEAM=4) -> 392ms (this) red 561ms (master) -> 550ms (this)	2025-03-27 11:58:47 -04:00
chenyu	62888614f6	lower bert eval bs to 24 (#9590 ) oom during eval	2025-03-26 21:25:23 -04:00
chenyu	c965f4c20b	update bert config (#9555 ) BEAM 4->5 for green, 2% faster use AMD driver instead of AM for red, 5% faster	2025-03-23 16:14:41 -04:00
Francis Lata	1a1087e3a0	cleanups on losses and dataset tests (#9538 )	2025-03-21 17:03:18 -04:00
Francis Lata	8cbe4009fc	RetinaNet losses (#9536 ) * add sigmoid_focal_loss and l1_loss * update ref implementation comment	2025-03-21 15:52:54 -04:00
chenyu	b46b8ee15e	add a flag to log when beam surpassed max limit [pr] (#9533 )	2025-03-21 13:37:02 -04:00
Francis Lata	eb95825eea	RetinaNet dataloader (#9442 ) * retinanet dataloader * remove batch_size from generate_anchors * refactor kits19 dataset tests * add tests for dataloader * fix testing setup and cleanups * remove unused import	2025-03-21 13:36:41 -04:00
chenyu	f53be010d7	lower bert learning rate (#9481 ) slightly better. first sub 3hr run https://wandb.ai/chenyuxyz/MLPerf-BERT/runs/0or96ink/overview	2025-03-17 10:49:56 -04:00
chenyu	d2cfbd8a4d	bert lower learning rate and total steps (#9466 ) closer to the other submission with BS=240. converged with 10% less epochs	2025-03-16 17:21:20 -04:00
chenyu	4992958dae	update bert beam params (#9423 ) BEAM_MIN_PROGRESS=5 for setup speed	2025-03-12 13:00:41 -04:00
chenyu	22fc0a2e36	bert sum acc in half (#9412 ) also BS=96	2025-03-11 23:03:15 -04:00
chenyu	01e8b60911	acc_dtype -> dtype (#9402 ) matched numpy and torch	2025-03-10 16:05:30 -04:00
chenyu	2af129c078	bert corealize multiple outputs (#9359 ) 1% faster step	2025-03-05 10:58:37 -05:00
chenyu	ad72269f08	bert put eval copy and getting lr in jit (#9350 )	2025-03-04 20:57:03 -05:00
chenyu	9eb45eb629	add a flag to skip bert train (#9349 )	2025-03-04 17:13:00 -05:00
qazal	845814f396	revert buffer_view change (#9311 ) * Revert "BUFFER_VIEW is a node in the kernel graph + delete ViewOp (#9298)" This reverts commit `3210b656b6`. * Revert "substitute ast from kernel op [pr] (#9293)" This reverts commit `5a9c788ae6`.	2025-03-01 11:00:12 +01:00
qazal	3210b656b6	BUFFER_VIEW is a node in the kernel graph + delete ViewOp (#9298 )	2025-02-28 12:15:04 +02:00
George Hotz	3f4eb9006a	test for device mismatch [pr] (#9250 ) * test for device mismatch [pr] * fix bert	2025-02-26 13:06:33 +08:00
chenyu	979e84f30e	RESET_STEP in bert setup and beam (#9248 ) running dev_beam migh OOM without it but runs fine in real run.	2025-02-25 19:15:10 -05:00
chenyu	6610ad58ab	hotfix bert no shard with only one device (#9243 ) `LLVM=1 BERT_SIZE="tiny" DEFAULT_FLOAT=HALF BENCHMARK=5 MODEL="bert" python3 examples/mlperf/model_train.py` runs for me with this. it should not failed with single device shard though	2025-02-25 09:05:11 -05:00
chenyu	8c7be428e5	update bert BS to 78 (#9236 ) fits 78 now. about 215 tflops on green	2025-02-24 22:47:35 -05:00
chenyu	2e7c2780a9	CLANG -> CPU (#9189 )	2025-02-20 18:03:09 -05:00
chenyu	3b37cc898b	add bert tiny config (#9177 ) set with BERT_SIZE=tiny. easier to study embedding and fusion	2025-02-19 14:57:03 -05:00
chenyu	975c318dbc	bert use int32 for input ids (#9173 ) original data was int32 for these. float might have caused precision issues	2025-02-19 08:17:27 -05:00
chenyu	ff05bff221	put bert data shard inside jit (#9160 ) python time 45ms -> 9ms, it was spending time to schedule the shard also init bert data on CLANG since it's from numpy, so we don't create the tensor on default device then shard into GPUS	2025-02-18 10:36:54 -05:00
chenyu	5dc1257ce0	clean up bert fake data iterator [pr] (#9145 ) reuse the same get_data_bert path in setup and real run	2025-02-17 20:03:38 -05:00
chenyu	81597ddd96	increase lr for bert (#9098 ) had one run that converged better https://wandb.ai/chenyuxyz/MLPerf-BERT/runs/u66tv2hh/overview	2025-02-14 19:10:35 -05:00

1 2 3 4 5

216 Commits