tinygrad

mirror of https://github.com/tinygrad/tinygrad.git synced 2026-01-26 07:18:40 -05:00

Author	SHA1	Message	Date
Sieds Lykles	91ccf1c343	Off by one error in start_pos (#9792 ) Variable upper bound is inclusive	2025-04-15 15:07:13 -04:00
pkotzbach	5849c43382	FP8s part 1 (#9887 ) * fp8s part 1 * prettier * fixes * fixes * remove stuff that should be in next pr * revert * add creation --------- Co-authored-by: pkotzbach <pawkotz@gmail.com>	2025-04-15 11:20:02 -04:00
Francis Lata	31483050c0	add eval_freq flag (#9894 )	2025-04-15 06:42:40 -04:00
nimlgen	83ae83d871	compare amd and am to cpu as well (#9896 )	2025-04-15 13:32:18 +03:00
nimlgen	23a95dd84d	script to compare amd and am kerns (#9889 ) * script to compare amd and am kerns * tool * is it used???	2025-04-15 00:11:22 +03:00
chenyu	ce454793e6	support specifying dtype for Tensor.linear (#9886 )	2025-04-14 13:55:11 -04:00
b1tg	e8a0aee88d	add arch to AMDLLVMRenderer (#9884 ) * add arch to AMDLLVMRenderer * __reduce__ to match others --------- Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>	2025-04-14 19:59:22 +03:00
George Hotz	44e4934167	fast pattern matcher [pr] (#9737 ) * FastPatternMatcher * works without that * fix test pickle * strict len * compile match function * dynamic compile * fast * faster * compile * track * a lot faster * clean up * dup or * faster and simpler * fast match doesn't support store * plane * minor refactor * real speed * don't imply return None * upat * fix test * heard you wanted more speed * no generator * split cf * early fixup * fxn fixup * reconstruct_function * Revert "reconstruct_function" This reverts commit `37dac010ab`. * simpler stuff * too big * upat compile error * cleanups * don't cache that * cleanups * 10 -> 15	2025-04-14 15:24:41 +01:00
chenyu	43d3a75d6c	increase bert max train_steps (#9883 )	2025-04-14 08:53:44 -04:00
qazal	bf099520a4	add names to grouper rewrites + cleanups [pr] (#9881 ) * add names to grouper rewrites + cleanups [pr] * assign_targets	2025-04-14 19:47:36 +08:00
George Hotz	ca8aaadd00	clean up some patterns [pr] (#9880 ) * clean up some patterns [pr] * cleanest * move that into upat_interpret	2025-04-14 11:33:22 +01:00
George Hotz	355739fc94	switch to universal match [pr] (#9879 ) * switch to universal match [pr] * 10 -> 15	2025-04-14 09:15:37 +01:00
Nishant Rajadhyaksha	32ed128598	fixing transformer training bug (#9877 )	2025-04-13 19:34:20 -04:00
George Hotz	bd5939514d	clean up a few patterns [pr] (#9873 )	2025-04-13 20:19:37 +01:00
Alexey Zaytsev	78a6af3da7	Use $CUDA_PATH/include for CUDA headers (#9858 )	2025-04-13 16:20:19 +01:00
chenyu	e2a40fb523	update bert mi300x script (#9872 ) 2 runs failed to converge in 10 back to back runs, increase total train steps and some beam params (2% faster step)	2025-04-13 10:07:36 -04:00
qazal	e201bc3e93	process replay kernel asts in toposort order [pr] (#9869 ) * process replay kernel asts in toposort order [pr] * use HEAD replay	2025-04-13 17:20:34 +08:00
qazal	7191f88551	add asserts for KERNEL op ast [pr] (#9868 )	2025-04-13 16:50:18 +08:00
qazal	5ee9c343e6	add device to NullRenderer [pr] (#9867 )	2025-04-13 13:17:16 +08:00
Francis Lata	2793cca9a6	RetinaNet MLPerf (#8385 ) * add support for a custom BASEDIR for openimages download * make export step faster * add focal loss * update model_eval with new dataloader * generate_anchors in tinygrad * update initializers for model * small cleanup * revert isin enhancements * recursively go through backbone layers to freeze them * add optimizer * minor cleanup * start dataloader work with input images * add first transform for train set * reuse existing prepare_target * continue with dataloader implementation * add dataloader * separate out KiTS19 dataset test cases * create mock data samples for test * add dataloader + test * cleanup dataloader test and revert shm path * trim dataloader related code needed from ref * got dataloader with normalize working * update image to be float32 * add back normalization and negate it in test * clean up reference dataset implementation + ruff changes * add validation set test * add proper training loop over the training dataset * add LambdaLR support * add LR scheduler and the start of training step * get forward call to model work and setup multi-GPU * already passed device * return matches from dataloader * hotfix for dataloader typo causing some hang * start some work on classification loss * update focal loss to support masking * add missing test and cleanup focal loss * cleanup unit tests * remove masking support for sigmoid_focal_loss * make ClassificationHead loss work * cleanups + fix dataloader tests * remove sigmoid when computing loss * make anchors use Tensors * simplify anchors batching * revert anchors to use np * implement regression loss * fix regression loss * cleanup losses * move BoxCoder to MLPerf helpers * revert helper changes * fixes after helper refactor cleanup * add tests for l1_loss * start re-enabling training step * minor cleanup * add pycocotools to testing dependencies * make training work * adjust regression loss to mask after L1 loss is calculated * reduce img and lbl sizes by half for KiTS19 dataset tests * Revert "reduce img and lbl sizes by half for KiTS19 dataset tests" This reverts commit `d115b0c664`. * temporarily disable openimages dataset tests to debug CI * enable openimages dataset test and create samples once * temporarily disable openimages validation set test * reenable test and add some debugging to the test * add boto3 testing dependencies * add pandas to testing dependencies * This reverts commit `467704fec6`. * reenable test * move sample creation to setup * realize boxcoder's encoding * add wandb * fix wandb resuming feature * move anchors as part of dataloader * fix dtype for anchor inside dataloader and fix horizontal flip transformation * add support for BENCHMARK * set seed * debug dataset test failuire * Revert "debug dataset test failuire" This reverts commit `1b2f9d7f50`. * fix dataloader script * do not realize when sharding model weights * setup openimages samples differently * create the necessary samples per test case * enable lr scheduler and fix benchmark timing * add jit to the training loop * add checkpointing and training resume capabilities * refactor on training loop and start the work on val looop * add debug logging for dataloader test * debug test * assert boxes again * update validation dataloader and more cleanups * fix validation test case * add multi device support to retinanet eval * fix issue with realized on dataloader * remove optional disk tensors in dataloader * remove verbose debugging on datasets test * put back parallel testing and remove img_ids Tensor from dataloader * cleanup train and validation dataloader * return validation targets in dataloader * cleanup boxes and labels in dataloader * fix img_ids repeating its values * remove unnecessary targets from validation dataloader * add validation loop to training script * adjust LR to be the ratio of the batch size * minor cleanups * remove frozen layers from optimizer's params * hyperparameter adjustments and cleanups * model init, hyperparam, and data preprocessing updates * no need to return loaded keys for resnet * fix train script * update loss calculation for regresionhead and some cleanups * add JIT reset support * add nan check during training * Revert "add nan check during training" This reverts commit `ddf1f0d5dd`. * Revert "Revert "add nan check during training"" This reverts commit `b7b2943197`. * some typing cleanups * update seeding on dataloader and the start of training script * undo changse * undo more changes * more typing fixes * minor cleanups * update dataloader seed * hotfix: log metric and move target metric check outside of CKPT * check for CKPT when target metric is reached before saving * add TRAIN_BEAM and EVAL_BEAM * minor cleanup * update hyperparams and add support for EVAL_BS * add green coloring to metric reached statement * initial work to support f16 * update model initializers to be monkeypatched * update layers to support float32 weight loading + float16 training * don't return loss that's scaled * run eval on benchmark beam * move BEAM to their respective steps * update layers to be compatible with fp16 * end BENCHMARK after first eval * cleanups and adjust learning rate for fp16 * remove duplicated files from test * revert losses changes * Revert "revert losses changes" This reverts commit `aebccf93ac`. * go back to old LR * cast batchnorm to float32 * set new loss scaler default value for float16 * remove LambdaLRScheduler * remove runner and use dataloader on eval * fix retinanet eval with new dataloader * remove unused import * revert lr_scheduler updates * use BS=96 with new learning rate * rename module initializers * more cleanups on training loop * remove contig from optim.step * simplify sum when computing loss	2025-04-12 22:11:51 -04:00
nimlgen	23b67f532c	amd: minor comments and readme updates (#9865 )	2025-04-12 23:24:05 +03:00
nimlgen	7c466c24f7	am_smi: refactor to support arches (#9864 ) * am_smi: refactor to support arches * shorter	2025-04-12 20:37:01 +03:00
nimlgen	a9430b4118	am: fix metrics table for smu14_0_2 (#9863 )	2025-04-12 19:07:22 +03:00
Alexey Zaytsev	3bce5ad2b4	clang should not emit the .comment section (#9859 ) This section gets included in the finanl image, and we get a lot of garbage with DEBUG=7	2025-04-12 10:59:11 +08:00
Alexey Zaytsev	7dda6aae7d	Skip CLOUD in external_test_example (#9857 ) Closes #9814	2025-04-12 10:17:44 +08:00
nimlgen	7919bb4f8a	amd: do not use log2 (#9852 )	2025-04-11 19:53:06 +03:00
nimlgen	ada0f67d3d	am: fix speed of ring copies (#9854 )	2025-04-11 17:28:06 +03:00
chenyu	4aab16ca6a	bert script cleanup and assert nan loss (#9851 )	2025-04-11 05:41:49 -04:00
qazal	ad677f8e55	create_ast cleanups from kernelize [pr] (#9849 )	2025-04-11 16:10:21 +08:00
qazal	cbc5e7ed45	unbind variables when creating ScheduleItems [pr] (#9846 )	2025-04-11 15:23:53 +08:00
chenyu	6896197978	relax ATOL for TC half tests more (#9847 )	2025-04-11 03:20:22 -04:00
George Hotz	dd52951dd0	fix single kernel softmax with cast (#9842 ) * fix single kernel softmax with cast * tolerate none * 3e-4 * skip on dtype	2025-04-11 12:12:02 +08:00
chenyu	8c6299bced	move hand_coded_optimizations to heuristic.py [pr] (#9844 ) * move hand_coded_optimizations to heuristic.py [pr] also folded all long lines * make a copy and rename self -> k * fix test	2025-04-10 23:40:16 -04:00
chenyu	e0ec8be37d	use CPU for test_schedule_ring (#9843 ) * use CPU for test_schedule_ring * why pre-commit is good	2025-04-10 23:20:53 -04:00
qazal	7045920786	give _apply_map_to_tensors substitutes name [pr] (#9840 )	2025-04-11 10:38:57 +08:00
qazal	40ef2f2857	add ast fixup stage to tensor_map [pr] (#9839 )	2025-04-11 09:24:01 +08:00
qazal	fbc6aa53d4	script for local process_replay + fix viz name [pr] (#9837 )	2025-04-11 00:39:18 +08:00
b1tg	a35b475d18	fix am driver for gfx1201 (#9836 )	2025-04-10 19:33:02 +03:00
qazal	16956b79de	canonicalize Device.DEFAULT (#9835 )	2025-04-10 23:02:11 +08:00
George Hotz	f666dd14eb	fix get reduce contraction with test (#9834 )	2025-04-10 22:24:21 +08:00
George Hotz	c3fa470852	hotfix: remove tracebacklimit, it persists if you catch the exception and made webgpu flaky	2025-04-10 20:29:25 +08:00
chenyu	7fa5f29582	add test_embedding to test_softmax_fusion (#9832 )	2025-04-10 08:25:34 -04:00
chenyu	995d20673a	increase bert TRAIN_STEPS for mi300x (#9833 ) got a few non converged ones so try to increase steps. we need >= 90% runs to converge	2025-04-10 08:25:09 -04:00
George Hotz	25e2a3cf5d	hotfix: fix get_contraction_with_reduce	2025-04-10 20:18:19 +08:00
George Hotz	53f0b2aad7	fix infinite loop in flash attention (#9827 ) * fix infinite loop in flash attention * get_contraction_with_reduce * skip that test * SINGLE_KERNEL_SOFTMAX + fix multi * default IGNORE_OOB * print change	2025-04-10 20:06:44 +08:00
qazal	16afe04f45	move process replay to grouper (#9830 ) * simpler * sched	2025-04-10 18:27:42 +08:00
chenyu	c8f47c1d07	not_support_multi_device helper (#9831 ) unify the test helper to skip ci device that does not support multi	2025-04-10 05:25:29 -04:00
chenyu	817746b30e	add contiguous to EmbeddingBert output (#9829 ) for some reason with random dropout it creates different ast on each device. And search embedding is slow. This workaround saved 6 minutes setup time on mi300x (25->19) and resulted in similar speed	2025-04-10 04:31:19 -04:00
qazal	fd4f06e623	kernelize prereqs [pr] (#9811 ) * kernelize prereqs [pr] * work * tensor maps to assign * unwrap st * process replay * grouper changes * replay	2025-04-10 15:22:20 +08:00
chenyu	c462162db8	update benchmark bert scripts with BS and ACC_DTYPE (#9826 ) BS=16, ACC_DTYPE=half for tinybox, BS=128, ACC_DTYPE=float for mi300x	2025-04-10 02:06:02 -04:00

... 38 39 40 41 42 ...

10417 Commits