tinygrad

mirror of https://github.com/tinygrad/tinygrad.git synced 2026-01-24 22:38:16 -05:00

Author	SHA1	Message	Date
George Hotz	8919370c76	hotfix: fix test_save_all_dtypes on METAL	2025-04-18 08:42:31 +01:00
qazal	16dfe0a902	upstream remu (#9921 )	2025-04-18 01:57:36 +03:00
qazal	d287afe3b1	remove shapeless const check in full_shape [pr] (#9911 ) * remove shapeless const check in full_shape [pr] * those can go too	2025-04-18 00:00:26 +03:00
chenyu	fe6a482f1d	pin hypothesis version to 6.131.0 (#9920 ) 6.131.1 seems to cause timeout in CI	2025-04-17 16:34:10 -04:00
chenyu	f5256e0020	Kernel.apply_opts [pr] (#9917 ) * Kernel.apply_opts [pr] updated all `for opt in`. also updated a few test_liinearizer tests to not implcitly depend on hand_coded_optimization * not you yet	2025-04-17 08:00:56 -04:00
chenyu	e2ed673c94	FUSE_ARANGE_UINT to not fuse uint (#9915 ) hack to bypass rand, can FUSE_ARANGE on green for 6ms per step	2025-04-16 18:49:38 -04:00
qazal	497daa658a	hotfix: edge-labels go above the overlay (#9910 )	2025-04-16 23:38:12 +08:00
qazal	e8e43c6dad	ensure edge labels are always on top (#9908 )	2025-04-16 21:08:06 +08:00
qazal	5265f25088	add counter for incoming edges in viz (#9907 )	2025-04-16 20:14:14 +08:00
Eitan Turok	2c7c205bc5	Fix dtype comparisons in vectorized transcendental + tests (#9794 ) * init test * cleanup * init * update * fix * fix python runtime for vectorized code * awesome helper * update * update * cleanup * more cleaning * cleanup more * fix tests * more cleaning * cleanup more * fix * even cleaner * failing tests is sad * cleanup * better name * make tests pass * remove vec from python runtime * remove vec from eval_uop * remove expected failues * better name	2025-04-16 08:06:12 -04:00
qazal	929e5a9905	do not construct GrouperContext [pr] (#9906 )	2025-04-16 18:26:31 +08:00
Xingyu	047c8fd70d	Add amax support to Tensor operations in Torch Backend (#9905 ) * Add amax support to Tensor operations - Implemented amax function in backend.py for tensor max operations. - Added unit tests for amax in test.py to ensure correct functionality. * Fix formatting in amax output function - Adjusted spacing in the amax output lambda function in backend.py - Improved code readability for better maintenance	2025-04-16 10:35:50 +01:00
uuuvn	d7f623dac2	Use Buffer in cloud server instead of opaques (#9875 ) Not-quite-required but makes cloud graph a lot cleaner because unlike raw compiled programs `GraphRunner` takes `Buffer`s like other runners. Otherwise either of: adding a new option to not free on `__del__`, (ab)using `external_ptr` to prevent free, or making something like a `FakeBuffer` is required.	2025-04-16 10:17:32 +01:00
qazal	05334e0f3f	construct children from UOp.toposort [pr] (#9882 ) * construct children from UOp.toposort [pr] * only for bases	2025-04-16 16:55:59 +08:00
geohotstan	4e8f25109a	Revert "ONNX add output shape validation (#9720 )" (#9904 ) This reverts commit `ac713e04db`.	2025-04-16 03:15:56 -04:00
chenyu	e8024c8281	faster bert global_norm (#9901 ) tinyamd 2% faster. also updated beam params that's 2-3% faster. update mlperf doc and steps too	2025-04-15 18:24:44 -04:00
Sieds Lykles	91ccf1c343	Off by one error in start_pos (#9792 ) Variable upper bound is inclusive	2025-04-15 15:07:13 -04:00
pkotzbach	5849c43382	FP8s part 1 (#9887 ) * fp8s part 1 * prettier * fixes * fixes * remove stuff that should be in next pr * revert * add creation --------- Co-authored-by: pkotzbach <pawkotz@gmail.com>	2025-04-15 11:20:02 -04:00
Francis Lata	31483050c0	add eval_freq flag (#9894 )	2025-04-15 06:42:40 -04:00
nimlgen	83ae83d871	compare amd and am to cpu as well (#9896 )	2025-04-15 13:32:18 +03:00
nimlgen	23a95dd84d	script to compare amd and am kerns (#9889 ) * script to compare amd and am kerns * tool * is it used???	2025-04-15 00:11:22 +03:00
chenyu	ce454793e6	support specifying dtype for Tensor.linear (#9886 )	2025-04-14 13:55:11 -04:00
b1tg	e8a0aee88d	add arch to AMDLLVMRenderer (#9884 ) * add arch to AMDLLVMRenderer * __reduce__ to match others --------- Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>	2025-04-14 19:59:22 +03:00
George Hotz	44e4934167	fast pattern matcher [pr] (#9737 ) * FastPatternMatcher * works without that * fix test pickle * strict len * compile match function * dynamic compile * fast * faster * compile * track * a lot faster * clean up * dup or * faster and simpler * fast match doesn't support store * plane * minor refactor * real speed * don't imply return None * upat * fix test * heard you wanted more speed * no generator * split cf * early fixup * fxn fixup * reconstruct_function * Revert "reconstruct_function" This reverts commit `37dac010ab`. * simpler stuff * too big * upat compile error * cleanups * don't cache that * cleanups * 10 -> 15	2025-04-14 15:24:41 +01:00
chenyu	43d3a75d6c	increase bert max train_steps (#9883 )	2025-04-14 08:53:44 -04:00
qazal	bf099520a4	add names to grouper rewrites + cleanups [pr] (#9881 ) * add names to grouper rewrites + cleanups [pr] * assign_targets	2025-04-14 19:47:36 +08:00
George Hotz	ca8aaadd00	clean up some patterns [pr] (#9880 ) * clean up some patterns [pr] * cleanest * move that into upat_interpret	2025-04-14 11:33:22 +01:00
George Hotz	355739fc94	switch to universal match [pr] (#9879 ) * switch to universal match [pr] * 10 -> 15	2025-04-14 09:15:37 +01:00
Nishant Rajadhyaksha	32ed128598	fixing transformer training bug (#9877 )	2025-04-13 19:34:20 -04:00
George Hotz	bd5939514d	clean up a few patterns [pr] (#9873 )	2025-04-13 20:19:37 +01:00
Alexey Zaytsev	78a6af3da7	Use $CUDA_PATH/include for CUDA headers (#9858 )	2025-04-13 16:20:19 +01:00
chenyu	e2a40fb523	update bert mi300x script (#9872 ) 2 runs failed to converge in 10 back to back runs, increase total train steps and some beam params (2% faster step)	2025-04-13 10:07:36 -04:00
qazal	e201bc3e93	process replay kernel asts in toposort order [pr] (#9869 ) * process replay kernel asts in toposort order [pr] * use HEAD replay	2025-04-13 17:20:34 +08:00
qazal	7191f88551	add asserts for KERNEL op ast [pr] (#9868 )	2025-04-13 16:50:18 +08:00
qazal	5ee9c343e6	add device to NullRenderer [pr] (#9867 )	2025-04-13 13:17:16 +08:00
Francis Lata	2793cca9a6	RetinaNet MLPerf (#8385 ) * add support for a custom BASEDIR for openimages download * make export step faster * add focal loss * update model_eval with new dataloader * generate_anchors in tinygrad * update initializers for model * small cleanup * revert isin enhancements * recursively go through backbone layers to freeze them * add optimizer * minor cleanup * start dataloader work with input images * add first transform for train set * reuse existing prepare_target * continue with dataloader implementation * add dataloader * separate out KiTS19 dataset test cases * create mock data samples for test * add dataloader + test * cleanup dataloader test and revert shm path * trim dataloader related code needed from ref * got dataloader with normalize working * update image to be float32 * add back normalization and negate it in test * clean up reference dataset implementation + ruff changes * add validation set test * add proper training loop over the training dataset * add LambdaLR support * add LR scheduler and the start of training step * get forward call to model work and setup multi-GPU * already passed device * return matches from dataloader * hotfix for dataloader typo causing some hang * start some work on classification loss * update focal loss to support masking * add missing test and cleanup focal loss * cleanup unit tests * remove masking support for sigmoid_focal_loss * make ClassificationHead loss work * cleanups + fix dataloader tests * remove sigmoid when computing loss * make anchors use Tensors * simplify anchors batching * revert anchors to use np * implement regression loss * fix regression loss * cleanup losses * move BoxCoder to MLPerf helpers * revert helper changes * fixes after helper refactor cleanup * add tests for l1_loss * start re-enabling training step * minor cleanup * add pycocotools to testing dependencies * make training work * adjust regression loss to mask after L1 loss is calculated * reduce img and lbl sizes by half for KiTS19 dataset tests * Revert "reduce img and lbl sizes by half for KiTS19 dataset tests" This reverts commit `d115b0c664`. * temporarily disable openimages dataset tests to debug CI * enable openimages dataset test and create samples once * temporarily disable openimages validation set test * reenable test and add some debugging to the test * add boto3 testing dependencies * add pandas to testing dependencies * This reverts commit `467704fec6`. * reenable test * move sample creation to setup * realize boxcoder's encoding * add wandb * fix wandb resuming feature * move anchors as part of dataloader * fix dtype for anchor inside dataloader and fix horizontal flip transformation * add support for BENCHMARK * set seed * debug dataset test failuire * Revert "debug dataset test failuire" This reverts commit `1b2f9d7f50`. * fix dataloader script * do not realize when sharding model weights * setup openimages samples differently * create the necessary samples per test case * enable lr scheduler and fix benchmark timing * add jit to the training loop * add checkpointing and training resume capabilities * refactor on training loop and start the work on val looop * add debug logging for dataloader test * debug test * assert boxes again * update validation dataloader and more cleanups * fix validation test case * add multi device support to retinanet eval * fix issue with realized on dataloader * remove optional disk tensors in dataloader * remove verbose debugging on datasets test * put back parallel testing and remove img_ids Tensor from dataloader * cleanup train and validation dataloader * return validation targets in dataloader * cleanup boxes and labels in dataloader * fix img_ids repeating its values * remove unnecessary targets from validation dataloader * add validation loop to training script * adjust LR to be the ratio of the batch size * minor cleanups * remove frozen layers from optimizer's params * hyperparameter adjustments and cleanups * model init, hyperparam, and data preprocessing updates * no need to return loaded keys for resnet * fix train script * update loss calculation for regresionhead and some cleanups * add JIT reset support * add nan check during training * Revert "add nan check during training" This reverts commit `ddf1f0d5dd`. * Revert "Revert "add nan check during training"" This reverts commit `b7b2943197`. * some typing cleanups * update seeding on dataloader and the start of training script * undo changse * undo more changes * more typing fixes * minor cleanups * update dataloader seed * hotfix: log metric and move target metric check outside of CKPT * check for CKPT when target metric is reached before saving * add TRAIN_BEAM and EVAL_BEAM * minor cleanup * update hyperparams and add support for EVAL_BS * add green coloring to metric reached statement * initial work to support f16 * update model initializers to be monkeypatched * update layers to support float32 weight loading + float16 training * don't return loss that's scaled * run eval on benchmark beam * move BEAM to their respective steps * update layers to be compatible with fp16 * end BENCHMARK after first eval * cleanups and adjust learning rate for fp16 * remove duplicated files from test * revert losses changes * Revert "revert losses changes" This reverts commit `aebccf93ac`. * go back to old LR * cast batchnorm to float32 * set new loss scaler default value for float16 * remove LambdaLRScheduler * remove runner and use dataloader on eval * fix retinanet eval with new dataloader * remove unused import * revert lr_scheduler updates * use BS=96 with new learning rate * rename module initializers * more cleanups on training loop * remove contig from optim.step * simplify sum when computing loss	2025-04-12 22:11:51 -04:00
nimlgen	23b67f532c	amd: minor comments and readme updates (#9865 )	2025-04-12 23:24:05 +03:00
nimlgen	7c466c24f7	am_smi: refactor to support arches (#9864 ) * am_smi: refactor to support arches * shorter	2025-04-12 20:37:01 +03:00
nimlgen	a9430b4118	am: fix metrics table for smu14_0_2 (#9863 )	2025-04-12 19:07:22 +03:00
Alexey Zaytsev	3bce5ad2b4	clang should not emit the .comment section (#9859 ) This section gets included in the finanl image, and we get a lot of garbage with DEBUG=7	2025-04-12 10:59:11 +08:00
Alexey Zaytsev	7dda6aae7d	Skip CLOUD in external_test_example (#9857 ) Closes #9814	2025-04-12 10:17:44 +08:00
nimlgen	7919bb4f8a	amd: do not use log2 (#9852 )	2025-04-11 19:53:06 +03:00
nimlgen	ada0f67d3d	am: fix speed of ring copies (#9854 )	2025-04-11 17:28:06 +03:00
chenyu	4aab16ca6a	bert script cleanup and assert nan loss (#9851 )	2025-04-11 05:41:49 -04:00
qazal	ad677f8e55	create_ast cleanups from kernelize [pr] (#9849 )	2025-04-11 16:10:21 +08:00
qazal	cbc5e7ed45	unbind variables when creating ScheduleItems [pr] (#9846 )	2025-04-11 15:23:53 +08:00
chenyu	6896197978	relax ATOL for TC half tests more (#9847 )	2025-04-11 03:20:22 -04:00
George Hotz	dd52951dd0	fix single kernel softmax with cast (#9842 ) * fix single kernel softmax with cast * tolerate none * 3e-4 * skip on dtype	2025-04-11 12:12:02 +08:00
chenyu	8c6299bced	move hand_coded_optimizations to heuristic.py [pr] (#9844 ) * move hand_coded_optimizations to heuristic.py [pr] also folded all long lines * make a copy and rename self -> k * fix test	2025-04-10 23:40:16 -04:00
chenyu	e0ec8be37d	use CPU for test_schedule_ring (#9843 ) * use CPU for test_schedule_ring * why pre-commit is good	2025-04-10 23:20:53 -04:00

... 42 43 44 45 46 ...

10633 Commits