tinygrad

mirror of https://github.com/tinygrad/tinygrad.git synced 2026-04-29 03:00:14 -04:00

Author	SHA1	Message	Date
Alexey Zaytsev	3bce5ad2b4	clang should not emit the .comment section (#9859 ) This section gets included in the finanl image, and we get a lot of garbage with DEBUG=7	2025-04-12 10:59:11 +08:00
Alexey Zaytsev	7dda6aae7d	Skip CLOUD in external_test_example (#9857 ) Closes #9814	2025-04-12 10:17:44 +08:00
nimlgen	7919bb4f8a	amd: do not use log2 (#9852 )	2025-04-11 19:53:06 +03:00
nimlgen	ada0f67d3d	am: fix speed of ring copies (#9854 )	2025-04-11 17:28:06 +03:00
chenyu	4aab16ca6a	bert script cleanup and assert nan loss (#9851 )	2025-04-11 05:41:49 -04:00
qazal	ad677f8e55	create_ast cleanups from kernelize [pr] (#9849 )	2025-04-11 16:10:21 +08:00
qazal	cbc5e7ed45	unbind variables when creating ScheduleItems [pr] (#9846 )	2025-04-11 15:23:53 +08:00
chenyu	6896197978	relax ATOL for TC half tests more (#9847 )	2025-04-11 03:20:22 -04:00
George Hotz	dd52951dd0	fix single kernel softmax with cast (#9842 ) * fix single kernel softmax with cast * tolerate none * 3e-4 * skip on dtype	2025-04-11 12:12:02 +08:00
chenyu	8c6299bced	move hand_coded_optimizations to heuristic.py [pr] (#9844 ) * move hand_coded_optimizations to heuristic.py [pr] also folded all long lines * make a copy and rename self -> k * fix test	2025-04-10 23:40:16 -04:00
chenyu	e0ec8be37d	use CPU for test_schedule_ring (#9843 ) * use CPU for test_schedule_ring * why pre-commit is good	2025-04-10 23:20:53 -04:00
qazal	7045920786	give _apply_map_to_tensors substitutes name [pr] (#9840 )	2025-04-11 10:38:57 +08:00
qazal	40ef2f2857	add ast fixup stage to tensor_map [pr] (#9839 )	2025-04-11 09:24:01 +08:00
qazal	fbc6aa53d4	script for local process_replay + fix viz name [pr] (#9837 )	2025-04-11 00:39:18 +08:00
b1tg	a35b475d18	fix am driver for gfx1201 (#9836 )	2025-04-10 19:33:02 +03:00
qazal	16956b79de	canonicalize Device.DEFAULT (#9835 )	2025-04-10 23:02:11 +08:00
George Hotz	f666dd14eb	fix get reduce contraction with test (#9834 )	2025-04-10 22:24:21 +08:00
George Hotz	c3fa470852	hotfix: remove tracebacklimit, it persists if you catch the exception and made webgpu flaky	2025-04-10 20:29:25 +08:00
chenyu	7fa5f29582	add test_embedding to test_softmax_fusion (#9832 )	2025-04-10 08:25:34 -04:00
chenyu	995d20673a	increase bert TRAIN_STEPS for mi300x (#9833 ) got a few non converged ones so try to increase steps. we need >= 90% runs to converge	2025-04-10 08:25:09 -04:00
George Hotz	25e2a3cf5d	hotfix: fix get_contraction_with_reduce	2025-04-10 20:18:19 +08:00
George Hotz	53f0b2aad7	fix infinite loop in flash attention (#9827 ) * fix infinite loop in flash attention * get_contraction_with_reduce * skip that test * SINGLE_KERNEL_SOFTMAX + fix multi * default IGNORE_OOB * print change	2025-04-10 20:06:44 +08:00
qazal	16afe04f45	move process replay to grouper (#9830 ) * simpler * sched	2025-04-10 18:27:42 +08:00
chenyu	c8f47c1d07	not_support_multi_device helper (#9831 ) unify the test helper to skip ci device that does not support multi	2025-04-10 05:25:29 -04:00
chenyu	817746b30e	add contiguous to EmbeddingBert output (#9829 ) for some reason with random dropout it creates different ast on each device. And search embedding is slow. This workaround saved 6 minutes setup time on mi300x (25->19) and resulted in similar speed	2025-04-10 04:31:19 -04:00
qazal	fd4f06e623	kernelize prereqs [pr] (#9811 ) * kernelize prereqs [pr] * work * tensor maps to assign * unwrap st * process replay * grouper changes * replay	2025-04-10 15:22:20 +08:00
chenyu	c462162db8	update benchmark bert scripts with BS and ACC_DTYPE (#9826 ) BS=16, ACC_DTYPE=half for tinybox, BS=128, ACC_DTYPE=float for mi300x	2025-04-10 02:06:02 -04:00
qazal	498a2bf738	add err handling tests to viz + cleanups (#9825 ) * cleanup * add err handling tests to viz + cleanups * lint	2025-04-10 14:05:05 +08:00
chenyu	a0b72f066a	don't free intermediate for bert mi300x (#9824 )	2025-04-10 01:48:34 -04:00
chenyu	566e389585	more relaxed ATOL for HALF=1 simple_matmul test (#9823 ) it's a function of N so only updated in the test command	2025-04-10 00:46:16 -04:00
Francis Lata	eb2e59db42	RetinaNet model type annotations and loss functions (#9822 ) * add type annotations and loss functions for training * combine sum of multiple dims inside loss functions	2025-04-10 00:31:37 -04:00
chenyu	06a928b341	higher ATOL for half input TC test (#9821 ) flaky	2025-04-09 23:57:25 -04:00
Francis Lata	7bb36d71b2	remove openimages iterate (#9820 )	2025-04-09 22:54:12 -04:00
chenyu	2e1002e179	EVAL_BS=96 and BEAM=3 for bert green (#9819 ) 19m -> 13m setup and same end to end time	2025-04-09 22:37:27 -04:00
uuuvn	3ee317ffed	Fix kfd autogen and verify it in ci (#9818 ) Had to autogen newer uapi headers for #9746 (dmabuf export ioctl missing), submitting just the fix without updating to newer headers as they are only needed for infiniband stuff	2025-04-10 09:53:42 +08:00
nimlgen	d7330ea6ad	amd: refactor sqtt into sep functions (#9816 ) * amd: refactor sqtt into sep functions * fix	2025-04-10 00:39:45 +03:00
nimlgen	0ca98b9f20	amd: gfx9 use cache ctrls in acquire_mem (#9815 )	2025-04-09 20:17:02 +03:00
George Hotz	fce432d2e3	Ops.FUSE makes softmax a single kernel (#9808 ) * KERNELIZE makes softmax a single kernel * single kernel works * softmax works * broken * correct * skip that test * kernelize tests * rename to fuse * better reduce_push_add_ones code * correct now * cleanups * oops * return None if we can't push ones * rename + docs * atol fixes group * flash attention broken test	2025-04-09 22:56:28 +08:00
nimlgen	1798ce7e52	amd: faster xcc sync (#9783 ) * amd: faster xcc sync * move to same cacheline * comment * keep it uncached + better poll timings * revert this, should be fine * fixed now? * minor	2025-04-09 15:56:50 +03:00
qazal	3bd992dc95	multi stage graph_rewrite_map (#9803 ) * multistage graph_rewrite_map * s/merge_map/input_map * build up kernel_map from the tensor_map	2025-04-09 15:59:45 +08:00
chenyu	57f4bc3fbb	add numpy to setup linting (#9806 ) this would have caught the mypy error in fp8 pr. keep ignore_missing_imports to true as we also import torch which is fat	2025-04-09 03:47:03 -04:00
George Hotz	bf769fa5c5	label ranges with their number (#9805 )	2025-04-09 14:31:18 +08:00
chenyu	c5db5b83b9	add SHOULD_USE_TC=1 check to simple_matmul (#9802 ) * add SHOULD_USE_TC=1 check to simple_matmul also zero centered the random input and update atol for tf32 * ATOL=2e-2 for HALF	2025-04-09 02:24:42 -04:00
qazal	f27dbc8c35	becomes_map cleanups [pr] (#9790 ) * cleanup becomes_map [pr] * source	2025-04-09 14:11:53 +08:00
qazal	7d2349c827	track_rewrites in scheduler [pr] (#9801 )	2025-04-09 12:48:14 +08:00
George Hotz	bb18adb0d5	reduce with a mul chain (#9799 ) * reduce with a mul chain * inside is just 1	2025-04-09 12:42:32 +08:00
George Hotz	78caf55154	Revert "FP8 support on NVIDIA (#8631 )" This reverts commit `2c8e4ea865`.	2025-04-09 12:27:41 +08:00
George Hotz	d1505137ad	Revert "move TestOpsFp8s skipTest (#9797 )" This reverts commit `a3aaf92b21`.	2025-04-09 12:27:40 +08:00
George Hotz	14928fecff	Revert "fix TF32 tensor core dropped in tc_sm89 (#9798 )" This reverts commit `7c9a96824f`.	2025-04-09 12:27:39 +08:00
qazal	1ed4eae510	hotfix: don't add shape to SINK viz node (#9800 )	2025-04-09 12:04:33 +08:00

... 52 53 54 55 56 ...

11094 Commits