tinygrad

mirror of https://github.com/tinygrad/tinygrad.git synced 2026-02-18 10:31:41 -05:00

Author	SHA1	Message	Date
gswangg	3cf507ae7f	remove extra.ops and LazyOp support from Kernel (#6267 ) * remove extra.ops and BufferOps * remove extra.ops and LazyOp support in Kernel	2024-08-24 16:44:38 +03:00
qazal	ccb05d8baa	fixup neg tests [run_process_replay] (#6268 )	2024-08-24 16:35:43 +03:00
qazal	bcb2f1caa3	init REDUCE_AXIS with BinaryOps (#6256 ) * REDUCE_AXIS arg with BinaryOps * more work in kernel.py fixup sops.gz * fix TestGraphRewriteEfficiency	2024-08-24 11:28:41 +03:00
chenyu	3fc8203475	remove NEG from handwritten ast in tests (#6234 ) * remove NEG from handwritten ast in tests * test_linearizer_failures	2024-08-22 09:06:59 -04:00
gswangg	c74b318458	migrate test_linearizer.py to UOp AST, pt. 2 (#6228 )	2024-08-21 22:16:11 +03:00
qazal	3b8cc5a3e0	more multireduce tests prep for neg removal [run_process_replay] (#6220 )	2024-08-21 12:45:24 +03:00
qazal	f03e5a4b3b	test_multireduce const has a shape (#6218 )	2024-08-21 11:02:45 +03:00
gswangg	0e6f057eae	migrate test_linearizer.py to UOP AST (pt. 1) (#6150 ) * migrate test_multioutput to UOP AST * inline buf declarations * migrate test_multireduce to UOp AST * update test_mid_dim_multireduce to UOp AST * update test_triple_multireduce with UOp AST * make global definitions more concise * update test_double_reduce_multireduce with UOp AST * update test_multireduce_with_parallel with UOp AST * update test_multiout_multireduce to UOp AST * make gidx style consistent across updated tests --------- Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>	2024-08-20 10:02:20 +03:00
chenyu	b36a7273c6	RUF018 assignment-in-assert [run_process_replay] (#6172 ) assertion should not have side effect or `-O` breaks. initially just wanted to fix the one in rearrange, but it also made some long lines less long	2024-08-19 00:34:52 -04:00
Timmy	e3d14d1ccc	Lowerer Multireduce Grouping (#6097 ) * grouping changes to codegen * linters + tests * fix identical store issue on PTX * comment in grouping multireduce tests * cleaning up diff * cleaning up diff * comments * linters * hotfix: dont change kernels --------- Co-authored-by: qazal <qazal.software@gmail.com>	2024-08-18 19:57:51 +03:00
George Hotz	5048066e79	st_arg, never -1 [run_process_replay] (#6128 )	2024-08-16 22:46:56 -07:00
George Hotz	74ee9febec	remove iter from uopgraph (#6110 ) * remove iter from uopgraph * linearize returns uops * fix tests * linearize in linearize * tests fix * touchup * test failures	2024-08-16 15:58:29 -07:00
qazal	28c75bf2a6	merge uops with ops (#6111 ) Co-authored-by: chenyu <chenyu@fastmail.com>	2024-08-16 18:17:57 -04:00
qazal	c23d44c779	AST is UOp (#6030 ) * most of the work from the uops2 branch * schedule * realize * kernel * lowerer * search * green * merge uops with ops * Revert "merge uops with ops" This reverts commit `1408a59f12`. * fix benchmark * remove extra dedup	2024-08-16 22:09:00 +03:00
CaltropHungerton	38fb1e14a2	Intel XMX Tensor Core Support (#5622 ) * fixed xmx demo * i think i'm invoking the DPAS but it's slow * compiler build arg to stop register spilling, indicated where to fix flop counter * don't mind this * do NOT mind me * do not mind me * do not view * i will add bf16 later * in process of figuring out tc fields * we figured out the fields!!! * added check for cl device vendor, added seperate IntelRenderer * remove tc thread_local_aliases * cleaning debris before draft pr * edits for linter * deduping and checking device extensions * i will find more line reductions in other places * before merge upstream * double grf size in compiler to fix register spilling (bandaid), device checking changes * tc python emulation * fixed emulation * tests for emulated intel tensor core * TC=0, 1 working on upstream, fixed perf * test * debris * check for specialized cl device when we canonicalize device * bf16 support, tc=3 test added * address tests * revert half2 loads on intel tc, cleanup * linter * fold_expanded revert * lint, whitespace fix * cuda bf16 (only one with bf16) is skipped in test tensor cores, so i will skip for intel bf16 too * make line shorter, no need for noqa E501 * removed device intel * fix python emulation --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2024-08-16 09:19:21 -07:00
qazal	4d38fec8c1	rename lazyops to parents [run_process_replay] (#6091 )	2024-08-15 17:27:32 +03:00
ignaciosica	164ca5632e	split tensor core tests (#6041 )	2024-08-12 09:42:02 -04:00
Timmy	a00994b423	Lowerer Multireduce Uopgraph (#6007 ) * uopgraph changes * fixing for non-reducing ranges * multireduce tests * linters * linters * removing comments * removing arg[1] * linters * prettier * linters * more linters * use any instead of intersection	2024-08-12 15:16:07 +03:00
Timmy	8c99bdab08	More Multireduce Tests (#5968 ) * multireduce tests * linters * more linters * more linters * seeing how it works with parallel	2024-08-08 22:04:08 +03:00
wozeparrot	97d708252a	remove realize from threefry (#5969 )	2024-08-07 15:08:49 -07:00
George Hotz	1417cc8df1	can reenable that test now (#5914 )	2024-08-06 13:38:21 -07:00
ignaciosica	81ae9fadc8	Float4 support for CLANG (#5915 ) * float4 support on clang * skip linearizer tests that require locals * add aligned attribute	2024-08-06 07:50:12 -07:00
George Hotz	159ac06b5b	remove unused reduce rules + improve unparented (#5908 ) * remove unused reduce rules [run_process_replay] * this work * those tests are meaningless now	2024-08-04 18:18:27 -07:00
George Hotz	877e0b4ba0	define global only has the index [run_process_replay] (#5869 ) * define global only has the index [run_process_replay] * fix that linearizer test * fix ptx * stupid ptx fix	2024-08-01 19:01:15 -07:00
chenyu	02f0be03f2	tests on UOp div negative number and arange opts (#5825 )	2024-07-30 20:06:57 -04:00
George Hotz	17a2f74412	new style load/store folder (#5784 ) * remove old index reorder * new style folder * works better * dedup * one failure * this is fine now... * expander_rewrite * images broken, but all else should work * cleanups * make tests work with old * fix images * cleanups + bugfix * minor fixes * fix gated store folding * flip gate_creator and expander * fix gated store * remove unneeded rules * lines getting close * line count good	2024-07-30 13:17:20 -07:00
George Hotz	4df46eac67	clean up tensor cores [run_process_replay] (#5736 ) * clean up tensor cores [run_process_replay] * remove tuple(wmma_sz), self.opts.device * remove tls, leave DEVICE	2024-07-26 13:21:23 -07:00
chenyu	16c27ae400	update UOp.SPECIAL arg spec [run_process_replay] (#5661 ) * update UOp.SPECIAL arg spec [run_process_replay] from `(0, "gid0", 4)` to just `("gid0", 4)`. closer to a Variable * fix ptx	2024-07-23 16:58:12 -04:00
George Hotz	386fb5e7f8	folding without UNMUL (#5628 ) * folding without UNMUL * fix failures, index_collapse * import ReduceOps * test_arange_4096 isn't folding	2024-07-21 20:14:44 -07:00
qazal	3ab5fe4e1b	test argmax multireduce failure (#5609 )	2024-07-20 21:33:03 +08:00
chenyu	37dd233650	always reverse global dim (#5586 ) * always reverse global dim * one more test	2024-07-19 13:58:05 -04:00
George Hotz	10be05aae5	push contract through cast to fix test_float2_acc (try 2) (#5585 ) * push contract through cast to fix test_float2_acc (try 2) * contract push only on floats	2024-07-19 10:34:43 -07:00
George Hotz	51892c8fac	Revert "push contract through cast to fix test_float2_acc (#5581 )" (#5583 ) This reverts commit `ddda9420be`.	2024-07-19 09:44:30 -07:00
George Hotz	ddda9420be	push contract through cast to fix test_float2_acc (#5581 ) * push contract through cast to fix test_float2_acc * no_vectorized_alu applies to cast too	2024-07-19 09:30:26 -07:00
chenyu	3f590c3b31	some limit_dims to limit global merging (#5489 ) only supports merging dims in a way that does not surpass limit, no splitting yet	2024-07-19 12:17:46 -04:00
chenyu	2b2f8ad18c	failed example of float2 acc no long applies (#5573 ) * failed example of float2 acc no long applies * # noqa: E501	2024-07-19 02:40:04 -04:00
George Hotz	223d9283ee	fix float4 acc by moving contracts (#5559 )	2024-07-18 11:30:16 -07:00
chenyu	f5af98c450	failed test case that DEFINE_ACC no long uses float4 (#5555 ) * failed test case that DEFINE_ACC no long uses float4 * line	2024-07-18 10:55:59 -07:00
George Hotz	923e0fe0b8	fix half4 folding (#5556 )	2024-07-18 10:47:39 -07:00
chenyu	12e6771209	failed test case for unrolled half4 (#5552 )	2024-07-18 13:05:52 -04:00
kormann	2c4add6844	pretty print lazy op per default (#5505 ) * pretty lop * min diff * walrus * fix * min diff * simplify * pretty helper function * ws * pretty uop upat * tests * stricter tests * test passes * ws * stronger upat test * delete print_tree * min diff * stricter exp test * fix merge * stronger uops eval test * +readable and deep upat test * +readable and deep upat test * sort inv fix * fix * revert allowed_len	2024-07-18 09:34:08 -07:00
George Hotz	fa7e734b49	MetaOps.KERNEL (#5543 )	2024-07-17 19:41:23 -07:00
qazal	61ee02e93d	start multireduce lowerer work (var/std) (#5537 ) * multireduce no-opts works * passed test_var_multireduce * cleanup * double reduce * extra check for range_group * more checking for range_groups * cleaning up debug prints * cleanup diff * linters * revert kernel changes * these are uops toposort --------- Co-authored-by: timmy <timmy0x@proton.me>	2024-07-17 23:43:46 +03:00
qazal	173064c69c	(re)start multireduce in codegen/* (#5391 ) * test_var_multireduce * run verify_lazyop * test_var_multireduce * assert lazyop * add test_indexing_multireduce * arange fuses (crude) * note: extra reshape * start readble * test_arange_simple * test_arange_expanded * test_indexing_multireduce * cleanups * skip ptx * skip nv and amd ci * skip arange expanded too * GPU=1 is slow too in CI	2024-07-16 14:20:48 +03:00
chenyu	63990705b5	test kernel opts case for 4 local and 4 groups (#5499 ) make sure local grouped dim is correct	2024-07-15 20:09:38 -04:00
qazal	ac08f0eb00	reshape rawbufs in test_linearizer (#5492 ) * reshape rawbufs in test_linearizer * fix helper_linearizer_ast	2024-07-15 19:14:38 +03:00
chenyu	613a1dbeed	render lidx starting with 0 (#5478 ) * render lidx starting with 0 changed from ``` int gidx0 = gid.x; /* 4096 / int lidx4 = lid.x; / 8 / int gidx1 = gid.y; / 7 / int lidx5 = lid.y; / 8 / int gidx2 = gid.z; / 7 / int lidx6 = lid.z; / 2 / ``` to ``` int gidx0 = gid.x; / 4096 / int lidx0 = lid.x; / 8 / int gidx1 = gid.y; / 7 / int lidx1 = lid.y; / 8 / int gidx2 = gid.z; / 7 / int lidx2 = lid.z; / 2 / ``` the existing one started from pre-limited global dims which skip number if there are more than 3 global dims don't need start_dim --------- Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>	2024-07-14 16:34:04 -04:00
chenyu	28972418c4	s/get_linearizer/get_kernel [run_process_replay] (#5467 )	2024-07-13 20:32:22 -04:00
George Hotz	03c2dc8bd7	lowerer is kernel [run_process_replay] (#5437 )	2024-07-12 18:50:55 -07:00
George Hotz	b8342fb085	independent lowerer [run_process_replay] (#5434 ) * independent lowerer [run_process_replay] * don't relinearize PTX * fix ptx * Revert "fix ptx" This reverts commit `f4e8e059c0`. * Revert "don't relinearize PTX" This reverts commit `f6c12c506c`. * parents is fine, no need for linearization * remove loop local idxs * recover stupid loop_idxs	2024-07-12 18:08:43 -07:00

1 2 3 4 5 ...

252 Commits