tinygrad

mirror of https://github.com/tinygrad/tinygrad.git synced 2026-01-25 14:58:46 -05:00

Author	SHA1	Message	Date
George Hotz	5048066e79	st_arg, never -1 [run_process_replay] (#6128 )	2024-08-16 22:46:56 -07:00
George Hotz	74ee9febec	remove iter from uopgraph (#6110 ) * remove iter from uopgraph * linearize returns uops * fix tests * linearize in linearize * tests fix * touchup * test failures	2024-08-16 15:58:29 -07:00
qazal	28c75bf2a6	merge uops with ops (#6111 ) Co-authored-by: chenyu <chenyu@fastmail.com>	2024-08-16 18:17:57 -04:00
qazal	c23d44c779	AST is UOp (#6030 ) * most of the work from the uops2 branch * schedule * realize * kernel * lowerer * search * green * merge uops with ops * Revert "merge uops with ops" This reverts commit `1408a59f12`. * fix benchmark * remove extra dedup	2024-08-16 22:09:00 +03:00
CaltropHungerton	38fb1e14a2	Intel XMX Tensor Core Support (#5622 ) * fixed xmx demo * i think i'm invoking the DPAS but it's slow * compiler build arg to stop register spilling, indicated where to fix flop counter * don't mind this * do NOT mind me * do not mind me * do not view * i will add bf16 later * in process of figuring out tc fields * we figured out the fields!!! * added check for cl device vendor, added seperate IntelRenderer * remove tc thread_local_aliases * cleaning debris before draft pr * edits for linter * deduping and checking device extensions * i will find more line reductions in other places * before merge upstream * double grf size in compiler to fix register spilling (bandaid), device checking changes * tc python emulation * fixed emulation * tests for emulated intel tensor core * TC=0, 1 working on upstream, fixed perf * test * debris * check for specialized cl device when we canonicalize device * bf16 support, tc=3 test added * address tests * revert half2 loads on intel tc, cleanup * linter * fold_expanded revert * lint, whitespace fix * cuda bf16 (only one with bf16) is skipped in test tensor cores, so i will skip for intel bf16 too * make line shorter, no need for noqa E501 * removed device intel * fix python emulation --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2024-08-16 09:19:21 -07:00
qazal	4d38fec8c1	rename lazyops to parents [run_process_replay] (#6091 )	2024-08-15 17:27:32 +03:00
ignaciosica	164ca5632e	split tensor core tests (#6041 )	2024-08-12 09:42:02 -04:00
Timmy	a00994b423	Lowerer Multireduce Uopgraph (#6007 ) * uopgraph changes * fixing for non-reducing ranges * multireduce tests * linters * linters * removing comments * removing arg[1] * linters * prettier * linters * more linters * use any instead of intersection	2024-08-12 15:16:07 +03:00
Timmy	8c99bdab08	More Multireduce Tests (#5968 ) * multireduce tests * linters * more linters * more linters * seeing how it works with parallel	2024-08-08 22:04:08 +03:00
wozeparrot	97d708252a	remove realize from threefry (#5969 )	2024-08-07 15:08:49 -07:00
George Hotz	1417cc8df1	can reenable that test now (#5914 )	2024-08-06 13:38:21 -07:00
ignaciosica	81ae9fadc8	Float4 support for CLANG (#5915 ) * float4 support on clang * skip linearizer tests that require locals * add aligned attribute	2024-08-06 07:50:12 -07:00
George Hotz	159ac06b5b	remove unused reduce rules + improve unparented (#5908 ) * remove unused reduce rules [run_process_replay] * this work * those tests are meaningless now	2024-08-04 18:18:27 -07:00
George Hotz	877e0b4ba0	define global only has the index [run_process_replay] (#5869 ) * define global only has the index [run_process_replay] * fix that linearizer test * fix ptx * stupid ptx fix	2024-08-01 19:01:15 -07:00
chenyu	02f0be03f2	tests on UOp div negative number and arange opts (#5825 )	2024-07-30 20:06:57 -04:00
George Hotz	17a2f74412	new style load/store folder (#5784 ) * remove old index reorder * new style folder * works better * dedup * one failure * this is fine now... * expander_rewrite * images broken, but all else should work * cleanups * make tests work with old * fix images * cleanups + bugfix * minor fixes * fix gated store folding * flip gate_creator and expander * fix gated store * remove unneeded rules * lines getting close * line count good	2024-07-30 13:17:20 -07:00
George Hotz	4df46eac67	clean up tensor cores [run_process_replay] (#5736 ) * clean up tensor cores [run_process_replay] * remove tuple(wmma_sz), self.opts.device * remove tls, leave DEVICE	2024-07-26 13:21:23 -07:00
chenyu	16c27ae400	update UOp.SPECIAL arg spec [run_process_replay] (#5661 ) * update UOp.SPECIAL arg spec [run_process_replay] from `(0, "gid0", 4)` to just `("gid0", 4)`. closer to a Variable * fix ptx	2024-07-23 16:58:12 -04:00
George Hotz	386fb5e7f8	folding without UNMUL (#5628 ) * folding without UNMUL * fix failures, index_collapse * import ReduceOps * test_arange_4096 isn't folding	2024-07-21 20:14:44 -07:00
qazal	3ab5fe4e1b	test argmax multireduce failure (#5609 )	2024-07-20 21:33:03 +08:00
chenyu	37dd233650	always reverse global dim (#5586 ) * always reverse global dim * one more test	2024-07-19 13:58:05 -04:00
George Hotz	10be05aae5	push contract through cast to fix test_float2_acc (try 2) (#5585 ) * push contract through cast to fix test_float2_acc (try 2) * contract push only on floats	2024-07-19 10:34:43 -07:00
George Hotz	51892c8fac	Revert "push contract through cast to fix test_float2_acc (#5581 )" (#5583 ) This reverts commit `ddda9420be`.	2024-07-19 09:44:30 -07:00
George Hotz	ddda9420be	push contract through cast to fix test_float2_acc (#5581 ) * push contract through cast to fix test_float2_acc * no_vectorized_alu applies to cast too	2024-07-19 09:30:26 -07:00
chenyu	3f590c3b31	some limit_dims to limit global merging (#5489 ) only supports merging dims in a way that does not surpass limit, no splitting yet	2024-07-19 12:17:46 -04:00
chenyu	2b2f8ad18c	failed example of float2 acc no long applies (#5573 ) * failed example of float2 acc no long applies * # noqa: E501	2024-07-19 02:40:04 -04:00
George Hotz	223d9283ee	fix float4 acc by moving contracts (#5559 )	2024-07-18 11:30:16 -07:00
chenyu	f5af98c450	failed test case that DEFINE_ACC no long uses float4 (#5555 ) * failed test case that DEFINE_ACC no long uses float4 * line	2024-07-18 10:55:59 -07:00
George Hotz	923e0fe0b8	fix half4 folding (#5556 )	2024-07-18 10:47:39 -07:00
chenyu	12e6771209	failed test case for unrolled half4 (#5552 )	2024-07-18 13:05:52 -04:00
kormann	2c4add6844	pretty print lazy op per default (#5505 ) * pretty lop * min diff * walrus * fix * min diff * simplify * pretty helper function * ws * pretty uop upat * tests * stricter tests * test passes * ws * stronger upat test * delete print_tree * min diff * stricter exp test * fix merge * stronger uops eval test * +readable and deep upat test * +readable and deep upat test * sort inv fix * fix * revert allowed_len	2024-07-18 09:34:08 -07:00
George Hotz	fa7e734b49	MetaOps.KERNEL (#5543 )	2024-07-17 19:41:23 -07:00
qazal	61ee02e93d	start multireduce lowerer work (var/std) (#5537 ) * multireduce no-opts works * passed test_var_multireduce * cleanup * double reduce * extra check for range_group * more checking for range_groups * cleaning up debug prints * cleanup diff * linters * revert kernel changes * these are uops toposort --------- Co-authored-by: timmy <timmy0x@proton.me>	2024-07-17 23:43:46 +03:00
qazal	173064c69c	(re)start multireduce in codegen/* (#5391 ) * test_var_multireduce * run verify_lazyop * test_var_multireduce * assert lazyop * add test_indexing_multireduce * arange fuses (crude) * note: extra reshape * start readble * test_arange_simple * test_arange_expanded * test_indexing_multireduce * cleanups * skip ptx * skip nv and amd ci * skip arange expanded too * GPU=1 is slow too in CI	2024-07-16 14:20:48 +03:00
chenyu	63990705b5	test kernel opts case for 4 local and 4 groups (#5499 ) make sure local grouped dim is correct	2024-07-15 20:09:38 -04:00
qazal	ac08f0eb00	reshape rawbufs in test_linearizer (#5492 ) * reshape rawbufs in test_linearizer * fix helper_linearizer_ast	2024-07-15 19:14:38 +03:00
chenyu	613a1dbeed	render lidx starting with 0 (#5478 ) * render lidx starting with 0 changed from ``` int gidx0 = gid.x; /* 4096 / int lidx4 = lid.x; / 8 / int gidx1 = gid.y; / 7 / int lidx5 = lid.y; / 8 / int gidx2 = gid.z; / 7 / int lidx6 = lid.z; / 2 / ``` to ``` int gidx0 = gid.x; / 4096 / int lidx0 = lid.x; / 8 / int gidx1 = gid.y; / 7 / int lidx1 = lid.y; / 8 / int gidx2 = gid.z; / 7 / int lidx2 = lid.z; / 2 / ``` the existing one started from pre-limited global dims which skip number if there are more than 3 global dims don't need start_dim --------- Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>	2024-07-14 16:34:04 -04:00
chenyu	28972418c4	s/get_linearizer/get_kernel [run_process_replay] (#5467 )	2024-07-13 20:32:22 -04:00
George Hotz	03c2dc8bd7	lowerer is kernel [run_process_replay] (#5437 )	2024-07-12 18:50:55 -07:00
George Hotz	b8342fb085	independent lowerer [run_process_replay] (#5434 ) * independent lowerer [run_process_replay] * don't relinearize PTX * fix ptx * Revert "fix ptx" This reverts commit `f4e8e059c0`. * Revert "don't relinearize PTX" This reverts commit `f6c12c506c`. * parents is fine, no need for linearization * remove loop local idxs * recover stupid loop_idxs	2024-07-12 18:08:43 -07:00
George Hotz	870dc8c350	s/Linearizer/Lowerer [run_process_replay] (#5428 )	2024-07-12 15:54:07 -07:00
George Hotz	6707c778d0	scheduleitem is not Tuple [run_process_replay] (#5425 ) * scheduleitem is not Tuple [run_process_replay] * fix tests * fix op + fuzzers * fix mop test	2024-07-12 15:13:19 -07:00
chenyu	d37056f3b1	pass Renderer.global_max / local_max into get_grouped_dims (#5423 ) [run_process_replay]	2024-07-12 16:49:27 -04:00
George Hotz	f6ef283e6a	s/loadops/metaops [run_process_replay] (#5421 )	2024-07-12 13:26:50 -07:00
chenyu	76125c07be	make some grouped_dim test work (#5415 ) next need to support max size per dim, splitting and correct way to do reverse or arbitrary permute global dims	2024-07-12 14:22:50 -04:00
George Hotz	c2da4454cd	indexing getting better (#5389 ) * indexing getting better [run_process_replay] [no_assert] * fix test * test_arange_2_reduce is a simpler test * put that print back, NOOPT * don't merge reduces (they could be different reduces) * FUSE_AS_ONE_KERNEL * fix tests * fix test_var_multireduce * w/e put that there * fails on others too * fix test, revert UNMUL change * in case order matters * one kernel indexing works * one kernel indexing works (test other)	2024-07-11 16:41:51 -07:00
qazal	0421f5d83e	hotfix: compare test_var_multireduce against numpy (#5394 )	2024-07-11 18:57:08 -04:00
George Hotz	6972a2569f	Linearizer -> Lowerer (#4957 ) * st to uops function * lowerer * uops reduce * uops reduce * acc_number correct * reduce unroll * complete unroll * do upcasts * handle multioutput * define_accs * fix valid * get grouped dims * revert lin * minor * fixup_ast * group for reduce * group works now * all forwards pass * all ops tests pass * fix clang * mypy * lil cleanups, no image yet * ugh, variables everywhere * bugfix * counters and name fix * use symbolic, not uops * cleanups * Fix tests * linearizer tests * expands * float4 expand load * tests pass * woooo, float4 test * test ops works again * one more lin test * more lin tests * bypass * fix tests * something like this * const in defineacc * uops get_reduce_acc * move around * allow consts in the LOAD/STORE * each axis should only appear once, 21 failures * 16 failures * fix some image * optional float4 * onnx tests * gate the stores * add reorder * fix terrible skip function * tc work * opt add/mul merge * fix float4 tests * tiny tweak, 9 failing * 7 test failures * start tc, but i don't think this will work * progress on tensorcores * note * fix ops tests * closer on tc * weeee...one tensor core works * still works, more generic * large WMMA works * tc test passes * use WMMA as accumulator * basic tc tests passing * small gemm padded works * 4 failures * 3 tests failing * super barrier * now two tests failing * one test failing * cleanpus, add reduce to UopGraph * remove the linearizer * remove unused * lil cleanups * Lowerer everywhere * remove test that doesn't exist now * image indexing * llvm fix * fix metal * fix image * fix images * might fix ptx * fix image type mismatch * more tests pass * CAST -> VECTORIZE * forgot that one * fix TestOps.test_flip_eye_crash * locals shouldn't be image dtype * change less files * test fix * fix recursive expands * touches * MULACC support in python * delete unneeded * alu before contract * bug fixes * tests * no var multireduce * simpler tc * metal works in new style * working on AMD and METAL * fix amd * shot in the dark, fix amd * something for CUDA * CUDA WORKS from the docs * comment * correct merge * cleanups + ptx fix + get_reduce_acc * local alias isn't used anymore * add store sanity check * fix for AMD * cleanups and single expand pass * more correct with acc_cache * tests should pass * block on WMMA * tests pass * merge contract and reduce * contractor fixes issue * multicontract * pre expand wmma (same as a reduce) * expand wmma and only take one * all expands * comments and whitespace	2024-07-10 15:07:42 -07:00
qazal	1f5de80eba	multi reduce Tensor.var passing verify_lazyop (#5346 ) * what about this * reset late gate	2024-07-09 17:20:17 +03:00
chenyu	4ceab5d2b1	fix PTX match rule for gated LOAD (#5338 ) * test padto sum with bool tensor and bool acc dtype make sure bool tensor acc with gate is handled correctly * broken in PTX * fix ptx	2024-07-08 22:25:03 -04:00

1 2 3 4 5

242 Commits