tinygrad

mirror of https://github.com/tinygrad/tinygrad.git synced 2026-01-21 04:47:56 -05:00

Author	SHA1	Message	Date
George Hotz	66e7e51c79	Revert beam failure (#6376 ) * Revert "late gate creation for STORE [run_process_replay] (#6373)" This reverts commit `c26744de9f`. * Revert "gated store rewrite to UOps.IF (#5976)" This reverts commit `48061e8400`.	2024-09-06 09:36:44 +08:00
ignaciosica	c15506fc35	[WIP] amx support as TC (#5693 ) * almost working with relu, even hackable... but acc size is wrong, fix needed * upcast based on threads, change thread size to 4x4 * revert wrongfully commented assert * fix tc load indexing * modify for size 8 * fix bug for size 8 * Revert "fix bug for size 8" This reverts commit `cdb3f5df85`. * Revert "modify for size 8" This reverts commit `3ef0904bd9`. * good kernel with changes in lowerer * revert "good kernel with changes in lowerer" This reverts commit `975e2b5a4e`. * good kernel for relu! * refactor lowerer changes * add amx context var to helper * clean up amx flag * improve lowerer changes readability * improve check for amx * revert lowerer if * add float4 type rendering for clang * add amx definitions * enable indexing for clang if amx * working amx example, wrong because of dims * almost works for float 16, need to spot using double load in amx * cleaner render_kernel * revert chages in simple_matmul and delete env * add new var upcast_offset to get_optimized_ast * change axis for axes * invert if in rendering phi * fix some bugs * fix linearizer tests * fix vec/get pat for amx * remove clang tc if amx is disabled * add ops_python support * refactor into one complementary function in ops_python * add job for EMUALTE_AMX * improve checking for AMX in UPCAST and TC extra ops * fix lint issue * commit before refactor into autocontained AMX * start refactor by removing special rendering for AMX * all ready for amx handcoded kernel * working poc, most straightforward amx support * avoid local opts for tc if amx * fix merge bugs * skip test for clang * skip tc hand-coded opts if amx * remove hardcoded ops_python values * remove hardcoded sizes for amx kernel * fix ops_python bug where dim was hard-coded * change contract for vectorize * working without changes in lowerer * revert changes in gep rendering * fix ops_python * modify comment * skip test if clang for different type accumulation * move rename and bug for seperate pr * fix wrong path for test * addmm not implemented in torch for cpu * change struct for vector; equally slow but cleaner * revert modified test * simply wmma rendering * minor change * noqa:501 * add length 16 for AMX * fix vectorized half issue * fix error * remove comment * change set for dedup * split test of tensor_core_extra_ops so that cases that dont require locals run for AMX * add amx reference * load acc into amx registers * fix dtype rendering and remove noqa * moved tests change into another pr * add real AMX job for CI and fix bug * fix ops_python bug * fix test class * remove real AMX tests and fix uops_stats test * remove wrong test * acc folding * hotfix: bug * fix float4 tests for amx * hack for fixing flops counting * hotfix: mypy * add flop counts test for amx * improve test_float4_multidim_amx * improve test_float4_multidim_amx * improve test_float4_multidim_unaligned_load_amx * nits tests --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2024-09-06 09:01:10 +08:00
Ian Paul	48061e8400	gated store rewrite to UOps.IF (#5976 ) * Core change to gate stores in IFs * Updates to cstyle renderer to handle IFs around STOREs * Make uops asserts happy * Add tests and fix newly broken tests * make ruff happy * make mypy happy * Simplify renderer to have all gated stores use IF * Revert some changes * Make test_where_fold happy * Revert unnecessary handling of ifs rendering. Was included before when changes weren't fully built out * Rewrite graph to have IFs be dependent on RANGEs if STORE is already dependent on RANGE * Re-change broken test * Make ifs be grouped together * get non-merged IFs working. ALl tests pass except grouping related ifs together * Fix tests by making the IF UOp dependent on the correct node of the STORE UOp * Changes to uopgraph * Simplify graph rewrite logic * Changes to get test_padto_where_multireduce working * Simplify uops.store renderer * Make test_padto_where_multireduce pass but now other tests fail * Clean up uopgraph from scrach work * Ignore sudo IF srcs when rendering * Attempt to fix llvm tests * rm comment * reduce lines * Add line to make mypy happy :( * llvmir fix pt 1 * Mods after rebasing to master * Fix llvmir * Fix ptx tests * Fix other ptx tests * Move changes from uops.py to ops.py * rm uops.py * Fix TestGateStoreRewrite tests * Get multireduce tests working * reset to remote branch * Fix linearizer tests * uop_graph test patch * Add comment to create_gate * hotfix: uncomment those tests * Attempt to fix ptx tests by including whitespace inside if block * Patch from remote tinybox. Tests passing here * Min changes to get some ptx tests passsing * Changes after rebase * Exclude ifs and endifs from ptx * IF conditional branching within ptx * Save lines on delete_redundant_gates * Simplify merge_gates * rm noqa * Remove unnecessary checks when merging gates * Fix ops error msg * Smarter check for if/endif in llvmir * simplify delete redundant gates to only have 2 returns * spacing * Smarter check at beginning of merge_gates * patches from comments * Remove need for merge_gates * include proper srcs in IF from the get-go * test expand ifs dumb will result in 4 ifs, not 1 now * Make tests happy * Fix uops stats * rm merge_gates method. Will add back in separate PR * Spacing * cleaner error msg * Fix uops rendering when expanding. test_failure_43 * patch tests * undo changes in delete_redundant_gates * process replay attempt * re-intro deletion of redundant gates * fix addition of gates when they get nested in stores and loads * patch tests * smarter init of IF srcs when adding gate to STORE * make ruff happy * Resp to comment * include all src[2]'s srcs in IF for gated store * add reference of the storing value to the gate's src * minor patch after rebasing * change ptx renderer --------- Co-authored-by: qazal <qazal.software@gmail.com>	2024-09-06 01:05:30 +08:00
gswangg	3cf507ae7f	remove extra.ops and LazyOp support from Kernel (#6267 ) * remove extra.ops and BufferOps * remove extra.ops and LazyOp support in Kernel	2024-08-24 16:44:38 +03:00
qazal	ccb05d8baa	fixup neg tests [run_process_replay] (#6268 )	2024-08-24 16:35:43 +03:00
qazal	bcb2f1caa3	init REDUCE_AXIS with BinaryOps (#6256 ) * REDUCE_AXIS arg with BinaryOps * more work in kernel.py fixup sops.gz * fix TestGraphRewriteEfficiency	2024-08-24 11:28:41 +03:00
chenyu	3fc8203475	remove NEG from handwritten ast in tests (#6234 ) * remove NEG from handwritten ast in tests * test_linearizer_failures	2024-08-22 09:06:59 -04:00
gswangg	c74b318458	migrate test_linearizer.py to UOp AST, pt. 2 (#6228 )	2024-08-21 22:16:11 +03:00
qazal	3b8cc5a3e0	more multireduce tests prep for neg removal [run_process_replay] (#6220 )	2024-08-21 12:45:24 +03:00
qazal	f03e5a4b3b	test_multireduce const has a shape (#6218 )	2024-08-21 11:02:45 +03:00
gswangg	0e6f057eae	migrate test_linearizer.py to UOP AST (pt. 1) (#6150 ) * migrate test_multioutput to UOP AST * inline buf declarations * migrate test_multireduce to UOp AST * update test_mid_dim_multireduce to UOp AST * update test_triple_multireduce with UOp AST * make global definitions more concise * update test_double_reduce_multireduce with UOp AST * update test_multireduce_with_parallel with UOp AST * update test_multiout_multireduce to UOp AST * make gidx style consistent across updated tests --------- Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>	2024-08-20 10:02:20 +03:00
chenyu	b36a7273c6	RUF018 assignment-in-assert [run_process_replay] (#6172 ) assertion should not have side effect or `-O` breaks. initially just wanted to fix the one in rearrange, but it also made some long lines less long	2024-08-19 00:34:52 -04:00
Timmy	e3d14d1ccc	Lowerer Multireduce Grouping (#6097 ) * grouping changes to codegen * linters + tests * fix identical store issue on PTX * comment in grouping multireduce tests * cleaning up diff * cleaning up diff * comments * linters * hotfix: dont change kernels --------- Co-authored-by: qazal <qazal.software@gmail.com>	2024-08-18 19:57:51 +03:00
George Hotz	5048066e79	st_arg, never -1 [run_process_replay] (#6128 )	2024-08-16 22:46:56 -07:00
George Hotz	74ee9febec	remove iter from uopgraph (#6110 ) * remove iter from uopgraph * linearize returns uops * fix tests * linearize in linearize * tests fix * touchup * test failures	2024-08-16 15:58:29 -07:00
qazal	28c75bf2a6	merge uops with ops (#6111 ) Co-authored-by: chenyu <chenyu@fastmail.com>	2024-08-16 18:17:57 -04:00
qazal	c23d44c779	AST is UOp (#6030 ) * most of the work from the uops2 branch * schedule * realize * kernel * lowerer * search * green * merge uops with ops * Revert "merge uops with ops" This reverts commit `1408a59f12`. * fix benchmark * remove extra dedup	2024-08-16 22:09:00 +03:00
CaltropHungerton	38fb1e14a2	Intel XMX Tensor Core Support (#5622 ) * fixed xmx demo * i think i'm invoking the DPAS but it's slow * compiler build arg to stop register spilling, indicated where to fix flop counter * don't mind this * do NOT mind me * do not mind me * do not view * i will add bf16 later * in process of figuring out tc fields * we figured out the fields!!! * added check for cl device vendor, added seperate IntelRenderer * remove tc thread_local_aliases * cleaning debris before draft pr * edits for linter * deduping and checking device extensions * i will find more line reductions in other places * before merge upstream * double grf size in compiler to fix register spilling (bandaid), device checking changes * tc python emulation * fixed emulation * tests for emulated intel tensor core * TC=0, 1 working on upstream, fixed perf * test * debris * check for specialized cl device when we canonicalize device * bf16 support, tc=3 test added * address tests * revert half2 loads on intel tc, cleanup * linter * fold_expanded revert * lint, whitespace fix * cuda bf16 (only one with bf16) is skipped in test tensor cores, so i will skip for intel bf16 too * make line shorter, no need for noqa E501 * removed device intel * fix python emulation --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2024-08-16 09:19:21 -07:00
qazal	4d38fec8c1	rename lazyops to parents [run_process_replay] (#6091 )	2024-08-15 17:27:32 +03:00
ignaciosica	164ca5632e	split tensor core tests (#6041 )	2024-08-12 09:42:02 -04:00
Timmy	a00994b423	Lowerer Multireduce Uopgraph (#6007 ) * uopgraph changes * fixing for non-reducing ranges * multireduce tests * linters * linters * removing comments * removing arg[1] * linters * prettier * linters * more linters * use any instead of intersection	2024-08-12 15:16:07 +03:00
Timmy	8c99bdab08	More Multireduce Tests (#5968 ) * multireduce tests * linters * more linters * more linters * seeing how it works with parallel	2024-08-08 22:04:08 +03:00
wozeparrot	97d708252a	remove realize from threefry (#5969 )	2024-08-07 15:08:49 -07:00
George Hotz	1417cc8df1	can reenable that test now (#5914 )	2024-08-06 13:38:21 -07:00
ignaciosica	81ae9fadc8	Float4 support for CLANG (#5915 ) * float4 support on clang * skip linearizer tests that require locals * add aligned attribute	2024-08-06 07:50:12 -07:00
George Hotz	159ac06b5b	remove unused reduce rules + improve unparented (#5908 ) * remove unused reduce rules [run_process_replay] * this work * those tests are meaningless now	2024-08-04 18:18:27 -07:00
George Hotz	877e0b4ba0	define global only has the index [run_process_replay] (#5869 ) * define global only has the index [run_process_replay] * fix that linearizer test * fix ptx * stupid ptx fix	2024-08-01 19:01:15 -07:00
chenyu	02f0be03f2	tests on UOp div negative number and arange opts (#5825 )	2024-07-30 20:06:57 -04:00
George Hotz	17a2f74412	new style load/store folder (#5784 ) * remove old index reorder * new style folder * works better * dedup * one failure * this is fine now... * expander_rewrite * images broken, but all else should work * cleanups * make tests work with old * fix images * cleanups + bugfix * minor fixes * fix gated store folding * flip gate_creator and expander * fix gated store * remove unneeded rules * lines getting close * line count good	2024-07-30 13:17:20 -07:00
George Hotz	4df46eac67	clean up tensor cores [run_process_replay] (#5736 ) * clean up tensor cores [run_process_replay] * remove tuple(wmma_sz), self.opts.device * remove tls, leave DEVICE	2024-07-26 13:21:23 -07:00
chenyu	16c27ae400	update UOp.SPECIAL arg spec [run_process_replay] (#5661 ) * update UOp.SPECIAL arg spec [run_process_replay] from `(0, "gid0", 4)` to just `("gid0", 4)`. closer to a Variable * fix ptx	2024-07-23 16:58:12 -04:00
George Hotz	386fb5e7f8	folding without UNMUL (#5628 ) * folding without UNMUL * fix failures, index_collapse * import ReduceOps * test_arange_4096 isn't folding	2024-07-21 20:14:44 -07:00
qazal	3ab5fe4e1b	test argmax multireduce failure (#5609 )	2024-07-20 21:33:03 +08:00
chenyu	37dd233650	always reverse global dim (#5586 ) * always reverse global dim * one more test	2024-07-19 13:58:05 -04:00
George Hotz	10be05aae5	push contract through cast to fix test_float2_acc (try 2) (#5585 ) * push contract through cast to fix test_float2_acc (try 2) * contract push only on floats	2024-07-19 10:34:43 -07:00
George Hotz	51892c8fac	Revert "push contract through cast to fix test_float2_acc (#5581 )" (#5583 ) This reverts commit `ddda9420be`.	2024-07-19 09:44:30 -07:00
George Hotz	ddda9420be	push contract through cast to fix test_float2_acc (#5581 ) * push contract through cast to fix test_float2_acc * no_vectorized_alu applies to cast too	2024-07-19 09:30:26 -07:00
chenyu	3f590c3b31	some limit_dims to limit global merging (#5489 ) only supports merging dims in a way that does not surpass limit, no splitting yet	2024-07-19 12:17:46 -04:00
chenyu	2b2f8ad18c	failed example of float2 acc no long applies (#5573 ) * failed example of float2 acc no long applies * # noqa: E501	2024-07-19 02:40:04 -04:00
George Hotz	223d9283ee	fix float4 acc by moving contracts (#5559 )	2024-07-18 11:30:16 -07:00
chenyu	f5af98c450	failed test case that DEFINE_ACC no long uses float4 (#5555 ) * failed test case that DEFINE_ACC no long uses float4 * line	2024-07-18 10:55:59 -07:00
George Hotz	923e0fe0b8	fix half4 folding (#5556 )	2024-07-18 10:47:39 -07:00
chenyu	12e6771209	failed test case for unrolled half4 (#5552 )	2024-07-18 13:05:52 -04:00
kormann	2c4add6844	pretty print lazy op per default (#5505 ) * pretty lop * min diff * walrus * fix * min diff * simplify * pretty helper function * ws * pretty uop upat * tests * stricter tests * test passes * ws * stronger upat test * delete print_tree * min diff * stricter exp test * fix merge * stronger uops eval test * +readable and deep upat test * +readable and deep upat test * sort inv fix * fix * revert allowed_len	2024-07-18 09:34:08 -07:00
George Hotz	fa7e734b49	MetaOps.KERNEL (#5543 )	2024-07-17 19:41:23 -07:00
qazal	61ee02e93d	start multireduce lowerer work (var/std) (#5537 ) * multireduce no-opts works * passed test_var_multireduce * cleanup * double reduce * extra check for range_group * more checking for range_groups * cleaning up debug prints * cleanup diff * linters * revert kernel changes * these are uops toposort --------- Co-authored-by: timmy <timmy0x@proton.me>	2024-07-17 23:43:46 +03:00
qazal	173064c69c	(re)start multireduce in codegen/* (#5391 ) * test_var_multireduce * run verify_lazyop * test_var_multireduce * assert lazyop * add test_indexing_multireduce * arange fuses (crude) * note: extra reshape * start readble * test_arange_simple * test_arange_expanded * test_indexing_multireduce * cleanups * skip ptx * skip nv and amd ci * skip arange expanded too * GPU=1 is slow too in CI	2024-07-16 14:20:48 +03:00
chenyu	63990705b5	test kernel opts case for 4 local and 4 groups (#5499 ) make sure local grouped dim is correct	2024-07-15 20:09:38 -04:00
qazal	ac08f0eb00	reshape rawbufs in test_linearizer (#5492 ) * reshape rawbufs in test_linearizer * fix helper_linearizer_ast	2024-07-15 19:14:38 +03:00
chenyu	613a1dbeed	render lidx starting with 0 (#5478 ) * render lidx starting with 0 changed from ``` int gidx0 = gid.x; /* 4096 / int lidx4 = lid.x; / 8 / int gidx1 = gid.y; / 7 / int lidx5 = lid.y; / 8 / int gidx2 = gid.z; / 7 / int lidx6 = lid.z; / 2 / ``` to ``` int gidx0 = gid.x; / 4096 / int lidx0 = lid.x; / 8 / int gidx1 = gid.y; / 7 / int lidx1 = lid.y; / 8 / int gidx2 = gid.z; / 7 / int lidx2 = lid.z; / 2 / ``` the existing one started from pre-limited global dims which skip number if there are more than 3 global dims don't need start_dim --------- Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>	2024-07-14 16:34:04 -04:00

... 2 3 4 5 6 ...

405 Commits