tinygrad

mirror of https://github.com/tinygrad/tinygrad.git synced 2026-01-23 22:08:08 -05:00

Author	SHA1	Message	Date
qazal	173064c69c	(re)start multireduce in codegen/* (#5391 ) * test_var_multireduce * run verify_lazyop * test_var_multireduce * assert lazyop * add test_indexing_multireduce * arange fuses (crude) * note: extra reshape * start readble * test_arange_simple * test_arange_expanded * test_indexing_multireduce * cleanups * skip ptx * skip nv and amd ci * skip arange expanded too * GPU=1 is slow too in CI	2024-07-16 14:20:48 +03:00
chenyu	63990705b5	test kernel opts case for 4 local and 4 groups (#5499 ) make sure local grouped dim is correct	2024-07-15 20:09:38 -04:00
qazal	ac08f0eb00	reshape rawbufs in test_linearizer (#5492 ) * reshape rawbufs in test_linearizer * fix helper_linearizer_ast	2024-07-15 19:14:38 +03:00
chenyu	613a1dbeed	render lidx starting with 0 (#5478 ) * render lidx starting with 0 changed from ``` int gidx0 = gid.x; /* 4096 / int lidx4 = lid.x; / 8 / int gidx1 = gid.y; / 7 / int lidx5 = lid.y; / 8 / int gidx2 = gid.z; / 7 / int lidx6 = lid.z; / 2 / ``` to ``` int gidx0 = gid.x; / 4096 / int lidx0 = lid.x; / 8 / int gidx1 = gid.y; / 7 / int lidx1 = lid.y; / 8 / int gidx2 = gid.z; / 7 / int lidx2 = lid.z; / 2 / ``` the existing one started from pre-limited global dims which skip number if there are more than 3 global dims don't need start_dim --------- Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>	2024-07-14 16:34:04 -04:00
chenyu	28972418c4	s/get_linearizer/get_kernel [run_process_replay] (#5467 )	2024-07-13 20:32:22 -04:00
George Hotz	03c2dc8bd7	lowerer is kernel [run_process_replay] (#5437 )	2024-07-12 18:50:55 -07:00
George Hotz	b8342fb085	independent lowerer [run_process_replay] (#5434 ) * independent lowerer [run_process_replay] * don't relinearize PTX * fix ptx * Revert "fix ptx" This reverts commit `f4e8e059c0`. * Revert "don't relinearize PTX" This reverts commit `f6c12c506c`. * parents is fine, no need for linearization * remove loop local idxs * recover stupid loop_idxs	2024-07-12 18:08:43 -07:00
George Hotz	870dc8c350	s/Linearizer/Lowerer [run_process_replay] (#5428 )	2024-07-12 15:54:07 -07:00
George Hotz	6707c778d0	scheduleitem is not Tuple [run_process_replay] (#5425 ) * scheduleitem is not Tuple [run_process_replay] * fix tests * fix op + fuzzers * fix mop test	2024-07-12 15:13:19 -07:00
chenyu	d37056f3b1	pass Renderer.global_max / local_max into get_grouped_dims (#5423 ) [run_process_replay]	2024-07-12 16:49:27 -04:00
George Hotz	f6ef283e6a	s/loadops/metaops [run_process_replay] (#5421 )	2024-07-12 13:26:50 -07:00
chenyu	76125c07be	make some grouped_dim test work (#5415 ) next need to support max size per dim, splitting and correct way to do reverse or arbitrary permute global dims	2024-07-12 14:22:50 -04:00
George Hotz	c2da4454cd	indexing getting better (#5389 ) * indexing getting better [run_process_replay] [no_assert] * fix test * test_arange_2_reduce is a simpler test * put that print back, NOOPT * don't merge reduces (they could be different reduces) * FUSE_AS_ONE_KERNEL * fix tests * fix test_var_multireduce * w/e put that there * fails on others too * fix test, revert UNMUL change * in case order matters * one kernel indexing works * one kernel indexing works (test other)	2024-07-11 16:41:51 -07:00
qazal	0421f5d83e	hotfix: compare test_var_multireduce against numpy (#5394 )	2024-07-11 18:57:08 -04:00
George Hotz	6972a2569f	Linearizer -> Lowerer (#4957 ) * st to uops function * lowerer * uops reduce * uops reduce * acc_number correct * reduce unroll * complete unroll * do upcasts * handle multioutput * define_accs * fix valid * get grouped dims * revert lin * minor * fixup_ast * group for reduce * group works now * all forwards pass * all ops tests pass * fix clang * mypy * lil cleanups, no image yet * ugh, variables everywhere * bugfix * counters and name fix * use symbolic, not uops * cleanups * Fix tests * linearizer tests * expands * float4 expand load * tests pass * woooo, float4 test * test ops works again * one more lin test * more lin tests * bypass * fix tests * something like this * const in defineacc * uops get_reduce_acc * move around * allow consts in the LOAD/STORE * each axis should only appear once, 21 failures * 16 failures * fix some image * optional float4 * onnx tests * gate the stores * add reorder * fix terrible skip function * tc work * opt add/mul merge * fix float4 tests * tiny tweak, 9 failing * 7 test failures * start tc, but i don't think this will work * progress on tensorcores * note * fix ops tests * closer on tc * weeee...one tensor core works * still works, more generic * large WMMA works * tc test passes * use WMMA as accumulator * basic tc tests passing * small gemm padded works * 4 failures * 3 tests failing * super barrier * now two tests failing * one test failing * cleanpus, add reduce to UopGraph * remove the linearizer * remove unused * lil cleanups * Lowerer everywhere * remove test that doesn't exist now * image indexing * llvm fix * fix metal * fix image * fix images * might fix ptx * fix image type mismatch * more tests pass * CAST -> VECTORIZE * forgot that one * fix TestOps.test_flip_eye_crash * locals shouldn't be image dtype * change less files * test fix * fix recursive expands * touches * MULACC support in python * delete unneeded * alu before contract * bug fixes * tests * no var multireduce * simpler tc * metal works in new style * working on AMD and METAL * fix amd * shot in the dark, fix amd * something for CUDA * CUDA WORKS from the docs * comment * correct merge * cleanups + ptx fix + get_reduce_acc * local alias isn't used anymore * add store sanity check * fix for AMD * cleanups and single expand pass * more correct with acc_cache * tests should pass * block on WMMA * tests pass * merge contract and reduce * contractor fixes issue * multicontract * pre expand wmma (same as a reduce) * expand wmma and only take one * all expands * comments and whitespace	2024-07-10 15:07:42 -07:00
qazal	1f5de80eba	multi reduce Tensor.var passing verify_lazyop (#5346 ) * what about this * reset late gate	2024-07-09 17:20:17 +03:00
chenyu	4ceab5d2b1	fix PTX match rule for gated LOAD (#5338 ) * test padto sum with bool tensor and bool acc dtype make sure bool tensor acc with gate is handled correctly * broken in PTX * fix ptx	2024-07-08 22:25:03 -04:00
chenyu	a80f2df1bd	fix some PTX tests (#5337 ) fix broken PTX tests in test_linearizer and test_uops. there are tests that were skipped and broken because it runs only with CUDA=1 and we run PTX with NV=1 now	2024-07-08 21:33:05 -04:00
qazal	ae10e936e7	UOps.VECTORIZE cleanups [run_process_replay] (#5314 ) * still render_cast * one extra line ok * these are all just vectorize * save space * behavior change can go in a different diff	2024-07-07 10:49:08 +03:00
greg-niemeyer	77b2ce9fc9	Add UOps.VECTORIZE [run_process_replay] (#5289 ) * Add UOps.VECTORIZE to core * Update vectorized cast tests * Addresses code review comments - Removes VECTORIZE from LLVMRenderer - Add line breaks to unduly long lines - Add noop CAST rule back - Update asserts and add render_vectorize in CSytleLanguage renderer * Add missing const folding rule for VECTORIZE Also adds corresponding test * Fixes test_const_vectorize_fold and add assert - Use sane types with VECTORIZE in test_const_vectorize_fold - Add assert that sanity checks the types for VECTORIZE * Rename test_cast_vectorized_fold Renames test_cast_vectorized_fold to test_noop_vectorize_fold because the test targets a very specific rule and there are other tests for VECTORIZE. * Revert unrelated changes --------- Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com> Co-authored-by: qazal <qazal.software@gmail.com>	2024-07-07 09:59:57 +03:00
qazal	8a99514462	generalize the uops toposort spec to ptx (#5309 ) * generalize spec to ptx * redundant assert * extra print	2024-07-07 00:06:30 +03:00
chenyu	3929a9dc94	fix UOp.cmp_tuple for ALU (#5280 ) * fix UOp.cmp_tuple for ALU for ALU, use self.arg instead of self.op to compare * skip that?	2024-07-03 14:59:05 -04:00
George Hotz	e53b164e1a	small changes from lowerer (#5266 )	2024-07-02 15:03:54 -07:00
qazal	3f4eeb8b54	late UOps.IF generation [run_process_replay] [no_assert] (#5027 ) * find all places * test gates * test * gate based on depths * add ctx * that cache was so wrong * delete useless things * dont double write if * self.if_cond * move UOps.IF to gated store * test_padto_where_multioutput * test_padto_group * minor cleanup * hmm this actually works? * need a good barrier * merge 2 * delete ctx * p1 * maybe p2 * p3 * minor fixup * fixup 2 * smart thing from the Lowerer branch * refactoring * refactoring 2 * maybe before graph_rewrite * slightly more acceptable Linearizer diff * more correct * [run_process_replay] [no_assert]	2024-06-29 12:22:14 -04:00
George Hotz	80ac21200b	hotfix: linearizer test fixup	2024-06-28 10:52:25 -07:00
George Hotz	d094a6828f	single pass rewrite (#5159 ) * single pass rewrite * claude cleanups * claude cleanups * skip those tests * restrict that to ints * comment * asserts i don't expect to fail do fail * simplest...rewrite...ever * simplest...rewrite...ever * add that rule back * tests pass? * only collapse reduce loops * second SHL/SHR arg must be 4 bytes * fix verify * no SHL/SHR in ptx * put that back * skip them in PTX...bad tests	2024-06-27 11:36:05 -07:00
Roelof van Dijk	f88f71d73a	ruff: unnecessary-comprehension (#5174 ) * enable ruff C416 unnecessary-comprehension * already a list	2024-06-27 07:45:29 -04:00
Jhenner Tigreros	fa78755f19	Add new patterns to unfold division (#5139 ) * Add new patterns to unfold division * Create regression test and fix pattern	2024-06-25 18:07:47 -07:00
qazal	c4fdb9c725	second iteration on verify_lazyop (#5140 )	2024-06-25 09:44:32 +03:00
qazal	18e70deec3	verify_lazyop (#5124 ) * start verify_lazyop * bfs order * assert * assert shapetrackers 2 * refactor * more iteration * skips * that ast was wrong too	2024-06-24 13:45:35 -07:00
Francis Lam	b563cd52ed	linearizer: change globals to merge into left axis/gridDims.x first (#5033 ) * linearizer: change order of collapse to be left-most also fixes Variable max size to be correct and add docs for the off parameter * fix multiple global dim oversizes * add passing variable test and reorganize tests * use assert RuntimeError for failing test	2024-06-23 18:53:15 -04:00
qazal	28bf8d86d8	test_linearizer with multi output ASTs (#5115 ) * ast is tuple * run test_phi_simplification * update reason * more tc * beam * a few more * use test_opt directly	2024-06-23 15:41:24 +03:00
qazal	5717a54b28	don't use Tensor.empty in kernel opts tests (#5086 )	2024-06-21 18:41:03 +03:00
George Hotz	6f6b3b10c9	import from uops, not linearizer (#5064 )	2024-06-20 08:08:44 -07:00
kormann	7c3b877216	rename uop [run_process_replay] (#5031 ) * rename * fix unittests * rename vin * fix test * fix type [run_process_replay] * rm pre commit hook change	2024-06-18 21:34:05 +03:00
Francis Lam	8d33998e0d	[run_process_replay] linearizer: fix get_grouping_dims to respect global/local max (#4855 ) * linearizer: fix get_grouping_dims to respect global/local max * fix lidx variable index offset and unrestrict clang/llvm global len * test reverse variable indexing when reverse_dims is true * change the collapse axis to be the right most if reversed	2024-06-18 16:51:27 +03:00
Junjun Dong	c8cd6e725c	Remove BinaryOps.SUB. Replace SUB by ADD and NEG in all tests. Regenerate dataset (#4977 ) * feat: remove BinaryOps.SUB * remove SUB in test_early_end_local * regenerate dataset. remove SUB in test_linearizer_* * reenable overflow tests * simplify tensor.sub function by returning a+(-b) * remove whitespaces --------- Co-authored-by: chenyu <chenyu@fastmail.com>	2024-06-18 09:06:13 -04:00
chenyu	67e8df4969	remove numpy from dtype (#4969 ) replaced all dtype.np with _to_np_dtype defined in tensor.py. after this, the only numpy usages are (1) Tensor(np.ndarray), (2) construct .numpy() output, (3) numpy random buffer	2024-06-14 15:38:45 -04:00
Jhenner Tigreros	dc9e9e4363	Convert BinaryOps.DIV to UnaryOps.RECIP and BinaryOps.IDIV (#4887 ) * Create UnaryOps.RECIP and BinaryOps.IDIV and changing uses of BinaryOps.DIV * Delete unused import * Add cstyle renderer * Fix formatting text * Fix test error due to bad implementation of renderer * Add PTX support * Add RECIP to LLVMIR * Remove BinaryOps.DIV from symbolic test * Change some test and fix C floor division * Change references to DIV for the RECIP or IDIV * Add mimic idiv for symbolic test * Restore floor * Mimic idiv * cast to int * Fix some test and renderer * Remove DIV for render nodes * Resolve issue with div * Add TestRenderer * Fix test * fix error * Fix PAD test * Fix div implementation * Remove DIV * Add upcast to rshift, due to use of MUL and RECIP on DIV * Fix linter * Remove complete BinaryOps.DIV * Fix lint * Fix some test * Revert mul modification * Fix tests * Fix CLANG for uops * Revert IDIV function * Minor fix * modify pattern matching rule to support nan * Fix UNSAFE_PADS_OPS to add UnaryOps.RECIP * Remove const folding for IDIV and fix PTX * Complete remove IDIV from extra * Remove test_div from TestFloatUOps due to test on recip * Fix linearizer * fix * Fix test_22 * Fix llvm * Apply trunc function for llvmlit * use floor instead of trunc * Use correct type * Generate new fuzz db * Fix rshift, do not cast to float to support idiv * Return upcast=false to rshift * Add to unsafepad BinaryOps.IDIV * Remove RECIP override for CUDA * add atol / rtol for the test * Remove cast to int on IDIV * Regenerate sops * delete sops.gz * regenerate * regenerate * regenerate * Reduce margins * pass atol and rtol as parametersg for _test_metrics * regenerated dataset * Regenerate * Remove duplicated * Revert changes on extra * Remove changes extra and NOQA for test * Remove E501 * Remove and change line * Remove E501 * Fix atan2 * Revert import and E501 * Remove E501 * Add hrcp to halp ops * Remove 1 of hrcp * Remove last DIV and add type check on uops for IDIV * Fix new tests * Fix tests and custom function * Regenerate dataset * Regenerate dataset * Revert dataset * Change generate dataset script * Remove line * Change IDIV, type checker validate if x,y and z are int --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2024-06-14 02:43:46 -07:00
Timmy	720c700a8a	Multireduce-Kernels: Linearizer Changes and Tests (#4259 ) * basic tests * cleanup * pylint * ruff * use define acc as a proxy for rendered reductions * use define acc as a proxy for rendered reductions * recursive reduceop rendering via ast_parse * linters + cleanup * fixing late buf loading * plus linters * removing extra line * linters * does this break ci? * added tests and if add end change * typo in add_ends * linters * removing comments * allow endifs to be inserted before the end of the graph * find add ENDIF before next BARRIER * removing tests with manual ENDIF + linters * specifically the next barrier aftr the store of the local result * Revert "specifically the next barrier aftr the store of the local result" This reverts commit `b288a5c3ce`. * keeping up to date * linters + merge changes * cleaning up old bad decisions * linters and opts * mrged linearizer tests * fixing merge issues * removing the big ugly uop test (functionality tested end-to-end by test_linearizer additions * small diff fixes * updating linearizer to work without uops.add( ... cachable) * linters * comment in multireduce tests * skipping tests without locals * full tests * linters * load_cache[key] fix for multiple accs * linters * assert only one reduceop * fix loop_scope test to actually cause an issue * self.load_cache[key] key for DEFINE_ACC changed to use a string to make sure each acc is unique * updated tests * fixing merge * removing debug prints * complete merge fix * linters * diff cleanup * adding tests in * give each reduce it's own local buffer * gpu=1 changes * store and load locals with upcasting * modifying test? * make multireduce_netsted_local_upcast test match single reduce shapes * removing todo * cleaning up the diff * unroll test * unroll and upcast tests * fix gpu * seq and self.load_cache[key] cleaning * linters * padto works * merge fixes * fixes * add skips for amd * linters + seq * cleaning & more tests * softmax tests * linters * [run_process_replay] * add new tests back This reverts commit `19dec22e01`. * more hardcoded -1s * fix ptx * Fix name for loop in ptx * cleaning up the diff * cleaning up the uops diff * nv ci is too slow --------- Co-authored-by: qazal <qazal.software@gmail.com> Co-authored-by: Szymon Ożóg <58388001+SzymonOzog@users.noreply.github.com> Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>	2024-06-12 13:29:43 -04:00
Timmy	887643cf34	Multireduce atomic local load/store test (#4786 ) * atomic load/store test * tests for nested & unrolled * check barriers * linters * cleaning up diff * fix assert in _temp_create_multireduce_ast changes * cleaning up the check for redundant barriers * minor cleanups for the assert * always seed randn, helps with debuggability --------- Co-authored-by: qazal <qazal.software@gmail.com>	2024-06-05 14:41:19 +03:00
Szymon Ożóg	e47277d18a	Disable for PTX as well (#4838 ) Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>	2024-06-05 10:37:59 +03:00
chenyu	3afc914617	CMPEQ -> CMPNE and make it safe to pad (#4818 ) * CMPNE * new dataset	2024-06-03 18:02:15 -04:00
Timmy	ca32921f84	Multireduce PADTO Test (#4785 ) * padto test * expanded multireduce padto tests * cuda doesnt run on ci * moving padto_where_multireduce test to SUM so that we can check the reduce axis * cleaning up tests some more * add wanna_outputs * refactor test_padto_sum_multireduce * fix max and refactor where * fix axis --------- Co-authored-by: qazal <qazal.software@gmail.com>	2024-06-02 13:46:53 +03:00
chenyu	7cc883ecee	CMPLT is safe to pad (#4790 ) 0 < 0 evals to False	2024-05-30 22:50:48 -04:00
qazal	c2945be0a3	add fused tensor core opts tests (#4775 ) * add fused tc opts tests * n=64	2024-05-30 13:50:00 +03:00
qazal	0e824741c4	pre multi reduce codegen/* cleanup (#4755 ) * refactor self.reduceop * free lines * fix test	2024-05-28 08:15:48 -04:00
qazal	0e69b22629	multireduce OptOps tests (start) (#4733 ) * start * full tests * add skips * unrelated * notes	2024-05-27 12:21:33 +03:00
qazal	c7b1d802f1	delete duplicate tests in test_linearizer (#4723 ) * delete duplicate test test_simplify_uop isnt needed max works * ci * remove skip * add skip back	2024-05-26 08:11:42 +03:00
chenyu	31358cbea5	change Tensor.stack to method (#4719 )	2024-05-24 17:04:19 -04:00

1 2 3 4 5

209 Commits