tinygrad

mirror of https://github.com/tinygrad/tinygrad.git synced 2026-02-07 13:15:01 -05:00

Author	SHA1	Message	Date
chenyu	afe020710d	disable PADTO on upcasted axis (#4444 ) fixed test_failure_31. PADTO upcasted is at best a no-op, and might fail at edge cases.	2024-05-05 21:52:03 -04:00
Francis Lam	5c5b40880f	search: fix edge cases on screening potential ops (#4394 ) * search: fix edge cases on screening potential ops won't change correctness, but will save a little python time by properly deduplicating potential actions * check for de-duplication instead of exact valid actions * refactor long line	2024-05-02 14:53:05 -04:00
Francis Lam	0d33c54d99	kernel: change PADTO check to allow up to 4x padding (#4354 ) * kernel: change PADTO check to allow up to 4x padding also optionally remove PADTO from the search action space with BEAM_PADTO=0. * fix test_linearizer test_tensor_cores_padded tests * update resnet runs to use SPLIT_REDUCEOP=1 * fix up search TC axis and amt checking * fix up the dimensions of the TC tests	2024-04-30 15:29:34 -04:00
Francis Lam	c12bcabb07	search: fix actions space checks to ignore TC axis and amt (#4360 ) * search: fix actions space checks to ignore TC axis and amt * add test for number of actions in get_linearizer_actions	2024-04-30 14:02:22 -04:00
George Hotz	d325be2540	update docs (#4356 ) * update docs * nn.md * mnist cleanups * rhip test is very slow	2024-04-30 16:51:42 +09:00
Francis Lam	a9a1fa6bbf	wmma: add reduce axis choice to TC action space (#4328 ) * wmma: add reduce axis choice to TC action space * add test for TC multi-reduce axis choice	2024-04-29 19:15:39 -04:00
George Hotz	38f97aa0fe	rename rawbufs to bufs in ExecItem (#4274 )	2024-04-24 11:27:27 +08:00
Francis Lam	3f6c7ca8bf	test: fix test_tensor_core_padded on CUDA and add to benchmarks (#4258 ) * test: fix test_tensor_core_padded on CUDA and add to benchmarks * fix linter * run both tests in one call	2024-04-22 23:22:11 -04:00
Francis Lam	bbb0ad4800	wmma: widen TC usage in search by using PADTO on TC axes when possible (#4216 ) * wmma: widen TC usage in search by using PADTO on TC axes when possible * test: start tests for the new padding TC behavior * search: upgrade padded TC search to TC_OPT >= 2 * test: add behavior and correctness test for padded TC added optional argument to apply_tensor_core to set TC_OPT level * linearizer: add tests for the PADTO behvaior and docs	2024-04-22 16:50:31 -04:00
chenyu	31c9d9a228	fix test_linearizer tc opt tests for bf16 (#4237 ) bf16 tc has larger rtol	2024-04-20 11:51:50 -04:00
George Hotz	ebc94c9d6c	rewrite the jit in the context of new schedule (#4162 ) * rewrite the jit in the context of new schedule * mypy better * fix placeholder * tests * all functionality should work * fix tests * no CacheCollector	2024-04-12 21:54:36 -07:00
chenyu	06bcae13b4	PADTO SUM if parents of sum are all zero-preserving (#4140 ) * PADTO SUM if parents of sum are all zero-preserving * test case unsafe ops after sum is fine * reuse UNSAFE_PAD_OPS * update db version	2024-04-10 22:16:12 -04:00
chenyu	406cb5fd90	const fold ReduceOps (#4059 )	2024-04-03 14:39:28 -04:00
chenyu	f61ed869f5	Use exec_alu for lazy const folding (#4039 )	2024-04-02 20:52:05 -04:00
George Hotz	9eef44521b	ScheduleItem uses Buffer (#3995 ) * schedule Buffer * update * update tests * master * works * remove LoadOps.WAIT * fix compile2 * bad test * rename and note	2024-03-29 20:50:27 -07:00
George Hotz	8f1e34a2a0	early src delete (#3996 ) * early src delete * fix bad test * fix test_linearizer	2024-03-29 19:46:07 -07:00
chenyu	d9ff636cf5	use is to compare with enum (#3993 ) * use is to compare with enum currently it's mixed between `==` and `is`, moved all to `is` * more	2024-03-29 13:02:56 -04:00
chenyu	b47f6cebb2	LinearizerOptions -> CompilerOptions (#3978 )	2024-03-28 17:50:23 -04:00
George Hotz	42b9d999ea	Buffer isn't always allocated (#3974 ) * buffer alloc * allocate * missing allocates * last one	2024-03-28 13:33:47 -07:00
Francis Lam	7c5729a3bd	wmma: refactor to remove wmma_func and create TC funcs as needed (#3945 ) * wmma: refactor to remove wmma_func and create TC funcs as needed * test_linearizer: disable bf16 CUDA during emulation testing * cstyle: clean up creation of CUDA vec dtypes * extra/gemm: add option to accumulate to bfloat16 * cleanups * benchmark: add CUDA bfloat16 matmul * more cleanups	2024-03-27 16:43:09 -04:00
George Hotz	68ca4d4276	split to schedule.py (#3949 ) * split to schedule.py * split	2024-03-26 21:02:46 -07:00
George Hotz	150ea2eb76	create engine folder and move code (#3948 ) * retry * older tf * that	2024-03-26 20:38:03 -07:00
Patrick Tsai	b436c9792f	Fix factoring bug (O(n) arange related) (#3817 ) * Factoring bug * Another one in case * It works now so change tests back * large arange cumsum optimization * More cleanup * symbolic no factor div test * name change * Rename test --------- Co-authored-by: Patrick Tsai <patosai@users.noreply.github.com>	2024-03-19 11:49:42 -04:00
chenyu	ac866eaf5a	disable simplify_phi_loops (#3812 ) * disble simplify_phi_loops this breaks BEAM search GPT2. * skip that	2024-03-18 19:25:26 -04:00
wozeparrot	a0ab755317	threefry again (#3785 ) * feat: initial xor * feat: initial threefly * feat: remove custom random * fix: really need to install precommit * feat: lmao forgot that this is rotate not a shift * clean: put that there * feat: numpy xor * feat: quick test for xor * feat: llvm xor * feat: slightly working xor in torch * feat: rand works in jit * clean: save a line * feat: match jax * feat: maybe test against jax * feat: requires_grad * fix: fix test_symbolic_ops * feat: lower alpha * feat: just pad * fix: maybe fix training tests? * fix: fix some llvm stuff * feat: cursed realize on the way out * feat: testing jax * fix: why is the jax install process not simple * fix: maybe passing test * fix: symbolic workarounds * clean: still need that precommit * fix: aaaa * fix: more test fixes * fix: quick fix for wgsl * feat: need to set requires_grad on the final tensor * feat: one more tensor * feat: don't take forever * feat: seeing y ci is brok * feat: can't allocate 64GiB lmao * fix: fix this * feat: hope this doesn't break smth before i go to bed * feat: don't destroy ram * feat: int * feat: remove jax * feat: properish workaround? * feat: skip slow webgpu tests * feat: no longer fails * feat: use dtypes * feat: real number * fix: torch * fix: don't test against reference for torch * feat: to device * feat: fix advanced indexing * feat: correct casting * feat: even rng_counter * feat: match master * feat: this was actually bad * fix: maybe? * feat: store * feat: remove realizes * feat: somehow this is important * feat: somehow this is also important * feat: save a line * fix: don't need that anymore * feat: restore this * fix: linter * feat: remove realizes * fix: realized is in base now * fix: add back cast * fix: bump deadline * fix: bump deadline * fix: bump deadline * fix: bump deadline * fix: bump deadline * fix: :( * fix: :( * fix: not being dumb * feat: try changing less tests * feat: shouldn't have to change that * feat: contiguous bumps it by one * fix: hmm * fix: numpy memory moment * fix: cl_khr_fp16 * fix: torch has different tensor count * fix: missing contiguous * hmm: hmm * fix: some fixes * fix: typing * feat: dont do that * feat: typing fixes * feat: why is this realize required? * feat: ngl kinda odd typing * feat: oh * feat: remove realizes * feat: why is this realize required? * fix: hacky patch for cudacpu * fix: without this realize pytest crashes????? * fix: shorter line * fix: cudacpu fixes * fix: cudacpu fixes * feat: real buffer * feat: don't search when searching lmao * fix: can't use contiguous things * fix: no more 100GB arrays * fix: revert * fix: skip 7 and 10 * feat: working ish beam * feat: minimize changes * feat: seed 0 stable diffusion example changed * fix: different on ci * fix: no beam * feat: make threefry optional * fix: check value * fix: unused import * feat: threefry default * fix: 5d * feat: allow non upcast div * fix: 5d better * fix: 5d better * fix: save all dtype * feat: proper error * feat: lazyop key * fix: check float * feat: try removing this realize now * feat: disable threefry for uops hip tensor cores * feat: don't need that * feat: only check upcast * fix: disable threefry for some metal tests * feat: disable for metal tensor uops as well * feat: disable for most uops * fix: disable threefry for new uops tests * feat: multitensor * fix: typing * feat: threefry default off * feat: skip threefry half rand * feat: restore old * fix: bad git * clean: ruff * feat: bfloat16 fix * fix: :\| * feat: restore old --------- Co-authored-by: chenyu <chenyu@fastmail.com>	2024-03-18 16:47:07 -04:00
George Hotz	311cf2b7d3	Revert "threefry_2x32 (#2601 )" (#3784 ) This reverts commit `db3de54bc4`.	2024-03-17 10:27:20 -07:00
wozeparrot	db3de54bc4	threefry_2x32 (#2601 ) * feat: initial xor * feat: initial threefly * feat: remove custom random * fix: really need to install precommit * feat: lmao forgot that this is rotate not a shift * clean: put that there * feat: numpy xor * feat: quick test for xor * feat: llvm xor * feat: slightly working xor in torch * feat: rand works in jit * clean: save a line * feat: match jax * feat: maybe test against jax * feat: requires_grad * fix: fix test_symbolic_ops * feat: lower alpha * feat: just pad * fix: maybe fix training tests? * fix: fix some llvm stuff * feat: cursed realize on the way out * feat: testing jax * fix: why is the jax install process not simple * fix: maybe passing test * fix: symbolic workarounds * clean: still need that precommit * fix: aaaa * fix: more test fixes * fix: quick fix for wgsl * feat: need to set requires_grad on the final tensor * feat: one more tensor * feat: don't take forever * feat: seeing y ci is brok * feat: can't allocate 64GiB lmao * fix: fix this * feat: hope this doesn't break smth before i go to bed * feat: don't destroy ram * feat: int * feat: remove jax * feat: properish workaround? * feat: skip slow webgpu tests * feat: no longer fails * feat: use dtypes * feat: real number * fix: torch * fix: don't test against reference for torch * feat: to device * feat: fix advanced indexing * feat: correct casting * feat: even rng_counter * feat: match master * feat: this was actually bad * fix: maybe? * feat: store * feat: remove realizes * feat: somehow this is important * feat: somehow this is also important * feat: save a line * fix: don't need that anymore * feat: restore this * fix: linter * feat: remove realizes * fix: realized is in base now * fix: add back cast * fix: bump deadline * fix: bump deadline * fix: bump deadline * fix: bump deadline * fix: bump deadline * fix: :( * fix: :( * fix: not being dumb * feat: try changing less tests * feat: shouldn't have to change that * feat: contiguous bumps it by one * fix: hmm * fix: numpy memory moment * fix: cl_khr_fp16 * fix: torch has different tensor count * fix: missing contiguous * hmm: hmm * fix: some fixes * fix: typing * feat: dont do that * feat: typing fixes * feat: why is this realize required? * feat: ngl kinda odd typing * feat: oh * feat: remove realizes * feat: why is this realize required? * fix: hacky patch for cudacpu * fix: without this realize pytest crashes????? * fix: shorter line * fix: cudacpu fixes * fix: cudacpu fixes * feat: real buffer * feat: don't search when searching lmao * fix: can't use contiguous things * fix: no more 100GB arrays * fix: revert * fix: skip 7 and 10 * feat: working ish beam * feat: minimize changes * feat: seed 0 stable diffusion example changed * fix: different on ci * fix: no beam * feat: make threefry optional * fix: check value * fix: unused import * feat: threefry default * fix: 5d * feat: allow non upcast div * fix: 5d better * fix: 5d better * fix: save all dtype * feat: proper error * feat: lazyop key * fix: check float * feat: try removing this realize now * feat: disable threefry for uops hip tensor cores * feat: don't need that * feat: only check upcast * fix: disable threefry for some metal tests * feat: disable for metal tensor uops as well * feat: disable for most uops * fix: disable threefry for new uops tests * feat: multitensor * fix: typing * feat: threefry default off * feat: skip threefry half rand * feat: restore old * fix: bad git * clean: ruff * feat: bfloat16 fix * fix: :\| --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2024-03-17 10:19:33 -07:00
George Hotz	53adcb34f5	remove hip backend (#3783 ) * remove hip backend * remove unused * rhip * more RHIP	2024-03-17 10:12:16 -07:00
qazal	e3e89c244b	multioutput uoping infra (#3706 ) * linearize multioutput * add vars to copy	2024-03-15 21:56:59 -07:00
George Hotz	ca19eb3e82	where fold try 2 (#3748 ) * where fold try 2 * assign fold * test_where_fold works * add gated store support to ops_python --------- Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>	2024-03-15 07:46:26 -07:00
chenyu	90e55a9fd1	fix buf_index not found case in _apply_tc_opt (#3739 ) ValueError if src.src[0] is not a LOAD. Replaced with returning None in _apply_tc_opt and test to make sure the net output is KernelOptError.	2024-03-14 14:27:05 -04:00
nimlgen	6bf11a2ce3	fix incorrect direct store with gep (#3735 ) * fix incorrect direct store with gep * better comment * phi as well * dtype check there * mypy happy? * not used * renames * phi in phi	2024-03-14 20:58:50 +03:00
qazal	43953c0ba9	skip grouped store for umatching upcasts (#3723 ) * skip if upcasts dont match * outputs match now * this ast is hardcoded --------- Co-authored-by: chenyu <chenyu@fastmail.com>	2024-03-14 01:18:31 -04:00
qazal	337cd53444	multioutput ScheduleItem (#3699 ) * refactor realize.py * update docs * update test_sched * update runners and devices * update openpilot and unit tests * cleanup runner lowering * update more tests	2024-03-13 08:59:38 -07:00
Patrick Tsai	971d7f5d7c	O(n) arange attempt (#3530 ) * It works? * Clamp correctly * Refactor * Make code better * Undo some stuff * First step to trying to make floats work * Floats work in Python op but not metal because int div is different Python integerdivision was implemented as // which rounds towards negative infinity, but C integer division rounds towards 0 so there is an off-by-1 division error * arange does cumsum with ints and then multiplies by step This is so loop optimization can remain int only * Undo a lot of symbolic changes * Final check * Cleanup * There can be multiple phis * Fix multiple phi op removal * const sets dtype correctly * Fix bugs * Fix a couple bugs and add loop vars to resolve * missed one * Don't trim too many ops * Fix symbolic test * Use ones instead of full * Delete test * Lint passes * max node error * Small updates to loop logic * Remove unnecessary changes * We are getting somewhere * Simple case * Fix * rm, prn * Better * If NumNode doesn't work then continue * clamp is needed for arange(256) * Move everything into the optim fn * Replace correctly * Order optimizations better * Delete * mypy * Test for simplification * Rename * Fix test * update test description * Undo more * Cleanup * No replaced_ops map * Fix lint * AssertionError * back again * Reinstate assertion * Return true and make diff not as big * Bigger range for test * Change cumsum impl * fix bug * make big cumsum work * lint * Undo cumsum 2-stage removal * No while helper * optional min/max clamping * floats work * rm giant arange test * fix python cast None * Check phi parents * one phi allowed per where * Fix one phi per where * Rework iteration * Delete assertions * convert to int * Try mul -1 instead of neg for hip..? * Remove one phi per where requirements * one accum only * Lint * should simplify a loop at a time * Don't get rid of loop explcitly * Need to iterate backwards * lint * unary neg * Make optim work for onnx and sum_pad_collapse * Better message * filter alu ops correctly * Fix the limiter * lint and simplify * Add it back * off by one error * test wheres and phis * test max ops and non-if stuff * <= * cast_scalar * Oops * Change test * Pass loop uops instead of a modified map * Cut param transfer between linearizer and uops * Fix issues * Fix lint * fix efficientnet python 3.8 invalid syntax * distinct vars in seen_vars * accurate var names --------- Co-authored-by: Patrick Tsai <patosai@users.noreply.github.com> Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2024-03-11 16:09:20 -07:00
qazal	aec4c4f01b	linearizer ast as a tuple of lazyops (#3689 ) * multi store op linearizer * currently we do only one output per kernel * named opts	2024-03-11 15:39:04 -07:00
George Hotz	44a67bf783	constant folding (#3675 ) * constant fold * bool math * fix ptx	2024-03-10 14:47:24 -07:00
chenyu	915f98791c	use custom KernelOptError in kernel opt (#3661 ) be more specific about invalid kernel opt, used that in test_linearizer_failures. make BEAM kernel search work even with assertion disabled. `BEAM=2 python3 -O examples/llama.py --temperature=0 --count=10 --prompt="Hello." --timing`	2024-03-08 15:36:16 -05:00
chenyu	906cc3a69b	cleanup tests Device[Device.DEFAULT] is always Compiled (#3645 )	2024-03-07 11:15:42 -05:00
George Hotz	81baf3eed3	bring ptx back (#3623 ) * bring ptx back * ptx back * fix define var * fix a few bugs * bugfixes * fixes * fix llvm bug * fix test bug	2024-03-06 13:34:21 -08:00
qazal	94679322a3	simpler float4 direct store and locals support (#3592 ) * swap vins instead * delete the upcast * leave it to remove_childless try 1 * Revert "leave it to remove_childless try 1" This reverts commit `bf25e935f8`. * try 2, simpler * Revert "try 2, simpler" This reverts commit `d2472af711`. * add note	2024-03-04 06:28:28 -08:00
qazal	a89afd4ffa	Directly store float4 nodes (#3564 ) * float4 cast collapse * simplify cstyle * simplify uoptimizer * ci --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2024-03-02 15:58:20 -08:00
George Hotz	aa9b013d79	add constant folding for WHERE in uops (#3584 ) * add constant folding for WHERE in uops * prereqs for generic constant folding * fix test * disable slow overflow logic * make that test faster	2024-03-02 10:37:14 -08:00
George Hotz	6b29c70b3d	Refactor to UOpGraph class (#3566 ) * Refactor to UOpGraph class * fix test	2024-03-01 15:14:48 -08:00
Francis Lam	5d434801fa	search: add tensor core to beam search space (#3275 ) * search: add tensor core to beam search space * kernel: refactor apply_tensor_core into apply_opt and hand_coded * kernel: revert removal of apply_tensor_cores also revert BEAM search parameter changes	2024-02-29 13:05:10 -08:00
qazal	94fc0fd546	uop the float4 acc upcast in group_for_reduce kernels (#3466 ) * simplest one * but i can trust this will be cached correctly * wait that was wrong too * cleanup * test_reduce_upcast for single reduce case * a late accumulator always outputs to gds lint	2024-02-28 17:33:47 -08:00
David Hou	f513c37e64	support same uidx in multiple shape positions (#3205 ) * support same uidx in multiple shape positions * rename var * update comment * add contiguous index check to global_store too * update comment * small change * is this better? * smh * smaller change? * get rid of more changes * get rid of more changes * is this even making anything better * comment * fix test --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2024-02-21 19:37:03 +01:00
chenyu	86efdf0b34	remove create_rednode (#3444 ) handle Node collapsing into NumNode similar to OpNode	2024-02-18 21:08:19 -05:00
qazal	e1a57fe58a	test the behavior, not the implementation (#3419 )	2024-02-15 17:23:42 +01:00
qazal	7919a1e6ec	dtypes: delete the float cast in realize.py (#3401 ) * remove float cast * cast scalars to the correct value in creation time * cast scalar in the correct place * wrong, use y_dtype * make consts have a unique cache key * add cast_scalar back * test_load_cache_const_bufs * add bool dtype * test_const_dtype * fix linters	2024-02-15 14:20:30 +01:00

1 2 3

129 Commits