tinygrad

mirror of https://github.com/tinygrad/tinygrad.git synced 2026-01-25 06:48:22 -05:00

Author	SHA1	Message	Date
chenyu	4ecd5789ab	#include <tgmath.h> in ops_clang (#3927 ) * different clang sqrt/log2/exp2/sin function based on dtype fixed softmax_argmax issue in #3552 for clang. * tgmath.h * revert those	2024-03-25 17:48:57 -04:00
Arseny Kapoulkine	514c43201d	Fix issues with pointer provenance in load/store through ALU (#3916 ) * Track pointer provenance in load/store through ALU Previously load/store could be incorrectly rendered into ld.global/st.global when the input was an ALU op that performed an address computation with DEFINE_LOCAL on one of the arguments. * Simplify the load provenance workaround The issue is that we can render the same code twice, and on the second run the opstream is already modified so that vin[0] isn't a DEFINE_, which overwrites initially correct .shared wth .global. Add a couple tests for basic local use * Skip local tests on LLVM since it doesn't implement DEFINE_LOCAL	2024-03-25 14:41:05 -07:00
chenyu	83f39a8ceb	env var to change default float (#3902 ) * env var to change default float to fp16 or bf16 looking for standard names for these. we have FLOAT16 that does something to IMAGE and HALF to convert weights. working on default bf16 too. ``` RuntimeError: compile failed: <null>(6): error: identifier "__bf16" is undefined __bf16 cast0 = (nv_bfloat16)(val0); ``` remove that in cifar * DEFAULT_FLOAT * default of default * unit test * don't check default * tests work on linux	2024-03-24 20:33:57 -04:00
George Hotz	03899a74bb	increase atol on reset train	2024-03-24 15:17:31 -07:00
qazal	d8fafca13a	assign regression (#3907 ) * infra * track mutations * assign levels * add seen back * add test * infra 2.0 * add assign targets * dont need levels * delete * Update test_assign.py --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2024-03-24 15:12:31 -07:00
Patrick Tsai	e27129a798	Fix linearizer failure 26 test (#3906 ) * Adjust adds between WHERE and PHI * Not much better * undo recursive change * hm * iterate over where, not factored op * oo * consts only for loop * UNdo var name change * update --------- Co-authored-by: Patrick Tsai <patosai@users.noreply.github.com>	2024-03-24 16:34:13 -04:00
wozeparrot	9a9cac58f9	add lars to nn (#3750 ) * feat: add lars * feat: don't remove this comment * clean: smaller diff * clean: shorter line * feat: remove mlperf lars, switch resnet * fix: fully remove mlperf lars * clean: comment * feat: contiguous * feat: no weight decay on skip params * feat: optimizergroup * feat: classic momentum * fix: pylint * clean: move comment * fix: correct algo * feat: lrschedulergroup * feat: skip list tests * feat: :\| forgot that params are a thing * feat: remove skip_list params from main params * feat: set moment --------- Co-authored-by: chenyu <chenyu@fastmail.com>	2024-03-24 11:43:12 -04:00
chenyu	2c69888654	include negative float in test_dtype (#3884 ) * include negative float in test_dtype * that is ub * too annoying * pack can overflow	2024-03-24 02:39:15 -04:00
Francis Lam	0145366323	wmma: fix the AMD TC threads to split the first 16 threads (#3904 ) previously it was incorrectly aliasing 16 into the size 8 upcast on the store alias. now it splits it properly into 8 and the remaining 2 into the correct local stride	2024-03-23 21:17:42 -04:00
chenyu	a2b2597fc2	replace dtype.name str with render_dtype (#3903 ) fixed some bf16 cast issue since it does not have `.name`. also more robust if there are lang specific type override	2024-03-23 19:25:48 -04:00
Alejandro F Queiruga	556dcfb8f2	Fix the result permutation in einsum (#3895 ) * Fix permutation of result indices in einsum. * Delete stray line used for breaking tests * Fix linter error by renaming twice-used variable --------- Co-authored-by: chenyu <chenyu@fastmail.com>	2024-03-23 15:48:19 -04:00
chenyu	2d3ce53348	touchup test_dtype.test_gradient_dtype (#3887 ) add back bad merge from #3613 and add float.double and float.bfloat16 to test	2024-03-22 20:56:45 -04:00
David Hou	fc11808a79	initialize Tensor grad same type as self (#3613 ) * initialize Tensor grad same type as self * also test different default float * check dtype + try/finally * don't test_gradient_dtype if f16 is not supported * fix bad merge --------- Co-authored-by: chenyu <chenyu@fastmail.com>	2024-03-22 20:33:18 -04:00
Francis Lam	8db7a6bbcc	debug: add optional detailed BEAM_LOG logging (#3883 ) * debug: add optional detailed BEAM_LOG logging show uop count, compile and run times for each candidate in search also add --timing to verify_kernel.py to make it easier to explore hand-crafted applied opts * fix linter	2024-03-22 19:23:31 -04:00
George Hotz	54dc48aa47	fix assign (#3878 ) * fix assign * remove terrible optimizer hack * oops, not realized assigns	2024-03-22 11:48:48 -07:00
Francis Lam	5587594a00	fuzz_linearizer: add --ast and --file params to read kernels (#3877 ) also fix up ast_str_to_str to support the new tuple of LazyOps	2024-03-22 14:27:40 -04:00
chenyu	c5467e5bd6	diverse test value in test_dtype DATA based on dtype (#3864 ) * diverse test value in test_dtype DATA based on dtype * eh fix typo * that too? * PTX does not support i8 and s8 * skip that * unused line * pus the hack back * remove that	2024-03-22 14:22:06 -04:00
George Hotz	86ee36e697	preschedule all (#3875 )	2024-03-22 11:20:06 -07:00
Szymon Ożóg	d8c3f1894a	Use UOpGraph in test (#3876 )	2024-03-22 14:12:38 -04:00
qazal	fe6ceff15f	proposal: multioutput JIT spec (#3856 ) * corealize JIT * requirements	2024-03-21 21:28:30 -07:00
uuuvn	6729f20aab	Ring allreduce try 2 (#3852 ) * Ring allreduce v3 * Configurable size, number of gpus and jit in benchmark * ScheduleBarrier v0 * GB/s that make sense * ScheduleBarrier v0.1 * Fallback on 2 GPUs * ScheduleBarrier v0.2 * ScheduleBarrier v0.3 * ScheduleBarrier v0.3.1 * ScheduleBarrier v0.3.2 * Replace ScheduleBarrier with automatic optimization * unused import * fix comment * typing * better fallback * python 3.8 * RING=2 and use ContextVar * DEBUG >= 2 and change name * linter * type --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com> Co-authored-by: chenyu <chenyu@fastmail.com> Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>	2024-03-21 19:17:51 -04:00
Francis Lam	3c0478bfab	fuzz_linearizer: add additional DEBUG info for comparison errors (#3866 )	2024-03-21 18:58:10 -04:00
chenyu	e50b7abe4f	diversed buf inputs based on dtype in fuzz_linearizer (#3863 )	2024-03-21 16:23:11 -04:00
chenyu	30fa03243e	reuse fuzz_linearizer.compare_linearizer in test_linearizer_failures (#3861 )	2024-03-21 14:12:27 -04:00
chenyu	33dd99acf4	remove helper_add_store from test_linearizer_failures (#3860 )	2024-03-21 12:53:31 -04:00
chenyu	6bf0b82267	alloc new output in fuzz_linearizer between baseline and real one (#3859 ) if the kernel is an assign `a += 1`, the rawbufs[0] is updated twice and gives false compare_error	2024-03-21 11:36:05 -04:00
nimlgen	85691c8e20	fix hsa sync issue (#3847 ) * fix hsa sync issue * linter	2024-03-21 04:00:30 +03:00
chenyu	f271cd682b	user _resolve_dim in argmax (#3846 ) also added comment of the behavior if there are multple, and more tests	2024-03-20 20:17:30 -04:00
Francis Lam	6d5dec2fef	log optimized kernels and a script to compare with non-optimized ones (#3829 ) * search: add BEAM_VERIFY option to validate search results refactor fuzz_linearizer comparison to allow it to be used in for BEAM_VERIFY in device.py * search: fix to verify the beam_search result and not the fastest * search: fix typing and clean up * device: remove imports from test and add LOGKERN options LOGKERN output can be used with test/external/verify_kernel.py to validate correctness * fix example in verify_kernel.py * cleanup fixes * fix to use f-strings	2024-03-20 19:22:08 -04:00
chenyu	519336cfea	factor out partial in SumNode div int (#3841 ) * factor out partial in SumNode div int * div not rem * space	2024-03-20 16:34:33 -04:00
George Hotz	8cb5215885	Revert "Ring allreduce in multitensor (#3000 )" (#3840 ) This reverts commit `c5bf9e4c96`.	2024-03-20 11:41:49 -07:00
uuuvn	c5bf9e4c96	Ring allreduce in multitensor (#3000 ) * Ring allreduce v3 * Configurable size, number of gpus and jit in benchmark * ScheduleBarrier v0 * GB/s that make sense * ScheduleBarrier v0.1 * Fallback on 2 GPUs * ScheduleBarrier v0.2 * ScheduleBarrier v0.3 * ScheduleBarrier v0.3.1 * ScheduleBarrier v0.3.2 * Replace ScheduleBarrier with automatic optimization * unused import * fix comment * typing * better fallback * python 3.8 --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com> Co-authored-by: chenyu <chenyu@fastmail.com> Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>	2024-03-20 11:20:01 -07:00
chenyu	455f7bea9b	test example from half resnet that idx has number outside of int32 (#3838 ) * test example from half resnet that idx has number outside of int32 * ruff	2024-03-20 13:44:20 -04:00
chenyu	d17900bc45	use int32 instead of default_int in simplify_phi_loops (#3828 ) * use int32 instead of default_int in simplify_phi_loops indices are in int32 now and is separated from buffer dtype. fix #3823 * return early if not supported * it's not that * why is it failing for RHIP	2024-03-19 17:49:58 -04:00
chenyu	99cbc24390	use dtypes.int32 as return dtype for functions that return indices (#3827 ) behavior matches jax. It's fine to have a tensor greater than max int8 size even if we set default int to int8	2024-03-19 17:06:57 -04:00
chenyu	fa1921ec7d	move test_dtype tests to test dtype and output value (#3826 )	2024-03-19 16:31:27 -04:00
Francis Lam	131bbb6563	test_linearizer_failure: add failure 27 from a gpt2 kernel (#3825 ) * test_linearizer_failure: add failure 27 from a gpt2 kernel found during a full fuzz test of applied_opts combos to a depth of 4 on the gpt2 kernels w/o GROUPTOP. added additional examples to failure 26 that don't have GROUPTOP * add other platform failure	2024-03-19 16:29:50 -04:00
Francis Lam	9851e2c3b9	test_linearizer_failure: add failure 26 from a gpt2 kernel (#3821 ) found during a full fuzz test of all applied_opts combos to a depth of 3 on the gpt2 kernels	2024-03-19 13:19:54 -04:00
Patrick Tsai	b436c9792f	Fix factoring bug (O(n) arange related) (#3817 ) * Factoring bug * Another one in case * It works now so change tests back * large arange cumsum optimization * More cleanup * symbolic no factor div test * name change * Rename test --------- Co-authored-by: Patrick Tsai <patosai@users.noreply.github.com>	2024-03-19 11:49:42 -04:00
chenyu	a6ed2ae3c6	use old cumsum optimization for arange (#3813 ) revert to old cumsum opt while phi simplification is disabled. added a flops complexity test for this	2024-03-18 20:01:03 -04:00
chenyu	ac866eaf5a	disable simplify_phi_loops (#3812 ) * disble simplify_phi_loops this breaks BEAM search GPT2. * skip that	2024-03-18 19:25:26 -04:00
George Hotz	4c4d3cb3e3	restrict assignment to base (#3809 ) * restrict assignment to base * add some restrictions there * more restrictions	2024-03-18 15:33:06 -07:00
chenyu	20681d5c4a	remove old dist multigpu (#3811 )	2024-03-18 18:31:05 -04:00
chenyu	5dd048a378	remove HIP in core tinygrad (#3810 ) * remove HIP in core tinygrad ci test uses device RHIP and HSA compiler (LinearizerOpt), so fine to remove HIP from tc. Also updated README and EMULATE tc test flag * EMULATE_CUDA	2024-03-18 18:19:27 -04:00
Francis Lam	a7afd2f6bf	test_linearizer_failures: add failing kernel from GPT2 CUDA (#3808 ) * test_linearizer_failures: add failing kernel from GPT2 CUDA * test_linearizer_failure: remove "HIP" from failed_platforms	2024-03-18 17:16:40 -04:00
George Hotz	d8296d4a3f	simple assign tests (#3807 )	2024-03-18 13:57:01 -07:00
wozeparrot	a0ab755317	threefry again (#3785 ) * feat: initial xor * feat: initial threefly * feat: remove custom random * fix: really need to install precommit * feat: lmao forgot that this is rotate not a shift * clean: put that there * feat: numpy xor * feat: quick test for xor * feat: llvm xor * feat: slightly working xor in torch * feat: rand works in jit * clean: save a line * feat: match jax * feat: maybe test against jax * feat: requires_grad * fix: fix test_symbolic_ops * feat: lower alpha * feat: just pad * fix: maybe fix training tests? * fix: fix some llvm stuff * feat: cursed realize on the way out * feat: testing jax * fix: why is the jax install process not simple * fix: maybe passing test * fix: symbolic workarounds * clean: still need that precommit * fix: aaaa * fix: more test fixes * fix: quick fix for wgsl * feat: need to set requires_grad on the final tensor * feat: one more tensor * feat: don't take forever * feat: seeing y ci is brok * feat: can't allocate 64GiB lmao * fix: fix this * feat: hope this doesn't break smth before i go to bed * feat: don't destroy ram * feat: int * feat: remove jax * feat: properish workaround? * feat: skip slow webgpu tests * feat: no longer fails * feat: use dtypes * feat: real number * fix: torch * fix: don't test against reference for torch * feat: to device * feat: fix advanced indexing * feat: correct casting * feat: even rng_counter * feat: match master * feat: this was actually bad * fix: maybe? * feat: store * feat: remove realizes * feat: somehow this is important * feat: somehow this is also important * feat: save a line * fix: don't need that anymore * feat: restore this * fix: linter * feat: remove realizes * fix: realized is in base now * fix: add back cast * fix: bump deadline * fix: bump deadline * fix: bump deadline * fix: bump deadline * fix: bump deadline * fix: :( * fix: :( * fix: not being dumb * feat: try changing less tests * feat: shouldn't have to change that * feat: contiguous bumps it by one * fix: hmm * fix: numpy memory moment * fix: cl_khr_fp16 * fix: torch has different tensor count * fix: missing contiguous * hmm: hmm * fix: some fixes * fix: typing * feat: dont do that * feat: typing fixes * feat: why is this realize required? * feat: ngl kinda odd typing * feat: oh * feat: remove realizes * feat: why is this realize required? * fix: hacky patch for cudacpu * fix: without this realize pytest crashes????? * fix: shorter line * fix: cudacpu fixes * fix: cudacpu fixes * feat: real buffer * feat: don't search when searching lmao * fix: can't use contiguous things * fix: no more 100GB arrays * fix: revert * fix: skip 7 and 10 * feat: working ish beam * feat: minimize changes * feat: seed 0 stable diffusion example changed * fix: different on ci * fix: no beam * feat: make threefry optional * fix: check value * fix: unused import * feat: threefry default * fix: 5d * feat: allow non upcast div * fix: 5d better * fix: 5d better * fix: save all dtype * feat: proper error * feat: lazyop key * fix: check float * feat: try removing this realize now * feat: disable threefry for uops hip tensor cores * feat: don't need that * feat: only check upcast * fix: disable threefry for some metal tests * feat: disable for metal tensor uops as well * feat: disable for most uops * fix: disable threefry for new uops tests * feat: multitensor * fix: typing * feat: threefry default off * feat: skip threefry half rand * feat: restore old * fix: bad git * clean: ruff * feat: bfloat16 fix * fix: :\| * feat: restore old --------- Co-authored-by: chenyu <chenyu@fastmail.com>	2024-03-18 16:47:07 -04:00
nimlgen	629757eaa1	hotfix: update inputs of correct transfers in hsagraph (#3800 ) * hotfix: update inputs of correct transfers in hsagraph * test it * run in ci?	2024-03-18 15:52:27 -04:00
George Hotz	0183a05f0a	test assign (#3798 ) * Reapply "add failing assign test (#3796)" (#3797) This reverts commit `1e1beb888c`. * no realized check	2024-03-18 08:58:04 -07:00
George Hotz	1e1beb888c	Revert "add failing assign test (#3796 )" (#3797 ) This reverts commit `2dea12832c`.	2024-03-18 08:55:36 -07:00

... 60 61 62 63 64 ...

4618 Commits