tinygrad

mirror of https://github.com/tinygrad/tinygrad.git synced 2026-01-26 07:18:40 -05:00

Author	SHA1	Message	Date
George Hotz	942c58be90	BEAM_COMPARE=2 validates the correctness of BEAM kernels (#5458 ) * beam compare 2 * found issue maybe * correct, not fail * full rand * less numpy * extra simplify doesn't fix it * reorder * no numpy * check in reverse * test new tensor behavior * better error msg	2024-07-13 13:53:43 -07:00
qazal	487ceff825	hotfix: ASSERT_PROCESS_REPLAY sometimes doesn't exist (#5456 )	2024-07-13 21:15:40 +03:00
qazal	40ec9410f9	simpler process replay (#5452 ) * remove check_process_replay * that can go to the top * add assert back * [run_process_replay] * checkout code [run_process_replay] * temp [run_process_replay] * revert temp [run_process_replay] * ahh this is why [run_process_replay] * revert temp [run_process_replay]	2024-07-13 19:55:06 +03:00
qazal	23b907efbb	restore process replay runs by their id (#5453 )	2024-07-13 19:32:34 +03:00
George Hotz	e638b0084f	smaller multitensor resnet test (#5450 ) * minor improvments to matcher speed [run_process_replay] * oh, put that back * make fake images smaller for resnet test	2024-07-13 07:31:28 -07:00
qazal	bb1a9ebf78	run process replay in parallel (#5443 )	2024-07-13 11:29:36 +03:00
chenyu	3ebf569f04	relax fuzz transend math threshold a bit (#5442 ) * relax fuzz transend math threshold a bit * fuzz more * fuzz 50k	2024-07-13 03:31:21 -04:00
chenyu	e398734890	fuzz test transcend math (#5383 ) * fuzz test transcend math found something wrong with float64 sin reduction ``` from tinygrad import Tensor, dtypes import numpy as np print(Tensor([39800.0], dtype=dtypes.float64).sin().numpy()) print(Tensor([39800.0], dtype=dtypes.float32).sin().numpy()) print(Tensor([39800.0], dtype=dtypes.float16).sin().numpy()) print(np.sin(np.array([39800.0], dtype=np.float64))) print(np.sin(np.array([39800.0], dtype=np.float32))) print(np.sin(np.array([39800.0], dtype=np.float16))) ``` ``` CLANG=1 python test.py [0.92785633] [0.7428573] [-0.7705] [0.74285722] [0.7428572] [-0.7705] ``` * fix test * abs * skip	2024-07-13 01:54:52 -04:00
hikettei	3a7262d923	[Patch] Fixed an invaild value of fp64 xlog(DBL_MIN) (#5441 ) * [Patch] Removed weird NaN Handling in xlog2 resulting in different output around 1e-203 * Patch: compare the value of xlog(x) using y, allowing x <= 1e-200 * mypy * fuzzer tests for log2 * fix tests: use approximate dbl_min, fp64 fails at nv * update: gradually increment the scale (if y is not inf)	2024-07-13 01:11:53 -04:00
wozeparrot	90f0e2fc49	db in wal mode (#5388 )	2024-07-12 20:43:36 -07:00
chenyu	4df63da190	clean up rest of the loadop [run_process_replay] (#5440 ) to metaop and filter_sink	2024-07-12 23:38:51 -04:00
hikettei	0795139f30	Fix TRANSCENDENTAL=2 fp64 sin (#5385 ) * fixes on transcendental: fix for fp64 payne hanek, refactor for fp16 sin * revert the changes on test * refactor on xsin: removed cody_waite_reduction, always use payne_hanek * Revert "refactor on xsin: removed cody_waite_reduction, always use payne_hanek" This reverts commit `2fd401f251`. * still need cody_waite_reduction for the very smaller range * test: added a regression test for transcendental sin * test: found the worse case ulp 3.5 only in numpy * give the input as a valid dtype --------- Co-authored-by: chenyu <chenyu@fastmail.com>	2024-07-12 23:15:04 -04:00
George Hotz	fb3011ac61	improve matcher speed [run_process_replay] (#5438 ) * improve matcher speed [run_process_replay] * don't use arg set in ptx	2024-07-12 20:02:19 -07:00
George Hotz	03c2dc8bd7	lowerer is kernel [run_process_replay] (#5437 )	2024-07-12 18:50:55 -07:00
George Hotz	b8342fb085	independent lowerer [run_process_replay] (#5434 ) * independent lowerer [run_process_replay] * don't relinearize PTX * fix ptx * Revert "fix ptx" This reverts commit `f4e8e059c0`. * Revert "don't relinearize PTX" This reverts commit `f6c12c506c`. * parents is fine, no need for linearization * remove loop local idxs * recover stupid loop_idxs	2024-07-12 18:08:43 -07:00
wozeparrot	b80fd7d23c	allow benchmarking forward only (#5436 )	2024-07-12 17:37:49 -07:00
chenyu	00813a92a0	update Tensor.eye api to match torch (#5433 ) * update Tensor.eye api to match torch input is n for nrows and optional m for ncols * space * fix onnx	2024-07-12 20:25:12 -04:00
George Hotz	870dc8c350	s/Linearizer/Lowerer [run_process_replay] (#5428 )	2024-07-12 15:54:07 -07:00
George Hotz	6707c778d0	scheduleitem is not Tuple [run_process_replay] (#5425 ) * scheduleitem is not Tuple [run_process_replay] * fix tests * fix op + fuzzers * fix mop test	2024-07-12 15:13:19 -07:00
George Hotz	94599c0637	fixup ast in kernel to be MetaOps.SINK [run_process_replay] (#5424 ) * fixup ast in kernel to be MetaOps.SINK [run_process_replay] * fix tests * fix more tests	2024-07-12 14:01:03 -07:00
chenyu	d37056f3b1	pass Renderer.global_max / local_max into get_grouped_dims (#5423 ) [run_process_replay]	2024-07-12 16:49:27 -04:00
George Hotz	f6ef283e6a	s/loadops/metaops [run_process_replay] (#5421 )	2024-07-12 13:26:50 -07:00
chenyu	a0dbe20dbd	skip some redundant and slow tests in ci (#5416 )	2024-07-12 14:43:13 -04:00
chenyu	76125c07be	make some grouped_dim test work (#5415 ) next need to support max size per dim, splitting and correct way to do reverse or arbitrary permute global dims	2024-07-12 14:22:50 -04:00
uuuvn	3cb94a0a15	Rename tinygrad/runtime/driver to support (#5413 )	2024-07-12 11:06:42 -07:00
chenyu	497274f663	add float64 to test_dtype_alu dtypes_float (#5410 ) * add float64 to test_dtype_alu dtypes_float * CUDACPU float64 crashes * real NV failed	2024-07-12 10:21:32 -04:00
qazal	31fcc516dc	more process replay tooling (#5407 ) * replays * what's in there * can it be up there * sha is enough * insert sha as the key * fix str * update reset utils * that nested try/except was terrible * github_context can go	2024-07-12 13:11:34 +03:00
qazal	e22b377839	generalize FUSE_AS_ONE_KERNEL in the scheduler (#5397 ) * test: use const * hotfix: base * asserts * dont push through reshape * cleanup * dont need the cache * test_reduceop_reshape_dont_push and test_index_fused are next	2024-07-12 10:23:16 +03:00
chenyu	6e0a523078	repro slow resnet kernel with 4 global dims (#5402 ) * repro slow resnet kernel with 4 global dims * fix ruff	2024-07-11 23:31:15 -04:00
George Hotz	01fbd18209	metal compile fail	2024-07-11 19:27:05 -07:00
George Hotz	3a2b5a75d2	improve single kernel indexing (#5398 ) * improve single kernel indexing * metadata in graph (#5399) * indexing is O(1) * add failing test * ugh, that all needs to be replaced with symbolic * broken on ptx, it's fine --------- Co-authored-by: wozeparrot <wozeparrot@gmail.com>	2024-07-11 19:00:57 -07:00
George Hotz	c2da4454cd	indexing getting better (#5389 ) * indexing getting better [run_process_replay] [no_assert] * fix test * test_arange_2_reduce is a simpler test * put that print back, NOOPT * don't merge reduces (they could be different reduces) * FUSE_AS_ONE_KERNEL * fix tests * fix test_var_multireduce * w/e put that there * fails on others too * fix test, revert UNMUL change * in case order matters * one kernel indexing works * one kernel indexing works (test other)	2024-07-11 16:41:51 -07:00
qazal	9712d9ffb6	pass lowering errors if not asserting process replay (#5395 ) * pass lowering errors if not asserting process replay * ProcessReplayError	2024-07-11 19:09:12 -04:00
qazal	0421f5d83e	hotfix: compare test_var_multireduce against numpy (#5394 )	2024-07-11 18:57:08 -04:00
George Hotz	e8191479a3	add bigint type for indexing [run_process_replay] (#5387 )	2024-07-11 11:37:10 -07:00
George Hotz	3e40211e45	add UOP_IS_SYMBOLIC [run_process_replay] [no_assert] (#5386 ) * cleanup a few things in uops [run_process_replay] [no_assert] * add optional UOP_IS_SYMBOLIC	2024-07-11 10:48:45 -07:00
qazal	004366b193	context aware process replay [run_process_replay] (#5378 ) * test tc as ctx var * remove from opts * process replay * pop variable * B -> Variable * fix re-assign * pop temp vars * move TRANSCENDENTAL=2	2024-07-11 13:07:28 +03:00
chenyu	2396ab9b33	more transcend cleanup [run_process_replay] (#5369 ) fix test name, less # noqa: E501 and removed the cast	2024-07-10 23:05:03 -04:00
George Hotz	0215c952c5	Move transcendental to UOp level (#5367 ) * move uopgraph to file [run_process_replay] * transcendental uops * tests pass * no skip --------- Co-authored-by: chenyu <chenyu@fastmail.com>	2024-07-10 19:06:25 -07:00
chenyu	64986f949c	more transcend math tests in ci (#5368 ) * more transcend math tests in ci test large input to trig functions that hit different reduction algo, and test TRANSCENDENTAL=2 for all backend * no CUDACPU * try that	2024-07-10 21:19:09 -04:00
George Hotz	d13654a820	move uopgraph to file [run_process_replay] (#5364 ) * move uopgraph to file [run_process_replay] * fix print tree test	2024-07-10 17:34:50 -07:00
hikettei	320e7ed935	Approximations for SIN/LOG2/EXP2 passing all tests. (#5187 ) * [WIP] Added an approximated implementation of Sin(FP32, FP64) passing all tests on Clang runtime * Map nan/-inf/inf as 1.0 in order to avoid doing as_const(math.inf) * [WIP] Added a support for LLVM IR * cleaned up the code for the mypy and linter * [WIP] Updated fp64 supports (bitwise shift causes the compilation error), fixed linter issue. * [Add] added fast=true mode which disables the payne-hanek reduction which is slow * [Fix] fails to compute elements when shape includes zero * [WIP] Added BinaryOps.ADD/BinaryOps.OR to assembly * [wip] update the assembly for ptx * Enables fast=True when device is one of PTX, NV, CUDA, to avoid slow bitwise ops (as lv3 reduction is not required). * [WIP] Added an approximation of LOG2/EXP2 (FP32, FP64) * [Fix] Cyclic dependencies existing in xlog2 * [Fix] Cycle dependency in the graph of exp2, and log2. (passing test_symbolic_ops.py) * [Fix] keep using higher precision for exp2, but cycle graph issue remained to be fixed... * [Refactor] removed is_metal option. xsin does not rely on fp64 when fp32 mode. * [WIP] fp16 xsin implementation passing all tests. (still needs to be refactored) * [WIP] Added fp16 exp2 implementation * [WIP] Increased the precision of Log2 from 3.5 ULP to 1.0 ULP, and added FP16 Log2 approximation. * stashed the changes for FP16 sin * [Fix] Patch for FP16 Sin/Exp2. (updated the dtype_via, fp32_p, and lower) * [Refactor] migration to fastmath.py, some code simplification, renamed apis in fastmath, et al. * [Refactor] Added the function polyN to clean-up N-terms polynomial approximation. * [Patch] Increase fp64 precision when ldexp3k if possible, and patch for fp16 exp2 * [Patch] added bitcast_forward option * [Patch] resolved cycle graph * patch fix cycle graph * set bitcast_forward=True in ilogb2k * bitcast_forward for multi.py * E501 * Break into multiple small PRs * [Patch] FP16 -> FP64 upcast is not anymore required since xlog2 use quad precision polyN * [Patch] NV still required FP64 for xlog2 * updated schedule test * updated the count of kernels * [Update] Removed all bitwise ops (SHL/SHR), tweaked the nan manipulation of log2, passing all tests except for AMD. * Bitcast: make them api-compatible * [update] force to use bitcast * updated the count of constant folding * [Patch] Creating a mask for exp2 using x <= Inf satisfies True as long as x is a real value * [Update] isNaN(x) Free log2 algorithm, passing PTX tests, METAL with fastmath enabled is able to handle nan well, amd backend will not crash. * xsin is reluctant to call payne_hanek_reduction which is slow to compile, passing stable diffusion compilation in a realistic time * some minor simplification to payne hanek reduction * [refactor] refactored some rebundant parts existing in payne hanek * [refactor] more readable payne hanek impl * [refactor] improved the code consistency of payne hanek * [experiment] topological sort when doing _recursive_group (i dunno if this is good but at least it works.) * Revert "[experiment] topological sort when doing _recursive_group (i dunno if this is good but at least it works.)" This reverts commit `0eee08b87c`. * use allow_buffer_view * lets support multilazytensor * updated the count of kernels * [test] added the jit tests for approx ops * keep failed constant folding tests tested, added expectedFailure * explict the timeout deadline when testing approx jit timeout * [WIP] Simplified the implementation of xsin, never timeouts * [Refactor] Improved the consistency of approx sin implementation, passing time out tests * integrated xexp2_base into xexp2 * Set switch_over=39800.0 * delete: is_buffer_fastmath_supported * sin: compute against abs(x) * some cleanups * fix typo * removed the space between param and dtype * allow 514 kernels on CI for sd * [refactor] no need to upcast ad ldexp3k * [refactor] added some comments, references to help understanding the code. * [Fix] 1.0 ULP Sine Approximation for FP16 * [update] assume e != 0 * use pow2if instead of ldexp3k to fuse payne_hanek reduction into one * check if approximated sin/log2/exp are fused into one * clean up changes * test amd exp * some code cleanup and test sigmoid * fix: enabled payne_hanek for fp16 to achieve higher acc * fix: payne_hanek always accumlates the value with uint64, and fp16 sin is fused to a single kernel * [Refactor] Rename: fastmath -> transcendental * [Refactor] Added TRANSCENDENTAL, Moved the gate function to function.py * updated const folding tests * TRANSCENDENTAL as a ContextVar, removed old test of cody waite reduction, added assertions, et al. * Add: unittest.main() * Import TRANSCENDENTAL instead of getenv * Refactor: Added dtype check when TRANSCENDENTAL=2, more context var * Patch: xlog2, break expt(2, 32) x 2 -> expt(2, 16) x 4 for fp16 math --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com> Co-authored-by: chenyu <chenyu@fastmail.com>	2024-07-10 16:44:58 -07:00
George Hotz	6972a2569f	Linearizer -> Lowerer (#4957 ) * st to uops function * lowerer * uops reduce * uops reduce * acc_number correct * reduce unroll * complete unroll * do upcasts * handle multioutput * define_accs * fix valid * get grouped dims * revert lin * minor * fixup_ast * group for reduce * group works now * all forwards pass * all ops tests pass * fix clang * mypy * lil cleanups, no image yet * ugh, variables everywhere * bugfix * counters and name fix * use symbolic, not uops * cleanups * Fix tests * linearizer tests * expands * float4 expand load * tests pass * woooo, float4 test * test ops works again * one more lin test * more lin tests * bypass * fix tests * something like this * const in defineacc * uops get_reduce_acc * move around * allow consts in the LOAD/STORE * each axis should only appear once, 21 failures * 16 failures * fix some image * optional float4 * onnx tests * gate the stores * add reorder * fix terrible skip function * tc work * opt add/mul merge * fix float4 tests * tiny tweak, 9 failing * 7 test failures * start tc, but i don't think this will work * progress on tensorcores * note * fix ops tests * closer on tc * weeee...one tensor core works * still works, more generic * large WMMA works * tc test passes * use WMMA as accumulator * basic tc tests passing * small gemm padded works * 4 failures * 3 tests failing * super barrier * now two tests failing * one test failing * cleanpus, add reduce to UopGraph * remove the linearizer * remove unused * lil cleanups * Lowerer everywhere * remove test that doesn't exist now * image indexing * llvm fix * fix metal * fix image * fix images * might fix ptx * fix image type mismatch * more tests pass * CAST -> VECTORIZE * forgot that one * fix TestOps.test_flip_eye_crash * locals shouldn't be image dtype * change less files * test fix * fix recursive expands * touches * MULACC support in python * delete unneeded * alu before contract * bug fixes * tests * no var multireduce * simpler tc * metal works in new style * working on AMD and METAL * fix amd * shot in the dark, fix amd * something for CUDA * CUDA WORKS from the docs * comment * correct merge * cleanups + ptx fix + get_reduce_acc * local alias isn't used anymore * add store sanity check * fix for AMD * cleanups and single expand pass * more correct with acc_cache * tests should pass * block on WMMA * tests pass * merge contract and reduce * contractor fixes issue * multicontract * pre expand wmma (same as a reduce) * expand wmma and only take one * all expands * comments and whitespace	2024-07-10 15:07:42 -07:00
chenyu	322c37e621	use helpers.JIT in llama and gpt2 examples (#5350 ) * use helpers.JIT in llama and gpt2 examples replaced getenv("JIT"), effectively made gpt2 default jit * fix test_gpt2	2024-07-09 15:04:43 -04:00
Elias Wahl	097268fab3	Add layerwise performance bench for bert (#5349 ) * add bert bench * dont disable by defauöt * remove lr * linter	2024-07-09 15:03:25 -04:00
nimlgen	1678199b15	add update_copy to hcq spec (#5348 ) * add update_copy to hcq spec * fix amd	2024-07-09 20:44:44 +03:00
qazal	1f5de80eba	multi reduce Tensor.var passing verify_lazyop (#5346 ) * what about this * reset late gate	2024-07-09 17:20:17 +03:00
kormann	3d452195e4	[bug fix] nested commutative pattern _match [run_process_replay] [no_assert] (#5340 ) * deep pat test * lint * min diff * min lines * nothing * is res extra * cleanup2 * add res back * reduce lines * type anno --------- Co-authored-by: qazal <qazal.software@gmail.com>	2024-07-09 16:38:39 +03:00
qazal	bee96a19ff	fuzz uop schedules (#5345 ) * basic blocks + cleanups * fixups * elif is better for future me * fuzz_schedule_max_paths * fix linter	2024-07-09 15:24:56 +03:00
George Hotz	c13da83f12	tests from lowerer branch (#5339 ) * tests from lowerer branch * Update test_image_dtype.py * Update test_image_dtype.py * Update test_image_dtype.py	2024-07-08 21:23:19 -07:00

... 50 51 52 53 54 ...

4618 Commits