tinygrad

mirror of https://github.com/tinygrad/tinygrad.git synced 2026-01-07 22:23:55 -05:00

Author	SHA1	Message	Date
chenyu	72a3f78d19	jit includes tensor inputs in containers (#14043 ) * jit includes tensor inputs in containers * cleanup	2026-01-06 19:42:06 -05:00
chenyu	c714881832	don't allow jit input to be const (#14045 ) * don't allow jit input to be unbuffered like const * just const to fix multi * fix rnnt	2026-01-06 18:15:22 -05:00
chenyu	7fb18f7e47	raise when jit fxn returns non-Tensor output (#14042 )	2026-01-06 12:59:20 -05:00
chenyu	4491ec0c9e	JitError (#14041 ) * JitError * test_symbolic_jit	2026-01-06 12:19:50 -05:00
chenyu	6ddddc68af	test jit tolist failure (#14040 ) also moved tests to test_jit_footguns	2026-01-06 11:16:57 -05:00
chenyu	b699b9f763	test case for jit a function with item call (#14039 ) * test case for jit a function with item call output is silently wrong now * no dtype	2026-01-06 10:40:43 -05:00
chenyu	03600aef1e	failed test case when init jit with empty inputs (#13641 ) not related to bert grad acc, but still seems to be a bug	2025-12-10 22:03:06 -05:00
George Hotz	6bd355fa26	add needs_second_gpu decorator (#13543 ) * add needs_second_gpu decorator * more skips * two more fixes	2025-12-02 19:08:23 -08:00
chenyu	f2c3a72b0c	remove RANGEIFY flag [pr] (#12577 )	2025-10-09 21:52:54 -04:00
George Hotz	fd2e4f2353	failing rng test (#12328 ) * tighten spec: fixup devectorizer types / rangeify * tighten assign * failing rangeify test * simpler * otherwise contig * more tolerance cause rng seed changed	2025-09-29 16:06:45 +08:00
nimlgen	4762a24022	test_free_intermediates force buffers (#12255 ) * test_free_intermediates force buffers * f * fix for rangiefy * xx	2025-09-20 18:14:39 +03:00
qazal	57c7e0a8f8	RANGEIFY=1 test_jit (#12254 ) * RANGEIFY=1 test_jit * don't do any of that * disk * simple disk tensor * more work * run more tests * it also doesn't copy everytime * skip tests that hang everything	2025-09-20 17:34:32 +03:00
nimlgen	1c6c42715f	unify cpu and llvm (#11982 ) * try unify cpu and llvm * fixes * fix * ops * no llvm * fix * rm * lvmm is ot * oops * override * no llvm * ignore * skip llvm * ooops	2025-09-09 13:54:44 +03:00
nimlgen	d2bb1bcb97	cloud: a bit better err handling (#11616 ) * cloud: err propagation to client * fix * print exc * linter * excs * fix * hm * flaky	2025-08-11 15:51:22 +03:00
chenyu	c9225d22ce	only disable flaky test_jit_multidev_xfer (#11523 )	2025-08-05 22:17:25 -04:00
nimlgen	fc4e713d1c	jit graph split tests (#11507 ) * jit graph split tests * fix * one more test * more tests * fix * xm * rmeote	2025-08-05 21:32:37 +03:00
uuuvn	011ef8fa9d	Fix incorrect jit current batch devs reset (#11505 ) `current_batch_devs = []` (in `flush_batch()`) happens between `new_batched_devs = ...` and `current_batch_devs = new_batched_devs` => doesn't actually reset anything leading to things not jitting properly which 2xs remote bert step time (should have similar effects on any non-hcq backend)	2025-08-05 08:16:16 +03:00
uuuvn	10c9ede6b7	Cloud graph (#9876 )	2025-05-07 11:41:41 -07:00
uuuvn	dba073e5c0	Less messy broken graph on paravirtualized metal workaround (#10182 ) * Less messy broken graph on paravirtualized metal workaround GitHub CI macOS runners use paravirtualized metal which is broken with graph (some comments say that ICB in particular is broken but in my testing it was fine sometimes, but other times hitting an assert inside metal's code related to resouces, so not sure). > Assertion failed: (resource != nil), function -[IOGPUMetalResource initWithResource:], file IOGPUMetalResource.m, line 458. This can be reproduced locally with any virtualization software (like utm) that can create macOS VMs with apple's own virtualization framework. * unused import	2025-05-06 20:41:02 +03:00
nimlgen	37a7a99adb	metal: fix graph when unrelated input buffers are not metal buffers (#10170 ) * metal: fix graph when unrelated input buffers are not metal buffers * tinier test	2025-05-06 11:37:16 +03:00
George Hotz	b6d2effaf5	assign is contiguous (#10066 ) * assign is contiguous * disable process replay for SDXL	2025-04-27 08:40:33 -04:00
uuuvn	754d789f51	Fix and enable jit tests on CLOUD (#10031 )	2025-04-24 18:39:31 +03:00
chenyu	c8f47c1d07	not_support_multi_device helper (#9831 ) unify the test helper to skip ci device that does not support multi	2025-04-10 05:25:29 -04:00
nimlgen	5f7c79676f	jit: prune independent copies (#9749 ) * jit: prune independent copies * linter * check kernel cnt	2025-04-05 20:50:28 +03:00
nimlgen	c2573b247c	jit: rename optimize_weights -> replan_buffers_memory_layout (#9751 )	2025-04-05 20:35:15 +03:00
nimlgen	949459fdd6	jit: fix deallocate on unallocated buffers in free_intermediates (#9699 )	2025-04-03 18:32:51 +03:00
nimlgen	fa0ebbd237	jit: optimize before pickle (#9611 ) * jit: optimize before pickle * optimize weights * fix * mypy * mypy2	2025-03-28 19:06:09 +07:00
nimlgen	dc9da1d917	memplan into one buffer (#9526 ) * new memplanner * new should works * fix * VALIDATE_MEMORY_PLANNER * hm? * ugh * fix alignment * fix2 * rm * tiny fixes * test * comments and fixes * fix2 * liiiinetr * t * fix	2025-03-27 01:46:50 +07:00
chenyu	cddd750d68	add a failed test case for jit/nojit rand [pr] (#9574 ) currently adding jit produced different rand values	2025-03-25 13:32:44 -04:00
chenyu	2e7c2780a9	CLANG -> CPU (#9189 )	2025-02-20 18:03:09 -05:00
George Hotz	46a8c5e1e5	delete forced_realize (#8615 ) * delete forced_realize * put that back * expectedFailures * cleaner create_subbuffer * more comments --------- Co-authored-by: qazal <qazal.software@gmail.com> Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>	2025-01-20 09:40:36 -08:00
George Hotz	4ac4c1415a	free intermediate buffers in the jit [pr] (#8581 ) * free intermediate buffers in the jit [pr] * intermediates_freed * deallocate if not allocated * self._first_run is simpler	2025-01-12 15:41:41 -08:00
nimlgen	c0240855b9	qcom has not transfer (#8075 ) * qcom alloc is not hcq alloc * maybe base? * test	2024-12-06 14:45:01 +03:00
George Hotz	e37bff6c19	fix bug in jit prune with copy [pr] (#8073 )	2024-12-06 18:38:23 +08:00
George Hotz	aae8557ada	test copy inside jit [pr] (#8072 )	2024-12-06 17:51:50 +08:00
ignaciosica	509c4a573f	increase tolerance on test (#7972 )	2024-11-30 11:50:10 -05:00
Ahmed Harmouche	2d11765295	Fix WebGPU atomic store (#7954 )	2024-11-29 19:31:25 +08:00
George Hotz	4e5bf9dc7a	test assignment in jit (#7906 ) * test assignment in jit * don't waste lines * skip broken test in webgpu	2024-11-26 17:37:00 +08:00
chenyu	c805e3fff5	skip test_jit_batch_split if JIT >= 2 (#7561 ) * skip test_jit_batch_split if JIT >= 2 only test graphs * 1600	2024-11-05 14:59:04 -05:00
Tobias Fischer	1a9e145388	Tensor Clone Function (#7154 ) * implemented clone function * cleanup linting, single func * added tests, cleaned up grad cloning * fixed whitespace	2024-11-01 12:24:43 +08:00
wozeparrot	9eb6eef441	seed in tensor (#6869 )	2024-10-06 14:46:58 -04:00
wozeparrot	97d708252a	remove realize from threefry (#5969 )	2024-08-07 15:08:49 -07:00
hikettei	320e7ed935	Approximations for SIN/LOG2/EXP2 passing all tests. (#5187 ) * [WIP] Added an approximated implementation of Sin(FP32, FP64) passing all tests on Clang runtime * Map nan/-inf/inf as 1.0 in order to avoid doing as_const(math.inf) * [WIP] Added a support for LLVM IR * cleaned up the code for the mypy and linter * [WIP] Updated fp64 supports (bitwise shift causes the compilation error), fixed linter issue. * [Add] added fast=true mode which disables the payne-hanek reduction which is slow * [Fix] fails to compute elements when shape includes zero * [WIP] Added BinaryOps.ADD/BinaryOps.OR to assembly * [wip] update the assembly for ptx * Enables fast=True when device is one of PTX, NV, CUDA, to avoid slow bitwise ops (as lv3 reduction is not required). * [WIP] Added an approximation of LOG2/EXP2 (FP32, FP64) * [Fix] Cyclic dependencies existing in xlog2 * [Fix] Cycle dependency in the graph of exp2, and log2. (passing test_symbolic_ops.py) * [Fix] keep using higher precision for exp2, but cycle graph issue remained to be fixed... * [Refactor] removed is_metal option. xsin does not rely on fp64 when fp32 mode. * [WIP] fp16 xsin implementation passing all tests. (still needs to be refactored) * [WIP] Added fp16 exp2 implementation * [WIP] Increased the precision of Log2 from 3.5 ULP to 1.0 ULP, and added FP16 Log2 approximation. * stashed the changes for FP16 sin * [Fix] Patch for FP16 Sin/Exp2. (updated the dtype_via, fp32_p, and lower) * [Refactor] migration to fastmath.py, some code simplification, renamed apis in fastmath, et al. * [Refactor] Added the function polyN to clean-up N-terms polynomial approximation. * [Patch] Increase fp64 precision when ldexp3k if possible, and patch for fp16 exp2 * [Patch] added bitcast_forward option * [Patch] resolved cycle graph * patch fix cycle graph * set bitcast_forward=True in ilogb2k * bitcast_forward for multi.py * E501 * Break into multiple small PRs * [Patch] FP16 -> FP64 upcast is not anymore required since xlog2 use quad precision polyN * [Patch] NV still required FP64 for xlog2 * updated schedule test * updated the count of kernels * [Update] Removed all bitwise ops (SHL/SHR), tweaked the nan manipulation of log2, passing all tests except for AMD. * Bitcast: make them api-compatible * [update] force to use bitcast * updated the count of constant folding * [Patch] Creating a mask for exp2 using x <= Inf satisfies True as long as x is a real value * [Update] isNaN(x) Free log2 algorithm, passing PTX tests, METAL with fastmath enabled is able to handle nan well, amd backend will not crash. * xsin is reluctant to call payne_hanek_reduction which is slow to compile, passing stable diffusion compilation in a realistic time * some minor simplification to payne hanek reduction * [refactor] refactored some rebundant parts existing in payne hanek * [refactor] more readable payne hanek impl * [refactor] improved the code consistency of payne hanek * [experiment] topological sort when doing _recursive_group (i dunno if this is good but at least it works.) * Revert "[experiment] topological sort when doing _recursive_group (i dunno if this is good but at least it works.)" This reverts commit `0eee08b87c`. * use allow_buffer_view * lets support multilazytensor * updated the count of kernels * [test] added the jit tests for approx ops * keep failed constant folding tests tested, added expectedFailure * explict the timeout deadline when testing approx jit timeout * [WIP] Simplified the implementation of xsin, never timeouts * [Refactor] Improved the consistency of approx sin implementation, passing time out tests * integrated xexp2_base into xexp2 * Set switch_over=39800.0 * delete: is_buffer_fastmath_supported * sin: compute against abs(x) * some cleanups * fix typo * removed the space between param and dtype * allow 514 kernels on CI for sd * [refactor] no need to upcast ad ldexp3k * [refactor] added some comments, references to help understanding the code. * [Fix] 1.0 ULP Sine Approximation for FP16 * [update] assume e != 0 * use pow2if instead of ldexp3k to fuse payne_hanek reduction into one * check if approximated sin/log2/exp are fused into one * clean up changes * test amd exp * some code cleanup and test sigmoid * fix: enabled payne_hanek for fp16 to achieve higher acc * fix: payne_hanek always accumlates the value with uint64, and fp16 sin is fused to a single kernel * [Refactor] Rename: fastmath -> transcendental * [Refactor] Added TRANSCENDENTAL, Moved the gate function to function.py * updated const folding tests * TRANSCENDENTAL as a ContextVar, removed old test of cody waite reduction, added assertions, et al. * Add: unittest.main() * Import TRANSCENDENTAL instead of getenv * Refactor: Added dtype check when TRANSCENDENTAL=2, more context var * Patch: xlog2, break expt(2, 32) x 2 -> expt(2, 16) x 4 for fp16 math --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com> Co-authored-by: chenyu <chenyu@fastmail.com>	2024-07-10 16:44:58 -07:00
chenyu	622b7bd556	simpler TinyJit inside TinyJit detection (#5219 ) * simpler TinyJit inside TinyJit detection suggested in `73395b998b (commitcomment-143660402)` * cannot repro... * clear the way out * finally clear	2024-07-03 12:28:53 -04:00
chenyu	73395b998b	better error msg for TinyJit inside TinyJit (#5202 ) it's possible to support TinyJit inside TinyJit, but there are edge cases like two TinyJit functions shared another TinyJit function. so just give a more precise error for now	2024-06-27 18:09:19 -04:00
chenyu	ad91962dcf	CACHECOLLECTING -> CAPTURING and don't capture clear_l2 (#5190 ) fixed first time BEAM slowness	2024-06-27 12:32:28 -04:00
chenyu	5b8fda3c65	fix: JIT=0 means no JIT (#5188 )	2024-06-27 10:31:37 -04:00
nimlgen	654a8b9ef7	retire hsa (#4885 ) * retire hsa * EMULATE_AMD	2024-06-09 11:33:03 +03:00
nimlgen	47bfd7c2b7	fix sync of offset buffers in graphs (#4850 ) * correctly sync offset buffers * test * style * run less * just use base	2024-06-06 16:09:45 +03:00
nimlgen	eb9689336e	nv mockgpu (#4600 ) * mockgpu nv * works * comment that out * fix merge * setup gpuocelot * install packages * not run all of them * passes * fix ci * almost * should pass * linter * linter 2 * try this? * ugn, not supported * ci * remove ticket from description * better descs	2024-05-15 23:46:08 +03:00

1 2

95 Commits