tinygrad

mirror of https://github.com/tinygrad/tinygrad.git synced 2026-01-07 22:23:55 -05:00

Author	SHA1	Message	Date
chenyu	ed222070f7	update xlog2 fp16 decomp to not use fp32 (#13955 )	2026-01-01 11:18:29 -05:00
Christopher Milan	0aabc1e938	Mesa NIR backend (NAK/LLVMpipe) (#12089 ) * nak works * TestOps::test_add works * testop has no crashes * fix bool casts * fix typo * add disassemble * RANGE and locals/regs * simplify NAKCompiler * disass cleanup * cleanup nir codegen * almost all tests passing * cleanup notes in extra/ * old notes * only import nak if NIR=1 * fix new SPECIAL syntax * fix local/shared memory * more tests passing * add DEFINE_VAR support * llvmpipe kinda works * diskcache * some mypy stuff * lvp passing test_ops.py * fix imports * actually fix imports * remove 'stdout' * fix llvm import * fix mypy issues * nicer errors * simpler test_dtype skips * test lvp in CI * fix github action syntax * fix more actions typos * switch to mesa 25.1.0 * diskcache_put * better generation for lvp nir_options * b64encode shader blobs * Revert diskcache changes This reverts commits `930fa3de8a` and `8428c694b3`. * general cleanup * better error messages * fix llvm import * fix windows tests * link with libm and libgcc_s * fix some errors * dont check for 'float4' * NIR uses pointer arithmetic * use tinymesa * bump tinymesa * bump tinymesa again * update lvp nir_options * print nir shader with DEBUG * simplify LVPCompiler * more tests * "gated" STORE * NAK is cacheable * more tests * all tests pass locally for NAK * test autogen in CI * autogen deps * more deps * fix uop_gc * fix macos * mypy * save 2 lines * save two more lines * save 1 line * save 4 lines * save more lines * Revert "save more lines" This reverts commit `dd3a720c5a`. * save more lines * fix LVP on windows * refactor * reorganize some code * refactor lib_gpu * move LVP check * out of order loads * remove support.mesa * bump tinymesa version * simplify LVP jit * macos * macos ci * shell: bash * testing * more testing * compute brew prefix * stupid typo * actually fix * lib * stdout on macos * inline gallivm_compile_module * Revert "inline gallivm_compile_module" This reverts commit `b65983b151`. * elf macos * semicolon * inherit from CPULLVMCompiler * ruff * disas test * fix libm linking * default is fine actually * arm works * add elf loader link test * fix NAK beam * pylint is too smart by half --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com> Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>	2025-10-15 17:38:33 +08:00
quortus	bdd44d4255	Fix DSP transcendentals (#9542 )	2025-03-22 11:08:18 +08:00
chenyu	f8976dd2eb	enable more webgpu tests (#9502 ) OSX has larger buffer number limit, and it supports fp16 now	2025-03-18 23:03:54 -04:00
Eitan Turok	d657d5f754	[Bounty] Vectorize Transcendental (#9058 ) * init * cast everythig right * more casting * install pillow in test * quick tests * simplify * quick tests * delete test * tests * fix import error * add vec to ldexp3k * vec for bitcast * some helper tests * high level tests * clean tests * change tolerance so cuda passes * ruff passes * remove tests for transcendental helpers * ruff passes * make exponent in power vectorized * fix pow test * add newline * add vec dtype to ilogb2k * comment + clean up * ruff --------- Co-authored-by: chenyu <chenyu@fastmail.com> Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2025-02-28 15:47:25 +08:00
Ahmed Harmouche	916d5e7f08	WebGPU f16 support (f16 bounty part 2) (#8653 ) * WebGPU f16 support * Don't enable f16 yet * dtype tests passing after bitcast fix * Maybe all WebGPU green? * Require shader-f16 in examples * Minor wgsl touchup * 1 line shorter * Simpler * Add transcendetal support * log2 nan location mismatch on Vulkan * Nan skips	2025-02-12 19:46:53 +08:00
nimlgen	9bc317d5d2	mockcuda (#8503 ) * init mockcuda * run gpu ocelot * fix * sfixes * disable broken tests * linter * these fails as well * pylint * myypy * this fails on real platforms as well * mypy please	2025-01-05 01:23:57 +03:00
George Hotz	205befa788	move is_dtype_supported to device [pr] (#7575 )	2024-11-07 20:38:03 +08:00
chenyu	df49439b9a	remove reassoc from LLVM flags (#7512 ) reassoc reorders compute and breaks transcendental	2024-11-03 13:11:56 -05:00
chenyu	2f70fb893e	move transcendental fuzzer test to test_transcendental (#7511 )	2024-11-03 12:36:50 -05:00
chenyu	4617c9a565	move COMMUTATIVE flipping to symbolic (#7507 ) * move COMMUTATIVE flipping to symbolic it cannot go with TRANSCENDENTAL * skip LLVM	2024-11-03 09:03:45 -05:00
vladov	5f6b6162b3	Suppress warnings in transcendental tests. (#6891 )	2024-10-05 07:37:17 +08:00
chenyu	9c60a27ece	lower float64 sin fuzzer threshold (#6173 ) 139216373.71875 failed https://github.com/tinygrad/tinygrad/actions/runs/10446960642/job/28925156240	2024-08-19 00:25:42 -04:00
chenyu	a7163b80d8	lower test_transcendental fuzz test threshold for sin float64 (#5956 )	2024-08-07 02:04:37 -04:00
wozeparrot	30d0cb2a82	fix: fix transcendental flakyness on exp float with 9.96875 (#5951 )	2024-08-06 17:32:13 -07:00
chenyu	4a65010de8	remove CUDACPU flag in tests [run_process_replay] (#5902 ) no longer used	2024-08-04 16:06:38 -04:00
chenyu	3ebf569f04	relax fuzz transend math threshold a bit (#5442 ) * relax fuzz transend math threshold a bit * fuzz more * fuzz 50k	2024-07-13 03:31:21 -04:00
chenyu	e398734890	fuzz test transcend math (#5383 ) * fuzz test transcend math found something wrong with float64 sin reduction ``` from tinygrad import Tensor, dtypes import numpy as np print(Tensor([39800.0], dtype=dtypes.float64).sin().numpy()) print(Tensor([39800.0], dtype=dtypes.float32).sin().numpy()) print(Tensor([39800.0], dtype=dtypes.float16).sin().numpy()) print(np.sin(np.array([39800.0], dtype=np.float64))) print(np.sin(np.array([39800.0], dtype=np.float32))) print(np.sin(np.array([39800.0], dtype=np.float16))) ``` ``` CLANG=1 python test.py [0.92785633] [0.7428573] [-0.7705] [0.74285722] [0.7428572] [-0.7705] ``` * fix test * abs * skip	2024-07-13 01:54:52 -04:00
hikettei	320e7ed935	Approximations for SIN/LOG2/EXP2 passing all tests. (#5187 ) * [WIP] Added an approximated implementation of Sin(FP32, FP64) passing all tests on Clang runtime * Map nan/-inf/inf as 1.0 in order to avoid doing as_const(math.inf) * [WIP] Added a support for LLVM IR * cleaned up the code for the mypy and linter * [WIP] Updated fp64 supports (bitwise shift causes the compilation error), fixed linter issue. * [Add] added fast=true mode which disables the payne-hanek reduction which is slow * [Fix] fails to compute elements when shape includes zero * [WIP] Added BinaryOps.ADD/BinaryOps.OR to assembly * [wip] update the assembly for ptx * Enables fast=True when device is one of PTX, NV, CUDA, to avoid slow bitwise ops (as lv3 reduction is not required). * [WIP] Added an approximation of LOG2/EXP2 (FP32, FP64) * [Fix] Cyclic dependencies existing in xlog2 * [Fix] Cycle dependency in the graph of exp2, and log2. (passing test_symbolic_ops.py) * [Fix] keep using higher precision for exp2, but cycle graph issue remained to be fixed... * [Refactor] removed is_metal option. xsin does not rely on fp64 when fp32 mode. * [WIP] fp16 xsin implementation passing all tests. (still needs to be refactored) * [WIP] Added fp16 exp2 implementation * [WIP] Increased the precision of Log2 from 3.5 ULP to 1.0 ULP, and added FP16 Log2 approximation. * stashed the changes for FP16 sin * [Fix] Patch for FP16 Sin/Exp2. (updated the dtype_via, fp32_p, and lower) * [Refactor] migration to fastmath.py, some code simplification, renamed apis in fastmath, et al. * [Refactor] Added the function polyN to clean-up N-terms polynomial approximation. * [Patch] Increase fp64 precision when ldexp3k if possible, and patch for fp16 exp2 * [Patch] added bitcast_forward option * [Patch] resolved cycle graph * patch fix cycle graph * set bitcast_forward=True in ilogb2k * bitcast_forward for multi.py * E501 * Break into multiple small PRs * [Patch] FP16 -> FP64 upcast is not anymore required since xlog2 use quad precision polyN * [Patch] NV still required FP64 for xlog2 * updated schedule test * updated the count of kernels * [Update] Removed all bitwise ops (SHL/SHR), tweaked the nan manipulation of log2, passing all tests except for AMD. * Bitcast: make them api-compatible * [update] force to use bitcast * updated the count of constant folding * [Patch] Creating a mask for exp2 using x <= Inf satisfies True as long as x is a real value * [Update] isNaN(x) Free log2 algorithm, passing PTX tests, METAL with fastmath enabled is able to handle nan well, amd backend will not crash. * xsin is reluctant to call payne_hanek_reduction which is slow to compile, passing stable diffusion compilation in a realistic time * some minor simplification to payne hanek reduction * [refactor] refactored some rebundant parts existing in payne hanek * [refactor] more readable payne hanek impl * [refactor] improved the code consistency of payne hanek * [experiment] topological sort when doing _recursive_group (i dunno if this is good but at least it works.) * Revert "[experiment] topological sort when doing _recursive_group (i dunno if this is good but at least it works.)" This reverts commit `0eee08b87c`. * use allow_buffer_view * lets support multilazytensor * updated the count of kernels * [test] added the jit tests for approx ops * keep failed constant folding tests tested, added expectedFailure * explict the timeout deadline when testing approx jit timeout * [WIP] Simplified the implementation of xsin, never timeouts * [Refactor] Improved the consistency of approx sin implementation, passing time out tests * integrated xexp2_base into xexp2 * Set switch_over=39800.0 * delete: is_buffer_fastmath_supported * sin: compute against abs(x) * some cleanups * fix typo * removed the space between param and dtype * allow 514 kernels on CI for sd * [refactor] no need to upcast ad ldexp3k * [refactor] added some comments, references to help understanding the code. * [Fix] 1.0 ULP Sine Approximation for FP16 * [update] assume e != 0 * use pow2if instead of ldexp3k to fuse payne_hanek reduction into one * check if approximated sin/log2/exp are fused into one * clean up changes * test amd exp * some code cleanup and test sigmoid * fix: enabled payne_hanek for fp16 to achieve higher acc * fix: payne_hanek always accumlates the value with uint64, and fp16 sin is fused to a single kernel * [Refactor] Rename: fastmath -> transcendental * [Refactor] Added TRANSCENDENTAL, Moved the gate function to function.py * updated const folding tests * TRANSCENDENTAL as a ContextVar, removed old test of cody waite reduction, added assertions, et al. * Add: unittest.main() * Import TRANSCENDENTAL instead of getenv * Refactor: Added dtype check when TRANSCENDENTAL=2, more context var * Patch: xlog2, break expt(2, 32) x 2 -> expt(2, 16) x 4 for fp16 math --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com> Co-authored-by: chenyu <chenyu@fastmail.com>	2024-07-10 16:44:58 -07:00

19 Commits