95 Commits

Author SHA1 Message Date
chenyu
72a3f78d19 jit includes tensor inputs in containers (#14043)
* jit includes tensor inputs in containers

* cleanup
2026-01-06 19:42:06 -05:00
chenyu
c714881832 don't allow jit input to be const (#14045)
* don't allow jit input to be unbuffered like const

* just const to fix multi

* fix rnnt
2026-01-06 18:15:22 -05:00
chenyu
7fb18f7e47 raise when jit fxn returns non-Tensor output (#14042) 2026-01-06 12:59:20 -05:00
chenyu
4491ec0c9e JitError (#14041)
* JitError

* test_symbolic_jit
2026-01-06 12:19:50 -05:00
chenyu
6ddddc68af test jit tolist failure (#14040)
also moved tests to test_jit_footguns
2026-01-06 11:16:57 -05:00
chenyu
b699b9f763 test case for jit a function with item call (#14039)
* test case for jit a function with item call

output is silently wrong now

* no dtype
2026-01-06 10:40:43 -05:00
chenyu
03600aef1e failed test case when init jit with empty inputs (#13641)
not related to bert grad acc, but still seems to be a bug
2025-12-10 22:03:06 -05:00
George Hotz
6bd355fa26 add needs_second_gpu decorator (#13543)
* add needs_second_gpu decorator

* more skips

* two more fixes
2025-12-02 19:08:23 -08:00
chenyu
f2c3a72b0c remove RANGEIFY flag [pr] (#12577) 2025-10-09 21:52:54 -04:00
George Hotz
fd2e4f2353 failing rng test (#12328)
* tighten spec: fixup devectorizer types / rangeify

* tighten assign

* failing rangeify test

* simpler

* otherwise contig

* more tolerance cause rng seed changed
2025-09-29 16:06:45 +08:00
nimlgen
4762a24022 test_free_intermediates force buffers (#12255)
* test_free_intermediates force buffers

* f

* fix for rangiefy

* xx
2025-09-20 18:14:39 +03:00
qazal
57c7e0a8f8 RANGEIFY=1 test_jit (#12254)
* RANGEIFY=1 test_jit

* don't do any of that

* disk

* simple disk tensor

* more work

* run more tests

* it also doesn't copy everytime

* skip tests that hang everything
2025-09-20 17:34:32 +03:00
nimlgen
1c6c42715f unify cpu and llvm (#11982)
* try unify cpu and llvm

* fixes

* fix

* ops

* no llvm

* fix

* rm

* lvmm is ot

* oops

* override

* no llvm

* ignore

* skip llvm

* ooops
2025-09-09 13:54:44 +03:00
nimlgen
d2bb1bcb97 cloud: a bit better err handling (#11616)
* cloud: err propagation to client

* fix

* print exc

* linter

* excs

* fix

* hm

* flaky
2025-08-11 15:51:22 +03:00
chenyu
c9225d22ce only disable flaky test_jit_multidev_xfer (#11523) 2025-08-05 22:17:25 -04:00
nimlgen
fc4e713d1c jit graph split tests (#11507)
* jit graph split tests

* fix

* one more test

* more tests

* fix

* xm

* rmeote
2025-08-05 21:32:37 +03:00
uuuvn
011ef8fa9d Fix incorrect jit current batch devs reset (#11505)
`current_batch_devs = []` (in `flush_batch()`) happens between
`new_batched_devs = ...` and `current_batch_devs = new_batched_devs` =>
doesn't actually reset anything leading to things not jitting properly

which 2xs remote bert step time (should have similar effects on any
non-hcq backend)
2025-08-05 08:16:16 +03:00
uuuvn
10c9ede6b7 Cloud graph (#9876) 2025-05-07 11:41:41 -07:00
uuuvn
dba073e5c0 Less messy broken graph on paravirtualized metal workaround (#10182)
* Less messy broken graph on paravirtualized metal workaround

GitHub CI macOS runners use paravirtualized metal which is broken with
graph (some comments say that ICB in particular is broken but in my
testing it was fine sometimes, but other times hitting an assert inside
metal's code related to resouces, so not sure).

> Assertion failed: (resource != nil), function -[IOGPUMetalResource initWithResource:], file IOGPUMetalResource.m, line 458.

This can be reproduced locally with any virtualization software (like utm)
that can create macOS VMs with apple's own virtualization framework.

* unused import
2025-05-06 20:41:02 +03:00
nimlgen
37a7a99adb metal: fix graph when unrelated input buffers are not metal buffers (#10170)
* metal: fix graph when unrelated input buffers are not metal buffers

* tinier test
2025-05-06 11:37:16 +03:00
George Hotz
b6d2effaf5 assign is contiguous (#10066)
* assign is contiguous

* disable process replay for SDXL
2025-04-27 08:40:33 -04:00
uuuvn
754d789f51 Fix and enable jit tests on CLOUD (#10031) 2025-04-24 18:39:31 +03:00
chenyu
c8f47c1d07 not_support_multi_device helper (#9831)
unify the test helper to skip ci device that does not support multi
2025-04-10 05:25:29 -04:00
nimlgen
5f7c79676f jit: prune independent copies (#9749)
* jit: prune independent copies

* linter

* check kernel cnt
2025-04-05 20:50:28 +03:00
nimlgen
c2573b247c jit: rename optimize_weights -> replan_buffers_memory_layout (#9751) 2025-04-05 20:35:15 +03:00
nimlgen
949459fdd6 jit: fix deallocate on unallocated buffers in free_intermediates (#9699) 2025-04-03 18:32:51 +03:00
nimlgen
fa0ebbd237 jit: optimize before pickle (#9611)
* jit: optimize before pickle

* optimize weights

* fix

* mypy

* mypy2
2025-03-28 19:06:09 +07:00
nimlgen
dc9da1d917 memplan into one buffer (#9526)
* new memplanner

* new should works

* fix

* VALIDATE_MEMORY_PLANNER

* hm?

* ugh

* fix alignment

* fix2

* rm

* tiny fixes

* test

* comments and fixes

* fix2

* liiiinetr

* t

* fix
2025-03-27 01:46:50 +07:00
chenyu
cddd750d68 add a failed test case for jit/nojit rand [pr] (#9574)
currently adding jit produced different rand values
2025-03-25 13:32:44 -04:00
chenyu
2e7c2780a9 CLANG -> CPU (#9189) 2025-02-20 18:03:09 -05:00
George Hotz
46a8c5e1e5 delete forced_realize (#8615)
* delete forced_realize

* put that back

* expectedFailures

* cleaner create_subbuffer

* more comments

---------

Co-authored-by: qazal <qazal.software@gmail.com>
Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>
2025-01-20 09:40:36 -08:00
George Hotz
4ac4c1415a free intermediate buffers in the jit [pr] (#8581)
* free intermediate buffers in the jit [pr]

* intermediates_freed

* deallocate if not allocated

* self._first_run is simpler
2025-01-12 15:41:41 -08:00
nimlgen
c0240855b9 qcom has not transfer (#8075)
* qcom alloc is not hcq alloc

* maybe base?

* test
2024-12-06 14:45:01 +03:00
George Hotz
e37bff6c19 fix bug in jit prune with copy [pr] (#8073) 2024-12-06 18:38:23 +08:00
George Hotz
aae8557ada test copy inside jit [pr] (#8072) 2024-12-06 17:51:50 +08:00
ignaciosica
509c4a573f increase tolerance on test (#7972) 2024-11-30 11:50:10 -05:00
Ahmed Harmouche
2d11765295 Fix WebGPU atomic store (#7954) 2024-11-29 19:31:25 +08:00
George Hotz
4e5bf9dc7a test assignment in jit (#7906)
* test assignment in jit

* don't waste lines

* skip broken test in webgpu
2024-11-26 17:37:00 +08:00
chenyu
c805e3fff5 skip test_jit_batch_split if JIT >= 2 (#7561)
* skip test_jit_batch_split if JIT >= 2

only test graphs

* 1600
2024-11-05 14:59:04 -05:00
Tobias Fischer
1a9e145388 Tensor Clone Function (#7154)
* implemented clone function

* cleanup linting, single func

* added tests, cleaned up grad cloning

* fixed whitespace
2024-11-01 12:24:43 +08:00
wozeparrot
9eb6eef441 seed in tensor (#6869) 2024-10-06 14:46:58 -04:00
wozeparrot
97d708252a remove realize from threefry (#5969) 2024-08-07 15:08:49 -07:00
hikettei
320e7ed935 Approximations for SIN/LOG2/EXP2 passing all tests. (#5187)
* [WIP] Added an approximated implementation of Sin(FP32, FP64) passing all tests on Clang runtime

* Map nan/-inf/inf as 1.0 in order to avoid doing as_const(math.inf)

* [WIP] Added a support for LLVM IR

* cleaned up the code for the mypy and linter

* [WIP] Updated fp64 supports (bitwise shift causes the compilation error), fixed linter issue.

* [Add] added fast=true mode which disables the payne-hanek reduction which is slow

* [Fix] fails to compute elements when shape includes zero

* [WIP] Added BinaryOps.ADD/BinaryOps.OR to assembly

* [wip] update the assembly for ptx

* Enables fast=True when device is one of PTX, NV, CUDA, to avoid slow bitwise ops (as lv3 reduction is not required).

* [WIP] Added an approximation of LOG2/EXP2 (FP32, FP64)

* [Fix] Cyclic dependencies existing in xlog2

* [Fix] Cycle dependency in the graph of exp2, and log2. (passing test_symbolic_ops.py)

* [Fix] keep using higher precision for exp2, but cycle graph issue remained to be fixed...

* [Refactor] removed is_metal option. xsin does not rely on fp64 when fp32 mode.

* [WIP] fp16 xsin implementation passing all tests. (still needs to be refactored)

* [WIP] Added fp16 exp2 implementation

* [WIP] Increased the precision of Log2 from 3.5 ULP to 1.0 ULP, and added FP16 Log2 approximation.

* stashed the changes for FP16 sin

* [Fix] Patch for FP16 Sin/Exp2. (updated the dtype_via, fp32_p, and lower)

* [Refactor] migration to fastmath.py, some code simplification, renamed apis in fastmath, et al.

* [Refactor] Added the function polyN to clean-up N-terms polynomial approximation.

* [Patch] Increase fp64 precision when ldexp3k if possible, and patch for fp16 exp2

* [Patch] added bitcast_forward option

* [Patch] resolved cycle graph

* patch fix cycle graph

* set bitcast_forward=True in ilogb2k

* bitcast_forward for multi.py

* E501

* Break into multiple small PRs

* [Patch] FP16 -> FP64 upcast is not anymore required since xlog2 use quad precision polyN

* [Patch] NV still required FP64 for xlog2

* updated schedule test

* updated the count of kernels

* [Update] Removed all bitwise ops (SHL/SHR), tweaked the nan manipulation of log2, passing all tests except for AMD.

* Bitcast: make them api-compatible

* [update] force to use bitcast

* updated the count of constant folding

* [Patch] Creating a mask for exp2 using x <= Inf satisfies True as long as x is a real value

* [Update] isNaN(x) Free log2 algorithm, passing PTX tests, METAL with fastmath enabled is able to handle nan well, amd backend will not crash.

* xsin is reluctant to call payne_hanek_reduction which is slow to compile, passing stable diffusion compilation in a realistic time

* some minor simplification to payne hanek reduction

* [refactor] refactored some rebundant parts existing in payne hanek

* [refactor] more readable payne hanek impl

* [refactor] improved the code consistency of payne hanek

* [experiment] topological sort when doing _recursive_group (i dunno if this is good but at least it works.)

* Revert "[experiment] topological sort when doing _recursive_group (i dunno if this is good but at least it works.)"

This reverts commit 0eee08b87c.

* use allow_buffer_view

* lets support multilazytensor

* updated the count of kernels

* [test] added the jit tests for approx ops

* keep failed constant folding tests tested, added expectedFailure

* explict the timeout deadline when testing approx jit timeout

* [WIP] Simplified the implementation of xsin, never timeouts

* [Refactor] Improved the consistency of approx sin implementation, passing time out tests

* integrated xexp2_base into xexp2

* Set switch_over=39800.0

* delete: is_buffer_fastmath_supported

* sin: compute against abs(x)

* some cleanups

* fix typo

* removed the space between param and dtype

* allow 514 kernels on CI for sd

* [refactor] no need to upcast ad ldexp3k

* [refactor] added some comments, references to help understanding the code.

* [Fix] 1.0 ULP Sine Approximation for FP16

* [update] assume e != 0

* use pow2if instead of ldexp3k to fuse payne_hanek reduction into one

* check if approximated sin/log2/exp are fused into one

* clean up changes

* test amd exp

* some code cleanup and test sigmoid

* fix: enabled payne_hanek for fp16 to achieve higher acc

* fix: payne_hanek always accumlates the value with uint64, and fp16 sin is fused to a single kernel

* [Refactor] Rename: fastmath -> transcendental

* [Refactor] Added TRANSCENDENTAL, Moved the gate function to function.py

* updated const folding tests

* TRANSCENDENTAL as a ContextVar, removed old test of cody waite reduction, added assertions, et al.

* Add: unittest.main()

* Import TRANSCENDENTAL instead of getenv

* Refactor: Added dtype check when TRANSCENDENTAL=2, more context var

* Patch: xlog2, break expt(2, 32) x 2 -> expt(2, 16) x 4 for fp16 math

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
Co-authored-by: chenyu <chenyu@fastmail.com>
2024-07-10 16:44:58 -07:00
chenyu
622b7bd556 simpler TinyJit inside TinyJit detection (#5219)
* simpler TinyJit inside TinyJit detection

suggested in 73395b998b (commitcomment-143660402)

* cannot repro...

* clear the way out

* finally clear
2024-07-03 12:28:53 -04:00
chenyu
73395b998b better error msg for TinyJit inside TinyJit (#5202)
it's possible to support TinyJit inside TinyJit, but there are edge cases like two TinyJit functions shared another TinyJit function. so just give a more precise error for now
2024-06-27 18:09:19 -04:00
chenyu
ad91962dcf CACHECOLLECTING -> CAPTURING and don't capture clear_l2 (#5190)
fixed first time BEAM slowness
2024-06-27 12:32:28 -04:00
chenyu
5b8fda3c65 fix: JIT=0 means no JIT (#5188) 2024-06-27 10:31:37 -04:00
nimlgen
654a8b9ef7 retire hsa (#4885)
* retire hsa

* EMULATE_AMD
2024-06-09 11:33:03 +03:00
nimlgen
47bfd7c2b7 fix sync of offset buffers in graphs (#4850)
* correctly sync offset buffers

* test

* style

* run less

* just use base
2024-06-06 16:09:45 +03:00
nimlgen
eb9689336e nv mockgpu (#4600)
* mockgpu nv

* works

* comment that out

* fix merge

* setup gpuocelot

* install packages

* not run all of them

* passes

* fix ci

* almost

* should pass

* linter

* linter 2

* try this?

* ugn, not supported

* ci

* remove ticket from description

* better descs
2024-05-15 23:46:08 +03:00