Commit Graph

12063 Commits

Author SHA1 Message Date
chenyu
b09dc646f5 revert some late_buffer_view change (#14578)
revert #14478 which breaks tinyfs
2026-02-05 22:51:40 -05:00
chenyu
d41836f135 remove KERNEL special case in realize_assign [pr] (#14573) 2026-02-05 21:55:44 -05:00
George Hotz
6cbcf98627 KernelInfo is required on get_program (#14571)
* rangeify always adds KernelInfo

* fix tests

* skip flaky test
2026-02-06 10:49:27 +08:00
George Hotz
28c56a783c add CallInfo and viz call toggle (#14570) 2026-02-06 09:30:58 +08:00
wozeparrot
f73468d516 fa: block skipping for fa kv bwd (#14569) 2026-02-05 16:13:53 -08:00
chenyu
b7ef775677 more cleanup in create_schedule [pr] (#14566)
fixed wrong comments and simplified queue building
2026-02-05 16:12:17 -05:00
Garret Castro
cee7ef7ab2 disable threads (#14555) 2026-02-05 16:11:32 -05:00
chenyu
79b7799dba clean up linearize schedule [pr] (#14565)
* clean up linearize schedule [pr]

don't mix ScheduleItem and UOp in schedule queue

* ok
2026-02-05 15:24:09 -05:00
chenyu
41a179f542 fix test_xlm_roberta_large (#14564)
onnxruntime does not allow symlink that's outside model dir. update snapshot_download to use local_dir instead of cache_dir. some ad hoc migration step to copy the existing model too
2026-02-05 14:56:06 -05:00
Christopher Milan
aa9dc50577 dtype decomps don't require bitshifts (#14542)
* dtype decomps don't require bitshifts

* simplify shr/shl

* ruff
2026-02-05 14:42:30 -05:00
Christopher Milan
b47397ab17 list ml_dtypes as dependency for DSP (#14562)
* pin onnxruntime to 1.23.2 for DSP

* list ml_dtypes instead

This reverts commit 84bb2cc0fc.
2026-02-05 14:27:50 -05:00
chenyu
2b47a9a1b5 skip test_xlm_roberta_large (#14563)
symlink model not allowed in latest onnxruntime
2026-02-05 14:00:24 -05:00
chenyu
42c18da88a add Ops asserts in toposort sched_sink [pr] (#14561)
more explicit
2026-02-05 12:40:02 -05:00
nimlgen
483bba4f05 nv: use prof_exec_counter (#14559) 2026-02-05 19:00:14 +03:00
qazal
190042358f llama: faster bf16 matmul / rope backward (#14558) 2026-02-05 23:57:25 +09:00
George Hotz
b398335f62 assembly/amd: fix saturation in python remu (#14557)
* PYTHONREMU: failing test for V_SUB_NC_U32_E64 clamp

* fix saturation in PYTHON_REMU

* simpler

* more tests, less lines

---------

Co-authored-by: Christopher Milan <chrismilan@ucla.edu>
2026-02-05 18:35:57 +08:00
wozeparrot
c1ea6687e5 fa: simpler is faster (#14548) 2026-02-05 01:13:17 -08:00
George Hotz
43e7eda4e7 grad_b uses custom gemm (#14550)
* grad_b uses custom gemm

* fix multi backward, acc is in float32

* test_gemm_batched

* square gemm

---------

Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>
Co-authored-by: qazal <qazal.software@gmail.com>
2026-02-05 15:22:27 +09:00
qazal
f9cfb64cd9 test asm_gemm in CI (#14551)
* test asm_gemm in CI

* default float16

* use a smaller shape for multi

* smaller size

* smaller for CI

* smaller for ci

* need half
2026-02-05 13:32:22 +09:00
chenyu
c0ca7f9c51 use more UOp.sum and UOp.prod [pr] (#14549) 2026-02-04 22:05:20 -05:00
chenyu
e8dace41b6 clean up UOp.vars [pr] (#14547) 2026-02-04 20:52:25 -05:00
Christopher Milan
232848d086 PYTHONREMU: VOP3P integer operations with constants don't cast to fp16 (#14546)
* PYTHONREMU: VOP3P integer operations with constants don't cast to fp16

* put that back

* cleaner

* do that once
2026-02-04 20:10:59 -05:00
wozeparrot
2966619834 feat: llama uses enable_gqa during training (#14545) 2026-02-04 16:22:31 -08:00
chenyu
664f1bf76d minor ops/jit cleanups [pr] (#14543) 2026-02-04 17:21:34 -05:00
chenyu
03d0fa9c3f merge as_buf into buf_uop [pr] (#14541) 2026-02-04 16:32:23 -05:00
chenyu
43ef24a8af remove buf_target [pr] (#14540)
not really needed
2026-02-04 15:03:47 -05:00
chenyu
8b7343b950 clean up is_realized [pr] (#14538)
base cannot be Ops.MULTI since MULTI is a view now
2026-02-04 14:24:10 -05:00
Christopher Milan
5338ce6b74 test S_PACK in extra/assembly/amd/test/hw (#14537)
* S_PACK_LL_B32_B16 in test/hw

* add rest of S_PACK instructions
2026-02-04 14:17:16 -05:00
chenyu
9052db678f remove allow_shape_mismatch in Tensor.replace (#14536)
move all logic to torch_backend and not hacking Tensor method
2026-02-04 12:38:18 -05:00
nimlgen
ec2b6bbda8 hcq: update signal logic (#14531) 2026-02-04 19:32:56 +03:00
nimlgen
62786d488a am: mi3xx perf (#14529) 2026-02-04 19:32:43 +03:00
chenyu
d57d24c7d4 Buffer.as_buffer -> Buffer.as_memoryview [pr] (#14535)
it casts to memoryview. also inline the as_typed_buffer checks to Tensor._data
2026-02-04 11:31:11 -05:00
chenyu
024f57ecf5 jit input_buffers cleanup [pr] (#14532) 2026-02-04 10:14:38 -05:00
chenyu
67f91e897b UOp.is_contiguous -> UOp.has_buffer_identity [pr] (#14530)
one more confusing buffer related method, but it's definitely not is_contiguous
2026-02-04 09:21:26 -05:00
George Hotz
fb9df1e031 pretty print binary (#14520) 2026-02-04 18:04:35 +08:00
Christopher Milan
8c3c026d86 decomp float16 to float32 (#14417)
* decomp float16 to float32

* denormals arent zero

* add test

* denormals are zero

* fix

* oops

* bitcast works

* fix LOADs

* test_dtype passing

* cleanup

* mypy

* debug print

* only emulate if EMULATED

* very ugly, but passes spec

* add test_dtype_alu tests

* Revert "very ugly, but passes spec"

This reverts commit fdc3999b65.

* bottom up decompositions

* that should have symbolic

* simplify a bit

* SPEC really works

* run with DEBUG

* debug=4

* rm debug
2026-02-04 01:37:47 -05:00
Christopher Milan
ecbce5269e PYTHONREMU properly supports S_PACK_LL_B32_B16 (#14527)
* PYTHONREMU properly supports S_PACK_LL_B32_B16

* default
2026-02-03 23:45:33 -05:00
wozeparrot
720c9597a9 feat: llama uses is_causal on sdpa during training (#14528) 2026-02-03 20:24:30 -08:00
chenyu
9c2fc118ef relax setitem target check (#14526)
old check was too conservative
2026-02-03 22:32:49 -05:00
qazal
d1bfbe9ce3 isolate slow llama gemm (#14525) 2026-02-04 12:20:10 +09:00
nimlgen
2f55005ad9 qcom: sync cpu cache when from_blob (#14518)
* um

* fx

* d

* x

* x

* x

* x

* f

* ren
2026-02-03 21:51:03 +03:00
chenyu
ee9d6a1f36 remove DEFINE_VAR in to_define_global [pr] (#14522)
not needed
2026-02-03 10:12:33 -05:00
Nino Risteski
af4c74bb41 delete extra cast (#14517) 2026-02-03 08:29:04 -05:00
chenyu
9d1e9e643e removed a duplicated remove_bufferize rule [pr] (#14519) 2026-02-03 08:28:07 -05:00
George Hotz
d59e6e7a37 move more tests to test/null, split some existing ones (#14512)
* move more tests to test/null, split some existing ones

* null work

* null work

* move more

* fixes

* move PIL

* PIL in CLIP

* don't move that
2026-02-03 20:20:20 +08:00
qazal
a98c53769a ASM_GEMM=1 runs the UOp gemm on non cdna (#14516)
* ASM_GEMM=1 runs the UOp gemm on non cdna

tests run on mac in 3 seconds

* min diff
2026-02-03 20:42:02 +09:00
qazal
5c1d21349e viz: profiler command line tool (#14515) 2026-02-03 19:51:25 +09:00
George Hotz
dd2de4f838 rename all DEFINE_GLOBAL to PARAM (#14511) 2026-02-03 15:09:38 +08:00
George Hotz
dc77b3318b move files that pass with NULL=1 to test/null (#14508)
* move files that pass with NULL=1 to test/null

* fix windows

* cpu 0

* bugfix + durations
2026-02-03 13:52:36 +08:00
George Hotz
888819ee09 call autodiff gradient (#14510) 2026-02-03 13:51:02 +08:00