Christopher Milan
b47397ab17
list ml_dtypes as dependency for DSP ( #14562 )
...
* pin onnxruntime to 1.23.2 for DSP
* list ml_dtypes instead
This reverts commit 84bb2cc0fc .
2026-02-05 14:27:50 -05:00
chenyu
2b47a9a1b5
skip test_xlm_roberta_large ( #14563 )
...
symlink model not allowed in latest onnxruntime
2026-02-05 14:00:24 -05:00
chenyu
42c18da88a
add Ops asserts in toposort sched_sink [pr] ( #14561 )
...
more explicit
2026-02-05 12:40:02 -05:00
nimlgen
483bba4f05
nv: use prof_exec_counter ( #14559 )
2026-02-05 19:00:14 +03:00
qazal
190042358f
llama: faster bf16 matmul / rope backward ( #14558 )
2026-02-05 23:57:25 +09:00
George Hotz
b398335f62
assembly/amd: fix saturation in python remu ( #14557 )
...
* PYTHONREMU: failing test for V_SUB_NC_U32_E64 clamp
* fix saturation in PYTHON_REMU
* simpler
* more tests, less lines
---------
Co-authored-by: Christopher Milan <chrismilan@ucla.edu >
2026-02-05 18:35:57 +08:00
wozeparrot
c1ea6687e5
fa: simpler is faster ( #14548 )
2026-02-05 01:13:17 -08:00
George Hotz
43e7eda4e7
grad_b uses custom gemm ( #14550 )
...
* grad_b uses custom gemm
* fix multi backward, acc is in float32
* test_gemm_batched
* square gemm
---------
Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com >
Co-authored-by: qazal <qazal.software@gmail.com >
2026-02-05 15:22:27 +09:00
qazal
f9cfb64cd9
test asm_gemm in CI ( #14551 )
...
* test asm_gemm in CI
* default float16
* use a smaller shape for multi
* smaller size
* smaller for CI
* smaller for ci
* need half
2026-02-05 13:32:22 +09:00
chenyu
c0ca7f9c51
use more UOp.sum and UOp.prod [pr] ( #14549 )
2026-02-04 22:05:20 -05:00
chenyu
e8dace41b6
clean up UOp.vars [pr] ( #14547 )
2026-02-04 20:52:25 -05:00
Christopher Milan
232848d086
PYTHONREMU: VOP3P integer operations with constants don't cast to fp16 ( #14546 )
...
* PYTHONREMU: VOP3P integer operations with constants don't cast to fp16
* put that back
* cleaner
* do that once
2026-02-04 20:10:59 -05:00
wozeparrot
2966619834
feat: llama uses enable_gqa during training ( #14545 )
2026-02-04 16:22:31 -08:00
chenyu
664f1bf76d
minor ops/jit cleanups [pr] ( #14543 )
2026-02-04 17:21:34 -05:00
chenyu
03d0fa9c3f
merge as_buf into buf_uop [pr] ( #14541 )
2026-02-04 16:32:23 -05:00
chenyu
43ef24a8af
remove buf_target [pr] ( #14540 )
...
not really needed
2026-02-04 15:03:47 -05:00
chenyu
8b7343b950
clean up is_realized [pr] ( #14538 )
...
base cannot be Ops.MULTI since MULTI is a view now
2026-02-04 14:24:10 -05:00
Christopher Milan
5338ce6b74
test S_PACK in extra/assembly/amd/test/hw ( #14537 )
...
* S_PACK_LL_B32_B16 in test/hw
* add rest of S_PACK instructions
2026-02-04 14:17:16 -05:00
chenyu
9052db678f
remove allow_shape_mismatch in Tensor.replace ( #14536 )
...
move all logic to torch_backend and not hacking Tensor method
2026-02-04 12:38:18 -05:00
nimlgen
ec2b6bbda8
hcq: update signal logic ( #14531 )
2026-02-04 19:32:56 +03:00
nimlgen
62786d488a
am: mi3xx perf ( #14529 )
2026-02-04 19:32:43 +03:00
chenyu
d57d24c7d4
Buffer.as_buffer -> Buffer.as_memoryview [pr] ( #14535 )
...
it casts to memoryview. also inline the as_typed_buffer checks to Tensor._data
2026-02-04 11:31:11 -05:00
chenyu
024f57ecf5
jit input_buffers cleanup [pr] ( #14532 )
2026-02-04 10:14:38 -05:00
chenyu
67f91e897b
UOp.is_contiguous -> UOp.has_buffer_identity [pr] ( #14530 )
...
one more confusing buffer related method, but it's definitely not is_contiguous
2026-02-04 09:21:26 -05:00
George Hotz
fb9df1e031
pretty print binary ( #14520 )
2026-02-04 18:04:35 +08:00
Christopher Milan
8c3c026d86
decomp float16 to float32 ( #14417 )
...
* decomp float16 to float32
* denormals arent zero
* add test
* denormals are zero
* fix
* oops
* bitcast works
* fix LOADs
* test_dtype passing
* cleanup
* mypy
* debug print
* only emulate if EMULATED
* very ugly, but passes spec
* add test_dtype_alu tests
* Revert "very ugly, but passes spec"
This reverts commit fdc3999b65 .
* bottom up decompositions
* that should have symbolic
* simplify a bit
* SPEC really works
* run with DEBUG
* debug=4
* rm debug
2026-02-04 01:37:47 -05:00
Christopher Milan
ecbce5269e
PYTHONREMU properly supports S_PACK_LL_B32_B16 ( #14527 )
...
* PYTHONREMU properly supports S_PACK_LL_B32_B16
* default
2026-02-03 23:45:33 -05:00
wozeparrot
720c9597a9
feat: llama uses is_causal on sdpa during training ( #14528 )
2026-02-03 20:24:30 -08:00
chenyu
9c2fc118ef
relax setitem target check ( #14526 )
...
old check was too conservative
2026-02-03 22:32:49 -05:00
qazal
d1bfbe9ce3
isolate slow llama gemm ( #14525 )
2026-02-04 12:20:10 +09:00
nimlgen
2f55005ad9
qcom: sync cpu cache when from_blob ( #14518 )
...
* um
* fx
* d
* x
* x
* x
* x
* f
* ren
2026-02-03 21:51:03 +03:00
chenyu
ee9d6a1f36
remove DEFINE_VAR in to_define_global [pr] ( #14522 )
...
not needed
2026-02-03 10:12:33 -05:00
Nino Risteski
af4c74bb41
delete extra cast ( #14517 )
2026-02-03 08:29:04 -05:00
chenyu
9d1e9e643e
removed a duplicated remove_bufferize rule [pr] ( #14519 )
2026-02-03 08:28:07 -05:00
George Hotz
d59e6e7a37
move more tests to test/null, split some existing ones ( #14512 )
...
* move more tests to test/null, split some existing ones
* null work
* null work
* move more
* fixes
* move PIL
* PIL in CLIP
* don't move that
2026-02-03 20:20:20 +08:00
qazal
a98c53769a
ASM_GEMM=1 runs the UOp gemm on non cdna ( #14516 )
...
* ASM_GEMM=1 runs the UOp gemm on non cdna
tests run on mac in 3 seconds
* min diff
2026-02-03 20:42:02 +09:00
qazal
5c1d21349e
viz: profiler command line tool ( #14515 )
2026-02-03 19:51:25 +09:00
George Hotz
dd2de4f838
rename all DEFINE_GLOBAL to PARAM ( #14511 )
2026-02-03 15:09:38 +08:00
George Hotz
dc77b3318b
move files that pass with NULL=1 to test/null ( #14508 )
...
* move files that pass with NULL=1 to test/null
* fix windows
* cpu 0
* bugfix + durations
2026-02-03 13:52:36 +08:00
George Hotz
888819ee09
call autodiff gradient ( #14510 )
2026-02-03 13:51:02 +08:00
wozeparrot
bbcd3d67a3
fa: faster ( #14453 )
2026-02-02 21:34:17 -08:00
Christopher Milan
e579613b90
IR3 has aux ( #14509 )
2026-02-02 23:46:41 -05:00
George Hotz
85c7b23160
add pytest -nauto to benchmark for mac ( #14458 )
...
* add pytest -nauto to benchmark
* 3 minute timeout
* 3 min
* setup env
* comment
* fresh db
* in the pyenv
2026-02-03 12:26:09 +08:00
Christopher Milan
a5d7eb37db
IR3 works on versions earlier than 3.14 ( #14507 )
2026-02-02 23:10:19 -05:00
George Hotz
33c886cafa
disable copyout on NULL backend by default ( #14506 )
...
* disable copyout on NULL backend
* gate it
* allow copyout on some tests
2026-02-03 11:57:47 +08:00
chenyu
3c5845e8a5
remove cut_store_range ( #14505 )
...
special scheduling for CPU
2026-02-02 21:58:36 -05:00
chenyu
4f2e7aed24
fix multiple REDUCE on same RANGE ( #14504 )
...
each RANGE maps to one END, but reduce_to_acc is local and would not know this
2026-02-02 20:42:09 -05:00
chenyu
93c41a78fa
clean up NOOP [pr] ( #14503 )
...
should not be used as a COPY, started with removing from ALWAYS_RUN_OPS
2026-02-02 19:46:45 -05:00
chenyu
66d2b02f11
delete files that depends on extra.optimization.helpers ( #14499 )
2026-02-02 13:33:33 -05:00
George Hotz
ec0398fceb
test amd gpu crashes ( #14459 )
...
* test amd gpu crashes
* cleanup
* less sketch tests
2026-02-02 18:57:47 +03:00