chenyu
557134e1c7
model/test fix that failed with WEBGPU=1 DEBUG=2 ( #14706 )
2026-02-12 09:08:16 -05:00
George Hotz
4680247e35
renderer/amd: move in tree ( #14702 )
...
* renderer/amd: move in tree
* fix paths in tests
* 24000 lines
* no delete for amd files
2026-02-12 18:09:16 +08:00
George Hotz
d5fc3ea1ba
assembly/amd: mypy+ruff passes ( #14701 )
...
* assembly/amd: mypy+ruff passes
* touchups
2026-02-12 16:59:42 +08:00
George Hotz
025049c521
clean up sqtt / update src formatting in viz ( #14696 )
...
* update src formatting in viz
* rename to RDNA3/RDNA4 in sqtt
* wrap
* move sqttmap
* update readme
* why did that change?
* cdna
* that's just for test
2026-02-12 14:27:14 +08:00
George Hotz
befc1e800c
assembly/amd: disasm is test only ( #14694 )
...
* assembly/amd: disasm is test only
* viz uses str
2026-02-12 12:33:46 +08:00
George Hotz
c331798201
move tests to test/backend ( #14691 )
...
* move tests to test/backend
* fix imports
* fix CI
* revert that one
* Fix formatting in README for test command
2026-02-12 11:09:44 +08:00
George Hotz
3fab43c57c
add cache to asm gemm ( #14675 )
2026-02-11 08:26:30 +08:00
wozeparrot
69574542ab
fix: use correct fa implementation in eval ( #14651 )
2026-02-09 18:20:44 -08:00
qazal
80b0119cef
llama: add new asm gemm shape ( #14611 )
...
* llama: add new asm gemm shape
* work
* cleanup
* half dtype
* more comment
2026-02-10 00:34:29 +09:00
nimlgen
e087c58ae0
print tables in llama/profile.sh ( #14639 )
2026-02-09 12:32:54 +03:00
nimlgen
01a4ee4d66
do not hive_reset when amdgpu ( #14624 )
2026-02-08 19:14:13 +03:00
George Hotz
183d38b128
remove CUSTOM_KERNEL / directly construct it ( #14604 )
...
* remove CUSTOM_KERNEL / directly construct it
* clean that up
* simpler multi
* custom kernel spec
* remove Kernel
* fix multi
* use sharded shape
* explicit regression test
2026-02-08 18:43:33 +08:00
nimlgen
e29a88ca09
hive_reset respects lock ( #14618 )
2026-02-08 10:47:25 +03:00
wozeparrot
d87ae1c84c
feat: tinyfs load test in benchmark ( #14602 )
2026-02-06 18:00:00 -08:00
nimlgen
fbb67a3f95
am_smi: fix after regen ( #14594 )
2026-02-06 20:57:41 +03:00
qazal
b7e3fbe07e
llama: add VIZ=-1 to dev_run ( #14583 )
...
* llama: add VIZ=-1 to dev_run
* readme
* cleaner
* add profile.sh script
* better grouping of options
* add other row
* readme edits
* work
2026-02-06 22:59:22 +09:00
nimlgen
fbeb978170
diff devices for sdma ( #14589 )
...
* start
* x
* fix
* sdma
* c
* clean
* x
* hm
* cleaer
2026-02-06 16:39:12 +03:00
qazal
cf73d7e2a7
hotfix: disable slower asm gemm shape from llama seqlen 8192 ( #14582 )
2026-02-06 15:05:19 +09:00
qazal
be77873974
llama: contig backward for wk / wv matmul backward ( #14581 )
2026-02-06 14:54:00 +09:00
wozeparrot
f73468d516
fa: block skipping for fa kv bwd ( #14569 )
2026-02-05 16:13:53 -08:00
chenyu
41a179f542
fix test_xlm_roberta_large ( #14564 )
...
onnxruntime does not allow symlink that's outside model dir. update snapshot_download to use local_dir instead of cache_dir. some ad hoc migration step to copy the existing model too
2026-02-05 14:56:06 -05:00
qazal
190042358f
llama: faster bf16 matmul / rope backward ( #14558 )
2026-02-05 23:57:25 +09:00
George Hotz
b398335f62
assembly/amd: fix saturation in python remu ( #14557 )
...
* PYTHONREMU: failing test for V_SUB_NC_U32_E64 clamp
* fix saturation in PYTHON_REMU
* simpler
* more tests, less lines
---------
Co-authored-by: Christopher Milan <chrismilan@ucla.edu >
2026-02-05 18:35:57 +08:00
wozeparrot
c1ea6687e5
fa: simpler is faster ( #14548 )
2026-02-05 01:13:17 -08:00
George Hotz
43e7eda4e7
grad_b uses custom gemm ( #14550 )
...
* grad_b uses custom gemm
* fix multi backward, acc is in float32
* test_gemm_batched
* square gemm
---------
Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com >
Co-authored-by: qazal <qazal.software@gmail.com >
2026-02-05 15:22:27 +09:00
qazal
f9cfb64cd9
test asm_gemm in CI ( #14551 )
...
* test asm_gemm in CI
* default float16
* use a smaller shape for multi
* smaller size
* smaller for CI
* smaller for ci
* need half
2026-02-05 13:32:22 +09:00
Christopher Milan
232848d086
PYTHONREMU: VOP3P integer operations with constants don't cast to fp16 ( #14546 )
...
* PYTHONREMU: VOP3P integer operations with constants don't cast to fp16
* put that back
* cleaner
* do that once
2026-02-04 20:10:59 -05:00
wozeparrot
2966619834
feat: llama uses enable_gqa during training ( #14545 )
2026-02-04 16:22:31 -08:00
Christopher Milan
5338ce6b74
test S_PACK in extra/assembly/amd/test/hw ( #14537 )
...
* S_PACK_LL_B32_B16 in test/hw
* add rest of S_PACK instructions
2026-02-04 14:17:16 -05:00
chenyu
9052db678f
remove allow_shape_mismatch in Tensor.replace ( #14536 )
...
move all logic to torch_backend and not hacking Tensor method
2026-02-04 12:38:18 -05:00
nimlgen
62786d488a
am: mi3xx perf ( #14529 )
2026-02-04 19:32:43 +03:00
chenyu
d57d24c7d4
Buffer.as_buffer -> Buffer.as_memoryview [pr] ( #14535 )
...
it casts to memoryview. also inline the as_typed_buffer checks to Tensor._data
2026-02-04 11:31:11 -05:00
chenyu
67f91e897b
UOp.is_contiguous -> UOp.has_buffer_identity [pr] ( #14530 )
...
one more confusing buffer related method, but it's definitely not is_contiguous
2026-02-04 09:21:26 -05:00
Christopher Milan
ecbce5269e
PYTHONREMU properly supports S_PACK_LL_B32_B16 ( #14527 )
...
* PYTHONREMU properly supports S_PACK_LL_B32_B16
* default
2026-02-03 23:45:33 -05:00
wozeparrot
720c9597a9
feat: llama uses is_causal on sdpa during training ( #14528 )
2026-02-03 20:24:30 -08:00
qazal
d1bfbe9ce3
isolate slow llama gemm ( #14525 )
2026-02-04 12:20:10 +09:00
George Hotz
d59e6e7a37
move more tests to test/null, split some existing ones ( #14512 )
...
* move more tests to test/null, split some existing ones
* null work
* null work
* move more
* fixes
* move PIL
* PIL in CLIP
* don't move that
2026-02-03 20:20:20 +08:00
qazal
a98c53769a
ASM_GEMM=1 runs the UOp gemm on non cdna ( #14516 )
...
* ASM_GEMM=1 runs the UOp gemm on non cdna
tests run on mac in 3 seconds
* min diff
2026-02-03 20:42:02 +09:00
qazal
5c1d21349e
viz: profiler command line tool ( #14515 )
2026-02-03 19:51:25 +09:00
George Hotz
dd2de4f838
rename all DEFINE_GLOBAL to PARAM ( #14511 )
2026-02-03 15:09:38 +08:00
wozeparrot
bbcd3d67a3
fa: faster ( #14453 )
2026-02-02 21:34:17 -08:00
chenyu
66d2b02f11
delete files that depends on extra.optimization.helpers ( #14499 )
2026-02-02 13:33:33 -05:00
George Hotz
6e958dbfd4
assembly/amd: add RDNA4 support to emulator ( #14341 )
...
* start new rdna4
* work
* plus works
* more pass
* rdna4
* assembly/amd: fix RDNA4 emulator for float16 and VOP3 clamp
* stale
* rev
* rr
* rdna4 emu tests
* cleanup
* cleanup
* simp
* works
* better factorizaion
* hacks
* fix mockgpu
* guard both
* cleaner
* gate
* bug fix and a few tests
* all test_tiny
2026-02-02 21:35:59 +08:00
qazal
965940dd00
sqtt: update examples after event field change ( #14493 )
...
* regen sqtt examples
* cdna
* rdna4
* packet counts for rdna3
* sqttmap work
2026-02-02 21:39:48 +09:00
George Hotz
965149a46d
assembly/amd: add ds perm instructions ( #14486 )
...
* assembly/amd: add ds perm instructions
* NO SKIP
* fix preexisting RDNA3 issues
* pcode
* assert
* asserts
* unify
* simp
* good fix
2026-02-02 16:02:00 +08:00
Robbe Derks
d75a1b0d5a
usbgpu: use BOT interface for patch.py ( #13644 )
...
* BOT usage
* cleanup
* fix lint
* fix ruff
* fix -7?
2026-02-02 11:54:46 +08:00
qazal
616e9c1483
CDNA assembly gemm in tensor.py with flag ( #14310 )
...
* work
* work
* the assembly
* remove the old one
* remove ws bufs, assert splitk
* notes cleanup
* work
* gemm args
* gemm in mixins would be nice
* add gemm gradient
* print counters
* the realize is for DEBUG=2 aesthetics
* dedup
* rewrite to python dsl, no list copies
* leave that
* add B, M, N, K to gemm name
* it's M0 not NULL
* fp16 support
* test cleanup + more gemms
* work from viz
* more work
* gemm batch_size
* xccg path work
* tiny comments on the label naming
* s_waitcnt
2026-01-31 22:34:14 +09:00
qazal
d69bc5aa1a
make DEV=NULL EMULATE=AMD amd_asm_matmul run ( #14460 )
2026-01-31 20:45:24 +09:00
George Hotz
b705c9143c
assembly/amd: test more instructions ( #14365 )
...
* assembly/amd: test more instructions
* more
* passing
* revert
* no const fold
* remove junk
* cleaner
2026-01-31 12:40:22 +08:00
Christopher Milan
e575dd8275
prevent UB in long decomp and more emulated tests ( #14447 )
2026-01-30 19:38:41 -05:00