Commit Graph

1633 Commits

Author SHA1 Message Date
chenyu
557134e1c7 model/test fix that failed with WEBGPU=1 DEBUG=2 (#14706) 2026-02-12 09:08:16 -05:00
George Hotz
4680247e35 renderer/amd: move in tree (#14702)
* renderer/amd: move in tree

* fix paths in tests

* 24000 lines

* no delete for amd files
2026-02-12 18:09:16 +08:00
George Hotz
d5fc3ea1ba assembly/amd: mypy+ruff passes (#14701)
* assembly/amd: mypy+ruff passes

* touchups
2026-02-12 16:59:42 +08:00
George Hotz
025049c521 clean up sqtt / update src formatting in viz (#14696)
* update src formatting in viz

* rename to RDNA3/RDNA4 in sqtt

* wrap

* move sqttmap

* update readme

* why did that change?

* cdna

* that's just for test
2026-02-12 14:27:14 +08:00
George Hotz
befc1e800c assembly/amd: disasm is test only (#14694)
* assembly/amd: disasm is test only

* viz uses str
2026-02-12 12:33:46 +08:00
George Hotz
c331798201 move tests to test/backend (#14691)
* move tests to test/backend

* fix imports

* fix CI

* revert that one

* Fix formatting in README for test command
2026-02-12 11:09:44 +08:00
George Hotz
3fab43c57c add cache to asm gemm (#14675) 2026-02-11 08:26:30 +08:00
wozeparrot
69574542ab fix: use correct fa implementation in eval (#14651) 2026-02-09 18:20:44 -08:00
qazal
80b0119cef llama: add new asm gemm shape (#14611)
* llama: add new asm gemm shape

* work

* cleanup

* half dtype

* more comment
2026-02-10 00:34:29 +09:00
nimlgen
e087c58ae0 print tables in llama/profile.sh (#14639) 2026-02-09 12:32:54 +03:00
nimlgen
01a4ee4d66 do not hive_reset when amdgpu (#14624) 2026-02-08 19:14:13 +03:00
George Hotz
183d38b128 remove CUSTOM_KERNEL / directly construct it (#14604)
* remove CUSTOM_KERNEL / directly construct it

* clean that up

* simpler multi

* custom kernel spec

* remove Kernel

* fix multi

* use sharded shape

* explicit regression test
2026-02-08 18:43:33 +08:00
nimlgen
e29a88ca09 hive_reset respects lock (#14618) 2026-02-08 10:47:25 +03:00
wozeparrot
d87ae1c84c feat: tinyfs load test in benchmark (#14602) 2026-02-06 18:00:00 -08:00
nimlgen
fbb67a3f95 am_smi: fix after regen (#14594) 2026-02-06 20:57:41 +03:00
qazal
b7e3fbe07e llama: add VIZ=-1 to dev_run (#14583)
* llama: add VIZ=-1 to dev_run

* readme

* cleaner

* add profile.sh script

* better grouping of options

* add other row

* readme edits

* work
2026-02-06 22:59:22 +09:00
nimlgen
fbeb978170 diff devices for sdma (#14589)
* start

* x

* fix

* sdma

* c

* clean

* x

* hm

* cleaer
2026-02-06 16:39:12 +03:00
qazal
cf73d7e2a7 hotfix: disable slower asm gemm shape from llama seqlen 8192 (#14582) 2026-02-06 15:05:19 +09:00
qazal
be77873974 llama: contig backward for wk / wv matmul backward (#14581) 2026-02-06 14:54:00 +09:00
wozeparrot
f73468d516 fa: block skipping for fa kv bwd (#14569) 2026-02-05 16:13:53 -08:00
chenyu
41a179f542 fix test_xlm_roberta_large (#14564)
onnxruntime does not allow symlink that's outside model dir. update snapshot_download to use local_dir instead of cache_dir. some ad hoc migration step to copy the existing model too
2026-02-05 14:56:06 -05:00
qazal
190042358f llama: faster bf16 matmul / rope backward (#14558) 2026-02-05 23:57:25 +09:00
George Hotz
b398335f62 assembly/amd: fix saturation in python remu (#14557)
* PYTHONREMU: failing test for V_SUB_NC_U32_E64 clamp

* fix saturation in PYTHON_REMU

* simpler

* more tests, less lines

---------

Co-authored-by: Christopher Milan <chrismilan@ucla.edu>
2026-02-05 18:35:57 +08:00
wozeparrot
c1ea6687e5 fa: simpler is faster (#14548) 2026-02-05 01:13:17 -08:00
George Hotz
43e7eda4e7 grad_b uses custom gemm (#14550)
* grad_b uses custom gemm

* fix multi backward, acc is in float32

* test_gemm_batched

* square gemm

---------

Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>
Co-authored-by: qazal <qazal.software@gmail.com>
2026-02-05 15:22:27 +09:00
qazal
f9cfb64cd9 test asm_gemm in CI (#14551)
* test asm_gemm in CI

* default float16

* use a smaller shape for multi

* smaller size

* smaller for CI

* smaller for ci

* need half
2026-02-05 13:32:22 +09:00
Christopher Milan
232848d086 PYTHONREMU: VOP3P integer operations with constants don't cast to fp16 (#14546)
* PYTHONREMU: VOP3P integer operations with constants don't cast to fp16

* put that back

* cleaner

* do that once
2026-02-04 20:10:59 -05:00
wozeparrot
2966619834 feat: llama uses enable_gqa during training (#14545) 2026-02-04 16:22:31 -08:00
Christopher Milan
5338ce6b74 test S_PACK in extra/assembly/amd/test/hw (#14537)
* S_PACK_LL_B32_B16 in test/hw

* add rest of S_PACK instructions
2026-02-04 14:17:16 -05:00
chenyu
9052db678f remove allow_shape_mismatch in Tensor.replace (#14536)
move all logic to torch_backend and not hacking Tensor method
2026-02-04 12:38:18 -05:00
nimlgen
62786d488a am: mi3xx perf (#14529) 2026-02-04 19:32:43 +03:00
chenyu
d57d24c7d4 Buffer.as_buffer -> Buffer.as_memoryview [pr] (#14535)
it casts to memoryview. also inline the as_typed_buffer checks to Tensor._data
2026-02-04 11:31:11 -05:00
chenyu
67f91e897b UOp.is_contiguous -> UOp.has_buffer_identity [pr] (#14530)
one more confusing buffer related method, but it's definitely not is_contiguous
2026-02-04 09:21:26 -05:00
Christopher Milan
ecbce5269e PYTHONREMU properly supports S_PACK_LL_B32_B16 (#14527)
* PYTHONREMU properly supports S_PACK_LL_B32_B16

* default
2026-02-03 23:45:33 -05:00
wozeparrot
720c9597a9 feat: llama uses is_causal on sdpa during training (#14528) 2026-02-03 20:24:30 -08:00
qazal
d1bfbe9ce3 isolate slow llama gemm (#14525) 2026-02-04 12:20:10 +09:00
George Hotz
d59e6e7a37 move more tests to test/null, split some existing ones (#14512)
* move more tests to test/null, split some existing ones

* null work

* null work

* move more

* fixes

* move PIL

* PIL in CLIP

* don't move that
2026-02-03 20:20:20 +08:00
qazal
a98c53769a ASM_GEMM=1 runs the UOp gemm on non cdna (#14516)
* ASM_GEMM=1 runs the UOp gemm on non cdna

tests run on mac in 3 seconds

* min diff
2026-02-03 20:42:02 +09:00
qazal
5c1d21349e viz: profiler command line tool (#14515) 2026-02-03 19:51:25 +09:00
George Hotz
dd2de4f838 rename all DEFINE_GLOBAL to PARAM (#14511) 2026-02-03 15:09:38 +08:00
wozeparrot
bbcd3d67a3 fa: faster (#14453) 2026-02-02 21:34:17 -08:00
chenyu
66d2b02f11 delete files that depends on extra.optimization.helpers (#14499) 2026-02-02 13:33:33 -05:00
George Hotz
6e958dbfd4 assembly/amd: add RDNA4 support to emulator (#14341)
* start new rdna4

* work

* plus works

* more pass

* rdna4

* assembly/amd: fix RDNA4 emulator for float16 and VOP3 clamp

* stale

* rev

* rr

* rdna4 emu tests

* cleanup

* cleanup

* simp

* works

* better factorizaion

* hacks

* fix mockgpu

* guard both

* cleaner

* gate

* bug fix and a few tests

* all test_tiny
2026-02-02 21:35:59 +08:00
qazal
965940dd00 sqtt: update examples after event field change (#14493)
* regen sqtt examples

* cdna

* rdna4

* packet counts for rdna3

* sqttmap work
2026-02-02 21:39:48 +09:00
George Hotz
965149a46d assembly/amd: add ds perm instructions (#14486)
* assembly/amd: add ds perm instructions

* NO SKIP

* fix preexisting RDNA3 issues

* pcode

* assert

* asserts

* unify

* simp

* good fix
2026-02-02 16:02:00 +08:00
Robbe Derks
d75a1b0d5a usbgpu: use BOT interface for patch.py (#13644)
* BOT usage

* cleanup

* fix lint

* fix ruff

* fix -7?
2026-02-02 11:54:46 +08:00
qazal
616e9c1483 CDNA assembly gemm in tensor.py with flag (#14310)
* work

* work

* the assembly

* remove the old one

* remove ws bufs, assert splitk

* notes cleanup

* work

* gemm args

* gemm in mixins would be nice

* add gemm gradient

* print counters

* the realize is for DEBUG=2 aesthetics

* dedup

* rewrite to python dsl, no list copies

* leave that

* add B, M, N, K to gemm name

* it's M0 not NULL

* fp16 support

* test cleanup + more gemms

* work from viz

* more work

* gemm batch_size

* xccg path work

* tiny comments on the label naming

* s_waitcnt
2026-01-31 22:34:14 +09:00
qazal
d69bc5aa1a make DEV=NULL EMULATE=AMD amd_asm_matmul run (#14460) 2026-01-31 20:45:24 +09:00
George Hotz
b705c9143c assembly/amd: test more instructions (#14365)
* assembly/amd: test more instructions

* more

* passing

* revert

* no const fold

* remove junk

* cleaner
2026-01-31 12:40:22 +08:00
Christopher Milan
e575dd8275 prevent UB in long decomp and more emulated tests (#14447) 2026-01-30 19:38:41 -05:00