wozeparrot
9317e96881
fa: explicitly pass shapes ( #14857 )
2026-02-19 05:26:16 -08:00
nimlgen
3b95fa0ed4
am_smi: enable mem usage back ( #14858 )
2026-02-18 19:27:27 +03:00
wozeparrot
6d301ad2c4
feat: llama wqkv ( #14841 )
2026-02-17 23:01:33 -08:00
wozeparrot
95e97ec341
seperate llama optim ( #14810 )
2026-02-17 13:02:35 -08:00
qazal
f8e485ee9e
nvcc/nvdisasm macos shim ( #14822 )
...
* move to backend
* and arch
* setup_nvcc_osx
* blackwell
* min test
* now getting dumb assert is_ptx
* support cubin.
* work
* remove that
* simpler
2026-02-17 20:07:05 +09:00
qazal
f590564bf7
gemm multiple is only for cdna4 asm ( #14814 )
...
* gemm multiple is only for cdna4 asm
* move to backend
* and arch
* path
2026-02-17 14:00:02 +09:00
George Hotz
5bd2862d1a
late compile the cdna gemm ( #14783 )
...
* late compile the cdna gemm
* remove old things
* finalize inplace
---------
Co-authored-by: qazal <qazal.software@gmail.com >
2026-02-17 13:04:22 +09:00
George Hotz
f081f154ae
parameterize the CDNA asm gemm ( #14813 )
...
* parameterize the CDNA asm gemm
* fix llama test
* fix
* add more gemmt ests
* confirm all match
* test these asm gemms
2026-02-17 11:35:18 +08:00
nimlgen
131bbbbfd8
am: smu_v13_0_12 ( #14800 )
2026-02-16 22:58:10 +03:00
wozeparrot
45aebe1572
hipkittens fa backward ( #14723 )
2026-02-16 00:38:44 -08:00
qazal
c7a4dbf918
viz: get program binary from the UOp ( #14787 )
...
* viz: get program binary from the UOp
* remove that
* less
* rename View Program to View Source
* two words
* fix
2026-02-16 15:46:58 +09:00
George Hotz
dff9cf35c2
amd asm emulator fixes + run it in CI ( #14786 )
...
* amd asm fix, try 2
* fix tests
2026-02-16 13:24:21 +08:00
qazal
55a4dfa2e0
cdna4 asm_gemm tests in CI on the null backend ( #14785 )
...
* cdna4 asm_gemm tests in CI on the null backend
* no .numpy() in null
* better
* gemm/asm: device comes from renderer
2026-02-16 14:06:23 +09:00
George Hotz
ac079e43d7
ElementwiseMixin ( #14777 )
2026-02-16 08:50:47 +08:00
qazal
33b31d9cd6
tinykittens flash attention dtype fix, add CI ( #14770 )
...
* don't hardcdoe amd device
* add failing tests, ci too
* fix: fix for dtype mixin
* bump to rocm 7.1
---------
Co-authored-by: Woze Parrot <wozeparrot@gmail.com >
2026-02-16 01:15:11 +09:00
qazal
9bb6014900
keep existing profile trace in viz cli ( #14757 )
2026-02-15 13:16:32 +09:00
nimlgen
4ab51b55bd
stream pma decoder ( #14746 )
2026-02-14 17:40:18 +03:00
George Hotz
c0de4f75b1
improve mmapeak, print names with sqtt ( #14726 )
2026-02-13 16:07:06 +08:00
wozeparrot
0613c0ac0c
hipkittens fa forward ( #14692 )
2026-02-12 20:16:43 -08:00
George Hotz
4088d686b2
remove llvm requirement from amd ( #14717 )
...
* remove llvm requirement from amd
* tests pass
* test
* sink kernarg_size
* move stuff
* amd_asm_matmul to new style
* default type
* fix tests, simpler
* cu mode is faster and simpler
* darken
2026-02-13 10:50:12 +08:00
chenyu
557134e1c7
model/test fix that failed with WEBGPU=1 DEBUG=2 ( #14706 )
2026-02-12 09:08:16 -05:00
George Hotz
4680247e35
renderer/amd: move in tree ( #14702 )
...
* renderer/amd: move in tree
* fix paths in tests
* 24000 lines
* no delete for amd files
2026-02-12 18:09:16 +08:00
George Hotz
d5fc3ea1ba
assembly/amd: mypy+ruff passes ( #14701 )
...
* assembly/amd: mypy+ruff passes
* touchups
2026-02-12 16:59:42 +08:00
George Hotz
025049c521
clean up sqtt / update src formatting in viz ( #14696 )
...
* update src formatting in viz
* rename to RDNA3/RDNA4 in sqtt
* wrap
* move sqttmap
* update readme
* why did that change?
* cdna
* that's just for test
2026-02-12 14:27:14 +08:00
George Hotz
befc1e800c
assembly/amd: disasm is test only ( #14694 )
...
* assembly/amd: disasm is test only
* viz uses str
2026-02-12 12:33:46 +08:00
George Hotz
c331798201
move tests to test/backend ( #14691 )
...
* move tests to test/backend
* fix imports
* fix CI
* revert that one
* Fix formatting in README for test command
2026-02-12 11:09:44 +08:00
George Hotz
3fab43c57c
add cache to asm gemm ( #14675 )
2026-02-11 08:26:30 +08:00
wozeparrot
69574542ab
fix: use correct fa implementation in eval ( #14651 )
2026-02-09 18:20:44 -08:00
qazal
80b0119cef
llama: add new asm gemm shape ( #14611 )
...
* llama: add new asm gemm shape
* work
* cleanup
* half dtype
* more comment
2026-02-10 00:34:29 +09:00
nimlgen
e087c58ae0
print tables in llama/profile.sh ( #14639 )
2026-02-09 12:32:54 +03:00
nimlgen
01a4ee4d66
do not hive_reset when amdgpu ( #14624 )
2026-02-08 19:14:13 +03:00
George Hotz
183d38b128
remove CUSTOM_KERNEL / directly construct it ( #14604 )
...
* remove CUSTOM_KERNEL / directly construct it
* clean that up
* simpler multi
* custom kernel spec
* remove Kernel
* fix multi
* use sharded shape
* explicit regression test
2026-02-08 18:43:33 +08:00
nimlgen
e29a88ca09
hive_reset respects lock ( #14618 )
2026-02-08 10:47:25 +03:00
wozeparrot
d87ae1c84c
feat: tinyfs load test in benchmark ( #14602 )
2026-02-06 18:00:00 -08:00
nimlgen
fbb67a3f95
am_smi: fix after regen ( #14594 )
2026-02-06 20:57:41 +03:00
qazal
b7e3fbe07e
llama: add VIZ=-1 to dev_run ( #14583 )
...
* llama: add VIZ=-1 to dev_run
* readme
* cleaner
* add profile.sh script
* better grouping of options
* add other row
* readme edits
* work
2026-02-06 22:59:22 +09:00
nimlgen
fbeb978170
diff devices for sdma ( #14589 )
...
* start
* x
* fix
* sdma
* c
* clean
* x
* hm
* cleaer
2026-02-06 16:39:12 +03:00
qazal
cf73d7e2a7
hotfix: disable slower asm gemm shape from llama seqlen 8192 ( #14582 )
2026-02-06 15:05:19 +09:00
qazal
be77873974
llama: contig backward for wk / wv matmul backward ( #14581 )
2026-02-06 14:54:00 +09:00
wozeparrot
f73468d516
fa: block skipping for fa kv bwd ( #14569 )
2026-02-05 16:13:53 -08:00
chenyu
41a179f542
fix test_xlm_roberta_large ( #14564 )
...
onnxruntime does not allow symlink that's outside model dir. update snapshot_download to use local_dir instead of cache_dir. some ad hoc migration step to copy the existing model too
2026-02-05 14:56:06 -05:00
qazal
190042358f
llama: faster bf16 matmul / rope backward ( #14558 )
2026-02-05 23:57:25 +09:00
George Hotz
b398335f62
assembly/amd: fix saturation in python remu ( #14557 )
...
* PYTHONREMU: failing test for V_SUB_NC_U32_E64 clamp
* fix saturation in PYTHON_REMU
* simpler
* more tests, less lines
---------
Co-authored-by: Christopher Milan <chrismilan@ucla.edu >
2026-02-05 18:35:57 +08:00
wozeparrot
c1ea6687e5
fa: simpler is faster ( #14548 )
2026-02-05 01:13:17 -08:00
George Hotz
43e7eda4e7
grad_b uses custom gemm ( #14550 )
...
* grad_b uses custom gemm
* fix multi backward, acc is in float32
* test_gemm_batched
* square gemm
---------
Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com >
Co-authored-by: qazal <qazal.software@gmail.com >
2026-02-05 15:22:27 +09:00
qazal
f9cfb64cd9
test asm_gemm in CI ( #14551 )
...
* test asm_gemm in CI
* default float16
* use a smaller shape for multi
* smaller size
* smaller for CI
* smaller for ci
* need half
2026-02-05 13:32:22 +09:00
Christopher Milan
232848d086
PYTHONREMU: VOP3P integer operations with constants don't cast to fp16 ( #14546 )
...
* PYTHONREMU: VOP3P integer operations with constants don't cast to fp16
* put that back
* cleaner
* do that once
2026-02-04 20:10:59 -05:00
wozeparrot
2966619834
feat: llama uses enable_gqa during training ( #14545 )
2026-02-04 16:22:31 -08:00
Christopher Milan
5338ce6b74
test S_PACK in extra/assembly/amd/test/hw ( #14537 )
...
* S_PACK_LL_B32_B16 in test/hw
* add rest of S_PACK instructions
2026-02-04 14:17:16 -05:00
chenyu
9052db678f
remove allow_shape_mismatch in Tensor.replace ( #14536 )
...
move all logic to torch_backend and not hacking Tensor method
2026-02-04 12:38:18 -05:00