George Hotz
4680247e35
renderer/amd: move in tree ( #14702 )
...
* renderer/amd: move in tree
* fix paths in tests
* 24000 lines
* no delete for amd files
2026-02-12 18:09:16 +08:00
George Hotz
befc1e800c
assembly/amd: disasm is test only ( #14694 )
...
* assembly/amd: disasm is test only
* viz uses str
2026-02-12 12:33:46 +08:00
George Hotz
3fab43c57c
add cache to asm gemm ( #14675 )
2026-02-11 08:26:30 +08:00
qazal
80b0119cef
llama: add new asm gemm shape ( #14611 )
...
* llama: add new asm gemm shape
* work
* cleanup
* half dtype
* more comment
2026-02-10 00:34:29 +09:00
George Hotz
183d38b128
remove CUSTOM_KERNEL / directly construct it ( #14604 )
...
* remove CUSTOM_KERNEL / directly construct it
* clean that up
* simpler multi
* custom kernel spec
* remove Kernel
* fix multi
* use sharded shape
* explicit regression test
2026-02-08 18:43:33 +08:00
qazal
cf73d7e2a7
hotfix: disable slower asm gemm shape from llama seqlen 8192 ( #14582 )
2026-02-06 15:05:19 +09:00
George Hotz
43e7eda4e7
grad_b uses custom gemm ( #14550 )
...
* grad_b uses custom gemm
* fix multi backward, acc is in float32
* test_gemm_batched
* square gemm
---------
Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com >
Co-authored-by: qazal <qazal.software@gmail.com >
2026-02-05 15:22:27 +09:00
qazal
f9cfb64cd9
test asm_gemm in CI ( #14551 )
...
* test asm_gemm in CI
* default float16
* use a smaller shape for multi
* smaller size
* smaller for CI
* smaller for ci
* need half
2026-02-05 13:32:22 +09:00
chenyu
d57d24c7d4
Buffer.as_buffer -> Buffer.as_memoryview [pr] ( #14535 )
...
it casts to memoryview. also inline the as_typed_buffer checks to Tensor._data
2026-02-04 11:31:11 -05:00
qazal
d1bfbe9ce3
isolate slow llama gemm ( #14525 )
2026-02-04 12:20:10 +09:00
qazal
a98c53769a
ASM_GEMM=1 runs the UOp gemm on non cdna ( #14516 )
...
* ASM_GEMM=1 runs the UOp gemm on non cdna
tests run on mac in 3 seconds
* min diff
2026-02-03 20:42:02 +09:00
qazal
616e9c1483
CDNA assembly gemm in tensor.py with flag ( #14310 )
...
* work
* work
* the assembly
* remove the old one
* remove ws bufs, assert splitk
* notes cleanup
* work
* gemm args
* gemm in mixins would be nice
* add gemm gradient
* print counters
* the realize is for DEBUG=2 aesthetics
* dedup
* rewrite to python dsl, no list copies
* leave that
* add B, M, N, K to gemm name
* it's M0 not NULL
* fp16 support
* test cleanup + more gemms
* work from viz
* more work
* gemm batch_size
* xccg path work
* tiny comments on the label naming
* s_waitcnt
2026-01-31 22:34:14 +09:00
qazal
d69bc5aa1a
make DEV=NULL EMULATE=AMD amd_asm_matmul run ( #14460 )
2026-01-31 20:45:24 +09:00
George Hotz
e5df7e640b
fix branches in amd_asm_matmul ( #14369 )
2026-01-27 20:48:42 +08:00
qazal
dfefeddeed
add tflops to cdna gemm custom kernel ( #14281 )
2026-01-22 12:48:28 +09:00
George Hotz
79c1559f69
amd asm can still be simpler ( #14199 )
...
* amd asm can still be simpler
* simpler
* V_LANE_ID
* simpler
* simpler
* compact vgpr
2026-01-17 18:40:10 +09:00
George Hotz
50554115ee
fix VALU_SALU / IMMED_MASK and improve amd_asm_matmul ( #14196 )
...
* fix VALU_SALU / IMMED_MASK and improve amd_asm_matmul
* immed
* wave override
* restore ALT
* advance sgprs correctly
* no helpers
* decrease to 192 VGPRs
2026-01-17 11:58:34 +09:00
George Hotz
8a2549d42b
improve amd_asm_matmul + minor VIZ PKTS improvements ( #14186 )
...
* improve amd_asm_matmul + minor VIZ PKTS improvements
* fix waitcnt issue
* cleanups
2026-01-17 06:56:59 +09:00
qazal
b46da603fe
codegen/custom_kernel: do not attach KernelInfo to user program ( #14160 )
2026-01-15 14:01:48 +09:00
qazal
434dbafab5
optional Estimates in KernelInfo ( #14147 )
...
* optional Estimates in KernelInfo
* custom asm test plumbing
* s_code_end
* estimates test
* vaddr arg in global_store
* kernel desc
* Ops.DEVICE name
2026-01-14 22:55:03 +09:00
qazal
002ea39da7
assembly/amd: use Tensor.custom_kernel to run assembly ( #14125 )
...
* assembly/amd: use Tensor.custom_kernel to run assembly
* PRINT_ASM=1 is DEBUG=4
2026-01-14 08:29:25 +09:00
George Hotz
a28c8105a5
assembly/amd: 2% faster amd_uop_matmul + SQTT ( #14122 )
...
* assembly/amd: 2% faster amd_uop_matmul
* SQTT_TOKEN_EXCLUDE + SQTT_SIMD_SEL
* sqtt printer
* fix printer
* fast decode
* fast decoder
* test packet counts
* ugh it's not faster
* dead
2026-01-13 19:55:32 +09:00
George Hotz
330a0b686e
assembly/amd: clean up dsl and make type verification strict ( #14102 )
...
* assembly/amd: start newdsl
* work
* newdsl upd
* Reg is p nice
* cleaner
* work
* getting clean
* all fields
* more BitFields
* redo the pdfs with dsl2 syntax
* no lit
* cleanups
* more defaults
* fix get and remove crap
* aliases
* ugly but kind of works
* NULL, not rawimm
* clean up defaults
* only dsl
* asm fixes
* lit fixup
* more lit
* cleanups
* olddsl
* single pcode dict
* emu sort of works
* trash test
* global is global
* types property
* reg mods
* fix a few tests
* remove monkey patch
* fixes
* less hacks in tests
* less hacks in tests
* 4 test failures
* hw tests all pass
* fix compare emulator
* fix some tests
* 3 more
* fix and shorten sqtt
* handwritten
* fix validation
* test corrections
* all types validate
* fix dsl2 tests
* fix bugs in disasm
* skips on cdna
* work
* repr with reg[]
* fix bitfield tests
* merge pcodes in dsl
* remove override
* disasm uses inst.types
* simpler
2026-01-13 08:52:16 +09:00
George Hotz
cb500466c2
assembly/amd: amd_asm_matmul ( #13989 )
...
* amd_asm_matmul
* dsl transform
* asm roundtrip
* fixed
* less
* better
* more
* simpler
* simplify
* lil
* simpler
* compact
* work
* cleanups
* simplify
* simpler
* cleanup
* name the regs
* simp
* big simp
* big simp
* simp
* acc grid
* fast
* stuff
* fast
* simpler
* owrks
* save vgprs
* save vgprs
* Compact
* less VGPRs
* after
* SQTT support
* fastest
* faster
* lil faster
* tile regs
* faster
* readable
* one more
* simpler
* lil simpler
* NO_GLOBAL skips early globals
* stock kernel
* cleanups
* cleanups
* one b reg
* safe reg changes
* acc is compact now
* remove confusing stuff
* sregs
* lds cleanups
* vopd
2026-01-07 20:11:05 -08:00
qazal
bd55507ee4
RDNA3 fp16 assembly gemm 85 TFLOPS ( #13990 )
2026-01-03 18:34:23 +09:00
qazal
2cc64d71b0
simplify mi350x gemm / viz asm tests ( #13984 )
...
* mi350x gemm cleanup
* asm tests work
* simpler asm tests
2026-01-03 11:11:07 +09:00
qazal
5f52266225
mi350x gemm: use Tensor.custom_kernel in asm test ( #13969 )
...
* mi350x gemm: use Tensor.custom_kernel in asm test
* A @ B for baseline
2026-01-02 18:30:50 +09:00
qazal
c0f52c9dcb
split assembly gemm to per arch directory ( #13953 )
2026-01-02 00:10:22 +09:00
qazal
6a5430ab00
correct args order in mi350x gemm ( #13949 )
2026-01-01 23:01:46 +09:00
qazal
b23f4517ab
prep mi350x gemm for python dsl ( #13918 )
...
* start by pruning existing asm
* better branch names
* split to template and real instructions
2025-12-31 20:00:57 +09:00
qazal
b557c46233
assembly gemm clean ups, instructions for cli ( #13892 )
2025-12-30 16:14:06 +09:00
qazal
f541540129
variable N for asm gemm ( #13869 )
...
* variable N for asm gemm
* cleanup spacing
2025-12-29 19:35:50 +09:00
qazal
fc5278746f
mi350x assembly gemm cleanups ( #13867 )
2025-12-29 18:47:23 +09:00
qazal
066d96c397
print tflops in asm gemm test ( #13859 )
...
* print tflops in asm gemm test
* change order
2025-12-29 02:26:40 +09:00
qazal
2cfbabdc34
mi350x 1tflop bf16 gemm in extra ( #13702 )
2025-12-28 21:45:42 +09:00
George Hotz
744af193f0
remove ScheduleItem and merge it with ExecItem ( #13759 )
...
* remove ExecItem and merge it with ScheduleItem
* less diff
* fix issues
* min diff
* don't change bufs in _lower
* min diff
* update
* revert
* fixes
* diff
2025-12-19 17:04:24 -04:00
George Hotz
df6cde8a00
cleanup stale examples/extra ( #13764 )
...
* cleanup stale files
* examples
* move those back
* old
* delete more
2025-12-19 16:27:37 -04:00
George Hotz
bd4b9de7d2
use numpy in amd_uop_matmul for simpler tracing ( #13503 )
2025-11-30 08:04:38 -08:00
George Hotz
98e9e73286
hotfix: amd_uop_matmul getenvs
2025-11-17 13:26:01 -08:00
George Hotz
ba84d415fe
work from benchmarking tinybox red v2 ( #13264 )
...
* work from benchmarking tinybox red v2
* gpuburn
2025-11-13 16:38:40 -08:00
George Hotz
faf68c03a8
more mi350x matmul work ( #13138 )
...
* more mi350x matmul work
* broken compute
2025-11-13 09:09:28 -08:00
George Hotz
2d4f01fda0
move mixins to mixin dir ( #13105 )
...
* move mixins to mixin dir
* math
2025-11-05 10:18:33 -08:00
wozeparrot
4ed0f216b5
fix: make max_matmul run again ( #13085 )
2025-11-03 18:09:09 -08:00
George Hotz
416b15cc59
improve uop matmul syntax ( #13074 )
...
* improve uop matmul syntax
* store takes const
* copy
* cleanups
* faster and simpler
* label them reduce
* better syntax
* touchup
2025-11-03 21:34:26 +08:00
George Hotz
1e3d6e49a6
index slicing + allclose ( #13071 )
...
* continue work on slicing+allclose
* Revert "Revert "slicing + allclose""
This reverts commit 6c7a12f21c .
* fix tests + better syntax
* forgot an after
* slot is an integer
2025-11-03 13:01:48 +08:00
George Hotz
8cbef912d2
move reshape to MathTraits ( #13054 )
...
* move reshape to MathTraits
* confirm it works in amd_uop_matmul
2025-11-02 12:56:15 +08:00
George Hotz
267be7fc5e
fp16 acc
2025-11-02 12:53:04 +08:00
George Hotz
e98506735b
add CONTRACT support to UOp programs ( #13043 )
...
* add contract support
* use contract
* 342 tflops
2025-11-01 19:11:32 +08:00
George Hotz
65a0a31475
AMD mi350x matmul from stream ( #13040 )
...
* works
* working mfma
* 120 TFLOPS
* regs
* 192 TFLOPS
* try pipelining
* something
* notes
* contract
* linter to 3.11
* that was a bug
2025-11-01 17:55:19 +08:00
George Hotz
bc178d14a9
matmul example on metal showing off tensor core ( #13033 )
...
* matmul example on metal showing off tensor core
* flip the args of placeholder
* mat_idx
* imp
2025-10-31 19:40:36 +08:00