Commit Graph

195 Commits

Author SHA1 Message Date
George Hotz
4680247e35 renderer/amd: move in tree (#14702)
* renderer/amd: move in tree

* fix paths in tests

* 24000 lines

* no delete for amd files
2026-02-12 18:09:16 +08:00
George Hotz
befc1e800c assembly/amd: disasm is test only (#14694)
* assembly/amd: disasm is test only

* viz uses str
2026-02-12 12:33:46 +08:00
George Hotz
3fab43c57c add cache to asm gemm (#14675) 2026-02-11 08:26:30 +08:00
qazal
80b0119cef llama: add new asm gemm shape (#14611)
* llama: add new asm gemm shape

* work

* cleanup

* half dtype

* more comment
2026-02-10 00:34:29 +09:00
George Hotz
183d38b128 remove CUSTOM_KERNEL / directly construct it (#14604)
* remove CUSTOM_KERNEL / directly construct it

* clean that up

* simpler multi

* custom kernel spec

* remove Kernel

* fix multi

* use sharded shape

* explicit regression test
2026-02-08 18:43:33 +08:00
qazal
cf73d7e2a7 hotfix: disable slower asm gemm shape from llama seqlen 8192 (#14582) 2026-02-06 15:05:19 +09:00
George Hotz
43e7eda4e7 grad_b uses custom gemm (#14550)
* grad_b uses custom gemm

* fix multi backward, acc is in float32

* test_gemm_batched

* square gemm

---------

Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>
Co-authored-by: qazal <qazal.software@gmail.com>
2026-02-05 15:22:27 +09:00
qazal
f9cfb64cd9 test asm_gemm in CI (#14551)
* test asm_gemm in CI

* default float16

* use a smaller shape for multi

* smaller size

* smaller for CI

* smaller for ci

* need half
2026-02-05 13:32:22 +09:00
chenyu
d57d24c7d4 Buffer.as_buffer -> Buffer.as_memoryview [pr] (#14535)
it casts to memoryview. also inline the as_typed_buffer checks to Tensor._data
2026-02-04 11:31:11 -05:00
qazal
d1bfbe9ce3 isolate slow llama gemm (#14525) 2026-02-04 12:20:10 +09:00
qazal
a98c53769a ASM_GEMM=1 runs the UOp gemm on non cdna (#14516)
* ASM_GEMM=1 runs the UOp gemm on non cdna

tests run on mac in 3 seconds

* min diff
2026-02-03 20:42:02 +09:00
qazal
616e9c1483 CDNA assembly gemm in tensor.py with flag (#14310)
* work

* work

* the assembly

* remove the old one

* remove ws bufs, assert splitk

* notes cleanup

* work

* gemm args

* gemm in mixins would be nice

* add gemm gradient

* print counters

* the realize is for DEBUG=2 aesthetics

* dedup

* rewrite to python dsl, no list copies

* leave that

* add B, M, N, K to gemm name

* it's M0 not NULL

* fp16 support

* test cleanup + more gemms

* work from viz

* more work

* gemm batch_size

* xccg path work

* tiny comments on the label naming

* s_waitcnt
2026-01-31 22:34:14 +09:00
qazal
d69bc5aa1a make DEV=NULL EMULATE=AMD amd_asm_matmul run (#14460) 2026-01-31 20:45:24 +09:00
George Hotz
e5df7e640b fix branches in amd_asm_matmul (#14369) 2026-01-27 20:48:42 +08:00
qazal
dfefeddeed add tflops to cdna gemm custom kernel (#14281) 2026-01-22 12:48:28 +09:00
George Hotz
79c1559f69 amd asm can still be simpler (#14199)
* amd asm can still be simpler

* simpler

* V_LANE_ID

* simpler

* simpler

* compact vgpr
2026-01-17 18:40:10 +09:00
George Hotz
50554115ee fix VALU_SALU / IMMED_MASK and improve amd_asm_matmul (#14196)
* fix VALU_SALU / IMMED_MASK and improve amd_asm_matmul

* immed

* wave override

* restore ALT

* advance sgprs correctly

* no helpers

* decrease to 192 VGPRs
2026-01-17 11:58:34 +09:00
George Hotz
8a2549d42b improve amd_asm_matmul + minor VIZ PKTS improvements (#14186)
* improve amd_asm_matmul + minor VIZ PKTS improvements

* fix waitcnt issue

* cleanups
2026-01-17 06:56:59 +09:00
qazal
b46da603fe codegen/custom_kernel: do not attach KernelInfo to user program (#14160) 2026-01-15 14:01:48 +09:00
qazal
434dbafab5 optional Estimates in KernelInfo (#14147)
* optional Estimates in KernelInfo

* custom asm test plumbing

* s_code_end

* estimates test

* vaddr arg in global_store

* kernel desc

* Ops.DEVICE name
2026-01-14 22:55:03 +09:00
qazal
002ea39da7 assembly/amd: use Tensor.custom_kernel to run assembly (#14125)
* assembly/amd: use Tensor.custom_kernel to run assembly

* PRINT_ASM=1 is DEBUG=4
2026-01-14 08:29:25 +09:00
George Hotz
a28c8105a5 assembly/amd: 2% faster amd_uop_matmul + SQTT (#14122)
* assembly/amd: 2% faster amd_uop_matmul

* SQTT_TOKEN_EXCLUDE + SQTT_SIMD_SEL

* sqtt printer

* fix printer

* fast decode

* fast decoder

* test packet counts

* ugh it's not faster

* dead
2026-01-13 19:55:32 +09:00
George Hotz
330a0b686e assembly/amd: clean up dsl and make type verification strict (#14102)
* assembly/amd: start newdsl

* work

* newdsl upd

* Reg is p nice

* cleaner

* work

* getting clean

* all fields

* more BitFields

* redo the pdfs with dsl2 syntax

* no lit

* cleanups

* more defaults

* fix get and remove crap

* aliases

* ugly but kind of works

* NULL, not rawimm

* clean up defaults

* only dsl

* asm fixes

* lit fixup

* more lit

* cleanups

* olddsl

* single pcode dict

* emu sort of works

* trash test

* global is global

* types property

* reg mods

* fix a few tests

* remove monkey patch

* fixes

* less hacks in tests

* less hacks in tests

* 4 test failures

* hw tests all pass

* fix compare emulator

* fix some tests

* 3 more

* fix and shorten sqtt

* handwritten

* fix validation

* test corrections

* all types validate

* fix dsl2 tests

* fix bugs in disasm

* skips on cdna

* work

* repr with reg[]

* fix bitfield tests

* merge pcodes in dsl

* remove override

* disasm uses inst.types

* simpler
2026-01-13 08:52:16 +09:00
George Hotz
cb500466c2 assembly/amd: amd_asm_matmul (#13989)
* amd_asm_matmul

* dsl transform

* asm roundtrip

* fixed

* less

* better

* more

* simpler

* simplify

* lil

* simpler

* compact

* work

* cleanups

* simplify

* simpler

* cleanup

* name the regs

* simp

* big simp

* big simp

* simp

* acc grid

* fast

* stuff

* fast

* simpler

* owrks

* save vgprs

* save vgprs

* Compact

* less VGPRs

* after

* SQTT support

* fastest

* faster

* lil faster

* tile regs

* faster

* readable

* one more

* simpler

* lil simpler

* NO_GLOBAL skips early globals

* stock kernel

* cleanups

* cleanups

* one b reg

* safe reg changes

* acc is compact now

* remove confusing stuff

* sregs

* lds cleanups

* vopd
2026-01-07 20:11:05 -08:00
qazal
bd55507ee4 RDNA3 fp16 assembly gemm 85 TFLOPS (#13990) 2026-01-03 18:34:23 +09:00
qazal
2cc64d71b0 simplify mi350x gemm / viz asm tests (#13984)
* mi350x gemm cleanup

* asm tests work

* simpler asm tests
2026-01-03 11:11:07 +09:00
qazal
5f52266225 mi350x gemm: use Tensor.custom_kernel in asm test (#13969)
* mi350x gemm: use Tensor.custom_kernel in asm test

* A @ B for baseline
2026-01-02 18:30:50 +09:00
qazal
c0f52c9dcb split assembly gemm to per arch directory (#13953) 2026-01-02 00:10:22 +09:00
qazal
6a5430ab00 correct args order in mi350x gemm (#13949) 2026-01-01 23:01:46 +09:00
qazal
b23f4517ab prep mi350x gemm for python dsl (#13918)
* start by pruning existing asm

* better branch names

* split to template and real instructions
2025-12-31 20:00:57 +09:00
qazal
b557c46233 assembly gemm clean ups, instructions for cli (#13892) 2025-12-30 16:14:06 +09:00
qazal
f541540129 variable N for asm gemm (#13869)
* variable N for asm gemm

* cleanup spacing
2025-12-29 19:35:50 +09:00
qazal
fc5278746f mi350x assembly gemm cleanups (#13867) 2025-12-29 18:47:23 +09:00
qazal
066d96c397 print tflops in asm gemm test (#13859)
* print tflops in asm gemm test

* change order
2025-12-29 02:26:40 +09:00
qazal
2cfbabdc34 mi350x 1tflop bf16 gemm in extra (#13702) 2025-12-28 21:45:42 +09:00
George Hotz
744af193f0 remove ScheduleItem and merge it with ExecItem (#13759)
* remove ExecItem and merge it with ScheduleItem

* less diff

* fix issues

* min diff

* don't change bufs in _lower

* min diff

* update

* revert

* fixes

* diff
2025-12-19 17:04:24 -04:00
George Hotz
df6cde8a00 cleanup stale examples/extra (#13764)
* cleanup stale files

* examples

* move those back

* old

* delete more
2025-12-19 16:27:37 -04:00
George Hotz
bd4b9de7d2 use numpy in amd_uop_matmul for simpler tracing (#13503) 2025-11-30 08:04:38 -08:00
George Hotz
98e9e73286 hotfix: amd_uop_matmul getenvs 2025-11-17 13:26:01 -08:00
George Hotz
ba84d415fe work from benchmarking tinybox red v2 (#13264)
* work from benchmarking tinybox red v2

* gpuburn
2025-11-13 16:38:40 -08:00
George Hotz
faf68c03a8 more mi350x matmul work (#13138)
* more mi350x matmul work

* broken compute
2025-11-13 09:09:28 -08:00
George Hotz
2d4f01fda0 move mixins to mixin dir (#13105)
* move mixins to mixin dir

* math
2025-11-05 10:18:33 -08:00
wozeparrot
4ed0f216b5 fix: make max_matmul run again (#13085) 2025-11-03 18:09:09 -08:00
George Hotz
416b15cc59 improve uop matmul syntax (#13074)
* improve uop matmul syntax

* store takes const

* copy

* cleanups

* faster and simpler

* label them reduce

* better syntax

* touchup
2025-11-03 21:34:26 +08:00
George Hotz
1e3d6e49a6 index slicing + allclose (#13071)
* continue work on slicing+allclose

* Revert "Revert "slicing + allclose""

This reverts commit 6c7a12f21c.

* fix tests + better syntax

* forgot an after

* slot is an integer
2025-11-03 13:01:48 +08:00
George Hotz
8cbef912d2 move reshape to MathTraits (#13054)
* move reshape to MathTraits

* confirm it works in amd_uop_matmul
2025-11-02 12:56:15 +08:00
George Hotz
267be7fc5e fp16 acc 2025-11-02 12:53:04 +08:00
George Hotz
e98506735b add CONTRACT support to UOp programs (#13043)
* add contract support

* use contract

* 342 tflops
2025-11-01 19:11:32 +08:00
George Hotz
65a0a31475 AMD mi350x matmul from stream (#13040)
* works

* working mfma

* 120 TFLOPS

* regs

* 192 TFLOPS

* try pipelining

* something

* notes

* contract

* linter to 3.11

* that was a bug
2025-11-01 17:55:19 +08:00
George Hotz
bc178d14a9 matmul example on metal showing off tensor core (#13033)
* matmul example on metal showing off tensor core

* flip the args of placeholder

* mat_idx

* imp
2025-10-31 19:40:36 +08:00