qazal
76b577ee76
viz: only SIMD name in sqtt timeline rows ( #14146 )
2026-01-14 20:13:27 +09:00
nimlgen
86708ccac5
hip_ioctl: dump aql ( #14142 )
2026-01-14 13:15:10 +03:00
wozeparrot
a92778aa0c
tk: fa multi fix ( #14134 )
2026-01-13 17:22:15 -08:00
George Hotz
2ab18ea7e3
assembly/amd: use xml instead of pdf ( #14118 )
...
* assembly/amd: use xml instead of pdf
* use amdxml to generate info about op sizes
* fix many tests with invalid instructions
* fix info generation
* chad xml fixes many bugs
* rename to operands
* simplify
* amdxml
* bug fix
2026-01-14 10:03:37 +09:00
qazal
002ea39da7
assembly/amd: use Tensor.custom_kernel to run assembly ( #14125 )
...
* assembly/amd: use Tensor.custom_kernel to run assembly
* PRINT_ASM=1 is DEBUG=4
2026-01-14 08:29:25 +09:00
chenyu
fe00682502
clean up svd tests ( #14133 )
...
removed from test_ops and added to TestTorchBackend
2026-01-13 16:32:21 -05:00
chenyu
e610821c52
Tensor.cummin and Tensor.nonzero ( #14131 )
2026-01-13 15:09:56 -05:00
chenyu
176a934ddd
Tensor.diagonal support offset and dims ( #14130 )
2026-01-13 14:49:06 -05:00
chenyu
2a217ba206
tinybackend isin and log10 ( #14120 )
...
can use tinygrad directly
2026-01-13 14:14:09 -05:00
George Hotz
a28c8105a5
assembly/amd: 2% faster amd_uop_matmul + SQTT ( #14122 )
...
* assembly/amd: 2% faster amd_uop_matmul
* SQTT_TOKEN_EXCLUDE + SQTT_SIMD_SEL
* sqtt printer
* fix printer
* fast decode
* fast decoder
* test packet counts
* ugh it's not faster
* dead
2026-01-13 19:55:32 +09:00
chenyu
05fcb57696
also return index in Tensor.cummax ( #14117 )
...
* also return index in Tensor.cummax
* fix
2026-01-12 22:42:10 -05:00
George Hotz
330a0b686e
assembly/amd: clean up dsl and make type verification strict ( #14102 )
...
* assembly/amd: start newdsl
* work
* newdsl upd
* Reg is p nice
* cleaner
* work
* getting clean
* all fields
* more BitFields
* redo the pdfs with dsl2 syntax
* no lit
* cleanups
* more defaults
* fix get and remove crap
* aliases
* ugly but kind of works
* NULL, not rawimm
* clean up defaults
* only dsl
* asm fixes
* lit fixup
* more lit
* cleanups
* olddsl
* single pcode dict
* emu sort of works
* trash test
* global is global
* types property
* reg mods
* fix a few tests
* remove monkey patch
* fixes
* less hacks in tests
* less hacks in tests
* 4 test failures
* hw tests all pass
* fix compare emulator
* fix some tests
* 3 more
* fix and shorten sqtt
* handwritten
* fix validation
* test corrections
* all types validate
* fix dsl2 tests
* fix bugs in disasm
* skips on cdna
* work
* repr with reg[]
* fix bitfield tests
* merge pcodes in dsl
* remove override
* disasm uses inst.types
* simpler
2026-01-13 08:52:16 +09:00
George Hotz
91bde927ef
assembly/amd: split asm.py into asm.py and disasm.py ( #14101 )
...
* split asm.py into asm.py and disasm.py
* split decoder
* move to pcode
* tests
2026-01-12 07:22:02 +09:00
George Hotz
44135e2e84
assembly/amd: always use v_nop in test for rocprof-trace-decoder ( #14100 )
...
* assembly/amd: always use v_nop in test for rocprof-trace-decoder
* test touchups
2026-01-12 05:31:58 +09:00
George Hotz
8b1b15aec0
assembly/amd: SQTT support ( #14099 )
...
* assembly/amd: SQTT support
* simpler
* cmp wave
* instruction compare
* rocprof decode
* simpler
* no llvm
* no strcmp
2026-01-12 05:07:17 +09:00
nimlgen
3e2c05ee9f
hevc: decoder as iterator ( #14091 )
2026-01-10 14:57:56 +03:00
b1tg
0fbc551622
train bert with fp8 ( #13874 )
...
* fp8 train
* clean
* lint
* test fix from #13439
* skip first/last layer
* rm __init__, restore unroll <=32 check
* tests
* clean test, remove unused
* multi-gpu test, clean quantize_to_fp8
* remove bert contiguous
* run script
* test: better check
* run script search
* add seed in bert data shuffle
* move script to mi350x folder
---------
Co-authored-by: chenyu <chenyu@fastmail.com >
2026-01-09 09:21:59 -05:00
qazal
2917ed1616
roc: propagate decoder errors to main thread ( #14081 )
...
* roc: propagate decoder errors to main thread
* types
* add cause
2026-01-09 21:10:45 +09:00
George Hotz
e7b5d8a434
assembly/amd: more RDNA4 asm ( #14062 )
...
* rdna4 more
* asm
* fixes
* assembly/amd: handwritten wmma failing test
* passes
* wmma default hacks
* space
* 0 skips in rdna3/rdna4 disasm
* more RDNA4 tests
---------
Co-authored-by: qazal <qazal.software@gmail.com >
2026-01-08 05:09:37 -08:00
nimlgen
e372c841ba
hevc: beam in decode ( #14067 )
...
* hevc: beam in decode
* fine
* g
2026-01-08 15:47:16 +03:00
nimlgen
f3aceaa08b
hevc: fast decoder ( #14057 )
2026-01-08 15:20:37 +03:00
qazal
309197bca5
assembly/amd: test_roundtrip for cdna/rdna4 ( #14066 )
2026-01-08 21:03:13 +09:00
qazal
15a056715d
fix amd assembly IDE tests on macbook ( #14063 )
2026-01-08 17:27:52 +09:00
George Hotz
2db04d0696
assembly/amd: start adding RDNA4 support ( #14060 )
...
* assembly/amd: start adding RDNA4 support
* rdna4 asm
2026-01-07 21:19:30 -08:00
George Hotz
cb500466c2
assembly/amd: amd_asm_matmul ( #13989 )
...
* amd_asm_matmul
* dsl transform
* asm roundtrip
* fixed
* less
* better
* more
* simpler
* simplify
* lil
* simpler
* compact
* work
* cleanups
* simplify
* simpler
* cleanup
* name the regs
* simp
* big simp
* big simp
* simp
* acc grid
* fast
* stuff
* fast
* simpler
* owrks
* save vgprs
* save vgprs
* Compact
* less VGPRs
* after
* SQTT support
* fastest
* faster
* lil faster
* tile regs
* faster
* readable
* one more
* simpler
* lil simpler
* NO_GLOBAL skips early globals
* stock kernel
* cleanups
* cleanups
* one b reg
* safe reg changes
* acc is compact now
* remove confusing stuff
* sregs
* lds cleanups
* vopd
2026-01-07 20:11:05 -08:00
nimlgen
5bd4593eda
hevc: cleaner decoder ( #14056 )
...
* hevc: cleaner decoder
* nn
2026-01-07 18:29:30 +03:00
chenyu
c714881832
don't allow jit input to be const ( #14045 )
...
* don't allow jit input to be unbuffered like const
* just const to fix multi
* fix rnnt
2026-01-06 18:15:22 -05:00
wozeparrot
2b3e01e79c
tk: support sliced local -> reg load ( #14034 )
2026-01-06 05:33:24 -05:00
George Hotz
45f7fd073d
assembly/amd: pcode bug fixes ( #14032 )
...
* bring over pcode parser
* fixes
* pdf test
* delay alu
2026-01-06 00:15:48 -08:00
wozeparrot
21d0f6bb76
tk: flat global -> local load ( #14033 )
2026-01-05 23:35:53 -08:00
George Hotz
20653d2996
assembly/amd: make pdf.py code shine ( #14029 )
...
* assembly/amd: make pdf.py code shine
* no merge
* pdf2 is the future
* something
* regen enums
* test
* work
* remove junk
* write
* pcode extraction
* pdf2 passes all tests
* simplify
* simpler pdf
* late filter
* remove hacks
* simplify pdf2.py
* field type
* remove defaults
* don't export srcenum
* simple pdf.py
* simpler
* cleaner
* less hack in PDF
2026-01-05 18:49:40 -08:00
qazal
ea7b149ca5
viz command line tool ( #14030 )
2026-01-06 10:19:47 +09:00
nimlgen
70405b4f3c
am_smi: mi350 ( #14018 )
2026-01-05 13:10:56 +03:00
George Hotz
404eed6172
assembly/amd: improve tests for asm ( #14007 )
...
* assembly/amd: improve tests for asm
* upd
* skip
* tests
* re bug
* more passing
* cleanups
* cdna fixups
* improve tests, better CDNA parsing
* fix CI
* no defs
* simpler
* all pass
* from pdf
* regen
2026-01-04 15:14:08 -08:00
George Hotz
7ebda28692
assembly/amd: add CDNA support to asm ( #13982 )
...
* add CDNA support
* more cdna tests
* something
* fix more stuff
* more work
* simpler
* simplier
* cdna
* disasm
* less skip
* fixes
* simpler
2026-01-04 08:53:56 -08:00
George Hotz
34ea053b26
assembly/amd: clean up pcode, jit pcode instead of static ( #14001 )
...
* assembly/amd: clean up pcode
* regen
* lil
* jit the pcode
* sendmsg
* cleanups
* inst prefetch lol
2026-01-03 23:06:15 -08:00
George Hotz
8328511808
assembly/amd: make the emu.py code shine ( #13996 )
...
* assembly/amd: make the code shine
* lil clean
* reg back in pcode
* cleanups
* gen fma_mix
* no writelane hacks
* fn cleanup
* dead vgpr_write
* readable
* smem
* cleanup bench_emu
* speedups
* simpler and faster
* direct inst._fn
* split fxn
* Revert "simpler and faster"
This reverts commit e85f6594b3 .
* move lds to wavestate
* dispatcher
* pc in dispatch
* literal isn't wavestate
* cleanups + program
* one readlane
* exec_vop3sd in exec_vop
* cleaner exec_vopd
* fully merge VOP3P
* no special paths
* no SliceProxy
* low=0
* no bigint
* failing tests
* fma on python 3.13
2026-01-03 20:33:09 -08:00
qazal
bd55507ee4
RDNA3 fp16 assembly gemm 85 TFLOPS ( #13990 )
2026-01-03 18:34:23 +09:00
wozeparrot
6242a9d151
tk: no global copy and clear ranges ( #13988 )
2026-01-02 23:45:15 -08:00
wozeparrot
9f082e8e25
fa: split kv bwd into 2 kernels ( #13981 )
2026-01-02 18:45:51 -08:00
qazal
2cc64d71b0
simplify mi350x gemm / viz asm tests ( #13984 )
...
* mi350x gemm cleanup
* asm tests work
* simpler asm tests
2026-01-03 11:11:07 +09:00
George Hotz
0e282025ff
assembly/amd: split test_emu into hw tests ( #13966 )
...
* assmebly/amd: split test_emu into hw tests
* hw tests
* bugfixes
* more tests and fix
2026-01-02 08:04:56 -08:00
chenyu
2e2b5fed12
fix misspellings ( #13976 )
2026-01-02 10:37:38 -05:00
nietras
f49e4714af
Fix spelling errors in README for AMD assembly ( #13975 )
2026-01-02 10:15:20 -05:00
qazal
5f52266225
mi350x gemm: use Tensor.custom_kernel in asm test ( #13969 )
...
* mi350x gemm: use Tensor.custom_kernel in asm test
* A @ B for baseline
2026-01-02 18:30:50 +09:00
George Hotz
5a1a561e0f
assembly/amd: rdna4 autogen ( #13967 )
...
* assembly/amd: add pcode ds ops
* refactors
* fix ds op
* update autogen
* fix flat bug
* more tests
* fix emu test
* that's a hack
* generic
* fix all tests
* two tests
* fix test failure
* better
* remove __all__
* assembly/amd: fix autogen for RDNA4
2026-01-01 23:12:18 -05:00
wozeparrot
b27527f05a
fix: missed inner tracked range ( #13964 )
2026-01-01 18:09:57 -08:00
wozeparrot
ecbac8a338
tk: fa cleanups + causal test ( #13963 )
2026-01-01 18:05:00 -08:00
George Hotz
dfb813b760
assembly/amd: add pcode ds ops ( #13939 )
...
* assembly/amd: add pcode ds ops
* refactors
* fix ds op
* update autogen
* fix flat bug
* more tests
* fix emu test
* that's a hack
* generic
* fix all tests
* two tests
* fix test failure
* better
* remove __all__
2026-01-01 16:24:13 -05:00
b1tg
24723327ac
fix tc_up in search ( #13438 )
...
* tensor_core is missing from Scheduler
* test upcast max
---------
Co-authored-by: chenyu <chenyu@fastmail.com >
2026-01-01 10:25:08 -05:00