11663 Commits

Author SHA1 Message Date
qazal
a071adffc0 viz: amdgpu disassembly register highlighting UI (#14059)
* viz: amdgpu disassembly register highlighting

* minor details

* details from IDA

* more details from IDA

* refactor token colors

* move tokenizer to python

* simplify

* minimal tokenizer for registers

* all the operand types
2026-01-09 11:27:09 +09:00
chenyu
b878f9d5a4 reuse Tensor init with const path [pr] (#14076) 2026-01-08 17:49:37 -05:00
chenyu
efcb32f6a9 unique const when requires_grad is set to True (#14075)
* unique const when requires_grad is set to True

* fix pyrender
2026-01-08 16:30:45 -05:00
chenyu
b34c637767 support bfloat16 for CL (#14073) 2026-01-08 14:14:29 -05:00
Garret Castro
16b652302e skip bf16 test if not supported by device (#14070) 2026-01-08 13:37:24 -05:00
nimlgen
3f61a96d79 am: SetSoftMaxByFreq on gfx10+ (#14068) 2026-01-08 17:00:03 +03:00
George Hotz
e7b5d8a434 assembly/amd: more RDNA4 asm (#14062)
* rdna4 more

* asm

* fixes

* assembly/amd: handwritten wmma failing test

* passes

* wmma default hacks

* space

* 0 skips in rdna3/rdna4 disasm

* more RDNA4 tests

---------

Co-authored-by: qazal <qazal.software@gmail.com>
2026-01-08 05:09:37 -08:00
nimlgen
e372c841ba hevc: beam in decode (#14067)
* hevc: beam in decode

* fine

* g
2026-01-08 15:47:16 +03:00
nimlgen
1732a4ec4b am: rework set_clocks (#14065) 2026-01-08 15:33:32 +03:00
nimlgen
f3aceaa08b hevc: fast decoder (#14057) 2026-01-08 15:20:37 +03:00
qazal
309197bca5 assembly/amd: test_roundtrip for cdna/rdna4 (#14066) 2026-01-08 21:03:13 +09:00
qazal
15a056715d fix amd assembly IDE tests on macbook (#14063) 2026-01-08 17:27:52 +09:00
wozeparrot
027b935269 tk: fix grouped load store (#14035) 2026-01-07 22:38:02 -08:00
George Hotz
2db04d0696 assembly/amd: start adding RDNA4 support (#14060)
* assembly/amd: start adding RDNA4 support

* rdna4 asm
2026-01-07 21:19:30 -08:00
George Hotz
cb500466c2 assembly/amd: amd_asm_matmul (#13989)
* amd_asm_matmul

* dsl transform

* asm roundtrip

* fixed

* less

* better

* more

* simpler

* simplify

* lil

* simpler

* compact

* work

* cleanups

* simplify

* simpler

* cleanup

* name the regs

* simp

* big simp

* big simp

* simp

* acc grid

* fast

* stuff

* fast

* simpler

* owrks

* save vgprs

* save vgprs

* Compact

* less VGPRs

* after

* SQTT support

* fastest

* faster

* lil faster

* tile regs

* faster

* readable

* one more

* simpler

* lil simpler

* NO_GLOBAL skips early globals

* stock kernel

* cleanups

* cleanups

* one b reg

* safe reg changes

* acc is compact now

* remove confusing stuff

* sregs

* lds cleanups

* vopd
2026-01-07 20:11:05 -08:00
chenyu
3caa1e2c98 fix cast HALF with PYTHON backend (#14058) 2026-01-07 16:52:05 -05:00
chenyu
5f1ede7f7e clean up test_dtype (#14055)
use less lambda
2026-01-07 15:45:42 -05:00
nimlgen
5bd4593eda hevc: cleaner decoder (#14056)
* hevc: cleaner decoder

* nn
2026-01-07 18:29:30 +03:00
b1tg
241f0402b4 add seed in bert data shuffle (#14054) 2026-01-07 10:02:05 -05:00
nimlgen
25c82dd242 nv: profile nvdec (#14053) 2026-01-07 15:56:54 +03:00
qazal
35900290b2 viz: configure text height for cfg (#14052) 2026-01-07 18:58:56 +09:00
chenyu
87f4bc5446 update variable names around jit [pr] (#14049)
lbs, st_vars_dtype_device and rawbuffers no more
2026-01-06 22:32:41 -05:00
chenyu
2833c5a54b few more jit tests with multi tensor inputs (#14047) 2026-01-06 22:05:22 -05:00
chenyu
72a3f78d19 jit includes tensor inputs in containers (#14043)
* jit includes tensor inputs in containers

* cleanup
2026-01-06 19:42:06 -05:00
chenyu
c714881832 don't allow jit input to be const (#14045)
* don't allow jit input to be unbuffered like const

* just const to fix multi

* fix rnnt
2026-01-06 18:15:22 -05:00
chenyu
a8896f28e1 test_unrealized_const_input_frozen (#14044)
unrealized const is not replaced in jit
2026-01-06 14:17:43 -05:00
nimlgen
325f4006ff amd: copies w/o sdma (#14036)
* amd: copies w/o sdma

* as_args

* fixes

* f
2026-01-06 21:15:58 +03:00
chenyu
7fb18f7e47 raise when jit fxn returns non-Tensor output (#14042) 2026-01-06 12:59:20 -05:00
chenyu
4491ec0c9e JitError (#14041)
* JitError

* test_symbolic_jit
2026-01-06 12:19:50 -05:00
chenyu
6ddddc68af test jit tolist failure (#14040)
also moved tests to test_jit_footguns
2026-01-06 11:16:57 -05:00
chenyu
b699b9f763 test case for jit a function with item call (#14039)
* test case for jit a function with item call

output is silently wrong now

* no dtype
2026-01-06 10:40:43 -05:00
nimlgen
02084f5376 mockdsp: use dsp allocator (#14037)
* mockdsp: use dsp allocator

* fix

* ?
2026-01-06 16:04:47 +03:00
wozeparrot
2b3e01e79c tk: support sliced local -> reg load (#14034) 2026-01-06 05:33:24 -05:00
George Hotz
45f7fd073d assembly/amd: pcode bug fixes (#14032)
* bring over pcode parser

* fixes

* pdf test

* delay alu
2026-01-06 00:15:48 -08:00
wozeparrot
21d0f6bb76 tk: flat global -> local load (#14033) 2026-01-05 23:35:53 -08:00
qazal
3170365a5b visualize SQTT with the same cfg infrastructure (#13870)
* start

* rough sketch

* post render dag

* art

* intro g key

* work

* custom color scale

* colors

* more blue

* better

* smaller

* use for loop in test
2026-01-06 14:53:20 +09:00
Christopher Milan
0120d69caa autogen: avcodec (and simplify workflow) (#14031)
* simplify autogen workflow and add avcodec verification

- Consolidate all regeneration into single steps (delete + import)
- Remove continue-on-error and individual diff checks
- Use git diff at end to catch all differences
- Show artifact URL in failure message
- Add avcodec.py verification

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* patch avcodec

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-05 23:30:25 -05:00
George Hotz
20653d2996 assembly/amd: make pdf.py code shine (#14029)
* assembly/amd: make pdf.py code shine

* no merge

* pdf2 is the future

* something

* regen enums

* test

* work

* remove junk

* write

* pcode extraction

* pdf2 passes all tests

* simplify

* simpler pdf

* late filter

* remove hacks

* simplify pdf2.py

* field type

* remove defaults

* don't export srcenum

* simple pdf.py

* simpler

* cleaner

* less hack in PDF
2026-01-05 18:49:40 -08:00
qazal
ea7b149ca5 viz command line tool (#14030) 2026-01-06 10:19:47 +09:00
Christopher Milan
f86c728440 load libclang as 'libclang.so' too (#14028) 2026-01-05 16:56:16 -05:00
chenyu
eda6a73897 clean up canonicalize_device (#14027)
centralize the type check
2026-01-05 10:29:55 -05:00
chenyu
ce464b147a clean up comments that mentioned outdated terms (#14026)
no MultiLazyBuffer and no ShapeTracker in comments
2026-01-05 09:42:58 -05:00
chenyu
83063cc3e4 onnx TensorScatter (#14024) 2026-01-05 09:05:22 -05:00
chenyu
9497ec00f2 fix onnx attention permute (#14025)
* fix onnx attention permute

* skip test_attention_4d_fp16_cpu too
2026-01-05 08:58:50 -05:00
qazal
5cff5698f7 viz: g key toggles graph and text view (#14023) 2026-01-05 22:41:45 +09:00
chenyu
7a81a3cb98 more passed onnx tests (#14022) 2026-01-05 07:46:27 -05:00
kim yongjin
34fe105386 remove unused LazySeq (#14020) 2026-01-05 07:38:33 -05:00
qazal
4f2f38bf64 viz: split cfg and table render (#14021) 2026-01-05 20:59:08 +09:00
nimlgen
70405b4f3c am_smi: mi350 (#14018) 2026-01-05 13:10:56 +03:00
Christopher Milan
b2a0b9c551 autogen: dump patch in CI (#14010)
* autogen: don't fast-fail, produce patch artifact on differences

All verification steps now use continue-on-error to run completely.
Each job generates a patch artifact containing all differences found.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* add gen from header test

* fix tests

* fail if diff

* add forward decl autogen test

* remove confusing/wrong comments

* macos unittests set LIBCLANG_PATH

---------

Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-04 22:38:12 -05:00