Commit Graph

11702 Commits

Author SHA1 Message Date
nimlgen
1364449cab system: early pci perm check (#14126)
* system: early pci perm check

* l
2026-01-13 17:45:05 +03:00
George Hotz
a28c8105a5 assembly/amd: 2% faster amd_uop_matmul + SQTT (#14122)
* assembly/amd: 2% faster amd_uop_matmul

* SQTT_TOKEN_EXCLUDE + SQTT_SIMD_SEL

* sqtt printer

* fix printer

* fast decode

* fast decoder

* test packet counts

* ugh it's not faster

* dead
2026-01-13 19:55:32 +09:00
qazal
6cd318e377 viz: add link to graph from sqtt (#14123) 2026-01-13 17:31:03 +09:00
qazal
fd10fd245a viz: cfg tokenizer fix and unit tests (#14121)
* output Ops.BINARY

* failing test for the cfg

* dsl renamed to offset and sz

* add better asserts

* move the note
2026-01-13 15:08:55 +09:00
chenyu
05fcb57696 also return index in Tensor.cummax (#14117)
* also return index in Tensor.cummax

* fix
2026-01-12 22:42:10 -05:00
wozeparrot
7c967399a4 tk: add failing test for fa multidevice (#14116) 2026-01-12 19:11:09 -08:00
George Hotz
330a0b686e assembly/amd: clean up dsl and make type verification strict (#14102)
* assembly/amd: start newdsl

* work

* newdsl upd

* Reg is p nice

* cleaner

* work

* getting clean

* all fields

* more BitFields

* redo the pdfs with dsl2 syntax

* no lit

* cleanups

* more defaults

* fix get and remove crap

* aliases

* ugly but kind of works

* NULL, not rawimm

* clean up defaults

* only dsl

* asm fixes

* lit fixup

* more lit

* cleanups

* olddsl

* single pcode dict

* emu sort of works

* trash test

* global is global

* types property

* reg mods

* fix a few tests

* remove monkey patch

* fixes

* less hacks in tests

* less hacks in tests

* 4 test failures

* hw tests all pass

* fix compare emulator

* fix some tests

* 3 more

* fix and shorten sqtt

* handwritten

* fix validation

* test corrections

* all types validate

* fix dsl2 tests

* fix bugs in disasm

* skips on cdna

* work

* repr with reg[]

* fix bitfield tests

* merge pcodes in dsl

* remove override

* disasm uses inst.types

* simpler
2026-01-13 08:52:16 +09:00
C T
a8c821f45e add Tensor.log10 with test\test_ops.py::TestOps::test_log10 (#14113) 2026-01-12 13:45:47 -05:00
chenyu
6b0a9f5ee6 don't strip sink in to_uops_list [pr] (#14111) v0.12.0 2026-01-12 11:19:03 -05:00
chenyu
cad7feec02 more onnx ops (#14104)
HannWindow, HammingWindow, BlackmanWindow, Hardmax, LpNormalization
2026-01-12 09:11:13 -05:00
nimlgen
635ed2df9d system: use pci.PCI_VENDOR_ID instead of const (#14109) 2026-01-12 15:24:09 +03:00
qazal
6c0f0e29ff Revert "viz: loading... (#14107)" (#14108)
This reverts commit 9347757c2d.
2026-01-12 20:45:37 +09:00
nimlgen
9347757c2d viz: loading... (#14107) 2026-01-12 13:24:24 +03:00
wozeparrot
3a92df66ea feat: bump version to 0.12.0 (#14105) 2026-01-11 21:19:49 -08:00
chenyu
7c234a9c7c wgsl cleanup [pr] (#14103)
refactor common pack functions
2026-01-11 21:23:45 -05:00
George Hotz
91bde927ef assembly/amd: split asm.py into asm.py and disasm.py (#14101)
* split asm.py into asm.py and disasm.py

* split decoder

* move to pcode

* tests
2026-01-12 07:22:02 +09:00
George Hotz
44135e2e84 assembly/amd: always use v_nop in test for rocprof-trace-decoder (#14100)
* assembly/amd: always use v_nop in test for rocprof-trace-decoder

* test touchups
2026-01-12 05:31:58 +09:00
George Hotz
8b1b15aec0 assembly/amd: SQTT support (#14099)
* assembly/amd: SQTT support

* simpler

* cmp wave

* instruction compare

* rocprof decode

* simpler

* no llvm

* no strcmp
2026-01-12 05:07:17 +09:00
nimlgen
8b5ff403fa am: flag successful finalization (#14097)
* am: flag successful finalization

* import
2026-01-11 16:24:53 +03:00
qazal
d8aba24967 amd: use kernel descriptor struct in AMDProgram (#14096) 2026-01-11 18:25:16 +09:00
chenyu
9973a81356 add channels_last to QLinearGlobalAveragePool (#14094)
and other minor cleanups
2026-01-10 18:38:19 -05:00
chenyu
c5492f8f75 cstyle cleanup [pr] (#14093) 2026-01-10 09:44:50 -05:00
nimlgen
d5f954858d viz: show precise timings (#14092) 2026-01-10 16:21:08 +03:00
nimlgen
3e2c05ee9f hevc: decoder as iterator (#14091) 2026-01-10 14:57:56 +03:00
chenyu
35c9701df0 update outdated tests and comments (#14090) 2026-01-10 01:00:48 -05:00
chenyu
92246ea731 update tests, WEBGPU=1 pytest . passes (#14089)
* update tests, `WEBGPU=1 pytest .` passes

* minor update
2026-01-10 00:03:02 -05:00
chenyu
c34c6d9468 fix wgsl packed_store can drop valid (#14088)
* fix wgsl packed_store can drop valid

* fix
2026-01-09 15:22:06 -05:00
chenyu
eacccc5ace more disk assign tests (#14087)
covers more edge cases
2026-01-09 14:14:52 -05:00
chenyu
ed295e74dc don't skip gguf test if ggml is not installed (#14086)
* don't skip gguf test if ggml is not installed

should just let it fail

* fix
2026-01-09 12:05:58 -05:00
chenyu
cff33c8d78 add some disk assign tests (#14085) 2026-01-09 11:50:59 -05:00
chenyu
74fa3c7d09 decomp pow for LVP (#14084)
test failed due to undefined behavior, so use decomp instead
2026-01-09 10:50:28 -05:00
b1tg
0fbc551622 train bert with fp8 (#13874)
* fp8 train

* clean

* lint

* test fix from #13439

* skip first/last layer

* rm __init__, restore unroll <=32 check

* tests

* clean test, remove unused

* multi-gpu test, clean quantize_to_fp8

* remove bert contiguous

* run script

* test: better check

* run script search

* add seed in bert data shuffle

* move script to mi350x folder

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2026-01-09 09:21:59 -05:00
nimlgen
ba209d6305 am: utc_l1_enable on all sdma inst (#14083) 2026-01-09 17:17:05 +03:00
nimlgen
6b308b89b7 viz: timeline time (#14080)
* viz: timeline time

* less lines

* cut
2026-01-09 16:43:45 +03:00
nimlgen
40f9fa2db4 autogen: new kfd (#14082) 2026-01-09 16:08:17 +03:00
qazal
2917ed1616 roc: propagate decoder errors to main thread (#14081)
* roc: propagate decoder errors to main thread

* types

* add cause
2026-01-09 21:10:45 +09:00
qazal
f3f4d9b387 viz: fix disasm node width (#14079) 2026-01-09 16:37:37 +09:00
anu
c70c112254 fix CUDA=1 disassembly (VIZ=1) by stripping null terminator (#14046)
* fix ptxas disassembly bug

* single '

* move fix to get_bytes

* move rstrip

---------

Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>
2026-01-09 15:19:59 +09:00
qazal
13e5d00d0e viz: exclude comma in register highlight (#14078)
* viz: exclude comma in register highlight

* simplify
2026-01-09 15:10:30 +09:00
qazal
a071adffc0 viz: amdgpu disassembly register highlighting UI (#14059)
* viz: amdgpu disassembly register highlighting

* minor details

* details from IDA

* more details from IDA

* refactor token colors

* move tokenizer to python

* simplify

* minimal tokenizer for registers

* all the operand types
2026-01-09 11:27:09 +09:00
chenyu
b878f9d5a4 reuse Tensor init with const path [pr] (#14076) 2026-01-08 17:49:37 -05:00
chenyu
efcb32f6a9 unique const when requires_grad is set to True (#14075)
* unique const when requires_grad is set to True

* fix pyrender
2026-01-08 16:30:45 -05:00
chenyu
b34c637767 support bfloat16 for CL (#14073) 2026-01-08 14:14:29 -05:00
Garret Castro
16b652302e skip bf16 test if not supported by device (#14070) 2026-01-08 13:37:24 -05:00
nimlgen
3f61a96d79 am: SetSoftMaxByFreq on gfx10+ (#14068) 2026-01-08 17:00:03 +03:00
George Hotz
e7b5d8a434 assembly/amd: more RDNA4 asm (#14062)
* rdna4 more

* asm

* fixes

* assembly/amd: handwritten wmma failing test

* passes

* wmma default hacks

* space

* 0 skips in rdna3/rdna4 disasm

* more RDNA4 tests

---------

Co-authored-by: qazal <qazal.software@gmail.com>
2026-01-08 05:09:37 -08:00
nimlgen
e372c841ba hevc: beam in decode (#14067)
* hevc: beam in decode

* fine

* g
2026-01-08 15:47:16 +03:00
nimlgen
1732a4ec4b am: rework set_clocks (#14065) 2026-01-08 15:33:32 +03:00
nimlgen
f3aceaa08b hevc: fast decoder (#14057) 2026-01-08 15:20:37 +03:00
qazal
309197bca5 assembly/amd: test_roundtrip for cdna/rdna4 (#14066) 2026-01-08 21:03:13 +09:00