Commit Graph

1741 Commits

Author SHA1 Message Date
wozeparrot
7e54992bf6 fp8 llama (#15588)
Co-authored-by: qazal <qazal.software@gmail.com>
2026-04-04 18:24:57 -07:00
qazal
f7aed180e4 viz/cli: add Other row in profiler (#15600) 2026-04-04 22:40:53 +09:00
Christopher Milan
645d45d968 DEV has arch (#15577)
Co-authored-by: Comma Device <device@comma.ai>
2026-04-03 19:17:19 -04:00
nimlgen
237084b276 remote: support several hosts (#15585)
* remote: support several hossts

* f
2026-04-03 11:22:15 +03:00
Christopher Milan
0ed8d9271d Renderers accept Target or nothing (#15590) 2026-04-03 01:09:41 -04:00
nimlgen
046c3f1240 mlx: add loopback with send/recv (#15583) 2026-04-02 18:15:46 +03:00
qazal
fefb0ebc2a gemm/asm: fp8 cleanups (#15580)
* normal gemm here

* s/dtypes.fp8e4m3/FP8_DTYPE

* gemm_bw

* device UOp stays NULL
2026-04-02 19:02:38 +09:00
chenyu
1aa04eab08 simple CreationMixin (#15567)
start with full_like, zeros_like, ones_like
2026-04-01 23:00:56 -04:00
nimlgen
da12c2ea16 better install msg (#15570) 2026-04-01 20:09:37 +03:00
qazal
9275f283e5 viz: update flag and display names (#15566)
* rename to occ, other_simd

* se pkts

* match viz cli tool in names
2026-04-01 21:48:37 +09:00
Christopher Milan
acf239e4d2 specify renderer in DEV, <dev>_<ren>=1 is deprecated (#15551) 2026-03-31 18:35:14 -04:00
nimlgen
477d194630 hipcomgr and tinygpu scripts (#15549) 2026-03-31 20:07:52 +03:00
qazal
a15345a53e viz/cli: improve --help message (#15546)
* viz/cli: improve --help message

* not the default

* more work

* -s

* respect colored
2026-03-31 22:31:33 +09:00
qazal
8feb8edc68 gemm/asm: add fp8 support to cdna asm_gemm (#15542)
* work

* hmm, mixins

* rhs_transposed

* also fix the dtype

* check for hipcc

* Exception

* select dev

* default
2026-03-31 19:32:54 +09:00
nimlgen
ceb63c8c2f new bundle id (#15307)
* new bundle id

* new profiles
2026-03-31 12:16:03 +03:00
qazal
bc866a93f0 viz: rename exec to sqtt (#15527)
* viz: rename exec to sqtt

* more
2026-03-31 08:06:51 +09:00
nimlgen
9583489068 add mlx driver to extra (#15526)
* mlx driver

* x

* simpler
2026-03-30 20:28:49 +03:00
qazal
36a925e2a2 viz: color wmma, one color map for cli and web (#15519)
* viz: color wmma, one color map for cli and web

* op_type

* like uops

* mypy cli
2026-03-29 04:53:01 +09:00
qazal
266fb07721 viz: show exec duration (#15484)
* duration

* handwritten tests

* rdna3 pickle

* rdna4 pickle

* asserts

* rm that

* wmma work

* r4

* this shows the overlap well

* ohh okay it goes back

* are ds_load and ds_store different queues on RDNA4?

* print msg, v_mul_lo_u32 is 4 cycles?

* discover

* wmma something

* wmma comment

* less

* less

* better comments

* work

* inst st

* delay column

* better cli

* emit_alt

* update test_handwritten

* work
2026-03-28 22:48:59 +09:00
qazal
ccaa6bfc19 viz/cli cleanups (#15511)
* one less function

* work

* layout

* better handling of rewrites

* mypy passes
2026-03-28 08:50:38 +09:00
qazal
dcc2a5d23b viz/cli: simplify to --source and --item flags (#15510)
* viz/cli: simplify to --source and --item flags

* update viz cli test
2026-03-28 04:46:39 +09:00
qazal
586c49642f viz/cli: test in CI (#15501)
* viz cli work

* baseline test

* make cli test work without subprocess

* more checks

* check itrace

* s/return/return None

* change

* minimal

* colored
2026-03-27 06:47:15 +09:00
qazal
ec5b7a249e viz: refactor sqtt timeline builder (#15494)
* viz: refactor sqtt timeline builder

* barrier maps to waves

* clean up cli
2026-03-26 21:16:15 +09:00
Christopher Milan
bc180a963c deprecate <dev>=1 in favor of DEV=<dev> (#15467)
* start work on target

* add test

* update actions to use DEV

* update docs

* update readmes

* tests need that too

* update example

* update tests (comments)

* fix that test

* ruff

* mypy

* oops

* remove getenvs

* don't add Target yet

* and the test

* lint

* and docs

* more stuff

* assert

* few more fixes

* test assert
2026-03-26 03:48:03 -04:00
nimlgen
9d2d0774b4 remote: disk copies (#15482)
* remote: disk copies

* lineter

* r

* nv

* x
2026-03-25 22:14:25 +03:00
qazal
c973b508b8 viz/cli: pass ctrlc (#15470) 2026-03-25 21:13:28 +09:00
qazal
1b3d00d6ac viz/cli: remove --offset and --limit flags (#15439)
* work

* also no more no-color

* reorder

* update llama

* sqtt readme

* itertools

* rm that

* signals back
2026-03-25 09:52:27 +09:00
Christopher Milan
d5320a9ddf QCOM cleanups (#15435) 2026-03-23 22:18:38 -04:00
George Hotz
85dee83f5d amd flash attention cleanups + emulator fixes (#15431)
* amd flash attention cleanups

* simpler

* params

* fix emulator bugs

* fix idiv bug

* remove that test

* more emu fixes
2026-03-24 10:10:46 +08:00
qazal
109472c37e sqtt: new s_barrier pickles, handle rdna4 barriers in emulator (#15437) 2026-03-24 03:25:28 +09:00
nimlgen
fa4cdb422e memplan on linears (#15422)
* memplan

* test

* x

* arenas

* correct

* set any size

* ugh

* make hevc happy

* x

* x

* held

* rm old

* del

* x

* fu

* f

* cl

* cl

* ok
2026-03-23 19:50:16 +08:00
George Hotz
c62dea6881 ai slop flash attention (it works) (#15401)
* ai slop flash attention (it works)

* speed up, 2 TFLOPS + 7 GB/s

* simpler

* simpler

* optimize

* faster

* warp shuffle

* sqtt: link dispatch to exec (#15396)

* sqtt packet linking infra

python

* javascript

* ~doubly linked list

* ui works

* work

* exec can also highlight the pc, coloring work

* more work

* rm sqtt/model.py, doesn't need to be upstreamed

* viz: no context enters in cli, update llama profile (#15404)

* removed unused named arg in rules [pr] (#15414)

* viz: sqtt printer in viz/cli.py (#15411)

* work

* sqtt timeline in CLI

* format all printers nicely

* s/Showed/Printed

* ansistrip

* sys.exit

* keep colors in list

* work from amd_copy_matmul

* has_more always gets returned

* linter

* don't print colors

* more colors

* wow this is so deep

* work

* minor details

* selected

* improve progress bar

* remove it

* 22, global_load_vaddr is so long

* remove *0 hack in sign, gradient materializes zeros for unconnected nodes (#15416)

Amp-Thread-ID: https://ampcode.com/threads/T-019d1612-6322-706b-a94d-a812400a55cb

Co-authored-by: Amp <amp@ampcode.com>

* works

* cnt=20

* revert that

* uop slice tests

* simpler

---------

Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>
Co-authored-by: chenyu <chenyu@fastmail.com>
Co-authored-by: gg <ggordbegli@gmail.com>
Co-authored-by: Amp <amp@ampcode.com>
2026-03-23 16:15:10 +08:00
qazal
fd3559103b viz/cli: better error message for empty itrace (#15425) 2026-03-23 15:50:20 +09:00
qazal
c7b18e6108 viz: sqtt printer in viz/cli.py (#15411)
* work

* sqtt timeline in CLI

* format all printers nicely

* s/Showed/Printed

* ansistrip

* sys.exit

* keep colors in list

* work from amd_copy_matmul

* has_more always gets returned

* linter

* don't print colors

* more colors

* wow this is so deep

* work

* minor details

* selected

* improve progress bar

* remove it

* 22, global_load_vaddr is so long
2026-03-23 00:17:05 +09:00
qazal
2363bceb47 viz: no context enters in cli, update llama profile (#15404) 2026-03-22 05:47:02 +09:00
George Hotz
c13d9d29ff add SHAPED_WMMA (#15400)
* add SHAPED_WMMA

* shaped wmma

* less bad
2026-03-21 16:16:03 +08:00
George Hotz
41a9b09683 minimal vec in amd_copy_matmul (#15398)
* minimal vec in amd_copy_matmul

* unified

* unify

* reshape/permute

* cleanups

* simpler

* move index

* cleanups

* more shared
2026-03-21 14:57:21 +08:00
qazal
71ccc69c52 FP8=1 llama works again, hipcc can run on macos (#15394)
* hipcc macos shim

* is_dtype_supported opens devices less
2026-03-20 23:43:15 +09:00
George Hotz
1a2a203f48 add wmma support to amd_copy_matmul (#15384)
* add wmma support to amd_copy_matmul

* 15 TFLOPS and merged

* unify

* simpler

* simpler

* simpler

* cleanups

* TM/TN is the full regs

* comments

* WAVES_PER_SH + SQTT_EVENT

* Add WAVERDY support

* no split warp

* 3 range
2026-03-20 19:02:19 +08:00
chenyu
c491345766 pass device into Tensor._frompy (#15385)
* pass device into Tensor._frompy

with this, canonicalize_device is the only usage of Device in tensor.py

* export_model.py
2026-03-20 05:09:01 -04:00
chenyu
da1700e16b dtypes.index -> dtypes.weakint (#15377) 2026-03-20 01:08:46 -04:00
George Hotz
4091d37e8e flat llama step work (#15355)
* flat llama step work

* fp8 support

* blacklisted matmul

* chestertons fence
2026-03-20 09:06:12 +08:00
George Hotz
70dad9d642 add PING to RemoteCmd (#15371)
* add PING to RemoteCmd

* cleanup
2026-03-19 18:57:40 +08:00
nimlgen
ff004d2114 remote: fix mmio (#15347) 2026-03-18 18:20:39 +08:00
George Hotz
6e196195d8 add test for flat llama (#15327)
* add test for flat llama

* simpler

* back to split w1/w3

* env

* still too much ram

* invalid
2026-03-18 15:16:33 +08:00
nimlgen
0315faf938 remote bench (#15331) 2026-03-18 14:03:51 +08:00
wozeparrot
b45edeb965 fix: rand supports large tensors (#15329) 2026-03-17 15:45:41 -07:00
qazal
00817cf65e viz: all tests can run on the NULL device (#15328)
* remove that

* move to test_viz

* get_cfg

* do not use os.environ

* hm

* it's always on NULL

* import renderer

* no import *
2026-03-18 04:14:20 +09:00
nimlgen
0a641ce17d system: remote (#15318)
* system: remote

* listen

* print

* fix

* minor
2026-03-17 19:25:37 +08:00
nimlgen
a50fdb0528 nvcc macos (#15308)
* fix nvcc install macos

* um

* arm

* per

* tm
2026-03-17 17:25:33 +08:00