1491 Commits

Author SHA1 Message Date
George Hotz
b46229ca51 use shrink in amd_matmul_uop (#13026)
* use shrink in amd_matmul_uop

* colors
2025-10-31 10:43:41 +08:00
wozeparrot
78f7650eec faster tk matmul (#13006) 2025-10-30 19:09:27 -07:00
George Hotz
512513c403 cleanup amd uop matmul (#13025)
* cleanup amd uop matmul

* remove mod

* move that out

* better variable names

* var names

* more

* render fallback

* colors
2025-10-31 10:04:45 +08:00
nimlgen
629b177b66 amd: sqtt works in profile mode (#13019) 2025-10-30 23:48:52 +08:00
nimlgen
4d7a7096c9 am: enable perfmon (#13013)
* am: enable perfmon

* try

* msg
2025-10-30 22:28:36 +08:00
George Hotz
4a741e8364 modernize amd uop matmul (#13011)
* modernize amd uop matmul

* progress

* comment

* more comments

* revert that

* mac cleanups

* fix estimates

* format
2025-10-30 17:02:38 +08:00
wozeparrot
92a87e37e4 fix: fetch_file (#13010) 2025-10-29 22:44:22 -07:00
nimlgen
a6f5b1482e amd: perf counters (#12975)
* amd: perf counters

* sq

* cleaner

* fix

* if enabled

* ruff

* mypy

* counters

* reset

* fix

* no cpu
2025-10-30 00:10:31 +08:00
wozeparrot
d66c997a39 feat: thunderkittens fa2 (#12955) 2025-10-28 11:27:45 -07:00
wozeparrot
24884c6768 fix: don't use KITTENS_HOPPER for 4090 (#12954) 2025-10-27 17:19:53 -07:00
George Hotz
25c2da1579 check SPEC=2 in CI (#12945)
* check SPEC=2 in CI

* split SPEC=2

* fast enough
2025-10-27 21:53:57 +08:00
nimlgen
f4da94af28 system: reset is a method of pcidevice (#12936) 2025-10-27 16:21:10 +08:00
wozeparrot
6b54378eba working kitten matmul (#12935) 2025-10-26 23:40:49 -07:00
George Hotz
db5c918215 source extra/cl_android.sh to fix opencl on android 2025-10-26 15:27:51 +08:00
qazal
2f95c10702 remu new instructions / use volatile in emulator tests (#12862)
* remu new instructions

* start moving to volatile

* test_simple works

* test_exec_mov works and lid is still here

* test_exec_cmp_vopc

* clang did s_mov_b32 exec_lo, 1

* don't hardcode v1

* support volatile in tests

* hw_test passes

* only the volatile version

* subrev saturating behavior
2025-10-23 11:13:43 +08:00
chenyu
c5cee74706 remove BLOCK_REORDER (#12854)
not used
2025-10-21 19:10:14 -04:00
b1tg
60d7e232f2 cuda fp8 (#12782)
* cuda fp8

* tensor core

* tc test

* clean

* clean pm
2025-10-21 15:05:25 -04:00
chenyu
8baa61bd67 use torch 2.9 and its Muon in test (#12773)
* use torch 2.9 and its Muon in test

* relax and disable
2025-10-21 13:35:17 -04:00
chenyu
f51f9aaa16 muon ns_params -> ns_coefficients (#12850)
match the official torch one
2025-10-21 12:35:52 -04:00
nimlgen
1ad6598963 amd: trace all instructions (#12831) 2025-10-21 20:52:24 +08:00
George Hotz
cad3ada909 tinygpu: build with SIP off works 2025-10-20 09:11:09 +08:00
nimlgen
59784a5972 amd: ensure ts is written (#12794) 2025-10-19 23:55:49 +08:00
George Hotz
89e7f2fa00 mmapeak: gfx1103 support 2025-10-19 16:57:28 +08:00
George Hotz
617614beb7 add mi350x support to mmapeak (#12784) 2025-10-19 16:11:07 +08:00
nimlgen
037f6e8fa0 qcom: ioctl for 7xx (#12777) 2025-10-18 20:33:14 +08:00
geohotstan
5d209ee7ec onnx helper intermediate node output validation (#12740)
* start

* update comments

* good

* add comments and better printing

* done
2025-10-16 11:17:47 -04:00
nimlgen
3aa2277b8f nv: usb4 (#12696)
* hackish

* prog

* match

* l

* simpler

* refactor

* not osx

* apple things

* tiny changes

* fix mask

* match fix

* nn
2025-10-16 20:11:19 +08:00
wozeparrot
cc2dfe22f5 tinyfs: fetch file utility (#12719) 2025-10-15 23:38:56 -07:00
George Hotz
4a151e7533 make xcode signing happy, waiting for entitlement (#12712) 2025-10-16 10:20:34 +08:00
Daniel
d65bd669f8 update tiny torch backend hook (#12575)
* update the backend to fix torch deprecation warning

* use param_hook to avoid full backward hook needlessly firing on inputs which do not require gradients

* fix indentation

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2025-10-15 14:02:33 -04:00
Christopher Milan
0aabc1e938 Mesa NIR backend (NAK/LLVMpipe) (#12089)
* nak works

* TestOps::test_add works

* testop has no crashes

* fix bool casts

* fix typo

* add disassemble

* RANGE and locals/regs

* simplify NAKCompiler

* disass cleanup

* cleanup nir codegen

* almost all tests passing

* cleanup notes in extra/

* old notes

* only import nak if NIR=1

* fix new SPECIAL syntax

* fix local/shared memory

* more tests passing

* add DEFINE_VAR support

* llvmpipe kinda works

* diskcache

* some mypy stuff

* lvp passing test_ops.py

* fix imports

* actually fix imports

* remove 'stdout'

* fix llvm import

* fix mypy issues

* nicer errors

* simpler test_dtype skips

* test lvp in CI

* fix github action syntax

* fix more actions typos

* switch to mesa 25.1.0

* diskcache_put

* better generation for lvp nir_options

* b64encode shader blobs

* Revert diskcache changes

This reverts commits 930fa3de8a and 8428c694b3.

* general cleanup

* better error messages

* fix llvm import

* fix windows tests

* link with libm and libgcc_s

* fix some errors

* dont check for 'float4'

* NIR uses pointer arithmetic

* use tinymesa

* bump tinymesa

* bump tinymesa again

* update lvp nir_options

* print nir shader with DEBUG

* simplify LVPCompiler

* more tests

* "gated" STORE

* NAK is cacheable

* more tests

* all tests pass locally for NAK

* test autogen in CI

* autogen deps

* more deps

* fix uop_gc

* fix macos

* mypy

* save 2 lines

* save two more lines

* save 1 line

* save 4 lines

* save more lines

* Revert "save more lines"

This reverts commit dd3a720c5a.

* save more lines

* fix LVP on windows

* refactor

* reorganize some code

* refactor lib_gpu

* move LVP check

* out of order loads

* remove support.mesa

* bump tinymesa version

* simplify LVP jit

* macos

* macos ci

* shell: bash

* testing

* more testing

* compute brew prefix

* stupid typo

* actually fix

* lib

* stdout on macos

* inline gallivm_compile_module

* Revert "inline gallivm_compile_module"

This reverts commit b65983b151.

* elf macos

* semicolon

* inherit from CPULLVMCompiler

* ruff

* disas test

* fix libm linking

* default is fine actually

* arm works

* add elf loader link test

* fix NAK beam

* pylint is too smart by half

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
2025-10-15 17:38:33 +08:00
nimlgen
aa81bde150 amd: usb4/thunderbolt on macs (#12641)
* tbgpu

* works

* cleaner

* this

* zero size

* h

* fix

* simpler

* prio over usb

* c

* not needed

* linter

* this way

* mappings

* mypy

* mypy

* mypy 2

* nn
2025-10-15 13:02:01 +08:00
wozeparrot
f228c03f9f fetch raid from cloud (#10799)
* feat: initial tinyfs device

* feat: don't allow compute on tinyfs device

* feat: tensor helpers to load and store

* feat: bufferview for tinyfs

* fix: keep copy sizes correct

* fix: recv large

* clean: unneeded

* feat: comment

* clean: unneeded

* clean: remove

* clean: remove

* feat: get request tag

* feat: rename to cloud

* feat: send request_id

* feat: start computing tree

* feat: compute store tree on this side

* feat: jank chunked load

* feat: more debugging

* feat: rename to just load and store

* feat: correct chunk count

* fix: fix load for < 1mb

* feat: comments

* feat: don't truncate on block devices

* feat: better way of testing block device

* feat: don't need to pad that much

* feat: connect to nodes directly on load

* feat: cache connections

* feat: don't hard code chunk size

* feat: close mmap when closing file handle

* feat: don't overwrite stuff on disk if storing from disk

* clean: debug print

* fix: close mmap

* feat: await workers

* feat: fast copy from tinyfs to disk

* feat: don't copy to device on last

* feat: use single socket per device

* feat: raid in tinyfs

* clean: remove import

* clean: type

* feat: maintain single event loop

* feat: lower worker count

* feat: use connection pool

* feat: fetch mapping in its own process

* fix: release lock

* feat: don't fetch if exists

* feat: req id only on stores

* feat: always fetch

* fix: rangeify

* feat: allow specifying raid root

* fix: dealloc buffer

* feat: start support non 0 offset

* clean: use cleaner

* feat: don't pass to threadpool

* clean: typing
2025-10-14 07:53:55 -07:00
George Hotz
fb61f3519f remove assign contiguous hack (#12659)
* remove assign contiguous hack

* remove bad contiguous usage in torch backend

* assign
2025-10-14 16:42:14 +08:00
qazal
cd6aeebfee sqtt: osx decoder installer (#12637) 2025-10-13 17:26:12 +08:00
nimlgen
89be3590aa amd: sqtt on gfx12 (#12564)
* amd: sqtt on gfx12

* cleaner

* thi

* and this

* ops

* ugh

* back

* rm this

* rm
2025-10-10 17:54:14 +08:00
wozeparrot
f12e2a75db feat: add thunderkittens (#12590) 2025-10-10 00:32:33 -07:00
nimlgen
1309cea247 rocprof parser in extra (#12569)
* rocprof parser

* viewer

* vw

* skip
2025-10-10 14:56:42 +08:00
chenyu
c8dfd10257 ShapeTracker.real_strides -> is_expanded [pr] (#12579)
only keep the used part
2025-10-09 22:52:45 -04:00
George Hotz
9b66c2b0b7 fix weekly commits table (i didn't know we linted extra) 2025-10-10 09:23:33 +08:00
George Hotz
658b96cbfb weekly commits table 2025-10-10 09:15:41 +08:00
nimlgen
a11b686c71 amd: sqtt for all gfx11 (#12546)
* amd: general sqtt for gfx11

* target

* ops

* no gfx12 here
2025-10-09 17:04:06 +08:00
chenyu
ae51bdd06a remove trivial use of RANGEIFY flag (#12550)
some tests need update still
2025-10-09 02:29:38 -04:00
George Hotz
2653147cb7 delete the lowerer (#12526) 2025-10-08 21:58:18 +08:00
chenyu
e701106a64 remove FUSE_ARANGE (#12511)
it was the default already
2025-10-08 04:54:07 -04:00
nimlgen
4a756a37d8 amd: support rocm7 (#12502)
* amd: support rocm7

* mock
2025-10-08 14:30:39 +08:00
George Hotz
514d2a0774 merge tagless reshapes (#12474)
* merge tagless reshapes

* cleanup
2025-10-07 13:57:58 +08:00
George Hotz
b4509fba31 thundermittens (#12471)
* thundermittens

* give device a type
2025-10-07 11:47:39 +08:00
George Hotz
0f25b4b289 move frontend dir to nn [pr] (#12470) 2025-10-07 10:42:22 +08:00
hooved
0f804c9a83 Stable Diffusion model init for mlperf (#12314)
* include clip pr diff

* updated unet and sd init

* dehardcode default device

* revert beam hang workaround

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2025-10-02 02:28:41 -04:00