Commit Graph

8216 Commits

Author SHA1 Message Date
George Hotz
74d98eafb8 add onnx frontend stub [pr] (#9558) 2025-03-24 12:24:34 +08:00
George Hotz
de7d6cec3a hotfix: DEBUG 5 prints the ast 2025-03-24 11:43:11 +08:00
chenyu
ba41076e94 update embedding test to not use dtypes.long [pr] (#9556) 2025-03-23 21:33:38 -04:00
chenyu
c965f4c20b update bert config (#9555)
BEAM 4->5 for green, 2% faster
use AMD driver instead of AM for red, 5% faster
2025-03-23 16:14:41 -04:00
chenyu
d734e24c01 minor WEBGPU_PATH cleanup [pr] (#9552)
also mypy recognizes `sys.platform == 'win32'` but does not recognizes it if wrapped inside a helper...
2025-03-23 09:10:02 -04:00
Ahmed Harmouche
7ce7fe0574 Refactor webgpu_dawn lib finding (#9547)
* Refactor webgpu_dawn lib finding

* Fix ruff
2025-03-23 08:23:29 -04:00
uuuvn
c631c72f22 HCQ: Increment timeline signal before submitting (#9550)
`AMDComputeQueue.__del__` frees `hw_page` which is safe because
`AMDAllocator._free` does `self.dev.synchronize()` which is supposed
to wait for execution of IB to finish, however that doesn't happen if
AMDComputeQueue is dropped right after submit before timeline signal is
incremented, which it is in most places leading to a race if .bind() is
also used (required for multi-xcc because bug in mec fw treats all
PACKET3_PRED_EXECs outside IBs as if they had EXEC_COUNT of zero).
2025-03-23 18:30:38 +07:00
nimlgen
d5667419af am: move out pte creation logic (#9548)
* am: move out pte creation logic

* emu

* ops
2025-03-23 18:29:10 +07:00
geohotstan
309afa20b7 add Tensor.max_unpool2d (#9518)
* why does max_unpool2d feel slower than out.gradient ...

* slightly cleaner

* what happened to ruff

* need to think about this some more

* slightly faster now?

* clean up, 1 more failing edge case

* ok good

* working TINY_BACKEND

* nit doc wording

* retry CI
2025-03-22 12:11:33 -04:00
quortus
bdd44d4255 Fix DSP transcendentals (#9542) 2025-03-22 11:08:18 +08:00
Ignacio Sica
eddafb84e5 Bugfix for TC=3 (#9464)
* wrong but uses less shared

* for size 8 tc1 with devectorize in 0 loads into local before wmma and works

* improvements over tc1 devectorize

* fix tc=3

* works for handcoded tc opts

* clean bugfix tc=3

* fix

* revert changes
2025-03-21 16:43:42 -07:00
chenyu
6da78164f9 assert Kernel ast.op to be Ops.SINK [pr] (#9539)
rest of the code assumes self.ast is defined anyway
2025-03-21 18:09:44 -04:00
chenyu
c33679c47b increase size in test_multinomial_counterexample (#9540)
should be less flaky
2025-03-21 17:46:52 -04:00
Francis Lata
1a1087e3a0 cleanups on losses and dataset tests (#9538) 2025-03-21 17:03:18 -04:00
Francis Lata
8cbe4009fc RetinaNet losses (#9536)
* add sigmoid_focal_loss and l1_loss

* update ref implementation comment
2025-03-21 15:52:54 -04:00
Francis Lata
e6389184c5 update comment for retinanet dataloader implementations (#9534)
Co-authored-by: chenyu <chenyu@fastmail.com>
2025-03-21 15:07:45 -04:00
chenyu
ee3d313b34 Revert "update ruff to 0.11.2 (#9531)" (#9535)
This reverts commit d8d65e2747.
2025-03-21 14:52:25 -04:00
chenyu
b46b8ee15e add a flag to log when beam surpassed max limit [pr] (#9533) 2025-03-21 13:37:02 -04:00
Francis Lata
eb95825eea RetinaNet dataloader (#9442)
* retinanet dataloader

* remove batch_size from generate_anchors

* refactor kits19 dataset tests

* add tests for dataloader

* fix testing setup and cleanups

* remove unused import
2025-03-21 13:36:41 -04:00
b1tg
58206fa8a9 add amd llvm compiler (#9519)
Co-authored-by: b1tg <b1tg@users.noreply.github.com>
Co-authored-by: chenyu <chenyu@fastmail.com>
2025-03-21 23:13:27 +08:00
chenyu
d8d65e2747 update ruff to 0.11.2 (#9531)
0.11.2 fixed the false alert from 0.11.1. also pinned the version in setup for now to prevent broken CI from ruff upgrade
2025-03-21 10:32:59 -04:00
qazal
ee3ed73ed1 add reorder_view matcher to scheduler [pr] (#9528) 2025-03-21 17:46:20 +08:00
George Hotz
8e555c586c switch quantization to unsigned/unsigned + add Ops.REDUCE (#9527)
* switch quantization to unsigned/unsigned + add Ops.REDUCE

* tests

* nhwc + replay pkl
2025-03-21 17:02:37 +08:00
nimlgen
a35b0a88bf am: just rename and reorder ip init funcs (#9504) 2025-03-21 15:57:32 +08:00
nimlgen
8a131ab271 am: allow allocations as small as a page (#9523)
* am: fix allocs

* bettermsg

* comment

* next time
2025-03-21 15:53:32 +08:00
Sieds Lykles
3ad3ac4d1e Change dtypes.int to dtypes.ints (#9517) 2025-03-20 17:24:26 -04:00
chenyu
b9fab9b914 pin ruff to 0.11.0 in CI (#9520)
0.11.1 had a bug https://github.com/astral-sh/ruff/issues/16874 that breaks ci
2025-03-20 13:12:50 -04:00
George Hotz
3c5161b4cb add validation of the bounds of Ops.INDEX (#9503)
* add validation of the bounds of Ops.INDEX

* do mask properly

* more validation

* correct

* fix gated

* add CAST support to vmin/vmax

* fix ptx and image

* ptx no diff

* upat.index also stays

---------

Co-authored-by: qazal <qazal.software@gmail.com>
2025-03-20 12:15:55 +08:00
qazal
0b20f91ce7 remove move_mask from the devectorizer (#9511)
* remove move_mask from the devectorizer

* add (wrong) ptx

* reason

* enable index addition in PTX, we won't have the INDEX anyways

* space
2025-03-20 11:53:12 +08:00
qazal
9302738263 hotfix: more consistent wgsl.py spacing + cleanups [pr] (#9515)
* hotfix: more consistent wgsl.py spacing + cleanups [pr]

* free things up
2025-03-20 11:07:15 +08:00
George Hotz
68053d0510 dsp stuff / sniff ioctls from snpe (#9490)
* sniff ioctls from snpe

* dump input buffers

* snpe logs from dsp

* NHWC support

* knum 3

* this run?

* revert those

---------

Co-authored-by: Comma Device <device@comma.ai>
2025-03-20 10:38:23 +08:00
qazal
2223b93338 add UPat.or_casted [pr] (#9513) 2025-03-20 10:08:32 +08:00
qazal
1839e8c9b3 place masks in INDEX for TestGatedStoreRewrite [pr] (#9512) 2025-03-20 09:46:53 +08:00
b1tg
bd731a8624 AMDCompiler refactor (no_comgr prereq) (#9497)
* add amdgpu_disassemble to helpers

* refactor hip compiler

---------

Co-authored-by: b1tg <b1tg@users.noreply.github.com>
2025-03-20 09:44:07 +08:00
geohotstan
8c0d0a122c Add return_indices to max_pool (#9506)
* wow argmax is so good

* 1 less line

* clean up and better variable names

* is this torch thing right...?

* add more tests

* slap a TODO on it

* clean ups

* prettier looking code and fix ceil mode test

* add return types and some docs

* ok that was a bad example since indices == value, just no example
2025-03-19 15:25:37 -04:00
chenyu
189f62d44f add rounding to tqdm unit scale (#9507)
fixed `AssertionError: ' 1.00/10.0  1000it/s]' != ' 1.00/10.0  1.00kit/s]'`
2025-03-19 12:08:46 -04:00
nimlgen
a5c971ff3a am: prereqs for rdna4 1/n (#9495)
* am: ip_ver rename for acc

* am: refactor this

* fix version

* ugh
2025-03-19 17:14:57 +08:00
Francis Lam
1e5d9ad8f7 extra/gemm/max_matmul: start of custom kernels for GEMM (#6926)
* extra/gemm/max_matmul: start of custom kernels for GEMM

* add an unoptimized FP16/FP16 MMA example

* add slow 3-stage fp16 acc example

* add correct 3-stage pipeline with unswizzled/flat smem input (slow)

* add acc fp16 example with 3 stages and swizzle (no bank conflicts)

* add max version of NV fp16_fp16_fp16

* fix up comments and removed unused code in max variations

* add start of no_xor example

* fix to account for UOps to Ops
2025-03-19 15:04:57 +08:00
George Hotz
865f23dd7b olmoe memory usage cleanups 2025-03-19 12:28:18 +08:00
b1tg
2c87a22cf2 fix prg size calculation when there are adjacent mapped ranges (#9498)
Co-authored-by: b1tg <b1tg@users.noreply.github.com>
2025-03-19 11:55:03 +08:00
b1tg
1d71436e6a use libllvm19 in ci (#9494)
Co-authored-by: b1tg <b1tg@users.noreply.github.com>
2025-03-19 11:53:32 +08:00
b1tg
a95b489a55 nanoGPT train works with tiny torch backend (#9283)
* train_shakespeare_char.py works

* move aten.where.self_out to tiny_backend_out

* fix memory leak

* corealize in the backward_hook

* Update backend.py

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2025-03-19 11:51:02 +08:00
chenyu
f8976dd2eb enable more webgpu tests (#9502)
OSX has larger buffer number limit, and it supports fp16 now
2025-03-18 23:03:54 -04:00
qazal
ae688e4103 simple failing test for scheduling parallel reduce [pr] (#9501)
* simple failing test for scheduling parallel reduce [pr]

* atol
2025-03-19 10:52:13 +08:00
leopf
e4dad99145 nn.state docs cleanup (#8332)
* doc cleanup

* extension cleanup

* manual definition

* bring back accept_filename for gguf_load

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
Co-authored-by: chenyu <chenyu@fastmail.com>
2025-03-18 17:16:40 -04:00
chenyu
1ea4876dfa olmoe touchups (#9499)
GlobalCounters.reset() and only validate if temperature is 0
2025-03-18 15:25:45 -04:00
geohotstan
f7506c6c25 JIT OLMoE (#9396)
* jit the forward

* might timeout, idk just send it

* this is dumb

* naive bitonic lol

* idk if this is correct, but that squeeze before is definitly not

* vectorized bitonic sort, but still slow

* yay 1 layer is correct

* alright its pretty good

* good enough

* rerun CI

* nit improve comment
2025-03-18 14:49:02 -04:00
Ignacio Sica
5c56cac0a0 MI300 mfma support (#9417)
* add f16/f32 mfma support for MI300

- add 16x16 mfma shape support for f16 with f32 acc
- add ops_python mfma emulation
- add arch to AMDRenderer

* minor cleanup

* minor cleanup

* add mfma emulation task to ci

* add back todo

* hotfix: comment

* add tc=3 job to ci
2025-03-18 14:33:30 -03:00
hooved
5500887eed improve reproducibility of WebGPU CI puppeteer test (#9496)
* try to make CI test fail with slow JS import

* prevent race between model import and reference

* revert artificial delay in JS module import
2025-03-18 09:27:38 -04:00
qazal
cde4fd3be3 do not view_left assign + elementwise sources always have a shape [pr] (#9491) 2025-03-18 17:42:51 +08:00