Commit Graph

8417 Commits

Author SHA1 Message Date
Francis Lata
aebccf93ac revert losses changes 2025-03-21 20:20:36 +00:00
Francis Lata
3f0134156e Merge branch 'master' into retinanet_mlperf 2025-03-21 20:05:04 +00:00
Francis Lata
8cbe4009fc RetinaNet losses (#9536)
* add sigmoid_focal_loss and l1_loss

* update ref implementation comment
2025-03-21 15:52:54 -04:00
Francis Lata
64cff4f41d Merge branch 'master' into retinanet_mlperf 2025-03-21 19:20:03 +00:00
Francis Lata
e6389184c5 update comment for retinanet dataloader implementations (#9534)
Co-authored-by: chenyu <chenyu@fastmail.com>
2025-03-21 15:07:45 -04:00
Francis Lata
3d893da3a6 Merge branch 'master' into retinanet_mlperf 2025-03-21 18:56:01 +00:00
chenyu
ee3d313b34 Revert "update ruff to 0.11.2 (#9531)" (#9535)
This reverts commit d8d65e2747.
2025-03-21 14:52:25 -04:00
Francis Lata
5408eabef4 remove duplicated files from test 2025-03-21 18:41:27 +00:00
Francis Lata
95bb6a9d06 Merge branch 'master' into retinanet_mlperf 2025-03-21 18:32:39 +00:00
chenyu
b46b8ee15e add a flag to log when beam surpassed max limit [pr] (#9533) 2025-03-21 13:37:02 -04:00
Francis Lata
eb95825eea RetinaNet dataloader (#9442)
* retinanet dataloader

* remove batch_size from generate_anchors

* refactor kits19 dataset tests

* add tests for dataloader

* fix testing setup and cleanups

* remove unused import
2025-03-21 13:36:41 -04:00
Francis Lata
2447fad0be Merge branch 'master' into retinanet_mlperf 2025-03-21 17:04:45 +00:00
Francis Lata
7939186e7d cleanups and adjust learning rate for fp16 2025-03-21 17:03:07 +00:00
Francis Lata
da97696498 end BENCHMARK after first eval 2025-03-21 15:18:08 +00:00
b1tg
58206fa8a9 add amd llvm compiler (#9519)
Co-authored-by: b1tg <b1tg@users.noreply.github.com>
Co-authored-by: chenyu <chenyu@fastmail.com>
2025-03-21 23:13:27 +08:00
chenyu
d8d65e2747 update ruff to 0.11.2 (#9531)
0.11.2 fixed the false alert from 0.11.1. also pinned the version in setup for now to prevent broken CI from ruff upgrade
2025-03-21 10:32:59 -04:00
qazal
ee3ed73ed1 add reorder_view matcher to scheduler [pr] (#9528) 2025-03-21 17:46:20 +08:00
George Hotz
8e555c586c switch quantization to unsigned/unsigned + add Ops.REDUCE (#9527)
* switch quantization to unsigned/unsigned + add Ops.REDUCE

* tests

* nhwc + replay pkl
2025-03-21 17:02:37 +08:00
nimlgen
a35b0a88bf am: just rename and reorder ip init funcs (#9504) 2025-03-21 15:57:32 +08:00
nimlgen
8a131ab271 am: allow allocations as small as a page (#9523)
* am: fix allocs

* bettermsg

* comment

* next time
2025-03-21 15:53:32 +08:00
Sieds Lykles
3ad3ac4d1e Change dtypes.int to dtypes.ints (#9517) 2025-03-20 17:24:26 -04:00
chenyu
b9fab9b914 pin ruff to 0.11.0 in CI (#9520)
0.11.1 had a bug https://github.com/astral-sh/ruff/issues/16874 that breaks ci
2025-03-20 13:12:50 -04:00
George Hotz
3c5161b4cb add validation of the bounds of Ops.INDEX (#9503)
* add validation of the bounds of Ops.INDEX

* do mask properly

* more validation

* correct

* fix gated

* add CAST support to vmin/vmax

* fix ptx and image

* ptx no diff

* upat.index also stays

---------

Co-authored-by: qazal <qazal.software@gmail.com>
2025-03-20 12:15:55 +08:00
qazal
0b20f91ce7 remove move_mask from the devectorizer (#9511)
* remove move_mask from the devectorizer

* add (wrong) ptx

* reason

* enable index addition in PTX, we won't have the INDEX anyways

* space
2025-03-20 11:53:12 +08:00
Francis Lata
74ad538f9d Merge branch 'master' into retinanet_mlperf 2025-03-20 03:16:38 +00:00
qazal
9302738263 hotfix: more consistent wgsl.py spacing + cleanups [pr] (#9515)
* hotfix: more consistent wgsl.py spacing + cleanups [pr]

* free things up
2025-03-20 11:07:15 +08:00
George Hotz
68053d0510 dsp stuff / sniff ioctls from snpe (#9490)
* sniff ioctls from snpe

* dump input buffers

* snpe logs from dsp

* NHWC support

* knum 3

* this run?

* revert those

---------

Co-authored-by: Comma Device <device@comma.ai>
2025-03-20 10:38:23 +08:00
qazal
2223b93338 add UPat.or_casted [pr] (#9513) 2025-03-20 10:08:32 +08:00
qazal
1839e8c9b3 place masks in INDEX for TestGatedStoreRewrite [pr] (#9512) 2025-03-20 09:46:53 +08:00
b1tg
bd731a8624 AMDCompiler refactor (no_comgr prereq) (#9497)
* add amdgpu_disassemble to helpers

* refactor hip compiler

---------

Co-authored-by: b1tg <b1tg@users.noreply.github.com>
2025-03-20 09:44:07 +08:00
geohotstan
8c0d0a122c Add return_indices to max_pool (#9506)
* wow argmax is so good

* 1 less line

* clean up and better variable names

* is this torch thing right...?

* add more tests

* slap a TODO on it

* clean ups

* prettier looking code and fix ceil mode test

* add return types and some docs

* ok that was a bad example since indices == value, just no example
2025-03-19 15:25:37 -04:00
Francis Lata
7ef36cad7f Merge branch 'master' into retinanet_mlperf 2025-03-19 18:13:46 +00:00
Francis Lata
81f2336e08 update layers to be compatible with fp16 2025-03-19 18:13:00 +00:00
chenyu
189f62d44f add rounding to tqdm unit scale (#9507)
fixed `AssertionError: ' 1.00/10.0  1000it/s]' != ' 1.00/10.0  1.00kit/s]'`
2025-03-19 12:08:46 -04:00
nimlgen
a5c971ff3a am: prereqs for rdna4 1/n (#9495)
* am: ip_ver rename for acc

* am: refactor this

* fix version

* ugh
2025-03-19 17:14:57 +08:00
Francis Lam
1e5d9ad8f7 extra/gemm/max_matmul: start of custom kernels for GEMM (#6926)
* extra/gemm/max_matmul: start of custom kernels for GEMM

* add an unoptimized FP16/FP16 MMA example

* add slow 3-stage fp16 acc example

* add correct 3-stage pipeline with unswizzled/flat smem input (slow)

* add acc fp16 example with 3 stages and swizzle (no bank conflicts)

* add max version of NV fp16_fp16_fp16

* fix up comments and removed unused code in max variations

* add start of no_xor example

* fix to account for UOps to Ops
2025-03-19 15:04:57 +08:00
George Hotz
865f23dd7b olmoe memory usage cleanups 2025-03-19 12:28:18 +08:00
b1tg
2c87a22cf2 fix prg size calculation when there are adjacent mapped ranges (#9498)
Co-authored-by: b1tg <b1tg@users.noreply.github.com>
2025-03-19 11:55:03 +08:00
b1tg
1d71436e6a use libllvm19 in ci (#9494)
Co-authored-by: b1tg <b1tg@users.noreply.github.com>
2025-03-19 11:53:32 +08:00
b1tg
a95b489a55 nanoGPT train works with tiny torch backend (#9283)
* train_shakespeare_char.py works

* move aten.where.self_out to tiny_backend_out

* fix memory leak

* corealize in the backward_hook

* Update backend.py

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2025-03-19 11:51:02 +08:00
chenyu
f8976dd2eb enable more webgpu tests (#9502)
OSX has larger buffer number limit, and it supports fp16 now
2025-03-18 23:03:54 -04:00
qazal
ae688e4103 simple failing test for scheduling parallel reduce [pr] (#9501)
* simple failing test for scheduling parallel reduce [pr]

* atol
2025-03-19 10:52:13 +08:00
leopf
e4dad99145 nn.state docs cleanup (#8332)
* doc cleanup

* extension cleanup

* manual definition

* bring back accept_filename for gguf_load

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
Co-authored-by: chenyu <chenyu@fastmail.com>
2025-03-18 17:16:40 -04:00
chenyu
1ea4876dfa olmoe touchups (#9499)
GlobalCounters.reset() and only validate if temperature is 0
2025-03-18 15:25:45 -04:00
geohotstan
f7506c6c25 JIT OLMoE (#9396)
* jit the forward

* might timeout, idk just send it

* this is dumb

* naive bitonic lol

* idk if this is correct, but that squeeze before is definitly not

* vectorized bitonic sort, but still slow

* yay 1 layer is correct

* alright its pretty good

* good enough

* rerun CI

* nit improve comment
2025-03-18 14:49:02 -04:00
Ignacio Sica
5c56cac0a0 MI300 mfma support (#9417)
* add f16/f32 mfma support for MI300

- add 16x16 mfma shape support for f16 with f32 acc
- add ops_python mfma emulation
- add arch to AMDRenderer

* minor cleanup

* minor cleanup

* add mfma emulation task to ci

* add back todo

* hotfix: comment

* add tc=3 job to ci
2025-03-18 14:33:30 -03:00
hooved
5500887eed improve reproducibility of WebGPU CI puppeteer test (#9496)
* try to make CI test fail with slow JS import

* prevent race between model import and reference

* revert artificial delay in JS module import
2025-03-18 09:27:38 -04:00
qazal
cde4fd3be3 do not view_left assign + elementwise sources always have a shape [pr] (#9491) 2025-03-18 17:42:51 +08:00
George Hotz
117b7a16ef VALIDATE_WITH_CPU [pr] (#9488)
* VALIDATE_WITH_CPU [pr]

* fix test
2025-03-18 15:15:04 +08:00
qazal
935cd01f56 simple failing test for graph_rewrite children [pr] (#9489)
* simple failing test for graph_rewrite children [pr]

* lint

* update too
2025-03-18 13:07:21 +08:00