Francis Lata
aebccf93ac
revert losses changes
2025-03-21 20:20:36 +00:00
Francis Lata
3f0134156e
Merge branch 'master' into retinanet_mlperf
2025-03-21 20:05:04 +00:00
Francis Lata
8cbe4009fc
RetinaNet losses ( #9536 )
...
* add sigmoid_focal_loss and l1_loss
* update ref implementation comment
2025-03-21 15:52:54 -04:00
Francis Lata
64cff4f41d
Merge branch 'master' into retinanet_mlperf
2025-03-21 19:20:03 +00:00
Francis Lata
e6389184c5
update comment for retinanet dataloader implementations ( #9534 )
...
Co-authored-by: chenyu <chenyu@fastmail.com >
2025-03-21 15:07:45 -04:00
Francis Lata
3d893da3a6
Merge branch 'master' into retinanet_mlperf
2025-03-21 18:56:01 +00:00
chenyu
ee3d313b34
Revert "update ruff to 0.11.2 ( #9531 )" ( #9535 )
...
This reverts commit d8d65e2747 .
2025-03-21 14:52:25 -04:00
Francis Lata
5408eabef4
remove duplicated files from test
2025-03-21 18:41:27 +00:00
Francis Lata
95bb6a9d06
Merge branch 'master' into retinanet_mlperf
2025-03-21 18:32:39 +00:00
chenyu
b46b8ee15e
add a flag to log when beam surpassed max limit [pr] ( #9533 )
2025-03-21 13:37:02 -04:00
Francis Lata
eb95825eea
RetinaNet dataloader ( #9442 )
...
* retinanet dataloader
* remove batch_size from generate_anchors
* refactor kits19 dataset tests
* add tests for dataloader
* fix testing setup and cleanups
* remove unused import
2025-03-21 13:36:41 -04:00
Francis Lata
2447fad0be
Merge branch 'master' into retinanet_mlperf
2025-03-21 17:04:45 +00:00
Francis Lata
7939186e7d
cleanups and adjust learning rate for fp16
2025-03-21 17:03:07 +00:00
Francis Lata
da97696498
end BENCHMARK after first eval
2025-03-21 15:18:08 +00:00
b1tg
58206fa8a9
add amd llvm compiler ( #9519 )
...
Co-authored-by: b1tg <b1tg@users.noreply.github.com >
Co-authored-by: chenyu <chenyu@fastmail.com >
2025-03-21 23:13:27 +08:00
chenyu
d8d65e2747
update ruff to 0.11.2 ( #9531 )
...
0.11.2 fixed the false alert from 0.11.1. also pinned the version in setup for now to prevent broken CI from ruff upgrade
2025-03-21 10:32:59 -04:00
qazal
ee3ed73ed1
add reorder_view matcher to scheduler [pr] ( #9528 )
2025-03-21 17:46:20 +08:00
George Hotz
8e555c586c
switch quantization to unsigned/unsigned + add Ops.REDUCE ( #9527 )
...
* switch quantization to unsigned/unsigned + add Ops.REDUCE
* tests
* nhwc + replay pkl
2025-03-21 17:02:37 +08:00
nimlgen
a35b0a88bf
am: just rename and reorder ip init funcs ( #9504 )
2025-03-21 15:57:32 +08:00
nimlgen
8a131ab271
am: allow allocations as small as a page ( #9523 )
...
* am: fix allocs
* bettermsg
* comment
* next time
2025-03-21 15:53:32 +08:00
Sieds Lykles
3ad3ac4d1e
Change dtypes.int to dtypes.ints ( #9517 )
2025-03-20 17:24:26 -04:00
chenyu
b9fab9b914
pin ruff to 0.11.0 in CI ( #9520 )
...
0.11.1 had a bug https://github.com/astral-sh/ruff/issues/16874 that breaks ci
2025-03-20 13:12:50 -04:00
George Hotz
3c5161b4cb
add validation of the bounds of Ops.INDEX ( #9503 )
...
* add validation of the bounds of Ops.INDEX
* do mask properly
* more validation
* correct
* fix gated
* add CAST support to vmin/vmax
* fix ptx and image
* ptx no diff
* upat.index also stays
---------
Co-authored-by: qazal <qazal.software@gmail.com >
2025-03-20 12:15:55 +08:00
qazal
0b20f91ce7
remove move_mask from the devectorizer ( #9511 )
...
* remove move_mask from the devectorizer
* add (wrong) ptx
* reason
* enable index addition in PTX, we won't have the INDEX anyways
* space
2025-03-20 11:53:12 +08:00
Francis Lata
74ad538f9d
Merge branch 'master' into retinanet_mlperf
2025-03-20 03:16:38 +00:00
qazal
9302738263
hotfix: more consistent wgsl.py spacing + cleanups [pr] ( #9515 )
...
* hotfix: more consistent wgsl.py spacing + cleanups [pr]
* free things up
2025-03-20 11:07:15 +08:00
George Hotz
68053d0510
dsp stuff / sniff ioctls from snpe ( #9490 )
...
* sniff ioctls from snpe
* dump input buffers
* snpe logs from dsp
* NHWC support
* knum 3
* this run?
* revert those
---------
Co-authored-by: Comma Device <device@comma.ai >
2025-03-20 10:38:23 +08:00
qazal
2223b93338
add UPat.or_casted [pr] ( #9513 )
2025-03-20 10:08:32 +08:00
qazal
1839e8c9b3
place masks in INDEX for TestGatedStoreRewrite [pr] ( #9512 )
2025-03-20 09:46:53 +08:00
b1tg
bd731a8624
AMDCompiler refactor (no_comgr prereq) ( #9497 )
...
* add amdgpu_disassemble to helpers
* refactor hip compiler
---------
Co-authored-by: b1tg <b1tg@users.noreply.github.com >
2025-03-20 09:44:07 +08:00
geohotstan
8c0d0a122c
Add return_indices to max_pool ( #9506 )
...
* wow argmax is so good
* 1 less line
* clean up and better variable names
* is this torch thing right...?
* add more tests
* slap a TODO on it
* clean ups
* prettier looking code and fix ceil mode test
* add return types and some docs
* ok that was a bad example since indices == value, just no example
2025-03-19 15:25:37 -04:00
Francis Lata
7ef36cad7f
Merge branch 'master' into retinanet_mlperf
2025-03-19 18:13:46 +00:00
Francis Lata
81f2336e08
update layers to be compatible with fp16
2025-03-19 18:13:00 +00:00
chenyu
189f62d44f
add rounding to tqdm unit scale ( #9507 )
...
fixed `AssertionError: ' 1.00/10.0 1000it/s]' != ' 1.00/10.0 1.00kit/s]'`
2025-03-19 12:08:46 -04:00
nimlgen
a5c971ff3a
am: prereqs for rdna4 1/n ( #9495 )
...
* am: ip_ver rename for acc
* am: refactor this
* fix version
* ugh
2025-03-19 17:14:57 +08:00
Francis Lam
1e5d9ad8f7
extra/gemm/max_matmul: start of custom kernels for GEMM ( #6926 )
...
* extra/gemm/max_matmul: start of custom kernels for GEMM
* add an unoptimized FP16/FP16 MMA example
* add slow 3-stage fp16 acc example
* add correct 3-stage pipeline with unswizzled/flat smem input (slow)
* add acc fp16 example with 3 stages and swizzle (no bank conflicts)
* add max version of NV fp16_fp16_fp16
* fix up comments and removed unused code in max variations
* add start of no_xor example
* fix to account for UOps to Ops
2025-03-19 15:04:57 +08:00
George Hotz
865f23dd7b
olmoe memory usage cleanups
2025-03-19 12:28:18 +08:00
b1tg
2c87a22cf2
fix prg size calculation when there are adjacent mapped ranges ( #9498 )
...
Co-authored-by: b1tg <b1tg@users.noreply.github.com >
2025-03-19 11:55:03 +08:00
b1tg
1d71436e6a
use libllvm19 in ci ( #9494 )
...
Co-authored-by: b1tg <b1tg@users.noreply.github.com >
2025-03-19 11:53:32 +08:00
b1tg
a95b489a55
nanoGPT train works with tiny torch backend ( #9283 )
...
* train_shakespeare_char.py works
* move aten.where.self_out to tiny_backend_out
* fix memory leak
* corealize in the backward_hook
* Update backend.py
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com >
2025-03-19 11:51:02 +08:00
chenyu
f8976dd2eb
enable more webgpu tests ( #9502 )
...
OSX has larger buffer number limit, and it supports fp16 now
2025-03-18 23:03:54 -04:00
qazal
ae688e4103
simple failing test for scheduling parallel reduce [pr] ( #9501 )
...
* simple failing test for scheduling parallel reduce [pr]
* atol
2025-03-19 10:52:13 +08:00
leopf
e4dad99145
nn.state docs cleanup ( #8332 )
...
* doc cleanup
* extension cleanup
* manual definition
* bring back accept_filename for gguf_load
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com >
Co-authored-by: chenyu <chenyu@fastmail.com >
2025-03-18 17:16:40 -04:00
chenyu
1ea4876dfa
olmoe touchups ( #9499 )
...
GlobalCounters.reset() and only validate if temperature is 0
2025-03-18 15:25:45 -04:00
geohotstan
f7506c6c25
JIT OLMoE ( #9396 )
...
* jit the forward
* might timeout, idk just send it
* this is dumb
* naive bitonic lol
* idk if this is correct, but that squeeze before is definitly not
* vectorized bitonic sort, but still slow
* yay 1 layer is correct
* alright its pretty good
* good enough
* rerun CI
* nit improve comment
2025-03-18 14:49:02 -04:00
Ignacio Sica
5c56cac0a0
MI300 mfma support ( #9417 )
...
* add f16/f32 mfma support for MI300
- add 16x16 mfma shape support for f16 with f32 acc
- add ops_python mfma emulation
- add arch to AMDRenderer
* minor cleanup
* minor cleanup
* add mfma emulation task to ci
* add back todo
* hotfix: comment
* add tc=3 job to ci
2025-03-18 14:33:30 -03:00
hooved
5500887eed
improve reproducibility of WebGPU CI puppeteer test ( #9496 )
...
* try to make CI test fail with slow JS import
* prevent race between model import and reference
* revert artificial delay in JS module import
2025-03-18 09:27:38 -04:00
qazal
cde4fd3be3
do not view_left assign + elementwise sources always have a shape [pr] ( #9491 )
2025-03-18 17:42:51 +08:00
George Hotz
117b7a16ef
VALIDATE_WITH_CPU [pr] ( #9488 )
...
* VALIDATE_WITH_CPU [pr]
* fix test
2025-03-18 15:15:04 +08:00
qazal
935cd01f56
simple failing test for graph_rewrite children [pr] ( #9489 )
...
* simple failing test for graph_rewrite children [pr]
* lint
* update too
2025-03-18 13:07:21 +08:00