quortus
bdd44d4255
Fix DSP transcendentals ( #9542 )
2025-03-22 11:08:18 +08:00
Ignacio Sica
eddafb84e5
Bugfix for TC=3 ( #9464 )
...
* wrong but uses less shared
* for size 8 tc1 with devectorize in 0 loads into local before wmma and works
* improvements over tc1 devectorize
* fix tc=3
* works for handcoded tc opts
* clean bugfix tc=3
* fix
* revert changes
2025-03-21 16:43:42 -07:00
chenyu
6da78164f9
assert Kernel ast.op to be Ops.SINK [pr] ( #9539 )
...
rest of the code assumes self.ast is defined anyway
2025-03-21 18:09:44 -04:00
chenyu
c33679c47b
increase size in test_multinomial_counterexample ( #9540 )
...
should be less flaky
2025-03-21 17:46:52 -04:00
Francis Lata
1a1087e3a0
cleanups on losses and dataset tests ( #9538 )
2025-03-21 17:03:18 -04:00
Francis Lata
8cbe4009fc
RetinaNet losses ( #9536 )
...
* add sigmoid_focal_loss and l1_loss
* update ref implementation comment
2025-03-21 15:52:54 -04:00
Francis Lata
e6389184c5
update comment for retinanet dataloader implementations ( #9534 )
...
Co-authored-by: chenyu <chenyu@fastmail.com >
2025-03-21 15:07:45 -04:00
chenyu
ee3d313b34
Revert "update ruff to 0.11.2 ( #9531 )" ( #9535 )
...
This reverts commit d8d65e2747 .
2025-03-21 14:52:25 -04:00
chenyu
b46b8ee15e
add a flag to log when beam surpassed max limit [pr] ( #9533 )
2025-03-21 13:37:02 -04:00
Francis Lata
eb95825eea
RetinaNet dataloader ( #9442 )
...
* retinanet dataloader
* remove batch_size from generate_anchors
* refactor kits19 dataset tests
* add tests for dataloader
* fix testing setup and cleanups
* remove unused import
2025-03-21 13:36:41 -04:00
b1tg
58206fa8a9
add amd llvm compiler ( #9519 )
...
Co-authored-by: b1tg <b1tg@users.noreply.github.com >
Co-authored-by: chenyu <chenyu@fastmail.com >
2025-03-21 23:13:27 +08:00
chenyu
d8d65e2747
update ruff to 0.11.2 ( #9531 )
...
0.11.2 fixed the false alert from 0.11.1. also pinned the version in setup for now to prevent broken CI from ruff upgrade
2025-03-21 10:32:59 -04:00
qazal
ee3ed73ed1
add reorder_view matcher to scheduler [pr] ( #9528 )
2025-03-21 17:46:20 +08:00
George Hotz
8e555c586c
switch quantization to unsigned/unsigned + add Ops.REDUCE ( #9527 )
...
* switch quantization to unsigned/unsigned + add Ops.REDUCE
* tests
* nhwc + replay pkl
2025-03-21 17:02:37 +08:00
nimlgen
a35b0a88bf
am: just rename and reorder ip init funcs ( #9504 )
2025-03-21 15:57:32 +08:00
nimlgen
8a131ab271
am: allow allocations as small as a page ( #9523 )
...
* am: fix allocs
* bettermsg
* comment
* next time
2025-03-21 15:53:32 +08:00
Sieds Lykles
3ad3ac4d1e
Change dtypes.int to dtypes.ints ( #9517 )
2025-03-20 17:24:26 -04:00
chenyu
b9fab9b914
pin ruff to 0.11.0 in CI ( #9520 )
...
0.11.1 had a bug https://github.com/astral-sh/ruff/issues/16874 that breaks ci
2025-03-20 13:12:50 -04:00
George Hotz
3c5161b4cb
add validation of the bounds of Ops.INDEX ( #9503 )
...
* add validation of the bounds of Ops.INDEX
* do mask properly
* more validation
* correct
* fix gated
* add CAST support to vmin/vmax
* fix ptx and image
* ptx no diff
* upat.index also stays
---------
Co-authored-by: qazal <qazal.software@gmail.com >
2025-03-20 12:15:55 +08:00
qazal
0b20f91ce7
remove move_mask from the devectorizer ( #9511 )
...
* remove move_mask from the devectorizer
* add (wrong) ptx
* reason
* enable index addition in PTX, we won't have the INDEX anyways
* space
2025-03-20 11:53:12 +08:00
qazal
9302738263
hotfix: more consistent wgsl.py spacing + cleanups [pr] ( #9515 )
...
* hotfix: more consistent wgsl.py spacing + cleanups [pr]
* free things up
2025-03-20 11:07:15 +08:00
George Hotz
68053d0510
dsp stuff / sniff ioctls from snpe ( #9490 )
...
* sniff ioctls from snpe
* dump input buffers
* snpe logs from dsp
* NHWC support
* knum 3
* this run?
* revert those
---------
Co-authored-by: Comma Device <device@comma.ai >
2025-03-20 10:38:23 +08:00
qazal
2223b93338
add UPat.or_casted [pr] ( #9513 )
2025-03-20 10:08:32 +08:00
qazal
1839e8c9b3
place masks in INDEX for TestGatedStoreRewrite [pr] ( #9512 )
2025-03-20 09:46:53 +08:00
b1tg
bd731a8624
AMDCompiler refactor (no_comgr prereq) ( #9497 )
...
* add amdgpu_disassemble to helpers
* refactor hip compiler
---------
Co-authored-by: b1tg <b1tg@users.noreply.github.com >
2025-03-20 09:44:07 +08:00
geohotstan
8c0d0a122c
Add return_indices to max_pool ( #9506 )
...
* wow argmax is so good
* 1 less line
* clean up and better variable names
* is this torch thing right...?
* add more tests
* slap a TODO on it
* clean ups
* prettier looking code and fix ceil mode test
* add return types and some docs
* ok that was a bad example since indices == value, just no example
2025-03-19 15:25:37 -04:00
chenyu
189f62d44f
add rounding to tqdm unit scale ( #9507 )
...
fixed `AssertionError: ' 1.00/10.0 1000it/s]' != ' 1.00/10.0 1.00kit/s]'`
2025-03-19 12:08:46 -04:00
nimlgen
a5c971ff3a
am: prereqs for rdna4 1/n ( #9495 )
...
* am: ip_ver rename for acc
* am: refactor this
* fix version
* ugh
2025-03-19 17:14:57 +08:00
Francis Lam
1e5d9ad8f7
extra/gemm/max_matmul: start of custom kernels for GEMM ( #6926 )
...
* extra/gemm/max_matmul: start of custom kernels for GEMM
* add an unoptimized FP16/FP16 MMA example
* add slow 3-stage fp16 acc example
* add correct 3-stage pipeline with unswizzled/flat smem input (slow)
* add acc fp16 example with 3 stages and swizzle (no bank conflicts)
* add max version of NV fp16_fp16_fp16
* fix up comments and removed unused code in max variations
* add start of no_xor example
* fix to account for UOps to Ops
2025-03-19 15:04:57 +08:00
George Hotz
865f23dd7b
olmoe memory usage cleanups
2025-03-19 12:28:18 +08:00
b1tg
2c87a22cf2
fix prg size calculation when there are adjacent mapped ranges ( #9498 )
...
Co-authored-by: b1tg <b1tg@users.noreply.github.com >
2025-03-19 11:55:03 +08:00
b1tg
1d71436e6a
use libllvm19 in ci ( #9494 )
...
Co-authored-by: b1tg <b1tg@users.noreply.github.com >
2025-03-19 11:53:32 +08:00
b1tg
a95b489a55
nanoGPT train works with tiny torch backend ( #9283 )
...
* train_shakespeare_char.py works
* move aten.where.self_out to tiny_backend_out
* fix memory leak
* corealize in the backward_hook
* Update backend.py
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com >
2025-03-19 11:51:02 +08:00
chenyu
f8976dd2eb
enable more webgpu tests ( #9502 )
...
OSX has larger buffer number limit, and it supports fp16 now
2025-03-18 23:03:54 -04:00
qazal
ae688e4103
simple failing test for scheduling parallel reduce [pr] ( #9501 )
...
* simple failing test for scheduling parallel reduce [pr]
* atol
2025-03-19 10:52:13 +08:00
leopf
e4dad99145
nn.state docs cleanup ( #8332 )
...
* doc cleanup
* extension cleanup
* manual definition
* bring back accept_filename for gguf_load
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com >
Co-authored-by: chenyu <chenyu@fastmail.com >
2025-03-18 17:16:40 -04:00
chenyu
1ea4876dfa
olmoe touchups ( #9499 )
...
GlobalCounters.reset() and only validate if temperature is 0
2025-03-18 15:25:45 -04:00
geohotstan
f7506c6c25
JIT OLMoE ( #9396 )
...
* jit the forward
* might timeout, idk just send it
* this is dumb
* naive bitonic lol
* idk if this is correct, but that squeeze before is definitly not
* vectorized bitonic sort, but still slow
* yay 1 layer is correct
* alright its pretty good
* good enough
* rerun CI
* nit improve comment
2025-03-18 14:49:02 -04:00
Ignacio Sica
5c56cac0a0
MI300 mfma support ( #9417 )
...
* add f16/f32 mfma support for MI300
- add 16x16 mfma shape support for f16 with f32 acc
- add ops_python mfma emulation
- add arch to AMDRenderer
* minor cleanup
* minor cleanup
* add mfma emulation task to ci
* add back todo
* hotfix: comment
* add tc=3 job to ci
2025-03-18 14:33:30 -03:00
hooved
5500887eed
improve reproducibility of WebGPU CI puppeteer test ( #9496 )
...
* try to make CI test fail with slow JS import
* prevent race between model import and reference
* revert artificial delay in JS module import
2025-03-18 09:27:38 -04:00
qazal
cde4fd3be3
do not view_left assign + elementwise sources always have a shape [pr] ( #9491 )
2025-03-18 17:42:51 +08:00
George Hotz
117b7a16ef
VALIDATE_WITH_CPU [pr] ( #9488 )
...
* VALIDATE_WITH_CPU [pr]
* fix test
2025-03-18 15:15:04 +08:00
qazal
935cd01f56
simple failing test for graph_rewrite children [pr] ( #9489 )
...
* simple failing test for graph_rewrite children [pr]
* lint
* update too
2025-03-18 13:07:21 +08:00
George Hotz
d20494e6d7
move buffer logic to Buffer [pr] ( #9487 )
...
* move buffer logic to Buffer [pr]
* pass shape into as_typed_buffer
* pass shape into as_typed_buffer
* work
* cleaner
* fix tests
2025-03-18 11:21:21 +08:00
qazal
3be228182f
unbind Tensor variables last [pr] ( #9486 )
...
* reorder do_realize [pr]
* move merge_views
* unbind all variables at the end [pr]
2025-03-18 09:52:01 +08:00
qazal
b44f9c409a
reorder do_realize [pr] ( #9485 )
...
* reorder do_realize [pr]
* move merge_views
2025-03-18 09:30:10 +08:00
nimlgen
a82c9332d3
am: rename soc21 to soc ( #9482 )
2025-03-18 08:54:26 +08:00
qazal
b100fc0b20
split the rule that uses context in scheduler simplifier [pr] ( #9484 )
...
* split the rule that uses context in scheduler simplifier [pr]
* add
2025-03-18 08:12:26 +08:00
Anish Umale
5e58f4b65b
Tiny backend test_ops fix part 3 ( #9483 )
...
* extract straightforward things from https://github.com/tinygrad/tinygrad/pull/9302
* pass dtype and device for ones_like
2025-03-17 18:01:51 -04:00
TJ
9fcef4d009
add masked_select to tensor.py ( #9468 )
...
* add masked_select to tensor.py
* fix tests
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com >
2025-03-17 16:05:36 -04:00