Commit Graph

3498 Commits

Author SHA1 Message Date
George Hotz
74d98eafb8 add onnx frontend stub [pr] (#9558) 2025-03-24 12:24:34 +08:00
chenyu
ba41076e94 update embedding test to not use dtypes.long [pr] (#9556) 2025-03-23 21:33:38 -04:00
nimlgen
d5667419af am: move out pte creation logic (#9548)
* am: move out pte creation logic

* emu

* ops
2025-03-23 18:29:10 +07:00
geohotstan
309afa20b7 add Tensor.max_unpool2d (#9518)
* why does max_unpool2d feel slower than out.gradient ...

* slightly cleaner

* what happened to ruff

* need to think about this some more

* slightly faster now?

* clean up, 1 more failing edge case

* ok good

* working TINY_BACKEND

* nit doc wording

* retry CI
2025-03-22 12:11:33 -04:00
quortus
bdd44d4255 Fix DSP transcendentals (#9542) 2025-03-22 11:08:18 +08:00
chenyu
c33679c47b increase size in test_multinomial_counterexample (#9540)
should be less flaky
2025-03-21 17:46:52 -04:00
Francis Lata
1a1087e3a0 cleanups on losses and dataset tests (#9538) 2025-03-21 17:03:18 -04:00
Francis Lata
8cbe4009fc RetinaNet losses (#9536)
* add sigmoid_focal_loss and l1_loss

* update ref implementation comment
2025-03-21 15:52:54 -04:00
Francis Lata
e6389184c5 update comment for retinanet dataloader implementations (#9534)
Co-authored-by: chenyu <chenyu@fastmail.com>
2025-03-21 15:07:45 -04:00
Francis Lata
eb95825eea RetinaNet dataloader (#9442)
* retinanet dataloader

* remove batch_size from generate_anchors

* refactor kits19 dataset tests

* add tests for dataloader

* fix testing setup and cleanups

* remove unused import
2025-03-21 13:36:41 -04:00
b1tg
58206fa8a9 add amd llvm compiler (#9519)
Co-authored-by: b1tg <b1tg@users.noreply.github.com>
Co-authored-by: chenyu <chenyu@fastmail.com>
2025-03-21 23:13:27 +08:00
George Hotz
8e555c586c switch quantization to unsigned/unsigned + add Ops.REDUCE (#9527)
* switch quantization to unsigned/unsigned + add Ops.REDUCE

* tests

* nhwc + replay pkl
2025-03-21 17:02:37 +08:00
George Hotz
3c5161b4cb add validation of the bounds of Ops.INDEX (#9503)
* add validation of the bounds of Ops.INDEX

* do mask properly

* more validation

* correct

* fix gated

* add CAST support to vmin/vmax

* fix ptx and image

* ptx no diff

* upat.index also stays

---------

Co-authored-by: qazal <qazal.software@gmail.com>
2025-03-20 12:15:55 +08:00
qazal
0b20f91ce7 remove move_mask from the devectorizer (#9511)
* remove move_mask from the devectorizer

* add (wrong) ptx

* reason

* enable index addition in PTX, we won't have the INDEX anyways

* space
2025-03-20 11:53:12 +08:00
qazal
1839e8c9b3 place masks in INDEX for TestGatedStoreRewrite [pr] (#9512) 2025-03-20 09:46:53 +08:00
geohotstan
8c0d0a122c Add return_indices to max_pool (#9506)
* wow argmax is so good

* 1 less line

* clean up and better variable names

* is this torch thing right...?

* add more tests

* slap a TODO on it

* clean ups

* prettier looking code and fix ceil mode test

* add return types and some docs

* ok that was a bad example since indices == value, just no example
2025-03-19 15:25:37 -04:00
chenyu
189f62d44f add rounding to tqdm unit scale (#9507)
fixed `AssertionError: ' 1.00/10.0  1000it/s]' != ' 1.00/10.0  1.00kit/s]'`
2025-03-19 12:08:46 -04:00
b1tg
2c87a22cf2 fix prg size calculation when there are adjacent mapped ranges (#9498)
Co-authored-by: b1tg <b1tg@users.noreply.github.com>
2025-03-19 11:55:03 +08:00
chenyu
f8976dd2eb enable more webgpu tests (#9502)
OSX has larger buffer number limit, and it supports fp16 now
2025-03-18 23:03:54 -04:00
qazal
ae688e4103 simple failing test for scheduling parallel reduce [pr] (#9501)
* simple failing test for scheduling parallel reduce [pr]

* atol
2025-03-19 10:52:13 +08:00
George Hotz
117b7a16ef VALIDATE_WITH_CPU [pr] (#9488)
* VALIDATE_WITH_CPU [pr]

* fix test
2025-03-18 15:15:04 +08:00
qazal
935cd01f56 simple failing test for graph_rewrite children [pr] (#9489)
* simple failing test for graph_rewrite children [pr]

* lint

* update too
2025-03-18 13:07:21 +08:00
Anish Umale
5e58f4b65b Tiny backend test_ops fix part 3 (#9483)
* extract straightforward things from https://github.com/tinygrad/tinygrad/pull/9302

* pass dtype and device for ones_like
2025-03-17 18:01:51 -04:00
TJ
9fcef4d009 add masked_select to tensor.py (#9468)
* add masked_select to tensor.py

* fix tests

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2025-03-17 16:05:36 -04:00
chenyu
4f8eac59ea failed test case for threefry (#9469)
* failed test case for threefry

not sure if it's always like this, but increment before _threefry_random_bits is incorrect. the counts should start with random numbers generated so far.

use jax to generate 20 + 20 + 10 random numbers, the first 20 + 20 matches and the last 10 are different. just moving increment after _threefry_random_bits matches the number but jit test failes

* workaround

* why is this different?

* revert those

* and that
2025-03-17 14:52:10 -04:00
geohotstan
53d6f1e1bb Add bitonic cat sort (#9422)
* poc

* repeated values fail, sigh

* is this being timed out?

* fix up down names

* bitonic v2, does this run?

* bitonic v3, faster

* bitonic v3.1, faster

* bitonic v3.1.1, same speed unlucky

* support dim and indices

* bitonic v3.2, simpler code, TODO repeated indices

* bruv gimme green for once cmon

* cat (stack) implementation, slow but maybe one day when cat is fast meow

* revert to v3.2

* bitonic v4, who let the cats out edition

* clean up variable names

* figured out repeated indices :D

* ruff check --fix

* use sort for topk

* add Tensor.sort everywhere

* fix docs and add some types

* slightly better variable names

* am I doing torch inplace correctly?

* delegate sort to values_stable

* add a contig, faster first sort

* maybe don't test_inplace

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2025-03-17 12:01:23 -04:00
qazal
e03c0aacf2 more explicit DONT_PUSH_VIEWS [pr] (#9479)
* more explicit DONT_PUSH_VIEWS [pr]

* update tests to not handcode ast

* lint

* test_recursive_swizzle and test_simple_store_reshape
2025-03-17 20:43:21 +08:00
qazal
3b00a778ba fix view_left for unsafe pad ops [pr] (#9478) 2025-03-17 19:02:02 +08:00
qazal
813f713edc merge_views for buffer ops + create valids last (#9472)
* merge_views for buffer ops + create valids last

* view.arg

* pass
2025-03-17 17:15:44 +08:00
qazal
bd1f71c1e2 simple failing test for extra ops in VALID [pr] (#9474)
* simple failing test for extra valids [pr]

* this has DEBUG=4
2025-03-17 17:02:40 +08:00
qazal
e26caf4c3a hotfix: skip test_mean_half_precision_underflow on amd ci (#9476)
The global size is very large (781250 gidx) and the emulated version takes more than 1
minute to execute the kernel.
2025-03-17 16:47:48 +08:00
George Hotz
15ee742afa add get_children_map to uop (#9470)
* add get_children_map to uop

* update_children

* fix new children
2025-03-17 14:36:13 +08:00
George Hotz
cb7a7f69c7 quantization preprocessor from DSP, should be universal (#9437)
* quantization preprocessor from DSP, should be universal

* touchups

* fix tests
2025-03-15 07:49:37 +08:00
chenyu
99b0287e4e add GROUP and GROUPTOP to test_arange (#9432)
it does not grow quadratically, but it's not 0 ops now
2025-03-13 11:28:38 -04:00
qazal
90ffa9bd45 swizzle without buffer ops try 2 [pr] (#9427)
* add DONT_PUSH_VIEWS to matchers

* swizzle without buffer ops try 2 [pr]

* swizzle reduceop

* simple failing test

* fix failing test

* s/on/for
2025-03-13 10:00:40 +01:00
chenyu
22fc0a2e36 bert sum acc in half (#9412)
also BS=96
2025-03-11 23:03:15 -04:00
George Hotz
e174c6c3bc new devectorizer (#9331)
* new devectorizer

* lidx

* test linearizer passes

* fix images

* fix unfoldable image load

* delete unused

* improve fix_unfoldable_image_load

* working for image

* fixup types

* fixup transcendental

* cast_vec

* cleaner transcendental

* skip failing test

* err, flip that

* not devec

* sqrt
2025-03-11 18:47:56 +08:00
George Hotz
2780e2027e devectorize prereqs [pr] (#9404) 2025-03-11 12:33:29 +08:00
chenyu
01e8b60911 acc_dtype -> dtype (#9402)
matched numpy and torch
2025-03-10 16:05:30 -04:00
qazal
59dfb234eb replace hardcoded ast with tensors in TestSwizzle [pr] (#9401) 2025-03-10 19:33:57 +01:00
geohotstan
1d64c12f2b add Topk to tensor (#9343)
* terrible but somewhat working impl

* linux behaves differently than macos?

* slightly better impl

* small clean up; haven't figured this out yet

* better

* torch has different behavior on linux and macos for duplicated values

* add sum docs

* fix test

* add torch return_type test

* add an exception test

* wrap_fxn instead, and move op lower in order

* better repeated values test

* rerun ci
2025-03-09 20:01:42 -04:00
qazal
a1f41fadf6 test_schedule cleanups + add DONT_GROUP_REDUCES [pr] (#9392)
* test_schedule cleanups + add DONT_GROUP_REDUCES [pr]

* replace with test_swizzle_reduceop

* delete duplicate tests

* test_allow_push_permutes

* one kernel tests
2025-03-09 15:01:08 +01:00
qazal
286b480f82 do not replace assign with the offset buffer [pr] (#9387) 2025-03-08 11:57:44 +01:00
qazal
0d2762c010 prep refactor for adding buffer ops last [pr] (#9383)
* prep refactor for adding buffer ops last [pr]

* freeze buffers

* add swizzle_reduceop

* shape for reduceop_view_right

* simpler elementwise_view_right

* add shapetracker to const

* only const

* from process replay
2025-03-08 08:00:14 +01:00
nimlgen
243078dda9 am: optimize tlb usage (#9049)
* am: optimize tlb usage

* fxies

* comments

* tiny
2025-03-07 19:37:29 +03:00
geohotstan
088d86691b fix onnx gather and onnx auto_pad VALID mode (#9375)
* fix gather and auto_pad

* long -> int64
2025-03-07 10:27:23 -05:00
hooved
136cf7b8b1 hotfix: load >2 GiB from disk on macOS (#9361)
* enable loading >2 GiB buffer from disk on macOS

* handle None case raised by mypy

* add test

* revert fix to repro bug in CI

* tell CI to run a unit test for macOS

* reapply fix
2025-03-07 14:51:58 +08:00
nimlgen
9bd13de44c lower test_gemv_4096_16384 to 750 for red (#9367) 2025-03-05 22:44:48 +03:00
uuuvn
b75f307234 amd: autogen ip bases (#9360) 2025-03-05 22:30:38 +03:00
chenyu
2cb2fce8d9 lower test_gemm_8192 amd_tflops to 65 (#9364) 2025-03-05 14:06:11 -05:00