Commit Graph

1248 Commits

Author SHA1 Message Date
Francis Lata
eb2e59db42 RetinaNet model type annotations and loss functions (#9822)
* add type annotations and loss functions for training

* combine sum of multiple dims inside loss functions
2025-04-10 00:31:37 -04:00
Francis Lata
7bb36d71b2 remove openimages iterate (#9820) 2025-04-09 22:54:12 -04:00
chenyu
c5db5b83b9 add SHOULD_USE_TC=1 check to simple_matmul (#9802)
* add SHOULD_USE_TC=1 check to simple_matmul

also zero centered the random input and update atol for tf32

* ATOL=2e-2 for HALF
2025-04-09 02:24:42 -04:00
George Hotz
78caf55154 Revert "FP8 support on NVIDIA (#8631)"
This reverts commit 2c8e4ea865.
2025-04-09 12:27:41 +08:00
George Hotz
14928fecff Revert "fix TF32 tensor core dropped in tc_sm89 (#9798)"
This reverts commit 7c9a96824f.
2025-04-09 12:27:39 +08:00
chenyu
7c9a96824f fix TF32 tensor core dropped in tc_sm89 (#9798)
also add `SHOULD_USE_TC=1` to verify TC is applied in simple_matmul
2025-04-08 23:20:50 -04:00
pkotzbach
2c8e4ea865 FP8 support on NVIDIA (#8631)
* squashed fp8 commits

* tensorcore start

* minor changes

* pre-commit

* pylint

* Delete fp8mul.cu

* clean

* small bugfix

* fix test_dtype

* fix test_dtype_alu

* add EMULATE_CUDA_SM89

* fix ci

* fix test_linearizer

* fix test_linearizer

* fix swizzle

* add debug to simple_matmul

* fixed swizzle

* python emulator

* refactor python emulator

* setup fix

* numpy setup

* ml_dtypes only in emulate_cuda_sm89

* fix pylint

* fix tests

* fix mypy

* fix mypy

* fix ruff

* done python emulator

* add acc type

* tests

* mypy

* clean code

* add cuda tensor core tests to CI

* minor fix

* clean test_dtype.py

* clean cstyle.py

* clean test_ops.py

* fix test

* fix test

* whitespaces

* pylint

* pylint

* amd?

* amd?

* amd

* reduce lines

* mockgpu remove

* fix

* ruff

* ruff

* fix mypy

* ruff

* test only for cuda

* fixed formatting

* small fixes

* small fix

* least_upper_dtype if fp8s not supported

* log and reciprocal are supported for fp8s

* ops python fixes

* dtypes.fp8s use

* e4m3 + e5m2 result dtype test

* truncate linter fix

---------

Co-authored-by: pkotzbach <pawkotz@gmail.com>
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
Co-authored-by: chenyu <chenyu@fastmail.com>
2025-04-08 21:54:04 -04:00
Francis Lata
f8fe15e64e move BoxCoder to mlperf helpers (#9773) 2025-04-07 20:27:06 -04:00
Francis Lata
71b8890dd6 use validation dataloader inside retinanet eval (#9747) 2025-04-05 16:46:55 -04:00
geohotstan
ac713e04db ONNX add output shape validation (#9720)
* add output shape validation and remove support for sequence_type

* nit better err msg

* add sequence_type back

* improve err msg

* Revert "improve err msg"

This reverts commit dc9eaea4bb.

* Revert "add sequence_type back"

This reverts commit 288170b2d9.

* do explicit shape equality

* small nit
2025-04-03 05:44:53 -04:00
chenyu
7dadbf3697 insert float() in bert acc (#9726)
sum of bool by default uses default_float for acc. So without float, it might overflow with a large BS and default_float=HALF.

fixed clsf_accuracy to not be inf in mi300x bert
2025-04-03 05:44:09 -04:00
George Hotz
5c7b549eab use functools.cache instead of lru_cache(None) [pr] (#9714)
* use functools.cache instead of lru_cache(None) [pr]

* more cache
2025-04-03 11:47:13 +08:00
geohotstan
e1d7e47cca fix ONNX IsInf unintended dtype promotion (#9711)
* add IsInf

* add corresponding test

* that float16 is kinda silly
2025-04-02 22:46:15 -04:00
George Hotz
f72a87fd0e add proper support for Ops.IGNORE to remove store masks (#9692)
* add proper support for Ops.IGNORE to remove store masks

* remove useless NHWC

* revert that
2025-04-02 16:38:01 +08:00
George Hotz
6f812d3f2f fixes from the dsp branch + 12500 lines (#9683)
* fixes from the dsp branch

* more changes

* those are gep pushing
2025-04-02 13:07:17 +08:00
qazal
eee0dcc37a merge viz back into one file (#9672)
* merge viz back into one file

* work

* rename lib to js directory

* fix diff

* less indenting

* memory graph is back

* viz_sz.py
2025-04-01 19:52:02 +08:00
nimlgen
3e2f42c2e8 autogen: remove am headers from extra (#9666) 2025-04-01 14:45:30 +07:00
Anish Umale
a1ee4d587f Fix test_ops for tiny backend (#9302)
* fix some tests in test_ops for torch backend(171 failing)

* fix more tests (135 failures)

* fix tests (126 failing)

* handle transposed convs (109 tests failing)

* fix slice

* fix lshift & rshift and more tests (87 tests failing)

* revert accidental change

* remove unnecessary changes (82 failures)

* fix backward for avg_pool2d (78 failures)

* fix backward for avg_pool2d (78 failures)

* fix replication backpass

* fix reflection pad back pass (71 failures)

* cummax with indicies, aten.mv and move out methods (67 failures)

* extract avg_pool2d and avg_pool3d to separate functions (62 failures)

* revert changes for cat_out

* rewrite avg_pool and pad without repetition

* remove duplicates from decomps

* slice rewrite and add slice_backward (59 failures)

* add dtype fixup from https://github.com/tinygrad/tinygrad/pull/9297

* fix linter error and remove Tensor.pad (48 failures)

* add select_backward and index_put (40 failures)

* fix some more tests (36 failures)

* fix more tests (12 failures)

* some cleanups and fix couple more tests (10 failures)

* cleaner way to write upsample

* some more upsample cleanups

* use lambda for upsample

* add autowrapper for upsample forward

* cumsum and max_dim without aten functions

* revert _log_softmax

* fix more tests (1 failure)

* make linter happy

* move import to appropriate func

* make linter happy

* add codes for noqa

* some more refactors

* remove comment

* remove dependency on aten function for conv backward

* some more refactors

* add returns

* revert a change from merge

* some cleanups

* remove whitespace

* remove ruff change

* revert upsample

* add masked_fill_.Tensor and scatter.src_out

* add todo

* fix test_biased_conv2d

* fix test_var_one_in_axis & test_std_one_in_axis but break test_biased_conv2d :(

* revert torch_debug

* revert torch_debug

* skip test_gather_failure for the tiny backend

* make padding registration more consise

* add nonzero

* remove scatter_add since we already have the out

* fix scatter

* remove some repetition

* make upsample backward registrations more concise

* remove select.int

* use Tensor.cumsum

* realize conv2d outputs before backward to fix test_biased_conv2d

* add a todo for realize(1 failure)

* add new_empty and new_empty_strided

* make test_pad_circular_mode forward only and remove redundant stuff

* fix linter errors

* remove expect failure

* just tb

* slice is a view_op

* contiguous only when lazydata.is_realized

* fix backward for test_pad_circular_mode

* revert torch.nn.functional.pad override

* add transpose.int and make constant_pad_nd contiguous

* slice_backwards has no kwargs

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2025-03-31 21:13:09 -04:00
Priyank Patel
e2d9322d21 torch backend: partial fix for strided related test fails (#9642)
* partial fix for strided related test fails

* cleanup

* fix lint
2025-03-31 05:45:18 -04:00
George Hotz
49b1c46d16 good changes from the dsp branch (#9638) 2025-03-31 13:02:53 +08:00
Yvon Manzi
6652003839 Add cumprod to Tensor (#9629)
* probably how cumprod should look like

* update _cumalu to work with MUL

* shorter

* cumprod testing

* clean

* more cleanup

* add cumprod to torch backend.

* make it look like cumsum

* mypy fix

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2025-03-30 21:49:18 -04:00
geohotstan
d52e91db7b ONNX ops clean ups (#9622)
* combine work from remove numpy and onnx ops tests

* clippy

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2025-03-30 21:39:22 -04:00
uuuvn
2a4247b8c2 RDNA 3.5 support (#9627) 2025-03-31 01:15:20 +08:00
geohotstan
a08b07b4da Bump onnx==1.17.0 (#9618)
* bump

* remove resize tf_crop_and_resize

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2025-03-30 03:21:51 -04:00
nimlgen
54e1e59b44 am: rdna 4 support (#9621)
* hm

* fix

* return this

* fine

* g

* ruff

* fix
2025-03-29 23:16:27 +07:00
uuuvn
5908b89f71 MI300X support (WIP) (#9585) 2025-03-29 19:46:42 +08:00
uuuvn
dd9aae02c3 Refactor ops_amd.py (MI300X prereq) (#9428) 2025-03-29 00:17:20 +07:00
George Hotz
1e6e75e39a little changes from dsp branch (#9582)
* little changes from dsp branch

* not that one

* need the where

* Revert "need the where"

This reverts commit 140f89c878.
2025-03-26 20:01:21 +08:00
Andrey
7b865ed03d use tuple in isinstance for type checking (#9583) 2025-03-26 19:36:48 +08:00
nimlgen
4cf2b68ca8 am_smi: fix init for newer versions (#9559) 2025-03-25 23:48:05 +07:00
Priyank Patel
4f5e03bd60 better fix inplace detach (#9557) 2025-03-24 22:50:28 +08:00
George Hotz
74d98eafb8 add onnx frontend stub [pr] (#9558) 2025-03-24 12:24:34 +08:00
geohotstan
309afa20b7 add Tensor.max_unpool2d (#9518)
* why does max_unpool2d feel slower than out.gradient ...

* slightly cleaner

* what happened to ruff

* need to think about this some more

* slightly faster now?

* clean up, 1 more failing edge case

* ok good

* working TINY_BACKEND

* nit doc wording

* retry CI
2025-03-22 12:11:33 -04:00
Francis Lata
eb95825eea RetinaNet dataloader (#9442)
* retinanet dataloader

* remove batch_size from generate_anchors

* refactor kits19 dataset tests

* add tests for dataloader

* fix testing setup and cleanups

* remove unused import
2025-03-21 13:36:41 -04:00
George Hotz
8e555c586c switch quantization to unsigned/unsigned + add Ops.REDUCE (#9527)
* switch quantization to unsigned/unsigned + add Ops.REDUCE

* tests

* nhwc + replay pkl
2025-03-21 17:02:37 +08:00
George Hotz
68053d0510 dsp stuff / sniff ioctls from snpe (#9490)
* sniff ioctls from snpe

* dump input buffers

* snpe logs from dsp

* NHWC support

* knum 3

* this run?

* revert those

---------

Co-authored-by: Comma Device <device@comma.ai>
2025-03-20 10:38:23 +08:00
geohotstan
8c0d0a122c Add return_indices to max_pool (#9506)
* wow argmax is so good

* 1 less line

* clean up and better variable names

* is this torch thing right...?

* add more tests

* slap a TODO on it

* clean ups

* prettier looking code and fix ceil mode test

* add return types and some docs

* ok that was a bad example since indices == value, just no example
2025-03-19 15:25:37 -04:00
Francis Lam
1e5d9ad8f7 extra/gemm/max_matmul: start of custom kernels for GEMM (#6926)
* extra/gemm/max_matmul: start of custom kernels for GEMM

* add an unoptimized FP16/FP16 MMA example

* add slow 3-stage fp16 acc example

* add correct 3-stage pipeline with unswizzled/flat smem input (slow)

* add acc fp16 example with 3 stages and swizzle (no bank conflicts)

* add max version of NV fp16_fp16_fp16

* fix up comments and removed unused code in max variations

* add start of no_xor example

* fix to account for UOps to Ops
2025-03-19 15:04:57 +08:00
b1tg
a95b489a55 nanoGPT train works with tiny torch backend (#9283)
* train_shakespeare_char.py works

* move aten.where.self_out to tiny_backend_out

* fix memory leak

* corealize in the backward_hook

* Update backend.py

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2025-03-19 11:51:02 +08:00
George Hotz
117b7a16ef VALIDATE_WITH_CPU [pr] (#9488)
* VALIDATE_WITH_CPU [pr]

* fix test
2025-03-18 15:15:04 +08:00
nimlgen
a82c9332d3 am: rename soc21 to soc (#9482) 2025-03-18 08:54:26 +08:00
Anish Umale
5e58f4b65b Tiny backend test_ops fix part 3 (#9483)
* extract straightforward things from https://github.com/tinygrad/tinygrad/pull/9302

* pass dtype and device for ones_like
2025-03-17 18:01:51 -04:00
TJ
9fcef4d009 add masked_select to tensor.py (#9468)
* add masked_select to tensor.py

* fix tests

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2025-03-17 16:05:36 -04:00
geohotstan
53d6f1e1bb Add bitonic cat sort (#9422)
* poc

* repeated values fail, sigh

* is this being timed out?

* fix up down names

* bitonic v2, does this run?

* bitonic v3, faster

* bitonic v3.1, faster

* bitonic v3.1.1, same speed unlucky

* support dim and indices

* bitonic v3.2, simpler code, TODO repeated indices

* bruv gimme green for once cmon

* cat (stack) implementation, slow but maybe one day when cat is fast meow

* revert to v3.2

* bitonic v4, who let the cats out edition

* clean up variable names

* figured out repeated indices :D

* ruff check --fix

* use sort for topk

* add Tensor.sort everywhere

* fix docs and add some types

* slightly better variable names

* am I doing torch inplace correctly?

* delegate sort to values_stable

* add a contig, faster first sort

* maybe don't test_inplace

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2025-03-17 12:01:23 -04:00
George Hotz
824c5f41ac dsp work try 3 (#9475)
* dsp work try 3

* padding
2025-03-17 16:42:12 +08:00
George Hotz
52ae9af4dd Fast DSP for MobileNetV2 (try 2) (#9467)
* Fast DSP for MobileNetV2 (try 2)

* enable fast path on uchar

* fix tests
2025-03-17 15:10:36 +08:00
George Hotz
09e7708b49 minimum change for rdna4 [pr] (#9455) 2025-03-16 13:39:24 +08:00
George Hotz
cb7a7f69c7 quantization preprocessor from DSP, should be universal (#9437)
* quantization preprocessor from DSP, should be universal

* touchups

* fix tests
2025-03-15 07:49:37 +08:00
chenyu
0e591baf43 redo simple_matmul change (#9450)
numpy does not support bfloat16
2025-03-14 17:53:52 -04:00
chenyu
b0f63d3c04 Revert "simple_matmul.py uses np to generate random (#9438)" (#9449)
This reverts commit 14018050c1.
2025-03-14 17:14:22 -04:00