Commit Graph

10633 Commits

Author SHA1 Message Date
Ignacio Sica
bc91fffc5d fix gated store with index in python backend (#9703)
* add default gate in index

* assert store

* add TestRendererFailures

- move test_gated_store_with_alu to new TestRenderFailures class for
tests that fail on multiple renderers
- add test_renderer_failures.py run on python CI

* add test for gated index in 2d

* test TestRenderFailures
2025-04-03 12:48:28 +08:00
qazal
f2bd65ccfc delete Ops.EMPTY and Tensor._metaop (#9715)
* delete Ops.EMPTY and Tensor._metaop [pr]

* test_creation

* arg=

* abstractions2
2025-04-03 12:29:02 +08:00
George Hotz
5c7b549eab use functools.cache instead of lru_cache(None) [pr] (#9714)
* use functools.cache instead of lru_cache(None) [pr]

* more cache
2025-04-03 11:47:13 +08:00
qazal
bbd13191f4 cleanup tensor BIND + remove outdated comments in tensor.py [pr] (#9712)
* cleanup tensor BIND + remove outdated comments in tensor.py [pr]

* from_blob whitespace

* assert
2025-04-03 11:21:53 +08:00
geohotstan
e1d7e47cca fix ONNX IsInf unintended dtype promotion (#9711)
* add IsInf

* add corresponding test

* that float16 is kinda silly
2025-04-02 22:46:15 -04:00
qazal
11ae254dc5 construct BUFFER UOps directly when device in known [pr] (#9710)
* construct BUFFER UOps directly when device in known [pr]

* diff
2025-04-03 10:41:44 +08:00
George Hotz
1714fc3ba4 start work on speed [pr] (#9707)
* fix get_location

* fix get_location try 2

* clean up split_load_store [pr]

* SHR fixup [pr]
2025-04-03 10:39:01 +08:00
George Hotz
0f1ffc2050 hotfix: cat tests 2048 instead of 256 2025-04-03 10:37:56 +08:00
uuuvn
5bd485c027 Fix double SDMA_OP_FENCE (#9705)
Introduced in #9585, probably when i incorrectly resolved merge conflict
while rebasing an old, mi300x-only branch. Seems to be the source of
multi gpu beam llama hangs
2025-04-03 09:43:37 +08:00
chenyu
a6fec2f5ae dev_run for bert on mi300x (#9706) 2025-04-02 21:12:55 -04:00
nimlgen
d96b4983ac amd: support rdna4 in runtime again (#9702) 2025-04-03 01:19:23 +07:00
Ignacio Sica
2d6d8b7355 add bf16 mfma support (#9695)
* add bf16 mfma support

* skip tc if emulated_amd and dtypes is bf16

* hotfix
2025-04-02 21:44:49 +08:00
nimlgen
a6733f519f dsp: make relro sections contiguous (#9701) 2025-04-02 18:02:16 +07:00
George Hotz
ea5caefef0 gep should look at count, not vcount (#9698)
* gep should look at count, not vcount

* gep in order is a rule

* min change

* gep on void
2025-04-02 18:10:57 +08:00
George Hotz
f72a87fd0e add proper support for Ops.IGNORE to remove store masks (#9692)
* add proper support for Ops.IGNORE to remove store masks

* remove useless NHWC

* revert that
2025-04-02 16:38:01 +08:00
chenyu
3b8d923692 remove skip LLVM in test_div_int (#9686) 2025-04-02 04:15:00 -04:00
chenyu
bc3bfcbad4 update install gpuocelot (#9693)
`-DCMAKE_POLICY_VERSION_MINIMUM=3.5`
2025-04-02 04:10:34 -04:00
George Hotz
e78e8722dc Revert "LDS noop and spec (#9669)" (#9691)
This reverts commit 870b545ace.

Co-authored-by: Ignacio Sica <mignacio.sica@gmail.com>
2025-04-02 15:31:32 +08:00
George Hotz
4514fd91c1 more stuff from DSP (#9689)
* more good stuff from dsp branch

* test pkl imagenet
2025-04-02 15:27:48 +08:00
chenyu
6a5eacba8b disable CI red llama 3 4 gpu beam (#9690)
device hangs and ci would fail
2025-04-02 03:19:09 -04:00
Ignacio Sica
876a8be97a Debug env var breakdown (#9663)
* add debug level breakdown

* hotfix

* Update env_vars.md
2025-04-02 14:34:07 +08:00
George Hotz
6f812d3f2f fixes from the dsp branch + 12500 lines (#9683)
* fixes from the dsp branch

* more changes

* those are gep pushing
2025-04-02 13:07:17 +08:00
chenyu
c20f112e9f example test use z3 to verify valid simplification (#9684) 2025-04-02 01:05:52 -04:00
chenyu
bca0c85193 skip CI CPU test_data_parallel_resnet_train_step (#9685)
flaky
2025-04-02 01:04:54 -04:00
qazal
bb94f13e58 add RECORD_TRACEBACKS=1 option to process replay (#9679)
* add RECORD_TRACEBACKS=1 option to process replay

* stack
2025-04-02 11:58:27 +08:00
chenyu
3acc1b928a minor div_and_mod_folding cleanup [pr] (#9681)
it's not wrong because the dtype is never used, but `x.const_like` is more readable
2025-04-01 23:51:36 -04:00
chenyu
c672716b38 improve vmin/vmax for IDIV (#9678) 2025-04-01 23:16:01 -04:00
chenyu
8dd88ad476 don't div_and_mod_folding for negative numerator with remainder (#9674)
can be wrong in C div since it truncates towards zero
2025-04-01 16:26:23 -04:00
chenyu
0e34f9082e helper functions for cstyle div mod [pr] (#9673) 2025-04-01 08:06:56 -04:00
qazal
eee0dcc37a merge viz back into one file (#9672)
* merge viz back into one file

* work

* rename lib to js directory

* fix diff

* less indenting

* memory graph is back

* viz_sz.py
2025-04-01 19:52:02 +08:00
Ignacio Sica
870b545ace LDS noop and spec (#9669)
* init lds noop and lds_0 spec

* refactor lds helper test

* fix typo

* test all lds at the same time

* change comment

* comment

* start test_lds_full

* test_lds_tc

* add tc spec
2025-04-01 18:44:55 +08:00
uuuvn
609a006242 AMDComputeQueue.wreg (#9628)
* AMDComputeQueue.wreg

Used to be part of #9428, i think it's much more readable than repeating
the ~same pm4 things over and over again, especially with separate .encode

* fix indentation
2025-04-01 17:01:33 +07:00
qazal
fa373e15a3 hotfix: NULL=1 Buffer does not have _buf (#9661) 2025-04-01 17:43:55 +08:00
nimlgen
3e2f42c2e8 autogen: remove am headers from extra (#9666) 2025-04-01 14:45:30 +07:00
Ignacio Sica
cfad139189 bump assembly debug to 7 (#9662) 2025-04-01 11:51:33 +08:00
Ignacio Sica
ac533e89a2 remove duplicated ast print (#9660) 2025-04-01 10:29:24 +08:00
Ignacio Sica
846ef84cda move uops print to debug >= 6 (#9659) 2025-04-01 10:29:09 +08:00
b1tg
d9af4cfc1b AMD_LLVM: tensor cores support (#9613)
* tensor cores support

* test tesor cores codegen

* use rewrite rules

---------

Co-authored-by: b1tg <b1tg@users.noreply.github.com>
2025-04-01 09:56:27 +08:00
qazal
1658eb4e63 always fit fresh viz graph into view [pr] (#9657) 2025-04-01 09:34:26 +08:00
Anish Umale
a1ee4d587f Fix test_ops for tiny backend (#9302)
* fix some tests in test_ops for torch backend(171 failing)

* fix more tests (135 failures)

* fix tests (126 failing)

* handle transposed convs (109 tests failing)

* fix slice

* fix lshift & rshift and more tests (87 tests failing)

* revert accidental change

* remove unnecessary changes (82 failures)

* fix backward for avg_pool2d (78 failures)

* fix backward for avg_pool2d (78 failures)

* fix replication backpass

* fix reflection pad back pass (71 failures)

* cummax with indicies, aten.mv and move out methods (67 failures)

* extract avg_pool2d and avg_pool3d to separate functions (62 failures)

* revert changes for cat_out

* rewrite avg_pool and pad without repetition

* remove duplicates from decomps

* slice rewrite and add slice_backward (59 failures)

* add dtype fixup from https://github.com/tinygrad/tinygrad/pull/9297

* fix linter error and remove Tensor.pad (48 failures)

* add select_backward and index_put (40 failures)

* fix some more tests (36 failures)

* fix more tests (12 failures)

* some cleanups and fix couple more tests (10 failures)

* cleaner way to write upsample

* some more upsample cleanups

* use lambda for upsample

* add autowrapper for upsample forward

* cumsum and max_dim without aten functions

* revert _log_softmax

* fix more tests (1 failure)

* make linter happy

* move import to appropriate func

* make linter happy

* add codes for noqa

* some more refactors

* remove comment

* remove dependency on aten function for conv backward

* some more refactors

* add returns

* revert a change from merge

* some cleanups

* remove whitespace

* remove ruff change

* revert upsample

* add masked_fill_.Tensor and scatter.src_out

* add todo

* fix test_biased_conv2d

* fix test_var_one_in_axis & test_std_one_in_axis but break test_biased_conv2d :(

* revert torch_debug

* revert torch_debug

* skip test_gather_failure for the tiny backend

* make padding registration more consise

* add nonzero

* remove scatter_add since we already have the out

* fix scatter

* remove some repetition

* make upsample backward registrations more concise

* remove select.int

* use Tensor.cumsum

* realize conv2d outputs before backward to fix test_biased_conv2d

* add a todo for realize(1 failure)

* add new_empty and new_empty_strided

* make test_pad_circular_mode forward only and remove redundant stuff

* fix linter errors

* remove expect failure

* just tb

* slice is a view_op

* contiguous only when lazydata.is_realized

* fix backward for test_pad_circular_mode

* revert torch.nn.functional.pad override

* add transpose.int and make constant_pad_nd contiguous

* slice_backwards has no kwargs

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2025-03-31 21:13:09 -04:00
qazal
a0b4465412 bring GroupOp.Meta back (#9656) 2025-04-01 01:02:29 +08:00
Ignacio Sica
f277f407f2 remove smem_prefix_for_cast for amd (#9651) 2025-03-31 23:03:35 +08:00
chenyu
f7cb2e8da3 bert dev_beam for mi300x box (#9648)
* bert dev_beam for mi300x box

* terminate BENCHMARK properly
2025-03-31 08:35:51 -04:00
qazal
5171b098e5 merge_double_reduce without asserts [pr] (#9650) 2025-03-31 19:17:05 +08:00
Ignacio Sica
1444069c09 Uppercase K for dimension and lowercase k for kernel in linearizer tc helper test (#9649) 2025-03-31 19:05:36 +08:00
Ignacio Sica
baa67fd124 Uppercase N and M (standalone syntax change) (#9647) 2025-03-31 18:45:30 +08:00
chenyu
aca0f1befb print idx when OUT OF BOUNDS ACCESS (#9646)
in some cases (if there's a where in idx) the vmin/vmax might not be tight
2025-03-31 06:12:44 -04:00
Priyank Patel
e2d9322d21 torch backend: partial fix for strided related test fails (#9642)
* partial fix for strided related test fails

* cleanup

* fix lint
2025-03-31 05:45:18 -04:00
qazal
76c1b1edf6 viz kernel list cleanup (#9643) 2025-03-31 15:53:39 +08:00
George Hotz
e4c545b396 linearizer fix from dsp branch (#9641)
* linearizer fix from dsp branch

* revert that
2025-03-31 14:26:39 +08:00