Commit Graph

352 Commits

Author SHA1 Message Date
qazal
5b59728c75 refactor LOAD(DEFINE_GLOBAL, VIEW) in kernels to LOAD(VIEW(DEFINE_GLOBAL)) (#10541)
* changes to core tinygrad

* fixups pt1

TC=3
docs/abstractions2.py
IMAGE=2
test_quantize_dsp
test_schedule

* more tests

* green now

* images stay images
2025-05-30 14:27:58 +03:00
qazal
bbf05110a2 use kernelize in TestLinearizer.test_indexing_multireduce [pr] (#10571) 2025-05-30 11:27:09 +03:00
qazal
9169dcfb49 do not create kernels with more inputs than the backend allows (#10510)
* work

* no itertools + top down pass

* clean viz

* python can do that

* webgpu

* gbarrier of gbarrier is gbarrier

* device can be tuple

* bug in toposort

* failing test for gated toposort

* contiguous of gbarrier is gbarrier

* check for binops

* Revert "check for binops"

This reverts commit 53e3cdf720.

* viz + match on gbarrier, self exists by default

* alt

* green now

* cleanup
2025-05-26 18:02:03 +03:00
George Hotz
411392dfb7 move files into uop dir (#10399)
* move files into uop dir [pr]

* tinygrad.uop is a thing

* fix uop docs, no pr

* fix viz
2025-05-18 11:38:28 -07:00
Ignacio Sica
8f79492c75 fix test_tensor_cores_codegen for ptx renderer (#10119) 2025-05-01 21:52:36 -03:00
Ignacio Sica
bf5fb97498 fix AMD_LLVM bf16 tc for gfx1100 (#10102)
* fix amd_llvm bf16 tc

* cleanup pattern
2025-04-30 20:06:38 -03:00
Ignacio Sica
bda116d773 fix use_tensor_cores propagation (#10048)
* propagate use_tensor_cores

* add use_tensor_core to arg in test and search

* bugfix

* get TC val from ContextVar in search

* revert minor space change

* add tc emulation test to ci and benchmark

* revert

* revert whitespace change

* remove test for ptx

* add comment and remove llvm test run
2025-04-28 19:30:50 -03:00
George Hotz
4c242b0483 hotfix: tests all pass on metal local 2025-04-28 12:09:00 -04:00
qazal
d13c100981 don't sort dims in verify_sink_dims [pr] (#10059)
* don't sort dims in verify_sink_dims [pr]

* 1 can exist with n

* put process_replay warn last

* assert shape is the same

* bring that back
2025-04-26 23:24:30 +08:00
Ignacio Sica
76a86735c0 hotfix amd bf16 is supported case (#10039)
* hotfix amd and amd_llvm

* bf16 not supported in ci

* hotfix amd_llvm is not a device

* remove default

* dont gate on ci and amd_llvm

* minor cleanup

* skip bf16 tc test for amd_llvm
2025-04-24 21:29:27 -03:00
Ignacio Sica
b4f823acbe fix helper_tc_allclose (#9606)
* fix helper_tc_allclose

* cleanup

* hotfix

* cleanup

* cleanup

* check real buffer and add cast for bf16

* cleanup

* fix padded for ops_python

* avoid assert on amd emulated tc

* swap dimensions

* revert, should have nothing to do with padded

* revert fix, should not go in this pr

* remove skip
2025-04-24 18:36:40 -03:00
Ignacio Sica
51ca19d061 set test_tensor_cores_padded_amd to expectedFailure (#10036)
* init

* add expected failure to correctly track progres

* hotfix

* skip for amd_llvm as well

* add skip

* add pr number

* move comment to amd test

* change reason
2025-04-24 17:11:40 -03:00
Ignacio Sica
373ca59b7f use is_dtype_supported to check dtype support in tc tests (#10035) 2025-04-24 14:59:14 -03:00
George Hotz
2ed3acd767 toposort is a function [pr] (#10004) 2025-04-23 16:25:03 +01:00
chenyu
6c30948df6 hand_coded_optimizations returns list[Opt] [pr] (#9938)
new api looks like `k.apply_opts(hand_coded_optimizations(k))`
2025-04-19 20:26:59 -04:00
Ignacio Sica
023b1c28a2 test_tensor_cores_padded refactor (#9724)
* set pad t 3 for amd padded tc test

* change pad for amd regardless CI

* test tc padded uops and correctness separately

* add test_tensor_cores_padded_uops test to ci

* remove redundant chack for amd device

* cleanup
2025-04-18 17:05:54 -03:00
George Hotz
aa98aff4cd don't use ops name, just keep sink (#9922)
* don't use ops name, just keep sink

* fix test

* endif sink
2025-04-18 08:59:18 +01:00
chenyu
f5256e0020 Kernel.apply_opts [pr] (#9917)
* Kernel.apply_opts [pr]

updated all `for opt in`. also updated a few test_liinearizer tests to not implcitly depend on hand_coded_optimization

* not you yet
2025-04-17 08:00:56 -04:00
chenyu
8c6299bced move hand_coded_optimizations to heuristic.py [pr] (#9844)
* move hand_coded_optimizations to heuristic.py [pr]

also folded all long lines

* make a copy and rename self -> k

* fix test
2025-04-10 23:40:16 -04:00
George Hotz
78caf55154 Revert "FP8 support on NVIDIA (#8631)"
This reverts commit 2c8e4ea865.
2025-04-09 12:27:41 +08:00
pkotzbach
2c8e4ea865 FP8 support on NVIDIA (#8631)
* squashed fp8 commits

* tensorcore start

* minor changes

* pre-commit

* pylint

* Delete fp8mul.cu

* clean

* small bugfix

* fix test_dtype

* fix test_dtype_alu

* add EMULATE_CUDA_SM89

* fix ci

* fix test_linearizer

* fix test_linearizer

* fix swizzle

* add debug to simple_matmul

* fixed swizzle

* python emulator

* refactor python emulator

* setup fix

* numpy setup

* ml_dtypes only in emulate_cuda_sm89

* fix pylint

* fix tests

* fix mypy

* fix mypy

* fix ruff

* done python emulator

* add acc type

* tests

* mypy

* clean code

* add cuda tensor core tests to CI

* minor fix

* clean test_dtype.py

* clean cstyle.py

* clean test_ops.py

* fix test

* fix test

* whitespaces

* pylint

* pylint

* amd?

* amd?

* amd

* reduce lines

* mockgpu remove

* fix

* ruff

* ruff

* fix mypy

* ruff

* test only for cuda

* fixed formatting

* small fixes

* small fix

* least_upper_dtype if fp8s not supported

* log and reciprocal are supported for fp8s

* ops python fixes

* dtypes.fp8s use

* e4m3 + e5m2 result dtype test

* truncate linter fix

---------

Co-authored-by: pkotzbach <pawkotz@gmail.com>
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
Co-authored-by: chenyu <chenyu@fastmail.com>
2025-04-08 21:54:04 -04:00
Ignacio Sica
58785181a8 AMD bf16xf32 TC (#9717)
* dont test bf16 for emulated amd tc

* skip bf16 tc test in ci

* skip bf16 for AMD in test_tensor_cores_codegen

* add simple bf16 gemm test to benchmark
2025-04-07 11:41:04 +08:00
George Hotz
cac8bcf8b5 use Ops.REDUCE (#9721)
* decrease bert python time [pr]

* order copies

* Revert "order copies"

This reverts commit 3f62c8693b.

* rewrite count

* Ops.REDUCE

* acc first in the add chain

* Fix tensor core acc

* arange patterns look good

* fix multireduce gate

* reduce rewrite rule

* bump that to 15 minutes

* multiwmma isn't fusing

* gep through wmma is gep pushing

* bump that timeout too, it's all env setup

* add failing test
2025-04-04 10:14:34 +08:00
Ignacio Sica
2d6d8b7355 add bf16 mfma support (#9695)
* add bf16 mfma support

* skip tc if emulated_amd and dtypes is bf16

* hotfix
2025-04-02 21:44:49 +08:00
George Hotz
e78e8722dc Revert "LDS noop and spec (#9669)" (#9691)
This reverts commit 870b545ace.

Co-authored-by: Ignacio Sica <mignacio.sica@gmail.com>
2025-04-02 15:31:32 +08:00
Ignacio Sica
870b545ace LDS noop and spec (#9669)
* init lds noop and lds_0 spec

* refactor lds helper test

* fix typo

* test all lds at the same time

* change comment

* comment

* start test_lds_full

* test_lds_tc

* add tc spec
2025-04-01 18:44:55 +08:00
b1tg
d9af4cfc1b AMD_LLVM: tensor cores support (#9613)
* tensor cores support

* test tesor cores codegen

* use rewrite rules

---------

Co-authored-by: b1tg <b1tg@users.noreply.github.com>
2025-04-01 09:56:27 +08:00
Ignacio Sica
1444069c09 Uppercase K for dimension and lowercase k for kernel in linearizer tc helper test (#9649) 2025-03-31 19:05:36 +08:00
Ignacio Sica
baa67fd124 Uppercase N and M (standalone syntax change) (#9647) 2025-03-31 18:45:30 +08:00
chenyu
f8976dd2eb enable more webgpu tests (#9502)
OSX has larger buffer number limit, and it supports fp16 now
2025-03-18 23:03:54 -04:00
George Hotz
117b7a16ef VALIDATE_WITH_CPU [pr] (#9488)
* VALIDATE_WITH_CPU [pr]

* fix test
2025-03-18 15:15:04 +08:00
chenyu
01e8b60911 acc_dtype -> dtype (#9402)
matched numpy and torch
2025-03-10 16:05:30 -04:00
George Hotz
ece0a0f305 use empty for test instead of rand (#9332) 2025-03-03 16:19:06 +08:00
George Hotz
2cc4cb74f0 reorder binops (#9328)
* reorder binops

* test improvements + fix string tests

* ugh, okay this
2025-03-03 14:58:18 +08:00
qazal
2eab8021fb remove inputs+outputs attributes from ScheduleItem [pr] (#9192)
* remove inputs/outputs from ScheduleItem

* fix test_linearizer

* fix test_conv_shapetracker

* fix test_schedule + lint

* test_image_dtype + multitensor + search
2025-02-21 13:48:11 +01:00
chenyu
2e7c2780a9 CLANG -> CPU (#9189) 2025-02-20 18:03:09 -05:00
George Hotz
a4dab3ec3f add name uop (#9149)
* add name uop, TODO: refactor renderer to use

* renderer uses name uop

* fix tests

* render

* ptx
2025-02-18 15:26:58 +08:00
Ahmed Harmouche
59fe45f947 Solve get_grouped_dims does not split issue (#9085)
* Solve dims too large errors on webgpu

* Simplify divisor find

* Test square root divisor

* Fix lint

* Refactor into group_dims and split_dims

* Refactor

* Fix lint

* Add back max check in _group_dims

* Prefer grouping over split

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2025-02-16 19:57:29 -05:00
chenyu
f53b819648 UOps. -> Ops. [pr] (#9044)
updated the comments and doc except extra
2025-02-12 12:53:23 -05:00
Ignacio Sica
aaed315fee add AMX support to LLVM (#8957)
* init amx support for llvm

* revert elf changes

* fix attributes for AMX asm calls

* add comments

* add llvm amx job to benchmarks

* cleanup

* cleanup

* hotfix: improve comments

* comment for aux buffers

* hotfix:

* move amx_tc to ClangRenderer

* merge master

* refactor

* add docs

* add corsix docs reference

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2025-02-12 16:01:18 +08:00
George Hotz
a3c78d47b3 speed docs + upgrades [pr] (#8964)
* add some docs about speed [pr]

* better torch gemm

* enable locals on llvm/clang

* disable locals for beam speed on LLVM/CLANG

* 0x20 alignment in llvm allows ymm use
2025-02-08 17:28:52 +08:00
George Hotz
c2b4c43edb handle stride 0 reduce (#8068)
* handle stride 0 reduce [pr]

* more test fixups

* a few more

---------

Co-authored-by: qazal <qazal.software@gmail.com>
2025-02-07 15:40:58 +01:00
Ahmed Harmouche
133cacadde Autogen webgpu dawn, removing wgpu-py dependency (f16 support part 1) (#8646)
* Switch to dawn, all tests passing locally

* Use dawn-python

* Skip failing test

* Skip midcast and fix timestamp on metal ci

* Autogen webgpu

* Try fetch dawn lib again

* /usr/lib

* Without lib prefix

* Test autogen diff

* Delete webgpu support, move everything to ops_webgpu

* mypy fix

* Simplify, refactor

* Line savings

* No ResultContainer

* Type annotation for result

* Some more simplifications

* Why was this explicit sync used at all?

* Refactor: delete functions that are only used once

* Create shader module inline

* Clear unit tests cache, maybe that solves it

* That wasn't it

* Try deleting cache to pass failing weight compare

* weights_only=False for pytorch 2.6

* Simplify ctype array creation

* Remove nanosecond precision timestamps

* Simplify error handling

* Refactor, add back type annotations

* Deleted custom submit function, refactor

* read_buffer simplify

* Fix use after free, refactor

* Simplify supported_features

* Runtime docs

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2025-02-07 15:16:59 +08:00
chenyu
a092b6395d Tuple -> tuple, List -> list [pr] (#8936) 2025-02-06 14:21:19 -05:00
Ignacio Sica
15f94ac964 TC_SEARCH_OVER_SHAPE to search multiple TC shapes (#8793)
* squash search over search

* refactor assert

* init benchmark

* cleaner get_kernel_actions

* cleaner get_kernel_actions

* add comment
2025-02-05 11:03:46 -05:00
Ignacio Sica
260df1a17f tc_select noop (#8801)
* tc_select noop

* revert changes in test
2025-01-29 13:53:23 -05:00
qazal
ba17786068 do not construct unmasked VALID (#8759)
* new lines that exist in codegen/ops

* update tests

* update sops.gz (13071 -> 13070 asts)

* fix viz too

* remove that TODO

* diff pruning

* mask assert + device

* work

* diff pruning

* re: fix viz too

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2025-01-28 20:51:21 +02:00
Ignacio Sica
b240f12593 [TIP-9] rename Opt's amt to arg 2 (#8770)
* rename Opt amt to arg

* ignore_beam_cache for test_tiny

* move ignore_beam_cache to test_tiny

* move to separate pr

* revert space change

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2025-01-27 14:19:04 -05:00
George Hotz
3ed146a5ff Revert "rename Opt amt to arg (#8767)" (#8769)
This reverts commit bf041659a5.
2025-01-27 23:46:37 +09:00
Ignacio Sica
bf041659a5 rename Opt amt to arg (#8767) 2025-01-27 23:36:47 +09:00