chenyu
8c6299bced
move hand_coded_optimizations to heuristic.py [pr] ( #9844 )
...
* move hand_coded_optimizations to heuristic.py [pr]
also folded all long lines
* make a copy and rename self -> k
* fix test
2025-04-10 23:40:16 -04:00
George Hotz
78caf55154
Revert "FP8 support on NVIDIA ( #8631 )"
...
This reverts commit 2c8e4ea865 .
2025-04-09 12:27:41 +08:00
pkotzbach
2c8e4ea865
FP8 support on NVIDIA ( #8631 )
...
* squashed fp8 commits
* tensorcore start
* minor changes
* pre-commit
* pylint
* Delete fp8mul.cu
* clean
* small bugfix
* fix test_dtype
* fix test_dtype_alu
* add EMULATE_CUDA_SM89
* fix ci
* fix test_linearizer
* fix test_linearizer
* fix swizzle
* add debug to simple_matmul
* fixed swizzle
* python emulator
* refactor python emulator
* setup fix
* numpy setup
* ml_dtypes only in emulate_cuda_sm89
* fix pylint
* fix tests
* fix mypy
* fix mypy
* fix ruff
* done python emulator
* add acc type
* tests
* mypy
* clean code
* add cuda tensor core tests to CI
* minor fix
* clean test_dtype.py
* clean cstyle.py
* clean test_ops.py
* fix test
* fix test
* whitespaces
* pylint
* pylint
* amd?
* amd?
* amd
* reduce lines
* mockgpu remove
* fix
* ruff
* ruff
* fix mypy
* ruff
* test only for cuda
* fixed formatting
* small fixes
* small fix
* least_upper_dtype if fp8s not supported
* log and reciprocal are supported for fp8s
* ops python fixes
* dtypes.fp8s use
* e4m3 + e5m2 result dtype test
* truncate linter fix
---------
Co-authored-by: pkotzbach <pawkotz@gmail.com >
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com >
Co-authored-by: chenyu <chenyu@fastmail.com >
2025-04-08 21:54:04 -04:00
Ignacio Sica
58785181a8
AMD bf16xf32 TC ( #9717 )
...
* dont test bf16 for emulated amd tc
* skip bf16 tc test in ci
* skip bf16 for AMD in test_tensor_cores_codegen
* add simple bf16 gemm test to benchmark
2025-04-07 11:41:04 +08:00
George Hotz
cac8bcf8b5
use Ops.REDUCE ( #9721 )
...
* decrease bert python time [pr]
* order copies
* Revert "order copies"
This reverts commit 3f62c8693b .
* rewrite count
* Ops.REDUCE
* acc first in the add chain
* Fix tensor core acc
* arange patterns look good
* fix multireduce gate
* reduce rewrite rule
* bump that to 15 minutes
* multiwmma isn't fusing
* gep through wmma is gep pushing
* bump that timeout too, it's all env setup
* add failing test
2025-04-04 10:14:34 +08:00
Ignacio Sica
2d6d8b7355
add bf16 mfma support ( #9695 )
...
* add bf16 mfma support
* skip tc if emulated_amd and dtypes is bf16
* hotfix
2025-04-02 21:44:49 +08:00
George Hotz
e78e8722dc
Revert "LDS noop and spec ( #9669 )" ( #9691 )
...
This reverts commit 870b545ace .
Co-authored-by: Ignacio Sica <mignacio.sica@gmail.com >
2025-04-02 15:31:32 +08:00
Ignacio Sica
870b545ace
LDS noop and spec ( #9669 )
...
* init lds noop and lds_0 spec
* refactor lds helper test
* fix typo
* test all lds at the same time
* change comment
* comment
* start test_lds_full
* test_lds_tc
* add tc spec
2025-04-01 18:44:55 +08:00
b1tg
d9af4cfc1b
AMD_LLVM: tensor cores support ( #9613 )
...
* tensor cores support
* test tesor cores codegen
* use rewrite rules
---------
Co-authored-by: b1tg <b1tg@users.noreply.github.com >
2025-04-01 09:56:27 +08:00
Ignacio Sica
1444069c09
Uppercase K for dimension and lowercase k for kernel in linearizer tc helper test ( #9649 )
2025-03-31 19:05:36 +08:00
Ignacio Sica
baa67fd124
Uppercase N and M (standalone syntax change) ( #9647 )
2025-03-31 18:45:30 +08:00
chenyu
f8976dd2eb
enable more webgpu tests ( #9502 )
...
OSX has larger buffer number limit, and it supports fp16 now
2025-03-18 23:03:54 -04:00
George Hotz
117b7a16ef
VALIDATE_WITH_CPU [pr] ( #9488 )
...
* VALIDATE_WITH_CPU [pr]
* fix test
2025-03-18 15:15:04 +08:00
chenyu
01e8b60911
acc_dtype -> dtype ( #9402 )
...
matched numpy and torch
2025-03-10 16:05:30 -04:00
George Hotz
ece0a0f305
use empty for test instead of rand ( #9332 )
2025-03-03 16:19:06 +08:00
George Hotz
2cc4cb74f0
reorder binops ( #9328 )
...
* reorder binops
* test improvements + fix string tests
* ugh, okay this
2025-03-03 14:58:18 +08:00
qazal
2eab8021fb
remove inputs+outputs attributes from ScheduleItem [pr] ( #9192 )
...
* remove inputs/outputs from ScheduleItem
* fix test_linearizer
* fix test_conv_shapetracker
* fix test_schedule + lint
* test_image_dtype + multitensor + search
2025-02-21 13:48:11 +01:00
chenyu
2e7c2780a9
CLANG -> CPU ( #9189 )
2025-02-20 18:03:09 -05:00
George Hotz
a4dab3ec3f
add name uop ( #9149 )
...
* add name uop, TODO: refactor renderer to use
* renderer uses name uop
* fix tests
* render
* ptx
2025-02-18 15:26:58 +08:00
Ahmed Harmouche
59fe45f947
Solve get_grouped_dims does not split issue ( #9085 )
...
* Solve dims too large errors on webgpu
* Simplify divisor find
* Test square root divisor
* Fix lint
* Refactor into group_dims and split_dims
* Refactor
* Fix lint
* Add back max check in _group_dims
* Prefer grouping over split
---------
Co-authored-by: chenyu <chenyu@fastmail.com >
2025-02-16 19:57:29 -05:00
chenyu
f53b819648
UOps. -> Ops. [pr] ( #9044 )
...
updated the comments and doc except extra
2025-02-12 12:53:23 -05:00
Ignacio Sica
aaed315fee
add AMX support to LLVM ( #8957 )
...
* init amx support for llvm
* revert elf changes
* fix attributes for AMX asm calls
* add comments
* add llvm amx job to benchmarks
* cleanup
* cleanup
* hotfix: improve comments
* comment for aux buffers
* hotfix:
* move amx_tc to ClangRenderer
* merge master
* refactor
* add docs
* add corsix docs reference
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com >
2025-02-12 16:01:18 +08:00
George Hotz
a3c78d47b3
speed docs + upgrades [pr] ( #8964 )
...
* add some docs about speed [pr]
* better torch gemm
* enable locals on llvm/clang
* disable locals for beam speed on LLVM/CLANG
* 0x20 alignment in llvm allows ymm use
2025-02-08 17:28:52 +08:00
George Hotz
c2b4c43edb
handle stride 0 reduce ( #8068 )
...
* handle stride 0 reduce [pr]
* more test fixups
* a few more
---------
Co-authored-by: qazal <qazal.software@gmail.com >
2025-02-07 15:40:58 +01:00
Ahmed Harmouche
133cacadde
Autogen webgpu dawn, removing wgpu-py dependency (f16 support part 1) ( #8646 )
...
* Switch to dawn, all tests passing locally
* Use dawn-python
* Skip failing test
* Skip midcast and fix timestamp on metal ci
* Autogen webgpu
* Try fetch dawn lib again
* /usr/lib
* Without lib prefix
* Test autogen diff
* Delete webgpu support, move everything to ops_webgpu
* mypy fix
* Simplify, refactor
* Line savings
* No ResultContainer
* Type annotation for result
* Some more simplifications
* Why was this explicit sync used at all?
* Refactor: delete functions that are only used once
* Create shader module inline
* Clear unit tests cache, maybe that solves it
* That wasn't it
* Try deleting cache to pass failing weight compare
* weights_only=False for pytorch 2.6
* Simplify ctype array creation
* Remove nanosecond precision timestamps
* Simplify error handling
* Refactor, add back type annotations
* Deleted custom submit function, refactor
* read_buffer simplify
* Fix use after free, refactor
* Simplify supported_features
* Runtime docs
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com >
2025-02-07 15:16:59 +08:00
chenyu
a092b6395d
Tuple -> tuple, List -> list [pr] ( #8936 )
2025-02-06 14:21:19 -05:00
Ignacio Sica
15f94ac964
TC_SEARCH_OVER_SHAPE to search multiple TC shapes ( #8793 )
...
* squash search over search
* refactor assert
* init benchmark
* cleaner get_kernel_actions
* cleaner get_kernel_actions
* add comment
2025-02-05 11:03:46 -05:00
Ignacio Sica
260df1a17f
tc_select noop (#8801 )
...
* tc_select noop
* revert changes in test
2025-01-29 13:53:23 -05:00
qazal
ba17786068
do not construct unmasked VALID ( #8759 )
...
* new lines that exist in codegen/ops
* update tests
* update sops.gz (13071 -> 13070 asts)
* fix viz too
* remove that TODO
* diff pruning
* mask assert + device
* work
* diff pruning
* re: fix viz too
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com >
2025-01-28 20:51:21 +02:00
Ignacio Sica
b240f12593
[TIP-9] rename Opt's amt to arg 2 ( #8770 )
...
* rename Opt amt to arg
* ignore_beam_cache for test_tiny
* move ignore_beam_cache to test_tiny
* move to separate pr
* revert space change
---------
Co-authored-by: chenyu <chenyu@fastmail.com >
2025-01-27 14:19:04 -05:00
George Hotz
3ed146a5ff
Revert "rename Opt amt to arg ( #8767 )" ( #8769 )
...
This reverts commit bf041659a5 .
2025-01-27 23:46:37 +09:00
Ignacio Sica
bf041659a5
rename Opt amt to arg ( #8767 )
2025-01-27 23:36:47 +09:00
George Hotz
b4bf6a7dea
switch backward to use gradient [pr] ( #8235 )
...
* switch backward to use gradient [pr]
* set device correctly, dedup
* why does that fail?
* add noop cast
* simple backward
* fix beautiful_mnist
* touchups
* set in compute_gradient
* uop_count
* uop_count was wrong
* collections
* no note
* skip that test
* update sched kernel counts
* train mnist is 65
* fix metadata and gc
* fixes
* materialize_grads
* no pathlib stuff
* add contiguous_backward, fix bugs
* add some realize
* fix multi
2025-01-26 09:12:16 +09:00
George Hotz
46a8c5e1e5
delete forced_realize ( #8615 )
...
* delete forced_realize
* put that back
* expectedFailures
* cleaner create_subbuffer
* more comments
---------
Co-authored-by: qazal <qazal.software@gmail.com >
Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com >
2025-01-20 09:40:36 -08:00
qazal
d957a4f108
add tests for div buffer collapsing in the scheduler [pr] ( #8671 )
...
* add tests for mul/div buffer collapsing in the scheduler [pr]
* lint
* merge with test_linearizer's version of this
* 4*3
2025-01-18 14:15:29 -05:00
ignaciosica
d2234e308a
tf32 tc for nv and ptx ( #8635 )
...
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com >
2025-01-17 17:43:57 -08:00
qazal
ae2229d727
assert kernel buffer limit at compile time [pr] ( #8595 )
...
* remove the BUF_LIMIT assert
* skip the base one
2025-01-13 16:32:07 -05:00
qazal
586e730d32
use UOp.st for kernel reduce axes ( #8499 )
...
* use UOp.st for kernel reduce axes [pr]
* do not return dict
2025-01-13 06:24:11 -05:00
qazal
866dfa1f23
create_schedule([x.lazydata]) -> x.schedule() in tests ( #8449 )
2024-12-31 03:15:52 +08:00
George Hotz
29c14f1cbf
hotfix: update tests for no uop mut
2024-12-30 10:05:37 -05:00
ignaciosica
ba0c844a83
special tol when f16 and bf16 are tc input dtypes ( #8183 )
2024-12-21 11:32:26 -05:00
George Hotz
bd9c015b09
tests from grad uop path [pr] ( #8313 )
2024-12-18 09:25:05 -08:00
Ahmed Harmouche
a73e3677d0
Test linearizer on webgpu ( #8159 )
...
* Test linearizer on wgpu
* Skip tests due to exceeded dims
2024-12-11 17:03:26 +01:00
qazal
6be388be86
failing test for const folding breaking indexing [pr] ( #8103 )
2024-12-07 19:55:02 +08:00
George Hotz
0c7477b108
no bool in range [pr] ( #7988 )
...
* no bool in range [pr]
* fix llvm
* add arg to range spec
* fix broken test
* forgot this one
* hotfix: test_tiny jit is a real test
2024-12-02 19:05:16 +08:00
George Hotz
f17af70d17
replace all sparents with toposort ( #7983 )
2024-12-02 15:00:30 +08:00
George Hotz
c5c3b05b5a
block lin: only the test changes ( #7933 )
2024-11-28 13:19:00 +08:00
George Hotz
32dbab945c
Revert "add block uops and modify tests ( #7931 )" ( #7932 )
...
This reverts commit 6f4519ff45 .
2024-11-28 13:15:41 +08:00
George Hotz
6f4519ff45
add block uops and modify tests ( #7931 )
2024-11-28 13:11:18 +08:00
chenyu
a58e289d77
Revert "prereqs for new block lin so PR works ( #7919 )" ( #7921 )
...
This reverts commit c53261b541 .
2024-11-27 08:41:09 -05:00