Commit Graph

4433 Commits

Author SHA1 Message Date
George Hotz
44e4934167 fast pattern matcher [pr] (#9737)
* FastPatternMatcher

* works without that

* fix test pickle

* strict len

* compile match function

* dynamic compile

* fast

* faster

* compile

* track

* a lot faster

* clean up

* dup or

* faster and simpler

* fast match doesn't support store

* plane

* minor refactor

* real speed

* don't imply return None

* upat

* fix test

* heard you wanted more speed

* no generator

* split cf

* early fixup

* fxn fixup

* reconstruct_function

* Revert "reconstruct_function"

This reverts commit 37dac010ab.

* simpler stuff

* too big

* upat compile error

* cleanups

* don't cache that

* cleanups

* 10 -> 15
2025-04-14 15:24:41 +01:00
qazal
e201bc3e93 process replay kernel asts in toposort order [pr] (#9869)
* process replay kernel asts in toposort order [pr]

* use HEAD replay
2025-04-13 17:20:34 +08:00
Alexey Zaytsev
7dda6aae7d Skip CLOUD in external_test_example (#9857)
Closes #9814
2025-04-12 10:17:44 +08:00
George Hotz
dd52951dd0 fix single kernel softmax with cast (#9842)
* fix single kernel softmax with cast

* tolerate none

* 3e-4

* skip on dtype
2025-04-11 12:12:02 +08:00
chenyu
8c6299bced move hand_coded_optimizations to heuristic.py [pr] (#9844)
* move hand_coded_optimizations to heuristic.py [pr]

also folded all long lines

* make a copy and rename self -> k

* fix test
2025-04-10 23:40:16 -04:00
chenyu
e0ec8be37d use CPU for test_schedule_ring (#9843)
* use CPU for test_schedule_ring

* why pre-commit is good
2025-04-10 23:20:53 -04:00
qazal
fbc6aa53d4 script for local process_replay + fix viz name [pr] (#9837) 2025-04-11 00:39:18 +08:00
qazal
16956b79de canonicalize Device.DEFAULT (#9835) 2025-04-10 23:02:11 +08:00
George Hotz
f666dd14eb fix get reduce contraction with test (#9834) 2025-04-10 22:24:21 +08:00
chenyu
7fa5f29582 add test_embedding to test_softmax_fusion (#9832) 2025-04-10 08:25:34 -04:00
George Hotz
53f0b2aad7 fix infinite loop in flash attention (#9827)
* fix infinite loop in flash attention

* get_contraction_with_reduce

* skip that test

* SINGLE_KERNEL_SOFTMAX + fix multi

* default IGNORE_OOB

* print change
2025-04-10 20:06:44 +08:00
qazal
16afe04f45 move process replay to grouper (#9830)
* simpler

* sched
2025-04-10 18:27:42 +08:00
chenyu
c8f47c1d07 not_support_multi_device helper (#9831)
unify the test helper to skip ci device that does not support multi
2025-04-10 05:25:29 -04:00
chenyu
c462162db8 update benchmark bert scripts with BS and ACC_DTYPE (#9826)
BS=16, ACC_DTYPE=half for tinybox, BS=128, ACC_DTYPE=float for mi300x
2025-04-10 02:06:02 -04:00
qazal
498a2bf738 add err handling tests to viz + cleanups (#9825)
* cleanup

* add err handling tests to viz + cleanups

* lint
2025-04-10 14:05:05 +08:00
George Hotz
fce432d2e3 Ops.FUSE makes softmax a single kernel (#9808)
* KERNELIZE makes softmax a single kernel

* single kernel works

* softmax works

* broken

* correct

* skip that test

* kernelize tests

* rename to fuse

* better reduce_push_add_ones code

* correct now

* cleanups

* oops

* return None if we can't push ones

* rename + docs

* atol fixes group

* flash attention broken test
2025-04-09 22:56:28 +08:00
qazal
3bd992dc95 multi stage graph_rewrite_map (#9803)
* multistage graph_rewrite_map

* s/merge_map/input_map

* build up kernel_map from the tensor_map
2025-04-09 15:59:45 +08:00
George Hotz
78caf55154 Revert "FP8 support on NVIDIA (#8631)"
This reverts commit 2c8e4ea865.
2025-04-09 12:27:41 +08:00
George Hotz
d1505137ad Revert "move TestOpsFp8s skipTest (#9797)"
This reverts commit a3aaf92b21.
2025-04-09 12:27:40 +08:00
chenyu
a3aaf92b21 move TestOpsFp8s skipTest (#9797)
so get_available_devices is not called when running other tests
2025-04-08 22:44:07 -04:00
pkotzbach
2c8e4ea865 FP8 support on NVIDIA (#8631)
* squashed fp8 commits

* tensorcore start

* minor changes

* pre-commit

* pylint

* Delete fp8mul.cu

* clean

* small bugfix

* fix test_dtype

* fix test_dtype_alu

* add EMULATE_CUDA_SM89

* fix ci

* fix test_linearizer

* fix test_linearizer

* fix swizzle

* add debug to simple_matmul

* fixed swizzle

* python emulator

* refactor python emulator

* setup fix

* numpy setup

* ml_dtypes only in emulate_cuda_sm89

* fix pylint

* fix tests

* fix mypy

* fix mypy

* fix ruff

* done python emulator

* add acc type

* tests

* mypy

* clean code

* add cuda tensor core tests to CI

* minor fix

* clean test_dtype.py

* clean cstyle.py

* clean test_ops.py

* fix test

* fix test

* whitespaces

* pylint

* pylint

* amd?

* amd?

* amd

* reduce lines

* mockgpu remove

* fix

* ruff

* ruff

* fix mypy

* ruff

* test only for cuda

* fixed formatting

* small fixes

* small fix

* least_upper_dtype if fp8s not supported

* log and reciprocal are supported for fp8s

* ops python fixes

* dtypes.fp8s use

* e4m3 + e5m2 result dtype test

* truncate linter fix

---------

Co-authored-by: pkotzbach <pawkotz@gmail.com>
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
Co-authored-by: chenyu <chenyu@fastmail.com>
2025-04-08 21:54:04 -04:00
qazal
f13e9cf2d9 move view_left to grouper.py + tiny reorders [pr] (#9780)
* move view_left to grouper.py [pr]

* reorder grouper

* test_schedule
2025-04-08 15:39:28 +08:00
chenyu
7a28133b37 failed test for single softmax backward (#9778)
getting RecursionError with DONT_GROUP_REDUCES=1
2025-04-08 02:36:32 -04:00
George Hotz
fefee5d3ab single kernel softmax (#9776)
* real single kernel softmax

* cleanup

* fix blockend insertion

* add to bert test
2025-04-08 12:35:48 +08:00
qazal
9963bb51e0 grouper tests cleanups [pr] (#9777)
* grouper tests cleanups [pr]

* viz

* tuple

* whitespace
2025-04-08 12:33:11 +08:00
George Hotz
db22094d35 hotfix: update softmax fusion test 2025-04-08 11:23:19 +08:00
Eitan Turok
bb7922b95f Vectorize Transcendental Regression Tests (#9753)
* init test

* cleanup
2025-04-08 01:27:39 +08:00
Sieds Lykles
07d1aefaf4 fast idiv (#9755)
* fast idiv with tests and fuzzer

* Add todo comment

* Add env variable to toggle fast_idiv

* Move env check

* Add fuzz fast_idiv to ci

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2025-04-07 08:32:24 -04:00
nimlgen
fa888ee077 minor test cleanups (#9770)
* fix test_graph on max

* pcie5
2025-04-07 15:29:12 +03:00
qazal
891322fd51 split into grouper.py (#9768)
* split into grouper.py

* update tests

* reorder
2025-04-07 18:40:59 +08:00
qazal
8ddb1357c0 fix UPat.location after pickle (#9763)
* fix UPat.location after pickle [pr]

* named upat test
2025-04-07 15:16:42 +08:00
chenyu
b190d85ad7 benchmark script bert softmax (#9759) 2025-04-07 00:31:18 -04:00
Ignacio Sica
58785181a8 AMD bf16xf32 TC (#9717)
* dont test bf16 for emulated amd tc

* skip bf16 tc test in ci

* skip bf16 for AMD in test_tensor_cores_codegen

* add simple bf16 gemm test to benchmark
2025-04-07 11:41:04 +08:00
chenyu
43e4565148 weighted linear in external_benchmark_bert_matmuls (#9757)
include the linear to get qkv, and permute so that stride matches with the real run
2025-04-06 23:35:42 -04:00
chenyu
8a585dc5c1 benchmark script for matmuls in bert (#9752)
2 main matmuls in the bert layers. getting these to be fast makes bert fast
2025-04-06 19:34:25 +08:00
nimlgen
5f7c79676f jit: prune independent copies (#9749)
* jit: prune independent copies

* linter

* check kernel cnt
2025-04-05 20:50:28 +03:00
nimlgen
c2573b247c jit: rename optimize_weights -> replan_buffers_memory_layout (#9751) 2025-04-05 20:35:15 +03:00
chenyu
407ca54382 symbolic fold double where (#9436)
* symbolic fold double where

a.where(b.where(c, d), d) -> (a & b).where(c, d). a pattern in optimizer

* test case
2025-04-05 05:12:17 -04:00
Sieds Lykles
9c2fc695b5 cond.logical_not().where(a,b) -> cond.where(b,a) (#9741)
* Add rule for negation in where, simplifies arange patterns

* 0 becomes 0.0 again

* Only if cond is bool

* ne is never None

* Add a test

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2025-04-04 19:13:32 -04:00
chenyu
fe998798fb linearizer failure test for OUT OF BOUNDS ACCESS (#9738) 2025-04-04 03:48:43 -04:00
George Hotz
8b5a523743 fix minimum length in pattern matcher (#9736) 2025-04-04 14:57:01 +08:00
George Hotz
926b0bcc57 cache folded upcast [pr] (#9733) 2025-04-04 11:23:19 +08:00
George Hotz
cac8bcf8b5 use Ops.REDUCE (#9721)
* decrease bert python time [pr]

* order copies

* Revert "order copies"

This reverts commit 3f62c8693b.

* rewrite count

* Ops.REDUCE

* acc first in the add chain

* Fix tensor core acc

* arange patterns look good

* fix multireduce gate

* reduce rewrite rule

* bump that to 15 minutes

* multiwmma isn't fusing

* gep through wmma is gep pushing

* bump that timeout too, it's all env setup

* add failing test
2025-04-04 10:14:34 +08:00
nimlgen
949459fdd6 jit: fix deallocate on unallocated buffers in free_intermediates (#9699) 2025-04-03 18:32:51 +03:00
geohotstan
ac713e04db ONNX add output shape validation (#9720)
* add output shape validation and remove support for sequence_type

* nit better err msg

* add sequence_type back

* improve err msg

* Revert "improve err msg"

This reverts commit dc9eaea4bb.

* Revert "add sequence_type back"

This reverts commit 288170b2d9.

* do explicit shape equality

* small nit
2025-04-03 05:44:53 -04:00
chenyu
79145e3d40 cleanup truncate_bf16 [pr] (#9725)
use torch bfloat16 for groundtruth in test. also a TODO for discrepancy
2025-04-03 05:43:49 -04:00
Ignacio Sica
bc2d86195e increase test tolerance (#9719) 2025-04-03 15:24:09 +08:00
George Hotz
49dafe6d43 add gc tests [pr] (#9718)
* add gc tests [pr]

* del

* more gc tests

* add NullGraph
2025-04-03 14:08:32 +08:00
Ignacio Sica
bc91fffc5d fix gated store with index in python backend (#9703)
* add default gate in index

* assert store

* add TestRendererFailures

- move test_gated_store_with_alu to new TestRenderFailures class for
tests that fail on multiple renderers
- add test_renderer_failures.py run on python CI

* add test for gated index in 2d

* test TestRenderFailures
2025-04-03 12:48:28 +08:00
geohotstan
e1d7e47cca fix ONNX IsInf unintended dtype promotion (#9711)
* add IsInf

* add corresponding test

* that float16 is kinda silly
2025-04-02 22:46:15 -04:00