Commit Graph

952 Commits

Author SHA1 Message Date
George Hotz
984f09ac74 flip Ops.COPY order [pr] (#10120) 2025-04-30 16:50:18 -04:00
George Hotz
c3ff308abb range has only one src now [pr] (#10100)
* range has only one op now

* fix z3 checker

* ci fix

* needs shell

* try pip ensure update

* that ensurepip is useless

* upgrade pip before cache

* windows happy?
2025-04-29 10:31:05 -04:00
qazal
cbf7347cd6 display viz rewrites with tabbing if they are subrewrites (#10097)
* display viz rewrites with tabbing if they are subrewrites

* update viz api
2025-04-29 17:57:21 +08:00
Sieds Lykles
dbb7aee02e Split constant in div with negative x (#10088)
* add rule

* change test

* lower complexity limit

* remove offset in fold_unrolled_divs

* remove import

* add one more condition
2025-04-28 16:24:14 -04:00
George Hotz
690dac79b5 don't modify the ranges on reduce rewrite (#10062)
* bug in div range folding

* simpler

* oh, this is right for indexing, but the div mod folding needs to be fixed

* reenable

* Passing test_complexity_w_unroll2 (#10068)

* Passing

* remove non_folded_divs

* Add check for negative tern in div folding

* Add test

* bump that limit

* fix casted

---------

Co-authored-by: Sieds Lykles <93992551+S-Lykles@users.noreply.github.com>
2025-04-28 12:01:19 -04:00
chenyu
4c1ce1a299 don't simplify if div folding resulted in negative numerator (#10064)
* don't simplify if div folding resulted in negative numerator

* test
2025-04-26 17:01:18 -04:00
George Hotz
2ed3acd767 toposort is a function [pr] (#10004) 2025-04-23 16:25:03 +01:00
George Hotz
d1f6701eb7 hotfix: lower amd threshold + improve block reorder test 2025-04-22 20:44:29 +01:00
George Hotz
c1539b0319 putting add first orders loads as expected (#9991) 2025-04-22 20:12:05 +01:00
George Hotz
feee6986c9 faster block reorder (#9990)
* faster block reorder [pr]

* that shouldn't change order

* key just in sorted

* ind
2025-04-22 19:18:57 +01:00
chenyu
9e5e371999 make DISABLE_COMPILER_CACHE a ContextVar [pr] (#9983) 2025-04-22 10:32:54 -04:00
George Hotz
c519b553db non recursive toposort is 2x+ faster (#9979)
* non recursive toposort is 2x+ faster

* don't change the order
2025-04-22 13:59:38 +01:00
George Hotz
f5dc70c624 microbenchmarks + micro speed ups (#9972)
* microbenchmarks

* forgot the ubenchs

* clean up type verify
2025-04-22 11:30:46 +01:00
qazal
9a9aba4cd5 setitem tests (some failing) from kernelize (#9940) 2025-04-20 18:47:55 +08:00
George Hotz
8919370c76 hotfix: fix test_save_all_dtypes on METAL 2025-04-18 08:42:31 +01:00
Eitan Turok
2c7c205bc5 Fix dtype comparisons in vectorized transcendental + tests (#9794)
* init test

* cleanup

* init

* update

* fix

* fix python runtime for vectorized code

* awesome helper

* update

* update

* cleanup

* more cleaning

* cleanup more

* fix tests

* more cleaning

* cleanup more

* fix

* even cleaner

* failing tests is sad

* cleanup

* better name

* make tests pass

* remove vec from python runtime

* remove vec from eval_uop

* remove expected failues

* better name
2025-04-16 08:06:12 -04:00
George Hotz
44e4934167 fast pattern matcher [pr] (#9737)
* FastPatternMatcher

* works without that

* fix test pickle

* strict len

* compile match function

* dynamic compile

* fast

* faster

* compile

* track

* a lot faster

* clean up

* dup or

* faster and simpler

* fast match doesn't support store

* plane

* minor refactor

* real speed

* don't imply return None

* upat

* fix test

* heard you wanted more speed

* no generator

* split cf

* early fixup

* fxn fixup

* reconstruct_function

* Revert "reconstruct_function"

This reverts commit 37dac010ab.

* simpler stuff

* too big

* upat compile error

* cleanups

* don't cache that

* cleanups

* 10 -> 15
2025-04-14 15:24:41 +01:00
chenyu
e0ec8be37d use CPU for test_schedule_ring (#9843)
* use CPU for test_schedule_ring

* why pre-commit is good
2025-04-10 23:20:53 -04:00
qazal
16956b79de canonicalize Device.DEFAULT (#9835) 2025-04-10 23:02:11 +08:00
George Hotz
f666dd14eb fix get reduce contraction with test (#9834) 2025-04-10 22:24:21 +08:00
George Hotz
53f0b2aad7 fix infinite loop in flash attention (#9827)
* fix infinite loop in flash attention

* get_contraction_with_reduce

* skip that test

* SINGLE_KERNEL_SOFTMAX + fix multi

* default IGNORE_OOB

* print change
2025-04-10 20:06:44 +08:00
qazal
498a2bf738 add err handling tests to viz + cleanups (#9825)
* cleanup

* add err handling tests to viz + cleanups

* lint
2025-04-10 14:05:05 +08:00
qazal
3bd992dc95 multi stage graph_rewrite_map (#9803)
* multistage graph_rewrite_map

* s/merge_map/input_map

* build up kernel_map from the tensor_map
2025-04-09 15:59:45 +08:00
Eitan Turok
bb7922b95f Vectorize Transcendental Regression Tests (#9753)
* init test

* cleanup
2025-04-08 01:27:39 +08:00
chenyu
407ca54382 symbolic fold double where (#9436)
* symbolic fold double where

a.where(b.where(c, d), d) -> (a & b).where(c, d). a pattern in optimizer

* test case
2025-04-05 05:12:17 -04:00
Sieds Lykles
9c2fc695b5 cond.logical_not().where(a,b) -> cond.where(b,a) (#9741)
* Add rule for negation in where, simplifies arange patterns

* 0 becomes 0.0 again

* Only if cond is bool

* ne is never None

* Add a test

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2025-04-04 19:13:32 -04:00
George Hotz
8b5a523743 fix minimum length in pattern matcher (#9736) 2025-04-04 14:57:01 +08:00
George Hotz
cac8bcf8b5 use Ops.REDUCE (#9721)
* decrease bert python time [pr]

* order copies

* Revert "order copies"

This reverts commit 3f62c8693b.

* rewrite count

* Ops.REDUCE

* acc first in the add chain

* Fix tensor core acc

* arange patterns look good

* fix multireduce gate

* reduce rewrite rule

* bump that to 15 minutes

* multiwmma isn't fusing

* gep through wmma is gep pushing

* bump that timeout too, it's all env setup

* add failing test
2025-04-04 10:14:34 +08:00
chenyu
c20f112e9f example test use z3 to verify valid simplification (#9684) 2025-04-02 01:05:52 -04:00
chenyu
c672716b38 improve vmin/vmax for IDIV (#9678) 2025-04-01 23:16:01 -04:00
chenyu
8dd88ad476 don't div_and_mod_folding for negative numerator with remainder (#9674)
can be wrong in C div since it truncates towards zero
2025-04-01 16:26:23 -04:00
chenyu
0e34f9082e helper functions for cstyle div mod [pr] (#9673) 2025-04-01 08:06:56 -04:00
chenyu
5358b0904b update uop_given_valid if a node becomes const (#9604)
* update uop_given_valid if a node becomes const

* cleanup
2025-03-27 14:57:46 -04:00
qazal
bf94924d5a fix viz with nested graph_rewrite (#9595) 2025-03-27 13:14:28 +08:00
qazal
e5ff7b23d7 refactor to @track_matches + add failing test_nested_rewrite (#9592)
* test_nested_rewrite

* refactor to track_matches

* positional arg
2025-03-27 11:11:56 +08:00
George Hotz
3c5161b4cb add validation of the bounds of Ops.INDEX (#9503)
* add validation of the bounds of Ops.INDEX

* do mask properly

* more validation

* correct

* fix gated

* add CAST support to vmin/vmax

* fix ptx and image

* ptx no diff

* upat.index also stays

---------

Co-authored-by: qazal <qazal.software@gmail.com>
2025-03-20 12:15:55 +08:00
qazal
0b20f91ce7 remove move_mask from the devectorizer (#9511)
* remove move_mask from the devectorizer

* add (wrong) ptx

* reason

* enable index addition in PTX, we won't have the INDEX anyways

* space
2025-03-20 11:53:12 +08:00
chenyu
189f62d44f add rounding to tqdm unit scale (#9507)
fixed `AssertionError: ' 1.00/10.0  1000it/s]' != ' 1.00/10.0  1.00kit/s]'`
2025-03-19 12:08:46 -04:00
hooved
136cf7b8b1 hotfix: load >2 GiB from disk on macOS (#9361)
* enable loading >2 GiB buffer from disk on macOS

* handle None case raised by mypy

* add test

* revert fix to repro bug in CI

* tell CI to run a unit test for macOS

* reapply fix
2025-03-07 14:51:58 +08:00
George Hotz
2cc4cb74f0 reorder binops (#9328)
* reorder binops

* test improvements + fix string tests

* ugh, okay this
2025-03-03 14:58:18 +08:00
qazal
e162aa862d is_realized only if buffer is allocated (#9253)
* is_realized only if the buffer is allocated

* fix the image check too

* assert test_lil_model after ExecItems run
2025-02-26 08:58:08 +01:00
Sieds Lykles
9c4d9d9f10 Acc first (#9232)
* put acc in front of the add chain

* handle the other case

* Make loop collapse more generic

* Remove mulacc_unrolled

* Actually remove it

---------

Co-authored-by: George Hotz <geohot@gmail.com>
Co-authored-by: chenyu <chenyu@fastmail.com>
2025-02-25 22:10:15 -05:00
chenyu
90c3ed17c5 move cast to before softmax in attention (#9213)
* move cast to before softmax in attention

saved some memory because exp (which is used for backward) are done in half. training bert seems fine and can fit BS=78 now (from 66)

* test
2025-02-24 17:24:59 -05:00
qazal
14aa2395d0 allow VIEW(BUFFER) in Tensor UOps [pr] (#9210)
* allow VIEW(BUFFER) in Tensor UOps [pr]

* still reshapes

* update becomes_map tests

* bring copy folder to the scheduler

* lint

* only sgd left

* optimizer assign

* 13 kernels

* rename to test_reorder_expand + assert VIEW
2025-02-24 13:06:15 +01:00
qazal
d12efc95d4 support custom name function in viz [pr] (#9219)
* support custom name function in viz [pr]

* title case

* assert name count in test_track_rewrites_name_fxn
2025-02-24 03:03:25 +02:00
chenyu
2e7c2780a9 CLANG -> CPU (#9189) 2025-02-20 18:03:09 -05:00
chenyu
3e22747799 run unit test on windows ci (#9187)
* factor out testing_minimal in setup.py [pr]

* testing_unit + windows
2025-02-20 14:40:41 -05:00
chenyu
287de4ecc6 use torch in test_gradient (#9186)
used torch.autograd.grad, but not sure if it can be a template like jax
2025-02-20 12:26:11 -05:00
George Hotz
df3b320f46 rewriter -> devectorizer [pr] (#9147) 2025-02-18 12:42:08 +08:00
Ali Ladjevardi
35e9c4657b Use proper units when printing beam time (#9103)
* use proper units when printing beam time

* refactor DEBUG=2
2025-02-17 23:41:38 +08:00