George Hotz
984f09ac74
flip Ops.COPY order [pr] ( #10120 )
2025-04-30 16:50:18 -04:00
George Hotz
c3ff308abb
range has only one src now [pr] ( #10100 )
...
* range has only one op now
* fix z3 checker
* ci fix
* needs shell
* try pip ensure update
* that ensurepip is useless
* upgrade pip before cache
* windows happy?
2025-04-29 10:31:05 -04:00
qazal
cbf7347cd6
display viz rewrites with tabbing if they are subrewrites ( #10097 )
...
* display viz rewrites with tabbing if they are subrewrites
* update viz api
2025-04-29 17:57:21 +08:00
Sieds Lykles
dbb7aee02e
Split constant in div with negative x ( #10088 )
...
* add rule
* change test
* lower complexity limit
* remove offset in fold_unrolled_divs
* remove import
* add one more condition
2025-04-28 16:24:14 -04:00
George Hotz
690dac79b5
don't modify the ranges on reduce rewrite ( #10062 )
...
* bug in div range folding
* simpler
* oh, this is right for indexing, but the div mod folding needs to be fixed
* reenable
* Passing test_complexity_w_unroll2 (#10068 )
* Passing
* remove non_folded_divs
* Add check for negative tern in div folding
* Add test
* bump that limit
* fix casted
---------
Co-authored-by: Sieds Lykles <93992551+S-Lykles@users.noreply.github.com >
2025-04-28 12:01:19 -04:00
chenyu
4c1ce1a299
don't simplify if div folding resulted in negative numerator ( #10064 )
...
* don't simplify if div folding resulted in negative numerator
* test
2025-04-26 17:01:18 -04:00
George Hotz
2ed3acd767
toposort is a function [pr] ( #10004 )
2025-04-23 16:25:03 +01:00
George Hotz
d1f6701eb7
hotfix: lower amd threshold + improve block reorder test
2025-04-22 20:44:29 +01:00
George Hotz
c1539b0319
putting add first orders loads as expected ( #9991 )
2025-04-22 20:12:05 +01:00
George Hotz
feee6986c9
faster block reorder ( #9990 )
...
* faster block reorder [pr]
* that shouldn't change order
* key just in sorted
* ind
2025-04-22 19:18:57 +01:00
chenyu
9e5e371999
make DISABLE_COMPILER_CACHE a ContextVar [pr] ( #9983 )
2025-04-22 10:32:54 -04:00
George Hotz
c519b553db
non recursive toposort is 2x+ faster ( #9979 )
...
* non recursive toposort is 2x+ faster
* don't change the order
2025-04-22 13:59:38 +01:00
George Hotz
f5dc70c624
microbenchmarks + micro speed ups ( #9972 )
...
* microbenchmarks
* forgot the ubenchs
* clean up type verify
2025-04-22 11:30:46 +01:00
qazal
9a9aba4cd5
setitem tests (some failing) from kernelize ( #9940 )
2025-04-20 18:47:55 +08:00
George Hotz
8919370c76
hotfix: fix test_save_all_dtypes on METAL
2025-04-18 08:42:31 +01:00
Eitan Turok
2c7c205bc5
Fix dtype comparisons in vectorized transcendental + tests ( #9794 )
...
* init test
* cleanup
* init
* update
* fix
* fix python runtime for vectorized code
* awesome helper
* update
* update
* cleanup
* more cleaning
* cleanup more
* fix tests
* more cleaning
* cleanup more
* fix
* even cleaner
* failing tests is sad
* cleanup
* better name
* make tests pass
* remove vec from python runtime
* remove vec from eval_uop
* remove expected failues
* better name
2025-04-16 08:06:12 -04:00
George Hotz
44e4934167
fast pattern matcher [pr] ( #9737 )
...
* FastPatternMatcher
* works without that
* fix test pickle
* strict len
* compile match function
* dynamic compile
* fast
* faster
* compile
* track
* a lot faster
* clean up
* dup or
* faster and simpler
* fast match doesn't support store
* plane
* minor refactor
* real speed
* don't imply return None
* upat
* fix test
* heard you wanted more speed
* no generator
* split cf
* early fixup
* fxn fixup
* reconstruct_function
* Revert "reconstruct_function"
This reverts commit 37dac010ab .
* simpler stuff
* too big
* upat compile error
* cleanups
* don't cache that
* cleanups
* 10 -> 15
2025-04-14 15:24:41 +01:00
chenyu
e0ec8be37d
use CPU for test_schedule_ring ( #9843 )
...
* use CPU for test_schedule_ring
* why pre-commit is good
2025-04-10 23:20:53 -04:00
qazal
16956b79de
canonicalize Device.DEFAULT ( #9835 )
2025-04-10 23:02:11 +08:00
George Hotz
f666dd14eb
fix get reduce contraction with test ( #9834 )
2025-04-10 22:24:21 +08:00
George Hotz
53f0b2aad7
fix infinite loop in flash attention ( #9827 )
...
* fix infinite loop in flash attention
* get_contraction_with_reduce
* skip that test
* SINGLE_KERNEL_SOFTMAX + fix multi
* default IGNORE_OOB
* print change
2025-04-10 20:06:44 +08:00
qazal
498a2bf738
add err handling tests to viz + cleanups ( #9825 )
...
* cleanup
* add err handling tests to viz + cleanups
* lint
2025-04-10 14:05:05 +08:00
qazal
3bd992dc95
multi stage graph_rewrite_map ( #9803 )
...
* multistage graph_rewrite_map
* s/merge_map/input_map
* build up kernel_map from the tensor_map
2025-04-09 15:59:45 +08:00
Eitan Turok
bb7922b95f
Vectorize Transcendental Regression Tests ( #9753 )
...
* init test
* cleanup
2025-04-08 01:27:39 +08:00
chenyu
407ca54382
symbolic fold double where ( #9436 )
...
* symbolic fold double where
a.where(b.where(c, d), d) -> (a & b).where(c, d). a pattern in optimizer
* test case
2025-04-05 05:12:17 -04:00
Sieds Lykles
9c2fc695b5
cond.logical_not().where(a,b) -> cond.where(b,a) ( #9741 )
...
* Add rule for negation in where, simplifies arange patterns
* 0 becomes 0.0 again
* Only if cond is bool
* ne is never None
* Add a test
---------
Co-authored-by: chenyu <chenyu@fastmail.com >
2025-04-04 19:13:32 -04:00
George Hotz
8b5a523743
fix minimum length in pattern matcher ( #9736 )
2025-04-04 14:57:01 +08:00
George Hotz
cac8bcf8b5
use Ops.REDUCE ( #9721 )
...
* decrease bert python time [pr]
* order copies
* Revert "order copies"
This reverts commit 3f62c8693b .
* rewrite count
* Ops.REDUCE
* acc first in the add chain
* Fix tensor core acc
* arange patterns look good
* fix multireduce gate
* reduce rewrite rule
* bump that to 15 minutes
* multiwmma isn't fusing
* gep through wmma is gep pushing
* bump that timeout too, it's all env setup
* add failing test
2025-04-04 10:14:34 +08:00
chenyu
c20f112e9f
example test use z3 to verify valid simplification ( #9684 )
2025-04-02 01:05:52 -04:00
chenyu
c672716b38
improve vmin/vmax for IDIV ( #9678 )
2025-04-01 23:16:01 -04:00
chenyu
8dd88ad476
don't div_and_mod_folding for negative numerator with remainder ( #9674 )
...
can be wrong in C div since it truncates towards zero
2025-04-01 16:26:23 -04:00
chenyu
0e34f9082e
helper functions for cstyle div mod [pr] ( #9673 )
2025-04-01 08:06:56 -04:00
chenyu
5358b0904b
update uop_given_valid if a node becomes const ( #9604 )
...
* update uop_given_valid if a node becomes const
* cleanup
2025-03-27 14:57:46 -04:00
qazal
bf94924d5a
fix viz with nested graph_rewrite ( #9595 )
2025-03-27 13:14:28 +08:00
qazal
e5ff7b23d7
refactor to @track_matches + add failing test_nested_rewrite ( #9592 )
...
* test_nested_rewrite
* refactor to track_matches
* positional arg
2025-03-27 11:11:56 +08:00
George Hotz
3c5161b4cb
add validation of the bounds of Ops.INDEX ( #9503 )
...
* add validation of the bounds of Ops.INDEX
* do mask properly
* more validation
* correct
* fix gated
* add CAST support to vmin/vmax
* fix ptx and image
* ptx no diff
* upat.index also stays
---------
Co-authored-by: qazal <qazal.software@gmail.com >
2025-03-20 12:15:55 +08:00
qazal
0b20f91ce7
remove move_mask from the devectorizer ( #9511 )
...
* remove move_mask from the devectorizer
* add (wrong) ptx
* reason
* enable index addition in PTX, we won't have the INDEX anyways
* space
2025-03-20 11:53:12 +08:00
chenyu
189f62d44f
add rounding to tqdm unit scale ( #9507 )
...
fixed `AssertionError: ' 1.00/10.0 1000it/s]' != ' 1.00/10.0 1.00kit/s]'`
2025-03-19 12:08:46 -04:00
hooved
136cf7b8b1
hotfix: load >2 GiB from disk on macOS ( #9361 )
...
* enable loading >2 GiB buffer from disk on macOS
* handle None case raised by mypy
* add test
* revert fix to repro bug in CI
* tell CI to run a unit test for macOS
* reapply fix
2025-03-07 14:51:58 +08:00
George Hotz
2cc4cb74f0
reorder binops ( #9328 )
...
* reorder binops
* test improvements + fix string tests
* ugh, okay this
2025-03-03 14:58:18 +08:00
qazal
e162aa862d
is_realized only if buffer is allocated ( #9253 )
...
* is_realized only if the buffer is allocated
* fix the image check too
* assert test_lil_model after ExecItems run
2025-02-26 08:58:08 +01:00
Sieds Lykles
9c4d9d9f10
Acc first ( #9232 )
...
* put acc in front of the add chain
* handle the other case
* Make loop collapse more generic
* Remove mulacc_unrolled
* Actually remove it
---------
Co-authored-by: George Hotz <geohot@gmail.com >
Co-authored-by: chenyu <chenyu@fastmail.com >
2025-02-25 22:10:15 -05:00
chenyu
90c3ed17c5
move cast to before softmax in attention ( #9213 )
...
* move cast to before softmax in attention
saved some memory because exp (which is used for backward) are done in half. training bert seems fine and can fit BS=78 now (from 66)
* test
2025-02-24 17:24:59 -05:00
qazal
14aa2395d0
allow VIEW(BUFFER) in Tensor UOps [pr] ( #9210 )
...
* allow VIEW(BUFFER) in Tensor UOps [pr]
* still reshapes
* update becomes_map tests
* bring copy folder to the scheduler
* lint
* only sgd left
* optimizer assign
* 13 kernels
* rename to test_reorder_expand + assert VIEW
2025-02-24 13:06:15 +01:00
qazal
d12efc95d4
support custom name function in viz [pr] ( #9219 )
...
* support custom name function in viz [pr]
* title case
* assert name count in test_track_rewrites_name_fxn
2025-02-24 03:03:25 +02:00
chenyu
2e7c2780a9
CLANG -> CPU ( #9189 )
2025-02-20 18:03:09 -05:00
chenyu
3e22747799
run unit test on windows ci ( #9187 )
...
* factor out testing_minimal in setup.py [pr]
* testing_unit + windows
2025-02-20 14:40:41 -05:00
chenyu
287de4ecc6
use torch in test_gradient ( #9186 )
...
used torch.autograd.grad, but not sure if it can be a template like jax
2025-02-20 12:26:11 -05:00
George Hotz
df3b320f46
rewriter -> devectorizer [pr] ( #9147 )
2025-02-18 12:42:08 +08:00
Ali Ladjevardi
35e9c4657b
Use proper units when printing beam time ( #9103 )
...
* use proper units when printing beam time
* refactor DEBUG=2
2025-02-17 23:41:38 +08:00