qazal
f4ec57baff
new schedule linearizer enqueues KERNEL UOps [pr] ( #9993 )
...
* new schedule linearizer enqueues kernels [pr]
* no defaultdict
* diff
* minor
2025-04-23 05:17:58 +08:00
George Hotz
d1f6701eb7
hotfix: lower amd threshold + improve block reorder test
2025-04-22 20:44:29 +01:00
nimlgen
db51133537
rename HWInterface -> FileIOInterface ( #9989 )
...
* rename HWInterface -> FileIOInterface
* ugh
2025-04-22 22:18:57 +03:00
George Hotz
c1539b0319
putting add first orders loads as expected ( #9991 )
2025-04-22 20:12:05 +01:00
nimlgen
bd580d8ea4
hcq: use mmio interface in nv ( #9986 )
...
* hcq: start mmio interface
* allow double cast
* revert
* faster?
* simpler, not needed more now
* dd
* types
* fix
2025-04-22 21:58:12 +03:00
George Hotz
feee6986c9
faster block reorder ( #9990 )
...
* faster block reorder [pr]
* that shouldn't change order
* key just in sorted
* ind
2025-04-22 19:18:57 +01:00
qazal
6cb2d18c03
refactor schedule linearize to defaultdict [pr] ( #9984 )
...
* refactor schedule linearize to defaultdict [pr]
* skip that
* don't need .get
2025-04-23 00:00:23 +08:00
chenyu
9e5e371999
make DISABLE_COMPILER_CACHE a ContextVar [pr] ( #9983 )
2025-04-22 10:32:54 -04:00
qazal
bbc324f5dc
remove CAST_AFTER_EXPAND ( #9980 )
2025-04-22 21:06:11 +08:00
George Hotz
c519b553db
non recursive toposort is 2x+ faster ( #9979 )
...
* non recursive toposort is 2x+ faster
* don't change the order
2025-04-22 13:59:38 +01:00
qazal
7b55846e08
prep STORE UOp creation for multi output [pr] ( #9975 )
...
* prep STORE UOp creation for multi output [pr]
* test_multioutput_ast
2025-04-22 19:34:52 +08:00
George Hotz
e358e0a0c6
move metadata set to tensor [pr] ( #9976 )
...
* move metadata set to tensor [pr]
* only track that in tensor.py
2025-04-22 12:30:35 +01:00
George Hotz
f5dc70c624
microbenchmarks + micro speed ups ( #9972 )
...
* microbenchmarks
* forgot the ubenchs
* clean up type verify
2025-04-22 11:30:46 +01:00
qazal
1cf4e24ca5
fix kernelize usage with pm_gradient ( #9953 )
...
* fix kernelize usage with pm_gradient
* remove that
2025-04-22 17:26:05 +08:00
qazal
36ed3c3253
fix kernelize with VIEW children ( #9961 )
2025-04-21 23:38:46 +08:00
qazal
e8910540f6
Kernelize can be called multiple times on a Tensor ( #9949 )
...
* Kernelize can be called multiple times on a Tensor
* add (failing) test_kernelize_bw
2025-04-21 06:28:47 +08:00
qazal
1d90be2cff
match kernelize API in process replay ( #9948 )
2025-04-21 05:23:41 +08:00
qazal
e20ef7196a
Tensor.kernelize ( #9845 )
...
* add kernelize
* remove that
* kernelize returns self
* update abstractions2.py
* kernelize in test_schedule
* temp: assert BUFFER_VIEW's existence
* ASSIGN must have a buffer or subbuffer target
* assert and shrink
* fix
* padded setitem
* var
* toposort once
* extra
* base_buffer
* end with BUFFER_VIEW
* setitem for disk
* test_setitem_becomes_subbuffer
* mul slice test
* torch backend fix 1
* non-deterministic
* keep subbuffer
2025-04-20 20:53:49 +08:00
qazal
dd16087f62
fold double ASSIGN to same target ( #9941 )
2025-04-20 19:06:38 +08:00
qazal
9a9aba4cd5
setitem tests (some failing) from kernelize ( #9940 )
2025-04-20 18:47:55 +08:00
chenyu
6c30948df6
hand_coded_optimizations returns list[Opt] [pr] ( #9938 )
...
new api looks like `k.apply_opts(hand_coded_optimizations(k))`
2025-04-19 20:26:59 -04:00
chenyu
720f20865b
remove required_optimizations ( #9848 )
2025-04-19 16:51:16 -04:00
Ignacio Sica
023b1c28a2
test_tensor_cores_padded refactor (#9724 )
...
* set pad t 3 for amd padded tc test
* change pad for amd regardless CI
* test tc padded uops and correctness separately
* add test_tensor_cores_padded_uops test to ci
* remove redundant chack for amd device
* cleanup
2025-04-18 17:05:54 -03:00
qazal
b58decac0c
fix diamond assigns before mapping tensors UOps to assigns ( #9855 )
...
* keep tensor_map until diamond assign fixup
* ctx
2025-04-18 14:17:43 +03:00
George Hotz
aa98aff4cd
don't use ops name, just keep sink ( #9922 )
...
* don't use ops name, just keep sink
* fix test
* endif sink
2025-04-18 08:59:18 +01:00
George Hotz
8919370c76
hotfix: fix test_save_all_dtypes on METAL
2025-04-18 08:42:31 +01:00
qazal
16dfe0a902
upstream remu ( #9921 )
2025-04-18 01:57:36 +03:00
chenyu
f5256e0020
Kernel.apply_opts [pr] ( #9917 )
...
* Kernel.apply_opts [pr]
updated all `for opt in`. also updated a few test_liinearizer tests to not implcitly depend on hand_coded_optimization
* not you yet
2025-04-17 08:00:56 -04:00
Eitan Turok
2c7c205bc5
Fix dtype comparisons in vectorized transcendental + tests ( #9794 )
...
* init test
* cleanup
* init
* update
* fix
* fix python runtime for vectorized code
* awesome helper
* update
* update
* cleanup
* more cleaning
* cleanup more
* fix tests
* more cleaning
* cleanup more
* fix
* even cleaner
* failing tests is sad
* cleanup
* better name
* make tests pass
* remove vec from python runtime
* remove vec from eval_uop
* remove expected failues
* better name
2025-04-16 08:06:12 -04:00
geohotstan
4e8f25109a
Revert "ONNX add output shape validation ( #9720 )" ( #9904 )
...
This reverts commit ac713e04db .
2025-04-16 03:15:56 -04:00
pkotzbach
5849c43382
FP8s part 1 ( #9887 )
...
* fp8s part 1
* prettier
* fixes
* fixes
* remove stuff that should be in next pr
* revert
* add creation
---------
Co-authored-by: pkotzbach <pawkotz@gmail.com >
2025-04-15 11:20:02 -04:00
nimlgen
83ae83d871
compare amd and am to cpu as well ( #9896 )
2025-04-15 13:32:18 +03:00
nimlgen
23a95dd84d
script to compare amd and am kerns ( #9889 )
...
* script to compare amd and am kerns
* tool
* is it used???
2025-04-15 00:11:22 +03:00
chenyu
ce454793e6
support specifying dtype for Tensor.linear ( #9886 )
2025-04-14 13:55:11 -04:00
George Hotz
44e4934167
fast pattern matcher [pr] ( #9737 )
...
* FastPatternMatcher
* works without that
* fix test pickle
* strict len
* compile match function
* dynamic compile
* fast
* faster
* compile
* track
* a lot faster
* clean up
* dup or
* faster and simpler
* fast match doesn't support store
* plane
* minor refactor
* real speed
* don't imply return None
* upat
* fix test
* heard you wanted more speed
* no generator
* split cf
* early fixup
* fxn fixup
* reconstruct_function
* Revert "reconstruct_function"
This reverts commit 37dac010ab .
* simpler stuff
* too big
* upat compile error
* cleanups
* don't cache that
* cleanups
* 10 -> 15
2025-04-14 15:24:41 +01:00
qazal
e201bc3e93
process replay kernel asts in toposort order [pr] ( #9869 )
...
* process replay kernel asts in toposort order [pr]
* use HEAD replay
2025-04-13 17:20:34 +08:00
Alexey Zaytsev
7dda6aae7d
Skip CLOUD in external_test_example ( #9857 )
...
Closes #9814
2025-04-12 10:17:44 +08:00
George Hotz
dd52951dd0
fix single kernel softmax with cast ( #9842 )
...
* fix single kernel softmax with cast
* tolerate none
* 3e-4
* skip on dtype
2025-04-11 12:12:02 +08:00
chenyu
8c6299bced
move hand_coded_optimizations to heuristic.py [pr] ( #9844 )
...
* move hand_coded_optimizations to heuristic.py [pr]
also folded all long lines
* make a copy and rename self -> k
* fix test
2025-04-10 23:40:16 -04:00
chenyu
e0ec8be37d
use CPU for test_schedule_ring ( #9843 )
...
* use CPU for test_schedule_ring
* why pre-commit is good
2025-04-10 23:20:53 -04:00
qazal
fbc6aa53d4
script for local process_replay + fix viz name [pr] ( #9837 )
2025-04-11 00:39:18 +08:00
qazal
16956b79de
canonicalize Device.DEFAULT ( #9835 )
2025-04-10 23:02:11 +08:00
George Hotz
f666dd14eb
fix get reduce contraction with test ( #9834 )
2025-04-10 22:24:21 +08:00
chenyu
7fa5f29582
add test_embedding to test_softmax_fusion ( #9832 )
2025-04-10 08:25:34 -04:00
George Hotz
53f0b2aad7
fix infinite loop in flash attention ( #9827 )
...
* fix infinite loop in flash attention
* get_contraction_with_reduce
* skip that test
* SINGLE_KERNEL_SOFTMAX + fix multi
* default IGNORE_OOB
* print change
2025-04-10 20:06:44 +08:00
qazal
16afe04f45
move process replay to grouper ( #9830 )
...
* simpler
* sched
2025-04-10 18:27:42 +08:00
chenyu
c8f47c1d07
not_support_multi_device helper ( #9831 )
...
unify the test helper to skip ci device that does not support multi
2025-04-10 05:25:29 -04:00
chenyu
c462162db8
update benchmark bert scripts with BS and ACC_DTYPE ( #9826 )
...
BS=16, ACC_DTYPE=half for tinybox, BS=128, ACC_DTYPE=float for mi300x
2025-04-10 02:06:02 -04:00
qazal
498a2bf738
add err handling tests to viz + cleanups ( #9825 )
...
* cleanup
* add err handling tests to viz + cleanups
* lint
2025-04-10 14:05:05 +08:00
George Hotz
fce432d2e3
Ops.FUSE makes softmax a single kernel ( #9808 )
...
* KERNELIZE makes softmax a single kernel
* single kernel works
* softmax works
* broken
* correct
* skip that test
* kernelize tests
* rename to fuse
* better reduce_push_add_ones code
* correct now
* cleanups
* oops
* return None if we can't push ones
* rename + docs
* atol fixes group
* flash attention broken test
2025-04-09 22:56:28 +08:00