gswangg
3cf507ae7f
remove extra.ops and LazyOp support from Kernel ( #6267 )
...
* remove extra.ops and BufferOps
* remove extra.ops and LazyOp support in Kernel
2024-08-24 16:44:38 +03:00
qazal
ccb05d8baa
fixup neg tests [run_process_replay] ( #6268 )
2024-08-24 16:35:43 +03:00
qazal
bcb2f1caa3
init REDUCE_AXIS with BinaryOps ( #6256 )
...
* REDUCE_AXIS arg with BinaryOps
* more work in kernel.py
fixup sops.gz
* fix TestGraphRewriteEfficiency
2024-08-24 11:28:41 +03:00
chenyu
3fc8203475
remove NEG from handwritten ast in tests ( #6234 )
...
* remove NEG from handwritten ast in tests
* test_linearizer_failures
2024-08-22 09:06:59 -04:00
gswangg
c74b318458
migrate test_linearizer.py to UOp AST, pt. 2 ( #6228 )
2024-08-21 22:16:11 +03:00
qazal
3b8cc5a3e0
more multireduce tests prep for neg removal [run_process_replay] ( #6220 )
2024-08-21 12:45:24 +03:00
qazal
f03e5a4b3b
test_multireduce const has a shape ( #6218 )
2024-08-21 11:02:45 +03:00
gswangg
0e6f057eae
migrate test_linearizer.py to UOP AST (pt. 1) ( #6150 )
...
* migrate test_multioutput to UOP AST
* inline buf declarations
* migrate test_multireduce to UOp AST
* update test_mid_dim_multireduce to UOp AST
* update test_triple_multireduce with UOp AST
* make global definitions more concise
* update test_double_reduce_multireduce with UOp AST
* update test_multireduce_with_parallel with UOp AST
* update test_multiout_multireduce to UOp AST
* make gidx style consistent across updated tests
---------
Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com >
2024-08-20 10:02:20 +03:00
chenyu
b36a7273c6
RUF018 assignment-in-assert [run_process_replay] ( #6172 )
...
assertion should not have side effect or `-O` breaks.
initially just wanted to fix the one in rearrange, but it also made some long lines less long
2024-08-19 00:34:52 -04:00
Timmy
e3d14d1ccc
Lowerer Multireduce Grouping ( #6097 )
...
* grouping changes to codegen
* linters + tests
* fix identical store issue on PTX
* comment in grouping multireduce tests
* cleaning up diff
* cleaning up diff
* comments
* linters
* hotfix: dont change kernels
---------
Co-authored-by: qazal <qazal.software@gmail.com >
2024-08-18 19:57:51 +03:00
George Hotz
5048066e79
st_arg, never -1 [run_process_replay] ( #6128 )
2024-08-16 22:46:56 -07:00
George Hotz
74ee9febec
remove iter from uopgraph ( #6110 )
...
* remove iter from uopgraph
* linearize returns uops
* fix tests
* linearize in linearize
* tests fix
* touchup
* test failures
2024-08-16 15:58:29 -07:00
qazal
28c75bf2a6
merge uops with ops ( #6111 )
...
Co-authored-by: chenyu <chenyu@fastmail.com >
2024-08-16 18:17:57 -04:00
qazal
c23d44c779
AST is UOp ( #6030 )
...
* most of the work from the uops2 branch
* schedule
* realize
* kernel
* lowerer
* search
* green
* merge uops with ops
* Revert "merge uops with ops"
This reverts commit 1408a59f12 .
* fix benchmark
* remove extra dedup
2024-08-16 22:09:00 +03:00
CaltropHungerton
38fb1e14a2
Intel XMX Tensor Core Support ( #5622 )
...
* fixed xmx demo
* i think i'm invoking the DPAS but it's slow
* compiler build arg to stop register spilling, indicated where to fix flop counter
* don't mind this
* do NOT mind me
* do not mind me
* do not view
* i will add bf16 later
* in process of figuring out tc fields
* we figured out the fields!!!
* added check for cl device vendor, added seperate IntelRenderer
* remove tc thread_local_aliases
* cleaning debris before draft pr
* edits for linter
* deduping and checking device extensions
* i will find more line reductions in other places
* before merge upstream
* double grf size in compiler to fix register spilling (bandaid), device checking changes
* tc python emulation
* fixed emulation
* tests for emulated intel tensor core
* TC=0, 1 working on upstream, fixed perf
* test
* debris
* check for specialized cl device when we canonicalize device
* bf16 support, tc=3 test added
* address tests
* revert half2 loads on intel tc, cleanup
* linter
* fold_expanded revert
* lint, whitespace fix
* cuda bf16 (only one with bf16) is skipped in test tensor cores, so i will skip for intel bf16 too
* make line shorter, no need for noqa E501
* removed device intel
* fix python emulation
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com >
2024-08-16 09:19:21 -07:00
qazal
4d38fec8c1
rename lazyops to parents [run_process_replay] ( #6091 )
2024-08-15 17:27:32 +03:00
ignaciosica
164ca5632e
split tensor core tests ( #6041 )
2024-08-12 09:42:02 -04:00
Timmy
a00994b423
Lowerer Multireduce Uopgraph ( #6007 )
...
* uopgraph changes
* fixing for non-reducing ranges
* multireduce tests
* linters
* linters
* removing comments
* removing arg[1]
* linters
* prettier
* linters
* more linters
* use any instead of intersection
2024-08-12 15:16:07 +03:00
Timmy
8c99bdab08
More Multireduce Tests ( #5968 )
...
* multireduce tests
* linters
* more linters
* more linters
* seeing how it works with parallel
2024-08-08 22:04:08 +03:00
wozeparrot
97d708252a
remove realize from threefry ( #5969 )
2024-08-07 15:08:49 -07:00
George Hotz
1417cc8df1
can reenable that test now ( #5914 )
2024-08-06 13:38:21 -07:00
ignaciosica
81ae9fadc8
Float4 support for CLANG ( #5915 )
...
* float4 support on clang
* skip linearizer tests that require locals
* add aligned attribute
2024-08-06 07:50:12 -07:00
George Hotz
159ac06b5b
remove unused reduce rules + improve unparented ( #5908 )
...
* remove unused reduce rules [run_process_replay]
* this work
* those tests are meaningless now
2024-08-04 18:18:27 -07:00
George Hotz
877e0b4ba0
define global only has the index [run_process_replay] ( #5869 )
...
* define global only has the index [run_process_replay]
* fix that linearizer test
* fix ptx
* stupid ptx fix
2024-08-01 19:01:15 -07:00
chenyu
02f0be03f2
tests on UOp div negative number and arange opts ( #5825 )
2024-07-30 20:06:57 -04:00
George Hotz
17a2f74412
new style load/store folder ( #5784 )
...
* remove old index reorder
* new style folder
* works better
* dedup
* one failure
* this is fine now...
* expander_rewrite
* images broken, but all else should work
* cleanups
* make tests work with old
* fix images
* cleanups + bugfix
* minor fixes
* fix gated store folding
* flip gate_creator and expander
* fix gated store
* remove unneeded rules
* lines getting close
* line count good
2024-07-30 13:17:20 -07:00
George Hotz
4df46eac67
clean up tensor cores [run_process_replay] ( #5736 )
...
* clean up tensor cores [run_process_replay]
* remove tuple(wmma_sz), self.opts.device
* remove tls, leave DEVICE
2024-07-26 13:21:23 -07:00
chenyu
16c27ae400
update UOp.SPECIAL arg spec [run_process_replay] ( #5661 )
...
* update UOp.SPECIAL arg spec [run_process_replay]
from `(0, "gid0", 4)` to just `("gid0", 4)`. closer to a Variable
* fix ptx
2024-07-23 16:58:12 -04:00
George Hotz
386fb5e7f8
folding without UNMUL ( #5628 )
...
* folding without UNMUL
* fix failures, index_collapse
* import ReduceOps
* test_arange_4096 isn't folding
2024-07-21 20:14:44 -07:00
qazal
3ab5fe4e1b
test argmax multireduce failure ( #5609 )
2024-07-20 21:33:03 +08:00
chenyu
37dd233650
always reverse global dim ( #5586 )
...
* always reverse global dim
* one more test
2024-07-19 13:58:05 -04:00
George Hotz
10be05aae5
push contract through cast to fix test_float2_acc (try 2) ( #5585 )
...
* push contract through cast to fix test_float2_acc (try 2)
* contract push only on floats
2024-07-19 10:34:43 -07:00
George Hotz
51892c8fac
Revert "push contract through cast to fix test_float2_acc ( #5581 )" ( #5583 )
...
This reverts commit ddda9420be .
2024-07-19 09:44:30 -07:00
George Hotz
ddda9420be
push contract through cast to fix test_float2_acc ( #5581 )
...
* push contract through cast to fix test_float2_acc
* no_vectorized_alu applies to cast too
2024-07-19 09:30:26 -07:00
chenyu
3f590c3b31
some limit_dims to limit global merging ( #5489 )
...
only supports merging dims in a way that does not surpass limit, no splitting yet
2024-07-19 12:17:46 -04:00
chenyu
2b2f8ad18c
failed example of float2 acc no long applies ( #5573 )
...
* failed example of float2 acc no long applies
* # noqa: E501
2024-07-19 02:40:04 -04:00
George Hotz
223d9283ee
fix float4 acc by moving contracts ( #5559 )
2024-07-18 11:30:16 -07:00
chenyu
f5af98c450
failed test case that DEFINE_ACC no long uses float4 ( #5555 )
...
* failed test case that DEFINE_ACC no long uses float4
* line
2024-07-18 10:55:59 -07:00
George Hotz
923e0fe0b8
fix half4 folding ( #5556 )
2024-07-18 10:47:39 -07:00
chenyu
12e6771209
failed test case for unrolled half4 ( #5552 )
2024-07-18 13:05:52 -04:00
kormann
2c4add6844
pretty print lazy op per default ( #5505 )
...
* pretty lop
* min diff
* walrus
* fix
* min diff
* simplify
* pretty helper function
* ws
* pretty uop upat
* tests
* stricter tests
* test passes
* ws
* stronger upat test
* delete print_tree
* min diff
* stricter exp test
* fix merge
* stronger uops eval test
* +readable and deep upat test
* +readable and deep upat test
* sort inv fix
* fix
* revert allowed_len
2024-07-18 09:34:08 -07:00
George Hotz
fa7e734b49
MetaOps.KERNEL ( #5543 )
2024-07-17 19:41:23 -07:00
qazal
61ee02e93d
start multireduce lowerer work (var/std) ( #5537 )
...
* multireduce no-opts works
* passed test_var_multireduce
* cleanup
* double reduce
* extra check for range_group
* more checking for range_groups
* cleaning up debug prints
* cleanup diff
* linters
* revert kernel changes
* these are uops toposort
---------
Co-authored-by: timmy <timmy0x@proton.me >
2024-07-17 23:43:46 +03:00
qazal
173064c69c
(re)start multireduce in codegen/* ( #5391 )
...
* test_var_multireduce
* run verify_lazyop
* test_var_multireduce
* assert lazyop
* add test_indexing_multireduce
* arange fuses (crude)
* note: extra reshape
* start readble
* test_arange_simple
* test_arange_expanded
* test_indexing_multireduce
* cleanups
* skip ptx
* skip nv and amd ci
* skip arange expanded too
* GPU=1 is slow too in CI
2024-07-16 14:20:48 +03:00
chenyu
63990705b5
test kernel opts case for 4 local and 4 groups ( #5499 )
...
make sure local grouped dim is correct
2024-07-15 20:09:38 -04:00
qazal
ac08f0eb00
reshape rawbufs in test_linearizer ( #5492 )
...
* reshape rawbufs in test_linearizer
* fix helper_linearizer_ast
2024-07-15 19:14:38 +03:00
chenyu
613a1dbeed
render lidx starting with 0 ( #5478 )
...
* render lidx starting with 0
changed from
```
int gidx0 = gid.x; /* 4096 */
int lidx4 = lid.x; /* 8 */
int gidx1 = gid.y; /* 7 */
int lidx5 = lid.y; /* 8 */
int gidx2 = gid.z; /* 7 */
int lidx6 = lid.z; /* 2 */
```
to
```
int gidx0 = gid.x; /* 4096 */
int lidx0 = lid.x; /* 8 */
int gidx1 = gid.y; /* 7 */
int lidx1 = lid.y; /* 8 */
int gidx2 = gid.z; /* 7 */
int lidx2 = lid.z; /* 2 */
```
the existing one started from pre-limited global dims which skip number if there are more than 3 global dims
* don't need start_dim
---------
Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com >
2024-07-14 16:34:04 -04:00
chenyu
28972418c4
s/get_linearizer/get_kernel [run_process_replay] ( #5467 )
2024-07-13 20:32:22 -04:00
George Hotz
03c2dc8bd7
lowerer is kernel [run_process_replay] ( #5437 )
2024-07-12 18:50:55 -07:00
George Hotz
b8342fb085
independent lowerer [run_process_replay] ( #5434 )
...
* independent lowerer [run_process_replay]
* don't relinearize PTX
* fix ptx
* Revert "fix ptx"
This reverts commit f4e8e059c0 .
* Revert "don't relinearize PTX"
This reverts commit f6c12c506c .
* parents is fine, no need for linearization
* remove loop local idxs
* recover stupid loop_idxs
2024-07-12 18:08:43 -07:00