Commit Graph

225 Commits

Author SHA1 Message Date
chenyu
16c27ae400 update UOp.SPECIAL arg spec [run_process_replay] (#5661)
* update UOp.SPECIAL arg spec [run_process_replay]

from `(0, "gid0", 4)` to just `("gid0", 4)`. closer to a Variable

* fix ptx
2024-07-23 16:58:12 -04:00
George Hotz
386fb5e7f8 folding without UNMUL (#5628)
* folding without UNMUL

* fix failures, index_collapse

* import ReduceOps

* test_arange_4096 isn't folding
2024-07-21 20:14:44 -07:00
qazal
3ab5fe4e1b test argmax multireduce failure (#5609) 2024-07-20 21:33:03 +08:00
chenyu
37dd233650 always reverse global dim (#5586)
* always reverse global dim

* one more test
2024-07-19 13:58:05 -04:00
George Hotz
10be05aae5 push contract through cast to fix test_float2_acc (try 2) (#5585)
* push contract through cast to fix test_float2_acc (try 2)

* contract push only on floats
2024-07-19 10:34:43 -07:00
George Hotz
51892c8fac Revert "push contract through cast to fix test_float2_acc (#5581)" (#5583)
This reverts commit ddda9420be.
2024-07-19 09:44:30 -07:00
George Hotz
ddda9420be push contract through cast to fix test_float2_acc (#5581)
* push contract through cast to fix test_float2_acc

* no_vectorized_alu applies to cast too
2024-07-19 09:30:26 -07:00
chenyu
3f590c3b31 some limit_dims to limit global merging (#5489)
only supports merging dims in a way that does not surpass limit, no splitting yet
2024-07-19 12:17:46 -04:00
chenyu
2b2f8ad18c failed example of float2 acc no long applies (#5573)
* failed example of float2 acc no long applies

* # noqa: E501
2024-07-19 02:40:04 -04:00
George Hotz
223d9283ee fix float4 acc by moving contracts (#5559) 2024-07-18 11:30:16 -07:00
chenyu
f5af98c450 failed test case that DEFINE_ACC no long uses float4 (#5555)
* failed test case that DEFINE_ACC no long uses float4

* line
2024-07-18 10:55:59 -07:00
George Hotz
923e0fe0b8 fix half4 folding (#5556) 2024-07-18 10:47:39 -07:00
chenyu
12e6771209 failed test case for unrolled half4 (#5552) 2024-07-18 13:05:52 -04:00
kormann
2c4add6844 pretty print lazy op per default (#5505)
* pretty lop

* min diff

* walrus

* fix

* min diff

* simplify

* pretty helper function

* ws

* pretty uop upat

* tests

* stricter tests

* test passes

* ws

* stronger upat test

* delete print_tree

* min diff

* stricter exp test

* fix merge

* stronger uops eval test

* +readable and deep upat test

* +readable and deep upat test

* sort inv fix

* fix

* revert allowed_len
2024-07-18 09:34:08 -07:00
George Hotz
fa7e734b49 MetaOps.KERNEL (#5543) 2024-07-17 19:41:23 -07:00
qazal
61ee02e93d start multireduce lowerer work (var/std) (#5537)
* multireduce no-opts works

* passed test_var_multireduce

* cleanup

* double reduce

* extra check for range_group

* more checking for range_groups

* cleaning up debug prints

* cleanup diff

* linters

* revert kernel changes

* these are uops toposort

---------

Co-authored-by: timmy <timmy0x@proton.me>
2024-07-17 23:43:46 +03:00
qazal
173064c69c (re)start multireduce in codegen/* (#5391)
* test_var_multireduce

* run verify_lazyop

* test_var_multireduce

* assert lazyop

* add test_indexing_multireduce

* arange fuses (crude)

* note: extra reshape

* start readble

* test_arange_simple

* test_arange_expanded

* test_indexing_multireduce

* cleanups

* skip ptx

* skip nv and amd ci

* skip arange expanded too

* GPU=1 is slow too in CI
2024-07-16 14:20:48 +03:00
chenyu
63990705b5 test kernel opts case for 4 local and 4 groups (#5499)
make sure local grouped dim is correct
2024-07-15 20:09:38 -04:00
qazal
ac08f0eb00 reshape rawbufs in test_linearizer (#5492)
* reshape rawbufs in test_linearizer

* fix helper_linearizer_ast
2024-07-15 19:14:38 +03:00
chenyu
613a1dbeed render lidx starting with 0 (#5478)
* render lidx starting with 0

changed from
```
  int gidx0 = gid.x; /* 4096 */
  int lidx4 = lid.x; /* 8 */
  int gidx1 = gid.y; /* 7 */
  int lidx5 = lid.y; /* 8 */
  int gidx2 = gid.z; /* 7 */
  int lidx6 = lid.z; /* 2 */
```
to
```
  int gidx0 = gid.x; /* 4096 */
  int lidx0 = lid.x; /* 8 */
  int gidx1 = gid.y; /* 7 */
  int lidx1 = lid.y; /* 8 */
  int gidx2 = gid.z; /* 7 */
  int lidx2 = lid.z; /* 2 */
```

the existing one started from pre-limited global dims which skip number if there are more than 3 global dims

* don't need start_dim

---------

Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>
2024-07-14 16:34:04 -04:00
chenyu
28972418c4 s/get_linearizer/get_kernel [run_process_replay] (#5467) 2024-07-13 20:32:22 -04:00
George Hotz
03c2dc8bd7 lowerer is kernel [run_process_replay] (#5437) 2024-07-12 18:50:55 -07:00
George Hotz
b8342fb085 independent lowerer [run_process_replay] (#5434)
* independent lowerer [run_process_replay]

* don't relinearize PTX

* fix ptx

* Revert "fix ptx"

This reverts commit f4e8e059c0.

* Revert "don't relinearize PTX"

This reverts commit f6c12c506c.

* parents is fine, no need for linearization

* remove loop local idxs

* recover stupid loop_idxs
2024-07-12 18:08:43 -07:00
George Hotz
870dc8c350 s/Linearizer/Lowerer [run_process_replay] (#5428) 2024-07-12 15:54:07 -07:00
George Hotz
6707c778d0 scheduleitem is not Tuple [run_process_replay] (#5425)
* scheduleitem is not Tuple [run_process_replay]

* fix tests

* fix op + fuzzers

* fix mop test
2024-07-12 15:13:19 -07:00
chenyu
d37056f3b1 pass Renderer.global_max / local_max into get_grouped_dims (#5423)
[run_process_replay]
2024-07-12 16:49:27 -04:00
George Hotz
f6ef283e6a s/loadops/metaops [run_process_replay] (#5421) 2024-07-12 13:26:50 -07:00
chenyu
76125c07be make some grouped_dim test work (#5415)
next need to support max size per dim, splitting and correct way to do reverse or arbitrary permute global dims
2024-07-12 14:22:50 -04:00
George Hotz
c2da4454cd indexing getting better (#5389)
* indexing getting better [run_process_replay] [no_assert]

* fix test

* test_arange_2_reduce is a simpler test

* put that print back, NOOPT

* don't merge reduces (they could be different reduces)

* FUSE_AS_ONE_KERNEL

* fix tests

* fix test_var_multireduce

* w/e put that there

* fails on others too

* fix test, revert UNMUL change

* in case order matters

* one kernel indexing works

* one kernel indexing works (test other)
2024-07-11 16:41:51 -07:00
qazal
0421f5d83e hotfix: compare test_var_multireduce against numpy (#5394) 2024-07-11 18:57:08 -04:00
George Hotz
6972a2569f Linearizer -> Lowerer (#4957)
* st to uops function

* lowerer

* uops reduce

* uops reduce

* acc_number correct

* reduce unroll

* complete unroll

* do upcasts

* handle multioutput

* define_accs

* fix valid

* get grouped dims

* revert lin

* minor

* fixup_ast

* group for reduce

* group works now

* all forwards pass

* all ops tests pass

* fix clang

* mypy

* lil cleanups, no image yet

* ugh, variables everywhere

* bugfix

* counters and name fix

* use symbolic, not uops

* cleanups

* Fix tests

* linearizer tests

* expands

* float4 expand load

* tests pass

* woooo, float4 test

* test ops works again

* one more lin test

* more lin tests

* bypass

* fix tests

* something like this

* const in defineacc

* uops get_reduce_acc

* move around

* allow consts in the LOAD/STORE

* each axis should only appear once, 21 failures

* 16 failures

* fix some image

* optional float4

* onnx tests

* gate the stores

* add reorder

* fix terrible skip function

* tc work

* opt add/mul merge

* fix float4 tests

* tiny tweak, 9 failing

* 7 test failures

* start tc, but i don't think this will work

* progress on tensorcores

* note

* fix ops tests

* closer on tc

* weeee...one tensor core works

* still works, more generic

* large WMMA works

* tc test passes

* use WMMA as accumulator

* basic tc tests passing

* small gemm padded works

* 4 failures

* 3 tests failing

* super barrier

* now two tests failing

* one test failing

* cleanpus, add reduce to UopGraph

* remove the linearizer

* remove unused

* lil cleanups

* Lowerer everywhere

* remove test that doesn't exist now

* image indexing

* llvm fix

* fix metal

* fix image

* fix images

* might fix ptx

* fix image type mismatch

* more tests pass

* CAST -> VECTORIZE

* forgot that one

* fix TestOps.test_flip_eye_crash

* locals shouldn't be image dtype

* change less files

* test fix

* fix recursive expands

* touches

* MULACC support in python

* delete unneeded

* alu before contract

* bug fixes

* tests

* no var multireduce

* simpler tc

* metal works in new style

* working on AMD and METAL

* fix amd

* shot in the dark, fix amd

* something for CUDA

* CUDA WORKS from the docs

* comment

* correct merge

* cleanups + ptx fix + get_reduce_acc

* local alias isn't used anymore

* add store sanity check

* fix for AMD

* cleanups and single expand pass

* more correct with acc_cache

* tests should pass

* block on WMMA

* tests pass

* merge contract and reduce

* contractor fixes issue

* multicontract

* pre expand wmma (same as a reduce)

* expand wmma and only take one

* all expands

* comments and whitespace
2024-07-10 15:07:42 -07:00
qazal
1f5de80eba multi reduce Tensor.var passing verify_lazyop (#5346)
* what about this

* reset late gate
2024-07-09 17:20:17 +03:00
chenyu
4ceab5d2b1 fix PTX match rule for gated LOAD (#5338)
* test padto sum with bool tensor and bool acc dtype

make sure bool tensor acc with gate is handled correctly

* broken in PTX

* fix ptx
2024-07-08 22:25:03 -04:00
chenyu
a80f2df1bd fix some PTX tests (#5337)
fix broken PTX tests in test_linearizer and test_uops. there are tests that were skipped and broken because it runs only with CUDA=1 and we run PTX with NV=1 now
2024-07-08 21:33:05 -04:00
qazal
ae10e936e7 UOps.VECTORIZE cleanups [run_process_replay] (#5314)
* still render_cast

* one extra line ok

* these are all just vectorize

* save space

* behavior change can go in a different diff
2024-07-07 10:49:08 +03:00
greg-niemeyer
77b2ce9fc9 Add UOps.VECTORIZE [run_process_replay] (#5289)
* Add UOps.VECTORIZE to core

* Update vectorized cast tests

* Addresses code review comments

- Removes VECTORIZE from LLVMRenderer
- Add line breaks to unduly long lines
- Add noop CAST rule back
- Update asserts and add render_vectorize in
  CSytleLanguage renderer

* Add missing const folding rule for VECTORIZE

Also adds corresponding test

* Fixes test_const_vectorize_fold and add assert

- Use sane types with VECTORIZE in test_const_vectorize_fold
- Add assert that sanity checks the types for VECTORIZE

* Rename test_cast_vectorized_fold

Renames test_cast_vectorized_fold to test_noop_vectorize_fold
because the test targets a very specific rule and there are
other tests for VECTORIZE.

* Revert unrelated changes

---------

Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>
Co-authored-by: qazal <qazal.software@gmail.com>
2024-07-07 09:59:57 +03:00
qazal
8a99514462 generalize the uops toposort spec to ptx (#5309)
* generalize spec to ptx

* redundant assert

* extra print
2024-07-07 00:06:30 +03:00
chenyu
3929a9dc94 fix UOp.cmp_tuple for ALU (#5280)
* fix UOp.cmp_tuple for ALU

for ALU, use self.arg instead of self.op to compare

* skip that?
2024-07-03 14:59:05 -04:00
George Hotz
e53b164e1a small changes from lowerer (#5266) 2024-07-02 15:03:54 -07:00
qazal
3f4eeb8b54 late UOps.IF generation [run_process_replay] [no_assert] (#5027)
* find all places

* test gates

* test

* gate based on depths

* add ctx

* that cache was so wrong

* delete useless things

* dont double write if

* self.if_cond

* move UOps.IF to gated store

* test_padto_where_multioutput

* test_padto_group

* minor cleanup

* hmm this actually works?

* need a good barrier

* merge 2

* delete ctx

* p1

* maybe p2

* p3

* minor fixup

* fixup 2

* smart thing from the Lowerer branch

* refactoring

* refactoring 2

* maybe before graph_rewrite

* slightly more acceptable Linearizer diff

* more correct

* [run_process_replay] [no_assert]
2024-06-29 12:22:14 -04:00
George Hotz
80ac21200b hotfix: linearizer test fixup 2024-06-28 10:52:25 -07:00
George Hotz
d094a6828f single pass rewrite (#5159)
* single pass rewrite

* claude cleanups

* claude cleanups

* skip those tests

* restrict that to ints

* comment

* asserts i don't expect to fail do fail

* simplest...rewrite...ever

* simplest...rewrite...ever

* add that rule back

* tests pass?

* only collapse reduce loops

* second SHL/SHR arg must be 4 bytes

* fix verify

* no SHL/SHR in ptx

* put that back

* skip them in PTX...bad tests
2024-06-27 11:36:05 -07:00
Roelof van Dijk
f88f71d73a ruff: unnecessary-comprehension (#5174)
* enable ruff C416 unnecessary-comprehension

* already a list
2024-06-27 07:45:29 -04:00
Jhenner Tigreros
fa78755f19 Add new patterns to unfold division (#5139)
* Add new patterns to unfold division

* Create regression test and fix pattern
2024-06-25 18:07:47 -07:00
qazal
c4fdb9c725 second iteration on verify_lazyop (#5140) 2024-06-25 09:44:32 +03:00
qazal
18e70deec3 verify_lazyop (#5124)
* start verify_lazyop

* bfs order

* assert

* assert shapetrackers 2

* refactor

* more iteration

* skips

* that ast was wrong too
2024-06-24 13:45:35 -07:00
Francis Lam
b563cd52ed linearizer: change globals to merge into left axis/gridDims.x first (#5033)
* linearizer: change order of collapse to be left-most

also fixes Variable max size to be correct and add docs for the off
parameter

* fix multiple global dim oversizes

* add passing variable test and reorganize tests

* use assert RuntimeError for failing test
2024-06-23 18:53:15 -04:00
qazal
28bf8d86d8 test_linearizer with multi output ASTs (#5115)
* ast is tuple

* run test_phi_simplification

* update reason

* more tc

* beam

* a few more

* use test_opt directly
2024-06-23 15:41:24 +03:00
qazal
5717a54b28 don't use Tensor.empty in kernel opts tests (#5086) 2024-06-21 18:41:03 +03:00
George Hotz
6f6b3b10c9 import from uops, not linearizer (#5064) 2024-06-20 08:08:44 -07:00