168 Commits

Author SHA1 Message Date
George Hotz
07b350a8f4 new uops is an actual graph (#4560)
* new uops is an actual graph

* it's way slower

* simpler

* fix define acc

* render_loop unique

* ops test pass

* add pattern matcher back, there's bugs

* rewrite

* use priority queue

* recursive children

* fix tests

* fix tests with SINK

* fix abstractions

* fix assembly

* simpler

* link define_acc

* fix DEFINE_ACC placement

* type verify

* full cmp

* fix cmp

* ACCESS_ACC

* insert DEFINE_ACC

* fix PHI

* recursive rewrite

* fix many tests

* sum collapse

* more patterns

* correct change

* fold arange

* fix that lin test

* space

* big folding rule works

* close

* has more maxes, meh

* cached node replace

* set changed

* simplest folding yet

* works

* works

* DIV

* all tests pass

* del

* fuzz linearizer fails

* sum_collapse

* test depth 2 cf

* fix lin test 14

* fix clang depth

* disable that

* failure 14 is fixed

* fix ptx

* failure 27 is fixed

* fix llama

* run_cnt

* Revert "Optimize PTX gated loads index calculation (#4304)"

This reverts commit d97d5a7689.

* fix uops loop

* fix ptx bugs

* add barrier

* print

* mem_type in ptx direct

* bypass tests that fail in CI but pass locally

* ptx remove ptr_ar

* more ptx passing

* fix ptx tests

* assert compile support

* remove  model inference benchmark from red
2024-05-17 18:00:18 -07:00
uuuvn
639ea5b0f2 Metal linearizer failure 22 is flaky not just on CI (#4617)
* METAL doesn't fail anymore, not just on CI

* oops
2024-05-16 11:31:23 -04:00
nimlgen
eb9689336e nv mockgpu (#4600)
* mockgpu nv

* works

* comment that out

* fix merge

* setup gpuocelot

* install packages

* not run all of them

* passes

* fix ci

* almost

* should pass

* linter

* linter 2

* try this?

* ugn, not supported

* ci

* remove ticket from description

* better descs
2024-05-15 23:46:08 +03:00
Ahmed Harmouche
662bca8134 Split UnaryOps.CAST into CAST and BITCAST (#4487)
* Separate cast and bitcast

* Fix lint

* No more arg[0]

* Revert "No more arg[0]"

This reverts commit dee6911335513f092fe2cbb9684e8a9d26aad964.

* CAST/BITCAST arg is the dtype only, no more tuple

* No image bitcast, regenerate dataset

* Small fixes
2024-05-15 11:43:31 -04:00
George Hotz
ff64bcab69 move graph/search to engine (#4596) 2024-05-14 23:12:59 -07:00
chenyu
afe020710d disable PADTO on upcasted axis (#4444)
fixed test_failure_31. PADTO upcasted is at best a no-op, and might fail at edge cases.
2024-05-05 21:52:03 -04:00
Francis Lam
c8595a9655 update sops.gz, fix tests and add new linearizer test (#4437)
* update sops.gz, fix tests and add new linearizer test

* remove METAL CI skip for test_failure_22

* re-add skip to METAL CI to test_failure_22
2024-05-05 17:31:25 -04:00
chenyu
3f3af0fb85 test_linearizer_failures 29 passes now (#4215)
TC + PADTO fixed
2024-04-18 19:49:23 -04:00
chenyu
1fa0351acb fix DEFINE_ACC invalid_value to have same type as localtype (#3980) 2024-03-28 19:21:17 -04:00
Patrick Tsai
e27129a798 Fix linearizer failure 26 test (#3906)
* Adjust adds between WHERE and PHI

* Not much better

* undo recursive change

* hm

* iterate over where, not factored op

* oo

* consts only for loop

* UNdo var name change

* update

---------

Co-authored-by: Patrick Tsai <patosai@users.noreply.github.com>
2024-03-24 16:34:13 -04:00
Francis Lam
0145366323 wmma: fix the AMD TC threads to split the first 16 threads (#3904)
previously it was incorrectly aliasing 16 into the size 8 upcast
on the store alias.  now it splits it properly into 8 and the
remaining 2 into the correct local stride
2024-03-23 21:17:42 -04:00
chenyu
a2b2597fc2 replace dtype.name str with render_dtype (#3903)
fixed some bf16 cast issue since it does not have `.name`.
also more robust if there are lang specific type override
2024-03-23 19:25:48 -04:00
chenyu
30fa03243e reuse fuzz_linearizer.compare_linearizer in test_linearizer_failures (#3861) 2024-03-21 14:12:27 -04:00
chenyu
33dd99acf4 remove helper_add_store from test_linearizer_failures (#3860) 2024-03-21 12:53:31 -04:00
Francis Lam
131bbb6563 test_linearizer_failure: add failure 27 from a gpt2 kernel (#3825)
* test_linearizer_failure: add failure 27 from a gpt2 kernel

found during a full fuzz test of applied_opts combos to a
depth of 4 on the gpt2 kernels w/o GROUPTOP.

added additional examples to failure 26 that don't have GROUPTOP

* add other platform failure
2024-03-19 16:29:50 -04:00
Francis Lam
9851e2c3b9 test_linearizer_failure: add failure 26 from a gpt2 kernel (#3821)
found during a full fuzz test of all applied_opts combos to a
depth of 3 on the gpt2 kernels
2024-03-19 13:19:54 -04:00
chenyu
ac866eaf5a disable simplify_phi_loops (#3812)
* disble simplify_phi_loops

this breaks BEAM search GPT2.

* skip that
2024-03-18 19:25:26 -04:00
Francis Lam
a7afd2f6bf test_linearizer_failures: add failing kernel from GPT2 CUDA (#3808)
* test_linearizer_failures: add failing kernel from GPT2 CUDA

* test_linearizer_failure: remove "HIP" from failed_platforms
2024-03-18 17:16:40 -04:00
qazal
e3e89c244b multioutput uoping infra (#3706)
* linearize multioutput

* add vars to copy
2024-03-15 21:56:59 -07:00
chenyu
a2d3cf64a5 move is_dtype_supported to test.helpers (#3762)
* move is_dtype_supported to test.helpers

updated all places that check if float16 is supports

* fix tests
2024-03-15 14:33:26 -04:00
nimlgen
6b8c66e04f fix broken loops in llvm (#3751) 2024-03-15 11:57:51 +03:00
nimlgen
6bf11a2ce3 fix incorrect direct store with gep (#3735)
* fix incorrect direct store with gep

* better comment

* phi as well

* dtype check there

* mypy happy?

* not used

* renames

* phi in phi
2024-03-14 20:58:50 +03:00
qazal
43953c0ba9 skip grouped store for umatching upcasts (#3723)
* skip if upcasts dont match

* outputs match now

* this ast is hardcoded

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-03-14 01:18:31 -04:00
nimlgen
08064a0e29 add SEED env to fuzz_linearizer (#3713)
* add SEED env to test/external/fuzz_linearizer.py

* found some

* more platforms
2024-03-13 18:08:42 +03:00
chenyu
e1b2a82d89 fix st.real_size can be nagative if valid is always false (#3708)
two followups after this. (1) if a buffer is never accessed in kernel, it can be removed from input (2) real_size can be smaller conditional on valid being true (the old validhack stuff)
2024-03-12 20:34:07 -04:00
Francis Lam
b6e2495fdd kernel: limit shared memory usage when adding opts (#3705)
* kernel: limit shared memory usage when adding opts

* search: remove unnecessary limit on search space

apply_opt will do the more correct check
2024-03-12 17:06:21 -04:00
Patrick Tsai
971d7f5d7c O(n) arange attempt (#3530)
* It works?

* Clamp correctly

* Refactor

* Make code better

* Undo some stuff

* First step to trying to make floats work

* Floats work in Python op but not metal because int div is different

Python integerdivision was implemented as // which rounds towards
negative infinity, but C integer division rounds towards 0 so there
is an off-by-1 division error

* arange does cumsum with ints and then multiplies by step

This is so loop optimization can remain int only

* Undo a lot of symbolic changes

* Final check

* Cleanup

* There can be multiple phis

* Fix multiple phi op removal

* const sets dtype correctly

* Fix bugs

* Fix a couple bugs and add loop vars to resolve

* missed one

* Don't trim too many ops

* Fix symbolic test

* Use ones instead of full

* Delete test

* Lint passes

* max node error

* Small updates to loop logic

* Remove unnecessary changes

* We are getting somewhere

* Simple case

* Fix

* rm, prn

* Better

* If NumNode doesn't work then continue

* clamp is needed for arange(256)

* Move everything into the optim fn

* Replace correctly

* Order optimizations better

* Delete

* mypy

* Test for simplification

* Rename

* Fix test

* update test description

* Undo more

* Cleanup

* No replaced_ops map

* Fix lint

* AssertionError

* back again

* Reinstate assertion

* Return true and make diff not as big

* Bigger range for test

* Change cumsum impl

* fix bug

* make big cumsum work

* lint

* Undo cumsum 2-stage removal

* No while helper

* optional min/max clamping

* floats work

* rm giant arange test

* fix python cast None

* Check phi parents

* one phi allowed per where

* Fix one phi per where

* Rework iteration

* Delete assertions

* convert to int

* Try mul -1 instead of neg for hip..?

* Remove one phi per where requirements

* one accum only

* Lint

* should simplify a loop at a time

* Don't get rid of loop explcitly

* Need to iterate backwards

* lint

* unary neg

* Make optim work for onnx and sum_pad_collapse

* Better message

* filter alu ops correctly

* Fix the limiter

* lint and simplify

* Add it back

* off by one error

* test wheres and phis

* test max ops and non-if stuff

* <=

* cast_scalar

* Oops

* Change test

* Pass loop uops instead of a modified map

* Cut param transfer between linearizer and uops

* Fix issues

* Fix lint

* fix efficientnet python 3.8 invalid syntax

* distinct vars in seen_vars

* accurate var names

---------

Co-authored-by: Patrick Tsai <patosai@users.noreply.github.com>
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-03-11 16:09:20 -07:00
chenyu
915f98791c use custom KernelOptError in kernel opt (#3661)
be more specific about invalid kernel opt, used that in test_linearizer_failures.

make BEAM kernel search work even with assertion disabled.

`BEAM=2 python3 -O examples/llama.py  --temperature=0 --count=10 --prompt="Hello." --timing`
2024-03-08 15:36:16 -05:00
chenyu
1130c73844 add FUZZ_NTH to fuzz_linearizer (#3656)
* add FUZZ_NTH to fuzz_linearizer

also update tests in test_linearizer_failures to not just run on METAL

* update failures for HIP/HSA

* test_failure_21 LLVM PADTO
2024-03-08 09:16:49 -05:00
chenyu
b282a45e39 fix direct store float4 with same vin (#3652)
In a kernel that stores expanded value, the vin of float4 can come from same source, and we only remove once in that case.
2024-03-07 18:11:50 -05:00
chenyu
d33311ebe0 remove parens of ALU if it has associative property (#3635)
need to remove SUB since it's possible to have (const - (const - const)) in test/test_ops.py::TestOps::test_cos,
in which case cannot remove the parens of children
2024-03-06 21:12:11 -05:00
chenyu
fe6b6e38c1 remove parentheses of GEP if it's from SSA (#3634)
fixed some bracket nesting level exceeded maximum of 256 errors
2024-03-06 20:22:46 -05:00
chenyu
48d22067ca clean up test_linearizer_failures (#3562)
* cleanup test_linearizer_failures

* fix test_failure_8

* fix that

* better assert message
2024-03-01 15:57:17 -05:00
chenyu
1136e2a82a skipIf(not( -> skipUnless( in test_linearizer_failures (#3519)
if these behaves weirdly in CI might need to disable them in CI
2024-02-28 13:48:47 -05:00
Francis Lam
39d75f0d58 test_linearizer_failures: add more METAL examples (#3495)
these were obtained from running fuzz_linearizer on METAL
2024-02-26 10:19:05 +01:00
George Hotz
871ba73e65 _reduce_op is axis based now (#3462)
* _reduce_op is axis based now

* axis_

* update lin failures

* disable that

* fix shape
2024-02-21 16:36:31 +01:00
xarkes
28a8b72024 Remove Interpreted device & remaining CPU/TORCH ref (#3423)
* Remove Interpreted device & remaining CPU/TORCH ref

* Oops

* supports_device was useful

* Fix doc wording

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-02-16 00:30:21 -05:00
Francis Lam
668324d92b wmma: protect TC locals from modification and use only LOCAL (#3379)
also remove unnecesssary upcast_dim from tensor_core and calculate
it from the dimensions and thread sizes
2024-02-13 10:19:35 +01:00
Francis Lam
f1ad01fd91 test_linearizer_failures: add new linearizer compile failure on METAL (#3380) 2024-02-12 20:28:34 -05:00
Francis Lam
2266152b28 linearizer: added FUZZ_BEAM to fuzz_linearizer and additional tests (#3340)
Fixed test_tensor_core_opts to test all the TCs.

Added commented out failing tests in test_color_shapes_with_local.
2024-02-08 16:12:58 +01:00
nimlgen
5097d5b808 fix padto when with late reduce (#3180)
* fix padto test

* no long comment
2024-01-19 14:01:44 -05:00
nimlgen
f87ecbb0f3 fuzzer validates outputs + (partially) oob accesses (#3178)
* fuzzer validates outputs + (partially) oob accesses

* +random

* oob check only for compiled

* type cmp fixes

* fix zeroing

* no prints

* add seed
2024-01-19 13:34:51 -05:00
George Hotz
c003be7309 Revert "track size in shapetracker" (#3043)
* Revert "track size in shapetracker (#3026)"

This reverts commit a8ba1ac08f.

* st.size
2024-01-08 13:13:39 -08:00
George Hotz
a8ba1ac08f track size in shapetracker (#3026)
* track size in shapetracker

* shapetracker adapter

* size is an int

* create Buffer with st.size

* only compare the views for the jit

* fix webgpu
2024-01-05 20:15:53 -08:00
George Hotz
a280cfe169 move dtypes to dtype.py (#2964)
* move dtypes to dtype.py

* fix urllib
2024-01-01 14:58:48 -08:00
qazal
5f07ef455e update dtypes (#2872) 2023-12-20 15:04:02 -05:00
George Hotz
051402625e remove pushing contig + fix linearizer bug (#2798)
* remove that logic

* fix test, move LOADs

* fix repeat issue on LLVM

* with_phi
2023-12-16 09:36:31 -08:00
chenyu
765f8b05e5 TernaryOps.WHERE has vin[0] as bool and BinaryOps.CMPLT always outputs bool (#2782)
* vin[0] to where is always bool

* due to better hack

* update test

* fix test_uops
2023-12-15 14:51:51 -05:00
George Hotz
c6eb618013 tests from new lazy branch (#2774)
* tests from new lazy branch

* fix lin 11

* that was needed

* doesn't fail

* mark

* meant that

* llvm passes
2023-12-14 23:06:39 -08:00
George Hotz
6d6eb9302d ruff checks the max line length is 150 (#2734)
* ruff checks the max line length is 150

* fix tensor.py

* a lot more

* done
2023-12-12 17:34:47 -08:00