Commit Graph

93 Commits

Author SHA1 Message Date
George Hotz
44a67bf783 constant folding (#3675)
* constant fold

* bool math

* fix ptx
2024-03-10 14:47:24 -07:00
chenyu
915f98791c use custom KernelOptError in kernel opt (#3661)
be more specific about invalid kernel opt, used that in test_linearizer_failures.

make BEAM kernel search work even with assertion disabled.

`BEAM=2 python3 -O examples/llama.py  --temperature=0 --count=10 --prompt="Hello." --timing`
2024-03-08 15:36:16 -05:00
chenyu
906cc3a69b cleanup tests Device[Device.DEFAULT] is always Compiled (#3645) 2024-03-07 11:15:42 -05:00
George Hotz
81baf3eed3 bring ptx back (#3623)
* bring ptx back

* ptx back

* fix define var

* fix a few bugs

* bugfixes

* fixes

* fix llvm bug

* fix test bug
2024-03-06 13:34:21 -08:00
qazal
94679322a3 simpler float4 direct store and locals support (#3592)
* swap vins instead

* delete the upcast

* leave it to remove_childless try 1

* Revert "leave it to remove_childless try 1"

This reverts commit bf25e935f8.

* try 2, simpler

* Revert "try 2, simpler"

This reverts commit d2472af711.

* add note
2024-03-04 06:28:28 -08:00
qazal
a89afd4ffa Directly store float4 nodes (#3564)
* float4 cast collapse

* simplify cstyle

* simplify uoptimizer

* ci

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-03-02 15:58:20 -08:00
George Hotz
aa9b013d79 add constant folding for WHERE in uops (#3584)
* add constant folding for WHERE in uops

* prereqs for generic constant folding

* fix test

* disable slow overflow logic

* make that test faster
2024-03-02 10:37:14 -08:00
George Hotz
6b29c70b3d Refactor to UOpGraph class (#3566)
* Refactor to UOpGraph class

* fix test
2024-03-01 15:14:48 -08:00
Francis Lam
5d434801fa search: add tensor core to beam search space (#3275)
* search: add tensor core to beam search space

* kernel: refactor apply_tensor_core into apply_opt and hand_coded

* kernel: revert removal of apply_tensor_cores

also revert BEAM search parameter changes
2024-02-29 13:05:10 -08:00
qazal
94fc0fd546 uop the float4 acc upcast in group_for_reduce kernels (#3466)
* simplest one

* but i can trust this will be cached correctly

* wait that was wrong too

* cleanup

* test_reduce_upcast for single reduce case

* a late accumulator always outputs to gds

lint
2024-02-28 17:33:47 -08:00
David Hou
f513c37e64 support same uidx in multiple shape positions (#3205)
* support same uidx in multiple shape positions

* rename var

* update comment

* add contiguous index check to global_store too

* update comment

* small change

* is this better?

* smh

* smaller change?

* get rid of more changes

* get rid of more changes

* is this even making anything better

* comment

* fix test

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-02-21 19:37:03 +01:00
chenyu
86efdf0b34 remove create_rednode (#3444)
handle Node collapsing into NumNode similar to OpNode
2024-02-18 21:08:19 -05:00
qazal
e1a57fe58a test the behavior, not the implementation (#3419) 2024-02-15 17:23:42 +01:00
qazal
7919a1e6ec dtypes: delete the float cast in realize.py (#3401)
* remove float cast

* cast scalars to the correct value in creation time

* cast scalar in the correct place

* wrong, use y_dtype

* make consts have a unique cache key

* add cast_scalar back

* test_load_cache_const_bufs

* add bool dtype

* test_const_dtype

* fix linters
2024-02-15 14:20:30 +01:00
Francis Lam
668324d92b wmma: protect TC locals from modification and use only LOCAL (#3379)
also remove unnecesssary upcast_dim from tensor_core and calculate
it from the dimensions and thread sizes
2024-02-13 10:19:35 +01:00
George Hotz
2e60012bcf move create schedule and delete old API (#3377)
* move create schedule and delete old API

* fix test multitensor
2024-02-12 18:10:45 +01:00
George Hotz
41efaa848c move graph.py and jit.py into features (#3376)
* move graph.py into features

* move jit into features

* fix quickstart
2024-02-12 17:34:34 +01:00
qazal
c8fd66a131 Run RDNA3 tensor core tests in CI (#3367)
* add test_linearizer

* skip test_padto_matmul
2024-02-11 19:54:06 -05:00
Francis Lam
ddb22a60c8 linearizer: fix up edge case bugs in UNROLL opt (#3362)
Fully UNROLLing the first_reduce should not change the number of
local_dims.

Fully UNROLLing a GROUP dim should reduce the number of
group_for_reduces by one.

Also changed group_for_reduces to be a count as the axis number
isn't used anywhere (they are always the first reduce dims).
2024-02-10 11:49:25 +01:00
Francis Lam
ce21fdfb67 ops_python: add HIP tensor core mock and refactor METAL (#3354)
* ops_python: add HIP tensor core mock and refactor METAL

* Add tests to CI

* add DEBUG=2 to full tests

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-02-09 12:46:06 +01:00
George Hotz
c32ea95d7d Python uop emulator (#3327)
* start uop emu

* tiny_add passes

* more ops

* emulate the whole warp

* test_gemm passes

* metal gemm test pass

* works on big gemm

* works on big gemm

* more tests pass

* touch ups

* fix mypy

* cleanups

* exp2 mypy

* arch is where it belongs

* actually emulate tensor cores

* fix test

* new style
2024-02-08 19:24:55 +01:00
Francis Lam
2266152b28 linearizer: added FUZZ_BEAM to fuzz_linearizer and additional tests (#3340)
Fixed test_tensor_core_opts to test all the TCs.

Added commented out failing tests in test_color_shapes_with_local.
2024-02-08 16:12:58 +01:00
David Hou
aebaab011f faster wino compile by catting consts across data expand dim (#3293)
* PoC faster wino compile by catting consts across data expand dim

* fix fusions

* faster + golf it

* noqa 501

* implicit broadcast

* Revert "implicit broadcast"

This reverts commit 5915a9083d045ec1e6be84dcb492333325d48666.

* shorter

* shorter

* oops

* 216 upcasts is probably fine

* wino kernel count test

* test winograd number of sts

* specify device for apply_matrix mat elements
2024-02-02 03:47:45 -05:00
Francis Lam
927f2dd24d wmma: add HIP FP16 to FP16 tensor core (#3287)
* wmma: add HIP FP16 to FP16 tensor core

* test: fix test_tensor_core to use separate tolerances for half
2024-01-31 23:00:51 -05:00
Francis Lam
861d5ac224 wmma: fix the upcasts after WMMA to be hcopt ordering invariant (#3250)
will correctly handle and permutation of optops after the TC one
2024-01-29 11:51:57 -08:00
George Hotz
9e17378b60 Fix metal tests (#3266)
* small fixes for tests on mac

* remove device from TensorCore
2024-01-27 18:09:42 -08:00
George Hotz
3c728d1082 compiler support (#3260)
* compiler support

* revert that

* fix tests
2024-01-26 23:36:40 -08:00
Francis Lam
4273aabe31 extra/gemm: add a simple_conv.py along with correctness check (#3236)
* extra/gemm: add a simple_conv.py along with correctness check

The goal is to easily test tensor core triggering situations

* test: add tests for acc_dtype handling and fixed typing
2024-01-26 19:06:57 -08:00
Francis Lam
595d05a250 test: fix test_linearizer to use the correct tc_dims (#3218)
also re-enable the test_tensor_core_opts
2024-01-23 16:07:31 -05:00
David Hou
3378625773 name upcast variables (#3200)
* name upcast variables

* typing

* unused
2024-01-22 11:37:28 -05:00
chenyu
e52a609240 make WINO a context var, and LATEWINO in hlb_cifar (#3161) 2024-01-17 20:21:26 -05:00
George Hotz
1f9aee8b6f remove numpy from device (#3123)
* remove numpy from device

* fix tests

* np item

* cleanups

* simplify with as_buffer

* no toCPU

* tinygradic

* cast to scalar
2024-01-14 19:36:05 -08:00
Francis Lam
ddbdb52f77 wmma: enable METAL half tensor cores and clean up cstyle (#3095)
* wmma: enable METAL half tensor cores and clean up cstyle

* revert simple_matmul rand changes and break line in tensor

* added metal fp16->fp32 tensor core
2024-01-12 16:25:28 -05:00
George Hotz
c003be7309 Revert "track size in shapetracker" (#3043)
* Revert "track size in shapetracker (#3026)"

This reverts commit a8ba1ac08f.

* st.size
2024-01-08 13:13:39 -08:00
George Hotz
a8ba1ac08f track size in shapetracker (#3026)
* track size in shapetracker

* shapetracker adapter

* size is an int

* create Buffer with st.size

* only compare the views for the jit

* fix webgpu
2024-01-05 20:15:53 -08:00
chenyu
91665ef143 rewrite MUL CAST SUM to CAST MULACC 2024-01-04 13:12:22 -05:00
chenyu
ab7dfd637b use float for acc dtype for half tensor sum
we previously only upcast uint and int, and half was using half for acc.
change to acc in float for precision. but cast the result back to half to match torch/jax output dtype
2024-01-04 13:12:22 -05:00
chenyu
ae112c9dbe fix some long lines in tests (#3006)
* fix some long lines in tests

* better
2024-01-03 23:53:33 -05:00
George Hotz
a280cfe169 move dtypes to dtype.py (#2964)
* move dtypes to dtype.py

* fix urllib
2024-01-01 14:58:48 -08:00
chenyu
8291986959 Variable.sum -> Node.sum, Variable.ands -> Node.ands (#2961) 2024-01-01 16:21:28 -05:00
chenyu
3d720b5761 move expand_idx, iter_idxs and expand_node from symbolic to linearizer (#2959) 2024-01-01 14:41:21 -05:00
chenyu
50f2e31d26 cleanup float4 grouping in global_load and global_store (#2942)
* cleanup float4 grouping in global_load and global_store

* fix test decorator
2023-12-27 14:10:04 -05:00
chenyu
820f2e054e fix PADTO optimization (#2935)
the correct condition is that PADTO cannot be applied to reduce axis, not Reduce.MAX in ops.
even for Reduce.SUM it's possible that the reduce axis had a div before, and the padded 0 became inf then sum over it is incorrect.
2023-12-25 22:52:49 -05:00
chenyu
50927defad s/lazydata.realized/lazydata.base.realized/g (#2914)
* s/lazydata.realized/lazydata.base.realized/g

* not that
2023-12-22 14:45:13 -05:00
George Hotz
1765849937 new lazy, benchmark (#2878)
* lazy rewrite, try 2

* min fix tests

* pass contig test

* put broken pads back

* move that to realize

* no contig child fixes array packing

* so wrong

* now that's correct

* base children

* fix bind issues

* disable to_image_idx

* fix tests

* that failure shouldn't break other tests

* more fixes

* fix torch

* skip failing tests in CI

* 1e-7

* half is broken

* 1e-6 margin of error
2023-12-20 14:33:21 -08:00
George Hotz
8fe24038d8 Revert "mulacc fusion cleanup (#2871)" (#2876)
This reverts commit 863c5b26ed.
2023-12-20 13:26:25 -08:00
qazal
863c5b26ed mulacc fusion cleanup (#2871)
* add mulacc fusion tests

* cleanup the implementation

* fix indent in the test utility

* less verbose
2023-12-20 15:39:54 -05:00
qazal
5f07ef455e update dtypes (#2872) 2023-12-20 15:04:02 -05:00
George Hotz
90fb09b55c remove unused _device_extra_args 2023-12-18 22:14:58 -08:00
chenyu
e4bbbc5bc3 Revert "Use the reduceop dtype to define the acc in linearizer (#2625)" (#2783)
This reverts commit f3ed96a929.
2023-12-15 16:29:10 -05:00