Commit Graph

1100 Commits

Author SHA1 Message Date
George Hotz
4164d0ebbd multitensor start (#2676)
* multitensor work

* early gen fixes the tests

* atol for flaky test
2023-12-07 17:07:05 -08:00
Ahmed Harmouche
4b01839774 support vals on WebGPU, run more tests (#2668)
* Vals on webgpu, run more tests

* Skip slow tests, run symbolic ops tests

* Balance out tests
2023-12-07 16:45:21 -08:00
geohotstan
d02ff21f1a enable test_index and test_advancedindex (#2648)
* enable test_index and test_advancedindex with pretty diff

* removed contig

* created set_ helper function

* comment change

* del empty line

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2023-12-07 19:44:39 -05:00
George Hotz
00d9eda961 FROM -> COPY, move vars_from_ast (#2675) 2023-12-07 16:32:30 -08:00
chenyu
51af99367f fix fuzz_linearizer using new device Buffer (#2674) 2023-12-07 19:21:47 -05:00
nimlgen
650117a8f6 split large jit into several graphs (#2650)
* jit graph split

* update

* that's fine, not all buffers are there now

* use logariphmic tho, seems good

* no keep it simple

* add test

* simplify

* split graph when jit item cannot be graphed
2023-12-07 10:58:25 -08:00
chenyu
fd21eced74 reduce gpt2 kernel count in test_real_world (#2663) 2023-12-06 21:57:04 -05:00
chenyu
371005cb2d use one kvcache tensor in gpt2 instead of two separate caches (#2662)
* use one kvcache tensor in gpt2

* test case

* is None

* better test cases
2023-12-06 20:59:17 -05:00
George Hotz
5a7b2ff1b2 masked shapetrackers (#2657) 2023-12-06 11:22:26 -08:00
chenyu
b931a20882 minor shapetracker cleanup (#2652) 2023-12-06 11:43:52 -05:00
qazal
c704a77ca0 green dtypes ALU tests (#2617)
* dtypes alu test

* those types don't exist in torch

* floats

* more tests

* disable those

* a couple unary tests

* skip float16 tests in CI for GPU

* fix LLVM bool add True+True=1+1=2 which truncates to False in native LLVM

* remove hardcoded float for LLVM ALU fns

* less sensitive atol for fp32, 1e-10 is flaky and sometimes failed even if you revert the merge commit for non-fp32 math, nothing has changed in our kernels for fp32.

* return on overflows

* fix CUDA exp2

* compute results of op regardless of bounds in a python backend

* skip fp16 in GPU and CUDACPU

* fuzz a smaller range in the float_midcast_int32 test

I sampled this and we overflow ~70% of the time.
because numpy behaves differently on different devices for overflows and Metal seems to do the same, I'm opting to eliminate the non-determinism here

* remove CUDA exp2 overload it's already there now

---------

Co-authored-by: George Hotz <geohot@gmail.com>
2023-12-06 08:15:46 -08:00
Amrit Sahu
71d989b476 adding test to cover #2644 failure (#2645) 2023-12-06 11:00:30 -05:00
Ahmed Harmouche
50dcd532d5 Get all WEBGPU test_ops passing (#2646)
* Get all WEBGPU tests passing

* Custom render store is not needed in wgsl
2023-12-06 07:40:37 -08:00
qazal
be09cc87c1 Bitcast support / fast bf16 load (#2011)
* bitcast renderers

* fast llama load

* make it one kernel

* regression testing p1: re-enable test_dtype for all backends

fix GPU

* regression testing p2: fuzz all possible cases against numpy

remove hancoded tests since the fuzzer covers them

* define ushort

* fix indent, probably need flake8 back for CI to catch

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2023-12-05 16:19:28 -08:00
George Hotz
232ed2af3f more test cleanups (#2631)
* more test cleanups

* move test example back
2023-12-05 16:17:57 -08:00
wozeparrot
6d58c19736 binaryops xor (#2627)
* feat: initial xor

* feat: numpy xor

* feat: llvm xor

* feat: quick test for xor

* feat: slightly working xor in torch

* feat: xor in tensor

* feat: slightly better test
2023-12-05 13:21:42 -08:00
George Hotz
c53e854687 cast image doesn't work on nvidia (#2626)
* cast image doesn't work on nvidia

* hmm, interpreteds use buffer size 0

* fix type

* no lru
2023-12-05 12:48:19 -08:00
George Hotz
8c67eb1c92 GPT bugfixes (#2624)
* simple fixes

* fix exp2

* fixed

* parallel beam for CUDA

* fix image dtypes
2023-12-05 11:42:28 -08:00
chenyu
8903a40541 update the onnx test so cuda local run passes (#2623) 2023-12-05 14:04:17 -05:00
George Hotz
35b5e95097 parallel beam search (#2610)
* better print

* fix beam search with vars

* cleanups

* parallel is not default

* restore that

* bugfix

* cleanups

* bugfix
2023-12-05 10:09:45 -08:00
chenyu
dd8b4632a4 regression test for reshape fix #2616 (#2620) 2023-12-05 11:46:33 -05:00
chenyu
c257a0dd99 minor reshape cleanups (#2619)
* minor reshape cleanups

* mea culpa
2023-12-05 11:23:17 -05:00
geohotstan
fc00da538d helper functions for test_indexing.py (#2615)
* add some helpers

* I think it should all work..

* fixed get_set_tensor

* done

* del import

* bye bye typing

* style

* remove empty lines lol

* deleted dtype arg

* del trailing space
2023-12-05 02:00:41 -05:00
chenyu
7322ab8dfd onnx tests with different dtypes (#2612) 2023-12-05 00:04:08 -05:00
geohotstan
f12bcccb87 [ready] refactor getitem round 2 :D (#2568)
* new getitem

* go

* add temporary simple tests

* better

* comments

* WOW that took awhile

* save 1 line lol

* work

* still need to add comprehensive tests, but i think getitem looks nice :D

* GIMME GREEN CI CHECKMARK PLS

* try..

* k idk

* added tests for errors

* fixed small hack

* added tests

* almost good

* try no contig?

* yay no more contig + comments and spacing

* finishing touches (comments)

* revert regex unittests lol

* add suggested change

* oops I fell asleep yesterday
2023-12-04 22:36:32 -05:00
George Hotz
09b6e254a3 hip compile speed (#2606) 2023-12-04 13:47:40 -08:00
Amrit Sahu
e8d6a6ef2e view.reshape without symbolic (#2218)
* handle reshape of contiguous subparts with explicit mask

* remove the add/remove ones logic in reshape

* accomodate ones in accumulate logic

* make multiply commutative

* fix linting

* make mypy happy

* add test for commutative mul

* merge dimensions in shape_strides for 1 range masks

* add offsets for merging

* fix linting

* add back explicit 1 reshapes

* fix mypy errors

* fix accumulate by includng state

* include non-zero stride dimension in acc

* small cleanup

* more compact to_shape_strides

* more logical cleanup

* compress more

* compress reshape mask

* adding some comments

* small bug fix

* improve test coverage

* remove explicit add remove ones

* small bug in test

* enable test_reshape_splitting_combining

* small fix

* 10 lines less to_shape_strides

* shorten reshape mask

* some more cleanup

* more cleanup

* introduce some symbols for compactness

* more symbols

* more cleaner

* lessen symbols, it became less readable

* remove merge_views from view.reshape

* change to_shape_strides to _merge_dims

* improve readability

* fix corner case

* cleanup

* better handling of 1 <= Variable('i',1,10) & new_dim = Variable('i',1,10)

* rewrite _reshape_mask for readability

* fix white space

* add comment

* nice shorthands for readability

* add proof in docs

* small nit

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2023-12-04 12:46:53 -05:00
George Hotz
664475f247 vals is an argument (#2599)
* vals is an argument

* don't even know how that's legal python
2023-12-03 21:50:43 -08:00
George Hotz
fcd0b2ee6c fix multigpu on tinybox (#2595)
* fix multigpu on tinybox

* fixed multigpu
2023-12-03 16:48:07 -08:00
George Hotz
61c0113928 test external_multi_gpu.py (and works in CUDA) 2023-12-03 15:57:13 -08:00
George Hotz
bbeba8ec85 use default dict for external_model_benchmark (#2592)
* device default

* Device.DEFAULT

* half max for cuda

* CUDA_INCLUDE_PATH

* closer to working

* cuda fixups

* Update ops_cuda.py
2023-12-03 15:25:43 -08:00
chenyu
550817389a enable test_sample for all backend (#2593) 2023-12-03 17:20:27 -05:00
qazal
4380ccb169 Non fp32 math (#2264)
* `global_load` and `global_store` using buffer dtype

* `UOps.PHI` in all dtypes

* `UOps.ALU` in all dtypes

* `UOps.CONST` & `UOps.DEFINE_ACC` in all dtypes

* -- endof implementation --
+tiny lint changes

* these tests require the fp16 extention

you can run them locally to confirm they're green: (GPT2 test is broken in master for mac, see [this](https://discord.com/channels/1068976834382925865/1069001075828469790/1177993277958533261)

`GPU=1 python3 -m pytest test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_dequantizelinear_e4m3fn_float16_cpu test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_max_float16_cpu test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_min_float16_cpu test/models/test_real_world.py::TestRealWorld::test_llama test/models/test_real_world.py::TestRealWorld::test_gpt2 test/models/test_whisper.py test/test_specific_conv.py::TestSpecific::test_big_vec_mul`

skip the new test_linearizer_failures in CI GPU because of the fp16 extention

This passes on a real GPU since the extention is available:
`GPU=1 python3 -m pytest test/test_linearizer_failures.py::TestLinearizerFailures::test_failure_8`

see CI logs [here](https://github.com/tinygrad/tinygrad/actions/runs/6996590597/job/19032641427#step:14:644)

* these tests fail in CI due to segfaults and CPU crashes

To confirm they're green locally, you can run the following commands:

1. For the tests skipped in test_ops.py (note: CLANG is very slow)

`for var in GPU CUDA CLANG; do export $var=1; for test in test/test_ops.py::TestOps::test_slice_fancy_indexing_no_dim_collapse test/test_ops.py::TestOps::test_slice_fancy_indexing_dim_collapse_int test/test_ops.py::TestOps::test_slice_fancy_indexing_dim_inject_none test/test_ops.py::TestOps::test_slice_fancy_indexing_dim_inject_and_collapse; do python3 -m pytest $test; done; unset $var; done`

2. For the ONNX tests skipped in CLANG:

```
CLANG=1 python3 -m pytest test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_ai_onnx_ml_array_feature_extractor_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_gather_elements_0_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_sce_mean_weight_ii_3d_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_gather_elements_1_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_sce_NCd1_mean_weight_negative_ii_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1_weight_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2d3_none_no_weight_negative_ii_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1_ii_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_sce_mean_weight_ii_4d_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_sce_mean_weight_ii_3d_log_prob_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_gather_elements_negative_indices_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_sce_NCd1d2d3d4d5_mean_weight_log_prob_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_sce_NCd1_mean_weight_negative_ii_log_prob_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2_no_weight_reduction_mean_ii_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_sce_NCd1d2d3d4d5_mean_weight_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2d3d4d5_mean_weight_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1_mean_weight_negative_ii_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_sce_mean_weight_ii_4d_log_prob_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2_with_weight_reduction_mean_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1_weight_ii_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2_with_weight_reduction_sum_ii_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2_with_weight_reduction_sum_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2_reduction_sum_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2d3d4d5_none_no_weight_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2d3_sum_weight_high_ii_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2_reduction_mean_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2_with_weight_expanded_cpu
```

3. The LLVM test I skipped here is already [skipped in master for all backends](https://github.com/tinygrad/tinygrad/blob/master/test/external/external_test_onnx_backend.py#L186), I just made it more specific

`LLVM=1 python3 -m pytest test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_dequantizelinear_e4m3fn_float16_cpu`

* Revert "these tests fail in CI due to segfaults and CPU crashes"

This reverts commit 15db570143.

* merge with cleanup-vectorized-hip-renders

* barely working HIP P1, ALU ops need a refactor?

* manage the fact that in HIP [half2 is actually an unsigned int vec](f921880387/hip/include/hip/amd_detail/amd_hip_fp16.h (L59)) and half is a totally different __half that [has an unsigned int element in it](f921880387/hip/include/hip/amd_detail/amd_hip_fp16.h (L50)) but can't be accessed [because it's private](f921880387/hip/include/hip/amd_detail/amd_hip_fp16.h (L86)). If you just do this:

```
half2 val0 = // ...
half val1 = // ...
```
then you can't do:
```
val0.x + val1 // error: use of overloaded operator '+' is ambiguous (with operand types 'unsigned short' and 'half' (aka '__half'))
```

* update the sign definition to avoid division by zero in all dtypes

* diff cleanup p1: why were these in the diff anyways

* less hacky HIP, enable CIFAR fp16 benchmark, test ops for HIP in CI!

add ALU ops overloads for HIP

this will make HIP max work

handle mod

Revert "handle mod"

This reverts commit 370fd4b3fbe99b6ae8cc293d005b106628205933.

update max to use hmax

add HIP GEP render logic

enable CIFAR fp16 benchmark

test ops for HIP

back to store as float because this only works for float4 grouping right now

test_ops for hip!!

always sign

* back to the sign we had before because we cant do a backward pass on a Less node

* remove old hacks

HIP compiling test_ops in CI takes ~9 mins, not doing it for now

new HIP ALUs

* reduce accs done right

* refactor to function

* no device hacks

hacks p2

the other way

* LLVM ALU ops

half, float and double are all float

update max

* update test_uops, cmplt is always a bool in the real linearizer. assertAlmostEqual is wrong when ret is bool

* cleanup LLVM wrong code

* dummy change for the CUDA install glitch

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2023-12-03 13:45:49 -08:00
chenyu
1ac958a058 update pytest marks and CI test filters (#2587)
* remove pytest marks

* test more stuff

* fine revert some

* add that mark back

* skip that

* hmm LLVM does not work on ubuntu

* too slow on CUDA CI

* dup test
2023-12-03 15:20:44 -05:00
qazal
ab2d4d8d29 Fix cl import in the copy_speed test and cifar example (#2586)
* fix CL import

* update test to only run on GPU

* update hlb_cifar too
2023-12-03 09:22:07 -08:00
chenyu
3226b3d96b enable the jit random test (#2580) 2023-12-02 20:25:23 -05:00
chenyu
09c9794f3f clean external_test_opt.py (#2578) 2023-12-02 19:51:08 -05:00
George Hotz
171543fc8d cleanups to save lines and files (#2577)
* runtime/graph -> features/graph

* put all the cstyle renderers in cstyle

* same line for those

* how did that pass mypy
2023-12-02 16:29:56 -08:00
George Hotz
d6b404ac11 No dtype alloc (#2570)
* fix all allocs

* improve docs

* ugh fix fake alloc
2023-12-02 13:29:40 -08:00
chenyu
c8774713c5 lazy cleanup (#2567) 2023-12-02 13:21:43 -05:00
George Hotz
5068e99d18 refactor to remove extra kernel params (#2563)
* refactor to have compiled kernel

* bugfixes

* docs/beautiful.py

* revert that

* fix tests
2023-12-02 00:32:25 -08:00
George Hotz
27481b9206 Switch ops_gpu -> gpuctypes (#2532)
* ops_gpu is go

* fix size 0

* fix image, and add more tests

* nerf openpilot test, doesn't test thneed

* run the schedule

* better

* oops, new inputs

* delete pyopencl

* Update ops_gpu.py
2023-12-01 22:30:21 -08:00
George Hotz
6733425095 lower schedule (#2559)
* lower schedule

* remove RAND, and don't put load in the JIT yet

* better fix for that test
2023-12-01 19:17:46 -08:00
Christopher Mauri Milan
077567f62d Remove as_buffer for TORCH (#2554)
* remove as_buffer for torch

* enable torch zerocopy if on cpu

* remove as_buffer even on torch:cpu
2023-12-01 18:51:38 -08:00
chenyu
86fbd413f3 update test_real_world configs (#2557) 2023-12-01 20:03:52 -05:00
andresgit
00523d5656 New fix accessing elements created by padding (#2529)
* pad slice test cases, many failing

* fix failing test cases

check mask if we are outside the base buffer
also create a multi-view if in that case we reshape to an empty shape

* real_offset calculation more readable

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2023-12-01 19:08:10 -05:00
chenyu
67f4e03724 rewrite 0 size loadop into a CONST (#2556)
* rewrite 0 size loadop into a CONST

* check alloc size

* EMPTY is better

* Revert "EMPTY is better"

This reverts commit 574fe0f9ed28f1b97da5a81afdfd2cd5d9a94ff9.

* no ast is created

* fix test
2023-12-01 18:29:06 -05:00
George Hotz
4447188051 gate METAL_FAST_LOAD 2023-12-01 15:28:40 -08:00
chenyu
e9426f4fe4 simpler get_contraction (#2552)
* simpler get_contraction

* and test
2023-12-01 18:02:52 -05:00
George Hotz
f5de21e753 fast path for copy (#2548)
* fast copy

* ruff first

* flat_mv on malloc

* order + webgpu test
2023-12-01 11:34:47 -08:00