Commit Graph

10417 Commits

Author SHA1 Message Date
George Hotz
5a7b2ff1b2 masked shapetrackers (#2657) 2023-12-06 11:22:26 -08:00
chenyu
b931a20882 minor shapetracker cleanup (#2652) 2023-12-06 11:43:52 -05:00
qazal
c704a77ca0 green dtypes ALU tests (#2617)
* dtypes alu test

* those types don't exist in torch

* floats

* more tests

* disable those

* a couple unary tests

* skip float16 tests in CI for GPU

* fix LLVM bool add True+True=1+1=2 which truncates to False in native LLVM

* remove hardcoded float for LLVM ALU fns

* less sensitive atol for fp32, 1e-10 is flaky and sometimes failed even if you revert the merge commit for non-fp32 math, nothing has changed in our kernels for fp32.

* return on overflows

* fix CUDA exp2

* compute results of op regardless of bounds in a python backend

* skip fp16 in GPU and CUDACPU

* fuzz a smaller range in the float_midcast_int32 test

I sampled this and we overflow ~70% of the time.
because numpy behaves differently on different devices for overflows and Metal seems to do the same, I'm opting to eliminate the non-determinism here

* remove CUDA exp2 overload it's already there now

---------

Co-authored-by: George Hotz <geohot@gmail.com>
2023-12-06 08:15:46 -08:00
Amrit Sahu
71d989b476 adding test to cover #2644 failure (#2645) 2023-12-06 11:00:30 -05:00
Ahmed Harmouche
50dcd532d5 Get all WEBGPU test_ops passing (#2646)
* Get all WEBGPU tests passing

* Custom render store is not needed in wgsl
2023-12-06 07:40:37 -08:00
chenyu
0978c24b8e fast gpt2 embedding with variable bs=1 (#2596) 2023-12-05 23:01:17 -05:00
chenyu
229ada5fe5 Gpt2 benchmark with HALF and BEAM (#2636)
* benchmark gpt2 with half and beam

* BEAM=4

* optional validation

* green is good

* we care
2023-12-05 22:15:16 -05:00
George Hotz
a73579919f mlx benchmark, a lil slower than tg 2023-12-05 19:00:43 -08:00
Oleg Rybalko
7c427d738c don't apply padding on script call (#2585)
* don't apply padding on script call

* no need for new param because batch_size value can be utilized to check

* fixed argument naming
2023-12-05 16:34:10 -08:00
George Hotz
9d7ead84e1 hotfix: no need for model cache in examples/coder.py 2023-12-05 16:27:36 -08:00
qazal
be09cc87c1 Bitcast support / fast bf16 load (#2011)
* bitcast renderers

* fast llama load

* make it one kernel

* regression testing p1: re-enable test_dtype for all backends

fix GPU

* regression testing p2: fuzz all possible cases against numpy

remove hancoded tests since the fuzzer covers them

* define ushort

* fix indent, probably need flake8 back for CI to catch

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2023-12-05 16:19:28 -08:00
George Hotz
232ed2af3f more test cleanups (#2631)
* more test cleanups

* move test example back
2023-12-05 16:17:57 -08:00
chenyu
a63f48d3db gpt2 half for kvcache and output logits (#2630)
* gpt2 more half

* hlaf is fine after softmax
2023-12-05 16:54:56 -05:00
George Hotz
0be5d16950 only 62 gflops (#2629) 2023-12-05 13:28:24 -08:00
wozeparrot
6d58c19736 binaryops xor (#2627)
* feat: initial xor

* feat: numpy xor

* feat: llvm xor

* feat: quick test for xor

* feat: slightly working xor in torch

* feat: xor in tensor

* feat: slightly better test
2023-12-05 13:21:42 -08:00
George Hotz
c53e854687 cast image doesn't work on nvidia (#2626)
* cast image doesn't work on nvidia

* hmm, interpreteds use buffer size 0

* fix type

* no lru
2023-12-05 12:48:19 -08:00
George Hotz
8c67eb1c92 GPT bugfixes (#2624)
* simple fixes

* fix exp2

* fixed

* parallel beam for CUDA

* fix image dtypes
2023-12-05 11:42:28 -08:00
chenyu
8903a40541 update the onnx test so cuda local run passes (#2623) 2023-12-05 14:04:17 -05:00
George Hotz
ec594cf03c hotfix: tasteful ctrl-c in parallel beam 2023-12-05 18:20:10 +00:00
George Hotz
35b5e95097 parallel beam search (#2610)
* better print

* fix beam search with vars

* cleanups

* parallel is not default

* restore that

* bugfix

* cleanups

* bugfix
2023-12-05 10:09:45 -08:00
chenyu
9996f1adf9 no document prs (#2622) 2023-12-05 13:05:36 -05:00
chenyu
dd8b4632a4 regression test for reshape fix #2616 (#2620) 2023-12-05 11:46:33 -05:00
chenyu
c257a0dd99 minor reshape cleanups (#2619)
* minor reshape cleanups

* mea culpa
2023-12-05 11:23:17 -05:00
Amrit Sahu
a6b68e8e40 fix for false merge (#2616) 2023-12-05 10:47:18 -05:00
geohotstan
fc00da538d helper functions for test_indexing.py (#2615)
* add some helpers

* I think it should all work..

* fixed get_set_tensor

* done

* del import

* bye bye typing

* style

* remove empty lines lol

* deleted dtype arg

* del trailing space
2023-12-05 02:00:41 -05:00
chenyu
7322ab8dfd onnx tests with different dtypes (#2612) 2023-12-05 00:04:08 -05:00
geohotstan
f12bcccb87 [ready] refactor getitem round 2 :D (#2568)
* new getitem

* go

* add temporary simple tests

* better

* comments

* WOW that took awhile

* save 1 line lol

* work

* still need to add comprehensive tests, but i think getitem looks nice :D

* GIMME GREEN CI CHECKMARK PLS

* try..

* k idk

* added tests for errors

* fixed small hack

* added tests

* almost good

* try no contig?

* yay no more contig + comments and spacing

* finishing touches (comments)

* revert regex unittests lol

* add suggested change

* oops I fell asleep yesterday
2023-12-04 22:36:32 -05:00
chenyu
6ba6349c97 JIT=0 llama.py should not jit (#2609) 2023-12-04 20:21:07 -05:00
George Hotz
41d696145d hotfix: forking works okay in HIP now 2023-12-04 21:59:18 +00:00
George Hotz
09b6e254a3 hip compile speed (#2606) 2023-12-04 13:47:40 -08:00
nimlgen
19a0a839db fix used resources in metal graph (#2604) 2023-12-04 13:45:51 -08:00
Yixiang Gao
fde44aed76 update hip_matmul with new abstraction (#2605) 2023-12-04 13:37:10 -08:00
George Hotz
5540f6e966 hotfix: make_half4 2023-12-04 09:58:34 -08:00
Amrit Sahu
e8d6a6ef2e view.reshape without symbolic (#2218)
* handle reshape of contiguous subparts with explicit mask

* remove the add/remove ones logic in reshape

* accomodate ones in accumulate logic

* make multiply commutative

* fix linting

* make mypy happy

* add test for commutative mul

* merge dimensions in shape_strides for 1 range masks

* add offsets for merging

* fix linting

* add back explicit 1 reshapes

* fix mypy errors

* fix accumulate by includng state

* include non-zero stride dimension in acc

* small cleanup

* more compact to_shape_strides

* more logical cleanup

* compress more

* compress reshape mask

* adding some comments

* small bug fix

* improve test coverage

* remove explicit add remove ones

* small bug in test

* enable test_reshape_splitting_combining

* small fix

* 10 lines less to_shape_strides

* shorten reshape mask

* some more cleanup

* more cleanup

* introduce some symbols for compactness

* more symbols

* more cleaner

* lessen symbols, it became less readable

* remove merge_views from view.reshape

* change to_shape_strides to _merge_dims

* improve readability

* fix corner case

* cleanup

* better handling of 1 <= Variable('i',1,10) & new_dim = Variable('i',1,10)

* rewrite _reshape_mask for readability

* fix white space

* add comment

* nice shorthands for readability

* add proof in docs

* small nit

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2023-12-04 12:46:53 -05:00
George Hotz
664475f247 vals is an argument (#2599)
* vals is an argument

* don't even know how that's legal python
2023-12-03 21:50:43 -08:00
George Hotz
fcd0b2ee6c fix multigpu on tinybox (#2595)
* fix multigpu on tinybox

* fixed multigpu
2023-12-03 16:48:07 -08:00
George Hotz
61c0113928 test external_multi_gpu.py (and works in CUDA) 2023-12-03 15:57:13 -08:00
George Hotz
bbeba8ec85 use default dict for external_model_benchmark (#2592)
* device default

* Device.DEFAULT

* half max for cuda

* CUDA_INCLUDE_PATH

* closer to working

* cuda fixups

* Update ops_cuda.py
2023-12-03 15:25:43 -08:00
chenyu
550817389a enable test_sample for all backend (#2593) 2023-12-03 17:20:27 -05:00
chenyu
a58736fdf1 print DEBUG=2 stats after stats update (#2590) 2023-12-03 17:13:37 -05:00
George Hotz
bc012f26b9 hotfix, disable model inference benchmark on NVIDIA 2023-12-03 13:52:41 -08:00
qazal
4380ccb169 Non fp32 math (#2264)
* `global_load` and `global_store` using buffer dtype

* `UOps.PHI` in all dtypes

* `UOps.ALU` in all dtypes

* `UOps.CONST` & `UOps.DEFINE_ACC` in all dtypes

* -- endof implementation --
+tiny lint changes

* these tests require the fp16 extention

you can run them locally to confirm they're green: (GPT2 test is broken in master for mac, see [this](https://discord.com/channels/1068976834382925865/1069001075828469790/1177993277958533261)

`GPU=1 python3 -m pytest test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_dequantizelinear_e4m3fn_float16_cpu test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_max_float16_cpu test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_min_float16_cpu test/models/test_real_world.py::TestRealWorld::test_llama test/models/test_real_world.py::TestRealWorld::test_gpt2 test/models/test_whisper.py test/test_specific_conv.py::TestSpecific::test_big_vec_mul`

skip the new test_linearizer_failures in CI GPU because of the fp16 extention

This passes on a real GPU since the extention is available:
`GPU=1 python3 -m pytest test/test_linearizer_failures.py::TestLinearizerFailures::test_failure_8`

see CI logs [here](https://github.com/tinygrad/tinygrad/actions/runs/6996590597/job/19032641427#step:14:644)

* these tests fail in CI due to segfaults and CPU crashes

To confirm they're green locally, you can run the following commands:

1. For the tests skipped in test_ops.py (note: CLANG is very slow)

`for var in GPU CUDA CLANG; do export $var=1; for test in test/test_ops.py::TestOps::test_slice_fancy_indexing_no_dim_collapse test/test_ops.py::TestOps::test_slice_fancy_indexing_dim_collapse_int test/test_ops.py::TestOps::test_slice_fancy_indexing_dim_inject_none test/test_ops.py::TestOps::test_slice_fancy_indexing_dim_inject_and_collapse; do python3 -m pytest $test; done; unset $var; done`

2. For the ONNX tests skipped in CLANG:

```
CLANG=1 python3 -m pytest test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_ai_onnx_ml_array_feature_extractor_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_gather_elements_0_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_sce_mean_weight_ii_3d_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_gather_elements_1_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_sce_NCd1_mean_weight_negative_ii_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1_weight_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2d3_none_no_weight_negative_ii_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1_ii_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_sce_mean_weight_ii_4d_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_sce_mean_weight_ii_3d_log_prob_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_gather_elements_negative_indices_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_sce_NCd1d2d3d4d5_mean_weight_log_prob_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_sce_NCd1_mean_weight_negative_ii_log_prob_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2_no_weight_reduction_mean_ii_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_sce_NCd1d2d3d4d5_mean_weight_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2d3d4d5_mean_weight_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1_mean_weight_negative_ii_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_sce_mean_weight_ii_4d_log_prob_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2_with_weight_reduction_mean_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1_weight_ii_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2_with_weight_reduction_sum_ii_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2_with_weight_reduction_sum_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2_reduction_sum_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2d3d4d5_none_no_weight_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2d3_sum_weight_high_ii_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2_reduction_mean_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2_with_weight_expanded_cpu
```

3. The LLVM test I skipped here is already [skipped in master for all backends](https://github.com/tinygrad/tinygrad/blob/master/test/external/external_test_onnx_backend.py#L186), I just made it more specific

`LLVM=1 python3 -m pytest test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_dequantizelinear_e4m3fn_float16_cpu`

* Revert "these tests fail in CI due to segfaults and CPU crashes"

This reverts commit 15db570143.

* merge with cleanup-vectorized-hip-renders

* barely working HIP P1, ALU ops need a refactor?

* manage the fact that in HIP [half2 is actually an unsigned int vec](f921880387/hip/include/hip/amd_detail/amd_hip_fp16.h (L59)) and half is a totally different __half that [has an unsigned int element in it](f921880387/hip/include/hip/amd_detail/amd_hip_fp16.h (L50)) but can't be accessed [because it's private](f921880387/hip/include/hip/amd_detail/amd_hip_fp16.h (L86)). If you just do this:

```
half2 val0 = // ...
half val1 = // ...
```
then you can't do:
```
val0.x + val1 // error: use of overloaded operator '+' is ambiguous (with operand types 'unsigned short' and 'half' (aka '__half'))
```

* update the sign definition to avoid division by zero in all dtypes

* diff cleanup p1: why were these in the diff anyways

* less hacky HIP, enable CIFAR fp16 benchmark, test ops for HIP in CI!

add ALU ops overloads for HIP

this will make HIP max work

handle mod

Revert "handle mod"

This reverts commit 370fd4b3fbe99b6ae8cc293d005b106628205933.

update max to use hmax

add HIP GEP render logic

enable CIFAR fp16 benchmark

test ops for HIP

back to store as float because this only works for float4 grouping right now

test_ops for hip!!

always sign

* back to the sign we had before because we cant do a backward pass on a Less node

* remove old hacks

HIP compiling test_ops in CI takes ~9 mins, not doing it for now

new HIP ALUs

* reduce accs done right

* refactor to function

* no device hacks

hacks p2

the other way

* LLVM ALU ops

half, float and double are all float

update max

* update test_uops, cmplt is always a bool in the real linearizer. assertAlmostEqual is wrong when ret is bool

* cleanup LLVM wrong code

* dummy change for the CUDA install glitch

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2023-12-03 13:45:49 -08:00
chenyu
1ac958a058 update pytest marks and CI test filters (#2587)
* remove pytest marks

* test more stuff

* fine revert some

* add that mark back

* skip that

* hmm LLVM does not work on ubuntu

* too slow on CUDA CI

* dup test
2023-12-03 15:20:44 -05:00
nimlgen
88a5c368d4 fix metal graph with var_vals (#2583) 2023-12-03 09:24:36 -08:00
qazal
f180cac8f0 wgsl renderer cleanup: use the same const render, reuse cast render logic (#2579)
* share const render

* should just use render_cast here
2023-12-03 09:24:00 -08:00
qazal
ab2d4d8d29 Fix cl import in the copy_speed test and cifar example (#2586)
* fix CL import

* update test to only run on GPU

* update hlb_cifar too
2023-12-03 09:22:07 -08:00
chenyu
3226b3d96b enable the jit random test (#2580) 2023-12-02 20:25:23 -05:00
chenyu
09c9794f3f clean external_test_opt.py (#2578) 2023-12-02 19:51:08 -05:00
George Hotz
171543fc8d cleanups to save lines and files (#2577)
* runtime/graph -> features/graph

* put all the cstyle renderers in cstyle

* same line for those

* how did that pass mypy
2023-12-02 16:29:56 -08:00
George Hotz
a9a76639c8 that's not needed (#2574) 2023-12-02 16:01:29 -08:00