Commit Graph

34 Commits

Author SHA1 Message Date
George Hotz
1765849937 new lazy, benchmark (#2878)
* lazy rewrite, try 2

* min fix tests

* pass contig test

* put broken pads back

* move that to realize

* no contig child fixes array packing

* so wrong

* now that's correct

* base children

* fix bind issues

* disable to_image_idx

* fix tests

* that failure shouldn't break other tests

* more fixes

* fix torch

* skip failing tests in CI

* 1e-7

* half is broken

* 1e-6 margin of error
2023-12-20 14:33:21 -08:00
chenyu
4e2a92cee1 run HALF GPT2 in nvidia benchmark in addition to HALF/BEAM (#2811)
easier to separate the issue between HALF and BEAM when it failed
2023-12-17 02:24:55 -05:00
chenyu
a044125c39 validate stable diffusion for seed 0 (#2773)
* validate stable diffusion for seed 0

the closest false positive i can get is with the setup and one less step. dist = 0.0036
same setup with fp16 has dist=5e-6.
so setting validation threshold to 1e-4 should be good

* run with --seed 0
2023-12-15 00:07:09 -05:00
chenyu
229ada5fe5 Gpt2 benchmark with HALF and BEAM (#2636)
* benchmark gpt2 with half and beam

* BEAM=4

* optional validation

* green is good

* we care
2023-12-05 22:15:16 -05:00
George Hotz
bbeba8ec85 use default dict for external_model_benchmark (#2592)
* device default

* Device.DEFAULT

* half max for cuda

* CUDA_INCLUDE_PATH

* closer to working

* cuda fixups

* Update ops_cuda.py
2023-12-03 15:25:43 -08:00
George Hotz
bc012f26b9 hotfix, disable model inference benchmark on NVIDIA 2023-12-03 13:52:41 -08:00
qazal
4380ccb169 Non fp32 math (#2264)
* `global_load` and `global_store` using buffer dtype

* `UOps.PHI` in all dtypes

* `UOps.ALU` in all dtypes

* `UOps.CONST` & `UOps.DEFINE_ACC` in all dtypes

* -- endof implementation --
+tiny lint changes

* these tests require the fp16 extention

you can run them locally to confirm they're green: (GPT2 test is broken in master for mac, see [this](https://discord.com/channels/1068976834382925865/1069001075828469790/1177993277958533261)

`GPU=1 python3 -m pytest test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_dequantizelinear_e4m3fn_float16_cpu test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_max_float16_cpu test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_min_float16_cpu test/models/test_real_world.py::TestRealWorld::test_llama test/models/test_real_world.py::TestRealWorld::test_gpt2 test/models/test_whisper.py test/test_specific_conv.py::TestSpecific::test_big_vec_mul`

skip the new test_linearizer_failures in CI GPU because of the fp16 extention

This passes on a real GPU since the extention is available:
`GPU=1 python3 -m pytest test/test_linearizer_failures.py::TestLinearizerFailures::test_failure_8`

see CI logs [here](https://github.com/tinygrad/tinygrad/actions/runs/6996590597/job/19032641427#step:14:644)

* these tests fail in CI due to segfaults and CPU crashes

To confirm they're green locally, you can run the following commands:

1. For the tests skipped in test_ops.py (note: CLANG is very slow)

`for var in GPU CUDA CLANG; do export $var=1; for test in test/test_ops.py::TestOps::test_slice_fancy_indexing_no_dim_collapse test/test_ops.py::TestOps::test_slice_fancy_indexing_dim_collapse_int test/test_ops.py::TestOps::test_slice_fancy_indexing_dim_inject_none test/test_ops.py::TestOps::test_slice_fancy_indexing_dim_inject_and_collapse; do python3 -m pytest $test; done; unset $var; done`

2. For the ONNX tests skipped in CLANG:

```
CLANG=1 python3 -m pytest test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_ai_onnx_ml_array_feature_extractor_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_gather_elements_0_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_sce_mean_weight_ii_3d_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_gather_elements_1_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_sce_NCd1_mean_weight_negative_ii_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1_weight_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2d3_none_no_weight_negative_ii_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1_ii_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_sce_mean_weight_ii_4d_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_sce_mean_weight_ii_3d_log_prob_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_gather_elements_negative_indices_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_sce_NCd1d2d3d4d5_mean_weight_log_prob_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_sce_NCd1_mean_weight_negative_ii_log_prob_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2_no_weight_reduction_mean_ii_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_sce_NCd1d2d3d4d5_mean_weight_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2d3d4d5_mean_weight_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1_mean_weight_negative_ii_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_sce_mean_weight_ii_4d_log_prob_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2_with_weight_reduction_mean_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1_weight_ii_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2_with_weight_reduction_sum_ii_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2_with_weight_reduction_sum_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2_reduction_sum_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2d3d4d5_none_no_weight_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2d3_sum_weight_high_ii_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2_reduction_mean_expanded_cpu \
test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_nllloss_NCd1d2_with_weight_expanded_cpu
```

3. The LLVM test I skipped here is already [skipped in master for all backends](https://github.com/tinygrad/tinygrad/blob/master/test/external/external_test_onnx_backend.py#L186), I just made it more specific

`LLVM=1 python3 -m pytest test/external/external_test_onnx_backend.py::OnnxBackendNodeModelTest::test_dequantizelinear_e4m3fn_float16_cpu`

* Revert "these tests fail in CI due to segfaults and CPU crashes"

This reverts commit 15db570143.

* merge with cleanup-vectorized-hip-renders

* barely working HIP P1, ALU ops need a refactor?

* manage the fact that in HIP [half2 is actually an unsigned int vec](f921880387/hip/include/hip/amd_detail/amd_hip_fp16.h (L59)) and half is a totally different __half that [has an unsigned int element in it](f921880387/hip/include/hip/amd_detail/amd_hip_fp16.h (L50)) but can't be accessed [because it's private](f921880387/hip/include/hip/amd_detail/amd_hip_fp16.h (L86)). If you just do this:

```
half2 val0 = // ...
half val1 = // ...
```
then you can't do:
```
val0.x + val1 // error: use of overloaded operator '+' is ambiguous (with operand types 'unsigned short' and 'half' (aka '__half'))
```

* update the sign definition to avoid division by zero in all dtypes

* diff cleanup p1: why were these in the diff anyways

* less hacky HIP, enable CIFAR fp16 benchmark, test ops for HIP in CI!

add ALU ops overloads for HIP

this will make HIP max work

handle mod

Revert "handle mod"

This reverts commit 370fd4b3fbe99b6ae8cc293d005b106628205933.

update max to use hmax

add HIP GEP render logic

enable CIFAR fp16 benchmark

test ops for HIP

back to store as float because this only works for float4 grouping right now

test_ops for hip!!

always sign

* back to the sign we had before because we cant do a backward pass on a Less node

* remove old hacks

HIP compiling test_ops in CI takes ~9 mins, not doing it for now

new HIP ALUs

* reduce accs done right

* refactor to function

* no device hacks

hacks p2

the other way

* LLVM ALU ops

half, float and double are all float

update max

* update test_uops, cmplt is always a bool in the real linearizer. assertAlmostEqual is wrong when ret is bool

* cleanup LLVM wrong code

* dummy change for the CUDA install glitch

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2023-12-03 13:45:49 -08:00
George Hotz
857d440ea7 fail means fail (#2391)
* flip order

* cleanup and comment out failing test
2023-11-24 08:27:39 -08:00
George Hotz
1f4231a8f9 global pipefail 2023-11-24 08:03:49 -08:00
George Hotz
095e2ced61 add name support to fetch (#2407)
* add name support

* use fetch in gpt2

* remove requests from main lib, networkx also optional

* umm, keep that assert

* updates to fetch

* i love the walrus so much

* stop bundling mnist with tinygrad

* err, https

* download cache names

* add DOWNLOAD_CACHE_VERSION

* need env.

* ugh, wrong path

* replace get_child
2023-11-23 14:16:17 -08:00
George Hotz
c60c3b467a clean up symlinking in benchmark (#2219)
* clean up symlinking

* make torch deterministic
2023-11-05 16:46:05 -08:00
George Hotz
12dd165d38 add WINO/HALF/HIP to AMD benchmark 2023-10-25 13:22:45 -04:00
George Hotz
0e3e2bac13 amd wino: upload results 2023-09-09 13:57:14 -07:00
George Hotz
6f95c5f284 winograd speed test for AMD (#1826) 2023-09-09 13:56:33 -07:00
George Hotz
0f2bd10d00 add winograd CIFAR to mac tests (#1825)
* add winograd CIFAR to mac tests

* symlink already done
2023-09-09 13:45:24 -07:00
George Hotz
fb1cc6bf4b llama jit is default, print tok/sec (#1774)
* llama jit is default, print tok/sec

* jit not default in CI
2023-09-05 10:12:16 -07:00
George Hotz
89cd380bfc add nvidia CI (#1737)
* add nvidia

* speed(nvidia)
2023-09-01 22:02:30 -07:00
George Hotz
fdd7f282cb Reenable tensor cores for self-hosted Mac CI (#1717)
* debug 5 matmul

* allow tensor cores in CI

* tensor cores on arm64

* put debug back
2023-08-30 07:53:04 -07:00
wozeparrot
2f768e386d stable diffusion benchmark artifact (#1714) 2023-08-29 21:08:40 -04:00
George Hotz
0ea22bf249 remove DEBUG=1 from stable diffusion AMD since jit cache is fixed 2023-08-29 12:46:12 -07:00
George Hotz
ab9b9ff3e2 pipefail benchmark (#1709) (#1710)
* feat: specify shell

* feat: specify shell for mac

Co-authored-by: wozeparrot <wozeparrot@gmail.com>
2023-08-29 08:15:02 -07:00
George Hotz
aa7c98722b sd timing (#1706) 2023-08-28 20:22:57 -07:00
George Hotz
ad7d26c393 fix __launch_bounds__ and benchmark TC MATMUL (#1575)
* fix

* benchmark matmul
2023-08-19 10:54:39 -07:00
George Hotz
e3c6c0c6db add GPT2 example (#1511) (#1514)
* add gpt2 to examples

* some cleanup

* fixes

* argparse + scaled_dot_product_attention

* add timing

* add to benchmark

Co-authored-by: YassineYousfi <yassine.y10@gmail.com>
2023-08-10 09:09:47 -07:00
wozeparrot
351684395c dont run on fork (#1510) 2023-08-09 13:06:45 -04:00
wozeparrot
88e2e0c8a3 Revert "don't try to run benchmark on forks" (#1508) 2023-08-09 12:59:49 -04:00
wozeparrot
65b65b760b don't try to run benchmark on forks (#1507) 2023-08-09 12:59:19 -04:00
George Hotz
5fdd248617 don't download cifar (#1472) 2023-08-06 21:38:59 -07:00
George Hotz
d78fb8f4ed add stable diffusion and llama (#1471)
* add stable diffusion and llama

* pretty in CI

* was CI not true

* that

* CI=true, wtf

* pythonpath

* debug=1

* oops, wrong place

* uops test broken for wgpu

* wgpu tests flaky
2023-08-06 21:31:51 -07:00
George Hotz
486a9dbfd9 speed v torch (#1464)
* speed v torch

* always print

* change print

* torch speed tee

* all exposed
2023-08-06 09:32:33 -07:00
George Hotz
2ab282bfec run on update_benchmark too (#1460)
* run on update_benchmark too

* amd inference test

* name it better

* add 10 CIFAR training steps
2023-08-06 08:58:37 -07:00
George Hotz
943b227cb1 only on push to master 2023-08-06 00:10:07 -07:00
George Hotz
2274e3e757 Fix benchmark (#1454)
* do benchmarking

* system

* artifact

* go

* name artifact

* only on push
2023-08-05 23:44:36 -07:00
George Hotz
bf21aec81f do benchmarking (#1451)
* do benchmarking

* system

* artifact

* go

* name artifact
2023-08-05 23:35:01 -07:00