Commit Graph

6949 Commits

Author SHA1 Message Date
qazal
ceda43ce75 always swizzle load st in wmma [pr] (#7908) 2024-11-26 20:00:58 +08:00
George Hotz
4e5bf9dc7a test assignment in jit (#7906)
* test assignment in jit

* don't waste lines

* skip broken test in webgpu
2024-11-26 17:37:00 +08:00
mesozoic-egg
0cd1cc29dc PTX simplify: use a dict matcher for prefix [pr] (#7890)
* use a dict matcher for prefix

* simplify tuple unpack

* simplify tuple unpack

* debug pr

* Revert "debug pr"

This reverts commit 3aa9f77517.

* define_acc boolean case

* remove commented lines

* wip

* no need for .scalar in define_acc

* indentation

* linter fix

* add keys to matcher from GroupOps directly

* put dtype in tuple directly

* cast, line too long fix

* check ptrdtype with isinstance

* dtype is always ptr for define_global

wip

* blank commit to trigger CI

---------

Co-authored-by: Mesozoic Egg <mesozoic.egg@proton.mail>
2024-11-26 17:32:48 +08:00
Ahmed Harmouche
10618aba98 Bring back WebGPU (#7063)
* Start from andredaprato:webgpu-clean

* Fix infs

* inf wgsl function is not needed

* Emulated ulong for threefry, more tests passing

* Randomness tests passing

* Update model export to support new changes in webgpu, efficientnet export works again

* Simplify shift emulation in wgsl

* Delete test file

* Fix bigger than u32 u32 literal

* Why was skip copies added here?

* Python3.12 for webgpu tests

* Fix model export syntax error

* Get test ops passing with some skips

* Fix lint

* Much simpler shift

* Run more tests

* Timestamp queries are not supported in CI, so skip search tests

* All fancy indexing passing

* r is ctx

* Run more dtype tests by using is_dtype_supported

* Cleanup ulong shift rendering

* UPat -> Pat, UOps -> Ops

* Pat -> UPat

* Refactor render_ushift if-else

* Pattern to avoid ulong mul

* Remove vals_dtype

* is_nan trick + rewrite, test_isnan passing

* Rewrite a * select(1, nan, gate) -> select(a, nan, gate)

* No arg, just op

* Support char, uchar, short, ushort

* Run test_index_mnis now that we have uint8

* Fix pyling

* Save 3 lines by using base Compiler

* No more long emulation

* Remove fixup_binops

* No more external_local_bufx wgsl specific cstyle modif, use base extra_pm

* Simpler, faster copyin/out

* Skip some new tests that use long

* Fix typo

* copyout touchup

* Save lines by using render_cast

* WebGL is not supported in core, delete it from is_dtype_supported

* More narrow test skips for some unary tests

* TernaryOps, UnaryOps -> Ops

* TinyGrad supports WebGPU

* StableDiffusion demo: f16tof32 gpu is a lib, update UI

* Packed load/store, no more scale_size, no core tinygrad changes

* Rename copyin, copyout

* Device -> dev

* Fix lint

* Pattern matcher rule for packed load/store

* Refactor

* Shorter packed load/store

* this should fix lint

* Fix mypy

* SD compile script working

* New SD webgpu UI

* New default prompt

* New SD weights

* Fix title when webgpu not available

* Run symbolic tests, simplify is_nan, use round_up

* Show step time on UI

* Bump minimum wgpu version to v0.19

* Fix latent

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-11-26 12:26:40 +08:00
chenyu
ff3f2a9c1a Revert "move attention upcast (#7830)" (#7903)
This reverts commit c07daf40e7.
2024-11-25 18:59:51 -05:00
chenyu
04bee97d2a hotfix ctypes.c_ulong(size) for metal _alloc (#7902)
fix `Tensor.ones(1000, 1000, 1000).contiguous().realize()` on METAL
2024-11-25 18:25:33 -05:00
chenyu
631dc98b52 validate llama quantize output (#7901)
mac benchmark already runs quantize, this adds output validation
2024-11-25 16:46:23 -05:00
qazal
e8777cb8db assert view on uops without shape [pr] (#7898)
* assert view on uops without shape [pr]

* lint
2024-11-25 20:43:50 +08:00
chenyu
a49ca0c2ff clean up fully_flatten [pr] (#7885)
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-11-25 06:53:18 -05:00
qazal
e823de3828 viz with bottom_up=True (#7894)
* add failing test

* single pass it

* linter
2024-11-25 17:56:48 +08:00
qazal
2ca41d6a44 ops metadata map try 2, early fuse [pr] (#7893)
* make this return early

* delete that

* ops metadata map try 2, early fuse [pr]
2024-11-25 17:08:38 +08:00
qazal
9295c86ddc delete base op cast [pr] (#7891) 2024-11-25 16:38:32 +08:00
qazal
26784c45c6 delete cast arg 2 [pr] (#7881) 2024-11-25 16:15:57 +08:00
George Hotz
9d0038bccb small changes from block linearizer [pr] (#7888)
* small changes from block linearizer [pr]

* fix test_gc
2024-11-25 15:27:04 +08:00
mesozoic-egg
9e958f2b10 Ptx simplify [pr] (#7877)
* simplify render_kernel

* cvar in const

* Revert "simplify render_kernel"

This reverts commit 1c8817bea2.

* CMPNE src match

* src match in cast

* cvar in define_acc

* simplify render_store

* simplify render_kernel

* whitespace

* render_kernel fix fstring

* render newline

* do not embed newline in Ops.WHERE render

* WHERE op fix

* missed a comma

* whitespace

---------

Co-authored-by: Mesozoic Egg <mesozoic.egg@proton.mail>
2024-11-25 15:01:47 +08:00
nib9888
e9c681c839 fix missing final rewrite in viz (#7883) 2024-11-25 14:13:33 +08:00
Sieds Lykles
a49a7c4784 Improved mod folding (#7887)
* Remove uneccessary if statement

In all paths where something_changed was set to True, remainder is
appended so the list can't be empty

* Working version of improved mod folding

* Fix offset calculation

Passing fuzz_symbolic.py to 130_000 so far
Added an extra test

* Cleaner offset calculation
2024-11-24 22:21:34 -05:00
leopf
5d92efb121 [BUGFIX] Tensor([]).data() (#7884)
* added test, fix

* fix only for (0,) shape

* Revert "fix only for (0,) shape"

* test_data_empty_multi_dim
2024-11-24 16:42:57 -05:00
chenyu
ac57d82a13 test_tiny on real NV/CUDA/AMD/HIP (#7886)
simple tests that run on real CUDA and HIP
2024-11-24 16:34:54 -05:00
qazal
06a28d83f5 delete extra dtype check in uop const [pr] (#7880) 2024-11-25 00:06:52 +08:00
chenyu
31337b49e3 cleanup Embedding call [pr] (#7869)
reshape on self.weight is noop, and don't need special case for numel 0.
2024-11-24 07:32:26 -05:00
geohotstan
ad9df26fba add test for inconsistent behavior in float to int casting (#7870)
* found teeny bug

* no healthcheck

* change function name
2024-11-24 07:31:34 -05:00
qazal
6b8a657085 cleanup group_realizes [pr] (#7878) 2024-11-24 18:16:46 +08:00
qazal
5aee78a0a6 fix uop swizzle on BUFFER, new tests (#7875)
* fix uop swizzle on BUFFER, new tests

* can have view of view
2024-11-24 17:11:09 +08:00
George Hotz
5d28a202b5 make tinychat local (#7871) 2024-11-24 14:45:48 +08:00
chenyu
22d5def113 download llama3 70B (#7868)
use "nvidia/Llama-3.1-Nemotron-70B-Instruct-HF".
```
PYTHONPATH=. JITBEAM=2 python3 examples/llama3.py --download_model --size 70B --quantize int8 --benchmark
```

on M4 Max, 40 sec to load the model and
```
enqueue in 165.15 ms
total 328.54 ms, 3.04 tok/s, 247.46 GB/s, param 221.20 GB/s

enqueue in   5.31 ms
total 168.48 ms, 5.94 tok/s, 482.54 GB/s, param 431.34 GB/s

enqueue in   5.32 ms
total 168.77 ms, 5.93 tok/s, 481.71 GB/s, param 430.60 GB/s

enqueue in   5.69 ms
total 169.51 ms, 5.90 tok/s, 479.61 GB/s, param 428.72 GB/s

enqueue in   5.41 ms
total 168.60 ms, 5.93 tok/s, 482.20 GB/s, param 431.04 GB/s

enqueue in   5.18 ms
total 168.98 ms, 5.92 tok/s, 481.12 GB/s, param 430.08 GB/s

enqueue in   5.43 ms
total 168.82 ms, 5.92 tok/s, 481.59 GB/s, param 430.49 GB/s

enqueue in   5.27 ms
total 168.94 ms, 5.92 tok/s, 481.23 GB/s, param 430.17 GB/s
```
2024-11-23 12:18:31 -05:00
qazal
6a8be3ca1e don't change lazy state in schedule [pr] (#7867) 2024-11-24 00:18:50 +08:00
JaSpa99
28e83e662e least controversial (#7863) 2024-11-23 21:23:30 +08:00
George Hotz
8c3d3181dd bottom up rewrite fixes substitute [pr] (#7862)
* single pass rewrite fixes substitute [pr]

* caching for single_pass_rewrite

* allow multiple rewrites

* a simple test

* bottom_up_rewrite is fully flexible
2024-11-23 20:53:37 +08:00
mesozoic-egg
54d8f75d0c vectorized define_acc does not seem to get used (#7858)
Co-authored-by: Mesozoic Egg <mesozoic.egg@proton.mail>
2024-11-23 19:46:34 +08:00
qazal
40be9177ba move swizzle upats to ops, prereq for swizzle tc [pr] (#7861) 2024-11-23 18:34:45 +08:00
qazal
27a6cd7822 cleanup swizzle upats [pr] (#7860)
* cleanup swizzle upats [pr]

* match the rest
2024-11-23 15:19:06 +08:00
qazal
5b2c03e865 defer realize folding to kernel splitting [pr] (#7849)
* defer realize folding to schedule breaking [pr]

* this is init

* p2

* need to lookup edges

* refactor image cast folding [pr]

* Ops.LOAD diff

* image works

* refactor can_pad

* fix fold_img_cast
2024-11-23 14:29:14 +08:00
George Hotz
144e9f00df viz is local, new test, and new quantize [pr] (#7859)
* viz is local, new test, and new quantize [pr]

* fix mime types

* remove font

* after index
2024-11-23 14:27:10 +08:00
qazal
d43613e113 refactor image cast folding [pr] (#7852)
* refactor image cast folding [pr]

* Ops.LOAD diff
2024-11-23 13:59:21 +08:00
chenyu
c07daf40e7 move attention upcast (#7830)
still upcast before softmax, but faster because intermediate buffer can be stored in half (as long as qk is within half range).
2024-11-22 17:10:51 -05:00
chenyu
5c5b1b994c less flaky benchmarks (#7855)
JIT=2 for metal cifar with HALF, and lower tflops for nv test_gemm_4096. failures in https://github.com/tinygrad/tinygrad/actions/runs/11980239535/job/33404098428?pr=7830
2024-11-22 16:39:39 -05:00
chenyu
3b26e51fce Tensor.cummax (#7854)
generalized the existing cumsum and take Ops.MAX in addition to Ops.ADD
2024-11-22 15:55:02 -05:00
ignaciosica
fb10ea563e typedef bf16 amd (#7850) 2024-11-22 14:29:01 -05:00
chenyu
a352a6938f simplify group_for_reduces in get_index [pr] (#7851)
what was that
2024-11-22 11:53:21 -05:00
chenyu
af5d77f684 move sint_to_uop from view.py to ops.py [pr] (#7848)
both sint and uop are in ops.py
2024-11-22 11:15:02 -05:00
chenyu
f6d1201c48 variable_to_uop -> sint_to_uop [pr] (#7847)
and added type to it
2024-11-22 10:54:59 -05:00
chenyu
40d7535eeb clean up DTYPES_DICT [pr] (#7845) 2024-11-22 10:01:34 -05:00
chenyu
4453ab51e1 use ceildiv in View.stride [pr] (#7844) 2024-11-22 08:38:05 -05:00
qazal
9828277c03 view doesn't have buffer, fix the tests [pr] (#7841)
* view doesn't have buffer, fix the tests [pr]

* need assigns
2024-11-22 20:41:55 +08:00
qazal
7e8777eee9 faster assign scheduling [pr] (#7839)
* baseline 87 ms

* 86 ms, only PRELOAD assigns

* refactor to assign_adjacents

* ops_folding
2024-11-22 19:23:59 +08:00
chenyu
6229d87f45 simpler reshape symbolic shape check [pr] (#7837) 2024-11-21 22:53:57 -05:00
George Hotz
1d6d842887 move DSP to extra (room for webgpu) [pr] (#7836) 2024-11-22 11:32:57 +08:00
chenyu
8ff6cba9f0 simpler swizzle_r new_axis [pr] (#7835)
new axis are the permuted to end ones
2024-11-21 22:26:41 -05:00
George Hotz
6fc7013463 put all DSP in dsp file [pr] (#7833) 2024-11-22 11:22:59 +08:00