Commit Graph

6940 Commits

Author SHA1 Message Date
qazal
e823de3828 viz with bottom_up=True (#7894)
* add failing test

* single pass it

* linter
2024-11-25 17:56:48 +08:00
qazal
2ca41d6a44 ops metadata map try 2, early fuse [pr] (#7893)
* make this return early

* delete that

* ops metadata map try 2, early fuse [pr]
2024-11-25 17:08:38 +08:00
qazal
9295c86ddc delete base op cast [pr] (#7891) 2024-11-25 16:38:32 +08:00
qazal
26784c45c6 delete cast arg 2 [pr] (#7881) 2024-11-25 16:15:57 +08:00
George Hotz
9d0038bccb small changes from block linearizer [pr] (#7888)
* small changes from block linearizer [pr]

* fix test_gc
2024-11-25 15:27:04 +08:00
mesozoic-egg
9e958f2b10 Ptx simplify [pr] (#7877)
* simplify render_kernel

* cvar in const

* Revert "simplify render_kernel"

This reverts commit 1c8817bea2.

* CMPNE src match

* src match in cast

* cvar in define_acc

* simplify render_store

* simplify render_kernel

* whitespace

* render_kernel fix fstring

* render newline

* do not embed newline in Ops.WHERE render

* WHERE op fix

* missed a comma

* whitespace

---------

Co-authored-by: Mesozoic Egg <mesozoic.egg@proton.mail>
2024-11-25 15:01:47 +08:00
nib9888
e9c681c839 fix missing final rewrite in viz (#7883) 2024-11-25 14:13:33 +08:00
Sieds Lykles
a49a7c4784 Improved mod folding (#7887)
* Remove uneccessary if statement

In all paths where something_changed was set to True, remainder is
appended so the list can't be empty

* Working version of improved mod folding

* Fix offset calculation

Passing fuzz_symbolic.py to 130_000 so far
Added an extra test

* Cleaner offset calculation
2024-11-24 22:21:34 -05:00
leopf
5d92efb121 [BUGFIX] Tensor([]).data() (#7884)
* added test, fix

* fix only for (0,) shape

* Revert "fix only for (0,) shape"

* test_data_empty_multi_dim
2024-11-24 16:42:57 -05:00
chenyu
ac57d82a13 test_tiny on real NV/CUDA/AMD/HIP (#7886)
simple tests that run on real CUDA and HIP
2024-11-24 16:34:54 -05:00
qazal
06a28d83f5 delete extra dtype check in uop const [pr] (#7880) 2024-11-25 00:06:52 +08:00
chenyu
31337b49e3 cleanup Embedding call [pr] (#7869)
reshape on self.weight is noop, and don't need special case for numel 0.
2024-11-24 07:32:26 -05:00
geohotstan
ad9df26fba add test for inconsistent behavior in float to int casting (#7870)
* found teeny bug

* no healthcheck

* change function name
2024-11-24 07:31:34 -05:00
qazal
6b8a657085 cleanup group_realizes [pr] (#7878) 2024-11-24 18:16:46 +08:00
qazal
5aee78a0a6 fix uop swizzle on BUFFER, new tests (#7875)
* fix uop swizzle on BUFFER, new tests

* can have view of view
2024-11-24 17:11:09 +08:00
George Hotz
5d28a202b5 make tinychat local (#7871) 2024-11-24 14:45:48 +08:00
chenyu
22d5def113 download llama3 70B (#7868)
use "nvidia/Llama-3.1-Nemotron-70B-Instruct-HF".
```
PYTHONPATH=. JITBEAM=2 python3 examples/llama3.py --download_model --size 70B --quantize int8 --benchmark
```

on M4 Max, 40 sec to load the model and
```
enqueue in 165.15 ms
total 328.54 ms, 3.04 tok/s, 247.46 GB/s, param 221.20 GB/s

enqueue in   5.31 ms
total 168.48 ms, 5.94 tok/s, 482.54 GB/s, param 431.34 GB/s

enqueue in   5.32 ms
total 168.77 ms, 5.93 tok/s, 481.71 GB/s, param 430.60 GB/s

enqueue in   5.69 ms
total 169.51 ms, 5.90 tok/s, 479.61 GB/s, param 428.72 GB/s

enqueue in   5.41 ms
total 168.60 ms, 5.93 tok/s, 482.20 GB/s, param 431.04 GB/s

enqueue in   5.18 ms
total 168.98 ms, 5.92 tok/s, 481.12 GB/s, param 430.08 GB/s

enqueue in   5.43 ms
total 168.82 ms, 5.92 tok/s, 481.59 GB/s, param 430.49 GB/s

enqueue in   5.27 ms
total 168.94 ms, 5.92 tok/s, 481.23 GB/s, param 430.17 GB/s
```
2024-11-23 12:18:31 -05:00
qazal
6a8be3ca1e don't change lazy state in schedule [pr] (#7867) 2024-11-24 00:18:50 +08:00
JaSpa99
28e83e662e least controversial (#7863) 2024-11-23 21:23:30 +08:00
George Hotz
8c3d3181dd bottom up rewrite fixes substitute [pr] (#7862)
* single pass rewrite fixes substitute [pr]

* caching for single_pass_rewrite

* allow multiple rewrites

* a simple test

* bottom_up_rewrite is fully flexible
2024-11-23 20:53:37 +08:00
mesozoic-egg
54d8f75d0c vectorized define_acc does not seem to get used (#7858)
Co-authored-by: Mesozoic Egg <mesozoic.egg@proton.mail>
2024-11-23 19:46:34 +08:00
qazal
40be9177ba move swizzle upats to ops, prereq for swizzle tc [pr] (#7861) 2024-11-23 18:34:45 +08:00
qazal
27a6cd7822 cleanup swizzle upats [pr] (#7860)
* cleanup swizzle upats [pr]

* match the rest
2024-11-23 15:19:06 +08:00
qazal
5b2c03e865 defer realize folding to kernel splitting [pr] (#7849)
* defer realize folding to schedule breaking [pr]

* this is init

* p2

* need to lookup edges

* refactor image cast folding [pr]

* Ops.LOAD diff

* image works

* refactor can_pad

* fix fold_img_cast
2024-11-23 14:29:14 +08:00
George Hotz
144e9f00df viz is local, new test, and new quantize [pr] (#7859)
* viz is local, new test, and new quantize [pr]

* fix mime types

* remove font

* after index
2024-11-23 14:27:10 +08:00
qazal
d43613e113 refactor image cast folding [pr] (#7852)
* refactor image cast folding [pr]

* Ops.LOAD diff
2024-11-23 13:59:21 +08:00
chenyu
c07daf40e7 move attention upcast (#7830)
still upcast before softmax, but faster because intermediate buffer can be stored in half (as long as qk is within half range).
2024-11-22 17:10:51 -05:00
chenyu
5c5b1b994c less flaky benchmarks (#7855)
JIT=2 for metal cifar with HALF, and lower tflops for nv test_gemm_4096. failures in https://github.com/tinygrad/tinygrad/actions/runs/11980239535/job/33404098428?pr=7830
2024-11-22 16:39:39 -05:00
chenyu
3b26e51fce Tensor.cummax (#7854)
generalized the existing cumsum and take Ops.MAX in addition to Ops.ADD
2024-11-22 15:55:02 -05:00
ignaciosica
fb10ea563e typedef bf16 amd (#7850) 2024-11-22 14:29:01 -05:00
chenyu
a352a6938f simplify group_for_reduces in get_index [pr] (#7851)
what was that
2024-11-22 11:53:21 -05:00
chenyu
af5d77f684 move sint_to_uop from view.py to ops.py [pr] (#7848)
both sint and uop are in ops.py
2024-11-22 11:15:02 -05:00
chenyu
f6d1201c48 variable_to_uop -> sint_to_uop [pr] (#7847)
and added type to it
2024-11-22 10:54:59 -05:00
chenyu
40d7535eeb clean up DTYPES_DICT [pr] (#7845) 2024-11-22 10:01:34 -05:00
chenyu
4453ab51e1 use ceildiv in View.stride [pr] (#7844) 2024-11-22 08:38:05 -05:00
qazal
9828277c03 view doesn't have buffer, fix the tests [pr] (#7841)
* view doesn't have buffer, fix the tests [pr]

* need assigns
2024-11-22 20:41:55 +08:00
qazal
7e8777eee9 faster assign scheduling [pr] (#7839)
* baseline 87 ms

* 86 ms, only PRELOAD assigns

* refactor to assign_adjacents

* ops_folding
2024-11-22 19:23:59 +08:00
chenyu
6229d87f45 simpler reshape symbolic shape check [pr] (#7837) 2024-11-21 22:53:57 -05:00
George Hotz
1d6d842887 move DSP to extra (room for webgpu) [pr] (#7836) 2024-11-22 11:32:57 +08:00
chenyu
8ff6cba9f0 simpler swizzle_r new_axis [pr] (#7835)
new axis are the permuted to end ones
2024-11-21 22:26:41 -05:00
George Hotz
6fc7013463 put all DSP in dsp file [pr] (#7833) 2024-11-22 11:22:59 +08:00
George Hotz
e39af63156 no loop assert in ops_python [pr] (#7834) 2024-11-22 11:17:36 +08:00
George Hotz
d18b948f48 ptxcompiler isn't a cudacompiler [pr] (#7832)
* ptxcompiler isn't a cudacompiler [pr]

* hcq types
2024-11-22 10:57:22 +08:00
mesozoic-egg
855f9a767a add restype for msg method for type annotation and consistency (#7828)
* no need to explicitly set objc_id as restype

* add restype for type annotation

---------

Co-authored-by: Mesozoic Egg <mesozoic.egg@proton.me>
2024-11-22 09:17:58 +08:00
chenyu
d5c9fafff5 default run stable diffusion benchmark with fp16 (#7831)
and keep the non-fp16 one in mac
2024-11-21 15:58:17 -05:00
chenyu
69e382216d fix wino conv output dtype for half inputs (#7829) 2024-11-21 12:13:54 -05:00
geohotstan
cf1ec90ad4 add inverse trig functions to Tensor (#7805)
* implement inverse trig functions

* guess we should still test nans?

* magnitude as variable name :D

* reorder onnx_ops ops

* approximation -> x for consistency

* address feedback

* simpler acos

* improvement?

* actually just have asin depend on atan

* actually this is nicer

* remove a comment

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2024-11-21 09:13:36 -05:00
qazal
5399ff6d06 add UOp.const_with_shape [pr] (#7825)
* add UOp.const_with_shape [pr]

* lines
2024-11-21 21:13:23 +08:00
qazal
2f884b2384 good suggestions from mypy lineprecision-report for schedule.py [pr] (#7823)
* good suggestions from mypy lineprecision-report [pr]

* ok if metadata doesn't exist

* same for store

* that's buf_uop
2024-11-21 19:59:51 +08:00
qazal
e378aeb94e assert view degrade to const tests post scheduler graph_rewrite [pr] (#7822)
* assert view degrade to const tests post scheduler graph_rewrite [pr]

* low pri, probably tricky, todo
2024-11-21 19:00:41 +08:00