Commit Graph

7547 Commits

Author SHA1 Message Date
George Hotz
0d7bd4f389 empty graph rewrite to VIZ tensor graph [pr] (#8658)
* empty graph rewrite to VIZ tensor graph [pr]

* fix lint
2025-01-17 11:29:33 -08:00
George Hotz
8609b880bd hotfix: test_backward_sum 2025-01-17 10:25:02 -08:00
chenyu
f8cc971c3b raise RuntimeError for uneven shards in Tensor.shard [pr] (#8656) 2025-01-17 12:48:39 -05:00
mesozoic-egg
3506a7585f upcast overflowed idx to int64 [pr] (#8268)
* use full_shape to determine if index can potentially overflow

* update comment

* use shapetracker to check max index value

* wip

* lint

* handle mask

* upcast to int64 by st is noop on WGSL

* fix comments

* Handle negative overflow, intermediaries overflow, int64 support

handle negative overflow

handle symbolic

wip

handle intermediate values

wip

check if typemap support int64

lint

comment

* add invalid_dtype

lint

* Fix bug on checking mask overflow

wip

wip

* Add more tests, need to resolve partial upcast

test Valid_view_dup

test valid op overflow

refine test cases

clean up

cleanup

wip

refine tests

lint

* Upcast is handled by lower_load_store

upcast as graph_rewrite to backtrack

update test

wip

cleanup

wip

cleanup

do upcast in lower_load_store

lint

* cleanup

* do upcast within lower_load_store and mutate ctx

* do upcast in get_idx and view

revert

lint

* cleanup

* Upcast in vec, const

upcast to const

test case 3

upcast on vector

lint

* simplify idx with symbolic in case of fake overflow

test case4

test case 4

update test

* test case4 is only for metal

* try: upcast inside graph_rewrite instead of shapetracker

wip

* checking overflow can just be done directly on all views, with idxs

* cleanup

* REMOVE hard coded uop test for idx upcast

* refactor

cleanup

refactor

* do actual casting when necessary, instead of rewriting all idx

hard code uop test

new upcast

* check dtype for int64 in webgpu

* cleanup

cleanup

* cleanup

* update tests

cleanup

comment

cleanup

cleanup

* comment

* comment

* update comment

update comment

* refactor

* typo

* keep the scope to only upcasting

* white space

* Revert "white space"

This reverts commit 314d7eb184.

* Revert "keep the scope to only upcasting"

This reverts commit 1ef701dd85.

* sym folding is not necessary

lint1

* fold symbolic

lint

* use symbolic simple when folding shapetracker idx

* full sym folding is required after all...

* Ops.CAST should retain the src min max

* put rewrite to lowerer

wip

* start testing on higher level

wip

test higher level in test_tensor

* find Ops.STORE in list instead of recursively

* check dtype support when upcasting

* remove invalid_dtype

* lint

* fix int64 support checks in upcast

lint

* skipif skipunless

* revert fold to find test case

* Revert "revert fold to find test case"

This reverts commit 225bb6e801.

* test sym folding

* handle ptx

* wip

* wip

* delete hard coded uop test

* lint fixes

* wip

* fix checking for None

* lint

* handle ptx

* comment

* dtype for overflow()

* update skipIf skipUnless

* assert in wgsl renderer for int64

wip

* do folded_upcast in to_indexed_op, real_size uses views_to_indexed_ops

* assert in lowerer for dtype support

lint

* Revert "assert in lowerer for dtype support"

This reverts commit 8e9b1b79bf.

* assert dtype in kernel.py

* Revert "assert dtype in kernel.py"

This reverts commit e29b9a9893.

* wip

* assert in render

* remove old assert

* check dtype from rendere, assert in upcast

wip

* smaller arange for sym fold case

* linearize directly

* use expand directly

* lint

* lint

* rename

* no need to check dtype in device.py

* trigger pr

* remove dtype assert in upcast, make wgpu fail in render

* use DType for type hint instead of dtypes

* assert on KeyError in tests for webgpu backend int64

* use a tuple for src

* test real kernel run

wip

* lint error

* restore

* fix real_size

* update test example

* resolve merge stuff

---------

Co-authored-by: Mesozoic Egg <mesozoic.egg@proton.mail>
2025-01-17 11:52:31 -05:00
qazal
23f0ff0ed8 add bitcast to multi [pr] (#8652) 2025-01-17 03:17:19 -05:00
qazal
2b7db9b45d delete unused cast/bitcast lines from ops.py [pr] (#8651)
* move cast and bitcast out

* more deletion of bitcast arg

* fix test_bitcast_fuses

* update tests

* work
2025-01-17 03:04:18 -05:00
Mike Ashcroft
4f0d1b4759 Disable graphs by default if using an intel macbook (#8648) (#8649) 2025-01-16 18:24:56 -08:00
eliotgolding
0289fbb1c2 limit real_size to the size of first View of ShapeTracker (#8628)
* fix real_size

* add fuzzer; typing

* spacing

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2025-01-16 16:27:39 -05:00
nimlgen
f91ca508cf am: bind for sdma (#8633)
* am: bind for sdma

* fix
2025-01-16 15:22:27 +03:00
nimlgen
f671da6755 ci: add AM start time to benchmark (#8637)
* ci: add AM start time to benchmark

* am: unlock it

* add AMD

* revert this
2025-01-16 14:47:36 +03:00
qazal
81a84aa85a remove is_unrealized_unmasked_const [pr] (#8644) 2025-01-16 05:27:47 -05:00
uuuvn
00e5979897 Use full soname for libgcc_s in CPUProgram (#8642)
Number after .so is abi version, it is always 1 for libgcc_s.
Most linux systems set default library versions via symlinks that are
simply followed to get actual elf, however conda does it via linker
scripts which ctypes doesn't follow (below contents of libgcc_s.so):
```
/* GNU ld script
   Use the shared library, but some functions are only in
   the static library.  */
GROUP ( libgcc_s.so.1 -lgcc )
```
ctypes.util.find_library thinks that this is the actual elf and
ctypes.CDLL just loads this text file as a shared library. The result
is:
```
  File "/home/me/src/tinygrad/tinygrad/device.py", line 223, in CPUProgram
    helper_handle = ctypes.CDLL(ctypes.util.find_library('System' if OSX else 'gcc_s'))
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/me/miniforge3/envs/tinygrad/lib/python3.12/ctypes/__init__.py", line 379, in __init__
    self._handle = _dlopen(self._name, mode)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^
OSError: /home/me/miniforge3/envs/tinygrad/lib/libgcc_s.so: invalid ELF header
```
2025-01-16 12:56:52 +03:00
qazal
611208cd8a Revert "Revert "move subbuffer to a rewrite rule in the scheduler (#8639)" (…" (#8643)
This reverts commit 82ef956cb8.
2025-01-16 04:30:11 -05:00
qazal
82ef956cb8 Revert "move subbuffer to a rewrite rule in the scheduler (#8639)" (#8641)
This reverts commit d5c90da286.
2025-01-16 03:29:07 -05:00
qazal
d5c90da286 move subbuffer to a rewrite rule in the scheduler (#8639)
* delete buffer_view from tensor

* add to the scheduler

* move buffer_view to the scheduler

* gradient doesn't care.

* for/with
2025-01-16 03:14:28 +02:00
nimlgen
b3efeeb717 docs: start am docs (#8638)
* docs: init am docs

* missing
2025-01-16 00:22:35 +03:00
uuuvn
7ecced7f6d LLVM JIT prereqs (#8634)
* LLVM JIT prereqs

This commit moves jit loading, disassembling and CPUProgram logic from
`ops_clang.py` to `elf.py`, `helpers.py` and `device.py` respectively

I don't quite like the `helpers.py` destination for capstone_flatdump
but this is where cpu_objdump is so presumably this is how it's supposed
to be

* Types
2025-01-15 09:47:08 -08:00
qazal
a1f70ce7d0 only use BUFFER_VIEW in disk [pr] (#8629)
* only use BUFFER_VIEW in disk [pr]

* delete can_view

* BUFFER_VIEW op on DISK

* remove that allow_buffer_view=False

* notes

* bitcast is a low-level op too

* this passes on AMD and LLVM
2025-01-15 12:34:15 -05:00
ignaciosica
bae20e5043 Generic PTX wmma rendering [pr] (#8632)
* make wmma rendering dtype size generic

* use var instead of calculating multiple times

* compact rendering
2025-01-15 09:31:48 -08:00
qazal
6193e279d4 isolate simple failing test for subbuffer on CONST [pr] (#8630)
* simple failing test for subbuffer on CONST [pr]

* add view_supported_devices check
2025-01-15 05:45:03 -05:00
George Hotz
e1f7c90459 gradient is a set [pr] (#8626)
* gradient is a set [pr]

* typing for deepwalk
2025-01-14 20:48:23 -08:00
chenyu
7fb1c7af61 minor multi cleanups [pr] (#8625) 2025-01-14 22:25:23 -05:00
George Hotz
504ad08e73 hotfix: add test_example_matmul_same 2025-01-14 19:03:17 -08:00
George Hotz
f29d6f54b8 support multilb gradient [pr] (#8624) 2025-01-14 18:33:33 -08:00
chenyu
4ee3243c93 JITBEAM=2 for LLaMA-3 8B on 4 GPUs [pr] (#8623)
is it fast?
2025-01-14 19:52:38 -05:00
chenyu
7860a80801 simpler MultiLazyBuffer alu [pr] (#8622) 2025-01-14 19:19:13 -05:00
chenyu
930728c069 bert BS 72->66 [pr] (#8621)
72 does not fit now
2025-01-14 18:41:41 -05:00
chenyu
0790d8059f remove MultiLazyBuffer.from_sharded [pr] (#8620)
it's eqivalent to taking the lazydata from Tensor.split, then copy to devices
2025-01-14 18:00:49 -05:00
George Hotz
c85737c200 assert to prepare for grad uop [pr] (#8280)
* assert to prepare for grad uop [pr]

* fix test_nn

* fix most of test_tensor

* few more tests

* fix multi

* uniform gradient

* acc_dtype

* any for multi

* fix typing

* fix assert, CAST_BEFORE_VIEW is still the issue

* explict test for CAST_BEFORE_VIEW

---------

Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>
2025-01-14 13:26:56 -08:00
George Hotz
fdd46c9f28 delete view instant rule (#8616)
* remove cast before view

* greener

* indexing

* delete view instant rule

* that passes too

* openpilot too

* ack

* base on cast_before_view

* add it as a rewrite rule

* VIEW(DEVICE) is also fine

* test_shard_memory depends on forced_realize removal

* put that back, will go soon

* UOp representations change once we don't instantly fold things

* do not duplicate tests

---------

Co-authored-by: qazal <qazal.software@gmail.com>
Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>
2025-01-14 16:15:13 -05:00
qazal
dddd4e5f9f hotfix: remove duplicate TestTensorMutates [pr] (#8619)
* hotfix: remove duplicate TestTensorMutates [pr]

* imports
2025-01-14 16:03:17 -05:00
nimlgen
c5782e85d2 tlsf: optimize alloc (#8608) 2025-01-14 23:48:07 +03:00
George Hotz
bfbe81df71 remove cast before view (#8613)
* remove cast before view

* greener

* indexing

* that passes too

* openpilot too

* ack

---------

Co-authored-by: qazal <qazal.software@gmail.com>
2025-01-14 15:04:58 -05:00
chenyu
393eec3201 raise RuntimeError for uneven shard [pr] (#8593)
no 7B llama on 6 GPUs

skip 70B
2025-01-14 14:51:48 -05:00
ignaciosica
d5a646d492 CUDA Turing TC (#8597)
* init turing tc

* reorder tc

* hotfix: remove some spaces

* revert var name to x

* consistent order of factors

* revert order of terms to match old stuff

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2025-01-14 10:35:14 -08:00
chenyu
cbfd51f5a5 make MultiLazyBuffer.bounds a property [pr] (#8614)
determined by lbs shapes and axis
2025-01-14 13:25:54 -05:00
chenyu
52e7003414 Revert "make kits19 dataset samples have small sizes (#8591)" (#8610)
This reverts commit 76a03e950a.
2025-01-14 12:24:27 -05:00
Francis Lata
76a03e950a make kits19 dataset samples have small sizes (#8591) 2025-01-14 08:27:45 -08:00
ignaciosica
4057b98f7f rename i and j into k and row/col (#8607) 2025-01-14 08:27:05 -08:00
nimlgen
1ff6862a3d ci: sleep a bit to let the driver unload the prev pid (#8605) 2025-01-14 15:55:23 +03:00
qazal
97ec564b03 noop changes from the block_assign branch [pr] (#8606) 2025-01-14 07:47:17 -05:00
qazal
5aab2806f0 rename to test_tensor_uop + use upats for asserting [pr] (#8604)
* rename to test_tensor_uop + use upats for asserting [pr]

* fix pr
2025-01-14 05:09:56 -05:00
qazal
863abc7140 scheduling graph_rewrite prereqs for BLOCK in ASSIGN (#8598)
* remove the BUF_LIMIT assert

* skip the base one

* work

* work

* good error

* ok comment

* shorter check
2025-01-14 03:01:59 -05:00
chenyu
05e54f00d3 remove bounds from MultiLazyBuffer.from_sharded [pr] (#8603)
without a custom bound, the bound is uniquely determined by shape and axis
2025-01-13 23:40:05 -05:00
chenyu
d443e91d82 remove custom splits in Tensor.shard [pr] (#8602)
towards even split only
2025-01-13 21:29:13 -05:00
chenyu
227d96d7a3 remove unused src from metaop [pr] (#8601) 2025-01-13 20:28:14 -05:00
chenyu
c4e33048c6 test Tensor.clone has a different lazydata [pr] (#8600) 2025-01-13 20:13:44 -05:00
qazal
ae2229d727 assert kernel buffer limit at compile time [pr] (#8595)
* remove the BUF_LIMIT assert

* skip the base one
2025-01-13 16:32:07 -05:00
nimlgen
c2504357af am: lock to access dev (#8594)
* amm lock to access dev

* wording

* just works

* disbale
2025-01-13 23:53:13 +03:00
geohotstan
4abe631b56 fix onnx mobilenetv2-7-quantized.onnx (#8574)
* is 67% considered fixed?

* move test up

* share function

* add qgemm too

* make sure qgemm comes out as int

* actually that note is not right

* remove qgemm (I did it wrong) and add it later lol.
2025-01-13 09:25:06 -08:00