Commit Graph

4433 Commits

Author SHA1 Message Date
George Hotz
98d01a059d rename uopgraph to rewriter [pr] (#8682) 2025-01-19 17:03:12 -08:00
chenyu
2d0842386d fix parse_valid for float uop (#8681)
x < c -> X <= c-1 only works for int
2025-01-19 18:15:49 -05:00
George Hotz
168c16646a change create_schedule_with_vars api to big_sink [pr] (#8677) 2025-01-19 13:30:26 -08:00
chenyu
beba490ba8 update mask in scaled_dot_product_attention (#8674)
built is_causal mask with ones_like and start with boolean, and reversed the mask -inf order
2025-01-19 15:19:23 -05:00
chenyu
5842ee56c6 raise if attn_mask is set when is_causal=True in sdpa [pr] (#8675)
matches torch, also fixed incorrect usage in tests
2025-01-19 12:55:04 -05:00
qazal
2faf8774fe replace DEVICE of CONST after copy folding (#8673) 2025-01-19 11:33:39 -05:00
qazal
d957a4f108 add tests for div buffer collapsing in the scheduler [pr] (#8671)
* add tests for mul/div buffer collapsing in the scheduler [pr]

* lint

* merge with test_linearizer's version of this

* 4*3
2025-01-18 14:15:29 -05:00
ignaciosica
d2234e308a tf32 tc for nv and ptx (#8635)
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2025-01-17 17:43:57 -08:00
nimlgen
5afb0a4a81 metal: fix transfer profiling (#8659) 2025-01-17 23:47:01 +03:00
George Hotz
8609b880bd hotfix: test_backward_sum 2025-01-17 10:25:02 -08:00
chenyu
f8cc971c3b raise RuntimeError for uneven shards in Tensor.shard [pr] (#8656) 2025-01-17 12:48:39 -05:00
mesozoic-egg
3506a7585f upcast overflowed idx to int64 [pr] (#8268)
* use full_shape to determine if index can potentially overflow

* update comment

* use shapetracker to check max index value

* wip

* lint

* handle mask

* upcast to int64 by st is noop on WGSL

* fix comments

* Handle negative overflow, intermediaries overflow, int64 support

handle negative overflow

handle symbolic

wip

handle intermediate values

wip

check if typemap support int64

lint

comment

* add invalid_dtype

lint

* Fix bug on checking mask overflow

wip

wip

* Add more tests, need to resolve partial upcast

test Valid_view_dup

test valid op overflow

refine test cases

clean up

cleanup

wip

refine tests

lint

* Upcast is handled by lower_load_store

upcast as graph_rewrite to backtrack

update test

wip

cleanup

wip

cleanup

do upcast in lower_load_store

lint

* cleanup

* do upcast within lower_load_store and mutate ctx

* do upcast in get_idx and view

revert

lint

* cleanup

* Upcast in vec, const

upcast to const

test case 3

upcast on vector

lint

* simplify idx with symbolic in case of fake overflow

test case4

test case 4

update test

* test case4 is only for metal

* try: upcast inside graph_rewrite instead of shapetracker

wip

* checking overflow can just be done directly on all views, with idxs

* cleanup

* REMOVE hard coded uop test for idx upcast

* refactor

cleanup

refactor

* do actual casting when necessary, instead of rewriting all idx

hard code uop test

new upcast

* check dtype for int64 in webgpu

* cleanup

cleanup

* cleanup

* update tests

cleanup

comment

cleanup

cleanup

* comment

* comment

* update comment

update comment

* refactor

* typo

* keep the scope to only upcasting

* white space

* Revert "white space"

This reverts commit 314d7eb184.

* Revert "keep the scope to only upcasting"

This reverts commit 1ef701dd85.

* sym folding is not necessary

lint1

* fold symbolic

lint

* use symbolic simple when folding shapetracker idx

* full sym folding is required after all...

* Ops.CAST should retain the src min max

* put rewrite to lowerer

wip

* start testing on higher level

wip

test higher level in test_tensor

* find Ops.STORE in list instead of recursively

* check dtype support when upcasting

* remove invalid_dtype

* lint

* fix int64 support checks in upcast

lint

* skipif skipunless

* revert fold to find test case

* Revert "revert fold to find test case"

This reverts commit 225bb6e801.

* test sym folding

* handle ptx

* wip

* wip

* delete hard coded uop test

* lint fixes

* wip

* fix checking for None

* lint

* handle ptx

* comment

* dtype for overflow()

* update skipIf skipUnless

* assert in wgsl renderer for int64

wip

* do folded_upcast in to_indexed_op, real_size uses views_to_indexed_ops

* assert in lowerer for dtype support

lint

* Revert "assert in lowerer for dtype support"

This reverts commit 8e9b1b79bf.

* assert dtype in kernel.py

* Revert "assert dtype in kernel.py"

This reverts commit e29b9a9893.

* wip

* assert in render

* remove old assert

* check dtype from rendere, assert in upcast

wip

* smaller arange for sym fold case

* linearize directly

* use expand directly

* lint

* lint

* rename

* no need to check dtype in device.py

* trigger pr

* remove dtype assert in upcast, make wgpu fail in render

* use DType for type hint instead of dtypes

* assert on KeyError in tests for webgpu backend int64

* use a tuple for src

* test real kernel run

wip

* lint error

* restore

* fix real_size

* update test example

* resolve merge stuff

---------

Co-authored-by: Mesozoic Egg <mesozoic.egg@proton.mail>
2025-01-17 11:52:31 -05:00
qazal
23f0ff0ed8 add bitcast to multi [pr] (#8652) 2025-01-17 03:17:19 -05:00
qazal
2b7db9b45d delete unused cast/bitcast lines from ops.py [pr] (#8651)
* move cast and bitcast out

* more deletion of bitcast arg

* fix test_bitcast_fuses

* update tests

* work
2025-01-17 03:04:18 -05:00
eliotgolding
0289fbb1c2 limit real_size to the size of first View of ShapeTracker (#8628)
* fix real_size

* add fuzzer; typing

* spacing

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2025-01-16 16:27:39 -05:00
qazal
81a84aa85a remove is_unrealized_unmasked_const [pr] (#8644) 2025-01-16 05:27:47 -05:00
qazal
a1f70ce7d0 only use BUFFER_VIEW in disk [pr] (#8629)
* only use BUFFER_VIEW in disk [pr]

* delete can_view

* BUFFER_VIEW op on DISK

* remove that allow_buffer_view=False

* notes

* bitcast is a low-level op too

* this passes on AMD and LLVM
2025-01-15 12:34:15 -05:00
qazal
6193e279d4 isolate simple failing test for subbuffer on CONST [pr] (#8630)
* simple failing test for subbuffer on CONST [pr]

* add view_supported_devices check
2025-01-15 05:45:03 -05:00
George Hotz
504ad08e73 hotfix: add test_example_matmul_same 2025-01-14 19:03:17 -08:00
George Hotz
f29d6f54b8 support multilb gradient [pr] (#8624) 2025-01-14 18:33:33 -08:00
chenyu
0790d8059f remove MultiLazyBuffer.from_sharded [pr] (#8620)
it's eqivalent to taking the lazydata from Tensor.split, then copy to devices
2025-01-14 18:00:49 -05:00
George Hotz
c85737c200 assert to prepare for grad uop [pr] (#8280)
* assert to prepare for grad uop [pr]

* fix test_nn

* fix most of test_tensor

* few more tests

* fix multi

* uniform gradient

* acc_dtype

* any for multi

* fix typing

* fix assert, CAST_BEFORE_VIEW is still the issue

* explict test for CAST_BEFORE_VIEW

---------

Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>
2025-01-14 13:26:56 -08:00
George Hotz
fdd46c9f28 delete view instant rule (#8616)
* remove cast before view

* greener

* indexing

* delete view instant rule

* that passes too

* openpilot too

* ack

* base on cast_before_view

* add it as a rewrite rule

* VIEW(DEVICE) is also fine

* test_shard_memory depends on forced_realize removal

* put that back, will go soon

* UOp representations change once we don't instantly fold things

* do not duplicate tests

---------

Co-authored-by: qazal <qazal.software@gmail.com>
Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>
2025-01-14 16:15:13 -05:00
qazal
dddd4e5f9f hotfix: remove duplicate TestTensorMutates [pr] (#8619)
* hotfix: remove duplicate TestTensorMutates [pr]

* imports
2025-01-14 16:03:17 -05:00
George Hotz
bfbe81df71 remove cast before view (#8613)
* remove cast before view

* greener

* indexing

* that passes too

* openpilot too

* ack

---------

Co-authored-by: qazal <qazal.software@gmail.com>
2025-01-14 15:04:58 -05:00
chenyu
393eec3201 raise RuntimeError for uneven shard [pr] (#8593)
no 7B llama on 6 GPUs

skip 70B
2025-01-14 14:51:48 -05:00
chenyu
52e7003414 Revert "make kits19 dataset samples have small sizes (#8591)" (#8610)
This reverts commit 76a03e950a.
2025-01-14 12:24:27 -05:00
Francis Lata
76a03e950a make kits19 dataset samples have small sizes (#8591) 2025-01-14 08:27:45 -08:00
qazal
5aab2806f0 rename to test_tensor_uop + use upats for asserting [pr] (#8604)
* rename to test_tensor_uop + use upats for asserting [pr]

* fix pr
2025-01-14 05:09:56 -05:00
qazal
863abc7140 scheduling graph_rewrite prereqs for BLOCK in ASSIGN (#8598)
* remove the BUF_LIMIT assert

* skip the base one

* work

* work

* good error

* ok comment

* shorter check
2025-01-14 03:01:59 -05:00
chenyu
d443e91d82 remove custom splits in Tensor.shard [pr] (#8602)
towards even split only
2025-01-13 21:29:13 -05:00
chenyu
c4e33048c6 test Tensor.clone has a different lazydata [pr] (#8600) 2025-01-13 20:13:44 -05:00
qazal
ae2229d727 assert kernel buffer limit at compile time [pr] (#8595)
* remove the BUF_LIMIT assert

* skip the base one
2025-01-13 16:32:07 -05:00
geohotstan
4abe631b56 fix onnx mobilenetv2-7-quantized.onnx (#8574)
* is 67% considered fixed?

* move test up

* share function

* add qgemm too

* make sure qgemm comes out as int

* actually that note is not right

* remove qgemm (I did it wrong) and add it later lol.
2025-01-13 09:25:06 -08:00
George Hotz
d19c1c7f03 bump 75 -> 73 for test failure 2025-01-13 09:18:38 -08:00
nimlgen
d224d0ed7f nv: fix fault info (#8587)
* nv: fix fault info

* and emu for amd

* skip if not mock
2025-01-13 14:38:43 +03:00
qazal
586e730d32 use UOp.st for kernel reduce axes (#8499)
* use UOp.st for kernel reduce axes [pr]

* do not return dict
2025-01-13 06:24:11 -05:00
qazal
7562cc0399 better test for reduce swizzle + don't use double dtype [pr] (#8586)
* better test_permute_rewrite

* use float32
2025-01-13 05:02:21 -05:00
George Hotz
4ac4c1415a free intermediate buffers in the jit [pr] (#8581)
* free intermediate buffers in the jit [pr]

* intermediates_freed

* deallocate if not allocated

* self._first_run is simpler
2025-01-12 15:41:41 -08:00
George Hotz
d817dc10db start on test rewrite map [pr] (#8432)
* start on test rewrite map [pr]

* chatgpt writes dumb tests

* comment out failing

* fix that test

* fix gc issue

* oh, frame 2

* remove uop mutability

* map is only the map

* simplier + more tests

* test tiny passes

* tests that need to pass

* parent test passes

* child test passes

* remove uop mutability [pr]

* test fixups

* most tests pass

* more tests pass

* lil test fixups

* them too

* fix test

* unneeded

* err, that

* fix test_hcq

* fix test failures

* fix that test

* tensor universe

* does this pass test

* Revert "does this pass test"

This reverts commit ed516b3169.

* Revert "tensor universe"

This reverts commit c21301852a.

* test_mutate_add passes

* this can pass

* Revert "Merge remote-tracking branch 'origin/no_uop_mutability' into test_rewrite_map"

This reverts commit 657822dcdc, reversing
changes made to 2a126c145b.

* Revert "test_mutate_add passes"

This reverts commit ab4fc4c78e.

* correct enough

* remove test_rewrite_map_schedule.py

* viz

* uops are immutable

---------

Co-authored-by: qazal <qazal.software@gmail.com>
2025-01-12 13:13:51 -05:00
qazal
cde18fddce fix DEBUG=2 output for copy runners [pr] (#8579)
* fix DEBUG=2 output for copy runners [pr]

* itemsize is constant
2025-01-12 12:03:01 -05:00
eliotgolding
867004fbeb use unravel in views_to_indexed_uops [pr] (#8560)
* use unravel in shape

* make process replay work

* earlier View.minify()

* fix

* fix tests

* mypy

* get rid of early minify

* fix

* linter

* clean and add test

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2025-01-12 10:25:55 -05:00
nimlgen
38b5ac4d4a mypy for mockgpu/cuda & dsp/run (#8575) 2025-01-12 18:25:39 +03:00
qazal
ae241e96db fix half4 on qcom and gpu (#8573)
* add test_setitem_half

* this fixes comma benchmark
2025-01-12 06:23:05 -05:00
qazal
cff1ee9038 add SINK folding from the tensor_map branch [pr] (#8562)
* delete is_constant from the scheduler

* add sink folding

* always give BUFFER uops Buffers [pr]

* spec for view, var (bind) and const

* add test_buffer_only_after_realize

* work

* 3 lines

* more work
2025-01-12 03:39:34 -05:00
qazal
87cbff3ac0 always give BUFFER uops Buffers [pr] (#8572)
* always give BUFFER uops Buffers [pr]

* add test_buffer_only_after_realize
2025-01-11 23:17:09 +02:00
qazal
79738d768c do not require PYTHONPATH=. for process replay [pr] (#8567) 2025-01-11 09:45:34 -05:00
qazal
a70d1bf439 move print_diff to process replay [pr] (#8566)
* move print_diff to process replay [pr]

* ruff rightfully complians
2025-01-11 09:28:45 -05:00
qazal
60503c8621 use CAPTURE_PROCESS_REPLAY=1 in CI [pr] (#8564) 2025-01-11 06:03:48 -05:00
chenyu
d09897c2aa allow double copy [pr] (#8559)
fixed ring allreduce pattern and recovered most of the bert step time regression (10% faster), will double check all benchmark
2025-01-10 18:21:01 -05:00