Commit Graph

7531 Commits

Author SHA1 Message Date
uuuvn
7ecced7f6d LLVM JIT prereqs (#8634)
* LLVM JIT prereqs

This commit moves jit loading, disassembling and CPUProgram logic from
`ops_clang.py` to `elf.py`, `helpers.py` and `device.py` respectively

I don't quite like the `helpers.py` destination for capstone_flatdump
but this is where cpu_objdump is so presumably this is how it's supposed
to be

* Types
2025-01-15 09:47:08 -08:00
qazal
a1f70ce7d0 only use BUFFER_VIEW in disk [pr] (#8629)
* only use BUFFER_VIEW in disk [pr]

* delete can_view

* BUFFER_VIEW op on DISK

* remove that allow_buffer_view=False

* notes

* bitcast is a low-level op too

* this passes on AMD and LLVM
2025-01-15 12:34:15 -05:00
ignaciosica
bae20e5043 Generic PTX wmma rendering [pr] (#8632)
* make wmma rendering dtype size generic

* use var instead of calculating multiple times

* compact rendering
2025-01-15 09:31:48 -08:00
qazal
6193e279d4 isolate simple failing test for subbuffer on CONST [pr] (#8630)
* simple failing test for subbuffer on CONST [pr]

* add view_supported_devices check
2025-01-15 05:45:03 -05:00
George Hotz
e1f7c90459 gradient is a set [pr] (#8626)
* gradient is a set [pr]

* typing for deepwalk
2025-01-14 20:48:23 -08:00
chenyu
7fb1c7af61 minor multi cleanups [pr] (#8625) 2025-01-14 22:25:23 -05:00
George Hotz
504ad08e73 hotfix: add test_example_matmul_same 2025-01-14 19:03:17 -08:00
George Hotz
f29d6f54b8 support multilb gradient [pr] (#8624) 2025-01-14 18:33:33 -08:00
chenyu
4ee3243c93 JITBEAM=2 for LLaMA-3 8B on 4 GPUs [pr] (#8623)
is it fast?
2025-01-14 19:52:38 -05:00
chenyu
7860a80801 simpler MultiLazyBuffer alu [pr] (#8622) 2025-01-14 19:19:13 -05:00
chenyu
930728c069 bert BS 72->66 [pr] (#8621)
72 does not fit now
2025-01-14 18:41:41 -05:00
chenyu
0790d8059f remove MultiLazyBuffer.from_sharded [pr] (#8620)
it's eqivalent to taking the lazydata from Tensor.split, then copy to devices
2025-01-14 18:00:49 -05:00
George Hotz
c85737c200 assert to prepare for grad uop [pr] (#8280)
* assert to prepare for grad uop [pr]

* fix test_nn

* fix most of test_tensor

* few more tests

* fix multi

* uniform gradient

* acc_dtype

* any for multi

* fix typing

* fix assert, CAST_BEFORE_VIEW is still the issue

* explict test for CAST_BEFORE_VIEW

---------

Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>
2025-01-14 13:26:56 -08:00
George Hotz
fdd46c9f28 delete view instant rule (#8616)
* remove cast before view

* greener

* indexing

* delete view instant rule

* that passes too

* openpilot too

* ack

* base on cast_before_view

* add it as a rewrite rule

* VIEW(DEVICE) is also fine

* test_shard_memory depends on forced_realize removal

* put that back, will go soon

* UOp representations change once we don't instantly fold things

* do not duplicate tests

---------

Co-authored-by: qazal <qazal.software@gmail.com>
Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>
2025-01-14 16:15:13 -05:00
qazal
dddd4e5f9f hotfix: remove duplicate TestTensorMutates [pr] (#8619)
* hotfix: remove duplicate TestTensorMutates [pr]

* imports
2025-01-14 16:03:17 -05:00
nimlgen
c5782e85d2 tlsf: optimize alloc (#8608) 2025-01-14 23:48:07 +03:00
George Hotz
bfbe81df71 remove cast before view (#8613)
* remove cast before view

* greener

* indexing

* that passes too

* openpilot too

* ack

---------

Co-authored-by: qazal <qazal.software@gmail.com>
2025-01-14 15:04:58 -05:00
chenyu
393eec3201 raise RuntimeError for uneven shard [pr] (#8593)
no 7B llama on 6 GPUs

skip 70B
2025-01-14 14:51:48 -05:00
ignaciosica
d5a646d492 CUDA Turing TC (#8597)
* init turing tc

* reorder tc

* hotfix: remove some spaces

* revert var name to x

* consistent order of factors

* revert order of terms to match old stuff

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2025-01-14 10:35:14 -08:00
chenyu
cbfd51f5a5 make MultiLazyBuffer.bounds a property [pr] (#8614)
determined by lbs shapes and axis
2025-01-14 13:25:54 -05:00
chenyu
52e7003414 Revert "make kits19 dataset samples have small sizes (#8591)" (#8610)
This reverts commit 76a03e950a.
2025-01-14 12:24:27 -05:00
Francis Lata
76a03e950a make kits19 dataset samples have small sizes (#8591) 2025-01-14 08:27:45 -08:00
ignaciosica
4057b98f7f rename i and j into k and row/col (#8607) 2025-01-14 08:27:05 -08:00
nimlgen
1ff6862a3d ci: sleep a bit to let the driver unload the prev pid (#8605) 2025-01-14 15:55:23 +03:00
qazal
97ec564b03 noop changes from the block_assign branch [pr] (#8606) 2025-01-14 07:47:17 -05:00
qazal
5aab2806f0 rename to test_tensor_uop + use upats for asserting [pr] (#8604)
* rename to test_tensor_uop + use upats for asserting [pr]

* fix pr
2025-01-14 05:09:56 -05:00
qazal
863abc7140 scheduling graph_rewrite prereqs for BLOCK in ASSIGN (#8598)
* remove the BUF_LIMIT assert

* skip the base one

* work

* work

* good error

* ok comment

* shorter check
2025-01-14 03:01:59 -05:00
chenyu
05e54f00d3 remove bounds from MultiLazyBuffer.from_sharded [pr] (#8603)
without a custom bound, the bound is uniquely determined by shape and axis
2025-01-13 23:40:05 -05:00
chenyu
d443e91d82 remove custom splits in Tensor.shard [pr] (#8602)
towards even split only
2025-01-13 21:29:13 -05:00
chenyu
227d96d7a3 remove unused src from metaop [pr] (#8601) 2025-01-13 20:28:14 -05:00
chenyu
c4e33048c6 test Tensor.clone has a different lazydata [pr] (#8600) 2025-01-13 20:13:44 -05:00
qazal
ae2229d727 assert kernel buffer limit at compile time [pr] (#8595)
* remove the BUF_LIMIT assert

* skip the base one
2025-01-13 16:32:07 -05:00
nimlgen
c2504357af am: lock to access dev (#8594)
* amm lock to access dev

* wording

* just works

* disbale
2025-01-13 23:53:13 +03:00
geohotstan
4abe631b56 fix onnx mobilenetv2-7-quantized.onnx (#8574)
* is 67% considered fixed?

* move test up

* share function

* add qgemm too

* make sure qgemm comes out as int

* actually that note is not right

* remove qgemm (I did it wrong) and add it later lol.
2025-01-13 09:25:06 -08:00
George Hotz
d19c1c7f03 bump 75 -> 73 for test failure 2025-01-13 09:18:38 -08:00
Francis Lata
c25d5d3101 improve isin checks (#8589) 2025-01-13 12:12:31 -05:00
nimlgen
74b83c4c41 am in ci (#8532)
* try am in ci

* no sudo

* temp

* run more am test

* run half on am

* insert amdgpu

* other machine as well
2025-01-13 19:55:17 +03:00
nimlgen
d224d0ed7f nv: fix fault info (#8587)
* nv: fix fault info

* and emu for amd

* skip if not mock
2025-01-13 14:38:43 +03:00
qazal
586e730d32 use UOp.st for kernel reduce axes (#8499)
* use UOp.st for kernel reduce axes [pr]

* do not return dict
2025-01-13 06:24:11 -05:00
qazal
7562cc0399 better test for reduce swizzle + don't use double dtype [pr] (#8586)
* better test_permute_rewrite

* use float32
2025-01-13 05:02:21 -05:00
George Hotz
df59b072db rename to top_down_rewrite [pr] (#8583) 2025-01-12 18:36:38 -08:00
chenyu
994944920b simpler batch_load_train_bert [pr] (#8582)
don't think that buffer is really beneficial. 5% faster data_time and 1ms faster per step.
https://wandb.ai/chenyuxyz/MLPerf-BERT/runs/69c9lx8y/overview
2025-01-12 20:25:05 -05:00
George Hotz
05e5de6a91 ugh, remove that binary blob 2025-01-12 17:02:28 -08:00
George Hotz
4ac4c1415a free intermediate buffers in the jit [pr] (#8581)
* free intermediate buffers in the jit [pr]

* intermediates_freed

* deallocate if not allocated

* self._first_run is simpler
2025-01-12 15:41:41 -08:00
George Hotz
d817dc10db start on test rewrite map [pr] (#8432)
* start on test rewrite map [pr]

* chatgpt writes dumb tests

* comment out failing

* fix that test

* fix gc issue

* oh, frame 2

* remove uop mutability

* map is only the map

* simplier + more tests

* test tiny passes

* tests that need to pass

* parent test passes

* child test passes

* remove uop mutability [pr]

* test fixups

* most tests pass

* more tests pass

* lil test fixups

* them too

* fix test

* unneeded

* err, that

* fix test_hcq

* fix test failures

* fix that test

* tensor universe

* does this pass test

* Revert "does this pass test"

This reverts commit ed516b3169.

* Revert "tensor universe"

This reverts commit c21301852a.

* test_mutate_add passes

* this can pass

* Revert "Merge remote-tracking branch 'origin/no_uop_mutability' into test_rewrite_map"

This reverts commit 657822dcdc, reversing
changes made to 2a126c145b.

* Revert "test_mutate_add passes"

This reverts commit ab4fc4c78e.

* correct enough

* remove test_rewrite_map_schedule.py

* viz

* uops are immutable

---------

Co-authored-by: qazal <qazal.software@gmail.com>
2025-01-12 13:13:51 -05:00
qazal
2f71a00236 remove PYTHONPATH=. from mypy ci [pr] (#8578) 2025-01-12 09:52:03 -08:00
qazal
cde18fddce fix DEBUG=2 output for copy runners [pr] (#8579)
* fix DEBUG=2 output for copy runners [pr]

* itemsize is constant
2025-01-12 12:03:01 -05:00
eliotgolding
867004fbeb use unravel in views_to_indexed_uops [pr] (#8560)
* use unravel in shape

* make process replay work

* earlier View.minify()

* fix

* fix tests

* mypy

* get rid of early minify

* fix

* linter

* clean and add test

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2025-01-12 10:25:55 -05:00
nimlgen
38b5ac4d4a mypy for mockgpu/cuda & dsp/run (#8575) 2025-01-12 18:25:39 +03:00
chenyu
def90b22f6 EVAL_BS=36 for bert [pr] (#8576)
3X faster eval compared to BS=6.
green https://wandb.ai/chenyuxyz/MLPerf-BERT/runs/ka5p5sm9/overview
red https://wandb.ai/chenyuxyz/MLPerf-BERT/runs/a7maxsxd/overview
2025-01-12 09:43:56 -05:00