Commit Graph

2555 Commits

Author SHA1 Message Date
qazal
b6aaf12df7 Internal cast 2 with more tests (#2257)
* Change linearizer to parse CAST

* Oneliner renders for cstyle and triton

* LLVM cast and ALU implementation

* pylint fixes

* cast in gep

* remove printbufs

* use cast for post-load ops

* get rid of parse_cast

* partially supported vectorized dtypes for initial dev

* render phi as the dtype

* Revert "partially supported vectorized dtypes for initial dev"

This reverts commit 1bf1a818a3.

* Revert "render phi as the dtype"

This reverts commit d08cb270b4.

* reenable triton tests

* no vstore_half if dtype is already half

* upcast max
2023-11-10 10:42:39 -08:00
chenyu
75f6e9ab54 one more fuzz linearizer failed example (#2260) 2023-11-10 09:17:37 -05:00
George Hotz
330484c072 Revert "Internal casting support (#2046)" (#2256)
This reverts commit 7e1d08b2ae.
2023-11-09 21:27:13 -08:00
qazal
7e1d08b2ae Internal casting support (#2046)
* Change linearizer to parse CAST

* Oneliner renders for cstyle and triton

* LLVM cast and ALU implementation

* pylint fixes

* cast in gep

* remove printbufs

* use cast for post-load ops

* get rid of parse_cast

* partially supported vectorized dtypes for initial dev

* render phi as the dtype

* Revert "partially supported vectorized dtypes for initial dev"

This reverts commit 1bf1a818a3.

* Revert "render phi as the dtype"

This reverts commit d08cb270b4.

* reenable triton tests

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2023-11-09 21:02:32 -08:00
vish-pr
6051f0ce82 For cuda get current free space from device, and retry alloc failures (#2197)
* For cuda get current free space from device, and rery alloc failures

* type ignore for mypy

* add init to get free mem in cuda

* Move retry logic in common lib.

Fix typo in override _get_cur_free_space

* linter error fix in test file

* Not catch all, as it will catch KeyboardInterrupt

* fix unintened line changes
2023-11-09 15:53:50 -08:00
qazal
2465d5d267 fix ops tests in test_dtype (#2237)
* fix test ops

* decompose the err from test_ops

* skipTest skips the entire test, we dont want that

* handle cases with the same priority

* add int16 to torch map
2023-11-09 15:17:43 -08:00
George Hotz
80bf0b8586 proper wmma (#2245)
* proper wmma

* hip cast

* bugfixes

* bugfix

* that bug is fixed

---------

Co-authored-by: George Hotz <george@tinygrad.org>
2023-11-09 15:15:18 -08:00
chenyu
10d642e174 fuzz linearizer transformation (#2188)
* fuzz linearizer transformation

* no standard normal for fp16

* work

* Interpreted start

* CPU and TORCH work

* fix MemBuffer with same idx

* id for failed kernels

* no image and variable for Interpreted

* symbolic shape

* IMAGE only for GPU

* Interpreted almost all good

* cleanup

* fix bufs_from_lin

* zero size

* some failed examples

* just Exception

* just test not pass
2023-11-09 08:03:27 -08:00
George Hotz
38b7f5a7fd less phi, proper phi (#2241)
* less phi, proper phi

* disable flaky whisper test
2023-11-08 16:13:43 -08:00
wozeparrot
4c44d1344b feat: remove cache_id (#2236) 2023-11-08 08:09:21 -08:00
George Hotz
c0a033f01d remove real_offset (#2234)
* remove real_offset

* pass in numnode

* remove that real_offset

* sample only for variable
2023-11-07 17:30:53 -08:00
nimlgen
ae5d1407ee Fix mmaped in jit (#2225)
* fix reuse for mmaped buffers in jit

* comment
2023-11-06 14:54:21 -08:00
George Hotz
2f7aab3d13 move optimize_local_size (#2221)
* move optimize_local_size

* interpret_ast
2023-11-05 21:00:52 -08:00
George Hotz
c60c3b467a clean up symlinking in benchmark (#2219)
* clean up symlinking

* make torch deterministic
2023-11-05 16:46:05 -08:00
George Hotz
baeb77a403 Make the JIT simple (no batch exec, no cache collector) (#2215)
* remove batch exec

* simple cachecollector

* remove cache collector test

* less lr
2023-11-05 16:23:43 -08:00
chenyu
719a97b337 fix IMAGE=2 failed with NOOPT=1 (#2209)
* IMAGE=2 failed with NOOPT=1

* fix it
2023-11-05 13:16:37 -08:00
chenyu
680cbfdba4 less broken limit_dims_to_max (#2214) 2023-11-04 08:38:06 -07:00
chenyu
f582ec56d5 Replace (getenv("CI", "") != "") with helpers.CI (#2213) 2023-11-03 15:20:44 -07:00
George Hotz
f17bc16f46 simple runtime args (#2211)
* simple runtime args

* fix some tests

* fix abstractions and triton

* fix search
2023-11-03 12:31:29 -07:00
George Hotz
03cf0afa4f move all to compile api (#2203)
* move metal+clang to compile api

* all to the new style

* remove binary arg

* fix triton

* fixup tests

* fix clang

* diskcache is generic

* __wrapped__

* compile_gpu

* fix thneed

* keep the src in the ASTRunner

* lib

* move compile_gpu

* compile_gpu in device

* put compiler in astrunner

* test reverts

* triton compiler

* ugh, that too
2023-11-01 23:01:32 -07:00
George Hotz
7103b716c4 merge kernel and optimizer (#2200)
* merge kernel and optimizer

* linearize is reentrant

* move global/local size

* clean up linearizer copy

* remove unneeded lin copies

* stop linearizing twice

* oops, that should be None
2023-11-01 15:20:01 -07:00
George Hotz
8ba7ced7f9 extract const if it's const (#2193)
* extract const if it's const

* fix if statement

* fast math issue

* fix graphing and casting

* disable flaky copyout test
2023-10-31 18:52:35 -07:00
George Hotz
b245f1307e add exp2 (#2192) 2023-10-31 17:48:42 -07:00
qazal
e2428b63a6 external (#2191) 2023-10-31 13:57:24 -07:00
nimlgen
8c07c73a9b Fix cl map buffer (#2190)
* fix gpu enqueue_map_buffer out of space

* add test
2023-10-31 12:02:46 -07:00
qazal
be5f185ac0 Higher test coverage for dtypes (#2156)
* refactor unit tests for dtypes

* add missing dtypes in llvmir.py and lib.py

* skip torch tests

* webgpu

* cleaner skips

* fix llvm bool casting issue using compare

* llvm 100% passing

* llvm segfault

* TEMP decrease timeout mins to 11

debug

* add bf16 to setup

* skip half tests in cuda cpu

* check for CUDACPU insetad

* add int16 to triton dtypes

* u16 for triton

* remove debug - diff is still hard to read

* derive from base class TestDType

* enhance test_upcast and downcast by running on every possible version

* dummy commit to rerun the flakey test

* skip the correct tests for CUDA

* bf16 should be skipped in the common TestDType cases

* re-enable bf16

* more consistent structure

* tiny changes to is_dtype_supported 1

* tiny changes 2

add reason

* fuzz

* fuzzer p2

* run fp32 twice

* remove duplicate fp32 run

* clang: use stdbool

* skip triton on bool casts

* merge and resolve conflicts
2023-10-30 22:38:42 -07:00
Akshay Kashyap
018bd29e37 Enable Multi-Output Export (#2179)
* Enable Multi-Output Export

* Add test

* Update examples and lint

* fix padding

* test ops

* dummy commit to rerun test

* revert cuda lint

* Enforce tuple/list of tensors

* subscripted generics

* put back webgpu test

* Re-enable WebGPU Efficientnet test
2023-10-30 18:42:26 -07:00
qazal
a7439af786 Fix llvm int->bool cast (#2164)
* add to ir

* add test case

* minimize diff

* todo

* enable fast math

* added both False and True case
2023-10-30 15:28:23 -07:00
chenyu
3c88af5071 use unique table name for each disk_cache test (#2184) 2023-10-30 13:49:49 -07:00
George Hotz
194e4ad6f8 Revert "optimizer: simplify GROUP and LOCAL to have one of each (#2162)" (#2182)
This reverts commit 8cf0bb9351.
2023-10-30 10:22:26 -07:00
Francis Lam
8cf0bb9351 optimizer: simplify GROUP and LOCAL to have one of each (#2162)
* optimizer: simplify GROUP and LOCAL to have one of each

Now that tensor cores only use LASTLOCAL, we can simplify to use
only that op everywhere.

The only use of GROUP is in matvec hand-coded opts and it doesn't
make a performance difference so switching to use only the top
behavior.

Also adds additional asserts to prevent tensor core dims from
being altered which causes bad kernels to be generated.

* search: remove duplicated actions
2023-10-27 11:37:44 -10:00
George Hotz
e0201922e3 Q network for pruning BEAM / uops deduping / BEAM_ESTIMATE (#2142)
* stable diffusion < 324ms

* revert swap action

* fix tests due to more sum splitting

* REDUCEOP_SPLIT_THRESHOLD env var

* added from unaligned np test (#2134)

* align cpu buffer before copy into cl buffer (#2135)

* remove shelve from handcode_resnet50_opt.py (#2139)

* Add dictionary keys to reduce db size (#2131)

* work

* ignore beam cache

* dictionary keys are generic

* minor db cleanups

* fix baseline and extract dataset

* fix training

* log likelihood

* more lin to feats

* sts

* training policynet

* net sort of works

* dedup

* refactor, stupid new actions

* fix uops deduping

* BEAM_ESTIMATE

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
Co-authored-by: imaolo <56898718+imaolo@users.noreply.github.com>
2023-10-27 10:53:06 -10:00
chenyu
9215bccb41 Tensor.uniform set default to standard uniform (#2158)
* Tensor.uniform set default to standard uniform

* clean up test to reuse function
2023-10-27 16:15:30 -04:00
Roelof van Dijk
36ab04ae35 perf: lazyop as dataclass (#1603)
* perf: lazyop as dataclass

fix: linter

fix: restore eq

* use builtin methods, buffers to property to allow freezing

* fix: reduce diff

* fix: can't freeze due to KOPT tests, mypy

* fix: explicit hash

* can freeze if tests are fixed

* fix: typo

---------

Co-authored-by: Roelof van Dijk <roelof.van.dijk@vitestro.com>
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2023-10-25 17:54:30 -04:00
Francis Lam
bf3490cdf9 wmma: refactor tensor cores using existing local dims (#2097)
* wmma: refactor tensor cores using existing local dims

* optimizer: fix bad rebase and break after one late local

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2023-10-25 13:10:46 -04:00
wozeparrot
c29653605e hip multigpu training (#1878)
* feat: move to hip

* feat: special path for RawBufferTransfer

* feat: initial rawbuffertransfer

* feat: hip ipc

* feat: working hip ipc

* feat: need to base device without args

* feat: close mem handle

* feat: modified test

* feat: more multihip stuff

* clean: cleanup

* feat: cleaner

* feat: don't crash

* feat: test more

* clean: way cleaner hip wrapper

* feat: barrier

* feat: barrier

* feat: this breaks stuff

* feat: we can use empty here

* feat: maybe fix tests

* feat: maybe fix tests again?

* fix: probably fix tests

* feat: no waiting here

* feat: wait here

* feat: much larger test

* feat: need to sync here

* feat: make this async

* feat: no waiting!

* feat: cut here

* feat: sync copy

* feat: random imports

* feat: much cleaner world

* feat: restore this

* feat: restore this

* clean: cleanup

* feat: set this
2023-10-24 17:35:53 -04:00
nimlgen
2e89fd264f Refactor hipgraph (#2141)
* refactor hip graph

* linter happy

* happy liner
2023-10-24 15:45:56 -04:00
George Hotz
cea2bc7964 Add dictionary keys to reduce db size (#2131)
* work

* ignore beam cache

* dictionary keys are generic

* minor db cleanups

* fix baseline and extract dataset

* fix training

* log likelihood
2023-10-24 10:49:22 -04:00
imaolo
6ee0435263 added from unaligned np test (#2134) 2023-10-23 11:38:57 -04:00
George Hotz
6dc8eb5bfd universal disk cache (#2130)
* caching infra for tinygrad

* nons tr key

* fix linter

* no shelve in beam search

* beam search caching

* check tensor cores with beam too

* pretty print

* LATEBEAM in stable diffusion
2023-10-22 10:56:57 -07:00
Francis Lam
ace6b2a151 optimizer: add test for correctness of opts (#2124)
* optimizer: add test for correctness of opts

Also added OptOps.UPCASTMID to constrain valid axes for opts with
group_for_reduce.

* llvm: fix LinearizerOptions to correctly not has_shared

* optimizer: remove premature test scaffold for TC opts

* search: fix the action space
2023-10-22 08:02:22 -07:00
George Hotz
cb508e6923 uops graphing + phi (#2120)
* uops graphing

* add_phi_node

* less phi nodes

* where graph uops should live

* naming

* move it to external

* fix triton yolo

* fix clang and preserve behavior
2023-10-19 22:26:28 -07:00
20kdc
bedd028061 waifu2x vgg7: testcase, auto-RGBA->RGB, function to grab pretrained models, training "fix" (#2117) 2023-10-19 22:07:15 -07:00
qazal
36d4001b4f add test coverage for search (#2104)
* add test coverage for search

* only in compiled backends

* dont use device.default in decorator

* time_til is the other way around xd
2023-10-19 17:06:47 -07:00
David Hou
95e17ff0d4 fix wino mask upcast calculation (#2057)
* fix wino mask upcast calculation

* add tests for wino upcast hcopt

* add info to note

* real world wino hcopt test

* wino backward test

* whitespace
2023-10-18 16:54:48 -07:00
George Hotz
87b714b8cb split test_conv2d 2023-10-18 14:00:50 -07:00
George Hotz
15da96f393 print test durations and add speed (#2107)
* print test durations

* decrease sizes to increase speed

* faster

* GPU/CLANG onnx in seperate runner

* test split, move ONNX CPU CI

* simpler tests

* simpler uops test

* faster

* less cuda apt

* running ninja install

* apt install

* split fancy indexing
2023-10-18 13:46:42 -07:00
George Hotz
881fd7c141 add mops to graph, refactor IMAGE (#2100)
* add mops to graph, refactor IMAGE

* no reshape pushing

* add todo

* fix openpilot model alt

* push reshapes reduces kernels in new op

* IMAGE=2 is a first class citizen now
2023-10-17 21:27:51 -07:00
Umut Zengin
01b98b7f42 MulNode.__lt__ rule (#2086)
* Added the rule

* Added tests

* flake8

* self.b == -1 shortcut
2023-10-17 13:18:35 -07:00
Szymon Ożóg
4bef1591f0 Disable ocelot cache + fix matvec in triton (#2010)
* Revert "disable flaky triton test"

This reverts commit 1e15fdaee7.

* Update test.yml

* check if has shared for matvec

* disable ocelot cache for triton

* disable ocelot cache

* disable ocelot cache

* pass shared to triton uops tests

* temporary debugs for CI crash

* Revert "temporary debugs for CI crash"

This reverts commit fee3ea96c8.

* Revert "triton isn't tested, and allows this refactor (#2007)"

This reverts commit dea8bb0938.

* add runtime_args to every renderer, move triton local size override to runtime args

* Add binary to args, correct type returned

* update to new loops

* Update test.yml
2023-10-17 10:33:32 -07:00