Commit Graph

2732 Commits

Author SHA1 Message Date
George Hotz
c0a033f01d remove real_offset (#2234)
* remove real_offset

* pass in numnode

* remove that real_offset

* sample only for variable
2023-11-07 17:30:53 -08:00
George Hotz
4d95e6d070 move cache out of tmp (#2232) 2023-11-07 11:41:00 -08:00
George Hotz
a48ccdb359 cleanup deps, no pyyaml, pillow to testing (#2231) 2023-11-07 10:32:23 -08:00
nimlgen
ae5d1407ee Fix mmaped in jit (#2225)
* fix reuse for mmaped buffers in jit

* comment
2023-11-06 14:54:21 -08:00
George Hotz
0c9b4ab885 no to_underlying (#2222)
* no to_underlying

* context is no longer used

* no more optimizing

* update docs
2023-11-05 21:34:20 -08:00
George Hotz
fbe7f0c62b metal: unwrap lib write 2023-11-05 21:02:31 -08:00
George Hotz
2f7aab3d13 move optimize_local_size (#2221)
* move optimize_local_size

* interpret_ast
2023-11-05 21:00:52 -08:00
George Hotz
c60c3b467a clean up symlinking in benchmark (#2219)
* clean up symlinking

* make torch deterministic
2023-11-05 16:46:05 -08:00
George Hotz
baeb77a403 Make the JIT simple (no batch exec, no cache collector) (#2215)
* remove batch exec

* simple cachecollector

* remove cache collector test

* less lr
2023-11-05 16:23:43 -08:00
chenyu
719a97b337 fix IMAGE=2 failed with NOOPT=1 (#2209)
* IMAGE=2 failed with NOOPT=1

* fix it
2023-11-05 13:16:37 -08:00
chenyu
680cbfdba4 less broken limit_dims_to_max (#2214) 2023-11-04 08:38:06 -07:00
Ahmed Harmouche
265304e7fd Stable diffusion WebGPU port (#1370)
* WIP: Stable diffusion WebGPU port

* Load whole model: split safetensor to avoid Chrome allocation limit

* Gitignore .DS_Store, remove debug print

* Clip tokenizer in JS

* WIP: Compile model in parts (text model, diffusor, get_x_prev_and_pred_x0, decoder), and recreate forward logic in JS

* e2e stable diffusion flow

* Create initial random latent tensor in JS

* SD working e2e

* Log if some weights were not loaded properly

* Remove latent_tensor.npy used for debugging

* Cleanup, remove useless logs

* Improve UI

* Add progress bar

* Remove .npy files used for debugging

* Add clip tokenizer as external dependency

* Remove alphas_cumprod.js and load it from safetensors

* Refactor

* Simplify a lot

* Dedup base when limiting elementwise merge (webgpu)

* Add return type to safe_load_metadata

* Do not allow run when webgpu is not supported

* Add progress bar, refactor, fix special names

* Add option to chose from local vs huggingface weights

* lowercase tinygrad :)

* fp16 model dl, decompression client side

* Cache f16 model in browser, better progress

* Cache miss recovery

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2023-11-03 18:29:16 -07:00
chenyu
f582ec56d5 Replace (getenv("CI", "") != "") with helpers.CI (#2213) 2023-11-03 15:20:44 -07:00
George Hotz
f17bc16f46 simple runtime args (#2211)
* simple runtime args

* fix some tests

* fix abstractions and triton

* fix search
2023-11-03 12:31:29 -07:00
George Hotz
9ea0448103 compile interpreted to python code (#2208)
* sort of works

* interpreted

* fix flopcounter

* interpreted

* simpler

* type

* functools compile ast

* lose a line

* delete extra file

* no self.method_cache
2023-11-03 09:16:12 -07:00
George Hotz
ddbc6eecaf some refactors in the realization (#2206)
* some refactors

* delete old kernel search
2023-11-02 19:51:28 -07:00
George Hotz
51fd993f1f pin onnx to 1.14.1 2023-11-02 18:03:21 -07:00
George Hotz
6621d2eb98 Revert "Modernize setup.py (#2187)"
This reverts commit 7e8c5f1a0f.
2023-11-03 01:01:15 +00:00
nimlgen
6e06adcb95 fix hip segfault (#2204) 2023-11-02 08:40:56 -07:00
George Hotz
03cf0afa4f move all to compile api (#2203)
* move metal+clang to compile api

* all to the new style

* remove binary arg

* fix triton

* fixup tests

* fix clang

* diskcache is generic

* __wrapped__

* compile_gpu

* fix thneed

* keep the src in the ASTRunner

* lib

* move compile_gpu

* compile_gpu in device

* put compiler in astrunner

* test reverts

* triton compiler

* ugh, that too
2023-11-01 23:01:32 -07:00
George Hotz
8932816816 remove arm64, caching for cuda (#2201)
* remove arm64, caching for cuda

* caching in llvm

* switch cache_compiled to new cache

* fix clang

* caching for metal

* fix pylint

* cleanups

* perf_counter and binary
2023-11-01 18:44:00 -07:00
George Hotz
7103b716c4 merge kernel and optimizer (#2200)
* merge kernel and optimizer

* linearize is reentrant

* move global/local size

* clean up linearizer copy

* remove unneeded lin copies

* stop linearizing twice

* oops, that should be None
2023-11-01 15:20:01 -07:00
George Hotz
33bb650e94 use mad in opencl (#2198)
Co-authored-by: Comma Device <device@comma.ai>
2023-11-01 10:40:08 -07:00
George Hotz
c8b6a811ea no locals as opt action (#2196)
* switch barrier, add clear_l2

* no locals can be searched

* revert barrier

* fix ci

* put it there
2023-11-01 09:47:44 -07:00
Comma Device
2e9982fe2d fastvits example that's 10% faster 2023-10-31 21:48:23 -07:00
George Hotz
8ba7ced7f9 extract const if it's const (#2193)
* extract const if it's const

* fix if statement

* fast math issue

* fix graphing and casting

* disable flaky copyout test
2023-10-31 18:52:35 -07:00
George Hotz
b245f1307e add exp2 (#2192) 2023-10-31 17:48:42 -07:00
qazal
e2428b63a6 external (#2191) 2023-10-31 13:57:24 -07:00
Elias Wahl
7e8c5f1a0f Modernize setup.py (#2187)
* Added pyproject.toml

* Pin onnx
2023-10-31 13:55:45 -07:00
nimlgen
8c07c73a9b Fix cl map buffer (#2190)
* fix gpu enqueue_map_buffer out of space

* add test
2023-10-31 12:02:46 -07:00
George Hotz
c59ea32f90 prevent over unrolling in optimzer 2023-10-31 11:45:18 -07:00
George Hotz
5aaa8a0cc1 fix shape 2023-10-31 11:36:19 -07:00
George Hotz
a27c9f9de5 openpilot compile2 (#2189)
* try compile2

* pass to thneed

* fix tanh onnx
2023-10-31 11:08:58 -07:00
qazal
be5f185ac0 Higher test coverage for dtypes (#2156)
* refactor unit tests for dtypes

* add missing dtypes in llvmir.py and lib.py

* skip torch tests

* webgpu

* cleaner skips

* fix llvm bool casting issue using compare

* llvm 100% passing

* llvm segfault

* TEMP decrease timeout mins to 11

debug

* add bf16 to setup

* skip half tests in cuda cpu

* check for CUDACPU insetad

* add int16 to triton dtypes

* u16 for triton

* remove debug - diff is still hard to read

* derive from base class TestDType

* enhance test_upcast and downcast by running on every possible version

* dummy commit to rerun the flakey test

* skip the correct tests for CUDA

* bf16 should be skipped in the common TestDType cases

* re-enable bf16

* more consistent structure

* tiny changes to is_dtype_supported 1

* tiny changes 2

add reason

* fuzz

* fuzzer p2

* run fp32 twice

* remove duplicate fp32 run

* clang: use stdbool

* skip triton on bool casts

* merge and resolve conflicts
2023-10-30 22:38:42 -07:00
forcefieldsovereign
f294bdd681 fixed imports (#2185) 2023-10-30 22:07:17 -07:00
Akshay Kashyap
018bd29e37 Enable Multi-Output Export (#2179)
* Enable Multi-Output Export

* Add test

* Update examples and lint

* fix padding

* test ops

* dummy commit to rerun test

* revert cuda lint

* Enforce tuple/list of tensors

* subscripted generics

* put back webgpu test

* Re-enable WebGPU Efficientnet test
2023-10-30 18:42:26 -07:00
qazal
a7439af786 Fix llvm int->bool cast (#2164)
* add to ir

* add test case

* minimize diff

* todo

* enable fast math

* added both False and True case
2023-10-30 15:28:23 -07:00
George Hotz
94cf652b6b don't use locals applies to GROUP also 2023-10-30 13:56:43 -07:00
George Hotz
5cc536bcc0 don't use locals applies to LASTLOCAL 2023-10-30 13:53:42 -07:00
chenyu
3c88af5071 use unique table name for each disk_cache test (#2184) 2023-10-30 13:49:49 -07:00
George Hotz
608e3ee800 fix no locals search and search both (#2171)
* fix no locals search and search both

* pretty print

* nolocals default no other search
2023-10-30 10:22:50 -07:00
George Hotz
194e4ad6f8 Revert "optimizer: simplify GROUP and LOCAL to have one of each (#2162)" (#2182)
This reverts commit 8cf0bb9351.
2023-10-30 10:22:26 -07:00
Ahmed Harmouche
95f7183c3a Reenable global, local limiting (#2095) 2023-10-30 10:17:23 -07:00
chenyu
8548b20b23 fix codellama params and repeat_kv (#2181) 2023-10-30 10:16:26 -07:00
George Hotz
c7f4dd6cb0 CACHELEVEL for smaller caches 2023-10-28 07:26:03 -10:00
chenyu
6c58bf3e9c in time_linearizer, allocate a scratch buffer if output buffer is also input (#2152)
* in time_linearizer, allocate a scratch buffer if output buffer is also input

* move scratch buffer creation outside search
2023-10-28 07:17:41 -10:00
Yixiang Gao
902f00b095 adding cuda TC headers (#2165)
* split cuda to renderer and add headers for tc

* fix TritonRenderer

* remove unused import
2023-10-27 14:25:59 -10:00
David Hou
7f4f925385 fix hip del on compile fail (#2163)
* fix hip del on compile fail

* the test doesn't actually work
2023-10-27 11:38:07 -10:00
Francis Lam
8cf0bb9351 optimizer: simplify GROUP and LOCAL to have one of each (#2162)
* optimizer: simplify GROUP and LOCAL to have one of each

Now that tensor cores only use LASTLOCAL, we can simplify to use
only that op everywhere.

The only use of GROUP is in matvec hand-coded opts and it doesn't
make a performance difference so switching to use only the top
behavior.

Also adds additional asserts to prevent tensor core dims from
being altered which causes bad kernels to be generated.

* search: remove duplicated actions
2023-10-27 11:37:44 -10:00
George Hotz
e0201922e3 Q network for pruning BEAM / uops deduping / BEAM_ESTIMATE (#2142)
* stable diffusion < 324ms

* revert swap action

* fix tests due to more sum splitting

* REDUCEOP_SPLIT_THRESHOLD env var

* added from unaligned np test (#2134)

* align cpu buffer before copy into cl buffer (#2135)

* remove shelve from handcode_resnet50_opt.py (#2139)

* Add dictionary keys to reduce db size (#2131)

* work

* ignore beam cache

* dictionary keys are generic

* minor db cleanups

* fix baseline and extract dataset

* fix training

* log likelihood

* more lin to feats

* sts

* training policynet

* net sort of works

* dedup

* refactor, stupid new actions

* fix uops deduping

* BEAM_ESTIMATE

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
Co-authored-by: imaolo <56898718+imaolo@users.noreply.github.com>
2023-10-27 10:53:06 -10:00