Commit Graph

3089 Commits

Author SHA1 Message Date
chenyu
8548b20b23 fix codellama params and repeat_kv (#2181) 2023-10-30 10:16:26 -07:00
George Hotz
c7f4dd6cb0 CACHELEVEL for smaller caches 2023-10-28 07:26:03 -10:00
chenyu
6c58bf3e9c in time_linearizer, allocate a scratch buffer if output buffer is also input (#2152)
* in time_linearizer, allocate a scratch buffer if output buffer is also input

* move scratch buffer creation outside search
2023-10-28 07:17:41 -10:00
Yixiang Gao
902f00b095 adding cuda TC headers (#2165)
* split cuda to renderer and add headers for tc

* fix TritonRenderer

* remove unused import
2023-10-27 14:25:59 -10:00
David Hou
7f4f925385 fix hip del on compile fail (#2163)
* fix hip del on compile fail

* the test doesn't actually work
2023-10-27 11:38:07 -10:00
Francis Lam
8cf0bb9351 optimizer: simplify GROUP and LOCAL to have one of each (#2162)
* optimizer: simplify GROUP and LOCAL to have one of each

Now that tensor cores only use LASTLOCAL, we can simplify to use
only that op everywhere.

The only use of GROUP is in matvec hand-coded opts and it doesn't
make a performance difference so switching to use only the top
behavior.

Also adds additional asserts to prevent tensor core dims from
being altered which causes bad kernels to be generated.

* search: remove duplicated actions
2023-10-27 11:37:44 -10:00
George Hotz
e0201922e3 Q network for pruning BEAM / uops deduping / BEAM_ESTIMATE (#2142)
* stable diffusion < 324ms

* revert swap action

* fix tests due to more sum splitting

* REDUCEOP_SPLIT_THRESHOLD env var

* added from unaligned np test (#2134)

* align cpu buffer before copy into cl buffer (#2135)

* remove shelve from handcode_resnet50_opt.py (#2139)

* Add dictionary keys to reduce db size (#2131)

* work

* ignore beam cache

* dictionary keys are generic

* minor db cleanups

* fix baseline and extract dataset

* fix training

* log likelihood

* more lin to feats

* sts

* training policynet

* net sort of works

* dedup

* refactor, stupid new actions

* fix uops deduping

* BEAM_ESTIMATE

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
Co-authored-by: imaolo <56898718+imaolo@users.noreply.github.com>
2023-10-27 10:53:06 -10:00
will
bc0829b677 Fix llama json loading (#2160) 2023-10-27 10:21:56 -10:00
nimlgen
8d41b3eb3f beam=16 makes gpt2 gpu-time < 5ms on 3090 (#2154) 2023-10-27 10:21:27 -10:00
nimlgen
5204864eca init cudagraph (#2153)
* init cudagraph

* linter happy

* print warning when cuda graph creation failed
2023-10-27 16:19:50 -04:00
chenyu
9215bccb41 Tensor.uniform set default to standard uniform (#2158)
* Tensor.uniform set default to standard uniform

* clean up test to reuse function
2023-10-27 16:15:30 -04:00
Roelof van Dijk
36ab04ae35 perf: lazyop as dataclass (#1603)
* perf: lazyop as dataclass

fix: linter

fix: restore eq

* use builtin methods, buffers to property to allow freezing

* fix: reduce diff

* fix: can't freeze due to KOPT tests, mypy

* fix: explicit hash

* can freeze if tests are fixed

* fix: typo

---------

Co-authored-by: Roelof van Dijk <roelof.van.dijk@vitestro.com>
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2023-10-25 17:54:30 -04:00
chenyu
0ca0e9ee5e exclude ast with variables from beam search (#2140)
* exclude ast with variables from beam search

* test that

* add to CI
2023-10-25 16:35:29 -04:00
Szymon Ożóg
a52b420fb3 switch ocelot back to main repo (#2147)
* return to ocelot main branch

* cd before checkout
2023-10-25 15:14:26 -04:00
George Hotz
12dd165d38 add WINO/HALF/HIP to AMD benchmark 2023-10-25 13:22:45 -04:00
Francis Lam
bf3490cdf9 wmma: refactor tensor cores using existing local dims (#2097)
* wmma: refactor tensor cores using existing local dims

* optimizer: fix bad rebase and break after one late local

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2023-10-25 13:10:46 -04:00
wozeparrot
c29653605e hip multigpu training (#1878)
* feat: move to hip

* feat: special path for RawBufferTransfer

* feat: initial rawbuffertransfer

* feat: hip ipc

* feat: working hip ipc

* feat: need to base device without args

* feat: close mem handle

* feat: modified test

* feat: more multihip stuff

* clean: cleanup

* feat: cleaner

* feat: don't crash

* feat: test more

* clean: way cleaner hip wrapper

* feat: barrier

* feat: barrier

* feat: this breaks stuff

* feat: we can use empty here

* feat: maybe fix tests

* feat: maybe fix tests again?

* fix: probably fix tests

* feat: no waiting here

* feat: wait here

* feat: much larger test

* feat: need to sync here

* feat: make this async

* feat: no waiting!

* feat: cut here

* feat: sync copy

* feat: random imports

* feat: much cleaner world

* feat: restore this

* feat: restore this

* clean: cleanup

* feat: set this
2023-10-24 17:35:53 -04:00
nimlgen
2e89fd264f Refactor hipgraph (#2141)
* refactor hip graph

* linter happy

* happy liner
2023-10-24 15:45:56 -04:00
nimlgen
e21bf776c8 fix debug=1 llama/gpt2 timings (#2143) 2023-10-24 15:45:00 -04:00
chenyu
4444e6d4b3 stable diffusion < 324ms (#2129)
* stable diffusion < 324ms

* revert swap action

* fix tests due to more sum splitting

* REDUCEOP_SPLIT_THRESHOLD env var
2023-10-24 14:56:12 -04:00
George Hotz
cea2bc7964 Add dictionary keys to reduce db size (#2131)
* work

* ignore beam cache

* dictionary keys are generic

* minor db cleanups

* fix baseline and extract dataset

* fix training

* log likelihood
2023-10-24 10:49:22 -04:00
chenyu
d5e2fdea22 remove shelve from handcode_resnet50_opt.py (#2139) 2023-10-24 10:37:30 -04:00
imaolo
228b310478 align cpu buffer before copy into cl buffer (#2135) 2023-10-23 21:04:35 -04:00
imaolo
6ee0435263 added from unaligned np test (#2134) 2023-10-23 11:38:57 -04:00
George Hotz
3c56c181f6 string formatting 25 -> 30 to fit 2023-10-22 10:57:34 -07:00
George Hotz
6dc8eb5bfd universal disk cache (#2130)
* caching infra for tinygrad

* nons tr key

* fix linter

* no shelve in beam search

* beam search caching

* check tensor cores with beam too

* pretty print

* LATEBEAM in stable diffusion
2023-10-22 10:56:57 -07:00
Francis Lam
ace6b2a151 optimizer: add test for correctness of opts (#2124)
* optimizer: add test for correctness of opts

Also added OptOps.UPCASTMID to constrain valid axes for opts with
group_for_reduce.

* llvm: fix LinearizerOptions to correctly not has_shared

* optimizer: remove premature test scaffold for TC opts

* search: fix the action space
2023-10-22 08:02:22 -07:00
George Hotz
abeba8f1fc optimization: get actions in CI (#2125)
* get actions in CI

* actually run the test

* pythonpath
2023-10-20 12:22:01 -07:00
qazal
14625721e9 minor triton casting refactor (#2118)
* minor refactor

* render_cast taking an x like cstyle

* fix fmt strings

* tl.where

* fix alu render

* use dtype

* newline eof

* better diff
2023-10-20 12:11:55 -07:00
George Hotz
cb508e6923 uops graphing + phi (#2120)
* uops graphing

* add_phi_node

* less phi nodes

* where graph uops should live

* naming

* move it to external

* fix triton yolo

* fix clang and preserve behavior
2023-10-19 22:26:28 -07:00
20kdc
bedd028061 waifu2x vgg7: testcase, auto-RGBA->RGB, function to grab pretrained models, training "fix" (#2117) 2023-10-19 22:07:15 -07:00
Szymon Ożóg
e0b2bf46b4 Improve triton generated code quality (#2119) 2023-10-19 22:06:19 -07:00
qazal
36d4001b4f add test coverage for search (#2104)
* add test coverage for search

* only in compiled backends

* dont use device.default in decorator

* time_til is the other way around xd
2023-10-19 17:06:47 -07:00
Szymon Ożóg
7268b3c6fb make triton not write to disk (#2116) 2023-10-18 23:06:47 -07:00
David Hou
95e17ff0d4 fix wino mask upcast calculation (#2057)
* fix wino mask upcast calculation

* add tests for wino upcast hcopt

* add info to note

* real world wino hcopt test

* wino backward test

* whitespace
2023-10-18 16:54:48 -07:00
George Hotz
5cfec59abc hlb cifar touchups (#2113)
* types and cnt and EVAL_STEPS

* eval time + always print eval
2023-10-18 16:26:15 -07:00
chenyu
5d5921d2c8 small doc env update (#2112) 2023-10-18 14:49:25 -07:00
George Hotz
4526891db7 parallel apt (#2111) 2023-10-18 14:49:00 -07:00
George Hotz
87b714b8cb split test_conv2d 2023-10-18 14:00:50 -07:00
George Hotz
15da96f393 print test durations and add speed (#2107)
* print test durations

* decrease sizes to increase speed

* faster

* GPU/CLANG onnx in seperate runner

* test split, move ONNX CPU CI

* simpler tests

* simpler uops test

* faster

* less cuda apt

* running ninja install

* apt install

* split fancy indexing
2023-10-18 13:46:42 -07:00
George Hotz
e2a1c2aaa6 force ruff reinstall 2023-10-18 11:40:46 -07:00
George Hotz
0d2b3a9d33 full path for ruff 2023-10-18 11:27:49 -07:00
George Hotz
8940c89d13 tests: remove 2 runners, make cache reliable (#2106)
* remove 2 runners

* device.DEFAULT printing

* explain rebuild

* disable ocelot rebuild

* try again to fix workflow

* this? fix cache hash

* force no rebuild

* fix pylint
2023-10-18 11:10:41 -07:00
George Hotz
b3afe0106b typo, src printing, and no verbose on triton (#2105) 2023-10-18 09:44:36 -07:00
20kdc
967a88a505 examples/waifu2x: Cleanup waifu2x vgg7 model format (now uses safetensors) (#2082) 2023-10-18 09:20:11 -07:00
George Hotz
881fd7c141 add mops to graph, refactor IMAGE (#2100)
* add mops to graph, refactor IMAGE

* no reshape pushing

* add todo

* fix openpilot model alt

* push reshapes reduces kernels in new op

* IMAGE=2 is a first class citizen now
2023-10-17 21:27:51 -07:00
George Hotz
2498802b46 fix beam search for llvm, this needs tests (#2101) 2023-10-17 20:09:42 -07:00
wozeparrot
4d1e59abfd fix: only when distributed (#2102) 2023-10-17 20:09:04 -07:00
Sean D'Souza
999c95ea29 fix: hlb cifar types (#2099) 2023-10-17 19:23:50 -07:00
George Hotz
9b1c3cd9ca hlb_cifar: support EVAL_STEPS=1000, print when dataset is shuffled 2023-10-18 01:11:08 +00:00