Commit Graph

1207 Commits

Author SHA1 Message Date
chenyu
e9847be790 remove whisper +1-1 hack (#2360)
* remove whisper +1-1 hack

* Revert "remove whisper +1-1 hack"

This reverts commit 5db3800f09.

* update whisper tests

* comment context
2023-11-19 17:56:36 -05:00
George Hotz
c8c5212dce a lil more beautiful_mnist 2023-11-17 19:53:06 -08:00
George Hotz
c7b38b324b A beautiful MNIST training example (#2272)
* beautiful mnist

* beautiful mnist example

* from tinygrad import Tensor

* more beautiful

* the jit is super core tinygrad

* globalcounters reset on jit run

* symlinks and exclude

* beautiful_cartpole

* evaluate is it's own function

* no symlinks

* more beautiful

* jit reset for double speed

* type hinting for JIT

* beautiful_mnist gets 98%

* beautiful_mnist < 4s with BEAM=2

* better cartpole

* use actor critic

* zero_grad got lost

* delete double relu

* stable cartpole with PPO

* beautiful_cartpole is more beautiful

* REPLAY_BUFFER

* beautiful stuff typechecks

* None support in shape

* hp tuning
2023-11-17 19:42:43 -08:00
Friedrich Carl Eichenroth
75676ab8e1 Profiling-helper (#2321)
* change profiler

* remove unused imports

* remove unused imports

* change lazybuffer references

* remove unused line

* remove unused import

* remove unused stuff

* add types

* typing

* typing

* typing

* trigger actions

* -1 loc

* fixup

* trigger actions

* revert lazy typing changes

* WIP profiler helper

* replace old start & stop profiler

* fixup

* linting

* Update llama.py

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2023-11-16 14:15:56 -08:00
mmmkkaaayy
8235da11dd whisper: support batch inference, add librispeech WER test (#2074)
* whisper: support batch inference, add librispeech WER test, add kv caching and JIT

* remove JIT_SUPPORTED_DEVICE

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2023-11-16 13:50:08 -08:00
George Hotz
3baaf298d6 two stage cumsum in tensor.py (#2331)
* two stage cumsum in tensor.py

* 2 more kernels for llama cumsum

* gpt-2 and llama use fast multinomial
2023-11-16 12:09:53 -08:00
George Hotz
70a65c201e JIT support in Interpreted (#2314)
* factor that out

* jit is supported everywhere

* fix some tests

* there's no jit supported device, the jit is everywhere

* fix test uops
2023-11-15 11:13:38 -08:00
George Hotz
01f8781c26 fix CI (#2300)
* might work

* might work 2

* might work 3

* sneak that in to llama too

* pin them all
2023-11-14 11:02:59 -08:00
George Hotz
0cbf6c1811 move things, clean up extra (#2292)
* move things

* idk why pylint needs that now

* delete unused
2023-11-13 20:18:40 -08:00
chenyu
a72b370066 llama take int and convert to Variable internally (#2284) 2023-11-12 17:11:37 -05:00
chenyu
5ef8d682e3 clean up attentions in stable diffusion (#2275) 2023-11-11 14:25:36 -05:00
chenyu
453f48ce02 pad None means (0,0) (#2273) 2023-11-11 09:50:26 -08:00
chenyu
880e693207 fix llama n_kv_heads in kvcache (#2267)
* fix llama n_kv_heads in kvcache

* trigger ci
2023-11-10 21:44:39 -05:00
chenyu
a753c8e071 examples of new GPT2 and JIT change (#2261)
* var_vals are global

* working with global ish

* better

* fix export model

* fix tests

* better kv cache

* does it run?

* use where for kvmask

* fix excessive var_vals

* fix import

* how does multigpu use this?

* llama kinda work

* faster and simpler

* cleanup

* fix conversation mode

* test cleanups

* fix one more test

* test cleanup

---------

Co-authored-by: George Hotz <geohot@gmail.com>
2023-11-10 15:07:02 -05:00
wozeparrot
4c44d1344b feat: remove cache_id (#2236) 2023-11-08 08:09:21 -08:00
George Hotz
2f7aab3d13 move optimize_local_size (#2221)
* move optimize_local_size

* interpret_ast
2023-11-05 21:00:52 -08:00
Ahmed Harmouche
265304e7fd Stable diffusion WebGPU port (#1370)
* WIP: Stable diffusion WebGPU port

* Load whole model: split safetensor to avoid Chrome allocation limit

* Gitignore .DS_Store, remove debug print

* Clip tokenizer in JS

* WIP: Compile model in parts (text model, diffusor, get_x_prev_and_pred_x0, decoder), and recreate forward logic in JS

* e2e stable diffusion flow

* Create initial random latent tensor in JS

* SD working e2e

* Log if some weights were not loaded properly

* Remove latent_tensor.npy used for debugging

* Cleanup, remove useless logs

* Improve UI

* Add progress bar

* Remove .npy files used for debugging

* Add clip tokenizer as external dependency

* Remove alphas_cumprod.js and load it from safetensors

* Refactor

* Simplify a lot

* Dedup base when limiting elementwise merge (webgpu)

* Add return type to safe_load_metadata

* Do not allow run when webgpu is not supported

* Add progress bar, refactor, fix special names

* Add option to chose from local vs huggingface weights

* lowercase tinygrad :)

* fp16 model dl, decompression client side

* Cache f16 model in browser, better progress

* Cache miss recovery

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2023-11-03 18:29:16 -07:00
George Hotz
7103b716c4 merge kernel and optimizer (#2200)
* merge kernel and optimizer

* linearize is reentrant

* move global/local size

* clean up linearizer copy

* remove unneeded lin copies

* stop linearizing twice

* oops, that should be None
2023-11-01 15:20:01 -07:00
George Hotz
b245f1307e add exp2 (#2192) 2023-10-31 17:48:42 -07:00
Akshay Kashyap
018bd29e37 Enable Multi-Output Export (#2179)
* Enable Multi-Output Export

* Add test

* Update examples and lint

* fix padding

* test ops

* dummy commit to rerun test

* revert cuda lint

* Enforce tuple/list of tensors

* subscripted generics

* put back webgpu test

* Re-enable WebGPU Efficientnet test
2023-10-30 18:42:26 -07:00
chenyu
8548b20b23 fix codellama params and repeat_kv (#2181) 2023-10-30 10:16:26 -07:00
George Hotz
e0201922e3 Q network for pruning BEAM / uops deduping / BEAM_ESTIMATE (#2142)
* stable diffusion < 324ms

* revert swap action

* fix tests due to more sum splitting

* REDUCEOP_SPLIT_THRESHOLD env var

* added from unaligned np test (#2134)

* align cpu buffer before copy into cl buffer (#2135)

* remove shelve from handcode_resnet50_opt.py (#2139)

* Add dictionary keys to reduce db size (#2131)

* work

* ignore beam cache

* dictionary keys are generic

* minor db cleanups

* fix baseline and extract dataset

* fix training

* log likelihood

* more lin to feats

* sts

* training policynet

* net sort of works

* dedup

* refactor, stupid new actions

* fix uops deduping

* BEAM_ESTIMATE

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
Co-authored-by: imaolo <56898718+imaolo@users.noreply.github.com>
2023-10-27 10:53:06 -10:00
will
bc0829b677 Fix llama json loading (#2160) 2023-10-27 10:21:56 -10:00
nimlgen
8d41b3eb3f beam=16 makes gpt2 gpu-time < 5ms on 3090 (#2154) 2023-10-27 10:21:27 -10:00
wozeparrot
c29653605e hip multigpu training (#1878)
* feat: move to hip

* feat: special path for RawBufferTransfer

* feat: initial rawbuffertransfer

* feat: hip ipc

* feat: working hip ipc

* feat: need to base device without args

* feat: close mem handle

* feat: modified test

* feat: more multihip stuff

* clean: cleanup

* feat: cleaner

* feat: don't crash

* feat: test more

* clean: way cleaner hip wrapper

* feat: barrier

* feat: barrier

* feat: this breaks stuff

* feat: we can use empty here

* feat: maybe fix tests

* feat: maybe fix tests again?

* fix: probably fix tests

* feat: no waiting here

* feat: wait here

* feat: much larger test

* feat: need to sync here

* feat: make this async

* feat: no waiting!

* feat: cut here

* feat: sync copy

* feat: random imports

* feat: much cleaner world

* feat: restore this

* feat: restore this

* clean: cleanup

* feat: set this
2023-10-24 17:35:53 -04:00
nimlgen
e21bf776c8 fix debug=1 llama/gpt2 timings (#2143) 2023-10-24 15:45:00 -04:00
chenyu
d5e2fdea22 remove shelve from handcode_resnet50_opt.py (#2139) 2023-10-24 10:37:30 -04:00
George Hotz
6dc8eb5bfd universal disk cache (#2130)
* caching infra for tinygrad

* nons tr key

* fix linter

* no shelve in beam search

* beam search caching

* check tensor cores with beam too

* pretty print

* LATEBEAM in stable diffusion
2023-10-22 10:56:57 -07:00
20kdc
bedd028061 waifu2x vgg7: testcase, auto-RGBA->RGB, function to grab pretrained models, training "fix" (#2117) 2023-10-19 22:07:15 -07:00
George Hotz
5cfec59abc hlb cifar touchups (#2113)
* types and cnt and EVAL_STEPS

* eval time + always print eval
2023-10-18 16:26:15 -07:00
20kdc
967a88a505 examples/waifu2x: Cleanup waifu2x vgg7 model format (now uses safetensors) (#2082) 2023-10-18 09:20:11 -07:00
wozeparrot
4d1e59abfd fix: only when distributed (#2102) 2023-10-17 20:09:04 -07:00
Sean D'Souza
999c95ea29 fix: hlb cifar types (#2099) 2023-10-17 19:23:50 -07:00
George Hotz
9b1c3cd9ca hlb_cifar: support EVAL_STEPS=1000, print when dataset is shuffled 2023-10-18 01:11:08 +00:00
George Hotz
c36d306606 KOPT is over, BEAM is upstream (#2071)
* create cache for q learning

* make linter happy

* global beam

* where it belongs

* bugfix

* ditch the kopt, use the beam

* faster lin and DEBUG=2 okay

* remove kopt, move search to features
2023-10-16 09:46:03 -07:00
Ahmed Harmouche
0d3410d93f Stable diffusion: Make guidance modifiable (#2077) 2023-10-15 14:36:43 -07:00
George Hotz
49bcfec383 0s in the action space (#2070)
* 0s in the action space

* simpler

* skip duplicate actions
2023-10-14 11:22:48 -07:00
mmmkkaaayy
91168a28c4 whisper: make file transcription work, add basic CI test (#2042) 2023-10-13 17:13:35 -07:00
George Hotz
6f1810af2d with unroll, the action space goes from 161 -> 127 (#2060)
* with unroll, the action space goes from 161 -> 127

* more reliable instrumentation

* beam search is so op

* beam bugfix
2023-10-12 20:52:23 -07:00
Yixiang Gao
3187962476 CIFAR HALF mode (#2041)
* load weights in fp16

* add dtype option in nn

* fix test

* no need for dtype in nn

* add option to load weights in FP16, but NaN

* change loss scaler

* cast to float32 for norm layer

* add a todo for the forward pass padding

* fix transform
2023-10-12 10:19:51 -07:00
George Hotz
c5edb3c374 train value net, improve API, add BCE (#2047)
* api cleanups, BCE losses

* valuenet

* fixup examples

* learning okay

* add valuenet runner

* net improvements

* net improvements

* 40% win rate
2023-10-12 07:56:38 -07:00
George Hotz
41bfeb2c1e start work on auto opt (#2034)
* start work on auto opt

* lin failure

* not beating hcopt

* greedy

* timing is fast

* codegen.search

* greedy search in handcode_opt

* track running gflops

* clean up those files

* no failure
2023-10-11 12:54:53 -07:00
chenyu
e2b83f1b42 Variable.bind newer (#2017)
* Variable.bind attempt 2

* ShapeTracker.unbind

* fix llama

* fix types

* test case

* View.vars cleanup

* include mask in symbolic source

* mask can be sint

* st.unbind in bufferops

* assert ast contain free Variable only

* cleanup

* conservative unbinding reduce op arg

* move reduceop unbind

* fix llama JIT arg behavior
2023-10-10 10:03:01 -07:00
Ahmed Harmouche
e27fedfc7b Fix stable diffusion output error on WebGPU (#2032)
* Fix stable diffusion on WebGPU

* Remove hack, numpy cast only on webgpu

* No-copy numpy cast
2023-10-10 06:40:51 -07:00
chenyu
25555c836f llama default to JIT only if device supports JIT (#2028) 2023-10-09 17:26:02 -07:00
George Hotz
16ca8410f8 op logger + replay (#2021)
* logops

* fix dtype printing

* needs inf

* ops dataset

* minor improvements

* 12k kernels

* opt can compile

* graph flops
2023-10-08 15:10:18 -07:00
mmmkkaaayy
af6e2f31ca whisper: cast model output token to int32 (#2013)
Co-authored-by: mmmkkaaayy <mmmkkaaayy@users.noreply.github.com>
2023-10-08 05:56:22 -07:00
George Hotz
44ed94ef5c use the device abstraction in handcode_resnet50_opt 2023-10-07 13:22:20 -07:00
George Hotz
121f7aa8c5 Schedule item (#2012)
* ScheduleItem

* put var_vals in the schedule

* fix tests, wow that proliferated quickly

* not ready to be in the schedule
2023-10-07 08:59:25 -07:00
George Hotz
f54959e5cd move print tree into graph (#2003)
* move print tree into graph

* add winograd profiling test

* change pre-commit to run ruff first
2023-10-07 04:39:21 -07:00