Commit Graph

942 Commits

Author SHA1 Message Date
chenyu
9f6d545a16 bert log global_norm in training step [pr] (#8708)
* bert log global_norm in training step [pr]

and minor cleanups

* .item()
2025-01-21 20:36:27 -05:00
chenyu
1e283c33d3 remove realize in bert model init [pr] (#8707) 2025-01-21 14:11:03 -05:00
geohotstan
dd82b4c913 make onnx runner a class (#8647)
* this

* clean up

* more clean ups and improve debug msg

* more correct training toggler

* remove manual training toggling

* change some variable names

* actually just add the training toggle for LIMIT envvar too

* more refinement

* __call__ and OnnxRunner

* fix half pylint, other half is importing from onnx while this file is onnx.py, figure out later

* ahhhh found another mistake

* remove limit from __call__

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2025-01-20 10:11:05 -08:00
chenyu
c49e0fca60 GlobalCounters.reset() in sdxl step [pr] (#8664) 2025-01-17 21:10:28 -05:00
chenyu
930728c069 bert BS 72->66 [pr] (#8621)
72 does not fit now
2025-01-14 18:41:41 -05:00
geohotstan
4abe631b56 fix onnx mobilenetv2-7-quantized.onnx (#8574)
* is 67% considered fixed?

* move test up

* share function

* add qgemm too

* make sure qgemm comes out as int

* actually that note is not right

* remove qgemm (I did it wrong) and add it later lol.
2025-01-13 09:25:06 -08:00
chenyu
994944920b simpler batch_load_train_bert [pr] (#8582)
don't think that buffer is really beneficial. 5% faster data_time and 1ms faster per step.
https://wandb.ai/chenyuxyz/MLPerf-BERT/runs/69c9lx8y/overview
2025-01-12 20:25:05 -05:00
George Hotz
4ac4c1415a free intermediate buffers in the jit [pr] (#8581)
* free intermediate buffers in the jit [pr]

* intermediates_freed

* deallocate if not allocated

* self._first_run is simpler
2025-01-12 15:41:41 -08:00
chenyu
def90b22f6 EVAL_BS=36 for bert [pr] (#8576)
3X faster eval compared to BS=6.
green https://wandb.ai/chenyuxyz/MLPerf-BERT/runs/ka5p5sm9/overview
red https://wandb.ai/chenyuxyz/MLPerf-BERT/runs/a7maxsxd/overview
2025-01-12 09:43:56 -05:00
George Hotz
9833fe83d8 more work on onnx imagenet [pr] (#8552)
* more work on onnx imagenet [pr]

* working quantization

* static quant

* benchmark onnx 0 dim
2025-01-09 20:28:18 -08:00
George Hotz
e172b759f0 more working (#8550) 2025-01-09 18:40:08 -08:00
chenyu
b6be407bc6 fix handcode_opt bert [pr] (#8509)
* fix handcode_opt bert [pr]

* too slow
2025-01-05 19:14:12 -05:00
George Hotz
24de25b52f example to benchmark onnx [pr] (#8459)
* example to benchmark onnx [pr]

* reset global count
2024-12-31 11:38:33 -05:00
qazal
866dfa1f23 create_schedule([x.lazydata]) -> x.schedule() in tests (#8449) 2024-12-31 03:15:52 +08:00
Calum
d8b08790b9 Fix examples/conversation.py (#8425)
* fix: conversation example

* remove slice func

* remove unused import

* use Tensor.split
2024-12-26 12:45:19 -05:00
chenyu
4712847766 make self_tokenize output more like a python file (#8411)
use comment for file name and join with newline instead of null byte when export to file
2024-12-25 14:16:30 -05:00
chenyu
a35eef8d58 optionally output to file in self_tokenize.py (#8399)
can paste the whole tinygrad in gemini this way
2024-12-24 21:09:26 -05:00
Harald Schäfer
7059459648 Openpilot compile: fix for openpilot use (#8338)
* compile3 changes

* merge conflict

* merge conflict

* give dm npy for now

* Revert "give dm npy for now"

This reverts commit bfd980da7d2c2bab5b073127442c361922032ba1.

* updates

* Always float32 floats

* Update compile3.py

* Update compile3.py

---------

Co-authored-by: ZwX1616 <zwx1616@gmail.com>
2024-12-19 19:43:15 -05:00
George Hotz
8f95b578f6 use Estimates class [pr] (#8319)
* use Estimates class [pr]

* frozen dataclass
2024-12-18 10:19:32 -08:00
George Hotz
37fa38d272 Revert "switch beautiful_mnist to use new optimizer [pr] (#8231)" (#8233)
This reverts commit e9ee39df22.
2024-12-13 19:07:09 -08:00
George Hotz
e9ee39df22 switch beautiful_mnist to use new optimizer [pr] (#8231)
* switch beautiful_mnist to use new optimizer [pr]

* fix abstractions3 + docs

* fix OptimizerGroup with schedule_step api
2024-12-13 18:27:16 -08:00
Ahmed Harmouche
651f72442c encapsulate the exported webgpu model (#8203) 2024-12-13 10:55:37 +01:00
chenyu
64a917b7eb remove LAZYCACHE ContextVar [pr] (#8175)
also removed from resnet latest script
2024-12-11 22:02:52 -05:00
chenyu
26e049ab40 add ALLOWED_READ_IMAGE=2131 to openpilot (#8166)
added as exact number check now as it's not clear if more/less than allowed is any better
2024-12-11 12:14:17 -08:00
Maxim Zakharov
e53a5bf0c3 StableDdiffusion UI - convenient send via Enter (#8160) 2024-12-11 19:05:24 +01:00
George Hotz
f83d715f41 move checks into compile3, delete compile2 [pr] (#8127)
* move checks into compile3 [pr]

* test_vs_onnx

* test v torch works

* float16 won't compile on compile3

* actually delete compile2
2024-12-09 14:21:42 -08:00
George Hotz
a773c5a571 hotfix: default llama3 is 1B with download_model 2024-12-09 07:23:35 -08:00
Ahmed Harmouche
c6277fce09 Remove f16 decompression lib from SD compile.py (#8121)
* Remove f16-to-f32-gpu lib, use tinygrad exported decompression

* No need to create new instance
2024-12-09 14:09:00 +01:00
George Hotz
00ac0db9d4 np tensors have the memory from numpy in compile3 [pr] (#8098) 2024-12-07 14:01:51 +08:00
George Hotz
22feb3a2f1 move copy into the JIT for openpilot compile3 (#7937)
* move copy into the JIT, test fails

* ahh, prune was the issue
2024-12-07 13:26:26 +08:00
Ahmed Harmouche
f3983f6743 Move efficientnet example (#8087) 2024-12-06 15:48:16 +01:00
Ahmed Harmouche
fad3eaa35e Use atomicLoad builtin when loading atomic type (#8084) 2024-12-06 15:33:11 +01:00
George Hotz
e2fe7f0d2f hotfix: actually fix pylint, it's a python 3.10 issue 2024-12-06 13:53:46 +08:00
George Hotz
b28d660172 update self_tokenize, fix pylint maybe 2024-12-06 13:49:41 +08:00
George Hotz
344fd4845c example: self_tokenize. someday tinygrad will be recursively self improving 2024-12-06 13:35:02 +08:00
leopf
65b6696f3b refactor safe_load (#8035)
* refactor safe_load

* cleanup
2024-12-06 12:08:21 +08:00
Ahmed Harmouche
c6f5bb03fa YoloV8 WebGPU fixes (#8057)
* Bump up input size to 416, show if webgpu is not supported

* Minor fix in export_model
2024-12-05 16:23:45 +01:00
Francis Lata
c3187087f7 QwQ-32B-Preview support (#7962)
* load weights with some debugging

* start running a prompt

* cleanup

* optionally permute layers and cleanup

* add validation for simple prompt

* small cleanup

* minor cleanup with formatting download links

* add a longer prompt

* add timing option

* some typings

* remove unused arg

* reset GlobalCounters

* minor cleanups
2024-12-04 21:46:37 -05:00
Ahmed Harmouche
c9e7701417 Fast YoloV8 on WebGPU (#8036)
* Fast yolov8 with downscaled input

* Faster + FPS meter

* Add loader while model is downloading/compiling

* Title touchup
2024-12-04 15:23:09 +01:00
Ahmed Harmouche
db330a3110 Remove WebGL (#8012) 2024-12-03 16:02:53 +01:00
Ahmed Harmouche
8818046940 YoloV8 on WebGPU (#8007)
Port YoloV8 to WebGPU
2024-12-03 15:10:41 +01:00
Ahmed Harmouche
8909dbd82c Remove wgpu specific checks from stable diffusion example (#7991) 2024-12-02 11:31:14 +01:00
chenyu
3e2430f822 use tqdm tqdm in mlperf training (#7929)
issue in benchmark dashboard logging, revert back to tqdm tqdm for now
2024-11-27 21:57:05 -05:00
Ahmed Harmouche
10618aba98 Bring back WebGPU (#7063)
* Start from andredaprato:webgpu-clean

* Fix infs

* inf wgsl function is not needed

* Emulated ulong for threefry, more tests passing

* Randomness tests passing

* Update model export to support new changes in webgpu, efficientnet export works again

* Simplify shift emulation in wgsl

* Delete test file

* Fix bigger than u32 u32 literal

* Why was skip copies added here?

* Python3.12 for webgpu tests

* Fix model export syntax error

* Get test ops passing with some skips

* Fix lint

* Much simpler shift

* Run more tests

* Timestamp queries are not supported in CI, so skip search tests

* All fancy indexing passing

* r is ctx

* Run more dtype tests by using is_dtype_supported

* Cleanup ulong shift rendering

* UPat -> Pat, UOps -> Ops

* Pat -> UPat

* Refactor render_ushift if-else

* Pattern to avoid ulong mul

* Remove vals_dtype

* is_nan trick + rewrite, test_isnan passing

* Rewrite a * select(1, nan, gate) -> select(a, nan, gate)

* No arg, just op

* Support char, uchar, short, ushort

* Run test_index_mnis now that we have uint8

* Fix pyling

* Save 3 lines by using base Compiler

* No more long emulation

* Remove fixup_binops

* No more external_local_bufx wgsl specific cstyle modif, use base extra_pm

* Simpler, faster copyin/out

* Skip some new tests that use long

* Fix typo

* copyout touchup

* Save lines by using render_cast

* WebGL is not supported in core, delete it from is_dtype_supported

* More narrow test skips for some unary tests

* TernaryOps, UnaryOps -> Ops

* TinyGrad supports WebGPU

* StableDiffusion demo: f16tof32 gpu is a lib, update UI

* Packed load/store, no more scale_size, no core tinygrad changes

* Rename copyin, copyout

* Device -> dev

* Fix lint

* Pattern matcher rule for packed load/store

* Refactor

* Shorter packed load/store

* this should fix lint

* Fix mypy

* SD compile script working

* New SD webgpu UI

* New default prompt

* New SD weights

* Fix title when webgpu not available

* Run symbolic tests, simplify is_nan, use round_up

* Show step time on UI

* Bump minimum wgpu version to v0.19

* Fix latent

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-11-26 12:26:40 +08:00
chenyu
631dc98b52 validate llama quantize output (#7901)
mac benchmark already runs quantize, this adds output validation
2024-11-25 16:46:23 -05:00
George Hotz
5d28a202b5 make tinychat local (#7871) 2024-11-24 14:45:48 +08:00
chenyu
22d5def113 download llama3 70B (#7868)
use "nvidia/Llama-3.1-Nemotron-70B-Instruct-HF".
```
PYTHONPATH=. JITBEAM=2 python3 examples/llama3.py --download_model --size 70B --quantize int8 --benchmark
```

on M4 Max, 40 sec to load the model and
```
enqueue in 165.15 ms
total 328.54 ms, 3.04 tok/s, 247.46 GB/s, param 221.20 GB/s

enqueue in   5.31 ms
total 168.48 ms, 5.94 tok/s, 482.54 GB/s, param 431.34 GB/s

enqueue in   5.32 ms
total 168.77 ms, 5.93 tok/s, 481.71 GB/s, param 430.60 GB/s

enqueue in   5.69 ms
total 169.51 ms, 5.90 tok/s, 479.61 GB/s, param 428.72 GB/s

enqueue in   5.41 ms
total 168.60 ms, 5.93 tok/s, 482.20 GB/s, param 431.04 GB/s

enqueue in   5.18 ms
total 168.98 ms, 5.92 tok/s, 481.12 GB/s, param 430.08 GB/s

enqueue in   5.43 ms
total 168.82 ms, 5.92 tok/s, 481.59 GB/s, param 430.49 GB/s

enqueue in   5.27 ms
total 168.94 ms, 5.92 tok/s, 481.23 GB/s, param 430.17 GB/s
```
2024-11-23 12:18:31 -05:00
George Hotz
144e9f00df viz is local, new test, and new quantize [pr] (#7859)
* viz is local, new test, and new quantize [pr]

* fix mime types

* remove font

* after index
2024-11-23 14:27:10 +08:00
qazal
9828277c03 view doesn't have buffer, fix the tests [pr] (#7841)
* view doesn't have buffer, fix the tests [pr]

* need assigns
2024-11-22 20:41:55 +08:00
chenyu
69e382216d fix wino conv output dtype for half inputs (#7829) 2024-11-21 12:13:54 -05:00