tinygrad

mirror of https://github.com/tinygrad/tinygrad.git synced 2026-01-10 07:28:15 -05:00

Author	SHA1	Message	Date
chenyu	9f6d545a16	bert log global_norm in training step [pr] (#8708 ) * bert log global_norm in training step [pr] and minor cleanups * .item()	2025-01-21 20:36:27 -05:00
chenyu	1e283c33d3	remove realize in bert model init [pr] (#8707 )	2025-01-21 14:11:03 -05:00
geohotstan	dd82b4c913	make onnx runner a class (#8647 ) * this * clean up * more clean ups and improve debug msg * more correct training toggler * remove manual training toggling * change some variable names * actually just add the training toggle for LIMIT envvar too * more refinement * __call__ and OnnxRunner * fix half pylint, other half is importing from onnx while this file is onnx.py, figure out later * ahhhh found another mistake * remove limit from __call__ --------- Co-authored-by: chenyu <chenyu@fastmail.com>	2025-01-20 10:11:05 -08:00
chenyu	c49e0fca60	GlobalCounters.reset() in sdxl step [pr] (#8664 )	2025-01-17 21:10:28 -05:00
chenyu	930728c069	bert BS 72->66 [pr] (#8621 ) 72 does not fit now	2025-01-14 18:41:41 -05:00
geohotstan	4abe631b56	fix onnx mobilenetv2-7-quantized.onnx (#8574 ) * is 67% considered fixed? * move test up * share function * add qgemm too * make sure qgemm comes out as int * actually that note is not right * remove qgemm (I did it wrong) and add it later lol.	2025-01-13 09:25:06 -08:00
chenyu	994944920b	simpler batch_load_train_bert [pr] (#8582 ) don't think that buffer is really beneficial. 5% faster data_time and 1ms faster per step. https://wandb.ai/chenyuxyz/MLPerf-BERT/runs/69c9lx8y/overview	2025-01-12 20:25:05 -05:00
George Hotz	4ac4c1415a	free intermediate buffers in the jit [pr] (#8581 ) * free intermediate buffers in the jit [pr] * intermediates_freed * deallocate if not allocated * self._first_run is simpler	2025-01-12 15:41:41 -08:00
chenyu	def90b22f6	EVAL_BS=36 for bert [pr] (#8576 ) 3X faster eval compared to BS=6. green https://wandb.ai/chenyuxyz/MLPerf-BERT/runs/ka5p5sm9/overview red https://wandb.ai/chenyuxyz/MLPerf-BERT/runs/a7maxsxd/overview	2025-01-12 09:43:56 -05:00
George Hotz	9833fe83d8	more work on onnx imagenet [pr] (#8552 ) * more work on onnx imagenet [pr] * working quantization * static quant * benchmark onnx 0 dim	2025-01-09 20:28:18 -08:00
George Hotz	e172b759f0	more working (#8550 )	2025-01-09 18:40:08 -08:00
chenyu	b6be407bc6	fix handcode_opt bert [pr] (#8509 ) * fix handcode_opt bert [pr] * too slow	2025-01-05 19:14:12 -05:00
George Hotz	24de25b52f	example to benchmark onnx [pr] (#8459 ) * example to benchmark onnx [pr] * reset global count	2024-12-31 11:38:33 -05:00
qazal	866dfa1f23	create_schedule([x.lazydata]) -> x.schedule() in tests (#8449 )	2024-12-31 03:15:52 +08:00
Calum	d8b08790b9	Fix examples/conversation.py (#8425 ) * fix: conversation example * remove slice func * remove unused import * use Tensor.split	2024-12-26 12:45:19 -05:00
chenyu	4712847766	make self_tokenize output more like a python file (#8411 ) use comment for file name and join with newline instead of null byte when export to file	2024-12-25 14:16:30 -05:00
chenyu	a35eef8d58	optionally output to file in self_tokenize.py (#8399 ) can paste the whole tinygrad in gemini this way	2024-12-24 21:09:26 -05:00
Harald Schäfer	7059459648	Openpilot compile: fix for openpilot use (#8338 ) * compile3 changes * merge conflict * merge conflict * give dm npy for now * Revert "give dm npy for now" This reverts commit bfd980da7d2c2bab5b073127442c361922032ba1. * updates * Always float32 floats * Update compile3.py * Update compile3.py --------- Co-authored-by: ZwX1616 <zwx1616@gmail.com>	2024-12-19 19:43:15 -05:00
George Hotz	8f95b578f6	use Estimates class [pr] (#8319 ) * use Estimates class [pr] * frozen dataclass	2024-12-18 10:19:32 -08:00
George Hotz	37fa38d272	Revert "switch beautiful_mnist to use new optimizer [pr] (#8231 )" (#8233 ) This reverts commit `e9ee39df22`.	2024-12-13 19:07:09 -08:00
George Hotz	e9ee39df22	switch beautiful_mnist to use new optimizer [pr] (#8231 ) * switch beautiful_mnist to use new optimizer [pr] * fix abstractions3 + docs * fix OptimizerGroup with schedule_step api	2024-12-13 18:27:16 -08:00
Ahmed Harmouche	651f72442c	encapsulate the exported webgpu model (#8203 )	2024-12-13 10:55:37 +01:00
chenyu	64a917b7eb	remove LAZYCACHE ContextVar [pr] (#8175 ) also removed from resnet latest script	2024-12-11 22:02:52 -05:00
chenyu	26e049ab40	add ALLOWED_READ_IMAGE=2131 to openpilot (#8166 ) added as exact number check now as it's not clear if more/less than allowed is any better	2024-12-11 12:14:17 -08:00
Maxim Zakharov	e53a5bf0c3	StableDdiffusion UI - convenient send via Enter (#8160 )	2024-12-11 19:05:24 +01:00
George Hotz	f83d715f41	move checks into compile3, delete compile2 [pr] (#8127 ) * move checks into compile3 [pr] * test_vs_onnx * test v torch works * float16 won't compile on compile3 * actually delete compile2	2024-12-09 14:21:42 -08:00
George Hotz	a773c5a571	hotfix: default llama3 is 1B with download_model	2024-12-09 07:23:35 -08:00
Ahmed Harmouche	c6277fce09	Remove f16 decompression lib from SD compile.py (#8121 ) * Remove f16-to-f32-gpu lib, use tinygrad exported decompression * No need to create new instance	2024-12-09 14:09:00 +01:00
George Hotz	00ac0db9d4	np tensors have the memory from numpy in compile3 [pr] (#8098 )	2024-12-07 14:01:51 +08:00
George Hotz	22feb3a2f1	move copy into the JIT for openpilot compile3 (#7937 ) * move copy into the JIT, test fails * ahh, prune was the issue	2024-12-07 13:26:26 +08:00
Ahmed Harmouche	f3983f6743	Move efficientnet example (#8087 )	2024-12-06 15:48:16 +01:00
Ahmed Harmouche	fad3eaa35e	Use atomicLoad builtin when loading atomic type (#8084 )	2024-12-06 15:33:11 +01:00
George Hotz	e2fe7f0d2f	hotfix: actually fix pylint, it's a python 3.10 issue	2024-12-06 13:53:46 +08:00
George Hotz	b28d660172	update self_tokenize, fix pylint maybe	2024-12-06 13:49:41 +08:00
George Hotz	344fd4845c	example: self_tokenize. someday tinygrad will be recursively self improving	2024-12-06 13:35:02 +08:00
leopf	65b6696f3b	refactor safe_load (#8035 ) * refactor safe_load * cleanup	2024-12-06 12:08:21 +08:00
Ahmed Harmouche	c6f5bb03fa	YoloV8 WebGPU fixes (#8057 ) * Bump up input size to 416, show if webgpu is not supported * Minor fix in export_model	2024-12-05 16:23:45 +01:00
Francis Lata	c3187087f7	QwQ-32B-Preview support (#7962 ) * load weights with some debugging * start running a prompt * cleanup * optionally permute layers and cleanup * add validation for simple prompt * small cleanup * minor cleanup with formatting download links * add a longer prompt * add timing option * some typings * remove unused arg * reset GlobalCounters * minor cleanups	2024-12-04 21:46:37 -05:00
Ahmed Harmouche	c9e7701417	Fast YoloV8 on WebGPU (#8036 ) * Fast yolov8 with downscaled input * Faster + FPS meter * Add loader while model is downloading/compiling * Title touchup	2024-12-04 15:23:09 +01:00
Ahmed Harmouche	db330a3110	Remove WebGL (#8012 )	2024-12-03 16:02:53 +01:00
Ahmed Harmouche	8818046940	YoloV8 on WebGPU (#8007 ) Port YoloV8 to WebGPU	2024-12-03 15:10:41 +01:00
Ahmed Harmouche	8909dbd82c	Remove wgpu specific checks from stable diffusion example (#7991 )	2024-12-02 11:31:14 +01:00
chenyu	3e2430f822	use tqdm tqdm in mlperf training (#7929 ) issue in benchmark dashboard logging, revert back to tqdm tqdm for now	2024-11-27 21:57:05 -05:00
Ahmed Harmouche	10618aba98	Bring back WebGPU (#7063 ) * Start from andredaprato:webgpu-clean * Fix infs * inf wgsl function is not needed * Emulated ulong for threefry, more tests passing * Randomness tests passing * Update model export to support new changes in webgpu, efficientnet export works again * Simplify shift emulation in wgsl * Delete test file * Fix bigger than u32 u32 literal * Why was skip copies added here? * Python3.12 for webgpu tests * Fix model export syntax error * Get test ops passing with some skips * Fix lint * Much simpler shift * Run more tests * Timestamp queries are not supported in CI, so skip search tests * All fancy indexing passing * r is ctx * Run more dtype tests by using is_dtype_supported * Cleanup ulong shift rendering * UPat -> Pat, UOps -> Ops * Pat -> UPat * Refactor render_ushift if-else * Pattern to avoid ulong mul * Remove vals_dtype * is_nan trick + rewrite, test_isnan passing * Rewrite a * select(1, nan, gate) -> select(a, nan, gate) * No arg, just op * Support char, uchar, short, ushort * Run test_index_mnis now that we have uint8 * Fix pyling * Save 3 lines by using base Compiler * No more long emulation * Remove fixup_binops * No more external_local_bufx wgsl specific cstyle modif, use base extra_pm * Simpler, faster copyin/out * Skip some new tests that use long * Fix typo * copyout touchup * Save lines by using render_cast * WebGL is not supported in core, delete it from is_dtype_supported * More narrow test skips for some unary tests * TernaryOps, UnaryOps -> Ops * TinyGrad supports WebGPU * StableDiffusion demo: f16tof32 gpu is a lib, update UI * Packed load/store, no more scale_size, no core tinygrad changes * Rename copyin, copyout * Device -> dev * Fix lint * Pattern matcher rule for packed load/store * Refactor * Shorter packed load/store * this should fix lint * Fix mypy * SD compile script working * New SD webgpu UI * New default prompt * New SD weights * Fix title when webgpu not available * Run symbolic tests, simplify is_nan, use round_up * Show step time on UI * Bump minimum wgpu version to v0.19 * Fix latent --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2024-11-26 12:26:40 +08:00
chenyu	631dc98b52	validate llama quantize output (#7901 ) mac benchmark already runs quantize, this adds output validation	2024-11-25 16:46:23 -05:00
George Hotz	5d28a202b5	make tinychat local (#7871 )	2024-11-24 14:45:48 +08:00
chenyu	22d5def113	download llama3 70B (#7868 ) use "nvidia/Llama-3.1-Nemotron-70B-Instruct-HF". ``` PYTHONPATH=. JITBEAM=2 python3 examples/llama3.py --download_model --size 70B --quantize int8 --benchmark ``` on M4 Max, 40 sec to load the model and ``` enqueue in 165.15 ms total 328.54 ms, 3.04 tok/s, 247.46 GB/s, param 221.20 GB/s enqueue in 5.31 ms total 168.48 ms, 5.94 tok/s, 482.54 GB/s, param 431.34 GB/s enqueue in 5.32 ms total 168.77 ms, 5.93 tok/s, 481.71 GB/s, param 430.60 GB/s enqueue in 5.69 ms total 169.51 ms, 5.90 tok/s, 479.61 GB/s, param 428.72 GB/s enqueue in 5.41 ms total 168.60 ms, 5.93 tok/s, 482.20 GB/s, param 431.04 GB/s enqueue in 5.18 ms total 168.98 ms, 5.92 tok/s, 481.12 GB/s, param 430.08 GB/s enqueue in 5.43 ms total 168.82 ms, 5.92 tok/s, 481.59 GB/s, param 430.49 GB/s enqueue in 5.27 ms total 168.94 ms, 5.92 tok/s, 481.23 GB/s, param 430.17 GB/s ```	2024-11-23 12:18:31 -05:00
George Hotz	144e9f00df	viz is local, new test, and new quantize [pr] (#7859 ) * viz is local, new test, and new quantize [pr] * fix mime types * remove font * after index	2024-11-23 14:27:10 +08:00
qazal	9828277c03	view doesn't have buffer, fix the tests [pr] (#7841 ) * view doesn't have buffer, fix the tests [pr] * need assigns	2024-11-22 20:41:55 +08:00
chenyu	69e382216d	fix wino conv output dtype for half inputs (#7829 )	2024-11-21 12:13:54 -05:00

1 2 3 4 5 ...

942 Commits