tinygrad

mirror of https://github.com/tinygrad/tinygrad.git synced 2026-01-07 22:23:55 -05:00

Author	SHA1	Message	Date
George Hotz	fd6803000e	mutmut cfg (#13184 ) * mutmut cfg * coveragerc	2025-11-09 23:29:29 -08:00
chenyu	e8380968f2	add venv_sd_mlperf to gitignore (#12676 ) training stable diffusion stuff	2025-10-14 12:51:36 -04:00
qazal	40560e77c2	minor grouper + viz fixup [pr] (#10217 ) * minor grouper + viz fixup [pr] * gitignore mypy_cache * reorder create_kernels * replace with realized * use tensor_map + viz before spec * lint * add that back	2025-05-08 21:39:44 +03:00
qazal	16dfe0a902	upstream remu (#9921 )	2025-04-18 01:57:36 +03:00
geohotstan	0bed9b6cd2	benchmark huggingface onnx models (#8493 ) * add ability to ORT=1 * test_vs_ort * useless f * actually have benchmark take in modelproto for more flexibility in huggingface stuff * ok runs * good * oops fix benchmark_onnx __main__ * 224 as default * add ORT=1 option to huggingface_onnx * use Tensor to get_input * add abilty to do single onnx model testing * better names * merge properly... * copy in onnx_helpers * better * decent script * need to add debug tool first * new limit usage * why did narrowing_error come back.. * pretty decent * revert validate change * more ops bug fixes * revert unnecessary changes * fix InstanceNorm too * remove op from O4 * minimize diff * address old feedback * unsure of this, just revert * remove that assert * working attention * to_python_const Attention * cant init from np constant so just do this * final * fix bug in attention * attention clean ups * add hard TODOs and REPOPATH and TRUNCATE envvar * fix input_ids default value * final * fix scatter * cleaner _prepare_quantize * use new attention and tempfile for huggingface script * more stats * update * remove outdated code * big refactor to something usable by CI * booooooom * clean up * update to using yaml as env var input * add dry run * try * valid pad * use argparser and fix gather bug * ignore all yaml * tiny bit more polish * woah ignoring all yaml was not right * typo * decouple huggingface_onnx_run debug run with huggingface_onnx_download * bug fix for downloading single model * WOOOO ok much better * oops argparse 'required' is an invalid argument for positionals * oops argparse 'required' is an invalid argument for positionals * add assert * fix types --------- Co-authored-by: chenyu <chenyu@fastmail.com>	2025-03-12 20:13:12 -04:00
qazal	95e0f069be	hotfix: gitignore *.log [pr] (#9410 )	2025-03-11 21:39:19 +01:00
hooved	304afe0d55	tinychat in browser, Part 3: browser app (#9276 ) * load llama3-1B to WEBGPU device * include compile script for loading llama3 to WEBGPU * parametrize max_context in build_transformer fxn * jit_model with two different args sets * compile for webgpu, split weights * load model weight parts in browser * export all tensors from initialized transformer * run transformer inference in browser * enable tiktoken with llama bpe in browser * count total tokens on client with tiktoken.js * full client-side chat streaming, eliminate server * revert change that enabled jitting with 2 argsets * llama without Variable or cache_kv, for webgpu * have client use mask tokens / whole context * cleanup staged weights * add tiktoken.js build script, README * export CLANG for Q6_k to float32 decompression * fix and test exported CLANG code for Q6_k to fp32 * revert changes to jit and export_model * isolate clang export * test Q6_K to float32 decompression in browser * gguf_load now also returns t_infos and data_start * prepare llama-1B Q6_K gguf chunks for browser * cache and decompress quantized llama in browser * enable separate deployment of large files * fix kv cache and symbolic with llama wgpu * eliminate browser lag during decompression * hash metadata and weight chunks * delete obsolete indexeddb cache to free disk * add progress bar, track model download/decompress * refactor progress callback * skip buffer hash verification for speed * Display progress for entire loading scope * Report page load errors to user * actually display errors * skip prompt tokens already seen by model * skip prefilling with last assistant message tokens * on page load tell user if webgpu not enabled * push deployed URL root to window.history * make note of bug sources with TODO items * isolate bug in CLANG with BEAM=2 * remove clang_bug.py from diff * decompress q6k to f32 on webgpu instead of clang * remove unused code * inter-weight decomp with larger wgpu kernels * parallelize decompression submissions * refactor dequantize scheduling * add progress bar back * fix bug * temp fix for loading GGUF Q6_K to fp16 not fp32 * fix rendering of exported CLANG * remove weight casts, sketch js functions for clang * get symbolic vars from jit_cache for model export * include symbolic vars in exported CLANG * render js for clang transformer * toggle clang/webgpu deployment; refactor decomp * compile and render clang Q6_K->fp16 and int8 quant * fix rendered clang for abs(fp16), to work in wasm * simplify clang js wrapping * run compiled clang in worker * prepare llama weights in workers, q6k to int8/fp16 * tinychat on clang in browser, f32/int8 weights * move wasm inference to (now flexible) worker * don't load redundant embeddings * modest wasm perf gain with compile flags * set default backend, enable backend choice/backup * render symbolic vars in exported WEBGPU * quantize webgpu llama to int8/f32 * improve UX arising from rendered WEBGPU * clean up webgpu launch * new weights split: smaller chunks, tinygrad quant. * switch webgpu inference to int8 quant * remove unneeded clang decompression * eliminate unneeded kv cache transfer to wasm * use 1 worker for simplified clang decompression * display launch errors * refactor: stream load weight chunks to WebGPU * show loading chunk completion * quantize embeddings to int8 * test float16 as input for quantization * webgpu: use f16 source, int8 embed, eliminate q6k * simplify split weights prep: all from state_dict * revert change to nn.state.gguf_load * remove unneeded decompression from webgpu client * remove unneeded code * decrease dl chunks from 47 to 16 MiB * improve stability of webgpu loading on mobile * autodetect mobile, improve load stability * refactor: progress closure * refactor: one unified progress bar * remove unneeded code * revert changes to tinygrad core library * enforce ios18.3 nerfed max buf size * BEAM=3 webgpu * cache integrity, mobile save throttling * improve mobile UX - no autozoom on prompt box * clang: int8 from f16, remove q6k * reduce concurrent dls on mobile to 2 for stability * refactor: wasm backend with stream loading * prevent race between wasm load and indexedb save * split wasm kernels into separate modules * js wrapper for multiple wasm module inference * revert multi-module wasm to single module * make mobile wasm load more stable/fast * refactor: copy weights into wasm without crashes * fix bug in download queue; increase mobile dls * refactor exported clang wrapper, split weights * remove unnecessary code * greatly improve int8 quant quality with rounding * eliminate mobile throttling * increase webgpu context to 4096 tokens * export webgpu js functions * enable separate hosted weights for mobile/pc * enable prompt-thread switching during generation * stop generation when max_context is reached * show progress bar for prefill * tell user if webgpu fails, while wasm loads * make loading messages more concise * update font * revert changes to tinychat python app launch * cleanup quantization, add scale_dtype param * cleanup kv cache code * cleanup compile code * link tok_embeddings with output in webgpu export * refactor: export_model webgpu: symbolic vars * refactor: export_model weight loading * forgot to commit export_model.py * change CLANG to CPU * deal with pylint incorrectly failing tests * simplify f-strings for older CI python version * fix pre-python3.12 parser errors * [Int32Array] not Int32Array * cleanup webgpu compile after refactor export_model * refactor WASM export into export_model * merge WebGPU/WASM compile scripts * simplify max_contexts for local deployment * fix parser issues and whitespace * deduplicate variable defs for non-wasm clang export * cleanup code * cleanup compile scripts * simplify wasm inference wrapping * simplify webgpu symbolic vars export * refactor: unify export of symbolic variables * simplify WASM export * simplify clang/wasm export * update README and build scripts * separate files for browser/python apps * restore original python tinychat app files * browser and python tinychats share assets * minor cleanup * isolate app layer diff * add .gitignore for generated files * validate CPU/WEBGPU models in python * prevent infinite generation if validation fails * check if exported weight files are unique --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2025-03-07 15:07:33 +08:00
George Hotz	05e5de6a91	ugh, remove that binary blob	2025-01-12 17:02:28 -08:00
qazal	24c7c41ce0	diff LazyBuffer schedules in process replay (#5996 ) * start diff printing * this should be 2 * add to process_replay.py * enable schedule capture * arange diff is process replay	2024-08-09 14:16:43 +03:00
wozeparrot	f33950f454	tracemeta fixups (#5904 )	2024-08-04 16:15:06 -07:00
George Hotz	e63701fbd4	RDNA3 assembly support (#3637 ) * amazing that i can use comgr for this * compile empty kernel * cleanups * tiny_add compiles * ugh * more work * put that in extra	2024-06-13 09:09:24 +02:00
George Hotz	fd02ab1e8b	move disassemblers and openpilot (#4592 ) * move disassemblers and openpilot * delete junk * put that in pre-commit * fixup readme	2024-05-14 19:30:02 -07:00
qazal	2094b3b327	graph ScheduleItems (#4224 ) * graph schedules * add logging * inplace --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2024-04-19 16:17:11 +04:00
Elias Wahl	6eef8ee22a	Wikipedia download script for MLPerf BERT training (#4202 ) * wikipedia download script * add link * checksum valueError * ops	2024-04-17 16:34:57 -04:00
George Hotz	8f749ae0eb	New docs are in mkdocs (#4178 ) * start mkdocs * simple docs for tensor * more docs * move those back * more docs * copy markdown extensions * docs legacy * docs building workflow * fix showcase links * only that? * install tinygrad * add docs to setup.py * Delete examples/llm.c/data	2024-04-16 10:59:51 +04:00
Francis Lam	9d2273235c	search: BEAM_UOPS_MAX to prune candidates with too many uops (#4088 ) * search: add better default settings for fast search not the highest possible performance, but adequate for most usage * search: revert BEAM_MIN_PROGRESS and BEAM_UPCAST_MAX default changes also sneak in a link to .gitignore for the unet3d dataset * revert BEAM_MAX_TASKS_PER_CHILD change and fix uops max condition	2024-04-15 18:56:22 -04:00
George Hotz	f916aadaea	external that test	2024-03-29 19:35:50 -07:00
reddyn12	9b5e15db6e	Mamba Implementation (#3456 ) * first commit * state back to orig * mamba comparisions * rm file * rename file * use Tensor.einsum and mke default model 370M * Cleaned code and made a comparision test * Simplyfy pull request. Only has 1 mamba implementation now. * Update prompt * rm whitespaces * last space * remove Einops dependency * rm unused code * add tests * rm print statement * rm imports * skip CLANG * Update skipIf description * skip model test in CI and add CLANG fix * rm Device import * don't be stupid * Fix conv assign When the prompt is too short, the logic for conv_state assign messes up. This can be fixed when padding the tokenized array to min length of 4. I padded using the empty string token, but idk if proper practice is to use the PAD token * fix p1 * temp * fix jit import --------- Co-authored-by: schlimeszn <schlimeszn@gmail.com> Co-authored-by: reddyn <nikidsniper@gmail.com> Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2024-03-28 17:49:12 -07:00
George Hotz	2c6f2e899d	No extra vars call (#3054 ) * remove unused reciprocal * comment * remove unneeded call to vars * free speedup	2024-01-09 09:52:58 -08:00
George Hotz	77c98a1543	hotfix: remove weights directory	2024-01-03 13:40:39 -08:00
qazal	c704a77ca0	green dtypes ALU tests (#2617 ) * dtypes alu test * those types don't exist in torch * floats * more tests * disable those * a couple unary tests * skip float16 tests in CI for GPU * fix LLVM bool add True+True=1+1=2 which truncates to False in native LLVM * remove hardcoded float for LLVM ALU fns * less sensitive atol for fp32, 1e-10 is flaky and sometimes failed even if you revert the merge commit for non-fp32 math, nothing has changed in our kernels for fp32. * return on overflows * fix CUDA exp2 * compute results of op regardless of bounds in a python backend * skip fp16 in GPU and CUDACPU * fuzz a smaller range in the float_midcast_int32 test I sampled this and we overflow ~70% of the time. because numpy behaves differently on different devices for overflows and Metal seems to do the same, I'm opting to eliminate the non-determinism here * remove CUDA exp2 overload it's already there now --------- Co-authored-by: George Hotz <geohot@gmail.com>	2023-12-06 08:15:46 -08:00
George Hotz	2c363b5f0b	new style device (#2530 ) * cpu tests pass * torch works * works * metal works * fix ops_disk * metal jit works * fix openpilot * llvm and clang work * fix webgpu * docs are rly broken * LRU works on metal * delete comment * revert name to ._buf. LRU only on Compiled * changes * allocator * allocator, getting closer * lru alloc * LRUAllocator * all pass * metal * cuda * test examples * linearizer * test fixes * fix custom + clean realize * fix hip * skip tests * fix tests * fix size=0 * fix MOCKHIP * fix thneed * copy better * simple * old style metal copy * fix thneed * np reshape * give cuda a device	2023-11-30 17:07:16 -08:00
George Hotz	d87a246439	move to new cached fetch (#2493 ) * move to new cached fetch * extra.utils is over * loads * bump download cache * bump timeout	2023-11-28 17:36:55 -08:00
George Hotz	cbb8486779	ResNet training changes (update benchmark) (#2390 ) * default arg for chunk * bring back to_ * good changes * new set * unused hash * fix optim * new torch loader * fix test lr scheduler	2023-11-22 17:41:12 -08:00
Ahmed Harmouche	265304e7fd	Stable diffusion WebGPU port (#1370 ) * WIP: Stable diffusion WebGPU port * Load whole model: split safetensor to avoid Chrome allocation limit * Gitignore .DS_Store, remove debug print * Clip tokenizer in JS * WIP: Compile model in parts (text model, diffusor, get_x_prev_and_pred_x0, decoder), and recreate forward logic in JS * e2e stable diffusion flow * Create initial random latent tensor in JS * SD working e2e * Log if some weights were not loaded properly * Remove latent_tensor.npy used for debugging * Cleanup, remove useless logs * Improve UI * Add progress bar * Remove .npy files used for debugging * Add clip tokenizer as external dependency * Remove alphas_cumprod.js and load it from safetensors * Refactor * Simplify a lot * Dedup base when limiting elementwise merge (webgpu) * Add return type to safe_load_metadata * Do not allow run when webgpu is not supported * Add progress bar, refactor, fix special names * Add option to chose from local vs huggingface weights * lowercase tinygrad :) * fp16 model dl, decompression client side * Cache f16 model in browser, better progress * Cache miss recovery --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2023-11-03 18:29:16 -07:00
George Hotz	16ca8410f8	op logger + replay (#2021 ) * logops * fix dtype printing * needs inf * ops dataset * minor improvements * 12k kernels * opt can compile * graph flops	2023-10-08 15:10:18 -07:00
George Hotz	fdd7f282cb	Reenable tensor cores for self-hosted Mac CI (#1717 ) * debug 5 matmul * allow tensor cores in CI * tensor cores on arm64 * put debug back	2023-08-30 07:53:04 -07:00
George Hotz	739f327d2d	Shorter (#1582 ) * deleting lines * remove insert dims * if statement is never hit * bug fixes	2023-08-20 08:12:16 -07:00
wozeparrot	7e7c9001e9	distributed world (#1481 ) * feat: world * feat: tests * feat: no more backwards * feat: recv into * feat: whoops * feat: test in ci * feat: some debug logging * feat: workflow naming * feat: need to set pythonpath * feat: just send to same device	2023-08-10 10:00:51 -07:00
George Hotz	bf21aec81f	do benchmarking (#1451 ) * do benchmarking * system * artifact * go * name artifact	2023-08-05 23:35:01 -07:00
George Hotz	fc2303e520	gitignore in weights	2023-08-02 16:26:41 +00:00
Diogo	4dc8595069	simple exporting models (#1344 ) * unified exporting * json exporting * ignore more * simplified buffer export * added dtypes * added assert * swift example * fix tests * linter * remove whitespace * fixed tests * remove swift example * remove unintended changes * allow callable models to be used * whitespace * more readable json export * name change * whitespace * whitespace	2023-08-01 09:35:48 -07:00
Yixiang Gao	6e62dcfbf3	add check global dim limit in linearizer (#1299 ) * need a better place for reshape and permute * add permutation * cuda fixed * clean up * enable nvidia GPU with global max * fix order * fix CI * add check for global dim limit but need refactor * refactor * fix ignore	2023-07-31 11:14:54 -07:00
George Hotz	d6637623e3	torch test touchup	2023-07-19 09:37:23 -07:00
Diogo	a9a1df785f	Webgpu support (#1077 ) * initial commit * 81 passing * 105 passing tests * 148 passing * CI tests * install dep on ci * try opencl pkgs * try using vulkan * down to only 6 failing * refactor * cleaning up * another test skipped due to buffer limit * linter * segfault * indent fix * another segfault found * small touchups * Fix max and maxpool tests * Add constant folding * Add javascript export script * better asserts in codegen * manual upcasting * reverted token type change * skip safetensor test due to unsupported type * FIx efficientnet and all other model tests * Remove np copy * fixed indent and missing import * manually destroy the buffer * revert back to length * linter errors * removed extra val * skip broken tests * skipping more tests * Make the page pretty * Save model weights as safetensor * Fix imagenet to c test * Fix second imagenet to c bug * Async and paralel kernel compilation * workgroup support * reversed local size * fixed non local bug * correct local groups * ci experiment * removed typo * Fix define local by using shared memory * Refactor * try running on mac * match metal tests * add more workers * scope down tests * trying windows runner * fixed windows env * see how many it can do * merged master * refactor * missed refactor * increase test suite coverage * missing import * whitespace in test_efficientnet.py * getting there * fixed reset * fixed bufs * switched to cstyle * cleanup * min/max rename * one more linter issue * fixed demo * linter * testing ci chrome * add unsafe webgpu arg * add build step * remove WEBGPU from cmd line * use module * try forcing directx * trying forced metal backend * temp disable conv2d for CI * disable conv_trasnpose2d --------- Co-authored-by: 0x4d - Martin Loretz <20306567+martinloretzzz@users.noreply.github.com> Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2023-07-12 12:52:06 -07:00
terafo	aa60feda48	Fix naming conflict with huggingface datasets (#1161 ) * Rename in files * Move files * Moved to extra/datasets as suggested * Changes to files * Fixed stupid mistake --------- Co-authored-by: terafo <terafo@protonmail.com>	2023-07-07 10:43:44 -07:00
George Hotz	1f5d45ca8c	imagenet loader minor cleanups	2023-06-28 05:08:09 +00:00
George Hotz	b78addf2f8	Whisper (#919 ) * no whispering yet * whispering * live whisper * small support	2023-06-03 18:55:14 -07:00
Diogo	1a5d72f812	Onnx ops And, Or, Xor, Not (#847 ) * onnx and, or, xor, not * added bool type to llvm and clang * removed float conversion * switched where op to use tensor func	2023-05-29 11:09:20 -07:00
Jacky Lee	5d212864b5	Add MLPerf UNet3D model (#775 ) * Add ResNet inference test and cannon * Test with ResNet50 * test_car works with resnet fix * Add KiTS19 dataset * KiTS19: Implement iterate * No batch load for this dataset * Save results on iterate * Implement dice score * Add data prep and eval functions * Resolve shape issue * Conversion works but wrong values * Segfaults when load_from_pretrained is called * Fix segfault and assign properly * Final result generated, though very slow * Store and load final result to save time * Fix typo in finalize * Score computes * More bug fixes, dice score is very low * Working broken code * Assign output values to result * Getting a much higher score now * Fix dataset preprocessing * Mean DICE score of 88.5 * Ugh, typo * Attempt to reimplement model * Rename layers * Tiny model works, kinda * Accuracy? gone * Implement InstanceNorm and match torch * Test instance norm 2d and 3d * Combined input block with downsample block * Tiny model works, support strided convtranspose * Commands to download dataset * Clean up a bit * unet3d_v2 -> unet3d * Remove duplicated code * Oops, put tests back	2023-05-28 20:38:19 -07:00
George Hotz	46327f7420	bugfix for stable diffusion	2023-05-29 00:03:09 +00:00
George Hotz	59f9bcd4a4	Disktensors! (#819 ) * make empty a real thing * start ops_disk * disk tensor works * interpreted cleanup * slice write to disk * preprocess imagenet * fix custom function	2023-05-28 15:40:37 -07:00
wozeparrot	67de3aa1de	Add mlperf bert model (#803 ) * feat: add mlperf bert model * feat: switch to nn.Embedding * clean+fix: fix formatting * feat: add simple downloader * feat: metrics * feat: don't actually need exact match * feat: doing a run * feat: set eps on the layernorms * clean+fix: cleaner impl + hopefully fixed * feat: move dataset initialization into iterate * feat: move tokenizer out of iterate * clean+fix: cleaner + working * clean: cleanup * fix: fix metrics * feat: need to use original bert gelu + download vocab * feat: make directory if it doesn't exist yet * feat: jit go brrr	2023-05-27 14:53:32 -07:00
George Hotz	a968c4c3a4	Cleanup mlperf (#797 ) * improve factorization * cleanups	2023-05-25 11:36:43 -07:00
wozeparrot	01ae45a43c	Add mlperf RNN-T model (#782 ) * feat: initial rnn-t * feat: working with BS>1 * feat: add lstm test * feat: test passing hidden * clean: cleanup * feat: specify start * feat: way faster lstm & model * fix: default batch size * feat: optimization * fix: fix metrics * fix: fix feature splicing * feat: cleaner stacktime * clean: remove unused import * clean: remove extra prints * fix: fix tests and happy llvm * feat: have the librispeech dataset in its own dir * clean: unused variable * feat: no longer need numpy for the embedding + slightly more memory efficient lstm * fix: forgot to remove something that broke tests * feat: use relative paths * feat: even faster * feat: remove pointless transposes in StackTime * fix: correct forward * feat: switch to soundfile for loading and fix some leaks * feat: add comment about initial dataset setup * feat: jit more things * feat: default batch size back to 1 larger than 1 is broken again :( and even in the reference implementation it gives worse results	2023-05-25 00:41:21 -07:00
George Hotz	f28df9900f	multidevice works (#763 ) * basic multigpu working * better multigpu test * upper * touchups * cl sync	2023-05-04 01:04:58 -07:00
Joqsan	0b9d4126d0	Add Tensor.stack() and Tensor.repeat() (...trying to make einops work with tinygrad) (#758 ) * add stack() and repeat() methods * make stack a static method	2023-05-01 09:37:46 -07:00
George Hotz	1240c12ac5	download cifar to datasets dir	2023-03-29 12:25:42 +04:00
George Hotz	1a039306d2	good changes from llama branch (#671 ) * good changes from llama * transpose behavior changed	2023-03-09 20:51:22 -08:00
George Hotz	b1ba78ac38	move applegpu disassembler	2023-03-05 11:21:12 -08:00

1 2

63 Commits