tinygrad

mirror of https://github.com/tinygrad/tinygrad.git synced 2026-01-13 17:08:11 -05:00

Author	SHA1	Message	Date
George Hotz	68053d0510	dsp stuff / sniff ioctls from snpe (#9490 ) * sniff ioctls from snpe * dump input buffers * snpe logs from dsp * NHWC support * knum 3 * this run? * revert those --------- Co-authored-by: Comma Device <device@comma.ai>	2025-03-20 10:38:23 +08:00
geohotstan	8c0d0a122c	Add return_indices to max_pool (#9506 ) * wow argmax is so good * 1 less line * clean up and better variable names * is this torch thing right...? * add more tests * slap a TODO on it * clean ups * prettier looking code and fix ceil mode test * add return types and some docs * ok that was a bad example since indices == value, just no example	2025-03-19 15:25:37 -04:00
Francis Lam	1e5d9ad8f7	extra/gemm/max_matmul: start of custom kernels for GEMM (#6926 ) * extra/gemm/max_matmul: start of custom kernels for GEMM * add an unoptimized FP16/FP16 MMA example * add slow 3-stage fp16 acc example * add correct 3-stage pipeline with unswizzled/flat smem input (slow) * add acc fp16 example with 3 stages and swizzle (no bank conflicts) * add max version of NV fp16_fp16_fp16 * fix up comments and removed unused code in max variations * add start of no_xor example * fix to account for UOps to Ops	2025-03-19 15:04:57 +08:00
b1tg	a95b489a55	nanoGPT train works with tiny torch backend (#9283 ) * train_shakespeare_char.py works * move aten.where.self_out to tiny_backend_out * fix memory leak * corealize in the backward_hook * Update backend.py --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2025-03-19 11:51:02 +08:00
George Hotz	117b7a16ef	VALIDATE_WITH_CPU [pr] (#9488 ) * VALIDATE_WITH_CPU [pr] * fix test	2025-03-18 15:15:04 +08:00
nimlgen	a82c9332d3	am: rename soc21 to soc (#9482 )	2025-03-18 08:54:26 +08:00
Anish Umale	5e58f4b65b	Tiny backend test_ops fix part 3 (#9483 ) * extract straightforward things from https://github.com/tinygrad/tinygrad/pull/9302 * pass dtype and device for ones_like	2025-03-17 18:01:51 -04:00
TJ	9fcef4d009	add masked_select to tensor.py (#9468 ) * add masked_select to tensor.py * fix tests --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2025-03-17 16:05:36 -04:00
geohotstan	53d6f1e1bb	Add bitonic cat sort (#9422 ) * poc * repeated values fail, sigh * is this being timed out? * fix up down names * bitonic v2, does this run? * bitonic v3, faster * bitonic v3.1, faster * bitonic v3.1.1, same speed unlucky * support dim and indices * bitonic v3.2, simpler code, TODO repeated indices * bruv gimme green for once cmon * cat (stack) implementation, slow but maybe one day when cat is fast meow * revert to v3.2 * bitonic v4, who let the cats out edition * clean up variable names * figured out repeated indices :D * ruff check --fix * use sort for topk * add Tensor.sort everywhere * fix docs and add some types * slightly better variable names * am I doing torch inplace correctly? * delegate sort to values_stable * add a contig, faster first sort * maybe don't test_inplace --------- Co-authored-by: chenyu <chenyu@fastmail.com>	2025-03-17 12:01:23 -04:00
George Hotz	824c5f41ac	dsp work try 3 (#9475 ) * dsp work try 3 * padding	2025-03-17 16:42:12 +08:00
George Hotz	52ae9af4dd	Fast DSP for MobileNetV2 (try 2) (#9467 ) * Fast DSP for MobileNetV2 (try 2) * enable fast path on uchar * fix tests	2025-03-17 15:10:36 +08:00
George Hotz	09e7708b49	minimum change for rdna4 [pr] (#9455 )	2025-03-16 13:39:24 +08:00
George Hotz	cb7a7f69c7	quantization preprocessor from DSP, should be universal (#9437 ) * quantization preprocessor from DSP, should be universal * touchups * fix tests	2025-03-15 07:49:37 +08:00
chenyu	0e591baf43	redo simple_matmul change (#9450 ) numpy does not support bfloat16	2025-03-14 17:53:52 -04:00
chenyu	b0f63d3c04	Revert "`simple_matmul.py` uses np to generate random (#9438 )" (#9449 ) This reverts commit `14018050c1`.	2025-03-14 17:14:22 -04:00
Ignacio Sica	14018050c1	`simple_matmul.py` uses np to generate random (#9438 ) * np generates randoms * hotfix: use generator for int dtype * float32 as default dtype for float generator * use np.float32 instead of stirng * add dtype= to integers generator * change import _to_np_dtype source	2025-03-14 17:36:50 -03:00
geohotstan	0bed9b6cd2	benchmark huggingface onnx models (#8493 ) * add ability to ORT=1 * test_vs_ort * useless f * actually have benchmark take in modelproto for more flexibility in huggingface stuff * ok runs * good * oops fix benchmark_onnx __main__ * 224 as default * add ORT=1 option to huggingface_onnx * use Tensor to get_input * add abilty to do single onnx model testing * better names * merge properly... * copy in onnx_helpers * better * decent script * need to add debug tool first * new limit usage * why did narrowing_error come back.. * pretty decent * revert validate change * more ops bug fixes * revert unnecessary changes * fix InstanceNorm too * remove op from O4 * minimize diff * address old feedback * unsure of this, just revert * remove that assert * working attention * to_python_const Attention * cant init from np constant so just do this * final * fix bug in attention * attention clean ups * add hard TODOs and REPOPATH and TRUNCATE envvar * fix input_ids default value * final * fix scatter * cleaner _prepare_quantize * use new attention and tempfile for huggingface script * more stats * update * remove outdated code * big refactor to something usable by CI * booooooom * clean up * update to using yaml as env var input * add dry run * try * valid pad * use argparser and fix gather bug * ignore all yaml * tiny bit more polish * woah ignoring all yaml was not right * typo * decouple huggingface_onnx_run debug run with huggingface_onnx_download * bug fix for downloading single model * WOOOO ok much better * oops argparse 'required' is an invalid argument for positionals * oops argparse 'required' is an invalid argument for positionals * add assert * fix types --------- Co-authored-by: chenyu <chenyu@fastmail.com>	2025-03-12 20:13:12 -04:00
Priyank Patel	4714c4f9ad	torch backend multigpu - add devices and tests (#9414 ) * add multi-device support and tests * simplify	2025-03-12 11:33:11 +08:00
uuuvn	e85001b6ee	SQTT profiling (#9278 ) * sqtt * docs * multi-device * ProfileSQTTEvent * exec update * 256mb default * don't let people hang their gpus * bitfields from autogen * asic info from mesa * more bitfields from autogen * SQTT_ITRACE_SE_MASK --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2025-03-11 13:19:56 +08:00
Priyank Patel	beed00eabe	fix torch backend memory leak (#9395 ) * fix leak, realize everything on torch optim step * only realize a subset --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2025-03-11 10:48:20 +08:00
chenyu	01e8b60911	acc_dtype -> dtype (#9402 ) matched numpy and torch	2025-03-10 16:05:30 -04:00
Priyank Patel	796c3bbb23	torch: support in-place operations on views (#9371 ) * add torch inplace tests * first set of tests passing * wrap all inplace funcs, add more tests * fixes and wrap more functions * fix all uint8 tests to avoid slow tests * fix the one test * another test, another fix * and one more, works for ddp now * something on contiguous, cleanup --------- Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>	2025-03-10 23:29:00 +08:00
George Hotz	25847080f0	olmoe (from stream, wip) (#9390 ) * olmoest working (but not) * it's correct * compare ropes * old code wasn't wrong * default device * no metal * fix permute * working * more minimal	2025-03-10 13:46:33 +08:00
geohotstan	1d64c12f2b	add Topk to tensor (#9343 ) * terrible but somewhat working impl * linux behaves differently than macos? * slightly better impl * small clean up; haven't figured this out yet * better * torch has different behavior on linux and macos for duplicated values * add sum docs * fix test * add torch return_type test * add an exception test * wrap_fxn instead, and move op lower in order * better repeated values test * rerun ci	2025-03-09 20:01:42 -04:00
geohotstan	088d86691b	fix onnx gather and onnx auto_pad VALID mode (#9375 ) * fix gather and auto_pad * long -> int64	2025-03-07 10:27:23 -05:00
uuuvn	b75f307234	amd: autogen ip bases (#9360 )	2025-03-05 22:30:38 +03:00
Anish Umale	b3ac60ce53	Fix test_ops for tiny backend part 2 (#9358 ) * extact functions from https://github.com/tinygrad/tinygrad/pull/9302 * revert gather and add aten.elu_backward * address review --------- Co-authored-by: chenyu <chenyu@fastmail.com>	2025-03-05 13:38:40 -05:00
Priyank Patel	f048256341	fix TORCH_DEBUG=1 sigsegv (#9352 )	2025-03-05 12:24:53 +03:00
nimlgen	993ef42bd5	am: hdp cg (#9346 )	2025-03-04 20:44:09 +03:00
hooved	01f7a4fadc	tinychat in browser, Part 2: model export (#9274 ) * load llama3-1B to WEBGPU device * include compile script for loading llama3 to WEBGPU * parametrize max_context in build_transformer fxn * jit_model with two different args sets * compile for webgpu, split weights * load model weight parts in browser * export all tensors from initialized transformer * run transformer inference in browser * enable tiktoken with llama bpe in browser * count total tokens on client with tiktoken.js * full client-side chat streaming, eliminate server * revert change that enabled jitting with 2 argsets * llama without Variable or cache_kv, for webgpu * have client use mask tokens / whole context * cleanup staged weights * add tiktoken.js build script, README * export CLANG for Q6_k to float32 decompression * fix and test exported CLANG code for Q6_k to fp32 * revert changes to jit and export_model * isolate clang export * test Q6_K to float32 decompression in browser * gguf_load now also returns t_infos and data_start * prepare llama-1B Q6_K gguf chunks for browser * cache and decompress quantized llama in browser * enable separate deployment of large files * fix kv cache and symbolic with llama wgpu * eliminate browser lag during decompression * hash metadata and weight chunks * delete obsolete indexeddb cache to free disk * add progress bar, track model download/decompress * refactor progress callback * skip buffer hash verification for speed * Display progress for entire loading scope * Report page load errors to user * actually display errors * skip prompt tokens already seen by model * skip prefilling with last assistant message tokens * on page load tell user if webgpu not enabled * push deployed URL root to window.history * make note of bug sources with TODO items * isolate bug in CLANG with BEAM=2 * remove clang_bug.py from diff * decompress q6k to f32 on webgpu instead of clang * remove unused code * inter-weight decomp with larger wgpu kernels * parallelize decompression submissions * refactor dequantize scheduling * add progress bar back * fix bug * temp fix for loading GGUF Q6_K to fp16 not fp32 * fix rendering of exported CLANG * remove weight casts, sketch js functions for clang * get symbolic vars from jit_cache for model export * include symbolic vars in exported CLANG * render js for clang transformer * toggle clang/webgpu deployment; refactor decomp * compile and render clang Q6_K->fp16 and int8 quant * fix rendered clang for abs(fp16), to work in wasm * simplify clang js wrapping * run compiled clang in worker * prepare llama weights in workers, q6k to int8/fp16 * tinychat on clang in browser, f32/int8 weights * move wasm inference to (now flexible) worker * don't load redundant embeddings * modest wasm perf gain with compile flags * set default backend, enable backend choice/backup * render symbolic vars in exported WEBGPU * quantize webgpu llama to int8/f32 * improve UX arising from rendered WEBGPU * clean up webgpu launch * new weights split: smaller chunks, tinygrad quant. * switch webgpu inference to int8 quant * remove unneeded clang decompression * eliminate unneeded kv cache transfer to wasm * use 1 worker for simplified clang decompression * display launch errors * refactor: stream load weight chunks to WebGPU * show loading chunk completion * quantize embeddings to int8 * test float16 as input for quantization * webgpu: use f16 source, int8 embed, eliminate q6k * simplify split weights prep: all from state_dict * revert change to nn.state.gguf_load * remove unneeded decompression from webgpu client * remove unneeded code * decrease dl chunks from 47 to 16 MiB * improve stability of webgpu loading on mobile * autodetect mobile, improve load stability * refactor: progress closure * refactor: one unified progress bar * remove unneeded code * revert changes to tinygrad core library * enforce ios18.3 nerfed max buf size * BEAM=3 webgpu * cache integrity, mobile save throttling * improve mobile UX - no autozoom on prompt box * clang: int8 from f16, remove q6k * reduce concurrent dls on mobile to 2 for stability * refactor: wasm backend with stream loading * prevent race between wasm load and indexedb save * split wasm kernels into separate modules * js wrapper for multiple wasm module inference * revert multi-module wasm to single module * make mobile wasm load more stable/fast * refactor: copy weights into wasm without crashes * fix bug in download queue; increase mobile dls * refactor exported clang wrapper, split weights * remove unnecessary code * greatly improve int8 quant quality with rounding * eliminate mobile throttling * increase webgpu context to 4096 tokens * export webgpu js functions * enable separate hosted weights for mobile/pc * enable prompt-thread switching during generation * stop generation when max_context is reached * show progress bar for prefill * tell user if webgpu fails, while wasm loads * make loading messages more concise * update font * revert changes to tinychat python app launch * cleanup quantization, add scale_dtype param * cleanup kv cache code * cleanup compile code * link tok_embeddings with output in webgpu export * refactor: export_model webgpu: symbolic vars * refactor: export_model weight loading * forgot to commit export_model.py * change CLANG to CPU * deal with pylint incorrectly failing tests * simplify f-strings for older CI python version * fix pre-python3.12 parser errors * [Int32Array] not Int32Array * cleanup webgpu compile after refactor export_model * refactor WASM export into export_model * merge WebGPU/WASM compile scripts * simplify max_contexts for local deployment * fix parser issues and whitespace * deduplicate variable defs for non-wasm clang export * cleanup code * cleanup compile scripts * simplify wasm inference wrapping * simplify webgpu symbolic vars export * refactor: unify export of symbolic variables * simplify WASM export * simplify clang/wasm export * update README and build scripts * separate files for browser/python apps * restore original python tinychat app files * browser and python tinychats share assets * minor cleanup * isolate compile/export model --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2025-03-04 15:53:30 +08:00
chenyu	019417743c	ruff torch backend (#9341 )	2025-03-03 15:15:23 -05:00
nimlgen	f9e4c638f1	torch_hook fixes (#9334 )	2025-03-03 23:07:30 +03:00
Anish Umale	bafa40fe12	Tiny backend test_ops fix part1 (#9338 ) * extract name methods from https://github.com/tinygrad/tinygrad/pull/9302 * t.grad.numpy() -> t.grad.cpu().numpy() * revert TORCH_DEBUG change * revert dtype change in aten.sum	2025-03-03 12:36:51 -05:00
Friedrich Carl Eichenroth	b4028e48ae	Torch Backend Refinement (#9327 ) * fix some torch tests * fixup * small change * fixup * fix test * use default function * add todo * bunch of small changes * fix tests * more tests * fix * fix * test fix * simplify	2025-03-03 10:24:02 -05:00
George Hotz	a73d8717f3	fast amd gemm (#9318 ) * 50 TFLOP AMD gemm * add lds tiling * register tiling * flip locals * work * comment * remove those	2025-03-03 12:01:14 +08:00
chenyu	ba4b8c2c23	Tensor.copysign (#9329 )	2025-03-02 21:33:49 -05:00
Friedrich Carl Eichenroth	06ef9cc9f4	aten leaky_relu, div.out_mode, clamp_max, clamp_min, copysign (#9323 ) * fix some torch tests * fixup * small change * fixup * fix test * use default function * add todo	2025-03-02 19:12:16 -05:00
nimlgen	91c421fb7d	adaptive am_smi (#9319 )	2025-03-02 15:45:07 +03:00
geohotstan	d9ec05cea6	Test Onnx quantization behavior (#9301 ) * add DynamicDequantizeLinear and corresponding tests * wow qlinearops are round away from zero * this passes locally... * again * try * try separate test * round to even again * also add QLinearMul --------- Co-authored-by: chenyu <chenyu@fastmail.com>	2025-03-01 19:21:58 -05:00
Priyank Patel	f4148ac46a	torch fix casting and add ops for sd vae(s) (#9297 ) * torch fix copy casting and add upsample op * update cast and add test * fix lint * add pad for sdxl vae to work	2025-03-01 08:49:10 -05:00
chenyu	38d7aae3b7	onnx fmod (#9307 )	2025-02-28 14:09:22 -05:00
chenyu	3ae66e59a3	least_upper_float is at least default_float (#9303 ) * least_upper_float is at least default_float en route for div rounding mode. dtype of true int division would change from int32 to default_float, which matches torch too. * fix bert acc	2025-02-28 10:41:56 -05:00
nimlgen	052722a7bc	torch hook: address comments (#9295 ) * torch hook: address comments * failed test	2025-02-28 11:51:52 +03:00
Priyank Patel	8ae215dd3d	torch backend fix manual seed warning (#9292 )	2025-02-28 13:45:32 +08:00
George Hotz	ac40316692	hotfix: group cpu functions in torch backend	2025-02-28 10:39:00 +08:00
George Hotz	b32595dbbc	torch examples (#9290 ) * torch, fix examples/mnist * fix vae torch example * where out	2025-02-28 10:16:06 +08:00
hooved	3b9950241e	tinychat in browser, Part 1: llama (#9273 ) * load llama3-1B to WEBGPU device * include compile script for loading llama3 to WEBGPU * parametrize max_context in build_transformer fxn * jit_model with two different args sets * compile for webgpu, split weights * load model weight parts in browser * export all tensors from initialized transformer * run transformer inference in browser * enable tiktoken with llama bpe in browser * count total tokens on client with tiktoken.js * full client-side chat streaming, eliminate server * revert change that enabled jitting with 2 argsets * llama without Variable or cache_kv, for webgpu * have client use mask tokens / whole context * cleanup staged weights * add tiktoken.js build script, README * export CLANG for Q6_k to float32 decompression * fix and test exported CLANG code for Q6_k to fp32 * revert changes to jit and export_model * isolate clang export * test Q6_K to float32 decompression in browser * gguf_load now also returns t_infos and data_start * prepare llama-1B Q6_K gguf chunks for browser * cache and decompress quantized llama in browser * enable separate deployment of large files * fix kv cache and symbolic with llama wgpu * eliminate browser lag during decompression * hash metadata and weight chunks * delete obsolete indexeddb cache to free disk * add progress bar, track model download/decompress * refactor progress callback * skip buffer hash verification for speed * Display progress for entire loading scope * Report page load errors to user * actually display errors * skip prompt tokens already seen by model * skip prefilling with last assistant message tokens * on page load tell user if webgpu not enabled * push deployed URL root to window.history * make note of bug sources with TODO items * isolate bug in CLANG with BEAM=2 * remove clang_bug.py from diff * decompress q6k to f32 on webgpu instead of clang * remove unused code * inter-weight decomp with larger wgpu kernels * parallelize decompression submissions * refactor dequantize scheduling * add progress bar back * fix bug * temp fix for loading GGUF Q6_K to fp16 not fp32 * fix rendering of exported CLANG * remove weight casts, sketch js functions for clang * get symbolic vars from jit_cache for model export * include symbolic vars in exported CLANG * render js for clang transformer * toggle clang/webgpu deployment; refactor decomp * compile and render clang Q6_K->fp16 and int8 quant * fix rendered clang for abs(fp16), to work in wasm * simplify clang js wrapping * run compiled clang in worker * prepare llama weights in workers, q6k to int8/fp16 * tinychat on clang in browser, f32/int8 weights * move wasm inference to (now flexible) worker * don't load redundant embeddings * modest wasm perf gain with compile flags * set default backend, enable backend choice/backup * render symbolic vars in exported WEBGPU * quantize webgpu llama to int8/f32 * improve UX arising from rendered WEBGPU * clean up webgpu launch * new weights split: smaller chunks, tinygrad quant. * switch webgpu inference to int8 quant * remove unneeded clang decompression * eliminate unneeded kv cache transfer to wasm * use 1 worker for simplified clang decompression * display launch errors * refactor: stream load weight chunks to WebGPU * show loading chunk completion * quantize embeddings to int8 * test float16 as input for quantization * webgpu: use f16 source, int8 embed, eliminate q6k * simplify split weights prep: all from state_dict * revert change to nn.state.gguf_load * remove unneeded decompression from webgpu client * remove unneeded code * decrease dl chunks from 47 to 16 MiB * improve stability of webgpu loading on mobile * autodetect mobile, improve load stability * refactor: progress closure * refactor: one unified progress bar * remove unneeded code * revert changes to tinygrad core library * enforce ios18.3 nerfed max buf size * BEAM=3 webgpu * cache integrity, mobile save throttling * improve mobile UX - no autozoom on prompt box * clang: int8 from f16, remove q6k * reduce concurrent dls on mobile to 2 for stability * refactor: wasm backend with stream loading * prevent race between wasm load and indexedb save * split wasm kernels into separate modules * js wrapper for multiple wasm module inference * revert multi-module wasm to single module * make mobile wasm load more stable/fast * refactor: copy weights into wasm without crashes * fix bug in download queue; increase mobile dls * refactor exported clang wrapper, split weights * remove unnecessary code * greatly improve int8 quant quality with rounding * eliminate mobile throttling * increase webgpu context to 4096 tokens * export webgpu js functions * enable separate hosted weights for mobile/pc * enable prompt-thread switching during generation * stop generation when max_context is reached * show progress bar for prefill * tell user if webgpu fails, while wasm loads * make loading messages more concise * update font * revert changes to tinychat python app launch * cleanup quantization, add scale_dtype param * cleanup kv cache code * cleanup compile code * link tok_embeddings with output in webgpu export * refactor: export_model webgpu: symbolic vars * refactor: export_model weight loading * forgot to commit export_model.py * change CLANG to CPU * deal with pylint incorrectly failing tests * simplify f-strings for older CI python version * fix pre-python3.12 parser errors * [Int32Array] not Int32Array * cleanup webgpu compile after refactor export_model * refactor WASM export into export_model * merge WebGPU/WASM compile scripts * simplify max_contexts for local deployment * fix parser issues and whitespace * deduplicate variable defs for non-wasm clang export * cleanup code * cleanup compile scripts * simplify wasm inference wrapping * simplify webgpu symbolic vars export * refactor: unify export of symbolic variables * simplify WASM export * simplify clang/wasm export * update README and build scripts * separate files for browser/python apps * restore original python tinychat app files * browser and python tinychats share assets * minor cleanup * isolate diffs to llama files * minor cleanup * set default scale_dtype * set default scale_dtype for NF4 quantization * make quantization of tok_embeds optional * match output with tok_embeds if not quantizing * minor change	2025-02-27 15:57:37 -05:00
chenyu	184030168d	fix aten.reflection_pad2d (#9289 ) tested the torch doc example	2025-02-27 15:53:46 -05:00
chenyu	0de6585df0	fix aten.normal_ arg (#9288 ) should be mean and std.	2025-02-27 15:36:25 -05:00
chenyu	8ee2b460ee	Tensor.var_mean (#9287 )	2025-02-27 15:15:31 -05:00

... 6 7 8 9 10 ...

1363 Commits