tinygrad

mirror of https://github.com/tinygrad/tinygrad.git synced 2026-01-27 07:48:07 -05:00

Author	SHA1	Message	Date
nimlgen	f995b465b8	am: set doorbell offsets to nb (#9413 )	2025-03-12 10:35:47 +08:00
qazal	95e0f069be	hotfix: gitignore *.log [pr] (#9410 )	2025-03-11 21:39:19 +01:00
nimlgen	78ebade125	Merge pull request #9408 from nimlgen/hcq_progress_during_wait hcq: reset timer on progress in singal.wait	2025-03-11 19:40:23 +08:00
George Hotz	e174c6c3bc	new devectorizer (#9331 ) * new devectorizer * lidx * test linearizer passes * fix images * fix unfoldable image load * delete unused * improve fix_unfoldable_image_load * working for image * fixup types * fixup transcendental * cast_vec * cleaner transcendental * skip failing test * err, flip that * not devec * sqrt	2025-03-11 18:47:56 +08:00
qazal	69fac5fe89	Merge pull request #9407 from tinygrad/no_const_after_sym no const/view in schedule sink after sym [pr]	2025-03-11 12:24:09 +02:00
nimlgen	4d09ea4c06	hcq: reset timer on progress in singal.wait	2025-03-11 10:02:14 +00:00
qazal	fa69fd3afc	no const/view in schedule sink after sym [pr]	2025-03-11 10:58:38 +01:00
George Hotz	68f062c8be	cast_vec on transcendental (#9406 )	2025-03-11 17:30:46 +08:00
uuuvn	e85001b6ee	SQTT profiling (#9278 ) * sqtt * docs * multi-device * ProfileSQTTEvent * exec update * 256mb default * don't let people hang their gpus * bitfields from autogen * asic info from mesa * more bitfields from autogen * SQTT_ITRACE_SE_MASK --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2025-03-11 13:19:56 +08:00
George Hotz	2780e2027e	devectorize prereqs [pr] (#9404 )	2025-03-11 12:33:29 +08:00
Priyank Patel	beed00eabe	fix torch backend memory leak (#9395 ) * fix leak, realize everything on torch optim step * only realize a subset --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2025-03-11 10:48:20 +08:00
chenyu	01e8b60911	acc_dtype -> dtype (#9402 ) matched numpy and torch	2025-03-10 16:05:30 -04:00
qazal	59dfb234eb	replace hardcoded ast with tensors in TestSwizzle [pr] (#9401 )	2025-03-10 19:33:57 +01:00
Priyank Patel	796c3bbb23	torch: support in-place operations on views (#9371 ) * add torch inplace tests * first set of tests passing * wrap all inplace funcs, add more tests * fixes and wrap more functions * fix all uint8 tests to avoid slow tests * fix the one test * another test, another fix * and one more, works for ddp now * something on contiguous, cleanup --------- Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>	2025-03-10 23:29:00 +08:00
qazal	2afc7759a7	sink in kernel op [pr] (#9397 ) * sink in kernel op [pr] * metadata	2025-03-10 13:13:42 +01:00
George Hotz	25847080f0	olmoe (from stream, wip) (#9390 ) * olmoest working (but not) * it's correct * compare ropes * old code wasn't wrong * default device * no metal * fix permute * working * more minimal	2025-03-10 13:46:33 +08:00
geohotstan	1d64c12f2b	add Topk to tensor (#9343 ) * terrible but somewhat working impl * linux behaves differently than macos? * slightly better impl * small clean up; haven't figured this out yet * better * torch has different behavior on linux and macos for duplicated values * add sum docs * fix test * add torch return_type test * add an exception test * wrap_fxn instead, and move op lower in order * better repeated values test * rerun ci	2025-03-09 20:01:42 -04:00
qazal	a1f41fadf6	test_schedule cleanups + add DONT_GROUP_REDUCES [pr] (#9392 ) * test_schedule cleanups + add DONT_GROUP_REDUCES [pr] * replace with test_swizzle_reduceop * delete duplicate tests * test_allow_push_permutes * one kernel tests	2025-03-09 15:01:08 +01:00
wozeparrot	b6fe5ab4dd	fix: correct gfx10 ctl stack size (#9384 )	2025-03-09 13:03:20 +08:00
qazal	456697d0be	always create kernels for assign/contiguous/copy [pr] (#9388 )	2025-03-08 15:32:06 +01:00
qazal	286b480f82	do not replace assign with the offset buffer [pr] (#9387 )	2025-03-08 11:57:44 +01:00
qazal	ecfccdea8e	remove views from the kernel graph minimum diff (#9385 ) * remove views from the kernel graph * notes	2025-03-08 10:14:42 +01:00
qazal	0d2762c010	prep refactor for adding buffer ops last [pr] (#9383 ) * prep refactor for adding buffer ops last [pr] * freeze buffers * add swizzle_reduceop * shape for reduceop_view_right * simpler elementwise_view_right * add shapetracker to const * only const * from process replay	2025-03-08 08:00:14 +01:00
b1tg	bde0347618	amd: support relocatable elf (#9380 ) Co-authored-by: b1tg <b1tg@users.noreply.github.com>	2025-03-08 02:21:49 +08:00
nimlgen	243078dda9	am: optimize tlb usage (#9049 ) * am: optimize tlb usage * fxies * comments * tiny	2025-03-07 19:37:29 +03:00
qazal	46720294d6	reorder ScheduleItem creation [pr] (#9379 )	2025-03-07 17:20:53 +01:00
qazal	dc89dae994	remove unmasked valid after swizzles (#9377 )	2025-03-07 16:43:16 +01:00
geohotstan	088d86691b	fix onnx gather and onnx auto_pad VALID mode (#9375 ) * fix gather and auto_pad * long -> int64	2025-03-07 10:27:23 -05:00
qazal	3565c08df5	refactor to kernel ast fixup [pr] (#9376 )	2025-03-07 15:47:38 +01:00
hooved	304afe0d55	tinychat in browser, Part 3: browser app (#9276 ) * load llama3-1B to WEBGPU device * include compile script for loading llama3 to WEBGPU * parametrize max_context in build_transformer fxn * jit_model with two different args sets * compile for webgpu, split weights * load model weight parts in browser * export all tensors from initialized transformer * run transformer inference in browser * enable tiktoken with llama bpe in browser * count total tokens on client with tiktoken.js * full client-side chat streaming, eliminate server * revert change that enabled jitting with 2 argsets * llama without Variable or cache_kv, for webgpu * have client use mask tokens / whole context * cleanup staged weights * add tiktoken.js build script, README * export CLANG for Q6_k to float32 decompression * fix and test exported CLANG code for Q6_k to fp32 * revert changes to jit and export_model * isolate clang export * test Q6_K to float32 decompression in browser * gguf_load now also returns t_infos and data_start * prepare llama-1B Q6_K gguf chunks for browser * cache and decompress quantized llama in browser * enable separate deployment of large files * fix kv cache and symbolic with llama wgpu * eliminate browser lag during decompression * hash metadata and weight chunks * delete obsolete indexeddb cache to free disk * add progress bar, track model download/decompress * refactor progress callback * skip buffer hash verification for speed * Display progress for entire loading scope * Report page load errors to user * actually display errors * skip prompt tokens already seen by model * skip prefilling with last assistant message tokens * on page load tell user if webgpu not enabled * push deployed URL root to window.history * make note of bug sources with TODO items * isolate bug in CLANG with BEAM=2 * remove clang_bug.py from diff * decompress q6k to f32 on webgpu instead of clang * remove unused code * inter-weight decomp with larger wgpu kernels * parallelize decompression submissions * refactor dequantize scheduling * add progress bar back * fix bug * temp fix for loading GGUF Q6_K to fp16 not fp32 * fix rendering of exported CLANG * remove weight casts, sketch js functions for clang * get symbolic vars from jit_cache for model export * include symbolic vars in exported CLANG * render js for clang transformer * toggle clang/webgpu deployment; refactor decomp * compile and render clang Q6_K->fp16 and int8 quant * fix rendered clang for abs(fp16), to work in wasm * simplify clang js wrapping * run compiled clang in worker * prepare llama weights in workers, q6k to int8/fp16 * tinychat on clang in browser, f32/int8 weights * move wasm inference to (now flexible) worker * don't load redundant embeddings * modest wasm perf gain with compile flags * set default backend, enable backend choice/backup * render symbolic vars in exported WEBGPU * quantize webgpu llama to int8/f32 * improve UX arising from rendered WEBGPU * clean up webgpu launch * new weights split: smaller chunks, tinygrad quant. * switch webgpu inference to int8 quant * remove unneeded clang decompression * eliminate unneeded kv cache transfer to wasm * use 1 worker for simplified clang decompression * display launch errors * refactor: stream load weight chunks to WebGPU * show loading chunk completion * quantize embeddings to int8 * test float16 as input for quantization * webgpu: use f16 source, int8 embed, eliminate q6k * simplify split weights prep: all from state_dict * revert change to nn.state.gguf_load * remove unneeded decompression from webgpu client * remove unneeded code * decrease dl chunks from 47 to 16 MiB * improve stability of webgpu loading on mobile * autodetect mobile, improve load stability * refactor: progress closure * refactor: one unified progress bar * remove unneeded code * revert changes to tinygrad core library * enforce ios18.3 nerfed max buf size * BEAM=3 webgpu * cache integrity, mobile save throttling * improve mobile UX - no autozoom on prompt box * clang: int8 from f16, remove q6k * reduce concurrent dls on mobile to 2 for stability * refactor: wasm backend with stream loading * prevent race between wasm load and indexedb save * split wasm kernels into separate modules * js wrapper for multiple wasm module inference * revert multi-module wasm to single module * make mobile wasm load more stable/fast * refactor: copy weights into wasm without crashes * fix bug in download queue; increase mobile dls * refactor exported clang wrapper, split weights * remove unnecessary code * greatly improve int8 quant quality with rounding * eliminate mobile throttling * increase webgpu context to 4096 tokens * export webgpu js functions * enable separate hosted weights for mobile/pc * enable prompt-thread switching during generation * stop generation when max_context is reached * show progress bar for prefill * tell user if webgpu fails, while wasm loads * make loading messages more concise * update font * revert changes to tinychat python app launch * cleanup quantization, add scale_dtype param * cleanup kv cache code * cleanup compile code * link tok_embeddings with output in webgpu export * refactor: export_model webgpu: symbolic vars * refactor: export_model weight loading * forgot to commit export_model.py * change CLANG to CPU * deal with pylint incorrectly failing tests * simplify f-strings for older CI python version * fix pre-python3.12 parser errors * [Int32Array] not Int32Array * cleanup webgpu compile after refactor export_model * refactor WASM export into export_model * merge WebGPU/WASM compile scripts * simplify max_contexts for local deployment * fix parser issues and whitespace * deduplicate variable defs for non-wasm clang export * cleanup code * cleanup compile scripts * simplify wasm inference wrapping * simplify webgpu symbolic vars export * refactor: unify export of symbolic variables * simplify WASM export * simplify clang/wasm export * update README and build scripts * separate files for browser/python apps * restore original python tinychat app files * browser and python tinychats share assets * minor cleanup * isolate app layer diff * add .gitignore for generated files * validate CPU/WEBGPU models in python * prevent infinite generation if validation fails * check if exported weight files are unique --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2025-03-07 15:07:33 +08:00
hooved	136cf7b8b1	hotfix: load >2 GiB from disk on macOS (#9361 ) * enable loading >2 GiB buffer from disk on macOS * handle None case raised by mypy * add test * revert fix to repro bug in CI * tell CI to run a unit test for macOS * reapply fix	2025-03-07 14:51:58 +08:00
Friedrich Carl Eichenroth	dbdefbbe54	Typed methods in tensor.py (#9356 ) * types for tensor.py * x * more * remove some casts * more typing * fix linting issues * -1 line * add last type * cast 🤙🤙	2025-03-05 20:34:18 -05:00
nimlgen	77f7ddf62a	gfx10 correct ctl stack size (#9365 )	2025-03-05 23:04:16 +03:00
nimlgen	c8a74b11ed	am: resize bar error msg (#9366 )	2025-03-05 23:02:04 +03:00
nimlgen	9bd13de44c	lower test_gemv_4096_16384 to 750 for red (#9367 )	2025-03-05 22:44:48 +03:00
uuuvn	b75f307234	amd: autogen ip bases (#9360 )	2025-03-05 22:30:38 +03:00
chenyu	2cb2fce8d9	lower test_gemm_8192 amd_tflops to 65 (#9364 )	2025-03-05 14:06:11 -05:00
Anish Umale	b3ac60ce53	Fix test_ops for tiny backend part 2 (#9358 ) * extact functions from https://github.com/tinygrad/tinygrad/pull/9302 * revert gather and add aten.elu_backward * address review --------- Co-authored-by: chenyu <chenyu@fastmail.com>	2025-03-05 13:38:40 -05:00
uuuvn	c6d76770e4	Increase timeout on macos tests (#9362 ) Process replay timeouts: https://github.com/tinygrad/tinygrad/actions/runs/13682213444/job/38257133289?pr=9360	2025-03-05 13:04:16 -05:00
nimlgen	cd9d74f7ea	use am in training benchmarks (#9357 ) * am in training benchmarks * fix * not needed anymore	2025-03-05 19:13:47 +03:00
chenyu	2af129c078	bert corealize multiple outputs (#9359 ) 1% faster step	2025-03-05 10:58:37 -05:00
nimlgen	d550583657	am: flush hdp (#9354 ) * am: fix hdp * and flush it	2025-03-05 16:31:59 +03:00
Priyank Patel	f048256341	fix TORCH_DEBUG=1 sigsegv (#9352 )	2025-03-05 12:24:53 +03:00
chenyu	ad72269f08	bert put eval copy and getting lr in jit (#9350 )	2025-03-04 20:57:03 -05:00
George Hotz	7576a1da23	hotfix: line count to 11500, lines for SQTT and AMDLLVM	2025-03-05 09:21:18 +08:00
chenyu	9eb45eb629	add a flag to skip bert train (#9349 )	2025-03-04 17:13:00 -05:00
nimlgen	14c88abf27	add some options to allreduce bench (#9348 )	2025-03-04 23:46:36 +03:00
nimlgen	993ef42bd5	am: hdp cg (#9346 )	2025-03-04 20:44:09 +03:00
chenyu	e301f21f63	CI ubuntu-20.04 -> ubuntu-22.04 (#9345 ) 20.04 is removed now	2025-03-04 11:39:12 -05:00
hooved	01f7a4fadc	tinychat in browser, Part 2: model export (#9274 ) * load llama3-1B to WEBGPU device * include compile script for loading llama3 to WEBGPU * parametrize max_context in build_transformer fxn * jit_model with two different args sets * compile for webgpu, split weights * load model weight parts in browser * export all tensors from initialized transformer * run transformer inference in browser * enable tiktoken with llama bpe in browser * count total tokens on client with tiktoken.js * full client-side chat streaming, eliminate server * revert change that enabled jitting with 2 argsets * llama without Variable or cache_kv, for webgpu * have client use mask tokens / whole context * cleanup staged weights * add tiktoken.js build script, README * export CLANG for Q6_k to float32 decompression * fix and test exported CLANG code for Q6_k to fp32 * revert changes to jit and export_model * isolate clang export * test Q6_K to float32 decompression in browser * gguf_load now also returns t_infos and data_start * prepare llama-1B Q6_K gguf chunks for browser * cache and decompress quantized llama in browser * enable separate deployment of large files * fix kv cache and symbolic with llama wgpu * eliminate browser lag during decompression * hash metadata and weight chunks * delete obsolete indexeddb cache to free disk * add progress bar, track model download/decompress * refactor progress callback * skip buffer hash verification for speed * Display progress for entire loading scope * Report page load errors to user * actually display errors * skip prompt tokens already seen by model * skip prefilling with last assistant message tokens * on page load tell user if webgpu not enabled * push deployed URL root to window.history * make note of bug sources with TODO items * isolate bug in CLANG with BEAM=2 * remove clang_bug.py from diff * decompress q6k to f32 on webgpu instead of clang * remove unused code * inter-weight decomp with larger wgpu kernels * parallelize decompression submissions * refactor dequantize scheduling * add progress bar back * fix bug * temp fix for loading GGUF Q6_K to fp16 not fp32 * fix rendering of exported CLANG * remove weight casts, sketch js functions for clang * get symbolic vars from jit_cache for model export * include symbolic vars in exported CLANG * render js for clang transformer * toggle clang/webgpu deployment; refactor decomp * compile and render clang Q6_K->fp16 and int8 quant * fix rendered clang for abs(fp16), to work in wasm * simplify clang js wrapping * run compiled clang in worker * prepare llama weights in workers, q6k to int8/fp16 * tinychat on clang in browser, f32/int8 weights * move wasm inference to (now flexible) worker * don't load redundant embeddings * modest wasm perf gain with compile flags * set default backend, enable backend choice/backup * render symbolic vars in exported WEBGPU * quantize webgpu llama to int8/f32 * improve UX arising from rendered WEBGPU * clean up webgpu launch * new weights split: smaller chunks, tinygrad quant. * switch webgpu inference to int8 quant * remove unneeded clang decompression * eliminate unneeded kv cache transfer to wasm * use 1 worker for simplified clang decompression * display launch errors * refactor: stream load weight chunks to WebGPU * show loading chunk completion * quantize embeddings to int8 * test float16 as input for quantization * webgpu: use f16 source, int8 embed, eliminate q6k * simplify split weights prep: all from state_dict * revert change to nn.state.gguf_load * remove unneeded decompression from webgpu client * remove unneeded code * decrease dl chunks from 47 to 16 MiB * improve stability of webgpu loading on mobile * autodetect mobile, improve load stability * refactor: progress closure * refactor: one unified progress bar * remove unneeded code * revert changes to tinygrad core library * enforce ios18.3 nerfed max buf size * BEAM=3 webgpu * cache integrity, mobile save throttling * improve mobile UX - no autozoom on prompt box * clang: int8 from f16, remove q6k * reduce concurrent dls on mobile to 2 for stability * refactor: wasm backend with stream loading * prevent race between wasm load and indexedb save * split wasm kernels into separate modules * js wrapper for multiple wasm module inference * revert multi-module wasm to single module * make mobile wasm load more stable/fast * refactor: copy weights into wasm without crashes * fix bug in download queue; increase mobile dls * refactor exported clang wrapper, split weights * remove unnecessary code * greatly improve int8 quant quality with rounding * eliminate mobile throttling * increase webgpu context to 4096 tokens * export webgpu js functions * enable separate hosted weights for mobile/pc * enable prompt-thread switching during generation * stop generation when max_context is reached * show progress bar for prefill * tell user if webgpu fails, while wasm loads * make loading messages more concise * update font * revert changes to tinychat python app launch * cleanup quantization, add scale_dtype param * cleanup kv cache code * cleanup compile code * link tok_embeddings with output in webgpu export * refactor: export_model webgpu: symbolic vars * refactor: export_model weight loading * forgot to commit export_model.py * change CLANG to CPU * deal with pylint incorrectly failing tests * simplify f-strings for older CI python version * fix pre-python3.12 parser errors * [Int32Array] not Int32Array * cleanup webgpu compile after refactor export_model * refactor WASM export into export_model * merge WebGPU/WASM compile scripts * simplify max_contexts for local deployment * fix parser issues and whitespace * deduplicate variable defs for non-wasm clang export * cleanup code * cleanup compile scripts * simplify wasm inference wrapping * simplify webgpu symbolic vars export * refactor: unify export of symbolic variables * simplify WASM export * simplify clang/wasm export * update README and build scripts * separate files for browser/python apps * restore original python tinychat app files * browser and python tinychats share assets * minor cleanup * isolate compile/export model --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2025-03-04 15:53:30 +08:00

... 45 46 47 48 49 ...

10417 Commits