tinygrad

mirror of https://github.com/tinygrad/tinygrad.git synced 2026-01-09 15:08:02 -05:00

Author	SHA1	Message	Date
chenyu	0599e86186	replace hardcoded GPU in llama debug msg (#12102 )	2025-09-10 13:56:40 -04:00
George Hotz	32e9949052	rename lazydata to uop (#10698 )	2025-06-08 08:42:22 -07:00
George Hotz	b3b43a82c4	remove Tensor.no_grad, it's meaningless now [pr] (#10556 )	2025-05-28 22:20:02 -07:00
wozeparrot	1ed04f993b	move benchmark stat tracking to influxdb (#10185 )	2025-05-15 16:14:56 -07:00
qazal	a59d18da21	hack for VIZ=1 with examples/llama (#10103 ) * hack for VIZ=1 with examples/llama * move it alongside BEAM=0	2025-04-29 23:42:17 +08:00
chenyu	3eba3d6ee9	don't pass model in convert_from_huggingface and convert_from_gguf (#10094 ) it only needs n_layers	2025-04-28 20:11:19 -04:00
chenyu	631dc98b52	validate llama quantize output (#7901 ) mac benchmark already runs quantize, this adds output validation	2024-11-25 16:46:23 -05:00
chenyu	e6debda5c4	remove numpy from gpt2 and llama examples (#7778 )	2024-11-18 22:48:17 -05:00
David Hou	9a485f36e4	shard kvcache (#5830 )	2024-07-30 20:29:54 -07:00
George Hotz	21c5e8e1b7	extreme llama speed, 57.34 tok/s (#5827 ) * extreme llama speed * mergable	2024-07-30 18:32:09 -07:00
wozeparrot	c9b3ae6bbf	fix llama.py chat mode assert (#5366 )	2024-07-10 18:06:14 -07:00
chenyu	322c37e621	use helpers.JIT in llama and gpt2 examples (#5350 ) * use helpers.JIT in llama and gpt2 examples replaced getenv("JIT"), effectively made gpt2 default jit * fix test_gpt2	2024-07-09 15:04:43 -04:00
David Hou	3604642847	Llama shard axis 0 sometimes (#5123 ) * make buffer view optional with a flag [run_process_replay] * do not view when sharding to save memory [run_process_replay] * llama shard axis=0 sometimes --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com> Co-authored-by: chenyu <chenyu@fastmail.com>	2024-06-26 10:35:25 -04:00
chenyu	38bc38cdff	fix llama example quantize (#4699 ) * fix llama example quantize import quantize layers from new example llama3 add to mac benchmark * fix that * save the files	2024-05-23 15:35:26 -04:00
chenyu	2b0ee74bb6	lshift and rshift (#4591 )	2024-05-14 19:16:31 -04:00
wozeparrot	d7670f8141	quantized llama multilazybuffer fix (#4557 )	2024-05-12 14:19:21 -07:00
chenyu	01a0c1a948	slightly faster nf4 llama (#4542 )	2024-05-12 14:24:42 -04:00
wozeparrot	e07c7668b3	nf4 llama (#4540 )	2024-05-11 22:22:34 -07:00
George Hotz	7c630a9a53	hotfix: fix llama spacing + fix hcq	2024-05-10 15:10:13 +00:00
ym555	3113785604	Llama 3 Models (#4339 ) * Full Impl * fix test * Fix inference loop --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2024-05-02 06:06:07 -07:00
chenyu	c71627fee6	move GlobalCounter to helpers (#4002 ) break circular import between ops and buffer	2024-03-30 00:30:30 -04:00
Francis Lam	16a1d43f6f	llama: prevent device initialization outside of __main__ (#3966 ) * llama: prevent device initialization outside of __main__ causes HSA resources leakages in child compile processes * llama: fix loading with multiple devices	2024-03-27 19:19:38 -04:00
Arseny Kapoulkine	cb6e7b57a6	examples: Fix parameter bandwidth accounting for quantized LLama (#3930 ) Instead of assuming every parameter is 2 bytes, just add up tensor sizes in bytes	2024-03-25 18:41:05 -04:00
chenyu	f7f67e0cc5	simple fix llama shard with quantize (#3882 ) copy scale on all device for now. naive sharding does not work because scale needs expand to really save memory. 70B does not work due to HSA_STATUS_ERROR_OUT_OF_RESOURCES. `python3 examples/llama.py --gen 2 --size 13B --shard 6 --prompt "Hello." --count 10 --temperature 0 --timing --quantize` 13B on 6 gpus uses 47 GB v.s. 34 GB quantized	2024-03-22 18:15:37 -04:00
chenyu	9d1d08fbb0	show llama bandwith with timing (#3844 )	2024-03-20 17:19:15 -04:00
chenyu	5ac1fa933f	apply the same fix_bf16 in llama and coder (#3789 ) * apply the same fix_bf16 in llama and coder did not realize the same logic was in llama too. really fix #2775 * flag for native SUPPORT_BF16 cast	2024-03-17 21:25:24 -04:00
chenyu	ad1d873f8d	fix llama shard convo mode (#3716 )	2024-03-13 12:07:02 -04:00
chenyu	d69170e27e	add llama 2 70B in ci and verify output (#3682 ) * add llama 2 70B in ci and verify output * ln -s llama2 dir	2024-03-11 12:48:22 -04:00
George Hotz	c80884884e	event driven hip (#3160 ) * event driven hip * simpler, src makes copy * pass mypy	2024-01-18 14:35:18 -08:00
George Hotz	a72b1b6d65	sharding for llama (#3151 ) * shard llama * sharding works * simpler * simpler * consume option * disable that test * save a line --------- Co-authored-by: George Hotz <george@tinygrad.org>	2024-01-16 19:28:00 -08:00
George Hotz	655c6f61d3	St real size (#3046 ) * track the size in the lazybuffer * shapetracker real size * lint	2024-01-08 14:44:53 -08:00
George Hotz	c003be7309	Revert "track size in shapetracker" (#3043 ) * Revert "track size in shapetracker (#3026)" This reverts commit `a8ba1ac08f`. * st.size	2024-01-08 13:13:39 -08:00
George Hotz	ebb81e8f11	hotfix: st.size() -> st.size in llama	2024-01-05 20:18:52 -08:00
chenyu	f88506e630	move gpt2/llama sampling inside the model call (#3013 ) * move gpt2/llama sampling inside the model call * argmax uses one more kernel	2024-01-04 17:01:50 -05:00
George Hotz	a280cfe169	move dtypes to dtype.py (#2964 ) * move dtypes to dtype.py * fix urllib	2024-01-01 14:58:48 -08:00
George Hotz	c81ce9643d	move globalcounters to ops (#2960 ) * move globalcounters to ops * missed a few * sick of that failing	2024-01-01 14:21:02 -08:00
George Hotz	0fd44259cd	bf16 fix + cleanups from mixtral (#2698 ) * bf16 fix + cleanups from mixtral * generic bf16 cast	2023-12-10 16:31:52 -08:00
chenyu	fae5394845	validate llama output (#2681 ) * validate llama output * does not work with quantize	2023-12-08 16:42:01 -05:00
chenyu	539b00a645	move llama getenv("JIT") from models to examples (#2671 ) Transformer class has a jit param so we should use that in the caller	2023-12-07 12:43:22 -05:00
Oleg Rybalko	5e87083783	Whisper + LLAMA + VITS (#2332 ) * feat: working voice 2 text using whisper * feat: added llama generation * feat: vits init * feat: more accurate voice conversion * feat: support for tts and working pipeline for the first pass * fix: linter checks * refactored vits initialization and inference, added mmts-tts support * fixed process sync and now we can have an infinite conversation * reuse output stream to remove overhead of creating a new one each time * added pre-prompt configuration with yaml files * adjusted code to merge PR which changed whisper * optimized whisper, now it's blazing fast and also reduced number of lines * added better debug printing * use jitted encode function for whisper, added timings and removed response delim to save speed on generating those tokens * fixed hf convert and now it's working with tinyllama * added tinyllama config * refactored code and made it work with all llama models * prettier order * prettier order * fixed suffix for tinyllama and refactored convert_from_hf * added missing parameters * fixed stream release and added missing params * jitted dp and encoder * jitted flow forward * removed re-init of espeak on each call to save up time * jitted generator forward for blazing fast tts * added contextmanager for displaying a chat log * removed whitespace for pylint * updated code to support latest fetch func * wait for llama eos token and pass params from cli to llama * listen for not fixed amount of time * refactored code a bit * removed thresholding and now the output streams directly to whisper * tokenize llama output for vits batch size to work and stream each sentence to a speaker * changed speaker * whisper is now printing on the same line * don't trigger llama on whisper output in parens * added tinyllama chat model * adjusted code to work with tinyllama chat model * removed unused cli arg * autofetch tokenizer and tinyllama model. add 3 chat tokens to the tokenizer * fixed issue with long sentences by chunking them * support for multiline llama output * prettified log output * adjusted sentence length * remove quote from response to avoid funny tts * fixed prompts * added missing parameter	2023-12-02 15:03:46 -08:00
Davi Silva	ddeec24fa8	Cleanup & fix llama.py (#2524 ) * docs, cleanup crap * comma AI * fix 70B * this is why lexical scope exists	2023-11-30 16:00:17 -05:00
George Hotz	9e07824542	move device to device.py (#2466 ) * move device to device.py * pylint test --disable R,C,W,E --enable E0611 * fix tests	2023-11-27 11:34:37 -08:00
George Hotz	7170a9a057	coder.py can write and run code (#2439 ) * wip mistral * coder * touchups * cleanups * mistral cleanups * clean up cache create * download the weights, fix tests * fix llama loading * global fixup * clean up all * move llama model * cleanups * Revert "cleanups" This reverts commit `a71c5d59eb`. * fine, leave it	2023-11-25 12:27:54 -08:00
Davi Silva	df41a57e09	Fix: missing n_kv_heads for smaller models from huggingface (#2438 ) * fix: missing n_kv_heads for smaller models from huggingface * a lil golfing	2023-11-25 10:29:04 -08:00
George Hotz	5bb720a777	Cocoa is no longer used	2023-11-23 14:31:21 -08:00
George Hotz	2dec86970a	hotfix: default remains gen 1 llama	2023-11-21 14:43:02 -08:00
Oleg Rybalko	7220f5c9fc	fixed hf convert and now it's working with tinyllama (#2374 ) * fixed hf convert and now it's working with tinyllama * added tinyllama config * refactored code and made it work with all llama models * prettier order * prettier order * fixed suffix for tinyllama and refactored convert_from_hf * dynamically update help if MODEL_PARAMS changes and default size is the 1st	2023-11-21 14:36:52 -08:00
Friedrich Carl Eichenroth	75676ab8e1	Profiling-helper (#2321 ) * change profiler * remove unused imports * remove unused imports * change lazybuffer references * remove unused line * remove unused import * remove unused stuff * add types * typing * typing * typing * trigger actions * -1 loc * fixup * trigger actions * revert lazy typing changes * WIP profiler helper * replace old start & stop profiler * fixup * linting * Update llama.py --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2023-11-16 14:15:56 -08:00
George Hotz	3baaf298d6	two stage cumsum in tensor.py (#2331 ) * two stage cumsum in tensor.py * 2 more kernels for llama cumsum * gpt-2 and llama use fast multinomial	2023-11-16 12:09:53 -08:00
George Hotz	70a65c201e	JIT support in Interpreted (#2314 ) * factor that out * jit is supported everywhere * fix some tests * there's no jit supported device, the jit is everywhere * fix test uops	2023-11-15 11:13:38 -08:00

1 2 3

112 Commits