tinygrad

mirror of https://github.com/tinygrad/tinygrad.git synced 2026-04-29 03:00:14 -04:00

Author	SHA1	Message	Date
wozeparrot	963c59ebdb	fix: pull fixes from gradacc branch (#14296 )	2026-01-22 23:07:54 -08:00
George Hotz	52b989c6c8	don't place consts early + fixes from anthropic challenge (#14286 ) * don't place consts early * add anthropic challenge * with ref * do we still have to devectorize bools? * tests pass * just WHERE * fine, revert that * fine, revert * only index * z3 validator doesn't support vectorized * Revert "z3 validator doesn't support vectorized" This reverts commit `1b7930ecb3`. * z3 not for vec * no spec * VLIWRenderer * loop unrolling * better comments * cleanups * skip cast * renderer * cleanups * prints * no hack * hacks * bump to 11 * reg warning * lil clean * cleaner renderer	2026-01-23 10:48:39 +09:00
wozeparrot	c1d14ea832	llama8b train fixes (#14264 )	2026-01-20 20:34:47 -08:00
wozeparrot	ba90e1b52e	feat: script to run llama8b training (#14239 )	2026-01-20 12:44:06 -08:00
C T	26f8b12e01	Whisper audio helpers (mel filters in tinygrad) (#13478 ) * add whisper audio helpers for stft/mel/resample * cleanup * add whisper stft test * make only stft test explicitly depend on librosa * extract sinc_window_kernel * dehardcode device * use same device argument * simplify * type annotate * ruff format audio_helpers.py * ruff format test_whisper.py * add WHISPER_NEW_STFT * rename * undo ruff format changes * use new stft and mel for whisper * remove stft test that depends on librosa * remove whitespace * add Tensor.log10 with test\test_ops.py::TestOps::test_log10 * use Tensor.log10 * fix lint * future: remove unused STFT class * future: remove resample code since it isn't used (yet) * match openai with pad_mode="reflect" * pad_to * future: cut resample leftovers * cleanup * add mel tests * future: cut stft * future: cut non-mel prep_audio changes * reduce diff * move audio_helpers.py to examples * reduce whitespace * fix imports * reduce whitespace --------- Co-authored-by: chenyu <chenyu@fastmail.com>	2026-01-20 10:50:02 -05:00
wozeparrot	a879b54234	tk: fa jit fix (#14170 )	2026-01-16 16:38:45 -08:00
b1tg	0fbc551622	train bert with fp8 (#13874 ) * fp8 train * clean * lint * test fix from #13439 * skip first/last layer * rm __init__, restore unroll <=32 check * tests * clean test, remove unused * multi-gpu test, clean quantize_to_fp8 * remove bert contiguous * run script * test: better check * run script search * add seed in bert data shuffle * move script to mi350x folder --------- Co-authored-by: chenyu <chenyu@fastmail.com>	2026-01-09 09:21:59 -05:00
b1tg	241f0402b4	add seed in bert data shuffle (#14054 )	2026-01-07 10:02:05 -05:00
chenyu	87f4bc5446	update variable names around jit [pr] (#14049 ) lbs, st_vars_dtype_device and rawbuffers no more	2026-01-06 22:32:41 -05:00
Francis Lata	fac137779e	remove flux1 seed image (#13843 )	2025-12-27 00:45:11 -05:00
chenyu	da1cb6a9ec	update llama dataloader (#13825 ) separate creating dataset from itererating over the dataset to not create eval data for each eval	2025-12-24 17:42:08 -05:00
chenyu	903753c60c	llama wandb logging (#13822 )	2025-12-24 10:24:59 -05:00
chenyu	27d899ce97	TRAIN=0 to only eval llama (#13804 )	2025-12-22 11:55:46 -05:00
chenyu	39d962106f	update llama logging (#13803 ) ``` REWRITE_STACK_LIMIT=1000000 SMALL=1 BASEDIR=/raid/datasets/c4-8b SAMPLES=1000 BS=8 DP=8 DEFAULT_FLOAT=bfloat16 OPTIM_DTYPE=bfloat16 LLAMA3_SIZE=8B SEQLEN=1024 PYTHONPATH=. MODEL=llama3 python3 examples/mlperf/model_train.py 1 93.44 s run, 11.8750 loss, 0.000000000001 LR, 642.43 GB used, 19644.30 GFLOPS 2 101.78 s run, 11.8750 loss, 0.000000000001 LR, 1454.57 GB used, 17039.35 GFLOPS 3 7.34 s run, 11.8750 loss, 0.000000000002 LR, 1454.57 GB used, 236258.78 GFLOPS 4 4.32 s run, 11.8750 loss, 0.000000000002 LR, 1454.57 GB used, 401488.40 GFLOPS 5 4.36 s run, 11.9375 loss, 0.000000000003 LR, 1454.57 GB used, 398116.13 GFLOPS 6 4.32 s run, 11.8750 loss, 0.000000000003 LR, 1454.57 GB used, 401878.60 GFLOPS 7 4.34 s run, 11.8750 loss, 0.000000000004 LR, 1454.57 GB used, 399822.57 GFLOPS 8 4.35 s run, 11.8750 loss, 0.000000000004 LR, 1454.57 GB used, 398512.24 GFLOPS 9 4.36 s run, 11.8750 loss, 0.000000000005 LR, 1454.57 GB used, 397832.61 GFLOPS 10 4.40 s run, 11.8750 loss, 0.000000000005 LR, 1454.57 GB used, 394520.83 GFLOPS ```	2025-12-22 11:28:29 -05:00
George Hotz	45c459848d	remove more stale stuff (#13765 ) * remove more stale stuff * remove disassemblers/adreno * stale	2025-12-19 17:14:56 -04:00
George Hotz	df6cde8a00	cleanup stale examples/extra (#13764 ) * cleanup stale files * examples * move those back * old * delete more	2025-12-19 16:27:37 -04:00
chenyu	7cd7593c5d	add script to train bert on mi350x (#13743 ) adapted from mi300 config	2025-12-17 16:54:04 -05:00
chenyu	e428fbfab6	verify dtype of llama model params (#13719 )	2025-12-16 12:32:02 -05:00
chenyu	6cad622f59	don't FREE_INTERMEDIATE in bert (#13684 ) hangs green hcq consistently after an hour of training	2025-12-14 14:27:42 -05:00
chenyu	fcaed1e1dd	don't use empty in bert fake data (#13661 ) somehow jit does not count empty as input	2025-12-12 15:59:50 -05:00
chenyu	01e9ad0d52	clean up bert next_data (#13650 ) train iter was designed to never stop for both real and fake data	2025-12-11 22:56:28 -05:00
chenyu	5034c6fb37	reenable FREE_INTERMEDIATE for bert (#13639 ) * reenable FREE_INTERMEDIATE for bert * comment	2025-12-10 12:08:09 -05:00
chenyu	016a59cafa	remove contiguous and use where in EmbeddingBert (#13632 )	2025-12-09 15:49:21 -05:00
chenyu	2471b49e45	minor bert / llama change from grad acc branch (#13622 ) * minor bert / llama change from grad acc branch * revert those	2025-12-08 16:04:14 -05:00
chenyu	b981b6f89e	remove old llama grad_acc (#13611 ) * remove old llama grad_acc * GRADIENT_ACC_STEPS=1	2025-12-07 13:03:47 -05:00
chenyu	4562f217e1	more bert updates (#13597 ) prep split jit also lower BS to 72	2025-12-06 08:32:43 -05:00
chenyu	cb4c6324ef	revert bert grad accumulation (#13596 ) prep for the new split jit style	2025-12-05 17:30:08 -05:00
chenyu	89f9e1dcd5	add SGD to beautiful_mnist (#13571 )	2025-12-04 12:17:29 -05:00
George Hotz	96d16675fe	update examples/gradaccum_mnist.py to use the JIT	2025-12-03 16:11:42 -08:00
George Hotz	a4c4e48385	add LUNIQUE op (#13554 )	2025-12-03 14:34:34 -08:00
wozeparrot	8713ae6de9	fix: dead sdv2 download link (#13521 )	2025-12-01 22:50:53 -08:00
George Hotz	44104b0b7f	mnist with grad acc + Adam on CPU (#13520 ) * mnist with grad acc + Adam on CPU * still broken, but closer * works w/o jit * this works without the jit	2025-12-01 18:27:32 -08:00
George Hotz	8e8fec408e	fix n^2 _apply_map_to_tensors [pr] (#13443 ) * clean up slow rules * fix rule * non n^2 toposort * topovisit * state dict profile_marker	2025-11-24 18:59:16 -08:00
George Hotz	cc5e6323ac	stable diffusion profiling (#13441 ) * stable diffusion profiling Signed-off-by: George Hotz <geohot@gmail.com> * profile_marker * profile per step * fix slow Context * profile that --------- Signed-off-by: George Hotz <geohot@gmail.com>	2025-11-24 15:25:45 -08:00
chenyu	646372490c	move tiktoken import in llama3 (#13316 ) only Tokenizer requires that	2025-11-17 14:09:37 -05:00
George Hotz	17aa3379e9	hotfix: improve self_tokenize	2025-11-13 00:18:57 -08:00
chenyu	4e5a9132e7	JIT_BATCH_SIZE=0 in compile3 (#13245 ) fixed some enqueue time	2025-11-12 23:12:45 -05:00
chenyu	41e45c20ff	minor stuff reading the printed code [pr] (#13177 )	2025-11-09 00:58:51 -05:00
chenyu	834067d91f	move onnx import in compile3 (#13172 ) only used in test_vs_onnx	2025-11-08 09:44:34 -08:00
C T	0f9d7f650d	whisper: fix oob, explicit dtype (#13144 ) * fix dtype depending on numpy version numpy v2 np.array returns int64 which Tensor passed through for the first decode call, swallowing the <\|notimestamps\|> token and corrupting the sequence * fix whisper OOB global limit on whisper's context length * enforce whisper max_tokens_to_sample (match openai) local limit on max tokens decoded	2025-11-07 12:55:01 -05:00
chenyu	74db65cf72	update mlperf bert LOGMLPERF (#13065 )	2025-11-02 15:26:37 -05:00
b1tg	45e2f916a3	add quantize fp8 in llama3 (#12893 ) * add quantize fp8 in llama3 * don't truncate fp8 alu result * cast to float32 before matmul * --model weights/LLaMA-3/8B-SF-DPO/ --------- Co-authored-by: chenyu <chenyu@fastmail.com>	2025-10-27 10:22:57 -04:00
Harald Schäfer	587ccc0e5c	compile3: make selftests opt-in (#12851 )	2025-10-21 11:32:27 -07:00
wozeparrot	990e8b97ee	feat: log openpilot 0.10.1 times (#12816 )	2025-10-20 18:30:34 -07:00
Sieds Lykles	1e93d19ee3	stable diffusion --fakeweights (#12810 )	2025-10-20 12:41:06 +02:00
Harald Schäfer	addc54b96c	Simplify openpilot compile3.py (#12748 ) * Simpler compile3 * tests * remove default args * onnx file is still fp16 * self-test FP16 too * allow test disable * absurd tolerance * Just do latest * Try simplest * use later models * kernel count not relevant if speed is good * dead improts * Revert "dead improts" This reverts commit `f68c2cd15d`. * Revert "kernel count not relevant if speed is good" This reverts commit `0955ca4ee0`. * add back kernal count check on latest model	2025-10-18 10:12:22 -04:00
chenyu	285534ce64	delete DONT_REALIZE_EXPAND and DONT_GROUP_REDUCES (#12744 ) does nothing now	2025-10-16 14:11:33 -04:00
chenyu	f34f26bca0	fix gpt2 with benchmark (#12736 ) `CPU=1 python3 examples/gpt2.py --benchmark 128` works now	2025-10-16 09:55:20 -04:00
George Hotz	af4479c169	faster stable diffusion load (#12725 ) * faster stable diffusion load * failing tests	2025-10-16 18:31:59 +08:00
George Hotz	612e3d6143	replace mop arg with vectorized index (#12695 ) * replace mop arg with vectorized index * tests passing * better viz * no compile4	2025-10-15 20:50:06 +08:00

1 2 3 4 5 ...

1241 Commits