tinygrad

mirror of https://github.com/tinygrad/tinygrad.git synced 2026-01-09 15:08:02 -05:00

Author	SHA1	Message	Date
chenyu	e428fbfab6	verify dtype of llama model params (#13719 )	2025-12-16 12:32:02 -05:00
chenyu	6cad622f59	don't FREE_INTERMEDIATE in bert (#13684 ) hangs green hcq consistently after an hour of training	2025-12-14 14:27:42 -05:00
chenyu	fcaed1e1dd	don't use empty in bert fake data (#13661 ) somehow jit does not count empty as input	2025-12-12 15:59:50 -05:00
chenyu	01e9ad0d52	clean up bert next_data (#13650 ) train iter was designed to never stop for both real and fake data	2025-12-11 22:56:28 -05:00
chenyu	5034c6fb37	reenable FREE_INTERMEDIATE for bert (#13639 ) * reenable FREE_INTERMEDIATE for bert * comment	2025-12-10 12:08:09 -05:00
chenyu	016a59cafa	remove contiguous and use where in EmbeddingBert (#13632 )	2025-12-09 15:49:21 -05:00
chenyu	2471b49e45	minor bert / llama change from grad acc branch (#13622 ) * minor bert / llama change from grad acc branch * revert those	2025-12-08 16:04:14 -05:00
chenyu	b981b6f89e	remove old llama grad_acc (#13611 ) * remove old llama grad_acc * GRADIENT_ACC_STEPS=1	2025-12-07 13:03:47 -05:00
chenyu	4562f217e1	more bert updates (#13597 ) prep split jit also lower BS to 72	2025-12-06 08:32:43 -05:00
chenyu	cb4c6324ef	revert bert grad accumulation (#13596 ) prep for the new split jit style	2025-12-05 17:30:08 -05:00
chenyu	89f9e1dcd5	add SGD to beautiful_mnist (#13571 )	2025-12-04 12:17:29 -05:00
George Hotz	96d16675fe	update examples/gradaccum_mnist.py to use the JIT	2025-12-03 16:11:42 -08:00
George Hotz	a4c4e48385	add LUNIQUE op (#13554 )	2025-12-03 14:34:34 -08:00
wozeparrot	8713ae6de9	fix: dead sdv2 download link (#13521 )	2025-12-01 22:50:53 -08:00
George Hotz	44104b0b7f	mnist with grad acc + Adam on CPU (#13520 ) * mnist with grad acc + Adam on CPU * still broken, but closer * works w/o jit * this works without the jit	2025-12-01 18:27:32 -08:00
George Hotz	8e8fec408e	fix n^2 _apply_map_to_tensors [pr] (#13443 ) * clean up slow rules * fix rule * non n^2 toposort * topovisit * state dict profile_marker	2025-11-24 18:59:16 -08:00
George Hotz	cc5e6323ac	stable diffusion profiling (#13441 ) * stable diffusion profiling Signed-off-by: George Hotz <geohot@gmail.com> * profile_marker * profile per step * fix slow Context * profile that --------- Signed-off-by: George Hotz <geohot@gmail.com>	2025-11-24 15:25:45 -08:00
chenyu	646372490c	move tiktoken import in llama3 (#13316 ) only Tokenizer requires that	2025-11-17 14:09:37 -05:00
George Hotz	17aa3379e9	hotfix: improve self_tokenize	2025-11-13 00:18:57 -08:00
chenyu	4e5a9132e7	JIT_BATCH_SIZE=0 in compile3 (#13245 ) fixed some enqueue time	2025-11-12 23:12:45 -05:00
chenyu	41e45c20ff	minor stuff reading the printed code [pr] (#13177 )	2025-11-09 00:58:51 -05:00
chenyu	834067d91f	move onnx import in compile3 (#13172 ) only used in test_vs_onnx	2025-11-08 09:44:34 -08:00
C T	0f9d7f650d	whisper: fix oob, explicit dtype (#13144 ) * fix dtype depending on numpy version numpy v2 np.array returns int64 which Tensor passed through for the first decode call, swallowing the <\|notimestamps\|> token and corrupting the sequence * fix whisper OOB global limit on whisper's context length * enforce whisper max_tokens_to_sample (match openai) local limit on max tokens decoded	2025-11-07 12:55:01 -05:00
chenyu	74db65cf72	update mlperf bert LOGMLPERF (#13065 )	2025-11-02 15:26:37 -05:00
b1tg	45e2f916a3	add quantize fp8 in llama3 (#12893 ) * add quantize fp8 in llama3 * don't truncate fp8 alu result * cast to float32 before matmul * --model weights/LLaMA-3/8B-SF-DPO/ --------- Co-authored-by: chenyu <chenyu@fastmail.com>	2025-10-27 10:22:57 -04:00
Harald Schäfer	587ccc0e5c	compile3: make selftests opt-in (#12851 )	2025-10-21 11:32:27 -07:00
wozeparrot	990e8b97ee	feat: log openpilot 0.10.1 times (#12816 )	2025-10-20 18:30:34 -07:00
Sieds Lykles	1e93d19ee3	stable diffusion --fakeweights (#12810 )	2025-10-20 12:41:06 +02:00
Harald Schäfer	addc54b96c	Simplify openpilot compile3.py (#12748 ) * Simpler compile3 * tests * remove default args * onnx file is still fp16 * self-test FP16 too * allow test disable * absurd tolerance * Just do latest * Try simplest * use later models * kernel count not relevant if speed is good * dead improts * Revert "dead improts" This reverts commit `f68c2cd15d`. * Revert "kernel count not relevant if speed is good" This reverts commit `0955ca4ee0`. * add back kernal count check on latest model	2025-10-18 10:12:22 -04:00
chenyu	285534ce64	delete DONT_REALIZE_EXPAND and DONT_GROUP_REDUCES (#12744 ) does nothing now	2025-10-16 14:11:33 -04:00
chenyu	f34f26bca0	fix gpt2 with benchmark (#12736 ) `CPU=1 python3 examples/gpt2.py --benchmark 128` works now	2025-10-16 09:55:20 -04:00
George Hotz	af4479c169	faster stable diffusion load (#12725 ) * faster stable diffusion load * failing tests	2025-10-16 18:31:59 +08:00
George Hotz	612e3d6143	replace mop arg with vectorized index (#12695 ) * replace mop arg with vectorized index * tests passing * better viz * no compile4	2025-10-15 20:50:06 +08:00
chenyu	70dd297a05	BS=96 for bert (#12675 ) 96 trains fine now	2025-10-14 09:07:43 -04:00
chenyu	77b5e6774e	fix bert training config (#12647 ) FREE_INTERMEDIATE=0 REWRITE_STACK_LIMIT=500000	2025-10-13 15:03:47 -04:00
chenyu	0f776c6e46	examples/mlperf/training_submission_v6.0 (#12644 ) copied from v5.1	2025-10-13 09:58:25 -04:00
nimlgen	658c566e22	vars in gated_read_image_count (#12486 ) * vars in gated_read_image_count * nc	2025-10-09 14:54:15 +08:00
George Hotz	6e6059dde0	clean up stable diffusion weight loading (#12452 )	2025-10-09 11:13:11 +08:00
chenyu	be05028419	move ASSERT_MIN_STEP_TIME to compile3 (#12535 ) threshold is current time +20%	2025-10-08 22:16:59 -04:00
chenyu	28edea5d67	delete FUSE_CONV_BW (#12527 )	2025-10-08 10:41:38 -04:00
Rudeus	a65ec5c693	fix fromarray depreceation (#12512 )	2025-10-08 09:13:26 -04:00
qazal	7e0b14243e	delete grouper and kernelize (#12517 ) * delete grouper and kernelize * +sys.setrecursionlimit	2025-10-08 12:27:26 +03:00
chenyu	e701106a64	remove FUSE_ARANGE (#12511 ) it was the default already	2025-10-08 04:54:07 -04:00
George Hotz	0f25b4b289	move frontend dir to nn [pr] (#12470 )	2025-10-07 10:42:22 +08:00
qazal	1af05dae77	fix rangeify in compile4.py (#12467 ) * fix rangeify in compile4.py * fix type_verify	2025-10-06 13:37:46 +03:00
hooved	69857d0ab0	Stable Diffusion mlperf training (#11304 ) * entrypoint for sd mlperf train development * match sd-v2 mlperf reference unet * implement dataloader from mlperf ref * update dataloader reference * implement LambdaLR scheduler from mlperf ref * match tokenizer from mlperf reference * sample latent * add noise to latent * complete training epoch * run full training step * jit training loop * replicate mlperf ref. losses over 11 train steps * save tinygrad loss checkpoints properly * match out.2.bias.grad to reference * match weights to ref after 1 step * compare out.2.bias to ref over three train steps * implement attn_mask; cleanup closeness testing * correct mse loss * update dev_run / dependencies * setup validation config/checkpointing * implement validation sampling * test closeness of eval denoise step to mlperf ref * test closeness of decoder to mlperf ref * confirm inception matches mlperf ref * resize w/ bicubic interpolation, test closeness * confirm closeness of clip preprocess to mlperf ref * confirm clip score matches mlperf ref * confirm fid/clip scores match mlperf ref * cleanup * cleanup * zero-init some unet params as in mlperf reference * revert jit change * uncomment dependencies * move to tinybox red * implement GradScaler from torch but jittable * simplify lr_scheduler, ensure jittability * instantiate GradScaler * only check if grads are finite with fp16 * implement fp16 training loop * refactor UNet: norm, gelu, mixed precision * refactor clip_tokenizer to enable versioning * make fp16 attention closer to torch * remove comparisons to torch fp16 attention * add globvars.py for reference * confirm closeness of fp16 unet forward to mlperf * test norm closeness to torch with precast * remeasure e2e with master attention * more detailed softmax upcast comparison to torch * parameterize softmax upcast in attention and unet * use fp32 weights with autocast to fp16 * cleanup * add data/checkpoint download script * debug kernel timeout on AMD * fix finite grads check; start multigpu * pass numpy arrays from dataloader * include text encoder in jit train step * use int32 for tokens instead of int64 * prevent multi bug in reshape within clip * corealize more, del refs before * add more logging and wandb * use erf gelu in clip encoder * minor changes to train step and logging * save checkpoints for eval or resuming * add eval-only logic to training script * multigpu eval * remove PARALLEL=0 * cleanup * pad eval batches of size < EVAL_BS * workaround silent multigpu bug in jit * cleanup * tokenize captions * verify correctness of multigpu eval * cleanup * verify correctness of grads in train step * verify correctness of training (20 steps) * don't shard in the training jit * training settings * minor cleanup * overfit train w/ eval on 6 samples * offload to enable combined train and eval * download to raid; use local rclone * misc changes for mi300x / logging * refactor eval for larger BS, verify correctness * cleanup * ckpt resuming and remove eval cats * eval BEAM config on mi300x and red * resume eval after crash * confirm eval correctness (one iteration, 6 samples) * verify eval correctness at full scale * cleanup correctness testing * training correctness (20 steps, BS=248 uniform) * cleanup * remove eval cache at end of run * switch f16 for bf16, del grad scaler * confirm bf16 training correctness * timestamps, new jits * merge jits in training * realize loss/lr on CPU * training correctness * post-bf16 train/eval * implement grad_acc with timing/logging * beam offline; debug gradacc; use float32 * fix gradacc in jit, correctness test * prepare f32 BS=512 gradacc=4 run * workaround jit problem in diffusion eval * scale lr by BS * revert gradacc, prepare bf16 BS=336 lr=BS train make checkpointing faster * resume bf16 BS=336 base_lr=1.25e-7 run * jit ckpt at beginning * don't alloc more gpu mem in ckpt * cleanup * move script to mi300x dir * cleanup * cleanup unneeded files * revert beam search to master * minor changes * fix regression: realize before assign in eval * cleanup mlperf SD data/ckpt downloads * workaround BEAM failure * workaround bug in Tensor.stack * minor changes * revert gradscaler * cleanup * cleanup/validate dataloader * ensure checksum of laion data * simplify config * load training state to jitted bufs * simplify lr scheduler * simplify train script * cleanup comments * refactor stable diffusion/unet init * more refactoring of stable diffusion init * fix import errors in tests * refactor: separate train/eval * fix import errors * eval checkpoints in reverse chron. order * save/load cycle in sd init * refactor and verify eval * verify training correctness * prepare repro train run * cleanup * integrate beam retry, train, eval * simplify wandb * kill orphaned processes * better logging * train to 10 ckpts instead of 7 * remove optimizer/scheduler checkpointing/resume * cleanup * BEAM=2 7 ckpts * add test to compare with torch softmax in amp * cleanup * stop eval early if checkpoint converged * add test for lr scheduler * add proper test method * add test for training * use venv name that is ignored by .gitignore * linting * add simple f32 softmax fxn * revert change to scaled_dot_product_attention * refactor gelu_erf init * simplify mixed precision in unet * add norm autocasting to fp32 * rm extra test * test eval with NULL backend * fix venv name * simplify norm autocast * use temp dir for training test * actually add eval test * remove parallel env variable from tests * update clip with tests * reorg init functions * use np for testing * remove unused var * factor out GPUS * add sd model init tests * more unet tests * match master * rerun CI due to linux (remote) hang * explain UNET_CKPTDIR * rerun CI due to linux (remote) timeout --------- Co-authored-by: chenyu <chenyu@fastmail.com>	2025-10-05 07:56:05 -04:00
hooved	1e8945a28c	Training loop for Stable Diffusion mlperf (#12315 ) * add diff * fix edit error * match master * point reference to specific commit * simplify wandb logging * remove lr test, dehardcode device * increase stack size limit	2025-10-03 02:45:38 -04:00
hooved	5d9035f5a6	Eval for Stable Diffusion mlperf (#12316 ) * add diff * rerun ci * refactor beam workaround, add test * fix conflict * linting	2025-10-02 02:35:38 -04:00
hooved	0f804c9a83	Stable Diffusion model init for mlperf (#12314 ) * include clip pr diff * updated unet and sd init * dehardcode default device * revert beam hang workaround --------- Co-authored-by: chenyu <chenyu@fastmail.com>	2025-10-02 02:28:41 -04:00
hooved	969a1b35ca	LR scheduler for Stable Diffusion mlperf training (#12201 ) * add lr scheduler for stable diffusion training * add lr scheduler test * rerun ci * rerun CI * use np for testing * move test to CI path * remove unneeded copy	2025-09-30 21:21:08 -04:00

1 2 3 4 5 ...

1224 Commits