tinygrad

mirror of https://github.com/tinygrad/tinygrad.git synced 2026-01-15 01:48:23 -05:00

Author	SHA1	Message	Date
George Hotz	f082cbcb36	respect the 8x8 tiles	2025-10-07 17:51:24 +08:00
George Hotz	5ad62f130d	split on tile_dim	2025-10-07 17:40:00 +08:00
George Hotz	f129d75ee5	fix on cpu	2025-10-07 16:43:52 +08:00
George Hotz	51f3a5cbb4	gpu	2025-10-07 16:00:10 +08:00
George Hotz	1d7a8b33c1	gemm works on pc	2025-10-07 15:52:00 +08:00
George Hotz	3fae886aa9	Merge branch 'master' into clone_tg	2025-10-07 14:02:36 +08:00
George Hotz	514d2a0774	merge tagless reshapes (#12474 ) * merge tagless reshapes * cleanup	2025-10-07 13:57:58 +08:00
chenyu	7b48f3cc45	failed test case repro for openpilot model (#12475 ) * failed test case repro for openpilot model * assertEqual	2025-10-07 13:46:43 +08:00
George Hotz	3f44ef699f	Merge branch 'master' into clone_tg	2025-10-07 13:08:35 +08:00
George Hotz	fa23f37e33	clone thunderkittens in uops	2025-10-07 13:06:58 +08:00
George Hotz	284db26a12	cleanup	2025-10-07 12:13:01 +08:00
George Hotz	0a0cb0b9e8	merge tagless reshapes	2025-10-07 12:11:11 +08:00
chenyu	a5484b767e	remove skipping cast in simplify_valid [pr] (#12472 ) * remove skipping cast in simplify_valid [pr] unsupported statements are handled in uop_given_valid already. the test failed because (100%x) somehow got simplified * better test	2025-10-07 00:10:04 -04:00
George Hotz	b4509fba31	thundermittens (#12471 ) * thundermittens * give device a type	2025-10-07 11:47:39 +08:00
George Hotz	0f25b4b289	move frontend dir to nn [pr] (#12470 )	2025-10-07 10:42:22 +08:00
qazal	f664bcc8bd	use recursive_property in UOp tracing (#12469 ) * test * simple passing	2025-10-06 21:10:52 +03:00
qazal	1af05dae77	fix rangeify in compile4.py (#12467 ) * fix rangeify in compile4.py * fix type_verify	2025-10-06 13:37:46 +03:00
qazal	76e8a3250c	rangeify: late zero folding (#12464 ) * rangeify: late zero folding * early * not kernels * none * multi * linter * mstack is sink comment * more comment	2025-10-06 12:52:33 +03:00
George Hotz	0c015a24fe	use recursive_property to prevent RecursionError (#12465 ) * use recursive_property to prevent RecursionError * not slower * fix tests * faster * simpler	2025-10-06 15:59:18 +08:00
chenyu	a1881b0c17	update test_chicken (#12466 ) logits are close, just numerical	2025-10-06 03:58:44 -04:00
qazal	1b1978b9c0	early copy fixup (#12463 ) * simple failing test * early copy fixup	2025-10-06 06:38:29 +03:00
chenyu	c1e85f699c	multi test case for sharded ring allreduce (#12462 ) * multi test case for sharded ring allreduce triggers `children not making progress` with RANGEIFY * expect_rangeify_fails	2025-10-05 23:18:24 -04:00
chenyu	1823a5043f	don't check MAX_BUFFER_SIZE on NULL (#12461 )	2025-10-05 22:09:29 -04:00
George Hotz	46e8ea15c1	split pm_substitute_recurse (#12460 )	2025-10-05 21:35:50 -04:00
nimlgen	1216fff781	remote: raise runtimeerror in checkz (#12453 )	2025-10-05 21:22:53 +08:00
qazal	6ad9a688ed	add failing test after "pend substitutes for speed" (#12457 ) * add failing substitute test * expect_rangeify_fails	2025-10-05 16:10:04 +03:00
chenyu	74b04f7dca	test beautiful_mnist_multigpu (#12455 ) * test beautiful_mnist_multigpu another example that fails with RANGEIFY * now i remember * MAX_BUFFER_SIZE=0	2025-10-05 08:45:01 -04:00
hooved	69857d0ab0	Stable Diffusion mlperf training (#11304 ) * entrypoint for sd mlperf train development * match sd-v2 mlperf reference unet * implement dataloader from mlperf ref * update dataloader reference * implement LambdaLR scheduler from mlperf ref * match tokenizer from mlperf reference * sample latent * add noise to latent * complete training epoch * run full training step * jit training loop * replicate mlperf ref. losses over 11 train steps * save tinygrad loss checkpoints properly * match out.2.bias.grad to reference * match weights to ref after 1 step * compare out.2.bias to ref over three train steps * implement attn_mask; cleanup closeness testing * correct mse loss * update dev_run / dependencies * setup validation config/checkpointing * implement validation sampling * test closeness of eval denoise step to mlperf ref * test closeness of decoder to mlperf ref * confirm inception matches mlperf ref * resize w/ bicubic interpolation, test closeness * confirm closeness of clip preprocess to mlperf ref * confirm clip score matches mlperf ref * confirm fid/clip scores match mlperf ref * cleanup * cleanup * zero-init some unet params as in mlperf reference * revert jit change * uncomment dependencies * move to tinybox red * implement GradScaler from torch but jittable * simplify lr_scheduler, ensure jittability * instantiate GradScaler * only check if grads are finite with fp16 * implement fp16 training loop * refactor UNet: norm, gelu, mixed precision * refactor clip_tokenizer to enable versioning * make fp16 attention closer to torch * remove comparisons to torch fp16 attention * add globvars.py for reference * confirm closeness of fp16 unet forward to mlperf * test norm closeness to torch with precast * remeasure e2e with master attention * more detailed softmax upcast comparison to torch * parameterize softmax upcast in attention and unet * use fp32 weights with autocast to fp16 * cleanup * add data/checkpoint download script * debug kernel timeout on AMD * fix finite grads check; start multigpu * pass numpy arrays from dataloader * include text encoder in jit train step * use int32 for tokens instead of int64 * prevent multi bug in reshape within clip * corealize more, del refs before * add more logging and wandb * use erf gelu in clip encoder * minor changes to train step and logging * save checkpoints for eval or resuming * add eval-only logic to training script * multigpu eval * remove PARALLEL=0 * cleanup * pad eval batches of size < EVAL_BS * workaround silent multigpu bug in jit * cleanup * tokenize captions * verify correctness of multigpu eval * cleanup * verify correctness of grads in train step * verify correctness of training (20 steps) * don't shard in the training jit * training settings * minor cleanup * overfit train w/ eval on 6 samples * offload to enable combined train and eval * download to raid; use local rclone * misc changes for mi300x / logging * refactor eval for larger BS, verify correctness * cleanup * ckpt resuming and remove eval cats * eval BEAM config on mi300x and red * resume eval after crash * confirm eval correctness (one iteration, 6 samples) * verify eval correctness at full scale * cleanup correctness testing * training correctness (20 steps, BS=248 uniform) * cleanup * remove eval cache at end of run * switch f16 for bf16, del grad scaler * confirm bf16 training correctness * timestamps, new jits * merge jits in training * realize loss/lr on CPU * training correctness * post-bf16 train/eval * implement grad_acc with timing/logging * beam offline; debug gradacc; use float32 * fix gradacc in jit, correctness test * prepare f32 BS=512 gradacc=4 run * workaround jit problem in diffusion eval * scale lr by BS * revert gradacc, prepare bf16 BS=336 lr=BS train make checkpointing faster * resume bf16 BS=336 base_lr=1.25e-7 run * jit ckpt at beginning * don't alloc more gpu mem in ckpt * cleanup * move script to mi300x dir * cleanup * cleanup unneeded files * revert beam search to master * minor changes * fix regression: realize before assign in eval * cleanup mlperf SD data/ckpt downloads * workaround BEAM failure * workaround bug in Tensor.stack * minor changes * revert gradscaler * cleanup * cleanup/validate dataloader * ensure checksum of laion data * simplify config * load training state to jitted bufs * simplify lr scheduler * simplify train script * cleanup comments * refactor stable diffusion/unet init * more refactoring of stable diffusion init * fix import errors in tests * refactor: separate train/eval * fix import errors * eval checkpoints in reverse chron. order * save/load cycle in sd init * refactor and verify eval * verify training correctness * prepare repro train run * cleanup * integrate beam retry, train, eval * simplify wandb * kill orphaned processes * better logging * train to 10 ckpts instead of 7 * remove optimizer/scheduler checkpointing/resume * cleanup * BEAM=2 7 ckpts * add test to compare with torch softmax in amp * cleanup * stop eval early if checkpoint converged * add test for lr scheduler * add proper test method * add test for training * use venv name that is ignored by .gitignore * linting * add simple f32 softmax fxn * revert change to scaled_dot_product_attention * refactor gelu_erf init * simplify mixed precision in unet * add norm autocasting to fp32 * rm extra test * test eval with NULL backend * fix venv name * simplify norm autocast * use temp dir for training test * actually add eval test * remove parallel env variable from tests * update clip with tests * reorg init functions * use np for testing * remove unused var * factor out GPUS * add sd model init tests * more unet tests * match master * rerun CI due to linux (remote) hang * explain UNET_CKPTDIR * rerun CI due to linux (remote) timeout --------- Co-authored-by: chenyu <chenyu@fastmail.com>	2025-10-05 07:56:05 -04:00
George Hotz	a976ace404	minor improvements to rewrite (#12454 ) * minor improvements to rewrite * need that continue * faster	2025-10-05 18:09:32 +08:00
qazal	4b60121498	fix bmnist torch with RANGEIFY=1 (#12442 ) * fix bmnist torch with RANGEIFY=1 * alt * test and comment * this was always wrong * simple failing test for rangeify * simple upat to match the old behavior	2025-10-05 12:34:27 +03:00
George Hotz	b5f31d7505	earlier seen children (#12451 )	2025-10-05 15:55:13 +08:00
qazal	865d5796f8	add a test for untested Tensor.assign behavior (#12448 ) * add a test for untested Tensor.assign behavior * better	2025-10-04 12:44:56 +03:00
Sieds Lykles	e74be4a140	UOp.factor and add chain sorting (#12413 ) * add ordering * fix some tests * fix more tests * shorten comment * update test * add rule and test * add rule and test * remove check * use fold_divmod_congruence instead of simplify * adjust tests * shorten line * new algo * add test * add function to un-nest the div * add UOp.factor * test UOp.factor * uop_given_valid tries to factor simplex expression * shorten line * symbolic_flat is back * change that back * fix those new tests * new rule for ordering * factor multiple factors * no symbolic_flat * symbolic_flat to there * move that back * fix imports * merge correctly * linter happy * add rule * add a test * cleanup * revert that for now * UOp.factor returns self instead of None * try all_candidates * remove or_else * post index symbolic * add test * maket this closer to the original * increase mac hlb_cifar min step time * add some ordering tests * cleanup * increase pytest timeout time * check dtype	2025-10-04 06:05:38 +02:00
Sieds Lykles	394dc24110	post index symbolic (#12446 ) * post index symbolic * add test	2025-10-03 23:23:03 +02:00
chenyu	9f2b69b870	enable few tests for PTX test_dtype (#12445 )	2025-10-03 08:56:30 -04:00
George Hotz	0b534f71c2	recursive substitute should be O(n) (#12444 ) * recursive substitute * even faster * make that a single rewrite	2025-10-03 18:29:59 +08:00
chenyu	b087663c35	RANGEIFY test_bert uses more ran somehow (#12443 )	2025-10-03 04:38:53 -04:00
chenyu	940a8d5ba9	default IGNORE_OOB=1 (#12441 ) * default IGNORE_OOB=1 z3 can get very slow with RANGEIFY, also update some kernel numbers to what it is * add to test	2025-10-03 04:16:19 -04:00
George Hotz	d290e77a5b	pend substitutes for speed (#12440 )	2025-10-03 15:49:19 +08:00
nimlgen	23d310bcc1	ptx: handle i8/u8 casts correctly (#12439 ) * ptx: handle casts correctly * notsetp	2025-10-03 15:34:15 +08:00
hooved	1e8945a28c	Training loop for Stable Diffusion mlperf (#12315 ) * add diff * fix edit error * match master * point reference to specific commit * simplify wandb logging * remove lr test, dehardcode device * increase stack size limit	2025-10-03 02:45:38 -04:00
George Hotz	c7849ac593	fix test lil model (#12437 ) * fix test lil model * 4 not 3	2025-10-03 02:28:37 -04:00
chenyu	0f82d92b9d	use float for softmax in llm.py (#12438 ) fixed numerical issue in `CPU=1 RANGEIFY=1 python3 -m tinygrad.apps.llm`	2025-10-03 02:27:56 -04:00
George Hotz	4c63f7e786	skip copies of reshaped buffers (#12430 ) * skip copies of reshaped buffers * always run NOOP * comment * comment	2025-10-03 13:05:47 +08:00
Sieds Lykles	0047bcc535	undo loaded comparison swap (#12436 ) * add rule * add a test	2025-10-03 06:57:29 +02:00
chenyu	f203d8b221	update RANGEIFY kernel count and test_masked_select (#12435 )	2025-10-03 00:41:34 -04:00
wozeparrot	a6dd5a224b	skip webgpu tests (#12433 )	2025-10-02 21:31:07 -07:00
chenyu	bf99de7b1e	update a few more tests for RANGEIFY (#12434 )	2025-10-03 00:16:58 -04:00
George Hotz	9cd365c12e	little changes from double gemm (#12429 ) * little changes from double gemm * split pm_group_for_reduce * pm_add_buffers_local * Revert "pm_add_buffers_local" This reverts commit `4d30a91db2`.	2025-10-03 10:31:51 +08:00
Sieds Lykles	16a65b4fd0	fix test_symbolic_gcd_div hang (#12427 )	2025-10-03 04:21:16 +02:00

1 2 3 4 5 ...

10443 Commits