tinygrad

mirror of https://github.com/tinygrad/tinygrad.git synced 2026-01-14 09:28:04 -05:00

Author	SHA1	Message	Date
chenyu	da1f46ff3f	remove RANGEIFY specific test jobs (#12507 )	2025-10-08 04:12:04 -04:00
George Hotz	1e567a5cf8	make RANGEIFY=1 the default (#12161 ) Co-authored-by: chenyu <chenyu@fastmail.com> Co-authored-by: Sieds Lykles <93992551+S-Lykles@users.noreply.github.com> Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>	2025-10-08 03:46:09 -04:00
nimlgen	9e7103647d	amd: rename cmd_id to sqtt_next_cmd_id (#12503 ) * amd: rename cmd_id to sqtt_next_cmd_id * and typo	2025-10-08 15:16:19 +08:00
nimlgen	4a756a37d8	amd: support rocm7 (#12502 ) * amd: support rocm7 * mock	2025-10-08 14:30:39 +08:00
qazal	60b6dca5ba	update some tests instead of expect_rangeify_fails (#12500 ) * update test_clone_doesnt_dedup to use base * new_flat_buffer passes * fix test_reorder_expand * remove the view stuff * remove that test, we don't want this view const behavior * test_setitem_becomes_subbuffer is good	2025-10-08 07:42:31 +03:00
qazal	84597ed53c	early assert for device mistmatched asts in rangeify (#12499 ) * early assert for device mistmatched asts in rangeify * alt also passes	2025-10-08 07:19:36 +03:00
qazal	2e19354c1c	viz: reorder timeline graphs (#12498 ) * viz: reorder timeline graphs * update test_viz with the new order	2025-10-08 07:10:23 +03:00
George Hotz	d06226b575	fix SPEC and all_tensors iterator (#12496 )	2025-10-07 23:18:17 -04:00
qazal	a7cb80bfab	use recursive_property in UOp device (#12477 ) * simple failing test with RecursionError * switch to @recursive_property * merge 2 * diff	2025-10-08 06:15:05 +03:00
George Hotz	a6d59a0b45	backward_slice to get srcs recursively (#12494 ) * change name to backward_slice * faster check * clean up comments and names * comment	2025-10-08 10:31:42 +08:00
chenyu	eb3bc277b3	remove ASSERT_MIN_STEP_TIME in external_benchmark_openpilot (#12495 ) should add for compile3 and compile 3 only	2025-10-07 22:13:42 -04:00
qazal	239f9a3029	update viz to not use children [pr] (#12493 )	2025-10-08 04:35:01 +03:00
Sieds Lykles	b465c17b56	Revert "UOp.factor and add chain sorting (#12413 )" (#12492 ) This reverts commit `e74be4a140`.	2025-10-08 03:20:23 +02:00
George Hotz	945cc46475	delete children tracking from uop (#12491 ) * delete children tracking from uop * uop children no longer exists * no tracked children * that test is flaky too	2025-10-08 09:04:14 +08:00
nimlgen	648e5bb223	hcq: do not raise when fini (#12487 ) * hcq: do not raise when fini * Revert "hcq: do not raise when fini" This reverts commit `44af5f7d05`. * this way * runtime is fine * nn	2025-10-07 23:27:03 +08:00
George Hotz	a2345787b9	parents is faster than sparents (#12490 )	2025-10-07 21:31:50 +08:00
George Hotz	12c4963489	add more rangeify pm tests (#12488 )	2025-10-07 05:45:38 -04:00
George Hotz	403fdfcfd4	check spec in test, cleanup vectorize render (#12484 )	2025-10-07 17:05:50 +08:00
qazal	22674798df	assert correctness in test_permuted_assignment [pr] (#12483 )	2025-10-07 11:42:22 +03:00
George Hotz	75ce11593c	test_reshape_match should match (#12479 )	2025-10-07 16:07:21 +08:00
chenyu	fe774a4319	more skip WINO on benchmark (#12482 )	2025-10-07 03:43:51 -04:00
chenyu	8ad5f9e74f	skip slow benchmarks (#12481 ) * skip slow benchmarks padded tc is already slow, rest are slow with rangeify (correct if run locally) * relax more	2025-10-07 03:28:56 -04:00
George Hotz	ea7672931f	fix test_matmul_relu_cat (#12478 )	2025-10-07 02:32:23 -04:00
George Hotz	514d2a0774	merge tagless reshapes (#12474 ) * merge tagless reshapes * cleanup	2025-10-07 13:57:58 +08:00
chenyu	7b48f3cc45	failed test case repro for openpilot model (#12475 ) * failed test case repro for openpilot model * assertEqual	2025-10-07 13:46:43 +08:00
chenyu	a5484b767e	remove skipping cast in simplify_valid [pr] (#12472 ) * remove skipping cast in simplify_valid [pr] unsupported statements are handled in uop_given_valid already. the test failed because (100%x) somehow got simplified * better test	2025-10-07 00:10:04 -04:00
George Hotz	b4509fba31	thundermittens (#12471 ) * thundermittens * give device a type	2025-10-07 11:47:39 +08:00
George Hotz	0f25b4b289	move frontend dir to nn [pr] (#12470 )	2025-10-07 10:42:22 +08:00
qazal	f664bcc8bd	use recursive_property in UOp tracing (#12469 ) * test * simple passing	2025-10-06 21:10:52 +03:00
qazal	1af05dae77	fix rangeify in compile4.py (#12467 ) * fix rangeify in compile4.py * fix type_verify	2025-10-06 13:37:46 +03:00
qazal	76e8a3250c	rangeify: late zero folding (#12464 ) * rangeify: late zero folding * early * not kernels * none * multi * linter * mstack is sink comment * more comment	2025-10-06 12:52:33 +03:00
George Hotz	0c015a24fe	use recursive_property to prevent RecursionError (#12465 ) * use recursive_property to prevent RecursionError * not slower * fix tests * faster * simpler	2025-10-06 15:59:18 +08:00
chenyu	a1881b0c17	update test_chicken (#12466 ) logits are close, just numerical	2025-10-06 03:58:44 -04:00
qazal	1b1978b9c0	early copy fixup (#12463 ) * simple failing test * early copy fixup	2025-10-06 06:38:29 +03:00
chenyu	c1e85f699c	multi test case for sharded ring allreduce (#12462 ) * multi test case for sharded ring allreduce triggers `children not making progress` with RANGEIFY * expect_rangeify_fails	2025-10-05 23:18:24 -04:00
chenyu	1823a5043f	don't check MAX_BUFFER_SIZE on NULL (#12461 )	2025-10-05 22:09:29 -04:00
George Hotz	46e8ea15c1	split pm_substitute_recurse (#12460 )	2025-10-05 21:35:50 -04:00
nimlgen	1216fff781	remote: raise runtimeerror in checkz (#12453 )	2025-10-05 21:22:53 +08:00
qazal	6ad9a688ed	add failing test after "pend substitutes for speed" (#12457 ) * add failing substitute test * expect_rangeify_fails	2025-10-05 16:10:04 +03:00
chenyu	74b04f7dca	test beautiful_mnist_multigpu (#12455 ) * test beautiful_mnist_multigpu another example that fails with RANGEIFY * now i remember * MAX_BUFFER_SIZE=0	2025-10-05 08:45:01 -04:00
hooved	69857d0ab0	Stable Diffusion mlperf training (#11304 ) * entrypoint for sd mlperf train development * match sd-v2 mlperf reference unet * implement dataloader from mlperf ref * update dataloader reference * implement LambdaLR scheduler from mlperf ref * match tokenizer from mlperf reference * sample latent * add noise to latent * complete training epoch * run full training step * jit training loop * replicate mlperf ref. losses over 11 train steps * save tinygrad loss checkpoints properly * match out.2.bias.grad to reference * match weights to ref after 1 step * compare out.2.bias to ref over three train steps * implement attn_mask; cleanup closeness testing * correct mse loss * update dev_run / dependencies * setup validation config/checkpointing * implement validation sampling * test closeness of eval denoise step to mlperf ref * test closeness of decoder to mlperf ref * confirm inception matches mlperf ref * resize w/ bicubic interpolation, test closeness * confirm closeness of clip preprocess to mlperf ref * confirm clip score matches mlperf ref * confirm fid/clip scores match mlperf ref * cleanup * cleanup * zero-init some unet params as in mlperf reference * revert jit change * uncomment dependencies * move to tinybox red * implement GradScaler from torch but jittable * simplify lr_scheduler, ensure jittability * instantiate GradScaler * only check if grads are finite with fp16 * implement fp16 training loop * refactor UNet: norm, gelu, mixed precision * refactor clip_tokenizer to enable versioning * make fp16 attention closer to torch * remove comparisons to torch fp16 attention * add globvars.py for reference * confirm closeness of fp16 unet forward to mlperf * test norm closeness to torch with precast * remeasure e2e with master attention * more detailed softmax upcast comparison to torch * parameterize softmax upcast in attention and unet * use fp32 weights with autocast to fp16 * cleanup * add data/checkpoint download script * debug kernel timeout on AMD * fix finite grads check; start multigpu * pass numpy arrays from dataloader * include text encoder in jit train step * use int32 for tokens instead of int64 * prevent multi bug in reshape within clip * corealize more, del refs before * add more logging and wandb * use erf gelu in clip encoder * minor changes to train step and logging * save checkpoints for eval or resuming * add eval-only logic to training script * multigpu eval * remove PARALLEL=0 * cleanup * pad eval batches of size < EVAL_BS * workaround silent multigpu bug in jit * cleanup * tokenize captions * verify correctness of multigpu eval * cleanup * verify correctness of grads in train step * verify correctness of training (20 steps) * don't shard in the training jit * training settings * minor cleanup * overfit train w/ eval on 6 samples * offload to enable combined train and eval * download to raid; use local rclone * misc changes for mi300x / logging * refactor eval for larger BS, verify correctness * cleanup * ckpt resuming and remove eval cats * eval BEAM config on mi300x and red * resume eval after crash * confirm eval correctness (one iteration, 6 samples) * verify eval correctness at full scale * cleanup correctness testing * training correctness (20 steps, BS=248 uniform) * cleanup * remove eval cache at end of run * switch f16 for bf16, del grad scaler * confirm bf16 training correctness * timestamps, new jits * merge jits in training * realize loss/lr on CPU * training correctness * post-bf16 train/eval * implement grad_acc with timing/logging * beam offline; debug gradacc; use float32 * fix gradacc in jit, correctness test * prepare f32 BS=512 gradacc=4 run * workaround jit problem in diffusion eval * scale lr by BS * revert gradacc, prepare bf16 BS=336 lr=BS train make checkpointing faster * resume bf16 BS=336 base_lr=1.25e-7 run * jit ckpt at beginning * don't alloc more gpu mem in ckpt * cleanup * move script to mi300x dir * cleanup * cleanup unneeded files * revert beam search to master * minor changes * fix regression: realize before assign in eval * cleanup mlperf SD data/ckpt downloads * workaround BEAM failure * workaround bug in Tensor.stack * minor changes * revert gradscaler * cleanup * cleanup/validate dataloader * ensure checksum of laion data * simplify config * load training state to jitted bufs * simplify lr scheduler * simplify train script * cleanup comments * refactor stable diffusion/unet init * more refactoring of stable diffusion init * fix import errors in tests * refactor: separate train/eval * fix import errors * eval checkpoints in reverse chron. order * save/load cycle in sd init * refactor and verify eval * verify training correctness * prepare repro train run * cleanup * integrate beam retry, train, eval * simplify wandb * kill orphaned processes * better logging * train to 10 ckpts instead of 7 * remove optimizer/scheduler checkpointing/resume * cleanup * BEAM=2 7 ckpts * add test to compare with torch softmax in amp * cleanup * stop eval early if checkpoint converged * add test for lr scheduler * add proper test method * add test for training * use venv name that is ignored by .gitignore * linting * add simple f32 softmax fxn * revert change to scaled_dot_product_attention * refactor gelu_erf init * simplify mixed precision in unet * add norm autocasting to fp32 * rm extra test * test eval with NULL backend * fix venv name * simplify norm autocast * use temp dir for training test * actually add eval test * remove parallel env variable from tests * update clip with tests * reorg init functions * use np for testing * remove unused var * factor out GPUS * add sd model init tests * more unet tests * match master * rerun CI due to linux (remote) hang * explain UNET_CKPTDIR * rerun CI due to linux (remote) timeout --------- Co-authored-by: chenyu <chenyu@fastmail.com>	2025-10-05 07:56:05 -04:00
George Hotz	a976ace404	minor improvements to rewrite (#12454 ) * minor improvements to rewrite * need that continue * faster	2025-10-05 18:09:32 +08:00
qazal	4b60121498	fix bmnist torch with RANGEIFY=1 (#12442 ) * fix bmnist torch with RANGEIFY=1 * alt * test and comment * this was always wrong * simple failing test for rangeify * simple upat to match the old behavior	2025-10-05 12:34:27 +03:00
George Hotz	b5f31d7505	earlier seen children (#12451 )	2025-10-05 15:55:13 +08:00
qazal	865d5796f8	add a test for untested Tensor.assign behavior (#12448 ) * add a test for untested Tensor.assign behavior * better	2025-10-04 12:44:56 +03:00
Sieds Lykles	e74be4a140	UOp.factor and add chain sorting (#12413 ) * add ordering * fix some tests * fix more tests * shorten comment * update test * add rule and test * add rule and test * remove check * use fold_divmod_congruence instead of simplify * adjust tests * shorten line * new algo * add test * add function to un-nest the div * add UOp.factor * test UOp.factor * uop_given_valid tries to factor simplex expression * shorten line * symbolic_flat is back * change that back * fix those new tests * new rule for ordering * factor multiple factors * no symbolic_flat * symbolic_flat to there * move that back * fix imports * merge correctly * linter happy * add rule * add a test * cleanup * revert that for now * UOp.factor returns self instead of None * try all_candidates * remove or_else * post index symbolic * add test * maket this closer to the original * increase mac hlb_cifar min step time * add some ordering tests * cleanup * increase pytest timeout time * check dtype	2025-10-04 06:05:38 +02:00
Sieds Lykles	394dc24110	post index symbolic (#12446 ) * post index symbolic * add test	2025-10-03 23:23:03 +02:00
chenyu	9f2b69b870	enable few tests for PTX test_dtype (#12445 )	2025-10-03 08:56:30 -04:00
George Hotz	0b534f71c2	recursive substitute should be O(n) (#12444 ) * recursive substitute * even faster * make that a single rewrite	2025-10-03 18:29:59 +08:00
chenyu	b087663c35	RANGEIFY test_bert uses more ran somehow (#12443 )	2025-10-03 04:38:53 -04:00

... 12 13 14 15 16 ...

11106 Commits