Commit Graph

11106 Commits

Author SHA1 Message Date
chenyu
da1f46ff3f remove RANGEIFY specific test jobs (#12507) 2025-10-08 04:12:04 -04:00
George Hotz
1e567a5cf8 make RANGEIFY=1 the default (#12161)
Co-authored-by: chenyu <chenyu@fastmail.com>
Co-authored-by: Sieds Lykles <93992551+S-Lykles@users.noreply.github.com>
Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>
2025-10-08 03:46:09 -04:00
nimlgen
9e7103647d amd: rename cmd_id to sqtt_next_cmd_id (#12503)
* amd: rename cmd_id to sqtt_next_cmd_id

* and typo
2025-10-08 15:16:19 +08:00
nimlgen
4a756a37d8 amd: support rocm7 (#12502)
* amd: support rocm7

* mock
2025-10-08 14:30:39 +08:00
qazal
60b6dca5ba update some tests instead of expect_rangeify_fails (#12500)
* update test_clone_doesnt_dedup to use base

* new_flat_buffer passes

* fix test_reorder_expand

* remove the view stuff

* remove that test, we don't want this view const behavior

* test_setitem_becomes_subbuffer is good
2025-10-08 07:42:31 +03:00
qazal
84597ed53c early assert for device mistmatched asts in rangeify (#12499)
* early assert for device mistmatched asts in rangeify

* alt also passes
2025-10-08 07:19:36 +03:00
qazal
2e19354c1c viz: reorder timeline graphs (#12498)
* viz: reorder timeline graphs

* update test_viz with the new order
2025-10-08 07:10:23 +03:00
George Hotz
d06226b575 fix SPEC and all_tensors iterator (#12496) 2025-10-07 23:18:17 -04:00
qazal
a7cb80bfab use recursive_property in UOp device (#12477)
* simple failing test with RecursionError

* switch to @recursive_property

* merge 2

* diff
2025-10-08 06:15:05 +03:00
George Hotz
a6d59a0b45 backward_slice to get srcs recursively (#12494)
* change name to backward_slice

* faster check

* clean up comments and names

* comment
2025-10-08 10:31:42 +08:00
chenyu
eb3bc277b3 remove ASSERT_MIN_STEP_TIME in external_benchmark_openpilot (#12495)
should add for compile3 and compile 3 only
2025-10-07 22:13:42 -04:00
qazal
239f9a3029 update viz to not use children [pr] (#12493) 2025-10-08 04:35:01 +03:00
Sieds Lykles
b465c17b56 Revert "UOp.factor and add chain sorting (#12413)" (#12492)
This reverts commit e74be4a140.
2025-10-08 03:20:23 +02:00
George Hotz
945cc46475 delete children tracking from uop (#12491)
* delete children tracking from uop

* uop children no longer exists

* no tracked children

* that test is flaky too
2025-10-08 09:04:14 +08:00
nimlgen
648e5bb223 hcq: do not raise when fini (#12487)
* hcq: do not raise when fini

* Revert "hcq: do not raise when fini"

This reverts commit 44af5f7d05.

* this way

* runtime is fine

* nn
2025-10-07 23:27:03 +08:00
George Hotz
a2345787b9 parents is faster than sparents (#12490) 2025-10-07 21:31:50 +08:00
George Hotz
12c4963489 add more rangeify pm tests (#12488) 2025-10-07 05:45:38 -04:00
George Hotz
403fdfcfd4 check spec in test, cleanup vectorize render (#12484) 2025-10-07 17:05:50 +08:00
qazal
22674798df assert correctness in test_permuted_assignment [pr] (#12483) 2025-10-07 11:42:22 +03:00
George Hotz
75ce11593c test_reshape_match should match (#12479) 2025-10-07 16:07:21 +08:00
chenyu
fe774a4319 more skip WINO on benchmark (#12482) 2025-10-07 03:43:51 -04:00
chenyu
8ad5f9e74f skip slow benchmarks (#12481)
* skip slow benchmarks

padded tc is already slow, rest are slow with rangeify (correct if run locally)

* relax more
2025-10-07 03:28:56 -04:00
George Hotz
ea7672931f fix test_matmul_relu_cat (#12478) 2025-10-07 02:32:23 -04:00
George Hotz
514d2a0774 merge tagless reshapes (#12474)
* merge tagless reshapes

* cleanup
2025-10-07 13:57:58 +08:00
chenyu
7b48f3cc45 failed test case repro for openpilot model (#12475)
* failed test case repro for openpilot model

* assertEqual
2025-10-07 13:46:43 +08:00
chenyu
a5484b767e remove skipping cast in simplify_valid [pr] (#12472)
* remove skipping cast in simplify_valid [pr]

unsupported statements are handled in uop_given_valid already. the test failed because (100%x) somehow got simplified

* better test
2025-10-07 00:10:04 -04:00
George Hotz
b4509fba31 thundermittens (#12471)
* thundermittens

* give device a type
2025-10-07 11:47:39 +08:00
George Hotz
0f25b4b289 move frontend dir to nn [pr] (#12470) 2025-10-07 10:42:22 +08:00
qazal
f664bcc8bd use recursive_property in UOp tracing (#12469)
* test

* simple passing
2025-10-06 21:10:52 +03:00
qazal
1af05dae77 fix rangeify in compile4.py (#12467)
* fix rangeify in compile4.py

* fix type_verify
2025-10-06 13:37:46 +03:00
qazal
76e8a3250c rangeify: late zero folding (#12464)
* rangeify: late zero folding

* early

* not kernels

* none

* multi

* linter

* mstack is sink comment

* more comment
2025-10-06 12:52:33 +03:00
George Hotz
0c015a24fe use recursive_property to prevent RecursionError (#12465)
* use recursive_property to prevent RecursionError

* not slower

* fix tests

* faster

* simpler
2025-10-06 15:59:18 +08:00
chenyu
a1881b0c17 update test_chicken (#12466)
logits are close, just numerical
2025-10-06 03:58:44 -04:00
qazal
1b1978b9c0 early copy fixup (#12463)
* simple failing test

* early copy fixup
2025-10-06 06:38:29 +03:00
chenyu
c1e85f699c multi test case for sharded ring allreduce (#12462)
* multi test case for sharded ring allreduce

triggers `children not making progress` with RANGEIFY

* expect_rangeify_fails
2025-10-05 23:18:24 -04:00
chenyu
1823a5043f don't check MAX_BUFFER_SIZE on NULL (#12461) 2025-10-05 22:09:29 -04:00
George Hotz
46e8ea15c1 split pm_substitute_recurse (#12460) 2025-10-05 21:35:50 -04:00
nimlgen
1216fff781 remote: raise runtimeerror in checkz (#12453) 2025-10-05 21:22:53 +08:00
qazal
6ad9a688ed add failing test after "pend substitutes for speed" (#12457)
* add failing substitute test

* expect_rangeify_fails
2025-10-05 16:10:04 +03:00
chenyu
74b04f7dca test beautiful_mnist_multigpu (#12455)
* test beautiful_mnist_multigpu

another example that fails with RANGEIFY

* now i remember

* MAX_BUFFER_SIZE=0
2025-10-05 08:45:01 -04:00
hooved
69857d0ab0 Stable Diffusion mlperf training (#11304)
* entrypoint for sd mlperf train development

* match sd-v2 mlperf reference unet

* implement dataloader from mlperf ref

* update dataloader reference

* implement LambdaLR scheduler from mlperf ref

* match tokenizer from mlperf reference

* sample latent

* add noise to latent

* complete training epoch

* run full training step

* jit training loop

* replicate mlperf ref. losses over 11 train steps

* save tinygrad loss checkpoints properly

* match out.2.bias.grad to reference

* match weights to ref after 1 step

* compare out.2.bias to ref over three train steps

* implement attn_mask; cleanup closeness testing

* correct mse loss

* update dev_run / dependencies

* setup validation config/checkpointing

* implement validation sampling

* test closeness of eval denoise step to mlperf ref

* test closeness of decoder to mlperf ref

* confirm inception matches mlperf ref

* resize w/ bicubic interpolation, test closeness

* confirm closeness of clip preprocess to mlperf ref

* confirm clip score matches mlperf ref

* confirm fid/clip scores match mlperf ref

* cleanup

* cleanup

* zero-init some unet params as in mlperf reference

* revert jit change

* uncomment dependencies

* move to tinybox red

* implement GradScaler from torch but jittable

* simplify lr_scheduler, ensure jittability

* instantiate GradScaler

* only check if grads are finite with fp16

* implement fp16 training loop

* refactor UNet: norm, gelu, mixed precision

* refactor clip_tokenizer to enable versioning

* make fp16 attention closer to torch

* remove comparisons to torch fp16 attention

* add globvars.py for reference

* confirm closeness of fp16 unet forward to mlperf

* test norm closeness to torch with precast

* remeasure e2e with master attention

* more detailed softmax upcast comparison to torch

* parameterize softmax upcast in attention and unet

* use fp32 weights with autocast to fp16

* cleanup

* add data/checkpoint download script

* debug kernel timeout on AMD

* fix finite grads check; start multigpu

* pass numpy arrays from dataloader

* include text encoder in jit train step

* use int32 for tokens instead of int64

* prevent multi bug in reshape within clip

* corealize more, del refs before

* add more logging and wandb

* use erf gelu in clip encoder

* minor changes to train step and logging

* save checkpoints for eval or resuming

* add eval-only logic to training script

* multigpu eval

* remove PARALLEL=0

* cleanup

* pad eval batches of size < EVAL_BS

* workaround silent multigpu bug in jit

* cleanup

* tokenize captions

* verify correctness of multigpu eval

* cleanup

* verify correctness of grads in train step

* verify correctness of training (20 steps)

* don't shard in the training jit

* training settings

* minor cleanup

* overfit train w/ eval on 6 samples

* offload to enable combined train and eval

* download to raid; use local rclone

* misc changes for mi300x / logging

* refactor eval for larger BS, verify correctness

* cleanup

* ckpt resuming and remove eval cats

* eval BEAM config on mi300x and red

* resume eval after crash

* confirm eval correctness (one iteration, 6 samples)

* verify eval correctness at full scale

* cleanup correctness testing

* training correctness (20 steps, BS=248 uniform)

* cleanup

* remove eval cache at end of run

* switch f16 for bf16, del grad scaler

* confirm bf16 training correctness

* timestamps, new jits

* merge jits in training

* realize loss/lr on CPU

* training correctness

* post-bf16 train/eval

* implement grad_acc with timing/logging

* beam offline; debug gradacc; use float32

* fix gradacc in jit, correctness test

* prepare f32 BS=512 gradacc=4 run

* workaround jit problem in diffusion eval

* scale lr by BS

* revert gradacc, prepare bf16 BS=336 lr*=BS train

* make checkpointing faster

* resume bf16 BS=336 base_lr=1.25e-7 run

* jit ckpt at beginning

* don't alloc more gpu mem in ckpt

* cleanup

* move script to mi300x dir

* cleanup

* cleanup unneeded files

* revert beam search to master

* minor changes

* fix regression: realize before assign in eval

* cleanup mlperf SD data/ckpt downloads

* workaround BEAM failure

* workaround bug in Tensor.stack

* minor changes

* revert gradscaler

* cleanup

* cleanup/validate dataloader

* ensure checksum of laion data

* simplify config

* load training state to jitted bufs

* simplify lr scheduler

* simplify train script

* cleanup comments

* refactor stable diffusion/unet init

* more refactoring of stable diffusion init

* fix import errors in tests

* refactor: separate train/eval

* fix import errors

* eval checkpoints in reverse chron. order

* save/load cycle in sd init

* refactor and verify eval

* verify training correctness

* prepare repro train run

* cleanup

* integrate beam retry, train, eval

* simplify wandb

* kill orphaned processes

* better logging

* train to 10 ckpts instead of 7

* remove optimizer/scheduler checkpointing/resume

* cleanup

* BEAM=2 7 ckpts

* add test to compare with torch softmax in amp

* cleanup

* stop eval early if checkpoint converged

* add test for lr scheduler

* add proper test method

* add test for training

* use venv name that is ignored by .gitignore

* linting

* add simple f32 softmax fxn

* revert change to scaled_dot_product_attention

* refactor gelu_erf init

* simplify mixed precision in unet

* add norm autocasting to fp32

* rm extra test

* test eval with NULL backend

* fix venv name

* simplify norm autocast

* use temp dir for training test

* actually add eval test

* remove parallel env variable from tests

* update clip with tests

* reorg init functions

* use np for testing

* remove unused var

* factor out GPUS

* add sd model init tests

* more unet tests

* match master

* rerun CI due to linux (remote) hang

* explain UNET_CKPTDIR

* rerun CI due to linux (remote) timeout

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2025-10-05 07:56:05 -04:00
George Hotz
a976ace404 minor improvements to rewrite (#12454)
* minor improvements to rewrite

* need that continue

* faster
2025-10-05 18:09:32 +08:00
qazal
4b60121498 fix bmnist torch with RANGEIFY=1 (#12442)
* fix bmnist torch with RANGEIFY=1

* alt

* test and comment

* this was always wrong

* simple failing test for rangeify

* simple upat to match the old behavior
2025-10-05 12:34:27 +03:00
George Hotz
b5f31d7505 earlier seen children (#12451) 2025-10-05 15:55:13 +08:00
qazal
865d5796f8 add a test for untested Tensor.assign behavior (#12448)
* add a test for untested Tensor.assign behavior

* better
2025-10-04 12:44:56 +03:00
Sieds Lykles
e74be4a140 UOp.factor and add chain sorting (#12413)
* add ordering

* fix some tests

* fix more tests

* shorten comment

* update test

* add rule and test

* add rule and test

* remove check

* use fold_divmod_congruence instead of simplify

* adjust tests

* shorten line

* new algo

* add test

* add function to un-nest the div

* add UOp.factor

* test UOp.factor

* uop_given_valid tries to factor simplex expression

* shorten line

* symbolic_flat is back

* change that back

* fix those new tests

* new rule for ordering

* factor multiple factors

* no symbolic_flat

* symbolic_flat to there

* move that back

* fix imports

* merge correctly

* linter happy

* add rule

* add a test

* cleanup

* revert that for now

* UOp.factor returns self instead of None

* try all_candidates

* remove or_else

* post index symbolic

* add test

* maket this closer to the original

* increase mac hlb_cifar min step time

* add some ordering tests

* cleanup

* increase pytest timeout time

* check dtype
2025-10-04 06:05:38 +02:00
Sieds Lykles
394dc24110 post index symbolic (#12446)
* post index symbolic

* add test
2025-10-03 23:23:03 +02:00
chenyu
9f2b69b870 enable few tests for PTX test_dtype (#12445) 2025-10-03 08:56:30 -04:00
George Hotz
0b534f71c2 recursive substitute should be O(n) (#12444)
* recursive substitute

* even faster

* make that a single rewrite
2025-10-03 18:29:59 +08:00
chenyu
b087663c35 RANGEIFY test_bert uses more ran somehow (#12443) 2025-10-03 04:38:53 -04:00