Commit Graph

1205 Commits

Author SHA1 Message Date
chenyu
4e5a9132e7 JIT_BATCH_SIZE=0 in compile3 (#13245)
fixed some enqueue time
2025-11-12 23:12:45 -05:00
chenyu
41e45c20ff minor stuff reading the printed code [pr] (#13177) 2025-11-09 00:58:51 -05:00
chenyu
834067d91f move onnx import in compile3 (#13172)
only used in test_vs_onnx
2025-11-08 09:44:34 -08:00
C T
0f9d7f650d whisper: fix oob, explicit dtype (#13144)
* fix dtype depending on numpy version

numpy v2 np.array returns int64 which Tensor passed through for the
first decode call, swallowing the <|notimestamps|> token and corrupting
the sequence

* fix whisper OOB

global limit on whisper's context length

* enforce whisper max_tokens_to_sample (match openai)

local limit on max tokens decoded
2025-11-07 12:55:01 -05:00
chenyu
74db65cf72 update mlperf bert LOGMLPERF (#13065) 2025-11-02 15:26:37 -05:00
b1tg
45e2f916a3 add quantize fp8 in llama3 (#12893)
* add quantize fp8 in llama3

* don't truncate fp8 alu result

* cast to float32 before matmul

* --model weights/LLaMA-3/8B-SF-DPO/

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2025-10-27 10:22:57 -04:00
Harald Schäfer
587ccc0e5c compile3: make selftests opt-in (#12851) 2025-10-21 11:32:27 -07:00
wozeparrot
990e8b97ee feat: log openpilot 0.10.1 times (#12816) 2025-10-20 18:30:34 -07:00
Sieds Lykles
1e93d19ee3 stable diffusion --fakeweights (#12810) 2025-10-20 12:41:06 +02:00
Harald Schäfer
addc54b96c Simplify openpilot compile3.py (#12748)
* Simpler compile3

* tests

* remove default args

* onnx file is still fp16

* self-test FP16 too

* allow test disable

* absurd tolerance

* Just do latest

* Try simplest

* use later models

* kernel count not relevant if speed is good

* dead improts

* Revert "dead improts"

This reverts commit f68c2cd15d.

* Revert "kernel count not relevant if speed is good"

This reverts commit 0955ca4ee0.

* add back kernal count check on latest model
2025-10-18 10:12:22 -04:00
chenyu
285534ce64 delete DONT_REALIZE_EXPAND and DONT_GROUP_REDUCES (#12744)
does nothing now
2025-10-16 14:11:33 -04:00
chenyu
f34f26bca0 fix gpt2 with benchmark (#12736)
`CPU=1 python3 examples/gpt2.py --benchmark 128` works now
2025-10-16 09:55:20 -04:00
George Hotz
af4479c169 faster stable diffusion load (#12725)
* faster stable diffusion load

* failing tests
2025-10-16 18:31:59 +08:00
George Hotz
612e3d6143 replace mop arg with vectorized index (#12695)
* replace mop arg with vectorized index

* tests passing

* better viz

* no compile4
2025-10-15 20:50:06 +08:00
chenyu
70dd297a05 BS=96 for bert (#12675)
96 trains fine now
2025-10-14 09:07:43 -04:00
chenyu
77b5e6774e fix bert training config (#12647)
FREE_INTERMEDIATE=0 REWRITE_STACK_LIMIT=500000
2025-10-13 15:03:47 -04:00
chenyu
0f776c6e46 examples/mlperf/training_submission_v6.0 (#12644)
copied from v5.1
2025-10-13 09:58:25 -04:00
nimlgen
658c566e22 vars in gated_read_image_count (#12486)
* vars in gated_read_image_count

* nc
2025-10-09 14:54:15 +08:00
George Hotz
6e6059dde0 clean up stable diffusion weight loading (#12452) 2025-10-09 11:13:11 +08:00
chenyu
be05028419 move ASSERT_MIN_STEP_TIME to compile3 (#12535)
threshold is current time +20%
2025-10-08 22:16:59 -04:00
chenyu
28edea5d67 delete FUSE_CONV_BW (#12527) 2025-10-08 10:41:38 -04:00
Rudeus
a65ec5c693 fix fromarray depreceation (#12512) 2025-10-08 09:13:26 -04:00
qazal
7e0b14243e delete grouper and kernelize (#12517)
* delete grouper and kernelize

* +sys.setrecursionlimit
2025-10-08 12:27:26 +03:00
chenyu
e701106a64 remove FUSE_ARANGE (#12511)
it was the default already
2025-10-08 04:54:07 -04:00
George Hotz
0f25b4b289 move frontend dir to nn [pr] (#12470) 2025-10-07 10:42:22 +08:00
qazal
1af05dae77 fix rangeify in compile4.py (#12467)
* fix rangeify in compile4.py

* fix type_verify
2025-10-06 13:37:46 +03:00
hooved
69857d0ab0 Stable Diffusion mlperf training (#11304)
* entrypoint for sd mlperf train development

* match sd-v2 mlperf reference unet

* implement dataloader from mlperf ref

* update dataloader reference

* implement LambdaLR scheduler from mlperf ref

* match tokenizer from mlperf reference

* sample latent

* add noise to latent

* complete training epoch

* run full training step

* jit training loop

* replicate mlperf ref. losses over 11 train steps

* save tinygrad loss checkpoints properly

* match out.2.bias.grad to reference

* match weights to ref after 1 step

* compare out.2.bias to ref over three train steps

* implement attn_mask; cleanup closeness testing

* correct mse loss

* update dev_run / dependencies

* setup validation config/checkpointing

* implement validation sampling

* test closeness of eval denoise step to mlperf ref

* test closeness of decoder to mlperf ref

* confirm inception matches mlperf ref

* resize w/ bicubic interpolation, test closeness

* confirm closeness of clip preprocess to mlperf ref

* confirm clip score matches mlperf ref

* confirm fid/clip scores match mlperf ref

* cleanup

* cleanup

* zero-init some unet params as in mlperf reference

* revert jit change

* uncomment dependencies

* move to tinybox red

* implement GradScaler from torch but jittable

* simplify lr_scheduler, ensure jittability

* instantiate GradScaler

* only check if grads are finite with fp16

* implement fp16 training loop

* refactor UNet: norm, gelu, mixed precision

* refactor clip_tokenizer to enable versioning

* make fp16 attention closer to torch

* remove comparisons to torch fp16 attention

* add globvars.py for reference

* confirm closeness of fp16 unet forward to mlperf

* test norm closeness to torch with precast

* remeasure e2e with master attention

* more detailed softmax upcast comparison to torch

* parameterize softmax upcast in attention and unet

* use fp32 weights with autocast to fp16

* cleanup

* add data/checkpoint download script

* debug kernel timeout on AMD

* fix finite grads check; start multigpu

* pass numpy arrays from dataloader

* include text encoder in jit train step

* use int32 for tokens instead of int64

* prevent multi bug in reshape within clip

* corealize more, del refs before

* add more logging and wandb

* use erf gelu in clip encoder

* minor changes to train step and logging

* save checkpoints for eval or resuming

* add eval-only logic to training script

* multigpu eval

* remove PARALLEL=0

* cleanup

* pad eval batches of size < EVAL_BS

* workaround silent multigpu bug in jit

* cleanup

* tokenize captions

* verify correctness of multigpu eval

* cleanup

* verify correctness of grads in train step

* verify correctness of training (20 steps)

* don't shard in the training jit

* training settings

* minor cleanup

* overfit train w/ eval on 6 samples

* offload to enable combined train and eval

* download to raid; use local rclone

* misc changes for mi300x / logging

* refactor eval for larger BS, verify correctness

* cleanup

* ckpt resuming and remove eval cats

* eval BEAM config on mi300x and red

* resume eval after crash

* confirm eval correctness (one iteration, 6 samples)

* verify eval correctness at full scale

* cleanup correctness testing

* training correctness (20 steps, BS=248 uniform)

* cleanup

* remove eval cache at end of run

* switch f16 for bf16, del grad scaler

* confirm bf16 training correctness

* timestamps, new jits

* merge jits in training

* realize loss/lr on CPU

* training correctness

* post-bf16 train/eval

* implement grad_acc with timing/logging

* beam offline; debug gradacc; use float32

* fix gradacc in jit, correctness test

* prepare f32 BS=512 gradacc=4 run

* workaround jit problem in diffusion eval

* scale lr by BS

* revert gradacc, prepare bf16 BS=336 lr*=BS train

* make checkpointing faster

* resume bf16 BS=336 base_lr=1.25e-7 run

* jit ckpt at beginning

* don't alloc more gpu mem in ckpt

* cleanup

* move script to mi300x dir

* cleanup

* cleanup unneeded files

* revert beam search to master

* minor changes

* fix regression: realize before assign in eval

* cleanup mlperf SD data/ckpt downloads

* workaround BEAM failure

* workaround bug in Tensor.stack

* minor changes

* revert gradscaler

* cleanup

* cleanup/validate dataloader

* ensure checksum of laion data

* simplify config

* load training state to jitted bufs

* simplify lr scheduler

* simplify train script

* cleanup comments

* refactor stable diffusion/unet init

* more refactoring of stable diffusion init

* fix import errors in tests

* refactor: separate train/eval

* fix import errors

* eval checkpoints in reverse chron. order

* save/load cycle in sd init

* refactor and verify eval

* verify training correctness

* prepare repro train run

* cleanup

* integrate beam retry, train, eval

* simplify wandb

* kill orphaned processes

* better logging

* train to 10 ckpts instead of 7

* remove optimizer/scheduler checkpointing/resume

* cleanup

* BEAM=2 7 ckpts

* add test to compare with torch softmax in amp

* cleanup

* stop eval early if checkpoint converged

* add test for lr scheduler

* add proper test method

* add test for training

* use venv name that is ignored by .gitignore

* linting

* add simple f32 softmax fxn

* revert change to scaled_dot_product_attention

* refactor gelu_erf init

* simplify mixed precision in unet

* add norm autocasting to fp32

* rm extra test

* test eval with NULL backend

* fix venv name

* simplify norm autocast

* use temp dir for training test

* actually add eval test

* remove parallel env variable from tests

* update clip with tests

* reorg init functions

* use np for testing

* remove unused var

* factor out GPUS

* add sd model init tests

* more unet tests

* match master

* rerun CI due to linux (remote) hang

* explain UNET_CKPTDIR

* rerun CI due to linux (remote) timeout

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2025-10-05 07:56:05 -04:00
hooved
1e8945a28c Training loop for Stable Diffusion mlperf (#12315)
* add diff

* fix edit error

* match master

* point reference to specific commit

* simplify wandb logging

* remove lr test, dehardcode device

* increase stack size limit
2025-10-03 02:45:38 -04:00
hooved
5d9035f5a6 Eval for Stable Diffusion mlperf (#12316)
* add diff

* rerun ci

* refactor beam workaround, add test

* fix conflict

* linting
2025-10-02 02:35:38 -04:00
hooved
0f804c9a83 Stable Diffusion model init for mlperf (#12314)
* include clip pr diff

* updated unet and sd init

* dehardcode default device

* revert beam hang workaround

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2025-10-02 02:28:41 -04:00
hooved
969a1b35ca LR scheduler for Stable Diffusion mlperf training (#12201)
* add lr scheduler for stable diffusion training

* add lr scheduler test

* rerun ci

* rerun CI

* use np for testing

* move test to CI path

* remove unneeded copy
2025-09-30 21:21:08 -04:00
George Hotz
7129419500 fix cifar training in RANGEIFY (#12355)
* fix cifar training in RANGEIFY

* even more wino fuse

* bugfix

* test to show issue
2025-09-30 15:59:19 +08:00
wozeparrot
2a0caa09c2 push copy to disk (#12348) 2025-09-29 21:55:05 -07:00
chenyu
3a480b858f use more getitem in gpt2 (#12343) 2025-09-29 23:08:03 -04:00
hooved
c2689c505e Clip model updates for Stable Diffusion mlperf training (#12313)
* stable diffusion mlperf clip changes

* add clip tests

* set gelu as attribute

* add more tests

* factor out GPUS

* rerun CI

* add imports to if blocks

* remove unneeded axis

* add clip tests to CI

* move clip tests

* add deps, disable max buf size
2025-09-29 21:50:14 -04:00
George Hotz
baf3b60cfb fix gpt2 on rangeify (#12335) 2025-09-29 19:16:44 +08:00
George Hotz
b899392f30 fix llm app with rangeify (#12334)
* fix llm app with rangeify

* add gpt2 contiguous also
2025-09-29 18:42:44 +08:00
chenyu
84d2d047ea Tensor.pad_to and Tensor.shrink_to (#12210)
most of the time i want this instead of spelling out the args

also add more input validation to shrink
2025-09-16 12:24:55 -04:00
hooved
3a9db08b49 download data and ckpts for sd train/eval (#12170) 2025-09-15 00:31:45 -04:00
Steven Shi
25b1bc8eff added top k sampling to examples/mamba (#12061) 2025-09-14 15:27:34 -04:00
chenyu
d2316ba91a don't validate output in sdxl with fakeweights (#12160)
NULL backend passed validation before because both desired and actual went through NULL backend
2025-09-13 21:47:51 -04:00
chenyu
0e266f376c ops_gpu -> ops_cl (#12103) 2025-09-10 15:15:48 -04:00
chenyu
0599e86186 replace hardcoded GPU in llama debug msg (#12102) 2025-09-10 13:56:40 -04:00
Sieds Lykles
5b73076e48 assert benchmark times (#12042)
* assert jitted times in openpilot

* better error

* better error

* add ASSERT_MIN_STEP_TIME to more models

* t is step_times

* update benchmark times

* update times
2025-09-09 23:40:02 +02:00
wozeparrot
d16cc6c012 feat: resume ckpt (#11970) 2025-09-02 15:47:48 -07:00
wozeparrot
7c21271a5f feat: end_lr envvar (#11953) 2025-09-01 14:53:07 -07:00
wozeparrot
7e68045fb2 feat: small llama3 training (#11829) 2025-08-31 13:41:47 -07:00
NoahKusaba
0838021753 remove np from beautiful_cifar (#10988)
* remove np from beautiful_cifar

* remove np from cifar

* rename variable and rename tensor.arrange to just tensor.randperm

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2025-08-29 19:34:16 -04:00
chenyu
e39b25cd36 upcast float exp to at least float32 (#11758)
* upcast float exp to at least float32

* unlucky seed
2025-08-22 20:16:34 -04:00
wozeparrot
b979162c5d llama3 eval train (#11706) 2025-08-20 19:56:35 -04:00