1232 Commits

Author SHA1 Message Date
Francis Lata
fac137779e remove flux1 seed image (#13843) 2025-12-27 00:45:11 -05:00
chenyu
da1cb6a9ec update llama dataloader (#13825)
separate creating dataset from itererating over the dataset to not create eval data for each eval
2025-12-24 17:42:08 -05:00
chenyu
903753c60c llama wandb logging (#13822) 2025-12-24 10:24:59 -05:00
chenyu
27d899ce97 TRAIN=0 to only eval llama (#13804) 2025-12-22 11:55:46 -05:00
chenyu
39d962106f update llama logging (#13803)
```
REWRITE_STACK_LIMIT=1000000 SMALL=1 BASEDIR=/raid/datasets/c4-8b SAMPLES=1000 BS=8 DP=8 DEFAULT_FLOAT=bfloat16 OPTIM_DTYPE=bfloat16 LLAMA3_SIZE=8B SEQLEN=1024 PYTHONPATH=. MODEL=llama3 python3 examples/mlperf/model_train.py

    1 93.44 s run, 11.8750 loss, 0.000000000001 LR, 642.43 GB used,  19644.30 GFLOPS
    2 101.78 s run, 11.8750 loss, 0.000000000001 LR, 1454.57 GB used,  17039.35 GFLOPS
    3 7.34 s run, 11.8750 loss, 0.000000000002 LR, 1454.57 GB used, 236258.78 GFLOPS
    4 4.32 s run, 11.8750 loss, 0.000000000002 LR, 1454.57 GB used, 401488.40 GFLOPS
    5 4.36 s run, 11.9375 loss, 0.000000000003 LR, 1454.57 GB used, 398116.13 GFLOPS
    6 4.32 s run, 11.8750 loss, 0.000000000003 LR, 1454.57 GB used, 401878.60 GFLOPS
    7 4.34 s run, 11.8750 loss, 0.000000000004 LR, 1454.57 GB used, 399822.57 GFLOPS
    8 4.35 s run, 11.8750 loss, 0.000000000004 LR, 1454.57 GB used, 398512.24 GFLOPS
    9 4.36 s run, 11.8750 loss, 0.000000000005 LR, 1454.57 GB used, 397832.61 GFLOPS
   10 4.40 s run, 11.8750 loss, 0.000000000005 LR, 1454.57 GB used, 394520.83 GFLOPS
```
2025-12-22 11:28:29 -05:00
George Hotz
45c459848d remove more stale stuff (#13765)
* remove more stale stuff

* remove disassemblers/adreno

* stale
2025-12-19 17:14:56 -04:00
George Hotz
df6cde8a00 cleanup stale examples/extra (#13764)
* cleanup stale files

* examples

* move those back

* old

* delete more
2025-12-19 16:27:37 -04:00
chenyu
7cd7593c5d add script to train bert on mi350x (#13743)
adapted from mi300 config
2025-12-17 16:54:04 -05:00
chenyu
e428fbfab6 verify dtype of llama model params (#13719) 2025-12-16 12:32:02 -05:00
chenyu
6cad622f59 don't FREE_INTERMEDIATE in bert (#13684)
hangs green hcq consistently after an hour of training
2025-12-14 14:27:42 -05:00
chenyu
fcaed1e1dd don't use empty in bert fake data (#13661)
somehow jit does not count empty as input
2025-12-12 15:59:50 -05:00
chenyu
01e9ad0d52 clean up bert next_data (#13650)
train iter was designed to never stop for both real and fake data
2025-12-11 22:56:28 -05:00
chenyu
5034c6fb37 reenable FREE_INTERMEDIATE for bert (#13639)
* reenable FREE_INTERMEDIATE for bert

* comment
2025-12-10 12:08:09 -05:00
chenyu
016a59cafa remove contiguous and use where in EmbeddingBert (#13632) 2025-12-09 15:49:21 -05:00
chenyu
2471b49e45 minor bert / llama change from grad acc branch (#13622)
* minor bert / llama change from grad acc branch

* revert those
2025-12-08 16:04:14 -05:00
chenyu
b981b6f89e remove old llama grad_acc (#13611)
* remove old llama grad_acc

* GRADIENT_ACC_STEPS=1
2025-12-07 13:03:47 -05:00
chenyu
4562f217e1 more bert updates (#13597)
prep split jit
also lower BS to 72
2025-12-06 08:32:43 -05:00
chenyu
cb4c6324ef revert bert grad accumulation (#13596)
prep for the new split jit style
2025-12-05 17:30:08 -05:00
chenyu
89f9e1dcd5 add SGD to beautiful_mnist (#13571) 2025-12-04 12:17:29 -05:00
George Hotz
96d16675fe update examples/gradaccum_mnist.py to use the JIT 2025-12-03 16:11:42 -08:00
George Hotz
a4c4e48385 add LUNIQUE op (#13554) 2025-12-03 14:34:34 -08:00
wozeparrot
8713ae6de9 fix: dead sdv2 download link (#13521) 2025-12-01 22:50:53 -08:00
George Hotz
44104b0b7f mnist with grad acc + Adam on CPU (#13520)
* mnist with grad acc + Adam on CPU

* still broken, but closer

* works w/o jit

* this works without the jit
2025-12-01 18:27:32 -08:00
George Hotz
8e8fec408e fix n^2 _apply_map_to_tensors [pr] (#13443)
* clean up slow rules

* fix rule

* non n^2 toposort

* topovisit

* state dict profile_marker
2025-11-24 18:59:16 -08:00
George Hotz
cc5e6323ac stable diffusion profiling (#13441)
* stable diffusion profiling

Signed-off-by: George Hotz <geohot@gmail.com>

* profile_marker

* profile per step

* fix slow Context

* profile that

---------

Signed-off-by: George Hotz <geohot@gmail.com>
2025-11-24 15:25:45 -08:00
chenyu
646372490c move tiktoken import in llama3 (#13316)
only Tokenizer requires that
2025-11-17 14:09:37 -05:00
George Hotz
17aa3379e9 hotfix: improve self_tokenize 2025-11-13 00:18:57 -08:00
chenyu
4e5a9132e7 JIT_BATCH_SIZE=0 in compile3 (#13245)
fixed some enqueue time
2025-11-12 23:12:45 -05:00
chenyu
41e45c20ff minor stuff reading the printed code [pr] (#13177) 2025-11-09 00:58:51 -05:00
chenyu
834067d91f move onnx import in compile3 (#13172)
only used in test_vs_onnx
2025-11-08 09:44:34 -08:00
C T
0f9d7f650d whisper: fix oob, explicit dtype (#13144)
* fix dtype depending on numpy version

numpy v2 np.array returns int64 which Tensor passed through for the
first decode call, swallowing the <|notimestamps|> token and corrupting
the sequence

* fix whisper OOB

global limit on whisper's context length

* enforce whisper max_tokens_to_sample (match openai)

local limit on max tokens decoded
2025-11-07 12:55:01 -05:00
chenyu
74db65cf72 update mlperf bert LOGMLPERF (#13065) 2025-11-02 15:26:37 -05:00
b1tg
45e2f916a3 add quantize fp8 in llama3 (#12893)
* add quantize fp8 in llama3

* don't truncate fp8 alu result

* cast to float32 before matmul

* --model weights/LLaMA-3/8B-SF-DPO/

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2025-10-27 10:22:57 -04:00
Harald Schäfer
587ccc0e5c compile3: make selftests opt-in (#12851) 2025-10-21 11:32:27 -07:00
wozeparrot
990e8b97ee feat: log openpilot 0.10.1 times (#12816) 2025-10-20 18:30:34 -07:00
Sieds Lykles
1e93d19ee3 stable diffusion --fakeweights (#12810) 2025-10-20 12:41:06 +02:00
Harald Schäfer
addc54b96c Simplify openpilot compile3.py (#12748)
* Simpler compile3

* tests

* remove default args

* onnx file is still fp16

* self-test FP16 too

* allow test disable

* absurd tolerance

* Just do latest

* Try simplest

* use later models

* kernel count not relevant if speed is good

* dead improts

* Revert "dead improts"

This reverts commit f68c2cd15d.

* Revert "kernel count not relevant if speed is good"

This reverts commit 0955ca4ee0.

* add back kernal count check on latest model
2025-10-18 10:12:22 -04:00
chenyu
285534ce64 delete DONT_REALIZE_EXPAND and DONT_GROUP_REDUCES (#12744)
does nothing now
2025-10-16 14:11:33 -04:00
chenyu
f34f26bca0 fix gpt2 with benchmark (#12736)
`CPU=1 python3 examples/gpt2.py --benchmark 128` works now
2025-10-16 09:55:20 -04:00
George Hotz
af4479c169 faster stable diffusion load (#12725)
* faster stable diffusion load

* failing tests
2025-10-16 18:31:59 +08:00
George Hotz
612e3d6143 replace mop arg with vectorized index (#12695)
* replace mop arg with vectorized index

* tests passing

* better viz

* no compile4
2025-10-15 20:50:06 +08:00
chenyu
70dd297a05 BS=96 for bert (#12675)
96 trains fine now
2025-10-14 09:07:43 -04:00
chenyu
77b5e6774e fix bert training config (#12647)
FREE_INTERMEDIATE=0 REWRITE_STACK_LIMIT=500000
2025-10-13 15:03:47 -04:00
chenyu
0f776c6e46 examples/mlperf/training_submission_v6.0 (#12644)
copied from v5.1
2025-10-13 09:58:25 -04:00
nimlgen
658c566e22 vars in gated_read_image_count (#12486)
* vars in gated_read_image_count

* nc
2025-10-09 14:54:15 +08:00
George Hotz
6e6059dde0 clean up stable diffusion weight loading (#12452) 2025-10-09 11:13:11 +08:00
chenyu
be05028419 move ASSERT_MIN_STEP_TIME to compile3 (#12535)
threshold is current time +20%
2025-10-08 22:16:59 -04:00
chenyu
28edea5d67 delete FUSE_CONV_BW (#12527) 2025-10-08 10:41:38 -04:00
Rudeus
a65ec5c693 fix fromarray depreceation (#12512) 2025-10-08 09:13:26 -04:00
qazal
7e0b14243e delete grouper and kernelize (#12517)
* delete grouper and kernelize

* +sys.setrecursionlimit
2025-10-08 12:27:26 +03:00