wozeparrot
963c59ebdb
fix: pull fixes from gradacc branch ( #14296 )
2026-01-22 23:07:54 -08:00
George Hotz
52b989c6c8
don't place consts early + fixes from anthropic challenge ( #14286 )
...
* don't place consts early
* add anthropic challenge
* with ref
* do we still have to devectorize bools?
* tests pass
* just WHERE
* fine, revert that
* fine, revert
* only index
* z3 validator doesn't support vectorized
* Revert "z3 validator doesn't support vectorized"
This reverts commit 1b7930ecb3 .
* z3 not for vec
* no spec
* VLIWRenderer
* loop unrolling
* better comments
* cleanups
* skip cast
* renderer
* cleanups
* prints
* no hack
* hacks
* bump to 11
* reg warning
* lil clean
* cleaner renderer
2026-01-23 10:48:39 +09:00
wozeparrot
c1d14ea832
llama8b train fixes ( #14264 )
2026-01-20 20:34:47 -08:00
wozeparrot
ba90e1b52e
feat: script to run llama8b training ( #14239 )
2026-01-20 12:44:06 -08:00
C T
26f8b12e01
Whisper audio helpers (mel filters in tinygrad) ( #13478 )
...
* add whisper audio helpers for stft/mel/resample
* cleanup
* add whisper stft test
* make only stft test explicitly depend on librosa
* extract sinc_window_kernel
* dehardcode device
* use same device argument
* simplify
* type annotate
* ruff format audio_helpers.py
* ruff format test_whisper.py
* add WHISPER_NEW_STFT
* rename
* undo ruff format changes
* use new stft and mel for whisper
* remove stft test that depends on librosa
* remove whitespace
* add Tensor.log10 with test\test_ops.py::TestOps::test_log10
* use Tensor.log10
* fix lint
* future: remove unused STFT class
* future: remove resample code since it isn't used (yet)
* match openai with pad_mode="reflect"
* pad_to
* future: cut resample leftovers
* cleanup
* add mel tests
* future: cut stft
* future: cut non-mel prep_audio changes
* reduce diff
* move audio_helpers.py to examples
* reduce whitespace
* fix imports
* reduce whitespace
---------
Co-authored-by: chenyu <chenyu@fastmail.com >
2026-01-20 10:50:02 -05:00
wozeparrot
a879b54234
tk: fa jit fix ( #14170 )
2026-01-16 16:38:45 -08:00
b1tg
0fbc551622
train bert with fp8 ( #13874 )
...
* fp8 train
* clean
* lint
* test fix from #13439
* skip first/last layer
* rm __init__, restore unroll <=32 check
* tests
* clean test, remove unused
* multi-gpu test, clean quantize_to_fp8
* remove bert contiguous
* run script
* test: better check
* run script search
* add seed in bert data shuffle
* move script to mi350x folder
---------
Co-authored-by: chenyu <chenyu@fastmail.com >
2026-01-09 09:21:59 -05:00
b1tg
241f0402b4
add seed in bert data shuffle ( #14054 )
2026-01-07 10:02:05 -05:00
chenyu
87f4bc5446
update variable names around jit [pr] ( #14049 )
...
lbs, st_vars_dtype_device and rawbuffers no more
2026-01-06 22:32:41 -05:00
Francis Lata
fac137779e
remove flux1 seed image ( #13843 )
2025-12-27 00:45:11 -05:00
chenyu
da1cb6a9ec
update llama dataloader ( #13825 )
...
separate creating dataset from itererating over the dataset to not create eval data for each eval
2025-12-24 17:42:08 -05:00
chenyu
903753c60c
llama wandb logging ( #13822 )
2025-12-24 10:24:59 -05:00
chenyu
27d899ce97
TRAIN=0 to only eval llama ( #13804 )
2025-12-22 11:55:46 -05:00
chenyu
39d962106f
update llama logging ( #13803 )
...
```
REWRITE_STACK_LIMIT=1000000 SMALL=1 BASEDIR=/raid/datasets/c4-8b SAMPLES=1000 BS=8 DP=8 DEFAULT_FLOAT=bfloat16 OPTIM_DTYPE=bfloat16 LLAMA3_SIZE=8B SEQLEN=1024 PYTHONPATH=. MODEL=llama3 python3 examples/mlperf/model_train.py
1 93.44 s run, 11.8750 loss, 0.000000000001 LR, 642.43 GB used, 19644.30 GFLOPS
2 101.78 s run, 11.8750 loss, 0.000000000001 LR, 1454.57 GB used, 17039.35 GFLOPS
3 7.34 s run, 11.8750 loss, 0.000000000002 LR, 1454.57 GB used, 236258.78 GFLOPS
4 4.32 s run, 11.8750 loss, 0.000000000002 LR, 1454.57 GB used, 401488.40 GFLOPS
5 4.36 s run, 11.9375 loss, 0.000000000003 LR, 1454.57 GB used, 398116.13 GFLOPS
6 4.32 s run, 11.8750 loss, 0.000000000003 LR, 1454.57 GB used, 401878.60 GFLOPS
7 4.34 s run, 11.8750 loss, 0.000000000004 LR, 1454.57 GB used, 399822.57 GFLOPS
8 4.35 s run, 11.8750 loss, 0.000000000004 LR, 1454.57 GB used, 398512.24 GFLOPS
9 4.36 s run, 11.8750 loss, 0.000000000005 LR, 1454.57 GB used, 397832.61 GFLOPS
10 4.40 s run, 11.8750 loss, 0.000000000005 LR, 1454.57 GB used, 394520.83 GFLOPS
```
2025-12-22 11:28:29 -05:00
George Hotz
45c459848d
remove more stale stuff ( #13765 )
...
* remove more stale stuff
* remove disassemblers/adreno
* stale
2025-12-19 17:14:56 -04:00
George Hotz
df6cde8a00
cleanup stale examples/extra ( #13764 )
...
* cleanup stale files
* examples
* move those back
* old
* delete more
2025-12-19 16:27:37 -04:00
chenyu
7cd7593c5d
add script to train bert on mi350x ( #13743 )
...
adapted from mi300 config
2025-12-17 16:54:04 -05:00
chenyu
e428fbfab6
verify dtype of llama model params ( #13719 )
2025-12-16 12:32:02 -05:00
chenyu
6cad622f59
don't FREE_INTERMEDIATE in bert ( #13684 )
...
hangs green hcq consistently after an hour of training
2025-12-14 14:27:42 -05:00
chenyu
fcaed1e1dd
don't use empty in bert fake data ( #13661 )
...
somehow jit does not count empty as input
2025-12-12 15:59:50 -05:00
chenyu
01e9ad0d52
clean up bert next_data ( #13650 )
...
train iter was designed to never stop for both real and fake data
2025-12-11 22:56:28 -05:00
chenyu
5034c6fb37
reenable FREE_INTERMEDIATE for bert ( #13639 )
...
* reenable FREE_INTERMEDIATE for bert
* comment
2025-12-10 12:08:09 -05:00
chenyu
016a59cafa
remove contiguous and use where in EmbeddingBert ( #13632 )
2025-12-09 15:49:21 -05:00
chenyu
2471b49e45
minor bert / llama change from grad acc branch ( #13622 )
...
* minor bert / llama change from grad acc branch
* revert those
2025-12-08 16:04:14 -05:00
chenyu
b981b6f89e
remove old llama grad_acc ( #13611 )
...
* remove old llama grad_acc
* GRADIENT_ACC_STEPS=1
2025-12-07 13:03:47 -05:00
chenyu
4562f217e1
more bert updates ( #13597 )
...
prep split jit
also lower BS to 72
2025-12-06 08:32:43 -05:00
chenyu
cb4c6324ef
revert bert grad accumulation ( #13596 )
...
prep for the new split jit style
2025-12-05 17:30:08 -05:00
chenyu
89f9e1dcd5
add SGD to beautiful_mnist ( #13571 )
2025-12-04 12:17:29 -05:00
George Hotz
96d16675fe
update examples/gradaccum_mnist.py to use the JIT
2025-12-03 16:11:42 -08:00
George Hotz
a4c4e48385
add LUNIQUE op ( #13554 )
2025-12-03 14:34:34 -08:00
wozeparrot
8713ae6de9
fix: dead sdv2 download link ( #13521 )
2025-12-01 22:50:53 -08:00
George Hotz
44104b0b7f
mnist with grad acc + Adam on CPU ( #13520 )
...
* mnist with grad acc + Adam on CPU
* still broken, but closer
* works w/o jit
* this works without the jit
2025-12-01 18:27:32 -08:00
George Hotz
8e8fec408e
fix n^2 _apply_map_to_tensors [pr] ( #13443 )
...
* clean up slow rules
* fix rule
* non n^2 toposort
* topovisit
* state dict profile_marker
2025-11-24 18:59:16 -08:00
George Hotz
cc5e6323ac
stable diffusion profiling ( #13441 )
...
* stable diffusion profiling
Signed-off-by: George Hotz <geohot@gmail.com >
* profile_marker
* profile per step
* fix slow Context
* profile that
---------
Signed-off-by: George Hotz <geohot@gmail.com >
2025-11-24 15:25:45 -08:00
chenyu
646372490c
move tiktoken import in llama3 ( #13316 )
...
only Tokenizer requires that
2025-11-17 14:09:37 -05:00
George Hotz
17aa3379e9
hotfix: improve self_tokenize
2025-11-13 00:18:57 -08:00
chenyu
4e5a9132e7
JIT_BATCH_SIZE=0 in compile3 ( #13245 )
...
fixed some enqueue time
2025-11-12 23:12:45 -05:00
chenyu
41e45c20ff
minor stuff reading the printed code [pr] ( #13177 )
2025-11-09 00:58:51 -05:00
chenyu
834067d91f
move onnx import in compile3 ( #13172 )
...
only used in test_vs_onnx
2025-11-08 09:44:34 -08:00
C T
0f9d7f650d
whisper: fix oob, explicit dtype ( #13144 )
...
* fix dtype depending on numpy version
numpy v2 np.array returns int64 which Tensor passed through for the
first decode call, swallowing the <|notimestamps|> token and corrupting
the sequence
* fix whisper OOB
global limit on whisper's context length
* enforce whisper max_tokens_to_sample (match openai)
local limit on max tokens decoded
2025-11-07 12:55:01 -05:00
chenyu
74db65cf72
update mlperf bert LOGMLPERF ( #13065 )
2025-11-02 15:26:37 -05:00
b1tg
45e2f916a3
add quantize fp8 in llama3 ( #12893 )
...
* add quantize fp8 in llama3
* don't truncate fp8 alu result
* cast to float32 before matmul
* --model weights/LLaMA-3/8B-SF-DPO/
---------
Co-authored-by: chenyu <chenyu@fastmail.com >
2025-10-27 10:22:57 -04:00
Harald Schäfer
587ccc0e5c
compile3: make selftests opt-in ( #12851 )
2025-10-21 11:32:27 -07:00
wozeparrot
990e8b97ee
feat: log openpilot 0.10.1 times ( #12816 )
2025-10-20 18:30:34 -07:00
Sieds Lykles
1e93d19ee3
stable diffusion --fakeweights ( #12810 )
2025-10-20 12:41:06 +02:00
Harald Schäfer
addc54b96c
Simplify openpilot compile3.py ( #12748 )
...
* Simpler compile3
* tests
* remove default args
* onnx file is still fp16
* self-test FP16 too
* allow test disable
* absurd tolerance
* Just do latest
* Try simplest
* use later models
* kernel count not relevant if speed is good
* dead improts
* Revert "dead improts"
This reverts commit f68c2cd15d .
* Revert "kernel count not relevant if speed is good"
This reverts commit 0955ca4ee0 .
* add back kernal count check on latest model
2025-10-18 10:12:22 -04:00
chenyu
285534ce64
delete DONT_REALIZE_EXPAND and DONT_GROUP_REDUCES ( #12744 )
...
does nothing now
2025-10-16 14:11:33 -04:00
chenyu
f34f26bca0
fix gpt2 with benchmark ( #12736 )
...
`CPU=1 python3 examples/gpt2.py --benchmark 128` works now
2025-10-16 09:55:20 -04:00
George Hotz
af4479c169
faster stable diffusion load ( #12725 )
...
* faster stable diffusion load
* failing tests
2025-10-16 18:31:59 +08:00
George Hotz
612e3d6143
replace mop arg with vectorized index ( #12695 )
...
* replace mop arg with vectorized index
* tests passing
* better viz
* no compile4
2025-10-15 20:50:06 +08:00