chenyu
d57d24c7d4
Buffer.as_buffer -> Buffer.as_memoryview [pr] ( #14535 )
...
it casts to memoryview. also inline the as_typed_buffer checks to Tensor._data
2026-02-04 11:31:11 -05:00
George Hotz
d59e6e7a37
move more tests to test/null, split some existing ones ( #14512 )
...
* move more tests to test/null, split some existing ones
* null work
* null work
* move more
* fixes
* move PIL
* PIL in CLIP
* don't move that
2026-02-03 20:20:20 +08:00
George Hotz
dd2de4f838
rename all DEFINE_GLOBAL to PARAM ( #14511 )
2026-02-03 15:09:38 +08:00
wozeparrot
bbcd3d67a3
fa: faster ( #14453 )
2026-02-02 21:34:17 -08:00
qazal
616e9c1483
CDNA assembly gemm in tensor.py with flag ( #14310 )
...
* work
* work
* the assembly
* remove the old one
* remove ws bufs, assert splitk
* notes cleanup
* work
* gemm args
* gemm in mixins would be nice
* add gemm gradient
* print counters
* the realize is for DEBUG=2 aesthetics
* dedup
* rewrite to python dsl, no list copies
* leave that
* add B, M, N, K to gemm name
* it's M0 not NULL
* fp16 support
* test cleanup + more gemms
* work from viz
* more work
* gemm batch_size
* xccg path work
* tiny comments on the label naming
* s_waitcnt
2026-01-31 22:34:14 +09:00
George Hotz
c9a3ddb341
benchmark llama walltime script ( #14454 )
...
* benchmark llama walltime script
* adj layers
2026-01-31 10:21:54 +08:00
George Hotz
f5346d6a1a
fix USE_ATOMICS for non float dtypes and make it the default ( #14444 )
...
* embedded multistep test
* complex test
* with jit
* fix dtypes and reenable USE_ATOMICS
* that test didn't catch anything
2026-01-31 09:44:16 +08:00
George Hotz
ee2c78709d
mlperf/llama: disable USE_ATOMICS for now
2026-01-31 00:42:08 +08:00
George Hotz
838cd078bc
use atomics for embedding backward ( #14400 )
...
* embedding is slow
* failing
* float is fine
* null
* it fails
* simplify embedding with broadcasting
* ATOMIC_ADD incoming
* min change
* simpler test
* better test
* fix test
* real test
* simpler
* cleanups
* types and names
* _zero_kernel
* grad multi
* hack
* none
* multi unshard
* more for call
* don't tag in call
* good
* call_multi
* call_multi wow claude is useless
* embedding backward mutli test
* test passes
* fix as_param
* shape_to_shape_arg
* add clip
* before cast
* fix spec=2, use atomics
2026-01-30 18:10:59 +08:00
George Hotz
793afbd473
simplify nn.Embedding, support AFTER in CUSTOM_KERNEL ( #14419 )
2026-01-29 17:22:13 +08:00
wozeparrot
4845e42135
llama3 gradacc fixes ( #14414 )
2026-01-28 19:12:39 -08:00
nimlgen
aec1ae0de1
llama: set manual_seed ( #14409 )
2026-01-28 14:40:00 -08:00
George Hotz
0c6b3f50aa
add marker to llama training ( #14401 )
2026-01-28 22:44:28 +08:00
Jakob Sachs
2b7c00d3d2
fix sd-example dtype for CLIP embeddings ( #14397 )
2026-01-28 09:07:19 -05:00
qazal
5bffa17f82
llama train: better NULL=1 EMULATE=AMD_CDNA4 dev experience ( #14395 )
...
* beam opens devices
* switch to hip renderer
* amd: true?
* llvm true is for test_autogen
2026-01-28 17:31:22 +09:00
wozeparrot
e496547720
llama3 gradacc ( #14291 )
2026-01-27 19:48:10 -08:00
chenyu
db010a31be
IGNORE_OOB -> CHECK_OOB [pr] ( #14374 )
...
flip the meaning
2026-01-27 12:20:59 -05:00
wozeparrot
a987a4abc3
feat: llama8b dev_beam.sh ( #14358 )
2026-01-26 14:51:23 -08:00
nimlgen
e152f1b0f5
llama: use ALL2ALL ( #14353 )
2026-01-26 22:01:53 +03:00
George Hotz
11ce1e847d
llama train: null device support
2026-01-26 08:53:05 +08:00
wozeparrot
963c59ebdb
fix: pull fixes from gradacc branch ( #14296 )
2026-01-22 23:07:54 -08:00
George Hotz
52b989c6c8
don't place consts early + fixes from anthropic challenge ( #14286 )
...
* don't place consts early
* add anthropic challenge
* with ref
* do we still have to devectorize bools?
* tests pass
* just WHERE
* fine, revert that
* fine, revert
* only index
* z3 validator doesn't support vectorized
* Revert "z3 validator doesn't support vectorized"
This reverts commit 1b7930ecb3 .
* z3 not for vec
* no spec
* VLIWRenderer
* loop unrolling
* better comments
* cleanups
* skip cast
* renderer
* cleanups
* prints
* no hack
* hacks
* bump to 11
* reg warning
* lil clean
* cleaner renderer
2026-01-23 10:48:39 +09:00
wozeparrot
c1d14ea832
llama8b train fixes ( #14264 )
2026-01-20 20:34:47 -08:00
wozeparrot
ba90e1b52e
feat: script to run llama8b training ( #14239 )
2026-01-20 12:44:06 -08:00
C T
26f8b12e01
Whisper audio helpers (mel filters in tinygrad) ( #13478 )
...
* add whisper audio helpers for stft/mel/resample
* cleanup
* add whisper stft test
* make only stft test explicitly depend on librosa
* extract sinc_window_kernel
* dehardcode device
* use same device argument
* simplify
* type annotate
* ruff format audio_helpers.py
* ruff format test_whisper.py
* add WHISPER_NEW_STFT
* rename
* undo ruff format changes
* use new stft and mel for whisper
* remove stft test that depends on librosa
* remove whitespace
* add Tensor.log10 with test\test_ops.py::TestOps::test_log10
* use Tensor.log10
* fix lint
* future: remove unused STFT class
* future: remove resample code since it isn't used (yet)
* match openai with pad_mode="reflect"
* pad_to
* future: cut resample leftovers
* cleanup
* add mel tests
* future: cut stft
* future: cut non-mel prep_audio changes
* reduce diff
* move audio_helpers.py to examples
* reduce whitespace
* fix imports
* reduce whitespace
---------
Co-authored-by: chenyu <chenyu@fastmail.com >
2026-01-20 10:50:02 -05:00
wozeparrot
a879b54234
tk: fa jit fix ( #14170 )
2026-01-16 16:38:45 -08:00
b1tg
0fbc551622
train bert with fp8 ( #13874 )
...
* fp8 train
* clean
* lint
* test fix from #13439
* skip first/last layer
* rm __init__, restore unroll <=32 check
* tests
* clean test, remove unused
* multi-gpu test, clean quantize_to_fp8
* remove bert contiguous
* run script
* test: better check
* run script search
* add seed in bert data shuffle
* move script to mi350x folder
---------
Co-authored-by: chenyu <chenyu@fastmail.com >
2026-01-09 09:21:59 -05:00
b1tg
241f0402b4
add seed in bert data shuffle ( #14054 )
2026-01-07 10:02:05 -05:00
chenyu
87f4bc5446
update variable names around jit [pr] ( #14049 )
...
lbs, st_vars_dtype_device and rawbuffers no more
2026-01-06 22:32:41 -05:00
Francis Lata
fac137779e
remove flux1 seed image ( #13843 )
2025-12-27 00:45:11 -05:00
chenyu
da1cb6a9ec
update llama dataloader ( #13825 )
...
separate creating dataset from itererating over the dataset to not create eval data for each eval
2025-12-24 17:42:08 -05:00
chenyu
903753c60c
llama wandb logging ( #13822 )
2025-12-24 10:24:59 -05:00
chenyu
27d899ce97
TRAIN=0 to only eval llama ( #13804 )
2025-12-22 11:55:46 -05:00
chenyu
39d962106f
update llama logging ( #13803 )
...
```
REWRITE_STACK_LIMIT=1000000 SMALL=1 BASEDIR=/raid/datasets/c4-8b SAMPLES=1000 BS=8 DP=8 DEFAULT_FLOAT=bfloat16 OPTIM_DTYPE=bfloat16 LLAMA3_SIZE=8B SEQLEN=1024 PYTHONPATH=. MODEL=llama3 python3 examples/mlperf/model_train.py
1 93.44 s run, 11.8750 loss, 0.000000000001 LR, 642.43 GB used, 19644.30 GFLOPS
2 101.78 s run, 11.8750 loss, 0.000000000001 LR, 1454.57 GB used, 17039.35 GFLOPS
3 7.34 s run, 11.8750 loss, 0.000000000002 LR, 1454.57 GB used, 236258.78 GFLOPS
4 4.32 s run, 11.8750 loss, 0.000000000002 LR, 1454.57 GB used, 401488.40 GFLOPS
5 4.36 s run, 11.9375 loss, 0.000000000003 LR, 1454.57 GB used, 398116.13 GFLOPS
6 4.32 s run, 11.8750 loss, 0.000000000003 LR, 1454.57 GB used, 401878.60 GFLOPS
7 4.34 s run, 11.8750 loss, 0.000000000004 LR, 1454.57 GB used, 399822.57 GFLOPS
8 4.35 s run, 11.8750 loss, 0.000000000004 LR, 1454.57 GB used, 398512.24 GFLOPS
9 4.36 s run, 11.8750 loss, 0.000000000005 LR, 1454.57 GB used, 397832.61 GFLOPS
10 4.40 s run, 11.8750 loss, 0.000000000005 LR, 1454.57 GB used, 394520.83 GFLOPS
```
2025-12-22 11:28:29 -05:00
George Hotz
45c459848d
remove more stale stuff ( #13765 )
...
* remove more stale stuff
* remove disassemblers/adreno
* stale
2025-12-19 17:14:56 -04:00
George Hotz
df6cde8a00
cleanup stale examples/extra ( #13764 )
...
* cleanup stale files
* examples
* move those back
* old
* delete more
2025-12-19 16:27:37 -04:00
chenyu
7cd7593c5d
add script to train bert on mi350x ( #13743 )
...
adapted from mi300 config
2025-12-17 16:54:04 -05:00
chenyu
e428fbfab6
verify dtype of llama model params ( #13719 )
2025-12-16 12:32:02 -05:00
chenyu
6cad622f59
don't FREE_INTERMEDIATE in bert ( #13684 )
...
hangs green hcq consistently after an hour of training
2025-12-14 14:27:42 -05:00
chenyu
fcaed1e1dd
don't use empty in bert fake data ( #13661 )
...
somehow jit does not count empty as input
2025-12-12 15:59:50 -05:00
chenyu
01e9ad0d52
clean up bert next_data ( #13650 )
...
train iter was designed to never stop for both real and fake data
2025-12-11 22:56:28 -05:00
chenyu
5034c6fb37
reenable FREE_INTERMEDIATE for bert ( #13639 )
...
* reenable FREE_INTERMEDIATE for bert
* comment
2025-12-10 12:08:09 -05:00
chenyu
016a59cafa
remove contiguous and use where in EmbeddingBert ( #13632 )
2025-12-09 15:49:21 -05:00
chenyu
2471b49e45
minor bert / llama change from grad acc branch ( #13622 )
...
* minor bert / llama change from grad acc branch
* revert those
2025-12-08 16:04:14 -05:00
chenyu
b981b6f89e
remove old llama grad_acc ( #13611 )
...
* remove old llama grad_acc
* GRADIENT_ACC_STEPS=1
2025-12-07 13:03:47 -05:00
chenyu
4562f217e1
more bert updates ( #13597 )
...
prep split jit
also lower BS to 72
2025-12-06 08:32:43 -05:00
chenyu
cb4c6324ef
revert bert grad accumulation ( #13596 )
...
prep for the new split jit style
2025-12-05 17:30:08 -05:00
chenyu
89f9e1dcd5
add SGD to beautiful_mnist ( #13571 )
2025-12-04 12:17:29 -05:00
George Hotz
96d16675fe
update examples/gradaccum_mnist.py to use the JIT
2025-12-03 16:11:42 -08:00
George Hotz
a4c4e48385
add LUNIQUE op ( #13554 )
2025-12-03 14:34:34 -08:00