wozeparrot
7e68045fb2
feat: small llama3 training ( #11829 )
2025-08-31 13:41:47 -07:00
NoahKusaba
0838021753
remove np from beautiful_cifar ( #10988 )
...
* remove np from beautiful_cifar
* remove np from cifar
* rename variable and rename tensor.arrange to just tensor.randperm
---------
Co-authored-by: chenyu <chenyu@fastmail.com >
2025-08-29 19:34:16 -04:00
chenyu
e39b25cd36
upcast float exp to at least float32 ( #11758 )
...
* upcast float exp to at least float32
* unlucky seed
2025-08-22 20:16:34 -04:00
wozeparrot
b979162c5d
llama3 eval train ( #11706 )
2025-08-20 19:56:35 -04:00
chenyu
dbd3b67657
clamp GRAD_CLIP_NORM in llama ( #11761 )
2025-08-20 19:55:50 -04:00
George Hotz
9366a23eb0
test backward in test_tiny ( #11697 )
...
* test backward in test_tiny
* empty
2025-08-16 20:29:39 -07:00
chenyu
e9d0027591
llama MP realize weight after shard ( #11672 )
...
* llama MP realize weight after shard
prevents memory spike on device 0
* empty weight for FAKEDATA
2025-08-14 16:17:46 -04:00
kevvz
e2873a3a41
[bounty] Muon optim ( #11414 )
...
* newton schulz
* add muon + move newton schulz to tensor
* compact newton schulz
* better tests
* cleanup
* add comments for muon
* cleanup
* add export with tests
* match muon optim with test optim
* cleanup
* unsed import
* correct comment
* whitespace
* move export
* muon test fix
* match reference impl + tests
* remove export by moving muon device
* add credit
* cleanup
* remove print
* spacing
* spacing
* comma
* cleanup
* removal
* fix tests + optim momentum
* consistent is not/ not
* more consistency
* fix test
* cleanup
* fix the nones
* remove comment
* cast
* comment
* comment
* muon teeny test
* muon flag beautiful mnist
* set steps
* steps as hyperparam
* match default test steps
* name
* large cleanup
* dont care about steps
* nesterov false default
* match each other impl
* steps
* switch nest
* swap defaults
* update docstring
* add no nesterov test
* ban fuse_optim
* prints
* classical momentum
* alternative condition
* recon
* pre + post wd
* false default
* detach
* signature changes
* context
* swap order
* big cleanup
* 0 step instead
* parity
* remove fuse
* remove fused
* better paper
* assert message
* correct shape check + eps
* multidim
* add eps
* cleanup
* correct assert message
* lint
* better tests
* naming
* ns_steps,ns_params
* update docstring
* docstring
* match sgd and muon together
* sandwich
* add back fused
* parity
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com >
2025-08-13 14:27:55 -04:00
Sardor
ca7a641442
fix bugs at examples/yolov3.py ( #11614 )
...
* Update load_weight. Give valid model url
* Fix bug in iou function
2025-08-11 21:14:47 -04:00
chenyu
630edcffd8
remove .float calls in olmoe ( #11610 )
...
still matches torch
2025-08-10 20:33:22 -04:00
chenyu
ef17af85c6
remove .float call in llama logit ( #11598 )
...
* remove .float call in llama logit
* bfloat item
2025-08-10 00:02:18 -04:00
chenyu
7338ffead0
small beautiful_mnist update ( #11596 )
...
gather is fast now. there's a conv/bw kernel that only gets fast with BEAM, but whole thing runs < 5 seconds now regardless
2025-08-09 19:51:14 -04:00
chenyu
45baec1aab
model parallel llama ( #11588 )
...
MP=8 GRADIENT_ACC_STEPS=3 BS=1 DEFAULT_FLOAT=bfloat16 OPTIM_DTYPE=bfloat16 LLAMA3_SIZE=70B SEQLEN=512 PYTHONPATH=. MODEL=llama3 python3 examples/mlperf/model_train.py
2025-08-09 16:54:27 -04:00
chenyu
702e38dc19
remove FUSE_ARANGE_UINT ( #11567 )
...
also add IGNORE_OOB=1 to bert runs. lowered BS on tinybox to 90 since 96 oom during eval without reset
2025-08-07 16:49:06 -04:00
wozeparrot
7ae4335127
feat: generate blend index ( #11566 )
2025-08-07 14:20:28 -04:00
George Hotz
21570545d3
move view pushing to codegen, try 2 ( #11534 )
...
* move view pushing to codegen, try 2
* fix up some linearizer tests
* fix test search
* fix test schedule
* delete that test
* fix test arange
* fix a few tests
* update tests
* push views
* ebs cleanup
* fix local/reg
* test and lint
* fix more tests
* test cleanups
* skipped that one
2025-08-06 15:58:38 -07:00
wozeparrot
2d5bdc939d
faster llama3 dataloader ( #11540 )
2025-08-06 18:25:57 -04:00
chenyu
f7965f85aa
Revert "feat: faster index building ( #11462 )" ( #11478 )
...
This reverts commit 3a4deb08d2 .
2025-08-02 12:50:48 -04:00
wozeparrot
3a4deb08d2
feat: faster index building ( #11462 )
...
* feat: faster index building
* feat: correct training samples
2025-08-02 11:50:18 -04:00
chenyu
9e8e6b45ab
grad acc train llama ( #11467 )
...
* grad acc train llama
* log step time
2025-08-01 15:54:50 -04:00
chenyu
7ad7329257
data parallel train llama ( #11466 )
2025-08-01 12:13:51 -04:00
George Hotz
8ff03806e8
add llama layers ( #11460 )
...
* add llama layers
* add contig bw for speed
2025-07-31 16:28:04 -07:00
wozeparrot
6252f7770e
feat: fake data ( #11447 )
2025-07-30 17:18:20 -07:00
chenyu
e300451f3a
update llama3 ( #11446 )
...
`LR=1e-4 TRAIN_ON_VAL=1 DEFAULT_FLOAT=bfloat16 FUSE_ARANGE=1 JITBEAM=2 OPTIM_DTYPE=bfloat16 LLAMA3_SIZE=1B WARMUP_STEPS=36 DECAY_STEPS=360 SEQLEN=512 PYTHONPATH=. AMD=1 AMD_LLVM=0 MODEL=llama3 python3 examples/mlperf/model_train.py` trained to 7
2025-07-30 19:34:21 -04:00
wozeparrot
5fb975351a
feat: flag for training on val ( #11441 )
2025-07-30 14:29:45 -07:00
wozeparrot
825b6a2505
feat: llama3 dataloader ( #11340 )
2025-07-30 13:27:55 -07:00
George Hotz
842184a1ab
rename kernelize to schedule, try 2 ( #11305 )
2025-07-21 11:18:36 -07:00
nimlgen
cc3c1e4c14
hcq: move cpu to hcq ( #11262 )
...
* hcq: move cpu to hcq
* import time
* upd
* fix
* windows support
* hm
* cleaner
* fix timer
* fix timing
* std is ns
* skip profiler
* mypy
* cleaner
* cleanups
* after merge
* default is back
2025-07-21 15:10:38 +03:00
chenyu
85ddd72038
simpler grouptop in hcopt ( #11219 )
...
* simpler grouptop in hcopt
keep the only perf relevant conditions and the rest is handled by try except
* update openpilot read image count
2025-07-13 16:06:09 -04:00
chenyu
a0438012af
remove Kernel.get_program [pr] ( #11203 )
2025-07-12 20:50:29 -04:00
geohotstan
5ce278b245
OnnxRunner file as input ( #10789 )
...
* file path as input and have parse be in OnnxRunner.__init__
* modelproto_to_onnxrunner -> modelproto_to_runner
* whoops, fix import
* oh flakiness again, is it because it's getting gc-ed?
* small changes
* CI flaky so just move compile4 fix in
* copy typing of onnx_load
* actually can just import onnx_load instead of onnx.load
* fix external_benchmark_openpilot
* fix onnx_runner test to use onnx_helper
* rerun CI
* try run_modelproto
* spam CI a few times
* revert run_modelproto since that's flaky also
* no external onnx_load usage except onnx.py
* cursor tab complete is evil. Snuck a darn sorted in. But does order change result? Why?
* model_benchmark 193s -> 80s, add OnnxRunner.to()...
* minimize diff and clean up
* device can be None, weird but eh
---------
Co-authored-by: chenyu <chenyu@fastmail.com >
2025-07-12 14:27:46 -04:00
chenyu
b072be0e2d
hotfix whisper main script ( #11184 )
2025-07-11 12:34:00 -04:00
Nino Risteski
bc15e98f5c
clean up unused imports in examples and update CI linting ( #11024 )
...
* clean up unused imports in examples
* enable unused import checking in examples
* lint
* ignore F541 and F841 - focus on unused imports only
* clean up
* restore tinygrad.frontend.torch for TINY_BACKEND
* tiny change
2025-06-30 08:21:27 -07:00
chenyu
c14c9a8eff
llama3 grad clip ( #11003 )
2025-06-27 19:14:12 -04:00
chenyu
f2548afeb5
bert grad clipping start with const 0 ( #11008 )
...
saved the init kernels
2025-06-27 18:02:23 -04:00
chenyu
6ab5a5cb6c
llama3 mlperf train ( #10983 )
...
work in progress. now it can overfit small examples and vram roughly matches
2025-06-26 20:24:27 -04:00
geohotstan
50936b4a18
ONNX real float16 ( #10694 )
...
* squash commits
* temp fix for const tensor
* actually realizing float16 can only happen in raw_data
* .float -> cast(float) to rerun CI
---------
Co-authored-by: chenyu <chenyu@fastmail.com >
2025-06-26 14:05:12 -04:00
chenyu
8751d47985
CosineAnnealingLRWithWarmup ( #10981 )
2025-06-25 17:45:21 -04:00
chenyu
efad567ebd
ruff check whole examples/mlperf/ ( #10979 )
2025-06-25 12:57:48 -04:00
Alexey Zaytsev
230ad3a460
[bounty] Don't use numpy inside hlb_cifar10 training loop ( #10777 )
...
* Don't use numpy inside hlb_cifar10 training loop
* Lint it
* jit it
* Drop the last half-batch
* Use gather for random_crop and reuse perms
* Wrap train_cifar in FUSE_ARANGE context
* No need to pass FUSE_ARANGE=1 to hlb_cifar10.py
* Add cutmix to jittable augmentations
* Remove .contiguous() from fetch_batches
* Fix indexing boundary
---------
Co-authored-by: Irwin1138 <irwin1139@gmail.com >
2025-06-23 17:24:56 -07:00
chenyu
3699d1d3ba
hotfix llama3 temperature is float ( #10938 )
2025-06-23 15:20:56 -04:00
chenyu
0480139def
log_perplexity metrics ( #10912 )
2025-06-21 10:44:47 -04:00
George Hotz
b41e0563a3
move stuff to kernelize folder ( #10902 )
...
* move stuff to kernelize folder
* oops, forgot that
2025-06-20 16:10:20 -07:00
George Hotz
92678e59ee
move kernel to opt ( #10899 )
2025-06-20 15:22:28 -07:00
chenyu
62a540066e
remove DEBUG=2 in mi300x bert setup ( #10886 )
...
seems fine now, not sure what the issue was
2025-06-19 13:28:53 -04:00
Nino Risteski
5a56710ff4
small fix replacing download_file with fetch ( #10877 )
...
* imported a missing os and replaced download_file with fetch from tg helpers
* use fetch directly
* Remove if not os.path.isfile
2025-06-19 12:12:09 -04:00
chenyu
8d721a4ead
add 405B params to llama3.py ( #10884 )
...
tested with `python examples/llama3.py --model /raid/weights/llama31_405b/ --size 405B --shard 8 --benchmark` on tinyamd2
2025-06-19 11:45:37 -04:00
chenyu
f377cc19cd
use AM for bert ( #10882 )
...
have triained 3 runs and all seem fine
2025-06-19 09:48:54 -04:00
chenyu
b70c7d3631
bert grad accumulation ( #10863 )
...
* bert grad accumulation
* realize grad
2025-06-18 12:17:07 -04:00
George Hotz
cba6e15937
split grouper and kernelize [pr] ( #10854 )
2025-06-17 17:54:20 -07:00