chenyu
1bcb58479d
resnet setup power cap red box gpu to 350W ( #4484 )
...
1%-2% faster
2024-05-08 23:32:41 -04:00
chenyu
0ed755bcf5
resnet use EVAL_BS=192 ( #4482 )
...
* resnet use EVAL_BS=192
also lower green run BEAM_MIN_PROGRESS from 10 to 5
* BEAM_MIN_PROGRESS 5 is too close to setup limit
2024-05-08 22:29:27 -04:00
chenyu
1f6bf9d2f7
real diskcache_clear in model_train resnet ( #4445 )
...
clear cache if INITMLPERF is set, or running run_and_time. dev_beam and dev_run do not clear cache
2024-05-08 19:06:09 -04:00
chenyu
1b4645bea6
hotfix resnet move init_start to start of the script ( #4481 )
2024-05-08 19:03:52 -04:00
wozeparrot
a347ae94d6
feat: remove wandb ( #4480 )
2024-05-08 15:31:16 -07:00
chenyu
db7e15c46f
hotfix resnet only log epoch start with RUNMLPERF ( #4477 )
2024-05-08 15:14:41 -04:00
chenyu
062c6dd65d
mlperf logging, truncate dir in logs and log seed ( #4475 )
2024-05-08 12:54:02 -04:00
chenyu
b62a65b617
redo faster sparse_categorical_crossentropy ( #4461 )
...
update LR and DECAY in resnet default that help convergence too
2024-05-08 11:21:43 -04:00
George Hotz
17faae091b
optimizer shouldn't be run without training ( #4460 )
...
* optimizer shouldn't be run without training
* set training in relevant tests
* fix multitensor
* that too
2024-05-06 15:34:12 -07:00
George Hotz
f4e49a7c1a
resnet 50 opt: correct loop + LARS ( #4449 )
...
* correct loop + LARS
* ops
2024-05-06 08:01:26 -07:00
George Hotz
fc995d4446
add backward to handcode_resnet50_opt
2024-05-06 06:42:26 -07:00
wozeparrot
603d3a351b
feat: allow keeping multiple cookies ( #4440 )
2024-05-05 19:26:48 -07:00
Francis Lam
709410071c
mlperf/resnet: updated BEAM params to increase performance ( #4443 )
2024-05-05 21:49:46 -04:00
chenyu
3b30756cbb
update mlperf submission system ( #4435 )
...
more required fields.
2024-05-05 13:19:07 -04:00
David Hou
c0a048c044
batchnorm d(var)/d(mean) = 0 ( #4430 )
...
* d(var)/d(mean) = 0
* drop the number in test_schedule!
2024-05-05 00:25:45 -04:00
qazal
fa17dcaf07
Fix llm.c/export.py ( #4423 )
...
* fix headers
* add CI
* add stdio
* merge clang tests
* revert llm.c
* revert ci
* Revert "revert llm.c"
This reverts commit 5fd17e3c8b .
2024-05-04 19:37:10 +03:00
George Hotz
cb7289f9c9
remove clang program header ( #4422 )
...
* remove clang program header
* proper max
* bools are numbers
* fix compile enet
2024-05-04 08:38:01 -07:00
chenyu
473ecb978a
remove SPLIT_REDUCEOP=1 from resnet scripts ( #4404 )
...
SPLIT_REDUCEOP=1 is default
2024-05-03 12:36:23 -04:00
David Hou
b767d59684
resnet trainer: keep old cookie around until next step has been queued ( #4401 )
...
* keep old cookie around until next step has been queued (-10ms 6gpu)
* also for eval
* drop cookie before data_get?
* Revert "drop cookie before data_get?"
This reverts commit b01e6aa2b2 .
* Revert "Revert "drop cookie before data_get?""
This reverts commit 23464e73d4 .
2024-05-03 12:15:21 -04:00
chenyu
2c3b7f8e70
pad resnet training data with training data mean ( #4369 )
...
update model_train resnet to pad training
2024-05-02 20:26:15 -04:00
Francis Lam
3cf8291f2f
mlperf/resnet: update beam params to increase time and quality ( #4396 )
...
* mlperf/resnet: update beam params to increase time and quality
* revert upcast 8 in search space and add rocm setup function
* refactor to independent setup.sh script
2024-05-02 20:14:46 -04:00
chenyu
ab01a9433d
resnet eval 4n+3 if epoch < 33 ( #4391 )
...
the rule is as thoroughly as 4n+k and we can stop the clock as soon as eval hits target. this can save 24 evals or 12 minutes
2024-05-02 16:52:07 -04:00
chenyu
7492e5d3e7
resnet correct log name for red ( #4390 )
2024-05-02 10:58:55 -04:00
chenyu
bf31837e6d
resnet correct steps_in_val_epoch in logging ( #4389 )
...
also added random seed from system in scripts
2024-05-02 10:51:36 -04:00
ym555
3113785604
Llama 3 Models ( #4339 )
...
* Full Impl
* fix test
* Fix inference loop
---------
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com >
2024-05-02 06:06:07 -07:00
chenyu
22376e53b7
resnet mlperf logging ( #4361 )
...
* resnet mlperf logging
* cropping too much?
2024-05-02 00:00:04 -04:00
chenyu
ad116dc5c6
fill in mlperf system description ( #4381 )
...
it did not ask too many details. will put software versions later with tinygrad commit.
```
python3 -m mlperf_logging.system_desc_checker examples/mlperf/training_submission_v4.0/tinycorp/systems/tinybox_red.json training 4.0.0
INFO - System description checker passed for tinybox red
```
```
python3 -m mlperf_logging.system_desc_checker examples/mlperf/training_submission_v4.0/tinycorp/systems/tinybox_green.json training 4.
0.0
INFO - System description checker passed for tinybox green
```
2024-05-01 16:47:45 -04:00
chenyu
9358b62073
rename resnet script to dev_beam.sh and dev_run.sh ( #4379 )
...
final run_and_time needs to be one script for both. rename the old scripts
2024-05-01 14:41:35 -04:00
chenyu
6628e13a5f
pad resnet eval data in model_train ( #4374 )
...
asserted if eval sample count is different from total eval file count.
2024-05-01 14:33:42 -04:00
chenyu
826cccd54d
fix mean underflow for half tensor ( #4377 )
...
* fix mean underflow for half tensor
divide only the reduce factor. added unit test and non-nan assertion in resnet training. also added a failed test cast for symbolic shape var
* skip for python backend
2024-05-01 13:38:57 -04:00
George Hotz
b683d0f496
hotfix: 100% accuracy is wrong
2024-05-01 08:07:18 -07:00
chenyu
683b7c605a
pad first batch of imagenet dataloader and update eval ( #4368 )
...
* pad first batch of imagenet dataloader and update eval
* pad zero instead of empty for training
2024-05-01 00:21:52 -04:00
Francis Lam
16838eae08
mlperf/resnet: update tinybox_red parameters to new best values ( #4364 )
...
about 27 minutes to setup and 345ms/110TF steps
2024-04-30 18:08:12 -04:00
Francis Lam
0d33c54d99
kernel: change PADTO check to allow up to 4x padding ( #4354 )
...
* kernel: change PADTO check to allow up to 4x padding
also optionally remove PADTO from the search action space with
BEAM_PADTO=0.
* fix test_linearizer test_tensor_cores_padded tests
* update resnet runs to use SPLIT_REDUCEOP=1
* fix up search TC axis and amt checking
* fix up the dimensions of the TC tests
2024-04-30 15:29:34 -04:00
Elias Wahl
babe87a8ae
BERT: Checkpoint loading tests ( #4359 )
...
* Move checkpoint init to helpers. Add test
* linters
* Move the steps outside of the main train loop
* Move data_get
* data_get belongs to helpers
2024-04-30 14:43:41 -04:00
Elias Wahl
71ff68b445
dropout after eval step ( #4351 )
2024-04-29 15:47:21 -04:00
Elias Wahl
27613dd881
MLPerf BERT: Main training loop ( #4288 )
...
* BERT language modeling head + trunc normal initializers
* add train loop + helpers
* shuffle in dataloaders + slight changes in main loop
* beam change
* Minor changes
* random.shuffle
* HParam update
* Use deque for dataloader
* wandb bert project name
* half fixes
* BENCHMARK + remove epoch
* cast + print()
---------
Co-authored-by: chenyu <chenyu@fastmail.com >
2024-04-29 14:35:27 -04:00
Francis Lata
bb849a57d1
[MLPerf] UNet3D dataloader ( #4343 )
...
* add support for train/val datasets for kits19
* split dataset into train and val sets
* add tests for kits19 dataloader
* add MLPerf dataset tests to CI
* update unet3d model_eval script
* fix linting
* add nibabel
* fix how mock dataset gets created
* update ref implementation with permalink and no edits
* clean up test and update rand_flip implementation
* cleanups
2024-04-28 22:34:18 -04:00
Arnav Mehta
f3de17912f
added the download if not present missing function ( #4318 )
2024-04-28 16:31:08 +08:00
chenyu
ec65aea32f
resnet stop the script once hit target ( #4303 )
...
* resnet stop the script once hit target
* comment
2024-04-25 23:54:56 -04:00
George Hotz
1e37c4a7a1
minor llm.c improvements
2024-04-26 11:15:31 +08:00
chenyu
f9a7badace
use LR=7 for resnet with BS=1536 ( #4299 )
...
had 3 runs after lr float32, seems quite stable and converges at epoch 34 and 35
2024-04-25 15:23:10 -04:00
chenyu
c11bad766d
prepare mlperf submission ( #4270 )
...
* prepare mlperf submission
* 28min compile and 3h53m
* red 30 minute compile and 56 TFLOPS
2024-04-24 13:19:31 -04:00
chenyu
c1fbacb182
resnet benchmarks use DEFAULT_FLOAT=HALF ( #4285 )
...
also update LR default to scaled based on 1536 (the BS we are submitting)
2024-04-24 12:10:57 -04:00
George Hotz
ad28fdecb1
si.inputs+outputs -> bufs ( #4279 )
2024-04-24 15:12:34 +08:00
chenyu
8401de9922
resnet benchmark return early in eval ( #4278 )
...
only do few eval steps to compile, and skip second epoch when doing beam + benchmark. save 2 minutes
2024-04-24 00:55:01 -04:00
chenyu
6637ecc5fe
use IGNORE_JIT_FIRST_BEAM to not BEAM in jit cnt=0 ( #4269 )
...
we want to have different BEAM values for resnet train and eval. global JITBEAM cannot do this. added the flag to change beam behavior at cnt=0 (so it default behaves the same with or without TinyJit), and for cnt=1 it uses existing BEAM.value.
Also updated the context var BEAM in resnet to be outside of TinyJit. saves about 3 minutes compile time
2024-04-23 18:59:43 -04:00
Elias Wahl
3a48773f1a
BERT dataloader ( #4252 )
...
* add dataloader
* comment
2024-04-23 13:44:49 -04:00
chenyu
37f8be6450
resnet print epoch ops and mem in benchmark ( #4244 )
...
* resnet print epoch ops and mem in benchmark
also added a flag to optionally disable reset jitted steps
* real per epoch stats
2024-04-21 18:32:31 -04:00
chenyu
30fc1ad415
remove TODO: remove explicit dtypes after broadcast fix in stable_diffusion ( #4241 )
...
this is done
2024-04-21 00:31:24 -04:00