Francis Lata
0345577032
UNet3D dataloader shared memory fix ( #5465 )
...
* create separate SharedMemory between inputs and labels
* update path check for shared mem
* clean up unit test for dataset
2024-07-13 20:26:00 -04:00
Elias Wahl
73bddc44f6
Fix fake dataloader ( #5326 )
2024-07-08 09:07:44 -04:00
chenyu
43c3f73fbc
handcode_bert_opt.py ( #5295 )
...
similar to handcode_resnet50_opt.py, one file to check bert kernels without dataset.
2024-07-05 11:01:20 -04:00
Elias Wahl
e267f3161d
Add MLLogger ( #5125 )
...
* add MLPerf logger
* eval steps
* start with step 1
* compliance for 3.1.0 and 4.0.0
* more compliance
* assert, comment and contiguous
2024-06-26 12:23:56 -04:00
Elias Wahl
f31ef11537
Better default hparams for large BS ( #5030 )
...
* better default hparams for large BS
* bf16 too
* use tuple
2024-06-18 11:13:06 -04:00
Elias Wahl
7bfa9101c0
Float in scaled dot product attention ( #4985 )
...
* Monkeypatch scaled-dot-product-attention
* Use dot instead of matmul
* new api
* imports
* least_upper_dtype
2024-06-18 08:16:41 -04:00
Elias Wahl
d2e3c391e8
Residual in MLM loss + Change default steps ( #4935 )
...
* Residual in mlm loss
* Reduce default steps to 160K * 24
* oops
* comment
2024-06-12 16:09:18 -04:00
Nik
085c0bbf6b
add mlperf train subset of openimages ( #4841 )
2024-06-05 10:10:11 -04:00
Elias Wahl
e576aca044
Disable dropout ( #4837 )
2024-06-04 18:57:26 -04:00
Elias Wahl
bb248a0dd1
Optional half matmul ( #4835 )
...
* half linear
* move weight cast back
* oops
* matmul dtype var
* todo comment
2024-06-04 17:53:41 -04:00
Elias Wahl
04e237328b
Refactor to class style ( #4804 )
2024-06-04 14:08:31 -07:00
Francis Lata
707099487a
Multiprocessing UNet3D dataloader ( #4801 )
...
* testing dataloader
* matching dataloader implementation for unet3d
* remove comments
* clean up dataloader
* add cookie and cleanup
* use shm_path when creating SharedMemory
* add support for testing resnet and unet3d dataloaders
* update dataset test to return preprocesed data directory in prep for dataloader testing
* pass preprocessed dataset directory properly
* update loader function for dataloader
* add shuffling on indices
* update shm name
* more cleanup for unet3d dataloader
* remove changes to tests
---------
Co-authored-by: chenyu <chenyu@fastmail.com >
2024-06-02 11:30:47 -04:00
Elias Wahl
c4b0acf095
Global norm + small changes ( #4749 )
...
* norm
* no empty
* default loss scaler in float
2024-05-27 18:35:27 -04:00
Elias Wahl
acc0039cfc
Resume fix + scheduler for non weight decay params ( #4679 )
...
* move ckpt dir
* fix resume. Add scheduler group
2024-05-21 19:38:13 -04:00
Elias Wahl
993091adfa
loss scaler + nan fixes ( #4661 )
2024-05-20 17:08:35 -04:00
chenyu
bed70b130c
mlperf bert getenv-able EVAL_STEP_FREQ ( #4534 )
2024-05-11 14:36:56 -04:00
chenyu
04a4980a51
touchup bert script ( #4531 )
...
small adjustments, remove duplicated training setting and stop the script once target is hit
2024-05-11 13:02:02 -04:00
chenyu
b00b6b16f0
fix TRAIN_BEAM and Tensor.training for mlperf bert ( #4525 )
...
also hard coded bert model config instead of looking up a file
2024-05-11 00:18:36 -04:00
chenyu
b399d98e41
fix resnet eval ( #4507 )
2024-05-10 00:49:00 -04:00
wozeparrot
a602dc67d3
feat: more mlperf fixes ( #4505 )
2024-05-09 20:50:20 -07:00
chenyu
0e8aa0e288
use fake data in beam searching resnet ( #4504 )
2024-05-09 23:43:50 -04:00
wozeparrot
29daea4e60
fix: core count and os ( #4503 )
2024-05-09 19:55:07 -07:00
chenyu
ef93e41a15
resnet mlperf systems add tinygrad commit and python / runtime versions ( #4494 )
2024-05-09 16:04:15 -04:00
chenyu
b5afdfbc5b
first draft resnet mlperf readme ( #4493 )
...
* start readme
* something
2024-05-09 15:51:44 -04:00
chenyu
047c7f3e5b
polish resnet mlperf logging ( #4490 )
...
don't include save final check point time in run time, and some cosmetic order changes
2024-05-09 13:04:24 -04:00
chenyu
d78e159aa3
resnet logging move RUN_START to start of the script ( #4488 )
2024-05-09 12:32:32 -04:00
chenyu
1bcb58479d
resnet setup power cap red box gpu to 350W ( #4484 )
...
1%-2% faster
2024-05-08 23:32:41 -04:00
chenyu
0ed755bcf5
resnet use EVAL_BS=192 ( #4482 )
...
* resnet use EVAL_BS=192
also lower green run BEAM_MIN_PROGRESS from 10 to 5
* BEAM_MIN_PROGRESS 5 is too close to setup limit
2024-05-08 22:29:27 -04:00
chenyu
1f6bf9d2f7
real diskcache_clear in model_train resnet ( #4445 )
...
clear cache if INITMLPERF is set, or running run_and_time. dev_beam and dev_run do not clear cache
2024-05-08 19:06:09 -04:00
chenyu
1b4645bea6
hotfix resnet move init_start to start of the script ( #4481 )
2024-05-08 19:03:52 -04:00
wozeparrot
a347ae94d6
feat: remove wandb ( #4480 )
2024-05-08 15:31:16 -07:00
chenyu
db7e15c46f
hotfix resnet only log epoch start with RUNMLPERF ( #4477 )
2024-05-08 15:14:41 -04:00
chenyu
062c6dd65d
mlperf logging, truncate dir in logs and log seed ( #4475 )
2024-05-08 12:54:02 -04:00
chenyu
b62a65b617
redo faster sparse_categorical_crossentropy ( #4461 )
...
update LR and DECAY in resnet default that help convergence too
2024-05-08 11:21:43 -04:00
wozeparrot
603d3a351b
feat: allow keeping multiple cookies ( #4440 )
2024-05-05 19:26:48 -07:00
Francis Lam
709410071c
mlperf/resnet: updated BEAM params to increase performance ( #4443 )
2024-05-05 21:49:46 -04:00
chenyu
3b30756cbb
update mlperf submission system ( #4435 )
...
more required fields.
2024-05-05 13:19:07 -04:00
chenyu
473ecb978a
remove SPLIT_REDUCEOP=1 from resnet scripts ( #4404 )
...
SPLIT_REDUCEOP=1 is default
2024-05-03 12:36:23 -04:00
David Hou
b767d59684
resnet trainer: keep old cookie around until next step has been queued ( #4401 )
...
* keep old cookie around until next step has been queued (-10ms 6gpu)
* also for eval
* drop cookie before data_get?
* Revert "drop cookie before data_get?"
This reverts commit b01e6aa2b2 .
* Revert "Revert "drop cookie before data_get?""
This reverts commit 23464e73d4 .
2024-05-03 12:15:21 -04:00
chenyu
2c3b7f8e70
pad resnet training data with training data mean ( #4369 )
...
update model_train resnet to pad training
2024-05-02 20:26:15 -04:00
Francis Lam
3cf8291f2f
mlperf/resnet: update beam params to increase time and quality ( #4396 )
...
* mlperf/resnet: update beam params to increase time and quality
* revert upcast 8 in search space and add rocm setup function
* refactor to independent setup.sh script
2024-05-02 20:14:46 -04:00
chenyu
ab01a9433d
resnet eval 4n+3 if epoch < 33 ( #4391 )
...
the rule is as thoroughly as 4n+k and we can stop the clock as soon as eval hits target. this can save 24 evals or 12 minutes
2024-05-02 16:52:07 -04:00
chenyu
7492e5d3e7
resnet correct log name for red ( #4390 )
2024-05-02 10:58:55 -04:00
chenyu
bf31837e6d
resnet correct steps_in_val_epoch in logging ( #4389 )
...
also added random seed from system in scripts
2024-05-02 10:51:36 -04:00
chenyu
22376e53b7
resnet mlperf logging ( #4361 )
...
* resnet mlperf logging
* cropping too much?
2024-05-02 00:00:04 -04:00
chenyu
ad116dc5c6
fill in mlperf system description ( #4381 )
...
it did not ask too many details. will put software versions later with tinygrad commit.
```
python3 -m mlperf_logging.system_desc_checker examples/mlperf/training_submission_v4.0/tinycorp/systems/tinybox_red.json training 4.0.0
INFO - System description checker passed for tinybox red
```
```
python3 -m mlperf_logging.system_desc_checker examples/mlperf/training_submission_v4.0/tinycorp/systems/tinybox_green.json training 4.
0.0
INFO - System description checker passed for tinybox green
```
2024-05-01 16:47:45 -04:00
chenyu
9358b62073
rename resnet script to dev_beam.sh and dev_run.sh ( #4379 )
...
final run_and_time needs to be one script for both. rename the old scripts
2024-05-01 14:41:35 -04:00
chenyu
6628e13a5f
pad resnet eval data in model_train ( #4374 )
...
asserted if eval sample count is different from total eval file count.
2024-05-01 14:33:42 -04:00
chenyu
826cccd54d
fix mean underflow for half tensor ( #4377 )
...
* fix mean underflow for half tensor
divide only the reduce factor. added unit test and non-nan assertion in resnet training. also added a failed test cast for symbolic shape var
* skip for python backend
2024-05-01 13:38:57 -04:00
chenyu
683b7c605a
pad first batch of imagenet dataloader and update eval ( #4368 )
...
* pad first batch of imagenet dataloader and update eval
* pad zero instead of empty for training
2024-05-01 00:21:52 -04:00