tinygrad

mirror of https://github.com/tinygrad/tinygrad.git synced 2026-01-23 22:08:08 -05:00

Author	SHA1	Message	Date
Francis Lata	0345577032	UNet3D dataloader shared memory fix (#5465 ) * create separate SharedMemory between inputs and labels * update path check for shared mem * clean up unit test for dataset	2024-07-13 20:26:00 -04:00
Elias Wahl	73bddc44f6	Fix fake dataloader (#5326 )	2024-07-08 09:07:44 -04:00
chenyu	43c3f73fbc	handcode_bert_opt.py (#5295 ) similar to handcode_resnet50_opt.py, one file to check bert kernels without dataset.	2024-07-05 11:01:20 -04:00
Elias Wahl	e267f3161d	Add MLLogger (#5125 ) * add MLPerf logger * eval steps * start with step 1 * compliance for 3.1.0 and 4.0.0 * more compliance * assert, comment and contiguous	2024-06-26 12:23:56 -04:00
Elias Wahl	f31ef11537	Better default hparams for large BS (#5030 ) * better default hparams for large BS * bf16 too * use tuple	2024-06-18 11:13:06 -04:00
Elias Wahl	7bfa9101c0	Float in scaled dot product attention (#4985 ) * Monkeypatch scaled-dot-product-attention * Use dot instead of matmul * new api * imports * least_upper_dtype	2024-06-18 08:16:41 -04:00
Elias Wahl	d2e3c391e8	Residual in MLM loss + Change default steps (#4935 ) * Residual in mlm loss * Reduce default steps to 160K * 24 * oops * comment	2024-06-12 16:09:18 -04:00
Nik	085c0bbf6b	add mlperf train subset of openimages (#4841 )	2024-06-05 10:10:11 -04:00
Elias Wahl	e576aca044	Disable dropout (#4837 )	2024-06-04 18:57:26 -04:00
Elias Wahl	bb248a0dd1	Optional half matmul (#4835 ) * half linear * move weight cast back * oops * matmul dtype var * todo comment	2024-06-04 17:53:41 -04:00
Elias Wahl	04e237328b	Refactor to class style (#4804 )	2024-06-04 14:08:31 -07:00
Francis Lata	707099487a	Multiprocessing UNet3D dataloader (#4801 ) * testing dataloader * matching dataloader implementation for unet3d * remove comments * clean up dataloader * add cookie and cleanup * use shm_path when creating SharedMemory * add support for testing resnet and unet3d dataloaders * update dataset test to return preprocesed data directory in prep for dataloader testing * pass preprocessed dataset directory properly * update loader function for dataloader * add shuffling on indices * update shm name * more cleanup for unet3d dataloader * remove changes to tests --------- Co-authored-by: chenyu <chenyu@fastmail.com>	2024-06-02 11:30:47 -04:00
Elias Wahl	c4b0acf095	Global norm + small changes (#4749 ) * norm * no empty * default loss scaler in float	2024-05-27 18:35:27 -04:00
Elias Wahl	acc0039cfc	Resume fix + scheduler for non weight decay params (#4679 ) * move ckpt dir * fix resume. Add scheduler group	2024-05-21 19:38:13 -04:00
Elias Wahl	993091adfa	loss scaler + nan fixes (#4661 )	2024-05-20 17:08:35 -04:00
chenyu	bed70b130c	mlperf bert getenv-able EVAL_STEP_FREQ (#4534 )	2024-05-11 14:36:56 -04:00
chenyu	04a4980a51	touchup bert script (#4531 ) small adjustments, remove duplicated training setting and stop the script once target is hit	2024-05-11 13:02:02 -04:00
chenyu	b00b6b16f0	fix TRAIN_BEAM and Tensor.training for mlperf bert (#4525 ) also hard coded bert model config instead of looking up a file	2024-05-11 00:18:36 -04:00
chenyu	b399d98e41	fix resnet eval (#4507 )	2024-05-10 00:49:00 -04:00
wozeparrot	a602dc67d3	feat: more mlperf fixes (#4505 )	2024-05-09 20:50:20 -07:00
chenyu	0e8aa0e288	use fake data in beam searching resnet (#4504 )	2024-05-09 23:43:50 -04:00
wozeparrot	29daea4e60	fix: core count and os (#4503 )	2024-05-09 19:55:07 -07:00
chenyu	ef93e41a15	resnet mlperf systems add tinygrad commit and python / runtime versions (#4494 )	2024-05-09 16:04:15 -04:00
chenyu	b5afdfbc5b	first draft resnet mlperf readme (#4493 ) * start readme * something	2024-05-09 15:51:44 -04:00
chenyu	047c7f3e5b	polish resnet mlperf logging (#4490 ) don't include save final check point time in run time, and some cosmetic order changes	2024-05-09 13:04:24 -04:00
chenyu	d78e159aa3	resnet logging move RUN_START to start of the script (#4488 )	2024-05-09 12:32:32 -04:00
chenyu	1bcb58479d	resnet setup power cap red box gpu to 350W (#4484 ) 1%-2% faster	2024-05-08 23:32:41 -04:00
chenyu	0ed755bcf5	resnet use EVAL_BS=192 (#4482 ) * resnet use EVAL_BS=192 also lower green run BEAM_MIN_PROGRESS from 10 to 5 * BEAM_MIN_PROGRESS 5 is too close to setup limit	2024-05-08 22:29:27 -04:00
chenyu	1f6bf9d2f7	real diskcache_clear in model_train resnet (#4445 ) clear cache if INITMLPERF is set, or running run_and_time. dev_beam and dev_run do not clear cache	2024-05-08 19:06:09 -04:00
chenyu	1b4645bea6	hotfix resnet move init_start to start of the script (#4481 )	2024-05-08 19:03:52 -04:00
wozeparrot	a347ae94d6	feat: remove wandb (#4480 )	2024-05-08 15:31:16 -07:00
chenyu	db7e15c46f	hotfix resnet only log epoch start with RUNMLPERF (#4477 )	2024-05-08 15:14:41 -04:00
chenyu	062c6dd65d	mlperf logging, truncate dir in logs and log seed (#4475 )	2024-05-08 12:54:02 -04:00
chenyu	b62a65b617	redo faster sparse_categorical_crossentropy (#4461 ) update LR and DECAY in resnet default that help convergence too	2024-05-08 11:21:43 -04:00
wozeparrot	603d3a351b	feat: allow keeping multiple cookies (#4440 )	2024-05-05 19:26:48 -07:00
Francis Lam	709410071c	mlperf/resnet: updated BEAM params to increase performance (#4443 )	2024-05-05 21:49:46 -04:00
chenyu	3b30756cbb	update mlperf submission system (#4435 ) more required fields.	2024-05-05 13:19:07 -04:00
chenyu	473ecb978a	remove SPLIT_REDUCEOP=1 from resnet scripts (#4404 ) SPLIT_REDUCEOP=1 is default	2024-05-03 12:36:23 -04:00
David Hou	b767d59684	resnet trainer: keep old cookie around until next step has been queued (#4401 ) * keep old cookie around until next step has been queued (-10ms 6gpu) * also for eval * drop cookie before data_get? * Revert "drop cookie before data_get?" This reverts commit `b01e6aa2b2`. * Revert "Revert "drop cookie before data_get?"" This reverts commit `23464e73d4`.	2024-05-03 12:15:21 -04:00
chenyu	2c3b7f8e70	pad resnet training data with training data mean (#4369 ) update model_train resnet to pad training	2024-05-02 20:26:15 -04:00
Francis Lam	3cf8291f2f	mlperf/resnet: update beam params to increase time and quality (#4396 ) * mlperf/resnet: update beam params to increase time and quality * revert upcast 8 in search space and add rocm setup function * refactor to independent setup.sh script	2024-05-02 20:14:46 -04:00
chenyu	ab01a9433d	resnet eval 4n+3 if epoch < 33 (#4391 ) the rule is as thoroughly as 4n+k and we can stop the clock as soon as eval hits target. this can save 24 evals or 12 minutes	2024-05-02 16:52:07 -04:00
chenyu	7492e5d3e7	resnet correct log name for red (#4390 )	2024-05-02 10:58:55 -04:00
chenyu	bf31837e6d	resnet correct steps_in_val_epoch in logging (#4389 ) also added random seed from system in scripts	2024-05-02 10:51:36 -04:00
chenyu	22376e53b7	resnet mlperf logging (#4361 ) * resnet mlperf logging * cropping too much?	2024-05-02 00:00:04 -04:00
chenyu	ad116dc5c6	fill in mlperf system description (#4381 ) it did not ask too many details. will put software versions later with tinygrad commit. ``` python3 -m mlperf_logging.system_desc_checker examples/mlperf/training_submission_v4.0/tinycorp/systems/tinybox_red.json training 4.0.0 INFO - System description checker passed for tinybox red ``` ``` python3 -m mlperf_logging.system_desc_checker examples/mlperf/training_submission_v4.0/tinycorp/systems/tinybox_green.json training 4. 0.0 INFO - System description checker passed for tinybox green ```	2024-05-01 16:47:45 -04:00
chenyu	9358b62073	rename resnet script to dev_beam.sh and dev_run.sh (#4379 ) final run_and_time needs to be one script for both. rename the old scripts	2024-05-01 14:41:35 -04:00
chenyu	6628e13a5f	pad resnet eval data in model_train (#4374 ) asserted if eval sample count is different from total eval file count.	2024-05-01 14:33:42 -04:00
chenyu	826cccd54d	fix mean underflow for half tensor (#4377 ) * fix mean underflow for half tensor divide only the reduce factor. added unit test and non-nan assertion in resnet training. also added a failed test cast for symbolic shape var * skip for python backend	2024-05-01 13:38:57 -04:00
chenyu	683b7c605a	pad first batch of imagenet dataloader and update eval (#4368 ) * pad first batch of imagenet dataloader and update eval * pad zero instead of empty for training	2024-05-01 00:21:52 -04:00

1 2 3

116 Commits