tinygrad

mirror of https://github.com/tinygrad/tinygrad.git synced 2026-04-29 03:00:14 -04:00

Author	SHA1	Message	Date
Francis Lata	27ec792c19	check for CKPT when target metric is reached before saving	2025-03-02 00:41:08 -08:00
Francis Lata	3ac4ae5870	hotfix: log metric and move target metric check outside of CKPT	2025-03-01 04:31:00 -08:00
Francis Lata	974309862d	update dataloader seed	2025-02-28 21:41:30 +00:00
Francis Lata	6a62ece474	minor cleanups	2025-02-28 15:43:11 +00:00
Francis Lata	074e9f742b	more typing fixes	2025-02-28 15:42:11 +00:00
Francis Lata	e9d1af26b2	undo more changes	2025-02-28 15:11:17 +00:00
Francis Lata	47edcdb834	undo changse	2025-02-28 15:08:55 +00:00
Francis Lata	bdf442717c	update seeding on dataloader and the start of training script	2025-02-28 14:58:28 +00:00
Francis Lata	87bfa77f4a	some typing cleanups	2025-02-28 14:47:29 +00:00
Francis Lata	7cb226d757	Revert "Revert "add nan check during training"" This reverts commit `b7b2943197`.	2025-02-26 15:43:20 +00:00
Francis Lata	e006ae24ea	Merge branch 'master' into retinanet_mlperf	2025-02-26 07:31:32 +00:00
George Hotz	3f4eb9006a	test for device mismatch [pr] (#9250 ) * test for device mismatch [pr] * fix bert	2025-02-26 13:06:33 +08:00
chenyu	979e84f30e	RESET_STEP in bert setup and beam (#9248 ) running dev_beam migh OOM without it but runs fine in real run.	2025-02-25 19:15:10 -05:00
Francis Lata	b7b2943197	Revert "add nan check during training" This reverts commit `ddf1f0d5dd`.	2025-02-25 21:43:28 +00:00
chenyu	6610ad58ab	hotfix bert no shard with only one device (#9243 ) `LLVM=1 BERT_SIZE="tiny" DEFAULT_FLOAT=HALF BENCHMARK=5 MODEL="bert" python3 examples/mlperf/model_train.py` runs for me with this. it should not failed with single device shard though	2025-02-25 09:05:11 -05:00
Francis Lata	ddf1f0d5dd	add nan check during training	2025-02-25 10:53:31 +00:00
Francis Lata	8737020d75	add JIT reset support	2025-02-25 10:52:26 +00:00
Francis Lata	30d5daa121	Merge branch 'master' into retinanet_mlperf	2025-02-25 10:32:34 +00:00
chenyu	8c7be428e5	update bert BS to 78 (#9236 ) fits 78 now. about 215 tflops on green	2025-02-24 22:47:35 -05:00
Francis Lata	2c3417dfce	Merge branch 'master' into retinanet_mlperf	2025-02-23 21:23:28 +00:00
Francis Lata	60c13c2932	update loss calculation for regresionhead and some cleanups	2025-02-23 21:22:33 +00:00
chenyu	2e7c2780a9	CLANG -> CPU (#9189 )	2025-02-20 18:03:09 -05:00
Francis Lata	7dba815c47	fix train script	2025-02-19 20:43:02 +00:00
Francis Lata	fc36f09b1e	no need to return loaded keys for resnet	2025-02-19 20:35:03 +00:00
chenyu	3b37cc898b	add bert tiny config (#9177 ) set with BERT_SIZE=tiny. easier to study embedding and fusion	2025-02-19 14:57:03 -05:00
Francis Lata	41378e74a6	model init, hyperparam, and data preprocessing updates	2025-02-19 18:47:06 +00:00
chenyu	975c318dbc	bert use int32 for input ids (#9173 ) original data was int32 for these. float might have caused precision issues	2025-02-19 08:17:27 -05:00
chenyu	ff05bff221	put bert data shard inside jit (#9160 ) python time 45ms -> 9ms, it was spending time to schedule the shard also init bert data on CLANG since it's from numpy, so we don't create the tensor on default device then shard into GPUS	2025-02-18 10:36:54 -05:00
chenyu	5dc1257ce0	clean up bert fake data iterator [pr] (#9145 ) reuse the same get_data_bert path in setup and real run	2025-02-17 20:03:38 -05:00
chenyu	81597ddd96	increase lr for bert (#9098 ) had one run that converged better https://wandb.ai/chenyuxyz/MLPerf-BERT/runs/u66tv2hh/overview	2025-02-14 19:10:35 -05:00
Francis Lata	cfa1c2d50e	hyperparameter adjustments and cleanups	2025-02-14 17:53:06 +00:00
chenyu	b58e7b1898	zero out the weight in bert init run (#9076 ) `DEFAULT_FLOAT=HALF BENCHMARK=10 BS=66 EVAL_BS=6 GPUS=6 MODEL=bert python3 examples/mlperf/model_train.py` no longer oom. I think the buffer of random init weights caused the oom.	2025-02-14 08:40:41 -05:00
Francis Lata	caf9b2baa2	Merge branch 'master' into retinanet_mlperf	2025-02-14 06:28:37 +00:00
chenyu	9e91898941	bert eval at the end of training (#9070 ) always eval at the last epoch	2025-02-13 16:29:44 -05:00
Francis Lata	5f26692068	remove frozen layers from optimizer's params	2025-02-13 06:36:13 +00:00
Francis Lata	ff301f0be9	minor cleanups	2025-02-12 16:03:38 +00:00
Francis Lata	f61b10450e	Merge branch 'master' into retinanet_mlperf	2025-02-12 15:47:05 +00:00
chenyu	7b5ac2c15e	free_intermediates in bert (#9040 ) also re-enable dropout and update EVAL_BS	2025-02-12 10:00:39 -05:00
Francis Lata	37aab697b8	adjust LR to be the ratio of the batch size	2025-02-07 19:46:54 +00:00
Francis Lata	041481f910	Merge branch 'master' into retinanet_mlperf	2025-02-07 15:28:29 +00:00
chenyu	a092b6395d	Tuple -> tuple, List -> list [pr] (#8936 )	2025-02-06 14:21:19 -05:00
Francis Lata	83a2b84f55	add validation loop to training script	2025-02-03 19:54:22 +00:00
Francis Lata	f02cce0049	remove unnecessary targets from validation dataloader	2025-02-03 19:15:30 +00:00
Francis Lata	932cf4b7f2	fix img_ids repeating its values	2025-02-02 19:21:46 +00:00
Francis Lata	17ae62d741	cleanup boxes and labels in dataloader	2025-02-02 18:51:14 +00:00
Francis Lata	594d7126d8	return validation targets in dataloader	2025-02-02 06:50:21 -08:00
Francis Lata	811893a3bd	cleanup train and validation dataloader	2025-01-31 16:59:37 -08:00
Francis Lata	6d70035c22	put back parallel testing and remove img_ids Tensor from dataloader	2025-01-31 16:13:02 -08:00
Francis Lata	9938a1aabc	remove optional disk tensors in dataloader	2025-01-31 09:07:39 -08:00
Francis Lata	80fa9dd731	fix issue with realized on dataloader	2025-01-31 08:31:25 -08:00

1 2 3 4 5 ...

283 Commits