tinygrad

mirror of https://github.com/tinygrad/tinygrad.git synced 2026-01-09 15:08:02 -05:00

Author	SHA1	Message	Date
chenyu	2471b49e45	minor bert / llama change from grad acc branch (#13622 ) * minor bert / llama change from grad acc branch * revert those	2025-12-08 16:04:14 -05:00
chenyu	b981b6f89e	remove old llama grad_acc (#13611 ) * remove old llama grad_acc * GRADIENT_ACC_STEPS=1	2025-12-07 13:03:47 -05:00
chenyu	4562f217e1	more bert updates (#13597 ) prep split jit also lower BS to 72	2025-12-06 08:32:43 -05:00
chenyu	cb4c6324ef	revert bert grad accumulation (#13596 ) prep for the new split jit style	2025-12-05 17:30:08 -05:00
chenyu	77b5e6774e	fix bert training config (#12647 ) FREE_INTERMEDIATE=0 REWRITE_STACK_LIMIT=500000	2025-10-13 15:03:47 -04:00
chenyu	28edea5d67	delete FUSE_CONV_BW (#12527 )	2025-10-08 10:41:38 -04:00
chenyu	e701106a64	remove FUSE_ARANGE (#12511 ) it was the default already	2025-10-08 04:54:07 -04:00
hooved	1e8945a28c	Training loop for Stable Diffusion mlperf (#12315 ) * add diff * fix edit error * match master * point reference to specific commit * simplify wandb logging * remove lr test, dehardcode device * increase stack size limit	2025-10-03 02:45:38 -04:00
Sieds Lykles	5b73076e48	assert benchmark times (#12042 ) * assert jitted times in openpilot * better error * better error * add ASSERT_MIN_STEP_TIME to more models * t is step_times * update benchmark times * update times	2025-09-09 23:40:02 +02:00
wozeparrot	d16cc6c012	feat: resume ckpt (#11970 )	2025-09-02 15:47:48 -07:00
wozeparrot	7c21271a5f	feat: end_lr envvar (#11953 )	2025-09-01 14:53:07 -07:00
wozeparrot	7e68045fb2	feat: small llama3 training (#11829 )	2025-08-31 13:41:47 -07:00
wozeparrot	b979162c5d	llama3 eval train (#11706 )	2025-08-20 19:56:35 -04:00
chenyu	dbd3b67657	clamp GRAD_CLIP_NORM in llama (#11761 )	2025-08-20 19:55:50 -04:00
chenyu	e9d0027591	llama MP realize weight after shard (#11672 ) * llama MP realize weight after shard prevents memory spike on device 0 * empty weight for FAKEDATA	2025-08-14 16:17:46 -04:00
chenyu	ef17af85c6	remove .float call in llama logit (#11598 ) * remove .float call in llama logit * bfloat item	2025-08-10 00:02:18 -04:00
chenyu	45baec1aab	model parallel llama (#11588 ) MP=8 GRADIENT_ACC_STEPS=3 BS=1 DEFAULT_FLOAT=bfloat16 OPTIM_DTYPE=bfloat16 LLAMA3_SIZE=70B SEQLEN=512 PYTHONPATH=. MODEL=llama3 python3 examples/mlperf/model_train.py	2025-08-09 16:54:27 -04:00
wozeparrot	2d5bdc939d	faster llama3 dataloader (#11540 )	2025-08-06 18:25:57 -04:00
chenyu	f7965f85aa	Revert "feat: faster index building (#11462 )" (#11478 ) This reverts commit `3a4deb08d2`.	2025-08-02 12:50:48 -04:00
wozeparrot	3a4deb08d2	feat: faster index building (#11462 ) * feat: faster index building * feat: correct training samples	2025-08-02 11:50:18 -04:00
chenyu	9e8e6b45ab	grad acc train llama (#11467 ) * grad acc train llama * log step time	2025-08-01 15:54:50 -04:00
chenyu	7ad7329257	data parallel train llama (#11466 )	2025-08-01 12:13:51 -04:00
George Hotz	8ff03806e8	add llama layers (#11460 ) * add llama layers * add contig bw for speed	2025-07-31 16:28:04 -07:00
wozeparrot	6252f7770e	feat: fake data (#11447 )	2025-07-30 17:18:20 -07:00
chenyu	e300451f3a	update llama3 (#11446 ) `LR=1e-4 TRAIN_ON_VAL=1 DEFAULT_FLOAT=bfloat16 FUSE_ARANGE=1 JITBEAM=2 OPTIM_DTYPE=bfloat16 LLAMA3_SIZE=1B WARMUP_STEPS=36 DECAY_STEPS=360 SEQLEN=512 PYTHONPATH=. AMD=1 AMD_LLVM=0 MODEL=llama3 python3 examples/mlperf/model_train.py` trained to 7	2025-07-30 19:34:21 -04:00
wozeparrot	5fb975351a	feat: flag for training on val (#11441 )	2025-07-30 14:29:45 -07:00
wozeparrot	825b6a2505	feat: llama3 dataloader (#11340 )	2025-07-30 13:27:55 -07:00
chenyu	c14c9a8eff	llama3 grad clip (#11003 )	2025-06-27 19:14:12 -04:00
chenyu	f2548afeb5	bert grad clipping start with const 0 (#11008 ) saved the init kernels	2025-06-27 18:02:23 -04:00
chenyu	6ab5a5cb6c	llama3 mlperf train (#10983 ) work in progress. now it can overfit small examples and vram roughly matches	2025-06-26 20:24:27 -04:00
chenyu	b70c7d3631	bert grad accumulation (#10863 ) * bert grad accumulation * realize grad	2025-06-18 12:17:07 -04:00
chenyu	075a74cf25	add global_batch_size to mlperf bert (#10852 ) global_batch_size = grad_acc_steps * batch_size. no-op change to prep grad acc for bert	2025-06-17 17:54:15 -04:00
chenyu	81e296d7b8	remove Tensor.test() in retinanet (#10770 ) test was removed	2025-06-10 22:14:57 -04:00
George Hotz	b3b43a82c4	remove Tensor.no_grad, it's meaningless now [pr] (#10556 )	2025-05-28 22:20:02 -07:00
chenyu	dc6309242d	WallTimeEvent for mlperf ci (#10506 )	2025-05-24 10:56:03 -04:00
chenyu	485e80da69	run_and_time for resnet ci (#10405 )	2025-05-18 23:39:57 -04:00
wozeparrot	1ed04f993b	move benchmark stat tracking to influxdb (#10185 )	2025-05-15 16:14:56 -07:00
chenyu	610ee79b22	cherry pick mlperf5.0 branch to master (#10089 )	2025-04-28 15:36:56 -04:00
chenyu	74c6cf8be3	lint mlperf model_train (#10038 )	2025-04-24 16:19:44 -04:00
chenyu	a25abf55e3	retinanet only call postprocess_detections with RUNMLPERF (#10017 ) during setup only need to compile `_eval_step().numpy()`	2025-04-23 20:45:38 -04:00
chenyu	a3f938dbee	remove retinanet INITMLPERF from beam script (#10011 ) it only controls logging, loading real data or not is solely controlled by RUNMLPERF	2025-04-23 14:32:54 -04:00
Francis Lata	5542aeb0e4	RetinaNet MLPerf flag updates (#10009 ) * add RUNMLPERF and update INITMLPERF usage * update scripts to use RUNMLPERF	2025-04-23 13:00:34 -04:00
George Hotz	de0504276b	pop 0 is slow [pr] (#10007 )	2025-04-23 17:00:59 +01:00
chenyu	c39128133c	retinanet green scripts (#9996 ) also removed realize in data_get and used empty for fake data. slightly bigger lr. https://wandb.ai/chenyuxyz/MLPerf-RetinaNet/runs/8skid0e8?nw=nwuserchenyuxyz	2025-04-23 08:28:03 -04:00
chenyu	fb89d9a584	retinanet eval combine output on GPUS[0] (#9966 ) eval 35 sec -> 20 sec. it was spending 13 seconds assembling output tensor on CPU backend. GPUS[0] seems to have enough memory, otherwise we can lower EVAL_BS	2025-04-22 07:43:51 -04:00
chenyu	5294c32279	dev scripts for retinanet (#9968 ) also BASE_DIR -> BASEDIR for consistency, and move wandb up a bit for more accurate timing	2025-04-21 17:54:56 -04:00
Francis Lata	defa1e77f6	get the proper dataset count (#9962 )	2025-04-21 12:11:37 -04:00
Francis Lata	d7e247f329	RetinaNet INITMLPERF support (#9950 ) * fixes to make fake data work * fix eval beam * fix merge issue	2025-04-21 10:32:05 -04:00
Francis Lata	ea4cb2c715	small cleanups (#9947 )	2025-04-20 20:33:20 -04:00
chenyu	e8024c8281	faster bert global_norm (#9901 ) tinyamd 2% faster. also updated beam params that's 2-3% faster. update mlperf doc and steps too	2025-04-15 18:24:44 -04:00

1 2 3 4

152 Commits