tinygrad

mirror of https://github.com/tinygrad/tinygrad.git synced 2026-02-04 03:35:16 -05:00

Author	SHA1	Message	Date
Timmy	664b563c91	Add `insert_before` to Linearizer Functions (#4320 ) * adding insert_before to linearizer functions * uop insert_before test case * formatting * more formatting * more formatting * syntax * removing self.cast * addressing err * removing noqa s	2024-04-28 11:38:36 -04:00
qazal	3372bea322	reduce children fusion tests (#4321 ) * base tests * real-world tests	2024-04-28 11:14:02 -04:00
Arnav Mehta	f3de17912f	added the download if not present missing function (#4318 )	2024-04-28 16:31:08 +08:00
geohotstan	bc36940c28	fix (#4319 )	2024-04-28 16:29:04 +08:00
nimlgen	8d1649d8c2	raise error when too many resources requested in nv (#4324 )	2024-04-27 23:48:51 +03:00
qazal	c6c12ba94a	save schedule graph pre validation (#4317 )	2024-04-27 12:06:15 +03:00
Victor Ziliang Peng	40264c7d1e	Update index.md (#4315 )	2024-04-27 15:12:44 +08:00
chenyu	24a6342950	add mem/s to external_benchmark_resnet (#4309 )	2024-04-26 20:07:17 -04:00
Francis Lam	1f2642c73b	kernel: fix calculation of smem size to ignore UNROLL (#4308 ) * kernel: fix calculation of smem size to ignore UNROLL * simplify prod array	2024-04-26 14:34:56 -04:00
Szymon Ożóg	de832d26c6	disable bfloat16 from ptx tests (#4305 )	2024-04-26 01:20:10 -04:00
chenyu	ec65aea32f	resnet stop the script once hit target (#4303 ) * resnet stop the script once hit target * comment	2024-04-25 23:54:56 -04:00
chenyu	1891ebb655	make ring allreduce chunks a multiple of 2^n if possible (#4302 ) in resnet, instead of chunking as [43691, 43691, 43691, 43691, 43690, 43690], chunk as [43712, 43712, 43680, 43680, 43680, 43680] and those can have 32 local. more than 2X faster for the applicable kernels and overall 1% for resnet	2024-04-25 23:45:28 -04:00
George Hotz	1e37c4a7a1	minor llm.c improvements	2024-04-26 11:15:31 +08:00
chenyu	3ec4b745d6	JIT=2 for mac cifar benchmark (#4300 ) also double BS for resnet training benchmark to match submission target	2024-04-25 18:33:40 -04:00
David Hou	c2dbe2a78b	new split reduce heuristic try 2 (#4294 ) * new split reduce heuristic * update comment * rename --------- Co-authored-by: chenyu <chenyu@fastmail.com>	2024-04-25 18:14:15 -04:00
Szymon Ożóg	f1ebcffb87	Ptx beam fix (#4296 ) * Fix beam search for PTX * fix ptr arm test	2024-04-25 15:39:39 -04:00
chenyu	f9a7badace	use LR=7 for resnet with BS=1536 (#4299 ) had 3 runs after lr float32, seems quite stable and converges at epoch 34 and 35	2024-04-25 15:23:10 -04:00
qazal	9a47ed0705	test crossing diamond assigns (#4298 )	2024-04-25 21:52:05 +03:00
chenyu	5ae252ae83	use at least float32 for optim.lr (#4297 ) * use at least float32 for optim.lr when doing mixed precision training (float32 weight, default_float=half), still use float32 to store lr. it would have been upcasted later in actual weight update, but would have lost precision. this improved resnet convergence significantly * undo type annotation	2024-04-25 14:42:28 -04:00
David Hou	6f792b727b	More improvements for resnet layer bench (#4272 ) * fix first layer size, new schedule stuff * estimates * get different conv layers * \r for estimated times * E501 * space after comma	2024-04-25 12:40:49 -04:00
David Hou	ac9464f47a	allow specify number of beam workers (#4292 )	2024-04-25 10:44:43 -04:00
qazal	74a1be88f5	test reduce graph permutations (#4291 )	2024-04-25 11:34:44 +03:00
George Hotz	0f0627bc60	add mnist tutorial	2024-04-25 16:08:32 +08:00
chenyu	d31e220cbf	add mlperf-logging to setup.py mlperf (#4289 )	2024-04-24 23:34:34 -04:00
nimlgen	6b8a85939d	fix lds size for amd (#4287 )	2024-04-24 22:54:42 +03:00
chenyu	c11bad766d	prepare mlperf submission (#4270 ) * prepare mlperf submission * 28min compile and 3h53m * red 30 minute compile and 56 TFLOPS	2024-04-24 13:19:31 -04:00
Szymon Ożóg	c606a0ba6f	Docs link fix (#4286 ) * Update quickstart.md * Update README.md * Update quickstart.md * Update README.md	2024-04-24 12:54:43 -04:00
chenyu	c1fbacb182	resnet benchmarks use DEFAULT_FLOAT=HALF (#4285 ) also update LR default to scaled based on 1536 (the BS we are submitting)	2024-04-24 12:10:57 -04:00
Szymon Ożóg	002a14088e	Ptx store gate cast to bool (#4284 ) * Cast gate to bool * Update * Add PTX fuzzing to benchmark	2024-04-24 11:43:44 -04:00
George Hotz	dbe3e1d548	or true fixes ci (#4283 ) * or true fixes ci * all with two pipes	2024-04-24 20:48:26 +08:00
qazal	53853e6d08	save the schedule graph in SAVE_SCHEDULE (#4248 ) * save the schedule graph with assigns * extend graph	2024-04-24 12:08:51 +03:00
George Hotz	acb32e1766	hotfix: PM4 supports timing	2024-04-24 08:38:59 +00:00
George Hotz	ad28fdecb1	si.inputs+outputs -> bufs (#4279 )	2024-04-24 15:12:34 +08:00
chenyu	8401de9922	resnet benchmark return early in eval (#4278 ) only do few eval steps to compile, and skip second epoch when doing beam + benchmark. save 2 minutes	2024-04-24 00:55:01 -04:00
George Hotz	38f97aa0fe	rename rawbufs to bufs in ExecItem (#4274 )	2024-04-24 11:27:27 +08:00
George Hotz	60e3aa5cb1	more docs (#4271 ) * more work on docs * CompilerOptions is dataclass	2024-04-24 10:52:42 +08:00
chenyu	6637ecc5fe	use IGNORE_JIT_FIRST_BEAM to not BEAM in jit cnt=0 (#4269 ) we want to have different BEAM values for resnet train and eval. global JITBEAM cannot do this. added the flag to change beam behavior at cnt=0 (so it default behaves the same with or without TinyJit), and for cnt=1 it uses existing BEAM.value. Also updated the context var BEAM in resnet to be outside of TinyJit. saves about 3 minutes compile time	2024-04-23 18:59:43 -04:00
nimlgen	f3b4dff7c9	KFDProgram -> AMDProgram (#4268 )	2024-04-24 00:29:50 +03:00
geohotstan	17328ded7d	setitem no return value (#4266 ) * no ret value and just force contiguous * ok revert contiguous stuff * actually do force it contiguous * revert again lol * add simple regression test * add assert for MLB * guess we're contiguous everything from now on * lol ugly af empty return... * don't change order cuz i don't get disk	2024-04-23 16:28:14 -04:00
Elias Wahl	3a48773f1a	BERT dataloader (#4252 ) * add dataloader * comment	2024-04-23 13:44:49 -04:00
Elias Wahl	69341144ba	Wikipedia preprocessing script (#4229 ) * Preprocessing script * short seq prob * comments + env vars * Add preprocessing reference. Add test * lint fix + add eval test support * whitespaces * point to commit * comment * rename * better comments	2024-04-23 10:28:01 -04:00
chenyu	759b4f41c3	few more KFD -> AMD (#4262 ) benchmark gemm and default_parallel	2024-04-23 10:15:37 -04:00
Szymon Ożóg	6c25f1abf7	Optimize ptx loops (#4263 ) * Optimize PTX loops * Update assembly.py	2024-04-23 12:20:14 +04:00
George Hotz	967638f0d5	update docs, remove corealize (#4264 ) * update docs, remove corealize * handle 0 line count * tensor schedule	2024-04-23 12:05:29 +04:00
George Hotz	9b7efa72ea	hotfix: skip 0 line count files in sz.py	2024-04-23 11:56:03 +04:00
George Hotz	acf4ba5c9f	method cache respects beam option (#4261 ) * method cache respects beam option * cleanup get_runner	2024-04-23 09:00:41 +04:00
George Hotz	9a95781d51	renamed (#4260 )	2024-04-23 09:00:28 +04:00
George Hotz	2ae4f45272	WIP PM4 Support (#4110 ) * pm4 kernel launch works * disable USE_THREAD_DIMENSIONS * add kernel code * work on real pm4 * pm4 signal * same * gate pm4 * hcq tests pass * ops passes * pm4 is closer * pm4 debug (#4165) * start debug tests passing * prg * smth * hdp flush * cleaner 1 * do not need this * logs not need * small things * linter * remove AQL * test hcq * fix tests * it's subtracting, it shouldn't be -1 * pm4 changes (#4251) * not need this anymore * sdma signal with non atomic --------- Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>	2024-04-23 08:31:27 +04:00
Francis Lam	3f6c7ca8bf	test: fix test_tensor_core_padded on CUDA and add to benchmarks (#4258 ) * test: fix test_tensor_core_padded on CUDA and add to benchmarks * fix linter * run both tests in one call	2024-04-22 23:22:11 -04:00
Francis Lam	a90de3b574	search: add additional 7 factors to the action space (#4256 ) also bump the DB version after the padded TC merge	2024-04-22 19:14:23 -04:00

... 127 128 129 130 131 ...

10633 Commits