tinygrad

mirror of https://github.com/tinygrad/tinygrad.git synced 2026-02-11 23:25:04 -05:00

Author	SHA1	Message	Date
George Hotz	ad28fdecb1	si.inputs+outputs -> bufs (#4279 )	2024-04-24 15:12:34 +08:00
chenyu	8401de9922	resnet benchmark return early in eval (#4278 ) only do few eval steps to compile, and skip second epoch when doing beam + benchmark. save 2 minutes	2024-04-24 00:55:01 -04:00
George Hotz	38f97aa0fe	rename rawbufs to bufs in ExecItem (#4274 )	2024-04-24 11:27:27 +08:00
George Hotz	60e3aa5cb1	more docs (#4271 ) * more work on docs * CompilerOptions is dataclass	2024-04-24 10:52:42 +08:00
chenyu	6637ecc5fe	use IGNORE_JIT_FIRST_BEAM to not BEAM in jit cnt=0 (#4269 ) we want to have different BEAM values for resnet train and eval. global JITBEAM cannot do this. added the flag to change beam behavior at cnt=0 (so it default behaves the same with or without TinyJit), and for cnt=1 it uses existing BEAM.value. Also updated the context var BEAM in resnet to be outside of TinyJit. saves about 3 minutes compile time	2024-04-23 18:59:43 -04:00
nimlgen	f3b4dff7c9	KFDProgram -> AMDProgram (#4268 )	2024-04-24 00:29:50 +03:00
geohotstan	17328ded7d	setitem no return value (#4266 ) * no ret value and just force contiguous * ok revert contiguous stuff * actually do force it contiguous * revert again lol * add simple regression test * add assert for MLB * guess we're contiguous everything from now on * lol ugly af empty return... * don't change order cuz i don't get disk	2024-04-23 16:28:14 -04:00
Elias Wahl	3a48773f1a	BERT dataloader (#4252 ) * add dataloader * comment	2024-04-23 13:44:49 -04:00
Elias Wahl	69341144ba	Wikipedia preprocessing script (#4229 ) * Preprocessing script * short seq prob * comments + env vars * Add preprocessing reference. Add test * lint fix + add eval test support * whitespaces * point to commit * comment * rename * better comments	2024-04-23 10:28:01 -04:00
chenyu	759b4f41c3	few more KFD -> AMD (#4262 ) benchmark gemm and default_parallel	2024-04-23 10:15:37 -04:00
Szymon Ożóg	6c25f1abf7	Optimize ptx loops (#4263 ) * Optimize PTX loops * Update assembly.py	2024-04-23 12:20:14 +04:00
George Hotz	967638f0d5	update docs, remove corealize (#4264 ) * update docs, remove corealize * handle 0 line count * tensor schedule	2024-04-23 12:05:29 +04:00
George Hotz	9b7efa72ea	hotfix: skip 0 line count files in sz.py	2024-04-23 11:56:03 +04:00
George Hotz	acf4ba5c9f	method cache respects beam option (#4261 ) * method cache respects beam option * cleanup get_runner	2024-04-23 09:00:41 +04:00
George Hotz	9a95781d51	renamed (#4260 )	2024-04-23 09:00:28 +04:00
George Hotz	2ae4f45272	WIP PM4 Support (#4110 ) * pm4 kernel launch works * disable USE_THREAD_DIMENSIONS * add kernel code * work on real pm4 * pm4 signal * same * gate pm4 * hcq tests pass * ops passes * pm4 is closer * pm4 debug (#4165) * start debug tests passing * prg * smth * hdp flush * cleaner 1 * do not need this * logs not need * small things * linter * remove AQL * test hcq * fix tests * it's subtracting, it shouldn't be -1 * pm4 changes (#4251) * not need this anymore * sdma signal with non atomic --------- Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>	2024-04-23 08:31:27 +04:00
Francis Lam	3f6c7ca8bf	test: fix test_tensor_core_padded on CUDA and add to benchmarks (#4258 ) * test: fix test_tensor_core_padded on CUDA and add to benchmarks * fix linter * run both tests in one call	2024-04-22 23:22:11 -04:00
Francis Lam	a90de3b574	search: add additional 7 factors to the action space (#4256 ) also bump the DB version after the padded TC merge	2024-04-22 19:14:23 -04:00
chenyu	de2b1fb468	update adding_new_accelerators doc (#4255 ) mlops -> function, and removed some old ops	2024-04-22 18:50:19 -04:00
Francis Lam	bbb0ad4800	wmma: widen TC usage in search by using PADTO on TC axes when possible (#4216 ) * wmma: widen TC usage in search by using PADTO on TC axes when possible * test: start tests for the new padding TC behavior * search: upgrade padded TC search to TC_OPT >= 2 * test: add behavior and correctness test for padded TC added optional argument to apply_tensor_core to set TC_OPT level * linearizer: add tests for the PADTO behvaior and docs	2024-04-22 16:50:31 -04:00
George Hotz	9e53d6cffa	hotfix: 8000 lines	2024-04-22 20:58:16 +04:00
nimlgen	e6227bdb15	nv driver (#4044 ) * start * fix err 93 * gpu * ioctl mappings * alloc like cuda * semaphores * wait for semaphores value * start ops_nv * very simple kernels work * init several gpus * qmd dumper * dirty, but most of kernels work * always all test_ops * progress, more tests, stable * test_ops passes, gpt2 works but wth big fifo, wrap of fifo doesn't work, i think it's something coherency releated * need better sync * fix sync * alloc2 * all tests pass! * cleanup 1 * cleanup * multigpu, simple transfer * fix sync * correct init * nv_gpu autogen + sync bug fix * clean extra/nv_gpu_driver * p2p * clean up * remove old gen * small fixes * cleanup * cleanup 2 * small fixes * bigger queue size * cleanups * wait * fixed signals for devs * fix hang + parallel beam * small fixes * detect when local memory is big in kernel * correct assert * small fixes * correct tls size est * one va space * less lines * shorter * save 2 lines * save some lines * remove type ignores --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2024-04-22 19:50:20 +04:00
qazal	77a3780005	assert reduce recompute (#4250 )	2024-04-22 16:12:39 +03:00
qazal	a9bc7c1c49	unify assign tests (#4247 )	2024-04-22 11:01:15 +03:00
chenyu	37f8be6450	resnet print epoch ops and mem in benchmark (#4244 ) * resnet print epoch ops and mem in benchmark also added a flag to optionally disable reset jitted steps * real per epoch stats	2024-04-21 18:32:31 -04:00
Micah Zoltu	7bc862767c	Improves error message when CUDA module fails to load. (#4243 )	2024-04-21 11:10:14 -04:00
wozeparrot	4c99d49c4d	some docstrings (#4201 ) * feat: create and data access docstrings * fix: linter --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2024-04-21 16:34:08 +04:00
chenyu	30fc1ad415	remove TODO: remove explicit dtypes after broadcast fix in stable_diffusion (#4241 ) this is done	2024-04-21 00:31:24 -04:00
chenyu	a1940ced77	remove the assign hack in whisper (#4240 ) no longer needed, the commented test case was removed too	2024-04-20 23:56:44 -04:00
chenyu	3f126c7664	fix examples vits / converstion.py (#4239 ) it was passing a const numpy array into Tensor.arange	2024-04-20 23:29:12 -04:00
chenyu	31c9d9a228	fix test_linearizer tc opt tests for bf16 (#4237 ) bf16 tc has larger rtol	2024-04-20 11:51:50 -04:00
chenyu	f1d9d0a151	cleanup external_test_opt (#4234 ) no more OPT=2 or OPT=3, check strict number of kernels, enabled tests that fusion works now	2024-04-20 04:00:08 -04:00
David Hou	dc4b1af09c	more realistic edge behavior for resnet benchmark (#4231 ) * more realistic edge behavior for resnet benchmark * schedule_step * realize all parameters ahead of time * don't save setup and misc schedules	2024-04-19 20:07:46 -04:00
David Hou	f6eea03749	SAVE_SCHEDULE as contextvar (#4230 )	2024-04-19 18:51:57 -04:00
qazal	2094b3b327	graph ScheduleItems (#4224 ) * graph schedules * add logging * inplace --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2024-04-19 16:17:11 +04:00
George Hotz	cd88afc98b	datasets isn't a feature + filter docstrings (#4228 ) * datasets isn't a feature * filter docstrings in sz	2024-04-19 16:16:10 +04:00
George Hotz	b9570d6100	clean up update stats (#4226 ) * WIP: clean up update stats * line savings now * fix graphs * fix tests * tighter prints * remove extra jit=false * debug=2 means wait * that won't update stats * still wait	2024-04-19 15:41:30 +04:00
qazal	1c87e5dbf6	fuzz schedule context vars (#4223 ) * fuzz schedule context vars * fuzz unique toposorts * merge ground truth with the rest * Revert "merge ground truth with the rest" This reverts commit `1f3463bb57`. * readability> * can override	2024-04-19 13:16:25 +03:00
George Hotz	d99b512084	llm.c timing (#4219 ) * add timing info * fix malloc * 8s with beam	2024-04-19 12:43:21 +04:00
qazal	43841a32b7	Merge pull request #4222 from Qazalin/fuzz-multi0 Tunable multi output fusion	2024-04-19 08:07:45 +03:00
qazal	b2fe3884fc	Merge branch 'master' into fuzz-multi0	2024-04-19 07:56:26 +03:00
qazal	abb10c83cd	tunable multi output fusion	2024-04-19 07:44:31 +03:00
chenyu	a1133beb80	KFD GEMM (#4221 ) added to benchmark CI and fixed duplicated filenames between cuda and ptx	2024-04-19 00:43:18 -04:00
chenyu	3f3af0fb85	test_linearizer_failures 29 passes now (#4215 ) TC + PADTO fixed	2024-04-18 19:49:23 -04:00
Elias Wahl	2ecd61e3e2	monkey patching (#4214 )	2024-04-18 19:20:52 -04:00
Francis Lam	126826afc8	linearizer: refactor to define accs with potentially TC-modified idxs (#4211 )	2024-04-18 15:31:06 -04:00
George Hotz	39b60a25f0	more llm c work (#4207 ) * more llm c work * print nicely * fake load pretrained * select warmups * output c code	2024-04-18 22:20:44 +04:00
chenyu	f7416916df	update resnet hparams based on BS=1632 RCP (#4210 ) https://github.com/mlcommons/logging/blob/master/mlperf_logging/rcp_checker/training_4.0.0/rcps_resnet.json	2024-04-18 12:01:46 -04:00
George Hotz	fa57c3e7ce	continue llm.c (#4190 ) * continue llm.c * export more * progress on llm.c * simpler optim, names work	2024-04-18 10:57:54 +04:00
geohotstan	269a58d5fa	tolist to return multidimensional list (#4192 ) * lol does this work * some more changes * a tiny note * rename a variable * add test for data const and add TODO comment * make type correct make type correct	2024-04-18 07:43:10 +04:00

1 2 3 4 5 ...

4201 Commits