tinygrad

mirror of https://github.com/tinygrad/tinygrad.git synced 2026-01-21 04:47:56 -05:00

Author	SHA1	Message	Date
George Hotz	f7291f6ca3	fixes big KOPT, breaks opencl (#505 ) * fixes big KOPT, breaks opencl * fix optimizer * KernelCache * oops, broke batchnorm * hack to fix it * fix llvm, less hacky gpu * disable the cache * cache just breaks things	2023-02-05 10:46:17 -08:00
George Hotz	cd97b036cc	A Triton backend for tinygrad (#470 ) * triton can add * print stuff from triton * write out file * ops triton working * reduce ops * sort of works * Triton bugfixes & implementation of remaining ops (#490) * padding * support pow, max, relu, gt0 * allocate return buffer * Fix reduce * Add tests for power op * Fix triton illegal memory accesses and memory leak (#512) * Fix mypy issue * Add triton to setup.py * Replace torch with pycuda * Use one cuda stream for data transfer and kernels * Remove triton submodule * Fix memory leak by using weakrefs for caching * Fix memory access by adding valid as mask for load * Fix invalid kernel launches by flattening the grid (#515) --------- Co-authored-by: Martin Loretz <20306567+martinloretzzz@users.noreply.github.com>	2023-02-01 11:53:57 -08:00
Jacky Lee	799b3f185a	Refactor getenv into helpers (#508 ) * Refactor getenv into helpers * Remove unused os * Fix default value * Fix more defaults for CI * Fix bracket * Revert changes to openpilot/compile.py * Use getenv from helpers when possible	2023-01-31 15:09:09 -08:00
George Hotz	60ccddb58b	reenable SWAP	2023-01-30 17:32:02 -08:00
George Hotz	aea55eb196	found failing upcast	2023-01-30 16:12:56 -08:00
George Hotz	b67f997864	tests pass w/o float4	2023-01-30 15:40:49 -08:00
George Hotz	c6f570a2e6	improve progress bar	2023-01-30 14:50:28 -08:00
George Hotz	7118602c97	goat progress bar	2023-01-30 14:37:26 -08:00
George Hotz	cccfea4b25	factor out KOPT code	2023-01-30 13:13:55 -08:00
George Hotz	de2c419fd4	make_pair and first attempt at hlb_cifar10	2023-01-30 11:07:23 -08:00
AllentDan	7b6b1f32b1	[Fix] fix typo: test_mnist -> datasets (#492 ) * test_mnist -> datasets * fix mnist_gan	2023-01-29 21:30:47 -08:00
George Hotz	2db272c7f7	Kernel Optimizer (#489 ) * kernel optimizer * 10x faster, but wrong. not good deal * move test -> extra * print x speedup * clcache * fix clcache + DEBUG * GFLOPS estimate * i==3	2023-01-29 17:15:00 -08:00
George Hotz	bb0cdc2442	111.51x speedup for reduce	2023-01-29 03:06:00 -08:00
George Hotz	45c0aa6e2d	search with SHIFT, REDUCE	2023-01-29 02:42:20 -08:00
George Hotz	87879cf4b6	improve search more	2023-01-29 02:08:57 -08:00
George Hotz	f6bbd43cb8	improve search	2023-01-29 01:33:47 -08:00
George Hotz	ebdec2b72f	fix optimizer	2023-01-29 00:23:06 -08:00
George Hotz	a9cabce791	oops, broke mem estimates	2023-01-28 20:21:31 -08:00
George Hotz	a500e79bd1	don't OPTWG on OS X, it's way slower	2023-01-28 20:02:33 -08:00
George Hotz	b0df4d99a0	os x profiling: this ratio is exact i believe	2023-01-28 19:02:51 -08:00
George Hotz	ae810eb558	minor cleanups	2023-01-28 08:59:15 -08:00
George Hotz	6d5e1a8029	GEMM kernel search	2023-01-27 10:08:57 -08:00
Comma Device	f08e740957	factor out hand coded opt	2023-01-26 14:54:06 -06:00
George Hotz	5e8a36a18b	real op kernel	2023-01-26 09:51:32 -08:00
George Hotz	e0600f537a	op kernel in kernel search	2023-01-26 09:47:01 -08:00
George Hotz	aafc29484a	cleanups	2023-01-25 12:37:10 -08:00
George Hotz	919e943867	decent search	2023-01-25 12:20:53 -08:00
George Hotz	7f3da91f8b	kernel_search	2023-01-25 12:05:09 -08:00
George Hotz	e37424424f	first little attempt at search	2023-01-25 11:49:29 -08:00
Comma Device	9e2af0a972	too far with the OPTWG	2023-01-24 13:14:59 -06:00
Comma Device	3590848b93	a little more local workgroup options	2023-01-24 12:50:27 -06:00
Comma Device	4b74752c42	fix hotspots by improving the workgroup optimizer	2023-01-24 12:46:28 -06:00
George Hotz	fd760a390a	fix incremental time	2023-01-24 10:19:04 -08:00
George Hotz	a949de873b	reduce 2.0 (#469 ) * reduce 2.0 * works * hacks * DEBUG=3 for shapes * fix types * 0s weren't being folded * cleaner * last_reduce is no longer needed * comments and cleanup	2023-01-23 15:11:13 -08:00
George Hotz	f1196984e6	harmless to intertwine the math and the stores	2023-01-21 09:31:56 -08:00
George Hotz	708215d06b	Typing (#468 ) * we typing * types look good in theory * most tests pass * gpu tests pass * TEST_AST * delete comments * i must have written that bug so many times * bugfix * don't merge the small ones * add f to constants * commits from reduce * don't GCD the mod nodes * broken and a hack IMAGE=3 * group for reduce * fix linter + mypy * move out test ast * insource TENSOR_TYPE_TO_NP_TYPE * does this fix it? * move imports out	2023-01-21 09:09:22 -08:00
George Hotz	0881d504c1	move shapetracker (#466 ) * move shapetracker * shapetracker test * move ast * move a few things * fix print kernel * fix test * symbolic fixups	2023-01-19 09:56:31 -08:00
George Hotz	9245f4650a	indexer changes for master	2023-01-18 18:02:02 -08:00
George Hotz	49c6e6d472	Latest attempt to add image (#462 ) * add image * load + store + boring stuff: * image tests pass * thneed print GFLOPS * op conv test * more debugging * hack for multiview image * shapetracker creates less views * disable image tests * working better * ugh, lkey not key * print in DEBUG, and allow views * works * simple padding conv2d * use index for image * that was bad code * debug print * fix types * less lines * save lines	2023-01-12 17:36:30 -08:00
George Hotz	281b0db773	three from image	2023-01-12 12:26:58 -08:00
George Hotz	9ff6c532eb	Prereqs for IMAGE=1 (#461 ) * contig * move ast, debug prog * add Token * cleanup reduce * exec_ast	2023-01-11 20:18:42 -08:00
George Hotz	fff1f046b0	Simple version of the new GPU backend (#458 ) * newgpu * more to delete * hmm, tests pass with constant folding * fix lint/type * fix constant folding * comment and rerun tests * lazy touchups * fix graph_batchnorm test * smaller transformer to fix OOM * Revert "smaller transformer to fix OOM" This reverts commit `a44ef8edc2`. * no func cache * introspect * touchups * CLASTKernel * ugh, it was lru_cache * codegen * spacing * old gpu still in opencl * typing fix	2023-01-10 19:16:02 -08:00
George Hotz	fad7cba590	move batchnorm to Tensor	2023-01-09 18:00:16 -08:00
George Hotz	4885fce56e	shapetracker from newgpu (#456 ) * shapetracker from newgpu * touchup ops * test * testst * thneed deletes unused inputs * test * bugfix	2023-01-09 12:40:01 -08:00
George Hotz	b8c94a67c9	Simple chonker (#431 ) * chonker will make llvm fast * work * better speed tests, we will make them fast * with the cache add is the same speed * relu and neg are fast * fix sum speed * maximum maxnum? * hack for gemm opt * gemm very slow * zeros like * test_permute * shapetracker returns self * fix shapetracker factorization * err, int strides * permutes are faster now in tinygrad than pytorch * support -1 in expand * gemm unrolled * improve final test case * WIP GEMM * why isn't GEMM fast? * revert cache dim * ffp contract works on clang, not llvm? * ignore llvm ir * this makes fma work at least, but no faster * USE_4x4 * 63 GFLOPS * 87 GFLOPS * that wasn't matmul, 44 GFLOPS now * 82 GFLOPS permuted * this permute too * a little speed for the convs * 45 GFLOPS * speed tests pass again * clean up prints * fix FMA WHAT A WASTE OF TIME * colors * moar fair * GPU * useless on chonker * cleanups * improve factorized shapetracker * better threshold * label conv * work * ops test pass again * hot load the index * run the last view, no need to create * ZeroView needs a repr for the key to work * fix segfault on out of bounds * one more test * start amx, and llvm.initialize_native_asmparser * amx works * nice AMX class * nicer AMX class * refactor get_idxs * amx working * is slower... * useless flip * cache * SZ_X * AMX_SZ_X/Y work alone * Contiguous mlop * test gemm packed * PREPARE in packed * use_amx factor * prefetch isn't faster * loop * same 3ms * 2.24 ms * allow double on store in TG * amx reduce is the same speed as non amx reduce * include memory bandwidth * clean up shapetracker * flip returns stride * prepare for upstream * Update ops_llvm.py (#426) * permutes are yellow and green now * faster conv * llvm cleanups * Show optimised IR under debug 4 (#428) * ASTKernel class * Make tinygrad work with older python version (#427) * Make tinygrad work with older python version * Use partialmethod instead of partial * smiple chonker is chonking * remove junk from test speed vs torch * fix linker and types * AMX is only here now * add LLVM tests, it's a valid backend now * oops, run llvm test * contiguous_op * fix loadops compare * dedup reduceops Co-authored-by: calledit <1573053+calledit@users.noreply.github.com>	2022-11-10 23:17:09 -08:00
George Hotz	2cc1d970c6	updates from the chonker branch	2022-11-07 21:12:08 -08:00
George Hotz	d878065ece	Gemm (#416 ) * gemm * off by factor of 5 * 50 GFLOPS * works * 91 gflops * working at 50G * works * iy * 150 GFLOPS * 150 GFLOPS * N=2048 is still fast * threading soon * multithread * pinning * throttling is sad * Align matrices to cacheline width (#361) Co-authored-by: cloud <Cloud11665@gmail.com>	2022-11-06 10:07:28 -08:00
George Hotz	6a8fb53304	move ops.py into lazy.py (#402 ) * move ops.py into lazy.py * fix graph and linter * ugh, didn't add	2022-10-25 13:58:03 -07:00
George Hotz	8e22d5ee67	replace networkx with defaultdict	2022-10-20 19:36:43 -07:00
George Hotz	63f9c55156	really dumb bug	2022-10-20 17:07:47 -07:00

... 21 22 23 24 25

1242 Commits