tinygrad

mirror of https://github.com/tinygrad/tinygrad.git synced 2026-01-26 07:18:40 -05:00

Author	SHA1	Message	Date
George Hotz	bfcec234a2	Refactor ASTs (#622 ) * ugh worst branch name * compiler refactor continues * scc -> cloc * buf -> _buf * finish _buf, and program -> runtime * gpu is still working, clang isn't * clang in new style * ops_metal * something broke it * improve metal * clean up tons of cl crap * hack fix sync * cleaner gpu * gpu metal clang * cleanups * minor refactor * GPUCodegen * fix up LLVM * blind CUDA refactor * codegen / runtime * keep ops naming * linter passes * woah, llvm was allocing 4x what it needed to * bugfixes * fix openpilot compiler * fix compile_efficientnet * method cache should fix tests * deal with duped functions	2023-03-01 18:57:29 -08:00
George Hotz	c5e2126d49	move DEBUG to helpers	2023-02-22 06:52:11 -08:00
George Hotz	4efe0169bb	remove old metal implementation	2023-02-18 13:51:48 -08:00
George Hotz	693d4b89a4	fixup TRITON backend to use new APIs	2023-02-12 06:57:49 -08:00
George Hotz	b9eae94ae9	move Device back into lazy	2023-02-11 11:26:53 -08:00
George Hotz	fd3807c479	delete cherry and old cuda accel, promote llvm	2023-02-06 10:02:41 -06:00
George Hotz	36c26a57b1	make slow LLVM opt optional	2023-02-05 20:24:12 -06:00
George Hotz	f7291f6ca3	fixes big KOPT, breaks opencl (#505 ) * fixes big KOPT, breaks opencl * fix optimizer * KernelCache * oops, broke batchnorm * hack to fix it * fix llvm, less hacky gpu * disable the cache * cache just breaks things	2023-02-05 10:46:17 -08:00
Martin Loretz	4ad67b4bbc	Refactor triton buffer to use CLBuffer of cuda runtime (#524 ) * Refactor triton buffer to use CLBuffer of runtime * Fix opencl GT0	2023-02-03 20:02:41 -08:00
Martin Loretz	45e847d284	Update triton to work in master (#517 ) * Update triton to work in master * Move mem_estimate out of runner	2023-02-01 12:58:14 -08:00
George Hotz	175c38d1b3	triton: it already was GT0	2023-02-01 12:00:33 -08:00
George Hotz	cd97b036cc	A Triton backend for tinygrad (#470 ) * triton can add * print stuff from triton * write out file * ops triton working * reduce ops * sort of works * Triton bugfixes & implementation of remaining ops (#490) * padding * support pow, max, relu, gt0 * allocate return buffer * Fix reduce * Add tests for power op * Fix triton illegal memory accesses and memory leak (#512) * Fix mypy issue * Add triton to setup.py * Replace torch with pycuda * Use one cuda stream for data transfer and kernels * Remove triton submodule * Fix memory leak by using weakrefs for caching * Fix memory access by adding valid as mask for load * Fix invalid kernel launches by flattening the grid (#515) --------- Co-authored-by: Martin Loretz <20306567+martinloretzzz@users.noreply.github.com>	2023-02-01 11:53:57 -08:00
Jacky Lee	54c68defc7	Replace SIGN with GT0 (#511 ) * Replace sign with gt0 * Replace sign with gt0 * GT0 works on GPU * Fix brackets --------- Co-authored-by: Tom Finet <tom.codeninja@gmail.com>	2023-02-01 11:01:39 -08:00
Jacky Lee	799b3f185a	Refactor getenv into helpers (#508 ) * Refactor getenv into helpers * Remove unused os * Fix default value * Fix more defaults for CI * Fix bracket * Revert changes to openpilot/compile.py * Use getenv from helpers when possible	2023-01-31 15:09:09 -08:00
Martin Loretz	43abbd3d00	Use force_create to allocate return buffer (#491 )	2023-01-29 17:13:10 -08:00
George Hotz	b3e4e678e8	Use ShapeTracker for tracking shapes in kernels (#485 ) * local is a normal buffer * remove extra shapes and strides * fix opt * fix llvm	2023-01-28 11:56:32 -08:00
George Hotz	6d7658db12	delete opencl <celebration>	2023-01-24 14:18:35 -08:00
George Hotz	325a440cb5	pass in op_estimate in opencl	2023-01-19 11:02:23 -08:00
George Hotz	0881d504c1	move shapetracker (#466 ) * move shapetracker * shapetracker test * move ast * move a few things * fix print kernel * fix test * symbolic fixups	2023-01-19 09:56:31 -08:00
George Hotz	3a3400e3a2	more from indexer	2023-01-18 18:11:51 -08:00
George Hotz	f1378b3ea1	fix linter and force allocation on hostbuf	2023-01-11 21:04:21 -08:00
George Hotz	3ea38cac72	IMAGE == 1, add reshape to the ast	2023-01-11 20:56:03 -08:00
George Hotz	74fb772e2a	rawcpu is junk, replace with llvm	2023-01-10 19:18:45 -08:00
George Hotz	fff1f046b0	Simple version of the new GPU backend (#458 ) * newgpu * more to delete * hmm, tests pass with constant folding * fix lint/type * fix constant folding * comment and rerun tests * lazy touchups * fix graph_batchnorm test * smaller transformer to fix OOM * Revert "smaller transformer to fix OOM" This reverts commit `a44ef8edc2`. * no func cache * introspect * touchups * CLASTKernel * ugh, it was lru_cache * codegen * spacing * old gpu still in opencl * typing fix	2023-01-10 19:16:02 -08:00
George Hotz	4885fce56e	shapetracker from newgpu (#456 ) * shapetracker from newgpu * touchup ops * test * testst * thneed deletes unused inputs * test * bugfix	2023-01-09 12:40:01 -08:00
George Hotz	5e07d4669d	the speedy chonker is going to replace the old chonker (#432 ) * bringing back reshape and permute * done with E701 * 4x4 works in generic way * max and sum not vectorizing... * special case single float * support comparing to MPS * improve matmul speed, consider generic principles * GlobalCounter * fix op tracking * faster * comment that out for now * err, it needs that * fix minor issues * fix global_mem	2022-11-11 18:34:24 -08:00
George Hotz	d2273d2cc4	s/contiguous_op/contiguous	2022-11-11 00:07:05 -08:00
George Hotz	b8c94a67c9	Simple chonker (#431 ) * chonker will make llvm fast * work * better speed tests, we will make them fast * with the cache add is the same speed * relu and neg are fast * fix sum speed * maximum maxnum? * hack for gemm opt * gemm very slow * zeros like * test_permute * shapetracker returns self * fix shapetracker factorization * err, int strides * permutes are faster now in tinygrad than pytorch * support -1 in expand * gemm unrolled * improve final test case * WIP GEMM * why isn't GEMM fast? * revert cache dim * ffp contract works on clang, not llvm? * ignore llvm ir * this makes fma work at least, but no faster * USE_4x4 * 63 GFLOPS * 87 GFLOPS * that wasn't matmul, 44 GFLOPS now * 82 GFLOPS permuted * this permute too * a little speed for the convs * 45 GFLOPS * speed tests pass again * clean up prints * fix FMA WHAT A WASTE OF TIME * colors * moar fair * GPU * useless on chonker * cleanups * improve factorized shapetracker * better threshold * label conv * work * ops test pass again * hot load the index * run the last view, no need to create * ZeroView needs a repr for the key to work * fix segfault on out of bounds * one more test * start amx, and llvm.initialize_native_asmparser * amx works * nice AMX class * nicer AMX class * refactor get_idxs * amx working * is slower... * useless flip * cache * SZ_X * AMX_SZ_X/Y work alone * Contiguous mlop * test gemm packed * PREPARE in packed * use_amx factor * prefetch isn't faster * loop * same 3ms * 2.24 ms * allow double on store in TG * amx reduce is the same speed as non amx reduce * include memory bandwidth * clean up shapetracker * flip returns stride * prepare for upstream * Update ops_llvm.py (#426) * permutes are yellow and green now * faster conv * llvm cleanups * Show optimised IR under debug 4 (#428) * ASTKernel class * Make tinygrad work with older python version (#427) * Make tinygrad work with older python version * Use partialmethod instead of partial * smiple chonker is chonking * remove junk from test speed vs torch * fix linker and types * AMX is only here now * add LLVM tests, it's a valid backend now * oops, run llvm test * contiguous_op * fix loadops compare * dedup reduceops Co-authored-by: calledit <1573053+calledit@users.noreply.github.com>	2022-11-10 23:17:09 -08:00
George Hotz	8c849e637c	that was in there twice, DEBUG>=4 to see loop opt	2022-10-30 15:31:39 -07:00
George Hotz	cfdf803b52	fix llvm vectorization by add analysis passes from the target machine	2022-10-30 15:28:36 -07:00
George Hotz	2f602a92ff	seperate STRIDED and EXPAND	2022-10-30 13:23:58 -07:00
George Hotz	4b6097f81d	more amx notes	2022-10-29 14:04:10 -07:00
George Hotz	fdb43fe553	gemm is 1.7 TFLOPS on a single M1 core	2022-10-29 13:42:33 -07:00
George Hotz	52bfbc31be	vectorization	2022-10-29 12:47:52 -07:00
George Hotz	e473d35f90	llvm doesn't vectorize	2022-10-29 11:59:48 -07:00
George Hotz	86eb06eb76	accurate flop estimation	2022-10-28 19:13:20 -07:00
George Hotz	dd543fbc7a	MovementOps is unused	2022-10-28 18:26:08 -07:00
George Hotz	71b336503f	no RESHAPEs in the AST	2022-10-28 18:25:30 -07:00
George Hotz	b65b70812a	Exec AST (#404 ) * working exec ast * exec_ast is staticmethod * GenericExecAST * fold that sometimes * ExplicitExecAST * exec_ast for GPU * gpu working * get_lazyop_shape * now gpubuffer is ExplicitExecAST * dedup * add a type * RESHAPE in opencl code * fix linter * that too for linter * cleanups * remove dead code * GenericShape is less lines * add ALLOWED_KERNEL_COUNT to tests * fix mypy * that's gotta be recursive * fix opencl shape processing * remove unneeded lambda	2022-10-28 08:27:03 -07:00
George Hotz	6a15fd3844	LLVM Backend take 2 (#403 ) * take 2 llvm * get_lazybuffers -> get_buffers * llvm tests pass * fix type issues and refactor LLVM	2022-10-26 20:32:31 -07:00
George Hotz	6a8fb53304	move ops.py into lazy.py (#402 ) * move ops.py into lazy.py * fix graph and linter * ugh, didn't add	2022-10-25 13:58:03 -07:00
George Hotz	1bec4651b3	fix nonstatic weights	2022-10-20 17:04:14 -07:00
George Hotz	9f8c414589	might fix tests	2022-10-20 16:27:11 -07:00
George Hotz	fd6ba8e7ac	don't recopy backing	2022-10-20 16:06:11 -07:00
George Hotz	0514594083	fix openpilot test	2022-10-20 11:56:26 -07:00
George Hotz	b7f748c15a	Fix GPU 2*31 virtual size limit (#392 ) in progress * big conv test works * that's unneeded * fix opencl with reduce * rewrite contiguous_view_constant_fold * clean up mids in loop code * subidx * print cl kernel before run * no reduce, no loop * Revert "no reduce, no loop" This reverts commit `92777e40e9`.	2022-10-05 00:55:20 -04:00
George Hotz	8382c51c12	always MATMUL, test the ops in OPENCL	2022-10-01 13:31:29 -04:00
Ollin Boer Bohan	3b1767e013	Fix OpenCL Metal texture issues (#378 ) * Fix OpenCL Metal texture issues Tile CL images when needed, to fit into the 16384 max Metal image size; gets me to ~4.8s/iteration for SD on M1 Pro with OPENCL=1 FLOAT16=1. * Minor cleanup * Fix mish in CI, or no-op? * Is mish being framed? * It would help if any of this reproduced locally * ??? * OPT is reverted; use original mish * Cleanup post-review * Fix some shape usage * Tiler tests, shouldn't oom or overflow either * Can't CL if there's no CL? * Run tiler tests even if GPU=1 * relu6 segfault binary chop; revert test * relu6 segfault binary chop; revert accel * relu6 segfault binary chop; revert . (???) * end relu6 segfault binary chop; repo's haunted	2022-09-29 01:21:54 -04:00
Comma Device	75f937227a	add barrier	2022-09-13 11:39:48 -04:00
George Hotz	3c3534736e	fix matmul kernel and tests	2022-09-13 08:31:04 -07:00

1 2 3

137 Commits