tinygrad

mirror of https://github.com/tinygrad/tinygrad.git synced 2026-01-13 17:08:11 -05:00

Author	SHA1	Message	Date
George Hotz	27701ef823	add locals support to rangeify (#11826 )	2025-08-24 14:03:12 -07:00
qazal	793ace530e	update amd_uop_matmul.py import (#11581 ) Using this for testing SQTT	2025-08-08 17:07:35 +03:00
George Hotz	82be8abfd2	move opt under codegen (#11569 )	2025-08-07 14:19:17 -07:00
George Hotz	4f26a9ad32	check elements_per_thread in tensorcore [pr] (#11435 )	2025-07-30 11:55:48 -07:00
George Hotz	1bef2d80c1	unrolls are all in the same scope (#11429 ) * unrolls are all in the same scope * fix that import	2025-07-29 16:55:37 -07:00
George Hotz	03909f2772	permute locals for HL uop matmul (#11412 ) * permute locals for HL uop matmul * parens fix that * permutes * 20 TFLOPS	2025-07-29 08:19:59 -07:00
George Hotz	735ad5f10d	kernel4 and 5 in uops (#11411 ) * move simplify views to merge views * add amd kernel 4 * Revert "move simplify views to merge views" This reverts commit `1e07dff384`. * k4 in python * kernel4 written in uops * k5 support * cleanups	2025-07-28 19:35:48 -07:00
George Hotz	fddc645668	HL=2 top matmul (#11406 ) * HL=2 top matmul * top colored	2025-07-28 12:32:38 -07:00
George Hotz	dfeee63d30	uop matmul work (#11388 ) * uop matmul work * works with locals	2025-07-26 21:23:55 -07:00
George Hotz	2c70eaf18c	fix load / barrier (#11386 ) * fix load / barrier * cleanups * fix CI	2025-07-26 10:27:37 -07:00
George Hotz	466ab5a3f2	store/load not pass through index (#11381 ) * noop * fix noop * store cat is NOOP * store dtype is void * stores aren't passed through anymore * meh, skip those for ptx * correct ptx skip * hl runs	2025-07-25 21:01:47 -07:00
George Hotz	490a93902c	define reg doesn't have init anymore (#11365 ) * define reg doesn't have init anymore * remove that * no special logic for dr * fix amd uop matmul	2025-07-24 19:15:49 -07:00
George Hotz	0602b22086	kernel spec (#11359 ) * kernel spec * ops.VIEW * work	2025-07-24 12:45:38 -07:00
George Hotz	b0dc97d1f7	write out kernel 3 in uops (#11352 ) * write out kernel 3 in uops * matmul is correct * gemm passes spec * bugfix to match speed * cleanups	2025-07-23 17:32:38 -07:00
George Hotz	108aac8af4	use AddrSpace instead of local (#11314 ) * use AddrSpace instead of local * addrspace in test	2025-07-21 14:00:06 -07:00
George Hotz	842184a1ab	rename kernelize to schedule, try 2 (#11305 )	2025-07-21 11:18:36 -07:00
chenyu	a0438012af	remove Kernel.get_program [pr] (#11203 )	2025-07-12 20:50:29 -04:00
George Hotz	d67c8e7b42	local metal on metal in uop syntax (#11185 ) * local metal on metal in uop syntax * TODO: just put the axis_info in the kernelinfo * local * amd_matmul works @ 28 TFLOPS * clean up matmul * kernel8 works * remove that * locals * axistype innovation * work * cleanup * kernel3 regs * cleanup kernel3 * work * why is it broken * no beam * reenable * permutes	2025-07-12 16:31:19 -07:00
chenyu	6283d50224	DEPRECATED_linearize -> to_program [pr] (#11198 )	2025-07-12 13:46:20 -04:00
George Hotz	2893feb9f6	cleanups for kernel.py (#11143 ) * cleanups for kernel.py * fixups	2025-07-08 18:10:25 -07:00
George Hotz	856759c79c	add halide example (#10980 ) * add halide example * upd halide gemm * partial works * touchups	2025-06-26 16:14:57 -07:00
George Hotz	92678e59ee	move kernel to opt (#10899 )	2025-06-20 15:22:28 -07:00
Sidharth N. Babu	ef14dfb277	compile fixes (#10442 )	2025-06-06 18:38:37 -04:00
George Hotz	411392dfb7	move files into uop dir (#10399 ) * move files into uop dir [pr] * tinygrad.uop is a thing * fix uop docs, no pr * fix viz	2025-05-18 11:38:28 -07:00
chenyu	720f20865b	remove required_optimizations (#9848 )	2025-04-19 16:51:16 -04:00
chenyu	f5256e0020	Kernel.apply_opts [pr] (#9917 ) * Kernel.apply_opts [pr] updated all `for opt in`. also updated a few test_liinearizer tests to not implcitly depend on hand_coded_optimization * not you yet	2025-04-17 08:00:56 -04:00
chenyu	8c6299bced	move hand_coded_optimizations to heuristic.py [pr] (#9844 ) * move hand_coded_optimizations to heuristic.py [pr] also folded all long lines * make a copy and rename self -> k * fix test	2025-04-10 23:40:16 -04:00
chenyu	c5db5b83b9	add SHOULD_USE_TC=1 check to simple_matmul (#9802 ) * add SHOULD_USE_TC=1 check to simple_matmul also zero centered the random input and update atol for tf32 * ATOL=2e-2 for HALF	2025-04-09 02:24:42 -04:00
George Hotz	78caf55154	Revert "FP8 support on NVIDIA (#8631 )" This reverts commit `2c8e4ea865`.	2025-04-09 12:27:41 +08:00
George Hotz	14928fecff	Revert "fix TF32 tensor core dropped in tc_sm89 (#9798 )" This reverts commit `7c9a96824f`.	2025-04-09 12:27:39 +08:00
chenyu	7c9a96824f	fix TF32 tensor core dropped in tc_sm89 (#9798 ) also add `SHOULD_USE_TC=1` to verify TC is applied in simple_matmul	2025-04-08 23:20:50 -04:00
pkotzbach	2c8e4ea865	FP8 support on NVIDIA (#8631 ) * squashed fp8 commits * tensorcore start * minor changes * pre-commit * pylint * Delete fp8mul.cu * clean * small bugfix * fix test_dtype * fix test_dtype_alu * add EMULATE_CUDA_SM89 * fix ci * fix test_linearizer * fix test_linearizer * fix swizzle * add debug to simple_matmul * fixed swizzle * python emulator * refactor python emulator * setup fix * numpy setup * ml_dtypes only in emulate_cuda_sm89 * fix pylint * fix tests * fix mypy * fix mypy * fix ruff * done python emulator * add acc type * tests * mypy * clean code * add cuda tensor core tests to CI * minor fix * clean test_dtype.py * clean cstyle.py * clean test_ops.py * fix test * fix test * whitespaces * pylint * pylint * amd? * amd? * amd * reduce lines * mockgpu remove * fix * ruff * ruff * fix mypy * ruff * test only for cuda * fixed formatting * small fixes * small fix * least_upper_dtype if fp8s not supported * log and reciprocal are supported for fp8s * ops python fixes * dtypes.fp8s use * e4m3 + e5m2 result dtype test * truncate linter fix --------- Co-authored-by: pkotzbach <pawkotz@gmail.com> Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com> Co-authored-by: chenyu <chenyu@fastmail.com>	2025-04-08 21:54:04 -04:00
Francis Lam	1e5d9ad8f7	extra/gemm/max_matmul: start of custom kernels for GEMM (#6926 ) * extra/gemm/max_matmul: start of custom kernels for GEMM * add an unoptimized FP16/FP16 MMA example * add slow 3-stage fp16 acc example * add correct 3-stage pipeline with unswizzled/flat smem input (slow) * add acc fp16 example with 3 stages and swizzle (no bank conflicts) * add max version of NV fp16_fp16_fp16 * fix up comments and removed unused code in max variations * add start of no_xor example * fix to account for UOps to Ops	2025-03-19 15:04:57 +08:00
chenyu	0e591baf43	redo simple_matmul change (#9450 ) numpy does not support bfloat16	2025-03-14 17:53:52 -04:00
chenyu	b0f63d3c04	Revert "`simple_matmul.py` uses np to generate random (#9438 )" (#9449 ) This reverts commit `14018050c1`.	2025-03-14 17:14:22 -04:00
Ignacio Sica	14018050c1	`simple_matmul.py` uses np to generate random (#9438 ) * np generates randoms * hotfix: use generator for int dtype * float32 as default dtype for float generator * use np.float32 instead of stirng * add dtype= to integers generator * change import _to_np_dtype source	2025-03-14 17:36:50 -03:00
chenyu	01e8b60911	acc_dtype -> dtype (#9402 ) matched numpy and torch	2025-03-10 16:05:30 -04:00
George Hotz	a73d8717f3	fast amd gemm (#9318 ) * 50 TFLOP AMD gemm * add lds tiling * register tiling * flip locals * work * comment * remove those	2025-03-03 12:01:14 +08:00
chenyu	2e7c2780a9	CLANG -> CPU (#9189 )	2025-02-20 18:03:09 -05:00
George Hotz	a3c78d47b3	speed docs + upgrades [pr] (#8964 ) * add some docs about speed [pr] * better torch gemm * enable locals on llvm/clang * disable locals for beam speed on LLVM/CLANG * 0x20 alignment in llvm allows ymm use	2025-02-08 17:28:52 +08:00
ignaciosica	b49a04145e	fix for int plus minor cleanup (#8650 )	2025-01-17 22:30:39 -05:00
qazal	866dfa1f23	create_schedule([x.lazydata]) -> x.schedule() in tests (#8449 )	2024-12-31 03:15:52 +08:00
George Hotz	c5d458ce02	BufferSpec and ProgramSpec [pr] (#7814 ) * BufferSpec and ProgramSpec [pr] * delete preallocate, it's unused * Revert "delete preallocate, it's unused" This reverts commit `dcfcfaccde`.	2024-11-21 12:18:05 +08:00
George Hotz	bc977fec53	dname -> device [pr] (#7804 ) * dname -> device [pr] * a few more * only one left	2024-11-20 17:57:14 +08:00
George Hotz	d71fe7faa5	rename allocator methods to not conflict [pr] (#7788 ) * rename allocator methods to not conflict [pr] * forgot those * transfer + offset	2024-11-20 00:10:29 +08:00
George Hotz	3169cb386d	remove graph [pr] (#7085 )	2024-10-16 11:40:07 +08:00
nimlgen	81a4a9623c	add qcom dsp runtime (#6112 ) * calling qualcomm dsp from python * include so files * add include file * adsprpc.py * running with adsprpc * work * 32-bit support in elf * compilation works * ion * msm_ion * working DSP backend * getting 500 MFLOPS on matmul * beam works with timing * move to autogen * disasm * progress * simple tests pass * qcom_dsp * more dsp autogen * progress * some progress * works w/o lib * checkpoint * no lib * ugh, better * cleaner, but with lib. test good, but with the hack * remove autogens * small * push * simpler * revert this * run_3 * simpler * android * handle * run it * why? * run2 * to gen * cc * cleaner * elf * part of autogen * comemnt * no lib * autohen * linter * bug reproducer * cleaner * this repro is almost empty and doesn't work!!!! * with this test_ops passes, no crashes anymore * cleaner * linter * renames * shorter * remoev contextlib * ugh * myoy * cleaner * cleaner * remove import * conn * import * revert this * remove heavy .so * shorter alloc * not tue anymore --------- Co-authored-by: Comma Device <device@comma.ai> Co-authored-by: George Hotz <geohot@gmail.com> Co-authored-by: George Hotz <george@comma.ai>	2024-09-13 21:01:33 +03:00
Francis Lam	7376b67e36	extra/gemm/triton_nv_matmul: fix Program arguments (#6212 ) remove op_estimate	2024-08-20 14:05:38 -07:00
CaltropHungerton	38fb1e14a2	Intel XMX Tensor Core Support (#5622 ) * fixed xmx demo * i think i'm invoking the DPAS but it's slow * compiler build arg to stop register spilling, indicated where to fix flop counter * don't mind this * do NOT mind me * do not mind me * do not view * i will add bf16 later * in process of figuring out tc fields * we figured out the fields!!! * added check for cl device vendor, added seperate IntelRenderer * remove tc thread_local_aliases * cleaning debris before draft pr * edits for linter * deduping and checking device extensions * i will find more line reductions in other places * before merge upstream * double grf size in compiler to fix register spilling (bandaid), device checking changes * tc python emulation * fixed emulation * tests for emulated intel tensor core * TC=0, 1 working on upstream, fixed perf * test * debris * check for specialized cl device when we canonicalize device * bf16 support, tc=3 test added * address tests * revert half2 loads on intel tc, cleanup * linter * fold_expanded revert * lint, whitespace fix * cuda bf16 (only one with bf16) is skipped in test tensor cores, so i will skip for intel bf16 too * make line shorter, no need for noqa E501 * removed device intel * fix python emulation --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2024-08-16 09:19:21 -07:00
George Hotz	bc55c8a30e	pmatmul example + GB/s bugfix [run_process_replay] (#5974 ) * pmatmul example + bugfix * improve pmatmul * Update real_pmatmul.py	2024-08-07 22:32:11 -07:00

1 2 3

133 Commits