tinygrad

mirror of https://github.com/tinygrad/tinygrad.git synced 2026-01-25 14:58:46 -05:00

Author	SHA1	Message	Date
qazal	ceda43ce75	always swizzle load st in wmma [pr] (#7908 )	2024-11-26 20:00:58 +08:00
George Hotz	4e5bf9dc7a	test assignment in jit (#7906 ) * test assignment in jit * don't waste lines * skip broken test in webgpu	2024-11-26 17:37:00 +08:00
mesozoic-egg	0cd1cc29dc	PTX simplify: use a dict matcher for prefix [pr] (#7890 ) * use a dict matcher for prefix * simplify tuple unpack * simplify tuple unpack * debug pr * Revert "debug pr" This reverts commit `3aa9f77517`. * define_acc boolean case * remove commented lines * wip * no need for .scalar in define_acc * indentation * linter fix * add keys to matcher from GroupOps directly * put dtype in tuple directly * cast, line too long fix * check ptrdtype with isinstance * dtype is always ptr for define_global wip * blank commit to trigger CI --------- Co-authored-by: Mesozoic Egg <mesozoic.egg@proton.mail>	2024-11-26 17:32:48 +08:00
Ahmed Harmouche	10618aba98	Bring back WebGPU (#7063 ) * Start from andredaprato:webgpu-clean * Fix infs * inf wgsl function is not needed * Emulated ulong for threefry, more tests passing * Randomness tests passing * Update model export to support new changes in webgpu, efficientnet export works again * Simplify shift emulation in wgsl * Delete test file * Fix bigger than u32 u32 literal * Why was skip copies added here? * Python3.12 for webgpu tests * Fix model export syntax error * Get test ops passing with some skips * Fix lint * Much simpler shift * Run more tests * Timestamp queries are not supported in CI, so skip search tests * All fancy indexing passing * r is ctx * Run more dtype tests by using is_dtype_supported * Cleanup ulong shift rendering * UPat -> Pat, UOps -> Ops * Pat -> UPat * Refactor render_ushift if-else * Pattern to avoid ulong mul * Remove vals_dtype * is_nan trick + rewrite, test_isnan passing * Rewrite a * select(1, nan, gate) -> select(a, nan, gate) * No arg, just op * Support char, uchar, short, ushort * Run test_index_mnis now that we have uint8 * Fix pyling * Save 3 lines by using base Compiler * No more long emulation * Remove fixup_binops * No more external_local_bufx wgsl specific cstyle modif, use base extra_pm * Simpler, faster copyin/out * Skip some new tests that use long * Fix typo * copyout touchup * Save lines by using render_cast * WebGL is not supported in core, delete it from is_dtype_supported * More narrow test skips for some unary tests * TernaryOps, UnaryOps -> Ops * TinyGrad supports WebGPU * StableDiffusion demo: f16tof32 gpu is a lib, update UI * Packed load/store, no more scale_size, no core tinygrad changes * Rename copyin, copyout * Device -> dev * Fix lint * Pattern matcher rule for packed load/store * Refactor * Shorter packed load/store * this should fix lint * Fix mypy * SD compile script working * New SD webgpu UI * New default prompt * New SD weights * Fix title when webgpu not available * Run symbolic tests, simplify is_nan, use round_up * Show step time on UI * Bump minimum wgpu version to v0.19 * Fix latent --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2024-11-26 12:26:40 +08:00
chenyu	ff3f2a9c1a	Revert "move attention upcast (#7830 )" (#7903 ) This reverts commit `c07daf40e7`.	2024-11-25 18:59:51 -05:00
chenyu	04bee97d2a	hotfix ctypes.c_ulong(size) for metal _alloc (#7902 ) fix `Tensor.ones(1000, 1000, 1000).contiguous().realize()` on METAL	2024-11-25 18:25:33 -05:00
chenyu	631dc98b52	validate llama quantize output (#7901 ) mac benchmark already runs quantize, this adds output validation	2024-11-25 16:46:23 -05:00
qazal	e8777cb8db	assert view on uops without shape [pr] (#7898 ) * assert view on uops without shape [pr] * lint	2024-11-25 20:43:50 +08:00
chenyu	a49ca0c2ff	clean up fully_flatten [pr] (#7885 ) Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2024-11-25 06:53:18 -05:00
qazal	e823de3828	viz with bottom_up=True (#7894 ) * add failing test * single pass it * linter	2024-11-25 17:56:48 +08:00
qazal	2ca41d6a44	ops metadata map try 2, early fuse [pr] (#7893 ) * make this return early * delete that * ops metadata map try 2, early fuse [pr]	2024-11-25 17:08:38 +08:00
qazal	9295c86ddc	delete base op cast [pr] (#7891 )	2024-11-25 16:38:32 +08:00
qazal	26784c45c6	delete cast arg 2 [pr] (#7881 )	2024-11-25 16:15:57 +08:00
George Hotz	9d0038bccb	small changes from block linearizer [pr] (#7888 ) * small changes from block linearizer [pr] * fix test_gc	2024-11-25 15:27:04 +08:00
mesozoic-egg	9e958f2b10	Ptx simplify [pr] (#7877 ) * simplify render_kernel * cvar in const * Revert "simplify render_kernel" This reverts commit `1c8817bea2`. * CMPNE src match * src match in cast * cvar in define_acc * simplify render_store * simplify render_kernel * whitespace * render_kernel fix fstring * render newline * do not embed newline in Ops.WHERE render * WHERE op fix * missed a comma * whitespace --------- Co-authored-by: Mesozoic Egg <mesozoic.egg@proton.mail>	2024-11-25 15:01:47 +08:00
nib9888	e9c681c839	fix missing final rewrite in viz (#7883 )	2024-11-25 14:13:33 +08:00
Sieds Lykles	a49a7c4784	Improved mod folding (#7887 ) * Remove uneccessary if statement In all paths where something_changed was set to True, remainder is appended so the list can't be empty * Working version of improved mod folding * Fix offset calculation Passing fuzz_symbolic.py to 130_000 so far Added an extra test * Cleaner offset calculation	2024-11-24 22:21:34 -05:00
leopf	5d92efb121	[BUGFIX] Tensor([]).data() (#7884 ) * added test, fix * fix only for (0,) shape * Revert "fix only for (0,) shape" * test_data_empty_multi_dim	2024-11-24 16:42:57 -05:00
chenyu	ac57d82a13	test_tiny on real NV/CUDA/AMD/HIP (#7886 ) simple tests that run on real CUDA and HIP	2024-11-24 16:34:54 -05:00
qazal	06a28d83f5	delete extra dtype check in uop const [pr] (#7880 )	2024-11-25 00:06:52 +08:00
chenyu	31337b49e3	cleanup Embedding call [pr] (#7869 ) reshape on self.weight is noop, and don't need special case for numel 0.	2024-11-24 07:32:26 -05:00
geohotstan	ad9df26fba	add test for inconsistent behavior in float to int casting (#7870 ) * found teeny bug * no healthcheck * change function name	2024-11-24 07:31:34 -05:00
qazal	6b8a657085	cleanup group_realizes [pr] (#7878 )	2024-11-24 18:16:46 +08:00
qazal	5aee78a0a6	fix uop swizzle on BUFFER, new tests (#7875 ) * fix uop swizzle on BUFFER, new tests * can have view of view	2024-11-24 17:11:09 +08:00
George Hotz	5d28a202b5	make tinychat local (#7871 )	2024-11-24 14:45:48 +08:00
chenyu	22d5def113	download llama3 70B (#7868 ) use "nvidia/Llama-3.1-Nemotron-70B-Instruct-HF". ``` PYTHONPATH=. JITBEAM=2 python3 examples/llama3.py --download_model --size 70B --quantize int8 --benchmark ``` on M4 Max, 40 sec to load the model and ``` enqueue in 165.15 ms total 328.54 ms, 3.04 tok/s, 247.46 GB/s, param 221.20 GB/s enqueue in 5.31 ms total 168.48 ms, 5.94 tok/s, 482.54 GB/s, param 431.34 GB/s enqueue in 5.32 ms total 168.77 ms, 5.93 tok/s, 481.71 GB/s, param 430.60 GB/s enqueue in 5.69 ms total 169.51 ms, 5.90 tok/s, 479.61 GB/s, param 428.72 GB/s enqueue in 5.41 ms total 168.60 ms, 5.93 tok/s, 482.20 GB/s, param 431.04 GB/s enqueue in 5.18 ms total 168.98 ms, 5.92 tok/s, 481.12 GB/s, param 430.08 GB/s enqueue in 5.43 ms total 168.82 ms, 5.92 tok/s, 481.59 GB/s, param 430.49 GB/s enqueue in 5.27 ms total 168.94 ms, 5.92 tok/s, 481.23 GB/s, param 430.17 GB/s ```	2024-11-23 12:18:31 -05:00
qazal	6a8be3ca1e	don't change lazy state in schedule [pr] (#7867 )	2024-11-24 00:18:50 +08:00
JaSpa99	28e83e662e	least controversial (#7863 )	2024-11-23 21:23:30 +08:00
George Hotz	8c3d3181dd	bottom up rewrite fixes substitute [pr] (#7862 ) * single pass rewrite fixes substitute [pr] * caching for single_pass_rewrite * allow multiple rewrites * a simple test * bottom_up_rewrite is fully flexible	2024-11-23 20:53:37 +08:00
mesozoic-egg	54d8f75d0c	vectorized define_acc does not seem to get used (#7858 ) Co-authored-by: Mesozoic Egg <mesozoic.egg@proton.mail>	2024-11-23 19:46:34 +08:00
qazal	40be9177ba	move swizzle upats to ops, prereq for swizzle tc [pr] (#7861 )	2024-11-23 18:34:45 +08:00
qazal	27a6cd7822	cleanup swizzle upats [pr] (#7860 ) * cleanup swizzle upats [pr] * match the rest	2024-11-23 15:19:06 +08:00
qazal	5b2c03e865	defer realize folding to kernel splitting [pr] (#7849 ) * defer realize folding to schedule breaking [pr] * this is init * p2 * need to lookup edges * refactor image cast folding [pr] * Ops.LOAD diff * image works * refactor can_pad * fix fold_img_cast	2024-11-23 14:29:14 +08:00
George Hotz	144e9f00df	viz is local, new test, and new quantize [pr] (#7859 ) * viz is local, new test, and new quantize [pr] * fix mime types * remove font * after index	2024-11-23 14:27:10 +08:00
qazal	d43613e113	refactor image cast folding [pr] (#7852 ) * refactor image cast folding [pr] * Ops.LOAD diff	2024-11-23 13:59:21 +08:00
chenyu	c07daf40e7	move attention upcast (#7830 ) still upcast before softmax, but faster because intermediate buffer can be stored in half (as long as qk is within half range).	2024-11-22 17:10:51 -05:00
chenyu	5c5b1b994c	less flaky benchmarks (#7855 ) JIT=2 for metal cifar with HALF, and lower tflops for nv test_gemm_4096. failures in https://github.com/tinygrad/tinygrad/actions/runs/11980239535/job/33404098428?pr=7830	2024-11-22 16:39:39 -05:00
chenyu	3b26e51fce	Tensor.cummax (#7854 ) generalized the existing cumsum and take Ops.MAX in addition to Ops.ADD	2024-11-22 15:55:02 -05:00
ignaciosica	fb10ea563e	typedef bf16 amd (#7850 )	2024-11-22 14:29:01 -05:00
chenyu	a352a6938f	simplify group_for_reduces in get_index [pr] (#7851 ) what was that	2024-11-22 11:53:21 -05:00
chenyu	af5d77f684	move sint_to_uop from view.py to ops.py [pr] (#7848 ) both sint and uop are in ops.py	2024-11-22 11:15:02 -05:00
chenyu	f6d1201c48	variable_to_uop -> sint_to_uop [pr] (#7847 ) and added type to it	2024-11-22 10:54:59 -05:00
chenyu	40d7535eeb	clean up DTYPES_DICT [pr] (#7845 )	2024-11-22 10:01:34 -05:00
chenyu	4453ab51e1	use ceildiv in View.stride [pr] (#7844 )	2024-11-22 08:38:05 -05:00
qazal	9828277c03	view doesn't have buffer, fix the tests [pr] (#7841 ) * view doesn't have buffer, fix the tests [pr] * need assigns	2024-11-22 20:41:55 +08:00
qazal	7e8777eee9	faster assign scheduling [pr] (#7839 ) * baseline 87 ms * 86 ms, only PRELOAD assigns * refactor to assign_adjacents * ops_folding	2024-11-22 19:23:59 +08:00
chenyu	6229d87f45	simpler reshape symbolic shape check [pr] (#7837 )	2024-11-21 22:53:57 -05:00
George Hotz	1d6d842887	move DSP to extra (room for webgpu) [pr] (#7836 )	2024-11-22 11:32:57 +08:00
chenyu	8ff6cba9f0	simpler swizzle_r new_axis [pr] (#7835 ) new axis are the permuted to end ones	2024-11-21 22:26:41 -05:00
George Hotz	6fc7013463	put all DSP in dsp file [pr] (#7833 )	2024-11-22 11:22:59 +08:00

1 2 3 4 5 ...

6949 Commits