tinygrad

mirror of https://github.com/tinygrad/tinygrad.git synced 2026-02-05 12:15:05 -05:00

Author	SHA1	Message	Date
chenyu	b47f6cebb2	LinearizerOptions -> CompilerOptions (#3978 )	2024-03-28 17:50:23 -04:00
qazal	2bfb1d3e39	dynamic assign idx (#3975 )	2024-03-28 13:59:32 -07:00
George Hotz	2cfcb5623a	hotfix: d was removed from buffer	2024-03-28 13:39:02 -07:00
George Hotz	42b9d999ea	Buffer isn't always allocated (#3974 ) * buffer alloc * allocate * missing allocates * last one	2024-03-28 13:33:47 -07:00
George Hotz	9c03fe3e5d	hotfix: ShapeTracker no longer has import cycle	2024-03-28 10:34:23 -07:00
chenyu	bfcaa2f70e	assert `__setitem__` if used other than disk (#3972 ) * assert `__setitem__` if used other than disk * that is not implemented	2024-03-28 12:16:38 -04:00
David Hou	4b95350c41	fp16 resnet (without expand backwards sum in float, doesn't work) (#3816 ) * fp16 resnet * cast running mean and var back to default float * extra cast * check symbolic no overflow * add linearizer failure * loss scaler after grad contig * oops * i think this works * don't loss scale fp32 * remove overflow test case * remove symbolic bounds check * loss scaler should be float * temporarily disable padto cuz bug shruggie * make running stats in batchnorm float32? * calculate lars stuff in fp32? * oops * remove most changes * move loss scaler out of optimizer * no more FP16 var * oops --------- Co-authored-by: chenyu <chenyu@fastmail.com>	2024-03-28 01:25:37 -04:00
George Hotz	607b4a7d70	remove buffer read, save lines (#3969 )	2024-03-27 22:02:47 -07:00
chenyu	80116be9a5	for loop to generate hip math functions for different floats (#3967 ) * for loop to generate hip math functions for different floats * slightly nicer	2024-03-27 23:24:29 -04:00
qazal	03d129baa8	inputs -> membufs (#3964 )	2024-03-27 17:34:39 -07:00
Francis Lam	16a1d43f6f	llama: prevent device initialization outside of __main__ (#3966 ) * llama: prevent device initialization outside of __main__ causes HSA resources leakages in child compile processes * llama: fix loading with multiple devices	2024-03-27 19:19:38 -04:00
Francis Lam	7c5729a3bd	wmma: refactor to remove wmma_func and create TC funcs as needed (#3945 ) * wmma: refactor to remove wmma_func and create TC funcs as needed * test_linearizer: disable bf16 CUDA during emulation testing * cstyle: clean up creation of CUDA vec dtypes * extra/gemm: add option to accumulate to bfloat16 * cleanups * benchmark: add CUDA bfloat16 matmul * more cleanups	2024-03-27 16:43:09 -04:00
chenyu	88b24df40a	touchup remove `float()` in cstyle render_const for float64 (#3962 )	2024-03-27 16:08:28 -04:00
qazal	27af37f2ad	misc: remove unused env vars (#3963 ) * remove unused env vars * delete CPU	2024-03-27 16:08:15 -04:00
George Hotz	60639cccac	hotfix: RuntimeError for assign	2024-03-27 11:18:48 -07:00
qazal	9fb573d73c	DAG cycle asserts (#3955 ) * assert cycles * these are cycle errors * flip to positive	2024-03-27 11:09:59 -07:00
geohotstan	bd3a7d068c	correct device for validation test in model benchmark CI (#3960 ) * fix tests * add clang back for only metal * change the name to reflect CLANG being ran * add back cuda	2024-03-27 13:40:06 -04:00
George Hotz	eec2b00edc	change kernel name if it's multioutput (#3958 )	2024-03-27 08:42:57 -07:00
George Hotz	d1c957a471	copy back to clang (#3951 ) * copy back to clang * force the copy for CLANG device	2024-03-27 08:13:01 -07:00
P4ssenger	332c82893a	Remove redundant check on device (#3957 ) * call self.nbytes * device is canonicalized, therefore, it cannot be None	2024-03-27 07:54:33 -07:00
chenyu	6c7df1445b	enforce UOps.CONST arg has python type based on dtype (#3952 ) added an assert in uops, remove the cast in renderer	2024-03-27 01:41:38 -04:00
George Hotz	91f3326c0b	hotfix: increase recursion limit	2024-03-26 21:26:54 -07:00
George Hotz	68ca4d4276	split to schedule.py (#3949 ) * split to schedule.py * split	2024-03-26 21:02:46 -07:00
George Hotz	da07f31fd4	hotfix: remove bf16 test entirely	2024-03-26 20:50:27 -07:00
George Hotz	0d5845fb5b	hotfix: jit is flaky on mac	2024-03-26 20:44:05 -07:00
George Hotz	150ea2eb76	create engine folder and move code (#3948 ) * retry * older tf * that	2024-03-26 20:38:03 -07:00
George Hotz	629cbc5587	only abstractions 2 (#3947 )	2024-03-26 20:02:18 -07:00
chenyu	77589bc7a5	rename Scalar to ConstType and cast_scalar to as_const (#3946 ) prereq cleanup to make const arg same python type as dtype	2024-03-26 22:39:58 -04:00
uuuvn	d6d902afe9	wtf (#3944 )	2024-03-26 17:49:28 -07:00
Francis Lam	5530b0cbed	fuzz_linearizer: reduce debug verbosity and make easier for CI usage (#3942 ) * fuzz_linearizer: reduce debug verbosity and make easier for CI usage * rename FUZZ_BEAM to FUZZ_ALL_ACTIONS (not choosing a subset) * skip simple ASTs (easier to use with LOGOPS output) * don't fuzz a previously seen AST * add options to allow non-zero --expected-failures * clean up naming and use set	2024-03-26 16:25:24 -04:00
chenyu	8df6587c41	hotfix 97.3 for beautiful_mnist (#3941 )	2024-03-26 15:02:53 -04:00
chenyu	b1e3817e18	correctly handle Tensor.rand whwn default_float = bf16 (#3940 ) always casting to float32 makes default half to be slow	2024-03-26 14:56:16 -04:00
chenyu	f6ff76be21	check only upcast int amount in upcasted_axis (#3938 ) fixed typing and fixed #3932	2024-03-26 12:54:57 -04:00
nimlgen	e2d6f76723	_alloc and _free with options (#3934 ) * _alloc has options * linter * fix hsa	2024-03-26 09:11:41 -07:00
nimlgen	739f47eb0f	check on cuEventSynchronize (#3933 )	2024-03-26 16:14:38 +03:00
George Hotz	778d17fbd3	intel matmul (#3830 ) * almost right * intel xmx	2024-03-25 22:37:20 -07:00
chenyu	ef537672bf	bf16 support in metal (#3929 ) it runs if device gpu supports bfloat. updated ci benchmark too	2024-03-25 23:17:36 -04:00
chenyu	72d617a37d	opencl on OSX does not support fp16 extension (#3931 ) running `GPU=1 python -m pytest -rA test/test_dtype.py::TestHalfDtype::test_casts_from` on mac would fail.	2024-03-25 19:50:17 -04:00
Arseny Kapoulkine	cb6e7b57a6	examples: Fix parameter bandwidth accounting for quantized LLama (#3930 ) Instead of assuming every parameter is 2 bytes, just add up tensor sizes in bytes	2024-03-25 18:41:05 -04:00
chenyu	4ecd5789ab	#include <tgmath.h> in ops_clang (#3927 ) * different clang sqrt/log2/exp2/sin function based on dtype fixed softmax_argmax issue in #3552 for clang. * tgmath.h * revert those	2024-03-25 17:48:57 -04:00
Arseny Kapoulkine	514c43201d	Fix issues with pointer provenance in load/store through ALU (#3916 ) * Track pointer provenance in load/store through ALU Previously load/store could be incorrectly rendered into ld.global/st.global when the input was an ALU op that performed an address computation with DEFINE_LOCAL on one of the arguments. * Simplify the load provenance workaround The issue is that we can render the same code twice, and on the second run the opstream is already modified so that vin[0] isn't a DEFINE_, which overwrites initially correct .shared wth .global. Add a couple tests for basic local use * Skip local tests on LLVM since it doesn't implement DEFINE_LOCAL	2024-03-25 14:41:05 -07:00
chenyu	d651835ef5	verify beautiful_mnist.py eval acc and put into benchmark ci (#3926 ) * verify beautiful_mnist and put in ci * 97.5 for eval verification	2024-03-25 16:47:49 -04:00
chenyu	dc508022a9	clean up clang src header (#3925 ) don't need to define int64 and uchar	2024-03-25 15:18:35 -04:00
uuuvn	2080325e8d	output_buffer isn't used anymore (#3919 )	2024-03-25 16:03:56 +03:00
nimlgen	f2a9ea4ea9	lru allocator for copyin host buffers (#3918 ) * lru allocator for copyin host buffers * linter happy	2024-03-25 15:57:18 +03:00
George Hotz	e0e234bf94	hotfix, str compare version for cuda	2024-03-24 20:35:24 -07:00
Arseny Kapoulkine	715850aef9	Fix sm89 PTX=1 compilation (#3915 ) * Fix sm89 PTX=1 compilation The minimum PTX version that supports sm89 is 7.8 (same version also supports sm90); without this ptxas fails when running tinygrad with PTX=1 on RTX 4090. * Use int(arch[3:]) for forward compat with SM10.0 if that happens	2024-03-24 20:32:29 -07:00
chenyu	83f39a8ceb	env var to change default float (#3902 ) * env var to change default float to fp16 or bf16 looking for standard names for these. we have FLOAT16 that does something to IMAGE and HALF to convert weights. working on default bf16 too. ``` RuntimeError: compile failed: <null>(6): error: identifier "__bf16" is undefined __bf16 cast0 = (nv_bfloat16)(val0); ``` remove that in cifar * DEFAULT_FLOAT * default of default * unit test * don't check default * tests work on linux	2024-03-24 20:33:57 -04:00
George Hotz	03899a74bb	increase atol on reset train	2024-03-24 15:17:31 -07:00
qazal	d8fafca13a	assign regression (#3907 ) * infra * track mutations * assign levels * add seen back * add test * infra 2.0 * add assign targets * dont need levels * delete * Update test_assign.py --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2024-03-24 15:12:31 -07:00

... 132 133 134 135 136 ...

10633 Commits