tinygrad

mirror of https://github.com/tinygrad/tinygrad.git synced 2026-01-09 15:08:02 -05:00

Author	SHA1	Message	Date
George Hotz	629cbc5587	only abstractions 2 (#3947 )	2024-03-26 20:02:18 -07:00
chenyu	77589bc7a5	rename Scalar to ConstType and cast_scalar to as_const (#3946 ) prereq cleanup to make const arg same python type as dtype	2024-03-26 22:39:58 -04:00
uuuvn	d6d902afe9	wtf (#3944 )	2024-03-26 17:49:28 -07:00
Francis Lam	5530b0cbed	fuzz_linearizer: reduce debug verbosity and make easier for CI usage (#3942 ) * fuzz_linearizer: reduce debug verbosity and make easier for CI usage * rename FUZZ_BEAM to FUZZ_ALL_ACTIONS (not choosing a subset) * skip simple ASTs (easier to use with LOGOPS output) * don't fuzz a previously seen AST * add options to allow non-zero --expected-failures * clean up naming and use set	2024-03-26 16:25:24 -04:00
chenyu	8df6587c41	hotfix 97.3 for beautiful_mnist (#3941 )	2024-03-26 15:02:53 -04:00
chenyu	b1e3817e18	correctly handle Tensor.rand whwn default_float = bf16 (#3940 ) always casting to float32 makes default half to be slow	2024-03-26 14:56:16 -04:00
chenyu	f6ff76be21	check only upcast int amount in upcasted_axis (#3938 ) fixed typing and fixed #3932	2024-03-26 12:54:57 -04:00
nimlgen	e2d6f76723	_alloc and _free with options (#3934 ) * _alloc has options * linter * fix hsa	2024-03-26 09:11:41 -07:00
nimlgen	739f47eb0f	check on cuEventSynchronize (#3933 )	2024-03-26 16:14:38 +03:00
George Hotz	778d17fbd3	intel matmul (#3830 ) * almost right * intel xmx	2024-03-25 22:37:20 -07:00
chenyu	ef537672bf	bf16 support in metal (#3929 ) it runs if device gpu supports bfloat. updated ci benchmark too	2024-03-25 23:17:36 -04:00
chenyu	72d617a37d	opencl on OSX does not support fp16 extension (#3931 ) running `GPU=1 python -m pytest -rA test/test_dtype.py::TestHalfDtype::test_casts_from` on mac would fail.	2024-03-25 19:50:17 -04:00
Arseny Kapoulkine	cb6e7b57a6	examples: Fix parameter bandwidth accounting for quantized LLama (#3930 ) Instead of assuming every parameter is 2 bytes, just add up tensor sizes in bytes	2024-03-25 18:41:05 -04:00
chenyu	4ecd5789ab	#include <tgmath.h> in ops_clang (#3927 ) * different clang sqrt/log2/exp2/sin function based on dtype fixed softmax_argmax issue in #3552 for clang. * tgmath.h * revert those	2024-03-25 17:48:57 -04:00
Arseny Kapoulkine	514c43201d	Fix issues with pointer provenance in load/store through ALU (#3916 ) * Track pointer provenance in load/store through ALU Previously load/store could be incorrectly rendered into ld.global/st.global when the input was an ALU op that performed an address computation with DEFINE_LOCAL on one of the arguments. * Simplify the load provenance workaround The issue is that we can render the same code twice, and on the second run the opstream is already modified so that vin[0] isn't a DEFINE_, which overwrites initially correct .shared wth .global. Add a couple tests for basic local use * Skip local tests on LLVM since it doesn't implement DEFINE_LOCAL	2024-03-25 14:41:05 -07:00
chenyu	d651835ef5	verify beautiful_mnist.py eval acc and put into benchmark ci (#3926 ) * verify beautiful_mnist and put in ci * 97.5 for eval verification	2024-03-25 16:47:49 -04:00
chenyu	dc508022a9	clean up clang src header (#3925 ) don't need to define int64 and uchar	2024-03-25 15:18:35 -04:00
uuuvn	2080325e8d	output_buffer isn't used anymore (#3919 )	2024-03-25 16:03:56 +03:00
nimlgen	f2a9ea4ea9	lru allocator for copyin host buffers (#3918 ) * lru allocator for copyin host buffers * linter happy	2024-03-25 15:57:18 +03:00
George Hotz	e0e234bf94	hotfix, str compare version for cuda	2024-03-24 20:35:24 -07:00
Arseny Kapoulkine	715850aef9	Fix sm89 PTX=1 compilation (#3915 ) * Fix sm89 PTX=1 compilation The minimum PTX version that supports sm89 is 7.8 (same version also supports sm90); without this ptxas fails when running tinygrad with PTX=1 on RTX 4090. * Use int(arch[3:]) for forward compat with SM10.0 if that happens	2024-03-24 20:32:29 -07:00
chenyu	83f39a8ceb	env var to change default float (#3902 ) * env var to change default float to fp16 or bf16 looking for standard names for these. we have FLOAT16 that does something to IMAGE and HALF to convert weights. working on default bf16 too. ``` RuntimeError: compile failed: <null>(6): error: identifier "__bf16" is undefined __bf16 cast0 = (nv_bfloat16)(val0); ``` remove that in cifar * DEFAULT_FLOAT * default of default * unit test * don't check default * tests work on linux	2024-03-24 20:33:57 -04:00
George Hotz	03899a74bb	increase atol on reset train	2024-03-24 15:17:31 -07:00
qazal	d8fafca13a	assign regression (#3907 ) * infra * track mutations * assign levels * add seen back * add test * infra 2.0 * add assign targets * dont need levels * delete * Update test_assign.py --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2024-03-24 15:12:31 -07:00
Szymon Ożóg	2d0bfdf01c	ptx cleanup (#3893 )	2024-03-24 14:54:45 -07:00
chenyu	2e39f57594	move lines around in ops_python wmma (#3911 )	2024-03-24 17:14:26 -04:00
Patrick Tsai	e27129a798	Fix linearizer failure 26 test (#3906 ) * Adjust adds between WHERE and PHI * Not much better * undo recursive change * hm * iterate over where, not factored op * oo * consts only for loop * UNdo var name change * update --------- Co-authored-by: Patrick Tsai <patosai@users.noreply.github.com>	2024-03-24 16:34:13 -04:00
chenyu	10673d1447	tiny search cleanup (#3910 ) * tiny search cleanup removed some `assert isinstance(dev, Compiled)` and lines * remove import	2024-03-24 14:20:55 -04:00
wozeparrot	9a9cac58f9	add lars to nn (#3750 ) * feat: add lars * feat: don't remove this comment * clean: smaller diff * clean: shorter line * feat: remove mlperf lars, switch resnet * fix: fully remove mlperf lars * clean: comment * feat: contiguous * feat: no weight decay on skip params * feat: optimizergroup * feat: classic momentum * fix: pylint * clean: move comment * fix: correct algo * feat: lrschedulergroup * feat: skip list tests * feat: :\| forgot that params are a thing * feat: remove skip_list params from main params * feat: set moment --------- Co-authored-by: chenyu <chenyu@fastmail.com>	2024-03-24 11:43:12 -04:00
chenyu	8c8b57fd5f	cleanup ops python (#3908 ) i just want to merge lars!	2024-03-24 11:36:31 -04:00
chenyu	2c69888654	include negative float in test_dtype (#3884 ) * include negative float in test_dtype * that is ub * too annoying * pack can overflow	2024-03-24 02:39:15 -04:00
chenyu	e22d78b3d2	training cifar with BF16 on CUDA (#3905 ) * training cifar with BF16 on CUDA memory usage is between float and half due to numpy calls on dataset preprocessing, which converts into float. * simpler bf16 functions * bf16 cifar works for HSA too just very slow * simpler bf16 functions, we love cuda	2024-03-24 01:37:47 -04:00
Francis Lam	0145366323	wmma: fix the AMD TC threads to split the first 16 threads (#3904 ) previously it was incorrectly aliasing 16 into the size 8 upcast on the store alias. now it splits it properly into 8 and the remaining 2 into the correct local stride	2024-03-23 21:17:42 -04:00
sekstini	7c3632fd1e	add --minimal flag to nvrtc (#3899 )	2024-03-23 16:38:31 -07:00
chenyu	a2b2597fc2	replace dtype.name str with render_dtype (#3903 ) fixed some bf16 cast issue since it does not have `.name`. also more robust if there are lang specific type override	2024-03-23 19:25:48 -04:00
chenyu	24d004a89b	hotfix check ckpts before writing achieved model (#3901 ) this killed tinybox green run	2024-03-23 17:16:38 -04:00
chenyu	4d566f12b1	touchup einsum (#3900 ) don't need rhs_letters	2024-03-23 16:46:39 -04:00
Alejandro F Queiruga	556dcfb8f2	Fix the result permutation in einsum (#3895 ) * Fix permutation of result indices in einsum. * Delete stray line used for breaking tests * Fix linter error by renaming twice-used variable --------- Co-authored-by: chenyu <chenyu@fastmail.com>	2024-03-23 15:48:19 -04:00
nimlgen	4e18dd78d3	faster program start in llvm (#3897 )	2024-03-23 15:20:15 +03:00
George Hotz	46a3501cec	nv ioctl sniffer (#3892 ) * nv ioctl sniffer * unused import * Update __init__.py * that work * that fix it	2024-03-23 00:29:30 -07:00
chenyu	18e0cef14d	cheap less lines in ptx (#3890 ) enought to merge lars	2024-03-23 01:12:31 -04:00
George Hotz	f0c4e06ffd	fix cuda sync (#3888 )	2024-03-22 19:02:30 -07:00
chenyu	2d3ce53348	touchup test_dtype.test_gradient_dtype (#3887 ) add back bad merge from #3613 and add float.double and float.bfloat16 to test	2024-03-22 20:56:45 -04:00
David Hou	fc11808a79	initialize Tensor grad same type as self (#3613 ) * initialize Tensor grad same type as self * also test different default float * check dtype + try/finally * don't test_gradient_dtype if f16 is not supported * fix bad merge --------- Co-authored-by: chenyu <chenyu@fastmail.com>	2024-03-22 20:33:18 -04:00
Francis Lam	8db7a6bbcc	debug: add optional detailed BEAM_LOG logging (#3883 ) * debug: add optional detailed BEAM_LOG logging show uop count, compile and run times for each candidate in search also add --timing to verify_kernel.py to make it easier to explore hand-crafted applied opts * fix linter	2024-03-22 19:23:31 -04:00
chenyu	f7f67e0cc5	simple fix llama shard with quantize (#3882 ) copy scale on all device for now. naive sharding does not work because scale needs expand to really save memory. 70B does not work due to HSA_STATUS_ERROR_OUT_OF_RESOURCES. `python3 examples/llama.py --gen 2 --size 13B --shard 6 --prompt "Hello." --count 10 --temperature 0 --timing --quantize` 13B on 6 gpus uses 47 GB v.s. 34 GB quantized	2024-03-22 18:15:37 -04:00
chenyu	ee502c8055	fixup to_movement_ops and add back to CI (#3881 )	2024-03-22 18:14:49 -04:00
nimlgen	16e31f7f0d	init multidevice cuda graph (#3858 ) * init multidevice cuda graph * cuda just works! * clean * linter happier * liners happy * update transfer inputs * do not change free * useless check for cuda --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2024-03-22 13:49:48 -07:00
George Hotz	0c197b9cf3	hotfix: hip bfloat formatting	2024-03-22 11:52:05 -07:00
George Hotz	54dc48aa47	fix assign (#3878 ) * fix assign * remove terrible optimizer hack * oops, not realized assigns	2024-03-22 11:48:48 -07:00

1 2 3 4 5 ...

3957 Commits