tinygrad

mirror of https://github.com/tinygrad/tinygrad.git synced 2026-02-08 13:45:50 -05:00

Author	SHA1	Message	Date
uuuvn	2080325e8d	output_buffer isn't used anymore (#3919 )	2024-03-25 16:03:56 +03:00
nimlgen	f2a9ea4ea9	lru allocator for copyin host buffers (#3918 ) * lru allocator for copyin host buffers * linter happy	2024-03-25 15:57:18 +03:00
George Hotz	e0e234bf94	hotfix, str compare version for cuda	2024-03-24 20:35:24 -07:00
Arseny Kapoulkine	715850aef9	Fix sm89 PTX=1 compilation (#3915 ) * Fix sm89 PTX=1 compilation The minimum PTX version that supports sm89 is 7.8 (same version also supports sm90); without this ptxas fails when running tinygrad with PTX=1 on RTX 4090. * Use int(arch[3:]) for forward compat with SM10.0 if that happens	2024-03-24 20:32:29 -07:00
chenyu	83f39a8ceb	env var to change default float (#3902 ) * env var to change default float to fp16 or bf16 looking for standard names for these. we have FLOAT16 that does something to IMAGE and HALF to convert weights. working on default bf16 too. ``` RuntimeError: compile failed: <null>(6): error: identifier "__bf16" is undefined __bf16 cast0 = (nv_bfloat16)(val0); ``` remove that in cifar * DEFAULT_FLOAT * default of default * unit test * don't check default * tests work on linux	2024-03-24 20:33:57 -04:00
George Hotz	03899a74bb	increase atol on reset train	2024-03-24 15:17:31 -07:00
qazal	d8fafca13a	assign regression (#3907 ) * infra * track mutations * assign levels * add seen back * add test * infra 2.0 * add assign targets * dont need levels * delete * Update test_assign.py --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2024-03-24 15:12:31 -07:00
Szymon Ożóg	2d0bfdf01c	ptx cleanup (#3893 )	2024-03-24 14:54:45 -07:00
chenyu	2e39f57594	move lines around in ops_python wmma (#3911 )	2024-03-24 17:14:26 -04:00
Patrick Tsai	e27129a798	Fix linearizer failure 26 test (#3906 ) * Adjust adds between WHERE and PHI * Not much better * undo recursive change * hm * iterate over where, not factored op * oo * consts only for loop * UNdo var name change * update --------- Co-authored-by: Patrick Tsai <patosai@users.noreply.github.com>	2024-03-24 16:34:13 -04:00
chenyu	10673d1447	tiny search cleanup (#3910 ) * tiny search cleanup removed some `assert isinstance(dev, Compiled)` and lines * remove import	2024-03-24 14:20:55 -04:00
wozeparrot	9a9cac58f9	add lars to nn (#3750 ) * feat: add lars * feat: don't remove this comment * clean: smaller diff * clean: shorter line * feat: remove mlperf lars, switch resnet * fix: fully remove mlperf lars * clean: comment * feat: contiguous * feat: no weight decay on skip params * feat: optimizergroup * feat: classic momentum * fix: pylint * clean: move comment * fix: correct algo * feat: lrschedulergroup * feat: skip list tests * feat: :\| forgot that params are a thing * feat: remove skip_list params from main params * feat: set moment --------- Co-authored-by: chenyu <chenyu@fastmail.com>	2024-03-24 11:43:12 -04:00
chenyu	8c8b57fd5f	cleanup ops python (#3908 ) i just want to merge lars!	2024-03-24 11:36:31 -04:00
chenyu	2c69888654	include negative float in test_dtype (#3884 ) * include negative float in test_dtype * that is ub * too annoying * pack can overflow	2024-03-24 02:39:15 -04:00
chenyu	e22d78b3d2	training cifar with BF16 on CUDA (#3905 ) * training cifar with BF16 on CUDA memory usage is between float and half due to numpy calls on dataset preprocessing, which converts into float. * simpler bf16 functions * bf16 cifar works for HSA too just very slow * simpler bf16 functions, we love cuda	2024-03-24 01:37:47 -04:00
Francis Lam	0145366323	wmma: fix the AMD TC threads to split the first 16 threads (#3904 ) previously it was incorrectly aliasing 16 into the size 8 upcast on the store alias. now it splits it properly into 8 and the remaining 2 into the correct local stride	2024-03-23 21:17:42 -04:00
sekstini	7c3632fd1e	add --minimal flag to nvrtc (#3899 )	2024-03-23 16:38:31 -07:00
chenyu	a2b2597fc2	replace dtype.name str with render_dtype (#3903 ) fixed some bf16 cast issue since it does not have `.name`. also more robust if there are lang specific type override	2024-03-23 19:25:48 -04:00
chenyu	24d004a89b	hotfix check ckpts before writing achieved model (#3901 ) this killed tinybox green run	2024-03-23 17:16:38 -04:00
chenyu	4d566f12b1	touchup einsum (#3900 ) don't need rhs_letters	2024-03-23 16:46:39 -04:00
Alejandro F Queiruga	556dcfb8f2	Fix the result permutation in einsum (#3895 ) * Fix permutation of result indices in einsum. * Delete stray line used for breaking tests * Fix linter error by renaming twice-used variable --------- Co-authored-by: chenyu <chenyu@fastmail.com>	2024-03-23 15:48:19 -04:00
nimlgen	4e18dd78d3	faster program start in llvm (#3897 )	2024-03-23 15:20:15 +03:00
George Hotz	46a3501cec	nv ioctl sniffer (#3892 ) * nv ioctl sniffer * unused import * Update __init__.py * that work * that fix it	2024-03-23 00:29:30 -07:00
chenyu	18e0cef14d	cheap less lines in ptx (#3890 ) enought to merge lars	2024-03-23 01:12:31 -04:00
George Hotz	f0c4e06ffd	fix cuda sync (#3888 )	2024-03-22 19:02:30 -07:00
chenyu	2d3ce53348	touchup test_dtype.test_gradient_dtype (#3887 ) add back bad merge from #3613 and add float.double and float.bfloat16 to test	2024-03-22 20:56:45 -04:00
David Hou	fc11808a79	initialize Tensor grad same type as self (#3613 ) * initialize Tensor grad same type as self * also test different default float * check dtype + try/finally * don't test_gradient_dtype if f16 is not supported * fix bad merge --------- Co-authored-by: chenyu <chenyu@fastmail.com>	2024-03-22 20:33:18 -04:00
Francis Lam	8db7a6bbcc	debug: add optional detailed BEAM_LOG logging (#3883 ) * debug: add optional detailed BEAM_LOG logging show uop count, compile and run times for each candidate in search also add --timing to verify_kernel.py to make it easier to explore hand-crafted applied opts * fix linter	2024-03-22 19:23:31 -04:00
chenyu	f7f67e0cc5	simple fix llama shard with quantize (#3882 ) copy scale on all device for now. naive sharding does not work because scale needs expand to really save memory. 70B does not work due to HSA_STATUS_ERROR_OUT_OF_RESOURCES. `python3 examples/llama.py --gen 2 --size 13B --shard 6 --prompt "Hello." --count 10 --temperature 0 --timing --quantize` 13B on 6 gpus uses 47 GB v.s. 34 GB quantized	2024-03-22 18:15:37 -04:00
chenyu	ee502c8055	fixup to_movement_ops and add back to CI (#3881 )	2024-03-22 18:14:49 -04:00
nimlgen	16e31f7f0d	init multidevice cuda graph (#3858 ) * init multidevice cuda graph * cuda just works! * clean * linter happier * liners happy * update transfer inputs * do not change free * useless check for cuda --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2024-03-22 13:49:48 -07:00
George Hotz	0c197b9cf3	hotfix: hip bfloat formatting	2024-03-22 11:52:05 -07:00
George Hotz	54dc48aa47	fix assign (#3878 ) * fix assign * remove terrible optimizer hack * oops, not realized assigns	2024-03-22 11:48:48 -07:00
Francis Lam	5587594a00	fuzz_linearizer: add --ast and --file params to read kernels (#3877 ) also fix up ast_str_to_str to support the new tuple of LazyOps	2024-03-22 14:27:40 -04:00
chenyu	c5467e5bd6	diverse test value in test_dtype DATA based on dtype (#3864 ) * diverse test value in test_dtype DATA based on dtype * eh fix typo * that too? * PTX does not support i8 and s8 * skip that * unused line * pus the hack back * remove that	2024-03-22 14:22:06 -04:00
George Hotz	86ee36e697	preschedule all (#3875 )	2024-03-22 11:20:06 -07:00
Szymon Ożóg	d8c3f1894a	Use UOpGraph in test (#3876 )	2024-03-22 14:12:38 -04:00
chenyu	1c51d586ea	replace raise Exception with specific errors (#3874 )	2024-03-22 12:32:21 -04:00
nimlgen	8ef5490ec8	cuda tranfer + async copyin (#3873 )	2024-03-22 09:01:37 -07:00
Szymon Ożóg	624bc89910	PTX - implement float 4, ptr arithmetics and other speed improvements (#3775 ) * ptx float4 implementation * remove from cache when trimming uops * Gate for float4 * Linting fix * disable test reasonable time for ptx * import getenv * Update uops.py * linter * Add div test for half * upcast if op does not support operation * fix offset * Run only if dtype supported * zero out registers when accessing by pred + cleanup * Remove trailing whitespace * revert * spacing fix * move cache clearing outside loop * did this suddenly start working? * unused import removed * Remove cast * Use pattern matching * linting --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2024-03-22 08:54:02 -07:00
George Hotz	f4055439dc	don't include hip common (#3851 ) * don't install hip common * only that * Revert "only that" This reverts commit `85f22015d9`. * less * needed * sep comgr * header file * 6.0.2 * update hsa * hsakmt * Revert "hsakmt" This reverts commit `d3a118078e`.	2024-03-22 08:50:50 -07:00
qazal	4a27ce6ec9	tiny version of amd_hip_bfloat16 (#3868 ) * add src_dtype * add maker * add bfloat16 * simpler	2024-03-22 08:37:30 -07:00
chenyu	82ce60e172	use JIT_BATCH_SIZE=4 for GPT2 3090 benchmark (#3870 ) smaller first batch saves about 0.05 ms per token. 1.75ms / tok on local 3090	2024-03-22 00:40:06 -04:00
qazal	fe6ceff15f	proposal: multioutput JIT spec (#3856 ) * corealize JIT * requirements	2024-03-21 21:28:30 -07:00
Francis Lam	a26090d404	search: change to use "spawn" and limit the number of tasks per child (#3862 ) also clean up some examples to use __main__ and not initialize resources outside of main	2024-03-21 21:23:36 -07:00
chenyu	dca69df197	hot fix use DEBUG >= 3 for allreduce message (#3869 )	2024-03-21 23:40:44 -04:00
uuuvn	6729f20aab	Ring allreduce try 2 (#3852 ) * Ring allreduce v3 * Configurable size, number of gpus and jit in benchmark * ScheduleBarrier v0 * GB/s that make sense * ScheduleBarrier v0.1 * Fallback on 2 GPUs * ScheduleBarrier v0.2 * ScheduleBarrier v0.3 * ScheduleBarrier v0.3.1 * ScheduleBarrier v0.3.2 * Replace ScheduleBarrier with automatic optimization * unused import * fix comment * typing * better fallback * python 3.8 * RING=2 and use ContextVar * DEBUG >= 2 and change name * linter * type --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com> Co-authored-by: chenyu <chenyu@fastmail.com> Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>	2024-03-21 19:17:51 -04:00
Francis Lam	3c0478bfab	fuzz_linearizer: add additional DEBUG info for comparison errors (#3866 )	2024-03-21 18:58:10 -04:00
chenyu	bc482729d0	lower hlb_cifar acc to 93.3 (#3865 ) ran 30 runs and the lowest i see is 93.35. lowered to 93.3 for now. maybe reenable ema later if it reduces variance	2024-03-21 17:58:53 -04:00
chenyu	e50b7abe4f	diversed buf inputs based on dtype in fuzz_linearizer (#3863 )	2024-03-21 16:23:11 -04:00

... 130 131 132 133 134 ...

10490 Commits