tinygrad

mirror of https://github.com/tinygrad/tinygrad.git synced 2026-02-05 12:15:05 -05:00

Author	SHA1	Message	Date
Szymon Ożóg	2d0bfdf01c	ptx cleanup (#3893 )	2024-03-24 14:54:45 -07:00
chenyu	2e39f57594	move lines around in ops_python wmma (#3911 )	2024-03-24 17:14:26 -04:00
Patrick Tsai	e27129a798	Fix linearizer failure 26 test (#3906 ) * Adjust adds between WHERE and PHI * Not much better * undo recursive change * hm * iterate over where, not factored op * oo * consts only for loop * UNdo var name change * update --------- Co-authored-by: Patrick Tsai <patosai@users.noreply.github.com>	2024-03-24 16:34:13 -04:00
chenyu	10673d1447	tiny search cleanup (#3910 ) * tiny search cleanup removed some `assert isinstance(dev, Compiled)` and lines * remove import	2024-03-24 14:20:55 -04:00
wozeparrot	9a9cac58f9	add lars to nn (#3750 ) * feat: add lars * feat: don't remove this comment * clean: smaller diff * clean: shorter line * feat: remove mlperf lars, switch resnet * fix: fully remove mlperf lars * clean: comment * feat: contiguous * feat: no weight decay on skip params * feat: optimizergroup * feat: classic momentum * fix: pylint * clean: move comment * fix: correct algo * feat: lrschedulergroup * feat: skip list tests * feat: :\| forgot that params are a thing * feat: remove skip_list params from main params * feat: set moment --------- Co-authored-by: chenyu <chenyu@fastmail.com>	2024-03-24 11:43:12 -04:00
chenyu	8c8b57fd5f	cleanup ops python (#3908 ) i just want to merge lars!	2024-03-24 11:36:31 -04:00
chenyu	2c69888654	include negative float in test_dtype (#3884 ) * include negative float in test_dtype * that is ub * too annoying * pack can overflow	2024-03-24 02:39:15 -04:00
chenyu	e22d78b3d2	training cifar with BF16 on CUDA (#3905 ) * training cifar with BF16 on CUDA memory usage is between float and half due to numpy calls on dataset preprocessing, which converts into float. * simpler bf16 functions * bf16 cifar works for HSA too just very slow * simpler bf16 functions, we love cuda	2024-03-24 01:37:47 -04:00
Francis Lam	0145366323	wmma: fix the AMD TC threads to split the first 16 threads (#3904 ) previously it was incorrectly aliasing 16 into the size 8 upcast on the store alias. now it splits it properly into 8 and the remaining 2 into the correct local stride	2024-03-23 21:17:42 -04:00
sekstini	7c3632fd1e	add --minimal flag to nvrtc (#3899 )	2024-03-23 16:38:31 -07:00
chenyu	a2b2597fc2	replace dtype.name str with render_dtype (#3903 ) fixed some bf16 cast issue since it does not have `.name`. also more robust if there are lang specific type override	2024-03-23 19:25:48 -04:00
chenyu	24d004a89b	hotfix check ckpts before writing achieved model (#3901 ) this killed tinybox green run	2024-03-23 17:16:38 -04:00
chenyu	4d566f12b1	touchup einsum (#3900 ) don't need rhs_letters	2024-03-23 16:46:39 -04:00
Alejandro F Queiruga	556dcfb8f2	Fix the result permutation in einsum (#3895 ) * Fix permutation of result indices in einsum. * Delete stray line used for breaking tests * Fix linter error by renaming twice-used variable --------- Co-authored-by: chenyu <chenyu@fastmail.com>	2024-03-23 15:48:19 -04:00
nimlgen	4e18dd78d3	faster program start in llvm (#3897 )	2024-03-23 15:20:15 +03:00
George Hotz	46a3501cec	nv ioctl sniffer (#3892 ) * nv ioctl sniffer * unused import * Update __init__.py * that work * that fix it	2024-03-23 00:29:30 -07:00
chenyu	18e0cef14d	cheap less lines in ptx (#3890 ) enought to merge lars	2024-03-23 01:12:31 -04:00
George Hotz	f0c4e06ffd	fix cuda sync (#3888 )	2024-03-22 19:02:30 -07:00
chenyu	2d3ce53348	touchup test_dtype.test_gradient_dtype (#3887 ) add back bad merge from #3613 and add float.double and float.bfloat16 to test	2024-03-22 20:56:45 -04:00
David Hou	fc11808a79	initialize Tensor grad same type as self (#3613 ) * initialize Tensor grad same type as self * also test different default float * check dtype + try/finally * don't test_gradient_dtype if f16 is not supported * fix bad merge --------- Co-authored-by: chenyu <chenyu@fastmail.com>	2024-03-22 20:33:18 -04:00
Francis Lam	8db7a6bbcc	debug: add optional detailed BEAM_LOG logging (#3883 ) * debug: add optional detailed BEAM_LOG logging show uop count, compile and run times for each candidate in search also add --timing to verify_kernel.py to make it easier to explore hand-crafted applied opts * fix linter	2024-03-22 19:23:31 -04:00
chenyu	f7f67e0cc5	simple fix llama shard with quantize (#3882 ) copy scale on all device for now. naive sharding does not work because scale needs expand to really save memory. 70B does not work due to HSA_STATUS_ERROR_OUT_OF_RESOURCES. `python3 examples/llama.py --gen 2 --size 13B --shard 6 --prompt "Hello." --count 10 --temperature 0 --timing --quantize` 13B on 6 gpus uses 47 GB v.s. 34 GB quantized	2024-03-22 18:15:37 -04:00
chenyu	ee502c8055	fixup to_movement_ops and add back to CI (#3881 )	2024-03-22 18:14:49 -04:00
nimlgen	16e31f7f0d	init multidevice cuda graph (#3858 ) * init multidevice cuda graph * cuda just works! * clean * linter happier * liners happy * update transfer inputs * do not change free * useless check for cuda --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2024-03-22 13:49:48 -07:00
George Hotz	0c197b9cf3	hotfix: hip bfloat formatting	2024-03-22 11:52:05 -07:00
George Hotz	54dc48aa47	fix assign (#3878 ) * fix assign * remove terrible optimizer hack * oops, not realized assigns	2024-03-22 11:48:48 -07:00
Francis Lam	5587594a00	fuzz_linearizer: add --ast and --file params to read kernels (#3877 ) also fix up ast_str_to_str to support the new tuple of LazyOps	2024-03-22 14:27:40 -04:00
chenyu	c5467e5bd6	diverse test value in test_dtype DATA based on dtype (#3864 ) * diverse test value in test_dtype DATA based on dtype * eh fix typo * that too? * PTX does not support i8 and s8 * skip that * unused line * pus the hack back * remove that	2024-03-22 14:22:06 -04:00
George Hotz	86ee36e697	preschedule all (#3875 )	2024-03-22 11:20:06 -07:00
Szymon Ożóg	d8c3f1894a	Use UOpGraph in test (#3876 )	2024-03-22 14:12:38 -04:00
chenyu	1c51d586ea	replace raise Exception with specific errors (#3874 )	2024-03-22 12:32:21 -04:00
nimlgen	8ef5490ec8	cuda tranfer + async copyin (#3873 )	2024-03-22 09:01:37 -07:00
Szymon Ożóg	624bc89910	PTX - implement float 4, ptr arithmetics and other speed improvements (#3775 ) * ptx float4 implementation * remove from cache when trimming uops * Gate for float4 * Linting fix * disable test reasonable time for ptx * import getenv * Update uops.py * linter * Add div test for half * upcast if op does not support operation * fix offset * Run only if dtype supported * zero out registers when accessing by pred + cleanup * Remove trailing whitespace * revert * spacing fix * move cache clearing outside loop * did this suddenly start working? * unused import removed * Remove cast * Use pattern matching * linting --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2024-03-22 08:54:02 -07:00
George Hotz	f4055439dc	don't include hip common (#3851 ) * don't install hip common * only that * Revert "only that" This reverts commit `85f22015d9`. * less * needed * sep comgr * header file * 6.0.2 * update hsa * hsakmt * Revert "hsakmt" This reverts commit `d3a118078e`.	2024-03-22 08:50:50 -07:00
qazal	4a27ce6ec9	tiny version of amd_hip_bfloat16 (#3868 ) * add src_dtype * add maker * add bfloat16 * simpler	2024-03-22 08:37:30 -07:00
chenyu	82ce60e172	use JIT_BATCH_SIZE=4 for GPT2 3090 benchmark (#3870 ) smaller first batch saves about 0.05 ms per token. 1.75ms / tok on local 3090	2024-03-22 00:40:06 -04:00
qazal	fe6ceff15f	proposal: multioutput JIT spec (#3856 ) * corealize JIT * requirements	2024-03-21 21:28:30 -07:00
Francis Lam	a26090d404	search: change to use "spawn" and limit the number of tasks per child (#3862 ) also clean up some examples to use __main__ and not initialize resources outside of main	2024-03-21 21:23:36 -07:00
chenyu	dca69df197	hot fix use DEBUG >= 3 for allreduce message (#3869 )	2024-03-21 23:40:44 -04:00
uuuvn	6729f20aab	Ring allreduce try 2 (#3852 ) * Ring allreduce v3 * Configurable size, number of gpus and jit in benchmark * ScheduleBarrier v0 * GB/s that make sense * ScheduleBarrier v0.1 * Fallback on 2 GPUs * ScheduleBarrier v0.2 * ScheduleBarrier v0.3 * ScheduleBarrier v0.3.1 * ScheduleBarrier v0.3.2 * Replace ScheduleBarrier with automatic optimization * unused import * fix comment * typing * better fallback * python 3.8 * RING=2 and use ContextVar * DEBUG >= 2 and change name * linter * type --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com> Co-authored-by: chenyu <chenyu@fastmail.com> Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>	2024-03-21 19:17:51 -04:00
Francis Lam	3c0478bfab	fuzz_linearizer: add additional DEBUG info for comparison errors (#3866 )	2024-03-21 18:58:10 -04:00
chenyu	bc482729d0	lower hlb_cifar acc to 93.3 (#3865 ) ran 30 runs and the lowest i see is 93.35. lowered to 93.3 for now. maybe reenable ema later if it reduces variance	2024-03-21 17:58:53 -04:00
chenyu	e50b7abe4f	diversed buf inputs based on dtype in fuzz_linearizer (#3863 )	2024-03-21 16:23:11 -04:00
chenyu	c40f78499f	reuse fuzz_linearizer.compare_linearizer in test_linearizer_failures (#3861 )	2024-03-21 14:23:37 -04:00
chenyu	30fa03243e	reuse fuzz_linearizer.compare_linearizer in test_linearizer_failures (#3861 )	2024-03-21 14:12:27 -04:00
chenyu	33dd99acf4	remove helper_add_store from test_linearizer_failures (#3860 )	2024-03-21 12:53:31 -04:00
chenyu	6bf0b82267	alloc new output in fuzz_linearizer between baseline and real one (#3859 ) if the kernel is an assign `a += 1`, the rawbufs[0] is updated twice and gives false compare_error	2024-03-21 11:36:05 -04:00
nimlgen	b78352b423	do not create structs every call in CUDAProgram (#3855 ) * do not create structs in cuda * fix graph * linter * do not exec twice * fix graph	2024-03-21 17:51:40 +03:00
nimlgen	e5745c1a0d	fix nan on multigpus cuda (#3854 )	2024-03-21 15:21:55 +03:00
Anurag Lamsal	4e0819e40b	fixing the benchmark not printing in handcode resnet50 opt example (#3850 )	2024-03-21 00:55:31 -04:00

... 133 134 135 136 137 ...

10633 Commits