tinygrad

mirror of https://github.com/tinygrad/tinygrad.git synced 2026-01-29 00:38:10 -05:00

Author	SHA1	Message	Date
chenyu	18e0cef14d	cheap less lines in ptx (#3890 ) enought to merge lars	2024-03-23 01:12:31 -04:00
George Hotz	f0c4e06ffd	fix cuda sync (#3888 )	2024-03-22 19:02:30 -07:00
chenyu	2d3ce53348	touchup test_dtype.test_gradient_dtype (#3887 ) add back bad merge from #3613 and add float.double and float.bfloat16 to test	2024-03-22 20:56:45 -04:00
David Hou	fc11808a79	initialize Tensor grad same type as self (#3613 ) * initialize Tensor grad same type as self * also test different default float * check dtype + try/finally * don't test_gradient_dtype if f16 is not supported * fix bad merge --------- Co-authored-by: chenyu <chenyu@fastmail.com>	2024-03-22 20:33:18 -04:00
Francis Lam	8db7a6bbcc	debug: add optional detailed BEAM_LOG logging (#3883 ) * debug: add optional detailed BEAM_LOG logging show uop count, compile and run times for each candidate in search also add --timing to verify_kernel.py to make it easier to explore hand-crafted applied opts * fix linter	2024-03-22 19:23:31 -04:00
chenyu	f7f67e0cc5	simple fix llama shard with quantize (#3882 ) copy scale on all device for now. naive sharding does not work because scale needs expand to really save memory. 70B does not work due to HSA_STATUS_ERROR_OUT_OF_RESOURCES. `python3 examples/llama.py --gen 2 --size 13B --shard 6 --prompt "Hello." --count 10 --temperature 0 --timing --quantize` 13B on 6 gpus uses 47 GB v.s. 34 GB quantized	2024-03-22 18:15:37 -04:00
chenyu	ee502c8055	fixup to_movement_ops and add back to CI (#3881 )	2024-03-22 18:14:49 -04:00
nimlgen	16e31f7f0d	init multidevice cuda graph (#3858 ) * init multidevice cuda graph * cuda just works! * clean * linter happier * liners happy * update transfer inputs * do not change free * useless check for cuda --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2024-03-22 13:49:48 -07:00
George Hotz	0c197b9cf3	hotfix: hip bfloat formatting	2024-03-22 11:52:05 -07:00
George Hotz	54dc48aa47	fix assign (#3878 ) * fix assign * remove terrible optimizer hack * oops, not realized assigns	2024-03-22 11:48:48 -07:00
Francis Lam	5587594a00	fuzz_linearizer: add --ast and --file params to read kernels (#3877 ) also fix up ast_str_to_str to support the new tuple of LazyOps	2024-03-22 14:27:40 -04:00
chenyu	c5467e5bd6	diverse test value in test_dtype DATA based on dtype (#3864 ) * diverse test value in test_dtype DATA based on dtype * eh fix typo * that too? * PTX does not support i8 and s8 * skip that * unused line * pus the hack back * remove that	2024-03-22 14:22:06 -04:00
George Hotz	86ee36e697	preschedule all (#3875 )	2024-03-22 11:20:06 -07:00
Szymon Ożóg	d8c3f1894a	Use UOpGraph in test (#3876 )	2024-03-22 14:12:38 -04:00
chenyu	1c51d586ea	replace raise Exception with specific errors (#3874 )	2024-03-22 12:32:21 -04:00
nimlgen	8ef5490ec8	cuda tranfer + async copyin (#3873 )	2024-03-22 09:01:37 -07:00
Szymon Ożóg	624bc89910	PTX - implement float 4, ptr arithmetics and other speed improvements (#3775 ) * ptx float4 implementation * remove from cache when trimming uops * Gate for float4 * Linting fix * disable test reasonable time for ptx * import getenv * Update uops.py * linter * Add div test for half * upcast if op does not support operation * fix offset * Run only if dtype supported * zero out registers when accessing by pred + cleanup * Remove trailing whitespace * revert * spacing fix * move cache clearing outside loop * did this suddenly start working? * unused import removed * Remove cast * Use pattern matching * linting --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2024-03-22 08:54:02 -07:00
George Hotz	f4055439dc	don't include hip common (#3851 ) * don't install hip common * only that * Revert "only that" This reverts commit `85f22015d9`. * less * needed * sep comgr * header file * 6.0.2 * update hsa * hsakmt * Revert "hsakmt" This reverts commit `d3a118078e`.	2024-03-22 08:50:50 -07:00
qazal	4a27ce6ec9	tiny version of amd_hip_bfloat16 (#3868 ) * add src_dtype * add maker * add bfloat16 * simpler	2024-03-22 08:37:30 -07:00
chenyu	82ce60e172	use JIT_BATCH_SIZE=4 for GPT2 3090 benchmark (#3870 ) smaller first batch saves about 0.05 ms per token. 1.75ms / tok on local 3090	2024-03-22 00:40:06 -04:00
qazal	fe6ceff15f	proposal: multioutput JIT spec (#3856 ) * corealize JIT * requirements	2024-03-21 21:28:30 -07:00
Francis Lam	a26090d404	search: change to use "spawn" and limit the number of tasks per child (#3862 ) also clean up some examples to use __main__ and not initialize resources outside of main	2024-03-21 21:23:36 -07:00
chenyu	dca69df197	hot fix use DEBUG >= 3 for allreduce message (#3869 )	2024-03-21 23:40:44 -04:00
uuuvn	6729f20aab	Ring allreduce try 2 (#3852 ) * Ring allreduce v3 * Configurable size, number of gpus and jit in benchmark * ScheduleBarrier v0 * GB/s that make sense * ScheduleBarrier v0.1 * Fallback on 2 GPUs * ScheduleBarrier v0.2 * ScheduleBarrier v0.3 * ScheduleBarrier v0.3.1 * ScheduleBarrier v0.3.2 * Replace ScheduleBarrier with automatic optimization * unused import * fix comment * typing * better fallback * python 3.8 * RING=2 and use ContextVar * DEBUG >= 2 and change name * linter * type --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com> Co-authored-by: chenyu <chenyu@fastmail.com> Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>	2024-03-21 19:17:51 -04:00
Francis Lam	3c0478bfab	fuzz_linearizer: add additional DEBUG info for comparison errors (#3866 )	2024-03-21 18:58:10 -04:00
chenyu	bc482729d0	lower hlb_cifar acc to 93.3 (#3865 ) ran 30 runs and the lowest i see is 93.35. lowered to 93.3 for now. maybe reenable ema later if it reduces variance	2024-03-21 17:58:53 -04:00
chenyu	e50b7abe4f	diversed buf inputs based on dtype in fuzz_linearizer (#3863 )	2024-03-21 16:23:11 -04:00
chenyu	c40f78499f	reuse fuzz_linearizer.compare_linearizer in test_linearizer_failures (#3861 )	2024-03-21 14:23:37 -04:00
chenyu	30fa03243e	reuse fuzz_linearizer.compare_linearizer in test_linearizer_failures (#3861 )	2024-03-21 14:12:27 -04:00
chenyu	33dd99acf4	remove helper_add_store from test_linearizer_failures (#3860 )	2024-03-21 12:53:31 -04:00
chenyu	6bf0b82267	alloc new output in fuzz_linearizer between baseline and real one (#3859 ) if the kernel is an assign `a += 1`, the rawbufs[0] is updated twice and gives false compare_error	2024-03-21 11:36:05 -04:00
nimlgen	b78352b423	do not create structs every call in CUDAProgram (#3855 ) * do not create structs in cuda * fix graph * linter * do not exec twice * fix graph	2024-03-21 17:51:40 +03:00
nimlgen	e5745c1a0d	fix nan on multigpus cuda (#3854 )	2024-03-21 15:21:55 +03:00
Anurag Lamsal	4e0819e40b	fixing the benchmark not printing in handcode resnet50 opt example (#3850 )	2024-03-21 00:55:31 -04:00
nimlgen	85691c8e20	fix hsa sync issue (#3847 ) * fix hsa sync issue * linter	2024-03-21 04:00:30 +03:00
chenyu	f271cd682b	user _resolve_dim in argmax (#3846 ) also added comment of the behavior if there are multple, and more tests	2024-03-20 20:17:30 -04:00
chenyu	5c4cf62d2c	fix View.pad arg type (#3845 ) close #3779	2024-03-20 19:36:02 -04:00
Francis Lam	6d5dec2fef	log optimized kernels and a script to compare with non-optimized ones (#3829 ) * search: add BEAM_VERIFY option to validate search results refactor fuzz_linearizer comparison to allow it to be used in for BEAM_VERIFY in device.py * search: fix to verify the beam_search result and not the fastest * search: fix typing and clean up * device: remove imports from test and add LOGKERN options LOGKERN output can be used with test/external/verify_kernel.py to validate correctness * fix example in verify_kernel.py * cleanup fixes * fix to use f-strings	2024-03-20 19:22:08 -04:00
chenyu	9d1d08fbb0	show llama bandwith with timing (#3844 )	2024-03-20 17:19:15 -04:00
chenyu	7ff47e45a1	cifar TARGET_EVAL_ACC_PCT=93.5 (#3843 )	2024-03-20 16:56:51 -04:00
qazal	92c5067439	conceptual small refactor (#3842 )	2024-03-20 16:46:14 -04:00
chenyu	519336cfea	factor out partial in SumNode div int (#3841 ) * factor out partial in SumNode div int * div not rem * space	2024-03-20 16:34:33 -04:00
George Hotz	8cb5215885	Revert "Ring allreduce in multitensor (#3000 )" (#3840 ) This reverts commit `c5bf9e4c96`.	2024-03-20 11:41:49 -07:00
uuuvn	c5bf9e4c96	Ring allreduce in multitensor (#3000 ) * Ring allreduce v3 * Configurable size, number of gpus and jit in benchmark * ScheduleBarrier v0 * GB/s that make sense * ScheduleBarrier v0.1 * Fallback on 2 GPUs * ScheduleBarrier v0.2 * ScheduleBarrier v0.3 * ScheduleBarrier v0.3.1 * ScheduleBarrier v0.3.2 * Replace ScheduleBarrier with automatic optimization * unused import * fix comment * typing * better fallback * python 3.8 --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com> Co-authored-by: chenyu <chenyu@fastmail.com> Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>	2024-03-20 11:20:01 -07:00
chenyu	455f7bea9b	test example from half resnet that idx has number outside of int32 (#3838 ) * test example from half resnet that idx has number outside of int32 * ruff	2024-03-20 13:44:20 -04:00
chenyu	727de5ba1e	llama 7B on 3090 benchmark (#3837 ) * llama 7B on 3090 benchmark * symlink llama	2024-03-20 12:48:22 -04:00
qazal	9452994201	add a better error message for resnet training (#3836 ) * add a better error message * assert * use FileNotFoundError	2024-03-20 09:22:15 -07:00
chenyu	47b9cc2dfe	use float32 for rand buffer in test_beam_search and test in metal (#3831 )	2024-03-19 23:22:58 -04:00
chenyu	d17900bc45	use int32 instead of default_int in simplify_phi_loops (#3828 ) * use int32 instead of default_int in simplify_phi_loops indices are in int32 now and is separated from buffer dtype. fix #3823 * return early if not supported * it's not that * why is it failing for RHIP	2024-03-19 17:49:58 -04:00
nimlgen	2d54e4d747	clean up hsa driver (#3818 ) * clean up driver * remove returns	2024-03-20 00:17:41 +03:00

... 129 130 131 132 133 ...

10417 Commits