tinygrad

mirror of https://github.com/tinygrad/tinygrad.git synced 2026-01-25 23:08:06 -05:00

Author	SHA1	Message	Date
Szymon Ożóg	ba118abfec	improved caching for pointer arithmetics in ptx (#3922 ) * improved caching for pointer arithmetics * Add test for pointer arithmetics caching * Refactor test	2024-04-04 07:33:48 -07:00
Szymon Ożóg	68fe3527f1	Tensor core ptx (#3894 ) * tensor cores * Merge from master * faster program start in llvm (#3897) * Fix the result permutation in einsum (#3895) * Fix permutation of result indices in einsum. * Delete stray line used for breaking tests * Fix linter error by renaming twice-used variable --------- Co-authored-by: chenyu <chenyu@fastmail.com> * touchup einsum (#3900) don't need rhs_letters * hotfix check ckpts before writing achieved model (#3901) this killed tinybox green run * replace dtype.name str with render_dtype (#3903) fixed some bf16 cast issue since it does not have `.name`. also more robust if there are lang specific type override * add --minimal flag to nvrtc (#3899) * wmma: fix the AMD TC threads to split the first 16 threads (#3904) previously it was incorrectly aliasing 16 into the size 8 upcast on the store alias. now it splits it properly into 8 and the remaining 2 into the correct local stride * training cifar with BF16 on CUDA (#3905) * training cifar with BF16 on CUDA memory usage is between float and half due to numpy calls on dataset preprocessing, which converts into float. * simpler bf16 functions * bf16 cifar works for HSA too just very slow * simpler bf16 functions, we love cuda * include negative float in test_dtype (#3884) * include negative float in test_dtype * that is ub * too annoying * pack can overflow * add to benchmark * change var name to satisfy mypy * spacing * Update to new TensorCore format * Spacing --------- Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com> Co-authored-by: Alejandro F Queiruga <33233447+afqueiruga@users.noreply.github.com> Co-authored-by: chenyu <chenyu@fastmail.com> Co-authored-by: sekstini <127142660+sekstini@users.noreply.github.com> Co-authored-by: Francis Lam <flam@alum.mit.edu> Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2024-04-04 07:32:31 -07:00
Szymon Ożóg	92378fb5b6	Ptx mulacc (#3937 ) * mulacc * Move more stuff to pattern matcher * disable callable from the == check * disable function passing in pattern matcher * Add set of dtypes pattern matching + refactor mulacc pattern	2024-04-04 00:15:25 -07:00
George Hotz	3e72d745ea	hotfix: make KFD timings right	2024-04-04 05:55:29 +00:00
George Hotz	58d162315c	Continuing KFD work (#4065 ) * cleanups * fix kernargs ptr * mypy passes	2024-04-03 22:48:13 -07:00
chenyu	d219aba962	prepend CLANG_PROGRAM_HEADER in ClangCompiler.render instead of compile (#4063 ) src header should be part of the rendered output, and DEBUG=4 includes the header this way	2024-04-03 23:17:56 -04:00
George Hotz	7181ffd630	HWCopyQueue in KFD (#4042 ) * HWCopyQueue in KFD * hw compute queue * test * move test * more tests * fix wait * fix multimap * mes crash * tests pass but slow * stuff is working * one more test	2024-04-03 20:14:24 -07:00
chenyu	e3c0ac9fbf	remove old envvar "OPT" (#4060 )	2024-04-03 14:55:21 -04:00
chenyu	406cb5fd90	const fold ReduceOps (#4059 )	2024-04-03 14:39:28 -04:00
chenyu	fe03725b21	const fold cast unrealized_unpadded_const (#4047 ) * const fold unrealized_unpadded_const changed the underlying arg directly * CAST_BEFORE_VIEW folds some * fix const index in getitem	2024-04-03 12:31:24 -04:00
Szymon Ożóg	e5a9bff899	Add pattern matcher tests, move uop transforms from assembly to pattern (#4056 ) matcher	2024-04-03 09:06:43 -07:00
qazal	1ea8fcbe1b	graph schedule items (#4054 )	2024-04-03 08:52:37 -07:00
George Hotz	52ee5b73b2	update logo (#4055 ) * update logo * update svg * put svg in file * Revert "put svg in file" This reverts commit `735528047a`. * better * move a tag * remove extra	2024-04-03 07:16:57 -07:00
chenyu	f61ed869f5	Use exec_alu for lazy const folding (#4039 )	2024-04-02 20:52:05 -04:00
Francis Lam	88dcdae485	search: fix counting of upcasts to ignore TC upcasts (#4045 ) TC upcasts don't impact the size or complexity of the kernel code	2024-04-02 19:52:05 -04:00
Szymon Ożóg	ccf3c16d6a	Refactor the use of pattern matcher in ptx (#4043 )	2024-04-02 14:19:51 -07:00
chenyu	85edc493b0	uops const fold rules to prevent tautological compare warnings (#4041 ) * uops const fold rules to prevent tautological compare warnings `bool < false` is false, `true < bool` is false, `a == a` is true, `a != a` is false * not true for nan * and nan does not work with llvm * full truth table test * revert a==a * comments and indents	2024-04-02 16:45:58 -04:00
Léo	e879e16c48	docs: add warning message for conda users when using METAL (#3917 ) * docs: add warning message for conda users when using METAL * fix: conda metal warning too long. disabled line length check * docs: changed conda METAL warning to include DISABLE_COMPILER_CACHE=1 * fix(metal): now detecting invalid library magic * format: removed noqa E501 * fix(metal): conda error line len * fix: typo --------- Co-authored-by: Léo Paillé <leo.paille@enseirb-matmeca.fr>	2024-04-02 09:22:24 -07:00
Patrick Tsai	0147174ad6	Embedding in one kernel (#4036 ) * Embedding is in one kernel * embedding is one kernel * rm extra line * newline * bert test counts state vars? * add a test? * move items around --------- Co-authored-by: Patrick Tsai <patosai@users.noreply.github.com>	2024-04-02 11:38:21 -04:00
George Hotz	506b1c5892	multigpu works (#4040 )	2024-04-02 08:29:37 -07:00
chenyu	05e7f930ee	use clang as default instead of llvm for cpu (#4035 ) llvm has problems with fp16 and comparing with nan	2024-04-02 00:02:18 -04:00
Dan Hoffman	5311b45053	re-enable has_local check for linearizer test (#4034 ) Co-authored-by: Dan Hoffman <daniel.hoffman@intel.com>	2024-04-02 00:02:03 -04:00
George Hotz	bec2aaf404	add beautiful_mnist_multigpu example	2024-04-02 00:54:04 +00:00
George Hotz	7425a0c646	CommandQueue is the future (#3950 ) * start of command queue * cq work * runs * cleanup * outs set * read is gone * future buffer work * command queue is better * command queue works * loadops * delete unneeded * command queue works * upd * fix tests * use CommandQueue in compile * delay sync	2024-04-01 17:35:48 -07:00
chenyu	0a34d6016b	move exec_alu from uops to ops (#4033 ) will use this for const folding in lazy too	2024-04-01 17:20:53 -07:00
chenyu	82440d3416	don't call contiguous for unpadded const into multi tensor (#4032 ) * don't call contiguous for unpadded const into multi tensor fixed multi const folding for sharded const. still wip, need to be careful that this does not break multi device cache somewhere * ehh need a memory test for that * simple sharded memory test	2024-04-01 19:22:14 -04:00
nimlgen	d6ba44bc1e	kfd free buffers (#4027 ) * kfd free buffers * unmap * all test passes * better pm4 * forgot these * invalidate only range * better cache * forgot * comments * fixes	2024-04-01 15:50:58 -07:00
chenyu	77a68fc52f	test examples for multi tensor const folding (#4031 ) works with literal const operand now because it's copied to each shard and handled by lazy. does not work for sharded const	2024-04-01 16:53:43 -04:00
chenyu	379d52548d	const fold left const operand for ADD and MUL (#4029 ) * const fold left const operand for ADD and MUL * neg have dtype issue	2024-04-01 15:09:04 -04:00
chenyu	0e02d074bd	fix Tensor.pow folding for exponent 0 and 1 (#4025 )	2024-03-31 19:57:23 -04:00
mmmkkaaayy	a4ae9352bd	delete irrelevant JIT regression test (#4024 )	2024-03-31 19:35:35 -04:00
chenyu	23c912e338	use *0+1 for Tensor.pow base case, remove function.Zero (#4023 ) one less mlops!	2024-03-31 19:20:44 -04:00
chenyu	276ef8eb87	move div folding from tensor to lazy (#4022 )	2024-03-31 18:07:39 -04:00
nimlgen	7fa233e8c9	kfd fix kernels with private memory (#4018 ) * kfd fix kernels with private memory * linter	2024-04-01 00:01:30 +03:00
Francis Lam	dcb58d3bed	extra/gemm/simple_matvec: add simple_matvec.py (#4021 ) we can test with this or add it to CI for benchmarks	2024-03-31 16:38:52 -04:00
chenyu	d3f27761b0	move const folding of ADD/SUB/MUL from tensor to lazy (#4020 ) * move const folding of ADD/SUB/MUL from tensor to lazy will do div and pow separately. * fix onnx adding with None	2024-03-31 16:35:36 -04:00
chenyu	7f859593b8	fix _to_const_val and const folding around it (#4017 ) * fix _to_const_val and const folding around it is_unrealized_contiguous_const is too strict and almost never hit if const is expanded. suffice to check if there's no pad * that test is folded * test_const_folding	2024-03-31 13:09:23 -04:00
George Hotz	2abb474d43	kfd driver wip (#3912 ) * kfd driver wip * cleanups * kfd almost ready to ring doorbell * ding dong? * issues with signals * something * works * ops kfd * add amd_signal_t * works...sometimes * program runs * _gpu_alloc cleanup * cleanups * work * header + enable profiling (#3959) * header + enable profiling * just cleaner * measure * only local time domain * remove old comments * fix with master * elf parsing (#3965) * elf parsing * fix kernels with private * not used * clean up * clean up 2 * add flags * kfd sdma (#3970) * working sdma * remove driver, shorter * all commands we might need * svm * kfd remove hardcoded values (#4007) * remove hardcoded values * match above line * 7k lines + revert hsa * update that from origin * fix sdma reg gen * not the updated SDMA * compiler_opts * don't require kfd_ioctl * get ioctls from python * get ioctls from python * remove build_sdma_command * merge into 64-bit fields * shorter * fix property spelling and off by one --------- Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>	2024-03-30 15:08:12 -07:00
chenyu	bee8eeae55	Revert "don't simplify st in _recursive_lazyop when unbind (#4011 )" (#4013 ) This reverts commit `2b704d7452`.	2024-03-30 17:36:17 -04:00
chenyu	2b704d7452	don't simplify st in _recursive_lazyop when unbind (#4011 ) st here should be the same, calling simplify.unbind generates a different st and break cache	2024-03-30 17:03:40 -04:00
Francis Lam	04746022b1	extra/gemm/hip_matmul: fix to use new HSA devices and no headers (#3999 ) * extra/gemm/hip_matmul: fix to use new HSA devices and no headers * remove compile_hip import	2024-03-30 15:42:23 -04:00
nimlgen	478c040e1c	hsa terminate without exceptions (#4006 ) * hsa terminate without exceptions * cleaner * linter	2024-03-30 16:03:46 +03:00
chenyu	aa76d566c2	cleanup mamba (#4004 ) make it read nicer and cleanup some movement methods and math simplification. 790m, 1.4b, 2.8b model does not really run. sampling is not implemented. jit is incorrect. some deadcode / wrong code path and copied from torch stuff stuff.	2024-03-30 02:50:13 -04:00
George Hotz	f35f9d32f2	rename mlops to function (#4003 )	2024-03-29 21:49:00 -07:00
chenyu	c71627fee6	move GlobalCounter to helpers (#4002 ) break circular import between ops and buffer	2024-03-30 00:30:30 -04:00
George Hotz	9eef44521b	ScheduleItem uses Buffer (#3995 ) * schedule Buffer * update * update tests * master * works * remove LoadOps.WAIT * fix compile2 * bad test * rename and note	2024-03-29 20:50:27 -07:00
George Hotz	1bd4f01da2	size instead of st.size (#4001 )	2024-03-29 19:59:02 -07:00
George Hotz	8f1e34a2a0	early src delete (#3996 ) * early src delete * fix bad test * fix test_linearizer	2024-03-29 19:46:07 -07:00
Szymon Ożóg	31c8ba8b84	Move transformations to PatternMatcher + clean up existing patterns (#3997 )	2024-03-29 19:42:39 -07:00
George Hotz	f916aadaea	external that test	2024-03-29 19:35:50 -07:00

1 2 3 4 5 ...

4051 Commits