tinygrad

mirror of https://github.com/tinygrad/tinygrad.git synced 2026-02-07 05:05:13 -05:00

Author	SHA1	Message	Date
George Hotz	10dbf90b2c	hotfix: test speed	2024-04-09 13:20:39 -07:00
George Hotz	ae849d12d7	numpy device + pickle it (#4120 )	2024-04-09 13:19:30 -07:00
chenyu	1ef9c50fd7	Update ssa input order and annotate types in cstyle and assembly (#4117 ) variable prefix is never optional (removed the default "t") and UOp can be optional (added the default None).	2024-04-09 13:10:29 -04:00
geohotstan	15f2f39658	conceptually simpler fancy index (#3335 ) * init * add failed case * fix: temp comment out MULACC cast * is this right? * add test case * oops, forgot to get rid of temp test * WOOOOOO TOOK OUT 2 TRANSPOSES IN GATHER YAY * cleaner * comment cleanup * update docs * resolve conflict * oops * SUPA FAST * comment out a test * del some print statements * use new broadcast stuff * more clean up * move try except * skip fancy indexing for python backend test_ops	2024-04-09 11:18:04 -04:00
David González Martínez	980124a605	add lerp operation to tensor (#4102 ) * feat: add lerp operation to tensor * fix * style: fit in one line: * tests: test backward for lerp	2024-04-08 17:03:27 -07:00
Francis Lam	46850a0269	search: add a BEAM_COMPARE env to optionally not compare to hc/tc (#4107 ) * search: add a BEAM_COMPARE env to optionally not compare to hc/tc setting BEAM_COMPARE=0 will prevent additional memory allocation needed to do the timing tests assuming the BEAM result is in the diskcache. * change to optionally use Buffer.allocate	2024-04-08 18:54:01 -04:00
qazal	c390828f61	refactor outbufs (#4112 )	2024-04-08 14:54:10 -07:00
andresgit	7fd12aba85	graph remove input buffer references (#4100 ) Co-authored-by: chenyu <chenyu@fastmail.com>	2024-04-08 16:49:16 -04:00
chenyu	078d841479	add SPLIT_REDUCEOP to disable reduce split (#4115 ) verify with `SPLIT_REDUCEOP=0 BIG=2 MPS=1 python3 -m pytest -rA test/test_speed_v_torch.py -k sum`. 10X slower on mac	2024-04-08 16:31:08 -04:00
qazal	eea42d864f	account for all outputs (#4113 )	2024-04-08 10:04:19 -07:00
chenyu	dbd39ab78a	setitem support setting python const (#4111 )	2024-04-08 11:37:50 -04:00
chenyu	f8dc82a8a7	use single tensor for llama kv chache (#4108 ) similar to optimization in gpt2	2024-04-08 00:38:32 -04:00
chenyu	92c0675ccf	setitem initial support (#4093 ) * wip setitem it's an eager assign to output shapetracker view * cleanups and tests * more cleanups	2024-04-07 20:35:22 -04:00
geohotstan	183708b3fd	broadcast expand to match torch (#4085 ) * initial version * heh gimme grrrreen * version 2 * clean ups * some test confusion * fix onnx * rename to _broadcast_tensors * improved errors and test * fixed? * some test fixup * version 3 lol * comments * cleaner * add failure test for expand to 0 test * 1 more assertRaises test * make err msg better * also rewrite the expand onnx op? :s	2024-04-07 16:23:13 -04:00
uuuvn	2b81d9b334	Fix broken test (#4104 )	2024-04-07 12:02:12 -04:00
chenyu	9a95d87366	metal CI run llama with 4 shards (#4103 ) this can catch multi tensor issue on mac.	2024-04-07 11:04:08 -04:00
George Hotz	444d2a7487	hotfix: fix SDMA read_pointer_address in KFD	2024-04-07 13:13:15 +00:00
uuuvn	bb7567b365	Fix metal (#4101 )	2024-04-07 05:21:19 -07:00
chenyu	bdbcac67f1	assign jit test case with other tensor as input (#4098 ) hmm it works	2024-04-06 14:41:14 -04:00
George Hotz	e4a1858471	revert command queue (#4097 )	2024-04-06 08:58:18 -07:00
George Hotz	97c402d69e	use imagenet spawn (#4096 )	2024-04-06 08:34:10 -07:00
George Hotz	fffd9b05f5	mock mnist data for imagenet trainer (#4095 ) * mock mnist data for imagenet * move print and test * needed to reshape	2024-04-06 08:08:40 -07:00
George Hotz	8739d33fe9	kfd: disable copy_from_fd while debugging (#4091 ) * kfd: disable copy_from_fd while debugging * increase timeout to a minute	2024-04-05 18:02:58 -07:00
George Hotz	93824e59eb	support MOCKDATA=1 for resnet (#4090 ) * mockdata for resnet * fix eval, revert hsa	2024-04-05 17:19:18 -07:00
George Hotz	164329a8ea	address kfd feedback (#4087 ) * address kfd feedback * signals cleanup * signals cleanup * handle 2 doorbell pages correctly * signal reset cleanup * signals cleanup * more GTT * cleanups * minor cleanups	2024-04-05 15:24:41 -07:00
geohotstan	dafa42e864	clean up (#4081 ) Co-authored-by: chenyu <chenyu@fastmail.com>	2024-04-05 11:57:44 -04:00
Akshit Talwar	750ecf8fef	replace slice by pad/shrink in _pool (#4082 )	2024-04-05 11:47:22 -04:00
George Hotz	a337922c44	more work on kfd (#4079 ) * more work on kfd * fix multitensor test on kfd * stuff	2024-04-05 08:36:36 -07:00
chenyu	e7ff5102cf	failed test in test_pattern_matcher (#4080 ) something about the PTX rewrite is incorrect that it has duplicated rewritten uops	2024-04-05 02:53:50 -04:00
chenyu	a023a1ed87	update github action to actions/cache@v4 (#4077 ) get rid of warning `Node.js 16 actions are deprecated. Please update the following actions to use Node.js 20: actions/cache@v3.`	2024-04-04 22:24:26 -04:00
George Hotz	28ec6c67be	hotfix: hlb_cifar KFD works	2024-04-05 02:19:14 +00:00
chenyu	1de9778949	import Buffer and BufferOption from tinygrad.buffer (#4076 )	2024-04-04 22:12:23 -04:00
chenyu	9e0ebf8979	remove dtype from FlopCounter (#4075 ) the annoying thing to remove all FlopCounter is that for device that does not support local, matmul index alu is huge. we can remove the dtype first. sneak in updating `ruff` command to `ruff check`	2024-04-04 21:23:28 -04:00
George Hotz	3de855ea50	don't use SVM memory in KFD (#4072 ) * don't use SVM memory in KFD * copy from fd * cleanups * transfer * hacks * ops_hsa * tighter API	2024-04-04 17:33:21 -07:00
chenyu	5e6e6c9a67	use ConstType in various const function type hint (#4074 )	2024-04-04 20:32:07 -04:00
chenyu	c1cffed1df	add LazyOp.dtype (#4073 ) an inferred cached_property. removed all cases that use get_lazyop_info just to get the dtype of an op. prereq to remove InterpretedFlopCounter	2024-04-04 17:38:19 -04:00
chenyu	f836d6a03f	is_unrealized_unpadded_const -> is_unrealized_unmasked_const (#4071 ) realized #3580 was doing the same thing. unmasked is more accurate	2024-04-04 14:25:17 -04:00
Szymon Ożóg	82b7b9655f	test for dtype set (#4069 )	2024-04-04 11:24:33 -04:00
geohotstan	1a1dd1c1a7	add and enable tests for indexing const folding (#4068 ) * enable test in test_indexing * added tests * rename stuff * del a test case cuz it's loadops.copy	2024-04-04 10:46:28 -04:00
Szymon Ożóg	ba118abfec	improved caching for pointer arithmetics in ptx (#3922 ) * improved caching for pointer arithmetics * Add test for pointer arithmetics caching * Refactor test	2024-04-04 07:33:48 -07:00
Szymon Ożóg	68fe3527f1	Tensor core ptx (#3894 ) * tensor cores * Merge from master * faster program start in llvm (#3897) * Fix the result permutation in einsum (#3895) * Fix permutation of result indices in einsum. * Delete stray line used for breaking tests * Fix linter error by renaming twice-used variable --------- Co-authored-by: chenyu <chenyu@fastmail.com> * touchup einsum (#3900) don't need rhs_letters * hotfix check ckpts before writing achieved model (#3901) this killed tinybox green run * replace dtype.name str with render_dtype (#3903) fixed some bf16 cast issue since it does not have `.name`. also more robust if there are lang specific type override * add --minimal flag to nvrtc (#3899) * wmma: fix the AMD TC threads to split the first 16 threads (#3904) previously it was incorrectly aliasing 16 into the size 8 upcast on the store alias. now it splits it properly into 8 and the remaining 2 into the correct local stride * training cifar with BF16 on CUDA (#3905) * training cifar with BF16 on CUDA memory usage is between float and half due to numpy calls on dataset preprocessing, which converts into float. * simpler bf16 functions * bf16 cifar works for HSA too just very slow * simpler bf16 functions, we love cuda * include negative float in test_dtype (#3884) * include negative float in test_dtype * that is ub * too annoying * pack can overflow * add to benchmark * change var name to satisfy mypy * spacing * Update to new TensorCore format * Spacing --------- Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com> Co-authored-by: Alejandro F Queiruga <33233447+afqueiruga@users.noreply.github.com> Co-authored-by: chenyu <chenyu@fastmail.com> Co-authored-by: sekstini <127142660+sekstini@users.noreply.github.com> Co-authored-by: Francis Lam <flam@alum.mit.edu> Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2024-04-04 07:32:31 -07:00
Szymon Ożóg	92378fb5b6	Ptx mulacc (#3937 ) * mulacc * Move more stuff to pattern matcher * disable callable from the == check * disable function passing in pattern matcher * Add set of dtypes pattern matching + refactor mulacc pattern	2024-04-04 00:15:25 -07:00
George Hotz	3e72d745ea	hotfix: make KFD timings right	2024-04-04 05:55:29 +00:00
George Hotz	58d162315c	Continuing KFD work (#4065 ) * cleanups * fix kernargs ptr * mypy passes	2024-04-03 22:48:13 -07:00
chenyu	d219aba962	prepend CLANG_PROGRAM_HEADER in ClangCompiler.render instead of compile (#4063 ) src header should be part of the rendered output, and DEBUG=4 includes the header this way	2024-04-03 23:17:56 -04:00
George Hotz	7181ffd630	HWCopyQueue in KFD (#4042 ) * HWCopyQueue in KFD * hw compute queue * test * move test * more tests * fix wait * fix multimap * mes crash * tests pass but slow * stuff is working * one more test	2024-04-03 20:14:24 -07:00
chenyu	e3c0ac9fbf	remove old envvar "OPT" (#4060 )	2024-04-03 14:55:21 -04:00
chenyu	406cb5fd90	const fold ReduceOps (#4059 )	2024-04-03 14:39:28 -04:00
chenyu	fe03725b21	const fold cast unrealized_unpadded_const (#4047 ) * const fold unrealized_unpadded_const changed the underlying arg directly * CAST_BEFORE_VIEW folds some * fix const index in getitem	2024-04-03 12:31:24 -04:00
Szymon Ożóg	e5a9bff899	Add pattern matcher tests, move uop transforms from assembly to pattern (#4056 ) matcher	2024-04-03 09:06:43 -07:00

... 127 128 129 130 131 ...

10490 Commits