tinygrad

mirror of https://github.com/tinygrad/tinygrad.git synced 2026-02-14 16:44:59 -05:00

Author	SHA1	Message	Date
nimlgen	b4c49ae3fa	remove cudacpu in favour of mockgpu (#5225 ) * remove cudacpu in favour of mockgpu * remove unused import * not used as well	2024-06-29 11:05:16 +03:00
nimlgen	ee02dcb98e	nv supports PTX=1 (#5222 ) * nv supports PTX=1 * not needed * split nv compiler into nvrtc autogen * remove to_c_array * test * Revert "test" This reverts commit `f0b56f308b`.	2024-06-29 10:46:29 +03:00
chenyu	a8e9307e0b	pylint runtime/ and shape/ (#5044 ) as pointed out by #4877, need to add `__init__.py` to trigger pylint. fixed some errors except ops_python (will do in a separate pr, it has a lot of errors), and sub-folders in runtime	2024-06-18 19:48:18 -04:00
Roelof van Dijk	0eebb8e998	fix: _free should not return (#4880 )	2024-06-08 14:45:06 +02:00
Roelof van Dijk	1785a70e77	fix: else-return on runtime (#4881 ) * fix: add init file * fix: no else-return * fix: remove file again	2024-06-08 14:44:24 +02:00
Szymon Ożóg	f7201b6852	Remove deprecated code (#4724 )	2024-05-25 03:02:12 -04:00
chenyu	286b4dbdf2	compile raise CompileError and skip only RuntimeError in multiprocess… (#4646 ) * compile raise CompileError and skip only RuntimeError in multiprocess beam renderer error with multiprocess should not be skipped by beam * use `==` for dtype to dtype comparison * that needs to be is * typo	2024-05-19 00:25:25 -04:00
George Hotz	347a3acb37	add renderer class (#4524 ) * add renderer class * tests pass * fix pylint * fix tensor cores	2024-05-10 21:40:02 -07:00
George Hotz	d438d5698d	bring buffer back to device (#4517 )	2024-05-10 11:22:31 -07:00
George Hotz	4eef1ee9bf	move renderer into options (#4514 ) * move renderer into options * fix tests * renders are functions	2024-05-10 10:01:51 -07:00
George Hotz	89e119bc58	move Allocator to buffer.py (#4502 ) * move Allocator to buffer.py * move those to realize * memory file * cleanup	2024-05-09 19:45:56 -07:00
George Hotz	9fc4465557	subbuffer support (#4397 ) * subbuffer support * diskbuffer offset * cuda subbuffer works * use subbuffer * more subbuffer tests * consecutive * cast * consec * offset * view is a better name * offset is in nbytes * fix view + memory planner * delete unused DiskRunner * reverse order * no subbuffers on unrealized consts * only enabled for disk * don't reverse memory * view supported devices * pickle buffer view * ring jit * support extra view inputs in jit * fix JIT=2 issue * test copy jit * p2p isn't an option anymore * fix dep tracking issue * fix mypy * fix pickle * from_nv is contents now	2024-05-03 18:05:57 -07:00
George Hotz	60e3aa5cb1	more docs (#4271 ) * more work on docs * CompilerOptions is dataclass	2024-04-24 10:52:42 +08:00
Micah Zoltu	7bc862767c	Improves error message when CUDA module fails to load. (#4243 )	2024-04-21 11:10:14 -04:00
nimlgen	5a57b48134	cuda p2p enable when available (#4153 )	2024-04-12 16:21:54 +03:00
George Hotz	af5984df43	cudagraph memcpy through host (#4137 )	2024-04-10 13:17:17 -07:00
chenyu	1de9778949	import Buffer and BufferOption from tinygrad.buffer (#4076 )	2024-04-04 22:12:23 -04:00
chenyu	b47f6cebb2	LinearizerOptions -> CompilerOptions (#3978 )	2024-03-28 17:50:23 -04:00
nimlgen	e2d6f76723	_alloc and _free with options (#3934 ) * _alloc has options * linter * fix hsa	2024-03-26 09:11:41 -07:00
nimlgen	739f47eb0f	check on cuEventSynchronize (#3933 )	2024-03-26 16:14:38 +03:00
nimlgen	f2a9ea4ea9	lru allocator for copyin host buffers (#3918 ) * lru allocator for copyin host buffers * linter happy	2024-03-25 15:57:18 +03:00
George Hotz	e0e234bf94	hotfix, str compare version for cuda	2024-03-24 20:35:24 -07:00
Arseny Kapoulkine	715850aef9	Fix sm89 PTX=1 compilation (#3915 ) * Fix sm89 PTX=1 compilation The minimum PTX version that supports sm89 is 7.8 (same version also supports sm90); without this ptxas fails when running tinygrad with PTX=1 on RTX 4090. * Use int(arch[3:]) for forward compat with SM10.0 if that happens	2024-03-24 20:32:29 -07:00
sekstini	7c3632fd1e	add --minimal flag to nvrtc (#3899 )	2024-03-23 16:38:31 -07:00
George Hotz	46a3501cec	nv ioctl sniffer (#3892 ) * nv ioctl sniffer * unused import * Update __init__.py * that work * that fix it	2024-03-23 00:29:30 -07:00
chenyu	1c51d586ea	replace raise Exception with specific errors (#3874 )	2024-03-22 12:32:21 -04:00
nimlgen	8ef5490ec8	cuda tranfer + async copyin (#3873 )	2024-03-22 09:01:37 -07:00
Szymon Ożóg	624bc89910	PTX - implement float 4, ptr arithmetics and other speed improvements (#3775 ) * ptx float4 implementation * remove from cache when trimming uops * Gate for float4 * Linting fix * disable test reasonable time for ptx * import getenv * Update uops.py * linter * Add div test for half * upcast if op does not support operation * fix offset * Run only if dtype supported * zero out registers when accessing by pred + cleanup * Remove trailing whitespace * revert * spacing fix * move cache clearing outside loop * did this suddenly start working? * unused import removed * Remove cast * Use pattern matching * linting --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2024-03-22 08:54:02 -07:00
nimlgen	b78352b423	do not create structs every call in CUDAProgram (#3855 ) * do not create structs in cuda * fix graph * linter * do not exec twice * fix graph	2024-03-21 17:51:40 +03:00
Francis Lam	b6e2495fdd	kernel: limit shared memory usage when adding opts (#3705 ) * kernel: limit shared memory usage when adding opts * search: remove unnecessary limit on search space apply_opt will do the more correct check	2024-03-12 17:06:21 -04:00
George Hotz	ac02e7347d	ptx timing vs cuda timing (#3659 )	2024-03-08 10:17:49 -08:00
George Hotz	6e50582e62	working to improve ptx (#3647 ) * working to improve ptx * fix compile fail	2024-03-07 12:39:31 -08:00
George Hotz	81baf3eed3	bring ptx back (#3623 ) * bring ptx back * ptx back * fix define var * fix a few bugs * bugfixes * fixes * fix llvm bug * fix test bug	2024-03-06 13:34:21 -08:00
Francis Lam	e17f1821a7	wmma: add CUDA tensor core and fix test_speed_v_torch failure (#3544 )	2024-03-01 17:51:02 -08:00
nimlgen	94b7ac7a29	no cuda compile helper (#3512 )	2024-02-28 01:50:10 +01:00
George Hotz	7698781389	Revert "wmma: add CUDA tensor core (#3464 )" (#3474 ) This reverts commit `e9cef13f0b`.	2024-02-22 11:58:16 +01:00
Francis Lam	e9cef13f0b	wmma: add CUDA tensor core (#3464 )	2024-02-22 11:57:08 +01:00
George Hotz	3c728d1082	compiler support (#3260 ) * compiler support * revert that * fix tests	2024-01-26 23:36:40 -08:00
George Hotz	03a6bc59c1	move autogen to runtime/autogen (#3254 )	2024-01-26 12:44:19 -08:00
George Hotz	a3869ffd46	move gpuctypes in tree (#3253 ) * move gpuctypes in tree * fix mypy * regex exclude * autogen sh * mypy exclude * does that fix it * fix mypy * add hip confirm * verify all autogens * build clang2py * opencl headers * gpu on 22.04	2024-01-26 12:25:03 -08:00
George Hotz	cb372b053f	add device speed test (#3244 )	2024-01-25 12:01:22 -08:00
nimlgen	3205fd8481	fix cuda device var rewrite (#3233 )	2024-01-24 16:57:49 -05:00
George Hotz	83d614295e	reduce lines (#3230 )	2024-01-24 10:35:59 -08:00
George Hotz	23b084e70a	add device name to device, all are constructed (#3221 )	2024-01-23 20:34:56 -08:00
nimlgen	81ae4ea179	compile cache for several devices (#3148 ) * compile cache for several devices * ops_gpu uses hash to not care about sql * hip rdna test with device * linter happy * no device passed where possible * arch is optional to compile_{hip\|cuda}	2024-01-16 11:45:26 -08:00
George Hotz	120c8b1841	update llvm api + add cache key (#3140 ) * update llvm api + add cache key * use_xcode is a different function * types	2024-01-15 17:25:32 -08:00
George Hotz	ca0beeef38	Christopherm99 ptx (#3139 ) * get basic ptx impl working * test ops passing * mypy * dont hardcode target * more walrus * ptx in ci * bool cast and f16 load/store * weird numpy bug and f16 cast tolerance * cast half to bool * fix 1 byte load/store * disable half for ptx * fix args and enable xid * fix non-ptr args * allow bitcast * mypy * cleanups * midcast use allclose * add xor * Revert "disable half for ptx" This reverts commit `73391c05fd`. * enable float16 * mypy * no more crashing in ci * fix ci * minor cleanups * use new fn for ptx compiler * no diskcache in ptx compile * use rn instead of rz * save some lines * new DEFINE_GLOBAL syntax * line length * new llvm * cmpeq * minor fix * cast in mulacc * update test_recursive_add to check line count * mypy * remove llvmir.py * fix bool const * wip * cleanups * working * llvm in separate pr * cleanups * more cleanups * fix ci * use in_features directly in nn.Linear.__init__ bound check (#3050) * use in_features directly in nn.Linear.__init__ bound check get rid of the unnecessary check of isinstance int * that is always int * long lines * Device._buffers -> Device._devices (#3052) backend devices used to be called buffers * make Embedding device aware for multigpu (#3051) * make Embedding device aware for multigpu * split line instead of igore because that's cheating * add test incomplete * add test complete * remove comment * fix white space * remove nn.Embedding * remove unused reciprocal (#3053) * remove unused reciprocal * comment * unit tests for Device.canonicalize (#3055) * add multigpu test for RMSNorm (#3056) * need all gather * add two multigpu test scenarios for RMSNorm * No extra vars call (#3054) * remove unused reciprocal * comment * remove unneeded call to vars * free speedup * explicit lazybuffer caching (#3058) * hotfix: remove useless slow assert from ShapeTracker * Speed tweaks (#3059) * base doesn't have to be a function * no double fetch * pop, don't check * make the gc happy * avoid hasattr * cache canonicalize * remove assert, faster base * don't redefine that every time * fix gpt2 attention with start_pos = 0 (#3061) * fix gpt2 attention with start_pos size 1 test cases taken from ll_transformer branch * fix interpreted * Tensor.cat with 0 shape tensors (#3062) * Tensor.cat with 0 shape tensors supported both 0 in cat axis (for a subset of input), or 0 in non-cat axis (all needs to be 0) * no shp * test scaled dot product attention (#3063) * add test * add initial test for scaled dot product attention * test pass for scaled dot product attention * cached size (#3060) * cached size * simplify simplify * 0 doesn't have base * fix test * cleaner cache * hmm, metal is flaky on this...might be real(ish) but useless as test * short circuit reshape/expand properly * better reshape bypass * hotfix: use is for enum compare * hotfix: use is for enum compare, a few more * speedtweaks3: apply shouldn't use the tensor constructor (#3065) * speedtweaks3: apply shouldn't use the tensor constructor * replace 0 size with CONST, not 0 in shape * update gh actions (#3033) * update checkout actions * update upload artifact * update setup python --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com> * unbind view or shapetracker also returns var_val (#3067) * unbind view or shapetracker also returns var_val 4% faster for llama compile time * one line less * unbound_views * hotfix: examples/transformer.py * jit autorealizes output (#3069) * early gate the graph (#3070) * simpler idxs_to_idx (#3071) * filter_strides -> canonicalize_strides (#3072) * fix onehot and jit in examples/transformer (#3073) trained to 0.999 in < 6 seconds on M1 Max consistently * better test demonstration (#3077) * a better test demonstration * fix white space * Tensor.expand resolves the new_shape before shortcut return (#3078) similar to how reshape is done. also updated shrink shortcut criteria to read similar to pad * minor cleanups of lazy.py (#3080) * wmma: clean up device specific tensor core code (#3081) * mem_estimate is always int, not symbolic (#3083) * mem_estimate is always int, not symbolic op_estimate can be symbolic, but mem_estimate is always int, thus we don't need to sym_infer it. fixed some long lines too. update_stats is a very big function * operator does not need underscores * cat works (#3086) * hotfix disable flaky mac runner wino cifar (#3087) * remove the third merging state in view._merge_dims (#3085) no logic depends on state == 0 or state == 2 * minor cleanup of View.reshape (#3088) * minor cleanup of View.reshape removed some redundant logic * new_strides * revert that * use BEAM=2 instead of BEAM=4 in cuda ci gpt2 (#3089) BEAM=2 is faster and less search time. investigating why BEAM2+BEAM4 is slower than BEAM2 alone * use device from LinearizerOptions in kernel search (#3090) * use device from LinearizerOptions in kernel search removed all Device.DEFAULT in search.py * pass device string for parallel pickle * device for interpreted backends in LinearizerOptions * update jit type annotation post lazy rewrite (#3091) * add mutigpu support for llama attention (#3064) * add llama attention test for multigpu * test fails * kv cache trying to shrink on sharded axis * mask None works for scale dot product * kv cache seems to be working but scale dot product breaks * scaled dot product works, but the last linear layer failed * running into the reshape case where it could be wrong for multigpu * making sure it was the reshape * adding contiguous doesn't solve * need to shard more properly * remove reshape test * minor adjustment to scale dot product attention test * weights are sharded wrong * continue fix new weight sharding * clean up * fix attention when start_pos is 0 * remove print * add TODOs for the best mutigpu interface * bugfix do not reset shapetracker of 0 size lazybuffer (#3096) it might be coming from an expand, and resetting results incorrect stride. caught by interpreted backend * One hot in tensor.py (#3093) * onehot in Tensor.py * one_hot tests * works for all shapes, not just 1 * pylint * not a static method * moved around, num_classes mandatory * pylint * pylint * space & moving * formatting * moved tests * fix broadcasted logic if there's 0 in shapes (#3097) * fix broadcasted logic if there's 0 in shapes should always expand into 0, not the other way around. fixed matmul with 0 in input shapes. for forwards for now though, backward is more involved and would need to change 0 size shortcuts * fix tests * replace with tensor op (#3099) * fix gpt2 with empty prompt (#3100) logits would be empty so need to replace that with ones before sampling, also cannot reshape with -1 when there's 0 in other axes * Revert "fix gpt2 with empty prompt" (#3101) * fix gpt2 with empty prompt take 2 (#3102) logits would be empty so need to replace that with ones before sampling, also cannot reshape with -1 when there's 0 in other axes * wmma: enable METAL half tensor cores and clean up cstyle (#3095) * wmma: enable METAL half tensor cores and clean up cstyle * revert simple_matmul rand changes and break line in tensor * added metal fp16->fp32 tensor core * add half @ half to mac benchmark (#3103) * flag to profile mixtral - 1.7 tok/s now (#3104) * update NumNode.__hash__ to be hash(self.b) (#3105) with this, `a:=NumNode(x) == b` implies `hash(a) == hash(b)` * catch runtime error in search._time_program (#3106) return inf if search encountered runtime errors. * no exceptions in __del__ when module creation is failed in hip/cuda (#3107) * failed test case due to cast resets shapetracker (#3109) cast implicitly resets shapetracker and makes it contiguous (for disk tensor), which fails for Interpreted backend if inputs contain non-contiguous st. * cleanup ops_disk type annotation and redundant str cast (#3110) * minor cleanup of test_disk_tensor (#3112) * add Tensor.var (#3114) also updated MeanVarianceNormalization and made test_ops test tensors of var and std smaller * move sample inside jit for beautiful_mnist (#3115) also removed .realize() for jit functions since jit does it automatically now. a little more beautiful * minor cleanups of onnx_ops (#3116) * fix conversation: llama generates token not prob now (#3120) * add device options for tests in multigpu (#3121) * make DType a dataclass (#3111) * remove np from DType * convert to dataclass * remove dunder hash, eq, ne overrides from ImageDType * is dataclass required for PtrDType? * fix GPU tests * reduce lines * revert changes to np * minor cleanup * hotfix: ptrdtype compare was broken * move fromcpu out of lazy.py (#3122) * move fromcpu out of lazy.py * fix abstractions2 * remove numpy from device (#3123) * remove numpy from device * fix tests * np item * cleanups * simplify with as_buffer * no toCPU * tinygradic * cast to scalar * remove numpy from ops_torch (#3124) updated mnist test to cast label to int8 and avoid hacking cast issue of torch uint8 * Fix backward fn for `<` and `==` (#3037) * fix no grad fn for < and == * remove 2 line breaks * Remove deprecated autograd variable --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com> * separate try except blocks in onnx2torch in model benchmark (#3126) exceptions can be raised from either model conversion or individual backend failed. openpilot on torch mps works, but does not work with torch cpu. seperate the expcetion block so that the benchmark can inlcude torch mps for openpilot. * update env_vars.md (#3127) mostly removed deprecated ones. not clear how to maintain this especially for extra/examples * update test_ptr_ne (#3130) * remove np from metal graph (#3129) * dtype fmt (#3132) * dtype fmt * three ways to access * fix off-by-one error in st_equal (#3131) * fix off by one error * whitespace * no numpy (#3134) * fast resnet eval (#3135) * fast resnet eval * fix HIP multidevice graph * neater expression for devices * lines * add decorator test * remove LLVMOPT * move ptx * Update ops_cuda.py --------- Co-authored-by: Christopher Milan <chrismilan@ucla.edu> Co-authored-by: chenyu <chenyu@fastmail.com> Co-authored-by: Yixiang Gao <yixiangg310573@gmail.com> Co-authored-by: jxdv <virgoj@protonmail.com> Co-authored-by: Francis Lam <flam@alum.mit.edu> Co-authored-by: SnakeOnex <sheeproman@gmail.com> Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com> Co-authored-by: Jyotirmaya Mahanta <jyotirmaya.mahanta@gmail.com> Co-authored-by: Guy Leroy <g.m.leroy@outlook.com> Co-authored-by: Paul Gustafson <paul.gustafson@theambrusgroup.com>	2024-01-15 16:44:20 -08:00
nimlgen	cf1d0a6704	no exceptions in __del__ when module creation is failed in hip/cuda (#3107 )	2024-01-13 12:03:55 -05:00
chenyu	0fe6904351	use device from LinearizerOptions in kernel search (#3090 ) * use device from LinearizerOptions in kernel search removed all Device.DEFAULT in search.py * pass device string for parallel pickle * device for interpreted backends in LinearizerOptions	2024-01-11 14:46:03 -05:00
George Hotz	c81ce9643d	move globalcounters to ops (#2960 ) * move globalcounters to ops * missed a few * sick of that failing	2024-01-01 14:21:02 -08:00

1 2 3

124 Commits