tinygrad

mirror of https://github.com/tinygrad/tinygrad.git synced 2026-01-23 05:48:08 -05:00

Author	SHA1	Message	Date
nimlgen	bf645d62b3	qcom docs (#6338 )	2024-09-02 20:42:20 +03:00
nimlgen	d22b46a2ac	qcom in benchmarks (#6337 )	2024-09-02 19:59:11 +03:00
Vyacheslav Pachkov	4c33192a8b	add qcom runtime (#5213 ) * qcom: driver init * autogen stubs for msm_kgsl also fixup ioctls to show numbers instead of _IOW macros * autogen: add adreno commands and registers * ops_qcom: QcomAllocator + signals * fix EDEADLK in hwqueue, init timestamps, use opencl compiler for qcom * qcom: we do not really need all these constants input/output is enough * qcom: perfctr for CS (do not really need all the rest) * qcom: HALFREGFOOTPRINT and FULLREGFOOTPRINT are set to be around max * qcom: explicitly set instruction len based on the shader size * ops_qcom: Program init extracts shader from open cl binary sets input/output buffers allocates stack sets cs mode runs shader * use data64_le from helpers * ops_qcom: use fill_kernargs for filling i/o buffers * ops_qcom: add QcomCopyQueue just for api & set kernargs_args_offset * new signals & fix exec * add QCOM to the list of supported devices * correct QcomComputeQueue._wait using CP_WAIT_REG_MEM * fix exec, synchronize before copyout * correct setting num_units for ST_SHADER * fix gpu hangs on sigs with CP_MEM_WRITE, it is uncached mem anyway * extract offsets to kernel arguments from opencl binary * extract constants values and offsets from opencl binary * handle KGSL_MEMFLAGS_USE_CPU_MAP correctly * align kernel name to 4 bytes when skipping kernel opencl struct * skip to consts directly using an offset from opencl binary header * fix alloc * get halfreg and fullreg from opencl bin * set unmultipled global sizes as kernel group in HLSQ_CS_NDRANGE * parse prg offset from open cl binary * save loc with HLSQ_CS_CNTL. set this with HLSQ_CONTROL_2_REG * support for vals in _fill_kernargs * support 16-bit constants * use KGSL_CONTEXT_NO_FAULT_TOLERANCE for contexts this helps to not fall down when executing big kernels /* Don't time out if the context has disabled it / if (drawobj->context->flags & KGSL_CONTEXT_NO_FAULT_TOLERANCE) return; minor changes of _exec * QCOMRenderer * disable HCQGraph for demo. TOOD: support HCQ update api * support HCQ - remove copy queue - add updates - add strides for buffs and vars for QCOM * bufs_stride * clean ups * linter * call super().__init__(value) in QcomSignal * disable=unused-import * mypy * type ignore when queue is on the device * fix * query gpu_id. Will be useful for selecting commands e.g. CP_EVENT_WRITE vs CP_EVENT_WRITE7 * working timestamps * free context after device is done * move gpu stack to the device * reserve some space with lib_gpu for gpu to write to this fixes test_interpolate_bilinear * exclude tests that fails with GPU=1 on qualcomm * lint * unmap mem in _gpu_free * ctxt priority and preemtion policy * remove old qcom * pass size to self.device.allocator.free * skip tests only on qcom * use kgsl and adreno defines instead of numeric vals * use allocator for allocating lib_gpu * update to QcomArgsState from master * intermediate commit while conquering images * enable image tests on qcom * fix shader disasm size, dump textures stuff * working images * allow signals to be 0 * set branchstack from OpenCL binary Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com> * set shared memory size from OpenCL binary Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com> * update images in QcomArgsState & less loc for images * set stack sizes from OpenCL binary Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com> * stack allocation based on OpenCL binary Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com> * better autogen for kgsl and adreno. no more bitshifts Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com> * cleanup commit for parse cl lib Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com> * dont forget actual generated files * refactor + less loc Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com> * device.py back * lint * ruff * timestamp divisor Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com> * fix tex fmt & round global size Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com> * dtypes * 19.2MHz * -1 loc in _update_exec * remove noqa --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com> Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>	2024-09-02 19:35:47 +03:00
nimlgen	8e2a3fc165	raise lines count to 9300 for qcom (#6336 )	2024-09-02 18:57:57 +03:00
George Hotz	e6ae332a26	hotfix: FIX_METAL_ICB isn't needed on M3	2024-08-31 11:50:02 -07:00
George Hotz	406ec8240e	hotfix: lin_fail_41 passes on my M3 Max	2024-08-31 11:46:46 -07:00
Roelof van Dijk	ad4b3b457f	bump limit for test_llama_embedding_opt (#6332 )	2024-08-31 10:03:43 -04:00
George Hotz	72939901fc	hotfix: ebs print kernel names	2024-08-29 21:20:36 -07:00
George Hotz	365babe391	precompute early_reject [run_process_replay] (#6327 ) * precompute early_reject [run_process_replay] * features for ebs * fix ocelot cache	2024-08-29 18:26:24 -07:00
George Hotz	385904526f	remove more rules [run_process_replay] (#6326 ) * remove more rules [run_process_replay] * disable invalid test * ptx needs that str	2024-08-29 16:27:10 -07:00
George Hotz	23081c4580	folding rules cleanup [run_process_replay] (#6325 ) * folding rules cleanup [run_process_replay] * tighten combine	2024-08-29 15:15:44 -07:00
George Hotz	56cd25e43f	dearg consts [run_process_replay] (#6324 )	2024-08-29 15:00:22 -07:00
nimlgen	9b616cb33e	HCQArgsState lifetime docs (#6323 )	2024-08-30 00:31:49 +03:00
Roelof van Dijk	56b7fadc2f	perf: skip type verify with -O (#6319 )	2024-08-29 13:47:51 -07:00
qazal	7a08b881ed	st_fixup explicit UOp init [run_process_replay] (#6320 )	2024-08-29 23:21:10 +03:00
qazal	539654fbe1	graph_rewrite complexity tests [run_process_replay] (#6317 )	2024-08-29 22:39:08 +03:00
qazal	07942ef361	Proposal: Better UOps.SWIZZLE (#6309 ) * better UOps.SWIZZLE * test_swizzle_rewrite * add it to docs * show a diff * a lil more verbose * two teeny notes * hotfix: sink	2024-08-29 15:39:48 +03:00
qazal	8c50ef8b7c	start uop docs (#6291 ) * start uop docs * only need show_labels * sink comes first * hotfix: invalid * touchups * 2 space indent works * limit some buffer uops * better BARRIER doc, Op -> UOp when it makes sense. * make KernelInfo optional * more work relative links don't work * this can be local in multi reduce+pads * add UOps.SHAPETRACKER details * UOps.CONST both types * nit: local buffer isn't device Buffer, habit * nit2: dtype -> DType	2024-08-29 15:22:39 +03:00
qazal	dd4e5f1c8d	process replay rewrite (#6284 ) * process replay rewrite p2 * start some unittests + exceptions and exits * shebang * remove extra kernel init	2024-08-29 15:08:27 +03:00
pedro	7de4eac8f7	add support and tests for nearest modes in interpolate, adapt uint8 bilinear to torch implementation (#6308 ) * add `nearest` mode to interpolate matching pytorch `nearest` which is knowingly buggy + relevant TestsOps * add `nearest-exact` mode to interpolate matching pytorch `nearest-exact` + relevant TestOps * fix uint8 bilinear interpolation by matching custom torch implementation * implement uint8 lerp with torch interpolation trick without converting it to float	2024-08-28 21:59:51 -07:00
George Hotz	638b4843da	fix for metal ICB issue on M1/M2 [run_process_replay] (#6313 ) * this is a working fix * better comment * repro	2024-08-28 21:31:14 -07:00
wozeparrot	cb61cfce24	feat: example and extra tweaks (#6310 )	2024-08-28 19:26:11 -07:00
wozeparrot	ea5b7910b7	AMD support gfx103x (#5926 )	2024-08-28 14:17:08 -07:00
gswangg	94a72d44d2	update CI tests in extra with UOp AST (#6290 )	2024-08-28 22:26:50 +03:00
Tobias Fischer	3517aa89d9	sdxl batched inference fixes (#6293 )	2024-08-28 07:44:58 -04:00
Roelof van Dijk	85591bd1ae	no need for functools here (#6303 )	2024-08-28 01:19:57 -07:00
nimlgen	b1e5343133	nv better error msg for p2p failure (#6301 ) * nv better error msg for p2p failure * linetr * from * mypy	2024-08-28 01:40:45 +03:00
nimlgen	ac303146ca	nv sure qmd addr less than 40bits (#6288 )	2024-08-27 20:47:38 +03:00
George Hotz	5ed6c6ef3e	hotfix: 220V 15A -> 220V 20A	2024-08-27 10:20:43 -07:00
qazal	ec34d9ee36	start benchmarking ast graph rewrite (#6297 ) * ast_rewrite to ctx var * add external_benchmark_ast * refactor to asts * track lazybuffers * more work * record checkpoint * cleanup	2024-08-27 18:18:44 +03:00
qazal	552fbd5527	update llm.c with UOp ast [run_process_replay] (#6296 )	2024-08-27 15:04:54 +03:00
Tobias Fischer	211bfb6d8a	fixed batched clip computation (#6292 )	2024-08-26 20:48:15 -04:00
ignaciosica	3918f6eea0	refactor amd render_kernel (#6223 ) * refactor amd render_kernel * fix spacing * add half alias back * use itemsize * 8 insted of fixed values * reverting becasue it broke as no longer 32 was default * remove comment * remove nested tuples * hotfix: prefix.append * hotfix2: is not None * more diff cleanups * hotfix 4: spacing changes must not be in the same diff * revert wmma dtype rendering --------- Co-authored-by: qazal <qazal.software@gmail.com>	2024-08-27 00:28:36 +08:00
ignaciosica	3132449086	refactor _make_{cuda/clang}_dtype into render_vector_prefix (#6287 )	2024-08-26 09:14:44 -07:00
Max-We	ab2714423b	Add einsum tests (#6286 ) Co-authored-by: Maximilian Weichart <maximilian.weichart@icloud.com>	2024-08-26 09:09:25 -07:00
chenyu	b76f0c875e	lazy const fold idiv 1 (#6285 )	2024-08-26 10:29:59 -04:00
chenyu	af7c04ff57	Tensor.__floordiv__ (#6283 ) support Tensor.__floordiv__ and friends	2024-08-26 09:43:40 -04:00
qazal	d2f8eeed2e	make [compare_schedule] the default [run_process_replay] (#6273 ) * make [compare_schedule] the default * capture ctx * logging * set capture to false	2024-08-26 21:40:03 +08:00
qazal	067aeaeb2f	single arange fusion with graph rewrite (#6160 )	2024-08-26 18:18:16 +08:00
qazal	b4381e9777	uop output_st is Optional [run_process_replay] (#6282 )	2024-08-26 17:58:55 +08:00
qazal	1c0456af89	add UOps.SWIZZLE (#6271 ) * add UOps.SWIZZLE * flip swizzle init * generic st_fixup	2024-08-26 16:08:51 +08:00
CaltropHungerton	002f60b4c3	fix intel wmma flop counting, add flop counting tests for different tensor cores (#6192 ) * fix wmma flop counting on intel, add count tests * half * add half gemm * Update test.yml * one test * Update test_uops_stats.py * Update test_uops_stats.py * Update test_uops_stats.py * smaller matrix, use unittest skipUnless decorator	2024-08-25 18:37:05 -07:00
Tobias Fischer	331b0f5477	new clip gather (#6277 )	2024-08-25 19:27:24 -04:00
qazal	f0cc8ca5f2	generic st_fixup in scheduler graph rewrite [compare_schedule] (#6278 )	2024-08-25 11:02:17 +03:00
qazal	70015bd89c	move permute_reduces to uop movementops [run_process_replay] (#6272 )	2024-08-25 10:25:51 +03:00
chenyu	b86907c6c7	UOp.const(x.dtype, y) -> x.const(y) [run_process_replay] (#6276 )	2024-08-24 21:39:50 -04:00
chenyu	00282afa41	identity element of binary ops (#6275 ) helper for the number reduce acc is inited to (0 for ADD, 1 for MUL and -inf for MAX)	2024-08-24 18:10:19 -04:00
qazal	ee245b48a9	refactor reduceop swizzling (prep for UOps.SWIZZLE) [compare_schedule] (#6269 )	2024-08-24 18:17:19 +03:00
gswangg	3cf507ae7f	remove extra.ops and LazyOp support from Kernel (#6267 ) * remove extra.ops and BufferOps * remove extra.ops and LazyOp support in Kernel	2024-08-24 16:44:38 +03:00
qazal	ccb05d8baa	fixup neg tests [run_process_replay] (#6268 )	2024-08-24 16:35:43 +03:00

1 2 3 4 5 ...

5780 Commits