tinygrad

mirror of https://github.com/tinygrad/tinygrad.git synced 2026-01-31 01:38:20 -05:00

Author	SHA1	Message	Date
qazal	f4e83b30b4	cleanup process_replay/* namings [run_process_replay] (#6429 )	2024-09-09 16:59:04 +08:00
George Hotz	8186e4e7d6	add UOps.VALID (#6387 ) * uops valid * broke full_shape * fixup that st (hardcoded asts still red) * fixup DEFINE_VAR debug more debug * start moving stuff to ast_const * move test_linearizer * move test_linearizer_failures to ast_const * fixup test_schedule * small diff change * regenerate dataset * fixup test_multitensor * regen dataset try 2 --------- Co-authored-by: qazal <qazal.software@gmail.com>	2024-09-09 16:58:43 +08:00
qazal	c5bae55ec8	new generate_dataset.sh (#6423 ) * new generate_dataset.sh * keep those there * test: rm expected failures * rename to extract	2024-09-09 15:13:07 +08:00
geohotstan	65da03e186	remove _slice [run_process_replay] (#6395 ) * try * pass * clean up * done * I'm becoming dumber * clean up 2 * remove useless max * useless but make computer brrr [run_process_replay] * try process replay * try again * 1 less line, just use pad2d	2024-09-08 09:12:39 +08:00
George Hotz	8f6d0485e7	hotfix: resnet to obj.device	2024-09-06 13:06:02 +08:00
George Hotz	9d72119a0c	minor resnet cleanups (#6382 ) * minor resnet cleanups * that should have been long * jit * meh	2024-09-06 12:50:21 +08:00
George Hotz	86d34daac9	UOps.PHI -> UOps.ASSIGN [run_process_replay] (#6383 )	2024-09-06 12:38:35 +08:00
George Hotz	72be31cb56	remove mla [run_process_replay] (#6357 ) * remove mla * other bad uses of const	2024-09-05 10:37:46 +08:00
Vyacheslav Pachkov	4c33192a8b	add qcom runtime (#5213 ) * qcom: driver init * autogen stubs for msm_kgsl also fixup ioctls to show numbers instead of _IOW macros * autogen: add adreno commands and registers * ops_qcom: QcomAllocator + signals * fix EDEADLK in hwqueue, init timestamps, use opencl compiler for qcom * qcom: we do not really need all these constants input/output is enough * qcom: perfctr for CS (do not really need all the rest) * qcom: HALFREGFOOTPRINT and FULLREGFOOTPRINT are set to be around max * qcom: explicitly set instruction len based on the shader size * ops_qcom: Program init extracts shader from open cl binary sets input/output buffers allocates stack sets cs mode runs shader * use data64_le from helpers * ops_qcom: use fill_kernargs for filling i/o buffers * ops_qcom: add QcomCopyQueue just for api & set kernargs_args_offset * new signals & fix exec * add QCOM to the list of supported devices * correct QcomComputeQueue._wait using CP_WAIT_REG_MEM * fix exec, synchronize before copyout * correct setting num_units for ST_SHADER * fix gpu hangs on sigs with CP_MEM_WRITE, it is uncached mem anyway * extract offsets to kernel arguments from opencl binary * extract constants values and offsets from opencl binary * handle KGSL_MEMFLAGS_USE_CPU_MAP correctly * align kernel name to 4 bytes when skipping kernel opencl struct * skip to consts directly using an offset from opencl binary header * fix alloc * get halfreg and fullreg from opencl bin * set unmultipled global sizes as kernel group in HLSQ_CS_NDRANGE * parse prg offset from open cl binary * save loc with HLSQ_CS_CNTL. set this with HLSQ_CONTROL_2_REG * support for vals in _fill_kernargs * support 16-bit constants * use KGSL_CONTEXT_NO_FAULT_TOLERANCE for contexts this helps to not fall down when executing big kernels /* Don't time out if the context has disabled it / if (drawobj->context->flags & KGSL_CONTEXT_NO_FAULT_TOLERANCE) return; minor changes of _exec * QCOMRenderer * disable HCQGraph for demo. TOOD: support HCQ update api * support HCQ - remove copy queue - add updates - add strides for buffs and vars for QCOM * bufs_stride * clean ups * linter * call super().__init__(value) in QcomSignal * disable=unused-import * mypy * type ignore when queue is on the device * fix * query gpu_id. Will be useful for selecting commands e.g. CP_EVENT_WRITE vs CP_EVENT_WRITE7 * working timestamps * free context after device is done * move gpu stack to the device * reserve some space with lib_gpu for gpu to write to this fixes test_interpolate_bilinear * exclude tests that fails with GPU=1 on qualcomm * lint * unmap mem in _gpu_free * ctxt priority and preemtion policy * remove old qcom * pass size to self.device.allocator.free * skip tests only on qcom * use kgsl and adreno defines instead of numeric vals * use allocator for allocating lib_gpu * update to QcomArgsState from master * intermediate commit while conquering images * enable image tests on qcom * fix shader disasm size, dump textures stuff * working images * allow signals to be 0 * set branchstack from OpenCL binary Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com> * set shared memory size from OpenCL binary Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com> * update images in QcomArgsState & less loc for images * set stack sizes from OpenCL binary Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com> * stack allocation based on OpenCL binary Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com> * better autogen for kgsl and adreno. no more bitshifts Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com> * cleanup commit for parse cl lib Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com> * dont forget actual generated files * refactor + less loc Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com> * device.py back * lint * ruff * timestamp divisor Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com> * fix tex fmt & round global size Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com> * dtypes * 19.2MHz * -1 loc in _update_exec * remove noqa --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com> Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>	2024-09-02 19:35:47 +03:00
wozeparrot	cb61cfce24	feat: example and extra tweaks (#6310 )	2024-08-28 19:26:11 -07:00
gswangg	94a72d44d2	update CI tests in extra with UOp AST (#6290 )	2024-08-28 22:26:50 +03:00
Tobias Fischer	3517aa89d9	sdxl batched inference fixes (#6293 )	2024-08-28 07:44:58 -04:00
Tobias Fischer	211bfb6d8a	fixed batched clip computation (#6292 )	2024-08-26 20:48:15 -04:00
Tobias Fischer	331b0f5477	new clip gather (#6277 )	2024-08-25 19:27:24 -04:00
qazal	bcb2f1caa3	init REDUCE_AXIS with BinaryOps (#6256 ) * REDUCE_AXIS arg with BinaryOps * more work in kernel.py fixup sops.gz * fix TestGraphRewriteEfficiency	2024-08-24 11:28:41 +03:00
qazal	0d4887e9df	use UOps.WMMA everywhere (#6255 ) * add UOps.WMMA_AXIS * delete ReduceOps.WMMA from ops	2024-08-23 15:03:26 -04:00
chenyu	590c0922b6	Tensor.prod (#6250 ) * Tensor.prod a new reduce op! * onnx ReduceProd	2024-08-23 10:06:32 -04:00
chenyu	e745e16441	remove UnaryOps.NEG (#6238 ) * Remove UnaryOps.NEG generated new dataset with ``` time JIT=2 PYTHONPATH=. ./extra/optimization/generate_dataset.sh gzip /tmp/sops mv /tmp/sops.gz extra/datasets/ ``` * fix that	2024-08-22 14:21:39 -04:00
Francis Lam	7376b67e36	extra/gemm/triton_nv_matmul: fix Program arguments (#6212 ) remove op_estimate	2024-08-20 14:05:38 -07:00
Francis Lata	8fd8b970b0	update URL to eval cases from recent MLPerf file movements (#6201 )	2024-08-20 08:43:13 -04:00
chenyu	9db2d0d5c6	fix some type error in onnx [run_process_replay] (#6153 )	2024-08-17 19:54:20 -04:00
chenyu	7c9c8ce22f	use TensorProto enum in onnx dtype mapping [run_process_replay] (#6151 )	2024-08-17 17:58:40 -04:00
George Hotz	9bc81c6db4	UOps.SHAPETRACKER (#6129 ) * UOps.SHAPETRACKER [run_process_replay] * no process replay	2024-08-16 23:26:34 -07:00
George Hotz	89c7989659	no shapetracker in ops [run_process_replay] (#6117 )	2024-08-16 17:23:27 -07:00
George Hotz	74ee9febec	remove iter from uopgraph (#6110 ) * remove iter from uopgraph * linearize returns uops * fix tests * linearize in linearize * tests fix * touchup * test failures	2024-08-16 15:58:29 -07:00
qazal	28c75bf2a6	merge uops with ops (#6111 ) Co-authored-by: chenyu <chenyu@fastmail.com>	2024-08-16 18:17:57 -04:00
qazal	c23d44c779	AST is UOp (#6030 ) * most of the work from the uops2 branch * schedule * realize * kernel * lowerer * search * green * merge uops with ops * Revert "merge uops with ops" This reverts commit `1408a59f12`. * fix benchmark * remove extra dedup	2024-08-16 22:09:00 +03:00
CaltropHungerton	38fb1e14a2	Intel XMX Tensor Core Support (#5622 ) * fixed xmx demo * i think i'm invoking the DPAS but it's slow * compiler build arg to stop register spilling, indicated where to fix flop counter * don't mind this * do NOT mind me * do not mind me * do not view * i will add bf16 later * in process of figuring out tc fields * we figured out the fields!!! * added check for cl device vendor, added seperate IntelRenderer * remove tc thread_local_aliases * cleaning debris before draft pr * edits for linter * deduping and checking device extensions * i will find more line reductions in other places * before merge upstream * double grf size in compiler to fix register spilling (bandaid), device checking changes * tc python emulation * fixed emulation * tests for emulated intel tensor core * TC=0, 1 working on upstream, fixed perf * test * debris * check for specialized cl device when we canonicalize device * bf16 support, tc=3 test added * address tests * revert half2 loads on intel tc, cleanup * linter * fold_expanded revert * lint, whitespace fix * cuda bf16 (only one with bf16) is skipped in test tensor cores, so i will skip for intel bf16 too * make line shorter, no need for noqa E501 * removed device intel * fix python emulation --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2024-08-16 09:19:21 -07:00
nimlgen	7ab531aede	autogen cleanup (#6064 ) * start autogen cleanup * nvgpu * better? * better * amd part * gpu regen * fix mockgpu amd * nv * amd fix linter * remove import * ugh * nv on master * amd on master	2024-08-14 20:20:35 +03:00
wozeparrot	059cf2a90d	feat: autogen from kernel register offset headers (#6056 )	2024-08-12 14:08:35 -07:00
wozeparrot	dc2617bffd	feat: use more correct reg for local dims (#6048 )	2024-08-12 11:15:37 -07:00
chenyu	e6c7c3e499	update pylint path to check indent/space for all (#6022 ) also fixed many errors. it was not checking nested dirs. exclude autogen for now. can we use ruff for this?	2024-08-10 14:41:09 -04:00
wozeparrot	d269bc95fa	faster tinychat (#5993 )	2024-08-08 19:16:26 -07:00
George Hotz	bc55c8a30e	pmatmul example + GB/s bugfix [run_process_replay] (#5974 ) * pmatmul example + bugfix * improve pmatmul * Update real_pmatmul.py	2024-08-07 22:32:11 -07:00
George Hotz	bf8ec23b00	hotfix: contiguous on precompute_freqs_cis	2024-08-07 14:40:56 -07:00
wozeparrot	5808e8a30f	mockgpu remu changes (#5925 )	2024-08-05 19:26:58 -07:00
wozeparrot	6740a0a6a0	hip_ioctl changes (#5917 )	2024-08-05 11:58:38 -07:00
chenyu	996ff0c135	pow(2) -> square in RMSNorm [run_process_replay] (#5901 ) reads nicer in metadata	2024-08-04 14:21:31 -04:00
Elias Wahl	4a114756f6	New BERT dataloader (#5881 ) * One file == One topic * update test * new dataloader * update train script * get index is faster	2024-08-02 15:12:23 -04:00
nimlgen	34168a64e3	optimize nv profiler (#5856 ) * nv profiler fix * cleanup hcq a bit * fixes * fix * typo * all signals put timestamp * a bit cleaner * merge fields * type * import * tiny fix	2024-08-01 23:57:45 +03:00
Vyacheslav Pachkov	610e454132	fix opencl_ioctl on comma (#5814 ) - remove unused code - add CP_REG_TO_MEM opcode - fixed parse_cmd_buf for more than 1 command object by correcting an offset - fixed memory mappings for cases when memory was allocated with KGSL_MEMFLAGS_USE_CPU_MAP. KGSL_MEMFLAGS_USE_CPU_MAP: If set on call and return, the returned GPU address will be 0. Calling mmap() will set the GPU address. So there are no IOCTL_KGSL_GPUOBJ_INFO ioctls for that type of memory and it resulted to crash right after get_mem.	2024-07-30 20:44:06 -07:00
David Hou	9a485f36e4	shard kvcache (#5830 )	2024-07-30 20:29:54 -07:00
George Hotz	4e89d45513	hotfix: put contiguous back in llama	2024-07-30 18:43:48 -07:00
George Hotz	21c5e8e1b7	extreme llama speed, 57.34 tok/s (#5827 ) * extreme llama speed * mergable	2024-07-30 18:32:09 -07:00
George Hotz	e6879035a0	work to make GEMV fast (#5824 ) * work to make GEMV fast * half8 cast * align struct * fix amd * float8 is a later problem	2024-07-30 17:41:40 -07:00
Francis Lata	ce61be16f1	clean up how preprocessed folder is defined (#5813 )	2024-07-30 12:35:26 -04:00
chenyu	471b188d79	fix mypy errors in latest mypy (#5794 ) * fix mypy errors in latest mypy mypy has stricter partial and api arg checks now * PYTHONPATH="."	2024-07-29 14:53:30 -04:00
nimlgen	ea27ec4cd0	nv switch classlist_v2 to classlist (#5763 ) * nv switch classlist_v2 to classlist * support in mockgpu * fix mockgpu	2024-07-28 20:24:42 +03:00
chenyu	3686b6726a	move GraphException to jit.py (#5744 ) same place where GraphRunner is defined	2024-07-26 19:01:12 -04:00
George Hotz	489a5b99a5	hotfix: triton_nv_matmul touchups	2024-07-24 23:24:29 +00:00

1 2 3 4 5 ...

780 Commits