tinygrad

mirror of https://github.com/tinygrad/tinygrad.git synced 2026-02-18 10:31:41 -05:00

Author	SHA1	Message	Date
chenyu	9ed2b8b818	fix DEFINE_VAR setup in test_uop_graph [run_process_replay] (#6392 ) making sure arg always have 3 items	2024-09-06 05:32:12 -04:00
George Hotz	282af21b95	hotfix: DEBUG_EXPAND -1 and NOOPT in benchmark schedule	2024-09-06 17:22:30 +08:00
chenyu	9a9fea7b8c	move DEFINE_VAR min/max from src to arg (#6388 ) new arg is (Variable, min as CONST, max as CONST)	2024-09-06 05:01:02 -04:00
qazal	f1bd2a5519	fix BUFFER_UOPS sts in verify_ast [run_process_replay] (#6389 )	2024-09-06 16:59:22 +08:00
chenyu	cc05016fa8	move test_pattern_matcher to test/unit (#6386 )	2024-09-06 03:22:43 -04:00
George Hotz	86d34daac9	UOps.PHI -> UOps.ASSIGN [run_process_replay] (#6383 )	2024-09-06 12:38:35 +08:00
chenyu	002303c145	fix output of truncate_fp16 (#6381 ) make sure the non-inf path returns the truncated value	2024-09-05 22:55:43 -04:00
George Hotz	c88329244b	create rewrite.py [run_process_replay] (#6379 ) * create rewrite.py [run_process_replay] * fix tests * not in rewrite or ops * skip flaky test	2024-09-06 10:51:01 +08:00
George Hotz	66e7e51c79	Revert beam failure (#6376 ) * Revert "late gate creation for STORE [run_process_replay] (#6373)" This reverts commit `c26744de9f`. * Revert "gated store rewrite to UOps.IF (#5976)" This reverts commit `48061e8400`.	2024-09-06 09:36:44 +08:00
ignaciosica	c15506fc35	[WIP] amx support as TC (#5693 ) * almost working with relu, even hackable... but acc size is wrong, fix needed * upcast based on threads, change thread size to 4x4 * revert wrongfully commented assert * fix tc load indexing * modify for size 8 * fix bug for size 8 * Revert "fix bug for size 8" This reverts commit `cdb3f5df85`. * Revert "modify for size 8" This reverts commit `3ef0904bd9`. * good kernel with changes in lowerer * revert "good kernel with changes in lowerer" This reverts commit `975e2b5a4e`. * good kernel for relu! * refactor lowerer changes * add amx context var to helper * clean up amx flag * improve lowerer changes readability * improve check for amx * revert lowerer if * add float4 type rendering for clang * add amx definitions * enable indexing for clang if amx * working amx example, wrong because of dims * almost works for float 16, need to spot using double load in amx * cleaner render_kernel * revert chages in simple_matmul and delete env * add new var upcast_offset to get_optimized_ast * change axis for axes * invert if in rendering phi * fix some bugs * fix linearizer tests * fix vec/get pat for amx * remove clang tc if amx is disabled * add ops_python support * refactor into one complementary function in ops_python * add job for EMUALTE_AMX * improve checking for AMX in UPCAST and TC extra ops * fix lint issue * commit before refactor into autocontained AMX * start refactor by removing special rendering for AMX * all ready for amx handcoded kernel * working poc, most straightforward amx support * avoid local opts for tc if amx * fix merge bugs * skip test for clang * skip tc hand-coded opts if amx * remove hardcoded ops_python values * remove hardcoded sizes for amx kernel * fix ops_python bug where dim was hard-coded * change contract for vectorize * working without changes in lowerer * revert changes in gep rendering * fix ops_python * modify comment * skip test if clang for different type accumulation * move rename and bug for seperate pr * fix wrong path for test * addmm not implemented in torch for cpu * change struct for vector; equally slow but cleaner * revert modified test * simply wmma rendering * minor change * noqa:501 * add length 16 for AMX * fix vectorized half issue * fix error * remove comment * change set for dedup * split test of tensor_core_extra_ops so that cases that dont require locals run for AMX * add amx reference * load acc into amx registers * fix dtype rendering and remove noqa * moved tests change into another pr * add real AMX job for CI and fix bug * fix ops_python bug * fix test class * remove real AMX tests and fix uops_stats test * remove wrong test * acc folding * hotfix: bug * fix float4 tests for amx * hack for fixing flops counting * hotfix: mypy * add flop counts test for amx * improve test_float4_multidim_amx * improve test_float4_multidim_amx * improve test_float4_multidim_unaligned_load_amx * nits tests --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>	2024-09-06 09:01:10 +08:00
qazal	c26744de9f	late gate creation for STORE [run_process_replay] (#6373 )	2024-09-06 03:32:19 +08:00
Ian Paul	48061e8400	gated store rewrite to UOps.IF (#5976 ) * Core change to gate stores in IFs * Updates to cstyle renderer to handle IFs around STOREs * Make uops asserts happy * Add tests and fix newly broken tests * make ruff happy * make mypy happy * Simplify renderer to have all gated stores use IF * Revert some changes * Make test_where_fold happy * Revert unnecessary handling of ifs rendering. Was included before when changes weren't fully built out * Rewrite graph to have IFs be dependent on RANGEs if STORE is already dependent on RANGE * Re-change broken test * Make ifs be grouped together * get non-merged IFs working. ALl tests pass except grouping related ifs together * Fix tests by making the IF UOp dependent on the correct node of the STORE UOp * Changes to uopgraph * Simplify graph rewrite logic * Changes to get test_padto_where_multireduce working * Simplify uops.store renderer * Make test_padto_where_multireduce pass but now other tests fail * Clean up uopgraph from scrach work * Ignore sudo IF srcs when rendering * Attempt to fix llvm tests * rm comment * reduce lines * Add line to make mypy happy :( * llvmir fix pt 1 * Mods after rebasing to master * Fix llvmir * Fix ptx tests * Fix other ptx tests * Move changes from uops.py to ops.py * rm uops.py * Fix TestGateStoreRewrite tests * Get multireduce tests working * reset to remote branch * Fix linearizer tests * uop_graph test patch * Add comment to create_gate * hotfix: uncomment those tests * Attempt to fix ptx tests by including whitespace inside if block * Patch from remote tinybox. Tests passing here * Min changes to get some ptx tests passsing * Changes after rebase * Exclude ifs and endifs from ptx * IF conditional branching within ptx * Save lines on delete_redundant_gates * Simplify merge_gates * rm noqa * Remove unnecessary checks when merging gates * Fix ops error msg * Smarter check for if/endif in llvmir * simplify delete redundant gates to only have 2 returns * spacing * Smarter check at beginning of merge_gates * patches from comments * Remove need for merge_gates * include proper srcs in IF from the get-go * test expand ifs dumb will result in 4 ifs, not 1 now * Make tests happy * Fix uops stats * rm merge_gates method. Will add back in separate PR * Spacing * cleaner error msg * Fix uops rendering when expanding. test_failure_43 * patch tests * undo changes in delete_redundant_gates * process replay attempt * re-intro deletion of redundant gates * fix addition of gates when they get nested in stores and loads * patch tests * smarter init of IF srcs when adding gate to STORE * make ruff happy * Resp to comment * include all src[2]'s srcs in IF for gated store * add reference of the storing value to the gate's src * minor patch after rebasing * change ptx renderer --------- Co-authored-by: qazal <qazal.software@gmail.com>	2024-09-06 01:05:30 +08:00
nimlgen	a1a15b54c9	qcom cache flush (#6367 ) * qcom cache flush * bench * linter * move	2024-09-05 13:23:39 +03:00
chenyu	62f9f273f7	increase test_profile_multidev_transfer threshold (#6370 ) flaky, bumpped to 16000 for CI	2024-09-05 05:49:32 -04:00
George Hotz	e882294c02	uops touchups [run_process_replay] (#6368 ) * uops touchups [run_process_replay] * those are classmethods * oops, kwargs * no kwargs there	2024-09-05 17:22:32 +08:00
George Hotz	a28ed7ba4d	math trait [run_process_replay] (#6364 ) * math trait [run_process_replay] * const -> const_like * Revert "const -> const_like" This reverts commit `85727c83d3`. * add MathTrait to LazyBuffer * clean up function * fixup the rest of function * fix custom function * mlb math trait * fix that test	2024-09-05 16:19:17 +08:00
George Hotz	4a51c28ee7	switch const to const_like [run_process_replay] (#6356 ) * const like * no more _const * missed one * mypy ops.py * file missing * const_like * fix image and test uop graph [run_process_replay] * fix ptx	2024-09-05 13:57:54 +08:00
George Hotz	0d6922edb4	faster local tests. copy torch permuted to defautl device [run_process_replay] (#6363 )	2024-09-05 13:57:20 +08:00
chenyu	6fd24561d1	distribute MUL const into ADD for int (#6361 ) pre-req for real_stride	2024-09-05 01:36:57 -04:00
qazal	e7f6b654ad	cleanup uop eq asserts for swizzle [run_process_replay] (#6362 ) * cleanup uop eq asserts for swizzle [run_process_replay] * more stuff	2024-09-05 13:36:36 +08:00
Oleg Rybalko	64f1384f5b	Einsum ellipsis support (#6333 ) * working ellipsis expansion * refactor * fix commas in output * add capital letters * refactor	2024-09-05 10:08:55 +08:00
nimlgen	326a77336e	qcom remove some tests skips (#6353 )	2024-09-04 15:38:18 +03:00
qazal	99018a4aa1	minor schedule differ utils [run_process_replay] (#6348 ) * minor schedule differ utils [run_process_replay] * rm	2024-09-04 03:41:38 +08:00
nimlgen	3adb76894d	validate image=2 float16=1 openpilot benchmark (#6346 ) * validate image=2 float=16 openpilot * linter * linter2	2024-09-03 20:13:40 +03:00
qazal	2f00bf0c78	conv bw in one kernel with graph_rewrite (#6330 ) * double reduce merger * add test_fold_conv_relu_backward_ast_rewrite * a correctness test to iterate on * merge axes the other way around * better	2024-09-03 03:53:53 +08:00
Vyacheslav Pachkov	4c33192a8b	add qcom runtime (#5213 ) * qcom: driver init * autogen stubs for msm_kgsl also fixup ioctls to show numbers instead of _IOW macros * autogen: add adreno commands and registers * ops_qcom: QcomAllocator + signals * fix EDEADLK in hwqueue, init timestamps, use opencl compiler for qcom * qcom: we do not really need all these constants input/output is enough * qcom: perfctr for CS (do not really need all the rest) * qcom: HALFREGFOOTPRINT and FULLREGFOOTPRINT are set to be around max * qcom: explicitly set instruction len based on the shader size * ops_qcom: Program init extracts shader from open cl binary sets input/output buffers allocates stack sets cs mode runs shader * use data64_le from helpers * ops_qcom: use fill_kernargs for filling i/o buffers * ops_qcom: add QcomCopyQueue just for api & set kernargs_args_offset * new signals & fix exec * add QCOM to the list of supported devices * correct QcomComputeQueue._wait using CP_WAIT_REG_MEM * fix exec, synchronize before copyout * correct setting num_units for ST_SHADER * fix gpu hangs on sigs with CP_MEM_WRITE, it is uncached mem anyway * extract offsets to kernel arguments from opencl binary * extract constants values and offsets from opencl binary * handle KGSL_MEMFLAGS_USE_CPU_MAP correctly * align kernel name to 4 bytes when skipping kernel opencl struct * skip to consts directly using an offset from opencl binary header * fix alloc * get halfreg and fullreg from opencl bin * set unmultipled global sizes as kernel group in HLSQ_CS_NDRANGE * parse prg offset from open cl binary * save loc with HLSQ_CS_CNTL. set this with HLSQ_CONTROL_2_REG * support for vals in _fill_kernargs * support 16-bit constants * use KGSL_CONTEXT_NO_FAULT_TOLERANCE for contexts this helps to not fall down when executing big kernels /* Don't time out if the context has disabled it / if (drawobj->context->flags & KGSL_CONTEXT_NO_FAULT_TOLERANCE) return; minor changes of _exec * QCOMRenderer * disable HCQGraph for demo. TOOD: support HCQ update api * support HCQ - remove copy queue - add updates - add strides for buffs and vars for QCOM * bufs_stride * clean ups * linter * call super().__init__(value) in QcomSignal * disable=unused-import * mypy * type ignore when queue is on the device * fix * query gpu_id. Will be useful for selecting commands e.g. CP_EVENT_WRITE vs CP_EVENT_WRITE7 * working timestamps * free context after device is done * move gpu stack to the device * reserve some space with lib_gpu for gpu to write to this fixes test_interpolate_bilinear * exclude tests that fails with GPU=1 on qualcomm * lint * unmap mem in _gpu_free * ctxt priority and preemtion policy * remove old qcom * pass size to self.device.allocator.free * skip tests only on qcom * use kgsl and adreno defines instead of numeric vals * use allocator for allocating lib_gpu * update to QcomArgsState from master * intermediate commit while conquering images * enable image tests on qcom * fix shader disasm size, dump textures stuff * working images * allow signals to be 0 * set branchstack from OpenCL binary Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com> * set shared memory size from OpenCL binary Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com> * update images in QcomArgsState & less loc for images * set stack sizes from OpenCL binary Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com> * stack allocation based on OpenCL binary Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com> * better autogen for kgsl and adreno. no more bitshifts Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com> * cleanup commit for parse cl lib Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com> * dont forget actual generated files * refactor + less loc Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com> * device.py back * lint * ruff * timestamp divisor Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com> * fix tex fmt & round global size Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com> * dtypes * 19.2MHz * -1 loc in _update_exec * remove noqa --------- Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com> Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>	2024-09-02 19:35:47 +03:00
George Hotz	406ec8240e	hotfix: lin_fail_41 passes on my M3 Max	2024-08-31 11:46:46 -07:00
Roelof van Dijk	ad4b3b457f	bump limit for test_llama_embedding_opt (#6332 )	2024-08-31 10:03:43 -04:00
George Hotz	72939901fc	hotfix: ebs print kernel names	2024-08-29 21:20:36 -07:00
George Hotz	365babe391	precompute early_reject [run_process_replay] (#6327 ) * precompute early_reject [run_process_replay] * features for ebs * fix ocelot cache	2024-08-29 18:26:24 -07:00
George Hotz	385904526f	remove more rules [run_process_replay] (#6326 ) * remove more rules [run_process_replay] * disable invalid test * ptx needs that str	2024-08-29 16:27:10 -07:00
qazal	539654fbe1	graph_rewrite complexity tests [run_process_replay] (#6317 )	2024-08-29 22:39:08 +03:00
qazal	07942ef361	Proposal: Better UOps.SWIZZLE (#6309 ) * better UOps.SWIZZLE * test_swizzle_rewrite * add it to docs * show a diff * a lil more verbose * two teeny notes * hotfix: sink	2024-08-29 15:39:48 +03:00
qazal	dd4e5f1c8d	process replay rewrite (#6284 ) * process replay rewrite p2 * start some unittests + exceptions and exits * shebang * remove extra kernel init	2024-08-29 15:08:27 +03:00
pedro	7de4eac8f7	add support and tests for nearest modes in interpolate, adapt uint8 bilinear to torch implementation (#6308 ) * add `nearest` mode to interpolate matching pytorch `nearest` which is knowingly buggy + relevant TestsOps * add `nearest-exact` mode to interpolate matching pytorch `nearest-exact` + relevant TestOps * fix uint8 bilinear interpolation by matching custom torch implementation * implement uint8 lerp with torch interpolation trick without converting it to float	2024-08-28 21:59:51 -07:00
qazal	ec34d9ee36	start benchmarking ast graph rewrite (#6297 ) * ast_rewrite to ctx var * add external_benchmark_ast * refactor to asts * track lazybuffers * more work * record checkpoint * cleanup	2024-08-27 18:18:44 +03:00
Max-We	ab2714423b	Add einsum tests (#6286 ) Co-authored-by: Maximilian Weichart <maximilian.weichart@icloud.com>	2024-08-26 09:09:25 -07:00
chenyu	b76f0c875e	lazy const fold idiv 1 (#6285 )	2024-08-26 10:29:59 -04:00
chenyu	af7c04ff57	Tensor.__floordiv__ (#6283 ) support Tensor.__floordiv__ and friends	2024-08-26 09:43:40 -04:00
qazal	d2f8eeed2e	make [compare_schedule] the default [run_process_replay] (#6273 ) * make [compare_schedule] the default * capture ctx * logging * set capture to false	2024-08-26 21:40:03 +08:00
CaltropHungerton	002f60b4c3	fix intel wmma flop counting, add flop counting tests for different tensor cores (#6192 ) * fix wmma flop counting on intel, add count tests * half * add half gemm * Update test.yml * one test * Update test_uops_stats.py * Update test_uops_stats.py * Update test_uops_stats.py * smaller matrix, use unittest skipUnless decorator	2024-08-25 18:37:05 -07:00
qazal	f0cc8ca5f2	generic st_fixup in scheduler graph rewrite [compare_schedule] (#6278 )	2024-08-25 11:02:17 +03:00
gswangg	3cf507ae7f	remove extra.ops and LazyOp support from Kernel (#6267 ) * remove extra.ops and BufferOps * remove extra.ops and LazyOp support in Kernel	2024-08-24 16:44:38 +03:00
qazal	ccb05d8baa	fixup neg tests [run_process_replay] (#6268 )	2024-08-24 16:35:43 +03:00
gswangg	ea76b93814	migrate test_linearizer_dumb.py to UOp AST (#6241 ) * add imports and update test_unmerged_ifs to UOp AST * test_max_simplify_and_cancel * test_expander_new_srcs * test_llama_embedding * test_unaligns_idxs * test_unrolled_float4_align * test_upcasted_stores_out_of_order * remove LazyOp * remove extra/ops and replace ReduceOps.SUM with BinaryOps.ADD	2024-08-24 16:27:29 +03:00
gswangg	e44653e25a	migrate test_linearizer_failures.py to UOp AST (#6240 ) * add imports and update test_failure_1 to UOp AST * update test_failure_2 with UOp AST * update test_failure_3 * test_failure_5 * test_failure_6 * test_failure_7 * test_failure_8 * test_failure_9 * test_failure_10 * test_failure_11 * test_failure_12 * test_failure_12_multireduce * uncomment skip and migrate test_failure_13 * test_failure_14 * test_failure_15 * test_failure_16 * test_failure_17 * test_failure_18 * test_failure_19 * test_failure_20 * test_failure_21 * test_failure_22 * test_failure_23 * test_failure_24 * test_failure_25 * test_failure_26 * test_failure_27 * test_failure_28 * test_failure_29 * test_failure_30 * test_failure_31 * test_failure_32 * test_failure_33 * test_failure_34 * test_failure_36 * test_failure_37 * test_failure_38 * test_update_39 * test_failure_40 * test_failure_41 * test_failure_42 * test_failure_43 * test_failure_44 * test_failure_45 * test_failure_46 * test_failure_47 * test_failure_48 * test_failure_49 * test_failure_50 * remove LazyOp * reskip test_failure_22 * remove extra/ops * replace ReduceOps with BinaryOps * fixup that import --------- Co-authored-by: qazal <qazal.software@gmail.com>	2024-08-24 16:26:58 +03:00
gswangg	1dc6040877	migrate test_search.py to UOp AST (#6245 ) * add imports and update test_kernel_count with UOp AST * test_filter_global_buffer * remove LazyOp * remove extra.ops and ReduceOps --------- Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>	2024-08-24 16:13:53 +03:00
qazal	ae23540d6e	refresh process replay schedule ref in reset.py (#6265 )	2024-08-24 16:12:51 +03:00
gswangg	7be5eede71	migrate test_linearizer_overflows.py to UOp AST (#6244 ) * add imports, remove ConstBuffer, and update test_overflow_1 with UOp AST * test_overflow_2 * test_overflow_3 * test_overflow_4 * test_overflow_5 * test_overflow_6 * test_overflow_7 * TestLinearizerOverflowAlt::test_overflow_1 * TestLinearizerOverflowAlt::test_overflow_2 * remove LazyOp * remove extra.ops * remove ReduceOps	2024-08-24 16:10:29 +03:00
chenyu	943ab97d24	fix Tensor.prod for multitensor (#6264 )	2024-08-24 08:52:24 -04:00

1 2 3 4 5 ...

2555 Commits