github/ROCm - ROCm - AtHeartEngineering

mirror of https://github.com/ROCm/ROCm.git synced 2026-04-05 03:01:17 -04:00

Author	SHA1	Message	Date
Thomas Raoux	ca8f110617	[BACKEND] Pipeliner refactoring (#2565 ) Refactor the pipeliner pass in order to make it more generic. The main change is that the pipeliner is now broken into 2 pieces one calculating a modulo schedule and create async ops based on the IR and an expander that will generate the pipelined IR based on the modulo schedule. The advantage of separating the two pieces is that it will allow us to create different schedule without having to change the expander and it will allow for more complex schedules. For now the schedule generated for matmul case matches rougly the schedule picked by the previous pipeliner in order to avoid changes. This also creates a different sequence of insert/extract slice for the alloc. We should probably change shared alloc to use memory semantic.	2023-11-02 09:56:39 -07:00
Thomas Raoux	218492cd65	[BACKEND] Prevent double rounding when doing f32 -> fp8 (#2583 )	2023-11-02 05:32:16 +00:00
Dongdong Li	d0098da7b1	[BACKEND] Add error reporting to report non-kernel-argument (#2552 ) Co-authored-by: dongdongl <dongdongl@nvidia.com>	2023-11-01 20:22:10 -04:00
Zahi Moudallal	3650213218	[OPTIMIZER] Thread local reduction optimization (#2542 ) Co-authored-by: Phil Tillet <phil@openai.com>	2023-10-31 16:13:36 -07:00
Goran Flegar	601b95cdbb	[DEPS] bump LLVM version to llvm/llvm-project@49af650 (#2570 ) Co-authored-by: Ashay Rane <ashay@users.noreply.github.com> Co-authored-by: khasanovaa <khasanovaaliya19@gmail.com>	2023-10-31 12:06:25 -07:00
daemondzh	96cf8f979a	[OPTIMIZER][BACKEND] Fix an issue in RewriteTensorPtr pass to enable TMA with 8-bit types (#2545 ) Co-authored-by: Zhicheng Xiong <zhichengx@ipp2-0148.nvidia.com> Co-authored-by: Zhicheng Xiong <zhichengx@dc7-sim-e12-203.nvidia.com> Co-authored-by: Zhicheng Xiong <zhichengx@ipp2-1604.nvidia.com> Co-authored-by: Zhicheng Xiong <zhichengx@ipp2-1608.nvidia.com> Co-authored-by: goostavz <109190422+goostavz@users.noreply.github.com>	2023-10-31 02:28:27 +00:00
Chris Jones	2398b82f18	[FRONTEND][BACKEND] dd memory synchronization scope parameter to atomic ops. (#2562 ) Co-authored-by: Keren Zhou <kerenzhou@openai.com>	2023-10-30 19:18:27 -07:00
Keren Zhou	70fca00b67	[BACKEND] Fix `device_print` without arguments (#2566 )	2023-10-30 20:04:44 -04:00
Dongdong Li	0469d5fccd	[OPTIMIZER] Remove extra wgmma_wait_group in flash attention (#2399 ) Co-authored-by: dongdongl <dongdongl@nvidia.com>	2023-10-26 16:35:36 +00:00
zhu jianjiang	cfae7e2a25	[BACKEND] Fix matmul downcast path (#2528 ) for https://github.com/openai/triton/issues/2523 ,add regression test --------- Co-authored-by: Jokeren <robinho364@gmail.com> Co-authored-by: Keren Zhou <kerenzhou@openai.com>	2023-10-26 09:43:49 -04:00
Hongtao Yu	2323adb387	[BACKEND] Handle AtomicCASOp in GPU IR conversion (#2514 ) Addressing https://github.com/openai/triton/issues/2011 Co-authored-by: Philippe Tillet <phil@openai.com> Co-authored-by: Keren Zhou <kerenzhou@openai.com>	2023-10-25 15:20:07 -04:00
Adnan Akhundov	7d55968fee	[BACKEND] Dedup elementwise in LLVM IR based on constancy (#2512 ) ### Summary When Triton GPU IR is lowered into LLVM IR, we can make use of the constancy information about the result of the elementwise ops to deduplicate otherwise redundant computation. That is the contribution of this PR: the constancy is checked and, if possible, some of the values in LLVM IR are reused multiple times instead of computing equal values separately. The change is beneficial for the PyTorch 2 / TorchInductor-generated Triton code, as the leftmost sub-indices extracted from the flat index by div / mod operations can be equal, given sufficiently large 2^n factor in the rightmost rightmost dimension(s). This makes the computation resulting in those sub-indices redundant. Consequently, under the necessary constancy conditions, the redundant indexing arithmetics can be deduplicated. We observe up to 29% decrease in the latency of some of our jagged tensor kernels	2023-10-25 11:25:29 -04:00
Justin Lebar	e70e11e834	[BACKEND] Improve printf. (#2532 ) [BACKEND] Improve printf. Previously, we printed all of a GPU thread's values in a single printf() call, and this, plus the user-specified prefix, was all we printed. This caused a few problems. - nvptx printf can only handle 32 arguments; if you pass more than that, it prints garbage. So if a thread had more than 32 values, you couldn't print them, issue #2486. - The order of the values within the Triton program (GPU thread block) is an implementation detail -- it depends on the layout the compiler assigns to a tensor. So this also prevented you from interpreting the printed output. To address this, we now print the Triton pid and multi-dimensional Tensor index for each value. And each value gets its own line to avoid passing too many args to printf. Example output: ``` pid (0, 1, 2) idx (36, 127) x: 42 ``` If you want to observe all the values in a tensor in order, you can grep and then sort the output. We also make a UX enhancement to print: The printed label always ends with ": "; you don't have to add it yourself. Fixes #2486.	2023-10-25 08:47:55 +00:00
Justin Lebar	2217bd2f5c	[BACKEND] Delete dead vprintf and vprintf_array functions (#2531 ) Delete dead vprintf and vprintf_array functions. These were introduced in `88498d104a` and appear to have been dead at the time of introduction.	2023-10-25 00:22:53 +00:00
Philippe Tillet	3f2b7263e8	Revert "[OPTIMIZER] Tweak warpsPerCTA based on the shape of MMA output (#2485 )" (#2541 ) Reverts openai/triton#2525	2023-10-24 10:23:19 -07:00
Philippe Tillet	8f467f1ea9	[OPTIMIZER] Tweak warpsPerCTA based on the shape of MMA output (#2485 ) (#2525 ) Reverts openai/triton#2497	2023-10-23 21:50:58 -07:00
Thomas Raoux	5e6071254c	[BACKEND] Use our internal slice implementation to avoid combinatoria… (#2535 ) …l explosion	2023-10-24 03:06:34 +00:00
Thomas Raoux	cba7abd682	[BACKEND] Remove ttg.cmp and ttg.select and replace by arith op (#2526 ) Now that the bug related to attribute is fixed in MLIR we can use arith ops for cmp and select ops.	2023-10-23 19:35:46 -07:00
Zahi Moudallal	b0c166b9e3	[BACKEND] Fixing bug in elementwise conversion (#2517 )	2023-10-20 09:11:15 -07:00
Keren Zhou	be1de890e1	[BACKEND] Replace assert(0) with llvm::report_fatal_error (#2516 ) Also add missing return statements	2023-10-19 11:53:09 -07:00
Zahi Moudallal	a980ec50f1	[BACKEND] Fixing f8e5m2 to bf16 conversion on A100 (#2508 )	2023-10-18 17:23:39 +01:00
Thomas Raoux	e36d1665ca	[BACKEND] Fix unsupported view op created during optimizations (#2510 ) When propagating layout we were generating a view op with mismatching total number of element per threads. Lowering such op would require exchanging data across threads. This change prevents the optimizer from generating such cases. This may require further optimizations in the future.	2023-10-18 16:37:13 +01:00
Mehdi Amini	721897fcc4	upgrade llvm to `b1115f8c` (NFC) (#2403 ) Co-authored-by: Thomas Raoux <thomas.raoux@openai.com> Co-authored-by: Keren Zhou <kerenzhou@openai.com> Co-authored-by: Phil Tillet <phil@openai.com>	2023-10-16 16:38:49 -07:00
Zahi Moudallal	726bdb984f	[FRONTEND][BACKEND] Fix constexpr assignment ; revert #2430 (#2496 ) Without this change, a constexpr assignment (ie. `A = B & C`, where `B` and `C` are both constexpr) is getting assigned to a triton tensor, which becomes an issue when `A` is used as the condition of an If statement. Note: I had to add `not isinstance(node.value, ast.Constant)` to the condition because if we are assigning `x = 0` then the assigned value is also a constexpr, but in this case we do want to assign a triton tensor to `x` so that we can do `x.to(tl.int64)` for example, which cannot be done on a constexpr. --------- Co-authored-by: Philippe Tillet <phil@openai.com>	2023-10-16 12:35:19 -07:00
Stewart Hall	29828fe491	[FRONTEND] add option to disable fp mul/add fusion (#2495 ) By default, ptxas will enable fusion of mul/add to fma instructions. The backend was also being configured unconditionally to enable this on conversion from LLVM IR to PTX. This commit adds an option which can be used to disable the FP fusion behavior in both locations.	2023-10-14 12:23:30 -07:00
Philippe Tillet	3b6ec763d5	Revert "[BACKEND] Disable BreakPhiStruct pass (#2458 )" (#2498 ) This reverts commit `b1bc9b20a0`.	2023-10-14 10:40:49 -07:00
Philippe Tillet	8db4fac3b0	Revert "[OPTIMIZER] Tweak warpsPerCTA based on the shape of MMA output (#2485 )" (#2497 ) Reverts openai/triton#2485	2023-10-13 23:32:59 -07:00
Weixing Zhang	76858bd917	[OPTIMIZER] Tweak warpsPerCTA based on the shape of MMA output (#2485 ) In current implementation, warpsPerCTA is always set to [numWarps, 1] for 2 tt.dot fusion scenario. But, it is not optimal for cases such that tt.dot doesn't have enough parallelism on row dimension but on column dimension.	2023-10-12 22:25:42 -07:00
Thomas Raoux	a777e1d8db	[OPTIMIZER] Propagate mma layout when the transitive use has dot_operand encoding (#2482 )	2023-10-12 23:57:40 +00:00
Thomas Raoux	cda298fae7	[Pipeliner] Allocate less shared memory when possible (#2466 ) The pipeliner was overallocating shared memory for the inputs for current schedule. This reduces the shared memory usage to only what is needed. Note that improving membar analysis could allow taking advantage of allocating extra buffers to remove barriers.	2023-10-12 12:10:06 -07:00
Thomas Raoux	6f46c93b9e	[BACKEND] Add back dot.wait when generating async_dot (#2478 ) Based on discussion this is needed to make sure there is no race condition when reading shared memory.	2023-10-10 21:45:28 -07:00
Zahi Moudallal	4749072fbd	[BACKEND] Allow reduce with sliced 3D layout as input (#2480 )	2023-10-10 15:19:11 -07:00
Beal Wang	5812d970a8	[HOPPER][OPTIMIZER] remove divOp and remOp from gemm math loop (#2402 ) This is just for Warp Specialization kernels on Hopper. Replace DivOp and RemOp with SelectOp and AndOp/XorOp.	2023-10-09 14:42:06 +08:00
Tori Baker	ab4549310b	[OPTIMIZER] erase ops after use in iterator (#2455 ) This seems to have worked fine in opt mode (although it may be producing undefined behavior), but in debug mode on a newer version of llvm, it segfaults without this PR as the iterators get invalidated. This is also consistent with other places it is done in this file.	2023-10-06 18:02:56 -07:00
Thomas Raoux	b1bc9b20a0	[BACKEND] Disable BreakPhiStruct pass (#2458 ) This is causing functional failures in pytorch workload. Disabling it until I figure out the problem.	2023-10-06 17:59:53 -07:00
Thomas Raoux	a7061e19b2	[BACKEND] Fix multiple bugs in WGMMA (#2457 ) Fix dependencies in wgmma_wait op to prevent the scheduler from moving it past the uses of wgmma accumulator. We need to explicitly represent the dependency between the wait and the accumulator uses otherwise LLVM is free to re-order those. This allows us to remove a workaround to prevent the re-ordering. We can also remove the wait op added in the loop during pipelining. Also fix the descritpor calcuation for wgmma, we should calculate the same descriptor for the whole warpgroup. Added a workaround for a bug that was exposed by different timing due to those changes. We shouldn't insert operations between the loop and async_wait or we may have race conditions.	2023-10-06 17:59:28 -07:00
Zahi Moudallal	be19cf3103	[BACKEND] Enable reduce with 3D tensors and added tests (#2460 )	2023-10-06 15:08:22 -07:00
Thomas Raoux	38f184b7cf	[BACKEND] Use native fp8 convert ops when possible (#2448 ) On Hopper we can use native fp8 conversion ops that are significantly more efficient. Improves epilogue in matmul. 8192x8192x512xf8 goes from 567 TFlops to 630 TFlops (the kernel is highly latency bound but this is a good proxy for epilogue performance)	2023-10-05 18:28:58 +00:00
Philippe Tillet	7bc6b99132	[DOCS] Fix FP8E4M3B15 docs (#2451 )	2023-10-05 10:59:45 -07:00
Philippe Tillet	e9d9ddd86d	[OPTIMIZER] More V100 conversion removal tweaks (#2446 )	2023-10-04 16:31:20 -07:00
Philippe Tillet	a7ff4eddae	[OPTIMIZER] hasConvertToMMATransisitiveUse=False for MMAv1 (#2445 )	2023-10-04 14:56:12 -07:00
Thomas Raoux	5a0170a27c	[BACKEND] Minor removing of unnecessary code and cleanup (#2443 )	2023-10-04 12:14:08 -07:00
Christian Sigg	5458014282	[BACKEND] Lower to PTX with `trap-unreachable` (#2429 ) We've seen cases where the entire kernel is poisoned due to division-by-zero, resulting in a single `unreachable` instruction at the LLIR level. Emit this instruction as `trap` (instead of dropping it) so that the kernel doesn't run successfully without writing any outputs.	2023-10-03 21:05:10 -07:00
Thomas Raoux	c656a139d3	[BACKEND] Fix for FP8 QK inputs in flash attention forward pass (#2435 )	2023-10-03 21:02:13 -07:00
Zahi Moudallal	0d84a7d70c	[BACKEND] Adding support for slice layout in InsertSliceAsyncOp (#2438 )	2023-10-03 20:59:53 -07:00
Bin Fan	6b860e7a74	[Backend] fix a bug in lowering ExtractSliceOp from TritonGPU to LLVM (#2436 )	2023-10-03 21:52:07 -04:00
Thomas Raoux	020f43d5a3	[NFC] Minor clean ups found during LLVM upgrade (#2433 ) Pull some of the changes required for LLVM upgrade to make the upgrade simpler.	2023-10-03 08:22:46 -07:00
apgoucher	cd38642ec5	Fix denormal handling in fp8e5 --> bf16 conversion PTX (#2430 )	2023-10-02 17:26:30 +01:00
Keren Zhou	ac9fa68d18	[BACKEND] Fine-tune SharedMemoryObject definition and fix related problems (#2428 )	2023-10-01 21:43:05 -07:00
Tori Baker	97e35b677b	[BACKEND] fix division by 0 pathway (#2412 ) It was possible for multiDimWarpId[1] to be 0 which then gets translated into a `urem 0, 0` and results in an unreachable when going through llvm, an empty kernel, and nans. This PR uses ceiling to clamp the result to be >=1. chsigg is working on a fix to lower the unreachable in llvm to a trap (https://github.com/llvm/llvm-project/pull/67478).	2023-09-30 10:53:43 -07:00

1 2 3 4 5 ...

585 Commits