github/ROCm - ROCm - AtHeartEngineering

mirror of https://github.com/ROCm/ROCm.git synced 2026-04-05 03:01:17 -04:00

Author	SHA1	Message	Date
Thomas Raoux	cba7abd682	[BACKEND] Remove ttg.cmp and ttg.select and replace by arith op (#2526 ) Now that the bug related to attribute is fixed in MLIR we can use arith ops for cmp and select ops.	2023-10-23 19:35:46 -07:00
Zahi Moudallal	b0c166b9e3	[BACKEND] Fixing bug in elementwise conversion (#2517 )	2023-10-20 09:11:15 -07:00
Thomas Raoux	e36d1665ca	[BACKEND] Fix unsupported view op created during optimizations (#2510 ) When propagating layout we were generating a view op with mismatching total number of element per threads. Lowering such op would require exchanging data across threads. This change prevents the optimizer from generating such cases. This may require further optimizations in the future.	2023-10-18 16:37:13 +01:00
Mehdi Amini	721897fcc4	upgrade llvm to `b1115f8c` (NFC) (#2403 ) Co-authored-by: Thomas Raoux <thomas.raoux@openai.com> Co-authored-by: Keren Zhou <kerenzhou@openai.com> Co-authored-by: Phil Tillet <phil@openai.com>	2023-10-16 16:38:49 -07:00
Thomas Raoux	cda298fae7	[Pipeliner] Allocate less shared memory when possible (#2466 ) The pipeliner was overallocating shared memory for the inputs for current schedule. This reduces the shared memory usage to only what is needed. Note that improving membar analysis could allow taking advantage of allocating extra buffers to remove barriers.	2023-10-12 12:10:06 -07:00
Thomas Raoux	6f46c93b9e	[BACKEND] Add back dot.wait when generating async_dot (#2478 ) Based on discussion this is needed to make sure there is no race condition when reading shared memory.	2023-10-10 21:45:28 -07:00
Zahi Moudallal	4749072fbd	[BACKEND] Allow reduce with sliced 3D layout as input (#2480 )	2023-10-10 15:19:11 -07:00
Beal Wang	5812d970a8	[HOPPER][OPTIMIZER] remove divOp and remOp from gemm math loop (#2402 ) This is just for Warp Specialization kernels on Hopper. Replace DivOp and RemOp with SelectOp and AndOp/XorOp.	2023-10-09 14:42:06 +08:00
Thomas Raoux	a7061e19b2	[BACKEND] Fix multiple bugs in WGMMA (#2457 ) Fix dependencies in wgmma_wait op to prevent the scheduler from moving it past the uses of wgmma accumulator. We need to explicitly represent the dependency between the wait and the accumulator uses otherwise LLVM is free to re-order those. This allows us to remove a workaround to prevent the re-ordering. We can also remove the wait op added in the loop during pipelining. Also fix the descritpor calcuation for wgmma, we should calculate the same descriptor for the whole warpgroup. Added a workaround for a bug that was exposed by different timing due to those changes. We shouldn't insert operations between the loop and async_wait or we may have race conditions.	2023-10-06 17:59:28 -07:00
Hongtao Yu	eed4559df2	[TOOLS] Enable per-pass IR printing in triton-translate (#2449 ) Enabling per-pass IR printing such as `--mlir-print-ir-after-all`	2023-10-05 13:23:46 -07:00
Thomas Raoux	38f184b7cf	[BACKEND] Use native fp8 convert ops when possible (#2448 ) On Hopper we can use native fp8 conversion ops that are significantly more efficient. Improves epilogue in matmul. 8192x8192x512xf8 goes from 567 TFlops to 630 TFlops (the kernel is highly latency bound but this is a good proxy for epilogue performance)	2023-10-05 18:28:58 +00:00
Zahi Moudallal	0d84a7d70c	[BACKEND] Adding support for slice layout in InsertSliceAsyncOp (#2438 )	2023-10-03 20:59:53 -07:00
Tori Baker	97e35b677b	[BACKEND] fix division by 0 pathway (#2412 ) It was possible for multiDimWarpId[1] to be 0 which then gets translated into a `urem 0, 0` and results in an unreachable when going through llvm, an empty kernel, and nans. This PR uses ceiling to clamp the result to be >=1. chsigg is working on a fix to lower the unreachable in llvm to a trap (https://github.com/llvm/llvm-project/pull/67478).	2023-09-30 10:53:43 -07:00
Thomas Raoux	90bef57acf	[BACKEND] turn on MMA V3 by default on Hopper (#2414 )	2023-09-28 22:45:28 -07:00
Thomas Raoux	721bdebee1	[OPTIMIZATION] Fix performance for attention backward path with mma v3 (#2411 ) Support having chain of mma with mixed size. Serialize the different block calculation in backward attention to workaround problem with ptxas and wgmma.	2023-09-28 10:29:08 -07:00
Yuheng XIE	1e093fbfff	[OPTIMIZER] Calculate a proper divisibility for ExpandDims (#2397 ) Previously ExpandDims always inserts 1 as the new divisibility, which makes writing (x * stride)[:, None] far more slower than (x[:, None] * stride). A better divisibility can be afforded by computing the GCD of the old dims. Now the two code above are equally fast. E.g. the conv inductor in pytorch may be faster. --------- Co-authored-by: Yuheng XIE <thinelephant@gmail.com>	2023-09-27 23:10:01 -07:00
Tori Baker	bf3171f5c7	Lit test to check for illegal st.shared.b1 llvmir (#2387 )	2023-09-26 17:12:32 +00:00
Thomas Raoux	6bc1d9e1be	[BACKEND] Support MMA V3 with register operand (#2375 ) MMA V3 support taking operand A from register. This helps for chained matmul operations like in attention. Add an optimization to use this mode when it helps and add the lowering for it.	2023-09-25 10:43:54 -07:00
Thomas Raoux	a4dbdefe3b	[BACKEND] Use shuffle intrinsics instead of inline asm (#2378 ) This will ensure we get the proper "convergent" semantic for those instructions	2023-09-23 11:50:37 -07:00
Thomas Raoux	840e7e7b53	[BACKEND] Improve decision of MMA dimension on H100 (#2373 ) When there is a chain of mma ops we want to pick the same shape to avoid conversions. This improves the detection going through for loops. This fixes a crash in tutorial bw attention. We might want to change this logic and convert the format to allow more efficient MMA at some point.	2023-09-22 15:21:56 -07:00
Thomas Raoux	9cab885dff	[BACKEND] Optimize wgmma with accumulator source equal to 0 (#2343 ) Also add a test for MMA v3 reduction.	2023-09-20 14:05:12 -07:00
Dongdong Li	e5eda098b3	[TESTS] fix flash attention (#2086 ) Co-authored-by: dongdongl <dongdongl@nvidia.com>	2023-09-20 14:23:46 +08:00
Thomas Raoux	3a848e2729	[BACKEND] Relax patterns to move sink broadcast and hoist convert (#2331 ) Improve patterns that sync broadcast to reduce the arithmetic density and also hoist convert on top of expand_dims to do less work. This address comments in https://github.com/openai/triton/pull/2274	2023-09-18 15:08:19 -07:00
Thomas Raoux	31b0c52142	[FRONTEND][BACKEND] Add flag to control accumulation for fp8 (#2300 ) Change the dot to allow taking an initial accumulator and add a flag that will allow the compiler to accumulate in a lower precision than the output type. On Hopper this flag is on by default which allows accumualting with lower precision. This only affect Hopper fp8 dot.	2023-09-15 18:42:54 -07:00
Thomas Raoux	cf7f8c5ea4	[BACKEND] Optimization to sink broadcast ops (#2274 ) Try to move broadcast ops after arithmetic and convert ops in order to reduce the amount of work needed.	2023-09-13 10:04:36 -07:00
Thomas Raoux	d3956a21f3	[BACKEND] Add LLVM pre-processing pass to break struct types (#2285 ) Add infrastructure to be able to add and test custom LLVM passes in the backend. This will allow use to apply some low level optimizations and cleanup on LLVM IR. Add a first pass that breaks up phi of struct created by lowering to LLVM. Those can often pessimise the optimizer as it would block optimizations going through phi nodes.	2023-09-13 10:03:29 -07:00
Zahi Moudallal	a47f1f5c28	[BACKEND] Unify slow/fast reduce codegen (#2220 )	2023-09-12 08:46:19 -07:00
Beal Wang	7aae350fab	[OPTIMIZER] async launch dots for hopper warp-specialized kernel (#2251 )	2023-09-07 13:37:07 +08:00
David Berard	13189bfe60	Coalesce pass - group values with same shape/order but different element type (#2199 ) Motivation: We have a kernel that loads multiple types of tensors - some int32 and some float16. The coalescing pass assigns `perThread = 8` for the float16 tensors and `perThread = 4` for the int32 tensors, resulting in unnecessary layout conversions that result in bad performance. Instead, we should just set `perThread = 8` for both of these loads. Details: One of the first steps in calculating the new encoding is to find the group of upstream/downstream tensors with the "same type", in order to find the maximal sizePerThread required in this group. This PR changes the logic so that tensors can be grouped as long as they have the same shape and same optimal ordering, even if they have different encoding or dtype. Next, the logic to compute `perThread` is updated to account for the change above; since dtype can now be different within a single "group", the `perThread` computation now considers different elemNumBits/elemNumBytes for each value in the group.	2023-09-01 17:08:56 -07:00
Zahi Moudallal	acbf716889	[BACKEND] Refactoring NVGPUToLLVMPass (#2158 )	2023-09-01 23:40:31 +00:00
Philippe Tillet	9b8c48f25d	[BACKEND] More minor backend fixups (#2223 ) * Fix bug in V100 convert layout * Do not push elementwise between convert and dot for V100	2023-08-31 22:02:56 -07:00
Thomas	fff97d864a	[BACKEND] Add support for propagating convert op through while loops (#2213 ) Support forward propagation through while loops.	2023-08-31 09:26:04 -07:00
Thomas	3175ee4ce7	[BACKEND] Handle more cases of folding convert into reduce op (#2209 ) Handle cases of reduce with multiple operand returning scalars	2023-08-30 11:04:38 -07:00
Thomas	2ff88c1368	[BACKEND] Extend hoisting of convert op above ext ops (#2206 ) Handle more cases of hoisting convert above ext op. If there are multiple ext op in the slice but only one requires inserting a convert we can still apply the optimization.	2023-08-29 17:36:34 -07:00
Thomas	17d633a64e	[BACKEND] Fix crash when propagating layout and slice axis doesn't ma… (#2205 ) …tch reduce	2023-08-29 17:11:03 +00:00
Thomas	d4644d6cb3	[BACKEND] Refactor RemoveLayoutConversion pass (#2181 ) Significant changes to the pass logic. Move away from greedy rewrites and use more global analysis instead. The pass is now bocken down into 2 main phases. First forward propagation of layout starting from ops that we don't want to change. Propagate to all the nodes. If there is a single layout needed for the op then we can rewrite the op, if there are multiple layout required based on dependency we need a tie break. The second phase is backward propgation that gets a backward slice of operations starting from the convert and if all the operations in the slice can be rematerialized rewrite the slice. This backward phase now supports going through loop arguments. This will allow more complex logic in the future to add a cost model to decide which convert to leave and which to fold	2023-08-28 19:05:16 -07:00
peterbell10	fa03b92109	[OPTIMIZER] Add folder for MakeRangeOp (#2187 ) This folds `tl.arange(x, x + 1)` into a constant. This shows up for example when autotuning and one of the block sizes gets set to 1. Co-authored-by: Philippe Tillet <phil@openai.com>	2023-08-26 16:44:13 +00:00
Thomas	3116933ccd	[BACKEND] Don't do dead code elimination on volatile load (#2165 )	2023-08-23 14:59:18 -07:00
Zahi Moudallal	23dd11d471	[BACKEND] Solidify f8e4m3 (#2105 ) Co-authored-by: Philippe Tillet <phil@openai.com>	2023-08-18 19:12:09 -07:00
Thomas	bf351b9ba2	[FRONTENT][BACKEND] Add support for elementwise inline assembly (#2136 ) Add a new operation to be able to implement packed inline assembly for elementwise operations. This way inline assembly can be used to control elementwise operations. It also allows to pack elements to be able to manually vectorize operations.	2023-08-18 12:57:52 -07:00
Whitney Tsang	100cabd0e4	[FRONTEND] use enum instead of bool to select target (#2118 ) Before this PR, the determination of `TritonGPUToLLVMIRPass` to generate NVVM-compatible LLVM or ROCDL-compatible LLVM is controlled by a boolean `isROCM`. This method is hard to scale. This PR changes it to use an enum instead, where new target can be added easily when needed. --------- Signed-off-by: Tsang, Whitney <whitney.tsang@intel.com> Co-authored-by: Philippe Tillet <phil@openai.com>	2023-08-17 18:37:09 -07:00
Zahi Moudallal	4d373aa103	[BACKEND] Remove HopperHelpers.c and replace with inline ptx and LLVM codegen (#2047 )	2023-08-10 15:52:37 -07:00
Goran Flegar	29bfdb6eef	[BACKEND] Fix crash in reductions on i1 (#1996 ) `getScratchSizeInBytes` was assuming that the size of all types in bits is a multiple of 8. If it is not, it would return 0. This caused a bug for boolean (i1) type, where the reduction lowering would attempt to use shared memory, which was not assigned to the op. Fix this issue by setting the number of bytes per element to `ceil(bits / 8)`.	2023-08-09 10:28:05 -07:00
allatit23	a58e6ef2b7	[HOPPER][WS] support tt.reduce as dependent op in guard pass (#2067 )	2023-08-09 19:31:38 +08:00
Beal Wang	de47bba07d	[OPTIMIZER] Fix the load and store fallback issue of test_persisten… (#2057 ) Co-authored-by: Biao Wang <biaow@nvidia.com>	2023-08-09 16:42:01 +08:00
goostavz	f1512bded1	Initial code merge of Hopper support (#2036 ) The initial code merge of Nvidia Hopper features support. Please be aware that the code merge is not finished yet and the trouble-shooting is still ongoing. The new hardware features (GMMA, TMA, STMATRIX etc.) and automatic warp-specialization are experimental for now and turned off by default. It is recommended for a trial when version 3.0 is released. The work is contributed by: ben-zhang-609, bealwang, donproc, qliu93, jsh20, allatit23, LyricZhao, ivanyinwz, goostavz & yangjunpro from Nvidia, in cooperation with: ptillet, Jokeren, ThomasRaoux & zahimoud from OpenAI. Co-authored-by: Goostav Zhu <gzhu@nvidia.com>	2023-08-07 09:53:04 +08:00
Philippe Tillet	52c146f66b	[OPTIMIZER][BACKEND] significantly cleaner handling of mixed-precision kernels (#1949 ) we currently have a very janky approach to optimizing mixed-precision matmul workloads, where some layout combinations (e.g., NT matmul) were explicitly pattern-matched to take a more optimized codepath. Attempt at unifying all the codepaths to codegen cp.async failed, due to bugs in SharedToDotOperandMMAv2.cpp. This PR fixes said bugs, add some assertions for SharedToDotOperandMMAv2 modes that aren't well supported, and greatly simplify our handling of element-wise operations between load and conversions to DotOperand.	2023-07-28 10:29:42 -07:00
Philippe Tillet	07c346b948	[OPTIMIZER] Falls back to using RemSI in the pipeline pass for now (#1972 ) This is strange. Using RemUI should be strictly better, but it can cause up to 20% performance regression in some cases. I am reverting to RemSI pending investigation	2023-07-19 22:06:51 -07:00
David Berard	9c422e260b	[OPTIMIZER] AxisInfoVisitor for LoadOp constancy calculation (#1968 ) If you call `result = load(x, mask)` where `x` and `mask` have some constancy properties, then you can infer some constancy properties for `result`.	2023-07-19 17:40:46 -07:00
nccx	15ab48d407	[TESTS] remove unnecessary lit command from combine.mlir (#1961 ) The only difference between the two RUNs is `FileCheck`, which should be needed.	2023-07-19 11:14:56 -07:00

1 2 3

140 Commits