Commit Graph

140 Commits

Author SHA1 Message Date
Thomas Raoux
cba7abd682 [BACKEND] Remove ttg.cmp and ttg.select and replace by arith op (#2526)
Now that the bug related to attribute is fixed in MLIR we can use arith
ops for cmp and select ops.
2023-10-23 19:35:46 -07:00
Zahi Moudallal
b0c166b9e3 [BACKEND] Fixing bug in elementwise conversion (#2517) 2023-10-20 09:11:15 -07:00
Thomas Raoux
e36d1665ca [BACKEND] Fix unsupported view op created during optimizations (#2510)
When propagating layout we were generating a view op with mismatching
total number of element per threads. Lowering such op would require
exchanging data across threads.
This change prevents the optimizer from generating such cases. This may
require further optimizations in the future.
2023-10-18 16:37:13 +01:00
Mehdi Amini
721897fcc4 upgrade llvm to b1115f8c (NFC) (#2403)
Co-authored-by: Thomas Raoux <thomas.raoux@openai.com>
Co-authored-by: Keren Zhou <kerenzhou@openai.com>
Co-authored-by: Phil Tillet <phil@openai.com>
2023-10-16 16:38:49 -07:00
Thomas Raoux
cda298fae7 [Pipeliner] Allocate less shared memory when possible (#2466)
The pipeliner was overallocating shared memory for the inputs
for current schedule. This reduces the shared memory usage to only
what is needed.
Note that improving membar analysis could allow taking advantage of
allocating extra buffers to remove barriers.
2023-10-12 12:10:06 -07:00
Thomas Raoux
6f46c93b9e [BACKEND] Add back dot.wait when generating async_dot (#2478)
Based on discussion this is needed to make sure there is no race
condition when reading shared memory.
2023-10-10 21:45:28 -07:00
Zahi Moudallal
4749072fbd [BACKEND] Allow reduce with sliced 3D layout as input (#2480) 2023-10-10 15:19:11 -07:00
Beal Wang
5812d970a8 [HOPPER][OPTIMIZER] remove divOp and remOp from gemm math loop (#2402)
This is just for Warp Specialization kernels on Hopper. Replace DivOp
and RemOp with SelectOp and AndOp/XorOp.
2023-10-09 14:42:06 +08:00
Thomas Raoux
a7061e19b2 [BACKEND] Fix multiple bugs in WGMMA (#2457)
Fix dependencies in wgmma_wait op to prevent the scheduler from moving
it past the uses of wgmma accumulator. We need to explicitly represent
the dependency between the wait and the accumulator uses otherwise LLVM
is free to re-order those.
This allows us to remove a workaround to prevent the re-ordering. We can
also remove the wait op added in the loop during pipelining.

Also fix the descritpor calcuation for wgmma, we should calculate the
same descriptor for the whole warpgroup.
Added a workaround for a bug that was exposed by different timing due to
those changes. We shouldn't insert operations between the loop and
async_wait or we may have race conditions.
2023-10-06 17:59:28 -07:00
Hongtao Yu
eed4559df2 [TOOLS] Enable per-pass IR printing in triton-translate (#2449)
Enabling per-pass IR printing such as `--mlir-print-ir-after-all`
2023-10-05 13:23:46 -07:00
Thomas Raoux
38f184b7cf [BACKEND] Use native fp8 convert ops when possible (#2448)
On Hopper we can use native fp8 conversion ops that are significantly
more efficient.

Improves epilogue in matmul. 8192x8192x512xf8 goes from 567 TFlops to
630 TFlops (the kernel is highly latency bound but this is a good proxy
for epilogue performance)
2023-10-05 18:28:58 +00:00
Zahi Moudallal
0d84a7d70c [BACKEND] Adding support for slice layout in InsertSliceAsyncOp (#2438) 2023-10-03 20:59:53 -07:00
Tori Baker
97e35b677b [BACKEND] fix division by 0 pathway (#2412)
It was possible for multiDimWarpId[1] to be 0 which then gets translated
into a `urem 0, 0` and results in an unreachable when going through
llvm, an empty kernel, and nans. This PR uses ceiling to clamp the
result to be >=1.

chsigg is working on a fix to lower the unreachable in llvm to a trap
(https://github.com/llvm/llvm-project/pull/67478).
2023-09-30 10:53:43 -07:00
Thomas Raoux
90bef57acf [BACKEND] turn on MMA V3 by default on Hopper (#2414) 2023-09-28 22:45:28 -07:00
Thomas Raoux
721bdebee1 [OPTIMIZATION] Fix performance for attention backward path with mma v3 (#2411)
Support having chain of mma with mixed size.
Serialize the different block calculation in backward attention to
workaround problem with ptxas and wgmma.
2023-09-28 10:29:08 -07:00
Yuheng XIE
1e093fbfff [OPTIMIZER] Calculate a proper divisibility for ExpandDims (#2397)
Previously ExpandDims always inserts 1 as the new divisibility, which
makes writing (x * stride)[:, None] far more slower than (x[:, None] *
stride). A better divisibility can be afforded by computing the GCD of
the old dims. Now the two code above are equally fast. E.g. the conv
inductor in pytorch may be faster.

---------

Co-authored-by: Yuheng XIE <thinelephant@gmail.com>
2023-09-27 23:10:01 -07:00
Tori Baker
bf3171f5c7 Lit test to check for illegal st.shared.b1 llvmir (#2387) 2023-09-26 17:12:32 +00:00
Thomas Raoux
6bc1d9e1be [BACKEND] Support MMA V3 with register operand (#2375)
MMA V3 support taking operand A from register. This helps for chained
matmul operations like in attention.
Add an optimization to use this mode when it helps and add the lowering
for it.
2023-09-25 10:43:54 -07:00
Thomas Raoux
a4dbdefe3b [BACKEND] Use shuffle intrinsics instead of inline asm (#2378)
This will ensure we get the proper "convergent" semantic for those
instructions
2023-09-23 11:50:37 -07:00
Thomas Raoux
840e7e7b53 [BACKEND] Improve decision of MMA dimension on H100 (#2373)
When there is a chain of mma ops we want to pick the same shape to avoid
conversions. This improves the detection going through for loops.
This fixes a crash in tutorial bw attention.

We might want to change this logic and convert the format to allow more
efficient MMA at some point.
2023-09-22 15:21:56 -07:00
Thomas Raoux
9cab885dff [BACKEND] Optimize wgmma with accumulator source equal to 0 (#2343)
Also add a test for MMA v3 reduction.
2023-09-20 14:05:12 -07:00
Dongdong Li
e5eda098b3 [TESTS] fix flash attention (#2086)
Co-authored-by: dongdongl <dongdongl@nvidia.com>
2023-09-20 14:23:46 +08:00
Thomas Raoux
3a848e2729 [BACKEND] Relax patterns to move sink broadcast and hoist convert (#2331)
Improve patterns that sync broadcast to reduce the arithmetic density
and also hoist convert on top of expand_dims to do less work.

This address comments in https://github.com/openai/triton/pull/2274
2023-09-18 15:08:19 -07:00
Thomas Raoux
31b0c52142 [FRONTEND][BACKEND] Add flag to control accumulation for fp8 (#2300)
Change the dot to allow taking an initial accumulator and add a flag
that will allow the compiler to accumulate in a lower precision than the
output type.
On Hopper this flag is on by default which allows accumualting with
lower precision.
This only affect Hopper fp8 dot.
2023-09-15 18:42:54 -07:00
Thomas Raoux
cf7f8c5ea4 [BACKEND] Optimization to sink broadcast ops (#2274)
Try to move broadcast ops after arithmetic and convert ops in order to
reduce the amount of work needed.
2023-09-13 10:04:36 -07:00
Thomas Raoux
d3956a21f3 [BACKEND] Add LLVM pre-processing pass to break struct types (#2285)
Add infrastructure to be able to add and test custom LLVM passes in the
backend. This will allow use to apply some low level optimizations and
cleanup on LLVM IR.
Add a first pass that breaks up phi of struct created by lowering to
LLVM. Those can often pessimise the optimizer as it would block
optimizations going through phi nodes.
2023-09-13 10:03:29 -07:00
Zahi Moudallal
a47f1f5c28 [BACKEND] Unify slow/fast reduce codegen (#2220) 2023-09-12 08:46:19 -07:00
Beal Wang
7aae350fab [OPTIMIZER] async launch dots for hopper warp-specialized kernel (#2251) 2023-09-07 13:37:07 +08:00
David Berard
13189bfe60 Coalesce pass - group values with same shape/order but different element type (#2199)
**Motivation**: We have a kernel that loads multiple types of tensors -
some int32 and some float16. The coalescing pass assigns `perThread = 8`
for the float16 tensors and `perThread = 4` for the int32 tensors,
resulting in unnecessary layout conversions that result in bad
performance. Instead, we should just set `perThread = 8` for both of
these loads.

**Details**:
One of the first steps in calculating the new encoding is to find the
group of upstream/downstream tensors with the "same type", in order to
find the maximal sizePerThread required in this group. This PR changes
the logic so that tensors can be grouped as long as they have the same
shape and same optimal ordering, even if they have different encoding or
dtype.

Next, the logic to compute `perThread` is updated to account for the
change above; since dtype can now be different within a single "group",
the `perThread` computation now considers different
elemNumBits/elemNumBytes for each value in the group.
2023-09-01 17:08:56 -07:00
Zahi Moudallal
acbf716889 [BACKEND] Refactoring NVGPUToLLVMPass (#2158) 2023-09-01 23:40:31 +00:00
Philippe Tillet
9b8c48f25d [BACKEND] More minor backend fixups (#2223)
* Fix bug in V100 convert layout
* Do not push elementwise between convert and dot for V100
2023-08-31 22:02:56 -07:00
Thomas
fff97d864a [BACKEND] Add support for propagating convert op through while loops (#2213)
Support forward propagation through while loops.
2023-08-31 09:26:04 -07:00
Thomas
3175ee4ce7 [BACKEND] Handle more cases of folding convert into reduce op (#2209)
Handle cases of reduce with multiple operand returning scalars
2023-08-30 11:04:38 -07:00
Thomas
2ff88c1368 [BACKEND] Extend hoisting of convert op above ext ops (#2206)
Handle more cases of hoisting convert above ext op. If there are
multiple ext op in the slice but only one requires inserting a convert
we can still apply the optimization.
2023-08-29 17:36:34 -07:00
Thomas
17d633a64e [BACKEND] Fix crash when propagating layout and slice axis doesn't ma… (#2205)
…tch reduce
2023-08-29 17:11:03 +00:00
Thomas
d4644d6cb3 [BACKEND] Refactor RemoveLayoutConversion pass (#2181)
Significant changes to the pass logic. Move away from greedy rewrites
and use more global analysis instead. The pass is now bocken down into 2
main phases. First forward propagation of layout starting from ops that
we don't want to change. Propagate to all the nodes. If there is a
single layout needed for the op then we can rewrite the op, if there are
multiple layout required based on dependency we need a tie break.
The second phase is backward propgation that gets a backward slice of
operations starting from the convert and if all the operations in the
slice can be rematerialized rewrite the slice. This backward phase now
supports going through loop arguments.

This will allow more complex logic in the future to add a cost model to
decide which convert to leave and which to fold
2023-08-28 19:05:16 -07:00
peterbell10
fa03b92109 [OPTIMIZER] Add folder for MakeRangeOp (#2187)
This folds `tl.arange(x, x + 1)` into a constant. This shows up for
example when autotuning and one of the block sizes gets set to 1.

Co-authored-by: Philippe Tillet <phil@openai.com>
2023-08-26 16:44:13 +00:00
Thomas
3116933ccd [BACKEND] Don't do dead code elimination on volatile load (#2165) 2023-08-23 14:59:18 -07:00
Zahi Moudallal
23dd11d471 [BACKEND] Solidify f8e4m3 (#2105)
Co-authored-by: Philippe Tillet <phil@openai.com>
2023-08-18 19:12:09 -07:00
Thomas
bf351b9ba2 [FRONTENT][BACKEND] Add support for elementwise inline assembly (#2136)
Add a new operation to be able to implement packed inline assembly for
elementwise operations. This way inline assembly can be used to control
elementwise operations. It also allows to pack elements to be able to
manually vectorize operations.
2023-08-18 12:57:52 -07:00
Whitney Tsang
100cabd0e4 [FRONTEND] use enum instead of bool to select target (#2118)
Before this PR, the determination of `TritonGPUToLLVMIRPass` to generate
NVVM-compatible LLVM or ROCDL-compatible LLVM is controlled by a boolean
`isROCM`. This method is hard to scale.
This PR changes it to use an enum instead, where new target can be added
easily when needed.

---------

Signed-off-by: Tsang, Whitney <whitney.tsang@intel.com>
Co-authored-by: Philippe Tillet <phil@openai.com>
2023-08-17 18:37:09 -07:00
Zahi Moudallal
4d373aa103 [BACKEND] Remove HopperHelpers.c and replace with inline ptx and LLVM codegen (#2047) 2023-08-10 15:52:37 -07:00
Goran Flegar
29bfdb6eef [BACKEND] Fix crash in reductions on i1 (#1996)
`getScratchSizeInBytes` was assuming that the size of all types in bits
is
a multiple of 8. If it is not, it would return 0. This caused a bug for
boolean
(i1) type, where the reduction lowering would attempt to use shared
memory,
which was not assigned to the op.

Fix this issue by setting the number of bytes per element to `ceil(bits
/ 8)`.
2023-08-09 10:28:05 -07:00
allatit23
a58e6ef2b7 [HOPPER][WS] support tt.reduce as dependent op in guard pass (#2067) 2023-08-09 19:31:38 +08:00
Beal Wang
de47bba07d [OPTIMIZER] Fix the load and store fallback issue of test_persisten… (#2057)
Co-authored-by: Biao Wang <biaow@nvidia.com>
2023-08-09 16:42:01 +08:00
goostavz
f1512bded1 Initial code merge of Hopper support (#2036)
The initial code merge of Nvidia Hopper features support. Please be
aware that the code merge is not finished yet and the trouble-shooting
is still ongoing. The new hardware features (GMMA, TMA, STMATRIX etc.)
and automatic warp-specialization are experimental for now and turned
off by default. It is recommended for a trial when version 3.0 is
released.

The work is contributed by:
ben-zhang-609, bealwang, donproc, qliu93, jsh20, allatit23, LyricZhao,
ivanyinwz, goostavz & yangjunpro
from Nvidia, in cooperation with:
ptillet, Jokeren, ThomasRaoux & zahimoud
from OpenAI.

Co-authored-by: Goostav Zhu <gzhu@nvidia.com>
2023-08-07 09:53:04 +08:00
Philippe Tillet
52c146f66b [OPTIMIZER][BACKEND] significantly cleaner handling of mixed-precision kernels (#1949)
we currently have a very janky approach to optimizing mixed-precision
matmul workloads, where some layout combinations (e.g., NT matmul) were
explicitly pattern-matched to take a more optimized codepath. Attempt at
unifying all the codepaths to codegen cp.async failed, due to bugs in
SharedToDotOperandMMAv2.cpp.

This PR fixes said bugs, add some assertions for SharedToDotOperandMMAv2
modes that aren't well supported, and greatly simplify our handling of
element-wise operations between load and conversions to DotOperand.
2023-07-28 10:29:42 -07:00
Philippe Tillet
07c346b948 [OPTIMIZER] Falls back to using RemSI in the pipeline pass for now (#1972)
This is strange. Using RemUI should be strictly better, but it can cause
up to 20% performance regression in some cases. I am reverting to RemSI
pending investigation
2023-07-19 22:06:51 -07:00
David Berard
9c422e260b [OPTIMIZER] AxisInfoVisitor for LoadOp constancy calculation (#1968)
If you call `result = load(x, mask)` where `x` and `mask` have some
constancy properties, then you can infer some constancy properties for
`result`.
2023-07-19 17:40:46 -07:00
nccx
15ab48d407 [TESTS] remove unnecessary lit command from combine.mlir (#1961)
The only difference between the two RUNs is `FileCheck`, which should be
needed.
2023-07-19 11:14:56 -07:00