When propagating layout we were generating a view op with mismatching
total number of element per threads. Lowering such op would require
exchanging data across threads.
This change prevents the optimizer from generating such cases. This may
require further optimizations in the future.
The pipeliner was overallocating shared memory for the inputs
for current schedule. This reduces the shared memory usage to only
what is needed.
Note that improving membar analysis could allow taking advantage of
allocating extra buffers to remove barriers.
Fix dependencies in wgmma_wait op to prevent the scheduler from moving
it past the uses of wgmma accumulator. We need to explicitly represent
the dependency between the wait and the accumulator uses otherwise LLVM
is free to re-order those.
This allows us to remove a workaround to prevent the re-ordering. We can
also remove the wait op added in the loop during pipelining.
Also fix the descritpor calcuation for wgmma, we should calculate the
same descriptor for the whole warpgroup.
Added a workaround for a bug that was exposed by different timing due to
those changes. We shouldn't insert operations between the loop and
async_wait or we may have race conditions.
On Hopper we can use native fp8 conversion ops that are significantly
more efficient.
Improves epilogue in matmul. 8192x8192x512xf8 goes from 567 TFlops to
630 TFlops (the kernel is highly latency bound but this is a good proxy
for epilogue performance)
It was possible for multiDimWarpId[1] to be 0 which then gets translated
into a `urem 0, 0` and results in an unreachable when going through
llvm, an empty kernel, and nans. This PR uses ceiling to clamp the
result to be >=1.
chsigg is working on a fix to lower the unreachable in llvm to a trap
(https://github.com/llvm/llvm-project/pull/67478).
Support having chain of mma with mixed size.
Serialize the different block calculation in backward attention to
workaround problem with ptxas and wgmma.
Previously ExpandDims always inserts 1 as the new divisibility, which
makes writing (x * stride)[:, None] far more slower than (x[:, None] *
stride). A better divisibility can be afforded by computing the GCD of
the old dims. Now the two code above are equally fast. E.g. the conv
inductor in pytorch may be faster.
---------
Co-authored-by: Yuheng XIE <thinelephant@gmail.com>
MMA V3 support taking operand A from register. This helps for chained
matmul operations like in attention.
Add an optimization to use this mode when it helps and add the lowering
for it.
When there is a chain of mma ops we want to pick the same shape to avoid
conversions. This improves the detection going through for loops.
This fixes a crash in tutorial bw attention.
We might want to change this logic and convert the format to allow more
efficient MMA at some point.
Improve patterns that sync broadcast to reduce the arithmetic density
and also hoist convert on top of expand_dims to do less work.
This address comments in https://github.com/openai/triton/pull/2274
Change the dot to allow taking an initial accumulator and add a flag
that will allow the compiler to accumulate in a lower precision than the
output type.
On Hopper this flag is on by default which allows accumualting with
lower precision.
This only affect Hopper fp8 dot.
Add infrastructure to be able to add and test custom LLVM passes in the
backend. This will allow use to apply some low level optimizations and
cleanup on LLVM IR.
Add a first pass that breaks up phi of struct created by lowering to
LLVM. Those can often pessimise the optimizer as it would block
optimizations going through phi nodes.
**Motivation**: We have a kernel that loads multiple types of tensors -
some int32 and some float16. The coalescing pass assigns `perThread = 8`
for the float16 tensors and `perThread = 4` for the int32 tensors,
resulting in unnecessary layout conversions that result in bad
performance. Instead, we should just set `perThread = 8` for both of
these loads.
**Details**:
One of the first steps in calculating the new encoding is to find the
group of upstream/downstream tensors with the "same type", in order to
find the maximal sizePerThread required in this group. This PR changes
the logic so that tensors can be grouped as long as they have the same
shape and same optimal ordering, even if they have different encoding or
dtype.
Next, the logic to compute `perThread` is updated to account for the
change above; since dtype can now be different within a single "group",
the `perThread` computation now considers different
elemNumBits/elemNumBytes for each value in the group.
Handle more cases of hoisting convert above ext op. If there are
multiple ext op in the slice but only one requires inserting a convert
we can still apply the optimization.
Significant changes to the pass logic. Move away from greedy rewrites
and use more global analysis instead. The pass is now bocken down into 2
main phases. First forward propagation of layout starting from ops that
we don't want to change. Propagate to all the nodes. If there is a
single layout needed for the op then we can rewrite the op, if there are
multiple layout required based on dependency we need a tie break.
The second phase is backward propgation that gets a backward slice of
operations starting from the convert and if all the operations in the
slice can be rematerialized rewrite the slice. This backward phase now
supports going through loop arguments.
This will allow more complex logic in the future to add a cost model to
decide which convert to leave and which to fold
This folds `tl.arange(x, x + 1)` into a constant. This shows up for
example when autotuning and one of the block sizes gets set to 1.
Co-authored-by: Philippe Tillet <phil@openai.com>
Add a new operation to be able to implement packed inline assembly for
elementwise operations. This way inline assembly can be used to control
elementwise operations. It also allows to pack elements to be able to
manually vectorize operations.
Before this PR, the determination of `TritonGPUToLLVMIRPass` to generate
NVVM-compatible LLVM or ROCDL-compatible LLVM is controlled by a boolean
`isROCM`. This method is hard to scale.
This PR changes it to use an enum instead, where new target can be added
easily when needed.
---------
Signed-off-by: Tsang, Whitney <whitney.tsang@intel.com>
Co-authored-by: Philippe Tillet <phil@openai.com>
`getScratchSizeInBytes` was assuming that the size of all types in bits
is
a multiple of 8. If it is not, it would return 0. This caused a bug for
boolean
(i1) type, where the reduction lowering would attempt to use shared
memory,
which was not assigned to the op.
Fix this issue by setting the number of bytes per element to `ceil(bits
/ 8)`.
The initial code merge of Nvidia Hopper features support. Please be
aware that the code merge is not finished yet and the trouble-shooting
is still ongoing. The new hardware features (GMMA, TMA, STMATRIX etc.)
and automatic warp-specialization are experimental for now and turned
off by default. It is recommended for a trial when version 3.0 is
released.
The work is contributed by:
ben-zhang-609, bealwang, donproc, qliu93, jsh20, allatit23, LyricZhao,
ivanyinwz, goostavz & yangjunpro
from Nvidia, in cooperation with:
ptillet, Jokeren, ThomasRaoux & zahimoud
from OpenAI.
Co-authored-by: Goostav Zhu <gzhu@nvidia.com>
we currently have a very janky approach to optimizing mixed-precision
matmul workloads, where some layout combinations (e.g., NT matmul) were
explicitly pattern-matched to take a more optimized codepath. Attempt at
unifying all the codepaths to codegen cp.async failed, due to bugs in
SharedToDotOperandMMAv2.cpp.
This PR fixes said bugs, add some assertions for SharedToDotOperandMMAv2
modes that aren't well supported, and greatly simplify our handling of
element-wise operations between load and conversions to DotOperand.
This is strange. Using RemUI should be strictly better, but it can cause
up to 20% performance regression in some cases. I am reverting to RemSI
pending investigation