This seems to have worked fine in opt mode (although it may be producing
undefined behavior), but in debug mode on a newer version of llvm, it
segfaults without this PR as the iterators get invalidated.
This is also consistent with other places it is done in this file.
Fix dependencies in wgmma_wait op to prevent the scheduler from moving
it past the uses of wgmma accumulator. We need to explicitly represent
the dependency between the wait and the accumulator uses otherwise LLVM
is free to re-order those.
This allows us to remove a workaround to prevent the re-ordering. We can
also remove the wait op added in the loop during pipelining.
Also fix the descritpor calcuation for wgmma, we should calculate the
same descriptor for the whole warpgroup.
Added a workaround for a bug that was exposed by different timing due to
those changes. We shouldn't insert operations between the loop and
async_wait or we may have race conditions.
Support having chain of mma with mixed size.
Serialize the different block calculation in backward attention to
workaround problem with ptxas and wgmma.
* [Alloc] Enhanced for mutually exclusive but aliased buffers
- Use disjoint alias analysis to minimize shared memory requirements
* * fix for allocation test
* * added test
* fixed mfma_enc printer
* * fixed test
MMA V3 support taking operand A from register. This helps for chained
matmul operations like in attention.
Add an optimization to use this mode when it helps and add the lowering
for it.
When there is a chain of mma ops we want to pick the same shape to avoid
conversions. This improves the detection going through for loops.
This fixes a crash in tutorial bw attention.
We might want to change this logic and convert the format to allow more
efficient MMA at some point.
llvm/llvm-project#66754 extends the `LoopLikeOpInterface`: the signature
of `getLoopBody` has changed. `ForOp::getRegion` can be used instead.
This change works with and without llvm/llvm-project#66754.
Improve patterns that sync broadcast to reduce the arithmetic density
and also hoist convert on top of expand_dims to do less work.
This address comments in https://github.com/openai/triton/pull/2274
Change the dot to allow taking an initial accumulator and add a flag
that will allow the compiler to accumulate in a lower precision than the
output type.
On Hopper this flag is on by default which allows accumualting with
lower precision.
This only affect Hopper fp8 dot.
* [MFMA] Support BFloat16 on MI100
This PR makes use of mfma_f32_32x32x4bf16 instruction, available on MI100.
* fix tests, fix mfma encoding comment, fix switch between mfma versions.
* replace kDim from mfma layout with kWidth from dotOp layout
* rebase fix
* fix mfma to dot op shortcut for bfloat16
* fix review comments
* [MLIR] Added tritongpu-stream-pipeline pass
- Prologue: Hoist the pipelinable load operations and shared memory store
for the ramp up stage
- Pipelined Loop: Assemble the loop body minus last iteration
- Prefetch next tile from global into regs (while computing from previous)
- Non-load loop body
- Store next tile into shared mem
- Epilogue: Peeled non-load loop body for last iteration
* * updated comment
**Motivation**: We have a kernel that loads multiple types of tensors -
some int32 and some float16. The coalescing pass assigns `perThread = 8`
for the float16 tensors and `perThread = 4` for the int32 tensors,
resulting in unnecessary layout conversions that result in bad
performance. Instead, we should just set `perThread = 8` for both of
these loads.
**Details**:
One of the first steps in calculating the new encoding is to find the
group of upstream/downstream tensors with the "same type", in order to
find the maximal sizePerThread required in this group. This PR changes
the logic so that tensors can be grouped as long as they have the same
shape and same optimal ordering, even if they have different encoding or
dtype.
Next, the logic to compute `perThread` is updated to account for the
change above; since dtype can now be different within a single "group",
the `perThread` computation now considers different
elemNumBits/elemNumBytes for each value in the group.
Handle more cases of hoisting convert above ext op. If there are
multiple ext op in the slice but only one requires inserting a convert
we can still apply the optimization.
Significant changes to the pass logic. Move away from greedy rewrites
and use more global analysis instead. The pass is now bocken down into 2
main phases. First forward propagation of layout starting from ops that
we don't want to change. Propagate to all the nodes. If there is a
single layout needed for the op then we can rewrite the op, if there are
multiple layout required based on dependency we need a tie break.
The second phase is backward propgation that gets a backward slice of
operations starting from the convert and if all the operations in the
slice can be rematerialized rewrite the slice. This backward phase now
supports going through loop arguments.
This will allow more complex logic in the future to add a cost model to
decide which convert to leave and which to fold
This folds `tl.arange(x, x + 1)` into a constant. This shows up for
example when autotuning and one of the block sizes gets set to 1.
Co-authored-by: Philippe Tillet <phil@openai.com>
* Enable usage of block pointer semantics for AMD gpus
This commit enables usage of block pointer semantics by enabling
rewrite_tensor_pointer_pass that rewrites block pointer loads/stores
to legacy loads/stores.
* Update FA fwd in tutorial to use the block pointers
* use 90 compute capability for amd gpus in python/triton/compiler/compiler.py
Co-authored-by: Alexander Efimov <efimov.alexander@gmail.com>
---------
Co-authored-by: Ognjen Plavsic <ognjen.plavsic@dxc.com>
Co-authored-by: Lixun Zhang <lixun.zhang@amd.com>
Co-authored-by: Aleksandr Efimov <130555951+alefimov-amd@users.noreply.github.com>
Co-authored-by: Alexander Efimov <efimov.alexander@gmail.com>