The call to `coalesceOp` is deleting the op it is processing and
replacing it with a new one. We can't `dyn_cast` the `curr` pointer
because it is dangling at this point.
Co-authored-by: Philippe Tillet <phil@openai.com>
- Rewrite the AxisInfo analysis to handle each op case by case.
- Add bit shift, min max, div/rem, and select ops to AxisInfo.
- Rematerialize across load/store ops in the following two cases:
- A size 1 tensor is considered not expensive since all threads will
load the same
- the targeEncoding may expose more vectorization opportunities (more
elements per thread on the first dim)
**_res2next_** benchmark GPU Kernel time comparison on A100.
- Average kernel sum. Triton 16838630ns vs Triton-MLIR 17105166ns.
**1.016x slowdown**.
- Total kernel sum. Triton 6511735460ns vs Triton-MLIR 6512370620ns.
Previous https://github.com/openai/triton/pull/1113 forgot to consider
that a node may have multiple parents, visiting the instruction before
any parent violates the semantic of topological sort.
The fixed implementation exhaustively add all operations into a
candidate subgraph and move an operation to the "ready" queue once all
of its operands have been visited.
- **temporarily commenting assertion in `MemBar.cpp`. We need to fix
this! but for now the following patches will unblock a number of
users.**
- Fixed frontend codegen issue for If / For / While. Emit an error when
replaced values' type mismatch.
- Added "top level" codepath for if statements, which allows users to
write patterns to exit early from kernels (e.g., `if cond1: if cond2:
return else: ...`). Added associated codegen in TritonToTritonGPUPass
- Added basic control flow tests
- Pipeline pass is no longer activated when memory accesses can't be
vectorized
- Added missing magic methods to `constexpr`
- Fixed issue in random.py: bitcast some values to uint when they need
to be.
- Added support for `Not`
- Fixed nondeterministic compilation issue
This fix enables the support on sm_90 (otherwise it will crash).
Logs like
> 'sm_90' is not a recognized processor for this target (ignoring
processor)
could be ignored and should be eliminated with the update of llvm nvptx
backend.
This PR adds UpdateMmaForVolta pass to help update the MMA encoding for
Volta.
Some context is told in https://github.com/openai/triton/pull/1014
# Changes
1. Moving the related MMAv1 patterns from GPUCombine pass to
UpdateMmaForVolta pass,
2. Updating both the versionMinor and warpsPerCTA fields for Volta MMA
encodings since they could only be determined after the GPUCombine Pass,
3. Moving the FixupLoop pattern from the Combine.cpp to new
Utility.h/.cpp files
4. Adding an ID field(takes 5 bits to store an integer) to versionMinor
to help assigning a unique ID(on Volta) for each MMA encodings, the
reason is as below
- Currently, there is a cyclic dependency between {DotOperand, Slice}
with MMA layouts, we use a map to help cluster all the DotOperand,
Slice, and MMA layout instances into the same group for further updating
in bulk
- When there are multiple DotOps in a module with the same MMA(literally
equivalent), it is possible to get the wrong groups
- an ID field is used to help to identify the MMA from different DotOps,
thus getting all the MMA, DotOperand, and Slice layout instances in the
right groups
This PR adds a couple of optimization passes that should substantially
improve the performance of Triton on fused attention kernels:
- DecomposeConversionsPass: This decomposes some instructions of the
form `convert_layout` into
- ReorderInstructions: this reorders instructions in a way that is more
amenable to good code generation from `ptxas`.
Continue the work https://github.com/openai/triton/pull/990
# Background
The `versionMinor` in MmaEncodingAttr holds some states of DotOp's
operands in Volta, while such operands will be modified by some
patterns, making the states out-of-date.
This PR helps to correct the states.
# Implementation
It adds three new patterns:
1. `CollectMmaToUpdateForVolta` helps to collect and build a map holding
the MmaEncodingAttr instances with wrong states and create new correct
ones for them,
2. `UpdateMMAVersionMinorForVolta` helps to replace the Ops generating
the wrong MmaEncodingAttr instances with new correct ones, currently it
supports the following Ops
a. `convert_layout[X -> mma]`
b. `arith.constant SplatAttr : !tensor<mma>`
c. `dot ... : !tensor<mma>`
# Limitation
This PR chooses the mapping way to bypass the IR walk complexity from
the circular dependency between dot_operand[parent] and mma.
We use the MmaEncodingAttr instance as the mapping key, but there might
be multiple DotOp holding different DotOprand(IsMMAv1Row) that have the
same wrong MmaEncodingAttr instance.
To make each DotOp's (wrong) MmaEncodingAttr unique, we might need an ID
field to MmaEncodingAttr.