One long-standing issue in the backend has been the apparent complexity
of the tensor core codegen. This complexity mostly stems from the
existence of the DotOpHelpers` utilities, which have become over time a
catch-all for all things related to MmaEncoding and DotOperandEncoding.
The purpose of this PR is to decouple what should be decoupled, as a
first step towards cleaning our tensor core codegen. Other, more more
local PRs will follow.
- fixed `CompilationError._format_message` fails when `error_message` is
a `constexpr`
- factored out `_is_constexpr()` checks and `_unwrap_if_constexpr()`
idioms
- Added `UnsupportedLanguageConstruct` exception, replaced some python
builtin exceptions raised in such cases.
- Some hardening in `.visit_If()`
- cleaner exception handling in `build_triton_ir()`
https://github.com/openai/triton/issues/1328
Match the convert_layout operation in SimplifyReduceCvt
(convert_layout->reduce). This way we don't miss higher priority rewrite
patterns like RematerializeBackward and SimplifyConversion. We also need
to set SimplifyConversion's benefit = 4, RematerializeBackward's benefit
= 3, and RematerializeForward's benefit = 2.
Differentiate between immediate and non-immediate block arguments.
If we have a load that immediately depends on a block argument in the
current iteration, it is an immediate dependency. Otherwise, it is a
non-immediate dependency, which means the load depends on a block
argument in the previous iterations.
For example:
```
scf.for (%arg0, %arg1, %arg2) {
%0 = load %arg0 <--- immediate dep, this address is initialized at numStages-2
%1 = load %arg1
%2 = add %1, %arg2
%3 = load %2 <--- non-immediate dep, %arg1 must be an update-to-date value
}
```
The above code pattern is commonly seen in cases where we have indirect
memory accesses using a lookup table, such as PyTorch's `bsr_dense_bmm`.
This PR improves `bsr_dense_bmm` for about ~20% on the unit test cases.
Contents of this change to `CodeGenerator`:
- addressed mutable default value in constructor (GitHub #1353)
- structured and faster name lookup (replaces `.get_value`)
- added informative error messages in some places
- tidy mechanism for "static" (compile time) functions replaces inline
`if ... elif ...` chain in `.visit_Call`
- more robust `static_assert` and `static_print`
- more informative `CompilationError` display (saves scrolling up
through long tracebacks)
- dedicated `CompileTimeAssertionFailure` exception for `static_assert`
can be specially treated upstream by `Autotuner` to skip configurations
that violate constraints (as for `OutOfResources`)
---------
Co-authored-by: Philippe Tillet <phil@openai.com>
This PR;
- Fixes syntax errors like `.type values: dict[str,
Callable[[list[Any]], Any]]` to `:type values: dict[str,
Callable[[list[Any]], Any]]`,
- Fixes typos,
- Fixes formatting like `k ++` to ` k++`,
- Increases consistency (e.g. by transforming the minority `cd dir/` to
the majority `cd dir`).
This addition allows explanatory messages upon assertion failures:
```python
@triton.jit
def my_single_block_kernel(
matrix_extent: tl.constexpr,
block_size: tl.constexpr, # must be >= extent (single block)
matrix: Tensor,
...
):
tl.static_assert(matrix_extent <= block_size,
f"`matrix_extent` should not be more than the block size ({block_size}), but is {matrix_extent}")
```
Yielding, when called incorrectly:
```
AssertionError: `matrix_extent` should not be more than the block size (32), but is 57
```
Before this PR, loops' induction variables' (IV) alignment info is lost.
For example:
```
for n in range(0, K, BLOCK):
x = base + n
^-- Triton doesn't know n is always a multiple of BLOCK
```
This PR fixes this.
---------
Co-authored-by: Philippe Tillet <phil@openai.com>
This PR applies a minor patch that removes unnecessary masks in
`_dsd_kernel()`.
### Details
`offs_bn` is defined as follows and not updated after that.
```py
offs_bn = pid_m * TILE_N + tl.arange(0, TILE_N)
offs_bn = tl.max_contiguous(tl.multiple_of(offs_bn % DS0, TILE_N), TILE_N)
```
Because `offs_bn = offs_bn % DS0`, this mask is always `True`.
```py
b = tl.load(pb, mask=offs_bn[None, :] < DS0)
```
This PR removes this mask (as well as explicit `mask=True`).
Fixed `JITFunction.__init__` to mark args as constexpr only when the
annotation is actually `tl.constexpr`, rather than treating any
annotated arg as constexpr.
When the user set the `LLVM_SYSPATH` to use custom build llvm, it will
throw the error because there is no version.txt under the custom build
one.
This PR skips the version check If the `LLVM_SYSPATH` is set.
---------
Co-authored-by: Philippe Tillet <phil@openai.com>
This is cherry-picked from #1305
If you call a `JITFunction` twice in the same kernel, first with `int32`
then with `uint32`, the second call will treat the unsigned value as
signed. This passes through MLIR without error because MLIR uses the
same types for both, but different operation calls will be generated so
you may silently get the wrong result.
I observed that when compiling with gcc8, stdc++fs linker flag isn't
passed to enough targets. I couldn't figure out the correct target
to add the linker flag to, so I'm just mashing it everywhere.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
- Significant simplification of the optimizer pipeline. Right mma
version is now set directly after the coalescing pass. DotOperand layout
no longer hold a state to `isRow` argument, and instead query it from
their parent
- Moved a bunch of things from TritonGPUToLLVM/DotOpHelpers to
TritonGPUAttrDefs. All MMAv1 state is now queried from attributes.
- logic for getELemsPerThread is no longer duplicated in TypeConverter
`_triton.runtime.num_sm`, `_triton.runtime.clock_rate`,
`_triton.runtime.cc` seem no longer exist.
use the corresponding methods from `get_max_tensorcore_tflops` in the
same file.