On my machine, when I try to `pip install cmake` outside a virtualenv,
it gets mad at me and tells me to use apt. Which doesn't quite work for
some reason. Anyway maybe this is simple to Python people, but perhaps
worth mentioning. Especially because we have `.venv` in gitignore
already.
Change the dot to allow taking an initial accumulator and add a flag
that will allow the compiler to accumulate in a lower precision than the
output type.
On Hopper this flag is on by default which allows accumualting with
lower precision.
This only affect Hopper fp8 dot.
Low tech but very useful way to override kernels on the fly. This can be
use for debugging functionality or performance problems this lets user
dump modify and feed back IR into the jit compiler.
This is needed for forward-compatibility with MLIR that now has
"inherent" and "discardable" attributes
(https://mlir.llvm.org/OpenMeetings/2023-02-09-Properties.pdf) and the
ExternElementwiseOp attrs do not propagate with the current
`addNamedAttrs` implementation.
Add infrastructure to be able to add and test custom LLVM passes in the
backend. This will allow use to apply some low level optimizations and
cleanup on LLVM IR.
Add a first pass that breaks up phi of struct created by lowering to
LLVM. Those can often pessimise the optimizer as it would block
optimizations going through phi nodes.
…rf on problems that need few blocks.
constrain the number of launched blocks to what it exactely needs for
persistent warp specialized kernel. It's useful when problems need very
few blocks.
e.g. MxNxK=800x800x60000, f16_f16_f32, block size=128x128x64,
non-split-k. Experiments show it can achieve ~16% speedup.
This fixes a few bugs related to scalar tensors:
- `tl.full([], fill_value, dtype)` fails with `TypeError('0d block_type
is forbidden')`
- `scalar[None]` fails with `TypeError("'constexpr' object is not
iterable")`
- `scalar[None, None]` fails with `AttributeError("'dtype' object has no
attribute 'shape'")`
- `scalar.shape` returns `[1]` instead of 0-dim `[]`
- Also related, `tl.zeros_like(scalar)` returns a 1d tensor instead of
another scalar
This PR fixes a few very minor compilation issues found in internal
deployment at Meta. It looks like nit-picking, but it'd be really
appreciated if it could be addressed in OSS Triton (to reduce
differences from OSS), and we believe these changes are not bad in
general. Neither performance nor functionality is affected by this PR.
1. Type cast in `python/triton/runtime/backends/cuda.c`. Implicit `void
*` -> `cuuint{32,64}_t *` cast is not allowed by many compilers (with
certain flags). It'd be nice to add an explicit cast (like
`backends/hip.c`).
2. Inconsistent include path specification in
`lib/Conversion/TritonGPUToLLVM/DotOpToLLVM/WGMMA.cpp`. Unlike other
`DotOpToLLVM/*.cpp`, include paths used in `WGMMA.cpp` are not relative.
This is problematic in some compilation settings since a compiler
somehow needs to find headers in a parent directory. It'd be great to
use a relative path, like other source files in Triton.
cc: @yuguo68
User can pass custom arguments to benchmarks. For example, user can pass
`dtype` which will be used to create tensors in a benchmark.
Co-authored-by: Keren Zhou <kerenzhou@openai.com>
Prior to this PR, matmul on sm_89 (RTX 4070)
(`test/unit/operators/test_matmul.py::test_op`) would result in test
failure due to too strict atol/rtol.
To avoid having to choose strictness ourselves, and to have better
defaults based on dtype, use the non-deprecated torch testing util.
See: https://github.com/pytorch/pytorch/issues/61844
Replace: https://github.com/openai/triton/pull/2242
.. instead of an option.
This partially addresses https://github.com/openai/triton/issues/2265 to
no longer crash when printing a pass pipeline in textual form.
It is not a proper solution for the fact that pass results should be
stored in the IR and not in a pointer argument.
Previously, if there were any specializations of "1" or "constexpr"
mixed with unspecialized arguments in arbitrary order, we might have
encountered errors due to passing incorrect arguments. This was because
the length of the signature did not indicate the maximum index of
regular arguments.
https://github.com/openai/triton/issues/2229
@shunting314 @amjames
More specifically for cases like:
```
kernel(
b: tl.tensor,
a: tl.constexpr,
c: tl.int = 1,
d,
e: tl.constexpr,
...
)
```
Current setup.py could not clean the build directly because the default
build directly has been changed in `CMakeBuild`. This PR is to clean
build directly in this regard.
**Motivation**: We have a kernel that loads multiple types of tensors -
some int32 and some float16. The coalescing pass assigns `perThread = 8`
for the float16 tensors and `perThread = 4` for the int32 tensors,
resulting in unnecessary layout conversions that result in bad
performance. Instead, we should just set `perThread = 8` for both of
these loads.
**Details**:
One of the first steps in calculating the new encoding is to find the
group of upstream/downstream tensors with the "same type", in order to
find the maximal sizePerThread required in this group. This PR changes
the logic so that tensors can be grouped as long as they have the same
shape and same optimal ordering, even if they have different encoding or
dtype.
Next, the logic to compute `perThread` is updated to account for the
change above; since dtype can now be different within a single "group",
the `perThread` computation now considers different
elemNumBits/elemNumBytes for each value in the group.