Hi,
I'm adding some features to
`triton.runtime.jit.JITFunction_make_launcher` and found it is hard to
debug it:
1. The inlined Python code is hard to inspect in my editor.
2. My debugger fails to step into these inlined codes.
In response, I've introduced some code to solve these issues. My
modifications include:
~~1. Refactoring the launcher's inline Python code, ensuring it only
relies on the "self" object.~~
~~2. Add a utility method that generates a temporary file to create a
launcher when debugging kernel in main module~~
Using a closure to hold the launcher's body
Because this features might be good to others, I have initiated this
Pull Request.
~~Tests are yet to be added; if this submission might be accepted, I
will add it later.~~
Since this change is a refactor, no new test was added.
Replace a single
mma.sync.aligned.m16n8k32.row.col.satfinite.s32.s8.s8.s32 instruction
that is used on Ampere with 4 x
mma.sync.aligned.m8n8k16.row.col.satfinite.s32.s8.s8.s32 instructions
for Turing
Extracted the Turing-int8, Turing-fp16 and Ampere to separate functions.
Somehow I messed up with my previous PR, so just open a new one.
---------
Co-authored-by: Philippe Tillet <phil@openai.com>
This fixes a few bugs I've encountered
- `atomic_add` with int64/uint64 `Operation .add requires .u32 or .s32
or .u64 [...] for instruction 'atom'`
- `atomic_min/max` with float64 -> `ValueError('Cannot bitcast data-type
of size 64 to data-type of size 32')`
- `atomic_min/max` with float32 returns the old value as int32
This was regressed by #2185 because we didn't realise CUDA_CHECK macro
could do Python calls (similar to what led to #2225). I think the
PyErr_Occurred got removed in that PR because there was missing error
handling before the call to _launch, so it looked like it was just in
the wrong place.
It looks like there are also potentially a couple places in cuda.c that
can return with error set, e.g. getDeviceProperties, memAlloc,
memcpyHtoD, memFree, tensorMapEncodeTiled etc, but those are all
pre-existing and not affected by recent changes.
1. On the axis, using `getAxisNumWarpsWithUniqueData` instead of getting
the raw number of warps to avoid communication among warps that handle
the same piece of data.
2. When there's a single warp on the axis, using warp Intrinsics for
communication and skip shared memory.
Need a follow up PR for code clean up.
Change the dot to allow taking an initial accumulator and add a flag
that will allow the compiler to accumulate in a lower precision than the
output type.
On Hopper this flag is on by default which allows accumualting with
lower precision.
This only affect Hopper fp8 dot.
Low tech but very useful way to override kernels on the fly. This can be
use for debugging functionality or performance problems this lets user
dump modify and feed back IR into the jit compiler.
…rf on problems that need few blocks.
constrain the number of launched blocks to what it exactely needs for
persistent warp specialized kernel. It's useful when problems need very
few blocks.
e.g. MxNxK=800x800x60000, f16_f16_f32, block size=128x128x64,
non-split-k. Experiments show it can achieve ~16% speedup.
This fixes a few bugs related to scalar tensors:
- `tl.full([], fill_value, dtype)` fails with `TypeError('0d block_type
is forbidden')`
- `scalar[None]` fails with `TypeError("'constexpr' object is not
iterable")`
- `scalar[None, None]` fails with `AttributeError("'dtype' object has no
attribute 'shape'")`
- `scalar.shape` returns `[1]` instead of 0-dim `[]`
- Also related, `tl.zeros_like(scalar)` returns a 1d tensor instead of
another scalar
This PR fixes a few very minor compilation issues found in internal
deployment at Meta. It looks like nit-picking, but it'd be really
appreciated if it could be addressed in OSS Triton (to reduce
differences from OSS), and we believe these changes are not bad in
general. Neither performance nor functionality is affected by this PR.
1. Type cast in `python/triton/runtime/backends/cuda.c`. Implicit `void
*` -> `cuuint{32,64}_t *` cast is not allowed by many compilers (with
certain flags). It'd be nice to add an explicit cast (like
`backends/hip.c`).
2. Inconsistent include path specification in
`lib/Conversion/TritonGPUToLLVM/DotOpToLLVM/WGMMA.cpp`. Unlike other
`DotOpToLLVM/*.cpp`, include paths used in `WGMMA.cpp` are not relative.
This is problematic in some compilation settings since a compiler
somehow needs to find headers in a parent directory. It'd be great to
use a relative path, like other source files in Triton.
cc: @yuguo68
User can pass custom arguments to benchmarks. For example, user can pass
`dtype` which will be used to create tensors in a benchmark.
Co-authored-by: Keren Zhou <kerenzhou@openai.com>
Prior to this PR, matmul on sm_89 (RTX 4070)
(`test/unit/operators/test_matmul.py::test_op`) would result in test
failure due to too strict atol/rtol.
To avoid having to choose strictness ourselves, and to have better
defaults based on dtype, use the non-deprecated torch testing util.
See: https://github.com/pytorch/pytorch/issues/61844
Replace: https://github.com/openai/triton/pull/2242
Previously, if there were any specializations of "1" or "constexpr"
mixed with unspecialized arguments in arbitrary order, we might have
encountered errors due to passing incorrect arguments. This was because
the length of the signature did not indicate the maximum index of
regular arguments.
https://github.com/openai/triton/issues/2229
@shunting314 @amjames
More specifically for cases like:
```
kernel(
b: tl.tensor,
a: tl.constexpr,
c: tl.int = 1,
d,
e: tl.constexpr,
...
)
```
Current setup.py could not clean the build directly because the default
build directly has been changed in `CMakeBuild`. This PR is to clean
build directly in this regard.