Prior to this PR, matmul on sm_89 (RTX 4070)
(`test/unit/operators/test_matmul.py::test_op`) would result in test
failure due to too strict atol/rtol.
To avoid having to choose strictness ourselves, and to have better
defaults based on dtype, use the non-deprecated torch testing util.
See: https://github.com/pytorch/pytorch/issues/61844
Replace: https://github.com/openai/triton/pull/2242
* this pr adds a third party backend for triton that works on AMD
* this expose a lot of the work that has been done in our
[fork](https://github.com/ROCmSoftwarePlatform/triton)
* most unit tests on `test_core.py` pass
* it skips some unit tests for various reasons
* we plan to follow up with more prs improving Functionality and
Performance in the future
---------
Co-authored-by: Philippe Tillet <phil@openai.com>
This PR makes the following change to AOT kernel
- Allow the client to generate AOT kernels with different sets of
constexprs and meta-parameters. Each combination of constexpr set and
meta-parameters is referred to an "algo". Within an algo client can
still give different hints about integer arguments.
- Add a API int ${kernle_name}_get_num_algos() that returns the total
number of algos.
- Add a algo_id to allow client to the generated kernel to select the
algo
- Remove gX, gY and gZ from the kernel parameter list. This is because
the launch grid is usually different with different algos, and the
client should not need to care about how to compute the launch grid for
each algo. Instead, we ask the client to pass the expression of
computing gX, gY and gZ for compile.py (when AOT kernels are generated).
The expression can only use kernel parameter or const values.
- We also change the testing flow. Now we first build the kernels into a
shared library libkernel.so, then the client test.c code is built and
link with libkernel.so. This is closer to a typical AOT kernel usage
flow.
For warp specialized persistent kernel, the instruction sequence for
Warp Groups are
```
// warp group 0
for wave in 0..num_waves:
idx = wave * num_inner_loop_steps;
for k_tile_idx in 0..num_k_tiles:
mbarrier.wait EB[idx];
W0;
mbarrier.arrive FB[idx];
idx++;
```
```
// warp group 1
for wave in 0..num_waves:
idx = wave * num_inner_loop_steps;
for k_tile_idx in 0..num_k_tiles:
mbarrier.wait FB[idx];
R0;
mbarrier.arrive EB[idx];
idx++;
```
then this would form a sequence of morally-strong relations W0 -> R0 ->
W1 -> R1 in causality order.
But if GEMM K is small than K-TileShape, then the num_inner_loop_steps
of persistent kernel is 0. The buffer id and mbarrier id will always be
0 in this case. And it may form W0 -> W1 -> R0 -> R1 order, which is
contradicts with the atomicity --
"If a read R precedes an overlapping write W in causality order, then R
cannot read from W."
`if _unwrap_if_constexpr(cond)` then enters `node.body` is wrong when
cond is a tensor since we cannot statically evaluate a dynamic tensor's
value.
The right way to solve the problem is probably:
1. visit the ast of IfExp (do not build IRs)
2. get the type of the last statement
3. initialize the return value and assign it to livein
4. call visit_If
Add a new operation to be able to implement packed inline assembly for
elementwise operations. This way inline assembly can be used to control
elementwise operations. It also allows to pack elements to be able to
manually vectorize operations.
Make sure that other threads within CTA do not operate on mbarrier until
it is initialized by thread 0.
Co-authored-by: Philippe Tillet <phil@openai.com>
Also fixes a bug exposed in convertLayout lowering for float16. We
shouldn't be using cvt.pack.sat.u16.s32 to pack 16bits values as this
needs to take a 32bits register. Also this prevented optimization at
llvm ir level.
The initial code merge of Nvidia Hopper features support. Please be
aware that the code merge is not finished yet and the trouble-shooting
is still ongoing. The new hardware features (GMMA, TMA, STMATRIX etc.)
and automatic warp-specialization are experimental for now and turned
off by default. It is recommended for a trial when version 3.0 is
released.
The work is contributed by:
ben-zhang-609, bealwang, donproc, qliu93, jsh20, allatit23, LyricZhao,
ivanyinwz, goostavz & yangjunpro
from Nvidia, in cooperation with:
ptillet, Jokeren, ThomasRaoux & zahimoud
from OpenAI.
Co-authored-by: Goostav Zhu <gzhu@nvidia.com>
In the current link.py, it produces the launcher code as below:
```python
CUresult matmul_fp16xfp16_16x16x16(CUstream stream, unsigned int gX, unsigned int gY, unsigned int gZ, CUdeviceptr C, CUdeviceptr A, CUdeviceptr B, int32_t stride_cm, int32_t stride_am, int32_t stride_bk){
if ((C % 16 == 0) && (A % 16 == 0) && (B % 16 == 0) && (stride_cm % 16 == 0))
return matmul_fp16xfp16_16x16x16_688cc413_0d1d2d3d45d(stream, gX, gY, gZ, C, A, B, stride_cm, stride_am, stride_bk);
// ...
if ((C % 16 == 0) && (A % 16 == 0) && (B % 16 == 0))
return matmul_fp16xfp16_16x16x16_7c0255bf_0d1d2d345(stream, gX, gY, gZ, C, A, B, stride_cm, stride_am, stride_bk);
}
```
Note that, when the input does not match any of the if branches, it will
do nothing, and the compiler should make it return 0 as a default
behavior, which equals to `CUDA_SUCCESS`, this doesn't match the
expectation.
This PR adds a `return CUDA_VALUE_ERROR;` to the tail of launchers, and
it produces code like:
```c++
CUresult matmul_fp16xfp16_16x16x16(CUstream stream, unsigned int gX, unsigned int gY, unsigned int gZ, CUdeviceptr C, CUdeviceptr A, CUdeviceptr B, int32_t stride_cm, int32_t stride_cn, int32_t stride_am, int32_t stride_ak, int32_t stride_bk, int32_t stride_bn){
if ((C % 16 == 0) && (A % 16 == 0) && (B % 16 == 0) && (stride_cm == 1) && (stride_cn == 1) && (stride_am == 1) && (stride_ak == 1) && (stride_bk % 16 == 0) && (stride_bn == 1))
return matmul_fp16xfp16_16x16x16_1f18a6da_0d1d2d3c4c5c6c7d8c(stream, gX, gY, gZ, C, A, B, stride_bk);
return CUDA_ERROR_INVALID_VALUE;
}
```
And it requires users to check the result in their application, which I
think should match the initial AOT ideas.
we currently have a very janky approach to optimizing mixed-precision
matmul workloads, where some layout combinations (e.g., NT matmul) were
explicitly pattern-matched to take a more optimized codepath. Attempt at
unifying all the codepaths to codegen cp.async failed, due to bugs in
SharedToDotOperandMMAv2.cpp.
This PR fixes said bugs, add some assertions for SharedToDotOperandMMAv2
modes that aren't well supported, and greatly simplify our handling of
element-wise operations between load and conversions to DotOperand.
- Change test_aot.py to actually use equal_to_1 hint
- In the client function, equal_to_1 parameters are not specialized,
because AOT clients may not know the details of Triton argument
specialization, they still want to use the same parameter list as they
write the Triton kernel. The generated kernels has specialized argument
list, the generated dispatcher code will make sure the correct arguments
from the original full argument list are passed.
- Fixed a bug in _match_suffix in link.py. Previously it assumes each
parameter has a suffix of either ‘d’ or ‘c’, but in fact sometimes a
parameter doesn’t have a suffix, like 0d1d2d34c56c78c
Fixes the case where setting default values for arguments in a kernel
function signature results in a generated kernel wrapper function
without these default values.
For example:
```
@triton.jit
def kernel(x, y, z=3):
...
...
kernel[grid](x,y)
```
Co-authored-by: Philippe Tillet <phil@openai.com>
This PR addresses the following issues encountered when using AOT
kernels in our project:
1. When different signatures are set for the same Triton kernel, it can
result in C functions with the same name. This is problematic because C
does not support function overloading.
2. Currently, the AOT kernel always compiles with `num_warps=1`, as
indicated
[here](https://github.com/openai/triton/pull/1939/files#diff-293af646f671d3a895c453a8b175754e9d4ec4fc855bb939ffa4d6e9e91b07c6L83).
However, the generated function includes a `numWarps` argument, which
can cause errors when the specified value does not match.
To resolve these issues, this PR does the following modifications:
1. Adds an 8-char hash key as a suffix to the generated function's
signature. This ensures that different function names are generated in C
when the argument dtype or constexpr value or even hint differs since we
hope these kernels could be used in one C/C++ library.
2. Introduces a new flag called `num-warps` that allows manual
specification of the `numWarps` value for AOT. This change hardcodes the
specified value into the generated kernel.c and removes the `numWarps`
argument from the generated function.