* [Dot] [MFMA] Support FP16 output of MFMA dot
This PR adds cast of output tensor to requested data type.
* add tests
* fix test for FMA implementation
* loose fp16xfp16->fp16 tolerance
* enable FMA fallback for unsupported sizes of dot operation
* rework granularity check
* add constant modifier to granularity
In the current link.py, it produces the launcher code as below:
```python
CUresult matmul_fp16xfp16_16x16x16(CUstream stream, unsigned int gX, unsigned int gY, unsigned int gZ, CUdeviceptr C, CUdeviceptr A, CUdeviceptr B, int32_t stride_cm, int32_t stride_am, int32_t stride_bk){
if ((C % 16 == 0) && (A % 16 == 0) && (B % 16 == 0) && (stride_cm % 16 == 0))
return matmul_fp16xfp16_16x16x16_688cc413_0d1d2d3d45d(stream, gX, gY, gZ, C, A, B, stride_cm, stride_am, stride_bk);
// ...
if ((C % 16 == 0) && (A % 16 == 0) && (B % 16 == 0))
return matmul_fp16xfp16_16x16x16_7c0255bf_0d1d2d345(stream, gX, gY, gZ, C, A, B, stride_cm, stride_am, stride_bk);
}
```
Note that, when the input does not match any of the if branches, it will
do nothing, and the compiler should make it return 0 as a default
behavior, which equals to `CUDA_SUCCESS`, this doesn't match the
expectation.
This PR adds a `return CUDA_VALUE_ERROR;` to the tail of launchers, and
it produces code like:
```c++
CUresult matmul_fp16xfp16_16x16x16(CUstream stream, unsigned int gX, unsigned int gY, unsigned int gZ, CUdeviceptr C, CUdeviceptr A, CUdeviceptr B, int32_t stride_cm, int32_t stride_cn, int32_t stride_am, int32_t stride_ak, int32_t stride_bk, int32_t stride_bn){
if ((C % 16 == 0) && (A % 16 == 0) && (B % 16 == 0) && (stride_cm == 1) && (stride_cn == 1) && (stride_am == 1) && (stride_ak == 1) && (stride_bk % 16 == 0) && (stride_bn == 1))
return matmul_fp16xfp16_16x16x16_1f18a6da_0d1d2d3c4c5c6c7d8c(stream, gX, gY, gZ, C, A, B, stride_bk);
return CUDA_ERROR_INVALID_VALUE;
}
```
And it requires users to check the result in their application, which I
think should match the initial AOT ideas.
we currently have a very janky approach to optimizing mixed-precision
matmul workloads, where some layout combinations (e.g., NT matmul) were
explicitly pattern-matched to take a more optimized codepath. Attempt at
unifying all the codepaths to codegen cp.async failed, due to bugs in
SharedToDotOperandMMAv2.cpp.
This PR fixes said bugs, add some assertions for SharedToDotOperandMMAv2
modes that aren't well supported, and greatly simplify our handling of
element-wise operations between load and conversions to DotOperand.
- Change test_aot.py to actually use equal_to_1 hint
- In the client function, equal_to_1 parameters are not specialized,
because AOT clients may not know the details of Triton argument
specialization, they still want to use the same parameter list as they
write the Triton kernel. The generated kernels has specialized argument
list, the generated dispatcher code will make sure the correct arguments
from the original full argument list are passed.
- Fixed a bug in _match_suffix in link.py. Previously it assumes each
parameter has a suffix of either ‘d’ or ‘c’, but in fact sometimes a
parameter doesn’t have a suffix, like 0d1d2d34c56c78c
* Fix the issue with masked load
Cherry-picked from a0b60eb187
* Remove tests in test_gemm that use too much LDS
---------
Co-authored-by: Shucai Xiao <shucai.xiao@amd.com>
Fixes the case where setting default values for arguments in a kernel
function signature results in a generated kernel wrapper function
without these default values.
For example:
```
@triton.jit
def kernel(x, y, z=3):
...
...
kernel[grid](x,y)
```
Co-authored-by: Philippe Tillet <phil@openai.com>
This PR addresses the following issues encountered when using AOT
kernels in our project:
1. When different signatures are set for the same Triton kernel, it can
result in C functions with the same name. This is problematic because C
does not support function overloading.
2. Currently, the AOT kernel always compiles with `num_warps=1`, as
indicated
[here](https://github.com/openai/triton/pull/1939/files#diff-293af646f671d3a895c453a8b175754e9d4ec4fc855bb939ffa4d6e9e91b07c6L83).
However, the generated function includes a `numWarps` argument, which
can cause errors when the specified value does not match.
To resolve these issues, this PR does the following modifications:
1. Adds an 8-char hash key as a suffix to the generated function's
signature. This ensures that different function names are generated in C
when the argument dtype or constexpr value or even hint differs since we
hope these kernels could be used in one C/C++ library.
2. Introduces a new flag called `num-warps` that allows manual
specification of the `numWarps` value for AOT. This change hardcodes the
specified value into the generated kernel.c and removes the `numWarps`
argument from the generated function.
* [WIP][FA OPTIMIZATION] Optimize chain dot
This commit optimizes chain dot operation by keeping
results of the first dot operation in registers.
* [FA OPTIMIZATION] Enable lowering pipeline for keeping result of chain dot in registers
* Move operand swapping in ttgir -> llir lowering phase
* Refactor emitMfmaOffsetForCTA function to be more readable
* Fix accidental change in 06-fused-attention.py
* Address review comments
* Fix rebase errors
The code generated by LLVM ends up using 15 SASS instructions, while the
inline PTX added here only uses 8. It might be possible to reduce this
down to 6 if NVIDIA optimizes ptxas to use the byte selector in I2F for
all bytes (right now, we still have some bit manipulation code generated
for 2 out of 4 bytes).
This change improves the performance of mixed precision matmul kernel
with M=N=K=4096, where one operand is casted from s8 to bf16 from 140
TFlop/s to 165 TFlop/s on A100-40GB.
Also refactors the ElementwiseOpConversionBase template to support
vectorized operations, reducing the boilerplate needed for existing, and
this new vectorized cast; and extends the casting test to process more
than one element (so vectorized casts can be properly tested).
0-bytes shared mem buffers don't materialize empty allocation buffers;
this could lead to unnecessary barriers.
note: reduceop code has become quite messy and will require some cleanup
Adding new tests across the board for float32, bfloat16, non-powers-of-2
shapes (to test masks), and tests on sequence parallel for atomics. This
also adds the sequence parallel features from
https://github.com/HazyResearch/flash-attention/blob/main/flash_attn/flash_attn_triton.py.
I am not sure about the best way to grab the baseline benchmarking
numbers. I have access to V100s and A100s, but I saw on the tests it
mentions " # A100 in the CI server is slow-ish for some reason.
# On some other servers, we are getting about 90% peak for 8kx8x8k
float16". Current plan is to run CI here and use those numbers for
baseline, then match against my GPUs as a sanity check.
---------
Co-authored-by: Phil Tillet <phil@openai.com>
Calling `tl.full` with an unsigned dtype currently fails with the error:
```
AttributeError("'triton._C.libtriton.triton.ir.builder' object has no attribute
'get_uint8'")
```
This PR defines those functions rather than changing the calls to the
signed versions so that we can use an unsigned argument type in C++ and
avoid overflow for large uint64 values.
Fix calculation of unique number of threads within a warp. We need to
consider the number of elements per thread in the calculation. Also
change the layout test to integer sum in order to catch bugs with unique
data as max reduction may hide those kind of problems.
Run most of the pytest in parallel, this allows to speed up CI from
36min to 10min for A100 and 22min to 6min for H100. Some tests still
need to run serially like runtime tests.
We've already updated the mapping between name and tensor before
visiting each compound statement in the while op. As a result, any
overwritten name gets up-to-date values updated in the while loop. And
any unchanged livein names hold the original tensors.
This relax the restriction in the scan lowering to support layout where
we scan along a dimension which isn't the fastest moving one. This is
done by relaxing how we accesses elements during scanning and allow
elements to be strided.
Since the type expected for mma encoding is i32 when lowering f16 splat
we need to pack f16 constants into a i32 value. This allows re-enabling
the constant matmul unit test.
Implement associative_scan in the front end and implement lowering to
LLVM for blocked layout where the scan happens on the fastest moving
dimension. This will later be generalized to support more layout.
[To squash] Configurable warp size in test_core_amd.py::test_convert2d
Note: test_core_amd.py::test_convert2d unit tests have been changed
because some of the old layouts exceed the shared memory limit (64KiB)
clearly differentiate between standard fp8e4 (which we'll stop
supporting on SM <= 89 because conversions are too expensive if we want
to handle the single NaN and clipping properly) and a software-optimized
fp8e4b15 format.