* [MFMA] Support BFloat16 on MI100
This PR makes use of mfma_f32_32x32x4bf16 instruction, available on MI100.
* fix tests, fix mfma encoding comment, fix switch between mfma versions.
* replace kDim from mfma layout with kWidth from dotOp layout
* rebase fix
* fix mfma to dot op shortcut for bfloat16
* fix review comments
* [MLIR] Added tritongpu-stream-pipeline pass
- Prologue: Hoist the pipelinable load operations and shared memory store
for the ramp up stage
- Pipelined Loop: Assemble the loop body minus last iteration
- Prefetch next tile from global into regs (while computing from previous)
- Non-load loop body
- Store next tile into shared mem
- Epilogue: Peeled non-load loop body for last iteration
* * updated comment
* refine the gemm tuning scripts to reduce tuning space and better perf numbers
* added code to support tuning in full tuning space
* add a function to get best tuning config
* refine the matmul tutorial example to print out best tuning config for each input
* added even_k to gemm kernel heuristic for better performance
* address review comments
This PR changes printf ttgir -> llvm conversion,
unknown location is assigned to global constant holding format string.
This fixes problem in test_subprocess.py tests,
which failed during construction of file location for format string constants.
* Add fwd and bwd v2
Changes are largely from upstream.
* Split bwd kernel in dq and dk+dv
Only adds the split kernels. They are not enabled yet.
* Pull scalar multiplies out of the loop
* Enable split kernel for bwd pass
* Put back P_SEQ=128 in fwd test
Not used for bwd test
* Address review comments
* Address comments
Conditionally set causal/ splitkernel to False for bwd.
* Add block pointer semantics to bwd pass
This significantly increases perf for bwd, similar to fwd.
* Enable usage of block pointer semantics for AMD gpus
This commit enables usage of block pointer semantics by enabling
rewrite_tensor_pointer_pass that rewrites block pointer loads/stores
to legacy loads/stores.
* Update FA fwd in tutorial to use the block pointers
* use 90 compute capability for amd gpus in python/triton/compiler/compiler.py
Co-authored-by: Alexander Efimov <efimov.alexander@gmail.com>
---------
Co-authored-by: Ognjen Plavsic <ognjen.plavsic@dxc.com>
Co-authored-by: Lixun Zhang <lixun.zhang@amd.com>
Co-authored-by: Aleksandr Efimov <130555951+alefimov-amd@users.noreply.github.com>
Co-authored-by: Alexander Efimov <efimov.alexander@gmail.com>
This PR replaces expensive operations with simpler ones:
mul,div are replaced with select and compare.
This is minor change, it decreses number of required registers
when dot operation loading is a bottleneck by one.
* simple changes of the matmul scripts to get good performance. Specification reason for the performance boost needs futher investigation and are tracked
* fix review comments
* change the num_warps in the autotuning config for hip to workaround an error and change the rtol so correctness check passed
0-bytes shared mem buffers don't materialize empty allocation buffers;
this could lead to unnecessary barriers.
note: reduceop code has become quite messy and will require some cleanup
This PR:
- enables test_dot_mfma_vector_load for fast path in mfma dot op pipeline
- fixes kernel execution for mfma enabled GPUS
- disables mfma layout conversion tests on architectures which can not run these tests
* [MFMA] [Dot] Support vector loads in normal path
This PR adds generation of vector loads in normal path of
MFMA dot operand loading.
This requires shared layout to have contiguous elements
which should be loaded by one lane.
* remove redundant refactoring
* fix tests
* extend test with transposed A/B tensors
* [Dot] [MFMA] Support FP16 output of MFMA dot
This PR adds cast of output tensor to requested data type.
* add tests
* fix test for FMA implementation
* loose fp16xfp16->fp16 tolerance
* enable FMA fallback for unsupported sizes of dot operation
* rework granularity check
* add constant modifier to granularity
Enabled the backward pass in the fused attention tutorial.
The tolerance when comparing to the naive implementation
had to be changed. The block size is forced to be 64x64
due to the 64 KiB LDS. Default is block 128 for A100's
larger SMEM. This creates differences in order of computation
and reuslts in a larger gap between the naive and FA
implementations.
In the current link.py, it produces the launcher code as below:
```python
CUresult matmul_fp16xfp16_16x16x16(CUstream stream, unsigned int gX, unsigned int gY, unsigned int gZ, CUdeviceptr C, CUdeviceptr A, CUdeviceptr B, int32_t stride_cm, int32_t stride_am, int32_t stride_bk){
if ((C % 16 == 0) && (A % 16 == 0) && (B % 16 == 0) && (stride_cm % 16 == 0))
return matmul_fp16xfp16_16x16x16_688cc413_0d1d2d3d45d(stream, gX, gY, gZ, C, A, B, stride_cm, stride_am, stride_bk);
// ...
if ((C % 16 == 0) && (A % 16 == 0) && (B % 16 == 0))
return matmul_fp16xfp16_16x16x16_7c0255bf_0d1d2d345(stream, gX, gY, gZ, C, A, B, stride_cm, stride_am, stride_bk);
}
```
Note that, when the input does not match any of the if branches, it will
do nothing, and the compiler should make it return 0 as a default
behavior, which equals to `CUDA_SUCCESS`, this doesn't match the
expectation.
This PR adds a `return CUDA_VALUE_ERROR;` to the tail of launchers, and
it produces code like:
```c++
CUresult matmul_fp16xfp16_16x16x16(CUstream stream, unsigned int gX, unsigned int gY, unsigned int gZ, CUdeviceptr C, CUdeviceptr A, CUdeviceptr B, int32_t stride_cm, int32_t stride_cn, int32_t stride_am, int32_t stride_ak, int32_t stride_bk, int32_t stride_bn){
if ((C % 16 == 0) && (A % 16 == 0) && (B % 16 == 0) && (stride_cm == 1) && (stride_cn == 1) && (stride_am == 1) && (stride_ak == 1) && (stride_bk % 16 == 0) && (stride_bn == 1))
return matmul_fp16xfp16_16x16x16_1f18a6da_0d1d2d3c4c5c6c7d8c(stream, gX, gY, gZ, C, A, B, stride_bk);
return CUDA_ERROR_INVALID_VALUE;
}
```
And it requires users to check the result in their application, which I
think should match the initial AOT ideas.