* Stablize load vectorization
* fix test failures
* Shared one mask check when decomposing a load
* Revert "fix test failures"
This reverts commit 75a461ae3ea4fdd5105dc73675582368eda80bc6.
* Emit vectorized loads
* Fix test failures due to using vectorized load
* Select mfma dimensions and instruction from static table
* Extend mfmaLayout to include version and instrShape
* Simplify generateMFMAOp by searching the mfma instruction in the table
* Fix getNonKDim() and non_k_dim
* Break instrShape into MDim and NDim
* Remove unnecessary xor computations for k-major swizzled tensors
* Support mfma16 and mfma4 in the fast path
* Choose warpsPerCTA according to nonKDim
* Set maxPhase=4 for mfma4
* Fix tests
For now, we do not disable swizzling for k-major tensors
* Remove fastPathComputeOffsetsTy1
* Enable k-major + disabled swizzling in the normal path
This PR:
- simplifies data types generated by `shared->mfma dot op` layout conversions. Do not pack data types in int32 or int64
- reduce code duplication between fast/normal path
- reduce code duplication between operand A and operand B
Co-authored-by: Shucai Xiao <shucai.xiao@amd.com>
Co-authored-by: Lixun Zhang <lixun.zhang@amd.com>
This PR enables 4x4 tile size in MFMA based dot operations.
Supported tiled dot is (4x64) x (64x4) -> (4x4) in MFMA layout.
However, actual dot operation should have at least 64 output elements, this is a limitation of other layouts appearing during result processing (i.e. blocked layout can not handle tensors smaller than wavesize).
For example, following dots are supported: (4x64) x (64x16) -> (4x16), (16x64) x (64x4) -> (16x4) or (8x64) x (64x8) -> (8x8)
Following dots are not supporter: (4x128) x (128x4) -> (4x4), (4x64) x (64x8) -> (4x8)
This is a first version of dot using mfma 4x4 instructions, with redundancy and reductions.
Inline assembly does not take into account instructions around,
and in general can not avoid data hazards.
Replacing inline asm with intrinsics solves this problem.
This particular code behaved incorrectly in one of mfma dot tests:
Code generated with help of inline assembly:
```
v_mfma_f32_4x4x4f16 v[4:7], v[4:5], v[6:7], 0
ds_swizzle_b32 v3, v4, offset:swizzle(SWAP:4)
```
Correct code generated with intrinsics:
```
v_mfma_f32_4x4x4f16 v[4:7], v[4:5], v[6:7], 0
s_nop 4
ds_swizzle_b32 v3, v4, offset:swizzle(SWAP:4)
```
* use hardware instruction for type conversion between fp8 and fp32
* move gpu_matrix_core_version from semantics.py to hip_backend.py
---------
Co-authored-by: Aleksandr Efimov <efimov.alexander@gmail.com>
fix more conflits
Resolve merge conflicts
Some more build and conflict fixes
Resolve conflicts for 06-fused-attension.py
resolve merge conflicts for the tutorial group gemm example
Fixes for some LIT tests
resolve remaining conflicts in tests
Fix empty kernel
set capability 0
Patch based on @donproc findings and suggested optimization.
Emitting multiple wait op may confuse ptxas and cause it to fallback to
a conservative mode.
* add two fp8 data types `tl.float8e4b8` and `tl.float8e5b16` to triton.
* add SW type conversion between `tl.float8e4b8/tl.float8e5b16` and `fp16`
* change flashattention to support fp8 in q/k.
Refactor the pipeliner pass in order to make it more generic. The main
change is that the pipeliner is now broken into 2 pieces one calculating
a modulo schedule and create async ops based on the IR and an expander
that will generate the pipelined IR based on the modulo schedule.
The advantage of separating the two pieces is that it will allow us to
create different schedule without having to change the expander and it
will allow for more complex schedules.
For now the schedule generated for matmul case matches rougly the
schedule picked by the previous pipeliner in order to avoid changes.
This also creates a different sequence of insert/extract slice for the
alloc. We should probably change shared alloc to use memory semantic.
* [MFMA] FP8 and BF8 support
This PR adds support of fp8 and bf8 in AccelerateMatmul pass and
Introduces generation of float8 mfma instructions in ttg to llvm conversion.
* add tests
* fix tests
* review fix: fix variable naming and dot operand promotion.
* review comments fixes
---------
Co-authored-by: Shucai Xiao <shucai.xiao@amd.com>
* rebase onto improve_fwd_fa
* Fixed a leftover from rebase
* rebase onto improve_fa_fwd
* Reduce tuning space
* Disable bwd with D=128
* Add test for d=128
* Fix an issue with get_best_config when there is only one config
* Added better configs for d=128
* Fix typos
---------
Co-authored-by: Lixun Zhang <lixun.zhang@amd.com>
### Summary
When Triton GPU IR is lowered into LLVM IR, we can make use of the
constancy information about the result of the elementwise ops to
deduplicate otherwise redundant computation. That is the contribution of
this PR: the constancy is checked and, if possible, some of the values
in LLVM IR are reused multiple times instead of computing equal values
separately.
The change is beneficial for the PyTorch 2 / TorchInductor-generated
Triton code, as the leftmost sub-indices extracted from the flat index
by div / mod operations can be equal, given sufficiently large 2^n
factor in the rightmost rightmost dimension(s). This makes the
computation resulting in those sub-indices redundant. Consequently,
under the necessary constancy conditions, the redundant indexing
arithmetics can be deduplicated. We observe up to 29% decrease in the
latency of some of our jagged tensor kernels
[BACKEND] Improve printf.
Previously, we printed all of a GPU thread's values in a single printf()
call, and this, plus the user-specified prefix, was all we printed.
This caused a few problems.
- nvptx printf can only handle 32 arguments; if you pass more than
that, it prints garbage. So if a thread had more than 32 values, you
couldn't print them, issue #2486.
- The order of the values within the Triton program (GPU thread block)
is an implementation detail -- it depends on the layout the compiler
assigns to a tensor. So this also prevented you from interpreting
the printed output.
To address this, we now print the Triton pid and multi-dimensional
Tensor index for each value. And each value gets its own line to avoid
passing too many args to printf.
Example output:
```
pid (0, 1, 2) idx (36, 127) x: 42
```
If you want to observe all the values in a tensor in order, you can grep
and then sort the output.
We also make a UX enhancement to print: The printed label always ends
with ": "; you don't have to add it yourself.
Fixes#2486.