-Introduce WmmaEncodingAttr for WMMA output
-Introduce BlockedToWMMA rewrite pattern in TritonAMDGPUAccelerateMatmulPass
-Provide a flag tho check if wmma instructions are supported by target
Signed-off-by: joviliast <iveselov.nn@gmail.com>
This PR enables 4x4 tile size in MFMA based dot operations.
Supported tiled dot is (4x64) x (64x4) -> (4x4) in MFMA layout.
However, actual dot operation should have at least 64 output elements, this is a limitation of other layouts appearing during result processing (i.e. blocked layout can not handle tensors smaller than wavesize).
For example, following dots are supported: (4x64) x (64x16) -> (4x16), (16x64) x (64x4) -> (16x4) or (8x64) x (64x8) -> (8x8)
Following dots are not supporter: (4x128) x (128x4) -> (4x4), (4x64) x (64x8) -> (4x8)
This is a first version of dot using mfma 4x4 instructions, with redundancy and reductions.
* use hardware instruction for type conversion between fp8 and fp32
* move gpu_matrix_core_version from semantics.py to hip_backend.py
---------
Co-authored-by: Aleksandr Efimov <efimov.alexander@gmail.com>
* [Tutorial] Fix post IFU issues with FA
* Remove redundant kernels in 06-fused-attention.py
* Added README for scripts in perf-kernels dir
* Fix bwd kernel
---------
Co-authored-by: Lixun Zhang <lixun.zhang@amd.com>
fix more conflits
Resolve merge conflicts
Some more build and conflict fixes
Resolve conflicts for 06-fused-attension.py
resolve merge conflicts for the tutorial group gemm example
Fixes for some LIT tests
resolve remaining conflicts in tests
Fix empty kernel
set capability 0
* optimize_epilogue
* Add config
* Remove licenses
* Comment out Hopper specific parameters when printing out configs
* Add benchmark parameters from flash-attention repo
* Add Z and H in the key of autotuner
---------
Co-authored-by: Lixun Zhang <lixun.zhang@amd.com>
In current implementation, warpsPerCTA is always set to [numWarps, 1]
for 2 tt.dot fusion scenario. But, it is not optimal for cases such that
tt.dot doesn't have enough parallelism on row dimension but on column
dimension.
This allow pipelining when a load is used by multiple dot in a loop.
Relax the condition to pipeline dot operands for mma v3 case. This
improves performance for the bwd pass from 260TF to 275TF. However this
expose a performance problem due to the wmma pipelining as ptxas will
now fall back to serial wgmma. A follow up PR will fix a bug in how we
emit wgmma_wait during pipelining and will bring performance to 335TF
I've add an option to yapf to do what we want for long lines, see
https://github.com/google/yapf/pull/1177. We can now have a real Python
formatter, yay!
To make this PR, I ran my modified yapf over the repository, then looked
over the full diff. Where yapf was mangling the param list of long
function decls/calls (mostly kernels), I manually added `#` to put
linebreaks where we want. I fixed up other formatting too -- mostly
adding or removing a trailing comma from lists.
Overall, trailing `#` was sufficient to get formatting similar to our
current code. I didn't have to disable yapf anywhere.
---------
Co-authored-by: Phil Tillet <phil@openai.com>