* Stablize load vectorization
* fix test failures
* Shared one mask check when decomposing a load
* Revert "fix test failures"
This reverts commit 75a461ae3ea4fdd5105dc73675582368eda80bc6.
* Emit vectorized loads
* Fix test failures due to using vectorized load
* Select mfma dimensions and instruction from static table
* Extend mfmaLayout to include version and instrShape
* Simplify generateMFMAOp by searching the mfma instruction in the table
* Fix getNonKDim() and non_k_dim
* Break instrShape into MDim and NDim
* First commit
* Implement DotSlicing pass.
* small fixes
* Support chained dot in DotSlicingPass (second GEMM in FA)
* Add lit test for FA dot slicing
---------
Co-authored-by: Ognjen Plavsic <ognjen.plavsic@luxoft.com>
Co-authored-by: Ognjen <oplavsic@luxoft.com>
* Remove unnecessary xor computations for k-major swizzled tensors
* Support mfma16 and mfma4 in the fast path
* Choose warpsPerCTA according to nonKDim
* Set maxPhase=4 for mfma4
* Fix tests
For now, we do not disable swizzling for k-major tensors
* Remove fastPathComputeOffsetsTy1
* Enable k-major + disabled swizzling in the normal path
-Introduce WmmaEncodingAttr for WMMA output
-Introduce BlockedToWMMA rewrite pattern in TritonAMDGPUAccelerateMatmulPass
-Provide a flag tho check if wmma instructions are supported by target
Signed-off-by: joviliast <iveselov.nn@gmail.com>
This PR:
- simplifies data types generated by `shared->mfma dot op` layout conversions. Do not pack data types in int32 or int64
- reduce code duplication between fast/normal path
- reduce code duplication between operand A and operand B
Co-authored-by: Shucai Xiao <shucai.xiao@amd.com>
Co-authored-by: Lixun Zhang <lixun.zhang@amd.com>
This PR enables 4x4 tile size in MFMA based dot operations.
Supported tiled dot is (4x64) x (64x4) -> (4x4) in MFMA layout.
However, actual dot operation should have at least 64 output elements, this is a limitation of other layouts appearing during result processing (i.e. blocked layout can not handle tensors smaller than wavesize).
For example, following dots are supported: (4x64) x (64x16) -> (4x16), (16x64) x (64x4) -> (16x4) or (8x64) x (64x8) -> (8x8)
Following dots are not supporter: (4x128) x (128x4) -> (4x4), (4x64) x (64x8) -> (4x8)
This is a first version of dot using mfma 4x4 instructions, with redundancy and reductions.
Inline assembly does not take into account instructions around,
and in general can not avoid data hazards.
Replacing inline asm with intrinsics solves this problem.
This particular code behaved incorrectly in one of mfma dot tests:
Code generated with help of inline assembly:
```
v_mfma_f32_4x4x4f16 v[4:7], v[4:5], v[6:7], 0
ds_swizzle_b32 v3, v4, offset:swizzle(SWAP:4)
```
Correct code generated with intrinsics:
```
v_mfma_f32_4x4x4f16 v[4:7], v[4:5], v[6:7], 0
s_nop 4
ds_swizzle_b32 v3, v4, offset:swizzle(SWAP:4)
```
* use hardware instruction for type conversion between fp8 and fp32
* move gpu_matrix_core_version from semantics.py to hip_backend.py
---------
Co-authored-by: Aleksandr Efimov <efimov.alexander@gmail.com>
fix more conflits
Resolve merge conflicts
Some more build and conflict fixes
Resolve conflicts for 06-fused-attension.py
resolve merge conflicts for the tutorial group gemm example
Fixes for some LIT tests
resolve remaining conflicts in tests
Fix empty kernel
set capability 0
* optimize_epilogue
* Add config
* Remove licenses
* Comment out Hopper specific parameters when printing out configs
* Add benchmark parameters from flash-attention repo
* Add Z and H in the key of autotuner
---------
Co-authored-by: Lixun Zhang <lixun.zhang@amd.com>
Patch based on @donproc findings and suggested optimization.
Emitting multiple wait op may confuse ptxas and cause it to fallback to
a conservative mode.
In current implementation, warpsPerCTA is always set to [numWarps, 1]
for 2 tt.dot fusion scenario. But, it is not optimal for cases such that
tt.dot doesn't have enough parallelism on row dimension but on column
dimension.
This allow pipelining when a load is used by multiple dot in a loop.
Relax the condition to pipeline dot operands for mma v3 case. This
improves performance for the bwd pass from 260TF to 275TF. However this
expose a performance problem due to the wmma pipelining as ptxas will
now fall back to serial wgmma. A follow up PR will fix a bug in how we
emit wgmma_wait during pipelining and will bring performance to 335TF
* add two fp8 data types `tl.float8e4b8` and `tl.float8e5b16` to triton.
* add SW type conversion between `tl.float8e4b8/tl.float8e5b16` and `fp16`
* change flashattention to support fp8 in q/k.
This was a workaround for a bug exposed in test_core when generating
ext_load in NVPTX. The backend bug seems fixed in latest LLVM upgrade so
removing the workaround.
Refactor the pipeliner pass in order to make it more generic. The main
change is that the pipeliner is now broken into 2 pieces one calculating
a modulo schedule and create async ops based on the IR and an expander
that will generate the pipelined IR based on the modulo schedule.
The advantage of separating the two pieces is that it will allow us to
create different schedule without having to change the expander and it
will allow for more complex schedules.
For now the schedule generated for matmul case matches rougly the
schedule picked by the previous pipeliner in order to avoid changes.
This also creates a different sequence of insert/extract slice for the
alloc. We should probably change shared alloc to use memory semantic.
* [RemoveLayoutConversions] Fix reduce failed infer type error
This PR fixes layout propagation algorithm in RemoveLayoutConversions pass.
In some cases during rewriteSlice process, reduce operation with multiple outputs
rewrites only one output layout, which breaks assumption that both outputs should have same layout.
This change is a minimal part of https://github.com/openai/triton/pull/2331 change and
small lit test for regression testing.
* fix combine test
* Fix issue with incorrect inference layout of make_range output result