-Introduce WmmaEncodingAttr for WMMA output
-Introduce BlockedToWMMA rewrite pattern in TritonAMDGPUAccelerateMatmulPass
-Provide a flag tho check if wmma instructions are supported by target
Signed-off-by: joviliast <iveselov.nn@gmail.com>
This PR:
- simplifies data types generated by `shared->mfma dot op` layout conversions. Do not pack data types in int32 or int64
- reduce code duplication between fast/normal path
- reduce code duplication between operand A and operand B
Co-authored-by: Shucai Xiao <shucai.xiao@amd.com>
Co-authored-by: Lixun Zhang <lixun.zhang@amd.com>
This PR enables 4x4 tile size in MFMA based dot operations.
Supported tiled dot is (4x64) x (64x4) -> (4x4) in MFMA layout.
However, actual dot operation should have at least 64 output elements, this is a limitation of other layouts appearing during result processing (i.e. blocked layout can not handle tensors smaller than wavesize).
For example, following dots are supported: (4x64) x (64x16) -> (4x16), (16x64) x (64x4) -> (16x4) or (8x64) x (64x8) -> (8x8)
Following dots are not supporter: (4x128) x (128x4) -> (4x4), (4x64) x (64x8) -> (4x8)
This is a first version of dot using mfma 4x4 instructions, with redundancy and reductions.
fix more conflits
Resolve merge conflicts
Some more build and conflict fixes
Resolve conflicts for 06-fused-attension.py
resolve merge conflicts for the tutorial group gemm example
Fixes for some LIT tests
resolve remaining conflicts in tests
Fix empty kernel
set capability 0
Patch based on @donproc findings and suggested optimization.
Emitting multiple wait op may confuse ptxas and cause it to fallback to
a conservative mode.
* add two fp8 data types `tl.float8e4b8` and `tl.float8e5b16` to triton.
* add SW type conversion between `tl.float8e4b8/tl.float8e5b16` and `fp16`
* change flashattention to support fp8 in q/k.
Refactor the pipeliner pass in order to make it more generic. The main
change is that the pipeliner is now broken into 2 pieces one calculating
a modulo schedule and create async ops based on the IR and an expander
that will generate the pipelined IR based on the modulo schedule.
The advantage of separating the two pieces is that it will allow us to
create different schedule without having to change the expander and it
will allow for more complex schedules.
For now the schedule generated for matmul case matches rougly the
schedule picked by the previous pipeliner in order to avoid changes.
This also creates a different sequence of insert/extract slice for the
alloc. We should probably change shared alloc to use memory semantic.
This is a combination of 4 commits.
Works as StandAlone and Backend
Works as StandAlone and Backend
This is a combination of 13 commits.
Works StandAlone and as Backend
This is a combination of 7 commits.
backend set default dir with flag
move bitcode to backend dir
copy backend
save
empty test work in backendmode
enable backend mode when copying to upstream
clean up
fix failure
minimize diff
add skip function
fix bug with corrupted dwarf exp
match num_wraps
fix multi threaded test issue
move bitcode file out of lib
move backend to python/triton/third_party/hip
move libhsa
backend works again
restart ci
clean upstream location first before copy
match scripts
fix new error
memoize backend stuff
fix bug
* this pr adds a third party backend for triton that works on AMD
* this expose a lot of the work that has been done in our
[fork](https://github.com/ROCmSoftwarePlatform/triton)
* most unit tests on `test_core.py` pass
* it skips some unit tests for various reasons
* we plan to follow up with more prs improving Functionality and
Performance in the future
---------
Co-authored-by: Philippe Tillet <phil@openai.com>
When propagating layout we were generating a view op with mismatching
total number of element per threads. Lowering such op would require
exchanging data across threads.
This change prevents the optimizer from generating such cases. This may
require further optimizations in the future.
By default, ptxas will enable fusion of mul/add to fma instructions. The
backend was also being configured unconditionally to enable this on
conversion from LLVM IR to PTX. This commit adds an option which can be
used to disable the FP fusion behavior in both locations.
Fix dependencies in wgmma_wait op to prevent the scheduler from moving
it past the uses of wgmma accumulator. We need to explicitly represent
the dependency between the wait and the accumulator uses otherwise LLVM
is free to re-order those.
This allows us to remove a workaround to prevent the re-ordering. We can
also remove the wait op added in the loop during pipelining.
Also fix the descritpor calcuation for wgmma, we should calculate the
same descriptor for the whole warpgroup.
Added a workaround for a bug that was exposed by different timing due to
those changes. We shouldn't insert operations between the loop and
async_wait or we may have race conditions.
Support having chain of mma with mixed size.
Serialize the different block calculation in backward attention to
workaround problem with ptxas and wgmma.
* [Alloc] Enhanced for mutually exclusive but aliased buffers
- Use disjoint alias analysis to minimize shared memory requirements
* * fix for allocation test
* * added test
* fixed mfma_enc printer
* * fixed test
MMA V3 support taking operand A from register. This helps for chained
matmul operations like in attention.
Add an optimization to use this mode when it helps and add the lowering
for it.
When there is a chain of mma ops we want to pick the same shape to avoid
conversions. This improves the detection going through for loops.
This fixes a crash in tutorial bw attention.
We might want to change this logic and convert the format to allow more
efficient MMA at some point.
1. On the axis, using `getAxisNumWarpsWithUniqueData` instead of getting
the raw number of warps to avoid communication among warps that handle
the same piece of data.
2. When there's a single warp on the axis, using warp Intrinsics for
communication and skip shared memory.
Need a follow up PR for code clean up.
Change the dot to allow taking an initial accumulator and add a flag
that will allow the compiler to accumulate in a lower precision than the
output type.
On Hopper this flag is on by default which allows accumualting with
lower precision.
This only affect Hopper fp8 dot.
.. instead of an option.
This partially addresses https://github.com/openai/triton/issues/2265 to
no longer crash when printing a pass pipeline in textual form.
It is not a proper solution for the fact that pass results should be
stored in the IR and not in a pointer argument.
* [MFMA] Support BFloat16 on MI100
This PR makes use of mfma_f32_32x32x4bf16 instruction, available on MI100.
* fix tests, fix mfma encoding comment, fix switch between mfma versions.
* replace kDim from mfma layout with kWidth from dotOp layout
* rebase fix
* fix mfma to dot op shortcut for bfloat16
* fix review comments