This PR:
- simplifies data types generated by `shared->mfma dot op` layout conversions. Do not pack data types in int32 or int64
- reduce code duplication between fast/normal path
- reduce code duplication between operand A and operand B
Co-authored-by: Shucai Xiao <shucai.xiao@amd.com>
Co-authored-by: Lixun Zhang <lixun.zhang@amd.com>
This PR enables 4x4 tile size in MFMA based dot operations.
Supported tiled dot is (4x64) x (64x4) -> (4x4) in MFMA layout.
However, actual dot operation should have at least 64 output elements, this is a limitation of other layouts appearing during result processing (i.e. blocked layout can not handle tensors smaller than wavesize).
For example, following dots are supported: (4x64) x (64x16) -> (4x16), (16x64) x (64x4) -> (16x4) or (8x64) x (64x8) -> (8x8)
Following dots are not supporter: (4x128) x (128x4) -> (4x4), (4x64) x (64x8) -> (4x8)
This is a first version of dot using mfma 4x4 instructions, with redundancy and reductions.
Inline assembly does not take into account instructions around,
and in general can not avoid data hazards.
Replacing inline asm with intrinsics solves this problem.
This particular code behaved incorrectly in one of mfma dot tests:
Code generated with help of inline assembly:
```
v_mfma_f32_4x4x4f16 v[4:7], v[4:5], v[6:7], 0
ds_swizzle_b32 v3, v4, offset:swizzle(SWAP:4)
```
Correct code generated with intrinsics:
```
v_mfma_f32_4x4x4f16 v[4:7], v[4:5], v[6:7], 0
s_nop 4
ds_swizzle_b32 v3, v4, offset:swizzle(SWAP:4)
```
This PR adds:
- verbose tuning mode: printing std output of compilation and tuning calls
- collecting information about failed compilations
- print correctness check output with word
- split dimensions in generated scripts with "-"
- gpu_ids option to set particular gpus
* use hardware instruction for type conversion between fp8 and fp32
* move gpu_matrix_core_version from semantics.py to hip_backend.py
---------
Co-authored-by: Aleksandr Efimov <efimov.alexander@gmail.com>
* [Tutorial] Fix post IFU issues with FA
* Remove redundant kernels in 06-fused-attention.py
* Added README for scripts in perf-kernels dir
* Fix bwd kernel
---------
Co-authored-by: Lixun Zhang <lixun.zhang@amd.com>
* [Tutorial] Fix post IFU issues with FA
* Remove redundant kernels in 06-fused-attention.py
* Added README for scripts in perf-kernels dir
* Fix bwd kernel
---------
Co-authored-by: Lixun Zhang <lixun.zhang@amd.com>
fix more conflits
Resolve merge conflicts
Some more build and conflict fixes
Resolve conflicts for 06-fused-attension.py
resolve merge conflicts for the tutorial group gemm example
Fixes for some LIT tests
resolve remaining conflicts in tests
Fix empty kernel
set capability 0
* optimize_epilogue
* Add config
* Remove licenses
* Comment out Hopper specific parameters when printing out configs
* Add benchmark parameters from flash-attention repo
* Add Z and H in the key of autotuner
---------
Co-authored-by: Lixun Zhang <lixun.zhang@amd.com>
* add two fp8 data types `tl.float8e4b8` and `tl.float8e5b16` to triton.
* add SW type conversion between `tl.float8e4b8/tl.float8e5b16` and `fp16`
* change flashattention to support fp8 in q/k.
* [RemoveLayoutConversions] Fix reduce failed infer type error
This PR fixes layout propagation algorithm in RemoveLayoutConversions pass.
In some cases during rewriteSlice process, reduce operation with multiple outputs
rewrites only one output layout, which breaks assumption that both outputs should have same layout.
This change is a minimal part of https://github.com/openai/triton/pull/2331 change and
small lit test for regression testing.
* fix combine test
* Fix issue with incorrect inference layout of make_range output result
This is a combination of 4 commits.
Works as StandAlone and Backend
Works as StandAlone and Backend
This is a combination of 13 commits.
Works StandAlone and as Backend
This is a combination of 7 commits.
backend set default dir with flag
move bitcode to backend dir
copy backend
save
empty test work in backendmode
enable backend mode when copying to upstream
clean up
fix failure
minimize diff
add skip function
fix bug with corrupted dwarf exp
match num_wraps
fix multi threaded test issue
move bitcode file out of lib
move backend to python/triton/third_party/hip
move libhsa
backend works again
restart ci
clean upstream location first before copy
match scripts
fix new error
memoize backend stuff
fix bug
* this pr adds a third party backend for triton that works on AMD
* this expose a lot of the work that has been done in our
[fork](https://github.com/ROCmSoftwarePlatform/triton)
* most unit tests on `test_core.py` pass
* it skips some unit tests for various reasons
* we plan to follow up with more prs improving Functionality and
Performance in the future
---------
Co-authored-by: Philippe Tillet <phil@openai.com>
This is a combination of 9 commits.
Empty Kernel Works rebase
minimzie diff: add libs
move to backend dir
match python
add includes
move everything to backend dir
match include and lib
create a backend build mode
simplify backend
* [MFMA] FP8 and BF8 support
This PR adds support of fp8 and bf8 in AccelerateMatmul pass and
Introduces generation of float8 mfma instructions in ttg to llvm conversion.
* add tests
* fix tests
* review fix: fix variable naming and dot operand promotion.
* review comments fixes
---------
Co-authored-by: Shucai Xiao <shucai.xiao@amd.com>