Vinayak Gokhale
c2766bbd5f
Merge changes from upstream FA bwd kernel ( #444 )
...
* Add optimized FA bwd from upstream
* Add autotuning
* Change loads and stores to use block ptrs
* Cleanup
2024-01-05 15:12:05 -06:00
oplavsic
bcea3051af
Add support for MFMA layout to view_slice op ( #442 )
...
Co-authored-by: Ognjen <oplavsic@luxoft.com >
2024-01-03 12:13:36 -06:00
oplavsic
6a520566a3
Add view_slice ttgir instruction ( #427 )
...
* Add view_slice op in ttgir
---------
Co-authored-by: Ognjen Plavsic <ognjen.plavsic@luxoft.com >
Co-authored-by: Ognjen <oplavsic@luxoft.com >
Co-authored-by: Lixun Zhang <lixun.zhang@amd.com >
2024-01-02 15:40:11 -06:00
Alexander Efimov
98589ac013
[MFMA] Remove CTA related code from layout ( #429 )
...
This PR removes CTALayout attribute from MFMA layout, because it is NV specific.
2023-12-27 18:01:28 +01:00
Jack Taylor
1e2fd0dd1a
Update hip_backend to use libhsa-runtime for arch info, ( #411 )
...
brings in path changes for pytorch triton wheels
Co-authored-by: jayfurmanek <Jason.Furmanek@amd.com >
2023-12-21 15:40:57 +00:00
Vinayak Gokhale
0248bdb29d
Minor edits to HBM bandwidth measurement kernel ( #434 )
...
* Change units to GiB/s from GB/s
* Run both with and w/o bounds check
2023-12-21 06:14:31 -06:00
jayfurmanek
16281f02f4
[ROCM] drop GIL for launch, and set value=false upon pointer error ( #426 )
2023-12-19 08:34:51 -06:00
Vinayak Gokhale
422d7096ce
Add kernel to check HBM BW ( #431 )
...
Add kernel to check HBM BW performance
2023-12-18 21:25:21 -06:00
joviliast
af15da2f84
Support WMMA layout in TritonAMDGPUAccelerateMatmulPass
...
-Introduce WmmaEncodingAttr for WMMA output
-Introduce BlockedToWMMA rewrite pattern in TritonAMDGPUAccelerateMatmulPass
-Provide a flag tho check if wmma instructions are supported by target
Signed-off-by: joviliast <iveselov.nn@gmail.com >
2023-12-18 09:11:20 -06:00
Vinayak Gokhale
b7a412d82a
Add support for ALiBi-style attention bias ( #417 )
...
Add support for matrix and vector bias to FA.
2023-12-15 16:28:37 -06:00
jayfurmanek
29847e9bb1
Merge pull request #410 from ROCmSoftwarePlatform/ifu-231117
...
Ifu 231117
2023-12-15 09:09:40 -06:00
Shucai Xiao
521f425fbf
add bitcode for gfx941 and gfx942 ( #403 )
...
Co-authored-by: Aleksandr Efimov <130555951+alefimov-amd@users.noreply.github.com >
2023-12-14 08:19:23 -06:00
Michael Melesse
26c3f99073
ROCM IFU: disable test_reduce_layouts
2023-12-13 17:12:39 -06:00
Michael Melesse
7a1f54645e
ROCM IFU: remove old tests
2023-12-13 15:30:55 +00:00
Michael Melesse
c7b62d5ec5
ROCM IFU: remove test_ext_elemwise
2023-12-12 22:58:56 -06:00
Michael Melesse
6efc013e46
ROCM IFU: fix AtomicCASOpConversion segfault
2023-12-12 17:40:31 -06:00
jayfurmanek
a42ac260aa
Merge branch 'triton-mlir' into ifu-231117
2023-12-12 14:24:11 -06:00
Jason Furmanek
160dfe838e
ROCM IFU: Fix print and assert
2023-12-12 19:30:01 +00:00
Alexander Efimov
605a90c58e
[MFMA] Support tile size 4x4 version 1 ( #413 )
...
This PR enables 4x4 tile size in MFMA based dot operations.
Supported tiled dot is (4x64) x (64x4) -> (4x4) in MFMA layout.
However, actual dot operation should have at least 64 output elements, this is a limitation of other layouts appearing during result processing (i.e. blocked layout can not handle tensors smaller than wavesize).
For example, following dots are supported: (4x64) x (64x16) -> (4x16), (16x64) x (64x4) -> (16x4) or (8x64) x (64x8) -> (8x8)
Following dots are not supporter: (4x128) x (128x4) -> (4x4), (4x64) x (64x8) -> (4x8)
This is a first version of dot using mfma 4x4 instructions, with redundancy and reductions.
2023-12-12 18:23:55 +01:00
Michael Melesse
64a0924381
ROCM IFU: remove ref to test_elementwise
2023-12-07 13:31:59 -06:00
Vinayak Gokhale
1d6b919897
Bugfix: Wrong boundary condition on qk GEMM
2023-12-04 10:11:41 -06:00
Vinayak Gokhale
f6969f4bb3
Correct that loop lo is multiple BLOCK_N
2023-12-04 10:11:41 -06:00
Vinayak Gokhale
0ef865508c
Update description
2023-12-04 10:11:41 -06:00
Vinayak Gokhale
dc62569e57
Remove slicing for out in save for bwd
2023-12-04 10:11:41 -06:00
Vinayak Gokhale
e0a4d97569
Mask vs pad for non power of 2 sequence lengths
...
Padding results in memory allocation which is slower.
Masking results in better performance.
2023-12-04 10:11:41 -06:00
Vinayak Gokhale
d5028079b7
Add FA support for non pow2 seqlen
2023-12-04 10:11:41 -06:00
Jason Furmanek
64f559771f
ROCM IFU: Fix test_core_amd.py::test_reduce_layouts
2023-11-28 04:02:48 +00:00
Jason Furmanek
f5f6b3c0a3
ROCM IFU: Add get_version_key for ROCM backend
2023-11-28 00:11:44 +00:00
Jason Furmanek
71547e4fdb
ROCM IFU: Fixes for kwargs
2023-11-28 00:09:46 +00:00
jayfurmanek
99aa1f4f75
Merge branch 'triton-mlir' into ifu-231117
2023-11-27 07:44:04 -06:00
Jason Furmanek
968a35fbf0
Fix merge conflict error
2023-11-27 13:33:23 +00:00
Shucai Xiao
d9219e0eba
use hw for fp8 type conversion ( #386 )
...
* use hardware instruction for type conversion between fp8 and fp32
* move gpu_matrix_core_version from semantics.py to hip_backend.py
---------
Co-authored-by: Aleksandr Efimov <efimov.alexander@gmail.com >
2023-11-24 10:26:40 -06:00
Jason Furmanek
a08dafe7fe
Initial commit to resolve merge conflicts
2023-11-20 22:41:03 +00:00
Jason Furmanek
5c87f363e4
Merge commit 'cb3d79a185e40c9d8a579bea07747a8a8d157d52' into ifu-231117
...
Conflicts:
lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp
lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVM.cpp
lib/Dialect/TritonGPU/IR/Dialect.cpp
python/setup.py
python/test/unit/language/assert_helper.py
python/test/unit/operators/test_flash_attention.py
python/test/unit/runtime/test_subproc.py
python/triton/compiler/compiler.py
python/triton/language/semantic.py
python/triton/runtime/autotuner.py
python/triton/runtime/jit.py
python/tutorials/03-matrix-multiplication.py
python/tutorials/05-layer-norm.py
python/tutorials/06-fused-attention.py
python/tutorials/11-grouped-gemm.py
test/Conversion/tritongpu_to_llvm.mlir
2023-11-17 20:42:12 +00:00
Alexander Efimov
dfb76540b4
[Tutorial] Fix post IFU issues with FA ( #398 )
...
* [Tutorial] Fix post IFU issues with FA
* Remove redundant kernels in 06-fused-attention.py
* Added README for scripts in perf-kernels dir
* Fix bwd kernel
---------
Co-authored-by: Lixun Zhang <lixun.zhang@amd.com >
2023-11-17 01:28:49 +00:00
Alexander Efimov
096def0c9b
[Test] Disable mma layout for amd hardware ( #384 )
...
Disable mma layout testing by looking at is_hip instead of wave size.
This fixes tests on Navi GPUs with wave size == 32.
2023-11-17 01:28:49 +00:00
Lixun Zhang
181bdbd410
Benchmark FA on 2 GCDs ( #393 )
2023-11-17 01:28:49 +00:00
Jason Furmanek
44b155f41b
ROCM IFU: Resolve merge conflicts in tutorial 06
...
Resolve merge conflicts in tutorial 06 - 2
2023-11-17 01:28:40 +00:00
Jason Furmanek
484852876e
Resolve merge conflicts; AMD adjustments for new LLVM version
2023-11-09 19:00:49 +00:00
Jason Furmanek
977d5aa267
Merge commit '721897fcc4f942aa97d2e9ba3787a5e213758177' into ifu-231108
...
Conflicts:
bin/triton-translate.cpp
lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp
lib/Dialect/TritonGPU/Transforms/RemoveLayoutConversions.cpp
python/triton/compiler/compiler.py
python/triton/runtime/jit.py
python/tutorials/06-fused-attention.py
test/Conversion/tritongpu_to_llvm.mlir
2023-11-08 18:51:23 +00:00
Lixun Zhang
1af893d8a2
[FRONTEND] Add input dtypes to autotuning key ( #2534 ) ( #374 )
...
* [FRONTEND] Add input dtypes to autotuning key (#2534 )
* Fix conflict in 06-fused-attention
* Fix get_best_config in FA-transV.py
* Fix leftover get_best_config()
---------
Co-authored-by: Adnan Akhundov <adnan.akhundov@gmail.com >
2023-11-07 19:36:57 -06:00
Jason Furmanek
85216ea5c5
ROCM IFU: Resoolve conflicts in FA tutorial
2023-11-07 04:29:45 +00:00
Alexander Efimov
aefc94bd25
ROCM IFU: fix test_dot_mfma_vector_load test
...
fix for previous commit
2023-11-07 04:29:45 +00:00
Jason Furmanek
39e8901d7a
ROCM IFU: Resolve merge conflicts in RemoveLayoutConversions.cpp
...
fix merge error
fix dot
fix make_range
additional fix
2023-11-07 04:29:38 +00:00
Jason Furmanek
c3132eeda8
ROCM IFU: Third-party fixes: preload ROCDL
...
Additional 3rd-party fix
Remove redundant mfma_supported defines
2023-11-07 03:29:16 +00:00
Jason Furmanek
3a6dc5ad8d
resolve some merge conflicts
...
fix more conflits
Resolve merge conflicts
Some more build and conflict fixes
Resolve conflicts for 06-fused-attension.py
resolve merge conflicts for the tutorial group gemm example
Fixes for some LIT tests
resolve remaining conflicts in tests
Fix empty kernel
set capability 0
2023-11-06 23:13:10 +00:00
Jason Furmanek
33151a860f
Merge commit 'ac9fa68d18c777e421bd3f6fb1ddcfd60b6fda33' into ifu-rebase-again
...
Conflicts:
.gitignore
.gitmodules
README.md
bin/triton-translate.cpp
include/triton/Dialect/TritonGPU/IR/TritonGPUAttrDefs.td
include/triton/Target/AMDGCN/AMDGCNTranslation.h
include/triton/Target/HSACO/HSACOTranslation.h
lib/Analysis/Allocation.cpp
lib/Analysis/Utility.cpp
lib/Conversion/TritonGPUToLLVM/CMakeLists.txt
lib/Conversion/TritonGPUToLLVM/ConvertLayoutOpToLLVM.cpp
lib/Conversion/TritonGPUToLLVM/ReduceOpToLLVM.cpp
lib/Conversion/TritonGPUToLLVM/ScanOpToLLVM.cpp
lib/Conversion/TritonGPUToLLVM/Utility.cpp
lib/Conversion/TritonGPUToLLVM/Utility.h
lib/Dialect/TritonGPU/IR/Dialect.cpp
lib/Dialect/TritonGPU/Transforms/RemoveLayoutConversions.cpp
lib/Target/HSACO/CMakeLists.txt
lib/Target/HSACO/HSACOTranslation.cpp
lib/Target/LLVMIR/LLVMIRTranslation.cpp
python/src/triton.cc
python/test/unit/language/test_core.py
python/test/unit/operators/test_flash_attention.py
python/triton/compiler/compiler.py
python/triton/compiler/make_launcher.py
python/triton/language/semantic.py
python/triton/runtime/jit.py
python/tutorials/06-fused-attention.py
python/tutorials/11-grouped-gemm.py
test/Conversion/tritongpu_to_llvm.mlir
2023-11-06 23:10:10 +00:00
oplavsic
c65f1e6211
Add OptimizeEpilogue pass. ( #346 )
...
* optimize_epilogue
* Add config
* Remove licenses
* Comment out Hopper specific parameters when printing out configs
* Add benchmark parameters from flash-attention repo
* Add Z and H in the key of autotuner
---------
Co-authored-by: Lixun Zhang <lixun.zhang@amd.com >
2023-11-03 16:46:24 -05:00
Weixing Zhang
34b89a1173
[OPTIMIZER] Tweak warpsPerCTA based on the shape of MMA output ( #2581 )
...
In current implementation, warpsPerCTA is always set to [numWarps, 1]
for 2 tt.dot fusion scenario. But, it is not optimal for cases such that
tt.dot doesn't have enough parallelism on row dimension but on column
dimension.
2023-11-03 16:40:03 -04:00
Thomas Raoux
6ac9d51ff0
[OPTIMIZATION] Enable pipelining for bwd flash attention ( #2590 )
...
This allow pipelining when a load is used by multiple dot in a loop.
Relax the condition to pipeline dot operands for mma v3 case. This
improves performance for the bwd pass from 260TF to 275TF. However this
expose a performance problem due to the wmma pipelining as ptxas will
now fall back to serial wgmma. A follow up PR will fix a bug in how we
emit wgmma_wait during pipelining and will bring performance to 335TF
2023-11-03 11:46:51 -07:00