Commit Graph

273 Commits

Author SHA1 Message Date
Ognjen
3a12d9d269 fix 2024-01-31 14:13:36 +00:00
Ognjen
171a67e837 Add scheduling pass 2024-01-25 18:07:45 +00:00
Lixun Zhang
02a2f24dd5 [Backend] Refactor mfma selection (#441)
* Select mfma dimensions and instruction from static table

* Extend mfmaLayout to include version and instrShape

* Simplify generateMFMAOp by searching the mfma instruction in the table

* Fix getNonKDim() and non_k_dim

* Break instrShape into MDim and NDim
2024-01-16 21:05:35 -06:00
oplavsic
760ac8441a Dot slicing pass (#440)
* First commit

* Implement DotSlicing pass.

* small fixes

* Support chained dot in DotSlicingPass (second GEMM in FA)

* Add lit test for FA dot slicing

---------

Co-authored-by: Ognjen Plavsic <ognjen.plavsic@luxoft.com>
Co-authored-by: Ognjen <oplavsic@luxoft.com>
2024-01-16 14:25:10 -06:00
Lixun Zhang
2e217c5a5c [Backend] Refactor sharedToDotOperandMFMA lowering (#439)
* Remove unnecessary xor computations for k-major swizzled tensors

* Support mfma16 and mfma4 in the fast path

* Choose warpsPerCTA according to nonKDim

* Set maxPhase=4 for mfma4

* Fix tests

For now, we do not disable swizzling for k-major tensors

* Remove fastPathComputeOffsetsTy1

* Enable k-major + disabled swizzling in the normal path
2024-01-12 12:50:18 -06:00
Ilya V
2e01bf08e9 [HotFix] Fix dot op for RDNA3 architecture (#451)
Disabled BlockedToWMMA layout transformation until WMMA is supported completely

Signed-off-by: joviliast <iveselov.nn@gmail.com>
2024-01-10 08:58:57 -06:00
oplavsic
6a520566a3 Add view_slice ttgir instruction (#427)
* Add view_slice op in ttgir

---------

Co-authored-by: Ognjen Plavsic <ognjen.plavsic@luxoft.com>
Co-authored-by: Ognjen <oplavsic@luxoft.com>
Co-authored-by: Lixun Zhang <lixun.zhang@amd.com>
2024-01-02 15:40:11 -06:00
Alexander Efimov
98589ac013 [MFMA] Remove CTA related code from layout (#429)
This PR removes CTALayout attribute from MFMA layout, because it is NV specific.
2023-12-27 18:01:28 +01:00
joviliast
5c182aa73a Disable WMMA dot transformation
Enabled only in lit test.
Revert after complete enabling WMMA

Signed-off-by: joviliast <iveselov.nn@gmail.com>
2023-12-18 09:11:20 -06:00
joviliast
af15da2f84 Support WMMA layout in TritonAMDGPUAccelerateMatmulPass
-Introduce WmmaEncodingAttr for WMMA output
-Introduce BlockedToWMMA rewrite pattern in TritonAMDGPUAccelerateMatmulPass
-Provide a flag tho check if wmma instructions are supported by target

Signed-off-by: joviliast <iveselov.nn@gmail.com>
2023-12-18 09:11:20 -06:00
jayfurmanek
29847e9bb1 Merge pull request #410 from ROCmSoftwarePlatform/ifu-231117
Ifu 231117
2023-12-15 09:09:40 -06:00
Alexander Efimov
f2afd65e8c [MFMA] Refactor dot pipeline to reduce code duplication (#400)
This PR:
- simplifies data types generated by `shared->mfma dot op` layout conversions. Do not pack data types in int32 or int64
- reduce code duplication between fast/normal path
- reduce code duplication between operand A and operand B

Co-authored-by: Shucai Xiao <shucai.xiao@amd.com>
Co-authored-by: Lixun Zhang <lixun.zhang@amd.com>
2023-12-13 22:33:02 +01:00
jayfurmanek
a42ac260aa Merge branch 'triton-mlir' into ifu-231117 2023-12-12 14:24:11 -06:00
Alexander Efimov
605a90c58e [MFMA] Support tile size 4x4 version 1 (#413)
This PR enables 4x4 tile size in MFMA based dot operations.

Supported tiled dot is (4x64) x (64x4) -> (4x4) in MFMA layout.
However, actual dot operation should have at least 64 output elements, this is a limitation of other layouts appearing during result processing (i.e. blocked layout can not handle tensors smaller than wavesize).

For example, following dots are supported: (4x64) x (64x16) -> (4x16), (16x64) x (64x4) -> (16x4) or (8x64) x (64x8) -> (8x8)
Following dots are not supporter: (4x128) x (128x4) -> (4x4), (4x64) x (64x8) -> (4x8)

This is a first version of dot using mfma 4x4 instructions, with redundancy and reductions.
2023-12-12 18:23:55 +01:00
Jason Furmanek
a08dafe7fe Initial commit to resolve merge conflicts 2023-11-20 22:41:03 +00:00
Jason Furmanek
5c87f363e4 Merge commit 'cb3d79a185e40c9d8a579bea07747a8a8d157d52' into ifu-231117
Conflicts:
	lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp
	lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVM.cpp
	lib/Dialect/TritonGPU/IR/Dialect.cpp
	python/setup.py
	python/test/unit/language/assert_helper.py
	python/test/unit/operators/test_flash_attention.py
	python/test/unit/runtime/test_subproc.py
	python/triton/compiler/compiler.py
	python/triton/language/semantic.py
	python/triton/runtime/autotuner.py
	python/triton/runtime/jit.py
	python/tutorials/03-matrix-multiplication.py
	python/tutorials/05-layer-norm.py
	python/tutorials/06-fused-attention.py
	python/tutorials/11-grouped-gemm.py
	test/Conversion/tritongpu_to_llvm.mlir
2023-11-17 20:42:12 +00:00
Jason Furmanek
484852876e Resolve merge conflicts; AMD adjustments for new LLVM version 2023-11-09 19:00:49 +00:00
Jason Furmanek
977d5aa267 Merge commit '721897fcc4f942aa97d2e9ba3787a5e213758177' into ifu-231108
Conflicts:
	bin/triton-translate.cpp
	lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp
	lib/Dialect/TritonGPU/Transforms/RemoveLayoutConversions.cpp
	python/triton/compiler/compiler.py
	python/triton/runtime/jit.py
	python/tutorials/06-fused-attention.py
	test/Conversion/tritongpu_to_llvm.mlir
2023-11-08 18:51:23 +00:00
Jason Furmanek
39e8901d7a ROCM IFU: Resolve merge conflicts in RemoveLayoutConversions.cpp
fix merge error

fix dot

fix make_range

additional fix
2023-11-07 04:29:38 +00:00
Jason Furmanek
3a6dc5ad8d resolve some merge conflicts
fix more conflits

Resolve merge conflicts

Some more build and conflict fixes

Resolve conflicts for 06-fused-attension.py

resolve merge conflicts for the tutorial group gemm example

Fixes for some LIT tests

resolve remaining conflicts in tests

Fix empty kernel

set capability 0
2023-11-06 23:13:10 +00:00
Jason Furmanek
33151a860f Merge commit 'ac9fa68d18c777e421bd3f6fb1ddcfd60b6fda33' into ifu-rebase-again
Conflicts:
	.gitignore
	.gitmodules
	README.md
	bin/triton-translate.cpp
	include/triton/Dialect/TritonGPU/IR/TritonGPUAttrDefs.td
	include/triton/Target/AMDGCN/AMDGCNTranslation.h
	include/triton/Target/HSACO/HSACOTranslation.h
	lib/Analysis/Allocation.cpp
	lib/Analysis/Utility.cpp
	lib/Conversion/TritonGPUToLLVM/CMakeLists.txt
	lib/Conversion/TritonGPUToLLVM/ConvertLayoutOpToLLVM.cpp
	lib/Conversion/TritonGPUToLLVM/ReduceOpToLLVM.cpp
	lib/Conversion/TritonGPUToLLVM/ScanOpToLLVM.cpp
	lib/Conversion/TritonGPUToLLVM/Utility.cpp
	lib/Conversion/TritonGPUToLLVM/Utility.h
	lib/Dialect/TritonGPU/IR/Dialect.cpp
	lib/Dialect/TritonGPU/Transforms/RemoveLayoutConversions.cpp
	lib/Target/HSACO/CMakeLists.txt
	lib/Target/HSACO/HSACOTranslation.cpp
	lib/Target/LLVMIR/LLVMIRTranslation.cpp
	python/src/triton.cc
	python/test/unit/language/test_core.py
	python/test/unit/operators/test_flash_attention.py
	python/triton/compiler/compiler.py
	python/triton/compiler/make_launcher.py
	python/triton/language/semantic.py
	python/triton/runtime/jit.py
	python/tutorials/06-fused-attention.py
	python/tutorials/11-grouped-gemm.py
	test/Conversion/tritongpu_to_llvm.mlir
2023-11-06 23:10:10 +00:00
oplavsic
c65f1e6211 Add OptimizeEpilogue pass. (#346)
* optimize_epilogue

* Add config

* Remove licenses

* Comment out Hopper specific parameters when printing out configs

* Add benchmark parameters from flash-attention repo

* Add Z and H in the key of autotuner

---------

Co-authored-by: Lixun Zhang <lixun.zhang@amd.com>
2023-11-03 16:46:24 -05:00
Thomas Raoux
cb3d79a185 [BACKEND] Prevent emitting multiple dot_wait after pipelinied loop (#2598)
Patch based on @donproc findings and suggested optimization.

Emitting multiple wait op may confuse ptxas and cause it to fallback to
a conservative mode.
2023-11-03 14:29:50 -07:00
Weixing Zhang
34b89a1173 [OPTIMIZER] Tweak warpsPerCTA based on the shape of MMA output (#2581)
In current implementation, warpsPerCTA is always set to [numWarps, 1]
for 2 tt.dot fusion scenario. But, it is not optimal for cases such that
tt.dot doesn't have enough parallelism on row dimension but on column
dimension.
2023-11-03 16:40:03 -04:00
Thomas Raoux
6ac9d51ff0 [OPTIMIZATION] Enable pipelining for bwd flash attention (#2590)
This allow pipelining when a load is used by multiple dot in a loop.

Relax the condition to pipeline dot operands for mma v3 case. This
improves performance for the bwd pass from 260TF to 275TF. However this
expose a performance problem due to the wmma pipelining as ptxas will
now fall back to serial wgmma. A follow up PR will fix a bug in how we
emit wgmma_wait during pipelining and will bring performance to 335TF
2023-11-03 11:46:51 -07:00
Thomas Raoux
dced22c4b7 [BACKEND] Remove workaround for NVPTX bug after LLVM upgrade (#2585)
This was a workaround for a bug exposed in test_core when generating
ext_load in NVPTX. The backend bug seems fixed in latest LLVM upgrade so
removing the workaround.
2023-11-02 17:31:58 +00:00
Thomas Raoux
ca8f110617 [BACKEND] Pipeliner refactoring (#2565)
Refactor the pipeliner pass in order to make it more generic. The main
change is that the pipeliner is now broken into 2 pieces one calculating
a modulo schedule and create async ops based on the IR and an expander
that will generate the pipelined IR based on the modulo schedule.
The advantage of separating the two pieces is that it will allow us to
create different schedule without having to change the expander and it
will allow for more complex schedules.
For now the schedule generated for matmul case matches rougly the
schedule picked by the previous pipeliner in order to avoid changes.

This also creates a different sequence of insert/extract slice for the
alloc. We should probably change shared alloc to use memory semantic.
2023-11-02 09:56:39 -07:00
Dongdong Li
d0098da7b1 [BACKEND] Add error reporting to report non-kernel-argument (#2552)
Co-authored-by: dongdongl <dongdongl@nvidia.com>
2023-11-01 20:22:10 -04:00
Alexander Efimov
74c5fd46ee [RemoveLayoutConversions] Fix reduce failed infer type error (#377)
* [RemoveLayoutConversions] Fix reduce failed infer type error

This PR fixes layout propagation algorithm in RemoveLayoutConversions pass.
In some cases during rewriteSlice process, reduce operation with multiple outputs
rewrites only one output layout, which breaks assumption that both outputs should have same layout.

This change is a minimal part of https://github.com/openai/triton/pull/2331 change and
small lit test for regression testing.

* fix combine test

* Fix issue with incorrect inference layout of make_range output result
2023-11-01 13:31:13 -05:00
Alexander Efimov
d62a3ffdbe [RemoveLayoutConversions] Remove PatternSharedInfo structure (#378)
This structure is not used anymore after massive refactoring
of RemoveLayoutConversion pass in September IFU.
2023-11-01 12:57:35 -05:00
Zahi Moudallal
3650213218 [OPTIMIZER] Thread local reduction optimization (#2542)
Co-authored-by: Phil Tillet <phil@openai.com>
2023-10-31 16:13:36 -07:00
daemondzh
96cf8f979a [OPTIMIZER][BACKEND] Fix an issue in RewriteTensorPtr pass to enable TMA with 8-bit types (#2545)
Co-authored-by: Zhicheng Xiong <zhichengx@ipp2-0148.nvidia.com>
Co-authored-by: Zhicheng Xiong <zhichengx@dc7-sim-e12-203.nvidia.com>
Co-authored-by: Zhicheng Xiong <zhichengx@ipp2-1604.nvidia.com>
Co-authored-by: Zhicheng Xiong <zhichengx@ipp2-1608.nvidia.com>
Co-authored-by: goostavz <109190422+goostavz@users.noreply.github.com>
2023-10-31 02:28:27 +00:00
Dongdong Li
0469d5fccd [OPTIMIZER] Remove extra wgmma_wait_group in flash attention (#2399)
Co-authored-by: dongdongl <dongdongl@nvidia.com>
2023-10-26 16:35:36 +00:00
zhu jianjiang
cfae7e2a25 [BACKEND] Fix matmul downcast path (#2528)
for https://github.com/openai/triton/issues/2523 ,add regression test

---------

Co-authored-by: Jokeren <robinho364@gmail.com>
Co-authored-by: Keren Zhou <kerenzhou@openai.com>
2023-10-26 09:43:49 -04:00
Shucai Xiao
2729ae6c6f use different int8 mfma instructions on different GPUs. (#368)
* changes support to choose different int8 instructions

* rename an instruction name

Co-authored-by: Aleksandr Efimov <efimov.alexander@gmail.com>
2023-10-25 19:12:21 -05:00
Alexander Efimov
5a86b46bb1 [MFMA] FP8 and BF8 support (#355)
* [MFMA] FP8 and BF8 support

This PR adds support of fp8 and bf8 in AccelerateMatmul pass and
Introduces generation of float8 mfma instructions in ttg to llvm conversion.

* add tests

* fix tests

* review fix: fix variable naming and dot operand promotion.

* review comments fixes

---------

Co-authored-by: Shucai Xiao <shucai.xiao@amd.com>
2023-10-25 13:27:10 -05:00
Philippe Tillet
3f2b7263e8 Revert "[OPTIMIZER] Tweak warpsPerCTA based on the shape of MMA output (#2485)" (#2541)
Reverts openai/triton#2525
2023-10-24 10:23:19 -07:00
Philippe Tillet
8f467f1ea9 [OPTIMIZER] Tweak warpsPerCTA based on the shape of MMA output (#2485) (#2525)
Reverts openai/triton#2497
2023-10-23 21:50:58 -07:00
Thomas Raoux
5e6071254c [BACKEND] Use our internal slice implementation to avoid combinatoria… (#2535)
…l explosion
2023-10-24 03:06:34 +00:00
Thomas Raoux
cba7abd682 [BACKEND] Remove ttg.cmp and ttg.select and replace by arith op (#2526)
Now that the bug related to attribute is fixed in MLIR we can use arith
ops for cmp and select ops.
2023-10-23 19:35:46 -07:00
Keren Zhou
be1de890e1 [BACKEND] Replace assert(0) with llvm::report_fatal_error (#2516)
Also add missing return statements
2023-10-19 11:53:09 -07:00
Thomas Raoux
e36d1665ca [BACKEND] Fix unsupported view op created during optimizations (#2510)
When propagating layout we were generating a view op with mismatching
total number of element per threads. Lowering such op would require
exchanging data across threads.
This change prevents the optimizer from generating such cases. This may
require further optimizations in the future.
2023-10-18 16:37:13 +01:00
Alexander Efimov
20f316b19a [MFMA] Switch between MFMA types (#352)
This PR introduces matrix_instr_nonkdim flag to switch
between MFMA 16 and MFMA 32.
2023-10-18 16:57:34 +02:00
Mehdi Amini
721897fcc4 upgrade llvm to b1115f8c (NFC) (#2403)
Co-authored-by: Thomas Raoux <thomas.raoux@openai.com>
Co-authored-by: Keren Zhou <kerenzhou@openai.com>
Co-authored-by: Phil Tillet <phil@openai.com>
2023-10-16 16:38:49 -07:00
Philippe Tillet
8db4fac3b0 Revert "[OPTIMIZER] Tweak warpsPerCTA based on the shape of MMA output (#2485)" (#2497)
Reverts openai/triton#2485
2023-10-13 23:32:59 -07:00
Weixing Zhang
76858bd917 [OPTIMIZER] Tweak warpsPerCTA based on the shape of MMA output (#2485)
In current implementation, warpsPerCTA is always set to [numWarps, 1]
for 2 tt.dot fusion scenario. But, it is not optimal for cases such that
tt.dot doesn't have enough parallelism on row dimension but on column
dimension.
2023-10-12 22:25:42 -07:00
Thomas Raoux
a777e1d8db [OPTIMIZER] Propagate mma layout when the transitive use has dot_operand encoding (#2482) 2023-10-12 23:57:40 +00:00
Thomas Raoux
cda298fae7 [Pipeliner] Allocate less shared memory when possible (#2466)
The pipeliner was overallocating shared memory for the inputs
for current schedule. This reduces the shared memory usage to only
what is needed.
Note that improving membar analysis could allow taking advantage of
allocating extra buffers to remove barriers.
2023-10-12 12:10:06 -07:00
Thomas Raoux
6f46c93b9e [BACKEND] Add back dot.wait when generating async_dot (#2478)
Based on discussion this is needed to make sure there is no race
condition when reading shared memory.
2023-10-10 21:45:28 -07:00
Alexander Efimov
7e34c244c2 [Triton] Mfma16 support (#251)
* [MFAM] Support mfma with NM size 16

This PR code emitting of MFMA instructions with size 16.

* add control over mfma type with MFMA_TYPE=16 env var
2023-10-09 13:59:54 -05:00