Commit Graph

93 Commits

Author SHA1 Message Date
Lixun Zhang
02a2f24dd5 [Backend] Refactor mfma selection (#441)
* Select mfma dimensions and instruction from static table

* Extend mfmaLayout to include version and instrShape

* Simplify generateMFMAOp by searching the mfma instruction in the table

* Fix getNonKDim() and non_k_dim

* Break instrShape into MDim and NDim
2024-01-16 21:05:35 -06:00
Shucai Xiao
a7bb38ea79 enable layout conversion from mfma to dot_op for mfma16. (#453)
* enable the layout conversion from mfma layout to dot_operand layout for mfma16

* backup changes
2024-01-10 22:31:59 -06:00
joviliast
af15da2f84 Support WMMA layout in TritonAMDGPUAccelerateMatmulPass
-Introduce WmmaEncodingAttr for WMMA output
-Introduce BlockedToWMMA rewrite pattern in TritonAMDGPUAccelerateMatmulPass
-Provide a flag tho check if wmma instructions are supported by target

Signed-off-by: joviliast <iveselov.nn@gmail.com>
2023-12-18 09:11:20 -06:00
jayfurmanek
a42ac260aa Merge branch 'triton-mlir' into ifu-231117 2023-12-12 14:24:11 -06:00
Alexander Efimov
605a90c58e [MFMA] Support tile size 4x4 version 1 (#413)
This PR enables 4x4 tile size in MFMA based dot operations.

Supported tiled dot is (4x64) x (64x4) -> (4x4) in MFMA layout.
However, actual dot operation should have at least 64 output elements, this is a limitation of other layouts appearing during result processing (i.e. blocked layout can not handle tensors smaller than wavesize).

For example, following dots are supported: (4x64) x (64x16) -> (4x16), (16x64) x (64x4) -> (16x4) or (8x64) x (64x8) -> (8x8)
Following dots are not supporter: (4x128) x (128x4) -> (4x4), (4x64) x (64x8) -> (4x8)

This is a first version of dot using mfma 4x4 instructions, with redundancy and reductions.
2023-12-12 18:23:55 +01:00
Jason Furmanek
5c87f363e4 Merge commit 'cb3d79a185e40c9d8a579bea07747a8a8d157d52' into ifu-231117
Conflicts:
	lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp
	lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVM.cpp
	lib/Dialect/TritonGPU/IR/Dialect.cpp
	python/setup.py
	python/test/unit/language/assert_helper.py
	python/test/unit/operators/test_flash_attention.py
	python/test/unit/runtime/test_subproc.py
	python/triton/compiler/compiler.py
	python/triton/language/semantic.py
	python/triton/runtime/autotuner.py
	python/triton/runtime/jit.py
	python/tutorials/03-matrix-multiplication.py
	python/tutorials/05-layer-norm.py
	python/tutorials/06-fused-attention.py
	python/tutorials/11-grouped-gemm.py
	test/Conversion/tritongpu_to_llvm.mlir
2023-11-17 20:42:12 +00:00
Jason Furmanek
977d5aa267 Merge commit '721897fcc4f942aa97d2e9ba3787a5e213758177' into ifu-231108
Conflicts:
	bin/triton-translate.cpp
	lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp
	lib/Dialect/TritonGPU/Transforms/RemoveLayoutConversions.cpp
	python/triton/compiler/compiler.py
	python/triton/runtime/jit.py
	python/tutorials/06-fused-attention.py
	test/Conversion/tritongpu_to_llvm.mlir
2023-11-08 18:51:23 +00:00
Jason Furmanek
3a6dc5ad8d resolve some merge conflicts
fix more conflits

Resolve merge conflicts

Some more build and conflict fixes

Resolve conflicts for 06-fused-attension.py

resolve merge conflicts for the tutorial group gemm example

Fixes for some LIT tests

resolve remaining conflicts in tests

Fix empty kernel

set capability 0
2023-11-06 23:13:10 +00:00
Jason Furmanek
33151a860f Merge commit 'ac9fa68d18c777e421bd3f6fb1ddcfd60b6fda33' into ifu-rebase-again
Conflicts:
	.gitignore
	.gitmodules
	README.md
	bin/triton-translate.cpp
	include/triton/Dialect/TritonGPU/IR/TritonGPUAttrDefs.td
	include/triton/Target/AMDGCN/AMDGCNTranslation.h
	include/triton/Target/HSACO/HSACOTranslation.h
	lib/Analysis/Allocation.cpp
	lib/Analysis/Utility.cpp
	lib/Conversion/TritonGPUToLLVM/CMakeLists.txt
	lib/Conversion/TritonGPUToLLVM/ConvertLayoutOpToLLVM.cpp
	lib/Conversion/TritonGPUToLLVM/ReduceOpToLLVM.cpp
	lib/Conversion/TritonGPUToLLVM/ScanOpToLLVM.cpp
	lib/Conversion/TritonGPUToLLVM/Utility.cpp
	lib/Conversion/TritonGPUToLLVM/Utility.h
	lib/Dialect/TritonGPU/IR/Dialect.cpp
	lib/Dialect/TritonGPU/Transforms/RemoveLayoutConversions.cpp
	lib/Target/HSACO/CMakeLists.txt
	lib/Target/HSACO/HSACOTranslation.cpp
	lib/Target/LLVMIR/LLVMIRTranslation.cpp
	python/src/triton.cc
	python/test/unit/language/test_core.py
	python/test/unit/operators/test_flash_attention.py
	python/triton/compiler/compiler.py
	python/triton/compiler/make_launcher.py
	python/triton/language/semantic.py
	python/triton/runtime/jit.py
	python/tutorials/06-fused-attention.py
	python/tutorials/11-grouped-gemm.py
	test/Conversion/tritongpu_to_llvm.mlir
2023-11-06 23:10:10 +00:00
Alexander Efimov
5a86b46bb1 [MFMA] FP8 and BF8 support (#355)
* [MFMA] FP8 and BF8 support

This PR adds support of fp8 and bf8 in AccelerateMatmul pass and
Introduces generation of float8 mfma instructions in ttg to llvm conversion.

* add tests

* fix tests

* review fix: fix variable naming and dot operand promotion.

* review comments fixes

---------

Co-authored-by: Shucai Xiao <shucai.xiao@amd.com>
2023-10-25 13:27:10 -05:00
Keren Zhou
be1de890e1 [BACKEND] Replace assert(0) with llvm::report_fatal_error (#2516)
Also add missing return statements
2023-10-19 11:53:09 -07:00
Alexander Efimov
20f316b19a [MFMA] Switch between MFMA types (#352)
This PR introduces matrix_instr_nonkdim flag to switch
between MFMA 16 and MFMA 32.
2023-10-18 16:57:34 +02:00
Mehdi Amini
721897fcc4 upgrade llvm to b1115f8c (NFC) (#2403)
Co-authored-by: Thomas Raoux <thomas.raoux@openai.com>
Co-authored-by: Keren Zhou <kerenzhou@openai.com>
Co-authored-by: Phil Tillet <phil@openai.com>
2023-10-16 16:38:49 -07:00
Zahi Moudallal
4749072fbd [BACKEND] Allow reduce with sliced 3D layout as input (#2480) 2023-10-10 15:19:11 -07:00
Alexander Efimov
7e34c244c2 [Triton] Mfma16 support (#251)
* [MFAM] Support mfma with NM size 16

This PR code emitting of MFMA instructions with size 16.

* add control over mfma type with MFMA_TYPE=16 env var
2023-10-09 13:59:54 -05:00
Zahi Moudallal
be19cf3103 [BACKEND] Enable reduce with 3D tensors and added tests (#2460) 2023-10-06 15:08:22 -07:00
Thomas Raoux
5a0170a27c [BACKEND] Minor removing of unnecessary code and cleanup (#2443) 2023-10-04 12:14:08 -07:00
Thomas Raoux
c656a139d3 [BACKEND] Fix for FP8 QK inputs in flash attention forward pass (#2435) 2023-10-03 21:02:13 -07:00
Jason Furmanek
e5d7bb4fae Initial commit to resolve merge conflicts
rename tl.float8e4 to tl.float8e4nv to align with upstream

ROCM IFU: Fix python arch issues

ROCM IFU: Fix kernel launcher

ROCM IFU: Fix merge conflicts

fix debug build

Set correct threadsPerCTA
2023-10-03 04:04:26 +00:00
Jason Furmanek
74fd8e9754 Merge commit '36fc54b6f28168d3644808bfe299f1ba06a36272' into ifu230908-2
Conflicts:
	.gitignore
	bin/triton-translate.cpp
	include/triton/Conversion/TritonGPUToLLVM/TritonGPUToLLVMPass.h
	include/triton/Dialect/TritonGPU/IR/TritonGPUAttrDefs.td
	include/triton/Dialect/TritonGPU/IR/TritonGPUDialect.td
	lib/Analysis/Utility.cpp
	lib/Conversion/TritonGPUToLLVM/ConvertLayoutOpToLLVM/SharedToDotOperandMMAv2.cpp
	lib/Conversion/TritonGPUToLLVM/DotOpToLLVM.cpp
	lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp
	lib/Conversion/TritonGPUToLLVM/ReduceOpToLLVM.cpp
	lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVM.cpp
	lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVMBase.h
	lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVMPass.cpp
	lib/Conversion/TritonGPUToLLVM/Utility.h
	lib/Dialect/Triton/Transforms/RewriteTensorPointer.cpp
	lib/Dialect/TritonGPU/IR/Dialect.cpp
	lib/Dialect/TritonGPU/Transforms/AccelerateMatmul.cpp
	lib/Dialect/TritonGPU/Transforms/RemoveLayoutConversions.cpp
	lib/Target/LLVMIR/LLVMIRTranslation.cpp
	python/src/triton.cc
	python/test/unit/runtime/test_subproc.py
	python/triton/compiler/compiler.py
	python/triton/compiler/make_launcher.py
	python/triton/language/semantic.py
	python/triton/runtime/jit.py
	python/tutorials/06-fused-attention.py
	test/Conversion/triton_to_tritongpu.mlir
	test/Conversion/tritongpu_to_llvm.mlir
	test/TritonGPU/coalesce.mlir
	unittest/Conversion/TritonGPUToLLVM/CMakeLists.txt
2023-10-02 18:01:04 +00:00
Thomas Raoux
90bef57acf [BACKEND] turn on MMA V3 by default on Hopper (#2414) 2023-09-28 22:45:28 -07:00
Thomas Raoux
721bdebee1 [OPTIMIZATION] Fix performance for attention backward path with mma v3 (#2411)
Support having chain of mma with mixed size.
Serialize the different block calculation in backward attention to
workaround problem with ptxas and wgmma.
2023-09-28 10:29:08 -07:00
Thomas Raoux
6bc1d9e1be [BACKEND] Support MMA V3 with register operand (#2375)
MMA V3 support taking operand A from register. This helps for chained
matmul operations like in attention.
Add an optimization to use this mode when it helps and add the lowering
for it.
2023-09-25 10:43:54 -07:00
Aleksandr Efimov
d80cd2d374 [MFMA] Change kWidth parameter semantics
This PR changes kWidth semantics "from elements per instruction" to
"elements per thread per instruction" along k axis.
2023-09-25 10:56:44 -05:00
Keren Zhou
ed5a53057d [BACKEND] Handle repetitive threads in scan op when the tensor dim is small (#2345)
https://github.com/openai/triton/issues/2298
2023-09-20 12:25:52 -04:00
Keren Zhou
307b5caa49 [BACKEND] Fix scan issues on repetitive warps and improve perf when there's a single warp on the axis (#2330)
1. On the axis, using `getAxisNumWarpsWithUniqueData` instead of getting
the raw number of warps to avoid communication among warps that handle
the same piece of data.
2. When there's a single warp on the axis, using warp Intrinsics for
communication and skip shared memory.

Need a follow up PR for code clean up.
2023-09-18 17:45:05 -04:00
Thomas Raoux
31b0c52142 [FRONTEND][BACKEND] Add flag to control accumulation for fp8 (#2300)
Change the dot to allow taking an initial accumulator and add a flag
that will allow the compiler to accumulate in a lower precision than the
output type.
On Hopper this flag is on by default which allows accumualting with
lower precision.
This only affect Hopper fp8 dot.
2023-09-15 18:42:54 -07:00
Zahi Moudallal
a47f1f5c28 [BACKEND] Unify slow/fast reduce codegen (#2220) 2023-09-12 08:46:19 -07:00
Lixun Zhang
5972eafbc9 Use ceil to align with upstream 2023-09-12 10:16:44 -05:00
Lixun Zhang
ea397b49aa Fix the issue when CTA coverage is larger than the tile 2023-09-12 10:16:44 -05:00
Philippe Tillet
bf4f9375a7 [FRONTEND] allow mixed precision FP8 matmul on pre-H100 hardware (#2281) 2023-09-11 20:54:29 -07:00
Lixun Zhang
28d4c3bdb4 [BACKEND] Make sure getAxisBlockStride does not return 0 (#2273)
This can happen when the CTA shape is larger than the tensor shape along
the non-axis dim during scanOp lowering.
2023-09-11 11:02:56 -07:00
Alexander Efimov
6691de65db [MFMA] Support BFloat16 on MI100 (#295)
* [MFMA] Support BFloat16 on MI100

This PR makes use of mfma_f32_32x32x4bf16 instruction, available on MI100.

* fix tests, fix mfma encoding comment, fix switch between mfma versions.

* replace kDim from mfma layout with kWidth from dotOp layout

* rebase fix

* fix mfma to dot op shortcut for bfloat16

* fix review comments
2023-09-08 15:08:34 -05:00
Jason Furmanek
df5c263a19 Fix merge conflicts 2023-09-01 04:01:32 +00:00
Jason Furmanek
3eaeb89d18 Merge commit '5df904233c11a65bd131ead7268f84cca7804275' into ifu230810-2
Conflicts:
	include/triton/Dialect/Triton/Transforms/Passes.h
	include/triton/Dialect/TritonGPU/IR/Dialect.h
	include/triton/Dialect/TritonGPU/IR/TritonGPUAttrDefs.td
	lib/Analysis/Allocation.cpp
	lib/Analysis/Utility.cpp
	lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp
	lib/Conversion/TritonGPUToLLVM/ReduceOpToLLVM.cpp
	lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVM.cpp
	lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVMPass.cpp
	lib/Dialect/Triton/Transforms/RewriteTensorPointer.cpp
	lib/Dialect/TritonGPU/Transforms/RemoveLayoutConversions.cpp
	lib/Dialect/TritonGPU/Transforms/ReorderInstructions.cpp
	lib/Target/LLVMIR/LLVMIRTranslation.cpp
	python/src/triton.cc
	python/triton/compiler/compiler.py
	python/triton/ops/flash_attention.py
	python/triton/runtime/autotuner.py
	python/triton/runtime/jit.py
	python/triton/tools/aot.py
	python/tutorials/06-fused-attention.py
	test/Conversion/tritongpu_to_llvm.mlir
	test/Target/tritongpu_to_llvmir.mlir
	test/Target/tritongpu_to_llvmir_noinline.mlir
2023-09-01 03:25:33 +00:00
Thomas
d4644d6cb3 [BACKEND] Refactor RemoveLayoutConversion pass (#2181)
Significant changes to the pass logic. Move away from greedy rewrites
and use more global analysis instead. The pass is now bocken down into 2
main phases. First forward propagation of layout starting from ops that
we don't want to change. Propagate to all the nodes. If there is a
single layout needed for the op then we can rewrite the op, if there are
multiple layout required based on dependency we need a tie break.
The second phase is backward propgation that gets a backward slice of
operations starting from the convert and if all the operations in the
slice can be rematerialized rewrite the slice. This backward phase now
supports going through loop arguments.

This will allow more complex logic in the future to add a cost model to
decide which convert to leave and which to fold
2023-08-28 19:05:16 -07:00
jayfurmanek
fa429316d4 Merge pull request #268 from ROCmSoftwarePlatform/improve_reduce_for_fa
[CHERRY-PICKED FROM UPSTREAM][BACKEND] no longer uses shared mem or barriers for single-warp reductions (openai#1915)
2023-08-21 13:29:11 -05:00
Zahi Moudallal
23dd11d471 [BACKEND] Solidify f8e4m3 (#2105)
Co-authored-by: Philippe Tillet <phil@openai.com>
2023-08-18 19:12:09 -07:00
Alexander Efimov
01b0108c94 [MFMA] [FA] Keep bf16 results of FA dot operations in registers (#298)
This PR enables optimization for keeping bf16 values in registers between dot operations.
2023-08-18 07:33:00 -05:00
Philippe Tillet
4215086931 [BACKEND] no longer uses shared mem or barriers for single-warp reductions (#1915)
0-bytes shared mem buffers don't materialize empty allocation buffers;
this could lead to unnecessary barriers.

note: reduceop code has become quite messy and will require some cleanup
2023-08-15 11:51:20 +00:00
Goran Flegar
29bfdb6eef [BACKEND] Fix crash in reductions on i1 (#1996)
`getScratchSizeInBytes` was assuming that the size of all types in bits
is
a multiple of 8. If it is not, it would return 0. This caused a bug for
boolean
(i1) type, where the reduction lowering would attempt to use shared
memory,
which was not assigned to the op.

Fix this issue by setting the number of bytes per element to `ceil(bits
/ 8)`.
2023-08-09 10:28:05 -07:00
goostavz
f1512bded1 Initial code merge of Hopper support (#2036)
The initial code merge of Nvidia Hopper features support. Please be
aware that the code merge is not finished yet and the trouble-shooting
is still ongoing. The new hardware features (GMMA, TMA, STMATRIX etc.)
and automatic warp-specialization are experimental for now and turned
off by default. It is recommended for a trial when version 3.0 is
released.

The work is contributed by:
ben-zhang-609, bealwang, donproc, qliu93, jsh20, allatit23, LyricZhao,
ivanyinwz, goostavz & yangjunpro
from Nvidia, in cooperation with:
ptillet, Jokeren, ThomasRaoux & zahimoud
from OpenAI.

Co-authored-by: Goostav Zhu <gzhu@nvidia.com>
2023-08-07 09:53:04 +08:00
Alexander Efimov
86f8b64ae0 [Dot] [MFMA] [FMA] Update Dot implementation to support upstream tests (#260)
* [Dot] [MFMA] Support FP16 output of MFMA dot

This PR adds cast of output tensor to requested data type.

* add tests

* fix test for FMA implementation

* loose fp16xfp16->fp16 tolerance

* enable FMA fallback for unsupported sizes of dot operation

* rework granularity check

* add constant modifier to granularity
2023-08-03 13:47:18 -05:00
Philippe Tillet
1db3bdc52e [BACKEND] avoid code duplication for fully warp-synchronous reductions (#1978) 2023-07-21 16:06:00 -07:00
oplavsic
d6e51fd221 [FA OPTIMIZATION] Keep results of FA dot operations in registers (#247)
* [WIP][FA OPTIMIZATION] Optimize chain dot

This commit optimizes chain dot operation by keeping
results of the first dot operation in registers.

* [FA OPTIMIZATION] Enable lowering pipeline for keeping result of chain dot in registers

* Move operand swapping in ttgir -> llir lowering phase

* Refactor emitMfmaOffsetForCTA function to be more readable

* Fix accidental change in 06-fused-attention.py

* Address review comments

* Fix rebase errors
2023-07-12 15:25:55 -05:00
Philippe Tillet
8fe5524c75 [BACKEND] no longer uses shared mem or barriers for single-warp reductions (#1915)
0-bytes shared mem buffers don't materialize empty allocation buffers;
this could lead to unnecessary barriers.

note: reduceop code has become quite messy and will require some cleanup
2023-07-11 00:23:26 -07:00
Thomas
bd900e0a6f [BACKEND] Fix reductions when number of unique element is smaller than layout (#1913)
Fix calculation of unique number of threads within a warp. We need to
consider the number of elements per thread in the calculation. Also
change the layout test to integer sum in order to catch bugs with unique
data as max reduction may hide those kind of problems.
2023-07-07 19:48:13 -07:00
Keren Zhou
4255ef0e9e [BACKEND] Determine fast reduce based on the parent layout (#1892) 2023-07-05 11:52:22 -07:00
Thomas
2e3182bab7 [BACKEND] Support scan on dimensions other that fastest moving one (#1863)
This relax the restriction in the scan lowering to support layout where
we scan along a dimension which isn't the fastest moving one. This is
done by relaxing how we accesses elements during scanning and allow
elements to be strided.
2023-06-30 12:40:48 -07:00
Jason Furmanek
2b38ab4b6c Merge remote-tracking branch 'oai/main' into ifu230620
Conflicts:
	include/triton/Conversion/TritonToTritonGPU/Passes.td
	include/triton/Dialect/TritonGPU/IR/TritonGPUDialect.td
	lib/Dialect/TritonGPU/Transforms/RemoveLayoutConversions.cpp
	python/test/unit/language/assert_helper.py
	python/triton/compiler/compiler.py
	python/triton/runtime/jit.py
	python/triton/tools/aot.py
	test/Conversion/triton_to_tritongpu.mlir
	test/Conversion/tritongpu_to_llvm.mlir
2023-06-29 21:47:27 +00:00