Commit Graph

355 Commits

Author SHA1 Message Date
Keren Zhou
6a9316e69a [BACKEND] Clean up SCF -> CF conversion (#1234) 2023-02-22 23:49:47 +00:00
Philippe Tillet
0ec277efc5 [OPTIMIZER] cleaned, renamed and simplified some optimization passes (#1232)
This shouldn't actually change the behavior of Triton -- only clean things up.
2023-02-22 13:54:55 -08:00
Keren Zhou
123c687ed9 [BACKEND] Rewrite Membar to fit the CF dialect (#1213) 2023-02-19 14:54:33 -08:00
Rohit Santhanam
50e6329d01 Fix LIT tests. 2023-02-18 09:32:42 +00:00
Rohit Santhanam
841784d1e3 Merge remote-tracking branch 'upstream/main' into upgrade_triton_mlir_rocm_to_llvm_head 2023-02-18 09:25:20 +00:00
Christian Sigg
9ef4b5d773 Rebase to LLVM-head. (#1200)
Rebase to
37b7a60cd7
2023-02-17 13:16:11 -08:00
Alexander Efimov
fce08c2ebb [Test][LIT] Fix tritongpu_to_hsaco.mlir test
This PR fixes tritongpu_to_hsaco.mlir test,
plus fixes minor issue with tritongpu_to_amdgcn test
2023-02-16 20:15:52 +01:00
Christian Sigg
fc7a8e3581 Rebase Triton to LLVM-15. (#1070)
This PR rebases Triton from LLVM-14 to LLVM-15. Most changes are
mechanical, except for the analysis framework changes.
2023-02-16 06:40:53 -08:00
Rohit Santhanam
112abb966f Fix test/Conversion/tritongpu_to_llvm.mlir for AMDGPU. 2023-02-11 15:16:50 +00:00
Rohit Santhanam
a2416e0901 Merge remote-tracking branch 'upstream/main' into triton-mlir-IFU-02112023 2023-02-11 14:48:19 +00:00
rsanthanam-amd
fb62b22364 Merge pull request #103 from binarman/unittest_fix/Conversion/tritongpu_to_llvm.mlir
[Test][Lit] Tritongpu to llvm
2023-02-09 22:57:08 -06:00
Philippe Tillet
2aba985daa [OPTIMIZER] Improved layout simplifications heuristics (#1168) 2023-02-09 20:17:25 -08:00
Alexander Efimov
a58c3ad317 update checks for Conversion/tritongpu_to_llvm.mlir test 2023-02-10 00:05:23 +01:00
Keren Zhou
c61c8a123f [BACKEND] Disallow the CombineSelectMaskedLoad pattern if conditions of select and broadcast are different (#1170) 2023-02-09 18:03:22 -05:00
Alexander Efimov
ba603b8319 review fix: CHECK to PTX 2023-02-09 01:07:57 +01:00
Alexander Efimov
d56258a4c6 review fix: roll back basic_insert_slice_async_v1_multictas 2023-02-09 01:01:52 +01:00
Alexander Efimov
89ca4cb4ba review fix: add COUNT checks 2023-02-09 01:01:28 +01:00
Alexander Efimov
a43f192889 code review fix 2023-02-08 19:17:00 +01:00
Keren Zhou
681d04cf2b [BACKEND] Fix axisInfo analysis for div ops (#1157) 2023-02-07 02:25:23 +00:00
Alexander Efimov
7fb8147d4f [Test] Fix load_store test
This PR fixes Conversion/AMDGPU/load_store.mlir lit based unit test.
2023-02-06 20:36:39 +01:00
Keren Zhou
546f2377ae [BACKEND] Get the right operand and result types in forward rematerialization passes (#1152) 2023-02-04 16:34:35 -08:00
Keren Zhou
bde52f9db2 [BACKEND] Fix alignment calculation (#1149)
`getDivisibility` represents if the address in bytes is divisible by a
certain number, so we should convert `#aligned bytes` to `#aligned
elements`.
2023-02-03 17:20:23 -08:00
Rohit Santhanam
8cb6ab5b1a Merge remote-tracking branch 'upstream/main' into triton_mlir_IFU_02022023 2023-02-02 22:54:53 +00:00
Alexander Efimov
af58dd90f8 [Test][Lit] Tritongpu to llvm
This PR fixes lit based unitest Conversion/tritongpu_to_llvm.mlir for GCN target.x
2023-02-02 23:31:34 +01:00
Keren Zhou
82befe32ad [BACKEND] Improve torch inductor performance (#1108)
- Rewrite the AxisInfo analysis to handle each op case by case.
- Add bit shift, min max, div/rem, and select ops to AxisInfo.
- Rematerialize across load/store ops in the following two cases:
- A size 1 tensor is considered not expensive since all threads will
load the same
- the targeEncoding may expose more vectorization opportunities (more
elements per thread on the first dim)

**_res2next_** benchmark GPU Kernel time comparison on A100.
- Average kernel sum. Triton 16838630ns vs Triton-MLIR 17105166ns.
**1.016x slowdown**.
- Total kernel sum. Triton 6511735460ns vs Triton-MLIR 6512370620ns.
2023-02-01 18:21:15 -08:00
Keren Zhou
1ec39fdf99 [BACKEND] Refactored the MoveConvertOutOfIf conversion to handle scf.if correctly (#1114)
Also removed duplicate code for `simulateBackwardRematerialization`.
2023-02-01 08:49:19 -08:00
Keren Zhou
5dd8ce3745 [BACKEND] Fix topological sort and add new test cases (#1132)
Previous https://github.com/openai/triton/pull/1113 forgot to consider
that a node may have multiple parents, visiting the instruction before
any parent violates the semantic of topological sort.

The fixed implementation exhaustively add all operations into a
candidate subgraph and move an operation to the "ready" queue once all
of its operands have been visited.
2023-01-31 23:41:20 -08:00
Philippe Tillet
c4b9d699d2 [FRONTEND][BACKEND] Fixed many bugs (#1122)
- **temporarily commenting assertion in `MemBar.cpp`. We need to fix
this! but for now the following patches will unblock a number of
users.**
- Fixed frontend codegen issue for If / For / While. Emit an error when
replaced values' type mismatch.
- Added "top level" codepath for if statements, which allows users to
write patterns to exit early from kernels (e.g., `if cond1: if cond2:
return else: ...`). Added associated codegen in TritonToTritonGPUPass
- Added basic control flow tests
- Pipeline pass is no longer activated when memory accesses can't be
vectorized
- Added missing magic methods to `constexpr`
- Fixed issue in random.py: bitcast some values to uint when they need
to be.
- Added support for `Not`
- Fixed nondeterministic compilation issue
2023-01-30 23:22:36 -08:00
Rohit Santhanam
2d0ee0fa0f Merge remote-tracking branch 'upstream/main' into triton-mlir-IFU-01232023 2023-01-24 03:59:17 +00:00
Keren Zhou
b5d32896b1 [BACKEND] Verify the same operand and result element type for convert_layout (#1081)
And a hotfix for incorrect convert_layout construction in the GPU
combine pass.
2023-01-22 16:59:24 +00:00
Yan Chunwei
88498d104a [BACKEND] DotOp enable ld.v4 in MMAv1 (#1020)
The existing convert distributed to distributed layouts logic is based
on processing each MMA-block, this requires each MMA-block to share
exactly the same fixed pattern(such as the one described in the [NV PTX
doc](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-fragment-mma-16816-float)).

While for MMAv1, things are different, the MMA-block has variant
patterns for different shapes and data layouts as below

<img width="200" alt="image"
src="https://user-images.githubusercontent.com/328693/213354941-731d7856-ad24-4f48-be0e-3cf41532cfa4.png">

This requires all the cell coordinates in DotOp output to be computed.
2023-01-19 09:42:33 -08:00
Philippe Tillet
408d1d7e87 [OPTIMIZER] Improved flash attention forward pass performance (#1075)
- Fixed typo in instruction reordering pass
- Minor additional optimizations for shared memory allocator
- Optimized flash attention tutorial forward pass kernel
2023-01-19 06:46:01 +00:00
Philippe Tillet
660f2e8cce [OPTIMIZER] pipeline and prefetch pass now use a more ptxas-friendly schedule (#1065) 2023-01-17 15:21:19 -08:00
Keren Zhou
3f47e9aa0e [BACKEND] Fix unrealized conversion for fp32 dot (#1051) 2023-01-17 21:55:44 +00:00
rsanthanam-amd
56dfa695d2 Merge pull request #79 from dfukalov/dfukalov/work
[Test] Enable triton-translate build.
2023-01-17 06:01:26 -06:00
Rohit Santhanam
ce8adb92bd Merge remote-tracking branch 'upstream/master' into triton-mlir-IFU-01142023 2023-01-14 19:19:58 +00:00
Yan Chunwei
86003c83dd [Optimizer] Add UpdateMmaForVolta Pass (#1048)
This PR adds UpdateMmaForVolta pass to help update the MMA encoding for
Volta.
Some context is told in https://github.com/openai/triton/pull/1014

# Changes

1. Moving the related MMAv1 patterns from GPUCombine pass to
UpdateMmaForVolta pass,
2. Updating both the versionMinor and warpsPerCTA fields for Volta MMA
encodings since they could only be determined after the GPUCombine Pass,
3. Moving the FixupLoop pattern from the Combine.cpp to new
Utility.h/.cpp files
4. Adding an ID field(takes 5 bits to store an integer) to versionMinor
to help assigning a unique ID(on Volta) for each MMA encodings, the
reason is as below
- Currently, there is a cyclic dependency between {DotOperand, Slice}
with MMA layouts, we use a map to help cluster all the DotOperand,
Slice, and MMA layout instances into the same group for further updating
in bulk
- When there are multiple DotOps in a module with the same MMA(literally
equivalent), it is possible to get the wrong groups
- an ID field is used to help to identify the MMA from different DotOps,
thus getting all the MMA, DotOperand, and Slice layout instances in the
right groups
2023-01-14 11:54:19 +08:00
Philippe Tillet
259f4c5f7d [OPTIMIZER] Added new optimization passes (#1055)
This PR adds a couple of optimization passes that should substantially
improve the performance of Triton on fused attention kernels:
- DecomposeConversionsPass: This decomposes some instructions of the
form `convert_layout` into
- ReorderInstructions: this reorders instructions in a way that is more
amenable to good code generation from `ptxas`.
2023-01-13 13:15:53 -08:00
Daniil Fukalov
ce23494cbf [Test] Enable triton-translate build.
And use it in tests instead of python module triton.tools.aot.
2023-01-13 20:36:16 +01:00
Philippe Tillet
589a18959e [BACKEND] Make swizzled shared memory pointers compatible with non-blocked distributed layout (#1053)
Notes:
  * Cleaned up implementation
  * Added comments
  * Re-using code between ConvertDistributedToShared and ConvertInsertSliceAsyncOp
2023-01-13 09:14:23 -08:00
Daniil Fukalov
4d4a91ec5f [ROCM] Fix incorrect llvm.bitcast creation.
Before the patch tt.load and tt.store generated incorrect bitcasts like
`%f = llvm.bitcast %i : i32 to f16` (source and destination bitcast'
types should have same bitwidth).
2023-01-12 14:28:17 +01:00
Keren Zhou
e638cb8060 [Backend] Use post-order traversal for liveness numbering (#1027)
Also add tests for `tt.trans`.
2023-01-04 15:13:09 +00:00
goostavz
50d4589e9c [Backend] Add value cache in emitting indices calculation and some refinement (#1018)
1, add explicit value cache in emitting indices calculation;
2, move the indices calculation emitting logics into
ConvertTritonGPUOpToLLVMPatternBase to avoid the redundant build cost by
templates. Refer to the discussion in this thread by @LyricZhao :
https://triton-lang.slack.com/archives/C042VBSQWNS/p1671336755922969
2023-01-04 15:11:53 +00:00
Yan Chunwei
176fa4b4d2 [OPTIMIZER] Update the versionMinor in MMA layout for volta (#1014)
Continue the work https://github.com/openai/triton/pull/990

# Background
The `versionMinor` in MmaEncodingAttr holds some states of DotOp's
operands in Volta, while such operands will be modified by some
patterns, making the states out-of-date.

This PR helps to correct the states.

# Implementation
It adds three new patterns:

1. `CollectMmaToUpdateForVolta` helps to collect and build a map holding
the MmaEncodingAttr instances with wrong states and create new correct
ones for them,
2. `UpdateMMAVersionMinorForVolta` helps to replace the Ops generating
the wrong MmaEncodingAttr instances with new correct ones, currently it
supports the following Ops
    a. `convert_layout[X -> mma]`
    b. `arith.constant SplatAttr : !tensor<mma>`
    c. `dot ... : !tensor<mma>`

# Limitation
This PR chooses the mapping way to bypass the IR walk complexity from
the circular dependency between dot_operand[parent] and mma.
We use the MmaEncodingAttr instance as the mapping key, but there might
be multiple DotOp holding different DotOprand(IsMMAv1Row) that have the
same wrong MmaEncodingAttr instance.
To make each DotOp's (wrong) MmaEncodingAttr unique, we might need an ID
field to MmaEncodingAttr.
2023-01-04 15:11:20 +00:00
Rohit Santhanam
7b7ddb7a59 Merge branch 'triton-mlir-IFU' into merge_IFU_to_triton_mlir 2023-01-03 23:37:11 +00:00
Keren Zhou
678b9f53a2 [Backend] Use post-order traversal for liveness numbering (#1027)
Also add tests for `tt.trans`.
2023-01-03 15:11:54 -08:00
goostavz
1d3029faf8 [Backend] Add value cache in emitting indices calculation and some refinement (#1018)
1, add explicit value cache in emitting indices calculation;
2, move the indices calculation emitting logics into
ConvertTritonGPUOpToLLVMPatternBase to avoid the redundant build cost by
templates. Refer to the discussion in this thread by @LyricZhao :
https://triton-lang.slack.com/archives/C042VBSQWNS/p1671336755922969
2022-12-29 11:19:59 -08:00
Yan Chunwei
2ba74d2729 [OPTIMIZER] Update the versionMinor in MMA layout for volta (#1014)
Continue the work https://github.com/openai/triton/pull/990

# Background
The `versionMinor` in MmaEncodingAttr holds some states of DotOp's
operands in Volta, while such operands will be modified by some
patterns, making the states out-of-date.

This PR helps to correct the states.

# Implementation
It adds three new patterns:

1. `CollectMmaToUpdateForVolta` helps to collect and build a map holding
the MmaEncodingAttr instances with wrong states and create new correct
ones for them,
2. `UpdateMMAVersionMinorForVolta` helps to replace the Ops generating
the wrong MmaEncodingAttr instances with new correct ones, currently it
supports the following Ops
    a. `convert_layout[X -> mma]`
    b. `arith.constant SplatAttr : !tensor<mma>`
    c. `dot ... : !tensor<mma>`

# Limitation
This PR chooses the mapping way to bypass the IR walk complexity from
the circular dependency between dot_operand[parent] and mma.
We use the MmaEncodingAttr instance as the mapping key, but there might
be multiple DotOp holding different DotOprand(IsMMAv1Row) that have the
same wrong MmaEncodingAttr instance.
To make each DotOp's (wrong) MmaEncodingAttr unique, we might need an ID
field to MmaEncodingAttr.
2022-12-28 12:24:01 +08:00
Michael Melesse
41578a63d2 Merge remote-tracking branch 'upstream/triton-mlir' into triton-mlir-IFU 2022-12-21 12:53:03 -06:00
Philippe Tillet
20100a7254 Merge triton-mlir branch - Complete rewrite of the backend from scratch (#1004)
This PR merges the `triton-mlir` branch, in which we have been quietly
rewriting the Triton backend from scratch to increase maintainability,
stability and ultimately performance. Changes to the runtime are
minimal, and this new version aims to remain backward-compatible with
the previous commit. The legacy backend is now officially deprecated,
but can still be accessed via the `legacy-backend` tag.

Co-authored-by: Keren Zhou <kerenzhou@openai.com>
Co-authored-by: Yan Chunwei <yanchunwei@outlook.com>
Co-authored-by: goostavz <109190422+goostavz@users.noreply.github.com>
Co-authored-by: Shintaro Iwasaki <siwasaki@fb.com>
Co-authored-by: Yan Da <dyanab@connect.ust.hk>
Co-authored-by: Jun Yang <yangjunpro@gmail.com>
Co-authored-by: Ian Bearman <ianb@microsoft.com>
Co-authored-by: Jason Ansel <jansel@jansel.net>
Co-authored-by: Qingyi Liu <qingyil@nvidia.com>
Co-authored-by: ben-zhang-609 <110140741+ben-zhang-609@users.noreply.github.com>
Co-authored-by: Chenggang Zhao <lyricz@yeah.net>
Co-authored-by: ben-zhang-609 <benzh609@gmail.com>
Co-authored-by: dongdongl <dongdongl@nvidia.com>
2022-12-21 01:30:50 -08:00