Commit Graph

273 Commits

Author SHA1 Message Date
Philippe Tillet
3cfa474d97 [OPTIMIZER] Remove dead code (#1160) 2023-02-06 23:02:58 -08:00
Philippe Tillet
1b31ad997f [OPTIMIZER] Improved memory coalescing heuristics (#1159) 2023-02-06 22:32:28 -08:00
Keren Zhou
546f2377ae [BACKEND] Get the right operand and result types in forward rematerialization passes (#1152) 2023-02-04 16:34:35 -08:00
Mehdi Amini
ce6d74e0b6 [BACKEND] Fix crash in test/TritonGPU/coalesce.mlir (#1148)
The call to `coalesceOp` is deleting the op it is processing and
replacing it with a new one. We can't `dyn_cast` the `curr` pointer
because it is dangling at this point.

Co-authored-by: Philippe Tillet <phil@openai.com>
2023-02-04 02:40:27 +00:00
Keren Zhou
bde52f9db2 [BACKEND] Fix alignment calculation (#1149)
`getDivisibility` represents if the address in bytes is divisible by a
certain number, so we should convert `#aligned bytes` to `#aligned
elements`.
2023-02-03 17:20:23 -08:00
Keren Zhou
82befe32ad [BACKEND] Improve torch inductor performance (#1108)
- Rewrite the AxisInfo analysis to handle each op case by case.
- Add bit shift, min max, div/rem, and select ops to AxisInfo.
- Rematerialize across load/store ops in the following two cases:
- A size 1 tensor is considered not expensive since all threads will
load the same
- the targeEncoding may expose more vectorization opportunities (more
elements per thread on the first dim)

**_res2next_** benchmark GPU Kernel time comparison on A100.
- Average kernel sum. Triton 16838630ns vs Triton-MLIR 17105166ns.
**1.016x slowdown**.
- Total kernel sum. Triton 6511735460ns vs Triton-MLIR 6512370620ns.
2023-02-01 18:21:15 -08:00
Keren Zhou
1ec39fdf99 [BACKEND] Refactored the MoveConvertOutOfIf conversion to handle scf.if correctly (#1114)
Also removed duplicate code for `simulateBackwardRematerialization`.
2023-02-01 08:49:19 -08:00
Keren Zhou
5dd8ce3745 [BACKEND] Fix topological sort and add new test cases (#1132)
Previous https://github.com/openai/triton/pull/1113 forgot to consider
that a node may have multiple parents, visiting the instruction before
any parent violates the semantic of topological sort.

The fixed implementation exhaustively add all operations into a
candidate subgraph and move an operation to the "ready" queue once all
of its operands have been visited.
2023-01-31 23:41:20 -08:00
Philippe Tillet
8fea1fb478 [FRONTEND] Adding static range (#1130)
Included: Revert "[BACKEND] Replace `mlir::topologicalSort` with a
custom implementation (#1113)"
2023-01-31 18:04:19 -08:00
Philippe Tillet
c4b9d699d2 [FRONTEND][BACKEND] Fixed many bugs (#1122)
- **temporarily commenting assertion in `MemBar.cpp`. We need to fix
this! but for now the following patches will unblock a number of
users.**
- Fixed frontend codegen issue for If / For / While. Emit an error when
replaced values' type mismatch.
- Added "top level" codepath for if statements, which allows users to
write patterns to exit early from kernels (e.g., `if cond1: if cond2:
return else: ...`). Added associated codegen in TritonToTritonGPUPass
- Added basic control flow tests
- Pipeline pass is no longer activated when memory accesses can't be
vectorized
- Added missing magic methods to `constexpr`
- Fixed issue in random.py: bitcast some values to uint when they need
to be.
- Added support for `Not`
- Fixed nondeterministic compilation issue
2023-01-30 23:22:36 -08:00
goostavz
3e8d83b7cc Minor fix to support sm_90 (#1125)
This fix enables the support on sm_90 (otherwise it will crash).

Logs like 
> 'sm_90' is not a recognized processor for this target (ignoring
processor)

could be ignored and should be eliminated with the update of llvm nvptx
backend.
2023-01-31 14:08:02 +08:00
Keren Zhou
bc8a26d56f [BACKEND] Replace mlir::topologicalSort with a custom implementation (#1113)
`multiRootTopologicalSort` is faster than `mlir::topologicalSort`
because it prunes nodes that have been visited before.
2023-01-29 18:57:21 -08:00
Da Yan
394f2e6991 [OPTIMIZER] improved prefetch width (#1106)
Before this PR: 16
After this PR: 16 (fp16/bf16), 32(int8/fp8), 8 (tf32)
The new prefetch width works better with i8/f8/tf32 tensor cores.
2023-01-27 17:41:49 +00:00
Keren Zhou
b5d32896b1 [BACKEND] Verify the same operand and result element type for convert_layout (#1081)
And a hotfix for incorrect convert_layout construction in the GPU
combine pass.
2023-01-22 16:59:24 +00:00
Yan Chunwei
88498d104a [BACKEND] DotOp enable ld.v4 in MMAv1 (#1020)
The existing convert distributed to distributed layouts logic is based
on processing each MMA-block, this requires each MMA-block to share
exactly the same fixed pattern(such as the one described in the [NV PTX
doc](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-fragment-mma-16816-float)).

While for MMAv1, things are different, the MMA-block has variant
patterns for different shapes and data layouts as below

<img width="200" alt="image"
src="https://user-images.githubusercontent.com/328693/213354941-731d7856-ad24-4f48-be0e-3cf41532cfa4.png">

This requires all the cell coordinates in DotOp output to be computed.
2023-01-19 09:42:33 -08:00
Philippe Tillet
408d1d7e87 [OPTIMIZER] Improved flash attention forward pass performance (#1075)
- Fixed typo in instruction reordering pass
- Minor additional optimizations for shared memory allocator
- Optimized flash attention tutorial forward pass kernel
2023-01-19 06:46:01 +00:00
Philippe Tillet
660f2e8cce [OPTIMIZER] pipeline and prefetch pass now use a more ptxas-friendly schedule (#1065) 2023-01-17 15:21:19 -08:00
Keren Zhou
3f47e9aa0e [BACKEND] Fix unrealized conversion for fp32 dot (#1051) 2023-01-17 21:55:44 +00:00
Yan Chunwei
86003c83dd [Optimizer] Add UpdateMmaForVolta Pass (#1048)
This PR adds UpdateMmaForVolta pass to help update the MMA encoding for
Volta.
Some context is told in https://github.com/openai/triton/pull/1014

# Changes

1. Moving the related MMAv1 patterns from GPUCombine pass to
UpdateMmaForVolta pass,
2. Updating both the versionMinor and warpsPerCTA fields for Volta MMA
encodings since they could only be determined after the GPUCombine Pass,
3. Moving the FixupLoop pattern from the Combine.cpp to new
Utility.h/.cpp files
4. Adding an ID field(takes 5 bits to store an integer) to versionMinor
to help assigning a unique ID(on Volta) for each MMA encodings, the
reason is as below
- Currently, there is a cyclic dependency between {DotOperand, Slice}
with MMA layouts, we use a map to help cluster all the DotOperand,
Slice, and MMA layout instances into the same group for further updating
in bulk
- When there are multiple DotOps in a module with the same MMA(literally
equivalent), it is possible to get the wrong groups
- an ID field is used to help to identify the MMA from different DotOps,
thus getting all the MMA, DotOperand, and Slice layout instances in the
right groups
2023-01-14 11:54:19 +08:00
Philippe Tillet
259f4c5f7d [OPTIMIZER] Added new optimization passes (#1055)
This PR adds a couple of optimization passes that should substantially
improve the performance of Triton on fused attention kernels:
- DecomposeConversionsPass: This decomposes some instructions of the
form `convert_layout` into
- ReorderInstructions: this reorders instructions in a way that is more
amenable to good code generation from `ptxas`.
2023-01-13 13:15:53 -08:00
goostavz
0e8590f1c9 [BACKEND] Add generic support of convert_layout from distributed to shared (#1025) 2022-12-30 11:29:58 -08:00
Yan Chunwei
2ba74d2729 [OPTIMIZER] Update the versionMinor in MMA layout for volta (#1014)
Continue the work https://github.com/openai/triton/pull/990

# Background
The `versionMinor` in MmaEncodingAttr holds some states of DotOp's
operands in Volta, while such operands will be modified by some
patterns, making the states out-of-date.

This PR helps to correct the states.

# Implementation
It adds three new patterns:

1. `CollectMmaToUpdateForVolta` helps to collect and build a map holding
the MmaEncodingAttr instances with wrong states and create new correct
ones for them,
2. `UpdateMMAVersionMinorForVolta` helps to replace the Ops generating
the wrong MmaEncodingAttr instances with new correct ones, currently it
supports the following Ops
    a. `convert_layout[X -> mma]`
    b. `arith.constant SplatAttr : !tensor<mma>`
    c. `dot ... : !tensor<mma>`

# Limitation
This PR chooses the mapping way to bypass the IR walk complexity from
the circular dependency between dot_operand[parent] and mma.
We use the MmaEncodingAttr instance as the mapping key, but there might
be multiple DotOp holding different DotOprand(IsMMAv1Row) that have the
same wrong MmaEncodingAttr instance.
To make each DotOp's (wrong) MmaEncodingAttr unique, we might need an ID
field to MmaEncodingAttr.
2022-12-28 12:24:01 +08:00
Philippe Tillet
20100a7254 Merge triton-mlir branch - Complete rewrite of the backend from scratch (#1004)
This PR merges the `triton-mlir` branch, in which we have been quietly
rewriting the Triton backend from scratch to increase maintainability,
stability and ultimately performance. Changes to the runtime are
minimal, and this new version aims to remain backward-compatible with
the previous commit. The legacy backend is now officially deprecated,
but can still be accessed via the `legacy-backend` tag.

Co-authored-by: Keren Zhou <kerenzhou@openai.com>
Co-authored-by: Yan Chunwei <yanchunwei@outlook.com>
Co-authored-by: goostavz <109190422+goostavz@users.noreply.github.com>
Co-authored-by: Shintaro Iwasaki <siwasaki@fb.com>
Co-authored-by: Yan Da <dyanab@connect.ust.hk>
Co-authored-by: Jun Yang <yangjunpro@gmail.com>
Co-authored-by: Ian Bearman <ianb@microsoft.com>
Co-authored-by: Jason Ansel <jansel@jansel.net>
Co-authored-by: Qingyi Liu <qingyil@nvidia.com>
Co-authored-by: ben-zhang-609 <110140741+ben-zhang-609@users.noreply.github.com>
Co-authored-by: Chenggang Zhao <lyricz@yeah.net>
Co-authored-by: ben-zhang-609 <benzh609@gmail.com>
Co-authored-by: dongdongl <dongdongl@nvidia.com>
2022-12-21 01:30:50 -08:00