Commit Graph

25 Commits

Author SHA1 Message Date
Rohit Santhanam
cd9ae1cd36 Merge remote-tracking branch 'upstream/main' into triton-mlir-IFU-02232023 2023-02-23 21:41:54 +00:00
Keren Zhou
123c687ed9 [BACKEND] Rewrite Membar to fit the CF dialect (#1213) 2023-02-19 14:54:33 -08:00
Rohit Santhanam
841784d1e3 Merge remote-tracking branch 'upstream/main' into upgrade_triton_mlir_rocm_to_llvm_head 2023-02-18 09:25:20 +00:00
Christian Sigg
9ef4b5d773 Rebase to LLVM-head. (#1200)
Rebase to
37b7a60cd7
2023-02-17 13:16:11 -08:00
Christian Sigg
fc7a8e3581 Rebase Triton to LLVM-15. (#1070)
This PR rebases Triton from LLVM-14 to LLVM-15. Most changes are
mechanical, except for the analysis framework changes.
2023-02-16 06:40:53 -08:00
rsanthanam-amd
44f69bea81 Merge pull request #113 from ROCmSoftwarePlatform/triton-mlir-IFU-02112023
Triton mlir ifu 02112023
2023-02-13 09:26:10 -06:00
rsanthanam-amd
ec387d5bf4 Merge pull request #109 from dfukalov/dfukalov/work-3
[ROCM] Enable part of tl.dot operations.
2023-02-12 13:50:20 -06:00
Daniil Fukalov
a6596fc634 [ROCM] Enable part of tl.dot operations.
The change enables fall-through FMA path for the ROCM. It works for
the float32 type and not all the tensors sizes. The change switches
off reporting MMA and async ops support to avoid NV asm inline
generation.
2023-02-12 17:25:48 +01:00
Philippe Tillet
0cbe368fe5 [OPTIMIZER] Using new multiRootGetSlice utility in memory coalescing pass (#1169) 2023-02-09 18:43:33 +00:00
Keren Zhou
82befe32ad [BACKEND] Improve torch inductor performance (#1108)
- Rewrite the AxisInfo analysis to handle each op case by case.
- Add bit shift, min max, div/rem, and select ops to AxisInfo.
- Rematerialize across load/store ops in the following two cases:
- A size 1 tensor is considered not expensive since all threads will
load the same
- the targeEncoding may expose more vectorization opportunities (more
elements per thread on the first dim)

**_res2next_** benchmark GPU Kernel time comparison on A100.
- Average kernel sum. Triton 16838630ns vs Triton-MLIR 17105166ns.
**1.016x slowdown**.
- Total kernel sum. Triton 6511735460ns vs Triton-MLIR 6512370620ns.
2023-02-01 18:21:15 -08:00
Keren Zhou
5dd8ce3745 [BACKEND] Fix topological sort and add new test cases (#1132)
Previous https://github.com/openai/triton/pull/1113 forgot to consider
that a node may have multiple parents, visiting the instruction before
any parent violates the semantic of topological sort.

The fixed implementation exhaustively add all operations into a
candidate subgraph and move an operation to the "ready" queue once all
of its operands have been visited.
2023-01-31 23:41:20 -08:00
Philippe Tillet
8fea1fb478 [FRONTEND] Adding static range (#1130)
Included: Revert "[BACKEND] Replace `mlir::topologicalSort` with a
custom implementation (#1113)"
2023-01-31 18:04:19 -08:00
Keren Zhou
bc8a26d56f [BACKEND] Replace mlir::topologicalSort with a custom implementation (#1113)
`multiRootTopologicalSort` is faster than `mlir::topologicalSort`
because it prunes nodes that have been visited before.
2023-01-29 18:57:21 -08:00
Philippe Tillet
408d1d7e87 [OPTIMIZER] Improved flash attention forward pass performance (#1075)
- Fixed typo in instruction reordering pass
- Minor additional optimizations for shared memory allocator
- Optimized flash attention tutorial forward pass kernel
2023-01-19 06:46:01 +00:00
Philippe Tillet
20100a7254 Merge triton-mlir branch - Complete rewrite of the backend from scratch (#1004)
This PR merges the `triton-mlir` branch, in which we have been quietly
rewriting the Triton backend from scratch to increase maintainability,
stability and ultimately performance. Changes to the runtime are
minimal, and this new version aims to remain backward-compatible with
the previous commit. The legacy backend is now officially deprecated,
but can still be accessed via the `legacy-backend` tag.

Co-authored-by: Keren Zhou <kerenzhou@openai.com>
Co-authored-by: Yan Chunwei <yanchunwei@outlook.com>
Co-authored-by: goostavz <109190422+goostavz@users.noreply.github.com>
Co-authored-by: Shintaro Iwasaki <siwasaki@fb.com>
Co-authored-by: Yan Da <dyanab@connect.ust.hk>
Co-authored-by: Jun Yang <yangjunpro@gmail.com>
Co-authored-by: Ian Bearman <ianb@microsoft.com>
Co-authored-by: Jason Ansel <jansel@jansel.net>
Co-authored-by: Qingyi Liu <qingyil@nvidia.com>
Co-authored-by: ben-zhang-609 <110140741+ben-zhang-609@users.noreply.github.com>
Co-authored-by: Chenggang Zhao <lyricz@yeah.net>
Co-authored-by: ben-zhang-609 <benzh609@gmail.com>
Co-authored-by: dongdongl <dongdongl@nvidia.com>
2022-12-21 01:30:50 -08:00
Keren Zhou
50a5128448 [Triton-MLIR][BACKEND] Support bfloat16 and clean up some test code (#998) 2022-12-20 22:26:51 -08:00
Philippe Tillet
9f27468377 [TESTS][FRONTEND][BACKEND] Merge master and triton-mlir tests (#979)
Also fix a bunch of bugs in float32 / tf32

Co-authored-by: Jokeren <kerenzhou@openai.com>
2022-12-15 19:28:50 -08:00
Philippe Tillet
8edfe813a5 [FRONTEND][BACKEND] Added trans instruction; made flash attention bwd pass work (#943) 2022-12-03 09:58:24 -08:00
Qingyi Liu
661be523c0 [Triton-MLIR][BACKEND] Minor fixes of shared memory in ReduceOpConversion (#924) 2022-11-29 11:50:31 +08:00
Qingyi Liu
9d31998a9d [Triton-MLIR][BACKEND] Add argmin / argmax implementation for ReduceOp (#918) 2022-11-27 22:59:27 -08:00
Keren Zhou
35c9ec1103 [Triton-MLIR][Backend] Fix number of warps and threads per warp when matrices are small (#917) 2022-11-26 12:30:38 -08:00
Keren Zhou
153aecb339 [Triton-MLIR][BACKEND] insert_slice_async on GPUs < sm80 (#908)
`insert_slice_async` is decomposed into `load + insert_slice` in the
backend.

Not sure if V100 perf can match the master branch though in this way.
Maybe the performance can be improved if instructions are arranged in
the following form:

```
%0 = load
%1 = load 
%2 = load 
...
insert_slice %0
insert_slice %1
insert_slice %2
```

Tested on A100 when manually enabling this decomposition.
Tests on V100 haven't been integrated yet, we can divide the tests into
two phases:
1. Test only load, insert_slice, and insert_slice_async, given TritonGPU
IRs in `test_backend.py`.
2. End to end gemm tests on V100.
2022-11-24 14:05:54 -08:00
Keren Zhou
fdd59900f7 [Triton-MLIR] Replace triton.extract_slice with tensor.extract_slice and support more general tensor slicing (#837)
## Features

- Allow taking a block of tensor slice, as long as each dimension is
contiguous (unit stride).
- Fix some problems in `insert_slice_async`'s semantic.
- More general verification for ops that return shared layout encoding.

## Known Limitations

- `insert_slice_async` still uses the old semantic. May submit another
PR later to support similar semantic like `tensor.extract_slice`.
- No encoding verification for `tensor.extract_slice`.
- 3d tensor ops are broken.
- Strided accesses are not allowed.
- May cause a little performance slowdown since we are passing strides
as values but not constants (e.g., int).
It would be difficult to pass strides as attributes when we have control
flows. A block argument is possible to accept tensors with different
strides.
2022-11-06 22:59:03 -08:00
Ian Bearman
f2106d0aa2 [BUILD] Fix Warnings and Enable Warnings as Errors (#794) 2022-10-28 12:36:09 -07:00
Keren Zhou
02ebf24d35 Analyze shared memory alias (#81)
The purpose of this PR is analyzing shared memory aliases so that we can
fix memory allocation bugs and save memory allocations in triton code
involving complex control flows.

Changes to memory bar and allocation are on the way.

Co-authored-by: Philippe Tillet <phil@openai.com>
2022-08-29 10:43:20 -07:00