Commit Graph

487 Commits

Author SHA1 Message Date
Rohit Santhanam
ce8adb92bd Merge remote-tracking branch 'upstream/master' into triton-mlir-IFU-01142023 2023-01-14 19:19:58 +00:00
Yan Chunwei
86003c83dd [Optimizer] Add UpdateMmaForVolta Pass (#1048)
This PR adds UpdateMmaForVolta pass to help update the MMA encoding for
Volta.
Some context is told in https://github.com/openai/triton/pull/1014

# Changes

1. Moving the related MMAv1 patterns from GPUCombine pass to
UpdateMmaForVolta pass,
2. Updating both the versionMinor and warpsPerCTA fields for Volta MMA
encodings since they could only be determined after the GPUCombine Pass,
3. Moving the FixupLoop pattern from the Combine.cpp to new
Utility.h/.cpp files
4. Adding an ID field(takes 5 bits to store an integer) to versionMinor
to help assigning a unique ID(on Volta) for each MMA encodings, the
reason is as below
- Currently, there is a cyclic dependency between {DotOperand, Slice}
with MMA layouts, we use a map to help cluster all the DotOperand,
Slice, and MMA layout instances into the same group for further updating
in bulk
- When there are multiple DotOps in a module with the same MMA(literally
equivalent), it is possible to get the wrong groups
- an ID field is used to help to identify the MMA from different DotOps,
thus getting all the MMA, DotOperand, and Slice layout instances in the
right groups
2023-01-14 11:54:19 +08:00
Philippe Tillet
259f4c5f7d [OPTIMIZER] Added new optimization passes (#1055)
This PR adds a couple of optimization passes that should substantially
improve the performance of Triton on fused attention kernels:
- DecomposeConversionsPass: This decomposes some instructions of the
form `convert_layout` into
- ReorderInstructions: this reorders instructions in a way that is more
amenable to good code generation from `ptxas`.
2023-01-13 13:15:53 -08:00
Keren Zhou
733301ff31 [Backend] Rewrite code for linking external library to expose more inlining opportunities (#1037)
- Also make it cleaner. 
- And mark out the code needs to be fixed in `semantic.py`.
2023-01-08 13:44:29 -08:00
Keren Zhou
e638cb8060 [Backend] Use post-order traversal for liveness numbering (#1027)
Also add tests for `tt.trans`.
2023-01-04 15:13:09 +00:00
goostavz
396e08f4de [BACKEND] Add generic support of convert_layout from distributed to shared (#1025) 2023-01-04 15:12:52 +00:00
Rohit Santhanam
7b7ddb7a59 Merge branch 'triton-mlir-IFU' into merge_IFU_to_triton_mlir 2023-01-03 23:37:11 +00:00
Keren Zhou
678b9f53a2 [Backend] Use post-order traversal for liveness numbering (#1027)
Also add tests for `tt.trans`.
2023-01-03 15:11:54 -08:00
goostavz
0e8590f1c9 [BACKEND] Add generic support of convert_layout from distributed to shared (#1025) 2022-12-30 11:29:58 -08:00
Michael Melesse
edd0df94dc compiles 2022-12-21 13:48:56 -06:00
Michael Melesse
41578a63d2 Merge remote-tracking branch 'upstream/triton-mlir' into triton-mlir-IFU 2022-12-21 12:53:03 -06:00
Philippe Tillet
20100a7254 Merge triton-mlir branch - Complete rewrite of the backend from scratch (#1004)
This PR merges the `triton-mlir` branch, in which we have been quietly
rewriting the Triton backend from scratch to increase maintainability,
stability and ultimately performance. Changes to the runtime are
minimal, and this new version aims to remain backward-compatible with
the previous commit. The legacy backend is now officially deprecated,
but can still be accessed via the `legacy-backend` tag.

Co-authored-by: Keren Zhou <kerenzhou@openai.com>
Co-authored-by: Yan Chunwei <yanchunwei@outlook.com>
Co-authored-by: goostavz <109190422+goostavz@users.noreply.github.com>
Co-authored-by: Shintaro Iwasaki <siwasaki@fb.com>
Co-authored-by: Yan Da <dyanab@connect.ust.hk>
Co-authored-by: Jun Yang <yangjunpro@gmail.com>
Co-authored-by: Ian Bearman <ianb@microsoft.com>
Co-authored-by: Jason Ansel <jansel@jansel.net>
Co-authored-by: Qingyi Liu <qingyil@nvidia.com>
Co-authored-by: ben-zhang-609 <110140741+ben-zhang-609@users.noreply.github.com>
Co-authored-by: Chenggang Zhao <lyricz@yeah.net>
Co-authored-by: ben-zhang-609 <benzh609@gmail.com>
Co-authored-by: dongdongl <dongdongl@nvidia.com>
2022-12-21 01:30:50 -08:00
Keren Zhou
50a5128448 [Triton-MLIR][BACKEND] Support bfloat16 and clean up some test code (#998) 2022-12-20 22:26:51 -08:00
Philippe Tillet
899bb0a0e7 [FORMAT] Run clang-format, autopep8 and isort (#1000) 2022-12-20 17:47:34 -08:00
Philippe Tillet
d438be01bd [TRITON-MLIR][BACKEND] New optimization patterns to speed-up layer norm (#991) 2022-12-18 23:50:20 -08:00
Chenggang Zhao
4e95e939a6 [Triton-MLIR][BACKEND] Refactor TritonGPUToLLVM into several files (#988)
Refactor the backend into multiple smaller files.
2022-12-18 14:54:38 +08:00
Philippe Tillet
9f27468377 [TESTS][FRONTEND][BACKEND] Merge master and triton-mlir tests (#979)
Also fix a bunch of bugs in float32 / tf32

Co-authored-by: Jokeren <kerenzhou@openai.com>
2022-12-15 19:28:50 -08:00
Yan Chunwei
42b5234e27 [Triton-MLIR][BACKEND] Decompose Mma version to versionMajor and versionMinor (#985) 2022-12-15 17:14:07 +08:00
Philippe Tillet
52accd4c2b [BACKEND] Add isRow attribute for DotOp tensors whose parent is mmav1 (#970)
Co-authored-by: Yan Chunwei <yanchunwei@outlook.com>
2022-12-11 19:01:57 -08:00
Daniil Fukalov
e68f17bd1f [Triton-MLIR][NFC] Fix line endings. 2022-12-08 22:45:59 +01:00
Keren Zhou
18e683d9bb [Triton-MLIR][BACKEND] Pass compute capability from the frontend and code cleanup (#961) 2022-12-07 15:03:46 -08:00
Philippe Tillet
b2b793dfb5 [FRONTEND][BACKEND] Fixes for cat / reshape / addptr (#959)
Most notably, this PR:
- changes the traits (and assembly format) of addptr so it can handle offsets that have arbitrary integer width.
- adds support for `cat`
2022-12-06 23:29:50 -08:00
Philippe Tillet
532e10cf87 [FRONTEND][BACKEND] Clean-up transpositions (#953) 2022-12-06 09:32:13 -08:00
Yan Chunwei
e419781978 [Triton-MLIR][BACKEND] Make mmav1 works on basic cases (#944)
TODO:

- Add more cases
- Currently, we just set vec to 4 to make the basic cases pass

Issue:

- the vec in shared layout is different compared to master branch
- when vec=1, it encounters CUDA misalignment error, it doesn't work in
master branch as well
- when setting vec to the value identical to master branch, the MMA
works
2022-12-06 10:57:08 +08:00
goostavz
e057c65cf0 [BACKEND] Porting the legacy heuristic rule in assigning shared layout for A/B of MMAv1 (#948) 2022-12-05 11:30:23 -08:00
Philippe Tillet
8edfe813a5 [FRONTEND][BACKEND] Added trans instruction; made flash attention bwd pass work (#943) 2022-12-03 09:58:24 -08:00
goostavz
4d64589b22 [Triton-MLIR][Backend] Fix the definition of MmaEncodingAttr v1, and the output sequence of DotConversion in MMAv1 (#941) 2022-12-03 21:12:48 +08:00
Yang Hau
8650b4d1cb [DRIVER] Fix typos (#939) 2022-12-02 11:13:46 -08:00
Keren Zhou
c280ebda1b [Triton-MLIR][BACKEND] Fix the membar pass to add missing barriers caused by scf.for (#933)
1. Add missing barriers and revert the previous temporary solution
2. Extract the `run` method from membar analysis because the membar
analysis should have two phases, including construction, which doesn't
modify any IR, and modification, which adds barrier IRs. Hope this could
make the use of membar clear.
2022-12-01 11:54:18 -08:00
Keren Zhou
7d90a07d0b [Triton-MLIR][BACKEND] Refactor decompose insert_slice_async (#929)
1. Improve pipline's comment
2. Decompose insert_slice_async when load vector size is not supported
3. Add a test that could fail our gemm code

Copy my comments here:

There's a knob that may cause performance regression when decomposition
has been performed. We should remove this knob once we have thorough
analysis on async wait. Currently, we decompose `insert_slice_async`
into `load` and `insert_slice` without knowing which `async_wait` is
responsible for the `insert_slice_async`. To guarantee correctness, we
blindly set the `async_wait` to wait for all async ops if any `insert_slice_async` has been decomposed.

There are two options to improve this:
1. We can perform a dataflow analysis to find the `async_wait` that is
responsible for the `insert_slice_async` in the backend.
4. We can modify the pipeline to perform the decomposition before the
`async_wait` is inserted. However, it is also risky because we don't
know the correct vectorized shape yet in the pipeline pass. Making the
pipeline pass aware of the vectorization could introduce additional
dependencies on the AxisInfoAnalysis and the Coalesce analysis.
2022-11-30 10:07:34 -08:00
goostavz
4e6a8209ed [Triton-MLIR] Two fixes on allocation and backend related with MMA v1 (#930) 2022-11-30 09:27:26 +00:00
dfukalov
88c178aec5 [Triton-MLIR][HSACO] Addressed comments.
Moved new functionality from HSACO into AMDGCN target, removed
duplicating HSACO target.
2022-11-30 00:01:51 +01:00
Rohit Santhanam
7ef7beb096 Implement triton::LoadOp and triton::StoreOp using LLVM load and store
ops.

This avoids using GCNBuilder which has the following issues:

1. Performance problems with emitting too many barriers
2. Impossibility of supporting int8 and uint8 because the minimum AMDGPU
register width is 16 bit.

More unit tests were included in test_core_amd.py.
2022-11-29 21:30:20 +00:00
rsanthanam-amd
31c512ee31 Merge pull request #33 from ROCmSoftwarePlatform/triton-mlir-divop
add inital div op impl
2022-11-29 14:38:36 -06:00
dfukalov
834c46ab44 [Triton-MLIR][HSACO] Added HSACO generation.
The change is based on Rohit' commit 9bf807f310

Used `AMDGCN_HSACO_DUMP_PATH=<dumpfile>` notion to save the copy of temp
generated HSACO file so test is able to llvm-readobj that dump and check
its content.
2022-11-28 23:34:56 +01:00
Rohit Santhanam
5ab51f5551 Duplicate the LLVM objects for AMDGCN assembly generation and AMDGCN
object file creation to avoid adding multiple passes to the same
TargetMachine.
2022-11-28 20:03:34 +00:00
Michael Melesse
723f4e9f7f add divop 2022-11-28 13:20:52 -06:00
Qingyi Liu
9d31998a9d [Triton-MLIR][BACKEND] Add argmin / argmax implementation for ReduceOp (#918) 2022-11-27 22:59:27 -08:00
Keren Zhou
35c9ec1103 [Triton-MLIR][Backend] Fix number of warps and threads per warp when matrices are small (#917) 2022-11-26 12:30:38 -08:00
donproc
f63be0e9b5 [TRITON-MLIR][BACKEND]support atomic_cas (#914)
1. support atomics-cas
2. add xchg support in atomic_rmw

Co-authored-by: dongdongl <dongdongl@nvidia.com>
2022-11-25 12:02:08 +08:00
Keren Zhou
153aecb339 [Triton-MLIR][BACKEND] insert_slice_async on GPUs < sm80 (#908)
`insert_slice_async` is decomposed into `load + insert_slice` in the
backend.

Not sure if V100 perf can match the master branch though in this way.
Maybe the performance can be improved if instructions are arranged in
the following form:

```
%0 = load
%1 = load 
%2 = load 
...
insert_slice %0
insert_slice %1
insert_slice %2
```

Tested on A100 when manually enabling this decomposition.
Tests on V100 haven't been integrated yet, we can divide the tests into
two phases:
1. Test only load, insert_slice, and insert_slice_async, given TritonGPU
IRs in `test_backend.py`.
2. End to end gemm tests on V100.
2022-11-24 14:05:54 -08:00
B1tway
288c5070db Add GCNBuilder 2022-11-24 12:25:33 +00:00
Keren Zhou
2e33352419 [Triton-MLIR] Fix side effects (#906)
Try to add proper side effects for triton operations. 

The CSE pass could fail, hang, or output incorrect IRs for unknown
reasons, if side effects are not defined properly.

For instance, suppose we have two shared memory tensors:

```
%a = triton_gpu.alloc_tensor shape0, share_encoding0
%b = triton_gpu.alloc_tensor shape0, share_encoding0
```

The CSE pass will consider `%a` and `%b` are the same thing and
eliminate one of them, resulting in mysterious outcomes.
2022-11-22 23:29:18 -08:00
ben-zhang-609
07786dc932 [Triton-MLIR] Add compute capability (#902)
add compute capability from python frontend to backend.

Co-authored-by: Keren Zhou <kerenzhou@openai.com>
2022-11-22 11:08:23 -08:00
Rohit Santhanam
520d9e8835 Fixes to support 16-bit data types. 2022-11-22 01:17:31 +00:00
Rohit Santhanam
ffb895f4a2 Added more GCNBuilder commits plus some additional fixes.
With this commit, 108 subtests from test_bin_op in test_core.py are passing.

Here is what doesn't work from test_bin_op:

- int8 datatype
- int16 datatype
- uint8 datatype
- uint16 datatype
- float16 datatype
- bfloat16 datatype
- division (/) operator
2022-11-21 22:28:39 +00:00
Philippe Tillet
23f71daa27 [OPTIMIZER] Fixed up order of shared layouts (#881) 2022-11-21 06:25:02 +01:00
Jun Yang
8a5647782d [Triton-MLIR][Testing]Fix tests warning, with small code clean-up (#894)
1.Code clean-up to remove superfluous #includes.
2.Fix two python test warnings, in which one relates to ["#"
formats](https://jira.mongodb.org/browse/PYTHON-2343), the other relates
to regular expression string usage.
2022-11-19 14:33:59 +00:00
Rohit Santhanam
cac4ee78ef Add GCNBuilder. 2022-11-19 13:59:12 +00:00
Rohit Santhanam
b9e7634356 Merge commit 'e517b58d59ba96357d042d8fa5819a690d00d749' into IFU_upstream_commit_e517b58d59ba96357d042d8fa5819a690d00d749 2022-11-16 12:57:19 +00:00