Commit Graph

273 Commits

Author SHA1 Message Date
Philippe Tillet
640f3c3921 [OPTIMIZER] Tweaked layout removal conversion heuristics (#1501)
Loads are now consider cheap to rematerialize when there are more
threads than elements in the tensor
2023-04-10 15:19:08 -07:00
Philippe Tillet
adc760dac1 [OPTIMIZER] enable loop pipelining using pointer increments from vector look-up tables (#1490) 2023-04-10 08:59:42 -07:00
long.chen
f7ad8ae022 [Refine] remove const ref of mlir::Attribute (#1486)
https://mlir.llvm.org/docs/DefiningDialects/AttributesAndTypes/

https://github.com/isocpp/CppCoreGuidelines/blob/master/CppCoreGuidelines.md#f16-for-in-parameters-pass-cheaply-copied-types-by-value-and-others-by-reference-to-const
```
The C++ Attribute and Type classes in MLIR (like Ops, and many other things) are value-typed. 
This means that instances of Attribute or Type are passed around by-value, 
as opposed to by-pointer or by-reference. 
The Attribute and Type classes act as wrappers around internal storage objects that are uniqued within an instance of an MLIRContext.
```
2023-04-08 10:38:59 -07:00
Rahul Batra
a27b388df5 Merge remote-tracking branch 'upstream/main' into IFU_04-06-2023 2023-04-06 16:18:31 -05:00
Christian Sigg
01a93185a1 [BACKEND][OPTIMIZER] Switch from llvm::Optional to std::optional. (#1416) 2023-04-04 09:06:28 -07:00
Keren Zhou
28ea484dab [BACKEND] Clean up type inference functions (#1451)
And remove duplicate function definition.
2023-03-30 23:07:32 -07:00
Keren Zhou
ee593fca0b [BACKEND] Fix int8 dot (#1435) 2023-03-28 20:18:17 -07:00
Keren Zhou
3342cc1c0c [OPTIMIZER] Do not create yield if yieldValues is empty (#1437)
https://github.com/openai/triton/issues/1432
2023-03-28 19:33:52 -07:00
Chenggang Zhao
72b071253e [FRONTEND] Support block pointer semantics (#1392)
This PR introduces a new semantics: **block pointer**, which makes users
easier & faster to load a block from a parent tensor.

Below is a detailed API change by an example:
```
# Make a block pointer, which points to a block in the parent shape
# `base`: the parent tensor
# `shape`: the shape of the parent tensor
# `strides`: the strides of the parent tensor
# `offsets`: the offsets of the block in the parent tensor
# `order`: the order of the data arrangement in memory
# Below is an example loading a 2D column-major matrix 
block_ptr = tl.make_block_ptr(base=ptr, shape=(M, N), strides=(stride_m, stride_n), offsets=(0, 0), block_shape=(BLOCK_M, BLOCK_N), order=(1, 0))

# Advance the offsets; note that the striding information is already saved in `block_ptr`
# `base`: the block pointer to be advanced
# `offsets`: the offsets for each dimension
block_ptr = tl.advance(base=block_ptr, offsets=(BLOCK_M, -BLOCK_N))
block_ptr = tl.advance(base=block_ptr, offsets=(-BLOCK_M, BLOCK_N))

# Load from a block pointer, the output type is the dereferenced type of `block_ptr`, e.g. ptr<tensor<32x32xf32>> -> tensor<32x32xf32>
# `ptr`: the block pointer to be loaded
# `boundary_check`: a tuple of dimensions to check the boundary
# `padding`: padding strategy for elements out of bound
val = tl.load(ptr=block_ptr, boundary_check=(0, 1), padding="zero")

# Store by a block pointer, in which the pointer and the value tensor should have the same shape
# `ptr`: the block pointer to be stored
# `boundary_check`: a tuple of dimensions to check the boundary (no-write if out of bound)
tl.store(ptr=block_ptr, value=val, boundary_check=(0, 1))
```

---------

Co-authored-by: Philippe Tillet <phil@openai.com>
2023-03-27 16:46:49 -07:00
Xuehai Pan
5b36cb48ad [CI][TEST] update pre-commit hooks and use pre-commit for style tests in CI (#1409)
Ref issue:

- #1408

Changes:

- Add `.editorconfig`
- Add `pre-commit-hooks`:

    ```yaml
    - repo: https://github.com/pre-commit/pre-commit-hooks
      rev: v4.4.0
      hooks:
        - id: check-symlinks
        - id: destroyed-symlinks
        - id: trailing-whitespace
        - id: end-of-file-fixer
        - id: check-yaml
        - id: check-toml
        - id: check-ast
        - id: check-added-large-files
        - id: check-merge-conflict
        - id: check-executables-have-shebangs
        - id: check-shebang-scripts-are-executable
        - id: detect-private-key
        - id: debug-statements
    ```
- Add `flake8` to `pre-commit` config and add `.flake8` file
- Use `pre-commit` for style tests in CI
- Run `pre-commit` and fix existing violations:
    - fix trailing spaces
    - fix end-of-files
    - fix mod file mode with `chmod -x`
    - run `autopep8` on existing code
    - fix `flake8` violations
2023-03-25 14:52:16 -07:00
Keren Zhou
2ba77a9212 [OPTIMIZER] Fix a typo in SimplifyReduceCvt (#1385) 2023-03-21 22:45:58 -07:00
Philippe Tillet
c34ceca741 [BACKEND] Remove DotOpHelpers (i.e., decouple ConvertLayoutOpToLLVM and DotOpToLLVM) (#1383)
One long-standing issue in the backend has been the apparent complexity
of the tensor core codegen. This complexity mostly stems from the
existence of the DotOpHelpers` utilities, which have become over time a
catch-all for all things related to MmaEncoding and DotOperandEncoding.

The purpose of this PR is to decouple what should be decoupled, as a
first step towards cleaning our tensor core codegen. Other, more more
local PRs will follow.
2023-03-21 15:24:28 -07:00
Keren Zhou
e281bd9fe9 [OPTIMIZER] Ensure the conversion of blockArgument is placed at the beginning of the block (#1379)
Co-authored-by: Philippe Tillet <phil@openai.com>
2023-03-20 21:19:26 -04:00
Keren Zhou
23fc647a3e [OPTIMIZER] Fixe optimizer hanging caused by SimplifyReduceCvt (#1377)
https://github.com/openai/triton/issues/1328

Match the convert_layout operation in SimplifyReduceCvt
(convert_layout->reduce). This way we don't miss higher priority rewrite
patterns like RematerializeBackward and SimplifyConversion. We also need
to set SimplifyConversion's benefit = 4, RematerializeBackward's benefit
= 3, and RematerializeForward's benefit = 2.
2023-03-20 16:20:19 -07:00
Philippe Tillet
29d01ba5f3 [OPTIMIZER] We shouldn't try to rematerialize view/cat forward since output layout can't be deduced automatically (#1378) 2023-03-20 14:26:50 -07:00
Keren Zhou
78d5900467 [OPTIMIZER] Improve pipeline to handle general indirect access to matrices (#1291)
Differentiate between immediate and non-immediate block arguments. 
If we have a load that immediately depends on a block argument in the
current iteration, it is an immediate dependency. Otherwise, it is a
non-immediate dependency, which means the load depends on a block
argument in the previous iterations.

For example:
```
scf.for (%arg0, %arg1, %arg2) {
%0 = load %arg0  <--- immediate dep, this address is initialized at numStages-2
%1 = load %arg1
%2 = add %1, %arg2
%3 = load %2  <--- non-immediate dep, %arg1 must be an update-to-date value
}
```

The above code pattern is commonly seen in cases where we have indirect
memory accesses using a lookup table, such as PyTorch's `bsr_dense_bmm`.
This PR improves `bsr_dense_bmm` for about ~20% on the unit test cases.
2023-03-20 14:39:47 -04:00
Philippe Tillet
fe9dc4b58e [OPTIMIZER] Restored ViewOp/CatOp passthrough in simulateBackwardRematerialization (#1376) 2023-03-20 11:02:54 -07:00
Rohit Santhanam
a3f18dba5b Merge remote-tracking branch 'upstream/main' into triton-mlir-IFU-03192023 2023-03-20 12:51:52 +00:00
Fei Hu
6366c5a254 [FRONTEND][BACKEND] Add support for FP16 output for tl.dot (#1258)
---------

Co-authored-by: Fei Hu <fhu@microsoft.com>
2023-03-19 19:52:14 -07:00
Philippe Tillet
28e05c9799 [OPTIMIZER] Canonicalize convert_layout(cat: #layout1) -> #layout2 as cat: #layout2 (#1369)
We can do that because `cat` reorders elements anyways
2023-03-19 14:16:55 -07:00
Rohit Santhanam
a84b4883e6 Merge remote-tracking branch 'upstream/main' into triton-mlir-IFU-03192023 2023-03-19 13:46:50 +00:00
Philippe Tillet
02caa8a652 [OPTIMIZER] Better handling of control flow in Triton -> TritonGPU conversion (#1367) 2023-03-18 23:00:19 -07:00
peterbell10
c9740f0870 [OPTIMIZER] Add canonicalize/fold for ExpandDimsOp, ViewOp and BroadcastOp (#1354)
These eliminate no-op reshapes, and simplify some combinations of view
ops into a single view. e.g. viewing a splat becomes a single splat.
2023-03-16 21:13:58 -07:00
Rohit Santhanam
77dcb667b5 Merge remote-tracking branch 'upstream/main' into triton-mlir-IFU-03162023 2023-03-16 13:21:15 +00:00
Philippe Tillet
6539395337 [OPTIMIZER] CatOp is now marked as not having invertible layout (#1332) 2023-03-13 15:42:48 -07:00
Rohit Santhanam
6ff54b495c Merge remote-tracking branch 'upstream/main' into triton-mlir-IFU-03122023 2023-03-13 18:09:12 +00:00
Philippe Tillet
9b7c65a3a9 [BACKEND][OPTIMIZER] Refactor MMAv1 codegen (#1322)
- Significant simplification of the optimizer pipeline. Right mma
version is now set directly after the coalescing pass. DotOperand layout
no longer hold a state to `isRow` argument, and instead query it from
their parent
- Moved a bunch of things from TritonGPUToLLVM/DotOpHelpers to
TritonGPUAttrDefs. All MMAv1 state is now queried from attributes.
- logic for getELemsPerThread is no longer duplicated in TypeConverter
2023-03-12 19:54:38 -07:00
Philippe Tillet
3fe3adbcde [FRONTEND][BACKEND] Add support for float8e5m2 type (#1314) 2023-03-10 19:14:47 -08:00
Keren Zhou
087a0e8d7f [BACKEND] Disable most rematerialization through load/store ops (#1309) 2023-03-10 00:57:23 -05:00
Da Yan
6249f5d923 [OPTIMIZER] checking loads' masks' alignment info in the pipeline pass (#1289)
Co-authored-by: Keren Zhou <kerenzhou@openai.com>
2023-03-06 18:37:26 -08:00
Philippe Tillet
3db55c5f94 [OPTIMIZER]]BACKEND] Some backend and optimization passes clean-up (#1284)
* Cleaned up pipeline pass. Now works when there are element-wise ops
between the load and the dot
* Made `splat` compatible with varibales that have DotOperandLayout
* Moves rematerialization utils to separate Transforms/Utility.cpp file.
2023-03-06 17:17:59 -08:00
Keren Zhou
73d55eb59c [OPTIMIZER] Check if the definition op is NULL in GPU combine passes (#1288) 2023-03-06 16:19:13 -08:00
B1tway
693cf89777 Changed warpSize = 32 to warpSize = 64 2023-03-06 18:33:35 +00:00
Keren Zhou
d54745538b [BACKEND][CI] Disable most backward rematerialization through load/store (#1260) 2023-03-02 09:45:50 -08:00
Philippe Tillet
fa0fbc937f [FRONTEND][BACKEND][OPTIMIZER] Loops now use 64-bit indices when necessary (#1261)
* Frontend:
  - `int` kernel arguments are always signed
- Loop induction variable is now determine by integer promotion on
lb/ub/step
* Optimizer:
  -  Added new ExtractSliceOp that enforces 32-bit offsets
* Backend:
    - Use 64-bit indices when lowering functions and control flow
    - Removed `idx_val` macro and replaced it with `i32_val`
    - Cleaned up comments
- Added new ArithToIndex pass to make sure operations on indices are
done with the `index` dialect, that gets converted to LLVM separately
using a 64-bit target
2023-03-01 23:09:48 -08:00
Rohit Santhanam
cd9ae1cd36 Merge remote-tracking branch 'upstream/main' into triton-mlir-IFU-02232023 2023-02-23 21:41:54 +00:00
Philippe Tillet
0ec277efc5 [OPTIMIZER] cleaned, renamed and simplified some optimization passes (#1232)
This shouldn't actually change the behavior of Triton -- only clean things up.
2023-02-22 13:54:55 -08:00
Rohit Santhanam
841784d1e3 Merge remote-tracking branch 'upstream/main' into upgrade_triton_mlir_rocm_to_llvm_head 2023-02-18 09:25:20 +00:00
Christian Sigg
9ef4b5d773 Rebase to LLVM-head. (#1200)
Rebase to
37b7a60cd7
2023-02-17 13:16:11 -08:00
Christian Sigg
fc7a8e3581 Rebase Triton to LLVM-15. (#1070)
This PR rebases Triton from LLVM-14 to LLVM-15. Most changes are
mechanical, except for the analysis framework changes.
2023-02-16 06:40:53 -08:00
Philippe Tillet
e3941f9d09 [OPTIMIZER][BACKEND] Cleaned up Volta codegen (#1185) 2023-02-14 22:39:35 -08:00
Philippe Tillet
8bca84ce3d [OPTIMIZER] Bugfix in Combine.cpp ; Added trans support in Pipeline.cpp (#1174) 2023-02-14 13:36:44 -08:00
Keren Zhou
6413c7b9de [BACKEND] Calculate correct warp ids for small matrices (#1180)
Fixing https://github.com/openai/triton/issues/1162

Add tests 16x16x16
2023-02-14 05:28:03 +00:00
rsanthanam-amd
44f69bea81 Merge pull request #113 from ROCmSoftwarePlatform/triton-mlir-IFU-02112023
Triton mlir ifu 02112023
2023-02-13 09:26:10 -06:00
rsanthanam-amd
ec387d5bf4 Merge pull request #109 from dfukalov/dfukalov/work-3
[ROCM] Enable part of tl.dot operations.
2023-02-12 13:50:20 -06:00
Daniil Fukalov
a6596fc634 [ROCM] Enable part of tl.dot operations.
The change enables fall-through FMA path for the ROCM. It works for
the float32 type and not all the tensors sizes. The change switches
off reporting MMA and async ops support to avoid NV asm inline
generation.
2023-02-12 17:25:48 +01:00
Philippe Tillet
3fa8a5a864 [OPTIMIZER] Fixed load/store rematerialization (#1177) 2023-02-11 01:21:10 -08:00
Philippe Tillet
2aba985daa [OPTIMIZER] Improved layout simplifications heuristics (#1168) 2023-02-09 20:17:25 -08:00
Keren Zhou
c61c8a123f [BACKEND] Disallow the CombineSelectMaskedLoad pattern if conditions of select and broadcast are different (#1170) 2023-02-09 18:03:22 -05:00
Philippe Tillet
0cbe368fe5 [OPTIMIZER] Using new multiRootGetSlice utility in memory coalescing pass (#1169) 2023-02-09 18:43:33 +00:00