Commit Graph

328 Commits

Author SHA1 Message Date
zahimoud
fd34b20fba [BACKEND] Fixed bug in reduce; add tests 2023-04-11 18:09:18 -07:00
Philippe Tillet
e0d6f5f4f5 [BUILD] updated LLVM binaries (#1504)
Co-authored-by: Christian Sigg <csigg@google.com>
2023-04-11 00:14:00 -07:00
Keren Zhou
6d0ed41307 [BACKEND] Replace Func Dialect with custom triton ops (func, call, return) (#1502)
MLIR current only supports a custom inlining interface per dialect, so
we cannot change the inlining decision of `func.func`.


https://discourse.llvm.org/t/avoid-inlining-some-functions-using-the-func-dialect/69830/3

Could revert it back once they've designed a better inliner interface.

Inlining attributes will be implemented in the next PR since this PR is
already huge.
2023-04-10 21:08:40 -07:00
Philippe Tillet
640f3c3921 [OPTIMIZER] Tweaked layout removal conversion heuristics (#1501)
Loads are now consider cheap to rematerialize when there are more
threads than elements in the tensor
2023-04-10 15:19:08 -07:00
Keren Zhou
032509384a [ANALYSIS] Fine-tune comments for shared memory allocation (#1492)
And add a new test to check multiple color cases which have never be
tested before
2023-04-10 09:00:36 -07:00
Philippe Tillet
adc760dac1 [OPTIMIZER] enable loop pipelining using pointer increments from vector look-up tables (#1490) 2023-04-10 08:59:42 -07:00
Philippe Tillet
b86425a28e [TEST] made lut_bmm pipeline test more concise and specific (#1488) 2023-04-08 19:17:35 -07:00
long.chen
f7ad8ae022 [Refine] remove const ref of mlir::Attribute (#1486)
https://mlir.llvm.org/docs/DefiningDialects/AttributesAndTypes/

https://github.com/isocpp/CppCoreGuidelines/blob/master/CppCoreGuidelines.md#f16-for-in-parameters-pass-cheaply-copied-types-by-value-and-others-by-reference-to-const
```
The C++ Attribute and Type classes in MLIR (like Ops, and many other things) are value-typed. 
This means that instances of Attribute or Type are passed around by-value, 
as opposed to by-pointer or by-reference. 
The Attribute and Type classes act as wrappers around internal storage objects that are uniqued within an instance of an MLIRContext.
```
2023-04-08 10:38:59 -07:00
Philippe Tillet
47e73aadda [BACKEND] Revert inline PTX for conversions supported by LLVM (#1474)
No longer needed now that we initialize all registers. Motivation for
reverting this workaround now that we can is that it introduced
performance regressions
2023-04-04 17:52:26 -07:00
Christian Sigg
01a93185a1 [BACKEND][OPTIMIZER] Switch from llvm::Optional to std::optional. (#1416) 2023-04-04 09:06:28 -07:00
Philippe Tillet
053af4e9f8 [FRONTEND] Refactor file hierarchy (#1464)
The purpose of this PR is to remove some circular dependencies and
separate concerns better in the frontend. It's still not perfect --
`triton.compile` still includes a few runtime architecture-specific
component, but at least much better than before.

This PR still assumes that AMD only supports empty kernels right now.
Other PRs will follow to make the frontend supports multiple devices in
a more modular way.
2023-04-02 12:07:08 -07:00
Keren Zhou
0855cacdd8 [BACKEND] Fix small matmul dot (#1463)
https://github.com/openai/triton/issues/1449

In theory, we might be able to support even 8x8 dot if we also wrap
around `cOff`.
2023-04-02 02:05:05 +00:00
Keren Zhou
801bb9d3b5 [ANALYSIS] Fix divisibility calculation for addptr (#1453) 2023-03-31 17:57:31 -07:00
Keren Zhou
28ea484dab [BACKEND] Clean up type inference functions (#1451)
And remove duplicate function definition.
2023-03-30 23:07:32 -07:00
Keren Zhou
43eed392df [BACKEND] Fix tl.exp for fp16 (#1440)
https://github.com/openai/triton/issues/1438
https://github.com/openai/triton/issues/1360
2023-03-29 16:34:23 -07:00
Keren Zhou
ee593fca0b [BACKEND] Fix int8 dot (#1435) 2023-03-28 20:18:17 -07:00
Keren Zhou
3342cc1c0c [OPTIMIZER] Do not create yield if yieldValues is empty (#1437)
https://github.com/openai/triton/issues/1432
2023-03-28 19:33:52 -07:00
Keren Zhou
adc4d25276 [BACKEND] A general interface for initializing destination operands in load/store operations (#1427) 2023-03-27 22:13:01 -07:00
Chenggang Zhao
72b071253e [FRONTEND] Support block pointer semantics (#1392)
This PR introduces a new semantics: **block pointer**, which makes users
easier & faster to load a block from a parent tensor.

Below is a detailed API change by an example:
```
# Make a block pointer, which points to a block in the parent shape
# `base`: the parent tensor
# `shape`: the shape of the parent tensor
# `strides`: the strides of the parent tensor
# `offsets`: the offsets of the block in the parent tensor
# `order`: the order of the data arrangement in memory
# Below is an example loading a 2D column-major matrix 
block_ptr = tl.make_block_ptr(base=ptr, shape=(M, N), strides=(stride_m, stride_n), offsets=(0, 0), block_shape=(BLOCK_M, BLOCK_N), order=(1, 0))

# Advance the offsets; note that the striding information is already saved in `block_ptr`
# `base`: the block pointer to be advanced
# `offsets`: the offsets for each dimension
block_ptr = tl.advance(base=block_ptr, offsets=(BLOCK_M, -BLOCK_N))
block_ptr = tl.advance(base=block_ptr, offsets=(-BLOCK_M, BLOCK_N))

# Load from a block pointer, the output type is the dereferenced type of `block_ptr`, e.g. ptr<tensor<32x32xf32>> -> tensor<32x32xf32>
# `ptr`: the block pointer to be loaded
# `boundary_check`: a tuple of dimensions to check the boundary
# `padding`: padding strategy for elements out of bound
val = tl.load(ptr=block_ptr, boundary_check=(0, 1), padding="zero")

# Store by a block pointer, in which the pointer and the value tensor should have the same shape
# `ptr`: the block pointer to be stored
# `boundary_check`: a tuple of dimensions to check the boundary (no-write if out of bound)
tl.store(ptr=block_ptr, value=val, boundary_check=(0, 1))
```

---------

Co-authored-by: Philippe Tillet <phil@openai.com>
2023-03-27 16:46:49 -07:00
Xuehai Pan
5b36cb48ad [CI][TEST] update pre-commit hooks and use pre-commit for style tests in CI (#1409)
Ref issue:

- #1408

Changes:

- Add `.editorconfig`
- Add `pre-commit-hooks`:

    ```yaml
    - repo: https://github.com/pre-commit/pre-commit-hooks
      rev: v4.4.0
      hooks:
        - id: check-symlinks
        - id: destroyed-symlinks
        - id: trailing-whitespace
        - id: end-of-file-fixer
        - id: check-yaml
        - id: check-toml
        - id: check-ast
        - id: check-added-large-files
        - id: check-merge-conflict
        - id: check-executables-have-shebangs
        - id: check-shebang-scripts-are-executable
        - id: detect-private-key
        - id: debug-statements
    ```
- Add `flake8` to `pre-commit` config and add `.flake8` file
- Use `pre-commit` for style tests in CI
- Run `pre-commit` and fix existing violations:
    - fix trailing spaces
    - fix end-of-files
    - fix mod file mode with `chmod -x`
    - run `autopep8` on existing code
    - fix `flake8` violations
2023-03-25 14:52:16 -07:00
peterbell10
6063fccd0b [FRONTEND][BACKEND] Lower tl.abs to math::Abs{I,F}Op (#1401)
This generates identical PTX for floating point, but for integer types
the resulting PTX is much better. For example `tl.abs` for int16
currently generates

```mlir
  cvt.s32.s16 %r1, %rs2;
  neg.s16     %rs4, %rs2;
  setp.lt.s32 %p4, %r1, 0;
  selp.b16    %rs3, %rs4, %rs2, %p4;
```

After, it becomes a single `abs.s16` instruction.

This also improves LLVM's ability to optimize floats. e.g. `abs(t) *
abs(t)` is optimized to `t * t` now which didn't happen before.

---------

Co-authored-by: Keren Zhou <kerenzhou@openai.com>
2023-03-24 21:58:24 -07:00
Michael Melesse
a9c87245b4 [ROCM] Enable ROCM Backend #1: Empty Kernel (#1312)
This PR is a first in a series of PRs to import the changes that we have
made to enable ROCM on [our
fork](https://github.com/ROCmSoftwarePlatform/triton) of triton.

The PR contains the major changes to the python frontend and enough
changes to the c++ backend to allow compilation and running of the empty
kernel. We use the ROCM ci added a few weeks ago to verify things.

---------

Co-authored-by: Ronan Keryell <ronan@keryell.fr>
2023-03-24 17:18:27 -07:00
Keren Zhou
b7762bee2c [TEST] Cleanup SCF dialect in tests (#1402) 2023-03-24 09:21:40 -07:00
Philippe Tillet
fc7c0b0e43 [FRONTEND] Removed torch dependency and cleaned up testing (#1394)
`assert triton.testing.allclose` -> `torch.testing.assert_allclose`
`triton.testing.assert_almost_equal` -> `torch.testing.assert_allclose`
2023-03-23 22:37:21 -07:00
xndcn
ff1d0377e0 [BACKEND] Fix wrong conversion from float8e5m2 <> bfloat16 (#1391)
exponent compensate should be 0x3800(112) instead of 0x3000(96)
also add a mantissa bit for float16 conversion to round to nearest
float8e5m2

Co-authored-by: Philippe Tillet <phil@openai.com>
2023-03-24 04:42:08 +00:00
Keren Zhou
c9f47d9094 [BACKEND] Init values before load to avoid ptxas issues (#1396) 2023-03-23 17:24:03 -07:00
Keren Zhou
2ba77a9212 [OPTIMIZER] Fix a typo in SimplifyReduceCvt (#1385) 2023-03-21 22:45:58 -07:00
xndcn
65d8d802d5 [BACKEND] Fix wrong conversion from float8e4m3 <> bfloat16 (#1384)
exponent compensate should be 0x3c00(120) instead of 0x3800(112)
2023-03-21 18:58:13 -07:00
Philippe Tillet
c34ceca741 [BACKEND] Remove DotOpHelpers (i.e., decouple ConvertLayoutOpToLLVM and DotOpToLLVM) (#1383)
One long-standing issue in the backend has been the apparent complexity
of the tensor core codegen. This complexity mostly stems from the
existence of the DotOpHelpers` utilities, which have become over time a
catch-all for all things related to MmaEncoding and DotOperandEncoding.

The purpose of this PR is to decouple what should be decoupled, as a
first step towards cleaning our tensor core codegen. Other, more more
local PRs will follow.
2023-03-21 15:24:28 -07:00
xndcn
84ffefc368 [BACKEND] Fix wrong conversion from float8e4m3 <> float16 (#1375)
after offset shifting, exponent compensate should not be forgotten
also add back some comments from `legacy_backend`
2023-03-20 21:45:25 -07:00
Keren Zhou
e281bd9fe9 [OPTIMIZER] Ensure the conversion of blockArgument is placed at the beginning of the block (#1379)
Co-authored-by: Philippe Tillet <phil@openai.com>
2023-03-20 21:19:26 -04:00
Keren Zhou
23fc647a3e [OPTIMIZER] Fixe optimizer hanging caused by SimplifyReduceCvt (#1377)
https://github.com/openai/triton/issues/1328

Match the convert_layout operation in SimplifyReduceCvt
(convert_layout->reduce). This way we don't miss higher priority rewrite
patterns like RematerializeBackward and SimplifyConversion. We also need
to set SimplifyConversion's benefit = 4, RematerializeBackward's benefit
= 3, and RematerializeForward's benefit = 2.
2023-03-20 16:20:19 -07:00
Philippe Tillet
29d01ba5f3 [OPTIMIZER] We shouldn't try to rematerialize view/cat forward since output layout can't be deduced automatically (#1378) 2023-03-20 14:26:50 -07:00
Keren Zhou
78d5900467 [OPTIMIZER] Improve pipeline to handle general indirect access to matrices (#1291)
Differentiate between immediate and non-immediate block arguments. 
If we have a load that immediately depends on a block argument in the
current iteration, it is an immediate dependency. Otherwise, it is a
non-immediate dependency, which means the load depends on a block
argument in the previous iterations.

For example:
```
scf.for (%arg0, %arg1, %arg2) {
%0 = load %arg0  <--- immediate dep, this address is initialized at numStages-2
%1 = load %arg1
%2 = add %1, %arg2
%3 = load %2  <--- non-immediate dep, %arg1 must be an update-to-date value
}
```

The above code pattern is commonly seen in cases where we have indirect
memory accesses using a lookup table, such as PyTorch's `bsr_dense_bmm`.
This PR improves `bsr_dense_bmm` for about ~20% on the unit test cases.
2023-03-20 14:39:47 -04:00
Philippe Tillet
fe9dc4b58e [OPTIMIZER] Restored ViewOp/CatOp passthrough in simulateBackwardRematerialization (#1376) 2023-03-20 11:02:54 -07:00
Philippe Tillet
b4decbe155 [BACKEND] Now using call_once to initialize LLVM target (#1373) 2023-03-19 21:23:39 -07:00
Fei Hu
6366c5a254 [FRONTEND][BACKEND] Add support for FP16 output for tl.dot (#1258)
---------

Co-authored-by: Fei Hu <fhu@microsoft.com>
2023-03-19 19:52:14 -07:00
Philippe Tillet
e4b2d1bc3d [FRONTEND][BACKEND] no longer using indices for loops (#1370) 2023-03-19 14:57:50 -07:00
Philippe Tillet
28e05c9799 [OPTIMIZER] Canonicalize convert_layout(cat: #layout1) -> #layout2 as cat: #layout2 (#1369)
We can do that because `cat` reorders elements anyways
2023-03-19 14:16:55 -07:00
Philippe Tillet
39139258c8 [FRONTEND][BACKEND] tl.mathlib -> tl.math; internally reverted to mathlib -> libdevice (#1368) 2023-03-19 02:14:57 -07:00
rsanthanam-amd
c575911a01 [FRONTEND] Change libdevice to mathlib and fix abs (#1361)
Co-authored-by: Phil Tillet <phil@openai.com>
2023-03-19 01:34:16 -07:00
Philippe Tillet
02caa8a652 [OPTIMIZER] Better handling of control flow in Triton -> TritonGPU conversion (#1367) 2023-03-18 23:00:19 -07:00
peterbell10
c9740f0870 [OPTIMIZER] Add canonicalize/fold for ExpandDimsOp, ViewOp and BroadcastOp (#1354)
These eliminate no-op reshapes, and simplify some combinations of view
ops into a single view. e.g. viewing a splat becomes a single splat.
2023-03-16 21:13:58 -07:00
Berke Kocaoğlu
ba91f39dbf [DOC] Fix syntax errors, typos, formatting; increase consistency (#1357)
This PR;
- Fixes syntax errors like `.type values: dict[str,
Callable[[list[Any]], Any]]` to `:type values: dict[str,
Callable[[list[Any]], Any]]`,
- Fixes typos,
- Fixes formatting like `k ++` to ` k++`,
- Increases consistency (e.g. by transforming the minority `cd dir/` to
the majority `cd dir`).
2023-03-16 15:32:02 -07:00
Da Yan
9d5505d043 [OPTIMIZER] Infer the alignment info of loops' induction variables (#1350)
Before this PR, loops' induction variables' (IV) alignment info is lost.
For example:
```
for n in range(0, K, BLOCK):
     x = base + n
                       ^--  Triton doesn't know n is always a multiple of BLOCK
```

This PR fixes this.

---------

Co-authored-by: Philippe Tillet <phil@openai.com>
2023-03-16 00:39:08 -07:00
Philippe Tillet
082828af47 [OPTIMIZER] Fixed up divisibility analysis in div operation (#1341) 2023-03-14 18:17:05 -07:00
Keren Zhou
da0b0bfde6 [BACKEND] Still run llvm-opt but set optLevel to 0 to avoid the abs(float) bug (#1339)
https://github.com/openai/triton/issues/1337
2023-03-14 12:38:57 -07:00
Philippe Tillet
6a8634e2a7 [BACKEND] No longer running LLVM-IR optimizations after codegen. (#1338)
This triggered some outrageous bugs. See #1337.
2023-03-13 22:50:15 -07:00
Philippe Tillet
6539395337 [OPTIMIZER] CatOp is now marked as not having invertible layout (#1332) 2023-03-13 15:42:48 -07:00
Philippe Tillet
9b7c65a3a9 [BACKEND][OPTIMIZER] Refactor MMAv1 codegen (#1322)
- Significant simplification of the optimizer pipeline. Right mma
version is now set directly after the coalescing pass. DotOperand layout
no longer hold a state to `isRow` argument, and instead query it from
their parent
- Moved a bunch of things from TritonGPUToLLVM/DotOpHelpers to
TritonGPUAttrDefs. All MMAv1 state is now queried from attributes.
- logic for getELemsPerThread is no longer duplicated in TypeConverter
2023-03-12 19:54:38 -07:00