Commit Graph

492 Commits

Author SHA1 Message Date
Christian Sigg
01a93185a1 [BACKEND][OPTIMIZER] Switch from llvm::Optional to std::optional. (#1416) 2023-04-04 09:06:28 -07:00
Rahul Batra
30f51f3b50 get Arch Info using HSA
This is a combination of 5 commits.

look up triple and warpsize with HSA

This is a combination of 6 commits.

add scripts

create basic stub

Add HSA

This is a combination of 3 commits.

add hsa

move has file

add hsa include and lib

functional name string

simplify gfx look up

return warpsize

clean up unnecssary imports

remove scripts

use tuple

remove prints
2023-04-03 13:58:02 -05:00
Philippe Tillet
053af4e9f8 [FRONTEND] Refactor file hierarchy (#1464)
The purpose of this PR is to remove some circular dependencies and
separate concerns better in the frontend. It's still not perfect --
`triton.compile` still includes a few runtime architecture-specific
component, but at least much better than before.

This PR still assumes that AMD only supports empty kernels right now.
Other PRs will follow to make the frontend supports multiple devices in
a more modular way.
2023-04-02 12:07:08 -07:00
Keren Zhou
28ea484dab [BACKEND] Clean up type inference functions (#1451)
And remove duplicate function definition.
2023-03-30 23:07:32 -07:00
Michael Melesse
5293288e77 [ROCM] Enable ROCM Backend #1.5: Address Remaining Comments from #1312 (#1434)
This PR address the remaing issues from #1312. It does the following
*  LLVM String Join
* adds comment to GCNBuilder Class

---------

Co-authored-by: Rahul Batra <rahbatra@amd.com>
2023-03-28 17:23:57 -07:00
Keren Zhou
adc4d25276 [BACKEND] A general interface for initializing destination operands in load/store operations (#1427) 2023-03-27 22:13:01 -07:00
Chenggang Zhao
72b071253e [FRONTEND] Support block pointer semantics (#1392)
This PR introduces a new semantics: **block pointer**, which makes users
easier & faster to load a block from a parent tensor.

Below is a detailed API change by an example:
```
# Make a block pointer, which points to a block in the parent shape
# `base`: the parent tensor
# `shape`: the shape of the parent tensor
# `strides`: the strides of the parent tensor
# `offsets`: the offsets of the block in the parent tensor
# `order`: the order of the data arrangement in memory
# Below is an example loading a 2D column-major matrix 
block_ptr = tl.make_block_ptr(base=ptr, shape=(M, N), strides=(stride_m, stride_n), offsets=(0, 0), block_shape=(BLOCK_M, BLOCK_N), order=(1, 0))

# Advance the offsets; note that the striding information is already saved in `block_ptr`
# `base`: the block pointer to be advanced
# `offsets`: the offsets for each dimension
block_ptr = tl.advance(base=block_ptr, offsets=(BLOCK_M, -BLOCK_N))
block_ptr = tl.advance(base=block_ptr, offsets=(-BLOCK_M, BLOCK_N))

# Load from a block pointer, the output type is the dereferenced type of `block_ptr`, e.g. ptr<tensor<32x32xf32>> -> tensor<32x32xf32>
# `ptr`: the block pointer to be loaded
# `boundary_check`: a tuple of dimensions to check the boundary
# `padding`: padding strategy for elements out of bound
val = tl.load(ptr=block_ptr, boundary_check=(0, 1), padding="zero")

# Store by a block pointer, in which the pointer and the value tensor should have the same shape
# `ptr`: the block pointer to be stored
# `boundary_check`: a tuple of dimensions to check the boundary (no-write if out of bound)
tl.store(ptr=block_ptr, value=val, boundary_check=(0, 1))
```

---------

Co-authored-by: Philippe Tillet <phil@openai.com>
2023-03-27 16:46:49 -07:00
Xuehai Pan
5b36cb48ad [CI][TEST] update pre-commit hooks and use pre-commit for style tests in CI (#1409)
Ref issue:

- #1408

Changes:

- Add `.editorconfig`
- Add `pre-commit-hooks`:

    ```yaml
    - repo: https://github.com/pre-commit/pre-commit-hooks
      rev: v4.4.0
      hooks:
        - id: check-symlinks
        - id: destroyed-symlinks
        - id: trailing-whitespace
        - id: end-of-file-fixer
        - id: check-yaml
        - id: check-toml
        - id: check-ast
        - id: check-added-large-files
        - id: check-merge-conflict
        - id: check-executables-have-shebangs
        - id: check-shebang-scripts-are-executable
        - id: detect-private-key
        - id: debug-statements
    ```
- Add `flake8` to `pre-commit` config and add `.flake8` file
- Use `pre-commit` for style tests in CI
- Run `pre-commit` and fix existing violations:
    - fix trailing spaces
    - fix end-of-files
    - fix mod file mode with `chmod -x`
    - run `autopep8` on existing code
    - fix `flake8` violations
2023-03-25 14:52:16 -07:00
Michael Melesse
a9c87245b4 [ROCM] Enable ROCM Backend #1: Empty Kernel (#1312)
This PR is a first in a series of PRs to import the changes that we have
made to enable ROCM on [our
fork](https://github.com/ROCmSoftwarePlatform/triton) of triton.

The PR contains the major changes to the python frontend and enough
changes to the c++ backend to allow compilation and running of the empty
kernel. We use the ROCM ci added a few weeks ago to verify things.

---------

Co-authored-by: Ronan Keryell <ronan@keryell.fr>
2023-03-24 17:18:27 -07:00
Philippe Tillet
c34ceca741 [BACKEND] Remove DotOpHelpers (i.e., decouple ConvertLayoutOpToLLVM and DotOpToLLVM) (#1383)
One long-standing issue in the backend has been the apparent complexity
of the tensor core codegen. This complexity mostly stems from the
existence of the DotOpHelpers` utilities, which have become over time a
catch-all for all things related to MmaEncoding and DotOperandEncoding.

The purpose of this PR is to decouple what should be decoupled, as a
first step towards cleaning our tensor core codegen. Other, more more
local PRs will follow.
2023-03-21 15:24:28 -07:00
Rohit Santhanam
a3f18dba5b Merge remote-tracking branch 'upstream/main' into triton-mlir-IFU-03192023 2023-03-20 12:51:52 +00:00
Philippe Tillet
e4b2d1bc3d [FRONTEND][BACKEND] no longer using indices for loops (#1370) 2023-03-19 14:57:50 -07:00
Rohit Santhanam
a84b4883e6 Merge remote-tracking branch 'upstream/main' into triton-mlir-IFU-03192023 2023-03-19 13:46:50 +00:00
peterbell10
c9740f0870 [OPTIMIZER] Add canonicalize/fold for ExpandDimsOp, ViewOp and BroadcastOp (#1354)
These eliminate no-op reshapes, and simplify some combinations of view
ops into a single view. e.g. viewing a splat becomes a single splat.
2023-03-16 21:13:58 -07:00
Berke Kocaoğlu
ba91f39dbf [DOC] Fix syntax errors, typos, formatting; increase consistency (#1357)
This PR;
- Fixes syntax errors like `.type values: dict[str,
Callable[[list[Any]], Any]]` to `:type values: dict[str,
Callable[[list[Any]], Any]]`,
- Fixes typos,
- Fixes formatting like `k ++` to ` k++`,
- Increases consistency (e.g. by transforming the minority `cd dir/` to
the majority `cd dir`).
2023-03-16 15:32:02 -07:00
Rohit Santhanam
77dcb667b5 Merge remote-tracking branch 'upstream/main' into triton-mlir-IFU-03162023 2023-03-16 13:21:15 +00:00
Philippe Tillet
082828af47 [OPTIMIZER] Fixed up divisibility analysis in div operation (#1341) 2023-03-14 18:17:05 -07:00
Rohit Santhanam
6ff54b495c Merge remote-tracking branch 'upstream/main' into triton-mlir-IFU-03122023 2023-03-13 18:09:12 +00:00
Philippe Tillet
9b7c65a3a9 [BACKEND][OPTIMIZER] Refactor MMAv1 codegen (#1322)
- Significant simplification of the optimizer pipeline. Right mma
version is now set directly after the coalescing pass. DotOperand layout
no longer hold a state to `isRow` argument, and instead query it from
their parent
- Moved a bunch of things from TritonGPUToLLVM/DotOpHelpers to
TritonGPUAttrDefs. All MMAv1 state is now queried from attributes.
- logic for getELemsPerThread is no longer duplicated in TypeConverter
2023-03-12 19:54:38 -07:00
Christian Sigg
64fc0e23ce [BACKEND] Fix triton-convert-arith-to-index. (#1310)
The dialect of created ops needs to be part of dependent dialects.
2023-03-12 19:43:41 -07:00
Philippe Tillet
3fe3adbcde [FRONTEND][BACKEND] Add support for float8e5m2 type (#1314) 2023-03-10 19:14:47 -08:00
B1tway
b5dc18d7c9 Added missing #ifdef and fixed code style 2023-03-07 11:32:52 +00:00
Philippe Tillet
3db55c5f94 [OPTIMIZER]]BACKEND] Some backend and optimization passes clean-up (#1284)
* Cleaned up pipeline pass. Now works when there are element-wise ops
between the load and the dot
* Made `splat` compatible with varibales that have DotOperandLayout
* Moves rematerialization utils to separate Transforms/Utility.cpp file.
2023-03-06 17:17:59 -08:00
B1tway
693cf89777 Changed warpSize = 32 to warpSize = 64 2023-03-06 18:33:35 +00:00
Keren Zhou
d376020f90 [FRONTEND][BACKEND] Implement tl.device_assert and rename tl.printf to tl.device_print (#1143)
Note that `tl.device_print` and `print` accepts different arguments than
the normal `print`. The first argument must be a string, following by
variables.

Device side:

- `tl.device_print`
- `tl.device_assert`
- `print`
- `assert`

Compilation time:

- `tl.static_assert`
- `tl.static_print`

Usage example:

1.
```Python
tl.device_assert(x == 0, "x != 0")
```

Output:

```Python
...
python/test/unit/language/assert_helper.py:18: kernel: block: [0,0,0], thread: [33,0,0] Assertion `x != 0` failed.
...
```

2.
```Python
tl.device_print("hello ", x)
```

Output:

```Python
...
hello 1
...
```

The environment variable `TRITON_DEBUG` sets the default debugging flag; if it's true, `tl.device_assert` or `assert` will be skipped.
2023-03-04 08:08:29 -08:00
Philippe Tillet
fa0fbc937f [FRONTEND][BACKEND][OPTIMIZER] Loops now use 64-bit indices when necessary (#1261)
* Frontend:
  - `int` kernel arguments are always signed
- Loop induction variable is now determine by integer promotion on
lb/ub/step
* Optimizer:
  -  Added new ExtractSliceOp that enforces 32-bit offsets
* Backend:
    - Use 64-bit indices when lowering functions and control flow
    - Removed `idx_val` macro and replaced it with `i32_val`
    - Cleaned up comments
- Added new ArithToIndex pass to make sure operations on indices are
done with the `index` dialect, that gets converted to LLVM separately
using a 64-bit target
2023-03-01 23:09:48 -08:00
Rohit Santhanam
cd9ae1cd36 Merge remote-tracking branch 'upstream/main' into triton-mlir-IFU-02232023 2023-02-23 21:41:54 +00:00
Philippe Tillet
0ec277efc5 [OPTIMIZER] cleaned, renamed and simplified some optimization passes (#1232)
This shouldn't actually change the behavior of Triton -- only clean things up.
2023-02-22 13:54:55 -08:00
Keren Zhou
123c687ed9 [BACKEND] Rewrite Membar to fit the CF dialect (#1213) 2023-02-19 14:54:33 -08:00
Philippe Tillet
c1194bd237 [OPTIMIZER] Refined side-effect traits (#1216) 2023-02-19 01:21:19 -08:00
Rohit Santhanam
841784d1e3 Merge remote-tracking branch 'upstream/main' into upgrade_triton_mlir_rocm_to_llvm_head 2023-02-18 09:25:20 +00:00
Christian Sigg
9ef4b5d773 Rebase to LLVM-head. (#1200)
Rebase to
37b7a60cd7
2023-02-17 13:16:11 -08:00
Christian Sigg
fc7a8e3581 Rebase Triton to LLVM-15. (#1070)
This PR rebases Triton from LLVM-14 to LLVM-15. Most changes are
mechanical, except for the analysis framework changes.
2023-02-16 06:40:53 -08:00
Philippe Tillet
8bca84ce3d [OPTIMIZER] Bugfix in Combine.cpp ; Added trans support in Pipeline.cpp (#1174) 2023-02-14 13:36:44 -08:00
rsanthanam-amd
2ec42ea37b Merge pull request #117 from ROCmSoftwarePlatform/fix_sramecc_xnack_warnings_navi21
Fix warning on some amdgpu arch (i.e., navi21)
2023-02-14 07:37:42 -06:00
Chao Chen
c0a8c72343 update function to get full arch details and compile it with arch details instead of hardcode 2023-02-14 12:59:26 +00:00
Rohit Santhanam
a2416e0901 Merge remote-tracking branch 'upstream/main' into triton-mlir-IFU-02112023 2023-02-11 14:48:19 +00:00
Philippe Tillet
2aba985daa [OPTIMIZER] Improved layout simplifications heuristics (#1168) 2023-02-09 20:17:25 -08:00
Philippe Tillet
0cbe368fe5 [OPTIMIZER] Using new multiRootGetSlice utility in memory coalescing pass (#1169) 2023-02-09 18:43:33 +00:00
Yu Guo
474ed978b9 [BUILD] Fixed typo in CMake type tablegen (#1124) 2023-02-03 18:46:11 -08:00
Keren Zhou
bde52f9db2 [BACKEND] Fix alignment calculation (#1149)
`getDivisibility` represents if the address in bytes is divisible by a
certain number, so we should convert `#aligned bytes` to `#aligned
elements`.
2023-02-03 17:20:23 -08:00
Rohit Santhanam
8cb6ab5b1a Merge remote-tracking branch 'upstream/main' into triton_mlir_IFU_02022023 2023-02-02 22:54:53 +00:00
Keren Zhou
82befe32ad [BACKEND] Improve torch inductor performance (#1108)
- Rewrite the AxisInfo analysis to handle each op case by case.
- Add bit shift, min max, div/rem, and select ops to AxisInfo.
- Rematerialize across load/store ops in the following two cases:
- A size 1 tensor is considered not expensive since all threads will
load the same
- the targeEncoding may expose more vectorization opportunities (more
elements per thread on the first dim)

**_res2next_** benchmark GPU Kernel time comparison on A100.
- Average kernel sum. Triton 16838630ns vs Triton-MLIR 17105166ns.
**1.016x slowdown**.
- Total kernel sum. Triton 6511735460ns vs Triton-MLIR 6512370620ns.
2023-02-01 18:21:15 -08:00
Keren Zhou
71c6d56b9b [BACKEND] Hotfix to remove SameOperandsAndResultEncoding for the trans operation (#1136)
The order of the input encoding is permuted to form the output encoding.

For example:

```
#A_SHARED = #triton_gpu.shared<{vec = 2, perPhase = 2, maxPhase = 4, order = [1, 0]}>
#A_SHARED_T = #triton_gpu.shared<{vec = 2, perPhase = 2, maxPhase = 4, order = [0, 1]}>
%b = tt.trans %tensor : (tensor<16x32xf16, #A_SHARED>) -> tensor<32x16xf16, #A_SHARED_T>
```
2023-02-01 10:03:38 -08:00
Keren Zhou
5dd8ce3745 [BACKEND] Fix topological sort and add new test cases (#1132)
Previous https://github.com/openai/triton/pull/1113 forgot to consider
that a node may have multiple parents, visiting the instruction before
any parent violates the semantic of topological sort.

The fixed implementation exhaustively add all operations into a
candidate subgraph and move an operation to the "ready" queue once all
of its operands have been visited.
2023-01-31 23:41:20 -08:00
Philippe Tillet
8fea1fb478 [FRONTEND] Adding static range (#1130)
Included: Revert "[BACKEND] Replace `mlir::topologicalSort` with a
custom implementation (#1113)"
2023-01-31 18:04:19 -08:00
Michael Melesse
a9f955f862 Merge remote-tracking branch 'upstream/main' into triton-mlir-IFU-2023-30-1 2023-01-30 14:04:01 -06:00
Keren Zhou
bc8a26d56f [BACKEND] Replace mlir::topologicalSort with a custom implementation (#1113)
`multiRootTopologicalSort` is faster than `mlir::topologicalSort`
because it prunes nodes that have been visited before.
2023-01-29 18:57:21 -08:00
Keren Zhou
5bcf60a5c0 [BACKEND] Refactored the code to no longer include static functions in header files. (#1109) 2023-01-28 14:58:28 -08:00
Keren Zhou
34ac01e597 [BACKEND] Change DenseMap to std::map in Allocation (#1092)
Since we need to get the address/reference of each entry, it is not
appropriate to use DenseMap here because rehashing on DenseMap will
reallocate buckets and change addresses of entries. Unlike DenseMap,
associative containers in STL do not change references to existing
entries when other entries are inserted or deleted.


https://stackoverflow.com/questions/61771352/will-it-change-the-address-of-a-existed-keys-value-when-inserting-new-keys
2023-01-25 05:18:20 -08:00