Commit Graph

368 Commits

Author SHA1 Message Date
zahimoud
73b124155b [FRONTEND] Added typehints support to speedup triton kernel launch (#1431)
One of the possible optimizations for kernel launch overhead. Basically,
we are trying to avoid having to run `hasattr` and `isinstance` for each
argument, by adding typehints to the kernel definition. Also, added a
unit test to regression to make sure we keep the launch overhead within
an expected range.
2023-03-28 22:37:34 -07:00
Keren Zhou
ee593fca0b [BACKEND] Fix int8 dot (#1435) 2023-03-28 20:18:17 -07:00
Keren Zhou
adc4d25276 [BACKEND] A general interface for initializing destination operands in load/store operations (#1427) 2023-03-27 22:13:01 -07:00
Chenggang Zhao
72b071253e [FRONTEND] Support block pointer semantics (#1392)
This PR introduces a new semantics: **block pointer**, which makes users
easier & faster to load a block from a parent tensor.

Below is a detailed API change by an example:
```
# Make a block pointer, which points to a block in the parent shape
# `base`: the parent tensor
# `shape`: the shape of the parent tensor
# `strides`: the strides of the parent tensor
# `offsets`: the offsets of the block in the parent tensor
# `order`: the order of the data arrangement in memory
# Below is an example loading a 2D column-major matrix 
block_ptr = tl.make_block_ptr(base=ptr, shape=(M, N), strides=(stride_m, stride_n), offsets=(0, 0), block_shape=(BLOCK_M, BLOCK_N), order=(1, 0))

# Advance the offsets; note that the striding information is already saved in `block_ptr`
# `base`: the block pointer to be advanced
# `offsets`: the offsets for each dimension
block_ptr = tl.advance(base=block_ptr, offsets=(BLOCK_M, -BLOCK_N))
block_ptr = tl.advance(base=block_ptr, offsets=(-BLOCK_M, BLOCK_N))

# Load from a block pointer, the output type is the dereferenced type of `block_ptr`, e.g. ptr<tensor<32x32xf32>> -> tensor<32x32xf32>
# `ptr`: the block pointer to be loaded
# `boundary_check`: a tuple of dimensions to check the boundary
# `padding`: padding strategy for elements out of bound
val = tl.load(ptr=block_ptr, boundary_check=(0, 1), padding="zero")

# Store by a block pointer, in which the pointer and the value tensor should have the same shape
# `ptr`: the block pointer to be stored
# `boundary_check`: a tuple of dimensions to check the boundary (no-write if out of bound)
tl.store(ptr=block_ptr, value=val, boundary_check=(0, 1))
```

---------

Co-authored-by: Philippe Tillet <phil@openai.com>
2023-03-27 16:46:49 -07:00
Philippe Tillet
46672772b4 [FORMAT] autopep8 now uses max-line-length=88 (#1410) 2023-03-25 15:46:50 -07:00
Xuehai Pan
5b36cb48ad [CI][TEST] update pre-commit hooks and use pre-commit for style tests in CI (#1409)
Ref issue:

- #1408

Changes:

- Add `.editorconfig`
- Add `pre-commit-hooks`:

    ```yaml
    - repo: https://github.com/pre-commit/pre-commit-hooks
      rev: v4.4.0
      hooks:
        - id: check-symlinks
        - id: destroyed-symlinks
        - id: trailing-whitespace
        - id: end-of-file-fixer
        - id: check-yaml
        - id: check-toml
        - id: check-ast
        - id: check-added-large-files
        - id: check-merge-conflict
        - id: check-executables-have-shebangs
        - id: check-shebang-scripts-are-executable
        - id: detect-private-key
        - id: debug-statements
    ```
- Add `flake8` to `pre-commit` config and add `.flake8` file
- Use `pre-commit` for style tests in CI
- Run `pre-commit` and fix existing violations:
    - fix trailing spaces
    - fix end-of-files
    - fix mod file mode with `chmod -x`
    - run `autopep8` on existing code
    - fix `flake8` violations
2023-03-25 14:52:16 -07:00
peterbell10
6063fccd0b [FRONTEND][BACKEND] Lower tl.abs to math::Abs{I,F}Op (#1401)
This generates identical PTX for floating point, but for integer types
the resulting PTX is much better. For example `tl.abs` for int16
currently generates

```mlir
  cvt.s32.s16 %r1, %rs2;
  neg.s16     %rs4, %rs2;
  setp.lt.s32 %p4, %r1, 0;
  selp.b16    %rs3, %rs4, %rs2, %p4;
```

After, it becomes a single `abs.s16` instruction.

This also improves LLVM's ability to optimize floats. e.g. `abs(t) *
abs(t)` is optimized to `t * t` now which didn't happen before.

---------

Co-authored-by: Keren Zhou <kerenzhou@openai.com>
2023-03-24 21:58:24 -07:00
Philippe Tillet
fc7c0b0e43 [FRONTEND] Removed torch dependency and cleaned up testing (#1394)
`assert triton.testing.allclose` -> `torch.testing.assert_allclose`
`triton.testing.assert_almost_equal` -> `torch.testing.assert_allclose`
2023-03-23 22:37:21 -07:00
xndcn
ff1d0377e0 [BACKEND] Fix wrong conversion from float8e5m2 <> bfloat16 (#1391)
exponent compensate should be 0x3800(112) instead of 0x3000(96)
also add a mantissa bit for float16 conversion to round to nearest
float8e5m2

Co-authored-by: Philippe Tillet <phil@openai.com>
2023-03-24 04:42:08 +00:00
Natalia Gimelshein
3239c93a93 [TEST] add a test for inductor normalization pattern (#1390) 2023-03-23 00:29:28 +00:00
xndcn
65d8d802d5 [BACKEND] Fix wrong conversion from float8e4m3 <> bfloat16 (#1384)
exponent compensate should be 0x3c00(120) instead of 0x3800(112)
2023-03-21 18:58:13 -07:00
Keren Zhou
c1dd6df9ce [FRONTEND] Fix negative induction variable (#1382) 2023-03-21 08:38:16 -07:00
xndcn
84ffefc368 [BACKEND] Fix wrong conversion from float8e4m3 <> float16 (#1375)
after offset shifting, exponent compensate should not be forgotten
also add back some comments from `legacy_backend`
2023-03-20 21:45:25 -07:00
Phil Tillet
e650d3708b [FRONTEND] dot now uses tl.float32 by default for out_dtype. 2023-03-19 21:58:46 -07:00
Philippe Tillet
b4decbe155 [BACKEND] Now using call_once to initialize LLVM target (#1373) 2023-03-19 21:23:39 -07:00
Fei Hu
6366c5a254 [FRONTEND][BACKEND] Add support for FP16 output for tl.dot (#1258)
---------

Co-authored-by: Fei Hu <fhu@microsoft.com>
2023-03-19 19:52:14 -07:00
Philippe Tillet
39139258c8 [FRONTEND][BACKEND] tl.mathlib -> tl.math; internally reverted to mathlib -> libdevice (#1368) 2023-03-19 02:14:57 -07:00
rsanthanam-amd
c575911a01 [FRONTEND] Change libdevice to mathlib and fix abs (#1361)
Co-authored-by: Phil Tillet <phil@openai.com>
2023-03-19 01:34:16 -07:00
Horace He
1d2871d0d1 [RUNTIME] Fix memory leak in (#1358)
Fixes a bug that causes Triton to leak 32 bytes on every kernel
invocation.

Also solves https://github.com/pytorch/pytorch/issues/96937
2023-03-16 17:52:06 -07:00
Berke Kocaoğlu
ba91f39dbf [DOC] Fix syntax errors, typos, formatting; increase consistency (#1357)
This PR;
- Fixes syntax errors like `.type values: dict[str,
Callable[[list[Any]], Any]]` to `:type values: dict[str,
Callable[[list[Any]], Any]]`,
- Fixes typos,
- Fixes formatting like `k ++` to ` k++`,
- Increases consistency (e.g. by transforming the minority `cd dir/` to
the majority `cd dir`).
2023-03-16 15:32:02 -07:00
Philippe Tillet
56b23f433d [TEST] Temporarily disable test_dot mode that fails because of ptxas/nvptx (#1344) 2023-03-15 01:17:48 -07:00
peterbell10
01b177afe7 [FRONTEND] Mangle signed and unsigned integer types differently (#1340)
This is cherry-picked from #1305

If you call a `JITFunction` twice in the same kernel, first with `int32`
then with `uint32`, the second call will treat the unsigned value as
signed. This passes through MLIR without error because MLIR uses the
same types for both, but different operation calls will be generated so
you may silently get the wrong result.
2023-03-14 22:29:18 -07:00
Philippe Tillet
6a8634e2a7 [BACKEND] No longer running LLVM-IR optimizations after codegen. (#1338)
This triggered some outrageous bugs. See #1337.
2023-03-13 22:50:15 -07:00
Philippe Tillet
3fe3adbcde [FRONTEND][BACKEND] Add support for float8e5m2 type (#1314) 2023-03-10 19:14:47 -08:00
Keren Zhou
8b25c30d39 [BACKEND] Fix bfloat16 flash attention (#1306)
See https://github.com/openai/triton/issues/1245 for more detailed
information

---------

Co-authored-by: giorgio-arena <arena.cpp@gmail.com>
2023-03-09 21:14:52 -08:00
Da Yan
902c61affb [BACKEND] Add arith::SelectOp => LLVM::SelectOp conversion (#1307) 2023-03-09 09:35:30 -08:00
Keren Zhou
78b311f6e2 [FRONTEND] Fix cast when both src_ty and dst_ty are of block_type (#1301)
Commonly used in atomic_rmw ops
2023-03-08 09:25:00 -08:00
Keren Zhou
4731f300d3 [BACKEND] Mask out wrapped threads in store ops (#1283) 2023-03-06 14:50:20 -08:00
Keren Zhou
d376020f90 [FRONTEND][BACKEND] Implement tl.device_assert and rename tl.printf to tl.device_print (#1143)
Note that `tl.device_print` and `print` accepts different arguments than
the normal `print`. The first argument must be a string, following by
variables.

Device side:

- `tl.device_print`
- `tl.device_assert`
- `print`
- `assert`

Compilation time:

- `tl.static_assert`
- `tl.static_print`

Usage example:

1.
```Python
tl.device_assert(x == 0, "x != 0")
```

Output:

```Python
...
python/test/unit/language/assert_helper.py:18: kernel: block: [0,0,0], thread: [33,0,0] Assertion `x != 0` failed.
...
```

2.
```Python
tl.device_print("hello ", x)
```

Output:

```Python
...
hello 1
...
```

The environment variable `TRITON_DEBUG` sets the default debugging flag; if it's true, `tl.device_assert` or `assert` will be skipped.
2023-03-04 08:08:29 -08:00
Keren Zhou
65e5a3bc24 [FRONTEND] Improve tl.full to accept both static and dynamic values (#1269) 2023-03-02 12:19:54 -08:00
Philippe Tillet
fa0fbc937f [FRONTEND][BACKEND][OPTIMIZER] Loops now use 64-bit indices when necessary (#1261)
* Frontend:
  - `int` kernel arguments are always signed
- Loop induction variable is now determine by integer promotion on
lb/ub/step
* Optimizer:
  -  Added new ExtractSliceOp that enforces 32-bit offsets
* Backend:
    - Use 64-bit indices when lowering functions and control flow
    - Removed `idx_val` macro and replaced it with `i32_val`
    - Cleaned up comments
- Added new ArithToIndex pass to make sure operations on indices are
done with the `index` dialect, that gets converted to LLVM separately
using a 64-bit target
2023-03-01 23:09:48 -08:00
Keren Zhou
90fcb38c7b [BACKEND] Overwrite NVPTX converters for fp16<->fp32 and int16<->int32 to avoid ptxas problems (#1267) 2023-03-01 18:26:06 -08:00
Da Yan
cb7b315a17 [OPTIMIZER] Copying named attributes when converting from Triton to TritonGPU (#1265) 2023-03-01 12:31:46 -08:00
Da Yan
0eead250c1 [FRONTEND] add missing tensor/constexpr ops (#1249) 2023-02-24 18:45:22 +00:00
Philippe Tillet
ba0198326e [TESTS] make performance regression testing less strict (#1231) 2023-02-21 22:22:02 -08:00
Philippe Tillet
174f121c1c [TESTS] Added attention regression tests (#1227) 2023-02-21 20:22:36 -08:00
Philippe Tillet
307dde9cb5 [CI] revived regression tests (#1225) 2023-02-21 16:33:03 -08:00
Christian Sigg
fc7a8e3581 Rebase Triton to LLVM-15. (#1070)
This PR rebases Triton from LLVM-14 to LLVM-15. Most changes are
mechanical, except for the analysis framework changes.
2023-02-16 06:40:53 -08:00
Philippe Tillet
9c330a411c [FRONTEND] fixed pinned memory exception behavior (#1197)
no longer raise exception when the pointer is on "cpu" but also
accessible from within kernels (e.g., pinned memory)
2023-02-15 17:40:45 -08:00
Philippe Tillet
e3941f9d09 [OPTIMIZER][BACKEND] Cleaned up Volta codegen (#1185) 2023-02-14 22:39:35 -08:00
Keren Zhou
6413c7b9de [BACKEND] Calculate correct warp ids for small matrices (#1180)
Fixing https://github.com/openai/triton/issues/1162

Add tests 16x16x16
2023-02-14 05:28:03 +00:00
Daniil Fukalov
3af678d097 [TEST] Fix typo. (#1164)
The line is duplicate of the line 1097 - seems like the typo.
2023-02-09 08:26:21 -08:00
fdrocha
972b761390 [FRONTEND] For __rshift__ operator, use arithmetic right shift if dtype is a signed int. (#1153) 2023-02-06 10:26:17 +00:00
Philippe Tillet
8a4ca2c61a [CI][TEST][FRONTEND] Various small fixes (#1150)
- cancels CI runs in progress when a PR is updated
- atomics tests now use small int values that can be represented exactly
- replaced some old-style formatting by some f-string
2023-02-03 18:12:34 -08:00
Chenggang Zhao
f86843f815 Change libdevice.bc Path in Core Tests (#1141)
Only test `libdevice.bc` shipped with triton
2023-02-02 20:01:12 -08:00
Philippe Tillet
ccd17d6bf9 [TESTS] Added test for flash-attention (#1138) 2023-02-01 11:26:29 -08:00
Philippe Tillet
c4b9d699d2 [FRONTEND][BACKEND] Fixed many bugs (#1122)
- **temporarily commenting assertion in `MemBar.cpp`. We need to fix
this! but for now the following patches will unblock a number of
users.**
- Fixed frontend codegen issue for If / For / While. Emit an error when
replaced values' type mismatch.
- Added "top level" codepath for if statements, which allows users to
write patterns to exit early from kernels (e.g., `if cond1: if cond2:
return else: ...`). Added associated codegen in TritonToTritonGPUPass
- Added basic control flow tests
- Pipeline pass is no longer activated when memory accesses can't be
vectorized
- Added missing magic methods to `constexpr`
- Fixed issue in random.py: bitcast some values to uint when they need
to be.
- Added support for `Not`
- Fixed nondeterministic compilation issue
2023-01-30 23:22:36 -08:00
Yan Chunwei
94b419c327 [FRONTEND] some tiny fix (#1120) 2023-01-30 19:39:38 -08:00
Nishant Sikarwar
e5dbe35cc1 [FRONTEND] removed unnecessary comprehension (#1085) 2023-01-30 19:42:14 +00:00
Keren Zhou
c59fb4acca [FRONTEND] Fix libdevice elementwise compute for constexpr (#1082) 2023-01-22 07:11:44 +00:00