Commit Graph

791 Commits

Author SHA1 Message Date
Philippe Tillet
7c7b769e37 [SETUP] Fixed dependencies (#1389) 2023-03-22 16:15:35 -07:00
Keren Zhou
2ba77a9212 [OPTIMIZER] Fix a typo in SimplifyReduceCvt (#1385) 2023-03-21 22:45:58 -07:00
xndcn
65d8d802d5 [BACKEND] Fix wrong conversion from float8e4m3 <> bfloat16 (#1384)
exponent compensate should be 0x3c00(120) instead of 0x3800(112)
2023-03-21 18:58:13 -07:00
Phil Tillet
08f705d193 [DOCS] Better contributing guidelines 2023-03-21 16:21:43 -07:00
Philippe Tillet
c34ceca741 [BACKEND] Remove DotOpHelpers (i.e., decouple ConvertLayoutOpToLLVM and DotOpToLLVM) (#1383)
One long-standing issue in the backend has been the apparent complexity
of the tensor core codegen. This complexity mostly stems from the
existence of the DotOpHelpers` utilities, which have become over time a
catch-all for all things related to MmaEncoding and DotOperandEncoding.

The purpose of this PR is to decouple what should be decoupled, as a
first step towards cleaning our tensor core codegen. Other, more more
local PRs will follow.
2023-03-21 15:24:28 -07:00
mcskatkat
9ae78d21f1 [FRONTEND] CompilationError._format_message issue + tidying (#1362)
- fixed `CompilationError._format_message` fails when `error_message` is
a `constexpr`
- factored out `_is_constexpr()` checks and `_unwrap_if_constexpr()`
idioms
- Added `UnsupportedLanguageConstruct` exception, replaced some python
builtin exceptions raised in such cases.
- Some hardening in `.visit_If()`
- cleaner exception handling in `build_triton_ir()`
2023-03-21 19:52:18 +00:00
Keren Zhou
c1dd6df9ce [FRONTEND] Fix negative induction variable (#1382) 2023-03-21 08:38:16 -07:00
xndcn
84ffefc368 [BACKEND] Fix wrong conversion from float8e4m3 <> float16 (#1375)
after offset shifting, exponent compensate should not be forgotten
also add back some comments from `legacy_backend`
2023-03-20 21:45:25 -07:00
Keren Zhou
e281bd9fe9 [OPTIMIZER] Ensure the conversion of blockArgument is placed at the beginning of the block (#1379)
Co-authored-by: Philippe Tillet <phil@openai.com>
2023-03-20 21:19:26 -04:00
Keren Zhou
23fc647a3e [OPTIMIZER] Fixe optimizer hanging caused by SimplifyReduceCvt (#1377)
https://github.com/openai/triton/issues/1328

Match the convert_layout operation in SimplifyReduceCvt
(convert_layout->reduce). This way we don't miss higher priority rewrite
patterns like RematerializeBackward and SimplifyConversion. We also need
to set SimplifyConversion's benefit = 4, RematerializeBackward's benefit
= 3, and RematerializeForward's benefit = 2.
2023-03-20 16:20:19 -07:00
Philippe Tillet
29d01ba5f3 [OPTIMIZER] We shouldn't try to rematerialize view/cat forward since output layout can't be deduced automatically (#1378) 2023-03-20 14:26:50 -07:00
Keren Zhou
78d5900467 [OPTIMIZER] Improve pipeline to handle general indirect access to matrices (#1291)
Differentiate between immediate and non-immediate block arguments. 
If we have a load that immediately depends on a block argument in the
current iteration, it is an immediate dependency. Otherwise, it is a
non-immediate dependency, which means the load depends on a block
argument in the previous iterations.

For example:
```
scf.for (%arg0, %arg1, %arg2) {
%0 = load %arg0  <--- immediate dep, this address is initialized at numStages-2
%1 = load %arg1
%2 = add %1, %arg2
%3 = load %2  <--- non-immediate dep, %arg1 must be an update-to-date value
}
```

The above code pattern is commonly seen in cases where we have indirect
memory accesses using a lookup table, such as PyTorch's `bsr_dense_bmm`.
This PR improves `bsr_dense_bmm` for about ~20% on the unit test cases.
2023-03-20 14:39:47 -04:00
Philippe Tillet
fe9dc4b58e [OPTIMIZER] Restored ViewOp/CatOp passthrough in simulateBackwardRematerialization (#1376) 2023-03-20 11:02:54 -07:00
Phil Tillet
e650d3708b [FRONTEND] dot now uses tl.float32 by default for out_dtype. 2023-03-19 21:58:46 -07:00
Philippe Tillet
b4decbe155 [BACKEND] Now using call_once to initialize LLVM target (#1373) 2023-03-19 21:23:39 -07:00
Fei Hu
6366c5a254 [FRONTEND][BACKEND] Add support for FP16 output for tl.dot (#1258)
---------

Co-authored-by: Fei Hu <fhu@microsoft.com>
2023-03-19 19:52:14 -07:00
Philippe Tillet
e4b2d1bc3d [FRONTEND][BACKEND] no longer using indices for loops (#1370) 2023-03-19 14:57:50 -07:00
Philippe Tillet
28e05c9799 [OPTIMIZER] Canonicalize convert_layout(cat: #layout1) -> #layout2 as cat: #layout2 (#1369)
We can do that because `cat` reorders elements anyways
2023-03-19 14:16:55 -07:00
Philippe Tillet
39139258c8 [FRONTEND][BACKEND] tl.mathlib -> tl.math; internally reverted to mathlib -> libdevice (#1368) 2023-03-19 02:14:57 -07:00
rsanthanam-amd
c575911a01 [FRONTEND] Change libdevice to mathlib and fix abs (#1361)
Co-authored-by: Phil Tillet <phil@openai.com>
2023-03-19 01:34:16 -07:00
Philippe Tillet
02caa8a652 [OPTIMIZER] Better handling of control flow in Triton -> TritonGPU conversion (#1367) 2023-03-18 23:00:19 -07:00
Philippe Tillet
2f035c0611 [FRONTEND] Fix contains_return_op when analyzing functions in another module (#1365) 2023-03-18 15:02:45 -07:00
Edward Z. Yang
6d61a5ca23 [FRONTEND] Don't use HOME envvar to get HOME (#1364)
Fixes https://github.com/pytorch/pytorch/issues/97076
2023-03-18 10:39:58 -07:00
peterbell10
c9740f0870 [OPTIMIZER] Add canonicalize/fold for ExpandDimsOp, ViewOp and BroadcastOp (#1354)
These eliminate no-op reshapes, and simplify some combinations of view
ops into a single view. e.g. viewing a splat becomes a single splat.
2023-03-16 21:13:58 -07:00
Horace He
1d2871d0d1 [RUNTIME] Fix memory leak in (#1358)
Fixes a bug that causes Triton to leak 32 bytes on every kernel
invocation.

Also solves https://github.com/pytorch/pytorch/issues/96937
2023-03-16 17:52:06 -07:00
mcskatkat
611a2dc9bf [FRONTEND] CodeGenerator: enhanced (#1355)
Contents of this change to `CodeGenerator`:
- addressed mutable default value in constructor (GitHub #1353)
- structured and faster name lookup (replaces `.get_value`)
- added informative error messages in some places
- tidy mechanism for "static" (compile time) functions replaces inline
`if ... elif ...` chain in `.visit_Call`
- more robust `static_assert` and `static_print`
- more informative `CompilationError` display (saves scrolling up
through long tracebacks)
- dedicated `CompileTimeAssertionFailure` exception for `static_assert`
can be specially treated upstream by `Autotuner` to skip configurations
that violate constraints (as for `OutOfResources`)

---------

Co-authored-by: Philippe Tillet <phil@openai.com>
2023-03-16 17:00:43 -07:00
Berke Kocaoğlu
ba91f39dbf [DOC] Fix syntax errors, typos, formatting; increase consistency (#1357)
This PR;
- Fixes syntax errors like `.type values: dict[str,
Callable[[list[Any]], Any]]` to `:type values: dict[str,
Callable[[list[Any]], Any]]`,
- Fixes typos,
- Fixes formatting like `k ++` to ` k++`,
- Increases consistency (e.g. by transforming the minority `cd dir/` to
the majority `cd dir`).
2023-03-16 15:32:02 -07:00
Phil Tillet
d00bc5af67 [README] Now saying we won't accept PRs that fix simple typos in our documentation 2023-03-16 12:43:47 -07:00
mcskatkat
53e8e04d6e [FRONTEND] fix constexpr by annotation (#1352)
Fixed unjustified `TypeError` raised when arg is (strangely) annotated
with a non-type
2023-03-16 11:10:19 -07:00
mcskatkat
f5d22d5995 [FRONTEND] support f-strings in compiler with constexpr conversion (#1349)
This addition allows explanatory messages upon assertion failures:

```python
@triton.jit
def my_single_block_kernel(
    matrix_extent: tl.constexpr,
    block_size: tl.constexpr,      # must be >= extent (single block)
    matrix: Tensor,
    ...
):
    tl.static_assert(matrix_extent <= block_size, 
                     f"`matrix_extent` should not be more than the block size ({block_size}), but is {matrix_extent}")
```

Yielding, when called incorrectly:
```
AssertionError: `matrix_extent` should not be more than the block size (32), but is 57
```
2023-03-16 08:02:10 +00:00
Da Yan
9d5505d043 [OPTIMIZER] Infer the alignment info of loops' induction variables (#1350)
Before this PR, loops' induction variables' (IV) alignment info is lost.
For example:
```
for n in range(0, K, BLOCK):
     x = base + n
                       ^--  Triton doesn't know n is always a multiple of BLOCK
```

This PR fixes this.

---------

Co-authored-by: Philippe Tillet <phil@openai.com>
2023-03-16 00:39:08 -07:00
Shintaro Iwasaki
4b774ee4d0 [OPS/BLOCKSPARSE] remove unnecessary mask (#1351)
This PR applies a minor patch that removes unnecessary masks in
`_dsd_kernel()`.

### Details

`offs_bn` is defined as follows and not updated after that.
```py
offs_bn = pid_m * TILE_N + tl.arange(0, TILE_N)
offs_bn = tl.max_contiguous(tl.multiple_of(offs_bn % DS0, TILE_N), TILE_N)
```

Because `offs_bn = offs_bn % DS0`, this mask is always `True`.
```py
b = tl.load(pb, mask=offs_bn[None, :] < DS0)
```
This PR removes this mask (as well as explicit `mask=True`).
2023-03-15 19:06:38 -07:00
mcskatkat
c175473bbf [FRONTEND] In JITFunction: infer constexpr arg only if annotated as such (#1345)
Fixed `JITFunction.__init__` to mark args as constexpr only when the
annotation is actually `tl.constexpr`, rather than treating any
annotated arg as constexpr.
2023-03-15 16:39:45 -07:00
Stonepia
109b5e2729 [BUILD] Fix the build bug when user use system package of llvm by setting LLVM_SYSPATH (#1336)
When the user set the `LLVM_SYSPATH` to use custom build llvm, it will
throw the error because there is no version.txt under the custom build
one.

This PR skips the version check If the `LLVM_SYSPATH` is set.

---------

Co-authored-by: Philippe Tillet <phil@openai.com>
2023-03-15 13:28:19 -07:00
Philippe Tillet
56b23f433d [TEST] Temporarily disable test_dot mode that fails because of ptxas/nvptx (#1344) 2023-03-15 01:17:48 -07:00
peterbell10
01b177afe7 [FRONTEND] Mangle signed and unsigned integer types differently (#1340)
This is cherry-picked from #1305

If you call a `JITFunction` twice in the same kernel, first with `int32`
then with `uint32`, the second call will treat the unsigned value as
signed. This passes through MLIR without error because MLIR uses the
same types for both, but different operation calls will be generated so
you may silently get the wrong result.
2023-03-14 22:29:18 -07:00
Philippe Tillet
ad81447ad0 [FRONTEND] Marking int1 (bool) type as unsigned (#1343) 2023-03-14 22:05:13 -07:00
Philippe Tillet
082828af47 [OPTIMIZER] Fixed up divisibility analysis in div operation (#1341) 2023-03-14 18:17:05 -07:00
Keren Zhou
da0b0bfde6 [BACKEND] Still run llvm-opt but set optLevel to 0 to avoid the abs(float) bug (#1339)
https://github.com/openai/triton/issues/1337
2023-03-14 12:38:57 -07:00
Philippe Tillet
6a8634e2a7 [BACKEND] No longer running LLVM-IR optimizations after codegen. (#1338)
This triggered some outrageous bugs. See #1337.
2023-03-13 22:50:15 -07:00
Philippe Tillet
dde34904d0 [TESTING] triton.testing.allclose now uses torch.allclose (#1333) 2023-03-13 17:48:32 -07:00
Philippe Tillet
6539395337 [OPTIMIZER] CatOp is now marked as not having invertible layout (#1332) 2023-03-13 15:42:48 -07:00
Edward Z. Yang
01b8cfe9ff [BUILD] Mash stdc++fs into more targets (#1329)
I observed that when compiling with gcc8, stdc++fs linker flag isn't
passed to enough targets.  I couldn't figure out the correct target
to add the linker flag to, so I'm just mashing it everywhere.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
2023-03-13 15:02:53 -07:00
Nikita Shulga
663074460d [VERSION] Update triton/__init__.py (#1327)
Followup after
c7581c9a91
2023-03-13 10:38:38 -07:00
Philippe Tillet
9b7c65a3a9 [BACKEND][OPTIMIZER] Refactor MMAv1 codegen (#1322)
- Significant simplification of the optimizer pipeline. Right mma
version is now set directly after the coalescing pass. DotOperand layout
no longer hold a state to `isRow` argument, and instead query it from
their parent
- Moved a bunch of things from TritonGPUToLLVM/DotOpHelpers to
TritonGPUAttrDefs. All MMAv1 state is now queried from attributes.
- logic for getELemsPerThread is no longer duplicated in TypeConverter
2023-03-12 19:54:38 -07:00
Christian Sigg
64fc0e23ce [BACKEND] Fix triton-convert-arith-to-index. (#1310)
The dialect of created ops needs to be part of dependent dialects.
2023-03-12 19:43:41 -07:00
Yu Guo
ef55ccfed0 [TESTING] fix get_max_simd_tflops (#1318)
`_triton.runtime.num_sm`, `_triton.runtime.clock_rate`,
`_triton.runtime.cc` seem no longer exist.

use the corresponding methods from `get_max_tensorcore_tflops` in the
same file.
2023-03-11 10:07:25 -08:00
Philippe Tillet
5a786cf778 [FRONTEND] Fixed contains_return_op behavior (#1317) 2023-03-10 23:58:28 -08:00
Philippe Tillet
3fe3adbcde [FRONTEND][BACKEND] Add support for float8e5m2 type (#1314) 2023-03-10 19:14:47 -08:00
Luo Yihang
9626c8e944 [DOC] Fix typos in comments (#1311)
Fixed several typos in `python/triton/runtime/autotuner.py`
2023-03-10 09:33:24 -08:00