Commit Graph

507 Commits

Author SHA1 Message Date
Zahi Moudallal
2dcbf4783e [BACKEND] Use getOrder for mma layout warps order instead of the hardcoded col-major order (#1825) 2023-06-27 10:56:09 -07:00
Philippe Tillet
d4c941177e [FRONTEND][BACKEND] improved fp8 specs (#1841)
clearly differentiate between standard fp8e4 (which we'll stop
supporting on SM <= 89 because conversions are too expensive if we want
to handle the single NaN and clipping properly) and a software-optimized
fp8e4b15 format.
2023-06-26 16:19:03 -07:00
Wang Weihan
a3c39d8fbe [TEST] Add device parameter for ut (#1817)
Triton has supported different codegen backends for different devices,
so enabling the unit test cases to support different devices also makes
sense. Otherwise, the third-party backend might have to intrusively
change the Triton test cases.
2023-06-25 15:38:59 +08:00
Thomas
3d1cd89b54 [BACKEND] Add store cache modifiers (#1826)
Plumb through store cache modifiers.
2023-06-23 09:29:10 -07:00
Zahi Moudallal
6ad8cd52e7 [CI] Added IR reference-check github workflow (#1755) 2023-06-22 18:00:40 -07:00
Zahi Moudallal
ca4f242c9b [TEST] Added matmul config for testing (#1758) 2023-06-22 13:31:37 -07:00
Philippe Tillet
0d6cd0307a [FRONTEND] add tie_break_left option to arg-reductions (#1813) 2023-06-21 19:35:52 -07:00
Philippe Tillet
4c0e3d907e [TOOLS] improved ahead-of-time compiler (#1805)
This is a revival of @gaxler initial ahead-of-time compiler proposal.
Code was simplified and some constraints were relaxed (i.e., we now
execute the entire file provided vs just the kernel AST) to promote
maintainability. A basic unit test was added, though it does not test
specialization right now.

co-authored by: Gregory Axler, thexler <g.axler@gmail.com>
2023-06-21 01:02:58 -07:00
Keren Zhou
1851c8ca99 [FRONTEND] Fix binary compare op on constexprs (#1801)
Example:

```
if static_a == 0 and static_b == 1:
    ...
```

The return value of `static_a == 0` should be `constexpr(True)` but not
`True`, otherwise the bool object (True/False) doesn't have the
`logical_and` method.
2023-06-18 20:27:56 -07:00
oplavsic
64d7b521cf [MFMA] Enabled fused attention forward pass. (#226)
* [MFMA] Activated Fused Attention Forward Pass

Patch contains following changes:
1) make_range operator now works with MFMA layout.
2) Reduce operation is forced to run in block layout:
   inputs converted to block layouts, outputs returned to MFMA layout

* Use simple module walk instead of pattern rewritter.

* Remove pattern rewritter header.

* Enable basic reduce algorithm for MFMA layout

* Add TODO comment for fused attention backward pass

* Fix bug in fast codegen algorithm for reduce op

* Fix input type bug

* Increase block size to 128 since out of memory issue is not seen on MI210

* Fix block_size error

* Add mfma support in DecomposeDotOperand pattern.
2023-06-16 15:39:08 -05:00
Christopher Hesse
981e98a213 [FRONTEND] update assert_helper.py (#1789) 2023-06-15 16:24:30 -07:00
Philippe Tillet
9a2580de13 [CI] Added H100 node (#1779) 2023-06-15 14:21:47 -07:00
Philippe Tillet
b24dc19741 [FRONTEND] cleaned up symbol names (#1782) 2023-06-14 18:55:32 -07:00
Zahi Moudallal
ac15d00ef4 [TEST] Added f8xf16 tests (#1771) 2023-06-12 16:14:17 -07:00
Wang Weihan
b27a91a113 [FRONTEND] Enable triton to support register thirdparty backend at runtime (#1643)
This PR intends to provide a mechanism to support a third-party backend
at runtime to generate the backend-specific code.

The mechanism provided a common class to abstract the third-party
backend logic and two essential functions to register and get the
third-party backend at runtime.

- `BaseBackend`: A common class to abstract the third-party backend
logic
- `register_backend`: Register a third-party backend with a given device
type
- `get_backend`: Get the third-party backend with a given device type

Generally, a third-party backend must inherit from `BaseBackend` and
implement all the member functions according to the backend
characteristics. As long as the backend implementation is ready, the
third-party backend can invoke `register_backend` to register it under a
given device. During the kernel compilation and execution, the mechanism
will get the registered backend to generate the kernel and launcher code
for a given device.

This PR added a dummy backend to simulate a third-party backend and
demonstrate the usage.

-
[test_device_backend.py](https://github.com/openai/triton/pull/1643/files#diff-bbe4d50624f2d11bf17c878a1ed4d422918c124c182cf9357b993240c385bea1):
To define a third-party backend and register the backend
-
[ExtensionBackend](https://github.com/openai/triton/pull/1643/files#diff-bbe4d50624f2d11bf17c878a1ed4d422918c124c182cf9357b993240c385bea1R123):
Inherit from the `BaseBackend` and implement some specific logic like
[filter out some compile
stages](https://github.com/openai/triton/pull/1643/files#diff-bbe4d50624f2d11bf17c878a1ed4d422918c124c182cf9357b993240c385bea1R129-R135)
- [Register the `ExtensionBackend` for
`CPU`](https://github.com/openai/triton/pull/1643/files#diff-bbe4d50624f2d11bf17c878a1ed4d422918c124c182cf9357b993240c385bea1R279)
  
-
[extension_backend.c](https://github.com/openai/triton/pull/1643/files#diff-169c1d08b3a0a7b343cfa3258fbc32b47e0f6c46305a112652fa1bdaaec89d29):
To provide the utility function to load kernel binary and get the
backend properties.
2023-06-09 09:09:59 -07:00
jayfurmanek
29f93b147b Merge pull request #229 from ROCmSoftwarePlatform/ifu230601
IFU 230601
2023-06-09 07:55:32 -05:00
Keren Zhou
4fbadf6f6f [BACKEND] Fix tl.cat when the number of threads > the size of a tensor (#1751)
`tl.cat(tensor<64>, tensor<64>) -> tensor(128)`, because it concatenates
elements into a single thread, if number of threads is 128, each thread
should own at least 2 elements.
With this PR, we also disable remat of the cat op in some cases.
2023-06-07 15:42:38 -07:00
Aleksandr Efimov
0a12031c75 [Triton] Fix MFMA dot operand loading
This PR fixes computation of indexes of MFMA dot operands and gives variables more informative names.
2023-06-07 21:30:52 +02:00
Philippe Tillet
c52a91231a [FRONTEND][BACKEND] Add acquire/release semantics for atomics (#1739) 2023-06-05 19:09:13 -07:00
Jason Furmanek
0497f95982 [ROCM] Fix assert helper 2023-06-05 21:42:44 +00:00
Philippe Tillet
6c1992cb38 [FRONTEND] min/max now accept return_indices argument (#1731)
Not just syntactic sugar for successive max + argmax but also avoids
computing the max twice
2023-06-02 22:01:02 -07:00
jayfurmanek
153ed472b8 Merge branch 'triton-mlir' into ifu230601 2023-06-01 16:18:25 -05:00
Daniil Fukalov
6be1dce41c [ROCM] Fix transposed operands processing in dot operation with MFMA. (#227)
- Applied to `loadA()` the same fix as 2c88ed6aab9ace22ccde1f0e443a1579727ee501.
- Minor cleanup of `mfmaLayout.getWarpsPerCTA()` usage.

Partially fixes ROCmSoftwarePlatform/frameworks-internal#4545
2023-06-01 23:08:58 +02:00
Jason Furmanek
56c55e7451 Initial commit to resolve merge conflicts 2023-06-01 20:58:37 +00:00
Jason Furmanek
28d9754b2a Merge remote-tracking branch 'oai/main' into ifu230601
Conflicts:
	python/test/unit/language/assert_helper.py
	test/Conversion/tritongpu_to_llvm.mlir
2023-06-01 20:53:33 +00:00
Keren Zhou
1e171bf270 [BACKEND] Pipeline pass rewrite part 1: functionality fixes (#1716)
Support the following three cases:
1. Operands of `load` depend on induction variables before `load`s.
2. Mixed use of induction variables and offset to update the `ptr`.
3. Cross iteration (>1) dependency values.
2023-06-01 12:07:43 -07:00
Mehdi Amini
440fd1bf20 [TESTS] Increase the paylog of the globaltimer kernel to reduce chances of fakiness (#1726)
If the kernel is too small, on a very fast GPU we may get 0 because the
resolution of the timer seems too coarse.

Fixes #1725
2023-06-01 02:53:07 -07:00
Mehdi Amini
b0c893cdc5 [FRONTEND][BACKEND] Hardened get_program_id axis by making it an enum attribute (#1721)
Also catch out-of-bounds indices at constructio and throw a proper error
in the frontend.
Finally, let's make the IR a bit prettier:

  %0 = tt.get_program_id {axis = 0 : i32} : i32

becomes:

  %0 = tt.get_program_id x : i32

Fixes #1718
2023-05-31 22:49:46 -07:00
Mehdi Amini
19c65d6007 [FRONTEND] fix checks for valid slice and avoid hitting an obscure exception. (#1720)
When comparing to the expected slides, using the `==` operator will
dispatch to the component of the slice. If the user writes `a[10:20]`
these are `triton.constexpr` instances, and the `__eq__` operator which
is implemented as: `return constexpr(self.value == other.value)`. At
this point the access to `.value` on the provided `None` yields an
exception that isn't very friendly to the user.

I am not sure if the implementation of `constexpr` should be hardened
instead?

Co-authored-by: Philippe Tillet <phil@openai.com>
2023-05-31 16:37:19 +00:00
Andrey Shukshov
fee5950893 [MFMA] Implementation of MFMA DotOp pipeline (#180)
* [MFMA] Implementation of MFMA DotOp pipeline

* Added MFMA test_dot unit tests

* Added missing ifdefs

* Update offline tests

* Removing duplicate parts

* fix build after rebase

* remove redundant stuff

* simplify MMAv3.cpp

* move reps function into operand attr description,
remove coreMatrixType type from layout conversion,
refactored type conversion

* remove duplication of mfma intruction shape computation

* move all MFMA instruction shape details into layout attribute

* fix formatting

* reenable matmul acceleration

* fix dot operator type conversion

* add offline test for dotop

* add missing ifdef wrappers

* run clang format on changes

* review and rebase fix

* add switch for MFMA instructions

* change check precision for float16 test

* disable redundant check for allowTF32

* - skip unsupported block size in matmul autotuner
- support transposed inputs of dot

* reenable matmul acceleration

* Add first part to FMA for dot operation on HW without MFMA support.

* Fix offline tests.

* Fix lit tests

* refactor mmav3 to mfma

* fix rebase issues

* fix detection of mfma support and wrong assert

* remove unnecessary macros

* Add documentation for MFMA layout.

* fix line size computation for B argument

* Fix getElemsPerThread() and getSizePerThread() functions for MFMA layout.

---------

Co-authored-by: Alexander Efimov <efimov.alexander@gmail.com>
Co-authored-by: dfukalov <1671137+dfukalov@users.noreply.github.com>
Co-authored-by: weihan13 <weihan13@amd.com>
Co-authored-by: Ognjen Plavsic <ognjen.plavsic@dxc.com>
2023-05-30 16:10:28 -05:00
Philippe Tillet
4e2f57add5 [FRONTEND] Added default axis=None for reduction, which reduces across all the axes. (#1712) 2023-05-28 16:13:21 -07:00
Philippe Tillet
420e4acecc [TEST] Added flash attention tests for D_HEAD in {16, 32, 128}. (#1709) 2023-05-27 22:48:22 -07:00
Keren Zhou
0341953466 [FRONTEND] Correct the debug syntax (#1705)
- If `TRITON_DEBUG=True`, all triton functions will be compiled in the
debug mode.
- Otherwise, a triton function `f`'s debug flag is either `True`,
`False` or `None` (default).
    - If `True`, `f` is compiled in the debug mode.
    - If `False`, `f` is compiled in the normal mode.
- If `None`, `f` is compiled based on its caller's debug flag. The root
(kernel) function's debug flag can also be set through the `compile`
function.

cc @ngimel , @Chillee
2023-05-24 23:24:29 -07:00
Philippe Tillet
b5ba639bae [FRONTEND] fixed issue for fp64 literals and added tests (#1698)
fixes #1686
2023-05-20 18:36:28 -07:00
Keren Zhou
fb30d84069 [FRONTEND] Refactor contains_return_op into an independent AST (#1694)
https://github.com/openai/triton/issues/1690
2023-05-20 11:18:40 -07:00
Zahi Moudallal
34817ecc95 [BACKEND] Added support to convert shared to distributed layouts (#1682) 2023-05-17 17:20:29 -07:00
Jason Furmanek
78c60742fc IFU 230517 Resolve merge conflicts 2023-05-17 17:36:44 +00:00
Jason Furmanek
4c4e42e524 Merge remote-tracking branch 'openai/main' into IFU-230517
Conflicts:
	lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVMPass.cpp
	lib/Target/LLVMIR/LLVMIRTranslation.cpp
	python/test/unit/language/assert_helper.py
	python/triton/third_party/cuda/bin/ptxas
	test/Conversion/tritongpu_to_llvm.mlir

 It looks like you may be committing a merge.
 If this is not correct, please remove the file
	.git/MERGE_HEAD
 and try again.
2023-05-17 15:03:42 +00:00
Keren Zhou
3baab48eaf [FRONTEND] Differentiate between bool and int in the frontend (#1678)
`bool` is a subclass of `int`, so `isinstance(bool_var, int) == True`,
and a `bool` constant will be converted to an `int` constant.

In triton specifically, if a bool var is treated as an integer, it
prevents us using the `logical_and` operator which requires both
operands have the same bit length.

> Cannot bitcast data-type of size 32 to data-type of size 1

By differentiating int and bool, it allows us to make the syntax more
close to native python. We can now use `if bool_var and condition` to
check the truthiness, and `if bool_var is True` to check identity.
2023-05-16 18:24:16 +00:00
Daniil Fukalov
7acc1cb707 [ROCM] Implement device_assert functionality. (#207)
Triton firstly prints assert message into stderr stream with the same
(refactored) helper function as `device_print` and then ends the thread
execution.

Note: s_endpgm instruction is used, since s_trap (generated from LLVM::Trap or LLVM::DebugTrap) has some issues on different HW.

Also got back fix in `python/triton/compiler/compiler.py` lost after one
of IFU.
2023-05-15 16:16:14 +02:00
Keren Zhou
674f9bf7a6 [FRONTEND] Better error messages for noinline functions (#1657)
```
at 10:18:def val_multiplier_noinline(val, i):
    return val * i

           ^
Function val_multiplier_noinline is marked noinline, but was called with non-scalar argument val:fp32[constexpr[128]]
```
2023-05-11 12:46:25 -07:00
Benjamin Chetioui
115964b780 [TESTS] Add regression test for issue #1601. (#1611)
Following up on #1603, I am adding a new file meant to contain
functional regression tests to the repository.
Let me know if another folder would be a more appropriate place for
these tests.

Co-authored-by: Philippe Tillet <phil@openai.com>
2023-05-10 23:30:36 -07:00
Natalia Gimelshein
0daee68d71 [FRONTEND] Don't call set_device in tl.dot (#1646)
This breaks multiprocess compilation
2023-05-10 20:39:27 -04:00
Zahi Moudallal
fb40bf1954 [TEST] Fixed and re-enabled reduce test (#1644)
Re-enabled reduce test after fixing the %cst stride in the ttgir, and
modifying the sweep parameters to make sure the shape per CTA to be less
than or equal to the tensor shape.
2023-05-10 15:15:11 -07:00
Mario Lezcano Casado
6b1af5fe37 [FRONTEND] Add support for scalar conditions in device_assert (#1641)
This sometimes happens in TorchInductor. See
https://github.com/pytorch/pytorch/pull/100880.
More generally, it's useful to be able to write `tl.device_assert(False,
msg)`.

Co-authored-by: Keren Zhou <kerenzhou@openai.com>
2023-05-09 23:05:00 -07:00
Keren Zhou
b19b274d93 [FRONTEND] Fix return op related control flow issues (#1637)
- Case 1: Return after static control flow is taken. Peel off
instructions after the first `return` for each basic block.

```python
if static_condition:
    tl.store(...)
    return
return
```

- Case 2: Return exists in both `if` and `else` branches of an inlined
`JITFunction` function

```python
def foo():
    if dynamic_condition:
        return a
    else:
        return b
```

- Case 3: Return exists in a `JITFunction` from another module

```python
import module
if cond:
    a = module.func()
```

- Case 4: A chain of calls through undefined local variables

```python
import module
if cond:
    a = x
    a = a.to(tl.int32).to(tl.int32)
```

- Case 5: Call a function `func` without returning variables. `func` is
recognized as an `Expr` first instead of a `Call`.

```python
if cond:
    foo()
else:
    bar()
```

- Case 6: Call a `noinline` function. We don't need to check if the
function contains any return op.
2023-05-09 12:51:14 -04:00
Michaël Benesty
858a2f0a5e [FRONTEND] Added interpreter mode (#1573)
Simple mechanism to run Triton kernels on PyTorch for debugging purpose
(upstream from Kernl).

Todo:
- random grid iteration
- support of atomic ops
- more unit tests
- cover new APIs?
2023-05-08 14:28:20 -07:00
Zahi Moudallal
125d9d1cc7 [TEST] Added convert layout test from/to sliced blocked/mma (#1620) 2023-05-06 00:20:52 +00:00
Keren Zhou
fd381e2336 [BACKEND] Allow noinline functions to return multiple values of primitive types (#1623)
Fix https://github.com/openai/triton/issues/1621
2023-05-05 19:25:58 +00:00
Zahi Moudallal
e2ae2c6c48 [BACKEND] Modified store op thread masking (#1605) 2023-05-04 17:15:05 -07:00