Commit Graph

417 Commits

Author SHA1 Message Date
Jason Furmanek
12005a82f2 Initial commit to resolve merge conflicts 2023-06-30 19:53:53 +00:00
Thomas
2e3182bab7 [BACKEND] Support scan on dimensions other that fastest moving one (#1863)
This relax the restriction in the scan lowering to support layout where
we scan along a dimension which isn't the fastest moving one. This is
done by relaxing how we accesses elements during scanning and allow
elements to be strided.
2023-06-30 12:40:48 -07:00
Oleg Shyshkov
66ed53d19d [FRONTEND] Support mixed-precision inputs in triton.ops.matmul. (#1754)
Support only combinations of float32 with float16 or bfloat16 for now.
Shouldn't change anything for cases when input types match.

That's a follow-up to the comment in my other PR:
https://github.com/openai/triton/pull/1746#issuecomment-1579630016.

---------

Co-authored-by: Philippe Tillet <phil@openai.com>
2023-06-30 09:22:27 -07:00
Thomas
7a8a2da8ef [BACKEND] Enable lowering of f16 constant matmul (#1870)
Since the type expected for mma encoding is i32 when lowering f16 splat
we need to pack f16 constants into a i32 value. This allows re-enabling
the constant matmul unit test.
2023-06-30 07:00:25 -04:00
Philippe Tillet
f77015967d Revert "[FRONTEND][BACKEND] improved fp8 specs (#1841)" (#1865)
This reverts commit d4c941177e.
2023-06-29 21:07:01 -04:00
Jason Furmanek
2b38ab4b6c Merge remote-tracking branch 'oai/main' into ifu230620
Conflicts:
	include/triton/Conversion/TritonToTritonGPU/Passes.td
	include/triton/Dialect/TritonGPU/IR/TritonGPUDialect.td
	lib/Dialect/TritonGPU/Transforms/RemoveLayoutConversions.cpp
	python/test/unit/language/assert_helper.py
	python/triton/compiler/compiler.py
	python/triton/runtime/jit.py
	python/triton/tools/aot.py
	test/Conversion/triton_to_tritongpu.mlir
	test/Conversion/tritongpu_to_llvm.mlir
2023-06-29 21:47:27 +00:00
Thomas
3be060849a [FEATURE] Add associative_scan support (#1858)
Implement associative_scan in the front end and implement lowering to
LLVM for blocked layout where the scan happens on the fastest moving
dimension. This will later be generalized to support more layout.
2023-06-29 14:37:51 -07:00
Xinya Zhang
75b86da598 Add configurable wavefront size support for Navi/MI.
[To squash] Configurable warp size in test_core_amd.py::test_convert2d

Note: test_core_amd.py::test_convert2d unit tests have been changed
because some of the old layouts exceed the shared memory limit (64KiB)
2023-06-28 22:25:14 -05:00
Thomas
e5d7411a69 [BACKEND] Add .wt store cache modifier (#1831) 2023-06-28 17:40:30 +00:00
Keren Zhou
d2de3f37f0 [BACKEND] Reduce code cleanup and bug fix for the fast path (#1816)
https://github.com/openai/triton/issues/1715
2023-06-27 17:27:24 -07:00
Zahi Moudallal
2dcbf4783e [BACKEND] Use getOrder for mma layout warps order instead of the hardcoded col-major order (#1825) 2023-06-27 10:56:09 -07:00
Philippe Tillet
d4c941177e [FRONTEND][BACKEND] improved fp8 specs (#1841)
clearly differentiate between standard fp8e4 (which we'll stop
supporting on SM <= 89 because conversions are too expensive if we want
to handle the single NaN and clipping properly) and a software-optimized
fp8e4b15 format.
2023-06-26 16:19:03 -07:00
Wang Weihan
a3c39d8fbe [TEST] Add device parameter for ut (#1817)
Triton has supported different codegen backends for different devices,
so enabling the unit test cases to support different devices also makes
sense. Otherwise, the third-party backend might have to intrusively
change the Triton test cases.
2023-06-25 15:38:59 +08:00
Thomas
3d1cd89b54 [BACKEND] Add store cache modifiers (#1826)
Plumb through store cache modifiers.
2023-06-23 09:29:10 -07:00
Zahi Moudallal
6ad8cd52e7 [CI] Added IR reference-check github workflow (#1755) 2023-06-22 18:00:40 -07:00
Zahi Moudallal
ca4f242c9b [TEST] Added matmul config for testing (#1758) 2023-06-22 13:31:37 -07:00
Philippe Tillet
0d6cd0307a [FRONTEND] add tie_break_left option to arg-reductions (#1813) 2023-06-21 19:35:52 -07:00
Philippe Tillet
4c0e3d907e [TOOLS] improved ahead-of-time compiler (#1805)
This is a revival of @gaxler initial ahead-of-time compiler proposal.
Code was simplified and some constraints were relaxed (i.e., we now
execute the entire file provided vs just the kernel AST) to promote
maintainability. A basic unit test was added, though it does not test
specialization right now.

co-authored by: Gregory Axler, thexler <g.axler@gmail.com>
2023-06-21 01:02:58 -07:00
Keren Zhou
1851c8ca99 [FRONTEND] Fix binary compare op on constexprs (#1801)
Example:

```
if static_a == 0 and static_b == 1:
    ...
```

The return value of `static_a == 0` should be `constexpr(True)` but not
`True`, otherwise the bool object (True/False) doesn't have the
`logical_and` method.
2023-06-18 20:27:56 -07:00
oplavsic
64d7b521cf [MFMA] Enabled fused attention forward pass. (#226)
* [MFMA] Activated Fused Attention Forward Pass

Patch contains following changes:
1) make_range operator now works with MFMA layout.
2) Reduce operation is forced to run in block layout:
   inputs converted to block layouts, outputs returned to MFMA layout

* Use simple module walk instead of pattern rewritter.

* Remove pattern rewritter header.

* Enable basic reduce algorithm for MFMA layout

* Add TODO comment for fused attention backward pass

* Fix bug in fast codegen algorithm for reduce op

* Fix input type bug

* Increase block size to 128 since out of memory issue is not seen on MI210

* Fix block_size error

* Add mfma support in DecomposeDotOperand pattern.
2023-06-16 15:39:08 -05:00
Christopher Hesse
981e98a213 [FRONTEND] update assert_helper.py (#1789) 2023-06-15 16:24:30 -07:00
Philippe Tillet
9a2580de13 [CI] Added H100 node (#1779) 2023-06-15 14:21:47 -07:00
Philippe Tillet
b24dc19741 [FRONTEND] cleaned up symbol names (#1782) 2023-06-14 18:55:32 -07:00
Zahi Moudallal
ac15d00ef4 [TEST] Added f8xf16 tests (#1771) 2023-06-12 16:14:17 -07:00
Wang Weihan
b27a91a113 [FRONTEND] Enable triton to support register thirdparty backend at runtime (#1643)
This PR intends to provide a mechanism to support a third-party backend
at runtime to generate the backend-specific code.

The mechanism provided a common class to abstract the third-party
backend logic and two essential functions to register and get the
third-party backend at runtime.

- `BaseBackend`: A common class to abstract the third-party backend
logic
- `register_backend`: Register a third-party backend with a given device
type
- `get_backend`: Get the third-party backend with a given device type

Generally, a third-party backend must inherit from `BaseBackend` and
implement all the member functions according to the backend
characteristics. As long as the backend implementation is ready, the
third-party backend can invoke `register_backend` to register it under a
given device. During the kernel compilation and execution, the mechanism
will get the registered backend to generate the kernel and launcher code
for a given device.

This PR added a dummy backend to simulate a third-party backend and
demonstrate the usage.

-
[test_device_backend.py](https://github.com/openai/triton/pull/1643/files#diff-bbe4d50624f2d11bf17c878a1ed4d422918c124c182cf9357b993240c385bea1):
To define a third-party backend and register the backend
-
[ExtensionBackend](https://github.com/openai/triton/pull/1643/files#diff-bbe4d50624f2d11bf17c878a1ed4d422918c124c182cf9357b993240c385bea1R123):
Inherit from the `BaseBackend` and implement some specific logic like
[filter out some compile
stages](https://github.com/openai/triton/pull/1643/files#diff-bbe4d50624f2d11bf17c878a1ed4d422918c124c182cf9357b993240c385bea1R129-R135)
- [Register the `ExtensionBackend` for
`CPU`](https://github.com/openai/triton/pull/1643/files#diff-bbe4d50624f2d11bf17c878a1ed4d422918c124c182cf9357b993240c385bea1R279)
  
-
[extension_backend.c](https://github.com/openai/triton/pull/1643/files#diff-169c1d08b3a0a7b343cfa3258fbc32b47e0f6c46305a112652fa1bdaaec89d29):
To provide the utility function to load kernel binary and get the
backend properties.
2023-06-09 09:09:59 -07:00
jayfurmanek
29f93b147b Merge pull request #229 from ROCmSoftwarePlatform/ifu230601
IFU 230601
2023-06-09 07:55:32 -05:00
Keren Zhou
4fbadf6f6f [BACKEND] Fix tl.cat when the number of threads > the size of a tensor (#1751)
`tl.cat(tensor<64>, tensor<64>) -> tensor(128)`, because it concatenates
elements into a single thread, if number of threads is 128, each thread
should own at least 2 elements.
With this PR, we also disable remat of the cat op in some cases.
2023-06-07 15:42:38 -07:00
Aleksandr Efimov
0a12031c75 [Triton] Fix MFMA dot operand loading
This PR fixes computation of indexes of MFMA dot operands and gives variables more informative names.
2023-06-07 21:30:52 +02:00
Philippe Tillet
c52a91231a [FRONTEND][BACKEND] Add acquire/release semantics for atomics (#1739) 2023-06-05 19:09:13 -07:00
Jason Furmanek
0497f95982 [ROCM] Fix assert helper 2023-06-05 21:42:44 +00:00
Philippe Tillet
6c1992cb38 [FRONTEND] min/max now accept return_indices argument (#1731)
Not just syntactic sugar for successive max + argmax but also avoids
computing the max twice
2023-06-02 22:01:02 -07:00
jayfurmanek
153ed472b8 Merge branch 'triton-mlir' into ifu230601 2023-06-01 16:18:25 -05:00
Daniil Fukalov
6be1dce41c [ROCM] Fix transposed operands processing in dot operation with MFMA. (#227)
- Applied to `loadA()` the same fix as 2c88ed6aab9ace22ccde1f0e443a1579727ee501.
- Minor cleanup of `mfmaLayout.getWarpsPerCTA()` usage.

Partially fixes ROCmSoftwarePlatform/frameworks-internal#4545
2023-06-01 23:08:58 +02:00
Jason Furmanek
56c55e7451 Initial commit to resolve merge conflicts 2023-06-01 20:58:37 +00:00
Jason Furmanek
28d9754b2a Merge remote-tracking branch 'oai/main' into ifu230601
Conflicts:
	python/test/unit/language/assert_helper.py
	test/Conversion/tritongpu_to_llvm.mlir
2023-06-01 20:53:33 +00:00
Keren Zhou
1e171bf270 [BACKEND] Pipeline pass rewrite part 1: functionality fixes (#1716)
Support the following three cases:
1. Operands of `load` depend on induction variables before `load`s.
2. Mixed use of induction variables and offset to update the `ptr`.
3. Cross iteration (>1) dependency values.
2023-06-01 12:07:43 -07:00
Mehdi Amini
440fd1bf20 [TESTS] Increase the paylog of the globaltimer kernel to reduce chances of fakiness (#1726)
If the kernel is too small, on a very fast GPU we may get 0 because the
resolution of the timer seems too coarse.

Fixes #1725
2023-06-01 02:53:07 -07:00
Mehdi Amini
b0c893cdc5 [FRONTEND][BACKEND] Hardened get_program_id axis by making it an enum attribute (#1721)
Also catch out-of-bounds indices at constructio and throw a proper error
in the frontend.
Finally, let's make the IR a bit prettier:

  %0 = tt.get_program_id {axis = 0 : i32} : i32

becomes:

  %0 = tt.get_program_id x : i32

Fixes #1718
2023-05-31 22:49:46 -07:00
Mehdi Amini
19c65d6007 [FRONTEND] fix checks for valid slice and avoid hitting an obscure exception. (#1720)
When comparing to the expected slides, using the `==` operator will
dispatch to the component of the slice. If the user writes `a[10:20]`
these are `triton.constexpr` instances, and the `__eq__` operator which
is implemented as: `return constexpr(self.value == other.value)`. At
this point the access to `.value` on the provided `None` yields an
exception that isn't very friendly to the user.

I am not sure if the implementation of `constexpr` should be hardened
instead?

Co-authored-by: Philippe Tillet <phil@openai.com>
2023-05-31 16:37:19 +00:00
Andrey Shukshov
fee5950893 [MFMA] Implementation of MFMA DotOp pipeline (#180)
* [MFMA] Implementation of MFMA DotOp pipeline

* Added MFMA test_dot unit tests

* Added missing ifdefs

* Update offline tests

* Removing duplicate parts

* fix build after rebase

* remove redundant stuff

* simplify MMAv3.cpp

* move reps function into operand attr description,
remove coreMatrixType type from layout conversion,
refactored type conversion

* remove duplication of mfma intruction shape computation

* move all MFMA instruction shape details into layout attribute

* fix formatting

* reenable matmul acceleration

* fix dot operator type conversion

* add offline test for dotop

* add missing ifdef wrappers

* run clang format on changes

* review and rebase fix

* add switch for MFMA instructions

* change check precision for float16 test

* disable redundant check for allowTF32

* - skip unsupported block size in matmul autotuner
- support transposed inputs of dot

* reenable matmul acceleration

* Add first part to FMA for dot operation on HW without MFMA support.

* Fix offline tests.

* Fix lit tests

* refactor mmav3 to mfma

* fix rebase issues

* fix detection of mfma support and wrong assert

* remove unnecessary macros

* Add documentation for MFMA layout.

* fix line size computation for B argument

* Fix getElemsPerThread() and getSizePerThread() functions for MFMA layout.

---------

Co-authored-by: Alexander Efimov <efimov.alexander@gmail.com>
Co-authored-by: dfukalov <1671137+dfukalov@users.noreply.github.com>
Co-authored-by: weihan13 <weihan13@amd.com>
Co-authored-by: Ognjen Plavsic <ognjen.plavsic@dxc.com>
2023-05-30 16:10:28 -05:00
Philippe Tillet
4e2f57add5 [FRONTEND] Added default axis=None for reduction, which reduces across all the axes. (#1712) 2023-05-28 16:13:21 -07:00
Philippe Tillet
420e4acecc [TEST] Added flash attention tests for D_HEAD in {16, 32, 128}. (#1709) 2023-05-27 22:48:22 -07:00
Keren Zhou
0341953466 [FRONTEND] Correct the debug syntax (#1705)
- If `TRITON_DEBUG=True`, all triton functions will be compiled in the
debug mode.
- Otherwise, a triton function `f`'s debug flag is either `True`,
`False` or `None` (default).
    - If `True`, `f` is compiled in the debug mode.
    - If `False`, `f` is compiled in the normal mode.
- If `None`, `f` is compiled based on its caller's debug flag. The root
(kernel) function's debug flag can also be set through the `compile`
function.

cc @ngimel , @Chillee
2023-05-24 23:24:29 -07:00
Philippe Tillet
b5ba639bae [FRONTEND] fixed issue for fp64 literals and added tests (#1698)
fixes #1686
2023-05-20 18:36:28 -07:00
Keren Zhou
fb30d84069 [FRONTEND] Refactor contains_return_op into an independent AST (#1694)
https://github.com/openai/triton/issues/1690
2023-05-20 11:18:40 -07:00
Zahi Moudallal
34817ecc95 [BACKEND] Added support to convert shared to distributed layouts (#1682) 2023-05-17 17:20:29 -07:00
Jason Furmanek
78c60742fc IFU 230517 Resolve merge conflicts 2023-05-17 17:36:44 +00:00
Jason Furmanek
4c4e42e524 Merge remote-tracking branch 'openai/main' into IFU-230517
Conflicts:
	lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVMPass.cpp
	lib/Target/LLVMIR/LLVMIRTranslation.cpp
	python/test/unit/language/assert_helper.py
	python/triton/third_party/cuda/bin/ptxas
	test/Conversion/tritongpu_to_llvm.mlir

 It looks like you may be committing a merge.
 If this is not correct, please remove the file
	.git/MERGE_HEAD
 and try again.
2023-05-17 15:03:42 +00:00
Keren Zhou
3baab48eaf [FRONTEND] Differentiate between bool and int in the frontend (#1678)
`bool` is a subclass of `int`, so `isinstance(bool_var, int) == True`,
and a `bool` constant will be converted to an `int` constant.

In triton specifically, if a bool var is treated as an integer, it
prevents us using the `logical_and` operator which requires both
operands have the same bit length.

> Cannot bitcast data-type of size 32 to data-type of size 1

By differentiating int and bool, it allows us to make the syntax more
close to native python. We can now use `if bool_var and condition` to
check the truthiness, and `if bool_var is True` to check identity.
2023-05-16 18:24:16 +00:00
Daniil Fukalov
7acc1cb707 [ROCM] Implement device_assert functionality. (#207)
Triton firstly prints assert message into stderr stream with the same
(refactored) helper function as `device_print` and then ends the thread
execution.

Note: s_endpgm instruction is used, since s_trap (generated from LLVM::Trap or LLVM::DebugTrap) has some issues on different HW.

Also got back fix in `python/triton/compiler/compiler.py` lost after one
of IFU.
2023-05-15 16:16:14 +02:00