Commit Graph

174 Commits

Author SHA1 Message Date
allatit23
6d98a0899f [HOPPER][WS] fix missing WS attrs when lowering to llvm (#2063) 2023-08-09 15:45:44 +08:00
allatit23
6dee55c912 [HOPPER][WS] fix TMA store hang in ws mode (#2056) 2023-08-08 19:53:52 +08:00
goostavz
f1512bded1 Initial code merge of Hopper support (#2036)
The initial code merge of Nvidia Hopper features support. Please be
aware that the code merge is not finished yet and the trouble-shooting
is still ongoing. The new hardware features (GMMA, TMA, STMATRIX etc.)
and automatic warp-specialization are experimental for now and turned
off by default. It is recommended for a trial when version 3.0 is
released.

The work is contributed by:
ben-zhang-609, bealwang, donproc, qliu93, jsh20, allatit23, LyricZhao,
ivanyinwz, goostavz & yangjunpro
from Nvidia, in cooperation with:
ptillet, Jokeren, ThomasRaoux & zahimoud
from OpenAI.

Co-authored-by: Goostav Zhu <gzhu@nvidia.com>
2023-08-07 09:53:04 +08:00
Alexander Efimov
86f8b64ae0 [Dot] [MFMA] [FMA] Update Dot implementation to support upstream tests (#260)
* [Dot] [MFMA] Support FP16 output of MFMA dot

This PR adds cast of output tensor to requested data type.

* add tests

* fix test for FMA implementation

* loose fp16xfp16->fp16 tolerance

* enable FMA fallback for unsupported sizes of dot operation

* rework granularity check

* add constant modifier to granularity
2023-08-03 13:47:18 -05:00
Philippe Tillet
1db3bdc52e [BACKEND] avoid code duplication for fully warp-synchronous reductions (#1978) 2023-07-21 16:06:00 -07:00
David Berard
9c422e260b [OPTIMIZER] AxisInfoVisitor for LoadOp constancy calculation (#1968)
If you call `result = load(x, mask)` where `x` and `mask` have some
constancy properties, then you can infer some constancy properties for
`result`.
2023-07-19 17:40:46 -07:00
oplavsic
d6e51fd221 [FA OPTIMIZATION] Keep results of FA dot operations in registers (#247)
* [WIP][FA OPTIMIZATION] Optimize chain dot

This commit optimizes chain dot operation by keeping
results of the first dot operation in registers.

* [FA OPTIMIZATION] Enable lowering pipeline for keeping result of chain dot in registers

* Move operand swapping in ttgir -> llir lowering phase

* Refactor emitMfmaOffsetForCTA function to be more readable

* Fix accidental change in 06-fused-attention.py

* Address review comments

* Fix rebase errors
2023-07-12 15:25:55 -05:00
Keren Zhou
fd89aa1d2b [ANALYSIS] Fix allocation algorithm (#1929) 2023-07-12 12:20:36 -07:00
Philippe Tillet
8fe5524c75 [BACKEND] no longer uses shared mem or barriers for single-warp reductions (#1915)
0-bytes shared mem buffers don't materialize empty allocation buffers;
this could lead to unnecessary barriers.

note: reduceop code has become quite messy and will require some cleanup
2023-07-11 00:23:26 -07:00
Thomas
bd900e0a6f [BACKEND] Fix reductions when number of unique element is smaller than layout (#1913)
Fix calculation of unique number of threads within a warp. We need to
consider the number of elements per thread in the calculation. Also
change the layout test to integer sum in order to catch bugs with unique
data as max reduction may hide those kind of problems.
2023-07-07 19:48:13 -07:00
Keren Zhou
4255ef0e9e [BACKEND] Determine fast reduce based on the parent layout (#1892) 2023-07-05 11:52:22 -07:00
Thomas
2e3182bab7 [BACKEND] Support scan on dimensions other that fastest moving one (#1863)
This relax the restriction in the scan lowering to support layout where
we scan along a dimension which isn't the fastest moving one. This is
done by relaxing how we accesses elements during scanning and allow
elements to be strided.
2023-06-30 12:40:48 -07:00
Jason Furmanek
2b38ab4b6c Merge remote-tracking branch 'oai/main' into ifu230620
Conflicts:
	include/triton/Conversion/TritonToTritonGPU/Passes.td
	include/triton/Dialect/TritonGPU/IR/TritonGPUDialect.td
	lib/Dialect/TritonGPU/Transforms/RemoveLayoutConversions.cpp
	python/test/unit/language/assert_helper.py
	python/triton/compiler/compiler.py
	python/triton/runtime/jit.py
	python/triton/tools/aot.py
	test/Conversion/triton_to_tritongpu.mlir
	test/Conversion/tritongpu_to_llvm.mlir
2023-06-29 21:47:27 +00:00
Thomas
3be060849a [FEATURE] Add associative_scan support (#1858)
Implement associative_scan in the front end and implement lowering to
LLVM for blocked layout where the scan happens on the fastest moving
dimension. This will later be generalized to support more layout.
2023-06-29 14:37:51 -07:00
chengjunlu
e209749835 [BACKEND] Add a configurable parameter for the number of threads per warp (#1719)
Add a configurable parameter for the number of threads per warp for
other GPU. Like: Intel GPU.

Make it default to be 32 not change code logic on the CUDA/AMD GPU.

Note: The Intel GPU GenX ISA is explicit SIMD and can support variant
number of threads lane per HW execution unit.
2023-06-28 22:25:14 -05:00
Keren Zhou
d2de3f37f0 [BACKEND] Reduce code cleanup and bug fix for the fast path (#1816)
https://github.com/openai/triton/issues/1715
2023-06-27 17:27:24 -07:00
oplavsic
64d7b521cf [MFMA] Enabled fused attention forward pass. (#226)
* [MFMA] Activated Fused Attention Forward Pass

Patch contains following changes:
1) make_range operator now works with MFMA layout.
2) Reduce operation is forced to run in block layout:
   inputs converted to block layouts, outputs returned to MFMA layout

* Use simple module walk instead of pattern rewritter.

* Remove pattern rewritter header.

* Enable basic reduce algorithm for MFMA layout

* Add TODO comment for fused attention backward pass

* Fix bug in fast codegen algorithm for reduce op

* Fix input type bug

* Increase block size to 128 since out of memory issue is not seen on MI210

* Fix block_size error

* Add mfma support in DecomposeDotOperand pattern.
2023-06-16 15:39:08 -05:00
Keren Zhou
58a8e8a914 [BACKEND] Clean up code (#1768)
- Remove unused header files.
- Get numThreads/numWarps from the triton module.
- Move transforms/utility.h to the include directory.
2023-06-12 17:40:33 -07:00
Zahi Moudallal
fdf2bd5801 [BACKEND] Commenting out reduce fast path shared mem optimization for… (#1761)
… MMA layout that ended up being a cause for failure
2023-06-08 14:39:38 -07:00
chengjunlu
45ba9af6ed [BACKEND] Add a configurable parameter for the number of threads per warp (#1719)
Add a configurable parameter for the number of threads per warp for
other GPU. Like: Intel GPU.

Make it default to be 32 not change code logic on the CUDA/AMD GPU.

Note: The Intel GPU GenX ISA is explicit SIMD and can support variant
number of threads lane per HW execution unit.
2023-06-02 16:55:06 -07:00
Jason Furmanek
28d9754b2a Merge remote-tracking branch 'oai/main' into ifu230601
Conflicts:
	python/test/unit/language/assert_helper.py
	test/Conversion/tritongpu_to_llvm.mlir
2023-06-01 20:53:33 +00:00
Andrey Shukshov
fee5950893 [MFMA] Implementation of MFMA DotOp pipeline (#180)
* [MFMA] Implementation of MFMA DotOp pipeline

* Added MFMA test_dot unit tests

* Added missing ifdefs

* Update offline tests

* Removing duplicate parts

* fix build after rebase

* remove redundant stuff

* simplify MMAv3.cpp

* move reps function into operand attr description,
remove coreMatrixType type from layout conversion,
refactored type conversion

* remove duplication of mfma intruction shape computation

* move all MFMA instruction shape details into layout attribute

* fix formatting

* reenable matmul acceleration

* fix dot operator type conversion

* add offline test for dotop

* add missing ifdef wrappers

* run clang format on changes

* review and rebase fix

* add switch for MFMA instructions

* change check precision for float16 test

* disable redundant check for allowTF32

* - skip unsupported block size in matmul autotuner
- support transposed inputs of dot

* reenable matmul acceleration

* Add first part to FMA for dot operation on HW without MFMA support.

* Fix offline tests.

* Fix lit tests

* refactor mmav3 to mfma

* fix rebase issues

* fix detection of mfma support and wrong assert

* remove unnecessary macros

* Add documentation for MFMA layout.

* fix line size computation for B argument

* Fix getElemsPerThread() and getSizePerThread() functions for MFMA layout.

---------

Co-authored-by: Alexander Efimov <efimov.alexander@gmail.com>
Co-authored-by: dfukalov <1671137+dfukalov@users.noreply.github.com>
Co-authored-by: weihan13 <weihan13@amd.com>
Co-authored-by: Ognjen Plavsic <ognjen.plavsic@dxc.com>
2023-05-30 16:10:28 -05:00
Mehdi Amini
83245259a6 [OPTIMIZER][BACKEND] switch the TritonGPU dialect to use MLIR Properties (NFC) (#1696)
Also try to switch APIs access to the new upstream APIs that separate
explicitly the access to "discardable" and "inherent" attributes (the
latter being stored in properties now).

Generic accessors like `getAttr()` `setAttr()` `setAttrs()` are much
more expensive and to be avoided.
2023-05-20 01:36:48 +00:00
Jason Furmanek
4c4e42e524 Merge remote-tracking branch 'openai/main' into IFU-230517
Conflicts:
	lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVMPass.cpp
	lib/Target/LLVMIR/LLVMIRTranslation.cpp
	python/test/unit/language/assert_helper.py
	python/triton/third_party/cuda/bin/ptxas
	test/Conversion/tritongpu_to_llvm.mlir

 It looks like you may be committing a merge.
 If this is not correct, please remove the file
	.git/MERGE_HEAD
 and try again.
2023-05-17 15:03:42 +00:00
Christian Sigg
177b46b9ef [BUILD] minor fixes (#1676)
Remove unused variables, fix member initializer list order.
2023-05-16 09:25:18 -07:00
Ingo Müller
47af6ba702 [BACKEND] Move isSharedEncoding to TritonGPUIR. (#1655)
This breaks a cyclic dependency between TritonAnalysis and TritonGPUIR
(see #1649).
2023-05-12 20:50:21 -04:00
Ingo Müller
b2a757d000 [BUILD] Add missing CMake link-time dependencies. (#1654) 2023-05-11 19:17:44 -07:00
Zahi Moudallal
fb40bf1954 [TEST] Fixed and re-enabled reduce test (#1644)
Re-enabled reduce test after fixing the %cst stride in the ttgir, and
modifying the sweep parameters to make sure the shape per CTA to be less
than or equal to the tensor shape.
2023-05-10 15:15:11 -07:00
long.chen
7d20a86865 [BACKEND] fix typo in Membar class about WAR description and refine some code (#1629)
Co-authored-by: Philippe Tillet <phil@openai.com>
2023-05-06 14:20:23 +00:00
Zahi Moudallal
3449a9d40d Zahi/slice reduce rebased (#1594)
[BACKEND] Enable slice layout support for reduce op
2023-05-01 18:00:23 -07:00
Keren Zhou
ee864048b3 [FRONTEND][BACKEND] Add the noinline annotation for triton.jit (#1568)
# Introducing the `noinline` Parameter for Triton JIT Decorator

We're excited to introduce a new parameter, `noinline`, that can be
added to the `jit` decorator in Triton. This parameter allows developers
to specify that a particular Triton function should not be inlined into
its callers. In this post, we'll dive into the syntax, purpose, and
implementation details of this new feature.

## Syntax

To use the `noinline` parameter, simply add `noinline=True` to the `jit`
decorator for the function that you don't want to be inlined. Here's an
example:

```python
@triton.jit(noinline=True)
def device_fn(x, y, Z):
    z = x + y
    tl.store(Z, z)

def test_noinline():
    @triton.jit
    def kernel(X, Y, Z):
        x = tl.load(X)
        y = tl.load(Y)
        device_fn(x, y, Z)
```

In this example, the `device_fn` function is decorated with
`@triton.jit(noinline=True)`, indicating that it should not be inlined
into its caller, `kernel`.

## Purpose

The `noinline` parameter serves several key purposes:

- Reducing code size: By preventing inlining, we can reduce the size of
the compiled code.
- Facilitating debugging: Keeping functions separate can make it easier
to debug the code.
- Avoiding common subexpression elimination (CSE) in certain cases: CSE
can sometimes be avoided by using the `noinline` parameter to reduce
register pressure.
- Enabling dynamic linking: This parameter makes it possible to
dynamically link Triton functions.

## Implementation

The implementation of the `noinline` parameter involves significant
changes to three analysis modules in Triton: *Allocation*, *Membar*, and
*AxisInfo*. Prior to this update, these modules assumed that all Triton
functions had been inlined into the root kernel function. With the
introduction of non-inlined functions, we've had to rework these
assumptions and make corresponding changes to the analyses.

### Call Graph and Limitations

<div style="text-align: center;">
<img
src="https://user-images.githubusercontent.com/2306281/234663904-12864247-3412-4405-987b-6991cdf053bb.png"
alt="figure 1" width="200" height="auto">
</div>

To address the changes, we build a call graph and perform all the
analyses on the call graph instead of a single function. The call graph
is constructed by traversing the call edges and storing them in an edge
map. Roots are extracted by checking nodes with no incoming edges.

The call graph has certain limitations:

- It does not support recursive function calls, although this could be
implemented in the future.
- It does not support dynamic function calls, where the function name is
unknown at compilation time.

### Allocation

<div style="text-align: center;">
<img
src="https://user-images.githubusercontent.com/2306281/234665110-bf6a2660-06fb-4648-85dc-16429439e72d.png"
alt="figure 2" width="400" height="auto">
</div>

In Triton, shared memory allocation is achieved through two operations:
`triton_gpu.convert_layout` and `triton_gpu.alloc_tensor`. The
`convert_layout` operation allocates an internal tensor, which we refer
to as a *scratch* buffer, while the `alloc_tensor` operation returns an
allocated tensor and is thus known as an *explicit* buffer.

To accommodate the introduction of function calls, we are introducing a
third type of buffer called a *virtual* buffer. Similar to scratch
buffers, virtual buffers are allocated internally within the scope of a
function call, and the buffers allocated by the called functions remain
invisible to subsequent operations in the calling function. However,
virtual buffers are distinct from scratch buffers in that the call
operation itself does not allocate memory—instead, it specifies the
total amount of memory required by all the child functions being called.
The actual allocation of buffers is performed by individual operations
within these child functions. For example, when invoking edge e1, no
memory is allocated, but the total amount of memory needed by function B
is reserved. Notably, the amount of shared memory used by function B
remains fixed across its call sites due to the consideration of dynamic
control flows within each function.

An additional challenge to address is the calculation of shared memory
offsets for functions within a call graph. While we can assume a shared
memory offset starting at 0 for a single root function, this is not the
case with a call graph, where we must determine each function's starting
offset based on the call path. Although each function has a fixed memory
consumption, the starting offset may vary. For instance, in Figure 2,
the starting offset of function C through edges e1->e2 differs from that
through edges e2->e4. To handle this, we accumulate the starting offset
at each call site and pass it as an argument to the called function.
Additionally, we amend both the function declaration and call sites by
appending an offset variable.

### Membar

<div style="text-align: center;">
<img
src="https://user-images.githubusercontent.com/2306281/234665157-844dd66f-5028-4ef3-bca2-4ca74b8f969d.png"
alt="figure 3" width="300" height="auto">
</div>

The membar pass is dependent on the allocation analysis. Once the offset
and size of each buffer are known, we conduct a post-order traversal of
the call graph and analyze each function on an individual basis. Unlike
previous analyses, we now return buffers that remain unsynchronized at
the end of functions, allowing the calling function to perform
synchronization in cases of overlap.

### AxisInfo

<div style="text-align: center;">
<img
src="https://user-images.githubusercontent.com/2306281/234665183-790a11ac-0ba1-47e1-98b1-e356220405a3.png"
alt="figure 4" width="400" height="auto">
</div>

The AxisInfo analysis operates differently from both membar and
allocation, as it traverses the call graph in topological order. This is
necessary because function arguments may contain axis information that
will be utilized by callee functions. As we do not implement
optimizations like function cloning, each function has a single code
base, and the axis information for an argument is determined as a
conservative result of all axis information passed by the calling
functions.

---------

Co-authored-by: Philippe Tillet <phil@openai.com>
2023-04-28 14:59:04 -07:00
Zahi Moudallal
65fb36e34e [BACKEND] Updated slice layout semantics, updated vectorization logic used for load/store ops. (#1587) 2023-04-28 13:40:01 -07:00
Michael Melesse
2784b804d9 Merge remote-tracking branch 'upstream/main' into ifu_4_26_2023 2023-04-26 12:04:21 -05:00
peterbell10
e152183570 [FRONTEND][BACKEND] ReduceOp to support arbitrary reduce operations (#1305)
Fixes #1285

This changes `tt.reduce` to replace `redOp` by a region containing
arbitrary code. For example, `tl.sum` is now lowered as:
```mlir
%res = "tt.reduce"(%arg0) ({
^bb0(%arg1: f32, %arg2: f32):
  %add = arith.addf %arg1, %arg2 : f32
  tt.reduce.return %add : f32
}) {axis = 1 : i32} : (tensor<128x128xf32>) -> tensor<128xf32>
```
Support for index reductions at the MLIR level are also dropped in favor
of simultaneous reductions over multiple tensors. Which generalizes the
code without loss of performance. So for example `argmin` gets lowered
as:
```mlir
  %7 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32>
  %8 = tt.view %7 : (tensor<256xi32>) -> tensor<1x256xi32>
  %9:2 = "tt.reduce"(%6, %8) ({
  ^bb0(%arg4: f32, %arg5: i32, %arg6: f32, %arg7: i32):
    %14 = arith.cmpf olt, %arg4, %arg6 : f32
    %15 = arith.cmpf ogt, %arg4, %arg6 : f32
    %16 = arith.cmpi slt, %arg5, %arg7 : i32
    %17 = arith.select %16, %arg5, %arg7 : i32
    %18 = arith.select %15, %arg7, %17 : i32
    %19 = arith.select %14, %arg5, %18 : i32
    %20 = arith.cmpf olt, %arg4, %arg6 : f32
    %21 = arith.select %20, %arg4, %arg6 : f32
    tt.reduce.return %21, %19 : f32, i32
  }) {axis = 1 : i32} : (tensor<1x256xf32>, tensor<1x256xi32>) -> (tensor<1xf32>, tensor<1xi32>)
```
2023-04-13 01:37:39 +00:00
zahimoud
fd34b20fba [BACKEND] Fixed bug in reduce; add tests 2023-04-11 18:09:18 -07:00
Keren Zhou
6d0ed41307 [BACKEND] Replace Func Dialect with custom triton ops (func, call, return) (#1502)
MLIR current only supports a custom inlining interface per dialect, so
we cannot change the inlining decision of `func.func`.


https://discourse.llvm.org/t/avoid-inlining-some-functions-using-the-func-dialect/69830/3

Could revert it back once they've designed a better inliner interface.

Inlining attributes will be implemented in the next PR since this PR is
already huge.
2023-04-10 21:08:40 -07:00
Keren Zhou
032509384a [ANALYSIS] Fine-tune comments for shared memory allocation (#1492)
And add a new test to check multiple color cases which have never be
tested before
2023-04-10 09:00:36 -07:00
Philippe Tillet
b86425a28e [TEST] made lut_bmm pipeline test more concise and specific (#1488) 2023-04-08 19:17:35 -07:00
long.chen
f7ad8ae022 [Refine] remove const ref of mlir::Attribute (#1486)
https://mlir.llvm.org/docs/DefiningDialects/AttributesAndTypes/

https://github.com/isocpp/CppCoreGuidelines/blob/master/CppCoreGuidelines.md#f16-for-in-parameters-pass-cheaply-copied-types-by-value-and-others-by-reference-to-const
```
The C++ Attribute and Type classes in MLIR (like Ops, and many other things) are value-typed. 
This means that instances of Attribute or Type are passed around by-value, 
as opposed to by-pointer or by-reference. 
The Attribute and Type classes act as wrappers around internal storage objects that are uniqued within an instance of an MLIRContext.
```
2023-04-08 10:38:59 -07:00
Rahul Batra
a27b388df5 Merge remote-tracking branch 'upstream/main' into IFU_04-06-2023 2023-04-06 16:18:31 -05:00
Keren Zhou
801bb9d3b5 [ANALYSIS] Fix divisibility calculation for addptr (#1453) 2023-03-31 17:57:31 -07:00
Keren Zhou
28ea484dab [BACKEND] Clean up type inference functions (#1451)
And remove duplicate function definition.
2023-03-30 23:07:32 -07:00
Keren Zhou
b7762bee2c [TEST] Cleanup SCF dialect in tests (#1402) 2023-03-24 09:21:40 -07:00
Rohit Santhanam
a84b4883e6 Merge remote-tracking branch 'upstream/main' into triton-mlir-IFU-03192023 2023-03-19 13:46:50 +00:00
Berke Kocaoğlu
ba91f39dbf [DOC] Fix syntax errors, typos, formatting; increase consistency (#1357)
This PR;
- Fixes syntax errors like `.type values: dict[str,
Callable[[list[Any]], Any]]` to `:type values: dict[str,
Callable[[list[Any]], Any]]`,
- Fixes typos,
- Fixes formatting like `k ++` to ` k++`,
- Increases consistency (e.g. by transforming the minority `cd dir/` to
the majority `cd dir`).
2023-03-16 15:32:02 -07:00
Rohit Santhanam
77dcb667b5 Merge remote-tracking branch 'upstream/main' into triton-mlir-IFU-03162023 2023-03-16 13:21:15 +00:00
Da Yan
9d5505d043 [OPTIMIZER] Infer the alignment info of loops' induction variables (#1350)
Before this PR, loops' induction variables' (IV) alignment info is lost.
For example:
```
for n in range(0, K, BLOCK):
     x = base + n
                       ^--  Triton doesn't know n is always a multiple of BLOCK
```

This PR fixes this.

---------

Co-authored-by: Philippe Tillet <phil@openai.com>
2023-03-16 00:39:08 -07:00
Philippe Tillet
082828af47 [OPTIMIZER] Fixed up divisibility analysis in div operation (#1341) 2023-03-14 18:17:05 -07:00
Rohit Santhanam
6ff54b495c Merge remote-tracking branch 'upstream/main' into triton-mlir-IFU-03122023 2023-03-13 18:09:12 +00:00
B1tway
693cf89777 Changed warpSize = 32 to warpSize = 64 2023-03-06 18:33:35 +00:00