Commit Graph

135 Commits

Author SHA1 Message Date
Jason Furmanek
5c87f363e4 Merge commit 'cb3d79a185e40c9d8a579bea07747a8a8d157d52' into ifu-231117
Conflicts:
	lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp
	lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVM.cpp
	lib/Dialect/TritonGPU/IR/Dialect.cpp
	python/setup.py
	python/test/unit/language/assert_helper.py
	python/test/unit/operators/test_flash_attention.py
	python/test/unit/runtime/test_subproc.py
	python/triton/compiler/compiler.py
	python/triton/language/semantic.py
	python/triton/runtime/autotuner.py
	python/triton/runtime/jit.py
	python/tutorials/03-matrix-multiplication.py
	python/tutorials/05-layer-norm.py
	python/tutorials/06-fused-attention.py
	python/tutorials/11-grouped-gemm.py
	test/Conversion/tritongpu_to_llvm.mlir
2023-11-17 20:42:12 +00:00
Jason Furmanek
33151a860f Merge commit 'ac9fa68d18c777e421bd3f6fb1ddcfd60b6fda33' into ifu-rebase-again
Conflicts:
	.gitignore
	.gitmodules
	README.md
	bin/triton-translate.cpp
	include/triton/Dialect/TritonGPU/IR/TritonGPUAttrDefs.td
	include/triton/Target/AMDGCN/AMDGCNTranslation.h
	include/triton/Target/HSACO/HSACOTranslation.h
	lib/Analysis/Allocation.cpp
	lib/Analysis/Utility.cpp
	lib/Conversion/TritonGPUToLLVM/CMakeLists.txt
	lib/Conversion/TritonGPUToLLVM/ConvertLayoutOpToLLVM.cpp
	lib/Conversion/TritonGPUToLLVM/ReduceOpToLLVM.cpp
	lib/Conversion/TritonGPUToLLVM/ScanOpToLLVM.cpp
	lib/Conversion/TritonGPUToLLVM/Utility.cpp
	lib/Conversion/TritonGPUToLLVM/Utility.h
	lib/Dialect/TritonGPU/IR/Dialect.cpp
	lib/Dialect/TritonGPU/Transforms/RemoveLayoutConversions.cpp
	lib/Target/HSACO/CMakeLists.txt
	lib/Target/HSACO/HSACOTranslation.cpp
	lib/Target/LLVMIR/LLVMIRTranslation.cpp
	python/src/triton.cc
	python/test/unit/language/test_core.py
	python/test/unit/operators/test_flash_attention.py
	python/triton/compiler/compiler.py
	python/triton/compiler/make_launcher.py
	python/triton/language/semantic.py
	python/triton/runtime/jit.py
	python/tutorials/06-fused-attention.py
	python/tutorials/11-grouped-gemm.py
	test/Conversion/tritongpu_to_llvm.mlir
2023-11-06 23:10:10 +00:00
Justin Lebar
df08301e76 Reformat Python code with yapf. (#2589)
I've add an option to yapf to do what we want for long lines, see
https://github.com/google/yapf/pull/1177.  We can now have a real Python
formatter, yay!

To make this PR, I ran my modified yapf over the repository, then looked
over the full diff.  Where yapf was mangling the param list of long
function decls/calls (mostly kernels), I manually added `#` to put
linebreaks where we want.  I fixed up other formatting too -- mostly
adding or removing a trailing comma from lists.

Overall, trailing `#` was sufficient to get formatting similar to our
current code.  I didn't have to disable yapf anywhere.

---------

Co-authored-by: Phil Tillet <phil@openai.com>
2023-11-02 20:44:17 -07:00
Justin Lebar
29a9245559 [BUILD] use clang+lld in CI builds. (#2564)
Use clang+lld in CI builds.

This is significantly faster.
2023-10-30 19:19:27 -07:00
Michael Melesse
09ba348f87 [ROCM] Core Functionality for AMD (#1983)
* this pr adds a third party backend for triton that works on AMD
* this expose a lot of the work that has been done in our
[fork](https://github.com/ROCmSoftwarePlatform/triton)
* most unit tests on `test_core.py` pass
* it skips some unit tests for various reasons
* we plan to follow up with more prs improving Functionality and
Performance in the future

---------

Co-authored-by: Philippe Tillet <phil@openai.com>
2023-10-26 08:36:49 -05:00
Phil Tillet
07baf3a102 [CI] move llvm-build.yml to the top of workflow directory hierarchy 2023-10-26 02:08:56 -07:00
Keren Zhou
bc72294507 [CI] Reenable torchinductor workflow (#2527) 2023-10-25 23:44:02 -07:00
Philippe Tillet
31c76ddd05 [CI] revert recent changes (#2543) 2023-10-24 17:00:31 -07:00
Phil Tillet
5181d62b1b [CI] renamed third party test workflow 2023-10-24 12:12:52 -07:00
Phil Tillet
96b04493f1 [CI] move workflow around 2023-10-24 04:01:53 -07:00
Phil Tillet
c65d2c2ed6 [CI] run wheels job on CPU worker 2023-10-23 20:04:37 -07:00
Philippe Tillet
05dc28be0e [CI] refactor workflows (#2504)
no longer run third-party tests on every PR
2023-10-17 00:27:17 -07:00
Jason Furmanek
74fd8e9754 Merge commit '36fc54b6f28168d3644808bfe299f1ba06a36272' into ifu230908-2
Conflicts:
	.gitignore
	bin/triton-translate.cpp
	include/triton/Conversion/TritonGPUToLLVM/TritonGPUToLLVMPass.h
	include/triton/Dialect/TritonGPU/IR/TritonGPUAttrDefs.td
	include/triton/Dialect/TritonGPU/IR/TritonGPUDialect.td
	lib/Analysis/Utility.cpp
	lib/Conversion/TritonGPUToLLVM/ConvertLayoutOpToLLVM/SharedToDotOperandMMAv2.cpp
	lib/Conversion/TritonGPUToLLVM/DotOpToLLVM.cpp
	lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp
	lib/Conversion/TritonGPUToLLVM/ReduceOpToLLVM.cpp
	lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVM.cpp
	lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVMBase.h
	lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVMPass.cpp
	lib/Conversion/TritonGPUToLLVM/Utility.h
	lib/Dialect/Triton/Transforms/RewriteTensorPointer.cpp
	lib/Dialect/TritonGPU/IR/Dialect.cpp
	lib/Dialect/TritonGPU/Transforms/AccelerateMatmul.cpp
	lib/Dialect/TritonGPU/Transforms/RemoveLayoutConversions.cpp
	lib/Target/LLVMIR/LLVMIRTranslation.cpp
	python/src/triton.cc
	python/test/unit/runtime/test_subproc.py
	python/triton/compiler/compiler.py
	python/triton/compiler/make_launcher.py
	python/triton/language/semantic.py
	python/triton/runtime/jit.py
	python/tutorials/06-fused-attention.py
	test/Conversion/triton_to_tritongpu.mlir
	test/Conversion/tritongpu_to_llvm.mlir
	test/TritonGPU/coalesce.mlir
	unittest/Conversion/TritonGPUToLLVM/CMakeLists.txt
2023-10-02 18:01:04 +00:00
Philippe Tillet
98039658d4 [CI] disable pypy wheel (continued) (#2424)
there's a typo in the previous commit
2023-09-30 00:38:06 -07:00
Philippe Tillet
c4f3afc020 [CI] disable pypy wheel (#2423)
emitting warnings from C++ code requires "#include pybind11/exec.h"
which is not compatible with pypy. I think using the python interpreter
form C++ is a bad idea in general... but we probably don't care much
about pypy wheels anyway
2023-09-29 23:48:08 -07:00
Thomas Raoux
90bef57acf [BACKEND] turn on MMA V3 by default on Hopper (#2414) 2023-09-28 22:45:28 -07:00
ian Bearman
215b2e77a1 Add Shared Middle Layer to Triton via Plug-In (#2374)
This PR leverages the plug-in system to add a shared middle-layer to
Triton.

Currently the middle layer is not complete but has enough functionality
to demonstrate how it can work. The general idea is that Triton IR is
lowered into an MLIR core dialect to allow it to be both shared across
Triton targets as well as allow back-ends to be shared with other
languages.

The basic intended architecture looks like this:

[Triton IR] -> [Middle Layer] -> [HW specific IR]

The middle-layer uses MLIR's Linalg and Tenor Dialects for operations on
Triton block values. Operations on Triton pointers use the Memref
Dialect.

## Usage
To include the shared middle-layer in your Triton build do `export
TRITON_CODEGEN_TRITON_SHARED=1` before invoking your build. Once it is
part of the build it can be leveraged in two ways:

### Stand-Alone
The middle layer can be used as a stand-alone component to convert
Triton dialect to the middle layer dialects.

Stand-alone example:
```
triton-shared-opt --triton-to-linalg %file
```

### Backend Component
The middle layer can also be used as a component in a Triton back-end by
adding the cmake targets it produces and its headers files to that
back-end. An example back-end will be published at a later date.

## Implementation details

Even though a valid triton program can perform load and store in
arbitrary memory locations, the prototype only supports lowering
programs that have structured memory access patterns.

### Analyses

As part of the conversion process, there are three important analyses:

1. Pointer analysis:
+ This analysis is responsible for extracting structured memory access
patterns from a `triton` program during load and store; it walks the IR
and visits relevant instructions to build strided memory accesses in the
`memref` dialect. The analysis is still in its early stage and does not
support all scenarios.

2. Use analysis:
+ After "Pointer analysis", instructions that are part of memory address
calculation will no longer be necessary in a triton program because
their semantics have now been captured by `memref` operations
representing strided memory accesses. To aid with removing these
instructions safely, we perform `Use analysis` to mark which
instructions are used *only* in address calculation (called `MetaUse`)
or used in *both* address calculation and data manipulation (called
`MixedUse`) operations. Those that are `MixedUse` are cloned and have
their users adjusted accordingly with the goal of separating out the
`MetaUse` ops so that they can be safely deleted.

3. Mask analysis:
    + This analysis is responsible for handling masked loads and stores.

### Conversion strategy

We introduce the `TritonToLinalg` pass that converts the `triton`
dialect to the `linalg` dialect on *tensors*. This means the resulting
IR is fully compatible with `linalg` tiling and fusion transformation
passes. As mentioned in the `Pointer analysis`'s description, we do
however have to deal with memref instructions at the load and store
boundaries and have to convert them to tensors using
`bufferization.to_tensor`. Here's a simple example of what the IR looks
like:

```mlir
tt.func @kernel(%afloat : !tt.ptr<bf16>, %res : !tt.ptr<bf16>) {
  %0 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32>
  %1 = tt.splat %afloat : (!tt.ptr<bf16>) -> tensor<128x!tt.ptr<bf16>>
  %2 = tt.addptr %1, %0 : tensor<128x!tt.ptr<bf16>>, tensor<128xi32>
  %afm = tt.load %2 {cache = 1 : i32, evict = 1 : i32, isVolatile = false} : tensor<128xbf16>
  %3 = "tt.reduce"(%afm) ({
  ^bb0(%arg5: bf16, %arg6: bf16):
    %21 = arith.addf %arg5, %arg6 : bf16
    tt.reduce.return %21 : bf16
  }) {axis = 0 : i32} : (tensor<128xbf16>) -> bf16
  tt.store %res, %3 : bf16
  tt.return
}
```

after conversion:

```mlir
func.func @kernel(%arg0: memref<*xbf16>, %arg1: memref<*xbf16>, %arg2: i32, %arg3: i32, %arg4: i32) {
    %cst = arith.constant 0.000000e+00 : f32
    %reinterpret_cast = memref.reinterpret_cast %arg0 to offset: [0], sizes: [128], strides: [1] :
        memref<*xbf16> to memref<128xbf16, strided<[1]>>
    %alloc = memref.alloc() : memref<128xbf16>
    memref.copy %reinterpret_cast, %alloc : memref<128xbf16, strided<[1]>> to memref<128xbf16>
    %0 = bufferization.to_tensor %alloc restrict writable : memref<128xbf16>
    %1 = bufferization.alloc_tensor() : tensor<f32>
    %inserted = tensor.insert %cst into %1[] : tensor<f32>
    %reduced = linalg.reduce ins(%0 : tensor<128xbf16>) outs(%inserted : tensor<f32>) dimensions = [0]
      (%in: bf16, %init: f32) {
        %3 = arith.extf %in : bf16 to f32
        %4 = arith.addf %3, %init : f32
        linalg.yield %4 : f32
      }
    %extracted = tensor.extract %reduced[] : tensor<f32>
    %2 = arith.truncf %extracted : f32 to bf16
    %reinterpret_cast_0 = memref.reinterpret_cast %arg1 to offset: [0], sizes: [1], strides: [1] :
        memref<*xbf16> to memref<1xbf16, strided<[1]>>
    affine.store %2, %reinterpret_cast_0[0] : memref<1xbf16, strided<[1]>>
    return

}

```

Important details to note:

+ `tt.load` (together with all of its related address calculation
instructions such as `tt.addptr` and `tt.splat`) are lowered to a
combination of `memref.reinterpret_cast`, `memref.alloc`, and
`memref.copy`. After the initialization of the local buffer, we convert
the memref back to a tensor using `bufferization.to_tensor`; this op is
automatically removed during bufferization.

+ `tt.store` lowers to a combination of `memref.reinterpret_cast` and
either `affine.store` or `memref.tensor_store`:

```
%reinterpret_cast = memref.reinterpret_cast %arg2 to offset: [...] memref<*xf32> to memref<1024xf32>
%extracted_slice = tensor.extract_slice %15[0] [%21] [1] : tensor<1024xf32> to tensor<?xf32>
%subview = memref.subview %reinterpret_cast[0] [%21] [1] : memref<1024xf32> to memref<?xf32>
memref.tensor_store %extracted_slice, %subview : memref<?xf32>
```

+ element-wise `arith` and `math` operators are converted to their
corresponding `linalg.generic` version.
+ `tt.dot` becomes `linalg.matmul`.
+ `tt.reduce` becomes `linalg.reduce`; known limitation: only support
`addf` and `maxf` reduction in the reduction body for now.

### Testing

The prototype was tested on the following triton kernel examples:

1. vector addition
2. fused softmax
3. matrix multiplication
4. layer normalization
5. fused attention

In addition to testing on the tutorial kernels, I have also added many
lit tests covering various scenarios.

## Recognition

The work here represents contributions from myself as well as many of my
colleagues at Microsoft. I especially want to call out @nhat-nguyen and
@haishanzzz who were major contributors to this work.
2023-09-22 15:29:31 -07:00
Philippe Tillet
c71ec14f31 [TEST] only test 4 configs without TF32 (#2370) 2023-09-21 21:23:19 -07:00
Dongdong Li
e5eda098b3 [TESTS] fix flash attention (#2086)
Co-authored-by: dongdongl <dongdongl@nvidia.com>
2023-09-20 14:23:46 +08:00
Lixun Zhang
ff6fd952ac Install ninja in pre-build 2023-09-18 15:30:06 -05:00
Justin Lebar
2a3746bac5 [BUILD] use ninja (#2318) 2023-09-18 15:30:06 -05:00
Philippe Tillet
e686b4d6d4 [FRONTEND] interpreter rewrite (#2321)
This is a new interpreter mode that shares semantic analysis with the
JIT'ed codepath and that the Triton core team is committed to maintain
2023-09-17 14:58:50 -07:00
Justin Lebar
073aa16379 [BUILD] use ninja (#2318) 2023-09-17 02:08:04 -07:00
Philippe Tillet
c98671cf7c Revert "Update integration-tests.yml" (#2323)
reverts #2310 as recent changes to Triton-IR have broken third-party backends
2023-09-17 01:16:00 -07:00
Michael Melesse
78a0b5dc2a [CI] update integration-tests.yml (#2310) 2023-09-15 18:38:15 -07:00
Zahi Moudallal
3dec616c7c [CI] Fix submodule issue (#2253)
...
2023-09-06 20:21:29 +00:00
jon-chuang
36859aebff [DOCS] Add MLIR Autogenerated Docs to Sphinx Docs (#2234)
Partially fixes: https://github.com/openai/triton/issues/2226

Here are some example renderings:
![Screenshot from 2023-09-04
18-39-20](https://github.com/openai/triton/assets/9093549/e9c4af04-aeae-4021-a8db-6a4a82b59ae7)
![Screenshot from 2023-09-04
18-39-30](https://github.com/openai/triton/assets/9093549/410391b8-e07e-4bed-909c-8ce5484072d1)
![Screenshot from 2023-09-04
18-39-41](https://github.com/openai/triton/assets/9093549/f1eaef95-66c1-4506-a153-c6069e2b5072)
2023-09-06 08:17:12 +00:00
Jason Furmanek
3eaeb89d18 Merge commit '5df904233c11a65bd131ead7268f84cca7804275' into ifu230810-2
Conflicts:
	include/triton/Dialect/Triton/Transforms/Passes.h
	include/triton/Dialect/TritonGPU/IR/Dialect.h
	include/triton/Dialect/TritonGPU/IR/TritonGPUAttrDefs.td
	lib/Analysis/Allocation.cpp
	lib/Analysis/Utility.cpp
	lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp
	lib/Conversion/TritonGPUToLLVM/ReduceOpToLLVM.cpp
	lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVM.cpp
	lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVMPass.cpp
	lib/Dialect/Triton/Transforms/RewriteTensorPointer.cpp
	lib/Dialect/TritonGPU/Transforms/RemoveLayoutConversions.cpp
	lib/Dialect/TritonGPU/Transforms/ReorderInstructions.cpp
	lib/Target/LLVMIR/LLVMIRTranslation.cpp
	python/src/triton.cc
	python/triton/compiler/compiler.py
	python/triton/ops/flash_attention.py
	python/triton/runtime/autotuner.py
	python/triton/runtime/jit.py
	python/triton/tools/aot.py
	python/tutorials/06-fused-attention.py
	test/Conversion/tritongpu_to_llvm.mlir
	test/Target/tritongpu_to_llvmir.mlir
	test/Target/tritongpu_to_llvmir_noinline.mlir
2023-09-01 03:25:33 +00:00
Michael Melesse
c6d33dcebf [ROCM] Core Functionality for AMD (#1983)
* this pr adds a third party backend for triton that works on AMD 
* this expose a lot of the work that has been done in our
[fork](https://github.com/ROCmSoftwarePlatform/triton)
* most unit tests on `test_core.py` pass
* it skips some unit tests for various reasons
* we plan to follow up with more prs improving Functionality and
Performance in the future

---------

Co-authored-by: Philippe Tillet <phil@openai.com>
2023-08-31 14:02:00 -07:00
Zahi Moudallal
bffb76e847 [DOCS] Fixing docs (#2221) 2023-08-31 11:47:57 -07:00
Zahi Moudallal
f922ecbc29 [DOCS] Fixing dependency (#2219) 2023-08-31 17:51:04 +00:00
Zahi Moudallal
cf31f4ddb2 [DOCS] Fixing docs by using sphinx-build instead of multiversion (#2217) 2023-08-31 16:51:14 +00:00
Zahi Moudallal
5282ed890d [CI] Add back pre-commit to nvidia CI job (#2159) 2023-08-23 01:11:03 +00:00
Zahi Moudallal
3c8f959f91 [CI] Adding workflow_run (#2120) 2023-08-18 23:58:41 -07:00
Zahi Moudallal
1faf93e6fb [CI] Fix PR comment (#2131) 2023-08-18 09:16:18 -07:00
Zahi Moudallal
b33f97a682 [CI] Fix bug in Compare Artifacts workflow (#2128)
Forgot to remove this line
2023-08-17 18:06:36 -07:00
Zahi Moudallal
6f654cfbbf [CI] Testing PR comment from another workflow (#2127) 2023-08-17 17:34:59 -07:00
Zahi Moudallal
3fa6d51bc9 [CI] Adding new github workflow for testing (#2121) 2023-08-17 15:32:38 -07:00
Zahi Moudallal
557b2d4b34 [CI] upload only test/unit/operators cache to artifacts and rely on kernel names in cache to compare artifacts (#2111) 2023-08-16 20:34:40 -07:00
Zahi Moudallal
0312ed3473 [CI] Update kernels names (#2093)
Co-authored-by: Philippe Tillet <phil@openai.com>
2023-08-14 19:41:41 -07:00
Philippe Tillet
facc1dcbac [TESTS] better matmul unit testing (#2098) 2023-08-13 17:54:32 -07:00
Zahi Moudallal
c309f7e57a [CI] Skip PR if we paginate 30 times without finding a run_id (#2092) 2023-08-11 15:31:46 -07:00
Philippe Tillet
3ec05fb023 [CI] H100 tests always use ENABLE_TMA=1 ENABLE_MMA_V3=1 (#2051) 2023-08-07 19:32:55 -07:00
Philippe Tillet
54f1ac950e [CI] disable AMD CI (#2045) 2023-08-07 12:03:26 -07:00
Philippe Tillet
223c2d32a2 [CI] disable XPU tests (not compiling) (#2044)
cc @EikanWang . I'm disabling this for now since it broke with the H100
merge, but please feel free to fix the compilation errors and submit a
PR.
2023-08-07 11:56:16 -07:00
goostavz
f1512bded1 Initial code merge of Hopper support (#2036)
The initial code merge of Nvidia Hopper features support. Please be
aware that the code merge is not finished yet and the trouble-shooting
is still ongoing. The new hardware features (GMMA, TMA, STMATRIX etc.)
and automatic warp-specialization are experimental for now and turned
off by default. It is recommended for a trial when version 3.0 is
released.

The work is contributed by:
ben-zhang-609, bealwang, donproc, qliu93, jsh20, allatit23, LyricZhao,
ivanyinwz, goostavz & yangjunpro
from Nvidia, in cooperation with:
ptillet, Jokeren, ThomasRaoux & zahimoud
from OpenAI.

Co-authored-by: Goostav Zhu <gzhu@nvidia.com>
2023-08-07 09:53:04 +08:00
Alexander Efimov
5bdf71313c [CI] Reduce required disk space (#270) 2023-07-22 09:04:42 +02:00
Philippe Tillet
1db3bdc52e [BACKEND] avoid code duplication for fully warp-synchronous reductions (#1978) 2023-07-21 16:06:00 -07:00
Phil Tillet
c7757fae71 [GITHUB] tweak CODEOWNERS 2023-07-17 00:41:11 -07:00
Keren Zhou
571c92f2a8 [CI] Fix CI kernel compare (#1931)
With this PR, we find the latest merged PR that successfully passed
"Integration Tests".
2023-07-12 10:06:34 -07:00