Rohit Santhanam
cd9ae1cd36
Merge remote-tracking branch 'upstream/main' into triton-mlir-IFU-02232023
2023-02-23 21:41:54 +00:00
Keren Zhou
6a9316e69a
[BACKEND] Clean up SCF -> CF conversion ( #1234 )
2023-02-22 23:49:47 +00:00
Rohit Santhanam
841784d1e3
Merge remote-tracking branch 'upstream/main' into upgrade_triton_mlir_rocm_to_llvm_head
2023-02-18 09:25:20 +00:00
Christian Sigg
9ef4b5d773
Rebase to LLVM-head. ( #1200 )
...
Rebase to
37b7a60cd7
2023-02-17 13:16:11 -08:00
Christian Sigg
fc7a8e3581
Rebase Triton to LLVM-15. ( #1070 )
...
This PR rebases Triton from LLVM-14 to LLVM-15. Most changes are
mechanical, except for the analysis framework changes.
2023-02-16 06:40:53 -08:00
Chao Chen
c0a8c72343
update function to get full arch details and compile it with arch details instead of hardcode
2023-02-14 12:59:26 +00:00
Rohit Santhanam
2d0ee0fa0f
Merge remote-tracking branch 'upstream/main' into triton-mlir-IFU-01232023
2023-01-24 03:59:17 +00:00
Goran Flegar
afd02626ea
[BUILD] Fix build issues of triton-translate tool ( #1068 )
2023-01-17 09:03:29 -08:00
Daniil Fukalov
ce23494cbf
[Test] Enable triton-translate build.
...
And use it in tests instead of python module triton.tools.aot.
2023-01-13 20:36:16 +01:00
Philippe Tillet
20100a7254
Merge triton-mlir branch - Complete rewrite of the backend from scratch ( #1004 )
...
This PR merges the `triton-mlir` branch, in which we have been quietly
rewriting the Triton backend from scratch to increase maintainability,
stability and ultimately performance. Changes to the runtime are
minimal, and this new version aims to remain backward-compatible with
the previous commit. The legacy backend is now officially deprecated,
but can still be accessed via the `legacy-backend` tag.
Co-authored-by: Keren Zhou <kerenzhou@openai.com >
Co-authored-by: Yan Chunwei <yanchunwei@outlook.com >
Co-authored-by: goostavz <109190422+goostavz@users.noreply.github.com >
Co-authored-by: Shintaro Iwasaki <siwasaki@fb.com >
Co-authored-by: Yan Da <dyanab@connect.ust.hk >
Co-authored-by: Jun Yang <yangjunpro@gmail.com >
Co-authored-by: Ian Bearman <ianb@microsoft.com >
Co-authored-by: Jason Ansel <jansel@jansel.net >
Co-authored-by: Qingyi Liu <qingyil@nvidia.com >
Co-authored-by: ben-zhang-609 <110140741+ben-zhang-609@users.noreply.github.com >
Co-authored-by: Chenggang Zhao <lyricz@yeah.net >
Co-authored-by: ben-zhang-609 <benzh609@gmail.com >
Co-authored-by: dongdongl <dongdongl@nvidia.com >
2022-12-21 01:30:50 -08:00
Chenggang Zhao
4e95e939a6
[Triton-MLIR][BACKEND] Refactor TritonGPUToLLVM into several files ( #988 )
...
Refactor the backend into multiple smaller files.
2022-12-18 14:54:38 +08:00
Keren Zhou
153aecb339
[Triton-MLIR][BACKEND] insert_slice_async on GPUs < sm80 ( #908 )
...
`insert_slice_async` is decomposed into `load + insert_slice` in the
backend.
Not sure if V100 perf can match the master branch though in this way.
Maybe the performance can be improved if instructions are arranged in
the following form:
```
%0 = load
%1 = load
%2 = load
...
insert_slice %0
insert_slice %1
insert_slice %2
```
Tested on A100 when manually enabling this decomposition.
Tests on V100 haven't been integrated yet, we can divide the tests into
two phases:
1. Test only load, insert_slice, and insert_slice_async, given TritonGPU
IRs in `test_backend.py`.
2. End to end gemm tests on V100.
2022-11-24 14:05:54 -08:00
Philippe Tillet
1e91ed30d0
[RUNTIME] Major code cleanup ( #711 )
...
This PR does the following:
- CUDA utilities (e.g., cuGetInfo) won't be compiled as part of libtriton.so anymore.
- Refactoring driver/llvm.cc to split it between PTX codegen and python.
- By extension this will also deprecate include/external so Triton won't have to live with a copy of some CUDA/Hip headers anymore.
- `triton-translate` becomes a `triton.tools.aot` Python utility that re-uses functions from the triton.compile sub-module.
2022-09-26 16:38:06 -07:00
goostavz
15bfd0cb79
[BACKEND] Support of ConvertLayoutOp from blocked to blocked and SliceLayout with blocked parent ( #658 )
2022-09-17 14:58:42 -07:00
Shintaro Iwasaki
3c635449e5
[Triton] Support math and libdevice ops ( #91 )
...
This PR adds basic math ops by using `MathDialect` and `libdevice` ops by using `extern_elementwise`. This is needed to compile some tutorial code (e.g., `softmax`). This PR implements only interface till PTX (so from frontend to TritonGPU-MLIR)
- Currently till TritonGPU. It cannot be lowered to PTX now.
- No special optimizations (e.g., constant folding etc) are applied.
- 14.x does not define folders for many operators for math ops, but 15.x seems to increase its coverage: https://github.com/llvm/llvm-project/blob/llvmorg-15.0.0-rc3/mlir/include/mlir/Dialect/Math/IR/MathOps.td
- No constant folding etc for `libdevice` ops.
```py
import triton
import triton.language as tl
import sys
@triton.jit
def add_kernel(
x_ptr,
y_ptr,
BLOCK_SIZE: tl.constexpr,
):
offsets = tl.arange(0, BLOCK_SIZE)
x = tl.load(x_ptr + offsets)
x = tl.sin(x)
output = tl.libdevice.sin(x)
output = tl.libdevice.fdiv_rn(output, output)
output = tl.libdevice.fmaf_rd(output, output, output)
tl.store(y_ptr + offsets, output)
if __name__ == "__main__" and len(sys.argv) >= 2:
signature = "*fp32,*fp32"
constants = {'BLOCK_SIZE': 1024}
output = triton.compile(add_kernel, signature, device=0, constants=constants, output="ttgir")
print(output)
```
->
```llvm
#blocked = #triton_gpu.blocked<{sizePerThread = [1], threadsPerWarp = [32], warpsPerCTA = [4], order = [0]}>
module attributes {"triton_gpu.num-warps" = 4 : i32} {
func @add_kernel__Pfp32_Pfp32__2c1024(%arg0: !tt.ptr<f32>, %arg1: !tt.ptr<f32>) {
%0 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32, #blocked>
%1 = tt.splat %arg0 : (!tt.ptr<f32>) -> tensor<1024x!tt.ptr<f32>, #blocked>
%2 = tt.getelementptr %1, %0 : tensor<1024x!tt.ptr<f32>, #blocked>
%3 = tt.load %2 {cache = 1 : i32, evict = 1 : i32, isVolatile = false} : tensor<1024xf32, #blocked>
%4 = math.sin %3 : tensor<1024xf32, #blocked>
%5 = tt.ext_elemwise %4 {libname = "libdevice", libpath = "/home/siwasaki/triton/python/triton/language/libdevice.10.bc", symbol = "__nv_sinf"} : tensor<1024xf32, #blocked> -> tensor<1024xf32, #blocked>
%6 = tt.ext_elemwise %5, %5 {libname = "libdevice", libpath = "/home/siwasaki/triton/python/triton/language/libdevice.10.bc", symbol = "__nv_fdiv_rn"} : tensor<1024xf32, #blocked>, tensor<1024xf32, #blocked> -> tensor<1024xf32, #blocked>
%7 = tt.ext_elemwise %6, %6, %6 {libname = "libdevice", libpath = "/home/siwasaki/triton/python/triton/language/libdevice.10.bc", symbol = "__nv_fmaf_rd"} : tensor<1024xf32, #blocked>, tensor<1024xf32, #blocked>, tensor<1024xf32, #blocked> -> tensor<1024xf32, #blocked>
%8 = tt.splat %arg1 : (!tt.ptr<f32>) -> tensor<1024x!tt.ptr<f32>, #blocked>
%9 = tt.getelementptr %8, %0 : tensor<1024x!tt.ptr<f32>, #blocked>
tt.store %9, %7 : tensor<1024xf32, #blocked>
return
}
}
```
2022-09-01 16:34:27 -07:00
Keren Zhou
328b87aec6
Keren/tensor slice insert alloc ( #94 )
...
This branch defines three new triton_gpu operations to partially solve #87 . Below is an overview:
```
%tensor = triton_gpu.alloc_tensor : tensor<2x16x16xf16, #A>
%b = triton_gpu.insert_slice_async %a_ptr, %tensor, %offset {axis = 0 : i32, cache = 1 : i32, evict = 1 : i32, isVolatile = false} : tensor<16x16x!tt.ptr<f16>, #AL> -> tensor<2x16x16xf16, #A>
%c = triton_gpu.extract_slice %b, %offset {axis = 0 : i32} : tensor<2x16x16xf16, #A> -> tensor<16x16xf16, #A>
```
We plan to fully replace `copy_async` with `insert_slice_async`. **This hasn't been done yet.**
2022-09-01 12:37:17 -07:00
Keren Zhou
02ebf24d35
Analyze shared memory alias ( #81 )
...
The purpose of this PR is analyzing shared memory aliases so that we can
fix memory allocation bugs and save memory allocations in triton code
involving complex control flows.
Changes to memory bar and allocation are on the way.
Co-authored-by: Philippe Tillet <phil@openai.com >
2022-08-29 10:43:20 -07:00
Keren Zhou
e0bedeb44c
[BACKEND] Keren/shared memory barrier ( #59 )
2022-08-18 12:32:57 -07:00
goostavz
fc58250a06
[BACKEND] Add backend support of arith::AddIOp, arith::AddFOp, GetProgramIdOp & GEPOp and bugfix for SplatOp, StoreOp, FuncOp ( #60 )
...
Add backend support of arith::AddIOp, arith::AddFOp, GetProgramIdOp, GEPOp and bugfix for SplatOp, StoreOp, FuncOp
Co-authored-by: gzhu <gzhu@nvidia.com >
2022-08-18 20:46:45 +08:00
Yan Chunwei
b1673caaf6
[FRONTEND] Expose end-to-end compile to python frontend ( #58 )
2022-08-17 10:42:48 -07:00
Yan Chunwei
920723cf3d
[BACKEND] add triton-translate to translate mlir to llvmir or PTX code ( #37 )
2022-08-07 22:34:36 -07:00
Keren Zhou
a7b49b3227
[BACKEND] Memory allocation ( #33 )
2022-08-04 11:22:49 -07:00
Yan Chunwei
b988bae813
Init TritonGPU to LLVM dialect conversion ( #32 )
...
* add toLLVM pass
* update num-warps setting in mlir
2022-08-04 10:15:45 +08:00
Philippe Tillet
6d62d88d4f
[CI] run clang-format ( #24 )
2022-07-26 17:25:03 -07:00
Philippe Tillet
a633d2b403
[Analysis] Added Axis Info Analysis ( #8 )
2022-07-19 13:38:48 -07:00
Phil Tillet
65237f6117
[PACKAGING] Added FileCheck
2022-07-07 16:53:19 -07:00
Yan Da
560e29229b
register conversion in triton-opt
2022-06-07 19:33:51 +08:00
Yan Da
366dddc3bc
update mma encoding & triton-opt
2022-06-06 21:03:58 +08:00
Yan Da
d5eca56cf3
more TritonGPU unit tests
2022-06-05 14:25:09 +08:00
Yan Da
55cf9a0a97
Add triton's opt
2022-06-04 22:10:00 +08:00