Commit Graph

1256 Commits

Author SHA1 Message Date
danny.jang
bb47f894f7 [FRONTEND] improve error message for shape mismatch (#2031)
Improve error messaging for block shape and value shape mismatch.
2023-08-08 01:13:16 -07:00
Philippe Tillet
658747feff [FRONTEND] remove ptxas from git (#2055) 2023-08-08 01:11:32 -07:00
allatit23
11cf334730 [hopper][ws] use per-agent thread idx by default (#2054)
Co-authored-by: Allen Zhao <allzhao@nvidia.com>
2023-08-08 15:28:10 +08:00
goostavz
b525880d8b [Backend] Fix CTA->warp ordering for MMAv3 and fix dot-chain scripts in hopper tests (#2041)
Co-authored-by: goostavz <gzhu@nvidia.com>
Co-authored-by: Philippe Tillet <phil@openai.com>
Co-authored-by: ben-zhang-609 <110140741+ben-zhang-609@users.noreply.github.com>
2023-08-08 06:30:04 +00:00
Bin Fan
a76ecd74e7 add num_stages parameter to aot compile.py (#2000)
This allows the AOT client to tune the number of stages for the
generated kernel. set the default number to 3 to match the triton
compiler.
2023-08-08 06:04:57 +00:00
BoxiangW
f21a053ee6 [TUTORIALS] support flash attention 2 with KV's sequence length longer than Q's (#2033)
Implemented this situation with and without causal mask.
My implementation with causal mask looks like:
111000
111100
111110
Where only the right upper triangle part will be masked.
I added `P_SEQ` for the notation of extra sequence length for KV.

Co-authored-by: Philippe Tillet <phil@openai.com>
2023-08-07 22:57:44 -07:00
ben-zhang-609
31e79aa384 [TESTS] remove get_proper_err, get_variant_golden (#2039)
Co-authored-by: Philippe Tillet <phil@openai.com>
2023-08-07 22:52:55 -07:00
Alex Collins
4ed8381fdb Linux arm64 support (#2003)
We are interested in having python wheels for triton built for Linux
arm64 platforms, such as NVIDIA's Grace CPU.

This change is fairly simple, however:
- It requires a linux arm64 build of LLVM to be available (see MR here:
https://github.com/ptillet/triton-llvm-releases/pull/15)
- For now my changes use the LLVM build hosted here:
https://github.com/acollins3/triton-llvm-releases/releases/tag/llvm-17.0.0-c5dede880d17
- The Triton release process will need to be updated to include arm64
wheels. Is this something you have time to work on @ptillet? It would be
difficult for me to update this part without more access permissions.

With these changes, I managed to build a set of python wheels and have
hosted them here for us to use in the meantime:
https://github.com/acollins3/triton/releases/tag/triton-2.1.0-arm64
2023-08-08 12:39:41 +08:00
Qingyi Liu
341f5b61be [BACKEND] Add BarrierOp after AllocMBarrierOp when numCTAs == 1 (#2040)
Make sure that other threads within CTA do not operate on mbarrier until
it is initialized by thread 0.

Co-authored-by: Philippe Tillet <phil@openai.com>
2023-08-07 20:11:00 -07:00
danny.jang
6a1ac65043 [FRONTEND] improve error message for type mismatch (#2038) 2023-08-07 19:34:09 -07:00
Keren Zhou
30a331e628 [FRONTEND] Support jit functions without arguments (#2043)
Issue https://github.com/openai/triton/issues/1973

Co-authored-by: Philippe Tillet <phil@openai.com>
2023-08-07 19:05:56 -07:00
Thomas
98523bcc48 [BACKEND] Support MMA V3 with float16 accumulator (#2049)
Also fixes a bug exposed in convertLayout lowering for float16. We
shouldn't be using cvt.pack.sat.u16.s32 to pack 16bits values as this
needs to take a 32bits register. Also this prevented optimization at
llvm ir level.
2023-08-07 15:55:44 -07:00
Phil Tillet
521cfae44d [CI] disabled float32 perf regression tests 2023-08-07 12:43:16 -07:00
jayfurmanek
32d7c6d646 Fix runtime/test_subproc.py for hip devices (#284)
* Fix runtime/test_subproc.py for hip devices

* address review comments
2023-08-07 10:30:36 -05:00
goostavz
f1512bded1 Initial code merge of Hopper support (#2036)
The initial code merge of Nvidia Hopper features support. Please be
aware that the code merge is not finished yet and the trouble-shooting
is still ongoing. The new hardware features (GMMA, TMA, STMATRIX etc.)
and automatic warp-specialization are experimental for now and turned
off by default. It is recommended for a trial when version 3.0 is
released.

The work is contributed by:
ben-zhang-609, bealwang, donproc, qliu93, jsh20, allatit23, LyricZhao,
ivanyinwz, goostavz & yangjunpro
from Nvidia, in cooperation with:
ptillet, Jokeren, ThomasRaoux & zahimoud
from OpenAI.

Co-authored-by: Goostav Zhu <gzhu@nvidia.com>
2023-08-07 09:53:04 +08:00
Alexander Efimov
7158ec286a [MFMA] [Dot] Support vector loads in normal path (#275)
* [MFMA] [Dot] Support vector loads in normal path

This PR adds generation of vector loads in normal path of
MFMA dot operand loading.
This requires shared layout to have contiguous elements
which should be loaded by one lane.

* remove redundant refactoring

* fix tests

* extend test with transposed A/B tensors
2023-08-03 14:57:39 -05:00
Alexander Efimov
86f8b64ae0 [Dot] [MFMA] [FMA] Update Dot implementation to support upstream tests (#260)
* [Dot] [MFMA] Support FP16 output of MFMA dot

This PR adds cast of output tensor to requested data type.

* add tests

* fix test for FMA implementation

* loose fp16xfp16->fp16 tolerance

* enable FMA fallback for unsupported sizes of dot operation

* rework granularity check

* add constant modifier to granularity
2023-08-03 13:47:18 -05:00
Vinayak Gokhale
f1063bb33c Enable backward pass in FA tutorial test (#282)
Enabled the backward pass in the fused attention tutorial.
The tolerance when comparing to the naive implementation
had to be changed. The block size is forced to be 64x64
due to the 64 KiB LDS. Default is block 128 for A100's
larger SMEM. This creates differences in order of computation
and reuslts in a larger gap between the naive and FA
implementations.
2023-08-03 10:12:46 -05:00
Shucai Xiao
31cfda8f0e enable more gemm tests corresponding to PR#273 (#279) 2023-08-02 16:45:31 -05:00
Alexander Efimov
a318bc5777 Disable pipelining (#276)
This PR sets default value of pipeline stages to 1 for amd.
Add explicit num stages in test_gemm test
2023-08-01 14:54:17 -05:00
Shucai Xiao
82db24cde7 add gfx940/941 to archs that support mfma (#277) 2023-08-01 10:57:16 -05:00
Yan Chunwei
89b0b79d75 [FRONTEND] fix the silent return issue in AOT launcher (#2013)
In the current link.py, it produces the launcher code as below:

```python
CUresult matmul_fp16xfp16_16x16x16(CUstream stream, unsigned int gX, unsigned int gY, unsigned int gZ, CUdeviceptr C, CUdeviceptr A, CUdeviceptr B, int32_t stride_cm, int32_t stride_am, int32_t stride_bk){
  if ((C % 16 == 0) && (A % 16 == 0) && (B % 16 == 0) && (stride_cm % 16 == 0))
    return matmul_fp16xfp16_16x16x16_688cc413_0d1d2d3d45d(stream, gX, gY, gZ, C, A, B, stride_cm, stride_am, stride_bk);
  // ...
  if ((C % 16 == 0) && (A % 16 == 0) && (B % 16 == 0))
    return matmul_fp16xfp16_16x16x16_7c0255bf_0d1d2d345(stream, gX, gY, gZ, C, A, B, stride_cm, stride_am, stride_bk);
}
```
Note that, when the input does not match any of the if branches, it will
do nothing, and the compiler should make it return 0 as a default
behavior, which equals to `CUDA_SUCCESS`, this doesn't match the
expectation.

This PR adds a `return CUDA_VALUE_ERROR;` to the tail of launchers, and
it produces code like:

```c++
CUresult matmul_fp16xfp16_16x16x16(CUstream stream, unsigned int gX, unsigned int gY, unsigned int gZ, CUdeviceptr C, CUdeviceptr A, CUdeviceptr B, int32_t stride_cm, int32_t stride_cn, int32_t stride_am, int32_t stride_ak, int32_t stride_bk, int32_t stride_bn){
  if ((C % 16 == 0) && (A % 16 == 0) && (B % 16 == 0) && (stride_cm == 1) && (stride_cn == 1) && (stride_am == 1) && (stride_ak == 1) && (stride_bk % 16 == 0) && (stride_bn == 1))
    return matmul_fp16xfp16_16x16x16_1f18a6da_0d1d2d3c4c5c6c7d8c(stream, gX, gY, gZ, C, A, B, stride_bk);

  return CUDA_ERROR_INVALID_VALUE;
}
```

And it requires users to check the result in their application, which I
think should match the initial AOT ideas.
2023-07-31 09:59:28 -07:00
Philippe Tillet
52c146f66b [OPTIMIZER][BACKEND] significantly cleaner handling of mixed-precision kernels (#1949)
we currently have a very janky approach to optimizing mixed-precision
matmul workloads, where some layout combinations (e.g., NT matmul) were
explicitly pattern-matched to take a more optimized codepath. Attempt at
unifying all the codepaths to codegen cp.async failed, due to bugs in
SharedToDotOperandMMAv2.cpp.

This PR fixes said bugs, add some assertions for SharedToDotOperandMMAv2
modes that aren't well supported, and greatly simplify our handling of
element-wise operations between load and conversions to DotOperand.
2023-07-28 10:29:42 -07:00
Bin Fan
2689f4a3b0 [TOOLS][AOT] some issues in equal_to_1 hint (#1998)
- Change test_aot.py to actually use equal_to_1 hint

- In the client function, equal_to_1 parameters are not specialized,
because AOT clients may not know the details of Triton argument
specialization, they still want to use the same parameter list as they
write the Triton kernel. The generated kernels has specialized argument
list, the generated dispatcher code will make sure the correct arguments
from the original full argument list are passed.

- Fixed a bug in _match_suffix in link.py. Previously it assumes each
parameter has a suffix of either ‘d’ or ‘c’, but in fact sometimes a
parameter doesn’t have a suffix, like 0d1d2d34c56c78c
2023-07-27 16:07:49 -07:00
Alexander Efimov
0073bb98f4 [MFMA] Introduce dot operand loading fast path (#269)
* [MFMA] Introduce dot operand loading fast path

This PR introduces fast path for code generation of MFMA dot operand
loading from LDS.

Fast path is used when operand is not swizzled and is not slice of some
bigger LDS object(it is not a slice of a tensor).
This is a case for current FA and GEMM kernels compiled with
num_stages=1, i.e. software pipelining is disabled.

* cleanup swizzle info
2023-07-27 13:46:50 -05:00
Lixun Zhang
2fbffe2784 Fix masked load (#262)
* Fix the issue with masked load

Cherry-picked from a0b60eb187

* Remove tests in test_gemm that use too much LDS

---------

Co-authored-by: Shucai Xiao <shucai.xiao@amd.com>
2023-07-26 10:45:21 -05:00
Shantanu
4f1b2ea8d7 [FRONTEND] fix error with -> None return annotation (#1987)
None is not a type, so you get:
```
    self.constexprs = [self.arg_names.index(name) for name, ty in self.__annotations__.items() if 'constexpr' in ty]
E   TypeError: argument of type 'NoneType' is not iterable
```

Co-authored-by: Philippe Tillet <phil@openai.com>
2023-07-25 18:49:45 -07:00
Phil Tillet
db695c093f [TUTORIALS] fix format 2023-07-25 18:16:39 -07:00
Shucai Xiao
1cccf14f62 workaround for num_programs() to get correct value (#271)
* workaround for num_programs to return correct value for issue #259.
2023-07-25 15:46:19 -05:00
janEbert
62a8afa403 [TUTORIALS] Support FlashAttention-2 reference (#1984)
Uses FlashAttention-2 if available, otherwise acts as before (if
FlashAttention-1 is available, that is used, otherwise the
FlashAttention reference benchmark is not run).

I decided to keep the same name for the imported function, but feel free
to make me change that.
2023-07-24 13:54:01 -07:00
Shucai Xiao
356595f3dc revert PR#245 and #241 for gemm performance regression (#265)
* Revert "[MFMA][Dot] Swizzle mfma dot operands  (#245)"

This reverts commit 4d0deef45f.

* revert the swizzle and slice PRs for gemm performance regression

* fix build error after reverting the PR#245(commit 4d0deef45f) Swizzle mamf dot operands
2023-07-24 14:09:46 -05:00
Izzy Putterman
de6f053c0f [TRITON][OPS] add Flash Attention v2 to Ops (#1970)
I also dropped the do_scaled as it is no longer needed (no scaling done
to the do in v2).

---------

Co-authored-by: Philippe Tillet <phil@openai.com>
2023-07-23 14:07:15 -07:00
youkaichao
c9ab44888e [FRONTEND] improve the process of finding libcuda.so and the error message (#1981)
`triton` uses `whereis` command to find `libcuda.so`, which is intended
to find binary, source, and manual page files. When `libcuda.so` is not
properly setup, the `whereis` command ends up with
`/usr/share/man/man7/libcuda.7`, which is not the place to look for.

This PR uses `ldconfig -p` to reliably find `libcuda.so`.

In my case, I find that I have a `libcuda.so.1` file, but it is not
linked to `libcuda.so`. Therefore `ld` cannot find the library to link.
After creating the linking, I was able to run `triton` successfully.

Therefore, I improve the code by first invoking `ldconfig -p`, and
checking `libcuda.so` strings first. These might be possible library to
link against. If the literal `libcuda.so` file is not found, then I
raise an error and tells the user that a possible fix is to create a
symlink file.
2023-07-23 10:31:07 -07:00
Philippe Tillet
66eda76e45 [FRONTEND][BACKEND] no longer serialize float8e4b15 (#1979)
We had a number of complains that the previous packed format was
error-prone and may not yet be worth the 2 SASS instruction saved per 4
conversions
2023-07-21 22:44:55 -07:00
Phil Tillet
cfce82d715 [TUTORIALS] Flash Attention tutorial now properly tries fwd, bwd, causal, non-causal 2023-07-19 21:56:29 -07:00
Philippe Tillet
3452615d79 [BUILD] Reverted ptxas change and fixed bug in cache key computation (#1971) 2023-07-19 20:58:24 -07:00
Philippe Tillet
28a61484bc [FRONTEND] more leniency when converting to/from fp8e4b15 (#1969) 2023-07-19 18:26:21 -07:00
Philippe Tillet
68124676c9 [FRONTEND][BACKEND] Fix trans for float8e4b15 (#1964)
float8e4b15 is a packed type; it is incompatible with most of our layout conversions. For now, we just convert to float16.
2023-07-19 11:30:39 -07:00
nccx
cd61f99fb5 [DOCS] remove empty README (#1963) 2023-07-19 10:51:38 -07:00
Philippe Tillet
c46a842b6f [TUTORIAL] more attention cleanup (#1958) 2023-07-18 12:36:15 -07:00
Shucai Xiao
1c86e3238a remove multiple archtictures to isa head and adding gemm tuning scripts (#261)
* Remove adding multiple architectures to isa head

* Add mask for gpu memory load in scripts for tuning gemm 'script/amd/gemm/matmul.py'

* Move the scripts to a better place 'scripts/amd/gemm/'
2023-07-18 14:21:16 -05:00
Philippe Tillet
9e3e10c5ed [OPTIMIZER][TUTORIAL] flash attention v2 (#1952) 2023-07-17 12:23:02 -07:00
David Berard
7202c6cff0 [FRONTEND] expose tl.max_constancy hint (#1951)
Similar to `tl.multiple_of` and `tl.max_contiguous`, `tl.max_constancy`
will expose a compiler hint indicating that all the values are equal in
a block of a certain size.

---------

Co-authored-by: Philippe Tillet <phil@openai.com>
2023-07-17 18:30:25 +00:00
Christian Sigg
80c6e39716 [BACKEND] Fix enable_debug implementation. (#1876)
Print before every pass and after failures if MLIR_ENABLE_DUMP is set.

Co-authored-by: Keren Zhou <kerenzhou@openai.com>
2023-07-16 21:50:30 -04:00
Keren Zhou
bcfd990a88 [TESTS] Fix autopep8 error (#1948) 2023-07-16 16:55:12 -07:00
Mehdi Amini
51fc42a568 [FRONTEND] fix AST IR generation for while loop nested inside other SCF (#1947)
The process of visiting twice the body of the while didn't restore
properly the insertion point, and was leaking the dummy block.
2023-07-15 10:17:29 -07:00
Philippe Tillet
8207eabd7b [FRONTEND][OPTIMIZER] small perf improvements (#1945) 2023-07-14 15:11:36 -07:00
Alex Collins
80163a9c1e [FRONTEND] Add support for default args in kernel wrappers (#1943)
Fixes the case where setting default values for arguments in a kernel
function signature results in a generated kernel wrapper function
without these default values.

For example:
```
@triton.jit
def kernel(x, y, z=3):
    ...

...
kernel[grid](x,y)
```

Co-authored-by: Philippe Tillet <phil@openai.com>
2023-07-14 21:32:47 +00:00
Yan Chunwei
d0c35b3b7d Hot fix for AOT (#1939)
This PR addresses the following issues encountered when using AOT
kernels in our project:

1. When different signatures are set for the same Triton kernel, it can
result in C functions with the same name. This is problematic because C
does not support function overloading.

2. Currently, the AOT kernel always compiles with `num_warps=1`, as
indicated
[here](https://github.com/openai/triton/pull/1939/files#diff-293af646f671d3a895c453a8b175754e9d4ec4fc855bb939ffa4d6e9e91b07c6L83).
However, the generated function includes a `numWarps` argument, which
can cause errors when the specified value does not match.

To resolve these issues, this PR does the following modifications:

1. Adds an 8-char hash key as a suffix to the generated function's
signature. This ensures that different function names are generated in C
when the argument dtype or constexpr value or even hint differs since we
hope these kernels could be used in one C/C++ library.

2. Introduces a new flag called `num-warps` that allows manual
specification of the `numWarps` value for AOT. This change hardcodes the
specified value into the generated kernel.c and removes the `numWarps`
argument from the generated function.
2023-07-14 09:16:43 +08:00
oplavsic
d6e51fd221 [FA OPTIMIZATION] Keep results of FA dot operations in registers (#247)
* [WIP][FA OPTIMIZATION] Optimize chain dot

This commit optimizes chain dot operation by keeping
results of the first dot operation in registers.

* [FA OPTIMIZATION] Enable lowering pipeline for keeping result of chain dot in registers

* Move operand swapping in ttgir -> llir lowering phase

* Refactor emitMfmaOffsetForCTA function to be more readable

* Fix accidental change in 06-fused-attention.py

* Address review comments

* Fix rebase errors
2023-07-12 15:25:55 -05:00