Commit Graph

748 Commits

Author SHA1 Message Date
Philippe Tillet
3452615d79 [BUILD] Reverted ptxas change and fixed bug in cache key computation (#1971) 2023-07-19 20:58:24 -07:00
Philippe Tillet
28a61484bc [FRONTEND] more leniency when converting to/from fp8e4b15 (#1969) 2023-07-19 18:26:21 -07:00
Philippe Tillet
68124676c9 [FRONTEND][BACKEND] Fix trans for float8e4b15 (#1964)
float8e4b15 is a packed type; it is incompatible with most of our layout conversions. For now, we just convert to float16.
2023-07-19 11:30:39 -07:00
nccx
cd61f99fb5 [DOCS] remove empty README (#1963) 2023-07-19 10:51:38 -07:00
Philippe Tillet
c46a842b6f [TUTORIAL] more attention cleanup (#1958) 2023-07-18 12:36:15 -07:00
Philippe Tillet
9e3e10c5ed [OPTIMIZER][TUTORIAL] flash attention v2 (#1952) 2023-07-17 12:23:02 -07:00
David Berard
7202c6cff0 [FRONTEND] expose tl.max_constancy hint (#1951)
Similar to `tl.multiple_of` and `tl.max_contiguous`, `tl.max_constancy`
will expose a compiler hint indicating that all the values are equal in
a block of a certain size.

---------

Co-authored-by: Philippe Tillet <phil@openai.com>
2023-07-17 18:30:25 +00:00
Christian Sigg
80c6e39716 [BACKEND] Fix enable_debug implementation. (#1876)
Print before every pass and after failures if MLIR_ENABLE_DUMP is set.

Co-authored-by: Keren Zhou <kerenzhou@openai.com>
2023-07-16 21:50:30 -04:00
Keren Zhou
bcfd990a88 [TESTS] Fix autopep8 error (#1948) 2023-07-16 16:55:12 -07:00
Mehdi Amini
51fc42a568 [FRONTEND] fix AST IR generation for while loop nested inside other SCF (#1947)
The process of visiting twice the body of the while didn't restore
properly the insertion point, and was leaking the dummy block.
2023-07-15 10:17:29 -07:00
Philippe Tillet
8207eabd7b [FRONTEND][OPTIMIZER] small perf improvements (#1945) 2023-07-14 15:11:36 -07:00
Alex Collins
80163a9c1e [FRONTEND] Add support for default args in kernel wrappers (#1943)
Fixes the case where setting default values for arguments in a kernel
function signature results in a generated kernel wrapper function
without these default values.

For example:
```
@triton.jit
def kernel(x, y, z=3):
    ...

...
kernel[grid](x,y)
```

Co-authored-by: Philippe Tillet <phil@openai.com>
2023-07-14 21:32:47 +00:00
Yan Chunwei
d0c35b3b7d Hot fix for AOT (#1939)
This PR addresses the following issues encountered when using AOT
kernels in our project:

1. When different signatures are set for the same Triton kernel, it can
result in C functions with the same name. This is problematic because C
does not support function overloading.

2. Currently, the AOT kernel always compiles with `num_warps=1`, as
indicated
[here](https://github.com/openai/triton/pull/1939/files#diff-293af646f671d3a895c453a8b175754e9d4ec4fc855bb939ffa4d6e9e91b07c6L83).
However, the generated function includes a `numWarps` argument, which
can cause errors when the specified value does not match.

To resolve these issues, this PR does the following modifications:

1. Adds an 8-char hash key as a suffix to the generated function's
signature. This ensures that different function names are generated in C
when the argument dtype or constexpr value or even hint differs since we
hope these kernels could be used in one C/C++ library.

2. Introduces a new flag called `num-warps` that allows manual
specification of the `numWarps` value for AOT. This change hardcodes the
specified value into the generated kernel.c and removes the `numWarps`
argument from the generated function.
2023-07-14 09:16:43 +08:00
Keren Zhou
571c92f2a8 [CI] Fix CI kernel compare (#1931)
With this PR, we find the latest merged PR that successfully passed
"Integration Tests".
2023-07-12 10:06:34 -07:00
Izzy Putterman
c615ce944c [FRONTEND] use local bindings in triton.cc (#1932)
Another follow up with the relative imports this time dealing with the
bindings.
2023-07-12 02:19:48 +00:00
Keren Zhou
4795820014 [TESTS] Fix unmatched test names (#1933) 2023-07-11 19:08:28 -07:00
Stonepia
d50e32fab7 [FRONTEND] fix the hard code builder.arch that could block third_party tests (#1859)
For CUDA devices, the `builder.arch` is an int.
For third_party devices, this line would be a TypeError. For example:

```
TypeError: '<' not supported between instances of 'dict' and 'int'
```

Co-authored-by: Wang Weihan <eikan.wang@intel.com>
2023-07-11 19:06:35 -07:00
Philippe Tillet
bf5acf46e2 [OPS] improved pointer arithmetic in attention (#1926)
this provides an additional 3-4% speed-up in non-causal attention, which
now tops at 155TFLOPS
2023-07-11 12:04:00 -07:00
Daniyal khan
b70d07aafe [BUILD][DOCS] updated setup.py and documentation (#1930) 2023-07-11 11:46:28 -07:00
Phil Tillet
041f1144e8 [DOCS] fixed flash_attn causal argument in tutorial 2023-07-11 09:28:20 -07:00
Goran Flegar
bbc1ad16d8 [BACKEND] Vectorize s8 to bf16 casts (#1879)
The code generated by LLVM ends up using 15 SASS instructions, while the
inline PTX added here only uses 8. It might be possible to reduce this
down to 6 if NVIDIA optimizes ptxas to use the byte selector in I2F for
all bytes (right now, we still have some bit manipulation code generated
for 2 out of 4 bytes).

This change improves the performance of mixed precision matmul kernel
with M=N=K=4096, where one operand is casted from s8 to bf16 from 140
TFlop/s to 165 TFlop/s on A100-40GB.

Also refactors the ElementwiseOpConversionBase template to support
vectorized operations, reducing the boilerplate needed for existing, and
this new vectorized cast; and extends the casting test to process more
than one element (so vectorized casts can be properly tested).
2023-07-11 09:24:05 -07:00
Philippe Tillet
8fe5524c75 [BACKEND] no longer uses shared mem or barriers for single-warp reductions (#1915)
0-bytes shared mem buffers don't materialize empty allocation buffers;
this could lead to unnecessary barriers.

note: reduceop code has become quite messy and will require some cleanup
2023-07-11 00:23:26 -07:00
Philippe Tillet
7e3ebbc4c8 [TESTING] now using cuda graphs for perf regression tests (#1925) 2023-07-10 22:49:25 -07:00
danny.jang
4a20d5010b [FRONTEND] Fix a inspection warning (#1914)
"Expected type 'SupportsIndex', got 'constexpr' instead" is no longer
reported.
2023-07-10 21:30:59 -07:00
Izzy Putterman
d39d78fa08 [OPS] Add more perf-tests, new features to FA (#1849)
Adding new tests across the board for float32, bfloat16, non-powers-of-2
shapes (to test masks), and tests on sequence parallel for atomics. This
also adds the sequence parallel features from
https://github.com/HazyResearch/flash-attention/blob/main/flash_attn/flash_attn_triton.py.
I am not sure about the best way to grab the baseline benchmarking
numbers. I have access to V100s and A100s, but I saw on the tests it
mentions " # A100 in the CI server is slow-ish for some reason.
# On some other servers, we are getting about 90% peak for 8kx8x8k
float16". Current plan is to run CI here and use those numbers for
baseline, then match against my GPUs as a sanity check.

---------

Co-authored-by: Phil Tillet <phil@openai.com>
2023-07-10 18:52:59 -07:00
peterbell10
e3d9478d31 [OPTIMIZER] Add pass to move broadcasts after elementwise operations (#1811)
This adds a pass that tries to reduce the shape of tensor arguments to
element-wise operations by moving splat and broadcast operations later
in the graph. So, for example say we have:

```python
@triton.jit
def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset  + tl.arange(0, XBLOCK)[:]
    xmask = xindex < xnumel
    x0 = xindex
    tmp0 = tl.load(in_ptr0 + (0))
    tmp1 = tl.broadcast_to(tmp0, [XBLOCK])
    tmp2 = 0.017453292519943295
    tmp3 = tmp1 * tmp2
    tmp4 = tl.sin(tmp3)
    tl.store(out_ptr0 + (x0), tmp4, None)
```

Today this results in duplicate `sin` calls:
```
    %27 = llvm.fmul %26, %3  : f32
    %28 = llvm.call @__nv_sinf(%27) : (f32) -> f32
    %29 = llvm.call @__nv_sinf(%27) : (f32) -> f32
```

The duplicate `llvm.fmul` calls are eliminated via CSE, but `llvm.call`
doesn't get CSE'd because it might be impure.

After this change, the sin is done on a scalar value in the triton IR
and splatted at the very end, so no duplicate calculation happens within
a thread.

---------

Co-authored-by: Keren Zhou <kerenzhou@openai.com>
Co-authored-by: Philippe Tillet <phil@openai.com>
2023-07-10 11:44:38 -07:00
peterbell10
ef947dac31 [FRONTEND] Fix tl.full with unsigned dtypes (#1919)
Calling `tl.full` with an unsigned dtype currently fails with the error:
```
AttributeError("'triton._C.libtriton.triton.ir.builder' object has no attribute
'get_uint8'")
```

This PR defines those functions rather than changing the calls to the
signed versions so that we can use an unsigned argument type in C++ and
avoid overflow for large uint64 values.
2023-07-10 09:36:22 -07:00
Philippe Tillet
5a722b5f74 [OPS][TESTS] Added float8 support in triton.ops.matmul (#1918)
this also adds rather extensive testing for mixed precision mode,
including `float8e4b15 x float8e5` and `float8e5 x float16`
2023-07-10 09:31:12 -07:00
Philippe Tillet
dadf7a9a50 [TUTORIAL] Faster flash attention; added non-causal (#1917) 2023-07-09 13:38:06 -07:00
Thomas
bd900e0a6f [BACKEND] Fix reductions when number of unique element is smaller than layout (#1913)
Fix calculation of unique number of threads within a warp. We need to
consider the number of elements per thread in the calculation. Also
change the layout test to integer sum in order to catch bugs with unique
data as max reduction may hide those kind of problems.
2023-07-07 19:48:13 -07:00
Natalia Gimelshein
778ed64a66 [BACKEND] make sure we always bind to primary context in loadBinary (#1912) 2023-07-07 14:28:03 -07:00
Bert Maher
38d767ea93 [FRONTEND] fix memory leak caused by retaining args to autotuned kernel (#1911) 2023-07-07 20:58:29 +00:00
Keren Zhou
cc5a7ed52f [FRONTEND][BACKEND] Materialize line info for triton kernels (#1902)
`export TRITON_DISABLE_LINE_INFO=1` to disable the feature.
2023-07-07 16:03:44 -04:00
Philippe Tillet
61e17db4f7 [TESTS] smaller problem sizes in matmul tests (#1908) 2023-07-06 14:36:02 -07:00
Philippe Tillet
6d1285e1ae [FRONTEND][BACKEND] improved fp8 specs (#1906)
This un-reverts commit
d4c941177e.
2023-07-06 13:03:53 -07:00
Thomas
787cdff0cd [TESTS] Enable parallel pytest in CI for CUDA (#1905)
Run most of the pytest in parallel, this allows to speed up CI from
36min to 10min for A100 and 22min to 6min for H100. Some tests still
need to run serially like runtime tests.
2023-07-06 11:40:33 -07:00
Thomas
ae0ee5248f [FRONTEND] Add cumprod scan op (#1894)
Add and test cumprod. This also allows testing a case of accumulation
where 0 is not the identity element.
Also add documention for scan functions.
2023-07-05 10:09:06 -07:00
Philippe Tillet
d57dcd9994 [FRONTEND] pattern-match ExpandDims + Mul + Reduce into DotOp (#1889)
transforms e.g.:
```
x1 = tl.expand_dims(x0, axis=2)
y1 = tl.expand_dims(y0, axis=0)
z = tl.sum(x1 * y1, axis=1)
```

into
```
z = tl.dot(x0, y0) 
```

uses allowTF32 = True by default.
2023-07-05 00:31:21 -07:00
Yongjik Kim
387cdc8fe9 [FRONTEND] improve error message. (#1887) 2023-07-05 00:19:28 -07:00
Christopher Hesse
cc93356c4b [DOCS] update print docs (#1884) 2023-07-05 00:19:12 -07:00
Keren Zhou
16de0d6f55 [BACKEND] Recover the correct axis in the parent layout for reduce ops (#1885)
https://github.com/openai/triton/issues/1883
2023-07-05 00:17:00 -07:00
Keren Zhou
aa55d3b5bf [FRONTEND] Remove unnecessary replace in while op's after block (#1886)
We've already updated the mapping between name and tensor before
visiting each compound statement in the while op. As a result, any
overwritten name gets up-to-date values updated in the while loop. And
any unchanged livein names hold the original tensors.
2023-07-05 00:16:38 -07:00
Ying Hang Eng
55eb32dff9 Fix tl.device_assert compilation error (#1875)
As mentioned in #1769, we set file name, function name to 'unknown' and
lineno to 0 if frame is None
2023-06-30 22:19:25 +00:00
Thomas
2e3182bab7 [BACKEND] Support scan on dimensions other that fastest moving one (#1863)
This relax the restriction in the scan lowering to support layout where
we scan along a dimension which isn't the fastest moving one. This is
done by relaxing how we accesses elements during scanning and allow
elements to be strided.
2023-06-30 12:40:48 -07:00
Oleg Shyshkov
66ed53d19d [FRONTEND] Support mixed-precision inputs in triton.ops.matmul. (#1754)
Support only combinations of float32 with float16 or bfloat16 for now.
Shouldn't change anything for cases when input types match.

That's a follow-up to the comment in my other PR:
https://github.com/openai/triton/pull/1746#issuecomment-1579630016.

---------

Co-authored-by: Philippe Tillet <phil@openai.com>
2023-06-30 09:22:27 -07:00
Thomas
7a8a2da8ef [BACKEND] Enable lowering of f16 constant matmul (#1870)
Since the type expected for mma encoding is i32 when lowering f16 splat
we need to pack f16 constants into a i32 value. This allows re-enabling
the constant matmul unit test.
2023-06-30 07:00:25 -04:00
Philippe Tillet
f77015967d Revert "[FRONTEND][BACKEND] improved fp8 specs (#1841)" (#1865)
This reverts commit d4c941177e.
2023-06-29 21:07:01 -04:00
Thomas
3be060849a [FEATURE] Add associative_scan support (#1858)
Implement associative_scan in the front end and implement lowering to
LLVM for blocked layout where the scan happens on the fastest moving
dimension. This will later be generalized to support more layout.
2023-06-29 14:37:51 -07:00
Izzy Putterman
9961b5c7aa [TESTING] allow user to adjust warmup and repetition time for autotuning (#1850)
Adds an option to adjust warmup and repetition time for autotuning. It
should default to old values and have no effect on current kernels.
This is useful for bigger kernels where runtime might be a sizable
fraction 100ms and lead to less warmup and more variance during
benchmarking.
2023-06-28 11:04:43 -07:00
Thomas
e5d7411a69 [BACKEND] Add .wt store cache modifier (#1831) 2023-06-28 17:40:30 +00:00