Commit Graph

1291 Commits

Author SHA1 Message Date
Justin Lebar
073aa16379 [BUILD] use ninja (#2318) 2023-09-17 02:08:04 -07:00
Philippe Tillet
c98671cf7c Revert "Update integration-tests.yml" (#2323)
reverts #2310 as recent changes to Triton-IR have broken third-party backends
2023-09-17 01:16:00 -07:00
Justin Lebar
0015611c17 [DOCS] Add build instrs for running in a virtualenv. (#2320)
On my machine, when I try to `pip install cmake` outside a virtualenv,
it gets mad at me and tells me to use apt.  Which doesn't quite work for
some reason.  Anyway maybe this is simple to Python people, but perhaps
worth mentioning.  Especially because we have `.venv` in gitignore
already.
2023-09-17 00:14:33 -07:00
Justin Lebar
41584c71a6 Add cuobjdump and nvsisasm to gitignore. (#2319)
Otherwise, these files show up in `git status` under
python/triton/third_party/cuda/bin/.
2023-09-17 01:53:46 +00:00
Thomas Raoux
bb949d1141 [BACKEND] Move struct optimization down the LLVM pipeline (#2312)
Move the optimization to remove phi of struct later in the optimization
pipeline to avoid interfering with CFG optimization.
2023-09-16 12:28:53 -07:00
Thomas Raoux
31b0c52142 [FRONTEND][BACKEND] Add flag to control accumulation for fp8 (#2300)
Change the dot to allow taking an initial accumulator and add a flag
that will allow the compiler to accumulate in a lower precision than the
output type.
On Hopper this flag is on by default which allows accumualting with
lower precision.
This only affect Hopper fp8 dot.
2023-09-15 18:42:54 -07:00
Michael Melesse
78a0b5dc2a [CI] update integration-tests.yml (#2310) 2023-09-15 18:38:15 -07:00
Zahi Moudallal
db5c793f82 [FRONTEND] Add sass to asm dict with lazy evaluation (#2309) 2023-09-15 15:31:43 -07:00
kshama-msft
ac1c216110 [DOCS] update README.md (#2311)
Triton conf registration closed.
2023-09-15 15:07:38 -07:00
Thomas Raoux
976aabdeb2 [BUILD] Fix few dependencies and layering issues to make lld work (#2307)
This fixes few problems that were preventing me to use lld linker.
2023-09-15 17:00:58 +00:00
Keren Zhou
08c1658957 [FRONTEND] Accommodate new triton IR format (#2294)
- Support memory space for pointers (e.g., `!tt.ptr<f32, 1>`).
- Support parsing function attribute, though not used yet.
2023-09-14 09:03:23 -07:00
Zahi Moudallal
36087a108f [FRONTEND] Added SASS to asm dict (#2280) 2023-09-13 21:21:01 +00:00
Zahi Moudallal
a301502d25 [BACKEND] Fixing assert in shared encoding swizzling addresses calculation (#2292) 2023-09-13 12:58:42 -07:00
Bin Fan
38a2ecdccf [OPTIMIZER] Fix Shared layout in OptimizeDotOperands pass to generate correct swizzling code (#2180)
fix bug #1937

Co-authored-by: Philippe Tillet <phil@openai.com>
2023-09-13 12:52:09 -07:00
Khushi Agrawal
c61d772eee [DOCS] add missing docs (#2154) 2023-09-13 19:30:40 +00:00
Thomas Raoux
b63e8f87fc [FRONTEND] Override prototype (#2214)
Low tech but very useful way to override kernels on the fly. This can be
use for debugging functionality or performance problems this lets user
dump modify and feed back IR into the jit compiler.
2023-09-13 10:05:47 -07:00
Thomas Raoux
cf7f8c5ea4 [BACKEND] Optimization to sink broadcast ops (#2274)
Try to move broadcast ops after arithmetic and convert ops in order to
reduce the amount of work needed.
2023-09-13 10:04:36 -07:00
Sergey Kozub
896ee611e0 [NFC] Create explicit conversion pattern for ExternElementwiseOp in TT->TTGPU pass (#2284)
This is needed for forward-compatibility with MLIR that now has
"inherent" and "discardable" attributes
(https://mlir.llvm.org/OpenMeetings/2023-02-09-Properties.pdf) and the
ExternElementwiseOp attrs do not propagate with the current
`addNamedAttrs` implementation.
2023-09-13 10:04:01 -07:00
Thomas Raoux
d3956a21f3 [BACKEND] Add LLVM pre-processing pass to break struct types (#2285)
Add infrastructure to be able to add and test custom LLVM passes in the
backend. This will allow use to apply some low level optimizations and
cleanup on LLVM IR.
Add a first pass that breaks up phi of struct created by lowering to
LLVM. Those can often pessimise the optimizer as it would block
optimizations going through phi nodes.
2023-09-13 10:03:29 -07:00
Zahi Moudallal
e95e1f12eb [BACKEND] Convert layout illegal mem access fix (#2287) 2023-09-13 10:02:25 -07:00
Thomas Raoux
994f7e4460 [BACKEND] Remove dependency between NVGPU and TritonNvidiaGPU (#2282) 2023-09-12 11:02:20 -07:00
Ying Zhang
37f12497b0 [FRONTEND] Add PyTorch fp8 dtypes to Triton (#2279)
Add PyTorch fp8 dtypes
(8025b193a9/torchgen/api/types/types.py (L50-L51))
to Triton.
2023-09-12 08:57:01 -07:00
Zahi Moudallal
a47f1f5c28 [BACKEND] Unify slow/fast reduce codegen (#2220) 2023-09-12 08:46:19 -07:00
jsh-20
fc5d7e6e7c [FRONTEND] Improve grid calculation for persistent kernels to hoist pe… (#2283)
…rf on problems that need few blocks.

constrain the number of launched blocks to what it exactely needs for
persistent warp specialized kernel. It's useful when problems need very
few blocks.
e.g. MxNxK=800x800x60000, f16_f16_f32, block size=128x128x64,
non-split-k. Experiments show it can achieve ~16% speedup.
2023-09-12 09:14:47 +00:00
peterbell10
ab9da3b2b8 [FRONTEND] Fix expand_dims and tl.full to handle scalar tensors (#2275)
This fixes a few bugs related to scalar tensors:
- `tl.full([], fill_value, dtype)` fails with `TypeError('0d block_type
is forbidden')`
- `scalar[None]` fails with `TypeError("'constexpr' object is not
iterable")`
- `scalar[None, None]` fails with `AttributeError("'dtype' object has no
attribute 'shape'")`
- `scalar.shape` returns `[1]` instead of 0-dim `[]`
- Also related, `tl.zeros_like(scalar)` returns a 1d tensor instead of
another scalar
2023-09-11 20:59:13 -07:00
Philippe Tillet
bf4f9375a7 [FRONTEND] allow mixed precision FP8 matmul on pre-H100 hardware (#2281) 2023-09-11 20:54:29 -07:00
Lixun Zhang
a5e483652b [NFC] Remove hard-coded warpSize=32 in scanOp lowering (#2272)
- To make the development on AMD GPUs a little easier
- Also changed `laneId` to `laneIdAxis` in some helper functions in
scanOp lowering
2023-09-11 19:35:18 -07:00
Shintaro Iwasaki
8da27c1c95 [Build] Fix very minor compilation problems (#2277)
This PR fixes a few very minor compilation issues found in internal
deployment at Meta. It looks like nit-picking, but it'd be really
appreciated if it could be addressed in OSS Triton (to reduce
differences from OSS), and we believe these changes are not bad in
general. Neither performance nor functionality is affected by this PR.

1. Type cast in `python/triton/runtime/backends/cuda.c`. Implicit `void
*` -> `cuuint{32,64}_t *` cast is not allowed by many compilers (with
certain flags). It'd be nice to add an explicit cast (like
`backends/hip.c`).

2. Inconsistent include path specification in
`lib/Conversion/TritonGPUToLLVM/DotOpToLLVM/WGMMA.cpp`. Unlike other
`DotOpToLLVM/*.cpp`, include paths used in `WGMMA.cpp` are not relative.
This is problematic in some compilation settings since a compiler
somehow needs to find headers in a parent directory. It'd be great to
use a relative path, like other source files in Triton.

cc: @yuguo68
2023-09-11 19:28:31 -07:00
Thomas Raoux
a9db6b94b9 Remove wrong dependency between TritonGPU and NVGPU dialect (#2276) 2023-09-11 16:30:13 -07:00
danny.jang
ec4a968d44 [TESTS] Enhance benchmark flexibility (#2239)
User can pass custom arguments to benchmarks. For example, user can pass
`dtype` which will be used to create tensors in a benchmark.

Co-authored-by: Keren Zhou <kerenzhou@openai.com>
2023-09-11 15:31:30 -04:00
jon-chuang
5231d57c71 [TESTS] replace deprecated torch.testing.assert_allclose (#2250)
Prior to this PR, matmul on sm_89 (RTX 4070)
(`test/unit/operators/test_matmul.py::test_op`) would result in test
failure due to too strict atol/rtol.

To avoid having to choose strictness ourselves, and to have better
defaults based on dtype, use the non-deprecated torch testing util.

See: https://github.com/pytorch/pytorch/issues/61844

Replace: https://github.com/openai/triton/pull/2242
2023-09-11 15:31:17 -04:00
Lixun Zhang
28d4c3bdb4 [BACKEND] Make sure getAxisBlockStride does not return 0 (#2273)
This can happen when the CTA shape is larger than the tensor shape along
the non-axis dim during scanOp lowering.
2023-09-11 11:02:56 -07:00
Christian Sigg
f6828e1a6f [Backend] Make ConvertTritonGPUToLLVMPass's tmaMetadata a member (#2271)
.. instead of an option.

This partially addresses https://github.com/openai/triton/issues/2265 to
no longer crash when printing a pass pipeline in textual form.

It is not a proper solution for the fact that pass results should be
stored in the IR and not in a pointer argument.
2023-09-11 07:16:54 -07:00
Philippe Tillet
3747843143 [OPTIMIZER] improvements to layout conversion removal (#2268)
* Improved heuristics for RemoveLayoutConversion; 
* add LayoutConversionOp canonicalizer for ViewOps
2023-09-09 22:06:27 -07:00
Keren Zhou
10f59d8ce0 [RUNTIME] Get the correct end idx for regular arguments of GPU kernels (#2262)
Previously, if there were any specializations of "1" or "constexpr"
mixed with unspecialized arguments in arbitrary order, we might have
encountered errors due to passing incorrect arguments. This was because
the length of the signature did not indicate the maximum index of
regular arguments.

https://github.com/openai/triton/issues/2229

@shunting314 @amjames 

More specifically for cases like:

```
kernel(
b: tl.tensor,
a: tl.constexpr,
c: tl.int = 1,
d,
e: tl.constexpr,
...
)
```
2023-09-07 23:31:07 -07:00
Thomas
52aa663dcb [BACKEND] Remove dead code (#2263) 2023-09-07 20:40:48 -07:00
Izzy Putterman
7d01c1852a Revert unintentional change (#2257)
This change seems to have been unintentionally reverted in the hopper
PR:
38d767ea93

Adding it back.
2023-09-07 10:48:12 -07:00
Beal Wang
7aae350fab [OPTIMIZER] async launch dots for hopper warp-specialized kernel (#2251) 2023-09-07 13:37:07 +08:00
Zahi Moudallal
f21b36c8c5 [CLEANUP] Delete binaries that went in by mistake (#2256) 2023-09-06 20:42:42 +00:00
Zahi Moudallal
3dec616c7c [CI] Fix submodule issue (#2253)
...
2023-09-06 20:21:29 +00:00
jon-chuang
36859aebff [DOCS] Add MLIR Autogenerated Docs to Sphinx Docs (#2234)
Partially fixes: https://github.com/openai/triton/issues/2226

Here are some example renderings:
![Screenshot from 2023-09-04
18-39-20](https://github.com/openai/triton/assets/9093549/e9c4af04-aeae-4021-a8db-6a4a82b59ae7)
![Screenshot from 2023-09-04
18-39-30](https://github.com/openai/triton/assets/9093549/410391b8-e07e-4bed-909c-8ce5484072d1)
![Screenshot from 2023-09-04
18-39-41](https://github.com/openai/triton/assets/9093549/f1eaef95-66c1-4506-a153-c6069e2b5072)
2023-09-06 08:17:12 +00:00
Wang Weihan
92e2c32283 [BACKEND] Rebased Intel GPU Backend to be comptiable with latest Triton version (#2245) 2023-09-06 08:03:31 +00:00
Thomas
60643f2a2d [BACKEND][NFC] Simplify coalescing pass (#2230)
Remove unnecessary templates and simplify the mapping of operations to
encoding.
2023-09-06 00:52:06 -07:00
Wang Weihan
e721911705 [FRONTEND] clean build directly when executing python setup.py clean (#2238)
Current setup.py could not clean the build directly because the default
build directly has been changed in `CMakeBuild`. This PR is to clean
build directly in this regard.
2023-09-04 21:31:38 -07:00
jon-chuang
99f8f912aa [OPS] Remove unnecessary perf bug workaround (#2240)
This bug previously existed and I verified it in previously nightly
release of triton (20230714).

However, according to new benchmarks, this bug no longer exists on
Triton main. See:
https://github.com/google/jax/pull/17328#issuecomment-1705010065
2023-09-04 21:30:54 -07:00
Keren Zhou
9e9fbe01f0 [FRONTEND] Fix specialization on triton integer types (#2236)
https://github.com/openai/triton/issues/2231
2023-09-03 23:57:08 -07:00
ivanyinwz
a539836876 Fix predicate for store tiled op (#2215)
The predicate of Store Tiled op was not set which caused a lot of perf
drop due to duplicated memory traffic in epilogue.
2023-09-04 02:27:00 +00:00
David Berard
13189bfe60 Coalesce pass - group values with same shape/order but different element type (#2199)
**Motivation**: We have a kernel that loads multiple types of tensors -
some int32 and some float16. The coalescing pass assigns `perThread = 8`
for the float16 tensors and `perThread = 4` for the int32 tensors,
resulting in unnecessary layout conversions that result in bad
performance. Instead, we should just set `perThread = 8` for both of
these loads.

**Details**:
One of the first steps in calculating the new encoding is to find the
group of upstream/downstream tensors with the "same type", in order to
find the maximal sizePerThread required in this group. This PR changes
the logic so that tensors can be grouped as long as they have the same
shape and same optimal ordering, even if they have different encoding or
dtype.

Next, the logic to compute `perThread` is updated to account for the
change above; since dtype can now be different within a single "group",
the `perThread` computation now considers different
elemNumBits/elemNumBytes for each value in the group.
2023-09-01 17:08:56 -07:00
Zahi Moudallal
acbf716889 [BACKEND] Refactoring NVGPUToLLVMPass (#2158) 2023-09-01 23:40:31 +00:00
Shantanu
a4df60e20a [FRONTEND] Fix GIL handling in error conditions (#2225)
The use of the opaque GIL state APIs should mean that the
PyErr_SetString is now safe, regardless of whether the caller has the
GIL or not.
2023-09-01 13:30:42 -07:00