Commit Graph

881 Commits

Author SHA1 Message Date
Philippe Tillet
eea0718445 [TESTING] better cudagraph-based benchmarking (#2394) 2023-09-25 21:41:26 -07:00
ben-zhang-609
d040b58547 [HOPPER] fix ref check failure of flash attention with mma v3 (#2384) 2023-09-25 11:29:49 -07:00
Keren Zhou
57fc6d1f13 [BACKEND] shfl ptx insts should have side effects (#2376)
Otherwise, llvm pass could generate very weird structure of CFG and
yield incorrect results.

https://github.com/openai/triton/issues/2361
2023-09-23 10:05:20 -07:00
edimetia3d
cb83b42ed6 [FRONTEND] using closure to create jit launcher (#2289)
Hi,

I'm adding some features to
`triton.runtime.jit.JITFunction_make_launcher` and found it is hard to
debug it:
1. The inlined Python code is hard to inspect in my editor.
2. My debugger fails to step into these inlined codes.

In response, I've introduced some code to solve these issues. My
modifications include:
~~1. Refactoring the launcher's inline Python code, ensuring it only
relies on the "self" object.~~
~~2. Add a utility method that generates a temporary file to create a
launcher when debugging kernel in main module~~
Using a closure to hold the launcher's body

Because this features might be good to others, I have initiated this
Pull Request.

~~Tests are yet to be added; if this submission might be accepted, I
will add it later.~~
Since this change is a refactor, no new test was added.
2023-09-22 17:01:54 -07:00
Bin Fan
1724604bd9 [DOCS] Add a tutorial example of grouped gemm (#2326) 2023-09-22 11:16:35 -07:00
q.yao
413b18eb73 [FROJTEND] fix core.dtype.__repr__ (#2372)
`function_type` does not have a `name` field, which leads to an error
when debugging with gdb.
2023-09-22 08:34:20 -07:00
Zahi Moudallal
293b7fd592 [TESTING] cleanup (#2293)
Co-authored-by: Philippe Tillet <phil@openai.com>
2023-09-22 05:37:14 +00:00
Philippe Tillet
c71ec14f31 [TEST] only test 4 configs without TF32 (#2370) 2023-09-21 21:23:19 -07:00
Alexander Zinoviev
d543eb1a36 [BACKEND] implement dot for INT8 on Turing (#2364)
Replace a single
mma.sync.aligned.m16n8k32.row.col.satfinite.s32.s8.s8.s32 instruction
that is used on Ampere with 4 x
mma.sync.aligned.m8n8k16.row.col.satfinite.s32.s8.s8.s32 instructions
for Turing

Extracted the Turing-int8, Turing-fp16 and Ampere to separate functions.

Somehow I messed up with my previous PR, so just open a new one.

---------

Co-authored-by: Philippe Tillet <phil@openai.com>
2023-09-21 16:40:53 -07:00
Philippe Tillet
32c9d2bb8f [FRONTEND] improved error messages (#2363)
this is a combination of #1774 and #2006, which I cannot edit but fail
CI pre-commit hook
2023-09-21 15:05:57 -07:00
Thomas Raoux
e36c99b588 [BACKEND] Handle scan of function non commutative (#2362)
Make sure we accumulate in the right order for scans so that non
commutative operations are handled correctly.
2023-09-21 12:00:41 -07:00
peterbell10
8094f46632 [FRONTEND][BACKEND] Fix various atomic_rmw bugs (#2355)
This fixes a few bugs I've encountered
- `atomic_add` with int64/uint64 `Operation .add requires .u32 or .s32
or .u64 [...] for instruction 'atom'`
- `atomic_min/max` with float64 -> `ValueError('Cannot bitcast data-type
of size 64 to data-type of size 32')`
- `atomic_min/max` with float32 returns the old value as int32
2023-09-21 03:31:20 +00:00
ben-zhang-609
bcaf14755a [HOPPER] enable flash attention with tma (#2336) 2023-09-20 14:06:56 -07:00
Thomas Raoux
9cab885dff [BACKEND] Optimize wgmma with accumulator source equal to 0 (#2343)
Also add a test for MMA v3 reduction.
2023-09-20 14:05:12 -07:00
Keren Zhou
ed5a53057d [BACKEND] Handle repetitive threads in scan op when the tensor dim is small (#2345)
https://github.com/openai/triton/issues/2298
2023-09-20 12:25:52 -04:00
Dongdong Li
e5eda098b3 [TESTS] fix flash attention (#2086)
Co-authored-by: dongdongl <dongdongl@nvidia.com>
2023-09-20 14:23:46 +08:00
Shantanu
8e75e392ae [FRONTEND] Fix Python error handling in launch (#2334)
This was regressed by #2185 because we didn't realise CUDA_CHECK macro
could do Python calls (similar to what led to #2225). I think the
PyErr_Occurred got removed in that PR because there was missing error
handling before the call to _launch, so it looked like it was just in
the wrong place.

It looks like there are also potentially a couple places in cuda.c that
can return with error set, e.g. getDeviceProperties, memAlloc,
memcpyHtoD, memFree, tensorMapEncodeTiled etc, but those are all
pre-existing and not affected by recent changes.
2023-09-19 00:12:00 -07:00
Philippe Tillet
73dae775df [DOCS] improved fused attention tutorial (bwd pass) (#2332) 2023-09-18 15:07:41 -07:00
Keren Zhou
307b5caa49 [BACKEND] Fix scan issues on repetitive warps and improve perf when there's a single warp on the axis (#2330)
1. On the axis, using `getAxisNumWarpsWithUniqueData` instead of getting
the raw number of warps to avoid communication among warps that handle
the same piece of data.
2. When there's a single warp on the axis, using warp Intrinsics for
communication and skip shared memory.

Need a follow up PR for code clean up.
2023-09-18 17:45:05 -04:00
Philippe Tillet
894fa9e943 [RUNTIME][INTERPRETER] now also override __str__ method for tensors (#2325) 2023-09-17 16:49:30 -07:00
Philippe Tillet
e686b4d6d4 [FRONTEND] interpreter rewrite (#2321)
This is a new interpreter mode that shares semantic analysis with the
JIT'ed codepath and that the Triton core team is committed to maintain
2023-09-17 14:58:50 -07:00
Myeonghwan Ahn
2b066000aa [FRONTEND] fix matmul int8 overflow issue (#2297)
Previously on matmul, if inputs are int8, output was also int8.
This commit fixes the overflow problem with int32 output.
#2296
2023-09-17 16:41:02 +00:00
Stonepia
68e1bd162c [FRONTEND] fix xpu stages logic (#2305) 2023-09-17 09:19:14 -07:00
jon-chuang
4f2d995fad [FRONTEND] Explicitly forbid dot(.., out_dtype=bfloat16) (#2308)
Fixes: https://github.com/openai/triton/issues/2302
2023-09-17 09:15:06 +00:00
Justin Lebar
073aa16379 [BUILD] use ninja (#2318) 2023-09-17 02:08:04 -07:00
Thomas Raoux
31b0c52142 [FRONTEND][BACKEND] Add flag to control accumulation for fp8 (#2300)
Change the dot to allow taking an initial accumulator and add a flag
that will allow the compiler to accumulate in a lower precision than the
output type.
On Hopper this flag is on by default which allows accumualting with
lower precision.
This only affect Hopper fp8 dot.
2023-09-15 18:42:54 -07:00
Zahi Moudallal
db5c793f82 [FRONTEND] Add sass to asm dict with lazy evaluation (#2309) 2023-09-15 15:31:43 -07:00
Keren Zhou
08c1658957 [FRONTEND] Accommodate new triton IR format (#2294)
- Support memory space for pointers (e.g., `!tt.ptr<f32, 1>`).
- Support parsing function attribute, though not used yet.
2023-09-14 09:03:23 -07:00
Zahi Moudallal
36087a108f [FRONTEND] Added SASS to asm dict (#2280) 2023-09-13 21:21:01 +00:00
Khushi Agrawal
c61d772eee [DOCS] add missing docs (#2154) 2023-09-13 19:30:40 +00:00
Thomas Raoux
b63e8f87fc [FRONTEND] Override prototype (#2214)
Low tech but very useful way to override kernels on the fly. This can be
use for debugging functionality or performance problems this lets user
dump modify and feed back IR into the jit compiler.
2023-09-13 10:05:47 -07:00
Zahi Moudallal
e95e1f12eb [BACKEND] Convert layout illegal mem access fix (#2287) 2023-09-13 10:02:25 -07:00
Thomas Raoux
994f7e4460 [BACKEND] Remove dependency between NVGPU and TritonNvidiaGPU (#2282) 2023-09-12 11:02:20 -07:00
Ying Zhang
37f12497b0 [FRONTEND] Add PyTorch fp8 dtypes to Triton (#2279)
Add PyTorch fp8 dtypes
(8025b193a9/torchgen/api/types/types.py (L50-L51))
to Triton.
2023-09-12 08:57:01 -07:00
Zahi Moudallal
a47f1f5c28 [BACKEND] Unify slow/fast reduce codegen (#2220) 2023-09-12 08:46:19 -07:00
jsh-20
fc5d7e6e7c [FRONTEND] Improve grid calculation for persistent kernels to hoist pe… (#2283)
…rf on problems that need few blocks.

constrain the number of launched blocks to what it exactely needs for
persistent warp specialized kernel. It's useful when problems need very
few blocks.
e.g. MxNxK=800x800x60000, f16_f16_f32, block size=128x128x64,
non-split-k. Experiments show it can achieve ~16% speedup.
2023-09-12 09:14:47 +00:00
peterbell10
ab9da3b2b8 [FRONTEND] Fix expand_dims and tl.full to handle scalar tensors (#2275)
This fixes a few bugs related to scalar tensors:
- `tl.full([], fill_value, dtype)` fails with `TypeError('0d block_type
is forbidden')`
- `scalar[None]` fails with `TypeError("'constexpr' object is not
iterable")`
- `scalar[None, None]` fails with `AttributeError("'dtype' object has no
attribute 'shape'")`
- `scalar.shape` returns `[1]` instead of 0-dim `[]`
- Also related, `tl.zeros_like(scalar)` returns a 1d tensor instead of
another scalar
2023-09-11 20:59:13 -07:00
Philippe Tillet
bf4f9375a7 [FRONTEND] allow mixed precision FP8 matmul on pre-H100 hardware (#2281) 2023-09-11 20:54:29 -07:00
Shintaro Iwasaki
8da27c1c95 [Build] Fix very minor compilation problems (#2277)
This PR fixes a few very minor compilation issues found in internal
deployment at Meta. It looks like nit-picking, but it'd be really
appreciated if it could be addressed in OSS Triton (to reduce
differences from OSS), and we believe these changes are not bad in
general. Neither performance nor functionality is affected by this PR.

1. Type cast in `python/triton/runtime/backends/cuda.c`. Implicit `void
*` -> `cuuint{32,64}_t *` cast is not allowed by many compilers (with
certain flags). It'd be nice to add an explicit cast (like
`backends/hip.c`).

2. Inconsistent include path specification in
`lib/Conversion/TritonGPUToLLVM/DotOpToLLVM/WGMMA.cpp`. Unlike other
`DotOpToLLVM/*.cpp`, include paths used in `WGMMA.cpp` are not relative.
This is problematic in some compilation settings since a compiler
somehow needs to find headers in a parent directory. It'd be great to
use a relative path, like other source files in Triton.

cc: @yuguo68
2023-09-11 19:28:31 -07:00
Thomas Raoux
a9db6b94b9 Remove wrong dependency between TritonGPU and NVGPU dialect (#2276) 2023-09-11 16:30:13 -07:00
danny.jang
ec4a968d44 [TESTS] Enhance benchmark flexibility (#2239)
User can pass custom arguments to benchmarks. For example, user can pass
`dtype` which will be used to create tensors in a benchmark.

Co-authored-by: Keren Zhou <kerenzhou@openai.com>
2023-09-11 15:31:30 -04:00
jon-chuang
5231d57c71 [TESTS] replace deprecated torch.testing.assert_allclose (#2250)
Prior to this PR, matmul on sm_89 (RTX 4070)
(`test/unit/operators/test_matmul.py::test_op`) would result in test
failure due to too strict atol/rtol.

To avoid having to choose strictness ourselves, and to have better
defaults based on dtype, use the non-deprecated torch testing util.

See: https://github.com/pytorch/pytorch/issues/61844

Replace: https://github.com/openai/triton/pull/2242
2023-09-11 15:31:17 -04:00
Lixun Zhang
28d4c3bdb4 [BACKEND] Make sure getAxisBlockStride does not return 0 (#2273)
This can happen when the CTA shape is larger than the tensor shape along
the non-axis dim during scanOp lowering.
2023-09-11 11:02:56 -07:00
Keren Zhou
10f59d8ce0 [RUNTIME] Get the correct end idx for regular arguments of GPU kernels (#2262)
Previously, if there were any specializations of "1" or "constexpr"
mixed with unspecialized arguments in arbitrary order, we might have
encountered errors due to passing incorrect arguments. This was because
the length of the signature did not indicate the maximum index of
regular arguments.

https://github.com/openai/triton/issues/2229

@shunting314 @amjames 

More specifically for cases like:

```
kernel(
b: tl.tensor,
a: tl.constexpr,
c: tl.int = 1,
d,
e: tl.constexpr,
...
)
```
2023-09-07 23:31:07 -07:00
Izzy Putterman
7d01c1852a Revert unintentional change (#2257)
This change seems to have been unintentionally reverted in the hopper
PR:
38d767ea93

Adding it back.
2023-09-07 10:48:12 -07:00
Zahi Moudallal
f21b36c8c5 [CLEANUP] Delete binaries that went in by mistake (#2256) 2023-09-06 20:42:42 +00:00
jon-chuang
36859aebff [DOCS] Add MLIR Autogenerated Docs to Sphinx Docs (#2234)
Partially fixes: https://github.com/openai/triton/issues/2226

Here are some example renderings:
![Screenshot from 2023-09-04
18-39-20](https://github.com/openai/triton/assets/9093549/e9c4af04-aeae-4021-a8db-6a4a82b59ae7)
![Screenshot from 2023-09-04
18-39-30](https://github.com/openai/triton/assets/9093549/410391b8-e07e-4bed-909c-8ce5484072d1)
![Screenshot from 2023-09-04
18-39-41](https://github.com/openai/triton/assets/9093549/f1eaef95-66c1-4506-a153-c6069e2b5072)
2023-09-06 08:17:12 +00:00
Wang Weihan
e721911705 [FRONTEND] clean build directly when executing python setup.py clean (#2238)
Current setup.py could not clean the build directly because the default
build directly has been changed in `CMakeBuild`. This PR is to clean
build directly in this regard.
2023-09-04 21:31:38 -07:00
jon-chuang
99f8f912aa [OPS] Remove unnecessary perf bug workaround (#2240)
This bug previously existed and I verified it in previously nightly
release of triton (20230714).

However, according to new benchmarks, this bug no longer exists on
Triton main. See:
https://github.com/google/jax/pull/17328#issuecomment-1705010065
2023-09-04 21:30:54 -07:00
Keren Zhou
9e9fbe01f0 [FRONTEND] Fix specialization on triton integer types (#2236)
https://github.com/openai/triton/issues/2231
2023-09-03 23:57:08 -07:00