Commit Graph

881 Commits

Author SHA1 Message Date
Shantanu
a4df60e20a [FRONTEND] Fix GIL handling in error conditions (#2225)
The use of the opaque GIL state APIs should mean that the
PyErr_SetString is now safe, regardless of whether the caller has the
GIL or not.
2023-09-01 13:30:42 -07:00
Ethan Pronovost
1367f3a6d2 [FRONTEND/OPS] wap stride_vn and stride_vk in flash attention (#2208)
I'm not sure if this was a typo or if I'm missing something. To me code
like
```
(offs_n[:, None] * stride_vk + offs_k[None, :] * stride_vn)
```
seems off.
In case this is a typo I made this PR to correct it. This PR should have
no functional changes.
If this is not a typo would you mind explaining the reasoning behind
these variable names?
2023-08-31 23:19:40 -07:00
Michael Melesse
c6d33dcebf [ROCM] Core Functionality for AMD (#1983)
* this pr adds a third party backend for triton that works on AMD 
* this expose a lot of the work that has been done in our
[fork](https://github.com/ROCmSoftwarePlatform/triton)
* most unit tests on `test_core.py` pass
* it skips some unit tests for various reasons
* we plan to follow up with more prs improving Functionality and
Performance in the future

---------

Co-authored-by: Philippe Tillet <phil@openai.com>
2023-08-31 14:02:00 -07:00
Philippe Tillet
ec51552fff [BACKEND] Lift restriction for float8e4b15 to only support row-col layout (#2212) 2023-08-30 14:06:31 -07:00
jon-chuang
9af76e7d5a [RUNTIME] Fix cache dir (#2196)
---------

Co-authored-by: Keren Zhou <kerenzhou@openai.com>
2023-08-29 21:07:16 -04:00
goostavz
1465b573e8 [TESTS][HOPPER] Prune hopper tests to speedup CI (#2193)
Co-authored-by: Goostav Zhu <gzhu@nvidia.com>
2023-08-27 20:45:23 -07:00
Philippe Tillet
5f448b2f08 [FRONTEND] remove dead libhopper_helpers.bc file (#2190) 2023-08-26 12:17:17 -07:00
Greg Brockman
a9b8c8c37d [FRONTEND] drop GIL for launch, and set value=false upon pointer error (#2185) 2023-08-26 17:07:57 +00:00
Keren Zhou
6e4932cda8 [BACKEND] Fix fma mixed-precision (#2184)
and expose the allow_tf32 argument to the matmul op

@shunting314
2023-08-26 09:49:58 -07:00
Greg Brockman
ab3e8b0dad [FRONTEND] fix handling of do_not_specialize with interior constantexprs (#2188) 2023-08-26 09:19:34 -07:00
Mohammed Anany
ebfe0ffb29 [FRONTEND] fix for undefined dtypes in jit during loading defaults (#2114)
Co-authored-by: Keren Zhou <kerenzhou@openai.com>
2023-08-25 10:28:23 -07:00
Ethan Pronovost
56fee37a0d [FRONTEND] Fix benchmark plotting (#2177) 2023-08-24 20:34:04 -07:00
Greg Brockman
64d8df4c69 [FRONTEND] handle errors from launch_enter_hook (#2178) 2023-08-24 20:32:01 -07:00
Shantanu
7083dae4f2 [FRONTEND] drop the GIL around more CUDA ops (#2173) 2023-08-24 20:31:38 -07:00
Zahi Moudallal
120ce0a5bf [DOCS] Fixing docs (#2175) 2023-08-24 15:58:59 -07:00
chengjunlu
6cb67185f8 [FRONTEND]To use proper default num_warps and num_stages based on the device backend in JITFucntion (#2130)
The default values used by JITFunction for num_warps and num_stages are
coupled with Nvidia GPU architecture. We should use the proper default
values based on the device backend for the kernel to be compiled to.
1. Add two functions to return the default num_warps and num_stages for
the specific device backend.
2. JITFunction uses the proper default num_warps and num_stages based on
the specific device backend.

Co-authored-by: Wang Weihan <eikan.wang@intel.com>
2023-08-24 21:58:18 +08:00
Bin Fan
dad83f9dcb [TOOLS] Add support for autotuning AOT kernel (#2123)
This PR makes the following change to AOT kernel

- Allow the client to generate AOT kernels with different sets of
constexprs and meta-parameters. Each combination of constexpr set and
meta-parameters is referred to an "algo". Within an algo client can
still give different hints about integer arguments.
- Add a API int ${kernle_name}_get_num_algos() that returns the total
number of algos.
- Add a algo_id to allow client to the generated kernel to select the
algo
- Remove gX, gY and gZ from the kernel parameter list. This is because
the launch grid is usually different with different algos, and the
client should not need to care about how to compute the launch grid for
each algo. Instead, we ask the client to pass the expression of
computing gX, gY and gZ for compile.py (when AOT kernels are generated).
The expression can only use kernel parameter or const values.
- We also change the testing flow. Now we first build the kernels into a
shared library libkernel.so, then the client test.c code is built and
link with libkernel.so. This is closer to a typical AOT kernel usage
flow.
2023-08-23 09:38:29 -07:00
Zahi Moudallal
5282ed890d [CI] Add back pre-commit to nvidia CI job (#2159) 2023-08-23 01:11:03 +00:00
Keren Zhou
6a65c894fe [TUTORIALS] Skip running TMA tutorials on non-hopper architectures (#2153) 2023-08-22 18:02:26 -07:00
Keren Zhou
5fa1fa1b27 [FRONTEND] Emit warning if the result of tl.advance is unused (#2155)
https://github.com/openai/triton/issues/2138
2023-08-22 18:02:00 -07:00
danny.jang
c4a9006340 [FRONTEND] fix a typo (#2152) 2023-08-22 13:02:31 -04:00
ivanyinwz
ec801ce18e [BACKEND] Optimize performance for f16 epilogue with TMA store (#2135)
1. Optimize the conversion and packing for 2xf32 -> 2xf16.
2. Split TMA store block into multiple slices of size 64x64.
3. Distribute the TMA store to all the warps.
4. Fix some naming issue.
2023-08-21 12:44:11 -07:00
Philippe Tillet
ea8416164f [FRONTEND] name mangling fixup (#2148) 2023-08-21 12:11:52 -07:00
Beal Wang
7e5cd95bf2 [OPTIMIZER] Fix Warp Specialized kernel launch failure (#2146)
For warp specialized persistent kernel, the instruction sequence for
Warp Groups are
```
// warp group 0
for wave in 0..num_waves:
    idx = wave * num_inner_loop_steps;
    for k_tile_idx in 0..num_k_tiles:
        mbarrier.wait EB[idx];
        W0;
        mbarrier.arrive FB[idx];
        idx++;
```
```
// warp group 1
for wave in 0..num_waves:
    idx = wave * num_inner_loop_steps;
    for k_tile_idx in 0..num_k_tiles:
        mbarrier.wait FB[idx];
        R0;
        mbarrier.arrive EB[idx];
        idx++;
```
then this would form a sequence of morally-strong relations W0 -> R0 ->
W1 -> R1 in causality order.
But if GEMM K is small than K-TileShape, then the num_inner_loop_steps
of persistent kernel is 0. The buffer id and mbarrier id will always be
0 in this case. And it may form W0 -> W1 -> R0 -> R1 order, which is
contradicts with the atomicity --
"If a read R precedes an overlapping write W in causality order, then R
cannot read from W."
2023-08-21 14:46:57 +08:00
Thomas
54ca7fcb35 [FRONTEND] Use inline asm for global timer and smid functions (#2143)
Simplify the code by using inline asm to implement globaltimer and smid
instead of relying on bc file.
2023-08-20 22:56:37 -07:00
Keren Zhou
584d5c263f [FRONTEND] Disable IfExp on dynamic conditions (#2100)
`if _unwrap_if_constexpr(cond)` then enters `node.body` is wrong when
cond is a tensor since we cannot statically evaluate a dynamic tensor's
value.

The right way to solve the problem is probably:

1. visit the ast of IfExp (do not build IRs)
2. get the type of the last statement
3. initialize the return value and assign it to livein
4. call visit_If
2023-08-20 12:58:10 -07:00
Alexander Zinoviev
a7b40a10f9 [TESTS] Fix tl.dot test on sm75 (#2140)
Disable tf32 if run on sm75 and below
Fix the pattern match to compare the generated ptx against if run on
sm75
2023-08-19 22:21:18 -07:00
Zahi Moudallal
23dd11d471 [BACKEND] Solidify f8e4m3 (#2105)
Co-authored-by: Philippe Tillet <phil@openai.com>
2023-08-18 19:12:09 -07:00
Thomas
23ef2615d2 [BACKEND] Merge TT_ElementwisePureExtern and TT_ElementwiseImpureExtern (#2137)
Use getEffect instead to tell passes whether the op has side effects or
not. This doesn't change functionality otherwise.
2023-08-18 20:56:10 +00:00
Thomas
bf351b9ba2 [FRONTENT][BACKEND] Add support for elementwise inline assembly (#2136)
Add a new operation to be able to implement packed inline assembly for
elementwise operations. This way inline assembly can be used to control
elementwise operations. It also allows to pack elements to be able to
manually vectorize operations.
2023-08-18 12:57:52 -07:00
Whitney Tsang
100cabd0e4 [FRONTEND] use enum instead of bool to select target (#2118)
Before this PR, the determination of `TritonGPUToLLVMIRPass` to generate
NVVM-compatible LLVM or ROCDL-compatible LLVM is controlled by a boolean
`isROCM`. This method is hard to scale.
This PR changes it to use an enum instead, where new target can be added
easily when needed.

---------

Signed-off-by: Tsang, Whitney <whitney.tsang@intel.com>
Co-authored-by: Philippe Tillet <phil@openai.com>
2023-08-17 18:37:09 -07:00
youkaichao
871ec2ad37 [FRONTEND] add informative erro msg to help find libcuda (#2019)
When I'm using Kaggle GPU (https://www.kaggle.com/), I find that
`ldconfig -p` does not show libcuda.so, but requires `ldconfig` (run
with sudo) to refresh the cache to find the libcuda.so.

Therefore, I added this informative message to help users find
libcuda.so.
2023-08-17 19:32:00 -04:00
Thomas
387fc890a5 [FRONTEND][BACKEND] Add a performance test for reductions (#2125)
Also stop promoting integer types as it doesn't give better perf this
will allow more vectorization oportuinity in the future.
2023-08-17 16:30:33 -07:00
YouJiacheng
0970a297b2 [TUTORIALS] alow BLOCK(bwd) != BLOCK_M(fwd) in flash attention (#2020) 2023-08-17 18:31:53 -04:00
Keren Zhou
2d513dbf50 [FRONTEND] Fix addptr code generation (#2122)
`offset + ptr` and `ptr + offset` both work now
2023-08-17 04:22:08 +00:00
Zahi Moudallal
557b2d4b34 [CI] upload only test/unit/operators cache to artifacts and rely on kernel names in cache to compare artifacts (#2111) 2023-08-16 20:34:40 -07:00
darkbuck
a3df6068b4 [BACKEND] Minor fixes found when building triton with LLVM 17/main branches (#2089)
- These minor fixes are not specific to interface changes from LLVM main
or official llvm-17 branch and can be applied on triton main branch.
- https://github.com/darkbuck/triton/tree/darkbuck/main/llvm-main-branch
has extra changes to build again LLVM main branch build to enable me to
work on other backends on the main branch only. That's the hobby effort
and just FYR.
2023-08-16 01:18:06 +00:00
Zahi Moudallal
0312ed3473 [CI] Update kernels names (#2093)
Co-authored-by: Philippe Tillet <phil@openai.com>
2023-08-14 19:41:41 -07:00
jsh-20
9055af1a5d Update test_user_defined_persistent_warp_specialized_gemm for num-CTA > 1 (#2101)
- remove auto-tune for
test_user_defined_persistent_warp_specialized_gemm.
- remove unnecessary perf evaluation parts.
- add test cases of num-CTA > 1 for
test_user_defined_persistent_warp_specialized_gemm.
2023-08-14 08:51:35 +00:00
Philippe Tillet
facc1dcbac [TESTS] better matmul unit testing (#2098) 2023-08-13 17:54:32 -07:00
Izzy Putterman
fc667d1f8f [FRONTEND] fix new absolute imports (#2072)
Co-authored-by: Philippe Tillet <phil@openai.com>
2023-08-13 14:23:36 +00:00
danny.jang
b878c8826f [DOCS] fix generating document failures (#2096) 2023-08-13 06:53:34 -07:00
danny.jang
cd114c0bd5 [DOCS] Fix typos (#2097) 2023-08-13 06:52:43 -07:00
Thomas
98372f46d3 [FRONTEND] Remove extra calls to _get_config causing runtime overhead (#2094) 2023-08-13 06:51:26 -07:00
Zahi Moudallal
a01c116f76 [FRONTEND/BACKEND] Revived Float8E4B15x4 (#2090) 2023-08-11 17:49:52 -07:00
Keren Zhou
382e8fb1fa [RUNTIME] Make apis compatible with cuda 11 drivers (#2081)
https://github.com/openai/triton/issues/2042
2023-08-11 17:46:56 -07:00
Thomas
421ce18988 [FRONTEND] Refactor min/max to unify tl.maximum and tl.math.max (#2091)
maximum used to generate a cmp/sel even for floating point types. Always
using max op allows better code quality and avoids having different
behavior than tl.math.max
2023-08-11 17:46:20 -07:00
Keren Zhou
5162871c6c [TUTORIAL] flash attention d128 improvement (#2074)
`ptxas` is able to automatically generate a call instruction to "call"
the loop body so that instructions are better scheduled.
2023-08-12 00:31:48 +00:00
Zahi Moudallal
b62b6d6a71 [FRONTEND] Remove cache key from metadata (#2082) 2023-08-11 08:58:00 -07:00
Zahi Moudallal
4d373aa103 [BACKEND] Remove HopperHelpers.c and replace with inline ptx and LLVM codegen (#2047) 2023-08-10 15:52:37 -07:00