Commit Graph

602 Commits

Author SHA1 Message Date
jeromeku
37cd3d5339 [FIX][AOT Compiler] Fix No Specialization Edge Case (#2584)
The [hints
dispatching](218492cd65/python/triton/tools/link.py (L161))
logic currently fails for the edge case where a single kernel with no
specializations is to be linked in the [AOT
compiler](https://github.com/openai/triton/blob/main/python/triton/tools/link.py).

Since the dispatcher inserts a conditional branch for each
specialization case, this results in an `if ()` being inserted into the
`C` source, which clearly breaks downstream artifacts.

Fix:
- Added simple check for this edge case
- Added unit test that mirrors the existing
[`test_compile_link_matmul`](218492cd65/python/test/unit/tools/test_aot.py (L224))
test case save for the aforementioned condition.
2023-11-02 10:02:41 -07:00
Vedant Roy
702cde0d6f [FRONTEND] Implement ternary operator for dynamic values (#2560) 2023-11-01 20:21:32 -04:00
Chenggang Zhao
e7fdfd76fb [FRONTEND] Add value restoration for autotuner (#2549)
For in-place kernels, neither `reset_to_zero` nor `Config.prehook`
provided in the autotuner can restore the values changed during the
tuning process, so I propose a recovery mechanism here.

---------

Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com>
Co-authored-by: Keren Zhou <kerenzhou@openai.com>
2023-10-31 21:37:44 -04:00
Zahi Moudallal
3650213218 [OPTIMIZER] Thread local reduction optimization (#2542)
Co-authored-by: Phil Tillet <phil@openai.com>
2023-10-31 16:13:36 -07:00
Justin Lebar
258399c114 Enable ruff linter instead of flake8 (#2574)
[FRONTEND] Enable ruff linter instead of flake8.
    
This fixes a few issues automatically, and also flagged two issues to
fix manually in test_core.py: We had two duplicate function names!  One
of these function bodies was a duplicate, so I deleted it.  The other
function body was not a duplicate, so I gave it a new name.

AIUI all of these errors should have been picked up by flake8.  I'm
confused why it wasn't working.  Anyway this is working, and it's faster
than flake8, so it seems like an improvement in all dimensions.
2023-10-31 21:28:24 +00:00
Zahi Moudallal
943330790a [FRONTEND] add do_not_specialize property back to JITFunction (#2573) 2023-10-31 12:02:45 -07:00
Chris Jones
2398b82f18 [FRONTEND][BACKEND] dd memory synchronization scope parameter to atomic ops. (#2562)
Co-authored-by: Keren Zhou <kerenzhou@openai.com>
2023-10-30 19:18:27 -07:00
Keren Zhou
70fca00b67 [BACKEND] Fix device_print without arguments (#2566) 2023-10-30 20:04:44 -04:00
Keren Zhou
492886fcde [FRONTEND] Add reverse eq and ne (#2563) 2023-10-30 16:56:43 -04:00
Justin Lebar
12f906287f [FRONTEND] Refactor jit.py. (#2556)
[FRONTEND] Refactor jit.py.

The goal is to simplify the code and make it more flexible before we
change the kernel launch syntax to
`kernel[grid, compiler_flags(...)](...)`.

The main changes here are:

 - Get rid of the eval'ed code in make_launcher.  We can do everything
   using bind().
 - Add KernelParam and KernelArg classes, letting us get rid of the
   parallel arrays/dicts indexed by parameter index.
 - Get rid of duplicated kernel launch code in the cache-hit/cache-miss
   branches.
2023-10-30 13:14:51 -07:00
Justin Lebar
f88b01f558 Apply ruff pre-commit to python/triton/runtime. (#2558)
We're in the process of incrementally converting from autopep8 + flake8
+ isort to ruff, on a directory-by-directory basis.

The motivation to switch away from autopep8 is that I can't get it to
wrap long lines, even with -aaa.  This seems to be a known problem,
https://github.com/hhatto/autopep8/issues/497.

See more details about alternatives tried in
https://github.com/openai/triton/pull/2557.
2023-10-30 11:06:44 -07:00
Someone
cde42e6221 [BUILD] make cuda tools vendoring optional (#2546) 2023-10-26 23:16:41 -07:00
runseny
4c816c2f59 [OPS] enable flash_attention_v2 TMA (#2544) 2023-10-25 23:31:17 -07:00
Justin Lebar
e70e11e834 [BACKEND] Improve printf. (#2532)
[BACKEND] Improve printf.

Previously, we printed all of a GPU thread's values in a single printf()
call, and this, plus the user-specified prefix, was all we printed.

This caused a few problems.

 - nvptx printf can only handle 32 arguments; if you pass more than
   that, it prints garbage.  So if a thread had more than 32 values, you
   couldn't print them, issue #2486.

 - The order of the values within the Triton program (GPU thread block)
   is an implementation detail -- it depends on the layout the compiler
   assigns to a tensor.  So this also prevented you from interpreting
   the printed output.

To address this, we now print the Triton pid and multi-dimensional
Tensor index for each value.  And each value gets its own line to avoid
passing too many args to printf.

Example output:

    ```
    pid (0, 1, 2) idx (36, 127) x: 42
    ```

If you want to observe all the values in a tensor in order, you can grep
and then sort the output.

We also make a UX enhancement to print: The printed label always ends
with ": "; you don't have to add it yourself.

Fixes #2486.
2023-10-25 08:47:55 +00:00
Sam Shleifer
12da43084b [TESTING] add diff column, option to return df in benchmark (#2469) 2023-10-24 05:17:00 +00:00
Adnan Akhundov
50add54334 [FRONTEND] Add input dtypes to autotuning key (#2534) 2023-10-24 03:29:30 +00:00
runseny
dc9e3063d7 [HOPPER] Move to tl.make_block_ptr in flash_attention backward scripts (#2395) 2023-10-20 11:06:48 +08:00
Justin Lebar
30186f401e Fix segfault in assertion test. (#2520)
<git-pr-chain>

#### Commits in this PR
1. Fix segfault in assertion test.
    
The issue here is that we were not checking the return values of the
CUDA API
calls we were making. We call one function and then use the data it
returns as
input to another call. Obviously this doesn't work if the first call
returns
    an error and doesn't actually return meaningful data.
    
I don't know why this was passing in CI, but it failed consistently for
me.

#### [PR chain](https://github.com/jlebar/git-pr-chain)
1. 👉 #2520 👈 **YOU ARE HERE**


</git-pr-chain>
2023-10-19 13:42:38 -07:00
Horace He
a4f373938c [RUNTIME] Filter out paths that don't exist in json group cache (#2511)
There's no guarantee that `/tmp/triton/*/*.json` existing means
that the corresponding `/tmp/triton/*/*.cubin` file also exists because the tmp directory doesn't guarantee file stability.
2023-10-18 16:44:34 -04:00
ian Bearman
768fc1fcd9 [FRONTEND] change hash to not require ptxas (#2476)
I noticed that Triton is using the `ptxas` version as part of the
version hash even for non-CUDA targets. This is an attempt at fixing
this. Moving the version calculation to the back-end makes sense to me
from an architectural standpoint, so that's my approach here. I'm not as
confident in the implementation, so please if folks have any feedback
let me know.
2023-10-17 10:28:51 -07:00
Zahi Moudallal
726bdb984f [FRONTEND][BACKEND] Fix constexpr assignment ; revert #2430 (#2496)
Without this change, a constexpr assignment (ie. `A = B & C`, where `B`
and `C` are both constexpr) is getting assigned to a triton tensor,
which becomes an issue when `A` is used as the condition of an If
statement.
Note: I had to add `not isinstance(node.value, ast.Constant)` to the
condition because if we are assigning `x = 0` then the assigned value is
also a constexpr, but in this case we do want to assign a triton tensor
to `x` so that we can do `x.to(tl.int64)` for example, which cannot be
done on a constexpr.

---------

Co-authored-by: Philippe Tillet <phil@openai.com>
2023-10-16 12:35:19 -07:00
Stewart Hall
29828fe491 [FRONTEND] add option to disable fp mul/add fusion (#2495)
By default, ptxas will enable fusion of mul/add to fma instructions. The
backend was also being configured unconditionally to enable this on
conversion from LLVM IR to PTX. This commit adds an option which can be
used to disable the FP fusion behavior in both locations.
2023-10-14 12:23:30 -07:00
Keren Zhou
f81d9d876f [FRONTEND] Fix math for constant values (#2472)
https://github.com/openai/triton/issues/2470
2023-10-12 12:11:42 -07:00
Beal Wang
5812d970a8 [HOPPER][OPTIMIZER] remove divOp and remOp from gemm math loop (#2402)
This is just for Warp Specialization kernels on Hopper. Replace DivOp
and RemOp with SelectOp and AndOp/XorOp.
2023-10-09 14:42:06 +08:00
Philippe Tillet
424e67e727 [FRONTEND] improved while loop error messages (#2463) 2023-10-06 18:37:52 -07:00
Keren Zhou
a42d517021 [FRONTEND] Better warning on nested jit functions (#2453) 2023-10-06 14:22:51 -07:00
Justin Lebar
71a8544ce7 Improve docs for atomic and load/store operations. (#2437)
- Move atomic_cas and atomic_xchg to "atomic ops" section of
   documentation.
 - Don't talk about the `cmp` operand for operations which don't have
   it.
 - Document the `sem` operand.
 - :code:`foo` and ``foo`` don't work inside a :type: annotation,
   apparently.  (They are rendered literally, instead of being treated
   as a formatting command.)  Get rid of them.
- Format the bulleted lists in the load/store operations as intended.
2023-10-04 04:17:42 +00:00
Philippe Tillet
a0025cfc44 [FRONTEND] add missing implicit constexpr conversion in dot (#2427) 2023-10-01 16:07:50 -07:00
Thomas Raoux
90bef57acf [BACKEND] turn on MMA V3 by default on Hopper (#2414) 2023-09-28 22:45:28 -07:00
Simon Boehm
b25edc139e [FRONTEND] fix out_path parsing in AOT compiler (#2409)
`out_path.with_suffix` (penultimate line) fails if out_path is string.
2023-09-27 22:15:17 -07:00
Philippe Tillet
7432fff4be [FRONTEND] add limited introspection capabilities in tl.extra.cuda ; rename arch into target (#2385) 2023-09-25 23:58:25 -07:00
Philippe Tillet
eea0718445 [TESTING] better cudagraph-based benchmarking (#2394) 2023-09-25 21:41:26 -07:00
ben-zhang-609
d040b58547 [HOPPER] fix ref check failure of flash attention with mma v3 (#2384) 2023-09-25 11:29:49 -07:00
edimetia3d
cb83b42ed6 [FRONTEND] using closure to create jit launcher (#2289)
Hi,

I'm adding some features to
`triton.runtime.jit.JITFunction_make_launcher` and found it is hard to
debug it:
1. The inlined Python code is hard to inspect in my editor.
2. My debugger fails to step into these inlined codes.

In response, I've introduced some code to solve these issues. My
modifications include:
~~1. Refactoring the launcher's inline Python code, ensuring it only
relies on the "self" object.~~
~~2. Add a utility method that generates a temporary file to create a
launcher when debugging kernel in main module~~
Using a closure to hold the launcher's body

Because this features might be good to others, I have initiated this
Pull Request.

~~Tests are yet to be added; if this submission might be accepted, I
will add it later.~~
Since this change is a refactor, no new test was added.
2023-09-22 17:01:54 -07:00
q.yao
413b18eb73 [FROJTEND] fix core.dtype.__repr__ (#2372)
`function_type` does not have a `name` field, which leads to an error
when debugging with gdb.
2023-09-22 08:34:20 -07:00
Zahi Moudallal
293b7fd592 [TESTING] cleanup (#2293)
Co-authored-by: Philippe Tillet <phil@openai.com>
2023-09-22 05:37:14 +00:00
Philippe Tillet
32c9d2bb8f [FRONTEND] improved error messages (#2363)
this is a combination of #1774 and #2006, which I cannot edit but fail
CI pre-commit hook
2023-09-21 15:05:57 -07:00
peterbell10
8094f46632 [FRONTEND][BACKEND] Fix various atomic_rmw bugs (#2355)
This fixes a few bugs I've encountered
- `atomic_add` with int64/uint64 `Operation .add requires .u32 or .s32
or .u64 [...] for instruction 'atom'`
- `atomic_min/max` with float64 -> `ValueError('Cannot bitcast data-type
of size 64 to data-type of size 32')`
- `atomic_min/max` with float32 returns the old value as int32
2023-09-21 03:31:20 +00:00
ben-zhang-609
bcaf14755a [HOPPER] enable flash attention with tma (#2336) 2023-09-20 14:06:56 -07:00
Shantanu
8e75e392ae [FRONTEND] Fix Python error handling in launch (#2334)
This was regressed by #2185 because we didn't realise CUDA_CHECK macro
could do Python calls (similar to what led to #2225). I think the
PyErr_Occurred got removed in that PR because there was missing error
handling before the call to _launch, so it looked like it was just in
the wrong place.

It looks like there are also potentially a couple places in cuda.c that
can return with error set, e.g. getDeviceProperties, memAlloc,
memcpyHtoD, memFree, tensorMapEncodeTiled etc, but those are all
pre-existing and not affected by recent changes.
2023-09-19 00:12:00 -07:00
Philippe Tillet
894fa9e943 [RUNTIME][INTERPRETER] now also override __str__ method for tensors (#2325) 2023-09-17 16:49:30 -07:00
Philippe Tillet
e686b4d6d4 [FRONTEND] interpreter rewrite (#2321)
This is a new interpreter mode that shares semantic analysis with the
JIT'ed codepath and that the Triton core team is committed to maintain
2023-09-17 14:58:50 -07:00
Myeonghwan Ahn
2b066000aa [FRONTEND] fix matmul int8 overflow issue (#2297)
Previously on matmul, if inputs are int8, output was also int8.
This commit fixes the overflow problem with int32 output.
#2296
2023-09-17 16:41:02 +00:00
Stonepia
68e1bd162c [FRONTEND] fix xpu stages logic (#2305) 2023-09-17 09:19:14 -07:00
jon-chuang
4f2d995fad [FRONTEND] Explicitly forbid dot(.., out_dtype=bfloat16) (#2308)
Fixes: https://github.com/openai/triton/issues/2302
2023-09-17 09:15:06 +00:00
Thomas Raoux
31b0c52142 [FRONTEND][BACKEND] Add flag to control accumulation for fp8 (#2300)
Change the dot to allow taking an initial accumulator and add a flag
that will allow the compiler to accumulate in a lower precision than the
output type.
On Hopper this flag is on by default which allows accumualting with
lower precision.
This only affect Hopper fp8 dot.
2023-09-15 18:42:54 -07:00
Zahi Moudallal
db5c793f82 [FRONTEND] Add sass to asm dict with lazy evaluation (#2309) 2023-09-15 15:31:43 -07:00
Keren Zhou
08c1658957 [FRONTEND] Accommodate new triton IR format (#2294)
- Support memory space for pointers (e.g., `!tt.ptr<f32, 1>`).
- Support parsing function attribute, though not used yet.
2023-09-14 09:03:23 -07:00
Zahi Moudallal
36087a108f [FRONTEND] Added SASS to asm dict (#2280) 2023-09-13 21:21:01 +00:00
Khushi Agrawal
c61d772eee [DOCS] add missing docs (#2154) 2023-09-13 19:30:40 +00:00