The [hints
dispatching](218492cd65/python/triton/tools/link.py (L161))
logic currently fails for the edge case where a single kernel with no
specializations is to be linked in the [AOT
compiler](https://github.com/openai/triton/blob/main/python/triton/tools/link.py).
Since the dispatcher inserts a conditional branch for each
specialization case, this results in an `if ()` being inserted into the
`C` source, which clearly breaks downstream artifacts.
Fix:
- Added simple check for this edge case
- Added unit test that mirrors the existing
[`test_compile_link_matmul`](218492cd65/python/test/unit/tools/test_aot.py (L224))
test case save for the aforementioned condition.
Refactor the pipeliner pass in order to make it more generic. The main
change is that the pipeliner is now broken into 2 pieces one calculating
a modulo schedule and create async ops based on the IR and an expander
that will generate the pipelined IR based on the modulo schedule.
The advantage of separating the two pieces is that it will allow us to
create different schedule without having to change the expander and it
will allow for more complex schedules.
For now the schedule generated for matmul case matches rougly the
schedule picked by the previous pipeliner in order to avoid changes.
This also creates a different sequence of insert/extract slice for the
alloc. We should probably change shared alloc to use memory semantic.
For in-place kernels, neither `reset_to_zero` nor `Config.prehook`
provided in the autotuner can restore the values changed during the
tuning process, so I propose a recovery mechanism here.
---------
Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com>
Co-authored-by: Keren Zhou <kerenzhou@openai.com>
[FRONTEND] Enable ruff linter instead of flake8.
This fixes a few issues automatically, and also flagged two issues to
fix manually in test_core.py: We had two duplicate function names! One
of these function bodies was a duplicate, so I deleted it. The other
function body was not a duplicate, so I gave it a new name.
AIUI all of these errors should have been picked up by flake8. I'm
confused why it wasn't working. Anyway this is working, and it's faster
than flake8, so it seems like an improvement in all dimensions.
This PR https://github.com/openai/triton/pull/2555 disabled `W503`
(means line breaks can now occur before a binary operator).
The change surprisingly didn't take any effect nor required any style
changes in `triton` main `pre-commit` stage. But our `triton-shared`
[pipeline
run](https://github.com/microsoft/triton-shared/actions/runs/6710459100/job/18236352821)
(see `Check pre-commit` stage) picked this up correctly and complained
about formatting issues. I'm not entirely sure what could be the cause
for such difference, but if we also disable `W503` in `pyproject.toml`
then the rule is picked up correctly.
[FRONTEND] Refactor jit.py.
The goal is to simplify the code and make it more flexible before we
change the kernel launch syntax to
`kernel[grid, compiler_flags(...)](...)`.
The main changes here are:
- Get rid of the eval'ed code in make_launcher. We can do everything
using bind().
- Add KernelParam and KernelArg classes, letting us get rid of the
parallel arrays/dicts indexed by parameter index.
- Get rid of duplicated kernel launch code in the cache-hit/cache-miss
branches.
We're in the process of incrementally converting from autopep8 + flake8
+ isort to ruff, on a directory-by-directory basis.
The motivation to switch away from autopep8 is that I can't get it to
wrap long lines, even with -aaa. This seems to be a known problem,
https://github.com/hhatto/autopep8/issues/497.
See more details about alternatives tried in
https://github.com/openai/triton/pull/2557.
It seems that by default, flake8 warns on both "linebreak occurred
before
binary operator" (W503) and "linebreak occurred *after* binary operator"
(W504). You...kind of have to pick one of these. :)
According to the docs, W503 is deprecated, so we disable that one.
https://www.flake8rules.com/rules/W503.html
Adding `Introduction to Triton-Shared.pptx` as presented at the Oct. 25
Triton community meeting.
---------
Co-authored-by: Phil Tillet <phil@openai.com>
### Summary
When Triton GPU IR is lowered into LLVM IR, we can make use of the
constancy information about the result of the elementwise ops to
deduplicate otherwise redundant computation. That is the contribution of
this PR: the constancy is checked and, if possible, some of the values
in LLVM IR are reused multiple times instead of computing equal values
separately.
The change is beneficial for the PyTorch 2 / TorchInductor-generated
Triton code, as the leftmost sub-indices extracted from the flat index
by div / mod operations can be equal, given sufficiently large 2^n
factor in the rightmost rightmost dimension(s). This makes the
computation resulting in those sub-indices redundant. Consequently,
under the necessary constancy conditions, the redundant indexing
arithmetics can be deduplicated. We observe up to 29% decrease in the
latency of some of our jagged tensor kernels
[BACKEND] Improve printf.
Previously, we printed all of a GPU thread's values in a single printf()
call, and this, plus the user-specified prefix, was all we printed.
This caused a few problems.
- nvptx printf can only handle 32 arguments; if you pass more than
that, it prints garbage. So if a thread had more than 32 values, you
couldn't print them, issue #2486.
- The order of the values within the Triton program (GPU thread block)
is an implementation detail -- it depends on the layout the compiler
assigns to a tensor. So this also prevented you from interpreting
the printed output.
To address this, we now print the Triton pid and multi-dimensional
Tensor index for each value. And each value gets its own line to avoid
passing too many args to printf.
Example output:
```
pid (0, 1, 2) idx (36, 127) x: 42
```
If you want to observe all the values in a tensor in order, you can grep
and then sort the output.
We also make a UX enhancement to print: The printed label always ends
with ": "; you don't have to add it yourself.
Fixes#2486.
Note that asan doesn't work with programs that use the GPU, so this is
only useful for running tools like triton-opt.
I was not able to get msan working. libstdc++'s std::string
implementation seems to use uninitialized memory in a way that seems
safe but triggers an msan error. I tried and gave up on switching to
libc++ and teaching msan to ignore this error.
<git-pr-chain>
#### Commits in this PR
1. Fix segfault in assertion test.
The issue here is that we were not checking the return values of the
CUDA API
calls we were making. We call one function and then use the data it
returns as
input to another call. Obviously this doesn't work if the first call
returns
an error and doesn't actually return meaningful data.
I don't know why this was passing in CI, but it failed consistently for
me.
#### [PR chain](https://github.com/jlebar/git-pr-chain)
1. 👉#2520👈 **YOU ARE HERE**
</git-pr-chain>
<git-pr-chain>
#### Commits in this PR
1. Make kernel_static_print test work when called twice.
This test is checking that a message is printed when the kernel is
compiled.
But the test had nothing to force the kernel to be compiled every time
you ran
the test. So after you ran it once, the test would fail every time until
you
cleared the cache.
#### [PR chain](https://github.com/jlebar/git-pr-chain)
1. 👉#2518👈 **YOU ARE HERE**
1. #2520
</git-pr-chain>