This PR https://github.com/openai/triton/pull/2555 disabled `W503`
(means line breaks can now occur before a binary operator).
The change surprisingly didn't take any effect nor required any style
changes in `triton` main `pre-commit` stage. But our `triton-shared`
[pipeline
run](https://github.com/microsoft/triton-shared/actions/runs/6710459100/job/18236352821)
(see `Check pre-commit` stage) picked this up correctly and complained
about formatting issues. I'm not entirely sure what could be the cause
for such difference, but if we also disable `W503` in `pyproject.toml`
then the rule is picked up correctly.
[FRONTEND] Refactor jit.py.
The goal is to simplify the code and make it more flexible before we
change the kernel launch syntax to
`kernel[grid, compiler_flags(...)](...)`.
The main changes here are:
- Get rid of the eval'ed code in make_launcher. We can do everything
using bind().
- Add KernelParam and KernelArg classes, letting us get rid of the
parallel arrays/dicts indexed by parameter index.
- Get rid of duplicated kernel launch code in the cache-hit/cache-miss
branches.
We're in the process of incrementally converting from autopep8 + flake8
+ isort to ruff, on a directory-by-directory basis.
The motivation to switch away from autopep8 is that I can't get it to
wrap long lines, even with -aaa. This seems to be a known problem,
https://github.com/hhatto/autopep8/issues/497.
See more details about alternatives tried in
https://github.com/openai/triton/pull/2557.
[BACKEND] Improve printf.
Previously, we printed all of a GPU thread's values in a single printf()
call, and this, plus the user-specified prefix, was all we printed.
This caused a few problems.
- nvptx printf can only handle 32 arguments; if you pass more than
that, it prints garbage. So if a thread had more than 32 values, you
couldn't print them, issue #2486.
- The order of the values within the Triton program (GPU thread block)
is an implementation detail -- it depends on the layout the compiler
assigns to a tensor. So this also prevented you from interpreting
the printed output.
To address this, we now print the Triton pid and multi-dimensional
Tensor index for each value. And each value gets its own line to avoid
passing too many args to printf.
Example output:
```
pid (0, 1, 2) idx (36, 127) x: 42
```
If you want to observe all the values in a tensor in order, you can grep
and then sort the output.
We also make a UX enhancement to print: The printed label always ends
with ": "; you don't have to add it yourself.
Fixes#2486.
Note that asan doesn't work with programs that use the GPU, so this is
only useful for running tools like triton-opt.
I was not able to get msan working. libstdc++'s std::string
implementation seems to use uninitialized memory in a way that seems
safe but triggers an msan error. I tried and gave up on switching to
libc++ and teaching msan to ignore this error.
<git-pr-chain>
#### Commits in this PR
1. Fix segfault in assertion test.
The issue here is that we were not checking the return values of the
CUDA API
calls we were making. We call one function and then use the data it
returns as
input to another call. Obviously this doesn't work if the first call
returns
an error and doesn't actually return meaningful data.
I don't know why this was passing in CI, but it failed consistently for
me.
#### [PR chain](https://github.com/jlebar/git-pr-chain)
1. 👉#2520👈 **YOU ARE HERE**
</git-pr-chain>
<git-pr-chain>
#### Commits in this PR
1. Make kernel_static_print test work when called twice.
This test is checking that a message is printed when the kernel is
compiled.
But the test had nothing to force the kernel to be compiled every time
you ran
the test. So after you ran it once, the test would fail every time until
you
cleared the cache.
#### [PR chain](https://github.com/jlebar/git-pr-chain)
1. 👉#2518👈 **YOU ARE HERE**
1. #2520
</git-pr-chain>
There's no guarantee that `/tmp/triton/*/*.json` existing means
that the corresponding `/tmp/triton/*/*.cubin` file also exists because the tmp directory doesn't guarantee file stability.
I noticed that Triton is using the `ptxas` version as part of the
version hash even for non-CUDA targets. This is an attempt at fixing
this. Moving the version calculation to the back-end makes sense to me
from an architectural standpoint, so that's my approach here. I'm not as
confident in the implementation, so please if folks have any feedback
let me know.
Without this change, a constexpr assignment (ie. `A = B & C`, where `B`
and `C` are both constexpr) is getting assigned to a triton tensor,
which becomes an issue when `A` is used as the condition of an If
statement.
Note: I had to add `not isinstance(node.value, ast.Constant)` to the
condition because if we are assigning `x = 0` then the assigned value is
also a constexpr, but in this case we do want to assign a triton tensor
to `x` so that we can do `x.to(tl.int64)` for example, which cannot be
done on a constexpr.
---------
Co-authored-by: Philippe Tillet <phil@openai.com>
By default, ptxas will enable fusion of mul/add to fma instructions. The
backend was also being configured unconditionally to enable this on
conversion from LLVM IR to PTX. This commit adds an option which can be
used to disable the FP fusion behavior in both locations.
In current implementation, warpsPerCTA is always set to [numWarps, 1]
for 2 tt.dot fusion scenario. But, it is not optimal for cases such that
tt.dot doesn't have enough parallelism on row dimension but on column
dimension.
- Move atomic_cas and atomic_xchg to "atomic ops" section of
documentation.
- Don't talk about the `cmp` operand for operations which don't have
it.
- Document the `sem` operand.
- :code:`foo` and ``foo`` don't work inside a :type: annotation,
apparently. (They are rendered literally, instead of being treated
as a formatting command.) Get rid of them.
- Format the bulleted lists in the load/store operations as intended.
clipping float8e4b15 to +-1.875 is a bad idea because these are
represented as 0x7f and 0xff, which are +- nan on H100 for float8e4nv.
We lose two values but this will make compatibility with float8e4nv way
less painful. (it will just be a matter of adjusting the bias)
Support having chain of mma with mixed size.
Serialize the different block calculation in backward attention to
workaround problem with ptxas and wgmma.