fix more conflits
Resolve merge conflicts
Some more build and conflict fixes
Resolve conflicts for 06-fused-attension.py
resolve merge conflicts for the tutorial group gemm example
Fixes for some LIT tests
resolve remaining conflicts in tests
Fix empty kernel
set capability 0
In current implementation, warpsPerCTA is always set to [numWarps, 1]
for 2 tt.dot fusion scenario. But, it is not optimal for cases such that
tt.dot doesn't have enough parallelism on row dimension but on column
dimension.
I've add an option to yapf to do what we want for long lines, see
https://github.com/google/yapf/pull/1177. We can now have a real Python
formatter, yay!
To make this PR, I ran my modified yapf over the repository, then looked
over the full diff. Where yapf was mangling the param list of long
function decls/calls (mostly kernels), I manually added `#` to put
linebreaks where we want. I fixed up other formatting too -- mostly
adding or removing a trailing comma from lists.
Overall, trailing `#` was sufficient to get formatting similar to our
current code. I didn't have to disable yapf anywhere.
---------
Co-authored-by: Phil Tillet <phil@openai.com>
* add two fp8 data types `tl.float8e4b8` and `tl.float8e5b16` to triton.
* add SW type conversion between `tl.float8e4b8/tl.float8e5b16` and `fp16`
* change flashattention to support fp8 in q/k.
The [hints
dispatching](218492cd65/python/triton/tools/link.py (L161))
logic currently fails for the edge case where a single kernel with no
specializations is to be linked in the [AOT
compiler](https://github.com/openai/triton/blob/main/python/triton/tools/link.py).
Since the dispatcher inserts a conditional branch for each
specialization case, this results in an `if ()` being inserted into the
`C` source, which clearly breaks downstream artifacts.
Fix:
- Added simple check for this edge case
- Added unit test that mirrors the existing
[`test_compile_link_matmul`](218492cd65/python/test/unit/tools/test_aot.py (L224))
test case save for the aforementioned condition.
Refactor the pipeliner pass in order to make it more generic. The main
change is that the pipeliner is now broken into 2 pieces one calculating
a modulo schedule and create async ops based on the IR and an expander
that will generate the pipelined IR based on the modulo schedule.
The advantage of separating the two pieces is that it will allow us to
create different schedule without having to change the expander and it
will allow for more complex schedules.
For now the schedule generated for matmul case matches rougly the
schedule picked by the previous pipeliner in order to avoid changes.
This also creates a different sequence of insert/extract slice for the
alloc. We should probably change shared alloc to use memory semantic.
For in-place kernels, neither `reset_to_zero` nor `Config.prehook`
provided in the autotuner can restore the values changed during the
tuning process, so I propose a recovery mechanism here.
---------
Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com>
Co-authored-by: Keren Zhou <kerenzhou@openai.com>
[FRONTEND] Enable ruff linter instead of flake8.
This fixes a few issues automatically, and also flagged two issues to
fix manually in test_core.py: We had two duplicate function names! One
of these function bodies was a duplicate, so I deleted it. The other
function body was not a duplicate, so I gave it a new name.
AIUI all of these errors should have been picked up by flake8. I'm
confused why it wasn't working. Anyway this is working, and it's faster
than flake8, so it seems like an improvement in all dimensions.
This PR https://github.com/openai/triton/pull/2555 disabled `W503`
(means line breaks can now occur before a binary operator).
The change surprisingly didn't take any effect nor required any style
changes in `triton` main `pre-commit` stage. But our `triton-shared`
[pipeline
run](https://github.com/microsoft/triton-shared/actions/runs/6710459100/job/18236352821)
(see `Check pre-commit` stage) picked this up correctly and complained
about formatting issues. I'm not entirely sure what could be the cause
for such difference, but if we also disable `W503` in `pyproject.toml`
then the rule is picked up correctly.
This is a combination of 4 commits.
Works as StandAlone and Backend
Works as StandAlone and Backend
This is a combination of 13 commits.
Works StandAlone and as Backend
This is a combination of 7 commits.
backend set default dir with flag
move bitcode to backend dir
copy backend
save
empty test work in backendmode
enable backend mode when copying to upstream
clean up
fix failure
minimize diff
add skip function
fix bug with corrupted dwarf exp
match num_wraps
fix multi threaded test issue
move bitcode file out of lib
move backend to python/triton/third_party/hip
move libhsa
backend works again
restart ci
clean upstream location first before copy
match scripts
fix new error
memoize backend stuff
fix bug
* this pr adds a third party backend for triton that works on AMD
* this expose a lot of the work that has been done in our
[fork](https://github.com/ROCmSoftwarePlatform/triton)
* most unit tests on `test_core.py` pass
* it skips some unit tests for various reasons
* we plan to follow up with more prs improving Functionality and
Performance in the future
---------
Co-authored-by: Philippe Tillet <phil@openai.com>
* [MFMA] FP8 and BF8 support
This PR adds support of fp8 and bf8 in AccelerateMatmul pass and
Introduces generation of float8 mfma instructions in ttg to llvm conversion.
* add tests
* fix tests
* review fix: fix variable naming and dot operand promotion.
* review comments fixes
---------
Co-authored-by: Shucai Xiao <shucai.xiao@amd.com>
[BACKEND] Improve printf.
Previously, we printed all of a GPU thread's values in a single printf()
call, and this, plus the user-specified prefix, was all we printed.
This caused a few problems.
- nvptx printf can only handle 32 arguments; if you pass more than
that, it prints garbage. So if a thread had more than 32 values, you
couldn't print them, issue #2486.
- The order of the values within the Triton program (GPU thread block)
is an implementation detail -- it depends on the layout the compiler
assigns to a tensor. So this also prevented you from interpreting
the printed output.
To address this, we now print the Triton pid and multi-dimensional
Tensor index for each value. And each value gets its own line to avoid
passing too many args to printf.
Example output:
```
pid (0, 1, 2) idx (36, 127) x: 42
```
If you want to observe all the values in a tensor in order, you can grep
and then sort the output.
We also make a UX enhancement to print: The printed label always ends
with ": "; you don't have to add it yourself.
Fixes#2486.
<git-pr-chain>
#### Commits in this PR
1. Fix segfault in assertion test.
The issue here is that we were not checking the return values of the
CUDA API
calls we were making. We call one function and then use the data it
returns as
input to another call. Obviously this doesn't work if the first call
returns
an error and doesn't actually return meaningful data.
I don't know why this was passing in CI, but it failed consistently for
me.
#### [PR chain](https://github.com/jlebar/git-pr-chain)
1. 👉#2520👈 **YOU ARE HERE**
</git-pr-chain>
<git-pr-chain>
#### Commits in this PR
1. Make kernel_static_print test work when called twice.
This test is checking that a message is printed when the kernel is
compiled.
But the test had nothing to force the kernel to be compiled every time
you ran
the test. So after you ran it once, the test would fail every time until
you
cleared the cache.
#### [PR chain](https://github.com/jlebar/git-pr-chain)
1. 👉#2518👈 **YOU ARE HERE**
1. #2520
</git-pr-chain>
I noticed that Triton is using the `ptxas` version as part of the
version hash even for non-CUDA targets. This is an attempt at fixing
this. Moving the version calculation to the back-end makes sense to me
from an architectural standpoint, so that's my approach here. I'm not as
confident in the implementation, so please if folks have any feedback
let me know.
Without this change, a constexpr assignment (ie. `A = B & C`, where `B`
and `C` are both constexpr) is getting assigned to a triton tensor,
which becomes an issue when `A` is used as the condition of an If
statement.
Note: I had to add `not isinstance(node.value, ast.Constant)` to the
condition because if we are assigning `x = 0` then the assigned value is
also a constexpr, but in this case we do want to assign a triton tensor
to `x` so that we can do `x.to(tl.int64)` for example, which cannot be
done on a constexpr.
---------
Co-authored-by: Philippe Tillet <phil@openai.com>
By default, ptxas will enable fusion of mul/add to fma instructions. The
backend was also being configured unconditionally to enable this on
conversion from LLVM IR to PTX. This commit adds an option which can be
used to disable the FP fusion behavior in both locations.
In current implementation, warpsPerCTA is always set to [numWarps, 1]
for 2 tt.dot fusion scenario. But, it is not optimal for cases such that
tt.dot doesn't have enough parallelism on row dimension but on column
dimension.