* [Tutorial] Fix post IFU issues with FA
* Remove redundant kernels in 06-fused-attention.py
* Added README for scripts in perf-kernels dir
* Fix bwd kernel
---------
Co-authored-by: Lixun Zhang <lixun.zhang@amd.com>
* [Tutorial] Fix post IFU issues with FA
* Remove redundant kernels in 06-fused-attention.py
* Added README for scripts in perf-kernels dir
* Fix bwd kernel
---------
Co-authored-by: Lixun Zhang <lixun.zhang@amd.com>
fix more conflits
Resolve merge conflicts
Some more build and conflict fixes
Resolve conflicts for 06-fused-attension.py
resolve merge conflicts for the tutorial group gemm example
Fixes for some LIT tests
resolve remaining conflicts in tests
Fix empty kernel
set capability 0
* optimize_epilogue
* Add config
* Remove licenses
* Comment out Hopper specific parameters when printing out configs
* Add benchmark parameters from flash-attention repo
* Add Z and H in the key of autotuner
---------
Co-authored-by: Lixun Zhang <lixun.zhang@amd.com>
Patch based on @donproc findings and suggested optimization.
Emitting multiple wait op may confuse ptxas and cause it to fallback to
a conservative mode.
In current implementation, warpsPerCTA is always set to [numWarps, 1]
for 2 tt.dot fusion scenario. But, it is not optimal for cases such that
tt.dot doesn't have enough parallelism on row dimension but on column
dimension.
This allow pipelining when a load is used by multiple dot in a loop.
Relax the condition to pipeline dot operands for mma v3 case. This
improves performance for the bwd pass from 260TF to 275TF. However this
expose a performance problem due to the wmma pipelining as ptxas will
now fall back to serial wgmma. A follow up PR will fix a bug in how we
emit wgmma_wait during pipelining and will bring performance to 335TF
I've add an option to yapf to do what we want for long lines, see
https://github.com/google/yapf/pull/1177. We can now have a real Python
formatter, yay!
To make this PR, I ran my modified yapf over the repository, then looked
over the full diff. Where yapf was mangling the param list of long
function decls/calls (mostly kernels), I manually added `#` to put
linebreaks where we want. I fixed up other formatting too -- mostly
adding or removing a trailing comma from lists.
Overall, trailing `#` was sufficient to get formatting similar to our
current code. I didn't have to disable yapf anywhere.
---------
Co-authored-by: Phil Tillet <phil@openai.com>
* add two fp8 data types `tl.float8e4b8` and `tl.float8e5b16` to triton.
* add SW type conversion between `tl.float8e4b8/tl.float8e5b16` and `fp16`
* change flashattention to support fp8 in q/k.
This was a workaround for a bug exposed in test_core when generating
ext_load in NVPTX. The backend bug seems fixed in latest LLVM upgrade so
removing the workaround.
The [hints
dispatching](218492cd65/python/triton/tools/link.py (L161))
logic currently fails for the edge case where a single kernel with no
specializations is to be linked in the [AOT
compiler](https://github.com/openai/triton/blob/main/python/triton/tools/link.py).
Since the dispatcher inserts a conditional branch for each
specialization case, this results in an `if ()` being inserted into the
`C` source, which clearly breaks downstream artifacts.
Fix:
- Added simple check for this edge case
- Added unit test that mirrors the existing
[`test_compile_link_matmul`](218492cd65/python/test/unit/tools/test_aot.py (L224))
test case save for the aforementioned condition.
Refactor the pipeliner pass in order to make it more generic. The main
change is that the pipeliner is now broken into 2 pieces one calculating
a modulo schedule and create async ops based on the IR and an expander
that will generate the pipelined IR based on the modulo schedule.
The advantage of separating the two pieces is that it will allow us to
create different schedule without having to change the expander and it
will allow for more complex schedules.
For now the schedule generated for matmul case matches rougly the
schedule picked by the previous pipeliner in order to avoid changes.
This also creates a different sequence of insert/extract slice for the
alloc. We should probably change shared alloc to use memory semantic.
* [RemoveLayoutConversions] Fix reduce failed infer type error
This PR fixes layout propagation algorithm in RemoveLayoutConversions pass.
In some cases during rewriteSlice process, reduce operation with multiple outputs
rewrites only one output layout, which breaks assumption that both outputs should have same layout.
This change is a minimal part of https://github.com/openai/triton/pull/2331 change and
small lit test for regression testing.
* fix combine test
* Fix issue with incorrect inference layout of make_range output result
For in-place kernels, neither `reset_to_zero` nor `Config.prehook`
provided in the autotuner can restore the values changed during the
tuning process, so I propose a recovery mechanism here.
---------
Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com>
Co-authored-by: Keren Zhou <kerenzhou@openai.com>
[FRONTEND] Enable ruff linter instead of flake8.
This fixes a few issues automatically, and also flagged two issues to
fix manually in test_core.py: We had two duplicate function names! One
of these function bodies was a duplicate, so I deleted it. The other
function body was not a duplicate, so I gave it a new name.
AIUI all of these errors should have been picked up by flake8. I'm
confused why it wasn't working. Anyway this is working, and it's faster
than flake8, so it seems like an improvement in all dimensions.
This PR https://github.com/openai/triton/pull/2555 disabled `W503`
(means line breaks can now occur before a binary operator).
The change surprisingly didn't take any effect nor required any style
changes in `triton` main `pre-commit` stage. But our `triton-shared`
[pipeline
run](https://github.com/microsoft/triton-shared/actions/runs/6710459100/job/18236352821)
(see `Check pre-commit` stage) picked this up correctly and complained
about formatting issues. I'm not entirely sure what could be the cause
for such difference, but if we also disable `W503` in `pyproject.toml`
then the rule is picked up correctly.