fix more conflits
Resolve merge conflicts
Some more build and conflict fixes
Resolve conflicts for 06-fused-attension.py
resolve merge conflicts for the tutorial group gemm example
Fixes for some LIT tests
resolve remaining conflicts in tests
Fix empty kernel
set capability 0
* [RemoveLayoutConversions] Fix reduce failed infer type error
This PR fixes layout propagation algorithm in RemoveLayoutConversions pass.
In some cases during rewriteSlice process, reduce operation with multiple outputs
rewrites only one output layout, which breaks assumption that both outputs should have same layout.
This change is a minimal part of https://github.com/openai/triton/pull/2331 change and
small lit test for regression testing.
* fix combine test
* Fix issue with incorrect inference layout of make_range output result
Fix dependencies in wgmma_wait op to prevent the scheduler from moving
it past the uses of wgmma accumulator. We need to explicitly represent
the dependency between the wait and the accumulator uses otherwise LLVM
is free to re-order those.
This allows us to remove a workaround to prevent the re-ordering. We can
also remove the wait op added in the loop during pipelining.
Also fix the descritpor calcuation for wgmma, we should calculate the
same descriptor for the whole warpgroup.
Added a workaround for a bug that was exposed by different timing due to
those changes. We shouldn't insert operations between the loop and
async_wait or we may have race conditions.
Support having chain of mma with mixed size.
Serialize the different block calculation in backward attention to
workaround problem with ptxas and wgmma.
Improve patterns that sync broadcast to reduce the arithmetic density
and also hoist convert on top of expand_dims to do less work.
This address comments in https://github.com/openai/triton/pull/2274
Change the dot to allow taking an initial accumulator and add a flag
that will allow the compiler to accumulate in a lower precision than the
output type.
On Hopper this flag is on by default which allows accumualting with
lower precision.
This only affect Hopper fp8 dot.
Handle more cases of hoisting convert above ext op. If there are
multiple ext op in the slice but only one requires inserting a convert
we can still apply the optimization.
Significant changes to the pass logic. Move away from greedy rewrites
and use more global analysis instead. The pass is now bocken down into 2
main phases. First forward propagation of layout starting from ops that
we don't want to change. Propagate to all the nodes. If there is a
single layout needed for the op then we can rewrite the op, if there are
multiple layout required based on dependency we need a tie break.
The second phase is backward propgation that gets a backward slice of
operations starting from the convert and if all the operations in the
slice can be rematerialized rewrite the slice. This backward phase now
supports going through loop arguments.
This will allow more complex logic in the future to add a cost model to
decide which convert to leave and which to fold
The initial code merge of Nvidia Hopper features support. Please be
aware that the code merge is not finished yet and the trouble-shooting
is still ongoing. The new hardware features (GMMA, TMA, STMATRIX etc.)
and automatic warp-specialization are experimental for now and turned
off by default. It is recommended for a trial when version 3.0 is
released.
The work is contributed by:
ben-zhang-609, bealwang, donproc, qliu93, jsh20, allatit23, LyricZhao,
ivanyinwz, goostavz & yangjunpro
from Nvidia, in cooperation with:
ptillet, Jokeren, ThomasRaoux & zahimoud
from OpenAI.
Co-authored-by: Goostav Zhu <gzhu@nvidia.com>
* [MFMA] Activated Fused Attention Forward Pass
Patch contains following changes:
1) make_range operator now works with MFMA layout.
2) Reduce operation is forced to run in block layout:
inputs converted to block layouts, outputs returned to MFMA layout
* Use simple module walk instead of pattern rewritter.
* Remove pattern rewritter header.
* Enable basic reduce algorithm for MFMA layout
* Add TODO comment for fused attention backward pass
* Fix bug in fast codegen algorithm for reduce op
* Fix input type bug
* Increase block size to 128 since out of memory issue is not seen on MI210
* Fix block_size error
* Add mfma support in DecomposeDotOperand pattern.
https://reviews.llvm.org/D152814
The API of `RewriterBase::replaceOp` was extended and a fix is needed so
that the called function is not ambiguous.
This change is compatible with both the old and the new API.
Co-authored-by: Philippe Tillet <phil@openai.com>
`tl.cat(tensor<64>, tensor<64>) -> tensor(128)`, because it concatenates
elements into a single thread, if number of threads is 128, each thread
should own at least 2 elements.
With this PR, we also disable remat of the cat op in some cases.
https://github.com/openai/triton/issues/1328
Match the convert_layout operation in SimplifyReduceCvt
(convert_layout->reduce). This way we don't miss higher priority rewrite
patterns like RematerializeBackward and SimplifyConversion. We also need
to set SimplifyConversion's benefit = 4, RematerializeBackward's benefit
= 3, and RematerializeForward's benefit = 2.