[MFMA] Introduce dot operand loading fast path (#269)

* [MFMA] Introduce dot operand loading fast path This PR introduces fast path for code generation of MFMA dot operand loading from LDS. Fast path is used when operand is not swizzled and is not slice of some bigger LDS object(it is not a slice of a tensor). This is a case for current FA and GEMM kernels compiled with num_stages=1, i.e. software pipelining is disabled. * cleanup swizzle info
2026-04-05 03:01:17 -04:00 · 2023-07-27 20:46:50 +02:00
parent 2fbffe2784
commit 0073bb98f4
3 changed files with 390 additions and 77 deletions
--- a/python/triton/compiler/compiler.py
+++ b/python/triton/compiler/compiler.py
@@ -85,9 +85,7 @@ def optimize_ttgir(mod, num_stages, arch):
        pm.add_tritongpu_accelerate_matmul_pass(80)
    pm.add_tritongpu_remove_layout_conversions_pass()
    pm.add_tritongpu_optimize_dot_operands_pass()
-    # TODO enable this pass for AMD GPU when it is ready
-    if not is_hip():
-        pm.add_tritongpu_pipeline_pass(num_stages)
+    pm.add_tritongpu_pipeline_pass(num_stages)
    pm.add_tritongpu_prefetch_pass()
    pm.add_tritongpu_optimize_dot_operands_pass()
    pm.add_tritongpu_remove_layout_conversions_pass()