mirror of
https://github.com/ROCm/ROCm.git
synced 2026-02-21 03:00:39 -05:00
This allow pipelining when a load is used by multiple dot in a loop. Relax the condition to pipeline dot operands for mma v3 case. This improves performance for the bwd pass from 260TF to 275TF. However this expose a performance problem due to the wmma pipelining as ptxas will now fall back to serial wgmma. A follow up PR will fix a bug in how we emit wgmma_wait during pipelining and will bring performance to 335TF