ROCm/python at 6ac9d51ff00803281253c7299c48fe0bb11dde33 - ROCm

mirror of https://github.com/ROCm/ROCm.git synced 2026-02-21 03:00:39 -05:00

Files

Thomas Raoux 6ac9d51ff0 [OPTIMIZATION] Enable pipelining for bwd flash attention (#2590 )

This allow pipelining when a load is used by multiple dot in a loop.

Relax the condition to pipeline dot operands for mma v3 case. This
improves performance for the bwd pass from 260TF to 275TF. However this
expose a performance problem due to the wmma pipelining as ptxas will
now fall back to serial wgmma. A follow up PR will fix a bug in how we
emit wgmma_wait during pipelining and will bring performance to 335TF

2023-11-03 11:46:51 -07:00