Tensor V is transposed in the way that seqlen/N_CTX dimension becomes the fastest changing (a.k.a. leading or least strided) dimension. This script produces better performance than tutorials/06-fused-attention.py since it has better LDS access efficiency for tensor V. Note that in the future, we'll improve the LDS access efficiency for non-transposed tensor V, i.e. head dimension is the fastest changing dimension.
Only fwd kernel is benchmarked.

`06-fused-attention-fwd-transV.py`

This script is used to produce the best performance for fwd kernel. It is a copy of 06-fused-attention-transV.py with the following changes:

All bwd kernels are removed.
Storing m at the end of the fwd kernel is removed.
Autotuner is removed. All parameters for D=64 ad D=128 are pre-tuned on MI250X and hard coded.

Note that this script is also used to benchmark FA performance with 2 GCDs. Check the 2GCD benchmark script for more details.