mirror of
https://github.com/ROCm/ROCm.git
synced 2026-02-21 03:00:39 -05:00
- Check correctness of what is benchmarked - Add capability to check col_a and col_b - But only check col_a=False, col_b=True for now - Only benchmark col_a=False, col_b=True for now - Remove in='int8', out='int8' due to too large error
AMD Perf Kernels
This directory contains customized/tuned/experimental kernels on AMD MI series GPUs.
06-fused-attention-transV.py
This script is a copy of tutorials/06-fused-attention.py with the following
two changes:
- Tensor V is transposed in the way that seqlen/N_CTX dimension becomes the
fastest changing (a.k.a. leading or least strided) dimension.
This script produces better performance than
tutorials/06-fused-attention.pysince it has better LDS access efficiency for tensor V. Note that in the future, we'll improve the LDS access efficiency for non-transposed tensor V, i.e. head dimension is the fastest changing dimension. - Only fwd kernel is benchmarked.
06-fused-attention-fwd-transV.py
This script is used to produce the best performance for fwd kernel.
It is a copy of 06-fused-attention-transV.py with the following
changes:
- All bwd kernels are removed.
- Storing
mat the end of the fwd kernel is removed. - Autotuner is removed. All parameters for D=64 ad D=128 are pre-tuned on MI250X and hard coded.
Note that this script is also used to benchmark FA performance with 2 GCDs. Check the 2GCD benchmark script for more details.