mirror of https://github.com/ROCm/ROCm.git synced 2026-02-21 03:00:39 -05:00

Files

Lixun Zhang a819e48435 Refine test_correctness (#463 )

- Check correctness of what is benchmarked
- Add capability to check col_a and col_b
  - But only check col_a=False, col_b=True for now
- Only benchmark col_a=False, col_b=True for now
- Remove in='int8', out='int8' due to too large error

2024-01-16 11:15:54 -06:00

03-matrix-multiplication-all-types.py

Refine test_correctness (#463 )

2024-01-16 11:15:54 -06:00

06-fused-attention-fwd-transV.py

Benchmark FA on 2 GCDs (#393 )

2023-11-08 12:42:54 -06:00

06-fused-attention-transV.py

[Tutorial] Fix post IFU issues with FA (#398 )

2023-11-14 10:46:45 -06:00

flash-attention-seqlen-padded.py

Bugfix: Wrong boundary condition on qk GEMM

2023-12-04 10:11:41 -06:00

flash-attention.py

Add autotuning for FA (#459 )

2024-01-12 17:15:12 -06:00

hbm-bw-test.py

Minor edits to HBM bandwidth measurement kernel (#434 )

2023-12-21 06:14:31 -06:00

README.md

[Tutorial] Fix post IFU issues with FA (#398 )

2023-11-14 10:46:45 -06:00

README.md

AMD Perf Kernels

This directory contains customized/tuned/experimental kernels on AMD MI series GPUs.

`06-fused-attention-transV.py`

This script is a copy of tutorials/06-fused-attention.py with the following two changes:

Tensor V is transposed in the way that seqlen/N_CTX dimension becomes the fastest changing (a.k.a. leading or least strided) dimension. This script produces better performance than tutorials/06-fused-attention.py since it has better LDS access efficiency for tensor V. Note that in the future, we'll improve the LDS access efficiency for non-transposed tensor V, i.e. head dimension is the fastest changing dimension.
Only fwd kernel is benchmarked.

`06-fused-attention-fwd-transV.py`

This script is used to produce the best performance for fwd kernel. It is a copy of 06-fused-attention-transV.py with the following changes:

All bwd kernels are removed.
Storing m at the end of the fwd kernel is removed.
Autotuner is removed. All parameters for D=64 ad D=128 are pre-tuned on MI250X and hard coded.

Note that this script is also used to benchmark FA performance with 2 GCDs. Check the 2GCD benchmark script for more details.