ROCm/python at d39d78fa08769ccef8a1a04deec3c37a5f89d524 - ROCm

mirror of https://github.com/ROCm/ROCm.git synced 2026-04-05 03:01:17 -04:00

Files

Izzy Putterman d39d78fa08 [OPS] Add more perf-tests, new features to FA (#1849 )

Adding new tests across the board for float32, bfloat16, non-powers-of-2
shapes (to test masks), and tests on sequence parallel for atomics. This
also adds the sequence parallel features from
https://github.com/HazyResearch/flash-attention/blob/main/flash_attn/flash_attn_triton.py.
I am not sure about the best way to grab the baseline benchmarking
numbers. I have access to V100s and A100s, but I saw on the tests it
mentions " # A100 in the CI server is slow-ish for some reason.
# On some other servers, we are getting about 90% peak for 8kx8x8k
float16". Current plan is to run CI here and use those numbers for
baseline, then match against my GPUs as a sanity check.

---------

Co-authored-by: Phil Tillet <phil@openai.com>

2023-07-10 18:52:59 -07:00