mirror of
https://github.com/ROCm/ROCm.git
synced 2026-04-05 03:01:17 -04:00
Adding new tests across the board for float32, bfloat16, non-powers-of-2 shapes (to test masks), and tests on sequence parallel for atomics. This also adds the sequence parallel features from https://github.com/HazyResearch/flash-attention/blob/main/flash_attn/flash_attn_triton.py. I am not sure about the best way to grab the baseline benchmarking numbers. I have access to V100s and A100s, but I saw on the tests it mentions " # A100 in the CI server is slow-ish for some reason. # On some other servers, we are getting about 90% peak for 8kx8x8k float16". Current plan is to run CI here and use those numbers for baseline, then match against my GPUs as a sanity check. --------- Co-authored-by: Phil Tillet <phil@openai.com>
The file is empty.