Commit Graph

20 Commits

Author SHA1 Message Date
Ognjen
3a12d9d269 fix 2024-01-31 14:13:36 +00:00
Ognjen
171a67e837 Add scheduling pass 2024-01-25 18:07:45 +00:00
Vinayak Gokhale
f239abfc7e Revert "Add autotuning for FA (#459)" (#467)
This reverts commit 1fec965c06.

This change used pre_hook to edit a kernel arg. However,
pre-hook does not make the changes made within visible to
the kernel in all cases.
2024-01-16 23:02:47 -06:00
Lixun Zhang
a819e48435 Refine test_correctness (#463)
- Check correctness of what is benchmarked
- Add capability to check col_a and col_b
  - But only check col_a=False, col_b=True for now
- Only benchmark col_a=False, col_b=True for now
- Remove in='int8', out='int8' due to too large error
2024-01-16 11:15:54 -06:00
Lixun Zhang
e231c41467 [TUTORIAL] Enable all types in gemm tutorial (#456)
* Enable all types in gemm tutorial

Co-authored-by: Shucai Xiao <shucai.xiao@amd.com>
2024-01-15 14:38:31 -06:00
Vinayak Gokhale
1fec965c06 Add autotuning for FA (#459) 2024-01-12 17:15:12 -06:00
Vinayak Gokhale
0248bdb29d Minor edits to HBM bandwidth measurement kernel (#434)
* Change units to GiB/s from GB/s

* Run both with and w/o bounds check
2023-12-21 06:14:31 -06:00
Vinayak Gokhale
422d7096ce Add kernel to check HBM BW (#431)
Add kernel to check HBM BW performance
2023-12-18 21:25:21 -06:00
Vinayak Gokhale
b7a412d82a Add support for ALiBi-style attention bias (#417)
Add support for matrix and vector bias to FA.
2023-12-15 16:28:37 -06:00
Vinayak Gokhale
1d6b919897 Bugfix: Wrong boundary condition on qk GEMM 2023-12-04 10:11:41 -06:00
Vinayak Gokhale
f6969f4bb3 Correct that loop lo is multiple BLOCK_N 2023-12-04 10:11:41 -06:00
Vinayak Gokhale
0ef865508c Update description 2023-12-04 10:11:41 -06:00
Vinayak Gokhale
dc62569e57 Remove slicing for out in save for bwd 2023-12-04 10:11:41 -06:00
Vinayak Gokhale
e0a4d97569 Mask vs pad for non power of 2 sequence lengths
Padding results in memory allocation which is slower.
Masking results in better performance.
2023-12-04 10:11:41 -06:00
Vinayak Gokhale
d5028079b7 Add FA support for non pow2 seqlen 2023-12-04 10:11:41 -06:00
Alexander Efimov
5b06b168aa [Tutorial] Fix post IFU issues with FA (#398)
* [Tutorial] Fix post IFU issues with FA

* Remove redundant kernels in 06-fused-attention.py

* Added README for scripts in perf-kernels dir

* Fix bwd kernel

---------

Co-authored-by: Lixun Zhang <lixun.zhang@amd.com>
2023-11-14 10:46:45 -06:00
Lixun Zhang
d4eda83b33 Benchmark FA on 2 GCDs (#393) 2023-11-08 12:42:54 -06:00
Lixun Zhang
1af893d8a2 [FRONTEND] Add input dtypes to autotuning key (#2534) (#374)
* [FRONTEND] Add input dtypes to autotuning key (#2534)

* Fix conflict in 06-fused-attention

* Fix get_best_config in FA-transV.py

* Fix leftover get_best_config()

---------

Co-authored-by: Adnan Akhundov <adnan.akhundov@gmail.com>
2023-11-07 19:36:57 -06:00
oplavsic
c65f1e6211 Add OptimizeEpilogue pass. (#346)
* optimize_epilogue

* Add config

* Remove licenses

* Comment out Hopper specific parameters when printing out configs

* Add benchmark parameters from flash-attention repo

* Add Z and H in the key of autotuner

---------

Co-authored-by: Lixun Zhang <lixun.zhang@amd.com>
2023-11-03 16:46:24 -05:00
Lixun Zhang
a66270165a Move fa-transV to the new perf-kernels dir (#387) 2023-11-03 00:09:48 -05:00