Lixun Zhang
a819e48435
Refine test_correctness ( #463 )
...
- Check correctness of what is benchmarked
- Add capability to check col_a and col_b
- But only check col_a=False, col_b=True for now
- Only benchmark col_a=False, col_b=True for now
- Remove in='int8', out='int8' due to too large error
2024-01-16 11:15:54 -06:00
Lixun Zhang
e231c41467
[TUTORIAL] Enable all types in gemm tutorial ( #456 )
...
* Enable all types in gemm tutorial
Co-authored-by: Shucai Xiao <shucai.xiao@amd.com >
2024-01-15 14:38:31 -06:00
Vinayak Gokhale
1fec965c06
Add autotuning for FA ( #459 )
2024-01-12 17:15:12 -06:00
Vinayak Gokhale
0248bdb29d
Minor edits to HBM bandwidth measurement kernel ( #434 )
...
* Change units to GiB/s from GB/s
* Run both with and w/o bounds check
2023-12-21 06:14:31 -06:00
Vinayak Gokhale
422d7096ce
Add kernel to check HBM BW ( #431 )
...
Add kernel to check HBM BW performance
2023-12-18 21:25:21 -06:00
Vinayak Gokhale
b7a412d82a
Add support for ALiBi-style attention bias ( #417 )
...
Add support for matrix and vector bias to FA.
2023-12-15 16:28:37 -06:00
Vinayak Gokhale
1d6b919897
Bugfix: Wrong boundary condition on qk GEMM
2023-12-04 10:11:41 -06:00
Vinayak Gokhale
f6969f4bb3
Correct that loop lo is multiple BLOCK_N
2023-12-04 10:11:41 -06:00
Vinayak Gokhale
0ef865508c
Update description
2023-12-04 10:11:41 -06:00
Vinayak Gokhale
dc62569e57
Remove slicing for out in save for bwd
2023-12-04 10:11:41 -06:00
Vinayak Gokhale
e0a4d97569
Mask vs pad for non power of 2 sequence lengths
...
Padding results in memory allocation which is slower.
Masking results in better performance.
2023-12-04 10:11:41 -06:00
Vinayak Gokhale
d5028079b7
Add FA support for non pow2 seqlen
2023-12-04 10:11:41 -06:00
Alexander Efimov
5b06b168aa
[Tutorial] Fix post IFU issues with FA ( #398 )
...
* [Tutorial] Fix post IFU issues with FA
* Remove redundant kernels in 06-fused-attention.py
* Added README for scripts in perf-kernels dir
* Fix bwd kernel
---------
Co-authored-by: Lixun Zhang <lixun.zhang@amd.com >
2023-11-14 10:46:45 -06:00
Lixun Zhang
d4eda83b33
Benchmark FA on 2 GCDs ( #393 )
2023-11-08 12:42:54 -06:00
Lixun Zhang
1af893d8a2
[FRONTEND] Add input dtypes to autotuning key ( #2534 ) ( #374 )
...
* [FRONTEND] Add input dtypes to autotuning key (#2534 )
* Fix conflict in 06-fused-attention
* Fix get_best_config in FA-transV.py
* Fix leftover get_best_config()
---------
Co-authored-by: Adnan Akhundov <adnan.akhundov@gmail.com >
2023-11-07 19:36:57 -06:00
oplavsic
c65f1e6211
Add OptimizeEpilogue pass. ( #346 )
...
* optimize_epilogue
* Add config
* Remove licenses
* Comment out Hopper specific parameters when printing out configs
* Add benchmark parameters from flash-attention repo
* Add Z and H in the key of autotuner
---------
Co-authored-by: Lixun Zhang <lixun.zhang@amd.com >
2023-11-03 16:46:24 -05:00
Lixun Zhang
a66270165a
Move fa-transV to the new perf-kernels dir ( #387 )
2023-11-03 00:09:48 -05:00