Implemented this situation with and without causal mask.
My implementation with causal mask looks like:
111000
111100
111110
Where only the right upper triangle part will be masked.
I added `P_SEQ` for the notation of extra sequence length for KV.
Co-authored-by: Philippe Tillet <phil@openai.com>
The initial code merge of Nvidia Hopper features support. Please be
aware that the code merge is not finished yet and the trouble-shooting
is still ongoing. The new hardware features (GMMA, TMA, STMATRIX etc.)
and automatic warp-specialization are experimental for now and turned
off by default. It is recommended for a trial when version 3.0 is
released.
The work is contributed by:
ben-zhang-609, bealwang, donproc, qliu93, jsh20, allatit23, LyricZhao,
ivanyinwz, goostavz & yangjunpro
from Nvidia, in cooperation with:
ptillet, Jokeren, ThomasRaoux & zahimoud
from OpenAI.
Co-authored-by: Goostav Zhu <gzhu@nvidia.com>
Enabled the backward pass in the fused attention tutorial.
The tolerance when comparing to the naive implementation
had to be changed. The block size is forced to be 64x64
due to the 64 KiB LDS. Default is block 128 for A100's
larger SMEM. This creates differences in order of computation
and reuslts in a larger gap between the naive and FA
implementations.
Uses FlashAttention-2 if available, otherwise acts as before (if
FlashAttention-1 is available, that is used, otherwise the
FlashAttention reference benchmark is not run).
I decided to keep the same name for the imported function, but feel free
to make me change that.
* [WIP][FA OPTIMIZATION] Optimize chain dot
This commit optimizes chain dot operation by keeping
results of the first dot operation in registers.
* [FA OPTIMIZATION] Enable lowering pipeline for keeping result of chain dot in registers
* Move operand swapping in ttgir -> llir lowering phase
* Refactor emitMfmaOffsetForCTA function to be more readable
* Fix accidental change in 06-fused-attention.py
* Address review comments
* Fix rebase errors
Adding new tests across the board for float32, bfloat16, non-powers-of-2
shapes (to test masks), and tests on sequence parallel for atomics. This
also adds the sequence parallel features from
https://github.com/HazyResearch/flash-attention/blob/main/flash_attn/flash_attn_triton.py.
I am not sure about the best way to grab the baseline benchmarking
numbers. I have access to V100s and A100s, but I saw on the tests it
mentions " # A100 in the CI server is slow-ish for some reason.
# On some other servers, we are getting about 90% peak for 8kx8x8k
float16". Current plan is to run CI here and use those numbers for
baseline, then match against my GPUs as a sanity check.
---------
Co-authored-by: Phil Tillet <phil@openai.com>
* [MFMA] Activated Fused Attention Forward Pass
Patch contains following changes:
1) make_range operator now works with MFMA layout.
2) Reduce operation is forced to run in block layout:
inputs converted to block layouts, outputs returned to MFMA layout
* Use simple module walk instead of pattern rewritter.
* Remove pattern rewritter header.
* Enable basic reduce algorithm for MFMA layout
* Add TODO comment for fused attention backward pass
* Fix bug in fast codegen algorithm for reduce op
* Fix input type bug
* Increase block size to 128 since out of memory issue is not seen on MI210
* Fix block_size error
* Add mfma support in DecomposeDotOperand pattern.
This PR contains:
- Several fixes for the matrix multiplication (M and N dimensions may
have out-of-bound access)
- A type check for block-based store
- The tutorial for block pointers
- Fix some formats
While merging `triton-mlir`, it seems that the libdevice tutorial was
missed. This PR adds it back and modifies it with current interface
`tl.math`.
Also found a bug in `test_core.py`, `extern_libs` arguments should still
pass `libdevice`. Or it will fail on my added test. Legacy code didn't
fail because `lib_path` is none and ignored.
---------
Co-authored-by: Keren Zhou <kerenzhou@openai.com>
Co-authored-by: Philippe Tillet <phil@openai.com>
This PR;
- Fixes syntax errors like `.type values: dict[str,
Callable[[list[Any]], Any]]` to `:type values: dict[str,
Callable[[list[Any]], Any]]`,
- Fixes typos,
- Fixes formatting like `k ++` to ` k++`,
- Increases consistency (e.g. by transforming the minority `cd dir/` to
the majority `cd dir`).
* Cleaned up pipeline pass. Now works when there are element-wise ops
between the load and the dot
* Made `splat` compatible with varibales that have DotOperandLayout
* Moves rematerialization utils to separate Transforms/Utility.cpp file.
* Frontend:
- `int` kernel arguments are always signed
- Loop induction variable is now determine by integer promotion on
lb/ub/step
* Optimizer:
- Added new ExtractSliceOp that enforces 32-bit offsets
* Backend:
- Use 64-bit indices when lowering functions and control flow
- Removed `idx_val` macro and replaced it with `i32_val`
- Cleaned up comments
- Added new ArithToIndex pass to make sure operations on indices are
done with the `index` dialect, that gets converted to LLVM separately
using a 64-bit target
- Add text description and equations for the tutorial.
- Improve the code readability by changing variable names to align them
with the equation. The actual code logic is not changed.
This is a follow-up of #510. Let me know if a preview HTML is helpful
for the review, I can add a link to that too.
This PR adds a couple of optimization passes that should substantially
improve the performance of Triton on fused attention kernels:
- DecomposeConversionsPass: This decomposes some instructions of the
form `convert_layout` into
- ReorderInstructions: this reorders instructions in a way that is more
amenable to good code generation from `ptxas`.
- Unifying several interfaces with different types to a single one, e.g.
`fsub_ru` and `dsub_ru` -> `sub_ru`;
- Minor bug fix: `fast_pow` is incorrectly classified into the `pow`
interface, of which arguments are the same as `powf`;
- Explicit interfaces for casting functions, e.g. decoupling
`ll2float_ru` to `ll2float_ru` and `ull2float_ru`;
- Removing interfaces that are not in NVIDIA's official documents, e.g.
`fmaf_ieee_rn`, which is confusing together with `fmaf_rn`.
Note that this PR for the master branch is different from #829, which is
for the MLIR branch.