Commit Graph

139 Commits

Author SHA1 Message Date
oplavsic
715a589ce3 [FA fwd D=128] Reduce LDS usage in epilogue (#340)
* rebase onto improve_fwd_fa

* Fixed a leftover from rebase

* rebase onto improve_fa_fwd

* Reduce tuning space

* Disable bwd with D=128

* Add test for d=128

* Fix an issue with get_best_config when there is only one config

* Added better configs for d=128

* Fix typos

---------

Co-authored-by: Lixun Zhang <lixun.zhang@amd.com>
2023-10-25 12:10:34 -05:00
Lixun Zhang
821e75a2b0 Improve FA fwd kernel with causal=True (#356)
* Attempt to absorb upstream's changes to improve causal=True

* Add autotuner

* Optimize for AMD MI250

- add pre_load_v as a tuning parameter
- do not define N_CTX as constexpr
- perform the second dot before sum
- remove qk_scale out of the inner loop
- add more configs in the autotuner

Note that bwd kernel is disabled for now. This is because we enabled
autotuning and grid becomes a function. So ctx.grid[0] no longer works.

* Enable bwd kernel
2023-10-12 12:34:27 -05:00
Shucai Xiao
99fa2e4237 add tutorial group gemm example (#343)
* [DOCS] Add a tutorial example of grouped gemm (#2326)
Co-authored-by: Bin Fan <binf@nvidia.com>
2023-10-11 15:13:17 -05:00
oplavsic
e801638b40 Add waves_per_eu as kernel parameter (#319)
* Add waves_per_eu as kernel parameter

* Fix failing tests

* Add default value for waves_per_eu for ttgir_to_llir function

* Remove aot.py
2023-10-06 12:08:34 -05:00
oplavsic
6a173eab8a Remove redundant fp32->fp16 conversion in FA (#349) 2023-10-04 14:10:07 -05:00
Jason Furmanek
e5d7bb4fae Initial commit to resolve merge conflicts
rename tl.float8e4 to tl.float8e4nv to align with upstream

ROCM IFU: Fix python arch issues

ROCM IFU: Fix kernel launcher

ROCM IFU: Fix merge conflicts

fix debug build

Set correct threadsPerCTA
2023-10-03 04:04:26 +00:00
Jason Furmanek
74fd8e9754 Merge commit '36fc54b6f28168d3644808bfe299f1ba06a36272' into ifu230908-2
Conflicts:
	.gitignore
	bin/triton-translate.cpp
	include/triton/Conversion/TritonGPUToLLVM/TritonGPUToLLVMPass.h
	include/triton/Dialect/TritonGPU/IR/TritonGPUAttrDefs.td
	include/triton/Dialect/TritonGPU/IR/TritonGPUDialect.td
	lib/Analysis/Utility.cpp
	lib/Conversion/TritonGPUToLLVM/ConvertLayoutOpToLLVM/SharedToDotOperandMMAv2.cpp
	lib/Conversion/TritonGPUToLLVM/DotOpToLLVM.cpp
	lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp
	lib/Conversion/TritonGPUToLLVM/ReduceOpToLLVM.cpp
	lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVM.cpp
	lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVMBase.h
	lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVMPass.cpp
	lib/Conversion/TritonGPUToLLVM/Utility.h
	lib/Dialect/Triton/Transforms/RewriteTensorPointer.cpp
	lib/Dialect/TritonGPU/IR/Dialect.cpp
	lib/Dialect/TritonGPU/Transforms/AccelerateMatmul.cpp
	lib/Dialect/TritonGPU/Transforms/RemoveLayoutConversions.cpp
	lib/Target/LLVMIR/LLVMIRTranslation.cpp
	python/src/triton.cc
	python/test/unit/runtime/test_subproc.py
	python/triton/compiler/compiler.py
	python/triton/compiler/make_launcher.py
	python/triton/language/semantic.py
	python/triton/runtime/jit.py
	python/tutorials/06-fused-attention.py
	test/Conversion/triton_to_tritongpu.mlir
	test/Conversion/tritongpu_to_llvm.mlir
	test/TritonGPU/coalesce.mlir
	unittest/Conversion/TritonGPUToLLVM/CMakeLists.txt
2023-10-02 18:01:04 +00:00
SJW
287b0adcc2 [Stream] Fixed bug in stream-pipeline for FA (#345)
* [Stream] Fixed bug in stream-pipeline for FA
* updated gemm tutorial for num_stages=0

* * updated configs
2023-09-29 20:20:55 -05:00
Ognjen Plavsic
a8574be74d [FA] Fix bug from IFU merge
This commit reverts num_warps back to 4 for
FA forward pass with d=128.
2023-09-21 10:48:41 -05:00
Alexander Efimov
b25557ad5e [FA] Upstream FA qk initialization (#328)
This PR replaces qk matrix initialization with
upstream version
2023-09-14 00:34:14 -05:00
Shucai Xiao
fb3f2d6feb refine gemm tuning scripts (#309)
* refine the gemm tuning scripts to reduce tuning space and better perf numbers

* added code to support tuning in full tuning space

* add a function to get best tuning config

* refine the matmul tutorial example to print out best tuning config for each input

* added even_k to gemm kernel heuristic for better performance

* address review comments
2023-09-07 08:09:11 -05:00
Jason Furmanek
df5c263a19 Fix merge conflicts 2023-09-01 04:01:32 +00:00
Jason Furmanek
3eaeb89d18 Merge commit '5df904233c11a65bd131ead7268f84cca7804275' into ifu230810-2
Conflicts:
	include/triton/Dialect/Triton/Transforms/Passes.h
	include/triton/Dialect/TritonGPU/IR/Dialect.h
	include/triton/Dialect/TritonGPU/IR/TritonGPUAttrDefs.td
	lib/Analysis/Allocation.cpp
	lib/Analysis/Utility.cpp
	lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp
	lib/Conversion/TritonGPUToLLVM/ReduceOpToLLVM.cpp
	lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVM.cpp
	lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVMPass.cpp
	lib/Dialect/Triton/Transforms/RewriteTensorPointer.cpp
	lib/Dialect/TritonGPU/Transforms/RemoveLayoutConversions.cpp
	lib/Dialect/TritonGPU/Transforms/ReorderInstructions.cpp
	lib/Target/LLVMIR/LLVMIRTranslation.cpp
	python/src/triton.cc
	python/triton/compiler/compiler.py
	python/triton/ops/flash_attention.py
	python/triton/runtime/autotuner.py
	python/triton/runtime/jit.py
	python/triton/tools/aot.py
	python/tutorials/06-fused-attention.py
	test/Conversion/tritongpu_to_llvm.mlir
	test/Target/tritongpu_to_llvmir.mlir
	test/Target/tritongpu_to_llvmir_noinline.mlir
2023-09-01 03:25:33 +00:00
Vinayak Gokhale
9cdf3a58c3 Enable split kernel in bwd pass (#303)
* Add fwd and bwd v2

Changes are largely from upstream.

* Split bwd kernel in dq and dk+dv

Only adds the split kernels. They are not enabled yet.

* Pull scalar multiplies out of the loop

* Enable split kernel for bwd pass

* Put back P_SEQ=128 in fwd test

Not used for bwd test

* Address review comments

* Address comments

Conditionally set causal/ splitkernel to False for bwd.

* Add block pointer semantics to bwd pass

This significantly increases perf for bwd, similar to fwd.
2023-08-29 13:51:29 -05:00
Zahi Moudallal
120ce0a5bf [DOCS] Fixing docs (#2175) 2023-08-24 15:58:59 -07:00
jayfurmanek
ff7e707f87 Enable usage of block pointer semantics for AMD gpus (#301)
* Enable usage of block pointer semantics for AMD gpus

This commit enables usage of block pointer semantics by enabling
rewrite_tensor_pointer_pass that rewrites block pointer loads/stores
to legacy loads/stores.

* Update FA fwd in tutorial to use the block pointers

* use 90 compute capability for amd gpus in python/triton/compiler/compiler.py

Co-authored-by: Alexander Efimov <efimov.alexander@gmail.com>

---------

Co-authored-by: Ognjen Plavsic <ognjen.plavsic@dxc.com>
Co-authored-by: Lixun Zhang <lixun.zhang@amd.com>
Co-authored-by: Aleksandr Efimov <130555951+alefimov-amd@users.noreply.github.com>
Co-authored-by: Alexander Efimov <efimov.alexander@gmail.com>
2023-08-24 13:05:12 -05:00
Keren Zhou
6a65c894fe [TUTORIALS] Skip running TMA tutorials on non-hopper architectures (#2153) 2023-08-22 18:02:26 -07:00
ivanyinwz
ec801ce18e [BACKEND] Optimize performance for f16 epilogue with TMA store (#2135)
1. Optimize the conversion and packing for 2xf32 -> 2xf16.
2. Split TMA store block into multiple slices of size 64x64.
3. Distribute the TMA store to all the warps.
4. Fix some naming issue.
2023-08-21 12:44:11 -07:00
Shucai Xiao
f7cf2c032b Changes of the tutorial matmul scripts to get good performance (#297)
* simple changes of the matmul scripts to get good performance. Specification reason for the performance boost needs futher investigation and are tracked

* fix review comments

* change the num_warps in the autotuning config for hip to workaround an error and change the rtol so correctness check passed
2023-08-17 13:24:49 -05:00
danny.jang
b878c8826f [DOCS] fix generating document failures (#2096) 2023-08-13 06:53:34 -07:00
Keren Zhou
5162871c6c [TUTORIAL] flash attention d128 improvement (#2074)
`ptxas` is able to automatically generate a call instruction to "call"
the loop body so that instructions are better scheduled.
2023-08-12 00:31:48 +00:00
Zahi Moudallal
4d373aa103 [BACKEND] Remove HopperHelpers.c and replace with inline ptx and LLVM codegen (#2047) 2023-08-10 15:52:37 -07:00
Alexander Efimov
ff95dddf18 [tutorial] Amd specific tuning configs (#287)
This pr adds amd specific tuning configs for matmul tutorial with num_stages == 1.
2023-08-08 20:11:30 +02:00
BoxiangW
f21a053ee6 [TUTORIALS] support flash attention 2 with KV's sequence length longer than Q's (#2033)
Implemented this situation with and without causal mask.
My implementation with causal mask looks like:
111000
111100
111110
Where only the right upper triangle part will be masked.
I added `P_SEQ` for the notation of extra sequence length for KV.

Co-authored-by: Philippe Tillet <phil@openai.com>
2023-08-07 22:57:44 -07:00
ben-zhang-609
31e79aa384 [TESTS] remove get_proper_err, get_variant_golden (#2039)
Co-authored-by: Philippe Tillet <phil@openai.com>
2023-08-07 22:52:55 -07:00
goostavz
f1512bded1 Initial code merge of Hopper support (#2036)
The initial code merge of Nvidia Hopper features support. Please be
aware that the code merge is not finished yet and the trouble-shooting
is still ongoing. The new hardware features (GMMA, TMA, STMATRIX etc.)
and automatic warp-specialization are experimental for now and turned
off by default. It is recommended for a trial when version 3.0 is
released.

The work is contributed by:
ben-zhang-609, bealwang, donproc, qliu93, jsh20, allatit23, LyricZhao,
ivanyinwz, goostavz & yangjunpro
from Nvidia, in cooperation with:
ptillet, Jokeren, ThomasRaoux & zahimoud
from OpenAI.

Co-authored-by: Goostav Zhu <gzhu@nvidia.com>
2023-08-07 09:53:04 +08:00
Vinayak Gokhale
f1063bb33c Enable backward pass in FA tutorial test (#282)
Enabled the backward pass in the fused attention tutorial.
The tolerance when comparing to the naive implementation
had to be changed. The block size is forced to be 64x64
due to the 64 KiB LDS. Default is block 128 for A100's
larger SMEM. This creates differences in order of computation
and reuslts in a larger gap between the naive and FA
implementations.
2023-08-03 10:12:46 -05:00
Phil Tillet
db695c093f [TUTORIALS] fix format 2023-07-25 18:16:39 -07:00
janEbert
62a8afa403 [TUTORIALS] Support FlashAttention-2 reference (#1984)
Uses FlashAttention-2 if available, otherwise acts as before (if
FlashAttention-1 is available, that is used, otherwise the
FlashAttention reference benchmark is not run).

I decided to keep the same name for the imported function, but feel free
to make me change that.
2023-07-24 13:54:01 -07:00
Izzy Putterman
de6f053c0f [TRITON][OPS] add Flash Attention v2 to Ops (#1970)
I also dropped the do_scaled as it is no longer needed (no scaling done
to the do in v2).

---------

Co-authored-by: Philippe Tillet <phil@openai.com>
2023-07-23 14:07:15 -07:00
Phil Tillet
cfce82d715 [TUTORIALS] Flash Attention tutorial now properly tries fwd, bwd, causal, non-causal 2023-07-19 21:56:29 -07:00
Philippe Tillet
c46a842b6f [TUTORIAL] more attention cleanup (#1958) 2023-07-18 12:36:15 -07:00
Philippe Tillet
9e3e10c5ed [OPTIMIZER][TUTORIAL] flash attention v2 (#1952) 2023-07-17 12:23:02 -07:00
Philippe Tillet
8207eabd7b [FRONTEND][OPTIMIZER] small perf improvements (#1945) 2023-07-14 15:11:36 -07:00
oplavsic
d6e51fd221 [FA OPTIMIZATION] Keep results of FA dot operations in registers (#247)
* [WIP][FA OPTIMIZATION] Optimize chain dot

This commit optimizes chain dot operation by keeping
results of the first dot operation in registers.

* [FA OPTIMIZATION] Enable lowering pipeline for keeping result of chain dot in registers

* Move operand swapping in ttgir -> llir lowering phase

* Refactor emitMfmaOffsetForCTA function to be more readable

* Fix accidental change in 06-fused-attention.py

* Address review comments

* Fix rebase errors
2023-07-12 15:25:55 -05:00
Philippe Tillet
bf5acf46e2 [OPS] improved pointer arithmetic in attention (#1926)
this provides an additional 3-4% speed-up in non-causal attention, which
now tops at 155TFLOPS
2023-07-11 12:04:00 -07:00
Phil Tillet
041f1144e8 [DOCS] fixed flash_attn causal argument in tutorial 2023-07-11 09:28:20 -07:00
Izzy Putterman
d39d78fa08 [OPS] Add more perf-tests, new features to FA (#1849)
Adding new tests across the board for float32, bfloat16, non-powers-of-2
shapes (to test masks), and tests on sequence parallel for atomics. This
also adds the sequence parallel features from
https://github.com/HazyResearch/flash-attention/blob/main/flash_attn/flash_attn_triton.py.
I am not sure about the best way to grab the baseline benchmarking
numbers. I have access to V100s and A100s, but I saw on the tests it
mentions " # A100 in the CI server is slow-ish for some reason.
# On some other servers, we are getting about 90% peak for 8kx8x8k
float16". Current plan is to run CI here and use those numbers for
baseline, then match against my GPUs as a sanity check.

---------

Co-authored-by: Phil Tillet <phil@openai.com>
2023-07-10 18:52:59 -07:00
Philippe Tillet
dadf7a9a50 [TUTORIAL] Faster flash attention; added non-causal (#1917) 2023-07-09 13:38:06 -07:00
oplavsic
64d7b521cf [MFMA] Enabled fused attention forward pass. (#226)
* [MFMA] Activated Fused Attention Forward Pass

Patch contains following changes:
1) make_range operator now works with MFMA layout.
2) Reduce operation is forced to run in block layout:
   inputs converted to block layouts, outputs returned to MFMA layout

* Use simple module walk instead of pattern rewritter.

* Remove pattern rewritter header.

* Enable basic reduce algorithm for MFMA layout

* Add TODO comment for fused attention backward pass

* Fix bug in fast codegen algorithm for reduce op

* Fix input type bug

* Increase block size to 128 since out of memory issue is not seen on MI210

* Fix block_size error

* Add mfma support in DecomposeDotOperand pattern.
2023-06-16 15:39:08 -05:00
Michael Melesse
2784b804d9 Merge remote-tracking branch 'upstream/main' into ifu_4_26_2023 2023-04-26 12:04:21 -05:00
Michaël Benesty
7d2a4d95c2 [DOCS] fixed num warps / stages in matmul (#1561) 2023-04-21 12:57:26 -07:00
Chenggang Zhao
c9311ef361 [TUTORIALS] Fix rendering issues in the block pointer tutorial (#1530)
Found some rendering issues here:
https://triton-lang.org/main/getting-started/tutorials/08-experimental-block-pointer.html,
sorry for not checking carefully in the last PR.
2023-04-15 14:27:14 -07:00
Chenggang Zhao
c624778e73 [TUTORIALS] Add tutorial for block pointers (#1519)
This PR contains:
- Several fixes for the matrix multiplication (M and N dimensions may
have out-of-bound access)
- A type check for block-based store
- The tutorial for block pointers
- Fix some formats
2023-04-14 00:40:41 -07:00
Keren Zhou
fdf1c1f2a1 [DOCS] Fix documentation workflow (#1520)
Co-authored-by: Phil Tillet <phil@openai.com>
2023-04-13 13:49:36 -07:00
Philippe Tillet
02e3c18f04 [TESTING] clean up testing.do_bench (#1513) 2023-04-11 20:05:58 -07:00
Rahul Batra
a27b388df5 Merge remote-tracking branch 'upstream/main' into IFU_04-06-2023 2023-04-06 16:18:31 -05:00
Kern Handa
2c0417da96 [DOCS] fixed typo triton.testing.allclose -> torch.allclose in MatMul tutorial (#1460) 2023-03-31 17:06:46 -07:00
Philippe Tillet
123afdf423 [DOCS] fixed typo assert_almost_equal -> assert_allclose in tutorials (#1456) 2023-03-31 11:27:18 -07:00
Chenggang Zhao
1bead327fd [TUTORIALS] Add the missing tutorial: libdevice functions (#1430)
While merging `triton-mlir`, it seems that the libdevice tutorial was
missed. This PR adds it back and modifies it with current interface
`tl.math`.

Also found a bug in `test_core.py`, `extern_libs` arguments should still
pass `libdevice`. Or it will fail on my added test. Legacy code didn't
fail because `lib_path` is none and ignored.

---------

Co-authored-by: Keren Zhou <kerenzhou@openai.com>
Co-authored-by: Philippe Tillet <phil@openai.com>
2023-03-29 19:00:17 -07:00