github/ROCm - ROCm - AtHeartEngineering

mirror of https://github.com/ROCm/ROCm.git synced 2026-04-05 03:01:17 -04:00

Author	SHA1	Message	Date
oplavsic	715a589ce3	[FA fwd D=128] Reduce LDS usage in epilogue (#340 ) * rebase onto improve_fwd_fa * Fixed a leftover from rebase * rebase onto improve_fa_fwd * Reduce tuning space * Disable bwd with D=128 * Add test for d=128 * Fix an issue with get_best_config when there is only one config * Added better configs for d=128 * Fix typos --------- Co-authored-by: Lixun Zhang <lixun.zhang@amd.com>	2023-10-25 12:10:34 -05:00
Lixun Zhang	821e75a2b0	Improve FA fwd kernel with causal=True (#356 ) * Attempt to absorb upstream's changes to improve causal=True * Add autotuner * Optimize for AMD MI250 - add pre_load_v as a tuning parameter - do not define N_CTX as constexpr - perform the second dot before sum - remove qk_scale out of the inner loop - add more configs in the autotuner Note that bwd kernel is disabled for now. This is because we enabled autotuning and grid becomes a function. So ctx.grid[0] no longer works. * Enable bwd kernel	2023-10-12 12:34:27 -05:00
Shucai Xiao	99fa2e4237	add tutorial group gemm example (#343 ) * [DOCS] Add a tutorial example of grouped gemm (#2326) Co-authored-by: Bin Fan <binf@nvidia.com>	2023-10-11 15:13:17 -05:00
oplavsic	e801638b40	Add waves_per_eu as kernel parameter (#319 ) * Add waves_per_eu as kernel parameter * Fix failing tests * Add default value for waves_per_eu for ttgir_to_llir function * Remove aot.py	2023-10-06 12:08:34 -05:00
oplavsic	6a173eab8a	Remove redundant fp32->fp16 conversion in FA (#349 )	2023-10-04 14:10:07 -05:00
Jason Furmanek	e5d7bb4fae	Initial commit to resolve merge conflicts rename tl.float8e4 to tl.float8e4nv to align with upstream ROCM IFU: Fix python arch issues ROCM IFU: Fix kernel launcher ROCM IFU: Fix merge conflicts fix debug build Set correct threadsPerCTA	2023-10-03 04:04:26 +00:00
Jason Furmanek	74fd8e9754	Merge commit '36fc54b6f28168d3644808bfe299f1ba06a36272' into ifu230908-2 Conflicts: .gitignore bin/triton-translate.cpp include/triton/Conversion/TritonGPUToLLVM/TritonGPUToLLVMPass.h include/triton/Dialect/TritonGPU/IR/TritonGPUAttrDefs.td include/triton/Dialect/TritonGPU/IR/TritonGPUDialect.td lib/Analysis/Utility.cpp lib/Conversion/TritonGPUToLLVM/ConvertLayoutOpToLLVM/SharedToDotOperandMMAv2.cpp lib/Conversion/TritonGPUToLLVM/DotOpToLLVM.cpp lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp lib/Conversion/TritonGPUToLLVM/ReduceOpToLLVM.cpp lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVM.cpp lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVMBase.h lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVMPass.cpp lib/Conversion/TritonGPUToLLVM/Utility.h lib/Dialect/Triton/Transforms/RewriteTensorPointer.cpp lib/Dialect/TritonGPU/IR/Dialect.cpp lib/Dialect/TritonGPU/Transforms/AccelerateMatmul.cpp lib/Dialect/TritonGPU/Transforms/RemoveLayoutConversions.cpp lib/Target/LLVMIR/LLVMIRTranslation.cpp python/src/triton.cc python/test/unit/runtime/test_subproc.py python/triton/compiler/compiler.py python/triton/compiler/make_launcher.py python/triton/language/semantic.py python/triton/runtime/jit.py python/tutorials/06-fused-attention.py test/Conversion/triton_to_tritongpu.mlir test/Conversion/tritongpu_to_llvm.mlir test/TritonGPU/coalesce.mlir unittest/Conversion/TritonGPUToLLVM/CMakeLists.txt	2023-10-02 18:01:04 +00:00
SJW	287b0adcc2	[Stream] Fixed bug in stream-pipeline for FA (#345 ) * [Stream] Fixed bug in stream-pipeline for FA * updated gemm tutorial for num_stages=0 * * updated configs	2023-09-29 20:20:55 -05:00
Ognjen Plavsic	a8574be74d	[FA] Fix bug from IFU merge This commit reverts num_warps back to 4 for FA forward pass with d=128.	2023-09-21 10:48:41 -05:00
Alexander Efimov	b25557ad5e	[FA] Upstream FA qk initialization (#328 ) This PR replaces qk matrix initialization with upstream version	2023-09-14 00:34:14 -05:00
Shucai Xiao	fb3f2d6feb	refine gemm tuning scripts (#309 ) * refine the gemm tuning scripts to reduce tuning space and better perf numbers * added code to support tuning in full tuning space * add a function to get best tuning config * refine the matmul tutorial example to print out best tuning config for each input * added even_k to gemm kernel heuristic for better performance * address review comments	2023-09-07 08:09:11 -05:00
Jason Furmanek	df5c263a19	Fix merge conflicts	2023-09-01 04:01:32 +00:00
Jason Furmanek	3eaeb89d18	Merge commit '5df904233c11a65bd131ead7268f84cca7804275' into ifu230810-2 Conflicts: include/triton/Dialect/Triton/Transforms/Passes.h include/triton/Dialect/TritonGPU/IR/Dialect.h include/triton/Dialect/TritonGPU/IR/TritonGPUAttrDefs.td lib/Analysis/Allocation.cpp lib/Analysis/Utility.cpp lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp lib/Conversion/TritonGPUToLLVM/ReduceOpToLLVM.cpp lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVM.cpp lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVMPass.cpp lib/Dialect/Triton/Transforms/RewriteTensorPointer.cpp lib/Dialect/TritonGPU/Transforms/RemoveLayoutConversions.cpp lib/Dialect/TritonGPU/Transforms/ReorderInstructions.cpp lib/Target/LLVMIR/LLVMIRTranslation.cpp python/src/triton.cc python/triton/compiler/compiler.py python/triton/ops/flash_attention.py python/triton/runtime/autotuner.py python/triton/runtime/jit.py python/triton/tools/aot.py python/tutorials/06-fused-attention.py test/Conversion/tritongpu_to_llvm.mlir test/Target/tritongpu_to_llvmir.mlir test/Target/tritongpu_to_llvmir_noinline.mlir	2023-09-01 03:25:33 +00:00
Vinayak Gokhale	9cdf3a58c3	Enable split kernel in bwd pass (#303 ) * Add fwd and bwd v2 Changes are largely from upstream. * Split bwd kernel in dq and dk+dv Only adds the split kernels. They are not enabled yet. * Pull scalar multiplies out of the loop * Enable split kernel for bwd pass * Put back P_SEQ=128 in fwd test Not used for bwd test * Address review comments * Address comments Conditionally set causal/ splitkernel to False for bwd. * Add block pointer semantics to bwd pass This significantly increases perf for bwd, similar to fwd.	2023-08-29 13:51:29 -05:00
Zahi Moudallal	120ce0a5bf	[DOCS] Fixing docs (#2175 )	2023-08-24 15:58:59 -07:00
jayfurmanek	ff7e707f87	Enable usage of block pointer semantics for AMD gpus (#301 ) * Enable usage of block pointer semantics for AMD gpus This commit enables usage of block pointer semantics by enabling rewrite_tensor_pointer_pass that rewrites block pointer loads/stores to legacy loads/stores. * Update FA fwd in tutorial to use the block pointers * use 90 compute capability for amd gpus in python/triton/compiler/compiler.py Co-authored-by: Alexander Efimov <efimov.alexander@gmail.com> --------- Co-authored-by: Ognjen Plavsic <ognjen.plavsic@dxc.com> Co-authored-by: Lixun Zhang <lixun.zhang@amd.com> Co-authored-by: Aleksandr Efimov <130555951+alefimov-amd@users.noreply.github.com> Co-authored-by: Alexander Efimov <efimov.alexander@gmail.com>	2023-08-24 13:05:12 -05:00
Keren Zhou	6a65c894fe	[TUTORIALS] Skip running TMA tutorials on non-hopper architectures (#2153 )	2023-08-22 18:02:26 -07:00
ivanyinwz	ec801ce18e	[BACKEND] Optimize performance for f16 epilogue with TMA store (#2135 ) 1. Optimize the conversion and packing for 2xf32 -> 2xf16. 2. Split TMA store block into multiple slices of size 64x64. 3. Distribute the TMA store to all the warps. 4. Fix some naming issue.	2023-08-21 12:44:11 -07:00
Shucai Xiao	f7cf2c032b	Changes of the tutorial matmul scripts to get good performance (#297 ) * simple changes of the matmul scripts to get good performance. Specification reason for the performance boost needs futher investigation and are tracked * fix review comments * change the num_warps in the autotuning config for hip to workaround an error and change the rtol so correctness check passed	2023-08-17 13:24:49 -05:00
danny.jang	b878c8826f	[DOCS] fix generating document failures (#2096 )	2023-08-13 06:53:34 -07:00
Keren Zhou	5162871c6c	[TUTORIAL] flash attention d128 improvement (#2074 ) `ptxas` is able to automatically generate a call instruction to "call" the loop body so that instructions are better scheduled.	2023-08-12 00:31:48 +00:00
Zahi Moudallal	4d373aa103	[BACKEND] Remove HopperHelpers.c and replace with inline ptx and LLVM codegen (#2047 )	2023-08-10 15:52:37 -07:00
Alexander Efimov	ff95dddf18	[tutorial] Amd specific tuning configs (#287 ) This pr adds amd specific tuning configs for matmul tutorial with num_stages == 1.	2023-08-08 20:11:30 +02:00
BoxiangW	f21a053ee6	[TUTORIALS] support flash attention 2 with KV's sequence length longer than Q's (#2033 ) Implemented this situation with and without causal mask. My implementation with causal mask looks like: 111000 111100 111110 Where only the right upper triangle part will be masked. I added `P_SEQ` for the notation of extra sequence length for KV. Co-authored-by: Philippe Tillet <phil@openai.com>	2023-08-07 22:57:44 -07:00
ben-zhang-609	31e79aa384	[TESTS] remove get_proper_err, get_variant_golden (#2039 ) Co-authored-by: Philippe Tillet <phil@openai.com>	2023-08-07 22:52:55 -07:00
goostavz	f1512bded1	Initial code merge of Hopper support (#2036 ) The initial code merge of Nvidia Hopper features support. Please be aware that the code merge is not finished yet and the trouble-shooting is still ongoing. The new hardware features (GMMA, TMA, STMATRIX etc.) and automatic warp-specialization are experimental for now and turned off by default. It is recommended for a trial when version 3.0 is released. The work is contributed by: ben-zhang-609, bealwang, donproc, qliu93, jsh20, allatit23, LyricZhao, ivanyinwz, goostavz & yangjunpro from Nvidia, in cooperation with: ptillet, Jokeren, ThomasRaoux & zahimoud from OpenAI. Co-authored-by: Goostav Zhu <gzhu@nvidia.com>	2023-08-07 09:53:04 +08:00
Vinayak Gokhale	f1063bb33c	Enable backward pass in FA tutorial test (#282 ) Enabled the backward pass in the fused attention tutorial. The tolerance when comparing to the naive implementation had to be changed. The block size is forced to be 64x64 due to the 64 KiB LDS. Default is block 128 for A100's larger SMEM. This creates differences in order of computation and reuslts in a larger gap between the naive and FA implementations.	2023-08-03 10:12:46 -05:00
Phil Tillet	db695c093f	[TUTORIALS] fix format	2023-07-25 18:16:39 -07:00
janEbert	62a8afa403	[TUTORIALS] Support FlashAttention-2 reference (#1984 ) Uses FlashAttention-2 if available, otherwise acts as before (if FlashAttention-1 is available, that is used, otherwise the FlashAttention reference benchmark is not run). I decided to keep the same name for the imported function, but feel free to make me change that.	2023-07-24 13:54:01 -07:00
Izzy Putterman	de6f053c0f	[TRITON][OPS] add Flash Attention v2 to Ops (#1970 ) I also dropped the do_scaled as it is no longer needed (no scaling done to the do in v2). --------- Co-authored-by: Philippe Tillet <phil@openai.com>	2023-07-23 14:07:15 -07:00
Phil Tillet	cfce82d715	[TUTORIALS] Flash Attention tutorial now properly tries fwd, bwd, causal, non-causal	2023-07-19 21:56:29 -07:00
Philippe Tillet	c46a842b6f	[TUTORIAL] more attention cleanup (#1958 )	2023-07-18 12:36:15 -07:00
Philippe Tillet	9e3e10c5ed	[OPTIMIZER][TUTORIAL] flash attention v2 (#1952 )	2023-07-17 12:23:02 -07:00
Philippe Tillet	8207eabd7b	[FRONTEND][OPTIMIZER] small perf improvements (#1945 )	2023-07-14 15:11:36 -07:00
oplavsic	d6e51fd221	[FA OPTIMIZATION] Keep results of FA dot operations in registers (#247 ) * [WIP][FA OPTIMIZATION] Optimize chain dot This commit optimizes chain dot operation by keeping results of the first dot operation in registers. * [FA OPTIMIZATION] Enable lowering pipeline for keeping result of chain dot in registers * Move operand swapping in ttgir -> llir lowering phase * Refactor emitMfmaOffsetForCTA function to be more readable * Fix accidental change in 06-fused-attention.py * Address review comments * Fix rebase errors	2023-07-12 15:25:55 -05:00
Philippe Tillet	bf5acf46e2	[OPS] improved pointer arithmetic in attention (#1926 ) this provides an additional 3-4% speed-up in non-causal attention, which now tops at 155TFLOPS	2023-07-11 12:04:00 -07:00
Phil Tillet	041f1144e8	[DOCS] fixed flash_attn causal argument in tutorial	2023-07-11 09:28:20 -07:00
Izzy Putterman	d39d78fa08	[OPS] Add more perf-tests, new features to FA (#1849 ) Adding new tests across the board for float32, bfloat16, non-powers-of-2 shapes (to test masks), and tests on sequence parallel for atomics. This also adds the sequence parallel features from https://github.com/HazyResearch/flash-attention/blob/main/flash_attn/flash_attn_triton.py. I am not sure about the best way to grab the baseline benchmarking numbers. I have access to V100s and A100s, but I saw on the tests it mentions " # A100 in the CI server is slow-ish for some reason. # On some other servers, we are getting about 90% peak for 8kx8x8k float16". Current plan is to run CI here and use those numbers for baseline, then match against my GPUs as a sanity check. --------- Co-authored-by: Phil Tillet <phil@openai.com>	2023-07-10 18:52:59 -07:00
Philippe Tillet	dadf7a9a50	[TUTORIAL] Faster flash attention; added non-causal (#1917 )	2023-07-09 13:38:06 -07:00
oplavsic	64d7b521cf	[MFMA] Enabled fused attention forward pass. (#226 ) * [MFMA] Activated Fused Attention Forward Pass Patch contains following changes: 1) make_range operator now works with MFMA layout. 2) Reduce operation is forced to run in block layout: inputs converted to block layouts, outputs returned to MFMA layout * Use simple module walk instead of pattern rewritter. * Remove pattern rewritter header. * Enable basic reduce algorithm for MFMA layout * Add TODO comment for fused attention backward pass * Fix bug in fast codegen algorithm for reduce op * Fix input type bug * Increase block size to 128 since out of memory issue is not seen on MI210 * Fix block_size error * Add mfma support in DecomposeDotOperand pattern.	2023-06-16 15:39:08 -05:00
Michael Melesse	2784b804d9	Merge remote-tracking branch 'upstream/main' into ifu_4_26_2023	2023-04-26 12:04:21 -05:00
Michaël Benesty	7d2a4d95c2	[DOCS] fixed num warps / stages in matmul (#1561 )	2023-04-21 12:57:26 -07:00
Chenggang Zhao	c9311ef361	[TUTORIALS] Fix rendering issues in the block pointer tutorial (#1530 ) Found some rendering issues here: https://triton-lang.org/main/getting-started/tutorials/08-experimental-block-pointer.html, sorry for not checking carefully in the last PR.	2023-04-15 14:27:14 -07:00
Chenggang Zhao	c624778e73	[TUTORIALS] Add tutorial for block pointers (#1519 ) This PR contains: - Several fixes for the matrix multiplication (M and N dimensions may have out-of-bound access) - A type check for block-based store - The tutorial for block pointers - Fix some formats	2023-04-14 00:40:41 -07:00
Keren Zhou	fdf1c1f2a1	[DOCS] Fix documentation workflow (#1520 ) Co-authored-by: Phil Tillet <phil@openai.com>	2023-04-13 13:49:36 -07:00
Philippe Tillet	02e3c18f04	[TESTING] clean up `testing.do_bench` (#1513 )	2023-04-11 20:05:58 -07:00
Rahul Batra	a27b388df5	Merge remote-tracking branch 'upstream/main' into IFU_04-06-2023	2023-04-06 16:18:31 -05:00
Kern Handa	2c0417da96	[DOCS] fixed typo `triton.testing.allclose` -> `torch.allclose` in MatMul tutorial (#1460 )	2023-03-31 17:06:46 -07:00
Philippe Tillet	123afdf423	[DOCS] fixed typo `assert_almost_equal` -> `assert_allclose` in tutorials (#1456 )	2023-03-31 11:27:18 -07:00
Chenggang Zhao	1bead327fd	[TUTORIALS] Add the missing tutorial: libdevice functions (#1430 ) While merging `triton-mlir`, it seems that the libdevice tutorial was missed. This PR adds it back and modifies it with current interface `tl.math`. Also found a bug in `test_core.py`, `extern_libs` arguments should still pass `libdevice`. Or it will fail on my added test. Legacy code didn't fail because `lib_path` is none and ignored. --------- Co-authored-by: Keren Zhou <kerenzhou@openai.com> Co-authored-by: Philippe Tillet <phil@openai.com>	2023-03-29 19:00:17 -07:00

1 2 3

139 Commits