github/ROCm - ROCm - AtHeartEngineering

mirror of https://github.com/ROCm/ROCm.git synced 2026-04-05 03:01:17 -04:00

Author	SHA1	Message	Date
BoxiangW	f21a053ee6	[TUTORIALS] support flash attention 2 with KV's sequence length longer than Q's (#2033 ) Implemented this situation with and without causal mask. My implementation with causal mask looks like: 111000 111100 111110 Where only the right upper triangle part will be masked. I added `P_SEQ` for the notation of extra sequence length for KV. Co-authored-by: Philippe Tillet <phil@openai.com>	2023-08-07 22:57:44 -07:00
ben-zhang-609	31e79aa384	[TESTS] remove get_proper_err, get_variant_golden (#2039 ) Co-authored-by: Philippe Tillet <phil@openai.com>	2023-08-07 22:52:55 -07:00
goostavz	f1512bded1	Initial code merge of Hopper support (#2036 ) The initial code merge of Nvidia Hopper features support. Please be aware that the code merge is not finished yet and the trouble-shooting is still ongoing. The new hardware features (GMMA, TMA, STMATRIX etc.) and automatic warp-specialization are experimental for now and turned off by default. It is recommended for a trial when version 3.0 is released. The work is contributed by: ben-zhang-609, bealwang, donproc, qliu93, jsh20, allatit23, LyricZhao, ivanyinwz, goostavz & yangjunpro from Nvidia, in cooperation with: ptillet, Jokeren, ThomasRaoux & zahimoud from OpenAI. Co-authored-by: Goostav Zhu <gzhu@nvidia.com>	2023-08-07 09:53:04 +08:00
Vinayak Gokhale	f1063bb33c	Enable backward pass in FA tutorial test (#282 ) Enabled the backward pass in the fused attention tutorial. The tolerance when comparing to the naive implementation had to be changed. The block size is forced to be 64x64 due to the 64 KiB LDS. Default is block 128 for A100's larger SMEM. This creates differences in order of computation and reuslts in a larger gap between the naive and FA implementations.	2023-08-03 10:12:46 -05:00
Phil Tillet	db695c093f	[TUTORIALS] fix format	2023-07-25 18:16:39 -07:00
janEbert	62a8afa403	[TUTORIALS] Support FlashAttention-2 reference (#1984 ) Uses FlashAttention-2 if available, otherwise acts as before (if FlashAttention-1 is available, that is used, otherwise the FlashAttention reference benchmark is not run). I decided to keep the same name for the imported function, but feel free to make me change that.	2023-07-24 13:54:01 -07:00
Izzy Putterman	de6f053c0f	[TRITON][OPS] add Flash Attention v2 to Ops (#1970 ) I also dropped the do_scaled as it is no longer needed (no scaling done to the do in v2). --------- Co-authored-by: Philippe Tillet <phil@openai.com>	2023-07-23 14:07:15 -07:00
Phil Tillet	cfce82d715	[TUTORIALS] Flash Attention tutorial now properly tries fwd, bwd, causal, non-causal	2023-07-19 21:56:29 -07:00
Philippe Tillet	c46a842b6f	[TUTORIAL] more attention cleanup (#1958 )	2023-07-18 12:36:15 -07:00
Philippe Tillet	9e3e10c5ed	[OPTIMIZER][TUTORIAL] flash attention v2 (#1952 )	2023-07-17 12:23:02 -07:00
Philippe Tillet	8207eabd7b	[FRONTEND][OPTIMIZER] small perf improvements (#1945 )	2023-07-14 15:11:36 -07:00
oplavsic	d6e51fd221	[FA OPTIMIZATION] Keep results of FA dot operations in registers (#247 ) * [WIP][FA OPTIMIZATION] Optimize chain dot This commit optimizes chain dot operation by keeping results of the first dot operation in registers. * [FA OPTIMIZATION] Enable lowering pipeline for keeping result of chain dot in registers * Move operand swapping in ttgir -> llir lowering phase * Refactor emitMfmaOffsetForCTA function to be more readable * Fix accidental change in 06-fused-attention.py * Address review comments * Fix rebase errors	2023-07-12 15:25:55 -05:00
Philippe Tillet	bf5acf46e2	[OPS] improved pointer arithmetic in attention (#1926 ) this provides an additional 3-4% speed-up in non-causal attention, which now tops at 155TFLOPS	2023-07-11 12:04:00 -07:00
Phil Tillet	041f1144e8	[DOCS] fixed flash_attn causal argument in tutorial	2023-07-11 09:28:20 -07:00
Izzy Putterman	d39d78fa08	[OPS] Add more perf-tests, new features to FA (#1849 ) Adding new tests across the board for float32, bfloat16, non-powers-of-2 shapes (to test masks), and tests on sequence parallel for atomics. This also adds the sequence parallel features from https://github.com/HazyResearch/flash-attention/blob/main/flash_attn/flash_attn_triton.py. I am not sure about the best way to grab the baseline benchmarking numbers. I have access to V100s and A100s, but I saw on the tests it mentions " # A100 in the CI server is slow-ish for some reason. # On some other servers, we are getting about 90% peak for 8kx8x8k float16". Current plan is to run CI here and use those numbers for baseline, then match against my GPUs as a sanity check. --------- Co-authored-by: Phil Tillet <phil@openai.com>	2023-07-10 18:52:59 -07:00
Philippe Tillet	dadf7a9a50	[TUTORIAL] Faster flash attention; added non-causal (#1917 )	2023-07-09 13:38:06 -07:00
oplavsic	64d7b521cf	[MFMA] Enabled fused attention forward pass. (#226 ) * [MFMA] Activated Fused Attention Forward Pass Patch contains following changes: 1) make_range operator now works with MFMA layout. 2) Reduce operation is forced to run in block layout: inputs converted to block layouts, outputs returned to MFMA layout * Use simple module walk instead of pattern rewritter. * Remove pattern rewritter header. * Enable basic reduce algorithm for MFMA layout * Add TODO comment for fused attention backward pass * Fix bug in fast codegen algorithm for reduce op * Fix input type bug * Increase block size to 128 since out of memory issue is not seen on MI210 * Fix block_size error * Add mfma support in DecomposeDotOperand pattern.	2023-06-16 15:39:08 -05:00
Michael Melesse	2784b804d9	Merge remote-tracking branch 'upstream/main' into ifu_4_26_2023	2023-04-26 12:04:21 -05:00
Michaël Benesty	7d2a4d95c2	[DOCS] fixed num warps / stages in matmul (#1561 )	2023-04-21 12:57:26 -07:00
Chenggang Zhao	c9311ef361	[TUTORIALS] Fix rendering issues in the block pointer tutorial (#1530 ) Found some rendering issues here: https://triton-lang.org/main/getting-started/tutorials/08-experimental-block-pointer.html, sorry for not checking carefully in the last PR.	2023-04-15 14:27:14 -07:00
Chenggang Zhao	c624778e73	[TUTORIALS] Add tutorial for block pointers (#1519 ) This PR contains: - Several fixes for the matrix multiplication (M and N dimensions may have out-of-bound access) - A type check for block-based store - The tutorial for block pointers - Fix some formats	2023-04-14 00:40:41 -07:00
Keren Zhou	fdf1c1f2a1	[DOCS] Fix documentation workflow (#1520 ) Co-authored-by: Phil Tillet <phil@openai.com>	2023-04-13 13:49:36 -07:00
Philippe Tillet	02e3c18f04	[TESTING] clean up `testing.do_bench` (#1513 )	2023-04-11 20:05:58 -07:00
Rahul Batra	a27b388df5	Merge remote-tracking branch 'upstream/main' into IFU_04-06-2023	2023-04-06 16:18:31 -05:00
Kern Handa	2c0417da96	[DOCS] fixed typo `triton.testing.allclose` -> `torch.allclose` in MatMul tutorial (#1460 )	2023-03-31 17:06:46 -07:00
Philippe Tillet	123afdf423	[DOCS] fixed typo `assert_almost_equal` -> `assert_allclose` in tutorials (#1456 )	2023-03-31 11:27:18 -07:00
Chenggang Zhao	1bead327fd	[TUTORIALS] Add the missing tutorial: libdevice functions (#1430 ) While merging `triton-mlir`, it seems that the libdevice tutorial was missed. This PR adds it back and modifies it with current interface `tl.math`. Also found a bug in `test_core.py`, `extern_libs` arguments should still pass `libdevice`. Or it will fail on my added test. Legacy code didn't fail because `lib_path` is none and ignored. --------- Co-authored-by: Keren Zhou <kerenzhou@openai.com> Co-authored-by: Philippe Tillet <phil@openai.com>	2023-03-29 19:00:17 -07:00
Rohit Santhanam	de3f543624	Fix layernorm plot name.	2023-03-28 05:36:51 +00:00
Rohit Santhanam	32986cce34	Print all softmax tutorial benchmark rows.	2023-03-28 04:45:15 +00:00
Philippe Tillet	46672772b4	[FORMAT] autopep8 now uses max-line-length=88 (#1410 )	2023-03-25 15:46:50 -07:00
Xuehai Pan	5b36cb48ad	[CI][TEST] update `pre-commit` hooks and use `pre-commit` for style tests in CI (#1409 ) Ref issue: - #1408 Changes: - Add `.editorconfig` - Add `pre-commit-hooks`: ```yaml - repo: https://github.com/pre-commit/pre-commit-hooks rev: v4.4.0 hooks: - id: check-symlinks - id: destroyed-symlinks - id: trailing-whitespace - id: end-of-file-fixer - id: check-yaml - id: check-toml - id: check-ast - id: check-added-large-files - id: check-merge-conflict - id: check-executables-have-shebangs - id: check-shebang-scripts-are-executable - id: detect-private-key - id: debug-statements ``` - Add `flake8` to `pre-commit` config and add `.flake8` file - Use `pre-commit` for style tests in CI - Run `pre-commit` and fix existing violations: - fix trailing spaces - fix end-of-files - fix mod file mode with `chmod -x` - run `autopep8` on existing code - fix `flake8` violations	2023-03-25 14:52:16 -07:00
Rohit Santhanam	a84b4883e6	Merge remote-tracking branch 'upstream/main' into triton-mlir-IFU-03192023	2023-03-19 13:46:50 +00:00
Berke Kocaoğlu	ba91f39dbf	[DOC] Fix syntax errors, typos, formatting; increase consistency (#1357 ) This PR; - Fixes syntax errors like `.type values: dict[str, Callable[[list[Any]], Any]]` to `:type values: dict[str, Callable[[list[Any]], Any]]`, - Fixes typos, - Fixes formatting like `k ++` to ` k++`, - Increases consistency (e.g. by transforming the minority `cd dir/` to the majority `cd dir`).	2023-03-16 15:32:02 -07:00
Rohit Santhanam	6b35506291	Some warp64 related fixes. Reduce the num_warps for layernorm and softmax. Fix some lit unit tests.	2023-03-07 16:15:04 +00:00
Philippe Tillet	3db55c5f94	[OPTIMIZER]]BACKEND] Some backend and optimization passes clean-up (#1284 ) * Cleaned up pipeline pass. Now works when there are element-wise ops between the load and the dot * Made `splat` compatible with varibales that have DotOperandLayout * Moves rematerialization utils to separate Transforms/Utility.cpp file.	2023-03-06 17:17:59 -08:00
Philippe Tillet	fa0fbc937f	[FRONTEND][BACKEND][OPTIMIZER] Loops now use 64-bit indices when necessary (#1261 ) * Frontend: - `int` kernel arguments are always signed - Loop induction variable is now determine by integer promotion on lb/ub/step * Optimizer: - Added new ExtractSliceOp that enforces 32-bit offsets * Backend: - Use 64-bit indices when lowering functions and control flow - Removed `idx_val` macro and replaced it with `i32_val` - Cleaned up comments - Added new ArithToIndex pass to make sure operations on indices are done with the `index` dialect, that gets converted to LLVM separately using a 64-bit target	2023-03-01 23:09:48 -08:00
Rohit Santhanam	cd9ae1cd36	Merge remote-tracking branch 'upstream/main' into triton-mlir-IFU-02232023	2023-02-23 21:41:54 +00:00
Philippe Tillet	0ec277efc5	[OPTIMIZER] cleaned, renamed and simplified some optimization passes (#1232 ) This shouldn't actually change the behavior of Triton -- only clean things up.	2023-02-22 13:54:55 -08:00
Rohit Santhanam	841784d1e3	Merge remote-tracking branch 'upstream/main' into upgrade_triton_mlir_rocm_to_llvm_head	2023-02-18 09:25:20 +00:00
Horace He	f21e76affe	[TUTORIALS] changed for loop to iterate by 1 in matmuls (#1198 ) For the new MLIR backend, this appears to increase matmul perf significantly in many cases.	2023-02-16 03:44:42 +00:00
Philippe Tillet	8bca84ce3d	[OPTIMIZER] Bugfix in Combine.cpp ; Added `trans` support in Pipeline.cpp (#1174 )	2023-02-14 13:36:44 -08:00
Yen-Chen Lin	1ea08be168	[TUTORIALS] Add description for 05-layer-norm.py (#1178 ) - Add text description and equations for the tutorial. - Improve the code readability by changing variable names to align them with the equation. The actual code logic is not changed. This is a follow-up of #510. Let me know if a preview HTML is helpful for the review, I can add a link to that too.	2023-02-13 08:47:35 +00:00
Rohit Santhanam	bafbe654a2	Change layernorm tutorial unit test and benchmark to run forward pass.	2023-02-09 18:59:32 +00:00
Rohit Santhanam	2d0ee0fa0f	Merge remote-tracking branch 'upstream/main' into triton-mlir-IFU-01232023	2023-01-24 03:59:17 +00:00
Rohit Santhanam	0e5ebc37a5	Enable layernorm backward pass.	2023-01-23 14:11:09 +00:00
Philippe Tillet	408d1d7e87	[OPTIMIZER] Improved flash attention forward pass performance (#1075 ) - Fixed typo in instruction reordering pass - Minor additional optimizations for shared memory allocator - Optimized flash attention tutorial forward pass kernel	2023-01-19 06:46:01 +00:00
Rohit Santhanam	78ee875298	Add AMDGPU specific tweaks for performance and functionality. Avoid layernorm backward pass due to lack of full support for atomics.	2023-01-16 18:13:11 +00:00
Philippe Tillet	259f4c5f7d	[OPTIMIZER] Added new optimization passes (#1055 ) This PR adds a couple of optimization passes that should substantially improve the performance of Triton on fused attention kernels: - DecomposeConversionsPass: This decomposes some instructions of the form `convert_layout` into - ReorderInstructions: this reorders instructions in a way that is more amenable to good code generation from `ptxas`.	2023-01-13 13:15:53 -08:00
Philippe Tillet	20100a7254	Merge `triton-mlir` branch - Complete rewrite of the backend from scratch (#1004 ) This PR merges the `triton-mlir` branch, in which we have been quietly rewriting the Triton backend from scratch to increase maintainability, stability and ultimately performance. Changes to the runtime are minimal, and this new version aims to remain backward-compatible with the previous commit. The legacy backend is now officially deprecated, but can still be accessed via the `legacy-backend` tag. Co-authored-by: Keren Zhou <kerenzhou@openai.com> Co-authored-by: Yan Chunwei <yanchunwei@outlook.com> Co-authored-by: goostavz <109190422+goostavz@users.noreply.github.com> Co-authored-by: Shintaro Iwasaki <siwasaki@fb.com> Co-authored-by: Yan Da <dyanab@connect.ust.hk> Co-authored-by: Jun Yang <yangjunpro@gmail.com> Co-authored-by: Ian Bearman <ianb@microsoft.com> Co-authored-by: Jason Ansel <jansel@jansel.net> Co-authored-by: Qingyi Liu <qingyil@nvidia.com> Co-authored-by: ben-zhang-609 <110140741+ben-zhang-609@users.noreply.github.com> Co-authored-by: Chenggang Zhao <lyricz@yeah.net> Co-authored-by: ben-zhang-609 <benzh609@gmail.com> Co-authored-by: dongdongl <dongdongl@nvidia.com>	2022-12-21 01:30:50 -08:00
Chenggang Zhao	f16138d447	[Frontend] Interface fixes for libdevice (#830 ) - Unifying several interfaces with different types to a single one, e.g. `fsub_ru` and `dsub_ru` -> `sub_ru`; - Minor bug fix: `fast_pow` is incorrectly classified into the `pow` interface, of which arguments are the same as `powf`; - Explicit interfaces for casting functions, e.g. decoupling `ll2float_ru` to `ll2float_ru` and `ull2float_ru`; - Removing interfaces that are not in NVIDIA's official documents, e.g. `fmaf_ieee_rn`, which is confusing together with `fmaf_rn`. Note that this PR for the master branch is different from #829, which is for the MLIR branch.	2022-11-01 10:51:58 -07:00

1 2 3 4

166 Commits