github/ROCm - ROCm - AtHeartEngineering

mirror of https://github.com/ROCm/ROCm.git synced 2026-04-27 03:01:52 -04:00

Author	SHA1	Message	Date
Keren Zhou	6a9316e69a	[BACKEND] Clean up SCF -> CF conversion (#1234 )	2023-02-22 23:49:47 +00:00
Philippe Tillet	0ec277efc5	[OPTIMIZER] cleaned, renamed and simplified some optimization passes (#1232 ) This shouldn't actually change the behavior of Triton -- only clean things up.	2023-02-22 13:54:55 -08:00
Keren Zhou	123c687ed9	[BACKEND] Rewrite Membar to fit the CF dialect (#1213 )	2023-02-19 14:54:33 -08:00
Rohit Santhanam	50e6329d01	Fix LIT tests.	2023-02-18 09:32:42 +00:00
Rohit Santhanam	841784d1e3	Merge remote-tracking branch 'upstream/main' into upgrade_triton_mlir_rocm_to_llvm_head	2023-02-18 09:25:20 +00:00
Christian Sigg	9ef4b5d773	Rebase to LLVM-head. (#1200 ) Rebase to `37b7a60cd7`	2023-02-17 13:16:11 -08:00
Alexander Efimov	fce08c2ebb	[Test][LIT] Fix tritongpu_to_hsaco.mlir test This PR fixes tritongpu_to_hsaco.mlir test, plus fixes minor issue with tritongpu_to_amdgcn test	2023-02-16 20:15:52 +01:00
Christian Sigg	fc7a8e3581	Rebase Triton to LLVM-15. (#1070 ) This PR rebases Triton from LLVM-14 to LLVM-15. Most changes are mechanical, except for the analysis framework changes.	2023-02-16 06:40:53 -08:00
Rohit Santhanam	112abb966f	Fix test/Conversion/tritongpu_to_llvm.mlir for AMDGPU.	2023-02-11 15:16:50 +00:00
Rohit Santhanam	a2416e0901	Merge remote-tracking branch 'upstream/main' into triton-mlir-IFU-02112023	2023-02-11 14:48:19 +00:00
rsanthanam-amd	fb62b22364	Merge pull request #103 from binarman/unittest_fix/Conversion/tritongpu_to_llvm.mlir [Test][Lit] Tritongpu to llvm	2023-02-09 22:57:08 -06:00
Philippe Tillet	2aba985daa	[OPTIMIZER] Improved layout simplifications heuristics (#1168 )	2023-02-09 20:17:25 -08:00
Alexander Efimov	a58c3ad317	update checks for Conversion/tritongpu_to_llvm.mlir test	2023-02-10 00:05:23 +01:00
Keren Zhou	c61c8a123f	[BACKEND] Disallow the CombineSelectMaskedLoad pattern if conditions of select and broadcast are different (#1170 )	2023-02-09 18:03:22 -05:00
Alexander Efimov	ba603b8319	review fix: CHECK to PTX	2023-02-09 01:07:57 +01:00
Alexander Efimov	d56258a4c6	review fix: roll back basic_insert_slice_async_v1_multictas	2023-02-09 01:01:52 +01:00
Alexander Efimov	89ca4cb4ba	review fix: add COUNT checks	2023-02-09 01:01:28 +01:00
Alexander Efimov	a43f192889	code review fix	2023-02-08 19:17:00 +01:00
Keren Zhou	681d04cf2b	[BACKEND] Fix axisInfo analysis for div ops (#1157 )	2023-02-07 02:25:23 +00:00
Alexander Efimov	7fb8147d4f	[Test] Fix load_store test This PR fixes Conversion/AMDGPU/load_store.mlir lit based unit test.	2023-02-06 20:36:39 +01:00
Keren Zhou	546f2377ae	[BACKEND] Get the right operand and result types in forward rematerialization passes (#1152 )	2023-02-04 16:34:35 -08:00
Keren Zhou	bde52f9db2	[BACKEND] Fix alignment calculation (#1149 ) `getDivisibility` represents if the address in bytes is divisible by a certain number, so we should convert `#aligned bytes` to `#aligned elements`.	2023-02-03 17:20:23 -08:00
Rohit Santhanam	8cb6ab5b1a	Merge remote-tracking branch 'upstream/main' into triton_mlir_IFU_02022023	2023-02-02 22:54:53 +00:00
Alexander Efimov	af58dd90f8	[Test][Lit] Tritongpu to llvm This PR fixes lit based unitest Conversion/tritongpu_to_llvm.mlir for GCN target.x	2023-02-02 23:31:34 +01:00
Keren Zhou	82befe32ad	[BACKEND] Improve torch inductor performance (#1108 ) - Rewrite the AxisInfo analysis to handle each op case by case. - Add bit shift, min max, div/rem, and select ops to AxisInfo. - Rematerialize across load/store ops in the following two cases: - A size 1 tensor is considered not expensive since all threads will load the same - the targeEncoding may expose more vectorization opportunities (more elements per thread on the first dim) _res2next_ benchmark GPU Kernel time comparison on A100. - Average kernel sum. Triton 16838630ns vs Triton-MLIR 17105166ns. 1.016x slowdown. - Total kernel sum. Triton 6511735460ns vs Triton-MLIR 6512370620ns.	2023-02-01 18:21:15 -08:00
Keren Zhou	1ec39fdf99	[BACKEND] Refactored the `MoveConvertOutOfIf` conversion to handle scf.if correctly (#1114 ) Also removed duplicate code for `simulateBackwardRematerialization`.	2023-02-01 08:49:19 -08:00
Keren Zhou	5dd8ce3745	[BACKEND] Fix topological sort and add new test cases (#1132 ) Previous https://github.com/openai/triton/pull/1113 forgot to consider that a node may have multiple parents, visiting the instruction before any parent violates the semantic of topological sort. The fixed implementation exhaustively add all operations into a candidate subgraph and move an operation to the "ready" queue once all of its operands have been visited.	2023-01-31 23:41:20 -08:00
Philippe Tillet	c4b9d699d2	[FRONTEND][BACKEND] Fixed many bugs (#1122 ) - temporarily commenting assertion in `MemBar.cpp`. We need to fix this! but for now the following patches will unblock a number of users. - Fixed frontend codegen issue for If / For / While. Emit an error when replaced values' type mismatch. - Added "top level" codepath for if statements, which allows users to write patterns to exit early from kernels (e.g., `if cond1: if cond2: return else: ...`). Added associated codegen in TritonToTritonGPUPass - Added basic control flow tests - Pipeline pass is no longer activated when memory accesses can't be vectorized - Added missing magic methods to `constexpr` - Fixed issue in random.py: bitcast some values to uint when they need to be. - Added support for `Not` - Fixed nondeterministic compilation issue	2023-01-30 23:22:36 -08:00
Rohit Santhanam	2d0ee0fa0f	Merge remote-tracking branch 'upstream/main' into triton-mlir-IFU-01232023	2023-01-24 03:59:17 +00:00
Keren Zhou	b5d32896b1	[BACKEND] Verify the same operand and result element type for convert_layout (#1081 ) And a hotfix for incorrect convert_layout construction in the GPU combine pass.	2023-01-22 16:59:24 +00:00
Yan Chunwei	88498d104a	[BACKEND] DotOp enable ld.v4 in MMAv1 (#1020 ) The existing convert distributed to distributed layouts logic is based on processing each MMA-block, this requires each MMA-block to share exactly the same fixed pattern(such as the one described in the [NV PTX doc](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-fragment-mma-16816-float)). While for MMAv1, things are different, the MMA-block has variant patterns for different shapes and data layouts as below <img width="200" alt="image" src="https://user-images.githubusercontent.com/328693/213354941-731d7856-ad24-4f48-be0e-3cf41532cfa4.png"> This requires all the cell coordinates in DotOp output to be computed.	2023-01-19 09:42:33 -08:00
Philippe Tillet	408d1d7e87	[OPTIMIZER] Improved flash attention forward pass performance (#1075 ) - Fixed typo in instruction reordering pass - Minor additional optimizations for shared memory allocator - Optimized flash attention tutorial forward pass kernel	2023-01-19 06:46:01 +00:00
Philippe Tillet	660f2e8cce	[OPTIMIZER] pipeline and prefetch pass now use a more ptxas-friendly schedule (#1065 )	2023-01-17 15:21:19 -08:00
Keren Zhou	3f47e9aa0e	[BACKEND] Fix unrealized conversion for fp32 dot (#1051 )	2023-01-17 21:55:44 +00:00
rsanthanam-amd	56dfa695d2	Merge pull request #79 from dfukalov/dfukalov/work [Test] Enable triton-translate build.	2023-01-17 06:01:26 -06:00
Rohit Santhanam	ce8adb92bd	Merge remote-tracking branch 'upstream/master' into triton-mlir-IFU-01142023	2023-01-14 19:19:58 +00:00
Yan Chunwei	86003c83dd	[Optimizer] Add UpdateMmaForVolta Pass (#1048 ) This PR adds UpdateMmaForVolta pass to help update the MMA encoding for Volta. Some context is told in https://github.com/openai/triton/pull/1014 # Changes 1. Moving the related MMAv1 patterns from GPUCombine pass to UpdateMmaForVolta pass, 2. Updating both the versionMinor and warpsPerCTA fields for Volta MMA encodings since they could only be determined after the GPUCombine Pass, 3. Moving the FixupLoop pattern from the Combine.cpp to new Utility.h/.cpp files 4. Adding an ID field(takes 5 bits to store an integer) to versionMinor to help assigning a unique ID(on Volta) for each MMA encodings, the reason is as below - Currently, there is a cyclic dependency between {DotOperand, Slice} with MMA layouts, we use a map to help cluster all the DotOperand, Slice, and MMA layout instances into the same group for further updating in bulk - When there are multiple DotOps in a module with the same MMA(literally equivalent), it is possible to get the wrong groups - an ID field is used to help to identify the MMA from different DotOps, thus getting all the MMA, DotOperand, and Slice layout instances in the right groups	2023-01-14 11:54:19 +08:00
Philippe Tillet	259f4c5f7d	[OPTIMIZER] Added new optimization passes (#1055 ) This PR adds a couple of optimization passes that should substantially improve the performance of Triton on fused attention kernels: - DecomposeConversionsPass: This decomposes some instructions of the form `convert_layout` into - ReorderInstructions: this reorders instructions in a way that is more amenable to good code generation from `ptxas`.	2023-01-13 13:15:53 -08:00
Daniil Fukalov	ce23494cbf	[Test] Enable triton-translate build. And use it in tests instead of python module triton.tools.aot.	2023-01-13 20:36:16 +01:00
Philippe Tillet	589a18959e	[BACKEND] Make swizzled shared memory pointers compatible with non-blocked distributed layout (#1053 ) Notes: * Cleaned up implementation * Added comments * Re-using code between ConvertDistributedToShared and ConvertInsertSliceAsyncOp	2023-01-13 09:14:23 -08:00
Daniil Fukalov	4d4a91ec5f	[ROCM] Fix incorrect llvm.bitcast creation. Before the patch tt.load and tt.store generated incorrect bitcasts like `%f = llvm.bitcast %i : i32 to f16` (source and destination bitcast' types should have same bitwidth).	2023-01-12 14:28:17 +01:00
Keren Zhou	e638cb8060	[Backend] Use post-order traversal for liveness numbering (#1027 ) Also add tests for `tt.trans`.	2023-01-04 15:13:09 +00:00
goostavz	50d4589e9c	[Backend] Add value cache in emitting indices calculation and some refinement (#1018 ) 1, add explicit value cache in emitting indices calculation; 2, move the indices calculation emitting logics into ConvertTritonGPUOpToLLVMPatternBase to avoid the redundant build cost by templates. Refer to the discussion in this thread by @LyricZhao : https://triton-lang.slack.com/archives/C042VBSQWNS/p1671336755922969	2023-01-04 15:11:53 +00:00
Yan Chunwei	176fa4b4d2	[OPTIMIZER] Update the versionMinor in MMA layout for volta (#1014 ) Continue the work https://github.com/openai/triton/pull/990 # Background The `versionMinor` in MmaEncodingAttr holds some states of DotOp's operands in Volta, while such operands will be modified by some patterns, making the states out-of-date. This PR helps to correct the states. # Implementation It adds three new patterns: 1. `CollectMmaToUpdateForVolta` helps to collect and build a map holding the MmaEncodingAttr instances with wrong states and create new correct ones for them, 2. `UpdateMMAVersionMinorForVolta` helps to replace the Ops generating the wrong MmaEncodingAttr instances with new correct ones, currently it supports the following Ops a. `convert_layout[X -> mma]` b. `arith.constant SplatAttr : !tensor<mma>` c. `dot ... : !tensor<mma>` # Limitation This PR chooses the mapping way to bypass the IR walk complexity from the circular dependency between dot_operand[parent] and mma. We use the MmaEncodingAttr instance as the mapping key, but there might be multiple DotOp holding different DotOprand(IsMMAv1Row) that have the same wrong MmaEncodingAttr instance. To make each DotOp's (wrong) MmaEncodingAttr unique, we might need an ID field to MmaEncodingAttr.	2023-01-04 15:11:20 +00:00
Rohit Santhanam	7b7ddb7a59	Merge branch 'triton-mlir-IFU' into merge_IFU_to_triton_mlir	2023-01-03 23:37:11 +00:00
Keren Zhou	678b9f53a2	[Backend] Use post-order traversal for liveness numbering (#1027 ) Also add tests for `tt.trans`.	2023-01-03 15:11:54 -08:00
goostavz	1d3029faf8	[Backend] Add value cache in emitting indices calculation and some refinement (#1018 ) 1, add explicit value cache in emitting indices calculation; 2, move the indices calculation emitting logics into ConvertTritonGPUOpToLLVMPatternBase to avoid the redundant build cost by templates. Refer to the discussion in this thread by @LyricZhao : https://triton-lang.slack.com/archives/C042VBSQWNS/p1671336755922969	2022-12-29 11:19:59 -08:00
Yan Chunwei	2ba74d2729	[OPTIMIZER] Update the versionMinor in MMA layout for volta (#1014 ) Continue the work https://github.com/openai/triton/pull/990 # Background The `versionMinor` in MmaEncodingAttr holds some states of DotOp's operands in Volta, while such operands will be modified by some patterns, making the states out-of-date. This PR helps to correct the states. # Implementation It adds three new patterns: 1. `CollectMmaToUpdateForVolta` helps to collect and build a map holding the MmaEncodingAttr instances with wrong states and create new correct ones for them, 2. `UpdateMMAVersionMinorForVolta` helps to replace the Ops generating the wrong MmaEncodingAttr instances with new correct ones, currently it supports the following Ops a. `convert_layout[X -> mma]` b. `arith.constant SplatAttr : !tensor<mma>` c. `dot ... : !tensor<mma>` # Limitation This PR chooses the mapping way to bypass the IR walk complexity from the circular dependency between dot_operand[parent] and mma. We use the MmaEncodingAttr instance as the mapping key, but there might be multiple DotOp holding different DotOprand(IsMMAv1Row) that have the same wrong MmaEncodingAttr instance. To make each DotOp's (wrong) MmaEncodingAttr unique, we might need an ID field to MmaEncodingAttr.	2022-12-28 12:24:01 +08:00
Michael Melesse	41578a63d2	Merge remote-tracking branch 'upstream/triton-mlir' into triton-mlir-IFU	2022-12-21 12:53:03 -06:00
Philippe Tillet	20100a7254	Merge `triton-mlir` branch - Complete rewrite of the backend from scratch (#1004 ) This PR merges the `triton-mlir` branch, in which we have been quietly rewriting the Triton backend from scratch to increase maintainability, stability and ultimately performance. Changes to the runtime are minimal, and this new version aims to remain backward-compatible with the previous commit. The legacy backend is now officially deprecated, but can still be accessed via the `legacy-backend` tag. Co-authored-by: Keren Zhou <kerenzhou@openai.com> Co-authored-by: Yan Chunwei <yanchunwei@outlook.com> Co-authored-by: goostavz <109190422+goostavz@users.noreply.github.com> Co-authored-by: Shintaro Iwasaki <siwasaki@fb.com> Co-authored-by: Yan Da <dyanab@connect.ust.hk> Co-authored-by: Jun Yang <yangjunpro@gmail.com> Co-authored-by: Ian Bearman <ianb@microsoft.com> Co-authored-by: Jason Ansel <jansel@jansel.net> Co-authored-by: Qingyi Liu <qingyil@nvidia.com> Co-authored-by: ben-zhang-609 <110140741+ben-zhang-609@users.noreply.github.com> Co-authored-by: Chenggang Zhao <lyricz@yeah.net> Co-authored-by: ben-zhang-609 <benzh609@gmail.com> Co-authored-by: dongdongl <dongdongl@nvidia.com>	2022-12-21 01:30:50 -08:00

... 3 4 5 6 7 ...

355 Commits