github/ROCm - ROCm - AtHeartEngineering

mirror of https://github.com/ROCm/ROCm.git synced 2026-04-05 03:01:17 -04:00

Author	SHA1	Message	Date
Rohit Santhanam	ce8adb92bd	Merge remote-tracking branch 'upstream/master' into triton-mlir-IFU-01142023	2023-01-14 19:19:58 +00:00
Yan Chunwei	86003c83dd	[Optimizer] Add UpdateMmaForVolta Pass (#1048 ) This PR adds UpdateMmaForVolta pass to help update the MMA encoding for Volta. Some context is told in https://github.com/openai/triton/pull/1014 # Changes 1. Moving the related MMAv1 patterns from GPUCombine pass to UpdateMmaForVolta pass, 2. Updating both the versionMinor and warpsPerCTA fields for Volta MMA encodings since they could only be determined after the GPUCombine Pass, 3. Moving the FixupLoop pattern from the Combine.cpp to new Utility.h/.cpp files 4. Adding an ID field(takes 5 bits to store an integer) to versionMinor to help assigning a unique ID(on Volta) for each MMA encodings, the reason is as below - Currently, there is a cyclic dependency between {DotOperand, Slice} with MMA layouts, we use a map to help cluster all the DotOperand, Slice, and MMA layout instances into the same group for further updating in bulk - When there are multiple DotOps in a module with the same MMA(literally equivalent), it is possible to get the wrong groups - an ID field is used to help to identify the MMA from different DotOps, thus getting all the MMA, DotOperand, and Slice layout instances in the right groups	2023-01-14 11:54:19 +08:00
Philippe Tillet	259f4c5f7d	[OPTIMIZER] Added new optimization passes (#1055 ) This PR adds a couple of optimization passes that should substantially improve the performance of Triton on fused attention kernels: - DecomposeConversionsPass: This decomposes some instructions of the form `convert_layout` into - ReorderInstructions: this reorders instructions in a way that is more amenable to good code generation from `ptxas`.	2023-01-13 13:15:53 -08:00
Keren Zhou	733301ff31	[Backend] Rewrite code for linking external library to expose more inlining opportunities (#1037 ) - Also make it cleaner. - And mark out the code needs to be fixed in `semantic.py`.	2023-01-08 13:44:29 -08:00
Keren Zhou	e638cb8060	[Backend] Use post-order traversal for liveness numbering (#1027 ) Also add tests for `tt.trans`.	2023-01-04 15:13:09 +00:00
goostavz	396e08f4de	[BACKEND] Add generic support of convert_layout from distributed to shared (#1025 )	2023-01-04 15:12:52 +00:00
Rohit Santhanam	7b7ddb7a59	Merge branch 'triton-mlir-IFU' into merge_IFU_to_triton_mlir	2023-01-03 23:37:11 +00:00
Keren Zhou	678b9f53a2	[Backend] Use post-order traversal for liveness numbering (#1027 ) Also add tests for `tt.trans`.	2023-01-03 15:11:54 -08:00
goostavz	0e8590f1c9	[BACKEND] Add generic support of convert_layout from distributed to shared (#1025 )	2022-12-30 11:29:58 -08:00
Michael Melesse	edd0df94dc	compiles	2022-12-21 13:48:56 -06:00
Michael Melesse	41578a63d2	Merge remote-tracking branch 'upstream/triton-mlir' into triton-mlir-IFU	2022-12-21 12:53:03 -06:00
Philippe Tillet	20100a7254	Merge `triton-mlir` branch - Complete rewrite of the backend from scratch (#1004 ) This PR merges the `triton-mlir` branch, in which we have been quietly rewriting the Triton backend from scratch to increase maintainability, stability and ultimately performance. Changes to the runtime are minimal, and this new version aims to remain backward-compatible with the previous commit. The legacy backend is now officially deprecated, but can still be accessed via the `legacy-backend` tag. Co-authored-by: Keren Zhou <kerenzhou@openai.com> Co-authored-by: Yan Chunwei <yanchunwei@outlook.com> Co-authored-by: goostavz <109190422+goostavz@users.noreply.github.com> Co-authored-by: Shintaro Iwasaki <siwasaki@fb.com> Co-authored-by: Yan Da <dyanab@connect.ust.hk> Co-authored-by: Jun Yang <yangjunpro@gmail.com> Co-authored-by: Ian Bearman <ianb@microsoft.com> Co-authored-by: Jason Ansel <jansel@jansel.net> Co-authored-by: Qingyi Liu <qingyil@nvidia.com> Co-authored-by: ben-zhang-609 <110140741+ben-zhang-609@users.noreply.github.com> Co-authored-by: Chenggang Zhao <lyricz@yeah.net> Co-authored-by: ben-zhang-609 <benzh609@gmail.com> Co-authored-by: dongdongl <dongdongl@nvidia.com>	2022-12-21 01:30:50 -08:00
Keren Zhou	50a5128448	[Triton-MLIR][BACKEND] Support bfloat16 and clean up some test code (#998 )	2022-12-20 22:26:51 -08:00
Philippe Tillet	899bb0a0e7	[FORMAT] Run `clang-format`, `autopep8` and `isort` (#1000 )	2022-12-20 17:47:34 -08:00
Philippe Tillet	d438be01bd	[TRITON-MLIR][BACKEND] New optimization patterns to speed-up layer norm (#991 )	2022-12-18 23:50:20 -08:00
Chenggang Zhao	4e95e939a6	[Triton-MLIR][BACKEND] Refactor TritonGPUToLLVM into several files (#988 ) Refactor the backend into multiple smaller files.	2022-12-18 14:54:38 +08:00
Philippe Tillet	9f27468377	[TESTS][FRONTEND][BACKEND] Merge `master` and `triton-mlir` tests (#979 ) Also fix a bunch of bugs in float32 / tf32 Co-authored-by: Jokeren <kerenzhou@openai.com>	2022-12-15 19:28:50 -08:00
Yan Chunwei	42b5234e27	[Triton-MLIR][BACKEND] Decompose Mma version to versionMajor and versionMinor (#985 )	2022-12-15 17:14:07 +08:00
Philippe Tillet	52accd4c2b	[BACKEND] Add isRow attribute for DotOp tensors whose parent is mmav1 (#970 ) Co-authored-by: Yan Chunwei <yanchunwei@outlook.com>	2022-12-11 19:01:57 -08:00
Daniil Fukalov	e68f17bd1f	[Triton-MLIR][NFC] Fix line endings.	2022-12-08 22:45:59 +01:00
Keren Zhou	18e683d9bb	[Triton-MLIR][BACKEND] Pass compute capability from the frontend and code cleanup (#961 )	2022-12-07 15:03:46 -08:00
Philippe Tillet	b2b793dfb5	[FRONTEND][BACKEND] Fixes for cat / reshape / addptr (#959 ) Most notably, this PR: - changes the traits (and assembly format) of addptr so it can handle offsets that have arbitrary integer width. - adds support for `cat`	2022-12-06 23:29:50 -08:00
Philippe Tillet	532e10cf87	[FRONTEND][BACKEND] Clean-up transpositions (#953 )	2022-12-06 09:32:13 -08:00
Yan Chunwei	e419781978	[Triton-MLIR][BACKEND] Make mmav1 works on basic cases (#944 ) TODO: - Add more cases - Currently, we just set vec to 4 to make the basic cases pass Issue: - the vec in shared layout is different compared to master branch - when vec=1, it encounters CUDA misalignment error, it doesn't work in master branch as well - when setting vec to the value identical to master branch, the MMA works	2022-12-06 10:57:08 +08:00
goostavz	e057c65cf0	[BACKEND] Porting the legacy heuristic rule in assigning shared layout for A/B of MMAv1 (#948 )	2022-12-05 11:30:23 -08:00
Philippe Tillet	8edfe813a5	[FRONTEND][BACKEND] Added `trans` instruction; made flash attention bwd pass work (#943 )	2022-12-03 09:58:24 -08:00
goostavz	4d64589b22	[Triton-MLIR][Backend] Fix the definition of MmaEncodingAttr v1, and the output sequence of DotConversion in MMAv1 (#941 )	2022-12-03 21:12:48 +08:00
Yang Hau	8650b4d1cb	[DRIVER] Fix typos (#939 )	2022-12-02 11:13:46 -08:00
Keren Zhou	c280ebda1b	[Triton-MLIR][BACKEND] Fix the membar pass to add missing barriers caused by scf.for (#933 ) 1. Add missing barriers and revert the previous temporary solution 2. Extract the `run` method from membar analysis because the membar analysis should have two phases, including construction, which doesn't modify any IR, and modification, which adds barrier IRs. Hope this could make the use of membar clear.	2022-12-01 11:54:18 -08:00
Keren Zhou	7d90a07d0b	[Triton-MLIR][BACKEND] Refactor decompose insert_slice_async (#929 ) 1. Improve pipline's comment 2. Decompose insert_slice_async when load vector size is not supported 3. Add a test that could fail our gemm code Copy my comments here: There's a knob that may cause performance regression when decomposition has been performed. We should remove this knob once we have thorough analysis on async wait. Currently, we decompose `insert_slice_async` into `load` and `insert_slice` without knowing which `async_wait` is responsible for the `insert_slice_async`. To guarantee correctness, we blindly set the `async_wait` to wait for all async ops if any `insert_slice_async` has been decomposed. There are two options to improve this: 1. We can perform a dataflow analysis to find the `async_wait` that is responsible for the `insert_slice_async` in the backend. 4. We can modify the pipeline to perform the decomposition before the `async_wait` is inserted. However, it is also risky because we don't know the correct vectorized shape yet in the pipeline pass. Making the pipeline pass aware of the vectorization could introduce additional dependencies on the AxisInfoAnalysis and the Coalesce analysis.	2022-11-30 10:07:34 -08:00
goostavz	4e6a8209ed	[Triton-MLIR] Two fixes on allocation and backend related with MMA v1 (#930 )	2022-11-30 09:27:26 +00:00
dfukalov	88c178aec5	[Triton-MLIR][HSACO] Addressed comments. Moved new functionality from HSACO into AMDGCN target, removed duplicating HSACO target.	2022-11-30 00:01:51 +01:00
Rohit Santhanam	7ef7beb096	Implement triton::LoadOp and triton::StoreOp using LLVM load and store ops. This avoids using GCNBuilder which has the following issues: 1. Performance problems with emitting too many barriers 2. Impossibility of supporting int8 and uint8 because the minimum AMDGPU register width is 16 bit. More unit tests were included in test_core_amd.py.	2022-11-29 21:30:20 +00:00
rsanthanam-amd	31c512ee31	Merge pull request #33 from ROCmSoftwarePlatform/triton-mlir-divop add inital div op impl	2022-11-29 14:38:36 -06:00
dfukalov	834c46ab44	[Triton-MLIR][HSACO] Added HSACO generation. The change is based on Rohit' commit `9bf807f310` Used `AMDGCN_HSACO_DUMP_PATH=<dumpfile>` notion to save the copy of temp generated HSACO file so test is able to llvm-readobj that dump and check its content.	2022-11-28 23:34:56 +01:00
Rohit Santhanam	5ab51f5551	Duplicate the LLVM objects for AMDGCN assembly generation and AMDGCN object file creation to avoid adding multiple passes to the same TargetMachine.	2022-11-28 20:03:34 +00:00
Michael Melesse	723f4e9f7f	add divop	2022-11-28 13:20:52 -06:00
Qingyi Liu	9d31998a9d	[Triton-MLIR][BACKEND] Add argmin / argmax implementation for ReduceOp (#918 )	2022-11-27 22:59:27 -08:00
Keren Zhou	35c9ec1103	[Triton-MLIR][Backend] Fix number of warps and threads per warp when matrices are small (#917 )	2022-11-26 12:30:38 -08:00
donproc	f63be0e9b5	[TRITON-MLIR][BACKEND]support atomic_cas (#914 ) 1. support atomics-cas 2. add xchg support in atomic_rmw Co-authored-by: dongdongl <dongdongl@nvidia.com>	2022-11-25 12:02:08 +08:00
Keren Zhou	153aecb339	[Triton-MLIR][BACKEND] insert_slice_async on GPUs < sm80 (#908 ) `insert_slice_async` is decomposed into `load + insert_slice` in the backend. Not sure if V100 perf can match the master branch though in this way. Maybe the performance can be improved if instructions are arranged in the following form: ``` %0 = load %1 = load %2 = load ... insert_slice %0 insert_slice %1 insert_slice %2 ``` Tested on A100 when manually enabling this decomposition. Tests on V100 haven't been integrated yet, we can divide the tests into two phases: 1. Test only load, insert_slice, and insert_slice_async, given TritonGPU IRs in `test_backend.py`. 2. End to end gemm tests on V100.	2022-11-24 14:05:54 -08:00
B1tway	288c5070db	Add GCNBuilder	2022-11-24 12:25:33 +00:00
Keren Zhou	2e33352419	[Triton-MLIR] Fix side effects (#906 ) Try to add proper side effects for triton operations. The CSE pass could fail, hang, or output incorrect IRs for unknown reasons, if side effects are not defined properly. For instance, suppose we have two shared memory tensors: ``` %a = triton_gpu.alloc_tensor shape0, share_encoding0 %b = triton_gpu.alloc_tensor shape0, share_encoding0 ``` The CSE pass will consider `%a` and `%b` are the same thing and eliminate one of them, resulting in mysterious outcomes.	2022-11-22 23:29:18 -08:00
ben-zhang-609	07786dc932	[Triton-MLIR] Add compute capability (#902 ) add compute capability from python frontend to backend. Co-authored-by: Keren Zhou <kerenzhou@openai.com>	2022-11-22 11:08:23 -08:00
Rohit Santhanam	520d9e8835	Fixes to support 16-bit data types.	2022-11-22 01:17:31 +00:00
Rohit Santhanam	ffb895f4a2	Added more GCNBuilder commits plus some additional fixes. With this commit, 108 subtests from test_bin_op in test_core.py are passing. Here is what doesn't work from test_bin_op: - int8 datatype - int16 datatype - uint8 datatype - uint16 datatype - float16 datatype - bfloat16 datatype - division (/) operator	2022-11-21 22:28:39 +00:00
Philippe Tillet	23f71daa27	[OPTIMIZER] Fixed up order of shared layouts (#881 )	2022-11-21 06:25:02 +01:00
Jun Yang	8a5647782d	[Triton-MLIR][Testing]Fix tests warning, with small code clean-up (#894 ) 1.Code clean-up to remove superfluous #includes. 2.Fix two python test warnings, in which one relates to ["#" formats](https://jira.mongodb.org/browse/PYTHON-2343), the other relates to regular expression string usage.	2022-11-19 14:33:59 +00:00
Rohit Santhanam	cac4ee78ef	Add GCNBuilder.	2022-11-19 13:59:12 +00:00
Rohit Santhanam	b9e7634356	Merge commit 'e517b58d59ba96357d042d8fa5819a690d00d749' into IFU_upstream_commit_e517b58d59ba96357d042d8fa5819a690d00d749	2022-11-16 12:57:19 +00:00

... 3 4 5 6 7 ...

487 Commits