github/ROCm - ROCm - AtHeartEngineering

mirror of https://github.com/ROCm/ROCm.git synced 2026-04-27 03:01:52 -04:00

Author	SHA1	Message	Date
Alexander Efimov	c3f24143d2	fix basic_load atomic_add_f32 tests	2023-02-23 12:03:56 +01:00
Alexander Efimov	33667cd106	Specialize checks for async slice tests: - basic_insert_slice_async_v4 - basic_insert_slice_async_v1 - basic_insert_slice_async_v1_multictas	2023-02-23 12:03:56 +01:00
Alexander Efimov	9ad7fec871	[Test][LIT] Fix Convertion/tritongpu_to_llvm.mlir crash This PR disables following sub tests, because they are PTX specific: - basic_async_wait - convert_dot - matmul_kernel_dot_operand_layout - matmul884_kernel_dot_operand_layout - matmul_tf32dot	2023-02-23 12:03:56 +01:00
Rohit Santhanam	50e6329d01	Fix LIT tests.	2023-02-18 09:32:42 +00:00
Rohit Santhanam	841784d1e3	Merge remote-tracking branch 'upstream/main' into upgrade_triton_mlir_rocm_to_llvm_head	2023-02-18 09:25:20 +00:00
Christian Sigg	9ef4b5d773	Rebase to LLVM-head. (#1200 ) Rebase to `37b7a60cd7`	2023-02-17 13:16:11 -08:00
Christian Sigg	fc7a8e3581	Rebase Triton to LLVM-15. (#1070 ) This PR rebases Triton from LLVM-14 to LLVM-15. Most changes are mechanical, except for the analysis framework changes.	2023-02-16 06:40:53 -08:00
Rohit Santhanam	112abb966f	Fix test/Conversion/tritongpu_to_llvm.mlir for AMDGPU.	2023-02-11 15:16:50 +00:00
Rohit Santhanam	a2416e0901	Merge remote-tracking branch 'upstream/main' into triton-mlir-IFU-02112023	2023-02-11 14:48:19 +00:00
rsanthanam-amd	fb62b22364	Merge pull request #103 from binarman/unittest_fix/Conversion/tritongpu_to_llvm.mlir [Test][Lit] Tritongpu to llvm	2023-02-09 22:57:08 -06:00
Alexander Efimov	a58c3ad317	update checks for Conversion/tritongpu_to_llvm.mlir test	2023-02-10 00:05:23 +01:00
Alexander Efimov	ba603b8319	review fix: CHECK to PTX	2023-02-09 01:07:57 +01:00
Alexander Efimov	d56258a4c6	review fix: roll back basic_insert_slice_async_v1_multictas	2023-02-09 01:01:52 +01:00
Alexander Efimov	89ca4cb4ba	review fix: add COUNT checks	2023-02-09 01:01:28 +01:00
Alexander Efimov	a43f192889	code review fix	2023-02-08 19:17:00 +01:00
Alexander Efimov	7fb8147d4f	[Test] Fix load_store test This PR fixes Conversion/AMDGPU/load_store.mlir lit based unit test.	2023-02-06 20:36:39 +01:00
Keren Zhou	bde52f9db2	[BACKEND] Fix alignment calculation (#1149 ) `getDivisibility` represents if the address in bytes is divisible by a certain number, so we should convert `#aligned bytes` to `#aligned elements`.	2023-02-03 17:20:23 -08:00
Alexander Efimov	af58dd90f8	[Test][Lit] Tritongpu to llvm This PR fixes lit based unitest Conversion/tritongpu_to_llvm.mlir for GCN target.x	2023-02-02 23:31:34 +01:00
Rohit Santhanam	2d0ee0fa0f	Merge remote-tracking branch 'upstream/main' into triton-mlir-IFU-01232023	2023-01-24 03:59:17 +00:00
Yan Chunwei	88498d104a	[BACKEND] DotOp enable ld.v4 in MMAv1 (#1020 ) The existing convert distributed to distributed layouts logic is based on processing each MMA-block, this requires each MMA-block to share exactly the same fixed pattern(such as the one described in the [NV PTX doc](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-fragment-mma-16816-float)). While for MMAv1, things are different, the MMA-block has variant patterns for different shapes and data layouts as below <img width="200" alt="image" src="https://user-images.githubusercontent.com/328693/213354941-731d7856-ad24-4f48-be0e-3cf41532cfa4.png"> This requires all the cell coordinates in DotOp output to be computed.	2023-01-19 09:42:33 -08:00
Philippe Tillet	660f2e8cce	[OPTIMIZER] pipeline and prefetch pass now use a more ptxas-friendly schedule (#1065 )	2023-01-17 15:21:19 -08:00
Keren Zhou	3f47e9aa0e	[BACKEND] Fix unrealized conversion for fp32 dot (#1051 )	2023-01-17 21:55:44 +00:00
Rohit Santhanam	ce8adb92bd	Merge remote-tracking branch 'upstream/master' into triton-mlir-IFU-01142023	2023-01-14 19:19:58 +00:00
Philippe Tillet	589a18959e	[BACKEND] Make swizzled shared memory pointers compatible with non-blocked distributed layout (#1053 ) Notes: * Cleaned up implementation * Added comments * Re-using code between ConvertDistributedToShared and ConvertInsertSliceAsyncOp	2023-01-13 09:14:23 -08:00
Daniil Fukalov	4d4a91ec5f	[ROCM] Fix incorrect llvm.bitcast creation. Before the patch tt.load and tt.store generated incorrect bitcasts like `%f = llvm.bitcast %i : i32 to f16` (source and destination bitcast' types should have same bitwidth).	2023-01-12 14:28:17 +01:00
goostavz	50d4589e9c	[Backend] Add value cache in emitting indices calculation and some refinement (#1018 ) 1, add explicit value cache in emitting indices calculation; 2, move the indices calculation emitting logics into ConvertTritonGPUOpToLLVMPatternBase to avoid the redundant build cost by templates. Refer to the discussion in this thread by @LyricZhao : https://triton-lang.slack.com/archives/C042VBSQWNS/p1671336755922969	2023-01-04 15:11:53 +00:00
Rohit Santhanam	7b7ddb7a59	Merge branch 'triton-mlir-IFU' into merge_IFU_to_triton_mlir	2023-01-03 23:37:11 +00:00
goostavz	1d3029faf8	[Backend] Add value cache in emitting indices calculation and some refinement (#1018 ) 1, add explicit value cache in emitting indices calculation; 2, move the indices calculation emitting logics into ConvertTritonGPUOpToLLVMPatternBase to avoid the redundant build cost by templates. Refer to the discussion in this thread by @LyricZhao : https://triton-lang.slack.com/archives/C042VBSQWNS/p1671336755922969	2022-12-29 11:19:59 -08:00
Michael Melesse	41578a63d2	Merge remote-tracking branch 'upstream/triton-mlir' into triton-mlir-IFU	2022-12-21 12:53:03 -06:00
Philippe Tillet	20100a7254	Merge `triton-mlir` branch - Complete rewrite of the backend from scratch (#1004 ) This PR merges the `triton-mlir` branch, in which we have been quietly rewriting the Triton backend from scratch to increase maintainability, stability and ultimately performance. Changes to the runtime are minimal, and this new version aims to remain backward-compatible with the previous commit. The legacy backend is now officially deprecated, but can still be accessed via the `legacy-backend` tag. Co-authored-by: Keren Zhou <kerenzhou@openai.com> Co-authored-by: Yan Chunwei <yanchunwei@outlook.com> Co-authored-by: goostavz <109190422+goostavz@users.noreply.github.com> Co-authored-by: Shintaro Iwasaki <siwasaki@fb.com> Co-authored-by: Yan Da <dyanab@connect.ust.hk> Co-authored-by: Jun Yang <yangjunpro@gmail.com> Co-authored-by: Ian Bearman <ianb@microsoft.com> Co-authored-by: Jason Ansel <jansel@jansel.net> Co-authored-by: Qingyi Liu <qingyil@nvidia.com> Co-authored-by: ben-zhang-609 <110140741+ben-zhang-609@users.noreply.github.com> Co-authored-by: Chenggang Zhao <lyricz@yeah.net> Co-authored-by: ben-zhang-609 <benzh609@gmail.com> Co-authored-by: dongdongl <dongdongl@nvidia.com>	2022-12-21 01:30:50 -08:00
Yan Chunwei	42b5234e27	[Triton-MLIR][BACKEND] Decompose Mma version to versionMajor and versionMinor (#985 )	2022-12-15 17:14:07 +08:00
Philippe Tillet	52accd4c2b	[BACKEND] Add isRow attribute for DotOp tensors whose parent is mmav1 (#970 ) Co-authored-by: Yan Chunwei <yanchunwei@outlook.com>	2022-12-11 19:01:57 -08:00
Philippe Tillet	b2b793dfb5	[FRONTEND][BACKEND] Fixes for cat / reshape / addptr (#959 ) Most notably, this PR: - changes the traits (and assembly format) of addptr so it can handle offsets that have arbitrary integer width. - adds support for `cat`	2022-12-06 23:29:50 -08:00
Keren Zhou	f2fcaeabf3	[BACKEND] Support dot op when the output is mma encoding and allowtf32 is true (#937 )	2022-12-03 19:14:12 +00:00
dfukalov	f7bea48b93	[Triton-MLIR][ROCM] Updated tests for Ld/St support.	2022-11-30 19:55:39 +01:00
Keren Zhou	7d90a07d0b	[Triton-MLIR][BACKEND] Refactor decompose insert_slice_async (#929 ) 1. Improve pipline's comment 2. Decompose insert_slice_async when load vector size is not supported 3. Add a test that could fail our gemm code Copy my comments here: There's a knob that may cause performance regression when decomposition has been performed. We should remove this knob once we have thorough analysis on async wait. Currently, we decompose `insert_slice_async` into `load` and `insert_slice` without knowing which `async_wait` is responsible for the `insert_slice_async`. To guarantee correctness, we blindly set the `async_wait` to wait for all async ops if any `insert_slice_async` has been decomposed. There are two options to improve this: 1. We can perform a dataflow analysis to find the `async_wait` that is responsible for the `insert_slice_async` in the backend. 4. We can modify the pipeline to perform the decomposition before the `async_wait` is inserted. However, it is also risky because we don't know the correct vectorized shape yet in the pipeline pass. Making the pipeline pass aware of the vectorization could introduce additional dependencies on the AxisInfoAnalysis and the Coalesce analysis.	2022-11-30 10:07:34 -08:00
goostavz	4e6a8209ed	[Triton-MLIR] Two fixes on allocation and backend related with MMA v1 (#930 )	2022-11-30 09:27:26 +00:00
Philippe Tillet	9bb54402b3	[FRONTEND][BACKEND] Small fixes to multiple_of, num_programs, axisinfo; enable block-sparse tests (#927 )	2022-11-29 20:00:34 +01:00
goostavz	0c1d4d764e	[Triton-MLIR][Backend] support MMA v1 in ConvertLayout (#922 ) The e2e verification of mma v1 is not done yet. Get this merged in advance just to prevent more conflicts.	2022-11-28 08:10:30 +00:00
Keren Zhou	35c9ec1103	[Triton-MLIR][Backend] Fix number of warps and threads per warp when matrices are small (#917 )	2022-11-26 12:30:38 -08:00
Keren Zhou	2afebcd79b	[Triton-MLIR][Backend] Remove unnecessary barriers (#901 ) Cross operation barriers are taken care of by the Membar pass. Explicit barriers are only required if there's any synchronization necessary within each operation.	2022-11-22 10:03:29 -08:00
Keren Zhou	6c5f646f4e	[WIP][Triton-MLIR] Prefetch pass fixup (#873 ) A (potential) problem by directly adopting `tensor.extract_slice`. Long story short, `tensor.extract_slice` is not aware of swizzling. Consider the following shared memory tensor and its first three slices, where each slice includes two tile (the loading unit of LDGSTS) of elements. Currently, the tiles haven't been swizzled yet, so slicing seems to work. <img width="1219" alt="image" src="https://user-images.githubusercontent.com/2306281/201833023-a7950705-2d50-4c0a-8527-7505261c3a3c.png"> However, now consider the following figure, which is the layout after applying swizzling on the first figure. <img width="1244" alt="image" src="https://user-images.githubusercontent.com/2306281/201834824-7daae360-f5bc-4e6b-a921-20be3f294b78.png"> Note that on phase 2, all tiles have been swizzled out of their originally slices. This implies that if we use the tile index after slicing, we can no longer locate the correct tiles. For example, T3 was in slice 1 but got swapped to slice 0 after swizzling. Here's a more detailed explanation. In the current `triton-mlir` branch, we only compute the relative offset of each tile. So T3's index in Slice 1 is 1, and it will be swizzled using 1 and phase id. Whereas the correct index of T3 should be 3, which is the relative offset to the beginning of the shared memory tensor being swizzled, and T3 should be swizzled using 3 and phase id. This PR proposes a hacky solution for this problem. We restore the "correct" offset of each tile by assuming that slicing on a specific dim only happens at most once on the output of insert_slice_async. I admit it's risky and fragile. The other possible solution is adopting cutlass' swizzling logic that limits the indices being swizzled in a "bounding box" that matches the mma instruction executes. For example, in the following tensor layout, each 4x4 submatrix is a minimum swizzling unit, and the entire tensor represents the tensor layout of operand A in `mma.16816`. <img width="565" alt="image" src="https://user-images.githubusercontent.com/2306281/201836879-4ca7824b-530c-4a06-a3d5-1e74a2de1b42.png"> Co-authored-by: Phil Tillet <phil@openai.com>	2022-11-19 19:57:16 -08:00
goostavz	9ea6135eb5	[Triton-MLIR][Backend] Some cleanup in getMultiDimIndex/getLinearIndex (#880 )	2022-11-18 01:19:21 +00:00
Rohit Santhanam	b9e7634356	Merge commit 'e517b58d59ba96357d042d8fa5819a690d00d749' into IFU_upstream_commit_e517b58d59ba96357d042d8fa5819a690d00d749	2022-11-16 12:57:19 +00:00
Rohit Santhanam	ad7f53ab15	Merge commit 'e517b58d59ba96357d042d8fa5819a690d00d749' into triton-mlir-rohit-test	2022-11-16 01:15:06 +00:00
goostavz	37f5846280	[Triton-MLIR][Backend] Minor fix for allocation and backend in handling tt.ptr tensors (#878 )	2022-11-15 10:08:07 +00:00
Yan Chunwei	a22ff39017	[Triton-MLIR][BACKEND] Refine/add codegen for get_promgram_id and get_num_programs Op (#877 )	2022-11-15 15:45:24 +08:00
Yan Chunwei	1eedaf7bec	[Triton-MLIR][BACKEND] adapt DotOp layout for FMADot (#872 )	2022-11-14 16:56:30 +08:00
Chenggang Zhao	57fd1864a7	[Triton-MLIR] Support FP8 (#864 ) Co-authored-by: Superjomn <yanchunwei@outlook.com>	2022-11-10 15:53:06 +08:00
Da Yan	4946167241	[Triton-MLIR] `tt.dot` operands now must have DotOperand layout; also added prefetch pass prototype (#712 ) Co-authored-by: Jokeren <kerenzhou@openai.com> Co-authored-by: Phil Tillet <phil@openai.com> Co-authored-by: Superjomn <yanchunwei@outlook.com>	2022-11-10 05:57:27 +00:00

1 2 3 4

188 Commits