github/ROCm - ROCm - AtHeartEngineering

mirror of https://github.com/ROCm/ROCm.git synced 2026-04-05 03:01:17 -04:00

Author	SHA1	Message	Date
Philippe Tillet	fa0fbc937f	[FRONTEND][BACKEND][OPTIMIZER] Loops now use 64-bit indices when necessary (#1261 ) * Frontend: - `int` kernel arguments are always signed - Loop induction variable is now determine by integer promotion on lb/ub/step * Optimizer: - Added new ExtractSliceOp that enforces 32-bit offsets * Backend: - Use 64-bit indices when lowering functions and control flow - Removed `idx_val` macro and replaced it with `i32_val` - Cleaned up comments - Added new ArithToIndex pass to make sure operations on indices are done with the `index` dialect, that gets converted to LLVM separately using a 64-bit target	2023-03-01 23:09:48 -08:00
Philippe Tillet	f8c92c3d17	[ANALYSIS] initializing operand axisinfo state (if necessary) before visiting operation (#1259 ) Feels very wrong, and probably not the right way to do this. But otherwise `scf.if` doesn't get initialized since the merge to llvm-head. Suggestions are welcome 😅	2023-02-27 19:32:19 -08:00
Rohit Santhanam	cd9ae1cd36	Merge remote-tracking branch 'upstream/main' into triton-mlir-IFU-02232023	2023-02-23 21:41:54 +00:00
Keren Zhou	123c687ed9	[BACKEND] Rewrite Membar to fit the CF dialect (#1213 )	2023-02-19 14:54:33 -08:00
Rohit Santhanam	841784d1e3	Merge remote-tracking branch 'upstream/main' into upgrade_triton_mlir_rocm_to_llvm_head	2023-02-18 09:25:20 +00:00
Christian Sigg	9ef4b5d773	Rebase to LLVM-head. (#1200 ) Rebase to `37b7a60cd7`	2023-02-17 13:16:11 -08:00
Christian Sigg	fc7a8e3581	Rebase Triton to LLVM-15. (#1070 ) This PR rebases Triton from LLVM-14 to LLVM-15. Most changes are mechanical, except for the analysis framework changes.	2023-02-16 06:40:53 -08:00
rsanthanam-amd	44f69bea81	Merge pull request #113 from ROCmSoftwarePlatform/triton-mlir-IFU-02112023 Triton mlir ifu 02112023	2023-02-13 09:26:10 -06:00
rsanthanam-amd	ec387d5bf4	Merge pull request #109 from dfukalov/dfukalov/work-3 [ROCM] Enable part of tl.dot operations.	2023-02-12 13:50:20 -06:00
Daniil Fukalov	a6596fc634	[ROCM] Enable part of tl.dot operations. The change enables fall-through FMA path for the ROCM. It works for the float32 type and not all the tensors sizes. The change switches off reporting MMA and async ops support to avoid NV asm inline generation.	2023-02-12 17:25:48 +01:00
Philippe Tillet	2aba985daa	[OPTIMIZER] Improved layout simplifications heuristics (#1168 )	2023-02-09 20:17:25 -08:00
Philippe Tillet	0cbe368fe5	[OPTIMIZER] Using new multiRootGetSlice utility in memory coalescing pass (#1169 )	2023-02-09 18:43:33 +00:00
Keren Zhou	681d04cf2b	[BACKEND] Fix axisInfo analysis for div ops (#1157 )	2023-02-07 02:25:23 +00:00
Keren Zhou	bde52f9db2	[BACKEND] Fix alignment calculation (#1149 ) `getDivisibility` represents if the address in bytes is divisible by a certain number, so we should convert `#aligned bytes` to `#aligned elements`.	2023-02-03 17:20:23 -08:00
Keren Zhou	82befe32ad	[BACKEND] Improve torch inductor performance (#1108 ) - Rewrite the AxisInfo analysis to handle each op case by case. - Add bit shift, min max, div/rem, and select ops to AxisInfo. - Rematerialize across load/store ops in the following two cases: - A size 1 tensor is considered not expensive since all threads will load the same - the targeEncoding may expose more vectorization opportunities (more elements per thread on the first dim) _res2next_ benchmark GPU Kernel time comparison on A100. - Average kernel sum. Triton 16838630ns vs Triton-MLIR 17105166ns. 1.016x slowdown. - Total kernel sum. Triton 6511735460ns vs Triton-MLIR 6512370620ns.	2023-02-01 18:21:15 -08:00
Keren Zhou	5dd8ce3745	[BACKEND] Fix topological sort and add new test cases (#1132 ) Previous https://github.com/openai/triton/pull/1113 forgot to consider that a node may have multiple parents, visiting the instruction before any parent violates the semantic of topological sort. The fixed implementation exhaustively add all operations into a candidate subgraph and move an operation to the "ready" queue once all of its operands have been visited.	2023-01-31 23:41:20 -08:00
Philippe Tillet	8fea1fb478	[FRONTEND] Adding static range (#1130 ) Included: Revert "[BACKEND] Replace `mlir::topologicalSort` with a custom implementation (#1113)"	2023-01-31 18:04:19 -08:00
Philippe Tillet	c4b9d699d2	[FRONTEND][BACKEND] Fixed many bugs (#1122 ) - temporarily commenting assertion in `MemBar.cpp`. We need to fix this! but for now the following patches will unblock a number of users. - Fixed frontend codegen issue for If / For / While. Emit an error when replaced values' type mismatch. - Added "top level" codepath for if statements, which allows users to write patterns to exit early from kernels (e.g., `if cond1: if cond2: return else: ...`). Added associated codegen in TritonToTritonGPUPass - Added basic control flow tests - Pipeline pass is no longer activated when memory accesses can't be vectorized - Added missing magic methods to `constexpr` - Fixed issue in random.py: bitcast some values to uint when they need to be. - Added support for `Not` - Fixed nondeterministic compilation issue	2023-01-30 23:22:36 -08:00
Keren Zhou	bc8a26d56f	[BACKEND] Replace `mlir::topologicalSort` with a custom implementation (#1113 ) `multiRootTopologicalSort` is faster than `mlir::topologicalSort` because it prunes nodes that have been visited before.	2023-01-29 18:57:21 -08:00
Yan Chunwei	88498d104a	[BACKEND] DotOp enable ld.v4 in MMAv1 (#1020 ) The existing convert distributed to distributed layouts logic is based on processing each MMA-block, this requires each MMA-block to share exactly the same fixed pattern(such as the one described in the [NV PTX doc](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-fragment-mma-16816-float)). While for MMAv1, things are different, the MMA-block has variant patterns for different shapes and data layouts as below <img width="200" alt="image" src="https://user-images.githubusercontent.com/328693/213354941-731d7856-ad24-4f48-be0e-3cf41532cfa4.png"> This requires all the cell coordinates in DotOp output to be computed.	2023-01-19 09:42:33 -08:00
Philippe Tillet	408d1d7e87	[OPTIMIZER] Improved flash attention forward pass performance (#1075 ) - Fixed typo in instruction reordering pass - Minor additional optimizations for shared memory allocator - Optimized flash attention tutorial forward pass kernel	2023-01-19 06:46:01 +00:00
Goran Flegar	e2923afc71	[BUILD] Add dependency to TritonTableGen in TritonAnalysis (#1067 ) The `TritonAnalysis` target depends on `TritonTableGen` through including `Triton/IR/Dialect.h`, which itself includes `Triton/IR/Dialect.h.inc` generated by `TritonTableGen`. Without it, the build might fail (seems to be happening inconsistently, due to multithreaded builds)	2023-01-17 13:21:16 -08:00
Keren Zhou	e638cb8060	[Backend] Use post-order traversal for liveness numbering (#1027 ) Also add tests for `tt.trans`.	2023-01-04 15:13:09 +00:00
Keren Zhou	678b9f53a2	[Backend] Use post-order traversal for liveness numbering (#1027 ) Also add tests for `tt.trans`.	2023-01-03 15:11:54 -08:00
Philippe Tillet	20100a7254	Merge `triton-mlir` branch - Complete rewrite of the backend from scratch (#1004 ) This PR merges the `triton-mlir` branch, in which we have been quietly rewriting the Triton backend from scratch to increase maintainability, stability and ultimately performance. Changes to the runtime are minimal, and this new version aims to remain backward-compatible with the previous commit. The legacy backend is now officially deprecated, but can still be accessed via the `legacy-backend` tag. Co-authored-by: Keren Zhou <kerenzhou@openai.com> Co-authored-by: Yan Chunwei <yanchunwei@outlook.com> Co-authored-by: goostavz <109190422+goostavz@users.noreply.github.com> Co-authored-by: Shintaro Iwasaki <siwasaki@fb.com> Co-authored-by: Yan Da <dyanab@connect.ust.hk> Co-authored-by: Jun Yang <yangjunpro@gmail.com> Co-authored-by: Ian Bearman <ianb@microsoft.com> Co-authored-by: Jason Ansel <jansel@jansel.net> Co-authored-by: Qingyi Liu <qingyil@nvidia.com> Co-authored-by: ben-zhang-609 <110140741+ben-zhang-609@users.noreply.github.com> Co-authored-by: Chenggang Zhao <lyricz@yeah.net> Co-authored-by: ben-zhang-609 <benzh609@gmail.com> Co-authored-by: dongdongl <dongdongl@nvidia.com>	2022-12-21 01:30:50 -08:00
Keren Zhou	50a5128448	[Triton-MLIR][BACKEND] Support bfloat16 and clean up some test code (#998 )	2022-12-20 22:26:51 -08:00
Philippe Tillet	9f27468377	[TESTS][FRONTEND][BACKEND] Merge `master` and `triton-mlir` tests (#979 ) Also fix a bunch of bugs in float32 / tf32 Co-authored-by: Jokeren <kerenzhou@openai.com>	2022-12-15 19:28:50 -08:00
Keren Zhou	f2fcaeabf3	[BACKEND] Support dot op when the output is mma encoding and allowtf32 is true (#937 )	2022-12-03 19:14:12 +00:00
Philippe Tillet	8edfe813a5	[FRONTEND][BACKEND] Added `trans` instruction; made flash attention bwd pass work (#943 )	2022-12-03 09:58:24 -08:00
Keren Zhou	c280ebda1b	[Triton-MLIR][BACKEND] Fix the membar pass to add missing barriers caused by scf.for (#933 ) 1. Add missing barriers and revert the previous temporary solution 2. Extract the `run` method from membar analysis because the membar analysis should have two phases, including construction, which doesn't modify any IR, and modification, which adds barrier IRs. Hope this could make the use of membar clear.	2022-12-01 11:54:18 -08:00
Keren Zhou	7d90a07d0b	[Triton-MLIR][BACKEND] Refactor decompose insert_slice_async (#929 ) 1. Improve pipline's comment 2. Decompose insert_slice_async when load vector size is not supported 3. Add a test that could fail our gemm code Copy my comments here: There's a knob that may cause performance regression when decomposition has been performed. We should remove this knob once we have thorough analysis on async wait. Currently, we decompose `insert_slice_async` into `load` and `insert_slice` without knowing which `async_wait` is responsible for the `insert_slice_async`. To guarantee correctness, we blindly set the `async_wait` to wait for all async ops if any `insert_slice_async` has been decomposed. There are two options to improve this: 1. We can perform a dataflow analysis to find the `async_wait` that is responsible for the `insert_slice_async` in the backend. 4. We can modify the pipeline to perform the decomposition before the `async_wait` is inserted. However, it is also risky because we don't know the correct vectorized shape yet in the pipeline pass. Making the pipeline pass aware of the vectorization could introduce additional dependencies on the AxisInfoAnalysis and the Coalesce analysis.	2022-11-30 10:07:34 -08:00
goostavz	4e6a8209ed	[Triton-MLIR] Two fixes on allocation and backend related with MMA v1 (#930 )	2022-11-30 09:27:26 +00:00
Philippe Tillet	9bb54402b3	[FRONTEND][BACKEND] Small fixes to multiple_of, num_programs, axisinfo; enable block-sparse tests (#927 )	2022-11-29 20:00:34 +01:00
Qingyi Liu	661be523c0	[Triton-MLIR][BACKEND] Minor fixes of shared memory in ReduceOpConversion (#924 )	2022-11-29 11:50:31 +08:00
Qingyi Liu	9d31998a9d	[Triton-MLIR][BACKEND] Add argmin / argmax implementation for ReduceOp (#918 )	2022-11-27 22:59:27 -08:00
Keren Zhou	35c9ec1103	[Triton-MLIR][Backend] Fix number of warps and threads per warp when matrices are small (#917 )	2022-11-26 12:30:38 -08:00
donproc	f63be0e9b5	[TRITON-MLIR][BACKEND]support atomic_cas (#914 ) 1. support atomics-cas 2. add xchg support in atomic_rmw Co-authored-by: dongdongl <dongdongl@nvidia.com>	2022-11-25 12:02:08 +08:00
Keren Zhou	153aecb339	[Triton-MLIR][BACKEND] insert_slice_async on GPUs < sm80 (#908 ) `insert_slice_async` is decomposed into `load + insert_slice` in the backend. Not sure if V100 perf can match the master branch though in this way. Maybe the performance can be improved if instructions are arranged in the following form: ``` %0 = load %1 = load %2 = load ... insert_slice %0 insert_slice %1 insert_slice %2 ``` Tested on A100 when manually enabling this decomposition. Tests on V100 haven't been integrated yet, we can divide the tests into two phases: 1. Test only load, insert_slice, and insert_slice_async, given TritonGPU IRs in `test_backend.py`. 2. End to end gemm tests on V100.	2022-11-24 14:05:54 -08:00
donproc	8925c2cd11	[TRITON-MLIR][BACKEND]AtomicRMWOp supports scalar (#903 ) AtomicRMWOp supports scalar Co-authored-by: dongdongl <dongdongl@nvidia.com>	2022-11-23 07:59:09 +00:00
Keren Zhou	2afebcd79b	[Triton-MLIR][Backend] Remove unnecessary barriers (#901 ) Cross operation barriers are taken care of by the Membar pass. Explicit barriers are only required if there's any synchronization necessary within each operation.	2022-11-22 10:03:29 -08:00
goostavz	37f5846280	[Triton-MLIR][Backend] Minor fix for allocation and backend in handling tt.ptr tensors (#878 )	2022-11-15 10:08:07 +00:00
Qingyi Liu	4c4159c6fa	[Triton-MLIR] Add ex2.approx implementation for ExpOp and fix smem allocation for ReduceOpConversion (#875 )	2022-11-15 01:27:32 +00:00
Chenggang Zhao	516a241234	[Triton-MLIR] Fix some typos (#874 ) Fix some typos	2022-11-13 18:15:53 -08:00
Philippe Tillet	2aa538ec2e	[BACKEND] Added support for mma layouts in reductions (#863 ) Validated hackily by manually modifying the reduction .ttgir in my local cache. There will be a follow-up PR adding some better testing infrastructure to test out conversions and reductions on arbitrary layouts.	2022-11-10 09:58:07 -08:00
Da Yan	4946167241	[Triton-MLIR] `tt.dot` operands now must have DotOperand layout; also added prefetch pass prototype (#712 ) Co-authored-by: Jokeren <kerenzhou@openai.com> Co-authored-by: Phil Tillet <phil@openai.com> Co-authored-by: Superjomn <yanchunwei@outlook.com>	2022-11-10 05:57:27 +00:00
goostavz	080b4addf8	[Triton-MLIR][Backend] Fix the order in linear/delinear and a few bugs in reduce conversion (#851 ) 1, fix the order in linearize/delinearize, which fix the error of order in emitIndices; 2, fix the selecting of fast implementation in reduce codegen; 3, fix the redundant barrier in reduce codegen; 4, fix the index mapping of the second round of warp_shuffle in shuffle version of reduce codegen. Co-authored-by: Keren Zhou <kerenzhou@openai.com>	2022-11-08 10:10:09 -08:00
Keren Zhou	fdd59900f7	[Triton-MLIR] Replace triton.extract_slice with tensor.extract_slice and support more general tensor slicing (#837 ) ## Features - Allow taking a block of tensor slice, as long as each dimension is contiguous (unit stride). - Fix some problems in `insert_slice_async`'s semantic. - More general verification for ops that return shared layout encoding. ## Known Limitations - `insert_slice_async` still uses the old semantic. May submit another PR later to support similar semantic like `tensor.extract_slice`. - No encoding verification for `tensor.extract_slice`. - 3d tensor ops are broken. - Strided accesses are not allowed. - May cause a little performance slowdown since we are passing strides as values but not constants (e.g., int). It would be difficult to pass strides as attributes when we have control flows. A block argument is possible to accept tensors with different strides.	2022-11-06 22:59:03 -08:00
Philippe Tillet	91a9773b38	[OPTIMIZER] Minor bugfixes that affected matmul codegen performance (#834 )	2022-11-02 22:58:09 -07:00
Philippe Tillet	12d60cb4a3	[BACKEND] Added support for 1D conversion blocked -> slice (#831 )	2022-11-01 13:19:58 -07:00
Ian Bearman	f2106d0aa2	[BUILD] Fix Warnings and Enable Warnings as Errors (#794 )	2022-10-28 12:36:09 -07:00

1 2 3 4

174 Commits