github/ROCm - ROCm - AtHeartEngineering

mirror of https://github.com/ROCm/ROCm.git synced 2026-02-21 03:00:39 -05:00

Author	SHA1	Message	Date
Rohit Santhanam	cd9ae1cd36	Merge remote-tracking branch 'upstream/main' into triton-mlir-IFU-02232023	2023-02-23 21:41:54 +00:00
Keren Zhou	123c687ed9	[BACKEND] Rewrite Membar to fit the CF dialect (#1213 )	2023-02-19 14:54:33 -08:00
Rohit Santhanam	841784d1e3	Merge remote-tracking branch 'upstream/main' into upgrade_triton_mlir_rocm_to_llvm_head	2023-02-18 09:25:20 +00:00
Christian Sigg	9ef4b5d773	Rebase to LLVM-head. (#1200 ) Rebase to `37b7a60cd7`	2023-02-17 13:16:11 -08:00
Christian Sigg	fc7a8e3581	Rebase Triton to LLVM-15. (#1070 ) This PR rebases Triton from LLVM-14 to LLVM-15. Most changes are mechanical, except for the analysis framework changes.	2023-02-16 06:40:53 -08:00
rsanthanam-amd	44f69bea81	Merge pull request #113 from ROCmSoftwarePlatform/triton-mlir-IFU-02112023 Triton mlir ifu 02112023	2023-02-13 09:26:10 -06:00
rsanthanam-amd	ec387d5bf4	Merge pull request #109 from dfukalov/dfukalov/work-3 [ROCM] Enable part of tl.dot operations.	2023-02-12 13:50:20 -06:00
Daniil Fukalov	a6596fc634	[ROCM] Enable part of tl.dot operations. The change enables fall-through FMA path for the ROCM. It works for the float32 type and not all the tensors sizes. The change switches off reporting MMA and async ops support to avoid NV asm inline generation.	2023-02-12 17:25:48 +01:00
Philippe Tillet	0cbe368fe5	[OPTIMIZER] Using new multiRootGetSlice utility in memory coalescing pass (#1169 )	2023-02-09 18:43:33 +00:00
Keren Zhou	82befe32ad	[BACKEND] Improve torch inductor performance (#1108 ) - Rewrite the AxisInfo analysis to handle each op case by case. - Add bit shift, min max, div/rem, and select ops to AxisInfo. - Rematerialize across load/store ops in the following two cases: - A size 1 tensor is considered not expensive since all threads will load the same - the targeEncoding may expose more vectorization opportunities (more elements per thread on the first dim) _res2next_ benchmark GPU Kernel time comparison on A100. - Average kernel sum. Triton 16838630ns vs Triton-MLIR 17105166ns. 1.016x slowdown. - Total kernel sum. Triton 6511735460ns vs Triton-MLIR 6512370620ns.	2023-02-01 18:21:15 -08:00
Keren Zhou	5dd8ce3745	[BACKEND] Fix topological sort and add new test cases (#1132 ) Previous https://github.com/openai/triton/pull/1113 forgot to consider that a node may have multiple parents, visiting the instruction before any parent violates the semantic of topological sort. The fixed implementation exhaustively add all operations into a candidate subgraph and move an operation to the "ready" queue once all of its operands have been visited.	2023-01-31 23:41:20 -08:00
Philippe Tillet	8fea1fb478	[FRONTEND] Adding static range (#1130 ) Included: Revert "[BACKEND] Replace `mlir::topologicalSort` with a custom implementation (#1113)"	2023-01-31 18:04:19 -08:00
Keren Zhou	bc8a26d56f	[BACKEND] Replace `mlir::topologicalSort` with a custom implementation (#1113 ) `multiRootTopologicalSort` is faster than `mlir::topologicalSort` because it prunes nodes that have been visited before.	2023-01-29 18:57:21 -08:00
Philippe Tillet	408d1d7e87	[OPTIMIZER] Improved flash attention forward pass performance (#1075 ) - Fixed typo in instruction reordering pass - Minor additional optimizations for shared memory allocator - Optimized flash attention tutorial forward pass kernel	2023-01-19 06:46:01 +00:00
Philippe Tillet	20100a7254	Merge `triton-mlir` branch - Complete rewrite of the backend from scratch (#1004 ) This PR merges the `triton-mlir` branch, in which we have been quietly rewriting the Triton backend from scratch to increase maintainability, stability and ultimately performance. Changes to the runtime are minimal, and this new version aims to remain backward-compatible with the previous commit. The legacy backend is now officially deprecated, but can still be accessed via the `legacy-backend` tag. Co-authored-by: Keren Zhou <kerenzhou@openai.com> Co-authored-by: Yan Chunwei <yanchunwei@outlook.com> Co-authored-by: goostavz <109190422+goostavz@users.noreply.github.com> Co-authored-by: Shintaro Iwasaki <siwasaki@fb.com> Co-authored-by: Yan Da <dyanab@connect.ust.hk> Co-authored-by: Jun Yang <yangjunpro@gmail.com> Co-authored-by: Ian Bearman <ianb@microsoft.com> Co-authored-by: Jason Ansel <jansel@jansel.net> Co-authored-by: Qingyi Liu <qingyil@nvidia.com> Co-authored-by: ben-zhang-609 <110140741+ben-zhang-609@users.noreply.github.com> Co-authored-by: Chenggang Zhao <lyricz@yeah.net> Co-authored-by: ben-zhang-609 <benzh609@gmail.com> Co-authored-by: dongdongl <dongdongl@nvidia.com>	2022-12-21 01:30:50 -08:00
Keren Zhou	50a5128448	[Triton-MLIR][BACKEND] Support bfloat16 and clean up some test code (#998 )	2022-12-20 22:26:51 -08:00
Philippe Tillet	9f27468377	[TESTS][FRONTEND][BACKEND] Merge `master` and `triton-mlir` tests (#979 ) Also fix a bunch of bugs in float32 / tf32 Co-authored-by: Jokeren <kerenzhou@openai.com>	2022-12-15 19:28:50 -08:00
Philippe Tillet	8edfe813a5	[FRONTEND][BACKEND] Added `trans` instruction; made flash attention bwd pass work (#943 )	2022-12-03 09:58:24 -08:00
Qingyi Liu	661be523c0	[Triton-MLIR][BACKEND] Minor fixes of shared memory in ReduceOpConversion (#924 )	2022-11-29 11:50:31 +08:00
Qingyi Liu	9d31998a9d	[Triton-MLIR][BACKEND] Add argmin / argmax implementation for ReduceOp (#918 )	2022-11-27 22:59:27 -08:00
Keren Zhou	35c9ec1103	[Triton-MLIR][Backend] Fix number of warps and threads per warp when matrices are small (#917 )	2022-11-26 12:30:38 -08:00
Keren Zhou	153aecb339	[Triton-MLIR][BACKEND] insert_slice_async on GPUs < sm80 (#908 ) `insert_slice_async` is decomposed into `load + insert_slice` in the backend. Not sure if V100 perf can match the master branch though in this way. Maybe the performance can be improved if instructions are arranged in the following form: ``` %0 = load %1 = load %2 = load ... insert_slice %0 insert_slice %1 insert_slice %2 ``` Tested on A100 when manually enabling this decomposition. Tests on V100 haven't been integrated yet, we can divide the tests into two phases: 1. Test only load, insert_slice, and insert_slice_async, given TritonGPU IRs in `test_backend.py`. 2. End to end gemm tests on V100.	2022-11-24 14:05:54 -08:00
Keren Zhou	fdd59900f7	[Triton-MLIR] Replace triton.extract_slice with tensor.extract_slice and support more general tensor slicing (#837 ) ## Features - Allow taking a block of tensor slice, as long as each dimension is contiguous (unit stride). - Fix some problems in `insert_slice_async`'s semantic. - More general verification for ops that return shared layout encoding. ## Known Limitations - `insert_slice_async` still uses the old semantic. May submit another PR later to support similar semantic like `tensor.extract_slice`. - No encoding verification for `tensor.extract_slice`. - 3d tensor ops are broken. - Strided accesses are not allowed. - May cause a little performance slowdown since we are passing strides as values but not constants (e.g., int). It would be difficult to pass strides as attributes when we have control flows. A block argument is possible to accept tensors with different strides.	2022-11-06 22:59:03 -08:00
Ian Bearman	f2106d0aa2	[BUILD] Fix Warnings and Enable Warnings as Errors (#794 )	2022-10-28 12:36:09 -07:00
Keren Zhou	02ebf24d35	Analyze shared memory alias (#81 ) The purpose of this PR is analyzing shared memory aliases so that we can fix memory allocation bugs and save memory allocations in triton code involving complex control flows. Changes to memory bar and allocation are on the way. Co-authored-by: Philippe Tillet <phil@openai.com>	2022-08-29 10:43:20 -07:00

25 Commits