github/ROCm - ROCm - AtHeartEngineering

mirror of https://github.com/ROCm/ROCm.git synced 2026-02-21 03:00:39 -05:00

Author	SHA1	Message	Date
Jason Furmanek	5c87f363e4	Merge commit 'cb3d79a185e40c9d8a579bea07747a8a8d157d52' into ifu-231117 Conflicts: lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVM.cpp lib/Dialect/TritonGPU/IR/Dialect.cpp python/setup.py python/test/unit/language/assert_helper.py python/test/unit/operators/test_flash_attention.py python/test/unit/runtime/test_subproc.py python/triton/compiler/compiler.py python/triton/language/semantic.py python/triton/runtime/autotuner.py python/triton/runtime/jit.py python/tutorials/03-matrix-multiplication.py python/tutorials/05-layer-norm.py python/tutorials/06-fused-attention.py python/tutorials/11-grouped-gemm.py test/Conversion/tritongpu_to_llvm.mlir	2023-11-17 20:42:12 +00:00
Jason Furmanek	33151a860f	Merge commit 'ac9fa68d18c777e421bd3f6fb1ddcfd60b6fda33' into ifu-rebase-again Conflicts: .gitignore .gitmodules README.md bin/triton-translate.cpp include/triton/Dialect/TritonGPU/IR/TritonGPUAttrDefs.td include/triton/Target/AMDGCN/AMDGCNTranslation.h include/triton/Target/HSACO/HSACOTranslation.h lib/Analysis/Allocation.cpp lib/Analysis/Utility.cpp lib/Conversion/TritonGPUToLLVM/CMakeLists.txt lib/Conversion/TritonGPUToLLVM/ConvertLayoutOpToLLVM.cpp lib/Conversion/TritonGPUToLLVM/ReduceOpToLLVM.cpp lib/Conversion/TritonGPUToLLVM/ScanOpToLLVM.cpp lib/Conversion/TritonGPUToLLVM/Utility.cpp lib/Conversion/TritonGPUToLLVM/Utility.h lib/Dialect/TritonGPU/IR/Dialect.cpp lib/Dialect/TritonGPU/Transforms/RemoveLayoutConversions.cpp lib/Target/HSACO/CMakeLists.txt lib/Target/HSACO/HSACOTranslation.cpp lib/Target/LLVMIR/LLVMIRTranslation.cpp python/src/triton.cc python/test/unit/language/test_core.py python/test/unit/operators/test_flash_attention.py python/triton/compiler/compiler.py python/triton/compiler/make_launcher.py python/triton/language/semantic.py python/triton/runtime/jit.py python/tutorials/06-fused-attention.py python/tutorials/11-grouped-gemm.py test/Conversion/tritongpu_to_llvm.mlir	2023-11-06 23:10:10 +00:00
Justin Lebar	df08301e76	Reformat Python code with yapf. (#2589 ) I've add an option to yapf to do what we want for long lines, see https://github.com/google/yapf/pull/1177. We can now have a real Python formatter, yay! To make this PR, I ran my modified yapf over the repository, then looked over the full diff. Where yapf was mangling the param list of long function decls/calls (mostly kernels), I manually added `#` to put linebreaks where we want. I fixed up other formatting too -- mostly adding or removing a trailing comma from lists. Overall, trailing `#` was sufficient to get formatting similar to our current code. I didn't have to disable yapf anywhere. --------- Co-authored-by: Phil Tillet <phil@openai.com>	2023-11-02 20:44:17 -07:00
Justin Lebar	29a9245559	[BUILD] use clang+lld in CI builds. (#2564 ) Use clang+lld in CI builds. This is significantly faster.	2023-10-30 19:19:27 -07:00
Michael Melesse	09ba348f87	[ROCM] Core Functionality for AMD (#1983 ) * this pr adds a third party backend for triton that works on AMD * this expose a lot of the work that has been done in our [fork](https://github.com/ROCmSoftwarePlatform/triton) * most unit tests on `test_core.py` pass * it skips some unit tests for various reasons * we plan to follow up with more prs improving Functionality and Performance in the future --------- Co-authored-by: Philippe Tillet <phil@openai.com>	2023-10-26 08:36:49 -05:00
Phil Tillet	07baf3a102	[CI] move llvm-build.yml to the top of workflow directory hierarchy	2023-10-26 02:08:56 -07:00
Keren Zhou	bc72294507	[CI] Reenable torchinductor workflow (#2527 )	2023-10-25 23:44:02 -07:00
Philippe Tillet	31c76ddd05	[CI] revert recent changes (#2543 )	2023-10-24 17:00:31 -07:00
Phil Tillet	5181d62b1b	[CI] renamed third party test workflow	2023-10-24 12:12:52 -07:00
Phil Tillet	96b04493f1	[CI] move workflow around	2023-10-24 04:01:53 -07:00
Phil Tillet	c65d2c2ed6	[CI] run wheels job on CPU worker	2023-10-23 20:04:37 -07:00
Philippe Tillet	05dc28be0e	[CI] refactor workflows (#2504 ) no longer run third-party tests on every PR	2023-10-17 00:27:17 -07:00
Jason Furmanek	74fd8e9754	Merge commit '36fc54b6f28168d3644808bfe299f1ba06a36272' into ifu230908-2 Conflicts: .gitignore bin/triton-translate.cpp include/triton/Conversion/TritonGPUToLLVM/TritonGPUToLLVMPass.h include/triton/Dialect/TritonGPU/IR/TritonGPUAttrDefs.td include/triton/Dialect/TritonGPU/IR/TritonGPUDialect.td lib/Analysis/Utility.cpp lib/Conversion/TritonGPUToLLVM/ConvertLayoutOpToLLVM/SharedToDotOperandMMAv2.cpp lib/Conversion/TritonGPUToLLVM/DotOpToLLVM.cpp lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp lib/Conversion/TritonGPUToLLVM/ReduceOpToLLVM.cpp lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVM.cpp lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVMBase.h lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVMPass.cpp lib/Conversion/TritonGPUToLLVM/Utility.h lib/Dialect/Triton/Transforms/RewriteTensorPointer.cpp lib/Dialect/TritonGPU/IR/Dialect.cpp lib/Dialect/TritonGPU/Transforms/AccelerateMatmul.cpp lib/Dialect/TritonGPU/Transforms/RemoveLayoutConversions.cpp lib/Target/LLVMIR/LLVMIRTranslation.cpp python/src/triton.cc python/test/unit/runtime/test_subproc.py python/triton/compiler/compiler.py python/triton/compiler/make_launcher.py python/triton/language/semantic.py python/triton/runtime/jit.py python/tutorials/06-fused-attention.py test/Conversion/triton_to_tritongpu.mlir test/Conversion/tritongpu_to_llvm.mlir test/TritonGPU/coalesce.mlir unittest/Conversion/TritonGPUToLLVM/CMakeLists.txt	2023-10-02 18:01:04 +00:00
Philippe Tillet	98039658d4	[CI] disable pypy wheel (continued) (#2424 ) there's a typo in the previous commit	2023-09-30 00:38:06 -07:00
Philippe Tillet	c4f3afc020	[CI] disable pypy wheel (#2423 ) emitting warnings from C++ code requires "#include pybind11/exec.h" which is not compatible with pypy. I think using the python interpreter form C++ is a bad idea in general... but we probably don't care much about pypy wheels anyway	2023-09-29 23:48:08 -07:00
Thomas Raoux	90bef57acf	[BACKEND] turn on MMA V3 by default on Hopper (#2414 )	2023-09-28 22:45:28 -07:00
ian Bearman	215b2e77a1	Add Shared Middle Layer to Triton via Plug-In (#2374 ) This PR leverages the plug-in system to add a shared middle-layer to Triton. Currently the middle layer is not complete but has enough functionality to demonstrate how it can work. The general idea is that Triton IR is lowered into an MLIR core dialect to allow it to be both shared across Triton targets as well as allow back-ends to be shared with other languages. The basic intended architecture looks like this: [Triton IR] -> [Middle Layer] -> [HW specific IR] The middle-layer uses MLIR's Linalg and Tenor Dialects for operations on Triton block values. Operations on Triton pointers use the Memref Dialect. ## Usage To include the shared middle-layer in your Triton build do `export TRITON_CODEGEN_TRITON_SHARED=1` before invoking your build. Once it is part of the build it can be leveraged in two ways: ### Stand-Alone The middle layer can be used as a stand-alone component to convert Triton dialect to the middle layer dialects. Stand-alone example: ``` triton-shared-opt --triton-to-linalg %file ``` ### Backend Component The middle layer can also be used as a component in a Triton back-end by adding the cmake targets it produces and its headers files to that back-end. An example back-end will be published at a later date. ## Implementation details Even though a valid triton program can perform load and store in arbitrary memory locations, the prototype only supports lowering programs that have structured memory access patterns. ### Analyses As part of the conversion process, there are three important analyses: 1. Pointer analysis: + This analysis is responsible for extracting structured memory access patterns from a `triton` program during load and store; it walks the IR and visits relevant instructions to build strided memory accesses in the `memref` dialect. The analysis is still in its early stage and does not support all scenarios. 2. Use analysis: + After "Pointer analysis", instructions that are part of memory address calculation will no longer be necessary in a triton program because their semantics have now been captured by `memref` operations representing strided memory accesses. To aid with removing these instructions safely, we perform `Use analysis` to mark which instructions are used only in address calculation (called `MetaUse`) or used in both address calculation and data manipulation (called `MixedUse`) operations. Those that are `MixedUse` are cloned and have their users adjusted accordingly with the goal of separating out the `MetaUse` ops so that they can be safely deleted. 3. Mask analysis: + This analysis is responsible for handling masked loads and stores. ### Conversion strategy We introduce the `TritonToLinalg` pass that converts the `triton` dialect to the `linalg` dialect on tensors. This means the resulting IR is fully compatible with `linalg` tiling and fusion transformation passes. As mentioned in the `Pointer analysis`'s description, we do however have to deal with memref instructions at the load and store boundaries and have to convert them to tensors using `bufferization.to_tensor`. Here's a simple example of what the IR looks like: ```mlir tt.func @kernel(%afloat : !tt.ptr<bf16>, %res : !tt.ptr<bf16>) { %0 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32> %1 = tt.splat %afloat : (!tt.ptr<bf16>) -> tensor<128x!tt.ptr<bf16>> %2 = tt.addptr %1, %0 : tensor<128x!tt.ptr<bf16>>, tensor<128xi32> %afm = tt.load %2 {cache = 1 : i32, evict = 1 : i32, isVolatile = false} : tensor<128xbf16> %3 = "tt.reduce"(%afm) ({ ^bb0(%arg5: bf16, %arg6: bf16): %21 = arith.addf %arg5, %arg6 : bf16 tt.reduce.return %21 : bf16 }) {axis = 0 : i32} : (tensor<128xbf16>) -> bf16 tt.store %res, %3 : bf16 tt.return } ``` after conversion: ```mlir func.func @kernel(%arg0: memref<xbf16>, %arg1: memref<xbf16>, %arg2: i32, %arg3: i32, %arg4: i32) { %cst = arith.constant 0.000000e+00 : f32 %reinterpret_cast = memref.reinterpret_cast %arg0 to offset: [0], sizes: [128], strides: [1] : memref<xbf16> to memref<128xbf16, strided<[1]>> %alloc = memref.alloc() : memref<128xbf16> memref.copy %reinterpret_cast, %alloc : memref<128xbf16, strided<[1]>> to memref<128xbf16> %0 = bufferization.to_tensor %alloc restrict writable : memref<128xbf16> %1 = bufferization.alloc_tensor() : tensor<f32> %inserted = tensor.insert %cst into %1[] : tensor<f32> %reduced = linalg.reduce ins(%0 : tensor<128xbf16>) outs(%inserted : tensor<f32>) dimensions = [0] (%in: bf16, %init: f32) { %3 = arith.extf %in : bf16 to f32 %4 = arith.addf %3, %init : f32 linalg.yield %4 : f32 } %extracted = tensor.extract %reduced[] : tensor<f32> %2 = arith.truncf %extracted : f32 to bf16 %reinterpret_cast_0 = memref.reinterpret_cast %arg1 to offset: [0], sizes: [1], strides: [1] : memref<xbf16> to memref<1xbf16, strided<[1]>> affine.store %2, %reinterpret_cast_0[0] : memref<1xbf16, strided<[1]>> return } ``` Important details to note: + `tt.load` (together with all of its related address calculation instructions such as `tt.addptr` and `tt.splat`) are lowered to a combination of `memref.reinterpret_cast`, `memref.alloc`, and `memref.copy`. After the initialization of the local buffer, we convert the memref back to a tensor using `bufferization.to_tensor`; this op is automatically removed during bufferization. + `tt.store` lowers to a combination of `memref.reinterpret_cast` and either `affine.store` or `memref.tensor_store`: ``` %reinterpret_cast = memref.reinterpret_cast %arg2 to offset: [...] memref<*xf32> to memref<1024xf32> %extracted_slice = tensor.extract_slice %15[0] [%21] [1] : tensor<1024xf32> to tensor<?xf32> %subview = memref.subview %reinterpret_cast[0] [%21] [1] : memref<1024xf32> to memref<?xf32> memref.tensor_store %extracted_slice, %subview : memref<?xf32> ``` + element-wise `arith` and `math` operators are converted to their corresponding `linalg.generic` version. + `tt.dot` becomes `linalg.matmul`. + `tt.reduce` becomes `linalg.reduce`; known limitation: only support `addf` and `maxf` reduction in the reduction body for now. ### Testing The prototype was tested on the following triton kernel examples: 1. vector addition 2. fused softmax 3. matrix multiplication 4. layer normalization 5. fused attention In addition to testing on the tutorial kernels, I have also added many lit tests covering various scenarios. ## Recognition The work here represents contributions from myself as well as many of my colleagues at Microsoft. I especially want to call out @nhat-nguyen and @haishanzzz who were major contributors to this work.	2023-09-22 15:29:31 -07:00
Philippe Tillet	c71ec14f31	[TEST] only test 4 configs without TF32 (#2370 )	2023-09-21 21:23:19 -07:00
Dongdong Li	e5eda098b3	[TESTS] fix flash attention (#2086 ) Co-authored-by: dongdongl <dongdongl@nvidia.com>	2023-09-20 14:23:46 +08:00
Lixun Zhang	ff6fd952ac	Install ninja in pre-build	2023-09-18 15:30:06 -05:00
Justin Lebar	2a3746bac5	[BUILD] use ninja (#2318 )	2023-09-18 15:30:06 -05:00
Philippe Tillet	e686b4d6d4	[FRONTEND] interpreter rewrite (#2321 ) This is a new interpreter mode that shares semantic analysis with the JIT'ed codepath and that the Triton core team is committed to maintain	2023-09-17 14:58:50 -07:00
Justin Lebar	073aa16379	[BUILD] use ninja (#2318 )	2023-09-17 02:08:04 -07:00
Philippe Tillet	c98671cf7c	Revert "Update integration-tests.yml" (#2323 ) reverts #2310 as recent changes to Triton-IR have broken third-party backends	2023-09-17 01:16:00 -07:00
Michael Melesse	78a0b5dc2a	[CI] update integration-tests.yml (#2310 )	2023-09-15 18:38:15 -07:00
Zahi Moudallal	3dec616c7c	[CI] Fix submodule issue (#2253 ) ...	2023-09-06 20:21:29 +00:00
jon-chuang	36859aebff	[DOCS] Add MLIR Autogenerated Docs to Sphinx Docs (#2234 ) Partially fixes: https://github.com/openai/triton/issues/2226 Here are some example renderings: ![Screenshot from 2023-09-04 18-39-20](https://github.com/openai/triton/assets/9093549/e9c4af04-aeae-4021-a8db-6a4a82b59ae7) ![Screenshot from 2023-09-04 18-39-30](https://github.com/openai/triton/assets/9093549/410391b8-e07e-4bed-909c-8ce5484072d1) ![Screenshot from 2023-09-04 18-39-41](https://github.com/openai/triton/assets/9093549/f1eaef95-66c1-4506-a153-c6069e2b5072)	2023-09-06 08:17:12 +00:00
Jason Furmanek	3eaeb89d18	Merge commit '5df904233c11a65bd131ead7268f84cca7804275' into ifu230810-2 Conflicts: include/triton/Dialect/Triton/Transforms/Passes.h include/triton/Dialect/TritonGPU/IR/Dialect.h include/triton/Dialect/TritonGPU/IR/TritonGPUAttrDefs.td lib/Analysis/Allocation.cpp lib/Analysis/Utility.cpp lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp lib/Conversion/TritonGPUToLLVM/ReduceOpToLLVM.cpp lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVM.cpp lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVMPass.cpp lib/Dialect/Triton/Transforms/RewriteTensorPointer.cpp lib/Dialect/TritonGPU/Transforms/RemoveLayoutConversions.cpp lib/Dialect/TritonGPU/Transforms/ReorderInstructions.cpp lib/Target/LLVMIR/LLVMIRTranslation.cpp python/src/triton.cc python/triton/compiler/compiler.py python/triton/ops/flash_attention.py python/triton/runtime/autotuner.py python/triton/runtime/jit.py python/triton/tools/aot.py python/tutorials/06-fused-attention.py test/Conversion/tritongpu_to_llvm.mlir test/Target/tritongpu_to_llvmir.mlir test/Target/tritongpu_to_llvmir_noinline.mlir	2023-09-01 03:25:33 +00:00
Michael Melesse	c6d33dcebf	[ROCM] Core Functionality for AMD (#1983 ) * this pr adds a third party backend for triton that works on AMD * this expose a lot of the work that has been done in our [fork](https://github.com/ROCmSoftwarePlatform/triton) * most unit tests on `test_core.py` pass * it skips some unit tests for various reasons * we plan to follow up with more prs improving Functionality and Performance in the future --------- Co-authored-by: Philippe Tillet <phil@openai.com>	2023-08-31 14:02:00 -07:00
Zahi Moudallal	bffb76e847	[DOCS] Fixing docs (#2221 )	2023-08-31 11:47:57 -07:00
Zahi Moudallal	f922ecbc29	[DOCS] Fixing dependency (#2219 )	2023-08-31 17:51:04 +00:00
Zahi Moudallal	cf31f4ddb2	[DOCS] Fixing docs by using sphinx-build instead of multiversion (#2217 )	2023-08-31 16:51:14 +00:00
Zahi Moudallal	5282ed890d	[CI] Add back pre-commit to nvidia CI job (#2159 )	2023-08-23 01:11:03 +00:00
Zahi Moudallal	3c8f959f91	[CI] Adding workflow_run (#2120 )	2023-08-18 23:58:41 -07:00
Zahi Moudallal	1faf93e6fb	[CI] Fix PR comment (#2131 )	2023-08-18 09:16:18 -07:00
Zahi Moudallal	b33f97a682	[CI] Fix bug in Compare Artifacts workflow (#2128 ) Forgot to remove this line	2023-08-17 18:06:36 -07:00
Zahi Moudallal	6f654cfbbf	[CI] Testing PR comment from another workflow (#2127 )	2023-08-17 17:34:59 -07:00
Zahi Moudallal	3fa6d51bc9	[CI] Adding new github workflow for testing (#2121 )	2023-08-17 15:32:38 -07:00
Zahi Moudallal	557b2d4b34	[CI] upload only test/unit/operators cache to artifacts and rely on kernel names in cache to compare artifacts (#2111 )	2023-08-16 20:34:40 -07:00
Zahi Moudallal	0312ed3473	[CI] Update kernels names (#2093 ) Co-authored-by: Philippe Tillet <phil@openai.com>	2023-08-14 19:41:41 -07:00
Philippe Tillet	facc1dcbac	[TESTS] better matmul unit testing (#2098 )	2023-08-13 17:54:32 -07:00
Zahi Moudallal	c309f7e57a	[CI] Skip PR if we paginate 30 times without finding a run_id (#2092 )	2023-08-11 15:31:46 -07:00
Philippe Tillet	3ec05fb023	[CI] H100 tests always use ENABLE_TMA=1 ENABLE_MMA_V3=1 (#2051 )	2023-08-07 19:32:55 -07:00
Philippe Tillet	54f1ac950e	[CI] disable AMD CI (#2045 )	2023-08-07 12:03:26 -07:00
Philippe Tillet	223c2d32a2	[CI] disable XPU tests (not compiling) (#2044 ) cc @EikanWang . I'm disabling this for now since it broke with the H100 merge, but please feel free to fix the compilation errors and submit a PR.	2023-08-07 11:56:16 -07:00
goostavz	f1512bded1	Initial code merge of Hopper support (#2036 ) The initial code merge of Nvidia Hopper features support. Please be aware that the code merge is not finished yet and the trouble-shooting is still ongoing. The new hardware features (GMMA, TMA, STMATRIX etc.) and automatic warp-specialization are experimental for now and turned off by default. It is recommended for a trial when version 3.0 is released. The work is contributed by: ben-zhang-609, bealwang, donproc, qliu93, jsh20, allatit23, LyricZhao, ivanyinwz, goostavz & yangjunpro from Nvidia, in cooperation with: ptillet, Jokeren, ThomasRaoux & zahimoud from OpenAI. Co-authored-by: Goostav Zhu <gzhu@nvidia.com>	2023-08-07 09:53:04 +08:00
Alexander Efimov	5bdf71313c	[CI] Reduce required disk space (#270 )	2023-07-22 09:04:42 +02:00
Philippe Tillet	1db3bdc52e	[BACKEND] avoid code duplication for fully warp-synchronous reductions (#1978 )	2023-07-21 16:06:00 -07:00
Phil Tillet	c7757fae71	[GITHUB] tweak CODEOWNERS	2023-07-17 00:41:11 -07:00
Keren Zhou	571c92f2a8	[CI] Fix CI kernel compare (#1931 ) With this PR, we find the latest merged PR that successfully passed "Integration Tests".	2023-07-12 10:06:34 -07:00

1 2 3

135 Commits