github/ROCm - ROCm - AtHeartEngineering

mirror of https://github.com/ROCm/ROCm.git synced 2026-04-05 03:01:17 -04:00

Author	SHA1	Message	Date
chengjunlu	e209749835	[BACKEND] Add a configurable parameter for the number of threads per warp (#1719 ) Add a configurable parameter for the number of threads per warp for other GPU. Like: Intel GPU. Make it default to be 32 not change code logic on the CUDA/AMD GPU. Note: The Intel GPU GenX ISA is explicit SIMD and can support variant number of threads lane per HW execution unit.	2023-06-28 22:25:14 -05:00
Thomas	e5d7411a69	[BACKEND] Add .wt store cache modifier (#1831 )	2023-06-28 17:40:30 +00:00
Keren Zhou	d2de3f37f0	[BACKEND] Reduce code cleanup and bug fix for the fast path (#1816 ) https://github.com/openai/triton/issues/1715	2023-06-27 17:27:24 -07:00
Philippe Tillet	d4c941177e	[FRONTEND][BACKEND] improved fp8 specs (#1841 ) clearly differentiate between standard fp8e4 (which we'll stop supporting on SM <= 89 because conversions are too expensive if we want to handle the single NaN and clipping properly) and a software-optimized fp8e4b15 format.	2023-06-26 16:19:03 -07:00
Thomas	3d1cd89b54	[BACKEND] Add store cache modifiers (#1826 ) Plumb through store cache modifiers.	2023-06-23 09:29:10 -07:00
Thomas	2eb7bc4b4c	[OPTIMIZER] Separate out kWidth layout optimization from pipelining pass (#1823 ) Since the kWidth optimization was happening during software pipelining it was skipped in case pipelining wasn't applied. This also improve separation of concerns.	2023-06-23 08:39:08 -07:00
Thomas	7b30e24328	[TEST] Add verifier to DotOp (#1824 ) Add verifier to enforce that kWidth attributes match between A and B operands.	2023-06-22 23:09:52 +00:00
Keren Zhou	58a8e8a914	[BACKEND] Clean up code (#1768 ) - Remove unused header files. - Get numThreads/numWarps from the triton module. - Move transforms/utility.h to the include directory.	2023-06-12 17:40:33 -07:00
Keren Zhou	4fbadf6f6f	[BACKEND] Fix `tl.cat` when the number of threads > the size of a tensor (#1751 ) `tl.cat(tensor<64>, tensor<64>) -> tensor(128)`, because it concatenates elements into a single thread, if number of threads is 128, each thread should own at least 2 elements. With this PR, we also disable remat of the cat op in some cases.	2023-06-07 15:42:38 -07:00
Philippe Tillet	c52a91231a	[FRONTEND][BACKEND] Add acquire/release semantics for atomics (#1739 )	2023-06-05 19:09:13 -07:00
Sarthak Bhatt	2a276f105c	[MISC] Small typo fix (#1734 )	2023-06-03 17:37:06 -07:00
chengjunlu	45ba9af6ed	[BACKEND] Add a configurable parameter for the number of threads per warp (#1719 ) Add a configurable parameter for the number of threads per warp for other GPU. Like: Intel GPU. Make it default to be 32 not change code logic on the CUDA/AMD GPU. Note: The Intel GPU GenX ISA is explicit SIMD and can support variant number of threads lane per HW execution unit.	2023-06-02 16:55:06 -07:00
Jason Furmanek	28d9754b2a	Merge remote-tracking branch 'oai/main' into ifu230601 Conflicts: python/test/unit/language/assert_helper.py test/Conversion/tritongpu_to_llvm.mlir	2023-06-01 20:53:33 +00:00
Mehdi Amini	b0c893cdc5	[FRONTEND][BACKEND] Hardened get_program_id axis by making it an enum attribute (#1721 ) Also catch out-of-bounds indices at constructio and throw a proper error in the frontend. Finally, let's make the IR a bit prettier: %0 = tt.get_program_id {axis = 0 : i32} : i32 becomes: %0 = tt.get_program_id x : i32 Fixes #1718	2023-05-31 22:49:46 -07:00
Andrey Shukshov	fee5950893	[MFMA] Implementation of MFMA DotOp pipeline (#180 ) * [MFMA] Implementation of MFMA DotOp pipeline * Added MFMA test_dot unit tests * Added missing ifdefs * Update offline tests * Removing duplicate parts * fix build after rebase * remove redundant stuff * simplify MMAv3.cpp * move reps function into operand attr description, remove coreMatrixType type from layout conversion, refactored type conversion * remove duplication of mfma intruction shape computation * move all MFMA instruction shape details into layout attribute * fix formatting * reenable matmul acceleration * fix dot operator type conversion * add offline test for dotop * add missing ifdef wrappers * run clang format on changes * review and rebase fix * add switch for MFMA instructions * change check precision for float16 test * disable redundant check for allowTF32 * - skip unsupported block size in matmul autotuner - support transposed inputs of dot * reenable matmul acceleration * Add first part to FMA for dot operation on HW without MFMA support. * Fix offline tests. * Fix lit tests * refactor mmav3 to mfma * fix rebase issues * fix detection of mfma support and wrong assert * remove unnecessary macros * Add documentation for MFMA layout. * fix line size computation for B argument * Fix getElemsPerThread() and getSizePerThread() functions for MFMA layout. --------- Co-authored-by: Alexander Efimov <efimov.alexander@gmail.com> Co-authored-by: dfukalov <1671137+dfukalov@users.noreply.github.com> Co-authored-by: weihan13 <weihan13@amd.com> Co-authored-by: Ognjen Plavsic <ognjen.plavsic@dxc.com>	2023-05-30 16:10:28 -05:00
Philippe Tillet	a2433f3135	[OPTIMIZER][BACKEND] Fixed failure mode for mixed precision matmul kernels (#1695 )	2023-05-20 20:29:45 -07:00
Mehdi Amini	83245259a6	[OPTIMIZER][BACKEND] switch the TritonGPU dialect to use MLIR Properties (NFC) (#1696 ) Also try to switch APIs access to the new upstream APIs that separate explicitly the access to "discardable" and "inherent" attributes (the latter being stored in properties now). Generic accessors like `getAttr()` `setAttr()` `setAttrs()` are much more expensive and to be avoided.	2023-05-20 01:36:48 +00:00
Mehdi Amini	9b072318bb	[FRONTEND] Fix tt.print printer format (#1685 ) Ideally the MLIR generator should detect the ambiguity and error out. Fixes #1683	2023-05-17 13:17:38 -04:00
Mehdi Amini	422ee4dc08	[FRONTEND] switch Triton dialect to use MLIR properties (#1684 ) This does not change the compiler behavior, it is purely an internal change: it'll store inherent attributes (the one defined in ODS) within the operations themselves instead of in a DictionaryAttr in the MLIRContext. The use of generic accessors like getAttribute("axis") should be avoided and direct access to the properties is prefered, like reduceOp.getProperties().axis Also `getAttrs()` or `getAttrDictionary()` are gonna be deprecated and users should move to `getDiscardableAttrDictionary()` instead.	2023-05-17 16:18:44 +00:00
Jason Furmanek	4c4e42e524	Merge remote-tracking branch 'openai/main' into IFU-230517 Conflicts: lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVMPass.cpp lib/Target/LLVMIR/LLVMIRTranslation.cpp python/test/unit/language/assert_helper.py python/triton/third_party/cuda/bin/ptxas test/Conversion/tritongpu_to_llvm.mlir It looks like you may be committing a merge. If this is not correct, please remove the file .git/MERGE_HEAD and try again.	2023-05-17 15:03:42 +00:00
Christian Sigg	e5ae37faa4	[BUILD] Add deduction guide for `Interval` (#1680 ) This avoids `ctad-maybe-unsupported` warning.	2023-05-16 13:40:21 -07:00
Ingo Müller	0c4de8ab72	[DEPENDENCIES] Update LLVM to 17.0.0 (c5dede880d17) and port changes. (#1668 ) This depends on a [pending LLVM release](https://github.com/ptillet/triton-llvm-releases/pull/10). * Implement setCalleeFromCallable in CallOp. * Cast type to ShapedType for various getters. * Improve TritonDialect::materializeConstant due to breaking change in constructor of arith::ConstantOp. * Add OpaqueProperties argument in inferReturnTypes. Co-authored-by: Philippe Tillet <phil@openai.com>	2023-05-15 21:51:14 -07:00
Ingo Müller	47af6ba702	[BACKEND] Move isSharedEncoding to TritonGPUIR. (#1655 ) This breaks a cyclic dependency between TritonAnalysis and TritonGPUIR (see #1649).	2023-05-12 20:50:21 -04:00
Zahi Moudallal	fb40bf1954	[TEST] Fixed and re-enabled reduce test (#1644 ) Re-enabled reduce test after fixing the %cst stride in the ttgir, and modifying the sweep parameters to make sure the shape per CTA to be less than or equal to the tensor shape.	2023-05-10 15:15:11 -07:00
long.chen	7d20a86865	[BACKEND] fix typo in Membar class about WAR description and refine some code (#1629 ) Co-authored-by: Philippe Tillet <phil@openai.com>	2023-05-06 14:20:23 +00:00
Zahi Moudallal	3449a9d40d	Zahi/slice reduce rebased (#1594 ) [BACKEND] Enable slice layout support for reduce op	2023-05-01 18:00:23 -07:00
Christian Sigg	26d80f026d	Merge branch `llvm-head` (#1600 )	2023-05-01 21:29:03 +00:00
Keren Zhou	ee864048b3	[FRONTEND][BACKEND] Add the `noinline` annotation for `triton.jit` (#1568 ) # Introducing the `noinline` Parameter for Triton JIT Decorator We're excited to introduce a new parameter, `noinline`, that can be added to the `jit` decorator in Triton. This parameter allows developers to specify that a particular Triton function should not be inlined into its callers. In this post, we'll dive into the syntax, purpose, and implementation details of this new feature. ## Syntax To use the `noinline` parameter, simply add `noinline=True` to the `jit` decorator for the function that you don't want to be inlined. Here's an example: ```python @triton.jit(noinline=True) def device_fn(x, y, Z): z = x + y tl.store(Z, z) def test_noinline(): @triton.jit def kernel(X, Y, Z): x = tl.load(X) y = tl.load(Y) device_fn(x, y, Z) ``` In this example, the `device_fn` function is decorated with `@triton.jit(noinline=True)`, indicating that it should not be inlined into its caller, `kernel`. ## Purpose The `noinline` parameter serves several key purposes: - Reducing code size: By preventing inlining, we can reduce the size of the compiled code. - Facilitating debugging: Keeping functions separate can make it easier to debug the code. - Avoiding common subexpression elimination (CSE) in certain cases: CSE can sometimes be avoided by using the `noinline` parameter to reduce register pressure. - Enabling dynamic linking: This parameter makes it possible to dynamically link Triton functions. ## Implementation The implementation of the `noinline` parameter involves significant changes to three analysis modules in Triton: Allocation, Membar, and AxisInfo. Prior to this update, these modules assumed that all Triton functions had been inlined into the root kernel function. With the introduction of non-inlined functions, we've had to rework these assumptions and make corresponding changes to the analyses. ### Call Graph and Limitations <div style="text-align: center;"> <img src="https://user-images.githubusercontent.com/2306281/234663904-12864247-3412-4405-987b-6991cdf053bb.png" alt="figure 1" width="200" height="auto"> </div> To address the changes, we build a call graph and perform all the analyses on the call graph instead of a single function. The call graph is constructed by traversing the call edges and storing them in an edge map. Roots are extracted by checking nodes with no incoming edges. The call graph has certain limitations: - It does not support recursive function calls, although this could be implemented in the future. - It does not support dynamic function calls, where the function name is unknown at compilation time. ### Allocation <div style="text-align: center;"> <img src="https://user-images.githubusercontent.com/2306281/234665110-bf6a2660-06fb-4648-85dc-16429439e72d.png" alt="figure 2" width="400" height="auto"> </div> In Triton, shared memory allocation is achieved through two operations: `triton_gpu.convert_layout` and `triton_gpu.alloc_tensor`. The `convert_layout` operation allocates an internal tensor, which we refer to as a scratch buffer, while the `alloc_tensor` operation returns an allocated tensor and is thus known as an explicit buffer. To accommodate the introduction of function calls, we are introducing a third type of buffer called a virtual buffer. Similar to scratch buffers, virtual buffers are allocated internally within the scope of a function call, and the buffers allocated by the called functions remain invisible to subsequent operations in the calling function. However, virtual buffers are distinct from scratch buffers in that the call operation itself does not allocate memory—instead, it specifies the total amount of memory required by all the child functions being called. The actual allocation of buffers is performed by individual operations within these child functions. For example, when invoking edge e1, no memory is allocated, but the total amount of memory needed by function B is reserved. Notably, the amount of shared memory used by function B remains fixed across its call sites due to the consideration of dynamic control flows within each function. An additional challenge to address is the calculation of shared memory offsets for functions within a call graph. While we can assume a shared memory offset starting at 0 for a single root function, this is not the case with a call graph, where we must determine each function's starting offset based on the call path. Although each function has a fixed memory consumption, the starting offset may vary. For instance, in Figure 2, the starting offset of function C through edges e1->e2 differs from that through edges e2->e4. To handle this, we accumulate the starting offset at each call site and pass it as an argument to the called function. Additionally, we amend both the function declaration and call sites by appending an offset variable. ### Membar <div style="text-align: center;"> <img src="https://user-images.githubusercontent.com/2306281/234665157-844dd66f-5028-4ef3-bca2-4ca74b8f969d.png" alt="figure 3" width="300" height="auto"> </div> The membar pass is dependent on the allocation analysis. Once the offset and size of each buffer are known, we conduct a post-order traversal of the call graph and analyze each function on an individual basis. Unlike previous analyses, we now return buffers that remain unsynchronized at the end of functions, allowing the calling function to perform synchronization in cases of overlap. ### AxisInfo <div style="text-align: center;"> <img src="https://user-images.githubusercontent.com/2306281/234665183-790a11ac-0ba1-47e1-98b1-e356220405a3.png" alt="figure 4" width="400" height="auto"> </div> The AxisInfo analysis operates differently from both membar and allocation, as it traverses the call graph in topological order. This is necessary because function arguments may contain axis information that will be utilized by callee functions. As we do not implement optimizations like function cloning, each function has a single code base, and the axis information for an argument is determined as a conservative result of all axis information passed by the calling functions. --------- Co-authored-by: Philippe Tillet <phil@openai.com>	2023-04-28 14:59:04 -07:00
Zahi Moudallal	65fb36e34e	[BACKEND] Updated slice layout semantics, updated vectorization logic used for load/store ops. (#1587 )	2023-04-28 13:40:01 -07:00
Philippe Tillet	8f47bdcc92	[OPTIMIZER] Added kWidth attribute to DotOperandEncoding (#1584 ) This is a pre-requisist for efficient mixed-precision matmul	2023-04-26 23:03:18 -07:00
Michael Melesse	2784b804d9	Merge remote-tracking branch 'upstream/main' into ifu_4_26_2023	2023-04-26 12:04:21 -05:00
Philippe Tillet	192f889e4f	[BACKEND] refactor non-ldmatrix lds codepath for SharedToDotOperandMMAv2 (#1557 )	2023-04-20 18:04:50 -07:00
zahimoud	8f7424221f	[BACKEND] Decomposed getElemsPerThread to return a vector of the per-dim elements per thread (#1549 ) This is a prerequisite for updating the semantics of SliceEncodingAttr.	2023-04-19 16:52:16 -07:00
Michael Melesse	705d47d0dd	fix lit test issues This is a combination of 6 commits. install lit fix lit test fix lit test fix aot lit issues fix final lit tests add lit tests	2023-04-17 11:46:37 -05:00
Philippe Tillet	e5c7d2a83c	[FRONTEND] cleaned up language; added frontend function for `globaltimer` special register (#1525 )	2023-04-14 15:29:27 -07:00
Keren Zhou	fdf1c1f2a1	[DOCS] Fix documentation workflow (#1520 ) Co-authored-by: Phil Tillet <phil@openai.com>	2023-04-13 13:49:36 -07:00
Michael Melesse	3603483fc0	clean up previous platform functions	2023-04-13 13:20:08 -05:00
peterbell10	e152183570	[FRONTEND][BACKEND] ReduceOp to support arbitrary reduce operations (#1305 ) Fixes #1285 This changes `tt.reduce` to replace `redOp` by a region containing arbitrary code. For example, `tl.sum` is now lowered as: ```mlir %res = "tt.reduce"(%arg0) ({ ^bb0(%arg1: f32, %arg2: f32): %add = arith.addf %arg1, %arg2 : f32 tt.reduce.return %add : f32 }) {axis = 1 : i32} : (tensor<128x128xf32>) -> tensor<128xf32> ``` Support for index reductions at the MLIR level are also dropped in favor of simultaneous reductions over multiple tensors. Which generalizes the code without loss of performance. So for example `argmin` gets lowered as: ```mlir %7 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32> %8 = tt.view %7 : (tensor<256xi32>) -> tensor<1x256xi32> %9:2 = "tt.reduce"(%6, %8) ({ ^bb0(%arg4: f32, %arg5: i32, %arg6: f32, %arg7: i32): %14 = arith.cmpf olt, %arg4, %arg6 : f32 %15 = arith.cmpf ogt, %arg4, %arg6 : f32 %16 = arith.cmpi slt, %arg5, %arg7 : i32 %17 = arith.select %16, %arg5, %arg7 : i32 %18 = arith.select %15, %arg7, %17 : i32 %19 = arith.select %14, %arg5, %18 : i32 %20 = arith.cmpf olt, %arg4, %arg6 : f32 %21 = arith.select %20, %arg4, %arg6 : f32 tt.reduce.return %21, %19 : f32, i32 }) {axis = 1 : i32} : (tensor<1x256xf32>, tensor<1x256xi32>) -> (tensor<1xf32>, tensor<1xi32>) ```	2023-04-13 01:37:39 +00:00
zahimoud	fd34b20fba	[BACKEND] Fixed bug in reduce; add tests	2023-04-11 18:09:18 -07:00
Keren Zhou	6d0ed41307	[BACKEND] Replace Func Dialect with custom triton ops (func, call, return) (#1502 ) MLIR current only supports a custom inlining interface per dialect, so we cannot change the inlining decision of `func.func`. https://discourse.llvm.org/t/avoid-inlining-some-functions-using-the-func-dialect/69830/3 Could revert it back once they've designed a better inliner interface. Inlining attributes will be implemented in the next PR since this PR is already huge.	2023-04-10 21:08:40 -07:00
Philippe Tillet	b86425a28e	[TEST] made `lut_bmm` pipeline test more concise and specific (#1488 )	2023-04-08 19:17:35 -07:00
long.chen	f7ad8ae022	[Refine] remove const ref of mlir::Attribute (#1486 ) https://mlir.llvm.org/docs/DefiningDialects/AttributesAndTypes/ https://github.com/isocpp/CppCoreGuidelines/blob/master/CppCoreGuidelines.md#f16-for-in-parameters-pass-cheaply-copied-types-by-value-and-others-by-reference-to-const ``` The C++ Attribute and Type classes in MLIR (like Ops, and many other things) are value-typed. This means that instances of Attribute or Type are passed around by-value, as opposed to by-pointer or by-reference. The Attribute and Type classes act as wrappers around internal storage objects that are uniqued within an instance of an MLIRContext. ```	2023-04-08 10:38:59 -07:00
Rahul Batra	a27b388df5	Merge remote-tracking branch 'upstream/main' into IFU_04-06-2023	2023-04-06 16:18:31 -05:00
Rahul Batra	ee0cf02429	clean up	2023-04-04 14:53:25 -05:00
Rahul Batra	4da67705a9	fix issue	2023-04-04 14:48:48 -05:00
Christian Sigg	01a93185a1	[BACKEND][OPTIMIZER] Switch from llvm::Optional to std::optional. (#1416 )	2023-04-04 09:06:28 -07:00
Rahul Batra	30f51f3b50	get Arch Info using HSA This is a combination of 5 commits. look up triple and warpsize with HSA This is a combination of 6 commits. add scripts create basic stub Add HSA This is a combination of 3 commits. add hsa move has file add hsa include and lib functional name string simplify gfx look up return warpsize clean up unnecssary imports remove scripts use tuple remove prints	2023-04-03 13:58:02 -05:00
Philippe Tillet	053af4e9f8	[FRONTEND] Refactor file hierarchy (#1464 ) The purpose of this PR is to remove some circular dependencies and separate concerns better in the frontend. It's still not perfect -- `triton.compile` still includes a few runtime architecture-specific component, but at least much better than before. This PR still assumes that AMD only supports empty kernels right now. Other PRs will follow to make the frontend supports multiple devices in a more modular way.	2023-04-02 12:07:08 -07:00
Keren Zhou	28ea484dab	[BACKEND] Clean up type inference functions (#1451 ) And remove duplicate function definition.	2023-03-30 23:07:32 -07:00
Michael Melesse	5293288e77	[ROCM] Enable ROCM Backend #1.5: Address Remaining Comments from #1312 (#1434 ) This PR address the remaing issues from #1312. It does the following * LLVM String Join * adds comment to GCNBuilder Class --------- Co-authored-by: Rahul Batra <rahbatra@amd.com>	2023-03-28 17:23:57 -07:00

1 2 3 4 5 ...

487 Commits