github/ROCm - ROCm - AtHeartEngineering

mirror of https://github.com/ROCm/ROCm.git synced 2026-04-05 03:01:17 -04:00

Author	SHA1	Message	Date
goostavz	f1512bded1	Initial code merge of Hopper support (#2036 ) The initial code merge of Nvidia Hopper features support. Please be aware that the code merge is not finished yet and the trouble-shooting is still ongoing. The new hardware features (GMMA, TMA, STMATRIX etc.) and automatic warp-specialization are experimental for now and turned off by default. It is recommended for a trial when version 3.0 is released. The work is contributed by: ben-zhang-609, bealwang, donproc, qliu93, jsh20, allatit23, LyricZhao, ivanyinwz, goostavz & yangjunpro from Nvidia, in cooperation with: ptillet, Jokeren, ThomasRaoux & zahimoud from OpenAI. Co-authored-by: Goostav Zhu <gzhu@nvidia.com>	2023-08-07 09:53:04 +08:00
Keren Zhou	ee864048b3	[FRONTEND][BACKEND] Add the `noinline` annotation for `triton.jit` (#1568 ) # Introducing the `noinline` Parameter for Triton JIT Decorator We're excited to introduce a new parameter, `noinline`, that can be added to the `jit` decorator in Triton. This parameter allows developers to specify that a particular Triton function should not be inlined into its callers. In this post, we'll dive into the syntax, purpose, and implementation details of this new feature. ## Syntax To use the `noinline` parameter, simply add `noinline=True` to the `jit` decorator for the function that you don't want to be inlined. Here's an example: ```python @triton.jit(noinline=True) def device_fn(x, y, Z): z = x + y tl.store(Z, z) def test_noinline(): @triton.jit def kernel(X, Y, Z): x = tl.load(X) y = tl.load(Y) device_fn(x, y, Z) ``` In this example, the `device_fn` function is decorated with `@triton.jit(noinline=True)`, indicating that it should not be inlined into its caller, `kernel`. ## Purpose The `noinline` parameter serves several key purposes: - Reducing code size: By preventing inlining, we can reduce the size of the compiled code. - Facilitating debugging: Keeping functions separate can make it easier to debug the code. - Avoiding common subexpression elimination (CSE) in certain cases: CSE can sometimes be avoided by using the `noinline` parameter to reduce register pressure. - Enabling dynamic linking: This parameter makes it possible to dynamically link Triton functions. ## Implementation The implementation of the `noinline` parameter involves significant changes to three analysis modules in Triton: Allocation, Membar, and AxisInfo. Prior to this update, these modules assumed that all Triton functions had been inlined into the root kernel function. With the introduction of non-inlined functions, we've had to rework these assumptions and make corresponding changes to the analyses. ### Call Graph and Limitations <div style="text-align: center;"> <img src="https://user-images.githubusercontent.com/2306281/234663904-12864247-3412-4405-987b-6991cdf053bb.png" alt="figure 1" width="200" height="auto"> </div> To address the changes, we build a call graph and perform all the analyses on the call graph instead of a single function. The call graph is constructed by traversing the call edges and storing them in an edge map. Roots are extracted by checking nodes with no incoming edges. The call graph has certain limitations: - It does not support recursive function calls, although this could be implemented in the future. - It does not support dynamic function calls, where the function name is unknown at compilation time. ### Allocation <div style="text-align: center;"> <img src="https://user-images.githubusercontent.com/2306281/234665110-bf6a2660-06fb-4648-85dc-16429439e72d.png" alt="figure 2" width="400" height="auto"> </div> In Triton, shared memory allocation is achieved through two operations: `triton_gpu.convert_layout` and `triton_gpu.alloc_tensor`. The `convert_layout` operation allocates an internal tensor, which we refer to as a scratch buffer, while the `alloc_tensor` operation returns an allocated tensor and is thus known as an explicit buffer. To accommodate the introduction of function calls, we are introducing a third type of buffer called a virtual buffer. Similar to scratch buffers, virtual buffers are allocated internally within the scope of a function call, and the buffers allocated by the called functions remain invisible to subsequent operations in the calling function. However, virtual buffers are distinct from scratch buffers in that the call operation itself does not allocate memory—instead, it specifies the total amount of memory required by all the child functions being called. The actual allocation of buffers is performed by individual operations within these child functions. For example, when invoking edge e1, no memory is allocated, but the total amount of memory needed by function B is reserved. Notably, the amount of shared memory used by function B remains fixed across its call sites due to the consideration of dynamic control flows within each function. An additional challenge to address is the calculation of shared memory offsets for functions within a call graph. While we can assume a shared memory offset starting at 0 for a single root function, this is not the case with a call graph, where we must determine each function's starting offset based on the call path. Although each function has a fixed memory consumption, the starting offset may vary. For instance, in Figure 2, the starting offset of function C through edges e1->e2 differs from that through edges e2->e4. To handle this, we accumulate the starting offset at each call site and pass it as an argument to the called function. Additionally, we amend both the function declaration and call sites by appending an offset variable. ### Membar <div style="text-align: center;"> <img src="https://user-images.githubusercontent.com/2306281/234665157-844dd66f-5028-4ef3-bca2-4ca74b8f969d.png" alt="figure 3" width="300" height="auto"> </div> The membar pass is dependent on the allocation analysis. Once the offset and size of each buffer are known, we conduct a post-order traversal of the call graph and analyze each function on an individual basis. Unlike previous analyses, we now return buffers that remain unsynchronized at the end of functions, allowing the calling function to perform synchronization in cases of overlap. ### AxisInfo <div style="text-align: center;"> <img src="https://user-images.githubusercontent.com/2306281/234665183-790a11ac-0ba1-47e1-98b1-e356220405a3.png" alt="figure 4" width="400" height="auto"> </div> The AxisInfo analysis operates differently from both membar and allocation, as it traverses the call graph in topological order. This is necessary because function arguments may contain axis information that will be utilized by callee functions. As we do not implement optimizations like function cloning, each function has a single code base, and the axis information for an argument is determined as a conservative result of all axis information passed by the calling functions. --------- Co-authored-by: Philippe Tillet <phil@openai.com>	2023-04-28 14:59:04 -07:00
Philippe Tillet	0f82fac60e	[BACKEND] move struct packing/unpacking to type converter and give a more explicit name (#1281 ) This is the first of a series of PR meant to clean up how the backend handles the codegen for dot operand layouts	2023-03-05 16:04:29 -08:00
Philippe Tillet	408d1d7e87	[OPTIMIZER] Improved flash attention forward pass performance (#1075 ) - Fixed typo in instruction reordering pass - Minor additional optimizations for shared memory allocator - Optimized flash attention tutorial forward pass kernel	2023-01-19 06:46:01 +00:00
Philippe Tillet	589a18959e	[BACKEND] Make swizzled shared memory pointers compatible with non-blocked distributed layout (#1053 ) Notes: * Cleaned up implementation * Added comments * Re-using code between ConvertDistributedToShared and ConvertInsertSliceAsyncOp	2023-01-13 09:14:23 -08:00
goostavz	0e8590f1c9	[BACKEND] Add generic support of convert_layout from distributed to shared (#1025 )	2022-12-30 11:29:58 -08:00
goostavz	1d3029faf8	[Backend] Add value cache in emitting indices calculation and some refinement (#1018 ) 1, add explicit value cache in emitting indices calculation; 2, move the indices calculation emitting logics into ConvertTritonGPUOpToLLVMPatternBase to avoid the redundant build cost by templates. Refer to the discussion in this thread by @LyricZhao : https://triton-lang.slack.com/archives/C042VBSQWNS/p1671336755922969	2022-12-29 11:19:59 -08:00
Philippe Tillet	20100a7254	Merge `triton-mlir` branch - Complete rewrite of the backend from scratch (#1004 ) This PR merges the `triton-mlir` branch, in which we have been quietly rewriting the Triton backend from scratch to increase maintainability, stability and ultimately performance. Changes to the runtime are minimal, and this new version aims to remain backward-compatible with the previous commit. The legacy backend is now officially deprecated, but can still be accessed via the `legacy-backend` tag. Co-authored-by: Keren Zhou <kerenzhou@openai.com> Co-authored-by: Yan Chunwei <yanchunwei@outlook.com> Co-authored-by: goostavz <109190422+goostavz@users.noreply.github.com> Co-authored-by: Shintaro Iwasaki <siwasaki@fb.com> Co-authored-by: Yan Da <dyanab@connect.ust.hk> Co-authored-by: Jun Yang <yangjunpro@gmail.com> Co-authored-by: Ian Bearman <ianb@microsoft.com> Co-authored-by: Jason Ansel <jansel@jansel.net> Co-authored-by: Qingyi Liu <qingyil@nvidia.com> Co-authored-by: ben-zhang-609 <110140741+ben-zhang-609@users.noreply.github.com> Co-authored-by: Chenggang Zhao <lyricz@yeah.net> Co-authored-by: ben-zhang-609 <benzh609@gmail.com> Co-authored-by: dongdongl <dongdongl@nvidia.com>	2022-12-21 01:30:50 -08:00

8 Commits