github/ROCm - ROCm - AtHeartEngineering

mirror of https://github.com/ROCm/ROCm.git synced 2026-04-05 03:01:17 -04:00

Author	SHA1	Message	Date
Jason Furmanek	4c4e42e524	Merge remote-tracking branch 'openai/main' into IFU-230517 Conflicts: lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVMPass.cpp lib/Target/LLVMIR/LLVMIRTranslation.cpp python/test/unit/language/assert_helper.py python/triton/third_party/cuda/bin/ptxas test/Conversion/tritongpu_to_llvm.mlir It looks like you may be committing a merge. If this is not correct, please remove the file .git/MERGE_HEAD and try again.	2023-05-17 15:03:42 +00:00
Michael Melesse	dbf6a638dd	Merge pull request #210 from ROCmSoftwarePlatform/tt.load_issue fix pyt 2.0 issues	2023-05-16 14:18:24 -04:00
Daniil Fukalov	7acc1cb707	[ROCM] Implement `device_assert` functionality. (#207 ) Triton firstly prints assert message into stderr stream with the same (refactored) helper function as `device_print` and then ends the thread execution. Note: s_endpgm instruction is used, since s_trap (generated from LLVM::Trap or LLVM::DebugTrap) has some issues on different HW. Also got back fix in `python/triton/compiler/compiler.py` lost after one of IFU.	2023-05-15 16:16:14 +02:00
Sophia Wisdom	9820899b38	[FRONTEND] Assert that for loop bounds must be ints (#1664 )	2023-05-12 22:44:45 -07:00
Keren Zhou	674f9bf7a6	[FRONTEND] Better error messages for noinline functions (#1657 ) ``` at 10:18:def val_multiplier_noinline(val, i): return val * i ^ Function val_multiplier_noinline is marked noinline, but was called with non-scalar argument val:fp32[constexpr[128]] ```	2023-05-11 12:46:25 -07:00
Michael Melesse	8b55aa3203	bring back #ifdefs	2023-05-11 14:06:09 -05:00
Keren Zhou	147ec4384d	[FRONTEND] Hotfix for `contains_return_op` (#1651 ) `noinline` can be None, False, or True, so we have to check the callee in the first two cases.	2023-05-10 15:14:53 -07:00
Keren Zhou	b19b274d93	[FRONTEND] Fix return op related control flow issues (#1637 ) - Case 1: Return after static control flow is taken. Peel off instructions after the first `return` for each basic block. ```python if static_condition: tl.store(...) return return ``` - Case 2: Return exists in both `if` and `else` branches of an inlined `JITFunction` function ```python def foo(): if dynamic_condition: return a else: return b ``` - Case 3: Return exists in a `JITFunction` from another module ```python import module if cond: a = module.func() ``` - Case 4: A chain of calls through undefined local variables ```python import module if cond: a = x a = a.to(tl.int32).to(tl.int32) ``` - Case 5: Call a function `func` without returning variables. `func` is recognized as an `Expr` first instead of a `Call`. ```python if cond: foo() else: bar() ``` - Case 6: Call a `noinline` function. We don't need to check if the function contains any return op.	2023-05-09 12:51:14 -04:00
Michael Melesse	9cc141b12d	assume ROCM device This is a combination of 7 commits. use pyt nightly with root repro with pytorch unit test hardcode isROCM to true set is_cuda to False ignore cc arg clean up match triton-mlir branch	2023-05-04 16:46:59 -05:00
Michaël Benesty	d196302cb0	[FRONTEND] make torch optional (#1604 ) make torch optional to fix circular dependency issue	2023-05-02 21:56:25 -07:00
Keren Zhou	3aff0102a3	[FRONTEND] Fix calling local variables’ attribute functions in the if statement (#1597 ) If `node.func` is an `ast.Attribute`, it won't cause an early return. (Not sure if I interpret it correctly) https://github.com/openai/triton/issues/1591	2023-04-30 15:41:16 -07:00
David MacLeod	4b072516e7	[FRONTEND] add architecture to hash to avoid invalid image on cubin load (#1593 ) Closes https://github.com/openai/triton/issues/1556 https://github.com/openai/triton/issues/1512 The current hash used for caching the cubin does not include the architecture. This leads to the following error when compiling against one arch and running against another (with no code changes to trigger a recompilation). ``` RuntimeError: Triton Error [CUDA]: device kernel image is invalid ``` Was not sure what unit tests would be appropriate here (if any) Co-authored-by: davidma <davidma@speechmatics.com>	2023-04-29 19:32:10 +00:00
Keren Zhou	ee864048b3	[FRONTEND][BACKEND] Add the `noinline` annotation for `triton.jit` (#1568 ) # Introducing the `noinline` Parameter for Triton JIT Decorator We're excited to introduce a new parameter, `noinline`, that can be added to the `jit` decorator in Triton. This parameter allows developers to specify that a particular Triton function should not be inlined into its callers. In this post, we'll dive into the syntax, purpose, and implementation details of this new feature. ## Syntax To use the `noinline` parameter, simply add `noinline=True` to the `jit` decorator for the function that you don't want to be inlined. Here's an example: ```python @triton.jit(noinline=True) def device_fn(x, y, Z): z = x + y tl.store(Z, z) def test_noinline(): @triton.jit def kernel(X, Y, Z): x = tl.load(X) y = tl.load(Y) device_fn(x, y, Z) ``` In this example, the `device_fn` function is decorated with `@triton.jit(noinline=True)`, indicating that it should not be inlined into its caller, `kernel`. ## Purpose The `noinline` parameter serves several key purposes: - Reducing code size: By preventing inlining, we can reduce the size of the compiled code. - Facilitating debugging: Keeping functions separate can make it easier to debug the code. - Avoiding common subexpression elimination (CSE) in certain cases: CSE can sometimes be avoided by using the `noinline` parameter to reduce register pressure. - Enabling dynamic linking: This parameter makes it possible to dynamically link Triton functions. ## Implementation The implementation of the `noinline` parameter involves significant changes to three analysis modules in Triton: Allocation, Membar, and AxisInfo. Prior to this update, these modules assumed that all Triton functions had been inlined into the root kernel function. With the introduction of non-inlined functions, we've had to rework these assumptions and make corresponding changes to the analyses. ### Call Graph and Limitations <div style="text-align: center;"> <img src="https://user-images.githubusercontent.com/2306281/234663904-12864247-3412-4405-987b-6991cdf053bb.png" alt="figure 1" width="200" height="auto"> </div> To address the changes, we build a call graph and perform all the analyses on the call graph instead of a single function. The call graph is constructed by traversing the call edges and storing them in an edge map. Roots are extracted by checking nodes with no incoming edges. The call graph has certain limitations: - It does not support recursive function calls, although this could be implemented in the future. - It does not support dynamic function calls, where the function name is unknown at compilation time. ### Allocation <div style="text-align: center;"> <img src="https://user-images.githubusercontent.com/2306281/234665110-bf6a2660-06fb-4648-85dc-16429439e72d.png" alt="figure 2" width="400" height="auto"> </div> In Triton, shared memory allocation is achieved through two operations: `triton_gpu.convert_layout` and `triton_gpu.alloc_tensor`. The `convert_layout` operation allocates an internal tensor, which we refer to as a scratch buffer, while the `alloc_tensor` operation returns an allocated tensor and is thus known as an explicit buffer. To accommodate the introduction of function calls, we are introducing a third type of buffer called a virtual buffer. Similar to scratch buffers, virtual buffers are allocated internally within the scope of a function call, and the buffers allocated by the called functions remain invisible to subsequent operations in the calling function. However, virtual buffers are distinct from scratch buffers in that the call operation itself does not allocate memory—instead, it specifies the total amount of memory required by all the child functions being called. The actual allocation of buffers is performed by individual operations within these child functions. For example, when invoking edge e1, no memory is allocated, but the total amount of memory needed by function B is reserved. Notably, the amount of shared memory used by function B remains fixed across its call sites due to the consideration of dynamic control flows within each function. An additional challenge to address is the calculation of shared memory offsets for functions within a call graph. While we can assume a shared memory offset starting at 0 for a single root function, this is not the case with a call graph, where we must determine each function's starting offset based on the call path. Although each function has a fixed memory consumption, the starting offset may vary. For instance, in Figure 2, the starting offset of function C through edges e1->e2 differs from that through edges e2->e4. To handle this, we accumulate the starting offset at each call site and pass it as an argument to the called function. Additionally, we amend both the function declaration and call sites by appending an offset variable. ### Membar <div style="text-align: center;"> <img src="https://user-images.githubusercontent.com/2306281/234665157-844dd66f-5028-4ef3-bca2-4ca74b8f969d.png" alt="figure 3" width="300" height="auto"> </div> The membar pass is dependent on the allocation analysis. Once the offset and size of each buffer are known, we conduct a post-order traversal of the call graph and analyze each function on an individual basis. Unlike previous analyses, we now return buffers that remain unsynchronized at the end of functions, allowing the calling function to perform synchronization in cases of overlap. ### AxisInfo <div style="text-align: center;"> <img src="https://user-images.githubusercontent.com/2306281/234665183-790a11ac-0ba1-47e1-98b1-e356220405a3.png" alt="figure 4" width="400" height="auto"> </div> The AxisInfo analysis operates differently from both membar and allocation, as it traverses the call graph in topological order. This is necessary because function arguments may contain axis information that will be utilized by callee functions. As we do not implement optimizations like function cloning, each function has a single code base, and the axis information for an argument is determined as a conservative result of all axis information passed by the calling functions. --------- Co-authored-by: Philippe Tillet <phil@openai.com>	2023-04-28 14:59:04 -07:00
Michael Melesse	2784b804d9	Merge remote-tracking branch 'upstream/main' into ifu_4_26_2023	2023-04-26 12:04:21 -05:00
Zahi Moudallal	4963d67cd3	[FRONTEND] Use ttgir module num-warps instead of default value (#1576 ) Use ttgir num-warps attribute instead of default value.	2023-04-25 08:22:49 -07:00
cctry	3e213dccb1	[FRONTEND] Make lru_cache compatible for Python 3.7 or older (#1552 ) Change the usage of LRU cache decorator from @functools.lru_cache to @functools.lru_cache(). The former raises an error TypeError('Expected maxsize to be an integer or None' for Python 3.7 or older.	2023-04-20 16:14:32 -07:00
Keren Zhou	fef8150b65	[FRONTEND] Remove debug print in code_gen (#1550 )	2023-04-19 17:13:01 -07:00
Alexander Efimov	fe612b1fc7	fix rebase issues	2023-04-18 18:16:59 +02:00
Bert Maher	bfd1f65ac7	[FRONTEND] cache path to ptxas (#1526 ) When running python 3.8, I've found that process creation gets slower over time (e.g. after creating a CUDA context, it can take 50-300ms per subprocess.run), and we do one of these calls to `ptxas --version` for every kernel, so a model with thousands of kernels can end up spending substantial time just calling ptxas redundantly. Co-authored-by: Philippe Tillet <phil@openai.com>	2023-04-14 17:01:42 +00:00
Philippe Tillet	c0d86d3b04	[RUNTIME] refactor driver (#1515 ) Improved separation between different backends	2023-04-12 23:50:44 -07:00
peterbell10	e152183570	[FRONTEND][BACKEND] ReduceOp to support arbitrary reduce operations (#1305 ) Fixes #1285 This changes `tt.reduce` to replace `redOp` by a region containing arbitrary code. For example, `tl.sum` is now lowered as: ```mlir %res = "tt.reduce"(%arg0) ({ ^bb0(%arg1: f32, %arg2: f32): %add = arith.addf %arg1, %arg2 : f32 tt.reduce.return %add : f32 }) {axis = 1 : i32} : (tensor<128x128xf32>) -> tensor<128xf32> ``` Support for index reductions at the MLIR level are also dropped in favor of simultaneous reductions over multiple tensors. Which generalizes the code without loss of performance. So for example `argmin` gets lowered as: ```mlir %7 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32> %8 = tt.view %7 : (tensor<256xi32>) -> tensor<1x256xi32> %9:2 = "tt.reduce"(%6, %8) ({ ^bb0(%arg4: f32, %arg5: i32, %arg6: f32, %arg7: i32): %14 = arith.cmpf olt, %arg4, %arg6 : f32 %15 = arith.cmpf ogt, %arg4, %arg6 : f32 %16 = arith.cmpi slt, %arg5, %arg7 : i32 %17 = arith.select %16, %arg5, %arg7 : i32 %18 = arith.select %15, %arg7, %17 : i32 %19 = arith.select %14, %arg5, %18 : i32 %20 = arith.cmpf olt, %arg4, %arg6 : f32 %21 = arith.select %20, %arg4, %arg6 : f32 tt.reduce.return %21, %19 : f32, i32 }) {axis = 1 : i32} : (tensor<1x256xf32>, tensor<1x256xi32>) -> (tensor<1xf32>, tensor<1xi32>) ```	2023-04-13 01:37:39 +00:00
Rahul Batra	9a8e334859	fix bitcode path issue	2023-04-11 12:44:26 -05:00
Keren Zhou	6d0ed41307	[BACKEND] Replace Func Dialect with custom triton ops (func, call, return) (#1502 ) MLIR current only supports a custom inlining interface per dialect, so we cannot change the inlining decision of `func.func`. https://discourse.llvm.org/t/avoid-inlining-some-functions-using-the-func-dialect/69830/3 Could revert it back once they've designed a better inliner interface. Inlining attributes will be implemented in the next PR since this PR is already huge.	2023-04-10 21:08:40 -07:00
mcskatkat	82ec1a89ea	[FRONTEND] `code_generator.py` TODOs fixed & removed (#1484 ) Handled TODOs that were waiting for the circular import issue to be resolved	2023-04-07 22:05:46 -07:00
Ian O'Connell	bc0b007e4b	[FRONTEND] Allow cache manager to be overridden, and tweak apis to easier work with remote caches (#1478 ) The changes here come with a few separate bits: - Allow replacing the cache manager with an ENV variable to make it pluggable - Make the `make_path` api private since its leaking some internal bits of the cache and allowing file access. Use a get operation instead. - For the `compile` operation we have a several files part of a single compile pipeline that are small, this can be not the most performant with remote caches. Also some operations like `_triton.get_shared_memory_size` only work when everything is cached or none(or some key ones aren't). They segfault otherwise. So grouping these as an entity avoids that.	2023-04-07 13:38:28 -07:00
Philippe Tillet	053af4e9f8	[FRONTEND] Refactor file hierarchy (#1464 ) The purpose of this PR is to remove some circular dependencies and separate concerns better in the frontend. It's still not perfect -- `triton.compile` still includes a few runtime architecture-specific component, but at least much better than before. This PR still assumes that AMD only supports empty kernels right now. Other PRs will follow to make the frontend supports multiple devices in a more modular way.	2023-04-02 12:07:08 -07:00

26 Commits