github/ROCm - ROCm - AtHeartEngineering

mirror of https://github.com/ROCm/ROCm.git synced 2026-02-21 03:00:39 -05:00

Author	SHA1	Message	Date
Keren Zhou	08c1658957	[FRONTEND] Accommodate new triton IR format (#2294 ) - Support memory space for pointers (e.g., `!tt.ptr<f32, 1>`). - Support parsing function attribute, though not used yet.	2023-09-14 09:03:23 -07:00
Bin Fan	dad83f9dcb	[TOOLS] Add support for autotuning AOT kernel (#2123 ) This PR makes the following change to AOT kernel - Allow the client to generate AOT kernels with different sets of constexprs and meta-parameters. Each combination of constexpr set and meta-parameters is referred to an "algo". Within an algo client can still give different hints about integer arguments. - Add a API int ${kernle_name}_get_num_algos() that returns the total number of algos. - Add a algo_id to allow client to the generated kernel to select the algo - Remove gX, gY and gZ from the kernel parameter list. This is because the launch grid is usually different with different algos, and the client should not need to care about how to compute the launch grid for each algo. Instead, we ask the client to pass the expression of computing gX, gY and gZ for compile.py (when AOT kernels are generated). The expression can only use kernel parameter or const values. - We also change the testing flow. Now we first build the kernels into a shared library libkernel.so, then the client test.c code is built and link with libkernel.so. This is closer to a typical AOT kernel usage flow.	2023-08-23 09:38:29 -07:00
goostavz	f1512bded1	Initial code merge of Hopper support (#2036 ) The initial code merge of Nvidia Hopper features support. Please be aware that the code merge is not finished yet and the trouble-shooting is still ongoing. The new hardware features (GMMA, TMA, STMATRIX etc.) and automatic warp-specialization are experimental for now and turned off by default. It is recommended for a trial when version 3.0 is released. The work is contributed by: ben-zhang-609, bealwang, donproc, qliu93, jsh20, allatit23, LyricZhao, ivanyinwz, goostavz & yangjunpro from Nvidia, in cooperation with: ptillet, Jokeren, ThomasRaoux & zahimoud from OpenAI. Co-authored-by: Goostav Zhu <gzhu@nvidia.com>	2023-08-07 09:53:04 +08:00
Yan Chunwei	89b0b79d75	[FRONTEND] fix the silent return issue in AOT launcher (#2013 ) In the current link.py, it produces the launcher code as below: ```python CUresult matmul_fp16xfp16_16x16x16(CUstream stream, unsigned int gX, unsigned int gY, unsigned int gZ, CUdeviceptr C, CUdeviceptr A, CUdeviceptr B, int32_t stride_cm, int32_t stride_am, int32_t stride_bk){ if ((C % 16 == 0) && (A % 16 == 0) && (B % 16 == 0) && (stride_cm % 16 == 0)) return matmul_fp16xfp16_16x16x16_688cc413_0d1d2d3d45d(stream, gX, gY, gZ, C, A, B, stride_cm, stride_am, stride_bk); // ... if ((C % 16 == 0) && (A % 16 == 0) && (B % 16 == 0)) return matmul_fp16xfp16_16x16x16_7c0255bf_0d1d2d345(stream, gX, gY, gZ, C, A, B, stride_cm, stride_am, stride_bk); } ``` Note that, when the input does not match any of the if branches, it will do nothing, and the compiler should make it return 0 as a default behavior, which equals to `CUDA_SUCCESS`, this doesn't match the expectation. This PR adds a `return CUDA_VALUE_ERROR;` to the tail of launchers, and it produces code like: ```c++ CUresult matmul_fp16xfp16_16x16x16(CUstream stream, unsigned int gX, unsigned int gY, unsigned int gZ, CUdeviceptr C, CUdeviceptr A, CUdeviceptr B, int32_t stride_cm, int32_t stride_cn, int32_t stride_am, int32_t stride_ak, int32_t stride_bk, int32_t stride_bn){ if ((C % 16 == 0) && (A % 16 == 0) && (B % 16 == 0) && (stride_cm == 1) && (stride_cn == 1) && (stride_am == 1) && (stride_ak == 1) && (stride_bk % 16 == 0) && (stride_bn == 1)) return matmul_fp16xfp16_16x16x16_1f18a6da_0d1d2d3c4c5c6c7d8c(stream, gX, gY, gZ, C, A, B, stride_bk); return CUDA_ERROR_INVALID_VALUE; } ``` And it requires users to check the result in their application, which I think should match the initial AOT ideas.	2023-07-31 09:59:28 -07:00
Bin Fan	2689f4a3b0	[TOOLS][AOT] some issues in equal_to_1 hint (#1998 ) - Change test_aot.py to actually use equal_to_1 hint - In the client function, equal_to_1 parameters are not specialized, because AOT clients may not know the details of Triton argument specialization, they still want to use the same parameter list as they write the Triton kernel. The generated kernels has specialized argument list, the generated dispatcher code will make sure the correct arguments from the original full argument list are passed. - Fixed a bug in _match_suffix in link.py. Previously it assumes each parameter has a suffix of either ‘d’ or ‘c’, but in fact sometimes a parameter doesn’t have a suffix, like 0d1d2d34c56c78c	2023-07-27 16:07:49 -07:00
Yan Chunwei	d0c35b3b7d	Hot fix for AOT (#1939 ) This PR addresses the following issues encountered when using AOT kernels in our project: 1. When different signatures are set for the same Triton kernel, it can result in C functions with the same name. This is problematic because C does not support function overloading. 2. Currently, the AOT kernel always compiles with `num_warps=1`, as indicated [here](https://github.com/openai/triton/pull/1939/files#diff-293af646f671d3a895c453a8b175754e9d4ec4fc855bb939ffa4d6e9e91b07c6L83). However, the generated function includes a `numWarps` argument, which can cause errors when the specified value does not match. To resolve these issues, this PR does the following modifications: 1. Adds an 8-char hash key as a suffix to the generated function's signature. This ensures that different function names are generated in C when the argument dtype or constexpr value or even hint differs since we hope these kernels could be used in one C/C++ library. 2. Introduces a new flag called `num-warps` that allows manual specification of the `numWarps` value for AOT. This change hardcodes the specified value into the generated kernel.c and removes the `numWarps` argument from the generated function.	2023-07-14 09:16:43 +08:00
Philippe Tillet	4c0e3d907e	[TOOLS] improved ahead-of-time compiler (#1805 ) This is a revival of @gaxler initial ahead-of-time compiler proposal. Code was simplified and some constraints were relaxed (i.e., we now execute the entire file provided vs just the kernel AST) to promote maintainability. A basic unit test was added, though it does not test specialization right now. co-authored by: Gregory Axler, thexler <g.axler@gmail.com>	2023-06-21 01:02:58 -07:00

7 Commits