github/ROCm - ROCm - AtHeartEngineering

mirror of https://github.com/ROCm/ROCm.git synced 2026-04-05 03:01:17 -04:00

Author	SHA1	Message	Date
Philippe Tillet	facc1dcbac	[TESTS] better matmul unit testing (#2098 )	2023-08-13 17:54:32 -07:00
Zahi Moudallal	a01c116f76	[FRONTEND/BACKEND] Revived Float8E4B15x4 (#2090 )	2023-08-11 17:49:52 -07:00
Beal Wang	d1ce4c4950	[TESTS] refactor test-persistent-warp-specialized-gemm UTs (#2075 ) remove unnecessary skips. decompose UTs in persistent-warp-specialized-gemm into vintage and stylish	2023-08-10 06:57:04 +00:00
allatit23	8a610f7cf7	[HOPPER][WS] remove numCTAs = 1 check in guard pass (#2066 )	2023-08-09 09:07:56 +00:00
Beal Wang	de47bba07d	[OPTIMIZER] Fix the load and store fallback issue of test_persisten… (#2057 ) Co-authored-by: Biao Wang <biaow@nvidia.com>	2023-08-09 16:42:01 +08:00
allatit23	6d98a0899f	[HOPPER][WS] fix missing WS attrs when lowering to llvm (#2063 )	2023-08-09 15:45:44 +08:00
allatit23	6dee55c912	[HOPPER][WS] fix TMA store hang in ws mode (#2056 )	2023-08-08 19:53:52 +08:00
ben-zhang-609	2a95d9bf0d	[Clean]: remove skip for num_ctas > 1 and num_warps == 8 (#2050 ) Co-authored-by: Philippe Tillet <phil@openai.com>	2023-08-08 16:54:21 +08:00
allatit23	11cf334730	[hopper][ws] use per-agent thread idx by default (#2054 ) Co-authored-by: Allen Zhao <allzhao@nvidia.com>	2023-08-08 15:28:10 +08:00
goostavz	b525880d8b	[Backend] Fix CTA->warp ordering for MMAv3 and fix dot-chain scripts in hopper tests (#2041 ) Co-authored-by: goostavz <gzhu@nvidia.com> Co-authored-by: Philippe Tillet <phil@openai.com> Co-authored-by: ben-zhang-609 <110140741+ben-zhang-609@users.noreply.github.com>	2023-08-08 06:30:04 +00:00
ben-zhang-609	31e79aa384	[TESTS] remove get_proper_err, get_variant_golden (#2039 ) Co-authored-by: Philippe Tillet <phil@openai.com>	2023-08-07 22:52:55 -07:00
Qingyi Liu	341f5b61be	[BACKEND] Add BarrierOp after AllocMBarrierOp when numCTAs == 1 (#2040 ) Make sure that other threads within CTA do not operate on mbarrier until it is initialized by thread 0. Co-authored-by: Philippe Tillet <phil@openai.com>	2023-08-07 20:11:00 -07:00
Keren Zhou	30a331e628	[FRONTEND] Support jit functions without arguments (#2043 ) Issue https://github.com/openai/triton/issues/1973 Co-authored-by: Philippe Tillet <phil@openai.com>	2023-08-07 19:05:56 -07:00
Thomas	98523bcc48	[BACKEND] Support MMA V3 with float16 accumulator (#2049 ) Also fixes a bug exposed in convertLayout lowering for float16. We shouldn't be using cvt.pack.sat.u16.s32 to pack 16bits values as this needs to take a 32bits register. Also this prevented optimization at llvm ir level.	2023-08-07 15:55:44 -07:00
Phil Tillet	521cfae44d	[CI] disabled float32 perf regression tests	2023-08-07 12:43:16 -07:00
goostavz	f1512bded1	Initial code merge of Hopper support (#2036 ) The initial code merge of Nvidia Hopper features support. Please be aware that the code merge is not finished yet and the trouble-shooting is still ongoing. The new hardware features (GMMA, TMA, STMATRIX etc.) and automatic warp-specialization are experimental for now and turned off by default. It is recommended for a trial when version 3.0 is released. The work is contributed by: ben-zhang-609, bealwang, donproc, qliu93, jsh20, allatit23, LyricZhao, ivanyinwz, goostavz & yangjunpro from Nvidia, in cooperation with: ptillet, Jokeren, ThomasRaoux & zahimoud from OpenAI. Co-authored-by: Goostav Zhu <gzhu@nvidia.com>	2023-08-07 09:53:04 +08:00
Yan Chunwei	89b0b79d75	[FRONTEND] fix the silent return issue in AOT launcher (#2013 ) In the current link.py, it produces the launcher code as below: ```python CUresult matmul_fp16xfp16_16x16x16(CUstream stream, unsigned int gX, unsigned int gY, unsigned int gZ, CUdeviceptr C, CUdeviceptr A, CUdeviceptr B, int32_t stride_cm, int32_t stride_am, int32_t stride_bk){ if ((C % 16 == 0) && (A % 16 == 0) && (B % 16 == 0) && (stride_cm % 16 == 0)) return matmul_fp16xfp16_16x16x16_688cc413_0d1d2d3d45d(stream, gX, gY, gZ, C, A, B, stride_cm, stride_am, stride_bk); // ... if ((C % 16 == 0) && (A % 16 == 0) && (B % 16 == 0)) return matmul_fp16xfp16_16x16x16_7c0255bf_0d1d2d345(stream, gX, gY, gZ, C, A, B, stride_cm, stride_am, stride_bk); } ``` Note that, when the input does not match any of the if branches, it will do nothing, and the compiler should make it return 0 as a default behavior, which equals to `CUDA_SUCCESS`, this doesn't match the expectation. This PR adds a `return CUDA_VALUE_ERROR;` to the tail of launchers, and it produces code like: ```c++ CUresult matmul_fp16xfp16_16x16x16(CUstream stream, unsigned int gX, unsigned int gY, unsigned int gZ, CUdeviceptr C, CUdeviceptr A, CUdeviceptr B, int32_t stride_cm, int32_t stride_cn, int32_t stride_am, int32_t stride_ak, int32_t stride_bk, int32_t stride_bn){ if ((C % 16 == 0) && (A % 16 == 0) && (B % 16 == 0) && (stride_cm == 1) && (stride_cn == 1) && (stride_am == 1) && (stride_ak == 1) && (stride_bk % 16 == 0) && (stride_bn == 1)) return matmul_fp16xfp16_16x16x16_1f18a6da_0d1d2d3c4c5c6c7d8c(stream, gX, gY, gZ, C, A, B, stride_bk); return CUDA_ERROR_INVALID_VALUE; } ``` And it requires users to check the result in their application, which I think should match the initial AOT ideas.	2023-07-31 09:59:28 -07:00
Philippe Tillet	52c146f66b	[OPTIMIZER][BACKEND] significantly cleaner handling of mixed-precision kernels (#1949 ) we currently have a very janky approach to optimizing mixed-precision matmul workloads, where some layout combinations (e.g., NT matmul) were explicitly pattern-matched to take a more optimized codepath. Attempt at unifying all the codepaths to codegen cp.async failed, due to bugs in SharedToDotOperandMMAv2.cpp. This PR fixes said bugs, add some assertions for SharedToDotOperandMMAv2 modes that aren't well supported, and greatly simplify our handling of element-wise operations between load and conversions to DotOperand.	2023-07-28 10:29:42 -07:00
Bin Fan	2689f4a3b0	[TOOLS][AOT] some issues in equal_to_1 hint (#1998 ) - Change test_aot.py to actually use equal_to_1 hint - In the client function, equal_to_1 parameters are not specialized, because AOT clients may not know the details of Triton argument specialization, they still want to use the same parameter list as they write the Triton kernel. The generated kernels has specialized argument list, the generated dispatcher code will make sure the correct arguments from the original full argument list are passed. - Fixed a bug in _match_suffix in link.py. Previously it assumes each parameter has a suffix of either ‘d’ or ‘c’, but in fact sometimes a parameter doesn’t have a suffix, like 0d1d2d34c56c78c	2023-07-27 16:07:49 -07:00
Izzy Putterman	de6f053c0f	[TRITON][OPS] add Flash Attention v2 to Ops (#1970 ) I also dropped the do_scaled as it is no longer needed (no scaling done to the do in v2). --------- Co-authored-by: Philippe Tillet <phil@openai.com>	2023-07-23 14:07:15 -07:00
Philippe Tillet	66eda76e45	[FRONTEND][BACKEND] no longer serialize float8e4b15 (#1979 ) We had a number of complains that the previous packed format was error-prone and may not yet be worth the 2 SASS instruction saved per 4 conversions	2023-07-21 22:44:55 -07:00
Philippe Tillet	28a61484bc	[FRONTEND] more leniency when converting to/from fp8e4b15 (#1969 )	2023-07-19 18:26:21 -07:00
Philippe Tillet	68124676c9	[FRONTEND][BACKEND] Fix trans for float8e4b15 (#1964 ) float8e4b15 is a packed type; it is incompatible with most of our layout conversions. For now, we just convert to float16.	2023-07-19 11:30:39 -07:00
Keren Zhou	bcfd990a88	[TESTS] Fix autopep8 error (#1948 )	2023-07-16 16:55:12 -07:00
Mehdi Amini	51fc42a568	[FRONTEND] fix AST IR generation for while loop nested inside other SCF (#1947 ) The process of visiting twice the body of the while didn't restore properly the insertion point, and was leaking the dummy block.	2023-07-15 10:17:29 -07:00
Alex Collins	80163a9c1e	[FRONTEND] Add support for default args in kernel wrappers (#1943 ) Fixes the case where setting default values for arguments in a kernel function signature results in a generated kernel wrapper function without these default values. For example: ``` @triton.jit def kernel(x, y, z=3): ... ... kernel[grid](x,y) ``` Co-authored-by: Philippe Tillet <phil@openai.com>	2023-07-14 21:32:47 +00:00
Yan Chunwei	d0c35b3b7d	Hot fix for AOT (#1939 ) This PR addresses the following issues encountered when using AOT kernels in our project: 1. When different signatures are set for the same Triton kernel, it can result in C functions with the same name. This is problematic because C does not support function overloading. 2. Currently, the AOT kernel always compiles with `num_warps=1`, as indicated [here](https://github.com/openai/triton/pull/1939/files#diff-293af646f671d3a895c453a8b175754e9d4ec4fc855bb939ffa4d6e9e91b07c6L83). However, the generated function includes a `numWarps` argument, which can cause errors when the specified value does not match. To resolve these issues, this PR does the following modifications: 1. Adds an 8-char hash key as a suffix to the generated function's signature. This ensures that different function names are generated in C when the argument dtype or constexpr value or even hint differs since we hope these kernels could be used in one C/C++ library. 2. Introduces a new flag called `num-warps` that allows manual specification of the `numWarps` value for AOT. This change hardcodes the specified value into the generated kernel.c and removes the `numWarps` argument from the generated function.	2023-07-14 09:16:43 +08:00
Keren Zhou	571c92f2a8	[CI] Fix CI kernel compare (#1931 ) With this PR, we find the latest merged PR that successfully passed "Integration Tests".	2023-07-12 10:06:34 -07:00
Keren Zhou	4795820014	[TESTS] Fix unmatched test names (#1933 )	2023-07-11 19:08:28 -07:00
Philippe Tillet	bf5acf46e2	[OPS] improved pointer arithmetic in attention (#1926 ) this provides an additional 3-4% speed-up in non-causal attention, which now tops at 155TFLOPS	2023-07-11 12:04:00 -07:00
Goran Flegar	bbc1ad16d8	[BACKEND] Vectorize s8 to bf16 casts (#1879 ) The code generated by LLVM ends up using 15 SASS instructions, while the inline PTX added here only uses 8. It might be possible to reduce this down to 6 if NVIDIA optimizes ptxas to use the byte selector in I2F for all bytes (right now, we still have some bit manipulation code generated for 2 out of 4 bytes). This change improves the performance of mixed precision matmul kernel with M=N=K=4096, where one operand is casted from s8 to bf16 from 140 TFlop/s to 165 TFlop/s on A100-40GB. Also refactors the ElementwiseOpConversionBase template to support vectorized operations, reducing the boilerplate needed for existing, and this new vectorized cast; and extends the casting test to process more than one element (so vectorized casts can be properly tested).	2023-07-11 09:24:05 -07:00
Philippe Tillet	8fe5524c75	[BACKEND] no longer uses shared mem or barriers for single-warp reductions (#1915 ) 0-bytes shared mem buffers don't materialize empty allocation buffers; this could lead to unnecessary barriers. note: reduceop code has become quite messy and will require some cleanup	2023-07-11 00:23:26 -07:00
Philippe Tillet	7e3ebbc4c8	[TESTING] now using cuda graphs for perf regression tests (#1925 )	2023-07-10 22:49:25 -07:00
Izzy Putterman	d39d78fa08	[OPS] Add more perf-tests, new features to FA (#1849 ) Adding new tests across the board for float32, bfloat16, non-powers-of-2 shapes (to test masks), and tests on sequence parallel for atomics. This also adds the sequence parallel features from https://github.com/HazyResearch/flash-attention/blob/main/flash_attn/flash_attn_triton.py. I am not sure about the best way to grab the baseline benchmarking numbers. I have access to V100s and A100s, but I saw on the tests it mentions " # A100 in the CI server is slow-ish for some reason. # On some other servers, we are getting about 90% peak for 8kx8x8k float16". Current plan is to run CI here and use those numbers for baseline, then match against my GPUs as a sanity check. --------- Co-authored-by: Phil Tillet <phil@openai.com>	2023-07-10 18:52:59 -07:00
peterbell10	ef947dac31	[FRONTEND] Fix tl.full with unsigned dtypes (#1919 ) Calling `tl.full` with an unsigned dtype currently fails with the error: ``` AttributeError("'triton._C.libtriton.triton.ir.builder' object has no attribute 'get_uint8'") ``` This PR defines those functions rather than changing the calls to the signed versions so that we can use an unsigned argument type in C++ and avoid overflow for large uint64 values.	2023-07-10 09:36:22 -07:00
Philippe Tillet	5a722b5f74	[OPS][TESTS] Added float8 support in triton.ops.matmul (#1918 ) this also adds rather extensive testing for mixed precision mode, including `float8e4b15 x float8e5` and `float8e5 x float16`	2023-07-10 09:31:12 -07:00
Thomas	bd900e0a6f	[BACKEND] Fix reductions when number of unique element is smaller than layout (#1913 ) Fix calculation of unique number of threads within a warp. We need to consider the number of elements per thread in the calculation. Also change the layout test to integer sum in order to catch bugs with unique data as max reduction may hide those kind of problems.	2023-07-07 19:48:13 -07:00
Natalia Gimelshein	778ed64a66	[BACKEND] make sure we always bind to primary context in loadBinary (#1912 )	2023-07-07 14:28:03 -07:00
Keren Zhou	cc5a7ed52f	[FRONTEND][BACKEND] Materialize line info for triton kernels (#1902 ) `export TRITON_DISABLE_LINE_INFO=1` to disable the feature.	2023-07-07 16:03:44 -04:00
Philippe Tillet	61e17db4f7	[TESTS] smaller problem sizes in matmul tests (#1908 )	2023-07-06 14:36:02 -07:00
Philippe Tillet	6d1285e1ae	[FRONTEND][BACKEND] improved fp8 specs (#1906 ) This un-reverts commit `d4c941177e`.	2023-07-06 13:03:53 -07:00
Thomas	787cdff0cd	[TESTS] Enable parallel pytest in CI for CUDA (#1905 ) Run most of the pytest in parallel, this allows to speed up CI from 36min to 10min for A100 and 22min to 6min for H100. Some tests still need to run serially like runtime tests.	2023-07-06 11:40:33 -07:00
Thomas	ae0ee5248f	[FRONTEND] Add cumprod scan op (#1894 ) Add and test cumprod. This also allows testing a case of accumulation where 0 is not the identity element. Also add documention for scan functions.	2023-07-05 10:09:06 -07:00
Philippe Tillet	d57dcd9994	[FRONTEND] pattern-match ExpandDims + Mul + Reduce into DotOp (#1889 ) transforms e.g.: ``` x1 = tl.expand_dims(x0, axis=2) y1 = tl.expand_dims(y0, axis=0) z = tl.sum(x1 * y1, axis=1) ``` into ``` z = tl.dot(x0, y0) ``` uses allowTF32 = True by default.	2023-07-05 00:31:21 -07:00
Keren Zhou	16de0d6f55	[BACKEND] Recover the correct axis in the parent layout for reduce ops (#1885 ) https://github.com/openai/triton/issues/1883	2023-07-05 00:17:00 -07:00
Keren Zhou	aa55d3b5bf	[FRONTEND] Remove unnecessary replace in while op's after block (#1886 ) We've already updated the mapping between name and tensor before visiting each compound statement in the while op. As a result, any overwritten name gets up-to-date values updated in the while loop. And any unchanged livein names hold the original tensors.	2023-07-05 00:16:38 -07:00
Thomas	2e3182bab7	[BACKEND] Support scan on dimensions other that fastest moving one (#1863 ) This relax the restriction in the scan lowering to support layout where we scan along a dimension which isn't the fastest moving one. This is done by relaxing how we accesses elements during scanning and allow elements to be strided.	2023-06-30 12:40:48 -07:00
Oleg Shyshkov	66ed53d19d	[FRONTEND] Support mixed-precision inputs in triton.ops.matmul. (#1754 ) Support only combinations of float32 with float16 or bfloat16 for now. Shouldn't change anything for cases when input types match. That's a follow-up to the comment in my other PR: https://github.com/openai/triton/pull/1746#issuecomment-1579630016. --------- Co-authored-by: Philippe Tillet <phil@openai.com>	2023-06-30 09:22:27 -07:00
Thomas	7a8a2da8ef	[BACKEND] Enable lowering of f16 constant matmul (#1870 ) Since the type expected for mma encoding is i32 when lowering f16 splat we need to pack f16 constants into a i32 value. This allows re-enabling the constant matmul unit test.	2023-06-30 07:00:25 -04:00
Philippe Tillet	f77015967d	Revert "[FRONTEND][BACKEND] improved fp8 specs (#1841 )" (#1865 ) This reverts commit `d4c941177e`.	2023-06-29 21:07:01 -04:00

1 2 3 4 5 ...

289 Commits