github/ROCm - ROCm - AtHeartEngineering

mirror of https://github.com/ROCm/ROCm.git synced 2026-04-05 03:01:17 -04:00

Author	SHA1	Message	Date
Keren Zhou	571c92f2a8	[CI] Fix CI kernel compare (#1931 ) With this PR, we find the latest merged PR that successfully passed "Integration Tests".	2023-07-12 10:06:34 -07:00
Izzy Putterman	c615ce944c	[FRONTEND] use local bindings in triton.cc (#1932 ) Another follow up with the relative imports this time dealing with the bindings.	2023-07-12 02:19:48 +00:00
Keren Zhou	4795820014	[TESTS] Fix unmatched test names (#1933 )	2023-07-11 19:08:28 -07:00
Stonepia	d50e32fab7	[FRONTEND] fix the hard code builder.arch that could block third_party tests (#1859 ) For CUDA devices, the `builder.arch` is an int. For third_party devices, this line would be a TypeError. For example: ``` TypeError: '<' not supported between instances of 'dict' and 'int' ``` Co-authored-by: Wang Weihan <eikan.wang@intel.com>	2023-07-11 19:06:35 -07:00
Philippe Tillet	bf5acf46e2	[OPS] improved pointer arithmetic in attention (#1926 ) this provides an additional 3-4% speed-up in non-causal attention, which now tops at 155TFLOPS	2023-07-11 12:04:00 -07:00
Daniyal khan	b70d07aafe	[BUILD][DOCS] updated setup.py and documentation (#1930 )	2023-07-11 11:46:28 -07:00
Phil Tillet	041f1144e8	[DOCS] fixed flash_attn causal argument in tutorial	2023-07-11 09:28:20 -07:00
Goran Flegar	bbc1ad16d8	[BACKEND] Vectorize s8 to bf16 casts (#1879 ) The code generated by LLVM ends up using 15 SASS instructions, while the inline PTX added here only uses 8. It might be possible to reduce this down to 6 if NVIDIA optimizes ptxas to use the byte selector in I2F for all bytes (right now, we still have some bit manipulation code generated for 2 out of 4 bytes). This change improves the performance of mixed precision matmul kernel with M=N=K=4096, where one operand is casted from s8 to bf16 from 140 TFlop/s to 165 TFlop/s on A100-40GB. Also refactors the ElementwiseOpConversionBase template to support vectorized operations, reducing the boilerplate needed for existing, and this new vectorized cast; and extends the casting test to process more than one element (so vectorized casts can be properly tested).	2023-07-11 09:24:05 -07:00
Philippe Tillet	8fe5524c75	[BACKEND] no longer uses shared mem or barriers for single-warp reductions (#1915 ) 0-bytes shared mem buffers don't materialize empty allocation buffers; this could lead to unnecessary barriers. note: reduceop code has become quite messy and will require some cleanup	2023-07-11 00:23:26 -07:00
Philippe Tillet	7e3ebbc4c8	[TESTING] now using cuda graphs for perf regression tests (#1925 )	2023-07-10 22:49:25 -07:00
danny.jang	4a20d5010b	[FRONTEND] Fix a inspection warning (#1914 ) "Expected type 'SupportsIndex', got 'constexpr' instead" is no longer reported.	2023-07-10 21:30:59 -07:00
Izzy Putterman	d39d78fa08	[OPS] Add more perf-tests, new features to FA (#1849 ) Adding new tests across the board for float32, bfloat16, non-powers-of-2 shapes (to test masks), and tests on sequence parallel for atomics. This also adds the sequence parallel features from https://github.com/HazyResearch/flash-attention/blob/main/flash_attn/flash_attn_triton.py. I am not sure about the best way to grab the baseline benchmarking numbers. I have access to V100s and A100s, but I saw on the tests it mentions " # A100 in the CI server is slow-ish for some reason. # On some other servers, we are getting about 90% peak for 8kx8x8k float16". Current plan is to run CI here and use those numbers for baseline, then match against my GPUs as a sanity check. --------- Co-authored-by: Phil Tillet <phil@openai.com>	2023-07-10 18:52:59 -07:00
peterbell10	e3d9478d31	[OPTIMIZER] Add pass to move broadcasts after elementwise operations (#1811 ) This adds a pass that tries to reduce the shape of tensor arguments to element-wise operations by moving splat and broadcast operations later in the graph. So, for example say we have: ```python @triton.jit def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr): xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:] xmask = xindex < xnumel x0 = xindex tmp0 = tl.load(in_ptr0 + (0)) tmp1 = tl.broadcast_to(tmp0, [XBLOCK]) tmp2 = 0.017453292519943295 tmp3 = tmp1 * tmp2 tmp4 = tl.sin(tmp3) tl.store(out_ptr0 + (x0), tmp4, None) ``` Today this results in duplicate `sin` calls: ``` %27 = llvm.fmul %26, %3 : f32 %28 = llvm.call @__nv_sinf(%27) : (f32) -> f32 %29 = llvm.call @__nv_sinf(%27) : (f32) -> f32 ``` The duplicate `llvm.fmul` calls are eliminated via CSE, but `llvm.call` doesn't get CSE'd because it might be impure. After this change, the sin is done on a scalar value in the triton IR and splatted at the very end, so no duplicate calculation happens within a thread. --------- Co-authored-by: Keren Zhou <kerenzhou@openai.com> Co-authored-by: Philippe Tillet <phil@openai.com>	2023-07-10 11:44:38 -07:00
peterbell10	ef947dac31	[FRONTEND] Fix tl.full with unsigned dtypes (#1919 ) Calling `tl.full` with an unsigned dtype currently fails with the error: ``` AttributeError("'triton._C.libtriton.triton.ir.builder' object has no attribute 'get_uint8'") ``` This PR defines those functions rather than changing the calls to the signed versions so that we can use an unsigned argument type in C++ and avoid overflow for large uint64 values.	2023-07-10 09:36:22 -07:00
Philippe Tillet	5a722b5f74	[OPS][TESTS] Added float8 support in triton.ops.matmul (#1918 ) this also adds rather extensive testing for mixed precision mode, including `float8e4b15 x float8e5` and `float8e5 x float16`	2023-07-10 09:31:12 -07:00
Philippe Tillet	dadf7a9a50	[TUTORIAL] Faster flash attention; added non-causal (#1917 )	2023-07-09 13:38:06 -07:00
Thomas	bd900e0a6f	[BACKEND] Fix reductions when number of unique element is smaller than layout (#1913 ) Fix calculation of unique number of threads within a warp. We need to consider the number of elements per thread in the calculation. Also change the layout test to integer sum in order to catch bugs with unique data as max reduction may hide those kind of problems.	2023-07-07 19:48:13 -07:00
Natalia Gimelshein	778ed64a66	[BACKEND] make sure we always bind to primary context in loadBinary (#1912 )	2023-07-07 14:28:03 -07:00
Bert Maher	38d767ea93	[FRONTEND] fix memory leak caused by retaining args to autotuned kernel (#1911 )	2023-07-07 20:58:29 +00:00
Keren Zhou	cc5a7ed52f	[FRONTEND][BACKEND] Materialize line info for triton kernels (#1902 ) `export TRITON_DISABLE_LINE_INFO=1` to disable the feature.	2023-07-07 16:03:44 -04:00
Philippe Tillet	61e17db4f7	[TESTS] smaller problem sizes in matmul tests (#1908 )	2023-07-06 14:36:02 -07:00
Philippe Tillet	6d1285e1ae	[FRONTEND][BACKEND] improved fp8 specs (#1906 ) This un-reverts commit `d4c941177e`.	2023-07-06 13:03:53 -07:00
Thomas	787cdff0cd	[TESTS] Enable parallel pytest in CI for CUDA (#1905 ) Run most of the pytest in parallel, this allows to speed up CI from 36min to 10min for A100 and 22min to 6min for H100. Some tests still need to run serially like runtime tests.	2023-07-06 11:40:33 -07:00
Thomas	ae0ee5248f	[FRONTEND] Add cumprod scan op (#1894 ) Add and test cumprod. This also allows testing a case of accumulation where 0 is not the identity element. Also add documention for scan functions.	2023-07-05 10:09:06 -07:00
Philippe Tillet	d57dcd9994	[FRONTEND] pattern-match ExpandDims + Mul + Reduce into DotOp (#1889 ) transforms e.g.: ``` x1 = tl.expand_dims(x0, axis=2) y1 = tl.expand_dims(y0, axis=0) z = tl.sum(x1 * y1, axis=1) ``` into ``` z = tl.dot(x0, y0) ``` uses allowTF32 = True by default.	2023-07-05 00:31:21 -07:00
Yongjik Kim	387cdc8fe9	[FRONTEND] improve error message. (#1887 )	2023-07-05 00:19:28 -07:00
Christopher Hesse	cc93356c4b	[DOCS] update print docs (#1884 )	2023-07-05 00:19:12 -07:00
Keren Zhou	16de0d6f55	[BACKEND] Recover the correct axis in the parent layout for reduce ops (#1885 ) https://github.com/openai/triton/issues/1883	2023-07-05 00:17:00 -07:00
Keren Zhou	aa55d3b5bf	[FRONTEND] Remove unnecessary replace in while op's after block (#1886 ) We've already updated the mapping between name and tensor before visiting each compound statement in the while op. As a result, any overwritten name gets up-to-date values updated in the while loop. And any unchanged livein names hold the original tensors.	2023-07-05 00:16:38 -07:00
Ying Hang Eng	55eb32dff9	Fix tl.device_assert compilation error (#1875 ) As mentioned in #1769, we set file name, function name to 'unknown' and lineno to 0 if frame is None	2023-06-30 22:19:25 +00:00
Thomas	2e3182bab7	[BACKEND] Support scan on dimensions other that fastest moving one (#1863 ) This relax the restriction in the scan lowering to support layout where we scan along a dimension which isn't the fastest moving one. This is done by relaxing how we accesses elements during scanning and allow elements to be strided.	2023-06-30 12:40:48 -07:00
Oleg Shyshkov	66ed53d19d	[FRONTEND] Support mixed-precision inputs in triton.ops.matmul. (#1754 ) Support only combinations of float32 with float16 or bfloat16 for now. Shouldn't change anything for cases when input types match. That's a follow-up to the comment in my other PR: https://github.com/openai/triton/pull/1746#issuecomment-1579630016. --------- Co-authored-by: Philippe Tillet <phil@openai.com>	2023-06-30 09:22:27 -07:00
Thomas	7a8a2da8ef	[BACKEND] Enable lowering of f16 constant matmul (#1870 ) Since the type expected for mma encoding is i32 when lowering f16 splat we need to pack f16 constants into a i32 value. This allows re-enabling the constant matmul unit test.	2023-06-30 07:00:25 -04:00
Philippe Tillet	f77015967d	Revert "[FRONTEND][BACKEND] improved fp8 specs (#1841 )" (#1865 ) This reverts commit `d4c941177e`.	2023-06-29 21:07:01 -04:00
Thomas	3be060849a	[FEATURE] Add associative_scan support (#1858 ) Implement associative_scan in the front end and implement lowering to LLVM for blocked layout where the scan happens on the fastest moving dimension. This will later be generalized to support more layout.	2023-06-29 14:37:51 -07:00
Izzy Putterman	9961b5c7aa	[TESTING] allow user to adjust warmup and repetition time for autotuning (#1850 ) Adds an option to adjust warmup and repetition time for autotuning. It should default to old values and have no effect on current kernels. This is useful for bigger kernels where runtime might be a sizable fraction 100ms and lead to less warmup and more variance during benchmarking.	2023-06-28 11:04:43 -07:00
Thomas	e5d7411a69	[BACKEND] Add .wt store cache modifier (#1831 )	2023-06-28 17:40:30 +00:00
Goran Flegar	938a6754b4	[BUILD] Export compile commands (#1854 ) This can be used by IDEs to figure out how to correctly compile individual sources and offer semantic code completion.	2023-06-28 14:11:59 +00:00
Keren Zhou	d2de3f37f0	[BACKEND] Reduce code cleanup and bug fix for the fast path (#1816 ) https://github.com/openai/triton/issues/1715	2023-06-27 17:27:24 -07:00
Zahi Moudallal	2dcbf4783e	[BACKEND] Use getOrder for mma layout warps order instead of the hardcoded col-major order (#1825 )	2023-06-27 10:56:09 -07:00
Philippe Tillet	d4c941177e	[FRONTEND][BACKEND] improved fp8 specs (#1841 ) clearly differentiate between standard fp8e4 (which we'll stop supporting on SM <= 89 because conversions are too expensive if we want to handle the single NaN and clipping properly) and a software-optimized fp8e4b15 format.	2023-06-26 16:19:03 -07:00
Izzy Putterman	3c400e7818	[FRONTEND] switch absolute imports to relative v2 (#1833 )	2023-06-26 04:13:12 +00:00
Wang Weihan	a3c39d8fbe	[TEST] Add device parameter for ut (#1817 ) Triton has supported different codegen backends for different devices, so enabling the unit test cases to support different devices also makes sense. Otherwise, the third-party backend might have to intrusively change the Triton test cases.	2023-06-25 15:38:59 +08:00
Thomas	3d1cd89b54	[BACKEND] Add store cache modifiers (#1826 ) Plumb through store cache modifiers.	2023-06-23 09:29:10 -07:00
Zahi Moudallal	6ad8cd52e7	[CI] Added IR reference-check github workflow (#1755 )	2023-06-22 18:00:40 -07:00
Wang Weihan	4d3a92f1b8	[BUILD] Make sure always build_ext first (#1819 ) The third-party backend might install its python package to the `triton/third_party` python package during the build process. But the `build_py` could be executed before the `build_ext`, and then `build_py` would only copy the `packages` defined in the `setup.py` w/o the third-party related packages as the third-party backend has not been built, which is triggered by `build_ext`. Therefore, this PR refined the build order a little bit to ensure `build_ext` always happens before `build_py`.	2023-06-22 13:32:03 -07:00
Zahi Moudallal	ca4f242c9b	[TEST] Added matmul config for testing (#1758 )	2023-06-22 13:31:37 -07:00
Goran Flegar	8d566e4196	[FRONTEND] Fix missing attribute access in DependenciesFinder (#1820 ) It seems that patch #1773 introduced a bug, since the `lhs` object doesn't necessarily have a `__name__` attribute. I'm hitting this if I modify the matmul tutorial (gflegar/triton@442b00f4d): ``` File "/home/gflegar/triton/python/triton/runtime/jit.py", line 74, in visit_Attribute if lhs is None or lhs.__name__ == "triton": AttributeError: 'Tensor' object has no attribute '__name__' ``` I think the idea of that patch was to remove the need to import triton by replacing `lhs is triton` with `lhs.__name__ == "triton"`. This patch should have the same behavior as the original code, but withouth failing if `lhs` doesn't havea `__name__` attribute.	2023-06-22 13:30:25 -07:00
Izzy Putterman	5686c51cdb	[FRONTEND] allow pre-hook in autotuner configs to access config kwargs (#1814 ) This is a very quick change that allows the configs' pre-hooks to see the values in the config itself. This is useful if we'd like to allocate intermediate tensor and the shape depends on tile size.	2023-06-22 05:40:48 -07:00
Philippe Tillet	0d6cd0307a	[FRONTEND] add tie_break_left option to arg-reductions (#1813 )	2023-06-21 19:35:52 -07:00

... 3 4 5 6 7 ...

935 Commits