github/ROCm - ROCm - AtHeartEngineering

mirror of https://github.com/ROCm/ROCm.git synced 2026-04-05 03:01:17 -04:00

Author	SHA1	Message	Date
Izzy Putterman	5686c51cdb	[FRONTEND] allow pre-hook in autotuner configs to access config kwargs (#1814 ) This is a very quick change that allows the configs' pre-hooks to see the values in the config itself. This is useful if we'd like to allocate intermediate tensor and the shape depends on tile size.	2023-06-22 05:40:48 -07:00
Philippe Tillet	0d6cd0307a	[FRONTEND] add tie_break_left option to arg-reductions (#1813 )	2023-06-21 19:35:52 -07:00
Philippe Tillet	4c0e3d907e	[TOOLS] improved ahead-of-time compiler (#1805 ) This is a revival of @gaxler initial ahead-of-time compiler proposal. Code was simplified and some constraints were relaxed (i.e., we now execute the entire file provided vs just the kernel AST) to promote maintainability. A basic unit test was added, though it does not test specialization right now. co-authored by: Gregory Axler, thexler <g.axler@gmail.com>	2023-06-21 01:02:58 -07:00
Christopher Hesse	767dd5b5c1	[DOCS] update debug docs (#1790 ) Haven't tested the docs (they don't seem to compile on my machine when I use the makefile)	2023-06-20 20:45:17 -07:00
Thomas	4be1c94b1f	Enable ir dumping for ttir toi ttgir phase. (#1804 ) ttir_to_ttgir was missing enable_debug to be able to dump IR.	2023-06-20 17:17:18 +00:00
ZhenLei Xu	326061efd4	Remove Duplicate Definition of 'asm' Variable in triton/compiler/compiler.py (#1803 )	2023-06-19 22:28:28 -07:00
Xinya Zhang	a14b1e2cc4	Add Navi3x bitcode files. (#239 ) Ideally Triton should be able to locate bitcode files under /opt/rocm/ (or $ROCM_HOME). Co-authored-by: Shucai Xiao <shucai.xiao@amd.com>	2023-06-19 10:30:47 -05:00
Keren Zhou	1851c8ca99	[FRONTEND] Fix binary compare op on constexprs (#1801 ) Example: ``` if static_a == 0 and static_b == 1: ... ``` The return value of `static_a == 0` should be `constexpr(True)` but not `True`, otherwise the bool object (True/False) doesn't have the `logical_and` method.	2023-06-18 20:27:56 -07:00
oplavsic	64d7b521cf	[MFMA] Enabled fused attention forward pass. (#226 ) * [MFMA] Activated Fused Attention Forward Pass Patch contains following changes: 1) make_range operator now works with MFMA layout. 2) Reduce operation is forced to run in block layout: inputs converted to block layouts, outputs returned to MFMA layout * Use simple module walk instead of pattern rewritter. * Remove pattern rewritter header. * Enable basic reduce algorithm for MFMA layout * Add TODO comment for fused attention backward pass * Fix bug in fast codegen algorithm for reduce op * Fix input type bug * Increase block size to 128 since out of memory issue is not seen on MI210 * Fix block_size error * Add mfma support in DecomposeDotOperand pattern.	2023-06-16 15:39:08 -05:00
Christopher Hesse	981e98a213	[FRONTEND] update assert_helper.py (#1789 )	2023-06-15 16:24:30 -07:00
Philippe Tillet	9a2580de13	[CI] Added H100 node (#1779 )	2023-06-15 14:21:47 -07:00
Philippe Tillet	b24dc19741	[FRONTEND] cleaned up symbol names (#1782 )	2023-06-14 18:55:32 -07:00
Izzy Putterman	71e21f5797	[FRONTEND] switch absolute imports to relative imports in Triton (#1773 )	2023-06-14 23:59:24 +00:00
Philippe Tillet	754306b5d7	[FRONTEND] `TRITON_PTXAS_PATH` can now take in options (#1778 )	2023-06-14 14:03:03 -07:00
Zahi Moudallal	ac15d00ef4	[TEST] Added f8xf16 tests (#1771 )	2023-06-12 16:14:17 -07:00
Wang Weihan	b27a91a113	[FRONTEND] Enable triton to support register thirdparty backend at runtime (#1643 ) This PR intends to provide a mechanism to support a third-party backend at runtime to generate the backend-specific code. The mechanism provided a common class to abstract the third-party backend logic and two essential functions to register and get the third-party backend at runtime. - `BaseBackend`: A common class to abstract the third-party backend logic - `register_backend`: Register a third-party backend with a given device type - `get_backend`: Get the third-party backend with a given device type Generally, a third-party backend must inherit from `BaseBackend` and implement all the member functions according to the backend characteristics. As long as the backend implementation is ready, the third-party backend can invoke `register_backend` to register it under a given device. During the kernel compilation and execution, the mechanism will get the registered backend to generate the kernel and launcher code for a given device. This PR added a dummy backend to simulate a third-party backend and demonstrate the usage. - [test_device_backend.py](https://github.com/openai/triton/pull/1643/files#diff-bbe4d50624f2d11bf17c878a1ed4d422918c124c182cf9357b993240c385bea1): To define a third-party backend and register the backend - [ExtensionBackend](https://github.com/openai/triton/pull/1643/files#diff-bbe4d50624f2d11bf17c878a1ed4d422918c124c182cf9357b993240c385bea1R123): Inherit from the `BaseBackend` and implement some specific logic like [filter out some compile stages](https://github.com/openai/triton/pull/1643/files#diff-bbe4d50624f2d11bf17c878a1ed4d422918c124c182cf9357b993240c385bea1R129-R135) - [Register the `ExtensionBackend` for `CPU`](https://github.com/openai/triton/pull/1643/files#diff-bbe4d50624f2d11bf17c878a1ed4d422918c124c182cf9357b993240c385bea1R279) - [extension_backend.c](https://github.com/openai/triton/pull/1643/files#diff-169c1d08b3a0a7b343cfa3258fbc32b47e0f6c46305a112652fa1bdaaec89d29): To provide the utility function to load kernel binary and get the backend properties.	2023-06-09 09:09:59 -07:00
jayfurmanek	29f93b147b	Merge pull request #229 from ROCmSoftwarePlatform/ifu230601 IFU 230601	2023-06-09 07:55:32 -05:00
jayfurmanek	49ef240047	get Arch Info using HSA (#233 )	2023-06-08 14:30:46 -05:00
Sophia Wisdom	2fb56dcf52	[FRONTEND] improve `tl.dot` error messages (#1757 ) This makes it easier to figure out what's going on when you get these error messages.	2023-06-07 19:39:16 -07:00
Keren Zhou	4fbadf6f6f	[BACKEND] Fix `tl.cat` when the number of threads > the size of a tensor (#1751 ) `tl.cat(tensor<64>, tensor<64>) -> tensor(128)`, because it concatenates elements into a single thread, if number of threads is 128, each thread should own at least 2 elements. With this PR, we also disable remat of the cat op in some cases.	2023-06-07 15:42:38 -07:00
Aleksandr Efimov	0a12031c75	[Triton] Fix MFMA dot operand loading This PR fixes computation of indexes of MFMA dot operands and gives variables more informative names.	2023-06-07 21:30:52 +02:00
Zahi Moudallal	1db570f45c	[FRONTEND] Added arch to the json (#1747 )	2023-06-06 17:06:03 -07:00
Edward Z. Yang	f294a18864	[FRONTEND] force quantile tensors to be float; prevents accidents (#1741 ) In particular, sometimes this was failing with: ``` RuntimeError: quantile() input tensor must be either float or double dtype ``` Fixes https://github.com/pytorch/pytorch/issues/103054 Signed-off-by: Edward Z. Yang <ezyang@meta.com> --------- Signed-off-by: Edward Z. Yang <ezyang@meta.com>	2023-06-05 20:55:40 -07:00
Keren Zhou	1eedef3596	[FRONTEND] Fix `next_power_of_2` for large integers (#1740 ) test case: `next_power_of_2(2**33 + 1)`	2023-06-05 20:29:05 -07:00
Philippe Tillet	c52a91231a	[FRONTEND][BACKEND] Add acquire/release semantics for atomics (#1739 )	2023-06-05 19:09:13 -07:00
Jason Furmanek	0497f95982	[ROCM] Fix assert helper	2023-06-05 21:42:44 +00:00
Philippe Tillet	9c8d7c18b3	[FRONTEND] simpler/faster argmax code (#1736 ) @peterbell10 @Jokeren I believe that the current argmax reduction code is overly complicated. Am I missing something here?	2023-06-04 21:30:55 -07:00
Philippe Tillet	6c1992cb38	[FRONTEND] min/max now accept `return_indices` argument (#1731 ) Not just syntactic sugar for successive max + argmax but also avoids computing the max twice	2023-06-02 22:01:02 -07:00
chengjunlu	45ba9af6ed	[BACKEND] Add a configurable parameter for the number of threads per warp (#1719 ) Add a configurable parameter for the number of threads per warp for other GPU. Like: Intel GPU. Make it default to be 32 not change code logic on the CUDA/AMD GPU. Note: The Intel GPU GenX ISA is explicit SIMD and can support variant number of threads lane per HW execution unit.	2023-06-02 16:55:06 -07:00
Sharad Vikram	acf1ede5bf	[DEBUGGER] Use from __future__ import annotations to avoid importing torch from annotations (#1722 ) Co-authored-by: Philippe Tillet <phil@openai.com>	2023-06-01 14:57:17 -07:00
jayfurmanek	153ed472b8	Merge branch 'triton-mlir' into ifu230601	2023-06-01 16:18:25 -05:00
Daniil Fukalov	6be1dce41c	[ROCM] Fix transposed operands processing in dot operation with MFMA. (#227 ) - Applied to `loadA()` the same fix as 2c88ed6aab9ace22ccde1f0e443a1579727ee501. - Minor cleanup of `mfmaLayout.getWarpsPerCTA()` usage. Partially fixes ROCmSoftwarePlatform/frameworks-internal#4545	2023-06-01 23:08:58 +02:00
Jason Furmanek	56c55e7451	Initial commit to resolve merge conflicts	2023-06-01 20:58:37 +00:00
Jason Furmanek	28d9754b2a	Merge remote-tracking branch 'oai/main' into ifu230601 Conflicts: python/test/unit/language/assert_helper.py test/Conversion/tritongpu_to_llvm.mlir	2023-06-01 20:53:33 +00:00
Keren Zhou	1e171bf270	[BACKEND] Pipeline pass rewrite part 1: functionality fixes (#1716 ) Support the following three cases: 1. Operands of `load` depend on induction variables before `load`s. 2. Mixed use of induction variables and offset to update the `ptr`. 3. Cross iteration (>1) dependency values.	2023-06-01 12:07:43 -07:00
Mehdi Amini	440fd1bf20	[TESTS] Increase the paylog of the globaltimer kernel to reduce chances of fakiness (#1726 ) If the kernel is too small, on a very fast GPU we may get 0 because the resolution of the timer seems too coarse. Fixes #1725	2023-06-01 02:53:07 -07:00
Mehdi Amini	b0c893cdc5	[FRONTEND][BACKEND] Hardened get_program_id axis by making it an enum attribute (#1721 ) Also catch out-of-bounds indices at constructio and throw a proper error in the frontend. Finally, let's make the IR a bit prettier: %0 = tt.get_program_id {axis = 0 : i32} : i32 becomes: %0 = tt.get_program_id x : i32 Fixes #1718	2023-05-31 22:49:46 -07:00
Mehdi Amini	19c65d6007	[FRONTEND] fix checks for valid slice and avoid hitting an obscure exception. (#1720 ) When comparing to the expected slides, using the `==` operator will dispatch to the component of the slice. If the user writes `a[10:20]` these are `triton.constexpr` instances, and the `__eq__` operator which is implemented as: `return constexpr(self.value == other.value)`. At this point the access to `.value` on the provided `None` yields an exception that isn't very friendly to the user. I am not sure if the implementation of `constexpr` should be hardened instead? Co-authored-by: Philippe Tillet <phil@openai.com>	2023-05-31 16:37:19 +00:00
Eugene Zhulenev	327d362cca	[FRONTEND] add a method to serialize Triton module to MLIR bytecode (#1711 ) In addition to printing Triton module to string, and an option to print it in bytecode format (https://mlir.llvm.org/docs/BytecodeFormat/)	2023-05-31 08:49:11 -07:00
Andrey Shukshov	fee5950893	[MFMA] Implementation of MFMA DotOp pipeline (#180 ) * [MFMA] Implementation of MFMA DotOp pipeline * Added MFMA test_dot unit tests * Added missing ifdefs * Update offline tests * Removing duplicate parts * fix build after rebase * remove redundant stuff * simplify MMAv3.cpp * move reps function into operand attr description, remove coreMatrixType type from layout conversion, refactored type conversion * remove duplication of mfma intruction shape computation * move all MFMA instruction shape details into layout attribute * fix formatting * reenable matmul acceleration * fix dot operator type conversion * add offline test for dotop * add missing ifdef wrappers * run clang format on changes * review and rebase fix * add switch for MFMA instructions * change check precision for float16 test * disable redundant check for allowTF32 * - skip unsupported block size in matmul autotuner - support transposed inputs of dot * reenable matmul acceleration * Add first part to FMA for dot operation on HW without MFMA support. * Fix offline tests. * Fix lit tests * refactor mmav3 to mfma * fix rebase issues * fix detection of mfma support and wrong assert * remove unnecessary macros * Add documentation for MFMA layout. * fix line size computation for B argument * Fix getElemsPerThread() and getSizePerThread() functions for MFMA layout. --------- Co-authored-by: Alexander Efimov <efimov.alexander@gmail.com> Co-authored-by: dfukalov <1671137+dfukalov@users.noreply.github.com> Co-authored-by: weihan13 <weihan13@amd.com> Co-authored-by: Ognjen Plavsic <ognjen.plavsic@dxc.com>	2023-05-30 16:10:28 -05:00
Philippe Tillet	4e2f57add5	[FRONTEND] Added default axis=None for reduction, which reduces across all the axes. (#1712 )	2023-05-28 16:13:21 -07:00
Philippe Tillet	420e4acecc	[TEST] Added flash attention tests for D_HEAD in {16, 32, 128}. (#1709 )	2023-05-27 22:48:22 -07:00
Keren Zhou	f29838a3ea	[FRONTEND] Emit return error message in the frontend (#1708 )	2023-05-26 09:44:03 -07:00
Nelson Elhage	0274446b3a	[FRONTEND] speed up autotuning of small kernel invocations. (#1701 ) Right now, `do_bench` estimates the runtime of the kernel and then uses that to run it a number of times approximately equal to 100ms (by default). However, when actually running the kernel, it also issues a `zero_()` call to clear the L2 cache. For small kernels, the `zero_()` kernel can be slower than the actual kernel we're benchmarking, causing us to badly overshoot our target latency. This has the perverse effect that very small invocations may take much longer to autotune than larger ones. By way of concrete example, before this PR, I tested the wall-clock time for the first call to `triton.ops.matmul(A, B.T)` in a process, on two `(N, N)` matrices in float32. I found that a 4k x 4k x 4k matmul warmed up in about 2.5s, but a 64 x 64 x 64 matmul took over 5 seconds! This PR fixes this issue by including the same call to `zero_()` inside our measurement loop. With this change, I find that the 4kx4kx4k and 64x64x64 matmuls warm up in very similar amounts of time, both around 2.5s. I noticed this because we tend to run tests on very small models in CI just to test code paths without regard to numerics, and found those tests were perversely taking longer than "real" models in some cases. It seems plausible that a better solution would be a pragma to disable autotuning entirely for such tests, but I think this change is a clear improvement as-is.	2023-05-26 09:42:31 -07:00
Keren Zhou	0341953466	[FRONTEND] Correct the debug syntax (#1705 ) - If `TRITON_DEBUG=True`, all triton functions will be compiled in the debug mode. - Otherwise, a triton function `f`'s debug flag is either `True`, `False` or `None` (default). - If `True`, `f` is compiled in the debug mode. - If `False`, `f` is compiled in the normal mode. - If `None`, `f` is compiled based on its caller's debug flag. The root (kernel) function's debug flag can also be set through the `compile` function. cc @ngimel , @Chillee	2023-05-24 23:24:29 -07:00
Jason Furmanek	0c95e2cdaa	[ROCM] Fix is_hip() checks	2023-05-22 15:50:27 +00:00
Keren Zhou	74dbb2fc0a	[DOCS] Add missing ops and corresponding comments (#1699 )	2023-05-21 12:18:48 -07:00
Philippe Tillet	b5ba639bae	[FRONTEND] fixed issue for fp64 literals and added tests (#1698 ) fixes #1686	2023-05-20 18:36:28 -07:00
Keren Zhou	fb30d84069	[FRONTEND] Refactor contains_return_op into an independent AST (#1694 ) https://github.com/openai/triton/issues/1690	2023-05-20 11:18:40 -07:00
Zahi Moudallal	34817ecc95	[BACKEND] Added support to convert shared to distributed layouts (#1682 )	2023-05-17 17:20:29 -07:00

... 3 4 5 6 7 ...

1250 Commits