github/ROCm - ROCm - AtHeartEngineering

mirror of https://github.com/ROCm/ROCm.git synced 2026-04-05 03:01:17 -04:00

Author	SHA1	Message	Date
Rohit Santhanam	cd9ae1cd36	Merge remote-tracking branch 'upstream/main' into triton-mlir-IFU-02232023	2023-02-23 21:41:54 +00:00
Chenggang Zhao	b5efa91e2a	[Backend] Fix a bug in swizzling store (#1235 ) The function calculates the swizzled address to store (not load), so we should use `outOrder` instead of `inOrder`. Current tests do not cover this case, but at NVIDIA, we have a case related to `sm_90` that could trigger. Already discussed in the Slack channel with @Jokeren.	2023-02-22 19:13:21 -08:00
Keren Zhou	6a9316e69a	[BACKEND] Clean up SCF -> CF conversion (#1234 )	2023-02-22 23:49:47 +00:00
Philippe Tillet	0ec277efc5	[OPTIMIZER] cleaned, renamed and simplified some optimization passes (#1232 ) This shouldn't actually change the behavior of Triton -- only clean things up.	2023-02-22 13:54:55 -08:00
Yu Guo	19228d88bc	[FRONTEND][BACKEND] add env variable TRITON_LIBDEVICE_PATH (#1166 ) we may compile kernels on remote machines which do not have local libdevice.10.bc. Co-authored-by: Philippe Tillet <phil@openai.com>	2023-02-21 20:15:12 +00:00
Rohit Santhanam	4b56eb2fd4	Performance enhancements for AMDGPU atomics. Add "agent" syncscope specification to prevent large performance loss for gfx90a. Add LLVM function attributes to enable fp32 atomic adds for archs that support it.	2023-02-20 06:28:00 +00:00
Keren Zhou	123c687ed9	[BACKEND] Rewrite Membar to fit the CF dialect (#1213 )	2023-02-19 14:54:33 -08:00
Rohit Santhanam	841784d1e3	Merge remote-tracking branch 'upstream/main' into upgrade_triton_mlir_rocm_to_llvm_head	2023-02-18 09:25:20 +00:00
Christian Sigg	9ef4b5d773	Rebase to LLVM-head. (#1200 ) Rebase to `37b7a60cd7`	2023-02-17 13:16:11 -08:00
Rohit Santhanam	1d8fd49254	Remove the unsafe tmpnam call to generate the HSACO file name.	2023-02-17 13:09:26 +00:00
Goran Flegar	3b72ebd199	[BACKEND] correctly propagate address spaces through GEP (#1207 ) While doing it incorrectly works at the moment, this will cause a validation error once we rebase Triton closer to LLVM head, since validation for some LLVM Dialect ops got stricter. Specifically, if we remove the shared memory space attribute, a subsequent bitcast tries to add it back, which is illegal.	2023-02-17 11:19:03 +00:00
Christian Sigg	fc7a8e3581	Rebase Triton to LLVM-15. (#1070 ) This PR rebases Triton from LLVM-14 to LLVM-15. Most changes are mechanical, except for the analysis framework changes.	2023-02-16 06:40:53 -08:00
Philippe Tillet	e3941f9d09	[OPTIMIZER][BACKEND] Cleaned up Volta codegen (#1185 )	2023-02-14 22:39:35 -08:00
Philippe Tillet	8bca84ce3d	[OPTIMIZER] Bugfix in Combine.cpp ; Added `trans` support in Pipeline.cpp (#1174 )	2023-02-14 13:36:44 -08:00
rsanthanam-amd	2ec42ea37b	Merge pull request #117 from ROCmSoftwarePlatform/fix_sramecc_xnack_warnings_navi21 Fix warning on some amdgpu arch (i.e., navi21)	2023-02-14 07:37:42 -06:00
Chao Chen	c0a8c72343	update function to get full arch details and compile it with arch details instead of hardcode	2023-02-14 12:59:26 +00:00
Keren Zhou	6413c7b9de	[BACKEND] Calculate correct warp ids for small matrices (#1180 ) Fixing https://github.com/openai/triton/issues/1162 Add tests 16x16x16	2023-02-14 05:28:03 +00:00
rsanthanam-amd	44f69bea81	Merge pull request #113 from ROCmSoftwarePlatform/triton-mlir-IFU-02112023 Triton mlir ifu 02112023	2023-02-13 09:26:10 -06:00
rsanthanam-amd	ec387d5bf4	Merge pull request #109 from dfukalov/dfukalov/work-3 [ROCM] Enable part of tl.dot operations.	2023-02-12 13:50:20 -06:00
Daniil Fukalov	a6596fc634	[ROCM] Enable part of tl.dot operations. The change enables fall-through FMA path for the ROCM. It works for the float32 type and not all the tensors sizes. The change switches off reporting MMA and async ops support to avoid NV asm inline generation.	2023-02-12 17:25:48 +01:00
Rohit Santhanam	a2416e0901	Merge remote-tracking branch 'upstream/main' into triton-mlir-IFU-02112023	2023-02-11 14:48:19 +00:00
Philippe Tillet	3fa8a5a864	[OPTIMIZER] Fixed load/store rematerialization (#1177 )	2023-02-11 01:21:10 -08:00
Nikita Shulga	2d4370bc9f	[LINKER] search for `libdevice` relative to shared library (#1176 )	2023-02-11 02:24:33 +00:00
Philippe Tillet	2aba985daa	[OPTIMIZER] Improved layout simplifications heuristics (#1168 )	2023-02-09 20:17:25 -08:00
Keren Zhou	c61c8a123f	[BACKEND] Disallow the CombineSelectMaskedLoad pattern if conditions of select and broadcast are different (#1170 )	2023-02-09 18:03:22 -05:00
Philippe Tillet	0cbe368fe5	[OPTIMIZER] Using new multiRootGetSlice utility in memory coalescing pass (#1169 )	2023-02-09 18:43:33 +00:00
Yan Chunwei	850f808b55	[Backend] Split declaration and defitions of functions in DotOpHelpers.h (#1163 ) This is the remaining work of the former Backend File Partition work.	2023-02-08 15:07:02 +08:00
Philippe Tillet	3cfa474d97	[OPTIMIZER] Remove dead code (#1160 )	2023-02-06 23:02:58 -08:00
Philippe Tillet	1b31ad997f	[OPTIMIZER] Improved memory coalescing heuristics (#1159 )	2023-02-06 22:32:28 -08:00
Keren Zhou	681d04cf2b	[BACKEND] Fix axisInfo analysis for div ops (#1157 )	2023-02-07 02:25:23 +00:00
Keren Zhou	546f2377ae	[BACKEND] Get the right operand and result types in forward rematerialization passes (#1152 )	2023-02-04 16:34:35 -08:00
Yu Guo	474ed978b9	[BUILD] Fixed typo in CMake type tablegen (#1124 )	2023-02-03 18:46:11 -08:00
Mehdi Amini	ce6d74e0b6	[BACKEND] Fix crash in test/TritonGPU/coalesce.mlir (#1148 ) The call to `coalesceOp` is deleting the op it is processing and replacing it with a new one. We can't `dyn_cast` the `curr` pointer because it is dangling at this point. Co-authored-by: Philippe Tillet <phil@openai.com>	2023-02-04 02:40:27 +00:00
Keren Zhou	bde52f9db2	[BACKEND] Fix alignment calculation (#1149 ) `getDivisibility` represents if the address in bytes is divisible by a certain number, so we should convert `#aligned bytes` to `#aligned elements`.	2023-02-03 17:20:23 -08:00
Philippe Tillet	43798ab27e	[BUILD] Restored wheels workflow (#1146 ) - Dependent CUDA files (ptxas, cuda.h, libdevice.bc.10) are now packaged in `triton/third_party/cuda`. `ptxas` is downloaded from conda repo at install time. - Can now be built with old glibc (as that used by manylinux2014)	2023-02-03 16:22:10 -08:00
Rohit Santhanam	8cb6ab5b1a	Merge remote-tracking branch 'upstream/main' into triton_mlir_IFU_02022023	2023-02-02 22:54:53 +00:00
Keren Zhou	82befe32ad	[BACKEND] Improve torch inductor performance (#1108 ) - Rewrite the AxisInfo analysis to handle each op case by case. - Add bit shift, min max, div/rem, and select ops to AxisInfo. - Rematerialize across load/store ops in the following two cases: - A size 1 tensor is considered not expensive since all threads will load the same - the targeEncoding may expose more vectorization opportunities (more elements per thread on the first dim) _res2next_ benchmark GPU Kernel time comparison on A100. - Average kernel sum. Triton 16838630ns vs Triton-MLIR 17105166ns. 1.016x slowdown. - Total kernel sum. Triton 6511735460ns vs Triton-MLIR 6512370620ns.	2023-02-01 18:21:15 -08:00
Keren Zhou	1ec39fdf99	[BACKEND] Refactored the `MoveConvertOutOfIf` conversion to handle scf.if correctly (#1114 ) Also removed duplicate code for `simulateBackwardRematerialization`.	2023-02-01 08:49:19 -08:00
Keren Zhou	5dd8ce3745	[BACKEND] Fix topological sort and add new test cases (#1132 ) Previous https://github.com/openai/triton/pull/1113 forgot to consider that a node may have multiple parents, visiting the instruction before any parent violates the semantic of topological sort. The fixed implementation exhaustively add all operations into a candidate subgraph and move an operation to the "ready" queue once all of its operands have been visited.	2023-01-31 23:41:20 -08:00
Philippe Tillet	8fea1fb478	[FRONTEND] Adding static range (#1130 ) Included: Revert "[BACKEND] Replace `mlir::topologicalSort` with a custom implementation (#1113)"	2023-01-31 18:04:19 -08:00
Philippe Tillet	c4b9d699d2	[FRONTEND][BACKEND] Fixed many bugs (#1122 ) - temporarily commenting assertion in `MemBar.cpp`. We need to fix this! but for now the following patches will unblock a number of users. - Fixed frontend codegen issue for If / For / While. Emit an error when replaced values' type mismatch. - Added "top level" codepath for if statements, which allows users to write patterns to exit early from kernels (e.g., `if cond1: if cond2: return else: ...`). Added associated codegen in TritonToTritonGPUPass - Added basic control flow tests - Pipeline pass is no longer activated when memory accesses can't be vectorized - Added missing magic methods to `constexpr` - Fixed issue in random.py: bitcast some values to uint when they need to be. - Added support for `Not` - Fixed nondeterministic compilation issue	2023-01-30 23:22:36 -08:00
goostavz	3e8d83b7cc	Minor fix to support sm_90 (#1125 ) This fix enables the support on sm_90 (otherwise it will crash). Logs like > 'sm_90' is not a recognized processor for this target (ignoring processor) could be ignored and should be eliminated with the update of llvm nvptx backend.	2023-01-31 14:08:02 +08:00
Michael Melesse	bfee8247f5	fix ops	2023-01-30 14:22:51 -06:00
Michael Melesse	a9f955f862	Merge remote-tracking branch 'upstream/main' into triton-mlir-IFU-2023-30-1	2023-01-30 14:04:01 -06:00
Keren Zhou	bc8a26d56f	[BACKEND] Replace `mlir::topologicalSort` with a custom implementation (#1113 ) `multiRootTopologicalSort` is faster than `mlir::topologicalSort` because it prunes nodes that have been visited before.	2023-01-29 18:57:21 -08:00
Keren Zhou	5bcf60a5c0	[BACKEND] Refactored the code to no longer include static functions in header files. (#1109 )	2023-01-28 14:58:28 -08:00
Da Yan	82f5e988be	[OPTIMIZER] Improve bf16 and i8 matmul performance (#1107 ) Use i32 as the storage type for <2xi16> and <4xi8>, as NVPTX inserts extra integer instructions for vector int types. Performance before this PR: (8192x8192x8192-TN input) bf16: 222 TFLOPS i8: 339 TOPS After this PR: bf16: 272 TFLOPS i8: 548 TOPS	2023-01-27 22:13:14 +00:00
Da Yan	394f2e6991	[OPTIMIZER] improved prefetch width (#1106 ) Before this PR: 16 After this PR: 16 (fp16/bf16), 32(int8/fp8), 8 (tf32) The new prefetch width works better with i8/f8/tf32 tensor cores.	2023-01-27 17:41:49 +00:00
binarman	3d447534ec	- removed semicolon - removed redundant newEmptyOperand method	2023-01-24 22:42:36 +01:00
binarman	7cd3c89a15	[Test] Fix ctest tests Fixed GcnAsmFormatTest.basic and GcnAsmFormatTest.complexInstruction tests - Added lost "off" operand - Added semicolon at the end of instructions - Removed redundant comma between args and modifiers	2023-01-24 22:42:36 +01:00

1 2 3 4 5 ...

584 Commits