github/ROCm - ROCm - AtHeartEngineering

mirror of https://github.com/ROCm/ROCm.git synced 2026-04-05 03:01:17 -04:00

Author	SHA1	Message	Date
Jason Furmanek	977d5aa267	Merge commit '721897fcc4f942aa97d2e9ba3787a5e213758177' into ifu-231108 Conflicts: bin/triton-translate.cpp lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp lib/Dialect/TritonGPU/Transforms/RemoveLayoutConversions.cpp python/triton/compiler/compiler.py python/triton/runtime/jit.py python/tutorials/06-fused-attention.py test/Conversion/tritongpu_to_llvm.mlir	2023-11-08 18:51:23 +00:00
Jason Furmanek	39e8901d7a	ROCM IFU: Resolve merge conflicts in RemoveLayoutConversions.cpp fix merge error fix dot fix make_range additional fix	2023-11-07 04:29:38 +00:00
Jason Furmanek	3a6dc5ad8d	resolve some merge conflicts fix more conflits Resolve merge conflicts Some more build and conflict fixes Resolve conflicts for 06-fused-attension.py resolve merge conflicts for the tutorial group gemm example Fixes for some LIT tests resolve remaining conflicts in tests Fix empty kernel set capability 0	2023-11-06 23:13:10 +00:00
Jason Furmanek	33151a860f	Merge commit 'ac9fa68d18c777e421bd3f6fb1ddcfd60b6fda33' into ifu-rebase-again Conflicts: .gitignore .gitmodules README.md bin/triton-translate.cpp include/triton/Dialect/TritonGPU/IR/TritonGPUAttrDefs.td include/triton/Target/AMDGCN/AMDGCNTranslation.h include/triton/Target/HSACO/HSACOTranslation.h lib/Analysis/Allocation.cpp lib/Analysis/Utility.cpp lib/Conversion/TritonGPUToLLVM/CMakeLists.txt lib/Conversion/TritonGPUToLLVM/ConvertLayoutOpToLLVM.cpp lib/Conversion/TritonGPUToLLVM/ReduceOpToLLVM.cpp lib/Conversion/TritonGPUToLLVM/ScanOpToLLVM.cpp lib/Conversion/TritonGPUToLLVM/Utility.cpp lib/Conversion/TritonGPUToLLVM/Utility.h lib/Dialect/TritonGPU/IR/Dialect.cpp lib/Dialect/TritonGPU/Transforms/RemoveLayoutConversions.cpp lib/Target/HSACO/CMakeLists.txt lib/Target/HSACO/HSACOTranslation.cpp lib/Target/LLVMIR/LLVMIRTranslation.cpp python/src/triton.cc python/test/unit/language/test_core.py python/test/unit/operators/test_flash_attention.py python/triton/compiler/compiler.py python/triton/compiler/make_launcher.py python/triton/language/semantic.py python/triton/runtime/jit.py python/tutorials/06-fused-attention.py python/tutorials/11-grouped-gemm.py test/Conversion/tritongpu_to_llvm.mlir	2023-11-06 23:10:10 +00:00
Alexander Efimov	74c5fd46ee	[RemoveLayoutConversions] Fix reduce failed infer type error (#377 ) * [RemoveLayoutConversions] Fix reduce failed infer type error This PR fixes layout propagation algorithm in RemoveLayoutConversions pass. In some cases during rewriteSlice process, reduce operation with multiple outputs rewrites only one output layout, which breaks assumption that both outputs should have same layout. This change is a minimal part of https://github.com/openai/triton/pull/2331 change and small lit test for regression testing. * fix combine test * Fix issue with incorrect inference layout of make_range output result	2023-11-01 13:31:13 -05:00
Alexander Efimov	d62a3ffdbe	[RemoveLayoutConversions] Remove PatternSharedInfo structure (#378 ) This structure is not used anymore after massive refactoring of RemoveLayoutConversion pass in September IFU.	2023-11-01 12:57:35 -05:00
Thomas Raoux	a777e1d8db	[OPTIMIZER] Propagate mma layout when the transitive use has dot_operand encoding (#2482 )	2023-10-12 23:57:40 +00:00
Thomas Raoux	a7061e19b2	[BACKEND] Fix multiple bugs in WGMMA (#2457 ) Fix dependencies in wgmma_wait op to prevent the scheduler from moving it past the uses of wgmma accumulator. We need to explicitly represent the dependency between the wait and the accumulator uses otherwise LLVM is free to re-order those. This allows us to remove a workaround to prevent the re-ordering. We can also remove the wait op added in the loop during pipelining. Also fix the descritpor calcuation for wgmma, we should calculate the same descriptor for the whole warpgroup. Added a workaround for a bug that was exposed by different timing due to those changes. We shouldn't insert operations between the loop and async_wait or we may have race conditions.	2023-10-06 17:59:28 -07:00
Philippe Tillet	e9d9ddd86d	[OPTIMIZER] More V100 conversion removal tweaks (#2446 )	2023-10-04 16:31:20 -07:00
Philippe Tillet	a7ff4eddae	[OPTIMIZER] hasConvertToMMATransisitiveUse=False for MMAv1 (#2445 )	2023-10-04 14:56:12 -07:00
Thomas Raoux	5a0170a27c	[BACKEND] Minor removing of unnecessary code and cleanup (#2443 )	2023-10-04 12:14:08 -07:00
Shucai Xiao	8049891ff7	fix ifu gemm perf regression (#348 )	2023-10-04 08:45:18 -05:00
Aleksandr Efimov	336c4b5f3c	ROCM IFU: Fix LDS overflow issues in test_dot	2023-10-03 04:30:09 +00:00
Jason Furmanek	e5d7bb4fae	Initial commit to resolve merge conflicts rename tl.float8e4 to tl.float8e4nv to align with upstream ROCM IFU: Fix python arch issues ROCM IFU: Fix kernel launcher ROCM IFU: Fix merge conflicts fix debug build Set correct threadsPerCTA	2023-10-03 04:04:26 +00:00
Jason Furmanek	74fd8e9754	Merge commit '36fc54b6f28168d3644808bfe299f1ba06a36272' into ifu230908-2 Conflicts: .gitignore bin/triton-translate.cpp include/triton/Conversion/TritonGPUToLLVM/TritonGPUToLLVMPass.h include/triton/Dialect/TritonGPU/IR/TritonGPUAttrDefs.td include/triton/Dialect/TritonGPU/IR/TritonGPUDialect.td lib/Analysis/Utility.cpp lib/Conversion/TritonGPUToLLVM/ConvertLayoutOpToLLVM/SharedToDotOperandMMAv2.cpp lib/Conversion/TritonGPUToLLVM/DotOpToLLVM.cpp lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp lib/Conversion/TritonGPUToLLVM/ReduceOpToLLVM.cpp lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVM.cpp lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVMBase.h lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVMPass.cpp lib/Conversion/TritonGPUToLLVM/Utility.h lib/Dialect/Triton/Transforms/RewriteTensorPointer.cpp lib/Dialect/TritonGPU/IR/Dialect.cpp lib/Dialect/TritonGPU/Transforms/AccelerateMatmul.cpp lib/Dialect/TritonGPU/Transforms/RemoveLayoutConversions.cpp lib/Target/LLVMIR/LLVMIRTranslation.cpp python/src/triton.cc python/test/unit/runtime/test_subproc.py python/triton/compiler/compiler.py python/triton/compiler/make_launcher.py python/triton/language/semantic.py python/triton/runtime/jit.py python/tutorials/06-fused-attention.py test/Conversion/triton_to_tritongpu.mlir test/Conversion/tritongpu_to_llvm.mlir test/TritonGPU/coalesce.mlir unittest/Conversion/TritonGPUToLLVM/CMakeLists.txt	2023-10-02 18:01:04 +00:00
Thomas Raoux	721bdebee1	[OPTIMIZATION] Fix performance for attention backward path with mma v3 (#2411 ) Support having chain of mma with mixed size. Serialize the different block calculation in backward attention to workaround problem with ptxas and wgmma.	2023-09-28 10:29:08 -07:00
Tori Baker	2d28b09319	Forward fixes to build on newer version of llvm (#2388 )	2023-09-26 08:08:13 -07:00
Thomas Raoux	3a848e2729	[BACKEND] Relax patterns to move sink broadcast and hoist convert (#2331 ) Improve patterns that sync broadcast to reduce the arithmetic density and also hoist convert on top of expand_dims to do less work. This address comments in https://github.com/openai/triton/pull/2274	2023-09-18 15:08:19 -07:00
Thomas Raoux	31b0c52142	[FRONTEND][BACKEND] Add flag to control accumulation for fp8 (#2300 ) Change the dot to allow taking an initial accumulator and add a flag that will allow the compiler to accumulate in a lower precision than the output type. On Hopper this flag is on by default which allows accumualting with lower precision. This only affect Hopper fp8 dot.	2023-09-15 18:42:54 -07:00
Thomas Raoux	cf7f8c5ea4	[BACKEND] Optimization to sink broadcast ops (#2274 ) Try to move broadcast ops after arithmetic and convert ops in order to reduce the amount of work needed.	2023-09-13 10:04:36 -07:00
Philippe Tillet	3747843143	[OPTIMIZER] improvements to layout conversion removal (#2268 ) * Improved heuristics for RemoveLayoutConversion; * add LayoutConversionOp canonicalizer for ViewOps	2023-09-09 22:06:27 -07:00
Jason Furmanek	df5c263a19	Fix merge conflicts	2023-09-01 04:01:32 +00:00
Jason Furmanek	3eaeb89d18	Merge commit '5df904233c11a65bd131ead7268f84cca7804275' into ifu230810-2 Conflicts: include/triton/Dialect/Triton/Transforms/Passes.h include/triton/Dialect/TritonGPU/IR/Dialect.h include/triton/Dialect/TritonGPU/IR/TritonGPUAttrDefs.td lib/Analysis/Allocation.cpp lib/Analysis/Utility.cpp lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp lib/Conversion/TritonGPUToLLVM/ReduceOpToLLVM.cpp lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVM.cpp lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVMPass.cpp lib/Dialect/Triton/Transforms/RewriteTensorPointer.cpp lib/Dialect/TritonGPU/Transforms/RemoveLayoutConversions.cpp lib/Dialect/TritonGPU/Transforms/ReorderInstructions.cpp lib/Target/LLVMIR/LLVMIRTranslation.cpp python/src/triton.cc python/triton/compiler/compiler.py python/triton/ops/flash_attention.py python/triton/runtime/autotuner.py python/triton/runtime/jit.py python/triton/tools/aot.py python/tutorials/06-fused-attention.py test/Conversion/tritongpu_to_llvm.mlir test/Target/tritongpu_to_llvmir.mlir test/Target/tritongpu_to_llvmir_noinline.mlir	2023-09-01 03:25:33 +00:00
Thomas	fff97d864a	[BACKEND] Add support for propagating convert op through while loops (#2213 ) Support forward propagation through while loops.	2023-08-31 09:26:04 -07:00
Thomas	3175ee4ce7	[BACKEND] Handle more cases of folding convert into reduce op (#2209 ) Handle cases of reduce with multiple operand returning scalars	2023-08-30 11:04:38 -07:00
Thomas	2ff88c1368	[BACKEND] Extend hoisting of convert op above ext ops (#2206 ) Handle more cases of hoisting convert above ext op. If there are multiple ext op in the slice but only one requires inserting a convert we can still apply the optimization.	2023-08-29 17:36:34 -07:00
Thomas	d4644d6cb3	[BACKEND] Refactor RemoveLayoutConversion pass (#2181 ) Significant changes to the pass logic. Move away from greedy rewrites and use more global analysis instead. The pass is now bocken down into 2 main phases. First forward propagation of layout starting from ops that we don't want to change. Propagate to all the nodes. If there is a single layout needed for the op then we can rewrite the op, if there are multiple layout required based on dependency we need a tie break. The second phase is backward propgation that gets a backward slice of operations starting from the convert and if all the operations in the slice can be rematerialized rewrite the slice. This backward phase now supports going through loop arguments. This will allow more complex logic in the future to add a cost model to decide which convert to leave and which to fold	2023-08-28 19:05:16 -07:00
Thomas	c736ea8492	[BACKEND] Minor clean up and remove loop fixup as it is not needed anymore (#2116 )	2023-08-18 12:12:45 -07:00
oplavsic	398d2c7dd0	Fix FA v2 hanging issue when BLOCK_N=32 (#274 ) * Fix FA v2 hanging issue when BLOCK_N=32 * Fix broken tests	2023-08-10 12:52:09 -05:00
goostavz	f1512bded1	Initial code merge of Hopper support (#2036 ) The initial code merge of Nvidia Hopper features support. Please be aware that the code merge is not finished yet and the trouble-shooting is still ongoing. The new hardware features (GMMA, TMA, STMATRIX etc.) and automatic warp-specialization are experimental for now and turned off by default. It is recommended for a trial when version 3.0 is released. The work is contributed by: ben-zhang-609, bealwang, donproc, qliu93, jsh20, allatit23, LyricZhao, ivanyinwz, goostavz & yangjunpro from Nvidia, in cooperation with: ptillet, Jokeren, ThomasRaoux & zahimoud from OpenAI. Co-authored-by: Goostav Zhu <gzhu@nvidia.com>	2023-08-07 09:53:04 +08:00
Keren Zhou	cc5a7ed52f	[FRONTEND][BACKEND] Materialize line info for triton kernels (#1902 ) `export TRITON_DISABLE_LINE_INFO=1` to disable the feature.	2023-07-07 16:03:44 -04:00
Jason Furmanek	12005a82f2	Initial commit to resolve merge conflicts	2023-06-30 19:53:53 +00:00
Keren Zhou	71512f9177	[OPTIMIZER] Try to move conversions out of for loop before moving forward (#1862 )	2023-06-30 11:41:58 -04:00
Keren Zhou	9e8315bcf7	[OPTIMIZER] Use simulateBackwardRematerialization to check if conversions can be moved out of loop (#1861 ) Don't move out of loop if any other uses of the convert_layout operand introduce additional conversions	2023-06-30 07:01:36 -04:00
Keren Zhou	51cc3780fa	[OPTIMIZER] Fix push conversion forward for reduce ops (#1860 ) https://github.com/openai/triton/issues/1846	2023-06-29 22:00:27 -04:00
Jason Furmanek	2b38ab4b6c	Merge remote-tracking branch 'oai/main' into ifu230620 Conflicts: include/triton/Conversion/TritonToTritonGPU/Passes.td include/triton/Dialect/TritonGPU/IR/TritonGPUDialect.td lib/Dialect/TritonGPU/Transforms/RemoveLayoutConversions.cpp python/test/unit/language/assert_helper.py python/triton/compiler/compiler.py python/triton/runtime/jit.py python/triton/tools/aot.py test/Conversion/triton_to_tritongpu.mlir test/Conversion/tritongpu_to_llvm.mlir	2023-06-29 21:47:27 +00:00
oplavsic	64d7b521cf	[MFMA] Enabled fused attention forward pass. (#226 ) * [MFMA] Activated Fused Attention Forward Pass Patch contains following changes: 1) make_range operator now works with MFMA layout. 2) Reduce operation is forced to run in block layout: inputs converted to block layouts, outputs returned to MFMA layout * Use simple module walk instead of pattern rewritter. * Remove pattern rewritter header. * Enable basic reduce algorithm for MFMA layout * Add TODO comment for fused attention backward pass * Fix bug in fast codegen algorithm for reduce op * Fix input type bug * Increase block size to 128 since out of memory issue is not seen on MI210 * Fix block_size error * Add mfma support in DecomposeDotOperand pattern.	2023-06-16 15:39:08 -05:00
Eugene Zhulenev	700b3cfd6e	[OPTIMIZER] integration for for MLIR change D151520 (#1780 ) Fix for: https://reviews.llvm.org/D151520	2023-06-14 23:13:37 +00:00
Matthias Springer	31b1d6c057	[OPTIMIZER] integration fix for MLIR change D152814 (#1775 ) https://reviews.llvm.org/D152814 The API of `RewriterBase::replaceOp` was extended and a fix is needed so that the called function is not ambiguous. This change is compatible with both the old and the new API. Co-authored-by: Philippe Tillet <phil@openai.com>	2023-06-14 15:35:58 -07:00
Keren Zhou	58a8e8a914	[BACKEND] Clean up code (#1768 ) - Remove unused header files. - Get numThreads/numWarps from the triton module. - Move transforms/utility.h to the include directory.	2023-06-12 17:40:33 -07:00
Keren Zhou	4fbadf6f6f	[BACKEND] Fix `tl.cat` when the number of threads > the size of a tensor (#1751 ) `tl.cat(tensor<64>, tensor<64>) -> tensor(128)`, because it concatenates elements into a single thread, if number of threads is 128, each thread should own at least 2 elements. With this PR, we also disable remat of the cat op in some cases.	2023-06-07 15:42:38 -07:00
Keren Zhou	fc6fed4292	[BACKEND] Simplify PushForward and fix MoveConvertOutOfLoop rematerialization (#1691 ) https://github.com/openai/triton/issues/1652	2023-05-19 17:40:46 -07:00
Ingo Müller	47af6ba702	[BACKEND] Move isSharedEncoding to TritonGPUIR. (#1655 ) This breaks a cyclic dependency between TritonAnalysis and TritonGPUIR (see #1649).	2023-05-12 20:50:21 -04:00
Philippe Tillet	19e7238d50	[OPTIMIZER] Clean-up Utility.cpp and fixed bug in RematerializeForward (#1608 ) ConvertLayoutOp can be folded in other ConvertLayoutOp	2023-05-02 23:17:52 -07:00
Philippe Tillet	8f47bdcc92	[OPTIMIZER] Added kWidth attribute to DotOperandEncoding (#1584 ) This is a pre-requisist for efficient mixed-precision matmul	2023-04-26 23:03:18 -07:00
peterbell10	a3c3e5a3a1	[TESTS][OPTIMIZER] enable tests for argmin/max and fix some bugs (#1537 ) `argmin`/`argmax` is currently only tested in 1d and when we enable the tests for 2d it reveals a few bugs.	2023-04-17 18:47:31 -07:00
peterbell10	e152183570	[FRONTEND][BACKEND] ReduceOp to support arbitrary reduce operations (#1305 ) Fixes #1285 This changes `tt.reduce` to replace `redOp` by a region containing arbitrary code. For example, `tl.sum` is now lowered as: ```mlir %res = "tt.reduce"(%arg0) ({ ^bb0(%arg1: f32, %arg2: f32): %add = arith.addf %arg1, %arg2 : f32 tt.reduce.return %add : f32 }) {axis = 1 : i32} : (tensor<128x128xf32>) -> tensor<128xf32> ``` Support for index reductions at the MLIR level are also dropped in favor of simultaneous reductions over multiple tensors. Which generalizes the code without loss of performance. So for example `argmin` gets lowered as: ```mlir %7 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32> %8 = tt.view %7 : (tensor<256xi32>) -> tensor<1x256xi32> %9:2 = "tt.reduce"(%6, %8) ({ ^bb0(%arg4: f32, %arg5: i32, %arg6: f32, %arg7: i32): %14 = arith.cmpf olt, %arg4, %arg6 : f32 %15 = arith.cmpf ogt, %arg4, %arg6 : f32 %16 = arith.cmpi slt, %arg5, %arg7 : i32 %17 = arith.select %16, %arg5, %arg7 : i32 %18 = arith.select %15, %arg7, %17 : i32 %19 = arith.select %14, %arg5, %18 : i32 %20 = arith.cmpf olt, %arg4, %arg6 : f32 %21 = arith.select %20, %arg4, %arg6 : f32 tt.reduce.return %21, %19 : f32, i32 }) {axis = 1 : i32} : (tensor<1x256xf32>, tensor<1x256xi32>) -> (tensor<1xf32>, tensor<1xi32>) ```	2023-04-13 01:37:39 +00:00
Keren Zhou	2ba77a9212	[OPTIMIZER] Fix a typo in SimplifyReduceCvt (#1385 )	2023-03-21 22:45:58 -07:00
Keren Zhou	23fc647a3e	[OPTIMIZER] Fixe optimizer hanging caused by SimplifyReduceCvt (#1377 ) https://github.com/openai/triton/issues/1328 Match the convert_layout operation in SimplifyReduceCvt (convert_layout->reduce). This way we don't miss higher priority rewrite patterns like RematerializeBackward and SimplifyConversion. We also need to set SimplifyConversion's benefit = 4, RematerializeBackward's benefit = 3, and RematerializeForward's benefit = 2.	2023-03-20 16:20:19 -07:00
Philippe Tillet	29d01ba5f3	[OPTIMIZER] We shouldn't try to rematerialize view/cat forward since output layout can't be deduced automatically (#1378 )	2023-03-20 14:26:50 -07:00

1 2

57 Commits