github/ROCm - ROCm - AtHeartEngineering

mirror of https://github.com/ROCm/ROCm.git synced 2026-04-05 03:01:17 -04:00

Author	SHA1	Message	Date
Beal Wang	5812d970a8	[HOPPER][OPTIMIZER] remove divOp and remOp from gemm math loop (#2402 ) This is just for Warp Specialization kernels on Hopper. Replace DivOp and RemOp with SelectOp and AndOp/XorOp.	2023-10-09 14:42:06 +08:00
Tori Baker	ab4549310b	[OPTIMIZER] erase ops after use in iterator (#2455 ) This seems to have worked fine in opt mode (although it may be producing undefined behavior), but in debug mode on a newer version of llvm, it segfaults without this PR as the iterators get invalidated. This is also consistent with other places it is done in this file.	2023-10-06 18:02:56 -07:00
Thomas Raoux	a7061e19b2	[BACKEND] Fix multiple bugs in WGMMA (#2457 ) Fix dependencies in wgmma_wait op to prevent the scheduler from moving it past the uses of wgmma accumulator. We need to explicitly represent the dependency between the wait and the accumulator uses otherwise LLVM is free to re-order those. This allows us to remove a workaround to prevent the re-ordering. We can also remove the wait op added in the loop during pipelining. Also fix the descritpor calcuation for wgmma, we should calculate the same descriptor for the whole warpgroup. Added a workaround for a bug that was exposed by different timing due to those changes. We shouldn't insert operations between the loop and async_wait or we may have race conditions.	2023-10-06 17:59:28 -07:00
Philippe Tillet	e9d9ddd86d	[OPTIMIZER] More V100 conversion removal tweaks (#2446 )	2023-10-04 16:31:20 -07:00
Philippe Tillet	a7ff4eddae	[OPTIMIZER] hasConvertToMMATransisitiveUse=False for MMAv1 (#2445 )	2023-10-04 14:56:12 -07:00
Thomas Raoux	5a0170a27c	[BACKEND] Minor removing of unnecessary code and cleanup (#2443 )	2023-10-04 12:14:08 -07:00
Shucai Xiao	8049891ff7	fix ifu gemm perf regression (#348 )	2023-10-04 08:45:18 -05:00
Aleksandr Efimov	336c4b5f3c	ROCM IFU: Fix LDS overflow issues in test_dot	2023-10-03 04:30:09 +00:00
Aleksandr Efimov	78faa65dbd	ROCM IFU: Fix of dot operand type promotion ROCM IFU: Fix formatting	2023-10-03 04:29:29 +00:00
Aleksandr Efimov	bae0e4527c	ROCM IFU: Add new CTALayout parameter to mfma layout	2023-10-03 04:29:21 +00:00
Jason Furmanek	e5d7bb4fae	Initial commit to resolve merge conflicts rename tl.float8e4 to tl.float8e4nv to align with upstream ROCM IFU: Fix python arch issues ROCM IFU: Fix kernel launcher ROCM IFU: Fix merge conflicts fix debug build Set correct threadsPerCTA	2023-10-03 04:04:26 +00:00
Jason Furmanek	74fd8e9754	Merge commit '36fc54b6f28168d3644808bfe299f1ba06a36272' into ifu230908-2 Conflicts: .gitignore bin/triton-translate.cpp include/triton/Conversion/TritonGPUToLLVM/TritonGPUToLLVMPass.h include/triton/Dialect/TritonGPU/IR/TritonGPUAttrDefs.td include/triton/Dialect/TritonGPU/IR/TritonGPUDialect.td lib/Analysis/Utility.cpp lib/Conversion/TritonGPUToLLVM/ConvertLayoutOpToLLVM/SharedToDotOperandMMAv2.cpp lib/Conversion/TritonGPUToLLVM/DotOpToLLVM.cpp lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp lib/Conversion/TritonGPUToLLVM/ReduceOpToLLVM.cpp lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVM.cpp lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVMBase.h lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVMPass.cpp lib/Conversion/TritonGPUToLLVM/Utility.h lib/Dialect/Triton/Transforms/RewriteTensorPointer.cpp lib/Dialect/TritonGPU/IR/Dialect.cpp lib/Dialect/TritonGPU/Transforms/AccelerateMatmul.cpp lib/Dialect/TritonGPU/Transforms/RemoveLayoutConversions.cpp lib/Target/LLVMIR/LLVMIRTranslation.cpp python/src/triton.cc python/test/unit/runtime/test_subproc.py python/triton/compiler/compiler.py python/triton/compiler/make_launcher.py python/triton/language/semantic.py python/triton/runtime/jit.py python/tutorials/06-fused-attention.py test/Conversion/triton_to_tritongpu.mlir test/Conversion/tritongpu_to_llvm.mlir test/TritonGPU/coalesce.mlir unittest/Conversion/TritonGPUToLLVM/CMakeLists.txt	2023-10-02 18:01:04 +00:00
SJW	287b0adcc2	[Stream] Fixed bug in stream-pipeline for FA (#345 ) * [Stream] Fixed bug in stream-pipeline for FA * updated gemm tutorial for num_stages=0 * * updated configs	2023-09-29 20:20:55 -05:00
Thomas Raoux	90bef57acf	[BACKEND] turn on MMA V3 by default on Hopper (#2414 )	2023-09-28 22:45:28 -07:00
Thomas Raoux	721bdebee1	[OPTIMIZATION] Fix performance for attention backward path with mma v3 (#2411 ) Support having chain of mma with mixed size. Serialize the different block calculation in backward attention to workaround problem with ptxas and wgmma.	2023-09-28 10:29:08 -07:00
SJW	0a7b1c7c12	[MLIR] Fixed support for mixed data-types in stream-pipeline (#329 ) * [MLIR] Fixed support for mixed data-types in stream-pipeline * added test * * fixed test * * cleanup * * consolidated code * * fixed build error	2023-09-26 21:26:50 -05:00
Tori Baker	2d28b09319	Forward fixes to build on newer version of llvm (#2388 )	2023-09-26 08:08:13 -07:00
SJW	4db99e0139	[Alloc] Enhanced SharedMem Allocation for mutually exclusive but aliased buffers (#337 ) * [Alloc] Enhanced for mutually exclusive but aliased buffers - Use disjoint alias analysis to minimize shared memory requirements * * fix for allocation test * * added test * fixed mfma_enc printer * * fixed test	2023-09-25 20:09:33 -05:00
Philippe Tillet	80adbbb87b	[OPTIMIZER] fix-up acceleratematmul (#2392 )	2023-09-25 17:05:04 -07:00
Thomas Raoux	6bc1d9e1be	[BACKEND] Support MMA V3 with register operand (#2375 ) MMA V3 support taking operand A from register. This helps for chained matmul operations like in attention. Add an optimization to use this mode when it helps and add the lowering for it.	2023-09-25 10:43:54 -07:00
Aleksandr Efimov	7af5e42fbe	review fix: fix semantics of chooseMfmaDimensions func	2023-09-25 10:56:44 -05:00
Alexander Efimov	5ac8c7afc1	change to the comment on kWidth parameter	2023-09-25 10:56:44 -05:00
Aleksandr Efimov	d80cd2d374	[MFMA] Change kWidth parameter semantics This PR changes kWidth semantics "from elements per instruction" to "elements per thread per instruction" along k axis.	2023-09-25 10:56:44 -05:00
Thomas Raoux	840e7e7b53	[BACKEND] Improve decision of MMA dimension on H100 (#2373 ) When there is a chain of mma ops we want to pick the same shape to avoid conversions. This improves the detection going through for loops. This fixes a crash in tutorial bw attention. We might want to change this logic and convert the format to allow more efficient MMA at some point.	2023-09-22 15:21:56 -07:00
Dongdong Li	e5eda098b3	[TESTS] fix flash attention (#2086 ) Co-authored-by: dongdongl <dongdongl@nvidia.com>	2023-09-20 14:23:46 +08:00
Matthias Springer	ae07b7b3d3	Integration fixes for llvm/llvm-project#66754 (#2338 ) llvm/llvm-project#66754 extends the `LoopLikeOpInterface`: the signature of `getLoopBody` has changed. `ForOp::getRegion` can be used instead. This change works with and without llvm/llvm-project#66754.	2023-09-19 23:00:33 +00:00
Thomas Raoux	3a848e2729	[BACKEND] Relax patterns to move sink broadcast and hoist convert (#2331 ) Improve patterns that sync broadcast to reduce the arithmetic density and also hoist convert on top of expand_dims to do less work. This address comments in https://github.com/openai/triton/pull/2274	2023-09-18 15:08:19 -07:00
Matthias Springer	a9ae9886dc	Integration fixes for llvm/llvm-project#66512 (#2328 ) Some duplicate functions on `scf.for` have been removed in llvm/llvm-project#66512. This PR works with and without llvm/llvm-project#66512.	2023-09-18 15:22:06 +00:00
Thomas Raoux	31b0c52142	[FRONTEND][BACKEND] Add flag to control accumulation for fp8 (#2300 ) Change the dot to allow taking an initial accumulator and add a flag that will allow the compiler to accumulate in a lower precision than the output type. On Hopper this flag is on by default which allows accumualting with lower precision. This only affect Hopper fp8 dot.	2023-09-15 18:42:54 -07:00
Thomas Raoux	976aabdeb2	[BUILD] Fix few dependencies and layering issues to make lld work (#2307 ) This fixes few problems that were preventing me to use lld linker.	2023-09-15 17:00:58 +00:00
Bin Fan	38a2ecdccf	[OPTIMIZER] Fix Shared layout in OptimizeDotOperands pass to generate correct swizzling code (#2180 ) fix bug #1937 Co-authored-by: Philippe Tillet <phil@openai.com>	2023-09-13 12:52:09 -07:00
Thomas Raoux	cf7f8c5ea4	[BACKEND] Optimization to sink broadcast ops (#2274 ) Try to move broadcast ops after arithmetic and convert ops in order to reduce the amount of work needed.	2023-09-13 10:04:36 -07:00
Thomas Raoux	a9db6b94b9	Remove wrong dependency between TritonGPU and NVGPU dialect (#2276 )	2023-09-11 16:30:13 -07:00
Philippe Tillet	3747843143	[OPTIMIZER] improvements to layout conversion removal (#2268 ) * Improved heuristics for RemoveLayoutConversion; * add LayoutConversionOp canonicalizer for ViewOps	2023-09-09 22:06:27 -07:00
Alexander Efimov	6691de65db	[MFMA] Support BFloat16 on MI100 (#295 ) * [MFMA] Support BFloat16 on MI100 This PR makes use of mfma_f32_32x32x4bf16 instruction, available on MI100. * fix tests, fix mfma encoding comment, fix switch between mfma versions. * replace kDim from mfma layout with kWidth from dotOp layout * rebase fix * fix mfma to dot op shortcut for bfloat16 * fix review comments	2023-09-08 15:08:34 -05:00
SJW	491eb9ddfe	[MLIR] Added tritongpu-stream-pipeline pass (#305 ) * [MLIR] Added tritongpu-stream-pipeline pass - Prologue: Hoist the pipelinable load operations and shared memory store for the ramp up stage - Pipelined Loop: Assemble the loop body minus last iteration - Prefetch next tile from global into regs (while computing from previous) - Non-load loop body - Store next tile into shared mem - Epilogue: Peeled non-load loop body for last iteration * * updated comment	2023-09-07 15:24:59 -05:00
Beal Wang	7aae350fab	[OPTIMIZER] async launch dots for hopper warp-specialized kernel (#2251 )	2023-09-07 13:37:07 +08:00
Thomas	60643f2a2d	[BACKEND][NFC] Simplify coalescing pass (#2230 ) Remove unnecessary templates and simplify the mapping of operations to encoding.	2023-09-06 00:52:06 -07:00
David Berard	13189bfe60	Coalesce pass - group values with same shape/order but different element type (#2199 ) Motivation: We have a kernel that loads multiple types of tensors - some int32 and some float16. The coalescing pass assigns `perThread = 8` for the float16 tensors and `perThread = 4` for the int32 tensors, resulting in unnecessary layout conversions that result in bad performance. Instead, we should just set `perThread = 8` for both of these loads. Details: One of the first steps in calculating the new encoding is to find the group of upstream/downstream tensors with the "same type", in order to find the maximal sizePerThread required in this group. This PR changes the logic so that tensors can be grouped as long as they have the same shape and same optimal ordering, even if they have different encoding or dtype. Next, the logic to compute `perThread` is updated to account for the change above; since dtype can now be different within a single "group", the `perThread` computation now considers different elemNumBits/elemNumBytes for each value in the group.	2023-09-01 17:08:56 -07:00
Philippe Tillet	9b8c48f25d	[BACKEND] More minor backend fixups (#2223 ) * Fix bug in V100 convert layout * Do not push elementwise between convert and dot for V100	2023-08-31 22:02:56 -07:00
Jason Furmanek	df5c263a19	Fix merge conflicts	2023-09-01 04:01:32 +00:00
Jason Furmanek	3eaeb89d18	Merge commit '5df904233c11a65bd131ead7268f84cca7804275' into ifu230810-2 Conflicts: include/triton/Dialect/Triton/Transforms/Passes.h include/triton/Dialect/TritonGPU/IR/Dialect.h include/triton/Dialect/TritonGPU/IR/TritonGPUAttrDefs.td lib/Analysis/Allocation.cpp lib/Analysis/Utility.cpp lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp lib/Conversion/TritonGPUToLLVM/ReduceOpToLLVM.cpp lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVM.cpp lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVMPass.cpp lib/Dialect/Triton/Transforms/RewriteTensorPointer.cpp lib/Dialect/TritonGPU/Transforms/RemoveLayoutConversions.cpp lib/Dialect/TritonGPU/Transforms/ReorderInstructions.cpp lib/Target/LLVMIR/LLVMIRTranslation.cpp python/src/triton.cc python/triton/compiler/compiler.py python/triton/ops/flash_attention.py python/triton/runtime/autotuner.py python/triton/runtime/jit.py python/triton/tools/aot.py python/tutorials/06-fused-attention.py test/Conversion/tritongpu_to_llvm.mlir test/Target/tritongpu_to_llvmir.mlir test/Target/tritongpu_to_llvmir_noinline.mlir	2023-09-01 03:25:33 +00:00
Thomas	fff97d864a	[BACKEND] Add support for propagating convert op through while loops (#2213 ) Support forward propagation through while loops.	2023-08-31 09:26:04 -07:00
Thomas	3175ee4ce7	[BACKEND] Handle more cases of folding convert into reduce op (#2209 ) Handle cases of reduce with multiple operand returning scalars	2023-08-30 11:04:38 -07:00
Thomas	2ff88c1368	[BACKEND] Extend hoisting of convert op above ext ops (#2206 ) Handle more cases of hoisting convert above ext op. If there are multiple ext op in the slice but only one requires inserting a convert we can still apply the optimization.	2023-08-29 17:36:34 -07:00
Thomas	17d633a64e	[BACKEND] Fix crash when propagating layout and slice axis doesn't ma… (#2205 ) …tch reduce	2023-08-29 17:11:03 +00:00
Thomas	d4644d6cb3	[BACKEND] Refactor RemoveLayoutConversion pass (#2181 ) Significant changes to the pass logic. Move away from greedy rewrites and use more global analysis instead. The pass is now bocken down into 2 main phases. First forward propagation of layout starting from ops that we don't want to change. Propagate to all the nodes. If there is a single layout needed for the op then we can rewrite the op, if there are multiple layout required based on dependency we need a tie break. The second phase is backward propgation that gets a backward slice of operations starting from the convert and if all the operations in the slice can be rematerialized rewrite the slice. This backward phase now supports going through loop arguments. This will allow more complex logic in the future to add a cost model to decide which convert to leave and which to fold	2023-08-28 19:05:16 -07:00
peterbell10	fa03b92109	[OPTIMIZER] Add folder for MakeRangeOp (#2187 ) This folds `tl.arange(x, x + 1)` into a constant. This shows up for example when autotuning and one of the block sizes gets set to 1. Co-authored-by: Philippe Tillet <phil@openai.com>	2023-08-26 16:44:13 +00:00
ben-zhang-609	22a2fe3e55	[Optimizer][Hopper] change mmaV3InstrN for flash attention (#2169 ) Revise the logics to assign MMA layout for serialized dots. This is a hueristic rule for FlashAttention.	2023-08-25 01:54:38 +00:00
jayfurmanek	ff7e707f87	Enable usage of block pointer semantics for AMD gpus (#301 ) * Enable usage of block pointer semantics for AMD gpus This commit enables usage of block pointer semantics by enabling rewrite_tensor_pointer_pass that rewrites block pointer loads/stores to legacy loads/stores. * Update FA fwd in tutorial to use the block pointers * use 90 compute capability for amd gpus in python/triton/compiler/compiler.py Co-authored-by: Alexander Efimov <efimov.alexander@gmail.com> --------- Co-authored-by: Ognjen Plavsic <ognjen.plavsic@dxc.com> Co-authored-by: Lixun Zhang <lixun.zhang@amd.com> Co-authored-by: Aleksandr Efimov <130555951+alefimov-amd@users.noreply.github.com> Co-authored-by: Alexander Efimov <efimov.alexander@gmail.com>	2023-08-24 13:05:12 -05:00

1 2 3 4 5 ...

273 Commits