github/ROCm - ROCm - AtHeartEngineering

mirror of https://github.com/ROCm/ROCm.git synced 2026-02-21 03:00:39 -05:00

Author	SHA1	Message	Date
Lixun Zhang	02a2f24dd5	[Backend] Refactor mfma selection (#441 ) * Select mfma dimensions and instruction from static table * Extend mfmaLayout to include version and instrShape * Simplify generateMFMAOp by searching the mfma instruction in the table * Fix getNonKDim() and non_k_dim * Break instrShape into MDim and NDim	2024-01-16 21:05:35 -06:00
Shucai Xiao	a7bb38ea79	enable layout conversion from mfma to dot_op for mfma16. (#453 ) * enable the layout conversion from mfma layout to dot_operand layout for mfma16 * backup changes	2024-01-10 22:31:59 -06:00
joviliast	af15da2f84	Support WMMA layout in TritonAMDGPUAccelerateMatmulPass -Introduce WmmaEncodingAttr for WMMA output -Introduce BlockedToWMMA rewrite pattern in TritonAMDGPUAccelerateMatmulPass -Provide a flag tho check if wmma instructions are supported by target Signed-off-by: joviliast <iveselov.nn@gmail.com>	2023-12-18 09:11:20 -06:00
jayfurmanek	a42ac260aa	Merge branch 'triton-mlir' into ifu-231117	2023-12-12 14:24:11 -06:00
Alexander Efimov	605a90c58e	[MFMA] Support tile size 4x4 version 1 (#413 ) This PR enables 4x4 tile size in MFMA based dot operations. Supported tiled dot is (4x64) x (64x4) -> (4x4) in MFMA layout. However, actual dot operation should have at least 64 output elements, this is a limitation of other layouts appearing during result processing (i.e. blocked layout can not handle tensors smaller than wavesize). For example, following dots are supported: (4x64) x (64x16) -> (4x16), (16x64) x (64x4) -> (16x4) or (8x64) x (64x8) -> (8x8) Following dots are not supporter: (4x128) x (128x4) -> (4x4), (4x64) x (64x8) -> (4x8) This is a first version of dot using mfma 4x4 instructions, with redundancy and reductions.	2023-12-12 18:23:55 +01:00
Jason Furmanek	5c87f363e4	Merge commit 'cb3d79a185e40c9d8a579bea07747a8a8d157d52' into ifu-231117 Conflicts: lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVM.cpp lib/Dialect/TritonGPU/IR/Dialect.cpp python/setup.py python/test/unit/language/assert_helper.py python/test/unit/operators/test_flash_attention.py python/test/unit/runtime/test_subproc.py python/triton/compiler/compiler.py python/triton/language/semantic.py python/triton/runtime/autotuner.py python/triton/runtime/jit.py python/tutorials/03-matrix-multiplication.py python/tutorials/05-layer-norm.py python/tutorials/06-fused-attention.py python/tutorials/11-grouped-gemm.py test/Conversion/tritongpu_to_llvm.mlir	2023-11-17 20:42:12 +00:00
Jason Furmanek	977d5aa267	Merge commit '721897fcc4f942aa97d2e9ba3787a5e213758177' into ifu-231108 Conflicts: bin/triton-translate.cpp lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp lib/Dialect/TritonGPU/Transforms/RemoveLayoutConversions.cpp python/triton/compiler/compiler.py python/triton/runtime/jit.py python/tutorials/06-fused-attention.py test/Conversion/tritongpu_to_llvm.mlir	2023-11-08 18:51:23 +00:00
Jason Furmanek	3a6dc5ad8d	resolve some merge conflicts fix more conflits Resolve merge conflicts Some more build and conflict fixes Resolve conflicts for 06-fused-attension.py resolve merge conflicts for the tutorial group gemm example Fixes for some LIT tests resolve remaining conflicts in tests Fix empty kernel set capability 0	2023-11-06 23:13:10 +00:00
Jason Furmanek	33151a860f	Merge commit 'ac9fa68d18c777e421bd3f6fb1ddcfd60b6fda33' into ifu-rebase-again Conflicts: .gitignore .gitmodules README.md bin/triton-translate.cpp include/triton/Dialect/TritonGPU/IR/TritonGPUAttrDefs.td include/triton/Target/AMDGCN/AMDGCNTranslation.h include/triton/Target/HSACO/HSACOTranslation.h lib/Analysis/Allocation.cpp lib/Analysis/Utility.cpp lib/Conversion/TritonGPUToLLVM/CMakeLists.txt lib/Conversion/TritonGPUToLLVM/ConvertLayoutOpToLLVM.cpp lib/Conversion/TritonGPUToLLVM/ReduceOpToLLVM.cpp lib/Conversion/TritonGPUToLLVM/ScanOpToLLVM.cpp lib/Conversion/TritonGPUToLLVM/Utility.cpp lib/Conversion/TritonGPUToLLVM/Utility.h lib/Dialect/TritonGPU/IR/Dialect.cpp lib/Dialect/TritonGPU/Transforms/RemoveLayoutConversions.cpp lib/Target/HSACO/CMakeLists.txt lib/Target/HSACO/HSACOTranslation.cpp lib/Target/LLVMIR/LLVMIRTranslation.cpp python/src/triton.cc python/test/unit/language/test_core.py python/test/unit/operators/test_flash_attention.py python/triton/compiler/compiler.py python/triton/compiler/make_launcher.py python/triton/language/semantic.py python/triton/runtime/jit.py python/tutorials/06-fused-attention.py python/tutorials/11-grouped-gemm.py test/Conversion/tritongpu_to_llvm.mlir	2023-11-06 23:10:10 +00:00
Hongtao Yu	2323adb387	[BACKEND] Handle AtomicCASOp in GPU IR conversion (#2514 ) Addressing https://github.com/openai/triton/issues/2011 Co-authored-by: Philippe Tillet <phil@openai.com> Co-authored-by: Keren Zhou <kerenzhou@openai.com>	2023-10-25 15:20:07 -04:00
Alexander Efimov	5a86b46bb1	[MFMA] FP8 and BF8 support (#355 ) * [MFMA] FP8 and BF8 support This PR adds support of fp8 and bf8 in AccelerateMatmul pass and Introduces generation of float8 mfma instructions in ttg to llvm conversion. * add tests * fix tests * review fix: fix variable naming and dot operand promotion. * review comments fixes --------- Co-authored-by: Shucai Xiao <shucai.xiao@amd.com>	2023-10-25 13:27:10 -05:00
Thomas Raoux	cba7abd682	[BACKEND] Remove ttg.cmp and ttg.select and replace by arith op (#2526 ) Now that the bug related to attribute is fixed in MLIR we can use arith ops for cmp and select ops.	2023-10-23 19:35:46 -07:00
Keren Zhou	be1de890e1	[BACKEND] Replace assert(0) with llvm::report_fatal_error (#2516 ) Also add missing return statements	2023-10-19 11:53:09 -07:00
Alexander Efimov	20f316b19a	[MFMA] Switch between MFMA types (#352 ) This PR introduces matrix_instr_nonkdim flag to switch between MFMA 16 and MFMA 32.	2023-10-18 16:57:34 +02:00
Mehdi Amini	721897fcc4	upgrade llvm to `b1115f8c` (NFC) (#2403 ) Co-authored-by: Thomas Raoux <thomas.raoux@openai.com> Co-authored-by: Keren Zhou <kerenzhou@openai.com> Co-authored-by: Phil Tillet <phil@openai.com>	2023-10-16 16:38:49 -07:00
Zahi Moudallal	4749072fbd	[BACKEND] Allow reduce with sliced 3D layout as input (#2480 )	2023-10-10 15:19:11 -07:00
Alexander Efimov	7e34c244c2	[Triton] Mfma16 support (#251 ) * [MFAM] Support mfma with NM size 16 This PR code emitting of MFMA instructions with size 16. * add control over mfma type with MFMA_TYPE=16 env var	2023-10-09 13:59:54 -05:00
Zahi Moudallal	be19cf3103	[BACKEND] Enable reduce with 3D tensors and added tests (#2460 )	2023-10-06 15:08:22 -07:00
Thomas Raoux	5a0170a27c	[BACKEND] Minor removing of unnecessary code and cleanup (#2443 )	2023-10-04 12:14:08 -07:00
Thomas Raoux	c656a139d3	[BACKEND] Fix for FP8 QK inputs in flash attention forward pass (#2435 )	2023-10-03 21:02:13 -07:00
Jason Furmanek	92edee723b	ROCM IFU: Fix getValueLivenessRange	2023-10-03 04:30:28 +00:00
Jason Furmanek	e5d7bb4fae	Initial commit to resolve merge conflicts rename tl.float8e4 to tl.float8e4nv to align with upstream ROCM IFU: Fix python arch issues ROCM IFU: Fix kernel launcher ROCM IFU: Fix merge conflicts fix debug build Set correct threadsPerCTA	2023-10-03 04:04:26 +00:00
Jason Furmanek	74fd8e9754	Merge commit '36fc54b6f28168d3644808bfe299f1ba06a36272' into ifu230908-2 Conflicts: .gitignore bin/triton-translate.cpp include/triton/Conversion/TritonGPUToLLVM/TritonGPUToLLVMPass.h include/triton/Dialect/TritonGPU/IR/TritonGPUAttrDefs.td include/triton/Dialect/TritonGPU/IR/TritonGPUDialect.td lib/Analysis/Utility.cpp lib/Conversion/TritonGPUToLLVM/ConvertLayoutOpToLLVM/SharedToDotOperandMMAv2.cpp lib/Conversion/TritonGPUToLLVM/DotOpToLLVM.cpp lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp lib/Conversion/TritonGPUToLLVM/ReduceOpToLLVM.cpp lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVM.cpp lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVMBase.h lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVMPass.cpp lib/Conversion/TritonGPUToLLVM/Utility.h lib/Dialect/Triton/Transforms/RewriteTensorPointer.cpp lib/Dialect/TritonGPU/IR/Dialect.cpp lib/Dialect/TritonGPU/Transforms/AccelerateMatmul.cpp lib/Dialect/TritonGPU/Transforms/RemoveLayoutConversions.cpp lib/Target/LLVMIR/LLVMIRTranslation.cpp python/src/triton.cc python/test/unit/runtime/test_subproc.py python/triton/compiler/compiler.py python/triton/compiler/make_launcher.py python/triton/language/semantic.py python/triton/runtime/jit.py python/tutorials/06-fused-attention.py test/Conversion/triton_to_tritongpu.mlir test/Conversion/tritongpu_to_llvm.mlir test/TritonGPU/coalesce.mlir unittest/Conversion/TritonGPUToLLVM/CMakeLists.txt	2023-10-02 18:01:04 +00:00
Thomas Raoux	90bef57acf	[BACKEND] turn on MMA V3 by default on Hopper (#2414 )	2023-09-28 22:45:28 -07:00
Thomas Raoux	721bdebee1	[OPTIMIZATION] Fix performance for attention backward path with mma v3 (#2411 ) Support having chain of mma with mixed size. Serialize the different block calculation in backward attention to workaround problem with ptxas and wgmma.	2023-09-28 10:29:08 -07:00
Yuheng XIE	1e093fbfff	[OPTIMIZER] Calculate a proper divisibility for ExpandDims (#2397 ) Previously ExpandDims always inserts 1 as the new divisibility, which makes writing (x * stride)[:, None] far more slower than (x[:, None] * stride). A better divisibility can be afforded by computing the GCD of the old dims. Now the two code above are equally fast. E.g. the conv inductor in pytorch may be faster. --------- Co-authored-by: Yuheng XIE <thinelephant@gmail.com>	2023-09-27 23:10:01 -07:00
SJW	4db99e0139	[Alloc] Enhanced SharedMem Allocation for mutually exclusive but aliased buffers (#337 ) * [Alloc] Enhanced for mutually exclusive but aliased buffers - Use disjoint alias analysis to minimize shared memory requirements * * fix for allocation test * * added test * fixed mfma_enc printer * * fixed test	2023-09-25 20:09:33 -05:00
Thomas Raoux	6bc1d9e1be	[BACKEND] Support MMA V3 with register operand (#2375 ) MMA V3 support taking operand A from register. This helps for chained matmul operations like in attention. Add an optimization to use this mode when it helps and add the lowering for it.	2023-09-25 10:43:54 -07:00
Aleksandr Efimov	d80cd2d374	[MFMA] Change kWidth parameter semantics This PR changes kWidth semantics "from elements per instruction" to "elements per thread per instruction" along k axis.	2023-09-25 10:56:44 -05:00
Keren Zhou	ed5a53057d	[BACKEND] Handle repetitive threads in scan op when the tensor dim is small (#2345 ) https://github.com/openai/triton/issues/2298	2023-09-20 12:25:52 -04:00
Keren Zhou	307b5caa49	[BACKEND] Fix scan issues on repetitive warps and improve perf when there's a single warp on the axis (#2330 ) 1. On the axis, using `getAxisNumWarpsWithUniqueData` instead of getting the raw number of warps to avoid communication among warps that handle the same piece of data. 2. When there's a single warp on the axis, using warp Intrinsics for communication and skip shared memory. Need a follow up PR for code clean up.	2023-09-18 17:45:05 -04:00
Thomas Raoux	31b0c52142	[FRONTEND][BACKEND] Add flag to control accumulation for fp8 (#2300 ) Change the dot to allow taking an initial accumulator and add a flag that will allow the compiler to accumulate in a lower precision than the output type. On Hopper this flag is on by default which allows accumualting with lower precision. This only affect Hopper fp8 dot.	2023-09-15 18:42:54 -07:00
Thomas Raoux	976aabdeb2	[BUILD] Fix few dependencies and layering issues to make lld work (#2307 ) This fixes few problems that were preventing me to use lld linker.	2023-09-15 17:00:58 +00:00
Zahi Moudallal	e95e1f12eb	[BACKEND] Convert layout illegal mem access fix (#2287 )	2023-09-13 10:02:25 -07:00
Zahi Moudallal	a47f1f5c28	[BACKEND] Unify slow/fast reduce codegen (#2220 )	2023-09-12 08:46:19 -07:00
Lixun Zhang	5972eafbc9	Use ceil to align with upstream	2023-09-12 10:16:44 -05:00
Lixun Zhang	ea397b49aa	Fix the issue when CTA coverage is larger than the tile	2023-09-12 10:16:44 -05:00
Philippe Tillet	bf4f9375a7	[FRONTEND] allow mixed precision FP8 matmul on pre-H100 hardware (#2281 )	2023-09-11 20:54:29 -07:00
Lixun Zhang	28d4c3bdb4	[BACKEND] Make sure `getAxisBlockStride` does not return 0 (#2273 ) This can happen when the CTA shape is larger than the tensor shape along the non-axis dim during scanOp lowering.	2023-09-11 11:02:56 -07:00
Alexander Efimov	6691de65db	[MFMA] Support BFloat16 on MI100 (#295 ) * [MFMA] Support BFloat16 on MI100 This PR makes use of mfma_f32_32x32x4bf16 instruction, available on MI100. * fix tests, fix mfma encoding comment, fix switch between mfma versions. * replace kDim from mfma layout with kWidth from dotOp layout * rebase fix * fix mfma to dot op shortcut for bfloat16 * fix review comments	2023-09-08 15:08:34 -05:00
Jason Furmanek	df5c263a19	Fix merge conflicts	2023-09-01 04:01:32 +00:00
Jason Furmanek	3eaeb89d18	Merge commit '5df904233c11a65bd131ead7268f84cca7804275' into ifu230810-2 Conflicts: include/triton/Dialect/Triton/Transforms/Passes.h include/triton/Dialect/TritonGPU/IR/Dialect.h include/triton/Dialect/TritonGPU/IR/TritonGPUAttrDefs.td lib/Analysis/Allocation.cpp lib/Analysis/Utility.cpp lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp lib/Conversion/TritonGPUToLLVM/ReduceOpToLLVM.cpp lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVM.cpp lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVMPass.cpp lib/Dialect/Triton/Transforms/RewriteTensorPointer.cpp lib/Dialect/TritonGPU/Transforms/RemoveLayoutConversions.cpp lib/Dialect/TritonGPU/Transforms/ReorderInstructions.cpp lib/Target/LLVMIR/LLVMIRTranslation.cpp python/src/triton.cc python/triton/compiler/compiler.py python/triton/ops/flash_attention.py python/triton/runtime/autotuner.py python/triton/runtime/jit.py python/triton/tools/aot.py python/tutorials/06-fused-attention.py test/Conversion/tritongpu_to_llvm.mlir test/Target/tritongpu_to_llvmir.mlir test/Target/tritongpu_to_llvmir_noinline.mlir	2023-09-01 03:25:33 +00:00
Thomas	d4644d6cb3	[BACKEND] Refactor RemoveLayoutConversion pass (#2181 ) Significant changes to the pass logic. Move away from greedy rewrites and use more global analysis instead. The pass is now bocken down into 2 main phases. First forward propagation of layout starting from ops that we don't want to change. Propagate to all the nodes. If there is a single layout needed for the op then we can rewrite the op, if there are multiple layout required based on dependency we need a tie break. The second phase is backward propgation that gets a backward slice of operations starting from the convert and if all the operations in the slice can be rematerialized rewrite the slice. This backward phase now supports going through loop arguments. This will allow more complex logic in the future to add a cost model to decide which convert to leave and which to fold	2023-08-28 19:05:16 -07:00
ivanyinwz	ec801ce18e	[BACKEND] Optimize performance for f16 epilogue with TMA store (#2135 ) 1. Optimize the conversion and packing for 2xf32 -> 2xf16. 2. Split TMA store block into multiple slices of size 64x64. 3. Distribute the TMA store to all the warps. 4. Fix some naming issue.	2023-08-21 12:44:11 -07:00
jayfurmanek	fa429316d4	Merge pull request #268 from ROCmSoftwarePlatform/improve_reduce_for_fa [CHERRY-PICKED FROM UPSTREAM][BACKEND] no longer uses shared mem or barriers for single-warp reductions (openai#1915)	2023-08-21 13:29:11 -05:00
Zahi Moudallal	23dd11d471	[BACKEND] Solidify f8e4m3 (#2105 ) Co-authored-by: Philippe Tillet <phil@openai.com>	2023-08-18 19:12:09 -07:00
Alexander Efimov	01b0108c94	[MFMA] [FA] Keep bf16 results of FA dot operations in registers (#298 ) This PR enables optimization for keeping bf16 values in registers between dot operations.	2023-08-18 07:33:00 -05:00
Philippe Tillet	4215086931	[BACKEND] no longer uses shared mem or barriers for single-warp reductions (#1915 ) 0-bytes shared mem buffers don't materialize empty allocation buffers; this could lead to unnecessary barriers. note: reduceop code has become quite messy and will require some cleanup	2023-08-15 11:51:20 +00:00
Luca Wehrstedt	8fa11a75d3	[ANALYSIS] propagate divisibility through tl.where for all types (#2023 )	2023-08-15 03:26:31 +00:00
Goran Flegar	29bfdb6eef	[BACKEND] Fix crash in reductions on i1 (#1996 ) `getScratchSizeInBytes` was assuming that the size of all types in bits is a multiple of 8. If it is not, it would return 0. This caused a bug for boolean (i1) type, where the reduction lowering would attempt to use shared memory, which was not assigned to the op. Fix this issue by setting the number of bytes per element to `ceil(bits / 8)`.	2023-08-09 10:28:05 -07:00

1 2 3 4

174 Commits