github/ROCm - ROCm - AtHeartEngineering

mirror of https://github.com/ROCm/ROCm.git synced 2026-04-05 03:01:17 -04:00

Author	SHA1	Message	Date
Hongtao Yu	315528f349	Use full-vectorized load instructions for load vectorization (#445 ) * Stablize load vectorization * fix test failures * Shared one mask check when decomposing a load * Revert "fix test failures" This reverts commit 75a461ae3ea4fdd5105dc73675582368eda80bc6. * Emit vectorized loads * Fix test failures due to using vectorized load	2024-01-18 13:34:05 -06:00
Shucai Xiao	2c7d850c2d	fixed warp size in lowering reduce op (#471 )	2024-01-18 09:38:41 -06:00
Ilya V	5da6276d89	Add shortcut for creation fp16/bfp16 (#468 ) Signed-off-by: joviliast <iveselov.nn@gmail.com>	2024-01-17 11:29:19 -06:00
Lixun Zhang	02a2f24dd5	[Backend] Refactor mfma selection (#441 ) * Select mfma dimensions and instruction from static table * Extend mfmaLayout to include version and instrShape * Simplify generateMFMAOp by searching the mfma instruction in the table * Fix getNonKDim() and non_k_dim * Break instrShape into MDim and NDim	2024-01-16 21:05:35 -06:00
Shucai Xiao	1223f6077a	support type conversion between fp8 formats and bf16/fp32 with HW instructions on MI300 (#414 ) * add type conversion between fp8 and bf16/fp32..	2024-01-15 17:14:49 -06:00
Lixun Zhang	2e217c5a5c	[Backend] Refactor sharedToDotOperandMFMA lowering (#439 ) * Remove unnecessary xor computations for k-major swizzled tensors * Support mfma16 and mfma4 in the fast path * Choose warpsPerCTA according to nonKDim * Set maxPhase=4 for mfma4 * Fix tests For now, we do not disable swizzling for k-major tensors * Remove fastPathComputeOffsetsTy1 * Enable k-major + disabled swizzling in the normal path	2024-01-12 12:50:18 -06:00
oplavsic	bcea3051af	Add support for MFMA layout to view_slice op (#442 ) Co-authored-by: Ognjen <oplavsic@luxoft.com>	2024-01-03 12:13:36 -06:00
oplavsic	6a520566a3	Add view_slice ttgir instruction (#427 ) * Add view_slice op in ttgir --------- Co-authored-by: Ognjen Plavsic <ognjen.plavsic@luxoft.com> Co-authored-by: Ognjen <oplavsic@luxoft.com> Co-authored-by: Lixun Zhang <lixun.zhang@amd.com>	2024-01-02 15:40:11 -06:00
Alexander Efimov	98589ac013	[MFMA] Remove CTA related code from layout (#429 ) This PR removes CTALayout attribute from MFMA layout, because it is NV specific.	2023-12-27 18:01:28 +01:00
jayfurmanek	29847e9bb1	Merge pull request #410 from ROCmSoftwarePlatform/ifu-231117 Ifu 231117	2023-12-15 09:09:40 -06:00
Alexander Efimov	40e1dcaa53	[MFMA] Reenable removed CDNA3 int and fp8 support (#424 ) MFMA4x4 PR accidentailly removed support of `int8xint8 -> int32` and `fp8xfp8 -> fp32` dot on CDNA. This PR reenables it back.	2023-12-14 13:06:28 +01:00
Alexander Efimov	f2afd65e8c	[MFMA] Refactor dot pipeline to reduce code duplication (#400 ) This PR: - simplifies data types generated by `shared->mfma dot op` layout conversions. Do not pack data types in int32 or int64 - reduce code duplication between fast/normal path - reduce code duplication between operand A and operand B Co-authored-by: Shucai Xiao <shucai.xiao@amd.com> Co-authored-by: Lixun Zhang <lixun.zhang@amd.com>	2023-12-13 22:33:02 +01:00
Michael Melesse	6efc013e46	ROCM IFU: fix AtomicCASOpConversion segfault	2023-12-12 17:40:31 -06:00
jayfurmanek	a42ac260aa	Merge branch 'triton-mlir' into ifu-231117	2023-12-12 14:24:11 -06:00
Jason Furmanek	160dfe838e	ROCM IFU: Fix print and assert	2023-12-12 19:30:01 +00:00
Alexander Efimov	605a90c58e	[MFMA] Support tile size 4x4 version 1 (#413 ) This PR enables 4x4 tile size in MFMA based dot operations. Supported tiled dot is (4x64) x (64x4) -> (4x4) in MFMA layout. However, actual dot operation should have at least 64 output elements, this is a limitation of other layouts appearing during result processing (i.e. blocked layout can not handle tensors smaller than wavesize). For example, following dots are supported: (4x64) x (64x16) -> (4x16), (16x64) x (64x4) -> (16x4) or (8x64) x (64x8) -> (8x8) Following dots are not supporter: (4x128) x (128x4) -> (4x4), (4x64) x (64x8) -> (4x8) This is a first version of dot using mfma 4x4 instructions, with redundancy and reductions.	2023-12-12 18:23:55 +01:00
Alexander Efimov	a944811b6d	Replace inline assembly in commonShflSync with intrinsics (#418 ) Inline assembly does not take into account instructions around, and in general can not avoid data hazards. Replacing inline asm with intrinsics solves this problem. This particular code behaved incorrectly in one of mfma dot tests: Code generated with help of inline assembly: ``` v_mfma_f32_4x4x4f16 v[4:7], v[4:5], v[6:7], 0 ds_swizzle_b32 v3, v4, offset:swizzle(SWAP:4) ``` Correct code generated with intrinsics: ``` v_mfma_f32_4x4x4f16 v[4:7], v[4:5], v[6:7], 0 s_nop 4 ds_swizzle_b32 v3, v4, offset:swizzle(SWAP:4) ```	2023-12-11 16:41:39 +01:00
jayfurmanek	99aa1f4f75	Merge branch 'triton-mlir' into ifu-231117	2023-11-27 07:44:04 -06:00
Shucai Xiao	d9219e0eba	use hw for fp8 type conversion (#386 ) * use hardware instruction for type conversion between fp8 and fp32 * move gpu_matrix_core_version from semantics.py to hip_backend.py --------- Co-authored-by: Aleksandr Efimov <efimov.alexander@gmail.com>	2023-11-24 10:26:40 -06:00
Jason Furmanek	4e86b25f1c	ROCM IFU: Fix PrintfHIP	2023-11-21 23:06:14 +00:00
Jason Furmanek	a08dafe7fe	Initial commit to resolve merge conflicts	2023-11-20 22:41:03 +00:00
Jason Furmanek	5c87f363e4	Merge commit 'cb3d79a185e40c9d8a579bea07747a8a8d157d52' into ifu-231117 Conflicts: lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVM.cpp lib/Dialect/TritonGPU/IR/Dialect.cpp python/setup.py python/test/unit/language/assert_helper.py python/test/unit/operators/test_flash_attention.py python/test/unit/runtime/test_subproc.py python/triton/compiler/compiler.py python/triton/language/semantic.py python/triton/runtime/autotuner.py python/triton/runtime/jit.py python/tutorials/03-matrix-multiplication.py python/tutorials/05-layer-norm.py python/tutorials/06-fused-attention.py python/tutorials/11-grouped-gemm.py test/Conversion/tritongpu_to_llvm.mlir	2023-11-17 20:42:12 +00:00
Ognjen	38fbb7e472	ROCM IFU: Enable slice layout for insertSliceAsync AMD path Fix basic_insert_slice_async_1d lit test Remove code added for debugging Return hopper test	2023-11-17 01:27:57 +00:00
Jason Furmanek	484852876e	Resolve merge conflicts; AMD adjustments for new LLVM version	2023-11-09 19:00:49 +00:00
Jason Furmanek	977d5aa267	Merge commit '721897fcc4f942aa97d2e9ba3787a5e213758177' into ifu-231108 Conflicts: bin/triton-translate.cpp lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp lib/Dialect/TritonGPU/Transforms/RemoveLayoutConversions.cpp python/triton/compiler/compiler.py python/triton/runtime/jit.py python/tutorials/06-fused-attention.py test/Conversion/tritongpu_to_llvm.mlir	2023-11-08 18:51:23 +00:00
oplavsic	502525ff11	ROCM IFU: Fix ScanOp test by implementing idx ShflKind in commonShflSync (#391 ) * Fix ScanOp test * Remove commented-out code --------- Co-authored-by: Ognjen <oplavsic@luxoft.com> Fix typo in commonShflSync	2023-11-07 04:29:45 +00:00
Alexander Efimov	8bc417b9b7	do not emit nvidia inline asm	2023-11-07 04:29:44 +00:00
Jason Furmanek	c3132eeda8	ROCM IFU: Third-party fixes: preload ROCDL Additional 3rd-party fix Remove redundant mfma_supported defines	2023-11-07 03:29:16 +00:00
Jason Furmanek	3a6dc5ad8d	resolve some merge conflicts fix more conflits Resolve merge conflicts Some more build and conflict fixes Resolve conflicts for 06-fused-attension.py resolve merge conflicts for the tutorial group gemm example Fixes for some LIT tests resolve remaining conflicts in tests Fix empty kernel set capability 0	2023-11-06 23:13:10 +00:00
Jason Furmanek	33151a860f	Merge commit 'ac9fa68d18c777e421bd3f6fb1ddcfd60b6fda33' into ifu-rebase-again Conflicts: .gitignore .gitmodules README.md bin/triton-translate.cpp include/triton/Dialect/TritonGPU/IR/TritonGPUAttrDefs.td include/triton/Target/AMDGCN/AMDGCNTranslation.h include/triton/Target/HSACO/HSACOTranslation.h lib/Analysis/Allocation.cpp lib/Analysis/Utility.cpp lib/Conversion/TritonGPUToLLVM/CMakeLists.txt lib/Conversion/TritonGPUToLLVM/ConvertLayoutOpToLLVM.cpp lib/Conversion/TritonGPUToLLVM/ReduceOpToLLVM.cpp lib/Conversion/TritonGPUToLLVM/ScanOpToLLVM.cpp lib/Conversion/TritonGPUToLLVM/Utility.cpp lib/Conversion/TritonGPUToLLVM/Utility.h lib/Dialect/TritonGPU/IR/Dialect.cpp lib/Dialect/TritonGPU/Transforms/RemoveLayoutConversions.cpp lib/Target/HSACO/CMakeLists.txt lib/Target/HSACO/HSACOTranslation.cpp lib/Target/LLVMIR/LLVMIRTranslation.cpp python/src/triton.cc python/test/unit/language/test_core.py python/test/unit/operators/test_flash_attention.py python/triton/compiler/compiler.py python/triton/compiler/make_launcher.py python/triton/language/semantic.py python/triton/runtime/jit.py python/tutorials/06-fused-attention.py python/tutorials/11-grouped-gemm.py test/Conversion/tritongpu_to_llvm.mlir	2023-11-06 23:10:10 +00:00
Thomas Raoux	cb3d79a185	[BACKEND] Prevent emitting multiple dot_wait after pipelinied loop (#2598 ) Patch based on @donproc findings and suggested optimization. Emitting multiple wait op may confuse ptxas and cause it to fallback to a conservative mode.	2023-11-03 14:29:50 -07:00
Shucai Xiao	79bebc4ffe	fp8 type support (#357 ) * add two fp8 data types `tl.float8e4b8` and `tl.float8e5b16` to triton. * add SW type conversion between `tl.float8e4b8/tl.float8e5b16` and `fp16` * change flashattention to support fp8 in q/k.	2023-11-02 15:51:23 -05:00
Thomas Raoux	ca8f110617	[BACKEND] Pipeliner refactoring (#2565 ) Refactor the pipeliner pass in order to make it more generic. The main change is that the pipeliner is now broken into 2 pieces one calculating a modulo schedule and create async ops based on the IR and an expander that will generate the pipelined IR based on the modulo schedule. The advantage of separating the two pieces is that it will allow us to create different schedule without having to change the expander and it will allow for more complex schedules. For now the schedule generated for matmul case matches rougly the schedule picked by the previous pipeliner in order to avoid changes. This also creates a different sequence of insert/extract slice for the alloc. We should probably change shared alloc to use memory semantic.	2023-11-02 09:56:39 -07:00
Thomas Raoux	218492cd65	[BACKEND] Prevent double rounding when doing f32 -> fp8 (#2583 )	2023-11-02 05:32:16 +00:00
Dongdong Li	d0098da7b1	[BACKEND] Add error reporting to report non-kernel-argument (#2552 ) Co-authored-by: dongdongl <dongdongl@nvidia.com>	2023-11-01 20:22:10 -04:00
Goran Flegar	601b95cdbb	[DEPS] bump LLVM version to llvm/llvm-project@49af650 (#2570 ) Co-authored-by: Ashay Rane <ashay@users.noreply.github.com> Co-authored-by: khasanovaa <khasanovaaliya19@gmail.com>	2023-10-31 12:06:25 -07:00
daemondzh	96cf8f979a	[OPTIMIZER][BACKEND] Fix an issue in RewriteTensorPtr pass to enable TMA with 8-bit types (#2545 ) Co-authored-by: Zhicheng Xiong <zhichengx@ipp2-0148.nvidia.com> Co-authored-by: Zhicheng Xiong <zhichengx@dc7-sim-e12-203.nvidia.com> Co-authored-by: Zhicheng Xiong <zhichengx@ipp2-1604.nvidia.com> Co-authored-by: Zhicheng Xiong <zhichengx@ipp2-1608.nvidia.com> Co-authored-by: goostavz <109190422+goostavz@users.noreply.github.com>	2023-10-31 02:28:27 +00:00
Chris Jones	2398b82f18	[FRONTEND][BACKEND] dd memory synchronization scope parameter to atomic ops. (#2562 ) Co-authored-by: Keren Zhou <kerenzhou@openai.com>	2023-10-30 19:18:27 -07:00
Keren Zhou	70fca00b67	[BACKEND] Fix `device_print` without arguments (#2566 )	2023-10-30 20:04:44 -04:00
zhu jianjiang	cfae7e2a25	[BACKEND] Fix matmul downcast path (#2528 ) for https://github.com/openai/triton/issues/2523 ,add regression test --------- Co-authored-by: Jokeren <robinho364@gmail.com> Co-authored-by: Keren Zhou <kerenzhou@openai.com>	2023-10-26 09:43:49 -04:00
Shucai Xiao	2729ae6c6f	use different int8 mfma instructions on different GPUs. (#368 ) * changes support to choose different int8 instructions * rename an instruction name Co-authored-by: Aleksandr Efimov <efimov.alexander@gmail.com>	2023-10-25 19:12:21 -05:00
Hongtao Yu	2323adb387	[BACKEND] Handle AtomicCASOp in GPU IR conversion (#2514 ) Addressing https://github.com/openai/triton/issues/2011 Co-authored-by: Philippe Tillet <phil@openai.com> Co-authored-by: Keren Zhou <kerenzhou@openai.com>	2023-10-25 15:20:07 -04:00
Alexander Efimov	5a86b46bb1	[MFMA] FP8 and BF8 support (#355 ) * [MFMA] FP8 and BF8 support This PR adds support of fp8 and bf8 in AccelerateMatmul pass and Introduces generation of float8 mfma instructions in ttg to llvm conversion. * add tests * fix tests * review fix: fix variable naming and dot operand promotion. * review comments fixes --------- Co-authored-by: Shucai Xiao <shucai.xiao@amd.com>	2023-10-25 13:27:10 -05:00
oplavsic	715a589ce3	[FA fwd D=128] Reduce LDS usage in epilogue (#340 ) * rebase onto improve_fwd_fa * Fixed a leftover from rebase * rebase onto improve_fa_fwd * Reduce tuning space * Disable bwd with D=128 * Add test for d=128 * Fix an issue with get_best_config when there is only one config * Added better configs for d=128 * Fix typos --------- Co-authored-by: Lixun Zhang <lixun.zhang@amd.com>	2023-10-25 12:10:34 -05:00
Adnan Akhundov	7d55968fee	[BACKEND] Dedup elementwise in LLVM IR based on constancy (#2512 ) ### Summary When Triton GPU IR is lowered into LLVM IR, we can make use of the constancy information about the result of the elementwise ops to deduplicate otherwise redundant computation. That is the contribution of this PR: the constancy is checked and, if possible, some of the values in LLVM IR are reused multiple times instead of computing equal values separately. The change is beneficial for the PyTorch 2 / TorchInductor-generated Triton code, as the leftmost sub-indices extracted from the flat index by div / mod operations can be equal, given sufficiently large 2^n factor in the rightmost rightmost dimension(s). This makes the computation resulting in those sub-indices redundant. Consequently, under the necessary constancy conditions, the redundant indexing arithmetics can be deduplicated. We observe up to 29% decrease in the latency of some of our jagged tensor kernels	2023-10-25 11:25:29 -04:00
Justin Lebar	e70e11e834	[BACKEND] Improve printf. (#2532 ) [BACKEND] Improve printf. Previously, we printed all of a GPU thread's values in a single printf() call, and this, plus the user-specified prefix, was all we printed. This caused a few problems. - nvptx printf can only handle 32 arguments; if you pass more than that, it prints garbage. So if a thread had more than 32 values, you couldn't print them, issue #2486. - The order of the values within the Triton program (GPU thread block) is an implementation detail -- it depends on the layout the compiler assigns to a tensor. So this also prevented you from interpreting the printed output. To address this, we now print the Triton pid and multi-dimensional Tensor index for each value. And each value gets its own line to avoid passing too many args to printf. Example output: ``` pid (0, 1, 2) idx (36, 127) x: 42 ``` If you want to observe all the values in a tensor in order, you can grep and then sort the output. We also make a UX enhancement to print: The printed label always ends with ": "; you don't have to add it yourself. Fixes #2486.	2023-10-25 08:47:55 +00:00
Justin Lebar	2217bd2f5c	[BACKEND] Delete dead vprintf and vprintf_array functions (#2531 ) Delete dead vprintf and vprintf_array functions. These were introduced in `88498d104a` and appear to have been dead at the time of introduction.	2023-10-25 00:22:53 +00:00
Thomas Raoux	cba7abd682	[BACKEND] Remove ttg.cmp and ttg.select and replace by arith op (#2526 ) Now that the bug related to attribute is fixed in MLIR we can use arith ops for cmp and select ops.	2023-10-23 19:35:46 -07:00
jayfurmanek	e74bdb1581	Always promote to int32 in commonShflSync (#369 )	2023-10-23 12:27:11 -05:00
Zahi Moudallal	b0c166b9e3	[BACKEND] Fixing bug in elementwise conversion (#2517 )	2023-10-20 09:11:15 -07:00

1 2 3 4 5 ...

513 Commits