github/ROCm - ROCm - AtHeartEngineering

mirror of https://github.com/ROCm/ROCm.git synced 2026-04-05 03:01:17 -04:00

Author	SHA1	Message	Date
joviliast	47e801730c	Add lit tests for TritonAMDGPUAccelerateMatmulPass WMMA case Signed-off-by: joviliast <iveselov.nn@gmail.com>	2023-12-18 09:11:20 -06:00
joviliast	af15da2f84	Support WMMA layout in TritonAMDGPUAccelerateMatmulPass -Introduce WmmaEncodingAttr for WMMA output -Introduce BlockedToWMMA rewrite pattern in TritonAMDGPUAccelerateMatmulPass -Provide a flag tho check if wmma instructions are supported by target Signed-off-by: joviliast <iveselov.nn@gmail.com>	2023-12-18 09:11:20 -06:00
Vinayak Gokhale	b7a412d82a	Add support for ALiBi-style attention bias (#417 ) Add support for matrix and vector bias to FA.	2023-12-15 16:28:37 -06:00
jayfurmanek	29847e9bb1	Merge pull request #410 from ROCmSoftwarePlatform/ifu-231117 Ifu 231117	2023-12-15 09:09:40 -06:00
Shucai Xiao	521f425fbf	add bitcode for gfx941 and gfx942 (#403 ) Co-authored-by: Aleksandr Efimov <130555951+alefimov-amd@users.noreply.github.com>	2023-12-14 08:19:23 -06:00
Alexander Efimov	40e1dcaa53	[MFMA] Reenable removed CDNA3 int and fp8 support (#424 ) MFMA4x4 PR accidentailly removed support of `int8xint8 -> int32` and `fp8xfp8 -> fp32` dot on CDNA. This PR reenables it back.	2023-12-14 13:06:28 +01:00
Michael Melesse	26c3f99073	ROCM IFU: disable test_reduce_layouts	2023-12-13 17:12:39 -06:00
Alexander Efimov	f2afd65e8c	[MFMA] Refactor dot pipeline to reduce code duplication (#400 ) This PR: - simplifies data types generated by `shared->mfma dot op` layout conversions. Do not pack data types in int32 or int64 - reduce code duplication between fast/normal path - reduce code duplication between operand A and operand B Co-authored-by: Shucai Xiao <shucai.xiao@amd.com> Co-authored-by: Lixun Zhang <lixun.zhang@amd.com>	2023-12-13 22:33:02 +01:00
Michael Melesse	7a1f54645e	ROCM IFU: remove old tests	2023-12-13 15:30:55 +00:00
Michael Melesse	c7b62d5ec5	ROCM IFU: remove test_ext_elemwise	2023-12-12 22:58:56 -06:00
Michael Melesse	6efc013e46	ROCM IFU: fix AtomicCASOpConversion segfault	2023-12-12 17:40:31 -06:00
jayfurmanek	a42ac260aa	Merge branch 'triton-mlir' into ifu-231117	2023-12-12 14:24:11 -06:00
Jason Furmanek	160dfe838e	ROCM IFU: Fix print and assert	2023-12-12 19:30:01 +00:00
Alexander Efimov	605a90c58e	[MFMA] Support tile size 4x4 version 1 (#413 ) This PR enables 4x4 tile size in MFMA based dot operations. Supported tiled dot is (4x64) x (64x4) -> (4x4) in MFMA layout. However, actual dot operation should have at least 64 output elements, this is a limitation of other layouts appearing during result processing (i.e. blocked layout can not handle tensors smaller than wavesize). For example, following dots are supported: (4x64) x (64x16) -> (4x16), (16x64) x (64x4) -> (16x4) or (8x64) x (64x8) -> (8x8) Following dots are not supporter: (4x128) x (128x4) -> (4x4), (4x64) x (64x8) -> (4x8) This is a first version of dot using mfma 4x4 instructions, with redundancy and reductions.	2023-12-12 18:23:55 +01:00
Michael Melesse	50a6db3afd	ROCM IFU: Lit test fixes	2023-12-11 17:00:35 -06:00
Alexander Efimov	a944811b6d	Replace inline assembly in commonShflSync with intrinsics (#418 ) Inline assembly does not take into account instructions around, and in general can not avoid data hazards. Replacing inline asm with intrinsics solves this problem. This particular code behaved incorrectly in one of mfma dot tests: Code generated with help of inline assembly: ``` v_mfma_f32_4x4x4f16 v[4:7], v[4:5], v[6:7], 0 ds_swizzle_b32 v3, v4, offset:swizzle(SWAP:4) ``` Correct code generated with intrinsics: ``` v_mfma_f32_4x4x4f16 v[4:7], v[4:5], v[6:7], 0 s_nop 4 ds_swizzle_b32 v3, v4, offset:swizzle(SWAP:4) ```	2023-12-11 16:41:39 +01:00
Alexander Efimov	2be6ec771e	[GEMM] [Tuning] Make tuning script more verbose (#420 ) This PR adds: - verbose tuning mode: printing std output of compilation and tuning calls - collecting information about failed compilations - print correctness check output with word - split dimensions in generated scripts with "-" - gpu_ids option to set particular gpus	2023-12-10 22:04:00 -06:00
Alexander Efimov	e19b5fd6bc	[GEMM] Add script to run one tuning config (#419 ) The script runs one given config for debug purposes.	2023-12-07 18:12:03 -06:00
Michael Melesse	64a0924381	ROCM IFU: remove ref to test_elementwise	2023-12-07 13:31:59 -06:00
Vinayak Gokhale	1d6b919897	Bugfix: Wrong boundary condition on qk GEMM	2023-12-04 10:11:41 -06:00
Vinayak Gokhale	f6969f4bb3	Correct that loop lo is multiple BLOCK_N	2023-12-04 10:11:41 -06:00
Vinayak Gokhale	0ef865508c	Update description	2023-12-04 10:11:41 -06:00
Vinayak Gokhale	dc62569e57	Remove slicing for out in save for bwd	2023-12-04 10:11:41 -06:00
Vinayak Gokhale	e0a4d97569	Mask vs pad for non power of 2 sequence lengths Padding results in memory allocation which is slower. Masking results in better performance.	2023-12-04 10:11:41 -06:00
Vinayak Gokhale	d5028079b7	Add FA support for non pow2 seqlen	2023-12-04 10:11:41 -06:00
Lixun Zhang	670ae8054d	Add a cute tool to plot blocked, dotOperand, and mfma layout (#407 ) * Add commands to plot blocked, dotOperand, and mfma layout * Add commands to plot LDS layout and wmma instruction layout	2023-11-29 09:35:33 -06:00
Jason Furmanek	64f559771f	ROCM IFU: Fix test_core_amd.py::test_reduce_layouts	2023-11-28 04:02:48 +00:00
Jason Furmanek	f5f6b3c0a3	ROCM IFU: Add get_version_key for ROCM backend	2023-11-28 00:11:44 +00:00
Jason Furmanek	71547e4fdb	ROCM IFU: Fixes for kwargs	2023-11-28 00:09:46 +00:00
jayfurmanek	99aa1f4f75	Merge branch 'triton-mlir' into ifu-231117	2023-11-27 07:44:04 -06:00
Jason Furmanek	968a35fbf0	Fix merge conflict error	2023-11-27 13:33:23 +00:00
Shucai Xiao	d9219e0eba	use hw for fp8 type conversion (#386 ) * use hardware instruction for type conversion between fp8 and fp32 * move gpu_matrix_core_version from semantics.py to hip_backend.py --------- Co-authored-by: Aleksandr Efimov <efimov.alexander@gmail.com>	2023-11-24 10:26:40 -06:00
Jason Furmanek	4e86b25f1c	ROCM IFU: Fix PrintfHIP	2023-11-21 23:06:14 +00:00
Jason Furmanek	a08dafe7fe	Initial commit to resolve merge conflicts	2023-11-20 22:41:03 +00:00
Jason Furmanek	5c87f363e4	Merge commit 'cb3d79a185e40c9d8a579bea07747a8a8d157d52' into ifu-231117 Conflicts: lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVM.cpp lib/Dialect/TritonGPU/IR/Dialect.cpp python/setup.py python/test/unit/language/assert_helper.py python/test/unit/operators/test_flash_attention.py python/test/unit/runtime/test_subproc.py python/triton/compiler/compiler.py python/triton/language/semantic.py python/triton/runtime/autotuner.py python/triton/runtime/jit.py python/tutorials/03-matrix-multiplication.py python/tutorials/05-layer-norm.py python/tutorials/06-fused-attention.py python/tutorials/11-grouped-gemm.py test/Conversion/tritongpu_to_llvm.mlir	2023-11-17 20:42:12 +00:00
jayfurmanek	e1513b34e1	Merge pull request #395 from ROCmSoftwarePlatform/ifu-231108 Ifu 231108	2023-11-17 14:14:14 -06:00
jayfurmanek	bb56be9d16	Merge branch 'triton-mlir' into ifu-231108	2023-11-16 19:30:43 -06:00
Alexander Efimov	dfb76540b4	[Tutorial] Fix post IFU issues with FA (#398 ) * [Tutorial] Fix post IFU issues with FA * Remove redundant kernels in 06-fused-attention.py * Added README for scripts in perf-kernels dir * Fix bwd kernel --------- Co-authored-by: Lixun Zhang <lixun.zhang@amd.com>	2023-11-17 01:28:49 +00:00
Alexander Efimov	096def0c9b	[Test] Disable mma layout for amd hardware (#384 ) Disable mma layout testing by looking at is_hip instead of wave size. This fixes tests on Navi GPUs with wave size == 32.	2023-11-17 01:28:49 +00:00
Lixun Zhang	181bdbd410	Benchmark FA on 2 GCDs (#393 )	2023-11-17 01:28:49 +00:00
Jason Furmanek	44b155f41b	ROCM IFU: Resolve merge conflicts in tutorial 06 Resolve merge conflicts in tutorial 06 - 2	2023-11-17 01:28:40 +00:00
Ognjen	9f3d6656a7	ROCM IFU: Fix reduce_slice lit test Skip tritongpu_to_llvm_hopper test as it is nvidia specific	2023-11-17 01:28:33 +00:00
Ognjen	38fbb7e472	ROCM IFU: Enable slice layout for insertSliceAsync AMD path Fix basic_insert_slice_async_1d lit test Remove code added for debugging Return hopper test	2023-11-17 01:27:57 +00:00
Alexander Efimov	5b06b168aa	[Tutorial] Fix post IFU issues with FA (#398 ) * [Tutorial] Fix post IFU issues with FA * Remove redundant kernels in 06-fused-attention.py * Added README for scripts in perf-kernels dir * Fix bwd kernel --------- Co-authored-by: Lixun Zhang <lixun.zhang@amd.com>	2023-11-14 10:46:45 -06:00
Alexander Efimov	9941ce7aa5	[Test] Disable mma layout for amd hardware (#384 ) Disable mma layout testing by looking at is_hip instead of wave size. This fixes tests on Navi GPUs with wave size == 32.	2023-11-13 17:52:35 +01:00
Jason Furmanek	484852876e	Resolve merge conflicts; AMD adjustments for new LLVM version	2023-11-09 19:00:49 +00:00
Jason Furmanek	977d5aa267	Merge commit '721897fcc4f942aa97d2e9ba3787a5e213758177' into ifu-231108 Conflicts: bin/triton-translate.cpp lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp lib/Dialect/TritonGPU/Transforms/RemoveLayoutConversions.cpp python/triton/compiler/compiler.py python/triton/runtime/jit.py python/tutorials/06-fused-attention.py test/Conversion/tritongpu_to_llvm.mlir	2023-11-08 18:51:23 +00:00
Lixun Zhang	d4eda83b33	Benchmark FA on 2 GCDs (#393 )	2023-11-08 12:42:54 -06:00
Lixun Zhang	1af893d8a2	[FRONTEND] Add input dtypes to autotuning key (#2534 ) (#374 ) * [FRONTEND] Add input dtypes to autotuning key (#2534) * Fix conflict in 06-fused-attention * Fix get_best_config in FA-transV.py * Fix leftover get_best_config() --------- Co-authored-by: Adnan Akhundov <adnan.akhundov@gmail.com>	2023-11-07 19:36:57 -06:00
jayfurmanek	3c1fe617c1	Merge pull request #382 from ROCmSoftwarePlatform/ifu231005-rebase Ifu231005	2023-11-06 22:55:25 -06:00

1 2 3 4 5 ...

2356 Commits