github/ROCm - ROCm - AtHeartEngineering

mirror of https://github.com/ROCm/ROCm.git synced 2026-04-05 03:01:17 -04:00

Author	SHA1	Message	Date
Lixun Zhang	2e217c5a5c	[Backend] Refactor sharedToDotOperandMFMA lowering (#439 ) * Remove unnecessary xor computations for k-major swizzled tensors * Support mfma16 and mfma4 in the fast path * Choose warpsPerCTA according to nonKDim * Set maxPhase=4 for mfma4 * Fix tests For now, we do not disable swizzling for k-major tensors * Remove fastPathComputeOffsetsTy1 * Enable k-major + disabled swizzling in the normal path	2024-01-12 12:50:18 -06:00
Shucai Xiao	a7bb38ea79	enable layout conversion from mfma to dot_op for mfma16. (#453 ) * enable the layout conversion from mfma layout to dot_operand layout for mfma16 * backup changes	2024-01-10 22:31:59 -06:00
Lixun Zhang	b5ed97873c	Added a script to print occupancy info (#450 )	2024-01-10 14:01:13 -06:00
Ilya V	2e01bf08e9	[HotFix] Fix dot op for RDNA3 architecture (#451 ) Disabled BlockedToWMMA layout transformation until WMMA is supported completely Signed-off-by: joviliast <iveselov.nn@gmail.com>	2024-01-10 08:58:57 -06:00
Lixun Zhang	ce9dacec72	Skip BLOCK_SIZE that is too large compare to M/N (#449 )	2024-01-09 13:41:09 -06:00
Vinayak Gokhale	c2766bbd5f	Merge changes from upstream FA bwd kernel (#444 ) * Add optimized FA bwd from upstream * Add autotuning * Change loads and stores to use block ptrs * Cleanup	2024-01-05 15:12:05 -06:00
oplavsic	bcea3051af	Add support for MFMA layout to view_slice op (#442 ) Co-authored-by: Ognjen <oplavsic@luxoft.com>	2024-01-03 12:13:36 -06:00
oplavsic	6a520566a3	Add view_slice ttgir instruction (#427 ) * Add view_slice op in ttgir --------- Co-authored-by: Ognjen Plavsic <ognjen.plavsic@luxoft.com> Co-authored-by: Ognjen <oplavsic@luxoft.com> Co-authored-by: Lixun Zhang <lixun.zhang@amd.com>	2024-01-02 15:40:11 -06:00
Alexander Efimov	98589ac013	[MFMA] Remove CTA related code from layout (#429 ) This PR removes CTALayout attribute from MFMA layout, because it is NV specific.	2023-12-27 18:01:28 +01:00
Jack Taylor	1e2fd0dd1a	Update hip_backend to use libhsa-runtime for arch info, (#411 ) brings in path changes for pytorch triton wheels Co-authored-by: jayfurmanek <Jason.Furmanek@amd.com>	2023-12-21 15:40:57 +00:00
Vinayak Gokhale	0248bdb29d	Minor edits to HBM bandwidth measurement kernel (#434 ) * Change units to GiB/s from GB/s * Run both with and w/o bounds check	2023-12-21 06:14:31 -06:00
jayfurmanek	16281f02f4	[ROCM] drop GIL for launch, and set value=false upon pointer error (#426 )	2023-12-19 08:34:51 -06:00
Vinayak Gokhale	422d7096ce	Add kernel to check HBM BW (#431 ) Add kernel to check HBM BW performance	2023-12-18 21:25:21 -06:00
joviliast	5c182aa73a	Disable WMMA dot transformation Enabled only in lit test. Revert after complete enabling WMMA Signed-off-by: joviliast <iveselov.nn@gmail.com>	2023-12-18 09:11:20 -06:00
joviliast	47e801730c	Add lit tests for TritonAMDGPUAccelerateMatmulPass WMMA case Signed-off-by: joviliast <iveselov.nn@gmail.com>	2023-12-18 09:11:20 -06:00
joviliast	af15da2f84	Support WMMA layout in TritonAMDGPUAccelerateMatmulPass -Introduce WmmaEncodingAttr for WMMA output -Introduce BlockedToWMMA rewrite pattern in TritonAMDGPUAccelerateMatmulPass -Provide a flag tho check if wmma instructions are supported by target Signed-off-by: joviliast <iveselov.nn@gmail.com>	2023-12-18 09:11:20 -06:00
Vinayak Gokhale	b7a412d82a	Add support for ALiBi-style attention bias (#417 ) Add support for matrix and vector bias to FA.	2023-12-15 16:28:37 -06:00
jayfurmanek	29847e9bb1	Merge pull request #410 from ROCmSoftwarePlatform/ifu-231117 Ifu 231117	2023-12-15 09:09:40 -06:00
Shucai Xiao	521f425fbf	add bitcode for gfx941 and gfx942 (#403 ) Co-authored-by: Aleksandr Efimov <130555951+alefimov-amd@users.noreply.github.com>	2023-12-14 08:19:23 -06:00
Alexander Efimov	40e1dcaa53	[MFMA] Reenable removed CDNA3 int and fp8 support (#424 ) MFMA4x4 PR accidentailly removed support of `int8xint8 -> int32` and `fp8xfp8 -> fp32` dot on CDNA. This PR reenables it back.	2023-12-14 13:06:28 +01:00
Michael Melesse	26c3f99073	ROCM IFU: disable test_reduce_layouts	2023-12-13 17:12:39 -06:00
Alexander Efimov	f2afd65e8c	[MFMA] Refactor dot pipeline to reduce code duplication (#400 ) This PR: - simplifies data types generated by `shared->mfma dot op` layout conversions. Do not pack data types in int32 or int64 - reduce code duplication between fast/normal path - reduce code duplication between operand A and operand B Co-authored-by: Shucai Xiao <shucai.xiao@amd.com> Co-authored-by: Lixun Zhang <lixun.zhang@amd.com>	2023-12-13 22:33:02 +01:00
Michael Melesse	7a1f54645e	ROCM IFU: remove old tests	2023-12-13 15:30:55 +00:00
Michael Melesse	c7b62d5ec5	ROCM IFU: remove test_ext_elemwise	2023-12-12 22:58:56 -06:00
Michael Melesse	6efc013e46	ROCM IFU: fix AtomicCASOpConversion segfault	2023-12-12 17:40:31 -06:00
jayfurmanek	a42ac260aa	Merge branch 'triton-mlir' into ifu-231117	2023-12-12 14:24:11 -06:00
Jason Furmanek	160dfe838e	ROCM IFU: Fix print and assert	2023-12-12 19:30:01 +00:00
Alexander Efimov	605a90c58e	[MFMA] Support tile size 4x4 version 1 (#413 ) This PR enables 4x4 tile size in MFMA based dot operations. Supported tiled dot is (4x64) x (64x4) -> (4x4) in MFMA layout. However, actual dot operation should have at least 64 output elements, this is a limitation of other layouts appearing during result processing (i.e. blocked layout can not handle tensors smaller than wavesize). For example, following dots are supported: (4x64) x (64x16) -> (4x16), (16x64) x (64x4) -> (16x4) or (8x64) x (64x8) -> (8x8) Following dots are not supporter: (4x128) x (128x4) -> (4x4), (4x64) x (64x8) -> (4x8) This is a first version of dot using mfma 4x4 instructions, with redundancy and reductions.	2023-12-12 18:23:55 +01:00
Michael Melesse	50a6db3afd	ROCM IFU: Lit test fixes	2023-12-11 17:00:35 -06:00
Alexander Efimov	a944811b6d	Replace inline assembly in commonShflSync with intrinsics (#418 ) Inline assembly does not take into account instructions around, and in general can not avoid data hazards. Replacing inline asm with intrinsics solves this problem. This particular code behaved incorrectly in one of mfma dot tests: Code generated with help of inline assembly: ``` v_mfma_f32_4x4x4f16 v[4:7], v[4:5], v[6:7], 0 ds_swizzle_b32 v3, v4, offset:swizzle(SWAP:4) ``` Correct code generated with intrinsics: ``` v_mfma_f32_4x4x4f16 v[4:7], v[4:5], v[6:7], 0 s_nop 4 ds_swizzle_b32 v3, v4, offset:swizzle(SWAP:4) ```	2023-12-11 16:41:39 +01:00
Alexander Efimov	2be6ec771e	[GEMM] [Tuning] Make tuning script more verbose (#420 ) This PR adds: - verbose tuning mode: printing std output of compilation and tuning calls - collecting information about failed compilations - print correctness check output with word - split dimensions in generated scripts with "-" - gpu_ids option to set particular gpus	2023-12-10 22:04:00 -06:00
Alexander Efimov	e19b5fd6bc	[GEMM] Add script to run one tuning config (#419 ) The script runs one given config for debug purposes.	2023-12-07 18:12:03 -06:00
Michael Melesse	64a0924381	ROCM IFU: remove ref to test_elementwise	2023-12-07 13:31:59 -06:00
Vinayak Gokhale	1d6b919897	Bugfix: Wrong boundary condition on qk GEMM	2023-12-04 10:11:41 -06:00
Vinayak Gokhale	f6969f4bb3	Correct that loop lo is multiple BLOCK_N	2023-12-04 10:11:41 -06:00
Vinayak Gokhale	0ef865508c	Update description	2023-12-04 10:11:41 -06:00
Vinayak Gokhale	dc62569e57	Remove slicing for out in save for bwd	2023-12-04 10:11:41 -06:00
Vinayak Gokhale	e0a4d97569	Mask vs pad for non power of 2 sequence lengths Padding results in memory allocation which is slower. Masking results in better performance.	2023-12-04 10:11:41 -06:00
Vinayak Gokhale	d5028079b7	Add FA support for non pow2 seqlen	2023-12-04 10:11:41 -06:00
Lixun Zhang	670ae8054d	Add a cute tool to plot blocked, dotOperand, and mfma layout (#407 ) * Add commands to plot blocked, dotOperand, and mfma layout * Add commands to plot LDS layout and wmma instruction layout	2023-11-29 09:35:33 -06:00
Jason Furmanek	64f559771f	ROCM IFU: Fix test_core_amd.py::test_reduce_layouts	2023-11-28 04:02:48 +00:00
Jason Furmanek	f5f6b3c0a3	ROCM IFU: Add get_version_key for ROCM backend	2023-11-28 00:11:44 +00:00
Jason Furmanek	71547e4fdb	ROCM IFU: Fixes for kwargs	2023-11-28 00:09:46 +00:00
jayfurmanek	99aa1f4f75	Merge branch 'triton-mlir' into ifu-231117	2023-11-27 07:44:04 -06:00
Jason Furmanek	968a35fbf0	Fix merge conflict error	2023-11-27 13:33:23 +00:00
Shucai Xiao	d9219e0eba	use hw for fp8 type conversion (#386 ) * use hardware instruction for type conversion between fp8 and fp32 * move gpu_matrix_core_version from semantics.py to hip_backend.py --------- Co-authored-by: Aleksandr Efimov <efimov.alexander@gmail.com>	2023-11-24 10:26:40 -06:00
Jason Furmanek	4e86b25f1c	ROCM IFU: Fix PrintfHIP	2023-11-21 23:06:14 +00:00
Jason Furmanek	a08dafe7fe	Initial commit to resolve merge conflicts	2023-11-20 22:41:03 +00:00
Jason Furmanek	5c87f363e4	Merge commit 'cb3d79a185e40c9d8a579bea07747a8a8d157d52' into ifu-231117 Conflicts: lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVM.cpp lib/Dialect/TritonGPU/IR/Dialect.cpp python/setup.py python/test/unit/language/assert_helper.py python/test/unit/operators/test_flash_attention.py python/test/unit/runtime/test_subproc.py python/triton/compiler/compiler.py python/triton/language/semantic.py python/triton/runtime/autotuner.py python/triton/runtime/jit.py python/tutorials/03-matrix-multiplication.py python/tutorials/05-layer-norm.py python/tutorials/06-fused-attention.py python/tutorials/11-grouped-gemm.py test/Conversion/tritongpu_to_llvm.mlir	2023-11-17 20:42:12 +00:00
jayfurmanek	e1513b34e1	Merge pull request #395 from ROCmSoftwarePlatform/ifu-231108 Ifu 231108	2023-11-17 14:14:14 -06:00

1 2 3 4 5 ...

2370 Commits