github/ROCm - ROCm - AtHeartEngineering

mirror of https://github.com/ROCm/ROCm.git synced 2026-02-21 03:00:39 -05:00

Author	SHA1	Message	Date
Ognjen Plavsic	cf3a387f4e	cache push	2024-01-31 14:14:09 +00:00
Ognjen	3a12d9d269	fix	2024-01-31 14:13:36 +00:00
Ognjen	171a67e837	Add scheduling pass	2024-01-25 18:07:45 +00:00
Hongtao Yu	315528f349	Use full-vectorized load instructions for load vectorization (#445 ) * Stablize load vectorization * fix test failures * Shared one mask check when decomposing a load * Revert "fix test failures" This reverts commit 75a461ae3ea4fdd5105dc73675582368eda80bc6. * Emit vectorized loads * Fix test failures due to using vectorized load	2024-01-18 13:34:05 -06:00
Shucai Xiao	2c7d850c2d	fixed warp size in lowering reduce op (#471 )	2024-01-18 09:38:41 -06:00
Lixun Zhang	e7033218d6	Fix vecSize for fp8 and int8 on MI300 (#466 ) * Fix vecSize for fp8 and int8 on MI300 * fix typo * Update include/triton/Dialect/TritonGPU/IR/TritonGPUAttrDefs.td Co-authored-by: Alexander Efimov <efimov.alexander@gmail.com> --------- Co-authored-by: Alexander Efimov <efimov.alexander@gmail.com>	2024-01-17 22:50:31 -06:00
Ilya V	5da6276d89	Add shortcut for creation fp16/bfp16 (#468 ) Signed-off-by: joviliast <iveselov.nn@gmail.com>	2024-01-17 11:29:19 -06:00
Lixun Zhang	8351f49fc7	[Tuning] Gemm tuning v3 (#457 ) * Add gemm tuning script v3 * Introduce --jobs to control the number of files to generate * Switch to trans convention used by Tensile * Rerun rocprof if it crashes * update README * Remove peak perf and efficiency	2024-01-17 10:09:34 -06:00
Vinayak Gokhale	f239abfc7e	Revert "Add autotuning for FA (#459 )" (#467 ) This reverts commit `1fec965c06`. This change used pre_hook to edit a kernel arg. However, pre-hook does not make the changes made within visible to the kernel in all cases.	2024-01-16 23:02:47 -06:00
Lixun Zhang	02a2f24dd5	[Backend] Refactor mfma selection (#441 ) * Select mfma dimensions and instruction from static table * Extend mfmaLayout to include version and instrShape * Simplify generateMFMAOp by searching the mfma instruction in the table * Fix getNonKDim() and non_k_dim * Break instrShape into MDim and NDim	2024-01-16 21:05:35 -06:00
jayfurmanek	d2f8bc1740	remove git modules for tree sitter (#465 )	2024-01-16 15:44:05 -06:00
oplavsic	760ac8441a	Dot slicing pass (#440 ) * First commit * Implement DotSlicing pass. * small fixes * Support chained dot in DotSlicingPass (second GEMM in FA) * Add lit test for FA dot slicing --------- Co-authored-by: Ognjen Plavsic <ognjen.plavsic@luxoft.com> Co-authored-by: Ognjen <oplavsic@luxoft.com>	2024-01-16 14:25:10 -06:00
Lixun Zhang	a819e48435	Refine test_correctness (#463 ) - Check correctness of what is benchmarked - Add capability to check col_a and col_b - But only check col_a=False, col_b=True for now - Only benchmark col_a=False, col_b=True for now - Remove in='int8', out='int8' due to too large error	2024-01-16 11:15:54 -06:00
Shucai Xiao	1223f6077a	support type conversion between fp8 formats and bf16/fp32 with HW instructions on MI300 (#414 ) * add type conversion between fp8 and bf16/fp32..	2024-01-15 17:14:49 -06:00
Lixun Zhang	e231c41467	[TUTORIAL] Enable all types in gemm tutorial (#456 ) * Enable all types in gemm tutorial Co-authored-by: Shucai Xiao <shucai.xiao@amd.com>	2024-01-15 14:38:31 -06:00
Vinayak Gokhale	1fec965c06	Add autotuning for FA (#459 )	2024-01-12 17:15:12 -06:00
Lixun Zhang	2e217c5a5c	[Backend] Refactor sharedToDotOperandMFMA lowering (#439 ) * Remove unnecessary xor computations for k-major swizzled tensors * Support mfma16 and mfma4 in the fast path * Choose warpsPerCTA according to nonKDim * Set maxPhase=4 for mfma4 * Fix tests For now, we do not disable swizzling for k-major tensors * Remove fastPathComputeOffsetsTy1 * Enable k-major + disabled swizzling in the normal path	2024-01-12 12:50:18 -06:00
Shucai Xiao	a7bb38ea79	enable layout conversion from mfma to dot_op for mfma16. (#453 ) * enable the layout conversion from mfma layout to dot_operand layout for mfma16 * backup changes	2024-01-10 22:31:59 -06:00
Lixun Zhang	b5ed97873c	Added a script to print occupancy info (#450 )	2024-01-10 14:01:13 -06:00
Ilya V	2e01bf08e9	[HotFix] Fix dot op for RDNA3 architecture (#451 ) Disabled BlockedToWMMA layout transformation until WMMA is supported completely Signed-off-by: joviliast <iveselov.nn@gmail.com>	2024-01-10 08:58:57 -06:00
Lixun Zhang	ce9dacec72	Skip BLOCK_SIZE that is too large compare to M/N (#449 )	2024-01-09 13:41:09 -06:00
Vinayak Gokhale	c2766bbd5f	Merge changes from upstream FA bwd kernel (#444 ) * Add optimized FA bwd from upstream * Add autotuning * Change loads and stores to use block ptrs * Cleanup	2024-01-05 15:12:05 -06:00
oplavsic	bcea3051af	Add support for MFMA layout to view_slice op (#442 ) Co-authored-by: Ognjen <oplavsic@luxoft.com>	2024-01-03 12:13:36 -06:00
oplavsic	6a520566a3	Add view_slice ttgir instruction (#427 ) * Add view_slice op in ttgir --------- Co-authored-by: Ognjen Plavsic <ognjen.plavsic@luxoft.com> Co-authored-by: Ognjen <oplavsic@luxoft.com> Co-authored-by: Lixun Zhang <lixun.zhang@amd.com>	2024-01-02 15:40:11 -06:00
Alexander Efimov	98589ac013	[MFMA] Remove CTA related code from layout (#429 ) This PR removes CTALayout attribute from MFMA layout, because it is NV specific.	2023-12-27 18:01:28 +01:00
Jack Taylor	1e2fd0dd1a	Update hip_backend to use libhsa-runtime for arch info, (#411 ) brings in path changes for pytorch triton wheels Co-authored-by: jayfurmanek <Jason.Furmanek@amd.com>	2023-12-21 15:40:57 +00:00
Vinayak Gokhale	0248bdb29d	Minor edits to HBM bandwidth measurement kernel (#434 ) * Change units to GiB/s from GB/s * Run both with and w/o bounds check	2023-12-21 06:14:31 -06:00
jayfurmanek	16281f02f4	[ROCM] drop GIL for launch, and set value=false upon pointer error (#426 )	2023-12-19 08:34:51 -06:00
Vinayak Gokhale	422d7096ce	Add kernel to check HBM BW (#431 ) Add kernel to check HBM BW performance	2023-12-18 21:25:21 -06:00
joviliast	5c182aa73a	Disable WMMA dot transformation Enabled only in lit test. Revert after complete enabling WMMA Signed-off-by: joviliast <iveselov.nn@gmail.com>	2023-12-18 09:11:20 -06:00
joviliast	47e801730c	Add lit tests for TritonAMDGPUAccelerateMatmulPass WMMA case Signed-off-by: joviliast <iveselov.nn@gmail.com>	2023-12-18 09:11:20 -06:00
joviliast	af15da2f84	Support WMMA layout in TritonAMDGPUAccelerateMatmulPass -Introduce WmmaEncodingAttr for WMMA output -Introduce BlockedToWMMA rewrite pattern in TritonAMDGPUAccelerateMatmulPass -Provide a flag tho check if wmma instructions are supported by target Signed-off-by: joviliast <iveselov.nn@gmail.com>	2023-12-18 09:11:20 -06:00
Vinayak Gokhale	b7a412d82a	Add support for ALiBi-style attention bias (#417 ) Add support for matrix and vector bias to FA.	2023-12-15 16:28:37 -06:00
jayfurmanek	29847e9bb1	Merge pull request #410 from ROCmSoftwarePlatform/ifu-231117 Ifu 231117	2023-12-15 09:09:40 -06:00
Shucai Xiao	521f425fbf	add bitcode for gfx941 and gfx942 (#403 ) Co-authored-by: Aleksandr Efimov <130555951+alefimov-amd@users.noreply.github.com>	2023-12-14 08:19:23 -06:00
Alexander Efimov	40e1dcaa53	[MFMA] Reenable removed CDNA3 int and fp8 support (#424 ) MFMA4x4 PR accidentailly removed support of `int8xint8 -> int32` and `fp8xfp8 -> fp32` dot on CDNA. This PR reenables it back.	2023-12-14 13:06:28 +01:00
Michael Melesse	26c3f99073	ROCM IFU: disable test_reduce_layouts	2023-12-13 17:12:39 -06:00
Alexander Efimov	f2afd65e8c	[MFMA] Refactor dot pipeline to reduce code duplication (#400 ) This PR: - simplifies data types generated by `shared->mfma dot op` layout conversions. Do not pack data types in int32 or int64 - reduce code duplication between fast/normal path - reduce code duplication between operand A and operand B Co-authored-by: Shucai Xiao <shucai.xiao@amd.com> Co-authored-by: Lixun Zhang <lixun.zhang@amd.com>	2023-12-13 22:33:02 +01:00
Michael Melesse	7a1f54645e	ROCM IFU: remove old tests	2023-12-13 15:30:55 +00:00
Michael Melesse	c7b62d5ec5	ROCM IFU: remove test_ext_elemwise	2023-12-12 22:58:56 -06:00
Michael Melesse	6efc013e46	ROCM IFU: fix AtomicCASOpConversion segfault	2023-12-12 17:40:31 -06:00
jayfurmanek	a42ac260aa	Merge branch 'triton-mlir' into ifu-231117	2023-12-12 14:24:11 -06:00
Jason Furmanek	160dfe838e	ROCM IFU: Fix print and assert	2023-12-12 19:30:01 +00:00
Alexander Efimov	605a90c58e	[MFMA] Support tile size 4x4 version 1 (#413 ) This PR enables 4x4 tile size in MFMA based dot operations. Supported tiled dot is (4x64) x (64x4) -> (4x4) in MFMA layout. However, actual dot operation should have at least 64 output elements, this is a limitation of other layouts appearing during result processing (i.e. blocked layout can not handle tensors smaller than wavesize). For example, following dots are supported: (4x64) x (64x16) -> (4x16), (16x64) x (64x4) -> (16x4) or (8x64) x (64x8) -> (8x8) Following dots are not supporter: (4x128) x (128x4) -> (4x4), (4x64) x (64x8) -> (4x8) This is a first version of dot using mfma 4x4 instructions, with redundancy and reductions.	2023-12-12 18:23:55 +01:00
Michael Melesse	50a6db3afd	ROCM IFU: Lit test fixes	2023-12-11 17:00:35 -06:00
Alexander Efimov	a944811b6d	Replace inline assembly in commonShflSync with intrinsics (#418 ) Inline assembly does not take into account instructions around, and in general can not avoid data hazards. Replacing inline asm with intrinsics solves this problem. This particular code behaved incorrectly in one of mfma dot tests: Code generated with help of inline assembly: ``` v_mfma_f32_4x4x4f16 v[4:7], v[4:5], v[6:7], 0 ds_swizzle_b32 v3, v4, offset:swizzle(SWAP:4) ``` Correct code generated with intrinsics: ``` v_mfma_f32_4x4x4f16 v[4:7], v[4:5], v[6:7], 0 s_nop 4 ds_swizzle_b32 v3, v4, offset:swizzle(SWAP:4) ```	2023-12-11 16:41:39 +01:00
Alexander Efimov	2be6ec771e	[GEMM] [Tuning] Make tuning script more verbose (#420 ) This PR adds: - verbose tuning mode: printing std output of compilation and tuning calls - collecting information about failed compilations - print correctness check output with word - split dimensions in generated scripts with "-" - gpu_ids option to set particular gpus	2023-12-10 22:04:00 -06:00
Alexander Efimov	e19b5fd6bc	[GEMM] Add script to run one tuning config (#419 ) The script runs one given config for debug purposes.	2023-12-07 18:12:03 -06:00
Michael Melesse	64a0924381	ROCM IFU: remove ref to test_elementwise	2023-12-07 13:31:59 -06:00
Vinayak Gokhale	1d6b919897	Bugfix: Wrong boundary condition on qk GEMM	2023-12-04 10:11:41 -06:00

1 2 3 4 5 ...

2386 Commits