Commit Graph

2386 Commits

Author SHA1 Message Date
Ognjen Plavsic
cf3a387f4e cache push 2024-01-31 14:14:09 +00:00
Ognjen
3a12d9d269 fix 2024-01-31 14:13:36 +00:00
Ognjen
171a67e837 Add scheduling pass 2024-01-25 18:07:45 +00:00
Hongtao Yu
315528f349 Use full-vectorized load instructions for load vectorization (#445)
* Stablize load vectorization

* fix test failures

* Shared one mask check when decomposing a load

* Revert "fix test failures"

This reverts commit 75a461ae3ea4fdd5105dc73675582368eda80bc6.

* Emit vectorized loads

* Fix test failures due to using vectorized load
2024-01-18 13:34:05 -06:00
Shucai Xiao
2c7d850c2d fixed warp size in lowering reduce op (#471) 2024-01-18 09:38:41 -06:00
Lixun Zhang
e7033218d6 Fix vecSize for fp8 and int8 on MI300 (#466)
* Fix vecSize for fp8 and int8 on MI300

* fix typo

* Update include/triton/Dialect/TritonGPU/IR/TritonGPUAttrDefs.td

Co-authored-by: Alexander Efimov <efimov.alexander@gmail.com>

---------

Co-authored-by: Alexander Efimov <efimov.alexander@gmail.com>
2024-01-17 22:50:31 -06:00
Ilya V
5da6276d89 Add shortcut for creation fp16/bfp16 (#468)
Signed-off-by: joviliast <iveselov.nn@gmail.com>
2024-01-17 11:29:19 -06:00
Lixun Zhang
8351f49fc7 [Tuning] Gemm tuning v3 (#457)
* Add gemm tuning script v3

* Introduce --jobs to control the number of files to generate

* Switch to trans convention used by Tensile

* Rerun rocprof if it crashes

* update README

* Remove peak perf and efficiency
2024-01-17 10:09:34 -06:00
Vinayak Gokhale
f239abfc7e Revert "Add autotuning for FA (#459)" (#467)
This reverts commit 1fec965c06.

This change used pre_hook to edit a kernel arg. However,
pre-hook does not make the changes made within visible to
the kernel in all cases.
2024-01-16 23:02:47 -06:00
Lixun Zhang
02a2f24dd5 [Backend] Refactor mfma selection (#441)
* Select mfma dimensions and instruction from static table

* Extend mfmaLayout to include version and instrShape

* Simplify generateMFMAOp by searching the mfma instruction in the table

* Fix getNonKDim() and non_k_dim

* Break instrShape into MDim and NDim
2024-01-16 21:05:35 -06:00
jayfurmanek
d2f8bc1740 remove git modules for tree sitter (#465) 2024-01-16 15:44:05 -06:00
oplavsic
760ac8441a Dot slicing pass (#440)
* First commit

* Implement DotSlicing pass.

* small fixes

* Support chained dot in DotSlicingPass (second GEMM in FA)

* Add lit test for FA dot slicing

---------

Co-authored-by: Ognjen Plavsic <ognjen.plavsic@luxoft.com>
Co-authored-by: Ognjen <oplavsic@luxoft.com>
2024-01-16 14:25:10 -06:00
Lixun Zhang
a819e48435 Refine test_correctness (#463)
- Check correctness of what is benchmarked
- Add capability to check col_a and col_b
  - But only check col_a=False, col_b=True for now
- Only benchmark col_a=False, col_b=True for now
- Remove in='int8', out='int8' due to too large error
2024-01-16 11:15:54 -06:00
Shucai Xiao
1223f6077a support type conversion between fp8 formats and bf16/fp32 with HW instructions on MI300 (#414)
* add type conversion between fp8 and bf16/fp32..
2024-01-15 17:14:49 -06:00
Lixun Zhang
e231c41467 [TUTORIAL] Enable all types in gemm tutorial (#456)
* Enable all types in gemm tutorial

Co-authored-by: Shucai Xiao <shucai.xiao@amd.com>
2024-01-15 14:38:31 -06:00
Vinayak Gokhale
1fec965c06 Add autotuning for FA (#459) 2024-01-12 17:15:12 -06:00
Lixun Zhang
2e217c5a5c [Backend] Refactor sharedToDotOperandMFMA lowering (#439)
* Remove unnecessary xor computations for k-major swizzled tensors

* Support mfma16 and mfma4 in the fast path

* Choose warpsPerCTA according to nonKDim

* Set maxPhase=4 for mfma4

* Fix tests

For now, we do not disable swizzling for k-major tensors

* Remove fastPathComputeOffsetsTy1

* Enable k-major + disabled swizzling in the normal path
2024-01-12 12:50:18 -06:00
Shucai Xiao
a7bb38ea79 enable layout conversion from mfma to dot_op for mfma16. (#453)
* enable the layout conversion from mfma layout to dot_operand layout for mfma16

* backup changes
2024-01-10 22:31:59 -06:00
Lixun Zhang
b5ed97873c Added a script to print occupancy info (#450) 2024-01-10 14:01:13 -06:00
Ilya V
2e01bf08e9 [HotFix] Fix dot op for RDNA3 architecture (#451)
Disabled BlockedToWMMA layout transformation until WMMA is supported completely

Signed-off-by: joviliast <iveselov.nn@gmail.com>
2024-01-10 08:58:57 -06:00
Lixun Zhang
ce9dacec72 Skip BLOCK_SIZE that is too large compare to M/N (#449) 2024-01-09 13:41:09 -06:00
Vinayak Gokhale
c2766bbd5f Merge changes from upstream FA bwd kernel (#444)
* Add optimized FA bwd from upstream

* Add autotuning

* Change loads and stores to use block ptrs

* Cleanup
2024-01-05 15:12:05 -06:00
oplavsic
bcea3051af Add support for MFMA layout to view_slice op (#442)
Co-authored-by: Ognjen <oplavsic@luxoft.com>
2024-01-03 12:13:36 -06:00
oplavsic
6a520566a3 Add view_slice ttgir instruction (#427)
* Add view_slice op in ttgir

---------

Co-authored-by: Ognjen Plavsic <ognjen.plavsic@luxoft.com>
Co-authored-by: Ognjen <oplavsic@luxoft.com>
Co-authored-by: Lixun Zhang <lixun.zhang@amd.com>
2024-01-02 15:40:11 -06:00
Alexander Efimov
98589ac013 [MFMA] Remove CTA related code from layout (#429)
This PR removes CTALayout attribute from MFMA layout, because it is NV specific.
2023-12-27 18:01:28 +01:00
Jack Taylor
1e2fd0dd1a Update hip_backend to use libhsa-runtime for arch info, (#411)
brings in path changes for pytorch triton wheels

Co-authored-by: jayfurmanek <Jason.Furmanek@amd.com>
2023-12-21 15:40:57 +00:00
Vinayak Gokhale
0248bdb29d Minor edits to HBM bandwidth measurement kernel (#434)
* Change units to GiB/s from GB/s

* Run both with and w/o bounds check
2023-12-21 06:14:31 -06:00
jayfurmanek
16281f02f4 [ROCM] drop GIL for launch, and set value=false upon pointer error (#426) 2023-12-19 08:34:51 -06:00
Vinayak Gokhale
422d7096ce Add kernel to check HBM BW (#431)
Add kernel to check HBM BW performance
2023-12-18 21:25:21 -06:00
joviliast
5c182aa73a Disable WMMA dot transformation
Enabled only in lit test.
Revert after complete enabling WMMA

Signed-off-by: joviliast <iveselov.nn@gmail.com>
2023-12-18 09:11:20 -06:00
joviliast
47e801730c Add lit tests for TritonAMDGPUAccelerateMatmulPass WMMA case
Signed-off-by: joviliast <iveselov.nn@gmail.com>
2023-12-18 09:11:20 -06:00
joviliast
af15da2f84 Support WMMA layout in TritonAMDGPUAccelerateMatmulPass
-Introduce WmmaEncodingAttr for WMMA output
-Introduce BlockedToWMMA rewrite pattern in TritonAMDGPUAccelerateMatmulPass
-Provide a flag tho check if wmma instructions are supported by target

Signed-off-by: joviliast <iveselov.nn@gmail.com>
2023-12-18 09:11:20 -06:00
Vinayak Gokhale
b7a412d82a Add support for ALiBi-style attention bias (#417)
Add support for matrix and vector bias to FA.
2023-12-15 16:28:37 -06:00
jayfurmanek
29847e9bb1 Merge pull request #410 from ROCmSoftwarePlatform/ifu-231117
Ifu 231117
2023-12-15 09:09:40 -06:00
Shucai Xiao
521f425fbf add bitcode for gfx941 and gfx942 (#403)
Co-authored-by: Aleksandr Efimov <130555951+alefimov-amd@users.noreply.github.com>
2023-12-14 08:19:23 -06:00
Alexander Efimov
40e1dcaa53 [MFMA] Reenable removed CDNA3 int and fp8 support (#424)
MFMA4x4 PR accidentailly removed support of `int8xint8 -> int32` and `fp8xfp8 -> fp32` dot on CDNA.
This PR reenables it back.
2023-12-14 13:06:28 +01:00
Michael Melesse
26c3f99073 ROCM IFU: disable test_reduce_layouts 2023-12-13 17:12:39 -06:00
Alexander Efimov
f2afd65e8c [MFMA] Refactor dot pipeline to reduce code duplication (#400)
This PR:
- simplifies data types generated by `shared->mfma dot op` layout conversions. Do not pack data types in int32 or int64
- reduce code duplication between fast/normal path
- reduce code duplication between operand A and operand B

Co-authored-by: Shucai Xiao <shucai.xiao@amd.com>
Co-authored-by: Lixun Zhang <lixun.zhang@amd.com>
2023-12-13 22:33:02 +01:00
Michael Melesse
7a1f54645e ROCM IFU: remove old tests 2023-12-13 15:30:55 +00:00
Michael Melesse
c7b62d5ec5 ROCM IFU: remove test_ext_elemwise 2023-12-12 22:58:56 -06:00
Michael Melesse
6efc013e46 ROCM IFU: fix AtomicCASOpConversion segfault 2023-12-12 17:40:31 -06:00
jayfurmanek
a42ac260aa Merge branch 'triton-mlir' into ifu-231117 2023-12-12 14:24:11 -06:00
Jason Furmanek
160dfe838e ROCM IFU: Fix print and assert 2023-12-12 19:30:01 +00:00
Alexander Efimov
605a90c58e [MFMA] Support tile size 4x4 version 1 (#413)
This PR enables 4x4 tile size in MFMA based dot operations.

Supported tiled dot is (4x64) x (64x4) -> (4x4) in MFMA layout.
However, actual dot operation should have at least 64 output elements, this is a limitation of other layouts appearing during result processing (i.e. blocked layout can not handle tensors smaller than wavesize).

For example, following dots are supported: (4x64) x (64x16) -> (4x16), (16x64) x (64x4) -> (16x4) or (8x64) x (64x8) -> (8x8)
Following dots are not supporter: (4x128) x (128x4) -> (4x4), (4x64) x (64x8) -> (4x8)

This is a first version of dot using mfma 4x4 instructions, with redundancy and reductions.
2023-12-12 18:23:55 +01:00
Michael Melesse
50a6db3afd ROCM IFU: Lit test fixes 2023-12-11 17:00:35 -06:00
Alexander Efimov
a944811b6d Replace inline assembly in commonShflSync with intrinsics (#418)
Inline assembly does not take into account instructions around,
and in general can not avoid data hazards.
Replacing inline asm with intrinsics solves this problem.
This particular code behaved incorrectly in one of mfma dot tests:

Code generated with help of inline assembly:

```
  v_mfma_f32_4x4x4f16 v[4:7], v[4:5], v[6:7], 0
  ds_swizzle_b32 v3, v4, offset:swizzle(SWAP:4)
```

Correct code generated with intrinsics:

```
  v_mfma_f32_4x4x4f16 v[4:7], v[4:5], v[6:7], 0
  s_nop 4
  ds_swizzle_b32 v3, v4, offset:swizzle(SWAP:4)
```
2023-12-11 16:41:39 +01:00
Alexander Efimov
2be6ec771e [GEMM] [Tuning] Make tuning script more verbose (#420)
This PR adds:
- verbose tuning mode: printing std output of compilation and tuning calls
- collecting information about failed compilations
- print correctness check output with word
- split dimensions in generated scripts with "-"
- gpu_ids option to set particular gpus
2023-12-10 22:04:00 -06:00
Alexander Efimov
e19b5fd6bc [GEMM] Add script to run one tuning config (#419)
The script runs one given config for debug purposes.
2023-12-07 18:12:03 -06:00
Michael Melesse
64a0924381 ROCM IFU: remove ref to test_elementwise 2023-12-07 13:31:59 -06:00
Vinayak Gokhale
1d6b919897 Bugfix: Wrong boundary condition on qk GEMM 2023-12-04 10:11:41 -06:00