Replace inline assembly in commonShflSync with intrinsics (#418)

Inline assembly does not take into account instructions around,
and in general can not avoid data hazards.
Replacing inline asm with intrinsics solves this problem.
This particular code behaved incorrectly in one of mfma dot tests:

Code generated with help of inline assembly:

```
  v_mfma_f32_4x4x4f16 v[4:7], v[4:5], v[6:7], 0
  ds_swizzle_b32 v3, v4, offset:swizzle(SWAP:4)
```

Correct code generated with intrinsics:

```
  v_mfma_f32_4x4x4f16 v[4:7], v[4:5], v[6:7], 0
  s_nop 4
  ds_swizzle_b32 v3, v4, offset:swizzle(SWAP:4)
```
This commit is contained in:
Alexander Efimov
2023-12-11 16:41:39 +01:00
committed by GitHub
parent 2be6ec771e
commit a944811b6d
2 changed files with 29 additions and 32 deletions

View File

@@ -1961,11 +1961,11 @@ module attributes {"triton_gpu.num-ctas" = 1 : i32, "triton_gpu.num-warps" = 1 :
// PTX: nvvm.shfl.sync bfly
// PTX: nvvm.barrier0
// GCN-COUNT-4: ds_swizzle_b32
// GCN-COUNT-4: rocdl.ds_swizzle %{{.*}} : (i32, i32) -> i32
// GCN: llvm.store
// GCN: rocdl.barrier
// GCN: llvm.load
// GCN-COUNT-2: ds_swizzle_b32
// GCN-COUNT-2: rocdl.ds_swizzle %{{.*}} : (i32, i32) -> i32
// GCN: llvm.store
// GCN: rocdl.barrier
// GCN: llvm.load