Commit Graph

1440 Commits

Author SHA1 Message Date
oplavsic
760ac8441a Dot slicing pass (#440)
* First commit

* Implement DotSlicing pass.

* small fixes

* Support chained dot in DotSlicingPass (second GEMM in FA)

* Add lit test for FA dot slicing

---------

Co-authored-by: Ognjen Plavsic <ognjen.plavsic@luxoft.com>
Co-authored-by: Ognjen <oplavsic@luxoft.com>
2024-01-16 14:25:10 -06:00
Lixun Zhang
a819e48435 Refine test_correctness (#463)
- Check correctness of what is benchmarked
- Add capability to check col_a and col_b
  - But only check col_a=False, col_b=True for now
- Only benchmark col_a=False, col_b=True for now
- Remove in='int8', out='int8' due to too large error
2024-01-16 11:15:54 -06:00
Shucai Xiao
1223f6077a support type conversion between fp8 formats and bf16/fp32 with HW instructions on MI300 (#414)
* add type conversion between fp8 and bf16/fp32..
2024-01-15 17:14:49 -06:00
Lixun Zhang
e231c41467 [TUTORIAL] Enable all types in gemm tutorial (#456)
* Enable all types in gemm tutorial

Co-authored-by: Shucai Xiao <shucai.xiao@amd.com>
2024-01-15 14:38:31 -06:00
Vinayak Gokhale
1fec965c06 Add autotuning for FA (#459) 2024-01-12 17:15:12 -06:00
Lixun Zhang
2e217c5a5c [Backend] Refactor sharedToDotOperandMFMA lowering (#439)
* Remove unnecessary xor computations for k-major swizzled tensors

* Support mfma16 and mfma4 in the fast path

* Choose warpsPerCTA according to nonKDim

* Set maxPhase=4 for mfma4

* Fix tests

For now, we do not disable swizzling for k-major tensors

* Remove fastPathComputeOffsetsTy1

* Enable k-major + disabled swizzling in the normal path
2024-01-12 12:50:18 -06:00
Vinayak Gokhale
c2766bbd5f Merge changes from upstream FA bwd kernel (#444)
* Add optimized FA bwd from upstream

* Add autotuning

* Change loads and stores to use block ptrs

* Cleanup
2024-01-05 15:12:05 -06:00
oplavsic
bcea3051af Add support for MFMA layout to view_slice op (#442)
Co-authored-by: Ognjen <oplavsic@luxoft.com>
2024-01-03 12:13:36 -06:00
oplavsic
6a520566a3 Add view_slice ttgir instruction (#427)
* Add view_slice op in ttgir

---------

Co-authored-by: Ognjen Plavsic <ognjen.plavsic@luxoft.com>
Co-authored-by: Ognjen <oplavsic@luxoft.com>
Co-authored-by: Lixun Zhang <lixun.zhang@amd.com>
2024-01-02 15:40:11 -06:00
Alexander Efimov
98589ac013 [MFMA] Remove CTA related code from layout (#429)
This PR removes CTALayout attribute from MFMA layout, because it is NV specific.
2023-12-27 18:01:28 +01:00
Jack Taylor
1e2fd0dd1a Update hip_backend to use libhsa-runtime for arch info, (#411)
brings in path changes for pytorch triton wheels

Co-authored-by: jayfurmanek <Jason.Furmanek@amd.com>
2023-12-21 15:40:57 +00:00
Vinayak Gokhale
0248bdb29d Minor edits to HBM bandwidth measurement kernel (#434)
* Change units to GiB/s from GB/s

* Run both with and w/o bounds check
2023-12-21 06:14:31 -06:00
jayfurmanek
16281f02f4 [ROCM] drop GIL for launch, and set value=false upon pointer error (#426) 2023-12-19 08:34:51 -06:00
Vinayak Gokhale
422d7096ce Add kernel to check HBM BW (#431)
Add kernel to check HBM BW performance
2023-12-18 21:25:21 -06:00
joviliast
af15da2f84 Support WMMA layout in TritonAMDGPUAccelerateMatmulPass
-Introduce WmmaEncodingAttr for WMMA output
-Introduce BlockedToWMMA rewrite pattern in TritonAMDGPUAccelerateMatmulPass
-Provide a flag tho check if wmma instructions are supported by target

Signed-off-by: joviliast <iveselov.nn@gmail.com>
2023-12-18 09:11:20 -06:00
Vinayak Gokhale
b7a412d82a Add support for ALiBi-style attention bias (#417)
Add support for matrix and vector bias to FA.
2023-12-15 16:28:37 -06:00
jayfurmanek
29847e9bb1 Merge pull request #410 from ROCmSoftwarePlatform/ifu-231117
Ifu 231117
2023-12-15 09:09:40 -06:00
Shucai Xiao
521f425fbf add bitcode for gfx941 and gfx942 (#403)
Co-authored-by: Aleksandr Efimov <130555951+alefimov-amd@users.noreply.github.com>
2023-12-14 08:19:23 -06:00
Michael Melesse
26c3f99073 ROCM IFU: disable test_reduce_layouts 2023-12-13 17:12:39 -06:00
Michael Melesse
7a1f54645e ROCM IFU: remove old tests 2023-12-13 15:30:55 +00:00
Michael Melesse
c7b62d5ec5 ROCM IFU: remove test_ext_elemwise 2023-12-12 22:58:56 -06:00
Michael Melesse
6efc013e46 ROCM IFU: fix AtomicCASOpConversion segfault 2023-12-12 17:40:31 -06:00
jayfurmanek
a42ac260aa Merge branch 'triton-mlir' into ifu-231117 2023-12-12 14:24:11 -06:00
Jason Furmanek
160dfe838e ROCM IFU: Fix print and assert 2023-12-12 19:30:01 +00:00
Alexander Efimov
605a90c58e [MFMA] Support tile size 4x4 version 1 (#413)
This PR enables 4x4 tile size in MFMA based dot operations.

Supported tiled dot is (4x64) x (64x4) -> (4x4) in MFMA layout.
However, actual dot operation should have at least 64 output elements, this is a limitation of other layouts appearing during result processing (i.e. blocked layout can not handle tensors smaller than wavesize).

For example, following dots are supported: (4x64) x (64x16) -> (4x16), (16x64) x (64x4) -> (16x4) or (8x64) x (64x8) -> (8x8)
Following dots are not supporter: (4x128) x (128x4) -> (4x4), (4x64) x (64x8) -> (4x8)

This is a first version of dot using mfma 4x4 instructions, with redundancy and reductions.
2023-12-12 18:23:55 +01:00
Michael Melesse
64a0924381 ROCM IFU: remove ref to test_elementwise 2023-12-07 13:31:59 -06:00
Vinayak Gokhale
1d6b919897 Bugfix: Wrong boundary condition on qk GEMM 2023-12-04 10:11:41 -06:00
Vinayak Gokhale
f6969f4bb3 Correct that loop lo is multiple BLOCK_N 2023-12-04 10:11:41 -06:00
Vinayak Gokhale
0ef865508c Update description 2023-12-04 10:11:41 -06:00
Vinayak Gokhale
dc62569e57 Remove slicing for out in save for bwd 2023-12-04 10:11:41 -06:00
Vinayak Gokhale
e0a4d97569 Mask vs pad for non power of 2 sequence lengths
Padding results in memory allocation which is slower.
Masking results in better performance.
2023-12-04 10:11:41 -06:00
Vinayak Gokhale
d5028079b7 Add FA support for non pow2 seqlen 2023-12-04 10:11:41 -06:00
Jason Furmanek
64f559771f ROCM IFU: Fix test_core_amd.py::test_reduce_layouts 2023-11-28 04:02:48 +00:00
Jason Furmanek
f5f6b3c0a3 ROCM IFU: Add get_version_key for ROCM backend 2023-11-28 00:11:44 +00:00
Jason Furmanek
71547e4fdb ROCM IFU: Fixes for kwargs 2023-11-28 00:09:46 +00:00
jayfurmanek
99aa1f4f75 Merge branch 'triton-mlir' into ifu-231117 2023-11-27 07:44:04 -06:00
Jason Furmanek
968a35fbf0 Fix merge conflict error 2023-11-27 13:33:23 +00:00
Shucai Xiao
d9219e0eba use hw for fp8 type conversion (#386)
* use hardware instruction for type conversion between fp8 and fp32

* move gpu_matrix_core_version from semantics.py to hip_backend.py

---------

Co-authored-by: Aleksandr Efimov <efimov.alexander@gmail.com>
2023-11-24 10:26:40 -06:00
Jason Furmanek
a08dafe7fe Initial commit to resolve merge conflicts 2023-11-20 22:41:03 +00:00
Jason Furmanek
5c87f363e4 Merge commit 'cb3d79a185e40c9d8a579bea07747a8a8d157d52' into ifu-231117
Conflicts:
	lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp
	lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVM.cpp
	lib/Dialect/TritonGPU/IR/Dialect.cpp
	python/setup.py
	python/test/unit/language/assert_helper.py
	python/test/unit/operators/test_flash_attention.py
	python/test/unit/runtime/test_subproc.py
	python/triton/compiler/compiler.py
	python/triton/language/semantic.py
	python/triton/runtime/autotuner.py
	python/triton/runtime/jit.py
	python/tutorials/03-matrix-multiplication.py
	python/tutorials/05-layer-norm.py
	python/tutorials/06-fused-attention.py
	python/tutorials/11-grouped-gemm.py
	test/Conversion/tritongpu_to_llvm.mlir
2023-11-17 20:42:12 +00:00
Alexander Efimov
dfb76540b4 [Tutorial] Fix post IFU issues with FA (#398)
* [Tutorial] Fix post IFU issues with FA

* Remove redundant kernels in 06-fused-attention.py

* Added README for scripts in perf-kernels dir

* Fix bwd kernel

---------

Co-authored-by: Lixun Zhang <lixun.zhang@amd.com>
2023-11-17 01:28:49 +00:00
Alexander Efimov
096def0c9b [Test] Disable mma layout for amd hardware (#384)
Disable mma layout testing by looking at is_hip instead of wave size.
This fixes tests on Navi GPUs with wave size == 32.
2023-11-17 01:28:49 +00:00
Lixun Zhang
181bdbd410 Benchmark FA on 2 GCDs (#393) 2023-11-17 01:28:49 +00:00
Jason Furmanek
44b155f41b ROCM IFU: Resolve merge conflicts in tutorial 06
Resolve merge conflicts in tutorial 06 - 2
2023-11-17 01:28:40 +00:00
Jason Furmanek
484852876e Resolve merge conflicts; AMD adjustments for new LLVM version 2023-11-09 19:00:49 +00:00
Jason Furmanek
977d5aa267 Merge commit '721897fcc4f942aa97d2e9ba3787a5e213758177' into ifu-231108
Conflicts:
	bin/triton-translate.cpp
	lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp
	lib/Dialect/TritonGPU/Transforms/RemoveLayoutConversions.cpp
	python/triton/compiler/compiler.py
	python/triton/runtime/jit.py
	python/tutorials/06-fused-attention.py
	test/Conversion/tritongpu_to_llvm.mlir
2023-11-08 18:51:23 +00:00
Lixun Zhang
1af893d8a2 [FRONTEND] Add input dtypes to autotuning key (#2534) (#374)
* [FRONTEND] Add input dtypes to autotuning key (#2534)

* Fix conflict in 06-fused-attention

* Fix get_best_config in FA-transV.py

* Fix leftover get_best_config()

---------

Co-authored-by: Adnan Akhundov <adnan.akhundov@gmail.com>
2023-11-07 19:36:57 -06:00
Jason Furmanek
85216ea5c5 ROCM IFU: Resoolve conflicts in FA tutorial 2023-11-07 04:29:45 +00:00
Alexander Efimov
aefc94bd25 ROCM IFU: fix test_dot_mfma_vector_load test
fix for previous commit
2023-11-07 04:29:45 +00:00
Jason Furmanek
39e8901d7a ROCM IFU: Resolve merge conflicts in RemoveLayoutConversions.cpp
fix merge error

fix dot

fix make_range

additional fix
2023-11-07 04:29:38 +00:00