Commit Graph

2072 Commits

Author SHA1 Message Date
Shucai Xiao
8547694665 set correct arch info for unit test (#370)
* set correct arch info for unit test

* address review comments
2023-10-25 13:06:45 -05:00
oplavsic
715a589ce3 [FA fwd D=128] Reduce LDS usage in epilogue (#340)
* rebase onto improve_fwd_fa

* Fixed a leftover from rebase

* rebase onto improve_fa_fwd

* Reduce tuning space

* Disable bwd with D=128

* Add test for d=128

* Fix an issue with get_best_config when there is only one config

* Added better configs for d=128

* Fix typos

---------

Co-authored-by: Lixun Zhang <lixun.zhang@amd.com>
2023-10-25 12:10:34 -05:00
jayfurmanek
e74bdb1581 Always promote to int32 in commonShflSync (#369) 2023-10-23 12:27:11 -05:00
Lixun Zhang
f963c04034 Use the same heuristics for mfma type as PR#352 (#366) 2023-10-18 20:32:44 -05:00
Alexander Efimov
20f316b19a [MFMA] Switch between MFMA types (#352)
This PR introduces matrix_instr_nonkdim flag to switch
between MFMA 16 and MFMA 32.
2023-10-18 16:57:34 +02:00
Alexander Efimov
4d539d7dae Add licenses to AMD related files (#351) 2023-10-16 15:18:01 -05:00
Lixun Zhang
1de859df32 [GEMM] [Tuning] Add waves_per_eu to gemm tuning (#362)
* Add waves_per_eu in the tuning space

* Do not allocate tensor on device during kernel compilation step

* Add breakdown elapsed time

* Parallelize the post-processing step

* Parallelize the profile step with --ngpus

* Better timing info printout
2023-10-16 13:50:03 -05:00
Lixun Zhang
821e75a2b0 Improve FA fwd kernel with causal=True (#356)
* Attempt to absorb upstream's changes to improve causal=True

* Add autotuner

* Optimize for AMD MI250

- add pre_load_v as a tuning parameter
- do not define N_CTX as constexpr
- perform the second dot before sum
- remove qk_scale out of the inner loop
- add more configs in the autotuner

Note that bwd kernel is disabled for now. This is because we enabled
autotuning and grid becomes a function. So ctx.grid[0] no longer works.

* Enable bwd kernel
2023-10-12 12:34:27 -05:00
Jack Taylor
6f073a43f6 Remove old ROCM_LIBRARIES set (#360)
In the last PR I forgot to overwrite the initial setting of ROCM_LIBRARIES causing an error in the wheel building process
2023-10-12 16:01:39 +01:00
Jack Taylor
5d44c60d17 enforce cc=None on PyTorch ROCm (#296)
* enforce cc=None on ROCm

* Comment

* Update approach to ignore integer cc values

Co-authored-by: Alexander Efimov <efimov.alexander@gmail.com>

---------

Co-authored-by: Alexander Efimov <efimov.alexander@gmail.com>
2023-10-12 10:17:26 +01:00
Shucai Xiao
99fa2e4237 add tutorial group gemm example (#343)
* [DOCS] Add a tutorial example of grouped gemm (#2326)
Co-authored-by: Bin Fan <binf@nvidia.com>
2023-10-11 15:13:17 -05:00
Jack Taylor
47563240f8 PyTorch triton branch synchronisation (#354)
* Restructure ROCM Library Search
Currently there are a handful of ROCM dependant files which are required for
triton to run.  The linker(ld.lld), the include files, and multiple hip/hsa
shared objects.

This change will provide three search areas to find these files.  All in
the same order.

1. third_party/rocm.  This location is within the python/triton directory
   and is carried over when triton is built.  IF all necessary files
   are in this location there will be no need to have ROCM installed at
   all on the system.

2. $ROCM_PATH environmental variable.  If this exists it will override
   all other locations to find ROCM necessary files

3. /opt/rocm.  The default location for ROCm installations.  Finding one
   here will notify triton that ROCM is installed in this environment

To ease with step 3.  A new script scripts/amd/setup_rocm_libs.sh
has been added to the repo.  Executing this script will cause all necessary
ROCM files to be downloaded from their respective packages on repo.radeon.com
and installed in third_party/rocm.  Allowing for triton to run without installing
the full ROCM stack.  setup_rocm_libs.sh takes a env_var ROCM_VERSION if a user
wishes to install a ROCM version other than the default (currently 5.4.2)

When triton whls are built to support Pytorch, method 3 will be used to stay in
sync with PyTorch's approach of bringing along any libraries needed and not
requiring ROCM to be installed.

(cherry picked from commit e6aea90fb3e8218cb562e5d990719112d8282702)

* Fix default rocm path

Running into `fatal error: hip/hip_runtime.h: No such file or directory` with latest wheel due to incorrect directory for ROCm libs

(cherry picked from commit 292bae625b113eb65c66cfe4442da7a6456c988a)

* setup_rocm_libs.sh manylinux refactor

(cherry picked from commit f995f314ada4606cb78dc6233cd9c8effc356191)

* Set setup_rocm_libs.sh to be executable

(cherry picked from commit 05d67b9418cacda0d356c2102d7c1a887948b013)

* Revert to using numbered so files to fix upstream

(cherry picked from commit 34f8189eae57a23cc15b4b4f032fe25757e0db8e)

* Remove drm script

---------

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2023-10-11 15:30:39 +01:00
Shucai Xiao
d6d1cf2859 add gfx942 to support matrix_core (#358) 2023-10-10 22:46:24 -05:00
Lixun Zhang
515525d068 [GEMM] Tuning script v2 (#350)
* [GEMM] Tuning script v2

* Extend tuning space to include BLOCK_SIZE = 256

Check LDS in a more smart way

* Added README

* Add git branch and commit to the default tuning result filename
2023-10-10 20:49:49 -05:00
Alexander Efimov
7e34c244c2 [Triton] Mfma16 support (#251)
* [MFAM] Support mfma with NM size 16

This PR code emitting of MFMA instructions with size 16.

* add control over mfma type with MFMA_TYPE=16 env var
2023-10-09 13:59:54 -05:00
oplavsic
e801638b40 Add waves_per_eu as kernel parameter (#319)
* Add waves_per_eu as kernel parameter

* Fix failing tests

* Add default value for waves_per_eu for ttgir_to_llir function

* Remove aot.py
2023-10-06 12:08:34 -05:00
jayfurmanek
be95edc63f Merge pull request #347 from ROCmSoftwarePlatform/ifu230908-2
Ifu230908 2
2023-10-05 12:21:50 -05:00
oplavsic
6a173eab8a Remove redundant fp32->fp16 conversion in FA (#349) 2023-10-04 14:10:07 -05:00
Shucai Xiao
8049891ff7 fix ifu gemm perf regression (#348) 2023-10-04 08:45:18 -05:00
Aleksandr Efimov
e6f75d05e3 fix sum_reduction lit test in Conversion/tritongpu_to_llvm.mlir testsuite 2023-10-03 16:13:13 +00:00
Michael Melesse
31fe8aadc5 ROCM IFU: Fix minimize_alloc
ROCM IFU: Small fixes
2023-10-03 05:34:44 +00:00
Aleksandr Efimov
88ce3b8985 ROCM IFU: Fix Conversion/AMDGPU/load_store.mlir lit test 2023-10-03 04:31:10 +00:00
Aleksandr Efimov
90a15e449e ROCM IFU: Fix tritongpu_to_llvm lit test 2023-10-03 04:31:03 +00:00
Michael Melesse
1caef34f8a ROCM IFU: Fix coalesce.mlir and stream-pipeline.mlir 2023-10-03 04:30:58 +00:00
Michael Melesse
9c7a215fed ROCM IFU: Fix triton_to_tritongpu.mlir 2023-10-03 04:30:50 +00:00
Shucai Xiao
334c9b5aed ROCM IFU: Fix unit tests error related to fp8/fp16 mixed input 2023-10-03 04:30:44 +00:00
Lixun Zhang
a41f13adcd ROCM IFU: Extend input to 32-bit when necessary
Note: we'll need to check this later if we can use i8 for some
reduction operations
2023-10-03 04:30:37 +00:00
Jason Furmanek
92edee723b ROCM IFU: Fix getValueLivenessRange 2023-10-03 04:30:28 +00:00
Michael Melesse
28c571ea43 ROCM IFU: Fix test_if 2023-10-03 04:30:22 +00:00
Aleksandr Efimov
8ccc4b0cce ROCM IFU: Fix layout formatting 2023-10-03 04:30:16 +00:00
Aleksandr Efimov
336c4b5f3c ROCM IFU: Fix LDS overflow issues in test_dot 2023-10-03 04:30:09 +00:00
wenchenvincent
42a5bf9c7c ROCM IFU: Enabled conversion between fp8e4m3b15x4 and fp16. Refactored conversion between fp8e4m3nv and fp16. (#335) 2023-10-03 04:30:01 +00:00
Aleksandr Efimov
634f66a090 ROCM IFU: Fix emitOffsetForMfmaLayout function 2023-10-03 04:29:54 +00:00
Aleksandr Efimov
78faa65dbd ROCM IFU: Fix of dot operand type promotion
ROCM IFU: Fix formatting
2023-10-03 04:29:29 +00:00
Aleksandr Efimov
bae0e4527c ROCM IFU: Add new CTALayout parameter to mfma layout 2023-10-03 04:29:21 +00:00
Jason Furmanek
e5d7bb4fae Initial commit to resolve merge conflicts
rename tl.float8e4 to tl.float8e4nv to align with upstream

ROCM IFU: Fix python arch issues

ROCM IFU: Fix kernel launcher

ROCM IFU: Fix merge conflicts

fix debug build

Set correct threadsPerCTA
2023-10-03 04:04:26 +00:00
Jason Furmanek
74fd8e9754 Merge commit '36fc54b6f28168d3644808bfe299f1ba06a36272' into ifu230908-2
Conflicts:
	.gitignore
	bin/triton-translate.cpp
	include/triton/Conversion/TritonGPUToLLVM/TritonGPUToLLVMPass.h
	include/triton/Dialect/TritonGPU/IR/TritonGPUAttrDefs.td
	include/triton/Dialect/TritonGPU/IR/TritonGPUDialect.td
	lib/Analysis/Utility.cpp
	lib/Conversion/TritonGPUToLLVM/ConvertLayoutOpToLLVM/SharedToDotOperandMMAv2.cpp
	lib/Conversion/TritonGPUToLLVM/DotOpToLLVM.cpp
	lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp
	lib/Conversion/TritonGPUToLLVM/ReduceOpToLLVM.cpp
	lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVM.cpp
	lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVMBase.h
	lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVMPass.cpp
	lib/Conversion/TritonGPUToLLVM/Utility.h
	lib/Dialect/Triton/Transforms/RewriteTensorPointer.cpp
	lib/Dialect/TritonGPU/IR/Dialect.cpp
	lib/Dialect/TritonGPU/Transforms/AccelerateMatmul.cpp
	lib/Dialect/TritonGPU/Transforms/RemoveLayoutConversions.cpp
	lib/Target/LLVMIR/LLVMIRTranslation.cpp
	python/src/triton.cc
	python/test/unit/runtime/test_subproc.py
	python/triton/compiler/compiler.py
	python/triton/compiler/make_launcher.py
	python/triton/language/semantic.py
	python/triton/runtime/jit.py
	python/tutorials/06-fused-attention.py
	test/Conversion/triton_to_tritongpu.mlir
	test/Conversion/tritongpu_to_llvm.mlir
	test/TritonGPU/coalesce.mlir
	unittest/Conversion/TritonGPUToLLVM/CMakeLists.txt
2023-10-02 18:01:04 +00:00
SJW
287b0adcc2 [Stream] Fixed bug in stream-pipeline for FA (#345)
* [Stream] Fixed bug in stream-pipeline for FA
* updated gemm tutorial for num_stages=0

* * updated configs
2023-09-29 20:20:55 -05:00
Lixun Zhang
8d99331c89 Combine split_k and non split_k kernels in GEMM tuning API (#344) 2023-09-28 12:37:22 -05:00
Shucai Xiao
6e82aa8dbc support gemm fp8/fp16 mixed input (#333)
* changes to support fp8/fp16 mixed inputs

* add unit test for fp8/fp16 mixed input for gemm
2023-09-27 08:00:31 -05:00
SJW
0a7b1c7c12 [MLIR] Fixed support for mixed data-types in stream-pipeline (#329)
* [MLIR] Fixed support for mixed data-types in stream-pipeline
* added test

* * fixed test

* * cleanup

* * consolidated code

* * fixed build error
2023-09-26 21:26:50 -05:00
Lixun Zhang
bdb35f12e0 [DOC] Add instructions for building with custom LLVM (#2344) (#338)
* Add instructions for building with custom LLVM (#2344)

Cherry picked from upstream

* Tweak install instructions for custom llvm build

openai#2381

---------

Co-authored-by: Justin Lebar <justin.lebar@gmail.com>
2023-09-25 22:18:11 -05:00
SJW
4db99e0139 [Alloc] Enhanced SharedMem Allocation for mutually exclusive but aliased buffers (#337)
* [Alloc] Enhanced for mutually exclusive but aliased buffers

- Use disjoint alias analysis to minimize shared memory requirements

* * fix for allocation test

* * added test
* fixed mfma_enc printer

* * fixed test
2023-09-25 20:09:33 -05:00
Aleksandr Efimov
7af5e42fbe review fix: fix semantics of chooseMfmaDimensions func 2023-09-25 10:56:44 -05:00
Alexander Efimov
5ac8c7afc1 change to the comment on kWidth parameter 2023-09-25 10:56:44 -05:00
Aleksandr Efimov
d80cd2d374 [MFMA] Change kWidth parameter semantics
This PR changes kWidth semantics "from elements per instruction" to
"elements per thread per instruction" along k axis.
2023-09-25 10:56:44 -05:00
Shucai Xiao
10795d8fd3 Fixed a bug related to split_k and prune unnecessary tuning space (#332)
* refine tuning scrit by adding prune_configs, also fixed a bug in generating tuning configs

* fixed a bug in returning the empty config
2023-09-21 23:47:14 -05:00
Ognjen Plavsic
a8574be74d [FA] Fix bug from IFU merge
This commit reverts num_warps back to 4 for
FA forward pass with d=128.
2023-09-21 10:48:41 -05:00
Lixun Zhang
ff6fd952ac Install ninja in pre-build 2023-09-18 15:30:06 -05:00
Justin Lebar
2a3746bac5 [BUILD] use ninja (#2318) 2023-09-18 15:30:06 -05:00