Commit Graph

1023 Commits

Author SHA1 Message Date
Ognjen
38fbb7e472 ROCM IFU: Enable slice layout for insertSliceAsync AMD path
Fix basic_insert_slice_async_1d lit test

Remove code added for debugging

Return hopper test
2023-11-17 01:27:57 +00:00
Jason Furmanek
484852876e Resolve merge conflicts; AMD adjustments for new LLVM version 2023-11-09 19:00:49 +00:00
Jason Furmanek
977d5aa267 Merge commit '721897fcc4f942aa97d2e9ba3787a5e213758177' into ifu-231108
Conflicts:
	bin/triton-translate.cpp
	lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp
	lib/Dialect/TritonGPU/Transforms/RemoveLayoutConversions.cpp
	python/triton/compiler/compiler.py
	python/triton/runtime/jit.py
	python/tutorials/06-fused-attention.py
	test/Conversion/tritongpu_to_llvm.mlir
2023-11-08 18:51:23 +00:00
oplavsic
502525ff11 ROCM IFU: Fix ScanOp test by implementing idx ShflKind in commonShflSync (#391)
* Fix ScanOp test

* Remove commented-out code

---------

Co-authored-by: Ognjen <oplavsic@luxoft.com>

Fix typo in commonShflSync
2023-11-07 04:29:45 +00:00
Alexander Efimov
8bc417b9b7 do not emit nvidia inline asm 2023-11-07 04:29:44 +00:00
Jason Furmanek
39e8901d7a ROCM IFU: Resolve merge conflicts in RemoveLayoutConversions.cpp
fix merge error

fix dot

fix make_range

additional fix
2023-11-07 04:29:38 +00:00
Jason Furmanek
c3132eeda8 ROCM IFU: Third-party fixes: preload ROCDL
Additional 3rd-party fix

Remove redundant mfma_supported defines
2023-11-07 03:29:16 +00:00
Jason Furmanek
3a6dc5ad8d resolve some merge conflicts
fix more conflits

Resolve merge conflicts

Some more build and conflict fixes

Resolve conflicts for 06-fused-attension.py

resolve merge conflicts for the tutorial group gemm example

Fixes for some LIT tests

resolve remaining conflicts in tests

Fix empty kernel

set capability 0
2023-11-06 23:13:10 +00:00
Jason Furmanek
33151a860f Merge commit 'ac9fa68d18c777e421bd3f6fb1ddcfd60b6fda33' into ifu-rebase-again
Conflicts:
	.gitignore
	.gitmodules
	README.md
	bin/triton-translate.cpp
	include/triton/Dialect/TritonGPU/IR/TritonGPUAttrDefs.td
	include/triton/Target/AMDGCN/AMDGCNTranslation.h
	include/triton/Target/HSACO/HSACOTranslation.h
	lib/Analysis/Allocation.cpp
	lib/Analysis/Utility.cpp
	lib/Conversion/TritonGPUToLLVM/CMakeLists.txt
	lib/Conversion/TritonGPUToLLVM/ConvertLayoutOpToLLVM.cpp
	lib/Conversion/TritonGPUToLLVM/ReduceOpToLLVM.cpp
	lib/Conversion/TritonGPUToLLVM/ScanOpToLLVM.cpp
	lib/Conversion/TritonGPUToLLVM/Utility.cpp
	lib/Conversion/TritonGPUToLLVM/Utility.h
	lib/Dialect/TritonGPU/IR/Dialect.cpp
	lib/Dialect/TritonGPU/Transforms/RemoveLayoutConversions.cpp
	lib/Target/HSACO/CMakeLists.txt
	lib/Target/HSACO/HSACOTranslation.cpp
	lib/Target/LLVMIR/LLVMIRTranslation.cpp
	python/src/triton.cc
	python/test/unit/language/test_core.py
	python/test/unit/operators/test_flash_attention.py
	python/triton/compiler/compiler.py
	python/triton/compiler/make_launcher.py
	python/triton/language/semantic.py
	python/triton/runtime/jit.py
	python/tutorials/06-fused-attention.py
	python/tutorials/11-grouped-gemm.py
	test/Conversion/tritongpu_to_llvm.mlir
2023-11-06 23:10:10 +00:00
oplavsic
c65f1e6211 Add OptimizeEpilogue pass. (#346)
* optimize_epilogue

* Add config

* Remove licenses

* Comment out Hopper specific parameters when printing out configs

* Add benchmark parameters from flash-attention repo

* Add Z and H in the key of autotuner

---------

Co-authored-by: Lixun Zhang <lixun.zhang@amd.com>
2023-11-03 16:46:24 -05:00
Shucai Xiao
79bebc4ffe fp8 type support (#357)
* add two fp8 data types `tl.float8e4b8` and `tl.float8e5b16` to triton.
* add SW type conversion between `tl.float8e4b8/tl.float8e5b16` and `fp16`
* change flashattention to support fp8 in q/k.
2023-11-02 15:51:23 -05:00
Alexander Efimov
74c5fd46ee [RemoveLayoutConversions] Fix reduce failed infer type error (#377)
* [RemoveLayoutConversions] Fix reduce failed infer type error

This PR fixes layout propagation algorithm in RemoveLayoutConversions pass.
In some cases during rewriteSlice process, reduce operation with multiple outputs
rewrites only one output layout, which breaks assumption that both outputs should have same layout.

This change is a minimal part of https://github.com/openai/triton/pull/2331 change and
small lit test for regression testing.

* fix combine test

* Fix issue with incorrect inference layout of make_range output result
2023-11-01 13:31:13 -05:00
Alexander Efimov
d62a3ffdbe [RemoveLayoutConversions] Remove PatternSharedInfo structure (#378)
This structure is not used anymore after massive refactoring
of RemoveLayoutConversion pass in September IFU.
2023-11-01 12:57:35 -05:00
Michael Melesse
1fd9b40f2f Works as StandAlone and Backend and also Perf is Good
This is a combination of 4 commits.

Works as StandAlone and Backend

Works as StandAlone and Backend

This is a combination of 13 commits.

Works StandAlone and as Backend

This is a combination of 7 commits.

backend set default dir with flag

move bitcode to backend dir

copy backend

save

empty test work in backendmode

enable backend mode when copying to upstream

clean up

fix failure

minimize diff

add skip function

fix bug with corrupted dwarf exp

match num_wraps

fix multi threaded test issue

move bitcode file out of lib

move backend to python/triton/third_party/hip

move libhsa

backend works again

restart ci

clean upstream location first before copy

match scripts

fix new error

memoize backend stuff

fix bug
2023-10-26 14:27:18 -05:00
Michael Melesse
09ba348f87 [ROCM] Core Functionality for AMD (#1983)
* this pr adds a third party backend for triton that works on AMD
* this expose a lot of the work that has been done in our
[fork](https://github.com/ROCmSoftwarePlatform/triton)
* most unit tests on `test_core.py` pass
* it skips some unit tests for various reasons
* we plan to follow up with more prs improving Functionality and
Performance in the future

---------

Co-authored-by: Philippe Tillet <phil@openai.com>
2023-10-26 08:36:49 -05:00
Shucai Xiao
2729ae6c6f use different int8 mfma instructions on different GPUs. (#368)
* changes support to choose different int8 instructions

* rename an instruction name

Co-authored-by: Aleksandr Efimov <efimov.alexander@gmail.com>
2023-10-25 19:12:21 -05:00
Alexander Efimov
5a86b46bb1 [MFMA] FP8 and BF8 support (#355)
* [MFMA] FP8 and BF8 support

This PR adds support of fp8 and bf8 in AccelerateMatmul pass and
Introduces generation of float8 mfma instructions in ttg to llvm conversion.

* add tests

* fix tests

* review fix: fix variable naming and dot operand promotion.

* review comments fixes

---------

Co-authored-by: Shucai Xiao <shucai.xiao@amd.com>
2023-10-25 13:27:10 -05:00
oplavsic
715a589ce3 [FA fwd D=128] Reduce LDS usage in epilogue (#340)
* rebase onto improve_fwd_fa

* Fixed a leftover from rebase

* rebase onto improve_fa_fwd

* Reduce tuning space

* Disable bwd with D=128

* Add test for d=128

* Fix an issue with get_best_config when there is only one config

* Added better configs for d=128

* Fix typos

---------

Co-authored-by: Lixun Zhang <lixun.zhang@amd.com>
2023-10-25 12:10:34 -05:00
jayfurmanek
e74bdb1581 Always promote to int32 in commonShflSync (#369) 2023-10-23 12:27:11 -05:00
Alexander Efimov
20f316b19a [MFMA] Switch between MFMA types (#352)
This PR introduces matrix_instr_nonkdim flag to switch
between MFMA 16 and MFMA 32.
2023-10-18 16:57:34 +02:00
Mehdi Amini
721897fcc4 upgrade llvm to b1115f8c (NFC) (#2403)
Co-authored-by: Thomas Raoux <thomas.raoux@openai.com>
Co-authored-by: Keren Zhou <kerenzhou@openai.com>
Co-authored-by: Phil Tillet <phil@openai.com>
2023-10-16 16:38:49 -07:00
Alexander Efimov
4d539d7dae Add licenses to AMD related files (#351) 2023-10-16 15:18:01 -05:00
Zahi Moudallal
726bdb984f [FRONTEND][BACKEND] Fix constexpr assignment ; revert #2430 (#2496)
Without this change, a constexpr assignment (ie. `A = B & C`, where `B`
and `C` are both constexpr) is getting assigned to a triton tensor,
which becomes an issue when `A` is used as the condition of an If
statement.
Note: I had to add `not isinstance(node.value, ast.Constant)` to the
condition because if we are assigning `x = 0` then the assigned value is
also a constexpr, but in this case we do want to assign a triton tensor
to `x` so that we can do `x.to(tl.int64)` for example, which cannot be
done on a constexpr.

---------

Co-authored-by: Philippe Tillet <phil@openai.com>
2023-10-16 12:35:19 -07:00
Stewart Hall
29828fe491 [FRONTEND] add option to disable fp mul/add fusion (#2495)
By default, ptxas will enable fusion of mul/add to fma instructions. The
backend was also being configured unconditionally to enable this on
conversion from LLVM IR to PTX. This commit adds an option which can be
used to disable the FP fusion behavior in both locations.
2023-10-14 12:23:30 -07:00
Philippe Tillet
3b6ec763d5 Revert "[BACKEND] Disable BreakPhiStruct pass (#2458)" (#2498)
This reverts commit b1bc9b20a0.
2023-10-14 10:40:49 -07:00
Philippe Tillet
8db4fac3b0 Revert "[OPTIMIZER] Tweak warpsPerCTA based on the shape of MMA output (#2485)" (#2497)
Reverts openai/triton#2485
2023-10-13 23:32:59 -07:00
Weixing Zhang
76858bd917 [OPTIMIZER] Tweak warpsPerCTA based on the shape of MMA output (#2485)
In current implementation, warpsPerCTA is always set to [numWarps, 1]
for 2 tt.dot fusion scenario. But, it is not optimal for cases such that
tt.dot doesn't have enough parallelism on row dimension but on column
dimension.
2023-10-12 22:25:42 -07:00
Thomas Raoux
a777e1d8db [OPTIMIZER] Propagate mma layout when the transitive use has dot_operand encoding (#2482) 2023-10-12 23:57:40 +00:00
Thomas Raoux
cda298fae7 [Pipeliner] Allocate less shared memory when possible (#2466)
The pipeliner was overallocating shared memory for the inputs
for current schedule. This reduces the shared memory usage to only
what is needed.
Note that improving membar analysis could allow taking advantage of
allocating extra buffers to remove barriers.
2023-10-12 12:10:06 -07:00
Jack Taylor
47563240f8 PyTorch triton branch synchronisation (#354)
* Restructure ROCM Library Search
Currently there are a handful of ROCM dependant files which are required for
triton to run.  The linker(ld.lld), the include files, and multiple hip/hsa
shared objects.

This change will provide three search areas to find these files.  All in
the same order.

1. third_party/rocm.  This location is within the python/triton directory
   and is carried over when triton is built.  IF all necessary files
   are in this location there will be no need to have ROCM installed at
   all on the system.

2. $ROCM_PATH environmental variable.  If this exists it will override
   all other locations to find ROCM necessary files

3. /opt/rocm.  The default location for ROCm installations.  Finding one
   here will notify triton that ROCM is installed in this environment

To ease with step 3.  A new script scripts/amd/setup_rocm_libs.sh
has been added to the repo.  Executing this script will cause all necessary
ROCM files to be downloaded from their respective packages on repo.radeon.com
and installed in third_party/rocm.  Allowing for triton to run without installing
the full ROCM stack.  setup_rocm_libs.sh takes a env_var ROCM_VERSION if a user
wishes to install a ROCM version other than the default (currently 5.4.2)

When triton whls are built to support Pytorch, method 3 will be used to stay in
sync with PyTorch's approach of bringing along any libraries needed and not
requiring ROCM to be installed.

(cherry picked from commit e6aea90fb3e8218cb562e5d990719112d8282702)

* Fix default rocm path

Running into `fatal error: hip/hip_runtime.h: No such file or directory` with latest wheel due to incorrect directory for ROCm libs

(cherry picked from commit 292bae625b113eb65c66cfe4442da7a6456c988a)

* setup_rocm_libs.sh manylinux refactor

(cherry picked from commit f995f314ada4606cb78dc6233cd9c8effc356191)

* Set setup_rocm_libs.sh to be executable

(cherry picked from commit 05d67b9418cacda0d356c2102d7c1a887948b013)

* Revert to using numbered so files to fix upstream

(cherry picked from commit 34f8189eae57a23cc15b4b4f032fe25757e0db8e)

* Remove drm script

---------

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2023-10-11 15:30:39 +01:00
Thomas Raoux
6f46c93b9e [BACKEND] Add back dot.wait when generating async_dot (#2478)
Based on discussion this is needed to make sure there is no race
condition when reading shared memory.
2023-10-10 21:45:28 -07:00
Zahi Moudallal
4749072fbd [BACKEND] Allow reduce with sliced 3D layout as input (#2480) 2023-10-10 15:19:11 -07:00
Alexander Efimov
7e34c244c2 [Triton] Mfma16 support (#251)
* [MFAM] Support mfma with NM size 16

This PR code emitting of MFMA instructions with size 16.

* add control over mfma type with MFMA_TYPE=16 env var
2023-10-09 13:59:54 -05:00
Beal Wang
5812d970a8 [HOPPER][OPTIMIZER] remove divOp and remOp from gemm math loop (#2402)
This is just for Warp Specialization kernels on Hopper. Replace DivOp
and RemOp with SelectOp and AndOp/XorOp.
2023-10-09 14:42:06 +08:00
Tori Baker
ab4549310b [OPTIMIZER] erase ops after use in iterator (#2455)
This seems to have worked fine in opt mode (although it may be producing
undefined behavior), but in debug mode on a newer version of llvm, it
segfaults without this PR as the iterators get invalidated.

This is also consistent with other places it is done in this file.
2023-10-06 18:02:56 -07:00
Thomas Raoux
b1bc9b20a0 [BACKEND] Disable BreakPhiStruct pass (#2458)
This is causing functional failures in pytorch workload. Disabling it
until I figure out the problem.
2023-10-06 17:59:53 -07:00
Thomas Raoux
a7061e19b2 [BACKEND] Fix multiple bugs in WGMMA (#2457)
Fix dependencies in wgmma_wait op to prevent the scheduler from moving
it past the uses of wgmma accumulator. We need to explicitly represent
the dependency between the wait and the accumulator uses otherwise LLVM
is free to re-order those.
This allows us to remove a workaround to prevent the re-ordering. We can
also remove the wait op added in the loop during pipelining.

Also fix the descritpor calcuation for wgmma, we should calculate the
same descriptor for the whole warpgroup.
Added a workaround for a bug that was exposed by different timing due to
those changes. We shouldn't insert operations between the loop and
async_wait or we may have race conditions.
2023-10-06 17:59:28 -07:00
Zahi Moudallal
be19cf3103 [BACKEND] Enable reduce with 3D tensors and added tests (#2460) 2023-10-06 15:08:22 -07:00
oplavsic
e801638b40 Add waves_per_eu as kernel parameter (#319)
* Add waves_per_eu as kernel parameter

* Fix failing tests

* Add default value for waves_per_eu for ttgir_to_llir function

* Remove aot.py
2023-10-06 12:08:34 -05:00
Thomas Raoux
38f184b7cf [BACKEND] Use native fp8 convert ops when possible (#2448)
On Hopper we can use native fp8 conversion ops that are significantly
more efficient.

Improves epilogue in matmul. 8192x8192x512xf8 goes from 567 TFlops to
630 TFlops (the kernel is highly latency bound but this is a good proxy
for epilogue performance)
2023-10-05 18:28:58 +00:00
Philippe Tillet
7bc6b99132 [DOCS] Fix FP8E4M3B15 docs (#2451) 2023-10-05 10:59:45 -07:00
Philippe Tillet
e9d9ddd86d [OPTIMIZER] More V100 conversion removal tweaks (#2446) 2023-10-04 16:31:20 -07:00
Philippe Tillet
a7ff4eddae [OPTIMIZER] hasConvertToMMATransisitiveUse=False for MMAv1 (#2445) 2023-10-04 14:56:12 -07:00
Thomas Raoux
5a0170a27c [BACKEND] Minor removing of unnecessary code and cleanup (#2443) 2023-10-04 12:14:08 -07:00
Shucai Xiao
8049891ff7 fix ifu gemm perf regression (#348) 2023-10-04 08:45:18 -05:00
Christian Sigg
5458014282 [BACKEND] Lower to PTX with trap-unreachable (#2429)
We've seen cases where the entire kernel is poisoned due to
division-by-zero, resulting in a single `unreachable` instruction at the
LLIR level. Emit this instruction as `trap` (instead of dropping it) so
that the kernel doesn't run successfully without writing any outputs.
2023-10-03 21:05:10 -07:00
Thomas Raoux
c656a139d3 [BACKEND] Fix for FP8 QK inputs in flash attention forward pass (#2435) 2023-10-03 21:02:13 -07:00
Zahi Moudallal
0d84a7d70c [BACKEND] Adding support for slice layout in InsertSliceAsyncOp (#2438) 2023-10-03 20:59:53 -07:00
Bin Fan
6b860e7a74 [Backend] fix a bug in lowering ExtractSliceOp from TritonGPU to LLVM (#2436) 2023-10-03 21:52:07 -04:00
Aleksandr Efimov
e6f75d05e3 fix sum_reduction lit test in Conversion/tritongpu_to_llvm.mlir testsuite 2023-10-03 16:13:13 +00:00