Commit Graph

931 Commits

Author SHA1 Message Date
Shucai Xiao
79bebc4ffe fp8 type support (#357)
* add two fp8 data types `tl.float8e4b8` and `tl.float8e5b16` to triton.
* add SW type conversion between `tl.float8e4b8/tl.float8e5b16` and `fp16`
* change flashattention to support fp8 in q/k.
2023-11-02 15:51:23 -05:00
Alexander Efimov
74c5fd46ee [RemoveLayoutConversions] Fix reduce failed infer type error (#377)
* [RemoveLayoutConversions] Fix reduce failed infer type error

This PR fixes layout propagation algorithm in RemoveLayoutConversions pass.
In some cases during rewriteSlice process, reduce operation with multiple outputs
rewrites only one output layout, which breaks assumption that both outputs should have same layout.

This change is a minimal part of https://github.com/openai/triton/pull/2331 change and
small lit test for regression testing.

* fix combine test

* Fix issue with incorrect inference layout of make_range output result
2023-11-01 13:31:13 -05:00
Alexander Efimov
d62a3ffdbe [RemoveLayoutConversions] Remove PatternSharedInfo structure (#378)
This structure is not used anymore after massive refactoring
of RemoveLayoutConversion pass in September IFU.
2023-11-01 12:57:35 -05:00
Michael Melesse
1fd9b40f2f Works as StandAlone and Backend and also Perf is Good
This is a combination of 4 commits.

Works as StandAlone and Backend

Works as StandAlone and Backend

This is a combination of 13 commits.

Works StandAlone and as Backend

This is a combination of 7 commits.

backend set default dir with flag

move bitcode to backend dir

copy backend

save

empty test work in backendmode

enable backend mode when copying to upstream

clean up

fix failure

minimize diff

add skip function

fix bug with corrupted dwarf exp

match num_wraps

fix multi threaded test issue

move bitcode file out of lib

move backend to python/triton/third_party/hip

move libhsa

backend works again

restart ci

clean upstream location first before copy

match scripts

fix new error

memoize backend stuff

fix bug
2023-10-26 14:27:18 -05:00
Michael Melesse
09ba348f87 [ROCM] Core Functionality for AMD (#1983)
* this pr adds a third party backend for triton that works on AMD
* this expose a lot of the work that has been done in our
[fork](https://github.com/ROCmSoftwarePlatform/triton)
* most unit tests on `test_core.py` pass
* it skips some unit tests for various reasons
* we plan to follow up with more prs improving Functionality and
Performance in the future

---------

Co-authored-by: Philippe Tillet <phil@openai.com>
2023-10-26 08:36:49 -05:00
Shucai Xiao
2729ae6c6f use different int8 mfma instructions on different GPUs. (#368)
* changes support to choose different int8 instructions

* rename an instruction name

Co-authored-by: Aleksandr Efimov <efimov.alexander@gmail.com>
2023-10-25 19:12:21 -05:00
Alexander Efimov
5a86b46bb1 [MFMA] FP8 and BF8 support (#355)
* [MFMA] FP8 and BF8 support

This PR adds support of fp8 and bf8 in AccelerateMatmul pass and
Introduces generation of float8 mfma instructions in ttg to llvm conversion.

* add tests

* fix tests

* review fix: fix variable naming and dot operand promotion.

* review comments fixes

---------

Co-authored-by: Shucai Xiao <shucai.xiao@amd.com>
2023-10-25 13:27:10 -05:00
oplavsic
715a589ce3 [FA fwd D=128] Reduce LDS usage in epilogue (#340)
* rebase onto improve_fwd_fa

* Fixed a leftover from rebase

* rebase onto improve_fa_fwd

* Reduce tuning space

* Disable bwd with D=128

* Add test for d=128

* Fix an issue with get_best_config when there is only one config

* Added better configs for d=128

* Fix typos

---------

Co-authored-by: Lixun Zhang <lixun.zhang@amd.com>
2023-10-25 12:10:34 -05:00
jayfurmanek
e74bdb1581 Always promote to int32 in commonShflSync (#369) 2023-10-23 12:27:11 -05:00
Alexander Efimov
20f316b19a [MFMA] Switch between MFMA types (#352)
This PR introduces matrix_instr_nonkdim flag to switch
between MFMA 16 and MFMA 32.
2023-10-18 16:57:34 +02:00
Alexander Efimov
4d539d7dae Add licenses to AMD related files (#351) 2023-10-16 15:18:01 -05:00
Jack Taylor
47563240f8 PyTorch triton branch synchronisation (#354)
* Restructure ROCM Library Search
Currently there are a handful of ROCM dependant files which are required for
triton to run.  The linker(ld.lld), the include files, and multiple hip/hsa
shared objects.

This change will provide three search areas to find these files.  All in
the same order.

1. third_party/rocm.  This location is within the python/triton directory
   and is carried over when triton is built.  IF all necessary files
   are in this location there will be no need to have ROCM installed at
   all on the system.

2. $ROCM_PATH environmental variable.  If this exists it will override
   all other locations to find ROCM necessary files

3. /opt/rocm.  The default location for ROCm installations.  Finding one
   here will notify triton that ROCM is installed in this environment

To ease with step 3.  A new script scripts/amd/setup_rocm_libs.sh
has been added to the repo.  Executing this script will cause all necessary
ROCM files to be downloaded from their respective packages on repo.radeon.com
and installed in third_party/rocm.  Allowing for triton to run without installing
the full ROCM stack.  setup_rocm_libs.sh takes a env_var ROCM_VERSION if a user
wishes to install a ROCM version other than the default (currently 5.4.2)

When triton whls are built to support Pytorch, method 3 will be used to stay in
sync with PyTorch's approach of bringing along any libraries needed and not
requiring ROCM to be installed.

(cherry picked from commit e6aea90fb3e8218cb562e5d990719112d8282702)

* Fix default rocm path

Running into `fatal error: hip/hip_runtime.h: No such file or directory` with latest wheel due to incorrect directory for ROCm libs

(cherry picked from commit 292bae625b113eb65c66cfe4442da7a6456c988a)

* setup_rocm_libs.sh manylinux refactor

(cherry picked from commit f995f314ada4606cb78dc6233cd9c8effc356191)

* Set setup_rocm_libs.sh to be executable

(cherry picked from commit 05d67b9418cacda0d356c2102d7c1a887948b013)

* Revert to using numbered so files to fix upstream

(cherry picked from commit 34f8189eae57a23cc15b4b4f032fe25757e0db8e)

* Remove drm script

---------

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2023-10-11 15:30:39 +01:00
Alexander Efimov
7e34c244c2 [Triton] Mfma16 support (#251)
* [MFAM] Support mfma with NM size 16

This PR code emitting of MFMA instructions with size 16.

* add control over mfma type with MFMA_TYPE=16 env var
2023-10-09 13:59:54 -05:00
oplavsic
e801638b40 Add waves_per_eu as kernel parameter (#319)
* Add waves_per_eu as kernel parameter

* Fix failing tests

* Add default value for waves_per_eu for ttgir_to_llir function

* Remove aot.py
2023-10-06 12:08:34 -05:00
Shucai Xiao
8049891ff7 fix ifu gemm perf regression (#348) 2023-10-04 08:45:18 -05:00
Aleksandr Efimov
e6f75d05e3 fix sum_reduction lit test in Conversion/tritongpu_to_llvm.mlir testsuite 2023-10-03 16:13:13 +00:00
Aleksandr Efimov
90a15e449e ROCM IFU: Fix tritongpu_to_llvm lit test 2023-10-03 04:31:03 +00:00
Jason Furmanek
92edee723b ROCM IFU: Fix getValueLivenessRange 2023-10-03 04:30:28 +00:00
Aleksandr Efimov
336c4b5f3c ROCM IFU: Fix LDS overflow issues in test_dot 2023-10-03 04:30:09 +00:00
wenchenvincent
42a5bf9c7c ROCM IFU: Enabled conversion between fp8e4m3b15x4 and fp16. Refactored conversion between fp8e4m3nv and fp16. (#335) 2023-10-03 04:30:01 +00:00
Aleksandr Efimov
634f66a090 ROCM IFU: Fix emitOffsetForMfmaLayout function 2023-10-03 04:29:54 +00:00
Aleksandr Efimov
78faa65dbd ROCM IFU: Fix of dot operand type promotion
ROCM IFU: Fix formatting
2023-10-03 04:29:29 +00:00
Aleksandr Efimov
bae0e4527c ROCM IFU: Add new CTALayout parameter to mfma layout 2023-10-03 04:29:21 +00:00
Jason Furmanek
e5d7bb4fae Initial commit to resolve merge conflicts
rename tl.float8e4 to tl.float8e4nv to align with upstream

ROCM IFU: Fix python arch issues

ROCM IFU: Fix kernel launcher

ROCM IFU: Fix merge conflicts

fix debug build

Set correct threadsPerCTA
2023-10-03 04:04:26 +00:00
Jason Furmanek
74fd8e9754 Merge commit '36fc54b6f28168d3644808bfe299f1ba06a36272' into ifu230908-2
Conflicts:
	.gitignore
	bin/triton-translate.cpp
	include/triton/Conversion/TritonGPUToLLVM/TritonGPUToLLVMPass.h
	include/triton/Dialect/TritonGPU/IR/TritonGPUAttrDefs.td
	include/triton/Dialect/TritonGPU/IR/TritonGPUDialect.td
	lib/Analysis/Utility.cpp
	lib/Conversion/TritonGPUToLLVM/ConvertLayoutOpToLLVM/SharedToDotOperandMMAv2.cpp
	lib/Conversion/TritonGPUToLLVM/DotOpToLLVM.cpp
	lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp
	lib/Conversion/TritonGPUToLLVM/ReduceOpToLLVM.cpp
	lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVM.cpp
	lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVMBase.h
	lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVMPass.cpp
	lib/Conversion/TritonGPUToLLVM/Utility.h
	lib/Dialect/Triton/Transforms/RewriteTensorPointer.cpp
	lib/Dialect/TritonGPU/IR/Dialect.cpp
	lib/Dialect/TritonGPU/Transforms/AccelerateMatmul.cpp
	lib/Dialect/TritonGPU/Transforms/RemoveLayoutConversions.cpp
	lib/Target/LLVMIR/LLVMIRTranslation.cpp
	python/src/triton.cc
	python/test/unit/runtime/test_subproc.py
	python/triton/compiler/compiler.py
	python/triton/compiler/make_launcher.py
	python/triton/language/semantic.py
	python/triton/runtime/jit.py
	python/tutorials/06-fused-attention.py
	test/Conversion/triton_to_tritongpu.mlir
	test/Conversion/tritongpu_to_llvm.mlir
	test/TritonGPU/coalesce.mlir
	unittest/Conversion/TritonGPUToLLVM/CMakeLists.txt
2023-10-02 18:01:04 +00:00
SJW
287b0adcc2 [Stream] Fixed bug in stream-pipeline for FA (#345)
* [Stream] Fixed bug in stream-pipeline for FA
* updated gemm tutorial for num_stages=0

* * updated configs
2023-09-29 20:20:55 -05:00
Shucai Xiao
6e82aa8dbc support gemm fp8/fp16 mixed input (#333)
* changes to support fp8/fp16 mixed inputs

* add unit test for fp8/fp16 mixed input for gemm
2023-09-27 08:00:31 -05:00
SJW
0a7b1c7c12 [MLIR] Fixed support for mixed data-types in stream-pipeline (#329)
* [MLIR] Fixed support for mixed data-types in stream-pipeline
* added test

* * fixed test

* * cleanup

* * consolidated code

* * fixed build error
2023-09-26 21:26:50 -05:00
SJW
4db99e0139 [Alloc] Enhanced SharedMem Allocation for mutually exclusive but aliased buffers (#337)
* [Alloc] Enhanced for mutually exclusive but aliased buffers

- Use disjoint alias analysis to minimize shared memory requirements

* * fix for allocation test

* * added test
* fixed mfma_enc printer

* * fixed test
2023-09-25 20:09:33 -05:00
Aleksandr Efimov
7af5e42fbe review fix: fix semantics of chooseMfmaDimensions func 2023-09-25 10:56:44 -05:00
Alexander Efimov
5ac8c7afc1 change to the comment on kWidth parameter 2023-09-25 10:56:44 -05:00
Aleksandr Efimov
d80cd2d374 [MFMA] Change kWidth parameter semantics
This PR changes kWidth semantics "from elements per instruction" to
"elements per thread per instruction" along k axis.
2023-09-25 10:56:44 -05:00
Lixun Zhang
23465f3416 Add assert for shuffleType and default value for laneid 2023-09-12 10:16:44 -05:00
Lixun Zhang
5972eafbc9 Use ceil to align with upstream 2023-09-12 10:16:44 -05:00
Lixun Zhang
1c653dc438 Move shfl_up impl into commonShflSync 2023-09-12 10:16:44 -05:00
Lixun Zhang
74ea0c87de Generalize warpSize
- We have to change the API of shflUpSync to pass laneId to the rocm
implementation of shfl_up
- And we also distinguish laneIdAxis and laneId
2023-09-12 10:16:44 -05:00
Lixun Zhang
ea397b49aa Fix the issue when CTA coverage is larger than the tile 2023-09-12 10:16:44 -05:00
Lixun Zhang
ed20089bc8 Add shfl_up implementation for AMD backend
copied from f58b93693b/include/hc.hpp (L2879-L2885)
2023-09-12 10:16:44 -05:00
Alexander Efimov
6691de65db [MFMA] Support BFloat16 on MI100 (#295)
* [MFMA] Support BFloat16 on MI100

This PR makes use of mfma_f32_32x32x4bf16 instruction, available on MI100.

* fix tests, fix mfma encoding comment, fix switch between mfma versions.

* replace kDim from mfma layout with kWidth from dotOp layout

* rebase fix

* fix mfma to dot op shortcut for bfloat16

* fix review comments
2023-09-08 15:08:34 -05:00
SJW
491eb9ddfe [MLIR] Added tritongpu-stream-pipeline pass (#305)
* [MLIR] Added tritongpu-stream-pipeline pass
     - Prologue: Hoist the pipelinable load operations and shared memory store
       for the ramp up stage
     - Pipelined Loop: Assemble the loop body minus last iteration
       - Prefetch next tile from global into regs (while computing from previous)
       - Non-load loop body
       - Store next tile into shared mem
     - Epilogue: Peeled non-load loop body for last iteration

* * updated comment
2023-09-07 15:24:59 -05:00
Wen Chen
076a04d5eb [ROCM] Optimized int8 to bf16 conversion by not reusing FpToFpOpConversion::convertFp32ToBf16.
Changed the lit test rules for vectorized int8 to bf16 conversion on
ROCm as ROCm has a different implementation.
2023-09-07 17:26:43 +00:00
Wen Chen
ffc230ebfe [ROCM] Fixed implementation of fp32 to bf16 conversion on ROCm. 2023-09-06 18:10:54 -05:00
Wen Chen
2d3e38e182 [ROCM] Added ROCm support for int8 to bfloat16 conversion. 2023-09-06 18:10:54 -05:00
Wen Chen
59a40d3f72 [ROCM] Added ROCm support for the conversions of following data types:
[float8e4m3, float8e4m3b15, float8e5m2] <-> [float16, bfloat16]
2023-09-06 18:10:54 -05:00
Aleksandr Efimov
751edfb3b9 [BACKEND] Fix fma mixed-precision
This is partial cherry-pick of https://github.com/openai/triton/pull/2184

Dropped code unrealted to dot fix.
2023-09-05 21:16:50 +00:00
Aleksandr Efimov
591681d36e Revert "[Dot] Fix FMA fp16xfp16 dot (#315)"
This reverts commit 11752a6993.
2023-09-05 21:12:56 +00:00
Alexander Efimov
11752a6993 [Dot] Fix FMA fp16xfp16 dot (#315)
Disable reorder of FMA dot arguments for amd gpu.
2023-09-05 20:52:08 +00:00
Keren Zhou
c0f418bcdd [BACKEND] Fix BF16 dot operand type mismatch (#2162)
https://github.com/openai/triton/issues/2156
2023-09-05 20:46:31 +00:00
Aleksandr Efimov
2f7ead6f3b Fix subprocess tests for IFU
This PR changes printf ttgir -> llvm conversion,
unknown location is assigned to global constant holding format string.
This fixes problem in test_subprocess.py tests,
which failed during construction of file location for format string constants.
2023-09-05 20:46:04 +00:00
Corbin Robeck
007bea9994 Add bitcode writer to AMDGCN hsaco output 2023-09-01 04:02:29 +00:00