Shucai Xiao
79bebc4ffe
fp8 type support ( #357 )
...
* add two fp8 data types `tl.float8e4b8` and `tl.float8e5b16` to triton.
* add SW type conversion between `tl.float8e4b8/tl.float8e5b16` and `fp16`
* change flashattention to support fp8 in q/k.
2023-11-02 15:51:23 -05:00
Shucai Xiao
2729ae6c6f
use different int8 mfma instructions on different GPUs. ( #368 )
...
* changes support to choose different int8 instructions
* rename an instruction name
Co-authored-by: Aleksandr Efimov <efimov.alexander@gmail.com >
2023-10-25 19:12:21 -05:00
Alexander Efimov
5a86b46bb1
[MFMA] FP8 and BF8 support ( #355 )
...
* [MFMA] FP8 and BF8 support
This PR adds support of fp8 and bf8 in AccelerateMatmul pass and
Introduces generation of float8 mfma instructions in ttg to llvm conversion.
* add tests
* fix tests
* review fix: fix variable naming and dot operand promotion.
* review comments fixes
---------
Co-authored-by: Shucai Xiao <shucai.xiao@amd.com >
2023-10-25 13:27:10 -05:00
oplavsic
715a589ce3
[FA fwd D=128] Reduce LDS usage in epilogue ( #340 )
...
* rebase onto improve_fwd_fa
* Fixed a leftover from rebase
* rebase onto improve_fa_fwd
* Reduce tuning space
* Disable bwd with D=128
* Add test for d=128
* Fix an issue with get_best_config when there is only one config
* Added better configs for d=128
* Fix typos
---------
Co-authored-by: Lixun Zhang <lixun.zhang@amd.com >
2023-10-25 12:10:34 -05:00
jayfurmanek
e74bdb1581
Always promote to int32 in commonShflSync ( #369 )
2023-10-23 12:27:11 -05:00
Alexander Efimov
20f316b19a
[MFMA] Switch between MFMA types ( #352 )
...
This PR introduces matrix_instr_nonkdim flag to switch
between MFMA 16 and MFMA 32.
2023-10-18 16:57:34 +02:00
Alexander Efimov
4d539d7dae
Add licenses to AMD related files ( #351 )
2023-10-16 15:18:01 -05:00
Alexander Efimov
7e34c244c2
[Triton] Mfma16 support ( #251 )
...
* [MFAM] Support mfma with NM size 16
This PR code emitting of MFMA instructions with size 16.
* add control over mfma type with MFMA_TYPE=16 env var
2023-10-09 13:59:54 -05:00
Aleksandr Efimov
e6f75d05e3
fix sum_reduction lit test in Conversion/tritongpu_to_llvm.mlir testsuite
2023-10-03 16:13:13 +00:00
Aleksandr Efimov
90a15e449e
ROCM IFU: Fix tritongpu_to_llvm lit test
2023-10-03 04:31:03 +00:00
wenchenvincent
42a5bf9c7c
ROCM IFU: Enabled conversion between fp8e4m3b15x4 and fp16. Refactored conversion between fp8e4m3nv and fp16. ( #335 )
2023-10-03 04:30:01 +00:00
Aleksandr Efimov
634f66a090
ROCM IFU: Fix emitOffsetForMfmaLayout function
2023-10-03 04:29:54 +00:00
Aleksandr Efimov
78faa65dbd
ROCM IFU: Fix of dot operand type promotion
...
ROCM IFU: Fix formatting
2023-10-03 04:29:29 +00:00
Jason Furmanek
e5d7bb4fae
Initial commit to resolve merge conflicts
...
rename tl.float8e4 to tl.float8e4nv to align with upstream
ROCM IFU: Fix python arch issues
ROCM IFU: Fix kernel launcher
ROCM IFU: Fix merge conflicts
fix debug build
Set correct threadsPerCTA
2023-10-03 04:04:26 +00:00
Jason Furmanek
74fd8e9754
Merge commit '36fc54b6f28168d3644808bfe299f1ba06a36272' into ifu230908-2
...
Conflicts:
.gitignore
bin/triton-translate.cpp
include/triton/Conversion/TritonGPUToLLVM/TritonGPUToLLVMPass.h
include/triton/Dialect/TritonGPU/IR/TritonGPUAttrDefs.td
include/triton/Dialect/TritonGPU/IR/TritonGPUDialect.td
lib/Analysis/Utility.cpp
lib/Conversion/TritonGPUToLLVM/ConvertLayoutOpToLLVM/SharedToDotOperandMMAv2.cpp
lib/Conversion/TritonGPUToLLVM/DotOpToLLVM.cpp
lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp
lib/Conversion/TritonGPUToLLVM/ReduceOpToLLVM.cpp
lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVM.cpp
lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVMBase.h
lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVMPass.cpp
lib/Conversion/TritonGPUToLLVM/Utility.h
lib/Dialect/Triton/Transforms/RewriteTensorPointer.cpp
lib/Dialect/TritonGPU/IR/Dialect.cpp
lib/Dialect/TritonGPU/Transforms/AccelerateMatmul.cpp
lib/Dialect/TritonGPU/Transforms/RemoveLayoutConversions.cpp
lib/Target/LLVMIR/LLVMIRTranslation.cpp
python/src/triton.cc
python/test/unit/runtime/test_subproc.py
python/triton/compiler/compiler.py
python/triton/compiler/make_launcher.py
python/triton/language/semantic.py
python/triton/runtime/jit.py
python/tutorials/06-fused-attention.py
test/Conversion/triton_to_tritongpu.mlir
test/Conversion/tritongpu_to_llvm.mlir
test/TritonGPU/coalesce.mlir
unittest/Conversion/TritonGPUToLLVM/CMakeLists.txt
2023-10-02 18:01:04 +00:00
Shucai Xiao
6e82aa8dbc
support gemm fp8/fp16 mixed input ( #333 )
...
* changes to support fp8/fp16 mixed inputs
* add unit test for fp8/fp16 mixed input for gemm
2023-09-27 08:00:31 -05:00
Aleksandr Efimov
d80cd2d374
[MFMA] Change kWidth parameter semantics
...
This PR changes kWidth semantics "from elements per instruction" to
"elements per thread per instruction" along k axis.
2023-09-25 10:56:44 -05:00
Lixun Zhang
23465f3416
Add assert for shuffleType and default value for laneid
2023-09-12 10:16:44 -05:00
Lixun Zhang
1c653dc438
Move shfl_up impl into commonShflSync
2023-09-12 10:16:44 -05:00
Lixun Zhang
74ea0c87de
Generalize warpSize
...
- We have to change the API of shflUpSync to pass laneId to the rocm
implementation of shfl_up
- And we also distinguish laneIdAxis and laneId
2023-09-12 10:16:44 -05:00
Lixun Zhang
ea397b49aa
Fix the issue when CTA coverage is larger than the tile
2023-09-12 10:16:44 -05:00
Lixun Zhang
ed20089bc8
Add shfl_up implementation for AMD backend
...
copied from f58b93693b/include/hc.hpp (L2879-L2885)
2023-09-12 10:16:44 -05:00
Alexander Efimov
6691de65db
[MFMA] Support BFloat16 on MI100 ( #295 )
...
* [MFMA] Support BFloat16 on MI100
This PR makes use of mfma_f32_32x32x4bf16 instruction, available on MI100.
* fix tests, fix mfma encoding comment, fix switch between mfma versions.
* replace kDim from mfma layout with kWidth from dotOp layout
* rebase fix
* fix mfma to dot op shortcut for bfloat16
* fix review comments
2023-09-08 15:08:34 -05:00
Wen Chen
076a04d5eb
[ROCM] Optimized int8 to bf16 conversion by not reusing FpToFpOpConversion::convertFp32ToBf16.
...
Changed the lit test rules for vectorized int8 to bf16 conversion on
ROCm as ROCm has a different implementation.
2023-09-07 17:26:43 +00:00
Wen Chen
ffc230ebfe
[ROCM] Fixed implementation of fp32 to bf16 conversion on ROCm.
2023-09-06 18:10:54 -05:00
Wen Chen
2d3e38e182
[ROCM] Added ROCm support for int8 to bfloat16 conversion.
2023-09-06 18:10:54 -05:00
Wen Chen
59a40d3f72
[ROCM] Added ROCm support for the conversions of following data types:
...
[float8e4m3, float8e4m3b15, float8e5m2] <-> [float16, bfloat16]
2023-09-06 18:10:54 -05:00
Aleksandr Efimov
751edfb3b9
[BACKEND] Fix fma mixed-precision
...
This is partial cherry-pick of https://github.com/openai/triton/pull/2184
Dropped code unrealted to dot fix.
2023-09-05 21:16:50 +00:00
Aleksandr Efimov
591681d36e
Revert "[Dot] Fix FMA fp16xfp16 dot ( #315 )"
...
This reverts commit 11752a6993 .
2023-09-05 21:12:56 +00:00
Alexander Efimov
11752a6993
[Dot] Fix FMA fp16xfp16 dot ( #315 )
...
Disable reorder of FMA dot arguments for amd gpu.
2023-09-05 20:52:08 +00:00
Keren Zhou
c0f418bcdd
[BACKEND] Fix BF16 dot operand type mismatch ( #2162 )
...
https://github.com/openai/triton/issues/2156
2023-09-05 20:46:31 +00:00
Aleksandr Efimov
2f7ead6f3b
Fix subprocess tests for IFU
...
This PR changes printf ttgir -> llvm conversion,
unknown location is assigned to global constant holding format string.
This fixes problem in test_subprocess.py tests,
which failed during construction of file location for format string constants.
2023-09-05 20:46:04 +00:00
Jason Furmanek
df5c263a19
Fix merge conflicts
2023-09-01 04:01:32 +00:00
Jason Furmanek
3eaeb89d18
Merge commit '5df904233c11a65bd131ead7268f84cca7804275' into ifu230810-2
...
Conflicts:
include/triton/Dialect/Triton/Transforms/Passes.h
include/triton/Dialect/TritonGPU/IR/Dialect.h
include/triton/Dialect/TritonGPU/IR/TritonGPUAttrDefs.td
lib/Analysis/Allocation.cpp
lib/Analysis/Utility.cpp
lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp
lib/Conversion/TritonGPUToLLVM/ReduceOpToLLVM.cpp
lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVM.cpp
lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVMPass.cpp
lib/Dialect/Triton/Transforms/RewriteTensorPointer.cpp
lib/Dialect/TritonGPU/Transforms/RemoveLayoutConversions.cpp
lib/Dialect/TritonGPU/Transforms/ReorderInstructions.cpp
lib/Target/LLVMIR/LLVMIRTranslation.cpp
python/src/triton.cc
python/triton/compiler/compiler.py
python/triton/ops/flash_attention.py
python/triton/runtime/autotuner.py
python/triton/runtime/jit.py
python/triton/tools/aot.py
python/tutorials/06-fused-attention.py
test/Conversion/tritongpu_to_llvm.mlir
test/Target/tritongpu_to_llvmir.mlir
test/Target/tritongpu_to_llvmir_noinline.mlir
2023-09-01 03:25:33 +00:00
Philippe Tillet
36fc54b6f2
[BACKEND] cleaner V100 tensor core packing ( #2222 )
2023-08-31 14:00:51 -07:00
Philippe Tillet
c4b27d04e3
[BACKEND] Added float8e4b15 codegen for SM < 80 ( #2216 )
2023-08-30 21:57:49 -07:00
Philippe Tillet
ec51552fff
[BACKEND] Lift restriction for float8e4b15 to only support row-col layout ( #2212 )
2023-08-30 14:06:31 -07:00
Keren Zhou
6e4932cda8
[BACKEND] Fix fma mixed-precision ( #2184 )
...
and expose the allow_tf32 argument to the matmul op
@shunting314
2023-08-26 09:49:58 -07:00
Keren Zhou
f6cdcf1d77
[BACKEND] Fix BF16 dot operand type mismatch ( #2162 )
...
https://github.com/openai/triton/issues/2156
2023-08-24 20:32:33 -07:00
Zahi Moudallal
5282ed890d
[CI] Add back pre-commit to nvidia CI job ( #2159 )
2023-08-23 01:11:03 +00:00
ivanyinwz
ec801ce18e
[BACKEND] Optimize performance for f16 epilogue with TMA store ( #2135 )
...
1. Optimize the conversion and packing for 2xf32 -> 2xf16.
2. Split TMA store block into multiple slices of size 64x64.
3. Distribute the TMA store to all the warps.
4. Fix some naming issue.
2023-08-21 12:44:11 -07:00
jayfurmanek
fa429316d4
Merge pull request #268 from ROCmSoftwarePlatform/improve_reduce_for_fa
...
[CHERRY-PICKED FROM UPSTREAM][BACKEND] no longer uses shared mem or barriers for single-warp reductions (openai#1915)
2023-08-21 13:29:11 -05:00
Alexander Zinoviev
d5188fa230
[BACKEND] enable transpose for float16 on sm75 ( #2139 )
...
Replace the Turing version for the dot operation from following Volta
version to following Ampere version.
Update code generator to produce two m16.n8.k8 MMAs for Turing instead
of one m16.n8.k16 MMA we have for Ampere.
2023-08-18 22:20:17 -07:00
Zahi Moudallal
23dd11d471
[BACKEND] Solidify f8e4m3 ( #2105 )
...
Co-authored-by: Philippe Tillet <phil@openai.com >
2023-08-18 19:12:09 -07:00
Thomas
23ef2615d2
[BACKEND] Merge TT_ElementwisePureExtern and TT_ElementwiseImpureExtern ( #2137 )
...
Use getEffect instead to tell passes whether the op has side effects or
not. This doesn't change functionality otherwise.
2023-08-18 20:56:10 +00:00
Thomas
bf351b9ba2
[FRONTENT][BACKEND] Add support for elementwise inline assembly ( #2136 )
...
Add a new operation to be able to implement packed inline assembly for
elementwise operations. This way inline assembly can be used to control
elementwise operations. It also allows to pack elements to be able to
manually vectorize operations.
2023-08-18 12:57:52 -07:00
Alexander Efimov
01b0108c94
[MFMA] [FA] Keep bf16 results of FA dot operations in registers ( #298 )
...
This PR enables optimization for keeping bf16 values in registers between dot operations.
2023-08-18 07:33:00 -05:00
Alexander Efimov
9ab335196f
[MFMA] More optimal offset computation ( #286 )
...
This PR replaces expensive operations with simpler ones:
mul,div are replaced with select and compare.
This is minor change, it decreses number of required registers
when dot operation loading is a bottleneck by one.
2023-08-18 07:32:38 -05:00
Alexander Efimov
23979098c8
[MFMA] MI200 bfloat16 support ( #294 )
...
This PR enables bfloat16 support in MFMA dot on MI200.
Used mfma_f32_32x32x8bf16_1k instruction.
2023-08-18 07:28:18 -05:00
Whitney Tsang
100cabd0e4
[FRONTEND] use enum instead of bool to select target ( #2118 )
...
Before this PR, the determination of `TritonGPUToLLVMIRPass` to generate
NVVM-compatible LLVM or ROCDL-compatible LLVM is controlled by a boolean
`isROCM`. This method is hard to scale.
This PR changes it to use an enum instead, where new target can be added
easily when needed.
---------
Signed-off-by: Tsang, Whitney <whitney.tsang@intel.com >
Co-authored-by: Philippe Tillet <phil@openai.com >
2023-08-17 18:37:09 -07:00