Commit Graph

409 Commits

Author SHA1 Message Date
Alexander Efimov
605a90c58e [MFMA] Support tile size 4x4 version 1 (#413)
This PR enables 4x4 tile size in MFMA based dot operations.

Supported tiled dot is (4x64) x (64x4) -> (4x4) in MFMA layout.
However, actual dot operation should have at least 64 output elements, this is a limitation of other layouts appearing during result processing (i.e. blocked layout can not handle tensors smaller than wavesize).

For example, following dots are supported: (4x64) x (64x16) -> (4x16), (16x64) x (64x4) -> (16x4) or (8x64) x (64x8) -> (8x8)
Following dots are not supporter: (4x128) x (128x4) -> (4x4), (4x64) x (64x8) -> (4x8)

This is a first version of dot using mfma 4x4 instructions, with redundancy and reductions.
2023-12-12 18:23:55 +01:00
Shucai Xiao
d9219e0eba use hw for fp8 type conversion (#386)
* use hardware instruction for type conversion between fp8 and fp32

* move gpu_matrix_core_version from semantics.py to hip_backend.py

---------

Co-authored-by: Aleksandr Efimov <efimov.alexander@gmail.com>
2023-11-24 10:26:40 -06:00
Alexander Efimov
096def0c9b [Test] Disable mma layout for amd hardware (#384)
Disable mma layout testing by looking at is_hip instead of wave size.
This fixes tests on Navi GPUs with wave size == 32.
2023-11-17 01:28:49 +00:00
Jason Furmanek
977d5aa267 Merge commit '721897fcc4f942aa97d2e9ba3787a5e213758177' into ifu-231108
Conflicts:
	bin/triton-translate.cpp
	lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp
	lib/Dialect/TritonGPU/Transforms/RemoveLayoutConversions.cpp
	python/triton/compiler/compiler.py
	python/triton/runtime/jit.py
	python/tutorials/06-fused-attention.py
	test/Conversion/tritongpu_to_llvm.mlir
2023-11-08 18:51:23 +00:00
Alexander Efimov
aefc94bd25 ROCM IFU: fix test_dot_mfma_vector_load test
fix for previous commit
2023-11-07 04:29:45 +00:00
Jason Furmanek
3a6dc5ad8d resolve some merge conflicts
fix more conflits

Resolve merge conflicts

Some more build and conflict fixes

Resolve conflicts for 06-fused-attension.py

resolve merge conflicts for the tutorial group gemm example

Fixes for some LIT tests

resolve remaining conflicts in tests

Fix empty kernel

set capability 0
2023-11-06 23:13:10 +00:00
Jason Furmanek
33151a860f Merge commit 'ac9fa68d18c777e421bd3f6fb1ddcfd60b6fda33' into ifu-rebase-again
Conflicts:
	.gitignore
	.gitmodules
	README.md
	bin/triton-translate.cpp
	include/triton/Dialect/TritonGPU/IR/TritonGPUAttrDefs.td
	include/triton/Target/AMDGCN/AMDGCNTranslation.h
	include/triton/Target/HSACO/HSACOTranslation.h
	lib/Analysis/Allocation.cpp
	lib/Analysis/Utility.cpp
	lib/Conversion/TritonGPUToLLVM/CMakeLists.txt
	lib/Conversion/TritonGPUToLLVM/ConvertLayoutOpToLLVM.cpp
	lib/Conversion/TritonGPUToLLVM/ReduceOpToLLVM.cpp
	lib/Conversion/TritonGPUToLLVM/ScanOpToLLVM.cpp
	lib/Conversion/TritonGPUToLLVM/Utility.cpp
	lib/Conversion/TritonGPUToLLVM/Utility.h
	lib/Dialect/TritonGPU/IR/Dialect.cpp
	lib/Dialect/TritonGPU/Transforms/RemoveLayoutConversions.cpp
	lib/Target/HSACO/CMakeLists.txt
	lib/Target/HSACO/HSACOTranslation.cpp
	lib/Target/LLVMIR/LLVMIRTranslation.cpp
	python/src/triton.cc
	python/test/unit/language/test_core.py
	python/test/unit/operators/test_flash_attention.py
	python/triton/compiler/compiler.py
	python/triton/compiler/make_launcher.py
	python/triton/language/semantic.py
	python/triton/runtime/jit.py
	python/tutorials/06-fused-attention.py
	python/tutorials/11-grouped-gemm.py
	test/Conversion/tritongpu_to_llvm.mlir
2023-11-06 23:10:10 +00:00
Shucai Xiao
79bebc4ffe fp8 type support (#357)
* add two fp8 data types `tl.float8e4b8` and `tl.float8e5b16` to triton.
* add SW type conversion between `tl.float8e4b8/tl.float8e5b16` and `fp16`
* change flashattention to support fp8 in q/k.
2023-11-02 15:51:23 -05:00
Michael Melesse
1fd9b40f2f Works as StandAlone and Backend and also Perf is Good
This is a combination of 4 commits.

Works as StandAlone and Backend

Works as StandAlone and Backend

This is a combination of 13 commits.

Works StandAlone and as Backend

This is a combination of 7 commits.

backend set default dir with flag

move bitcode to backend dir

copy backend

save

empty test work in backendmode

enable backend mode when copying to upstream

clean up

fix failure

minimize diff

add skip function

fix bug with corrupted dwarf exp

match num_wraps

fix multi threaded test issue

move bitcode file out of lib

move backend to python/triton/third_party/hip

move libhsa

backend works again

restart ci

clean upstream location first before copy

match scripts

fix new error

memoize backend stuff

fix bug
2023-10-26 14:27:18 -05:00
Michael Melesse
09ba348f87 [ROCM] Core Functionality for AMD (#1983)
* this pr adds a third party backend for triton that works on AMD
* this expose a lot of the work that has been done in our
[fork](https://github.com/ROCmSoftwarePlatform/triton)
* most unit tests on `test_core.py` pass
* it skips some unit tests for various reasons
* we plan to follow up with more prs improving Functionality and
Performance in the future

---------

Co-authored-by: Philippe Tillet <phil@openai.com>
2023-10-26 08:36:49 -05:00
Alexander Efimov
5a86b46bb1 [MFMA] FP8 and BF8 support (#355)
* [MFMA] FP8 and BF8 support

This PR adds support of fp8 and bf8 in AccelerateMatmul pass and
Introduces generation of float8 mfma instructions in ttg to llvm conversion.

* add tests

* fix tests

* review fix: fix variable naming and dot operand promotion.

* review comments fixes

---------

Co-authored-by: Shucai Xiao <shucai.xiao@amd.com>
2023-10-25 13:27:10 -05:00
Shucai Xiao
8547694665 set correct arch info for unit test (#370)
* set correct arch info for unit test

* address review comments
2023-10-25 13:06:45 -05:00
Alexander Efimov
20f316b19a [MFMA] Switch between MFMA types (#352)
This PR introduces matrix_instr_nonkdim flag to switch
between MFMA 16 and MFMA 32.
2023-10-18 16:57:34 +02:00
Mehdi Amini
721897fcc4 upgrade llvm to b1115f8c (NFC) (#2403)
Co-authored-by: Thomas Raoux <thomas.raoux@openai.com>
Co-authored-by: Keren Zhou <kerenzhou@openai.com>
Co-authored-by: Phil Tillet <phil@openai.com>
2023-10-16 16:38:49 -07:00
Zahi Moudallal
726bdb984f [FRONTEND][BACKEND] Fix constexpr assignment ; revert #2430 (#2496)
Without this change, a constexpr assignment (ie. `A = B & C`, where `B`
and `C` are both constexpr) is getting assigned to a triton tensor,
which becomes an issue when `A` is used as the condition of an If
statement.
Note: I had to add `not isinstance(node.value, ast.Constant)` to the
condition because if we are assigning `x = 0` then the assigned value is
also a constexpr, but in this case we do want to assign a triton tensor
to `x` so that we can do `x.to(tl.int64)` for example, which cannot be
done on a constexpr.

---------

Co-authored-by: Philippe Tillet <phil@openai.com>
2023-10-16 12:35:19 -07:00
Stewart Hall
29828fe491 [FRONTEND] add option to disable fp mul/add fusion (#2495)
By default, ptxas will enable fusion of mul/add to fma instructions. The
backend was also being configured unconditionally to enable this on
conversion from LLVM IR to PTX. This commit adds an option which can be
used to disable the FP fusion behavior in both locations.
2023-10-14 12:23:30 -07:00
Keren Zhou
f81d9d876f [FRONTEND] Fix math for constant values (#2472)
https://github.com/openai/triton/issues/2470
2023-10-12 12:11:42 -07:00
Zahi Moudallal
be19cf3103 [BACKEND] Enable reduce with 3D tensors and added tests (#2460) 2023-10-06 15:08:22 -07:00
Michael Melesse
31fe8aadc5 ROCM IFU: Fix minimize_alloc
ROCM IFU: Small fixes
2023-10-03 05:34:44 +00:00
Shucai Xiao
334c9b5aed ROCM IFU: Fix unit tests error related to fp8/fp16 mixed input 2023-10-03 04:30:44 +00:00
Michael Melesse
28c571ea43 ROCM IFU: Fix test_if 2023-10-03 04:30:22 +00:00
Aleksandr Efimov
8ccc4b0cce ROCM IFU: Fix layout formatting 2023-10-03 04:30:16 +00:00
wenchenvincent
42a5bf9c7c ROCM IFU: Enabled conversion between fp8e4m3b15x4 and fp16. Refactored conversion between fp8e4m3nv and fp16. (#335) 2023-10-03 04:30:01 +00:00
Jason Furmanek
e5d7bb4fae Initial commit to resolve merge conflicts
rename tl.float8e4 to tl.float8e4nv to align with upstream

ROCM IFU: Fix python arch issues

ROCM IFU: Fix kernel launcher

ROCM IFU: Fix merge conflicts

fix debug build

Set correct threadsPerCTA
2023-10-03 04:04:26 +00:00
Jason Furmanek
74fd8e9754 Merge commit '36fc54b6f28168d3644808bfe299f1ba06a36272' into ifu230908-2
Conflicts:
	.gitignore
	bin/triton-translate.cpp
	include/triton/Conversion/TritonGPUToLLVM/TritonGPUToLLVMPass.h
	include/triton/Dialect/TritonGPU/IR/TritonGPUAttrDefs.td
	include/triton/Dialect/TritonGPU/IR/TritonGPUDialect.td
	lib/Analysis/Utility.cpp
	lib/Conversion/TritonGPUToLLVM/ConvertLayoutOpToLLVM/SharedToDotOperandMMAv2.cpp
	lib/Conversion/TritonGPUToLLVM/DotOpToLLVM.cpp
	lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp
	lib/Conversion/TritonGPUToLLVM/ReduceOpToLLVM.cpp
	lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVM.cpp
	lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVMBase.h
	lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVMPass.cpp
	lib/Conversion/TritonGPUToLLVM/Utility.h
	lib/Dialect/Triton/Transforms/RewriteTensorPointer.cpp
	lib/Dialect/TritonGPU/IR/Dialect.cpp
	lib/Dialect/TritonGPU/Transforms/AccelerateMatmul.cpp
	lib/Dialect/TritonGPU/Transforms/RemoveLayoutConversions.cpp
	lib/Target/LLVMIR/LLVMIRTranslation.cpp
	python/src/triton.cc
	python/test/unit/runtime/test_subproc.py
	python/triton/compiler/compiler.py
	python/triton/compiler/make_launcher.py
	python/triton/language/semantic.py
	python/triton/runtime/jit.py
	python/tutorials/06-fused-attention.py
	test/Conversion/triton_to_tritongpu.mlir
	test/Conversion/tritongpu_to_llvm.mlir
	test/TritonGPU/coalesce.mlir
	unittest/Conversion/TritonGPUToLLVM/CMakeLists.txt
2023-10-02 18:01:04 +00:00
Philippe Tillet
533efd0cac [FRONTEND][BACKEND] changed float8e4b15 clipping semantics from +-1.875 to +-1.75 (#2422)
clipping float8e4b15 to +-1.875 is a bad idea because these are
represented as 0x7f and 0xff, which are +- nan on H100 for float8e4nv.
We lose two values but this will make compatibility with float8e4nv way
less painful. (it will just be a matter of adjusting the bias)
2023-09-29 23:33:28 -07:00
Hongtao Yu
e0edb70f78 [BACKEND] support of Fp8E4M3Nv to Bf16 conversion (#2415) 2023-09-29 17:29:41 -07:00
Thomas Raoux
90bef57acf [BACKEND] turn on MMA V3 by default on Hopper (#2414) 2023-09-28 22:45:28 -07:00
Ying Zhang
78c28bf5f6 Support scalar fp8 conversions by packing (#2379)
Support fp8 scalar conversions by packing fp8 with undef values.

Also add simple unittests to cover this change.
2023-09-27 08:29:53 -07:00
Shucai Xiao
6e82aa8dbc support gemm fp8/fp16 mixed input (#333)
* changes to support fp8/fp16 mixed inputs

* add unit test for fp8/fp16 mixed input for gemm
2023-09-27 08:00:31 -05:00
Philippe Tillet
7432fff4be [FRONTEND] add limited introspection capabilities in tl.extra.cuda ; rename arch into target (#2385) 2023-09-25 23:58:25 -07:00
ben-zhang-609
d040b58547 [HOPPER] fix ref check failure of flash attention with mma v3 (#2384) 2023-09-25 11:29:49 -07:00
Aleksandr Efimov
d80cd2d374 [MFMA] Change kWidth parameter semantics
This PR changes kWidth semantics "from elements per instruction" to
"elements per thread per instruction" along k axis.
2023-09-25 10:56:44 -05:00
Keren Zhou
57fc6d1f13 [BACKEND] shfl ptx insts should have side effects (#2376)
Otherwise, llvm pass could generate very weird structure of CFG and
yield incorrect results.

https://github.com/openai/triton/issues/2361
2023-09-23 10:05:20 -07:00
Philippe Tillet
c71ec14f31 [TEST] only test 4 configs without TF32 (#2370) 2023-09-21 21:23:19 -07:00
Alexander Zinoviev
d543eb1a36 [BACKEND] implement dot for INT8 on Turing (#2364)
Replace a single
mma.sync.aligned.m16n8k32.row.col.satfinite.s32.s8.s8.s32 instruction
that is used on Ampere with 4 x
mma.sync.aligned.m8n8k16.row.col.satfinite.s32.s8.s8.s32 instructions
for Turing

Extracted the Turing-int8, Turing-fp16 and Ampere to separate functions.

Somehow I messed up with my previous PR, so just open a new one.

---------

Co-authored-by: Philippe Tillet <phil@openai.com>
2023-09-21 16:40:53 -07:00
Philippe Tillet
32c9d2bb8f [FRONTEND] improved error messages (#2363)
this is a combination of #1774 and #2006, which I cannot edit but fail
CI pre-commit hook
2023-09-21 15:05:57 -07:00
Thomas Raoux
e36c99b588 [BACKEND] Handle scan of function non commutative (#2362)
Make sure we accumulate in the right order for scans so that non
commutative operations are handled correctly.
2023-09-21 12:00:41 -07:00
peterbell10
8094f46632 [FRONTEND][BACKEND] Fix various atomic_rmw bugs (#2355)
This fixes a few bugs I've encountered
- `atomic_add` with int64/uint64 `Operation .add requires .u32 or .s32
or .u64 [...] for instruction 'atom'`
- `atomic_min/max` with float64 -> `ValueError('Cannot bitcast data-type
of size 64 to data-type of size 32')`
- `atomic_min/max` with float32 returns the old value as int32
2023-09-21 03:31:20 +00:00
ben-zhang-609
bcaf14755a [HOPPER] enable flash attention with tma (#2336) 2023-09-20 14:06:56 -07:00
Thomas Raoux
9cab885dff [BACKEND] Optimize wgmma with accumulator source equal to 0 (#2343)
Also add a test for MMA v3 reduction.
2023-09-20 14:05:12 -07:00
Keren Zhou
ed5a53057d [BACKEND] Handle repetitive threads in scan op when the tensor dim is small (#2345)
https://github.com/openai/triton/issues/2298
2023-09-20 12:25:52 -04:00
Dongdong Li
e5eda098b3 [TESTS] fix flash attention (#2086)
Co-authored-by: dongdongl <dongdongl@nvidia.com>
2023-09-20 14:23:46 +08:00
Keren Zhou
307b5caa49 [BACKEND] Fix scan issues on repetitive warps and improve perf when there's a single warp on the axis (#2330)
1. On the axis, using `getAxisNumWarpsWithUniqueData` instead of getting
the raw number of warps to avoid communication among warps that handle
the same piece of data.
2. When there's a single warp on the axis, using warp Intrinsics for
communication and skip shared memory.

Need a follow up PR for code clean up.
2023-09-18 17:45:05 -04:00
Philippe Tillet
e686b4d6d4 [FRONTEND] interpreter rewrite (#2321)
This is a new interpreter mode that shares semantic analysis with the
JIT'ed codepath and that the Triton core team is committed to maintain
2023-09-17 14:58:50 -07:00
Thomas Raoux
31b0c52142 [FRONTEND][BACKEND] Add flag to control accumulation for fp8 (#2300)
Change the dot to allow taking an initial accumulator and add a flag
that will allow the compiler to accumulate in a lower precision than the
output type.
On Hopper this flag is on by default which allows accumualting with
lower precision.
This only affect Hopper fp8 dot.
2023-09-15 18:42:54 -07:00
Keren Zhou
08c1658957 [FRONTEND] Accommodate new triton IR format (#2294)
- Support memory space for pointers (e.g., `!tt.ptr<f32, 1>`).
- Support parsing function attribute, though not used yet.
2023-09-14 09:03:23 -07:00
Zahi Moudallal
36087a108f [FRONTEND] Added SASS to asm dict (#2280) 2023-09-13 21:21:01 +00:00
Zahi Moudallal
e95e1f12eb [BACKEND] Convert layout illegal mem access fix (#2287) 2023-09-13 10:02:25 -07:00
Thomas Raoux
994f7e4460 [BACKEND] Remove dependency between NVGPU and TritonNvidiaGPU (#2282) 2023-09-12 11:02:20 -07:00