Commit Graph

1050 Commits

Author SHA1 Message Date
Jason Furmanek
5c87f363e4 Merge commit 'cb3d79a185e40c9d8a579bea07747a8a8d157d52' into ifu-231117
Conflicts:
	lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp
	lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVM.cpp
	lib/Dialect/TritonGPU/IR/Dialect.cpp
	python/setup.py
	python/test/unit/language/assert_helper.py
	python/test/unit/operators/test_flash_attention.py
	python/test/unit/runtime/test_subproc.py
	python/triton/compiler/compiler.py
	python/triton/language/semantic.py
	python/triton/runtime/autotuner.py
	python/triton/runtime/jit.py
	python/tutorials/03-matrix-multiplication.py
	python/tutorials/05-layer-norm.py
	python/tutorials/06-fused-attention.py
	python/tutorials/11-grouped-gemm.py
	test/Conversion/tritongpu_to_llvm.mlir
2023-11-17 20:42:12 +00:00
Ognjen
38fbb7e472 ROCM IFU: Enable slice layout for insertSliceAsync AMD path
Fix basic_insert_slice_async_1d lit test

Remove code added for debugging

Return hopper test
2023-11-17 01:27:57 +00:00
Jason Furmanek
484852876e Resolve merge conflicts; AMD adjustments for new LLVM version 2023-11-09 19:00:49 +00:00
Jason Furmanek
977d5aa267 Merge commit '721897fcc4f942aa97d2e9ba3787a5e213758177' into ifu-231108
Conflicts:
	bin/triton-translate.cpp
	lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp
	lib/Dialect/TritonGPU/Transforms/RemoveLayoutConversions.cpp
	python/triton/compiler/compiler.py
	python/triton/runtime/jit.py
	python/tutorials/06-fused-attention.py
	test/Conversion/tritongpu_to_llvm.mlir
2023-11-08 18:51:23 +00:00
oplavsic
502525ff11 ROCM IFU: Fix ScanOp test by implementing idx ShflKind in commonShflSync (#391)
* Fix ScanOp test

* Remove commented-out code

---------

Co-authored-by: Ognjen <oplavsic@luxoft.com>

Fix typo in commonShflSync
2023-11-07 04:29:45 +00:00
Alexander Efimov
8bc417b9b7 do not emit nvidia inline asm 2023-11-07 04:29:44 +00:00
Jason Furmanek
39e8901d7a ROCM IFU: Resolve merge conflicts in RemoveLayoutConversions.cpp
fix merge error

fix dot

fix make_range

additional fix
2023-11-07 04:29:38 +00:00
Jason Furmanek
c3132eeda8 ROCM IFU: Third-party fixes: preload ROCDL
Additional 3rd-party fix

Remove redundant mfma_supported defines
2023-11-07 03:29:16 +00:00
Jason Furmanek
3a6dc5ad8d resolve some merge conflicts
fix more conflits

Resolve merge conflicts

Some more build and conflict fixes

Resolve conflicts for 06-fused-attension.py

resolve merge conflicts for the tutorial group gemm example

Fixes for some LIT tests

resolve remaining conflicts in tests

Fix empty kernel

set capability 0
2023-11-06 23:13:10 +00:00
Jason Furmanek
33151a860f Merge commit 'ac9fa68d18c777e421bd3f6fb1ddcfd60b6fda33' into ifu-rebase-again
Conflicts:
	.gitignore
	.gitmodules
	README.md
	bin/triton-translate.cpp
	include/triton/Dialect/TritonGPU/IR/TritonGPUAttrDefs.td
	include/triton/Target/AMDGCN/AMDGCNTranslation.h
	include/triton/Target/HSACO/HSACOTranslation.h
	lib/Analysis/Allocation.cpp
	lib/Analysis/Utility.cpp
	lib/Conversion/TritonGPUToLLVM/CMakeLists.txt
	lib/Conversion/TritonGPUToLLVM/ConvertLayoutOpToLLVM.cpp
	lib/Conversion/TritonGPUToLLVM/ReduceOpToLLVM.cpp
	lib/Conversion/TritonGPUToLLVM/ScanOpToLLVM.cpp
	lib/Conversion/TritonGPUToLLVM/Utility.cpp
	lib/Conversion/TritonGPUToLLVM/Utility.h
	lib/Dialect/TritonGPU/IR/Dialect.cpp
	lib/Dialect/TritonGPU/Transforms/RemoveLayoutConversions.cpp
	lib/Target/HSACO/CMakeLists.txt
	lib/Target/HSACO/HSACOTranslation.cpp
	lib/Target/LLVMIR/LLVMIRTranslation.cpp
	python/src/triton.cc
	python/test/unit/language/test_core.py
	python/test/unit/operators/test_flash_attention.py
	python/triton/compiler/compiler.py
	python/triton/compiler/make_launcher.py
	python/triton/language/semantic.py
	python/triton/runtime/jit.py
	python/tutorials/06-fused-attention.py
	python/tutorials/11-grouped-gemm.py
	test/Conversion/tritongpu_to_llvm.mlir
2023-11-06 23:10:10 +00:00
oplavsic
c65f1e6211 Add OptimizeEpilogue pass. (#346)
* optimize_epilogue

* Add config

* Remove licenses

* Comment out Hopper specific parameters when printing out configs

* Add benchmark parameters from flash-attention repo

* Add Z and H in the key of autotuner

---------

Co-authored-by: Lixun Zhang <lixun.zhang@amd.com>
2023-11-03 16:46:24 -05:00
Thomas Raoux
cb3d79a185 [BACKEND] Prevent emitting multiple dot_wait after pipelinied loop (#2598)
Patch based on @donproc findings and suggested optimization.

Emitting multiple wait op may confuse ptxas and cause it to fallback to
a conservative mode.
2023-11-03 14:29:50 -07:00
Weixing Zhang
34b89a1173 [OPTIMIZER] Tweak warpsPerCTA based on the shape of MMA output (#2581)
In current implementation, warpsPerCTA is always set to [numWarps, 1]
for 2 tt.dot fusion scenario. But, it is not optimal for cases such that
tt.dot doesn't have enough parallelism on row dimension but on column
dimension.
2023-11-03 16:40:03 -04:00
Thomas Raoux
6ac9d51ff0 [OPTIMIZATION] Enable pipelining for bwd flash attention (#2590)
This allow pipelining when a load is used by multiple dot in a loop.

Relax the condition to pipeline dot operands for mma v3 case. This
improves performance for the bwd pass from 260TF to 275TF. However this
expose a performance problem due to the wmma pipelining as ptxas will
now fall back to serial wgmma. A follow up PR will fix a bug in how we
emit wgmma_wait during pipelining and will bring performance to 335TF
2023-11-03 11:46:51 -07:00
Shucai Xiao
79bebc4ffe fp8 type support (#357)
* add two fp8 data types `tl.float8e4b8` and `tl.float8e5b16` to triton.
* add SW type conversion between `tl.float8e4b8/tl.float8e5b16` and `fp16`
* change flashattention to support fp8 in q/k.
2023-11-02 15:51:23 -05:00
Thomas Raoux
dced22c4b7 [BACKEND] Remove workaround for NVPTX bug after LLVM upgrade (#2585)
This was a workaround for a bug exposed in test_core when generating
ext_load in NVPTX. The backend bug seems fixed in latest LLVM upgrade so
removing the workaround.
2023-11-02 17:31:58 +00:00
Thomas Raoux
ca8f110617 [BACKEND] Pipeliner refactoring (#2565)
Refactor the pipeliner pass in order to make it more generic. The main
change is that the pipeliner is now broken into 2 pieces one calculating
a modulo schedule and create async ops based on the IR and an expander
that will generate the pipelined IR based on the modulo schedule.
The advantage of separating the two pieces is that it will allow us to
create different schedule without having to change the expander and it
will allow for more complex schedules.
For now the schedule generated for matmul case matches rougly the
schedule picked by the previous pipeliner in order to avoid changes.

This also creates a different sequence of insert/extract slice for the
alloc. We should probably change shared alloc to use memory semantic.
2023-11-02 09:56:39 -07:00
Thomas Raoux
218492cd65 [BACKEND] Prevent double rounding when doing f32 -> fp8 (#2583) 2023-11-02 05:32:16 +00:00
Dongdong Li
d0098da7b1 [BACKEND] Add error reporting to report non-kernel-argument (#2552)
Co-authored-by: dongdongl <dongdongl@nvidia.com>
2023-11-01 20:22:10 -04:00
Alexander Efimov
74c5fd46ee [RemoveLayoutConversions] Fix reduce failed infer type error (#377)
* [RemoveLayoutConversions] Fix reduce failed infer type error

This PR fixes layout propagation algorithm in RemoveLayoutConversions pass.
In some cases during rewriteSlice process, reduce operation with multiple outputs
rewrites only one output layout, which breaks assumption that both outputs should have same layout.

This change is a minimal part of https://github.com/openai/triton/pull/2331 change and
small lit test for regression testing.

* fix combine test

* Fix issue with incorrect inference layout of make_range output result
2023-11-01 13:31:13 -05:00
Alexander Efimov
d62a3ffdbe [RemoveLayoutConversions] Remove PatternSharedInfo structure (#378)
This structure is not used anymore after massive refactoring
of RemoveLayoutConversion pass in September IFU.
2023-11-01 12:57:35 -05:00
Zahi Moudallal
3650213218 [OPTIMIZER] Thread local reduction optimization (#2542)
Co-authored-by: Phil Tillet <phil@openai.com>
2023-10-31 16:13:36 -07:00
Goran Flegar
601b95cdbb [DEPS] bump LLVM version to llvm/llvm-project@49af650 (#2570)
Co-authored-by: Ashay Rane <ashay@users.noreply.github.com>
Co-authored-by: khasanovaa <khasanovaaliya19@gmail.com>
2023-10-31 12:06:25 -07:00
daemondzh
96cf8f979a [OPTIMIZER][BACKEND] Fix an issue in RewriteTensorPtr pass to enable TMA with 8-bit types (#2545)
Co-authored-by: Zhicheng Xiong <zhichengx@ipp2-0148.nvidia.com>
Co-authored-by: Zhicheng Xiong <zhichengx@dc7-sim-e12-203.nvidia.com>
Co-authored-by: Zhicheng Xiong <zhichengx@ipp2-1604.nvidia.com>
Co-authored-by: Zhicheng Xiong <zhichengx@ipp2-1608.nvidia.com>
Co-authored-by: goostavz <109190422+goostavz@users.noreply.github.com>
2023-10-31 02:28:27 +00:00
Chris Jones
2398b82f18 [FRONTEND][BACKEND] dd memory synchronization scope parameter to atomic ops. (#2562)
Co-authored-by: Keren Zhou <kerenzhou@openai.com>
2023-10-30 19:18:27 -07:00
Keren Zhou
70fca00b67 [BACKEND] Fix device_print without arguments (#2566) 2023-10-30 20:04:44 -04:00
Michael Melesse
1fd9b40f2f Works as StandAlone and Backend and also Perf is Good
This is a combination of 4 commits.

Works as StandAlone and Backend

Works as StandAlone and Backend

This is a combination of 13 commits.

Works StandAlone and as Backend

This is a combination of 7 commits.

backend set default dir with flag

move bitcode to backend dir

copy backend

save

empty test work in backendmode

enable backend mode when copying to upstream

clean up

fix failure

minimize diff

add skip function

fix bug with corrupted dwarf exp

match num_wraps

fix multi threaded test issue

move bitcode file out of lib

move backend to python/triton/third_party/hip

move libhsa

backend works again

restart ci

clean upstream location first before copy

match scripts

fix new error

memoize backend stuff

fix bug
2023-10-26 14:27:18 -05:00
Dongdong Li
0469d5fccd [OPTIMIZER] Remove extra wgmma_wait_group in flash attention (#2399)
Co-authored-by: dongdongl <dongdongl@nvidia.com>
2023-10-26 16:35:36 +00:00
zhu jianjiang
cfae7e2a25 [BACKEND] Fix matmul downcast path (#2528)
for https://github.com/openai/triton/issues/2523 ,add regression test

---------

Co-authored-by: Jokeren <robinho364@gmail.com>
Co-authored-by: Keren Zhou <kerenzhou@openai.com>
2023-10-26 09:43:49 -04:00
Michael Melesse
09ba348f87 [ROCM] Core Functionality for AMD (#1983)
* this pr adds a third party backend for triton that works on AMD
* this expose a lot of the work that has been done in our
[fork](https://github.com/ROCmSoftwarePlatform/triton)
* most unit tests on `test_core.py` pass
* it skips some unit tests for various reasons
* we plan to follow up with more prs improving Functionality and
Performance in the future

---------

Co-authored-by: Philippe Tillet <phil@openai.com>
2023-10-26 08:36:49 -05:00
Shucai Xiao
2729ae6c6f use different int8 mfma instructions on different GPUs. (#368)
* changes support to choose different int8 instructions

* rename an instruction name

Co-authored-by: Aleksandr Efimov <efimov.alexander@gmail.com>
2023-10-25 19:12:21 -05:00
Hongtao Yu
2323adb387 [BACKEND] Handle AtomicCASOp in GPU IR conversion (#2514)
Addressing https://github.com/openai/triton/issues/2011

Co-authored-by: Philippe Tillet <phil@openai.com>
Co-authored-by: Keren Zhou <kerenzhou@openai.com>
2023-10-25 15:20:07 -04:00
Alexander Efimov
5a86b46bb1 [MFMA] FP8 and BF8 support (#355)
* [MFMA] FP8 and BF8 support

This PR adds support of fp8 and bf8 in AccelerateMatmul pass and
Introduces generation of float8 mfma instructions in ttg to llvm conversion.

* add tests

* fix tests

* review fix: fix variable naming and dot operand promotion.

* review comments fixes

---------

Co-authored-by: Shucai Xiao <shucai.xiao@amd.com>
2023-10-25 13:27:10 -05:00
oplavsic
715a589ce3 [FA fwd D=128] Reduce LDS usage in epilogue (#340)
* rebase onto improve_fwd_fa

* Fixed a leftover from rebase

* rebase onto improve_fa_fwd

* Reduce tuning space

* Disable bwd with D=128

* Add test for d=128

* Fix an issue with get_best_config when there is only one config

* Added better configs for d=128

* Fix typos

---------

Co-authored-by: Lixun Zhang <lixun.zhang@amd.com>
2023-10-25 12:10:34 -05:00
Adnan Akhundov
7d55968fee [BACKEND] Dedup elementwise in LLVM IR based on constancy (#2512)
### Summary

When Triton GPU IR is lowered into LLVM IR, we can make use of the
constancy information about the result of the elementwise ops to
deduplicate otherwise redundant computation. That is the contribution of
this PR: the constancy is checked and, if possible, some of the values
in LLVM IR are reused multiple times instead of computing equal values
separately.

The change is beneficial for the PyTorch 2 / TorchInductor-generated
Triton code, as the leftmost sub-indices extracted from the flat index
by div / mod operations can be equal, given sufficiently large 2^n
factor in the rightmost rightmost dimension(s). This makes the
computation resulting in those sub-indices redundant. Consequently,
under the necessary constancy conditions, the redundant indexing
arithmetics can be deduplicated. We observe up to 29% decrease in the
latency of some of our jagged tensor kernels
2023-10-25 11:25:29 -04:00
Justin Lebar
e70e11e834 [BACKEND] Improve printf. (#2532)
[BACKEND] Improve printf.

Previously, we printed all of a GPU thread's values in a single printf()
call, and this, plus the user-specified prefix, was all we printed.

This caused a few problems.

 - nvptx printf can only handle 32 arguments; if you pass more than
   that, it prints garbage.  So if a thread had more than 32 values, you
   couldn't print them, issue #2486.

 - The order of the values within the Triton program (GPU thread block)
   is an implementation detail -- it depends on the layout the compiler
   assigns to a tensor.  So this also prevented you from interpreting
   the printed output.

To address this, we now print the Triton pid and multi-dimensional
Tensor index for each value.  And each value gets its own line to avoid
passing too many args to printf.

Example output:

    ```
    pid (0, 1, 2) idx (36, 127) x: 42
    ```

If you want to observe all the values in a tensor in order, you can grep
and then sort the output.

We also make a UX enhancement to print: The printed label always ends
with ": "; you don't have to add it yourself.

Fixes #2486.
2023-10-25 08:47:55 +00:00
Justin Lebar
2217bd2f5c [BACKEND] Delete dead vprintf and vprintf_array functions (#2531)
Delete dead vprintf and vprintf_array functions.

These were introduced in 88498d104a and
appear to have been dead at the time of introduction.
2023-10-25 00:22:53 +00:00
Philippe Tillet
3f2b7263e8 Revert "[OPTIMIZER] Tweak warpsPerCTA based on the shape of MMA output (#2485)" (#2541)
Reverts openai/triton#2525
2023-10-24 10:23:19 -07:00
Philippe Tillet
8f467f1ea9 [OPTIMIZER] Tweak warpsPerCTA based on the shape of MMA output (#2485) (#2525)
Reverts openai/triton#2497
2023-10-23 21:50:58 -07:00
Thomas Raoux
5e6071254c [BACKEND] Use our internal slice implementation to avoid combinatoria… (#2535)
…l explosion
2023-10-24 03:06:34 +00:00
Thomas Raoux
cba7abd682 [BACKEND] Remove ttg.cmp and ttg.select and replace by arith op (#2526)
Now that the bug related to attribute is fixed in MLIR we can use arith
ops for cmp and select ops.
2023-10-23 19:35:46 -07:00
jayfurmanek
e74bdb1581 Always promote to int32 in commonShflSync (#369) 2023-10-23 12:27:11 -05:00
Zahi Moudallal
b0c166b9e3 [BACKEND] Fixing bug in elementwise conversion (#2517) 2023-10-20 09:11:15 -07:00
Keren Zhou
be1de890e1 [BACKEND] Replace assert(0) with llvm::report_fatal_error (#2516)
Also add missing return statements
2023-10-19 11:53:09 -07:00
Zahi Moudallal
a980ec50f1 [BACKEND] Fixing f8e5m2 to bf16 conversion on A100 (#2508) 2023-10-18 17:23:39 +01:00
Thomas Raoux
e36d1665ca [BACKEND] Fix unsupported view op created during optimizations (#2510)
When propagating layout we were generating a view op with mismatching
total number of element per threads. Lowering such op would require
exchanging data across threads.
This change prevents the optimizer from generating such cases. This may
require further optimizations in the future.
2023-10-18 16:37:13 +01:00
Alexander Efimov
20f316b19a [MFMA] Switch between MFMA types (#352)
This PR introduces matrix_instr_nonkdim flag to switch
between MFMA 16 and MFMA 32.
2023-10-18 16:57:34 +02:00
Mehdi Amini
721897fcc4 upgrade llvm to b1115f8c (NFC) (#2403)
Co-authored-by: Thomas Raoux <thomas.raoux@openai.com>
Co-authored-by: Keren Zhou <kerenzhou@openai.com>
Co-authored-by: Phil Tillet <phil@openai.com>
2023-10-16 16:38:49 -07:00
Alexander Efimov
4d539d7dae Add licenses to AMD related files (#351) 2023-10-16 15:18:01 -05:00
Zahi Moudallal
726bdb984f [FRONTEND][BACKEND] Fix constexpr assignment ; revert #2430 (#2496)
Without this change, a constexpr assignment (ie. `A = B & C`, where `B`
and `C` are both constexpr) is getting assigned to a triton tensor,
which becomes an issue when `A` is used as the condition of an If
statement.
Note: I had to add `not isinstance(node.value, ast.Constant)` to the
condition because if we are assigning `x = 0` then the assigned value is
also a constexpr, but in this case we do want to assign a triton tensor
to `x` so that we can do `x.to(tl.int64)` for example, which cannot be
done on a constexpr.

---------

Co-authored-by: Philippe Tillet <phil@openai.com>
2023-10-16 12:35:19 -07:00