Commit Graph

1418 Commits

Author SHA1 Message Date
jayfurmanek
a42ac260aa Merge branch 'triton-mlir' into ifu-231117 2023-12-12 14:24:11 -06:00
Jason Furmanek
160dfe838e ROCM IFU: Fix print and assert 2023-12-12 19:30:01 +00:00
Alexander Efimov
605a90c58e [MFMA] Support tile size 4x4 version 1 (#413)
This PR enables 4x4 tile size in MFMA based dot operations.

Supported tiled dot is (4x64) x (64x4) -> (4x4) in MFMA layout.
However, actual dot operation should have at least 64 output elements, this is a limitation of other layouts appearing during result processing (i.e. blocked layout can not handle tensors smaller than wavesize).

For example, following dots are supported: (4x64) x (64x16) -> (4x16), (16x64) x (64x4) -> (16x4) or (8x64) x (64x8) -> (8x8)
Following dots are not supporter: (4x128) x (128x4) -> (4x4), (4x64) x (64x8) -> (4x8)

This is a first version of dot using mfma 4x4 instructions, with redundancy and reductions.
2023-12-12 18:23:55 +01:00
Michael Melesse
64a0924381 ROCM IFU: remove ref to test_elementwise 2023-12-07 13:31:59 -06:00
Vinayak Gokhale
1d6b919897 Bugfix: Wrong boundary condition on qk GEMM 2023-12-04 10:11:41 -06:00
Vinayak Gokhale
f6969f4bb3 Correct that loop lo is multiple BLOCK_N 2023-12-04 10:11:41 -06:00
Vinayak Gokhale
0ef865508c Update description 2023-12-04 10:11:41 -06:00
Vinayak Gokhale
dc62569e57 Remove slicing for out in save for bwd 2023-12-04 10:11:41 -06:00
Vinayak Gokhale
e0a4d97569 Mask vs pad for non power of 2 sequence lengths
Padding results in memory allocation which is slower.
Masking results in better performance.
2023-12-04 10:11:41 -06:00
Vinayak Gokhale
d5028079b7 Add FA support for non pow2 seqlen 2023-12-04 10:11:41 -06:00
Jason Furmanek
64f559771f ROCM IFU: Fix test_core_amd.py::test_reduce_layouts 2023-11-28 04:02:48 +00:00
Jason Furmanek
f5f6b3c0a3 ROCM IFU: Add get_version_key for ROCM backend 2023-11-28 00:11:44 +00:00
Jason Furmanek
71547e4fdb ROCM IFU: Fixes for kwargs 2023-11-28 00:09:46 +00:00
jayfurmanek
99aa1f4f75 Merge branch 'triton-mlir' into ifu-231117 2023-11-27 07:44:04 -06:00
Jason Furmanek
968a35fbf0 Fix merge conflict error 2023-11-27 13:33:23 +00:00
Shucai Xiao
d9219e0eba use hw for fp8 type conversion (#386)
* use hardware instruction for type conversion between fp8 and fp32

* move gpu_matrix_core_version from semantics.py to hip_backend.py

---------

Co-authored-by: Aleksandr Efimov <efimov.alexander@gmail.com>
2023-11-24 10:26:40 -06:00
Jason Furmanek
a08dafe7fe Initial commit to resolve merge conflicts 2023-11-20 22:41:03 +00:00
Jason Furmanek
5c87f363e4 Merge commit 'cb3d79a185e40c9d8a579bea07747a8a8d157d52' into ifu-231117
Conflicts:
	lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp
	lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVM.cpp
	lib/Dialect/TritonGPU/IR/Dialect.cpp
	python/setup.py
	python/test/unit/language/assert_helper.py
	python/test/unit/operators/test_flash_attention.py
	python/test/unit/runtime/test_subproc.py
	python/triton/compiler/compiler.py
	python/triton/language/semantic.py
	python/triton/runtime/autotuner.py
	python/triton/runtime/jit.py
	python/tutorials/03-matrix-multiplication.py
	python/tutorials/05-layer-norm.py
	python/tutorials/06-fused-attention.py
	python/tutorials/11-grouped-gemm.py
	test/Conversion/tritongpu_to_llvm.mlir
2023-11-17 20:42:12 +00:00
Alexander Efimov
dfb76540b4 [Tutorial] Fix post IFU issues with FA (#398)
* [Tutorial] Fix post IFU issues with FA

* Remove redundant kernels in 06-fused-attention.py

* Added README for scripts in perf-kernels dir

* Fix bwd kernel

---------

Co-authored-by: Lixun Zhang <lixun.zhang@amd.com>
2023-11-17 01:28:49 +00:00
Alexander Efimov
096def0c9b [Test] Disable mma layout for amd hardware (#384)
Disable mma layout testing by looking at is_hip instead of wave size.
This fixes tests on Navi GPUs with wave size == 32.
2023-11-17 01:28:49 +00:00
Lixun Zhang
181bdbd410 Benchmark FA on 2 GCDs (#393) 2023-11-17 01:28:49 +00:00
Jason Furmanek
44b155f41b ROCM IFU: Resolve merge conflicts in tutorial 06
Resolve merge conflicts in tutorial 06 - 2
2023-11-17 01:28:40 +00:00
Jason Furmanek
484852876e Resolve merge conflicts; AMD adjustments for new LLVM version 2023-11-09 19:00:49 +00:00
Jason Furmanek
977d5aa267 Merge commit '721897fcc4f942aa97d2e9ba3787a5e213758177' into ifu-231108
Conflicts:
	bin/triton-translate.cpp
	lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp
	lib/Dialect/TritonGPU/Transforms/RemoveLayoutConversions.cpp
	python/triton/compiler/compiler.py
	python/triton/runtime/jit.py
	python/tutorials/06-fused-attention.py
	test/Conversion/tritongpu_to_llvm.mlir
2023-11-08 18:51:23 +00:00
Lixun Zhang
1af893d8a2 [FRONTEND] Add input dtypes to autotuning key (#2534) (#374)
* [FRONTEND] Add input dtypes to autotuning key (#2534)

* Fix conflict in 06-fused-attention

* Fix get_best_config in FA-transV.py

* Fix leftover get_best_config()

---------

Co-authored-by: Adnan Akhundov <adnan.akhundov@gmail.com>
2023-11-07 19:36:57 -06:00
Jason Furmanek
85216ea5c5 ROCM IFU: Resoolve conflicts in FA tutorial 2023-11-07 04:29:45 +00:00
Alexander Efimov
aefc94bd25 ROCM IFU: fix test_dot_mfma_vector_load test
fix for previous commit
2023-11-07 04:29:45 +00:00
Jason Furmanek
39e8901d7a ROCM IFU: Resolve merge conflicts in RemoveLayoutConversions.cpp
fix merge error

fix dot

fix make_range

additional fix
2023-11-07 04:29:38 +00:00
Jason Furmanek
c3132eeda8 ROCM IFU: Third-party fixes: preload ROCDL
Additional 3rd-party fix

Remove redundant mfma_supported defines
2023-11-07 03:29:16 +00:00
Jason Furmanek
3a6dc5ad8d resolve some merge conflicts
fix more conflits

Resolve merge conflicts

Some more build and conflict fixes

Resolve conflicts for 06-fused-attension.py

resolve merge conflicts for the tutorial group gemm example

Fixes for some LIT tests

resolve remaining conflicts in tests

Fix empty kernel

set capability 0
2023-11-06 23:13:10 +00:00
Jason Furmanek
33151a860f Merge commit 'ac9fa68d18c777e421bd3f6fb1ddcfd60b6fda33' into ifu-rebase-again
Conflicts:
	.gitignore
	.gitmodules
	README.md
	bin/triton-translate.cpp
	include/triton/Dialect/TritonGPU/IR/TritonGPUAttrDefs.td
	include/triton/Target/AMDGCN/AMDGCNTranslation.h
	include/triton/Target/HSACO/HSACOTranslation.h
	lib/Analysis/Allocation.cpp
	lib/Analysis/Utility.cpp
	lib/Conversion/TritonGPUToLLVM/CMakeLists.txt
	lib/Conversion/TritonGPUToLLVM/ConvertLayoutOpToLLVM.cpp
	lib/Conversion/TritonGPUToLLVM/ReduceOpToLLVM.cpp
	lib/Conversion/TritonGPUToLLVM/ScanOpToLLVM.cpp
	lib/Conversion/TritonGPUToLLVM/Utility.cpp
	lib/Conversion/TritonGPUToLLVM/Utility.h
	lib/Dialect/TritonGPU/IR/Dialect.cpp
	lib/Dialect/TritonGPU/Transforms/RemoveLayoutConversions.cpp
	lib/Target/HSACO/CMakeLists.txt
	lib/Target/HSACO/HSACOTranslation.cpp
	lib/Target/LLVMIR/LLVMIRTranslation.cpp
	python/src/triton.cc
	python/test/unit/language/test_core.py
	python/test/unit/operators/test_flash_attention.py
	python/triton/compiler/compiler.py
	python/triton/compiler/make_launcher.py
	python/triton/language/semantic.py
	python/triton/runtime/jit.py
	python/tutorials/06-fused-attention.py
	python/tutorials/11-grouped-gemm.py
	test/Conversion/tritongpu_to_llvm.mlir
2023-11-06 23:10:10 +00:00
oplavsic
c65f1e6211 Add OptimizeEpilogue pass. (#346)
* optimize_epilogue

* Add config

* Remove licenses

* Comment out Hopper specific parameters when printing out configs

* Add benchmark parameters from flash-attention repo

* Add Z and H in the key of autotuner

---------

Co-authored-by: Lixun Zhang <lixun.zhang@amd.com>
2023-11-03 16:46:24 -05:00
Weixing Zhang
34b89a1173 [OPTIMIZER] Tweak warpsPerCTA based on the shape of MMA output (#2581)
In current implementation, warpsPerCTA is always set to [numWarps, 1]
for 2 tt.dot fusion scenario. But, it is not optimal for cases such that
tt.dot doesn't have enough parallelism on row dimension but on column
dimension.
2023-11-03 16:40:03 -04:00
Thomas Raoux
6ac9d51ff0 [OPTIMIZATION] Enable pipelining for bwd flash attention (#2590)
This allow pipelining when a load is used by multiple dot in a loop.

Relax the condition to pipeline dot operands for mma v3 case. This
improves performance for the bwd pass from 260TF to 275TF. However this
expose a performance problem due to the wmma pipelining as ptxas will
now fall back to serial wgmma. A follow up PR will fix a bug in how we
emit wgmma_wait during pipelining and will bring performance to 335TF
2023-11-03 11:46:51 -07:00
Shucai Xiao
cb02a0b346 remove unnecessary arch names (#388) 2023-11-03 12:04:02 -05:00
Lixun Zhang
a66270165a Move fa-transV to the new perf-kernels dir (#387) 2023-11-03 00:09:48 -05:00
Justin Lebar
df08301e76 Reformat Python code with yapf. (#2589)
I've add an option to yapf to do what we want for long lines, see
https://github.com/google/yapf/pull/1177.  We can now have a real Python
formatter, yay!

To make this PR, I ran my modified yapf over the repository, then looked
over the full diff.  Where yapf was mangling the param list of long
function decls/calls (mostly kernels), I manually added `#` to put
linebreaks where we want.  I fixed up other formatting too -- mostly
adding or removing a trailing comma from lists.

Overall, trailing `#` was sufficient to get formatting similar to our
current code.  I didn't have to disable yapf anywhere.

---------

Co-authored-by: Phil Tillet <phil@openai.com>
2023-11-02 20:44:17 -07:00
Shucai Xiao
79bebc4ffe fp8 type support (#357)
* add two fp8 data types `tl.float8e4b8` and `tl.float8e5b16` to triton.
* add SW type conversion between `tl.float8e4b8/tl.float8e5b16` and `fp16`
* change flashattention to support fp8 in q/k.
2023-11-02 15:51:23 -05:00
jeromeku
37cd3d5339 [FIX][AOT Compiler] Fix No Specialization Edge Case (#2584)
The [hints
dispatching](218492cd65/python/triton/tools/link.py (L161))
logic currently fails for the edge case where a single kernel with no
specializations is to be linked in the [AOT
compiler](https://github.com/openai/triton/blob/main/python/triton/tools/link.py).

Since the dispatcher inserts a conditional branch for each
specialization case, this results in an `if ()` being inserted into the
`C` source, which clearly breaks downstream artifacts.

Fix:
- Added simple check for this edge case
- Added unit test that mirrors the existing
[`test_compile_link_matmul`](218492cd65/python/test/unit/tools/test_aot.py (L224))
test case save for the aforementioned condition.
2023-11-02 10:02:41 -07:00
Thomas Raoux
ca8f110617 [BACKEND] Pipeliner refactoring (#2565)
Refactor the pipeliner pass in order to make it more generic. The main
change is that the pipeliner is now broken into 2 pieces one calculating
a modulo schedule and create async ops based on the IR and an expander
that will generate the pipelined IR based on the modulo schedule.
The advantage of separating the two pieces is that it will allow us to
create different schedule without having to change the expander and it
will allow for more complex schedules.
For now the schedule generated for matmul case matches rougly the
schedule picked by the previous pipeliner in order to avoid changes.

This also creates a different sequence of insert/extract slice for the
alloc. We should probably change shared alloc to use memory semantic.
2023-11-02 09:56:39 -07:00
Lixun Zhang
38f9136fc8 Add FA with transV (#385) 2023-11-02 08:52:33 -05:00
Vedant Roy
702cde0d6f [FRONTEND] Implement ternary operator for dynamic values (#2560) 2023-11-01 20:21:32 -04:00
Chenggang Zhao
e7fdfd76fb [FRONTEND] Add value restoration for autotuner (#2549)
For in-place kernels, neither `reset_to_zero` nor `Config.prehook`
provided in the autotuner can restore the values changed during the
tuning process, so I propose a recovery mechanism here.

---------

Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com>
Co-authored-by: Keren Zhou <kerenzhou@openai.com>
2023-10-31 21:37:44 -04:00
Zahi Moudallal
3650213218 [OPTIMIZER] Thread local reduction optimization (#2542)
Co-authored-by: Phil Tillet <phil@openai.com>
2023-10-31 16:13:36 -07:00
Justin Lebar
258399c114 Enable ruff linter instead of flake8 (#2574)
[FRONTEND] Enable ruff linter instead of flake8.
    
This fixes a few issues automatically, and also flagged two issues to
fix manually in test_core.py: We had two duplicate function names!  One
of these function bodies was a duplicate, so I deleted it.  The other
function body was not a duplicate, so I gave it a new name.

AIUI all of these errors should have been picked up by flake8.  I'm
confused why it wasn't working.  Anyway this is working, and it's faster
than flake8, so it seems like an improvement in all dimensions.
2023-10-31 21:28:24 +00:00
Goran Flegar
601b95cdbb [DEPS] bump LLVM version to llvm/llvm-project@49af650 (#2570)
Co-authored-by: Ashay Rane <ashay@users.noreply.github.com>
Co-authored-by: khasanovaa <khasanovaaliya19@gmail.com>
2023-10-31 12:06:25 -07:00
Zahi Moudallal
943330790a [FRONTEND] add do_not_specialize property back to JITFunction (#2573) 2023-10-31 12:02:45 -07:00
Nhat Nguyen
0cf3a67f04 [BUILD] Disable W503 in pyproject.toml (#2575)
This PR https://github.com/openai/triton/pull/2555 disabled `W503`
(means line breaks can now occur before a binary operator).

The change surprisingly didn't take any effect nor required any style
changes in `triton` main `pre-commit` stage. But our `triton-shared`
[pipeline
run](https://github.com/microsoft/triton-shared/actions/runs/6710459100/job/18236352821)
(see `Check pre-commit` stage) picked this up correctly and complained
about formatting issues. I'm not entirely sure what could be the cause
for such difference, but if we also disable `W503` in `pyproject.toml`
then the rule is picked up correctly.
2023-10-31 11:57:02 -07:00
Chris Jones
2398b82f18 [FRONTEND][BACKEND] dd memory synchronization scope parameter to atomic ops. (#2562)
Co-authored-by: Keren Zhou <kerenzhou@openai.com>
2023-10-30 19:18:27 -07:00
Keren Zhou
70fca00b67 [BACKEND] Fix device_print without arguments (#2566) 2023-10-30 20:04:44 -04:00