Commit Graph

2386 Commits

Author SHA1 Message Date
Vinayak Gokhale
f6969f4bb3 Correct that loop lo is multiple BLOCK_N 2023-12-04 10:11:41 -06:00
Vinayak Gokhale
0ef865508c Update description 2023-12-04 10:11:41 -06:00
Vinayak Gokhale
dc62569e57 Remove slicing for out in save for bwd 2023-12-04 10:11:41 -06:00
Vinayak Gokhale
e0a4d97569 Mask vs pad for non power of 2 sequence lengths
Padding results in memory allocation which is slower.
Masking results in better performance.
2023-12-04 10:11:41 -06:00
Vinayak Gokhale
d5028079b7 Add FA support for non pow2 seqlen 2023-12-04 10:11:41 -06:00
Lixun Zhang
670ae8054d Add a cute tool to plot blocked, dotOperand, and mfma layout (#407)
* Add commands to plot blocked, dotOperand, and mfma layout

* Add commands to plot LDS layout and wmma instruction layout
2023-11-29 09:35:33 -06:00
Jason Furmanek
64f559771f ROCM IFU: Fix test_core_amd.py::test_reduce_layouts 2023-11-28 04:02:48 +00:00
Jason Furmanek
f5f6b3c0a3 ROCM IFU: Add get_version_key for ROCM backend 2023-11-28 00:11:44 +00:00
Jason Furmanek
71547e4fdb ROCM IFU: Fixes for kwargs 2023-11-28 00:09:46 +00:00
jayfurmanek
99aa1f4f75 Merge branch 'triton-mlir' into ifu-231117 2023-11-27 07:44:04 -06:00
Jason Furmanek
968a35fbf0 Fix merge conflict error 2023-11-27 13:33:23 +00:00
Shucai Xiao
d9219e0eba use hw for fp8 type conversion (#386)
* use hardware instruction for type conversion between fp8 and fp32

* move gpu_matrix_core_version from semantics.py to hip_backend.py

---------

Co-authored-by: Aleksandr Efimov <efimov.alexander@gmail.com>
2023-11-24 10:26:40 -06:00
Jason Furmanek
4e86b25f1c ROCM IFU: Fix PrintfHIP 2023-11-21 23:06:14 +00:00
Jason Furmanek
a08dafe7fe Initial commit to resolve merge conflicts 2023-11-20 22:41:03 +00:00
Jason Furmanek
5c87f363e4 Merge commit 'cb3d79a185e40c9d8a579bea07747a8a8d157d52' into ifu-231117
Conflicts:
	lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp
	lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVM.cpp
	lib/Dialect/TritonGPU/IR/Dialect.cpp
	python/setup.py
	python/test/unit/language/assert_helper.py
	python/test/unit/operators/test_flash_attention.py
	python/test/unit/runtime/test_subproc.py
	python/triton/compiler/compiler.py
	python/triton/language/semantic.py
	python/triton/runtime/autotuner.py
	python/triton/runtime/jit.py
	python/tutorials/03-matrix-multiplication.py
	python/tutorials/05-layer-norm.py
	python/tutorials/06-fused-attention.py
	python/tutorials/11-grouped-gemm.py
	test/Conversion/tritongpu_to_llvm.mlir
2023-11-17 20:42:12 +00:00
jayfurmanek
e1513b34e1 Merge pull request #395 from ROCmSoftwarePlatform/ifu-231108
Ifu 231108
2023-11-17 14:14:14 -06:00
jayfurmanek
bb56be9d16 Merge branch 'triton-mlir' into ifu-231108 2023-11-16 19:30:43 -06:00
Alexander Efimov
dfb76540b4 [Tutorial] Fix post IFU issues with FA (#398)
* [Tutorial] Fix post IFU issues with FA

* Remove redundant kernels in 06-fused-attention.py

* Added README for scripts in perf-kernels dir

* Fix bwd kernel

---------

Co-authored-by: Lixun Zhang <lixun.zhang@amd.com>
2023-11-17 01:28:49 +00:00
Alexander Efimov
096def0c9b [Test] Disable mma layout for amd hardware (#384)
Disable mma layout testing by looking at is_hip instead of wave size.
This fixes tests on Navi GPUs with wave size == 32.
2023-11-17 01:28:49 +00:00
Lixun Zhang
181bdbd410 Benchmark FA on 2 GCDs (#393) 2023-11-17 01:28:49 +00:00
Jason Furmanek
44b155f41b ROCM IFU: Resolve merge conflicts in tutorial 06
Resolve merge conflicts in tutorial 06 - 2
2023-11-17 01:28:40 +00:00
Ognjen
9f3d6656a7 ROCM IFU: Fix reduce_slice lit test
Skip tritongpu_to_llvm_hopper test as it is nvidia specific
2023-11-17 01:28:33 +00:00
Ognjen
38fbb7e472 ROCM IFU: Enable slice layout for insertSliceAsync AMD path
Fix basic_insert_slice_async_1d lit test

Remove code added for debugging

Return hopper test
2023-11-17 01:27:57 +00:00
Alexander Efimov
5b06b168aa [Tutorial] Fix post IFU issues with FA (#398)
* [Tutorial] Fix post IFU issues with FA

* Remove redundant kernels in 06-fused-attention.py

* Added README for scripts in perf-kernels dir

* Fix bwd kernel

---------

Co-authored-by: Lixun Zhang <lixun.zhang@amd.com>
2023-11-14 10:46:45 -06:00
Alexander Efimov
9941ce7aa5 [Test] Disable mma layout for amd hardware (#384)
Disable mma layout testing by looking at is_hip instead of wave size.
This fixes tests on Navi GPUs with wave size == 32.
2023-11-13 17:52:35 +01:00
Jason Furmanek
484852876e Resolve merge conflicts; AMD adjustments for new LLVM version 2023-11-09 19:00:49 +00:00
Jason Furmanek
977d5aa267 Merge commit '721897fcc4f942aa97d2e9ba3787a5e213758177' into ifu-231108
Conflicts:
	bin/triton-translate.cpp
	lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp
	lib/Dialect/TritonGPU/Transforms/RemoveLayoutConversions.cpp
	python/triton/compiler/compiler.py
	python/triton/runtime/jit.py
	python/tutorials/06-fused-attention.py
	test/Conversion/tritongpu_to_llvm.mlir
2023-11-08 18:51:23 +00:00
Lixun Zhang
d4eda83b33 Benchmark FA on 2 GCDs (#393) 2023-11-08 12:42:54 -06:00
Lixun Zhang
1af893d8a2 [FRONTEND] Add input dtypes to autotuning key (#2534) (#374)
* [FRONTEND] Add input dtypes to autotuning key (#2534)

* Fix conflict in 06-fused-attention

* Fix get_best_config in FA-transV.py

* Fix leftover get_best_config()

---------

Co-authored-by: Adnan Akhundov <adnan.akhundov@gmail.com>
2023-11-07 19:36:57 -06:00
jayfurmanek
3c1fe617c1 Merge pull request #382 from ROCmSoftwarePlatform/ifu231005-rebase
Ifu231005
2023-11-06 22:55:25 -06:00
Jason Furmanek
85216ea5c5 ROCM IFU: Resoolve conflicts in FA tutorial 2023-11-07 04:29:45 +00:00
oplavsic
502525ff11 ROCM IFU: Fix ScanOp test by implementing idx ShflKind in commonShflSync (#391)
* Fix ScanOp test

* Remove commented-out code

---------

Co-authored-by: Ognjen <oplavsic@luxoft.com>

Fix typo in commonShflSync
2023-11-07 04:29:45 +00:00
Alexander Efimov
aefc94bd25 ROCM IFU: fix test_dot_mfma_vector_load test
fix for previous commit
2023-11-07 04:29:45 +00:00
Alexander Efimov
8bc417b9b7 do not emit nvidia inline asm 2023-11-07 04:29:44 +00:00
Jason Furmanek
39e8901d7a ROCM IFU: Resolve merge conflicts in RemoveLayoutConversions.cpp
fix merge error

fix dot

fix make_range

additional fix
2023-11-07 04:29:38 +00:00
Jason Furmanek
c3132eeda8 ROCM IFU: Third-party fixes: preload ROCDL
Additional 3rd-party fix

Remove redundant mfma_supported defines
2023-11-07 03:29:16 +00:00
Jason Furmanek
3a6dc5ad8d resolve some merge conflicts
fix more conflits

Resolve merge conflicts

Some more build and conflict fixes

Resolve conflicts for 06-fused-attension.py

resolve merge conflicts for the tutorial group gemm example

Fixes for some LIT tests

resolve remaining conflicts in tests

Fix empty kernel

set capability 0
2023-11-06 23:13:10 +00:00
Jason Furmanek
33151a860f Merge commit 'ac9fa68d18c777e421bd3f6fb1ddcfd60b6fda33' into ifu-rebase-again
Conflicts:
	.gitignore
	.gitmodules
	README.md
	bin/triton-translate.cpp
	include/triton/Dialect/TritonGPU/IR/TritonGPUAttrDefs.td
	include/triton/Target/AMDGCN/AMDGCNTranslation.h
	include/triton/Target/HSACO/HSACOTranslation.h
	lib/Analysis/Allocation.cpp
	lib/Analysis/Utility.cpp
	lib/Conversion/TritonGPUToLLVM/CMakeLists.txt
	lib/Conversion/TritonGPUToLLVM/ConvertLayoutOpToLLVM.cpp
	lib/Conversion/TritonGPUToLLVM/ReduceOpToLLVM.cpp
	lib/Conversion/TritonGPUToLLVM/ScanOpToLLVM.cpp
	lib/Conversion/TritonGPUToLLVM/Utility.cpp
	lib/Conversion/TritonGPUToLLVM/Utility.h
	lib/Dialect/TritonGPU/IR/Dialect.cpp
	lib/Dialect/TritonGPU/Transforms/RemoveLayoutConversions.cpp
	lib/Target/HSACO/CMakeLists.txt
	lib/Target/HSACO/HSACOTranslation.cpp
	lib/Target/LLVMIR/LLVMIRTranslation.cpp
	python/src/triton.cc
	python/test/unit/language/test_core.py
	python/test/unit/operators/test_flash_attention.py
	python/triton/compiler/compiler.py
	python/triton/compiler/make_launcher.py
	python/triton/language/semantic.py
	python/triton/runtime/jit.py
	python/tutorials/06-fused-attention.py
	python/tutorials/11-grouped-gemm.py
	test/Conversion/tritongpu_to_llvm.mlir
2023-11-06 23:10:10 +00:00
oplavsic
c65f1e6211 Add OptimizeEpilogue pass. (#346)
* optimize_epilogue

* Add config

* Remove licenses

* Comment out Hopper specific parameters when printing out configs

* Add benchmark parameters from flash-attention repo

* Add Z and H in the key of autotuner

---------

Co-authored-by: Lixun Zhang <lixun.zhang@amd.com>
2023-11-03 16:46:24 -05:00
Thomas Raoux
cb3d79a185 [BACKEND] Prevent emitting multiple dot_wait after pipelinied loop (#2598)
Patch based on @donproc findings and suggested optimization.

Emitting multiple wait op may confuse ptxas and cause it to fallback to
a conservative mode.
2023-11-03 14:29:50 -07:00
Weixing Zhang
34b89a1173 [OPTIMIZER] Tweak warpsPerCTA based on the shape of MMA output (#2581)
In current implementation, warpsPerCTA is always set to [numWarps, 1]
for 2 tt.dot fusion scenario. But, it is not optimal for cases such that
tt.dot doesn't have enough parallelism on row dimension but on column
dimension.
2023-11-03 16:40:03 -04:00
Thomas Raoux
6ac9d51ff0 [OPTIMIZATION] Enable pipelining for bwd flash attention (#2590)
This allow pipelining when a load is used by multiple dot in a loop.

Relax the condition to pipeline dot operands for mma v3 case. This
improves performance for the bwd pass from 260TF to 275TF. However this
expose a performance problem due to the wmma pipelining as ptxas will
now fall back to serial wgmma. A follow up PR will fix a bug in how we
emit wgmma_wait during pipelining and will bring performance to 335TF
2023-11-03 11:46:51 -07:00
Shucai Xiao
cb02a0b346 remove unnecessary arch names (#388) 2023-11-03 12:04:02 -05:00
Lixun Zhang
a66270165a Move fa-transV to the new perf-kernels dir (#387) 2023-11-03 00:09:48 -05:00
Justin Lebar
df08301e76 Reformat Python code with yapf. (#2589)
I've add an option to yapf to do what we want for long lines, see
https://github.com/google/yapf/pull/1177.  We can now have a real Python
formatter, yay!

To make this PR, I ran my modified yapf over the repository, then looked
over the full diff.  Where yapf was mangling the param list of long
function decls/calls (mostly kernels), I manually added `#` to put
linebreaks where we want.  I fixed up other formatting too -- mostly
adding or removing a trailing comma from lists.

Overall, trailing `#` was sufficient to get formatting similar to our
current code.  I didn't have to disable yapf anywhere.

---------

Co-authored-by: Phil Tillet <phil@openai.com>
2023-11-02 20:44:17 -07:00
Shucai Xiao
79bebc4ffe fp8 type support (#357)
* add two fp8 data types `tl.float8e4b8` and `tl.float8e5b16` to triton.
* add SW type conversion between `tl.float8e4b8/tl.float8e5b16` and `fp16`
* change flashattention to support fp8 in q/k.
2023-11-02 15:51:23 -05:00
Thomas Raoux
dced22c4b7 [BACKEND] Remove workaround for NVPTX bug after LLVM upgrade (#2585)
This was a workaround for a bug exposed in test_core when generating
ext_load in NVPTX. The backend bug seems fixed in latest LLVM upgrade so
removing the workaround.
2023-11-02 17:31:58 +00:00
jeromeku
37cd3d5339 [FIX][AOT Compiler] Fix No Specialization Edge Case (#2584)
The [hints
dispatching](218492cd65/python/triton/tools/link.py (L161))
logic currently fails for the edge case where a single kernel with no
specializations is to be linked in the [AOT
compiler](https://github.com/openai/triton/blob/main/python/triton/tools/link.py).

Since the dispatcher inserts a conditional branch for each
specialization case, this results in an `if ()` being inserted into the
`C` source, which clearly breaks downstream artifacts.

Fix:
- Added simple check for this edge case
- Added unit test that mirrors the existing
[`test_compile_link_matmul`](218492cd65/python/test/unit/tools/test_aot.py (L224))
test case save for the aforementioned condition.
2023-11-02 10:02:41 -07:00
Thomas Raoux
ca8f110617 [BACKEND] Pipeliner refactoring (#2565)
Refactor the pipeliner pass in order to make it more generic. The main
change is that the pipeliner is now broken into 2 pieces one calculating
a modulo schedule and create async ops based on the IR and an expander
that will generate the pipelined IR based on the modulo schedule.
The advantage of separating the two pieces is that it will allow us to
create different schedule without having to change the expander and it
will allow for more complex schedules.
For now the schedule generated for matmul case matches rougly the
schedule picked by the previous pipeliner in order to avoid changes.

This also creates a different sequence of insert/extract slice for the
alloc. We should probably change shared alloc to use memory semantic.
2023-11-02 09:56:39 -07:00
Lixun Zhang
38f9136fc8 Add FA with transV (#385) 2023-11-02 08:52:33 -05:00