Commit Graph

455 Commits

Author SHA1 Message Date
Alexander Efimov
a06072f8ff Fix dangling gpu_has_mfma use (#325)
* Fix dangling gpu_has_mfma use

This PR replaces gpu_has_mfma use with gpu_matrix_core_version

* add basic test
2023-09-11 12:31:48 -05:00
Alexander Efimov
6691de65db [MFMA] Support BFloat16 on MI100 (#295)
* [MFMA] Support BFloat16 on MI100

This PR makes use of mfma_f32_32x32x4bf16 instruction, available on MI100.

* fix tests, fix mfma encoding comment, fix switch between mfma versions.

* replace kDim from mfma layout with kWidth from dotOp layout

* rebase fix

* fix mfma to dot op shortcut for bfloat16

* fix review comments
2023-09-08 15:08:34 -05:00
Wen Chen
ffc230ebfe [ROCM] Fixed implementation of fp32 to bf16 conversion on ROCm. 2023-09-06 18:10:54 -05:00
Wen Chen
2d3e38e182 [ROCM] Added ROCm support for int8 to bfloat16 conversion. 2023-09-06 18:10:54 -05:00
Wen Chen
59a40d3f72 [ROCM] Added ROCm support for the conversions of following data types:
[float8e4m3, float8e4m3b15, float8e5m2] <-> [float16, bfloat16]
2023-09-06 18:10:54 -05:00
Keren Zhou
9e9fbe01f0 [FRONTEND] Fix specialization on triton integer types (#2236)
https://github.com/openai/triton/issues/2231
2023-09-03 23:57:08 -07:00
Jason Furmanek
320b1029da Temporarily disable F8 tests on ROCm 2023-09-01 04:02:14 +00:00
Jason Furmanek
3eaeb89d18 Merge commit '5df904233c11a65bd131ead7268f84cca7804275' into ifu230810-2
Conflicts:
	include/triton/Dialect/Triton/Transforms/Passes.h
	include/triton/Dialect/TritonGPU/IR/Dialect.h
	include/triton/Dialect/TritonGPU/IR/TritonGPUAttrDefs.td
	lib/Analysis/Allocation.cpp
	lib/Analysis/Utility.cpp
	lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp
	lib/Conversion/TritonGPUToLLVM/ReduceOpToLLVM.cpp
	lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVM.cpp
	lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVMPass.cpp
	lib/Dialect/Triton/Transforms/RewriteTensorPointer.cpp
	lib/Dialect/TritonGPU/Transforms/RemoveLayoutConversions.cpp
	lib/Dialect/TritonGPU/Transforms/ReorderInstructions.cpp
	lib/Target/LLVMIR/LLVMIRTranslation.cpp
	python/src/triton.cc
	python/triton/compiler/compiler.py
	python/triton/ops/flash_attention.py
	python/triton/runtime/autotuner.py
	python/triton/runtime/jit.py
	python/triton/tools/aot.py
	python/tutorials/06-fused-attention.py
	test/Conversion/tritongpu_to_llvm.mlir
	test/Target/tritongpu_to_llvmir.mlir
	test/Target/tritongpu_to_llvmir_noinline.mlir
2023-09-01 03:25:33 +00:00
Michael Melesse
c6d33dcebf [ROCM] Core Functionality for AMD (#1983)
* this pr adds a third party backend for triton that works on AMD 
* this expose a lot of the work that has been done in our
[fork](https://github.com/ROCmSoftwarePlatform/triton)
* most unit tests on `test_core.py` pass
* it skips some unit tests for various reasons
* we plan to follow up with more prs improving Functionality and
Performance in the future

---------

Co-authored-by: Philippe Tillet <phil@openai.com>
2023-08-31 14:02:00 -07:00
Philippe Tillet
ec51552fff [BACKEND] Lift restriction for float8e4b15 to only support row-col layout (#2212) 2023-08-30 14:06:31 -07:00
goostavz
1465b573e8 [TESTS][HOPPER] Prune hopper tests to speedup CI (#2193)
Co-authored-by: Goostav Zhu <gzhu@nvidia.com>
2023-08-27 20:45:23 -07:00
Keren Zhou
6e4932cda8 [BACKEND] Fix fma mixed-precision (#2184)
and expose the allow_tf32 argument to the matmul op

@shunting314
2023-08-26 09:49:58 -07:00
Mohammed Anany
ebfe0ffb29 [FRONTEND] fix for undefined dtypes in jit during loading defaults (#2114)
Co-authored-by: Keren Zhou <kerenzhou@openai.com>
2023-08-25 10:28:23 -07:00
jayfurmanek
ff7e707f87 Enable usage of block pointer semantics for AMD gpus (#301)
* Enable usage of block pointer semantics for AMD gpus

This commit enables usage of block pointer semantics by enabling
rewrite_tensor_pointer_pass that rewrites block pointer loads/stores
to legacy loads/stores.

* Update FA fwd in tutorial to use the block pointers

* use 90 compute capability for amd gpus in python/triton/compiler/compiler.py

Co-authored-by: Alexander Efimov <efimov.alexander@gmail.com>

---------

Co-authored-by: Ognjen Plavsic <ognjen.plavsic@dxc.com>
Co-authored-by: Lixun Zhang <lixun.zhang@amd.com>
Co-authored-by: Aleksandr Efimov <130555951+alefimov-amd@users.noreply.github.com>
Co-authored-by: Alexander Efimov <efimov.alexander@gmail.com>
2023-08-24 13:05:12 -05:00
Bin Fan
dad83f9dcb [TOOLS] Add support for autotuning AOT kernel (#2123)
This PR makes the following change to AOT kernel

- Allow the client to generate AOT kernels with different sets of
constexprs and meta-parameters. Each combination of constexpr set and
meta-parameters is referred to an "algo". Within an algo client can
still give different hints about integer arguments.
- Add a API int ${kernle_name}_get_num_algos() that returns the total
number of algos.
- Add a algo_id to allow client to the generated kernel to select the
algo
- Remove gX, gY and gZ from the kernel parameter list. This is because
the launch grid is usually different with different algos, and the
client should not need to care about how to compute the launch grid for
each algo. Instead, we ask the client to pass the expression of
computing gX, gY and gZ for compile.py (when AOT kernels are generated).
The expression can only use kernel parameter or const values.
- We also change the testing flow. Now we first build the kernels into a
shared library libkernel.so, then the client test.c code is built and
link with libkernel.so. This is closer to a typical AOT kernel usage
flow.
2023-08-23 09:38:29 -07:00
Zahi Moudallal
5282ed890d [CI] Add back pre-commit to nvidia CI job (#2159) 2023-08-23 01:11:03 +00:00
jayfurmanek
fa429316d4 Merge pull request #268 from ROCmSoftwarePlatform/improve_reduce_for_fa
[CHERRY-PICKED FROM UPSTREAM][BACKEND] no longer uses shared mem or barriers for single-warp reductions (openai#1915)
2023-08-21 13:29:11 -05:00
Beal Wang
7e5cd95bf2 [OPTIMIZER] Fix Warp Specialized kernel launch failure (#2146)
For warp specialized persistent kernel, the instruction sequence for
Warp Groups are
```
// warp group 0
for wave in 0..num_waves:
    idx = wave * num_inner_loop_steps;
    for k_tile_idx in 0..num_k_tiles:
        mbarrier.wait EB[idx];
        W0;
        mbarrier.arrive FB[idx];
        idx++;
```
```
// warp group 1
for wave in 0..num_waves:
    idx = wave * num_inner_loop_steps;
    for k_tile_idx in 0..num_k_tiles:
        mbarrier.wait FB[idx];
        R0;
        mbarrier.arrive EB[idx];
        idx++;
```
then this would form a sequence of morally-strong relations W0 -> R0 ->
W1 -> R1 in causality order.
But if GEMM K is small than K-TileShape, then the num_inner_loop_steps
of persistent kernel is 0. The buffer id and mbarrier id will always be
0 in this case. And it may form W0 -> W1 -> R0 -> R1 order, which is
contradicts with the atomicity --
"If a read R precedes an overlapping write W in causality order, then R
cannot read from W."
2023-08-21 14:46:57 +08:00
Thomas
54ca7fcb35 [FRONTEND] Use inline asm for global timer and smid functions (#2143)
Simplify the code by using inline asm to implement globaltimer and smid
instead of relying on bc file.
2023-08-20 22:56:37 -07:00
Keren Zhou
584d5c263f [FRONTEND] Disable IfExp on dynamic conditions (#2100)
`if _unwrap_if_constexpr(cond)` then enters `node.body` is wrong when
cond is a tensor since we cannot statically evaluate a dynamic tensor's
value.

The right way to solve the problem is probably:

1. visit the ast of IfExp (do not build IRs)
2. get the type of the last statement
3. initialize the return value and assign it to livein
4. call visit_If
2023-08-20 12:58:10 -07:00
Alexander Zinoviev
a7b40a10f9 [TESTS] Fix tl.dot test on sm75 (#2140)
Disable tf32 if run on sm75 and below
Fix the pattern match to compare the generated ptx against if run on
sm75
2023-08-19 22:21:18 -07:00
Zahi Moudallal
23dd11d471 [BACKEND] Solidify f8e4m3 (#2105)
Co-authored-by: Philippe Tillet <phil@openai.com>
2023-08-18 19:12:09 -07:00
Thomas
bf351b9ba2 [FRONTENT][BACKEND] Add support for elementwise inline assembly (#2136)
Add a new operation to be able to implement packed inline assembly for
elementwise operations. This way inline assembly can be used to control
elementwise operations. It also allows to pack elements to be able to
manually vectorize operations.
2023-08-18 12:57:52 -07:00
Alexander Efimov
d86b19f7a3 [CI] [Dot] Reduced test suite (#302)
Use upstream list of test for dot op on machines with no MFMA support.
This is needed to reduce time required for PR testing.
2023-08-18 07:47:14 -05:00
Alexander Efimov
23979098c8 [MFMA] MI200 bfloat16 support (#294)
This PR enables bfloat16 support in MFMA dot on MI200.
Used mfma_f32_32x32x8bf16_1k instruction.
2023-08-18 07:28:18 -05:00
Thomas
387fc890a5 [FRONTEND][BACKEND] Add a performance test for reductions (#2125)
Also stop promoting integer types as it doesn't give better perf this
will allow more vectorization oportuinity in the future.
2023-08-17 16:30:33 -07:00
Keren Zhou
2d513dbf50 [FRONTEND] Fix addptr code generation (#2122)
`offset + ptr` and `ptr + offset` both work now
2023-08-17 04:22:08 +00:00
Zahi Moudallal
557b2d4b34 [CI] upload only test/unit/operators cache to artifacts and rely on kernel names in cache to compare artifacts (#2111) 2023-08-16 20:34:40 -07:00
Philippe Tillet
4215086931 [BACKEND] no longer uses shared mem or barriers for single-warp reductions (#1915)
0-bytes shared mem buffers don't materialize empty allocation buffers;
this could lead to unnecessary barriers.

note: reduceop code has become quite messy and will require some cleanup
2023-08-15 11:51:20 +00:00
Zahi Moudallal
0312ed3473 [CI] Update kernels names (#2093)
Co-authored-by: Philippe Tillet <phil@openai.com>
2023-08-14 19:41:41 -07:00
jsh-20
9055af1a5d Update test_user_defined_persistent_warp_specialized_gemm for num-CTA > 1 (#2101)
- remove auto-tune for
test_user_defined_persistent_warp_specialized_gemm.
- remove unnecessary perf evaluation parts.
- add test cases of num-CTA > 1 for
test_user_defined_persistent_warp_specialized_gemm.
2023-08-14 08:51:35 +00:00
Philippe Tillet
facc1dcbac [TESTS] better matmul unit testing (#2098) 2023-08-13 17:54:32 -07:00
Zahi Moudallal
a01c116f76 [FRONTEND/BACKEND] Revived Float8E4B15x4 (#2090) 2023-08-11 17:49:52 -07:00
Beal Wang
d1ce4c4950 [TESTS] refactor test-persistent-warp-specialized-gemm UTs (#2075)
remove unnecessary skips. decompose UTs in
persistent-warp-specialized-gemm into vintage and stylish
2023-08-10 06:57:04 +00:00
allatit23
8a610f7cf7 [HOPPER][WS] remove numCTAs = 1 check in guard pass (#2066) 2023-08-09 09:07:56 +00:00
Beal Wang
de47bba07d [OPTIMIZER] Fix the load and store fallback issue of test_persisten… (#2057)
Co-authored-by: Biao Wang <biaow@nvidia.com>
2023-08-09 16:42:01 +08:00
allatit23
6d98a0899f [HOPPER][WS] fix missing WS attrs when lowering to llvm (#2063) 2023-08-09 15:45:44 +08:00
Alexander Efimov
af05f01218 [Tests] Fix some tests in test_core_amd.py (#288)
This PR:
- enables test_dot_mfma_vector_load for fast path in mfma dot op pipeline
- fixes kernel execution for mfma enabled GPUS
- disables mfma layout conversion tests on architectures which can not run these tests
2023-08-08 20:12:32 +02:00
allatit23
6dee55c912 [HOPPER][WS] fix TMA store hang in ws mode (#2056) 2023-08-08 19:53:52 +08:00
ben-zhang-609
2a95d9bf0d [Clean]: remove skip for num_ctas > 1 and num_warps == 8 (#2050)
Co-authored-by: Philippe Tillet <phil@openai.com>
2023-08-08 16:54:21 +08:00
allatit23
11cf334730 [hopper][ws] use per-agent thread idx by default (#2054)
Co-authored-by: Allen Zhao <allzhao@nvidia.com>
2023-08-08 15:28:10 +08:00
goostavz
b525880d8b [Backend] Fix CTA->warp ordering for MMAv3 and fix dot-chain scripts in hopper tests (#2041)
Co-authored-by: goostavz <gzhu@nvidia.com>
Co-authored-by: Philippe Tillet <phil@openai.com>
Co-authored-by: ben-zhang-609 <110140741+ben-zhang-609@users.noreply.github.com>
2023-08-08 06:30:04 +00:00
ben-zhang-609
31e79aa384 [TESTS] remove get_proper_err, get_variant_golden (#2039)
Co-authored-by: Philippe Tillet <phil@openai.com>
2023-08-07 22:52:55 -07:00
Qingyi Liu
341f5b61be [BACKEND] Add BarrierOp after AllocMBarrierOp when numCTAs == 1 (#2040)
Make sure that other threads within CTA do not operate on mbarrier until
it is initialized by thread 0.

Co-authored-by: Philippe Tillet <phil@openai.com>
2023-08-07 20:11:00 -07:00
Keren Zhou
30a331e628 [FRONTEND] Support jit functions without arguments (#2043)
Issue https://github.com/openai/triton/issues/1973

Co-authored-by: Philippe Tillet <phil@openai.com>
2023-08-07 19:05:56 -07:00
Thomas
98523bcc48 [BACKEND] Support MMA V3 with float16 accumulator (#2049)
Also fixes a bug exposed in convertLayout lowering for float16. We
shouldn't be using cvt.pack.sat.u16.s32 to pack 16bits values as this
needs to take a 32bits register. Also this prevented optimization at
llvm ir level.
2023-08-07 15:55:44 -07:00
Phil Tillet
521cfae44d [CI] disabled float32 perf regression tests 2023-08-07 12:43:16 -07:00
jayfurmanek
32d7c6d646 Fix runtime/test_subproc.py for hip devices (#284)
* Fix runtime/test_subproc.py for hip devices

* address review comments
2023-08-07 10:30:36 -05:00
goostavz
f1512bded1 Initial code merge of Hopper support (#2036)
The initial code merge of Nvidia Hopper features support. Please be
aware that the code merge is not finished yet and the trouble-shooting
is still ongoing. The new hardware features (GMMA, TMA, STMATRIX etc.)
and automatic warp-specialization are experimental for now and turned
off by default. It is recommended for a trial when version 3.0 is
released.

The work is contributed by:
ben-zhang-609, bealwang, donproc, qliu93, jsh20, allatit23, LyricZhao,
ivanyinwz, goostavz & yangjunpro
from Nvidia, in cooperation with:
ptillet, Jokeren, ThomasRaoux & zahimoud
from OpenAI.

Co-authored-by: Goostav Zhu <gzhu@nvidia.com>
2023-08-07 09:53:04 +08:00
Alexander Efimov
7158ec286a [MFMA] [Dot] Support vector loads in normal path (#275)
* [MFMA] [Dot] Support vector loads in normal path

This PR adds generation of vector loads in normal path of
MFMA dot operand loading.
This requires shared layout to have contiguous elements
which should be loaded by one lane.

* remove redundant refactoring

* fix tests

* extend test with transposed A/B tensors
2023-08-03 14:57:39 -05:00