Commit Graph

26 Commits

Author SHA1 Message Date
Thomas Raoux
cb3d79a185 [BACKEND] Prevent emitting multiple dot_wait after pipelinied loop (#2598)
Patch based on @donproc findings and suggested optimization.

Emitting multiple wait op may confuse ptxas and cause it to fallback to
a conservative mode.
2023-11-03 14:29:50 -07:00
daemondzh
96cf8f979a [OPTIMIZER][BACKEND] Fix an issue in RewriteTensorPtr pass to enable TMA with 8-bit types (#2545)
Co-authored-by: Zhicheng Xiong <zhichengx@ipp2-0148.nvidia.com>
Co-authored-by: Zhicheng Xiong <zhichengx@dc7-sim-e12-203.nvidia.com>
Co-authored-by: Zhicheng Xiong <zhichengx@ipp2-1604.nvidia.com>
Co-authored-by: Zhicheng Xiong <zhichengx@ipp2-1608.nvidia.com>
Co-authored-by: goostavz <109190422+goostavz@users.noreply.github.com>
2023-10-31 02:28:27 +00:00
Thomas Raoux
cba7abd682 [BACKEND] Remove ttg.cmp and ttg.select and replace by arith op (#2526)
Now that the bug related to attribute is fixed in MLIR we can use arith
ops for cmp and select ops.
2023-10-23 19:35:46 -07:00
Keren Zhou
be1de890e1 [BACKEND] Replace assert(0) with llvm::report_fatal_error (#2516)
Also add missing return statements
2023-10-19 11:53:09 -07:00
Mehdi Amini
721897fcc4 upgrade llvm to b1115f8c (NFC) (#2403)
Co-authored-by: Thomas Raoux <thomas.raoux@openai.com>
Co-authored-by: Keren Zhou <kerenzhou@openai.com>
Co-authored-by: Phil Tillet <phil@openai.com>
2023-10-16 16:38:49 -07:00
Thomas Raoux
6f46c93b9e [BACKEND] Add back dot.wait when generating async_dot (#2478)
Based on discussion this is needed to make sure there is no race
condition when reading shared memory.
2023-10-10 21:45:28 -07:00
Beal Wang
5812d970a8 [HOPPER][OPTIMIZER] remove divOp and remOp from gemm math loop (#2402)
This is just for Warp Specialization kernels on Hopper. Replace DivOp
and RemOp with SelectOp and AndOp/XorOp.
2023-10-09 14:42:06 +08:00
Tori Baker
ab4549310b [OPTIMIZER] erase ops after use in iterator (#2455)
This seems to have worked fine in opt mode (although it may be producing
undefined behavior), but in debug mode on a newer version of llvm, it
segfaults without this PR as the iterators get invalidated.

This is also consistent with other places it is done in this file.
2023-10-06 18:02:56 -07:00
Thomas Raoux
a7061e19b2 [BACKEND] Fix multiple bugs in WGMMA (#2457)
Fix dependencies in wgmma_wait op to prevent the scheduler from moving
it past the uses of wgmma accumulator. We need to explicitly represent
the dependency between the wait and the accumulator uses otherwise LLVM
is free to re-order those.
This allows us to remove a workaround to prevent the re-ordering. We can
also remove the wait op added in the loop during pipelining.

Also fix the descritpor calcuation for wgmma, we should calculate the
same descriptor for the whole warpgroup.
Added a workaround for a bug that was exposed by different timing due to
those changes. We shouldn't insert operations between the loop and
async_wait or we may have race conditions.
2023-10-06 17:59:28 -07:00
Thomas Raoux
90bef57acf [BACKEND] turn on MMA V3 by default on Hopper (#2414) 2023-09-28 22:45:28 -07:00
Tori Baker
2d28b09319 Forward fixes to build on newer version of llvm (#2388) 2023-09-26 08:08:13 -07:00
Dongdong Li
e5eda098b3 [TESTS] fix flash attention (#2086)
Co-authored-by: dongdongl <dongdongl@nvidia.com>
2023-09-20 14:23:46 +08:00
Thomas Raoux
31b0c52142 [FRONTEND][BACKEND] Add flag to control accumulation for fp8 (#2300)
Change the dot to allow taking an initial accumulator and add a flag
that will allow the compiler to accumulate in a lower precision than the
output type.
On Hopper this flag is on by default which allows accumualting with
lower precision.
This only affect Hopper fp8 dot.
2023-09-15 18:42:54 -07:00
Thomas Raoux
976aabdeb2 [BUILD] Fix few dependencies and layering issues to make lld work (#2307)
This fixes few problems that were preventing me to use lld linker.
2023-09-15 17:00:58 +00:00
Beal Wang
7aae350fab [OPTIMIZER] async launch dots for hopper warp-specialized kernel (#2251) 2023-09-07 13:37:07 +08:00
ivanyinwz
ec801ce18e [BACKEND] Optimize performance for f16 epilogue with TMA store (#2135)
1. Optimize the conversion and packing for 2xf32 -> 2xf16.
2. Split TMA store block into multiple slices of size 64x64.
3. Distribute the TMA store to all the warps.
4. Fix some naming issue.
2023-08-21 12:44:11 -07:00
Beal Wang
7e5cd95bf2 [OPTIMIZER] Fix Warp Specialized kernel launch failure (#2146)
For warp specialized persistent kernel, the instruction sequence for
Warp Groups are
```
// warp group 0
for wave in 0..num_waves:
    idx = wave * num_inner_loop_steps;
    for k_tile_idx in 0..num_k_tiles:
        mbarrier.wait EB[idx];
        W0;
        mbarrier.arrive FB[idx];
        idx++;
```
```
// warp group 1
for wave in 0..num_waves:
    idx = wave * num_inner_loop_steps;
    for k_tile_idx in 0..num_k_tiles:
        mbarrier.wait FB[idx];
        R0;
        mbarrier.arrive EB[idx];
        idx++;
```
then this would form a sequence of morally-strong relations W0 -> R0 ->
W1 -> R1 in causality order.
But if GEMM K is small than K-TileShape, then the num_inner_loop_steps
of persistent kernel is 0. The buffer id and mbarrier id will always be
0 in this case. And it may form W0 -> W1 -> R0 -> R1 order, which is
contradicts with the atomicity --
"If a read R precedes an overlapping write W in causality order, then R
cannot read from W."
2023-08-21 14:46:57 +08:00
Thomas
23ef2615d2 [BACKEND] Merge TT_ElementwisePureExtern and TT_ElementwiseImpureExtern (#2137)
Use getEffect instead to tell passes whether the op has side effects or
not. This doesn't change functionality otherwise.
2023-08-18 20:56:10 +00:00
Qingyi Liu
ecc5e8c558 [BACKEND] Fix several shape errors when numCTAs > 1 (#2107) 2023-08-16 02:14:10 +00:00
darkbuck
a3df6068b4 [BACKEND] Minor fixes found when building triton with LLVM 17/main branches (#2089)
- These minor fixes are not specific to interface changes from LLVM main
or official llvm-17 branch and can be applied on triton main branch.
- https://github.com/darkbuck/triton/tree/darkbuck/main/llvm-main-branch
has extra changes to build again LLVM main branch build to enable me to
work on other backends on the main branch only. That's the hobby effort
and just FYR.
2023-08-16 01:18:06 +00:00
allatit23
a58e6ef2b7 [HOPPER][WS] support tt.reduce as dependent op in guard pass (#2067) 2023-08-09 19:31:38 +08:00
allatit23
8a610f7cf7 [HOPPER][WS] remove numCTAs = 1 check in guard pass (#2066) 2023-08-09 09:07:56 +00:00
Beal Wang
de47bba07d [OPTIMIZER] Fix the load and store fallback issue of test_persisten… (#2057)
Co-authored-by: Biao Wang <biaow@nvidia.com>
2023-08-09 16:42:01 +08:00
allatit23
6dee55c912 [HOPPER][WS] fix TMA store hang in ws mode (#2056) 2023-08-08 19:53:52 +08:00
Qingyi Liu
341f5b61be [BACKEND] Add BarrierOp after AllocMBarrierOp when numCTAs == 1 (#2040)
Make sure that other threads within CTA do not operate on mbarrier until
it is initialized by thread 0.

Co-authored-by: Philippe Tillet <phil@openai.com>
2023-08-07 20:11:00 -07:00
goostavz
f1512bded1 Initial code merge of Hopper support (#2036)
The initial code merge of Nvidia Hopper features support. Please be
aware that the code merge is not finished yet and the trouble-shooting
is still ongoing. The new hardware features (GMMA, TMA, STMATRIX etc.)
and automatic warp-specialization are experimental for now and turned
off by default. It is recommended for a trial when version 3.0 is
released.

The work is contributed by:
ben-zhang-609, bealwang, donproc, qliu93, jsh20, allatit23, LyricZhao,
ivanyinwz, goostavz & yangjunpro
from Nvidia, in cooperation with:
ptillet, Jokeren, ThomasRaoux & zahimoud
from OpenAI.

Co-authored-by: Goostav Zhu <gzhu@nvidia.com>
2023-08-07 09:53:04 +08:00