Commit Graph

20 Commits

Author SHA1 Message Date
Justin Lebar
df08301e76 Reformat Python code with yapf. (#2589)
I've add an option to yapf to do what we want for long lines, see
https://github.com/google/yapf/pull/1177.  We can now have a real Python
formatter, yay!

To make this PR, I ran my modified yapf over the repository, then looked
over the full diff.  Where yapf was mangling the param list of long
function decls/calls (mostly kernels), I manually added `#` to put
linebreaks where we want.  I fixed up other formatting too -- mostly
adding or removing a trailing comma from lists.

Overall, trailing `#` was sufficient to get formatting similar to our
current code.  I didn't have to disable yapf anywhere.

---------

Co-authored-by: Phil Tillet <phil@openai.com>
2023-11-02 20:44:17 -07:00
Thomas Raoux
ca8f110617 [BACKEND] Pipeliner refactoring (#2565)
Refactor the pipeliner pass in order to make it more generic. The main
change is that the pipeliner is now broken into 2 pieces one calculating
a modulo schedule and create async ops based on the IR and an expander
that will generate the pipelined IR based on the modulo schedule.
The advantage of separating the two pieces is that it will allow us to
create different schedule without having to change the expander and it
will allow for more complex schedules.
For now the schedule generated for matmul case matches rougly the
schedule picked by the previous pipeliner in order to avoid changes.

This also creates a different sequence of insert/extract slice for the
alloc. We should probably change shared alloc to use memory semantic.
2023-11-02 09:56:39 -07:00
Thomas Raoux
90bef57acf [BACKEND] turn on MMA V3 by default on Hopper (#2414) 2023-09-28 22:45:28 -07:00
Dongdong Li
e5eda098b3 [TESTS] fix flash attention (#2086)
Co-authored-by: dongdongl <dongdongl@nvidia.com>
2023-09-20 14:23:46 +08:00
jsh-20
fc5d7e6e7c [FRONTEND] Improve grid calculation for persistent kernels to hoist pe… (#2283)
…rf on problems that need few blocks.

constrain the number of launched blocks to what it exactely needs for
persistent warp specialized kernel. It's useful when problems need very
few blocks.
e.g. MxNxK=800x800x60000, f16_f16_f32, block size=128x128x64,
non-split-k. Experiments show it can achieve ~16% speedup.
2023-09-12 09:14:47 +00:00
jon-chuang
5231d57c71 [TESTS] replace deprecated torch.testing.assert_allclose (#2250)
Prior to this PR, matmul on sm_89 (RTX 4070)
(`test/unit/operators/test_matmul.py::test_op`) would result in test
failure due to too strict atol/rtol.

To avoid having to choose strictness ourselves, and to have better
defaults based on dtype, use the non-deprecated torch testing util.

See: https://github.com/pytorch/pytorch/issues/61844

Replace: https://github.com/openai/triton/pull/2242
2023-09-11 15:31:17 -04:00
goostavz
1465b573e8 [TESTS][HOPPER] Prune hopper tests to speedup CI (#2193)
Co-authored-by: Goostav Zhu <gzhu@nvidia.com>
2023-08-27 20:45:23 -07:00
Beal Wang
7e5cd95bf2 [OPTIMIZER] Fix Warp Specialized kernel launch failure (#2146)
For warp specialized persistent kernel, the instruction sequence for
Warp Groups are
```
// warp group 0
for wave in 0..num_waves:
    idx = wave * num_inner_loop_steps;
    for k_tile_idx in 0..num_k_tiles:
        mbarrier.wait EB[idx];
        W0;
        mbarrier.arrive FB[idx];
        idx++;
```
```
// warp group 1
for wave in 0..num_waves:
    idx = wave * num_inner_loop_steps;
    for k_tile_idx in 0..num_k_tiles:
        mbarrier.wait FB[idx];
        R0;
        mbarrier.arrive EB[idx];
        idx++;
```
then this would form a sequence of morally-strong relations W0 -> R0 ->
W1 -> R1 in causality order.
But if GEMM K is small than K-TileShape, then the num_inner_loop_steps
of persistent kernel is 0. The buffer id and mbarrier id will always be
0 in this case. And it may form W0 -> W1 -> R0 -> R1 order, which is
contradicts with the atomicity --
"If a read R precedes an overlapping write W in causality order, then R
cannot read from W."
2023-08-21 14:46:57 +08:00
jsh-20
9055af1a5d Update test_user_defined_persistent_warp_specialized_gemm for num-CTA > 1 (#2101)
- remove auto-tune for
test_user_defined_persistent_warp_specialized_gemm.
- remove unnecessary perf evaluation parts.
- add test cases of num-CTA > 1 for
test_user_defined_persistent_warp_specialized_gemm.
2023-08-14 08:51:35 +00:00
Beal Wang
d1ce4c4950 [TESTS] refactor test-persistent-warp-specialized-gemm UTs (#2075)
remove unnecessary skips. decompose UTs in
persistent-warp-specialized-gemm into vintage and stylish
2023-08-10 06:57:04 +00:00
allatit23
8a610f7cf7 [HOPPER][WS] remove numCTAs = 1 check in guard pass (#2066) 2023-08-09 09:07:56 +00:00
Beal Wang
de47bba07d [OPTIMIZER] Fix the load and store fallback issue of test_persisten… (#2057)
Co-authored-by: Biao Wang <biaow@nvidia.com>
2023-08-09 16:42:01 +08:00
allatit23
6d98a0899f [HOPPER][WS] fix missing WS attrs when lowering to llvm (#2063) 2023-08-09 15:45:44 +08:00
allatit23
6dee55c912 [HOPPER][WS] fix TMA store hang in ws mode (#2056) 2023-08-08 19:53:52 +08:00
ben-zhang-609
2a95d9bf0d [Clean]: remove skip for num_ctas > 1 and num_warps == 8 (#2050)
Co-authored-by: Philippe Tillet <phil@openai.com>
2023-08-08 16:54:21 +08:00
allatit23
11cf334730 [hopper][ws] use per-agent thread idx by default (#2054)
Co-authored-by: Allen Zhao <allzhao@nvidia.com>
2023-08-08 15:28:10 +08:00
goostavz
b525880d8b [Backend] Fix CTA->warp ordering for MMAv3 and fix dot-chain scripts in hopper tests (#2041)
Co-authored-by: goostavz <gzhu@nvidia.com>
Co-authored-by: Philippe Tillet <phil@openai.com>
Co-authored-by: ben-zhang-609 <110140741+ben-zhang-609@users.noreply.github.com>
2023-08-08 06:30:04 +00:00
ben-zhang-609
31e79aa384 [TESTS] remove get_proper_err, get_variant_golden (#2039)
Co-authored-by: Philippe Tillet <phil@openai.com>
2023-08-07 22:52:55 -07:00
Qingyi Liu
341f5b61be [BACKEND] Add BarrierOp after AllocMBarrierOp when numCTAs == 1 (#2040)
Make sure that other threads within CTA do not operate on mbarrier until
it is initialized by thread 0.

Co-authored-by: Philippe Tillet <phil@openai.com>
2023-08-07 20:11:00 -07:00
goostavz
f1512bded1 Initial code merge of Hopper support (#2036)
The initial code merge of Nvidia Hopper features support. Please be
aware that the code merge is not finished yet and the trouble-shooting
is still ongoing. The new hardware features (GMMA, TMA, STMATRIX etc.)
and automatic warp-specialization are experimental for now and turned
off by default. It is recommended for a trial when version 3.0 is
released.

The work is contributed by:
ben-zhang-609, bealwang, donproc, qliu93, jsh20, allatit23, LyricZhao,
ivanyinwz, goostavz & yangjunpro
from Nvidia, in cooperation with:
ptillet, Jokeren, ThomasRaoux & zahimoud
from OpenAI.

Co-authored-by: Goostav Zhu <gzhu@nvidia.com>
2023-08-07 09:53:04 +08:00