Commit Graph

1359 Commits

Author SHA1 Message Date
Jason Furmanek
74fd8e9754 Merge commit '36fc54b6f28168d3644808bfe299f1ba06a36272' into ifu230908-2
Conflicts:
	.gitignore
	bin/triton-translate.cpp
	include/triton/Conversion/TritonGPUToLLVM/TritonGPUToLLVMPass.h
	include/triton/Dialect/TritonGPU/IR/TritonGPUAttrDefs.td
	include/triton/Dialect/TritonGPU/IR/TritonGPUDialect.td
	lib/Analysis/Utility.cpp
	lib/Conversion/TritonGPUToLLVM/ConvertLayoutOpToLLVM/SharedToDotOperandMMAv2.cpp
	lib/Conversion/TritonGPUToLLVM/DotOpToLLVM.cpp
	lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp
	lib/Conversion/TritonGPUToLLVM/ReduceOpToLLVM.cpp
	lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVM.cpp
	lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVMBase.h
	lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVMPass.cpp
	lib/Conversion/TritonGPUToLLVM/Utility.h
	lib/Dialect/Triton/Transforms/RewriteTensorPointer.cpp
	lib/Dialect/TritonGPU/IR/Dialect.cpp
	lib/Dialect/TritonGPU/Transforms/AccelerateMatmul.cpp
	lib/Dialect/TritonGPU/Transforms/RemoveLayoutConversions.cpp
	lib/Target/LLVMIR/LLVMIRTranslation.cpp
	python/src/triton.cc
	python/test/unit/runtime/test_subproc.py
	python/triton/compiler/compiler.py
	python/triton/compiler/make_launcher.py
	python/triton/language/semantic.py
	python/triton/runtime/jit.py
	python/tutorials/06-fused-attention.py
	test/Conversion/triton_to_tritongpu.mlir
	test/Conversion/tritongpu_to_llvm.mlir
	test/TritonGPU/coalesce.mlir
	unittest/Conversion/TritonGPUToLLVM/CMakeLists.txt
2023-10-02 18:01:04 +00:00
Philippe Tillet
a0025cfc44 [FRONTEND] add missing implicit constexpr conversion in dot (#2427) 2023-10-01 16:07:50 -07:00
Philippe Tillet
533efd0cac [FRONTEND][BACKEND] changed float8e4b15 clipping semantics from +-1.875 to +-1.75 (#2422)
clipping float8e4b15 to +-1.875 is a bad idea because these are
represented as 0x7f and 0xff, which are +- nan on H100 for float8e4nv.
We lose two values but this will make compatibility with float8e4nv way
less painful. (it will just be a matter of adjusting the bias)
2023-09-29 23:33:28 -07:00
SJW
287b0adcc2 [Stream] Fixed bug in stream-pipeline for FA (#345)
* [Stream] Fixed bug in stream-pipeline for FA
* updated gemm tutorial for num_stages=0

* * updated configs
2023-09-29 20:20:55 -05:00
Hongtao Yu
e0edb70f78 [BACKEND] support of Fp8E4M3Nv to Bf16 conversion (#2415) 2023-09-29 17:29:41 -07:00
Keren Zhou
e284112818 Revert "[TUTORIALS] Remove unneeded quantiles parameter (#2408)" (#2419)
This reverts commit 99af23f6f4.

`quantiles` shouldn't be the problem. The documentation workflow failed
because of other issues.
2023-09-29 14:24:50 -07:00
Keren Zhou
f2f5f1d457 [TUTORIALS] Add missing docstrings (#2420)
Depend on https://github.com/openai/triton/pull/2419 to fix the
documentation workflow
2023-09-29 14:24:30 -07:00
Thomas Raoux
90bef57acf [BACKEND] turn on MMA V3 by default on Hopper (#2414) 2023-09-28 22:45:28 -07:00
evelynmitchell
99af23f6f4 [TUTORIALS] Remove unneeded quantiles parameter (#2408)
The fix is to remove the quantiles parameter in both the triton and
torch calls for the benchmark.
2023-09-28 13:48:38 -04:00
Thomas Raoux
721bdebee1 [OPTIMIZATION] Fix performance for attention backward path with mma v3 (#2411)
Support having chain of mma with mixed size.
Serialize the different block calculation in backward attention to
workaround problem with ptxas and wgmma.
2023-09-28 10:29:08 -07:00
Simon Boehm
b25edc139e [FRONTEND] fix out_path parsing in AOT compiler (#2409)
`out_path.with_suffix` (penultimate line) fails if out_path is string.
2023-09-27 22:15:17 -07:00
Justin Lebar
9bf9c20f30 [DOCS] update build instructions, and add testing instrs. (#2400)
- Note `wheel` as a build-time dependency.
- Add tips for getting a faster build.
- Add instructions for running tests.
- Add flag to build with ccache.

(Thanks to @ThomasRaoux for most of these instructions!)
2023-09-27 22:13:03 -07:00
Ying Zhang
78c28bf5f6 Support scalar fp8 conversions by packing (#2379)
Support fp8 scalar conversions by packing fp8 with undef values.

Also add simple unittests to cover this change.
2023-09-27 08:29:53 -07:00
Shucai Xiao
6e82aa8dbc support gemm fp8/fp16 mixed input (#333)
* changes to support fp8/fp16 mixed inputs

* add unit test for fp8/fp16 mixed input for gemm
2023-09-27 08:00:31 -05:00
Philippe Tillet
7432fff4be [FRONTEND] add limited introspection capabilities in tl.extra.cuda ; rename arch into target (#2385) 2023-09-25 23:58:25 -07:00
Philippe Tillet
eea0718445 [TESTING] better cudagraph-based benchmarking (#2394) 2023-09-25 21:41:26 -07:00
ben-zhang-609
d040b58547 [HOPPER] fix ref check failure of flash attention with mma v3 (#2384) 2023-09-25 11:29:49 -07:00
Aleksandr Efimov
d80cd2d374 [MFMA] Change kWidth parameter semantics
This PR changes kWidth semantics "from elements per instruction" to
"elements per thread per instruction" along k axis.
2023-09-25 10:56:44 -05:00
Keren Zhou
57fc6d1f13 [BACKEND] shfl ptx insts should have side effects (#2376)
Otherwise, llvm pass could generate very weird structure of CFG and
yield incorrect results.

https://github.com/openai/triton/issues/2361
2023-09-23 10:05:20 -07:00
edimetia3d
cb83b42ed6 [FRONTEND] using closure to create jit launcher (#2289)
Hi,

I'm adding some features to
`triton.runtime.jit.JITFunction_make_launcher` and found it is hard to
debug it:
1. The inlined Python code is hard to inspect in my editor.
2. My debugger fails to step into these inlined codes.

In response, I've introduced some code to solve these issues. My
modifications include:
~~1. Refactoring the launcher's inline Python code, ensuring it only
relies on the "self" object.~~
~~2. Add a utility method that generates a temporary file to create a
launcher when debugging kernel in main module~~
Using a closure to hold the launcher's body

Because this features might be good to others, I have initiated this
Pull Request.

~~Tests are yet to be added; if this submission might be accepted, I
will add it later.~~
Since this change is a refactor, no new test was added.
2023-09-22 17:01:54 -07:00
Bin Fan
1724604bd9 [DOCS] Add a tutorial example of grouped gemm (#2326) 2023-09-22 11:16:35 -07:00
q.yao
413b18eb73 [FROJTEND] fix core.dtype.__repr__ (#2372)
`function_type` does not have a `name` field, which leads to an error
when debugging with gdb.
2023-09-22 08:34:20 -07:00
Zahi Moudallal
293b7fd592 [TESTING] cleanup (#2293)
Co-authored-by: Philippe Tillet <phil@openai.com>
2023-09-22 05:37:14 +00:00
Shucai Xiao
10795d8fd3 Fixed a bug related to split_k and prune unnecessary tuning space (#332)
* refine tuning scrit by adding prune_configs, also fixed a bug in generating tuning configs

* fixed a bug in returning the empty config
2023-09-21 23:47:14 -05:00
Philippe Tillet
c71ec14f31 [TEST] only test 4 configs without TF32 (#2370) 2023-09-21 21:23:19 -07:00
Alexander Zinoviev
d543eb1a36 [BACKEND] implement dot for INT8 on Turing (#2364)
Replace a single
mma.sync.aligned.m16n8k32.row.col.satfinite.s32.s8.s8.s32 instruction
that is used on Ampere with 4 x
mma.sync.aligned.m8n8k16.row.col.satfinite.s32.s8.s8.s32 instructions
for Turing

Extracted the Turing-int8, Turing-fp16 and Ampere to separate functions.

Somehow I messed up with my previous PR, so just open a new one.

---------

Co-authored-by: Philippe Tillet <phil@openai.com>
2023-09-21 16:40:53 -07:00
Philippe Tillet
32c9d2bb8f [FRONTEND] improved error messages (#2363)
this is a combination of #1774 and #2006, which I cannot edit but fail
CI pre-commit hook
2023-09-21 15:05:57 -07:00
Thomas Raoux
e36c99b588 [BACKEND] Handle scan of function non commutative (#2362)
Make sure we accumulate in the right order for scans so that non
commutative operations are handled correctly.
2023-09-21 12:00:41 -07:00
Ognjen Plavsic
a8574be74d [FA] Fix bug from IFU merge
This commit reverts num_warps back to 4 for
FA forward pass with d=128.
2023-09-21 10:48:41 -05:00
peterbell10
8094f46632 [FRONTEND][BACKEND] Fix various atomic_rmw bugs (#2355)
This fixes a few bugs I've encountered
- `atomic_add` with int64/uint64 `Operation .add requires .u32 or .s32
or .u64 [...] for instruction 'atom'`
- `atomic_min/max` with float64 -> `ValueError('Cannot bitcast data-type
of size 64 to data-type of size 32')`
- `atomic_min/max` with float32 returns the old value as int32
2023-09-21 03:31:20 +00:00
ben-zhang-609
bcaf14755a [HOPPER] enable flash attention with tma (#2336) 2023-09-20 14:06:56 -07:00
Thomas Raoux
9cab885dff [BACKEND] Optimize wgmma with accumulator source equal to 0 (#2343)
Also add a test for MMA v3 reduction.
2023-09-20 14:05:12 -07:00
Keren Zhou
ed5a53057d [BACKEND] Handle repetitive threads in scan op when the tensor dim is small (#2345)
https://github.com/openai/triton/issues/2298
2023-09-20 12:25:52 -04:00
Dongdong Li
e5eda098b3 [TESTS] fix flash attention (#2086)
Co-authored-by: dongdongl <dongdongl@nvidia.com>
2023-09-20 14:23:46 +08:00
Shantanu
8e75e392ae [FRONTEND] Fix Python error handling in launch (#2334)
This was regressed by #2185 because we didn't realise CUDA_CHECK macro
could do Python calls (similar to what led to #2225). I think the
PyErr_Occurred got removed in that PR because there was missing error
handling before the call to _launch, so it looked like it was just in
the wrong place.

It looks like there are also potentially a couple places in cuda.c that
can return with error set, e.g. getDeviceProperties, memAlloc,
memcpyHtoD, memFree, tensorMapEncodeTiled etc, but those are all
pre-existing and not affected by recent changes.
2023-09-19 00:12:00 -07:00
Philippe Tillet
73dae775df [DOCS] improved fused attention tutorial (bwd pass) (#2332) 2023-09-18 15:07:41 -07:00
Keren Zhou
307b5caa49 [BACKEND] Fix scan issues on repetitive warps and improve perf when there's a single warp on the axis (#2330)
1. On the axis, using `getAxisNumWarpsWithUniqueData` instead of getting
the raw number of warps to avoid communication among warps that handle
the same piece of data.
2. When there's a single warp on the axis, using warp Intrinsics for
communication and skip shared memory.

Need a follow up PR for code clean up.
2023-09-18 17:45:05 -04:00
Justin Lebar
2a3746bac5 [BUILD] use ninja (#2318) 2023-09-18 15:30:06 -05:00
Philippe Tillet
894fa9e943 [RUNTIME][INTERPRETER] now also override __str__ method for tensors (#2325) 2023-09-17 16:49:30 -07:00
Philippe Tillet
e686b4d6d4 [FRONTEND] interpreter rewrite (#2321)
This is a new interpreter mode that shares semantic analysis with the
JIT'ed codepath and that the Triton core team is committed to maintain
2023-09-17 14:58:50 -07:00
Myeonghwan Ahn
2b066000aa [FRONTEND] fix matmul int8 overflow issue (#2297)
Previously on matmul, if inputs are int8, output was also int8.
This commit fixes the overflow problem with int32 output.
#2296
2023-09-17 16:41:02 +00:00
Stonepia
68e1bd162c [FRONTEND] fix xpu stages logic (#2305) 2023-09-17 09:19:14 -07:00
jon-chuang
4f2d995fad [FRONTEND] Explicitly forbid dot(.., out_dtype=bfloat16) (#2308)
Fixes: https://github.com/openai/triton/issues/2302
2023-09-17 09:15:06 +00:00
Justin Lebar
073aa16379 [BUILD] use ninja (#2318) 2023-09-17 02:08:04 -07:00
Thomas Raoux
31b0c52142 [FRONTEND][BACKEND] Add flag to control accumulation for fp8 (#2300)
Change the dot to allow taking an initial accumulator and add a flag
that will allow the compiler to accumulate in a lower precision than the
output type.
On Hopper this flag is on by default which allows accumualting with
lower precision.
This only affect Hopper fp8 dot.
2023-09-15 18:42:54 -07:00
Zahi Moudallal
db5c793f82 [FRONTEND] Add sass to asm dict with lazy evaluation (#2309) 2023-09-15 15:31:43 -07:00
Keren Zhou
08c1658957 [FRONTEND] Accommodate new triton IR format (#2294)
- Support memory space for pointers (e.g., `!tt.ptr<f32, 1>`).
- Support parsing function attribute, though not used yet.
2023-09-14 09:03:23 -07:00
Alexander Efimov
b25557ad5e [FA] Upstream FA qk initialization (#328)
This PR replaces qk matrix initialization with
upstream version
2023-09-14 00:34:14 -05:00
Zahi Moudallal
36087a108f [FRONTEND] Added SASS to asm dict (#2280) 2023-09-13 21:21:01 +00:00
Khushi Agrawal
c61d772eee [DOCS] add missing docs (#2154) 2023-09-13 19:30:40 +00:00