Commit Graph

2386 Commits

Author SHA1 Message Date
oplavsic
e801638b40 Add waves_per_eu as kernel parameter (#319)
* Add waves_per_eu as kernel parameter

* Fix failing tests

* Add default value for waves_per_eu for ttgir_to_llir function

* Remove aot.py
2023-10-06 12:08:34 -05:00
Hongtao Yu
eed4559df2 [TOOLS] Enable per-pass IR printing in triton-translate (#2449)
Enabling per-pass IR printing such as `--mlir-print-ir-after-all`
2023-10-05 13:23:46 -07:00
Thomas Raoux
38f184b7cf [BACKEND] Use native fp8 convert ops when possible (#2448)
On Hopper we can use native fp8 conversion ops that are significantly
more efficient.

Improves epilogue in matmul. 8192x8192x512xf8 goes from 567 TFlops to
630 TFlops (the kernel is highly latency bound but this is a good proxy
for epilogue performance)
2023-10-05 18:28:58 +00:00
Philippe Tillet
7bc6b99132 [DOCS] Fix FP8E4M3B15 docs (#2451) 2023-10-05 10:59:45 -07:00
jayfurmanek
be95edc63f Merge pull request #347 from ROCmSoftwarePlatform/ifu230908-2
Ifu230908 2
2023-10-05 12:21:50 -05:00
Philippe Tillet
e9d9ddd86d [OPTIMIZER] More V100 conversion removal tweaks (#2446) 2023-10-04 16:31:20 -07:00
Philippe Tillet
a7ff4eddae [OPTIMIZER] hasConvertToMMATransisitiveUse=False for MMAv1 (#2445) 2023-10-04 14:56:12 -07:00
Thomas Raoux
24560b8152 Better tuning for H100 flash attention. (#2444)
Improves performance of fwd pass from 420 to 440 TF
2023-10-04 14:43:41 -07:00
Thomas Raoux
5a0170a27c [BACKEND] Minor removing of unnecessary code and cleanup (#2443) 2023-10-04 12:14:08 -07:00
oplavsic
6a173eab8a Remove redundant fp32->fp16 conversion in FA (#349) 2023-10-04 14:10:07 -05:00
Shucai Xiao
8049891ff7 fix ifu gemm perf regression (#348) 2023-10-04 08:45:18 -05:00
Justin Lebar
71a8544ce7 Improve docs for atomic and load/store operations. (#2437)
- Move atomic_cas and atomic_xchg to "atomic ops" section of
   documentation.
 - Don't talk about the `cmp` operand for operations which don't have
   it.
 - Document the `sem` operand.
 - :code:`foo` and ``foo`` don't work inside a :type: annotation,
   apparently.  (They are rendered literally, instead of being treated
   as a formatting command.)  Get rid of them.
- Format the bulleted lists in the load/store operations as intended.
2023-10-04 04:17:42 +00:00
Christian Sigg
5458014282 [BACKEND] Lower to PTX with trap-unreachable (#2429)
We've seen cases where the entire kernel is poisoned due to
division-by-zero, resulting in a single `unreachable` instruction at the
LLIR level. Emit this instruction as `trap` (instead of dropping it) so
that the kernel doesn't run successfully without writing any outputs.
2023-10-03 21:05:10 -07:00
Thomas Raoux
c656a139d3 [BACKEND] Fix for FP8 QK inputs in flash attention forward pass (#2435) 2023-10-03 21:02:13 -07:00
Zahi Moudallal
0d84a7d70c [BACKEND] Adding support for slice layout in InsertSliceAsyncOp (#2438) 2023-10-03 20:59:53 -07:00
Bin Fan
6b860e7a74 [Backend] fix a bug in lowering ExtractSliceOp from TritonGPU to LLVM (#2436) 2023-10-03 21:52:07 -04:00
Aleksandr Efimov
e6f75d05e3 fix sum_reduction lit test in Conversion/tritongpu_to_llvm.mlir testsuite 2023-10-03 16:13:13 +00:00
Thomas Raoux
020f43d5a3 [NFC] Minor clean ups found during LLVM upgrade (#2433)
Pull some of the changes required for LLVM upgrade to make the upgrade
simpler.
2023-10-03 08:22:46 -07:00
Michael Melesse
31fe8aadc5 ROCM IFU: Fix minimize_alloc
ROCM IFU: Small fixes
2023-10-03 05:34:44 +00:00
Aleksandr Efimov
88ce3b8985 ROCM IFU: Fix Conversion/AMDGPU/load_store.mlir lit test 2023-10-03 04:31:10 +00:00
Aleksandr Efimov
90a15e449e ROCM IFU: Fix tritongpu_to_llvm lit test 2023-10-03 04:31:03 +00:00
Michael Melesse
1caef34f8a ROCM IFU: Fix coalesce.mlir and stream-pipeline.mlir 2023-10-03 04:30:58 +00:00
Michael Melesse
9c7a215fed ROCM IFU: Fix triton_to_tritongpu.mlir 2023-10-03 04:30:50 +00:00
Shucai Xiao
334c9b5aed ROCM IFU: Fix unit tests error related to fp8/fp16 mixed input 2023-10-03 04:30:44 +00:00
Lixun Zhang
a41f13adcd ROCM IFU: Extend input to 32-bit when necessary
Note: we'll need to check this later if we can use i8 for some
reduction operations
2023-10-03 04:30:37 +00:00
Jason Furmanek
92edee723b ROCM IFU: Fix getValueLivenessRange 2023-10-03 04:30:28 +00:00
Michael Melesse
28c571ea43 ROCM IFU: Fix test_if 2023-10-03 04:30:22 +00:00
Aleksandr Efimov
8ccc4b0cce ROCM IFU: Fix layout formatting 2023-10-03 04:30:16 +00:00
Aleksandr Efimov
336c4b5f3c ROCM IFU: Fix LDS overflow issues in test_dot 2023-10-03 04:30:09 +00:00
wenchenvincent
42a5bf9c7c ROCM IFU: Enabled conversion between fp8e4m3b15x4 and fp16. Refactored conversion between fp8e4m3nv and fp16. (#335) 2023-10-03 04:30:01 +00:00
Aleksandr Efimov
634f66a090 ROCM IFU: Fix emitOffsetForMfmaLayout function 2023-10-03 04:29:54 +00:00
Aleksandr Efimov
78faa65dbd ROCM IFU: Fix of dot operand type promotion
ROCM IFU: Fix formatting
2023-10-03 04:29:29 +00:00
Aleksandr Efimov
bae0e4527c ROCM IFU: Add new CTALayout parameter to mfma layout 2023-10-03 04:29:21 +00:00
Jason Furmanek
e5d7bb4fae Initial commit to resolve merge conflicts
rename tl.float8e4 to tl.float8e4nv to align with upstream

ROCM IFU: Fix python arch issues

ROCM IFU: Fix kernel launcher

ROCM IFU: Fix merge conflicts

fix debug build

Set correct threadsPerCTA
2023-10-03 04:04:26 +00:00
Jason Furmanek
74fd8e9754 Merge commit '36fc54b6f28168d3644808bfe299f1ba06a36272' into ifu230908-2
Conflicts:
	.gitignore
	bin/triton-translate.cpp
	include/triton/Conversion/TritonGPUToLLVM/TritonGPUToLLVMPass.h
	include/triton/Dialect/TritonGPU/IR/TritonGPUAttrDefs.td
	include/triton/Dialect/TritonGPU/IR/TritonGPUDialect.td
	lib/Analysis/Utility.cpp
	lib/Conversion/TritonGPUToLLVM/ConvertLayoutOpToLLVM/SharedToDotOperandMMAv2.cpp
	lib/Conversion/TritonGPUToLLVM/DotOpToLLVM.cpp
	lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp
	lib/Conversion/TritonGPUToLLVM/ReduceOpToLLVM.cpp
	lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVM.cpp
	lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVMBase.h
	lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVMPass.cpp
	lib/Conversion/TritonGPUToLLVM/Utility.h
	lib/Dialect/Triton/Transforms/RewriteTensorPointer.cpp
	lib/Dialect/TritonGPU/IR/Dialect.cpp
	lib/Dialect/TritonGPU/Transforms/AccelerateMatmul.cpp
	lib/Dialect/TritonGPU/Transforms/RemoveLayoutConversions.cpp
	lib/Target/LLVMIR/LLVMIRTranslation.cpp
	python/src/triton.cc
	python/test/unit/runtime/test_subproc.py
	python/triton/compiler/compiler.py
	python/triton/compiler/make_launcher.py
	python/triton/language/semantic.py
	python/triton/runtime/jit.py
	python/tutorials/06-fused-attention.py
	test/Conversion/triton_to_tritongpu.mlir
	test/Conversion/tritongpu_to_llvm.mlir
	test/TritonGPU/coalesce.mlir
	unittest/Conversion/TritonGPUToLLVM/CMakeLists.txt
2023-10-02 18:01:04 +00:00
apgoucher
cd38642ec5 Fix denormal handling in fp8e5 --> bf16 conversion PTX (#2430) 2023-10-02 17:26:30 +01:00
Keren Zhou
ac9fa68d18 [BACKEND] Fine-tune SharedMemoryObject definition and fix related problems (#2428) 2023-10-01 21:43:05 -07:00
Philippe Tillet
a0025cfc44 [FRONTEND] add missing implicit constexpr conversion in dot (#2427) 2023-10-01 16:07:50 -07:00
Tori Baker
97e35b677b [BACKEND] fix division by 0 pathway (#2412)
It was possible for multiDimWarpId[1] to be 0 which then gets translated
into a `urem 0, 0` and results in an unreachable when going through
llvm, an empty kernel, and nans. This PR uses ceiling to clamp the
result to be >=1.

chsigg is working on a fix to lower the unreachable in llvm to a trap
(https://github.com/llvm/llvm-project/pull/67478).
2023-09-30 10:53:43 -07:00
Philippe Tillet
98039658d4 [CI] disable pypy wheel (continued) (#2424)
there's a typo in the previous commit
2023-09-30 00:38:06 -07:00
Philippe Tillet
c4f3afc020 [CI] disable pypy wheel (#2423)
emitting warnings from C++ code requires "#include pybind11/exec.h"
which is not compatible with pypy. I think using the python interpreter
form C++ is a bad idea in general... but we probably don't care much
about pypy wheels anyway
2023-09-29 23:48:08 -07:00
Philippe Tillet
533efd0cac [FRONTEND][BACKEND] changed float8e4b15 clipping semantics from +-1.875 to +-1.75 (#2422)
clipping float8e4b15 to +-1.875 is a bad idea because these are
represented as 0x7f and 0xff, which are +- nan on H100 for float8e4nv.
We lose two values but this will make compatibility with float8e4nv way
less painful. (it will just be a matter of adjusting the bias)
2023-09-29 23:33:28 -07:00
Ying Zhang
ee013d8978 Fix PTX issues in bf16 / fp8_e4m3 conversion (#2421)
Fix bugs in https://github.com/openai/triton/pull/2415. cc @htyu 

Previously corresponding tests failed on H100 with latest torch version.
It passed CI because CI doesn't use latest torch, so the tests were
skipped.
2023-09-29 19:36:00 -07:00
SJW
287b0adcc2 [Stream] Fixed bug in stream-pipeline for FA (#345)
* [Stream] Fixed bug in stream-pipeline for FA
* updated gemm tutorial for num_stages=0

* * updated configs
2023-09-29 20:20:55 -05:00
Hongtao Yu
e0edb70f78 [BACKEND] support of Fp8E4M3Nv to Bf16 conversion (#2415) 2023-09-29 17:29:41 -07:00
Keren Zhou
e284112818 Revert "[TUTORIALS] Remove unneeded quantiles parameter (#2408)" (#2419)
This reverts commit 99af23f6f4.

`quantiles` shouldn't be the problem. The documentation workflow failed
because of other issues.
2023-09-29 14:24:50 -07:00
Keren Zhou
f2f5f1d457 [TUTORIALS] Add missing docstrings (#2420)
Depend on https://github.com/openai/triton/pull/2419 to fix the
documentation workflow
2023-09-29 14:24:30 -07:00
Thomas Raoux
90bef57acf [BACKEND] turn on MMA V3 by default on Hopper (#2414) 2023-09-28 22:45:28 -07:00
Thomas Raoux
d4fae90169 [BACKEND][NFC] Simplify conversion to TritonGPU (#2416)
Remove ad hoc patterns. This will help LLVM transition.
2023-09-28 13:59:15 -07:00
evelynmitchell
99af23f6f4 [TUTORIALS] Remove unneeded quantiles parameter (#2408)
The fix is to remove the quantiles parameter in both the triton and
torch calls for the benchmark.
2023-09-28 13:48:38 -04:00