oplavsic
e801638b40
Add waves_per_eu as kernel parameter ( #319 )
...
* Add waves_per_eu as kernel parameter
* Fix failing tests
* Add default value for waves_per_eu for ttgir_to_llir function
* Remove aot.py
2023-10-06 12:08:34 -05:00
Hongtao Yu
eed4559df2
[TOOLS] Enable per-pass IR printing in triton-translate ( #2449 )
...
Enabling per-pass IR printing such as `--mlir-print-ir-after-all`
2023-10-05 13:23:46 -07:00
Thomas Raoux
38f184b7cf
[BACKEND] Use native fp8 convert ops when possible ( #2448 )
...
On Hopper we can use native fp8 conversion ops that are significantly
more efficient.
Improves epilogue in matmul. 8192x8192x512xf8 goes from 567 TFlops to
630 TFlops (the kernel is highly latency bound but this is a good proxy
for epilogue performance)
2023-10-05 18:28:58 +00:00
Philippe Tillet
7bc6b99132
[DOCS] Fix FP8E4M3B15 docs ( #2451 )
2023-10-05 10:59:45 -07:00
jayfurmanek
be95edc63f
Merge pull request #347 from ROCmSoftwarePlatform/ifu230908-2
...
Ifu230908 2
2023-10-05 12:21:50 -05:00
Philippe Tillet
e9d9ddd86d
[OPTIMIZER] More V100 conversion removal tweaks ( #2446 )
2023-10-04 16:31:20 -07:00
Philippe Tillet
a7ff4eddae
[OPTIMIZER] hasConvertToMMATransisitiveUse=False for MMAv1 ( #2445 )
2023-10-04 14:56:12 -07:00
Thomas Raoux
24560b8152
Better tuning for H100 flash attention. ( #2444 )
...
Improves performance of fwd pass from 420 to 440 TF
2023-10-04 14:43:41 -07:00
Thomas Raoux
5a0170a27c
[BACKEND] Minor removing of unnecessary code and cleanup ( #2443 )
2023-10-04 12:14:08 -07:00
oplavsic
6a173eab8a
Remove redundant fp32->fp16 conversion in FA ( #349 )
2023-10-04 14:10:07 -05:00
Shucai Xiao
8049891ff7
fix ifu gemm perf regression ( #348 )
2023-10-04 08:45:18 -05:00
Justin Lebar
71a8544ce7
Improve docs for atomic and load/store operations. ( #2437 )
...
- Move atomic_cas and atomic_xchg to "atomic ops" section of
documentation.
- Don't talk about the `cmp` operand for operations which don't have
it.
- Document the `sem` operand.
- :code:`foo` and ``foo`` don't work inside a :type: annotation,
apparently. (They are rendered literally, instead of being treated
as a formatting command.) Get rid of them.
- Format the bulleted lists in the load/store operations as intended.
2023-10-04 04:17:42 +00:00
Christian Sigg
5458014282
[BACKEND] Lower to PTX with trap-unreachable ( #2429 )
...
We've seen cases where the entire kernel is poisoned due to
division-by-zero, resulting in a single `unreachable` instruction at the
LLIR level. Emit this instruction as `trap` (instead of dropping it) so
that the kernel doesn't run successfully without writing any outputs.
2023-10-03 21:05:10 -07:00
Thomas Raoux
c656a139d3
[BACKEND] Fix for FP8 QK inputs in flash attention forward pass ( #2435 )
2023-10-03 21:02:13 -07:00
Zahi Moudallal
0d84a7d70c
[BACKEND] Adding support for slice layout in InsertSliceAsyncOp ( #2438 )
2023-10-03 20:59:53 -07:00
Bin Fan
6b860e7a74
[Backend] fix a bug in lowering ExtractSliceOp from TritonGPU to LLVM ( #2436 )
2023-10-03 21:52:07 -04:00
Aleksandr Efimov
e6f75d05e3
fix sum_reduction lit test in Conversion/tritongpu_to_llvm.mlir testsuite
2023-10-03 16:13:13 +00:00
Thomas Raoux
020f43d5a3
[NFC] Minor clean ups found during LLVM upgrade ( #2433 )
...
Pull some of the changes required for LLVM upgrade to make the upgrade
simpler.
2023-10-03 08:22:46 -07:00
Michael Melesse
31fe8aadc5
ROCM IFU: Fix minimize_alloc
...
ROCM IFU: Small fixes
2023-10-03 05:34:44 +00:00
Aleksandr Efimov
88ce3b8985
ROCM IFU: Fix Conversion/AMDGPU/load_store.mlir lit test
2023-10-03 04:31:10 +00:00
Aleksandr Efimov
90a15e449e
ROCM IFU: Fix tritongpu_to_llvm lit test
2023-10-03 04:31:03 +00:00
Michael Melesse
1caef34f8a
ROCM IFU: Fix coalesce.mlir and stream-pipeline.mlir
2023-10-03 04:30:58 +00:00
Michael Melesse
9c7a215fed
ROCM IFU: Fix triton_to_tritongpu.mlir
2023-10-03 04:30:50 +00:00
Shucai Xiao
334c9b5aed
ROCM IFU: Fix unit tests error related to fp8/fp16 mixed input
2023-10-03 04:30:44 +00:00
Lixun Zhang
a41f13adcd
ROCM IFU: Extend input to 32-bit when necessary
...
Note: we'll need to check this later if we can use i8 for some
reduction operations
2023-10-03 04:30:37 +00:00
Jason Furmanek
92edee723b
ROCM IFU: Fix getValueLivenessRange
2023-10-03 04:30:28 +00:00
Michael Melesse
28c571ea43
ROCM IFU: Fix test_if
2023-10-03 04:30:22 +00:00
Aleksandr Efimov
8ccc4b0cce
ROCM IFU: Fix layout formatting
2023-10-03 04:30:16 +00:00
Aleksandr Efimov
336c4b5f3c
ROCM IFU: Fix LDS overflow issues in test_dot
2023-10-03 04:30:09 +00:00
wenchenvincent
42a5bf9c7c
ROCM IFU: Enabled conversion between fp8e4m3b15x4 and fp16. Refactored conversion between fp8e4m3nv and fp16. ( #335 )
2023-10-03 04:30:01 +00:00
Aleksandr Efimov
634f66a090
ROCM IFU: Fix emitOffsetForMfmaLayout function
2023-10-03 04:29:54 +00:00
Aleksandr Efimov
78faa65dbd
ROCM IFU: Fix of dot operand type promotion
...
ROCM IFU: Fix formatting
2023-10-03 04:29:29 +00:00
Aleksandr Efimov
bae0e4527c
ROCM IFU: Add new CTALayout parameter to mfma layout
2023-10-03 04:29:21 +00:00
Jason Furmanek
e5d7bb4fae
Initial commit to resolve merge conflicts
...
rename tl.float8e4 to tl.float8e4nv to align with upstream
ROCM IFU: Fix python arch issues
ROCM IFU: Fix kernel launcher
ROCM IFU: Fix merge conflicts
fix debug build
Set correct threadsPerCTA
2023-10-03 04:04:26 +00:00
Jason Furmanek
74fd8e9754
Merge commit '36fc54b6f28168d3644808bfe299f1ba06a36272' into ifu230908-2
...
Conflicts:
.gitignore
bin/triton-translate.cpp
include/triton/Conversion/TritonGPUToLLVM/TritonGPUToLLVMPass.h
include/triton/Dialect/TritonGPU/IR/TritonGPUAttrDefs.td
include/triton/Dialect/TritonGPU/IR/TritonGPUDialect.td
lib/Analysis/Utility.cpp
lib/Conversion/TritonGPUToLLVM/ConvertLayoutOpToLLVM/SharedToDotOperandMMAv2.cpp
lib/Conversion/TritonGPUToLLVM/DotOpToLLVM.cpp
lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp
lib/Conversion/TritonGPUToLLVM/ReduceOpToLLVM.cpp
lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVM.cpp
lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVMBase.h
lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVMPass.cpp
lib/Conversion/TritonGPUToLLVM/Utility.h
lib/Dialect/Triton/Transforms/RewriteTensorPointer.cpp
lib/Dialect/TritonGPU/IR/Dialect.cpp
lib/Dialect/TritonGPU/Transforms/AccelerateMatmul.cpp
lib/Dialect/TritonGPU/Transforms/RemoveLayoutConversions.cpp
lib/Target/LLVMIR/LLVMIRTranslation.cpp
python/src/triton.cc
python/test/unit/runtime/test_subproc.py
python/triton/compiler/compiler.py
python/triton/compiler/make_launcher.py
python/triton/language/semantic.py
python/triton/runtime/jit.py
python/tutorials/06-fused-attention.py
test/Conversion/triton_to_tritongpu.mlir
test/Conversion/tritongpu_to_llvm.mlir
test/TritonGPU/coalesce.mlir
unittest/Conversion/TritonGPUToLLVM/CMakeLists.txt
2023-10-02 18:01:04 +00:00
apgoucher
cd38642ec5
Fix denormal handling in fp8e5 --> bf16 conversion PTX ( #2430 )
2023-10-02 17:26:30 +01:00
Keren Zhou
ac9fa68d18
[BACKEND] Fine-tune SharedMemoryObject definition and fix related problems ( #2428 )
2023-10-01 21:43:05 -07:00
Philippe Tillet
a0025cfc44
[FRONTEND] add missing implicit constexpr conversion in dot ( #2427 )
2023-10-01 16:07:50 -07:00
Tori Baker
97e35b677b
[BACKEND] fix division by 0 pathway ( #2412 )
...
It was possible for multiDimWarpId[1] to be 0 which then gets translated
into a `urem 0, 0` and results in an unreachable when going through
llvm, an empty kernel, and nans. This PR uses ceiling to clamp the
result to be >=1.
chsigg is working on a fix to lower the unreachable in llvm to a trap
(https://github.com/llvm/llvm-project/pull/67478 ).
2023-09-30 10:53:43 -07:00
Philippe Tillet
98039658d4
[CI] disable pypy wheel (continued) ( #2424 )
...
there's a typo in the previous commit
2023-09-30 00:38:06 -07:00
Philippe Tillet
c4f3afc020
[CI] disable pypy wheel ( #2423 )
...
emitting warnings from C++ code requires "#include pybind11/exec.h"
which is not compatible with pypy. I think using the python interpreter
form C++ is a bad idea in general... but we probably don't care much
about pypy wheels anyway
2023-09-29 23:48:08 -07:00
Philippe Tillet
533efd0cac
[FRONTEND][BACKEND] changed float8e4b15 clipping semantics from +-1.875 to +-1.75 ( #2422 )
...
clipping float8e4b15 to +-1.875 is a bad idea because these are
represented as 0x7f and 0xff, which are +- nan on H100 for float8e4nv.
We lose two values but this will make compatibility with float8e4nv way
less painful. (it will just be a matter of adjusting the bias)
2023-09-29 23:33:28 -07:00
Ying Zhang
ee013d8978
Fix PTX issues in bf16 / fp8_e4m3 conversion ( #2421 )
...
Fix bugs in https://github.com/openai/triton/pull/2415 . cc @htyu
Previously corresponding tests failed on H100 with latest torch version.
It passed CI because CI doesn't use latest torch, so the tests were
skipped.
2023-09-29 19:36:00 -07:00
SJW
287b0adcc2
[Stream] Fixed bug in stream-pipeline for FA ( #345 )
...
* [Stream] Fixed bug in stream-pipeline for FA
* updated gemm tutorial for num_stages=0
* * updated configs
2023-09-29 20:20:55 -05:00
Hongtao Yu
e0edb70f78
[BACKEND] support of Fp8E4M3Nv to Bf16 conversion ( #2415 )
2023-09-29 17:29:41 -07:00
Keren Zhou
e284112818
Revert "[TUTORIALS] Remove unneeded quantiles parameter ( #2408 )" ( #2419 )
...
This reverts commit 99af23f6f4 .
`quantiles` shouldn't be the problem. The documentation workflow failed
because of other issues.
2023-09-29 14:24:50 -07:00
Keren Zhou
f2f5f1d457
[TUTORIALS] Add missing docstrings ( #2420 )
...
Depend on https://github.com/openai/triton/pull/2419 to fix the
documentation workflow
2023-09-29 14:24:30 -07:00
Thomas Raoux
90bef57acf
[BACKEND] turn on MMA V3 by default on Hopper ( #2414 )
2023-09-28 22:45:28 -07:00
Thomas Raoux
d4fae90169
[BACKEND][NFC] Simplify conversion to TritonGPU ( #2416 )
...
Remove ad hoc patterns. This will help LLVM transition.
2023-09-28 13:59:15 -07:00
evelynmitchell
99af23f6f4
[TUTORIALS] Remove unneeded quantiles parameter ( #2408 )
...
The fix is to remove the quantiles parameter in both the triton and
torch calls for the benchmark.
2023-09-28 13:48:38 -04:00