github/ROCm - ROCm - AtHeartEngineering

mirror of https://github.com/ROCm/ROCm.git synced 2026-04-05 03:01:17 -04:00

Author	SHA1	Message	Date
oplavsic	e801638b40	Add waves_per_eu as kernel parameter (#319 ) * Add waves_per_eu as kernel parameter * Fix failing tests * Add default value for waves_per_eu for ttgir_to_llir function * Remove aot.py	2023-10-06 12:08:34 -05:00
Hongtao Yu	eed4559df2	[TOOLS] Enable per-pass IR printing in triton-translate (#2449 ) Enabling per-pass IR printing such as `--mlir-print-ir-after-all`	2023-10-05 13:23:46 -07:00
Thomas Raoux	38f184b7cf	[BACKEND] Use native fp8 convert ops when possible (#2448 ) On Hopper we can use native fp8 conversion ops that are significantly more efficient. Improves epilogue in matmul. 8192x8192x512xf8 goes from 567 TFlops to 630 TFlops (the kernel is highly latency bound but this is a good proxy for epilogue performance)	2023-10-05 18:28:58 +00:00
Philippe Tillet	7bc6b99132	[DOCS] Fix FP8E4M3B15 docs (#2451 )	2023-10-05 10:59:45 -07:00
jayfurmanek	be95edc63f	Merge pull request #347 from ROCmSoftwarePlatform/ifu230908-2 Ifu230908 2	2023-10-05 12:21:50 -05:00
Philippe Tillet	e9d9ddd86d	[OPTIMIZER] More V100 conversion removal tweaks (#2446 )	2023-10-04 16:31:20 -07:00
Philippe Tillet	a7ff4eddae	[OPTIMIZER] hasConvertToMMATransisitiveUse=False for MMAv1 (#2445 )	2023-10-04 14:56:12 -07:00
Thomas Raoux	24560b8152	Better tuning for H100 flash attention. (#2444 ) Improves performance of fwd pass from 420 to 440 TF	2023-10-04 14:43:41 -07:00
Thomas Raoux	5a0170a27c	[BACKEND] Minor removing of unnecessary code and cleanup (#2443 )	2023-10-04 12:14:08 -07:00
oplavsic	6a173eab8a	Remove redundant fp32->fp16 conversion in FA (#349 )	2023-10-04 14:10:07 -05:00
Shucai Xiao	8049891ff7	fix ifu gemm perf regression (#348 )	2023-10-04 08:45:18 -05:00
Justin Lebar	71a8544ce7	Improve docs for atomic and load/store operations. (#2437 ) - Move atomic_cas and atomic_xchg to "atomic ops" section of documentation. - Don't talk about the `cmp` operand for operations which don't have it. - Document the `sem` operand. - :code:`foo` and ``foo`` don't work inside a :type: annotation, apparently. (They are rendered literally, instead of being treated as a formatting command.) Get rid of them. - Format the bulleted lists in the load/store operations as intended.	2023-10-04 04:17:42 +00:00
Christian Sigg	5458014282	[BACKEND] Lower to PTX with `trap-unreachable` (#2429 ) We've seen cases where the entire kernel is poisoned due to division-by-zero, resulting in a single `unreachable` instruction at the LLIR level. Emit this instruction as `trap` (instead of dropping it) so that the kernel doesn't run successfully without writing any outputs.	2023-10-03 21:05:10 -07:00
Thomas Raoux	c656a139d3	[BACKEND] Fix for FP8 QK inputs in flash attention forward pass (#2435 )	2023-10-03 21:02:13 -07:00
Zahi Moudallal	0d84a7d70c	[BACKEND] Adding support for slice layout in InsertSliceAsyncOp (#2438 )	2023-10-03 20:59:53 -07:00
Bin Fan	6b860e7a74	[Backend] fix a bug in lowering ExtractSliceOp from TritonGPU to LLVM (#2436 )	2023-10-03 21:52:07 -04:00
Aleksandr Efimov	e6f75d05e3	fix sum_reduction lit test in Conversion/tritongpu_to_llvm.mlir testsuite	2023-10-03 16:13:13 +00:00
Thomas Raoux	020f43d5a3	[NFC] Minor clean ups found during LLVM upgrade (#2433 ) Pull some of the changes required for LLVM upgrade to make the upgrade simpler.	2023-10-03 08:22:46 -07:00
Michael Melesse	31fe8aadc5	ROCM IFU: Fix minimize_alloc ROCM IFU: Small fixes	2023-10-03 05:34:44 +00:00
Aleksandr Efimov	88ce3b8985	ROCM IFU: Fix Conversion/AMDGPU/load_store.mlir lit test	2023-10-03 04:31:10 +00:00
Aleksandr Efimov	90a15e449e	ROCM IFU: Fix tritongpu_to_llvm lit test	2023-10-03 04:31:03 +00:00
Michael Melesse	1caef34f8a	ROCM IFU: Fix coalesce.mlir and stream-pipeline.mlir	2023-10-03 04:30:58 +00:00
Michael Melesse	9c7a215fed	ROCM IFU: Fix triton_to_tritongpu.mlir	2023-10-03 04:30:50 +00:00
Shucai Xiao	334c9b5aed	ROCM IFU: Fix unit tests error related to fp8/fp16 mixed input	2023-10-03 04:30:44 +00:00
Lixun Zhang	a41f13adcd	ROCM IFU: Extend input to 32-bit when necessary Note: we'll need to check this later if we can use i8 for some reduction operations	2023-10-03 04:30:37 +00:00
Jason Furmanek	92edee723b	ROCM IFU: Fix getValueLivenessRange	2023-10-03 04:30:28 +00:00
Michael Melesse	28c571ea43	ROCM IFU: Fix test_if	2023-10-03 04:30:22 +00:00
Aleksandr Efimov	8ccc4b0cce	ROCM IFU: Fix layout formatting	2023-10-03 04:30:16 +00:00
Aleksandr Efimov	336c4b5f3c	ROCM IFU: Fix LDS overflow issues in test_dot	2023-10-03 04:30:09 +00:00
wenchenvincent	42a5bf9c7c	ROCM IFU: Enabled conversion between fp8e4m3b15x4 and fp16. Refactored conversion between fp8e4m3nv and fp16. (#335 )	2023-10-03 04:30:01 +00:00
Aleksandr Efimov	634f66a090	ROCM IFU: Fix emitOffsetForMfmaLayout function	2023-10-03 04:29:54 +00:00
Aleksandr Efimov	78faa65dbd	ROCM IFU: Fix of dot operand type promotion ROCM IFU: Fix formatting	2023-10-03 04:29:29 +00:00
Aleksandr Efimov	bae0e4527c	ROCM IFU: Add new CTALayout parameter to mfma layout	2023-10-03 04:29:21 +00:00
Jason Furmanek	e5d7bb4fae	Initial commit to resolve merge conflicts rename tl.float8e4 to tl.float8e4nv to align with upstream ROCM IFU: Fix python arch issues ROCM IFU: Fix kernel launcher ROCM IFU: Fix merge conflicts fix debug build Set correct threadsPerCTA	2023-10-03 04:04:26 +00:00
Jason Furmanek	74fd8e9754	Merge commit '36fc54b6f28168d3644808bfe299f1ba06a36272' into ifu230908-2 Conflicts: .gitignore bin/triton-translate.cpp include/triton/Conversion/TritonGPUToLLVM/TritonGPUToLLVMPass.h include/triton/Dialect/TritonGPU/IR/TritonGPUAttrDefs.td include/triton/Dialect/TritonGPU/IR/TritonGPUDialect.td lib/Analysis/Utility.cpp lib/Conversion/TritonGPUToLLVM/ConvertLayoutOpToLLVM/SharedToDotOperandMMAv2.cpp lib/Conversion/TritonGPUToLLVM/DotOpToLLVM.cpp lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp lib/Conversion/TritonGPUToLLVM/ReduceOpToLLVM.cpp lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVM.cpp lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVMBase.h lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVMPass.cpp lib/Conversion/TritonGPUToLLVM/Utility.h lib/Dialect/Triton/Transforms/RewriteTensorPointer.cpp lib/Dialect/TritonGPU/IR/Dialect.cpp lib/Dialect/TritonGPU/Transforms/AccelerateMatmul.cpp lib/Dialect/TritonGPU/Transforms/RemoveLayoutConversions.cpp lib/Target/LLVMIR/LLVMIRTranslation.cpp python/src/triton.cc python/test/unit/runtime/test_subproc.py python/triton/compiler/compiler.py python/triton/compiler/make_launcher.py python/triton/language/semantic.py python/triton/runtime/jit.py python/tutorials/06-fused-attention.py test/Conversion/triton_to_tritongpu.mlir test/Conversion/tritongpu_to_llvm.mlir test/TritonGPU/coalesce.mlir unittest/Conversion/TritonGPUToLLVM/CMakeLists.txt	2023-10-02 18:01:04 +00:00
apgoucher	cd38642ec5	Fix denormal handling in fp8e5 --> bf16 conversion PTX (#2430 )	2023-10-02 17:26:30 +01:00
Keren Zhou	ac9fa68d18	[BACKEND] Fine-tune SharedMemoryObject definition and fix related problems (#2428 )	2023-10-01 21:43:05 -07:00
Philippe Tillet	a0025cfc44	[FRONTEND] add missing implicit constexpr conversion in `dot` (#2427 )	2023-10-01 16:07:50 -07:00
Tori Baker	97e35b677b	[BACKEND] fix division by 0 pathway (#2412 ) It was possible for multiDimWarpId[1] to be 0 which then gets translated into a `urem 0, 0` and results in an unreachable when going through llvm, an empty kernel, and nans. This PR uses ceiling to clamp the result to be >=1. chsigg is working on a fix to lower the unreachable in llvm to a trap (https://github.com/llvm/llvm-project/pull/67478).	2023-09-30 10:53:43 -07:00
Philippe Tillet	98039658d4	[CI] disable pypy wheel (continued) (#2424 ) there's a typo in the previous commit	2023-09-30 00:38:06 -07:00
Philippe Tillet	c4f3afc020	[CI] disable pypy wheel (#2423 ) emitting warnings from C++ code requires "#include pybind11/exec.h" which is not compatible with pypy. I think using the python interpreter form C++ is a bad idea in general... but we probably don't care much about pypy wheels anyway	2023-09-29 23:48:08 -07:00
Philippe Tillet	533efd0cac	[FRONTEND][BACKEND] changed float8e4b15 clipping semantics from +-1.875 to +-1.75 (#2422 ) clipping float8e4b15 to +-1.875 is a bad idea because these are represented as 0x7f and 0xff, which are +- nan on H100 for float8e4nv. We lose two values but this will make compatibility with float8e4nv way less painful. (it will just be a matter of adjusting the bias)	2023-09-29 23:33:28 -07:00
Ying Zhang	ee013d8978	Fix PTX issues in bf16 / fp8_e4m3 conversion (#2421 ) Fix bugs in https://github.com/openai/triton/pull/2415. cc @htyu Previously corresponding tests failed on H100 with latest torch version. It passed CI because CI doesn't use latest torch, so the tests were skipped.	2023-09-29 19:36:00 -07:00
SJW	287b0adcc2	[Stream] Fixed bug in stream-pipeline for FA (#345 ) * [Stream] Fixed bug in stream-pipeline for FA * updated gemm tutorial for num_stages=0 * * updated configs	2023-09-29 20:20:55 -05:00
Hongtao Yu	e0edb70f78	[BACKEND] support of Fp8E4M3Nv to Bf16 conversion (#2415 )	2023-09-29 17:29:41 -07:00
Keren Zhou	e284112818	Revert "[TUTORIALS] Remove unneeded quantiles parameter (#2408 )" (#2419 ) This reverts commit `99af23f6f4`. `quantiles` shouldn't be the problem. The documentation workflow failed because of other issues.	2023-09-29 14:24:50 -07:00
Keren Zhou	f2f5f1d457	[TUTORIALS] Add missing docstrings (#2420 ) Depend on https://github.com/openai/triton/pull/2419 to fix the documentation workflow	2023-09-29 14:24:30 -07:00
Thomas Raoux	90bef57acf	[BACKEND] turn on MMA V3 by default on Hopper (#2414 )	2023-09-28 22:45:28 -07:00
Thomas Raoux	d4fae90169	[BACKEND][NFC] Simplify conversion to TritonGPU (#2416 ) Remove ad hoc patterns. This will help LLVM transition.	2023-09-28 13:59:15 -07:00
evelynmitchell	99af23f6f4	[TUTORIALS] Remove unneeded quantiles parameter (#2408 ) The fix is to remove the quantiles parameter in both the triton and torch calls for the benchmark.	2023-09-28 13:48:38 -04:00

... 3 4 5 6 7 ...

2386 Commits