github/ROCm - ROCm - AtHeartEngineering

mirror of https://github.com/ROCm/ROCm.git synced 2026-04-05 03:01:17 -04:00

Author	SHA1	Message	Date
Jason Furmanek	3a6dc5ad8d	resolve some merge conflicts fix more conflits Resolve merge conflicts Some more build and conflict fixes Resolve conflicts for 06-fused-attension.py resolve merge conflicts for the tutorial group gemm example Fixes for some LIT tests resolve remaining conflicts in tests Fix empty kernel set capability 0	2023-11-06 23:13:10 +00:00
Jason Furmanek	33151a860f	Merge commit 'ac9fa68d18c777e421bd3f6fb1ddcfd60b6fda33' into ifu-rebase-again Conflicts: .gitignore .gitmodules README.md bin/triton-translate.cpp include/triton/Dialect/TritonGPU/IR/TritonGPUAttrDefs.td include/triton/Target/AMDGCN/AMDGCNTranslation.h include/triton/Target/HSACO/HSACOTranslation.h lib/Analysis/Allocation.cpp lib/Analysis/Utility.cpp lib/Conversion/TritonGPUToLLVM/CMakeLists.txt lib/Conversion/TritonGPUToLLVM/ConvertLayoutOpToLLVM.cpp lib/Conversion/TritonGPUToLLVM/ReduceOpToLLVM.cpp lib/Conversion/TritonGPUToLLVM/ScanOpToLLVM.cpp lib/Conversion/TritonGPUToLLVM/Utility.cpp lib/Conversion/TritonGPUToLLVM/Utility.h lib/Dialect/TritonGPU/IR/Dialect.cpp lib/Dialect/TritonGPU/Transforms/RemoveLayoutConversions.cpp lib/Target/HSACO/CMakeLists.txt lib/Target/HSACO/HSACOTranslation.cpp lib/Target/LLVMIR/LLVMIRTranslation.cpp python/src/triton.cc python/test/unit/language/test_core.py python/test/unit/operators/test_flash_attention.py python/triton/compiler/compiler.py python/triton/compiler/make_launcher.py python/triton/language/semantic.py python/triton/runtime/jit.py python/tutorials/06-fused-attention.py python/tutorials/11-grouped-gemm.py test/Conversion/tritongpu_to_llvm.mlir	2023-11-06 23:10:10 +00:00
Jason Furmanek	74fd8e9754	Merge commit '36fc54b6f28168d3644808bfe299f1ba06a36272' into ifu230908-2 Conflicts: .gitignore bin/triton-translate.cpp include/triton/Conversion/TritonGPUToLLVM/TritonGPUToLLVMPass.h include/triton/Dialect/TritonGPU/IR/TritonGPUAttrDefs.td include/triton/Dialect/TritonGPU/IR/TritonGPUDialect.td lib/Analysis/Utility.cpp lib/Conversion/TritonGPUToLLVM/ConvertLayoutOpToLLVM/SharedToDotOperandMMAv2.cpp lib/Conversion/TritonGPUToLLVM/DotOpToLLVM.cpp lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp lib/Conversion/TritonGPUToLLVM/ReduceOpToLLVM.cpp lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVM.cpp lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVMBase.h lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVMPass.cpp lib/Conversion/TritonGPUToLLVM/Utility.h lib/Dialect/Triton/Transforms/RewriteTensorPointer.cpp lib/Dialect/TritonGPU/IR/Dialect.cpp lib/Dialect/TritonGPU/Transforms/AccelerateMatmul.cpp lib/Dialect/TritonGPU/Transforms/RemoveLayoutConversions.cpp lib/Target/LLVMIR/LLVMIRTranslation.cpp python/src/triton.cc python/test/unit/runtime/test_subproc.py python/triton/compiler/compiler.py python/triton/compiler/make_launcher.py python/triton/language/semantic.py python/triton/runtime/jit.py python/tutorials/06-fused-attention.py test/Conversion/triton_to_tritongpu.mlir test/Conversion/tritongpu_to_llvm.mlir test/TritonGPU/coalesce.mlir unittest/Conversion/TritonGPUToLLVM/CMakeLists.txt	2023-10-02 18:01:04 +00:00
Keren Zhou	57fc6d1f13	[BACKEND] `shfl` ptx insts should have side effects (#2376 ) Otherwise, llvm pass could generate very weird structure of CFG and yield incorrect results. https://github.com/openai/triton/issues/2361	2023-09-23 10:05:20 -07:00
Thomas Raoux	e36c99b588	[BACKEND] Handle scan of function non commutative (#2362 ) Make sure we accumulate in the right order for scans so that non commutative operations are handled correctly.	2023-09-21 12:00:41 -07:00
Keren Zhou	ed5a53057d	[BACKEND] Handle repetitive threads in scan op when the tensor dim is small (#2345 ) https://github.com/openai/triton/issues/2298	2023-09-20 12:25:52 -04:00
Keren Zhou	307b5caa49	[BACKEND] Fix scan issues on repetitive warps and improve perf when there's a single warp on the axis (#2330 ) 1. On the axis, using `getAxisNumWarpsWithUniqueData` instead of getting the raw number of warps to avoid communication among warps that handle the same piece of data. 2. When there's a single warp on the axis, using warp Intrinsics for communication and skip shared memory. Need a follow up PR for code clean up.	2023-09-18 17:45:05 -04:00
Lixun Zhang	74ea0c87de	Generalize warpSize - We have to change the API of shflUpSync to pass laneId to the rocm implementation of shfl_up - And we also distinguish laneIdAxis and laneId	2023-09-12 10:16:44 -05:00
Lixun Zhang	ed20089bc8	Add shfl_up implementation for AMD backend copied from `f58b93693b/include/hc.hpp (L2879-L2885)`	2023-09-12 10:16:44 -05:00
Lixun Zhang	a5e483652b	[NFC] Remove hard-coded warpSize=32 in scanOp lowering (#2272 ) - To make the development on AMD GPUs a little easier - Also changed `laneId` to `laneIdAxis` in some helper functions in scanOp lowering	2023-09-11 19:35:18 -07:00
goostavz	f1512bded1	Initial code merge of Hopper support (#2036 ) The initial code merge of Nvidia Hopper features support. Please be aware that the code merge is not finished yet and the trouble-shooting is still ongoing. The new hardware features (GMMA, TMA, STMATRIX etc.) and automatic warp-specialization are experimental for now and turned off by default. It is recommended for a trial when version 3.0 is released. The work is contributed by: ben-zhang-609, bealwang, donproc, qliu93, jsh20, allatit23, LyricZhao, ivanyinwz, goostavz & yangjunpro from Nvidia, in cooperation with: ptillet, Jokeren, ThomasRaoux & zahimoud from OpenAI. Co-authored-by: Goostav Zhu <gzhu@nvidia.com>	2023-08-07 09:53:04 +08:00
Thomas	2e3182bab7	[BACKEND] Support scan on dimensions other that fastest moving one (#1863 ) This relax the restriction in the scan lowering to support layout where we scan along a dimension which isn't the fastest moving one. This is done by relaxing how we accesses elements during scanning and allow elements to be strided.	2023-06-30 12:40:48 -07:00
Thomas	3be060849a	[FEATURE] Add associative_scan support (#1858 ) Implement associative_scan in the front end and implement lowering to LLVM for blocked layout where the scan happens on the fastest moving dimension. This will later be generalized to support more layout.	2023-06-29 14:37:51 -07:00

13 Commits