github/ROCm - ROCm - AtHeartEngineering

mirror of https://github.com/ROCm/ROCm.git synced 2026-04-05 03:01:17 -04:00

Author	SHA1	Message	Date
Horace He	a4f373938c	[RUNTIME] Filter out paths that don't exist in json group cache (#2511 ) There's no guarantee that `/tmp/triton//.json` existing means that the corresponding `/tmp/triton//.cubin` file also exists because the tmp directory doesn't guarantee file stability.	2023-10-18 16:44:34 -04:00
Alexander Efimov	20f316b19a	[MFMA] Switch between MFMA types (#352 ) This PR introduces matrix_instr_nonkdim flag to switch between MFMA 16 and MFMA 32.	2023-10-18 16:57:34 +02:00
ian Bearman	768fc1fcd9	[FRONTEND] change hash to not require ptxas (#2476 ) I noticed that Triton is using the `ptxas` version as part of the version hash even for non-CUDA targets. This is an attempt at fixing this. Moving the version calculation to the back-end makes sense to me from an architectural standpoint, so that's my approach here. I'm not as confident in the implementation, so please if folks have any feedback let me know.	2023-10-17 10:28:51 -07:00
Zahi Moudallal	726bdb984f	[FRONTEND][BACKEND] Fix constexpr assignment ; revert #2430 (#2496 ) Without this change, a constexpr assignment (ie. `A = B & C`, where `B` and `C` are both constexpr) is getting assigned to a triton tensor, which becomes an issue when `A` is used as the condition of an If statement. Note: I had to add `not isinstance(node.value, ast.Constant)` to the condition because if we are assigning `x = 0` then the assigned value is also a constexpr, but in this case we do want to assign a triton tensor to `x` so that we can do `x.to(tl.int64)` for example, which cannot be done on a constexpr. --------- Co-authored-by: Philippe Tillet <phil@openai.com>	2023-10-16 12:35:19 -07:00
Stewart Hall	29828fe491	[FRONTEND] add option to disable fp mul/add fusion (#2495 ) By default, ptxas will enable fusion of mul/add to fma instructions. The backend was also being configured unconditionally to enable this on conversion from LLVM IR to PTX. This commit adds an option which can be used to disable the FP fusion behavior in both locations.	2023-10-14 12:23:30 -07:00
Keren Zhou	f81d9d876f	[FRONTEND] Fix math for constant values (#2472 ) https://github.com/openai/triton/issues/2470	2023-10-12 12:11:42 -07:00
Jack Taylor	5d44c60d17	enforce cc=None on PyTorch ROCm (#296 ) * enforce cc=None on ROCm * Comment * Update approach to ignore integer cc values Co-authored-by: Alexander Efimov <efimov.alexander@gmail.com> --------- Co-authored-by: Alexander Efimov <efimov.alexander@gmail.com>	2023-10-12 10:17:26 +01:00
Jack Taylor	47563240f8	PyTorch triton branch synchronisation (#354 ) * Restructure ROCM Library Search Currently there are a handful of ROCM dependant files which are required for triton to run. The linker(ld.lld), the include files, and multiple hip/hsa shared objects. This change will provide three search areas to find these files. All in the same order. 1. third_party/rocm. This location is within the python/triton directory and is carried over when triton is built. IF all necessary files are in this location there will be no need to have ROCM installed at all on the system. 2. $ROCM_PATH environmental variable. If this exists it will override all other locations to find ROCM necessary files 3. /opt/rocm. The default location for ROCm installations. Finding one here will notify triton that ROCM is installed in this environment To ease with step 3. A new script scripts/amd/setup_rocm_libs.sh has been added to the repo. Executing this script will cause all necessary ROCM files to be downloaded from their respective packages on repo.radeon.com and installed in third_party/rocm. Allowing for triton to run without installing the full ROCM stack. setup_rocm_libs.sh takes a env_var ROCM_VERSION if a user wishes to install a ROCM version other than the default (currently 5.4.2) When triton whls are built to support Pytorch, method 3 will be used to stay in sync with PyTorch's approach of bringing along any libraries needed and not requiring ROCM to be installed. (cherry picked from commit e6aea90fb3e8218cb562e5d990719112d8282702) * Fix default rocm path Running into `fatal error: hip/hip_runtime.h: No such file or directory` with latest wheel due to incorrect directory for ROCm libs (cherry picked from commit 292bae625b113eb65c66cfe4442da7a6456c988a) * setup_rocm_libs.sh manylinux refactor (cherry picked from commit f995f314ada4606cb78dc6233cd9c8effc356191) * Set setup_rocm_libs.sh to be executable (cherry picked from commit 05d67b9418cacda0d356c2102d7c1a887948b013) * Revert to using numbered so files to fix upstream (cherry picked from commit 34f8189eae57a23cc15b4b4f032fe25757e0db8e) * Remove drm script --------- Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2023-10-11 15:30:39 +01:00
Shucai Xiao	d6d1cf2859	add gfx942 to support matrix_core (#358 )	2023-10-10 22:46:24 -05:00
Alexander Efimov	7e34c244c2	[Triton] Mfma16 support (#251 ) * [MFAM] Support mfma with NM size 16 This PR code emitting of MFMA instructions with size 16. * add control over mfma type with MFMA_TYPE=16 env var	2023-10-09 13:59:54 -05:00
Beal Wang	5812d970a8	[HOPPER][OPTIMIZER] remove divOp and remOp from gemm math loop (#2402 ) This is just for Warp Specialization kernels on Hopper. Replace DivOp and RemOp with SelectOp and AndOp/XorOp.	2023-10-09 14:42:06 +08:00
Philippe Tillet	424e67e727	[FRONTEND] improved while loop error messages (#2463 )	2023-10-06 18:37:52 -07:00
Keren Zhou	a42d517021	[FRONTEND] Better warning on nested jit functions (#2453 )	2023-10-06 14:22:51 -07:00
oplavsic	e801638b40	Add waves_per_eu as kernel parameter (#319 ) * Add waves_per_eu as kernel parameter * Fix failing tests * Add default value for waves_per_eu for ttgir_to_llir function * Remove aot.py	2023-10-06 12:08:34 -05:00
Justin Lebar	71a8544ce7	Improve docs for atomic and load/store operations. (#2437 ) - Move atomic_cas and atomic_xchg to "atomic ops" section of documentation. - Don't talk about the `cmp` operand for operations which don't have it. - Document the `sem` operand. - :code:`foo` and ``foo`` don't work inside a :type: annotation, apparently. (They are rendered literally, instead of being treated as a formatting command.) Get rid of them. - Format the bulleted lists in the load/store operations as intended.	2023-10-04 04:17:42 +00:00
Michael Melesse	31fe8aadc5	ROCM IFU: Fix minimize_alloc ROCM IFU: Small fixes	2023-10-03 05:34:44 +00:00
Shucai Xiao	334c9b5aed	ROCM IFU: Fix unit tests error related to fp8/fp16 mixed input	2023-10-03 04:30:44 +00:00
Lixun Zhang	a41f13adcd	ROCM IFU: Extend input to 32-bit when necessary Note: we'll need to check this later if we can use i8 for some reduction operations	2023-10-03 04:30:37 +00:00
Jason Furmanek	e5d7bb4fae	Initial commit to resolve merge conflicts rename tl.float8e4 to tl.float8e4nv to align with upstream ROCM IFU: Fix python arch issues ROCM IFU: Fix kernel launcher ROCM IFU: Fix merge conflicts fix debug build Set correct threadsPerCTA	2023-10-03 04:04:26 +00:00
Jason Furmanek	74fd8e9754	Merge commit '36fc54b6f28168d3644808bfe299f1ba06a36272' into ifu230908-2 Conflicts: .gitignore bin/triton-translate.cpp include/triton/Conversion/TritonGPUToLLVM/TritonGPUToLLVMPass.h include/triton/Dialect/TritonGPU/IR/TritonGPUAttrDefs.td include/triton/Dialect/TritonGPU/IR/TritonGPUDialect.td lib/Analysis/Utility.cpp lib/Conversion/TritonGPUToLLVM/ConvertLayoutOpToLLVM/SharedToDotOperandMMAv2.cpp lib/Conversion/TritonGPUToLLVM/DotOpToLLVM.cpp lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp lib/Conversion/TritonGPUToLLVM/ReduceOpToLLVM.cpp lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVM.cpp lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVMBase.h lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVMPass.cpp lib/Conversion/TritonGPUToLLVM/Utility.h lib/Dialect/Triton/Transforms/RewriteTensorPointer.cpp lib/Dialect/TritonGPU/IR/Dialect.cpp lib/Dialect/TritonGPU/Transforms/AccelerateMatmul.cpp lib/Dialect/TritonGPU/Transforms/RemoveLayoutConversions.cpp lib/Target/LLVMIR/LLVMIRTranslation.cpp python/src/triton.cc python/test/unit/runtime/test_subproc.py python/triton/compiler/compiler.py python/triton/compiler/make_launcher.py python/triton/language/semantic.py python/triton/runtime/jit.py python/tutorials/06-fused-attention.py test/Conversion/triton_to_tritongpu.mlir test/Conversion/tritongpu_to_llvm.mlir test/TritonGPU/coalesce.mlir unittest/Conversion/TritonGPUToLLVM/CMakeLists.txt	2023-10-02 18:01:04 +00:00
Philippe Tillet	a0025cfc44	[FRONTEND] add missing implicit constexpr conversion in `dot` (#2427 )	2023-10-01 16:07:50 -07:00
Thomas Raoux	90bef57acf	[BACKEND] turn on MMA V3 by default on Hopper (#2414 )	2023-09-28 22:45:28 -07:00
Simon Boehm	b25edc139e	[FRONTEND] fix out_path parsing in AOT compiler (#2409 ) `out_path.with_suffix` (penultimate line) fails if out_path is string.	2023-09-27 22:15:17 -07:00
Shucai Xiao	6e82aa8dbc	support gemm fp8/fp16 mixed input (#333 ) * changes to support fp8/fp16 mixed inputs * add unit test for fp8/fp16 mixed input for gemm	2023-09-27 08:00:31 -05:00
Philippe Tillet	7432fff4be	[FRONTEND] add limited introspection capabilities in `tl.extra.cuda` ; rename `arch` into `target` (#2385 )	2023-09-25 23:58:25 -07:00
Philippe Tillet	eea0718445	[TESTING] better cudagraph-based benchmarking (#2394 )	2023-09-25 21:41:26 -07:00
ben-zhang-609	d040b58547	[HOPPER] fix ref check failure of flash attention with mma v3 (#2384 )	2023-09-25 11:29:49 -07:00
edimetia3d	cb83b42ed6	[FRONTEND] using closure to create jit launcher (#2289 ) Hi, I'm adding some features to `triton.runtime.jit.JITFunction_make_launcher` and found it is hard to debug it: 1. The inlined Python code is hard to inspect in my editor. 2. My debugger fails to step into these inlined codes. In response, I've introduced some code to solve these issues. My modifications include: ~~1. Refactoring the launcher's inline Python code, ensuring it only relies on the "self" object.~~ ~~2. Add a utility method that generates a temporary file to create a launcher when debugging kernel in main module~~ Using a closure to hold the launcher's body Because this features might be good to others, I have initiated this Pull Request. ~~Tests are yet to be added; if this submission might be accepted, I will add it later.~~ Since this change is a refactor, no new test was added.	2023-09-22 17:01:54 -07:00
q.yao	413b18eb73	[FROJTEND] fix core.dtype.__repr__ (#2372 ) `function_type` does not have a `name` field, which leads to an error when debugging with gdb.	2023-09-22 08:34:20 -07:00
Zahi Moudallal	293b7fd592	[TESTING] cleanup (#2293 ) Co-authored-by: Philippe Tillet <phil@openai.com>	2023-09-22 05:37:14 +00:00
Shucai Xiao	10795d8fd3	Fixed a bug related to split_k and prune unnecessary tuning space (#332 ) * refine tuning scrit by adding prune_configs, also fixed a bug in generating tuning configs * fixed a bug in returning the empty config	2023-09-21 23:47:14 -05:00
Philippe Tillet	32c9d2bb8f	[FRONTEND] improved error messages (#2363 ) this is a combination of #1774 and #2006, which I cannot edit but fail CI pre-commit hook	2023-09-21 15:05:57 -07:00
peterbell10	8094f46632	[FRONTEND][BACKEND] Fix various atomic_rmw bugs (#2355 ) This fixes a few bugs I've encountered - `atomic_add` with int64/uint64 `Operation .add requires .u32 or .s32 or .u64 [...] for instruction 'atom'` - `atomic_min/max` with float64 -> `ValueError('Cannot bitcast data-type of size 64 to data-type of size 32')` - `atomic_min/max` with float32 returns the old value as int32	2023-09-21 03:31:20 +00:00
ben-zhang-609	bcaf14755a	[HOPPER] enable flash attention with tma (#2336 )	2023-09-20 14:06:56 -07:00
Shantanu	8e75e392ae	[FRONTEND] Fix Python error handling in launch (#2334 ) This was regressed by #2185 because we didn't realise CUDA_CHECK macro could do Python calls (similar to what led to #2225). I think the PyErr_Occurred got removed in that PR because there was missing error handling before the call to _launch, so it looked like it was just in the wrong place. It looks like there are also potentially a couple places in cuda.c that can return with error set, e.g. getDeviceProperties, memAlloc, memcpyHtoD, memFree, tensorMapEncodeTiled etc, but those are all pre-existing and not affected by recent changes.	2023-09-19 00:12:00 -07:00
Philippe Tillet	894fa9e943	[RUNTIME][INTERPRETER] now also override __str__ method for tensors (#2325 )	2023-09-17 16:49:30 -07:00
Philippe Tillet	e686b4d6d4	[FRONTEND] interpreter rewrite (#2321 ) This is a new interpreter mode that shares semantic analysis with the JIT'ed codepath and that the Triton core team is committed to maintain	2023-09-17 14:58:50 -07:00
Myeonghwan Ahn	2b066000aa	[FRONTEND] fix matmul int8 overflow issue (#2297 ) Previously on matmul, if inputs are int8, output was also int8. This commit fixes the overflow problem with int32 output. #2296	2023-09-17 16:41:02 +00:00
Stonepia	68e1bd162c	[FRONTEND] fix xpu stages logic (#2305 )	2023-09-17 09:19:14 -07:00
jon-chuang	4f2d995fad	[FRONTEND] Explicitly forbid `dot(.., out_dtype=bfloat16)` (#2308 ) Fixes: https://github.com/openai/triton/issues/2302	2023-09-17 09:15:06 +00:00
Thomas Raoux	31b0c52142	[FRONTEND][BACKEND] Add flag to control accumulation for fp8 (#2300 ) Change the dot to allow taking an initial accumulator and add a flag that will allow the compiler to accumulate in a lower precision than the output type. On Hopper this flag is on by default which allows accumualting with lower precision. This only affect Hopper fp8 dot.	2023-09-15 18:42:54 -07:00
Zahi Moudallal	db5c793f82	[FRONTEND] Add sass to asm dict with lazy evaluation (#2309 )	2023-09-15 15:31:43 -07:00
Keren Zhou	08c1658957	[FRONTEND] Accommodate new triton IR format (#2294 ) - Support memory space for pointers (e.g., `!tt.ptr<f32, 1>`). - Support parsing function attribute, though not used yet.	2023-09-14 09:03:23 -07:00
Zahi Moudallal	36087a108f	[FRONTEND] Added SASS to asm dict (#2280 )	2023-09-13 21:21:01 +00:00
Khushi Agrawal	c61d772eee	[DOCS] add missing docs (#2154 )	2023-09-13 19:30:40 +00:00
Thomas Raoux	b63e8f87fc	[FRONTEND] Override prototype (#2214 ) Low tech but very useful way to override kernels on the fly. This can be use for debugging functionality or performance problems this lets user dump modify and feed back IR into the jit compiler.	2023-09-13 10:05:47 -07:00
Ying Zhang	37f12497b0	[FRONTEND] Add PyTorch fp8 dtypes to Triton (#2279 ) Add PyTorch fp8 dtypes (`8025b193a9/torchgen/api/types/types.py (L50-L51)`) to Triton.	2023-09-12 08:57:01 -07:00
peterbell10	ab9da3b2b8	[FRONTEND] Fix expand_dims and tl.full to handle scalar tensors (#2275 ) This fixes a few bugs related to scalar tensors: - `tl.full([], fill_value, dtype)` fails with `TypeError('0d block_type is forbidden')` - `scalar[None]` fails with `TypeError("'constexpr' object is not iterable")` - `scalar[None, None]` fails with `AttributeError("'dtype' object has no attribute 'shape'")` - `scalar.shape` returns `[1]` instead of 0-dim `[]` - Also related, `tl.zeros_like(scalar)` returns a 1d tensor instead of another scalar	2023-09-11 20:59:13 -07:00
Philippe Tillet	bf4f9375a7	[FRONTEND] allow mixed precision FP8 matmul on pre-H100 hardware (#2281 )	2023-09-11 20:54:29 -07:00
Shintaro Iwasaki	8da27c1c95	[Build] Fix very minor compilation problems (#2277 ) This PR fixes a few very minor compilation issues found in internal deployment at Meta. It looks like nit-picking, but it'd be really appreciated if it could be addressed in OSS Triton (to reduce differences from OSS), and we believe these changes are not bad in general. Neither performance nor functionality is affected by this PR. 1. Type cast in `python/triton/runtime/backends/cuda.c`. Implicit `void ` -> `cuuint{32,64}_t ` cast is not allowed by many compilers (with certain flags). It'd be nice to add an explicit cast (like `backends/hip.c`). 2. Inconsistent include path specification in `lib/Conversion/TritonGPUToLLVM/DotOpToLLVM/WGMMA.cpp`. Unlike other `DotOpToLLVM/*.cpp`, include paths used in `WGMMA.cpp` are not relative. This is problematic in some compilation settings since a compiler somehow needs to find headers in a parent directory. It'd be great to use a relative path, like other source files in Triton. cc: @yuguo68	2023-09-11 19:28:31 -07:00

1 2 3 4 5 ...

874 Commits