github/ROCm - ROCm - AtHeartEngineering

mirror of https://github.com/ROCm/ROCm.git synced 2026-02-21 03:00:39 -05:00

Author	SHA1	Message	Date
Jason Furmanek	33151a860f	Merge commit 'ac9fa68d18c777e421bd3f6fb1ddcfd60b6fda33' into ifu-rebase-again Conflicts: .gitignore .gitmodules README.md bin/triton-translate.cpp include/triton/Dialect/TritonGPU/IR/TritonGPUAttrDefs.td include/triton/Target/AMDGCN/AMDGCNTranslation.h include/triton/Target/HSACO/HSACOTranslation.h lib/Analysis/Allocation.cpp lib/Analysis/Utility.cpp lib/Conversion/TritonGPUToLLVM/CMakeLists.txt lib/Conversion/TritonGPUToLLVM/ConvertLayoutOpToLLVM.cpp lib/Conversion/TritonGPUToLLVM/ReduceOpToLLVM.cpp lib/Conversion/TritonGPUToLLVM/ScanOpToLLVM.cpp lib/Conversion/TritonGPUToLLVM/Utility.cpp lib/Conversion/TritonGPUToLLVM/Utility.h lib/Dialect/TritonGPU/IR/Dialect.cpp lib/Dialect/TritonGPU/Transforms/RemoveLayoutConversions.cpp lib/Target/HSACO/CMakeLists.txt lib/Target/HSACO/HSACOTranslation.cpp lib/Target/LLVMIR/LLVMIRTranslation.cpp python/src/triton.cc python/test/unit/language/test_core.py python/test/unit/operators/test_flash_attention.py python/triton/compiler/compiler.py python/triton/compiler/make_launcher.py python/triton/language/semantic.py python/triton/runtime/jit.py python/tutorials/06-fused-attention.py python/tutorials/11-grouped-gemm.py test/Conversion/tritongpu_to_llvm.mlir	2023-11-06 23:10:10 +00:00
oplavsic	c65f1e6211	Add OptimizeEpilogue pass. (#346 ) * optimize_epilogue * Add config * Remove licenses * Comment out Hopper specific parameters when printing out configs * Add benchmark parameters from flash-attention repo * Add Z and H in the key of autotuner --------- Co-authored-by: Lixun Zhang <lixun.zhang@amd.com>	2023-11-03 16:46:24 -05:00
Shucai Xiao	cb02a0b346	remove unnecessary arch names (#388 )	2023-11-03 12:04:02 -05:00
Lixun Zhang	a66270165a	Move fa-transV to the new perf-kernels dir (#387 )	2023-11-03 00:09:48 -05:00
Shucai Xiao	79bebc4ffe	fp8 type support (#357 ) * add two fp8 data types `tl.float8e4b8` and `tl.float8e5b16` to triton. * add SW type conversion between `tl.float8e4b8/tl.float8e5b16` and `fp16` * change flashattention to support fp8 in q/k.	2023-11-02 15:51:23 -05:00
Lixun Zhang	38f9136fc8	Add FA with transV (#385 )	2023-11-02 08:52:33 -05:00
Lixun Zhang	9517d4c256	Tweak matmul tutorial on MI2xx GPU (#376 ) * Tweak matmul tutorial on MI2xx GPU * Add config for 9728 --------- Co-authored-by: Shucai Xiao <shucai.xiao@amd.com>	2023-10-27 10:40:11 -05:00
Michael Melesse	1fd9b40f2f	Works as StandAlone and Backend and also Perf is Good This is a combination of 4 commits. Works as StandAlone and Backend Works as StandAlone and Backend This is a combination of 13 commits. Works StandAlone and as Backend This is a combination of 7 commits. backend set default dir with flag move bitcode to backend dir copy backend save empty test work in backendmode enable backend mode when copying to upstream clean up fix failure minimize diff add skip function fix bug with corrupted dwarf exp match num_wraps fix multi threaded test issue move bitcode file out of lib move backend to python/triton/third_party/hip move libhsa backend works again restart ci clean upstream location first before copy match scripts fix new error memoize backend stuff fix bug	2023-10-26 14:27:18 -05:00
Michael Melesse	09ba348f87	[ROCM] Core Functionality for AMD (#1983 ) * this pr adds a third party backend for triton that works on AMD * this expose a lot of the work that has been done in our [fork](https://github.com/ROCmSoftwarePlatform/triton) * most unit tests on `test_core.py` pass * it skips some unit tests for various reasons * we plan to follow up with more prs improving Functionality and Performance in the future --------- Co-authored-by: Philippe Tillet <phil@openai.com>	2023-10-26 08:36:49 -05:00
Alexander Efimov	5a86b46bb1	[MFMA] FP8 and BF8 support (#355 ) * [MFMA] FP8 and BF8 support This PR adds support of fp8 and bf8 in AccelerateMatmul pass and Introduces generation of float8 mfma instructions in ttg to llvm conversion. * add tests * fix tests * review fix: fix variable naming and dot operand promotion. * review comments fixes --------- Co-authored-by: Shucai Xiao <shucai.xiao@amd.com>	2023-10-25 13:27:10 -05:00
Shucai Xiao	8547694665	set correct arch info for unit test (#370 ) * set correct arch info for unit test * address review comments	2023-10-25 13:06:45 -05:00
oplavsic	715a589ce3	[FA fwd D=128] Reduce LDS usage in epilogue (#340 ) * rebase onto improve_fwd_fa * Fixed a leftover from rebase * rebase onto improve_fa_fwd * Reduce tuning space * Disable bwd with D=128 * Add test for d=128 * Fix an issue with get_best_config when there is only one config * Added better configs for d=128 * Fix typos --------- Co-authored-by: Lixun Zhang <lixun.zhang@amd.com>	2023-10-25 12:10:34 -05:00
Alexander Efimov	20f316b19a	[MFMA] Switch between MFMA types (#352 ) This PR introduces matrix_instr_nonkdim flag to switch between MFMA 16 and MFMA 32.	2023-10-18 16:57:34 +02:00
Lixun Zhang	821e75a2b0	Improve FA fwd kernel with causal=True (#356 ) * Attempt to absorb upstream's changes to improve causal=True * Add autotuner * Optimize for AMD MI250 - add pre_load_v as a tuning parameter - do not define N_CTX as constexpr - perform the second dot before sum - remove qk_scale out of the inner loop - add more configs in the autotuner Note that bwd kernel is disabled for now. This is because we enabled autotuning and grid becomes a function. So ctx.grid[0] no longer works. * Enable bwd kernel	2023-10-12 12:34:27 -05:00
Jack Taylor	5d44c60d17	enforce cc=None on PyTorch ROCm (#296 ) * enforce cc=None on ROCm * Comment * Update approach to ignore integer cc values Co-authored-by: Alexander Efimov <efimov.alexander@gmail.com> --------- Co-authored-by: Alexander Efimov <efimov.alexander@gmail.com>	2023-10-12 10:17:26 +01:00
Shucai Xiao	99fa2e4237	add tutorial group gemm example (#343 ) * [DOCS] Add a tutorial example of grouped gemm (#2326) Co-authored-by: Bin Fan <binf@nvidia.com>	2023-10-11 15:13:17 -05:00
Jack Taylor	47563240f8	PyTorch triton branch synchronisation (#354 ) * Restructure ROCM Library Search Currently there are a handful of ROCM dependant files which are required for triton to run. The linker(ld.lld), the include files, and multiple hip/hsa shared objects. This change will provide three search areas to find these files. All in the same order. 1. third_party/rocm. This location is within the python/triton directory and is carried over when triton is built. IF all necessary files are in this location there will be no need to have ROCM installed at all on the system. 2. $ROCM_PATH environmental variable. If this exists it will override all other locations to find ROCM necessary files 3. /opt/rocm. The default location for ROCm installations. Finding one here will notify triton that ROCM is installed in this environment To ease with step 3. A new script scripts/amd/setup_rocm_libs.sh has been added to the repo. Executing this script will cause all necessary ROCM files to be downloaded from their respective packages on repo.radeon.com and installed in third_party/rocm. Allowing for triton to run without installing the full ROCM stack. setup_rocm_libs.sh takes a env_var ROCM_VERSION if a user wishes to install a ROCM version other than the default (currently 5.4.2) When triton whls are built to support Pytorch, method 3 will be used to stay in sync with PyTorch's approach of bringing along any libraries needed and not requiring ROCM to be installed. (cherry picked from commit e6aea90fb3e8218cb562e5d990719112d8282702) * Fix default rocm path Running into `fatal error: hip/hip_runtime.h: No such file or directory` with latest wheel due to incorrect directory for ROCm libs (cherry picked from commit 292bae625b113eb65c66cfe4442da7a6456c988a) * setup_rocm_libs.sh manylinux refactor (cherry picked from commit f995f314ada4606cb78dc6233cd9c8effc356191) * Set setup_rocm_libs.sh to be executable (cherry picked from commit 05d67b9418cacda0d356c2102d7c1a887948b013) * Revert to using numbered so files to fix upstream (cherry picked from commit 34f8189eae57a23cc15b4b4f032fe25757e0db8e) * Remove drm script --------- Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2023-10-11 15:30:39 +01:00
Shucai Xiao	d6d1cf2859	add gfx942 to support matrix_core (#358 )	2023-10-10 22:46:24 -05:00
Alexander Efimov	7e34c244c2	[Triton] Mfma16 support (#251 ) * [MFAM] Support mfma with NM size 16 This PR code emitting of MFMA instructions with size 16. * add control over mfma type with MFMA_TYPE=16 env var	2023-10-09 13:59:54 -05:00
oplavsic	e801638b40	Add waves_per_eu as kernel parameter (#319 ) * Add waves_per_eu as kernel parameter * Fix failing tests * Add default value for waves_per_eu for ttgir_to_llir function * Remove aot.py	2023-10-06 12:08:34 -05:00
oplavsic	6a173eab8a	Remove redundant fp32->fp16 conversion in FA (#349 )	2023-10-04 14:10:07 -05:00
Michael Melesse	31fe8aadc5	ROCM IFU: Fix minimize_alloc ROCM IFU: Small fixes	2023-10-03 05:34:44 +00:00
Shucai Xiao	334c9b5aed	ROCM IFU: Fix unit tests error related to fp8/fp16 mixed input	2023-10-03 04:30:44 +00:00
Lixun Zhang	a41f13adcd	ROCM IFU: Extend input to 32-bit when necessary Note: we'll need to check this later if we can use i8 for some reduction operations	2023-10-03 04:30:37 +00:00
Michael Melesse	28c571ea43	ROCM IFU: Fix test_if	2023-10-03 04:30:22 +00:00
Aleksandr Efimov	8ccc4b0cce	ROCM IFU: Fix layout formatting	2023-10-03 04:30:16 +00:00
wenchenvincent	42a5bf9c7c	ROCM IFU: Enabled conversion between fp8e4m3b15x4 and fp16. Refactored conversion between fp8e4m3nv and fp16. (#335 )	2023-10-03 04:30:01 +00:00
Jason Furmanek	e5d7bb4fae	Initial commit to resolve merge conflicts rename tl.float8e4 to tl.float8e4nv to align with upstream ROCM IFU: Fix python arch issues ROCM IFU: Fix kernel launcher ROCM IFU: Fix merge conflicts fix debug build Set correct threadsPerCTA	2023-10-03 04:04:26 +00:00
Jason Furmanek	74fd8e9754	Merge commit '36fc54b6f28168d3644808bfe299f1ba06a36272' into ifu230908-2 Conflicts: .gitignore bin/triton-translate.cpp include/triton/Conversion/TritonGPUToLLVM/TritonGPUToLLVMPass.h include/triton/Dialect/TritonGPU/IR/TritonGPUAttrDefs.td include/triton/Dialect/TritonGPU/IR/TritonGPUDialect.td lib/Analysis/Utility.cpp lib/Conversion/TritonGPUToLLVM/ConvertLayoutOpToLLVM/SharedToDotOperandMMAv2.cpp lib/Conversion/TritonGPUToLLVM/DotOpToLLVM.cpp lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp lib/Conversion/TritonGPUToLLVM/ReduceOpToLLVM.cpp lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVM.cpp lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVMBase.h lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVMPass.cpp lib/Conversion/TritonGPUToLLVM/Utility.h lib/Dialect/Triton/Transforms/RewriteTensorPointer.cpp lib/Dialect/TritonGPU/IR/Dialect.cpp lib/Dialect/TritonGPU/Transforms/AccelerateMatmul.cpp lib/Dialect/TritonGPU/Transforms/RemoveLayoutConversions.cpp lib/Target/LLVMIR/LLVMIRTranslation.cpp python/src/triton.cc python/test/unit/runtime/test_subproc.py python/triton/compiler/compiler.py python/triton/compiler/make_launcher.py python/triton/language/semantic.py python/triton/runtime/jit.py python/tutorials/06-fused-attention.py test/Conversion/triton_to_tritongpu.mlir test/Conversion/tritongpu_to_llvm.mlir test/TritonGPU/coalesce.mlir unittest/Conversion/TritonGPUToLLVM/CMakeLists.txt	2023-10-02 18:01:04 +00:00
Philippe Tillet	a0025cfc44	[FRONTEND] add missing implicit constexpr conversion in `dot` (#2427 )	2023-10-01 16:07:50 -07:00
Philippe Tillet	533efd0cac	[FRONTEND][BACKEND] changed float8e4b15 clipping semantics from +-1.875 to +-1.75 (#2422 ) clipping float8e4b15 to +-1.875 is a bad idea because these are represented as 0x7f and 0xff, which are +- nan on H100 for float8e4nv. We lose two values but this will make compatibility with float8e4nv way less painful. (it will just be a matter of adjusting the bias)	2023-09-29 23:33:28 -07:00
SJW	287b0adcc2	[Stream] Fixed bug in stream-pipeline for FA (#345 ) * [Stream] Fixed bug in stream-pipeline for FA * updated gemm tutorial for num_stages=0 * * updated configs	2023-09-29 20:20:55 -05:00
Hongtao Yu	e0edb70f78	[BACKEND] support of Fp8E4M3Nv to Bf16 conversion (#2415 )	2023-09-29 17:29:41 -07:00
Keren Zhou	e284112818	Revert "[TUTORIALS] Remove unneeded quantiles parameter (#2408 )" (#2419 ) This reverts commit `99af23f6f4`. `quantiles` shouldn't be the problem. The documentation workflow failed because of other issues.	2023-09-29 14:24:50 -07:00
Keren Zhou	f2f5f1d457	[TUTORIALS] Add missing docstrings (#2420 ) Depend on https://github.com/openai/triton/pull/2419 to fix the documentation workflow	2023-09-29 14:24:30 -07:00
Thomas Raoux	90bef57acf	[BACKEND] turn on MMA V3 by default on Hopper (#2414 )	2023-09-28 22:45:28 -07:00
evelynmitchell	99af23f6f4	[TUTORIALS] Remove unneeded quantiles parameter (#2408 ) The fix is to remove the quantiles parameter in both the triton and torch calls for the benchmark.	2023-09-28 13:48:38 -04:00
Thomas Raoux	721bdebee1	[OPTIMIZATION] Fix performance for attention backward path with mma v3 (#2411 ) Support having chain of mma with mixed size. Serialize the different block calculation in backward attention to workaround problem with ptxas and wgmma.	2023-09-28 10:29:08 -07:00
Simon Boehm	b25edc139e	[FRONTEND] fix out_path parsing in AOT compiler (#2409 ) `out_path.with_suffix` (penultimate line) fails if out_path is string.	2023-09-27 22:15:17 -07:00
Justin Lebar	9bf9c20f30	[DOCS] update build instructions, and add testing instrs. (#2400 ) - Note `wheel` as a build-time dependency. - Add tips for getting a faster build. - Add instructions for running tests. - Add flag to build with ccache. (Thanks to @ThomasRaoux for most of these instructions!)	2023-09-27 22:13:03 -07:00
Ying Zhang	78c28bf5f6	Support scalar fp8 conversions by packing (#2379 ) Support fp8 scalar conversions by packing fp8 with undef values. Also add simple unittests to cover this change.	2023-09-27 08:29:53 -07:00
Shucai Xiao	6e82aa8dbc	support gemm fp8/fp16 mixed input (#333 ) * changes to support fp8/fp16 mixed inputs * add unit test for fp8/fp16 mixed input for gemm	2023-09-27 08:00:31 -05:00
Philippe Tillet	7432fff4be	[FRONTEND] add limited introspection capabilities in `tl.extra.cuda` ; rename `arch` into `target` (#2385 )	2023-09-25 23:58:25 -07:00
Philippe Tillet	eea0718445	[TESTING] better cudagraph-based benchmarking (#2394 )	2023-09-25 21:41:26 -07:00
ben-zhang-609	d040b58547	[HOPPER] fix ref check failure of flash attention with mma v3 (#2384 )	2023-09-25 11:29:49 -07:00
Aleksandr Efimov	d80cd2d374	[MFMA] Change kWidth parameter semantics This PR changes kWidth semantics "from elements per instruction" to "elements per thread per instruction" along k axis.	2023-09-25 10:56:44 -05:00
Keren Zhou	57fc6d1f13	[BACKEND] `shfl` ptx insts should have side effects (#2376 ) Otherwise, llvm pass could generate very weird structure of CFG and yield incorrect results. https://github.com/openai/triton/issues/2361	2023-09-23 10:05:20 -07:00
edimetia3d	cb83b42ed6	[FRONTEND] using closure to create jit launcher (#2289 ) Hi, I'm adding some features to `triton.runtime.jit.JITFunction_make_launcher` and found it is hard to debug it: 1. The inlined Python code is hard to inspect in my editor. 2. My debugger fails to step into these inlined codes. In response, I've introduced some code to solve these issues. My modifications include: ~~1. Refactoring the launcher's inline Python code, ensuring it only relies on the "self" object.~~ ~~2. Add a utility method that generates a temporary file to create a launcher when debugging kernel in main module~~ Using a closure to hold the launcher's body Because this features might be good to others, I have initiated this Pull Request. ~~Tests are yet to be added; if this submission might be accepted, I will add it later.~~ Since this change is a refactor, no new test was added.	2023-09-22 17:01:54 -07:00
Bin Fan	1724604bd9	[DOCS] Add a tutorial example of grouped gemm (#2326 )	2023-09-22 11:16:35 -07:00
q.yao	413b18eb73	[FROJTEND] fix core.dtype.__repr__ (#2372 ) `function_type` does not have a `name` field, which leads to an error when debugging with gdb.	2023-09-22 08:34:20 -07:00

1 2 3 4 5 ...

1337 Commits