github/ROCm - ROCm - AtHeartEngineering

mirror of https://github.com/ROCm/ROCm.git synced 2026-04-05 03:01:17 -04:00

Author	SHA1	Message	Date
Jason Furmanek	39e8901d7a	ROCM IFU: Resolve merge conflicts in RemoveLayoutConversions.cpp fix merge error fix dot fix make_range additional fix	2023-11-07 04:29:38 +00:00
Jason Furmanek	c3132eeda8	ROCM IFU: Third-party fixes: preload ROCDL Additional 3rd-party fix Remove redundant mfma_supported defines	2023-11-07 03:29:16 +00:00
Jason Furmanek	3a6dc5ad8d	resolve some merge conflicts fix more conflits Resolve merge conflicts Some more build and conflict fixes Resolve conflicts for 06-fused-attension.py resolve merge conflicts for the tutorial group gemm example Fixes for some LIT tests resolve remaining conflicts in tests Fix empty kernel set capability 0	2023-11-06 23:13:10 +00:00
Jason Furmanek	33151a860f	Merge commit 'ac9fa68d18c777e421bd3f6fb1ddcfd60b6fda33' into ifu-rebase-again Conflicts: .gitignore .gitmodules README.md bin/triton-translate.cpp include/triton/Dialect/TritonGPU/IR/TritonGPUAttrDefs.td include/triton/Target/AMDGCN/AMDGCNTranslation.h include/triton/Target/HSACO/HSACOTranslation.h lib/Analysis/Allocation.cpp lib/Analysis/Utility.cpp lib/Conversion/TritonGPUToLLVM/CMakeLists.txt lib/Conversion/TritonGPUToLLVM/ConvertLayoutOpToLLVM.cpp lib/Conversion/TritonGPUToLLVM/ReduceOpToLLVM.cpp lib/Conversion/TritonGPUToLLVM/ScanOpToLLVM.cpp lib/Conversion/TritonGPUToLLVM/Utility.cpp lib/Conversion/TritonGPUToLLVM/Utility.h lib/Dialect/TritonGPU/IR/Dialect.cpp lib/Dialect/TritonGPU/Transforms/RemoveLayoutConversions.cpp lib/Target/HSACO/CMakeLists.txt lib/Target/HSACO/HSACOTranslation.cpp lib/Target/LLVMIR/LLVMIRTranslation.cpp python/src/triton.cc python/test/unit/language/test_core.py python/test/unit/operators/test_flash_attention.py python/triton/compiler/compiler.py python/triton/compiler/make_launcher.py python/triton/language/semantic.py python/triton/runtime/jit.py python/tutorials/06-fused-attention.py python/tutorials/11-grouped-gemm.py test/Conversion/tritongpu_to_llvm.mlir	2023-11-06 23:10:10 +00:00
oplavsic	c65f1e6211	Add OptimizeEpilogue pass. (#346 ) * optimize_epilogue * Add config * Remove licenses * Comment out Hopper specific parameters when printing out configs * Add benchmark parameters from flash-attention repo * Add Z and H in the key of autotuner --------- Co-authored-by: Lixun Zhang <lixun.zhang@amd.com>	2023-11-03 16:46:24 -05:00
Shucai Xiao	cb02a0b346	remove unnecessary arch names (#388 )	2023-11-03 12:04:02 -05:00
Lixun Zhang	a66270165a	Move fa-transV to the new perf-kernels dir (#387 )	2023-11-03 00:09:48 -05:00
Shucai Xiao	79bebc4ffe	fp8 type support (#357 ) * add two fp8 data types `tl.float8e4b8` and `tl.float8e5b16` to triton. * add SW type conversion between `tl.float8e4b8/tl.float8e5b16` and `fp16` * change flashattention to support fp8 in q/k.	2023-11-02 15:51:23 -05:00
Lixun Zhang	38f9136fc8	Add FA with transV (#385 )	2023-11-02 08:52:33 -05:00
Alexander Efimov	74c5fd46ee	[RemoveLayoutConversions] Fix reduce failed infer type error (#377 ) * [RemoveLayoutConversions] Fix reduce failed infer type error This PR fixes layout propagation algorithm in RemoveLayoutConversions pass. In some cases during rewriteSlice process, reduce operation with multiple outputs rewrites only one output layout, which breaks assumption that both outputs should have same layout. This change is a minimal part of https://github.com/openai/triton/pull/2331 change and small lit test for regression testing. * fix combine test * Fix issue with incorrect inference layout of make_range output result	2023-11-01 13:31:13 -05:00
Alexander Efimov	d62a3ffdbe	[RemoveLayoutConversions] Remove PatternSharedInfo structure (#378 ) This structure is not used anymore after massive refactoring of RemoveLayoutConversion pass in September IFU.	2023-11-01 12:57:35 -05:00
Lixun Zhang	9517d4c256	Tweak matmul tutorial on MI2xx GPU (#376 ) * Tweak matmul tutorial on MI2xx GPU * Add config for 9728 --------- Co-authored-by: Shucai Xiao <shucai.xiao@amd.com>	2023-10-27 10:40:11 -05:00
jayfurmanek	26debc92a0	Merge pull request #363 from ROCmSoftwarePlatform/post_ifu_rebase_empty_kernel_works Third Party Backend Merge	2023-10-26 16:28:00 -05:00
Michael Melesse	1fd9b40f2f	Works as StandAlone and Backend and also Perf is Good This is a combination of 4 commits. Works as StandAlone and Backend Works as StandAlone and Backend This is a combination of 13 commits. Works StandAlone and as Backend This is a combination of 7 commits. backend set default dir with flag move bitcode to backend dir copy backend save empty test work in backendmode enable backend mode when copying to upstream clean up fix failure minimize diff add skip function fix bug with corrupted dwarf exp match num_wraps fix multi threaded test issue move bitcode file out of lib move backend to python/triton/third_party/hip move libhsa backend works again restart ci clean upstream location first before copy match scripts fix new error memoize backend stuff fix bug	2023-10-26 14:27:18 -05:00
Michael Melesse	09ba348f87	[ROCM] Core Functionality for AMD (#1983 ) * this pr adds a third party backend for triton that works on AMD * this expose a lot of the work that has been done in our [fork](https://github.com/ROCmSoftwarePlatform/triton) * most unit tests on `test_core.py` pass * it skips some unit tests for various reasons * we plan to follow up with more prs improving Functionality and Performance in the future --------- Co-authored-by: Philippe Tillet <phil@openai.com>	2023-10-26 08:36:49 -05:00
Michael Melesse	833c9b985f	Backend Dir with Empty Kernel Working This is a combination of 9 commits. Empty Kernel Works rebase minimzie diff: add libs move to backend dir match python add includes move everything to backend dir match include and lib create a backend build mode simplify backend	2023-10-26 08:36:49 -05:00
Shucai Xiao	2729ae6c6f	use different int8 mfma instructions on different GPUs. (#368 ) * changes support to choose different int8 instructions * rename an instruction name Co-authored-by: Aleksandr Efimov <efimov.alexander@gmail.com>	2023-10-25 19:12:21 -05:00
Alexander Efimov	5a86b46bb1	[MFMA] FP8 and BF8 support (#355 ) * [MFMA] FP8 and BF8 support This PR adds support of fp8 and bf8 in AccelerateMatmul pass and Introduces generation of float8 mfma instructions in ttg to llvm conversion. * add tests * fix tests * review fix: fix variable naming and dot operand promotion. * review comments fixes --------- Co-authored-by: Shucai Xiao <shucai.xiao@amd.com>	2023-10-25 13:27:10 -05:00
Shucai Xiao	8547694665	set correct arch info for unit test (#370 ) * set correct arch info for unit test * address review comments	2023-10-25 13:06:45 -05:00
oplavsic	715a589ce3	[FA fwd D=128] Reduce LDS usage in epilogue (#340 ) * rebase onto improve_fwd_fa * Fixed a leftover from rebase * rebase onto improve_fa_fwd * Reduce tuning space * Disable bwd with D=128 * Add test for d=128 * Fix an issue with get_best_config when there is only one config * Added better configs for d=128 * Fix typos --------- Co-authored-by: Lixun Zhang <lixun.zhang@amd.com>	2023-10-25 12:10:34 -05:00
jayfurmanek	e74bdb1581	Always promote to int32 in commonShflSync (#369 )	2023-10-23 12:27:11 -05:00
Lixun Zhang	f963c04034	Use the same heuristics for mfma type as PR#352 (#366 )	2023-10-18 20:32:44 -05:00
Alexander Efimov	20f316b19a	[MFMA] Switch between MFMA types (#352 ) This PR introduces matrix_instr_nonkdim flag to switch between MFMA 16 and MFMA 32.	2023-10-18 16:57:34 +02:00
Alexander Efimov	4d539d7dae	Add licenses to AMD related files (#351 )	2023-10-16 15:18:01 -05:00
Lixun Zhang	1de859df32	[GEMM] [Tuning] Add `waves_per_eu` to gemm tuning (#362 ) * Add waves_per_eu in the tuning space * Do not allocate tensor on device during kernel compilation step * Add breakdown elapsed time * Parallelize the post-processing step * Parallelize the profile step with --ngpus * Better timing info printout	2023-10-16 13:50:03 -05:00
Lixun Zhang	821e75a2b0	Improve FA fwd kernel with causal=True (#356 ) * Attempt to absorb upstream's changes to improve causal=True * Add autotuner * Optimize for AMD MI250 - add pre_load_v as a tuning parameter - do not define N_CTX as constexpr - perform the second dot before sum - remove qk_scale out of the inner loop - add more configs in the autotuner Note that bwd kernel is disabled for now. This is because we enabled autotuning and grid becomes a function. So ctx.grid[0] no longer works. * Enable bwd kernel	2023-10-12 12:34:27 -05:00
Jack Taylor	6f073a43f6	Remove old ROCM_LIBRARIES set (#360 ) In the last PR I forgot to overwrite the initial setting of ROCM_LIBRARIES causing an error in the wheel building process	2023-10-12 16:01:39 +01:00
Jack Taylor	5d44c60d17	enforce cc=None on PyTorch ROCm (#296 ) * enforce cc=None on ROCm * Comment * Update approach to ignore integer cc values Co-authored-by: Alexander Efimov <efimov.alexander@gmail.com> --------- Co-authored-by: Alexander Efimov <efimov.alexander@gmail.com>	2023-10-12 10:17:26 +01:00
Shucai Xiao	99fa2e4237	add tutorial group gemm example (#343 ) * [DOCS] Add a tutorial example of grouped gemm (#2326) Co-authored-by: Bin Fan <binf@nvidia.com>	2023-10-11 15:13:17 -05:00
Jack Taylor	47563240f8	PyTorch triton branch synchronisation (#354 ) * Restructure ROCM Library Search Currently there are a handful of ROCM dependant files which are required for triton to run. The linker(ld.lld), the include files, and multiple hip/hsa shared objects. This change will provide three search areas to find these files. All in the same order. 1. third_party/rocm. This location is within the python/triton directory and is carried over when triton is built. IF all necessary files are in this location there will be no need to have ROCM installed at all on the system. 2. $ROCM_PATH environmental variable. If this exists it will override all other locations to find ROCM necessary files 3. /opt/rocm. The default location for ROCm installations. Finding one here will notify triton that ROCM is installed in this environment To ease with step 3. A new script scripts/amd/setup_rocm_libs.sh has been added to the repo. Executing this script will cause all necessary ROCM files to be downloaded from their respective packages on repo.radeon.com and installed in third_party/rocm. Allowing for triton to run without installing the full ROCM stack. setup_rocm_libs.sh takes a env_var ROCM_VERSION if a user wishes to install a ROCM version other than the default (currently 5.4.2) When triton whls are built to support Pytorch, method 3 will be used to stay in sync with PyTorch's approach of bringing along any libraries needed and not requiring ROCM to be installed. (cherry picked from commit e6aea90fb3e8218cb562e5d990719112d8282702) * Fix default rocm path Running into `fatal error: hip/hip_runtime.h: No such file or directory` with latest wheel due to incorrect directory for ROCm libs (cherry picked from commit 292bae625b113eb65c66cfe4442da7a6456c988a) * setup_rocm_libs.sh manylinux refactor (cherry picked from commit f995f314ada4606cb78dc6233cd9c8effc356191) * Set setup_rocm_libs.sh to be executable (cherry picked from commit 05d67b9418cacda0d356c2102d7c1a887948b013) * Revert to using numbered so files to fix upstream (cherry picked from commit 34f8189eae57a23cc15b4b4f032fe25757e0db8e) * Remove drm script --------- Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2023-10-11 15:30:39 +01:00
Shucai Xiao	d6d1cf2859	add gfx942 to support matrix_core (#358 )	2023-10-10 22:46:24 -05:00
Lixun Zhang	515525d068	[GEMM] Tuning script v2 (#350 ) * [GEMM] Tuning script v2 * Extend tuning space to include BLOCK_SIZE = 256 Check LDS in a more smart way * Added README * Add git branch and commit to the default tuning result filename	2023-10-10 20:49:49 -05:00
Alexander Efimov	7e34c244c2	[Triton] Mfma16 support (#251 ) * [MFAM] Support mfma with NM size 16 This PR code emitting of MFMA instructions with size 16. * add control over mfma type with MFMA_TYPE=16 env var	2023-10-09 13:59:54 -05:00
oplavsic	e801638b40	Add waves_per_eu as kernel parameter (#319 ) * Add waves_per_eu as kernel parameter * Fix failing tests * Add default value for waves_per_eu for ttgir_to_llir function * Remove aot.py	2023-10-06 12:08:34 -05:00
jayfurmanek	be95edc63f	Merge pull request #347 from ROCmSoftwarePlatform/ifu230908-2 Ifu230908 2	2023-10-05 12:21:50 -05:00
oplavsic	6a173eab8a	Remove redundant fp32->fp16 conversion in FA (#349 )	2023-10-04 14:10:07 -05:00
Shucai Xiao	8049891ff7	fix ifu gemm perf regression (#348 )	2023-10-04 08:45:18 -05:00
Aleksandr Efimov	e6f75d05e3	fix sum_reduction lit test in Conversion/tritongpu_to_llvm.mlir testsuite	2023-10-03 16:13:13 +00:00
Michael Melesse	31fe8aadc5	ROCM IFU: Fix minimize_alloc ROCM IFU: Small fixes	2023-10-03 05:34:44 +00:00
Aleksandr Efimov	88ce3b8985	ROCM IFU: Fix Conversion/AMDGPU/load_store.mlir lit test	2023-10-03 04:31:10 +00:00
Aleksandr Efimov	90a15e449e	ROCM IFU: Fix tritongpu_to_llvm lit test	2023-10-03 04:31:03 +00:00
Michael Melesse	1caef34f8a	ROCM IFU: Fix coalesce.mlir and stream-pipeline.mlir	2023-10-03 04:30:58 +00:00
Michael Melesse	9c7a215fed	ROCM IFU: Fix triton_to_tritongpu.mlir	2023-10-03 04:30:50 +00:00
Shucai Xiao	334c9b5aed	ROCM IFU: Fix unit tests error related to fp8/fp16 mixed input	2023-10-03 04:30:44 +00:00
Lixun Zhang	a41f13adcd	ROCM IFU: Extend input to 32-bit when necessary Note: we'll need to check this later if we can use i8 for some reduction operations	2023-10-03 04:30:37 +00:00
Jason Furmanek	92edee723b	ROCM IFU: Fix getValueLivenessRange	2023-10-03 04:30:28 +00:00
Michael Melesse	28c571ea43	ROCM IFU: Fix test_if	2023-10-03 04:30:22 +00:00
Aleksandr Efimov	8ccc4b0cce	ROCM IFU: Fix layout formatting	2023-10-03 04:30:16 +00:00
Aleksandr Efimov	336c4b5f3c	ROCM IFU: Fix LDS overflow issues in test_dot	2023-10-03 04:30:09 +00:00
wenchenvincent	42a5bf9c7c	ROCM IFU: Enabled conversion between fp8e4m3b15x4 and fp16. Refactored conversion between fp8e4m3nv and fp16. (#335 )	2023-10-03 04:30:01 +00:00

1 2 3 4 5 ...

2205 Commits