github/ROCm - ROCm - AtHeartEngineering

mirror of https://github.com/ROCm/ROCm.git synced 2026-02-21 03:00:39 -05:00

Author	SHA1	Message	Date
Shucai Xiao	8547694665	set correct arch info for unit test (#370 ) * set correct arch info for unit test * address review comments	2023-10-25 13:06:45 -05:00
oplavsic	715a589ce3	[FA fwd D=128] Reduce LDS usage in epilogue (#340 ) * rebase onto improve_fwd_fa * Fixed a leftover from rebase * rebase onto improve_fa_fwd * Reduce tuning space * Disable bwd with D=128 * Add test for d=128 * Fix an issue with get_best_config when there is only one config * Added better configs for d=128 * Fix typos --------- Co-authored-by: Lixun Zhang <lixun.zhang@amd.com>	2023-10-25 12:10:34 -05:00
Alexander Efimov	20f316b19a	[MFMA] Switch between MFMA types (#352 ) This PR introduces matrix_instr_nonkdim flag to switch between MFMA 16 and MFMA 32.	2023-10-18 16:57:34 +02:00
Lixun Zhang	821e75a2b0	Improve FA fwd kernel with causal=True (#356 ) * Attempt to absorb upstream's changes to improve causal=True * Add autotuner * Optimize for AMD MI250 - add pre_load_v as a tuning parameter - do not define N_CTX as constexpr - perform the second dot before sum - remove qk_scale out of the inner loop - add more configs in the autotuner Note that bwd kernel is disabled for now. This is because we enabled autotuning and grid becomes a function. So ctx.grid[0] no longer works. * Enable bwd kernel	2023-10-12 12:34:27 -05:00
Jack Taylor	5d44c60d17	enforce cc=None on PyTorch ROCm (#296 ) * enforce cc=None on ROCm * Comment * Update approach to ignore integer cc values Co-authored-by: Alexander Efimov <efimov.alexander@gmail.com> --------- Co-authored-by: Alexander Efimov <efimov.alexander@gmail.com>	2023-10-12 10:17:26 +01:00
Shucai Xiao	99fa2e4237	add tutorial group gemm example (#343 ) * [DOCS] Add a tutorial example of grouped gemm (#2326) Co-authored-by: Bin Fan <binf@nvidia.com>	2023-10-11 15:13:17 -05:00
Jack Taylor	47563240f8	PyTorch triton branch synchronisation (#354 ) * Restructure ROCM Library Search Currently there are a handful of ROCM dependant files which are required for triton to run. The linker(ld.lld), the include files, and multiple hip/hsa shared objects. This change will provide three search areas to find these files. All in the same order. 1. third_party/rocm. This location is within the python/triton directory and is carried over when triton is built. IF all necessary files are in this location there will be no need to have ROCM installed at all on the system. 2. $ROCM_PATH environmental variable. If this exists it will override all other locations to find ROCM necessary files 3. /opt/rocm. The default location for ROCm installations. Finding one here will notify triton that ROCM is installed in this environment To ease with step 3. A new script scripts/amd/setup_rocm_libs.sh has been added to the repo. Executing this script will cause all necessary ROCM files to be downloaded from their respective packages on repo.radeon.com and installed in third_party/rocm. Allowing for triton to run without installing the full ROCM stack. setup_rocm_libs.sh takes a env_var ROCM_VERSION if a user wishes to install a ROCM version other than the default (currently 5.4.2) When triton whls are built to support Pytorch, method 3 will be used to stay in sync with PyTorch's approach of bringing along any libraries needed and not requiring ROCM to be installed. (cherry picked from commit e6aea90fb3e8218cb562e5d990719112d8282702) * Fix default rocm path Running into `fatal error: hip/hip_runtime.h: No such file or directory` with latest wheel due to incorrect directory for ROCm libs (cherry picked from commit 292bae625b113eb65c66cfe4442da7a6456c988a) * setup_rocm_libs.sh manylinux refactor (cherry picked from commit f995f314ada4606cb78dc6233cd9c8effc356191) * Set setup_rocm_libs.sh to be executable (cherry picked from commit 05d67b9418cacda0d356c2102d7c1a887948b013) * Revert to using numbered so files to fix upstream (cherry picked from commit 34f8189eae57a23cc15b4b4f032fe25757e0db8e) * Remove drm script --------- Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2023-10-11 15:30:39 +01:00
Shucai Xiao	d6d1cf2859	add gfx942 to support matrix_core (#358 )	2023-10-10 22:46:24 -05:00
Alexander Efimov	7e34c244c2	[Triton] Mfma16 support (#251 ) * [MFAM] Support mfma with NM size 16 This PR code emitting of MFMA instructions with size 16. * add control over mfma type with MFMA_TYPE=16 env var	2023-10-09 13:59:54 -05:00
oplavsic	e801638b40	Add waves_per_eu as kernel parameter (#319 ) * Add waves_per_eu as kernel parameter * Fix failing tests * Add default value for waves_per_eu for ttgir_to_llir function * Remove aot.py	2023-10-06 12:08:34 -05:00
oplavsic	6a173eab8a	Remove redundant fp32->fp16 conversion in FA (#349 )	2023-10-04 14:10:07 -05:00
Michael Melesse	31fe8aadc5	ROCM IFU: Fix minimize_alloc ROCM IFU: Small fixes	2023-10-03 05:34:44 +00:00
Shucai Xiao	334c9b5aed	ROCM IFU: Fix unit tests error related to fp8/fp16 mixed input	2023-10-03 04:30:44 +00:00
Lixun Zhang	a41f13adcd	ROCM IFU: Extend input to 32-bit when necessary Note: we'll need to check this later if we can use i8 for some reduction operations	2023-10-03 04:30:37 +00:00
Michael Melesse	28c571ea43	ROCM IFU: Fix test_if	2023-10-03 04:30:22 +00:00
Aleksandr Efimov	8ccc4b0cce	ROCM IFU: Fix layout formatting	2023-10-03 04:30:16 +00:00
wenchenvincent	42a5bf9c7c	ROCM IFU: Enabled conversion between fp8e4m3b15x4 and fp16. Refactored conversion between fp8e4m3nv and fp16. (#335 )	2023-10-03 04:30:01 +00:00
Jason Furmanek	e5d7bb4fae	Initial commit to resolve merge conflicts rename tl.float8e4 to tl.float8e4nv to align with upstream ROCM IFU: Fix python arch issues ROCM IFU: Fix kernel launcher ROCM IFU: Fix merge conflicts fix debug build Set correct threadsPerCTA	2023-10-03 04:04:26 +00:00
Jason Furmanek	74fd8e9754	Merge commit '36fc54b6f28168d3644808bfe299f1ba06a36272' into ifu230908-2 Conflicts: .gitignore bin/triton-translate.cpp include/triton/Conversion/TritonGPUToLLVM/TritonGPUToLLVMPass.h include/triton/Dialect/TritonGPU/IR/TritonGPUAttrDefs.td include/triton/Dialect/TritonGPU/IR/TritonGPUDialect.td lib/Analysis/Utility.cpp lib/Conversion/TritonGPUToLLVM/ConvertLayoutOpToLLVM/SharedToDotOperandMMAv2.cpp lib/Conversion/TritonGPUToLLVM/DotOpToLLVM.cpp lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp lib/Conversion/TritonGPUToLLVM/ReduceOpToLLVM.cpp lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVM.cpp lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVMBase.h lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVMPass.cpp lib/Conversion/TritonGPUToLLVM/Utility.h lib/Dialect/Triton/Transforms/RewriteTensorPointer.cpp lib/Dialect/TritonGPU/IR/Dialect.cpp lib/Dialect/TritonGPU/Transforms/AccelerateMatmul.cpp lib/Dialect/TritonGPU/Transforms/RemoveLayoutConversions.cpp lib/Target/LLVMIR/LLVMIRTranslation.cpp python/src/triton.cc python/test/unit/runtime/test_subproc.py python/triton/compiler/compiler.py python/triton/compiler/make_launcher.py python/triton/language/semantic.py python/triton/runtime/jit.py python/tutorials/06-fused-attention.py test/Conversion/triton_to_tritongpu.mlir test/Conversion/tritongpu_to_llvm.mlir test/TritonGPU/coalesce.mlir unittest/Conversion/TritonGPUToLLVM/CMakeLists.txt	2023-10-02 18:01:04 +00:00
SJW	287b0adcc2	[Stream] Fixed bug in stream-pipeline for FA (#345 ) * [Stream] Fixed bug in stream-pipeline for FA * updated gemm tutorial for num_stages=0 * * updated configs	2023-09-29 20:20:55 -05:00
Shucai Xiao	6e82aa8dbc	support gemm fp8/fp16 mixed input (#333 ) * changes to support fp8/fp16 mixed inputs * add unit test for fp8/fp16 mixed input for gemm	2023-09-27 08:00:31 -05:00
Aleksandr Efimov	d80cd2d374	[MFMA] Change kWidth parameter semantics This PR changes kWidth semantics "from elements per instruction" to "elements per thread per instruction" along k axis.	2023-09-25 10:56:44 -05:00
Shucai Xiao	10795d8fd3	Fixed a bug related to split_k and prune unnecessary tuning space (#332 ) * refine tuning scrit by adding prune_configs, also fixed a bug in generating tuning configs * fixed a bug in returning the empty config	2023-09-21 23:47:14 -05:00
Ognjen Plavsic	a8574be74d	[FA] Fix bug from IFU merge This commit reverts num_warps back to 4 for FA forward pass with d=128.	2023-09-21 10:48:41 -05:00
Justin Lebar	2a3746bac5	[BUILD] use ninja (#2318 )	2023-09-18 15:30:06 -05:00
Alexander Efimov	b25557ad5e	[FA] Upstream FA qk initialization (#328 ) This PR replaces qk matrix initialization with upstream version	2023-09-14 00:34:14 -05:00
Lixun Zhang	ea397b49aa	Fix the issue when CTA coverage is larger than the tile	2023-09-12 10:16:44 -05:00
Alexander Efimov	a06072f8ff	Fix dangling gpu_has_mfma use (#325 ) * Fix dangling gpu_has_mfma use This PR replaces gpu_has_mfma use with gpu_matrix_core_version * add basic test	2023-09-11 12:31:48 -05:00
Alexander Efimov	6691de65db	[MFMA] Support BFloat16 on MI100 (#295 ) * [MFMA] Support BFloat16 on MI100 This PR makes use of mfma_f32_32x32x4bf16 instruction, available on MI100. * fix tests, fix mfma encoding comment, fix switch between mfma versions. * replace kDim from mfma layout with kWidth from dotOp layout * rebase fix * fix mfma to dot op shortcut for bfloat16 * fix review comments	2023-09-08 15:08:34 -05:00
SJW	491eb9ddfe	[MLIR] Added tritongpu-stream-pipeline pass (#305 ) * [MLIR] Added tritongpu-stream-pipeline pass - Prologue: Hoist the pipelinable load operations and shared memory store for the ramp up stage - Pipelined Loop: Assemble the loop body minus last iteration - Prefetch next tile from global into regs (while computing from previous) - Non-load loop body - Store next tile into shared mem - Epilogue: Peeled non-load loop body for last iteration * * updated comment	2023-09-07 15:24:59 -05:00
jayfurmanek	83a0958566	Merge pull request #322 from ROCmSoftwarePlatform/f8_and_bf16_conversions Enable fp8 conversions and fix bf16 conversions	2023-09-07 14:38:16 -05:00
Shucai Xiao	fb3f2d6feb	refine gemm tuning scripts (#309 ) * refine the gemm tuning scripts to reduce tuning space and better perf numbers * added code to support tuning in full tuning space * add a function to get best tuning config * refine the matmul tutorial example to print out best tuning config for each input * added even_k to gemm kernel heuristic for better performance * address review comments	2023-09-07 08:09:11 -05:00
Wen Chen	ffc230ebfe	[ROCM] Fixed implementation of fp32 to bf16 conversion on ROCm.	2023-09-06 18:10:54 -05:00
Wen Chen	2d3e38e182	[ROCM] Added ROCm support for int8 to bfloat16 conversion.	2023-09-06 18:10:54 -05:00
Wen Chen	59a40d3f72	[ROCM] Added ROCm support for the conversions of following data types: [float8e4m3, float8e4m3b15, float8e5m2] <-> [float16, bfloat16]	2023-09-06 18:10:54 -05:00
Jason Furmanek	320b1029da	Temporarily disable F8 tests on ROCm	2023-09-01 04:02:14 +00:00
Jason Furmanek	df5c263a19	Fix merge conflicts	2023-09-01 04:01:32 +00:00
Jason Furmanek	3eaeb89d18	Merge commit '5df904233c11a65bd131ead7268f84cca7804275' into ifu230810-2 Conflicts: include/triton/Dialect/Triton/Transforms/Passes.h include/triton/Dialect/TritonGPU/IR/Dialect.h include/triton/Dialect/TritonGPU/IR/TritonGPUAttrDefs.td lib/Analysis/Allocation.cpp lib/Analysis/Utility.cpp lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp lib/Conversion/TritonGPUToLLVM/ReduceOpToLLVM.cpp lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVM.cpp lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVMPass.cpp lib/Dialect/Triton/Transforms/RewriteTensorPointer.cpp lib/Dialect/TritonGPU/Transforms/RemoveLayoutConversions.cpp lib/Dialect/TritonGPU/Transforms/ReorderInstructions.cpp lib/Target/LLVMIR/LLVMIRTranslation.cpp python/src/triton.cc python/triton/compiler/compiler.py python/triton/ops/flash_attention.py python/triton/runtime/autotuner.py python/triton/runtime/jit.py python/triton/tools/aot.py python/tutorials/06-fused-attention.py test/Conversion/tritongpu_to_llvm.mlir test/Target/tritongpu_to_llvmir.mlir test/Target/tritongpu_to_llvmir_noinline.mlir	2023-09-01 03:25:33 +00:00
Philippe Tillet	ec51552fff	[BACKEND] Lift restriction for float8e4b15 to only support row-col layout (#2212 )	2023-08-30 14:06:31 -07:00
jon-chuang	9af76e7d5a	[RUNTIME] Fix cache dir (#2196 ) --------- Co-authored-by: Keren Zhou <kerenzhou@openai.com>	2023-08-29 21:07:16 -04:00
Vinayak Gokhale	9cdf3a58c3	Enable split kernel in bwd pass (#303 ) * Add fwd and bwd v2 Changes are largely from upstream. * Split bwd kernel in dq and dk+dv Only adds the split kernels. They are not enabled yet. * Pull scalar multiplies out of the loop * Enable split kernel for bwd pass * Put back P_SEQ=128 in fwd test Not used for bwd test * Address review comments * Address comments Conditionally set causal/ splitkernel to False for bwd. * Add block pointer semantics to bwd pass This significantly increases perf for bwd, similar to fwd.	2023-08-29 13:51:29 -05:00
Lixun Zhang	b834f42ae4	[autotuner] Add an option to print best_config for each key	2023-08-28 14:45:54 -05:00
goostavz	1465b573e8	[TESTS][HOPPER] Prune hopper tests to speedup CI (#2193 ) Co-authored-by: Goostav Zhu <gzhu@nvidia.com>	2023-08-27 20:45:23 -07:00
Philippe Tillet	5f448b2f08	[FRONTEND] remove dead libhopper_helpers.bc file (#2190 )	2023-08-26 12:17:17 -07:00
Greg Brockman	a9b8c8c37d	[FRONTEND] drop GIL for launch, and set value=false upon pointer error (#2185 )	2023-08-26 17:07:57 +00:00
Keren Zhou	6e4932cda8	[BACKEND] Fix fma mixed-precision (#2184 ) and expose the allow_tf32 argument to the matmul op @shunting314	2023-08-26 09:49:58 -07:00
Greg Brockman	ab3e8b0dad	[FRONTEND] fix handling of do_not_specialize with interior constantexprs (#2188 )	2023-08-26 09:19:34 -07:00
Mohammed Anany	ebfe0ffb29	[FRONTEND] fix for undefined dtypes in jit during loading defaults (#2114 ) Co-authored-by: Keren Zhou <kerenzhou@openai.com>	2023-08-25 10:28:23 -07:00
Ethan Pronovost	56fee37a0d	[FRONTEND] Fix benchmark plotting (#2177 )	2023-08-24 20:34:04 -07:00
Greg Brockman	64d8df4c69	[FRONTEND] handle errors from launch_enter_hook (#2178 )	2023-08-24 20:32:01 -07:00

1 2 3 4 5 ...

1262 Commits