github/ROCm - ROCm - AtHeartEngineering

mirror of https://github.com/ROCm/ROCm.git synced 2026-04-05 03:01:17 -04:00

Author	SHA1	Message	Date
Jason Furmanek	5c87f363e4	Merge commit 'cb3d79a185e40c9d8a579bea07747a8a8d157d52' into ifu-231117 Conflicts: lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVM.cpp lib/Dialect/TritonGPU/IR/Dialect.cpp python/setup.py python/test/unit/language/assert_helper.py python/test/unit/operators/test_flash_attention.py python/test/unit/runtime/test_subproc.py python/triton/compiler/compiler.py python/triton/language/semantic.py python/triton/runtime/autotuner.py python/triton/runtime/jit.py python/tutorials/03-matrix-multiplication.py python/tutorials/05-layer-norm.py python/tutorials/06-fused-attention.py python/tutorials/11-grouped-gemm.py test/Conversion/tritongpu_to_llvm.mlir	2023-11-17 20:42:12 +00:00
jayfurmanek	e1513b34e1	Merge pull request #395 from ROCmSoftwarePlatform/ifu-231108 Ifu 231108	2023-11-17 14:14:14 -06:00
jayfurmanek	bb56be9d16	Merge branch 'triton-mlir' into ifu-231108	2023-11-16 19:30:43 -06:00
Alexander Efimov	dfb76540b4	[Tutorial] Fix post IFU issues with FA (#398 ) * [Tutorial] Fix post IFU issues with FA * Remove redundant kernels in 06-fused-attention.py * Added README for scripts in perf-kernels dir * Fix bwd kernel --------- Co-authored-by: Lixun Zhang <lixun.zhang@amd.com>	2023-11-17 01:28:49 +00:00
Alexander Efimov	096def0c9b	[Test] Disable mma layout for amd hardware (#384 ) Disable mma layout testing by looking at is_hip instead of wave size. This fixes tests on Navi GPUs with wave size == 32.	2023-11-17 01:28:49 +00:00
Lixun Zhang	181bdbd410	Benchmark FA on 2 GCDs (#393 )	2023-11-17 01:28:49 +00:00
Jason Furmanek	44b155f41b	ROCM IFU: Resolve merge conflicts in tutorial 06 Resolve merge conflicts in tutorial 06 - 2	2023-11-17 01:28:40 +00:00
Ognjen	9f3d6656a7	ROCM IFU: Fix reduce_slice lit test Skip tritongpu_to_llvm_hopper test as it is nvidia specific	2023-11-17 01:28:33 +00:00
Ognjen	38fbb7e472	ROCM IFU: Enable slice layout for insertSliceAsync AMD path Fix basic_insert_slice_async_1d lit test Remove code added for debugging Return hopper test	2023-11-17 01:27:57 +00:00
Alexander Efimov	5b06b168aa	[Tutorial] Fix post IFU issues with FA (#398 ) * [Tutorial] Fix post IFU issues with FA * Remove redundant kernels in 06-fused-attention.py * Added README for scripts in perf-kernels dir * Fix bwd kernel --------- Co-authored-by: Lixun Zhang <lixun.zhang@amd.com>	2023-11-14 10:46:45 -06:00
Alexander Efimov	9941ce7aa5	[Test] Disable mma layout for amd hardware (#384 ) Disable mma layout testing by looking at is_hip instead of wave size. This fixes tests on Navi GPUs with wave size == 32.	2023-11-13 17:52:35 +01:00
Jason Furmanek	484852876e	Resolve merge conflicts; AMD adjustments for new LLVM version	2023-11-09 19:00:49 +00:00
Jason Furmanek	977d5aa267	Merge commit '721897fcc4f942aa97d2e9ba3787a5e213758177' into ifu-231108 Conflicts: bin/triton-translate.cpp lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp lib/Dialect/TritonGPU/Transforms/RemoveLayoutConversions.cpp python/triton/compiler/compiler.py python/triton/runtime/jit.py python/tutorials/06-fused-attention.py test/Conversion/tritongpu_to_llvm.mlir	2023-11-08 18:51:23 +00:00
Lixun Zhang	d4eda83b33	Benchmark FA on 2 GCDs (#393 )	2023-11-08 12:42:54 -06:00
Lixun Zhang	1af893d8a2	[FRONTEND] Add input dtypes to autotuning key (#2534 ) (#374 ) * [FRONTEND] Add input dtypes to autotuning key (#2534) * Fix conflict in 06-fused-attention * Fix get_best_config in FA-transV.py * Fix leftover get_best_config() --------- Co-authored-by: Adnan Akhundov <adnan.akhundov@gmail.com>	2023-11-07 19:36:57 -06:00
jayfurmanek	3c1fe617c1	Merge pull request #382 from ROCmSoftwarePlatform/ifu231005-rebase Ifu231005	2023-11-06 22:55:25 -06:00
Jason Furmanek	85216ea5c5	ROCM IFU: Resoolve conflicts in FA tutorial	2023-11-07 04:29:45 +00:00
oplavsic	502525ff11	ROCM IFU: Fix ScanOp test by implementing idx ShflKind in commonShflSync (#391 ) * Fix ScanOp test * Remove commented-out code --------- Co-authored-by: Ognjen <oplavsic@luxoft.com> Fix typo in commonShflSync	2023-11-07 04:29:45 +00:00
Alexander Efimov	aefc94bd25	ROCM IFU: fix test_dot_mfma_vector_load test fix for previous commit	2023-11-07 04:29:45 +00:00
Alexander Efimov	8bc417b9b7	do not emit nvidia inline asm	2023-11-07 04:29:44 +00:00
Jason Furmanek	39e8901d7a	ROCM IFU: Resolve merge conflicts in RemoveLayoutConversions.cpp fix merge error fix dot fix make_range additional fix	2023-11-07 04:29:38 +00:00
Jason Furmanek	c3132eeda8	ROCM IFU: Third-party fixes: preload ROCDL Additional 3rd-party fix Remove redundant mfma_supported defines	2023-11-07 03:29:16 +00:00
Jason Furmanek	3a6dc5ad8d	resolve some merge conflicts fix more conflits Resolve merge conflicts Some more build and conflict fixes Resolve conflicts for 06-fused-attension.py resolve merge conflicts for the tutorial group gemm example Fixes for some LIT tests resolve remaining conflicts in tests Fix empty kernel set capability 0	2023-11-06 23:13:10 +00:00
Jason Furmanek	33151a860f	Merge commit 'ac9fa68d18c777e421bd3f6fb1ddcfd60b6fda33' into ifu-rebase-again Conflicts: .gitignore .gitmodules README.md bin/triton-translate.cpp include/triton/Dialect/TritonGPU/IR/TritonGPUAttrDefs.td include/triton/Target/AMDGCN/AMDGCNTranslation.h include/triton/Target/HSACO/HSACOTranslation.h lib/Analysis/Allocation.cpp lib/Analysis/Utility.cpp lib/Conversion/TritonGPUToLLVM/CMakeLists.txt lib/Conversion/TritonGPUToLLVM/ConvertLayoutOpToLLVM.cpp lib/Conversion/TritonGPUToLLVM/ReduceOpToLLVM.cpp lib/Conversion/TritonGPUToLLVM/ScanOpToLLVM.cpp lib/Conversion/TritonGPUToLLVM/Utility.cpp lib/Conversion/TritonGPUToLLVM/Utility.h lib/Dialect/TritonGPU/IR/Dialect.cpp lib/Dialect/TritonGPU/Transforms/RemoveLayoutConversions.cpp lib/Target/HSACO/CMakeLists.txt lib/Target/HSACO/HSACOTranslation.cpp lib/Target/LLVMIR/LLVMIRTranslation.cpp python/src/triton.cc python/test/unit/language/test_core.py python/test/unit/operators/test_flash_attention.py python/triton/compiler/compiler.py python/triton/compiler/make_launcher.py python/triton/language/semantic.py python/triton/runtime/jit.py python/tutorials/06-fused-attention.py python/tutorials/11-grouped-gemm.py test/Conversion/tritongpu_to_llvm.mlir	2023-11-06 23:10:10 +00:00
oplavsic	c65f1e6211	Add OptimizeEpilogue pass. (#346 ) * optimize_epilogue * Add config * Remove licenses * Comment out Hopper specific parameters when printing out configs * Add benchmark parameters from flash-attention repo * Add Z and H in the key of autotuner --------- Co-authored-by: Lixun Zhang <lixun.zhang@amd.com>	2023-11-03 16:46:24 -05:00
Thomas Raoux	cb3d79a185	[BACKEND] Prevent emitting multiple dot_wait after pipelinied loop (#2598 ) Patch based on @donproc findings and suggested optimization. Emitting multiple wait op may confuse ptxas and cause it to fallback to a conservative mode.	2023-11-03 14:29:50 -07:00
Weixing Zhang	34b89a1173	[OPTIMIZER] Tweak warpsPerCTA based on the shape of MMA output (#2581 ) In current implementation, warpsPerCTA is always set to [numWarps, 1] for 2 tt.dot fusion scenario. But, it is not optimal for cases such that tt.dot doesn't have enough parallelism on row dimension but on column dimension.	2023-11-03 16:40:03 -04:00
Thomas Raoux	6ac9d51ff0	[OPTIMIZATION] Enable pipelining for bwd flash attention (#2590 ) This allow pipelining when a load is used by multiple dot in a loop. Relax the condition to pipeline dot operands for mma v3 case. This improves performance for the bwd pass from 260TF to 275TF. However this expose a performance problem due to the wmma pipelining as ptxas will now fall back to serial wgmma. A follow up PR will fix a bug in how we emit wgmma_wait during pipelining and will bring performance to 335TF	2023-11-03 11:46:51 -07:00
Shucai Xiao	cb02a0b346	remove unnecessary arch names (#388 )	2023-11-03 12:04:02 -05:00
Lixun Zhang	a66270165a	Move fa-transV to the new perf-kernels dir (#387 )	2023-11-03 00:09:48 -05:00
Justin Lebar	df08301e76	Reformat Python code with yapf. (#2589 ) I've add an option to yapf to do what we want for long lines, see https://github.com/google/yapf/pull/1177. We can now have a real Python formatter, yay! To make this PR, I ran my modified yapf over the repository, then looked over the full diff. Where yapf was mangling the param list of long function decls/calls (mostly kernels), I manually added `#` to put linebreaks where we want. I fixed up other formatting too -- mostly adding or removing a trailing comma from lists. Overall, trailing `#` was sufficient to get formatting similar to our current code. I didn't have to disable yapf anywhere. --------- Co-authored-by: Phil Tillet <phil@openai.com>	2023-11-02 20:44:17 -07:00
Shucai Xiao	79bebc4ffe	fp8 type support (#357 ) * add two fp8 data types `tl.float8e4b8` and `tl.float8e5b16` to triton. * add SW type conversion between `tl.float8e4b8/tl.float8e5b16` and `fp16` * change flashattention to support fp8 in q/k.	2023-11-02 15:51:23 -05:00
Thomas Raoux	dced22c4b7	[BACKEND] Remove workaround for NVPTX bug after LLVM upgrade (#2585 ) This was a workaround for a bug exposed in test_core when generating ext_load in NVPTX. The backend bug seems fixed in latest LLVM upgrade so removing the workaround.	2023-11-02 17:31:58 +00:00
jeromeku	37cd3d5339	[FIX][AOT Compiler] Fix No Specialization Edge Case (#2584 ) The [hints dispatching](`218492cd65/python/triton/tools/link.py (L161)`) logic currently fails for the edge case where a single kernel with no specializations is to be linked in the [AOT compiler](https://github.com/openai/triton/blob/main/python/triton/tools/link.py). Since the dispatcher inserts a conditional branch for each specialization case, this results in an `if ()` being inserted into the `C` source, which clearly breaks downstream artifacts. Fix: - Added simple check for this edge case - Added unit test that mirrors the existing [`test_compile_link_matmul`](`218492cd65/python/test/unit/tools/test_aot.py (L224)`) test case save for the aforementioned condition.	2023-11-02 10:02:41 -07:00
Thomas Raoux	ca8f110617	[BACKEND] Pipeliner refactoring (#2565 ) Refactor the pipeliner pass in order to make it more generic. The main change is that the pipeliner is now broken into 2 pieces one calculating a modulo schedule and create async ops based on the IR and an expander that will generate the pipelined IR based on the modulo schedule. The advantage of separating the two pieces is that it will allow us to create different schedule without having to change the expander and it will allow for more complex schedules. For now the schedule generated for matmul case matches rougly the schedule picked by the previous pipeliner in order to avoid changes. This also creates a different sequence of insert/extract slice for the alloc. We should probably change shared alloc to use memory semantic.	2023-11-02 09:56:39 -07:00
Lixun Zhang	38f9136fc8	Add FA with transV (#385 )	2023-11-02 08:52:33 -05:00
Thomas Raoux	218492cd65	[BACKEND] Prevent double rounding when doing f32 -> fp8 (#2583 )	2023-11-02 05:32:16 +00:00
Dongdong Li	d0098da7b1	[BACKEND] Add error reporting to report non-kernel-argument (#2552 ) Co-authored-by: dongdongl <dongdongl@nvidia.com>	2023-11-01 20:22:10 -04:00
Vedant Roy	702cde0d6f	[FRONTEND] Implement ternary operator for dynamic values (#2560 )	2023-11-01 20:21:32 -04:00
Alexander Efimov	74c5fd46ee	[RemoveLayoutConversions] Fix reduce failed infer type error (#377 ) * [RemoveLayoutConversions] Fix reduce failed infer type error This PR fixes layout propagation algorithm in RemoveLayoutConversions pass. In some cases during rewriteSlice process, reduce operation with multiple outputs rewrites only one output layout, which breaks assumption that both outputs should have same layout. This change is a minimal part of https://github.com/openai/triton/pull/2331 change and small lit test for regression testing. * fix combine test * Fix issue with incorrect inference layout of make_range output result	2023-11-01 13:31:13 -05:00
Alexander Efimov	d62a3ffdbe	[RemoveLayoutConversions] Remove PatternSharedInfo structure (#378 ) This structure is not used anymore after massive refactoring of RemoveLayoutConversion pass in September IFU.	2023-11-01 12:57:35 -05:00
Chenggang Zhao	e7fdfd76fb	[FRONTEND] Add value restoration for autotuner (#2549 ) For in-place kernels, neither `reset_to_zero` nor `Config.prehook` provided in the autotuner can restore the values changed during the tuning process, so I propose a recovery mechanism here. --------- Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com> Co-authored-by: Keren Zhou <kerenzhou@openai.com>	2023-10-31 21:37:44 -04:00
Zahi Moudallal	3650213218	[OPTIMIZER] Thread local reduction optimization (#2542 ) Co-authored-by: Phil Tillet <phil@openai.com>	2023-10-31 16:13:36 -07:00
Justin Lebar	258399c114	Enable ruff linter instead of flake8 (#2574 ) [FRONTEND] Enable ruff linter instead of flake8. This fixes a few issues automatically, and also flagged two issues to fix manually in test_core.py: We had two duplicate function names! One of these function bodies was a duplicate, so I deleted it. The other function body was not a duplicate, so I gave it a new name. AIUI all of these errors should have been picked up by flake8. I'm confused why it wasn't working. Anyway this is working, and it's faster than flake8, so it seems like an improvement in all dimensions.	2023-10-31 21:28:24 +00:00
Goran Flegar	601b95cdbb	[DEPS] bump LLVM version to llvm/llvm-project@49af650 (#2570 ) Co-authored-by: Ashay Rane <ashay@users.noreply.github.com> Co-authored-by: khasanovaa <khasanovaaliya19@gmail.com>	2023-10-31 12:06:25 -07:00
Zahi Moudallal	943330790a	[FRONTEND] add do_not_specialize property back to JITFunction (#2573 )	2023-10-31 12:02:45 -07:00
Nhat Nguyen	0cf3a67f04	[BUILD] Disable W503 in pyproject.toml (#2575 ) This PR https://github.com/openai/triton/pull/2555 disabled `W503` (means line breaks can now occur before a binary operator). The change surprisingly didn't take any effect nor required any style changes in `triton` main `pre-commit` stage. But our `triton-shared` [pipeline run](https://github.com/microsoft/triton-shared/actions/runs/6710459100/job/18236352821) (see `Check pre-commit` stage) picked this up correctly and complained about formatting issues. I'm not entirely sure what could be the cause for such difference, but if we also disable `W503` in `pyproject.toml` then the rule is picked up correctly.	2023-10-31 11:57:02 -07:00
daemondzh	96cf8f979a	[OPTIMIZER][BACKEND] Fix an issue in RewriteTensorPtr pass to enable TMA with 8-bit types (#2545 ) Co-authored-by: Zhicheng Xiong <zhichengx@ipp2-0148.nvidia.com> Co-authored-by: Zhicheng Xiong <zhichengx@dc7-sim-e12-203.nvidia.com> Co-authored-by: Zhicheng Xiong <zhichengx@ipp2-1604.nvidia.com> Co-authored-by: Zhicheng Xiong <zhichengx@ipp2-1608.nvidia.com> Co-authored-by: goostavz <109190422+goostavz@users.noreply.github.com>	2023-10-31 02:28:27 +00:00
Justin Lebar	29a9245559	[BUILD] use clang+lld in CI builds. (#2564 ) Use clang+lld in CI builds. This is significantly faster.	2023-10-30 19:19:27 -07:00
Chris Jones	2398b82f18	[FRONTEND][BACKEND] dd memory synchronization scope parameter to atomic ops. (#2562 ) Co-authored-by: Keren Zhou <kerenzhou@openai.com>	2023-10-30 19:18:27 -07:00

1 2 3 4 5 ...

2322 Commits