github/ROCm - ROCm - AtHeartEngineering

mirror of https://github.com/ROCm/ROCm.git synced 2026-02-21 03:00:39 -05:00

Author	SHA1	Message	Date
Jason Furmanek	5c87f363e4	Merge commit 'cb3d79a185e40c9d8a579bea07747a8a8d157d52' into ifu-231117 Conflicts: lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVM.cpp lib/Dialect/TritonGPU/IR/Dialect.cpp python/setup.py python/test/unit/language/assert_helper.py python/test/unit/operators/test_flash_attention.py python/test/unit/runtime/test_subproc.py python/triton/compiler/compiler.py python/triton/language/semantic.py python/triton/runtime/autotuner.py python/triton/runtime/jit.py python/tutorials/03-matrix-multiplication.py python/tutorials/05-layer-norm.py python/tutorials/06-fused-attention.py python/tutorials/11-grouped-gemm.py test/Conversion/tritongpu_to_llvm.mlir	2023-11-17 20:42:12 +00:00
Alexander Efimov	dfb76540b4	[Tutorial] Fix post IFU issues with FA (#398 ) * [Tutorial] Fix post IFU issues with FA * Remove redundant kernels in 06-fused-attention.py * Added README for scripts in perf-kernels dir * Fix bwd kernel --------- Co-authored-by: Lixun Zhang <lixun.zhang@amd.com>	2023-11-17 01:28:49 +00:00
Alexander Efimov	096def0c9b	[Test] Disable mma layout for amd hardware (#384 ) Disable mma layout testing by looking at is_hip instead of wave size. This fixes tests on Navi GPUs with wave size == 32.	2023-11-17 01:28:49 +00:00
Lixun Zhang	181bdbd410	Benchmark FA on 2 GCDs (#393 )	2023-11-17 01:28:49 +00:00
Jason Furmanek	44b155f41b	ROCM IFU: Resolve merge conflicts in tutorial 06 Resolve merge conflicts in tutorial 06 - 2	2023-11-17 01:28:40 +00:00
Jason Furmanek	484852876e	Resolve merge conflicts; AMD adjustments for new LLVM version	2023-11-09 19:00:49 +00:00
Jason Furmanek	977d5aa267	Merge commit '721897fcc4f942aa97d2e9ba3787a5e213758177' into ifu-231108 Conflicts: bin/triton-translate.cpp lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp lib/Dialect/TritonGPU/Transforms/RemoveLayoutConversions.cpp python/triton/compiler/compiler.py python/triton/runtime/jit.py python/tutorials/06-fused-attention.py test/Conversion/tritongpu_to_llvm.mlir	2023-11-08 18:51:23 +00:00
Lixun Zhang	1af893d8a2	[FRONTEND] Add input dtypes to autotuning key (#2534 ) (#374 ) * [FRONTEND] Add input dtypes to autotuning key (#2534) * Fix conflict in 06-fused-attention * Fix get_best_config in FA-transV.py * Fix leftover get_best_config() --------- Co-authored-by: Adnan Akhundov <adnan.akhundov@gmail.com>	2023-11-07 19:36:57 -06:00
Jason Furmanek	85216ea5c5	ROCM IFU: Resoolve conflicts in FA tutorial	2023-11-07 04:29:45 +00:00
Alexander Efimov	aefc94bd25	ROCM IFU: fix test_dot_mfma_vector_load test fix for previous commit	2023-11-07 04:29:45 +00:00
Jason Furmanek	39e8901d7a	ROCM IFU: Resolve merge conflicts in RemoveLayoutConversions.cpp fix merge error fix dot fix make_range additional fix	2023-11-07 04:29:38 +00:00
Jason Furmanek	c3132eeda8	ROCM IFU: Third-party fixes: preload ROCDL Additional 3rd-party fix Remove redundant mfma_supported defines	2023-11-07 03:29:16 +00:00
Jason Furmanek	3a6dc5ad8d	resolve some merge conflicts fix more conflits Resolve merge conflicts Some more build and conflict fixes Resolve conflicts for 06-fused-attension.py resolve merge conflicts for the tutorial group gemm example Fixes for some LIT tests resolve remaining conflicts in tests Fix empty kernel set capability 0	2023-11-06 23:13:10 +00:00
Jason Furmanek	33151a860f	Merge commit 'ac9fa68d18c777e421bd3f6fb1ddcfd60b6fda33' into ifu-rebase-again Conflicts: .gitignore .gitmodules README.md bin/triton-translate.cpp include/triton/Dialect/TritonGPU/IR/TritonGPUAttrDefs.td include/triton/Target/AMDGCN/AMDGCNTranslation.h include/triton/Target/HSACO/HSACOTranslation.h lib/Analysis/Allocation.cpp lib/Analysis/Utility.cpp lib/Conversion/TritonGPUToLLVM/CMakeLists.txt lib/Conversion/TritonGPUToLLVM/ConvertLayoutOpToLLVM.cpp lib/Conversion/TritonGPUToLLVM/ReduceOpToLLVM.cpp lib/Conversion/TritonGPUToLLVM/ScanOpToLLVM.cpp lib/Conversion/TritonGPUToLLVM/Utility.cpp lib/Conversion/TritonGPUToLLVM/Utility.h lib/Dialect/TritonGPU/IR/Dialect.cpp lib/Dialect/TritonGPU/Transforms/RemoveLayoutConversions.cpp lib/Target/HSACO/CMakeLists.txt lib/Target/HSACO/HSACOTranslation.cpp lib/Target/LLVMIR/LLVMIRTranslation.cpp python/src/triton.cc python/test/unit/language/test_core.py python/test/unit/operators/test_flash_attention.py python/triton/compiler/compiler.py python/triton/compiler/make_launcher.py python/triton/language/semantic.py python/triton/runtime/jit.py python/tutorials/06-fused-attention.py python/tutorials/11-grouped-gemm.py test/Conversion/tritongpu_to_llvm.mlir	2023-11-06 23:10:10 +00:00
oplavsic	c65f1e6211	Add OptimizeEpilogue pass. (#346 ) * optimize_epilogue * Add config * Remove licenses * Comment out Hopper specific parameters when printing out configs * Add benchmark parameters from flash-attention repo * Add Z and H in the key of autotuner --------- Co-authored-by: Lixun Zhang <lixun.zhang@amd.com>	2023-11-03 16:46:24 -05:00
Weixing Zhang	34b89a1173	[OPTIMIZER] Tweak warpsPerCTA based on the shape of MMA output (#2581 ) In current implementation, warpsPerCTA is always set to [numWarps, 1] for 2 tt.dot fusion scenario. But, it is not optimal for cases such that tt.dot doesn't have enough parallelism on row dimension but on column dimension.	2023-11-03 16:40:03 -04:00
Thomas Raoux	6ac9d51ff0	[OPTIMIZATION] Enable pipelining for bwd flash attention (#2590 ) This allow pipelining when a load is used by multiple dot in a loop. Relax the condition to pipeline dot operands for mma v3 case. This improves performance for the bwd pass from 260TF to 275TF. However this expose a performance problem due to the wmma pipelining as ptxas will now fall back to serial wgmma. A follow up PR will fix a bug in how we emit wgmma_wait during pipelining and will bring performance to 335TF	2023-11-03 11:46:51 -07:00
Shucai Xiao	cb02a0b346	remove unnecessary arch names (#388 )	2023-11-03 12:04:02 -05:00
Lixun Zhang	a66270165a	Move fa-transV to the new perf-kernels dir (#387 )	2023-11-03 00:09:48 -05:00
Justin Lebar	df08301e76	Reformat Python code with yapf. (#2589 ) I've add an option to yapf to do what we want for long lines, see https://github.com/google/yapf/pull/1177. We can now have a real Python formatter, yay! To make this PR, I ran my modified yapf over the repository, then looked over the full diff. Where yapf was mangling the param list of long function decls/calls (mostly kernels), I manually added `#` to put linebreaks where we want. I fixed up other formatting too -- mostly adding or removing a trailing comma from lists. Overall, trailing `#` was sufficient to get formatting similar to our current code. I didn't have to disable yapf anywhere. --------- Co-authored-by: Phil Tillet <phil@openai.com>	2023-11-02 20:44:17 -07:00
Shucai Xiao	79bebc4ffe	fp8 type support (#357 ) * add two fp8 data types `tl.float8e4b8` and `tl.float8e5b16` to triton. * add SW type conversion between `tl.float8e4b8/tl.float8e5b16` and `fp16` * change flashattention to support fp8 in q/k.	2023-11-02 15:51:23 -05:00
jeromeku	37cd3d5339	[FIX][AOT Compiler] Fix No Specialization Edge Case (#2584 ) The [hints dispatching](`218492cd65/python/triton/tools/link.py (L161)`) logic currently fails for the edge case where a single kernel with no specializations is to be linked in the [AOT compiler](https://github.com/openai/triton/blob/main/python/triton/tools/link.py). Since the dispatcher inserts a conditional branch for each specialization case, this results in an `if ()` being inserted into the `C` source, which clearly breaks downstream artifacts. Fix: - Added simple check for this edge case - Added unit test that mirrors the existing [`test_compile_link_matmul`](`218492cd65/python/test/unit/tools/test_aot.py (L224)`) test case save for the aforementioned condition.	2023-11-02 10:02:41 -07:00
Thomas Raoux	ca8f110617	[BACKEND] Pipeliner refactoring (#2565 ) Refactor the pipeliner pass in order to make it more generic. The main change is that the pipeliner is now broken into 2 pieces one calculating a modulo schedule and create async ops based on the IR and an expander that will generate the pipelined IR based on the modulo schedule. The advantage of separating the two pieces is that it will allow us to create different schedule without having to change the expander and it will allow for more complex schedules. For now the schedule generated for matmul case matches rougly the schedule picked by the previous pipeliner in order to avoid changes. This also creates a different sequence of insert/extract slice for the alloc. We should probably change shared alloc to use memory semantic.	2023-11-02 09:56:39 -07:00
Lixun Zhang	38f9136fc8	Add FA with transV (#385 )	2023-11-02 08:52:33 -05:00
Vedant Roy	702cde0d6f	[FRONTEND] Implement ternary operator for dynamic values (#2560 )	2023-11-01 20:21:32 -04:00
Chenggang Zhao	e7fdfd76fb	[FRONTEND] Add value restoration for autotuner (#2549 ) For in-place kernels, neither `reset_to_zero` nor `Config.prehook` provided in the autotuner can restore the values changed during the tuning process, so I propose a recovery mechanism here. --------- Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com> Co-authored-by: Keren Zhou <kerenzhou@openai.com>	2023-10-31 21:37:44 -04:00
Zahi Moudallal	3650213218	[OPTIMIZER] Thread local reduction optimization (#2542 ) Co-authored-by: Phil Tillet <phil@openai.com>	2023-10-31 16:13:36 -07:00
Justin Lebar	258399c114	Enable ruff linter instead of flake8 (#2574 ) [FRONTEND] Enable ruff linter instead of flake8. This fixes a few issues automatically, and also flagged two issues to fix manually in test_core.py: We had two duplicate function names! One of these function bodies was a duplicate, so I deleted it. The other function body was not a duplicate, so I gave it a new name. AIUI all of these errors should have been picked up by flake8. I'm confused why it wasn't working. Anyway this is working, and it's faster than flake8, so it seems like an improvement in all dimensions.	2023-10-31 21:28:24 +00:00
Goran Flegar	601b95cdbb	[DEPS] bump LLVM version to llvm/llvm-project@49af650 (#2570 ) Co-authored-by: Ashay Rane <ashay@users.noreply.github.com> Co-authored-by: khasanovaa <khasanovaaliya19@gmail.com>	2023-10-31 12:06:25 -07:00
Zahi Moudallal	943330790a	[FRONTEND] add do_not_specialize property back to JITFunction (#2573 )	2023-10-31 12:02:45 -07:00
Nhat Nguyen	0cf3a67f04	[BUILD] Disable W503 in pyproject.toml (#2575 ) This PR https://github.com/openai/triton/pull/2555 disabled `W503` (means line breaks can now occur before a binary operator). The change surprisingly didn't take any effect nor required any style changes in `triton` main `pre-commit` stage. But our `triton-shared` [pipeline run](https://github.com/microsoft/triton-shared/actions/runs/6710459100/job/18236352821) (see `Check pre-commit` stage) picked this up correctly and complained about formatting issues. I'm not entirely sure what could be the cause for such difference, but if we also disable `W503` in `pyproject.toml` then the rule is picked up correctly.	2023-10-31 11:57:02 -07:00
Chris Jones	2398b82f18	[FRONTEND][BACKEND] dd memory synchronization scope parameter to atomic ops. (#2562 ) Co-authored-by: Keren Zhou <kerenzhou@openai.com>	2023-10-30 19:18:27 -07:00
Keren Zhou	70fca00b67	[BACKEND] Fix `device_print` without arguments (#2566 )	2023-10-30 20:04:44 -04:00
Keren Zhou	492886fcde	[FRONTEND] Add reverse eq and ne (#2563 )	2023-10-30 16:56:43 -04:00
Justin Lebar	12f906287f	[FRONTEND] Refactor jit.py. (#2556 ) [FRONTEND] Refactor jit.py. The goal is to simplify the code and make it more flexible before we change the kernel launch syntax to `kernel[grid, compiler_flags(...)](...)`. The main changes here are: - Get rid of the eval'ed code in make_launcher. We can do everything using bind(). - Add KernelParam and KernelArg classes, letting us get rid of the parallel arrays/dicts indexed by parameter index. - Get rid of duplicated kernel launch code in the cache-hit/cache-miss branches.	2023-10-30 13:14:51 -07:00
Justin Lebar	f88b01f558	Apply `ruff` pre-commit to python/triton/runtime. (#2558 ) We're in the process of incrementally converting from autopep8 + flake8 + isort to ruff, on a directory-by-directory basis. The motivation to switch away from autopep8 is that I can't get it to wrap long lines, even with -aaa. This seems to be a known problem, https://github.com/hhatto/autopep8/issues/497. See more details about alternatives tried in https://github.com/openai/triton/pull/2557.	2023-10-30 11:06:44 -07:00
Lixun Zhang	9517d4c256	Tweak matmul tutorial on MI2xx GPU (#376 ) * Tweak matmul tutorial on MI2xx GPU * Add config for 9728 --------- Co-authored-by: Shucai Xiao <shucai.xiao@amd.com>	2023-10-27 10:40:11 -05:00
Someone	cde42e6221	[BUILD] make cuda tools vendoring optional (#2546 )	2023-10-26 23:16:41 -07:00
Michael Melesse	1fd9b40f2f	Works as StandAlone and Backend and also Perf is Good This is a combination of 4 commits. Works as StandAlone and Backend Works as StandAlone and Backend This is a combination of 13 commits. Works StandAlone and as Backend This is a combination of 7 commits. backend set default dir with flag move bitcode to backend dir copy backend save empty test work in backendmode enable backend mode when copying to upstream clean up fix failure minimize diff add skip function fix bug with corrupted dwarf exp match num_wraps fix multi threaded test issue move bitcode file out of lib move backend to python/triton/third_party/hip move libhsa backend works again restart ci clean upstream location first before copy match scripts fix new error memoize backend stuff fix bug	2023-10-26 14:27:18 -05:00
Dongdong Li	0469d5fccd	[OPTIMIZER] Remove extra wgmma_wait_group in flash attention (#2399 ) Co-authored-by: dongdongl <dongdongl@nvidia.com>	2023-10-26 16:35:36 +00:00
zhu jianjiang	cfae7e2a25	[BACKEND] Fix matmul downcast path (#2528 ) for https://github.com/openai/triton/issues/2523 ,add regression test --------- Co-authored-by: Jokeren <robinho364@gmail.com> Co-authored-by: Keren Zhou <kerenzhou@openai.com>	2023-10-26 09:43:49 -04:00
Michael Melesse	09ba348f87	[ROCM] Core Functionality for AMD (#1983 ) * this pr adds a third party backend for triton that works on AMD * this expose a lot of the work that has been done in our [fork](https://github.com/ROCmSoftwarePlatform/triton) * most unit tests on `test_core.py` pass * it skips some unit tests for various reasons * we plan to follow up with more prs improving Functionality and Performance in the future --------- Co-authored-by: Philippe Tillet <phil@openai.com>	2023-10-26 08:36:49 -05:00
runseny	4c816c2f59	[OPS] enable flash_attention_v2 TMA (#2544 )	2023-10-25 23:31:17 -07:00
Hongtao Yu	2323adb387	[BACKEND] Handle AtomicCASOp in GPU IR conversion (#2514 ) Addressing https://github.com/openai/triton/issues/2011 Co-authored-by: Philippe Tillet <phil@openai.com> Co-authored-by: Keren Zhou <kerenzhou@openai.com>	2023-10-25 15:20:07 -04:00
Alexander Efimov	5a86b46bb1	[MFMA] FP8 and BF8 support (#355 ) * [MFMA] FP8 and BF8 support This PR adds support of fp8 and bf8 in AccelerateMatmul pass and Introduces generation of float8 mfma instructions in ttg to llvm conversion. * add tests * fix tests * review fix: fix variable naming and dot operand promotion. * review comments fixes --------- Co-authored-by: Shucai Xiao <shucai.xiao@amd.com>	2023-10-25 13:27:10 -05:00
Shucai Xiao	8547694665	set correct arch info for unit test (#370 ) * set correct arch info for unit test * address review comments	2023-10-25 13:06:45 -05:00
oplavsic	715a589ce3	[FA fwd D=128] Reduce LDS usage in epilogue (#340 ) * rebase onto improve_fwd_fa * Fixed a leftover from rebase * rebase onto improve_fa_fwd * Reduce tuning space * Disable bwd with D=128 * Add test for d=128 * Fix an issue with get_best_config when there is only one config * Added better configs for d=128 * Fix typos --------- Co-authored-by: Lixun Zhang <lixun.zhang@amd.com>	2023-10-25 12:10:34 -05:00
Justin Lebar	e70e11e834	[BACKEND] Improve printf. (#2532 ) [BACKEND] Improve printf. Previously, we printed all of a GPU thread's values in a single printf() call, and this, plus the user-specified prefix, was all we printed. This caused a few problems. - nvptx printf can only handle 32 arguments; if you pass more than that, it prints garbage. So if a thread had more than 32 values, you couldn't print them, issue #2486. - The order of the values within the Triton program (GPU thread block) is an implementation detail -- it depends on the layout the compiler assigns to a tensor. So this also prevented you from interpreting the printed output. To address this, we now print the Triton pid and multi-dimensional Tensor index for each value. And each value gets its own line to avoid passing too many args to printf. Example output: ``` pid (0, 1, 2) idx (36, 127) x: 42 ``` If you want to observe all the values in a tensor in order, you can grep and then sort the output. We also make a UX enhancement to print: The printed label always ends with ": "; you don't have to add it yourself. Fixes #2486.	2023-10-25 08:47:55 +00:00
Justin Lebar	9b4d91b132	Add TRITON_BUILD_WITH_ASAN envvar. (#2537 ) Note that asan doesn't work with programs that use the GPU, so this is only useful for running tools like triton-opt. I was not able to get msan working. libstdc++'s std::string implementation seems to use uninitialized memory in a way that seems safe but triggers an msan error. I tried and gave up on switching to libc++ and teaching msan to ignore this error.	2023-10-24 10:30:30 -07:00
Philippe Tillet	3f2b7263e8	Revert "[OPTIMIZER] Tweak warpsPerCTA based on the shape of MMA output (#2485 )" (#2541 ) Reverts openai/triton#2525	2023-10-24 10:23:19 -07:00

1 2 3 4 5 ...

1401 Commits