github/ROCm - ROCm - AtHeartEngineering

mirror of https://github.com/ROCm/ROCm.git synced 2026-04-05 03:01:17 -04:00

Author	SHA1	Message	Date
Thomas Raoux	218492cd65	[BACKEND] Prevent double rounding when doing f32 -> fp8 (#2583 )	2023-11-02 05:32:16 +00:00
Dongdong Li	d0098da7b1	[BACKEND] Add error reporting to report non-kernel-argument (#2552 ) Co-authored-by: dongdongl <dongdongl@nvidia.com>	2023-11-01 20:22:10 -04:00
Vedant Roy	702cde0d6f	[FRONTEND] Implement ternary operator for dynamic values (#2560 )	2023-11-01 20:21:32 -04:00
Alexander Efimov	74c5fd46ee	[RemoveLayoutConversions] Fix reduce failed infer type error (#377 ) * [RemoveLayoutConversions] Fix reduce failed infer type error This PR fixes layout propagation algorithm in RemoveLayoutConversions pass. In some cases during rewriteSlice process, reduce operation with multiple outputs rewrites only one output layout, which breaks assumption that both outputs should have same layout. This change is a minimal part of https://github.com/openai/triton/pull/2331 change and small lit test for regression testing. * fix combine test * Fix issue with incorrect inference layout of make_range output result	2023-11-01 13:31:13 -05:00
Alexander Efimov	d62a3ffdbe	[RemoveLayoutConversions] Remove PatternSharedInfo structure (#378 ) This structure is not used anymore after massive refactoring of RemoveLayoutConversion pass in September IFU.	2023-11-01 12:57:35 -05:00
Chenggang Zhao	e7fdfd76fb	[FRONTEND] Add value restoration for autotuner (#2549 ) For in-place kernels, neither `reset_to_zero` nor `Config.prehook` provided in the autotuner can restore the values changed during the tuning process, so I propose a recovery mechanism here. --------- Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com> Co-authored-by: Keren Zhou <kerenzhou@openai.com>	2023-10-31 21:37:44 -04:00
Zahi Moudallal	3650213218	[OPTIMIZER] Thread local reduction optimization (#2542 ) Co-authored-by: Phil Tillet <phil@openai.com>	2023-10-31 16:13:36 -07:00
Justin Lebar	258399c114	Enable ruff linter instead of flake8 (#2574 ) [FRONTEND] Enable ruff linter instead of flake8. This fixes a few issues automatically, and also flagged two issues to fix manually in test_core.py: We had two duplicate function names! One of these function bodies was a duplicate, so I deleted it. The other function body was not a duplicate, so I gave it a new name. AIUI all of these errors should have been picked up by flake8. I'm confused why it wasn't working. Anyway this is working, and it's faster than flake8, so it seems like an improvement in all dimensions.	2023-10-31 21:28:24 +00:00
Goran Flegar	601b95cdbb	[DEPS] bump LLVM version to llvm/llvm-project@49af650 (#2570 ) Co-authored-by: Ashay Rane <ashay@users.noreply.github.com> Co-authored-by: khasanovaa <khasanovaaliya19@gmail.com>	2023-10-31 12:06:25 -07:00
Zahi Moudallal	943330790a	[FRONTEND] add do_not_specialize property back to JITFunction (#2573 )	2023-10-31 12:02:45 -07:00
Nhat Nguyen	0cf3a67f04	[BUILD] Disable W503 in pyproject.toml (#2575 ) This PR https://github.com/openai/triton/pull/2555 disabled `W503` (means line breaks can now occur before a binary operator). The change surprisingly didn't take any effect nor required any style changes in `triton` main `pre-commit` stage. But our `triton-shared` [pipeline run](https://github.com/microsoft/triton-shared/actions/runs/6710459100/job/18236352821) (see `Check pre-commit` stage) picked this up correctly and complained about formatting issues. I'm not entirely sure what could be the cause for such difference, but if we also disable `W503` in `pyproject.toml` then the rule is picked up correctly.	2023-10-31 11:57:02 -07:00
daemondzh	96cf8f979a	[OPTIMIZER][BACKEND] Fix an issue in RewriteTensorPtr pass to enable TMA with 8-bit types (#2545 ) Co-authored-by: Zhicheng Xiong <zhichengx@ipp2-0148.nvidia.com> Co-authored-by: Zhicheng Xiong <zhichengx@dc7-sim-e12-203.nvidia.com> Co-authored-by: Zhicheng Xiong <zhichengx@ipp2-1604.nvidia.com> Co-authored-by: Zhicheng Xiong <zhichengx@ipp2-1608.nvidia.com> Co-authored-by: goostavz <109190422+goostavz@users.noreply.github.com>	2023-10-31 02:28:27 +00:00
Justin Lebar	29a9245559	[BUILD] use clang+lld in CI builds. (#2564 ) Use clang+lld in CI builds. This is significantly faster.	2023-10-30 19:19:27 -07:00
Chris Jones	2398b82f18	[FRONTEND][BACKEND] dd memory synchronization scope parameter to atomic ops. (#2562 ) Co-authored-by: Keren Zhou <kerenzhou@openai.com>	2023-10-30 19:18:27 -07:00
Keren Zhou	70fca00b67	[BACKEND] Fix `device_print` without arguments (#2566 )	2023-10-30 20:04:44 -04:00
Keren Zhou	492886fcde	[FRONTEND] Add reverse eq and ne (#2563 )	2023-10-30 16:56:43 -04:00
Justin Lebar	12f906287f	[FRONTEND] Refactor jit.py. (#2556 ) [FRONTEND] Refactor jit.py. The goal is to simplify the code and make it more flexible before we change the kernel launch syntax to `kernel[grid, compiler_flags(...)](...)`. The main changes here are: - Get rid of the eval'ed code in make_launcher. We can do everything using bind(). - Add KernelParam and KernelArg classes, letting us get rid of the parallel arrays/dicts indexed by parameter index. - Get rid of duplicated kernel launch code in the cache-hit/cache-miss branches.	2023-10-30 13:14:51 -07:00
Justin Lebar	f88b01f558	Apply `ruff` pre-commit to python/triton/runtime. (#2558 ) We're in the process of incrementally converting from autopep8 + flake8 + isort to ruff, on a directory-by-directory basis. The motivation to switch away from autopep8 is that I can't get it to wrap long lines, even with -aaa. This seems to be a known problem, https://github.com/hhatto/autopep8/issues/497. See more details about alternatives tried in https://github.com/openai/triton/pull/2557.	2023-10-30 11:06:44 -07:00
Justin Lebar	f7be5f8fa5	[BUILD] Disable W503 in flake8 config. (#2555 ) It seems that by default, flake8 warns on both "linebreak occurred before binary operator" (W503) and "linebreak occurred after binary operator" (W504). You...kind of have to pick one of these. :) According to the docs, W503 is deprecated, so we disable that one. https://www.flake8rules.com/rules/W503.html	2023-10-30 09:28:34 -07:00
Justin Lebar	1ea5c0f675	[DOCS] Add instrs for setting up C++ intellisense. (#2554 )	2023-10-27 12:03:09 -07:00
Lixun Zhang	9517d4c256	Tweak matmul tutorial on MI2xx GPU (#376 ) * Tweak matmul tutorial on MI2xx GPU * Add config for 9728 --------- Co-authored-by: Shucai Xiao <shucai.xiao@amd.com>	2023-10-27 10:40:11 -05:00
Someone	cde42e6221	[BUILD] make cuda tools vendoring optional (#2546 )	2023-10-26 23:16:41 -07:00
jayfurmanek	26debc92a0	Merge pull request #363 from ROCmSoftwarePlatform/post_ifu_rebase_empty_kernel_works Third Party Backend Merge	2023-10-26 16:28:00 -05:00
Michael Melesse	1fd9b40f2f	Works as StandAlone and Backend and also Perf is Good This is a combination of 4 commits. Works as StandAlone and Backend Works as StandAlone and Backend This is a combination of 13 commits. Works StandAlone and as Backend This is a combination of 7 commits. backend set default dir with flag move bitcode to backend dir copy backend save empty test work in backendmode enable backend mode when copying to upstream clean up fix failure minimize diff add skip function fix bug with corrupted dwarf exp match num_wraps fix multi threaded test issue move bitcode file out of lib move backend to python/triton/third_party/hip move libhsa backend works again restart ci clean upstream location first before copy match scripts fix new error memoize backend stuff fix bug	2023-10-26 14:27:18 -05:00
Dongdong Li	0469d5fccd	[OPTIMIZER] Remove extra wgmma_wait_group in flash attention (#2399 ) Co-authored-by: dongdongl <dongdongl@nvidia.com>	2023-10-26 16:35:36 +00:00
zhu jianjiang	cfae7e2a25	[BACKEND] Fix matmul downcast path (#2528 ) for https://github.com/openai/triton/issues/2523 ,add regression test --------- Co-authored-by: Jokeren <robinho364@gmail.com> Co-authored-by: Keren Zhou <kerenzhou@openai.com>	2023-10-26 09:43:49 -04:00
Michael Melesse	09ba348f87	[ROCM] Core Functionality for AMD (#1983 ) * this pr adds a third party backend for triton that works on AMD * this expose a lot of the work that has been done in our [fork](https://github.com/ROCmSoftwarePlatform/triton) * most unit tests on `test_core.py` pass * it skips some unit tests for various reasons * we plan to follow up with more prs improving Functionality and Performance in the future --------- Co-authored-by: Philippe Tillet <phil@openai.com>	2023-10-26 08:36:49 -05:00
Michael Melesse	833c9b985f	Backend Dir with Empty Kernel Working This is a combination of 9 commits. Empty Kernel Works rebase minimzie diff: add libs move to backend dir match python add includes move everything to backend dir match include and lib create a backend build mode simplify backend	2023-10-26 08:36:49 -05:00
Phil Tillet	07baf3a102	[CI] move llvm-build.yml to the top of workflow directory hierarchy	2023-10-26 02:08:56 -07:00
ian Bearman	3f95a6fb81	[DOCS] refactor meetup subdir; add triton-shared introduction slides Adding `Introduction to Triton-Shared.pptx` as presented at the Oct. 25 Triton community meeting. --------- Co-authored-by: Phil Tillet <phil@openai.com>	2023-10-26 02:02:02 -07:00
Eikan Wang	40766928f1	Add the update for Intel XPU Backend. (#2551 ) - Intel XPU Backend Status Update - GEMM Lowering for Intel GPU - Questions to the Triton community	2023-10-26 08:33:37 +00:00
kshama-msft	746d411ead	[DOCS] create 10-25-2023.md (#2548 )	2023-10-26 01:32:09 -07:00
Keren Zhou	bc72294507	[CI] Reenable torchinductor workflow (#2527 )	2023-10-25 23:44:02 -07:00
runseny	4c816c2f59	[OPS] enable flash_attention_v2 TMA (#2544 )	2023-10-25 23:31:17 -07:00
Shucai Xiao	2729ae6c6f	use different int8 mfma instructions on different GPUs. (#368 ) * changes support to choose different int8 instructions * rename an instruction name Co-authored-by: Aleksandr Efimov <efimov.alexander@gmail.com>	2023-10-25 19:12:21 -05:00
Hongtao Yu	2323adb387	[BACKEND] Handle AtomicCASOp in GPU IR conversion (#2514 ) Addressing https://github.com/openai/triton/issues/2011 Co-authored-by: Philippe Tillet <phil@openai.com> Co-authored-by: Keren Zhou <kerenzhou@openai.com>	2023-10-25 15:20:07 -04:00
Alexander Efimov	5a86b46bb1	[MFMA] FP8 and BF8 support (#355 ) * [MFMA] FP8 and BF8 support This PR adds support of fp8 and bf8 in AccelerateMatmul pass and Introduces generation of float8 mfma instructions in ttg to llvm conversion. * add tests * fix tests * review fix: fix variable naming and dot operand promotion. * review comments fixes --------- Co-authored-by: Shucai Xiao <shucai.xiao@amd.com>	2023-10-25 13:27:10 -05:00
Shucai Xiao	8547694665	set correct arch info for unit test (#370 ) * set correct arch info for unit test * address review comments	2023-10-25 13:06:45 -05:00
oplavsic	715a589ce3	[FA fwd D=128] Reduce LDS usage in epilogue (#340 ) * rebase onto improve_fwd_fa * Fixed a leftover from rebase * rebase onto improve_fa_fwd * Reduce tuning space * Disable bwd with D=128 * Add test for d=128 * Fix an issue with get_best_config when there is only one config * Added better configs for d=128 * Fix typos --------- Co-authored-by: Lixun Zhang <lixun.zhang@amd.com>	2023-10-25 12:10:34 -05:00
Adnan Akhundov	7d55968fee	[BACKEND] Dedup elementwise in LLVM IR based on constancy (#2512 ) ### Summary When Triton GPU IR is lowered into LLVM IR, we can make use of the constancy information about the result of the elementwise ops to deduplicate otherwise redundant computation. That is the contribution of this PR: the constancy is checked and, if possible, some of the values in LLVM IR are reused multiple times instead of computing equal values separately. The change is beneficial for the PyTorch 2 / TorchInductor-generated Triton code, as the leftmost sub-indices extracted from the flat index by div / mod operations can be equal, given sufficiently large 2^n factor in the rightmost rightmost dimension(s). This makes the computation resulting in those sub-indices redundant. Consequently, under the necessary constancy conditions, the redundant indexing arithmetics can be deduplicated. We observe up to 29% decrease in the latency of some of our jagged tensor kernels	2023-10-25 11:25:29 -04:00
Justin Lebar	e70e11e834	[BACKEND] Improve printf. (#2532 ) [BACKEND] Improve printf. Previously, we printed all of a GPU thread's values in a single printf() call, and this, plus the user-specified prefix, was all we printed. This caused a few problems. - nvptx printf can only handle 32 arguments; if you pass more than that, it prints garbage. So if a thread had more than 32 values, you couldn't print them, issue #2486. - The order of the values within the Triton program (GPU thread block) is an implementation detail -- it depends on the layout the compiler assigns to a tensor. So this also prevented you from interpreting the printed output. To address this, we now print the Triton pid and multi-dimensional Tensor index for each value. And each value gets its own line to avoid passing too many args to printf. Example output: ``` pid (0, 1, 2) idx (36, 127) x: 42 ``` If you want to observe all the values in a tensor in order, you can grep and then sort the output. We also make a UX enhancement to print: The printed label always ends with ": "; you don't have to add it yourself. Fixes #2486.	2023-10-25 08:47:55 +00:00
Justin Lebar	2217bd2f5c	[BACKEND] Delete dead vprintf and vprintf_array functions (#2531 ) Delete dead vprintf and vprintf_array functions. These were introduced in `88498d104a` and appear to have been dead at the time of introduction.	2023-10-25 00:22:53 +00:00
Philippe Tillet	31c76ddd05	[CI] revert recent changes (#2543 )	2023-10-24 17:00:31 -07:00
Phil Tillet	5181d62b1b	[CI] renamed third party test workflow	2023-10-24 12:12:52 -07:00
Justin Lebar	9b4d91b132	Add TRITON_BUILD_WITH_ASAN envvar. (#2537 ) Note that asan doesn't work with programs that use the GPU, so this is only useful for running tools like triton-opt. I was not able to get msan working. libstdc++'s std::string implementation seems to use uninitialized memory in a way that seems safe but triggers an msan error. I tried and gave up on switching to libc++ and teaching msan to ignore this error.	2023-10-24 10:30:30 -07:00
Philippe Tillet	3f2b7263e8	Revert "[OPTIMIZER] Tweak warpsPerCTA based on the shape of MMA output (#2485 )" (#2541 ) Reverts openai/triton#2525	2023-10-24 10:23:19 -07:00
Phil Tillet	96b04493f1	[CI] move workflow around	2023-10-24 04:01:53 -07:00
Sam Shleifer	12da43084b	[TESTING] add diff column, option to return df in benchmark (#2469 )	2023-10-24 05:17:00 +00:00
Philippe Tillet	8f467f1ea9	[OPTIMIZER] Tweak warpsPerCTA based on the shape of MMA output (#2485 ) (#2525 ) Reverts openai/triton#2497	2023-10-23 21:50:58 -07:00
Adnan Akhundov	50add54334	[FRONTEND] Add input dtypes to autotuning key (#2534 )	2023-10-24 03:29:30 +00:00

1 2 3 4 5 ...

2386 Commits