github/ROCm - ROCm - AtHeartEngineering

mirror of https://github.com/ROCm/ROCm.git synced 2026-04-05 03:01:17 -04:00

Author	SHA1	Message	Date
Nhat Nguyen	0cf3a67f04	[BUILD] Disable W503 in pyproject.toml (#2575 ) This PR https://github.com/openai/triton/pull/2555 disabled `W503` (means line breaks can now occur before a binary operator). The change surprisingly didn't take any effect nor required any style changes in `triton` main `pre-commit` stage. But our `triton-shared` [pipeline run](https://github.com/microsoft/triton-shared/actions/runs/6710459100/job/18236352821) (see `Check pre-commit` stage) picked this up correctly and complained about formatting issues. I'm not entirely sure what could be the cause for such difference, but if we also disable `W503` in `pyproject.toml` then the rule is picked up correctly.	2023-10-31 11:57:02 -07:00
daemondzh	96cf8f979a	[OPTIMIZER][BACKEND] Fix an issue in RewriteTensorPtr pass to enable TMA with 8-bit types (#2545 ) Co-authored-by: Zhicheng Xiong <zhichengx@ipp2-0148.nvidia.com> Co-authored-by: Zhicheng Xiong <zhichengx@dc7-sim-e12-203.nvidia.com> Co-authored-by: Zhicheng Xiong <zhichengx@ipp2-1604.nvidia.com> Co-authored-by: Zhicheng Xiong <zhichengx@ipp2-1608.nvidia.com> Co-authored-by: goostavz <109190422+goostavz@users.noreply.github.com>	2023-10-31 02:28:27 +00:00
Justin Lebar	29a9245559	[BUILD] use clang+lld in CI builds. (#2564 ) Use clang+lld in CI builds. This is significantly faster.	2023-10-30 19:19:27 -07:00
Chris Jones	2398b82f18	[FRONTEND][BACKEND] dd memory synchronization scope parameter to atomic ops. (#2562 ) Co-authored-by: Keren Zhou <kerenzhou@openai.com>	2023-10-30 19:18:27 -07:00
Keren Zhou	70fca00b67	[BACKEND] Fix `device_print` without arguments (#2566 )	2023-10-30 20:04:44 -04:00
Keren Zhou	492886fcde	[FRONTEND] Add reverse eq and ne (#2563 )	2023-10-30 16:56:43 -04:00
Justin Lebar	12f906287f	[FRONTEND] Refactor jit.py. (#2556 ) [FRONTEND] Refactor jit.py. The goal is to simplify the code and make it more flexible before we change the kernel launch syntax to `kernel[grid, compiler_flags(...)](...)`. The main changes here are: - Get rid of the eval'ed code in make_launcher. We can do everything using bind(). - Add KernelParam and KernelArg classes, letting us get rid of the parallel arrays/dicts indexed by parameter index. - Get rid of duplicated kernel launch code in the cache-hit/cache-miss branches.	2023-10-30 13:14:51 -07:00
Justin Lebar	f88b01f558	Apply `ruff` pre-commit to python/triton/runtime. (#2558 ) We're in the process of incrementally converting from autopep8 + flake8 + isort to ruff, on a directory-by-directory basis. The motivation to switch away from autopep8 is that I can't get it to wrap long lines, even with -aaa. This seems to be a known problem, https://github.com/hhatto/autopep8/issues/497. See more details about alternatives tried in https://github.com/openai/triton/pull/2557.	2023-10-30 11:06:44 -07:00
Justin Lebar	f7be5f8fa5	[BUILD] Disable W503 in flake8 config. (#2555 ) It seems that by default, flake8 warns on both "linebreak occurred before binary operator" (W503) and "linebreak occurred after binary operator" (W504). You...kind of have to pick one of these. :) According to the docs, W503 is deprecated, so we disable that one. https://www.flake8rules.com/rules/W503.html	2023-10-30 09:28:34 -07:00
Justin Lebar	1ea5c0f675	[DOCS] Add instrs for setting up C++ intellisense. (#2554 )	2023-10-27 12:03:09 -07:00
Someone	cde42e6221	[BUILD] make cuda tools vendoring optional (#2546 )	2023-10-26 23:16:41 -07:00
Dongdong Li	0469d5fccd	[OPTIMIZER] Remove extra wgmma_wait_group in flash attention (#2399 ) Co-authored-by: dongdongl <dongdongl@nvidia.com>	2023-10-26 16:35:36 +00:00
zhu jianjiang	cfae7e2a25	[BACKEND] Fix matmul downcast path (#2528 ) for https://github.com/openai/triton/issues/2523 ,add regression test --------- Co-authored-by: Jokeren <robinho364@gmail.com> Co-authored-by: Keren Zhou <kerenzhou@openai.com>	2023-10-26 09:43:49 -04:00
Phil Tillet	07baf3a102	[CI] move llvm-build.yml to the top of workflow directory hierarchy	2023-10-26 02:08:56 -07:00
ian Bearman	3f95a6fb81	[DOCS] refactor meetup subdir; add triton-shared introduction slides Adding `Introduction to Triton-Shared.pptx` as presented at the Oct. 25 Triton community meeting. --------- Co-authored-by: Phil Tillet <phil@openai.com>	2023-10-26 02:02:02 -07:00
Eikan Wang	40766928f1	Add the update for Intel XPU Backend. (#2551 ) - Intel XPU Backend Status Update - GEMM Lowering for Intel GPU - Questions to the Triton community	2023-10-26 08:33:37 +00:00
kshama-msft	746d411ead	[DOCS] create 10-25-2023.md (#2548 )	2023-10-26 01:32:09 -07:00
Keren Zhou	bc72294507	[CI] Reenable torchinductor workflow (#2527 )	2023-10-25 23:44:02 -07:00
runseny	4c816c2f59	[OPS] enable flash_attention_v2 TMA (#2544 )	2023-10-25 23:31:17 -07:00
Hongtao Yu	2323adb387	[BACKEND] Handle AtomicCASOp in GPU IR conversion (#2514 ) Addressing https://github.com/openai/triton/issues/2011 Co-authored-by: Philippe Tillet <phil@openai.com> Co-authored-by: Keren Zhou <kerenzhou@openai.com>	2023-10-25 15:20:07 -04:00
Adnan Akhundov	7d55968fee	[BACKEND] Dedup elementwise in LLVM IR based on constancy (#2512 ) ### Summary When Triton GPU IR is lowered into LLVM IR, we can make use of the constancy information about the result of the elementwise ops to deduplicate otherwise redundant computation. That is the contribution of this PR: the constancy is checked and, if possible, some of the values in LLVM IR are reused multiple times instead of computing equal values separately. The change is beneficial for the PyTorch 2 / TorchInductor-generated Triton code, as the leftmost sub-indices extracted from the flat index by div / mod operations can be equal, given sufficiently large 2^n factor in the rightmost rightmost dimension(s). This makes the computation resulting in those sub-indices redundant. Consequently, under the necessary constancy conditions, the redundant indexing arithmetics can be deduplicated. We observe up to 29% decrease in the latency of some of our jagged tensor kernels	2023-10-25 11:25:29 -04:00
Justin Lebar	e70e11e834	[BACKEND] Improve printf. (#2532 ) [BACKEND] Improve printf. Previously, we printed all of a GPU thread's values in a single printf() call, and this, plus the user-specified prefix, was all we printed. This caused a few problems. - nvptx printf can only handle 32 arguments; if you pass more than that, it prints garbage. So if a thread had more than 32 values, you couldn't print them, issue #2486. - The order of the values within the Triton program (GPU thread block) is an implementation detail -- it depends on the layout the compiler assigns to a tensor. So this also prevented you from interpreting the printed output. To address this, we now print the Triton pid and multi-dimensional Tensor index for each value. And each value gets its own line to avoid passing too many args to printf. Example output: ``` pid (0, 1, 2) idx (36, 127) x: 42 ``` If you want to observe all the values in a tensor in order, you can grep and then sort the output. We also make a UX enhancement to print: The printed label always ends with ": "; you don't have to add it yourself. Fixes #2486.	2023-10-25 08:47:55 +00:00
Justin Lebar	2217bd2f5c	[BACKEND] Delete dead vprintf and vprintf_array functions (#2531 ) Delete dead vprintf and vprintf_array functions. These were introduced in `88498d104a` and appear to have been dead at the time of introduction.	2023-10-25 00:22:53 +00:00
Philippe Tillet	31c76ddd05	[CI] revert recent changes (#2543 )	2023-10-24 17:00:31 -07:00
Phil Tillet	5181d62b1b	[CI] renamed third party test workflow	2023-10-24 12:12:52 -07:00
Justin Lebar	9b4d91b132	Add TRITON_BUILD_WITH_ASAN envvar. (#2537 ) Note that asan doesn't work with programs that use the GPU, so this is only useful for running tools like triton-opt. I was not able to get msan working. libstdc++'s std::string implementation seems to use uninitialized memory in a way that seems safe but triggers an msan error. I tried and gave up on switching to libc++ and teaching msan to ignore this error.	2023-10-24 10:30:30 -07:00
Philippe Tillet	3f2b7263e8	Revert "[OPTIMIZER] Tweak warpsPerCTA based on the shape of MMA output (#2485 )" (#2541 ) Reverts openai/triton#2525	2023-10-24 10:23:19 -07:00
Phil Tillet	96b04493f1	[CI] move workflow around	2023-10-24 04:01:53 -07:00
Sam Shleifer	12da43084b	[TESTING] add diff column, option to return df in benchmark (#2469 )	2023-10-24 05:17:00 +00:00
Philippe Tillet	8f467f1ea9	[OPTIMIZER] Tweak warpsPerCTA based on the shape of MMA output (#2485 ) (#2525 ) Reverts openai/triton#2497	2023-10-23 21:50:58 -07:00
Adnan Akhundov	50add54334	[FRONTEND] Add input dtypes to autotuning key (#2534 )	2023-10-24 03:29:30 +00:00
Thomas Raoux	5e6071254c	[BACKEND] Use our internal slice implementation to avoid combinatoria… (#2535 ) …l explosion	2023-10-24 03:06:34 +00:00
Phil Tillet	c65d2c2ed6	[CI] run wheels job on CPU worker	2023-10-23 20:04:37 -07:00
Thomas Raoux	cba7abd682	[BACKEND] Remove ttg.cmp and ttg.select and replace by arith op (#2526 ) Now that the bug related to attribute is fixed in MLIR we can use arith ops for cmp and select ops.	2023-10-23 19:35:46 -07:00
Zahi Moudallal	b0c166b9e3	[BACKEND] Fixing bug in elementwise conversion (#2517 )	2023-10-20 09:11:15 -07:00
runseny	dc9e3063d7	[HOPPER] Move to tl.make_block_ptr in flash_attention backward scripts (#2395 )	2023-10-20 11:06:48 +08:00
Justin Lebar	30186f401e	Fix segfault in assertion test. (#2520 ) <git-pr-chain> #### Commits in this PR 1. Fix segfault in assertion test. The issue here is that we were not checking the return values of the CUDA API calls we were making. We call one function and then use the data it returns as input to another call. Obviously this doesn't work if the first call returns an error and doesn't actually return meaningful data. I don't know why this was passing in CI, but it failed consistently for me. #### [PR chain](https://github.com/jlebar/git-pr-chain) 1. 👉 #2520 👈 YOU ARE HERE </git-pr-chain>	2023-10-19 13:42:38 -07:00
Justin Lebar	bdf464e4a8	Make kernel_static_print test work when called twice. (#2518 ) <git-pr-chain> #### Commits in this PR 1. Make kernel_static_print test work when called twice. This test is checking that a message is printed when the kernel is compiled. But the test had nothing to force the kernel to be compiled every time you ran the test. So after you ran it once, the test would fail every time until you cleared the cache. #### [PR chain](https://github.com/jlebar/git-pr-chain) 1. 👉 #2518 👈 YOU ARE HERE 1. #2520 </git-pr-chain>	2023-10-19 13:17:38 -07:00
ian Bearman	0d57820be9	update triton-shared ref (#2506 )	2023-10-19 11:53:37 -07:00
Keren Zhou	be1de890e1	[BACKEND] Replace assert(0) with llvm::report_fatal_error (#2516 ) Also add missing return statements	2023-10-19 11:53:09 -07:00
Horace He	a4f373938c	[RUNTIME] Filter out paths that don't exist in json group cache (#2511 ) There's no guarantee that `/tmp/triton//.json` existing means that the corresponding `/tmp/triton//.cubin` file also exists because the tmp directory doesn't guarantee file stability.	2023-10-18 16:44:34 -04:00
Zahi Moudallal	a980ec50f1	[BACKEND] Fixing f8e5m2 to bf16 conversion on A100 (#2508 )	2023-10-18 17:23:39 +01:00
Thomas Raoux	e36d1665ca	[BACKEND] Fix unsupported view op created during optimizations (#2510 ) When propagating layout we were generating a view op with mismatching total number of element per threads. Lowering such op would require exchanging data across threads. This change prevents the optimizer from generating such cases. This may require further optimizations in the future.	2023-10-18 16:37:13 +01:00
ian Bearman	768fc1fcd9	[FRONTEND] change hash to not require ptxas (#2476 ) I noticed that Triton is using the `ptxas` version as part of the version hash even for non-CUDA targets. This is an attempt at fixing this. Moving the version calculation to the back-end makes sense to me from an architectural standpoint, so that's my approach here. I'm not as confident in the implementation, so please if folks have any feedback let me know.	2023-10-17 10:28:51 -07:00
Thomas Raoux	376acb610b	[BUILD] Fix macos x86 build (#2505 ) There was a mismatch in the llvm link name	2023-10-17 09:49:09 -07:00
Philippe Tillet	05dc28be0e	[CI] refactor workflows (#2504 ) no longer run third-party tests on every PR	2023-10-17 00:27:17 -07:00
Mehdi Amini	721897fcc4	upgrade llvm to `b1115f8c` (NFC) (#2403 ) Co-authored-by: Thomas Raoux <thomas.raoux@openai.com> Co-authored-by: Keren Zhou <kerenzhou@openai.com> Co-authored-by: Phil Tillet <phil@openai.com>	2023-10-16 16:38:49 -07:00
Maksim Levental	87a223d76f	bump triton_shared (#2501 ) Bump to tip. cc @manbearian @nhat-nguyen	2023-10-16 13:49:20 -07:00
Zahi Moudallal	726bdb984f	[FRONTEND][BACKEND] Fix constexpr assignment ; revert #2430 (#2496 ) Without this change, a constexpr assignment (ie. `A = B & C`, where `B` and `C` are both constexpr) is getting assigned to a triton tensor, which becomes an issue when `A` is used as the condition of an If statement. Note: I had to add `not isinstance(node.value, ast.Constant)` to the condition because if we are assigning `x = 0` then the assigned value is also a constexpr, but in this case we do want to assign a triton tensor to `x` so that we can do `x.to(tl.int64)` for example, which cannot be done on a constexpr. --------- Co-authored-by: Philippe Tillet <phil@openai.com>	2023-10-16 12:35:19 -07:00
Stewart Hall	29828fe491	[FRONTEND] add option to disable fp mul/add fusion (#2495 ) By default, ptxas will enable fusion of mul/add to fma instructions. The backend was also being configured unconditionally to enable this on conversion from LLVM IR to PTX. This commit adds an option which can be used to disable the FP fusion behavior in both locations.	2023-10-14 12:23:30 -07:00

1 2 3 4 5 ...

1434 Commits