github/ROCm - ROCm - AtHeartEngineering

mirror of https://github.com/ROCm/ROCm.git synced 2026-02-21 03:00:39 -05:00

Author	SHA1	Message	Date
Philippe Tillet	3f2b7263e8	Revert "[OPTIMIZER] Tweak warpsPerCTA based on the shape of MMA output (#2485 )" (#2541 ) Reverts openai/triton#2525	2023-10-24 10:23:19 -07:00
Philippe Tillet	8f467f1ea9	[OPTIMIZER] Tweak warpsPerCTA based on the shape of MMA output (#2485 ) (#2525 ) Reverts openai/triton#2497	2023-10-23 21:50:58 -07:00
Thomas Raoux	cba7abd682	[BACKEND] Remove ttg.cmp and ttg.select and replace by arith op (#2526 ) Now that the bug related to attribute is fixed in MLIR we can use arith ops for cmp and select ops.	2023-10-23 19:35:46 -07:00
Zahi Moudallal	b0c166b9e3	[BACKEND] Fixing bug in elementwise conversion (#2517 )	2023-10-20 09:11:15 -07:00
runseny	dc9e3063d7	[HOPPER] Move to tl.make_block_ptr in flash_attention backward scripts (#2395 )	2023-10-20 11:06:48 +08:00
Justin Lebar	30186f401e	Fix segfault in assertion test. (#2520 ) <git-pr-chain> #### Commits in this PR 1. Fix segfault in assertion test. The issue here is that we were not checking the return values of the CUDA API calls we were making. We call one function and then use the data it returns as input to another call. Obviously this doesn't work if the first call returns an error and doesn't actually return meaningful data. I don't know why this was passing in CI, but it failed consistently for me. #### [PR chain](https://github.com/jlebar/git-pr-chain) 1. 👉 #2520 👈 YOU ARE HERE </git-pr-chain>	2023-10-19 13:42:38 -07:00
Justin Lebar	bdf464e4a8	Make kernel_static_print test work when called twice. (#2518 ) <git-pr-chain> #### Commits in this PR 1. Make kernel_static_print test work when called twice. This test is checking that a message is printed when the kernel is compiled. But the test had nothing to force the kernel to be compiled every time you ran the test. So after you ran it once, the test would fail every time until you cleared the cache. #### [PR chain](https://github.com/jlebar/git-pr-chain) 1. 👉 #2518 👈 YOU ARE HERE 1. #2520 </git-pr-chain>	2023-10-19 13:17:38 -07:00
ian Bearman	768fc1fcd9	[FRONTEND] change hash to not require ptxas (#2476 ) I noticed that Triton is using the `ptxas` version as part of the version hash even for non-CUDA targets. This is an attempt at fixing this. Moving the version calculation to the back-end makes sense to me from an architectural standpoint, so that's my approach here. I'm not as confident in the implementation, so please if folks have any feedback let me know.	2023-10-17 10:28:51 -07:00
Mehdi Amini	721897fcc4	upgrade llvm to `b1115f8c` (NFC) (#2403 ) Co-authored-by: Thomas Raoux <thomas.raoux@openai.com> Co-authored-by: Keren Zhou <kerenzhou@openai.com> Co-authored-by: Phil Tillet <phil@openai.com>	2023-10-16 16:38:49 -07:00
Zahi Moudallal	726bdb984f	[FRONTEND][BACKEND] Fix constexpr assignment ; revert #2430 (#2496 ) Without this change, a constexpr assignment (ie. `A = B & C`, where `B` and `C` are both constexpr) is getting assigned to a triton tensor, which becomes an issue when `A` is used as the condition of an If statement. Note: I had to add `not isinstance(node.value, ast.Constant)` to the condition because if we are assigning `x = 0` then the assigned value is also a constexpr, but in this case we do want to assign a triton tensor to `x` so that we can do `x.to(tl.int64)` for example, which cannot be done on a constexpr. --------- Co-authored-by: Philippe Tillet <phil@openai.com>	2023-10-16 12:35:19 -07:00
Stewart Hall	29828fe491	[FRONTEND] add option to disable fp mul/add fusion (#2495 ) By default, ptxas will enable fusion of mul/add to fma instructions. The backend was also being configured unconditionally to enable this on conversion from LLVM IR to PTX. This commit adds an option which can be used to disable the FP fusion behavior in both locations.	2023-10-14 12:23:30 -07:00
Philippe Tillet	8db4fac3b0	Revert "[OPTIMIZER] Tweak warpsPerCTA based on the shape of MMA output (#2485 )" (#2497 ) Reverts openai/triton#2485	2023-10-13 23:32:59 -07:00
Weixing Zhang	76858bd917	[OPTIMIZER] Tweak warpsPerCTA based on the shape of MMA output (#2485 ) In current implementation, warpsPerCTA is always set to [numWarps, 1] for 2 tt.dot fusion scenario. But, it is not optimal for cases such that tt.dot doesn't have enough parallelism on row dimension but on column dimension.	2023-10-12 22:25:42 -07:00
Keren Zhou	f81d9d876f	[FRONTEND] Fix math for constant values (#2472 ) https://github.com/openai/triton/issues/2470	2023-10-12 12:11:42 -07:00
Zahi Moudallal	be19cf3103	[BACKEND] Enable reduce with 3D tensors and added tests (#2460 )	2023-10-06 15:08:22 -07:00
Philippe Tillet	533efd0cac	[FRONTEND][BACKEND] changed float8e4b15 clipping semantics from +-1.875 to +-1.75 (#2422 ) clipping float8e4b15 to +-1.875 is a bad idea because these are represented as 0x7f and 0xff, which are +- nan on H100 for float8e4nv. We lose two values but this will make compatibility with float8e4nv way less painful. (it will just be a matter of adjusting the bias)	2023-09-29 23:33:28 -07:00
Hongtao Yu	e0edb70f78	[BACKEND] support of Fp8E4M3Nv to Bf16 conversion (#2415 )	2023-09-29 17:29:41 -07:00
Thomas Raoux	90bef57acf	[BACKEND] turn on MMA V3 by default on Hopper (#2414 )	2023-09-28 22:45:28 -07:00
Ying Zhang	78c28bf5f6	Support scalar fp8 conversions by packing (#2379 ) Support fp8 scalar conversions by packing fp8 with undef values. Also add simple unittests to cover this change.	2023-09-27 08:29:53 -07:00
Philippe Tillet	7432fff4be	[FRONTEND] add limited introspection capabilities in `tl.extra.cuda` ; rename `arch` into `target` (#2385 )	2023-09-25 23:58:25 -07:00
Philippe Tillet	eea0718445	[TESTING] better cudagraph-based benchmarking (#2394 )	2023-09-25 21:41:26 -07:00
ben-zhang-609	d040b58547	[HOPPER] fix ref check failure of flash attention with mma v3 (#2384 )	2023-09-25 11:29:49 -07:00
Keren Zhou	57fc6d1f13	[BACKEND] `shfl` ptx insts should have side effects (#2376 ) Otherwise, llvm pass could generate very weird structure of CFG and yield incorrect results. https://github.com/openai/triton/issues/2361	2023-09-23 10:05:20 -07:00
Zahi Moudallal	293b7fd592	[TESTING] cleanup (#2293 ) Co-authored-by: Philippe Tillet <phil@openai.com>	2023-09-22 05:37:14 +00:00
Philippe Tillet	c71ec14f31	[TEST] only test 4 configs without TF32 (#2370 )	2023-09-21 21:23:19 -07:00
Alexander Zinoviev	d543eb1a36	[BACKEND] implement `dot` for INT8 on Turing (#2364 ) Replace a single mma.sync.aligned.m16n8k32.row.col.satfinite.s32.s8.s8.s32 instruction that is used on Ampere with 4 x mma.sync.aligned.m8n8k16.row.col.satfinite.s32.s8.s8.s32 instructions for Turing Extracted the Turing-int8, Turing-fp16 and Ampere to separate functions. Somehow I messed up with my previous PR, so just open a new one. --------- Co-authored-by: Philippe Tillet <phil@openai.com>	2023-09-21 16:40:53 -07:00
Philippe Tillet	32c9d2bb8f	[FRONTEND] improved error messages (#2363 ) this is a combination of #1774 and #2006, which I cannot edit but fail CI pre-commit hook	2023-09-21 15:05:57 -07:00
Thomas Raoux	e36c99b588	[BACKEND] Handle scan of function non commutative (#2362 ) Make sure we accumulate in the right order for scans so that non commutative operations are handled correctly.	2023-09-21 12:00:41 -07:00
peterbell10	8094f46632	[FRONTEND][BACKEND] Fix various atomic_rmw bugs (#2355 ) This fixes a few bugs I've encountered - `atomic_add` with int64/uint64 `Operation .add requires .u32 or .s32 or .u64 [...] for instruction 'atom'` - `atomic_min/max` with float64 -> `ValueError('Cannot bitcast data-type of size 64 to data-type of size 32')` - `atomic_min/max` with float32 returns the old value as int32	2023-09-21 03:31:20 +00:00
ben-zhang-609	bcaf14755a	[HOPPER] enable flash attention with tma (#2336 )	2023-09-20 14:06:56 -07:00
Thomas Raoux	9cab885dff	[BACKEND] Optimize wgmma with accumulator source equal to 0 (#2343 ) Also add a test for MMA v3 reduction.	2023-09-20 14:05:12 -07:00
Keren Zhou	ed5a53057d	[BACKEND] Handle repetitive threads in scan op when the tensor dim is small (#2345 ) https://github.com/openai/triton/issues/2298	2023-09-20 12:25:52 -04:00
Dongdong Li	e5eda098b3	[TESTS] fix flash attention (#2086 ) Co-authored-by: dongdongl <dongdongl@nvidia.com>	2023-09-20 14:23:46 +08:00
Keren Zhou	307b5caa49	[BACKEND] Fix scan issues on repetitive warps and improve perf when there's a single warp on the axis (#2330 ) 1. On the axis, using `getAxisNumWarpsWithUniqueData` instead of getting the raw number of warps to avoid communication among warps that handle the same piece of data. 2. When there's a single warp on the axis, using warp Intrinsics for communication and skip shared memory. Need a follow up PR for code clean up.	2023-09-18 17:45:05 -04:00
Philippe Tillet	e686b4d6d4	[FRONTEND] interpreter rewrite (#2321 ) This is a new interpreter mode that shares semantic analysis with the JIT'ed codepath and that the Triton core team is committed to maintain	2023-09-17 14:58:50 -07:00
Thomas Raoux	31b0c52142	[FRONTEND][BACKEND] Add flag to control accumulation for fp8 (#2300 ) Change the dot to allow taking an initial accumulator and add a flag that will allow the compiler to accumulate in a lower precision than the output type. On Hopper this flag is on by default which allows accumualting with lower precision. This only affect Hopper fp8 dot.	2023-09-15 18:42:54 -07:00
Keren Zhou	08c1658957	[FRONTEND] Accommodate new triton IR format (#2294 ) - Support memory space for pointers (e.g., `!tt.ptr<f32, 1>`). - Support parsing function attribute, though not used yet.	2023-09-14 09:03:23 -07:00
Zahi Moudallal	36087a108f	[FRONTEND] Added SASS to asm dict (#2280 )	2023-09-13 21:21:01 +00:00
Zahi Moudallal	e95e1f12eb	[BACKEND] Convert layout illegal mem access fix (#2287 )	2023-09-13 10:02:25 -07:00
Thomas Raoux	994f7e4460	[BACKEND] Remove dependency between NVGPU and TritonNvidiaGPU (#2282 )	2023-09-12 11:02:20 -07:00
Zahi Moudallal	a47f1f5c28	[BACKEND] Unify slow/fast reduce codegen (#2220 )	2023-09-12 08:46:19 -07:00
jsh-20	fc5d7e6e7c	[FRONTEND] Improve grid calculation for persistent kernels to hoist pe… (#2283 ) …rf on problems that need few blocks. constrain the number of launched blocks to what it exactely needs for persistent warp specialized kernel. It's useful when problems need very few blocks. e.g. MxNxK=800x800x60000, f16_f16_f32, block size=128x128x64, non-split-k. Experiments show it can achieve ~16% speedup.	2023-09-12 09:14:47 +00:00
peterbell10	ab9da3b2b8	[FRONTEND] Fix expand_dims and tl.full to handle scalar tensors (#2275 ) This fixes a few bugs related to scalar tensors: - `tl.full([], fill_value, dtype)` fails with `TypeError('0d block_type is forbidden')` - `scalar[None]` fails with `TypeError("'constexpr' object is not iterable")` - `scalar[None, None]` fails with `AttributeError("'dtype' object has no attribute 'shape'")` - `scalar.shape` returns `[1]` instead of 0-dim `[]` - Also related, `tl.zeros_like(scalar)` returns a 1d tensor instead of another scalar	2023-09-11 20:59:13 -07:00
Thomas Raoux	a9db6b94b9	Remove wrong dependency between TritonGPU and NVGPU dialect (#2276 )	2023-09-11 16:30:13 -07:00
jon-chuang	5231d57c71	[TESTS] replace deprecated `torch.testing.assert_allclose` (#2250 ) Prior to this PR, matmul on sm_89 (RTX 4070) (`test/unit/operators/test_matmul.py::test_op`) would result in test failure due to too strict atol/rtol. To avoid having to choose strictness ourselves, and to have better defaults based on dtype, use the non-deprecated torch testing util. See: https://github.com/pytorch/pytorch/issues/61844 Replace: https://github.com/openai/triton/pull/2242	2023-09-11 15:31:17 -04:00
Lixun Zhang	28d4c3bdb4	[BACKEND] Make sure `getAxisBlockStride` does not return 0 (#2273 ) This can happen when the CTA shape is larger than the tensor shape along the non-axis dim during scanOp lowering.	2023-09-11 11:02:56 -07:00
Keren Zhou	9e9fbe01f0	[FRONTEND] Fix specialization on triton integer types (#2236 ) https://github.com/openai/triton/issues/2231	2023-09-03 23:57:08 -07:00
Michael Melesse	c6d33dcebf	[ROCM] Core Functionality for AMD (#1983 ) * this pr adds a third party backend for triton that works on AMD * this expose a lot of the work that has been done in our [fork](https://github.com/ROCmSoftwarePlatform/triton) * most unit tests on `test_core.py` pass * it skips some unit tests for various reasons * we plan to follow up with more prs improving Functionality and Performance in the future --------- Co-authored-by: Philippe Tillet <phil@openai.com>	2023-08-31 14:02:00 -07:00
Philippe Tillet	ec51552fff	[BACKEND] Lift restriction for float8e4b15 to only support row-col layout (#2212 )	2023-08-30 14:06:31 -07:00
goostavz	1465b573e8	[TESTS][HOPPER] Prune hopper tests to speedup CI (#2193 ) Co-authored-by: Goostav Zhu <gzhu@nvidia.com>	2023-08-27 20:45:23 -07:00

1 2 3 4 5 ...

354 Commits