github/ROCm - ROCm - AtHeartEngineering

mirror of https://github.com/ROCm/ROCm.git synced 2026-04-05 03:01:17 -04:00

Author	SHA1	Message	Date
Keren Zhou	6a65c894fe	[TUTORIALS] Skip running TMA tutorials on non-hopper architectures (#2153 )	2023-08-22 18:02:26 -07:00
Keren Zhou	5fa1fa1b27	[FRONTEND] Emit warning if the result of `tl.advance` is unused (#2155 ) https://github.com/openai/triton/issues/2138	2023-08-22 18:02:00 -07:00
danny.jang	c4a9006340	[FRONTEND] fix a typo (#2152 )	2023-08-22 13:02:31 -04:00
ivanyinwz	ec801ce18e	[BACKEND] Optimize performance for f16 epilogue with TMA store (#2135 ) 1. Optimize the conversion and packing for 2xf32 -> 2xf16. 2. Split TMA store block into multiple slices of size 64x64. 3. Distribute the TMA store to all the warps. 4. Fix some naming issue.	2023-08-21 12:44:11 -07:00
Philippe Tillet	ea8416164f	[FRONTEND] name mangling fixup (#2148 )	2023-08-21 12:11:52 -07:00
jayfurmanek	fa429316d4	Merge pull request #268 from ROCmSoftwarePlatform/improve_reduce_for_fa [CHERRY-PICKED FROM UPSTREAM][BACKEND] no longer uses shared mem or barriers for single-warp reductions (openai#1915)	2023-08-21 13:29:11 -05:00
Beal Wang	7e5cd95bf2	[OPTIMIZER] Fix Warp Specialized kernel launch failure (#2146 ) For warp specialized persistent kernel, the instruction sequence for Warp Groups are ``` // warp group 0 for wave in 0..num_waves: idx = wave * num_inner_loop_steps; for k_tile_idx in 0..num_k_tiles: mbarrier.wait EB[idx]; W0; mbarrier.arrive FB[idx]; idx++; ``` ``` // warp group 1 for wave in 0..num_waves: idx = wave * num_inner_loop_steps; for k_tile_idx in 0..num_k_tiles: mbarrier.wait FB[idx]; R0; mbarrier.arrive EB[idx]; idx++; ``` then this would form a sequence of morally-strong relations W0 -> R0 -> W1 -> R1 in causality order. But if GEMM K is small than K-TileShape, then the num_inner_loop_steps of persistent kernel is 0. The buffer id and mbarrier id will always be 0 in this case. And it may form W0 -> W1 -> R0 -> R1 order, which is contradicts with the atomicity -- "If a read R precedes an overlapping write W in causality order, then R cannot read from W."	2023-08-21 14:46:57 +08:00
Thomas	54ca7fcb35	[FRONTEND] Use inline asm for global timer and smid functions (#2143 ) Simplify the code by using inline asm to implement globaltimer and smid instead of relying on bc file.	2023-08-20 22:56:37 -07:00
Keren Zhou	584d5c263f	[FRONTEND] Disable IfExp on dynamic conditions (#2100 ) `if _unwrap_if_constexpr(cond)` then enters `node.body` is wrong when cond is a tensor since we cannot statically evaluate a dynamic tensor's value. The right way to solve the problem is probably: 1. visit the ast of IfExp (do not build IRs) 2. get the type of the last statement 3. initialize the return value and assign it to livein 4. call visit_If	2023-08-20 12:58:10 -07:00
Alexander Zinoviev	a7b40a10f9	[TESTS] Fix tl.dot test on sm75 (#2140 ) Disable tf32 if run on sm75 and below Fix the pattern match to compare the generated ptx against if run on sm75	2023-08-19 22:21:18 -07:00
Zahi Moudallal	23dd11d471	[BACKEND] Solidify f8e4m3 (#2105 ) Co-authored-by: Philippe Tillet <phil@openai.com>	2023-08-18 19:12:09 -07:00
Thomas	23ef2615d2	[BACKEND] Merge TT_ElementwisePureExtern and TT_ElementwiseImpureExtern (#2137 ) Use getEffect instead to tell passes whether the op has side effects or not. This doesn't change functionality otherwise.	2023-08-18 20:56:10 +00:00
Thomas	bf351b9ba2	[FRONTENT][BACKEND] Add support for elementwise inline assembly (#2136 ) Add a new operation to be able to implement packed inline assembly for elementwise operations. This way inline assembly can be used to control elementwise operations. It also allows to pack elements to be able to manually vectorize operations.	2023-08-18 12:57:52 -07:00
Alexander Efimov	d86b19f7a3	[CI] [Dot] Reduced test suite (#302 ) Use upstream list of test for dot op on machines with no MFMA support. This is needed to reduce time required for PR testing.	2023-08-18 07:47:14 -05:00
Alexander Efimov	23979098c8	[MFMA] MI200 bfloat16 support (#294 ) This PR enables bfloat16 support in MFMA dot on MI200. Used mfma_f32_32x32x8bf16_1k instruction.	2023-08-18 07:28:18 -05:00
Whitney Tsang	100cabd0e4	[FRONTEND] use enum instead of bool to select target (#2118 ) Before this PR, the determination of `TritonGPUToLLVMIRPass` to generate NVVM-compatible LLVM or ROCDL-compatible LLVM is controlled by a boolean `isROCM`. This method is hard to scale. This PR changes it to use an enum instead, where new target can be added easily when needed. --------- Signed-off-by: Tsang, Whitney <whitney.tsang@intel.com> Co-authored-by: Philippe Tillet <phil@openai.com>	2023-08-17 18:37:09 -07:00
youkaichao	871ec2ad37	[FRONTEND] add informative erro msg to help find libcuda (#2019 ) When I'm using Kaggle GPU (https://www.kaggle.com/), I find that `ldconfig -p` does not show libcuda.so, but requires `ldconfig` (run with sudo) to refresh the cache to find the libcuda.so. Therefore, I added this informative message to help users find libcuda.so.	2023-08-17 19:32:00 -04:00
Thomas	387fc890a5	[FRONTEND][BACKEND] Add a performance test for reductions (#2125 ) Also stop promoting integer types as it doesn't give better perf this will allow more vectorization oportuinity in the future.	2023-08-17 16:30:33 -07:00
YouJiacheng	0970a297b2	[TUTORIALS] alow BLOCK(bwd) != BLOCK_M(fwd) in flash attention (#2020 )	2023-08-17 18:31:53 -04:00
Shucai Xiao	f7cf2c032b	Changes of the tutorial matmul scripts to get good performance (#297 ) * simple changes of the matmul scripts to get good performance. Specification reason for the performance boost needs futher investigation and are tracked * fix review comments * change the num_warps in the autotuning config for hip to workaround an error and change the rtol so correctness check passed	2023-08-17 13:24:49 -05:00
Keren Zhou	2d513dbf50	[FRONTEND] Fix addptr code generation (#2122 ) `offset + ptr` and `ptr + offset` both work now	2023-08-17 04:22:08 +00:00
Zahi Moudallal	557b2d4b34	[CI] upload only test/unit/operators cache to artifacts and rely on kernel names in cache to compare artifacts (#2111 )	2023-08-16 20:34:40 -07:00
Lixun Zhang	87e45cb011	Set vecSize and maxPhase more generically	2023-08-16 08:30:32 -05:00
Lixun Zhang	7156fcb0ef	Set `vecSize` = 4 and `maxPhase` = BLOCK_K/4	2023-08-16 08:30:32 -05:00
darkbuck	a3df6068b4	[BACKEND] Minor fixes found when building triton with LLVM 17/main branches (#2089 ) - These minor fixes are not specific to interface changes from LLVM main or official llvm-17 branch and can be applied on triton main branch. - https://github.com/darkbuck/triton/tree/darkbuck/main/llvm-main-branch has extra changes to build again LLVM main branch build to enable me to work on other backends on the main branch only. That's the hobby effort and just FYR.	2023-08-16 01:18:06 +00:00
Philippe Tillet	4215086931	[BACKEND] no longer uses shared mem or barriers for single-warp reductions (#1915 ) 0-bytes shared mem buffers don't materialize empty allocation buffers; this could lead to unnecessary barriers. note: reduceop code has become quite messy and will require some cleanup	2023-08-15 11:51:20 +00:00
Zahi Moudallal	0312ed3473	[CI] Update kernels names (#2093 ) Co-authored-by: Philippe Tillet <phil@openai.com>	2023-08-14 19:41:41 -07:00
jsh-20	9055af1a5d	Update test_user_defined_persistent_warp_specialized_gemm for num-CTA > 1 (#2101 ) - remove auto-tune for test_user_defined_persistent_warp_specialized_gemm. - remove unnecessary perf evaluation parts. - add test cases of num-CTA > 1 for test_user_defined_persistent_warp_specialized_gemm.	2023-08-14 08:51:35 +00:00
Philippe Tillet	facc1dcbac	[TESTS] better matmul unit testing (#2098 )	2023-08-13 17:54:32 -07:00
Izzy Putterman	fc667d1f8f	[FRONTEND] fix new absolute imports (#2072 ) Co-authored-by: Philippe Tillet <phil@openai.com>	2023-08-13 14:23:36 +00:00
danny.jang	b878c8826f	[DOCS] fix generating document failures (#2096 )	2023-08-13 06:53:34 -07:00
danny.jang	cd114c0bd5	[DOCS] Fix typos (#2097 )	2023-08-13 06:52:43 -07:00
Thomas	98372f46d3	[FRONTEND] Remove extra calls to _get_config causing runtime overhead (#2094 )	2023-08-13 06:51:26 -07:00
Zahi Moudallal	a01c116f76	[FRONTEND/BACKEND] Revived Float8E4B15x4 (#2090 )	2023-08-11 17:49:52 -07:00
Keren Zhou	382e8fb1fa	[RUNTIME] Make apis compatible with cuda 11 drivers (#2081 ) https://github.com/openai/triton/issues/2042	2023-08-11 17:46:56 -07:00
Thomas	421ce18988	[FRONTEND] Refactor min/max to unify tl.maximum and tl.math.max (#2091 ) maximum used to generate a cmp/sel even for floating point types. Always using max op allows better code quality and avoids having different behavior than tl.math.max	2023-08-11 17:46:20 -07:00
Keren Zhou	5162871c6c	[TUTORIAL] flash attention d128 improvement (#2074 ) `ptxas` is able to automatically generate a call instruction to "call" the loop body so that instructions are better scheduled.	2023-08-12 00:31:48 +00:00
Zahi Moudallal	b62b6d6a71	[FRONTEND] Remove cache key from metadata (#2082 )	2023-08-11 08:58:00 -07:00
Zahi Moudallal	4d373aa103	[BACKEND] Remove HopperHelpers.c and replace with inline ptx and LLVM codegen (#2047 )	2023-08-10 15:52:37 -07:00
Beal Wang	d1ce4c4950	[TESTS] refactor test-persistent-warp-specialized-gemm UTs (#2075 ) remove unnecessary skips. decompose UTs in persistent-warp-specialized-gemm into vintage and stylish	2023-08-10 06:57:04 +00:00
Shantanu	776b3784c2	[FRONTEND] further improve version_key speed (#2073 ) Realised I could do this right after my first PR got merged. This saves another 100ms	2023-08-09 22:29:36 +00:00
Shantanu	0e11257b8d	[FRONTEND] improve speed of computing version_key (#2071 ) libtriton.so is pretty large these days and hashing it is slow. Switching the hash from md5 to sha1 shaves close to 300ms off the time for me (as well as being a better hash, for whatever that's worth). As far as I could tell, sha1 is the fastest stable hash in the Python standard library, including things like zlib.crc32	2023-08-09 21:44:10 +00:00
allatit23	8a610f7cf7	[HOPPER][WS] remove numCTAs = 1 check in guard pass (#2066 )	2023-08-09 09:07:56 +00:00
Beal Wang	de47bba07d	[OPTIMIZER] Fix the load and store fallback issue of test_persisten… (#2057 ) Co-authored-by: Biao Wang <biaow@nvidia.com>	2023-08-09 16:42:01 +08:00
allatit23	6d98a0899f	[HOPPER][WS] fix missing WS attrs when lowering to llvm (#2063 )	2023-08-09 15:45:44 +08:00
Alexander Efimov	1c45836d5d	[ROCM] fix device_type name (#2061 ) Rename "rocm" -> "hip", to comply with other uses in compiler.py.	2023-08-09 01:11:09 +00:00
Alexander Efimov	af05f01218	[Tests] Fix some tests in test_core_amd.py (#288 ) This PR: - enables test_dot_mfma_vector_load for fast path in mfma dot op pipeline - fixes kernel execution for mfma enabled GPUS - disables mfma layout conversion tests on architectures which can not run these tests	2023-08-08 20:12:32 +02:00
Alexander Efimov	ff95dddf18	[tutorial] Amd specific tuning configs (#287 ) This pr adds amd specific tuning configs for matmul tutorial with num_stages == 1.	2023-08-08 20:11:30 +02:00
allatit23	6dee55c912	[HOPPER][WS] fix TMA store hang in ws mode (#2056 )	2023-08-08 19:53:52 +08:00
ben-zhang-609	2a95d9bf0d	[Clean]: remove skip for num_ctas > 1 and num_warps == 8 (#2050 ) Co-authored-by: Philippe Tillet <phil@openai.com>	2023-08-08 16:54:21 +08:00

1 2 3 4 5 ...

1256 Commits