github/ROCm - ROCm - AtHeartEngineering

mirror of https://github.com/ROCm/ROCm.git synced 2026-04-05 03:01:17 -04:00

Author	SHA1	Message	Date
Shucai Xiao	f7cf2c032b	Changes of the tutorial matmul scripts to get good performance (#297 ) * simple changes of the matmul scripts to get good performance. Specification reason for the performance boost needs futher investigation and are tracked * fix review comments * change the num_warps in the autotuning config for hip to workaround an error and change the rtol so correctness check passed	2023-08-17 13:24:49 -05:00
Keren Zhou	2d513dbf50	[FRONTEND] Fix addptr code generation (#2122 ) `offset + ptr` and `ptr + offset` both work now	2023-08-17 04:22:08 +00:00
Zahi Moudallal	557b2d4b34	[CI] upload only test/unit/operators cache to artifacts and rely on kernel names in cache to compare artifacts (#2111 )	2023-08-16 20:34:40 -07:00
Lixun Zhang	87e45cb011	Set vecSize and maxPhase more generically	2023-08-16 08:30:32 -05:00
Lixun Zhang	7156fcb0ef	Set `vecSize` = 4 and `maxPhase` = BLOCK_K/4	2023-08-16 08:30:32 -05:00
darkbuck	a3df6068b4	[BACKEND] Minor fixes found when building triton with LLVM 17/main branches (#2089 ) - These minor fixes are not specific to interface changes from LLVM main or official llvm-17 branch and can be applied on triton main branch. - https://github.com/darkbuck/triton/tree/darkbuck/main/llvm-main-branch has extra changes to build again LLVM main branch build to enable me to work on other backends on the main branch only. That's the hobby effort and just FYR.	2023-08-16 01:18:06 +00:00
Philippe Tillet	4215086931	[BACKEND] no longer uses shared mem or barriers for single-warp reductions (#1915 ) 0-bytes shared mem buffers don't materialize empty allocation buffers; this could lead to unnecessary barriers. note: reduceop code has become quite messy and will require some cleanup	2023-08-15 11:51:20 +00:00
Zahi Moudallal	0312ed3473	[CI] Update kernels names (#2093 ) Co-authored-by: Philippe Tillet <phil@openai.com>	2023-08-14 19:41:41 -07:00
jsh-20	9055af1a5d	Update test_user_defined_persistent_warp_specialized_gemm for num-CTA > 1 (#2101 ) - remove auto-tune for test_user_defined_persistent_warp_specialized_gemm. - remove unnecessary perf evaluation parts. - add test cases of num-CTA > 1 for test_user_defined_persistent_warp_specialized_gemm.	2023-08-14 08:51:35 +00:00
Philippe Tillet	facc1dcbac	[TESTS] better matmul unit testing (#2098 )	2023-08-13 17:54:32 -07:00
Izzy Putterman	fc667d1f8f	[FRONTEND] fix new absolute imports (#2072 ) Co-authored-by: Philippe Tillet <phil@openai.com>	2023-08-13 14:23:36 +00:00
danny.jang	b878c8826f	[DOCS] fix generating document failures (#2096 )	2023-08-13 06:53:34 -07:00
danny.jang	cd114c0bd5	[DOCS] Fix typos (#2097 )	2023-08-13 06:52:43 -07:00
Thomas	98372f46d3	[FRONTEND] Remove extra calls to _get_config causing runtime overhead (#2094 )	2023-08-13 06:51:26 -07:00
Zahi Moudallal	a01c116f76	[FRONTEND/BACKEND] Revived Float8E4B15x4 (#2090 )	2023-08-11 17:49:52 -07:00
Keren Zhou	382e8fb1fa	[RUNTIME] Make apis compatible with cuda 11 drivers (#2081 ) https://github.com/openai/triton/issues/2042	2023-08-11 17:46:56 -07:00
Thomas	421ce18988	[FRONTEND] Refactor min/max to unify tl.maximum and tl.math.max (#2091 ) maximum used to generate a cmp/sel even for floating point types. Always using max op allows better code quality and avoids having different behavior than tl.math.max	2023-08-11 17:46:20 -07:00
Keren Zhou	5162871c6c	[TUTORIAL] flash attention d128 improvement (#2074 ) `ptxas` is able to automatically generate a call instruction to "call" the loop body so that instructions are better scheduled.	2023-08-12 00:31:48 +00:00
Zahi Moudallal	b62b6d6a71	[FRONTEND] Remove cache key from metadata (#2082 )	2023-08-11 08:58:00 -07:00
Zahi Moudallal	4d373aa103	[BACKEND] Remove HopperHelpers.c and replace with inline ptx and LLVM codegen (#2047 )	2023-08-10 15:52:37 -07:00
Beal Wang	d1ce4c4950	[TESTS] refactor test-persistent-warp-specialized-gemm UTs (#2075 ) remove unnecessary skips. decompose UTs in persistent-warp-specialized-gemm into vintage and stylish	2023-08-10 06:57:04 +00:00
Shantanu	776b3784c2	[FRONTEND] further improve version_key speed (#2073 ) Realised I could do this right after my first PR got merged. This saves another 100ms	2023-08-09 22:29:36 +00:00
Shantanu	0e11257b8d	[FRONTEND] improve speed of computing version_key (#2071 ) libtriton.so is pretty large these days and hashing it is slow. Switching the hash from md5 to sha1 shaves close to 300ms off the time for me (as well as being a better hash, for whatever that's worth). As far as I could tell, sha1 is the fastest stable hash in the Python standard library, including things like zlib.crc32	2023-08-09 21:44:10 +00:00
allatit23	8a610f7cf7	[HOPPER][WS] remove numCTAs = 1 check in guard pass (#2066 )	2023-08-09 09:07:56 +00:00
Beal Wang	de47bba07d	[OPTIMIZER] Fix the load and store fallback issue of test_persisten… (#2057 ) Co-authored-by: Biao Wang <biaow@nvidia.com>	2023-08-09 16:42:01 +08:00
allatit23	6d98a0899f	[HOPPER][WS] fix missing WS attrs when lowering to llvm (#2063 )	2023-08-09 15:45:44 +08:00
Alexander Efimov	1c45836d5d	[ROCM] fix device_type name (#2061 ) Rename "rocm" -> "hip", to comply with other uses in compiler.py.	2023-08-09 01:11:09 +00:00
Alexander Efimov	af05f01218	[Tests] Fix some tests in test_core_amd.py (#288 ) This PR: - enables test_dot_mfma_vector_load for fast path in mfma dot op pipeline - fixes kernel execution for mfma enabled GPUS - disables mfma layout conversion tests on architectures which can not run these tests	2023-08-08 20:12:32 +02:00
Alexander Efimov	ff95dddf18	[tutorial] Amd specific tuning configs (#287 ) This pr adds amd specific tuning configs for matmul tutorial with num_stages == 1.	2023-08-08 20:11:30 +02:00
allatit23	6dee55c912	[HOPPER][WS] fix TMA store hang in ws mode (#2056 )	2023-08-08 19:53:52 +08:00
ben-zhang-609	2a95d9bf0d	[Clean]: remove skip for num_ctas > 1 and num_warps == 8 (#2050 ) Co-authored-by: Philippe Tillet <phil@openai.com>	2023-08-08 16:54:21 +08:00
danny.jang	bb47f894f7	[FRONTEND] improve error message for shape mismatch (#2031 ) Improve error messaging for block shape and value shape mismatch.	2023-08-08 01:13:16 -07:00
Philippe Tillet	658747feff	[FRONTEND] remove ptxas from git (#2055 )	2023-08-08 01:11:32 -07:00
allatit23	11cf334730	[hopper][ws] use per-agent thread idx by default (#2054 ) Co-authored-by: Allen Zhao <allzhao@nvidia.com>	2023-08-08 15:28:10 +08:00
goostavz	b525880d8b	[Backend] Fix CTA->warp ordering for MMAv3 and fix dot-chain scripts in hopper tests (#2041 ) Co-authored-by: goostavz <gzhu@nvidia.com> Co-authored-by: Philippe Tillet <phil@openai.com> Co-authored-by: ben-zhang-609 <110140741+ben-zhang-609@users.noreply.github.com>	2023-08-08 06:30:04 +00:00
Bin Fan	a76ecd74e7	add num_stages parameter to aot compile.py (#2000 ) This allows the AOT client to tune the number of stages for the generated kernel. set the default number to 3 to match the triton compiler.	2023-08-08 06:04:57 +00:00
BoxiangW	f21a053ee6	[TUTORIALS] support flash attention 2 with KV's sequence length longer than Q's (#2033 ) Implemented this situation with and without causal mask. My implementation with causal mask looks like: 111000 111100 111110 Where only the right upper triangle part will be masked. I added `P_SEQ` for the notation of extra sequence length for KV. Co-authored-by: Philippe Tillet <phil@openai.com>	2023-08-07 22:57:44 -07:00
ben-zhang-609	31e79aa384	[TESTS] remove get_proper_err, get_variant_golden (#2039 ) Co-authored-by: Philippe Tillet <phil@openai.com>	2023-08-07 22:52:55 -07:00
Alex Collins	4ed8381fdb	Linux arm64 support (#2003 ) We are interested in having python wheels for triton built for Linux arm64 platforms, such as NVIDIA's Grace CPU. This change is fairly simple, however: - It requires a linux arm64 build of LLVM to be available (see MR here: https://github.com/ptillet/triton-llvm-releases/pull/15) - For now my changes use the LLVM build hosted here: https://github.com/acollins3/triton-llvm-releases/releases/tag/llvm-17.0.0-c5dede880d17 - The Triton release process will need to be updated to include arm64 wheels. Is this something you have time to work on @ptillet? It would be difficult for me to update this part without more access permissions. With these changes, I managed to build a set of python wheels and have hosted them here for us to use in the meantime: https://github.com/acollins3/triton/releases/tag/triton-2.1.0-arm64	2023-08-08 12:39:41 +08:00
Qingyi Liu	341f5b61be	[BACKEND] Add BarrierOp after AllocMBarrierOp when numCTAs == 1 (#2040 ) Make sure that other threads within CTA do not operate on mbarrier until it is initialized by thread 0. Co-authored-by: Philippe Tillet <phil@openai.com>	2023-08-07 20:11:00 -07:00
danny.jang	6a1ac65043	[FRONTEND] improve error message for type mismatch (#2038 )	2023-08-07 19:34:09 -07:00
Keren Zhou	30a331e628	[FRONTEND] Support jit functions without arguments (#2043 ) Issue https://github.com/openai/triton/issues/1973 Co-authored-by: Philippe Tillet <phil@openai.com>	2023-08-07 19:05:56 -07:00
Thomas	98523bcc48	[BACKEND] Support MMA V3 with float16 accumulator (#2049 ) Also fixes a bug exposed in convertLayout lowering for float16. We shouldn't be using cvt.pack.sat.u16.s32 to pack 16bits values as this needs to take a 32bits register. Also this prevented optimization at llvm ir level.	2023-08-07 15:55:44 -07:00
Phil Tillet	521cfae44d	[CI] disabled float32 perf regression tests	2023-08-07 12:43:16 -07:00
jayfurmanek	32d7c6d646	Fix runtime/test_subproc.py for hip devices (#284 ) * Fix runtime/test_subproc.py for hip devices * address review comments	2023-08-07 10:30:36 -05:00
goostavz	f1512bded1	Initial code merge of Hopper support (#2036 ) The initial code merge of Nvidia Hopper features support. Please be aware that the code merge is not finished yet and the trouble-shooting is still ongoing. The new hardware features (GMMA, TMA, STMATRIX etc.) and automatic warp-specialization are experimental for now and turned off by default. It is recommended for a trial when version 3.0 is released. The work is contributed by: ben-zhang-609, bealwang, donproc, qliu93, jsh20, allatit23, LyricZhao, ivanyinwz, goostavz & yangjunpro from Nvidia, in cooperation with: ptillet, Jokeren, ThomasRaoux & zahimoud from OpenAI. Co-authored-by: Goostav Zhu <gzhu@nvidia.com>	2023-08-07 09:53:04 +08:00
Alexander Efimov	7158ec286a	[MFMA] [Dot] Support vector loads in normal path (#275 ) * [MFMA] [Dot] Support vector loads in normal path This PR adds generation of vector loads in normal path of MFMA dot operand loading. This requires shared layout to have contiguous elements which should be loaded by one lane. * remove redundant refactoring * fix tests * extend test with transposed A/B tensors	2023-08-03 14:57:39 -05:00
Alexander Efimov	86f8b64ae0	[Dot] [MFMA] [FMA] Update Dot implementation to support upstream tests (#260 ) * [Dot] [MFMA] Support FP16 output of MFMA dot This PR adds cast of output tensor to requested data type. * add tests * fix test for FMA implementation * loose fp16xfp16->fp16 tolerance * enable FMA fallback for unsupported sizes of dot operation * rework granularity check * add constant modifier to granularity	2023-08-03 13:47:18 -05:00
Vinayak Gokhale	f1063bb33c	Enable backward pass in FA tutorial test (#282 ) Enabled the backward pass in the fused attention tutorial. The tolerance when comparing to the naive implementation had to be changed. The block size is forced to be 64x64 due to the 64 KiB LDS. Default is block 128 for A100's larger SMEM. This creates differences in order of computation and reuslts in a larger gap between the naive and FA implementations.	2023-08-03 10:12:46 -05:00
Shucai Xiao	31cfda8f0e	enable more gemm tests corresponding to PR#273 (#279 )	2023-08-02 16:45:31 -05:00

... 2 3 4 5 6 ...

1337 Commits