github/ROCm - ROCm - AtHeartEngineering

mirror of https://github.com/ROCm/ROCm.git synced 2026-02-21 03:00:39 -05:00

Author	SHA1	Message	Date
Justin Lebar	df08301e76	Reformat Python code with yapf. (#2589 ) I've add an option to yapf to do what we want for long lines, see https://github.com/google/yapf/pull/1177. We can now have a real Python formatter, yay! To make this PR, I ran my modified yapf over the repository, then looked over the full diff. Where yapf was mangling the param list of long function decls/calls (mostly kernels), I manually added `#` to put linebreaks where we want. I fixed up other formatting too -- mostly adding or removing a trailing comma from lists. Overall, trailing `#` was sufficient to get formatting similar to our current code. I didn't have to disable yapf anywhere. --------- Co-authored-by: Phil Tillet <phil@openai.com>	2023-11-02 20:44:17 -07:00
Thomas Raoux	ca8f110617	[BACKEND] Pipeliner refactoring (#2565 ) Refactor the pipeliner pass in order to make it more generic. The main change is that the pipeliner is now broken into 2 pieces one calculating a modulo schedule and create async ops based on the IR and an expander that will generate the pipelined IR based on the modulo schedule. The advantage of separating the two pieces is that it will allow us to create different schedule without having to change the expander and it will allow for more complex schedules. For now the schedule generated for matmul case matches rougly the schedule picked by the previous pipeliner in order to avoid changes. This also creates a different sequence of insert/extract slice for the alloc. We should probably change shared alloc to use memory semantic.	2023-11-02 09:56:39 -07:00
Thomas Raoux	90bef57acf	[BACKEND] turn on MMA V3 by default on Hopper (#2414 )	2023-09-28 22:45:28 -07:00
Dongdong Li	e5eda098b3	[TESTS] fix flash attention (#2086 ) Co-authored-by: dongdongl <dongdongl@nvidia.com>	2023-09-20 14:23:46 +08:00
jsh-20	fc5d7e6e7c	[FRONTEND] Improve grid calculation for persistent kernels to hoist pe… (#2283 ) …rf on problems that need few blocks. constrain the number of launched blocks to what it exactely needs for persistent warp specialized kernel. It's useful when problems need very few blocks. e.g. MxNxK=800x800x60000, f16_f16_f32, block size=128x128x64, non-split-k. Experiments show it can achieve ~16% speedup.	2023-09-12 09:14:47 +00:00
jon-chuang	5231d57c71	[TESTS] replace deprecated `torch.testing.assert_allclose` (#2250 ) Prior to this PR, matmul on sm_89 (RTX 4070) (`test/unit/operators/test_matmul.py::test_op`) would result in test failure due to too strict atol/rtol. To avoid having to choose strictness ourselves, and to have better defaults based on dtype, use the non-deprecated torch testing util. See: https://github.com/pytorch/pytorch/issues/61844 Replace: https://github.com/openai/triton/pull/2242	2023-09-11 15:31:17 -04:00
goostavz	1465b573e8	[TESTS][HOPPER] Prune hopper tests to speedup CI (#2193 ) Co-authored-by: Goostav Zhu <gzhu@nvidia.com>	2023-08-27 20:45:23 -07:00
Beal Wang	7e5cd95bf2	[OPTIMIZER] Fix Warp Specialized kernel launch failure (#2146 ) For warp specialized persistent kernel, the instruction sequence for Warp Groups are ``` // warp group 0 for wave in 0..num_waves: idx = wave * num_inner_loop_steps; for k_tile_idx in 0..num_k_tiles: mbarrier.wait EB[idx]; W0; mbarrier.arrive FB[idx]; idx++; ``` ``` // warp group 1 for wave in 0..num_waves: idx = wave * num_inner_loop_steps; for k_tile_idx in 0..num_k_tiles: mbarrier.wait FB[idx]; R0; mbarrier.arrive EB[idx]; idx++; ``` then this would form a sequence of morally-strong relations W0 -> R0 -> W1 -> R1 in causality order. But if GEMM K is small than K-TileShape, then the num_inner_loop_steps of persistent kernel is 0. The buffer id and mbarrier id will always be 0 in this case. And it may form W0 -> W1 -> R0 -> R1 order, which is contradicts with the atomicity -- "If a read R precedes an overlapping write W in causality order, then R cannot read from W."	2023-08-21 14:46:57 +08:00
jsh-20	9055af1a5d	Update test_user_defined_persistent_warp_specialized_gemm for num-CTA > 1 (#2101 ) - remove auto-tune for test_user_defined_persistent_warp_specialized_gemm. - remove unnecessary perf evaluation parts. - add test cases of num-CTA > 1 for test_user_defined_persistent_warp_specialized_gemm.	2023-08-14 08:51:35 +00:00
Beal Wang	d1ce4c4950	[TESTS] refactor test-persistent-warp-specialized-gemm UTs (#2075 ) remove unnecessary skips. decompose UTs in persistent-warp-specialized-gemm into vintage and stylish	2023-08-10 06:57:04 +00:00
allatit23	8a610f7cf7	[HOPPER][WS] remove numCTAs = 1 check in guard pass (#2066 )	2023-08-09 09:07:56 +00:00
Beal Wang	de47bba07d	[OPTIMIZER] Fix the load and store fallback issue of test_persisten… (#2057 ) Co-authored-by: Biao Wang <biaow@nvidia.com>	2023-08-09 16:42:01 +08:00
allatit23	6d98a0899f	[HOPPER][WS] fix missing WS attrs when lowering to llvm (#2063 )	2023-08-09 15:45:44 +08:00
allatit23	6dee55c912	[HOPPER][WS] fix TMA store hang in ws mode (#2056 )	2023-08-08 19:53:52 +08:00
ben-zhang-609	2a95d9bf0d	[Clean]: remove skip for num_ctas > 1 and num_warps == 8 (#2050 ) Co-authored-by: Philippe Tillet <phil@openai.com>	2023-08-08 16:54:21 +08:00
allatit23	11cf334730	[hopper][ws] use per-agent thread idx by default (#2054 ) Co-authored-by: Allen Zhao <allzhao@nvidia.com>	2023-08-08 15:28:10 +08:00
goostavz	b525880d8b	[Backend] Fix CTA->warp ordering for MMAv3 and fix dot-chain scripts in hopper tests (#2041 ) Co-authored-by: goostavz <gzhu@nvidia.com> Co-authored-by: Philippe Tillet <phil@openai.com> Co-authored-by: ben-zhang-609 <110140741+ben-zhang-609@users.noreply.github.com>	2023-08-08 06:30:04 +00:00
ben-zhang-609	31e79aa384	[TESTS] remove get_proper_err, get_variant_golden (#2039 ) Co-authored-by: Philippe Tillet <phil@openai.com>	2023-08-07 22:52:55 -07:00
Qingyi Liu	341f5b61be	[BACKEND] Add BarrierOp after AllocMBarrierOp when numCTAs == 1 (#2040 ) Make sure that other threads within CTA do not operate on mbarrier until it is initialized by thread 0. Co-authored-by: Philippe Tillet <phil@openai.com>	2023-08-07 20:11:00 -07:00
goostavz	f1512bded1	Initial code merge of Hopper support (#2036 ) The initial code merge of Nvidia Hopper features support. Please be aware that the code merge is not finished yet and the trouble-shooting is still ongoing. The new hardware features (GMMA, TMA, STMATRIX etc.) and automatic warp-specialization are experimental for now and turned off by default. It is recommended for a trial when version 3.0 is released. The work is contributed by: ben-zhang-609, bealwang, donproc, qliu93, jsh20, allatit23, LyricZhao, ivanyinwz, goostavz & yangjunpro from Nvidia, in cooperation with: ptillet, Jokeren, ThomasRaoux & zahimoud from OpenAI. Co-authored-by: Goostav Zhu <gzhu@nvidia.com>	2023-08-07 09:53:04 +08:00

20 Commits