github/ROCm - ROCm - AtHeartEngineering

mirror of https://github.com/ROCm/ROCm.git synced 2026-04-05 03:01:17 -04:00

Author	SHA1	Message	Date
Ethan Pronovost	56fee37a0d	[FRONTEND] Fix benchmark plotting (#2177 )	2023-08-24 20:34:04 -07:00
Keren Zhou	f6cdcf1d77	[BACKEND] Fix BF16 dot operand type mismatch (#2162 ) https://github.com/openai/triton/issues/2156	2023-08-24 20:32:33 -07:00
Greg Brockman	64d8df4c69	[FRONTEND] handle errors from launch_enter_hook (#2178 )	2023-08-24 20:32:01 -07:00
Shantanu	7083dae4f2	[FRONTEND] drop the GIL around more CUDA ops (#2173 )	2023-08-24 20:31:38 -07:00
ben-zhang-609	22a2fe3e55	[Optimizer][Hopper] change mmaV3InstrN for flash attention (#2169 ) Revise the logics to assign MMA layout for serialized dots. This is a hueristic rule for FlashAttention.	2023-08-25 01:54:38 +00:00
Zahi Moudallal	120ce0a5bf	[DOCS] Fixing docs (#2175 )	2023-08-24 15:58:59 -07:00
kshama-msft	387c8d94e1	[DOCS] update README.md (#2174 ) Added speaker details	2023-08-24 09:55:13 -07:00
chengjunlu	6cb67185f8	[FRONTEND]To use proper default num_warps and num_stages based on the device backend in JITFucntion (#2130 ) The default values used by JITFunction for num_warps and num_stages are coupled with Nvidia GPU architecture. We should use the proper default values based on the device backend for the kernel to be compiled to. 1. Add two functions to return the default num_warps and num_stages for the specific device backend. 2. JITFunction uses the proper default num_warps and num_stages based on the specific device backend. Co-authored-by: Wang Weihan <eikan.wang@intel.com>	2023-08-24 21:58:18 +08:00
Phil Tillet	b43c28fdd7	[DOCS] fix typo in readme	2023-08-23 21:42:40 -07:00
kshama-msft	a1c5ef3387	[DOCS] update 08-22-2023.md (#2164 ) Added recording link and minutes for meeting.	2023-08-23 15:39:11 -07:00
kshama-msft	2231403668	[DOCS] Update README.md (#2168 ) Added Triton Conference Registration details.	2023-08-23 15:38:46 -07:00
Thomas	3116933ccd	[BACKEND] Don't do dead code elimination on volatile load (#2165 )	2023-08-23 14:59:18 -07:00
jayfurmanek	9ad6327fde	[DOCS] Add AMD GPU 08/23 status update doc to /docs/meetups (#2167 ) Chars shown at the 08/23 community meetup	2023-08-23 14:59:03 -07:00
Wang Weihan	5d47054a05	[DOCS] Add Intel XPU status update doc to Meetup Doc (#2160 )	2023-08-23 13:44:19 -07:00
Bin Fan	dad83f9dcb	[TOOLS] Add support for autotuning AOT kernel (#2123 ) This PR makes the following change to AOT kernel - Allow the client to generate AOT kernels with different sets of constexprs and meta-parameters. Each combination of constexpr set and meta-parameters is referred to an "algo". Within an algo client can still give different hints about integer arguments. - Add a API int ${kernle_name}_get_num_algos() that returns the total number of algos. - Add a algo_id to allow client to the generated kernel to select the algo - Remove gX, gY and gZ from the kernel parameter list. This is because the launch grid is usually different with different algos, and the client should not need to care about how to compute the launch grid for each algo. Instead, we ask the client to pass the expression of computing gX, gY and gZ for compile.py (when AOT kernels are generated). The expression can only use kernel parameter or const values. - We also change the testing flow. Now we first build the kernels into a shared library libkernel.so, then the client test.c code is built and link with libkernel.so. This is closer to a typical AOT kernel usage flow.	2023-08-23 09:38:29 -07:00
Zahi Moudallal	5282ed890d	[CI] Add back pre-commit to nvidia CI job (#2159 )	2023-08-23 01:11:03 +00:00
Keren Zhou	6a65c894fe	[TUTORIALS] Skip running TMA tutorials on non-hopper architectures (#2153 )	2023-08-22 18:02:26 -07:00
Keren Zhou	5fa1fa1b27	[FRONTEND] Emit warning if the result of `tl.advance` is unused (#2155 ) https://github.com/openai/triton/issues/2138	2023-08-22 18:02:00 -07:00
danny.jang	c4a9006340	[FRONTEND] fix a typo (#2152 )	2023-08-22 13:02:31 -04:00
kshama-msft	0410652666	[DOCS] update meetup/08-22-2023.md (#2149 )	2023-08-21 17:09:28 -07:00
ivanyinwz	ec801ce18e	[BACKEND] Optimize performance for f16 epilogue with TMA store (#2135 ) 1. Optimize the conversion and packing for 2xf32 -> 2xf16. 2. Split TMA store block into multiple slices of size 64x64. 3. Distribute the TMA store to all the warps. 4. Fix some naming issue.	2023-08-21 12:44:11 -07:00
Philippe Tillet	ea8416164f	[FRONTEND] name mangling fixup (#2148 )	2023-08-21 12:11:52 -07:00
Beal Wang	7e5cd95bf2	[OPTIMIZER] Fix Warp Specialized kernel launch failure (#2146 ) For warp specialized persistent kernel, the instruction sequence for Warp Groups are ``` // warp group 0 for wave in 0..num_waves: idx = wave * num_inner_loop_steps; for k_tile_idx in 0..num_k_tiles: mbarrier.wait EB[idx]; W0; mbarrier.arrive FB[idx]; idx++; ``` ``` // warp group 1 for wave in 0..num_waves: idx = wave * num_inner_loop_steps; for k_tile_idx in 0..num_k_tiles: mbarrier.wait FB[idx]; R0; mbarrier.arrive EB[idx]; idx++; ``` then this would form a sequence of morally-strong relations W0 -> R0 -> W1 -> R1 in causality order. But if GEMM K is small than K-TileShape, then the num_inner_loop_steps of persistent kernel is 0. The buffer id and mbarrier id will always be 0 in this case. And it may form W0 -> W1 -> R0 -> R1 order, which is contradicts with the atomicity -- "If a read R precedes an overlapping write W in causality order, then R cannot read from W."	2023-08-21 14:46:57 +08:00
Thomas	54ca7fcb35	[FRONTEND] Use inline asm for global timer and smid functions (#2143 ) Simplify the code by using inline asm to implement globaltimer and smid instead of relying on bc file.	2023-08-20 22:56:37 -07:00
Thomas	ad3e363a44	[BACKEND] Remove dead code related to old libhopper_helpers.bc (#2145 )	2023-08-20 22:56:20 -07:00
Keren Zhou	584d5c263f	[FRONTEND] Disable IfExp on dynamic conditions (#2100 ) `if _unwrap_if_constexpr(cond)` then enters `node.body` is wrong when cond is a tensor since we cannot statically evaluate a dynamic tensor's value. The right way to solve the problem is probably: 1. visit the ast of IfExp (do not build IRs) 2. get the type of the last statement 3. initialize the return value and assign it to livein 4. call visit_If	2023-08-20 12:58:10 -07:00
Alexander Zinoviev	a7b40a10f9	[TESTS] Fix tl.dot test on sm75 (#2140 ) Disable tf32 if run on sm75 and below Fix the pattern match to compare the generated ptx against if run on sm75	2023-08-19 22:21:18 -07:00
Alexander Zinoviev	e072da5b57	[OPTIMIZER] Remove dead code (#2141 ) A little code cleanup	2023-08-19 15:22:53 -07:00
Zahi Moudallal	3c8f959f91	[CI] Adding workflow_run (#2120 )	2023-08-18 23:58:41 -07:00
Alexander Zinoviev	d5188fa230	[BACKEND] enable transpose for float16 on sm75 (#2139 ) Replace the Turing version for the dot operation from following Volta version to following Ampere version. Update code generator to produce two m16.n8.k8 MMAs for Turing instead of one m16.n8.k16 MMA we have for Ampere.	2023-08-18 22:20:17 -07:00
Zahi Moudallal	23dd11d471	[BACKEND] Solidify f8e4m3 (#2105 ) Co-authored-by: Philippe Tillet <phil@openai.com>	2023-08-18 19:12:09 -07:00
Thomas	23ef2615d2	[BACKEND] Merge TT_ElementwisePureExtern and TT_ElementwiseImpureExtern (#2137 ) Use getEffect instead to tell passes whether the op has side effects or not. This doesn't change functionality otherwise.	2023-08-18 20:56:10 +00:00
Thomas	bf351b9ba2	[FRONTENT][BACKEND] Add support for elementwise inline assembly (#2136 ) Add a new operation to be able to implement packed inline assembly for elementwise operations. This way inline assembly can be used to control elementwise operations. It also allows to pack elements to be able to manually vectorize operations.	2023-08-18 12:57:52 -07:00
Thomas	c736ea8492	[BACKEND] Minor clean up and remove loop fixup as it is not needed anymore (#2116 )	2023-08-18 12:12:45 -07:00
Zahi Moudallal	1faf93e6fb	[CI] Fix PR comment (#2131 )	2023-08-18 09:16:18 -07:00
Whitney Tsang	100cabd0e4	[FRONTEND] use enum instead of bool to select target (#2118 ) Before this PR, the determination of `TritonGPUToLLVMIRPass` to generate NVVM-compatible LLVM or ROCDL-compatible LLVM is controlled by a boolean `isROCM`. This method is hard to scale. This PR changes it to use an enum instead, where new target can be added easily when needed. --------- Signed-off-by: Tsang, Whitney <whitney.tsang@intel.com> Co-authored-by: Philippe Tillet <phil@openai.com>	2023-08-17 18:37:09 -07:00
Zahi Moudallal	b33f97a682	[CI] Fix bug in Compare Artifacts workflow (#2128 ) Forgot to remove this line	2023-08-17 18:06:36 -07:00
Zahi Moudallal	6f654cfbbf	[CI] Testing PR comment from another workflow (#2127 )	2023-08-17 17:34:59 -07:00
youkaichao	871ec2ad37	[FRONTEND] add informative erro msg to help find libcuda (#2019 ) When I'm using Kaggle GPU (https://www.kaggle.com/), I find that `ldconfig -p` does not show libcuda.so, but requires `ldconfig` (run with sudo) to refresh the cache to find the libcuda.so. Therefore, I added this informative message to help users find libcuda.so.	2023-08-17 19:32:00 -04:00
Thomas	387fc890a5	[FRONTEND][BACKEND] Add a performance test for reductions (#2125 ) Also stop promoting integer types as it doesn't give better perf this will allow more vectorization oportuinity in the future.	2023-08-17 16:30:33 -07:00
Zahi Moudallal	3fa6d51bc9	[CI] Adding new github workflow for testing (#2121 )	2023-08-17 15:32:38 -07:00
YouJiacheng	0970a297b2	[TUTORIALS] alow BLOCK(bwd) != BLOCK_M(fwd) in flash attention (#2020 )	2023-08-17 18:31:53 -04:00
Keren Zhou	2d513dbf50	[FRONTEND] Fix addptr code generation (#2122 ) `offset + ptr` and `ptr + offset` both work now	2023-08-17 04:22:08 +00:00
Lixun Zhang	eb940823c4	[OPTIMIZER][BACKEND] Rename MMAv2kWidth to kWidth (#2119 )	2023-08-17 04:00:20 +00:00
Zahi Moudallal	557b2d4b34	[CI] upload only test/unit/operators cache to artifacts and rely on kernel names in cache to compare artifacts (#2111 )	2023-08-16 20:34:40 -07:00
apgoucher	49941b9c17	[BACKEND] improve inline PTX for various type conversions (#2117 )	2023-08-16 16:33:58 -07:00
Qingyi Liu	ecc5e8c558	[BACKEND] Fix several shape errors when numCTAs > 1 (#2107 )	2023-08-16 02:14:10 +00:00
darkbuck	a3df6068b4	[BACKEND] Minor fixes found when building triton with LLVM 17/main branches (#2089 ) - These minor fixes are not specific to interface changes from LLVM main or official llvm-17 branch and can be applied on triton main branch. - https://github.com/darkbuck/triton/tree/darkbuck/main/llvm-main-branch has extra changes to build again LLVM main branch build to enable me to work on other backends on the main branch only. That's the hobby effort and just FYR.	2023-08-16 01:18:06 +00:00
Whitney Tsang	129e7dfc6f	[TritonGPUToLLVM] Correct the usage of option passing (#2104 ) For example, when given `--convert-triton-gpu-to-llvm="is-rocm=true"`, `ConvertTritonGPUToLLVMPass` should generate ROCM-compatible LLVM. Before this PR, transformation options passed in command line are not respected.	2023-08-16 00:56:01 +00:00
Qingyi Liu	780266c3a2	[BACKEND] Fix nPerWarp == 8 in MMA16816SmemLoader (#2109 )	2023-08-15 17:32:27 -07:00

1 2 3 4 5 ...

1215 Commits