github/ROCm - ROCm - AtHeartEngineering

mirror of https://github.com/ROCm/ROCm.git synced 2026-04-05 03:01:17 -04:00

Author	SHA1	Message	Date
Philippe Tillet	bf4f9375a7	[FRONTEND] allow mixed precision FP8 matmul on pre-H100 hardware (#2281 )	2023-09-11 20:54:29 -07:00
Shintaro Iwasaki	8da27c1c95	[Build] Fix very minor compilation problems (#2277 ) This PR fixes a few very minor compilation issues found in internal deployment at Meta. It looks like nit-picking, but it'd be really appreciated if it could be addressed in OSS Triton (to reduce differences from OSS), and we believe these changes are not bad in general. Neither performance nor functionality is affected by this PR. 1. Type cast in `python/triton/runtime/backends/cuda.c`. Implicit `void ` -> `cuuint{32,64}_t ` cast is not allowed by many compilers (with certain flags). It'd be nice to add an explicit cast (like `backends/hip.c`). 2. Inconsistent include path specification in `lib/Conversion/TritonGPUToLLVM/DotOpToLLVM/WGMMA.cpp`. Unlike other `DotOpToLLVM/*.cpp`, include paths used in `WGMMA.cpp` are not relative. This is problematic in some compilation settings since a compiler somehow needs to find headers in a parent directory. It'd be great to use a relative path, like other source files in Triton. cc: @yuguo68	2023-09-11 19:28:31 -07:00
Thomas Raoux	a9db6b94b9	Remove wrong dependency between TritonGPU and NVGPU dialect (#2276 )	2023-09-11 16:30:13 -07:00
danny.jang	ec4a968d44	[TESTS] Enhance benchmark flexibility (#2239 ) User can pass custom arguments to benchmarks. For example, user can pass `dtype` which will be used to create tensors in a benchmark. Co-authored-by: Keren Zhou <kerenzhou@openai.com>	2023-09-11 15:31:30 -04:00
jon-chuang	5231d57c71	[TESTS] replace deprecated `torch.testing.assert_allclose` (#2250 ) Prior to this PR, matmul on sm_89 (RTX 4070) (`test/unit/operators/test_matmul.py::test_op`) would result in test failure due to too strict atol/rtol. To avoid having to choose strictness ourselves, and to have better defaults based on dtype, use the non-deprecated torch testing util. See: https://github.com/pytorch/pytorch/issues/61844 Replace: https://github.com/openai/triton/pull/2242	2023-09-11 15:31:17 -04:00
Lixun Zhang	28d4c3bdb4	[BACKEND] Make sure `getAxisBlockStride` does not return 0 (#2273 ) This can happen when the CTA shape is larger than the tensor shape along the non-axis dim during scanOp lowering.	2023-09-11 11:02:56 -07:00
Alexander Efimov	a06072f8ff	Fix dangling gpu_has_mfma use (#325 ) * Fix dangling gpu_has_mfma use This PR replaces gpu_has_mfma use with gpu_matrix_core_version * add basic test	2023-09-11 12:31:48 -05:00
Alexander Efimov	6691de65db	[MFMA] Support BFloat16 on MI100 (#295 ) * [MFMA] Support BFloat16 on MI100 This PR makes use of mfma_f32_32x32x4bf16 instruction, available on MI100. * fix tests, fix mfma encoding comment, fix switch between mfma versions. * replace kDim from mfma layout with kWidth from dotOp layout * rebase fix * fix mfma to dot op shortcut for bfloat16 * fix review comments	2023-09-08 15:08:34 -05:00
Keren Zhou	10f59d8ce0	[RUNTIME] Get the correct end idx for regular arguments of GPU kernels (#2262 ) Previously, if there were any specializations of "1" or "constexpr" mixed with unspecialized arguments in arbitrary order, we might have encountered errors due to passing incorrect arguments. This was because the length of the signature did not indicate the maximum index of regular arguments. https://github.com/openai/triton/issues/2229 @shunting314 @amjames More specifically for cases like: ``` kernel( b: tl.tensor, a: tl.constexpr, c: tl.int = 1, d, e: tl.constexpr, ... ) ```	2023-09-07 23:31:07 -07:00
SJW	491eb9ddfe	[MLIR] Added tritongpu-stream-pipeline pass (#305 ) * [MLIR] Added tritongpu-stream-pipeline pass - Prologue: Hoist the pipelinable load operations and shared memory store for the ramp up stage - Pipelined Loop: Assemble the loop body minus last iteration - Prefetch next tile from global into regs (while computing from previous) - Non-load loop body - Store next tile into shared mem - Epilogue: Peeled non-load loop body for last iteration * * updated comment	2023-09-07 15:24:59 -05:00
jayfurmanek	83a0958566	Merge pull request #322 from ROCmSoftwarePlatform/f8_and_bf16_conversions Enable fp8 conversions and fix bf16 conversions	2023-09-07 14:38:16 -05:00
Izzy Putterman	7d01c1852a	Revert unintentional change (#2257 ) This change seems to have been unintentionally reverted in the hopper PR: `38d767ea93` Adding it back.	2023-09-07 10:48:12 -07:00
Shucai Xiao	fb3f2d6feb	refine gemm tuning scripts (#309 ) * refine the gemm tuning scripts to reduce tuning space and better perf numbers * added code to support tuning in full tuning space * add a function to get best tuning config * refine the matmul tutorial example to print out best tuning config for each input * added even_k to gemm kernel heuristic for better performance * address review comments	2023-09-07 08:09:11 -05:00
Wen Chen	ffc230ebfe	[ROCM] Fixed implementation of fp32 to bf16 conversion on ROCm.	2023-09-06 18:10:54 -05:00
Wen Chen	2d3e38e182	[ROCM] Added ROCm support for int8 to bfloat16 conversion.	2023-09-06 18:10:54 -05:00
Wen Chen	59a40d3f72	[ROCM] Added ROCm support for the conversions of following data types: [float8e4m3, float8e4m3b15, float8e5m2] <-> [float16, bfloat16]	2023-09-06 18:10:54 -05:00
Zahi Moudallal	f21b36c8c5	[CLEANUP] Delete binaries that went in by mistake (#2256 )	2023-09-06 20:42:42 +00:00
jon-chuang	36859aebff	[DOCS] Add MLIR Autogenerated Docs to Sphinx Docs (#2234 ) Partially fixes: https://github.com/openai/triton/issues/2226 Here are some example renderings: ![Screenshot from 2023-09-04 18-39-20](https://github.com/openai/triton/assets/9093549/e9c4af04-aeae-4021-a8db-6a4a82b59ae7) ![Screenshot from 2023-09-04 18-39-30](https://github.com/openai/triton/assets/9093549/410391b8-e07e-4bed-909c-8ce5484072d1) ![Screenshot from 2023-09-04 18-39-41](https://github.com/openai/triton/assets/9093549/f1eaef95-66c1-4506-a153-c6069e2b5072)	2023-09-06 08:17:12 +00:00
Wang Weihan	e721911705	[FRONTEND] clean build directly when executing python setup.py clean (#2238 ) Current setup.py could not clean the build directly because the default build directly has been changed in `CMakeBuild`. This PR is to clean build directly in this regard.	2023-09-04 21:31:38 -07:00
jon-chuang	99f8f912aa	[OPS] Remove unnecessary perf bug workaround (#2240 ) This bug previously existed and I verified it in previously nightly release of triton (20230714). However, according to new benchmarks, this bug no longer exists on Triton main. See: https://github.com/google/jax/pull/17328#issuecomment-1705010065	2023-09-04 21:30:54 -07:00
Keren Zhou	9e9fbe01f0	[FRONTEND] Fix specialization on triton integer types (#2236 ) https://github.com/openai/triton/issues/2231	2023-09-03 23:57:08 -07:00
Shantanu	a4df60e20a	[FRONTEND] Fix GIL handling in error conditions (#2225 ) The use of the opaque GIL state APIs should mean that the PyErr_SetString is now safe, regardless of whether the caller has the GIL or not.	2023-09-01 13:30:42 -07:00
Ethan Pronovost	1367f3a6d2	[FRONTEND/OPS] wap `stride_vn` and `stride_vk` in flash attention (#2208 ) I'm not sure if this was a typo or if I'm missing something. To me code like ``` (offs_n[:, None] * stride_vk + offs_k[None, :] * stride_vn) ``` seems off. In case this is a typo I made this PR to correct it. This PR should have no functional changes. If this is not a typo would you mind explaining the reasoning behind these variable names?	2023-08-31 23:19:40 -07:00
Jason Furmanek	320b1029da	Temporarily disable F8 tests on ROCm	2023-09-01 04:02:14 +00:00
Jason Furmanek	df5c263a19	Fix merge conflicts	2023-09-01 04:01:32 +00:00
Jason Furmanek	3eaeb89d18	Merge commit '5df904233c11a65bd131ead7268f84cca7804275' into ifu230810-2 Conflicts: include/triton/Dialect/Triton/Transforms/Passes.h include/triton/Dialect/TritonGPU/IR/Dialect.h include/triton/Dialect/TritonGPU/IR/TritonGPUAttrDefs.td lib/Analysis/Allocation.cpp lib/Analysis/Utility.cpp lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp lib/Conversion/TritonGPUToLLVM/ReduceOpToLLVM.cpp lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVM.cpp lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVMPass.cpp lib/Dialect/Triton/Transforms/RewriteTensorPointer.cpp lib/Dialect/TritonGPU/Transforms/RemoveLayoutConversions.cpp lib/Dialect/TritonGPU/Transforms/ReorderInstructions.cpp lib/Target/LLVMIR/LLVMIRTranslation.cpp python/src/triton.cc python/triton/compiler/compiler.py python/triton/ops/flash_attention.py python/triton/runtime/autotuner.py python/triton/runtime/jit.py python/triton/tools/aot.py python/tutorials/06-fused-attention.py test/Conversion/tritongpu_to_llvm.mlir test/Target/tritongpu_to_llvmir.mlir test/Target/tritongpu_to_llvmir_noinline.mlir	2023-09-01 03:25:33 +00:00
Michael Melesse	c6d33dcebf	[ROCM] Core Functionality for AMD (#1983 ) * this pr adds a third party backend for triton that works on AMD * this expose a lot of the work that has been done in our [fork](https://github.com/ROCmSoftwarePlatform/triton) * most unit tests on `test_core.py` pass * it skips some unit tests for various reasons * we plan to follow up with more prs improving Functionality and Performance in the future --------- Co-authored-by: Philippe Tillet <phil@openai.com>	2023-08-31 14:02:00 -07:00
Philippe Tillet	ec51552fff	[BACKEND] Lift restriction for float8e4b15 to only support row-col layout (#2212 )	2023-08-30 14:06:31 -07:00
jon-chuang	9af76e7d5a	[RUNTIME] Fix cache dir (#2196 ) --------- Co-authored-by: Keren Zhou <kerenzhou@openai.com>	2023-08-29 21:07:16 -04:00
Vinayak Gokhale	9cdf3a58c3	Enable split kernel in bwd pass (#303 ) * Add fwd and bwd v2 Changes are largely from upstream. * Split bwd kernel in dq and dk+dv Only adds the split kernels. They are not enabled yet. * Pull scalar multiplies out of the loop * Enable split kernel for bwd pass * Put back P_SEQ=128 in fwd test Not used for bwd test * Address review comments * Address comments Conditionally set causal/ splitkernel to False for bwd. * Add block pointer semantics to bwd pass This significantly increases perf for bwd, similar to fwd.	2023-08-29 13:51:29 -05:00
Lixun Zhang	b834f42ae4	[autotuner] Add an option to print best_config for each key	2023-08-28 14:45:54 -05:00
goostavz	1465b573e8	[TESTS][HOPPER] Prune hopper tests to speedup CI (#2193 ) Co-authored-by: Goostav Zhu <gzhu@nvidia.com>	2023-08-27 20:45:23 -07:00
Philippe Tillet	5f448b2f08	[FRONTEND] remove dead libhopper_helpers.bc file (#2190 )	2023-08-26 12:17:17 -07:00
Greg Brockman	a9b8c8c37d	[FRONTEND] drop GIL for launch, and set value=false upon pointer error (#2185 )	2023-08-26 17:07:57 +00:00
Keren Zhou	6e4932cda8	[BACKEND] Fix fma mixed-precision (#2184 ) and expose the allow_tf32 argument to the matmul op @shunting314	2023-08-26 09:49:58 -07:00
Greg Brockman	ab3e8b0dad	[FRONTEND] fix handling of do_not_specialize with interior constantexprs (#2188 )	2023-08-26 09:19:34 -07:00
Mohammed Anany	ebfe0ffb29	[FRONTEND] fix for undefined dtypes in jit during loading defaults (#2114 ) Co-authored-by: Keren Zhou <kerenzhou@openai.com>	2023-08-25 10:28:23 -07:00
Ethan Pronovost	56fee37a0d	[FRONTEND] Fix benchmark plotting (#2177 )	2023-08-24 20:34:04 -07:00
Greg Brockman	64d8df4c69	[FRONTEND] handle errors from launch_enter_hook (#2178 )	2023-08-24 20:32:01 -07:00
Shantanu	7083dae4f2	[FRONTEND] drop the GIL around more CUDA ops (#2173 )	2023-08-24 20:31:38 -07:00
Zahi Moudallal	120ce0a5bf	[DOCS] Fixing docs (#2175 )	2023-08-24 15:58:59 -07:00
jayfurmanek	ff7e707f87	Enable usage of block pointer semantics for AMD gpus (#301 ) * Enable usage of block pointer semantics for AMD gpus This commit enables usage of block pointer semantics by enabling rewrite_tensor_pointer_pass that rewrites block pointer loads/stores to legacy loads/stores. * Update FA fwd in tutorial to use the block pointers * use 90 compute capability for amd gpus in python/triton/compiler/compiler.py Co-authored-by: Alexander Efimov <efimov.alexander@gmail.com> --------- Co-authored-by: Ognjen Plavsic <ognjen.plavsic@dxc.com> Co-authored-by: Lixun Zhang <lixun.zhang@amd.com> Co-authored-by: Aleksandr Efimov <130555951+alefimov-amd@users.noreply.github.com> Co-authored-by: Alexander Efimov <efimov.alexander@gmail.com>	2023-08-24 13:05:12 -05:00
chengjunlu	6cb67185f8	[FRONTEND]To use proper default num_warps and num_stages based on the device backend in JITFucntion (#2130 ) The default values used by JITFunction for num_warps and num_stages are coupled with Nvidia GPU architecture. We should use the proper default values based on the device backend for the kernel to be compiled to. 1. Add two functions to return the default num_warps and num_stages for the specific device backend. 2. JITFunction uses the proper default num_warps and num_stages based on the specific device backend. Co-authored-by: Wang Weihan <eikan.wang@intel.com>	2023-08-24 21:58:18 +08:00
Bin Fan	dad83f9dcb	[TOOLS] Add support for autotuning AOT kernel (#2123 ) This PR makes the following change to AOT kernel - Allow the client to generate AOT kernels with different sets of constexprs and meta-parameters. Each combination of constexpr set and meta-parameters is referred to an "algo". Within an algo client can still give different hints about integer arguments. - Add a API int ${kernle_name}_get_num_algos() that returns the total number of algos. - Add a algo_id to allow client to the generated kernel to select the algo - Remove gX, gY and gZ from the kernel parameter list. This is because the launch grid is usually different with different algos, and the client should not need to care about how to compute the launch grid for each algo. Instead, we ask the client to pass the expression of computing gX, gY and gZ for compile.py (when AOT kernels are generated). The expression can only use kernel parameter or const values. - We also change the testing flow. Now we first build the kernels into a shared library libkernel.so, then the client test.c code is built and link with libkernel.so. This is closer to a typical AOT kernel usage flow.	2023-08-23 09:38:29 -07:00
Zahi Moudallal	5282ed890d	[CI] Add back pre-commit to nvidia CI job (#2159 )	2023-08-23 01:11:03 +00:00
Keren Zhou	6a65c894fe	[TUTORIALS] Skip running TMA tutorials on non-hopper architectures (#2153 )	2023-08-22 18:02:26 -07:00
Keren Zhou	5fa1fa1b27	[FRONTEND] Emit warning if the result of `tl.advance` is unused (#2155 ) https://github.com/openai/triton/issues/2138	2023-08-22 18:02:00 -07:00
danny.jang	c4a9006340	[FRONTEND] fix a typo (#2152 )	2023-08-22 13:02:31 -04:00
ivanyinwz	ec801ce18e	[BACKEND] Optimize performance for f16 epilogue with TMA store (#2135 ) 1. Optimize the conversion and packing for 2xf32 -> 2xf16. 2. Split TMA store block into multiple slices of size 64x64. 3. Distribute the TMA store to all the warps. 4. Fix some naming issue.	2023-08-21 12:44:11 -07:00
Philippe Tillet	ea8416164f	[FRONTEND] name mangling fixup (#2148 )	2023-08-21 12:11:52 -07:00

... 2 3 4 5 6 ...

1401 Commits