github/ROCm - ROCm - AtHeartEngineering

mirror of https://github.com/ROCm/ROCm.git synced 2026-04-05 03:01:17 -04:00

Author	SHA1	Message	Date
oplavsic	e801638b40	Add waves_per_eu as kernel parameter (#319 ) * Add waves_per_eu as kernel parameter * Fix failing tests * Add default value for waves_per_eu for ttgir_to_llir function * Remove aot.py	2023-10-06 12:08:34 -05:00
Shucai Xiao	8049891ff7	fix ifu gemm perf regression (#348 )	2023-10-04 08:45:18 -05:00
Aleksandr Efimov	e6f75d05e3	fix sum_reduction lit test in Conversion/tritongpu_to_llvm.mlir testsuite	2023-10-03 16:13:13 +00:00
Aleksandr Efimov	90a15e449e	ROCM IFU: Fix tritongpu_to_llvm lit test	2023-10-03 04:31:03 +00:00
Jason Furmanek	92edee723b	ROCM IFU: Fix getValueLivenessRange	2023-10-03 04:30:28 +00:00
Aleksandr Efimov	336c4b5f3c	ROCM IFU: Fix LDS overflow issues in test_dot	2023-10-03 04:30:09 +00:00
wenchenvincent	42a5bf9c7c	ROCM IFU: Enabled conversion between fp8e4m3b15x4 and fp16. Refactored conversion between fp8e4m3nv and fp16. (#335 )	2023-10-03 04:30:01 +00:00
Aleksandr Efimov	634f66a090	ROCM IFU: Fix emitOffsetForMfmaLayout function	2023-10-03 04:29:54 +00:00
Aleksandr Efimov	78faa65dbd	ROCM IFU: Fix of dot operand type promotion ROCM IFU: Fix formatting	2023-10-03 04:29:29 +00:00
Aleksandr Efimov	bae0e4527c	ROCM IFU: Add new CTALayout parameter to mfma layout	2023-10-03 04:29:21 +00:00
Jason Furmanek	e5d7bb4fae	Initial commit to resolve merge conflicts rename tl.float8e4 to tl.float8e4nv to align with upstream ROCM IFU: Fix python arch issues ROCM IFU: Fix kernel launcher ROCM IFU: Fix merge conflicts fix debug build Set correct threadsPerCTA	2023-10-03 04:04:26 +00:00
Jason Furmanek	74fd8e9754	Merge commit '36fc54b6f28168d3644808bfe299f1ba06a36272' into ifu230908-2 Conflicts: .gitignore bin/triton-translate.cpp include/triton/Conversion/TritonGPUToLLVM/TritonGPUToLLVMPass.h include/triton/Dialect/TritonGPU/IR/TritonGPUAttrDefs.td include/triton/Dialect/TritonGPU/IR/TritonGPUDialect.td lib/Analysis/Utility.cpp lib/Conversion/TritonGPUToLLVM/ConvertLayoutOpToLLVM/SharedToDotOperandMMAv2.cpp lib/Conversion/TritonGPUToLLVM/DotOpToLLVM.cpp lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp lib/Conversion/TritonGPUToLLVM/ReduceOpToLLVM.cpp lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVM.cpp lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVMBase.h lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVMPass.cpp lib/Conversion/TritonGPUToLLVM/Utility.h lib/Dialect/Triton/Transforms/RewriteTensorPointer.cpp lib/Dialect/TritonGPU/IR/Dialect.cpp lib/Dialect/TritonGPU/Transforms/AccelerateMatmul.cpp lib/Dialect/TritonGPU/Transforms/RemoveLayoutConversions.cpp lib/Target/LLVMIR/LLVMIRTranslation.cpp python/src/triton.cc python/test/unit/runtime/test_subproc.py python/triton/compiler/compiler.py python/triton/compiler/make_launcher.py python/triton/language/semantic.py python/triton/runtime/jit.py python/tutorials/06-fused-attention.py test/Conversion/triton_to_tritongpu.mlir test/Conversion/tritongpu_to_llvm.mlir test/TritonGPU/coalesce.mlir unittest/Conversion/TritonGPUToLLVM/CMakeLists.txt	2023-10-02 18:01:04 +00:00
SJW	287b0adcc2	[Stream] Fixed bug in stream-pipeline for FA (#345 ) * [Stream] Fixed bug in stream-pipeline for FA * updated gemm tutorial for num_stages=0 * * updated configs	2023-09-29 20:20:55 -05:00
Shucai Xiao	6e82aa8dbc	support gemm fp8/fp16 mixed input (#333 ) * changes to support fp8/fp16 mixed inputs * add unit test for fp8/fp16 mixed input for gemm	2023-09-27 08:00:31 -05:00
SJW	0a7b1c7c12	[MLIR] Fixed support for mixed data-types in stream-pipeline (#329 ) * [MLIR] Fixed support for mixed data-types in stream-pipeline * added test * * fixed test * * cleanup * * consolidated code * * fixed build error	2023-09-26 21:26:50 -05:00
SJW	4db99e0139	[Alloc] Enhanced SharedMem Allocation for mutually exclusive but aliased buffers (#337 ) * [Alloc] Enhanced for mutually exclusive but aliased buffers - Use disjoint alias analysis to minimize shared memory requirements * * fix for allocation test * * added test * fixed mfma_enc printer * * fixed test	2023-09-25 20:09:33 -05:00
Aleksandr Efimov	7af5e42fbe	review fix: fix semantics of chooseMfmaDimensions func	2023-09-25 10:56:44 -05:00
Alexander Efimov	5ac8c7afc1	change to the comment on kWidth parameter	2023-09-25 10:56:44 -05:00
Aleksandr Efimov	d80cd2d374	[MFMA] Change kWidth parameter semantics This PR changes kWidth semantics "from elements per instruction" to "elements per thread per instruction" along k axis.	2023-09-25 10:56:44 -05:00
Lixun Zhang	23465f3416	Add assert for shuffleType and default value for laneid	2023-09-12 10:16:44 -05:00
Lixun Zhang	5972eafbc9	Use ceil to align with upstream	2023-09-12 10:16:44 -05:00
Lixun Zhang	1c653dc438	Move shfl_up impl into commonShflSync	2023-09-12 10:16:44 -05:00
Lixun Zhang	74ea0c87de	Generalize warpSize - We have to change the API of shflUpSync to pass laneId to the rocm implementation of shfl_up - And we also distinguish laneIdAxis and laneId	2023-09-12 10:16:44 -05:00
Lixun Zhang	ea397b49aa	Fix the issue when CTA coverage is larger than the tile	2023-09-12 10:16:44 -05:00
Lixun Zhang	ed20089bc8	Add shfl_up implementation for AMD backend copied from `f58b93693b/include/hc.hpp (L2879-L2885)`	2023-09-12 10:16:44 -05:00
Alexander Efimov	6691de65db	[MFMA] Support BFloat16 on MI100 (#295 ) * [MFMA] Support BFloat16 on MI100 This PR makes use of mfma_f32_32x32x4bf16 instruction, available on MI100. * fix tests, fix mfma encoding comment, fix switch between mfma versions. * replace kDim from mfma layout with kWidth from dotOp layout * rebase fix * fix mfma to dot op shortcut for bfloat16 * fix review comments	2023-09-08 15:08:34 -05:00
SJW	491eb9ddfe	[MLIR] Added tritongpu-stream-pipeline pass (#305 ) * [MLIR] Added tritongpu-stream-pipeline pass - Prologue: Hoist the pipelinable load operations and shared memory store for the ramp up stage - Pipelined Loop: Assemble the loop body minus last iteration - Prefetch next tile from global into regs (while computing from previous) - Non-load loop body - Store next tile into shared mem - Epilogue: Peeled non-load loop body for last iteration * * updated comment	2023-09-07 15:24:59 -05:00
Wen Chen	076a04d5eb	[ROCM] Optimized int8 to bf16 conversion by not reusing FpToFpOpConversion::convertFp32ToBf16. Changed the lit test rules for vectorized int8 to bf16 conversion on ROCm as ROCm has a different implementation.	2023-09-07 17:26:43 +00:00
Wen Chen	ffc230ebfe	[ROCM] Fixed implementation of fp32 to bf16 conversion on ROCm.	2023-09-06 18:10:54 -05:00
Wen Chen	2d3e38e182	[ROCM] Added ROCm support for int8 to bfloat16 conversion.	2023-09-06 18:10:54 -05:00
Wen Chen	59a40d3f72	[ROCM] Added ROCm support for the conversions of following data types: [float8e4m3, float8e4m3b15, float8e5m2] <-> [float16, bfloat16]	2023-09-06 18:10:54 -05:00
Aleksandr Efimov	751edfb3b9	[BACKEND] Fix fma mixed-precision This is partial cherry-pick of https://github.com/openai/triton/pull/2184 Dropped code unrealted to dot fix.	2023-09-05 21:16:50 +00:00
Aleksandr Efimov	591681d36e	Revert "[Dot] Fix FMA fp16xfp16 dot (#315 )" This reverts commit `11752a6993`.	2023-09-05 21:12:56 +00:00
Alexander Efimov	11752a6993	[Dot] Fix FMA fp16xfp16 dot (#315 ) Disable reorder of FMA dot arguments for amd gpu.	2023-09-05 20:52:08 +00:00
Keren Zhou	c0f418bcdd	[BACKEND] Fix BF16 dot operand type mismatch (#2162 ) https://github.com/openai/triton/issues/2156	2023-09-05 20:46:31 +00:00
Aleksandr Efimov	2f7ead6f3b	Fix subprocess tests for IFU This PR changes printf ttgir -> llvm conversion, unknown location is assigned to global constant holding format string. This fixes problem in test_subprocess.py tests, which failed during construction of file location for format string constants.	2023-09-05 20:46:04 +00:00
Corbin Robeck	007bea9994	Add bitcode writer to AMDGCN hsaco output	2023-09-01 04:02:29 +00:00
Jason Furmanek	df5c263a19	Fix merge conflicts	2023-09-01 04:01:32 +00:00
Jason Furmanek	3eaeb89d18	Merge commit '5df904233c11a65bd131ead7268f84cca7804275' into ifu230810-2 Conflicts: include/triton/Dialect/Triton/Transforms/Passes.h include/triton/Dialect/TritonGPU/IR/Dialect.h include/triton/Dialect/TritonGPU/IR/TritonGPUAttrDefs.td lib/Analysis/Allocation.cpp lib/Analysis/Utility.cpp lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp lib/Conversion/TritonGPUToLLVM/ReduceOpToLLVM.cpp lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVM.cpp lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVMPass.cpp lib/Dialect/Triton/Transforms/RewriteTensorPointer.cpp lib/Dialect/TritonGPU/Transforms/RemoveLayoutConversions.cpp lib/Dialect/TritonGPU/Transforms/ReorderInstructions.cpp lib/Target/LLVMIR/LLVMIRTranslation.cpp python/src/triton.cc python/triton/compiler/compiler.py python/triton/ops/flash_attention.py python/triton/runtime/autotuner.py python/triton/runtime/jit.py python/triton/tools/aot.py python/tutorials/06-fused-attention.py test/Conversion/tritongpu_to_llvm.mlir test/Target/tritongpu_to_llvmir.mlir test/Target/tritongpu_to_llvmir_noinline.mlir	2023-09-01 03:25:33 +00:00
Philippe Tillet	36fc54b6f2	[BACKEND] cleaner V100 tensor core packing (#2222 )	2023-08-31 14:00:51 -07:00
Thomas	fff97d864a	[BACKEND] Add support for propagating convert op through while loops (#2213 ) Support forward propagation through while loops.	2023-08-31 09:26:04 -07:00
Philippe Tillet	c4b27d04e3	[BACKEND] Added float8e4b15 codegen for SM < 80 (#2216 )	2023-08-30 21:57:49 -07:00
Philippe Tillet	ec51552fff	[BACKEND] Lift restriction for float8e4b15 to only support row-col layout (#2212 )	2023-08-30 14:06:31 -07:00
Thomas	3175ee4ce7	[BACKEND] Handle more cases of folding convert into reduce op (#2209 ) Handle cases of reduce with multiple operand returning scalars	2023-08-30 11:04:38 -07:00
Thomas	2ff88c1368	[BACKEND] Extend hoisting of convert op above ext ops (#2206 ) Handle more cases of hoisting convert above ext op. If there are multiple ext op in the slice but only one requires inserting a convert we can still apply the optimization.	2023-08-29 17:36:34 -07:00
Thomas	17d633a64e	[BACKEND] Fix crash when propagating layout and slice axis doesn't ma… (#2205 ) …tch reduce	2023-08-29 17:11:03 +00:00
Thomas	d4644d6cb3	[BACKEND] Refactor RemoveLayoutConversion pass (#2181 ) Significant changes to the pass logic. Move away from greedy rewrites and use more global analysis instead. The pass is now bocken down into 2 main phases. First forward propagation of layout starting from ops that we don't want to change. Propagate to all the nodes. If there is a single layout needed for the op then we can rewrite the op, if there are multiple layout required based on dependency we need a tie break. The second phase is backward propgation that gets a backward slice of operations starting from the convert and if all the operations in the slice can be rematerialized rewrite the slice. This backward phase now supports going through loop arguments. This will allow more complex logic in the future to add a cost model to decide which convert to leave and which to fold	2023-08-28 19:05:16 -07:00
Keren Zhou	6e4932cda8	[BACKEND] Fix fma mixed-precision (#2184 ) and expose the allow_tf32 argument to the matmul op @shunting314	2023-08-26 09:49:58 -07:00
peterbell10	fa03b92109	[OPTIMIZER] Add folder for MakeRangeOp (#2187 ) This folds `tl.arange(x, x + 1)` into a constant. This shows up for example when autotuning and one of the block sizes gets set to 1. Co-authored-by: Philippe Tillet <phil@openai.com>	2023-08-26 16:44:13 +00:00
Keren Zhou	f6cdcf1d77	[BACKEND] Fix BF16 dot operand type mismatch (#2162 ) https://github.com/openai/triton/issues/2156	2023-08-24 20:32:33 -07:00

1 2 3 4 5 ...

918 Commits