oplavsic
e801638b40
Add waves_per_eu as kernel parameter ( #319 )
...
* Add waves_per_eu as kernel parameter
* Fix failing tests
* Add default value for waves_per_eu for ttgir_to_llir function
* Remove aot.py
2023-10-06 12:08:34 -05:00
Shucai Xiao
8049891ff7
fix ifu gemm perf regression ( #348 )
2023-10-04 08:45:18 -05:00
Aleksandr Efimov
e6f75d05e3
fix sum_reduction lit test in Conversion/tritongpu_to_llvm.mlir testsuite
2023-10-03 16:13:13 +00:00
Aleksandr Efimov
90a15e449e
ROCM IFU: Fix tritongpu_to_llvm lit test
2023-10-03 04:31:03 +00:00
Jason Furmanek
92edee723b
ROCM IFU: Fix getValueLivenessRange
2023-10-03 04:30:28 +00:00
Aleksandr Efimov
336c4b5f3c
ROCM IFU: Fix LDS overflow issues in test_dot
2023-10-03 04:30:09 +00:00
wenchenvincent
42a5bf9c7c
ROCM IFU: Enabled conversion between fp8e4m3b15x4 and fp16. Refactored conversion between fp8e4m3nv and fp16. ( #335 )
2023-10-03 04:30:01 +00:00
Aleksandr Efimov
634f66a090
ROCM IFU: Fix emitOffsetForMfmaLayout function
2023-10-03 04:29:54 +00:00
Aleksandr Efimov
78faa65dbd
ROCM IFU: Fix of dot operand type promotion
...
ROCM IFU: Fix formatting
2023-10-03 04:29:29 +00:00
Aleksandr Efimov
bae0e4527c
ROCM IFU: Add new CTALayout parameter to mfma layout
2023-10-03 04:29:21 +00:00
Jason Furmanek
e5d7bb4fae
Initial commit to resolve merge conflicts
...
rename tl.float8e4 to tl.float8e4nv to align with upstream
ROCM IFU: Fix python arch issues
ROCM IFU: Fix kernel launcher
ROCM IFU: Fix merge conflicts
fix debug build
Set correct threadsPerCTA
2023-10-03 04:04:26 +00:00
Jason Furmanek
74fd8e9754
Merge commit '36fc54b6f28168d3644808bfe299f1ba06a36272' into ifu230908-2
...
Conflicts:
.gitignore
bin/triton-translate.cpp
include/triton/Conversion/TritonGPUToLLVM/TritonGPUToLLVMPass.h
include/triton/Dialect/TritonGPU/IR/TritonGPUAttrDefs.td
include/triton/Dialect/TritonGPU/IR/TritonGPUDialect.td
lib/Analysis/Utility.cpp
lib/Conversion/TritonGPUToLLVM/ConvertLayoutOpToLLVM/SharedToDotOperandMMAv2.cpp
lib/Conversion/TritonGPUToLLVM/DotOpToLLVM.cpp
lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp
lib/Conversion/TritonGPUToLLVM/ReduceOpToLLVM.cpp
lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVM.cpp
lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVMBase.h
lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVMPass.cpp
lib/Conversion/TritonGPUToLLVM/Utility.h
lib/Dialect/Triton/Transforms/RewriteTensorPointer.cpp
lib/Dialect/TritonGPU/IR/Dialect.cpp
lib/Dialect/TritonGPU/Transforms/AccelerateMatmul.cpp
lib/Dialect/TritonGPU/Transforms/RemoveLayoutConversions.cpp
lib/Target/LLVMIR/LLVMIRTranslation.cpp
python/src/triton.cc
python/test/unit/runtime/test_subproc.py
python/triton/compiler/compiler.py
python/triton/compiler/make_launcher.py
python/triton/language/semantic.py
python/triton/runtime/jit.py
python/tutorials/06-fused-attention.py
test/Conversion/triton_to_tritongpu.mlir
test/Conversion/tritongpu_to_llvm.mlir
test/TritonGPU/coalesce.mlir
unittest/Conversion/TritonGPUToLLVM/CMakeLists.txt
2023-10-02 18:01:04 +00:00
SJW
287b0adcc2
[Stream] Fixed bug in stream-pipeline for FA ( #345 )
...
* [Stream] Fixed bug in stream-pipeline for FA
* updated gemm tutorial for num_stages=0
* * updated configs
2023-09-29 20:20:55 -05:00
Shucai Xiao
6e82aa8dbc
support gemm fp8/fp16 mixed input ( #333 )
...
* changes to support fp8/fp16 mixed inputs
* add unit test for fp8/fp16 mixed input for gemm
2023-09-27 08:00:31 -05:00
SJW
0a7b1c7c12
[MLIR] Fixed support for mixed data-types in stream-pipeline ( #329 )
...
* [MLIR] Fixed support for mixed data-types in stream-pipeline
* added test
* * fixed test
* * cleanup
* * consolidated code
* * fixed build error
2023-09-26 21:26:50 -05:00
SJW
4db99e0139
[Alloc] Enhanced SharedMem Allocation for mutually exclusive but aliased buffers ( #337 )
...
* [Alloc] Enhanced for mutually exclusive but aliased buffers
- Use disjoint alias analysis to minimize shared memory requirements
* * fix for allocation test
* * added test
* fixed mfma_enc printer
* * fixed test
2023-09-25 20:09:33 -05:00
Aleksandr Efimov
7af5e42fbe
review fix: fix semantics of chooseMfmaDimensions func
2023-09-25 10:56:44 -05:00
Alexander Efimov
5ac8c7afc1
change to the comment on kWidth parameter
2023-09-25 10:56:44 -05:00
Aleksandr Efimov
d80cd2d374
[MFMA] Change kWidth parameter semantics
...
This PR changes kWidth semantics "from elements per instruction" to
"elements per thread per instruction" along k axis.
2023-09-25 10:56:44 -05:00
Lixun Zhang
23465f3416
Add assert for shuffleType and default value for laneid
2023-09-12 10:16:44 -05:00
Lixun Zhang
5972eafbc9
Use ceil to align with upstream
2023-09-12 10:16:44 -05:00
Lixun Zhang
1c653dc438
Move shfl_up impl into commonShflSync
2023-09-12 10:16:44 -05:00
Lixun Zhang
74ea0c87de
Generalize warpSize
...
- We have to change the API of shflUpSync to pass laneId to the rocm
implementation of shfl_up
- And we also distinguish laneIdAxis and laneId
2023-09-12 10:16:44 -05:00
Lixun Zhang
ea397b49aa
Fix the issue when CTA coverage is larger than the tile
2023-09-12 10:16:44 -05:00
Lixun Zhang
ed20089bc8
Add shfl_up implementation for AMD backend
...
copied from f58b93693b/include/hc.hpp (L2879-L2885)
2023-09-12 10:16:44 -05:00
Alexander Efimov
6691de65db
[MFMA] Support BFloat16 on MI100 ( #295 )
...
* [MFMA] Support BFloat16 on MI100
This PR makes use of mfma_f32_32x32x4bf16 instruction, available on MI100.
* fix tests, fix mfma encoding comment, fix switch between mfma versions.
* replace kDim from mfma layout with kWidth from dotOp layout
* rebase fix
* fix mfma to dot op shortcut for bfloat16
* fix review comments
2023-09-08 15:08:34 -05:00
SJW
491eb9ddfe
[MLIR] Added tritongpu-stream-pipeline pass ( #305 )
...
* [MLIR] Added tritongpu-stream-pipeline pass
- Prologue: Hoist the pipelinable load operations and shared memory store
for the ramp up stage
- Pipelined Loop: Assemble the loop body minus last iteration
- Prefetch next tile from global into regs (while computing from previous)
- Non-load loop body
- Store next tile into shared mem
- Epilogue: Peeled non-load loop body for last iteration
* * updated comment
2023-09-07 15:24:59 -05:00
Wen Chen
076a04d5eb
[ROCM] Optimized int8 to bf16 conversion by not reusing FpToFpOpConversion::convertFp32ToBf16.
...
Changed the lit test rules for vectorized int8 to bf16 conversion on
ROCm as ROCm has a different implementation.
2023-09-07 17:26:43 +00:00
Wen Chen
ffc230ebfe
[ROCM] Fixed implementation of fp32 to bf16 conversion on ROCm.
2023-09-06 18:10:54 -05:00
Wen Chen
2d3e38e182
[ROCM] Added ROCm support for int8 to bfloat16 conversion.
2023-09-06 18:10:54 -05:00
Wen Chen
59a40d3f72
[ROCM] Added ROCm support for the conversions of following data types:
...
[float8e4m3, float8e4m3b15, float8e5m2] <-> [float16, bfloat16]
2023-09-06 18:10:54 -05:00
Aleksandr Efimov
751edfb3b9
[BACKEND] Fix fma mixed-precision
...
This is partial cherry-pick of https://github.com/openai/triton/pull/2184
Dropped code unrealted to dot fix.
2023-09-05 21:16:50 +00:00
Aleksandr Efimov
591681d36e
Revert "[Dot] Fix FMA fp16xfp16 dot ( #315 )"
...
This reverts commit 11752a6993 .
2023-09-05 21:12:56 +00:00
Alexander Efimov
11752a6993
[Dot] Fix FMA fp16xfp16 dot ( #315 )
...
Disable reorder of FMA dot arguments for amd gpu.
2023-09-05 20:52:08 +00:00
Keren Zhou
c0f418bcdd
[BACKEND] Fix BF16 dot operand type mismatch ( #2162 )
...
https://github.com/openai/triton/issues/2156
2023-09-05 20:46:31 +00:00
Aleksandr Efimov
2f7ead6f3b
Fix subprocess tests for IFU
...
This PR changes printf ttgir -> llvm conversion,
unknown location is assigned to global constant holding format string.
This fixes problem in test_subprocess.py tests,
which failed during construction of file location for format string constants.
2023-09-05 20:46:04 +00:00
Corbin Robeck
007bea9994
Add bitcode writer to AMDGCN hsaco output
2023-09-01 04:02:29 +00:00
Jason Furmanek
df5c263a19
Fix merge conflicts
2023-09-01 04:01:32 +00:00
Jason Furmanek
3eaeb89d18
Merge commit '5df904233c11a65bd131ead7268f84cca7804275' into ifu230810-2
...
Conflicts:
include/triton/Dialect/Triton/Transforms/Passes.h
include/triton/Dialect/TritonGPU/IR/Dialect.h
include/triton/Dialect/TritonGPU/IR/TritonGPUAttrDefs.td
lib/Analysis/Allocation.cpp
lib/Analysis/Utility.cpp
lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp
lib/Conversion/TritonGPUToLLVM/ReduceOpToLLVM.cpp
lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVM.cpp
lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVMPass.cpp
lib/Dialect/Triton/Transforms/RewriteTensorPointer.cpp
lib/Dialect/TritonGPU/Transforms/RemoveLayoutConversions.cpp
lib/Dialect/TritonGPU/Transforms/ReorderInstructions.cpp
lib/Target/LLVMIR/LLVMIRTranslation.cpp
python/src/triton.cc
python/triton/compiler/compiler.py
python/triton/ops/flash_attention.py
python/triton/runtime/autotuner.py
python/triton/runtime/jit.py
python/triton/tools/aot.py
python/tutorials/06-fused-attention.py
test/Conversion/tritongpu_to_llvm.mlir
test/Target/tritongpu_to_llvmir.mlir
test/Target/tritongpu_to_llvmir_noinline.mlir
2023-09-01 03:25:33 +00:00
Philippe Tillet
36fc54b6f2
[BACKEND] cleaner V100 tensor core packing ( #2222 )
2023-08-31 14:00:51 -07:00
Thomas
fff97d864a
[BACKEND] Add support for propagating convert op through while loops ( #2213 )
...
Support forward propagation through while loops.
2023-08-31 09:26:04 -07:00
Philippe Tillet
c4b27d04e3
[BACKEND] Added float8e4b15 codegen for SM < 80 ( #2216 )
2023-08-30 21:57:49 -07:00
Philippe Tillet
ec51552fff
[BACKEND] Lift restriction for float8e4b15 to only support row-col layout ( #2212 )
2023-08-30 14:06:31 -07:00
Thomas
3175ee4ce7
[BACKEND] Handle more cases of folding convert into reduce op ( #2209 )
...
Handle cases of reduce with multiple operand returning scalars
2023-08-30 11:04:38 -07:00
Thomas
2ff88c1368
[BACKEND] Extend hoisting of convert op above ext ops ( #2206 )
...
Handle more cases of hoisting convert above ext op. If there are
multiple ext op in the slice but only one requires inserting a convert
we can still apply the optimization.
2023-08-29 17:36:34 -07:00
Thomas
17d633a64e
[BACKEND] Fix crash when propagating layout and slice axis doesn't ma… ( #2205 )
...
…tch reduce
2023-08-29 17:11:03 +00:00
Thomas
d4644d6cb3
[BACKEND] Refactor RemoveLayoutConversion pass ( #2181 )
...
Significant changes to the pass logic. Move away from greedy rewrites
and use more global analysis instead. The pass is now bocken down into 2
main phases. First forward propagation of layout starting from ops that
we don't want to change. Propagate to all the nodes. If there is a
single layout needed for the op then we can rewrite the op, if there are
multiple layout required based on dependency we need a tie break.
The second phase is backward propgation that gets a backward slice of
operations starting from the convert and if all the operations in the
slice can be rematerialized rewrite the slice. This backward phase now
supports going through loop arguments.
This will allow more complex logic in the future to add a cost model to
decide which convert to leave and which to fold
2023-08-28 19:05:16 -07:00
Keren Zhou
6e4932cda8
[BACKEND] Fix fma mixed-precision ( #2184 )
...
and expose the allow_tf32 argument to the matmul op
@shunting314
2023-08-26 09:49:58 -07:00
peterbell10
fa03b92109
[OPTIMIZER] Add folder for MakeRangeOp ( #2187 )
...
This folds `tl.arange(x, x + 1)` into a constant. This shows up for
example when autotuning and one of the block sizes gets set to 1.
Co-authored-by: Philippe Tillet <phil@openai.com >
2023-08-26 16:44:13 +00:00
Keren Zhou
f6cdcf1d77
[BACKEND] Fix BF16 dot operand type mismatch ( #2162 )
...
https://github.com/openai/triton/issues/2156
2023-08-24 20:32:33 -07:00