Commit Graph

584 Commits

Author SHA1 Message Date
Rohit Santhanam
cd9ae1cd36 Merge remote-tracking branch 'upstream/main' into triton-mlir-IFU-02232023 2023-02-23 21:41:54 +00:00
Chenggang Zhao
b5efa91e2a [Backend] Fix a bug in swizzling store (#1235)
The function calculates the swizzled address to **store** (not load), so
we should use `outOrder` instead of `inOrder`. Current tests do not
cover this case, but at NVIDIA, we have a case related to `sm_90` that
could trigger. Already discussed in the Slack channel with @Jokeren.
2023-02-22 19:13:21 -08:00
Keren Zhou
6a9316e69a [BACKEND] Clean up SCF -> CF conversion (#1234) 2023-02-22 23:49:47 +00:00
Philippe Tillet
0ec277efc5 [OPTIMIZER] cleaned, renamed and simplified some optimization passes (#1232)
This shouldn't actually change the behavior of Triton -- only clean things up.
2023-02-22 13:54:55 -08:00
Yu Guo
19228d88bc [FRONTEND][BACKEND] add env variable TRITON_LIBDEVICE_PATH (#1166)
we may compile kernels on remote machines which do not have local
libdevice.10.bc.

Co-authored-by: Philippe Tillet <phil@openai.com>
2023-02-21 20:15:12 +00:00
Rohit Santhanam
4b56eb2fd4 Performance enhancements for AMDGPU atomics.
Add "agent" syncscope specification to prevent large performance loss
for gfx90a.

Add LLVM function attributes to enable fp32 atomic adds for archs that
support it.
2023-02-20 06:28:00 +00:00
Keren Zhou
123c687ed9 [BACKEND] Rewrite Membar to fit the CF dialect (#1213) 2023-02-19 14:54:33 -08:00
Rohit Santhanam
841784d1e3 Merge remote-tracking branch 'upstream/main' into upgrade_triton_mlir_rocm_to_llvm_head 2023-02-18 09:25:20 +00:00
Christian Sigg
9ef4b5d773 Rebase to LLVM-head. (#1200)
Rebase to
37b7a60cd7
2023-02-17 13:16:11 -08:00
Rohit Santhanam
1d8fd49254 Remove the unsafe tmpnam call to generate the HSACO file name. 2023-02-17 13:09:26 +00:00
Goran Flegar
3b72ebd199 [BACKEND] correctly propagate address spaces through GEP (#1207)
While doing it incorrectly works at the moment, this will cause a
validation error once we rebase Triton closer to LLVM head, since
validation for some LLVM Dialect ops got stricter.

Specifically, if we remove the shared memory space attribute, a
subsequent bitcast tries to add it back, which is illegal.
2023-02-17 11:19:03 +00:00
Christian Sigg
fc7a8e3581 Rebase Triton to LLVM-15. (#1070)
This PR rebases Triton from LLVM-14 to LLVM-15. Most changes are
mechanical, except for the analysis framework changes.
2023-02-16 06:40:53 -08:00
Philippe Tillet
e3941f9d09 [OPTIMIZER][BACKEND] Cleaned up Volta codegen (#1185) 2023-02-14 22:39:35 -08:00
Philippe Tillet
8bca84ce3d [OPTIMIZER] Bugfix in Combine.cpp ; Added trans support in Pipeline.cpp (#1174) 2023-02-14 13:36:44 -08:00
rsanthanam-amd
2ec42ea37b Merge pull request #117 from ROCmSoftwarePlatform/fix_sramecc_xnack_warnings_navi21
Fix warning on some amdgpu arch (i.e., navi21)
2023-02-14 07:37:42 -06:00
Chao Chen
c0a8c72343 update function to get full arch details and compile it with arch details instead of hardcode 2023-02-14 12:59:26 +00:00
Keren Zhou
6413c7b9de [BACKEND] Calculate correct warp ids for small matrices (#1180)
Fixing https://github.com/openai/triton/issues/1162

Add tests 16x16x16
2023-02-14 05:28:03 +00:00
rsanthanam-amd
44f69bea81 Merge pull request #113 from ROCmSoftwarePlatform/triton-mlir-IFU-02112023
Triton mlir ifu 02112023
2023-02-13 09:26:10 -06:00
rsanthanam-amd
ec387d5bf4 Merge pull request #109 from dfukalov/dfukalov/work-3
[ROCM] Enable part of tl.dot operations.
2023-02-12 13:50:20 -06:00
Daniil Fukalov
a6596fc634 [ROCM] Enable part of tl.dot operations.
The change enables fall-through FMA path for the ROCM. It works for
the float32 type and not all the tensors sizes. The change switches
off reporting MMA and async ops support to avoid NV asm inline
generation.
2023-02-12 17:25:48 +01:00
Rohit Santhanam
a2416e0901 Merge remote-tracking branch 'upstream/main' into triton-mlir-IFU-02112023 2023-02-11 14:48:19 +00:00
Philippe Tillet
3fa8a5a864 [OPTIMIZER] Fixed load/store rematerialization (#1177) 2023-02-11 01:21:10 -08:00
Nikita Shulga
2d4370bc9f [LINKER] search for libdevice relative to shared library (#1176) 2023-02-11 02:24:33 +00:00
Philippe Tillet
2aba985daa [OPTIMIZER] Improved layout simplifications heuristics (#1168) 2023-02-09 20:17:25 -08:00
Keren Zhou
c61c8a123f [BACKEND] Disallow the CombineSelectMaskedLoad pattern if conditions of select and broadcast are different (#1170) 2023-02-09 18:03:22 -05:00
Philippe Tillet
0cbe368fe5 [OPTIMIZER] Using new multiRootGetSlice utility in memory coalescing pass (#1169) 2023-02-09 18:43:33 +00:00
Yan Chunwei
850f808b55 [Backend] Split declaration and defitions of functions in DotOpHelpers.h (#1163)
This is the remaining work of the former Backend File Partition work.
2023-02-08 15:07:02 +08:00
Philippe Tillet
3cfa474d97 [OPTIMIZER] Remove dead code (#1160) 2023-02-06 23:02:58 -08:00
Philippe Tillet
1b31ad997f [OPTIMIZER] Improved memory coalescing heuristics (#1159) 2023-02-06 22:32:28 -08:00
Keren Zhou
681d04cf2b [BACKEND] Fix axisInfo analysis for div ops (#1157) 2023-02-07 02:25:23 +00:00
Keren Zhou
546f2377ae [BACKEND] Get the right operand and result types in forward rematerialization passes (#1152) 2023-02-04 16:34:35 -08:00
Yu Guo
474ed978b9 [BUILD] Fixed typo in CMake type tablegen (#1124) 2023-02-03 18:46:11 -08:00
Mehdi Amini
ce6d74e0b6 [BACKEND] Fix crash in test/TritonGPU/coalesce.mlir (#1148)
The call to `coalesceOp` is deleting the op it is processing and
replacing it with a new one. We can't `dyn_cast` the `curr` pointer
because it is dangling at this point.

Co-authored-by: Philippe Tillet <phil@openai.com>
2023-02-04 02:40:27 +00:00
Keren Zhou
bde52f9db2 [BACKEND] Fix alignment calculation (#1149)
`getDivisibility` represents if the address in bytes is divisible by a
certain number, so we should convert `#aligned bytes` to `#aligned
elements`.
2023-02-03 17:20:23 -08:00
Philippe Tillet
43798ab27e [BUILD] Restored wheels workflow (#1146)
- Dependent CUDA files (ptxas, cuda.h, libdevice.bc.10) are now packaged in
`triton/third_party/cuda`. `ptxas` is downloaded from conda repo at
install time.
- Can now be built with old glibc (as that used by manylinux2014)
2023-02-03 16:22:10 -08:00
Rohit Santhanam
8cb6ab5b1a Merge remote-tracking branch 'upstream/main' into triton_mlir_IFU_02022023 2023-02-02 22:54:53 +00:00
Keren Zhou
82befe32ad [BACKEND] Improve torch inductor performance (#1108)
- Rewrite the AxisInfo analysis to handle each op case by case.
- Add bit shift, min max, div/rem, and select ops to AxisInfo.
- Rematerialize across load/store ops in the following two cases:
- A size 1 tensor is considered not expensive since all threads will
load the same
- the targeEncoding may expose more vectorization opportunities (more
elements per thread on the first dim)

**_res2next_** benchmark GPU Kernel time comparison on A100.
- Average kernel sum. Triton 16838630ns vs Triton-MLIR 17105166ns.
**1.016x slowdown**.
- Total kernel sum. Triton 6511735460ns vs Triton-MLIR 6512370620ns.
2023-02-01 18:21:15 -08:00
Keren Zhou
1ec39fdf99 [BACKEND] Refactored the MoveConvertOutOfIf conversion to handle scf.if correctly (#1114)
Also removed duplicate code for `simulateBackwardRematerialization`.
2023-02-01 08:49:19 -08:00
Keren Zhou
5dd8ce3745 [BACKEND] Fix topological sort and add new test cases (#1132)
Previous https://github.com/openai/triton/pull/1113 forgot to consider
that a node may have multiple parents, visiting the instruction before
any parent violates the semantic of topological sort.

The fixed implementation exhaustively add all operations into a
candidate subgraph and move an operation to the "ready" queue once all
of its operands have been visited.
2023-01-31 23:41:20 -08:00
Philippe Tillet
8fea1fb478 [FRONTEND] Adding static range (#1130)
Included: Revert "[BACKEND] Replace `mlir::topologicalSort` with a
custom implementation (#1113)"
2023-01-31 18:04:19 -08:00
Philippe Tillet
c4b9d699d2 [FRONTEND][BACKEND] Fixed many bugs (#1122)
- **temporarily commenting assertion in `MemBar.cpp`. We need to fix
this! but for now the following patches will unblock a number of
users.**
- Fixed frontend codegen issue for If / For / While. Emit an error when
replaced values' type mismatch.
- Added "top level" codepath for if statements, which allows users to
write patterns to exit early from kernels (e.g., `if cond1: if cond2:
return else: ...`). Added associated codegen in TritonToTritonGPUPass
- Added basic control flow tests
- Pipeline pass is no longer activated when memory accesses can't be
vectorized
- Added missing magic methods to `constexpr`
- Fixed issue in random.py: bitcast some values to uint when they need
to be.
- Added support for `Not`
- Fixed nondeterministic compilation issue
2023-01-30 23:22:36 -08:00
goostavz
3e8d83b7cc Minor fix to support sm_90 (#1125)
This fix enables the support on sm_90 (otherwise it will crash).

Logs like 
> 'sm_90' is not a recognized processor for this target (ignoring
processor)

could be ignored and should be eliminated with the update of llvm nvptx
backend.
2023-01-31 14:08:02 +08:00
Michael Melesse
bfee8247f5 fix ops 2023-01-30 14:22:51 -06:00
Michael Melesse
a9f955f862 Merge remote-tracking branch 'upstream/main' into triton-mlir-IFU-2023-30-1 2023-01-30 14:04:01 -06:00
Keren Zhou
bc8a26d56f [BACKEND] Replace mlir::topologicalSort with a custom implementation (#1113)
`multiRootTopologicalSort` is faster than `mlir::topologicalSort`
because it prunes nodes that have been visited before.
2023-01-29 18:57:21 -08:00
Keren Zhou
5bcf60a5c0 [BACKEND] Refactored the code to no longer include static functions in header files. (#1109) 2023-01-28 14:58:28 -08:00
Da Yan
82f5e988be [OPTIMIZER] Improve bf16 and i8 matmul performance (#1107)
Use i32 as the storage type for <2xi16> and <4xi8>, as NVPTX inserts
extra integer instructions for vector int types.

Performance before this PR: (8192x8192x8192-TN input)
bf16: 222 TFLOPS
i8:     339 TOPS

After this PR:
bf16: 272 TFLOPS
i8:      548 TOPS
2023-01-27 22:13:14 +00:00
Da Yan
394f2e6991 [OPTIMIZER] improved prefetch width (#1106)
Before this PR: 16
After this PR: 16 (fp16/bf16), 32(int8/fp8), 8 (tf32)
The new prefetch width works better with i8/f8/tf32 tensor cores.
2023-01-27 17:41:49 +00:00
binarman
3d447534ec - removed semicolon
- removed redundant newEmptyOperand method
2023-01-24 22:42:36 +01:00
binarman
7cd3c89a15 [Test] Fix ctest tests
Fixed GcnAsmFormatTest.basic and GcnAsmFormatTest.complexInstruction tests
- Added lost "off" operand
- Added semicolon at the end of instructions
- Removed redundant comma between args and modifiers
2023-01-24 22:42:36 +01:00