The function calculates the swizzled address to **store** (not load), so
we should use `outOrder` instead of `inOrder`. Current tests do not
cover this case, but at NVIDIA, we have a case related to `sm_90` that
could trigger. Already discussed in the Slack channel with @Jokeren.
Add "agent" syncscope specification to prevent large performance loss
for gfx90a.
Add LLVM function attributes to enable fp32 atomic adds for archs that
support it.
While doing it incorrectly works at the moment, this will cause a
validation error once we rebase Triton closer to LLVM head, since
validation for some LLVM Dialect ops got stricter.
Specifically, if we remove the shared memory space attribute, a
subsequent bitcast tries to add it back, which is illegal.
The change enables fall-through FMA path for the ROCM. It works for
the float32 type and not all the tensors sizes. The change switches
off reporting MMA and async ops support to avoid NV asm inline
generation.
The call to `coalesceOp` is deleting the op it is processing and
replacing it with a new one. We can't `dyn_cast` the `curr` pointer
because it is dangling at this point.
Co-authored-by: Philippe Tillet <phil@openai.com>
- Dependent CUDA files (ptxas, cuda.h, libdevice.bc.10) are now packaged in
`triton/third_party/cuda`. `ptxas` is downloaded from conda repo at
install time.
- Can now be built with old glibc (as that used by manylinux2014)
- Rewrite the AxisInfo analysis to handle each op case by case.
- Add bit shift, min max, div/rem, and select ops to AxisInfo.
- Rematerialize across load/store ops in the following two cases:
- A size 1 tensor is considered not expensive since all threads will
load the same
- the targeEncoding may expose more vectorization opportunities (more
elements per thread on the first dim)
**_res2next_** benchmark GPU Kernel time comparison on A100.
- Average kernel sum. Triton 16838630ns vs Triton-MLIR 17105166ns.
**1.016x slowdown**.
- Total kernel sum. Triton 6511735460ns vs Triton-MLIR 6512370620ns.
Previous https://github.com/openai/triton/pull/1113 forgot to consider
that a node may have multiple parents, visiting the instruction before
any parent violates the semantic of topological sort.
The fixed implementation exhaustively add all operations into a
candidate subgraph and move an operation to the "ready" queue once all
of its operands have been visited.
- **temporarily commenting assertion in `MemBar.cpp`. We need to fix
this! but for now the following patches will unblock a number of
users.**
- Fixed frontend codegen issue for If / For / While. Emit an error when
replaced values' type mismatch.
- Added "top level" codepath for if statements, which allows users to
write patterns to exit early from kernels (e.g., `if cond1: if cond2:
return else: ...`). Added associated codegen in TritonToTritonGPUPass
- Added basic control flow tests
- Pipeline pass is no longer activated when memory accesses can't be
vectorized
- Added missing magic methods to `constexpr`
- Fixed issue in random.py: bitcast some values to uint when they need
to be.
- Added support for `Not`
- Fixed nondeterministic compilation issue
This fix enables the support on sm_90 (otherwise it will crash).
Logs like
> 'sm_90' is not a recognized processor for this target (ignoring
processor)
could be ignored and should be eliminated with the update of llvm nvptx
backend.
Use i32 as the storage type for <2xi16> and <4xi8>, as NVPTX inserts
extra integer instructions for vector int types.
Performance before this PR: (8192x8192x8192-TN input)
bf16: 222 TFLOPS
i8: 339 TOPS
After this PR:
bf16: 272 TFLOPS
i8: 548 TOPS
Fixed GcnAsmFormatTest.basic and GcnAsmFormatTest.complexInstruction tests
- Added lost "off" operand
- Added semicolon at the end of instructions
- Removed redundant comma between args and modifiers