The call to `coalesceOp` is deleting the op it is processing and
replacing it with a new one. We can't `dyn_cast` the `curr` pointer
because it is dangling at this point.
Co-authored-by: Philippe Tillet <phil@openai.com>
- cancels CI runs in progress when a PR is updated
- atomics tests now use small int values that can be represented exactly
- replaced some old-style formatting by some f-string
- Dependent CUDA files (ptxas, cuda.h, libdevice.bc.10) are now packaged in
`triton/third_party/cuda`. `ptxas` is downloaded from conda repo at
install time.
- Can now be built with old glibc (as that used by manylinux2014)
The literal syntax can give minor performance bumps compared to function
calls to create dict, list and tuple. This name dict must be looked up
in the global scope in case it has rebound. The same goes for the other
two types list() and tuple().
Signed-off-by: nishantsikarwar <nsikarwar@ch.iitr.ac.in>
Co-authored-by: Philippe Tillet <phil@openai.com>
- Rewrite the AxisInfo analysis to handle each op case by case.
- Add bit shift, min max, div/rem, and select ops to AxisInfo.
- Rematerialize across load/store ops in the following two cases:
- A size 1 tensor is considered not expensive since all threads will
load the same
- the targeEncoding may expose more vectorization opportunities (more
elements per thread on the first dim)
**_res2next_** benchmark GPU Kernel time comparison on A100.
- Average kernel sum. Triton 16838630ns vs Triton-MLIR 17105166ns.
**1.016x slowdown**.
- Total kernel sum. Triton 6511735460ns vs Triton-MLIR 6512370620ns.
Previous https://github.com/openai/triton/pull/1113 forgot to consider
that a node may have multiple parents, visiting the instruction before
any parent violates the semantic of topological sort.
The fixed implementation exhaustively add all operations into a
candidate subgraph and move an operation to the "ready" queue once all
of its operands have been visited.
1. try to search llvm-config up to llvm-config-17 version
2. LLVM_LIBRARY_DIR is expected by MLIR_DIR, so set it to
LLVM_LIBRARY_DIRS, which is outputted by FindLLVM
Co-authored-by: Philippe Tillet <phil@openai.com>
- **temporarily commenting assertion in `MemBar.cpp`. We need to fix
this! but for now the following patches will unblock a number of
users.**
- Fixed frontend codegen issue for If / For / While. Emit an error when
replaced values' type mismatch.
- Added "top level" codepath for if statements, which allows users to
write patterns to exit early from kernels (e.g., `if cond1: if cond2:
return else: ...`). Added associated codegen in TritonToTritonGPUPass
- Added basic control flow tests
- Pipeline pass is no longer activated when memory accesses can't be
vectorized
- Added missing magic methods to `constexpr`
- Fixed issue in random.py: bitcast some values to uint when they need
to be.
- Added support for `Not`
- Fixed nondeterministic compilation issue
This fix enables the support on sm_90 (otherwise it will crash).
Logs like
> 'sm_90' is not a recognized processor for this target (ignoring
processor)
could be ignored and should be eliminated with the update of llvm nvptx
backend.
Using range(len(...)) is not pythonic.
Python does not have not index-based loops. Instead, it uses collection
iterators. Python has a built-in method enumerate which adds a counter
to an iterable. Using this, you can access the counter and the value
from the iterable at the same time. It is therefore recommended to
replace range(len(...)) with enumerate(...).
for ex
5bcf60a5c0/python/triton/language/extern.py (L68)f62d556fff/python/triton/language/extern.py (L68)
Signed-off-by: GitHub <noreply@github.com>
Co-authored-by: Keren Zhou <kerenzhou@openai.com>
Use i32 as the storage type for <2xi16> and <4xi8>, as NVPTX inserts
extra integer instructions for vector int types.
Performance before this PR: (8192x8192x8192-TN input)
bf16: 222 TFLOPS
i8: 339 TOPS
After this PR:
bf16: 272 TFLOPS
i8: 548 TOPS
Otherwise it fails with
```
File "setup.py", line 147, in build_extension
"-DLLVM_EXTERNAL_LIT=" + lit_dir,`
TypeError: can only concatenate str (not "NoneType") to str
```
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
The `TritonAnalysis` target depends on `TritonTableGen` through
including `Triton/IR/Dialect.h`, which itself includes
`Triton/IR/Dialect.h.inc` generated by `TritonTableGen`.
Without it, the build might fail (seems to be happening inconsistently,
due to multithreaded builds)
In the case when `TRITON_BUILD_PYTHON_MODULE` is set to `OFF`, the
`triton` target ends up having no sources, which triggers a CMake
configuration error.
Fix this by only generating the target when building with python module
support.
Co-authored-by: Philippe Tillet <phil@openai.com>
This PR adds UpdateMmaForVolta pass to help update the MMA encoding for
Volta.
Some context is told in https://github.com/openai/triton/pull/1014
# Changes
1. Moving the related MMAv1 patterns from GPUCombine pass to
UpdateMmaForVolta pass,
2. Updating both the versionMinor and warpsPerCTA fields for Volta MMA
encodings since they could only be determined after the GPUCombine Pass,
3. Moving the FixupLoop pattern from the Combine.cpp to new
Utility.h/.cpp files
4. Adding an ID field(takes 5 bits to store an integer) to versionMinor
to help assigning a unique ID(on Volta) for each MMA encodings, the
reason is as below
- Currently, there is a cyclic dependency between {DotOperand, Slice}
with MMA layouts, we use a map to help cluster all the DotOperand,
Slice, and MMA layout instances into the same group for further updating
in bulk
- When there are multiple DotOps in a module with the same MMA(literally
equivalent), it is possible to get the wrong groups
- an ID field is used to help to identify the MMA from different DotOps,
thus getting all the MMA, DotOperand, and Slice layout instances in the
right groups