On some machines, the amount of available RAM might not be enough to
compile Triton with `2 * num_cpus` parallelism. For example, CircleCI's
`large` instance can't handle Triton compilation as is due to
insufficient memory.
Instead, I propose to take PyTorch's approach where we can define a
[`MAX_JOBS` env
var](0e4ddc2b40/tools/setup_helpers/cmake.py (L366-L368))
that gives the user the possibility to reduce (or increase) the
parallelism during compilation.
Co-authored-by: Philippe Tillet <phil@openai.com>
While merging `triton-mlir`, it seems that the libdevice tutorial was
missed. This PR adds it back and modifies it with current interface
`tl.math`.
Also found a bug in `test_core.py`, `extern_libs` arguments should still
pass `libdevice`. Or it will fail on my added test. Legacy code didn't
fail because `lib_path` is none and ignored.
---------
Co-authored-by: Keren Zhou <kerenzhou@openai.com>
Co-authored-by: Philippe Tillet <phil@openai.com>
One of the possible optimizations for kernel launch overhead. Basically,
we are trying to avoid having to run `hasattr` and `isinstance` for each
argument, by adding typehints to the kernel definition. Also, added a
unit test to regression to make sure we keep the launch overhead within
an expected range.
This PR introduces a new semantics: **block pointer**, which makes users
easier & faster to load a block from a parent tensor.
Below is a detailed API change by an example:
```
# Make a block pointer, which points to a block in the parent shape
# `base`: the parent tensor
# `shape`: the shape of the parent tensor
# `strides`: the strides of the parent tensor
# `offsets`: the offsets of the block in the parent tensor
# `order`: the order of the data arrangement in memory
# Below is an example loading a 2D column-major matrix
block_ptr = tl.make_block_ptr(base=ptr, shape=(M, N), strides=(stride_m, stride_n), offsets=(0, 0), block_shape=(BLOCK_M, BLOCK_N), order=(1, 0))
# Advance the offsets; note that the striding information is already saved in `block_ptr`
# `base`: the block pointer to be advanced
# `offsets`: the offsets for each dimension
block_ptr = tl.advance(base=block_ptr, offsets=(BLOCK_M, -BLOCK_N))
block_ptr = tl.advance(base=block_ptr, offsets=(-BLOCK_M, BLOCK_N))
# Load from a block pointer, the output type is the dereferenced type of `block_ptr`, e.g. ptr<tensor<32x32xf32>> -> tensor<32x32xf32>
# `ptr`: the block pointer to be loaded
# `boundary_check`: a tuple of dimensions to check the boundary
# `padding`: padding strategy for elements out of bound
val = tl.load(ptr=block_ptr, boundary_check=(0, 1), padding="zero")
# Store by a block pointer, in which the pointer and the value tensor should have the same shape
# `ptr`: the block pointer to be stored
# `boundary_check`: a tuple of dimensions to check the boundary (no-write if out of bound)
tl.store(ptr=block_ptr, value=val, boundary_check=(0, 1))
```
---------
Co-authored-by: Philippe Tillet <phil@openai.com>
This generates identical PTX for floating point, but for integer types
the resulting PTX is much better. For example `tl.abs` for int16
currently generates
```mlir
cvt.s32.s16 %r1, %rs2;
neg.s16 %rs4, %rs2;
setp.lt.s32 %p4, %r1, 0;
selp.b16 %rs3, %rs4, %rs2, %p4;
```
After, it becomes a single `abs.s16` instruction.
This also improves LLVM's ability to optimize floats. e.g. `abs(t) *
abs(t)` is optimized to `t * t` now which didn't happen before.
---------
Co-authored-by: Keren Zhou <kerenzhou@openai.com>
This PR is a first in a series of PRs to import the changes that we have
made to enable ROCM on [our
fork](https://github.com/ROCmSoftwarePlatform/triton) of triton.
The PR contains the major changes to the python frontend and enough
changes to the c++ backend to allow compilation and running of the empty
kernel. We use the ROCM ci added a few weeks ago to verify things.
---------
Co-authored-by: Ronan Keryell <ronan@keryell.fr>
exponent compensate should be 0x3800(112) instead of 0x3000(96)
also add a mantissa bit for float16 conversion to round to nearest
float8e5m2
Co-authored-by: Philippe Tillet <phil@openai.com>
- fixed `CompilationError._format_message` fails when `error_message` is
a `constexpr`
- factored out `_is_constexpr()` checks and `_unwrap_if_constexpr()`
idioms
- Added `UnsupportedLanguageConstruct` exception, replaced some python
builtin exceptions raised in such cases.
- Some hardening in `.visit_If()`
- cleaner exception handling in `build_triton_ir()`
Contents of this change to `CodeGenerator`:
- addressed mutable default value in constructor (GitHub #1353)
- structured and faster name lookup (replaces `.get_value`)
- added informative error messages in some places
- tidy mechanism for "static" (compile time) functions replaces inline
`if ... elif ...` chain in `.visit_Call`
- more robust `static_assert` and `static_print`
- more informative `CompilationError` display (saves scrolling up
through long tracebacks)
- dedicated `CompileTimeAssertionFailure` exception for `static_assert`
can be specially treated upstream by `Autotuner` to skip configurations
that violate constraints (as for `OutOfResources`)
---------
Co-authored-by: Philippe Tillet <phil@openai.com>
This PR;
- Fixes syntax errors like `.type values: dict[str,
Callable[[list[Any]], Any]]` to `:type values: dict[str,
Callable[[list[Any]], Any]]`,
- Fixes typos,
- Fixes formatting like `k ++` to ` k++`,
- Increases consistency (e.g. by transforming the minority `cd dir/` to
the majority `cd dir`).
This addition allows explanatory messages upon assertion failures:
```python
@triton.jit
def my_single_block_kernel(
matrix_extent: tl.constexpr,
block_size: tl.constexpr, # must be >= extent (single block)
matrix: Tensor,
...
):
tl.static_assert(matrix_extent <= block_size,
f"`matrix_extent` should not be more than the block size ({block_size}), but is {matrix_extent}")
```
Yielding, when called incorrectly:
```
AssertionError: `matrix_extent` should not be more than the block size (32), but is 57
```
This PR applies a minor patch that removes unnecessary masks in
`_dsd_kernel()`.
### Details
`offs_bn` is defined as follows and not updated after that.
```py
offs_bn = pid_m * TILE_N + tl.arange(0, TILE_N)
offs_bn = tl.max_contiguous(tl.multiple_of(offs_bn % DS0, TILE_N), TILE_N)
```
Because `offs_bn = offs_bn % DS0`, this mask is always `True`.
```py
b = tl.load(pb, mask=offs_bn[None, :] < DS0)
```
This PR removes this mask (as well as explicit `mask=True`).
Fixed `JITFunction.__init__` to mark args as constexpr only when the
annotation is actually `tl.constexpr`, rather than treating any
annotated arg as constexpr.
When the user set the `LLVM_SYSPATH` to use custom build llvm, it will
throw the error because there is no version.txt under the custom build
one.
This PR skips the version check If the `LLVM_SYSPATH` is set.
---------
Co-authored-by: Philippe Tillet <phil@openai.com>
This is cherry-picked from #1305
If you call a `JITFunction` twice in the same kernel, first with `int32`
then with `uint32`, the second call will treat the unsigned value as
signed. This passes through MLIR without error because MLIR uses the
same types for both, but different operation calls will be generated so
you may silently get the wrong result.