Conflicts:
lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVMPass.cpp
lib/Target/LLVMIR/LLVMIRTranslation.cpp
python/test/unit/language/assert_helper.py
python/triton/third_party/cuda/bin/ptxas
test/Conversion/tritongpu_to_llvm.mlir
It looks like you may be committing a merge.
If this is not correct, please remove the file
.git/MERGE_HEAD
and try again.
This exposes `semantic.expand_dims` in the public API and builds upon it
with support for expanding multiple dimensions at once. e.g.
```python
tl.expand_dims(tl.arange(0, N), (0, -1)) # shape = [1, N, 1]
```
Compared to indexing with `None`, this API is useful because the
dimensions can be constexpr values rather than hard-coded into the
source. As a basic example
```python
@triton.jit
def max_keepdim(value, dim):
res = tl.max(value, dim)
return tl.expand_dims(res, dim)
```
The current main would fail on `math.scalbn` because we implicitly cast
the first argument from `int32` to `float32`, while the function only
accepts `int32` as the first argument and `float32` as the second
argument.
So we update the type matching logic as follows:
1. Check if there's a type tuple that matches the types of the input
arguments
2. If yes, we don't allow arithmetic check.
3. If not, we will do arithmetic check to implicitly cast types among
arguments.
4. If we still don't find a corresponding function that accepts the
casted types, throwing an error.
---------
Co-authored-by: Philippe Tillet <phil@openai.com>
`tl.reduction` is currently tested indirectly through the existing
reduction operators, but it's good to have a direct test for the
function itself.
---------
Co-authored-by: Philippe Tillet <phil@openai.com>
A small oversight in #1305, since `view` can rearrange elements it
should be avoided here. Instead I use indexing with `None` to create new
dimensions.
Co-authored-by: Philippe Tillet <phil@openai.com>
The purpose of this PR is to remove some circular dependencies and
separate concerns better in the frontend. It's still not perfect --
`triton.compile` still includes a few runtime architecture-specific
component, but at least much better than before.
This PR still assumes that AMD only supports empty kernels right now.
Other PRs will follow to make the frontend supports multiple devices in
a more modular way.
This PR introduces a new semantics: **block pointer**, which makes users
easier & faster to load a block from a parent tensor.
Below is a detailed API change by an example:
```
# Make a block pointer, which points to a block in the parent shape
# `base`: the parent tensor
# `shape`: the shape of the parent tensor
# `strides`: the strides of the parent tensor
# `offsets`: the offsets of the block in the parent tensor
# `order`: the order of the data arrangement in memory
# Below is an example loading a 2D column-major matrix
block_ptr = tl.make_block_ptr(base=ptr, shape=(M, N), strides=(stride_m, stride_n), offsets=(0, 0), block_shape=(BLOCK_M, BLOCK_N), order=(1, 0))
# Advance the offsets; note that the striding information is already saved in `block_ptr`
# `base`: the block pointer to be advanced
# `offsets`: the offsets for each dimension
block_ptr = tl.advance(base=block_ptr, offsets=(BLOCK_M, -BLOCK_N))
block_ptr = tl.advance(base=block_ptr, offsets=(-BLOCK_M, BLOCK_N))
# Load from a block pointer, the output type is the dereferenced type of `block_ptr`, e.g. ptr<tensor<32x32xf32>> -> tensor<32x32xf32>
# `ptr`: the block pointer to be loaded
# `boundary_check`: a tuple of dimensions to check the boundary
# `padding`: padding strategy for elements out of bound
val = tl.load(ptr=block_ptr, boundary_check=(0, 1), padding="zero")
# Store by a block pointer, in which the pointer and the value tensor should have the same shape
# `ptr`: the block pointer to be stored
# `boundary_check`: a tuple of dimensions to check the boundary (no-write if out of bound)
tl.store(ptr=block_ptr, value=val, boundary_check=(0, 1))
```
---------
Co-authored-by: Philippe Tillet <phil@openai.com>
This generates identical PTX for floating point, but for integer types
the resulting PTX is much better. For example `tl.abs` for int16
currently generates
```mlir
cvt.s32.s16 %r1, %rs2;
neg.s16 %rs4, %rs2;
setp.lt.s32 %p4, %r1, 0;
selp.b16 %rs3, %rs4, %rs2, %p4;
```
After, it becomes a single `abs.s16` instruction.
This also improves LLVM's ability to optimize floats. e.g. `abs(t) *
abs(t)` is optimized to `t * t` now which didn't happen before.
---------
Co-authored-by: Keren Zhou <kerenzhou@openai.com>
This PR;
- Fixes syntax errors like `.type values: dict[str,
Callable[[list[Any]], Any]]` to `:type values: dict[str,
Callable[[list[Any]], Any]]`,
- Fixes typos,
- Fixes formatting like `k ++` to ` k++`,
- Increases consistency (e.g. by transforming the minority `cd dir/` to
the majority `cd dir`).
Fix issue https://github.com/openai/triton/issues/244
Check `end` is greater than `start`.
Check if the range can fit in `int32`.
Check the number of elements less than or equal to
`TRITON_MAX_TENSOR_NUMEL = 131072`.
---------
Co-authored-by: Philippe Tillet <phil@openai.com>
- **temporarily commenting assertion in `MemBar.cpp`. We need to fix
this! but for now the following patches will unblock a number of
users.**
- Fixed frontend codegen issue for If / For / While. Emit an error when
replaced values' type mismatch.
- Added "top level" codepath for if statements, which allows users to
write patterns to exit early from kernels (e.g., `if cond1: if cond2:
return else: ...`). Added associated codegen in TritonToTritonGPUPass
- Added basic control flow tests
- Pipeline pass is no longer activated when memory accesses can't be
vectorized
- Added missing magic methods to `constexpr`
- Fixed issue in random.py: bitcast some values to uint when they need
to be.
- Added support for `Not`
- Fixed nondeterministic compilation issue