`tl.reduction` is currently tested indirectly through the existing
reduction operators, but it's good to have a direct test for the
function itself.
---------
Co-authored-by: Philippe Tillet <phil@openai.com>
When running python 3.8, I've found that process creation gets slower
over time (e.g. after creating a CUDA context, it can take 50-300ms per
subprocess.run), and we do one of these calls to `ptxas --version` for
every kernel, so a model with thousands of kernels can end up spending
substantial time just calling ptxas redundantly.
Co-authored-by: Philippe Tillet <phil@openai.com>
This PR contains:
- Several fixes for the matrix multiplication (M and N dimensions may
have out-of-bound access)
- A type check for block-based store
- The tutorial for block pointers
- Fix some formats
A small oversight in #1305, since `view` can rearrange elements it
should be avoided here. Instead I use indexing with `None` to create new
dimensions.
Co-authored-by: Philippe Tillet <phil@openai.com>
MLIR current only supports a custom inlining interface per dialect, so
we cannot change the inlining decision of `func.func`.
https://discourse.llvm.org/t/avoid-inlining-some-functions-using-the-func-dialect/69830/3
Could revert it back once they've designed a better inliner interface.
Inlining attributes will be implemented in the next PR since this PR is
already huge.
[`mlir-reduce`](https://mlir.llvm.org/docs/Tools/mlir-reduce/) is a tool
to reduce the complexity of bug reproducers written in mlir. Similar to
`triton-opt`, `triton` needs to have its own version with the dialects
registered properly for it to work.
When the language model grows really large, the axis 1 of the origin
grid shape (`c.shape[1]`, correspond to the number of nonzero elements
in the layout) will be larger than 65536, exceeds the cuda limit and
results in `[CUDA]: invalid argument`.
This PR is moving the axis 1 of the origin grid to axis 0, as the limit
for axis 0 is 2^31 - 1.
Thank you for your time on reviewing this PR :)
The changes here come with a few separate bits:
- Allow replacing the cache manager with an ENV variable to make it
pluggable
- Make the `make_path` api private since its leaking some internal bits
of the cache and allowing file access. Use a get operation instead.
- For the `compile` operation we have a several files part of a single
compile pipeline that are small, this can be not the most performant
with remote caches. Also some operations like
`_triton.get_shared_memory_size` only work when everything is cached or
none(or some key ones aren't). They segfault otherwise. So grouping
these as an entity avoids that.
The `triton/compiler`, `triton/runtime/driver`, and `triton/third_party`
subpackages were missing from the distribution built with the old
`setup.py` after #1464, causing an immediate error upon importing Triton
with a non-editable installation. This change adds the missing Python
subpackages and moves `triton/third_party` inclusion to `MANIFEST.in`,
where it will automatically be included in wheels due to the existing
`include_package_data` setup flag.
No longer needed now that we initialize all registers. Motivation for
reverting this workaround now that we can is that it introduced
performance regressions
The Autotuner is a handy utility. By allowing external access to the
Autotuner, users can overwrite some functions (e.g., `run`) to
load/store best configurations, initialize tensors based on
configuration values, and change benchmarking standard (e.g., based on
bytes instead of time).
The purpose of this PR is to remove some circular dependencies and
separate concerns better in the frontend. It's still not perfect --
`triton.compile` still includes a few runtime architecture-specific
component, but at least much better than before.
This PR still assumes that AMD only supports empty kernels right now.
Other PRs will follow to make the frontend supports multiple devices in
a more modular way.