We have had complaints/issues randomly where a zombie python process is
holding this lock. We don't need it since renames are atomic on posix.
So refactor this to make temp files unique and then use replace
(https://docs.python.org/3/library/os.html#os.replace )
Fixes#1545
`build_temp` is a temporary directory which `distutils` used to keep in
the `./build` directory, but when `pyproject.toml` is present `pip` now
puts it in `/tmp` and removes it at the end of the build.
Instead, this creates a new permanent directory like
`python/build/cmake.linux_x86_64-cpython-3.8` (the old name but with
cmake instead of temp).
While I was looking at the verbose pip output, I also noticed a bunch of
warnings like
```
Python recognizes 'triton/runtime.backends' as an importable package,
but it is not listed in the `packages` configuration of setuptools.
'triton/runtime.backends' has been automatically added to the distribution only
because it may contain data files, but this behavior is likely to change
in future versions of setuptools (and therefore is considered deprecated).
```
So I've also added these to the packages list.
---------
Co-authored-by: Keren Zhou <kerenzhou@openai.com>
Change the usage of LRU cache decorator from @functools.lru_cache to
@functools.lru_cache().
The former raises an error TypeError('Expected maxsize to be an integer
or None' for Python 3.7 or older.
This way reduces build time with assertions enabled LLVM and
dramatically speeds up triton's build with a "debug" LLVM.
Co-authored-by: Philippe Tillet <phil@openai.com>
`tl.reduction` is currently tested indirectly through the existing
reduction operators, but it's good to have a direct test for the
function itself.
---------
Co-authored-by: Philippe Tillet <phil@openai.com>
When running python 3.8, I've found that process creation gets slower
over time (e.g. after creating a CUDA context, it can take 50-300ms per
subprocess.run), and we do one of these calls to `ptxas --version` for
every kernel, so a model with thousands of kernels can end up spending
substantial time just calling ptxas redundantly.
Co-authored-by: Philippe Tillet <phil@openai.com>
This PR contains:
- Several fixes for the matrix multiplication (M and N dimensions may
have out-of-bound access)
- A type check for block-based store
- The tutorial for block pointers
- Fix some formats
A small oversight in #1305, since `view` can rearrange elements it
should be avoided here. Instead I use indexing with `None` to create new
dimensions.
Co-authored-by: Philippe Tillet <phil@openai.com>
MLIR current only supports a custom inlining interface per dialect, so
we cannot change the inlining decision of `func.func`.
https://discourse.llvm.org/t/avoid-inlining-some-functions-using-the-func-dialect/69830/3
Could revert it back once they've designed a better inliner interface.
Inlining attributes will be implemented in the next PR since this PR is
already huge.
When the language model grows really large, the axis 1 of the origin
grid shape (`c.shape[1]`, correspond to the number of nonzero elements
in the layout) will be larger than 65536, exceeds the cuda limit and
results in `[CUDA]: invalid argument`.
This PR is moving the axis 1 of the origin grid to axis 0, as the limit
for axis 0 is 2^31 - 1.
Thank you for your time on reviewing this PR :)
The changes here come with a few separate bits:
- Allow replacing the cache manager with an ENV variable to make it
pluggable
- Make the `make_path` api private since its leaking some internal bits
of the cache and allowing file access. Use a get operation instead.
- For the `compile` operation we have a several files part of a single
compile pipeline that are small, this can be not the most performant
with remote caches. Also some operations like
`_triton.get_shared_memory_size` only work when everything is cached or
none(or some key ones aren't). They segfault otherwise. So grouping
these as an entity avoids that.
The `triton/compiler`, `triton/runtime/driver`, and `triton/third_party`
subpackages were missing from the distribution built with the old
`setup.py` after #1464, causing an immediate error upon importing Triton
with a non-editable installation. This change adds the missing Python
subpackages and moves `triton/third_party` inclusion to `MANIFEST.in`,
where it will automatically be included in wheels due to the existing
`include_package_data` setup flag.
The Autotuner is a handy utility. By allowing external access to the
Autotuner, users can overwrite some functions (e.g., `run`) to
load/store best configurations, initialize tensors based on
configuration values, and change benchmarking standard (e.g., based on
bytes instead of time).
The purpose of this PR is to remove some circular dependencies and
separate concerns better in the frontend. It's still not perfect --
`triton.compile` still includes a few runtime architecture-specific
component, but at least much better than before.
This PR still assumes that AMD only supports empty kernels right now.
Other PRs will follow to make the frontend supports multiple devices in
a more modular way.
On some machines, the amount of available RAM might not be enough to
compile Triton with `2 * num_cpus` parallelism. For example, CircleCI's
`large` instance can't handle Triton compilation as is due to
insufficient memory.
Instead, I propose to take PyTorch's approach where we can define a
[`MAX_JOBS` env
var](0e4ddc2b40/tools/setup_helpers/cmake.py (L366-L368))
that gives the user the possibility to reduce (or increase) the
parallelism during compilation.
Co-authored-by: Philippe Tillet <phil@openai.com>