Simple mechanism to run Triton kernels on PyTorch for debugging purpose
(upstream from Kernl).
Todo:
- random grid iteration
- support of atomic ops
- more unit tests
- cover new APIs?
Fixes#1545
`build_temp` is a temporary directory which `distutils` used to keep in
the `./build` directory, but when `pyproject.toml` is present `pip` now
puts it in `/tmp` and removes it at the end of the build.
Instead, this creates a new permanent directory like
`python/build/cmake.linux_x86_64-cpython-3.8` (the old name but with
cmake instead of temp).
While I was looking at the verbose pip output, I also noticed a bunch of
warnings like
```
Python recognizes 'triton/runtime.backends' as an importable package,
but it is not listed in the `packages` configuration of setuptools.
'triton/runtime.backends' has been automatically added to the distribution only
because it may contain data files, but this behavior is likely to change
in future versions of setuptools (and therefore is considered deprecated).
```
So I've also added these to the packages list.
---------
Co-authored-by: Keren Zhou <kerenzhou@openai.com>
This way reduces build time with assertions enabled LLVM and
dramatically speeds up triton's build with a "debug" LLVM.
Co-authored-by: Philippe Tillet <phil@openai.com>
The `triton/compiler`, `triton/runtime/driver`, and `triton/third_party`
subpackages were missing from the distribution built with the old
`setup.py` after #1464, causing an immediate error upon importing Triton
with a non-editable installation. This change adds the missing Python
subpackages and moves `triton/third_party` inclusion to `MANIFEST.in`,
where it will automatically be included in wheels due to the existing
`include_package_data` setup flag.
The purpose of this PR is to remove some circular dependencies and
separate concerns better in the frontend. It's still not perfect --
`triton.compile` still includes a few runtime architecture-specific
component, but at least much better than before.
This PR still assumes that AMD only supports empty kernels right now.
Other PRs will follow to make the frontend supports multiple devices in
a more modular way.
On some machines, the amount of available RAM might not be enough to
compile Triton with `2 * num_cpus` parallelism. For example, CircleCI's
`large` instance can't handle Triton compilation as is due to
insufficient memory.
Instead, I propose to take PyTorch's approach where we can define a
[`MAX_JOBS` env
var](0e4ddc2b40/tools/setup_helpers/cmake.py (L366-L368))
that gives the user the possibility to reduce (or increase) the
parallelism during compilation.
Co-authored-by: Philippe Tillet <phil@openai.com>
When the user set the `LLVM_SYSPATH` to use custom build llvm, it will
throw the error because there is no version.txt under the custom build
one.
This PR skips the version check If the `LLVM_SYSPATH` is set.
---------
Co-authored-by: Philippe Tillet <phil@openai.com>
Make cmake happier, it doesn't like multiple target_link_library
definitions for the same name.
Use find_package instead on Windows for dlfcn-win32.
Set LLVM_SYS_PATH on Windows for python setup.
Debug build almost working, AlwaysCreate error thrown still.
- Dependent CUDA files (ptxas, cuda.h, libdevice.bc.10) are now packaged in
`triton/third_party/cuda`. `ptxas` is downloaded from conda repo at
install time.
- Can now be built with old glibc (as that used by manylinux2014)
Otherwise it fails with
```
File "setup.py", line 147, in build_extension
"-DLLVM_EXTERNAL_LIT=" + lit_dir,`
TypeError: can only concatenate str (not "NoneType") to str
```
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Based on the discussion in #700, this PR enables downloading pybind11 in
`setup.py` without `git submodule` instead of copy-pasting pybind11
code. The downloaded pybind11 will be in `~/.triton/pybind` (like
`llvm`).
This PR changes the `pybind11` source code management from copy-paste to
a package controlled by git-submodule.
See the discussion in #694 for details.
This is a more stable commit that produce bitwise identical code to earlier
versions. Using commits after this one may lead to slightly different numerics