For example, when given `--convert-triton-gpu-to-llvm="is-rocm=true"`,
`ConvertTritonGPUToLLVMPass` should generate ROCM-compatible LLVM.
Before this PR, transformation options passed in command line are not
respected.
No functional changes intended, and it might slightly speed up the
build.
This allows a downstream Bazel build of Triton to avoid building a
number of dialects and passes that Triton doesn't need.
The initial code merge of Nvidia Hopper features support. Please be
aware that the code merge is not finished yet and the trouble-shooting
is still ongoing. The new hardware features (GMMA, TMA, STMATRIX etc.)
and automatic warp-specialization are experimental for now and turned
off by default. It is recommended for a trial when version 3.0 is
released.
The work is contributed by:
ben-zhang-609, bealwang, donproc, qliu93, jsh20, allatit23, LyricZhao,
ivanyinwz, goostavz & yangjunpro
from Nvidia, in cooperation with:
ptillet, Jokeren, ThomasRaoux & zahimoud
from OpenAI.
Co-authored-by: Goostav Zhu <gzhu@nvidia.com>
That library makes use of the dladdr function, so it eventually needs to
be linked with -ldl, which may not be done automatically. This commit
adds a link dependency to `${CMAKE_DL_LIBS}`, which is CMake's way of
specifying that library in a portable way.
Also try to switch APIs access to the new upstream APIs that separate
explicitly the access to "discardable" and "inherent" attributes (the
latter being stored in properties now).
Generic accessors like `getAttr()` `setAttr()` `setAttrs()` are much
more expensive and to be avoided.
The purpose of this PR is to remove some circular dependencies and
separate concerns better in the frontend. It's still not perfect --
`triton.compile` still includes a few runtime architecture-specific
component, but at least much better than before.
This PR still assumes that AMD only supports empty kernels right now.
Other PRs will follow to make the frontend supports multiple devices in
a more modular way.
This PR is a first in a series of PRs to import the changes that we have
made to enable ROCM on [our
fork](https://github.com/ROCmSoftwarePlatform/triton) of triton.
The PR contains the major changes to the python frontend and enough
changes to the c++ backend to allow compilation and running of the empty
kernel. We use the ROCM ci added a few weeks ago to verify things.
---------
Co-authored-by: Ronan Keryell <ronan@keryell.fr>
* Cleaned up pipeline pass. Now works when there are element-wise ops
between the load and the dot
* Made `splat` compatible with varibales that have DotOperandLayout
* Moves rematerialization utils to separate Transforms/Utility.cpp file.
* Frontend:
- `int` kernel arguments are always signed
- Loop induction variable is now determine by integer promotion on
lb/ub/step
* Optimizer:
- Added new ExtractSliceOp that enforces 32-bit offsets
* Backend:
- Use 64-bit indices when lowering functions and control flow
- Removed `idx_val` macro and replaced it with `i32_val`
- Cleaned up comments
- Added new ArithToIndex pass to make sure operations on indices are
done with the `index` dialect, that gets converted to LLVM separately
using a 64-bit target
- Dependent CUDA files (ptxas, cuda.h, libdevice.bc.10) are now packaged in
`triton/third_party/cuda`. `ptxas` is downloaded from conda repo at
install time.
- Can now be built with old glibc (as that used by manylinux2014)
This fix enables the support on sm_90 (otherwise it will crash).
Logs like
> 'sm_90' is not a recognized processor for this target (ignoring
processor)
could be ignored and should be eliminated with the update of llvm nvptx
backend.