There's no guarantee that `/tmp/triton/*/*.json` existing means
that the corresponding `/tmp/triton/*/*.cubin` file also exists because the tmp directory doesn't guarantee file stability.
I noticed that Triton is using the `ptxas` version as part of the
version hash even for non-CUDA targets. This is an attempt at fixing
this. Moving the version calculation to the back-end makes sense to me
from an architectural standpoint, so that's my approach here. I'm not as
confident in the implementation, so please if folks have any feedback
let me know.
Without this change, a constexpr assignment (ie. `A = B & C`, where `B`
and `C` are both constexpr) is getting assigned to a triton tensor,
which becomes an issue when `A` is used as the condition of an If
statement.
Note: I had to add `not isinstance(node.value, ast.Constant)` to the
condition because if we are assigning `x = 0` then the assigned value is
also a constexpr, but in this case we do want to assign a triton tensor
to `x` so that we can do `x.to(tl.int64)` for example, which cannot be
done on a constexpr.
---------
Co-authored-by: Philippe Tillet <phil@openai.com>
By default, ptxas will enable fusion of mul/add to fma instructions. The
backend was also being configured unconditionally to enable this on
conversion from LLVM IR to PTX. This commit adds an option which can be
used to disable the FP fusion behavior in both locations.
* enforce cc=None on ROCm
* Comment
* Update approach to ignore integer cc values
Co-authored-by: Alexander Efimov <efimov.alexander@gmail.com>
---------
Co-authored-by: Alexander Efimov <efimov.alexander@gmail.com>
* Restructure ROCM Library Search
Currently there are a handful of ROCM dependant files which are required for
triton to run. The linker(ld.lld), the include files, and multiple hip/hsa
shared objects.
This change will provide three search areas to find these files. All in
the same order.
1. third_party/rocm. This location is within the python/triton directory
and is carried over when triton is built. IF all necessary files
are in this location there will be no need to have ROCM installed at
all on the system.
2. $ROCM_PATH environmental variable. If this exists it will override
all other locations to find ROCM necessary files
3. /opt/rocm. The default location for ROCm installations. Finding one
here will notify triton that ROCM is installed in this environment
To ease with step 3. A new script scripts/amd/setup_rocm_libs.sh
has been added to the repo. Executing this script will cause all necessary
ROCM files to be downloaded from their respective packages on repo.radeon.com
and installed in third_party/rocm. Allowing for triton to run without installing
the full ROCM stack. setup_rocm_libs.sh takes a env_var ROCM_VERSION if a user
wishes to install a ROCM version other than the default (currently 5.4.2)
When triton whls are built to support Pytorch, method 3 will be used to stay in
sync with PyTorch's approach of bringing along any libraries needed and not
requiring ROCM to be installed.
(cherry picked from commit e6aea90fb3e8218cb562e5d990719112d8282702)
* Fix default rocm path
Running into `fatal error: hip/hip_runtime.h: No such file or directory` with latest wheel due to incorrect directory for ROCm libs
(cherry picked from commit 292bae625b113eb65c66cfe4442da7a6456c988a)
* setup_rocm_libs.sh manylinux refactor
(cherry picked from commit f995f314ada4606cb78dc6233cd9c8effc356191)
* Set setup_rocm_libs.sh to be executable
(cherry picked from commit 05d67b9418cacda0d356c2102d7c1a887948b013)
* Revert to using numbered so files to fix upstream
(cherry picked from commit 34f8189eae57a23cc15b4b4f032fe25757e0db8e)
* Remove drm script
---------
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
- Move atomic_cas and atomic_xchg to "atomic ops" section of
documentation.
- Don't talk about the `cmp` operand for operations which don't have
it.
- Document the `sem` operand.
- :code:`foo` and ``foo`` don't work inside a :type: annotation,
apparently. (They are rendered literally, instead of being treated
as a formatting command.) Get rid of them.
- Format the bulleted lists in the load/store operations as intended.
Hi,
I'm adding some features to
`triton.runtime.jit.JITFunction_make_launcher` and found it is hard to
debug it:
1. The inlined Python code is hard to inspect in my editor.
2. My debugger fails to step into these inlined codes.
In response, I've introduced some code to solve these issues. My
modifications include:
~~1. Refactoring the launcher's inline Python code, ensuring it only
relies on the "self" object.~~
~~2. Add a utility method that generates a temporary file to create a
launcher when debugging kernel in main module~~
Using a closure to hold the launcher's body
Because this features might be good to others, I have initiated this
Pull Request.
~~Tests are yet to be added; if this submission might be accepted, I
will add it later.~~
Since this change is a refactor, no new test was added.
This fixes a few bugs I've encountered
- `atomic_add` with int64/uint64 `Operation .add requires .u32 or .s32
or .u64 [...] for instruction 'atom'`
- `atomic_min/max` with float64 -> `ValueError('Cannot bitcast data-type
of size 64 to data-type of size 32')`
- `atomic_min/max` with float32 returns the old value as int32
This was regressed by #2185 because we didn't realise CUDA_CHECK macro
could do Python calls (similar to what led to #2225). I think the
PyErr_Occurred got removed in that PR because there was missing error
handling before the call to _launch, so it looked like it was just in
the wrong place.
It looks like there are also potentially a couple places in cuda.c that
can return with error set, e.g. getDeviceProperties, memAlloc,
memcpyHtoD, memFree, tensorMapEncodeTiled etc, but those are all
pre-existing and not affected by recent changes.
Change the dot to allow taking an initial accumulator and add a flag
that will allow the compiler to accumulate in a lower precision than the
output type.
On Hopper this flag is on by default which allows accumualting with
lower precision.
This only affect Hopper fp8 dot.
Low tech but very useful way to override kernels on the fly. This can be
use for debugging functionality or performance problems this lets user
dump modify and feed back IR into the jit compiler.
This fixes a few bugs related to scalar tensors:
- `tl.full([], fill_value, dtype)` fails with `TypeError('0d block_type
is forbidden')`
- `scalar[None]` fails with `TypeError("'constexpr' object is not
iterable")`
- `scalar[None, None]` fails with `AttributeError("'dtype' object has no
attribute 'shape'")`
- `scalar.shape` returns `[1]` instead of 0-dim `[]`
- Also related, `tl.zeros_like(scalar)` returns a 1d tensor instead of
another scalar
This PR fixes a few very minor compilation issues found in internal
deployment at Meta. It looks like nit-picking, but it'd be really
appreciated if it could be addressed in OSS Triton (to reduce
differences from OSS), and we believe these changes are not bad in
general. Neither performance nor functionality is affected by this PR.
1. Type cast in `python/triton/runtime/backends/cuda.c`. Implicit `void
*` -> `cuuint{32,64}_t *` cast is not allowed by many compilers (with
certain flags). It'd be nice to add an explicit cast (like
`backends/hip.c`).
2. Inconsistent include path specification in
`lib/Conversion/TritonGPUToLLVM/DotOpToLLVM/WGMMA.cpp`. Unlike other
`DotOpToLLVM/*.cpp`, include paths used in `WGMMA.cpp` are not relative.
This is problematic in some compilation settings since a compiler
somehow needs to find headers in a parent directory. It'd be great to
use a relative path, like other source files in Triton.
cc: @yuguo68