This PR disables following sub tests, because they are PTX specific:
- basic_async_wait
- convert_dot
- matmul_kernel_dot_operand_layout
- matmul884_kernel_dot_operand_layout
- matmul_tf32dot
This is to solve https://github.com/openai/triton/issues/1236
This commit hides the symbols of the shared libraries for
`libtriton.so`, so that when other object link against `libtriton.so`,
it won't have confilct.
The function calculates the swizzled address to **store** (not load), so
we should use `outOrder` instead of `inOrder`. Current tests do not
cover this case, but at NVIDIA, we have a case related to `sm_90` that
could trigger. Already discussed in the Slack channel with @Jokeren.
Per issue https://github.com/openai/triton/issues/1228. I believe we are
potentially exposed when a Triton executor (Pytorch for example) links
in two or more `triton_.so` shared objects and each has a stub for
`_launch`.
This fix ensures the `_launch` function is tied locally to the calling
`__triton_launcher` and can't be misused by another library.
By default Triton generates MLIR with f32 result of the tt.dot operation on f16
typed operands. So we have "tt.dot(f16,f16,f32)->f32" types in .ttgir. But
LLVM FMA instruction requires for the same type for all three operands. So first
two operands are implicitly casted f16->f32 as
"unrealized_conversion_cast struct{f16,f16,...}->struct{f32,f32}".
The change fixed incorrect implicit cast generation.
For the int8 typed operands result operand is also casted after performing dot.
As the next step to improve FMA based dot operation FMA on f16 and int8 target
specific intrinsics (e.g. fma(f16,f16,f16)->f16) could be used, perhaps as an
option.
Python 3.10 changes where packages are installed by default, causing
problems with Ubuntu into `/local`. See
[this](https://lists.debian.org/debian-python/2022/03/msg00039.html) and
[this](https://bugs.launchpad.net/ubuntu/+source/python3.10/+bug/1967920).
Triton seems to break when using 3.10 as it looks for the headers, but
the headers are not in `/local`, e.g. they are at
`/usr/include/python3.X` and not `/usr/local/include/python3.X`
Not 100% sure what's going on here since it's deep in python / pip, but
I think this should fix it. Otherwise, you have to hack around it in
dockerfiles, e.g. `ENV DEB_PYTHON_INSTALL_LAYOUT=deb`, which breaks
things with the release of pip that went.
---------
Co-authored-by: Keren Zhou <kerenzhou@openai.com>
Fix issue https://github.com/openai/triton/issues/244
Check `end` is greater than `start`.
Check if the range can fit in `int32`.
Check the number of elements less than or equal to
`TRITON_MAX_TENSOR_NUMEL = 131072`.
---------
Co-authored-by: Philippe Tillet <phil@openai.com>
This pull request addresses a crash that occurs when casting to a
tl.constexpr type in the frontend.
More info and repro code available in:
https://github.com/openai/triton/issues/1221
Add "agent" syncscope specification to prevent large performance loss
for gfx90a.
Add LLVM function attributes to enable fp32 atomic adds for archs that
support it.
Make cmake happier, it doesn't like multiple target_link_library
definitions for the same name.
Use find_package instead on Windows for dlfcn-win32.
Set LLVM_SYS_PATH on Windows for python setup.
Debug build almost working, AlwaysCreate error thrown still.
Minor bug: AutoTuner currently throws the following error when certain
configs go OutOfResources (e.g. the matmul example when testing on GPUs
with less shared memory).
This is a combination of 6 commits.
use local bitcode
This is a combination of 3 commits.
add bit code to repo
update test
change bit code path
move bit code
update path
update scripts
update test
fix path issue
While doing it incorrectly works at the moment, this will cause a
validation error once we rebase Triton closer to LLVM head, since
validation for some LLVM Dialect ops got stricter.
Specifically, if we remove the shared memory space attribute, a
subsequent bitcast tries to add it back, which is illegal.