This PR enables 4x4 tile size in MFMA based dot operations.
Supported tiled dot is (4x64) x (64x4) -> (4x4) in MFMA layout.
However, actual dot operation should have at least 64 output elements, this is a limitation of other layouts appearing during result processing (i.e. blocked layout can not handle tensors smaller than wavesize).
For example, following dots are supported: (4x64) x (64x16) -> (4x16), (16x64) x (64x4) -> (16x4) or (8x64) x (64x8) -> (8x8)
Following dots are not supporter: (4x128) x (128x4) -> (4x4), (4x64) x (64x8) -> (4x8)
This is a first version of dot using mfma 4x4 instructions, with redundancy and reductions.
* use hardware instruction for type conversion between fp8 and fp32
* move gpu_matrix_core_version from semantics.py to hip_backend.py
---------
Co-authored-by: Aleksandr Efimov <efimov.alexander@gmail.com>
fix more conflits
Resolve merge conflicts
Some more build and conflict fixes
Resolve conflicts for 06-fused-attension.py
resolve merge conflicts for the tutorial group gemm example
Fixes for some LIT tests
resolve remaining conflicts in tests
Fix empty kernel
set capability 0
I've add an option to yapf to do what we want for long lines, see
https://github.com/google/yapf/pull/1177. We can now have a real Python
formatter, yay!
To make this PR, I ran my modified yapf over the repository, then looked
over the full diff. Where yapf was mangling the param list of long
function decls/calls (mostly kernels), I manually added `#` to put
linebreaks where we want. I fixed up other formatting too -- mostly
adding or removing a trailing comma from lists.
Overall, trailing `#` was sufficient to get formatting similar to our
current code. I didn't have to disable yapf anywhere.
---------
Co-authored-by: Phil Tillet <phil@openai.com>
* add two fp8 data types `tl.float8e4b8` and `tl.float8e5b16` to triton.
* add SW type conversion between `tl.float8e4b8/tl.float8e5b16` and `fp16`
* change flashattention to support fp8 in q/k.
This is a combination of 4 commits.
Works as StandAlone and Backend
Works as StandAlone and Backend
This is a combination of 13 commits.
Works StandAlone and as Backend
This is a combination of 7 commits.
backend set default dir with flag
move bitcode to backend dir
copy backend
save
empty test work in backendmode
enable backend mode when copying to upstream
clean up
fix failure
minimize diff
add skip function
fix bug with corrupted dwarf exp
match num_wraps
fix multi threaded test issue
move bitcode file out of lib
move backend to python/triton/third_party/hip
move libhsa
backend works again
restart ci
clean upstream location first before copy
match scripts
fix new error
memoize backend stuff
fix bug
* this pr adds a third party backend for triton that works on AMD
* this expose a lot of the work that has been done in our
[fork](https://github.com/ROCmSoftwarePlatform/triton)
* most unit tests on `test_core.py` pass
* it skips some unit tests for various reasons
* we plan to follow up with more prs improving Functionality and
Performance in the future
---------
Co-authored-by: Philippe Tillet <phil@openai.com>
* [MFMA] FP8 and BF8 support
This PR adds support of fp8 and bf8 in AccelerateMatmul pass and
Introduces generation of float8 mfma instructions in ttg to llvm conversion.
* add tests
* fix tests
* review fix: fix variable naming and dot operand promotion.
* review comments fixes
---------
Co-authored-by: Shucai Xiao <shucai.xiao@amd.com>
[BACKEND] Improve printf.
Previously, we printed all of a GPU thread's values in a single printf()
call, and this, plus the user-specified prefix, was all we printed.
This caused a few problems.
- nvptx printf can only handle 32 arguments; if you pass more than
that, it prints garbage. So if a thread had more than 32 values, you
couldn't print them, issue #2486.
- The order of the values within the Triton program (GPU thread block)
is an implementation detail -- it depends on the layout the compiler
assigns to a tensor. So this also prevented you from interpreting
the printed output.
To address this, we now print the Triton pid and multi-dimensional
Tensor index for each value. And each value gets its own line to avoid
passing too many args to printf.
Example output:
```
pid (0, 1, 2) idx (36, 127) x: 42
```
If you want to observe all the values in a tensor in order, you can grep
and then sort the output.
We also make a UX enhancement to print: The printed label always ends
with ": "; you don't have to add it yourself.
Fixes#2486.
* Restructure ROCM Library Search
Currently there are a handful of ROCM dependant files which are required for
triton to run. The linker(ld.lld), the include files, and multiple hip/hsa
shared objects.
This change will provide three search areas to find these files. All in
the same order.
1. third_party/rocm. This location is within the python/triton directory
and is carried over when triton is built. IF all necessary files
are in this location there will be no need to have ROCM installed at
all on the system.
2. $ROCM_PATH environmental variable. If this exists it will override
all other locations to find ROCM necessary files
3. /opt/rocm. The default location for ROCm installations. Finding one
here will notify triton that ROCM is installed in this environment
To ease with step 3. A new script scripts/amd/setup_rocm_libs.sh
has been added to the repo. Executing this script will cause all necessary
ROCM files to be downloaded from their respective packages on repo.radeon.com
and installed in third_party/rocm. Allowing for triton to run without installing
the full ROCM stack. setup_rocm_libs.sh takes a env_var ROCM_VERSION if a user
wishes to install a ROCM version other than the default (currently 5.4.2)
When triton whls are built to support Pytorch, method 3 will be used to stay in
sync with PyTorch's approach of bringing along any libraries needed and not
requiring ROCM to be installed.
(cherry picked from commit e6aea90fb3e8218cb562e5d990719112d8282702)
* Fix default rocm path
Running into `fatal error: hip/hip_runtime.h: No such file or directory` with latest wheel due to incorrect directory for ROCm libs
(cherry picked from commit 292bae625b113eb65c66cfe4442da7a6456c988a)
* setup_rocm_libs.sh manylinux refactor
(cherry picked from commit f995f314ada4606cb78dc6233cd9c8effc356191)
* Set setup_rocm_libs.sh to be executable
(cherry picked from commit 05d67b9418cacda0d356c2102d7c1a887948b013)
* Revert to using numbered so files to fix upstream
(cherry picked from commit 34f8189eae57a23cc15b4b4f032fe25757e0db8e)
* Remove drm script
---------
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
- Move atomic_cas and atomic_xchg to "atomic ops" section of
documentation.
- Don't talk about the `cmp` operand for operations which don't have
it.
- Document the `sem` operand.
- :code:`foo` and ``foo`` don't work inside a :type: annotation,
apparently. (They are rendered literally, instead of being treated
as a formatting command.) Get rid of them.
- Format the bulleted lists in the load/store operations as intended.
This fixes a few bugs I've encountered
- `atomic_add` with int64/uint64 `Operation .add requires .u32 or .s32
or .u64 [...] for instruction 'atom'`
- `atomic_min/max` with float64 -> `ValueError('Cannot bitcast data-type
of size 64 to data-type of size 32')`
- `atomic_min/max` with float32 returns the old value as int32
Change the dot to allow taking an initial accumulator and add a flag
that will allow the compiler to accumulate in a lower precision than the
output type.
On Hopper this flag is on by default which allows accumualting with
lower precision.
This only affect Hopper fp8 dot.
This fixes a few bugs related to scalar tensors:
- `tl.full([], fill_value, dtype)` fails with `TypeError('0d block_type
is forbidden')`
- `scalar[None]` fails with `TypeError("'constexpr' object is not
iterable")`
- `scalar[None, None]` fails with `AttributeError("'dtype' object has no
attribute 'shape'")`
- `scalar.shape` returns `[1]` instead of 0-dim `[]`
- Also related, `tl.zeros_like(scalar)` returns a 1d tensor instead of
another scalar
* [MFMA] Support BFloat16 on MI100
This PR makes use of mfma_f32_32x32x4bf16 instruction, available on MI100.
* fix tests, fix mfma encoding comment, fix switch between mfma versions.
* replace kDim from mfma layout with kWidth from dotOp layout
* rebase fix
* fix mfma to dot op shortcut for bfloat16
* fix review comments
* this pr adds a third party backend for triton that works on AMD
* this expose a lot of the work that has been done in our
[fork](https://github.com/ROCmSoftwarePlatform/triton)
* most unit tests on `test_core.py` pass
* it skips some unit tests for various reasons
* we plan to follow up with more prs improving Functionality and
Performance in the future
---------
Co-authored-by: Philippe Tillet <phil@openai.com>
Add a new operation to be able to implement packed inline assembly for
elementwise operations. This way inline assembly can be used to control
elementwise operations. It also allows to pack elements to be able to
manually vectorize operations.