This PR fixes a few very minor compilation issues found in internal
deployment at Meta. It looks like nit-picking, but it'd be really
appreciated if it could be addressed in OSS Triton (to reduce
differences from OSS), and we believe these changes are not bad in
general. Neither performance nor functionality is affected by this PR.
1. Type cast in `python/triton/runtime/backends/cuda.c`. Implicit `void
*` -> `cuuint{32,64}_t *` cast is not allowed by many compilers (with
certain flags). It'd be nice to add an explicit cast (like
`backends/hip.c`).
2. Inconsistent include path specification in
`lib/Conversion/TritonGPUToLLVM/DotOpToLLVM/WGMMA.cpp`. Unlike other
`DotOpToLLVM/*.cpp`, include paths used in `WGMMA.cpp` are not relative.
This is problematic in some compilation settings since a compiler
somehow needs to find headers in a parent directory. It'd be great to
use a relative path, like other source files in Triton.
cc: @yuguo68
User can pass custom arguments to benchmarks. For example, user can pass
`dtype` which will be used to create tensors in a benchmark.
Co-authored-by: Keren Zhou <kerenzhou@openai.com>
Prior to this PR, matmul on sm_89 (RTX 4070)
(`test/unit/operators/test_matmul.py::test_op`) would result in test
failure due to too strict atol/rtol.
To avoid having to choose strictness ourselves, and to have better
defaults based on dtype, use the non-deprecated torch testing util.
See: https://github.com/pytorch/pytorch/issues/61844
Replace: https://github.com/openai/triton/pull/2242
* [MFMA] Support BFloat16 on MI100
This PR makes use of mfma_f32_32x32x4bf16 instruction, available on MI100.
* fix tests, fix mfma encoding comment, fix switch between mfma versions.
* replace kDim from mfma layout with kWidth from dotOp layout
* rebase fix
* fix mfma to dot op shortcut for bfloat16
* fix review comments
Previously, if there were any specializations of "1" or "constexpr"
mixed with unspecialized arguments in arbitrary order, we might have
encountered errors due to passing incorrect arguments. This was because
the length of the signature did not indicate the maximum index of
regular arguments.
https://github.com/openai/triton/issues/2229
@shunting314 @amjames
More specifically for cases like:
```
kernel(
b: tl.tensor,
a: tl.constexpr,
c: tl.int = 1,
d,
e: tl.constexpr,
...
)
```
* [MLIR] Added tritongpu-stream-pipeline pass
- Prologue: Hoist the pipelinable load operations and shared memory store
for the ramp up stage
- Pipelined Loop: Assemble the loop body minus last iteration
- Prefetch next tile from global into regs (while computing from previous)
- Non-load loop body
- Store next tile into shared mem
- Epilogue: Peeled non-load loop body for last iteration
* * updated comment
* refine the gemm tuning scripts to reduce tuning space and better perf numbers
* added code to support tuning in full tuning space
* add a function to get best tuning config
* refine the matmul tutorial example to print out best tuning config for each input
* added even_k to gemm kernel heuristic for better performance
* address review comments
Current setup.py could not clean the build directly because the default
build directly has been changed in `CMakeBuild`. This PR is to clean
build directly in this regard.
I'm not sure if this was a typo or if I'm missing something. To me code
like
```
(offs_n[:, None] * stride_vk + offs_k[None, :] * stride_vn)
```
seems off.
In case this is a typo I made this PR to correct it. This PR should have
no functional changes.
If this is not a typo would you mind explaining the reasoning behind
these variable names?
* this pr adds a third party backend for triton that works on AMD
* this expose a lot of the work that has been done in our
[fork](https://github.com/ROCmSoftwarePlatform/triton)
* most unit tests on `test_core.py` pass
* it skips some unit tests for various reasons
* we plan to follow up with more prs improving Functionality and
Performance in the future
---------
Co-authored-by: Philippe Tillet <phil@openai.com>
* Add fwd and bwd v2
Changes are largely from upstream.
* Split bwd kernel in dq and dk+dv
Only adds the split kernels. They are not enabled yet.
* Pull scalar multiplies out of the loop
* Enable split kernel for bwd pass
* Put back P_SEQ=128 in fwd test
Not used for bwd test
* Address review comments
* Address comments
Conditionally set causal/ splitkernel to False for bwd.
* Add block pointer semantics to bwd pass
This significantly increases perf for bwd, similar to fwd.
* Enable usage of block pointer semantics for AMD gpus
This commit enables usage of block pointer semantics by enabling
rewrite_tensor_pointer_pass that rewrites block pointer loads/stores
to legacy loads/stores.
* Update FA fwd in tutorial to use the block pointers
* use 90 compute capability for amd gpus in python/triton/compiler/compiler.py
Co-authored-by: Alexander Efimov <efimov.alexander@gmail.com>
---------
Co-authored-by: Ognjen Plavsic <ognjen.plavsic@dxc.com>
Co-authored-by: Lixun Zhang <lixun.zhang@amd.com>
Co-authored-by: Aleksandr Efimov <130555951+alefimov-amd@users.noreply.github.com>
Co-authored-by: Alexander Efimov <efimov.alexander@gmail.com>
The default values used by JITFunction for num_warps and num_stages are
coupled with Nvidia GPU architecture. We should use the proper default
values based on the device backend for the kernel to be compiled to.
1. Add two functions to return the default num_warps and num_stages for
the specific device backend.
2. JITFunction uses the proper default num_warps and num_stages based on
the specific device backend.
Co-authored-by: Wang Weihan <eikan.wang@intel.com>
This PR makes the following change to AOT kernel
- Allow the client to generate AOT kernels with different sets of
constexprs and meta-parameters. Each combination of constexpr set and
meta-parameters is referred to an "algo". Within an algo client can
still give different hints about integer arguments.
- Add a API int ${kernle_name}_get_num_algos() that returns the total
number of algos.
- Add a algo_id to allow client to the generated kernel to select the
algo
- Remove gX, gY and gZ from the kernel parameter list. This is because
the launch grid is usually different with different algos, and the
client should not need to care about how to compute the launch grid for
each algo. Instead, we ask the client to pass the expression of
computing gX, gY and gZ for compile.py (when AOT kernels are generated).
The expression can only use kernel parameter or const values.
- We also change the testing flow. Now we first build the kernels into a
shared library libkernel.so, then the client test.c code is built and
link with libkernel.so. This is closer to a typical AOT kernel usage
flow.
1. Optimize the conversion and packing for 2xf32 -> 2xf16.
2. Split TMA store block into multiple slices of size 64x64.
3. Distribute the TMA store to all the warps.
4. Fix some naming issue.