* rebase onto improve_fwd_fa
* Fixed a leftover from rebase
* rebase onto improve_fa_fwd
* Reduce tuning space
* Disable bwd with D=128
* Add test for d=128
* Fix an issue with get_best_config when there is only one config
* Added better configs for d=128
* Fix typos
---------
Co-authored-by: Lixun Zhang <lixun.zhang@amd.com>
* Attempt to absorb upstream's changes to improve causal=True
* Add autotuner
* Optimize for AMD MI250
- add pre_load_v as a tuning parameter
- do not define N_CTX as constexpr
- perform the second dot before sum
- remove qk_scale out of the inner loop
- add more configs in the autotuner
Note that bwd kernel is disabled for now. This is because we enabled
autotuning and grid becomes a function. So ctx.grid[0] no longer works.
* Enable bwd kernel
* enforce cc=None on ROCm
* Comment
* Update approach to ignore integer cc values
Co-authored-by: Alexander Efimov <efimov.alexander@gmail.com>
---------
Co-authored-by: Alexander Efimov <efimov.alexander@gmail.com>
* Restructure ROCM Library Search
Currently there are a handful of ROCM dependant files which are required for
triton to run. The linker(ld.lld), the include files, and multiple hip/hsa
shared objects.
This change will provide three search areas to find these files. All in
the same order.
1. third_party/rocm. This location is within the python/triton directory
and is carried over when triton is built. IF all necessary files
are in this location there will be no need to have ROCM installed at
all on the system.
2. $ROCM_PATH environmental variable. If this exists it will override
all other locations to find ROCM necessary files
3. /opt/rocm. The default location for ROCm installations. Finding one
here will notify triton that ROCM is installed in this environment
To ease with step 3. A new script scripts/amd/setup_rocm_libs.sh
has been added to the repo. Executing this script will cause all necessary
ROCM files to be downloaded from their respective packages on repo.radeon.com
and installed in third_party/rocm. Allowing for triton to run without installing
the full ROCM stack. setup_rocm_libs.sh takes a env_var ROCM_VERSION if a user
wishes to install a ROCM version other than the default (currently 5.4.2)
When triton whls are built to support Pytorch, method 3 will be used to stay in
sync with PyTorch's approach of bringing along any libraries needed and not
requiring ROCM to be installed.
(cherry picked from commit e6aea90fb3e8218cb562e5d990719112d8282702)
* Fix default rocm path
Running into `fatal error: hip/hip_runtime.h: No such file or directory` with latest wheel due to incorrect directory for ROCm libs
(cherry picked from commit 292bae625b113eb65c66cfe4442da7a6456c988a)
* setup_rocm_libs.sh manylinux refactor
(cherry picked from commit f995f314ada4606cb78dc6233cd9c8effc356191)
* Set setup_rocm_libs.sh to be executable
(cherry picked from commit 05d67b9418cacda0d356c2102d7c1a887948b013)
* Revert to using numbered so files to fix upstream
(cherry picked from commit 34f8189eae57a23cc15b4b4f032fe25757e0db8e)
* Remove drm script
---------
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
* [MFMA] Support BFloat16 on MI100
This PR makes use of mfma_f32_32x32x4bf16 instruction, available on MI100.
* fix tests, fix mfma encoding comment, fix switch between mfma versions.
* replace kDim from mfma layout with kWidth from dotOp layout
* rebase fix
* fix mfma to dot op shortcut for bfloat16
* fix review comments
* [MLIR] Added tritongpu-stream-pipeline pass
- Prologue: Hoist the pipelinable load operations and shared memory store
for the ramp up stage
- Pipelined Loop: Assemble the loop body minus last iteration
- Prefetch next tile from global into regs (while computing from previous)
- Non-load loop body
- Store next tile into shared mem
- Epilogue: Peeled non-load loop body for last iteration
* * updated comment
* refine the gemm tuning scripts to reduce tuning space and better perf numbers
* added code to support tuning in full tuning space
* add a function to get best tuning config
* refine the matmul tutorial example to print out best tuning config for each input
* added even_k to gemm kernel heuristic for better performance
* address review comments
* Add fwd and bwd v2
Changes are largely from upstream.
* Split bwd kernel in dq and dk+dv
Only adds the split kernels. They are not enabled yet.
* Pull scalar multiplies out of the loop
* Enable split kernel for bwd pass
* Put back P_SEQ=128 in fwd test
Not used for bwd test
* Address review comments
* Address comments
Conditionally set causal/ splitkernel to False for bwd.
* Add block pointer semantics to bwd pass
This significantly increases perf for bwd, similar to fwd.