* Add gemm tuning script v3
* Introduce --jobs to control the number of files to generate
* Switch to trans convention used by Tensile
* Rerun rocprof if it crashes
* update README
* Remove peak perf and efficiency
This PR enables 4x4 tile size in MFMA based dot operations.
Supported tiled dot is (4x64) x (64x4) -> (4x4) in MFMA layout.
However, actual dot operation should have at least 64 output elements, this is a limitation of other layouts appearing during result processing (i.e. blocked layout can not handle tensors smaller than wavesize).
For example, following dots are supported: (4x64) x (64x16) -> (4x16), (16x64) x (64x4) -> (16x4) or (8x64) x (64x8) -> (8x8)
Following dots are not supporter: (4x128) x (128x4) -> (4x4), (4x64) x (64x8) -> (4x8)
This is a first version of dot using mfma 4x4 instructions, with redundancy and reductions.
This PR adds:
- verbose tuning mode: printing std output of compilation and tuning calls
- collecting information about failed compilations
- print correctness check output with word
- split dimensions in generated scripts with "-"
- gpu_ids option to set particular gpus
* Add waves_per_eu in the tuning space
* Do not allocate tensor on device during kernel compilation step
* Add breakdown elapsed time
* Parallelize the post-processing step
* Parallelize the profile step with --ngpus
* Better timing info printout
* Restructure ROCM Library Search
Currently there are a handful of ROCM dependant files which are required for
triton to run. The linker(ld.lld), the include files, and multiple hip/hsa
shared objects.
This change will provide three search areas to find these files. All in
the same order.
1. third_party/rocm. This location is within the python/triton directory
and is carried over when triton is built. IF all necessary files
are in this location there will be no need to have ROCM installed at
all on the system.
2. $ROCM_PATH environmental variable. If this exists it will override
all other locations to find ROCM necessary files
3. /opt/rocm. The default location for ROCm installations. Finding one
here will notify triton that ROCM is installed in this environment
To ease with step 3. A new script scripts/amd/setup_rocm_libs.sh
has been added to the repo. Executing this script will cause all necessary
ROCM files to be downloaded from their respective packages on repo.radeon.com
and installed in third_party/rocm. Allowing for triton to run without installing
the full ROCM stack. setup_rocm_libs.sh takes a env_var ROCM_VERSION if a user
wishes to install a ROCM version other than the default (currently 5.4.2)
When triton whls are built to support Pytorch, method 3 will be used to stay in
sync with PyTorch's approach of bringing along any libraries needed and not
requiring ROCM to be installed.
(cherry picked from commit e6aea90fb3e8218cb562e5d990719112d8282702)
* Fix default rocm path
Running into `fatal error: hip/hip_runtime.h: No such file or directory` with latest wheel due to incorrect directory for ROCm libs
(cherry picked from commit 292bae625b113eb65c66cfe4442da7a6456c988a)
* setup_rocm_libs.sh manylinux refactor
(cherry picked from commit f995f314ada4606cb78dc6233cd9c8effc356191)
* Set setup_rocm_libs.sh to be executable
(cherry picked from commit 05d67b9418cacda0d356c2102d7c1a887948b013)
* Revert to using numbered so files to fix upstream
(cherry picked from commit 34f8189eae57a23cc15b4b4f032fe25757e0db8e)
* Remove drm script
---------
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
* [GEMM] Tuning script v2
* Extend tuning space to include BLOCK_SIZE = 256
Check LDS in a more smart way
* Added README
* Add git branch and commit to the default tuning result filename
* refine the gemm tuning scripts to reduce tuning space and better perf numbers
* added code to support tuning in full tuning space
* add a function to get best tuning config
* refine the matmul tutorial example to print out best tuning config for each input
* added even_k to gemm kernel heuristic for better performance
* address review comments
* Remove adding multiple architectures to isa head
* Add mask for gpu memory load in scripts for tuning gemm 'script/amd/gemm/matmul.py'
* Move the scripts to a better place 'scripts/amd/gemm/'
This is a combination of 7 commits.
use pyt nightly with root
repro with pytorch unit test
hardcode isROCM to true
set is_cuda to False
ignore cc arg
clean up
match triton-mlir branch
This is a combination of 6 commits.
use local bitcode
This is a combination of 3 commits.
add bit code to repo
update test
change bit code path
move bit code
update path
update scripts
update test
fix path issue