mirror of
https://github.com/ROCm/ROCm.git
synced 2026-01-09 22:58:17 -05:00
Compare commits
109 Commits
community_
...
rocm-6.0.2
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
43cd74913b | ||
|
|
83766203ff | ||
|
|
e467b13c68 | ||
|
|
336f88c7c2 | ||
|
|
b18eacbdac | ||
|
|
78bd182403 | ||
|
|
ba9cc4f185 | ||
|
|
df70d90d49 | ||
|
|
95fa47e31a | ||
|
|
5afa1539ed | ||
|
|
0b5cfca1e4 | ||
|
|
14979045a8 | ||
|
|
65b5a383ec | ||
|
|
c679235a90 | ||
|
|
4833ecfa6a | ||
|
|
c9425c6d19 | ||
|
|
c4383d217a | ||
|
|
0ef9f2d53c | ||
|
|
44b5d516e8 | ||
|
|
ad66256e52 | ||
|
|
d509656c6b | ||
|
|
c2a3626026 | ||
|
|
51d5bf015c | ||
|
|
c6facfb30f | ||
|
|
fce96340f4 | ||
|
|
8d44e04483 | ||
|
|
dcce85a84a | ||
|
|
d399b13c88 | ||
|
|
20005e0ef7 | ||
|
|
d05c1d529e | ||
|
|
163262643f | ||
|
|
318126b155 | ||
|
|
221aa04931 | ||
|
|
2be774fb19 | ||
|
|
3faa2600eb | ||
|
|
d531936276 | ||
|
|
753d2f9719 | ||
|
|
7ffc622039 | ||
|
|
054689be6a | ||
|
|
40b5f85af9 | ||
|
|
a1372d56f9 | ||
|
|
717b09f7eb | ||
|
|
1cd2b651c4 | ||
|
|
587f821194 | ||
|
|
147dce6f28 | ||
|
|
4808c615e6 | ||
|
|
f94a8620eb | ||
|
|
5f9842db8f | ||
|
|
6fae95aa02 | ||
|
|
b865ae7796 | ||
|
|
74a5c1b580 | ||
|
|
538a44f4d7 | ||
|
|
6c90336e67 | ||
|
|
859f3763c8 | ||
|
|
7f4922d2b2 | ||
|
|
c8c4b5a034 | ||
|
|
3e1a87a4f1 | ||
|
|
3522084990 | ||
|
|
eeb96ebb18 | ||
|
|
1c420b4b5c | ||
|
|
914befefcb | ||
|
|
6099778813 | ||
|
|
8a8504246a | ||
|
|
82d871c907 | ||
|
|
a9099dd36e | ||
|
|
6ba05d8ab0 | ||
|
|
ba69933774 | ||
|
|
5676b16fce | ||
|
|
1828271505 | ||
|
|
5b672af67d | ||
|
|
a121e35aa7 | ||
|
|
2a71de6c93 | ||
|
|
8588444a0d | ||
|
|
b8412e17f3 | ||
|
|
652f72dbdd | ||
|
|
13da03473f | ||
|
|
bcc8603454 | ||
|
|
5a53b95c7f | ||
|
|
7889220f04 | ||
|
|
19eae6a8eb | ||
|
|
131aa66591 | ||
|
|
c648ca767b | ||
|
|
4922020441 | ||
|
|
07a778498c | ||
|
|
d75a05645f | ||
|
|
00f7899b03 | ||
|
|
412366ff61 | ||
|
|
be1fed8ca4 | ||
|
|
16a1d355c1 | ||
|
|
3aa7072fc2 | ||
|
|
7179884433 | ||
|
|
3523e9e822 | ||
|
|
3b9cd77b93 | ||
|
|
ef1c21ccf7 | ||
|
|
35893c4df6 | ||
|
|
c1ee7d32e0 | ||
|
|
f8446befd2 | ||
|
|
f51e1144df | ||
|
|
4adaff02a6 | ||
|
|
0d6fc80070 | ||
|
|
33f110e354 | ||
|
|
9a9cf073b4 | ||
|
|
1e6951dc55 | ||
|
|
135e489e7a | ||
|
|
c326a64381 | ||
|
|
37c48060f7 | ||
|
|
3f855e386c | ||
|
|
aa5eff25fb | ||
|
|
ccdcfbd7e3 |
4
.github/CODEOWNERS
vendored
Normal file → Executable file
4
.github/CODEOWNERS
vendored
Normal file → Executable file
@@ -1 +1,5 @@
|
||||
* @saadrahim @Rmalavally @amd-aakash @zhang2amd @jlgreathouse @samjwu @MathiasMagnus @LisaDelaney
|
||||
# Documentation files
|
||||
docs/* @ROCm/rocm-documentation
|
||||
*.md @ROCm/rocm-documentation
|
||||
*.rst @ROCm/rocm-documentation
|
||||
|
||||
76
.github/ISSUE_TEMPLATE/0_issue_report.yml
vendored
76
.github/ISSUE_TEMPLATE/0_issue_report.yml
vendored
@@ -1,76 +0,0 @@
|
||||
name: Issue Report
|
||||
description: File a report for something not working correctly.
|
||||
title: "[Issue]: "
|
||||
|
||||
body:
|
||||
- type: markdown
|
||||
attributes:
|
||||
value: |
|
||||
Thank you for taking the time to fill out this report!
|
||||
|
||||
On a Linux system, you can acquire your OS, CPU, GPU, and ROCm version (for filling out this report) with the following commands:
|
||||
echo "OS:" && cat /etc/os-release | grep -E "^(NAME=|VERSION=)";
|
||||
echo "CPU: " && cat /proc/cpuinfo | grep "model name" | sort --unique;
|
||||
echo "GPU:" && /opt/rocm/bin/rocminfo | grep -E "^\s*(Name|Marketing Name)";
|
||||
echo "ROCm in /opt:" && ls -1 /opt | grep -E "rocm-";
|
||||
- type: textarea
|
||||
attributes:
|
||||
label: Problem Description
|
||||
description: Describe the issue you encountered.
|
||||
placeholder: "The steps to reproduce can be included here, or in the dedicated section further below."
|
||||
validations:
|
||||
required: true
|
||||
- type: input
|
||||
attributes:
|
||||
label: Operating System
|
||||
description: What is the name and version number of the OS?
|
||||
placeholder: "e.g. Ubuntu 22.04.3 LTS (Jammy Jellyfish)"
|
||||
validations:
|
||||
required: true
|
||||
- type: input
|
||||
attributes:
|
||||
label: CPU
|
||||
description: What CPU did you encounter the issue on?
|
||||
placeholder: "e.g. AMD Ryzen 9 5900HX with Radeon Graphics"
|
||||
validations:
|
||||
required: true
|
||||
- type: input
|
||||
attributes:
|
||||
label: GPU
|
||||
description: What GPU(s) did you encounter the issue on?
|
||||
placeholder: "e.g. MI200"
|
||||
validations:
|
||||
required: true
|
||||
- type: input
|
||||
attributes:
|
||||
label: ROCm Version
|
||||
description: What version(s) of ROCm did you encounter the issue on?
|
||||
placeholder: "e.g. 5.7.0"
|
||||
validations:
|
||||
required: true
|
||||
- type: input
|
||||
attributes:
|
||||
label: ROCm Component
|
||||
description: (Optional) If this issue relates to a specific ROCm component, it can be mentioned here.
|
||||
placeholder: "e.g. rocBLAS"
|
||||
|
||||
- type: textarea
|
||||
attributes:
|
||||
label: Steps to Reproduce
|
||||
description: (Optional) Detailed steps to reproduce the issue.
|
||||
placeholder: Please also include what you expected to happen, and what actually did, at the failing step(s).
|
||||
validations:
|
||||
required: false
|
||||
|
||||
- type: textarea
|
||||
attributes:
|
||||
label: Output of /opt/rocm/bin/rocminfo --support
|
||||
description: The output of rocminfo --support will help to better address the problem.
|
||||
placeholder: |
|
||||
ROCk module is loaded
|
||||
=====================
|
||||
HSA System Attributes
|
||||
=====================
|
||||
[...]
|
||||
validations:
|
||||
required: true
|
||||
32
.github/ISSUE_TEMPLATE/1_feature_request.yml
vendored
32
.github/ISSUE_TEMPLATE/1_feature_request.yml
vendored
@@ -1,32 +0,0 @@
|
||||
name: Feature Suggestion
|
||||
description: Suggest an additional functionality, or new way of handling an existing functionality.
|
||||
title: "[Feature]: "
|
||||
|
||||
body:
|
||||
- type: markdown
|
||||
attributes:
|
||||
value: |
|
||||
Thank you for taking the time to make a suggestion!
|
||||
|
||||
- type: textarea
|
||||
attributes:
|
||||
label: Suggestion Description
|
||||
description: Describe your suggestion.
|
||||
validations:
|
||||
required: true
|
||||
- type: input
|
||||
attributes:
|
||||
label: Operating System
|
||||
description: (Optional) If this is for a specific OS, you can mention it here.
|
||||
placeholder: "e.g. Ubuntu"
|
||||
- type: input
|
||||
attributes:
|
||||
label: GPU
|
||||
description: (Optional) If this is for a specific GPU or GPU family, you can mention it here.
|
||||
placeholder: "e.g. MI200"
|
||||
- type: input
|
||||
attributes:
|
||||
label: ROCm Component
|
||||
description: (Optional) If this issue relates to a specific ROCm component, it can be mentioned here.
|
||||
placeholder: "e.g. rocBLAS"
|
||||
|
||||
5
.github/ISSUE_TEMPLATE/config.yml
vendored
5
.github/ISSUE_TEMPLATE/config.yml
vendored
@@ -1,5 +0,0 @@
|
||||
blank_issues_enabled: false
|
||||
contact_links:
|
||||
- name: ROCm Community Discussions
|
||||
url: https://github.com/RadeonOpenCompute/ROCm/discussions
|
||||
about: Please ask and answer questions here for anything ROCm.
|
||||
22
.github/workflows/issue_retrieval.yml
vendored
Normal file
22
.github/workflows/issue_retrieval.yml
vendored
Normal file
@@ -0,0 +1,22 @@
|
||||
name: Issue retrieval
|
||||
|
||||
on:
|
||||
issues:
|
||||
types: [opened]
|
||||
|
||||
jobs:
|
||||
auto-retrieve:
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- name: Generate a token
|
||||
id: generate_token
|
||||
uses: actions/create-github-app-token@v1
|
||||
with:
|
||||
app_id: ${{ secrets.ACTION_APP_ID }}
|
||||
private_key: ${{ secrets.ACTION_PEM }}
|
||||
- name: 'Retrieve Issue'
|
||||
uses: abhimeda/rocm_issue_management@main
|
||||
with:
|
||||
authentication-token: ${{ steps.generate_token.outputs.token }}
|
||||
github-organization: 'ROCm'
|
||||
project-num: '6'
|
||||
@@ -23,6 +23,7 @@ ASan
|
||||
ASIC
|
||||
ASICs
|
||||
ASm
|
||||
ATI
|
||||
atmi
|
||||
atomics
|
||||
autogenerated
|
||||
@@ -50,6 +51,7 @@ changelog
|
||||
chiplet
|
||||
CIFAR
|
||||
CLI
|
||||
CLion
|
||||
CMake
|
||||
cmake
|
||||
CMakeLists
|
||||
@@ -61,6 +63,7 @@ Codespaces
|
||||
comgr
|
||||
Commitizen
|
||||
CommonMark
|
||||
completers
|
||||
composable
|
||||
concretization
|
||||
Concretized
|
||||
@@ -81,6 +84,7 @@ CSE
|
||||
CSn
|
||||
csn
|
||||
CSV
|
||||
CTests
|
||||
CU
|
||||
cuBLAS
|
||||
CUDA
|
||||
@@ -90,6 +94,7 @@ cuRAND
|
||||
CUs
|
||||
cuSOLVER
|
||||
cuSPARSE
|
||||
CXX
|
||||
dataset
|
||||
datasets
|
||||
dataspace
|
||||
@@ -103,7 +108,9 @@ Dependabot
|
||||
deserializers
|
||||
detections
|
||||
dev
|
||||
DevCap
|
||||
devicelibs
|
||||
devsel
|
||||
DGEMM
|
||||
disambiguates
|
||||
distro
|
||||
@@ -112,7 +119,6 @@ DMA
|
||||
DNN
|
||||
DNNL
|
||||
Dockerfile
|
||||
DockerHub
|
||||
Doxygen
|
||||
DPM
|
||||
DRI
|
||||
@@ -151,6 +157,7 @@ GDR
|
||||
GDS
|
||||
GEMM
|
||||
GEMMs
|
||||
GenZ
|
||||
gfortran
|
||||
gfx
|
||||
GIM
|
||||
@@ -194,6 +201,7 @@ hipSPARSELt
|
||||
hipTensor
|
||||
HPC
|
||||
HPCG
|
||||
HPE
|
||||
HPL
|
||||
HSA
|
||||
hsa
|
||||
@@ -201,6 +209,8 @@ hsakmt
|
||||
HWE
|
||||
ib_core
|
||||
ICV
|
||||
IDE
|
||||
IDEs
|
||||
ImageNet
|
||||
IMDB
|
||||
inband
|
||||
@@ -224,6 +234,7 @@ IOP
|
||||
IOPM
|
||||
IOV
|
||||
ipo
|
||||
IRQ
|
||||
ISA
|
||||
ISV
|
||||
ISVs
|
||||
@@ -236,6 +247,7 @@ KVM
|
||||
LAPACK
|
||||
LCLK
|
||||
LDS
|
||||
libfabric
|
||||
libjpeg
|
||||
libs
|
||||
linearized
|
||||
@@ -268,6 +280,8 @@ mivisionx
|
||||
mkdir
|
||||
mlirmiopen
|
||||
MMA
|
||||
MMIO
|
||||
MMIOH
|
||||
MNIST
|
||||
MPI
|
||||
MSVC
|
||||
@@ -329,14 +343,17 @@ perl
|
||||
PIL
|
||||
PILImage
|
||||
PowerShell
|
||||
PnP
|
||||
pragma
|
||||
pre
|
||||
prebuilt
|
||||
precompiled
|
||||
prefetch
|
||||
prefetchable
|
||||
preprocess
|
||||
preprocessing
|
||||
preq
|
||||
prequantized
|
||||
prerequisites
|
||||
PRNG
|
||||
profiler
|
||||
@@ -348,6 +365,7 @@ PyPi
|
||||
PyTorch
|
||||
Qcycles
|
||||
quasirandom
|
||||
queueing
|
||||
Radeon
|
||||
RadeonOpenCompute
|
||||
RCCL
|
||||
@@ -369,6 +387,7 @@ Rickle
|
||||
roadmap
|
||||
roc
|
||||
ROC
|
||||
RoCE
|
||||
rocAL
|
||||
rocALUTION
|
||||
rocalution
|
||||
@@ -385,6 +404,7 @@ rocm
|
||||
ROCm
|
||||
ROCmCC
|
||||
rocminfo
|
||||
rocMLIR
|
||||
ROCmSoftwarePlatform
|
||||
ROCmValidationSuite
|
||||
rocPRIM
|
||||
@@ -410,6 +430,7 @@ RST
|
||||
runtime
|
||||
runtimes
|
||||
RW
|
||||
Ryzen
|
||||
SALU
|
||||
SBIOS
|
||||
SCA
|
||||
@@ -431,11 +452,13 @@ Shlens
|
||||
sigmoid
|
||||
SIGQUIT
|
||||
SIMD
|
||||
SIMDs
|
||||
SKU
|
||||
SKUs
|
||||
skylake
|
||||
sL
|
||||
SLES
|
||||
sm
|
||||
SMEM
|
||||
SMI
|
||||
smi
|
||||
@@ -455,6 +478,7 @@ subexpression
|
||||
subfolder
|
||||
subfolders
|
||||
supercomputing
|
||||
Supermicro
|
||||
SWE
|
||||
Szegedy
|
||||
tagram
|
||||
@@ -477,6 +501,7 @@ toolchains
|
||||
toolset
|
||||
toolsets
|
||||
TorchAudio
|
||||
TorchMIGraphX
|
||||
TorchScript
|
||||
TorchServe
|
||||
TorchVision
|
||||
@@ -494,6 +519,7 @@ UCX
|
||||
UIF
|
||||
Uncached
|
||||
uncached
|
||||
uncorrectable
|
||||
Unhandled
|
||||
uninstallation
|
||||
unsqueeze
|
||||
|
||||
3467
CHANGELOG.md
3467
CHANGELOG.md
File diff suppressed because it is too large
Load Diff
40
CMakeLists.txt
Normal file
40
CMakeLists.txt
Normal file
@@ -0,0 +1,40 @@
|
||||
# MIT License
|
||||
#
|
||||
# Copyright (c) 2023 Advanced Micro Devices, Inc. All rights reserved.
|
||||
#
|
||||
# Permission is hereby granted, free of charge, to any person obtaining a copy
|
||||
# of this software and associated documentation files (the "Software"), to deal
|
||||
# in the Software without restriction, including without limitation the rights
|
||||
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
||||
# copies of the Software, and to permit persons to whom the Software is
|
||||
# furnished to do so, subject to the following conditions:
|
||||
#
|
||||
# The above copyright notice and this permission notice shall be included in all
|
||||
# copies or substantial portions of the Software.
|
||||
#
|
||||
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
||||
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
||||
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
||||
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
||||
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
||||
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
||||
# SOFTWARE.
|
||||
|
||||
cmake_minimum_required(VERSION 3.18.0)
|
||||
|
||||
project(ROCm VERSION 5.7.1 LANGUAGES NONE)
|
||||
|
||||
option(BUILD_DOCS "Build ROCm documentation" ON)
|
||||
|
||||
include(GNUInstallDirs)
|
||||
|
||||
# Adding default path cmake modules
|
||||
list(APPEND CMAKE_MODULE_PATH "${CMAKE_CURRENT_SOURCE_DIR}/cmake/Modules")
|
||||
|
||||
# Handle dependencies
|
||||
include(Dependencies)
|
||||
|
||||
# Build docs
|
||||
if(BUILD_DOCS)
|
||||
add_subdirectory(docs)
|
||||
endif()
|
||||
323
CONTRIBUTING.md
323
CONTRIBUTING.md
@@ -1,229 +1,94 @@
|
||||
# Contributing to ROCm documentation
|
||||
|
||||
AMD values and encourages contributions to our code and documentation. If you choose to
|
||||
contribute, we encourage you to be polite and respectful. Improving documentation is a long-term
|
||||
process, to which we are dedicated.
|
||||
|
||||
If you have issues when trying to contribute, refer to the
|
||||
[discussions](https://github.com/RadeonOpenCompute/ROCm/discussions) page in our GitHub
|
||||
repository.
|
||||
|
||||
## Folder structure and naming convention
|
||||
|
||||
Our documentation follows the Pitchfork folder structure. Most documentation files are stored in the
|
||||
`/docs` folder. Some special files (such as release, contributing, and changelog) are stored in the root
|
||||
(`/`) folder.
|
||||
|
||||
All images are stored in the `/docs/data` folder. An image's file path mirrors that of the documentation
|
||||
file where it is used.
|
||||
|
||||
Our naming structure uses kebab case; for example, `my-file-name.rst`.
|
||||
|
||||
## Supported formats and syntax
|
||||
|
||||
Our documentation includes both Markdown and RST files. We are gradually transitioning existing
|
||||
Markdown to RST in order to more effectively meet our documentation needs. When contributing,
|
||||
RST is preferred; if you must use Markdown, use GitHub-flavored Markdown.
|
||||
|
||||
We use [Sphinx Design](https://sphinx-design.readthedocs.io/en/latest/index.html) syntax and compile
|
||||
our API references using [Doxygen](https://www.doxygen.nl/).
|
||||
|
||||
The following table shows some common documentation components and the syntax convention we
|
||||
use for each:
|
||||
|
||||
<table>
|
||||
<tr>
|
||||
<th>Component</th>
|
||||
<th>RST syntax</th>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>Code blocks</td>
|
||||
<td>
|
||||
|
||||
```rst
|
||||
|
||||
.. code-block:: language-name
|
||||
|
||||
My code block.
|
||||
|
||||
|
||||
```
|
||||
|
||||
</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>Cross-referencing internal files</td>
|
||||
<td>
|
||||
|
||||
```rst
|
||||
|
||||
:doc:`Title <../path/to/file/filename>`
|
||||
|
||||
```
|
||||
|
||||
</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>External links</td>
|
||||
<td>
|
||||
|
||||
```rst
|
||||
|
||||
`link name <URL>`_
|
||||
|
||||
```
|
||||
|
||||
</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<tr>
|
||||
<td>Headings</td>
|
||||
<td>
|
||||
|
||||
```rst
|
||||
|
||||
******************
|
||||
Chapter title (H1)
|
||||
******************
|
||||
|
||||
Section title (H2)
|
||||
===============
|
||||
|
||||
Subsection title (H3)
|
||||
---------------------
|
||||
|
||||
Sub-subsection title (H4)
|
||||
^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
|
||||
```
|
||||
|
||||
</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>Images</td>
|
||||
<td>
|
||||
|
||||
```rst
|
||||
|
||||
.. image:: image1.png
|
||||
|
||||
```
|
||||
|
||||
</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>Internal links</td>
|
||||
<td>
|
||||
|
||||
```rst
|
||||
|
||||
1. Add a tag to the section you want to reference:
|
||||
|
||||
.. _my-section-tag: section-1
|
||||
|
||||
Section 1
|
||||
==========
|
||||
|
||||
2. Link to your tag:
|
||||
|
||||
As shown in :ref:`section-1`.
|
||||
|
||||
```
|
||||
|
||||
</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<tr>
|
||||
<td>Lists</td>
|
||||
<td>
|
||||
|
||||
```rst
|
||||
|
||||
# Ordered (numbered) list item
|
||||
|
||||
* Unordered (bulleted) list item
|
||||
|
||||
```
|
||||
|
||||
</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<tr>
|
||||
<td>Math (block)</td>
|
||||
<td>
|
||||
|
||||
```rst
|
||||
|
||||
.. math::
|
||||
|
||||
A = \begin{pmatrix}
|
||||
0.0 & 1.0 & 1.0 & 3.0 \\
|
||||
4.0 & 5.0 & 6.0 & 7.0 \\
|
||||
\end{pmatrix}
|
||||
|
||||
```
|
||||
|
||||
</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>Math (inline)</td>
|
||||
<td>
|
||||
|
||||
```rst
|
||||
|
||||
:math:`2 \times 2 `
|
||||
|
||||
```
|
||||
|
||||
</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>Notes</td>
|
||||
<td>
|
||||
|
||||
```rst
|
||||
|
||||
.. note::
|
||||
|
||||
My note here.
|
||||
|
||||
```
|
||||
|
||||
</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>Tables</td>
|
||||
<td>
|
||||
|
||||
```rst
|
||||
|
||||
.. csv-table:: Optional title here
|
||||
:widths: 30, 70 #optional column widths
|
||||
:header: "entry1 header", "entry2 header"
|
||||
|
||||
"entry1", "entry2"
|
||||
|
||||
```
|
||||
|
||||
</td>
|
||||
</tr>
|
||||
</table>
|
||||
|
||||
## Language and style
|
||||
|
||||
We use the
|
||||
[Google developer documentation style guide](https://developers.google.com/style/highlights) to
|
||||
guide our content.
|
||||
|
||||
Font size and type, page layout, white space control, and other formatting
|
||||
details are controlled via
|
||||
[rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core). If you want to notify us
|
||||
of any formatting issues, create a pull request in our
|
||||
[rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) GitHub repository.
|
||||
|
||||
## Building our documentation
|
||||
|
||||
<!-- % TODO: Fix the link to be able to work at every files -->
|
||||
To learn how to build our documentation, refer to
|
||||
[Building documentation](./building.md).
|
||||
<head>
|
||||
<meta charset="UTF-8">
|
||||
<meta name="description" content="Contributing to ROCm">
|
||||
<meta name="keywords" content="ROCm, contributing, contribute, maintainer, contributor">
|
||||
</head>
|
||||
|
||||
# Contribute to ROCm
|
||||
|
||||
AMD values and encourages contributions to our code and documentation. If you want to contribute
|
||||
to our ROCm repositories, first review the following guidance. For documentation-specific information,
|
||||
see [Contributing to ROCm docs](https://rocm.docs.amd.com/en/latest/contribute/contribute-docs.html).
|
||||
|
||||
ROCm is a software stack made up of a collection of drivers, development tools, and APIs that enable
|
||||
GPU programming from low-level kernel to end-user applications. Because some of our components
|
||||
are inherited from external projects (such as
|
||||
[LLVM](https://github.com/ROCm/llvm-project) and
|
||||
[Kernel driver](https://github.com/ROCm/ROCK-Kernel-Driver)), these use
|
||||
project-specific contribution guidelines and workflow. Refer to their repositories for more information.
|
||||
All other ROCm components follow the workflow described in the following sections.
|
||||
|
||||
## Development workflow
|
||||
|
||||
ROCm uses GitHub to host code, collaborate, and manage version control. We use pull requests (PRs)
|
||||
for all changes within our repositories. We use
|
||||
[GitHub issues](https://github.com/ROCm/ROCm/issues) to track known issues, such as
|
||||
bugs.
|
||||
|
||||
### Issue tracking
|
||||
|
||||
Before filing a new issue, search the
|
||||
[existing issues](https://github.com/ROCm/ROCm/issues) to make sure your issue isn't
|
||||
already listed.
|
||||
|
||||
General issue guidelines:
|
||||
|
||||
* Use your best judgement for issue creation. If your issue is already listed, upvote the issue and
|
||||
comment or post to provide additional details, such as how you reproduced this issue.
|
||||
* If you're not sure if your issue is the same, err on the side of caution and file your issue.
|
||||
You can add a comment to include the issue number (and link) for the similar issue. If we evaluate
|
||||
your issue as being the same as the existing issue, we'll close the duplicate.
|
||||
* If your issue doesn't exist, use the issue template to file a new issue.
|
||||
* When filing an issue, be sure to provide as much information as possible, including script output so
|
||||
we can collect information about your configuration. This helps reduce the time required to
|
||||
reproduce your issue.
|
||||
* Check your issue regularly, as we may require additional information to successfully reproduce the
|
||||
issue.
|
||||
|
||||
### Pull requests
|
||||
|
||||
When you create a pull request, you should target the default branch. Our repositories typically use the **develop** branch as the default integration branch.
|
||||
|
||||
When creating a PR, use the following process. Note that each repository may include additional,
|
||||
project-specific steps. Refer to each repository's PR process for any additional steps.
|
||||
|
||||
* Identify the issue you want to fix
|
||||
* Target the default branch (usually the **develop** branch) for integration
|
||||
* Ensure your code builds successfully
|
||||
* Each component has a suite of test cases to run; include the log of the successful test run in your PR
|
||||
* Do not break existing test cases
|
||||
* New functionality is only merged with new unit tests
|
||||
* If your PR includes a new feature, you must provide an application or test so we can ensure that the
|
||||
feature works and continues to be valid in the future
|
||||
* Tests must have good code coverage
|
||||
* Submit your PR and work with the reviewer or maintainer to get your PR approved
|
||||
* Once approved, the PR is brought onto internal CI systems and may be merged into the component
|
||||
during our release cycle, as coordinated by the maintainer
|
||||
* We'll inform you once your change is committed
|
||||
|
||||
:::{important}
|
||||
By creating a PR, you agree to allow your contribution to be licensed under the
|
||||
terms of the LICENSE.txt file in the corresponding repository. Different repositories may use different
|
||||
licenses.
|
||||
:::
|
||||
|
||||
You can look up each license on the [ROCm licensing](https://rocm.docs.amd.com/en/latest/about/license.html) page.
|
||||
|
||||
### New feature development
|
||||
|
||||
Use the [GitHub Discussion forum](https://github.com/ROCm/ROCm/discussions)
|
||||
(Ideas category) to propose new features. Our maintainers are happy to provide direction and
|
||||
feedback on feature development.
|
||||
|
||||
### Documentation
|
||||
|
||||
Submit ROCm documentation changes to our
|
||||
[documentation repository](https://github.com/ROCm/ROCm). You must update
|
||||
documentation related to any new feature or API contribution.
|
||||
|
||||
Note that each ROCm project uses its own repository for documentation.
|
||||
|
||||
## Future development workflow
|
||||
|
||||
The current ROCm development workflow is GitHub-based. If, in the future, we change this platform,
|
||||
the tools and links may change. In this instance, we will update contribution guidelines accordingly.
|
||||
|
||||
60
GOVERNANCE.md
Normal file
60
GOVERNANCE.md
Normal file
@@ -0,0 +1,60 @@
|
||||
<head>
|
||||
<meta charset="UTF-8">
|
||||
<meta name="description" content="ROCm governance model">
|
||||
<meta name="keywords" content="ROCm, governance">
|
||||
</head>
|
||||
|
||||
# Governance model
|
||||
|
||||
ROCm is a software stack made up of a collection of drivers, development tools, and APIs that enable
|
||||
GPU programming from the low-level kernel to end-user applications.
|
||||
|
||||
Components of ROCm that are inherited from external projects (such as
|
||||
[LLVM](https://github.com/ROCm/llvm-project) and
|
||||
[Kernel driver](https://github.com/ROCm/ROCK-Kernel-Driver)) follow their own
|
||||
governance model and code of conduct. All other components of ROCm are governed by this
|
||||
document.
|
||||
|
||||
## Governance
|
||||
|
||||
ROCm is led and managed by AMD.
|
||||
|
||||
We welcome contributions from the community. Our maintainers review all proposed changes to
|
||||
ROCm.
|
||||
|
||||
## Roles
|
||||
|
||||
* **Maintainers** are responsible for their designated component and repositories.
|
||||
* **Contributors** provide input and suggest changes to existing components.
|
||||
|
||||
### Maintainers
|
||||
|
||||
Maintainers are appointed by AMD. They are able to approve changes and can commit to our
|
||||
repositories. They must use pull requests (PRs) for all changes.
|
||||
|
||||
You can find the list of maintainers in the CODEOWNERS file of each repository. Code owners differ
|
||||
between repositories.
|
||||
|
||||
### Contributors
|
||||
|
||||
If you're not a maintainer, you're a contributor. We encourage the ROCm community to contribute in
|
||||
several ways:
|
||||
|
||||
* Help other community members by posting questions or solutions on our
|
||||
[GitHub discussion forums](https://github.com/ROCm/ROCm/discussions)
|
||||
* Notify us of a bugs by filing an issue report on
|
||||
[GitHub Issues](https://github.com/ROCm/ROCm/issues)
|
||||
* Improve our documentation by submitting a PR to our
|
||||
[repository](https://github.com/ROCm/ROCm/)
|
||||
* Improve the code base (for smaller or contained changes) by submitting a PR to the component
|
||||
* Suggest larger features by adding to the *Ideas* category in the
|
||||
[GitHub discussion forum](https://github.com/ROCm/ROCm/discussions)
|
||||
|
||||
For more information, refer to our [contribution guidelines](CONTRIBUTING.md).
|
||||
|
||||
## Code of conduct
|
||||
|
||||
To engage with any AMD ROCm component that is hosted on GitHub, you must abide by the
|
||||
[GitHub community guidelines](https://docs.github.com/en/site-policy/github-terms/github-community-guidelines)
|
||||
and the
|
||||
[GitHub community code of conduct](https://docs.github.com/en/site-policy/github-terms/github-community-code-of-conduct).
|
||||
16
README.md
16
README.md
@@ -1,4 +1,4 @@
|
||||
# AMD ROCm™ platform
|
||||
# AMD ROCm Software
|
||||
|
||||
ROCm is an open-source stack, composed primarily of open-source software, designed for graphics
|
||||
processing unit (GPU) computation. ROCm consists of a collection of drivers, development tools, and
|
||||
@@ -34,7 +34,7 @@ The ROCm documentation homepage is [rocm.docs.amd.com](https://rocm.docs.amd.com
|
||||
### Building our documentation
|
||||
|
||||
For a quick-start build, use the following code. For more options and detail, refer to
|
||||
[Building documentation](./contribute/building.md).
|
||||
[Building documentation](./docs/contribute/building.md).
|
||||
|
||||
```bash
|
||||
cd docs
|
||||
@@ -44,7 +44,15 @@ pip3 install -r sphinx/requirements.txt
|
||||
python3 -m sphinx -T -E -b html -d _build/doctrees -D language=en . _build/html
|
||||
```
|
||||
|
||||
Alternatively, CMake build is supported.
|
||||
|
||||
```bash
|
||||
cmake -B build
|
||||
|
||||
cmake --build build --target=doc
|
||||
```
|
||||
|
||||
## Older ROCm releases
|
||||
|
||||
For release information for older ROCm releases, refer to
|
||||
[`CHANGELOG`](./CHANGELOG.md).
|
||||
For release information for older ROCm releases, refer to the
|
||||
[CHANGELOG](./CHANGELOG.md).
|
||||
|
||||
87
RELEASE.md
87
RELEASE.md
@@ -1,7 +1,4 @@
|
||||
# Release Notes
|
||||
<!-- Do not edit this file! This file is autogenerated with -->
|
||||
<!-- tools/autotag/tag_script.py -->
|
||||
|
||||
# Release notes
|
||||
<!-- Disable lints since this is an auto-generated file. -->
|
||||
<!-- markdownlint-disable blanks-around-headers -->
|
||||
<!-- markdownlint-disable no-duplicate-header -->
|
||||
@@ -11,65 +8,47 @@
|
||||
|
||||
<!-- spellcheck-disable -->
|
||||
|
||||
Welcome to the release notes for the ROCm platform.
|
||||
This page contains the release notes for AMD ROCm Software.
|
||||
|
||||
-------------------
|
||||
|
||||
## ROCm 5.7.1
|
||||
<!-- markdownlint-disable first-line-h1 -->
|
||||
<!-- markdownlint-disable no-duplicate-header -->
|
||||
## ROCm 6.0.2
|
||||
|
||||
### What's New in This Release
|
||||
The ROCm 6.0.2 point release consists of minor bug fixes to improve the stability of MI300 GPU applications. This release introduces several new driver features for system qualification on our partner server offerings.
|
||||
|
||||
### ROCm Libraries
|
||||
|
||||
#### rocBLAS
|
||||
A new functionality rocblas-gemm-tune and an environment variable ROCBLAS_TENSILE_GEMM_OVERRIDE_PATH are added to rocBLAS in the ROCm 5.7.1 release.
|
||||
|
||||
*rocblas-gemm-tune* is used to find the best-performing GEMM kernel for each GEMM problem set. It has a command line interface, which mimics the --yaml input used by rocblas-bench. To generate the expected --yaml input, profile logging can be used, by setting the environment variable ROCBLAS_LAYER4.
|
||||
|
||||
For more information on rocBLAS logging, see Logging in rocBLAS, in the [API Reference Guide](https://rocm.docs.amd.com/projects/rocBLAS/en/docs-5.7.1/API_Reference_Guide.html#logging-in-rocblas).
|
||||
|
||||
An example input file: Expected output (note selected GEMM idx may differ): Where the far right values (solution_index) are the indices of the best-performing kernels for those GEMMs in the rocBLAS kernel library. These indices can be directly used in future GEMM calls. See rocBLAS/samples/example_user_driven_tuning.cpp for sample code of directly using kernels via their indices.
|
||||
|
||||
If the output is stored in a file, the results can be used to override default kernel selection with the kernels found, by setting the environment variable ROCBLAS_TENSILE_GEMM_OVERRIDE_PATH, where points to the stored file.
|
||||
|
||||
For more details, refer to the [rocBLAS Programmer's Guide.](https://rocm.docs.amd.com/projects/rocBLAS/en/latest/Programmers_Guide.html#rocblas-gemm-tune)
|
||||
|
||||
#### HIP 5.7.1 (for ROCm 5.7.1)
|
||||
|
||||
ROCm 5.7.1 is a point release with several bug fixes in the HIP runtime.
|
||||
|
||||
### Fixed defects
|
||||
The *hipPointerGetAttributes* API returns the correct HIP memory type as *hipMemoryTypeManaged* for managed memory.
|
||||
|
||||
### Library Changes in ROCM 5.7.1
|
||||
### Library changes in ROCm 6.0.2
|
||||
|
||||
| Library | Version |
|
||||
|---------|---------|
|
||||
| hipBLAS | [1.1.0](https://github.com/ROCmSoftwarePlatform/hipBLAS/releases/tag/rocm-5.7.1) |
|
||||
| hipCUB | [2.13.1](https://github.com/ROCmSoftwarePlatform/hipCUB/releases/tag/rocm-5.7.1) |
|
||||
| hipFFT | [1.0.12](https://github.com/ROCmSoftwarePlatform/hipFFT/releases/tag/rocm-5.7.1) |
|
||||
| hipSOLVER | 1.8.1 ⇒ [1.8.2](https://github.com/ROCmSoftwarePlatform/hipSOLVER/releases/tag/rocm-5.7.1) |
|
||||
| hipSPARSE | [2.3.8](https://github.com/ROCmSoftwarePlatform/hipSPARSE/releases/tag/rocm-5.7.1) |
|
||||
| MIOpen | [2.19.0](https://github.com/ROCmSoftwarePlatform/MIOpen/releases/tag/rocm-5.7.1) |
|
||||
| rocALUTION | [2.1.11](https://github.com/ROCmSoftwarePlatform/rocALUTION/releases/tag/rocm-5.7.1) |
|
||||
| rocBLAS | [3.1.0](https://github.com/ROCmSoftwarePlatform/rocBLAS/releases/tag/rocm-5.7.1) |
|
||||
| rocFFT | [1.0.24](https://github.com/ROCmSoftwarePlatform/rocFFT/releases/tag/rocm-5.7.1) |
|
||||
| rocm-cmake | [0.10.0](https://github.com/RadeonOpenCompute/rocm-cmake/releases/tag/rocm-5.7.1) |
|
||||
| rocPRIM | [2.13.1](https://github.com/ROCmSoftwarePlatform/rocPRIM/releases/tag/rocm-5.7.1) |
|
||||
| rocRAND | [2.10.17](https://github.com/ROCmSoftwarePlatform/rocRAND/releases/tag/rocm-5.7.1) |
|
||||
| rocSOLVER | [3.23.0](https://github.com/ROCmSoftwarePlatform/rocSOLVER/releases/tag/rocm-5.7.1) |
|
||||
| rocSPARSE | [2.5.4](https://github.com/ROCmSoftwarePlatform/rocSPARSE/releases/tag/rocm-5.7.1) |
|
||||
| rocThrust | [2.18.0](https://github.com/ROCmSoftwarePlatform/rocThrust/releases/tag/rocm-5.7.1) |
|
||||
| rocWMMA | [1.2.0](https://github.com/ROCmSoftwarePlatform/rocWMMA/releases/tag/rocm-5.7.1) |
|
||||
| Tensile | [4.38.0](https://github.com/ROCmSoftwarePlatform/Tensile/releases/tag/rocm-5.7.1) |
|
||||
| AMDMIGraphX | ⇒ [2.8](https://github.com/ROCm/AMDMIGraphX/releases/tag/rocm-6.0.2) |
|
||||
| hipBLAS | ⇒ [2.0.0](https://github.com/ROCm/hipBLAS/releases/tag/rocm-6.0.2) |
|
||||
| hipBLASLt | ⇒ [0.6.0](https://github.com/ROCm/hipBLASLt/releases/tag/rocm-6.0.2) |
|
||||
| hipCUB | ⇒ [3.0.0](https://github.com/ROCm/hipCUB/releases/tag/rocm-6.0.2) |
|
||||
| hipFFT | ⇒ [1.0.13](https://github.com/ROCm/hipFFT/releases/tag/rocm-6.0.2) |
|
||||
| hipRAND | ⇒ [2.10.17](https://github.com/ROCm/hipRAND/releases/tag/rocm-6.0.2) |
|
||||
| hipSOLVER | ⇒ [2.0.0](https://github.com/ROCm/hipSOLVER/releases/tag/rocm-6.0.2) |
|
||||
| hipSPARSE | ⇒ [3.0.0](https://github.com/ROCm/hipSPARSE/releases/tag/rocm-6.0.2) |
|
||||
| hipSPARSELt | ⇒ [0.1.0](https://github.com/ROCm/hipSPARSELt/releases/tag/rocm-6.0.2) |
|
||||
| hipTensor | ⇒ [1.1.0](https://github.com/ROCm/hipTensor/releases/tag/rocm-6.0.2) |
|
||||
| MIOpen | ⇒ [2.19.0](https://github.com/ROCm/MIOpen/releases/tag/rocm-6.0.2) |
|
||||
| rccl | ⇒ [2.15.5](https://github.com/ROCm/rccl/releases/tag/rocm-6.0.2) |
|
||||
| rocALUTION | ⇒ [3.0.3](https://github.com/ROCm/rocALUTION/releases/tag/rocm-6.0.2) |
|
||||
| rocBLAS | ⇒ [4.0.0](https://github.com/ROCm/rocBLAS/releases/tag/rocm-6.0.2) |
|
||||
| rocFFT | ⇒ [1.0.25](https://github.com/ROCm/rocFFT/releases/tag/rocm-6.0.2) |
|
||||
| rocm-cmake | ⇒ [0.11.0](https://github.com/ROCm/rocm-cmake/releases/tag/rocm-6.0.2) |
|
||||
| rocPRIM | ⇒ [3.0.0](https://github.com/ROCm/rocPRIM/releases/tag/rocm-6.0.2) |
|
||||
| rocRAND | ⇒ [3.0.0](https://github.com/ROCm/rocRAND/releases/tag/rocm-6.0.2) |
|
||||
| rocSOLVER | ⇒ [3.24.0](https://github.com/ROCm/rocSOLVER/releases/tag/rocm-6.0.2) |
|
||||
| rocSPARSE | ⇒ [3.0.2](https://github.com/ROCm/rocSPARSE/releases/tag/rocm-6.0.2) |
|
||||
| rocThrust | ⇒ [3.0.0](https://github.com/ROCm/rocThrust/releases/tag/rocm-6.0.2) |
|
||||
| rocWMMA | ⇒ [1.3.0](https://github.com/ROCm/rocWMMA/releases/tag/rocm-6.0.2) |
|
||||
| Tensile | ⇒ [4.39.0](https://github.com/ROCm/Tensile/releases/tag/rocm-6.0.2) |
|
||||
|
||||
#### hipSOLVER 1.8.2
|
||||
#### hipFFT 1.0.13
|
||||
|
||||
hipSOLVER 1.8.2 for ROCm 5.7.1
|
||||
hipFFT 1.0.13 for ROCm 6.0.2
|
||||
|
||||
##### Fixed
|
||||
##### Changes
|
||||
|
||||
- Fixed conflicts between the hipsolver-dev and -asan packages by excluding
|
||||
hipsolver_module.f90 from the latter
|
||||
* Removed the Git submodule for shared files between rocFFT and hipFFT; instead, just copy the files
|
||||
over (this should help simplify downstream builds and packaging)
|
||||
|
||||
47
cmake/Modules/Dependencies.cmake
Normal file
47
cmake/Modules/Dependencies.cmake
Normal file
@@ -0,0 +1,47 @@
|
||||
# MIT License
|
||||
#
|
||||
# Copyright (c) 2023 Advanced Micro Devices, Inc. All rights reserved.
|
||||
#
|
||||
# Permission is hereby granted, free of charge, to any person obtaining a copy
|
||||
# of this software and associated documentation files (the "Software"), to deal
|
||||
# in the Software without restriction, including without limitation the rights
|
||||
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
||||
# copies of the Software, and to permit persons to whom the Software is
|
||||
# furnished to do so, subject to the following conditions:
|
||||
#
|
||||
# The above copyright notice and this permission notice shall be included in all
|
||||
# copies or substantial portions of the Software.
|
||||
#
|
||||
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
||||
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
||||
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
||||
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
||||
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
||||
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
||||
# SOFTWARE.
|
||||
|
||||
# ###########################
|
||||
# ROCm dependencies
|
||||
# ###########################
|
||||
|
||||
include(FetchContent)
|
||||
|
||||
if(BUILD_DOCS)
|
||||
find_package(ROCM 0.11.0 CONFIG QUIET PATHS "${ROCM_PATH}") # First version with Sphinx doc gen improvement
|
||||
if(NOT ROCM_FOUND)
|
||||
message(STATUS "ROCm CMake not found. Fetching...")
|
||||
set(rocm_cmake_tag
|
||||
"c044bb52ba85058d28afe2313be98d9fed02e293" # develop@2023.09.12. (move to 6.0 tag when released)
|
||||
CACHE STRING "rocm-cmake tag to download")
|
||||
FetchContent_Declare(
|
||||
rocm-cmake
|
||||
GIT_REPOSITORY https://github.com/RadeonOpenCompute/rocm-cmake.git
|
||||
GIT_TAG ${rocm_cmake_tag}
|
||||
SOURCE_SUBDIR "DISABLE ADDING TO BUILD" # We don't really want to consume the build and test targets of ROCm CMake.
|
||||
)
|
||||
FetchContent_MakeAvailable(rocm-cmake)
|
||||
find_package(ROCM CONFIG REQUIRED NO_DEFAULT_PATH PATHS "${rocm-cmake_SOURCE_DIR}")
|
||||
else()
|
||||
find_package(ROCM 0.11.0 CONFIG REQUIRED PATHS "${ROCM_PATH}")
|
||||
endif()
|
||||
endif()
|
||||
106
default.xml
106
default.xml
@@ -1,22 +1,17 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<manifest>
|
||||
<remote name="roc-github"
|
||||
fetch="https://github.com/RadeonOpenCompute/" />
|
||||
<remote name="rocm-devtools"
|
||||
fetch="https://github.com/ROCm-Developer-Tools/" />
|
||||
<remote name="rocm-swplat"
|
||||
fetch="https://github.com/ROCmSoftwarePlatform/" />
|
||||
<remote name="gpuopen-libs"
|
||||
fetch="https://github.com/GPUOpen-ProfessionalCompute-Libraries/" />
|
||||
<remote name="gpuopen-tools"
|
||||
fetch="https://github.com/GPUOpen-Tools/" />
|
||||
<remote name="KhronosGroup"
|
||||
fetch="https://github.com/KhronosGroup/" />
|
||||
<default revision="refs/tags/rocm-5.7.1"
|
||||
remote="roc-github"
|
||||
<remote name="rocm-org" fetch="https://github.com/ROCm/" />
|
||||
<remote name="roc-github" fetch="https://github.com/RadeonOpenCompute/" />
|
||||
<remote name="rocm-devtools" fetch="https://github.com/ROCm-Developer-Tools/" />
|
||||
<remote name="rocm-swplat" fetch="https://github.com/ROCmSoftwarePlatform/" />
|
||||
<remote name="gpuopen-libs" fetch="https://github.com/GPUOpen-ProfessionalCompute-Libraries/" />
|
||||
<remote name="gpuopen-tools" fetch="https://github.com/GPUOpen-Tools/" />
|
||||
<remote name="KhronosGroup" fetch="https://github.com/KhronosGroup/" />
|
||||
<default revision="refs/tags/rocm-6.0.2"
|
||||
remote="rocm-org"
|
||||
sync-c="true"
|
||||
sync-j="4" />
|
||||
<!--list of projects for ROCM-->
|
||||
<!--list of projects for ROCm-->
|
||||
<project name="ROCK-Kernel-Driver" />
|
||||
<project name="ROCT-Thunk-Interface" />
|
||||
<project name="ROCR-Runtime" />
|
||||
@@ -26,54 +21,57 @@ fetch="https://github.com/KhronosGroup/" />
|
||||
<project name="rocm-cmake" />
|
||||
<project name="rocminfo" />
|
||||
<project name="rocm_bandwidth_test" />
|
||||
<project name="rocprofiler" remote="rocm-devtools" />
|
||||
<project name="roctracer" remote="rocm-devtools" />
|
||||
<project name="rocprofiler" />
|
||||
<project name="roctracer" />
|
||||
<project path="ROCm-OpenCL-Runtime/api/opencl/khronos/icd" name="OpenCL-ICD-Loader" remote="KhronosGroup" revision="6c03f8b58fafd9dd693eaac826749a5cfad515f8" />
|
||||
<project name="clang-ocl" />
|
||||
<project name="rdc" />
|
||||
<!--HIP Projects-->
|
||||
<project name="HIP" remote="rocm-devtools" />
|
||||
<project name="HIP-Examples" remote="rocm-devtools" />
|
||||
<project name="clr" remote="rocm-devtools" />
|
||||
<project name="HIPIFY" remote="rocm-devtools" />
|
||||
<project name="HIPCC" remote="rocm-devtools" />
|
||||
<project name="HIP" />
|
||||
<project name="HIP-Examples" />
|
||||
<project name="clr" />
|
||||
<project name="hipother" />
|
||||
<project name="HIPIFY" />
|
||||
<project name="HIPCC" />
|
||||
<!-- The following projects are all associated with the AMDGPU LLVM compiler -->
|
||||
<project name="llvm-project" />
|
||||
<project name="ROCm-Device-Libs" />
|
||||
<project name="ROCm-CompilerSupport" />
|
||||
<project name="half" remote="rocm-swplat" revision="37742ce15b76b44e4b271c1e66d13d2fa7bd003e" />
|
||||
<project name="half" revision="37742ce15b76b44e4b271c1e66d13d2fa7bd003e" />
|
||||
<!-- gdb projects -->
|
||||
<project name="ROCgdb" remote="rocm-devtools" />
|
||||
<project name="ROCdbgapi" remote="rocm-devtools" />
|
||||
<project name="rocr_debug_agent" remote="rocm-devtools" />
|
||||
<project name="ROCgdb" />
|
||||
<project name="ROCdbgapi" />
|
||||
<project name="rocr_debug_agent" />
|
||||
<!-- ROCm Libraries -->
|
||||
<project groups="mathlibs" name="rocBLAS" remote="rocm-swplat" />
|
||||
<project groups="mathlibs" name="Tensile" remote="rocm-swplat" />
|
||||
<project groups="mathlibs" name="hipTensor" remote="rocm-swplat" />
|
||||
<project groups="mathlibs" name="hipBLAS" remote="rocm-swplat" />
|
||||
<project groups="mathlibs" name="rocFFT" remote="rocm-swplat" />
|
||||
<project groups="mathlibs" name="hipFFT" remote="rocm-swplat" />
|
||||
<project groups="mathlibs" name="rocRAND" remote="rocm-swplat" />
|
||||
<project groups="mathlibs" name="rocSPARSE" remote="rocm-swplat" />
|
||||
<project groups="mathlibs" name="rocSOLVER" remote="rocm-swplat" />
|
||||
<project groups="mathlibs" name="hipSOLVER" remote="rocm-swplat" />
|
||||
<project groups="mathlibs" name="hipSPARSE" remote="rocm-swplat" />
|
||||
<project groups="mathlibs" name="rocALUTION" remote="rocm-swplat" />
|
||||
<project groups="mathlibs" name="rocThrust" remote="rocm-swplat" />
|
||||
<project groups="mathlibs" name="hipCUB" remote="rocm-swplat" />
|
||||
<project groups="mathlibs" name="rocPRIM" remote="rocm-swplat" />
|
||||
<project groups="mathlibs" name="rocWMMA" remote="rocm-swplat" />
|
||||
<project groups="mathlibs" name="rccl" remote="rocm-swplat" />
|
||||
<project name="rocMLIR" remote="rocm-swplat" />
|
||||
<project name="MIOpen" remote="rocm-swplat" />
|
||||
<project name="composable_kernel" remote="rocm-swplat" />
|
||||
<project name="MIVisionX" remote="gpuopen-libs" />
|
||||
<project name="rpp" remote="gpuopen-libs" />
|
||||
<project name="hipfort" remote="rocm-swplat" />
|
||||
<project name="AMDMIGraphX" remote="rocm-swplat" />
|
||||
<project name="ROCmValidationSuite" remote="rocm-devtools" />
|
||||
<project groups="mathlibs" name="rocBLAS" />
|
||||
<project groups="mathlibs" name="Tensile" />
|
||||
<project groups="mathlibs" name="hipTensor" />
|
||||
<project groups="mathlibs" name="hipBLAS" />
|
||||
<project groups="mathlibs" name="hipBLASLt" />
|
||||
<project groups="mathlibs" name="rocFFT" />
|
||||
<project groups="mathlibs" name="hipFFT" />
|
||||
<project groups="mathlibs" name="rocRAND" />
|
||||
<project groups="mathlibs" name="hipRAND" />
|
||||
<project groups="mathlibs" name="rocSPARSE" />
|
||||
<project groups="mathlibs" name="hipSPARSELt" />
|
||||
<project groups="mathlibs" name="rocSOLVER" />
|
||||
<project groups="mathlibs" name="hipSOLVER" />
|
||||
<project groups="mathlibs" name="hipSPARSE" />
|
||||
<project groups="mathlibs" name="rocALUTION" />
|
||||
<project groups="mathlibs" name="rocThrust" />
|
||||
<project groups="mathlibs" name="hipCUB" />
|
||||
<project groups="mathlibs" name="rocPRIM" />
|
||||
<project groups="mathlibs" name="rocWMMA" />
|
||||
<project groups="mathlibs" name="rccl" />
|
||||
<project name="MIOpen" />
|
||||
<project name="composable_kernel" />
|
||||
<project name="MIVisionX" />
|
||||
<project name="rpp" />
|
||||
<project name="hipfort" />
|
||||
<project name="AMDMIGraphX" />
|
||||
<project name="ROCmValidationSuite" />
|
||||
<!-- Projects for OpenMP-Extras -->
|
||||
<project name="aomp" path="openmp-extras/aomp" remote="rocm-devtools" />
|
||||
<project name="aomp-extras" path="openmp-extras/aomp-extras" remote="rocm-devtools" />
|
||||
<project name="flang" path="openmp-extras/flang" remote="rocm-devtools" />
|
||||
<project name="aomp" path="openmp-extras/aomp" />
|
||||
<project name="aomp-extras" path="openmp-extras/aomp-extras" />
|
||||
<project name="flang" path="openmp-extras/flang" />
|
||||
</manifest>
|
||||
|
||||
33
docs/CMakeLists.txt
Normal file
33
docs/CMakeLists.txt
Normal file
@@ -0,0 +1,33 @@
|
||||
# MIT License
|
||||
#
|
||||
# Copyright (c) 2023 Advanced Micro Devices, Inc. All rights reserved.
|
||||
#
|
||||
# Permission is hereby granted, free of charge, to any person obtaining a copy
|
||||
# of this software and associated documentation files (the "Software"), to deal
|
||||
# in the Software without restriction, including without limitation the rights
|
||||
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
||||
# copies of the Software, and to permit persons to whom the Software is
|
||||
# furnished to do so, subject to the following conditions:
|
||||
#
|
||||
# The above copyright notice and this permission notice shall be included in all
|
||||
# copies or substantial portions of the Software.
|
||||
#
|
||||
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
||||
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
||||
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
||||
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
||||
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
||||
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
||||
# SOFTWARE.
|
||||
|
||||
include(ROCMSphinxDoc)
|
||||
|
||||
rocm_add_sphinx_doc(
|
||||
"${CMAKE_CURRENT_SOURCE_DIR}"
|
||||
OUTPUT_DIR html
|
||||
BUILDER html
|
||||
)
|
||||
|
||||
install(
|
||||
DIRECTORY "${CMAKE_CURRENT_BINARY_DIR}/html"
|
||||
DESTINATION "${CMAKE_INSTALL_DOCDIR}")
|
||||
@@ -1,63 +0,0 @@
|
||||
# Third party support matrix
|
||||
|
||||
ROCm™ supports various 3rd party libraries and frameworks. Supported versions
|
||||
are tested and known to work. Non-supported versions of 3rd parties may also
|
||||
work, but aren't tested.
|
||||
|
||||
## Deep learning
|
||||
|
||||
ROCm releases support the most recent and two prior releases of PyTorch and
|
||||
TensorFlow.
|
||||
|
||||
| ROCm | [PyTorch](https://github.com/pytorch/pytorch/releases/) | [TensorFlow](https://github.com/tensorflow/tensorflow/releases/) |
|
||||
|:------|:--------------------------:|:--------------------:|
|
||||
| 5.0.2 | 1.8, 1.9, 1.10 | 2.6, 2.7, 2.8 |
|
||||
| 5.1.3 | 1.9, 1.10, 1.11 | 2.7, 2.8, 2.9 |
|
||||
| 5.2.x | 1.10, 1.11, 1.12 | 2.8, 2.9, 2.9 |
|
||||
| 5.3.x | 1.10.1, 1.11, 1.12.1, 1.13 | 2.8, 2.9, 2.10 |
|
||||
| 5.4.x | 1.10.1, 1.11, 1.12.1, 1.13 | 2.8, 2.9, 2.10, 2.11 |
|
||||
| 5.5.x | 1.10.1, 1.11, 1.12.1, 1.13 | 2.10, 2.11, 2.13 |
|
||||
| 5.6.x | 1.12.1, 1.13, 2.0 | 2.12, 2.13 |
|
||||
| 5.7.x | 1.12.1, 1.13, 2.0 | 2.12, 2.13 |
|
||||
|
||||
(communication-libraries)=
|
||||
|
||||
## Communication libraries
|
||||
|
||||
ROCm supports [OpenUCX](https://openucx.org/), an open-source,
|
||||
production-grade communication framework for data-centric and high performance
|
||||
applications.
|
||||
|
||||
UCX version | ROCm 5.4 and older | ROCm 5.5 and newer |
|
||||
|:----------|:------------------:|:------------------:|
|
||||
| -1.14.0 | COMPATIBLE | INCOMPATIBLE |
|
||||
| 1.14.1+ | COMPATIBLE | COMPATIBLE |
|
||||
|
||||
The Unified Collective Communication ([UCC](https://github.com/openucx/ucc)) library also has
|
||||
support for ROCm devices.
|
||||
|
||||
UCC version | ROCm 5.5 and older | ROCm 5.6 and newer |
|
||||
|:----------|:------------------:|:------------------:|
|
||||
| -1.1.0 | COMPATIBLE | INCOMPATIBLE |
|
||||
| 1.2.0+ | COMPATIBLE | COMPATIBLE |
|
||||
|
||||
## Algorithm libraries
|
||||
|
||||
ROCm releases provide algorithm libraries with interfaces compatible with
|
||||
contemporary CUDA / NVIDIA HPC SDK alternatives.
|
||||
|
||||
* Thrust → rocThrust
|
||||
* CUB → hipCUB
|
||||
|
||||
| ROCm | Thrust / CUB | HPC SDK |
|
||||
|:------|:------------:|:-------:|
|
||||
| 5.0.2 | 1.14 | 21.9 |
|
||||
| 5.1.3 | 1.15 | 22.1 |
|
||||
| 5.2.x | 1.15 | 22.2, 22.3 |
|
||||
| 5.3.x | 1.16 | 22.7 |
|
||||
| 5.4.x | 1.16 | 22.9 |
|
||||
| 5.5.x | 1.17 | 22.9 |
|
||||
| 5.6.x | 1.17.2 | 22.9 |
|
||||
| 5.7.x | 1.17.2 | 22.9 |
|
||||
|
||||
For the latest documentation of these libraries, refer to [API libraries](../../reference/library-index.md).
|
||||
@@ -1,130 +0,0 @@
|
||||
******************************************************************
|
||||
Docker image support matrix
|
||||
******************************************************************
|
||||
|
||||
AMD validates and publishes `PyTorch <https://hub.docker.com/r/rocm/pytorch>`_ and
|
||||
`TensorFlow <https://hub.docker.com/r/rocm/tensorflow>`_ containers on dockerhub. The following
|
||||
tags, and associated inventories, are validated with ROCm 5.7.
|
||||
|
||||
.. tab-set::
|
||||
|
||||
.. tab-item:: PyTorch
|
||||
|
||||
.. tab-set::
|
||||
|
||||
.. tab-item:: Ubuntu 22.04
|
||||
|
||||
Tag: `rocm/pytorch:rocm5.7_ubuntu22.04_py3.10_pytorch_2.0.1 <https://hub.docker.com/layers/rocm/pytorch/rocm5.7_ubuntu22.04_py3.10_pytorch_2.0.1/images/sha256-21df283b1712f3d73884b9bc4733919374344ceacb694e8fbc2c50bdd3e767ee>`_
|
||||
|
||||
* Inventory:
|
||||
|
||||
* `ROCm 5.7 <https://repo.radeon.com/rocm/apt/5.7/>`_
|
||||
* `Python 3.10 <https://www.python.org/downloads/release/python-31013/>`_
|
||||
* `Torch 2.0.1 <https://github.com/ROCmSoftwarePlatform/pytorch/tree/release/2.0>`_
|
||||
* `Apex 0.1 <https://github.com/ROCmSoftwarePlatform/apex/tree/v0.1>`_
|
||||
* `Torchvision 0.15.0 <https://github.com/pytorch/vision/tree/release/0.15>`_
|
||||
* `Tensorboard 2.14.0 <https://github.com/tensorflow/tensorboard/tree/2.14>`_
|
||||
* `MAGMA <https://bitbucket.org/icl/magma/src/master/>`_
|
||||
* `UCX 1.10.0 <https://github.com/openucx/ucx/tree/v1.10.0>`_
|
||||
* `OMPI 4.0.3 <https://github.com/open-mpi/ompi/tree/v4.0.3>`_
|
||||
* `OFED 5.4.3 <https://content.mellanox.com/ofed/MLNX_OFED-5.3-1.0.5.0/MLNX_OFED_LINUX-5.3-1.0.5.0-ubuntu20.04-x86_64.tgz>`_
|
||||
|
||||
.. tab-item:: Ubuntu 20.04
|
||||
|
||||
Tag: `rocm/pytorch:rocm5.7_ubuntu20.04_py3.9_pytorch_staging <https://hub.docker.com/layers/rocm/pytorch/rocm5.7_ubuntu20.04_py3.9_pytorch_2.0.1/images/sha256-4dd86046e5f777f53ae40a75ecfc76a5e819f01f3b2d40eacbb2db95c2f971d4)>`_
|
||||
|
||||
* Inventory:
|
||||
|
||||
* `ROCm 5.7 <https://repo.radeon.com/rocm/apt/5.7/>`_
|
||||
* `Python 3.9 <https://www.python.org/downloads/release/python-3918/>`_
|
||||
* `Torch 2.1.0 <https://github.com/ROCmSoftwarePlatform/pytorch/tree/rocm5.7_internal_testing>`_
|
||||
* `Apex 0.1 <https://github.com/ROCmSoftwarePlatform/apex/tree/v0.1>`_
|
||||
* `Torchvision 0.16.0 <https://github.com/pytorch/vision/tree/release/0.16>`_
|
||||
* `Tensorboard 2.14.0 <https://github.com/tensorflow/tensorboard/tree/2.14>`_
|
||||
* `MAGMA <https://bitbucket.org/icl/magma/src/master/>`_
|
||||
* `UCX 1.10.0 <https://github.com/openucx/ucx/tree/v1.10.0>`_
|
||||
* `OMPI 4.0.3 <https://github.com/open-mpi/ompi/tree/v4.0.3>`_
|
||||
* `OFED 5.4.3 <https://content.mellanox.com/ofed/MLNX_OFED-5.3-1.0.5.0/MLNX_OFED_LINUX-5.3-1.0.5.0-ubuntu20.04-x86_64.tgz>`_
|
||||
|
||||
|
||||
Tag: `Ubuntu rocm/pytorch:rocm5.7_ubuntu20.04_py3.9_pytorch_1.12.1 <https://hub.docker.com/layers/rocm/pytorch/rocm5.7_ubuntu20.04_py3.9_pytorch_1.12.1/images/sha256-e67db9373c045a7b6defd43cc3d067e7d49fd5d380f3f8582d2fb219c1756e1f>`_
|
||||
|
||||
* Inventory:
|
||||
|
||||
* `ROCm 5.7 <https://repo.radeon.com/rocm/apt/5.7/>`_
|
||||
* `Python 3.9 <https://www.python.org/downloads/release/python-3918/>`_
|
||||
* `Torch 1.12.1 <https://github.com/ROCmSoftwarePlatform/pytorch/tree/release/1.12>`_
|
||||
* `Apex 0.1 <https://github.com/ROCmSoftwarePlatform/apex/tree/v0.1>`_
|
||||
* `Torchvision 0.13.1 <https://github.com/pytorch/vision/tree/v0.13.1>`_
|
||||
* `Tensorboard 2.14.0 <https://github.com/tensorflow/tensorboard/tree/2.14>`_
|
||||
* `MAGMA <https://bitbucket.org/icl/magma/src/master/>`_
|
||||
* `UCX 1.10.0 <https://github.com/openucx/ucx/tree/v1.10.0>`_
|
||||
* `OMPI 4.0.3 <https://github.com/open-mpi/ompi/tree/v4.0.3>`_
|
||||
* `OFED 5.4.3 <https://content.mellanox.com/ofed/MLNX_OFED-5.3-1.0.5.0/MLNX_OFED_LINUX-5.3-1.0.5.0-ubuntu20.04-x86_64.tgz>`_
|
||||
|
||||
Tag: `Ubuntu rocm/pytorch:rocm5.7_ubuntu20.04_py3.9_pytorch_1.13.1 <https://hub.docker.com/layers/rocm/pytorch/rocm5.7_ubuntu20.04_py3.9_pytorch_1.13.1/images/sha256-ed99d159026093d2aaf5c48c1e4b0911508773430377051372733f75c340a4c1>`_
|
||||
|
||||
* Inventory:
|
||||
|
||||
* `ROCm 5.7 <https://repo.radeon.com/rocm/apt/5.7/>`_
|
||||
* `Python 3.9 <https://www.python.org/downloads/release/python-3918/>`_
|
||||
* `Torch 1.12.1 <https://github.com/ROCmSoftwarePlatform/pytorch/tree/release/1.13>`_
|
||||
* `Apex 0.1 <https://github.com/ROCmSoftwarePlatform/apex/tree/v0.1>`_
|
||||
* `Torchvision 0.14.0 <https://github.com/pytorch/vision/tree/v0.14.0>`_
|
||||
* `Tensorboard 2.12.0 <https://github.com/tensorflow/tensorboard/tree/2.12.0>`_
|
||||
* `MAGMA <https://bitbucket.org/icl/magma/src/master/>`_
|
||||
* `UCX 1.10.0 <https://github.com/openucx/ucx/tree/v1.10.0>`_
|
||||
* `OMPI 4.0.3 <https://github.com/open-mpi/ompi/tree/v4.0.3>`_
|
||||
* `OFED 5.4.3 <https://content.mellanox.com/ofed/MLNX_OFED-5.3-1.0.5.0/MLNX_OFED_LINUX-5.3-1.0.5.0-ubuntu20.04-x86_64.tgz>`_
|
||||
|
||||
Tag: `Ubuntu rocm/pytorch:rocm5.7_ubuntu20.04_py3.9_pytorch_2.0.1 <https://hub.docker.com/layers/rocm/pytorch/rocm5.7_ubuntu20.04_py3.9_pytorch_2.0.1/images/sha256-4dd86046e5f777f53ae40a75ecfc76a5e819f01f3b2d40eacbb2db95c2f971d4>`_
|
||||
|
||||
* Inventory:
|
||||
|
||||
* `ROCm 5.7 <https://repo.radeon.com/rocm/apt/5.7/>`_
|
||||
* `Python 3.9 <https://www.python.org/downloads/release/python-3918/>`_
|
||||
* `Torch 2.0.1 <https://github.com/ROCmSoftwarePlatform/pytorch/tree/release/2.0>`_
|
||||
* `Apex 0.1 <https://github.com/ROCmSoftwarePlatform/apex/tree/v0.1>`_
|
||||
* `Torchvision 0.15.2 <https://github.com/pytorch/vision/tree/release/0.15>`_
|
||||
* `Tensorboard 2.14.0 <https://github.com/tensorflow/tensorboard/tree/2.14>`_
|
||||
* `MAGMA <https://bitbucket.org/icl/magma/src/master/>`_
|
||||
* `UCX 1.10.0 <https://github.com/openucx/ucx/tree/v1.10.0>`_
|
||||
* `OMPI 4.0.3 <https://github.com/open-mpi/ompi/tree/v4.0.3>`_
|
||||
* `OFED 5.4.3 <https://content.mellanox.com/ofed/MLNX_OFED-5.3-1.0.5.0/MLNX_OFED_LINUX-5.3-1.0.5.0-ubuntu20.04-x86_64.tgz>`_
|
||||
|
||||
.. tab-item:: CentOS 7
|
||||
|
||||
Tag: `rocm/pytorch:rocm5.7_centos7_py3.9_pytorch_staging <https://hub.docker.com/layers/rocm/pytorch/rocm5.7_centos7_py3.9_pytorch_staging/images/sha256-92240cdf0b4aa7afa76fc78be995caa19ee9c54b5c9f1683bdcac28cedb58d2b>`_
|
||||
|
||||
* Inventory:
|
||||
|
||||
* `ROCm 5.7 <https://repo.radeon.com/rocm/yum/5.7/>`_
|
||||
* `Python 3.9 <https://www.python.org/downloads/release/python-3918/>`_
|
||||
* `Torch 2.1.0 <https://github.com/ROCmSoftwarePlatform/pytorch/tree/rocm5.7_internal_testing>`_
|
||||
* `Apex 0.1 <https://github.com/ROCmSoftwarePlatform/apex/tree/v0.1>`_
|
||||
* `Torchvision 0.16.0 <https://github.com/pytorch/vision/tree/release/0.16>`_
|
||||
* `MAGMA <https://bitbucket.org/icl/magma/src/master/>`_
|
||||
|
||||
.. tab-item:: TensorFlow
|
||||
|
||||
.. tab-set::
|
||||
|
||||
.. tab-item:: Ubuntu 20.04
|
||||
|
||||
Tag: `rocm5.7-tf2.12-dev <https://hub.docker.com/layers/rocm/tensorflow/rocm5.7-tf2.12-dev/images/sha256-e0ac4d49122702e5167175acaeb98a79b9500f585d5e74df18facf6b52ce3e59>`_
|
||||
|
||||
* Inventory:
|
||||
|
||||
* `ROCm 5.7 <https://repo.radeon.com/rocm/apt/5.7/>`_
|
||||
* `Python 3.9 <https://www.python.org/downloads/release/python-3918/>`_
|
||||
* `tensorflow-rocm 2.12.1 <https://pypi.org/project/tensorflow-rocm/2.12.1.570/>`_
|
||||
* `Tensorboard 2.12.3 <https://github.com/tensorflow/tensorboard/tree/2.12>`_
|
||||
|
||||
Tag: `rocm5.7-tf2.13-dev <https://hub.docker.com/layers/rocm/tensorflow/rocm5.7-tf2.13-dev/images/sha256-6f995539eebc062aac2b53db40e2b545192d8b032d0deada8c24c6651a7ac332>`_
|
||||
|
||||
* Inventory:
|
||||
|
||||
* `ROCm 5.7 <https://repo.radeon.com/rocm/apt/5.7/>`_
|
||||
* `Python 3.9 <https://www.python.org/downloads/release/python-3918/>`_
|
||||
* `tensorflow-rocm 2.13.0 <https://pypi.org/project/tensorflow-rocm/2.13.0.570/>`_
|
||||
* `Tensorboard 2.13.0 <https://github.com/tensorflow/tensorboard/tree/2.13>`_
|
||||
@@ -1,116 +0,0 @@
|
||||
# GPU and OS support (Linux)
|
||||
|
||||
(linux-support)=
|
||||
|
||||
## Supported Linux distributions
|
||||
|
||||
AMD ROCm™ Platform supports the following Linux distributions.
|
||||
|
||||
::::{tab-set}
|
||||
|
||||
:::{tab-item} Supported
|
||||
|
||||
| Distribution | Processor Architectures | Validated Kernel | Support |
|
||||
| :----------- | :---------------------: | :--------------: | ------: |
|
||||
| RHEL 9.2 | x86-64 | 5.14 (5.14.0-284.11.1.el9_2.x86_64) | ✅ |
|
||||
| RHEL 9.1 | x86-64 | 5.14.0-284.11.1.el9_2.x86_64 | ✅ |
|
||||
| RHEL 8.8 | x86-64 | 4.18.0-477.el8.x86_64 | ✅ |
|
||||
| RHEL 8.7 | x86-64 | 4.18.0-425.10.1.el8_7.x86_64 | ✅ |
|
||||
| SLES 15 SP5 | x86-64 | 5.14.21-150500.53-default | ✅ |
|
||||
| SLES 15 SP4 | x86-64 | 5.14.21-150400.24.63-default | ✅ |
|
||||
| Ubuntu 22.04.2 | x86-64 | 5.19.0-45-generic | ✅ |
|
||||
| Ubuntu 20.04.5 | x86-64 | 5.15.0-75-generic | ✅ |
|
||||
|
||||
:::{versionadded} 5.6
|
||||
|
||||
* RHEL 8.8 and 9.2 support is added.
|
||||
* SLES 15 SP5 support is added
|
||||
|
||||
:::
|
||||
|
||||
:::{tab-item} Unsupported
|
||||
|
||||
| Distribution | Processor Architectures | Validated Kernel | Support |
|
||||
| :----------- | :---------------------: | :--------------: | ------: |
|
||||
| RHEL 9.0 | x86-64 | 5.14 | ❌ |
|
||||
| RHEL 8.6 | x86-64 | 5.14 | ❌ |
|
||||
| SLES 15 SP3 | x86-64 | 5.3 | ❌ |
|
||||
| Ubuntu 22.04.0 | x86-64 | 5.15 LTS, 5.17 OEM | ❌ |
|
||||
| Ubuntu 20.04.4 | x86-64 | 5.13 HWE, 5.13 OEM | ❌ |
|
||||
| Ubuntu 22.04.1 | x86-64 | 5.15 LTS | ❌ |
|
||||
|
||||
:::
|
||||
|
||||
::::
|
||||
|
||||
✅: **Supported** - AMD performs full testing of all ROCm components on distro
|
||||
GA image.
|
||||
❌: **Unsupported** - AMD no longer performs builds and testing on these
|
||||
previously supported distro GA images.
|
||||
|
||||
## Virtualization support
|
||||
|
||||
ROCm supports virtualization for select GPUs only as shown below.
|
||||
|
||||
| Hypervisor | Version | GPU | Validated Guest OS (validated kernel) |
|
||||
|----------------|----------|-------|----------------------------------------------------------------------------------|
|
||||
| VMWare | ESXi 8 | MI250 | Ubuntu 20.04 (`5.15.0-56-generic`) |
|
||||
| VMWare | ESXi 8 | MI210 | Ubuntu 20.04 (`5.15.0-56-generic`), SLES 15 SP4 (`5.14.21-150400.24.18-default`) |
|
||||
| VMWare | ESXi 7 | MI210 | Ubuntu 20.04 (`5.15.0-56-generic`), SLES 15 SP4 (`5.14.21-150400.24.18-default`) |
|
||||
|
||||
## Linux-supported GPUs
|
||||
|
||||
The table below shows supported GPUs for Instinct™, Radeon Pro™ and Radeon™
|
||||
GPUs. Please click the tabs below to switch between GPU product lines. If a GPU
|
||||
is not listed on this table, the GPU is not officially supported by AMD.
|
||||
|
||||
:::::{tab-set}
|
||||
|
||||
::::{tab-item} AMD Instinct™
|
||||
:sync: instinct
|
||||
|
||||
| Product Name | Architecture | [LLVM Target](https://www.llvm.org/docs/AMDGPUUsage.html#processors) |Support |
|
||||
|:------------:|:------------:|:--------------------------------------------------------------------:|:-------:|
|
||||
| AMD Instinct™ MI250X | CDNA2 | gfx90a | ✅ |
|
||||
| AMD Instinct™ MI250 | CDNA2 | gfx90a | ✅ |
|
||||
| AMD Instinct™ MI210 | CDNA2 | gfx90a | ✅ |
|
||||
| AMD Instinct™ MI100 | CDNA | gfx908 | ✅ |
|
||||
| AMD Instinct™ MI50 | GCN5.1 | gfx906 | ✅ |
|
||||
| AMD Instinct™ MI25 | GCN5.0 | gfx900 | ❌ |
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} Radeon Pro™
|
||||
:sync: radeonpro
|
||||
|
||||
| Name | Architecture |[LLVM Target](https://www.llvm.org/docs/AMDGPUUsage.html#processors) | Support|
|
||||
|:----:|:------------:|:--------------------------------------------------------------------:|:-------:|
|
||||
| AMD Radeon™ Pro W7900 | RDNA3 | gfx1100 | ✅ (Ubuntu 22.04 only)|
|
||||
| AMD Radeon™ Pro W6800 | RDNA2 | gfx1030 | ✅ |
|
||||
| AMD Radeon™ Pro V620 | RDNA2 | gfx1030 | ✅ |
|
||||
| AMD Radeon™ Pro VII | GCN5.1 | gfx906 | ✅ |
|
||||
::::
|
||||
|
||||
::::{tab-item} Radeon™
|
||||
:sync: radeonpro
|
||||
|
||||
| Name | Architecture |[LLVM Target](https://www.llvm.org/docs/AMDGPUUsage.html#processors) | Support|
|
||||
|:----:|:---------------:|:--------------------------------------------------------------------:|:-------:|
|
||||
| AMD Radeon™ RX 7900 XTX | RDNA3 | gfx1100 | ✅ (Ubuntu 22.04 only)|
|
||||
| AMD Radeon™ VII | GCN5.1 | gfx906 | ✅ |
|
||||
|
||||
::::
|
||||
:::::
|
||||
|
||||
### Support status
|
||||
|
||||
✅: **Supported** - AMD enables these GPUs in our software distributions for
|
||||
the corresponding ROCm product.
|
||||
⚠️: **Deprecated** - Support will be removed in a future release.
|
||||
❌: **Unsupported** - This configuration is not enabled in our software
|
||||
distributions.
|
||||
|
||||
## CPU support
|
||||
|
||||
ROCm requires CPUs that support PCIe™ atomics. Modern CPUs after the release of
|
||||
1st generation AMD Zen CPU and Intel™ Haswell support PCIe atomics.
|
||||
@@ -1,3 +1,9 @@
|
||||
<head>
|
||||
<meta charset="UTF-8">
|
||||
<meta name="description" content="OpenMP support in ROCm">
|
||||
<meta name="keywords" content="OpenMP, LLVM, OpenMP toolchain">
|
||||
</head>
|
||||
|
||||
# OpenMP support in ROCm
|
||||
|
||||
## Introduction
|
||||
@@ -9,7 +15,8 @@ Along with host APIs, the OpenMP compilers support offloading code and data onto
|
||||
GPU devices. This document briefly describes the installation location of the
|
||||
OpenMP toolchain, example usage of device offloading, and usage of `rocprof`
|
||||
with OpenMP applications. The GPUs supported are the same as those supported by
|
||||
this ROCm release. See the list of supported GPUs for [Linux](../../about/compatibility/linux-support.md) and [Windows](../../about/compatibility/windows-support.md).
|
||||
this ROCm release. See the list of supported GPUs for {doc}`Linux<rocm-install-on-linux:reference/system-requirements>` and
|
||||
{doc}`Windows<rocm-install-on-windows:reference/system-requirements>`.
|
||||
|
||||
The ROCm OpenMP compiler is implemented using LLVM compiler technology.
|
||||
The following image illustrates the internal steps taken to translate a user’s application into an executable that can offload computation to the AMDGPU. The compilation is a two-pass process. Pass 1 compiles the application to generate the CPU code and Pass 2 links the CPU code to the AMDGPU device code.
|
||||
@@ -41,10 +48,10 @@ cd $ROCM_PATH/share/openmp-extras/examples/openmp/veccopy
|
||||
sudo make run
|
||||
```
|
||||
|
||||
```{note}
|
||||
:::{note}
|
||||
`sudo` is required since we are building inside the `/opt` directory.
|
||||
Alternatively, copy the files to your home directory first.
|
||||
```
|
||||
:::
|
||||
|
||||
The above invocation of Make compiles and runs the program. Note the options
|
||||
that are required for target offload from an OpenMP program:
|
||||
@@ -53,13 +60,15 @@ that are required for target offload from an OpenMP program:
|
||||
-fopenmp --offload-arch=<gpu-arch>
|
||||
```
|
||||
|
||||
```{note}
|
||||
:::{note}
|
||||
The compiler also accepts the alternative offloading notation:
|
||||
|
||||
```bash
|
||||
-fopenmp -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa -march=<gpu-arch>
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
Obtain the value of `gpu-arch` by running the following command:
|
||||
|
||||
```bash
|
||||
@@ -321,10 +330,10 @@ double a = 0.0;
|
||||
a = a + 1.0;
|
||||
```
|
||||
|
||||
```{note}
|
||||
:::{note}
|
||||
`AMD_unsafe_fp_atomics` is an alias for `AMD_fast_fp_atomics`, and
|
||||
`AMD_safe_fp_atomics` is implemented with a compare-and-swap loop.
|
||||
```
|
||||
:::
|
||||
|
||||
To disable the generation of fast floating-point atomic instructions at the file
|
||||
level, build using the option `-msafe-fp-atomics` or use a hint clause on a
|
||||
|
||||
@@ -1,24 +0,0 @@
|
||||
# User/kernel-space support matrix
|
||||
|
||||
ROCm™ provides forward and backward compatibility between the Kernel Fusion
|
||||
Driver (KFD) and its user space software for +/- 2 releases. This table shows
|
||||
the compatibility combinations that are currently supported.
|
||||
|
||||
| KFD | Tested user space versions |
|
||||
|:------|:--------------------------:|
|
||||
| 5.0.2 | 5.1.0, 5.2.0 |
|
||||
| 5.1.0 | 5.0.2 |
|
||||
| 5.1.3 | 5.2.0, 5.3.0 |
|
||||
| 5.2.0 | 5.0.2, 5.1.3 |
|
||||
| 5.2.3 | 5.3.0, 5.4.0 |
|
||||
| 5.3.0 | 5.1.3, 5.2.3 |
|
||||
| 5.3.3 | 5.4.0, 5.5.0 |
|
||||
| 5.4.0 | 5.2.3, 5.3.3 |
|
||||
| 5.4.3 | 5.5.0, 5.6.0 |
|
||||
| 5.4.4 | 5.5.0 |
|
||||
| 5.5.0 | 5.3.3, 5.4.3 |
|
||||
| 5.5.1 | 5.6.0, 5.7.0 |
|
||||
| 5.6.0 | 5.4.3, 5.5.1 |
|
||||
| 5.6.1 | 5.7.0 |
|
||||
| 5.7.0 | 5.5.0, 5.6.1 |
|
||||
| 5.7.1 | 5.5.0, 5.6.1 |
|
||||
@@ -1,80 +0,0 @@
|
||||
# GPU and OS support (Windows)
|
||||
|
||||
(windows-support)=
|
||||
|
||||
## Supported SKUs
|
||||
|
||||
AMD HIP SDK supports the following Windows variants.
|
||||
|
||||
| Distribution |Processor Architectures| Validated update |
|
||||
|---------------------|-----------------------|--------------------|
|
||||
| Windows 10 | x86-64 | 22H2 (GA) |
|
||||
| Windows 11 | x86-64 | 22H2 (GA) |
|
||||
| Windows Server 2022 | x86-64 | |
|
||||
|
||||
## Windows-supported GPUs
|
||||
|
||||
The table below shows supported GPUs for Radeon Pro™ and Radeon™ GPUs. Please
|
||||
click the tabs below to switch between GPU product lines. If a GPU is not listed
|
||||
on this table, the GPU is not officially supported by AMD.
|
||||
|
||||
::::{tab-set}
|
||||
|
||||
:::{tab-item} Radeon Pro™
|
||||
:sync: radeonpro
|
||||
|
||||
| Name | Architecture |[LLVM Target](https://www.llvm.org/docs/AMDGPUUsage.html#processors) | Runtime | HIP SDK |
|
||||
|:----:|:------------:|:--------------------------------------------------------------------:|:-------:|:----------------:|
|
||||
| AMD Radeon Pro™ W7900 | RDNA3 | gfx1100 | ✅ | ✅ |
|
||||
| AMD Radeon Pro™ W7800 | RDNA3 | gfx1100 | ✅ | ✅ |
|
||||
| AMD Radeon Pro™ W6800 | RDNA2 | gfx1030 | ✅ | ✅ |
|
||||
| AMD Radeon Pro™ W6600 | RDNA2 | gfx1032 | ✅ | ❌ |
|
||||
| AMD Radeon Pro™ W5500 | RDNA1 | gfx1012 | ❌ | ❌ |
|
||||
| AMD Radeon Pro™ VII | GCN5.1 | gfx906 | ❌ | ❌ |
|
||||
|
||||
:::
|
||||
|
||||
:::{tab-item} Radeon™
|
||||
:sync: radeon
|
||||
|
||||
| Name | Architecture | [LLVM Target](https://www.llvm.org/docs/AMDGPUUsage.html#processors) | Runtime | HIP SDK |
|
||||
|:----:|:------------:|:--------------------------------------------------------------------:|:-------:|:----------------:|
|
||||
| AMD Radeon™ RX 7900 XTX | RDNA3 | gfx1100 | ✅ | ✅ |
|
||||
| AMD Radeon™ RX 7900 XT | RDNA3 | gfx1100 | ✅ | ✅ |
|
||||
| AMD Radeon™ RX 7600 | RDNA3 | gfx1102 | ✅ | ✅ |
|
||||
| AMD Radeon™ RX 6950 XT | RDNA2 | gfx1030 | ✅ | ✅ |
|
||||
| AMD Radeon™ RX 6900 XT | RDNA2 | gfx1030 | ✅ | ✅ |
|
||||
| AMD Radeon™ RX 6800 XT | RDNA2 | gfx1030 | ✅ | ✅ |
|
||||
| AMD Radeon™ RX 6800 | RDNA2 | gfx1030 | ✅ | ✅ |
|
||||
| AMD Radeon™ RX 6750 XT | RDNA2 | gfx1031 | ✅ | ❌ |
|
||||
| AMD Radeon™ RX 6700 XT | RDNA2 | gfx1031 | ✅ | ❌ |
|
||||
| AMD Radeon™ RX 6700 | RDNA2 | gfx1031 | ✅ | ❌ |
|
||||
| AMD Radeon™ RX 6650 XT | RDNA2 | gfx1032 | ✅ | ❌ |
|
||||
| AMD Radeon™ RX 6600 XT | RDNA2 | gfx1032 | ✅ | ❌ |
|
||||
| AMD Radeon™ RX 6600 | RDNA2 | gfx1032 | ✅ | ❌ |
|
||||
|
||||
:::
|
||||
|
||||
::::
|
||||
|
||||
### Component support
|
||||
|
||||
ROCm components are described in [What is ROCm?](../../what-is-rocm.md) Support
|
||||
on Windows is provided with two levels on enablement.
|
||||
|
||||
* **Runtime**: Runtime enables the use of the HIP and OpenCL runtimes only.
|
||||
* **HIP SDK**: Runtime plus additional components are listed in [Libraries](../../reference/library-index.md).
|
||||
Note that some math libraries are Linux exclusive.
|
||||
|
||||
### Support status
|
||||
|
||||
✅: **Supported** - AMD enables these GPUs in our software distributions for
|
||||
the corresponding ROCm product.
|
||||
⚠️: **Deprecated** - Support will be removed in a future release.
|
||||
❌: **Unsupported** - This configuration is not enabled in our software
|
||||
distributions.
|
||||
|
||||
## CPU support
|
||||
|
||||
ROCm requires CPUs that support PCIe™ atomics. Modern CPUs after the release of
|
||||
1st generation AMD Zen CPU and Intel™ Haswell support PCIe atomics.
|
||||
@@ -1,6 +1,10 @@
|
||||
# License
|
||||
|
||||
> Note: This license applies to the [ROCm repository](https://github.com/RadeonOpenCompute/ROCm) that primarily contains documentation. For other licensing information, refer to the [Licensing Terms page](./licensing).
|
||||
:::{note}
|
||||
This license applies to the [ROCm repository](https://github.com/RadeonOpenCompute/ROCm) that
|
||||
primarily contains documentation. For other licensing information, refer to the
|
||||
[Licensing Terms page](./licensing).
|
||||
:::
|
||||
|
||||
```{include} ../../LICENSE
|
||||
```
|
||||
|
||||
@@ -1,3 +1,9 @@
|
||||
<head>
|
||||
<meta charset="UTF-8">
|
||||
<meta name="description" content="ROCm licensing terms">
|
||||
<meta name="keywords" content="license, licensing terms">
|
||||
</head>
|
||||
|
||||
# ROCm licensing terms
|
||||
|
||||
ROCm™ is released by Advanced Micro Devices, Inc. and is licensed per component separately.
|
||||
@@ -108,7 +114,7 @@ companies.
|
||||
|
||||
## Package licensing
|
||||
|
||||
```{attention}
|
||||
:::{attention}
|
||||
AQL Profiler and AOCC CPU optimization are both provided in binary form, each
|
||||
subject to the license agreement enclosed in the directory for the binary and is
|
||||
available here: `/opt/rocm/share/doc/rocm-llvm-alt/EULA`. By using, installing,
|
||||
@@ -116,7 +122,7 @@ copying or distributing AQL Profiler and/or AOCC CPU Optimizations, you agree to
|
||||
the terms and conditions of this license agreement. If you do not agree to the
|
||||
terms of this agreement, do not install, copy or use the AQL Profiler and/or the
|
||||
AOCC CPU Optimizations.
|
||||
```
|
||||
:::
|
||||
|
||||
For the rest of the ROCm packages, you can find the licensing information at the
|
||||
following location: `/opt/rocm/share/doc/<component-name>/`
|
||||
|
||||
@@ -1,93 +0,0 @@
|
||||
# What's new in ROCm?
|
||||
|
||||
ROCm is now supported on Windows.
|
||||
|
||||
## Windows support
|
||||
|
||||
Starting with ROCm 5.5, the HIP SDK brings a subset of ROCm to developers on Windows.
|
||||
The collection of features enabled on Windows is referred to as the HIP SDK.
|
||||
These features allow developers to use the HIP runtime, HIP math libraries
|
||||
and HIP Primitive libraries. The following table shows the differences
|
||||
between Windows and Linux releases.
|
||||
|
||||
|Component|Linux|Windows|
|
||||
|---------|-----|-------|
|
||||
|Driver|Radeon Software for Linux |AMD Software Pro Edition|
|
||||
|Compiler|`hipcc`/`amdclang++`|`hipcc`/`clang++`|
|
||||
|Debugger|`rocgdb`|no debugger available|
|
||||
|Profiler|`rocprof`|[Radeon GPU Profiler](https://gpuopen.com/rgp/)|
|
||||
|Porting Tools|HIPIFY|Coming Soon|
|
||||
|Runtime|HIP (Open Sourced)|HIP (closed source)|
|
||||
|Math Libraries|Supported|Supported|
|
||||
|Primitives Libraries|Supported|Supported|
|
||||
|Communication Libraries|Supported|Not Available|
|
||||
|AI Libraries|MIOpen, MIGraphX|Not Available|
|
||||
|System Management|`rocm-smi-lib`, RDC, `rocminfo`|`amdsmi`, `hipInfo`|
|
||||
|AI Frameworks|PyTorch, TensorFlow, etc.|Not Available|
|
||||
|CMake HIP Language|Enabled|Unsupported|
|
||||
|Visual Studio| Not applicable| Plugin Available|
|
||||
|HIP Ray Tracing| Supported|Supported|
|
||||
|
||||
AMD is continuing to invest in Windows support and AMD plans to release enhanced
|
||||
features in subsequent revisions.
|
||||
|
||||
```{note}
|
||||
The 5.5 Windows Installer collectively groups the Math and Primitives
|
||||
libraries.
|
||||
```
|
||||
|
||||
```{note}
|
||||
GPU support on Windows and Linux may differ. You must refer to
|
||||
Windows and Linux GPU support tables separately.
|
||||
```
|
||||
|
||||
```{note}
|
||||
HIP Ray Tracing is not distributed via ROCm in Linux.
|
||||
```
|
||||
|
||||
## ROCm release versioning
|
||||
|
||||
Linux OS releases set the canonical version numbers for ROCm. Windows will
|
||||
follow Linux version numbers as Windows releases are based on Linux ROCm
|
||||
releases. However, not all Linux ROCm releases will have a corresponding Windows
|
||||
release. The following table shows the ROCm releases on Windows and Linux. Releases
|
||||
with both Windows and Linux are referred to as a joint release. Releases with
|
||||
only Linux support are referred to as a skipped release from the Windows
|
||||
perspective.
|
||||
|
||||
|Release version|Linux|Windows|
|
||||
|---------------|-----|-------|
|
||||
|5.5|✅|✅|
|
||||
|5.6|✅|❌|
|
||||
|
||||
ROCm Linux releases are versioned with following the Major.Minor.Patch
|
||||
version number system. Windows releases will only be versioned with Major.Minor.
|
||||
|
||||
In general, Windows releases will trail Linux releases. Software developers that
|
||||
wish to support both Linux and Windows using a single ROCm version should
|
||||
refrain from upgrading ROCm unless there is a joint release.
|
||||
|
||||
## Windows documentation implications
|
||||
|
||||
The ROCm documentation website contains both Windows and Linux documentation.
|
||||
Just below each article title, a convenient article information section states
|
||||
whether the page applies to Linux only, Windows only or both OSes. To find the
|
||||
exact Windows documentation for a release of the HIP SDK, please view the ROCm documentation with the same
|
||||
Major.Minor version number while ignoring the Patch version. The Patch version
|
||||
only matters for Linux releases. For convenience,
|
||||
Windows documentation will continue to be included in the overall ROCm
|
||||
documentation for the skipped Windows releases.
|
||||
|
||||
Windows release notes will contain only information pertinent to Windows.
|
||||
The software developer must read all the previous ROCm release notes (including)
|
||||
skipped ROCm versions on Windows for information on all the changes present in
|
||||
the Windows release.
|
||||
|
||||
## Windows builds from source
|
||||
|
||||
Not all source code required to build Windows from source is available under a
|
||||
permissive open source license. Build instructions on Windows is only provided
|
||||
for projects that can be built from source on Windows using a toolchain that
|
||||
has closed source build prerequisites. The ROCm manifest file is not valid for
|
||||
Windows. AMD does not release a manifest or tag our components in Windows.
|
||||
Users may use corresponding Linux tags to build on Windows.
|
||||
@@ -1,36 +1,61 @@
|
||||
===========================
|
||||
How ROCm uses PCIe atomics
|
||||
===========================
|
||||
.. meta::
|
||||
:description: How ROCm uses PCIe atomics
|
||||
:keywords: PCIe, PCIe atomics, atomics, BAR memory, AMD, ROCm
|
||||
|
||||
*****************************************************************************
|
||||
How ROCm uses PCIe atomics
|
||||
*****************************************************************************
|
||||
|
||||
ROCm PCIe feature and overview of BAR memory
|
||||
======================================================================
|
||||
================================================================
|
||||
|
||||
ROCm is an extension of HSA platform architecture, so it shares the queuing model, memory model,
|
||||
signaling and synchronization protocols. Platform atomics are integral to perform queuing and
|
||||
signaling memory operations where there may be multiple-writers across CPU and GPU agents.
|
||||
|
||||
ROCm is an extension of HSA platform architecture, so it shares the queueing model, memory model, signaling and synchronization protocols. Platform atomics are integral to perform queuing and signaling memory operations where there may be multiple-writers across CPU and GPU agents.
|
||||
The full list of HSA system architecture platform requirements are here:
|
||||
`HSA Sys Arch Features <http://hsafoundation.com/wp-content/uploads/2021/02/HSA-SysArch-1.2.pdf>`_.
|
||||
|
||||
The full list of HSA system architecture platform requirements are here: `HSA Sys Arch Features <http://hsafoundation.com/wp-content/uploads/2021/02/HSA-SysArch-1.2.pdf>`_.
|
||||
AMD ROCm Software uses the new PCI Express 3.0 (Peripheral Component Interconnect Express [PCIe]
|
||||
3.0) features for atomic read-modify-write transactions which extends inter-processor synchronization
|
||||
mechanisms to IO to support the defined set of HSA capabilities needed for queuing and signaling
|
||||
memory operations.
|
||||
|
||||
The ROCm Platform uses the new PCI Express 3.0 (PCIe 3.0) features for Atomic Read-Modify-Write Transactions which extends inter-processor synchronization mechanisms to IO to support the defined set of HSA capabilities needed for queuing and signaling memory operations.
|
||||
|
||||
The new PCIe AtomicOps operate as completers for ``CAS`` (Compare and Swap), ``FetchADD``, ``SWAP`` atomics. The AtomicsOps are initiated by the
|
||||
I/O device which support 32-bit, 64-bit and 128-bit operand which target address have to be naturally aligned to operation sizes.
|
||||
The new PCIe atomic operations operate as completers for ``CAS`` (Compare and Swap), ``FetchADD``,
|
||||
``SWAP`` atomics. The atomic operations are initiated by the I/O device which support 32-bit, 64-bit and
|
||||
128-bit operand which target address have to be naturally aligned to operation sizes.
|
||||
|
||||
For ROCm the Platform atomics are used in ROCm in the following ways:
|
||||
|
||||
* Update HSA queue’s read_dispatch_id: 64 bit atomic add used by the command processor on the GPU agent to update the packet ID it processed.
|
||||
* Update HSA queue’s write_dispatch_id: 64 bit atomic add used by the CPU and GPU agent to support multi-writer queue insertions.
|
||||
* Update HSA Signals – 64bit atomic ops are used for CPU & GPU synchronization.
|
||||
* Update HSA queue's read_dispatch_id: 64 bit atomic add used by the command processor on the
|
||||
GPU agent to update the packet ID it processed.
|
||||
* Update HSA queue's write_dispatch_id: 64 bit atomic add used by the CPU and GPU agent to
|
||||
support multi-writer queue insertions.
|
||||
* Update HSA Signals -- 64bit atomic ops are used for CPU & GPU synchronization.
|
||||
|
||||
The PCIe 3.0 AtomicOp feature allows atomic transactions to be requested by, routed through and completed by PCIe components. Routing and completion does not require software support. Component support for each is detectable via the DEVCAP2 register. Upstream bridges need to have AtomicOp routing enabled or the Atomic Operations will fail even though PCIe endpoint and PCIe I/O devices has the capability to Atomics Operations.
|
||||
The PCIe 3.0 atomic operations feature allows atomic transactions to be requested by, routed through
|
||||
and completed by PCIe components. Routing and completion does not require software support.
|
||||
Component support for each is detectable via the Device Capabilities 2 (DevCap2) register. Upstream
|
||||
bridges need to have atomic operations routing enabled or the atomic operations will fail even though
|
||||
PCIe endpoint and PCIe I/O devices has the capability to atomic operations.
|
||||
|
||||
To do AtomicOp routing capability between two or more Root Ports, each associated Root Port must indicate that capability via the AtomicOp routing supported bit in the Device Capabilities 2 register.
|
||||
To do atomic operations routing capability between two or more Root Ports, each associated Root Port
|
||||
must indicate that capability via the atomic operations routing supported bit in the DevCap2 register.
|
||||
|
||||
If your system has a PCIe Express Switch it needs to support AtomicsOp routing. AtomicOp requests are permitted only if a component’s ``DEVCTL2.ATOMICOP_REQUESTER_ENABLE`` field is set. These requests can only be serviced if the upstream components support AtomicOp completion and/or routing to a component which does. AtomicOp Routing Support=1 Routing is supported, AtomicOp Routing Support=0 routing is not supported.
|
||||
If your system has a PCIe Express Switch it needs to support atomic operations routing. Atomic
|
||||
operations requests are permitted only if a component's ``DEVCTL2.ATOMICOP_REQUESTER_ENABLE``
|
||||
field is set. These requests can only be serviced if the upstream components support atomic operation
|
||||
completion and/or routing to a component which does. Atomic operations routing support=1, routing
|
||||
is supported; atomic operations routing support=0, routing is not supported.
|
||||
|
||||
An atomic operation is a non-posted transaction supporting 32-bit and 64-bit address formats, there must be a response for Completion containing the result of the operation. Errors associated with the operation (uncorrectable error accessing the target location or carrying out the Atomic operation) are signaled to the requester by setting the Completion Status field in the completion descriptor, they are set to to Completer Abort (CA) or Unsupported Request (UR).
|
||||
An atomic operation is a non-posted transaction supporting 32-bit and 64-bit address formats, there
|
||||
must be a response for Completion containing the result of the operation. Errors associated with the
|
||||
operation (uncorrectable error accessing the target location or carrying out the atomic operation) are
|
||||
signaled to the requester by setting the Completion Status field in the completion descriptor, they are
|
||||
set to to Completer Abort (CA) or Unsupported Request (UR).
|
||||
|
||||
To understand more about how PCIe atomic operations work, see `PCIe atomics <https://pcisig.com/specifications/pciexpress/specifications/ECN_Atomic_Ops_080417.pdf>`_
|
||||
To understand more about how PCIe atomic operations work, see
|
||||
`PCIe atomics <https://pcisig.com/specifications/pciexpress/specifications/ECN_Atomic_Ops_080417.pdf>`_
|
||||
|
||||
`Linux Kernel Patch to pci_enable_atomic_request <https://patchwork.kernel.org/project/linux-pci/patch/1443110390-4080-1-git-send-email-jay@jcornwall.me/>`_
|
||||
|
||||
@@ -39,56 +64,60 @@ There are also a number of papers which talk about these new capabilities:
|
||||
* `Atomic Read Modify Write Primitives by Intel <https://www.intel.es/content/dam/doc/white-paper/atomic-read-modify-write-primitives-i-o-devices-paper.pdf>`_
|
||||
* `PCI express 3 Accelerator White paper by Intel <https://www.intel.sg/content/dam/doc/white-paper/pci-express3-accelerator-white-paper.pdf>`_
|
||||
* `Intel PCIe Generation 3 Hotchips Paper <https://www.hotchips.org/wp-content/uploads/hc_archives/hc21/1_sun/HC21.23.1.SystemInterconnectTutorial-Epub/HC21.23.131.Ajanovic-Intel-PCIeGen3.pdf>`_
|
||||
* `PCIe Generation 4 Base Specification includes Atomics Operation <https://astralvx.com/storage/2020/11/PCI_Express_Base_4.0_Rev0.3_February19-2014.pdf>`_
|
||||
* `PCIe Generation 4 Base Specification includes atomic operations <https://astralvx.com/storage/2020/11/PCI_Express_Base_4.0_Rev0.3_February19-2014.pdf>`_
|
||||
|
||||
Other I/O devices with PCIe atomics support
|
||||
|
||||
* `Mellanox ConnectX-5 InfiniBand Card <http://www.mellanox.com/related-docs/prod_adapter_cards/PB_ConnectX-5_VPI_Card.pdf>`_
|
||||
* `Cray Aries Interconnect <http://www.hoti.org/hoti20/slides/Bob_Alverson.pdf>`_
|
||||
* `Xilinx PCIe Ultrascale White paper <https://docs.xilinx.com/v/u/8OZSA2V1b1LLU2rRCDVGQw>`_
|
||||
* `Xilinx 7 Series Devices <https://docs.xilinx.com/v/u/1nfXeFNnGpA0ywyykvWHWQ>`_
|
||||
* `Mellanox ConnectX-5 InfiniBand Card <http://www.mellanox.com/related-docs/prod_adapter_cards/PB_ConnectX-5_VPI_Card.pdf>`_
|
||||
* `Cray Aries Interconnect <http://www.hoti.org/hoti20/slides/Bob_Alverson.pdf>`_
|
||||
* `Xilinx PCIe Ultrascale White paper <https://docs.xilinx.com/v/u/8OZSA2V1b1LLU2rRCDVGQw>`_
|
||||
* `Xilinx 7 Series Devices <https://docs.xilinx.com/v/u/1nfXeFNnGpA0ywyykvWHWQ>`_
|
||||
|
||||
Future bus technology with richer I/O atomics operation Support
|
||||
|
||||
* GenZ
|
||||
|
||||
New PCIe Endpoints with support beyond AMD Ryzen and EPYC CPU; Intel Haswell or newer CPU’s with PCIe Generation 3.0 support.
|
||||
New PCIe Endpoints with support beyond AMD Ryzen and EPYC CPU; Intel Haswell or newer CPUs
|
||||
with PCIe Generation 3.0 support.
|
||||
|
||||
* `Mellanox Bluefield SOC <https://docs.nvidia.com/networking/display/BlueFieldSWv25111213/BlueField+Software+Overview>`_
|
||||
* `Cavium Thunder X2 <https://en.wikichip.org/wiki/cavium/thunderx2>`_
|
||||
|
||||
In ROCm, we also take advantage of PCIe ID based ordering technology for P2P when the GPU originates two writes to two different targets:
|
||||
In ROCm, we also take advantage of PCIe ID based ordering technology for P2P when the GPU
|
||||
originates two writes to two different targets:
|
||||
|
||||
| 1. write to another GPU memory,
|
||||
* Write to another GPU memory
|
||||
* Write to system memory to indicate transfer complete
|
||||
|
||||
| 2. then write to system memory to indicate transfer complete.
|
||||
|
||||
They are routed off to different ends of the computer but we want to make sure the write to system memory to indicate transfer complete occurs AFTER P2P write to GPU has complete.
|
||||
They are routed off to different ends of the computer but we want to make sure the write to system
|
||||
memory to indicate transfer complete occurs AFTER P2P write to GPU has complete.
|
||||
|
||||
BAR memory overview
|
||||
***************************************************************************************************
|
||||
On a Xeon E5 based system in the BIOS we can turn on above 4GB PCIe addressing, if so he need to set MMIO Base address ( MMIOH Base) and Range ( MMIO High Size) in the BIOS.
|
||||
----------------------------------------------------------------------------------------------------
|
||||
On a Xeon E5 based system in the BIOS we can turn on above 4GB PCIe addressing, if so he need to set
|
||||
memory-mapped input/output (MMIO) base address (MMIOH base) and range (MMIO high size) in the BIOS.
|
||||
|
||||
In SuperMicro system in the system bios you need to see the following
|
||||
In the Supermicro system in the system bios you need to see the following
|
||||
|
||||
* Advanced->PCIe/PCI/PnP configuration-> Above 4G Decoding = Enabled
|
||||
* Advanced->PCIe/PCI/PnP configuration-\> Above 4G Decoding = Enabled
|
||||
* Advanced->PCIe/PCI/PnP Configuration-\>MMIOH Base = 512G
|
||||
* Advanced->PCIe/PCI/PnP Configuration-\>MMIO High Size = 256G
|
||||
|
||||
* Advanced->PCIe/PCI/PnP Configuration->MMIOH Base = 512G
|
||||
|
||||
* Advanced->PCIe/PCI/PnP Configuration->MMIO High Size = 256G
|
||||
|
||||
When we support Large Bar Capability there is a Large Bar Vbios which also disable the IO bar.
|
||||
When we support Large Bar Capability there is a Large Bar VBIOS which also disable the IO bar.
|
||||
|
||||
For GFX9 and Vega10 which have Physical Address up 44 bit and 48 bit Virtual address.
|
||||
|
||||
* BAR0-1 registers: 64bit, prefetchable, GPU memory. 8GB or 16GB depending on Vega10 SKU. Must be placed < 2^44 to support P2P access from other Vega10.
|
||||
* BAR2-3 registers: 64bit, prefetchable, Doorbell. Must be placed < 2^44 to support P2P access from other Vega10.
|
||||
* BAR4 register: Optional, not a boot device.
|
||||
* BAR5 register: 32bit, non-prefetchable, MMIO. Must be placed < 4GB.
|
||||
* BAR0-1 registers: 64bit, prefetchable, GPU memory. 8GB or 16GB depending on Vega10 SKU. Must
|
||||
be placed < 2^44 to support P2P access from other Vega10.
|
||||
* BAR2-3 registers: 64bit, prefetchable, Doorbell. Must be placed \< 2^44 to support P2P access from
|
||||
other Vega10.
|
||||
* BAR4 register: Optional, not a boot device.
|
||||
* BAR5 register: 32bit, non-prefetchable, MMIO. Must be placed \< 4GB.
|
||||
|
||||
Here is how our base address register (BAR) works on GFX 8 GPU’s with 40 bit Physical Address Limit ::
|
||||
Here is how our base address register (BAR) works on GFX 8 GPUs with 40 bit Physical Address Limit ::
|
||||
|
||||
11:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Fiji [Radeon R9 FURY / NANO Series] (rev c1)
|
||||
11:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Fiji [Radeon R9 FURY / NANO
|
||||
Series] (rev c1)
|
||||
|
||||
Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Device 0b35
|
||||
|
||||
@@ -106,40 +135,23 @@ Here is how our base address register (BAR) works on GFX 8 GPU’s with 40 bit P
|
||||
|
||||
Legend:
|
||||
|
||||
1 : GPU Frame Buffer BAR – In this example it happens to be 256M, but typically this will be size of the GPU memory (typically 4GB+). This BAR has to be placed < 2^40 to allow peer-to-peer access from other GFX8 AMD GPUs. For GFX9 (Vega GPU) the BAR has to be placed < 2^44 to allow peer-to-peer access from other GFX9 AMD GPUs.
|
||||
1 : GPU Frame Buffer BAR -- In this example it happens to be 256M, but typically this will be size of the
|
||||
GPU memory (typically 4GB+). This BAR has to be placed \< 2^40 to allow peer-to-peer access from
|
||||
other GFX8 AMD GPUs. For GFX9 (Vega GPU) the BAR has to be placed \< 2^44 to allow peer-to-peer
|
||||
access from other GFX9 AMD GPUs.
|
||||
|
||||
2 : Doorbell BAR – The size of the BAR is typically will be < 10MB (currently fixed at 2MB) for this generation GPUs. This BAR has to be placed < 2^40 to allow peer-to-peer access from other current generation AMD GPUs.
|
||||
2 : Doorbell BAR -- The size of the BAR is typically will be \< 10MB (currently fixed at 2MB) for this
|
||||
generation GPUs. This BAR has to be placed \< 2^40 to allow peer-to-peer access from other current
|
||||
generation AMD GPUs.
|
||||
|
||||
3 : IO BAR - This is for legacy VGA and boot device support, but since this the GPUs in this project are not VGA devices (headless), this is not a concern even if the SBIOS does not setup.
|
||||
3 : IO BAR -- This is for legacy VGA and boot device support, but since this the GPUs in this project are
|
||||
not VGA devices (headless), this is not a concern even if the SBIOS does not setup.
|
||||
|
||||
4 : MMIO BAR – This is required for the AMD Driver SW to access the configuration registers. Since the reminder of the BAR available is only 1 DWORD (32bit), this is placed < 4GB. This is fixed at 256KB.
|
||||
4 : MMIO BAR -- This is required for the AMD Driver SW to access the configuration registers. Since the
|
||||
reminder of the BAR available is only 1 DWORD (32bit), this is placed \< 4GB. This is fixed at 256KB.
|
||||
|
||||
5 : Expansion ROM – This is required for the AMD Driver SW to access the GPU’s video-bios. This is currently fixed at 128KB.
|
||||
5 : Expansion ROM -- This is required for the AMD Driver SW to access the GPU video-bios. This is
|
||||
currently fixed at 128KB.
|
||||
|
||||
Excerpts from 'Overview of Changes to PCI Express 3.0'
|
||||
================================================================
|
||||
By Mike Jackson, Senior Staff Architect, MindShare, Inc.
|
||||
***************************************************************************************************
|
||||
Atomic operations – goal:
|
||||
***************************************************************************************************
|
||||
Support SMP-type operations across a PCIe network to allow for things like offloading tasks between CPU cores and accelerators like a GPU. The spec says this enables advanced synchronization mechanisms that are particularly useful with multiple producers or consumers that need to be synchronized in a non-blocking fashion. Three new atomic non-posted requests were added, plus the corresponding completion (the address must be naturally aligned with the operand size or the TLP is malformed):
|
||||
|
||||
* Fetch and Add – uses one operand as the “add” value. Reads the target location, adds the operand, and then writes the result back to the original location.
|
||||
|
||||
* Unconditional Swap – uses one operand as the “swap” value. Reads the target location and then writes the swap value to it.
|
||||
|
||||
* Compare and Swap – uses 2 operands: first data is compare value, second is swap value. Reads the target location, checks it against the compare value and, if equal, writes the swap value to the target location.
|
||||
|
||||
* AtomicOpCompletion – new completion to give the result so far atomic request and indicate that the atomicity of the transaction has been maintained.
|
||||
|
||||
Since atomic operations are not locked they don't have the performance downsides of the PCI locked protocol. Compared to locked cycles, they provide “lower latency, higher scalability, advanced synchronization algorithms, and dramatically lower impact on other PCIe traffic.” The lock mechanism can still be used across a bridge to PCI or PCI-X to achieve the desired operation.
|
||||
|
||||
Atomic operations can go from device to device, device to host, or host to device. Each completer indicates whether it supports this capability and guarantees atomic access if it does. The ability to route atomic operations is also indicated in the registers for a given port.
|
||||
|
||||
ID-based ordering – goal:
|
||||
***************************************************************************************************
|
||||
Improve performance by avoiding stalls caused by ordering rules. For example, posted writes are never normally allowed to pass each other in a queue, but if they are requested by different functions, we can have some confidence that the requests are not dependent on each other. The previously reserved Attribute bit [2] is now combined with the RO bit to indicate ID ordering with or without relaxed ordering.
|
||||
|
||||
This only has meaning for memory requests, and is reserved for Configuration or IO requests. Completers are not required to copy this bit into a completion, and only use the bit if their enable bit is set for this operation.
|
||||
|
||||
To read more on PCIe Gen 3 new options https://www.mindshare.com/files/resources/PCIe%203-0.pdf
|
||||
For more information, you can review
|
||||
`Overview of Changes to PCI Express 3.0 <https://www.mindshare.com/files/resources/PCIe%203-0.pdf>`_.
|
||||
|
||||
@@ -1,3 +1,10 @@
|
||||
<head>
|
||||
<meta charset="UTF-8">
|
||||
<meta name="description" content="Inference optimization with MIGraphX">
|
||||
<meta name="keywords" content="Inference optimization, MIGraphX, deep-learning, MIGraphX
|
||||
installation, AMD, ROCm">
|
||||
</head>
|
||||
|
||||
# Inference optimization with MIGraphX
|
||||
|
||||
The following sections cover inferencing and introduces [MIGraphX](https://rocm.docs.amd.com/projects/AMDMIGraphX/en/latest/).
|
||||
@@ -209,23 +216,23 @@ Follow these steps:
|
||||
./inception_inference
|
||||
```
|
||||
|
||||
```{note}
|
||||
:::{note}
|
||||
Set `LD_LIBRARY_PATH` to `/opt/rocm/lib` if required during the build. Additional examples can be found in the MIGraphX repository under the `/examples/` directory.
|
||||
```
|
||||
:::
|
||||
|
||||
## Tuning MIGraphX
|
||||
|
||||
MIGraphX uses MIOpen kernels to target AMD GPU. For the model compiled with MIGraphX, tune MIOpen to pick the best possible kernel implementation. The MIOpen tuning results in a significant performance boost. Tuning can be done by setting the environment variable `MIOPEN_FIND_ENFORCE=3`.
|
||||
|
||||
```{note}
|
||||
:::{note}
|
||||
The tuning process can take a long time to finish.
|
||||
```
|
||||
:::
|
||||
|
||||
**Example:** The average inference time of the inception model example shown previously over 100 iterations using untuned kernels is 0.01383ms. After tuning, it reduces to 0.00459ms, which is a 3x improvement. This result is from ROCm v4.5 on a MI100 GPU.
|
||||
|
||||
```{note}
|
||||
:::{note}
|
||||
The results may vary depending on the system configurations.
|
||||
```
|
||||
:::
|
||||
|
||||
For reference, the following code snippet shows inference runs for only the first 10 iterations for both tuned and untuned kernels:
|
||||
|
||||
|
||||
@@ -1,3 +1,10 @@
|
||||
<head>
|
||||
<meta charset="UTF-8">
|
||||
<meta name="description" content="Inception V3 with PyTorch">
|
||||
<meta name="keywords" content="PyTorch, Inception V3, deep-learning, training data, optimization
|
||||
algorithm, AMD, ROCm">
|
||||
</head>
|
||||
|
||||
# Deep learning: Inception V3 with PyTorch
|
||||
|
||||
## Deep learning training
|
||||
@@ -36,7 +43,7 @@ Training is different from inference, particularly from the hardware perspective
|
||||
| Data for training is available on the disk before the training process and is generally significant. The training performance is measured by how fast the data batches can be processed. | Inference data usually arrive stochastically, which may be batched to improve performance. Inference performance is generally measured in throughput speed to process the batch of data and the delay in responding to the input (latency). |
|
||||
:::
|
||||
|
||||
Different quantization data types are typically chosen between training (FP32, BF16) and inference (FP16, INT8). The computation hardware has different specializations from other datatypes, leading to improvement in performance if a faster datatype can be selected for the corresponding task.
|
||||
Different quantization data types are typically chosen between training (FP32, BF16) and inference (FP16, INT8). The computation hardware has different specializations from other data types, leading to improvement in performance if a faster datatype can be selected for the corresponding task.
|
||||
|
||||
## Case studies
|
||||
|
||||
@@ -56,7 +63,7 @@ This example is adapted from the PyTorch research hub page on [Inception V3](htt
|
||||
|
||||
Follow these steps:
|
||||
|
||||
1. Run the PyTorch ROCm-based Docker image or refer to the section [Installing PyTorch](../install/pytorch-install.md) for setting up a PyTorch environment on ROCm.
|
||||
1. Run the PyTorch ROCm-based Docker image or refer to the section {doc}`Installing PyTorch <rocm-install-on-linux:how-to/3rd-party/pytorch-install>` for setting up a PyTorch environment on ROCm.
|
||||
|
||||
```dockerfile
|
||||
docker run -it -v $HOME:/data --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --shm-size 8G rocm/pytorch:latest
|
||||
@@ -146,7 +153,7 @@ The previous section focused on downloading and using the Inception V3 model for
|
||||
|
||||
Follow these steps:
|
||||
|
||||
1. Run the PyTorch ROCm Docker image or refer to the section [Installing PyTorch](../install/pytorch-install.md) for setting up a PyTorch environment on ROCm.
|
||||
1. Run the PyTorch ROCm Docker image or refer to the section {doc}`Installing PyTorch <rocm-install-on-linux:how-to/3rd-party/pytorch-install>` for setting up a PyTorch environment on ROCm.
|
||||
|
||||
```dockerfile
|
||||
docker pull rocm/pytorch:latest
|
||||
@@ -208,9 +215,9 @@ Follow these steps:
|
||||
|
||||
7. Set parameters to guide the training process.
|
||||
|
||||
```{note}
|
||||
:::{note}
|
||||
The device is set to `"cuda"`. In PyTorch, `"cuda"` is a generic keyword to denote a GPU.
|
||||
```
|
||||
:::
|
||||
|
||||
```py
|
||||
device = "cuda"
|
||||
@@ -270,9 +277,9 @@ Follow these steps:
|
||||
lr_gamma = 0.1
|
||||
```
|
||||
|
||||
```{note}
|
||||
:::{note}
|
||||
One training epoch is when the neural network passes an entire dataset forward and backward.
|
||||
```
|
||||
:::
|
||||
|
||||
```py
|
||||
epochs = 90
|
||||
@@ -333,9 +340,9 @@ Follow these steps:
|
||||
)
|
||||
```
|
||||
|
||||
```{note}
|
||||
:::{note}
|
||||
Use torchvision to obtain the Inception V3 model. Use the pre-trained model weights to speed up training.
|
||||
```
|
||||
:::
|
||||
|
||||
```py
|
||||
print("Creating model")
|
||||
@@ -1155,9 +1162,10 @@ To prepare the data for training, follow these steps:
|
||||
print("Accuracy: ", accuracy)
|
||||
```
|
||||
|
||||
```{note}
|
||||
model.fit() returns a History object that contains a dictionary with everything that happened during training.
|
||||
```
|
||||
:::{note}
|
||||
`model.fit()` returns a History object that contains a dictionary with everything that happened during
|
||||
training.
|
||||
:::
|
||||
|
||||
```py
|
||||
history_dict = history.history
|
||||
|
||||
@@ -1,34 +1,40 @@
|
||||
***********
|
||||
.. meta::
|
||||
:description: Using CMake
|
||||
:keywords: CMake, dependencies, HIP, C++, AMD, ROCm
|
||||
|
||||
*********************************
|
||||
Using CMake
|
||||
***********
|
||||
*********************************
|
||||
|
||||
Most components in ROCm support CMake. Projects depending on header-only or
|
||||
library components typically require CMake 3.5 or higher whereas those wanting
|
||||
to make use of CMake's HIP language support will require CMake 3.21 or higher.
|
||||
to make use of the CMake HIP language support will require CMake 3.21 or higher.
|
||||
|
||||
Finding dependencies
|
||||
====================
|
||||
|
||||
.. note::
|
||||
For a complete
|
||||
reference on how to deal with dependencies in CMake, refer to the CMake docs
|
||||
on `find_package
|
||||
<https://cmake.org/cmake/help/latest/command/find_package.html>`_ and the
|
||||
`Using Dependencies Guide
|
||||
<https://cmake.org/cmake/help/latest/guide/using-dependencies/index.html>`_
|
||||
to get an overview of CMake's related facilities.
|
||||
|
||||
For a complete
|
||||
reference on how to deal with dependencies in CMake, refer to the CMake docs
|
||||
on `find_package
|
||||
<https://cmake.org/cmake/help/latest/command/find_package.html>`_ and the
|
||||
`Using Dependencies Guide
|
||||
<https://cmake.org/cmake/help/latest/guide/using-dependencies/index.html>`_
|
||||
to get an overview of CMake related facilities.
|
||||
|
||||
In short, CMake supports finding dependencies in two ways:
|
||||
|
||||
* In Module mode, it consults a file ``Find<PackageName>.cmake`` which tries to
|
||||
find the component in typical install locations and layouts. CMake ships a
|
||||
few dozen such scripts, but users and projects may ship them as well.
|
||||
* In Config mode, it locates a file named ``<packagename>-config.cmake`` or
|
||||
``<PackageName>Config.cmake`` which describes the installed component in all
|
||||
regards needed to consume it.
|
||||
* In Module mode, it consults a file ``Find<PackageName>.cmake`` which tries to find the component
|
||||
in typical install locations and layouts. CMake ships a few dozen such scripts, but users and projects
|
||||
may ship them as well.
|
||||
|
||||
* In Config mode, it locates a file named ``<packagename>-config.cmake`` or
|
||||
``<PackageName>Config.cmake`` which describes the installed component in all regards needed to
|
||||
consume it.
|
||||
|
||||
ROCm predominantly relies on Config mode, one notable exception being the Module
|
||||
driving the compilation of HIP programs on Nvidia runtimes. As such, when
|
||||
driving the compilation of HIP programs on NVIDIA runtimes. As such, when
|
||||
dependencies are not found in standard system locations, one either has to
|
||||
instruct CMake to search for package config files in additional folders using
|
||||
the ``CMAKE_PREFIX_PATH`` variable (a semi-colon separated list of file system
|
||||
@@ -40,9 +46,9 @@ it to your CMake configuration command on the command line via
|
||||
``-D CMAKE_PREFIX_PATH=....`` . AMD packaged ROCm installs can typically be
|
||||
added to the config file search paths such as:
|
||||
|
||||
- Windows: ``-D CMAKE_PREFIX_PATH=${env:HIP_PATH}``
|
||||
* Windows: ``-D CMAKE_PREFIX_PATH=${env:HIP_PATH}``
|
||||
|
||||
- Linux: ``-D CMAKE_PREFIX_PATH=/opt/rocm``
|
||||
* Linux: ``-D CMAKE_PREFIX_PATH=/opt/rocm``
|
||||
|
||||
ROCm provides the respective *config-file* packages, and this enables
|
||||
``find_package`` to be used directly. ROCm does not require any Find module as
|
||||
@@ -50,14 +56,16 @@ the *config-file* packages are shipped with the upstream projects, such as
|
||||
rocPRIM and other ROCm libraries.
|
||||
|
||||
For a complete guide on where and how ROCm may be installed on a system, refer
|
||||
to the installation guides for `Linux <../install/linux/install.html>`_ and
|
||||
`Windows <../install/windows/install.html>`_.
|
||||
to the installation guides for
|
||||
`Linux <https://rocm.docs.amd.com/projects/install-on-linux/en/latest/tutorial/quick-start.html>`_
|
||||
and
|
||||
`Windows <https://rocm.docs.amd.com/projects/install-on-windows/en/latest/index.html>`_.
|
||||
|
||||
Using HIP in CMake
|
||||
==================
|
||||
|
||||
ROCm components providing a C/C++ interface support consumption via any
|
||||
C/C++ toolchain that CMake knows how to drive. ROCm also supports CMake's HIP
|
||||
C/C++ toolchain that CMake knows how to drive. ROCm also supports the CMake HIP
|
||||
language features, allowing users to program using the HIP single-source
|
||||
programming model. When a program (or translation-unit) uses the HIP API without
|
||||
compiling any GPU device code, HIP can be treated in CMake as a simple C/C++
|
||||
@@ -70,22 +78,22 @@ Source code written in the HIP dialect of C++ typically uses the `.hip`
|
||||
extension. When the HIP CMake language is enabled, it will automatically
|
||||
associate such source files with the HIP toolchain being used.
|
||||
|
||||
::
|
||||
.. code-block:: cmake
|
||||
|
||||
cmake_minimum_required(VERSION 3.21) # HIP language support requires 3.21
|
||||
cmake_policy(VERSION 3.21.3...3.27)
|
||||
project(MyProj LANGUAGES HIP)
|
||||
add_executable(MyApp Main.hip)
|
||||
cmake_minimum_required(VERSION 3.21) # HIP language support requires 3.21
|
||||
cmake_policy(VERSION 3.21.3...3.27)
|
||||
project(MyProj LANGUAGES HIP)
|
||||
add_executable(MyApp Main.hip)
|
||||
|
||||
Should you have existing CUDA code that is from the source compatible subset of
|
||||
HIP, you can tell CMake that despite their `.cu` extension, they're HIP sources.
|
||||
Do note that this mostly facilitates compiling kernel code-only source files,
|
||||
as host-side CUDA API won't compile in this fashion.
|
||||
|
||||
::
|
||||
.. code-block:: cmake
|
||||
|
||||
add_library(MyLib MyLib.cu)
|
||||
set_source_files_properties(MyLib.cu PROPERTIES LANGUAGE HIP)
|
||||
add_library(MyLib MyLib.cu)
|
||||
set_source_files_properties(MyLib.cu PROPERTIES LANGUAGE HIP)
|
||||
|
||||
CMake itself only hosts part of the HIP language support, such as defining
|
||||
HIP-specific properties, etc. while the other half ships with the HIP
|
||||
@@ -97,6 +105,10 @@ there's a catch-all, last resort variable consulted locating this file,
|
||||
``-D CMAKE_HIP_COMPILER_ROCM_ROOT:PATH=`` which should be set the root of the
|
||||
ROCm installation.
|
||||
|
||||
.. note::
|
||||
Imported targets defined by `hip-lang-config.cmake` are for internal use
|
||||
only.
|
||||
|
||||
If the user doesn't provide a semi-colon delimited list of device architectures
|
||||
via ``CMAKE_HIP_ARCHITECTURES``, CMake will select some sensible default. It is
|
||||
advised though that if a user knows what devices they wish to target, then set
|
||||
@@ -110,45 +122,57 @@ Illustrated in the example below is a C++ application using MIOpen from CMake.
|
||||
It calls ``find_package(miopen)``, which provides the ``MIOpen`` imported
|
||||
target. This can be linked with ``target_link_libraries``
|
||||
|
||||
::
|
||||
.. code-block:: cmake
|
||||
|
||||
cmake_minimum_required(VERSION 3.5) # find_package(miopen) requires 3.5
|
||||
cmake_policy(VERSION 3.5...3.27)
|
||||
project(MyProj LANGUAGES CXX)
|
||||
find_package(miopen)
|
||||
add_library(MyLib ...)
|
||||
target_link_libraries(MyLib PUBLIC MIOpen)
|
||||
cmake_minimum_required(VERSION 3.5) # find_package(miopen) requires 3.5
|
||||
cmake_policy(VERSION 3.5...3.27)
|
||||
project(MyProj LANGUAGES CXX)
|
||||
find_package(miopen)
|
||||
add_library(MyLib ...)
|
||||
target_link_libraries(MyLib PUBLIC MIOpen)
|
||||
|
||||
.. note::
|
||||
Most libraries are designed as host-only API, so using a GPU device
|
||||
compiler is not necessary for downstream projects unless they use GPU device
|
||||
code.
|
||||
|
||||
Most libraries are designed as host-only API, so using a GPU device
|
||||
compiler is not necessary for downstream projects unless they use GPU device
|
||||
code.
|
||||
|
||||
Consuming the HIP API in C++ code
|
||||
---------------------------------
|
||||
|
||||
Use the HIP API without compiling the GPU device code. As there is no GPU code,
|
||||
any C or C++ compiler can be used. The ``find_package(hip)`` provides the
|
||||
``hip::host`` imported target to use HIP in this context.
|
||||
Consuming the HIP API without compiling single-source GPU device code can be
|
||||
done using any C++ compiler. The ``find_package(hip)`` provides the
|
||||
``hip::host`` imported target to use HIP in this scenario.
|
||||
|
||||
::
|
||||
.. code-block:: cmake
|
||||
|
||||
cmake_minimum_required(VERSION 3.5) # find_package(hip) requires 3.5
|
||||
cmake_policy(VERSION 3.5...3.27)
|
||||
project(MyProj LANGUAGES CXX)
|
||||
find_package(hip REQUIRED)
|
||||
add_executable(MyApp ...)
|
||||
target_link_libraries(MyApp PRIVATE hip::host)
|
||||
cmake_minimum_required(VERSION 3.5) # find_package(hip) requires 3.5
|
||||
cmake_policy(VERSION 3.5...3.27)
|
||||
project(MyProj LANGUAGES CXX)
|
||||
find_package(hip REQUIRED)
|
||||
add_executable(MyApp ...)
|
||||
target_link_libraries(MyApp PRIVATE hip::host)
|
||||
|
||||
When mixing such ``CXX`` sources with ``HIP`` sources holding device-code, link
|
||||
only to `hip::host`. If HIP sources don't have `.hip` as their extension, use
|
||||
`set_source_files_properties(<hip_sources>... PROPERTIES LANGUAGE HIP)` on them.
|
||||
Linking to `hip::host` will set all the necessary flags for the ``CXX`` sources
|
||||
while ``HIP`` sources inherit all flags from the built-in language support.
|
||||
Having HIP sources in a target will turn the |LINK_LANG|_ into ``HIP``.
|
||||
|
||||
.. |LINK_LANG| replace:: ``LINKER_LANGUAGE``
|
||||
.. _LINK_LANG: https://cmake.org/cmake/help/latest/prop_tgt/LINKER_LANGUAGE.html
|
||||
|
||||
Compiling device code in C++ language mode
|
||||
------------------------------------------
|
||||
|
||||
.. attention::
|
||||
The workflow detailed here is considered legacy and is shown for
|
||||
understanding's sake. It pre-dates the existence of HIP language support in
|
||||
CMake. If source code has HIP device code in it, it is a HIP source file
|
||||
and should be compiled as such. Only resort to the method below if your
|
||||
HIP-enabled CMake codepath can't mandate CMake version 3.21.
|
||||
|
||||
The workflow detailed here is considered legacy and is shown for
|
||||
understanding's sake. It pre-dates the existence of HIP language support in
|
||||
CMake. If source code has HIP device code in it, it is a HIP source file
|
||||
and should be compiled as such. Only resort to the method below if your
|
||||
HIP-enabled CMake code path can't mandate CMake version 3.21.
|
||||
|
||||
If code uses the HIP API and compiles GPU device code, it requires using a
|
||||
device compiler. The compiler for CMake can be set using either the
|
||||
@@ -160,20 +184,21 @@ compiler that supports AMD GPU targets, which is usually Clang.
|
||||
The ``find_package(hip)`` provides the ``hip::device`` imported target to add
|
||||
all the flags necessary for device compilation.
|
||||
|
||||
::
|
||||
.. code-block:: cmake
|
||||
|
||||
cmake_minimum_required(VERSION 3.8) # cxx_std_11 requires 3.8
|
||||
cmake_policy(VERSION 3.8...3.27)
|
||||
project(MyProj LANGUAGES CXX)
|
||||
find_package(hip REQUIRED)
|
||||
add_library(MyLib ...)
|
||||
target_link_libraries(MyLib PRIVATE hip::device)
|
||||
target_compile_features(MyLib PRIVATE cxx_std_11)
|
||||
cmake_minimum_required(VERSION 3.8) # cxx_std_11 requires 3.8
|
||||
cmake_policy(VERSION 3.8...3.27)
|
||||
project(MyProj LANGUAGES CXX)
|
||||
find_package(hip REQUIRED)
|
||||
add_library(MyLib ...)
|
||||
target_link_libraries(MyLib PRIVATE hip::device)
|
||||
target_compile_features(MyLib PRIVATE cxx_std_11)
|
||||
|
||||
.. note::
|
||||
Compiling for the GPU device requires at least C++11.
|
||||
|
||||
This project can then be configured with for eg.
|
||||
Compiling for the GPU device requires at least C++11.
|
||||
|
||||
This project can then be configured with the following CMake commands.
|
||||
|
||||
- Windows: ``cmake -D CMAKE_CXX_COMPILER:PATH=${env:HIP_PATH}\bin\clang++.exe``
|
||||
|
||||
@@ -183,11 +208,11 @@ Which use the device compiler provided from the binary packages of
|
||||
`ROCm HIP SDK <https://www.amd.com/en/developer/rocm-hub.html>`_ and
|
||||
`repo.radeon.com <https://repo.radeon.com>`_ respectively.
|
||||
|
||||
When using the CXX language support to compile HIP device code, selecting the
|
||||
When using the ``CXX`` language support to compile HIP device code, selecting the
|
||||
target GPU architectures is done via setting the ``GPU_TARGETS`` variable.
|
||||
``CMAKE_HIP_ARCHITECTURES`` only exists when the HIP language is enabled. By
|
||||
default, this is set to some subset of the currently supported architectures of
|
||||
AMD ROCm. It can be set to eg. ``-D GPU_TARGETS="gfx1032;gfx1035"``.
|
||||
AMD ROCm. It can be set to the CMake option ``-D GPU_TARGETS="gfx1032;gfx1035"``.
|
||||
|
||||
ROCm CMake packages
|
||||
-------------------
|
||||
@@ -252,13 +277,12 @@ options.
|
||||
|
||||
IDEs supporting CMake (Visual Studio, Visual Studio Code, CLion, etc.) all came
|
||||
up with their own way to register command-line fragments of different purpose in
|
||||
a setup'n'forget fashion for quick assembly using graphical front-ends. This is
|
||||
a setup-and-forget fashion for quick assembly using graphical front-ends. This is
|
||||
all nice, but configurations aren't portable, nor can they be reused in
|
||||
Continuous Intergration (CI) pipelines. CMake has condensed existing practice
|
||||
Continuous Integration (CI) pipelines. CMake has condensed existing practice
|
||||
into a portable JSON format that works in all IDEs and can be invoked from any
|
||||
command line. This is
|
||||
`CMake Presets <https://cmake.org/cmake/help/latest/manual/cmake-presets.7.html>`_
|
||||
.
|
||||
`CMake Presets <https://cmake.org/cmake/help/latest/manual/cmake-presets.7.html>`_.
|
||||
|
||||
There are two types of preset files: one supplied by the project, called
|
||||
``CMakePresets.json`` which is meant to be committed to version control,
|
||||
@@ -275,109 +299,110 @@ Following is an example ``CMakeUserPresets.json`` file which actually compiles
|
||||
the `amd/rocm-examples <https://github.com/amd/rocm-examples>`_ suite of sample
|
||||
applications on a typical ROCm installation:
|
||||
|
||||
::
|
||||
.. code-block:: json
|
||||
|
||||
{
|
||||
"version": 3,
|
||||
"cmakeMinimumRequired": {
|
||||
"major": 3,
|
||||
"minor": 21,
|
||||
"patch": 0
|
||||
{
|
||||
"version": 3,
|
||||
"cmakeMinimumRequired": {
|
||||
"major": 3,
|
||||
"minor": 21,
|
||||
"patch": 0
|
||||
},
|
||||
"configurePresets": [
|
||||
{
|
||||
"name": "layout",
|
||||
"hidden": true,
|
||||
"binaryDir": "${sourceDir}/build/${presetName}",
|
||||
"installDir": "${sourceDir}/install/${presetName}"
|
||||
},
|
||||
"configurePresets": [
|
||||
{
|
||||
"name": "layout",
|
||||
"hidden": true,
|
||||
"binaryDir": "${sourceDir}/build/${presetName}",
|
||||
"installDir": "${sourceDir}/install/${presetName}"
|
||||
},
|
||||
{
|
||||
"name": "generator-ninja-multi-config",
|
||||
"hidden": true,
|
||||
"generator": "Ninja Multi-Config"
|
||||
},
|
||||
{
|
||||
"name": "toolchain-makefiles-c/c++-amdclang",
|
||||
"hidden": true,
|
||||
"cacheVariables": {
|
||||
"CMAKE_C_COMPILER": "/opt/rocm/bin/amdclang",
|
||||
"CMAKE_CXX_COMPILER": "/opt/rocm/bin/amdclang++",
|
||||
"CMAKE_HIP_COMPILER": "/opt/rocm/bin/amdclang++"
|
||||
}
|
||||
},
|
||||
{
|
||||
"name": "clang-strict-iso-high-warn",
|
||||
"hidden": true,
|
||||
"cacheVariables": {
|
||||
"CMAKE_C_FLAGS": "-Wall -Wextra -pedantic",
|
||||
"CMAKE_CXX_FLAGS": "-Wall -Wextra -pedantic",
|
||||
"CMAKE_HIP_FLAGS": "-Wall -Wextra -pedantic"
|
||||
}
|
||||
},
|
||||
{
|
||||
"name": "ninja-mc-rocm",
|
||||
"displayName": "Ninja Multi-Config ROCm",
|
||||
"inherits": [
|
||||
"layout",
|
||||
"generator-ninja-multi-config",
|
||||
"toolchain-makefiles-c/c++-amdclang",
|
||||
"clang-strict-iso-high-warn"
|
||||
]
|
||||
{
|
||||
"name": "generator-ninja-multi-config",
|
||||
"hidden": true,
|
||||
"generator": "Ninja Multi-Config"
|
||||
},
|
||||
{
|
||||
"name": "toolchain-makefiles-c/c++-amdclang",
|
||||
"hidden": true,
|
||||
"cacheVariables": {
|
||||
"CMAKE_C_COMPILER": "/opt/rocm/bin/amdclang",
|
||||
"CMAKE_CXX_COMPILER": "/opt/rocm/bin/amdclang++",
|
||||
"CMAKE_HIP_COMPILER": "/opt/rocm/bin/amdclang++"
|
||||
}
|
||||
],
|
||||
"buildPresets": [
|
||||
{
|
||||
"name": "ninja-mc-rocm-debug",
|
||||
"displayName": "Debug",
|
||||
"configuration": "Debug",
|
||||
"configurePreset": "ninja-mc-rocm"
|
||||
},
|
||||
{
|
||||
"name": "ninja-mc-rocm-release",
|
||||
"displayName": "Release",
|
||||
"configuration": "Release",
|
||||
"configurePreset": "ninja-mc-rocm"
|
||||
},
|
||||
{
|
||||
"name": "ninja-mc-rocm-debug-verbose",
|
||||
"displayName": "Debug (verbose)",
|
||||
"configuration": "Debug",
|
||||
"configurePreset": "ninja-mc-rocm",
|
||||
"verbose": true
|
||||
},
|
||||
{
|
||||
"name": "ninja-mc-rocm-release-verbose",
|
||||
"displayName": "Release (verbose)",
|
||||
"configuration": "Release",
|
||||
"configurePreset": "ninja-mc-rocm",
|
||||
"verbose": true
|
||||
},
|
||||
{
|
||||
"name": "clang-strict-iso-high-warn",
|
||||
"hidden": true,
|
||||
"cacheVariables": {
|
||||
"CMAKE_C_FLAGS": "-Wall -Wextra -pedantic",
|
||||
"CMAKE_CXX_FLAGS": "-Wall -Wextra -pedantic",
|
||||
"CMAKE_HIP_FLAGS": "-Wall -Wextra -pedantic"
|
||||
}
|
||||
],
|
||||
"testPresets": [
|
||||
{
|
||||
"name": "ninja-mc-rocm-debug",
|
||||
"displayName": "Debug",
|
||||
"configuration": "Debug",
|
||||
"configurePreset": "ninja-mc-rocm",
|
||||
"execution": {
|
||||
"jobs": 0
|
||||
}
|
||||
},
|
||||
{
|
||||
"name": "ninja-mc-rocm-release",
|
||||
"displayName": "Release",
|
||||
"configuration": "Release",
|
||||
"configurePreset": "ninja-mc-rocm",
|
||||
"execution": {
|
||||
"jobs": 0
|
||||
}
|
||||
},
|
||||
{
|
||||
"name": "ninja-mc-rocm",
|
||||
"displayName": "Ninja Multi-Config ROCm",
|
||||
"inherits": [
|
||||
"layout",
|
||||
"generator-ninja-multi-config",
|
||||
"toolchain-makefiles-c/c++-amdclang",
|
||||
"clang-strict-iso-high-warn"
|
||||
]
|
||||
}
|
||||
],
|
||||
"buildPresets": [
|
||||
{
|
||||
"name": "ninja-mc-rocm-debug",
|
||||
"displayName": "Debug",
|
||||
"configuration": "Debug",
|
||||
"configurePreset": "ninja-mc-rocm"
|
||||
},
|
||||
{
|
||||
"name": "ninja-mc-rocm-release",
|
||||
"displayName": "Release",
|
||||
"configuration": "Release",
|
||||
"configurePreset": "ninja-mc-rocm"
|
||||
},
|
||||
{
|
||||
"name": "ninja-mc-rocm-debug-verbose",
|
||||
"displayName": "Debug (verbose)",
|
||||
"configuration": "Debug",
|
||||
"configurePreset": "ninja-mc-rocm",
|
||||
"verbose": true
|
||||
},
|
||||
{
|
||||
"name": "ninja-mc-rocm-release-verbose",
|
||||
"displayName": "Release (verbose)",
|
||||
"configuration": "Release",
|
||||
"configurePreset": "ninja-mc-rocm",
|
||||
"verbose": true
|
||||
}
|
||||
],
|
||||
"testPresets": [
|
||||
{
|
||||
"name": "ninja-mc-rocm-debug",
|
||||
"displayName": "Debug",
|
||||
"configuration": "Debug",
|
||||
"configurePreset": "ninja-mc-rocm",
|
||||
"execution": {
|
||||
"jobs": 0
|
||||
}
|
||||
]
|
||||
}
|
||||
},
|
||||
{
|
||||
"name": "ninja-mc-rocm-release",
|
||||
"displayName": "Release",
|
||||
"configuration": "Release",
|
||||
"configurePreset": "ninja-mc-rocm",
|
||||
"execution": {
|
||||
"jobs": 0
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
.. note::
|
||||
Getting presets to work reliably on Windows requires some CMake improvements
|
||||
and/or support from compiler vendors. (Refer to
|
||||
`Add support to the Visual Studio generators <https://gitlab.kitware.com/cmake/cmake/-/issues/24245>`_
|
||||
and `Sourcing environment scripts <https://gitlab.kitware.com/cmake/cmake/-/issues/21619>`_
|
||||
.)
|
||||
|
||||
Getting presets to work reliably on Windows requires some CMake improvements
|
||||
and/or support from compiler vendors. (Refer to
|
||||
`Add support to the Visual Studio generators <https://gitlab.kitware.com/cmake/cmake/-/issues/24245>`_
|
||||
and `Sourcing environment scripts <https://gitlab.kitware.com/cmake/cmake/-/issues/21619>`_
|
||||
.)
|
||||
|
||||
@@ -1,3 +1,9 @@
|
||||
<head>
|
||||
<meta charset="UTF-8">
|
||||
<meta name="description" content="ROCm compilers disambiguation">
|
||||
<meta name="keywords" content="compilers, compiler naming, AMD, ROCm">
|
||||
</head>
|
||||
|
||||
# ROCm compilers disambiguation
|
||||
|
||||
ROCm ships multiple compilers of varying origins and purposes. This article
|
||||
|
||||
@@ -1,8 +1,15 @@
|
||||
<head>
|
||||
<meta charset="UTF-8">
|
||||
<meta name="description" content="ROCm Linux Filesystem Hierarchy Standard reorganization">
|
||||
<meta name="keywords" content="FHS, Linux Filesystem Hierarchy Standard, directory structure,
|
||||
AMD, ROCm">
|
||||
</head>
|
||||
|
||||
# ROCm Linux Filesystem Hierarchy Standard reorganization
|
||||
|
||||
## Introduction
|
||||
|
||||
The ROCm platform has adopted the Linux Filesystem Hierarchy Standard (FHS) [https://refspecs.linuxfoundation.org/FHS_3.0/fhs/index.html](https://refspecs.linuxfoundation.org/FHS_3.0/fhs/index.html) in order to to ensure ROCm is consistent with standard open source conventions. The following sections specify how current and future releases of ROCm adhere to FHS, how the previous ROCm file system is supported, and how improved versioning specifications are applied to ROCm.
|
||||
The ROCm Software has adopted the Linux Filesystem Hierarchy Standard (FHS) [https://refspecs.linuxfoundation.org/FHS_3.0/fhs/index.html](https://refspecs.linuxfoundation.org/FHS_3.0/fhs/index.html) in order to to ensure ROCm is consistent with standard open source conventions. The following sections specify how current and future releases of ROCm adhere to FHS, how the previous ROCm file system is supported, and how improved versioning specifications are applied to ROCm.
|
||||
|
||||
## Adopting the FHS
|
||||
|
||||
@@ -152,7 +159,7 @@ correct header file and use correct search paths.
|
||||
|
||||
## Changes in versioning specifications
|
||||
|
||||
In order to better manage ROCm dependencies specification and allow smoother releases of ROCm while avoiding dependency conflicts, the ROCm platform shall adhere to the following scheme when numbering and incrementing ROCm files versions:
|
||||
In order to better manage ROCm dependencies specification and allow smoother releases of ROCm while avoiding dependency conflicts, ROCm software shall adhere to the following scheme when numbering and incrementing ROCm files versions:
|
||||
|
||||
rocm-\<ver\>, where \<ver\> = \<x.y.z\>
|
||||
|
||||
|
||||
@@ -1,3 +1,10 @@
|
||||
<head>
|
||||
<meta charset="UTF-8">
|
||||
<meta name="description" content="GPU architecture">
|
||||
<meta name="keywords" content="GPU architecture, architecture support, MI200, MI250, RDNA,
|
||||
MI100, AMD Instinct">
|
||||
</head>
|
||||
|
||||
# GPU architecture documentation
|
||||
|
||||
:::::{grid} 1 1 2 2
|
||||
|
||||
@@ -1,3 +1,9 @@
|
||||
<head>
|
||||
<meta charset="UTF-8">
|
||||
<meta name="description" content="AMD Instinct MI100 microarchitecture">
|
||||
<meta name="keywords" content="Instinct, MI100, microarchitecture, AMD, ROCm">
|
||||
</head>
|
||||
|
||||
# AMD Instinct™ MI100 microarchitecture
|
||||
|
||||
The following image shows the node-level architecture of a system that
|
||||
|
||||
@@ -1,455 +1,578 @@
|
||||
<head>
|
||||
<meta charset="UTF-8">
|
||||
<meta name="description" content="MI200 performance counters and metrics">
|
||||
<meta name="keywords" content="MI200, performance counters, counters, GRBM counters, GRBM,
|
||||
CPF counters, CPF, CPC counters, CPC, command processor counters, SPI counters, SPI, AMD, ROCm">
|
||||
</head>
|
||||
|
||||
# MI200 performance counters and metrics
|
||||
<!-- markdownlint-disable no-duplicate-header -->
|
||||
|
||||
This document lists and describes the hardware performance counters and the derived metrics available on the AMD Instinct™ MI200 GPU. All hardware performance monitors, and the derived performance metrics are accessible via AMD ROCm™ Profiler tool.
|
||||
This document lists and describes the hardware performance counters and derived metrics available on the AMD Instinct™ MI200 GPU. All the hardware basic counters and derived metrics are accessible via {doc}`ROCProfiler tool <rocprofiler:rocprofv1>`.
|
||||
|
||||
## MI200 performance counters list
|
||||
|
||||
```{note}
|
||||
Preliminary validation of all MI200 performance counters is in progress. Those with “[*]” appended to the names require further evaluation.
|
||||
```
|
||||
See the category-wise listing of MI200 performance counters in the following tables.
|
||||
|
||||
### GRBM
|
||||
:::{note}
|
||||
Preliminary validation of all MI200 performance counters is in progress. Those with “*” appended to the names require further evaluation.
|
||||
:::
|
||||
|
||||
#### GRBM counters
|
||||
### Graphics Register Bus Management (GRBM) counters
|
||||
|
||||
| Hardware Counter | Unit | Definition |
|
||||
|--------------------|--------| ------------------------------------------------------|
|
||||
| `grbm_count` | Cycles | Free-running GPU clock |
|
||||
| `grbm_gui_active` | Cycles | GPU active cycles |
|
||||
| `grbm_cp_busy` | Cycles | Any of the command processor (CPC/CPF) blocks are busy. |
|
||||
| `grbm_spi_busy` | Cycles | Any of the shader processor input (SPI) are busy in the shader engine(s). |
|
||||
| `grbm_ta_busy` | Cycles | Any of the texture addressing unit are busy in the shader engine(s). |
|
||||
| `grbm_tc_busy` | Cycles | Any of the texture cache blocks (TCP/TCI/TCA/TCC) are busy. |
|
||||
| `grbm_cpc_busy` | Cycles | The command processor - compute (CPC) is busy. |
|
||||
| `grbm_cpf_busy` | Cycles | The command processor - fetcher (CPF) is busy. |
|
||||
| `grbm_utcl2_busy` | Cycles | The unified translation cache - level 2 (UTCL2) block is busy. |
|
||||
| `grbm_ea_busy` | Cycles | The efficiency arbiter (EA) block is busy. |
|
||||
| Hardware Counter | Unit | Definition |
|
||||
|:--------------------|:--------|:--------------------------------------------------------------------------|
|
||||
| `GRBM_COUNT` | Cycles | Number of free-running GPU cycles |
|
||||
| `GRBM_GUI_ACTIVE` | Cycles | Number of GPU active cycles |
|
||||
| `GRBM_CP_BUSY` | Cycles | Number of cycles any of the Command Processor (CP) blocks are busy |
|
||||
| `GRBM_SPI_BUSY` | Cycles | Number of cycles any of the Shader Processor Input (SPI) are busy in the shader engine(s) |
|
||||
| `GRBM_TA_BUSY` | Cycles | Number of cycles any of the Texture Addressing Unit (TA) are busy in the shader engine(s) |
|
||||
| `GRBM_TC_BUSY` | Cycles | Number of cycles any of the Texture Cache Blocks (TCP/TCI/TCA/TCC) are busy |
|
||||
| `GRBM_CPC_BUSY` | Cycles | Number of cycles the Command Processor - Compute (CPC) is busy |
|
||||
| `GRBM_CPF_BUSY` | Cycles | Number of cycles the Command Processor - Fetcher (CPF) is busy |
|
||||
| `GRBM_UTCL2_BUSY` | Cycles | Number of cycles the Unified Translation Cache - Level 2 (UTCL2) block is busy |
|
||||
| `GRBM_EA_BUSY` | Cycles | Number of cycles the Efficiency Arbiter (EA) block is busy |
|
||||
|
||||
### Command processor
|
||||
### Command Processor (CP) counters
|
||||
|
||||
The command processor counters are further classified into fetcher and compute.
|
||||
The CP counters are further classified into CP-Fetcher (CPF) and CP-Compute (CPC).
|
||||
|
||||
#### CPF
|
||||
#### CPF counters
|
||||
|
||||
##### CPF counters
|
||||
| Hardware Counter | Unit | Definition |
|
||||
|:--------------------------------------|:--------|:-------------------------------------------------------------|
|
||||
| `CPF_CMP_UTCL1_STALL_ON_TRANSLATION` | Cycles | Number of cycles one of the Compute UTCL1s is stalled waiting on translation |
|
||||
| `CPF_CPF_STAT_BUSY` | Cycles | Number of cycles CPF is busy |
|
||||
| `CPF_CPF_STAT_IDLE*` | Cycles | Number of cycles CPF is idle |
|
||||
| `CPF_CPF_STAT_STALL` | Cycles | Number of cycles CPF is stalled |
|
||||
| `CPF_CPF_TCIU_BUSY` | Cycles | Number of cycles CPF Texture Cache Interface Unit (TCIU) interface is busy |
|
||||
| `CPF_CPF_TCIU_IDLE` | Cycles | Number of cycles CPF TCIU interface is idle |
|
||||
| `CPF_CPF_TCIU_STALL*` | Cycles | Number of cycles CPF TCIU interface is stalled waiting on free tags |
|
||||
|
||||
| Hardware Counter | Unit | Definition |
|
||||
|--------------------------------------|--------|--------------------------------------------------------------|
|
||||
| `cpf_cmp_utcl1_stall_on_translation` | Cycles | One of the compute UTCL1s is stalled waiting on translation. |
|
||||
| `cpf_cpf_stat_idle[∗]` | Cycles | CPF idle |
|
||||
| `cpf_cpf_stat_stall` | Cycles | CPF stall |
|
||||
| `cpf_cpf_tciu_busy` | Cycles | CPF TCIU interface busy |
|
||||
| `cpf_cpf_tciu_idle` | Cycles | CPF TCIU interface idle |
|
||||
| `cpf_cpf_tciu_stall[∗]` | Cycles | CPF TCIU interface is stalled waiting on free tags. |
|
||||
|
||||
#### CPC
|
||||
|
||||
##### CPC counters
|
||||
#### CPC counters
|
||||
|
||||
| Hardware Counter | Unit | Definition |
|
||||
| ---------------------------------| -------| --------------------------------------------------- |
|
||||
| `cpc_me1_busy_for_packet_decode` | Cycles | CPC ME1 busy decoding packets |
|
||||
| `cpc_utcl1_stall_on_translation` | Cycles | One of the UTCL1s is stalled waiting on translation |
|
||||
| `cpc_cpc_stat_busy` | Cycles | CPC busy |
|
||||
| `cpc_cpc_stat_idle` | Cycles | CPC idle |
|
||||
| `cpc_cpc_stat_stall` | Cycles | CPC stalled |
|
||||
| `cpc_cpc_tciu_busy` | Cycles | CPC TCIU interface busy |
|
||||
| `cpc_cpc_tciu_idle` | Cycles | CPC TCIU interface idle |
|
||||
| `cpc_cpc_utcl2iu_busy` | Cycles | CPC UTCL2 interface busy |
|
||||
| `cpc_cpc_utcl2iu_idle` | Cycles | CPC UTCL2 interface idle |
|
||||
| `cpc_cpc_utcl2iu_stall[∗]` | Cycles | CPC UTCL2 interface stalled waiting |
|
||||
| `cpc_me1_dci0_spi_busy` | Cycles | CPC ME1 Processor busy |
|
||||
|:---------------------------------|:-------|:---------------------------------------------------|
|
||||
| `CPC_ME1_BUSY_FOR_PACKET_DECODE` | Cycles | Number of cycles CPC Micro Engine (ME1) is busy decoding packets |
|
||||
| `CPC_UTCL1_STALL_ON_TRANSLATION` | Cycles | Number of cycles one of the UTCL1s is stalled waiting on translation |
|
||||
| `CPC_CPC_STAT_BUSY` | Cycles | Number of cycles CPC is busy |
|
||||
| `CPC_CPC_STAT_IDLE` | Cycles | Number of cycles CPC is idle |
|
||||
| `CPC_CPC_STAT_STALL` | Cycles | Number of cycles CPC is stalled |
|
||||
| `CPC_CPC_TCIU_BUSY` | Cycles | Number of cycles CPC TCIU interface is busy |
|
||||
| `CPC_CPC_TCIU_IDLE` | Cycles | Number of cycles CPC TCIU interface is idle |
|
||||
| `CPC_CPC_UTCL2IU_BUSY` | Cycles | Number of cycles CPC UTCL2 interface is busy |
|
||||
| `CPC_CPC_UTCL2IU_IDLE` | Cycles | Number of cycles CPC UTCL2 interface is idle |
|
||||
| `CPC_CPC_UTCL2IU_STALL` | Cycles | Number of cycles CPC UTCL2 interface is stalled |
|
||||
| `CPC_ME1_DC0_SPI_BUSY` | Cycles | Number of cycles CPC ME1 Processor is busy |
|
||||
|
||||
### SPI
|
||||
|
||||
#### SPI counters
|
||||
### Shader Processor Input (SPI) counters
|
||||
|
||||
| Hardware Counter | Unit | Definition |
|
||||
| :----------------------------| :-----------| -----------------------------------------------------------: |
|
||||
| `spi_csn_busy` | Cycles | Number of clocks with outstanding waves |
|
||||
| `spi_csn_window_valid` | Cycles | Clock count enabled by perfcounter_start event |
|
||||
| `spi_csn_num_threadgroups` | Workgroups | Total number of dispatched workgroups |
|
||||
| `spi_csn_wave` | Wavefronts | Total number of dispatched wavefronts |
|
||||
| `spi_ra_req_no_alloc` | Cycles | Arb cycles with requests but no allocation (need to multiply this value by 4) |
|
||||
|`spi_ra_req_no_alloc_csn` | Cycles | Arb cycles with CSn req and no CSn alloc (need to multiply this value by 4) |
|
||||
| `spi_ra_res_stall_csn` | Cycles | Arb cycles with CSn req and no CSn fits (need to multiply this value by 4) |
|
||||
| `spi_ra_tmp_stall_csn[∗]` | Cycles | Cycles where CSn wants to req but does not fit in temp space |
|
||||
| `spi_ra_wave_simd_full_csn` | SIMD-cycles | Sum of SIMD where WAVE cannot take csn wave when not fits |
|
||||
| `spi_ra_vgpr_simd_full_csn[∗]` | SIMD-cycles | Sum of SIMD where VGPR cannot take csn wave when not fits |
|
||||
| `spi_ra_sgpr_simd_full_csn[∗]` | SIMD-cycles | Sum of SIMD where SGPR cannot take csn wave when not fits |
|
||||
| `spi_ra_lds_cu_full_csn` | CUs | Sum of CU where LDS cannot take csn wave when not fits |
|
||||
| `spi_ra_bar_cu_full_csn[∗]` | CUs | Sum of CU where BARRIER cannot take csn wave when not fits |
|
||||
| `spi_ra_bulky_cu_full_csn[∗]` | CUs | Sum of CU where BULKY cannot take csn wave when not fits |
|
||||
| `spi_ra_tglim_cu_full_csn[∗]` | Cycles | Cycles where csn wants to req but all CUs are at tg_limit |
|
||||
| `spi_ra_wvlim_cu_full_csn[∗]` | Cycles | Number of clocks csn is stalled due to WAVE LIMIT |
|
||||
| `spi_vwc_csc_wr` | Cycles | Number of clocks to write CSC waves to VGPRs (need to multiply this value by 4) |
|
||||
| `spi_swc_csc_wr` | Cycles | Number of clocks to write CSC waves to SGPRs (need to multiply this value by 4) |
|
||||
|:----------------------------|:-----------|:-----------------------------------------------------------|
|
||||
| `SPI_CSN_BUSY` | Cycles | Number of cycles with outstanding waves |
|
||||
| `SPI_CSN_WINDOW_VALID` | Cycles | Number of cycles enabled by `perfcounter_start` event |
|
||||
| `SPI_CSN_NUM_THREADGROUPS` | Workgroups | Number of dispatched workgroups |
|
||||
| `SPI_CSN_WAVE` | Wavefronts | Number of dispatched wavefronts |
|
||||
| `SPI_RA_REQ_NO_ALLOC` | Cycles | Number of Arb cycles with requests but no allocation |
|
||||
|`SPI_RA_REQ_NO_ALLOC_CSN` | Cycles | Number of Arb cycles with Compute Shader, n-th pipe (CSn) requests but no CSn allocation |
|
||||
| `SPI_RA_RES_STALL_CSN` | Cycles | Number of Arb stall cycles due to shortage of CSn pipeline slots |
|
||||
| `SPI_RA_TMP_STALL_CSN*` | Cycles | Number of stall cycles due to shortage of temp space |
|
||||
| `SPI_RA_WAVE_SIMD_FULL_CSN` | SIMD-cycles | Accumulated number of Single Instruction Multiple Data (SIMDs) per cycle affected by shortage of wave slots for CSn wave dispatch |
|
||||
| `SPI_RA_VGPR_SIMD_FULL_CSN*` | SIMD-cycles | Accumulated number of SIMDs per cycle affected by shortage of VGPR slots for CSn wave dispatch |
|
||||
| `SPI_RA_SGPR_SIMD_FULL_CSN*` | SIMD-cycles | Accumulated number of SIMDs per cycle affected by shortage of SGPR slots for CSn wave dispatch |
|
||||
| `SPI_RA_LDS_CU_FULL_CSN` | CUs | Number of Compute Units (CUs) affected by shortage of LDS space for CSn wave dispatch |
|
||||
| `SPI_RA_BAR_CU_FULL_CSN*` | CUs | Number of CUs with CSn waves waiting at a BARRIER |
|
||||
| `SPI_RA_BULKY_CU_FULL_CSN*` | CUs | Number of CUs with CSn waves waiting for BULKY resource |
|
||||
| `SPI_RA_TGLIM_CU_FULL_CSN*` | Cycles | Number of CSn wave stall cycles due to restriction of `tg_limit` for thread group size |
|
||||
| `SPI_RA_WVLIM_STALL_CSN*` | Cycles | Number of cycles CSn is stalled due to WAVE_LIMIT |
|
||||
| `SPI_VWC_CSC_WR` | Qcycles | Number of quad-cycles taken to initialize Vector General Purpose Register (VGPRs) when launching waves |
|
||||
| `SPI_SWC_CSC_WR` | Qcycles | Number of quad-cycles taken to initialize Vector General Purpose Register (SGPRs) when launching waves |
|
||||
|
||||
### Compute unit
|
||||
### Compute Unit (CU) counters
|
||||
|
||||
The compute unit counters are further classified into instruction mix, MFMA operation counters, level counters, wavefront counters, wavefront cycle counters, local data share counters, and others.
|
||||
The CU counters are further classified into instruction mix, Matrix Fused Multiply Add (MFMA) operation counters, level counters, wavefront counters, wavefront cycle counters and Local Data Share (LDS) counters.
|
||||
|
||||
#### Instruction mix
|
||||
|
||||
| Hardware Counter | Unit | Definition |
|
||||
| :-----------------------| :-----:| -----------------------------------------------------------------------: |
|
||||
| `sq_insts` | Instr | Number of instructions issued |
|
||||
| `sq_insts_valu` | Instr | Number of VALU instructions issued, including MFMA |
|
||||
| `sq_insts_valu_add_f16` | Instr | Number of VALU F16 Add instructions issued |
|
||||
| `sq_insts_valu_mul_f16` | Instr | Number of VALU F16 Multiply instructions issued |
|
||||
| `sq_insts_valu_fma_f16` | Instr | Number of VALU F16 FMA instructions issued |
|
||||
| `sq_insts_valu_trans_f16` | Instr | Number of VALU F16 Transcendental instructions issued |
|
||||
| `sq_insts_valu_add_f32` | Instr | Number of VALU F32 Add instructions issued |
|
||||
| `sq_insts_valu_mul_f32` | Instr | Number of VALU F32 Multiply instructions issued |
|
||||
| `sq_insts_valu_fma_f32` | Instr | Number of VALU F32 FMA instructions issued |
|
||||
| `sq_insts_valu_trans_f32` | Instr | Number of VALU F32 Transcendental instructions issued |
|
||||
| `sq_insts_valu_add_f64` | Instr | Number of VALU F64 Add instructions issued |
|
||||
| `sq_insts_valu_mul_f64` | Instr | Number of VALU F64 Multiply instructions issued |
|
||||
| `sq_insts_valu_fma_f64` | Instr | Number of VALU F64 FMA instructions issued |
|
||||
| `sq_insts_valu_trans_f64` | Instr | Number of VALU F64 Transcendental instructions issued |
|
||||
| `sq_insts_valu_int32` | Instr | Number of VALU 32-bit integer instructions issued (signed or unsigned) |
|
||||
| `sq_insts_valu_int64` | Instr | Number of VALU 64-bit integer instructions issued (signed or unsigned) |
|
||||
| `sq_insts_valu_cvt` | Instr | Number of VALU Conversion instructions issued |
|
||||
| `sq_insts_valu_mfma_i8` | Instr | Number of 8-bit Integer MFMA instructions issued |
|
||||
| `sq_insts_valu_mfma_f16` | Instr | Number of F16 MFMA instructions issued |
|
||||
| `sq_insts_valu_mfma_bf16` | Instr | Number of BF16 MFMA instructions issued |
|
||||
| `sq_insts_valu_mfma_f32` | Instr | Number of F32 MFMA instructions issued |
|
||||
| `sq_insts_valu_mfma_f64` | Instr | Number of F64 MFMA instructions issued |
|
||||
| `sq_insts_mfma` | Instr | Number of MFMA instructions issued |
|
||||
| `sq_insts_vmem_wr` | Instr | Number of VMEM write instructions issued |
|
||||
| `sq_insts_vmem_rd` | Instr | Number of VMEM read instructions issued |
|
||||
| `sq_insts_vmem` | Instr | Number of VMEM instructions issued, including both FLAT and buffer instructions |
|
||||
| `sq_insts_salu` | Instr | Number of SALU instructions issued |
|
||||
| `sq_insts_smem` | Instr | Number of SMEM instructions issued |
|
||||
| `sq_insts_smem_norm` | Instr | Number of SMEM instructions issued to normalize to match `smem_level`. Used in measuring SMEM latency |
|
||||
| `sq_insts_flat` | Instr | Number of FLAT instructions issued |
|
||||
| `sq_insts_flat_lds_only` | Instr | Number of FLAT instructions issued that read/write only from/to LDS |
|
||||
| `sq_insts_lds` | Instr | Number of LDS instructions issued |
|
||||
| `sq_insts_gds` | Instr | Number of GDS instructions issued |
|
||||
| `sq_insts_exp_gds` | Instr | Number of EXP and GDS instructions excluding skipped export instructions issued |
|
||||
| `sq_insts_branch` | Instr | Number of Branch instructions issued |
|
||||
| `sq_insts_sendmsg` | Instr | Number of SENDMSG instructions including s_endpgm issued |
|
||||
| `sq_insts_vskipped[∗]` | Instr | Number of VSkipped instructions issued |
|
||||
|:-----------------------|:-----|:-----------------------------------------------------------------------|
|
||||
| `SQ_INSTS` | Instr | Number of instructions issued. |
|
||||
| `SQ_INSTS_VALU` | Instr | Number of Vector Arithmetic Logic Unit (VALU) instructions including MFMA issued. |
|
||||
| `SQ_INSTS_VALU_ADD_F16` | Instr | Number of VALU Half Precision Floating Point (F16) ADD/SUB instructions issued. |
|
||||
| `SQ_INSTS_VALU_MUL_F16` | Instr | Number of VALU F16 Multiply instructions issued. |
|
||||
| `SQ_INSTS_VALU_FMA_F16` | Instr | Number of VALU F16 Fused Multiply Add (FMA)/ Multiply Add (MAD) instructions issued. |
|
||||
| `SQ_INSTS_VALU_TRANS_F16` | Instr | Number of VALU F16 Transcendental instructions issued. |
|
||||
| `SQ_INSTS_VALU_ADD_F32` | Instr | Number of VALU Full Precision Floating Point (F32) ADD/SUB instructions issued. |
|
||||
| `SQ_INSTS_VALU_MUL_F32` | Instr | Number of VALU F32 Multiply instructions issued. |
|
||||
| `SQ_INSTS_VALU_FMA_F32` | Instr | Number of VALU F32 FMA/MAD instructions issued. |
|
||||
| `SQ_INSTS_VALU_TRANS_F32` | Instr | Number of VALU F32 Transcendental instructions issued. |
|
||||
| `SQ_INSTS_VALU_ADD_F64` | Instr | Number of VALU F64 ADD/SUB instructions issued. |
|
||||
| `SQ_INSTS_VALU_MUL_F64` | Instr | Number of VALU F64 Multiply instructions issued. |
|
||||
| `SQ_INSTS_VALU_FMA_F64` | Instr | Number of VALU F64 FMA/MAD instructions issued. |
|
||||
| `SQ_INSTS_VALU_TRANS_F64` | Instr | Number of VALU F64 Transcendental instructions issued. |
|
||||
| `SQ_INSTS_VALU_INT32` | Instr | Number of VALU 32-bit integer instructions (signed or unsigned) issued. |
|
||||
| `SQ_INSTS_VALU_INT64` | Instr | Number of VALU 64-bit integer instructions (signed or unsigned) issued. |
|
||||
| `SQ_INSTS_VALU_CVT` | Instr | Number of VALU Conversion instructions issued. |
|
||||
| `SQ_INSTS_VALU_MFMA_I8` | Instr | Number of 8-bit Integer MFMA instructions issued. |
|
||||
| `SQ_INSTS_VALU_MFMA_F16` | Instr | Number of F16 MFMA instructions issued. |
|
||||
| `SQ_INSTS_VALU_MFMA_BF16` | Instr | Number of Brain Floating Point - 16 (BF16) MFMA instructions issued. |
|
||||
| `SQ_INSTS_VALU_MFMA_F32` | Instr | Number of F32 MFMA instructions issued. |
|
||||
| `SQ_INSTS_VALU_MFMA_F64` | Instr | Number of F64 MFMA instructions issued. |
|
||||
| `SQ_INSTS_MFMA` | Instr | Number of MFMA instructions issued. |
|
||||
| `SQ_INSTS_VMEM_WR` | Instr | Number of Vector Memory (VMEM) Write instructions (including FLAT) issued. |
|
||||
| `SQ_INSTS_VMEM_RD` | Instr | Number of VMEM Read instructions (including FLAT) issued. |
|
||||
| `SQ_INSTS_VMEM` | Instr | Number of VMEM instructions issued, including both FLAT and Buffer instructions. |
|
||||
| `SQ_INSTS_SALU` | Instr | Number of SALU instructions issued. |
|
||||
| `SQ_INSTS_SMEM` | Instr | Number of Scalar Memory (SMEM) instructions issued. |
|
||||
| `SQ_INSTS_SMEM_NORM` | Instr | Number of SMEM instructions normalized to match `smem_level` issued. |
|
||||
| `SQ_INSTS_FLAT` | Instr | Number of FLAT instructions issued. |
|
||||
| `SQ_INSTS_FLAT_LDS_ONLY` | Instr | Number of FLAT instructions that read/write only from/to LDS issued. Works only if `EARLY_TA_DONE` is enabled. |
|
||||
| `SQ_INSTS_LDS` | Instr | Number of Local Data Share (LDS) instructions issued (including FLAT). |
|
||||
| `SQ_INSTS_GDS` | Instr | Number of Global Data Share (GDS) instructions issued. |
|
||||
| `SQ_INSTS_EXP_GDS` | Instr | Number of EXP and GDS instructions excluding skipped export instructions issued. |
|
||||
| `SQ_INSTS_BRANCH` | Instr | Number of Branch instructions issued. |
|
||||
| `SQ_INSTS_SENDMSG` | Instr | Number of `SENDMSG` instructions including `s_endpgm` issued. |
|
||||
| `SQ_INSTS_VSKIPPED*` | Instr | Number of vector instructions skipped. |
|
||||
|
||||
#### MFMA operation counters
|
||||
|
||||
| Hardware Counter | Unit | Definition |
|
||||
| :----------------------------| :-----| ----------------------------------------------: |
|
||||
| `sq_insts_valu_mfma_mops_I8` | IOP | Number of 8-bit integer MFMA ops in unit of 512 |
|
||||
| `sq_insts_valu_mfma_mops_F16` | FLOP | Number of F16 floating MFMA ops in unit of 512 |
|
||||
| `sq_insts_valu_mfma_mops_BF16` | FLOP | Number of BF16 floating MFMA ops in unit of 512 |
|
||||
| `sq_insts_valu_mfma_mops_F32` | FLOP | Number of F32 floating MFMA ops in unit of 512 |
|
||||
| `sq_insts_valu_mfma_mops_F64` | FLOP | Number of F64 floating MFMA ops in unit of 512 |
|
||||
|:----------------------------|:-----|:----------------------------------------------|
|
||||
| `SQ_INSTS_VALU_MFMA_MOPS_I8` | IOP | Number of 8-bit integer MFMA ops in the unit of 512 |
|
||||
| `SQ_INSTS_VALU_MFMA_MOPS_F16` | FLOP | Number of F16 floating MFMA ops in the unit of 512 |
|
||||
| `SQ_INSTS_VALU_MFMA_MOPS_BF16` | FLOP | Number of BF16 floating MFMA ops in the unit of 512 |
|
||||
| `SQ_INSTS_VALU_MFMA_MOPS_F32` | FLOP | Number of F32 floating MFMA ops in the unit of 512 |
|
||||
| `SQ_INSTS_VALU_MFMA_MOPS_F64` | FLOP | Number of F64 floating MFMA ops in the unit of 512 |
|
||||
|
||||
#### Level counters
|
||||
|
||||
:::{note}
|
||||
All level counters must be followed by `SQ_ACCUM_PREV_HIRES` counter to measure average latency.
|
||||
:::
|
||||
|
||||
| Hardware Counter | Unit | Definition |
|
||||
| :-------------------| :-----| -------------------------------------: |
|
||||
| `sq_accum_prev` | Count | Accumulated counter sample value where accumulation takes place once every four cycles |
|
||||
| `sq_accum_prev_hires` | Count | Accumulated counter sample value where accumulation takes place once every cycle |
|
||||
| `sq_level_waves` | Waves | Number of inflight waves |
|
||||
| `sq_insts_level_vmem` | Instr | Number of inflight VMEM instructions |
|
||||
| `sq_insts_level_smem` | Instr | Number of inflight SMEM instructions |
|
||||
| `sq_insts_level_lds` | Instr | Number of inflight LDS instructions |
|
||||
| `sq_ifetch_level` | Instr | Number of inflight instruction fetches |
|
||||
|:-------------------|:-----|:-------------------------------------|
|
||||
| `SQ_ACCUM_PREV` | Count | Accumulated counter sample value where accumulation takes place once every four cycles. |
|
||||
| `SQ_ACCUM_PREV_HIRES` | Count | Accumulated counter sample value where accumulation takes place once every cycle. |
|
||||
| `SQ_LEVEL_WAVES` | Waves | Number of inflight waves. To calculate the wave latency, divide `SQ_ACCUM_PREV_HIRES` by `SQ_WAVE`. |
|
||||
| `SQ_INST_LEVEL_VMEM` | Instr | Number of inflight VMEM (including FLAT) instructions. To calculate the VMEM latency, divide `SQ_ACCUM_PREV_HIRES` by `SQ_INSTS_VMEM`. |
|
||||
| `SQ_INST_LEVEL_SMEM` | Instr | Number of inflight SMEM instructions. To calculate the SMEM latency, divide `SQ_ACCUM_PREV_HIRES` by `SQ_INSTS_SMEM_NORM`. |
|
||||
| `SQ_INST_LEVEL_LDS` | Instr | Number of inflight LDS (including FLAT) instructions. To calculate the LDS latency, divide `SQ_ACCUM_PREV_HIRES` by `SQ_INSTS_LDS`. |
|
||||
| `SQ_IFETCH_LEVEL` | Instr | Number of inflight instruction fetch requests from the cache. To calculate the instruction fetch latency, divide `SQ_ACCUM_PREV_HIRES` by `SQ_IFETCH`. |
|
||||
|
||||
#### Wavefront counters
|
||||
|
||||
| Hardware Counter | Unit | Definition |
|
||||
| :--------------------| :-----| ----------------------------------------------------------------: |
|
||||
| `sq_waves` | Waves | Number of wavefronts dispatch to SQs, including both new and restored wavefronts |
|
||||
| `sq_waves_saved[∗]` | Waves | Number of context-saved wavefronts |
|
||||
| `sq_waves_restored[∗]` | Waves | Number of context-restored wavefronts |
|
||||
| `sq_waves_eq_64` | Waves | Number of wavefronts with exactly 64 active threads sent to SQs |
|
||||
| `sq_waves_lt_64` | Waves | Number of wavefronts with less than 64 active threads sent to SQs |
|
||||
| `sq_waves_lt_48` | Waves | Number of wavefronts with less than 48 active threads sent to SQs |
|
||||
| `sq_waves_lt_32` | Waves | Number of wavefronts with less than 32 active threads sent to SQs |
|
||||
| `sq_waves_lt_16` | Waves | Number of wavefronts with less than 16 active threads sent to SQs |
|
||||
|:--------------------|:-----|:----------------------------------------------------------------|
|
||||
| `SQ_WAVES` | Waves | Number of wavefronts dispatched to Sequencers (SQs), including both new and restored wavefronts |
|
||||
| `SQ_WAVES_SAVED*` | Waves | Number of context-saved waves |
|
||||
| `SQ_WAVES_RESTORED*` | Waves | Number of context-restored waves sent to SQs |
|
||||
| `SQ_WAVES_EQ_64` | Waves | Number of wavefronts with exactly 64 active threads sent to SQs |
|
||||
| `SQ_WAVES_LT_64` | Waves | Number of wavefronts with less than 64 active threads sent to SQs |
|
||||
| `SQ_WAVES_LT_48` | Waves | Number of wavefronts with less than 48 active threads sent to SQs |
|
||||
| `SQ_WAVES_LT_32` | Waves | Number of wavefronts with less than 32 active threads sent to SQs |
|
||||
| `SQ_WAVES_LT_16` | Waves | Number of wavefronts with less than 16 active threads sent to SQs |
|
||||
|
||||
#### Wavefront cycle counters
|
||||
|
||||
| Hardware Counter | Unit | Definition |
|
||||
| :------------------------| :-------| --------------------------------------------------------------------: |
|
||||
| `sq_cycles` | Cycles | Free-running SQ clocks |
|
||||
| `sq_busy_cycles` | Cycles | Number of cycles while SQ reports it to be busy |
|
||||
| `sq_busy_cu_cycles` | Qcycles | Number of quad cycles each CU is busy |
|
||||
| `sq_valu_mfma_busy_cycles` | Cycles | Number of cycles the MFMA ALU is busy |
|
||||
| `sq_wave_cycles` | Qcycles | Number of quad cycles spent by waves in the CUs |
|
||||
| `sq_wait_any` | Qcycles | Number of quad cycles spent waiting for anything |
|
||||
| `sq_wait_inst_any` | Qcycles | Number of quad cycles spent waiting for an issued instruction |
|
||||
| `sq_active_inst_any` | Qcycles | Number of quad cycles spent by each wave to work on an instruction |
|
||||
| `sq_active_inst_vmem` | Qcycles | Number of quad cycles spent by each wave to work on a non-FLAT VMEM instruction |
|
||||
| `sq_active_inst_lds` | Qcycles | Number of quad cycles spent by each wave to work on an LDS instruction |
|
||||
| `sq_active_inst_valu` | Qcycles | Number of quad cycles spent by each wave to work on a VALU instruction |
|
||||
| `sq_active_inst_sca` | Qcycles | Number of quad cycles spent by each wave to work on an SCA instruction |
|
||||
| `sq_active_inst_exp_gds` | Qcycles | Number of quad cycles spent by each wave to work on EXP or GDS instruction |
|
||||
| `sq_active_inst_misc` | Qcycles | Number of quad cycles spent by each wave to work on an MISC instruction, including branch and sendmsg |
|
||||
| `sq_active_inst_flat` | Qcycles | Number of quad cycles spent by each wave to work on a FLAT instruction |
|
||||
| `sq_inst_cycles_vmem_wr` | Qcycles | Number of quad cycles spent to send addr and cmd data for VMEM write instructions, including both FLAT and buffer |
|
||||
| `sq_inst_cycles_vmem_rd` | Qcycles | Number of quad cycles spent to send addr and cmd data for VMEM read instructions, including both FLAT and buffer |
|
||||
| `sq_inst_cycles_smem` | Qcycles | Number of quad cycles spent to execute scalar memory reads |
|
||||
| `sq_inst_cycles_salu` | Cycles | Number of cycles spent to execute non-memory read scalar operations |
|
||||
| `sq_thread_cycles_valu` | Cycles | Number of thread cycles spent to execute VALU operations |
|
||||
|:------------------------|:-------|:--------------------------------------------------------------------|
|
||||
| `SQ_CYCLES` | Cycles | Clock cycles. |
|
||||
| `SQ_BUSY_CYCLES` | Cycles | Number of cycles while SQ reports it to be busy. |
|
||||
| `SQ_BUSY_CU_CYCLES` | Qcycles | Number of quad-cycles each CU is busy. |
|
||||
| `SQ_VALU_MFMA_BUSY_CYCLES` | Cycles | Number of cycles the MFMA ALU is busy. |
|
||||
| `SQ_WAVE_CYCLES` | Qcycles | Number of quad-cycles spent by waves in the CUs. |
|
||||
| `SQ_WAIT_ANY` | Qcycles | Number of quad-cycles spent waiting for anything. |
|
||||
| `SQ_WAIT_INST_ANY` | Qcycles | Number of quad-cycles spent waiting for any instruction to be issued. |
|
||||
| `SQ_ACTIVE_INST_ANY` | Qcycles | Number of quad-cycles spent by each wave to work on an instruction. |
|
||||
| `SQ_ACTIVE_INST_VMEM` | Qcycles | Number of quad-cycles spent by the SQ instruction arbiter to work on a VMEM instruction. |
|
||||
| `SQ_ACTIVE_INST_LDS` | Qcycles | Number of quad-cycles spent by the SQ instruction arbiter to work on an LDS instruction. |
|
||||
| `SQ_ACTIVE_INST_VALU` | Qcycles | Number of quad-cycles spent by the SQ instruction arbiter to work on a VALU instruction. |
|
||||
| `SQ_ACTIVE_INST_SCA` | Qcycles | Number of quad-cycles spent by the SQ instruction arbiter to work on a SALU or SMEM instruction. |
|
||||
| `SQ_ACTIVE_INST_EXP_GDS` | Qcycles | Number of quad-cycles spent by the SQ instruction arbiter to work on an EXPORT or GDS instruction. |
|
||||
| `SQ_ACTIVE_INST_MISC` | Qcycles | Number of quad-cycles spent by the SQ instruction aribter to work on a BRANCH or `SENDMSG` instruction. |
|
||||
| `SQ_ACTIVE_INST_FLAT` | Qcycles | Number of quad-cycles spent by the SQ instruction arbiter to work on a FLAT instruction. |
|
||||
| `SQ_INST_CYCLES_VMEM_WR` | Qcycles | Number of quad-cycles spent to send addr and cmd data for VMEM Write instructions. |
|
||||
| `SQ_INST_CYCLES_VMEM_RD` | Qcycles | Number of quad-cycles spent to send addr and cmd data for VMEM Read instructions. |
|
||||
| `SQ_INST_CYCLES_SMEM` | Qcycles | Number of quad-cycles spent to execute scalar memory reads. |
|
||||
| `SQ_INST_CYCLES_SALU` | Qcycles | Number of quad-cycles spent to execute non-memory read scalar operations. |
|
||||
| `SQ_THREAD_CYCLES_VALU` | Cycles | Number of thread-cycles spent to execute VALU operations. This is similar to `INST_CYCLES_VALU` but multiplied by the number of active threads. |
|
||||
| `SQ_WAIT_INST_LDS` | Qcycles | Number of quad-cycles spent waiting for LDS instruction to be issued. |
|
||||
|
||||
#### Local data share
|
||||
#### LDS counters
|
||||
|
||||
| Hardware Counter | Unit | Definition |
|
||||
| :--------------------------| :------| --------------------------------------------------------: |
|
||||
| `sq_lds_atomic_return` | Cycles | Number of atomic return cycles in LDS |
|
||||
| `sq_lds_bank_conflict` | Cycles | Number of cycles LDS is stalled by bank conflicts |
|
||||
| `sq_lds_addr_conflict[∗]` | Cycles | Number of cycles LDS is stalled by address conflicts |
|
||||
| `sq_lds_unaligned_stalls[∗]` | Cycles | Number of cycles LDS is stalled processing flat unaligned load/store ops |
|
||||
| `sq_lds_mem_violations[∗]` | Count | Number of threads that have a memory violation in the LDS |
|
||||
|:--------------------------|:------|:--------------------------------------------------------|
|
||||
| `SQ_LDS_ATOMIC_RETURN` | Cycles | Number of atomic return cycles in LDS |
|
||||
| `SQ_LDS_BANK_CONFLICT` | Cycles | Number of cycles LDS is stalled by bank conflicts |
|
||||
| `SQ_LDS_ADDR_CONFLICT*` | Cycles | Number of cycles LDS is stalled by address conflicts |
|
||||
| `SQ_LDS_UNALIGNED_STALL*` | Cycles | Number of cycles LDS is stalled processing flat unaligned load/store ops |
|
||||
| `SQ_LDS_MEM_VIOLATIONS*` | Count | Number of threads that have a memory violation in the LDS |
|
||||
| `SQ_LDS_IDX_ACTIVE` | Cycles | Number of cycles LDS is used for indexed operations |
|
||||
|
||||
#### Miscellaneous
|
||||
#### Miscellaneous counters
|
||||
|
||||
##### Local data share
|
||||
| Hardware Counter | Unit | Definition |
|
||||
|:--------------------------|:------|:--------------------------------------------------------|
|
||||
| `SQ_IFETCH` | Count | Number of instruction fetch requests from `L1I` cache, in 32-byte width |
|
||||
| `SQ_ITEMS` | Threads | Number of valid items per wave |
|
||||
|
||||
| Hardware Counter | Unit | Definition |
|
||||
| :----------------| :-------| --------------------------------------------------------: |
|
||||
| `sq_ifetch` | Count | Number of fetch requests from L1I cache, in 32-byte width |
|
||||
| `sq_items` | Threads | Number of valid threads |
|
||||
|
||||
### L1I and sL1D caches
|
||||
|
||||
#### L1I and sL1D caches
|
||||
### L1I and sL1D cache counters
|
||||
|
||||
| Hardware Counter | Unit | Definition |
|
||||
| :----------------------------| :------| ----------------------------------------------------------------: |
|
||||
| `sqc_icache_req` | Req | Number of L1I cache requests |
|
||||
| `sqc_icache_hits` | Count | Number of L1I cache lookup-hits |
|
||||
| `sqc_icache_misses` | Count | Number of L1I cache non-duplicate lookup-misses |
|
||||
| `sqc_icache_misses_duplicate` | Count | Number of d L1I cache duplicate lookup misses whose previous lookup miss on the same cache line is not fulfilled yet |
|
||||
| `sqc_dcache_req` | Req | Number of sL1D cache requests |
|
||||
| `sqc_dcache_input_valid_readb` | Cycles | Number of cycles while SQ input is valid but sL1D cache is not ready |
|
||||
| `sqc_dcache_hits` | Count | Number of sL1D cache lookup-hits |
|
||||
| `sqc_dcache_misses` | Count | Number of sL1D non-duplicate lookup-misses |
|
||||
| `sqc_dcache_misses_duplicate` | Count | Number of sL1D duplicate lookup-misses |
|
||||
| `sqc_dcache_req_read_1` | Req | Number of read requests in a single 32-bit data word, DWORD (DW) |
|
||||
| `sqc_dcache_req_read_2` | Req | Number of read requests in 2 DW |
|
||||
| `sqc_dcache_req_read_4` | Req | Number of read requests in 4 DW |
|
||||
| `sqc_dcache_req_read_8` | Req | Number of read requests in 8 DW |
|
||||
| `sqc_dcache_req_read_16` | Req | Number of read requests in 16 DW |
|
||||
| `sqc_dcache_atomic[∗]` | Req | Number of atomic requests |
|
||||
| `sqc_tc_req` | Req | Number of L2 cache requests that were issued by instruction and constant caches |
|
||||
| `sqc_tc_inst_req` | Req | Number of instruction cache line requests to L2 cache |
|
||||
| `sqc_tc_data_read_req` | Req | Number of data read requests to the L2 cache |
|
||||
| `sqc_tc_data_write_req[∗]` | Req | Number of data write requests to the L2 cache |
|
||||
| `sqc_tc_data_atomic_req[∗]` | Req | Number of data atomic requests to the L2 cache |
|
||||
| `sqc_tc_stall[∗]` | Cycles | Number of cycles while the valid requests to L2 cache are stalled |
|
||||
|:----------------------------|:------|:----------------------------------------------------------------|
|
||||
| `SQC_ICACHE_REQ` | Req | Number of `L1I` cache requests |
|
||||
| `SQC_ICACHE_HITS` | Count | Number of `L1I` cache hits |
|
||||
| `SQC_ICACHE_MISSES` | Count | Number of non-duplicate `L1I` cache misses including uncached requests |
|
||||
| `SQC_ICACHE_MISSES_DUPLICATE` | Count | Number of duplicate `L1I` cache misses whose previous lookup miss on the same cache line is not fulfilled yet |
|
||||
| `SQC_DCACHE_REQ` | Req | Number of `sL1D` cache requests |
|
||||
| `SQC_DCACHE_INPUT_VALID_READYB` | Cycles | Number of cycles while SQ input is valid but sL1D cache is not ready |
|
||||
| `SQC_DCACHE_HITS` | Count | Number of `sL1D` cache hits |
|
||||
| `SQC_DCACHE_MISSES` | Count | Number of non-duplicate `sL1D` cache misses including uncached requests |
|
||||
| `SQC_DCACHE_MISSES_DUPLICATE` | Count | Number of duplicate `sL1D` cache misses |
|
||||
| `SQC_DCACHE_REQ_READ_1` | Req | Number of constant cache read requests in a single DW |
|
||||
| `SQC_DCACHE_REQ_READ_2` | Req | Number of constant cache read requests in two DW |
|
||||
| `SQC_DCACHE_REQ_READ_4` | Req | Number of constant cache read requests in four DW |
|
||||
| `SQC_DCACHE_REQ_READ_8` | Req | Number of constant cache read requests in eight DW |
|
||||
| `SQC_DCACHE_REQ_READ_16` | Req | Number of constant cache read requests in 16 DW |
|
||||
| `SQC_DCACHE_ATOMIC*` | Req | Number of atomic requests |
|
||||
| `SQC_TC_REQ` | Req | Number of TC requests that were issued by instruction and constant caches |
|
||||
| `SQC_TC_INST_REQ` | Req | Number of instruction requests to the L2 cache |
|
||||
| `SQC_TC_DATA_READ_REQ` | Req | Number of data Read requests to the L2 cache |
|
||||
| `SQC_TC_DATA_WRITE_REQ*` | Req | Number of data write requests to the L2 cache |
|
||||
| `SQC_TC_DATA_ATOMIC_REQ*` | Req | Number of data atomic requests to the L2 cache |
|
||||
| `SQC_TC_STALL*` | Cycles | Number of cycles while the valid requests to the L2 cache are stalled |
|
||||
|
||||
### Vector L1 cache subsystem
|
||||
|
||||
The vector L1 cache subsystem counters are further classified into texture addressing unit, texture data unit, vector L1D cache, and texture cache arbiter.
|
||||
The vector L1 cache subsystem counters are further classified into Texture Addressing Unit (TA), Texture Data Unit (TD), vector L1D cache or Texture Cache per Pipe (TCP), and Texture Cache Arbiter (TCA) counters.
|
||||
|
||||
#### Texture addressing unit
|
||||
|
||||
##### Texture addressing unit counters
|
||||
#### TA counters
|
||||
|
||||
| Hardware Counter | Unit | Definition |
|
||||
| :--------------------------------| :------| ------------------------------------------------: |
|
||||
| `ta_ta_busy` | Cycles | texture addressing unit busy cycles |
|
||||
| `ta_total_wavefronts` | Instr | Number of wavefront instructions |
|
||||
| `ta_buffer_wavefronts` | Instr | Number of buffer wavefront instructions |
|
||||
| `ta_buffer_read_wavefronts` | Instr | Number of buffer read wavefront instructions |
|
||||
| `ta_buffer_write_wavefronts` | Instr | Number of buffer write wavefront instructions |
|
||||
| `ta_buffer_atomic_wavefronts[∗]` | Instr | Number of buffer atomic wavefront instructions |
|
||||
| `ta_buffer_total_cycles` | Cycles | Number of buffer cycles, including read and write |
|
||||
| `ta_buffer_coalesced_read_cycles` | Cycles | Number of coalesced buffer read cycles |
|
||||
| `ta_buffer_coalesced_write_cycles` | Cycles | Number of coalesced buffer write cycles |
|
||||
| `ta_addr_stalled_by_tc` | Cycles | Number of cycles texture addressing unit address is stalled by TCP |
|
||||
| `ta_data_stalled_by_tc` | Cycles | Number of cycles texture addressing unit data is stalled by TCP |
|
||||
| `ta_addr_stalled_by_td_cycles[∗]` | Cycles | Number of cycles texture addressing unit address is stalled by TD |
|
||||
| `ta_flat_wavefronts` | Instr | Number of flat wavefront instructions |
|
||||
| `ta_flat_read_wavefronts` | Instr | Number of flat read wavefront instructions |
|
||||
| `ta_flat_write_wavefronts` | Instr | Number of flat write wavefront instructions |
|
||||
| `ta_flat_atomic_wavefronts` | Instr | Number of flat atomic wavefront instructions |
|
||||
|:--------------------------------|:------|:------------------------------------------------|
|
||||
| `TA_TA_BUSY[n]` | Cycles | TA busy cycles. Value range for n: [0-15]. |
|
||||
| `TA_TOTAL_WAVEFRONTS[n]` | Instr | Number of wavefronts processed by TA. Value range for n: [0-15]. |
|
||||
| `TA_BUFFER_WAVEFRONTS[n]` | Instr | Number of buffer wavefronts processed by TA. Value range for n: [0-15]. |
|
||||
| `TA_BUFFER_READ_WAVEFRONTS[n]` | Instr | Number of buffer read wavefronts processed by TA. Value range for n: [0-15]. |
|
||||
| `TA_BUFFER_WRITE_WAVEFRONTS[n]` | Instr | Number of buffer write wavefronts processed by TA. Value range for n: [0-15]. |
|
||||
| `TA_BUFFER_ATOMIC_WAVEFRONTS[n]` | Instr | Number of buffer atomic wavefronts processed by TA. Value range for n: [0-15]. |
|
||||
| `TA_BUFFER_TOTAL_CYCLES[n]` | Cycles | Number of buffer cycles (including read and write) issued to TC. Value range for n: [0-15]. |
|
||||
| `TA_BUFFER_COALESCED_READ_CYCLES[n]` | Cycles | Number of coalesced buffer read cycles issued to TC. Value range for n: [0-15]. |
|
||||
| `TA_BUFFER_COALESCED_WRITE_CYCLES[n]` | Cycles | Number of coalesced buffer write cycles issued to TC. Value range for n: [0-15]. |
|
||||
| `TA_ADDR_STALLED_BY_TC_CYCLES[n]` | Cycles | Number of cycles TA address path is stalled by TC. Value range for n: [0-15]. |
|
||||
| `TA_DATA_STALLED_BY_TC_CYCLES[n]` | Cycles | Number of cycles TA data path is stalled by TC. Value range for n: [0-15]. |
|
||||
| `TA_ADDR_STALLED_BY_TD_CYCLES[n]` | Cycles | Number of cycles TA address path is stalled by TD. Value range for n: [0-15]. |
|
||||
| `TA_FLAT_WAVEFRONTS[n]` | Instr | Number of flat opcode wavefronts processed by TA. Value range for n: [0-15]. |
|
||||
| `TA_FLAT_READ_WAVEFRONTS[n]` | Instr | Number of flat opcode read wavefronts processed by TA. Value range for n: [0-15]. |
|
||||
| `TA_FLAT_WRITE_WAVEFRONTS[n]` | Instr | Number of flat opcode write wavefronts processed by TA. Value range for n: [0-15]. |
|
||||
| `TA_FLAT_ATOMIC_WAVEFRONTS[n]` | Instr | Number of flat opcode atomic wavefronts processed by TA. Value range for n: [0-15]. |
|
||||
|
||||
#### Texture data unit
|
||||
|
||||
##### Texture data unit counters
|
||||
#### TD counters
|
||||
|
||||
| Hardware Counter | Unit | Definition |
|
||||
| :------------------------| :-----| ---------------------------------------------------: |
|
||||
| `td_td_busy` | Cycle | TD busy cycles |
|
||||
| `td_tc_stall` | Cycle | Number of cycles TD is stalled by TCP |
|
||||
| `td_spi_stall[∗]` | Cycle | Number of cycles TD is stalled by SPI |
|
||||
| `td_load_wavefront` | Instr | Number of wavefront instructions (read/write/atomic) |
|
||||
| `td_store_wavefront` | Instr | Number of write wavefront instructions |
|
||||
| `td_atomic_wavefront` | Instr | Number of atomic wavefront instructions |
|
||||
| `td_coalescable_wavefront` | Instr | Number of coalescable instructions |
|
||||
|:------------------------|:-----|:---------------------------------------------------|
|
||||
| `TD_TD_BUSY[n]` | Cycle | TD busy cycles while it is processing or waiting for data. Value range for n: [0-15]. |
|
||||
| `TD_TC_STALL[n]` | Cycle | Number of cycles TD is stalled waiting for TC data. Value range for n: [0-15]. |
|
||||
| `TD_SPI_STALL[n]` | Cycle | Number of cycles TD is stalled by SPI. Value range for n: [0-15]. |
|
||||
| `TD_LOAD_WAVEFRONT[n]` | Instr |Number of wavefront instructions (read/write/atomic). Value range for n: [0-15]. |
|
||||
| `TD_STORE_WAVEFRONT[n]` | Instr | Number of write wavefront instructions. Value range for n: [0-15].|
|
||||
| `TD_ATOMIC_WAVEFRONT[n]` | Instr | Number of atomic wavefront instructions. Value range for n: [0-15]. |
|
||||
| `TD_COALESCABLE_WAVEFRONT[n]` | Instr | Number of coalescable wavefronts according to TA. Value range for n: [0-15]. |
|
||||
|
||||
#### Vector L1D cache
|
||||
#### TCP counters
|
||||
|
||||
| Hardware Counter | Unit | Definition |
|
||||
| :-----------------------------------| :------| ----------------------------------------------------------: |
|
||||
| `tcp_gate_en1` | Cycles | Number of cycles/ vL1D interface clocks are turned on |
|
||||
| `tcp_gate_en2` | Cycles | Number of cycles vL1D core clocks are turned on |
|
||||
| `tcp_td_tcp_stall_cycles` | Cycles | Number of cycles TD stalls vL1D |
|
||||
| `tcp_tcr_tcp_stall_cycles` | Cycles | Number of cycles TCR stalls vL1D |
|
||||
| `tcp_read_tagconflict_stall_cycles` | Cycles | Number of cycles tagram conflict stalls on a read |
|
||||
| `tcp_write_tagconflict_stall_cycles` | Cycles | Number of cycles tagram conflict stalls on a write |
|
||||
| `tcp_atomic_tagconflict_stall_cycles` | Cycles | Number of cycles tagram conflict stalls on an atomic |
|
||||
| `tcp_pending_stall_cycles` | Cycles | Number of cycles vL1D cache is stalled due to data pending from L2 cache |
|
||||
| `tcp_ta_tcp_state_read` | Req | Number of wavefront instruction requests to vL1D |
|
||||
| `tcp_volatile[∗]` | Req | Number of L1 volatile pixels/buffers from texture addressing unit |
|
||||
| `tcp_total_accesses` | Req | Number of vL1D accesses |
|
||||
| `tcp_total_read` | Req | Number of vL1D read accesses |
|
||||
| `tcp_total_write` | Req | Number of vL1D write accesses |
|
||||
| `tcp_total_atomic_with_ret` | Req | Number of vL1D atomic with return |
|
||||
| `tcp_total_atomic_without_ret` | Req | Number of vL1D atomic without return |
|
||||
| `tcp_total_writeback_invalidates` | Count | Number of vL1D writebacks and Invalidates |
|
||||
| `tcp_utcl1_request` | Req | Number of address translation requests to UTCL1 |
|
||||
| `tcp_utcl1_translation_hit` | Req | Number of UTCL1 translation hits |
|
||||
| `tcp_utcl1_translation_miss` | Req | Number of UTCL1 translation misses |
|
||||
| `tcp_utcl1_persmission_miss` | Req | Number of UTCL1 permission misses |
|
||||
| `tcp_total_cache_accesses` | Req | Number of vL1D cache accesses |
|
||||
| `tcp_tcp_latency` | Cycles | Accumulated wave access latency to vL1D over all wavefronts |
|
||||
| `tcp_tcc_read_req_latency` | Cycles | Accumulated vL1D-L2 request latency over all wavefronts for reads and atomics with return |
|
||||
| `tcp_tcc_write_req_latency` | Cycles | Accumulated vL1D-L2 request latency over all wavefronts for writes and atomics without return |
|
||||
| `tcp_tcc_read_req` | Req | Number of read requests to L2 cache |
|
||||
| `tcp_tcc_write_req` | Req | Number of write requests to L2 cache |
|
||||
| `tcp_tcc_atomic_with_ret_req` | Req | Number of atomic requests to L2 cache with return |
|
||||
| `tcp_tcc_atomic_without_ret_req` | Req | Number of atomic requests to L2 cache without return |
|
||||
| `tcp_tcc_nc_read_req` | Req | Number of NC read requests to L2 cache |
|
||||
| `tcp_tcc_uc_read_req` | Req | Number of UC read requests to L2 cache |
|
||||
| `tcp_tcc_cc_read_req` | Req | Number of CC read requests to L2 cache |
|
||||
| `tcp_tcc_rw_read_req` | Req | Number of RW read requests to L2 cache |
|
||||
| `tcp_tcc_nc_write_req` | Req | Number of NC write requests to L2 cache |
|
||||
| `tcp_tcc_uc_write_req` | Req | Number of UC write requests to L2 cache |
|
||||
| `tcp_tcc_cc_write_req` | Req | Number of CC write requests to L2 cache |
|
||||
| `tcp_tcc_rw_write_req` | Req | Number of RW write requests to L2 cache |
|
||||
| `tcp_tcc_nc_atomic_req` | Req | Number of NC atomic requests to L2 cache |
|
||||
| `tcp_tcc_uc_atomic_req` | Req | Number of UC atomic requests to L2 cache |
|
||||
| `tcp_tcc_cc_atomic_req` | Req | Number of CC atomic requests to L2 cache |
|
||||
| `tcp_tcc_rw_atomic_req` | Req | Number of RW atomic requests to L2 cache |
|
||||
|:-----------------------------------|:------|:----------------------------------------------------------|
|
||||
| `TCP_GATE_EN1[n]` | Cycles | Number of cycles vL1D interface clocks are turned on. Value range for n: [0-15]. |
|
||||
| `TCP_GATE_EN2[n]` | Cycles | Number of cycles vL1D core clocks are turned on. Value range for n: [0-15]. |
|
||||
| `TCP_TD_TCP_STALL_CYCLES[n]` | Cycles | Number of cycles TD stalls vL1D. Value range for n: [0-15]. |
|
||||
| `TCP_TCR_TCP_STALL_CYCLES[n]` | Cycles | Number of cycles TCR stalls vL1D. Value range for n: [0-15]. |
|
||||
| `TCP_READ_TAGCONFLICT_STALL_CYCLES[n]` | Cycles | Number of cycles tagram conflict stalls on a read. Value range for n: [0-15]. |
|
||||
| `TCP_WRITE_TAGCONFLICT_STALL_CYCLES[n]` | Cycles | Number of cycles tagram conflict stalls on a write. Value range for n: [0-15]. |
|
||||
| `TCP_ATOMIC_TAGCONFLICT_STALL_CYCLES[n]` | Cycles | Number of cycles tagram conflict stalls on an atomic. Value range for n: [0-15]. |
|
||||
| `TCP_PENDING_STALL_CYCLES[n]` | Cycles | Number of cycles vL1D cache is stalled due to data pending from L2 Cache. Value range for n: [0-15]. |
|
||||
| `TCP_TCP_TA_DATA_STALL_CYCLES` | Cycles | Number of cycles TCP stalls TA data interface. |
|
||||
| `TCP_TA_TCP_STATE_READ[n]` | Req | Number of state reads. Value range for n: [0-15]. |
|
||||
| `TCP_VOLATILE[n]` | Req | Number of L1 volatile pixels/buffers from TA. Value range for n: [0-15]. |
|
||||
| `TCP_TOTAL_ACCESSES[n]` | Req | Number of vL1D accesses. Equals `TCP_PERF_SEL_TOTAL_READ`+`TCP_PERF_SEL_TOTAL_NONREAD`. Value range for n: [0-15]. |
|
||||
| `TCP_TOTAL_READ[n]` | Req | Number of vL1D read accesses. Equals `TCP_PERF_SEL_TOTAL_HIT_LRU_READ` + `TCP_PERF_SEL_TOTAL_MISS_LRU_READ` + `TCP_PERF_SEL_TOTAL_MISS_EVICT_READ`. Value range for n: [0-15]. |
|
||||
| `TCP_TOTAL_WRITE[n]` | Req | Number of vL1D write accesses. `Equals TCP_PERF_SEL_TOTAL_MISS_LRU_WRITE`+ `TCP_PERF_SEL_TOTAL_MISS_EVICT_WRITE`. Value range for n: [0-15]. |
|
||||
| `TCP_TOTAL_ATOMIC_WITH_RET[n]` | Req | Number of vL1D atomic requests with return. Value range for n: [0-15]. |
|
||||
| `TCP_TOTAL_ATOMIC_WITHOUT_RET[n]` | Req | Number of vL1D atomic without return. Value range for n: [0-15]. |
|
||||
| `TCP_TOTAL_WRITEBACK_INVALIDATES[n]` | Count | Total number of vL1D writebacks and invalidates. Equals `TCP_PERF_SEL_TOTAL_WBINVL1`+ `TCP_PERF_SEL_TOTAL_WBINVL1_VOL`+ `TCP_PERF_SEL_CP_TCP_INVALIDATE`+ `TCP_PERF_SEL_SQ_TCP_INVALIDATE_VOL`. Value range for n: [0-15]. |
|
||||
| `TCP_UTCL1_REQUEST[n]` | Req | Number of address translation requests to UTCL1. Value range for n: [0-15]. |
|
||||
| `TCP_UTCL1_TRANSLATION_HIT[n]` | Req | Number of UTCL1 translation hits. Value range for n: [0-15]. |
|
||||
| `TCP_UTCL1_TRANSLATION_MISS[n]` | Req | Number of UTCL1 translation misses. Value range for n: [0-15]. |
|
||||
| `TCP_UTCL1_PERMISSION_MISS[n]` | Req | Number of UTCL1 permission misses. Value range for n: [0-15]. |
|
||||
| `TCP_TOTAL_CACHE_ACCESSES[n]` | Req | Number of vL1D cache accesses including hits and misses. Value range for n: [0-15]. |
|
||||
| `TCP_TCP_LATENCY[n]` | Cycles | Accumulated wave access latency to vL1D over all wavefronts. Value range for n: [0-15]. |
|
||||
| `TCP_TCC_READ_REQ_LATENCY[n]` | Cycles | Total vL1D to L2 request latency over all wavefronts for reads and atomics with return. Value range for n: [0-15]. |
|
||||
| `TCP_TCC_WRITE_REQ_LATENCY[n]` | Cycles | Total vL1D to L2 request latency over all wavefronts for writes and atomics without return. Value range for n: [0-15]. |
|
||||
| `TCP_TCC_READ_REQ[n]` | Req | Number of read requests to L2 cache. Value range for n: [0-15]. |
|
||||
| `TCP_TCC_WRITE_REQ[n]` | Req | Number of write requests to L2 cache. Value range for n: [0-15]. |
|
||||
| `TCP_TCC_ATOMIC_WITH_RET_REQ[n]` | Req | Number of atomic requests to L2 cache with return. Value range for n: [0-15]. |
|
||||
| `TCP_TCC_ATOMIC_WITHOUT_RET_REQ[n]` | Req | Number of atomic requests to L2 cache without return. Value range for n: [0-15]. |
|
||||
| `TCP_TCC_NC_READ_REQ[n]` | Req | Number of NC read requests to L2 cache. Value range for n: [0-15]. |
|
||||
| `TCP_TCC_UC_READ_REQ[n]` | Req | Number of UC read requests to L2 cache. Value range for n: [0-15]. |
|
||||
| `TCP_TCC_CC_READ_REQ[n]` | Req | Number of CC read requests to L2 cache. Value range for n: [0-15]. |
|
||||
| `TCP_TCC_RW_READ_REQ[n]` | Req | Number of RW read requests to L2 cache. Value range for n: [0-15]. |
|
||||
| `TCP_TCC_NC_WRITE_REQ[n]` | Req | Number of NC write requests to L2 cache. Value range for n: [0-15]. |
|
||||
| `TCP_TCC_UC_WRITE_REQ[n]` | Req | Number of UC write requests to L2 cache. Value range for n: [0-15]. |
|
||||
| `TCP_TCC_CC_WRITE_REQ[n]` | Req | Number of CC write requests to L2 cache. Value range for n: [0-15]. |
|
||||
| `TCP_TCC_RW_WRITE_REQ[n]` | Req | Number of RW write requests to L2 cache. Value range for n: [0-15]. |
|
||||
| `TCP_TCC_NC_ATOMIC_REQ[n]` | Req | Number of NC atomic requests to L2 cache. Value range for n: [0-15]. |
|
||||
| `TCP_TCC_UC_ATOMIC_REQ[n]` | Req | Number of UC atomic requests to L2 cache. Value range for n: [0-15]. |
|
||||
| `TCP_TCC_CC_ATOMIC_REQ[n]` | Req | Number of CC atomic requests to L2 cache. Value range for n: [0-15]. |
|
||||
| `TCP_TCC_RW_ATOMIC_REQ[n]` | Req | Number of RW atomic requests to L2 cache. Value range for n: [0-15]. |
|
||||
|
||||
#### TCA
|
||||
#### TCA counters
|
||||
|
||||
| Hardware Counter | Unit | Definition |
|
||||
| :----------------| :------| ------------------------------------------: |
|
||||
| `tca_cycle` | Cycles | TCA cycles |
|
||||
| `tca_busy` | Cycles | Number of cycles TCA has a pending request |
|
||||
|:----------------|:------|:------------------------------------------|
|
||||
| `TCA_CYCLE[n]` | Cycles | Number of TCA cycles. Value range for n: [0-31]. |
|
||||
| `TCA_BUSY[n]` | Cycles | Number of cycles TCA has a pending request. Value range for n: [0-31]. |
|
||||
|
||||
### L2 cache access
|
||||
### L2 cache access counters
|
||||
|
||||
#### L2 cache access counters
|
||||
L2 Cache is also known as Texture Cache per Channel (TCC).
|
||||
|
||||
| Hardware Counter | Unit | Definition |
|
||||
| :--------------------------------| :------| -------------------------------------------------------------: |
|
||||
| `tcc_cycle` |Cycle | L2 cache free-running clocks |
|
||||
| `tcc_busy` |Cycle | L2 cache busy cycles |
|
||||
| `tcc_req` |Req | Number of L2 cache requests |
|
||||
| `tcc_streaming_req[∗]` |Req | Number of L2 cache streaming requests |
|
||||
| `tcc_NC_req` |Req | Number of NC requests |
|
||||
| `tcc_UC_req` |Req | Number of UC requests |
|
||||
| `tcc_CC_req` |Req | Number of CC requests |
|
||||
| `tcc_RW_req` |Req | Number of RW requests |
|
||||
| `tcc_probe` |Req | Number of L2 cache probe requests |
|
||||
| `tcc_probe_all[∗]` |Req | Number of external probe requests with EA_TCC_preq_all== 1 |
|
||||
| `tcc_read_req` |Req | Number of L2 cache read requests |
|
||||
| `tcc_write_req` |Req | Number of L2 cache write requests |
|
||||
| `tcc_atomic_req` |Req | Number of L2 cache atomic requests |
|
||||
| `tcc_hit` |Req | Number of L2 cache lookup-hits |
|
||||
| `tcc_miss` |Req | Number of L2 cache lookup-misses |
|
||||
| `tcc_writeback` |Req | Number of lines written back to main memory, including writebacks of dirty lines and uncached write/atomic requests |
|
||||
| `tcc_ea_wrreq` |Req | Total number of 32-byte and 64-byte write requests to EA |
|
||||
| `tcc_ea_wrreq_64B` |Req | Total number of 64-byte write requests to EA |
|
||||
| `tcc_ea_wr_uncached_32B` |Req | Number of 32-byte write/atomic going over the TC_EA_wrreq interface due to uncached traffic. Note that CC mtypes can produce uncached requests, and those are included in this. A 64-byte request is counted as 2. |
|
||||
| `tcc_ea_wrreq_stall` | Cycles | Number of cycles a write request was stalled |
|
||||
| `tcc_ea_wrreq_io_credit_stall[∗]` | Cycles | Number of cycles an EA write request runs out of IO credits |
|
||||
| `tcc_ea_wrreq_gmi_credit_stall[∗]` | Cycles | Number of cycles an EA write request runs out of GMI credits |
|
||||
| `tcc_ea_wrreq_dram_credit_stall` | Cycles | Number of cycles an EA write request runs out of DRAM credits |
|
||||
| `tcc_too_many_ea_wrreqs_stall[∗]` | Cycles | Number of cycles the L2 cache reaches maximum number of pending EA write requests |
|
||||
| `tcc_ea_wrreq_level` | Req | Accumulated number of L2 cache-EA write requests in flight |
|
||||
| `tcc_ea_atomic` | Req | Number of 32-byte and 64-byte atomic requests to EA |
|
||||
| `tcc_ea_atomic_level` | Req | Accumulated number of L2 cache-EA atomic requests in flight |
|
||||
| `tcc_ea_rdreq` | Req | Total number of 32-byte and 64-byte read requests to EA |
|
||||
| `tcc_ea_rdreq_32B` | Req | Total number of 32-byte read requests to EA |
|
||||
| `tcc_ea_rd_uncached_32B` | Req | Number of 32-byte L2 cache-EA read due to uncached traffic. A 64-byte request is counted as 2. |
|
||||
| `tcc_ea_rdreq_io_credit_stall[∗]` | Cycles | Number of cycles read request interface runs out of IO credits |
|
||||
| `tcc_ea_rdreq_gmi_credit_stall[∗]` | Cycles | Number of cycles read request interface runs out of GMI credits |
|
||||
| `tcc_ea_rdreq_dram_credit_stall` | Cycles | Number of cycles read request interface runs out of DRAM credits |
|
||||
| `tcc_ea_rdreq_level` | Req | Accumulated number of L2 cache-EA read requests in flight |
|
||||
| `tcc_ea_rdreq_dram` | Req | Number of 32-byte and 64-byte read requests to HBM |
|
||||
| `tcc_ea_wrreq_dram` | Req | Number of 32-byte and 64-byte write requests to HBM |
|
||||
| `tcc_tag_stall` | Cycles | Number of cycles the normal request pipeline in the tag was stalled for any reason |
|
||||
| `tcc_normal_writeback` | Req | Number of L2 cache normal writeback |
|
||||
| `tcc_all_tc_op_wb_writeback[∗]` | Req | Number of instruction-triggered writeback requests |
|
||||
| `tcc_normal_evict` | Req | Number of L2 cache normal evictions |
|
||||
| `tcc_all_tc_op_inv_evict[∗]` | Req | Number of instruction-triggered eviction requests |
|
||||
|:--------------------------------|:------|:-------------------------------------------------------------|
|
||||
| `TCC_CYCLE[n]` |Cycle | Number of L2 cache free-running clocks. Value range for n: [0-31]. |
|
||||
| `TCC_BUSY[n]` |Cycle | Number of L2 cache busy cycles. Value range for n: [0-31]. |
|
||||
| `TCC_REQ[n]` |Req | Number of L2 cache requests of all types. This is measured at the tag block. This may be more than the number of requests arriving at the TCC, but it is a good indication of the total amount of work that needs to be performed. Value range for n: [0-31]. |
|
||||
| `TCC_STREAMING_REQ[n]` |Req | Number of L2 cache streaming requests. This is measured at the tag block. Value range for n: [0-31]. |
|
||||
| `TCC_NC_REQ[n]` |Req | Number of NC requests. This is measured at the tag block. Value range for n: [0-31]. |
|
||||
| `TCC_UC_REQ[n]` |Req | Number of UC requests. This is measured at the tag block. Value range for n: [0-31]. |
|
||||
| `TCC_CC_REQ[n]` |Req | Number of CC requests. This is measured at the tag block. Value range for n: [0-31]. |
|
||||
| `TCC_RW_REQ[n]` |Req | Number of RW requests. This is measured at the tag block. Value range for n: [0-31]. |
|
||||
| `TCC_PROBE[n]` |Req | Number of probe requests. Value range for n: [0-31]. |
|
||||
| `TCC_PROBE_ALL[n]` |Req | Number of external probe requests with `EA_TCC_preq_all`== 1. Value range for n: [0-31]. |
|
||||
| `TCC_READ[n]` |Req | Number of L2 cache read requests. This includes compressed reads but not metadata reads. Value range for n: [0-31]. |
|
||||
| `TCC_WRITE[n]` |Req | Number of L2 cache write requests. Value range for n: [0-31]. |
|
||||
| `TCC_ATOMIC[n]` |Req | Number of L2 cache atomic requests of all types. Value range for n: [0-31]. |
|
||||
| `TCC_HIT[n]` |Req | Number of L2 cache hits. Value range for n: [0-31]. |
|
||||
| `TCC_MISS[n]` |Req | Number of L2 cache misses. Value range for n: [0-31]. |
|
||||
| `TCC_WRITEBACK[n]` |Req | Number of lines written back to the main memory, including writebacks of dirty lines and uncached write/atomic requests. Value range for n: [0-31]. |
|
||||
| `TCC_EA_WRREQ[n]` |Req | Number of 32-byte and 64-byte transactions going over the `TC_EA_wrreq` interface. Atomics may travel over the same interface and are generally classified as write requests. This does not include probe commands. Value range for n: [0-31]. |
|
||||
| `TCC_EA_WRREQ_64B[n]` |Req | Total number of 64-byte transactions (write or `CMPSWAP`) going over the `TC_EA_wrreq` interface. Value range for n: [0-31]. |
|
||||
| `TCC_EA_WR_UNCACHED_32B[n]` |Req | Number of 32-byte write/atomic going over the `TC_EA_wrreq` interface due to uncached traffic. Note that CC mtypes can produce uncached requests, and those are included in this. A 64-byte request is counted as 2. Value range for n: [0-31].|
|
||||
| `TCC_EA_WRREQ_STALL[n]` | Cycles | Number of cycles a write request is stalled. Value range for n: [0-31]. |
|
||||
| `TCC_EA_WRREQ_IO_CREDIT_STALL[n]` | Cycles | Number of cycles an EA write request is stalled due to the interface running out of IO credits. Value range for n: [0-31]. |
|
||||
| `TCC_EA_WRREQ_GMI_CREDIT_STALL[n]` | Cycles | Number of cycles an EA write request is stalled due to the interface running out of GMI credits. Value range for n: [0-31]. |
|
||||
| `TCC_EA_WRREQ_DRAM_CREDIT_STALL[n]` | Cycles | Number of cycles an EA write request is stalled due to the interface running out of DRAM credits. Value range for n: [0-31]. |
|
||||
| `TCC_TOO_MANY_EA_WRREQS_STALL[n]` | Cycles | Number of cycles the L2 cache is unable to send an EA write request due to it reaching its maximum capacity of pending EA write requests. Value range for n: [0-31]. |
|
||||
| `TCC_EA_WRREQ_LEVEL[n]` | Req | The accumulated number of EA write requests in flight. This is primarily intended to measure average EA write latency. Average write latency = `TCC_PERF_SEL_EA_WRREQ_LEVEL`/`TCC_PERF_SEL_EA_WRREQ`. Value range for n: [0-31]. |
|
||||
| `TCC_EA_ATOMIC[n]` | Req | Number of 32-byte or 64-byte atomic requests going over the `TC_EA_wrreq` interface. Value range for n: [0-31]. |
|
||||
| `TCC_EA_ATOMIC_LEVEL[n]` | Req | The accumulated number of EA atomic requests in flight. This is primarily intended to measure average EA atomic latency. Average atomic latency = `TCC_PERF_SEL_EA_WRREQ_ATOMIC_LEVEL`/`TCC_PERF_SEL_EA_WRREQ_ATOMIC`. Value range for n: [0-31]. |
|
||||
| `TCC_EA_RDREQ[n]` | Req | Number of 32-byte or 64-byte read requests to EA. Value range for n: [0-31]. |
|
||||
| `TCC_EA_RDREQ_32B[n]` | Req | Number of 32-byte read requests to EA. Value range for n: [0-31]. |
|
||||
| `TCC_EA_RD_UNCACHED_32B[n]` | Req | Number of 32-byte EA reads due to uncached traffic. A 64-byte request is counted as 2. Value range for n: [0-31]. |
|
||||
| `TCC_EA_RDREQ_IO_CREDIT_STALL[n]` | Cycles | Number of cycles there is a stall due to the read request interface running out of IO credits. Stalls occur irrespective of the need for a read to be performed. Value range for n: [0-31]. |
|
||||
| `TCC_EA_RDREQ_GMI_CREDIT_STALL[n]` | Cycles | Number of cycles there is a stall due to the read request interface running out of GMI credits. Stalls occur irrespective of the need for a read to be performed. Value range for n: [0-31]. |
|
||||
| `TCC_EA_RDREQ_DRAM_CREDIT_STALL[n]` | Cycles | Number of cycles there is a stall due to the read request interface running out of DRAM credits. Stalls occur irrespective of the need for a read to be performed. Value range for n: [0-31]. |
|
||||
| `TCC_EA_RDREQ_LEVEL[n]` | Req | The accumulated number of EA read requests in flight. This is primarily intended to measure average EA read latency. Average read latency = `TCC_PERF_SEL_EA_RDREQ_LEVEL`/`TCC_PERF_SEL_EA_RDREQ`. Value range for n: [0-31]. |
|
||||
| `TCC_EA_RDREQ_DRAM[n]` | Req | Number of 32-byte or 64-byte EA read requests to High Bandwidth Memory (HBM). Value range for n: [0-31]. |
|
||||
| `TCC_EA_WRREQ_DRAM[n]` | Req | Number of 32-byte or 64-byte EA write requests to HBM. Value range for n: [0-31]. |
|
||||
| `TCC_TAG_STALL[n]` | Cycles | Number of cycles the normal request pipeline in the tag is stalled for any reason. Normally, stalls of this nature are measured exactly at one point in the pipeline however in case of this counter, probes can stall the pipeline at a variety of places and there is no single point that can reasonably measure the total stalls accurately. Value range for n: [0-31]. |
|
||||
| `TCC_NORMAL_WRITEBACK[n]` | Req | Number of writebacks due to requests that are not writeback requests. Value range for n: [0-31]. |
|
||||
| `TCC_ALL_TC_OP_WB_WRITEBACK[n]` | Req | Number of writebacks due to all `TC_OP` writeback requests. Value range for n: [0-31]. |
|
||||
| `TCC_NORMAL_EVICT[n]` | Req | Number of evictions due to requests that are not invalidate or probe requests. Value range for n: [0-31]. |
|
||||
| `TCC_ALL_TC_OP_INV_EVICT[n]` | Req | Number of evictions due to all `TC_OP` invalidate requests. Value range for n: [0-31]. |
|
||||
|
||||
## MI200 derived metrics list
|
||||
|
||||
### Derived metrics on MI200 GPUs
|
||||
|
||||
| Derived Metric | Description |
|
||||
| :----------------| -------------------------------------------------------------------------------------: |
|
||||
| `VFetchInsts` | The average number of vector fetch instructions from the video memory executed per work-item (affected by flow control). Excludes FLAT instructions that fetch from video memory |
|
||||
| `VWriteInsts` | The average number of vector write instructions to the video memory executed per work-item (affected by flow control). Excludes FLAT instructions that write to video memory |
|
||||
| `FlatVMemInsts` | The average number of FLAT instructions that read from or write to the video memory executed per work item (affected by flow control). Includes FLAT instructions that read from or write to scratch |
|
||||
| `LDSInsts` | The average number of LDS read/write instructions executed per work item (affected by flow control). Excludes FLAT instructions that read from or write to LDS |
|
||||
| `FlatLDSInsts` | The average number of FLAT instructions that read or write to LDS executed per work item (affected by flow control) |
|
||||
| `VALUUtilization` | The percentage of active vector ALU threads in a wave. A lower number can mean either more thread divergence in a wave or that the work-group size is not a multiple of 64. Value range: 0% (bad), 100% (ideal - no thread divergence) |
|
||||
| `VALUBusy` | The percentage of GPU time vector ALU instructions are processed. Value range: 0% (bad) to 100% (optimal) |
|
||||
| `SALUBusy` | The percentage of GPU time scalar ALU instructions are processed. Value range: 0% (bad) to 100% (optimal) |
|
||||
| `MemWrites32B` | The total number of effective 32B write transactions to the memory |
|
||||
| `L2CacheHit` | The percentage of fetch, write, atomic, and other instructions that hit the data in L2 cache. Value range: 0% (no hit) to 100% (optimal) |
|
||||
| `MemUnitStalled` | The percentage of GPU time the memory unit is stalled. Try reducing the number or size of fetches and writes if possible. Value range: 0% (optimal) to 100% (bad) |
|
||||
| `WriteUnitStalled` | The percentage of GPU time the write unit is stalled. Value range: 0% to 100% (bad) |
|
||||
| `LDSBankConflict` | The percentage of GPU time LDS is stalled by bank conflicts. Value range: 0% (optimal) to 100% (bad) |
|
||||
|:----------------|:-------------------------------------------------------------------------------------|
|
||||
| `ALUStalledByLDS` | Percentage of GPU time ALU units are stalled due to the LDS input queue being full or the output queue not being ready. Reduce this by reducing the LDS bank conflicts or the number of LDS accesses if possible. Value range: 0% (optimal) to 100% (bad). |
|
||||
| `FetchSize` | Total kilobytes fetched from the video memory. This is measured with all extra fetches and any cache or memory effects taken into account. |
|
||||
| `FlatLDSInsts` | Average number of FLAT instructions that read from or write to LDS, executed per work item (affected by flow control). |
|
||||
| `FlatVMemInsts` | Average number of FLAT instructions that read from or write to the video memory, executed per work item (affected by flow control). Includes FLAT instructions that read from or write to scratch. |
|
||||
| `GDSInsts` | Average number of GDS read/write instructions executed per work item (affected by flow control). |
|
||||
| `GPUBusy` | Percentage of time GPU is busy. |
|
||||
| `L2CacheHit` | Percentage of fetch, write, atomic, and other instructions that hit the data in L2 cache. Value range: 0% (no hit) to 100% (optimal). |
|
||||
| `LDSBankConflict` | Percentage of GPU time LDS is stalled by bank conflicts. Value range: 0% (optimal) to 100% (bad). |
|
||||
| `LDSInsts` | Average number of LDS read/write instructions executed per work item (affected by flow control). Excludes FLAT instructions that read from or write to LDS. |
|
||||
| `MemUnitBusy` | Percentage of GPU time the memory unit is active. The result includes the stall time (`MemUnitStalled`). This is measured with all extra fetches and writes and any cache or memory effects taken into account. Value range: 0% to 100% (fetch-bound). |
|
||||
| `MemUnitStalled` | Percentage of GPU time the memory unit is stalled. Try reducing the number or size of fetches and writes if possible. Value range: 0% (optimal) to 100% (bad). |
|
||||
| `MemWrites32B` | Total number of effective 32B write transactions to the memory. |
|
||||
| `SALUBusy` | Percentage of GPU time scalar ALU instructions are processed. Value range: 0% (bad) to 100% (optimal). |
|
||||
| `SALUInsts` | Average number of scalar ALU instructions executed per work item (affected by flow control). |
|
||||
| `SFetchInsts` | Average number of scalar fetch instructions from the video memory executed per work item (affected by flow control). |
|
||||
| `TA_ADDR_STALLED_BY_TC_CYCLES_sum` | Total number of cycles TA address path is stalled by TC, over all TA instances. |
|
||||
| `TA_ADDR_STALLED_BY_TD_CYCLES_sum` | Total number of cycles TA address path is stalled by TD, over all TA instances. |
|
||||
| `TA_BUFFER_WAVEFRONTS_sum` | Total number of buffer wavefronts processed by all TA instances. |
|
||||
| `TA_BUFFER_READ_WAVEFRONTS_sum` | Total number of buffer read wavefronts processed by all TA instances. |
|
||||
| `TA_BUFFER_WRITE_WAVEFRONTS_sum` | Total number of buffer write wavefronts processed by all TA instances. |
|
||||
| `TA_BUFFER_ATOMIC_WAVEFRONTS_sum` | Total number of buffer atomic wavefronts processed by all TA instances. |
|
||||
| `TA_BUFFER_TOTAL_CYCLES_sum` | Total number of buffer cycles (including read and write) issued to TC by all TA instances. |
|
||||
| `TA_BUFFER_COALESCED_READ_CYCLES_sum` | Total number of coalesced buffer read cycles issued to TC by all TA instances. |
|
||||
| `TA_BUFFER_COALESCED_WRITE_CYCLES_sum` | Total number of coalesced buffer write cycles issued to TC by all TA instances. |
|
||||
| `TA_BUSY_avr` | Average number of busy cycles over all TA instances. |
|
||||
| `TA_BUSY_max` | Maximum number of TA busy cycles over all TA instances. |
|
||||
| `TA_BUSY_min` | Minimum number of TA busy cycles over all TA instances. |
|
||||
| `TA_DATA_STALLED_BY_TC_CYCLES_sum` | Total number of cycles TA data path is stalled by TC, over all TA instances. |
|
||||
| `TA_FLAT_READ_WAVEFRONTS_sum` | Sum of flat opcode reads processed by all TA instances. |
|
||||
| `TA_FLAT_WRITE_WAVEFRONTS_sum` | Sum of flat opcode writes processed by all TA instances. |
|
||||
| `TA_FLAT_WAVEFRONTS_sum` | Total number of flat opcode wavefronts processed by all TA instances. |
|
||||
| `TA_FLAT_READ_WAVEFRONTS_sum` | Total number of flat opcode read wavefronts processed by all TA instances. |
|
||||
| `TA_FLAT_ATOMIC_WAVEFRONTS_sum` | Total number of flat opcode atomic wavefronts processed by all TA instances. |
|
||||
| `TA_TA_BUSY_sum` | Total number of TA busy cycles over all TA instances. |
|
||||
| `TA_TOTAL_WAVEFRONTS_sum` | Total number of wavefronts processed by all TA instances. |
|
||||
| `TCA_BUSY_sum` | Total number of cycles TCA has a pending request, over all TCA instances. |
|
||||
| `TCA_CYCLE_sum` | Total number of cycles over all TCA instances. |
|
||||
| `TCC_ALL_TC_OP_WB_WRITEBACK_sum` | Total number of writebacks due to all TC_OP writeback requests, over all TCC instances. |
|
||||
| `TCC_ALL_TC_OP_INV_EVICT_sum` | Total number of evictions due to all TC_OP invalidate requests, over all TCC instances. |
|
||||
| `TCC_ATOMIC_sum` | Total number of L2 cache atomic requests of all types, over all TCC instances. |
|
||||
| `TCC_BUSY_avr` | Average number of L2 cache busy cycles, over all TCC instances. |
|
||||
| `TCC_BUSY_sum` | Total number of L2 cache busy cycles, over all TCC instances. |
|
||||
| `TCC_CC_REQ_sum` | Total number of CC requests over all TCC instances. |
|
||||
| `TCC_CYCLE_sum` | Total number of L2 cache free running clocks, over all TCC instances. |
|
||||
| `TCC_EA_WRREQ_sum` | Total number of 32-byte and 64-byte transactions going over the TC_EA_wrreq interface, over all TCC instances. Atomics may travel over the same interface and are generally classified as write requests. This does not include probe commands. |
|
||||
| `TCC_EA_WRREQ_64B_sum` | Total number of 64-byte transactions (write or `CMPSWAP`) going over the TC_EA_wrreq interface, over all TCC instances. |
|
||||
| `TCC_EA_WR_UNCACHED_32B_sum` | Total Number of 32-byte write/atomic going over the TC_EA_wrreq interface due to uncached traffic, over all TCC instances. Note that CC mtypes can produce uncached requests, and those are included in this. A 64-byte request is counted as 2. |
|
||||
| `TCC_EA_WRREQ_STALL_sum` | Total Number of cycles a write request is stalled, over all instances. |
|
||||
| `TCC_EA_WRREQ_IO_CREDIT_STALL_sum` | Total number of cycles an EA write request is stalled due to the interface running out of IO credits, over all instances. |
|
||||
| `TCC_EA_WRREQ_GMI_CREDIT_STALL_sum` | Total number of cycles an EA write request is stalled due to the interface running out of GMI credits, over all instances. |
|
||||
| `TCC_EA_WRREQ_DRAM_CREDIT_STALL_sum` | Total number of cycles an EA write request is stalled due to the interface running out of DRAM credits, over all instances. |
|
||||
| `TCC_EA_WRREQ_LEVEL_sum` | Total number of EA write requests in flight over all TCC instances. |
|
||||
| `TCC_EA_RDREQ_LEVEL_sum` | Total number of EA read requests in flight over all TCC instances. |
|
||||
| `TCC_EA_ATOMIC_sum` | Total Number of 32-byte or 64-byte atomic requests going over the TC_EA_wrreq interface, over all TCC instances. |
|
||||
| `TCC_EA_ATOMIC_LEVEL_sum` | Total number of EA atomic requests in flight, over all TCC instances. |
|
||||
| `TCC_EA_RDREQ_sum` | Total number of 32-byte or 64-byte read requests to EA, over all TCC instances. |
|
||||
| `TCC_EA_RDREQ_32B_sum` | Total number of 32-byte read requests to EA, over all TCC instances. |
|
||||
| `TCC_EA_RD_UNCACHED_32B_sum` | Total number of 32-byte EA reads due to uncached traffic, over all TCC instances. |
|
||||
| `TCC_EA_RDREQ_IO_CREDIT_STALL_sum` | Total number of cycles there is a stall due to the read request interface running out of IO credits, over all TCC instances. |
|
||||
| `TCC_EA_RDREQ_GMI_CREDIT_STALL_sum` | Total number of cycles there is a stall due to the read request interface running out of GMI credits, over all TCC instances. |
|
||||
| `TCC_EA_RDREQ_DRAM_CREDIT_STALL_sum` | Total number of cycles there is a stall due to the read request interface running out of DRAM credits, over all TCC instances. |
|
||||
| `TCC_EA_RDREQ_DRAM_sum` | Total number of 32-byte or 64-byte EA read requests to HBM, over all TCC instances. |
|
||||
| `TCC_EA_WRREQ_DRAM_sum` | Total number of 32-byte or 64-byte EA write requests to HBM, over all TCC instances. |
|
||||
| `TCC_HIT_sum` | Total number of L2 cache hits over all TCC instances. |
|
||||
| `TCC_MISS_sum` | Total number of L2 cache misses over all TCC instances. |
|
||||
| `TCC_NC_REQ_sum` | Total number of NC requests over all TCC instances. |
|
||||
| `TCC_NORMAL_WRITEBACK_sum` | Total number of writebacks due to requests that are not writeback requests, over all TCC instances. |
|
||||
| `TCC_NORMAL_EVICT_sum` | Total number of evictions due to requests that are not invalidate or probe requests, over all TCC instances. |
|
||||
| `TCC_PROBE_sum` | Total number of probe requests over all TCC instances. |
|
||||
| `TCC_PROBE_ALL_sum` | Total number of external probe requests with EA_TCC_preq_all== 1, over all TCC instances. |
|
||||
| `TCC_READ_sum` | Total number of L2 cache read requests (including compressed reads but not metadata reads) over all TCC instances. |
|
||||
| `TCC_REQ_sum` | Total number of all types of L2 cache requests over all TCC instances. |
|
||||
| `TCC_RW_REQ_sum` | Total number of RW requests over all TCC instances. |
|
||||
| `TCC_STREAMING_REQ_sum` | Total number of L2 cache streaming requests over all TCC instances. |
|
||||
| `TCC_TAG_STALL_sum` | Total number of cycles the normal request pipeline in the tag is stalled for any reason, over all TCC instances. |
|
||||
| `TCC_TOO_MANY_EA_WRREQS_STALL_sum` | Total number of cycles L2 cache is unable to send an EA write request due to it reaching its maximum capacity of pending EA write requests, over all TCC instances. |
|
||||
| `TCC_UC_REQ_sum` | Total number of UC requests over all TCC instances. |
|
||||
| `TCC_WRITE_sum` | Total number of L2 cache write requests over all TCC instances. |
|
||||
| `TCC_WRITEBACK_sum` | Total number of lines written back to the main memory including writebacks of dirty lines and uncached write/atomic requests, over all TCC instances. |
|
||||
| `TCC_WRREQ_STALL_max` | Maximum number of cycles a write request is stalled, over all TCC instances. |
|
||||
| `TCP_ATOMIC_TAGCONFLICT_STALL_CYCLES_sum` | Total number of cycles tagram conflict stalls on an atomic, over all TCP instances. |
|
||||
| `TCP_GATE_EN1_sum` | Total number of cycles vL1D interface clocks are turned on, over all TCP instances. |
|
||||
| `TCP_GATE_EN2_sum` | Total number of cycles vL1D core clocks are turned on, over all TCP instances. |
|
||||
| `TCP_PENDING_STALL_CYCLES_sum` | Total number of cycles vL1D cache is stalled due to data pending from L2 Cache, over all TCP instances. |
|
||||
| `TCP_READ_TAGCONFLICT_STALL_CYCLES_sum` | Total number of cycles tagram conflict stalls on a read, over all TCP instances. |
|
||||
| `TCP_TA_TCP_STATE_READ_sum` | Total number of state reads by all TCP instances. |
|
||||
| `TCP_TCC_ATOMIC_WITH_RET_REQ_sum` | Total number of atomic requests to L2 cache with return, over all TCP instances. |
|
||||
| `TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum` | Total number of atomic requests to L2 cache without return, over all TCP instances. |
|
||||
| `TCP_TCC_CC_READ_REQ_sum` | Total number of CC read requests to L2 cache, over all TCP instances. |
|
||||
| `TCP_TCC_CC_WRITE_REQ_sum` | Total number of CC write requests to L2 cache, over all TCP instances. |
|
||||
| `TCP_TCC_CC_ATOMIC_REQ_sum` | Total number of CC atomic requests to L2 cache, over all TCP instances. |
|
||||
| `TCP_TCC_NC_READ_REQ_sum` | Total number of NC read requests to L2 cache, over all TCP instances. |
|
||||
| `TCP_TCC_NC_WRITE_REQ_sum` | Total number of NC write requests to L2 cache, over all TCP instances. |
|
||||
| `TCP_TCC_NC_ATOMIC_REQ_sum` | Total number of NC atomic requests to L2 cache, over all TCP instances. |
|
||||
| `TCP_TCC_READ_REQ_LATENCY_sum` | Total vL1D to L2 request latency over all wavefronts for reads and atomics with return for all TCP instances. |
|
||||
| `TCP_TCC_READ_REQ_sum` | Total number of read requests to L2 cache, over all TCP instances. |
|
||||
| `TCP_TCC_RW_READ_REQ_sum` | Total number of RW read requests to L2 cache, over all TCP instances. |
|
||||
| `TCP_TCC_RW_WRITE_REQ_sum` | Total number of RW write requests to L2 cache, over all TCP instances. |
|
||||
| `TCP_TCC_RW_ATOMIC_REQ_sum` | Total number of RW atomic requests to L2 cache, over all TCP instances. |
|
||||
| `TCP_TCC_UC_READ_REQ_sum` | Total number of UC read requests to L2 cache, over all TCP instances. |
|
||||
| `TCP_TCC_UC_WRITE_REQ_sum` | Total number of UC write requests to L2 cache, over all TCP instances. |
|
||||
| `TCP_TCC_UC_ATOMIC_REQ_sum` | Total number of UC atomic requests to L2 cache, over all TCP instances. |
|
||||
| `TCP_TCC_WRITE_REQ_LATENCY_sum` | Total vL1D to L2 request latency over all wavefronts for writes and atomics without return for all TCP instances. |
|
||||
| `TCP_TCC_WRITE_REQ_sum` | Total number of write requests to L2 cache, over all TCP instances. |
|
||||
| `TCP_TCP_LATENCY_sum` | Total wave access latency to vL1D over all wavefronts for all TCP instances. |
|
||||
| `TCP_TCR_TCP_STALL_CYCLES_sum` | Total number of cycles TCR stalls vL1D, over all TCP instances. |
|
||||
| `TCP_TD_TCP_STALL_CYCLES_sum` | Total number of cycles TD stalls vL1D, over all TCP instances. |
|
||||
| `TCP_TOTAL_ACCESSES_sum` | Total number of vL1D accesses, over all TCP instances. |
|
||||
| `TCP_TOTAL_READ_sum` | Total number of vL1D read accesses, over all TCP instances. |
|
||||
| `TCP_TOTAL_WRITE_sum` | Total number of vL1D write accesses, over all TCP instances. |
|
||||
| `TCP_TOTAL_ATOMIC_WITH_RET_sum` | Total number of vL1D atomic requests with return, over all TCP instances. |
|
||||
| `TCP_TOTAL_ATOMIC_WITHOUT_RET_sum` | Total number of vL1D atomic requests without return, over all TCP instances. |
|
||||
| `TCP_TOTAL_CACHE_ACCESSES_sum` | Total number of vL1D cache accesses (including hits and misses) by all TCP instances. |
|
||||
| `TCP_TOTAL_WRITEBACK_INVALIDATES_sum` | Total number of vL1D writebacks and invalidates, over all TCP instances. |
|
||||
| `TCP_UTCL1_PERMISSION_MISS_sum` | Total number of UTCL1 permission misses by all TCP instances. |
|
||||
| `TCP_UTCL1_REQUEST_sum` | Total number of address translation requests to UTCL1 by all TCP instances. |
|
||||
| `TCP_UTCL1_TRANSLATION_MISS_sum` | Total number of UTCL1 translation misses by all TCP instances. |
|
||||
| `TCP_UTCL1_TRANSLATION_HIT_sum` | Total number of UTCL1 translation hits by all TCP instances. |
|
||||
| `TCP_VOLATILE_sum` | Total number of L1 volatile pixels/buffers from TA, over all TCP instances. |
|
||||
| `TCP_WRITE_TAGCONFLICT_STALL_CYCLES_sum` | Total number of cycles tagram conflict stalls on a write, over all TCP instances. |
|
||||
| `TD_ATOMIC_WAVEFRONT_sum` | Total number of atomic wavefront instructions, over all TD instances. |
|
||||
| `TD_COALESCABLE_WAVEFRONT_sum` | Total number of coalescable wavefronts according to TA, over all TD instances. |
|
||||
| `TD_LOAD_WAVEFRONT_sum` | Total number of wavefront instructions (read/write/atomic), over all TD instances. |
|
||||
| `TD_SPI_STALL_sum` | Total number of cycles TD is stalled by SPI, over all TD instances. |
|
||||
| `TD_STORE_WAVEFRONT_sum` | Total number of write wavefront instructions, over all TD instances. |
|
||||
| `TD_TC_STALL_sum` | Total number of cycles TD is stalled waiting for TC data, over all TD instances. |
|
||||
| `TD_TD_BUSY_sum` | Total number of TD busy cycles while it is processing or waiting for data, over all TD instances. |
|
||||
| `VALUBusy` | Percentage of GPU time vector ALU instructions are processed. Value range: 0% (bad) to 100% (optimal). |
|
||||
| `VALUInsts` | Average number of vector ALU instructions executed per work item (affected by flow control). |
|
||||
| `VALUUtilization` | Percentage of active vector ALU threads in a wave. A lower number can mean either more thread divergence in a wave or that the work-group size is not a multiple of 64. Value range: 0% (bad), 100% (ideal - no thread divergence). |
|
||||
| `VFetchInsts` | Average number of vector fetch instructions from the video memory executed per work-item (affected by flow control). Excludes FLAT instructions that fetch from video memory. |
|
||||
| `VWriteInsts` | Average number of vector write instructions to the video memory executed per work-item (affected by flow control). Excludes FLAT instructions that write to video memory. |
|
||||
| `Wavefronts` | Total wavefronts. |
|
||||
| `WRITE_REQ_32B` | Total number of 32-byte effective memory writes. |
|
||||
| `WriteSize` | Total kilobytes written to the video memory. This is measured with all extra fetches and any cache or memory effects taken into account. |
|
||||
| `WriteUnitStalled` | Percentage of GPU time the write unit is stalled. Value range: 0% to 100% (bad). |
|
||||
|
||||
## MI200 acronyms
|
||||
## Abbreviations
|
||||
|
||||
| Abbreviation | Meaning |
|
||||
| :------------| --------------------------------------------------------------------------------: |
|
||||
| `ALU` | Arithmetic logic unit |
|
||||
| `Arb` | Arbiter |
|
||||
| `BF16` | Brain floating point – 16 |
|
||||
| `CC` | Coherently cached |
|
||||
| `CP` | Command processor |
|
||||
| `CPC` | Command processor – compute |
|
||||
| `CPF` | Command processor – fetcher |
|
||||
| `CS` | Compute shader |
|
||||
| `CSC` | Compute shader controller |
|
||||
| `CSn` | Compute Shader, the n-th pipe |
|
||||
| `CU` | Compute unit |
|
||||
| `DW` | 32-bit data word, DWORD |
|
||||
| `EA` | Efficiency arbiter |
|
||||
| `F16` | Half-precision floating point |
|
||||
| `FLAT` | FLAT instructions allow read/write/atomic access to a generic memory address pointer, which can resolve to any of the following physical memories:<br>• Global Memory<br>• Scratch (“private”)<br>• LDS (“shared”)<br>• Invalid – MEM_VIOL TrapStatus |
|
||||
| `FMA` | Fused multiply-add |
|
||||
| `GDS` | Global data share |
|
||||
| `GRBM` | Graphics register bus manager |
|
||||
| `HBM` | High bandwidth memory |
|
||||
| `Instr` | Instructions |
|
||||
| `IOP` | Integer operation |
|
||||
| `L2` | Level-2 cache |
|
||||
| `LDS` | Local data share |
|
||||
| `ME1` | Micro-engine, running packet processing firmware on CPC |
|
||||
| `MFMA` | Matrix fused multiply-add |
|
||||
| `NC` | Noncoherently cached |
|
||||
| `RW` | Coherently cached with write |
|
||||
| `SALU` | Scalar ALU |
|
||||
| `SGPR` | Scalar GPR |
|
||||
| `SIMD` | Single instruction multiple data |
|
||||
| `sL1D` | Scalar Level-1 data cache |
|
||||
| `SMEM` | Scalar memory |
|
||||
| `SPI` | Shader processor input |
|
||||
| `SQ` | Sequencer |
|
||||
| `TA` | Texture addressing unit |
|
||||
| `TC` | Texture cache |
|
||||
| `TCA` | Texture cache arbiter |
|
||||
| `TCC` | Texture cache per channel, known as L2 cache |
|
||||
| `TCIU` | Texture cache interface unit, command processor’s interface to memory system |
|
||||
| `TCP` | Texture cache per pipe, known as vector L1 cache |
|
||||
| `TCR` | Texture cache router |
|
||||
| `TD` | Texture data unit |
|
||||
| `UC` | Uncached |
|
||||
| `UTCL1` | Unified translation cache – level 1 |
|
||||
| `UTCL2` | Unified translation cache – level 2 |
|
||||
| `VALU` | Vector ALU |
|
||||
| `VGPR` | Vector GPR |
|
||||
| `vL1D` | Vector level 1 data cache |
|
||||
| `VMEM` | Vector memory |
|
||||
|:------------|:--------------------------------------------------------------------------------|
|
||||
| `ALU` | Arithmetic Logic Unit |
|
||||
| `Arb` | Arbiter |
|
||||
| `BF16` | Brain Floating Point - 16 bits |
|
||||
| `CC` | Coherently Cached |
|
||||
| `CP` | Command Processor |
|
||||
| `CPC` | Command Processor - Compute |
|
||||
| `CPF` | Command Processor - Fetcher |
|
||||
| `CS` | Compute Shader |
|
||||
| `CSC` | Compute Shader Controller |
|
||||
| `CSn` | Compute Shader, the n-th pipe |
|
||||
| `CU` | Compute Unit |
|
||||
| `DW` | 32-bit Data Word, DWORD |
|
||||
| `EA` | Efficiency Arbiter |
|
||||
| `F16` | Half Precision Floating Point |
|
||||
| `F32` | Full Precision Floating Point |
|
||||
| `FLAT` | FLAT instructions allow read/write/atomic access to a generic memory address pointer, which can resolve to any of the following physical memories:<br>. Global Memory<br>. Scratch ("private")<br>. LDS ("shared")<br>. Invalid - MEM_VIOL TrapStatus |
|
||||
| `FMA` | Fused Multiply Add |
|
||||
| `GDS` | Global Data Share |
|
||||
| `GRBM` | Graphics Register Bus Manager |
|
||||
| `HBM` | High Bandwidth Memory |
|
||||
| `Instr` | Instructions |
|
||||
| `IOP` | Integer Operation |
|
||||
| `L2` | Level-2 Cache |
|
||||
| `LDS` | Local Data Share |
|
||||
| `ME1` | Micro Engine, running packet processing firmware on CPC |
|
||||
| `MFMA` | Matrix Fused Multiply Add |
|
||||
| `NC` | Noncoherently Cached |
|
||||
| `RW` | Coherently Cached with Write |
|
||||
| `SALU` | Scalar ALU |
|
||||
| `SGPR` | Scalar General Purpose Register |
|
||||
| `SIMD` | Single Instruction Multiple Data |
|
||||
| `sL1D` | Scalar Level-1 Data Cache |
|
||||
| `SMEM` | Scalar Memory |
|
||||
| `SPI` | Shader Processor Input |
|
||||
| `SQ` | Sequencer |
|
||||
| `TA` | Texture Addressing Unit |
|
||||
| `TC` | Texture Cache |
|
||||
| `TCA` | Texture Cache Arbiter |
|
||||
| `TCC` | Texture Cache per Channel, known as L2 Cache |
|
||||
| `TCIU` | Texture Cache Interface Unit (interface between CP and the memory system) |
|
||||
| `TCP` | Texture Cache per Pipe, known as vector L1 Cache |
|
||||
| `TCR` | Texture Cache Router |
|
||||
| `TD` | Texture Data Unit |
|
||||
| `UC` | Uncached |
|
||||
| `UTCL1` | Unified Translation Cache - Level 1 |
|
||||
| `UTCL2` | Unified Translation Cache - Level 2 |
|
||||
| `VALU` | Vector ALU |
|
||||
| `VGPR` | Vector General Purpose Register |
|
||||
| `vL1D` | Vector Level -1 Data Cache |
|
||||
| `VMEM` | Vector Memory |
|
||||
|
||||
@@ -1,3 +1,9 @@
|
||||
<head>
|
||||
<meta charset="UTF-8">
|
||||
<meta name="description" content="AMD Instinct MI250 microarchitecture">
|
||||
<meta name="keywords" content="Instinct, MI250, microarchitecture, AMD, ROCm">
|
||||
</head>
|
||||
|
||||
# AMD Instinct™ MI250 microarchitecture
|
||||
|
||||
The microarchitecture of the AMD Instinct MI250 accelerators is based on the
|
||||
|
||||
@@ -1,3 +1,10 @@
|
||||
<head>
|
||||
<meta charset="UTF-8">
|
||||
<meta name="description" content="GPU isolation techniques">
|
||||
<meta name="keywords" content="GPU isolation techniques, UUID, universally unique identifier,
|
||||
environment variables, virtual machines, AMD, ROCm">
|
||||
</head>
|
||||
|
||||
# GPU isolation techniques
|
||||
|
||||
Restricting the access of applications to a subset of GPUs, aka isolating
|
||||
@@ -22,7 +29,7 @@ A list of device indices or {abbr}`UUID (universally unique identifier)`s
|
||||
that will be exposed to applications.
|
||||
|
||||
Runtime
|
||||
: ROCm Platform Runtime. Applies to all applications using the user mode ROCm
|
||||
: ROCm Software Runtime. Applies to all applications using the user mode ROCm
|
||||
software stack.
|
||||
|
||||
```{code-block} shell
|
||||
|
||||
@@ -1,9 +1,16 @@
|
||||
<head>
|
||||
<meta charset="UTF-8">
|
||||
<meta name="description" content="GPU memory">
|
||||
<meta name="keywords" content="GPU memory, VRAM, video random access memory, pageable
|
||||
memory, pinned memory, managed memory, AMD, ROCm">
|
||||
</head>
|
||||
|
||||
# GPU memory
|
||||
|
||||
For the HIP reference documentation, see:
|
||||
|
||||
* {doc}`hip:.doxygen/docBin/html/group___memory`
|
||||
* {doc}`hip:.doxygen/docBin/html/group___memory_m`
|
||||
* {doc}`hip:doxygen/html/group___memory`
|
||||
* {doc}`hip:doxygen/html/group___memory_m`
|
||||
|
||||
Host memory exists on the host (e.g. CPU) of the machine in random access memory (RAM).
|
||||
|
||||
@@ -170,8 +177,8 @@ Fine-grained memory implies that up-to-date data may be made visible to others r
|
||||
|
||||
| API | Flag | Coherence |
|
||||
|-------------------------|------------------------------|----------------|
|
||||
| `hipExtMallocWithFlags` | `hipHostMallocDefault` | Fine-grained |
|
||||
| `hipExtMallocWithFlags` | `hipDeviceMallocFinegrained` | Coarse-grained |
|
||||
| `hipExtMallocWithFlags` | `hipDeviceMallocDefault` | Coarse-grained |
|
||||
| `hipExtMallocWithFlags` | `hipDeviceMallocFinegrained` | Fine-grained |
|
||||
|
||||
| API | `hipMemAdvise` argument | Coherence |
|
||||
|-------------------------|------------------------------|----------------|
|
||||
|
||||
@@ -1,3 +1,10 @@
|
||||
<head>
|
||||
<meta charset="UTF-8">
|
||||
<meta name="description" content="Using the LLVM ASan on a GPU">
|
||||
<meta name="keywords" content="LLVM, ASan, address sanitizer, AddressSanitizer, instrumented
|
||||
libraries, instrumented applications, AMD, ROCm">
|
||||
</head>
|
||||
|
||||
# Using the LLVM ASan on a GPU (beta release)
|
||||
|
||||
The LLVM AddressSanitizer (ASan) provides a process that allows developers to detect runtime addressing errors in applications and libraries. The detection is achieved using a combination of compiler-added instrumentation and runtime techniques, including function interception and replacement.
|
||||
@@ -7,7 +14,9 @@ Until now, the LLVM ASan process was only available for traditional purely CPU a
|
||||
This document provides documentation on using ROCm ASan.
|
||||
For information about LLVM ASan, see the [LLVM documentation](https://clang.llvm.org/docs/AddressSanitizer.html).
|
||||
|
||||
**Note**: The beta release of LLVM ASan for ROCm is currently tested and validated on Ubuntu 20.04.
|
||||
:::{note}
|
||||
The beta release of LLVM ASan for ROCm is currently tested and validated on Ubuntu 20.04.
|
||||
:::
|
||||
|
||||
## Compiling for ASan
|
||||
|
||||
|
||||
30
docs/conf.py
30
docs/conf.py
@@ -8,15 +8,11 @@ import shutil
|
||||
import jinja2
|
||||
import os
|
||||
|
||||
from rocm_docs import ROCmDocs
|
||||
|
||||
# Environement to process Jinja templates.
|
||||
# Environment to process Jinja templates.
|
||||
jinja_env = jinja2.Environment(loader=jinja2.FileSystemLoader("."))
|
||||
|
||||
# Jinja templates to render out.
|
||||
templates = [
|
||||
|
||||
]
|
||||
templates = []
|
||||
|
||||
# Render templates and output files without the last extension.
|
||||
# For example: 'install.md.jinja' becomes 'install.md'.
|
||||
@@ -42,9 +38,9 @@ latex_elements = {
|
||||
# configurations for PDF output by Read the Docs
|
||||
project = "ROCm Documentation"
|
||||
author = "Advanced Micro Devices, Inc."
|
||||
copyright = "Copyright (c) 2023 Advanced Micro Devices, Inc. All rights reserved."
|
||||
version = "5.7.1"
|
||||
release = "5.7.1"
|
||||
copyright = "Copyright (c) 2024 Advanced Micro Devices, Inc. All rights reserved."
|
||||
version = "6.0.1"
|
||||
release = "6.0.1"
|
||||
setting_all_article_info = True
|
||||
all_article_info_os = ["linux", "windows"]
|
||||
all_article_info_author = ""
|
||||
@@ -54,7 +50,7 @@ article_pages = [
|
||||
{
|
||||
"file":"release",
|
||||
"os":["linux", "windows"],
|
||||
"date":"2023-07-27"
|
||||
"date":"2024-01-09"
|
||||
},
|
||||
|
||||
{"file":"install/windows/install-quick", "os":["windows"]},
|
||||
@@ -74,9 +70,6 @@ article_pages = [
|
||||
{"file":"install/windows/cli/index", "os":["windows"]},
|
||||
{"file":"install/windows/gui/index", "os":["windows"]},
|
||||
|
||||
{"file":"about/compatibility/linux-support", "os":["linux"]},
|
||||
{"file":"about/compatibility/windows-support", "os":["windows"]},
|
||||
|
||||
{"file":"about/compatibility/docker-image-support-matrix", "os":["linux"]},
|
||||
{"file":"about/compatibility/user-kernel-space-compat-matrix", "os":["linux"]},
|
||||
|
||||
@@ -89,19 +82,22 @@ article_pages = [
|
||||
|
||||
{"file":"rocm-a-z", "os":["linux", "windows"]},
|
||||
|
||||
{"file":"about/release-notes", "os":["linux"]},
|
||||
]
|
||||
|
||||
exclude_patterns = ['temp']
|
||||
|
||||
external_toc_path = "./sphinx/_toc.yml"
|
||||
|
||||
docs_core = ROCmDocs("ROCm Documentation")
|
||||
docs_core.setup()
|
||||
extensions = ["rocm_docs"]
|
||||
|
||||
external_projects_current_project = "rocm"
|
||||
|
||||
for sphinx_var in ROCmDocs.SPHINX_VARS:
|
||||
globals()[sphinx_var] = getattr(docs_core, sphinx_var)
|
||||
html_theme = "rocm_docs_theme"
|
||||
html_theme_options = {"flavor": "rocm-docs-home"}
|
||||
|
||||
html_title = "ROCm Documentation"
|
||||
|
||||
html_theme_options = {
|
||||
"link_main_doc": False
|
||||
}
|
||||
|
||||
@@ -1,3 +1,10 @@
|
||||
<head>
|
||||
<meta charset="UTF-8">
|
||||
<meta name="description" content="Building ROCm documentation">
|
||||
<meta name="keywords" content="documentation, Visual Studio Code, GitHub, command line,
|
||||
AMD, ROCm">
|
||||
</head>
|
||||
|
||||
# Building documentation
|
||||
|
||||
You can build our documentation via GitHub (in a pull request) or locally (using the command line or
|
||||
|
||||
229
docs/contribute/contribute-docs.md
Normal file
229
docs/contribute/contribute-docs.md
Normal file
@@ -0,0 +1,229 @@
|
||||
# Contributing to ROCm documentation
|
||||
|
||||
AMD values and encourages contributions to our code and documentation. If you choose to
|
||||
contribute, we encourage you to be polite and respectful. Improving documentation is a long-term
|
||||
process, to which we are dedicated.
|
||||
|
||||
If you have issues when trying to contribute, refer to the
|
||||
[discussions](https://github.com/RadeonOpenCompute/ROCm/discussions) page in our GitHub
|
||||
repository.
|
||||
|
||||
## Folder structure and naming convention
|
||||
|
||||
Our documentation follows the Pitchfork folder structure. Most documentation files are stored in the
|
||||
`/docs` folder. Some special files (such as release, contributing, and changelog) are stored in the root
|
||||
(`/`) folder.
|
||||
|
||||
All images are stored in the `/docs/data` folder. An image's file path mirrors that of the documentation
|
||||
file where it is used.
|
||||
|
||||
Our naming structure uses kebab case; for example, `my-file-name.rst`.
|
||||
|
||||
## Supported formats and syntax
|
||||
|
||||
Our documentation includes both Markdown and RST files. We are gradually transitioning existing
|
||||
Markdown to RST in order to more effectively meet our documentation needs. When contributing,
|
||||
RST is preferred; if you must use Markdown, use GitHub-flavored Markdown.
|
||||
|
||||
We use [Sphinx Design](https://sphinx-design.readthedocs.io/en/latest/index.html) syntax and compile
|
||||
our API references using [Doxygen](https://www.doxygen.nl/).
|
||||
|
||||
The following table shows some common documentation components and the syntax convention we
|
||||
use for each:
|
||||
|
||||
<table>
|
||||
<tr>
|
||||
<th>Component</th>
|
||||
<th>RST syntax</th>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>Code blocks</td>
|
||||
<td>
|
||||
|
||||
```rst
|
||||
|
||||
.. code-block:: language-name
|
||||
|
||||
My code block.
|
||||
|
||||
|
||||
```
|
||||
|
||||
</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>Cross-referencing internal files</td>
|
||||
<td>
|
||||
|
||||
```rst
|
||||
|
||||
:doc:`Title <../path/to/file/filename>`
|
||||
|
||||
```
|
||||
|
||||
</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>External links</td>
|
||||
<td>
|
||||
|
||||
```rst
|
||||
|
||||
`link name <URL>`_
|
||||
|
||||
```
|
||||
|
||||
</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<tr>
|
||||
<td>Headings</td>
|
||||
<td>
|
||||
|
||||
```rst
|
||||
|
||||
******************
|
||||
Chapter title (H1)
|
||||
******************
|
||||
|
||||
Section title (H2)
|
||||
===============
|
||||
|
||||
Subsection title (H3)
|
||||
---------------------
|
||||
|
||||
Sub-subsection title (H4)
|
||||
^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
|
||||
```
|
||||
|
||||
</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>Images</td>
|
||||
<td>
|
||||
|
||||
```rst
|
||||
|
||||
.. image:: image1.png
|
||||
|
||||
```
|
||||
|
||||
</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>Internal links</td>
|
||||
<td>
|
||||
|
||||
```rst
|
||||
|
||||
1. Add a tag to the section you want to reference:
|
||||
|
||||
.. _my-section-tag: section-1
|
||||
|
||||
Section 1
|
||||
==========
|
||||
|
||||
2. Link to your tag:
|
||||
|
||||
As shown in :ref:`section-1`.
|
||||
|
||||
```
|
||||
|
||||
</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<tr>
|
||||
<td>Lists</td>
|
||||
<td>
|
||||
|
||||
```rst
|
||||
|
||||
# Ordered (numbered) list item
|
||||
|
||||
* Unordered (bulleted) list item
|
||||
|
||||
```
|
||||
|
||||
</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<tr>
|
||||
<td>Math (block)</td>
|
||||
<td>
|
||||
|
||||
```rst
|
||||
|
||||
.. math::
|
||||
|
||||
A = \begin{pmatrix}
|
||||
0.0 & 1.0 & 1.0 & 3.0 \\
|
||||
4.0 & 5.0 & 6.0 & 7.0 \\
|
||||
\end{pmatrix}
|
||||
|
||||
```
|
||||
|
||||
</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>Math (inline)</td>
|
||||
<td>
|
||||
|
||||
```rst
|
||||
|
||||
:math:`2 \times 2 `
|
||||
|
||||
```
|
||||
|
||||
</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>Notes</td>
|
||||
<td>
|
||||
|
||||
```rst
|
||||
|
||||
.. note::
|
||||
|
||||
My note here.
|
||||
|
||||
```
|
||||
|
||||
</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>Tables</td>
|
||||
<td>
|
||||
|
||||
```rst
|
||||
|
||||
.. csv-table:: Optional title here
|
||||
:widths: 30, 70 #optional column widths
|
||||
:header: "entry1 header", "entry2 header"
|
||||
|
||||
"entry1", "entry2"
|
||||
|
||||
```
|
||||
|
||||
</td>
|
||||
</tr>
|
||||
</table>
|
||||
|
||||
## Language and style
|
||||
|
||||
We use the
|
||||
[Google developer documentation style guide](https://developers.google.com/style/highlights) to
|
||||
guide our content.
|
||||
|
||||
Font size and type, page layout, white space control, and other formatting
|
||||
details are controlled via
|
||||
[rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core). If you want to notify us
|
||||
of any formatting issues, create a pull request in our
|
||||
[rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) GitHub repository.
|
||||
|
||||
## Building our documentation
|
||||
|
||||
<!-- % TODO: Fix the link to be able to work at every files -->
|
||||
To learn how to build our documentation, refer to
|
||||
[Building documentation](./building.md).
|
||||
@@ -1,4 +1,10 @@
|
||||
# How to provide feedback for ROCm documentation
|
||||
<head>
|
||||
<meta charset="UTF-8">
|
||||
<meta name="description" content="Providing feedback for ROCm documentation">
|
||||
<meta name="keywords" content="documentation, pull request, GitHub, AMD, ROCm">
|
||||
</head>
|
||||
|
||||
# Providing feedback for ROCm documentation
|
||||
|
||||
There are four standard ways to provide feedback for this repository.
|
||||
|
||||
|
||||
@@ -1,3 +1,9 @@
|
||||
<head>
|
||||
<meta charset="UTF-8">
|
||||
<meta name="description" content="ROCm documentation toolchain">
|
||||
<meta name="keywords" content="documentation, toolchain, Sphinx, Doxygen, MyST, AMD, ROCm">
|
||||
</head>
|
||||
|
||||
# ROCm documentation toolchain
|
||||
|
||||
Our documentation relies on several open source toolchains and sites.
|
||||
|
||||
@@ -1,15 +1,22 @@
|
||||
<head>
|
||||
<meta charset="UTF-8">
|
||||
<meta name="description" content="Deep learning using ROCm">
|
||||
<meta name="keywords" content="deep learning, frameworks, installation, PyTorch, TensorFlow,
|
||||
MAGMA, AMD, ROCm">
|
||||
</head>
|
||||
|
||||
# Deep learning guide
|
||||
|
||||
The following sections cover the different framework installations for ROCm and
|
||||
deep-learning applications. The following image provides
|
||||
the sequential flow for the use of each framework. Refer to the ROCm Compatible
|
||||
Frameworks Release Notes for each framework's most current release notes at
|
||||
[Third party support](../about/compatibility/3rd-party-support-matrix.md).
|
||||
{doc}`Third-party support<rocm-install-on-linux:reference/3rd-party-support-matrix>`.
|
||||
|
||||

|
||||
|
||||
## Frameworks installation
|
||||
|
||||
* [Installing PyTorch](../install/pytorch-install.md)
|
||||
* [Installing TensorFlow](../install/tensorflow-install.md)
|
||||
* [Installing MAGMA](../install/magma-install.md)
|
||||
* {doc}`PyTorch for ROCm<rocm-install-on-linux:how-to/3rd-party/pytorch-install>`
|
||||
* {doc}`TensorFlow for ROCm<rocm-install-on-linux:how-to/3rd-party/tensorflow-install>`
|
||||
* {doc}`MAGMA for ROCm<rocm-install-on-linux:how-to/3rd-party/magma-install>`
|
||||
|
||||
@@ -1,189 +0,0 @@
|
||||
# GPU-enabled MPI
|
||||
|
||||
The Message Passing Interface ([MPI](https://www.mpi-forum.org)) is a standard
|
||||
API for distributed and parallel application development that can scale to
|
||||
multi-node clusters. To facilitate the porting of applications to clusters with
|
||||
GPUs, ROCm enables various technologies. These technologies allow users to
|
||||
directly use GPU pointers in MPI calls and enable ROCm-aware MPI libraries to
|
||||
deliver optimal performance for both intra-node and inter-node GPU-to-GPU
|
||||
communication.
|
||||
|
||||
The AMD kernel driver exposes Remote Direct Memory Access (RDMA) through the
|
||||
*PeerDirect* interfaces to allow Host Channel Adapters (HCA, a type of
|
||||
Network Interface Card or NIC) to directly read and write to the GPU device
|
||||
memory with RDMA capabilities. These interfaces are currently registered as a
|
||||
*peer_memory_client* with Mellanox’s OpenFabrics Enterprise Distribution (OFED)
|
||||
`ib_core` kernel module to allow high-speed DMA transfers between GPU and HCA.
|
||||
These interfaces are used to optimize inter-node MPI message communication.
|
||||
|
||||
This chapter exemplifies how to set up Open MPI with the ROCm platform. The Open
|
||||
MPI project is an open source implementation of the MPI that is developed and maintained by a consortium of academic, research,
|
||||
and industry partners.
|
||||
|
||||
Several MPI implementations can be made ROCm-aware by compiling them with
|
||||
[Unified Communication Framework](https://www.openucx.org/) (UCX) support. One
|
||||
notable exception is MVAPICH2: It directly supports AMD GPUs without using UCX,
|
||||
and you can download it [here](http://mvapich.cse.ohio-state.edu/downloads/).
|
||||
Use the latest version of the MVAPICH2-GDR package.
|
||||
|
||||
The Unified Communication Framework, is an open source cross-platform framework
|
||||
whose goal is to provide a common set of communication interfaces that targets a
|
||||
broad set of network programming models and interfaces. UCX is ROCm-aware, and
|
||||
ROCm technologies are used directly to implement various network operation
|
||||
primitives. For more details on the UCX design, refer to it's
|
||||
[documentation](https://www.openucx.org/documentation).
|
||||
|
||||
## Building UCX
|
||||
|
||||
The following section describes how to set up UCX so it can be used to compile
|
||||
Open MPI. The following environment variables are set, such that all software
|
||||
components will be installed in the same base directory (we assume to install
|
||||
them in your home directory; for other locations adjust the below environment
|
||||
variables accordingly, and make sure you have write permission for that
|
||||
location):
|
||||
|
||||
```shell
|
||||
export INSTALL_DIR=$HOME/ompi_for_gpu
|
||||
export BUILD_DIR=/tmp/ompi_for_gpu_build
|
||||
mkdir -p $BUILD_DIR
|
||||
```
|
||||
|
||||
```{note}
|
||||
The following sequences of build commands assume either the ROCmCC or the AOMP
|
||||
compiler is active in the environment, which will execute the commands.
|
||||
```
|
||||
|
||||
## Install UCX
|
||||
|
||||
The next step is to set up UCX by compiling its source code and install it:
|
||||
|
||||
```shell
|
||||
export UCX_DIR=$INSTALL_DIR/ucx
|
||||
cd $BUILD_DIR
|
||||
git clone https://github.com/openucx/ucx.git -b v1.14.1
|
||||
cd ucx
|
||||
./autogen.sh
|
||||
mkdir build
|
||||
cd build
|
||||
../configure -prefix=$UCX_DIR \
|
||||
--with-rocm=/opt/rocm
|
||||
make -j $(nproc)
|
||||
make -j $(nproc) install
|
||||
```
|
||||
|
||||
The [communication libraries tables](../reference/library-index.md)
|
||||
documents the compatibility of UCX versions with ROCm versions.
|
||||
|
||||
## Install Open MPI
|
||||
|
||||
These are the steps to build Open MPI:
|
||||
|
||||
```shell
|
||||
export OMPI_DIR=$INSTALL_DIR/ompi
|
||||
cd $BUILD_DIR
|
||||
git clone --recursive https://github.com/open-mpi/ompi.git \
|
||||
-b v5.0.x
|
||||
cd ompi
|
||||
./autogen.pl
|
||||
mkdir build
|
||||
cd build
|
||||
../configure --prefix=$OMPI_DIR --with-ucx=$UCX_DIR \
|
||||
--with-rocm=/opt/rocm
|
||||
make -j $(nproc)
|
||||
make -j $(nproc) install
|
||||
```
|
||||
|
||||
## ROCm-enabled OSU
|
||||
|
||||
The OSU Micro Benchmarks v5.9 (OMB) can be used to evaluate the performance of
|
||||
various primitives with an AMD GPU device and ROCm support. This functionality
|
||||
is exposed when configured with `--enable-rocm` option. We can use the following
|
||||
steps to compile OMB:
|
||||
|
||||
```shell
|
||||
export OSU_DIR=$INSTALL_DIR/osu
|
||||
cd $BUILD_DIR
|
||||
wget http://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-5.9.tar.gz
|
||||
tar xfz osu-micro-benchmarks-5.9.tar.gz
|
||||
cd osu-micro-benchmarks-5.9
|
||||
./configure --prefix=$INSTALL_DIR/osu --enable-rocm \
|
||||
--with-rocm=/opt/rocm \
|
||||
CC=$OMPI_DIR/bin/mpicc CXX=$OMPI_DIR/bin/mpicxx \
|
||||
LDFLAGS="-L$OMPI_DIR/lib/ -lmpi -L/opt/rocm/lib/ \
|
||||
$(hipconfig -C) -lamdhip64" CXXFLAGS="-std=c++11"
|
||||
make -j $(nproc)
|
||||
```
|
||||
|
||||
## Intra-node run
|
||||
|
||||
Before running an Open MPI job, it is essential to set some environment variables to
|
||||
ensure that the correct version of Open MPI and UCX is being used.
|
||||
|
||||
```shell
|
||||
export LD_LIBRARY_PATH=$OMPI_DIR/lib:$UCX_DIR/lib:/opt/rocm/lib
|
||||
export PATH=$OMPI_DIR/bin:$PATH
|
||||
```
|
||||
|
||||
The following command runs the OSU bandwidth benchmark between the first two GPU
|
||||
devices (i.e., GPU 0 and GPU 1, same OAM) by default inside the same node. It
|
||||
measures the unidirectional bandwidth from the first device to the other.
|
||||
|
||||
```shell
|
||||
$OMPI_DIR/bin/mpirun -np 2 \
|
||||
-x UCX_TLS=sm,self,rocm \
|
||||
--mca pml ucx mpi/pt2pt/osu_bw -d rocm D D
|
||||
```
|
||||
|
||||
To select different devices, for example 2 and 3, use the following command:
|
||||
|
||||
```shell
|
||||
export HIP_VISIBLE_DEVICES=2,3
|
||||
export HSA_ENABLE_SDMA=0
|
||||
```
|
||||
|
||||
The following output shows the effective transfer bandwidth measured for
|
||||
inter-die data transfer between GPU device 2 and 3 (same OAM). For messages
|
||||
larger than 67MB, an effective utilization of about 150GB/sec is achieved, which
|
||||
corresponds to 75% of the peak transfer bandwidth of 200GB/sec for that
|
||||
connection:
|
||||
|
||||

|
||||
|
||||
## Collective operations
|
||||
|
||||
Collective Operations on GPU buffers are best handled through the
|
||||
Unified Collective Communication Library (UCC) component in Open MPI.
|
||||
For this, the UCC library has to be configured and compiled with ROCm
|
||||
support.
|
||||
|
||||
Please note the compatibility tables in the [communication libraries](../reference/library-index.md)
|
||||
for UCC versions with the various ROCm versions.
|
||||
|
||||
An example for configuring UCC and Open MPI with ROCm support
|
||||
is shown below:
|
||||
|
||||
```shell
|
||||
export UCC_DIR=$INSTALL_DIR/ucc
|
||||
git clone https://github.com/openucx/ucc.git
|
||||
cd ucc
|
||||
./configure --with-rocm=/opt/rocm \
|
||||
--with-ucx=$UCX_DIR \
|
||||
--prefix=$UCC_DIR
|
||||
make -j && make install
|
||||
|
||||
# Configure and compile Open MPI with UCX, UCC, and ROCm support
|
||||
cd ompi
|
||||
./configure --with-rocm=/opt/rocm \
|
||||
--with-ucx=$UCX_DIR \
|
||||
--with-ucc=$UCC_DIR
|
||||
--prefix=$OMPI_DIR
|
||||
```
|
||||
|
||||
To use the UCC component with an MPI application requires setting some
|
||||
additional parameters:
|
||||
|
||||
```shell
|
||||
mpirun --mca pml ucx --mca osc ucx \
|
||||
--mca coll_ucc_enable 1 \
|
||||
--mca coll_ucc_priority 100 -np 64 ./my_mpi_app
|
||||
```
|
||||
264
docs/how-to/gpu-enabled-mpi.rst
Normal file
264
docs/how-to/gpu-enabled-mpi.rst
Normal file
@@ -0,0 +1,264 @@
|
||||
.. meta::
|
||||
:description: GPU-enabled Message Passing Interface
|
||||
:keywords: Message Passing Interface, MPI, AMD, ROCm
|
||||
|
||||
***************************************************************************************************
|
||||
GPU-enabled Message Passing Interface
|
||||
***************************************************************************************************
|
||||
|
||||
The Message Passing Interface (`MPI <https://www.mpi-forum.org>`_) is a standard API for distributed
|
||||
and parallel application development that can scale to multi-node clusters. To facilitate the porting of
|
||||
applications to clusters with GPUs, ROCm enables various technologies. You can use these
|
||||
technologies add GPU pointers to MPI calls and enable ROCm-aware MPI libraries to deliver optimal
|
||||
performance for both intra-node and inter-node GPU-to-GPU communication.
|
||||
|
||||
The AMD kernel driver exposes remote direct memory access (RDMA) through *PeerDirect* interfaces.
|
||||
This allows network interface cards (NICs) to directly read and write to RDMA-capable GPU device
|
||||
memory, resulting in high-speed direct memory access (DMA) transfers between GPU and NIC. These
|
||||
interfaces are used to optimize inter-node MPI message communication.
|
||||
|
||||
The Open MPI project is an open source implementation of the MPI. It's developed and maintained by
|
||||
a consortium of academic, research, and industry partners. To compile Open MPI with ROCm support,
|
||||
refer to the following sections:
|
||||
|
||||
* :ref:`open-mpi-ucx`
|
||||
* :ref:`open-mpi-libfabric`
|
||||
|
||||
.. _open-mpi-ucx:
|
||||
|
||||
ROCm-aware Open MPI on InfiniBand and RoCE networks using UCX
|
||||
================================================================
|
||||
|
||||
The `Unified Communication Framework <https://www.openucx.org/documentation>`_ (UCX), is an
|
||||
open source, cross-platform framework designed to provide a common set of communication
|
||||
interfaces for various network programming models and interfaces. UCX uses ROCm technologies to
|
||||
implement various network operation primitives. UCX is the standard communication library for
|
||||
InfiniBand and RDMA over Converged Ethernet (RoCE) network interconnect. To optimize data
|
||||
transfer operations, many MPI libraries, including Open MPI, can leverage UCX internally.
|
||||
|
||||
UCX and Open MPI have a compile option to enable ROCm support. To install and configure UCX to compile Open MPI for ROCm, use the following instructions.
|
||||
|
||||
1. Set environment variables to install all software components in the same base directory. We use the
|
||||
home directory in our example, but you can specify a different location if you want.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
export INSTALL_DIR=$HOME/ompi_for_gpu
|
||||
export BUILD_DIR=/tmp/ompi_for_gpu_build
|
||||
mkdir -p $BUILD_DIR
|
||||
|
||||
2. Install UCX. To view UCX and ROCm version compatibility, refer to the
|
||||
`communication libraries tables <https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/3rd-party-support-matrix.html>`_
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
export UCX_DIR=$INSTALL_DIR/ucx
|
||||
cd $BUILD_DIR
|
||||
git clone https://github.com/openucx/ucx.git -b v1.15.x
|
||||
cd ucx
|
||||
./autogen.sh
|
||||
mkdir build
|
||||
cd build
|
||||
../configure -prefix=$UCX_DIR \
|
||||
--with-rocm=/opt/rocm
|
||||
make -j $(nproc)
|
||||
make -j $(nproc) install
|
||||
|
||||
3. Install Open MPI.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
export OMPI_DIR=$INSTALL_DIR/ompi
|
||||
cd $BUILD_DIR
|
||||
git clone --recursive https://github.com/open-mpi/ompi.git \
|
||||
-b v5.0.x
|
||||
cd ompi
|
||||
./autogen.pl
|
||||
mkdir build
|
||||
cd build
|
||||
../configure --prefix=$OMPI_DIR --with-ucx=$UCX_DIR \
|
||||
--with-rocm=/opt/rocm
|
||||
make -j $(nproc)
|
||||
make install
|
||||
|
||||
.. _rocm-enabled-osu:
|
||||
|
||||
ROCm-enabled OSU benchmarks
|
||||
---------------------------------------------------------------------------------------------------------------
|
||||
|
||||
You can use OSU Micro Benchmarks (OMB) to evaluate the performance of various primitives on
|
||||
ROCm-supported AMD GPUs. The ``--enable-rocm`` option exposes this functionality.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
export OSU_DIR=$INSTALL_DIR/osu
|
||||
cd $BUILD_DIR
|
||||
wget http://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-7.2.tar.gz
|
||||
tar xfz osu-micro-benchmarks-7.2.tar.gz
|
||||
cd osu-micro-benchmarks-7.2
|
||||
./configure --enable-rocm \
|
||||
--with-rocm=/opt/rocm \
|
||||
CC=$OMPI_DIR/bin/mpicc CXX=$OMPI_DIR/bin/mpicxx \
|
||||
LDFLAGS="-L$OMPI_DIR/lib/ -lmpi -L/opt/rocm/lib/ \
|
||||
$(hipconfig -C) -lamdhip64" CXXFLAGS="-std=c++11"
|
||||
make -j $(nproc)
|
||||
|
||||
Intra-node run
|
||||
----------------------------------------------------------------------------------------------------------------
|
||||
|
||||
Before running an Open MPI job, you must set the following environment variables to ensure that
|
||||
you're using the correct versions of Open MPI and UCX.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
export LD_LIBRARY_PATH=$OMPI_DIR/lib:$UCX_DIR/lib:/opt/rocm/lib
|
||||
export PATH=$OMPI_DIR/bin:$PATH
|
||||
|
||||
To run the OSU bandwidth benchmark between the first two GPU devices (``GPU 0`` and ``GPU 1``)
|
||||
inside the same node, use the following code.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$OMPI_DIR/bin/mpirun -np 2 \
|
||||
-x UCX_TLS=sm,self,rocm \
|
||||
--mca pml ucx \
|
||||
./c/mpi/pt2pt/standard/osu_bw D D
|
||||
|
||||
This measures the unidirectional bandwidth from the first device (``GPU 0``) to the second device
|
||||
(``GPU 1``). To select specific devices, for example ``GPU 2`` and ``GPU 3``, include the following
|
||||
command:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
export HIP_VISIBLE_DEVICES=2,3
|
||||
|
||||
To force using a copy kernel instead of a DMA engine for the data transfer, use the following
|
||||
command:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
export HSA_ENABLE_SDMA=0
|
||||
|
||||
The following output shows the effective transfer bandwidth measured for inter-die data transfer
|
||||
between ``GPU 2`` and ``GPU 3`` on a system with MI250 GPUs. For messages larger than 67 MB, an effective
|
||||
utilization of about 150 GB/sec is achieved:
|
||||
|
||||
.. image:: ../data/how-to/gpu-enabled-mpi-1.png
|
||||
:width: 400
|
||||
:alt: Inter-GPU bandwidth for various payload sizes
|
||||
|
||||
Collective operations
|
||||
----------------------------------------------------------------------------------------------------------------
|
||||
|
||||
Collective operations on GPU buffers are best handled through the Unified Collective Communication
|
||||
(UCC) library component in Open MPI. To accomplish this, you must configure and compile the UCC
|
||||
library with ROCm support.
|
||||
|
||||
.. note::
|
||||
|
||||
You can verify UCC and ROCm version compatibility using the
|
||||
`communication libraries tables <https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/3rd-party-support-matrix.html>`_
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
export UCC_DIR=$INSTALL_DIR/ucc
|
||||
git clone https://github.com/openucx/ucc.git -b v1.2.x
|
||||
cd ucc
|
||||
./autogen.sh
|
||||
./configure --with-rocm=/opt/rocm \
|
||||
--with-ucx=$UCX_DIR \
|
||||
--prefix=$UCC_DIR
|
||||
make -j && make install
|
||||
|
||||
# Configure and compile Open MPI with UCX, UCC, and ROCm support
|
||||
cd ompi
|
||||
./configure --with-rocm=/opt/rocm \
|
||||
--with-ucx=$UCX_DIR \
|
||||
--with-ucc=$UCC_DIR
|
||||
--prefix=$OMPI_DIR
|
||||
|
||||
To use the UCC component with an MPI application, you must set additional parameters:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
mpirun --mca pml ucx --mca osc ucx \
|
||||
--mca coll_ucc_enable 1 \
|
||||
--mca coll_ucc_priority 100 -np 64 ./my_mpi_app
|
||||
|
||||
.. _open-mpi-libfabric:
|
||||
|
||||
ROCm-aware Open MPI using libfabric
|
||||
================================================================
|
||||
|
||||
For network interconnects that are not covered in the previous category, such as HPE Slingshot,
|
||||
ROCm-aware communication can often be achieved through the libfabric library. For more information,
|
||||
refer to the `libfabric documentation <https://github.com/ofiwg/libfabric/wiki>`_.
|
||||
|
||||
.. note::
|
||||
|
||||
When using Open MPI v5.0.x with libfabric support, shared memory communication between
|
||||
processes on the same node goes through the *ob1/sm* component. This component has
|
||||
fundamental support for GPU memory that is, accomplished by using a staging host buffer
|
||||
Consequently, the performance of device-to-device shared memory communication is lower than
|
||||
the theoretical peak performance allowed by the GPU-to-GPU interconnect.
|
||||
|
||||
1. Install libfabric. Note that libfabric is often pre-installed. To determine if it's already installed, run:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
module avail libfabric
|
||||
|
||||
Alternatively, you can download and compile libfabric with ROCm support. Note that not all
|
||||
components required to support some networks (e.g., HPE Slingshot) are available in the open source
|
||||
repository. Therefore, using a pre-installed libfabric library is strongly recommended over compiling
|
||||
libfabric manually.
|
||||
|
||||
If a pre-compiled libfabric library is available on your system, you can skip the following step.
|
||||
|
||||
2. Compile libfabric with ROCm support.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
export OFI_DIR=$INSTALL_DIR/ofi
|
||||
cd $BUILD_DIR
|
||||
git clone https://github.com/ofiwg/libfabric.git -b v1.19.x
|
||||
cd libfabric
|
||||
./autogen.sh
|
||||
./configure --prefix=$OFI_DIR \
|
||||
--with-rocr=/opt/rocm
|
||||
make -j $(nproc)
|
||||
make install
|
||||
|
||||
Installing Open MPI with libfabric support
|
||||
----------------------------------------------------------------------------------------------------------------
|
||||
|
||||
To build Open MPI with libfabric, use the following code:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
export OMPI_DIR=$INSTALL_DIR/ompi
|
||||
cd $BUILD_DIR
|
||||
git clone --recursive https://github.com/open-mpi/ompi.git \
|
||||
-b v5.0.x
|
||||
cd ompi
|
||||
./autogen.pl
|
||||
mkdir build
|
||||
cd build
|
||||
../configure --prefix=$OMPI_DIR --with-ofi=$OFI_DIR \
|
||||
--with-rocm=/opt/rocm
|
||||
make -j $(nproc)
|
||||
make install
|
||||
|
||||
ROCm-aware OSU with Open MPI and libfabric
|
||||
----------------------------------------------------------------------------------------------------------------
|
||||
|
||||
Compiling a ROCm-aware version of OSU benchmarks with Open MPI and libfabric uses the same
|
||||
process described in :ref:`rocm-enabled-osu`.
|
||||
|
||||
To run an OSU benchmark using multiple nodes, use the following code:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
export LD_LIBRARY_PATH=$OMPI_DIR/lib:$OFI_DIR/lib64:/opt/rocm/lib
|
||||
$OMPI_DIR/bin/mpirun -np 2 \
|
||||
./c/mpi/pt2pt/standard/osu_bw D D
|
||||
@@ -1,6 +1,13 @@
|
||||
<head>
|
||||
<meta charset="UTF-8">
|
||||
<meta name="description" content="System debugging guide">
|
||||
<meta name="keywords" content="debug, system-level debug, debug flags, PCIe debug, AMD,
|
||||
ROCm">
|
||||
</head>
|
||||
|
||||
# System debugging guide
|
||||
|
||||
## ROCm language and system level debug, flags, and environment variables
|
||||
## ROCm language and system-level debug, flags, and environment variables
|
||||
|
||||
Kernel options to avoid: the Ethernet port getting renamed every time you change graphics cards, `net.ifnames=0 biosdevname=0`
|
||||
|
||||
|
||||
@@ -1,3 +1,10 @@
|
||||
<head>
|
||||
<meta charset="UTF-8">
|
||||
<meta name="description" content="Tuning guides">
|
||||
<meta name="keywords" content="high-performance computing, HPC, Instinct accelerators,
|
||||
Radeon, tuning, tuning guide, AMD, ROCm">
|
||||
</head>
|
||||
|
||||
# Tuning guides
|
||||
|
||||
Use case-specific system setup and tuning guides.
|
||||
|
||||
@@ -1,3 +1,10 @@
|
||||
<head>
|
||||
<meta charset="UTF-8">
|
||||
<meta name="description" content="MI100 high-performance computing and tuning guide">
|
||||
<meta name="keywords" content="MI100, high-performance computing, HPC, tuning, BIOS
|
||||
settings, NBIO, AMD, ROCm">
|
||||
</head>
|
||||
|
||||
# MI100 high-performance computing and tuning guide
|
||||
|
||||
## System settings
|
||||
@@ -352,15 +359,15 @@ If SMT is enabled by setting "CCD/Core/Thread Enablement > SMT Control" to
|
||||
[...]
|
||||
```
|
||||
|
||||
Once the system is properly configured, the AMD ROCm platform can be
|
||||
Once the system is properly configured, ROCm software can be
|
||||
installed.
|
||||
|
||||
## System management
|
||||
|
||||
For a complete guide on how to install/manage/uninstall ROCm on Linux, refer to
|
||||
[Installing ROCm on Linux](../../install/linux/install.md). For verifying that the
|
||||
installation was successful, refer to
|
||||
{ref}`verifying-kernel-mode-driver-installation` and
|
||||
{doc}`Quick-start (Linux)<rocm-install-on-linux:tutorial/quick-start>`. To verify that the installation was
|
||||
successful, refer to the
|
||||
{doc}`post-install instructions<rocm-install-on-linux:how-to/native-install/post-install>` and
|
||||
[Validation Tools](../../reference/library-index.md). Should verification
|
||||
fail, consult the [System Debugging Guide](../system-debugging.md).
|
||||
|
||||
@@ -405,7 +412,8 @@ SIMD pipelines, memory information, and Instruction Set Architecture:
|
||||

|
||||
|
||||
For a complete list of architecture (LLVM target) names, refer to
|
||||
[Linux support](../../about/compatibility/linux-support.md) and [Windows support](../../about/compatibility/windows-support.md).
|
||||
{doc}`Linux<rocm-install-on-linux:reference/system-requirements>` and
|
||||
{doc}`Windows<rocm-install-on-windows:reference/system-requirements>` support.
|
||||
|
||||
### Testing inter-device bandwidth
|
||||
|
||||
|
||||
@@ -1,3 +1,10 @@
|
||||
<head>
|
||||
<meta charset="UTF-8">
|
||||
<meta name="description" content="MI200 high-performance computing and tuning guide">
|
||||
<meta name="keywords" content="MI200, high-performance computing, HPC, tuning, BIOS
|
||||
settings, NBIO, AMD, ROCm">
|
||||
</head>
|
||||
|
||||
# MI200 high-performance computing and tuning guide
|
||||
|
||||
## System settings
|
||||
@@ -27,7 +34,7 @@ Analogous settings for other non-AMI System BIOS providers could be set
|
||||
similarly. For systems with Intel processors, some settings may not apply or be
|
||||
available as listed in the following table.
|
||||
|
||||
```{list-table} Recommended settings for the system BIOS in a GIGABYTE platform.
|
||||
```{list-table}
|
||||
:header-rows: 1
|
||||
:name: mi200-bios
|
||||
|
||||
@@ -337,15 +344,15 @@ If SMT is enabled by setting "CCD/Core/Thread Enablement > SMT Control" to
|
||||
[...]
|
||||
```
|
||||
|
||||
Once the system is properly configured, the AMD ROCm platform can be
|
||||
Once the system is properly configured, ROCm software can be
|
||||
installed.
|
||||
|
||||
## System management
|
||||
|
||||
For a complete guide on how to install/manage/uninstall ROCm on Linux, refer to
|
||||
[Installing ROCm on Linux](../../install/linux/install.md). For verifying that the
|
||||
installation was successful, refer to
|
||||
{ref}`verifying-kernel-mode-driver-installation` and
|
||||
{doc}`Quick-start (Linux)<rocm-install-on-linux:tutorial/quick-start>`. For verifying that the
|
||||
installation was successful, refer to the
|
||||
{doc}`post-install instructions<rocm-install-on-linux:how-to/native-install/post-install>` and
|
||||
[Validation Tools](../../reference/library-index.md). Should verification
|
||||
fail, consult the [System Debugging Guide](../system-debugging.md).
|
||||
|
||||
@@ -390,7 +397,8 @@ Instruction Set Architecture (ISA):
|
||||

|
||||
|
||||
For a complete list of architecture (LLVM target) names, refer to GPU OS Support for
|
||||
[Linux](../../about/compatibility/linux-support.md) and [Windows](../../about/compatibility/windows-support.md).
|
||||
{doc}`Linux<rocm-install-on-linux:reference/system-requirements>` and
|
||||
{doc}`Windows<rocm-install-on-windows:reference/system-requirements>`.
|
||||
|
||||
### Testing inter-device bandwidth
|
||||
|
||||
|
||||
@@ -1,3 +1,10 @@
|
||||
<head>
|
||||
<meta charset="UTF-8">
|
||||
<meta name="description" content="RDNA2 workstation tuning guide">
|
||||
<meta name="keywords" content="RDNA2, workstation tuning, BIOS settings, installation, AMD,
|
||||
ROCm">
|
||||
</head>
|
||||
|
||||
# RDNA2 workstation tuning guide
|
||||
|
||||
## System settings
|
||||
@@ -5,16 +12,16 @@
|
||||
This chapter reviews system settings that are required to configure the system
|
||||
for ROCm virtualization on RDNA2-based AMD Radeon™ PRO GPUs. Installing ROCm on
|
||||
Bare Metal follows the routine ROCm
|
||||
[installation procedure](../../install/linux/install.md).
|
||||
{doc}`installation procedure<rocm-install-on-linux:how-to/native-install/index>`.
|
||||
|
||||
To enable ROCm virtualization on V620, one has to setup Single Root I/O
|
||||
Virtualization (SR-IOV) in the BIOS via setting found in the following
|
||||
({ref}`bios-settings`). A tested configuration can be followed in
|
||||
({ref}`os-settings`).
|
||||
|
||||
```{attention}
|
||||
:::{attention}
|
||||
SR-IOV is supported on V620 and unsupported on W6800.
|
||||
```
|
||||
:::
|
||||
|
||||
(bios-settings)=
|
||||
|
||||
@@ -160,6 +167,6 @@ First, assign GPU virtual function (VF) to VM using the following steps.
|
||||
Then start the VM.
|
||||
|
||||
Finally install ROCm on the virtual machine (VM). For detailed instructions,
|
||||
refer to the [ROCm Installation Guide](../../install/linux/install.md). For any
|
||||
refer to the {doc}`Linux install guide<rocm-install-on-linux:how-to/native-install/index>`. For any
|
||||
issue encountered during installation, write to us
|
||||
[here](mailto:CloudGPUsupport@amd.com).
|
||||
|
||||
@@ -1,12 +1,22 @@
|
||||
<head>
|
||||
<meta charset="UTF-8">
|
||||
<meta name="description" content="AMD ROCm documentation">
|
||||
<meta name="keywords" content="documentation, guides, installation, compatibility, support,
|
||||
reference, ROCm, AMD">
|
||||
</head>
|
||||
|
||||
# AMD ROCm™ documentation
|
||||
|
||||
Welcome to the ROCm docs home page! If you're new to ROCm, you can review the following
|
||||
resources to learn more about our products and what we support:
|
||||
|
||||
* [What is ROCm?](./what-is-rocm.md)
|
||||
* [What's new?](about/whats-new/whats-new)
|
||||
* [Release notes](./about/release-notes.md)
|
||||
|
||||
You can install ROCm on our Radeon™, Radeon Pro™, and Instinct™ GPUs. If you're using Radeon
|
||||
GPUs, we recommend reading the
|
||||
{doc}`Radeon-specific ROCm documentation<radeon:index>`
|
||||
|
||||
Our documentation is organized into the following categories:
|
||||
|
||||
::::{grid} 1 2 2 2
|
||||
@@ -20,34 +30,34 @@ Installation guides
|
||||
^^^
|
||||
|
||||
* Linux
|
||||
* [Quick-start (Linux)](./install/linux/install-quick.md)
|
||||
* [Linux install guide](./install/linux/install.md)
|
||||
* [Package manager integration](./install/linux/package-manager-integration.md)
|
||||
* {doc}`Quick-start (Linux)<rocm-install-on-linux:tutorial/quick-start>`
|
||||
* {doc}`Linux install guide<rocm-install-on-linux:how-to/native-install/index>`
|
||||
* {doc}`Package manager integration<rocm-install-on-linux:how-to/native-install/package-manager-integration>`
|
||||
* Windows
|
||||
* [Quick-start (Windows)](./install/windows/install-quick.md)
|
||||
* [Windows install guide](./install/windows/install.md)
|
||||
* [Application deployment guidelines](./install/windows/windows-app-deployment-guidelines.md)
|
||||
* [Deploy ROCm Docker containers](./install/docker.md)
|
||||
* [PyTorch for ROCm](./install/pytorch-install.md)
|
||||
* [TensorFlow for ROCm](./install/tensorflow-install.md)
|
||||
* [MAGMA for ROCm](./install/magma-install.md)
|
||||
* [ROCm & Spack](./install/spack-intro.md)
|
||||
* {doc}`Windows install guide<rocm-install-on-windows:how-to/install>`
|
||||
* {doc}`Application deployment guidelines<rocm-install-on-windows:conceptual/deployment-guidelines>`
|
||||
* {doc}`Install Docker containers<rocm-install-on-linux:how-to/docker>`
|
||||
* {doc}`PyTorch for ROCm<rocm-install-on-linux:how-to/3rd-party/pytorch-install>`
|
||||
* {doc}`TensorFlow for ROCm<rocm-install-on-linux:how-to/3rd-party/tensorflow-install>`
|
||||
* {doc}`MAGMA for ROCm<rocm-install-on-linux:how-to/3rd-party/magma-install>`
|
||||
* {doc}`ROCm & Spack<rocm-install-on-linux:how-to/spack>`
|
||||
|
||||
:::
|
||||
|
||||
:::{grid-item-card}
|
||||
:padding: 2
|
||||
**Compatibility & Support**
|
||||
**Compatibility & support**
|
||||
|
||||
ROCm compatibility information
|
||||
^^^
|
||||
|
||||
* [Linux (GPU & OS)](./about/compatibility/linux-support.md)
|
||||
* [Windows (GPU & OS)](./about/compatibility/windows-support.md)
|
||||
* [Third-party](./about/compatibility/3rd-party-support-matrix.md)
|
||||
* [User/kernel space](./about/compatibility/user-kernel-space-compat-matrix.md)
|
||||
* [Docker](./about/compatibility/docker-image-support-matrix.rst)
|
||||
* {doc}`System requirements (Linux)<rocm-install-on-linux:reference/system-requirements>`
|
||||
* {doc}`System requirements (Windows)<rocm-install-on-windows:reference/system-requirements>`
|
||||
* {doc}`Third-party<rocm-install-on-linux:reference/3rd-party-support-matrix>`
|
||||
* {doc}`User/kernel space<rocm-install-on-linux:reference/user-kernel-space-compat-matrix>`
|
||||
* {doc}`Docker<rocm-install-on-linux:reference/docker-image-support-matrix>`
|
||||
* [OpenMP](./about/compatibility/openmp.md)
|
||||
{doc}`ROCm on Radeon GPUs<radeon:index>`
|
||||
|
||||
:::
|
||||
|
||||
@@ -63,7 +73,7 @@ Task-oriented walkthroughs
|
||||
* [MI200](./how-to/tuning-guides/mi200.md)
|
||||
* [RDNA2](./how-to/tuning-guides/w6000-v620.md)
|
||||
* [Setting up for deep learning with ROCm](./how-to/deep-learning-rocm.md)
|
||||
* [GPU-enabled MPI](./how-to/gpu-enabled-mpi.md)
|
||||
* [GPU-enabled MPI](./how-to/gpu-enabled-mpi.rst)
|
||||
* [System level debugging](./how-to/system-debugging.md)
|
||||
* [GitHub examples](https://github.com/amd/rocm-examples)
|
||||
|
||||
@@ -95,7 +105,7 @@ Topic overviews & background information
|
||||
* [Compiler disambiguation](./conceptual/compiler-disambiguation.md)
|
||||
* [File structure (Linux FHS)](./conceptual/file-reorg.md)
|
||||
* [GPU isolation techniques](./conceptual/gpu-isolation.md)
|
||||
* [LLVN ASan](./conceptual/using-gpu-sanitizer.md)
|
||||
* [LLVM ASan](./conceptual/using-gpu-sanitizer.md)
|
||||
* [Using CMake](./conceptual/cmake-packages.rst)
|
||||
* [ROCm & PCIe atomics](./conceptual/More-about-how-ROCm-uses-PCIe-Atomics.rst)
|
||||
* [Inception v3 with PyTorch](./conceptual/ai-pytorch-inception.md)
|
||||
|
||||
@@ -1,90 +0,0 @@
|
||||
# Deploy ROCm Docker containers
|
||||
|
||||
## Prerequisites
|
||||
|
||||
Docker containers share the kernel with the host operating system, therefore the
|
||||
ROCm kernel-mode driver must be installed on the host. Please refer to
|
||||
{ref}`linux-install-methods` on installing `amdgpu-dkms`. The other
|
||||
user-space parts (like the HIP-runtime or math libraries) of the ROCm stack will
|
||||
be loaded from the container image and don't need to be installed to the host.
|
||||
|
||||
(docker-access-gpus-in-container)=
|
||||
|
||||
## Accessing GPUs in containers
|
||||
|
||||
In order to access GPUs in a container (to run applications using HIP, OpenCL or
|
||||
OpenMP offloading) explicit access to the GPUs must be granted.
|
||||
|
||||
The ROCm runtimes make use of multiple device files:
|
||||
|
||||
* `/dev/kfd`: the main compute interface shared by all GPUs
|
||||
* `/dev/dri/renderD<node>`: direct rendering interface (DRI) devices for each
|
||||
GPU. **`<node>`** is a number for each card in the system starting from 128.
|
||||
|
||||
Exposing these devices to a container is done by using the
|
||||
[`--device`](https://docs.docker.com/engine/reference/commandline/run/#device)
|
||||
option, i.e. to allow access to all GPUs expose `/dev/kfd` and all
|
||||
`/dev/dri/renderD` devices:
|
||||
|
||||
```shell
|
||||
docker run --device /dev/kfd --device /dev/renderD128 --device /dev/renderD129 ...
|
||||
```
|
||||
|
||||
More conveniently, instead of listing all devices, the entire `/dev/dri` folder
|
||||
can be exposed to the new container:
|
||||
|
||||
```shell
|
||||
docker run --device /dev/kfd --device /dev/dri
|
||||
```
|
||||
|
||||
Note that this gives more access than strictly required, as it also exposes the
|
||||
other device files found in that folder to the container.
|
||||
|
||||
(docker-restrict-gpus)=
|
||||
|
||||
### Restricting a container to a subset of the GPUs
|
||||
|
||||
If a `/dev/dri/renderD` device is not exposed to a container then it cannot use
|
||||
the GPU associated with it; this allows to restrict a container to any subset of
|
||||
devices.
|
||||
|
||||
For example to allow the container to access the first and third GPU start it
|
||||
like:
|
||||
|
||||
```shell
|
||||
docker run --device /dev/kfd --device /dev/dri/renderD128 --device /dev/dri/renderD130 <image>
|
||||
```
|
||||
|
||||
### Additional options
|
||||
|
||||
The performance of an application can vary depending on the assignment of GPUs
|
||||
and CPUs to the task. Typically, `numactl` is installed as part of many HPC
|
||||
applications to provide GPU/CPU mappings. This Docker runtime option supports
|
||||
memory mapping and can improve performance.
|
||||
|
||||
```shell
|
||||
--security-opt seccomp=unconfined
|
||||
```
|
||||
|
||||
This option is recommended for Docker Containers running HPC applications.
|
||||
|
||||
```shell
|
||||
docker run --device /dev/kfd --device /dev/dri --security-opt seccomp=unconfined ...
|
||||
```
|
||||
|
||||
## Docker images in the ROCm ecosystem
|
||||
|
||||
### Base images
|
||||
|
||||
<https://github.com/RadeonOpenCompute/ROCm-docker> hosts images useful for users
|
||||
wishing to build their own containers leveraging ROCm. The built images are
|
||||
available from [Docker Hub](https://hub.docker.com/u/rocm). In particular
|
||||
`rocm/rocm-terminal` is a small image with the prerequisites to build HIP
|
||||
applications, but does not include any libraries.
|
||||
|
||||
### Applications
|
||||
|
||||
AMD provides pre-built images for various GPU-ready applications through its
|
||||
Infinity Hub at <https://www.amd.com/en/technologies/infinity-hub>.
|
||||
Examples for invoking each application and suggested parameters used for
|
||||
benchmarking are also provided there.
|
||||
@@ -1,64 +0,0 @@
|
||||
# MAGMA installation for ROCm
|
||||
|
||||
## MAGMA for ROCm
|
||||
|
||||
Matrix Algebra on GPU and Multicore Architectures (MAGMA) is a
|
||||
collection of next-generation dense linear algebra libraries that is designed
|
||||
for heterogeneous architectures, such as multiple GPUs and multi- or many-core
|
||||
CPUs.
|
||||
|
||||
MAGMA provides implementations for CUDA, HIP, Intel Xeon Phi, and OpenCL™. For
|
||||
more information, refer to
|
||||
[https://icl.utk.edu/magma/index.html](https://icl.utk.edu/magma/index.html).
|
||||
|
||||
### Using MAGMA for PyTorch
|
||||
|
||||
Tensor is fundamental to deep-learning techniques because it provides extensive
|
||||
representational functionalities and math operations. This data structure is
|
||||
represented as a multidimensional matrix. MAGMA accelerates tensor operations
|
||||
with a variety of solutions including driver routines, computational routines,
|
||||
BLAS routines, auxiliary routines, and utility routines.
|
||||
|
||||
### Building MAGMA from source
|
||||
|
||||
To build MAGMA from the source, follow these steps:
|
||||
|
||||
1. In the event you want to compile only for your uarch, use:
|
||||
|
||||
```bash
|
||||
export PYTORCH_ROCM_ARCH=<uarch>
|
||||
```
|
||||
|
||||
`<uarch>` is the architecture reported by the `rocminfo` command.
|
||||
|
||||
2. Use the following:
|
||||
|
||||
```bash
|
||||
export PYTORCH_ROCM_ARCH=<uarch>
|
||||
|
||||
# "install" hipMAGMA into /opt/rocm/magma by copying after build
|
||||
git clone https://bitbucket.org/icl/magma.git
|
||||
pushd magma
|
||||
# Fixes memory leaks of MAGMA found while executing linalg UTs
|
||||
git checkout 5959b8783e45f1809812ed96ae762f38ee701972
|
||||
cp make.inc-examples/make.inc.hip-gcc-mkl make.inc
|
||||
echo 'LIBDIR += -L$(MKLROOT)/lib' >> make.inc
|
||||
echo 'LIB += -Wl,--enable-new-dtags -Wl,--rpath,/opt/rocm/lib -Wl,--rpath,$(MKLROOT)/lib -Wl,--rpath,/opt/rocm/magma/lib' >> make.inc
|
||||
echo 'DEVCCFLAGS += --gpu-max-threads-per-block=256' >> make.inc
|
||||
export PATH="${PATH}:/opt/rocm/bin"
|
||||
if [[ -n "$PYTORCH_ROCM_ARCH" ]]; then
|
||||
amdgpu_targets=`echo $PYTORCH_ROCM_ARCH | sed 's/;/ /g'`
|
||||
else
|
||||
amdgpu_targets=`rocm_agent_enumerator | grep -v gfx000 | sort -u | xargs`
|
||||
fi
|
||||
for arch in $amdgpu_targets; do
|
||||
echo "DEVCCFLAGS += --amdgpu-target=$arch" >> make.inc
|
||||
done
|
||||
# hipcc with openmp flag may cause isnan() on __device__ not to be found; depending on context, compiler may attempt to match with host definition
|
||||
sed -i 's/^FOPENMP/#FOPENMP/g' make.inc
|
||||
make -f make.gen.hipMAGMA -j $(nproc)
|
||||
LANG=C.UTF-8 make lib/libmagma.so -j $(nproc) MKLROOT=/opt/conda
|
||||
make testing/testing_dgemm -j $(nproc) MKLROOT=/opt/conda
|
||||
popd
|
||||
mv magma /opt/rocm
|
||||
```
|
||||
@@ -1,446 +0,0 @@
|
||||
# Installing PyTorch for ROCm
|
||||
|
||||
[PyTorch](https://pytorch.org/) is an open-source tensor library designed for deep learning. PyTorch on
|
||||
ROCm provides mixed-precision and large-scale training using our
|
||||
[MIOpen](https://github.com/ROCmSoftwarePlatform/MIOpen) and
|
||||
[RCCL](https://github.com/ROCmSoftwarePlatform/rccl) libraries.
|
||||
|
||||
To install [PyTorch for ROCm](https://pytorch.org/blog/pytorch-for-amd-rocm-platform-now-available-as-python-package/), you have the following options:
|
||||
|
||||
* [Use a Docker image with PyTorch pre-installed](#using-a-docker-image-with-pytorch-pre-installed)
|
||||
(recommended)
|
||||
* [Use a wheels package](#using-a-wheels-package)
|
||||
* [Use the PyTorch ROCm base Docker image](#using-the-pytorch-rocm-base-docker-image)
|
||||
* [Use the PyTorch upstream Docker file](#using-the-pytorch-upstream-docker-file)
|
||||
|
||||
For hardware, software, and third-party framework compatibility between ROCm and PyTorch, refer to:
|
||||
|
||||
* [GPU and OS support (Linux)](../about/compatibility/linux-support.md)
|
||||
* [Compatibility](../about/compatibility/3rd-party-support-matrix.md)
|
||||
|
||||
## Using a Docker image with PyTorch pre-installed
|
||||
|
||||
1. Download the latest public PyTorch Docker image
|
||||
([https://hub.docker.com/r/rocm/pytorch](https://hub.docker.com/r/rocm/pytorch)).
|
||||
|
||||
```bash
|
||||
docker pull rocm/pytorch:latest
|
||||
```
|
||||
|
||||
You can also download a specific and supported configuration with different user-space ROCm
|
||||
versions, PyTorch versions, and operating systems.
|
||||
|
||||
2. Start a Docker container using the image.
|
||||
|
||||
```bash
|
||||
docker run -it --cap-add=SYS_PTRACE --security-opt seccomp=unconfined \
|
||||
--device=/dev/kfd --device=/dev/dri --group-add video \
|
||||
--ipc=host --shm-size 8G rocm/pytorch:latest
|
||||
```
|
||||
|
||||
:::{note}
|
||||
This will automatically download the image if it does not exist on the host. You can also pass the `-v`
|
||||
argument to mount any data directories from the host onto the container.
|
||||
:::
|
||||
|
||||
(install_pytorch_wheels)=
|
||||
|
||||
## Using a wheels package
|
||||
|
||||
PyTorch supports the ROCm platform by providing tested wheels packages. To access this feature, go
|
||||
to [https://pytorch.org/get-started/locally/](https://pytorch.org/get-started/locally/). For the correct
|
||||
wheels command, you must select 'Linux', 'Python', 'pip', and 'ROCm' in the matrix.
|
||||
|
||||
1. Choose one of the following three options:
|
||||
|
||||
**Option 1:**
|
||||
|
||||
a. Download a base Docker image with the correct user-space ROCm version.
|
||||
| Base OS | Docker image | Link to Docker image|
|
||||
|----------------|-----------------------------|----------------|
|
||||
| Ubuntu 20.04 | `rocm/dev-ubuntu-20.04` | [https://hub.docker.com/r/rocm/dev-ubuntu-20.04](https://hub.docker.com/r/rocm/dev-ubuntu-20.04)
|
||||
| Ubuntu 22.04 | `rocm/dev-ubuntu-22.04` | [https://hub.docker.com/r/rocm/dev-ubuntu-22.04](https://hub.docker.com/r/rocm/dev-ubuntu-22.04)
|
||||
| CentOS 7 | `rocm/dev-centos-7` | [https://hub.docker.com/r/rocm/dev-centos-7](https://hub.docker.com/r/rocm/dev-centos-7)
|
||||
|
||||
b. Pull the selected image.
|
||||
|
||||
```bash
|
||||
docker pull rocm/dev-ubuntu-20.04:latest
|
||||
```
|
||||
|
||||
c. Start a Docker container using the downloaded image.
|
||||
|
||||
```bash
|
||||
docker run -it --device=/dev/kfd --device=/dev/dri --group-add video rocm/dev-ubuntu-20.04:latest
|
||||
```
|
||||
|
||||
**Option 2:**
|
||||
|
||||
Select a base OS Docker image (Check [OS compatibility](../about/compatibility/linux-support.md))
|
||||
|
||||
Pull selected base OS image (Ubuntu 20.04 for example)
|
||||
|
||||
```docker
|
||||
docker pull ubuntu:20.04
|
||||
```
|
||||
|
||||
Start a Docker container using the downloaded image
|
||||
|
||||
```docker
|
||||
docker run -it --device=/dev/kfd --device=/dev/dri --group-add video ubuntu:20.04
|
||||
```
|
||||
|
||||
Install ROCm using the directions in the [Installation section](./linux/install.md).
|
||||
|
||||
**Option 3:**
|
||||
|
||||
Install on bare metal. Check [OS compatibility](../about/compatibility/linux-support.md) and install ROCm using the
|
||||
directions in the [Installation section](./linux/install.md).
|
||||
|
||||
2. Install the required dependencies for the wheels package.
|
||||
|
||||
```bash
|
||||
sudo apt update
|
||||
sudo apt install libjpeg-dev python3-dev python3-pip
|
||||
pip3 install wheel setuptools
|
||||
```
|
||||
|
||||
3. Install `torch`, `torchvision`, and `torchaudio`, as specified in the
|
||||
[installation matrix](https://pytorch.org/get-started/locally/).
|
||||
|
||||
:::{note}
|
||||
The following command uses the ROCm 5.6 PyTorch wheel. If you want a different version of ROCm,
|
||||
modify the command accordingly.
|
||||
:::
|
||||
|
||||
```bash
|
||||
pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm5.6/
|
||||
```
|
||||
|
||||
4. (Optional) Use MIOpen kdb files with ROCm PyTorch wheels.
|
||||
|
||||
PyTorch uses [MIOpen](https://github.com/ROCmSoftwarePlatform/MIOpen) for machine learning
|
||||
primitives, which are compiled into kernels at runtime. Runtime compilation causes a small warm-up
|
||||
phase when starting PyTorch, and MIOpen kdb files contain precompiled kernels that can speed up
|
||||
application warm-up phases. For more information, refer to the
|
||||
{doc}`MIOpen installation page <miopen:install>`.
|
||||
|
||||
MIOpen kdb files can be used with ROCm PyTorch wheels. However, the kdb files need to be placed in
|
||||
a specific location with respect to the PyTorch installation path. A helper script simplifies this task by
|
||||
taking the ROCm version and GPU architecture as inputs. This works for Ubuntu and CentOS.
|
||||
|
||||
You can download the helper script here:
|
||||
[install_kdb_files_for_pytorch_wheels.sh](https://raw.githubusercontent.com/wiki/ROCmSoftwarePlatform/pytorch/files/ install_kdb_files_for_pytorch_wheels.sh), or use:
|
||||
|
||||
`wget https://raw.githubusercontent.com/wiki/ROCmSoftwarePlatform/pytorch/files/install_kdb_files_for_pytorch_wheels.sh`
|
||||
|
||||
After installing ROCm PyTorch wheels, run the following code:
|
||||
|
||||
```bash
|
||||
#Optional; replace 'gfx90a' with your architecture and 5.6 with your preferred ROCm version
|
||||
export GFX_ARCH=gfx90a
|
||||
|
||||
#Optional
|
||||
export ROCM_VERSION=5.6
|
||||
|
||||
./install_kdb_files_for_pytorch_wheels.sh
|
||||
```
|
||||
|
||||
## Using the PyTorch ROCm base Docker image
|
||||
|
||||
The pre-built base Docker image has all dependencies installed, including:
|
||||
|
||||
* ROCm
|
||||
* Torchvision
|
||||
* Conda packages
|
||||
* The compiler toolchain
|
||||
|
||||
Additionally, a particular environment flag (`BUILD_ENVIRONMENT`) is set, which is used by the build
|
||||
scripts to determine the configuration of the build environment.
|
||||
|
||||
1. Download the Docker image. This is the base image, which does not contain PyTorch.
|
||||
|
||||
```bash
|
||||
docker pull rocm/pytorch:latest-base
|
||||
```
|
||||
|
||||
2. Start a Docker container using the downloaded image.
|
||||
|
||||
```bash
|
||||
docker run -it --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --shm-size 8G rocm/pytorch:latest-base
|
||||
```
|
||||
|
||||
You can also pass the `-v` argument to mount any data directories from the host onto the container.
|
||||
|
||||
3. Clone the PyTorch repository.
|
||||
|
||||
```bash
|
||||
cd ~
|
||||
git clone https://github.com/pytorch/pytorch.git
|
||||
cd /pytorch
|
||||
git submodule update --init --recursive
|
||||
```
|
||||
|
||||
4. Set ROCm architecture (optional). The Docker image tag is `rocm/pytorch:latest-base`.
|
||||
|
||||
:::{note}
|
||||
By default in the `rocm/pytorch:latest-base` image, PyTorch builds simultaneously for the following
|
||||
architectures:
|
||||
* gfx900
|
||||
* gfx906
|
||||
* gfx908
|
||||
* gfx90a
|
||||
* gfx1030
|
||||
:::
|
||||
|
||||
If you want to compile _only_ for your microarchitecture (uarch), run:
|
||||
|
||||
```bash
|
||||
export PYTORCH_ROCM_ARCH=<uarch>
|
||||
```
|
||||
|
||||
Where `<uarch>` is the architecture reported by the `rocminfo` command.
|
||||
|
||||
To find your uarch, run:
|
||||
|
||||
```bash
|
||||
rocminfo | grep gfx
|
||||
```
|
||||
|
||||
5. Build PyTorch.
|
||||
|
||||
```bash
|
||||
./.ci/pytorch/build.sh
|
||||
```
|
||||
|
||||
This converts PyTorch sources for
|
||||
[HIP compatibility](https://www.amd.com/en/developer/rocm-hub/hip-sdk.html) and builds the
|
||||
PyTorch framework.
|
||||
|
||||
To check if your build is successful, run:
|
||||
|
||||
```bash
|
||||
echo $? # should return 0 if success
|
||||
```
|
||||
|
||||
## Using the PyTorch upstream Docker file
|
||||
|
||||
If you don't want to use a prebuilt base Docker image, you can build a custom base Docker image
|
||||
using scripts from the PyTorch repository. This uses a standard Docker image from operating system
|
||||
maintainers and installs all the required dependencies, including:
|
||||
|
||||
* ROCm
|
||||
* Torchvision
|
||||
* Conda packages
|
||||
* The compiler toolchain
|
||||
|
||||
1. Clone the PyTorch repository.
|
||||
|
||||
```bash
|
||||
cd ~
|
||||
git clone https://github.com/pytorch/pytorch.git
|
||||
cd /pytorch
|
||||
git submodule update --init --recursive
|
||||
```
|
||||
|
||||
2. Build the PyTorch Docker image.
|
||||
|
||||
```bash
|
||||
cd .ci/docker
|
||||
./build.sh pytorch-linux-<os-version>-rocm<rocm-version>-py<python-version> -t rocm/pytorch:build_from_dockerfile
|
||||
```
|
||||
|
||||
Where:
|
||||
* `<os-version>`: `ubuntu20.04` (or `focal`), `ubuntu22.04` (or `jammy`), `centos7.5`, or `centos9`
|
||||
* `<rocm-version>`: `5.4`, `5.5`, or `5.6`
|
||||
* `<python-version>`: `3.8`-`3.11`
|
||||
|
||||
To verify that your image was successfully created, run:
|
||||
|
||||
`docker image ls rocm/pytorch:build_from_dockerfile`
|
||||
|
||||
If successful, the output looks like this:
|
||||
|
||||
```bash
|
||||
REPOSITORY TAG IMAGE ID CREATED SIZE
|
||||
rocm/pytorch build_from_dockerfile 17071499be47 2 minutes ago 32.8GB
|
||||
```
|
||||
|
||||
3. Start a Docker container using the image with the mounted PyTorch folder.
|
||||
|
||||
```bash
|
||||
docker run -it --cap-add=SYS_PTRACE --security-opt --user root \
|
||||
seccomp=unconfined --device=/dev/kfd --device=/dev/dri \
|
||||
--group-add video --ipc=host --shm-size 8G \
|
||||
-v ~/pytorch:/pytorch rocm/pytorch:build_from_dockerfile
|
||||
```
|
||||
|
||||
You can also pass the `-v` argument to mount any data directories from the host onto the container.
|
||||
|
||||
4. Go to the PyTorch directory.
|
||||
|
||||
```bash
|
||||
cd pytorch
|
||||
```
|
||||
|
||||
5. Set ROCm architecture.
|
||||
|
||||
To determine your AMD architecture, run:
|
||||
|
||||
```bash
|
||||
rocminfo | grep gfx
|
||||
```
|
||||
|
||||
The result looks like this (for `gfx1030` architecture):
|
||||
|
||||
```bash
|
||||
Name: gfx1030
|
||||
Name: amdgcn-amd-amdhsa--gfx1030
|
||||
```
|
||||
|
||||
Set the `PYTORCH_ROCM_ARCH` environment variable to specify the architectures you want to
|
||||
build PyTorch for.
|
||||
|
||||
```bash
|
||||
export PYTORCH_ROCM_ARCH=<uarch>
|
||||
```
|
||||
|
||||
where `<uarch>` is the architecture reported by the `rocminfo` command.
|
||||
|
||||
6. Build PyTorch.
|
||||
|
||||
```bash
|
||||
./.ci/pytorch/build.sh
|
||||
```
|
||||
|
||||
This converts PyTorch sources for
|
||||
[HIP compatibility](https://www.amd.com/en/developer/rocm-hub/hip-sdk.html) and builds the
|
||||
PyTorch framework.
|
||||
|
||||
To check if your build is successful, run:
|
||||
|
||||
```bash
|
||||
echo $? # should return 0 if success
|
||||
```
|
||||
|
||||
## Testing the PyTorch installation
|
||||
|
||||
You can use PyTorch unit tests to validate your PyTorch installation. If you used a
|
||||
**prebuilt PyTorch Docker image from AMD ROCm DockerHub** or installed an
|
||||
**official wheels package**, validation tests are not necessary.
|
||||
|
||||
If you want to manually run unit tests to validate your PyTorch installation fully, follow these steps:
|
||||
|
||||
1. Import the torch package in Python to test if PyTorch is installed and accessible.
|
||||
|
||||
:::{note}
|
||||
Do not run the following command in the PyTorch git folder.
|
||||
:::
|
||||
|
||||
```bash
|
||||
python3 -c 'import torch' 2> /dev/null && echo 'Success' || echo 'Failure'
|
||||
```
|
||||
|
||||
2. Check if the GPU is accessible from PyTorch. In the PyTorch framework, `torch.cuda` is a generic way
|
||||
to access the GPU. This can only access an AMD GPU if one is available.
|
||||
|
||||
```bash
|
||||
python3 -c 'import torch; print(torch.cuda.is_available())'
|
||||
```
|
||||
|
||||
3. Run unit tests to validate the PyTorch installation fully.
|
||||
|
||||
:::{note}
|
||||
You must run the following command from the PyTorch home directory.
|
||||
:::
|
||||
|
||||
```bash
|
||||
PYTORCH_TEST_WITH_ROCM=1 python3 test/run_test.py --verbose \
|
||||
--include test_nn test_torch test_cuda test_ops \
|
||||
test_unary_ufuncs test_binary_ufuncs test_autograd
|
||||
```
|
||||
|
||||
This command ensures that the required environment variable is set to skip certain unit tests for
|
||||
ROCm. This also applies to wheel installs in a non-controlled environment.
|
||||
|
||||
:::{note}
|
||||
Make sure your PyTorch source code corresponds to the PyTorch wheel or the installation in the
|
||||
Docker image. Incompatible PyTorch source code can give errors when running unit tests.
|
||||
:::
|
||||
|
||||
Some tests may be skipped, as appropriate, based on your system configuration. ROCm doesn't
|
||||
support all PyTorch features; tests that evaluate unsupported features are skipped. Other tests might
|
||||
be skipped, depending on the host or GPU memory and the number of available GPUs.
|
||||
|
||||
If the compilation and installation are correct, all tests will pass.
|
||||
|
||||
4. Run individual unit tests.
|
||||
|
||||
```bash
|
||||
PYTORCH_TEST_WITH_ROCM=1 python3 test/test_nn.py --verbose
|
||||
```
|
||||
|
||||
You can replace `test_nn.py` with any other test set.
|
||||
|
||||
## Running a basic PyTorch example
|
||||
|
||||
The PyTorch examples repository provides basic examples that exercise the functionality of your
|
||||
framework.
|
||||
|
||||
Two of our favorite testing databases are:
|
||||
|
||||
* **MNIST** (Modified National Institute of Standards and Technology): A database of handwritten
|
||||
digits that can be used to train a Convolutional Neural Network for **handwriting recognition**.
|
||||
* **ImageNet**: A database of images that can be used to train a network for
|
||||
**visual object recognition**.
|
||||
|
||||
### MNIST PyTorch example
|
||||
|
||||
1. Clone the PyTorch examples repository.
|
||||
|
||||
```bash
|
||||
git clone https://github.com/pytorch/examples.git
|
||||
```
|
||||
|
||||
2. Go to the MNIST example folder.
|
||||
|
||||
```bash
|
||||
cd examples/mnist
|
||||
```
|
||||
|
||||
3. Follow the instructions in the `README.md`` file in this folder to install the requirements. Then run:
|
||||
|
||||
```bash
|
||||
python3 main.py
|
||||
```
|
||||
|
||||
This generates the following output:
|
||||
|
||||
```bash
|
||||
...
|
||||
Train Epoch: 14 [58240/60000 (97%)] Loss: 0.010128
|
||||
Train Epoch: 14 [58880/60000 (98%)] Loss: 0.001348
|
||||
Train Epoch: 14 [59520/60000 (99%)] Loss: 0.005261
|
||||
|
||||
Test set: Average loss: 0.0252, Accuracy: 9921/10000 (99%)
|
||||
```
|
||||
|
||||
### ImageNet PyTorch example
|
||||
|
||||
1. Clone the PyTorch examples repository (if you didn't already do this step in the preceding MNIST example).
|
||||
|
||||
```bash
|
||||
git clone https://github.com/pytorch/examples.git
|
||||
```
|
||||
|
||||
2. Go to the ImageNet example folder.
|
||||
|
||||
```bash
|
||||
cd examples/imagenet
|
||||
```
|
||||
|
||||
3. Follow the instructions in the `README.md` file in this folder to install the Requirements. Then run:
|
||||
|
||||
```bash
|
||||
python3 main.py
|
||||
```
|
||||
@@ -1,421 +0,0 @@
|
||||
# Introduction to Spack
|
||||
|
||||
Spack is a package management tool designed to support multiple software versions and
|
||||
configurations on a wide variety of platforms and environments. It was designed for large
|
||||
supercomputing centers, where many users share common software installations on clusters with
|
||||
exotic architectures using libraries that do not have a standard ABI. Spack is non-destructive: installing
|
||||
a new version does not break existing installations, so many configurations can coexist on the same
|
||||
system.
|
||||
|
||||
Most importantly, Spack is *simple*. It offers a simple *spec* syntax, so users can concisely specify
|
||||
versions and configuration options. Spack is also simple for package authors: package files are written
|
||||
in pure Python, and specs allow package authors to maintain a single file for many different builds of
|
||||
the same package. For more information on Spack, see
|
||||
[https://spack-tutorial.readthedocs.io/en/latest/](https://spack-tutorial.readthedocs.io/en/latest/).
|
||||
|
||||
## ROCM packages in Spack
|
||||
|
||||
| **Component** | **Spack Package Name** |
|
||||
|---------------------------|------------------------|
|
||||
| **rocm-cmake** | rocm-cmake |
|
||||
| **thunk** | hsakmt-roct |
|
||||
| **rocm-smi-lib** | rocm-smi-lib |
|
||||
| **hsa** | hsa-rocr-dev |
|
||||
| **lightning** | llvm-amdgpu |
|
||||
| **devicelibs** | rocm-device-libs |
|
||||
| **comgr** | comgr |
|
||||
| **rocclr (vdi)** | hip-rocclr |
|
||||
| **hipify_clang** | hipify-clang |
|
||||
| **hip (hip_in_vdi)** | hip |
|
||||
| **ocl (opencl_on_vdi )** | rocm-opencl |
|
||||
| **rocminfo** | rocminfo |
|
||||
| **clang-ocl** | rocm-clang-ocl |
|
||||
| **rccl** | rccl |
|
||||
| **atmi** | atmi |
|
||||
| **rocm_debug_agent** | rocm-debug-agent |
|
||||
| **rocm_bandwidth_test** | rocm-bandwidth-test |
|
||||
| **rocprofiler** | rocprofiler-dev |
|
||||
| **roctracer-dev-api** | roctracer-dev-api |
|
||||
| **roctracer** | roctracer-dev |
|
||||
| **dbgapi** | rocm-dbgapi |
|
||||
| **rocm-gdb** | rocm-gdb |
|
||||
| **openmp-extras** | rocm-openmp-extras |
|
||||
| **rocBLAS** | rocblas |
|
||||
| **hipBLAS** | hipblas |
|
||||
| **rocFFT** | rocfft |
|
||||
| **rocRAND** | rocrand |
|
||||
| **rocSPARSE** | rocsparse |
|
||||
| **hipSPARSE** | hipsparse |
|
||||
| **rocALUTION** | rocalution |
|
||||
| **rocSOLVER** | rocsolver |
|
||||
| **rocPRIM** | rocprim |
|
||||
| **rocThrust** | rocthrust |
|
||||
| **hipCUB** | hipcub |
|
||||
| **hipfort** | hipfort |
|
||||
| **ROCmValidationSuite** | rocm-validation-suite |
|
||||
| **MIOpenGEMM** | miopengemm |
|
||||
| **MIOpen(Hip variant)** | miopen-hip |
|
||||
| **MIOpen(opencl)** | miopen-opencl |
|
||||
| **MIVisionX** | mivisionx |
|
||||
| **AMDMIGraphX** | migraphx |
|
||||
| **rocm-tensile** | rocm-tensile |
|
||||
| **hipfft** | hipfft |
|
||||
| **RDC** | rdc |
|
||||
| **hipsolver** | hipsolver |
|
||||
| **mlirmiopen** | mlirmiopen |
|
||||
|
||||
```{note}
|
||||
You must install all prerequisites before installing Spack.
|
||||
```
|
||||
|
||||
::::{tab-set}
|
||||
:::{tab-item} Ubuntu
|
||||
:sync: Ubuntu
|
||||
|
||||
```shell
|
||||
# Install some essential utilities:
|
||||
apt-get update
|
||||
apt-get install make patch bash tar gzip unzip bzip2 file gnupg2 git gawk
|
||||
apt-get update -y
|
||||
apt-get install -y xz-utils
|
||||
apt-get build-essential
|
||||
apt-get install vim
|
||||
# Install Python:
|
||||
apt-get install python3
|
||||
apt-get upgrade python3-pip
|
||||
# Install Compilers:
|
||||
apt-get install gcc
|
||||
apt-get install gfortran
|
||||
```
|
||||
|
||||
:::
|
||||
:::{tab-item} SLES
|
||||
:sync: SLES
|
||||
|
||||
```shell
|
||||
# Install some essential utilities:
|
||||
zypper update
|
||||
zypper install make patch bash tar gzip unzip bzip xz file gnupg2 git awk
|
||||
zypper in -t pattern
|
||||
zypper install vim
|
||||
# Install Python:
|
||||
zypper install python3
|
||||
zypper install python3-pip
|
||||
# Install Compilers:
|
||||
zypper install gcc
|
||||
zypper install gcc-fortran
|
||||
zypper install gcc-c++
|
||||
```
|
||||
|
||||
:::
|
||||
:::{tab-item} CentOS
|
||||
:sync: CentOS
|
||||
|
||||
```shell
|
||||
# Install some essential utilities:
|
||||
yum update
|
||||
yum install make
|
||||
yum install patch bash tar yum install gzip unzip bzip2 xz file gnupg2 git gawk
|
||||
yum group install "Development Tools"
|
||||
yum install vim
|
||||
# Install Python:
|
||||
yum install python3
|
||||
pip3 install --upgrade pip
|
||||
# Install compilers:
|
||||
yum install gcc
|
||||
yum install gcc-gfortran
|
||||
yum install gcc-c++
|
||||
```
|
||||
|
||||
:::
|
||||
::::
|
||||
|
||||
## Steps to build ROCm components using Spack
|
||||
|
||||
1. To use the spack package manager, clone the Spack project from GitHub.
|
||||
|
||||
```bash
|
||||
git clone <https://github.com/spack/spack>
|
||||
```
|
||||
|
||||
2. Initialize Spack.
|
||||
|
||||
The `setup-env.sh` script initializes the Spack environment.
|
||||
|
||||
```bash
|
||||
cd spack
|
||||
|
||||
. share/spack/setup-env.sh
|
||||
```
|
||||
|
||||
Spack commands are available once the above steps are completed. To list the available commands,
|
||||
use `help`.
|
||||
|
||||
```bash
|
||||
root@[ixt-rack-104:/spack\#](http://ixt-rack-104/spack) spack help
|
||||
```
|
||||
|
||||
## Using Spack to install ROCm components
|
||||
|
||||
1. `rocm-cmake`
|
||||
|
||||
Install the default variants and the latest version of `rocm-cmake`.
|
||||
|
||||
```bash
|
||||
spack install rocm-cmake
|
||||
```
|
||||
|
||||
To install a specific version of `rocm-cmake`, use:
|
||||
|
||||
```bash
|
||||
spack install rocm-cmake@<version number>
|
||||
```
|
||||
|
||||
For example, `spack install rocm-cmake@5.2.0`
|
||||
|
||||
2. `info`
|
||||
|
||||
The `info**` command displays basic package information. It shows the preferred, safe, and
|
||||
deprecated versions, in addition to the available variants. It also shows the dependencies with other
|
||||
packages.
|
||||
|
||||
```bash
|
||||
spack info mivisionx
|
||||
```
|
||||
|
||||
For example:
|
||||
|
||||
```bash
|
||||
root@[ixt-rack-104:/spack\#](http://ixt-rack-104/spack) spack info mivisionx
|
||||
CMakePackage: mivisionx
|
||||
|
||||
Description:
|
||||
MIVisionX toolkit is a set of comprehensive computer vision and machine
|
||||
intelligence libraries, utilities, and applications bundled into a
|
||||
single toolkit.
|
||||
|
||||
Homepage: <https://github.com/GPUOpen-ProfessionalCompute-Libraries/MIVisionX>
|
||||
|
||||
Preferred version:
|
||||
5.3.0 <https://github.com/GPUOpen-ProfessionalCompute-Libraries/MIVisionX/archive/rocm-5.3.0.tar.gz>
|
||||
|
||||
Safe versions:
|
||||
5.3.0 <https://github.com/GPUOpen-ProfessionalCompute-Libraries/MIVisionX/archive/rocm-5.3.0.tar.gz>
|
||||
5.2.3 <https://github.com/GPUOpen-ProfessionalCompute-Libraries/MIVisionX/archive/rocm-5.2.3.tar.gz>
|
||||
5.2.1 <https://github.com/GPUOpen-ProfessionalCompute-Libraries/MIVisionX/archive/rocm-5.2.1.tar.gz>
|
||||
5.2.0 <https://github.com/GPUOpen-ProfessionalCompute-Libraries/MIVisionX/archive/rocm-5.2.0.tar.gz>
|
||||
5.1.3 <https://github.com/GPUOpen-ProfessionalCompute-Libraries/MIVisionX/archive/rocm-5.1.3.tar.gz>
|
||||
5.1.0 <https://github.com/GPUOpen-ProfessionalCompute-Libraries/MIVisionX/archive/rocm-5.1.0.tar.gz>
|
||||
5.0.2 <https://github.com/GPUOpen-ProfessionalCompute-Libraries/MIVisionX/archive/rocm-5.0.2.tar.gz>
|
||||
5.0.0 <https://github.com/GPUOpen-ProfessionalCompute-Libraries/MIVisionX/archive/rocm-5.0.0.tar.gz>
|
||||
4.5.2 <https://github.com/GPUOpen-ProfessionalCompute-Libraries/MIVisionX/archive/rocm-4.5.2.tar.gz>
|
||||
4.5.0 <https://github.com/GPUOpen-ProfessionalCompute-Libraries/MIVisionX/archive/rocm-4.5.0.tar.gz>
|
||||
|
||||
Deprecated versions:
|
||||
4.3.1 <https://github.com/GPUOpen-ProfessionalCompute-Libraries/MIVisionX/archive/rocm-4.3.1.tar.gz>
|
||||
4.3.0 <https://github.com/GPUOpen-ProfessionalCompute-Libraries/MIVisionX/archive/rocm-4.3.0.tar.gz>
|
||||
4.2.0 <https://github.com/GPUOpen-ProfessionalCompute-Libraries/MIVisionX/archive/rocm-4.2.0.tar.gz>
|
||||
4.1.0 <https://github.com/GPUOpen-ProfessionalCompute-Libraries/MIVisionX/archive/rocm-4.1.0.tar.gz>
|
||||
4.0.0 <https://github.com/GPUOpen-ProfessionalCompute-Libraries/MIVisionX/archive/rocm-4.0.0.tar.gz>
|
||||
3.10.0 <https://github.com/GPUOpen-ProfessionalCompute-Libraries/MIVisionX/archive/rocm-3.10.0.tar.gz>
|
||||
3.9.0 <https://github.com/GPUOpen-ProfessionalCompute-Libraries/MIVisionX/archive/rocm-3.9.0.tar.gz>
|
||||
3.8.0 <https://github.com/GPUOpen-ProfessionalCompute-Libraries/MIVisionX/archive/rocm-3.8.0.tar.gz>
|
||||
3.7.0 <https://github.com/GPUOpen-ProfessionalCompute-Libraries/MIVisionX/archive/rocm-3.7.0.tar.gz>
|
||||
1.7 <https://github.com/GPUOpen-ProfessionalCompute-Libraries/MIVisionX/archive/1.7.tar.gz>
|
||||
|
||||
Variants:
|
||||
Name [Default] When Allowed values Description
|
||||
==================== ==== ==================== ==================================
|
||||
|
||||
build_type [Release] -- Release, Debug, CMake build type
|
||||
RelWithDebInfo
|
||||
hip [on] -- on, off Use HIP as backend
|
||||
ipo [off] -- on, off CMake interprocedural optimization
|
||||
opencl [off] -- on, off Use OPENCL as the backend
|
||||
|
||||
Build Dependencies:
|
||||
cmake ffmpeg libjpeg-turbo miopen-hip miopen-opencl miopengemm opencv openssl protobuf rocm-cmake rocm-opencl
|
||||
|
||||
Link Dependencies:
|
||||
miopen-hip miopen-opencl miopengemm openssl rocm-opencl
|
||||
|
||||
Run Dependencies:
|
||||
None
|
||||
|
||||
root@[ixt-rack-104:/spack\#](http://ixt-rack-104/spack)
|
||||
```
|
||||
|
||||
## Installing variants for ROCm components
|
||||
|
||||
The variants listed above indicate that the `mivisionx` package is built by default with
|
||||
`build_type=Release` and the `hip` backend, and without the `opencl` backend. `build_type=Debug` and
|
||||
`RelWithDebInfo`, with `opencl` and without `hip`, are also supported.
|
||||
|
||||
For example:
|
||||
|
||||
```bash
|
||||
spack install mivisionx build_type=Debug (Backend will be hip since it is the default one)
|
||||
spack install mivisionx+opencl build_type=Debug (Backend will be opencl and hip will be disabled as per the conflict defined in recipe)
|
||||
```
|
||||
|
||||
* `spack spec` command
|
||||
|
||||
To display the dependency tree, the `spack spec` command can be used with the same format.
|
||||
|
||||
For example:
|
||||
|
||||
```bash
|
||||
root@[ixt-rack-104:/spack\#](http://ixt-rack-104/spack) spack spec mivisionx
|
||||
Input spec
|
||||
--------------------------------
|
||||
mivisionx
|
||||
|
||||
Concretized
|
||||
--------------------------------
|
||||
mivisionx@5.3.0%gcc@9.4.0+hip\~ipo\~opencl build_type=Release arch=linux-ubuntu20.04-skylake_avx512
|
||||
```
|
||||
|
||||
## Creating an environment
|
||||
|
||||
You can create an environment with all the required components of your version.
|
||||
|
||||
1. In the root folder, create a new folder when you can create a `.yaml` file. This file is used to
|
||||
create an environment.
|
||||
|
||||
```bash
|
||||
* mkdir /localscratch
|
||||
* cd /localscratch
|
||||
* vi sample.yaml
|
||||
```
|
||||
|
||||
2. Add all the required components in the `sample.yaml` file:
|
||||
|
||||
```bash
|
||||
* spack:
|
||||
* concretization: separately
|
||||
* packages:
|
||||
* all:
|
||||
* compiler: [gcc@8.5.0]
|
||||
* specs:
|
||||
* - matrix:
|
||||
* - ['%gcc@8.5.0\^cmake@3.19.7']
|
||||
* - [rocm-cmake@5.3.2, rocm-dbgapi@5.3.2, rocm-debug-agent@5.3.2, rocm-gdb@5.3.2,
|
||||
* rocminfo@5.3.2, rocm-opencl@5.3.2, rocm-smi-lib@5.3.2, rocm-tensile@5.3.2, rocm-validation-suite@4.3.1,
|
||||
* rocprim@5.3.2, rocprofiler-dev@5.3.2, rocrand@5.3.2, rocsolver@5.3.2, rocsparse@5.3.2,
|
||||
* rocthrust@5.3.2, roctracer-dev@5.3.2]
|
||||
* view: true
|
||||
```
|
||||
|
||||
3. Once you've created the `.yaml` file, you can use it to create an environment.
|
||||
|
||||
```bash
|
||||
* spack env create -d /localscratch/MyEnvironment /localscratch/sample.yaml
|
||||
```
|
||||
|
||||
4. Activate the environment.
|
||||
|
||||
```bash
|
||||
* spack env activate /localscratch/MyEnvironment
|
||||
```
|
||||
|
||||
5. Verify that you want all the component versions.
|
||||
|
||||
```bash
|
||||
* spack find - this command will list out all components been in the environment (and 0 installed )
|
||||
```
|
||||
|
||||
6. Install all the components in the `.yaml` file.
|
||||
|
||||
```bash
|
||||
* cd /localscratch/MyEnvironment
|
||||
* spack install -j 50
|
||||
```
|
||||
|
||||
7. Check that all components are successfully installed.
|
||||
|
||||
```bash
|
||||
* spack find
|
||||
```
|
||||
|
||||
8. If any modification is made to the `.yaml` file, you must deactivate the existing environment and create a new one in order for the modications to be reflected.
|
||||
|
||||
To deactivate, use:
|
||||
|
||||
```bash
|
||||
* spack env deactivate
|
||||
```
|
||||
|
||||
## Create and apply a patch before installation
|
||||
|
||||
Spack installs ROCm packages after pulling the source code from GitHub and building it locally. In
|
||||
order to build a component with any modification to the source code, you must generate a patch and
|
||||
apply it before the build phase.
|
||||
|
||||
To generate a patch and build with the changes:
|
||||
|
||||
1. Stage the source code.
|
||||
|
||||
```bash
|
||||
spack stage hip@5.2.0 (This will pull the 5.2.0 release version source code of hip and display the path to spack-src directory where entire source code is available)
|
||||
|
||||
root@[ixt-rack-104:/spack#](http://ixt-rack-104/spack) spack stage hip@5.2.0
|
||||
==> Fetching <https://github.com/ROCm-Developer-Tools/HIP/archive/rocm-5.2.0.tar.gz>
|
||||
==> Fetching <https://github.com/ROCm-Developer-Tools/hipamd/archive/rocm-5.2.0.tar.gz>
|
||||
==> Fetching <https://github.com/ROCm-Developer-Tools/ROCclr/archive/rocm-5.2.0.tar.gz>
|
||||
==> Moving resource stage
|
||||
source: /tmp/root/spack-stage/resource-hipamd-wzo5y6ysvmadyb5mvffr35galb6vjxb7/spack-src/
|
||||
destination: /tmp/root/spack-stage/spack-stage-hip-5.2.0-wzo5y6ysvmadyb5mvffr35galb6vjxb7/spack-src/hipamd
|
||||
==> Moving resource stage
|
||||
source: /tmp/root/spack-stage/resource-opencl-wzo5y6ysvmadyb5mvffr35galb6vjxb7/spack-src/
|
||||
destination: /tmp/root/spack-stage/spack-stage-hip-5.2.0-wzo5y6ysvmadyb5mvffr35galb6vjxb7/spack-src/opencl
|
||||
==> Moving resource stage
|
||||
source: /tmp/root/spack-stage/resource-rocclr-wzo5y6ysvmadyb5mvffr35galb6vjxb7/spack-src/
|
||||
destination: /tmp/root/spack-stage/spack-stage-hip-5.2.0-wzo5y6ysvmadyb5mvffr35galb6vjxb7/spack-src/rocclr
|
||||
==> Staged hip in /tmp/root/spack-stage/spack-stage-hip-5.2.0-wzo5y6ysvmadyb5mvffr35galb6vjxb7
|
||||
```
|
||||
|
||||
2. Change directory to `spack-src` inside the staged directory.
|
||||
|
||||
```bash
|
||||
root@[ixt-rack-104:/spack#cd /tmp/root/spack-stage/spack-stage-hip-5.2.0-wzo5y6ysvmadyb5mvffr35galb6vjxb7](http://ixt-rack-104/spack)
|
||||
root@[ixt-rack-104:/tmp/root/spack-stage/spack-stage-hip-5.2.0-wzo5y6ysvmadyb5mvffr35galb6vjxb7#](http://ixt-rack-104/tmp/root/spack-stage/spack-stage-hip-5.2.0-wzo5y6ysvmadyb5mvffr35galb6vjxb7) cd spack-src/
|
||||
```
|
||||
|
||||
3. Create a new Git repository.
|
||||
|
||||
```bash
|
||||
root@[ixt-rack-104:/tmp/root/spack-stage/spack-stage-hip-5.2.0-wzo5y6ysvmadyb5mvffr35galb6vjxb7/spack-src#](http://ixt-rack-104/tmp/root/spack-stage/spack-stage-hip-5.2.0-wzo5y6ysvmadyb5mvffr35galb6vjxb7/spack-src) git init
|
||||
```
|
||||
|
||||
4. Add the entire directory to the repository.
|
||||
|
||||
```bash
|
||||
root@[ixt-rack-104:/tmp/root/spack-stage/spack-stage-hip-5.2.0-wzo5y6ysvmadyb5mvffr35galb6vjxb7/spack-src#](http://ixt-rack-104/tmp/root/spack-stage/spack-stage-hip-5.2.0-wzo5y6ysvmadyb5mvffr35galb6vjxb7/spack-src) git add .
|
||||
```
|
||||
|
||||
5. Make the required changes to the source code.
|
||||
|
||||
```bash
|
||||
root@[ixt-rack-104:/tmp/root/spack-stage/spack-stage-hip-5.2.0-wzo5y6ysvmadyb5mvffr35galb6vjxb7/spack-src#](http://ixt-rack-104/tmp/root/spack-stage/spack-stage-hip-5.2.0-wzo5y6ysvmadyb5mvffr35galb6vjxb7/spack-src) vi hipamd/CMakeLists.txt (Make required changes in the source code)
|
||||
```
|
||||
|
||||
6. Generate the patch using the `git diff` command.
|
||||
|
||||
```bash
|
||||
diff > /spack/var/spack/repos/builtin/packages/hip/0001-modifications.patch
|
||||
```
|
||||
|
||||
7. Update the recipe with the patch file name and any conditions you want to apply.
|
||||
|
||||
```bash
|
||||
root@[ixt-rack-104:/tmp/root/spack-stage/spack-stage-hip-5.2.0-wzo5y6ysvmadyb5mvffr35galb6vjxb7/spack-src#](http://ixt-rack-104/tmp/root/spack-stage/spack-stage-hip-5.2.0-wzo5y6ysvmadyb5mvffr35galb6vjxb7/spack-src) spack edit hip
|
||||
```
|
||||
|
||||
Provide the patch file name and the conditions for the patch:
|
||||
|
||||
`patch("0001-modifications.patch", when="@5.2.0")`
|
||||
|
||||
Spack applies `0001-modifications.patch` on the `5.2.0` release code before starting the `hip` build.
|
||||
|
||||
After each modification, you must update the recipe. If there is no change to the recipe, run
|
||||
`touch /spack/var/spack/repos/builtin/packages/hip/package.py`
|
||||
@@ -1,191 +0,0 @@
|
||||
# Installing TensorFlow for ROCm
|
||||
|
||||
## TensorFlow
|
||||
|
||||
TensorFlow is an open-source library for solving machine-learning,
|
||||
deep-learning, and artificial-intelligence problems. It can be used to solve
|
||||
many problems across different sectors and industries but primarily focuses on
|
||||
training and inference in neural networks. It is one of the most popular and
|
||||
in-demand frameworks and is very active in open source contribution and
|
||||
development.
|
||||
|
||||
:::{warning}
|
||||
ROCm 5.6 and 5.7 deviates from the standard practice of supporting the last three
|
||||
TensorFlow versions. This is due to incompatibilities between earlier TensorFlow
|
||||
versions and changes introduced in the ROCm 5.6 compiler. Refer to the following
|
||||
version support matrix:
|
||||
|
||||
| ROCm | TensorFlow |
|
||||
|:-----:|:----------:|
|
||||
| 5.6.x | 2.12 |
|
||||
| 5.7.0 | 2.12, 2.13 |
|
||||
| Post-5.7.0 | Last three versions at ROCm release. |
|
||||
:::
|
||||
|
||||
### Installing TensorFlow
|
||||
|
||||
The following sections contain options for installing TensorFlow.
|
||||
|
||||
#### Option 1: using a Docker image
|
||||
|
||||
To install ROCm on bare metal, follow the section
|
||||
[Linux installation guide](../install/linux/install.md). The recommended option to
|
||||
get a TensorFlow environment is through Docker.
|
||||
|
||||
Using Docker provides portability and access to a prebuilt Docker container that
|
||||
has been rigorously tested within AMD. This might also save compilation time and
|
||||
should perform as tested without facing potential installation issues.
|
||||
Follow these steps:
|
||||
|
||||
1. Pull the latest public TensorFlow Docker image.
|
||||
|
||||
```bash
|
||||
docker pull rocm/tensorflow:latest
|
||||
```
|
||||
|
||||
2. Once you have pulled the image, run it by using the command below:
|
||||
|
||||
```bash
|
||||
docker run -it --network=host --device=/dev/kfd --device=/dev/dri \
|
||||
--ipc=host --shm-size 16G --group-add video --cap-add=SYS_PTRACE \
|
||||
--security-opt seccomp=unconfined rocm/tensorflow:latest
|
||||
```
|
||||
|
||||
#### Option 2: using a wheels package
|
||||
|
||||
To install TensorFlow using the wheels package, follow these steps:
|
||||
|
||||
1. Check the Python version.
|
||||
|
||||
```bash
|
||||
python3 --version
|
||||
```
|
||||
|
||||
| If: | Then: |
|
||||
|:-----------------------------------:|:--------------------------------:|
|
||||
| The Python version is less than 3.7 | Upgrade Python. |
|
||||
| The Python version is more than 3.7 | Skip this step and go to Step 3. |
|
||||
|
||||
```{note}
|
||||
The supported Python versions are:
|
||||
|
||||
* 3.7
|
||||
* 3.8
|
||||
* 3.9
|
||||
* 3.10
|
||||
```
|
||||
|
||||
```bash
|
||||
sudo apt-get install python3.7 # or python3.8 or python 3.9 or python 3.10
|
||||
```
|
||||
|
||||
2. Set up multiple Python versions using update-alternatives.
|
||||
|
||||
```bash
|
||||
update-alternatives --query python3
|
||||
sudo update-alternatives --install
|
||||
/usr/bin/python3 python3 /usr/bin/python[version] [priority]
|
||||
```
|
||||
|
||||
```{note}
|
||||
Follow the instruction in Step 2 for incompatible Python versions.
|
||||
```
|
||||
|
||||
```bash
|
||||
sudo update-alternatives --config python3
|
||||
```
|
||||
|
||||
3. Follow the screen prompts, and select the Python version installed in Step 2.
|
||||
|
||||
4. Install or upgrade PIP.
|
||||
|
||||
```bash
|
||||
sudo apt install python3-pip
|
||||
```
|
||||
|
||||
To install PIP, use the following:
|
||||
|
||||
```bash
|
||||
/usr/bin/python[version] -m pip install --upgrade pip
|
||||
```
|
||||
|
||||
Upgrade PIP for Python version installed in step 2:
|
||||
|
||||
```bash
|
||||
sudo pip3 install --upgrade pip
|
||||
```
|
||||
|
||||
5. Install TensorFlow for the Python version as indicated in Step 2.
|
||||
|
||||
```bash
|
||||
/usr/bin/python[version] -m pip install --user tensorflow-rocm==[wheel-version] --upgrade
|
||||
```
|
||||
|
||||
For a valid wheel version for a ROCm release, refer to the instruction below:
|
||||
|
||||
```bash
|
||||
sudo apt install rocm-libs rccl
|
||||
```
|
||||
|
||||
6. Update `protobuf` to 3.19 or lower.
|
||||
|
||||
```bash
|
||||
/usr/bin/python3.7 -m pip install protobuf=3.19.0
|
||||
sudo pip3 install tensorflow
|
||||
```
|
||||
|
||||
7. Set the environment variable `PYTHONPATH`.
|
||||
|
||||
```bash
|
||||
export PYTHONPATH="./.local/lib/python[version]/site-packages:$PYTHONPATH" #Use same python version as in step 2
|
||||
```
|
||||
|
||||
8. Install libraries.
|
||||
|
||||
```bash
|
||||
sudo apt install rocm-libs rccl
|
||||
```
|
||||
|
||||
9. Test installation.
|
||||
|
||||
```bash
|
||||
python3 -c 'import tensorflow' 2> /dev/null && echo 'Success' || echo 'Failure'
|
||||
```
|
||||
|
||||
```{note}
|
||||
For details on `tensorflow-rocm` wheels and ROCm version compatibility, see:
|
||||
[https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/blob/develop-upstream/rocm_docs/tensorflow-rocm-release.md](https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/blob/develop-upstream/rocm_docs/tensorflow-rocm-release.md)
|
||||
```
|
||||
|
||||
### Test the TensorFlow installation
|
||||
|
||||
To test the installation of TensorFlow, run the container image as specified in
|
||||
the previous section Installing TensorFlow. Ensure you have access to the Python
|
||||
shell in the Docker container.
|
||||
|
||||
```bash
|
||||
python3 -c 'import tensorflow' 2> /dev/null && echo ‘Success’ || echo ‘Failure’
|
||||
```
|
||||
|
||||
### Run a basic TensorFlow example
|
||||
|
||||
The TensorFlow examples repository provides basic examples that exercise the
|
||||
framework's functionality. The MNIST database is a collection of handwritten
|
||||
digits that may be used to train a Convolutional Neural Network for handwriting
|
||||
recognition.
|
||||
|
||||
Follow these steps:
|
||||
|
||||
1. Clone the TensorFlow example repository.
|
||||
|
||||
```bash
|
||||
cd ~
|
||||
git clone https://github.com/tensorflow/models.git
|
||||
```
|
||||
|
||||
2. Install the dependencies of the code, and run the code.
|
||||
|
||||
```bash
|
||||
#pip3 install requirement.txt
|
||||
#python mnist_tf.py
|
||||
```
|
||||
@@ -1,140 +0,0 @@
|
||||
# Windows quick-start installation guide
|
||||
|
||||
For a quick summary on installing ROCm (HIP SDK) on Windows, follow the steps listed on this page. If
|
||||
you want a more in-depth installation guide, see
|
||||
[Installing ROCm on Windows](./install.md).
|
||||
|
||||
## System requirements
|
||||
|
||||
The HIP SDK is supported on Windows 10 and 11. The HIP SDK may be installed on a
|
||||
system without AMD GPUs to use the build toolchains. To run HIP applications, a
|
||||
compatible GPU is required. Please see the supported GPU guide for more details.
|
||||
|
||||
## HIP SDK installation
|
||||
|
||||
### Download the installer
|
||||
|
||||
Download the installer from the
|
||||
[HIP-SDK download page](https://www.amd.com/en/developer/rocm-hub/hip-sdk.html).
|
||||
|
||||
### Launch the installer
|
||||
|
||||
To launch the AMD HIP SDK Installer, click the **Setup** icon shown in the following image.
|
||||
|
||||

|
||||
|
||||
The installer requires Administrator Privileges, so you may be greeted with a
|
||||
User Access Control (UAC) pop-up. Click Yes.
|
||||
|
||||

|
||||
|
||||

|
||||
|
||||
The installer executable will temporarily extract installer packages to `C:\AMD`
|
||||
which it will remove after installation completes. This extraction is signified
|
||||
by the "Initializing install" window in the following image.
|
||||
|
||||

|
||||
|
||||
The installer will then detect your system configuration to determine which installable components
|
||||
are applicable to your system.
|
||||
|
||||

|
||||
|
||||
### Customize the install
|
||||
|
||||
When the installer launches, it displays a window that lets the user customize
|
||||
the installation. By default, all components are selected for installation.
|
||||
Refer to the following image for an instance when the Select All option
|
||||
is turned on.
|
||||
|
||||

|
||||
|
||||
#### HIP SDK installer
|
||||
|
||||
The HIP SDK installation options are listed in the following table.
|
||||
|
||||
```{table} HIP SDK Components for Installation
|
||||
:name: hip-sdk-options-win
|
||||
| **HIP Components** | **Install Type** | **Additional Options** |
|
||||
|:------------------:|:----------------:|:----------------------:|
|
||||
| HIP SDK Core | 5.5.0 | Install location |
|
||||
| HIP Libraries | Full, Partial, None | Runtime, Development (Libs and headers) |
|
||||
| HIP Runtime Compiler | Full, Partial, None | Runtime, Development (Headers) |
|
||||
| HIP Ray Tracing | Full, Partial, None | Runtime, Development (Headers) |
|
||||
| Visual Studio Plugin | Full, Partial, None | Visual Studio 2017, 2019, 2022 Plugin |
|
||||
```
|
||||
|
||||
```{note}
|
||||
The Select/DeSelect All option only applies to the installation of HIP SDK
|
||||
components. To install the bundled AMD Display Driver, manually select the
|
||||
install type.
|
||||
```
|
||||
|
||||
```{tip}
|
||||
Should you only wish to install a few select components,
|
||||
DeSelecting All and then picking the individual components may be more
|
||||
convenient.
|
||||
```
|
||||
|
||||
#### AMD display driver
|
||||
|
||||
The HIP SDK installer bundles an AMD Radeon Software PRO 23.10 installer. The
|
||||
supported install options are summarized in the following table:
|
||||
|
||||
```{table} AMD Display Driver Install Options
|
||||
:name: display-driver-install-win
|
||||
| **Install Option** | **Description** |
|
||||
|:------------------:|:---------------:|
|
||||
| Install Location | Location on disk to store driver files. |
|
||||
| Install Type | The breadth of components to be installed. |
|
||||
| Factory Reset (Optional) | A Factory Reset will remove all prior versions of AMD HIP SDK and drivers. You will not be able to roll back to previously installed drivers. |
|
||||
```
|
||||
|
||||
```{table} AMD Display Driver Install Types
|
||||
:name: display-driver-win-types
|
||||
| **Install Type** | **Description** |
|
||||
|:----------------:|:---------------:|
|
||||
| Full Install | Provides all AMD Software features and controls for gaming, recording, streaming, and tweaking the performance on your graphics hardware. |
|
||||
| Minimal Install | Provides only the basic controls for AMD Software features and does not include advanced features such as performance tweaking or recording and capturing content. |
|
||||
| Driver Only | Provides no user interface for AMD Software features. |
|
||||
```
|
||||
|
||||
```{note}
|
||||
You must perform a system restart for a complete installation of the
|
||||
Display Driver.
|
||||
```
|
||||
|
||||
### Install components
|
||||
|
||||
Please wait for the installation to complete during as shown in the following image.
|
||||
|
||||

|
||||
|
||||
### Installation complete
|
||||
|
||||
Once the installation is complete, the installer window may prompt you for a
|
||||
system restart. Click **Restart** at the lower right corner, shown in the following image.
|
||||
|
||||

|
||||
|
||||
```{error}
|
||||
Should the installer terminate due to unexpcted circumstances, or the user
|
||||
forcibly terminates the installer, the temporary directory created under
|
||||
`C:\AMD` may be safely removed. Installed components will not depend on this
|
||||
folder (unless the user specifies `C:\AMD` as an install folder explicitly).
|
||||
```
|
||||
|
||||
## Uninstall
|
||||
|
||||
All components, except visual studio plug-in should be uninstalled through
|
||||
control panel -> Add/Remove Program. For visual studio extension uninstallation,
|
||||
please refer to
|
||||
<https://github.com/ROCm-Developer-Tools/HIP-VS/blob/master/README.md>.
|
||||
Uninstallation of the HIP SDK components can be done through the Windows
|
||||
Settings app. Navigate to "Apps > Installed apps", click the "..." on the far
|
||||
right next to the component to uninstall, and click "Uninstall".
|
||||
|
||||

|
||||
|
||||

|
||||
@@ -1,258 +0,0 @@
|
||||
# Install HIP SDK on Windows
|
||||
|
||||
To install the HIP SDK on Windows, use the [quick-start guide](./install-quick.md) or follow the instructions below.
|
||||
|
||||
follow the instructions listed below.
|
||||
|
||||
**Topics:**
|
||||
|
||||
* [Prerequisites](#prerequisites)
|
||||
* [Install HIP SDK](#install-hip-sdk)
|
||||
* [Upgrade HIP SDK](#upgrade-hip-sdk)
|
||||
* [Uninstall HIP SDK](#uninstall-hip-sdk)
|
||||
|
||||
## Prerequisites
|
||||
|
||||
Verify that your system meets all the installation requirements. The installation is only supported
|
||||
only on specific host architectures, Windows Editions, and update versions.
|
||||
|
||||
The HIP SDK is supported on Windows 10 and 11. It can be installed on a
|
||||
system without AMD GPUs to use the build toolchains, but to run HIP applications, a
|
||||
compatible GPU is required. Please see the
|
||||
[supported GPU guide](../../about/compatibility/windows-support.md) for more details.
|
||||
|
||||
::::{tab-set}
|
||||
|
||||
:::{tab-item} CLI
|
||||
:sync: cli
|
||||
|
||||
1. Type the following command on your system from a PowerShell command-line interface (CLI):
|
||||
|
||||
```pwsh
|
||||
Get-ComputerInfo | Format-Table CsSystemType,OSName,OSDisplayVersion
|
||||
```
|
||||
|
||||
Running this command on a Windows system may result in the following output:
|
||||
|
||||
```output
|
||||
CsSystemType OsName OSDisplayVersion
|
||||
------------ ------ ----------------
|
||||
x64-based PC Microsoft Windows 11 Pro 22H2
|
||||
```
|
||||
|
||||
2. Confirm that the obtained information matches with those listed in {ref}`windows-support`.
|
||||
|
||||
:::
|
||||
|
||||
:::{tab-item} GUI
|
||||
:sync: gui
|
||||
|
||||
1. Open the **Settings** app.
|
||||
|
||||

|
||||
|
||||
2. Navigate to **System > About**.
|
||||
|
||||

|
||||
|
||||
3. Confirm that the obtained information matches {ref}`windows-support`.
|
||||
|
||||
:::
|
||||
::::
|
||||
|
||||
## Install HIP SDK
|
||||
|
||||
::::{tab-set}
|
||||
|
||||
:::{tab-item} CLI
|
||||
:sync: cli
|
||||
|
||||
CLI options are listed in the following table:
|
||||
|
||||
```{table}
|
||||
:name: hip-sdk-cli-install
|
||||
| **Install Option** | **Description** |
|
||||
|:------------------|:---------------|
|
||||
| `-install` | Command used to install packages, both driver and applications. No output to the screen. |
|
||||
| `-install -boot` | Silent install with auto reboot. |
|
||||
| `-install -log <absolute path>` | Write install result code to the specified log file. The specified log file must be on a local machine. Double quotes are needed if there are spaces in the log file path. |
|
||||
| `-uninstall` | Command to uninstall all packages installed by this installer on the system. There is no option to specify which packages to uninstall. |
|
||||
| `-uninstall -boot` | Silent uninstall with auto reboot. |
|
||||
| `/?` or `/help` | Shows a brief description of all switch commands. |
|
||||
```
|
||||
|
||||
```{note}
|
||||
Unlike the GUI, the CLI doesn't support selectively installing parts of the SDK bundle.
|
||||
```
|
||||
|
||||
To start the installation, follow these steps:
|
||||
|
||||
1. Download the installer from the
|
||||
[HIP-SDK download page](https://www.amd.com/en/developer/rocm-hub/hip-sdk.html).
|
||||
|
||||
2. Launch the installer. Note that the installer is a graphical application with a `WinMain` entry
|
||||
point, even when called on the command line. This means that the application lifetime is tied to a
|
||||
window, even on headless systems where that window may not be visible.
|
||||
|
||||
```pwsh
|
||||
Start-Process $InstallerExecutable -ArgumentList $InstallerArgs -NoNewWindow -Wait
|
||||
```
|
||||
|
||||
```{important}
|
||||
Running the installer requires Administrator Privileges.
|
||||
```
|
||||
|
||||
To install all components:
|
||||
|
||||
```pwsh
|
||||
Start-Process ~\Downloads\Setup.exe -ArgumentList '-install','-log',"${env:USERPROFILE}\installer_log.txt" -NoNewWindow -Wait
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
:::{tab-item} GUI
|
||||
:sync: gui
|
||||
|
||||
The HIP SDK installation options are listed in the following table.
|
||||
|
||||
```{table}
|
||||
:name: hip-sdk-options
|
||||
| **HIP Components** | **Install Type** | **Additional Options** |
|
||||
|:------------------|:----------------|:----------------------|
|
||||
| HIP SDK Core | 5.5.0 | Install location |
|
||||
| HIP Libraries | Full, Partial, None | Runtime, Development (Libs and headers) |
|
||||
| HIP Runtime Compiler | Full, Partial, None | Runtime, Development (Headers) |
|
||||
| HIP Ray Tracing | Full, Partial, None | Runtime, Development (Headers) |
|
||||
| Visual Studio Plugin | Full, Partial, None | Visual Studio 2017, 2019, 2022 Plugin |
|
||||
```
|
||||
|
||||
```{note}
|
||||
The Select/DeSelect All option only applies to the installation of HIP SDK
|
||||
components. To install the bundled AMD Display Driver, manually select the
|
||||
install type.
|
||||
```
|
||||
|
||||
```{tip}
|
||||
Should you only wish to install a few select components,
|
||||
DeSelecting All and then picking the individual components may be more
|
||||
convenient.
|
||||
```
|
||||
|
||||
The HIP SDK installer bundles an AMD Radeon Software PRO 23.10 installer. The
|
||||
supported install options are summarized in the following table:
|
||||
|
||||
```{table}
|
||||
:name: display-driver-install-options
|
||||
| **Install Option** | **Description** |
|
||||
|:------------------|:---------------|
|
||||
| Install Location | Location on disk to store driver files. |
|
||||
| Install Type | The breadth of components to be installed. |
|
||||
| Factory Reset (Optional) | A Factory Reset will remove all prior versions of AMD HIP SDK and drivers. You will not be able to roll back to previously installed drivers. |
|
||||
```
|
||||
|
||||
```{table} AMD Display Driver Install Types
|
||||
:name:
|
||||
| **Install Type** | **Description** |
|
||||
|:----------------|:---------------|
|
||||
| Full Install | Provides all AMD Software features and controls for gaming, recording, streaming, and tweaking the performance on your graphics hardware. |
|
||||
| Minimal Install | Provides only the basic controls for AMD Software features and does not include advanced features such as performance tweaking or recording and capturing content. |
|
||||
| Driver Only | Provides no user interface for AMD Software features. |
|
||||
```
|
||||
|
||||
```{note}
|
||||
You must perform a system restart for a complete installation of the
|
||||
Display Driver.
|
||||
```
|
||||
|
||||
To start the installation, follow these steps:
|
||||
|
||||
1. Download the installer from the
|
||||
[HIP-SDK download page](https://www.amd.com/en/developer/rocm-hub/hip-sdk.html).
|
||||
|
||||
2. Launch the installer by clicking the **Setup** icon.
|
||||
|
||||

|
||||
|
||||
The installer requires Administrator Privileges, so you may be greeted with a
|
||||
User Access Control (UAC) pop-up. Click Yes.
|
||||
|
||||

|
||||
|
||||
The installer executable temporarily extracts installer packages to `C:\AMD`; it removes these after the
|
||||
installation completes.
|
||||
|
||||

|
||||
|
||||
The installer detects your system configuration to determine which installable components
|
||||
are applicable to your system.
|
||||
|
||||

|
||||
|
||||
3. Customize your installation. When the installer launches, it displays a window that lets you customize
|
||||
your installation. By default, all components are selected.
|
||||
|
||||

|
||||
|
||||
4. Wait for the installation to complete.
|
||||
|
||||

|
||||
|
||||
When installation is complete, the installer window may prompt you for a system restart.
|
||||
|
||||

|
||||
|
||||
```{error}
|
||||
If the installer terminates mid-installation, the temporary directory created under `C:\AMD` can be
|
||||
safely removed. Installed components don't depend on this folder unless you explicitly choose this
|
||||
as the install folder.
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
::::
|
||||
|
||||
## Upgrade HIP SDK
|
||||
|
||||
To upgrade the HIP SDK, you can run the installer for the newer version without uninstalling the
|
||||
existing version. You can also uninstall the HIP SDK before installing the newest version.
|
||||
|
||||
## Uninstall HIP SDK
|
||||
|
||||
::::{tab-set}
|
||||
|
||||
:::{tab-item} CLI
|
||||
:sync: cli
|
||||
|
||||
Launch the installer. Note that the installer is a graphical application with a `WinMain` entry
|
||||
point, even when called on the command line. This means that the application lifetime is tied to a
|
||||
window, even on headless systems where that window may not be visible.
|
||||
|
||||
```pwsh
|
||||
Start-Process $InstallerExecutable -ArgumentList $InstallerArgs -NoNewWindow -Wait
|
||||
```
|
||||
|
||||
```{important}
|
||||
Running the installer requires Administrator Privileges.
|
||||
```
|
||||
|
||||
To uninstall all components:
|
||||
|
||||
```pwsh
|
||||
Start-Process ~\Downloads\Setup.exe -ArgumentList '-uninstall' -NoNewWindow -Wait
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
:::{tab-item} GUI
|
||||
:sync: gui
|
||||
|
||||
Uninstallation of HIP SDK components can be done through the Windows Settings app. Navigate to
|
||||
"Apps > Installed apps" and click the ellipsis (...) on the far right next to the component you want to uninstall. Click "Uninstall".
|
||||
|
||||

|
||||
|
||||
For visual studio extension uninstallation, refer to
|
||||
<https://github.com/ROCm-Developer-Tools/HIP-VS/blob/master/README.md>.
|
||||
:::
|
||||
|
||||
::::
|
||||
@@ -1,71 +0,0 @@
|
||||
# Application deployment guidelines for Windows
|
||||
|
||||
ISVs deploying applications using the HIP SDK depend on the AMD GPU Drivers, HIP
|
||||
Runtime Library and HIP SDK Libraries. A compatibility matrix table provides
|
||||
details on AMD’s support model. AMD GPU Drivers are distributed with a HIP
|
||||
Runtime included. Each HIP runtime is associated with a HIP compiler version.
|
||||
Applications built with a particular HIP compiler should document its associated
|
||||
HIP runtime version and AMD GPU Driver as minimum version requirements for its
|
||||
end users. Applications do not distribute the HIP runtime. Instead, end users
|
||||
will use the HIP runtime provided by an AMD GPU Driver. AMD provides backward
|
||||
compatibility for applications dynamically linked to the HIP runtime based on
|
||||
our Driver and HIP support policy. ISV applications using the HIP SDK Libraries,
|
||||
for example hipBLAS, should distribute the HIP SDK Library as part of its
|
||||
installer package. It is recommended not to require end users to install the
|
||||
HIP SDK. AMD provides backward compatibility for AMD Driver and HIP runtime for
|
||||
the HIP SDK Libraries based on our support policy. AMD support policy for Visual
|
||||
Studio and other third-party compilers are documented here.
|
||||
|
||||
## Usage scenario
|
||||
|
||||
This guide is intended for Independent Software Vendors (ISVs) and other
|
||||
software developers intending to build applications with the HIP SDK for
|
||||
Windows. The HIP SDK is intended for developer distribution in contrast to the
|
||||
AMD GPU driver which is intended for all end users. The guide discusses how to
|
||||
use and distribute components from the HIP SDK. The HIP SDK is the collection of
|
||||
the AMD GPU Driver, HIP runtime and the HIP Libraries. These three parts are
|
||||
distributed in the HIP SDK installer. The compatibility and versioning relation
|
||||
between these three parts is documented here. AMD’s support policies for the
|
||||
developer tools allows the ISVs the stability to plan the usage of a tool chain.
|
||||
|
||||
## Recommended library distribution model
|
||||
|
||||
The HIP SDK is distributed via a Windows installer. This distribution system is
|
||||
only intended for software developers and testers. AMD recommends that end users
|
||||
of the program built against HIP SDK components do not have a requirement to
|
||||
install the HIP SDK. There are two types of ISV applications that use the HIP
|
||||
SDK as follows.
|
||||
|
||||
The first group of ISV applications have a dependency on the HIP runtime and
|
||||
select HIP Header Only Libraries (rocPRIM, hipCUB and rocThrust). This group of
|
||||
ISV applications need to require their end users install an AMD GPU Driver. Each
|
||||
AMD GPU driver has a HIP runtime library bundled with it. The ISV application
|
||||
should ensure that the HIP runtime library has a minimum version associated with
|
||||
it. As the HIP runtime library does not have semantic versioning, the ISV
|
||||
application cannot check for compatibility. However, AMD is committed to not
|
||||
breaking API/ABI compatibility unless the major version number of the HIP
|
||||
runtime is incremented. ISV applications may run without user warning if the HIP
|
||||
major version available in the driver is the same as the HIP major version
|
||||
associated with the compiler it was built with. The ISV at its discretion may
|
||||
throw a warning if the HIP major version is higher than the associate HIP major
|
||||
version of the compiler it was built with.
|
||||
|
||||
The second group of ISV application has a dependency on the HIP runtime and one
|
||||
or more Dynamically Linked HIP Libraries including the HIP RT library. ISV
|
||||
applications with this dependency need to ensure the end user installs an AMD
|
||||
GPU Driver and is recommended to distribute the dynamically linked HIP library
|
||||
in the installer package of its application. This allows end users to avoid
|
||||
installing the HIP SDK. One benefit of this model is smaller disk space required
|
||||
as only required binaries are distributed by the ISV application. It also avoids
|
||||
the end user to have to agree to licensing agreements for the entire HIP SDK.
|
||||
The version checks recommended for the ISV application including dynamically
|
||||
linked HIP Libraries follow the same requirements as the ISV applications that
|
||||
only have the HIP runtime and header only library. In addition, each dynamically
|
||||
linked HIP library also has a minimum HIP runtime requirement. Checks for the
|
||||
minimum HIP version for each dynamically linked HIP library may be added at the
|
||||
ISVs discretion. Usually, the minimum HIP version check for the HIP runtime is
|
||||
sufficient if dynamically linked HIP libraries come from the same SDK package as
|
||||
the HIP compiler.
|
||||
|
||||
Please note AMD does not support static linking to any components distributed in
|
||||
the HIP SDK.
|
||||
@@ -1,6 +1,14 @@
|
||||
<head>
|
||||
<meta charset="UTF-8">
|
||||
<meta name="description" content="ROCm API libraries & tools">
|
||||
<meta name="keywords" content="ROCm, API, libraries, tools, artificial intelligence, development,
|
||||
Communications, C++ primitives, Fast Fourier transforms, FFTs, random number generators, linear
|
||||
algebra, AMD">
|
||||
</head>
|
||||
|
||||
# ROCm API libraries & tools
|
||||
|
||||
::::{grid} 1 2 2 2
|
||||
::::{grid} 1 3 3 3
|
||||
:class-container: rocm-doc-grid
|
||||
|
||||
:::{grid-item-card}
|
||||
@@ -10,8 +18,9 @@
|
||||
^^^
|
||||
|
||||
* {doc}`Composable Kernel <composable_kernel:index>`
|
||||
* {doc}`MIOpen <miopen:index>`
|
||||
* {doc}`MIGraphX <amdmigraphx:index>`
|
||||
* {doc}`MIOpen <miopen:index>`
|
||||
* {doc}`MIVisionX <mivisionx:doxygen/html/index>`
|
||||
|
||||
:::
|
||||
|
||||
@@ -44,7 +53,6 @@
|
||||
|
||||
^^^
|
||||
|
||||
* {doc}`hipCC <hipcc:index>`
|
||||
* {doc}`ROCdbgapi <rocdbgapi:index>`
|
||||
* [ROCmCC](./rocmcc.md)
|
||||
* {doc}`ROCm debugger (ROCgdb) <rocgdb:index>`
|
||||
@@ -99,7 +107,7 @@
|
||||
|
||||
^^^
|
||||
|
||||
* {doc}`ROCProfiler <rocprofiler:rocprof>`
|
||||
* {doc}`ROCProfiler <rocprofiler:profiler_home_page>`
|
||||
* {doc}`ROCTracer <roctracer:index>`
|
||||
|
||||
:::
|
||||
@@ -121,8 +129,9 @@
|
||||
|
||||
^^^
|
||||
|
||||
* {doc}`AMD SMI <amdsmi:index>`
|
||||
* {doc}`ROCm Data Center Tool <rdc:index>`
|
||||
* {doc}`ROCm SMI LIB <rocm_smi_lib:index>`
|
||||
* {doc}`ROCm SMI <rocm_smi_lib:index>`
|
||||
* {doc}`ROCm Validation Suite <rocmvalidationsuite:index>`
|
||||
* {doc}`TransferBench <transferbench:index>`
|
||||
|
||||
|
||||
@@ -1,3 +1,10 @@
|
||||
<head>
|
||||
<meta charset="UTF-8">
|
||||
<meta name="description" content="Compiler reference guide">
|
||||
<meta name="keywords" content="compiler, hipCC, Clang, amdclang, optimizations, LLVM,
|
||||
rocm-llvm, , AMD, ROCm">
|
||||
</head>
|
||||
|
||||
# Compiler reference guide
|
||||
|
||||
## Introduction to compiler reference guide
|
||||
@@ -134,12 +141,12 @@ The `-famd-opt` flag is useful when a user wants to build with the proprietary
|
||||
optimization compiler and not have to depend on setting any of the other
|
||||
proprietary optimization flags.
|
||||
|
||||
```{note}
|
||||
:::{note}
|
||||
`-famd-opt` can be used in addition to the other proprietary CPU optimization
|
||||
flags. The table of optimizations below implicitly enables the invocation of the
|
||||
AMD proprietary optimizations compiler, whereas the `-famd-opt` flag requires
|
||||
this to be handled explicitly.
|
||||
```
|
||||
:::
|
||||
|
||||
#### `-fstruct-layout=[1,2,3,4,5,6,7]`
|
||||
|
||||
@@ -255,12 +262,12 @@ loop. The heuristic can be controlled with the following options:
|
||||
|
||||
Where, `n` is a positive integer and higher value of `<n>` facilitates more unswitching.
|
||||
|
||||
```{note}
|
||||
:::{note}
|
||||
These options may facilitate more unswitching under some workloads. Since
|
||||
loop-unswitching inherently leads to code bloat, facilitating more
|
||||
unswitching may significantly increase the code size. Hence, it may also lead
|
||||
to longer compilation times.
|
||||
```
|
||||
:::
|
||||
|
||||
##### `-enable-strided-vectorization`
|
||||
|
||||
@@ -451,11 +458,11 @@ supports ASM statements, their use is not recommended for the following reasons:
|
||||
* Writing correct ASM statements is often difficult; we strongly recommend
|
||||
thorough testing of any use of ASM statements.
|
||||
|
||||
```{note}
|
||||
:::{note}
|
||||
For developers who choose to include ASM statements in the code, AMD is
|
||||
interested in understanding the use case and appreciates feedback at
|
||||
[https://github.com/RadeonOpenCompute/ROCm/issues](https://github.com/RadeonOpenCompute/ROCm/issues)
|
||||
```
|
||||
:::
|
||||
|
||||
### Miscellaneous OpenMP compiler features
|
||||
|
||||
|
||||
@@ -1,23 +1,33 @@
|
||||
# ROCm release history
|
||||
|
||||
| Version | Release Date |
|
||||
| ------- | ------------ |
|
||||
| [5.6.0](https://rocm.docs.amd.com/en/docs-5.6.0/) | Jun 28, 2023 |
|
||||
| [5.5.1](https://rocm.docs.amd.com/en/docs-5.5.1/) | May 24, 2023 |
|
||||
| [5.5.0](https://rocm.docs.amd.com/en/docs-5.5.0/) | May 1, 2023 |
|
||||
| [5.4.3](https://rocm.docs.amd.com/en/docs-5.4.3/) | Feb 7, 2023 |
|
||||
| [5.4.2](https://rocm.docs.amd.com/en/docs-5.4.2/) | Jan 13, 2023 |
|
||||
| [5.4.1](https://rocm.docs.amd.com/en/docs-5.4.1/) | Dec 15, 2022 |
|
||||
| [5.4.0](https://rocm.docs.amd.com/en/docs-5.4.0/) | Nov 30, 2022 |
|
||||
| [5.3.3](https://rocm.docs.amd.com/en/docs-5.3.3/) | Nov 17, 2022 |
|
||||
| [5.3.2](https://rocm.docs.amd.com/en/docs-5.3.2/) | Nov 9, 2022 |
|
||||
| [5.3.0](https://rocm.docs.amd.com/en/docs-5.3.0/) | Oct 4, 2022 |
|
||||
| [5.2.3](https://rocm.docs.amd.com/en/docs-5.2.3/) | Aug 18, 2022 |
|
||||
| [5.2.1](https://rocm.docs.amd.com/en/docs-5.2.1/) | Jul 21, 2022 |
|
||||
| [5.2.0](https://rocm.docs.amd.com/en/docs-5.2.0/) | Jun 28, 2022 |
|
||||
| [5.1.3](https://rocm.docs.amd.com/en/docs-5.1.3/) | May 20, 2022 |
|
||||
| [5.1.1](https://rocm.docs.amd.com/en/docs-5.1.1/) | Apr 8, 2022 |
|
||||
| [5.1.0](https://rocm.docs.amd.com/en/docs-5.1.0/) | Mar 30, 2022 |
|
||||
| [5.0.2](https://rocm.docs.amd.com/en/docs-5.0.2/) | Mar 4, 2022 |
|
||||
| [5.0.1](https://rocm.docs.amd.com/en/docs-5.0.1/) | Feb 16, 2022 |
|
||||
| [5.0.0](https://rocm.docs.amd.com/en/docs-5.0.0/) | Feb 9, 2022 |
|
||||
<head>
|
||||
<meta charset="UTF-8">
|
||||
<meta name="description" content="ROCm release history">
|
||||
<meta name="keywords" content="documentation, release history, ROCm, AMD">
|
||||
</head>
|
||||
|
||||
# ROCm release history
|
||||
|
||||
| Version | Release date |
|
||||
| ------- | ------------ |
|
||||
| [6.0.0](https://rocm.docs.amd.com/en/docs-6.0.0/) | Dec 15, 2023 |
|
||||
| [5.7.1](https://rocm.docs.amd.com/en/docs-5.7.1/) | Oct 13, 2023 |
|
||||
| [5.7.0](https://rocm.docs.amd.com/en/docs-5.7.0/) | Sep 15, 2023 |
|
||||
| [5.6.1](https://rocm.docs.amd.com/en/docs-5.6.1/) | Aug 29, 2023 |
|
||||
| [5.6.0](https://rocm.docs.amd.com/en/docs-5.6.0/) | Jun 28, 2023 |
|
||||
| [5.5.1](https://rocm.docs.amd.com/en/docs-5.5.1/) | May 24, 2023 |
|
||||
| [5.5.0](https://rocm.docs.amd.com/en/docs-5.5.0/) | May 1, 2023 |
|
||||
| [5.4.3](https://rocm.docs.amd.com/en/docs-5.4.3/) | Feb 7, 2023 |
|
||||
| [5.4.2](https://rocm.docs.amd.com/en/docs-5.4.2/) | Jan 13, 2023 |
|
||||
| [5.4.1](https://rocm.docs.amd.com/en/docs-5.4.1/) | Dec 15, 2022 |
|
||||
| [5.4.0](https://rocm.docs.amd.com/en/docs-5.4.0/) | Nov 30, 2022 |
|
||||
| [5.3.3](https://rocm.docs.amd.com/en/docs-5.3.3/) | Nov 17, 2022 |
|
||||
| [5.3.2](https://rocm.docs.amd.com/en/docs-5.3.2/) | Nov 9, 2022 |
|
||||
| [5.3.0](https://rocm.docs.amd.com/en/docs-5.3.0/) | Oct 4, 2022 |
|
||||
| [5.2.3](https://rocm.docs.amd.com/en/docs-5.2.3/) | Aug 18, 2022 |
|
||||
| [5.2.1](https://rocm.docs.amd.com/en/docs-5.2.1/) | Jul 21, 2022 |
|
||||
| [5.2.0](https://rocm.docs.amd.com/en/docs-5.2.0/) | Jun 28, 2022 |
|
||||
| [5.1.3](https://rocm.docs.amd.com/en/docs-5.1.3/) | May 20, 2022 |
|
||||
| [5.1.1](https://rocm.docs.amd.com/en/docs-5.1.1/) | Apr 8, 2022 |
|
||||
| [5.1.0](https://rocm.docs.amd.com/en/docs-5.1.0/) | Mar 30, 2022 |
|
||||
| [5.0.2](https://rocm.docs.amd.com/en/docs-5.0.2/) | Mar 4, 2022 |
|
||||
| [5.0.1](https://rocm.docs.amd.com/en/docs-5.0.1/) | Feb 16, 2022 |
|
||||
| [5.0.0](https://rocm.docs.amd.com/en/docs-5.0.0/) | Feb 9, 2022 |
|
||||
@@ -7,60 +7,39 @@ root: index
|
||||
subtrees:
|
||||
- entries:
|
||||
- file: what-is-rocm.md
|
||||
- file: about/whats-new/whats-new.md
|
||||
|
||||
- caption: Installation
|
||||
entries:
|
||||
- file: install/windows/install-quick.md
|
||||
title: Quick start (Windows)
|
||||
- file: install/windows/install.md
|
||||
title: Windows install guide
|
||||
subtrees:
|
||||
- entries:
|
||||
- file: install/windows/windows-app-deployment-guidelines.md
|
||||
title: Application deployment guidelines
|
||||
- file: install/docker.md
|
||||
title: ROCm Docker containers
|
||||
- file: install/pytorch-install.md
|
||||
title: PyTorch for ROCm
|
||||
- file: install/tensorflow-install.md
|
||||
title: Tensorflow for ROCm
|
||||
- file: install/magma-install.md
|
||||
title: MAGMA for ROCm
|
||||
- file: install/spack-intro.md
|
||||
title: ROCm & Spack
|
||||
|
||||
- caption: Compatibility & support
|
||||
entries:
|
||||
- file: about/compatibility/linux-support.md
|
||||
title: Linux (GPU & OS)
|
||||
- file: about/compatibility/windows-support.md
|
||||
title: Windows (GPU & OS)
|
||||
- file: about/compatibility/3rd-party-support-matrix.md
|
||||
title: Third-party
|
||||
- file: about/compatibility/user-kernel-space-compat-matrix.md
|
||||
title: User/kernel space support
|
||||
- file: about/compatibility/docker-image-support-matrix.rst
|
||||
title: Docker
|
||||
- file: about/compatibility/openmp.md
|
||||
title: OpenMP
|
||||
|
||||
- caption: Release information
|
||||
entries:
|
||||
- file: about/release-notes.md
|
||||
title: Release notes
|
||||
- file: about/CHANGELOG.md
|
||||
title: Changelog
|
||||
- file: about/release-history.md
|
||||
title: Release history
|
||||
subtrees:
|
||||
- entries:
|
||||
- file: about/CHANGELOG.md
|
||||
title: Changelog
|
||||
- url: https://github.com/RadeonOpenCompute/ROCm/labels/Verified%20Issue
|
||||
title: Known issues
|
||||
|
||||
- caption: Install
|
||||
entries:
|
||||
- url: https://rocm.docs.amd.com/projects/install-on-linux/en/${branch}/
|
||||
title: ROCm on Linux
|
||||
- url: https://rocm.docs.amd.com/projects/install-on-windows/en/${branch}/
|
||||
title: HIP SDK on Windows
|
||||
|
||||
- caption: Supported configurations
|
||||
entries:
|
||||
- url: https://rocm.docs.amd.com/projects/install-on-linux/en/${branch}/reference/system-requirements.html
|
||||
title: Linux
|
||||
- url: https://rocm.docs.amd.com/projects/install-on-windows/en/${branch}/reference/system-requirements.html
|
||||
title: Windows
|
||||
|
||||
- caption: Reference
|
||||
entries:
|
||||
- file: reference/library-index.md
|
||||
title: API libraries & tools
|
||||
|
||||
- caption: How-to
|
||||
entries:
|
||||
- file: how-to/deep-learning-rocm.md
|
||||
title: Deep learning
|
||||
- file: how-to/gpu-enabled-mpi.md
|
||||
- file: how-to/gpu-enabled-mpi.rst
|
||||
title: Using MPI
|
||||
- file: how-to/system-debugging.md
|
||||
title: Debugging
|
||||
@@ -77,79 +56,6 @@ subtrees:
|
||||
- url: https://github.com/amd/rocm-examples
|
||||
title: GitHub examples
|
||||
|
||||
- caption: Reference
|
||||
entries:
|
||||
- file: reference/library-index.md
|
||||
title: API libraries & tools
|
||||
subtrees:
|
||||
- entries:
|
||||
- url: ${project:composable_kernel}
|
||||
title: Composable kernel
|
||||
- url: ${project:hipblas}
|
||||
title: hipBLAS
|
||||
- url: ${project:hipblaslt}
|
||||
title: hipBLASLt
|
||||
- url: ${project:hipcc}
|
||||
title: hipCC
|
||||
- url: ${project:hipcub}
|
||||
title: hipCUB
|
||||
- url: ${project:hipfft}
|
||||
title: hipFFT
|
||||
- url: ${project:hipify}
|
||||
title: HIPIFY
|
||||
- url: ${project:hiprand}
|
||||
title: hipRAND
|
||||
- url: ${project:hip}
|
||||
title: HIP runtime
|
||||
- url: ${project:hipsolver}
|
||||
title: hipSOLVER
|
||||
- url: ${project:hipsparse}
|
||||
title: hipSPARSE
|
||||
- url: ${project:hipsparselt}
|
||||
title: hipSPARSELt
|
||||
- url: ${project:hiptensor}
|
||||
title: hipTensor
|
||||
- url: ${project:miopen}
|
||||
title: MIOpen
|
||||
- url: ${project:amdmigraphx}
|
||||
title: MIGraphX
|
||||
- url: ${project:rccl}
|
||||
title: RCCL
|
||||
- url: ${project:rocalution}
|
||||
title: rocALUTION
|
||||
- url: ${project:rocblas}
|
||||
title: rocBLAS
|
||||
- url: ${project:rocdbgapi}
|
||||
title: ROCdbgapi
|
||||
- url: ${project:rocfft}
|
||||
title: rocFFT
|
||||
- file: reference/rocmcc.md
|
||||
title: ROCmCC
|
||||
- url: ${project:rdc}
|
||||
title: ROCm Data Center Tool
|
||||
- url: ${project:rocm_smi_lib}
|
||||
title: ROCm SMI LIB
|
||||
- url: ${project:rocmvalidationsuite}
|
||||
title: ROCm validation suite
|
||||
- url: ${project:rocprim}
|
||||
title: rocPRIM
|
||||
- url: ${project:rocprofiler}
|
||||
title: ROCProfiler
|
||||
- url: ${project:rocrand}
|
||||
title: rocRAND
|
||||
- url: ${project:rocsolver}
|
||||
title: rocSOLVER
|
||||
- url: ${project:rocsparse}
|
||||
title: rocSPARSE
|
||||
- url: ${project:rocthrust}
|
||||
title: rocThrust
|
||||
- url: ${project:roctracer}
|
||||
title: rocTracer
|
||||
- url: ${project:rocwmma}
|
||||
title: rocWMMA
|
||||
- url: ${project:transferbench}
|
||||
title: TransferBench
|
||||
|
||||
- caption: Conceptual
|
||||
entries:
|
||||
- file: conceptual/gpu-arch.md
|
||||
@@ -178,12 +84,14 @@ subtrees:
|
||||
title: GPU memory
|
||||
- file: conceptual/compiler-disambiguation.md
|
||||
title: Compiler disambiguation
|
||||
- file: about/compatibility/openmp.md
|
||||
title: OpenMP
|
||||
- file: conceptual/file-reorg.md
|
||||
title: File structure (Linux FHS)
|
||||
- file: conceptual/gpu-isolation.md
|
||||
title: GPU isolation techniques
|
||||
- file: conceptual/using-gpu-sanitizer.md
|
||||
title: LLVN ASan
|
||||
title: LLVM ASan
|
||||
- file: conceptual/cmake-packages.rst
|
||||
title: Using CMake
|
||||
- file: conceptual/More-about-how-ROCm-uses-PCIe-Atomics.rst
|
||||
@@ -196,14 +104,19 @@ subtrees:
|
||||
- caption: Contribute
|
||||
entries:
|
||||
- file: contribute/index.md
|
||||
title: Contribute to ROCm docs
|
||||
title: Contribute to ROCm
|
||||
subtrees:
|
||||
- entries:
|
||||
- file: contribute/toolchain.md
|
||||
title: Documentation tools
|
||||
- file: contribute/building.md
|
||||
title: Building documentation
|
||||
- file: contribute/feedback.md
|
||||
title: Providing feedback
|
||||
- file: contribute/contribute-docs.md
|
||||
title: Contribute to ROCm docs
|
||||
subtrees:
|
||||
- entries:
|
||||
- file: contribute/toolchain.md
|
||||
title: Documentation tools
|
||||
- file: contribute/building.md
|
||||
title: Building documentation
|
||||
- file: contribute/feedback.md
|
||||
title: Provide feedback
|
||||
- file: about/license.md
|
||||
title: ROCm license
|
||||
|
||||
|
||||
@@ -1 +1 @@
|
||||
rocm-docs-core==0.26.0
|
||||
rocm-docs-core==0.33.0
|
||||
|
||||
@@ -1,5 +1,5 @@
|
||||
#
|
||||
# This file is autogenerated by pip-compile with Python 3.10
|
||||
# This file is autogenerated by pip-compile with Python 3.8
|
||||
# by the following command:
|
||||
#
|
||||
# pip-compile requirements.in
|
||||
@@ -40,17 +40,17 @@ fastjsonschema==2.16.3
|
||||
# via rocm-docs-core
|
||||
gitdb==4.0.10
|
||||
# via gitpython
|
||||
gitpython==3.1.30
|
||||
gitpython==3.1.41
|
||||
# via rocm-docs-core
|
||||
idna==3.4
|
||||
# via requests
|
||||
imagesize==1.4.1
|
||||
# via sphinx
|
||||
importlib-metadata==6.8.0
|
||||
importlib-metadata==7.0.0
|
||||
# via sphinx
|
||||
importlib-resources==6.1.0
|
||||
importlib-resources==6.1.1
|
||||
# via rocm-docs-core
|
||||
jinja2==3.1.2
|
||||
jinja2==3.1.3
|
||||
# via
|
||||
# myst-parser
|
||||
# sphinx
|
||||
@@ -84,7 +84,9 @@ pygments==2.15.0
|
||||
# pydata-sphinx-theme
|
||||
# sphinx
|
||||
pyjwt[crypto]==2.6.0
|
||||
# via pygithub
|
||||
# via
|
||||
# pygithub
|
||||
# pyjwt
|
||||
pynacl==1.5.0
|
||||
# via pygithub
|
||||
pytz==2022.7.1
|
||||
@@ -98,7 +100,7 @@ requests==2.31.0
|
||||
# via
|
||||
# pygithub
|
||||
# sphinx
|
||||
rocm-docs-core==0.26.0
|
||||
rocm-docs-core==0.33.0
|
||||
# via -r requirements.in
|
||||
smmap==5.0.0
|
||||
# via gitdb
|
||||
|
||||
@@ -18,7 +18,7 @@ Installation of various deep learning frameworks and applications.
|
||||
:::
|
||||
|
||||
:::{grid-item-card}
|
||||
**[GPU-enabled MPI](./gpu-enabled-mpi.md)**
|
||||
**[GPU-enabled MPI](./gpu-enabled-mpi.rst)**
|
||||
|
||||
This chapter exemplifies how to set up Open MPI with the ROCm platform.
|
||||
|
||||
|
||||
@@ -29,11 +29,11 @@ To implement a workaround, follow these steps:
|
||||
roc-obj-ls -v $TORCHDIR/lib/libtorch_hip.so # check for gfx target
|
||||
```
|
||||
|
||||
```{note}
|
||||
:::{note}
|
||||
Recompile PyTorch with the right gfx target if compiling from the source if
|
||||
the hardware is not supported. For wheels or Docker installation, contact
|
||||
ROCm support [^ROCm_issues].
|
||||
```
|
||||
:::
|
||||
|
||||
**Q: Why am I unable to access Docker or GPU in user accounts?**
|
||||
|
||||
@@ -43,7 +43,7 @@ described in the ROCm Installation Guide at {ref}`linux_group_permissions`.
|
||||
**Q: Can I install PyTorch directly on bare metal?**
|
||||
|
||||
Ans: Bare-metal installation of PyTorch is supported through wheels. Refer to
|
||||
Option 2: Install PyTorch Using Wheels Package. See [Installing PyTorch](../install/pytorch-install.md) for more information.
|
||||
Option 2: Install PyTorch Using Wheels Package. See {doc}`PyTorch for ROCm<rocm-install-on-linux:pytorch-install>` for more information.
|
||||
|
||||
**Q: How do I profile PyTorch workloads?**
|
||||
|
||||
|
||||
@@ -1,3 +1,9 @@
|
||||
<head>
|
||||
<meta charset="UTF-8">
|
||||
<meta name="description" content="What is ROCm">
|
||||
<meta name="keywords" content="documentation, projects, introduction, ROCm, AMD">
|
||||
</head>
|
||||
|
||||
# What is ROCm?
|
||||
|
||||
ROCm is an open-source stack, composed primarily of open-source software, designed for
|
||||
@@ -19,6 +25,11 @@ ROCm supports programming models, such as OpenMP and OpenCL, and includes all ne
|
||||
source software compilers, debuggers, and libraries. ROCm is fully integrated into machine learning
|
||||
(ML) frameworks, such as PyTorch and TensorFlow.
|
||||
|
||||
```{tip}
|
||||
If you're using Radeon GPUs, refer to the
|
||||
{doc}`Radeon-specific ROCm documentation<radeon:index>`
|
||||
```
|
||||
|
||||
## ROCm projects
|
||||
|
||||
ROCm consists of the following drivers, development tools, and APIs.
|
||||
@@ -70,7 +81,7 @@ ROCm consists of the following drivers, development tools, and APIs.
|
||||
| [ROCProfiler](https://rocm.docs.amd.com/projects/rocprofiler/en/latest/profiler_home_page.html) | A profiling tool for HIP applications |
|
||||
| [rocRAND](https://rocm.docs.amd.com/projects/rocRAND/en/latest/) | Provides functions that generate pseudorandom and quasirandom numbers |
|
||||
| [ROCR-Runtime](https://github.com/RadeonOpenCompute/ROCR-Runtime/) | User-mode API interfaces and libraries necessary for host applications to launch compute kernels on available HSA ROCm kernel agents |
|
||||
| [rocSOLVER](https://rocm.docs.amd.com/projects/rocSOLVER/en/latest/) | An implementation of LAPACK routines on the ROCm platform, implemented in the HIP programming language and optimized for AMD’s latest discrete GPUs |
|
||||
| [rocSOLVER](https://rocm.docs.amd.com/projects/rocSOLVER/en/latest/) | An implementation of LAPACK routines on ROCm software, implemented in the HIP programming language and optimized for AMD’s latest discrete GPUs |
|
||||
| [rocSPARSE](https://rocm.docs.amd.com/projects/rocSPARSE/en/latest/) | Exposes a common interface that provides BLAS for sparse computation implemented on ROCm runtime and toolchains (in the HIP programming language) |
|
||||
| [rocThrust](https://rocm.docs.amd.com/projects/rocThrust/en/latest/) | A parallel algorithm library |
|
||||
| [ROCT-Thunk-Interface](https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/) | User-mode API interfaces used to interact with the ROCk driver |
|
||||
|
||||
@@ -2,6 +2,7 @@
|
||||
|
||||
## Pre-requisites
|
||||
|
||||
* Python 3.10
|
||||
* Create a GitHub Personal Access Token.
|
||||
* Tested with all the read-only permissions, but public_repo, read:project read:user, and repo:status should be enough.
|
||||
* Copy the token somewhere safe.
|
||||
@@ -17,23 +18,16 @@
|
||||
* Run this for 5.6.0 (change for whatever version you require)
|
||||
* `GITHUB_ACCESS_TOKEN=my_token_here`
|
||||
|
||||
<<<<<<< HEAD
|
||||
To generate the changelog from 5.0.0 up to and including 5.7.0:
|
||||
To generate the changelog from 5.0.0 up to and including 6.0.1:
|
||||
|
||||
```sh
|
||||
python3 tag_script.py -t $GITHUB_ACCESS_TOKEN --no-release --no-pulls --do-previous --compile_file ../../CHANGELOG.md --branch release/rocm-rel-5.7 5.7.0
|
||||
=======
|
||||
To generate the changelog from 5.0.0 up to and including 5.7.1:
|
||||
|
||||
```sh
|
||||
python3 tag_script.py -t $GITHUB_ACCESS_TOKEN --no-release --no-pulls --do-previous --compile_file ../../CHANGELOG.md --branch release/rocm-rel-5.7 5.7.1
|
||||
>>>>>>> roc-5.7.x
|
||||
python3 tag_script.py -t $GITHUB_ACCESS_TOKEN --no-release --no-pulls --do-previous --compile_file ../../CHANGELOG.md --branch release/rocm-rel-6.0 6.0.1
|
||||
```
|
||||
|
||||
To generate the changelog only for 5.7.1:
|
||||
To generate the changelog only for 6.0.1:
|
||||
|
||||
```sh
|
||||
python3 tag_script.py -t $GITHUB_ACCESS_TOKEN --no-release --no-pulls --compile_file ../../CHANGELOG.md --branch release/rocm-rel-5.7 5.7.1
|
||||
python3 tag_script.py -t $GITHUB_ACCESS_TOKEN --no-release --no-pulls --compile_file ../../CHANGELOG.md --branch release/rocm-rel-6.0 6.0.1
|
||||
```
|
||||
|
||||
### Notes
|
||||
|
||||
@@ -227,9 +227,9 @@ def run_tagging():
|
||||
|
||||
# Creates a collection of ROCm libraries grouped by release.
|
||||
release_bundle_factory = ReleaseBundleFactory(
|
||||
"RadeonOpenCompute/ROCm",
|
||||
"ROCm/ROCm",
|
||||
Github(**gh_args), Github(**pr_args),
|
||||
"RadeonOpenCompute",
|
||||
"ROCm",
|
||||
remote_map,
|
||||
args.branch
|
||||
)
|
||||
|
||||
@@ -1,4 +1,4 @@
|
||||
# Release Notes
|
||||
# Release notes
|
||||
<!-- Do not edit this file! This file is autogenerated with -->
|
||||
<!-- tools/autotag/tag_script.py -->
|
||||
|
||||
@@ -16,7 +16,7 @@
|
||||
|
||||
<!-- spellcheck-disable -->
|
||||
|
||||
The release notes for the ROCm platform.
|
||||
This page contains the release notes for AMD ROCm Software.
|
||||
|
||||
{%- for version, release in releases %}
|
||||
|
||||
@@ -27,7 +27,7 @@ The release notes for the ROCm platform.
|
||||
{%- set rocm_changes = "./rocm_changes/" ~ version ~ ".md" %}
|
||||
{% include rocm_changes ignore missing %}
|
||||
|
||||
### Library Changes in ROCM {{version}}
|
||||
### Library changes in ROCM {{version}}
|
||||
|
||||
| Library | Version |
|
||||
|---------|---------|
|
||||
|
||||
@@ -16,27 +16,27 @@ Refer to the HIP Installation Guide v5.0 for more details.
|
||||
|
||||
Managed memory, including the `__managed__` keyword, is now supported in the HIP combined host/device compilation. Through unified memory allocation, managed memory allows data to be shared and accessible to both the CPU and GPU using a single pointer. The allocation is managed by the AMD GPU driver using the Linux Heterogeneous Memory Management (HMM) mechanism. The user can call managed memory API hipMallocManaged to allocate a large chunk of HMM memory, execute kernels on a device, and fetch data between the host and device as needed.
|
||||
|
||||
> **Note**
|
||||
>
|
||||
> In a HIP application, it is recommended to do a capability check before calling the managed memory APIs. For example,
|
||||
>
|
||||
> ```cpp
|
||||
> int managed_memory = 0;
|
||||
> HIPCHECK(hipDeviceGetAttribute(&managed_memory,
|
||||
> hipDeviceAttributeManagedMemory,p_gpuDevice));
|
||||
> if (!managed_memory ) {
|
||||
> printf ("info: managed memory access not supported on the device %d\n Skipped\n", p_gpuDevice);
|
||||
> }
|
||||
> else {
|
||||
> HIPCHECK(hipSetDevice(p_gpuDevice));
|
||||
> HIPCHECK(hipMallocManaged(&Hmm, N * sizeof(T)));
|
||||
> . . .
|
||||
> }
|
||||
> ```
|
||||
:::{note}
|
||||
In a HIP application, it is recommended to do a capability check before calling the managed memory APIs. For example,
|
||||
|
||||
> **Note**
|
||||
>
|
||||
> The managed memory capability check may not be necessary; however, if HMM is not supported, managed malloc will fall back to using system memory. Other managed memory API calls will, then, have
|
||||
```cpp
|
||||
int managed_memory = 0;
|
||||
HIPCHECK(hipDeviceGetAttribute(&managed_memory,
|
||||
hipDeviceAttributeManagedMemory,p_gpuDevice));
|
||||
if (!managed_memory ) {
|
||||
printf ("info: managed memory access not supported on the device %d\n Skipped\n", p_gpuDevice);
|
||||
}
|
||||
else {
|
||||
HIPCHECK(hipSetDevice(p_gpuDevice));
|
||||
HIPCHECK(hipMallocManaged(&Hmm, N * sizeof(T)));
|
||||
. . .
|
||||
}
|
||||
```
|
||||
:::
|
||||
|
||||
:::{note}
|
||||
The managed memory capability check may not be necessary; however, if HMM is not supported, managed malloc will fall back to using system memory. Other managed memory API calls will, then, have
|
||||
:::
|
||||
|
||||
Refer to the HIP API documentation for more details on managed memory APIs.
|
||||
|
||||
@@ -264,13 +264,17 @@ typedef enum hipDeviceAttribute_t {
|
||||
|
||||
#### Incorrect dGPU behavior when using AMDVBFlash tool
|
||||
|
||||
The AMDVBFlash tool, used for flashing the VBIOS image to dGPU, does not communicate with the ROM Controller specifically when the driver is present. This is because the driver, as part of its runtime power management feature, puts the dGPU to a sleep state.
|
||||
The AMDVBFlash tool, used for flashing the VBIOS image to dGPU, does not communicate with the
|
||||
ROM Controller specifically when the driver is present. This is because the driver, as part of its runtime
|
||||
power management feature, puts the dGPU to a sleep state.
|
||||
|
||||
As a workaround, users can run amdgpu.runpm=0, which temporarily disables the runtime power management feature from the driver and dynamically changes some power control-related sysfs files.
|
||||
As a workaround, users can run amdgpu.runpm=0, which temporarily disables the runtime power
|
||||
management feature from the driver and dynamically changes some power control-related sysfs files.
|
||||
|
||||
#### Issue with START timestamp in ROCProfiler
|
||||
|
||||
Users may encounter an issue with the enabled timestamp functionality for monitoring one or multiple counters. ROCProfiler outputs the following four timestamps for each kernel:
|
||||
Users may encounter an issue with the enabled timestamp functionality for monitoring one or multiple
|
||||
counters. ROCProfiler outputs the following four timestamps for each kernel:
|
||||
|
||||
* Dispatch
|
||||
* Start
|
||||
@@ -279,7 +283,8 @@ Users may encounter an issue with the enabled timestamp functionality for monito
|
||||
|
||||
##### Issue
|
||||
|
||||
This defect is related to the Start timestamp functionality, which incorrectly shows an earlier time than the Dispatch timestamp.
|
||||
This defect is related to the Start timestamp functionality, which incorrectly shows an earlier time than
|
||||
the Dispatch timestamp.
|
||||
|
||||
To reproduce the issue,
|
||||
|
||||
@@ -301,20 +306,22 @@ The correct order is:
|
||||
|
||||
Dispatch < Start < End < Complete
|
||||
|
||||
Users cannot use ROCProfiler to measure the time spent on each kernel because of the incorrect timestamp with counter collection enabled.
|
||||
Users cannot use ROCProfiler to measure the time spent on each kernel because of the incorrect
|
||||
timestamp with counter collection enabled.
|
||||
|
||||
##### Recommended workaround
|
||||
|
||||
Users are recommended to collect kernel execution timestamps without monitoring counters, as follows:
|
||||
Users are recommended to collect kernel execution timestamps without monitoring counters, as
|
||||
follows:
|
||||
|
||||
1. Enable timing using the --timestamp on flag, and run the application.
|
||||
|
||||
2. Rerun the application using the -i option with the input filename that contains the name of the counter(s) to monitor, and save this to a different output file using the -o flag.
|
||||
2. Rerun the application using the -i option with the input filename that contains the name of the
|
||||
counter(s) to monitor, and save this to a different output file using the -o flag.
|
||||
|
||||
3. Check the output result file from step 1.
|
||||
|
||||
4. The order of timestamps correctly displays as:
|
||||
DispatchNS < BeginNS < EndNS < CompleteNS
|
||||
4. The order of timestamps correctly displays as: DispatchNS < BeginNS < EndNS < CompleteNS
|
||||
|
||||
5. Users can find the values of the collected counters in the output file generated in step 2.
|
||||
|
||||
@@ -322,17 +329,21 @@ Users are recommended to collect kernel execution timestamps without monitoring
|
||||
|
||||
##### No support for SMI and ROCDebugger on SRIOV
|
||||
|
||||
System Management Interface (SMI) and ROCDebugger are not supported in the SRIOV environment on any GPU. For more information, refer to the Systems Management Interface documentation.
|
||||
System Management Interface (SMI) and ROCDebugger are not supported in the SRIOV environment
|
||||
on any GPU. For more information, refer to the Systems Management Interface documentation.
|
||||
|
||||
### Deprecations and warnings
|
||||
|
||||
#### ROCm libraries changes – deprecations and deprecation removal
|
||||
|
||||
* The hipFFT.h header is now provided only by the hipFFT package. Up to ROCm 5.0, users would get hipFFT.h in the rocFFT package too.
|
||||
* The `hipFFT.h` header is now provided only by the hipFFT package. Up to ROCm 5.0, users would get
|
||||
`hipFFT.h` in the rocFFT package too.
|
||||
|
||||
* The GlobalPairwiseAMG class is now entirely removed, users should use the PairwiseAMG class instead.
|
||||
* The GlobalPairwiseAMG class is now entirely removed, users should use the PairwiseAMG class
|
||||
instead.
|
||||
|
||||
* The rocsparse_spmm signature in 5.0 was changed to match that of rocsparse_spmm_ex. In 5.0, rocsparse_spmm_ex is still present, but deprecated. Signature diff for rocsparse_spmm
|
||||
* The rocsparse_spmm signature in 5.0 was changed to match that of rocsparse_spmm_ex. In 5.0,
|
||||
rocsparse_spmm_ex is still present, but deprecated. Signature diff for rocsparse_spmm
|
||||
rocsparse_spmm in 5.0
|
||||
|
||||
```cpp
|
||||
@@ -374,11 +385,15 @@ System Management Interface (SMI) and ROCDebugger are not supported in the SRIOV
|
||||
|
||||
In this release, arithmetic operators of HIP complex and vector types are deprecated.
|
||||
|
||||
* As alternatives to arithmetic operators of HIP complex types, users can use arithmetic operators of `std::complex` types.
|
||||
* As alternatives to arithmetic operators of HIP complex types, users can use arithmetic operators of
|
||||
`std::complex` types.
|
||||
|
||||
* As alternatives to arithmetic operators of HIP vector types, users can use the operators of the native clang vector type associated with the data member of HIP vector types.
|
||||
* As alternatives to arithmetic operators of HIP vector types, users can use the operators of the native
|
||||
clang vector type associated with the data member of HIP vector types.
|
||||
|
||||
During the deprecation, two macros `_HIP_ENABLE_COMPLEX_OPERATORS` and `_HIP_ENABLE_VECTOR_OPERATORS` are provided to allow users to conditionally enable arithmetic operators of HIP complex or vector types.
|
||||
During the deprecation, two macros `_HIP_ENABLE_COMPLEX_OPERATORS` and
|
||||
`_HIP_ENABLE_VECTOR_OPERATORS` are provided to allow users to conditionally enable arithmetic
|
||||
operators of HIP complex or vector types.
|
||||
|
||||
Note, the two macros are mutually exclusive and, by default, set to Off.
|
||||
|
||||
@@ -388,7 +403,8 @@ Refer to the HIP API Guide for more information.
|
||||
|
||||
#### Warning - compiler-generated code object version 4 deprecation
|
||||
|
||||
Support for loading compiler-generated code object version 4 will be deprecated in a future release with no release announcement and replaced with code object 5 as the default version.
|
||||
Support for loading compiler-generated code object version 4 will be deprecated in a future release
|
||||
with no release announcement and replaced with code object 5 as the default version.
|
||||
|
||||
The current default is code object version 4.
|
||||
|
||||
|
||||
@@ -3,10 +3,17 @@
|
||||
|
||||
#### Refactor of HIPCC/HIPCONFIG
|
||||
|
||||
In prior ROCm releases, by default, the hipcc/hipconfig Perl scripts were used to identify and set target compiler options, target platform, compiler, and runtime appropriately.
|
||||
In prior ROCm releases, by default, the hipcc/hipconfig Perl scripts were used to identify and set target
|
||||
compiler options, target platform, compiler, and runtime appropriately.
|
||||
|
||||
In ROCm v5.0.1, hipcc.bin and hipconfig.bin have been added as the compiled binary implementations of the hipcc and hipconfig. These new binaries are currently a work-in-progress, considered, and marked as experimental. ROCm plans to fully transition to hipcc.bin and hipconfig.bin in the a future ROCm release. The existing hipcc and hipconfig Perl scripts are renamed to hipcc.pl and hipconfig.pl respectively. New top-level hipcc and hipconfig Perl scripts are created, which can switch between the Perl script or the compiled binary based on the environment variable HIPCC_USE_PERL_SCRIPT.
|
||||
In ROCm v5.0.1, hipcc.bin and hipconfig.bin have been added as the compiled binary implementations
|
||||
of the hipcc and hipconfig. These new binaries are currently a work-in-progress, considered, and
|
||||
marked as experimental. ROCm plans to fully transition to hipcc.bin and hipconfig.bin in the a future
|
||||
ROCm release. The existing hipcc and hipconfig Perl scripts are renamed to `hipcc.pl` and `hipconfig.pl`
|
||||
respectively. New top-level hipcc and hipconfig Perl scripts are created, which can switch between the
|
||||
Perl script or the compiled binary based on the environment variable `HIPCC_USE_PERL_SCRIPT`.
|
||||
|
||||
In ROCm 5.0.1, by default, this environment variable is set to use hipcc and hipconfig through the Perl scripts.
|
||||
In ROCm 5.0.1, by default, this environment variable is set to use hipcc and hipconfig through the Perl
|
||||
scripts.
|
||||
|
||||
Subsequently, Perl scripts will no longer be available in ROCm in a future release.
|
||||
Subsequent Perl scripts will no longer be available in ROCm in a future release.
|
||||
|
||||
@@ -1,18 +1,26 @@
|
||||
<!-- markdownlint-disable first-line-h1 -->
|
||||
### Fixed defects
|
||||
### Defect fixes
|
||||
|
||||
The following defects are fixed in the ROCm v5.0.2 release.
|
||||
|
||||
#### Issue with hostcall facility in HIP runtime
|
||||
|
||||
In ROCm v5.0, when using the “assert()” call in a HIP kernel, the compiler may sometimes fail to emit kernel metadata related to the hostcall facility, which results in incomplete initialization of the hostcall facility in the HIP runtime. This can cause the HIP kernel to crash when it attempts to execute the “assert()” call.
|
||||
In ROCm v5.0, when using the “assert()” call in a HIP kernel, the compiler may sometimes fail to emit
|
||||
kernel metadata related to the hostcall facility, which results in incomplete initialization of the hostcall
|
||||
facility in the HIP runtime. This can cause the HIP kernel to crash when it attempts to execute the
|
||||
“assert()” call.
|
||||
|
||||
The root cause was an incorrect check in the compiler to determine whether the hostcall facility is required by the kernel. This is fixed in the ROCm v5.0.2 release.
|
||||
The root cause was an incorrect check in the compiler to determine whether the hostcall facility is
|
||||
required by the kernel. This is fixed in the ROCm v5.0.2 release.
|
||||
|
||||
The resolution includes a compiler change, which emits the required metadata by default, unless the compiler can prove that the hostcall facility is not required by the kernel. This ensures that the “assert()” call never fails.
|
||||
The resolution includes a compiler change, which emits the required metadata by default, unless the
|
||||
compiler can prove that the hostcall facility is not required by the kernel. This ensures that the
|
||||
“assert()” call never fails.
|
||||
|
||||
Note:
|
||||
This fix may lead to breakage in some OpenMP offload use cases, which use print inside a target region and result in an abort in device code. The issue will be fixed in a future release.
|
||||
Compatibility Matrix Updates to the [Deep-learning guide](./how-to/deep-learning-rocm.md)
|
||||
:::{note}
|
||||
This fix may lead to breakage in some OpenMP offload use cases, which use print inside a target region
|
||||
and result in an abort in device code. The issue will be fixed in a future release.
|
||||
:::
|
||||
|
||||
The compatibility matrix in the [Deep-learning guide](./how-to/deep-learning-rocm.md) is updated for ROCm v5.0.2.
|
||||
The compatibility matrix in the [Deep-learning guide](./how-to/deep-learning-rocm.md) is updated for
|
||||
ROCm v5.0.2.
|
||||
|
||||
@@ -8,7 +8,8 @@ The ROCm v5.1 release consists of the following HIP enhancements.
|
||||
|
||||
##### HIP installation guide updates
|
||||
|
||||
The HIP Installation Guide is updated to include installation and building HIP from source on the AMD and NVIDIA platforms.
|
||||
The HIP installation guide now includes information on installing and building HIP from source on
|
||||
AMD and NVIDIA platforms.
|
||||
|
||||
Refer to the HIP Installation Guide v5.1 for more details.
|
||||
|
||||
@@ -20,11 +21,14 @@ ROCm v5.1 extends support for HIP Graph.
|
||||
|
||||
###### Separation of hiprtc (libhiprtc) library from hip runtime (amdhip64)
|
||||
|
||||
On ROCm/Linux, to maintain backward compatibility, the hipruntime library (amdhip64) will continue to include hiprtc symbols in future releases. The backward compatible support may be discontinued by removing hiprtc symbols from the hipruntime library (amdhip64) in the next major release.
|
||||
On ROCm/Linux, to maintain backward compatibility, the hipruntime library (amdhip64) will continue
|
||||
to include hiprtc symbols in future releases. The backward compatible support may be discontinued by
|
||||
removing hiprtc symbols from the hipruntime library (amdhip64) in the next major release.
|
||||
|
||||
###### hipDeviceProp_t structure enhancements
|
||||
|
||||
Changes to the hipDeviceProp_t structure in the next major release may result in backward incompatibility. More details on these changes will be provided in subsequent releases.
|
||||
Changes to the hipDeviceProp_t structure in the next major release may result in backward
|
||||
incompatibility. More details on these changes will be provided in subsequent releases.
|
||||
|
||||
#### ROCDebugger enhancements
|
||||
|
||||
@@ -34,15 +38,19 @@ The compiler now generates a source-level variable and function argument debug i
|
||||
|
||||
The accuracy is guaranteed if the compiler options `-g -O0` are used and apply only to HIP.
|
||||
|
||||
This enhancement enables ROCDebugger users to interact with the HIP source-level variables and function arguments.
|
||||
This enhancement enables ROCDebugger users to interact with the HIP source-level variables and
|
||||
function arguments.
|
||||
|
||||
> **Note**
|
||||
>
|
||||
> The newly-suggested compiler -g option must be used instead of the previously-suggested `-ggdb` option. Although the effect of these two options is currently equivalent, this is not guaranteed for the future and might get changed by the upstream LLVM community.
|
||||
:::{note}
|
||||
The newly-suggested compiler -g option must be used instead of the previously-suggested `-ggdb`
|
||||
option. Although the effect of these two options is currently equivalent, this is not guaranteed for the
|
||||
future, as changes might be made by the upstream LLVM community.
|
||||
:::
|
||||
|
||||
##### Machine interface lanes support
|
||||
|
||||
ROCDebugger Machine Interface (MI) extends support to lanes. The following enhancements are made:
|
||||
ROCDebugger Machine Interface (MI) extends support to lanes, which includes the following
|
||||
enhancements:
|
||||
|
||||
* Added a new -lane-info command, listing the current thread's lanes.
|
||||
|
||||
@@ -52,24 +60,29 @@ ROCDebugger Machine Interface (MI) extends support to lanes. The following enhan
|
||||
-thread-select -l LANE THREAD
|
||||
```
|
||||
|
||||
* The =thread-selected notification gained a lane-id attribute. This enables the frontend to know which lane of the thread was selected.
|
||||
* The =thread-selected notification gained a lane-id attribute. This enables the frontend to know which
|
||||
lane of the thread was selected.
|
||||
|
||||
* The *stopped asynchronous record gained lane-id and hit-lanes attributes. The former indicates which lane is selected, and the latter indicates which lanes explain the stop.
|
||||
* The *stopped asynchronous record gained lane-id and hit-lanes attributes. The former indicates
|
||||
which lane is selected, and the latter indicates which lanes explain the stop.
|
||||
|
||||
* MI commands now accept a global --lane option, similar to the global --thread and --frame options.
|
||||
|
||||
* MI varobjs are now lane-aware.
|
||||
|
||||
For more information, refer to the ROC Debugger User Guide at
|
||||
{doc}`ROCgdb <rocgdb:index>`.
|
||||
For more information, refer to the ROC Debugger User Guide at {doc}`ROCgdb <rocgdb:index>`.
|
||||
|
||||
##### Enhanced - clone-inferior command
|
||||
|
||||
The clone-inferior command now ensures that the TTY, CMD, ARGS, and AMDGPU PRECISE-MEMORY settings are copied from the original inferior to the new one. All modifications to the environment variables done using the 'set environment' or 'unset environment' commands are also copied to the new inferior.
|
||||
The clone-inferior command now ensures that the TTY, CMD, ARGS, and AMDGPU PRECISE-MEMORY
|
||||
settings are copied from the original inferior to the new one. All modifications to the environment
|
||||
variables done using the 'set environment' or 'unset environment' commands are also copied to the
|
||||
new inferior.
|
||||
|
||||
#### MIOpen support for RDNA GPUs
|
||||
|
||||
This release includes support for AMD Radeon™ Pro W6800, in addition to other bug fixes and performance improvements as listed below:
|
||||
This release includes support for AMD Radeon™ Pro W6800, in addition to other bug fixes and
|
||||
performance improvements as listed below:
|
||||
|
||||
* MIOpen now supports RDNA GPUs!! (via MIOpen PRs 973, 780, 764, 740, 739, 677, 660, 653, 493, 498)
|
||||
|
||||
@@ -87,11 +100,13 @@ For more information, see {doc}`Documentation <miopen:index>`.
|
||||
|
||||
#### Checkpoint restore support with CRIU
|
||||
|
||||
The new Checkpoint Restore in Userspace (CRIU) functionality is implemented to support AMD GPU and ROCm applications.
|
||||
The new Checkpoint Restore in Userspace (CRIU) functionality is implemented to support AMD GPU
|
||||
and ROCm applications.
|
||||
|
||||
CRIU is a userspace tool to Checkpoint and Restore an application.
|
||||
|
||||
CRIU lacked the support for checkpoint restore applications that used device files such as a GPU. With this ROCm release, CRIU is enhanced with a new plugin to support AMD GPUs, which includes:
|
||||
CRIU lacked the support for checkpoint restore applications that used device files such as a GPU. With
|
||||
this ROCm release, CRIU is enhanced with a new plugin to support AMD GPUs, which includes:
|
||||
|
||||
* Single and Multi GPU systems (Gfx9)
|
||||
* Checkpoint / Restore on a different system
|
||||
@@ -100,15 +115,19 @@ CRIU lacked the support for checkpoint restore applications that used device fil
|
||||
* TensorFlow
|
||||
* Using CRIU Image Streamer
|
||||
|
||||
For more information, refer to <https://github.com/checkpoint-restore/criu/tree/criu-dev/plugins/amdgpu>
|
||||
For more information, refer to
|
||||
<https://github.com/checkpoint-restore/criu/tree/criu-dev/plugins/amdgpu>
|
||||
|
||||
> **Note**
|
||||
>
|
||||
> The CRIU plugin (amdgpu_plugin) is merged upstream with the CRIU repository. The KFD kernel patches are also available upstream with the amd-staging-drm-next branch (public) and the ROCm 5.1 release branch.
|
||||
:::{note}
|
||||
The CRIU plugin (amdgpu_plugin) is merged upstream with the CRIU repository. The KFD kernel
|
||||
patches are also available upstream with the amd-staging-drm-next branch (public) and the ROCm 5.1
|
||||
release branch.
|
||||
:::
|
||||
|
||||
> **Note**
|
||||
>
|
||||
> This is a Beta release of the Checkpoint and Restore functionality, and some features are not available in this release.
|
||||
:::{note}
|
||||
This is a Beta release of the Checkpoint and Restore functionality, and some features are not available
|
||||
in this release.
|
||||
:::
|
||||
|
||||
For more information, refer to the following websites:
|
||||
|
||||
@@ -116,7 +135,7 @@ For more information, refer to the following websites:
|
||||
|
||||
* <https://criu.org/Main_Page>
|
||||
|
||||
### Fixed defects
|
||||
### Defect fixes
|
||||
|
||||
The following defects are fixed in this release.
|
||||
|
||||
@@ -126,37 +145,48 @@ The issue with the driver failing to load after ROCm installation is now fixed.
|
||||
|
||||
The driver installs successfully, and the server reboots with working rocminfo and clinfo.
|
||||
|
||||
#### ROCDebugger fixed defects
|
||||
#### ROCDebugger defect fixes
|
||||
|
||||
##### Breakpoints in GPU kernel code before kernel is loaded
|
||||
|
||||
Previously, setting a breakpoint in device code by line number before the device code was loaded into the program resulted in ROCgdb incorrectly moving the breakpoint to the first following line that contains host code.
|
||||
Previously, setting a breakpoint in device code by line number before the device code was loaded into
|
||||
the program resulted in ROCgdb incorrectly moving the breakpoint to the first following line that
|
||||
contains host code.
|
||||
|
||||
Now, the breakpoint is left pending. When the GPU kernel gets loaded, the breakpoint resolves to a location in the kernel.
|
||||
Now, the breakpoint is left pending. When the GPU kernel gets loaded, the breakpoint resolves to a
|
||||
location in the kernel.
|
||||
|
||||
##### Registers invalidated after write
|
||||
|
||||
Previously, the stale just-written value was presented as a current value.
|
||||
|
||||
ROCgdb now invalidates the cached values of registers whose content might differ after being written. For example, registers with read-only bits.
|
||||
ROCgdb now invalidates the cached values of registers whose content might differ after being written.
|
||||
For example, registers with read-only bits.
|
||||
|
||||
ROCgdb also invalidates all volatile registers when a volatile register is written. For example, writing VCC invalidates the content of STATUS as STATUS.VCCZ may change.
|
||||
ROCgdb also invalidates all volatile registers when a volatile register is written. For example, writing
|
||||
VCC invalidates the content of STATUS as STATUS.VCCZ may change.
|
||||
|
||||
##### Scheduler-locking and GPU wavefronts
|
||||
|
||||
When scheduler-locking is in effect, new wavefronts created by a resumed thread, CPU, or GPU wavefront, are held in the halt state. For example, the "set scheduler-locking" command.
|
||||
When scheduler-locking is in effect, new wavefronts created by a resumed thread, CPU, or GPU
|
||||
wavefront, are held in the halt state. For example, the "set scheduler-locking" command.
|
||||
|
||||
##### ROCDebugger fails before completion of kernel execution
|
||||
|
||||
It was possible (although erroneous) for a debugger to load GPU code in memory, send it to the device, start executing a kernel on the device, and dispose of the original code before the kernel had finished execution. If a breakpoint was hit after this point, the debugger failed with an internal error while trying to access the debug information.
|
||||
It was possible (although erroneous) for a debugger to load GPU code in memory, send it to the
|
||||
device, start executing a kernel on the device, and dispose of the original code before the kernel had
|
||||
finished execution. If a breakpoint was hit after this point, the debugger failed with an internal error
|
||||
while trying to access the debug information.
|
||||
|
||||
This issue is now fixed by ensuring that the debugger keeps a local copy of the original code and debug information.
|
||||
This issue is now fixed by ensuring that the debugger keeps a local copy of the original code and
|
||||
debug information.
|
||||
|
||||
### Known issues
|
||||
|
||||
#### Random memory access fault errors observed while running math libraries unit tests
|
||||
|
||||
**Issue:** Random memory access fault issues are observed while running Math libraries unit tests. This issue is encountered in ROCm v5.0, ROCm v5.0.1, and ROCm v5.0.2.
|
||||
**Issue:** Random memory access fault issues are observed while running Math libraries unit tests.
|
||||
This issue is encountered in ROCm v5.0, ROCm v5.0.1, and ROCm v5.0.2.
|
||||
|
||||
Note, the faults only occur in the SRIOV environment.
|
||||
|
||||
@@ -178,13 +208,15 @@ Where expectation is 0.
|
||||
|
||||
#### CU masking causes application to freeze
|
||||
|
||||
Using CU Masking results in an application freeze or runs exceptionally slowly. This issue is noticed only in the GFX10 suite of products. Note, this issue is observed only in GFX10 suite of products.
|
||||
Using CU Masking results in an application freeze or runs exceptionally slowly. This issue is noticed
|
||||
only in the GFX10 suite of products. Note, this issue is observed only in GFX10 suite of products.
|
||||
|
||||
This issue is under active investigation at this time.
|
||||
|
||||
#### Failed checkpoint in Docker containers
|
||||
|
||||
A defect with Ubuntu images kernel-5.13-30-generic and kernel-5.13-35-generic with Overlay FS results in incorrect reporting of the mount ID.
|
||||
A defect with Ubuntu images kernel-5.13-30-generic and kernel-5.13-35-generic with Overlay FS
|
||||
results in incorrect reporting of the mount ID.
|
||||
|
||||
This issue with Ubuntu causes CRIU checkpointing to fail in Docker containers.
|
||||
|
||||
@@ -192,8 +224,8 @@ As a workaround, use an older version of the kernel. For example, Ubuntu 5.11.0-
|
||||
|
||||
#### Issue with restoring workloads using cooperative groups feature
|
||||
|
||||
Workloads that use the cooperative groups function to ensure all waves can be resident at the same time may fail to restore correctly.
|
||||
This issue is under investigation and will be fixed in a future release.
|
||||
Workloads that use the cooperative groups function to ensure all waves can be resident at the same
|
||||
time may fail to restore correctly. This issue is under investigation and will be fixed in a future release.
|
||||
|
||||
#### Radeon Pro V620 and W6800 workstation GPUs
|
||||
|
||||
|
||||
@@ -8,21 +8,26 @@ The ROCm v5.2 release consists of the following HIP enhancements:
|
||||
|
||||
##### HIP installation guide updates
|
||||
|
||||
The HIP Installation Guide is updated to include building HIP tests from source on the AMD and NVIDIA platforms.
|
||||
The HIP Installation Guide is updated to include building HIP tests from source on the AMD and
|
||||
NVIDIA platforms.
|
||||
|
||||
For more details, refer to the HIP Installation Guide v5.2.
|
||||
|
||||
##### Support for device-side malloc on HIP-Clang
|
||||
|
||||
HIP-Clang now supports device-side malloc. This implementation does not require the use of `hipDeviceSetLimit(hipLimitMallocHeapSize,value)` nor respect any setting. The heap is fully dynamic and can grow until the available free memory on the device is consumed.
|
||||
HIP-Clang now supports device-side malloc. This implementation does not require the use of
|
||||
`hipDeviceSetLimit(hipLimitMallocHeapSize,value)` nor respect any setting. The heap is fully dynamic
|
||||
and can grow until the available free memory on the device is consumed.
|
||||
|
||||
The test codes at the following link show how to implement applications using malloc and free functions in device kernels:
|
||||
The test codes at the following link show how to implement applications using malloc and free
|
||||
functions in device kernels:
|
||||
|
||||
<https://github.com/ROCm-Developer-Tools/HIP/blob/develop/tests/src/deviceLib/hipDeviceMalloc.cpp>
|
||||
|
||||
##### New HIP APIs in this release
|
||||
|
||||
The following new HIP APIs are available in the ROCm v5.2 release. Note that this is a pre-official version (beta) release of the new APIs:
|
||||
The following new HIP APIs are available in the ROCm v5.2 release. Note that this is a pre-official
|
||||
version (beta) release of the new APIs:
|
||||
|
||||
###### Device management HIP APIs
|
||||
|
||||
@@ -34,13 +39,11 @@ The new device management HIP APIs are as follows:
|
||||
hipError_t hipDeviceGetUuid(hipUUID* uuid, hipDevice_t device);
|
||||
```
|
||||
|
||||
> **Note**
|
||||
>
|
||||
> This new API corresponds to the following CUDA API:
|
||||
>
|
||||
> ```cpp
|
||||
> CUresult cuDeviceGetUuid(CUuuid* uuid, CUdevice dev);
|
||||
> ```
|
||||
Note that this new API corresponds to the following CUDA API:
|
||||
|
||||
```cpp
|
||||
CUresult cuDeviceGetUuid(CUuuid* uuid, CUdevice dev);
|
||||
```
|
||||
|
||||
* Gets default memory pool of the specified device
|
||||
|
||||
@@ -62,7 +65,7 @@ The new device management HIP APIs are as follows:
|
||||
|
||||
###### New HIP runtime APIs in memory management
|
||||
|
||||
The new Stream Ordered Memory Allocator functions of HIP runtime APIs in memory management are as follows:
|
||||
The new Stream Ordered Memory Allocator functions of HIP runtime APIs in memory management are:
|
||||
|
||||
* Allocates memory with stream ordered semantics
|
||||
|
||||
@@ -180,7 +183,7 @@ The new HIP Graph Management APIs are as follows:
|
||||
* Gets a node attribute
|
||||
|
||||
```cpp
|
||||
hipError_t hipGraphKernelNodeGetAttribute(hipGraphNode_t hNode, hipKernelNodeAttrID attr, hipKernelNodeAttrValue* value);
|
||||
hipError_t hipGraphKernelNodeGetAttribute(hipGraphNode_t hNode, hipKernelNodeAttrID attr, hipKernelNodeAttrValue* value);
|
||||
```
|
||||
|
||||
###### Support for virtual memory management APIs
|
||||
@@ -244,7 +247,7 @@ The new APIs for virtual memory management are as follows:
|
||||
* Maps or unmaps subregions of sparse HIP arrays and sparse HIP mipmapped arrays
|
||||
|
||||
```cpp
|
||||
hipError_t hipMemMapArrayAsync(hipArrayMapInfo* mapInfoList, unsigned int count, hipStream_t stream);
|
||||
hipError_t hipMemMapArrayAsync(hipArrayMapInfo* mapInfoList, unsigned int count, hipStream_t stream);
|
||||
```
|
||||
|
||||
* Release a memory handle representing a memory allocation, that was previously allocated through hipMemCreate
|
||||
@@ -272,45 +275,71 @@ The new APIs for virtual memory management are as follows:
|
||||
```
|
||||
|
||||
For more information, refer to the HIP API documentation at
|
||||
{doc}`hip:.doxygen/docBin/html/modules`.
|
||||
{doc}`hip:doxygen/html/modules`.
|
||||
|
||||
##### Planned HIP changes in future releases
|
||||
|
||||
Changes to `hipDeviceProp_t`, `HIPMEMCPY_3D`, and `hipArray` structures (and related HIP APIs) are planned in the next major release. These changes may impact backward compatibility.
|
||||
Changes to `hipDeviceProp_t`, `HIPMEMCPY_3D`, and `hipArray` structures (and related HIP APIs) are
|
||||
planned in the next major release. These changes may impact backward compatibility.
|
||||
|
||||
Refer to the Release Notes document in subsequent releases for more information.
|
||||
ROCm Math and Communication Libraries
|
||||
Refer to the release notes in subsequent releases for more information.
|
||||
|
||||
In this release, ROCm Math and Communication Libraries consist of the following enhancements and fixes:
|
||||
New rocWMMA for Matrix Multiplication and Accumulation Operations Acceleration
|
||||
#### ROCm math and communication libraries
|
||||
|
||||
This release introduces a new ROCm C++ library for accelerating mixed-precision matrix multiplication and accumulation (MFMA) operations leveraging specialized GPU matrix cores. rocWMMA provides a C++ API to facilitate breaking down matrix multiply accumulate problems into fragments and using them in block-wise operations that are distributed in parallel across GPU wavefronts. The API is a header library of GPU device code, meaning matrix core acceleration may be compiled directly into your kernel device code. This can benefit from compiler optimization in the generation of kernel assembly and does not incur additional overhead costs of linking to external runtime libraries or having to launch separate kernels.
|
||||
In this release, ROCm math and communication libraries consist of the following enhancements and
|
||||
fixes:
|
||||
|
||||
rocWMMA is released as a header library and includes test and sample projects to validate and illustrate example usages of the C++ API. GEMM matrix multiplication is used as primary validation given the heavy precedent for the library. However, the usage portfolio is growing significantly and demonstrates different ways rocWMMA may be consumed.
|
||||
* New rocWMMA for matrix multiplication and accumulation operations acceleration
|
||||
|
||||
For more information, refer to
|
||||
[Communication Libraries](./docs/reference/library-index.md)
|
||||
This release introduces a new ROCm C++ library for accelerating mixed-precision matrix multiplication
|
||||
and accumulation (MFMA) operations leveraging specialized GPU matrix cores. rocWMMA provides a
|
||||
C++ API to facilitate breaking down matrix multiply accumulate problems into fragments and using
|
||||
them in block-wise operations that are distributed in parallel across GPU wavefronts. The API is a
|
||||
header library of GPU device code, meaning matrix core acceleration may be compiled directly into
|
||||
your kernel device code. This can benefit from compiler optimization in the generation of kernel
|
||||
assembly and does not incur additional overhead costs of linking to external runtime libraries or having
|
||||
to launch separate kernels.
|
||||
|
||||
rocWMMA is released as a header library and includes test and sample projects to validate and
|
||||
illustrate example usages of the C++ API. GEMM matrix multiplication is used as primary validation
|
||||
given the heavy precedent for the library. However, the usage portfolio is growing significantly and
|
||||
demonstrates different ways rocWMMA may be consumed.
|
||||
|
||||
For more information, refer to [Communication Libraries](../reference/library-index.md)
|
||||
|
||||
#### OpenMP enhancements in this release
|
||||
|
||||
##### OMPT target support
|
||||
|
||||
The OpenMP runtime in ROCm implements a subset of the OMPT device APIs, as described in the OpenMP specification document. These are APIs that allow first-party tools to examine the profile and traces for kernels that execute on a device. A tool may register callbacks for data transfer and kernel dispatch entry points. A tool may use APIs to start and stop tracing for device-related activities such as data transfer and kernel dispatch timings and associated metadata. If device tracing is enabled, trace records for device activities are collected during program execution and returned to the tool using the APIs described in the specification.
|
||||
The OpenMP runtime in ROCm implements a subset of the OMPT device APIs, as described in the
|
||||
OpenMP specification document. These are APIs that allow first-party tools to examine the profile
|
||||
and traces for kernels that execute on a device. A tool may register callbacks for data transfer and
|
||||
kernel dispatch entry points. A tool may use APIs to start and stop tracing for device-related activities,
|
||||
such as data transfer and kernel dispatch timings and associated metadata. If device tracing is enabled,
|
||||
trace records for device activities are collected during program execution and returned to the tool
|
||||
using the APIs described in the specification.
|
||||
|
||||
Following is an example demonstrating how a tool would use the OMPT target APIs supported. The README in /opt/rocm/llvm/examples/tools/ompt outlines the steps to follow, and you can run the provided example as indicated below:
|
||||
Following is an example demonstrating how a tool would use the OMPT target APIs supported. The
|
||||
README in /opt/rocm/llvm/examples/tools/ompt outlines the steps to follow, and you can run the
|
||||
provided example as indicated below:
|
||||
|
||||
```sh
|
||||
cd /opt/rocm/llvm/examples/tools/ompt/veccopy-ompt-target-tracing
|
||||
make run
|
||||
```
|
||||
|
||||
The file `veccopy-ompt-target-tracing.c` simulates how a tool would initiate device activity tracing. The file `callbacks.h` shows the callbacks that may be registered and implemented by the tool.
|
||||
The file `veccopy-ompt-target-tracing.c` simulates how a tool would initiate device activity tracing. The
|
||||
file `callbacks.h` shows the callbacks that may be registered and implemented by the tool.
|
||||
|
||||
### Deprecations and warnings
|
||||
|
||||
#### Linux file system hierarchy standard for ROCm
|
||||
|
||||
ROCm packages have adopted the Linux foundation file system hierarchy standard in this release to ensure ROCm components follow open source conventions for Linux-based distributions. While moving to a new file system hierarchy, ROCm ensures backward compatibility with its 5.1 version or older file system hierarchy. See below for a detailed explanation of the new file system hierarchy and backward compatibility.
|
||||
ROCm packages have adopted the Linux foundation file system hierarchy standard in this release to
|
||||
ensure ROCm components follow open source conventions for Linux-based distributions. While
|
||||
moving to a new file system hierarchy, ROCm ensures backward compatibility with its 5.1 version or
|
||||
older file system hierarchy. See below for a detailed explanation of the new file system hierarchy and
|
||||
backward compatibility.
|
||||
|
||||
##### New file system hierarchy
|
||||
|
||||
@@ -346,23 +375,27 @@ The following is the new file system hierarchy:
|
||||
|
||||
```
|
||||
|
||||
> **Note**
|
||||
>
|
||||
> ROCm will not support backward compatibility with the v5.1(old) file system hierarchy in its next major release.
|
||||
:::{note}
|
||||
ROCm will not support backward compatibility with the v5.1(old) file system hierarchy in its next major
|
||||
release.
|
||||
:::
|
||||
|
||||
For more information, refer to <https://refspecs.linuxfoundation.org/fhs.shtml>.
|
||||
|
||||
##### Backward compatibility with older file systems
|
||||
|
||||
ROCm has moved header files and libraries to its new location as indicated in the above structure and included symbolic-link and wrapper header files in its old location for backward compatibility.
|
||||
ROCm has moved header files and libraries to its new location as indicated in the above structure and
|
||||
included symbolic-link and wrapper header files in its old location for backward compatibility.
|
||||
|
||||
> **Note**
|
||||
>
|
||||
> ROCm will continue supporting backward compatibility until the next major release.
|
||||
:::{note}
|
||||
ROCm will continue supporting backward compatibility until the next major release.
|
||||
:::
|
||||
|
||||
##### Wrapper header files
|
||||
|
||||
Wrapper header files are placed in the old location (`/opt/rocm-xxx/<component>/include`) with a warning message to include files from the new location (`/opt/rocm-xxx/include`) as shown in the example below:
|
||||
Wrapper header files are placed in the old location (`/opt/rocm-xxx/<component>/include`) with a
|
||||
warning message to include files from the new location (`/opt/rocm-xxx/include`) as shown in the
|
||||
example below:
|
||||
|
||||
```cpp
|
||||
// Code snippet from hip_runtime.h
|
||||
@@ -379,7 +412,8 @@ The wrapper header files’ backward compatibility deprecation is as follows:
|
||||
|
||||
##### Library files
|
||||
|
||||
Library files are available in the `/opt/rocm-xxx/lib` folder. For backward compatibility, the old library location (`/opt/rocm-xxx/<component>/lib`) has a soft link to the library at the new location.
|
||||
Library files are available in the `/opt/rocm-xxx/lib` folder. For backward compatibility, the old library
|
||||
location (`/opt/rocm-xxx/<component>/lib`) has a soft link to the library at the new location.
|
||||
|
||||
Example:
|
||||
|
||||
@@ -392,7 +426,9 @@ lrwxrwxrwx 1 root root 24 May 10 23:32 libamdhip64.so -> ../../lib/libamdhip64
|
||||
|
||||
##### CMake config files
|
||||
|
||||
All CMake configuration files are available in the `/opt/rocm-xxx/lib/cmake/<component>` folder. For backward compatibility, the old CMake locations (`/opt/rocm-xxx/<component>/lib/cmake`) consist of a soft link to the new CMake config.
|
||||
All CMake configuration files are available in the `/opt/rocm-xxx/lib/cmake/<component>` folder. For
|
||||
backward compatibility, the old CMake locations (`/opt/rocm-xxx/<component>/lib/cmake`) consist of
|
||||
a soft link to the new CMake config.
|
||||
|
||||
Example:
|
||||
|
||||
@@ -404,20 +440,26 @@ lrwxrwxrwx 1 root root 42 May 10 23:32 hip-config.cmake -> ../../../../lib/cmake
|
||||
|
||||
#### Planned deprecation of hip-rocclr and hip-base packages
|
||||
|
||||
In the ROCm v5.2 release, hip-rocclr and hip-base packages (Debian and RPM) are planned for deprecation and will be removed in a future release. hip-runtime-amd and hip-dev(el) will replace these packages respectively. Users of hip-rocclr must install two packages, hip-runtime-amd and hip-dev, to get the same set of packages installed by hip-rocclr previously.
|
||||
In the ROCm v5.2 release, hip-rocclr and hip-base packages (Debian and RPM) are planned for
|
||||
deprecation and will be removed in a future release. hip-runtime-amd and hip-dev(el) will replace
|
||||
these packages respectively. Users of hip-rocclr must install two packages, hip-runtime-amd and
|
||||
hip-dev, to get the same set of packages installed by hip-rocclr previously.
|
||||
|
||||
Currently, both package names hip-rocclr (or) hip-runtime-amd and hip-base (or) hip-dev(el) are supported.
|
||||
Deprecation of Integrated HIP Directed Tests
|
||||
Currently, both package names hip-rocclr (or) hip-runtime-amd and hip-base (or) hip-dev(el) are
|
||||
supported.
|
||||
|
||||
The integrated HIP directed tests, which are currently built by default, are deprecated in this release. The default building and execution support through CMake will be removed in future release.
|
||||
#### Deprecation of integrated HIP directed tests
|
||||
|
||||
### Fixed defects
|
||||
The integrated HIP directed tests, which are currently built by default, are deprecated in this release.
|
||||
The default building and execution support through CMake will be removed in future release.
|
||||
|
||||
| Fixed Defect | Fix |
|
||||
|------------------------------------------------------------------------------|----------|
|
||||
| ROCmInfo does not list gpus | Code fix |
|
||||
| Hang observed while restoring cooperative group samples | Code fix |
|
||||
| ROCM-SMI over SRIOV: Unsupported commands do not return proper error message | Code fix |
|
||||
### Defect fixes
|
||||
|
||||
| Defect | Fix |
|
||||
|--------|------|
|
||||
| ROCmInfo does not list gpus | code fix |
|
||||
| Hang observed while restoring cooperative group samples | code fix |
|
||||
| ROCM-SMI over SRIOV: Unsupported commands do not return proper error message | code fix |
|
||||
|
||||
### Known issues
|
||||
|
||||
@@ -427,35 +469,44 @@ This section consists of known issues in this release.
|
||||
|
||||
##### Issue
|
||||
|
||||
A compiler error occurs when using -O0 flag to compile code for gfx1030 that calls atomicAddNoRet, which is defined in amd_hip_atomic.h. The compiler generates an illegal instruction for gfx1030.
|
||||
A compiler error occurs when using -O0 flag to compile code for gfx1030 that calls atomicAddNoRet,
|
||||
which is defined in amd_hip_atomic.h. The compiler generates an illegal instruction for gfx1030.
|
||||
|
||||
##### Workaround
|
||||
|
||||
The workaround is not to use the -O0 flag for this case. For higher optimization levels, the compiler does not generate an invalid instruction.
|
||||
The workaround is not to use the -O0 flag for this case. For higher optimization levels, the compiler
|
||||
does not generate an invalid instruction.
|
||||
|
||||
#### System freeze observed during CUDA memtest checkpoint
|
||||
|
||||
##### Issue
|
||||
|
||||
Checkpoint/Restore in Userspace (CRIU) requires 20 MB of VRAM approximately to checkpoint and restore. The CRIU process may freeze if the maximum amount of available VRAM is allocated to checkpoint applications.
|
||||
Checkpoint/Restore in Userspace (CRIU) requires 20 MB of VRAM approximately to checkpoint and
|
||||
restore. The CRIU process may freeze if the maximum amount of available VRAM is allocated to
|
||||
checkpoint applications.
|
||||
|
||||
##### Workaround
|
||||
|
||||
To use CRIU to checkpoint and restore your application, limit the amount of VRAM the application uses to ensure at least 20 MB is available.
|
||||
To use CRIU to checkpoint and restore your application, limit the amount of VRAM the application uses
|
||||
to ensure at least 20 MB is available.
|
||||
|
||||
#### HPC test fails with the “HSA_STATUS_ERROR_MEMORY_FAULT” error
|
||||
|
||||
##### Issue
|
||||
|
||||
The compiler may incorrectly compile a program that uses the `__shfl_sync(mask, value, srcLane)` function when the "value" parameter to the function is undefined along some path to the function. For most functions, uninitialized inputs cause undefined behavior, but the definition for `__shfl_sync` should allow for undefined values.
|
||||
The compiler may incorrectly compile a program that uses the `__shfl_sync(mask, value, srcLane)`
|
||||
function when the "value" parameter to the function is undefined along some path to the function. For
|
||||
most functions, uninitialized inputs cause undefined behavior, but the definition for `__shfl_sync` should
|
||||
allow for undefined values.
|
||||
|
||||
##### Workaround
|
||||
|
||||
The workaround is to initialize the parameters to `__shfl_sync`.
|
||||
|
||||
> **Note**
|
||||
>
|
||||
> When the `-Wall` compilation flag is used, the compiler generates a warning indicating the variable is initialized along some path.
|
||||
:::{note}
|
||||
When the `-Wall` compilation flag is used, the compiler generates a warning indicating the variable is
|
||||
initialized along some path.
|
||||
:::
|
||||
|
||||
Example:
|
||||
|
||||
@@ -471,24 +522,32 @@ res = __shfl_sync(mask, res, 0);
|
||||
|
||||
##### Issue
|
||||
|
||||
In recent changes to Clang, insertion of the noundef attribute to all the function arguments has been enabled by default.
|
||||
In recent changes to Clang, insertion of the noundef attribute to all the function arguments has been
|
||||
enabled by default.
|
||||
|
||||
In the HIP kernel, variable var in shfl_sync may not be initialized, so LLVM IR treats it as undef.
|
||||
|
||||
So, the function argument that is potentially undef (because it is not intialized) has always been assumed to be noundef by LLVM IR (since Clang has inserted noundef attribute). This leads to ambiguous kernel execution.
|
||||
So, the function argument that is potentially undef (because it is not initialized) has always been
|
||||
assumed to be noundef by LLVM IR (since Clang has inserted the noundef attribute). This leads to
|
||||
ambiguous kernel execution.
|
||||
|
||||
##### Workaround
|
||||
|
||||
* Skip adding `noundef` attribute to functions tagged with convergent attribute. Refer to <https://reviews.llvm.org/D124158> for more information.
|
||||
* Skip adding `noundef` attribute to functions tagged with convergent attribute. Refer to
|
||||
<https://reviews.llvm.org/D124158> for more information.
|
||||
|
||||
* Introduce shuffle attribute and add it to `__shfl` like APIs at hip headers. Clang can skip adding noundef attribute, if it finds that argument is tagged with shuffle attribute. Refer to <https://reviews.llvm.org/D125378> for more information.
|
||||
* Introduce shuffle attribute and add it to `__shfl` like APIs at hip headers. Clang can skip adding the
|
||||
`noundef` attribute, if it finds that argument is tagged with shuffle attribute. Refer to
|
||||
<https://reviews.llvm.org/D125378> for more information.
|
||||
|
||||
* Introduce clang builtin for `__shfl` to identify it and skip adding `noundef` attribute.
|
||||
|
||||
* Introduce `__builtin_freeze` to use on the relevant arguments in library wrappers. The library/header need to insert freezes on the relevant inputs.
|
||||
* Introduce `__builtin_freeze` to use on the relevant arguments in library wrappers. The library/header
|
||||
need to insert freezes on the relevant inputs.
|
||||
|
||||
#### Issue with applications triggering oversubscription
|
||||
|
||||
There is a known issue with applications that trigger oversubscription. A hardware hang occurs when ROCgdb is used on AMD Instinct™ MI50 and MI100 systems.
|
||||
There is a known issue with applications that trigger oversubscription. A hardware hang occurs when
|
||||
ROCgdb is used on AMD Instinct™ MI50 and MI100 systems.
|
||||
|
||||
This issue is under investigation and will be fixed in a future release.
|
||||
|
||||
@@ -3,25 +3,28 @@
|
||||
|
||||
#### Ubuntu 18.04 end-of-life announcement
|
||||
|
||||
Support for Ubuntu 18.04 ends in this release. Future releases of ROCm will not provide prebuilt packages for Ubuntu 18.04.
|
||||
HIP and Other Runtimes
|
||||
Support for Ubuntu 18.04 ends in this release. Future releases of ROCm will not provide prebuilt
|
||||
packages for Ubuntu 18.04.
|
||||
|
||||
#### HIP Runtime
|
||||
#### HIP runtime
|
||||
|
||||
##### Fixes
|
||||
|
||||
* A bug was discovered in the HIP graph capture implementation in the ROCm v5.2.0 release. If the same kernel is called twice (with different argument values) in a graph capture, the implementation only kept the argument values for the second kernel call.
|
||||
* A bug was discovered in the HIP graph capture implementation in the ROCm v5.2.0 release. If the
|
||||
same kernel is called twice (with different argument values) in a graph capture, the implementation
|
||||
only kept the argument values for the second kernel call.
|
||||
|
||||
* A bug was introduced in the hiprtc implementation in the ROCm v5.2.0 release. This bug caused the `hiprtcGetLoweredName` call to fail for named expressions with whitespace in it.
|
||||
* A bug was introduced in the hiprtc implementation in the ROCm v5.2.0 release. This bug caused the
|
||||
`hiprtcGetLoweredName` call to fail for named expressions with whitespace in it.
|
||||
|
||||
Example:
|
||||
|
||||
The named expression `my_sqrt<complex<double>>` passed but `my_sqrt<complex<double >>` failed.
|
||||
ROCm Libraries
|
||||
The named expression `my_sqrt<complex<double>>` passed but `my_sqrt<complex<double >>`
|
||||
failed.
|
||||
|
||||
#### RCCL
|
||||
|
||||
##### Added
|
||||
##### Additions
|
||||
|
||||
Compatibility with NCCL 2.12.10
|
||||
|
||||
@@ -33,9 +36,11 @@ Compatibility with NCCL 2.12.10
|
||||
|
||||
* Added experimental support for using multiple ranks per device
|
||||
|
||||
* Requires using a new interface to create communicator (ncclCommInitRankMulti), refer to the interface documentation for details.
|
||||
* Requires using a new interface to create communicator (ncclCommInitRankMulti), refer to the
|
||||
interface documentation for details.
|
||||
|
||||
* To avoid potential deadlocks, user might have to set an environment variables increasing the number of hardware queues. For example,
|
||||
* To avoid potential deadlocks, user might have to set an environment variables increasing the
|
||||
number of hardware queues. For example,
|
||||
|
||||
```sh
|
||||
export GPU_MAX_HW_QUEUES=16
|
||||
@@ -45,20 +50,23 @@ export GPU_MAX_HW_QUEUES=16
|
||||
|
||||
* Opt-in with NCCL_IB_SOCK_CLIENT_PORT_REUSE=1 and NCCL_IB_SOCK_SERVER_PORT_REUSE=1
|
||||
|
||||
* When "Call to bind failed: Address already in use" error happens in large-scale AlltoAll(for example, >=64 MI200 nodes), users are suggested to opt-in either one or both of the options to resolve the massive port usage issue
|
||||
* When "Call to bind failed: Address already in use" error happens in large-scale AlltoAll (for example,
|
||||
\>=64 MI200 nodes), users are suggested to opt-in either one or both of the options to resolve the
|
||||
massive port usage issue
|
||||
|
||||
* Avoid using NCCL_IB_SOCK_SERVER_PORT_REUSE when NCCL_NCHANNELS_PER_NET_PEER is tuned >1
|
||||
* Avoid using NCCL_IB_SOCK_SERVER_PORT_REUSE when NCCL_NCHANNELS_PER_NET_PEER is tuned
|
||||
\>1
|
||||
|
||||
##### Removed
|
||||
##### Removals
|
||||
|
||||
* Removed experimental clique-based kernels
|
||||
|
||||
#### Development tools
|
||||
|
||||
No notable changes in this release for development tools, including the compiler, profiler, and debugger
|
||||
Deployment and Management Tools
|
||||
No notable changes in this release for development tools, including the compiler, profiler, and
|
||||
debugger deployment and management tools
|
||||
|
||||
No notable changes in this release for deployment and management tools.
|
||||
Older ROCm Releases
|
||||
|
||||
For release information for older ROCm releases, refer to <https://github.com/RadeonOpenCompute/ROCm/blob/master/CHANGELOG.md>
|
||||
For release information for older ROCm releases, refer to
|
||||
<https://github.com/RadeonOpenCompute/ROCm/blob/master/CHANGELOG.md>
|
||||
|
||||
@@ -3,15 +3,24 @@
|
||||
|
||||
#### HIP Perl scripts deprecation
|
||||
|
||||
The `hipcc` and `hipconfig` Perl scripts are deprecated. In a future release, compiled binaries will be available as `hipcc.bin` and `hipconfig.bin` as replacements for the Perl scripts.
|
||||
The `hipcc` and `hipconfig` Perl scripts are deprecated. In a future release, compiled binaries will be
|
||||
available as `hipcc.bin` and `hipconfig.bin` as replacements for the Perl scripts.
|
||||
|
||||
> **Note**
|
||||
>
|
||||
> There will be a transition period where the Perl scripts and compiled binaries are available before the scripts are removed. There will be no functional difference between the Perl scripts and their compiled binary counterpart. No user action is required. Once these are available, users can optionally switch to `hipcc.bin` and `hipconfig.bin`. The `hipcc`/`hipconfig` soft link will be assimilated to point from `hipcc`/`hipconfig` to the respective compiled binaries as the default option.
|
||||
:::{note}
|
||||
There will be a transition period where the Perl scripts and compiled binaries are available before the
|
||||
scripts are removed. There will be no functional difference between the Perl scripts and their compiled
|
||||
binary counterpart. No user action is required. Once these are available, users can optionally switch to
|
||||
`hipcc.bin` and `hipconfig.bin`. The `hipcc`/`hipconfig` soft link will be assimilated to point from
|
||||
`hipcc`/`hipconfig` to the respective compiled binaries as the default option.
|
||||
:::
|
||||
|
||||
#### Linux file system hierarchy standard for ROCm
|
||||
|
||||
ROCm packages have adopted the Linux foundation file system hierarchy standard in this release to ensure ROCm components follow open source conventions for Linux-based distributions. While moving to a new file system hierarchy, ROCm ensures backward compatibility with its 5.1 version or older file system hierarchy. See below for a detailed explanation of the new file system hierarchy and backward compatibility.
|
||||
ROCm packages have adopted the Linux foundation file system hierarchy standard in this release to
|
||||
ensure ROCm components follow open source conventions for Linux-based distributions. While
|
||||
moving to a new file system hierarchy, ROCm ensures backward compatibility with its 5.1 version or
|
||||
older file system hierarchy. See below for a detailed explanation of the new file system hierarchy and
|
||||
backward compatibility.
|
||||
|
||||
##### New file system hierarchy
|
||||
|
||||
@@ -47,23 +56,27 @@ The following is the new file system hierarchy:
|
||||
|
||||
```
|
||||
|
||||
> **Note**
|
||||
>
|
||||
> ROCm will not support backward compatibility with the v5.1(old) file system hierarchy in its next major release.
|
||||
:::{note}
|
||||
ROCm will not support backward compatibility with the v5.1(old) file system hierarchy in its next major
|
||||
release.
|
||||
:::
|
||||
|
||||
For more information, refer to <https://refspecs.linuxfoundation.org/fhs.shtml>.
|
||||
|
||||
##### Backward compatibility with older file systems
|
||||
|
||||
ROCm has moved header files and libraries to its new location as indicated in the above structure and included symbolic-link and wrapper header files in its old location for backward compatibility.
|
||||
ROCm has moved header files and libraries to its new location as indicated in the above structure and
|
||||
included symbolic-link and wrapper header files in its old location for backward compatibility.
|
||||
|
||||
> **Note**
|
||||
>
|
||||
> ROCm will continue supporting backward compatibility until the next major release.
|
||||
:::{note}
|
||||
ROCm will continue supporting backward compatibility until the next major release.
|
||||
:::
|
||||
|
||||
##### Wrapper header files
|
||||
|
||||
Wrapper header files are placed in the old location (`/opt/rocm-xxx/<component>/include`) with a warning message to include files from the new location (`/opt/rocm-xxx/include`) as shown in the example below:
|
||||
Wrapper header files are placed in the old location (`/opt/rocm-xxx/<component>/include`) with a
|
||||
warning message to include files from the new location (`/opt/rocm-xxx/include`) as shown in the
|
||||
example below:
|
||||
|
||||
```cpp
|
||||
// Code snippet from hip_runtime.h
|
||||
@@ -80,7 +93,8 @@ The wrapper header files’ backward compatibility deprecation is as follows:
|
||||
|
||||
##### Library files
|
||||
|
||||
Library files are available in the `/opt/rocm-xxx/lib` folder. For backward compatibility, the old library location (`/opt/rocm-xxx/<component>/lib`) has a soft link to the library at the new location.
|
||||
Library files are available in the `/opt/rocm-xxx/lib` folder. For backward compatibility, the old library
|
||||
location (`/opt/rocm-xxx/<component>/lib`) has a soft link to the library at the new location.
|
||||
|
||||
Example:
|
||||
|
||||
@@ -93,7 +107,9 @@ lrwxrwxrwx 1 root root 24 May 10 23:32 libamdhip64.so -> ../../lib/libamdhip64
|
||||
|
||||
##### CMake config files
|
||||
|
||||
All CMake configuration files are available in the `/opt/rocm-xxx/lib/cmake/<component>` folder. For backward compatibility, the old CMake locations (`/opt/rocm-xxx/<component>/lib/cmake`) consist of a soft link to the new CMake config.
|
||||
All CMake configuration files are available in the `/opt/rocm-xxx/lib/cmake/<component>` folder. For
|
||||
backward compatibility, the old CMake locations (`/opt/rocm-xxx/<component>/lib/cmake`) consist of
|
||||
a soft link to the new CMake config.
|
||||
|
||||
Example:
|
||||
|
||||
@@ -103,23 +119,29 @@ total 0
|
||||
lrwxrwxrwx 1 root root 42 May 10 23:32 hip-config.cmake -> ../../../../lib/cmake/hip/hip-config.cmake
|
||||
```
|
||||
|
||||
### Fixed defects
|
||||
### Defect fixes
|
||||
|
||||
The following defects are fixed in this release.
|
||||
|
||||
These defects were identified and documented as known issues in previous ROCm releases and are fixed in the ROCm v5.3 release.
|
||||
These defects were identified and documented as known issues in previous ROCm releases and are
|
||||
fixed in the ROCm v5.3 release.
|
||||
|
||||
#### Kernel produces incorrect results with ROCm 5.2
|
||||
|
||||
User code did not initialize certain data constructs, leading to a correctness issue. A strict reading of the C++ standard suggests that failing to initialize these data constructs is undefined behavior. However, a special case was added for a specific compiler builtin to handle the uninitialized data in a defined manner.
|
||||
User code did not initialize certain data constructs, leading to a correctness issue. A strict reading of
|
||||
the C++ standard suggests that failing to initialize these data constructs is undefined behavior.
|
||||
However, a special case was added for a specific compiler builtin to handle the uninitialized data in a
|
||||
defined manner.
|
||||
|
||||
The compiler fix consists of the following patches:
|
||||
|
||||
* A new `noundef` attribute is added. This attribute denotes when a function call argument or return val may never contain uninitialized bits.
|
||||
For more information, see <https://reviews.llvm.org/D81678>
|
||||
* The application of this attribute was refined such that it was not added to a specific compiler builtin where the compiler knows that inactive lanes do not impact program execution.
|
||||
|
||||
For more information, see <https://github.com/RadeonOpenCompute/llvm-project/commit/accf36c58409268ca1f216cdf5ad812ba97ceccd>.
|
||||
* A new `noundef` attribute is added. This attribute denotes when a function call argument or return
|
||||
value may never contain uninitialized bits. For more information, see
|
||||
<https://reviews.llvm.org/D81678>
|
||||
* The application of this attribute was refined such that it was not added to a specific compiler built-in
|
||||
where the compiler knows that inactive lanes do not impact program execution. For more
|
||||
information, see
|
||||
<https://github.com/RadeonOpenCompute/llvm-project/commit/accf36c58409268ca1f216cdf5ad812ba97ceccd>.
|
||||
|
||||
### Known issues
|
||||
|
||||
@@ -127,7 +149,10 @@ This section consists of known issues in this release.
|
||||
|
||||
#### Issue with OpenMP-extras package upgrade
|
||||
|
||||
The `openmp-extras` package has been split into runtime (`openmp-extras-runtime`) and dev (`openmp-extras-devel`) packages. This change has broken the upgrade support for the `openmp-extras` package in RHEL/SLES.
|
||||
The `openmp-extras` package has been split into runtime (`openmp-extras-runtime`) and dev
|
||||
(`openmp-extras-devel`) packages. This change has broken the upgrade support for the
|
||||
`openmp-extras` package in RHEL/SLES.
|
||||
|
||||
An available workaround in RHEL is to use the following command for upgrades:
|
||||
|
||||
```sh
|
||||
@@ -143,16 +168,21 @@ zypper update --force-resolution <meta-package>
|
||||
|
||||
#### AMD Instinct™ MI200 SRIOV virtualization issue
|
||||
|
||||
There is a known issue in this ROCm v5.3 release with all AMD Instinct™ MI200 devices running within a virtual function (VF) under SRIOV virtualization. This issue will likely impact the functionality of SRIOV-based workloads, but does not impact Discrete Device Assignment (DDA) or Bare Metal.
|
||||
There is a known issue in this ROCm v5.3 release with all AMD Instinct™ MI200 devices running within
|
||||
a virtual function (VF) under SRIOV virtualization. This issue will likely impact the functionality of
|
||||
SRIOV-based workloads, but does not impact Discrete Device Assignment (DDA) or Bare Metal.
|
||||
|
||||
Until a fix is provided, users should rely on ROCm v5.2.3 to support their SRIOV workloads.
|
||||
|
||||
#### System crash when IMMOU is enabled
|
||||
|
||||
If input-output memory management unit (IOMMU) is enabled in SBIOS and ROCm is installed, the system may report the following failure or errors when running workloads such as bandwidth test, clinfo, and HelloWord.cl and cause a system crash.
|
||||
If input-output memory management unit (IOMMU) is enabled in SBIOS and ROCm is installed, the
|
||||
system may report the following failure or errors when running workloads such as bandwidth test,
|
||||
clinfo, and HelloWord.cl and cause a system crash.
|
||||
|
||||
* IO PAGE FAULT
|
||||
* IRQ remapping does not support X2APIC mode
|
||||
* NMI error
|
||||
|
||||
Workaround: To avoid the system crash, add `amd_iommu=on iommu=pt` as the kernel bootparam, as indicated in the warning message.
|
||||
Workaround: To avoid the system crash, add `amd_iommu=on iommu=pt` as the kernel bootparam, as
|
||||
indicated in the warning message.
|
||||
|
||||
@@ -1,13 +1,14 @@
|
||||
<!-- markdownlint-disable first-line-h1 -->
|
||||
### Fixed defects
|
||||
### Defect fixes
|
||||
|
||||
The following known issues in ROCm v5.3.2 are fixed in this release.
|
||||
|
||||
#### Peer-to-peer DMA mapping errors with SLES and RHEL
|
||||
|
||||
Peer-to-Peer Direct Memory Access (DMA) mapping errors on Dell systems (R7525 and R750XA) with SLES 15 SP3/SP4 and RHEL 9.0 are fixed in this release.
|
||||
Peer-to-Peer Direct Memory Access (DMA) mapping errors on Dell systems (R7525 and R750XA) with
|
||||
SLES 15 SP3/SP4 and RHEL 9.0 are fixed in this release.
|
||||
|
||||
Previously, running rocminfo resulted in Peer-to-Peer DMA mapping errors.
|
||||
Previously, running `rocminfo` resulted in Peer-to-Peer DMA mapping errors.
|
||||
|
||||
#### RCCL tuning table
|
||||
|
||||
@@ -15,7 +16,8 @@ The RCCL tuning table is updated for supported platforms.
|
||||
|
||||
#### SGEMM (F32 GEMM) routines in rocBLAS
|
||||
|
||||
Functional correctness failures in SGEMM (F32 GEMM) routines in rocBLAS for certain problem sizes and ranges are fixed in this release.
|
||||
Functional correctness failures in SGEMM (F32 GEMM) routines in rocBLAS for certain problem sizes
|
||||
and ranges are fixed in this release.
|
||||
|
||||
### Known issues
|
||||
|
||||
@@ -23,7 +25,9 @@ This section consists of known issues in this release.
|
||||
|
||||
#### AMD Instinct™ MI200 SRIOV virtualization issue
|
||||
|
||||
There is a known issue in this ROCm v5.3 release with all AMD Instinct™ MI200 devices running within a virtual function (VF) under SRIOV virtualization. This issue will likely impact the functionality of SRIOV-based workloads but does not impact Discrete Device Assignment (DDA) or bare metal.
|
||||
There is a known issue in this ROCm v5.3 release with all AMD Instinct™ MI200 devices running within
|
||||
a virtual function (VF) under SRIOV virtualization. This issue will likely impact the functionality of
|
||||
SRIOV-based workloads but does not impact Discrete Device Assignment (DDA) or bare metal.
|
||||
|
||||
Until a fix is provided, users should rely on ROCm v5.2.3 to support their SRIOV workloads.
|
||||
|
||||
@@ -31,14 +35,18 @@ Until a fix is provided, users should rely on ROCm v5.2.3 to support their SRIOV
|
||||
|
||||
Customers cannot update the Integrated Firmware Image (IFWI) for AMD Instinct™ MI200 accelerators.
|
||||
|
||||
An updated firmware maintenance bundle consisting of an installation tool and images specific to AMD Instinct™ MI200 accelerators is under planning and will be available soon.
|
||||
An updated firmware maintenance bundle consisting of an installation tool and images specific to
|
||||
AMD Instinct™ MI200 accelerators is under planning and will be available soon.
|
||||
|
||||
#### Known issue with rocThrust and rocPRIM libraries
|
||||
|
||||
There is a known known issue with rocThrust and rocPRIM libraries supporting iterator and types in ROCm v5.3.x releases.
|
||||
There is a known known issue with rocThrust and rocPRIM libraries supporting iterator and types in
|
||||
ROCm v5.3.x releases.
|
||||
|
||||
* thrust::merge no longer correctly supports different iterator types for `keys_input1` and `keys_input2`.
|
||||
* `thrust::merge` no longer correctly supports different iterator types for `keys_input1` and
|
||||
`keys_input2`.
|
||||
|
||||
* rocprim::device_merge no longer correctly supports using different types for `keys_input1` and `keys_input2`.
|
||||
* `rocprim::device_merge` no longer correctly supports using different types for `keys_input1` and
|
||||
`keys_input2`.
|
||||
|
||||
This issue is currently under investigation and will be resolved in a future release.
|
||||
|
||||
@@ -1,12 +1,15 @@
|
||||
<!-- markdownlint-disable first-line-h1 -->
|
||||
### Fixed defects
|
||||
### Defect fixes
|
||||
|
||||
#### Issue with rocTHRUST and rocPRIM libraries
|
||||
|
||||
There was a known issue with rocTHRUST and rocPRIM libraries supporting iterator and types in ROCm v5.3.x releases.
|
||||
There was a known issue with rocTHRUST and rocPRIM libraries supporting iterator and types in ROCm
|
||||
v5.3.x releases.
|
||||
|
||||
* `thrust::merge` no longer correctly supports different iterator types for `keys_input1` and `keys_input2`.
|
||||
* `rocprim::device_merge` no longer correctly supports using different types for `keys_input1` and `keys_input2`.
|
||||
* `thrust::merge` no longer correctly supports different iterator types for `keys_input1` and
|
||||
`keys_input2`.
|
||||
* `rocprim::device_merge` no longer correctly supports using different types for `keys_input1` and
|
||||
`keys_input2`.
|
||||
|
||||
This issue is resolved with the following fixes to compilation failures:
|
||||
|
||||
|
||||
@@ -8,13 +8,15 @@ The ROCm v5.4 release consists of the following HIP enhancements:
|
||||
|
||||
##### Support for wall_clock64
|
||||
|
||||
A new timer function wall_clock64() is supported, which returns wall clock count at a constant frequency on the device.
|
||||
A new timer function wall_clock64() is supported, which returns wall clock count at a constant
|
||||
frequency on the device.
|
||||
|
||||
```cpp
|
||||
long long int wall_clock64();
|
||||
```
|
||||
|
||||
It returns wall clock count at a constant frequency on the device, which can be queried via HIP API with the hipDeviceAttributeWallClockRate attribute of the device in the HIP application code.
|
||||
It returns wall clock count at a constant frequency on the device, which can be queried via HIP API with
|
||||
the hipDeviceAttributeWallClockRate attribute of the device in the HIP application code.
|
||||
|
||||
Example:
|
||||
|
||||
@@ -25,19 +27,23 @@ int wallClkRate = 0; //in kilohertz
|
||||
|
||||
Where hipDeviceAttributeWallClockRate is a device attribute.
|
||||
|
||||
> **Note**
|
||||
>
|
||||
> The wall clock frequency is a per-device attribute.
|
||||
:::{note}
|
||||
The wall clock frequency is a per-device attribute.
|
||||
:::
|
||||
|
||||
##### New registry added for GPU_MAX_HW_QUEUES
|
||||
|
||||
The GPU_MAX_HW_QUEUES registry defines the maximum number of independent hardware queues allocated per process per device.
|
||||
The GPU_MAX_HW_QUEUES registry defines the maximum number of independent hardware queues
|
||||
allocated per process per device.
|
||||
|
||||
The environment variable controls how many independent hardware queues HIP runtime can create per process, per device. If the application allocates more HIP streams than this number, then the HIP runtime reuses the same hardware queues for the new streams in a round-robin manner.
|
||||
The environment variable controls how many independent hardware queues HIP runtime can create
|
||||
per process, per device. If the application allocates more HIP streams than this number, then the HIP
|
||||
runtime reuses the same hardware queues for the new streams in a round-robin manner.
|
||||
|
||||
> **Note**
|
||||
>
|
||||
> This maximum number does not apply to hardware queues created for CU-masked HIP streams or cooperative queues for HIP Cooperative Groups (there is only one queue per device).
|
||||
:::{note}
|
||||
This maximum number does not apply to hardware queues created for CU-masked HIP streams or
|
||||
cooperative queues for HIP Cooperative Groups (there is only one queue per device).
|
||||
:::
|
||||
|
||||
For more details, refer to the HIP Programming Guide.
|
||||
|
||||
@@ -45,9 +51,9 @@ For more details, refer to the HIP Programming Guide.
|
||||
|
||||
The following new HIP APIs are available in the ROCm v5.4 release.
|
||||
|
||||
> **Note**
|
||||
>
|
||||
> This is a pre-official version (beta) release of the new APIs.
|
||||
:::{note}
|
||||
This is a pre-official version (beta) release of the new APIs.
|
||||
:::
|
||||
|
||||
##### Error handling
|
||||
|
||||
@@ -81,7 +87,8 @@ This release consists of the following OpenMP enhancements:
|
||||
|
||||
* Enable new device RTL in libomptarget as default.
|
||||
* New flag `-fopenmp-target-fast` to imply `-fopenmp-target-ignore-env-vars -fopenmp-assume-no-thread-state -fopenmp-assume-no-nested-parallelism`.
|
||||
* Support for the collapse clause and non-unit stride in cases where the no-loop specialized kernel is generated.
|
||||
* Support for the collapse clause and non-unit stride in cases where the no-loop specialized kernel is
|
||||
generated.
|
||||
* Initial implementation of optimized cross-team sum reduction for float and double type scalars.
|
||||
* Pool-based optimization in the OpenMP runtime to reduce locking during data transfer.
|
||||
|
||||
@@ -89,15 +96,24 @@ This release consists of the following OpenMP enhancements:
|
||||
|
||||
#### HIP Perl scripts deprecation
|
||||
|
||||
The `hipcc` and `hipconfig` Perl scripts are deprecated. In a future release, compiled binaries will be available as `hipcc.bin` and `hipconfig.bin` as replacements for the Perl scripts.
|
||||
The `hipcc` and `hipconfig` Perl scripts are deprecated. In a future release, compiled binaries will be
|
||||
available as `hipcc.bin` and `hipconfig.bin` as replacements for the Perl scripts.
|
||||
|
||||
> **Note**
|
||||
>
|
||||
> There will be a transition period where the Perl scripts and compiled binaries are available before the scripts are removed. There will be no functional difference between the Perl scripts and their compiled binary counterpart. No user action is required. Once these are available, users can optionally switch to `hipcc.bin` and `hipconfig.bin`. The `hipcc`/`hipconfig` soft link will be assimilated to point from `hipcc`/`hipconfig` to the respective compiled binaries as the default option.
|
||||
:::{note}
|
||||
There will be a transition period where the Perl scripts and compiled binaries are available before the
|
||||
scripts are removed. There will be no functional difference between the Perl scripts and their compiled
|
||||
binary counterpart. No user action is required. Once these are available, users can optionally switch to
|
||||
`hipcc.bin` and `hipconfig.bin`. The `hipcc`/`hipconfig` soft link will be assimilated to point from
|
||||
`hipcc`/`hipconfig` to the respective compiled binaries as the default option.
|
||||
:::
|
||||
|
||||
##### Linux file system hierarchy standard for ROCm
|
||||
|
||||
ROCm packages have adopted the Linux foundation file system hierarchy standard in this release to ensure ROCm components follow open source conventions for Linux-based distributions. While moving to a new file system hierarchy, ROCm ensures backward compatibility with its 5.1 version or older file system hierarchy. See below for a detailed explanation of the new file system hierarchy and backward compatibility.
|
||||
ROCm packages have adopted the Linux foundation file system hierarchy standard in this release to
|
||||
ensure ROCm components follow open source conventions for Linux-based distributions. While
|
||||
moving to a new file system hierarchy, ROCm ensures backward compatibility with its 5.1 version or
|
||||
older file system hierarchy. See below for a detailed explanation of the new file system hierarchy and
|
||||
backward compatibility.
|
||||
|
||||
##### New file system hierarchy
|
||||
|
||||
@@ -133,23 +149,27 @@ The following is the new file system hierarchy:
|
||||
|
||||
```
|
||||
|
||||
> **Note**
|
||||
>
|
||||
> ROCm will not support backward compatibility with the v5.1(old) file system hierarchy in its next major release.
|
||||
:::{note}
|
||||
ROCm will not support backward compatibility with the v5.1(old) file system hierarchy in its next major
|
||||
release.
|
||||
:::
|
||||
|
||||
For more information, refer to <https://refspecs.linuxfoundation.org/fhs.shtml>.
|
||||
|
||||
##### Backward compatibility with older file systems
|
||||
|
||||
ROCm has moved header files and libraries to its new location as indicated in the above structure and included symbolic-link and wrapper header files in its old location for backward compatibility.
|
||||
ROCm has moved header files and libraries to its new location as indicated in the above structure and
|
||||
included symbolic-link and wrapper header files in its old location for backward compatibility.
|
||||
|
||||
> **Note**
|
||||
>
|
||||
> ROCm will continue supporting backward compatibility until the next major release.
|
||||
:::{note}
|
||||
ROCm will continue supporting backward compatibility until the next major release.
|
||||
:::
|
||||
|
||||
##### Wrapper header files
|
||||
|
||||
Wrapper header files are placed in the old location (`/opt/rocm-xxx/<component>/include`) with a warning message to include files from the new location (`/opt/rocm-xxx/include`) as shown in the example below:
|
||||
Wrapper header files are placed in the old location (`/opt/rocm-xxx/<component>/include`) with a
|
||||
warning message to include files from the new location (`/opt/rocm-xxx/include`) as shown in the
|
||||
example below:
|
||||
|
||||
```cpp
|
||||
// Code snippet from hip_runtime.h
|
||||
@@ -166,7 +186,8 @@ The wrapper header files’ backward compatibility deprecation is as follows:
|
||||
|
||||
##### Library files
|
||||
|
||||
Library files are available in the `/opt/rocm-xxx/lib` folder. For backward compatibility, the old library location (`/opt/rocm-xxx/<component>/lib`) has a soft link to the library at the new location.
|
||||
Library files are available in the `/opt/rocm-xxx/lib` folder. For backward compatibility, the old library
|
||||
location (`/opt/rocm-xxx/<component>/lib`) has a soft link to the library at the new location.
|
||||
|
||||
Example:
|
||||
|
||||
@@ -179,7 +200,9 @@ lrwxrwxrwx 1 root root 24 May 10 23:32 libamdhip64.so -> ../../lib/libamdhip64
|
||||
|
||||
##### CMake config files
|
||||
|
||||
All CMake configuration files are available in the `/opt/rocm-xxx/lib/cmake/<component>` folder. For backward compatibility, the old CMake locations (`/opt/rocm-xxx/<component>/lib/cmake`) consist of a soft link to the new CMake config.
|
||||
All CMake configuration files are available in the `/opt/rocm-xxx/lib/cmake/<component>` folder. For
|
||||
backward compatibility, the old CMake locations (`/opt/rocm-xxx/<component>/lib/cmake`) consist of
|
||||
a soft link to the new CMake config.
|
||||
|
||||
Example:
|
||||
|
||||
@@ -189,37 +212,45 @@ total 0
|
||||
lrwxrwxrwx 1 root root 42 May 10 23:32 hip-config.cmake -> ../../../../lib/cmake/hip/hip-config.cmake
|
||||
```
|
||||
|
||||
### Fixed defects
|
||||
### Defect fixes
|
||||
|
||||
The following defects are fixed in this release.
|
||||
|
||||
These defects were identified and documented as known issues in previous ROCm releases and are fixed in this release.
|
||||
These defects were identified and documented as known issues in previous ROCm releases and are
|
||||
fixed in this release.
|
||||
|
||||
#### Memory allocated using hipHostMalloc() with flags didn't exhibit fine-grain behavior
|
||||
|
||||
##### Issue
|
||||
|
||||
The test was incorrectly using the `hipDeviceAttributePageableMemoryAccess` device attribute to determine coherent support.
|
||||
The test was incorrectly using the `hipDeviceAttributePageableMemoryAccess` device attribute to
|
||||
determine coherent support.
|
||||
|
||||
##### Fix
|
||||
|
||||
`hipHostMalloc()` allocates memory with fine-grained access by default when the environment variable `HIP_HOST_COHERENT=1` is used.
|
||||
`hipHostMalloc()` allocates memory with fine-grained access by default when the environment variable
|
||||
`HIP_HOST_COHERENT=1` is used.
|
||||
|
||||
For more information, refer to {doc}`hip:.doxygen/docBin/html/index`.
|
||||
For more information, refer to {doc}`hip:doxygen/html/index`.
|
||||
|
||||
|
||||
#### SoftHang with `hipStreamWithCUMask` test on AMD Instinct™
|
||||
|
||||
##### Issue
|
||||
|
||||
On GFX10 GPUs, kernel execution hangs when it is launched on streams created using `hipStreamWithCUMask`.
|
||||
On GFX10 GPUs, kernel execution hangs when it is launched on streams created using
|
||||
`hipStreamWithCUMask`.
|
||||
|
||||
##### Fix
|
||||
|
||||
On GFX10 GPUs, each workgroup processor encompasses two compute units, and the compute units must be enabled as a pair. The `hipStreamWithCUMask` API unit test cases are updated to set compute unit mask (cuMask) in pairs for GFX10 GPUs.
|
||||
On GFX10 GPUs, each workgroup processor encompasses two compute units, and the compute units
|
||||
must be enabled as a pair. The `hipStreamWithCUMask` API unit test cases are updated to set compute
|
||||
unit mask (cuMask) in pairs for GFX10 GPUs.
|
||||
|
||||
#### ROCm tools GPU IDs
|
||||
|
||||
The HIP language device IDs are not the same as the GPU IDs reported by the tools. GPU IDs are globally unique and guaranteed to be consistent across APIs and processes.
|
||||
The HIP language device IDs are not the same as the GPU IDs reported by the tools. GPU IDs are
|
||||
globally unique and guaranteed to be consistent across APIs and processes.
|
||||
|
||||
GPU IDs reported by ROCTracer and ROCProfiler or ROCm Tools are HSA Driver Node ID of that GPU, as it is a unique ID for that device in that particular node.
|
||||
GPU IDs reported by ROCTracer and ROCProfiler or ROCm Tools are HSA Driver Node ID of that GPU,
|
||||
as it is a unique ID for that device in that particular node.
|
||||
|
||||
@@ -9,9 +9,9 @@ The ROCm v5.4.1 release consists of the following new HIP API:
|
||||
|
||||
The following new HIP API is introduced in the ROCm v5.4.1 release.
|
||||
|
||||
> **Note**
|
||||
>
|
||||
> This is a pre-official version (beta) release of the new APIs.
|
||||
:::{note}
|
||||
This is a pre-official version (beta) release of the new APIs.
|
||||
:::
|
||||
|
||||
```cpp
|
||||
hipError_t hipLaunchHostFunc(hipStream_t stream, hipHostFn_t fn, void* userData);
|
||||
@@ -25,30 +25,40 @@ This swaps the stream capture mode of a thread.
|
||||
|
||||
This parameter returns `#hipSuccess`, `#hipErrorInvalidValue`.
|
||||
|
||||
For more information, refer to the HIP API documentation at /bundle/HIP_API_Guide/page/modules.html.
|
||||
For more information, refer to the HIP API documentation at
|
||||
/bundle/HIP_API_Guide/page/modules.html.
|
||||
|
||||
### Deprecations and warnings
|
||||
|
||||
#### HIP Perl scripts deprecation
|
||||
|
||||
The `hipcc` and `hipconfig` Perl scripts are deprecated. In a future release, compiled binaries will be available as `hipcc.bin` and `hipconfig.bin` as replacements for the Perl scripts.
|
||||
The `hipcc` and `hipconfig` Perl scripts are deprecated. In a future release, compiled binaries will be
|
||||
available as `hipcc.bin` and `hipconfig.bin` as replacements for the Perl scripts.
|
||||
|
||||
> **Note**
|
||||
>
|
||||
> There will be a transition period where the Perl scripts and compiled binaries are available before the scripts are removed. There will be no functional difference between the Perl scripts and their compiled binary counterpart. No user action is required. Once these are available, users can optionally switch to `hipcc.bin` and `hipconfig.bin`. The `hipcc`/`hipconfig` soft link will be assimilated to point from `hipcc`/`hipconfig` to the respective compiled binaries as the default option.
|
||||
:::{note}
|
||||
There will be a transition period where the Perl scripts and compiled binaries are available before the
|
||||
scripts are removed. There will be no functional difference between the Perl scripts and their compiled
|
||||
binary counterpart. No user action is required. Once these are available, users can optionally switch to
|
||||
`hipcc.bin` and `hipconfig.bin`. The `hipcc`/`hipconfig` soft link will be assimilated to point from
|
||||
`hipcc`/`hipconfig` to the respective compiled binaries as the default option.
|
||||
:::
|
||||
|
||||
### IFWI fixes
|
||||
|
||||
These defects were identified and documented as known issues in previous ROCm releases and are fixed in this release.
|
||||
AMD Instinct™ MI200 Firmware IFWI Maintenance Update #3
|
||||
These defects were identified and documented as known issues in previous ROCm releases and are
|
||||
fixed in this release.
|
||||
|
||||
#### AMD Instinct™ MI200 firmware IFWI maintenance update #3
|
||||
|
||||
This IFWI release fixes the following issue in AMD Instinct™ MI210/MI250 Accelerators.
|
||||
|
||||
After prolonged periods of operation, certain MI200 Instinct™ Accelerators may perform in a degraded way resulting in application failures.
|
||||
After prolonged periods of operation, certain MI200 Instinct™ Accelerators may perform in a degraded
|
||||
way resulting in application failures.
|
||||
|
||||
In this package, AMD delivers a new firmware version for MI200 GPU accelerators and a firmware installation tool – AMD FW FLASH 1.2.
|
||||
In this package, AMD delivers a new firmware version for MI200 GPU accelerators and a firmware
|
||||
installation tool – AMD FW FLASH 1.2.
|
||||
|
||||
| GPU | Production Part Number | SKU | IFWI Name |
|
||||
| GPU | Productionp part number | SKU | IFWI name |
|
||||
|-------|------------|--------|---------------|
|
||||
| MI210 | 113-D673XX | D67302 | D6730200V.110 |
|
||||
| MI210 | 113-D673XX | D67301 | D6730100V.073 |
|
||||
@@ -61,4 +71,5 @@ Instructions on how to download and apply MI200 maintenance updates are availabl
|
||||
|
||||
#### AMD Instinct™ MI200 SRIOV virtualization support
|
||||
|
||||
Maintenance update #3, combined with ROCm 5.4.1, now provides SRIOV virtualization support for all AMD Instinct™ MI200 devices.
|
||||
Maintenance update #3, combined with ROCm 5.4.1, now provides SRIOV virtualization support for all
|
||||
AMD Instinct™ MI200 devices.
|
||||
|
||||
@@ -3,23 +3,32 @@
|
||||
|
||||
#### HIP Perl scripts deprecation
|
||||
|
||||
The `hipcc` and `hipconfig` Perl scripts are deprecated. In a future release, compiled binaries will be available as `hipcc.bin` and `hipconfig.bin` as replacements for the Perl scripts.
|
||||
The `hipcc` and `hipconfig` Perl scripts are deprecated. In a future release, compiled binaries will be
|
||||
available as `hipcc.bin` and `hipconfig.bin` as replacements for the Perl scripts.
|
||||
|
||||
> **Note**
|
||||
>
|
||||
> There will be a transition period where the Perl scripts and compiled binaries are available before the scripts are removed. There will be no functional difference between the Perl scripts and their compiled binary counterpart. No user action is required. Once these are available, users can optionally switch to `hipcc.bin` and `hipconfig.bin`. The `hipcc`/`hipconfig` soft link will be assimilated to point from `hipcc`/`hipconfig` to the respective compiled binaries as the default option.
|
||||
:::{note}
|
||||
There will be a transition period where the Perl scripts and compiled binaries are available before the
|
||||
scripts are removed. There will be no functional difference between the Perl scripts and their compiled
|
||||
binary counterpart. No user action is required. Once these are available, users can optionally switch to
|
||||
`hipcc.bin` and `hipconfig.bin`. The `hipcc`/`hipconfig` soft link will be assimilated to point from
|
||||
`hipcc`/`hipconfig` to the respective compiled binaries as the default option.
|
||||
:::
|
||||
|
||||
#### `hipcc` options deprecation
|
||||
|
||||
The following hipcc options are being deprecated and will be removed in a future release:
|
||||
|
||||
* The `--amdgpu-target` option is being deprecated, and user must use the `–offload-arch` option to specify the GPU architecture.
|
||||
* The `--amdhsa-code-object-version` option is being deprecated. Users can use the Clang/LLVM option `-mllvm -mcode-object-version` to debug issues related to code object versions.
|
||||
* The `--hipcc-func-supp`/`--hipcc-no-func-supp` options are being deprecated, as the function calls are already supported in production on AMD GPUs.
|
||||
* The `--amdgpu-target` option is being deprecated, and user must use the `–offload-arch` option to
|
||||
specify the GPU architecture.
|
||||
* The `--amdhsa-code-object-version` option is being deprecated. Users can use the Clang/LLVM
|
||||
option `-mllvm -mcode-object-version` to debug issues related to code object versions.
|
||||
* The `--hipcc-func-supp`/`--hipcc-no-func-supp` options are being deprecated, as the function calls
|
||||
are already supported in production on AMD GPUs.
|
||||
|
||||
### Known issues
|
||||
|
||||
Under certain circumstances typified by high register pressure, users may encounter a compiler abort with one of the following error messages:
|
||||
Under certain circumstances typified by high register pressure, users may encounter a compiler abort
|
||||
with one of the following error messages:
|
||||
|
||||
* > `error: unhandled SGPR spill to memory`
|
||||
|
||||
|
||||
@@ -3,15 +3,24 @@
|
||||
|
||||
#### HIP Perl scripts deprecation
|
||||
|
||||
The `hipcc` and `hipconfig` Perl scripts are deprecated. In a future release, compiled binaries will be available as `hipcc.bin` and `hipconfig.bin` as replacements for the Perl scripts.
|
||||
The `hipcc` and `hipconfig` Perl scripts are deprecated. In a future release, compiled binaries will be
|
||||
available as `hipcc.bin` and `hipconfig.bin` as replacements for the Perl scripts.
|
||||
|
||||
> **Note**
|
||||
>
|
||||
> There will be a transition period where the Perl scripts and compiled binaries are available before the scripts are removed. There will be no functional difference between the Perl scripts and their compiled binary counterpart. No user action is required. Once these are available, users can optionally switch to `hipcc.bin` and `hipconfig.bin`. The `hipcc`/`hipconfig` soft link will be assimilated to point from `hipcc`/`hipconfig` to the respective compiled binaries as the default option.
|
||||
:::{note}
|
||||
There will be a transition period where the Perl scripts and compiled binaries are available before the
|
||||
scripts are removed. There will be no functional difference between the Perl scripts and their compiled
|
||||
binary counterpart. No user action is required. Once these are available, users can optionally switch to
|
||||
`hipcc.bin` and `hipconfig.bin`. The `hipcc`/`hipconfig` soft link will be assimilated to point from
|
||||
`hipcc`/`hipconfig` to the respective compiled binaries as the default option.
|
||||
:::
|
||||
|
||||
##### Linux file system hierarchy standard for ROCm
|
||||
|
||||
ROCm packages have adopted the Linux foundation file system hierarchy standard in this release to ensure ROCm components follow open source conventions for Linux-based distributions. While moving to a new file system hierarchy, ROCm ensures backward compatibility with its 5.1 version or older file system hierarchy. See below for a detailed explanation of the new file system hierarchy and backward compatibility.
|
||||
ROCm packages have adopted the Linux foundation file system hierarchy standard in this release to
|
||||
ensure ROCm components follow open source conventions for Linux-based distributions. While
|
||||
moving to a new file system hierarchy, ROCm ensures backward compatibility with its 5.1 version or
|
||||
older file system hierarchy. See below for a detailed explanation of the new file system hierarchy and
|
||||
backward compatibility.
|
||||
|
||||
##### New file system hierarchy
|
||||
|
||||
@@ -47,23 +56,27 @@ The following is the new file system hierarchy:4
|
||||
|
||||
```
|
||||
|
||||
> **Note**
|
||||
>
|
||||
> ROCm will not support backward compatibility with the v5.1(old) file system hierarchy in its next major release.
|
||||
:::{note}
|
||||
ROCm will not support backward compatibility with the v5.1(old) file system hierarchy in its next major
|
||||
release.
|
||||
:::
|
||||
|
||||
For more information, refer to <https://refspecs.linuxfoundation.org/fhs.shtml>.
|
||||
|
||||
##### Backward compatibility with older file systems
|
||||
|
||||
ROCm has moved header files and libraries to its new location as indicated in the above structure and included symbolic-link and wrapper header files in its old location for backward compatibility.
|
||||
ROCm has moved header files and libraries to its new location as indicated in the above structure and
|
||||
included symbolic-link and wrapper header files in its old location for backward compatibility.
|
||||
|
||||
> **Note**
|
||||
>
|
||||
> ROCm will continue supporting backward compatibility until the next major release.
|
||||
:::{note}
|
||||
ROCm will continue supporting backward compatibility until the next major release.
|
||||
:::
|
||||
|
||||
##### Wrapper header files
|
||||
|
||||
Wrapper header files are placed in the old location (`/opt/rocm-xxx/<component>/include`) with a warning message to include files from the new location (`/opt/rocm-xxx/include`) as shown in the example below:
|
||||
Wrapper header files are placed in the old location (`/opt/rocm-xxx/<component>/include`) with a
|
||||
warning message to include files from the new location (`/opt/rocm-xxx/include`) as shown in the
|
||||
example below:
|
||||
|
||||
```cpp
|
||||
// Code snippet from hip_runtime.h
|
||||
@@ -80,7 +93,8 @@ The wrapper header files’ backward compatibility deprecation is as follows:
|
||||
|
||||
##### Library files
|
||||
|
||||
Library files are available in the `/opt/rocm-xxx/lib` folder. For backward compatibility, the old library location (`/opt/rocm-xxx/<component>/lib`) has a soft link to the library at the new location.
|
||||
Library files are available in the `/opt/rocm-xxx/lib` folder. For backward compatibility, the old library
|
||||
location (`/opt/rocm-xxx/<component>/lib`) has a soft link to the library at the new location.
|
||||
|
||||
Example:
|
||||
|
||||
@@ -93,7 +107,9 @@ lrwxrwxrwx 1 root root 24 May 10 23:32 libamdhip64.so -> ../../lib/libamdhip64
|
||||
|
||||
##### CMake config files
|
||||
|
||||
All CMake configuration files are available in the `/opt/rocm-xxx/lib/cmake/<component>` folder. For backward compatibility, the old CMake locations (`/opt/rocm-xxx/<component>/lib/cmake`) consist of a soft link to the new CMake config.
|
||||
All CMake configuration files are available in the `/opt/rocm-xxx/lib/cmake/<component>` folder. For
|
||||
backward compatibility, the old CMake locations (`/opt/rocm-xxx/<component>/lib/cmake`) consist of
|
||||
a soft link to the new CMake config.
|
||||
|
||||
Example:
|
||||
|
||||
@@ -103,7 +119,7 @@ total 0
|
||||
lrwxrwxrwx 1 root root 42 May 10 23:32 hip-config.cmake -> ../../../../lib/cmake/hip/hip-config.cmake
|
||||
```
|
||||
|
||||
### Fixed defects
|
||||
### Defect fixes
|
||||
|
||||
#### Compiler improvements
|
||||
|
||||
@@ -117,6 +133,8 @@ In ROCm v5.4.3, improvements to the compiler address errors with the following s
|
||||
|
||||
#### Compiler option error at runtime
|
||||
|
||||
Some users may encounter a “Cannot find Symbol” error at runtime when using `-save-temps`. While most `-save-temps` use cases work correctly, this error may appear occasionally.
|
||||
Some users may encounter a “Cannot find Symbol” error at runtime when using `-save-temps`. While
|
||||
most `-save-temps` use cases work correctly, this error may appear occasionally.
|
||||
|
||||
This issue is under investigation, and the known workaround is not to use `-save-temps` when the error appears.
|
||||
This issue is under investigation, and the known workaround is not to use `-save-temps` when the error
|
||||
appears.
|
||||
|
||||
@@ -15,12 +15,14 @@ Applications requiring to update the stack size can use hipDeviceSetLimit API.
|
||||
|
||||
The following hipcc changes are implemented in this release:
|
||||
|
||||
* `hipcc` will not implicitly link to `libpthread` and `librt`, as they are no longer a link time dependence for HIP programs. Applications that depend on these libraries must explicitly link to them.
|
||||
* `hipcc` will not implicitly link to `libpthread` and `librt`, as they are no longer a link time dependence
|
||||
for HIP programs. Applications that depend on these libraries must explicitly link to them.
|
||||
* `-use-staticlib` and `-use-sharedlib` options are deprecated.
|
||||
|
||||
##### Future changes
|
||||
|
||||
* Separation of `hipcc` binaries (Perl scripts) from HIP to `hipcc` project. Users will access separate `hipcc` package for installing `hipcc` binaries in future ROCm releases.
|
||||
* Separation of `hipcc` binaries (Perl scripts) from HIP to `hipcc` project. Users will access separate
|
||||
`hipcc` package for installing `hipcc` binaries in future ROCm releases.
|
||||
|
||||
* In a future ROCm release, the following samples will be removed from the `hip-tests` project.
|
||||
* `hipBusbandWidth` at <https://github.com/ROCm-Developer-Tools/hip-tests/tree/develop/samples/1_Utils/shipBusBandwidth>
|
||||
@@ -53,9 +55,9 @@ The following hipcc changes are implemented in this release:
|
||||
|
||||
##### New HIP APIs in this release
|
||||
|
||||
> **Note**
|
||||
>
|
||||
> This is a pre-official version (beta) release of the new APIs and may contain unresolved issues.
|
||||
:::{note}
|
||||
This is a pre-official version (beta) release of the new APIs and may contain unresolved issues.
|
||||
:::
|
||||
|
||||
###### Memory management HIP APIs
|
||||
|
||||
@@ -71,21 +73,23 @@ The new memory management HIP API is as follows:
|
||||
|
||||
The new module management HIP APIs are as follows:
|
||||
|
||||
* Launches kernel $f$ with launch parameters and shared memory on stream with arguments passed to `kernelParams`, where thread blocks can cooperate and synchronize as they execute.
|
||||
* Launches kernel $f$ with launch parameters and shared memory on stream with arguments passed
|
||||
to `kernelParams`, where thread blocks can cooperate and synchronize as they run.
|
||||
|
||||
```cpp
|
||||
hipError_t hipModuleLaunchCooperativeKernel(hipFunction_t f, unsigned int gridDimX, unsigned int gridDimY, unsigned int gridDimZ, unsigned int blockDimX, unsigned int blockDimY, unsigned int blockDimZ, unsigned int sharedMemBytes, hipStream_t stream, void** kernelParams);
|
||||
```
|
||||
|
||||
* Launches kernels on multiple devices where thread blocks can cooperate and synchronize as they execute.
|
||||
* Launches kernels on multiple devices where thread blocks can cooperate and synchronize as they
|
||||
run.
|
||||
|
||||
```cpp
|
||||
hipError_t hipModuleLaunchCooperativeKernelMultiDevice(hipFunctionLaunchParams* launchParamsList, unsigned int numDevices, unsigned int flags);
|
||||
```
|
||||
|
||||
###### HIP Graph Management APIs
|
||||
###### HIP graph management APIs
|
||||
|
||||
The new HIP Graph Management APIs are as follows:
|
||||
The new HIP graph management APIs are as follows:
|
||||
|
||||
* Creates a memory allocation node and adds it to a graph \[BETA]
|
||||
|
||||
@@ -136,21 +140,27 @@ The new HIP Graph Management APIs are as follows:
|
||||
```
|
||||
|
||||
##### OpenMP enhancements
|
||||
|
||||
This release consists of the following OpenMP enhancements:
|
||||
|
||||
* Additional support for OMPT functions `get_device_time` and `get_record_type`.
|
||||
* Add support for min/max fast fp atomics on AMD GPUs.
|
||||
Fix the use of the abs function in C device regions.
|
||||
* Additional support for OMPT functions `get_device_time` and `get_record_type`
|
||||
* Added support for min/max fast fp atomics on AMD GPUs
|
||||
* Fixed the use of the abs function in C device regions
|
||||
|
||||
### Deprecations and warnings
|
||||
|
||||
#### HIP deprecation
|
||||
|
||||
The `hipcc` and `hipconfig` Perl scripts are deprecated. In a future release, compiled binaries will be available as `hipcc.bin` and `hipconfig.bin` as replacements for the Perl scripts.
|
||||
The `hipcc` and `hipconfig` Perl scripts are deprecated. In a future release, compiled binaries will be
|
||||
available as `hipcc.bin` and `hipconfig.bin` as replacements for the Perl scripts.
|
||||
|
||||
> **Note**
|
||||
>
|
||||
> There will be a transition period where the Perl scripts and compiled binaries are available before the scripts are removed. There will be no functional difference between the Perl scripts and their compiled binary counterpart. No user action is required. Once these are available, users can optionally switch to `hipcc.bin` and `hipconfig.bin`. The `hipcc`/`hipconfig` soft link will be assimilated to point from `hipcc`/`hipconfig` to the respective compiled binaries as the default option.
|
||||
:::{note}
|
||||
There will be a transition period where the Perl scripts and compiled binaries are available before the
|
||||
scripts are removed. There will be no functional difference between the Perl scripts and their compiled
|
||||
binary counterpart. No user action is required. Once these are available, users can optionally switch to
|
||||
`hipcc.bin` and `hipconfig.bin`. The `hipcc`/`hipconfig` soft link will be assimilated to point from
|
||||
`hipcc`/`hipconfig` to the respective compiled binaries as the default option.
|
||||
:::
|
||||
|
||||
##### Linux file system hierarchy standard for ROCm
|
||||
|
||||
@@ -190,23 +200,26 @@ The following is the new file system hierarchy:4
|
||||
|
||||
```
|
||||
|
||||
> **Note**
|
||||
>
|
||||
> ROCm will not support backward compatibility with the v5.1(old) file system hierarchy in its next major release.
|
||||
:::{note}
|
||||
ROCm will not support backward compatibility with the v5.1(old) file system hierarchy in its next major
|
||||
release.
|
||||
:::
|
||||
|
||||
For more information, refer to <https://refspecs.linuxfoundation.org/fhs.shtml>.
|
||||
|
||||
##### Backward compatibility with older file systems
|
||||
|
||||
ROCm has moved header files and libraries to its new location as indicated in the above structure and included symbolic-link and wrapper header files in its old location for backward compatibility.
|
||||
|
||||
> **Note**
|
||||
>
|
||||
> ROCm will continue supporting backward compatibility until the next major release.
|
||||
ROCm has moved header files and libraries to its new location as indicated in the above structure and
|
||||
included symbolic-link and wrapper header files in its old location for backward compatibility.
|
||||
|
||||
:::{note}
|
||||
ROCm will continue supporting backward compatibility until the next major release.
|
||||
:::
|
||||
##### Wrapper header files
|
||||
|
||||
Wrapper header files are placed in the old location (`/opt/rocm-xxx/<component>/include`) with a warning message to include files from the new location (`/opt/rocm-xxx/include`) as shown in the example below:
|
||||
Wrapper header files are placed in the old location (`/opt/rocm-xxx/<component>/include`) with a
|
||||
warning message to include files from the new location (`/opt/rocm-xxx/include`) as shown in the
|
||||
example below:
|
||||
|
||||
```cpp
|
||||
// Code snippet from hip_runtime.h
|
||||
@@ -223,7 +236,8 @@ The wrapper header files’ backward compatibility deprecation is as follows:
|
||||
|
||||
##### Library files
|
||||
|
||||
Library files are available in the `/opt/rocm-xxx/lib` folder. For backward compatibility, the old library location (`/opt/rocm-xxx/<component>/lib`) has a soft link to the library at the new location.
|
||||
Library files are available in the `/opt/rocm-xxx/lib` folder. For backward compatibility, the old library
|
||||
location (`/opt/rocm-xxx/<component>/lib`) has a soft link to the library at the new location.
|
||||
|
||||
Example:
|
||||
|
||||
@@ -237,7 +251,8 @@ lrwxrwxrwx 1 root root 24 May 10 23:32 libamdhip64.so -> ../../lib/libamdhip64
|
||||
##### CMake config files
|
||||
|
||||
All CMake configuration files are available in the `/opt/rocm-xxx/lib/cmake/<component>` folder.
|
||||
For backward compatibility, the old CMake locations (`/opt/rocm-xxx/<component>/lib/cmake`) consist of a soft link to the new CMake config.
|
||||
For backward compatibility, the old CMake locations (`/opt/rocm-xxx/<component>/lib/cmake`)
|
||||
consist of a soft link to the new CMake config.
|
||||
|
||||
Example:
|
||||
|
||||
@@ -253,7 +268,8 @@ Support for Code Object v3 is deprecated and will be removed in a future release
|
||||
|
||||
#### Comgr V3.0 changes
|
||||
|
||||
The following APIs and macros have been marked as deprecated. These are expected to be removed in a future ROCm release and coincides with the release of Comgr v3.0.
|
||||
The following APIs and macros have been marked as deprecated. These are expected to be removed in
|
||||
a future ROCm release and coincides with the release of Comgr v3.0.
|
||||
|
||||
##### API changes
|
||||
|
||||
@@ -265,7 +281,8 @@ The following APIs and macros have been marked as deprecated. These are expected
|
||||
* `AMD_COMGR_ACTION_ADD_DEVICE_LIBRARIES`
|
||||
* `AMD_COMGR_ACTION_COMPILE_SOURCE_TO_FATBIN`
|
||||
|
||||
For replacements, see the `AMD_COMGR_ACTION_INFO_GET`/`SET_OPTION_LIST APIs`, and the `AMD_COMGR_ACTION_COMPILE_SOURCE_(WITH_DEVICE_LIBS)_TO_BC` macros.
|
||||
For replacements, see the `AMD_COMGR_ACTION_INFO_GET`/`SET_OPTION_LIST APIs`, and the
|
||||
`AMD_COMGR_ACTION_COMPILE_SOURCE_(WITH_DEVICE_LIBS)_TO_BC` macros.
|
||||
|
||||
#### Deprecated environment variables
|
||||
|
||||
|
||||
@@ -5,7 +5,7 @@
|
||||
#### HIP SDK for Windows
|
||||
|
||||
AMD is pleased to announce the availability of the HIP SDK for Windows as part
|
||||
of the ROCm platform. The
|
||||
of ROCm software. The
|
||||
[HIP SDK OS and GPU support page](https://rocm.docs.amd.com/en/docs-5.5.1/release/windows_support.html)
|
||||
lists the versions of Windows and GPUs validated by AMD. HIP SDK features on
|
||||
Windows are described in detail in our
|
||||
@@ -21,4 +21,5 @@ The following HIP API is updated in the ROCm 5.5.1 release:
|
||||
|
||||
##### `hipDeviceSetCacheConfig`
|
||||
|
||||
* The return value for `hipDeviceSetCacheConfig` is updated from `hipErrorNotSupported` to `hipSuccess`
|
||||
* The return value for `hipDeviceSetCacheConfig` is updated from `hipErrorNotSupported` to
|
||||
`hipSuccess`
|
||||
|
||||
@@ -3,27 +3,37 @@
|
||||
<!-- markdownlint-disable header-increment -->
|
||||
### Release highlights
|
||||
|
||||
ROCm 5.6 consists of several AI software ecosystem improvements to our fast-growing user base. A few examples include:
|
||||
ROCm 5.6 consists of several AI software ecosystem improvements to our fast-growing user base. A
|
||||
few examples include:
|
||||
|
||||
* New documentation portal at https://rocm.docs.amd.com
|
||||
* Ongoing software enhancements for LLMs, ensuring full compliance with the HuggingFace unit test suite
|
||||
* Ongoing software enhancements for LLMs, ensuring full compliance with the HuggingFace unit test
|
||||
suite
|
||||
* OpenAI Triton, CuPy, HIP Graph support, and many other library performance enhancements
|
||||
* Improved ROCm deployment and development tools, including CPU-GPU (rocGDB) debugger, profiler, and docker containers
|
||||
* New pseudorandom generators are available in rocRAND. Added support for half-precision transforms in hipFFT/rocFFT. Added LU refactorization and linear system solver for sparse matrices in rocSOLVER.
|
||||
* Improved ROCm deployment and development tools, including CPU-GPU (rocGDB) debugger,
|
||||
profiler, and docker containers
|
||||
* New pseudorandom generators are available in rocRAND. Added support for half-precision
|
||||
transforms in hipFFT/rocFFT. Added LU refactorization and linear system solver for sparse matrices in
|
||||
rocSOLVER.
|
||||
|
||||
### OS and GPU support changes
|
||||
|
||||
* SLES15 SP5 support was added this release. SLES15 SP3 support was dropped.
|
||||
* AMD Instinct MI50, Radeon Pro VII, and Radeon VII products (collectively referred to as gfx906 GPUs) will be entering the maintenance mode starting Q3 2023. This will be aligned with ROCm 5.7 GA release date.
|
||||
* No new features and performance optimizations will be supported for the gfx906 GPUs beyond ROCm 5.7
|
||||
* Bug fixes / critical security patches will continue to be supported for the gfx906 GPUs till Q2 2024 (End of Maintenance \[EOM])(will be aligned with the closest ROCm release)
|
||||
* AMD Instinct MI50, Radeon Pro VII, and Radeon VII products (collectively referred to as gfx906 GPUs)
|
||||
will be entering the maintenance mode starting Q3 2023. This will be aligned with ROCm 5.7 GA
|
||||
release date.
|
||||
* No new features and performance optimizations will be supported for the gfx906 GPUs beyond
|
||||
ROCm 5.7
|
||||
* Bug fixes / critical security patches will continue to be supported for the gfx906 GPUs till Q2 2024
|
||||
(EOM will be aligned with the closest ROCm release)
|
||||
* Bug fixes during the maintenance will be made to the next ROCm point release
|
||||
* Bug fixes will not be back ported to older ROCm releases for this SKU
|
||||
* Distro / Operating system updates will continue per the ROCm release cadence for gfx906 GPUs till EOM.
|
||||
* Distro / Operating system updates will continue per the ROCm release cadence for gfx906 GPUs till
|
||||
EOM.
|
||||
|
||||
### AMDSMI CLI 23.0.0.4
|
||||
|
||||
#### Added
|
||||
#### Additions
|
||||
|
||||
* AMDSMI CLI tool enabled for Linux Bare Metal & Guest
|
||||
|
||||
@@ -39,7 +49,8 @@ ROCm 5.6 consists of several AI software ecosystem improvements to our fast-grow
|
||||
|
||||
#### Fixes
|
||||
|
||||
* Stability fix for multi GPU system reproducible via ROCm_Bandwidth_Test as reported in [Issue 2198](https://github.com/RadeonOpenCompute/ROCm/issues/2198).
|
||||
* Stability fix for multi GPU system reproducible via ROCm_Bandwidth_Test as reported in
|
||||
[Issue 2198](https://github.com/RadeonOpenCompute/ROCm/issues/2198).
|
||||
|
||||
### HIP 5.6 (for ROCm 5.6)
|
||||
|
||||
@@ -48,7 +59,7 @@ ROCm 5.6 consists of several AI software ecosystem improvements to our fast-grow
|
||||
* Consolidation of hipamd, rocclr and OpenCL projects in clr
|
||||
* Optimized lock for graph global capture mode
|
||||
|
||||
#### Added
|
||||
#### Additions
|
||||
|
||||
* Added hipRTC support for amd_hip_fp16
|
||||
* Added hipStreamGetDevice implementation to get the device associated with the stream
|
||||
@@ -57,14 +68,14 @@ ROCm 5.6 consists of several AI software ecosystem improvements to our fast-grow
|
||||
* hipArrayGetDescriptor for getting 1D or 2D array descriptor
|
||||
* hipArray3DGetDescriptor to get 3D array descriptor
|
||||
|
||||
#### Changed
|
||||
#### Changes
|
||||
|
||||
* hipMallocAsync to return success for zero size allocation to match hipMalloc
|
||||
* Separation of hipcc perl binaries from HIP project to hipcc project. hip-devel package depends on newly added hipcc package
|
||||
* Consolidation of hipamd, ROCclr, and OpenCL repositories into a single repository called clr. Instructions are updated to build HIP from sources in the HIP Installation guide
|
||||
* Removed hipBusBandwidth and hipCommander samples from hip-tests
|
||||
|
||||
#### Fixed
|
||||
#### Fixes
|
||||
|
||||
* Fixed regression in hipMemCpyParam3D when offset is applied
|
||||
|
||||
@@ -98,11 +109,11 @@ ROCm 5.6 consists of several AI software ecosystem improvements to our fast-grow
|
||||
|
||||
### ROCgdb-13 (For ROCm 5.6.0)
|
||||
|
||||
#### Optimized
|
||||
#### Optimizations
|
||||
|
||||
* Improved performances when handling the end of a process with a large number of threads.
|
||||
|
||||
Known Issues
|
||||
#### Known issues
|
||||
|
||||
* On certain configurations, ROCgdb can show the following warning message:
|
||||
|
||||
@@ -176,15 +187,15 @@ gcc main.c -I/opt/rocm-5.6.0/include -L/opt/rocm-5.6.0/lib -lrocprofiler64-v2
|
||||
The resulting `a.out` will depend on
|
||||
`/opt/rocm-5.6.0/lib/librocprofiler64.so.2`.
|
||||
|
||||
#### Optimized
|
||||
#### Optimizations
|
||||
|
||||
* Improved Test Suite
|
||||
|
||||
#### Added
|
||||
#### Additions
|
||||
|
||||
* 'end_time' need to be disabled in roctx_trace.txt
|
||||
|
||||
#### Fixed
|
||||
#### Fixes
|
||||
|
||||
* rocprof in ROcm/5.4.0 gpu selector broken.
|
||||
* rocprof in ROCm/5.4.1 fails to generate kernel info.
|
||||
|
||||
@@ -7,9 +7,9 @@ ROCm 5.6.1 is a point release with several bug fixes in the HIP runtime.
|
||||
|
||||
#### HIP 5.6.1 (for ROCm 5.6.1)
|
||||
|
||||
### Fixed defects
|
||||
### Defect fixes
|
||||
|
||||
* *hipMemcpy* device-to-device (inter-device) is now asynchronous with respect to the host
|
||||
* `hipMemcpy` device-to-device (inter-device) is now asynchronous with respect to the host
|
||||
* Enabled xnack+ check in HIP catch2 tests hang when executing tests
|
||||
* Memory leak when code object files are loaded/unloaded via hipModuleLoad/hipModuleUnload APIs
|
||||
* Using *hipGraphAddMemFreeNode* no longer results in a crash
|
||||
* Using `hipGraphAddMemFreeNode` no longer results in a crash
|
||||
|
||||
@@ -3,27 +3,45 @@
|
||||
|
||||
### Release highlights for ROCm 5.7
|
||||
|
||||
ROCm 5.7.0 includes many new features. These include: a new library (hipTensor), and optimizations for rocRAND and MIVisionX. Address sanitizer for host and device code (GPU) is now available as a beta. Note that ROCm 5.7.0 is EOS for MI50. 5.7 versions of ROCm are the last major release in the ROCm 5 series. This release is Linux-only.
|
||||
New features include:
|
||||
|
||||
Important: The next major ROCm release (ROCm 6.0) will not be backward compatible with the ROCm 5 series. Changes will include: splitting LLVM packages into more manageable sizes, changes to the HIP runtime API, splitting rocRAND and hipRAND into separate packages, and reorganizing our file structure.
|
||||
* A new library (hipTensor)
|
||||
* Optimizations for rocRAND and MIVisionX
|
||||
* AddressSanitizer for host and device code (GPU) is now available as a beta
|
||||
|
||||
Note that ROCm 5.7.0 is EOS for MI50. 5.7 versions of ROCm are the last major releases in the ROCm 5
|
||||
series. This release is Linux-only.
|
||||
|
||||
:::{important}
|
||||
The next major ROCm release (ROCm 6.0) will not be backward compatible with the ROCm 5 series.
|
||||
Changes will include: splitting LLVM packages into more manageable sizes, changes to the HIP runtime
|
||||
API, splitting rocRAND and hipRAND into separate packages, and reorganizing our file structure.
|
||||
:::
|
||||
|
||||
#### AMD Instinct™ MI50 end-of-support notice
|
||||
|
||||
AMD Instinct MI50, Radeon Pro VII, and Radeon VII products (collectively gfx906 GPUs) will enter maintenance mode starting Q3 2023.
|
||||
AMD Instinct MI50, Radeon Pro VII, and Radeon VII products (collectively gfx906 GPUs) will enter
|
||||
maintenance mode starting Q3 2023.
|
||||
|
||||
As outlined in [5.6.0](https://rocm.docs.amd.com/en/docs-5.6.0/release.html), ROCm 5.7 will be the final release for gfx906 GPUs to be in a fully supported state.
|
||||
As outlined in [5.6.0](https://rocm.docs.amd.com/en/docs-5.6.0/release.html), ROCm 5.7 will be the
|
||||
final release for gfx906 GPUs to be in a fully supported state.
|
||||
|
||||
* ROCm 6.0 release will show MI50s as "under maintenance" for [Linux](../about/compatibility/linux-support.md) and [Windows](../about/compatibility/windows-support.md)
|
||||
* ROCm 6.0 release will show MI50s as "under maintenance" for
|
||||
{doc}`Linux<rocm-install-on-linux:reference/system-requirements>` and
|
||||
{doc}`Windows<rocm-install-on-windows:reference/system-requirements>`
|
||||
|
||||
* No new features and performance optimizations will be supported for the gfx906 GPUs beyond this major release (ROCm 5.7).
|
||||
* No new features and performance optimizations will be supported for the gfx906 GPUs beyond this
|
||||
major release (ROCm 5.7).
|
||||
|
||||
* Bug fixes and critical security patches will continue to be supported for the gfx906 GPUs until Q2 2024 (EOM (End of Maintenance) will be aligned with the closest ROCm release).
|
||||
* Bug fixes and critical security patches will continue to be supported for the gfx906 GPUs until Q2
|
||||
2024 (end of maintenance \[EOM] will be aligned with the closest ROCm release).
|
||||
|
||||
* Bug fixes during the maintenance will be made to the next ROCm point release.
|
||||
|
||||
* Bug fixes will not be backported to older ROCm releases for gfx906.
|
||||
|
||||
* Distribution and operating system updates will continue per the ROCm release cadence for gfx906 GPUs until EOM.
|
||||
* Distribution and operating system updates will continue per the ROCm release cadence for gfx906
|
||||
GPUs until EOM.
|
||||
|
||||
#### Feature updates
|
||||
|
||||
@@ -31,40 +49,62 @@ As outlined in [5.6.0](https://rocm.docs.amd.com/en/docs-5.6.0/release.html), RO
|
||||
|
||||
**Current behavior**
|
||||
|
||||
The current version of HIP printf relies on hostcalls, which, in turn, rely on PCIe atomics. However, PCle atomics are unavailable in some environments, and, as a result, HIP-printf does not work in those environments. Users may see the following error from runtime (with AMD_LOG_LEVEL 1 and above):
|
||||
The current version of HIP printf relies on hostcalls, which, in turn, rely on PCIe atomics. However, PCle
|
||||
atomics are unavailable in some environments, and, as a result, HIP-printf does not work in those
|
||||
environments. Users may see the following error from runtime (with AMD_LOG_LEVEL 1 and above):
|
||||
|
||||
```
|
||||
```shell
|
||||
Pcie atomics not enabled, hostcall not supported
|
||||
```
|
||||
|
||||
**Workaround**
|
||||
|
||||
The ROCm 5.7 release introduces an alternative to the current hostcall-based implementation that leverages an older OpenCL-based printf scheme, which does not rely on hostcalls/PCIe atomics.
|
||||
The ROCm 5.7 release introduces an alternative to the current hostcall-based implementation that
|
||||
leverages an older OpenCL-based printf scheme, which does not rely on hostcalls/PCIe atomics.
|
||||
|
||||
Note: This option is less robust than hostcall-based implementation and is intended to be a workaround when hostcalls do not work.
|
||||
:::{note}
|
||||
This option is less robust than hostcall-based implementation and is intended to be a
|
||||
workaround when hostcalls do not work.
|
||||
:::
|
||||
|
||||
The printf variant is now controlled via a new compiler option -mprintf-kind=<value>. This is supported only for HIP programs and takes the following values,
|
||||
The printf variant is now controlled via a new compiler option -mprintf-kind=<value>. This is
|
||||
supported only for HIP programs and takes the following values,
|
||||
|
||||
* “hostcall” – This currently available implementation relies on hostcalls, which require the system to support PCIe atomics. It is the default scheme.
|
||||
* “hostcall” – This currently available implementation relies on hostcalls, which require the system to
|
||||
support PCIe atomics. It is the default scheme.
|
||||
|
||||
* “buffered” – This implementation leverages the older printf scheme used by OpenCL; it relies on a memory buffer where printf arguments are stored during the kernel execution, and then the runtime handles the actual printing once the kernel finishes execution.
|
||||
* “buffered” – This implementation leverages the older printf scheme used by OpenCL; it relies on a
|
||||
memory buffer where printf arguments are stored during the kernel execution, and then the runtime
|
||||
handles the actual printing once the kernel finishes execution.
|
||||
|
||||
**NOTE**: With the new workaround:
|
||||
|
||||
* The printf buffer is fixed size and non-circular. After the buffer is filled, calls to printf will not result in additional output.
|
||||
* The printf buffer is fixed size and non-circular. After the buffer is filled, calls to printf will not result in
|
||||
additional output.
|
||||
|
||||
* The printf call returns either 0 (on success) or -1 (on failure, due to full buffer), unlike the hostcall scheme that returns the number of characters printed.
|
||||
* The printf call returns either 0 (on success) or -1 (on failure, due to full buffer), unlike the hostcall
|
||||
scheme that returns the number of characters printed.
|
||||
|
||||
##### Beta release of LLVM AddressSanitizer (ASan) with the GPU
|
||||
|
||||
The ROCm 5.7 release introduces the beta release of LLVM AddressSanitizer (ASan) with the GPU. The LLVM ASan provides a process that allows developers to detect runtime addressing errors in applications and libraries. The detection is achieved using a combination of compiler-added instrumentation and runtime techniques, including function interception and replacement.
|
||||
The ROCm 5.7 release introduces the beta release of LLVM AddressSanitizer (ASan) with the GPU. The
|
||||
LLVM ASan provides a process that allows developers to detect runtime addressing errors in
|
||||
applications and libraries. The detection is achieved using a combination of compiler-added
|
||||
instrumentation and runtime techniques, including function interception and replacement.
|
||||
|
||||
Until now, the LLVM ASan process was only available for traditional purely CPU applications. However, ROCm has extended this mechanism to additionally allow the detection of some addressing errors on the GPU in heterogeneous applications. Ideally, developers should treat heterogeneous HIP and OpenMP applications like pure CPU applications. However, this simplicity has not been achieved yet.
|
||||
Until now, the LLVM ASan process was only available for traditional purely CPU applications. However,
|
||||
ROCm has extended this mechanism to additionally allow the detection of some addressing errors on
|
||||
the GPU in heterogeneous applications. Ideally, developers should treat heterogeneous HIP and
|
||||
OpenMP applications like pure CPU applications. However, this simplicity has not been achieved yet.
|
||||
|
||||
Refer to the documentation on LLVM ASan with the GPU at [LLVM AddressSanitizer User Guide](../conceptual/using_gpu_sanitizer.md).
|
||||
Refer to the documentation on LLVM ASan with the GPU at
|
||||
[LLVM AddressSanitizer User Guide](../conceptual/using-gpu-sanitizer.md).
|
||||
|
||||
**Note**: The beta release of LLVM ASan for ROCm is currently tested and validated on Ubuntu 20.04.
|
||||
:::{note}
|
||||
The beta release of LLVM ASan for ROCm is currently tested and validated on Ubuntu 20.04.
|
||||
:::
|
||||
|
||||
#### Fixed defects
|
||||
#### Defect fixes
|
||||
|
||||
The following defects are fixed in ROCm v5.7:
|
||||
|
||||
@@ -80,7 +120,7 @@ The following defects are fixed in ROCm v5.7:
|
||||
|
||||
##### Optimizations
|
||||
|
||||
##### Added
|
||||
##### Additions
|
||||
|
||||
* Added `meta_group_size`/`rank` for getting the number of tiles and rank of a tile in the partition
|
||||
|
||||
@@ -98,14 +138,16 @@ The following defects are fixed in ROCm v5.7:
|
||||
|
||||
* `hipMipmappedArrayGetLevel` for getting a mipmapped array on a mipmapped level
|
||||
|
||||
##### Changed
|
||||
##### Changes
|
||||
|
||||
##### Fixed
|
||||
##### Fixes
|
||||
|
||||
##### Known issues
|
||||
|
||||
* HIP memory type enum values currently don't support equivalent value to `cudaMemoryTypeUnregistered`, due to HIP functionality backward compatibility.
|
||||
* HIP API `hipPointerGetAttributes` could return invalid value in case the input memory pointer was not allocated through any HIP API on device or host.
|
||||
* HIP memory type enum values currently don't support equivalent value to
|
||||
`cudaMemoryTypeUnregistered`, due to HIP functionality backward compatibility.
|
||||
* HIP API `hipPointerGetAttributes` could return invalid value in case the input memory pointer was not
|
||||
allocated through any HIP API on device or host.
|
||||
|
||||
##### Upcoming changes for HIP in ROCm 6.0 release
|
||||
|
||||
@@ -139,16 +181,17 @@ The following defects are fixed in ROCm v5.7:
|
||||
|
||||
* Removal of deprecated code -hip-hcc codes from hip code tree
|
||||
|
||||
* Correct hipArray usage in HIP APIs such as hipMemcpyAtoH and hipMemcpyHtoA
|
||||
* Correct hipArray usage in HIP APIs such as `hipMemcpyAtoH` and `hipMemcpyHtoA`
|
||||
|
||||
* HIPMEMCPY_3D fields correction to avoid truncation of "size_t" to "unsigned int" inside hipMemcpy3D()
|
||||
* HIPMEMCPY_3D fields correction to avoid truncation of "size_t" to "unsigned int" inside
|
||||
`hipMemcpy3D()`
|
||||
|
||||
* Renaming of 'memoryType' in hipPointerAttribute_t structure to 'type'
|
||||
* Renaming of 'memoryType' in `hipPointerAttribute_t` structure to 'type'
|
||||
|
||||
* Correct hipGetLastError to return the last error instead of last API call's return code
|
||||
* Correct `hipGetLastError` to return the last error instead of last API call's return code
|
||||
|
||||
* Update hipExternalSemaphoreHandleDesc to add "unsigned int reserved[16]"
|
||||
* Update `hipExternalSemaphoreHandleDesc` to add "unsigned int reserved[16]"
|
||||
|
||||
* Correct handling of flag values in hipIpcOpenMemHandle for hipIpcMemLazyEnablePeerAccess
|
||||
* Correct handling of flag values in `hipIpcOpenMemHandle` for `hipIpcMemLazyEnablePeerAccess`
|
||||
|
||||
* Remove hiparray* and make it opaque with hipArray_t
|
||||
* Remove `hiparray*` and make it opaque with `hipArray_t`
|
||||
|
||||
@@ -1,50 +1,53 @@
|
||||
<<<<<<< HEAD
|
||||
<!-- markdownlint-disable first-line-h1 -->
|
||||
<!-- markdownlint-disable no-duplicate-header -->
|
||||
|
||||
### What's new in this release
|
||||
|
||||
ROCm 5.7.1 is a point release with several bug fixes in the HIP runtime.
|
||||
|
||||
#### HIP 5.7.1 (for ROCm 5.7.1)
|
||||
|
||||
### Fixed defects
|
||||
|
||||
=======
|
||||
<!-- markdownlint-disable first-line-h1 -->
|
||||
<!-- markdownlint-disable no-duplicate-header -->
|
||||
|
||||
### What's New in This Release
|
||||
### What's new in this release
|
||||
|
||||
#### Installing all GPU Address sanitizer packages with a single command
|
||||
ROCm 5.7.1 is a point release with several bug fixes in the HIP runtime.
|
||||
|
||||
ROCm 5.7.1 simplifies the installation steps for the optional Address Sanitizer (ASan) packages. This release provides the meta package *rocm-ml-sdk-asan* for ease of ASan installation. The following command can be used to install all ASan packages rather than installing each package separately,
|
||||
#### Installing all GPU AddressSanitizer packages with a single command
|
||||
|
||||
ROCm 5.7.1 simplifies the installation steps for the optional AddressSanitizer (ASan) packages. This
|
||||
release provides the meta package *rocm-ml-sdk-asan* for ease of ASan installation. The following
|
||||
command can be used to install all ASan packages rather than installing each package separately,
|
||||
|
||||
sudo apt-get install rocm-ml-sdk-asan
|
||||
|
||||
For more detailed information about using the GPU AddressSanitizer, refer to the [user guide](https://rocm.docs.amd.com/en/docs-5.7.1/understand/using_gpu_sanitizer.html)
|
||||
For more detailed information about using the GPU AddressSanitizer, refer to the
|
||||
[user guide](https://rocm.docs.amd.com/en/docs-5.7.1/understand/using_gpu_sanitizer.html)
|
||||
|
||||
### ROCm Libraries
|
||||
### ROCm libraries
|
||||
|
||||
#### rocBLAS
|
||||
A new functionality rocblas-gemm-tune and an environment variable ROCBLAS_TENSILE_GEMM_OVERRIDE_PATH are added to rocBLAS in the ROCm 5.7.1 release.
|
||||
A new functionality rocblas-gemm-tune and an environment variable
|
||||
ROCBLAS_TENSILE_GEMM_OVERRIDE_PATH are added to rocBLAS in the ROCm 5.7.1 release.
|
||||
|
||||
*rocblas-gemm-tune* is used to find the best-performing GEMM kernel for each GEMM problem set. It has a command line interface, which mimics the --yaml input used by rocblas-bench. To generate the expected --yaml input, profile logging can be used, by setting the environment variable ROCBLAS_LAYER4.
|
||||
`rocblas-gemm-tune` is used to find the best-performing GEMM kernel for each GEMM problem set. It
|
||||
has a command line interface, which mimics the --yaml input used by rocblas-bench. To generate the
|
||||
expected --yaml input, profile logging can be used, by setting the environment variable
|
||||
ROCBLAS_LAYER4.
|
||||
|
||||
For more information on rocBLAS logging, see Logging in rocBLAS, in the [API Reference Guide](https://rocm.docs.amd.com/projects/rocBLAS/en/docs-5.7.1/API_Reference_Guide.html#logging-in-rocblas).
|
||||
For more information on rocBLAS logging, see Logging in rocBLAS, in the
|
||||
[API Reference Guide](https://rocm.docs.amd.com/projects/rocBLAS/en/docs-5.7.1/API_Reference_Guide.html#logging-in-rocblas).
|
||||
|
||||
An example input file: Expected output (note selected GEMM idx may differ): Where the far right values (solution_index) are the indices of the best-performing kernels for those GEMMs in the rocBLAS kernel library. These indices can be directly used in future GEMM calls. See rocBLAS/samples/example_user_driven_tuning.cpp for sample code of directly using kernels via their indices.
|
||||
An example input file: Expected output (note selected GEMM idx may differ): Where the far right values
|
||||
(solution_index) are the indices of the best-performing kernels for those GEMMs in the rocBLAS kernel
|
||||
library. These indices can be directly used in future GEMM calls. See
|
||||
` rocBLAS/samples/example_user_driven_tuning.cpp` for sample code of directly using kernels via their
|
||||
indices.
|
||||
|
||||
If the output is stored in a file, the results can be used to override default kernel selection with the kernels found, by setting the environment variable ROCBLAS_TENSILE_GEMM_OVERRIDE_PATH, where points to the stored file.
|
||||
If the output is stored in a file, the results can be used to override default kernel selection with the
|
||||
kernels found by setting the environment variable ROCBLAS_TENSILE_GEMM_OVERRIDE_PATH, which
|
||||
points to the stored file.
|
||||
|
||||
For more details, refer to the [rocBLAS Programmer's Guide.](https://rocm.docs.amd.com/projects/rocBLAS/en/latest/Programmers_Guide.html#rocblas-gemm-tune)
|
||||
For more details, refer to the
|
||||
[rocBLAS Programmer's Guide](https://rocm.docs.amd.com/projects/rocBLAS/en/latest/Programmers_Guide.html#rocblas-gemm-tune).
|
||||
|
||||
#### HIP 5.7.1 (for ROCm 5.7.1)
|
||||
|
||||
ROCm 5.7.1 is a point release with several bug fixes in the HIP runtime.
|
||||
|
||||
### Fixed defects
|
||||
The *hipPointerGetAttributes* API returns the correct HIP memory type as *hipMemoryTypeManaged* for managed memory.
|
||||
### Defect fixes
|
||||
|
||||
|
||||
>>>>>>> roc-5.7.x
|
||||
The `hipPointerGetAttributes` API returns the correct HIP memory type as `hipMemoryTypeManaged`
|
||||
for managed memory.
|
||||
|
||||
891
tools/autotag/templates/rocm_changes/6.0.0.md
Normal file
891
tools/autotag/templates/rocm_changes/6.0.0.md
Normal file
@@ -0,0 +1,891 @@
|
||||
<!-- markdownlint-disable first-line-h1 -->
|
||||
<!-- markdownlint-disable no-duplicate-header -->
|
||||
|
||||
ROCm 6.0 is a major release with new performance optimizations, expanded frameworks and library
|
||||
support, and improved developer experience. This includes initial enablement of the AMD Instinct™
|
||||
MI300 series. Future releases will further enable and optimize this new platform. Key features include:
|
||||
|
||||
* Improved performance in areas like lower precision math and attention layers.
|
||||
* New hipSPARSELt library to accelerate AI workloads via AMD's sparse matrix core technique.
|
||||
* Latest upstream support for popular AI frameworks like PyTorch, TensorFlow, and JAX.
|
||||
* New support for libraries, such as DeepSpeed, ONNX-RT, and CuPy.
|
||||
* Prepackaged HPC and AI containers on AMD Infinity Hub, with improved documentation and
|
||||
tutorials on the [AMD ROCm Docs](https://rocm.docs.amd.com) site.
|
||||
* Consolidated developer resources and training on the new AMD ROCm Developer Hub.
|
||||
|
||||
The following section provide a release overview for ROCm 6.0. For additional details, you can refer to
|
||||
the [Changelog](https://rocm.docs.amd.com/en/develop/about/CHANGELOG.html).
|
||||
|
||||
### OS and GPU support changes
|
||||
|
||||
AMD Instinct™ MI300A and MI300X Accelerator support has been enabled for limited operating
|
||||
systems.
|
||||
|
||||
* Ubuntu 22.04.3 (MI300A and MI300X)
|
||||
* RHEL 8.9 (MI300A)
|
||||
* SLES 15 SP5 (MI300A)
|
||||
|
||||
We've added support for the following operating systems:
|
||||
|
||||
* RHEL 9.3
|
||||
* RHEL 8.9
|
||||
|
||||
Note that, of ROCm 6.2, we've planned for end-of-support (EoS) for the following operating systems:
|
||||
|
||||
* Ubuntu 20.04.5
|
||||
* SLES 15 SP4
|
||||
* RHEL/CentOS 7.9
|
||||
|
||||
### New ROCm meta package
|
||||
|
||||
We've added a new ROCm meta package for easy installation of all ROCm core packages, tools, and
|
||||
libraries. For example, the following command will install the full ROCm package: `apt-get install rocm`
|
||||
(Ubuntu), or `yum install rocm` (RHEL).
|
||||
|
||||
### Filesystem Hierarchy Standard
|
||||
|
||||
ROCm 6.0 fully adopts the Filesystem Hierarchy Standard (FHS) reorganization goals. We've removed
|
||||
the backward compatibility support for old file locations.
|
||||
|
||||
### Compiler location change
|
||||
|
||||
* The installation path of LLVM has been changed from `/opt/rocm-<rel>/llvm` to
|
||||
`/opt/rocm-<rel>/lib/llvm`. For backward compatibility, a symbolic link is provided to the old
|
||||
location and will be removed in a future release.
|
||||
* The installation path of the device library bitcode has changed from `/opt/rocm-<rel>/amdgcn` to
|
||||
`/opt/rocm-<rel>/lib/llvm/lib/clang/<ver>/lib/amdgcn`. For backward compatibility, a symbolic link
|
||||
is provided and will be removed in a future release.
|
||||
|
||||
### Documentation
|
||||
|
||||
CMake support has been added for documentation in the
|
||||
[ROCm repository](https://github.com/RadeonOpenCompute/ROCm).
|
||||
|
||||
### AMD Instinct™ MI50 end-of-support notice
|
||||
|
||||
AMD Instinct MI50, Radeon Pro VII, and Radeon VII products (collectively gfx906 GPUs) enters
|
||||
maintenance mode in ROCm 6.0.
|
||||
|
||||
As outlined in [5.6.0](https://rocm.docs.amd.com/en/docs-5.6.0/release.html), ROCm 5.7 was the
|
||||
final release for gfx906 GPUs in a fully supported state.
|
||||
|
||||
* Henceforth, no new features and performance optimizations will be supported for the gfx906 GPUs.
|
||||
* Bug fixes and critical security patches will continue to be supported for the gfx906 GPUs until Q2
|
||||
2024 (end of maintenance \[EOM] will be aligned with the closest ROCm release).
|
||||
* Bug fixes will be made up to the next ROCm point release.
|
||||
* Bug fixes will not be backported to older ROCm releases for gfx906.
|
||||
* Distribution and operating system updates will continue per the ROCm release cadence for gfx906
|
||||
GPUs until EOM.
|
||||
|
||||
### Known issues
|
||||
|
||||
* Hang is observed with rocSPARSE tests: [Issue 2726](https://github.com/ROCm/ROCm/issues/2726).
|
||||
* AddressSanitizer instrumentation is incorrect for device global variables:
|
||||
[Issue 2551](https://github.com/ROCm/ROCm/issues/2551).
|
||||
* Dynamically loaded HIP runtime library references incorrect version of `hipDeviceGetProperties`
|
||||
API: [Issue 2728](https://github.com/ROCm/ROCm/issues/2728).
|
||||
* Memory access violations when running rocFFT-HMM:
|
||||
[Issue 2730](https://github.com/ROCm/ROCm/issues/2730).
|
||||
|
||||
### Library changes
|
||||
|
||||
| Library | Version |
|
||||
|---------|---------|
|
||||
| AMDMIGraphX | ⇒ [2.8](https://github.com/ROCmSoftwarePlatform/AMDMIGraphX/releases/tag/rocm-6.0.0) |
|
||||
| HIP | [6.0.0](https://github.com/ROCm/HIP/releases/tag/rocm-6.0.0) |
|
||||
| hipBLAS | ⇒ [2.0.0](https://github.com/ROCmSoftwarePlatform/hipBLAS/releases/tag/rocm-6.0.0) |
|
||||
| hipCUB | ⇒ [3.0.0](https://github.com/ROCmSoftwarePlatform/hipCUB/releases/tag/rocm-6.0.0) |
|
||||
| hipFFT | ⇒ [1.0.13](https://github.com/ROCmSoftwarePlatform/hipFFT/releases/tag/rocm-6.0.0) |
|
||||
| hipSOLVER | ⇒ [2.0.0](https://github.com/ROCmSoftwarePlatform/hipSOLVER/releases/tag/rocm-6.0.0) |
|
||||
| hipSPARSE | ⇒ [3.0.0](https://github.com/ROCmSoftwarePlatform/hipSPARSE/releases/tag/rocm-6.0.0) |
|
||||
| hipTensor | ⇒ [1.1.0](https://github.com/ROCmSoftwarePlatform/hipTensor/releases/tag/rocm-6.0.0) |
|
||||
| MIOpen | ⇒ [2.19.0](https://github.com/ROCmSoftwarePlatform/MIOpen/releases/tag/rocm-6.0.0) |
|
||||
| rccl | ⇒ [2.15.5](https://github.com/ROCmSoftwarePlatform/rccl/releases/tag/rocm-6.0.0) |
|
||||
| rocALUTION | ⇒ [3.0.3](https://github.com/ROCmSoftwarePlatform/rocALUTION/releases/tag/rocm-6.0.0) |
|
||||
| rocBLAS | ⇒ [4.0.0](https://github.com/ROCmSoftwarePlatform/rocBLAS/releases/tag/rocm-6.0.0) |
|
||||
| rocFFT | ⇒ [1.0.25](https://github.com/ROCmSoftwarePlatform/rocFFT/releases/tag/rocm-6.0.0) |
|
||||
| ROCgdb | [13.2](https://github.com/ROCm/ROCgdb/releases/tag/rocm-6.0.0) |
|
||||
| rocm-cmake | ⇒ [0.11.0](https://github.com/RadeonOpenCompute/rocm-cmake/releases/tag/rocm-6.0.0) |
|
||||
| rocPRIM | ⇒ [3.0.0](https://github.com/ROCmSoftwarePlatform/rocPRIM/releases/tag/rocm-6.0.0) |
|
||||
| rocprofiler | [2.0.0](https://github.com/ROCm/rocprofiler/releases/tag/rocm-6.0.0) |
|
||||
| rocRAND | ⇒ [2.10.17](https://github.com/ROCmSoftwarePlatform/rocRAND/releases/tag/rocm-6.0.0) |
|
||||
| rocSOLVER | ⇒ [3.24.0](https://github.com/ROCmSoftwarePlatform/rocSOLVER/releases/tag/rocm-6.0.0) |
|
||||
| rocSPARSE | ⇒ [3.0.2](https://github.com/ROCmSoftwarePlatform/rocSPARSE/releases/tag/rocm-6.0.0) |
|
||||
| rocThrust | ⇒ [3.0.0](https://github.com/ROCmSoftwarePlatform/rocThrust/releases/tag/rocm-6.0.0) |
|
||||
| rocWMMA | ⇒ [1.3.0](https://github.com/ROCmSoftwarePlatform/rocWMMA/releases/tag/rocm-6.0.0) |
|
||||
| Tensile | ⇒ [4.39.0](https://github.com/ROCmSoftwarePlatform/Tensile/releases/tag/rocm-6.0.0) |
|
||||
|
||||
#### AMDMIGraphX 2.8
|
||||
|
||||
MIGraphX 2.8 for ROCm 6.0.0
|
||||
|
||||
##### Additions
|
||||
|
||||
* Support for TorchMIGraphX via PyTorch
|
||||
* Boosted overall performance by integrating rocMLIR
|
||||
* INT8 support for ONNX Runtime
|
||||
* Support for ONNX version 1.14.1
|
||||
* Added new operators: `Qlinearadd`, `QlinearGlobalAveragePool`, `Qlinearconv`, `Shrink`, `CastLike`,
|
||||
and `RandomUniform`
|
||||
* Added an error message for when `gpu_targets` is not set during MIGraphX compilation
|
||||
* Added parameter to set tolerances with `migraphx-driver` verify
|
||||
* Added support for MXR files > 4 GB
|
||||
* Added `MIGRAPHX_TRACE_MLIR` flag
|
||||
* BETA added capability for using ROCm Composable Kernels via the `MIGRAPHX_ENABLE_CK=1`
|
||||
environment variable
|
||||
|
||||
##### Optimizations
|
||||
|
||||
* Improved performance support for INT8
|
||||
* Improved time precision while benchmarking candidate kernels from CK or MLIR
|
||||
* Removed contiguous from reshape parsing
|
||||
* Updated the `ConstantOfShape` operator to support Dynamic Batch
|
||||
* Simplified dynamic shapes-related operators to their static versions, where possible
|
||||
* Improved debugging tools for accuracy issues
|
||||
* Included a print warning about `miopen_fusion` while generating `mxr`
|
||||
* General reduction in system memory usage during model compilation
|
||||
* Created additional fusion opportunities during model compilation
|
||||
* Improved debugging for matchers
|
||||
* Improved general debug messages
|
||||
|
||||
##### Fixes
|
||||
|
||||
* Fixed scatter operator for nonstandard shapes with some models from ONNX Model Zoo
|
||||
* Provided a compile option to improve the accuracy of some models by disabling Fast-Math
|
||||
* Improved layernorm + pointwise fusion matching to ignore argument order
|
||||
* Fixed accuracy issue with `ROIAlign` operator
|
||||
* Fixed computation logic for the `Trilu` operator
|
||||
* Fixed support for the DETR model
|
||||
|
||||
##### Changes
|
||||
|
||||
* Changed MIGraphX version to 2.8
|
||||
* Extracted the test packages into a separate deb file when building MIGraphX from source
|
||||
|
||||
##### Removals
|
||||
|
||||
* Removed building Python 2.7 bindings
|
||||
|
||||
#### AMD SMI
|
||||
|
||||
* Integrated the E-SMI library: You can now query CPU-related information directly through AMD SMI.
|
||||
Metrics include power, energy, performance, and other system details.
|
||||
|
||||
* Added support for gfx942 metrics: You can now query MI300 device metrics to get real-time
|
||||
information. Metrics include power, temperature, energy, and performance.
|
||||
|
||||
* Added support for compute and memory partitions
|
||||
|
||||
#### HIP 6.0.0
|
||||
|
||||
HIP 6.0.0 for ROCm 6.0.0
|
||||
|
||||
##### Additions
|
||||
|
||||
* New fields and structs for external resource interoperability
|
||||
* `hipExternalMemoryHandleDesc_st`
|
||||
* `hipExternalMemoryBufferDesc_st`
|
||||
* `hipExternalSemaphoreHandleDesc_st`
|
||||
* `hipExternalSemaphoreSignalParams_st`
|
||||
* `hipExternalSemaphoreWaitParams_st Enumerations`
|
||||
* `hipExternalMemoryHandleType_enum`
|
||||
* `hipExternalSemaphoreHandleType_enum`
|
||||
* `hipExternalMemoryHandleType_enum`
|
||||
|
||||
* New environment variable `HIP_LAUNCH_BLOCKING`
|
||||
* For serialization on kernel execution. The default value is 0 (disable); kernel will execute normally as
|
||||
defined in the queue. When this environment variable is set as 1 (enable), HIP runtime will
|
||||
serialize kernel enqueue; behaves the same as AMD_SERIALIZE_KERNEL.
|
||||
|
||||
* More members are added in HIP struct `hipDeviceProp_t`, for new feature capabilities including:
|
||||
* Texture
|
||||
* `int maxTexture1DMipmap;`
|
||||
* `int maxTexture2DMipmap[2];`
|
||||
* `int maxTexture2DLinear[3];`
|
||||
* `int maxTexture2DGather[2];`
|
||||
* `int maxTexture3DAlt[3];`
|
||||
* `int maxTextureCubemap;`
|
||||
* `int maxTexture1DLayered[2];`
|
||||
* `int maxTexture2DLayered[3];`
|
||||
* `int maxTextureCubemapLayered[2];`
|
||||
* Surface
|
||||
* `int maxSurface1D;`
|
||||
* `int maxSurface2D[2];`
|
||||
* `int maxSurface3D[3];`
|
||||
* `int maxSurface1DLayered[2];`
|
||||
* `int maxSurface2DLayered[3];`
|
||||
* `int maxSurfaceCubemap;`
|
||||
* `int maxSurfaceCubemapLayered[2];`
|
||||
* Device
|
||||
* `hipUUID uuid;`
|
||||
* `char luid[8];` this is an 8-byte unique identifier. Only valid on Windows
|
||||
* `unsigned int luidDeviceNodeMask;`
|
||||
|
||||
* LUID (Locally Unique Identifier) is supported for interoperability between devices. In HIP, more
|
||||
members are added in the struct `hipDeviceProp_t`, as properties to identify each device:
|
||||
* `char luid[8];`
|
||||
* `unsigned int luidDeviceNodeMask;`
|
||||
|
||||
:::{note}
|
||||
HIP only supports LUID on Windows OS.
|
||||
:::
|
||||
|
||||
##### Changes
|
||||
|
||||
* Some OpenGL Interop HIP APIs are moved from the hip_runtime_api header to a new header file hip_gl_interop.h for the AMD platform, as follows:
|
||||
* `hipGLGetDevices`
|
||||
* `hipGraphicsGLRegisterBuffer`
|
||||
* `hipGraphicsGLRegisterImage`
|
||||
|
||||
###### Changes impacting backward incompatibility
|
||||
|
||||
* Data types for members in `HIP_MEMCPY3D` structure are changed from `unsigned int` to `size_t`.
|
||||
* The value of the flag `hipIpcMemLazyEnablePeerAccess` is changed to `0x01`, which was previously
|
||||
defined as `0`
|
||||
* Some device property attributes are not currently supported in HIP runtime. In order to maintain
|
||||
consistency, the following related enumeration names are changed in `hipDeviceAttribute_t`
|
||||
* `hipDeviceAttributeName` is changed to `hipDeviceAttributeUnused1`
|
||||
* `hipDeviceAttributeUuid` is changed to `hipDeviceAttributeUnused2`
|
||||
* `hipDeviceAttributeArch` is changed to `hipDeviceAttributeUnused3`
|
||||
* `hipDeviceAttributeGcnArch` is changed to `hipDeviceAttributeUnused4`
|
||||
* `hipDeviceAttributeGcnArchName` is changed to `hipDeviceAttributeUnused5`
|
||||
* HIP struct `hipArray` is removed from driver type header to comply with CUDA
|
||||
* `hipArray_t` replaces `hipArray*`, as the pointer to array.
|
||||
* This allows `hipMemcpyAtoH` and `hipMemcpyHtoA` to have the correct array type which is
|
||||
equivalent to corresponding CUDA driver APIs.
|
||||
|
||||
##### Fixes
|
||||
|
||||
* Kernel launch maximum dimension validation is added specifically on gridY and gridZ in the HIP API `hipModule-LaunchKernel`. As a result,when `hipGetDeviceAttribute` is called for the value of `hipDeviceAttributeMaxGrid-Dim`, the behavior on the AMD platform is equivalent to NVIDIA.
|
||||
|
||||
* The HIP stream synchronization behavior is changed in internal stream functions, in which a flag "wait" is added and set when the current stream is null pointer while executing stream synchronization on other explicitly created streams. This change avoids blocking of execution on null/default stream. The change won't affect usage of applications, and makes them behave the same on the AMD platform as NVIDIA.
|
||||
|
||||
* Error handling behavior on unsupported GPU is fixed, HIP runtime will log out error message, instead of creating signal abortion error which is invisible to developers but continued kernel execution process. This is for the case when developers compile any application via hipcc, setting the option `--offload-arch` with GPU ID which is different from the one on the system.
|
||||
|
||||
* HIP complex vector type multiplication and division operations. On AMD platform, some duplicated complex operators are removed to avoid compilation failures. In HIP, `hipFloatComplex` and `hipDoubleComplex` are defined as complex data types: `typedef float2 hipFloatComplex; typedef double2 hipDoubleComplex;` Any application that uses complex multiplication and division operations needs to replace '*' and '/' operators with the following:
|
||||
* `hipCmulf()` and `hipCdivf()` for `hipFloatComplex`
|
||||
* `hipCmul()` and `hipCdiv()` for `hipDoubleComplex`
|
||||
Note: These complex operations are equivalent to corresponding types/functions on NVIDIA platform.
|
||||
|
||||
##### Removals
|
||||
|
||||
* Deprecated Heterogeneous Compute (HCC) symbols and flags are removed from the HIP source code, including:
|
||||
* Build options on obsolete `HCC_OPTIONS` were removed from cmake.
|
||||
* Micro definitions are removed:
|
||||
* `HIP_INCLUDE_HIP_HCC_DETAIL_DRIVER_TYPES_H`
|
||||
* `HIP_INCLUDE_HIP_HCC_DETAIL_HOST_DEFINES_H`
|
||||
* Compilation flags for the platform definitions
|
||||
* AMD platform
|
||||
* `HIP_PLATFORM_HCC`
|
||||
* `HCC`
|
||||
* `HIP_ROCclr`
|
||||
* NVIDIA platform
|
||||
* `HIP_PLATFORM_NVCC`
|
||||
* File directories in the clr repository are removed, for more details see https://github.com/ROCm-Developer-Tools/clr/blob/develop/hipamd/include/hip/hcc_detail and https://github.com/ROCm-Developer-Tools/clr/blob/develop/hipamd/include/hip/nvcc_detail
|
||||
* Deprecated gcnArch is removed from hip device struct `hipDeviceProp_t`.
|
||||
* Deprecated `enum hipMemoryType memoryType;` is removed from HIP struct `hipPointerAttribute_t` union.
|
||||
|
||||
#### hipBLAS 2.0.0
|
||||
|
||||
hipBLAS 2.0.0 for ROCm 6.0.0
|
||||
|
||||
##### Additions
|
||||
|
||||
* New option to define `HIPBLAS_USE_HIP_BFLOAT16` to switch API to use the `hip_bfloat16` type
|
||||
* New `hipblasGemmExWithFlags` API
|
||||
|
||||
##### Deprecations
|
||||
|
||||
* `hipblasDatatype_t`; use `hipDataType` instead
|
||||
* `hipblasComplex`; use `hipComplex` instead
|
||||
* `hipblasDoubleComplex`; use `hipDoubleComplex` instead
|
||||
* Use of `hipblasDatatype_t` for `hipblasGemmEx` for compute-type; use `hipblasComputeType_t` instead
|
||||
|
||||
##### Removals
|
||||
|
||||
* `hipblasXtrmm` (calculates B <- alpha * op(A) * B) has been replaced with `hipblasXtrmm` (calculates
|
||||
C <- alpha * op(A) * B)
|
||||
|
||||
|
||||
#### hipCUB 3.0.0
|
||||
|
||||
hipCUB 3.0.0 for ROCm 6.0.0
|
||||
|
||||
##### Changes
|
||||
|
||||
* Removed `DOWNLOAD_ROCPRIM`: you can force rocPRIM to download using
|
||||
`DEPENDENCIES_FORCE_DOWNLOAD`
|
||||
|
||||
#### hipFFT 1.0.13
|
||||
|
||||
hipFFT 1.0.13 for ROCm 6.0.0
|
||||
|
||||
##### Changes
|
||||
|
||||
* `hipfft-rider` has been renamed to `hipfft-bench`; it is controlled by the `BUILD_CLIENTS_BENCH`
|
||||
CMake option (note that a link for the old file name is installed, and the old `BUILD_CLIENTS_RIDER`
|
||||
CMake option is accepted for backwards compatibility, but both will be removed in a future release)
|
||||
* Binaries in debug builds no longer have a `-d` suffix
|
||||
* The minimum rocFFT required version has been updated to 1.0.21
|
||||
|
||||
##### Additions
|
||||
|
||||
* `hipfftXtSetGPUs`, `hipfftXtMalloc, hipfftXtMemcpy`, `hipfftXtFree`, and `hipfftXtExecDescriptor` APIs
|
||||
have been implemented to allow FFT computing on multiple devices in a single process
|
||||
|
||||
#### hipSOLVER 2.0.0
|
||||
|
||||
hipSOLVER 2.0.0 for ROCm 6.0.0
|
||||
|
||||
##### Additions
|
||||
|
||||
* Added hipBLAS as an optional dependency to `hipsolver-test`
|
||||
* You can use the `BUILD_HIPBLAS_TESTS` CMake option to test the compatibility between hipSOLVER
|
||||
and hipBLAS
|
||||
|
||||
##### Changes
|
||||
|
||||
* The `hipsolverOperation_t` type is now an alias of `hipblasOperation_t`
|
||||
* The `hipsolverFillMode_t` type is now an alias of `hipblasFillMode_t`
|
||||
* The `hipsolverSideMode_t` type is now an alias of `hipblasSideMode_t`
|
||||
|
||||
##### Fixes
|
||||
|
||||
* Tests for hipSOLVER info updates in `ORGBR/UNGBR`, `ORGQR/UNGQR`, `ORGTR/UNGTR`,
|
||||
`ORMQR/UNMQR`, and `ORMTR/UNMTR`
|
||||
|
||||
#### hipSPARSE 3.0.0
|
||||
|
||||
hipSPARSE 3.0.0 for ROCm 6.0.0
|
||||
|
||||
##### Additions
|
||||
|
||||
* Added `hipsparseGetErrorName` and `hipsparseGetErrorString`
|
||||
|
||||
##### Changes
|
||||
|
||||
* Changed the `hipsparseSpSV_solve()` API function to match the cuSPARSE API
|
||||
* Changed generic API functions to use const descriptors
|
||||
* Improved documentation
|
||||
|
||||
#### hipTensor 1.1.0
|
||||
|
||||
hipTensor 1.1.0 for ROCm 6.0.0
|
||||
|
||||
##### Additions
|
||||
|
||||
* Architecture support for gfx942
|
||||
* Client tests configuration parameters now support YAML file input format
|
||||
|
||||
##### Changes
|
||||
|
||||
* Doxygen now treats warnings as errors
|
||||
|
||||
##### Fixes
|
||||
|
||||
* Client tests output redirections now behave accordingly
|
||||
* Removed dependency static library deployment
|
||||
* Security issues for documentation
|
||||
* Compile issues in debug mode
|
||||
* Corrected soft link for ROCm deployment
|
||||
|
||||
#### MIOpen 2.19.0
|
||||
|
||||
MIOpen 2.19.0 for ROCm 6.0.0
|
||||
|
||||
##### Additions
|
||||
|
||||
* ROCm 5.5 support for gfx1101 (Navi32)
|
||||
|
||||
##### Changes
|
||||
|
||||
* Tuning results for MLIR on ROCm 5.5
|
||||
* Bumped MLIR commit to 5.5.0 release tag
|
||||
|
||||
##### Fixes
|
||||
|
||||
* 3-D convolution host API bug
|
||||
* `[HOTFIX][MI200][FP16]` has been disabled for `ConvHipImplicitGemmBwdXdlops` when FP16_ALT is
|
||||
required
|
||||
|
||||
#### MIVisionX
|
||||
|
||||
* Added Comprehensive CTests to aid developers
|
||||
* Introduced Doxygen support for complete API documentation
|
||||
* Simplified dependencies for rocAL
|
||||
|
||||
#### OpenMP
|
||||
|
||||
* MI300:
|
||||
* Added support for gfx942 targets
|
||||
* Fixed declare target variable access in unified_shared_memory mode
|
||||
* Enabled OMPX_APU_MAPS environment variable for MI200 and gfx942
|
||||
* Handled global pointers in forced USM (`OMPX_APU_MAPS`)
|
||||
|
||||
* Nextgen AMDGPU plugin:
|
||||
* Respect `GPU_MAX_HW_QUEUES` in the AMDGPU Nextgen plugin, which takes precedence over the
|
||||
standard `LIBOMPTARGET_AMDGPU_NUM_HSA_QUEUES` environment variable
|
||||
* Changed the default for `LIBOMPTARGET_AMDGPU_TEAMS_PER_CU` from 4 to 6
|
||||
* Fixed the behavior of the `OMPX_FORCE_SYNC_REGIONS` environment variable, which is used to
|
||||
force synchronous target regions (the default is to use an asynchronous implementation)
|
||||
* Added support for and enabled default of code object version 5
|
||||
* Implemented target OMPT callbacks and trace records support in the nextgen plugin
|
||||
|
||||
* Specialized kernels:
|
||||
* Removes redundant copying of arrays when xteam reductions are active but not offloaded
|
||||
* Tuned the number of teams for BigJumpLoop
|
||||
* Enables specialized kernel generation with nested OpenMP pragma, as long as there is no nested
|
||||
omp-parallel directive
|
||||
|
||||
##### Additions
|
||||
|
||||
* `-fopenmp-runtimelib={lib,lib-perf,lib-debug}` to select libs
|
||||
* Warning if mixed HIP / OpenMP offloading (i.e., if HIP language mode is active, but OpenMP target
|
||||
directives are encountered)
|
||||
* Introduced compile-time limit for the number of GPUs supported in a system: 16 GPUs in a single
|
||||
node is currently the maximum supported
|
||||
|
||||
##### Changes
|
||||
|
||||
* Correctly compute number of waves when workgroup size is less than the wave size
|
||||
* Implemented `LIBOMPTARGET_KERNEL_TRACE=3`, which prints DEVID traces and API timings
|
||||
* ASAN support for openmp release, debug, and perf libraries
|
||||
* Changed LDS lowering default to hybrid
|
||||
|
||||
##### Fixes
|
||||
|
||||
* Fixed RUNPATH for gdb plugin
|
||||
* Fixed hang in OMPT support if flush trace is called when there are no helper threads
|
||||
|
||||
#### rccl 2.15.5
|
||||
|
||||
RCCL 2.15.5 for ROCm 6.0.0
|
||||
|
||||
##### Changes
|
||||
|
||||
* Compatibility with NCCL 2.15.5
|
||||
* Renamed the unit test executable to `rccl-UnitTests`
|
||||
|
||||
##### Additions
|
||||
|
||||
* HW-topology-aware binary tree implementation
|
||||
* Experimental support for MSCCL
|
||||
* New unit tests for hipGraph support
|
||||
* NPKit integration
|
||||
|
||||
##### Fixes
|
||||
|
||||
* rocm-smi ID conversion
|
||||
* Support for `HIP_VISIBLE_DEVICES` for unit tests
|
||||
* Support for p2p transfers to non (HIP) visible devices
|
||||
|
||||
##### Removals
|
||||
|
||||
* Removed TransferBench from tools as it exists in standalone repo:
|
||||
[https://github.com/ROCmSoftwarePlatform/TransferBench](https://github.com/ROCmSoftwarePlatform/TransferBench)
|
||||
|
||||
#### rocALUTION 3.0.3
|
||||
|
||||
rocALUTION 3.0.3 for ROCm 6.0.0
|
||||
|
||||
##### Additions
|
||||
|
||||
* Support for 64bit integer vectors
|
||||
* Inclusive and exclusive sum functionality for vector classes
|
||||
* Transpose functionality for `GlobalMatrix` and `LocalMatrix`
|
||||
* `TripleMatrixProduct` functionality for `LocalMatrix`
|
||||
* `Sort()` function for `LocalVector` class
|
||||
* Multiple stream support to the HIP backend
|
||||
|
||||
##### Optimizations
|
||||
|
||||
* `GlobalMatrix::Apply()` now uses multiple streams to better hide communication
|
||||
|
||||
##### Changes
|
||||
|
||||
* Matrix dimensions and number of non-zeros are now stored using 64-bit integers
|
||||
* Improved the ILUT preconditioner
|
||||
|
||||
##### Removals
|
||||
|
||||
* `LocalVector::GetIndexValues(ValueType*)`
|
||||
* `LocalVector::SetIndexValues(const ValueType*)`
|
||||
* `LocalMatrix::RSDirectInterpolation(const LocalVector&, const LocalVector&, LocalMatrix*, LocalMatrix*)`
|
||||
* `LocalMatrix::RSExtPIInterpolation(const LocalVector&, const LocalVector&, bool, float, LocalMatrix*, LocalMatrix*)`
|
||||
* `LocalMatrix::RugeStueben()`
|
||||
* `LocalMatrix::AMGSmoothedAggregation(ValueType, const LocalVector&, const LocalVector&, LocalMatrix*, LocalMatrix*, int)`
|
||||
* `LocalMatrix::AMGAggregation(const LocalVector&, LocalMatrix*, LocalMatrix*)`
|
||||
|
||||
##### Fixes
|
||||
|
||||
* Unit tests no longer ignore BCSR block dimension
|
||||
* Fixed documentation typos
|
||||
* Bug in multi-coloring for non-symmetric matrix patterns
|
||||
|
||||
#### rocBLAS 4.0.0
|
||||
|
||||
rocBLAS 4.0.0 for ROCm 6.0.0
|
||||
|
||||
##### Additions
|
||||
|
||||
* Beta API `rocblas_gemm_batched_ex3` and `rocblas_gemm_strided_batched_ex3`
|
||||
* Input/output type f16_r/bf16_r and execution type f32_r support for Level 2 gemv_batched and
|
||||
gemv_strided_batched
|
||||
* Use of `rocblas_status_excluded_from_build` when calling functions that require Tensile (when using
|
||||
rocBLAS built without Tensile)
|
||||
* System for asynchronous kernel launches that set a `rocblas_status` failure based on a
|
||||
`hipPeekAtLastError` discrepancy
|
||||
|
||||
##### Optimizations
|
||||
|
||||
* TRSM performance for small sizes (m < 32 && n < 32)
|
||||
|
||||
##### Deprecations
|
||||
|
||||
* Atomic operations will be disabled by default in a future release of rocBLAS (you can enable atomic
|
||||
operations using the `rocblas_set_atomics_mode` function)
|
||||
|
||||
##### Removals
|
||||
|
||||
* `rocblas_gemm_ext2` API function
|
||||
* In-place trmm API from Legacy BLAS is replaced by an API that supports both in-place and
|
||||
out-of-place trmm
|
||||
* int8x4 support is removed (int8 support is unchanged)
|
||||
* `#define __STDC_WANT_IEC_60559_TYPES_EXT__` is removed from `rocblas-types.h` (if you want
|
||||
ISO/IEC TS 18661-3:2015 functionality, you must define `__STDC_WANT_IEC_60559_TYPES_EXT__`
|
||||
before including `float.h`, `math.h`, and `rocblas.h`)
|
||||
* The default build removes device code for gfx803 architecture from the fat binary
|
||||
|
||||
##### Fixes
|
||||
|
||||
* Made offset calculations for 64-bit rocBLAS functions safe
|
||||
* Fixes for very large leading dimension or increment potentially causing overflow:
|
||||
* Level2: `gbmv`, `gemv`, `hbmv`, `sbmv`, `spmv`, `tbmv`, `tpmv`, `tbsv`, and `tpsv`
|
||||
* Lazy loading supports heterogeneous architecture setup and load-appropriate tensile library files,
|
||||
based on device architecture
|
||||
* Guards against no-op kernel launches that result in a potential `hipGetLastError`
|
||||
|
||||
##### Changes
|
||||
|
||||
* Reduced the default verbosity of `rocblas-test` (you can see all tests by setting the
|
||||
`GTEST_LISTENER=PASS_LINE_IN_LOG` environment variable)
|
||||
|
||||
#### rocFFT 1.0.25
|
||||
|
||||
rocFFT 1.0.25 for ROCm 6.0.0
|
||||
|
||||
##### Additions
|
||||
|
||||
* Implemented experimental APIs to allow computing FFTs on data distributed across multiple devices
|
||||
in a single process
|
||||
|
||||
* `rocfft_field` is a new type that can be added to a plan description to describe the layout of FFT
|
||||
input or output
|
||||
* `rocfft_field_add_brick` can be called to describe the brick decomposition of an FFT field, where each
|
||||
brick can be assigned a different device
|
||||
|
||||
These interfaces are still experimental and subject to change. Your feedback is appreciated.
|
||||
You can raise questions and concerns by opening issues in the
|
||||
[rocFFT issue tracker](https://github.com/ROCmSoftwarePlatform/rocFFT/issues).
|
||||
|
||||
Note that multi-device FFTs currently have several limitations (we plan to address these in future
|
||||
releases):
|
||||
|
||||
* Real-complex (forward or inverse) FFTs are not supported
|
||||
* Planar format fields are not supported
|
||||
* Batch (the `number_of_transforms` provided to `rocfft_plan_create`) must be 1
|
||||
* FFT input is gathered to the current device at run time, so all FFT data must fit on that device
|
||||
|
||||
##### Optimizations
|
||||
|
||||
* Improved the performance of several 2D/3D real FFTs supported by `2D_SINGLE` kernel. Offline
|
||||
tuning provides more optimization for fx90a
|
||||
* Removed an extra kernel launch from even-length, real-complex FFTs that use callbacks
|
||||
|
||||
##### Changes
|
||||
|
||||
* Built kernels in a solution map to the library kernel cache
|
||||
* Real forward transforms (real-to-complex) no longer overwrite input; rocFFT may still overwrite real
|
||||
inverse (complex-to-real) input, as this allows for faster performance
|
||||
|
||||
* `rocfft-rider` and `dyna-rocfft-rider` have been renamed to `rocfft-bench` and `dyna-rocfft-bench`;
|
||||
these are controlled by the `BUILD_CLIENTS_BENCH` CMake option
|
||||
* Links for the former file names are installed, and the former `BUILD_CLIENTS_RIDER` CMake option
|
||||
is accepted for compatibility, but both will be removed in a future release
|
||||
* Binaries in debug builds no longer have a `-d` suffix
|
||||
|
||||
##### Fixes
|
||||
|
||||
* rocFFT now correctly handles load callbacks that convert data from a smaller data type (e.g., 16-bit
|
||||
integers -> 32-bit float)
|
||||
|
||||
#### ROCgdb 13.2
|
||||
|
||||
ROCgdb 13.2 for ROCm 6.0.0
|
||||
|
||||
##### Additions
|
||||
|
||||
* Support for watchpoints on scratch memory addresses.
|
||||
* Added support for gfx1100, gfx1101, and gfx1102.
|
||||
* Added support for gfx942.
|
||||
|
||||
##### Optimizations
|
||||
|
||||
* Improved performances when handling the end of a process with a large number of threads.
|
||||
|
||||
##### Known issues
|
||||
|
||||
* On certain configurations, ROCgdb can show the following warning message:
|
||||
`warning: Probes-based dynamic linker interface failed. Reverting to original interface.`
|
||||
This does not affect ROCgdb's functionalities.
|
||||
|
||||
* ROCgdb cannot debug a program on an AMDGPU device past a
|
||||
`s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)` instruction. If an exception is reported after this
|
||||
instruction has been executed (including asynchronous exceptions), the wave is killed and the
|
||||
exceptions are only reported by the ROCm runtime.
|
||||
|
||||
#### rocm-cmake 0.11.0
|
||||
|
||||
rocm-cmake 0.11.0 for ROCm 6.0.0
|
||||
|
||||
##### Changes
|
||||
|
||||
* Improved validation, documentation, and rocm-docs-core integration for ROCMSphinxDoc
|
||||
|
||||
##### Fixes
|
||||
|
||||
* Fixed extra `make` flags passed for Clang-Tidy (ROCMClangTidy).
|
||||
* Fixed issues with ROCMTest when using a module in a subdirectory
|
||||
|
||||
#### ROCm Compiler
|
||||
|
||||
* On MI300, kernel arguments can be preloaded into SGPRs rather than passed in memory. This
|
||||
feature is enabled with a compiler option, which also controls the number of arguments to pass in
|
||||
SGPRs.
|
||||
|
||||
* Improved register allocation at -O0: Avoid compiler crashes ( 'ran out of registers during register allocation' )
|
||||
|
||||
* Improved generation of debug information:
|
||||
* Improve compile time
|
||||
* Avoid compiler crashes
|
||||
|
||||
#### rocPRIM 3.0.0
|
||||
|
||||
rocPRIM 3.0.0 for ROCm 6.0.0
|
||||
|
||||
##### Additions
|
||||
|
||||
* `block_sort::sort()` overload for keys and values with a dynamic size, for all block sort algorithms
|
||||
* All `block_sort::sort()` overloads with a dynamic size are now supported for
|
||||
`block_sort_algorithm::merge_sort` and `block_sort_algorithm::bitonic_sort`
|
||||
* New two-way partition primitive `partition_two_way`, which can write to two separate iterators
|
||||
|
||||
##### Optimizations
|
||||
|
||||
* Improved `partition` performance
|
||||
|
||||
##### Fixes
|
||||
|
||||
* Fixed `rocprim::MatchAny` for devices with 64-bit warp size
|
||||
* Note that `rocprim::MatchAny` is deprecated; use `rocprim::match_any` instead
|
||||
|
||||
#### Roc Profiler 2.0.0
|
||||
|
||||
Roc Profiler 2.0.0 for ROCm 6.0.0
|
||||
|
||||
##### Additions
|
||||
|
||||
* Updated supported GPU architectures in README with profiler versions
|
||||
* Automatic ISA dumping for ATT. See README.
|
||||
* CSV mode for ATT. See README.
|
||||
* Added option to control kernel name truncation.
|
||||
* Limit rocprof(v1) script usage to only supported architectures.
|
||||
* Added Tool versioning to be able to run rocprofv2 using rocprof. See README for more information.
|
||||
* Added Plugin Versioning way in rocprofv2. See README for more details.
|
||||
* Added `--version` in rocprof and rocprofv2 to be able to see the current rocprof/v2 version along with ROCm version information.
|
||||
|
||||
#### rocRAND 2.10.17
|
||||
|
||||
rocRAND 2.10.17 for ROCm 6.0.0
|
||||
|
||||
### Changes
|
||||
|
||||
* Generator classes from `rocrand.hpp` are no longer copyable (in previous versions these copies
|
||||
would copy internal references to the generators and would lead to double free or memory leak
|
||||
errors)
|
||||
* These types should be moved instead of copied; move constructors and operators are now
|
||||
defined
|
||||
|
||||
### Optimizations
|
||||
|
||||
* Improved MT19937 initialization and generation performance
|
||||
|
||||
### Removals
|
||||
|
||||
* Removed the hipRAND submodule from rocRAND; hipRAND is now only available as a separate
|
||||
package
|
||||
* Removed references to, and workarounds for, the deprecated hcc
|
||||
|
||||
### Fixes
|
||||
|
||||
* `mt19937_engine` from `rocrand.hpp` is now move-constructible and move-assignable (the move
|
||||
constructor and move assignment operator was deleted for this class)
|
||||
* Various fixes for the C++ wrapper header `rocrand.hpp`
|
||||
* The name of `mrg31k3p` it is now correctly spelled (was incorrectly named `mrg31k3a` in previous
|
||||
versions)
|
||||
* Added the missing `order` setter method for `threefry4x64`
|
||||
* Fixed the default ordering parameter for `lfsr113`
|
||||
* Build error when using Clang++ directly resulting from unsupported `amdgpu-target` references
|
||||
|
||||
#### rocSOLVER 3.24.0
|
||||
|
||||
rocSOLVER 3.24.0 for ROCm 6.0.0
|
||||
|
||||
##### Additions
|
||||
|
||||
* Cholesky refactorization for sparse matrices: `CSRRF_REFACTCHOL`
|
||||
* Added `rocsolver_rfinfo_mode` and the ability to specify the desired refactorization routine (see `rocsolver_set_rfinfo_mode`)
|
||||
|
||||
##### Changes
|
||||
|
||||
* `CSRRF_ANALYSIS` and `CSRRF_SOLVE` now support sparse Cholesky factorization
|
||||
|
||||
#### rocSPARSE 3.0.2
|
||||
|
||||
rocSPARSE 3.0.2 for ROCm 6.0.0
|
||||
|
||||
##### Changes
|
||||
|
||||
* Function arguments for `rocsparse_spmv`
|
||||
* Function arguments for `rocsparse_xbsrmv` routines
|
||||
* When using host pointer mode, you must now call `hipStreamSynchronize` following `doti`, `dotci`,
|
||||
`spvv`, and `csr2ell`
|
||||
* Improved documentation
|
||||
* Improved verbose output during argument checking on API function calls
|
||||
|
||||
##### Removals
|
||||
|
||||
* Auto stages from `spmv`, `spmm`, `spgemm`, `spsv`, `spsm`, and `spitsv`
|
||||
* Formerly deprecated `rocsparse_spmm_ex` routine
|
||||
|
||||
### Fixes
|
||||
|
||||
* Bug in `rocsparse-bench` where the SpMV algorithm was not taken into account in CSR format
|
||||
* BSR and GEBSR routines (`bsrmv`, `bsrsv`, `bsrmm`, `bsrgeam`, `gebsrmv`, `gebsrmm`) didn't always
|
||||
show `block_dim==0` as an invalid size
|
||||
* Passing `nnz = 0` to `doti` or `dotci` wasn't always returning a dot product of 0
|
||||
|
||||
### Additions
|
||||
|
||||
* `rocsparse_inverse_permutation`
|
||||
* Mixed-precisions for SpVV
|
||||
* Uniform int8 precision for gather and scatter
|
||||
|
||||
#### rocThrust 3.0.0
|
||||
|
||||
rocThrust 3.0.0 for ROCm 6.0.0
|
||||
|
||||
##### Additions
|
||||
|
||||
* Updated to match upstream Thrust 2.0.1
|
||||
* `NV_IF_TARGET` macro from libcu++ for NVIDIA backend and HIP implementation for HIP backend
|
||||
|
||||
##### Changes
|
||||
|
||||
* The CMake build system now accepts `GPU_TARGETS` in addition to `AMDGPU_TARGETS` for
|
||||
setting targeted GPU architectures
|
||||
* `GPU_TARGETS=all` compiles for all supported architectures
|
||||
* `AMDGPU_TARGETS` is only provided for backwards compatibility (`GPU_TARGETS` is preferred)
|
||||
* Removed CUB symlink from the root of the repository
|
||||
* Removed support for deprecated macros (`THRUST_DEVICE_BACKEND` and
|
||||
`THRUST_HOST_BACKEND`)
|
||||
|
||||
##### Known issues
|
||||
|
||||
* The `THRUST_HAS_CUDART` macro, which is no longer used in Thrust (it's provided only for legacy
|
||||
support) is replaced with `NV_IF_TARGET` and `THRUST_RDC_ENABLED` in the NVIDIA backend. The
|
||||
HIP backend doesn't have a `THRUST_RDC_ENABLED` macro, so some branches in Thrust code may
|
||||
be unreachable in the HIP backend.
|
||||
|
||||
#### rocWMMA 1.3.0
|
||||
|
||||
rocWMMA 1.3.0 for ROCm 6.0.0
|
||||
|
||||
##### Additions
|
||||
|
||||
* Support for gfx942
|
||||
* Support for f8, bf8, and xfloat32 data types
|
||||
* support for `HIP_NO_HALF`, `__ HIP_NO_HALF_CONVERSIONS__`, and
|
||||
`__ HIP_NO_HALF_OPERATORS__` (e.g., PyTorch environment)
|
||||
|
||||
##### Changes
|
||||
|
||||
* rocWMMA with hipRTC now supports `bfloat16_t` data type
|
||||
* gfx11 WMMA now uses lane swap instead of broadcast for layout adjustment
|
||||
* Updated samples GEMM parameter validation on host arch
|
||||
|
||||
##### Fixes
|
||||
|
||||
* Disabled GoogleTest static library deployment
|
||||
* Extended tests now build in large code model
|
||||
|
||||
#### Tensile 4.39.0
|
||||
|
||||
Tensile 4.39.0 for ROCm 6.0.0
|
||||
|
||||
##### Additions
|
||||
|
||||
* Added `aquavanjaram` support: gfx942, fp8/bf8 datatype, xf32 datatype, and
|
||||
stochastic rounding for various datatypes
|
||||
* Added and updated tuning scripts
|
||||
* Added `DirectToLds` support for larger data types with 32-bit global load (old parameter `DirectToLds`
|
||||
is replaced with `DirectToLdsA` and `DirectToLdsB`), and the corresponding test cases
|
||||
* Added the average of frequency, power consumption, and temperature information for the winner
|
||||
kernels to the CSV file
|
||||
* Added asmcap check for MFMA + const src
|
||||
* Added support for wider local read + pack with v_perm (with `VgprForLocalReadPacking=True`)
|
||||
* Added a new parameter to increase `miLatencyLeft`
|
||||
|
||||
##### Optimizations
|
||||
|
||||
* Enabled `InitAccVgprOpt` for `MatrixInstruction` cases
|
||||
* Implemented local read related parameter calculations with `DirectToVgpr`
|
||||
* Enabled dedicated vgpr allocation for local read + pack
|
||||
* Optimized code initialization
|
||||
* Optimized sgpr allocation
|
||||
* Supported DGEMM TLUB + RLVW=2 for odd N (edge shift change)
|
||||
* Enabled `miLatency` optimization for specific data types, and fixed
|
||||
instruction scheduling
|
||||
|
||||
##### Changes
|
||||
|
||||
* Removed old code for DTL + (bpe * GlobalReadVectorWidth > 4)
|
||||
* Changed/updated failed CI tests for gfx11xx, InitAccVgprOpt, and DTLds
|
||||
* Removed unused `CustomKernels` and `ReplacementKernels`
|
||||
* Added a reject condition for DTVB + TransposeLDS=False (not supported so far)
|
||||
* Removed unused code for DirectToLds
|
||||
* Updated test cases for DTV + TransposeLDS=False
|
||||
* Moved the `MinKForGSU` parameter from `globalparameter` to `BenchmarkCommonParameter` to
|
||||
support smaller K
|
||||
* Changed how to calculate `latencyForLR` for miLatency
|
||||
* Set minimum value of `latencyForLRCount` for 1LDSBuffer to avoid getting rejected by
|
||||
overflowedResources=5 (related to miLatency)
|
||||
* Refactored allowLRVWBforTLUandMI and renamed it as VectorWidthB
|
||||
* Supported multi-gpu for different architectures in lazy library loading
|
||||
* Enabled dtree library for batch > 1
|
||||
* Added problem scale feature for dtree selection
|
||||
* Modified non-lazy load build to skip experimental logic
|
||||
|
||||
##### Fixes
|
||||
|
||||
* Predicate ordering for fp16alt impl round near zero mode to unbreak distance modes
|
||||
* Boundary check for mirror dims and re-enable disabled mirror dims test cases
|
||||
* Merge error affecting i8 with WMMA
|
||||
* Mismatch issue with DTLds + TSGR + TailLoop
|
||||
* Bug with `InitAccVgprOpt` + GSU>1 and a mismatch issue with PGR=0
|
||||
* Override for unloaded solutions when lazy loading
|
||||
* Adding missing headers
|
||||
* Boost link for a clean build on Ubuntu 22
|
||||
* Bug in `forcestoresc1` arch selection
|
||||
* Compiler directive for gfx942
|
||||
* Formatting for `DecisionTree_test.cpp`
|
||||
10
tools/autotag/templates/rocm_changes/6.0.2.md
Normal file
10
tools/autotag/templates/rocm_changes/6.0.2.md
Normal file
@@ -0,0 +1,10 @@
|
||||
The ROCm 6.0.2 point release consists of minor bug fixes to improve the stability of MI300 GPU applications. This release introduces several new driver features for system qualification on our partner server offerings.
|
||||
|
||||
#### hipFFT 1.0.13
|
||||
|
||||
hipFFT 1.0.13 for ROCm 6.0.2
|
||||
|
||||
##### Changes
|
||||
|
||||
* Removed the Git submodule for shared files between rocFFT and hipFFT; instead, just copy the files
|
||||
over (this should help simplify downstream builds and packaging)
|
||||
Reference in New Issue
Block a user