Compare commits

..

30 Commits

Author SHA1 Message Date
Sam Wu
2edcc3a6c6 Update documentation requirements 2024-09-16 10:12:40 -08:00
Sam Wu
ae2409fa47 Update documentation requirements 2024-06-06 16:58:35 -06:00
Sam Wu
574d62b077 Fix RTD config 2024-05-02 08:54:54 -06:00
Sam Wu
ff9c523c3e Update documentation requirements 2024-05-01 16:59:20 -06:00
Sam Wu
5712fd2b98 Update documentation requirements 2024-05-01 16:53:13 -06:00
Mátyás Aradi
f0a9e81a9a Update Linux and ROCm versions in ROCm 5.4.1 (#2877)
* Update Linux and ROCm versions

* Update CI to use build.os
2024-02-08 09:07:08 -07:00
Sam Wu
829d91892b add version to html title 2023-08-04 17:17:07 -06:00
Sam Wu
1e5228b65f update pdf 2023-06-30 09:39:15 -06:00
Sam Wu
5ae4c333c5 rocm-docs-core v0.18.3 2023-06-30 09:39:02 -06:00
Máté Ferenc Nagy-Egri
15292ddebe Downgrade license notice to 5.4.1 2023-06-22 18:37:26 +02:00
Máté Ferenc Nagy-Egri
89986d332d Downgrade changelog to 5.4.1 2023-06-22 18:37:25 +02:00
Máté Ferenc Nagy-Egri
de6fc1634a Downgrade install instructions to 5.4.1 2023-06-22 18:37:25 +02:00
Máté Ferenc Nagy-Egri
52986c3635 Downgrade release notes to 5.4.1 2023-06-22 18:37:25 +02:00
Máté Ferenc Nagy-Egri
b86717e454 Downgrade license notice to 5.4.2 2023-06-22 18:37:25 +02:00
Máté Ferenc Nagy-Egri
7dbd277203 Downgrade changelog to 5.4.2 2023-06-22 18:37:25 +02:00
Máté Ferenc Nagy-Egri
31ee8e712c Downgrade install instructions to 5.4.2 2023-06-22 18:37:25 +02:00
Máté Ferenc Nagy-Egri
55eda666d5 Downgrade release notes to 5.4.2 2023-06-22 18:37:24 +02:00
Máté Ferenc Nagy-Egri
01e24da121 Downgrade support matrices to 5.4.3 2023-06-22 18:37:24 +02:00
Máté Ferenc Nagy-Egri
f68c47d748 Downgrade changelog to 5.4.3 2023-06-22 18:37:24 +02:00
Máté Ferenc Nagy-Egri
2c0a351bbd Downgrade install instructions to 5.4.3 2023-06-22 18:37:24 +02:00
Máté Ferenc Nagy-Egri
b6509809d3 Downgrade OS support to 5.4.3 2023-06-22 18:37:24 +02:00
Máté Ferenc Nagy-Egri
a29205cc5c Downgrade release notes to 5.4.3 2023-06-22 18:37:24 +02:00
Máté Ferenc Nagy-Egri
16c4d22099 Downgrade changelog to 5.5.0 2023-06-22 18:37:23 +02:00
Máté Ferenc Nagy-Egri
ed3335c3a5 Downgrade install instructions to 5.5.0 2023-06-22 18:37:23 +02:00
Máté Ferenc Nagy-Egri
30f27c4644 Downgrade OS support to 5.5.0 2023-06-22 18:37:23 +02:00
Máté Ferenc Nagy-Egri
f4be54f896 Downgrade release notes to 5.5.0 2023-06-22 18:37:23 +02:00
Máté Ferenc Nagy-Egri
9c04aef6a5 Fix HIP API reference links 2023-06-22 18:37:23 +02:00
Máté Ferenc Nagy-Egri
c7d4e75e95 Add support for non-Pro Radeon VII 2023-06-22 18:37:23 +02:00
Máté Ferenc Nagy-Egri
aabbea88f2 Fix ROCm SMI links 2023-06-22 18:37:14 +02:00
Máté Ferenc Nagy-Egri
7747e130b9 Update rocm-docs-core to 0.16.0 2023-06-22 18:37:04 +02:00
236 changed files with 7349 additions and 11317 deletions

2
.github/CODEOWNERS vendored
View File

@@ -1 +1 @@
* @saadrahim @Rmalavally @amd-aakash @zhang2amd @jlgreathouse @samjwu @MathiasMagnus @LisaDelaney
* @saadrahim @Rmalavally @amd-aakash @zhang2amd @jlgreathouse @samjwu @MathiasMagnus

View File

@@ -1,76 +0,0 @@
name: Issue Report
description: File a report for something not working correctly.
title: "[Issue]: "
body:
- type: markdown
attributes:
value: |
Thank you for taking the time to fill out this report!
On a Linux system, you can acquire your OS, CPU, GPU, and ROCm version (for filling out this report) with the following commands:
echo "OS:" && cat /etc/os-release | grep -E "^(NAME=|VERSION=)";
echo "CPU: " && cat /proc/cpuinfo | grep "model name" | sort --unique;
echo "GPU:" && /opt/rocm/bin/rocminfo | grep -E "^\s*(Name|Marketing Name)";
echo "ROCm in /opt:" && ls -1 /opt | grep -E "rocm-";
- type: textarea
attributes:
label: Problem Description
description: Describe the issue you encountered.
placeholder: "The steps to reproduce can be included here, or in the dedicated section further below."
validations:
required: true
- type: input
attributes:
label: Operating System
description: What is the name and version number of the OS?
placeholder: "e.g. Ubuntu 22.04.3 LTS (Jammy Jellyfish)"
validations:
required: true
- type: input
attributes:
label: CPU
description: What CPU did you encounter the issue on?
placeholder: "e.g. AMD Ryzen 9 5900HX with Radeon Graphics"
validations:
required: true
- type: input
attributes:
label: GPU
description: What GPU(s) did you encounter the issue on?
placeholder: "e.g. MI200"
validations:
required: true
- type: input
attributes:
label: ROCm Version
description: What version(s) of ROCm did you encounter the issue on?
placeholder: "e.g. 5.7.0"
validations:
required: true
- type: input
attributes:
label: ROCm Component
description: (Optional) If this issue relates to a specific ROCm component, it can be mentioned here.
placeholder: "e.g. rocBLAS"
- type: textarea
attributes:
label: Steps to Reproduce
description: (Optional) Detailed steps to reproduce the issue.
placeholder: Please also include what you expected to happen, and what actually did, at the failing step(s).
validations:
required: false
- type: textarea
attributes:
label: Output of /opt/rocm/bin/rocminfo --support
description: The output of rocminfo --support will help to better address the problem.
placeholder: |
ROCk module is loaded
=====================
HSA System Attributes
=====================
[...]
validations:
required: true

View File

@@ -1,32 +0,0 @@
name: Feature Suggestion
description: Suggest an additional functionality, or new way of handling an existing functionality.
title: "[Feature]: "
body:
- type: markdown
attributes:
value: |
Thank you for taking the time to make a suggestion!
- type: textarea
attributes:
label: Suggestion Description
description: Describe your suggestion.
validations:
required: true
- type: input
attributes:
label: Operating System
description: (Optional) If this is for a specific OS, you can mention it here.
placeholder: "e.g. Ubuntu"
- type: input
attributes:
label: GPU
description: (Optional) If this is for a specific GPU or GPU family, you can mention it here.
placeholder: "e.g. MI200"
- type: input
attributes:
label: ROCm Component
description: (Optional) If this issue relates to a specific ROCm component, it can be mentioned here.
placeholder: "e.g. rocBLAS"

View File

@@ -1,5 +0,0 @@
blank_issues_enabled: false
contact_links:
- name: ROCm Community Discussions
url: https://github.com/RadeonOpenCompute/ROCm/discussions
about: Please ask and answer questions here for anything ROCm.

View File

@@ -10,4 +10,3 @@ updates:
open-pull-requests-limit: 10
schedule:
interval: "daily"
versioning-strategy: increase

View File

@@ -5,16 +5,52 @@ on:
branches:
- develop
- main
- 'docs/*'
- 'roc**'
pull_request:
branches:
- develop
- main
- 'docs/*'
- 'roc**'
concurrency:
group: ${{ github.ref }}-${{ github.workflow }}
cancel-in-progress: true
jobs:
call-workflow-passing-data:
name: Documentation
uses: RadeonOpenCompute/rocm-docs-core/.github/workflows/linting.yml@develop
lint-rest:
name: "RestructuredText"
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v3
- name: Install rst-lint
run: pip install restructuredtext-lint
- name: Lint ResT files
run: rst-lint ${{ join(github.workspace, '/docs') }}
lint-md:
name: "Markdown"
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v3
- name: Use markdownlint-cli2
uses: DavidAnson/markdownlint-cli2-action@v10.0.1
with:
globs: '**/*.md'
spelling:
name: "Spelling"
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v3
- name: Fetch config
shell: sh
run: |
curl --silent --show-error --fail --location https://raw.github.com/RadeonOpenCompute/rocm-docs-core/develop/.spellcheck.yaml -O
curl --silent --show-error --fail --location https://raw.github.com/RadeonOpenCompute/rocm-docs-core/develop/.wordlist.txt >> .wordlist.txt
- name: Run spellcheck
uses: rojopolis/spellcheck-github-actions@0.30.0
- name: On fail
if: failure()
run: |
echo "Please check for spelling mistakes or add them to '.wordlist.txt' in either the root of this project or in rocm-docs-core."

5
.gitignore vendored
View File

@@ -13,7 +13,6 @@ _doxygen/
_readthedocs/
# avoid duplicating contributing.md due to conf.py
docs/contributing.md
docs/release.md
docs/CHANGELOG.md
docs/contribute/index.md
docs/about/release-notes.md
docs/about/CHANGELOG.md

View File

@@ -1,7 +1,5 @@
config:
default: true
MD004:
style: asterisk
MD013: false
MD026:
punctuation: '.,;:!'
@@ -10,9 +8,7 @@ config:
MD033: false
MD034: false
MD041: false
MD051: false
ignores:
- CHANGELOG.md
- docs/CHANGELOG.md
- "{,docs/}{RELEASE,release}.md"
- tools/autotag/templates/**/*.md

View File

@@ -3,16 +3,19 @@
version: 2
sphinx:
configuration: docs/conf.py
formats: [htmlzip, pdf]
build:
os: ubuntu-22.04
tools:
python: "3.10"
apt_packages:
- "doxygen"
- "graphviz" # For dot graphs in doxygen
python:
install:
- requirements: docs/sphinx/requirements.txt
build:
os: ubuntu-20.04
tools:
python: "3.8"
sphinx:
configuration: docs/conf.py
formats: []

View File

@@ -1,584 +1,29 @@
# isv_deployment_win
ABI
activations
addr
AddressSanitizer
AlexNet
alloc
allocator
allocators
ALU
AMD
AMDGPU
amdgpu
AMDGPUs
AMDMIGraphX
AMI
AOCC
AOMP
api
APIC
APIs
Arb
ASan
ASIC
ASICs
ASm
ATI
atmi
atomics
autogenerated
avx
awk
backend
# gpu_aware_mpi
DMA
GDR
HCA
MPI
MVAPICH
Mellanox's
NIC
OFED
OSU
OpenFabrics
PeerDirect
RDMA
UCX
ib_core
# linear algebra
LAPACK
MMA
backends
benchmarking
bilinear
BitCode
BLAS
Blit
blit
BMC
buildable
bursty
bzip
cacheable
CCD
cd
CDNA
CentOS
centric
changelog
chiplet
CIFAR
CLI
CLion
CMake
cmake
CMakeLists
CMakePackage
cmd
coalescable
codename
Codespaces
comgr
Commitizen
CommonMark
completers
composable
concretization
Concretized
Conda
config
conformant
convolutional
convolves
CoRR
CP
CPC
CPF
CPP
CPU
CPUs
CSC
CSE
CSn
csn
CSV
CTests
CU
cuBLAS
CUDA
cuFFT
cuLIB
cuRAND
CUs
cuSOLVER
cuSPARSE
CXX
dataset
datasets
dataspace
datatype
datatypes
dbgapi
de
deallocation
denormalize
Dependabot
deserializers
detections
dev
DevCap
devicelibs
devsel
# tuning_guides
BMC
DGEMM
disambiguates
distro
DL
DMA
DNN
DNNL
Dockerfile
Doxygen
DPM
DRI
DW
DWORD
el
enablement
endpgm
env
epilog
EPYC
ESXi
ethernet
exascale
executables
ffmpeg
FFT
FFTs
FHS
filesystem
Filesystem
Flang
FMA
Fortran
fortran
FP
galb
gcc
GCD
GCDs
GCN
GDB
gdb
GDDR
GDR
GDS
GEMM
GEMMs
GenZ
gfortran
gfx
GIM
github
Gitpod
GL
GLXT
GMI
gnupg
GPG
GPR
GPU
GPUs
grayscale
GRBM
gzip
Haswell
HBM
HCA
heterogenous
hipamd
hipBLAS
hipblas
hipBLASLt
HIPCC
hipCUB
hipcub
HIPExtension
hipFFT
hipfft
hipfort
HIPIFY
hipify
hipLIB
hipRAND
hipSOLVER
hipsolver
hipSPARSE
hipsparse
hipSPARSELt
hipTensor
HPC
HPCG
HPE
HPL
HSA
hsa
hsakmt
HWE
ib_core
ICV
IDE
IDEs
ImageNet
IMDB
inband
incrementing
inferencing
InfiniBand
inflight
init
Inlines
inlining
installable
IntelliSense
interprocedural
Intersphinx
intra
invariants
invocating
Ioffe
IOMMU
IOP
IOPM
IOV
ipo
IRQ
ISA
ISV
ISVs
JSON
Jupyter
kdb
KFD
Khronos
KVM
LAPACK
LCLK
LDS
libfabric
libjpeg
libs
linearized
linter
linux
llvm
LLVM
localscratch
logits
lossy
LSAN
LTS
Makefile
Makefiles
matchers
Matplotlib
Mellanox's
MEM
MERCHANTABILITY
MFMA
microarchitecture
MIGraphX
migraphx
MIOpen
miopen
MIOpenGEMM
miopengemm
MIVisionX
mivisionx
mkdir
mlirmiopen
MMA
MMIO
MMIOH
MNIST
MPI
MSVC
mtypes
Multicore
Multithreaded
MVAPICH
mvffr
MyEnvironment
MyST
namespace
namespaces
Nano
Navi
NBIO
NBIOs
NIC
NICs
Noncoherently
NPS
NUMA
NumPy
numref
NVCC
NVPTX
OAM
OAMs
ocl
OCP
OEM
OFED
OMP
OMPT
OMPX
ONNX
OpenCL
opencl
opencv
OpenFabrics
OpenGL
OpenMP
openmp
openssl
OpenVX
optimizers
os
OSS
OSU
Pageable
pageable
passthrough
PCI
PCIe
PeerDirect
perfcounter
Perfetto
performant
perl
PIL
PILImage
PowerShell
PnP
pragma
pre
prebuilt
precompiled
prefetch
prefetchable
preprocess
preprocessing
preq
prequantized
prerequisites
PRNG
profiler
protobuf
PRs
pseudorandom
py
PyPi
PyTorch
Qcycles
quasirandom
queueing
Radeon
RadeonOpenCompute
RCCL
rccl
RDC
rdc
RDMA
RDNA
reformats
RelWithDebInfo
repos
Req
req
resampling
RST
reStructuredText
RHEL
Rickle
roadmap
roc
ROC
RoCE
rocAL
rocALUTION
rocalution
rocBLAS
rocblas
rocclr
ROCdbgapi
rocFFT
rocfft
ROCgdb
ROCk
rocLIB
rocm
ROCm
ROCmCC
rocminfo
rocMLIR
ROCmSoftwarePlatform
ROCmValidationSuite
rocPRIM
rocprim
rocprof
ROCProfiler
rocprofiler
ROCr
rocr
rocRAND
rocrand
rocSOLVER
rocsolver
rocSPARSE
rocsparse
roct
rocThrust
rocthrust
ROCTracer
roctracer
rocWMMA
RST
runtime
runtimes
RW
Ryzen
SALU
SBIOS
SCA
scalability
SDK
SDMA
SDRAM
SENDMSG
sendmsg
SENDMSG
sendmsg
SerDes
serializers
SGPR
SGPRs
SHA
shader
Shlens
sigmoid
SIGQUIT
SIMD
SIMDs
SKU
SKUs
skylake
sL
SLES
sm
SMEM
SMI
smi
SMT
softmax
Spack
spack
SPI
SQs
SRAM
SRAMECC
src
stochastically
strided
subdirectory
subexpression
subfolder
subfolders
supercomputing
Supermicro
SWE
Szegedy
tagram
TCA
TCC
TCI
TCIU
TCP
TCR
TensorBoard
TensorFlow
TFLOPS
tg
th
tmp
ToC
tokenize
toolchain
toolchains
toolset
toolsets
TorchAudio
TorchMIGraphX
TorchScript
TorchServe
TorchVision
torchvision
tracebacks
TransferBench
TrapStatus
txt
UAC
uarch
ubuntu
UC
UCC
UCX
UIF
Uncached
uncached
uncorrectable
Unhandled
uninstallation
unsqueeze
unstacking
unswitching
untrusted
untuned
USM
UTCL
UTIL
utils
VALU
Vanhoucke
VBIOS
vdi
vectorizable
vectorization
vectorize
vectorized
vectorizer
vectorizes
VGPR
VGPRs
vjxb
vL
VM
VMEM
VMWare
VRAM
VSIX
VSkipped
Vulkan
walkthrough
walkthroughs
wavefront
wavefronts
WGP
whitespaces
Wojna
workgroup
Workgroups
workgroups
writeback
Writebacks
writebacks
wrreq
WX
wzo
Xeon
XGMI
Xnack
XT
Xteam
XTX
xz
YAML
yaml
YML
YModel
ysvmadyb
ZenDNN
zypper

File diff suppressed because it is too large Load Diff

View File

@@ -1,40 +0,0 @@
# MIT License
#
# Copyright (c) 2023 Advanced Micro Devices, Inc. All rights reserved.
#
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in all
# copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
# SOFTWARE.
cmake_minimum_required(VERSION 3.18.0)
project(ROCm VERSION 5.7.1 LANGUAGES NONE)
option(ROCM_BUILD_DOCS "Build ROCm documentation" ON)
include(GNUInstallDirs)
# Adding default path cmake modules
list(APPEND CMAKE_MODULE_PATH "${CMAKE_CURRENT_SOURCE_DIR}/cmake/Modules")
# Handle dependencies
include(Dependencies)
# Build docs
if(ROCM_BUILD_DOCS)
add_subdirectory(docs)
endif()

View File

@@ -1,229 +1,246 @@
# Contributing to ROCm documentation
# Contributing to ROCm Docs
AMD values and encourages contributions to our code and documentation. If you choose to
contribute, we encourage you to be polite and respectful. Improving documentation is a long-term
process, to which we are dedicated.
AMD values and encourages the ROCm community to contribute to our code and
documentation. This repository is focused on ROCm documentation and this
contribution guide describes the recommend method for creating and modifying our
documentation.
If you have issues when trying to contribute, refer to the
[discussions](https://github.com/RadeonOpenCompute/ROCm/discussions) page in our GitHub
repository.
While interacting with ROCm Documentation, we encourage you to be polite and
respectful in your contributions, content or otherwise. Authors, maintainers of
these docs act on good intentions and to the best of their knowledge.
Keep that in mind while you engage. Should you have issues with contributing
itself, refer to
[discussions](https://github.com/RadeonOpenCompute/ROCm/discussions) on the
GitHub repository.
## Folder structure and naming convention
## Supported Formats
Our documentation follows the Pitchfork folder structure. Most documentation files are stored in the
`/docs` folder. Some special files (such as release, contributing, and changelog) are stored in the root
(`/`) folder.
Our documentation includes both markdown and rst files. Markdown is encouraged
over rst due to the lower barrier to participation. GitHub flavored markdown is preferred
for all submissions as it will render accurately on our GitHub repositories. For existing documentation,
[MyST](https://myst-parser.readthedocs.io/en/latest/intro.html) markdown
is used to implement certain features unsupported in GitHub markdown. This is
not encouraged for new documentation. AMD will transition
to stricter use of GitHub flavored markdown with a few caveats. ROCm documentation
also uses [sphinx-design](https://sphinx-design.readthedocs.io/en/latest/index.html)
in our markdown and rst files. We also will use breathe syntax for doxygen documentation
in our markdown files. Other design elements for effective HTML rendering of the documents
may be added to our markdown files. Please see
[GitHub](https://docs.github.com/en/get-started/writing-on-github/getting-started-with-writing-and-formatting-on-github)'s
guide on writing and formatting on GitHub as a starting point.
All images are stored in the `/docs/data` folder. An image's file path mirrors that of the documentation
file where it is used.
ROCm documentation adds additional requirements to markdown and rst based files
as follows:
Our naming structure uses kebab case; for example, `my-file-name.rst`.
- Level one headers are only used for page titles. There must be only one level
1 header per file for both Markdown and Restructured Text.
- Pass [markdownlint](https://github.com/markdownlint/markdownlint) check via
our automated github action on a Pull Request (PR).
## Supported formats and syntax
## Filenames and folder structure
Our documentation includes both Markdown and RST files. We are gradually transitioning existing
Markdown to RST in order to more effectively meet our documentation needs. When contributing,
RST is preferred; if you must use Markdown, use GitHub-flavored Markdown.
Please use snake case for file names. Our documentation follows pitchfork for
folder structure. All documentation is in /docs except for special files like
the contributing guide in the / folder. All images used in the documentation are
place in the /docs/data folder.
We use [Sphinx Design](https://sphinx-design.readthedocs.io/en/latest/index.html) syntax and compile
our API references using [Doxygen](https://www.doxygen.nl/).
## How to provide feedback for for ROCm documentation
The following table shows some common documentation components and the syntax convention we
use for each:
There are three standard ways to provide feedback for this repository.
<table>
<tr>
<th>Component</th>
<th>RST syntax</th>
</tr>
<tr>
<td>Code blocks</td>
<td>
### Pull Request
```rst
All contributions to ROCm documentation should arrive via the
[GitHub Flow](https://docs.github.com/en/get-started/quickstart/github-flow)
targetting the develop branch of the repository. If you are unable to contribute
via the GitHub Flow, feel free to email us. TODO, confirm email address.
.. code-block:: language-name
### GitHub Issue
My code block.
Issues on existing or absent docs can be filed as [GitHub issues
](https://github.com/RadeonOpenCompute/ROCm/issues).
### Email Feedback
## Language and Style
Adopting Microsoft CPP-Docs guidelines for [Voice and Tone
](https://github.com/MicrosoftDocs/cpp-docs/blob/main/styleguide/voice-tone.md).
ROCm documentation templates to be made public shortly. ROCm templates dictate
the recommended structure and flow of the documentation. Guidelines on how to
integrate figures, equations, and tables are all based off
[MyST](https://myst-parser.readthedocs.io/en/latest/intro.html).
Font size and selection, page layout, white space control, and other formatting
details are controlled via rocm-docs-core, sphinx extention. Please raise issues
in rocm-docs-core for any formatting concerns and changes requested.
## Building Documentation
While contributing, one may build the documentation locally on the command-line
or rely on Continuous Integration for previewing the resulting HTML pages in a
browser.
### Command line documentation builds
Python versions known to build documentation:
- 3.8
To build the docs locally using Python Virtual Environment (`venv`), execute the
following commands from the project root:
```sh
python3 -mvenv .venv
# Windows
.venv/Scripts/python -m pip install -r docs/sphinx/requirements.txt
.venv/Scripts/python -m sphinx -T -E -b html -d _build/doctrees -D language=en docs _build/html
# Linux
.venv/bin/python -m pip install -r docs/sphinx/requirements.txt
.venv/bin/python -m sphinx -T -E -b html -d _build/doctrees -D language=en docs _build/html
```
</td>
</tr>
<tr>
<td>Cross-referencing internal files</td>
<td>
Then open up `_build/html/index.html` in your favorite browser.
```rst
### Pull Requests documentation builds
:doc:`Title <../path/to/file/filename>`
When opening a PR to the `develop` branch on GitHub, the page corresponding to
the PR (`https://github.com/RadeonOpenCompute/ROCm/pull/<pr_number>`) will have
a summary at the bottom. This requires the user be logged in to GitHub.
```
- There, click `Show all checks` and `Details` of the Read the Docs pipeline. It
will take you to `https://readthedocs.com/projects/advanced-micro-devices-rocm/
builds/<some_build_num>/`
- The list of commands shown are the exact ones used by CI to produce a render
of the documentation.
- There, click on the small blue link `View docs` (which is not the same as the
bigger button with the same text). It will take you to the built HTML site with
a URL of the form `https://
advanced-micro-devices-demo--<pr_number>.com.readthedocs.build/projects/alpha/en
/<pr_number>/`.
</td>
</tr>
<tr>
<td>External links</td>
<td>
### Build the docs using VS Code
```rst
One can put together a productive environment to author documentation and also
test it locally using VS Code with only a handful of extensions. Even though the
extension landscape of VS Code is ever changing, here is one example setup that
proved useful at the time of writing. In it, one can change/add content, build a
new version of the docs using a single VS Code Task (or hotkey), see all errors/
warnings emitted by Sphinx in the Problems pane and immediately see the
resulting website show up on a locally serving web server.
`link name <URL>`_
#### Configuring VS Code
```
1. Install the following extensions:
</td>
</tr>
<tr>
<tr>
<td>Headings</td>
<td>
- Python (ms-python.python)
- Live Server (ritwickdey.LiveServer)
```rst
2. Add the following entries in `.vscode/settings.json`
******************
Chapter title (H1)
******************
```json
{
"liveServer.settings.root": "/.vscode/build/html",
"liveServer.settings.wait": 1000,
"python.terminal.activateEnvInCurrentTerminal": true
}
```
Section title (H2)
===============
The settings in order are set for the following reasons:
- Sets the root of the output website for live previews. Must be changed
alongside the `tasks.json` command.
- Tells live server to wait with the update to give time for Sphinx to
regenerate site contents and not refresh before all is don. (Empirical value)
- Automatic virtual env activation is a nice touch, should you want to build
the site from the integrated terminal.
Subsection title (H3)
---------------------
3. Add the following tasks in `.vscode/tasks.json`
Sub-subsection title (H4)
^^^^^^^^^^^^^^^^^^^^
```json
{
"version": "2.0.0",
"tasks": [
{
"label": "Build Docs",
"type": "process",
"windows": {
"command": "${workspaceFolder}/.venv/Scripts/python.exe"
},
"command": "${workspaceFolder}/.venv/bin/python3",
"args": [
"-m",
"sphinx",
"-j",
"auto",
"-T",
"-b",
"html",
"-d",
"${workspaceFolder}/.vscode/build/doctrees",
"-D",
"language=en",
"${workspaceFolder}/docs",
"${workspaceFolder}/.vscode/build/html"
],
"problemMatcher": [
{
"owner": "sphinx",
"fileLocation": "absolute",
"pattern": {
"regexp": "^(?:.*\\.{3}\\s+)?(\\/[^:]*|[a-zA-Z]:\\\\[^:]*):(\\d+):\\s+(WARNING|ERROR):\\s+(.*)$",
"file": 1,
"line": 2,
"severity": 3,
"message": 4
},
},
{
"owner": "sphinx",
"fileLocation": "absolute",
"pattern": {
"regexp": "^(?:.*\\.{3}\\s+)?(\\/[^:]*|[a-zA-Z]:\\\\[^:]*):{1,2}\\s+(WARNING|ERROR):\\s+(.*)$",
"file": 1,
"severity": 2,
"message": 3
}
}
],
"group": {
"kind": "build",
"isDefault": true
}
},
],
}
```
> (Implementation detail: two problem matchers were needed to be defined,
> because VS Code doesn't tolerate some problem information being potentially
> absent. While a single regex could match all types of errors, if a capture
> group remains empty (the line number doesn't show up in all warning/error
> messages) but the `pattern` references said empty capture group, VS Code
> discards the message completely.)
```
4. Configure Python virtual environment (venv)
</td>
</tr>
<tr>
<td>Images</td>
<td>
- From the Command Palette, run `Python: Create Environment`
- Select `venv` environment and the `docs/sphinx/requirements.txt` file.
_(Simply pressing enter while hovering over the file from the dropdown is
insufficient, one has to select the radio button with the 'Space' key if
using the keyboard.)_
```rst
5. Build the docs
.. image:: image1.png
- Launch the default build Task using either:
- a hotkey _(default is 'Ctrl+Shift+B')_ or
- by issuing the `Tasks: Run Build Task` from the Command Palette.
```
6. Open the live preview
</td>
</tr>
<tr>
<td>Internal links</td>
<td>
- Navigate to the output of the site within VS Code, right-click on
`.vscode/build/html/index.html` and select `Open with Live Server`. The
contents should update on every rebuild without having to refresh the
browser.
```rst
1. Add a tag to the section you want to reference:
.. _my-section-tag: section-1
Section 1
==========
2. Link to your tag:
As shown in :ref:`section-1`.
```
</td>
</tr>
<tr>
<tr>
<td>Lists</td>
<td>
```rst
# Ordered (numbered) list item
* Unordered (bulleted) list item
```
</td>
</tr>
<tr>
<tr>
<td>Math (block)</td>
<td>
```rst
.. math::
A = \begin{pmatrix}
0.0 & 1.0 & 1.0 & 3.0 \\
4.0 & 5.0 & 6.0 & 7.0 \\
\end{pmatrix}
```
</td>
</tr>
<tr>
<td>Math (inline)</td>
<td>
```rst
:math:`2 \times 2 `
```
</td>
</tr>
<tr>
<td>Notes</td>
<td>
```rst
.. note::
My note here.
```
</td>
</tr>
<tr>
<td>Tables</td>
<td>
```rst
.. csv-table:: Optional title here
:widths: 30, 70 #optional column widths
:header: "entry1 header", "entry2 header"
"entry1", "entry2"
```
</td>
</tr>
</table>
## Language and style
We use the
[Google developer documentation style guide](https://developers.google.com/style/highlights) to
guide our content.
Font size and type, page layout, white space control, and other formatting
details are controlled via
[rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core). If you want to notify us
of any formatting issues, create a pull request in our
[rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) GitHub repository.
## Building our documentation
<!-- % TODO: Fix the link to be able to work at every files -->
To learn how to build our documentation, refer to
[Building documentation](./building.md).
<!-- markdownlint-restore -->

View File

@@ -1,40 +1,44 @@
# AMD ROCm Software
# AMD ROCm™ Platform
ROCm is an open-source stack, composed primarily of open-source software, designed for graphics
processing unit (GPU) computation. ROCm consists of a collection of drivers, development tools, and
APIs that enable GPU programming from low-level kernel to end-user applications.
ROCm is an open-source stack for GPU computation. ROCm is primarily Open-Source
Software (OSS) that allows developers the freedom to customize and tailor their
GPU software for their own needs while collaborating with a community of other
developers, and helping each other find solutions in an agile, flexible, rapid
and secure manner.
With ROCm, you can customize your GPU software to meet your specific needs. You can develop,
collaborate, test, and deploy your applications in a free, open source, integrated, and secure software
ecosystem. ROCm is particularly well-suited to GPU-accelerated high-performance computing (HPC),
artificial intelligence (AI), scientific computing, and computer aided design (CAD).
ROCm is a collection of drivers, development tools and APIs enabling GPU
programming from the low-level kernel to end-user applications. ROCm is powered
by AMDs Heterogeneous-computing Interface for Portability (HIP), an OSS C++ GPU
programming environment and its corresponding runtime. HIP allows ROCm
developers to create portable applications on different platforms by deploying
code on a range of platforms, from dedicated gaming GPUs to exascale HPC
clusters. ROCm supports programming models such as OpenMP and OpenCL, and
includes all the necessary OSS compilers, debuggers and libraries. ROCm is fully
integrated into ML frameworks such as PyTorch and TensorFlow. ROCm can be
deployed in many ways, including through the use of containers such as Docker,
Spack, and your own build from source.
ROCm is powered by AMDs
[Heterogeneous-computing Interface for Portability (HIP)](https://github.com/ROCm-Developer-Tools/HIP),
an open-source software C++ GPU programming environment and its corresponding runtime. HIP
allows ROCm developers to create portable applications on different platforms by deploying code on a
range of platforms, from dedicated gaming GPUs to exascale HPC clusters.
ROCms goal is to allow our users to maximize their GPU hardware investment.
ROCm is designed to help develop, test and deploy GPU accelerated HPC, AI,
scientific computing, CAD, and other applications in a free, open-source,
integrated and secure software ecosystem.
ROCm supports programming models, such as OpenMP and OpenCL, and includes all necessary open
source software compilers, debuggers, and libraries. ROCm is fully integrated into machine learning
(ML) frameworks, such as PyTorch and TensorFlow.
This repository contains the manifest file for ROCm™ releases, changelogs, and
release information. The file default.xml contains information for all
repositories and the associated commit used to build the current ROCm release.
## ROCm documentation
The default.xml file uses the repo Manifest format.
This repository contains the manifest file for ROCm releases, changelogs, and release information.
The develop branch of this repository contains content for the next
ROCm release.
The `default.xml` file contains information for all repositories and the associated commit used to build
the current ROCm release; `default.xml` uses the Manifest Format repository.
## ROCm Documentation
Source code for our documentation is located in the `/docs` folder of most ROCm repositories. The
`develop` branch of our repositories contains content for the next ROCm release.
ROCm Documentation is available online at
[rocm.docs.amd.com](https://rocm.docs.amd.com). Source code for the documenation
is located in the docs folder of most repositories that are part of ROCm.
The ROCm documentation homepage is [rocm.docs.amd.com](https://rocm.docs.amd.com).
### Building our documentation
For a quick-start build, use the following code. For more options and detail, refer to
[Building documentation](./contribute/building.md).
### How to build documentation via Sphinx
```bash
cd docs
@@ -44,15 +48,7 @@ pip3 install -r sphinx/requirements.txt
python3 -m sphinx -T -E -b html -d _build/doctrees -D language=en . _build/html
```
Alternatively, CMake build is supported.
## Older ROCm™ Releases
```bash
cmake -B build
cmake --build build --target=doc
```
## Older ROCm releases
For release information for older ROCm releases, refer to the
For release information for older ROCm™ releases, refer to
[CHANGELOG](./CHANGELOG.md).

View File

@@ -1,248 +1,83 @@
# Release notes for AMD ROCm™ 6.0
# Release Notes
<!-- Do not edit this file! This file is autogenerated with -->
<!-- tools/autotag/tag_script.py -->
ROCm 6.0 is a major release with new performance optimizations, expanded frameworks and library
support, and improved developer experience. This includes initial enablement of the AMD Instinct™
MI300 series.Future releases will further enable and optimize this new platform. Key features include:
<!-- Disable lints since this is an auto-generated file. -->
<!-- markdownlint-disable blanks-around-headers -->
<!-- markdownlint-disable no-duplicate-header -->
<!-- markdownlint-disable no-blanks-blockquote -->
<!-- markdownlint-disable ul-indent -->
<!-- markdownlint-disable no-trailing-spaces -->
* Improved performance in areas like lower precision math and attention layers.
* New hipSPARSELt library accelerates AI workloads via AMD's sparse matrix core technique.
* Upstream support is now available for popular AI frameworks like TensorFlow, JAX, and PyTorch.
* New support for libraries, such as DeepSpeed, ONNX-RT, and CuPy.
* Prepackaged HPC and AI containers on AMD Infinity Hub, with improved documentation and
tutorials on the [AMD ROCm Docs](https://rocm.docs.amd.com) site.
* Consolidated developer resources and training on the new
[AMD ROCm Developer Hub](https://www.amd.com/en/developer/resources/rocm-hub.html).
<!-- spellcheck-disable -->
The following section provide a release overview for ROCm 6.0. For additional details, you can refer to
the [Changelog](https://rocm.docs.amd.com/en/develop/about/CHANGELOG.html). We list known
issues on [GitHub](https://github.com/ROCm/ROCm/issues).
The release notes for the ROCm platform.
## OS and GPU support changes
-------------------
ROCm 6.0 enables the use of MI300A and MI300X Accelerators with a limited operating systems
support. Future releases will add additional OS's to match our general offering.
## ROCm 5.4.1
<!-- markdownlint-disable first-line-h1 -->
<!-- markdownlint-disable no-duplicate-header -->
### What's New in This Release
| Operating Systems | MI300A | MI300X |
|:---:|:---:|:---:|
| Ubuntu 22.04.5 | Supported | Supported |
| RHEL 8.9 | Supported | |
| SLES15 SP5 | Supported | |
#### HIP Enhancements
For older generations of supported Instinct products we've added the following operating systems:
The ROCm v5.4.1 release consists of the following new HIP API:
* RHEL 9.3
* RHEL 8.9
##### New HIP API - hipLaunchHostFunc
Note: For ROCm 6.2 and beyond, we've planned for end-of-support (EoS) for the following operating
systems:
The following new HIP API is introduced in the ROCm v5.4.1 release.
* Ubuntu 20.04.5
* SLES 15 SP4
* RHEL/CentOS 7.9
> **Note**
>
> This is a pre-official version (beta) release of the new APIs.
## New ROCm meta package
```h
hipError_t hipLaunchHostFunc(hipStream_t stream, hipHostFn_t fn, void* userData);
```
We've added a new ROCm meta package for easy installation of all ROCm core packages, tools, and
libraries. For example, the following command will install the full ROCm package: `apt-get install rocm`
(Ubuntu), or `yum install rocm` (RHEL).
This swaps the stream capture mode of a thread.
## Filesystem Hierarchy Standard
```text
@param [in] mode - Pointer to mode value to swap with the current mode
```
ROCm 6.0 fully adopts the Filesystem Hierarchy Standard (FHS) reorganization goals. We've removed
the backward compatibility support for old file locations.
This parameter returns `#hipSuccess`, `#hipErrorInvalidValue`.
## Compiler location change
For more information, refer to the HIP API documentation at /bundle/HIP_API_Guide/page/modules.html.
* The installation path of LLVM has been changed from `/opt/rocm-<rel>/llvm` to
`/opt/rocm-<rel>/lib/llvm`. For backward compatibility, a symbolic link is provided to the old
location and will be removed in a future release.
* The installation path of the device library bitcode has changed from `/opt/rocm-<rel>/amdgcn` to
`/opt/rocm-<rel>/lib/llvm/lib/clang/<ver>/lib/amdgcn`. For backward compatibility, a symbolic link
is provided and will be removed in a future release.
### Deprecations and Warnings
## Documentation
#### HIP Perl Scripts Deprecation
CMake support has been added for documentation in the
[ROCm repository](https://github.com/RadeonOpenCompute/ROCm).
The `hipcc` and `hipconfig` Perl scripts are deprecated. In a future release, compiled binaries will be available as `hipcc.bin` and `hipconfig.bin` as replacements for the Perl scripts.
## AMD Instinct™ MI50 end-of-support notice
> **Note**
>
> There will be a transition period where the Perl scripts and compiled binaries are available before the scripts are removed. There will be no functional difference between the Perl scripts and their compiled binary counterpart. No user action is required. Once these are available, users can optionally switch to `hipcc.bin` and `hipconfig.bin`. The `hipcc`/`hipconfig` soft link will be assimilated to point from `hipcc`/`hipconfig` to the respective compiled binaries as the default option.
AMD Instinct MI50, Radeon Pro VII, and Radeon VII products (collectively gfx906 GPUs) enters
maintenance mode in ROCm 6.0.
### IFWI Fixes
As outlined in [5.6.0](https://rocm.docs.amd.com/en/docs-5.6.0/release.html), ROCm 5.7 was the
final release for gfx906 GPUs in a fully supported state.
These defects were identified and documented as known issues in previous ROCm releases and are fixed in this release.
AMD Instinct™ MI200 Firmware IFWI Maintenance Update #3
* Henceforth, no new features and performance optimizations will be supported for the gfx906 GPUs.
* Bug fixes and critical security patches will continue to be supported for the gfx906 GPUs until Q2
2024 (end of maintenance \[EOM] will be aligned with the closest ROCm release).
* Bug fixes will be made up to the next ROCm point release.
* Bug fixes will not be backported to older ROCm releases for gfx906.
* Distribution and operating system updates will continue per the ROCm release cadence for gfx906
GPUs until EOM.
This IFWI release fixes the following issue in AMD Instinct™ MI210/MI250 Accelerators.
## ROCm projects
After prolonged periods of operation, certain MI200 Instinct™ Accelerators may perform in a degraded way resulting in application failures.
The following sections contains project-specific release notes for ROCm 6.0. For additional details, you
can refer to the [Changelog](https://rocm.docs.amd.com/en/develop/about/CHANGELOG.html).
In this package, AMD delivers a new firmware version for MI200 GPU accelerators and a firmware installation tool AMD FW FLASH 1.2.
### AMD SMI
| GPU | Production Part Number | SKU | IFWI Name |
|-------|------------|--------|---------------|
| MI210 | 113-D673XX | D67302 | D6730200V.110 |
| MI210 | 113-D673XX | D67301 | D6730100V.073 |
| MI250 | 113-D652XX | D65209 | D6520900.073 |
| MI250 | 113-D652XX | D65210 | D6521000.073 |
* **Integrated the E-SMI (EPYC-SMI) library**.
You can now query CPU-related information directly through AMD SMI. Metrics include power,
energy, performance, and other system details.
Instructions on how to download and apply MI200 maintenance updates are available at:
* **Added support for gfx942 metrics**.
You can now query MI300 device metrics to get real-time information. Metrics include power,
temperature, energy, and performance.
<https://www.amd.com/en/support/server-accelerators/amd-instinct/amd-instinct-mi-series/amd-instinct-mi210>
### HIP
#### AMD Instinct™ MI200 SRIOV Virtualization Support
* **New features to improve resource interoperability**.
* For external resource interoperability, we've added new structs and enums.
* We've added new members to HIP struct `hipDeviceProp_t` for surfaces, textures, and device
identifiers.
* **Changes impacting backward compatibility**.
There are several changes impacting backward compatibility: we changed some struct members and
some enum values, and removed some deprecated flags. For additional information, please refer to
the Changelog.
### hipCUB
* **Additional CUB API support**.
The hipCUB backend is updated to CUB and Thrust 2.1.
### HIPIFY
* **Enhanced CUDA2HIP document generation**.
API versions are now listed in the CUDA2HIP documentation. To see if the application binary
interface (ABI) has changed, refer to the
[*C* column](https://rocm.docs.amd.com/projects/HIPIFY/en/latest/tables/CUDA_Runtime_API_functions_supported_by_HIP.html)
in our API documentation.
* **Hipified rocSPARSE**.
We've implemented support for the direct hipification of additional cuSPARSE APIs into rocSPARSE
APIs under the `--roc` option. This covers a major milestone in the roadmap towards complete
cuSPARSE-to-rocSPARSE hipification.
### hipRAND
* **Official release**.
hipRAND is now a *standalone project*--it's no longer available as a submodule for rocRAND.
### hipTensor
* **Added architecture support**.
We've added contraction support for gfx942 architectures, and f32 and f64 data
types.
* **Upgraded testing infrastructure**.
hipTensor will now support dynamic parameter configuration with input YAML config.
### MIGraphX
* **Added TorchMIGraphX**.
We introduced a Dynamo backend for Torch, which allows PyTorch to use MIGraphX directly
without first requiring a model to be converted to the ONNX model format. With a single line of
code, PyTorch users can utilize the performance and quantization benefits provided by MIGraphX.
* **Boosted overall performance with rocMLIR**.
We've integrated the rocMLIR library for ROCm-supported RDNA and CDNA GPUs. This
technology provides MLIR-based convolution and GEMM kernel generation.
* **Added INT8 support across the MIGraphX portfolio**.
We now support the INT8 data type. MIGraphX can perform the quantization or ingest
prequantized models. INT8 support extends to the MIGraphX execution provider for ONNX Runtime.
### ROCgdb
* **Added support for additional GPU architectures**.
* Navi 3 series: gfx1100, gfx1101, and gfx1102.
* MI300 series: gfx942.
### rocm-smi-lib
* **Improved accessibility to GPU partition nodes**.
You can now view, set, and reset the compute and memory partitions. You'll also get notifications of
a GPU busy state, which helps you avoid partition set or reset failure.
* **Upgraded GPU metrics version 1.4**.
The upgraded GPU metrics binary has an improved metric version format with a content version
appended to it. You can read each metric within the binary without the full `rsmi_gpu_metric_t` data
structure.
* **Updated GPU index sorting**.
We made GPU index sorting consistent with other ROCm software tools by optimizing it to use
`Bus:Device.Function` (BDF) instead of the card number.
### ROCm Compiler
* **Added kernel argument optimization on gfx942**.
With the new feature, you can preload kernel arguments into Scalar General-Purpose Registers
(SGPRs) rather than pass them in memory. This feature is enabled with a compiler option, which also
controls the number of arguments to pass in SGPRs. For more information, see:
[https://llvm.org/docs/AMDGPUUsage.html#preloaded-kernel-arguments](https://llvm.org/docs/AMDGPUUsage.html#preloaded-kernel-arguments)
* **Improved register allocation at -O0**.
We've improved the register allocator used at -O0 to avoid compiler crashes (when the signature is
'ran out of registers during register allocation').
* **Improved generation of debug information**.
We've improved compile time when generating debug information for certain corner cases. We've
also improved the compiler to eliminate compiler crashes when generating debug information.
### ROCmValidationSuite
* **Added GPU and operating system support**.
We added support for MI300X GPU in GPU Stress Test (GST).
### Roc Profiler
* **Added option to specify desired Roc Profiler version**.
You can now use rocProfV1 or rocProfV2 by specifying your desired version, as the legacy rocProf
(`rocprofv1`) provides the option to use the latest version (`rocprofv2`).
* **Automated the ISA dumping process by Advance Thread Tracer**.
Advance Thread Tracer (ATT) no longer depends on user-supplied Instruction Set Architecture (ISA)
and compilation process (using ``hipcc --save-temps``) to dump ISA from the running kernels.
* **Added ATT support for parallel kernels**.
The automatic ISA dumping process also helps ATT successfully parse multiple kernels running in
parallel, and provide cycle-accurate occupancy information for multiple kernels at the same time.
### ROCr
* **Support for SDMA link aggregation**.
If multiple XGMI links are available when making SDMA copies between GPUs, the copy is
distributed over multiple links to increase peak bandwidth.
### rocThrust
* **Added Thrust 2.1 API support**.
rocThrust backend is updated to Thrust and CUB 2.1.
### rocWMMA
* **Added new architecture support**.
We added support for gfx942 architectures.
* **Added data type support**.
We added support for f8, bf8, xf32 data types on supporting architectures, and for bf16 in the HIP RTC
environment.
* **Added support for the PyTorch kernel plugin**.
We added awareness of `__HIP_NO_HALF_CONVERSIONS__` to support PyTorch users.
### TransferBench
* **Improved ordering control**.
You can now set the thread block size (`BLOCK_SIZE`) and the thread block order (`BLOCK_ORDER`)
in which thread blocks from different transfers are run when using a single stream.
* **Added comprehensive reports**.
We modified individual transfers to report X Compute Clusters (XCC) ID when `SHOW_ITERATIONS`
is set to 1.
* **Improved accuracy in result validation**.
You can now validate results for each iteration instead of just once for all iterations.
Maintenance update #3, combined with ROCm 5.4.1, now provides SRIOV virtualization support for all AMD Instinct™ MI200 devices.

View File

@@ -1,101 +0,0 @@
# MIT License
#
# Copyright (c) 2023 Advanced Micro Devices, Inc. All rights reserved.
#
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in all
# copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
# SOFTWARE.
# ###########################
# ROCm dependencies
# ###########################
include(FetchContent)
if(ROCM_BUILD_DOCS)
find_package(ROCM 0.11.0 CONFIG QUIET PATHS "${ROCM_PATH}") # First version with Sphinx doc gen improvement
if(NOT ROCM_FOUND)
message(STATUS "ROCm CMake not found. Fetching...")
set(rocm_cmake_tag
"c044bb52ba85058d28afe2313be98d9fed02e293" # develop@2023.09.12. (move to 6.0 tag when released)
CACHE STRING "rocm-cmake tag to download")
FetchContent_Declare(
rocm-cmake
GIT_REPOSITORY https://github.com/RadeonOpenCompute/rocm-cmake.git
GIT_TAG ${rocm_cmake_tag}
SOURCE_SUBDIR "DISABLE ADDING TO BUILD" # We don't really want to consume the build and test targets of ROCm CMake.
)
FetchContent_MakeAvailable(rocm-cmake)
find_package(ROCM CONFIG REQUIRED NO_DEFAULT_PATH PATHS "${rocm-cmake_SOURCE_DIR}")
else()
find_package(ROCM 0.11.0 CONFIG REQUIRED PATHS "${ROCM_PATH}")
endif()
if(Python_FIND_VIRTUALENV STREQUAL "ONLY" AND NOT DEFINED ENV{VIRTUAL_ENV})
if(NOT EXISTS "${CMAKE_CURRENT_BINARY_DIR}/.venv")
message(STATUS "Python virtualenv use requested but not found. Fetching...")
find_program(BOOTSTRAP_PYTHON_EXE python3 REQUIRED)
execute_process(
COMMAND "${BOOTSTRAP_PYTHON_EXE}" -m pip install --user virtualenv
OUTPUT_QUIET
COMMAND_ERROR_IS_FATAL ANY
)
execute_process(
COMMAND "${BOOTSTRAP_PYTHON_EXE}" -m virtualenv "${CMAKE_CURRENT_BINARY_DIR}/.venv"
OUTPUT_QUIET
COMMAND_ERROR_IS_FATAL ANY
)
endif()
set(ENV{VIRTUAL_ENV} "${CMAKE_CURRENT_BINARY_DIR}/.venv")
if(WIN32)
set(ENV{PATH} "${CMAKE_CURRENT_BINARY_DIR}/.venv/Scripts;$ENV{PATH}")
else()
set(ENV{PATH} "${CMAKE_CURRENT_BINARY_DIR}/.venv/bin:$ENV{PATH}")
endif()
find_package(Python REQUIRED)
# TODO: shortcircuit if installed
execute_process(
COMMAND "${Python_EXECUTABLE}" -m pip install pip-tools
OUTPUT_QUIET
COMMAND_ERROR_IS_FATAL ANY
)
list(APPEND CMAKE_CONFIGURE_DEPENDS "${PROJECT_SOURCE_DIR}/docs/sphinx/requirements.in")
file(MAKE_DIRECTORY "$ENV{VIRTUAL_ENV}/usr/share/${PROJECT_NAME}")
if("${PROJECT_SOURCE_DIR}/docs/sphinx/requirements.in" IS_NEWER_THAN "$ENV{VIRTUAL_ENV}/usr/share/${PROJECT_NAME}/requirements.txt")
execute_process(
COMMAND "${Python_EXECUTABLE}" -m piptools compile
"${PROJECT_SOURCE_DIR}/docs/sphinx/requirements.in"
--output-file
"$ENV{VIRTUAL_ENV}/usr/share/${PROJECT_NAME}/requirements.txt"
OUTPUT_QUIET
ERROR_QUIET
COMMAND_ERROR_IS_FATAL ANY
)
execute_process(
COMMAND "${Python_EXECUTABLE}" -m piptools sync
"$ENV{VIRTUAL_ENV}/usr/share/${PROJECT_NAME}/requirements.txt"
OUTPUT_QUIET
ERROR_QUIET
COMMAND_ERROR_IS_FATAL ANY
)
endif()
endif()
endif()

View File

@@ -1,76 +1,79 @@
<?xml version="1.0" encoding="UTF-8"?>
<manifest>
<remote name="rocm-org" fetch="https://github.com/ROCm/" />
<remote name="roc-github" fetch="https://github.com/RadeonOpenCompute/" />
<remote name="rocm-devtools" fetch="https://github.com/ROCm-Developer-Tools/" />
<remote name="rocm-swplat" fetch="https://github.com/ROCmSoftwarePlatform/" />
<remote name="gpuopen-libs" fetch="https://github.com/GPUOpen-ProfessionalCompute-Libraries/" />
<remote name="gpuopen-tools" fetch="https://github.com/GPUOpen-Tools/" />
<remote name="KhronosGroup" fetch="https://github.com/KhronosGroup/" />
<default revision="refs/tags/rocm-6.0.0"
remote="rocm-org"
<remote name="roc-github"
fetch="https://github.com/RadeonOpenCompute/" />
<remote name="rocm-devtools"
fetch="https://github.com/ROCm-Developer-Tools/" />
<remote name="rocm-swplat"
fetch="https://github.com/ROCmSoftwarePlatform/" />
<remote name="gpuopen-libs"
fetch="https://github.com/GPUOpen-ProfessionalCompute-Libraries/" />
<remote name="gpuopen-tools"
fetch="https://github.com/GPUOpen-Tools/" />
<remote name="KhronosGroup"
fetch="https://github.com/KhronosGroup/" />
<default revision="refs/tags/rocm-5.5.1"
remote="roc-github"
sync-c="true"
sync-j="4" />
<!--list of projects for ROCm-->
<!--list of projects for ROCM-->
<project name="ROCK-Kernel-Driver" />
<project name="ROCT-Thunk-Interface" />
<project name="ROCR-Runtime" />
<project name="amdsmi" />
<project name="rocm_smi_lib" />
<project name="rocm-core" />
<project name="rocm-cmake" />
<project name="rocminfo" />
<project name="rocm_bandwidth_test" />
<project name="rocprofiler" />
<project name="roctracer" />
<project name="rocprofiler" remote="rocm-devtools" />
<project name="roctracer" remote="rocm-devtools" />
<project name="ROCm-OpenCL-Runtime" />
<project path="ROCm-OpenCL-Runtime/api/opencl/khronos/icd" name="OpenCL-ICD-Loader" remote="KhronosGroup" revision="6c03f8b58fafd9dd693eaac826749a5cfad515f8" />
<project name="clang-ocl" />
<project name="rdc" />
<!--HIP Projects-->
<project name="HIP" />
<project name="HIP-Examples" />
<project name="clr" />
<project name="hipother" />
<project name="HIPIFY" />
<project name="HIPCC" />
<project name="HIP" remote="rocm-devtools" />
<project name="hipamd" remote="rocm-devtools" />
<project name="HIP-Examples" remote="rocm-devtools" />
<project name="ROCclr" remote="rocm-devtools" />
<project name="HIPIFY" remote="rocm-devtools" />
<project name="HIPCC" remote="rocm-devtools" />
<!-- The following projects are all associated with the AMDGPU LLVM compiler -->
<project name="llvm-project" />
<project name="ROCm-Device-Libs" />
<project name="atmi" />
<project name="ROCm-CompilerSupport" />
<project name="half" revision="37742ce15b76b44e4b271c1e66d13d2fa7bd003e" />
<project name="rocr_debug_agent" remote="rocm-devtools" />
<project name="rocm_bandwidth_test" />
<project name="half" remote="rocm-swplat" revision="37742ce15b76b44e4b271c1e66d13d2fa7bd003e" />
<project name="RCP" remote="gpuopen-tools" revision="3a49405a1500067c49d181844ec90aea606055bb" />
<!-- gdb projects -->
<project name="ROCgdb" />
<project name="ROCdbgapi" />
<project name="rocr_debug_agent" />
<project name="ROCgdb" remote="rocm-devtools" />
<project name="ROCdbgapi" remote="rocm-devtools" />
<!-- ROCm Libraries -->
<project groups="mathlibs" name="rocBLAS" />
<project groups="mathlibs" name="Tensile" />
<project groups="mathlibs" name="hipTensor" />
<project groups="mathlibs" name="hipBLAS" />
<project groups="mathlibs" name="rocFFT" />
<project groups="mathlibs" name="hipFFT" />
<project groups="mathlibs" name="rocRAND" />
<project groups="mathlibs" name="hipRAND" />
<project groups="mathlibs" name="rocSPARSE" />
<project groups="mathlibs" name="hipSPARSELt" />
<project groups="mathlibs" name="rocSOLVER" />
<project groups="mathlibs" name="hipSOLVER" />
<project groups="mathlibs" name="hipSPARSE" />
<project groups="mathlibs" name="rocALUTION" />
<project groups="mathlibs" name="rocThrust" />
<project groups="mathlibs" name="hipCUB" />
<project groups="mathlibs" name="rocPRIM" />
<project groups="mathlibs" name="rocWMMA" />
<project groups="mathlibs" name="rccl" />
<project name="MIOpen" />
<project name="composable_kernel" />
<project name="MIVisionX" />
<project name="rpp" />
<project name="hipfort" />
<project name="AMDMIGraphX" />
<project name="ROCmValidationSuite" />
<project name="rdc" />
<project groups="mathlibs" name="rocBLAS" remote="rocm-swplat" />
<project groups="mathlibs" name="Tensile" remote="rocm-swplat" />
<project groups="mathlibs" name="hipBLAS" remote="rocm-swplat" />
<project groups="mathlibs" name="rocFFT" remote="rocm-swplat" />
<project groups="mathlibs" name="hipFFT" remote="rocm-swplat" />
<project groups="mathlibs" name="rocRAND" remote="rocm-swplat" />
<project groups="mathlibs" name="rocSPARSE" remote="rocm-swplat" />
<project groups="mathlibs" name="rocSOLVER" remote="rocm-swplat" />
<project groups="mathlibs" name="hipSOLVER" remote="rocm-swplat" />
<project groups="mathlibs" name="hipSPARSE" remote="rocm-swplat" />
<project groups="mathlibs" name="rocALUTION" remote="rocm-swplat" />
<project name="MIOpenGEMM" remote="rocm-swplat" />
<project name="MIOpen" remote="rocm-swplat" />
<project groups="mathlibs" name="rccl" remote="rocm-swplat" />
<project name="MIVisionX" remote="gpuopen-libs" />
<project groups="mathlibs" name="rocThrust" remote="rocm-swplat" />
<project groups="mathlibs" name="hipCUB" remote="rocm-swplat" />
<project groups="mathlibs" name="rocPRIM" remote="rocm-swplat" />
<project groups="mathlibs" name="rocWMMA" remote="rocm-swplat" />
<project name="hipfort" remote="rocm-swplat" />
<project name="AMDMIGraphX" remote="rocm-swplat" />
<project name="ROCmValidationSuite" remote="rocm-devtools" />
<!-- Projects for OpenMP-Extras -->
<project name="aomp" path="openmp-extras/aomp" />
<project name="aomp-extras" path="openmp-extras/aomp-extras" />
<project name="flang" path="openmp-extras/flang" />
<project name="aomp" path="openmp-extras/aomp" remote="rocm-devtools" />
<project name="aomp-extras" path="openmp-extras/aomp-extras" remote="rocm-devtools" />
<project name="flang" path="openmp-extras/flang" remote="rocm-devtools" />
</manifest>

6
docs/404.md Normal file
View File

@@ -0,0 +1,6 @@
# 404 Page Not Found
Page could not be found.
Return to [home](./index) or please use the links from the sidebar to find what
you are looking for.

View File

@@ -1,33 +0,0 @@
# MIT License
#
# Copyright (c) 2023 Advanced Micro Devices, Inc. All rights reserved.
#
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in all
# copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
# SOFTWARE.
include(ROCMSphinxDoc)
rocm_add_sphinx_doc(
"${CMAKE_CURRENT_SOURCE_DIR}"
OUTPUT_DIR html
BUILDER html
)
install(
DIRECTORY "${CMAKE_CURRENT_BINARY_DIR}/html"
DESTINATION "${CMAKE_INSTALL_DOCDIR}")

74
docs/about.md Normal file
View File

@@ -0,0 +1,74 @@
# About ROCm Documentation
ROCm documentation is made available under open source [licenses](licensing.md).
Documentation is built using open source toolchains. Contributions to our
documentation is encouraged and welcome. As a contributor, please familiarize
yourself with our documentation toolchain.
## ReadTheDocs
[ReadTheDocs](https://docs.readthedocs.io/en/stable/) is our front end for the
our documentation. By front end, this is the tool that serves our HTML based
documentation to our end users.
## Doxygen
[Doxygen](https://www.doxygen.nl/) is the most common inline code documentation
standard. ROCm projects are use Doxygen for public API documentation (unless the
upstream project is using a different tool).
## Sphinx
[Sphinx](https://www.sphinx-doc.org/en/master/) is a documentation generator
originally used for python. It is now widely used in the Open Source community.
Originally, sphinx supported RST based documentation. Markdown support is now
available. ROCm documentation plans to default to markdown for new projects.
Existing projects using RST are under no obligation to convert to markdown. New
projects that believe markdown is not suitable should contact the documentation
team prior to selecting RST.
### MyST
[Markedly Structured Text (MyST)](https://myst-tools.org/docs/spec) is an extended
flavor of Markdown ([CommonMark](https://commonmark.org/)) influenced by reStructuredText (RST) and Sphinx.
It is integrated via [`myst-parser`](https://myst-parser.readthedocs.io/en/latest/).
A cheat sheet that showcases how to use the MyST syntax is available over at [the Jupyter
reference](https://jupyterbook.org/en/stable/reference/cheatsheet.html).
### Sphinx Theme
ROCm is using the
[Sphinx Book Theme](https://sphinx-book-theme.readthedocs.io/en/latest/). This
theme is used by Jupyter books. ROCm documentation applies some customization
include a header and footer on top of the Sphinx Book Theme. A future custom
ROCm theme will be part of our documentation goals.
### Sphinx Design
Sphinx Design is an extension for sphinx based websites that add design
functionality. Please see the documentation
[here](https://sphinx-design.readthedocs.io/en/latest/index.html). ROCm
documentation uses sphinx design for grids, cards, and synchronized tabs.
Other features may be used in the future.
### Sphinx External TOC
ROCm uses the
[sphinx-external-toc](https://sphinx-external-toc.readthedocs.io/en/latest/intro.html)
for our navigation. This tool allows a YAML file based left navigation menu. This
tool was selected due to its flexibility that allows scripts to operate on the
YAML file. Please transition to this file for the project's navigation. You can
see the `_toc.yml.in` file in this repository in the docs/sphinx folder for an
example.
### Breathe
Sphinx uses [Breathe](https://www.breathe-doc.org/) to integrate Doxygen
content.
## `rocm-docs-core` pip package
[rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) is an AMD
maintained project that applies customization for our documentation. This
project is the tool most ROCm repositories will use as part of the documentation
build.

View File

@@ -1,122 +0,0 @@
<head>
<meta charset="UTF-8">
<meta name="description" content="Linux GPU and OS support">
<meta name="keywords" content="Linux support, ROCm distributions">
</head>
# GPU and OS support (Linux)
(linux-support)=
## Supported Linux distributions
AMD ROCm™ Platform supports the following Linux distributions.
::::{tab-set}
:::{tab-item} Supported
| Distribution | Processor Architectures | Validated Kernel | Support |
| :----------- | :---------------------: | :--------------: | ------: |
| RHEL 9.2 | x86-64 | 5.14 (5.14.0-284.11.1.el9_2.x86_64) | ✅ |
| RHEL 9.1 | x86-64 | 5.14.0-284.11.1.el9_2.x86_64 | ✅ |
| RHEL 8.8 | x86-64 | 4.18.0-477.el8.x86_64 | ✅ |
| RHEL 8.7 | x86-64 | 4.18.0-425.10.1.el8_7.x86_64 | ✅ |
| SLES 15 SP5 | x86-64 | 5.14.21-150500.53-default | ✅ |
| SLES 15 SP4 | x86-64 | 5.14.21-150400.24.63-default | ✅ |
| Ubuntu 22.04.2 | x86-64 | 5.19.0-45-generic | ✅ |
| Ubuntu 20.04.5 | x86-64 | 5.15.0-75-generic | ✅ |
:::{versionadded} 5.6
* RHEL 8.8 and 9.2 support is added.
* SLES 15 SP5 support is added
:::
:::{tab-item} Unsupported
| Distribution | Processor Architectures | Validated Kernel | Support |
| :----------- | :---------------------: | :--------------: | ------: |
| RHEL 9.0 | x86-64 | 5.14 | ❌ |
| RHEL 8.6 | x86-64 | 5.14 | ❌ |
| SLES 15 SP3 | x86-64 | 5.3 | ❌ |
| Ubuntu 22.04.0 | x86-64 | 5.15 LTS, 5.17 OEM | ❌ |
| Ubuntu 20.04.4 | x86-64 | 5.13 HWE, 5.13 OEM | ❌ |
| Ubuntu 22.04.1 | x86-64 | 5.15 LTS | ❌ |
:::
::::
✅: **Supported** - AMD performs full testing of all ROCm components on distro
GA image.
❌: **Unsupported** - AMD no longer performs builds and testing on these
previously supported distro GA images.
## Virtualization support
ROCm supports virtualization for select GPUs only as shown below.
| Hypervisor | Version | GPU | Validated Guest OS (validated kernel) |
|----------------|----------|-------|----------------------------------------------------------------------------------|
| VMWare | ESXi 8 | MI250 | Ubuntu 20.04 (`5.15.0-56-generic`) |
| VMWare | ESXi 8 | MI210 | Ubuntu 20.04 (`5.15.0-56-generic`), SLES 15 SP4 (`5.14.21-150400.24.18-default`) |
| VMWare | ESXi 7 | MI210 | Ubuntu 20.04 (`5.15.0-56-generic`), SLES 15 SP4 (`5.14.21-150400.24.18-default`) |
## Linux-supported GPUs
The table below shows supported GPUs for Instinct™, Radeon Pro™ and Radeon™
GPUs. Please click the tabs below to switch between GPU product lines. If a GPU
is not listed on this table, the GPU is not officially supported by AMD.
:::::{tab-set}
::::{tab-item} AMD Instinct™
:sync: instinct
| Product Name | Architecture | [LLVM Target](https://www.llvm.org/docs/AMDGPUUsage.html#processors) |Support |
|:------------:|:------------:|:--------------------------------------------------------------------:|:-------:|
| AMD Instinct™ MI250X | CDNA2 | gfx90a | ✅ |
| AMD Instinct™ MI250 | CDNA2 | gfx90a | ✅ |
| AMD Instinct™ MI210 | CDNA2 | gfx90a | ✅ |
| AMD Instinct™ MI100 | CDNA | gfx908 | ✅ |
| AMD Instinct™ MI50 | GCN5.1 | gfx906 | ✅ |
| AMD Instinct™ MI25 | GCN5.0 | gfx900 | ❌ |
::::
::::{tab-item} Radeon Pro™
:sync: radeonpro
| Name | Architecture |[LLVM Target](https://www.llvm.org/docs/AMDGPUUsage.html#processors) | Support|
|:----:|:------------:|:--------------------------------------------------------------------:|:-------:|
| AMD Radeon™ Pro W7900 | RDNA3 | gfx1100 | ✅ (Ubuntu 22.04 only)|
| AMD Radeon™ Pro W6800 | RDNA2 | gfx1030 | ✅ |
| AMD Radeon™ Pro V620 | RDNA2 | gfx1030 | ✅ |
| AMD Radeon™ Pro VII | GCN5.1 | gfx906 | ✅ |
::::
::::{tab-item} Radeon™
:sync: radeonpro
| Name | Architecture |[LLVM Target](https://www.llvm.org/docs/AMDGPUUsage.html#processors) | Support|
|:----:|:---------------:|:--------------------------------------------------------------------:|:-------:|
| AMD Radeon™ RX 7900 XTX | RDNA3 | gfx1100 | ✅ (Ubuntu 22.04 only)|
| AMD Radeon™ VII | GCN5.1 | gfx906 | ✅ |
::::
:::::
### Support status
✅: **Supported** - AMD enables these GPUs in our software distributions for
the corresponding ROCm product.
⚠️: **Deprecated** - Support will be removed in a future release.
❌: **Unsupported** - This configuration is not enabled in our software
distributions.
## CPU support
ROCm requires CPUs that support PCIe™ atomics. Modern CPUs after the release of
1st generation AMD Zen CPU and Intel™ Haswell support PCIe atomics.

View File

@@ -1,86 +0,0 @@
<head>
<meta charset="UTF-8">
<meta name="description" content="Windows GPU and OS support">
<meta name="keywords" content="Windows support, ROCm distributions">
</head>
# GPU and OS support (Windows)
(windows-support)=
## Supported SKUs
AMD HIP SDK supports the following Windows variants.
| Distribution |Processor Architectures| Validated update |
|---------------------|-----------------------|--------------------|
| Windows 10 | x86-64 | 22H2 (GA) |
| Windows 11 | x86-64 | 22H2 (GA) |
| Windows Server 2022 | x86-64 | |
## Windows-supported GPUs
The table below shows supported GPUs for Radeon Pro™ and Radeon™ GPUs. Please
click the tabs below to switch between GPU product lines. If a GPU is not listed
on this table, the GPU is not officially supported by AMD.
::::{tab-set}
:::{tab-item} Radeon Pro™
:sync: radeonpro
| Name | Architecture |[LLVM Target](https://www.llvm.org/docs/AMDGPUUsage.html#processors) | Runtime | HIP SDK |
|:----:|:------------:|:--------------------------------------------------------------------:|:-------:|:----------------:|
| AMD Radeon Pro™ W7900 | RDNA3 | gfx1100 | ✅ | ✅ |
| AMD Radeon Pro™ W7800 | RDNA3 | gfx1100 | ✅ | ✅ |
| AMD Radeon Pro™ W6800 | RDNA2 | gfx1030 | ✅ | ✅ |
| AMD Radeon Pro™ W6600 | RDNA2 | gfx1032 | ✅ | ❌ |
| AMD Radeon Pro™ W5500 | RDNA1 | gfx1012 | ❌ | ❌ |
| AMD Radeon Pro™ VII | GCN5.1 | gfx906 | ❌ | ❌ |
:::
:::{tab-item} Radeon™
:sync: radeon
| Name | Architecture | [LLVM Target](https://www.llvm.org/docs/AMDGPUUsage.html#processors) | Runtime | HIP SDK |
|:----:|:------------:|:--------------------------------------------------------------------:|:-------:|:----------------:|
| AMD Radeon™ RX 7900 XTX | RDNA3 | gfx1100 | ✅ | ✅ |
| AMD Radeon™ RX 7900 XT | RDNA3 | gfx1100 | ✅ | ✅ |
| AMD Radeon™ RX 7600 | RDNA3 | gfx1102 | ✅ | ✅ |
| AMD Radeon™ RX 6950 XT | RDNA2 | gfx1030 | ✅ | ✅ |
| AMD Radeon™ RX 6900 XT | RDNA2 | gfx1030 | ✅ | ✅ |
| AMD Radeon™ RX 6800 XT | RDNA2 | gfx1030 | ✅ | ✅ |
| AMD Radeon™ RX 6800 | RDNA2 | gfx1030 | ✅ | ✅ |
| AMD Radeon™ RX 6750 XT | RDNA2 | gfx1031 | ✅ | ❌ |
| AMD Radeon™ RX 6700 XT | RDNA2 | gfx1031 | ✅ | ❌ |
| AMD Radeon™ RX 6700 | RDNA2 | gfx1031 | ✅ | ❌ |
| AMD Radeon™ RX 6650 XT | RDNA2 | gfx1032 | ✅ | ❌ |
| AMD Radeon™ RX 6600 XT | RDNA2 | gfx1032 | ✅ | ❌ |
| AMD Radeon™ RX 6600 | RDNA2 | gfx1032 | ✅ | ❌ |
:::
::::
### Component support
ROCm components are described in [What is ROCm?](../../what-is-rocm.md) Support
on Windows is provided with two levels on enablement.
* **Runtime**: Runtime enables the use of the HIP and OpenCL runtimes only.
* **HIP SDK**: Runtime plus additional components are listed in [Libraries](../../reference/library-index.md).
Note that some math libraries are Linux exclusive.
### Support status
✅: **Supported** - AMD enables these GPUs in our software distributions for
the corresponding ROCm product.
⚠️: **Deprecated** - Support will be removed in a future release.
❌: **Unsupported** - This configuration is not enabled in our software
distributions.
## CPU support
ROCm requires CPUs that support PCIe™ atomics. Modern CPUs after the release of
1st generation AMD Zen CPU and Intel™ Haswell support PCIe atomics.

View File

@@ -1,9 +0,0 @@
# License
> Note: This license applies to the [ROCm repository](https://github.com/RadeonOpenCompute/ROCm) that primarily contains documentation. For other licensing information, refer to the [Licensing Terms page](./licensing).
```{include} ../../LICENSE
```
```{include} ./licensing.md
```

View File

@@ -1,133 +0,0 @@
<head>
<meta charset="UTF-8">
<meta name="description" content="ROCm licensing terms">
<meta name="keywords" content="license, licensing terms">
</head>
# ROCm licensing terms
ROCm™ is released by Advanced Micro Devices, Inc. and is licensed per component separately.
The following table is a list of ROCm components with links to their respective license
terms. These components may include third party components subject to
additional licenses. Please review individual repositories for more information.
The table shows ROCm components, the name of license, and link to the license terms.
The table is ordered to follow the ROCm manifest file.
<!-- spellcheck-disable -->
| Component | License |
|:---------------------|:-------------------------|
| [AMDMIGraphX](https://github.com/ROCmSoftwarePlatform/AMDMIGraphX/) | [MIT](https://github.com/ROCmSoftwarePlatform/AMDMIGraphX/blob/develop/LICENSE) |
| [HIPCC](https://github.com/ROCm-Developer-Tools/HIPCC/blob/develop/LICENSE.txt) | [MIT](https://github.com/ROCm-Developer-Tools/HIPCC/blob/develop/LICENSE.txt) |
| [HIPIFY](https://github.com/ROCm-Developer-Tools/HIPIFY/) | [MIT](https://github.com/ROCm-Developer-Tools/HIPIFY/blob/amd-staging/LICENSE.txt) |
| [HIP](https://github.com/ROCm-Developer-Tools/HIP/) | [MIT](https://github.com/ROCm-Developer-Tools/HIP/blob/develop/LICENSE.txt) |
| [MIOpenGEMM](https://github.com/ROCmSoftwarePlatform/MIOpenGEMM/) | [MIT](https://github.com/ROCmSoftwarePlatform/MIOpenGEMM/blob/master/LICENSE.txt) |
| [MIOpen](https://github.com/ROCmSoftwarePlatform/MIOpen/) | [MIT](https://github.com/ROCmSoftwarePlatform/MIOpen/blob/master/LICENSE.txt) |
| [MIVisionX](https://github.com/GPUOpen-ProfessionalCompute-Libraries/MIVisionX/) | [MIT](https://github.com/GPUOpen-ProfessionalCompute-Libraries/MIVisionX/blob/master/LICENSE.txt) |
| [RCP](https://github.com/GPUOpen-Tools/radeon_compute_profiler/) | [MIT](https://github.com/GPUOpen-Tools/radeon_compute_profiler/blob/master/LICENSE) |
| [ROCK-Kernel-Driver](https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver/) | [GPL 2.0 WITH Linux-syscall-note](https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver/blob/master/COPYING) |
| [ROCR-Runtime](https://github.com/RadeonOpenCompute/ROCR-Runtime/) | [The University of Illinois/NCSA](https://github.com/RadeonOpenCompute/ROCR-Runtime/blob/master/LICENSE.txt) |
| [ROCT-Thunk-Interface](https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/) | [MIT](https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/blob/master/LICENSE.md) |
| [ROCclr](https://github.com/ROCm-Developer-Tools/ROCclr/) | [MIT](https://github.com/ROCm-Developer-Tools/ROCclr/blob/develop/LICENSE.txt) |
| [ROCdbgapi](https://github.com/ROCm-Developer-Tools/ROCdbgapi/) | [MIT](https://github.com/ROCm-Developer-Tools/ROCdbgapi/blob/amd-master/LICENSE.txt) |
| [ROCgdb](https://github.com/ROCm-Developer-Tools/ROCgdb/) | [GNU General Public License v2.0](https://github.com/ROCm-Developer-Tools/ROCgdb/blob/amd-master/COPYING) |
| [ROCm-CompilerSupport](https://github.com/RadeonOpenCompute/ROCm-CompilerSupport/) | [The University of Illinois/NCSA](https://github.com/RadeonOpenCompute/ROCm-CompilerSupport/blob/amd-stg-open/LICENSE.txt) |
| [ROCm-Device-Libs](https://github.com/RadeonOpenCompute/ROCm-Device-Libs/) | [The University of Illinois/NCSA](https://github.com/RadeonOpenCompute/ROCm-Device-Libs/blob/amd-stg-open/LICENSE.TXT) |
| [ROCm-OpenCL-Runtime/api/opencl/khronos/icd](https://github.com/KhronosGroup/OpenCL-ICD-Loader/) | [Apache 2.0](https://github.com/KhronosGroup/OpenCL-ICD-Loader/blob/main/LICENSE) |
| [ROCm-OpenCL-Runtime](https://github.com/RadeonOpenCompute/ROCm-OpenCL-Runtime/) | [MIT](https://github.com/RadeonOpenCompute/ROCm-OpenCL-Runtime/blob/develop/LICENSE.txt) |
| [ROCmValidationSuite](https://github.com/ROCm-Developer-Tools/ROCmValidationSuite/) | [MIT](https://github.com/ROCm-Developer-Tools/ROCmValidationSuite/blob/master/LICENSE) |
| [Tensile](https://github.com/ROCmSoftwarePlatform/Tensile/) | [MIT](https://github.com/ROCmSoftwarePlatform/Tensile/blob/develop/LICENSE.md) |
| [aomp-extras](https://github.com/ROCm-Developer-Tools/aomp-extras/) | [MIT](https://github.com/ROCm-Developer-Tools/aomp-extras/blob/aomp-dev/LICENSE) |
| [aomp](https://github.com/ROCm-Developer-Tools/aomp/) | [Apache 2.0](https://github.com/ROCm-Developer-Tools/aomp/blob/aomp-dev/LICENSE) |
| [atmi](https://github.com/RadeonOpenCompute/atmi/) | [MIT](https://github.com/RadeonOpenCompute/atmi/blob/master/LICENSE.txt) |
| [clang-ocl](https://github.com/RadeonOpenCompute/clang-ocl/) | [MIT](https://github.com/RadeonOpenCompute/clang-ocl/blob/master/LICENSE) |
| [flang](https://github.com/ROCm-Developer-Tools/flang/) | [Apache 2.0](https://github.com/ROCm-Developer-Tools/flang/blob/master/LICENSE.txt) |
| [half](https://github.com/ROCmSoftwarePlatform/half/) | [MIT](https://github.com/ROCmSoftwarePlatform/half/blob/master/LICENSE.txt) |
| [hipBLAS](https://github.com/ROCmSoftwarePlatform/hipBLAS/) | [MIT](https://github.com/ROCmSoftwarePlatform/hipBLAS/blob/develop/LICENSE.md) |
| [hipCUB](https://github.com/ROCmSoftwarePlatform/hipCUB/) | [Custom](https://github.com/ROCmSoftwarePlatform/hipCUB/blob/develop/LICENSE.txt) |
| [hipFFT](https://github.com/ROCmSoftwarePlatform/hipFFT/) | [MIT](https://github.com/ROCmSoftwarePlatform/hipFFT/blob/develop/LICENSE.md) |
| [hipSOLVER](https://github.com/ROCmSoftwarePlatform/hipSOLVER/) | [MIT](https://github.com/ROCmSoftwarePlatform/hipSOLVER/blob/develop/LICENSE.md) |
| [hipSPARSELt](https://github.com/ROCmSoftwarePlatform/hipSPARSELt/) | [MIT](https://github.com/ROCmSoftwarePlatform/hipSPARSELt/blob/develop/LICENSE.md) |
| [hipSPARSE](https://github.com/ROCmSoftwarePlatform/hipSPARSE/) | [MIT](https://github.com/ROCmSoftwarePlatform/hipSPARSE/blob/develop/LICENSE.md) |
| [hipTensor](https://github.com/ROCmSoftwarePlatform/hipTensor) | [MIT](https://github.com/ROCmSoftwarePlatform/hipTensor/blob/develop/LICENSE) |
| [hipamd](https://github.com/ROCm-Developer-Tools/hipamd/) | [MIT](https://github.com/ROCm-Developer-Tools/hipamd/blob/develop/LICENSE.txt) |
| [hipfort](https://github.com/ROCmSoftwarePlatform/hipfort/) | [MIT](https://github.com/ROCmSoftwarePlatform/hipfort/blob/master/LICENSE) |
| [llvm-project](https://github.com/ROCm-Developer-Tools/llvm-project/) | [Apache](https://github.com/ROCm-Developer-Tools/llvm-project/blob/main/LICENSE.TXT) |
| [rccl](https://github.com/ROCmSoftwarePlatform/rccl/) | [Custom](https://github.com/ROCmSoftwarePlatform/rccl/blob/develop/LICENSE.txt) |
| [rdc](https://github.com/RadeonOpenCompute/rdc/) | [MIT](https://github.com/RadeonOpenCompute/rdc/blob/master/LICENSE) |
| [rocALUTION](https://github.com/ROCmSoftwarePlatform/rocALUTION/) | [MIT](https://github.com/ROCmSoftwarePlatform/rocALUTION/blob/develop/LICENSE.md) |
| [rocBLAS](https://github.com/ROCmSoftwarePlatform/rocBLAS/) | [MIT](https://github.com/ROCmSoftwarePlatform/rocBLAS/blob/develop/LICENSE.md) |
| [rocFFT](https://github.com/ROCmSoftwarePlatform/rocFFT/) | [MIT](https://github.com/ROCmSoftwarePlatform/rocFFT/blob/develop/LICENSE.md) |
| [rocPRIM](https://github.com/ROCmSoftwarePlatform/rocPRIM/) | [MIT](https://github.com/ROCmSoftwarePlatform/rocPRIM/blob/develop/LICENSE.txt) |
| [rocRAND](https://github.com/ROCmSoftwarePlatform/rocRAND/) | [MIT](https://github.com/ROCmSoftwarePlatform/rocRAND/blob/develop/LICENSE.txt) |
| [rocSOLVER](https://github.com/ROCmSoftwarePlatform/rocSOLVER/) | [BSD-2-Clause](https://github.com/ROCmSoftwarePlatform/rocSOLVER/blob/develop/LICENSE.md) |
| [rocSPARSE](https://github.com/ROCmSoftwarePlatform/rocSPARSE/) | [MIT](https://github.com/ROCmSoftwarePlatform/rocSPARSE/blob/develop/LICENSE.md) |
| [rocThrust](https://github.com/ROCmSoftwarePlatform/rocThrust/) | [Apache 2.0](https://github.com/ROCmSoftwarePlatform/rocThrust/blob/develop/LICENSE) |
| [rocWMMA](https://github.com/ROCmSoftwarePlatform/rocWMMA/) | [MIT](https://github.com/ROCmSoftwarePlatform/rocWMMA/blob/develop/LICENSE.md) |
| [rocm-cmake](https://github.com/RadeonOpenCompute/rocm-cmake/) | [MIT](https://github.com/RadeonOpenCompute/rocm-cmake/blob/develop/LICENSE) |
| [rocm_bandwidth_test](https://github.com/RadeonOpenCompute/rocm_bandwidth_test/) | [The University of Illinois/NCSA](https://github.com/RadeonOpenCompute/rocm_bandwidth_test/blob/master/LICENSE.txt) |
| [rocm_smi_lib](https://github.com/RadeonOpenCompute/rocm_smi_lib/) | [The University of Illinois/NCSA](https://github.com/RadeonOpenCompute/rocm_smi_lib/blob/master/License.txt) |
| [rocminfo](https://github.com/RadeonOpenCompute/rocminfo/) | [The University of Illinois/NCSA](https://github.com/RadeonOpenCompute/rocminfo/blob/master/License.txt) |
| [rocprofiler](https://github.com/ROCm-Developer-Tools/rocprofiler/) | [MIT](https://github.com/ROCm-Developer-Tools/rocprofiler/blob/amd-master/LICENSE) |
| [rocr_debug_agent](https://github.com/ROCm-Developer-Tools/rocr_debug_agent/) | [The University of Illinois/NCSA](https://github.com/ROCm-Developer-Tools/rocr_debug_agent/blob/master/LICENSE.txt) |
| [roctracer](https://github.com/ROCm-Developer-Tools/roctracer/) | [MIT](https://github.com/ROCm-Developer-Tools/roctracer/blob/amd-master/LICENSE) |
| rocm-llvm-alt | [AMD Proprietary License](https://www.amd.com/en/support/amd-software-eula)
Open sourced ROCm components are released via public GitHub
repositories, packages on https://repo.radeon.com and other distribution channels.
Proprietary products are only available on https://repo.radeon.com. Currently, only
one component of ROCm, rocm-llvm-alt is governed by a proprietary license.
Proprietary components are organized in a proprietary subdirectory in the package
repositories to distinguish from open sourced packages.
The additional terms and conditions below apply to your use of ROCm technical
documentation.
©2023 Advanced Micro Devices, Inc. All rights reserved.
The information presented in this document is for informational purposes only
and may contain technical inaccuracies, omissions, and typographical errors. The
information contained herein is subject to change and may be rendered inaccurate
for many reasons, including but not limited to product and roadmap changes,
component and motherboard version changes, new model and/or product releases,
product differences between differing manufacturers, software changes, BIOS
flashes, firmware upgrades, or the like. Any computer system has risks of
security vulnerabilities that cannot be completely prevented or mitigated. AMD
assumes no obligation to update or otherwise correct or revise this information.
However, AMD reserves the right to revise this information and to make changes
from time to time to the content hereof without obligation of AMD to notify any
person of such revisions or changes.
THIS INFORMATION IS PROVIDED “AS IS.” AMD MAKES NO REPRESENTATIONS OR WARRANTIES
WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY
INACCURACIES, ERRORS, OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD
SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF NON-INFRINGEMENT,
MERCHANTABILITY, OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE
LIABLE TO ANY PERSON FOR ANY RELIANCE, DIRECT, INDIRECT, SPECIAL, OR OTHER
CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN,
EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
AMD, the AMD Arrow logo, ROCm, and combinations thereof are trademarks of
Advanced Micro Devices, Inc. Other product names used in this publication are
for identification purposes only and may be trademarks of their respective
companies.
## Package licensing
```{attention}
AQL Profiler and AOCC CPU optimization are both provided in binary form, each
subject to the license agreement enclosed in the directory for the binary and is
available here: `/opt/rocm/share/doc/rocm-llvm-alt/EULA`. By using, installing,
copying or distributing AQL Profiler and/or AOCC CPU Optimizations, you agree to
the terms and conditions of this license agreement. If you do not agree to the
terms of this agreement, do not install, copy or use the AQL Profiler and/or the
AOCC CPU Optimizations.
```
For the rest of the ROCm packages, you can find the licensing information at the
following location: `/opt/rocm/share/doc/<component-name>/`
For example, you can fetch the licensing information of the `_amd_comgr_`
component (Code Object Manager) from the `amd_comgr` folder. A file named
`LICENSE.txt` contains the license details at:
`/opt/rocm-5.4.3/share/doc/amd_comgr/LICENSE.txt`

View File

@@ -1,157 +0,0 @@
.. meta::
:description: How ROCm uses PCIe atomics
:keywords: PCIe, PCIe atomics, atomics, BAR memory
*****************************************************************************
How ROCm uses PCIe atomics
*****************************************************************************
ROCm PCIe feature and overview of BAR memory
================================================================
ROCm is an extension of HSA platform architecture, so it shares the queuing model, memory model,
signaling and synchronization protocols. Platform atomics are integral to perform queuing and
signaling memory operations where there may be multiple-writers across CPU and GPU agents.
The full list of HSA system architecture platform requirements are here:
`HSA Sys Arch Features <http://hsafoundation.com/wp-content/uploads/2021/02/HSA-SysArch-1.2.pdf>`_.
AMD ROCm Software uses the new PCI Express 3.0 (Peripheral Component Interconnect Express [PCIe]
3.0) features for atomic read-modify-write transactions which extends inter-processor synchronization
mechanisms to IO to support the defined set of HSA capabilities needed for queuing and signaling
memory operations.
The new PCIe atomic operations operate as completers for ``CAS`` (Compare and Swap), ``FetchADD``,
``SWAP`` atomics. The atomic operations are initiated by the I/O device which support 32-bit, 64-bit and
128-bit operand which target address have to be naturally aligned to operation sizes.
For ROCm the Platform atomics are used in ROCm in the following ways:
* Update HSA queue's read_dispatch_id: 64 bit atomic add used by the command processor on the
GPU agent to update the packet ID it processed.
* Update HSA queue's write_dispatch_id: 64 bit atomic add used by the CPU and GPU agent to
support multi-writer queue insertions.
* Update HSA Signals -- 64bit atomic ops are used for CPU & GPU synchronization.
The PCIe 3.0 atomic operations feature allows atomic transactions to be requested by, routed through
and completed by PCIe components. Routing and completion does not require software support.
Component support for each is detectable via the Device Capabilities 2 (DevCap2) register. Upstream
bridges need to have atomic operations routing enabled or the atomic operations will fail even though
PCIe endpoint and PCIe I/O devices has the capability to atomic operations.
To do atomic operations routing capability between two or more Root Ports, each associated Root Port
must indicate that capability via the atomic operations routing supported bit in the DevCap2 register.
If your system has a PCIe Express Switch it needs to support atomic operations routing. Atomic
operations requests are permitted only if a component's ``DEVCTL2.ATOMICOP_REQUESTER_ENABLE``
field is set. These requests can only be serviced if the upstream components support atomic operation
completion and/or routing to a component which does. Atomic operations routing support=1, routing
is supported; atomic operations routing support=0, routing is not supported.
An atomic operation is a non-posted transaction supporting 32-bit and 64-bit address formats, there
must be a response for Completion containing the result of the operation. Errors associated with the
operation (uncorrectable error accessing the target location or carrying out the atomic operation) are
signaled to the requester by setting the Completion Status field in the completion descriptor, they are
set to to Completer Abort (CA) or Unsupported Request (UR).
To understand more about how PCIe atomic operations work, see
`PCIe atomics <https://pcisig.com/specifications/pciexpress/specifications/ECN_Atomic_Ops_080417.pdf>`_
`Linux Kernel Patch to pci_enable_atomic_request <https://patchwork.kernel.org/project/linux-pci/patch/1443110390-4080-1-git-send-email-jay@jcornwall.me/>`_
There are also a number of papers which talk about these new capabilities:
* `Atomic Read Modify Write Primitives by Intel <https://www.intel.es/content/dam/doc/white-paper/atomic-read-modify-write-primitives-i-o-devices-paper.pdf>`_
* `PCI express 3 Accelerator White paper by Intel <https://www.intel.sg/content/dam/doc/white-paper/pci-express3-accelerator-white-paper.pdf>`_
* `Intel PCIe Generation 3 Hotchips Paper <https://www.hotchips.org/wp-content/uploads/hc_archives/hc21/1_sun/HC21.23.1.SystemInterconnectTutorial-Epub/HC21.23.131.Ajanovic-Intel-PCIeGen3.pdf>`_
* `PCIe Generation 4 Base Specification includes atomic operations <https://astralvx.com/storage/2020/11/PCI_Express_Base_4.0_Rev0.3_February19-2014.pdf>`_
Other I/O devices with PCIe atomics support
* `Mellanox ConnectX-5 InfiniBand Card <http://www.mellanox.com/related-docs/prod_adapter_cards/PB_ConnectX-5_VPI_Card.pdf>`_
* `Cray Aries Interconnect <http://www.hoti.org/hoti20/slides/Bob_Alverson.pdf>`_
* `Xilinx PCIe Ultrascale White paper <https://docs.xilinx.com/v/u/8OZSA2V1b1LLU2rRCDVGQw>`_
* `Xilinx 7 Series Devices <https://docs.xilinx.com/v/u/1nfXeFNnGpA0ywyykvWHWQ>`_
Future bus technology with richer I/O atomics operation Support
* GenZ
New PCIe Endpoints with support beyond AMD Ryzen and EPYC CPU; Intel Haswell or newer CPUs
with PCIe Generation 3.0 support.
* `Mellanox Bluefield SOC <https://docs.nvidia.com/networking/display/BlueFieldSWv25111213/BlueField+Software+Overview>`_
* `Cavium Thunder X2 <https://en.wikichip.org/wiki/cavium/thunderx2>`_
In ROCm, we also take advantage of PCIe ID based ordering technology for P2P when the GPU
originates two writes to two different targets:
* Write to another GPU memory
* Write to system memory to indicate transfer complete
They are routed off to different ends of the computer but we want to make sure the write to system
memory to indicate transfer complete occurs AFTER P2P write to GPU has complete.
BAR memory overview
----------------------------------------------------------------------------------------------------
On a Xeon E5 based system in the BIOS we can turn on above 4GB PCIe addressing, if so he need to set
memory-mapped input/output (MMIO) base address (MMIOH base) and range (MMIO high size) in the BIOS.
In the Supermicro system in the system bios you need to see the following
* Advanced->PCIe/PCI/PnP configuration-\> Above 4G Decoding = Enabled
* Advanced->PCIe/PCI/PnP Configuration-\>MMIOH Base = 512G
* Advanced->PCIe/PCI/PnP Configuration-\>MMIO High Size = 256G
When we support Large Bar Capability there is a Large Bar VBIOS which also disable the IO bar.
For GFX9 and Vega10 which have Physical Address up 44 bit and 48 bit Virtual address.
* BAR0-1 registers: 64bit, prefetchable, GPU memory. 8GB or 16GB depending on Vega10 SKU. Must
be placed < 2^44 to support P2P access from other Vega10.
* BAR2-3 registers: 64bit, prefetchable, Doorbell. Must be placed \< 2^44 to support P2P access from
other Vega10.
* BAR4 register: Optional, not a boot device.
* BAR5 register: 32bit, non-prefetchable, MMIO. Must be placed \< 4GB.
Here is how our base address register (BAR) works on GFX 8 GPUs with 40 bit Physical Address Limit ::
11:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Fiji [Radeon R9 FURY / NANO
Series] (rev c1)
Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Device 0b35
Flags: bus master, fast devsel, latency 0, IRQ 119
Memory at bf40000000 (64-bit, prefetchable) [size=256M]
Memory at bf50000000 (64-bit, prefetchable) [size=2M]
I/O ports at 3000 [size=256]
Memory at c7400000 (32-bit, non-prefetchable) [size=256K]
Expansion ROM at c7440000 [disabled] [size=128K]
Legend:
1 : GPU Frame Buffer BAR -- In this example it happens to be 256M, but typically this will be size of the
GPU memory (typically 4GB+). This BAR has to be placed \< 2^40 to allow peer-to-peer access from
other GFX8 AMD GPUs. For GFX9 (Vega GPU) the BAR has to be placed \< 2^44 to allow peer-to-peer
access from other GFX9 AMD GPUs.
2 : Doorbell BAR -- The size of the BAR is typically will be \< 10MB (currently fixed at 2MB) for this
generation GPUs. This BAR has to be placed \< 2^40 to allow peer-to-peer access from other current
generation AMD GPUs.
3 : IO BAR -- This is for legacy VGA and boot device support, but since this the GPUs in this project are
not VGA devices (headless), this is not a concern even if the SBIOS does not setup.
4 : MMIO BAR -- This is required for the AMD Driver SW to access the configuration registers. Since the
reminder of the BAR available is only 1 DWORD (32bit), this is placed \< 4GB. This is fixed at 256KB.
5 : Expansion ROM -- This is required for the AMD Driver SW to access the GPU video-bios. This is
currently fixed at 128KB.
For more information, you can review
`Overview of Changes to PCI Express 3.0 <https://www.mindshare.com/files/resources/PCIe%203-0.pdf>`_.

View File

@@ -1,58 +0,0 @@
<head>
<meta charset="UTF-8">
<meta name="description" content="GPU architecture">
<meta name="keywords" content="GPU architecture, architecture support, MI200, MI250, RDNA,
MI100, AMD Instinct">
</head>
# GPU architecture documentation
:::::{grid} 1 1 2 2
:gutter: 1
:::{grid-item-card}
**AMD Instinct MI200 series**
Review hardware aspects of the AMD Instinct™ MI200 series of GPU
accelerators and the CDNA™ 2 architecture.
* [AMD Instinct™ MI250 microarchitecture](./gpu-arch/mi250.md)
* [AMD Instinct MI200/CDNA2 ISA](https://www.amd.com/system/files/TechDocs/instinct-mi200-cdna2-instruction-set-architecture.pdf)
* [White paper](https://www.amd.com/system/files/documents/amd-cdna2-white-paper.pdf)
* [Performance counters](./gpu-arch/mi200-performance-counters.md)
:::
:::{grid-item-card}
**AMD Instinct MI100**
Review hardware aspects of the AMD Instinct™ MI100
accelerators and the CDNA™ 1 architecture that is the foundation of these GPUs.
* [AMD Instinct™ MI100 microarchitecture](./gpu-arch/mi100.md)
* [AMD Instinct MI100/CDNA1 ISA](https://www.amd.com/system/files/TechDocs/instinct-mi100-cdna1-shader-instruction-set-architecture%C2%A0.pdf)
* [White paper](https://www.amd.com/system/files/documents/amd-cdna-whitepaper.pdf)
:::
:::{grid-item-card}
**RDNA**
* [AMD RDNA3 ISA](https://www.amd.com/system/files/TechDocs/rdna3-shader-instruction-set-architecture-feb-2023_0.pdf)
* [AMD RDNA2 ISA](https://www.amd.com/system/files/TechDocs/rdna2-shader-instruction-set-architecture.pdf)
* [AMD RDNA ISA](https://www.amd.com/system/files/TechDocs/rdna-shader-instruction-set-architecture.pdf)
* [AMD RDNA Architecture White Paper](https://www.amd.com/system/files/documents/rdna-whitepaper.pdf)
:::
:::{grid-item-card}
**Older architectures**
* [AMD Instinct MI50/Vega 7nm ISA](https://www.amd.com/system/files/TechDocs/vega-7nm-shader-instruction-set-architecture.pdf)
* [AMD Instinct MI25/Vega ISA](https://www.amd.com/system/files/TechDocs/vega-shader-instruction-set-architecture.pdf)
* [AMD GCN3 ISA](https://www.amd.com/system/files/TechDocs/gcn3-instruction-set-architecture.pdf)
* [AMD Vega Architecture White Paper](https://en.wikichip.org/w/images/a/a1/vega-whitepaper.pdf)
:::
:::::

View File

@@ -1,578 +0,0 @@
<head>
<meta charset="UTF-8">
<meta name="description" content="MI200 performance counters and metrics">
<meta name="keywords" content="MI200, performance counters, counters, GRBM counters, GRBM,
CPF counters, CPF, CPC counters, CPC, command processor counters, SPI counters, SPI">
</head>
# MI200 performance counters and metrics
<!-- markdownlint-disable no-duplicate-header -->
This document lists and describes the hardware performance counters and derived metrics available on the AMD Instinct™ MI200 GPU. All the hardware basic counters and derived metrics are accessible via {doc}`ROCProfiler tool <rocprofiler:rocprofv1>`.
## MI200 performance counters list
See the category-wise listing of MI200 performance counters in the following tables.
```{note}
Preliminary validation of all MI200 performance counters is in progress. Those with “*” appended to the names require further evaluation.
```
### Graphics Register Bus Management (GRBM) counters
| Hardware Counter | Unit | Definition |
|:--------------------|:--------|:--------------------------------------------------------------------------|
| `GRBM_COUNT` | Cycles | Number of free-running GPU cycles |
| `GRBM_GUI_ACTIVE` | Cycles | Number of GPU active cycles |
| `GRBM_CP_BUSY` | Cycles | Number of cycles any of the Command Processor (CP) blocks are busy |
| `GRBM_SPI_BUSY` | Cycles | Number of cycles any of the Shader Processor Input (SPI) are busy in the shader engine(s) |
| `GRBM_TA_BUSY` | Cycles | Number of cycles any of the Texture Addressing Unit (TA) are busy in the shader engine(s) |
| `GRBM_TC_BUSY` | Cycles | Number of cycles any of the Texture Cache Blocks (TCP/TCI/TCA/TCC) are busy |
| `GRBM_CPC_BUSY` | Cycles | Number of cycles the Command Processor - Compute (CPC) is busy |
| `GRBM_CPF_BUSY` | Cycles | Number of cycles the Command Processor - Fetcher (CPF) is busy |
| `GRBM_UTCL2_BUSY` | Cycles | Number of cycles the Unified Translation Cache - Level 2 (UTCL2) block is busy |
| `GRBM_EA_BUSY` | Cycles | Number of cycles the Efficiency Arbiter (EA) block is busy |
### Command Processor (CP) counters
The CP counters are further classified into CP-Fetcher (CPF) and CP-Compute (CPC).
#### CPF counters
| Hardware Counter | Unit | Definition |
|:--------------------------------------|:--------|:-------------------------------------------------------------|
| `CPF_CMP_UTCL1_STALL_ON_TRANSLATION` | Cycles | Number of cycles one of the Compute UTCL1s is stalled waiting on translation |
| `CPF_CPF_STAT_BUSY` | Cycles | Number of cycles CPF is busy |
| `CPF_CPF_STAT_IDLE*` | Cycles | Number of cycles CPF is idle |
| `CPF_CPF_STAT_STALL` | Cycles | Number of cycles CPF is stalled |
| `CPF_CPF_TCIU_BUSY` | Cycles | Number of cycles CPF Texture Cache Interface Unit (TCIU) interface is busy |
| `CPF_CPF_TCIU_IDLE` | Cycles | Number of cycles CPF TCIU interface is idle |
| `CPF_CPF_TCIU_STALL*` | Cycles | Number of cycles CPF TCIU interface is stalled waiting on free tags |
#### CPC counters
| Hardware Counter | Unit | Definition |
|:---------------------------------|:-------|:---------------------------------------------------|
| `CPC_ME1_BUSY_FOR_PACKET_DECODE` | Cycles | Number of cycles CPC Micro Engine (ME1) is busy decoding packets |
| `CPC_UTCL1_STALL_ON_TRANSLATION` | Cycles | Number of cycles one of the UTCL1s is stalled waiting on translation |
| `CPC_CPC_STAT_BUSY` | Cycles | Number of cycles CPC is busy |
| `CPC_CPC_STAT_IDLE` | Cycles | Number of cycles CPC is idle |
| `CPC_CPC_STAT_STALL` | Cycles | Number of cycles CPC is stalled |
| `CPC_CPC_TCIU_BUSY` | Cycles | Number of cycles CPC TCIU interface is busy |
| `CPC_CPC_TCIU_IDLE` | Cycles | Number of cycles CPC TCIU interface is idle |
| `CPC_CPC_UTCL2IU_BUSY` | Cycles | Number of cycles CPC UTCL2 interface is busy |
| `CPC_CPC_UTCL2IU_IDLE` | Cycles | Number of cycles CPC UTCL2 interface is idle |
| `CPC_CPC_UTCL2IU_STALL` | Cycles | Number of cycles CPC UTCL2 interface is stalled |
| `CPC_ME1_DC0_SPI_BUSY` | Cycles | Number of cycles CPC ME1 Processor is busy |
### Shader Processor Input (SPI) counters
| Hardware Counter | Unit | Definition |
|:----------------------------|:-----------|:-----------------------------------------------------------|
| `SPI_CSN_BUSY` | Cycles | Number of cycles with outstanding waves |
| `SPI_CSN_WINDOW_VALID` | Cycles | Number of cycles enabled by `perfcounter_start` event |
| `SPI_CSN_NUM_THREADGROUPS` | Workgroups | Number of dispatched workgroups |
| `SPI_CSN_WAVE` | Wavefronts | Number of dispatched wavefronts |
| `SPI_RA_REQ_NO_ALLOC` | Cycles | Number of Arb cycles with requests but no allocation |
|`SPI_RA_REQ_NO_ALLOC_CSN` | Cycles | Number of Arb cycles with Compute Shader, n-th pipe (CSn) requests but no CSn allocation |
| `SPI_RA_RES_STALL_CSN` | Cycles | Number of Arb stall cycles due to shortage of CSn pipeline slots |
| `SPI_RA_TMP_STALL_CSN*` | Cycles | Number of stall cycles due to shortage of temp space |
| `SPI_RA_WAVE_SIMD_FULL_CSN` | SIMD-cycles | Accumulated number of Single Instruction Multiple Data (SIMDs) per cycle affected by shortage of wave slots for CSn wave dispatch |
| `SPI_RA_VGPR_SIMD_FULL_CSN*` | SIMD-cycles | Accumulated number of SIMDs per cycle affected by shortage of VGPR slots for CSn wave dispatch |
| `SPI_RA_SGPR_SIMD_FULL_CSN*` | SIMD-cycles | Accumulated number of SIMDs per cycle affected by shortage of SGPR slots for CSn wave dispatch |
| `SPI_RA_LDS_CU_FULL_CSN` | CUs | Number of Compute Units (CUs) affected by shortage of LDS space for CSn wave dispatch |
| `SPI_RA_BAR_CU_FULL_CSN*` | CUs | Number of CUs with CSn waves waiting at a BARRIER |
| `SPI_RA_BULKY_CU_FULL_CSN*` | CUs | Number of CUs with CSn waves waiting for BULKY resource |
| `SPI_RA_TGLIM_CU_FULL_CSN*` | Cycles | Number of CSn wave stall cycles due to restriction of `tg_limit` for thread group size |
| `SPI_RA_WVLIM_STALL_CSN*` | Cycles | Number of cycles CSn is stalled due to WAVE_LIMIT |
| `SPI_VWC_CSC_WR` | Qcycles | Number of quad-cycles taken to initialize Vector General Purpose Register (VGPRs) when launching waves |
| `SPI_SWC_CSC_WR` | Qcycles | Number of quad-cycles taken to initialize Vector General Purpose Register (SGPRs) when launching waves |
### Compute Unit (CU) counters
The CU counters are further classified into instruction mix, Matrix Fused Multiply Add (MFMA) operation counters, level counters, wavefront counters, wavefront cycle counters and Local Data Share (LDS) counters.
#### Instruction mix
| Hardware Counter | Unit | Definition |
|:-----------------------|:-----|:-----------------------------------------------------------------------|
| `SQ_INSTS` | Instr | Number of instructions issued. |
| `SQ_INSTS_VALU` | Instr | Number of Vector Arithmetic Logic Unit (VALU) instructions including MFMA issued. |
| `SQ_INSTS_VALU_ADD_F16` | Instr | Number of VALU Half Precision Floating Point (F16) ADD/SUB instructions issued. |
| `SQ_INSTS_VALU_MUL_F16` | Instr | Number of VALU F16 Multiply instructions issued. |
| `SQ_INSTS_VALU_FMA_F16` | Instr | Number of VALU F16 Fused Multiply Add (FMA)/ Multiply Add (MAD) instructions issued. |
| `SQ_INSTS_VALU_TRANS_F16` | Instr | Number of VALU F16 Transcendental instructions issued. |
| `SQ_INSTS_VALU_ADD_F32` | Instr | Number of VALU Full Precision Floating Point (F32) ADD/SUB instructions issued. |
| `SQ_INSTS_VALU_MUL_F32` | Instr | Number of VALU F32 Multiply instructions issued. |
| `SQ_INSTS_VALU_FMA_F32` | Instr | Number of VALU F32 FMA/MAD instructions issued. |
| `SQ_INSTS_VALU_TRANS_F32` | Instr | Number of VALU F32 Transcendental instructions issued. |
| `SQ_INSTS_VALU_ADD_F64` | Instr | Number of VALU F64 ADD/SUB instructions issued. |
| `SQ_INSTS_VALU_MUL_F64` | Instr | Number of VALU F64 Multiply instructions issued. |
| `SQ_INSTS_VALU_FMA_F64` | Instr | Number of VALU F64 FMA/MAD instructions issued. |
| `SQ_INSTS_VALU_TRANS_F64` | Instr | Number of VALU F64 Transcendental instructions issued. |
| `SQ_INSTS_VALU_INT32` | Instr | Number of VALU 32-bit integer instructions (signed or unsigned) issued. |
| `SQ_INSTS_VALU_INT64` | Instr | Number of VALU 64-bit integer instructions (signed or unsigned) issued. |
| `SQ_INSTS_VALU_CVT` | Instr | Number of VALU Conversion instructions issued. |
| `SQ_INSTS_VALU_MFMA_I8` | Instr | Number of 8-bit Integer MFMA instructions issued. |
| `SQ_INSTS_VALU_MFMA_F16` | Instr | Number of F16 MFMA instructions issued. |
| `SQ_INSTS_VALU_MFMA_BF16` | Instr | Number of Brain Floating Point - 16 (BF16) MFMA instructions issued. |
| `SQ_INSTS_VALU_MFMA_F32` | Instr | Number of F32 MFMA instructions issued. |
| `SQ_INSTS_VALU_MFMA_F64` | Instr | Number of F64 MFMA instructions issued. |
| `SQ_INSTS_MFMA` | Instr | Number of MFMA instructions issued. |
| `SQ_INSTS_VMEM_WR` | Instr | Number of Vector Memory (VMEM) Write instructions (including FLAT) issued. |
| `SQ_INSTS_VMEM_RD` | Instr | Number of VMEM Read instructions (including FLAT) issued. |
| `SQ_INSTS_VMEM` | Instr | Number of VMEM instructions issued, including both FLAT and Buffer instructions. |
| `SQ_INSTS_SALU` | Instr | Number of SALU instructions issued. |
| `SQ_INSTS_SMEM` | Instr | Number of Scalar Memory (SMEM) instructions issued. |
| `SQ_INSTS_SMEM_NORM` | Instr | Number of SMEM instructions normalized to match `smem_level` issued. |
| `SQ_INSTS_FLAT` | Instr | Number of FLAT instructions issued. |
| `SQ_INSTS_FLAT_LDS_ONLY` | Instr | Number of FLAT instructions that read/write only from/to LDS issued. Works only if `EARLY_TA_DONE` is enabled. |
| `SQ_INSTS_LDS` | Instr | Number of Local Data Share (LDS) instructions issued (including FLAT). |
| `SQ_INSTS_GDS` | Instr | Number of Global Data Share (GDS) instructions issued. |
| `SQ_INSTS_EXP_GDS` | Instr | Number of EXP and GDS instructions excluding skipped export instructions issued. |
| `SQ_INSTS_BRANCH` | Instr | Number of Branch instructions issued. |
| `SQ_INSTS_SENDMSG` | Instr | Number of `SENDMSG` instructions including `s_endpgm` issued. |
| `SQ_INSTS_VSKIPPED*` | Instr | Number of vector instructions skipped. |
#### MFMA operation counters
| Hardware Counter | Unit | Definition |
|:----------------------------|:-----|:----------------------------------------------|
| `SQ_INSTS_VALU_MFMA_MOPS_I8` | IOP | Number of 8-bit integer MFMA ops in the unit of 512 |
| `SQ_INSTS_VALU_MFMA_MOPS_F16` | FLOP | Number of F16 floating MFMA ops in the unit of 512 |
| `SQ_INSTS_VALU_MFMA_MOPS_BF16` | FLOP | Number of BF16 floating MFMA ops in the unit of 512 |
| `SQ_INSTS_VALU_MFMA_MOPS_F32` | FLOP | Number of F32 floating MFMA ops in the unit of 512 |
| `SQ_INSTS_VALU_MFMA_MOPS_F64` | FLOP | Number of F64 floating MFMA ops in the unit of 512 |
#### Level counters
:::{note}
All level counters must be followed by `SQ_ACCUM_PREV_HIRES` counter to measure average latency.
:::
| Hardware Counter | Unit | Definition |
|:-------------------|:-----|:-------------------------------------|
| `SQ_ACCUM_PREV` | Count | Accumulated counter sample value where accumulation takes place once every four cycles. |
| `SQ_ACCUM_PREV_HIRES` | Count | Accumulated counter sample value where accumulation takes place once every cycle. |
| `SQ_LEVEL_WAVES` | Waves | Number of inflight waves. To calculate the wave latency, divide `SQ_ACCUM_PREV_HIRES` by `SQ_WAVE`. |
| `SQ_INST_LEVEL_VMEM` | Instr | Number of inflight VMEM (including FLAT) instructions. To calculate the VMEM latency, divide `SQ_ACCUM_PREV_HIRES` by `SQ_INSTS_VMEM`. |
| `SQ_INST_LEVEL_SMEM` | Instr | Number of inflight SMEM instructions. To calculate the SMEM latency, divide `SQ_ACCUM_PREV_HIRES` by `SQ_INSTS_SMEM_NORM`. |
| `SQ_INST_LEVEL_LDS` | Instr | Number of inflight LDS (including FLAT) instructions. To calculate the LDS latency, divide `SQ_ACCUM_PREV_HIRES` by `SQ_INSTS_LDS`. |
| `SQ_IFETCH_LEVEL` | Instr | Number of inflight instruction fetch requests from the cache. To calculate the instruction fetch latency, divide `SQ_ACCUM_PREV_HIRES` by `SQ_IFETCH`. |
#### Wavefront counters
| Hardware Counter | Unit | Definition |
|:--------------------|:-----|:----------------------------------------------------------------|
| `SQ_WAVES` | Waves | Number of wavefronts dispatched to Sequencers (SQs), including both new and restored wavefronts |
| `SQ_WAVES_SAVED*` | Waves | Number of context-saved waves |
| `SQ_WAVES_RESTORED*` | Waves | Number of context-restored waves sent to SQs |
| `SQ_WAVES_EQ_64` | Waves | Number of wavefronts with exactly 64 active threads sent to SQs |
| `SQ_WAVES_LT_64` | Waves | Number of wavefronts with less than 64 active threads sent to SQs |
| `SQ_WAVES_LT_48` | Waves | Number of wavefronts with less than 48 active threads sent to SQs |
| `SQ_WAVES_LT_32` | Waves | Number of wavefronts with less than 32 active threads sent to SQs |
| `SQ_WAVES_LT_16` | Waves | Number of wavefronts with less than 16 active threads sent to SQs |
#### Wavefront cycle counters
| Hardware Counter | Unit | Definition |
|:------------------------|:-------|:--------------------------------------------------------------------|
| `SQ_CYCLES` | Cycles | Clock cycles. |
| `SQ_BUSY_CYCLES` | Cycles | Number of cycles while SQ reports it to be busy. |
| `SQ_BUSY_CU_CYCLES` | Qcycles | Number of quad-cycles each CU is busy. |
| `SQ_VALU_MFMA_BUSY_CYCLES` | Cycles | Number of cycles the MFMA ALU is busy. |
| `SQ_WAVE_CYCLES` | Qcycles | Number of quad-cycles spent by waves in the CUs. |
| `SQ_WAIT_ANY` | Qcycles | Number of quad-cycles spent waiting for anything. |
| `SQ_WAIT_INST_ANY` | Qcycles | Number of quad-cycles spent waiting for any instruction to be issued. |
| `SQ_ACTIVE_INST_ANY` | Qcycles | Number of quad-cycles spent by each wave to work on an instruction. |
| `SQ_ACTIVE_INST_VMEM` | Qcycles | Number of quad-cycles spent by the SQ instruction arbiter to work on a VMEM instruction. |
| `SQ_ACTIVE_INST_LDS` | Qcycles | Number of quad-cycles spent by the SQ instruction arbiter to work on an LDS instruction. |
| `SQ_ACTIVE_INST_VALU` | Qcycles | Number of quad-cycles spent by the SQ instruction arbiter to work on a VALU instruction. |
| `SQ_ACTIVE_INST_SCA` | Qcycles | Number of quad-cycles spent by the SQ instruction arbiter to work on a SALU or SMEM instruction. |
| `SQ_ACTIVE_INST_EXP_GDS` | Qcycles | Number of quad-cycles spent by the SQ instruction arbiter to work on an EXPORT or GDS instruction. |
| `SQ_ACTIVE_INST_MISC` | Qcycles | Number of quad-cycles spent by the SQ instruction aribter to work on a BRANCH or `SENDMSG` instruction. |
| `SQ_ACTIVE_INST_FLAT` | Qcycles | Number of quad-cycles spent by the SQ instruction arbiter to work on a FLAT instruction. |
| `SQ_INST_CYCLES_VMEM_WR` | Qcycles | Number of quad-cycles spent to send addr and cmd data for VMEM Write instructions. |
| `SQ_INST_CYCLES_VMEM_RD` | Qcycles | Number of quad-cycles spent to send addr and cmd data for VMEM Read instructions. |
| `SQ_INST_CYCLES_SMEM` | Qcycles | Number of quad-cycles spent to execute scalar memory reads. |
| `SQ_INST_CYCLES_SALU` | Qcycles | Number of quad-cycles spent to execute non-memory read scalar operations. |
| `SQ_THREAD_CYCLES_VALU` | Cycles | Number of thread-cycles spent to execute VALU operations. This is similar to `INST_CYCLES_VALU` but multiplied by the number of active threads. |
| `SQ_WAIT_INST_LDS` | Qcycles | Number of quad-cycles spent waiting for LDS instruction to be issued. |
#### LDS counters
| Hardware Counter | Unit | Definition |
|:--------------------------|:------|:--------------------------------------------------------|
| `SQ_LDS_ATOMIC_RETURN` | Cycles | Number of atomic return cycles in LDS |
| `SQ_LDS_BANK_CONFLICT` | Cycles | Number of cycles LDS is stalled by bank conflicts |
| `SQ_LDS_ADDR_CONFLICT*` | Cycles | Number of cycles LDS is stalled by address conflicts |
| `SQ_LDS_UNALIGNED_STALL*` | Cycles | Number of cycles LDS is stalled processing flat unaligned load/store ops |
| `SQ_LDS_MEM_VIOLATIONS*` | Count | Number of threads that have a memory violation in the LDS |
| `SQ_LDS_IDX_ACTIVE` | Cycles | Number of cycles LDS is used for indexed operations |
#### Miscellaneous counters
| Hardware Counter | Unit | Definition |
|:--------------------------|:------|:--------------------------------------------------------|
| `SQ_IFETCH` | Count | Number of instruction fetch requests from `L1I` cache, in 32-byte width |
| `SQ_ITEMS` | Threads | Number of valid items per wave |
### L1I and sL1D cache counters
| Hardware Counter | Unit | Definition |
|:----------------------------|:------|:----------------------------------------------------------------|
| `SQC_ICACHE_REQ` | Req | Number of `L1I` cache requests |
| `SQC_ICACHE_HITS` | Count | Number of `L1I` cache hits |
| `SQC_ICACHE_MISSES` | Count | Number of non-duplicate `L1I` cache misses including uncached requests |
| `SQC_ICACHE_MISSES_DUPLICATE` | Count | Number of duplicate `L1I` cache misses whose previous lookup miss on the same cache line is not fulfilled yet |
| `SQC_DCACHE_REQ` | Req | Number of `sL1D` cache requests |
| `SQC_DCACHE_INPUT_VALID_READYB` | Cycles | Number of cycles while SQ input is valid but sL1D cache is not ready |
| `SQC_DCACHE_HITS` | Count | Number of `sL1D` cache hits |
| `SQC_DCACHE_MISSES` | Count | Number of non-duplicate `sL1D` cache misses including uncached requests |
| `SQC_DCACHE_MISSES_DUPLICATE` | Count | Number of duplicate `sL1D` cache misses |
| `SQC_DCACHE_REQ_READ_1` | Req | Number of constant cache read requests in a single DW |
| `SQC_DCACHE_REQ_READ_2` | Req | Number of constant cache read requests in two DW |
| `SQC_DCACHE_REQ_READ_4` | Req | Number of constant cache read requests in four DW |
| `SQC_DCACHE_REQ_READ_8` | Req | Number of constant cache read requests in eight DW |
| `SQC_DCACHE_REQ_READ_16` | Req | Number of constant cache read requests in 16 DW |
| `SQC_DCACHE_ATOMIC*` | Req | Number of atomic requests |
| `SQC_TC_REQ` | Req | Number of TC requests that were issued by instruction and constant caches |
| `SQC_TC_INST_REQ` | Req | Number of instruction requests to the L2 cache |
| `SQC_TC_DATA_READ_REQ` | Req | Number of data Read requests to the L2 cache |
| `SQC_TC_DATA_WRITE_REQ*` | Req | Number of data write requests to the L2 cache |
| `SQC_TC_DATA_ATOMIC_REQ*` | Req | Number of data atomic requests to the L2 cache |
| `SQC_TC_STALL*` | Cycles | Number of cycles while the valid requests to the L2 cache are stalled |
### Vector L1 cache subsystem
The vector L1 cache subsystem counters are further classified into Texture Addressing Unit (TA), Texture Data Unit (TD), vector L1D cache or Texture Cache per Pipe (TCP), and Texture Cache Arbiter (TCA) counters.
#### TA counters
| Hardware Counter | Unit | Definition |
|:--------------------------------|:------|:------------------------------------------------|
| `TA_TA_BUSY[n]` | Cycles | TA busy cycles. Value range for n: [0-15]. |
| `TA_TOTAL_WAVEFRONTS[n]` | Instr | Number of wavefronts processed by TA. Value range for n: [0-15]. |
| `TA_BUFFER_WAVEFRONTS[n]` | Instr | Number of buffer wavefronts processed by TA. Value range for n: [0-15]. |
| `TA_BUFFER_READ_WAVEFRONTS[n]` | Instr | Number of buffer read wavefronts processed by TA. Value range for n: [0-15]. |
| `TA_BUFFER_WRITE_WAVEFRONTS[n]` | Instr | Number of buffer write wavefronts processed by TA. Value range for n: [0-15]. |
| `TA_BUFFER_ATOMIC_WAVEFRONTS[n]` | Instr | Number of buffer atomic wavefronts processed by TA. Value range for n: [0-15]. |
| `TA_BUFFER_TOTAL_CYCLES[n]` | Cycles | Number of buffer cycles (including read and write) issued to TC. Value range for n: [0-15]. |
| `TA_BUFFER_COALESCED_READ_CYCLES[n]` | Cycles | Number of coalesced buffer read cycles issued to TC. Value range for n: [0-15]. |
| `TA_BUFFER_COALESCED_WRITE_CYCLES[n]` | Cycles | Number of coalesced buffer write cycles issued to TC. Value range for n: [0-15]. |
| `TA_ADDR_STALLED_BY_TC_CYCLES[n]` | Cycles | Number of cycles TA address path is stalled by TC. Value range for n: [0-15]. |
| `TA_DATA_STALLED_BY_TC_CYCLES[n]` | Cycles | Number of cycles TA data path is stalled by TC. Value range for n: [0-15]. |
| `TA_ADDR_STALLED_BY_TD_CYCLES[n]` | Cycles | Number of cycles TA address path is stalled by TD. Value range for n: [0-15]. |
| `TA_FLAT_WAVEFRONTS[n]` | Instr | Number of flat opcode wavefronts processed by TA. Value range for n: [0-15]. |
| `TA_FLAT_READ_WAVEFRONTS[n]` | Instr | Number of flat opcode read wavefronts processed by TA. Value range for n: [0-15]. |
| `TA_FLAT_WRITE_WAVEFRONTS[n]` | Instr | Number of flat opcode write wavefronts processed by TA. Value range for n: [0-15]. |
| `TA_FLAT_ATOMIC_WAVEFRONTS[n]` | Instr | Number of flat opcode atomic wavefronts processed by TA. Value range for n: [0-15]. |
#### TD counters
| Hardware Counter | Unit | Definition |
|:------------------------|:-----|:---------------------------------------------------|
| `TD_TD_BUSY[n]` | Cycle | TD busy cycles while it is processing or waiting for data. Value range for n: [0-15]. |
| `TD_TC_STALL[n]` | Cycle | Number of cycles TD is stalled waiting for TC data. Value range for n: [0-15]. |
| `TD_SPI_STALL[n]` | Cycle | Number of cycles TD is stalled by SPI. Value range for n: [0-15]. |
| `TD_LOAD_WAVEFRONT[n]` | Instr |Number of wavefront instructions (read/write/atomic). Value range for n: [0-15]. |
| `TD_STORE_WAVEFRONT[n]` | Instr | Number of write wavefront instructions. Value range for n: [0-15].|
| `TD_ATOMIC_WAVEFRONT[n]` | Instr | Number of atomic wavefront instructions. Value range for n: [0-15]. |
| `TD_COALESCABLE_WAVEFRONT[n]` | Instr | Number of coalescable wavefronts according to TA. Value range for n: [0-15]. |
#### TCP counters
| Hardware Counter | Unit | Definition |
|:-----------------------------------|:------|:----------------------------------------------------------|
| `TCP_GATE_EN1[n]` | Cycles | Number of cycles vL1D interface clocks are turned on. Value range for n: [0-15]. |
| `TCP_GATE_EN2[n]` | Cycles | Number of cycles vL1D core clocks are turned on. Value range for n: [0-15]. |
| `TCP_TD_TCP_STALL_CYCLES[n]` | Cycles | Number of cycles TD stalls vL1D. Value range for n: [0-15]. |
| `TCP_TCR_TCP_STALL_CYCLES[n]` | Cycles | Number of cycles TCR stalls vL1D. Value range for n: [0-15]. |
| `TCP_READ_TAGCONFLICT_STALL_CYCLES[n]` | Cycles | Number of cycles tagram conflict stalls on a read. Value range for n: [0-15]. |
| `TCP_WRITE_TAGCONFLICT_STALL_CYCLES[n]` | Cycles | Number of cycles tagram conflict stalls on a write. Value range for n: [0-15]. |
| `TCP_ATOMIC_TAGCONFLICT_STALL_CYCLES[n]` | Cycles | Number of cycles tagram conflict stalls on an atomic. Value range for n: [0-15]. |
| `TCP_PENDING_STALL_CYCLES[n]` | Cycles | Number of cycles vL1D cache is stalled due to data pending from L2 Cache. Value range for n: [0-15]. |
| `TCP_TCP_TA_DATA_STALL_CYCLES` | Cycles | Number of cycles TCP stalls TA data interface. |
| `TCP_TA_TCP_STATE_READ[n]` | Req | Number of state reads. Value range for n: [0-15]. |
| `TCP_VOLATILE[n]` | Req | Number of L1 volatile pixels/buffers from TA. Value range for n: [0-15]. |
| `TCP_TOTAL_ACCESSES[n]` | Req | Number of vL1D accesses. Equals `TCP_PERF_SEL_TOTAL_READ`+`TCP_PERF_SEL_TOTAL_NONREAD`. Value range for n: [0-15]. |
| `TCP_TOTAL_READ[n]` | Req | Number of vL1D read accesses. Equals `TCP_PERF_SEL_TOTAL_HIT_LRU_READ` + `TCP_PERF_SEL_TOTAL_MISS_LRU_READ` + `TCP_PERF_SEL_TOTAL_MISS_EVICT_READ`. Value range for n: [0-15]. |
| `TCP_TOTAL_WRITE[n]` | Req | Number of vL1D write accesses. `Equals TCP_PERF_SEL_TOTAL_MISS_LRU_WRITE`+ `TCP_PERF_SEL_TOTAL_MISS_EVICT_WRITE`. Value range for n: [0-15]. |
| `TCP_TOTAL_ATOMIC_WITH_RET[n]` | Req | Number of vL1D atomic requests with return. Value range for n: [0-15]. |
| `TCP_TOTAL_ATOMIC_WITHOUT_RET[n]` | Req | Number of vL1D atomic without return. Value range for n: [0-15]. |
| `TCP_TOTAL_WRITEBACK_INVALIDATES[n]` | Count | Total number of vL1D writebacks and invalidates. Equals `TCP_PERF_SEL_TOTAL_WBINVL1`+ `TCP_PERF_SEL_TOTAL_WBINVL1_VOL`+ `TCP_PERF_SEL_CP_TCP_INVALIDATE`+ `TCP_PERF_SEL_SQ_TCP_INVALIDATE_VOL`. Value range for n: [0-15]. |
| `TCP_UTCL1_REQUEST[n]` | Req | Number of address translation requests to UTCL1. Value range for n: [0-15]. |
| `TCP_UTCL1_TRANSLATION_HIT[n]` | Req | Number of UTCL1 translation hits. Value range for n: [0-15]. |
| `TCP_UTCL1_TRANSLATION_MISS[n]` | Req | Number of UTCL1 translation misses. Value range for n: [0-15]. |
| `TCP_UTCL1_PERMISSION_MISS[n]` | Req | Number of UTCL1 permission misses. Value range for n: [0-15]. |
| `TCP_TOTAL_CACHE_ACCESSES[n]` | Req | Number of vL1D cache accesses including hits and misses. Value range for n: [0-15]. |
| `TCP_TCP_LATENCY[n]` | Cycles | Accumulated wave access latency to vL1D over all wavefronts. Value range for n: [0-15]. |
| `TCP_TCC_READ_REQ_LATENCY[n]` | Cycles | Total vL1D to L2 request latency over all wavefronts for reads and atomics with return. Value range for n: [0-15]. |
| `TCP_TCC_WRITE_REQ_LATENCY[n]` | Cycles | Total vL1D to L2 request latency over all wavefronts for writes and atomics without return. Value range for n: [0-15]. |
| `TCP_TCC_READ_REQ[n]` | Req | Number of read requests to L2 cache. Value range for n: [0-15]. |
| `TCP_TCC_WRITE_REQ[n]` | Req | Number of write requests to L2 cache. Value range for n: [0-15]. |
| `TCP_TCC_ATOMIC_WITH_RET_REQ[n]` | Req | Number of atomic requests to L2 cache with return. Value range for n: [0-15]. |
| `TCP_TCC_ATOMIC_WITHOUT_RET_REQ[n]` | Req | Number of atomic requests to L2 cache without return. Value range for n: [0-15]. |
| `TCP_TCC_NC_READ_REQ[n]` | Req | Number of NC read requests to L2 cache. Value range for n: [0-15]. |
| `TCP_TCC_UC_READ_REQ[n]` | Req | Number of UC read requests to L2 cache. Value range for n: [0-15]. |
| `TCP_TCC_CC_READ_REQ[n]` | Req | Number of CC read requests to L2 cache. Value range for n: [0-15]. |
| `TCP_TCC_RW_READ_REQ[n]` | Req | Number of RW read requests to L2 cache. Value range for n: [0-15]. |
| `TCP_TCC_NC_WRITE_REQ[n]` | Req | Number of NC write requests to L2 cache. Value range for n: [0-15]. |
| `TCP_TCC_UC_WRITE_REQ[n]` | Req | Number of UC write requests to L2 cache. Value range for n: [0-15]. |
| `TCP_TCC_CC_WRITE_REQ[n]` | Req | Number of CC write requests to L2 cache. Value range for n: [0-15]. |
| `TCP_TCC_RW_WRITE_REQ[n]` | Req | Number of RW write requests to L2 cache. Value range for n: [0-15]. |
| `TCP_TCC_NC_ATOMIC_REQ[n]` | Req | Number of NC atomic requests to L2 cache. Value range for n: [0-15]. |
| `TCP_TCC_UC_ATOMIC_REQ[n]` | Req | Number of UC atomic requests to L2 cache. Value range for n: [0-15]. |
| `TCP_TCC_CC_ATOMIC_REQ[n]` | Req | Number of CC atomic requests to L2 cache. Value range for n: [0-15]. |
| `TCP_TCC_RW_ATOMIC_REQ[n]` | Req | Number of RW atomic requests to L2 cache. Value range for n: [0-15]. |
#### TCA counters
| Hardware Counter | Unit | Definition |
|:----------------|:------|:------------------------------------------|
| `TCA_CYCLE[n]` | Cycles | Number of TCA cycles. Value range for n: [0-31]. |
| `TCA_BUSY[n]` | Cycles | Number of cycles TCA has a pending request. Value range for n: [0-31]. |
### L2 cache access counters
L2 Cache is also known as Texture Cache per Channel (TCC).
| Hardware Counter | Unit | Definition |
|:--------------------------------|:------|:-------------------------------------------------------------|
| `TCC_CYCLE[n]` |Cycle | Number of L2 cache free-running clocks. Value range for n: [0-31]. |
| `TCC_BUSY[n]` |Cycle | Number of L2 cache busy cycles. Value range for n: [0-31]. |
| `TCC_REQ[n]` |Req | Number of L2 cache requests of all types. This is measured at the tag block. This may be more than the number of requests arriving at the TCC, but it is a good indication of the total amount of work that needs to be performed. Value range for n: [0-31]. |
| `TCC_STREAMING_REQ[n]` |Req | Number of L2 cache streaming requests. This is measured at the tag block. Value range for n: [0-31]. |
| `TCC_NC_REQ[n]` |Req | Number of NC requests. This is measured at the tag block. Value range for n: [0-31]. |
| `TCC_UC_REQ[n]` |Req | Number of UC requests. This is measured at the tag block. Value range for n: [0-31]. |
| `TCC_CC_REQ[n]` |Req | Number of CC requests. This is measured at the tag block. Value range for n: [0-31]. |
| `TCC_RW_REQ[n]` |Req | Number of RW requests. This is measured at the tag block. Value range for n: [0-31]. |
| `TCC_PROBE[n]` |Req | Number of probe requests. Value range for n: [0-31]. |
| `TCC_PROBE_ALL[n]` |Req | Number of external probe requests with `EA_TCC_preq_all`== 1. Value range for n: [0-31]. |
| `TCC_READ[n]` |Req | Number of L2 cache read requests. This includes compressed reads but not metadata reads. Value range for n: [0-31]. |
| `TCC_WRITE[n]` |Req | Number of L2 cache write requests. Value range for n: [0-31]. |
| `TCC_ATOMIC[n]` |Req | Number of L2 cache atomic requests of all types. Value range for n: [0-31]. |
| `TCC_HIT[n]` |Req | Number of L2 cache hits. Value range for n: [0-31]. |
| `TCC_MISS[n]` |Req | Number of L2 cache misses. Value range for n: [0-31]. |
| `TCC_WRITEBACK[n]` |Req | Number of lines written back to the main memory, including writebacks of dirty lines and uncached write/atomic requests. Value range for n: [0-31]. |
| `TCC_EA_WRREQ[n]` |Req | Number of 32-byte and 64-byte transactions going over the `TC_EA_wrreq` interface. Atomics may travel over the same interface and are generally classified as write requests. This does not include probe commands. Value range for n: [0-31]. |
| `TCC_EA_WRREQ_64B[n]` |Req | Total number of 64-byte transactions (write or `CMPSWAP`) going over the `TC_EA_wrreq` interface. Value range for n: [0-31]. |
| `TCC_EA_WR_UNCACHED_32B[n]` |Req | Number of 32-byte write/atomic going over the `TC_EA_wrreq` interface due to uncached traffic. Note that CC mtypes can produce uncached requests, and those are included in this. A 64-byte request is counted as 2. Value range for n: [0-31].|
| `TCC_EA_WRREQ_STALL[n]` | Cycles | Number of cycles a write request is stalled. Value range for n: [0-31]. |
| `TCC_EA_WRREQ_IO_CREDIT_STALL[n]` | Cycles | Number of cycles an EA write request is stalled due to the interface running out of IO credits. Value range for n: [0-31]. |
| `TCC_EA_WRREQ_GMI_CREDIT_STALL[n]` | Cycles | Number of cycles an EA write request is stalled due to the interface running out of GMI credits. Value range for n: [0-31]. |
| `TCC_EA_WRREQ_DRAM_CREDIT_STALL[n]` | Cycles | Number of cycles an EA write request is stalled due to the interface running out of DRAM credits. Value range for n: [0-31]. |
| `TCC_TOO_MANY_EA_WRREQS_STALL[n]` | Cycles | Number of cycles the L2 cache is unable to send an EA write request due to it reaching its maximum capacity of pending EA write requests. Value range for n: [0-31]. |
| `TCC_EA_WRREQ_LEVEL[n]` | Req | The accumulated number of EA write requests in flight. This is primarily intended to measure average EA write latency. Average write latency = `TCC_PERF_SEL_EA_WRREQ_LEVEL`/`TCC_PERF_SEL_EA_WRREQ`. Value range for n: [0-31]. |
| `TCC_EA_ATOMIC[n]` | Req | Number of 32-byte or 64-byte atomic requests going over the `TC_EA_wrreq` interface. Value range for n: [0-31]. |
| `TCC_EA_ATOMIC_LEVEL[n]` | Req | The accumulated number of EA atomic requests in flight. This is primarily intended to measure average EA atomic latency. Average atomic latency = `TCC_PERF_SEL_EA_WRREQ_ATOMIC_LEVEL`/`TCC_PERF_SEL_EA_WRREQ_ATOMIC`. Value range for n: [0-31]. |
| `TCC_EA_RDREQ[n]` | Req | Number of 32-byte or 64-byte read requests to EA. Value range for n: [0-31]. |
| `TCC_EA_RDREQ_32B[n]` | Req | Number of 32-byte read requests to EA. Value range for n: [0-31]. |
| `TCC_EA_RD_UNCACHED_32B[n]` | Req | Number of 32-byte EA reads due to uncached traffic. A 64-byte request is counted as 2. Value range for n: [0-31]. |
| `TCC_EA_RDREQ_IO_CREDIT_STALL[n]` | Cycles | Number of cycles there is a stall due to the read request interface running out of IO credits. Stalls occur irrespective of the need for a read to be performed. Value range for n: [0-31]. |
| `TCC_EA_RDREQ_GMI_CREDIT_STALL[n]` | Cycles | Number of cycles there is a stall due to the read request interface running out of GMI credits. Stalls occur irrespective of the need for a read to be performed. Value range for n: [0-31]. |
| `TCC_EA_RDREQ_DRAM_CREDIT_STALL[n]` | Cycles | Number of cycles there is a stall due to the read request interface running out of DRAM credits. Stalls occur irrespective of the need for a read to be performed. Value range for n: [0-31]. |
| `TCC_EA_RDREQ_LEVEL[n]` | Req | The accumulated number of EA read requests in flight. This is primarily intended to measure average EA read latency. Average read latency = `TCC_PERF_SEL_EA_RDREQ_LEVEL`/`TCC_PERF_SEL_EA_RDREQ`. Value range for n: [0-31]. |
| `TCC_EA_RDREQ_DRAM[n]` | Req | Number of 32-byte or 64-byte EA read requests to High Bandwidth Memory (HBM). Value range for n: [0-31]. |
| `TCC_EA_WRREQ_DRAM[n]` | Req | Number of 32-byte or 64-byte EA write requests to HBM. Value range for n: [0-31]. |
| `TCC_TAG_STALL[n]` | Cycles | Number of cycles the normal request pipeline in the tag is stalled for any reason. Normally, stalls of this nature are measured exactly at one point in the pipeline however in case of this counter, probes can stall the pipeline at a variety of places and there is no single point that can reasonably measure the total stalls accurately. Value range for n: [0-31]. |
| `TCC_NORMAL_WRITEBACK[n]` | Req | Number of writebacks due to requests that are not writeback requests. Value range for n: [0-31]. |
| `TCC_ALL_TC_OP_WB_WRITEBACK[n]` | Req | Number of writebacks due to all `TC_OP` writeback requests. Value range for n: [0-31]. |
| `TCC_NORMAL_EVICT[n]` | Req | Number of evictions due to requests that are not invalidate or probe requests. Value range for n: [0-31]. |
| `TCC_ALL_TC_OP_INV_EVICT[n]` | Req | Number of evictions due to all `TC_OP` invalidate requests. Value range for n: [0-31]. |
## MI200 derived metrics list
| Derived Metric | Description |
|:----------------|:-------------------------------------------------------------------------------------|
| `ALUStalledByLDS` | Percentage of GPU time ALU units are stalled due to the LDS input queue being full or the output queue not being ready. Reduce this by reducing the LDS bank conflicts or the number of LDS accesses if possible. Value range: 0% (optimal) to 100% (bad). |
| `FetchSize` | Total kilobytes fetched from the video memory. This is measured with all extra fetches and any cache or memory effects taken into account. |
| `FlatLDSInsts` | Average number of FLAT instructions that read from or write to LDS, executed per work item (affected by flow control). |
| `FlatVMemInsts` | Average number of FLAT instructions that read from or write to the video memory, executed per work item (affected by flow control). Includes FLAT instructions that read from or write to scratch. |
| `GDSInsts` | Average number of GDS read/write instructions executed per work item (affected by flow control). |
| `GPUBusy` | Percentage of time GPU is busy. |
| `L2CacheHit` | Percentage of fetch, write, atomic, and other instructions that hit the data in L2 cache. Value range: 0% (no hit) to 100% (optimal). |
| `LDSBankConflict` | Percentage of GPU time LDS is stalled by bank conflicts. Value range: 0% (optimal) to 100% (bad). |
| `LDSInsts` | Average number of LDS read/write instructions executed per work item (affected by flow control). Excludes FLAT instructions that read from or write to LDS. |
| `MemUnitBusy` | Percentage of GPU time the memory unit is active. The result includes the stall time (`MemUnitStalled`). This is measured with all extra fetches and writes and any cache or memory effects taken into account. Value range: 0% to 100% (fetch-bound). |
| `MemUnitStalled` | Percentage of GPU time the memory unit is stalled. Try reducing the number or size of fetches and writes if possible. Value range: 0% (optimal) to 100% (bad). |
| `MemWrites32B` | Total number of effective 32B write transactions to the memory. |
| `SALUBusy` | Percentage of GPU time scalar ALU instructions are processed. Value range: 0% (bad) to 100% (optimal). |
| `SALUInsts` | Average number of scalar ALU instructions executed per work item (affected by flow control). |
| `SFetchInsts` | Average number of scalar fetch instructions from the video memory executed per work item (affected by flow control). |
| `TA_ADDR_STALLED_BY_TC_CYCLES_sum` | Total number of cycles TA address path is stalled by TC, over all TA instances. |
| `TA_ADDR_STALLED_BY_TD_CYCLES_sum` | Total number of cycles TA address path is stalled by TD, over all TA instances. |
| `TA_BUFFER_WAVEFRONTS_sum` | Total number of buffer wavefronts processed by all TA instances. |
| `TA_BUFFER_READ_WAVEFRONTS_sum` | Total number of buffer read wavefronts processed by all TA instances. |
| `TA_BUFFER_WRITE_WAVEFRONTS_sum` | Total number of buffer write wavefronts processed by all TA instances. |
| `TA_BUFFER_ATOMIC_WAVEFRONTS_sum` | Total number of buffer atomic wavefronts processed by all TA instances. |
| `TA_BUFFER_TOTAL_CYCLES_sum` | Total number of buffer cycles (including read and write) issued to TC by all TA instances. |
| `TA_BUFFER_COALESCED_READ_CYCLES_sum` | Total number of coalesced buffer read cycles issued to TC by all TA instances. |
| `TA_BUFFER_COALESCED_WRITE_CYCLES_sum` | Total number of coalesced buffer write cycles issued to TC by all TA instances. |
| `TA_BUSY_avr` | Average number of busy cycles over all TA instances. |
| `TA_BUSY_max` | Maximum number of TA busy cycles over all TA instances. |
| `TA_BUSY_min` | Minimum number of TA busy cycles over all TA instances. |
| `TA_DATA_STALLED_BY_TC_CYCLES_sum` | Total number of cycles TA data path is stalled by TC, over all TA instances. |
| `TA_FLAT_READ_WAVEFRONTS_sum` | Sum of flat opcode reads processed by all TA instances. |
| `TA_FLAT_WRITE_WAVEFRONTS_sum` | Sum of flat opcode writes processed by all TA instances. |
| `TA_FLAT_WAVEFRONTS_sum` | Total number of flat opcode wavefronts processed by all TA instances. |
| `TA_FLAT_READ_WAVEFRONTS_sum` | Total number of flat opcode read wavefronts processed by all TA instances. |
| `TA_FLAT_ATOMIC_WAVEFRONTS_sum` | Total number of flat opcode atomic wavefronts processed by all TA instances. |
| `TA_TA_BUSY_sum` | Total number of TA busy cycles over all TA instances. |
| `TA_TOTAL_WAVEFRONTS_sum` | Total number of wavefronts processed by all TA instances. |
| `TCA_BUSY_sum` | Total number of cycles TCA has a pending request, over all TCA instances. |
| `TCA_CYCLE_sum` | Total number of cycles over all TCA instances. |
| `TCC_ALL_TC_OP_WB_WRITEBACK_sum` | Total number of writebacks due to all TC_OP writeback requests, over all TCC instances. |
| `TCC_ALL_TC_OP_INV_EVICT_sum` | Total number of evictions due to all TC_OP invalidate requests, over all TCC instances. |
| `TCC_ATOMIC_sum` | Total number of L2 cache atomic requests of all types, over all TCC instances. |
| `TCC_BUSY_avr` | Average number of L2 cache busy cycles, over all TCC instances. |
| `TCC_BUSY_sum` | Total number of L2 cache busy cycles, over all TCC instances. |
| `TCC_CC_REQ_sum` | Total number of CC requests over all TCC instances. |
| `TCC_CYCLE_sum` | Total number of L2 cache free running clocks, over all TCC instances. |
| `TCC_EA_WRREQ_sum` | Total number of 32-byte and 64-byte transactions going over the TC_EA_wrreq interface, over all TCC instances. Atomics may travel over the same interface and are generally classified as write requests. This does not include probe commands. |
| `TCC_EA_WRREQ_64B_sum` | Total number of 64-byte transactions (write or `CMPSWAP`) going over the TC_EA_wrreq interface, over all TCC instances. |
| `TCC_EA_WR_UNCACHED_32B_sum` | Total Number of 32-byte write/atomic going over the TC_EA_wrreq interface due to uncached traffic, over all TCC instances. Note that CC mtypes can produce uncached requests, and those are included in this. A 64-byte request is counted as 2. |
| `TCC_EA_WRREQ_STALL_sum` | Total Number of cycles a write request is stalled, over all instances. |
| `TCC_EA_WRREQ_IO_CREDIT_STALL_sum` | Total number of cycles an EA write request is stalled due to the interface running out of IO credits, over all instances. |
| `TCC_EA_WRREQ_GMI_CREDIT_STALL_sum` | Total number of cycles an EA write request is stalled due to the interface running out of GMI credits, over all instances. |
| `TCC_EA_WRREQ_DRAM_CREDIT_STALL_sum` | Total number of cycles an EA write request is stalled due to the interface running out of DRAM credits, over all instances. |
| `TCC_EA_WRREQ_LEVEL_sum` | Total number of EA write requests in flight over all TCC instances. |
| `TCC_EA_RDREQ_LEVEL_sum` | Total number of EA read requests in flight over all TCC instances. |
| `TCC_EA_ATOMIC_sum` | Total Number of 32-byte or 64-byte atomic requests going over the TC_EA_wrreq interface, over all TCC instances. |
| `TCC_EA_ATOMIC_LEVEL_sum` | Total number of EA atomic requests in flight, over all TCC instances. |
| `TCC_EA_RDREQ_sum` | Total number of 32-byte or 64-byte read requests to EA, over all TCC instances. |
| `TCC_EA_RDREQ_32B_sum` | Total number of 32-byte read requests to EA, over all TCC instances. |
| `TCC_EA_RD_UNCACHED_32B_sum` | Total number of 32-byte EA reads due to uncached traffic, over all TCC instances. |
| `TCC_EA_RDREQ_IO_CREDIT_STALL_sum` | Total number of cycles there is a stall due to the read request interface running out of IO credits, over all TCC instances. |
| `TCC_EA_RDREQ_GMI_CREDIT_STALL_sum` | Total number of cycles there is a stall due to the read request interface running out of GMI credits, over all TCC instances. |
| `TCC_EA_RDREQ_DRAM_CREDIT_STALL_sum` | Total number of cycles there is a stall due to the read request interface running out of DRAM credits, over all TCC instances. |
| `TCC_EA_RDREQ_DRAM_sum` | Total number of 32-byte or 64-byte EA read requests to HBM, over all TCC instances. |
| `TCC_EA_WRREQ_DRAM_sum` | Total number of 32-byte or 64-byte EA write requests to HBM, over all TCC instances. |
| `TCC_HIT_sum` | Total number of L2 cache hits over all TCC instances. |
| `TCC_MISS_sum` | Total number of L2 cache misses over all TCC instances. |
| `TCC_NC_REQ_sum` | Total number of NC requests over all TCC instances. |
| `TCC_NORMAL_WRITEBACK_sum` | Total number of writebacks due to requests that are not writeback requests, over all TCC instances. |
| `TCC_NORMAL_EVICT_sum` | Total number of evictions due to requests that are not invalidate or probe requests, over all TCC instances. |
| `TCC_PROBE_sum` | Total number of probe requests over all TCC instances. |
| `TCC_PROBE_ALL_sum` | Total number of external probe requests with EA_TCC_preq_all== 1, over all TCC instances. |
| `TCC_READ_sum` | Total number of L2 cache read requests (including compressed reads but not metadata reads) over all TCC instances. |
| `TCC_REQ_sum` | Total number of all types of L2 cache requests over all TCC instances. |
| `TCC_RW_REQ_sum` | Total number of RW requests over all TCC instances. |
| `TCC_STREAMING_REQ_sum` | Total number of L2 cache streaming requests over all TCC instances. |
| `TCC_TAG_STALL_sum` | Total number of cycles the normal request pipeline in the tag is stalled for any reason, over all TCC instances. |
| `TCC_TOO_MANY_EA_WRREQS_STALL_sum` | Total number of cycles L2 cache is unable to send an EA write request due to it reaching its maximum capacity of pending EA write requests, over all TCC instances. |
| `TCC_UC_REQ_sum` | Total number of UC requests over all TCC instances. |
| `TCC_WRITE_sum` | Total number of L2 cache write requests over all TCC instances. |
| `TCC_WRITEBACK_sum` | Total number of lines written back to the main memory including writebacks of dirty lines and uncached write/atomic requests, over all TCC instances. |
| `TCC_WRREQ_STALL_max` | Maximum number of cycles a write request is stalled, over all TCC instances. |
| `TCP_ATOMIC_TAGCONFLICT_STALL_CYCLES_sum` | Total number of cycles tagram conflict stalls on an atomic, over all TCP instances. |
| `TCP_GATE_EN1_sum` | Total number of cycles vL1D interface clocks are turned on, over all TCP instances. |
| `TCP_GATE_EN2_sum` | Total number of cycles vL1D core clocks are turned on, over all TCP instances. |
| `TCP_PENDING_STALL_CYCLES_sum` | Total number of cycles vL1D cache is stalled due to data pending from L2 Cache, over all TCP instances. |
| `TCP_READ_TAGCONFLICT_STALL_CYCLES_sum` | Total number of cycles tagram conflict stalls on a read, over all TCP instances. |
| `TCP_TA_TCP_STATE_READ_sum` | Total number of state reads by all TCP instances. |
| `TCP_TCC_ATOMIC_WITH_RET_REQ_sum` | Total number of atomic requests to L2 cache with return, over all TCP instances. |
| `TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum` | Total number of atomic requests to L2 cache without return, over all TCP instances. |
| `TCP_TCC_CC_READ_REQ_sum` | Total number of CC read requests to L2 cache, over all TCP instances. |
| `TCP_TCC_CC_WRITE_REQ_sum` | Total number of CC write requests to L2 cache, over all TCP instances. |
| `TCP_TCC_CC_ATOMIC_REQ_sum` | Total number of CC atomic requests to L2 cache, over all TCP instances. |
| `TCP_TCC_NC_READ_REQ_sum` | Total number of NC read requests to L2 cache, over all TCP instances. |
| `TCP_TCC_NC_WRITE_REQ_sum` | Total number of NC write requests to L2 cache, over all TCP instances. |
| `TCP_TCC_NC_ATOMIC_REQ_sum` | Total number of NC atomic requests to L2 cache, over all TCP instances. |
| `TCP_TCC_READ_REQ_LATENCY_sum` | Total vL1D to L2 request latency over all wavefronts for reads and atomics with return for all TCP instances. |
| `TCP_TCC_READ_REQ_sum` | Total number of read requests to L2 cache, over all TCP instances. |
| `TCP_TCC_RW_READ_REQ_sum` | Total number of RW read requests to L2 cache, over all TCP instances. |
| `TCP_TCC_RW_WRITE_REQ_sum` | Total number of RW write requests to L2 cache, over all TCP instances. |
| `TCP_TCC_RW_ATOMIC_REQ_sum` | Total number of RW atomic requests to L2 cache, over all TCP instances. |
| `TCP_TCC_UC_READ_REQ_sum` | Total number of UC read requests to L2 cache, over all TCP instances. |
| `TCP_TCC_UC_WRITE_REQ_sum` | Total number of UC write requests to L2 cache, over all TCP instances. |
| `TCP_TCC_UC_ATOMIC_REQ_sum` | Total number of UC atomic requests to L2 cache, over all TCP instances. |
| `TCP_TCC_WRITE_REQ_LATENCY_sum` | Total vL1D to L2 request latency over all wavefronts for writes and atomics without return for all TCP instances. |
| `TCP_TCC_WRITE_REQ_sum` | Total number of write requests to L2 cache, over all TCP instances. |
| `TCP_TCP_LATENCY_sum` | Total wave access latency to vL1D over all wavefronts for all TCP instances. |
| `TCP_TCR_TCP_STALL_CYCLES_sum` | Total number of cycles TCR stalls vL1D, over all TCP instances. |
| `TCP_TD_TCP_STALL_CYCLES_sum` | Total number of cycles TD stalls vL1D, over all TCP instances. |
| `TCP_TOTAL_ACCESSES_sum` | Total number of vL1D accesses, over all TCP instances. |
| `TCP_TOTAL_READ_sum` | Total number of vL1D read accesses, over all TCP instances. |
| `TCP_TOTAL_WRITE_sum` | Total number of vL1D write accesses, over all TCP instances. |
| `TCP_TOTAL_ATOMIC_WITH_RET_sum` | Total number of vL1D atomic requests with return, over all TCP instances. |
| `TCP_TOTAL_ATOMIC_WITHOUT_RET_sum` | Total number of vL1D atomic requests without return, over all TCP instances. |
| `TCP_TOTAL_CACHE_ACCESSES_sum` | Total number of vL1D cache accesses (including hits and misses) by all TCP instances. |
| `TCP_TOTAL_WRITEBACK_INVALIDATES_sum` | Total number of vL1D writebacks and invalidates, over all TCP instances. |
| `TCP_UTCL1_PERMISSION_MISS_sum` | Total number of UTCL1 permission misses by all TCP instances. |
| `TCP_UTCL1_REQUEST_sum` | Total number of address translation requests to UTCL1 by all TCP instances. |
| `TCP_UTCL1_TRANSLATION_MISS_sum` | Total number of UTCL1 translation misses by all TCP instances. |
| `TCP_UTCL1_TRANSLATION_HIT_sum` | Total number of UTCL1 translation hits by all TCP instances. |
| `TCP_VOLATILE_sum` | Total number of L1 volatile pixels/buffers from TA, over all TCP instances. |
| `TCP_WRITE_TAGCONFLICT_STALL_CYCLES_sum` | Total number of cycles tagram conflict stalls on a write, over all TCP instances. |
| `TD_ATOMIC_WAVEFRONT_sum` | Total number of atomic wavefront instructions, over all TD instances. |
| `TD_COALESCABLE_WAVEFRONT_sum` | Total number of coalescable wavefronts according to TA, over all TD instances. |
| `TD_LOAD_WAVEFRONT_sum` | Total number of wavefront instructions (read/write/atomic), over all TD instances. |
| `TD_SPI_STALL_sum` | Total number of cycles TD is stalled by SPI, over all TD instances. |
| `TD_STORE_WAVEFRONT_sum` | Total number of write wavefront instructions, over all TD instances. |
| `TD_TC_STALL_sum` | Total number of cycles TD is stalled waiting for TC data, over all TD instances. |
| `TD_TD_BUSY_sum` | Total number of TD busy cycles while it is processing or waiting for data, over all TD instances. |
| `VALUBusy` | Percentage of GPU time vector ALU instructions are processed. Value range: 0% (bad) to 100% (optimal). |
| `VALUInsts` | Average number of vector ALU instructions executed per work item (affected by flow control). |
| `VALUUtilization` | Percentage of active vector ALU threads in a wave. A lower number can mean either more thread divergence in a wave or that the work-group size is not a multiple of 64. Value range: 0% (bad), 100% (ideal - no thread divergence). |
| `VFetchInsts` | Average number of vector fetch instructions from the video memory executed per work-item (affected by flow control). Excludes FLAT instructions that fetch from video memory. |
| `VWriteInsts` | Average number of vector write instructions to the video memory executed per work-item (affected by flow control). Excludes FLAT instructions that write to video memory. |
| `Wavefronts` | Total wavefronts. |
| `WRITE_REQ_32B` | Total number of 32-byte effective memory writes. |
| `WriteSize` | Total kilobytes written to the video memory. This is measured with all extra fetches and any cache or memory effects taken into account. |
| `WriteUnitStalled` | Percentage of GPU time the write unit is stalled. Value range: 0% to 100% (bad). |
## Abbreviations
| Abbreviation | Meaning |
|:------------|:--------------------------------------------------------------------------------|
| `ALU` | Arithmetic Logic Unit |
| `Arb` | Arbiter |
| `BF16` | Brain Floating Point - 16 bits |
| `CC` | Coherently Cached |
| `CP` | Command Processor |
| `CPC` | Command Processor - Compute |
| `CPF` | Command Processor - Fetcher |
| `CS` | Compute Shader |
| `CSC` | Compute Shader Controller |
| `CSn` | Compute Shader, the n-th pipe |
| `CU` | Compute Unit |
| `DW` | 32-bit Data Word, DWORD |
| `EA` | Efficiency Arbiter |
| `F16` | Half Precision Floating Point |
| `F32` | Full Precision Floating Point |
| `FLAT` | FLAT instructions allow read/write/atomic access to a generic memory address pointer, which can resolve to any of the following physical memories:<br>. Global Memory<br>. Scratch ("private")<br>. LDS ("shared")<br>. Invalid - MEM_VIOL TrapStatus |
| `FMA` | Fused Multiply Add |
| `GDS` | Global Data Share |
| `GRBM` | Graphics Register Bus Manager |
| `HBM` | High Bandwidth Memory |
| `Instr` | Instructions |
| `IOP` | Integer Operation |
| `L2` | Level-2 Cache |
| `LDS` | Local Data Share |
| `ME1` | Micro Engine, running packet processing firmware on CPC |
| `MFMA` | Matrix Fused Multiply Add |
| `NC` | Noncoherently Cached |
| `RW` | Coherently Cached with Write |
| `SALU` | Scalar ALU |
| `SGPR` | Scalar General Purpose Register |
| `SIMD` | Single Instruction Multiple Data |
| `sL1D` | Scalar Level-1 Data Cache |
| `SMEM` | Scalar Memory |
| `SPI` | Shader Processor Input |
| `SQ` | Sequencer |
| `TA` | Texture Addressing Unit |
| `TC` | Texture Cache |
| `TCA` | Texture Cache Arbiter |
| `TCC` | Texture Cache per Channel, known as L2 Cache |
| `TCIU` | Texture Cache Interface Unit (interface between CP and the memory system) |
| `TCP` | Texture Cache per Pipe, known as vector L1 Cache |
| `TCR` | Texture Cache Router |
| `TD` | Texture Data Unit |
| `UC` | Uncached |
| `UTCL1` | Unified Translation Cache - Level 1 |
| `UTCL2` | Unified Translation Cache - Level 2 |
| `VALU` | Vector ALU |
| `VGPR` | Vector General Purpose Register |
| `vL1D` | Vector Level -1 Data Cache |
| `VMEM` | Vector Memory |

View File

@@ -1,241 +0,0 @@
<head>
<meta charset="UTF-8">
<meta name="description" content="GPU memory">
<meta name="keywords" content="GPU memory, VRAM, video random access memory, pageable
memory, pinned memory, managed memory">
</head>
# GPU memory
For the HIP reference documentation, see:
* {doc}`hip:.doxygen/docBin/html/group___memory`
* {doc}`hip:.doxygen/docBin/html/group___memory_m`
Host memory exists on the host (e.g. CPU) of the machine in random access memory (RAM).
Device memory exists on the device (e.g. GPU) of the machine in video random access memory (VRAM).
Recent architectures use graphics double data rate (GDDR) synchronous dynamic random-access memory (SDRAM)such as GDDR6, or high-bandwidth memory (HBM) such as HBM2e.
## Memory allocation
Memory can be allocated in two ways: pageable memory, and pinned memory.
The following API calls with result in these allocations:
| API | Data location | Allocation |
|--------------------|---------------|------------|
| System allocated | Host | Pageable |
| `hipMallocManaged` | Host | Managed |
| `hipHostMalloc` | Host | Pinned |
| `hipMalloc` | Device | Pinned |
:::{tip}
`hipMalloc` and `hipFree` are blocking calls, however, HIP recently added non-blocking versions `hipMallocAsync` and `hipFreeAsync` which take in a stream as an additional argument.
:::
### Pageable memory
Pageable memory is usually gotten when calling `malloc` or `new` in a C++ application.
It is unique in that it exists on "pages" (blocks of memory), which can be migrated to other memory storage.
For example, migrating memory between CPU sockets on a motherboard, or a system that runs out of space in RAM and starts dumping pages of RAM into the swap partition of your hard drive.
### Pinned memory
Pinned memory (or page-locked memory, or non-pageable memory) is host memory that is mapped into the address space of all GPUs, meaning that the pointer can be used on both host and device.
Accessing host-resident pinned memory in device kernels is generally not recommended for performance, as it can force the data to traverse the host-device interconnect (e.g. PCIe), which is much slower than the on-device bandwidth (>40x on MI200).
Pinned host memory can be allocated with one of two types of coherence support:
:::{note}
In HIP, pinned memory allocations are coherent by default (`hipHostMallocDefault`).
There are additional pinned memory flags (e.g. `hipHostMallocMapped` and `hipHostMallocPortable`).
On MI200 these options do not impact performance.
<!-- TODO: link to programming_manual#memory-allocation-flags -->
For more information, see the section *memory allocation flags* in the HIP Programming Guide: {doc}`hip:user_guide/programming_manual`.
:::
Much like how a process can be locked to a CPU core by setting affinity, a pinned memory allocator does this with the memory storage system.
On multi-socket systems it is important to ensure that pinned memory is located on the same socket as the owning process, or else each cache line will be moved through the CPU-CPU interconnect, thereby increasing latency and potentially decreasing bandwidth.
In practice, pinned memory is used to improve transfer times between host and device.
For transfer operations, such as `hipMemcpy` or `hipMemcpyAsync`, using pinned memory instead of pageable memory on host can lead to a ~3x improvement in bandwidth.
:::{tip}
If the application needs to move data back and forth between device and host (separate allocations), use pinned memory on the host side.
:::
### Managed memory
Managed memory refers to universally addressable, or unified memory available on the MI200 series of GPUs.
Much like pinned memory, managed memory shares a pointer between host and device and (by default) supports fine-grained coherence, however, managed memory can also automatically migrate pages between host and device.
The allocation will be managed by AMD GPU driver using the Linux HMM (Heterogeneous Memory Management) mechanism.
If heterogenous memory management (HMM) is not available, then `hipMallocManaged` will default back to using system memory and will act like pinned host memory.
Other managed memory API calls will have undefined behavior.
It is therefore recommended to check for managed memory capability with: `hipDeviceGetAttribute` and `hipDeviceAttributeManagedMemory`.
HIP supports additional calls that work with page migration:
* `hipMemAdvise`
* `hipMemPrefetchAsync`
:::{tip}
If the application needs to use data on both host and device regularly, does not want to deal with separate allocations, and is not worried about maxing out the VRAM on MI200 GPUs (64 GB per GCD), use managed memory.
:::
:::{tip}
If managed memory performance is poor, check to see if managed memory is supported on your system and if page migration (XNACK) is enabled.
:::
## Access behavior
Memory allocations for GPUs behave as follow:
| API | Data location | Host access | Device access |
|--------------------|---------------|--------------|----------------------|
| System allocated | Host | Local access | Unhandled page fault |
| `hipMallocManaged` | Host | Local access | Zero-copy |
| `hipHostMalloc` | Host | Local access | Zero-copy* |
| `hipMalloc` | Device | Zero-copy | Local access |
Zero-copy accesses happen over the Infinity Fabric interconnect or PCI-E lanes on discrete GPUs.
:::{note}
While `hipHostMalloc` allocated memory is accessible by a device, the host pointer must be converted to a device pointer with `hipHostGetDevicePointer`.
Memory allocated through standard system allocators such as `malloc`, can be accessed a device by registering the memory via `hipHostRegister`.
The device pointer to be used in kernels can be retrieved with `hipHostGetDevicePointer`.
Registered memory is treated like `hipHostMalloc` and will have similar performance.
On devices that support and have [](#xnack) enabled, such as the MI250X, `hipHostRegister` is not required as memory accesses are handled via automatic page migration.
:::
### XNACK
Normally, host and device memory are separate and data has to be transferred manually via `hipMemcpy`.
On a subset of GPUs, such as the MI200, there is an option to automatically migrate pages of memory between host and device.
This is important for managed memory, where the locality of the data is important for performance.
Depending on the system, page migration may be disabled by default in which case managed memory will act like pinned host memory and suffer degraded performance.
*XNACK* describes the GPUs ability to retry memory accesses that failed due a page fault (which normally would lead to a memory access error), and instead retrieve the missing page.
This also affects memory allocated by the system as indicated by the following table:
| API | Data location | Host after device access | Device after host access |
|--------------------|---------------|--------------------------|--------------------------|
| System allocated | Host | Migrate page to host | Migrate page to device |
| `hipMallocManaged` | Host | Migrate page to host | Migrate page to device |
| `hipHostMalloc` | Host | Local access | Zero-copy |
| `hipMalloc` | Device | Zero-copy | Local access |
To check if page migration is available on a platform, use `rocminfo`:
```sh
$ rocminfo | grep xnack
Name: amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-
```
Here, `xnack-` means that XNACK is available but is disabled by default.
Turning on XNACK by setting the environment variable `HSA_XNACK=1` and gives the expected result, `xnack+`:
```sh
$ HSA_XNACK=1 rocminfo | grep xnack
Name: amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack+
```
`hipcc`by default will generate code that runs correctly with both XNACK enabled or disabled.
Setting the `--offload-arch=`-option with `xnack+` or `xnack-` forces code to be only run with XNACK enabled or disabled respectively.
```sh
# Compiled kernels will run regardless if XNACK is enabled or is disabled.
hipcc --offload-arch=gfx90a
# Compiled kernels will only be run if XNACK is enabled with XNACK=1.
hipcc --offload-arch=gfx90a:xnack+
# Compiled kernels will only be run if XNACK is disabled with XNACK=0.
hipcc --offload-arch=gfx90a:xnack-
```
:::{tip}
If you want to make use of page migration, use managed memory. While pageable memory will migrate correctly, it is not a portable solution and can have performance issues if the accessed data isn't page aligned.
:::
### Coherence
* *Coarse-grained coherence* means that memory is only considered up to date at kernel boundaries, which can be enforced through `hipDeviceSynchronize`, `hipStreamSynchronize`, or any blocking operation that acts on the null stream (e.g. `hipMemcpy`).
For example, cacheable memory is a type of coarse-grained memory where an up-to-date copy of the data can be stored elsewhere (e.g. in an L2 cache).
* *Fine-grained coherence* means the coherence is supported while a CPU/GPU kernel is running.
This can be useful if both host and device are operating on the same dataspace using system-scope atomic operations (e.g. updating an error code or flag to a buffer).
Fine-grained memory implies that up-to-date data may be made visible to others regardless of kernel boundaries as discussed above.
| API | Flag | Coherence |
|-------------------------|------------------------------|----------------|
| `hipHostMalloc` | `hipHostMallocDefault` | Fine-grained |
| `hipHostMalloc` | `hipHostMallocNonCoherent` | Coarse-grained |
| API | Flag | Coherence |
|-------------------------|------------------------------|----------------|
| `hipExtMallocWithFlags` | `hipHostMallocDefault` | Fine-grained |
| `hipExtMallocWithFlags` | `hipDeviceMallocFinegrained` | Coarse-grained |
| API | `hipMemAdvise` argument | Coherence |
|-------------------------|------------------------------|----------------|
| `hipMallocManaged` | | Fine-grained |
| `hipMallocManaged` | `hipMemAdviseSetCoarseGrain` | Coarse-grained |
| `malloc` | | Fine-grained |
| `malloc` | `hipMemAdviseSetCoarseGrain` | Coarse-grained |
:::{tip}
Try to design your algorithms to avoid host-device memory coherence (e.g. system scope atomics). While it can be a useful feature in very specific cases, it is not supported on all systems, and can negatively impact performance by introducing the host-device interconnect bottleneck.
:::
The availability of fine- and coarse-grained memory pools can be checked with `rocminfo`:
```sh
$ rocminfo
...
*******
Agent 1
*******
Name: AMD EPYC 7742 64-Core Processor
...
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED
...
Pool 3
Segment: GLOBAL; FLAGS: COARSE GRAINED
...
*******
Agent 9
*******
Name: gfx90a
...
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
...
```
## System direct memory access
In most cases, the default behavior for HIP in transferring data from a pinned host allocation to device will run at the limit of the interconnect.
However, there are certain cases where the interconnect is not the bottleneck.
The primary way to transfer data onto and off of a GPU, such as the MI200, is to use the onboard System Direct Memory Access engine, which is used to feed blocks of memory to the off-device interconnect (either GPU-CPU or GPU-GPU).
Each GCD has a separate SDMA engine for host-to-device and device-to-host memory transfers.
Importantly, SDMA engines are separate from the computing infrastructure, meaning that memory transfers to and from a device will not impact kernel compute performance, though they do impact memory bandwidth to a limited extent.
The SDMA engines are mainly tuned for PCIe-4.0 x16, which means they are designed to operate at bandwidths up to 32 GB/s.
:::{note}
An important feature of the MI250X platform is the Infinity Fabric™ interconnect between host and device.
The Infinity Fabric interconnect supports improved performance over standard PCIe-4.0 (usually ~50% more bandwidth); however, since the SDMA engine does not run at this speed, it will not max out the bandwidth of the faster interconnect.
:::
The bandwidth limitation can be countered by bypassing the SDMA engine and replacing it with a type of copy kernel known as a "blit" kernel.
Blit kernels will use the compute units on the GPU, thereby consuming compute resources, which may not always be beneficial.
The easiest way to enable blit kernels is to set an environment variable `HSA_ENABLE_SDMA=0`, which will disable the SDMA engine.
On systems where the GPU uses a PCIe interconnect instead of an Infinity Fabric interconnect, blit kernels will not impact bandwidth, but will still consume compute resources.
The use of SDMA vs blit kernels also applies to MPI data transfers and GPU-GPU transfers.

View File

@@ -1,248 +0,0 @@
<head>
<meta charset="UTF-8">
<meta name="description" content="Using the LLVM ASan on a GPU">
<meta name="keywords" content="LLVM, ASan, address sanitizer, AddressSanitizer, instrumented
libraries, instrumented applications">
</head>
# Using the LLVM ASan on a GPU (beta release)
The LLVM AddressSanitizer (ASan) provides a process that allows developers to detect runtime addressing errors in applications and libraries. The detection is achieved using a combination of compiler-added instrumentation and runtime techniques, including function interception and replacement.
Until now, the LLVM ASan process was only available for traditional purely CPU applications. However, ROCm has extended this mechanism to additionally allow the detection of some addressing errors on the GPU in heterogeneous applications. Ideally, developers should treat heterogeneous HIP and OpenMP applications exactly like pure CPU applications. However, this simplicity has not been achieved yet.
This document provides documentation on using ROCm ASan.
For information about LLVM ASan, see the [LLVM documentation](https://clang.llvm.org/docs/AddressSanitizer.html).
**Note**: The beta release of LLVM ASan for ROCm is currently tested and validated on Ubuntu 20.04.
## Compiling for ASan
The ASan process begins by compiling the application of interest with the ASan instrumentation.
Recommendations for doing this are:
* Compile as many application and dependent library sources as possible using an AMD-built clang-based compiler such as `amdclang++`.
* Add the following options to the existing compiler and linker options:
* `-fsanitize=address` - enables instrumentation
* `-shared-libsan` - use shared version of runtime
* `-g` - add debug info for improved reporting
* Explicitly use `xnack+` in the offload architecture option. For example, `--offload-arch=gfx90a:xnack+`
Other architectures are allowed, but their device code will not be instrumented and a warning will be emitted.
It is not an error to compile some files without ASan instrumentation, but doing so reduces the ability of the process to detect addressing errors. However, if the main program "`a.out`" does not directly depend on the ASan runtime (`libclang_rt.asan-x86_64.so`) after the build completes (check by running `ldd` (List Dynamic Dependencies) or `readelf`), the application will immediately report an error at runtime as described in the next section.
### About compilation time
When `-fsanitize=address` is used, the LLVM compiler adds instrumentation code around every memory operation. This added code must be handled by all of the downstream components of the compiler toolchain and results in increased overall compilation time. This increase is especially evident in the AMDGPU device compiler and has in a few instances raised the compile time to an unacceptable level.
There are a few options if the compile time becomes unacceptable:
* Avoid instrumentation of the files which have the worst compile times. This will reduce the effectiveness of the ASan process.
* Add the option `-fsanitize-recover=address` to the compiles with the worst compile times. This option simplifies the added instrumentation resulting in faster compilation. See below for more information.
* Disable instrumentation on a per-function basis by adding `__attribute__`((no_sanitize("address"))) to functions found to be responsible for the large compile time. Again, this will reduce the effectiveness of the process.
## Installing ROCm GPU ASan packages
For a complete ROCm GPU Sanitizer installation, including packages, instrumented HSA and HIP runtimes, tools, and math libraries, use the following instruction,
```bash
sudo apt-get install rocm-ml-sdk-asan
```
## Using AMD-supplied ASan instrumented libraries
ROCm releases have optional packages that contain additional ASan instrumented builds of the ROCm libraries (usually found in `/opt/rocm-<version>/lib`). The instrumented libraries have identical names to the regular uninstrumented libraries, and are located in `/opt/rocm-<version>/lib/asan`.
These additional libraries are built using the `amdclang++` and `hipcc` compilers, while some uninstrumented libraries are built with g++. The preexisting build options are used but, as described above, additional options are used: `-fsanitize=address`, `-shared-libsan` and `-g`.
These additional libraries avoid additional developer effort to locate repositories, identify the correct branch, check out the correct tags, and other efforts needed to build the libraries from the source. And they extend the ability of the process to detect addressing errors into the ROCm libraries themselves.
When adjusting an application build to add instrumentation, linking against these instrumented libraries is unnecessary. For example, any `-L` `/opt/rocm-<version>/lib` compiler options need not be changed. However, the instrumented libraries should be used when the application is run. It is particularly important that the instrumented language runtimes, like `libamdhip64.so` and `librocm-core.so`, are used; otherwise, device invalid access detections may not be reported.
## Running ASan instrumented applications
### Preparing to run an instrumented application
Here are a few recommendations to consider before running an ASan instrumented heterogeneous application.
* Ensure the Linux kernel running on the system has Heterogeneous Memory Management (HMM) support. A kernel version of 5.6 or higher should be sufficient.
* Ensure XNACK is enabled
* For `gfx90a` (MI-2X0) or `gfx940` (MI-3X0) use environment `HSA_XNACK = 1`.
* For `gfx906` (MI-50) or `gfx908` (MI-100) use environment `HSA_XNACK = 1` but also ensure the amdgpu kernel module is loaded with module argument `noretry=0`.
This requirement is due to the fact that the XNACK setting for these GPUs is system-wide.
* Ensure that the application will use the instrumented libraries when it runs. The output from the shell command `ldd <application name>` can be used to see which libraries will be used.
If the instrumented libraries are not listed by `ldd`, the environment variable `LD_LIBRARY_PATH` may need to be adjusted, or in some cases an `RPATH` compiled into the application may need to be changed and the application recompiled.
* Ensure that the application depends on the ASan runtime. This can be checked by running the command `readelf -d <application name> | grep NEEDED` and verifying that shared library: `libclang_rt.asan-x86_64.so` appears in the output.
If it does not appear, when executed the application will quickly output an ASan error that looks like:
```bash
==3210==ASan runtime does not come first in initial library list; you should either link runtime to your application or manually preload it with LD_PRELOAD.
```
* Ensure that the application `llvm-symbolizer` can be executed, and that it is located in `/opt/rocm-<version>/llvm/bin`. This executable is not strictly required, but if found is used to translate ("symbolize") a host-side instruction address into a more useful function name, file name, and line number (assuming the application has been built to include debug information).
There is an environment variable, `ASAN_OPTIONS`, that can be used to adjust the runtime behavior of the ASAN runtime itself. There are more than a hundred "flags" that can be adjusted (see an old list at [flags](https://github.com/google/sanitizers/wiki/AddressSanitizerFlags)) but the default settings are correct and should be used in most cases. It must be noted that these options only affect the host ASAN runtime. The device runtime only currently supports the default settings for the few relevant options.
There are two `ASAN_OPTION` flags of particular note.
* `halt_on_error=0/1 default 1`.
This tells the ASAN runtime to halt the application immediately after detecting and reporting an addressing error. The default makes sense because the application has entered the realm of undefined behavior. If the developer wishes to have the application continue anyway, this option can be set to zero. However, the application and libraries should then be compiled with the additional option `-fsanitize-recover=address`. Note that the ROCm optional ASan instrumented libraries are not compiled with this option and if an error is detected within one of them, but halt_on_error is set to 0, more undefined behavior will occur.
* `detect_leaks=0/1 default 1`.
This option directs the ASan runtime to enable the [Leak Sanitizer](https://clang.llvm.org/docs/LeakSanitizer.html) (LSAN). Unfortunately, for heterogeneous applications, this default will result in significant output from the leak sanitizer when the application exits due to allocations made by the language runtime which are not considered to be to be leaks. This output can be avoided by adding `detect_leaks=0` to the `ASAN_OPTIONS`, or alternatively by producing an LSAN suppression file (syntax described [here](https://github.com/google/sanitizers/wiki/AddressSanitizerLeakSanitizer)) and activating it with environment variable `LSAN_OPTIONS=suppressions=/path/to/suppression/file`. When using a suppression file, a suppression report is printed by default. The suppression report can be disabled by using the `LSAN_OPTIONS` flag `print_suppressions=0`.
## Runtime overhead
Running an ASan instrumented application incurs
overheads which may result in unacceptably long runtimes
or failure to run at all.
### Higher execution time
ASan detection works by checking each address at runtime
before the address is actually accessed by a load, store, or atomic
instruction.
This checking involves an additional load to "shadow" memory which
records whether the address is "poisoned" or not, and additional logic
that decides whether to produce an detection report or not.
This extra runtime work can cause the application to slow down by
a factor of three or more, depending on how many memory accesses are
executed.
For heterogeneous applications, the shadow memory must be accessible by all devices
and this can mean that shadow accesses from some devices may be more costly
than non-shadow accesses.
### Higher memory use
The address checking described above relies on the compiler to surround
each program variable with a red zone and on ASan
runtime to surround each runtime memory allocation with a red zone and
fill the shadow corresponding to each red zone with poison.
The added memory for the red zones is additional overhead on top
of the 13% overhead for the shadow memory itself.
Applications which consume most one or more available memory pools when
run normally are likely to encounter allocation failures when run with
instrumentation.
## Runtime reporting
It is not the intention of this document to provide a detailed explanation of all of the types of reports that can be output by the ASan runtime. Instead, the focus is on the differences between the standard reports for CPU issues, and reports for GPU issues.
An invalid address detection report for the CPU always starts with
```bash
==<PID>==ERROR: AddressSanitizer: <problem type> on address <memory address> at pc <pc> bp <bp> sp <sp> <access> of size <N> at <memory address> thread T0
```
and continues with a stack trace for the access, a stack trace for the allocation and deallocation, if relevant, and a dump of the shadow near the <memory address>.
In contrast, an invalid address detection report for the GPU always starts with
```bash
==<PID>==ERROR: AddressSanitizer: <problem type> on amdgpu device <device> at pc <pc> <access> of size <n> in workgroup id (<X>,<Y>,<Z>)
```
Above, `<device>` is the integer device ID, and `(<X>, <Y>, <Z>)` is the ID of the workgroup or block where the invalid address was detected.
While the CPU report include a call stack for the thread attempting the invalid access, the GPU is currently to a call stack of size one, i.e. the (symbolized) of the invalid access, e.g.
```bash
#0 <pc> in <fuction signature> at /path/to/file.hip:<line>:<column>
```
This short call stack is followed by a GPU unique section that looks like
```bash
Thread ids and accessed addresses:
<lid0> <maddr 0> : <lid1> <maddr1> : ...
```
where each `<lid j> <maddr j>` indicates the lane ID and the invalid memory address held by lane `j` of the wavefront attempting the invalid access.
Additionally, reports for invalid GPU accesses to memory allocated by GPU code via `malloc` or new starting with, for example,
```bash
==1234==ERROR: AddressSanitizer: heap-buffer-overflow on amdgpu device 0 at pc 0x7fa9f5c92dcc
```
or
```bash
==5678==ERROR: AddressSanitizer: heap-use-after-free on amdgpu device 3 at pc 0x7f4c10062d74
```
currently may include one or two surprising CPU side tracebacks mentioning :`hostcall`". This is due to how `malloc` and `free` are implemented for GPU code and these call stacks can be ignored.
### Running with `rocgdb`
`rocgdb` can be used to further investigate ASan detected errors, with some preparation.
Currently, the ASan runtime complains when starting `rocgdb` without preparation.
```bash
$ rocgdb my_app
==1122==ASan` runtime does not come first in initial library list; you should either link runtime to your application or manually preload it with LD_PRELOAD.
```
This is solved by setting environment variable `LD_PRELOAD` to the path to the ASan runtime, whose path can be obtained using the command
```bash
amdclang++ -print-file-name=libclang_rt.asan-x86_64.so
```
It is also recommended to set the environment variable `HIP_ENABLE_DEFERRED_LOADING=0` before debugging HIP applications.
After starting `rocgdb` breakpoints can be set on the ASan runtime error reporting entry points of interest. For example, if an ASan error report includes
```bash
WRITE of size 4 in workgroup id (10,0,0)
```
the `rocgdb` command needed to stop the program before the report is printed is
```bash
(gdb) break __asan_report_store4
```
Similarly, the appropriate command for a report including
```bash
READ of size <N> in workgroup ID (1,2,3)
```
is
```bash
(gdb) break __asan_report_load<N>
```
It is possible to set breakpoints on all ASan report functions using these commands:
```bash
$ rocgdb <path to application>
(gdb) start <commmand line arguments>
(gdb) rbreak ^__asan_report
(gdb) c
```
### Using ASan with a short HIP application
Refer to the following example to use ASan with a short HIP application,
https://github.com/Rmalavally/rocm-examples/blob/Rmalavally-patch-1/LLVM_ASAN/Using-Address-Sanitizer-with-a-Short-HIP-Application.md
### Known issues with using GPU sanitizer
* Red zones must have limited size and it is possible for an invalid access to completely miss a red zone and not be detected.
* Lack of detection or false reports can be caused by the runtime not properly maintaining red zone shadows.
* Lack of detection on the GPU might also be due to the implementation not instrumenting accesses to all GPU specific address spaces. For example, in the current implementation accesses to "private" or "stack" variables on the GPU are not instrumented, and accesses to HIP shared variables (also known as "local data store" or "LDS") are also not instrumented.
* It can also be the case that a memory fault is hit for an invalid address even with the instrumentation. This is usually caused by the invalid address being so wild that its shadow address is outside of any memory region, and the fault actually occurs on the access to the shadow address. It is also possible to hit a memory fault for the `NULL` pointer. While address 0 does have a shadow location, it is not poisoned by the runtime.

View File

@@ -5,102 +5,72 @@
# https://www.sphinx-doc.org/en/master/usage/configuration.html
import shutil
import jinja2
import os
# Environment to process Jinja templates.
jinja_env = jinja2.Environment(loader=jinja2.FileSystemLoader("."))
from rocm_docs import ROCmDocs
# Jinja templates to render out.
templates = []
# Render templates and output files without the last extension.
# For example: 'install.md.jinja' becomes 'install.md'.
for template in templates:
rendered = jinja_env.get_template(template).render()
with open(os.path.splitext(template)[0], 'w') as file:
file.write(rendered)
shutil.copy2('../CONTRIBUTING.md','./contribute/index.md')
shutil.copy2('../RELEASE.md','./about/release-notes.md')
shutil.copy2('../CONTRIBUTING.md','./contributing.md')
shutil.copy2('../RELEASE.md','./release.md')
# Keep capitalization due to similar linking on GitHub's markdown preview.
shutil.copy2('../CHANGELOG.md','./about/CHANGELOG.md')
latex_engine = "xelatex"
latex_elements = {
"fontpkg": r"""
\usepackage{tgtermes}
\usepackage{tgheros}
\renewcommand\ttdefault{txtt}
"""
}
shutil.copy2('../CHANGELOG.md','./CHANGELOG.md')
# configurations for PDF output by Read the Docs
project = "ROCm Documentation"
author = "Advanced Micro Devices, Inc."
copyright = "Copyright (c) 2023-2024 Advanced Micro Devices, Inc. All rights reserved."
version = "6.0.0"
release = "6.0.0"
copyright = "Copyright (c) 2023 Advanced Micro Devices, Inc. All rights reserved."
version = "5.4.1"
release = "5.4.1"
setting_all_article_info = True
all_article_info_os = ["linux", "windows"]
all_article_info_os = ["linux"]
all_article_info_author = ""
# pages with specific settings
article_pages = [
{
"file":"release",
"os":["linux", "windows"],
"date":"2023-07-27"
},
{"file":"deploy/linux/index", "os":["linux"]},
{"file":"deploy/linux/install_overview", "os":["linux"]},
{"file":"deploy/linux/prerequisites", "os":["linux"]},
{"file":"deploy/linux/quick_start", "os":["linux"]},
{"file":"deploy/linux/install", "os":["linux"]},
{"file":"deploy/linux/upgrade", "os":["linux"]},
{"file":"deploy/linux/uninstall", "os":["linux"]},
{"file":"deploy/linux/package_manager_integration", "os":["linux"]},
{"file":"deploy/docker", "os":["linux"]},
{"file":"release/gpu_os_support", "os":["linux"]},
{"file":"release/docker_support_matrix", "os":["linux"]},
{"file":"reference/gpu_libraries/communication", "os":["linux"]},
{"file":"reference/ai_tools", "os":["linux"]},
{"file":"reference/management_tools", "os":["linux"]},
{"file":"reference/validation_tools", "os":["linux"]},
{"file":"reference/framework_compatibility/framework_compatibility", "os":["linux"]},
{"file":"reference/computer_vision", "os":["linux"]},
{"file":"how_to/deep_learning_rocm", "os":["linux"]},
{"file":"how_to/gpu_aware_mpi", "os":["linux"]},
{"file":"how_to/magma_install/magma_install", "os":["linux"]},
{"file":"how_to/pytorch_install/pytorch_install", "os":["linux"]},
{"file":"how_to/system_debugging", "os":["linux"]},
{"file":"how_to/tensorflow_install/tensorflow_install", "os":["linux"]},
{"file":"install/windows/install-quick", "os":["windows"]},
{"file":"install/linux/install-quick", "os":["linux"]},
{"file":"examples/machine_learning", "os":["linux"]},
{"file":"examples/inception_casestudy/inception_casestudy", "os":["linux"]},
{"file":"understand/file_reorg", "os":["linux"]},
{"file":"install/linux/install", "os":["linux"]},
{"file":"install/linux/install-options", "os":["linux"]},
{"file":"install/linux/prerequisites", "os":["linux"]},
{"file":"install/docker", "os":["linux"]},
{"file":"install/magma-install", "os":["linux"]},
{"file":"install/pytorch-install", "os":["linux"]},
{"file":"install/tensorflow-install", "os":["linux"]},
{"file":"install/windows/install", "os":["windows"]},
{"file":"install/windows/prerequisites", "os":["windows"]},
{"file":"install/windows/cli/index", "os":["windows"]},
{"file":"install/windows/gui/index", "os":["windows"]},
{"file":"about/compatibility/linux-support", "os":["linux"]},
{"file":"about/compatibility/windows-support", "os":["windows"]},
{"file":"about/compatibility/docker-image-support-matrix", "os":["linux"]},
{"file":"about/compatibility/user-kernel-space-compat-matrix", "os":["linux"]},
{"file":"reference/library-index", "os":["linux"]},
{"file":"how-to/deep-learning-rocm", "os":["linux"]},
{"file":"how-to/gpu-enabled-mpi", "os":["linux"]},
{"file":"how-to/system-debugging", "os":["linux"]},
{"file":"how-to/tuning-guides", "os":["linux", "windows"]},
{"file":"rocm-a-z", "os":["linux", "windows"]},
{"file":"about/release-notes", "os":["linux"]},
{"file":"understand/isv_deployment_win", "os":["windows"]},
]
exclude_patterns = ['temp']
external_toc_path = "./sphinx/_toc.yml"
extensions = ["rocm_docs"]
docs_core = ROCmDocs("ROCm 5.4.1 Documentation Home")
docs_core.setup()
external_projects_current_project = "rocm"
html_theme = "rocm_docs_theme"
html_theme_options = {"flavor": "rocm-docs-home"}
html_title = "ROCm Documentation"
for sphinx_var in ROCmDocs.SPHINX_VARS:
globals()[sphinx_var] = getattr(docs_core, sphinx_var)
html_theme_options = {
"link_main_doc": False
}

View File

@@ -1,154 +0,0 @@
<head>
<meta charset="UTF-8">
<meta name="description" content="Building ROCm documentation">
<meta name="keywords" content="documentation, Visual Studio Code, GitHub, command line">
</head>
# Building documentation
You can build our documentation via GitHub (in a pull request) or locally (using the command line or
Visual Studio (VS) Code.
## GitHub
If you open a pull request on the `develop` branch of a ROCm repository and scroll to the bottom of
the page, there is a summary panel. Next to the line
`docs/readthedocs.com:advanced-micro-devices-demo`, there is a `Details` link. If you click this, it takes
you to the Read the Docs build for your pull request.
![Screenshot of the GitHub documentation build link](../data/contribute/github-docs-build.png)
If you don't see this line, click `Show all checks` to get an itemized view.
## Command line
You can build our documentation via the command line using Python. We use Python 3.8; other
versions may not support the build.
Use the Python Virtual Environment (`venv`) and run the following commands from the project root:
```sh
python3 -mvenv .venv
# Windows
.venv/Scripts/python -m pip install -r docs/sphinx/requirements.txt
.venv/Scripts/python -m sphinx -T -E -b html -d _build/doctrees -D language=en docs _build/html
# Linux
.venv/bin/python -m pip install -r docs/sphinx/requirements.txt
.venv/bin/python -m sphinx -T -E -b html -d _build/doctrees -D language=en docs _build/html
```
Navigate to `_build/html/index.html` and open this file in a web browser.
## Visual Studio Code
With the help of a few extensions, you can create a productive environment to author and test
documentation locally using Visual Studio (VS) Code. Follow these steps to configure VS Code:
1. Install the required extensions:
* Python: `(ms-python.python)`
* Live Server: `(ritwickdey.LiveServer)`
2. Add the following entries to `.vscode/settings.json`.
```json
{
"liveServer.settings.root": "/.vscode/build/html",
"liveServer.settings.wait": 1000,
"python.terminal.activateEnvInCurrentTerminal": true
}
```
* `liveServer.settings.root`: Sets the root of the output website for live previews. Must be changed
alongside the `tasks.json` command.
* `liveServer.settings.wait`: Tells the live server to wait with the update in order to give Sphinx time to
regenerate the site contents and not refresh before the build is complete.
* `python.terminal.activateEnvInCurrentTerminal`: Activates the automatic virtual environment, so you
can build the site from the integrated terminal.
3. Add the following tasks to `.vscode/tasks.json`.
```json
{
"version": "2.0.0",
"tasks": [
{
"label": "Build Docs",
"type": "process",
"windows": {
"command": "${workspaceFolder}/.venv/Scripts/python.exe"
},
"command": "${workspaceFolder}/.venv/bin/python3",
"args": [
"-m",
"sphinx",
"-j",
"auto",
"-T",
"-b",
"html",
"-d",
"${workspaceFolder}/.vscode/build/doctrees",
"-D",
"language=en",
"${workspaceFolder}/docs",
"${workspaceFolder}/.vscode/build/html"
],
"problemMatcher": [
{
"owner": "sphinx",
"fileLocation": "absolute",
"pattern": {
"regexp": "^(?:.*\\.{3}\\s+)?(\\/[^:]*|[a-zA-Z]:\\\\[^:]*):(\\d+):\\s+(WARNING|ERROR):\\s+(.*)$",
"file": 1,
"line": 2,
"severity": 3,
"message": 4
}
},
{
"owner": "sphinx",
"fileLocation": "absolute",
"pattern": {
"regexp": "^(?:.*\\.{3}\\s+)?(\\/[^:]*|[a-zA-Z]:\\\\[^:]*):{1,2}\\s+(WARNING|ERROR):\\s+(.*)$",
"file": 1,
"severity": 2,
"message": 3
}
}
],
"group": {
"kind": "build",
"isDefault": true
}
}
]
}
```
> (Implementation detail: two problem matchers were needed to be defined,
> because VS Code doesn't tolerate some problem information being potentially
> absent. While a single regex could match all types of errors, if a capture
> group remains empty (the line number doesn't show up in all warning/error
> messages) but the `pattern` references said empty capture group, VS Code
> discards the message completely.)
4. Configure the Python virtual environment (`venv`).
From the Command Palette, run `Python: Create Environment`. Select `venv` environment and
`docs/sphinx/requirements.txt`.
5. Build the docs.
Launch the default build task using one of the following options:
* A hotkey (the default is `Ctrl+Shift+B`)
* Issuing the `Tasks: Run Build Task` from the Command Palette
6. Open the live preview.
Navigate to the site output within VS Code: right-click on `.vscode/build/html/index.html` and
select `Open with Live Server`. The contents should update on every rebuild without having to
refresh the browser.

View File

@@ -1,33 +0,0 @@
<head>
<meta charset="UTF-8">
<meta name="description" content="Providing feedback for ROCm documentation">
<meta name="keywords" content="documentation, pull request, GitHub">
</head>
# Providing feedback for ROCm documentation
There are four standard ways to provide feedback for this repository.
## Pull request
All contributions to ROCm documentation should arrive via the
[GitHub Flow](https://docs.github.com/en/get-started/quickstart/github-flow)
targeting the develop branch of the repository. If you are unable to contribute
via the GitHub Flow, feel free to email us at [rocm-feedback@amd.com](mailto:rocm-feedback@amd.com?subject=Documentation%20Feedback).
## GitHub discussions
To ask questions or view answers to frequently asked questions, refer to
[GitHub Discussions](https://github.com/RadeonOpenCompute/ROCm/discussions).
On GitHub Discussions, in addition to asking and answering questions,
members can share updates, have open-ended conversations,
and follow along on via public announcements.
## GitHub issue
Issues on existing or absent docs can be filed as
[GitHub Issues](https://github.com/RadeonOpenCompute/ROCm/issues).
## Email
Send other feedback or questions to [rocm-feedback@amd.com](mailto:rocm-feedback@amd.com?subject=Documentation%20Feedback).

View File

@@ -1,77 +0,0 @@
<head>
<meta charset="UTF-8">
<meta name="description" content="ROCm documentation toolchain">
<meta name="keywords" content="documentation, toolchain, Sphinx, Doxygen, MyST">
</head>
# ROCm documentation toolchain
Our documentation relies on several open source toolchains and sites.
## `rocm-docs-core`
[rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) is an AMD-maintained
project that applies customization for our documentation. This
project is the tool most ROCm repositories use as part of the documentation
build. It is also available as a [pip package on PyPI](https://pypi.org/project/rocm-docs-core/).
See the user and developer guides for rocm-docs-core at {doc}`rocm-docs-core documentation<rocm-docs-core:index>`.
## Sphinx
[Sphinx](https://www.sphinx-doc.org/en/master/) is a documentation generator
originally used for Python. It is now widely used in the open source community.
Originally, Sphinx supported reStructuredText (RST) based documentation, but
Markdown support is now available.
ROCm documentation plans to default to Markdown for new projects.
Existing projects using RST are under no obligation to convert to Markdown. New
projects that believe Markdown is not suitable should contact the documentation
team prior to selecting RST.
## Read the Docs
[Read the Docs](https://docs.readthedocs.io/en/stable/) is the service that builds
and hosts the HTML documentation generated using Sphinx to our end users.
## Doxygen
[Doxygen](https://www.doxygen.nl/) is a documentation generator that extracts
information from inline code.
ROCm projects typically use Doxygen for public API documentation unless the
upstream project uses a different tool.
### Breathe
[Breathe](https://www.breathe-doc.org/) is a Sphinx plugin to integrate Doxygen
content.
### MyST
[Markedly Structured Text (MyST)](https://myst-tools.org/docs/spec) is an extended
flavor of Markdown ([CommonMark](https://commonmark.org/)) influenced by reStructuredText (RST) and Sphinx.
It is integrated into ROCm documentation by the Sphinx extension [`myst-parser`](https://myst-parser.readthedocs.io/en/latest/).
A cheat sheet that showcases how to use the MyST syntax is available over at
the [Jupyter reference](https://jupyterbook.org/en/stable/reference/cheatsheet.html).
### Sphinx External ToC
[Sphinx External ToC](https://sphinx-external-toc.readthedocs.io/en/latest/intro.html)
is a Sphinx extension used for ROCm documentation navigation. This tool generates a navigation menu on the left
based on a YAML file that specifies the table of contents.
It was selected due to its flexibility that allows scripts to operate on the
YAML file. Please transition to this file for the project's navigation. You can
see the `_toc.yml.in` file in this repository in the `docs/sphinx` folder for an
example.
### Sphinx-book-theme
[Sphinx-book-theme](https://sphinx-book-theme.readthedocs.io/en/latest/) is a Sphinx theme
that defines the base appearance for ROCm documentation.
ROCm documentation applies some customization,
such as a custom header and footer on top of the Sphinx Book Theme.
### Sphinx design
[Sphinx design](https://sphinx-design.readthedocs.io/en/latest/index.html) is a Sphinx extension that adds design
functionality.
ROCm documentation uses Sphinx Design for grids, cards, and synchronized tabs.

Binary file not shown.

Before

Width:  |  Height:  |  Size: 10 KiB

View File

Before

Width:  |  Height:  |  Size: 939 KiB

After

Width:  |  Height:  |  Size: 939 KiB

View File

Before

Width:  |  Height:  |  Size: 537 KiB

After

Width:  |  Height:  |  Size: 537 KiB

View File

Before

Width:  |  Height:  |  Size: 292 KiB

After

Width:  |  Height:  |  Size: 292 KiB

View File

Before

Width:  |  Height:  |  Size: 1.3 MiB

After

Width:  |  Height:  |  Size: 1.3 MiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 163 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 2.1 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 34 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 183 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 40 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 40 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 36 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 38 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 407 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 465 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 207 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 461 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 461 KiB

View File

Before

Width:  |  Height:  |  Size: 3.5 KiB

After

Width:  |  Height:  |  Size: 3.5 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 412 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 68 KiB

View File

@@ -0,0 +1 @@

View File

Before

Width:  |  Height:  |  Size: 10 KiB

After

Width:  |  Height:  |  Size: 10 KiB

View File

Before

Width:  |  Height:  |  Size: 13 KiB

After

Width:  |  Height:  |  Size: 13 KiB

View File

Before

Width:  |  Height:  |  Size: 13 KiB

After

Width:  |  Height:  |  Size: 13 KiB

View File

Before

Width:  |  Height:  |  Size: 88 KiB

After

Width:  |  Height:  |  Size: 88 KiB

View File

Before

Width:  |  Height:  |  Size: 32 KiB

After

Width:  |  Height:  |  Size: 32 KiB

View File

Before

Width:  |  Height:  |  Size: 99 KiB

After

Width:  |  Height:  |  Size: 99 KiB

View File

Before

Width:  |  Height:  |  Size: 130 KiB

After

Width:  |  Height:  |  Size: 130 KiB

View File

Before

Width:  |  Height:  |  Size: 21 KiB

After

Width:  |  Height:  |  Size: 21 KiB

View File

Before

Width:  |  Height:  |  Size: 8.8 KiB

After

Width:  |  Height:  |  Size: 8.8 KiB

View File

Before

Width:  |  Height:  |  Size: 14 KiB

After

Width:  |  Height:  |  Size: 14 KiB

View File

Before

Width:  |  Height:  |  Size: 25 KiB

After

Width:  |  Height:  |  Size: 25 KiB

View File

Before

Width:  |  Height:  |  Size: 144 KiB

After

Width:  |  Height:  |  Size: 144 KiB

View File

Before

Width:  |  Height:  |  Size: 17 KiB

After

Width:  |  Height:  |  Size: 17 KiB

View File

Before

Width:  |  Height:  |  Size: 47 KiB

After

Width:  |  Height:  |  Size: 47 KiB

View File

Before

Width:  |  Height:  |  Size: 41 KiB

After

Width:  |  Height:  |  Size: 41 KiB

View File

Before

Width:  |  Height:  |  Size: 14 KiB

After

Width:  |  Height:  |  Size: 14 KiB

View File

Before

Width:  |  Height:  |  Size: 19 KiB

After

Width:  |  Height:  |  Size: 19 KiB

View File

Before

Width:  |  Height:  |  Size: 57 KiB

After

Width:  |  Height:  |  Size: 57 KiB

View File

Before

Width:  |  Height:  |  Size: 36 KiB

After

Width:  |  Height:  |  Size: 36 KiB

View File

Before

Width:  |  Height:  |  Size: 102 KiB

After

Width:  |  Height:  |  Size: 102 KiB

View File

Before

Width:  |  Height:  |  Size: 114 KiB

After

Width:  |  Height:  |  Size: 114 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 3.6 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 3.5 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 114 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 110 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 26 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 26 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 228 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 796 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 310 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 789 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 801 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 102 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 102 KiB

View File

Before

Width:  |  Height:  |  Size: 103 KiB

After

Width:  |  Height:  |  Size: 103 KiB

View File

Before

Width:  |  Height:  |  Size: 59 KiB

After

Width:  |  Height:  |  Size: 59 KiB

View File

Before

Width:  |  Height:  |  Size: 41 KiB

After

Width:  |  Height:  |  Size: 41 KiB

View File

Before

Width:  |  Height:  |  Size: 39 KiB

After

Width:  |  Height:  |  Size: 39 KiB

View File

Before

Width:  |  Height:  |  Size: 47 KiB

After

Width:  |  Height:  |  Size: 47 KiB

View File

Before

Width:  |  Height:  |  Size: 33 KiB

After

Width:  |  Height:  |  Size: 33 KiB

File diff suppressed because one or more lines are too long

Before

Width:  |  Height:  |  Size: 323 KiB

View File

Before

Width:  |  Height:  |  Size: 58 KiB

After

Width:  |  Height:  |  Size: 58 KiB

View File

Before

Width:  |  Height:  |  Size: 64 KiB

After

Width:  |  Height:  |  Size: 64 KiB

View File

Before

Width:  |  Height:  |  Size: 28 KiB

After

Width:  |  Height:  |  Size: 28 KiB

Some files were not shown because too many files have changed in this diff Show More