Compare commits
103 Commits
rocm-5.7.0
...
roc-5.7.x
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
497a5a26d8 | ||
|
|
a0d75e9720 | ||
|
|
9d8a830851 | ||
|
|
23d563eefb | ||
|
|
7585e9b165 | ||
|
|
f0f4fa15b4 | ||
|
|
549b23b521 | ||
|
|
b0caf52156 | ||
|
|
201f626887 | ||
|
|
37db70c914 | ||
|
|
244c6a6823 | ||
|
|
ce82a047bf | ||
|
|
17a1cb8bbb | ||
|
|
afa14c518e | ||
|
|
b61a54e4f3 | ||
|
|
227e135f5a | ||
|
|
1e9a1ca55a | ||
|
|
20f3c28345 | ||
|
|
ef93b5e176 | ||
|
|
2dd6923ab9 | ||
|
|
59b53af074 | ||
|
|
fd927e514d | ||
|
|
72d4da7da0 | ||
|
|
aac49cef23 | ||
|
|
69b8117726 | ||
|
|
9ac4a7b194 | ||
|
|
00163edd45 | ||
|
|
80fd791421 | ||
|
|
f65ab4ce27 | ||
|
|
365b31728d | ||
|
|
b6c71018a6 | ||
|
|
54177e8b96 | ||
|
|
74f4f86c92 | ||
|
|
74d8f95afb | ||
|
|
50ad3847e5 | ||
|
|
c6e2856822 | ||
|
|
444efec642 | ||
|
|
4b7775d264 | ||
|
|
5700b8f9e8 | ||
|
|
e87dba01c6 | ||
|
|
7d22b96c5d | ||
|
|
4496b2abc8 | ||
|
|
2b788350e4 | ||
|
|
e607ba6259 | ||
|
|
0e7ae20a32 | ||
|
|
033b6d089e | ||
|
|
4b62e9b90f | ||
|
|
cf0798ec0d | ||
|
|
75456466e7 | ||
|
|
3176676240 | ||
|
|
24614972d3 | ||
|
|
1e96665c34 | ||
|
|
42a44e020f | ||
|
|
99073fb9fc | ||
|
|
9f2c53ef0a | ||
|
|
acd247bfe8 | ||
|
|
6e70c6026f | ||
|
|
315b8770a4 | ||
|
|
060838bcc2 | ||
|
|
8d68b6618b | ||
|
|
b0d773d2a9 | ||
|
|
aff08a5f42 | ||
|
|
39e0150f94 | ||
|
|
d856e6fa3e | ||
|
|
64496f2838 | ||
|
|
60491de85f | ||
|
|
2065ff398f | ||
|
|
64ad833c33 | ||
|
|
d8d55a1717 | ||
|
|
ee6c183aa9 | ||
|
|
948bb14cce | ||
|
|
e29f654883 | ||
|
|
7b3e6364f9 | ||
|
|
5c1b2a7a5f | ||
|
|
a45c51475e | ||
|
|
0fa1796636 | ||
|
|
84f2c86126 | ||
|
|
35122729b8 | ||
|
|
8252721a31 | ||
|
|
c98da4a11a | ||
|
|
14e0fae0fe | ||
|
|
f6f6bc7b24 | ||
|
|
13bea6bf4e | ||
|
|
7a5f2eb508 | ||
|
|
786b44d8eb | ||
|
|
fac4843569 | ||
|
|
940d2933ff | ||
|
|
80d8eb84ef | ||
|
|
acde6284a0 | ||
|
|
63a45a168e | ||
|
|
fe3c9ebf38 | ||
|
|
03f78be781 | ||
|
|
c2a4257103 | ||
|
|
d0d4eed1a6 | ||
|
|
772b51a7d2 | ||
|
|
006546e9e6 | ||
|
|
fdc2f51b25 | ||
|
|
89dc44ea6c | ||
|
|
3825dbc2b3 | ||
|
|
1e92ef9a2d | ||
|
|
7c5976004f | ||
|
|
dba06fe315 | ||
|
|
890c735f53 |
2
.github/CODEOWNERS
vendored
@@ -1 +1 @@
|
||||
* @saadrahim @Rmalavally @amd-aakash @zhang2amd @jlgreathouse @samjwu @MathiasMagnus
|
||||
* @saadrahim @Rmalavally @amd-aakash @zhang2amd @jlgreathouse @samjwu @MathiasMagnus @LisaDelaney
|
||||
|
||||
76
.github/ISSUE_TEMPLATE/0_issue_report.yml
vendored
Normal file
@@ -0,0 +1,76 @@
|
||||
name: Issue Report
|
||||
description: File a report for something not working correctly.
|
||||
title: "[Issue]: "
|
||||
|
||||
body:
|
||||
- type: markdown
|
||||
attributes:
|
||||
value: |
|
||||
Thank you for taking the time to fill out this report!
|
||||
|
||||
On a Linux system, you can acquire your OS, CPU, GPU, and ROCm version (for filling out this report) with the following commands:
|
||||
echo "OS:" && cat /etc/os-release | grep -E "^(NAME=|VERSION=)";
|
||||
echo "CPU: " && cat /proc/cpuinfo | grep "model name" | sort --unique;
|
||||
echo "GPU:" && /opt/rocm/bin/rocminfo | grep -E "^\s*(Name|Marketing Name)";
|
||||
echo "ROCm in /opt:" && ls -1 /opt | grep -E "rocm-";
|
||||
- type: textarea
|
||||
attributes:
|
||||
label: Problem Description
|
||||
description: Describe the issue you encountered.
|
||||
placeholder: "The steps to reproduce can be included here, or in the dedicated section further below."
|
||||
validations:
|
||||
required: true
|
||||
- type: input
|
||||
attributes:
|
||||
label: Operating System
|
||||
description: What is the name and version number of the OS?
|
||||
placeholder: "e.g. Ubuntu 22.04.3 LTS (Jammy Jellyfish)"
|
||||
validations:
|
||||
required: true
|
||||
- type: input
|
||||
attributes:
|
||||
label: CPU
|
||||
description: What CPU did you encounter the issue on?
|
||||
placeholder: "e.g. AMD Ryzen 9 5900HX with Radeon Graphics"
|
||||
validations:
|
||||
required: true
|
||||
- type: input
|
||||
attributes:
|
||||
label: GPU
|
||||
description: What GPU(s) did you encounter the issue on?
|
||||
placeholder: "e.g. MI200"
|
||||
validations:
|
||||
required: true
|
||||
- type: input
|
||||
attributes:
|
||||
label: ROCm Version
|
||||
description: What version(s) of ROCm did you encounter the issue on?
|
||||
placeholder: "e.g. 5.7.0"
|
||||
validations:
|
||||
required: true
|
||||
- type: input
|
||||
attributes:
|
||||
label: ROCm Component
|
||||
description: (Optional) If this issue relates to a specific ROCm component, it can be mentioned here.
|
||||
placeholder: "e.g. rocBLAS"
|
||||
|
||||
- type: textarea
|
||||
attributes:
|
||||
label: Steps to Reproduce
|
||||
description: (Optional) Detailed steps to reproduce the issue.
|
||||
placeholder: Please also include what you expected to happen, and what actually did, at the failing step(s).
|
||||
validations:
|
||||
required: false
|
||||
|
||||
- type: textarea
|
||||
attributes:
|
||||
label: Output of /opt/rocm/bin/rocminfo --support
|
||||
description: The output of rocminfo --support will help to better address the problem.
|
||||
placeholder: |
|
||||
ROCk module is loaded
|
||||
=====================
|
||||
HSA System Attributes
|
||||
=====================
|
||||
[...]
|
||||
validations:
|
||||
required: true
|
||||
32
.github/ISSUE_TEMPLATE/1_feature_request.yml
vendored
Normal file
@@ -0,0 +1,32 @@
|
||||
name: Feature Suggestion
|
||||
description: Suggest an additional functionality, or new way of handling an existing functionality.
|
||||
title: "[Feature]: "
|
||||
|
||||
body:
|
||||
- type: markdown
|
||||
attributes:
|
||||
value: |
|
||||
Thank you for taking the time to make a suggestion!
|
||||
|
||||
- type: textarea
|
||||
attributes:
|
||||
label: Suggestion Description
|
||||
description: Describe your suggestion.
|
||||
validations:
|
||||
required: true
|
||||
- type: input
|
||||
attributes:
|
||||
label: Operating System
|
||||
description: (Optional) If this is for a specific OS, you can mention it here.
|
||||
placeholder: "e.g. Ubuntu"
|
||||
- type: input
|
||||
attributes:
|
||||
label: GPU
|
||||
description: (Optional) If this is for a specific GPU or GPU family, you can mention it here.
|
||||
placeholder: "e.g. MI200"
|
||||
- type: input
|
||||
attributes:
|
||||
label: ROCm Component
|
||||
description: (Optional) If this issue relates to a specific ROCm component, it can be mentioned here.
|
||||
placeholder: "e.g. rocBLAS"
|
||||
|
||||
5
.github/ISSUE_TEMPLATE/config.yml
vendored
Normal file
@@ -0,0 +1,5 @@
|
||||
blank_issues_enabled: false
|
||||
contact_links:
|
||||
- name: ROCm Community Discussions
|
||||
url: https://github.com/RadeonOpenCompute/ROCm/discussions
|
||||
about: Please ask and answer questions here for anything ROCm.
|
||||
1
.github/dependabot.yml
vendored
@@ -10,3 +10,4 @@ updates:
|
||||
open-pull-requests-limit: 10
|
||||
schedule:
|
||||
interval: "daily"
|
||||
versioning-strategy: increase
|
||||
|
||||
5
.gitignore
vendored
@@ -13,6 +13,7 @@ _doxygen/
|
||||
_readthedocs/
|
||||
|
||||
# avoid duplicating contributing.md due to conf.py
|
||||
docs/contributing.md
|
||||
docs/release.md
|
||||
docs/CHANGELOG.md
|
||||
docs/contribute/index.md
|
||||
docs/about/release-notes.md
|
||||
docs/about/CHANGELOG.md
|
||||
@@ -1,5 +1,7 @@
|
||||
config:
|
||||
default: true
|
||||
MD004:
|
||||
style: asterisk
|
||||
MD013: false
|
||||
MD026:
|
||||
punctuation: '.,;:!'
|
||||
@@ -8,7 +10,9 @@ config:
|
||||
MD033: false
|
||||
MD034: false
|
||||
MD041: false
|
||||
MD051: false
|
||||
ignores:
|
||||
- CHANGELOG.md
|
||||
- docs/CHANGELOG.md
|
||||
- "{,docs/}{RELEASE,release}.md"
|
||||
- tools/autotag/templates/**/*.md
|
||||
|
||||
@@ -6,7 +6,7 @@ version: 2
|
||||
sphinx:
|
||||
configuration: docs/conf.py
|
||||
|
||||
formats: [htmlzip, pdf, epub]
|
||||
formats: [htmlzip, pdf]
|
||||
|
||||
python:
|
||||
install:
|
||||
|
||||
598
.wordlist.txt
@@ -1,122 +1,558 @@
|
||||
# building
|
||||
matchers
|
||||
# file_reorg
|
||||
FHS
|
||||
incrementing
|
||||
Filesystem
|
||||
filesystem
|
||||
rocm
|
||||
# gpu_aware_mpi
|
||||
DMA
|
||||
GDR
|
||||
HCA
|
||||
MPI
|
||||
MVAPICH
|
||||
Mellanox's
|
||||
NIC
|
||||
OFED
|
||||
OSU
|
||||
OpenFabrics
|
||||
PeerDirect
|
||||
RDMA
|
||||
UCX
|
||||
ib_core
|
||||
# isv_deployment_win
|
||||
ABI
|
||||
# linear algebra
|
||||
LAPACK
|
||||
MMA
|
||||
backends
|
||||
cuSOLVER
|
||||
cuSPARSE
|
||||
# mi200_performance_counters
|
||||
activations
|
||||
addr
|
||||
AddressSanitizer
|
||||
AlexNet
|
||||
alloc
|
||||
allocator
|
||||
allocators
|
||||
ALU
|
||||
AMD
|
||||
AMDGPU
|
||||
amdgpu
|
||||
AMDGPUs
|
||||
AMDMIGraphX
|
||||
AMI
|
||||
AOCC
|
||||
AOMP
|
||||
api
|
||||
APIC
|
||||
APIs
|
||||
Arb
|
||||
ASan
|
||||
ASIC
|
||||
ASICs
|
||||
ASm
|
||||
atmi
|
||||
atomics
|
||||
autogenerated
|
||||
avx
|
||||
awk
|
||||
backend
|
||||
backends
|
||||
benchmarking
|
||||
bilinear
|
||||
BitCode
|
||||
BLAS
|
||||
Blit
|
||||
blit
|
||||
BMC
|
||||
buildable
|
||||
bursty
|
||||
bzip
|
||||
cacheable
|
||||
CCD
|
||||
cd
|
||||
CDNA
|
||||
CentOS
|
||||
centric
|
||||
changelog
|
||||
chiplet
|
||||
CIFAR
|
||||
CLI
|
||||
CMake
|
||||
cmake
|
||||
CMakeLists
|
||||
CMakePackage
|
||||
cmd
|
||||
coalescable
|
||||
codename
|
||||
Codespaces
|
||||
comgr
|
||||
Commitizen
|
||||
CommonMark
|
||||
composable
|
||||
concretization
|
||||
Concretized
|
||||
Conda
|
||||
config
|
||||
conformant
|
||||
convolutional
|
||||
convolves
|
||||
CoRR
|
||||
CP
|
||||
CPC
|
||||
CPF
|
||||
CPP
|
||||
CPU
|
||||
CPUs
|
||||
CSC
|
||||
CSE
|
||||
CSn
|
||||
csn
|
||||
CSV
|
||||
CU
|
||||
cuBLAS
|
||||
CUDA
|
||||
cuFFT
|
||||
cuLIB
|
||||
cuRAND
|
||||
CUs
|
||||
cuSOLVER
|
||||
cuSPARSE
|
||||
dataset
|
||||
datasets
|
||||
dataspace
|
||||
datatype
|
||||
datatypes
|
||||
dbgapi
|
||||
de
|
||||
deallocation
|
||||
denormalize
|
||||
Dependabot
|
||||
deserializers
|
||||
detections
|
||||
dev
|
||||
devicelibs
|
||||
DGEMM
|
||||
disambiguates
|
||||
distro
|
||||
DL
|
||||
DMA
|
||||
DNN
|
||||
DNNL
|
||||
Dockerfile
|
||||
DockerHub
|
||||
Doxygen
|
||||
DPM
|
||||
DRI
|
||||
DW
|
||||
DWORD
|
||||
el
|
||||
enablement
|
||||
endpgm
|
||||
env
|
||||
epilog
|
||||
EPYC
|
||||
ESXi
|
||||
ethernet
|
||||
exascale
|
||||
executables
|
||||
ffmpeg
|
||||
FFT
|
||||
FFTs
|
||||
FHS
|
||||
filesystem
|
||||
Filesystem
|
||||
Flang
|
||||
FMA
|
||||
Fortran
|
||||
fortran
|
||||
FP
|
||||
galb
|
||||
gcc
|
||||
GCD
|
||||
GCDs
|
||||
GCN
|
||||
GDB
|
||||
gdb
|
||||
GDDR
|
||||
GDR
|
||||
GDS
|
||||
GEMM
|
||||
GEMMs
|
||||
gfortran
|
||||
gfx
|
||||
GIM
|
||||
github
|
||||
Gitpod
|
||||
GL
|
||||
GLXT
|
||||
GMI
|
||||
gnupg
|
||||
GPG
|
||||
GPR
|
||||
GPU
|
||||
GPUs
|
||||
grayscale
|
||||
GRBM
|
||||
gzip
|
||||
Haswell
|
||||
HBM
|
||||
HCA
|
||||
heterogenous
|
||||
hipamd
|
||||
hipBLAS
|
||||
hipblas
|
||||
hipBLASLt
|
||||
HIPCC
|
||||
hipCUB
|
||||
hipcub
|
||||
HIPExtension
|
||||
hipFFT
|
||||
hipfft
|
||||
hipfort
|
||||
HIPIFY
|
||||
hipify
|
||||
hipLIB
|
||||
hipRAND
|
||||
hipSOLVER
|
||||
hipsolver
|
||||
hipSPARSE
|
||||
hipsparse
|
||||
hipSPARSELt
|
||||
hipTensor
|
||||
HPC
|
||||
HPCG
|
||||
HPL
|
||||
HSA
|
||||
hsa
|
||||
hsakmt
|
||||
HWE
|
||||
ib_core
|
||||
ICV
|
||||
ImageNet
|
||||
IMDB
|
||||
inband
|
||||
incrementing
|
||||
inferencing
|
||||
InfiniBand
|
||||
inflight
|
||||
init
|
||||
Inlines
|
||||
inlining
|
||||
installable
|
||||
IntelliSense
|
||||
interprocedural
|
||||
Intersphinx
|
||||
intra
|
||||
invariants
|
||||
invocating
|
||||
Ioffe
|
||||
IOMMU
|
||||
IOP
|
||||
IOPM
|
||||
IOV
|
||||
ipo
|
||||
ISA
|
||||
ISV
|
||||
ISVs
|
||||
JSON
|
||||
Jupyter
|
||||
kdb
|
||||
KFD
|
||||
Khronos
|
||||
KVM
|
||||
LAPACK
|
||||
LCLK
|
||||
LDS
|
||||
libjpeg
|
||||
libs
|
||||
linearized
|
||||
linter
|
||||
linux
|
||||
llvm
|
||||
LLVM
|
||||
localscratch
|
||||
logits
|
||||
lossy
|
||||
LSAN
|
||||
LTS
|
||||
Makefile
|
||||
Makefiles
|
||||
matchers
|
||||
Matplotlib
|
||||
Mellanox's
|
||||
MEM
|
||||
MERCHANTABILITY
|
||||
MFMA
|
||||
microarchitecture
|
||||
MIGraphX
|
||||
migraphx
|
||||
MIOpen
|
||||
miopen
|
||||
MIOpenGEMM
|
||||
miopengemm
|
||||
MIVisionX
|
||||
mivisionx
|
||||
mkdir
|
||||
mlirmiopen
|
||||
MMA
|
||||
MNIST
|
||||
MPI
|
||||
MSVC
|
||||
mtypes
|
||||
Multicore
|
||||
Multithreaded
|
||||
MVAPICH
|
||||
mvffr
|
||||
MyEnvironment
|
||||
MyST
|
||||
namespace
|
||||
namespaces
|
||||
Nano
|
||||
Navi
|
||||
NBIO
|
||||
NBIOs
|
||||
NIC
|
||||
NICs
|
||||
Noncoherently
|
||||
NPS
|
||||
NUMA
|
||||
NumPy
|
||||
numref
|
||||
NVCC
|
||||
NVPTX
|
||||
OAM
|
||||
OAMs
|
||||
ocl
|
||||
OCP
|
||||
OEM
|
||||
OFED
|
||||
OMP
|
||||
OMPT
|
||||
OMPX
|
||||
ONNX
|
||||
OpenCL
|
||||
opencl
|
||||
opencv
|
||||
OpenFabrics
|
||||
OpenGL
|
||||
OpenMP
|
||||
openmp
|
||||
openssl
|
||||
OpenVX
|
||||
optimizers
|
||||
os
|
||||
OSS
|
||||
OSU
|
||||
Pageable
|
||||
pageable
|
||||
passthrough
|
||||
PCI
|
||||
PCIe
|
||||
PeerDirect
|
||||
perfcounter
|
||||
Perfetto
|
||||
performant
|
||||
perl
|
||||
PIL
|
||||
PILImage
|
||||
PowerShell
|
||||
pragma
|
||||
pre
|
||||
prebuilt
|
||||
precompiled
|
||||
prefetch
|
||||
preprocess
|
||||
preprocessing
|
||||
preq
|
||||
prerequisites
|
||||
PRNG
|
||||
profiler
|
||||
protobuf
|
||||
PRs
|
||||
pseudorandom
|
||||
py
|
||||
PyPi
|
||||
PyTorch
|
||||
Qcycles
|
||||
RW
|
||||
quasirandom
|
||||
Radeon
|
||||
RadeonOpenCompute
|
||||
RCCL
|
||||
rccl
|
||||
RDC
|
||||
rdc
|
||||
RDMA
|
||||
RDNA
|
||||
reformats
|
||||
RelWithDebInfo
|
||||
repos
|
||||
Req
|
||||
req
|
||||
resampling
|
||||
RST
|
||||
reStructuredText
|
||||
RHEL
|
||||
Rickle
|
||||
roadmap
|
||||
roc
|
||||
ROC
|
||||
rocAL
|
||||
rocALUTION
|
||||
rocalution
|
||||
rocBLAS
|
||||
rocblas
|
||||
rocclr
|
||||
ROCdbgapi
|
||||
rocFFT
|
||||
rocfft
|
||||
ROCgdb
|
||||
ROCk
|
||||
rocLIB
|
||||
rocm
|
||||
ROCm
|
||||
ROCmCC
|
||||
rocminfo
|
||||
ROCmSoftwarePlatform
|
||||
ROCmValidationSuite
|
||||
rocPRIM
|
||||
rocprim
|
||||
rocprof
|
||||
ROCProfiler
|
||||
rocprofiler
|
||||
ROCr
|
||||
rocr
|
||||
rocRAND
|
||||
rocrand
|
||||
rocSOLVER
|
||||
rocsolver
|
||||
rocSPARSE
|
||||
rocsparse
|
||||
roct
|
||||
rocThrust
|
||||
rocthrust
|
||||
ROCTracer
|
||||
roctracer
|
||||
rocWMMA
|
||||
RST
|
||||
runtime
|
||||
runtimes
|
||||
RW
|
||||
SALU
|
||||
SBIOS
|
||||
SCA
|
||||
scalability
|
||||
SDK
|
||||
SDMA
|
||||
SDRAM
|
||||
SENDMSG
|
||||
sendmsg
|
||||
SENDMSG
|
||||
sendmsg
|
||||
SerDes
|
||||
serializers
|
||||
SGPR
|
||||
SGPRs
|
||||
SHA
|
||||
shader
|
||||
Shlens
|
||||
sigmoid
|
||||
SIGQUIT
|
||||
SIMD
|
||||
SKU
|
||||
SKUs
|
||||
skylake
|
||||
sL
|
||||
SLES
|
||||
SMEM
|
||||
SMI
|
||||
smi
|
||||
SMT
|
||||
softmax
|
||||
Spack
|
||||
spack
|
||||
SPI
|
||||
SQs
|
||||
SRAM
|
||||
SRAMECC
|
||||
src
|
||||
stochastically
|
||||
strided
|
||||
subdirectory
|
||||
subexpression
|
||||
subfolder
|
||||
subfolders
|
||||
supercomputing
|
||||
SWE
|
||||
Szegedy
|
||||
tagram
|
||||
TCA
|
||||
TCC
|
||||
TCI
|
||||
TCIU
|
||||
TCP
|
||||
TCR
|
||||
TrapStatus
|
||||
UC
|
||||
UTCL
|
||||
Uncached
|
||||
VALU
|
||||
VMEM
|
||||
VSkipped
|
||||
Workgroups
|
||||
Writebacks
|
||||
addr
|
||||
alloc
|
||||
cmd
|
||||
coalescable
|
||||
csn
|
||||
endpgm
|
||||
inflight
|
||||
mtypes
|
||||
perfcounter
|
||||
preq
|
||||
req
|
||||
sL
|
||||
sendmsg
|
||||
tagram
|
||||
TensorBoard
|
||||
TensorFlow
|
||||
TFLOPS
|
||||
tg
|
||||
th
|
||||
tmp
|
||||
ToC
|
||||
tokenize
|
||||
toolchain
|
||||
toolchains
|
||||
toolset
|
||||
toolsets
|
||||
TorchAudio
|
||||
TorchScript
|
||||
TorchServe
|
||||
TorchVision
|
||||
torchvision
|
||||
tracebacks
|
||||
TransferBench
|
||||
TrapStatus
|
||||
txt
|
||||
UAC
|
||||
uarch
|
||||
ubuntu
|
||||
UC
|
||||
UCC
|
||||
UCX
|
||||
UIF
|
||||
Uncached
|
||||
uncached
|
||||
Unhandled
|
||||
uninstallation
|
||||
unsqueeze
|
||||
unstacking
|
||||
unswitching
|
||||
untrusted
|
||||
untuned
|
||||
USM
|
||||
UTCL
|
||||
UTIL
|
||||
utils
|
||||
VALU
|
||||
Vanhoucke
|
||||
VBIOS
|
||||
vdi
|
||||
vectorizable
|
||||
vectorization
|
||||
vectorize
|
||||
vectorized
|
||||
vectorizer
|
||||
vectorizes
|
||||
VGPR
|
||||
VGPRs
|
||||
vjxb
|
||||
vL
|
||||
VM
|
||||
VMEM
|
||||
VMWare
|
||||
VRAM
|
||||
VSIX
|
||||
VSkipped
|
||||
Vulkan
|
||||
walkthrough
|
||||
walkthroughs
|
||||
wavefront
|
||||
wavefronts
|
||||
WGP
|
||||
whitespaces
|
||||
Wojna
|
||||
workgroup
|
||||
Workgroups
|
||||
workgroups
|
||||
writeback
|
||||
Writebacks
|
||||
writebacks
|
||||
wrreq
|
||||
# openmp
|
||||
ICV
|
||||
Multithreaded
|
||||
# tuning_guides
|
||||
BMC
|
||||
DGEMM
|
||||
HPCG
|
||||
HPL
|
||||
IOPM
|
||||
# windows
|
||||
SKU
|
||||
SKUs
|
||||
PowerShell
|
||||
UAC
|
||||
# pytorch_install
|
||||
kdb
|
||||
precompiled
|
||||
# gpu_os_support
|
||||
HWE
|
||||
el
|
||||
# using_gpu_sanitizer
|
||||
LSAN
|
||||
deallocation
|
||||
detections
|
||||
tracebacks
|
||||
workgroup
|
||||
WX
|
||||
wzo
|
||||
Xeon
|
||||
XGMI
|
||||
Xnack
|
||||
XT
|
||||
Xteam
|
||||
XTX
|
||||
xz
|
||||
YAML
|
||||
yaml
|
||||
YML
|
||||
YModel
|
||||
ysvmadyb
|
||||
ZenDNN
|
||||
zypper
|
||||
|
||||
1690
CHANGELOG.md
270
CONTRIBUTING.md
@@ -1,71 +1,229 @@
|
||||
# Contributing to ROCm Docs
|
||||
# Contributing to ROCm documentation
|
||||
|
||||
AMD values and encourages the ROCm community to contribute to our code and
|
||||
documentation. This repository is focused on ROCm documentation and this
|
||||
contribution guide describes the recommended method for creating and modifying our
|
||||
documentation.
|
||||
AMD values and encourages contributions to our code and documentation. If you choose to
|
||||
contribute, we encourage you to be polite and respectful. Improving documentation is a long-term
|
||||
process, to which we are dedicated.
|
||||
|
||||
While interacting with ROCm Documentation, we encourage you to be polite and
|
||||
respectful in your contributions, content or otherwise. Authors, maintainers of
|
||||
these docs act on good intentions and to the best of their knowledge.
|
||||
Keep that in mind while you engage. Should you have issues with contributing
|
||||
itself, refer to
|
||||
[discussions](https://github.com/RadeonOpenCompute/ROCm/discussions) on the
|
||||
GitHub repository.
|
||||
If you have issues when trying to contribute, refer to the
|
||||
[discussions](https://github.com/RadeonOpenCompute/ROCm/discussions) page in our GitHub
|
||||
repository.
|
||||
|
||||
For additional information on documentation functionalities,
|
||||
see the user and developer guides for rocm-docs-core
|
||||
at {doc}`rocm-docs-core documentation <rocm-docs-core:index>`.
|
||||
## Folder structure and naming convention
|
||||
|
||||
## Supported Formats
|
||||
Our documentation follows the Pitchfork folder structure. Most documentation files are stored in the
|
||||
`/docs` folder. Some special files (such as release, contributing, and changelog) are stored in the root
|
||||
(`/`) folder.
|
||||
|
||||
Our documentation includes both Markdown and RST files. Markdown is encouraged
|
||||
over RST due to the lower barrier to participation. GitHub-flavored Markdown is preferred
|
||||
for all submissions as it renders accurately on our GitHub repositories. For existing documentation,
|
||||
[MyST](https://myst-parser.readthedocs.io/en/latest/intro.html) Markdown
|
||||
is used to implement certain features unsupported in GitHub Markdown. This is
|
||||
not encouraged for new documentation. AMD will transition
|
||||
to stricter use of GitHub-flavored Markdown with a few caveats. ROCm documentation
|
||||
also uses [Sphinx Design](https://sphinx-design.readthedocs.io/en/latest/index.html)
|
||||
in our Markdown and RST files. We also use Breathe syntax for Doxygen documentation
|
||||
in our Markdown files. See
|
||||
[GitHub](https://docs.github.com/en/get-started/writing-on-github/getting-started-with-writing-and-formatting-on-github)'s
|
||||
guide on writing and formatting on GitHub as a starting point.
|
||||
All images are stored in the `/docs/data` folder. An image's file path mirrors that of the documentation
|
||||
file where it is used.
|
||||
|
||||
ROCm documentation adds additional requirements to Markdown and RST based files
|
||||
as follows:
|
||||
Our naming structure uses kebab case; for example, `my-file-name.rst`.
|
||||
|
||||
- Level one headers are only used for page titles. There must be only one level
|
||||
1 header per file for both Markdown and Restructured Text.
|
||||
- Pass [markdownlint](https://github.com/markdownlint/markdownlint) check via
|
||||
our automated GitHub action on a Pull Request (PR).
|
||||
See the {doc}`rocm-docs-core linting user guide <rocm-docs-core:user_guide/linting>` for more details.
|
||||
## Supported formats and syntax
|
||||
|
||||
## Filenames and folder structure
|
||||
Our documentation includes both Markdown and RST files. We are gradually transitioning existing
|
||||
Markdown to RST in order to more effectively meet our documentation needs. When contributing,
|
||||
RST is preferred; if you must use Markdown, use GitHub-flavored Markdown.
|
||||
|
||||
Please use snake case (all lower case letters and underscores instead of spaces)
|
||||
for file names. For example, `example_file_name.md`.
|
||||
Our documentation follows Pitchfork for folder structure.
|
||||
All documentation is in `/docs` except for special files like
|
||||
the contributing guide in the `/` folder. All images used in the documentation are
|
||||
placed in the `/docs/data` folder.
|
||||
We use [Sphinx Design](https://sphinx-design.readthedocs.io/en/latest/index.html) syntax and compile
|
||||
our API references using [Doxygen](https://www.doxygen.nl/).
|
||||
|
||||
## Language and Style
|
||||
The following table shows some common documentation components and the syntax convention we
|
||||
use for each:
|
||||
|
||||
Adopt Microsoft CPP-Docs guidelines for
|
||||
[Voice and Tone](https://github.com/MicrosoftDocs/cpp-docs/blob/main/styleguide/voice-tone.md).
|
||||
<table>
|
||||
<tr>
|
||||
<th>Component</th>
|
||||
<th>RST syntax</th>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>Code blocks</td>
|
||||
<td>
|
||||
|
||||
ROCm documentation templates to be made public shortly. ROCm templates dictate
|
||||
the recommended structure and flow of the documentation. Guidelines on how to
|
||||
integrate figures, equations, and tables are all based off
|
||||
[MyST](https://myst-parser.readthedocs.io/en/latest/intro.html).
|
||||
```rst
|
||||
|
||||
Font size and selection, page layout, white space control, and other formatting
|
||||
details are controlled via [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core).
|
||||
Raise issues in `rocm-docs-core` for any formatting concerns and changes requested.
|
||||
.. code-block:: language-name
|
||||
|
||||
## More
|
||||
My code block.
|
||||
|
||||
For more topics, such as submitting feedback and ways to build documentation,
|
||||
see the [Contributing Section](https://rocm.docs.amd.com/en/latest/contributing.html)
|
||||
at [rocm.docs.amd.com](https://rocm.docs.amd.com)
|
||||
|
||||
```
|
||||
|
||||
</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>Cross-referencing internal files</td>
|
||||
<td>
|
||||
|
||||
```rst
|
||||
|
||||
:doc:`Title <../path/to/file/filename>`
|
||||
|
||||
```
|
||||
|
||||
</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>External links</td>
|
||||
<td>
|
||||
|
||||
```rst
|
||||
|
||||
`link name <URL>`_
|
||||
|
||||
```
|
||||
|
||||
</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<tr>
|
||||
<td>Headings</td>
|
||||
<td>
|
||||
|
||||
```rst
|
||||
|
||||
******************
|
||||
Chapter title (H1)
|
||||
******************
|
||||
|
||||
Section title (H2)
|
||||
===============
|
||||
|
||||
Subsection title (H3)
|
||||
---------------------
|
||||
|
||||
Sub-subsection title (H4)
|
||||
^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
|
||||
```
|
||||
|
||||
</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>Images</td>
|
||||
<td>
|
||||
|
||||
```rst
|
||||
|
||||
.. image:: image1.png
|
||||
|
||||
```
|
||||
|
||||
</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>Internal links</td>
|
||||
<td>
|
||||
|
||||
```rst
|
||||
|
||||
1. Add a tag to the section you want to reference:
|
||||
|
||||
.. _my-section-tag: section-1
|
||||
|
||||
Section 1
|
||||
==========
|
||||
|
||||
2. Link to your tag:
|
||||
|
||||
As shown in :ref:`section-1`.
|
||||
|
||||
```
|
||||
|
||||
</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<tr>
|
||||
<td>Lists</td>
|
||||
<td>
|
||||
|
||||
```rst
|
||||
|
||||
# Ordered (numbered) list item
|
||||
|
||||
* Unordered (bulleted) list item
|
||||
|
||||
```
|
||||
|
||||
</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<tr>
|
||||
<td>Math (block)</td>
|
||||
<td>
|
||||
|
||||
```rst
|
||||
|
||||
.. math::
|
||||
|
||||
A = \begin{pmatrix}
|
||||
0.0 & 1.0 & 1.0 & 3.0 \\
|
||||
4.0 & 5.0 & 6.0 & 7.0 \\
|
||||
\end{pmatrix}
|
||||
|
||||
```
|
||||
|
||||
</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>Math (inline)</td>
|
||||
<td>
|
||||
|
||||
```rst
|
||||
|
||||
:math:`2 \times 2 `
|
||||
|
||||
```
|
||||
|
||||
</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>Notes</td>
|
||||
<td>
|
||||
|
||||
```rst
|
||||
|
||||
.. note::
|
||||
|
||||
My note here.
|
||||
|
||||
```
|
||||
|
||||
</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>Tables</td>
|
||||
<td>
|
||||
|
||||
```rst
|
||||
|
||||
.. csv-table:: Optional title here
|
||||
:widths: 30, 70 #optional column widths
|
||||
:header: "entry1 header", "entry2 header"
|
||||
|
||||
"entry1", "entry2"
|
||||
|
||||
```
|
||||
|
||||
</td>
|
||||
</tr>
|
||||
</table>
|
||||
|
||||
## Language and style
|
||||
|
||||
We use the
|
||||
[Google developer documentation style guide](https://developers.google.com/style/highlights) to
|
||||
guide our content.
|
||||
|
||||
Font size and type, page layout, white space control, and other formatting
|
||||
details are controlled via
|
||||
[rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core). If you want to notify us
|
||||
of any formatting issues, create a pull request in our
|
||||
[rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) GitHub repository.
|
||||
|
||||
## Building our documentation
|
||||
|
||||
<!-- % TODO: Fix the link to be able to work at every files -->
|
||||
To learn how to build our documentation, refer to
|
||||
[Building documentation](./building.md).
|
||||
|
||||
46
README.md
@@ -1,40 +1,40 @@
|
||||
# AMD ROCm™ Platform
|
||||
# AMD ROCm™ platform
|
||||
|
||||
ROCm is an open-source stack, composed primarily of open-source software (OSS), designed for
|
||||
graphics processing unit (GPU) computation. ROCm consists of a collection of drivers, development
|
||||
tools, and APIs that enable GPU programming from low-level kernel to end-user applications.
|
||||
ROCm is an open-source stack, composed primarily of open-source software, designed for graphics
|
||||
processing unit (GPU) computation. ROCm consists of a collection of drivers, development tools, and
|
||||
APIs that enable GPU programming from low-level kernel to end-user applications.
|
||||
|
||||
With ROCm, you can customize your GPU software to meet your specific needs. You can develop,
|
||||
collaborate, test, and deploy your applications in a free, open-source, integrated, and secure software
|
||||
collaborate, test, and deploy your applications in a free, open source, integrated, and secure software
|
||||
ecosystem. ROCm is particularly well-suited to GPU-accelerated high-performance computing (HPC),
|
||||
artificial intelligence (AI), scientific computing, and computer aided design (CAD).
|
||||
|
||||
ROCm is powered by AMD’s
|
||||
[Heterogeneous-computing Interface for Portability (HIP)](https://github.com/ROCm-Developer-Tools/HIP),
|
||||
an OSS C++ GPU programming environment and its corresponding runtime. HIP allows ROCm
|
||||
developers to create portable applications on different platforms by deploying code on a range of
|
||||
platforms, from dedicated gaming GPUs to exascale HPC clusters.
|
||||
an open-source software C++ GPU programming environment and its corresponding runtime. HIP
|
||||
allows ROCm developers to create portable applications on different platforms by deploying code on a
|
||||
range of platforms, from dedicated gaming GPUs to exascale HPC clusters.
|
||||
|
||||
ROCm supports programming models, such as OpenMP and OpenCL, and includes all necessary OSS
|
||||
compilers, debuggers, and libraries. ROCm is fully integrated into machine learning (ML) frameworks,
|
||||
such as PyTorch and TensorFlow.
|
||||
ROCm supports programming models, such as OpenMP and OpenCL, and includes all necessary open
|
||||
source software compilers, debuggers, and libraries. ROCm is fully integrated into machine learning
|
||||
(ML) frameworks, such as PyTorch and TensorFlow.
|
||||
|
||||
## ROCm Documentation
|
||||
|
||||
The ROCm Documentation site is [rocm.docs.amd.com](https://rocm.docs.amd.com).
|
||||
|
||||
Source code for the documentation is located in the docs folder of most repositories that are part of
|
||||
ROCm.
|
||||
## ROCm documentation
|
||||
|
||||
This repository contains the manifest file for ROCm releases, changelogs, and release information.
|
||||
The file `default.xml` contains information for all repositories and the associated commit used to build
|
||||
the current ROCm release.
|
||||
|
||||
The `default.xml` file uses the repo Manifest Format.
|
||||
The `default.xml` file contains information for all repositories and the associated commit used to build
|
||||
the current ROCm release; `default.xml` uses the Manifest Format repository.
|
||||
|
||||
The develop branch of this repository contains content for the next ROCm release.
|
||||
Source code for our documentation is located in the `/docs` folder of most ROCm repositories. The
|
||||
`develop` branch of our repositories contains content for the next ROCm release.
|
||||
|
||||
### How to build documentation via Sphinx
|
||||
The ROCm documentation homepage is [rocm.docs.amd.com](https://rocm.docs.amd.com).
|
||||
|
||||
### Building our documentation
|
||||
|
||||
For a quick-start build, use the following code. For more options and detail, refer to
|
||||
[Building documentation](./contribute/building.md).
|
||||
|
||||
```bash
|
||||
cd docs
|
||||
@@ -44,7 +44,7 @@ pip3 install -r sphinx/requirements.txt
|
||||
python3 -m sphinx -T -E -b html -d _build/doctrees -D language=en . _build/html
|
||||
```
|
||||
|
||||
## Older ROCm Releases
|
||||
## Older ROCm releases
|
||||
|
||||
For release information for older ROCm releases, refer to
|
||||
[`CHANGELOG`](./CHANGELOG.md).
|
||||
|
||||
482
RELEASE.md
@@ -11,473 +11,65 @@
|
||||
|
||||
<!-- spellcheck-disable -->
|
||||
|
||||
The release notes for the ROCm platform.
|
||||
Welcome to the release notes for the ROCm platform.
|
||||
|
||||
-------------------
|
||||
|
||||
## ROCm 5.7.0
|
||||
## ROCm 5.7.1
|
||||
<!-- markdownlint-disable first-line-h1 -->
|
||||
<!-- markdownlint-disable no-duplicate-header -->
|
||||
|
||||
### Release Highlights for ROCm v5.7
|
||||
### What's New in This Release
|
||||
|
||||
ROCm 5.7.0 includes many new features. These include: a new library (hipTensor), debugger (ROCgdb) support for Fortran and OMPD, and optimizations for rocRAND and MIVisionX. Address sanitizer for host and device code (GPU) is now available as a beta. Note that ROCm 5.7.0 is EOS for MI50. 5.7 versions of ROCm are the last major release in the ROCm 5 series. This release is Linux-only.
|
||||
### ROCm Libraries
|
||||
|
||||
Important: The next major ROCm release (ROCm 6.0) will not be backward compatible with the ROCm 5 series. Changes will include: splitting LLVM packages into more manageable sizes, changes to the HIP runtime API, splitting rocRAND and hipRAND into separate packages, and reorganizing our file structure.
|
||||
#### rocBLAS
|
||||
A new functionality rocblas-gemm-tune and an environment variable ROCBLAS_TENSILE_GEMM_OVERRIDE_PATH are added to rocBLAS in the ROCm 5.7.1 release.
|
||||
|
||||
#### AMD Instinct™ MI50 End of Support Notice
|
||||
*rocblas-gemm-tune* is used to find the best-performing GEMM kernel for each GEMM problem set. It has a command line interface, which mimics the --yaml input used by rocblas-bench. To generate the expected --yaml input, profile logging can be used, by setting the environment variable ROCBLAS_LAYER4.
|
||||
|
||||
AMD Instinct MI50, Radeon Pro VII, and Radeon VII products (collectively gfx906 GPUs) will enter maintenance mode starting Q3 2023.
|
||||
For more information on rocBLAS logging, see Logging in rocBLAS, in the [API Reference Guide](https://rocm.docs.amd.com/projects/rocBLAS/en/docs-5.7.1/API_Reference_Guide.html#logging-in-rocblas).
|
||||
|
||||
As outlined in [5.6.0](https://rocm.docs.amd.com/en/docs-5.6.0/release.html), ROCm 5.7 will be the final release for gfx906 GPUs to be in a fully supported state.
|
||||
An example input file: Expected output (note selected GEMM idx may differ): Where the far right values (solution_index) are the indices of the best-performing kernels for those GEMMs in the rocBLAS kernel library. These indices can be directly used in future GEMM calls. See rocBLAS/samples/example_user_driven_tuning.cpp for sample code of directly using kernels via their indices.
|
||||
|
||||
- ROCm 6.0 release will show MI50s as "under maintenance" mode for [Linux](./about/release/linux_support) and [Windows](./about/release/windows_support)
|
||||
If the output is stored in a file, the results can be used to override default kernel selection with the kernels found, by setting the environment variable ROCBLAS_TENSILE_GEMM_OVERRIDE_PATH, where points to the stored file.
|
||||
|
||||
- No new features and performance optimizations will be supported for the gfx906 GPUs beyond this major release (ROCm 5.7).
|
||||
For more details, refer to the [rocBLAS Programmer's Guide.](https://rocm.docs.amd.com/projects/rocBLAS/en/latest/Programmers_Guide.html#rocblas-gemm-tune)
|
||||
|
||||
- Bug fixes / critical security patches will continue to be supported for the gfx906 GPUs till Q2 2024 (EOM (End of Maintenance) will be aligned with the closest ROCm release).
|
||||
#### HIP 5.7.1 (for ROCm 5.7.1)
|
||||
|
||||
- Bug fixes during the maintenance will be made to the next ROCm point release.
|
||||
ROCm 5.7.1 is a point release with several bug fixes in the HIP runtime.
|
||||
|
||||
- Bug fixes will not be backported to older ROCm releases for gfx906.
|
||||
### Fixed defects
|
||||
The *hipPointerGetAttributes* API returns the correct HIP memory type as *hipMemoryTypeManaged* for managed memory.
|
||||
|
||||
- Distro / Operating system updates will continue as per the ROCm release cadence for gfx906 GPUs till EOM.
|
||||
|
||||
#### Feature Updates
|
||||
|
||||
##### Non-hostcall HIP Printf
|
||||
|
||||
**Current behavior**
|
||||
|
||||
The current version of HIP printf relies on hostcalls, which, in turn, rely on PCIe atomics. However, PCle atomics are unavailable in some environments, and, as a result, HIP-printf does not work in those environments. Users may see the following error from runtime (with AMD_LOG_LEVEL 1 and above),
|
||||
|
||||
```
|
||||
Pcie atomics not enabled, hostcall not supported
|
||||
```
|
||||
**Workaround**
|
||||
|
||||
The ROCm 5.7 release introduces an alternative to the current hostcall-based implementation that leverages an older OpenCL-based printf scheme, which does not rely on hostcalls/PCIe atomics.
|
||||
Note: This option is less robust than hostcall-based implementation and is intended to be a workaround when hostcalls do not work.
|
||||
|
||||
The printf variant is now controlled via a new compiler option -mprintf-kind=<value>. This is supported only for HIP programs and takes the following values,
|
||||
|
||||
- “hostcall” – This currently available implementation relies on hostcalls, which require the system to support PCIe atomics. It is the default scheme.
|
||||
|
||||
- “buffered” – This implementation leverages the older printf scheme used by OpenCL; it relies on a memory buffer where printf arguments are stored during the kernel execution, and then the runtime handles the actual printing once the kernel finishes execution.
|
||||
|
||||
**NOTE**: With the new workaround,
|
||||
|
||||
- The printf buffer is fixed size and non-circular. After the buffer is filled, calls to printf will not result in additional output.
|
||||
|
||||
- The printf call returns either 0 (on success) or -1 (on failure, due to full buffer), unlike the hostcall scheme that returns the number of characters printed.
|
||||
|
||||
##### Beta Release of LLVM Address Sanitizer (ASAN) with the GPU
|
||||
|
||||
The ROCm v5.7 release introduces the beta release of LLVM Address Sanitizer (ASAN) with the GPU. The LLVM Address Sanitizer provides a process that allows developers to detect runtime addressing errors in applications and libraries. The detection is achieved using a combination of compiler-added instrumentation and runtime techniques, including function interception and replacement.
|
||||
Until now, the LLVM Address Sanitizer process was only available for traditional purely CPU applications. However, ROCm has extended this mechanism to additionally allow the detection of some addressing errors on the GPU in heterogeneous applications. Ideally, developers should treat heterogeneous HIP and OpenMP applications like pure CPU applications. However, this simplicity has not been achieved yet.
|
||||
|
||||
Refer to the documentation on LLVM Address Sanitizer with the GPU at [LLVM Address Sanitizer User Guide](./docs/understand/using_gpu_sanitizer.md).
|
||||
|
||||
**Note**: The beta release of LLVM Address Sanitizer for ROCm is currently tested and validated on Ubuntu 20.04.
|
||||
|
||||
#### Fixed Defects
|
||||
|
||||
The following defects are fixed in ROCm v5.7,
|
||||
|
||||
- Test hangs observed in HMM RCCL
|
||||
|
||||
- NoGpuTst test of Catch2 fails with Docker
|
||||
|
||||
- Failures observed with non-HMM HIP directed catch2 tests with XNACK+
|
||||
|
||||
- Multiple test failures and test hangs observed in hip-directed catch2 tests with xnack+
|
||||
|
||||
#### HIP 5.7.0
|
||||
|
||||
##### Optimizations
|
||||
|
||||
##### Added
|
||||
|
||||
- Added `meta_group_size`/`rank` for getting the number of tiles and rank of a tile in the partition
|
||||
|
||||
- Added new APIs supporting Windows only, under development on Linux
|
||||
|
||||
- `hipMallocMipmappedArray` for allocating a mipmapped array on the device
|
||||
|
||||
- `hipFreeMipmappedArray` for freeing a mipmapped array on the device
|
||||
|
||||
- `hipGetMipmappedArrayLevel` for getting a mipmap level of a HIP mipmapped array
|
||||
|
||||
- `hipMipmappedArrayCreate` for creating a mipmapped array
|
||||
|
||||
- `hipMipmappedArrayDestroy` for destroy a mipmapped array
|
||||
|
||||
- `hipMipmappedArrayGetLevel` for getting a mipmapped array on a mipmapped level
|
||||
|
||||
##### Changed
|
||||
|
||||
##### Fixed
|
||||
|
||||
##### Known Issues
|
||||
|
||||
- HIP memory type enum values currently don't support equivalent value to `cudaMemoryTypeUnregistered`, due to HIP functionality backward compatibility.
|
||||
- HIP API `hipPointerGetAttributes` could return invalid value in case the input memory pointer was not allocated through any HIP API on device or host.
|
||||
|
||||
##### Upcoming changes for HIP in ROCm 6.0 release
|
||||
|
||||
- Removal of gcnarch from hipDeviceProp_t structure
|
||||
|
||||
- Addition of new fields in hipDeviceProp_t structure
|
||||
|
||||
- maxTexture1D
|
||||
|
||||
- maxTexture2D
|
||||
|
||||
- maxTexture1DLayered
|
||||
|
||||
- maxTexture2DLayered
|
||||
|
||||
- sharedMemPerMultiprocessor
|
||||
|
||||
- deviceOverlap
|
||||
|
||||
- asyncEngineCount
|
||||
|
||||
- surfaceAlignment
|
||||
|
||||
- unifiedAddressing
|
||||
|
||||
- computePreemptionSupported
|
||||
|
||||
- hostRegisterSupported
|
||||
|
||||
- uuid
|
||||
|
||||
- Removal of deprecated code -hip-hcc codes from hip code tree
|
||||
|
||||
- Correct hipArray usage in HIP APIs such as hipMemcpyAtoH and hipMemcpyHtoA
|
||||
|
||||
- HIPMEMCPY_3D fields correction to avoid truncation of "size_t" to "unsigned int" inside hipMemcpy3D()
|
||||
|
||||
- Renaming of 'memoryType' in hipPointerAttribute_t structure to 'type'
|
||||
|
||||
- Correct hipGetLastError to return the last error instead of last API call's return code
|
||||
|
||||
- Update hipExternalSemaphoreHandleDesc to add "unsigned int reserved[16]"
|
||||
|
||||
- Correct handling of flag values in hipIpcOpenMemHandle for hipIpcMemLazyEnablePeerAccess
|
||||
|
||||
- Remove hiparray* and make it opaque with hipArray_t
|
||||
|
||||
### Library Changes in ROCM 5.7.0
|
||||
### Library Changes in ROCM 5.7.1
|
||||
|
||||
| Library | Version |
|
||||
|---------|---------|
|
||||
| hipBLAS | ⇒ [1.1.0](https://github.com/ROCmSoftwarePlatform/hipBLAS/releases/tag/rocm-5.7.0) |
|
||||
| hipCUB | ⇒ [2.13.1](https://github.com/ROCmSoftwarePlatform/hipCUB/releases/tag/rocm-5.7.0) |
|
||||
| hipFFT | ⇒ [1.0.12](https://github.com/ROCmSoftwarePlatform/hipFFT/releases/tag/rocm-5.7.0) |
|
||||
| hipSOLVER | ⇒ [1.8.1](https://github.com/ROCmSoftwarePlatform/hipSOLVER/releases/tag/rocm-5.7.0) |
|
||||
| hipSPARSE | ⇒ [2.3.8](https://github.com/ROCmSoftwarePlatform/hipSPARSE/releases/tag/rocm-5.7.0) |
|
||||
| MIOpen | ⇒ [2.19.0](https://github.com/ROCmSoftwarePlatform/MIOpen/releases/tag/rocm-5.7.0) |
|
||||
| rccl | ⇒ [2.17.1-1](https://github.com/ROCmSoftwarePlatform/rccl/releases/tag/rocm-5.7.0) |
|
||||
| rocALUTION | ⇒ [2.1.11](https://github.com/ROCmSoftwarePlatform/rocALUTION/releases/tag/rocm-5.7.0) |
|
||||
| rocBLAS | ⇒ [3.1.0](https://github.com/ROCmSoftwarePlatform/rocBLAS/releases/tag/rocm-5.7.0) |
|
||||
| rocFFT | ⇒ [1.0.24](https://github.com/ROCmSoftwarePlatform/rocFFT/releases/tag/rocm-5.7.0) |
|
||||
| rocm-cmake | ⇒ [0.10.0](https://github.com/RadeonOpenCompute/rocm-cmake/releases/tag/rocm-5.7.0) |
|
||||
| rocPRIM | ⇒ [2.13.1](https://github.com/ROCmSoftwarePlatform/rocPRIM/releases/tag/rocm-5.7.0) |
|
||||
| rocRAND | ⇒ [2.10.17](https://github.com/ROCmSoftwarePlatform/rocRAND/releases/tag/rocm-5.7.0) |
|
||||
| rocSOLVER | ⇒ [3.23.0](https://github.com/ROCmSoftwarePlatform/rocSOLVER/releases/tag/rocm-5.7.0) |
|
||||
| rocSPARSE | ⇒ [2.5.4](https://github.com/ROCmSoftwarePlatform/rocSPARSE/releases/tag/rocm-5.7.0) |
|
||||
| rocThrust | ⇒ [2.18.0](https://github.com/ROCmSoftwarePlatform/rocThrust/releases/tag/rocm-5.7.0) |
|
||||
| rocWMMA | ⇒ [1.2.0](https://github.com/ROCmSoftwarePlatform/rocWMMA/releases/tag/rocm-5.7.0) |
|
||||
| Tensile | ⇒ [4.38.0](https://github.com/ROCmSoftwarePlatform/Tensile/releases/tag/rocm-5.7.0) |
|
||||
| hipBLAS | [1.1.0](https://github.com/ROCmSoftwarePlatform/hipBLAS/releases/tag/rocm-5.7.1) |
|
||||
| hipCUB | [2.13.1](https://github.com/ROCmSoftwarePlatform/hipCUB/releases/tag/rocm-5.7.1) |
|
||||
| hipFFT | [1.0.12](https://github.com/ROCmSoftwarePlatform/hipFFT/releases/tag/rocm-5.7.1) |
|
||||
| hipSOLVER | 1.8.1 ⇒ [1.8.2](https://github.com/ROCmSoftwarePlatform/hipSOLVER/releases/tag/rocm-5.7.1) |
|
||||
| hipSPARSE | [2.3.8](https://github.com/ROCmSoftwarePlatform/hipSPARSE/releases/tag/rocm-5.7.1) |
|
||||
| MIOpen | [2.19.0](https://github.com/ROCmSoftwarePlatform/MIOpen/releases/tag/rocm-5.7.1) |
|
||||
| rocALUTION | [2.1.11](https://github.com/ROCmSoftwarePlatform/rocALUTION/releases/tag/rocm-5.7.1) |
|
||||
| rocBLAS | [3.1.0](https://github.com/ROCmSoftwarePlatform/rocBLAS/releases/tag/rocm-5.7.1) |
|
||||
| rocFFT | [1.0.24](https://github.com/ROCmSoftwarePlatform/rocFFT/releases/tag/rocm-5.7.1) |
|
||||
| rocm-cmake | [0.10.0](https://github.com/RadeonOpenCompute/rocm-cmake/releases/tag/rocm-5.7.1) |
|
||||
| rocPRIM | [2.13.1](https://github.com/ROCmSoftwarePlatform/rocPRIM/releases/tag/rocm-5.7.1) |
|
||||
| rocRAND | [2.10.17](https://github.com/ROCmSoftwarePlatform/rocRAND/releases/tag/rocm-5.7.1) |
|
||||
| rocSOLVER | [3.23.0](https://github.com/ROCmSoftwarePlatform/rocSOLVER/releases/tag/rocm-5.7.1) |
|
||||
| rocSPARSE | [2.5.4](https://github.com/ROCmSoftwarePlatform/rocSPARSE/releases/tag/rocm-5.7.1) |
|
||||
| rocThrust | [2.18.0](https://github.com/ROCmSoftwarePlatform/rocThrust/releases/tag/rocm-5.7.1) |
|
||||
| rocWMMA | [1.2.0](https://github.com/ROCmSoftwarePlatform/rocWMMA/releases/tag/rocm-5.7.1) |
|
||||
| Tensile | [4.38.0](https://github.com/ROCmSoftwarePlatform/Tensile/releases/tag/rocm-5.7.1) |
|
||||
|
||||
#### hipBLAS 1.1.0
|
||||
#### hipSOLVER 1.8.2
|
||||
|
||||
hipBLAS 1.1.0 for ROCm 5.7.0
|
||||
|
||||
##### Changed
|
||||
|
||||
- updated documentation requirements
|
||||
|
||||
##### Dependencies
|
||||
|
||||
- dependency rocSOLVER now depends on rocSPARSE
|
||||
|
||||
#### hipCUB 2.13.1
|
||||
|
||||
hipCUB 2.13.1 for ROCm 5.7.0
|
||||
|
||||
##### Changed
|
||||
|
||||
- CUB backend references CUB and Thrust version 2.0.1.
|
||||
- Fixed `DeviceSegmentedReduce::ArgMin` and `DeviceSegmentedReduce::ArgMax` by returning the segment-relative index instead of the absolute one.
|
||||
- Fixed `DeviceSegmentedReduce::ArgMin` for inputs where the segment minimum is smaller than the value returned for empty segments. An equivalent fix is applied to `DeviceSegmentedReduce::ArgMax`.
|
||||
|
||||
##### Known Issues
|
||||
|
||||
- `debug_synchronous` no longer works on CUDA platform. `CUB_DEBUG_SYNC` should be used to enable those checks.
|
||||
- `DeviceReduce::Sum` does not compile on CUDA platform for mixed extended-floating-point/floating-point InputT and OutputT types.
|
||||
- `DeviceHistogram::HistogramEven` fails on CUDA platform for `[LevelT, SampleIteratorT] = [int, int]`.
|
||||
- `DeviceHistogram::MultiHistogramEven` fails on CUDA platform for `[LevelT, SampleIteratorT] = [int, int/unsigned short/float/double]` and `[LevelT, SampleIteratorT] = [float, double]`.
|
||||
|
||||
#### hipFFT 1.0.12
|
||||
|
||||
hipFFT 1.0.12 for ROCm 5.7.0
|
||||
|
||||
##### Added
|
||||
|
||||
- Implemented the hipfftXtMakePlanMany, hipfftXtGetSizeMany, hipfftXtExec APIs, to allow requesting half-precision transforms.
|
||||
|
||||
##### Changed
|
||||
|
||||
- Added --precision argument to benchmark/test clients. --double is still accepted but is deprecated as a method to request a double-precision transform.
|
||||
|
||||
#### hipSOLVER 1.8.1
|
||||
|
||||
hipSOLVER 1.8.1 for ROCm 5.7.0
|
||||
|
||||
##### Changed
|
||||
|
||||
- Changed hipsolver-test sparse input data search paths to be relative to the test executable
|
||||
|
||||
#### hipSPARSE 2.3.8
|
||||
|
||||
hipSPARSE 2.3.8 for ROCm 5.7.0
|
||||
|
||||
##### Improved
|
||||
|
||||
- Fix compilation failures when using cusparse 12.1.0 backend
|
||||
- Fix compilation failures when using cusparse 12.0.0 backend
|
||||
- Fix compilation failures when using cusparse 10.1 (non-update versions) as backend
|
||||
- Minor improvements
|
||||
|
||||
#### MIOpen 2.19.0
|
||||
|
||||
MIOpen 2.19.0 for ROCm 5.7.0
|
||||
|
||||
##### Added
|
||||
|
||||
- ROCm 5.5 support for gfx1101 (Navi32)
|
||||
|
||||
##### Changed
|
||||
|
||||
- Tuning results for MLIR on ROCm 5.5
|
||||
- Bumping MLIR commit to 5.5.0 release tag
|
||||
hipSOLVER 1.8.2 for ROCm 5.7.1
|
||||
|
||||
##### Fixed
|
||||
|
||||
- Fix 3d convolution Host API bug
|
||||
- [HOTFIX][MI200][FP16] Disabled ConvHipImplicitGemmBwdXdlops when FP16_ALT is required.
|
||||
|
||||
#### RCCL 2.17.1-1
|
||||
|
||||
RCCL 2.17.1-1 for ROCm 5.7.0
|
||||
|
||||
##### Changed
|
||||
|
||||
- Compatibility with NCCL 2.17.1-1
|
||||
- Performance tuning for some collective operations
|
||||
|
||||
##### Added
|
||||
|
||||
- Minor improvements to MSCCL codepath
|
||||
- NCCL_NCHANNELS_PER_PEER support
|
||||
- Improved compilation performance
|
||||
- Support for gfx94x
|
||||
|
||||
##### Fixed
|
||||
|
||||
- Potential race-condition during ncclSocketClose()
|
||||
|
||||
#### rocALUTION 2.1.11
|
||||
|
||||
rocALUTION 2.1.11 for ROCm 5.7.0
|
||||
|
||||
##### Added
|
||||
|
||||
- Added support for gfx940, gfx941 and gfx942
|
||||
|
||||
##### Improved
|
||||
|
||||
- Fixed OpenMP runtime issue with Windows toolchain
|
||||
|
||||
#### rocBLAS 3.1.0
|
||||
|
||||
rocBLAS 3.1.0 for ROCm 5.7.0
|
||||
|
||||
##### Added
|
||||
|
||||
- yaml lock step argument scanning for rocblas-bench and rocblas-test clients. See Programmers Guide for details.
|
||||
- rocblas-gemm-tune is used to find the best performing GEMM kernel for each of a given set of GEMM problems.
|
||||
|
||||
##### Fixed
|
||||
|
||||
- make offset calculations for rocBLAS functions 64 bit safe. Fixes for very large leading dimensions or increments potentially causing overflow:
|
||||
- Level 1: axpy, copy, rot, rotm, scal, swap, asum, dot, iamax, iamin, nrm2
|
||||
- Level 2: gemv, symv, hemv, trmv, ger, syr, her, syr2, her2, trsv
|
||||
- Level 3: gemm, symm, hemm, trmm, syrk, herk, syr2k, her2k, syrkx, herkx, trsm, trtri, dgmm, geam
|
||||
- General: set_vector, get_vector, set_matrix, get_matrix
|
||||
- Related fixes: internal scalar loads with > 32bit offsets
|
||||
- fix in-place functionality for all trtri sizes
|
||||
|
||||
##### Changed
|
||||
|
||||
- dot when using rocblas_pointer_mode_host is now synchronous to match legacy BLAS as it stores results in host memory
|
||||
- enhanced reporting of installation issues caused by runtime libraries (Tensile)
|
||||
- standardized internal rocblas C++ interface across most functions
|
||||
|
||||
##### Deprecated
|
||||
|
||||
- Removal of __STDC_WANT_IEC_60559_TYPES_EXT__ define in future release
|
||||
|
||||
##### Dependencies
|
||||
|
||||
- optional use of AOCL BLIS 4.0 on Linux for clients
|
||||
- optional build tool only dependency on python psutil
|
||||
|
||||
#### rocFFT 1.0.24
|
||||
|
||||
rocFFT 1.0.24 for ROCm 5.7.0
|
||||
|
||||
##### Optimizations
|
||||
|
||||
- Improved performance of complex forward/inverse 1D FFTs (2049 <= length <= 131071) that use Bluestein's algorithm.
|
||||
|
||||
##### Added
|
||||
|
||||
- Implemented a solution map version converter and finish the first conversion from ver.0 to ver.1. Where version 1 removes some incorrect kernels (sbrc/sbcr using half_lds)
|
||||
|
||||
##### Changed
|
||||
|
||||
- Moved rocfft_rtc_helper executable to lib/rocFFT directory on Linux.
|
||||
- Moved library kernel cache to lib/rocFFT directory.
|
||||
|
||||
#### rocm-cmake 0.10.0
|
||||
|
||||
rocm-cmake 0.10.0 for ROCm 5.7.0
|
||||
|
||||
##### Added
|
||||
|
||||
- Added ROCMTest module
|
||||
- ROCMCreatePackage: Added support for ASAN packages
|
||||
|
||||
#### rocPRIM 2.13.1
|
||||
|
||||
rocPRIM 2.13.1 for ROCm 5.7.0
|
||||
|
||||
##### Changed
|
||||
|
||||
- Deprecated configuration `radix_sort_config` for device-level radix sort as it no longer matches the algorithm's parameters. New configuration `radix_sort_config_v2` is preferred instead.
|
||||
- Removed erroneous implementation of device-level `inclusive_scan` and `exclusive_scan`. The prior default implementation using lookback-scan now is the only available implementation.
|
||||
- The benchmark metric indicating the bytes processed for `exclusive_scan_by_key` and `inclusive_scan_by_key` has been changed to incorporate the key type. Furthermore, the benchmark log has been changed such that these algorithms are reported as `scan` and `scan_by_key` instead of `scan_exclusive` and `scan_inclusive`.
|
||||
- Deprecated configurations `scan_config` and `scan_by_key_config` for device-level scans, as they no longer match the algorithm's parameters. New configurations `scan_config_v2` and `scan_by_key_config_v2` are preferred instead.
|
||||
|
||||
##### Fixed
|
||||
|
||||
- Fixed build issue caused by missing header in `thread/thread_search.hpp`.
|
||||
|
||||
#### rocRAND 2.10.17
|
||||
|
||||
rocRAND 2.10.17 for ROCm 5.7.0
|
||||
|
||||
##### Added
|
||||
|
||||
- MT19937 pseudo random number generator based on M. Matsumoto and T. Nishimura, 1998, Mersenne Twister: A 623-dimensionally equidistributed uniform pseudorandom number generator.
|
||||
- New benchmark for the device API using Google Benchmark, `benchmark_rocrand_device_api`, replacing `benchmark_rocrand_kernel`. `benchmark_rocrand_kernel` is deprecated and will be removed in a future version. Likewise, `benchmark_curand_host_api` is added to replace `benchmark_curand_generate` and `benchmark_curand_device_api` is added to replace `benchmark_curand_kernel`.
|
||||
- experimental HIP-CPU feature
|
||||
- ThreeFry pseudorandom number generator based on Salmon et al., 2011, "Parallel random numbers: as easy as 1, 2, 3".
|
||||
|
||||
##### Changed
|
||||
|
||||
- Python 2.7 is no longer officially supported.
|
||||
|
||||
#### rocSOLVER 3.23.0
|
||||
|
||||
rocSOLVER 3.23.0 for ROCm 5.7.0
|
||||
|
||||
##### Added
|
||||
|
||||
- LU factorization without pivoting for block tridiagonal matrices:
|
||||
- GEBLTTRF_NPVT now supports interleaved\_batched format
|
||||
- Linear system solver without pivoting for block tridiagonal matrices:
|
||||
- GEBLTTRS_NPVT now supports interleaved\_batched format
|
||||
|
||||
##### Fixed
|
||||
|
||||
- Fixed stack overflow in sparse tests on Windows
|
||||
|
||||
##### Changed
|
||||
|
||||
- Changed rocsolver-test sparse input data search paths to be relative to the test executable
|
||||
- Changed build scripts to default to compressed debug symbols in Debug builds
|
||||
|
||||
#### rocSPARSE 2.5.4
|
||||
|
||||
rocSPARSE 2.5.4 for ROCm 5.7.0
|
||||
|
||||
##### Added
|
||||
|
||||
- Added more mixed precisions for SpMV, (matrix: float, vectors: double, calculation: double) and (matrix: rocsparse_float_complex, vectors: rocsparse_double_complex, calculation: rocsparse_double_complex)
|
||||
- Added support for gfx940, gfx941 and gfx942
|
||||
|
||||
##### Improved
|
||||
|
||||
- Fixed a bug in csrsm and bsrsm
|
||||
|
||||
##### Known Issues
|
||||
|
||||
In csritlu0, the algorithm rocsparse_itilu0_alg_sync_split_fusion has some accuracy issues to investigate with XNACK enabled. The fallback is rocsparse_itilu0_alg_sync_split.
|
||||
|
||||
#### rocThrust 2.18.0
|
||||
|
||||
rocThrust 2.18.0 for ROCm 5.7.0
|
||||
|
||||
##### Fixed
|
||||
|
||||
- `lower_bound`, `upper_bound`, and `binary_search` failed to compile for certain types.
|
||||
- Fixed issue where `transform_iterator` would not compile with `__device__`-only operators.
|
||||
|
||||
##### Changed
|
||||
|
||||
- Updated `docs` directory structure to match the standard of [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core).
|
||||
- Removed references to and workarounds for deprecated hcc
|
||||
|
||||
#### rocWMMA 1.2.0
|
||||
|
||||
rocWMMA 1.2.0 for ROCm 5.7.0
|
||||
|
||||
##### Changed
|
||||
|
||||
- Fixed a bug with synchronization
|
||||
- Updated rocWMMA cmake versioning
|
||||
|
||||
#### Tensile 4.38.0
|
||||
|
||||
Tensile 4.38.0 for ROCm 5.7.0
|
||||
|
||||
##### Added
|
||||
|
||||
- Added support for FP16 Alt Round Near Zero Mode (this feature allows the generation of alternate kernels with intermediate rounding instead of truncation)
|
||||
- Added user-driven solution selection feature
|
||||
|
||||
##### Optimizations
|
||||
|
||||
- Enabled LocalSplitU with MFMA for I8 data type
|
||||
- Optimized K mask code in mfmaIter
|
||||
- Enabled TailLoop code in NoLoadLoop to prefetch global/local read
|
||||
- Enabled DirectToVgpr in TailLoop for NN, TN, and TT matrix orientations
|
||||
- Optimized DirectToLds test cases to reduce the test duration
|
||||
|
||||
##### Changed
|
||||
|
||||
- Removed DGEMM NT custom kernels and related test cases
|
||||
- Changed noTailLoop logic to apply noTailLoop only for NT
|
||||
- Changed the range of AssertFree0ElementMultiple and Free1
|
||||
- Unified aStr, bStr generation code in mfmaIter
|
||||
|
||||
##### Fixed
|
||||
|
||||
- Fixed LocalSplitU mismatch issue for SGEMM
|
||||
- Fixed BufferStore=0 and Ldc != Ldd case
|
||||
- Fixed mismatch issue with TailLoop + MatrixInstB > 1
|
||||
- Fixed conflicts between the hipsolver-dev and -asan packages by excluding
|
||||
hipsolver_module.f90 from the latter
|
||||
|
||||
45
default.xml
@@ -12,43 +12,44 @@ fetch="https://github.com/GPUOpen-ProfessionalCompute-Libraries/" />
|
||||
fetch="https://github.com/GPUOpen-Tools/" />
|
||||
<remote name="KhronosGroup"
|
||||
fetch="https://github.com/KhronosGroup/" />
|
||||
<default revision="refs/tags/rocm-5.6.1"
|
||||
<default revision="refs/tags/rocm-5.7.1"
|
||||
remote="roc-github"
|
||||
sync-c="true"
|
||||
sync-j="4" />
|
||||
<!--list of projects for ROCM-->
|
||||
<project name="ROCK-Kernel-Driver" remote="roc-github" />
|
||||
<project name="ROCT-Thunk-Interface" remote="roc-github" />
|
||||
<project name="ROCR-Runtime" remote="roc-github" />
|
||||
<project name="rocm_smi_lib" remote="roc-github" />
|
||||
<project name="rocm-core" remote="roc-github" />
|
||||
<project name="rocm-cmake" remote="roc-github" />
|
||||
<project name="rocminfo" remote="roc-github" />
|
||||
<project name="ROCK-Kernel-Driver" />
|
||||
<project name="ROCT-Thunk-Interface" />
|
||||
<project name="ROCR-Runtime" />
|
||||
<project name="amdsmi" />
|
||||
<project name="rocm_smi_lib" />
|
||||
<project name="rocm-core" />
|
||||
<project name="rocm-cmake" />
|
||||
<project name="rocminfo" />
|
||||
<project name="rocm_bandwidth_test" />
|
||||
<project name="rocprofiler" remote="rocm-devtools" />
|
||||
<project name="roctracer" remote="rocm-devtools" />
|
||||
<project path="ROCm-OpenCL-Runtime/api/opencl/khronos/icd" name="OpenCL-ICD-Loader" remote="KhronosGroup" revision="6c03f8b58fafd9dd693eaac826749a5cfad515f8" />
|
||||
<project name="clang-ocl" remote="roc-github" />
|
||||
<project name="clang-ocl" />
|
||||
<project name="rdc" />
|
||||
<!--HIP Projects-->
|
||||
<project name="HIP" remote="rocm-devtools" />
|
||||
<project name="clr" remote="rocm-devtools" />
|
||||
<project name="HIP-Examples" remote="rocm-devtools" />
|
||||
<project name="clr" remote="rocm-devtools" />
|
||||
<project name="HIPIFY" remote="rocm-devtools" />
|
||||
<project name="HIPCC" remote="rocm-devtools" />
|
||||
<!-- The following projects are all associated with the AMDGPU LLVM compiler -->
|
||||
<project name="llvm-project" remote="roc-github" />
|
||||
<project name="ROCm-Device-Libs" remote="roc-github" />
|
||||
<project name="ROCm-CompilerSupport" remote="roc-github" />
|
||||
<project name="rocr_debug_agent" remote="rocm-devtools" />
|
||||
<project name="rocm_bandwidth_test" remote="roc-github" />
|
||||
<project name="llvm-project" />
|
||||
<project name="ROCm-Device-Libs" />
|
||||
<project name="ROCm-CompilerSupport" />
|
||||
<project name="half" remote="rocm-swplat" revision="37742ce15b76b44e4b271c1e66d13d2fa7bd003e" />
|
||||
<project name="RCP" remote="gpuopen-tools" revision="3a49405a1500067c49d181844ec90aea606055bb" />
|
||||
<!-- gdb projects -->
|
||||
<project name="ROCgdb" remote="rocm-devtools" />
|
||||
<project name="ROCdbgapi" remote="rocm-devtools" />
|
||||
<project name="rocr_debug_agent" remote="rocm-devtools" />
|
||||
<!-- ROCm Libraries -->
|
||||
<project name="rdc" remote="roc-github" />
|
||||
<project groups="mathlibs" name="rocBLAS" remote="rocm-swplat" />
|
||||
<project groups="mathlibs" name="Tensile" remote="rocm-swplat" />
|
||||
<project groups="mathlibs" name="hipTensor" remote="rocm-swplat" />
|
||||
<project groups="mathlibs" name="hipBLAS" remote="rocm-swplat" />
|
||||
<project groups="mathlibs" name="rocFFT" remote="rocm-swplat" />
|
||||
<project groups="mathlibs" name="hipFFT" remote="rocm-swplat" />
|
||||
@@ -57,15 +58,17 @@ fetch="https://github.com/KhronosGroup/" />
|
||||
<project groups="mathlibs" name="rocSOLVER" remote="rocm-swplat" />
|
||||
<project groups="mathlibs" name="hipSOLVER" remote="rocm-swplat" />
|
||||
<project groups="mathlibs" name="hipSPARSE" remote="rocm-swplat" />
|
||||
<project groups="mathlibs" name="hipSPARSELt" remote="rocm-swplat" />
|
||||
<project groups="mathlibs" name="rocALUTION" remote="rocm-swplat" />
|
||||
<project name="MIOpen" remote="rocm-swplat" />
|
||||
<project groups="mathlibs" name="rccl" remote="rocm-swplat" />
|
||||
<project name="MIVisionX" remote="gpuopen-libs" />
|
||||
<project groups="mathlibs" name="rocThrust" remote="rocm-swplat" />
|
||||
<project groups="mathlibs" name="hipCUB" remote="rocm-swplat" />
|
||||
<project groups="mathlibs" name="rocPRIM" remote="rocm-swplat" />
|
||||
<project groups="mathlibs" name="rocWMMA" remote="rocm-swplat" />
|
||||
<project groups="mathlibs" name="rccl" remote="rocm-swplat" />
|
||||
<project name="rocMLIR" remote="rocm-swplat" />
|
||||
<project name="MIOpen" remote="rocm-swplat" />
|
||||
<project name="composable_kernel" remote="rocm-swplat" />
|
||||
<project name="MIVisionX" remote="gpuopen-libs" />
|
||||
<project name="rpp" remote="gpuopen-libs" />
|
||||
<project name="hipfort" remote="rocm-swplat" />
|
||||
<project name="AMDMIGraphX" remote="rocm-swplat" />
|
||||
<project name="ROCmValidationSuite" remote="rocm-devtools" />
|
||||
|
||||
@@ -1,7 +0,0 @@
|
||||
# 404 - Page Not Found
|
||||
|
||||
```{figure} ./data/AMD-404.png
|
||||
:align: center
|
||||
```
|
||||
|
||||
Return [home](./index) or use the sidebar navigation to get back on track.
|
||||
@@ -1,12 +1,10 @@
|
||||
# 3rd Party Support Matrix
|
||||
# Third party support matrix
|
||||
|
||||
ROCm™ supports various 3rd party libraries and frameworks. Supported versions
|
||||
are tested and known to work. Non-supported versions of 3rd parties may also
|
||||
work, but aren't tested.
|
||||
|
||||
(ml_framework_compat_matrix)=
|
||||
|
||||
## Deep Learning
|
||||
## Deep learning
|
||||
|
||||
ROCm releases support the most recent and two prior releases of PyTorch and
|
||||
TensorFlow.
|
||||
@@ -22,19 +20,21 @@ TensorFlow.
|
||||
| 5.6.x | 1.12.1, 1.13, 2.0 | 2.12, 2.13 |
|
||||
| 5.7.x | 1.12.1, 1.13, 2.0 | 2.12, 2.13 |
|
||||
|
||||
(communication-libraries)=
|
||||
|
||||
## Communication libraries
|
||||
|
||||
ROCm supports [OpenUCX](https://openucx.org/) an "an open-source,
|
||||
production-grade communication framework for data-centric and high-performance
|
||||
applications".
|
||||
ROCm supports [OpenUCX](https://openucx.org/), an open-source,
|
||||
production-grade communication framework for data-centric and high performance
|
||||
applications.
|
||||
|
||||
UCX version | ROCm 5.4 and older | ROCm 5.5 and newer |
|
||||
|:----------|:------------------:|:------------------:|
|
||||
| -1.14.0 | COMPATIBLE | INCOMPATIBLE |
|
||||
| 1.14.1+ | COMPATIBLE | COMPATIBLE |
|
||||
|
||||
The Unified Collective Communication Library [UCC](https://https://github.com/openucx/ucc)
|
||||
also has support for ROCm devices.
|
||||
The Unified Collective Communication ([UCC](https://github.com/openucx/ucc)) library also has
|
||||
support for ROCm devices.
|
||||
|
||||
UCC version | ROCm 5.5 and older | ROCm 5.6 and newer |
|
||||
|:----------|:------------------:|:------------------:|
|
||||
@@ -46,8 +46,8 @@ UCC version | ROCm 5.5 and older | ROCm 5.6 and newer |
|
||||
ROCm releases provide algorithm libraries with interfaces compatible with
|
||||
contemporary CUDA / NVIDIA HPC SDK alternatives.
|
||||
|
||||
- Thrust → rocThrust
|
||||
- CUB → hipCUB
|
||||
* Thrust → rocThrust
|
||||
* CUB → hipCUB
|
||||
|
||||
| ROCm | Thrust / CUB | HPC SDK |
|
||||
|:------|:------------:|:-------:|
|
||||
@@ -57,8 +57,7 @@ contemporary CUDA / NVIDIA HPC SDK alternatives.
|
||||
| 5.3.x | 1.16 | 22.7 |
|
||||
| 5.4.x | 1.16 | 22.9 |
|
||||
| 5.5.x | 1.17 | 22.9 |
|
||||
| 5.6 | 1.17.2 | 22.9 |
|
||||
| 5.7 | 1.17.2 | 22.9 |
|
||||
| 5.6.x | 1.17.2 | 22.9 |
|
||||
| 5.7.x | 1.17.2 | 22.9 |
|
||||
|
||||
For the latest documentation of these libraries, refer to the
|
||||
[associated documentation](../reference/gpu_libraries/c%2B%2B_primitives.md).
|
||||
For the latest documentation of these libraries, refer to [API libraries](../../reference/library-index.md).
|
||||
130
docs/about/compatibility/docker-image-support-matrix.rst
Normal file
@@ -0,0 +1,130 @@
|
||||
******************************************************************
|
||||
Docker image support matrix
|
||||
******************************************************************
|
||||
|
||||
AMD validates and publishes `PyTorch <https://hub.docker.com/r/rocm/pytorch>`_ and
|
||||
`TensorFlow <https://hub.docker.com/r/rocm/tensorflow>`_ containers on dockerhub. The following
|
||||
tags, and associated inventories, are validated with ROCm 5.7.
|
||||
|
||||
.. tab-set::
|
||||
|
||||
.. tab-item:: PyTorch
|
||||
|
||||
.. tab-set::
|
||||
|
||||
.. tab-item:: Ubuntu 22.04
|
||||
|
||||
Tag: `rocm/pytorch:rocm5.7_ubuntu22.04_py3.10_pytorch_2.0.1 <https://hub.docker.com/layers/rocm/pytorch/rocm5.7_ubuntu22.04_py3.10_pytorch_2.0.1/images/sha256-21df283b1712f3d73884b9bc4733919374344ceacb694e8fbc2c50bdd3e767ee>`_
|
||||
|
||||
* Inventory:
|
||||
|
||||
* `ROCm 5.7 <https://repo.radeon.com/rocm/apt/5.7/>`_
|
||||
* `Python 3.10 <https://www.python.org/downloads/release/python-31013/>`_
|
||||
* `Torch 2.0.1 <https://github.com/ROCmSoftwarePlatform/pytorch/tree/release/2.0>`_
|
||||
* `Apex 0.1 <https://github.com/ROCmSoftwarePlatform/apex/tree/v0.1>`_
|
||||
* `Torchvision 0.15.0 <https://github.com/pytorch/vision/tree/release/0.15>`_
|
||||
* `Tensorboard 2.14.0 <https://github.com/tensorflow/tensorboard/tree/2.14>`_
|
||||
* `MAGMA <https://bitbucket.org/icl/magma/src/master/>`_
|
||||
* `UCX 1.10.0 <https://github.com/openucx/ucx/tree/v1.10.0>`_
|
||||
* `OMPI 4.0.3 <https://github.com/open-mpi/ompi/tree/v4.0.3>`_
|
||||
* `OFED 5.4.3 <https://content.mellanox.com/ofed/MLNX_OFED-5.3-1.0.5.0/MLNX_OFED_LINUX-5.3-1.0.5.0-ubuntu20.04-x86_64.tgz>`_
|
||||
|
||||
.. tab-item:: Ubuntu 20.04
|
||||
|
||||
Tag: `rocm/pytorch:rocm5.7_ubuntu20.04_py3.9_pytorch_staging <https://hub.docker.com/layers/rocm/pytorch/rocm5.7_ubuntu20.04_py3.9_pytorch_2.0.1/images/sha256-4dd86046e5f777f53ae40a75ecfc76a5e819f01f3b2d40eacbb2db95c2f971d4)>`_
|
||||
|
||||
* Inventory:
|
||||
|
||||
* `ROCm 5.7 <https://repo.radeon.com/rocm/apt/5.7/>`_
|
||||
* `Python 3.9 <https://www.python.org/downloads/release/python-3918/>`_
|
||||
* `Torch 2.1.0 <https://github.com/ROCmSoftwarePlatform/pytorch/tree/rocm5.7_internal_testing>`_
|
||||
* `Apex 0.1 <https://github.com/ROCmSoftwarePlatform/apex/tree/v0.1>`_
|
||||
* `Torchvision 0.16.0 <https://github.com/pytorch/vision/tree/release/0.16>`_
|
||||
* `Tensorboard 2.14.0 <https://github.com/tensorflow/tensorboard/tree/2.14>`_
|
||||
* `MAGMA <https://bitbucket.org/icl/magma/src/master/>`_
|
||||
* `UCX 1.10.0 <https://github.com/openucx/ucx/tree/v1.10.0>`_
|
||||
* `OMPI 4.0.3 <https://github.com/open-mpi/ompi/tree/v4.0.3>`_
|
||||
* `OFED 5.4.3 <https://content.mellanox.com/ofed/MLNX_OFED-5.3-1.0.5.0/MLNX_OFED_LINUX-5.3-1.0.5.0-ubuntu20.04-x86_64.tgz>`_
|
||||
|
||||
|
||||
Tag: `Ubuntu rocm/pytorch:rocm5.7_ubuntu20.04_py3.9_pytorch_1.12.1 <https://hub.docker.com/layers/rocm/pytorch/rocm5.7_ubuntu20.04_py3.9_pytorch_1.12.1/images/sha256-e67db9373c045a7b6defd43cc3d067e7d49fd5d380f3f8582d2fb219c1756e1f>`_
|
||||
|
||||
* Inventory:
|
||||
|
||||
* `ROCm 5.7 <https://repo.radeon.com/rocm/apt/5.7/>`_
|
||||
* `Python 3.9 <https://www.python.org/downloads/release/python-3918/>`_
|
||||
* `Torch 1.12.1 <https://github.com/ROCmSoftwarePlatform/pytorch/tree/release/1.12>`_
|
||||
* `Apex 0.1 <https://github.com/ROCmSoftwarePlatform/apex/tree/v0.1>`_
|
||||
* `Torchvision 0.13.1 <https://github.com/pytorch/vision/tree/v0.13.1>`_
|
||||
* `Tensorboard 2.14.0 <https://github.com/tensorflow/tensorboard/tree/2.14>`_
|
||||
* `MAGMA <https://bitbucket.org/icl/magma/src/master/>`_
|
||||
* `UCX 1.10.0 <https://github.com/openucx/ucx/tree/v1.10.0>`_
|
||||
* `OMPI 4.0.3 <https://github.com/open-mpi/ompi/tree/v4.0.3>`_
|
||||
* `OFED 5.4.3 <https://content.mellanox.com/ofed/MLNX_OFED-5.3-1.0.5.0/MLNX_OFED_LINUX-5.3-1.0.5.0-ubuntu20.04-x86_64.tgz>`_
|
||||
|
||||
Tag: `Ubuntu rocm/pytorch:rocm5.7_ubuntu20.04_py3.9_pytorch_1.13.1 <https://hub.docker.com/layers/rocm/pytorch/rocm5.7_ubuntu20.04_py3.9_pytorch_1.13.1/images/sha256-ed99d159026093d2aaf5c48c1e4b0911508773430377051372733f75c340a4c1>`_
|
||||
|
||||
* Inventory:
|
||||
|
||||
* `ROCm 5.7 <https://repo.radeon.com/rocm/apt/5.7/>`_
|
||||
* `Python 3.9 <https://www.python.org/downloads/release/python-3918/>`_
|
||||
* `Torch 1.12.1 <https://github.com/ROCmSoftwarePlatform/pytorch/tree/release/1.13>`_
|
||||
* `Apex 0.1 <https://github.com/ROCmSoftwarePlatform/apex/tree/v0.1>`_
|
||||
* `Torchvision 0.14.0 <https://github.com/pytorch/vision/tree/v0.14.0>`_
|
||||
* `Tensorboard 2.12.0 <https://github.com/tensorflow/tensorboard/tree/2.12.0>`_
|
||||
* `MAGMA <https://bitbucket.org/icl/magma/src/master/>`_
|
||||
* `UCX 1.10.0 <https://github.com/openucx/ucx/tree/v1.10.0>`_
|
||||
* `OMPI 4.0.3 <https://github.com/open-mpi/ompi/tree/v4.0.3>`_
|
||||
* `OFED 5.4.3 <https://content.mellanox.com/ofed/MLNX_OFED-5.3-1.0.5.0/MLNX_OFED_LINUX-5.3-1.0.5.0-ubuntu20.04-x86_64.tgz>`_
|
||||
|
||||
Tag: `Ubuntu rocm/pytorch:rocm5.7_ubuntu20.04_py3.9_pytorch_2.0.1 <https://hub.docker.com/layers/rocm/pytorch/rocm5.7_ubuntu20.04_py3.9_pytorch_2.0.1/images/sha256-4dd86046e5f777f53ae40a75ecfc76a5e819f01f3b2d40eacbb2db95c2f971d4>`_
|
||||
|
||||
* Inventory:
|
||||
|
||||
* `ROCm 5.7 <https://repo.radeon.com/rocm/apt/5.7/>`_
|
||||
* `Python 3.9 <https://www.python.org/downloads/release/python-3918/>`_
|
||||
* `Torch 2.0.1 <https://github.com/ROCmSoftwarePlatform/pytorch/tree/release/2.0>`_
|
||||
* `Apex 0.1 <https://github.com/ROCmSoftwarePlatform/apex/tree/v0.1>`_
|
||||
* `Torchvision 0.15.2 <https://github.com/pytorch/vision/tree/release/0.15>`_
|
||||
* `Tensorboard 2.14.0 <https://github.com/tensorflow/tensorboard/tree/2.14>`_
|
||||
* `MAGMA <https://bitbucket.org/icl/magma/src/master/>`_
|
||||
* `UCX 1.10.0 <https://github.com/openucx/ucx/tree/v1.10.0>`_
|
||||
* `OMPI 4.0.3 <https://github.com/open-mpi/ompi/tree/v4.0.3>`_
|
||||
* `OFED 5.4.3 <https://content.mellanox.com/ofed/MLNX_OFED-5.3-1.0.5.0/MLNX_OFED_LINUX-5.3-1.0.5.0-ubuntu20.04-x86_64.tgz>`_
|
||||
|
||||
.. tab-item:: CentOS 7
|
||||
|
||||
Tag: `rocm/pytorch:rocm5.7_centos7_py3.9_pytorch_staging <https://hub.docker.com/layers/rocm/pytorch/rocm5.7_centos7_py3.9_pytorch_staging/images/sha256-92240cdf0b4aa7afa76fc78be995caa19ee9c54b5c9f1683bdcac28cedb58d2b>`_
|
||||
|
||||
* Inventory:
|
||||
|
||||
* `ROCm 5.7 <https://repo.radeon.com/rocm/yum/5.7/>`_
|
||||
* `Python 3.9 <https://www.python.org/downloads/release/python-3918/>`_
|
||||
* `Torch 2.1.0 <https://github.com/ROCmSoftwarePlatform/pytorch/tree/rocm5.7_internal_testing>`_
|
||||
* `Apex 0.1 <https://github.com/ROCmSoftwarePlatform/apex/tree/v0.1>`_
|
||||
* `Torchvision 0.16.0 <https://github.com/pytorch/vision/tree/release/0.16>`_
|
||||
* `MAGMA <https://bitbucket.org/icl/magma/src/master/>`_
|
||||
|
||||
.. tab-item:: TensorFlow
|
||||
|
||||
.. tab-set::
|
||||
|
||||
.. tab-item:: Ubuntu 20.04
|
||||
|
||||
Tag: `rocm5.7-tf2.12-dev <https://hub.docker.com/layers/rocm/tensorflow/rocm5.7-tf2.12-dev/images/sha256-e0ac4d49122702e5167175acaeb98a79b9500f585d5e74df18facf6b52ce3e59>`_
|
||||
|
||||
* Inventory:
|
||||
|
||||
* `ROCm 5.7 <https://repo.radeon.com/rocm/apt/5.7/>`_
|
||||
* `Python 3.9 <https://www.python.org/downloads/release/python-3918/>`_
|
||||
* `tensorflow-rocm 2.12.1 <https://pypi.org/project/tensorflow-rocm/2.12.1.570/>`_
|
||||
* `Tensorboard 2.12.3 <https://github.com/tensorflow/tensorboard/tree/2.12>`_
|
||||
|
||||
Tag: `rocm5.7-tf2.13-dev <https://hub.docker.com/layers/rocm/tensorflow/rocm5.7-tf2.13-dev/images/sha256-6f995539eebc062aac2b53db40e2b545192d8b032d0deada8c24c6651a7ac332>`_
|
||||
|
||||
* Inventory:
|
||||
|
||||
* `ROCm 5.7 <https://repo.radeon.com/rocm/apt/5.7/>`_
|
||||
* `Python 3.9 <https://www.python.org/downloads/release/python-3918/>`_
|
||||
* `tensorflow-rocm 2.13.0 <https://pypi.org/project/tensorflow-rocm/2.13.0.570/>`_
|
||||
* `Tensorboard 2.13.0 <https://github.com/tensorflow/tensorboard/tree/2.13>`_
|
||||
@@ -1,8 +1,8 @@
|
||||
# GPU Support and OS Compatibility (Linux)
|
||||
# GPU and OS support (Linux)
|
||||
|
||||
(supported_distributions)=
|
||||
(linux-support)=
|
||||
|
||||
## Supported Linux Distributions
|
||||
## Supported Linux distributions
|
||||
|
||||
AMD ROCm™ Platform supports the following Linux distributions.
|
||||
|
||||
@@ -12,23 +12,19 @@ AMD ROCm™ Platform supports the following Linux distributions.
|
||||
|
||||
| Distribution | Processor Architectures | Validated Kernel | Support |
|
||||
| :----------- | :---------------------: | :--------------: | ------: |
|
||||
| CentOS 7.9 | x86-64 | 3.10 | ✅ |
|
||||
| RHEL 7.9 | x86-64 | 3.10 | ✅ |
|
||||
| RHEL 8.7 | x86-64 | 4.18 | ✅ |
|
||||
| RHEL 8.8 | x86-64 | 4.18 | ✅ |
|
||||
| RHEL 9.1 | x86-64 | 5.14 | ✅ |
|
||||
| RHEL 9.2 | x86-64 | 5.14 | ✅ |
|
||||
| SLES 15 SP4 | x86-64 | 5.14.21 | ✅ |
|
||||
| SLES 15 SP5 | x86-64 | 5.14.21 | ✅ |
|
||||
| Ubuntu 20.04.5 | x86-64 | 5.15 | ✅ |
|
||||
| Ubuntu 20.04.6 | x86-64 | 5.15 | ✅ |
|
||||
| Ubuntu 22.04.2 | x86-64 | 5.19 | ✅ |
|
||||
| Ubuntu 22.04.3 | x86-64 | 6.2 | ✅ |
|
||||
| RHEL 9.2 | x86-64 | 5.14 (5.14.0-284.11.1.el9_2.x86_64) | ✅ |
|
||||
| RHEL 9.1 | x86-64 | 5.14.0-284.11.1.el9_2.x86_64 | ✅ |
|
||||
| RHEL 8.8 | x86-64 | 4.18.0-477.el8.x86_64 | ✅ |
|
||||
| RHEL 8.7 | x86-64 | 4.18.0-425.10.1.el8_7.x86_64 | ✅ |
|
||||
| SLES 15 SP5 | x86-64 | 5.14.21-150500.53-default | ✅ |
|
||||
| SLES 15 SP4 | x86-64 | 5.14.21-150400.24.63-default | ✅ |
|
||||
| Ubuntu 22.04.2 | x86-64 | 5.19.0-45-generic | ✅ |
|
||||
| Ubuntu 20.04.5 | x86-64 | 5.15.0-75-generic | ✅ |
|
||||
|
||||
:::{versionadded} 5.6
|
||||
|
||||
- RHEL 8.8 and 9.2 support is added.
|
||||
- SLES 15 SP5 support is added
|
||||
* RHEL 8.8 and 9.2 support is added.
|
||||
* SLES 15 SP5 support is added
|
||||
|
||||
:::
|
||||
|
||||
@@ -47,12 +43,12 @@ AMD ROCm™ Platform supports the following Linux distributions.
|
||||
|
||||
::::
|
||||
|
||||
- ✅: **Supported** - AMD performs full testing of all ROCm components on distro
|
||||
✅: **Supported** - AMD performs full testing of all ROCm components on distro
|
||||
GA image.
|
||||
- ❌: **Unsupported** - AMD no longer performs builds and testing on these
|
||||
❌: **Unsupported** - AMD no longer performs builds and testing on these
|
||||
previously supported distro GA images.
|
||||
|
||||
## Virtualization Support
|
||||
## Virtualization support
|
||||
|
||||
ROCm supports virtualization for select GPUs only as shown below.
|
||||
|
||||
@@ -62,19 +58,17 @@ ROCm supports virtualization for select GPUs only as shown below.
|
||||
| VMWare | ESXi 8 | MI210 | Ubuntu 20.04 (`5.15.0-56-generic`), SLES 15 SP4 (`5.14.21-150400.24.18-default`) |
|
||||
| VMWare | ESXi 7 | MI210 | Ubuntu 20.04 (`5.15.0-56-generic`), SLES 15 SP4 (`5.14.21-150400.24.18-default`) |
|
||||
|
||||
## Linux Supported GPUs
|
||||
## Linux-supported GPUs
|
||||
|
||||
The table below shows supported GPUs for Instinct™, Radeon Pro™ and Radeon™
|
||||
GPUs. Please click the tabs below to switch between GPU product lines. If a GPU
|
||||
is not listed on this table, the GPU is not officially supported by AMD.
|
||||
|
||||
::::{tab-set}
|
||||
:::::{tab-set}
|
||||
|
||||
:::{tab-item} AMD Instinct™
|
||||
::::{tab-item} AMD Instinct™
|
||||
:sync: instinct
|
||||
|
||||
Use Driver Shipped with ROCm
|
||||
|
||||
| Product Name | Architecture | [LLVM Target](https://www.llvm.org/docs/AMDGPUUsage.html#processors) |Support |
|
||||
|:------------:|:------------:|:--------------------------------------------------------------------:|:-------:|
|
||||
| AMD Instinct™ MI250X | CDNA2 | gfx90a | ✅ |
|
||||
@@ -84,42 +78,39 @@ Use Driver Shipped with ROCm
|
||||
| AMD Instinct™ MI50 | GCN5.1 | gfx906 | ✅ |
|
||||
| AMD Instinct™ MI25 | GCN5.0 | gfx900 | ❌ |
|
||||
|
||||
:::
|
||||
::::
|
||||
|
||||
:::{tab-item} Radeon Pro™
|
||||
::::{tab-item} Radeon Pro™
|
||||
:sync: radeonpro
|
||||
|
||||
[Use Radeon Pro Driver](https://www.amd.com/en/support/linux-drivers)
|
||||
|
||||
| Name | Architecture |[LLVM Target](https://www.llvm.org/docs/AMDGPUUsage.html#processors) | Support|
|
||||
|:----:|:------------:|:--------------------------------------------------------------------:|:-------:|
|
||||
| AMD Radeon™ Pro W7900 | RDNA3 | gfx1100 | ✅ (Ubuntu 22.04 only)|
|
||||
| AMD Radeon™ Pro W6800 | RDNA2 | gfx1030 | ✅ |
|
||||
| AMD Radeon™ Pro V620 | RDNA2 | gfx1030 | ✅ |
|
||||
| AMD Radeon™ Pro VII | GCN5.1 | gfx906 | ✅ |
|
||||
:::
|
||||
::::
|
||||
|
||||
:::{tab-item} Radeon™
|
||||
::::{tab-item} Radeon™
|
||||
:sync: radeonpro
|
||||
|
||||
[Use Radeon Pro Driver](https://www.amd.com/en/support/linux-drivers)
|
||||
|
||||
| Name | Architecture |[LLVM Target](https://www.llvm.org/docs/AMDGPUUsage.html#processors) | Support|
|
||||
|:----:|:---------------:|:--------------------------------------------------------------------:|:-------:|
|
||||
| AMD Radeon™ RX 7900 XTX | RDNA3 | gfx1100 | ✅ (Ubuntu 22.04 only)|
|
||||
| AMD Radeon™ VII | GCN5.1 | gfx906 | ✅ |
|
||||
|
||||
:::
|
||||
|
||||
::::
|
||||
:::::
|
||||
|
||||
### Support Status
|
||||
### Support status
|
||||
|
||||
- ✅: **Supported** - AMD enables these GPUs in our software distributions for
|
||||
✅: **Supported** - AMD enables these GPUs in our software distributions for
|
||||
the corresponding ROCm product.
|
||||
- ⚠️: **Deprecated** - Support will be removed in a future release.
|
||||
- ❌: **Unsupported** - This configuration is not enabled in our software
|
||||
⚠️: **Deprecated** - Support will be removed in a future release.
|
||||
❌: **Unsupported** - This configuration is not enabled in our software
|
||||
distributions.
|
||||
|
||||
## CPU Support
|
||||
## CPU support
|
||||
|
||||
ROCm requires CPUs that support PCIe™ Atomics. Modern CPUs after the release of
|
||||
1st generation AMD Zen CPU and Intel™ Haswell support PCIe Atomics.
|
||||
ROCm requires CPUs that support PCIe™ atomics. Modern CPUs after the release of
|
||||
1st generation AMD Zen CPU and Intel™ Haswell support PCIe atomics.
|
||||
@@ -1,4 +1,4 @@
|
||||
# OpenMP Support in ROCm
|
||||
# OpenMP support in ROCm
|
||||
|
||||
## Introduction
|
||||
|
||||
@@ -9,12 +9,12 @@ Along with host APIs, the OpenMP compilers support offloading code and data onto
|
||||
GPU devices. This document briefly describes the installation location of the
|
||||
OpenMP toolchain, example usage of device offloading, and usage of `rocprof`
|
||||
with OpenMP applications. The GPUs supported are the same as those supported by
|
||||
this ROCm release. See the list of supported GPUs in {doc}`/release/gpu_os_support`.
|
||||
this ROCm release. See the list of supported GPUs for [Linux](../../about/compatibility/linux-support.md) and [Windows](../../about/compatibility/windows-support.md).
|
||||
|
||||
The ROCm OpenMP compiler is implemented using LLVM compiler technology.
|
||||
{numref}`openmp-toolchain` illustrates the internal steps taken to translate a user’s application into an executable that can offload computation to the AMDGPU. The compilation is a two-pass process. Pass 1 compiles the application to generate the CPU code and Pass 2 links the CPU code to the AMDGPU device code.
|
||||
The following image illustrates the internal steps taken to translate a user’s application into an executable that can offload computation to the AMDGPU. The compilation is a two-pass process. Pass 1 compiles the application to generate the CPU code and Pass 2 links the CPU code to the AMDGPU device code.
|
||||
|
||||

|
||||

|
||||
|
||||
### Installation
|
||||
|
||||
@@ -22,17 +22,13 @@ The OpenMP toolchain is automatically installed as part of the standard ROCm
|
||||
installation and is available under `/opt/rocm-{version}/llvm`. The
|
||||
sub-directories are:
|
||||
|
||||
bin: Compilers (`flang` and `clang`) and other binaries.
|
||||
* bin: Compilers (`flang` and `clang`) and other binaries.
|
||||
* examples: The usage section below shows how to compile and run these programs.
|
||||
* include: Header files.
|
||||
* lib: Libraries including those required for target offload.
|
||||
* lib-debug: Debug versions of the above libraries.
|
||||
|
||||
- examples: The usage section below shows how to compile and run these programs.
|
||||
|
||||
- include: Header files.
|
||||
|
||||
- lib: Libraries including those required for target offload.
|
||||
|
||||
- lib-debug: Debug versions of the above libraries.
|
||||
|
||||
## OpenMP: Usage
|
||||
## OpenMP: usage
|
||||
|
||||
The example programs can be compiled and run by pointing the environment
|
||||
variable `ROCM_PATH` to the ROCm install directory.
|
||||
@@ -45,10 +41,10 @@ cd $ROCM_PATH/share/openmp-extras/examples/openmp/veccopy
|
||||
sudo make run
|
||||
```
|
||||
|
||||
:::{note}
|
||||
```{note}
|
||||
`sudo` is required since we are building inside the `/opt` directory.
|
||||
Alternatively, copy the files to your home directory first.
|
||||
:::
|
||||
```
|
||||
|
||||
The above invocation of Make compiles and runs the program. Note the options
|
||||
that are required for target offload from an OpenMP program:
|
||||
@@ -57,15 +53,13 @@ that are required for target offload from an OpenMP program:
|
||||
-fopenmp --offload-arch=<gpu-arch>
|
||||
```
|
||||
|
||||
:::{note}
|
||||
```{note}
|
||||
The compiler also accepts the alternative offloading notation:
|
||||
|
||||
```bash
|
||||
-fopenmp -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa -march=<gpu-arch>
|
||||
-fopenmp -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa -march=<gpu-arch>
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
Obtain the value of `gpu-arch` by running the following command:
|
||||
|
||||
```bash
|
||||
@@ -114,7 +108,7 @@ code compiled with AOMP:
|
||||
|
||||
For more details on `rocprof`, refer to the {doc}`ROCProfilerV1 User Manual <rocprofiler:rocprofv1>`.
|
||||
|
||||
### Using Tracing Options
|
||||
### Using tracing options
|
||||
|
||||
**Prerequisite:** When using the `--sys-trace` option, compile the OpenMP
|
||||
program with:
|
||||
@@ -125,10 +119,10 @@ program with:
|
||||
|
||||
The following tracing options are widely used to generate useful information:
|
||||
|
||||
- **`--hsa-trace`**: This option is used to get a JSON output file with the HSA
|
||||
* **`--hsa-trace`**: This option is used to get a JSON output file with the HSA
|
||||
API execution traces and a flat profile in a CSV file.
|
||||
|
||||
- **`--sys-trace`**: This allows programmers to trace both HIP and HSA calls.
|
||||
* **`--sys-trace`**: This allows programmers to trace both HIP and HSA calls.
|
||||
Since this option results in loading ``libamdhip64.so``, follow the
|
||||
prerequisite as mentioned above.
|
||||
|
||||
@@ -140,7 +134,7 @@ HSA calls.
|
||||
|
||||
For more details on tracing, refer to the {doc}`ROCProfilerV1 User Manual <rocprofiler:rocprofv1>`.
|
||||
|
||||
### Environment Variables
|
||||
### Environment variables
|
||||
|
||||
:::{table}
|
||||
:widths: auto
|
||||
@@ -155,29 +149,29 @@ For more details on tracing, refer to the {doc}`ROCProfilerV1 User Manual <rocpr
|
||||
| `OMPX_FORCE_SYNC_REGIONS` | To force the runtime to execute all operations synchronously, i.e., wait for an operation to complete immediately. This affects data transfers and kernel execution. While it is mainly designed for debugging, it may have a minor positive effect on performance in certain situations. |
|
||||
:::
|
||||
|
||||
## OpenMP: Features
|
||||
## OpenMP: features
|
||||
|
||||
The OpenMP programming model is greatly enhanced with the following new features
|
||||
implemented in the past releases.
|
||||
|
||||
(openmp_usm)=
|
||||
|
||||
### Asynchronous Behavior in OpenMP Target Regions
|
||||
### Asynchronous behavior in OpenMP target regions
|
||||
|
||||
- Controlling Asynchronous Behavior
|
||||
* Controlling Asynchronous Behavior
|
||||
|
||||
The OpenMP offloading runtime executes in an asynchronous fashion by default, allowing multiple data transfers to start concurrently. However, if the data to be transferred becomes larger than the default threshold of 1MB, the runtime falls back to a synchronous data transfer. The buffers that have been locked already are always executed asynchronously.
|
||||
You can overrule this default behavior by setting `LIBOMPTARGET_AMDGPU_MAX_ASYNC_COPY_BYTES` and `OMPX_FORCE_SYNC_REGIONS`. See the [Environment Variables](#environment-variables) table for details.
|
||||
|
||||
- Multithreaded Offloading on the Same Device
|
||||
* Multithreaded Offloading on the Same Device
|
||||
|
||||
The `libomptarget` plugin for GPU offloading allows creation of separate configurable HSA queues per chiplet, which enables two or more threads to concurrently offload to the same device.
|
||||
|
||||
- Parallel Memory Copy Invocations
|
||||
* Parallel Memory Copy Invocations
|
||||
|
||||
Implicit asynchronous execution of single target region enables parallel memory copy invocations.
|
||||
|
||||
### Unified Shared Memory
|
||||
### Unified shared memory
|
||||
|
||||
Unified Shared Memory (USM) provides a pointer-based approach to memory
|
||||
management. To implement USM, fulfill the following system requirements along
|
||||
@@ -185,14 +179,12 @@ with Xnack capability.
|
||||
|
||||
#### Prerequisites
|
||||
|
||||
- Linux Kernel versions above 5.14
|
||||
|
||||
- Latest KFD driver packaged in ROCm stack
|
||||
|
||||
- Xnack, as USM support can only be tested with applications compiled with Xnack
|
||||
* Linux Kernel versions above 5.14
|
||||
* Latest KFD driver packaged in ROCm stack
|
||||
* Xnack, as USM support can only be tested with applications compiled with Xnack
|
||||
capability
|
||||
|
||||
#### Xnack Capability
|
||||
#### Xnack capability
|
||||
|
||||
When enabled, Xnack capability allows GPU threads to access CPU (system) memory,
|
||||
allocated with OS-allocators, such as `malloc`, `new`, and `mmap`. Xnack must be
|
||||
@@ -218,15 +210,15 @@ HSA_XNACK=1
|
||||
|
||||
When Xnack support is not needed:
|
||||
|
||||
- Build the applications to maximize resource utilization using:
|
||||
* Build the applications to maximize resource utilization using:
|
||||
|
||||
```bash
|
||||
--offload-arch=gfx908:xnack-
|
||||
```
|
||||
|
||||
- At runtime, set the `HSA_XNACK` environment variable to 0.
|
||||
* At runtime, set the `HSA_XNACK` environment variable to 0.
|
||||
|
||||
#### Unified Shared Memory Pragma
|
||||
#### Unified shared memory pragma
|
||||
|
||||
This OpenMP pragma is available on MI200 through `xnack+` support.
|
||||
|
||||
@@ -280,7 +272,7 @@ to by “b” are in coarse-grain memory during and after the execution of the
|
||||
target region. This is accomplished in the OpenMP runtime library with calls to
|
||||
the ROCr runtime to set the pages pointed by “b” as coarse grain.
|
||||
|
||||
### OMPT Target Support
|
||||
### OMPT target support
|
||||
|
||||
The OpenMP runtime in ROCm implements a subset of the OMPT device APIs, as
|
||||
described in the OpenMP specification document. These APIs allow first-party
|
||||
@@ -305,7 +297,7 @@ The file `veccopy-ompt-target-tracing.c` simulates how a tool initiates device
|
||||
activity tracing. The file `callbacks.h` shows the callbacks registered and
|
||||
implemented by the tool.
|
||||
|
||||
### Floating Point Atomic Operations
|
||||
### Floating point atomic operations
|
||||
|
||||
The MI200-series GPUs support the generation of hardware floating-point atomics
|
||||
using the OpenMP atomic pragma. The support includes single- and
|
||||
@@ -329,10 +321,10 @@ double a = 0.0;
|
||||
a = a + 1.0;
|
||||
```
|
||||
|
||||
:::{note}
|
||||
```{note}
|
||||
`AMD_unsafe_fp_atomics` is an alias for `AMD_fast_fp_atomics`, and
|
||||
`AMD_safe_fp_atomics` is implemented with a compare-and-swap loop.
|
||||
:::
|
||||
```
|
||||
|
||||
To disable the generation of fast floating-point atomic instructions at the file
|
||||
level, build using the option `-msafe-fp-atomics` or use a hint clause on a
|
||||
@@ -365,44 +357,36 @@ double b = 0.0;
|
||||
b = b + 1.0;
|
||||
```
|
||||
|
||||
### Address Sanitizer (ASan) Tool
|
||||
### AddressSanitizer tool
|
||||
|
||||
Address Sanitizer is a memory error detector tool utilized by applications to
|
||||
AddressSanitizer (ASan) is a memory error detector tool utilized by applications to
|
||||
detect various errors ranging from spatial issues such as out-of-bound access to
|
||||
temporal issues such as use-after-free. The AOMP compiler supports ASan for AMD
|
||||
GPUs with applications written in both HIP and OpenMP.
|
||||
|
||||
**Features Supported on Host Platform (Target x86_64):**
|
||||
**Features supported on host platform (Target x86_64):**
|
||||
|
||||
- Use-after-free
|
||||
* Use-after-free
|
||||
* Buffer overflows
|
||||
* Heap buffer overflow
|
||||
* Stack buffer overflow
|
||||
* Global buffer overflow
|
||||
* Use-after-return
|
||||
* Use-after-scope
|
||||
* Initialization order bugs
|
||||
|
||||
- Buffer overflows
|
||||
**Features supported on AMDGPU platform (`amdgcn-amd-amdhsa`):**
|
||||
|
||||
- Heap buffer overflow
|
||||
* Heap buffer overflow
|
||||
* Global buffer overflow
|
||||
|
||||
- Stack buffer overflow
|
||||
|
||||
- Global buffer overflow
|
||||
|
||||
- Use-after-return
|
||||
|
||||
- Use-after-scope
|
||||
|
||||
- Initialization order bugs
|
||||
|
||||
**Features Supported on AMDGPU Platform (`amdgcn-amd-amdhsa`):**
|
||||
|
||||
- Heap buffer overflow
|
||||
|
||||
- Global buffer overflow
|
||||
|
||||
**Software (Kernel/OS) Requirements:** Unified Shared Memory support with Xnack
|
||||
**Software (kernel/OS) requirements:** Unified Shared Memory support with Xnack
|
||||
capability. See the section on [Unified Shared Memory](#unified-shared-memory)
|
||||
for prerequisites and details on Xnack.
|
||||
|
||||
**Example:**
|
||||
|
||||
- Heap buffer overflow
|
||||
* Heap buffer overflow
|
||||
|
||||
```bash
|
||||
void main() {
|
||||
@@ -422,7 +406,7 @@ void main() {
|
||||
See the complete sample code for heap buffer overflow
|
||||
[here](https://github.com/ROCm-Developer-Tools/aomp/blob/aomp-dev/examples/tools/asan/heap_buffer_overflow/openmp/vecadd-HBO.cpp).
|
||||
|
||||
- Global buffer overflow
|
||||
* Global buffer overflow
|
||||
|
||||
```bash
|
||||
#pragma omp declare target
|
||||
@@ -447,46 +431,44 @@ for(int i=0; i<N; i++){
|
||||
See the complete sample code for global buffer overflow
|
||||
[here](https://github.com/ROCm-Developer-Tools/aomp/blob/aomp-dev/examples/tools/asan/global_buffer_overflow/openmp/vecadd-GBO.cpp).
|
||||
|
||||
### Clang Compiler Option for Kernel Optimization
|
||||
### Clang compiler option for kernel optimization
|
||||
|
||||
You can use the clang compiler option `-fopenmp-target-fast` for kernel optimization if certain constraints implied by its component options are satisfied. `-fopenmp-target-fast` enables the following options:
|
||||
|
||||
- `-fopenmp-target-ignore-env-vars`: It enables code generation of specialized kernels including No-loop and Cross-team reductions.
|
||||
* `-fopenmp-target-ignore-env-vars`: It enables code generation of specialized kernels including no-loop and Cross-team reductions.
|
||||
|
||||
- `-fopenmp-assume-no-thread-state`: It enables the compiler to assume that no thread in a parallel region modifies an Internal Control Variable (`ICV`), thus potentially reducing the device runtime code execution.
|
||||
* `-fopenmp-assume-no-thread-state`: It enables the compiler to assume that no thread in a parallel region modifies an Internal Control Variable (`ICV`), thus potentially reducing the device runtime code execution.
|
||||
|
||||
- `-fopenmp-assume-no-nested-parallelism`: It enables the compiler to assume that no thread in a parallel region encounters a parallel region, thus potentially reducing the device runtime code execution.
|
||||
* `-fopenmp-assume-no-nested-parallelism`: It enables the compiler to assume that no thread in a parallel region encounters a parallel region, thus potentially reducing the device runtime code execution.
|
||||
|
||||
- `-O3` if no `-O*` is specified by the user.
|
||||
* `-O3` if no `-O*` is specified by the user.
|
||||
|
||||
### Specialized Kernels
|
||||
### Specialized kernels
|
||||
|
||||
Clang will attempt to generate specialized kernels based on compiler options and OpenMP constructs. The following specialized kernels are supported:
|
||||
|
||||
- No-Loop
|
||||
|
||||
- Big-Jump-Loop
|
||||
|
||||
- Cross-Team (Xteam) Reductions
|
||||
* No-loop
|
||||
* Big-jump-loop
|
||||
* Cross-team reductions
|
||||
|
||||
To enable the generation of specialized kernels, follow these guidelines:
|
||||
|
||||
- Do not specify teams, threads, and schedule-related environment variables. The `num_teams` clause in an OpenMP target construct acts as an override and prevents the generation of the No-Loop kernel. If the specification of `num_teams` clause is a user requirement then clang tries to generate the Big-Jump-Loop kernel instead of the No-Loop kernel.
|
||||
* Do not specify teams, threads, and schedule-related environment variables. The `num_teams` clause in an OpenMP target construct acts as an override and prevents the generation of the no-loop kernel. If the specification of `num_teams` clause is a user requirement then clang tries to generate the big-jump-loop kernel instead of the no-loop kernel.
|
||||
|
||||
- Assert the absence of the teams, threads, and schedule-related environment variables by adding the command-line option `-fopenmp-target-ignore-env-vars`.
|
||||
* Assert the absence of the teams, threads, and schedule-related environment variables by adding the command-line option `-fopenmp-target-ignore-env-vars`.
|
||||
|
||||
- To automatically enable the specialized kernel generation, use `-Ofast` or `-fopenmp-target-fast` for compilation.
|
||||
* To automatically enable the specialized kernel generation, use `-Ofast` or `-fopenmp-target-fast` for compilation.
|
||||
|
||||
- To disable specialized kernel generation, use `-fno-openmp-target-ignore-env-vars`.
|
||||
* To disable specialized kernel generation, use `-fno-openmp-target-ignore-env-vars`.
|
||||
|
||||
#### No-Loop Kernel Generation
|
||||
#### No-loop kernel generation
|
||||
|
||||
The No-loop kernel generation feature optimizes the compiler performance by generating a specialized kernel for certain OpenMP target constructs such as target teams distribute parallel for. The specialized kernel generation feature assumes every thread executes a single iteration of the user loop, which leads the runtime to launch a total number of GPU threads equal to or greater than the iteration space size of the target region loop. This allows the compiler to generate code for the loop body without an enclosing loop, resulting in reduced control-flow complexity and potentially better performance.
|
||||
The no-loop kernel generation feature optimizes the compiler performance by generating a specialized kernel for certain OpenMP target constructs such as target teams distribute parallel for. The specialized kernel generation feature assumes every thread executes a single iteration of the user loop, which leads the runtime to launch a total number of GPU threads equal to or greater than the iteration space size of the target region loop. This allows the compiler to generate code for the loop body without an enclosing loop, resulting in reduced control-flow complexity and potentially better performance.
|
||||
|
||||
#### Big-Jump-Loop Kernel Generation
|
||||
#### Big-jump-loop kernel generation
|
||||
|
||||
A No-Loop kernel is not generated if the OpenMP teams construct uses a `num_teams` clause. Instead, the compiler attempts to generate a different specialized kernel called the Big-Jump-Loop kernel. The compiler launches the kernel with a grid size determined by the number of teams specified by the OpenMP `num_teams` clause and the `blocksize` chosen either by the compiler or specified by the corresponding OpenMP clause.
|
||||
A no-loop kernel is not generated if the OpenMP teams construct uses a `num_teams` clause. Instead, the compiler attempts to generate a different specialized kernel called the big-jump-loop kernel. The compiler launches the kernel with a grid size determined by the number of teams specified by the OpenMP `num_teams` clause and the `blocksize` chosen either by the compiler or specified by the corresponding OpenMP clause.
|
||||
|
||||
#### Xteam Optimized Reduction Kernel Generation
|
||||
#### Cross-team optimized reduction kernel generation
|
||||
|
||||
If the OpenMP construct has a reduction clause, the compiler attempts to generate optimized code by utilizing efficient Xteam communication. New APIs for Xteam reduction are implemented in the device runtime and are automatically generated by clang.
|
||||
If the OpenMP construct has a reduction clause, the compiler attempts to generate optimized code by utilizing efficient cross-team communication. New APIs for cross-team reduction are implemented in the device runtime and are automatically generated by clang.
|
||||
@@ -1,4 +1,4 @@
|
||||
# User/Kernel-Space Support Matrix
|
||||
# User/kernel-space support matrix
|
||||
|
||||
ROCm™ provides forward and backward compatibility between the Kernel Fusion
|
||||
Driver (KFD) and its user space software for +/- 2 releases. This table shows
|
||||
@@ -21,3 +21,4 @@ the compatibility combinations that are currently supported.
|
||||
| 5.6.0 | 5.4.3, 5.5.1 |
|
||||
| 5.6.1 | 5.7.0 |
|
||||
| 5.7.0 | 5.5.0, 5.6.1 |
|
||||
| 5.7.1 | 5.5.0, 5.6.1 |
|
||||
@@ -1,10 +1,10 @@
|
||||
# GPU and OS Support (Windows)
|
||||
# GPU and OS support (Windows)
|
||||
|
||||
(supported_skus)=
|
||||
(windows-support)=
|
||||
|
||||
## Supported SKUs
|
||||
|
||||
AMD ROCm™ Platform supports the following Windows SKU.
|
||||
AMD HIP SDK supports the following Windows variants.
|
||||
|
||||
| Distribution |Processor Architectures| Validated update |
|
||||
|---------------------|-----------------------|--------------------|
|
||||
@@ -12,7 +12,7 @@ AMD ROCm™ Platform supports the following Windows SKU.
|
||||
| Windows 11 | x86-64 | 22H2 (GA) |
|
||||
| Windows Server 2022 | x86-64 | |
|
||||
|
||||
## Windows Supported GPUs
|
||||
## Windows-supported GPUs
|
||||
|
||||
The table below shows supported GPUs for Radeon Pro™ and Radeon™ GPUs. Please
|
||||
click the tabs below to switch between GPU product lines. If a GPU is not listed
|
||||
@@ -41,14 +41,14 @@ on this table, the GPU is not officially supported by AMD.
|
||||
|:----:|:------------:|:--------------------------------------------------------------------:|:-------:|:----------------:|
|
||||
| AMD Radeon™ RX 7900 XTX | RDNA3 | gfx1100 | ✅ | ✅ |
|
||||
| AMD Radeon™ RX 7900 XT | RDNA3 | gfx1100 | ✅ | ✅ |
|
||||
| AMD Radeon™ RX 7600 | RDNA3 | gfx1100 | ✅ | ✅ |
|
||||
| AMD Radeon™ RX 7600 | RDNA3 | gfx1102 | ✅ | ✅ |
|
||||
| AMD Radeon™ RX 6950 XT | RDNA2 | gfx1030 | ✅ | ✅ |
|
||||
| AMD Radeon™ RX 6900 XT | RDNA2 | gfx1030 | ✅ | ✅ |
|
||||
| AMD Radeon™ RX 6800 XT | RDNA2 | gfx1030 | ✅ | ✅ |
|
||||
| AMD Radeon™ RX 6800 | RDNA2 | gfx1030 | ✅ | ✅ |
|
||||
| AMD Radeon™ RX 6750 | RDNA2 | gfx1032 | ✅ | ❌ |
|
||||
| AMD Radeon™ RX 6700 XT | RDNA2 | gfx1032 | ✅ | ❌ |
|
||||
| AMD Radeon™ RX 6700 | RDNA2 | gfx1032 | ✅ | ❌ |
|
||||
| AMD Radeon™ RX 6750 XT | RDNA2 | gfx1031 | ✅ | ❌ |
|
||||
| AMD Radeon™ RX 6700 XT | RDNA2 | gfx1031 | ✅ | ❌ |
|
||||
| AMD Radeon™ RX 6700 | RDNA2 | gfx1031 | ✅ | ❌ |
|
||||
| AMD Radeon™ RX 6650 XT | RDNA2 | gfx1032 | ✅ | ❌ |
|
||||
| AMD Radeon™ RX 6600 XT | RDNA2 | gfx1032 | ✅ | ❌ |
|
||||
| AMD Radeon™ RX 6600 | RDNA2 | gfx1032 | ✅ | ❌ |
|
||||
@@ -57,27 +57,24 @@ on this table, the GPU is not officially supported by AMD.
|
||||
|
||||
::::
|
||||
|
||||
### Component Support
|
||||
### Component support
|
||||
|
||||
ROCm components are described in the [reference](../reference/all) page. Support
|
||||
ROCm components are described in [What is ROCm?](../../what-is-rocm.md) Support
|
||||
on Windows is provided with two levels on enablement.
|
||||
|
||||
- **Runtime**: Runtime enables the use of the HIP/OpenCL runtimes only.
|
||||
- **HIP SDK**: Runtime plus additional components refer to libraries found under
|
||||
[Math Libraries](../reference/gpu_libraries/math.md) and
|
||||
[C++ Primitive Libraries](../reference/gpu_libraries/c%2B%2B_primitives.md).
|
||||
Some [Math Libraries](../reference/gpu_libraries/math.md) are Linux exclusive,
|
||||
please check the library details.
|
||||
* **Runtime**: Runtime enables the use of the HIP and OpenCL runtimes only.
|
||||
* **HIP SDK**: Runtime plus additional components are listed in [Libraries](../../reference/library-index.md).
|
||||
Note that some math libraries are Linux exclusive.
|
||||
|
||||
### Support Status
|
||||
### Support status
|
||||
|
||||
- ✅: **Supported** - AMD enables these GPUs in our software distributions for
|
||||
✅: **Supported** - AMD enables these GPUs in our software distributions for
|
||||
the corresponding ROCm product.
|
||||
- ⚠️: **Deprecated** - Support will be removed in a future release.
|
||||
- ❌: **Unsupported** - This configuration is not enabled in our software
|
||||
⚠️: **Deprecated** - Support will be removed in a future release.
|
||||
❌: **Unsupported** - This configuration is not enabled in our software
|
||||
distributions.
|
||||
|
||||
## CPU Support
|
||||
## CPU support
|
||||
|
||||
ROCm requires CPUs that support PCIe™ Atomics. Modern CPUs after the release of
|
||||
1st generation AMD Zen CPU and Intel™ Haswell support PCIe Atomics.
|
||||
ROCm requires CPUs that support PCIe™ atomics. Modern CPUs after the release of
|
||||
1st generation AMD Zen CPU and Intel™ Haswell support PCIe atomics.
|
||||
9
docs/about/license.md
Normal file
@@ -0,0 +1,9 @@
|
||||
# License
|
||||
|
||||
> Note: This license applies to the [ROCm repository](https://github.com/RadeonOpenCompute/ROCm) that primarily contains documentation. For other licensing information, refer to the [Licensing Terms page](./licensing).
|
||||
|
||||
```{include} ../../LICENSE
|
||||
```
|
||||
|
||||
```{include} ./licensing.md
|
||||
```
|
||||
@@ -1,15 +1,16 @@
|
||||
# Licensing Terms
|
||||
# ROCm licensing terms
|
||||
|
||||
ROCm™ is released by Advanced Micro Devices, Inc. and is licensed per component separately.
|
||||
The following table is a list of ROCm components with links to their respective license
|
||||
terms. These components may include third party components subject to
|
||||
additional licenses. Please review individual repositories for more information.
|
||||
The table shows ROCm components, the name of license and link to the license terms.
|
||||
The table is ordered to follow ROCm's manifest file.
|
||||
|
||||
The table shows ROCm components, the name of license, and link to the license terms.
|
||||
The table is ordered to follow the ROCm manifest file.
|
||||
|
||||
<!-- spellcheck-disable -->
|
||||
| Component | License |
|
||||
|:------------------------------------------------------------------------------------------------:|:--------------------------------------------------------------------------------------------------------------------------:|
|
||||
|:---------------------|:-------------------------|
|
||||
| [AMDMIGraphX](https://github.com/ROCmSoftwarePlatform/AMDMIGraphX/) | [MIT](https://github.com/ROCmSoftwarePlatform/AMDMIGraphX/blob/develop/LICENSE) |
|
||||
| [HIPCC](https://github.com/ROCm-Developer-Tools/HIPCC/blob/develop/LICENSE.txt) | [MIT](https://github.com/ROCm-Developer-Tools/HIPCC/blob/develop/LICENSE.txt) |
|
||||
| [HIPIFY](https://github.com/ROCm-Developer-Tools/HIPIFY/) | [MIT](https://github.com/ROCm-Developer-Tools/HIPIFY/blob/amd-staging/LICENSE.txt) |
|
||||
@@ -105,7 +106,7 @@ Advanced Micro Devices, Inc. Other product names used in this publication are
|
||||
for identification purposes only and may be trademarks of their respective
|
||||
companies.
|
||||
|
||||
## Package Licensing
|
||||
## Package licensing
|
||||
|
||||
```{attention}
|
||||
AQL Profiler and AOCC CPU optimization are both provided in binary form, each
|
||||
@@ -1,7 +1,9 @@
|
||||
# ROCm Release History
|
||||
# ROCm release history
|
||||
|
||||
| Version | Release Date |
|
||||
| ------- | ------------ |
|
||||
| [5.7.1](https://rocm.docs.amd.com/en/docs-5.7.1/) | Oct 13, 2023 |
|
||||
| [5.7.0](https://rocm.docs.amd.com/en/docs-5.7.0/) | Sep 15, 2023 |
|
||||
| [5.6.0](https://rocm.docs.amd.com/en/docs-5.6.0/) | Jun 28, 2023 |
|
||||
| [5.5.1](https://rocm.docs.amd.com/en/docs-5.5.1/) | May 24, 2023 |
|
||||
| [5.5.0](https://rocm.docs.amd.com/en/docs-5.5.0/) | May 1, 2023 |
|
||||
@@ -1,25 +1,8 @@
|
||||
# What is ROCm?
|
||||
# What's new in ROCm?
|
||||
|
||||
ROCm is an open-source stack, composed primarily of open-source software (OSS), designed for
|
||||
graphics processing unit (GPU) computation. ROCm consists of a collection of drivers, development
|
||||
tools, and APIs that enable GPU programming from low-level kernel to end-user applications.
|
||||
ROCm is now supported on Windows.
|
||||
|
||||
With ROCm, you can customize your GPU software to meet your specific needs. You can develop,
|
||||
collaborate, test, and deploy your applications in a free, open-source, integrated, and secure software
|
||||
ecosystem. ROCm is particularly well-suited to GPU-accelerated high-performance computing (HPC),
|
||||
artificial intelligence (AI), scientific computing, and computer aided design (CAD).
|
||||
|
||||
ROCm is powered by AMD’s
|
||||
[Heterogeneous-computing Interface for Portability (HIP)](https://github.com/ROCm-Developer-Tools/HIP),
|
||||
an OSS C++ GPU programming environment and its corresponding runtime. HIP allows ROCm
|
||||
developers to create portable applications on different platforms by deploying code on a range of
|
||||
platforms, from dedicated gaming GPUs to exascale HPC clusters.
|
||||
|
||||
ROCm supports programming models, such as OpenMP and OpenCL, and includes all necessary OSS
|
||||
compilers, debuggers, and libraries. ROCm is fully integrated into machine learning (ML) frameworks,
|
||||
such as PyTorch and TensorFlow.
|
||||
|
||||
## ROCm on Windows
|
||||
## Windows support
|
||||
|
||||
Starting with ROCm 5.5, the HIP SDK brings a subset of ROCm to developers on Windows.
|
||||
The collection of features enabled on Windows is referred to as the HIP SDK.
|
||||
@@ -62,7 +45,7 @@ Windows and Linux GPU support tables separately.
|
||||
HIP Ray Tracing is not distributed via ROCm in Linux.
|
||||
```
|
||||
|
||||
### ROCm release versioning
|
||||
## ROCm release versioning
|
||||
|
||||
Linux OS releases set the canonical version numbers for ROCm. Windows will
|
||||
follow Linux version numbers as Windows releases are based on Linux ROCm
|
||||
@@ -84,7 +67,7 @@ In general, Windows releases will trail Linux releases. Software developers that
|
||||
wish to support both Linux and Windows using a single ROCm version should
|
||||
refrain from upgrading ROCm unless there is a joint release.
|
||||
|
||||
### Windows Documentation implications
|
||||
## Windows documentation implications
|
||||
|
||||
The ROCm documentation website contains both Windows and Linux documentation.
|
||||
Just below each article title, a convenient article information section states
|
||||
@@ -100,7 +83,7 @@ The software developer must read all the previous ROCm release notes (including)
|
||||
skipped ROCm versions on Windows for information on all the changes present in
|
||||
the Windows release.
|
||||
|
||||
### Windows Builds from Source
|
||||
## Windows builds from source
|
||||
|
||||
Not all source code required to build Windows from source is available under a
|
||||
permissive open source license. Build instructions on Windows is only provided
|
||||
@@ -1,10 +1,10 @@
|
||||
===========================
|
||||
How ROCm uses PCIe Atomics
|
||||
How ROCm uses PCIe atomics
|
||||
===========================
|
||||
|
||||
|
||||
ROCm PCIe Feature and Overview BAR Memory
|
||||
==========================================
|
||||
ROCm PCIe feature and overview of BAR memory
|
||||
======================================================================
|
||||
|
||||
|
||||
ROCm is an extension of HSA platform architecture, so it shares the queueing model, memory model, signaling and synchronization protocols. Platform atomics are integral to perform queuing and signaling memory operations where there may be multiple-writers across CPU and GPU agents.
|
||||
@@ -22,33 +22,33 @@ For ROCm the Platform atomics are used in ROCm in the following ways:
|
||||
* Update HSA queue’s write_dispatch_id: 64 bit atomic add used by the CPU and GPU agent to support multi-writer queue insertions.
|
||||
* Update HSA Signals – 64bit atomic ops are used for CPU & GPU synchronization.
|
||||
|
||||
The PCIe 3.0 AtomicOp feature allows atomic transactions to be requested by, routed through and completed by PCIe components. Routing and completion does not require software support. Component support for each is detectable via the DEVCAP2 register. Upstream bridges need to have AtomicOp routing enabled or the Atomic Operations will fail even though PCIe endpoint and PCIe I/O Devices has the capability to Atomics Operations.
|
||||
The PCIe 3.0 AtomicOp feature allows atomic transactions to be requested by, routed through and completed by PCIe components. Routing and completion does not require software support. Component support for each is detectable via the DEVCAP2 register. Upstream bridges need to have AtomicOp routing enabled or the Atomic Operations will fail even though PCIe endpoint and PCIe I/O devices has the capability to Atomics Operations.
|
||||
|
||||
To do AtomicOp routing capability between two or more Root Ports, each associated Root Port must indicate that capability via the AtomicOp Routing Supported bit in the Device Capabilities 2 register.
|
||||
To do AtomicOp routing capability between two or more Root Ports, each associated Root Port must indicate that capability via the AtomicOp routing supported bit in the Device Capabilities 2 register.
|
||||
|
||||
If your system has a PCIe Express Switch it needs to support AtomicsOp routing. Again AtomicOp requests are permitted only if a component’s ``DEVCTL2.ATOMICOP_REQUESTER_ENABLE`` field is set. These requests can only be serviced if the upstream components support AtomicOp completion and/or routing to a component which does. AtomicOp Routing Support=1 Routing is supported, AtomicOp Routing Support=0 routing is not supported.
|
||||
If your system has a PCIe Express Switch it needs to support AtomicsOp routing. AtomicOp requests are permitted only if a component’s ``DEVCTL2.ATOMICOP_REQUESTER_ENABLE`` field is set. These requests can only be serviced if the upstream components support AtomicOp completion and/or routing to a component which does. AtomicOp Routing Support=1 Routing is supported, AtomicOp Routing Support=0 routing is not supported.
|
||||
|
||||
Atomic Operation is a Non-Posted transaction supporting 32-bit and 64-bit address formats, there must be a response for Completion containing the result of the operation. Errors associated with the operation (uncorrectable error accessing the target location or carrying out the Atomic operation) are signaled to the requester by setting the Completion Status field in the completion descriptor, they are set to to Completer Abort (CA) or Unsupported Request (UR).
|
||||
An atomic operation is a non-posted transaction supporting 32-bit and 64-bit address formats, there must be a response for Completion containing the result of the operation. Errors associated with the operation (uncorrectable error accessing the target location or carrying out the Atomic operation) are signaled to the requester by setting the Completion Status field in the completion descriptor, they are set to to Completer Abort (CA) or Unsupported Request (UR).
|
||||
|
||||
To understand more about how PCIe Atomic operations work `PCIe Atomics <https://pcisig.com/specifications/pciexpress/specifications/ECN_Atomic_Ops_080417.pdf>`_
|
||||
To understand more about how PCIe atomic operations work, see `PCIe atomics <https://pcisig.com/specifications/pciexpress/specifications/ECN_Atomic_Ops_080417.pdf>`_
|
||||
|
||||
`Linux Kernel Patch to pci_enable_atomic_request <https://patchwork.kernel.org/project/linux-pci/patch/1443110390-4080-1-git-send-email-jay@jcornwall.me/>`_
|
||||
|
||||
There are also a number of papers which talk about these new capabilities:
|
||||
|
||||
* `Atomic Read Modify Write Primitives by Intel <https://www.intel.es/content/dam/doc/white-paper/atomic-read-modify-write-primitives-i-o-devices-paper.pdf>`_
|
||||
* `PCI express 3 Accelerator Whitepaper by Intel <https://www.intel.sg/content/dam/doc/white-paper/pci-express3-accelerator-white-paper.pdf>`_
|
||||
* `PCI express 3 Accelerator White paper by Intel <https://www.intel.sg/content/dam/doc/white-paper/pci-express3-accelerator-white-paper.pdf>`_
|
||||
* `Intel PCIe Generation 3 Hotchips Paper <https://www.hotchips.org/wp-content/uploads/hc_archives/hc21/1_sun/HC21.23.1.SystemInterconnectTutorial-Epub/HC21.23.131.Ajanovic-Intel-PCIeGen3.pdf>`_
|
||||
* `PCIe Generation 4 Base Specification includes Atomics Operation <https://astralvx.com/storage/2020/11/PCI_Express_Base_4.0_Rev0.3_February19-2014.pdf>`_
|
||||
|
||||
Other I/O devices with PCIe Atomics support
|
||||
Other I/O devices with PCIe atomics support
|
||||
|
||||
* `Mellanox ConnectX-5 InfiniBand Card <http://www.mellanox.com/related-docs/prod_adapter_cards/PB_ConnectX-5_VPI_Card.pdf>`_
|
||||
* `Cray Aries Interconnect <http://www.hoti.org/hoti20/slides/Bob_Alverson.pdf>`_
|
||||
* `Xilinx PCIe Ultrascale Whitepaper <https://docs.xilinx.com/v/u/8OZSA2V1b1LLU2rRCDVGQw>`_
|
||||
* `Xilinx PCIe Ultrascale White paper <https://docs.xilinx.com/v/u/8OZSA2V1b1LLU2rRCDVGQw>`_
|
||||
* `Xilinx 7 Series Devices <https://docs.xilinx.com/v/u/1nfXeFNnGpA0ywyykvWHWQ>`_
|
||||
|
||||
Future bus technology with richer I/O Atomics Operation Support
|
||||
Future bus technology with richer I/O atomics operation Support
|
||||
|
||||
* GenZ
|
||||
|
||||
@@ -57,22 +57,22 @@ New PCIe Endpoints with support beyond AMD Ryzen and EPYC CPU; Intel Haswell or
|
||||
* `Mellanox Bluefield SOC <https://docs.nvidia.com/networking/display/BlueFieldSWv25111213/BlueField+Software+Overview>`_
|
||||
* `Cavium Thunder X2 <https://en.wikichip.org/wiki/cavium/thunderx2>`_
|
||||
|
||||
In ROCm, we also take advantage of PCIe ID based ordering technology for P2P when the GPU originates two writes to two different targets:
|
||||
In ROCm, we also take advantage of PCIe ID based ordering technology for P2P when the GPU originates two writes to two different targets:
|
||||
|
||||
| 1. write to another GPU memory,
|
||||
|
||||
|
||||
| 2. then write to system memory to indicate transfer complete.
|
||||
|
||||
They are routed off to different ends of the computer but we want to make sure the write to system memory to indicate transfer complete occurs AFTER P2P write to GPU has complete.
|
||||
|
||||
BAR Memory Overview
|
||||
*******************
|
||||
BAR memory overview
|
||||
***************************************************************************************************
|
||||
On a Xeon E5 based system in the BIOS we can turn on above 4GB PCIe addressing, if so he need to set MMIO Base address ( MMIOH Base) and Range ( MMIO High Size) in the BIOS.
|
||||
|
||||
In SuperMicro system in the system bios you need to see the following
|
||||
|
||||
* Advanced->PCIe/PCI/PnP configuration-> Above 4G Decoding = Enabled
|
||||
|
||||
|
||||
* Advanced->PCIe/PCI/PnP Configuration->MMIOH Base = 512G
|
||||
|
||||
* Advanced->PCIe/PCI/PnP Configuration->MMIO High Size = 256G
|
||||
@@ -86,22 +86,22 @@ For GFX9 and Vega10 which have Physical Address up 44 bit and 48 bit Virtual add
|
||||
* BAR4 register: Optional, not a boot device.
|
||||
* BAR5 register: 32bit, non-prefetchable, MMIO. Must be placed < 4GB.
|
||||
|
||||
Here is how our BAR works on GFX 8 GPU’s with 40 bit Physical Address Limit ::
|
||||
Here is how our base address register (BAR) works on GFX 8 GPU’s with 40 bit Physical Address Limit ::
|
||||
|
||||
11:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Fiji [Radeon R9 FURY / NANO Series] (rev c1)
|
||||
|
||||
Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Device 0b35
|
||||
|
||||
|
||||
Flags: bus master, fast devsel, latency 0, IRQ 119
|
||||
|
||||
|
||||
Memory at bf40000000 (64-bit, prefetchable) [size=256M]
|
||||
|
||||
|
||||
Memory at bf50000000 (64-bit, prefetchable) [size=2M]
|
||||
|
||||
|
||||
I/O ports at 3000 [size=256]
|
||||
|
||||
|
||||
Memory at c7400000 (32-bit, non-prefetchable) [size=256K]
|
||||
|
||||
|
||||
Expansion ROM at c7440000 [disabled] [size=128K]
|
||||
|
||||
Legend:
|
||||
@@ -116,12 +116,12 @@ Legend:
|
||||
|
||||
5 : Expansion ROM – This is required for the AMD Driver SW to access the GPU’s video-bios. This is currently fixed at 128KB.
|
||||
|
||||
Excepts form Overview of Changes to PCI Express 3.0
|
||||
===================================================
|
||||
Excerpts from 'Overview of Changes to PCI Express 3.0'
|
||||
================================================================
|
||||
By Mike Jackson, Senior Staff Architect, MindShare, Inc.
|
||||
********************************************************
|
||||
Atomic Operations – Goal:
|
||||
*************************
|
||||
***************************************************************************************************
|
||||
Atomic operations – goal:
|
||||
***************************************************************************************************
|
||||
Support SMP-type operations across a PCIe network to allow for things like offloading tasks between CPU cores and accelerators like a GPU. The spec says this enables advanced synchronization mechanisms that are particularly useful with multiple producers or consumers that need to be synchronized in a non-blocking fashion. Three new atomic non-posted requests were added, plus the corresponding completion (the address must be naturally aligned with the operand size or the TLP is malformed):
|
||||
|
||||
* Fetch and Add – uses one operand as the “add” value. Reads the target location, adds the operand, and then writes the result back to the original location.
|
||||
@@ -132,12 +132,12 @@ Support SMP-type operations across a PCIe network to allow for things like offlo
|
||||
|
||||
* AtomicOpCompletion – new completion to give the result so far atomic request and indicate that the atomicity of the transaction has been maintained.
|
||||
|
||||
Since AtomicOps are not locked they don't have the performance downsides of the PCI locked protocol. Compared to locked cycles, they provide “lower latency, higher scalability, advanced synchronization algorithms, and dramatically lower impact on other PCIe traffic.” The lock mechanism can still be used across a bridge to PCI or PCI-X to achieve the desired operation.
|
||||
Since atomic operations are not locked they don't have the performance downsides of the PCI locked protocol. Compared to locked cycles, they provide “lower latency, higher scalability, advanced synchronization algorithms, and dramatically lower impact on other PCIe traffic.” The lock mechanism can still be used across a bridge to PCI or PCI-X to achieve the desired operation.
|
||||
|
||||
AtomicOps can go from device to device, device to host, or host to device. Each completer indicates whether it supports this capability and guarantees atomic access if it does. The ability to route AtomicOps is also indicated in the registers for a given port.
|
||||
Atomic operations can go from device to device, device to host, or host to device. Each completer indicates whether it supports this capability and guarantees atomic access if it does. The ability to route atomic operations is also indicated in the registers for a given port.
|
||||
|
||||
ID-based Ordering – Goal:
|
||||
*************************
|
||||
ID-based ordering – goal:
|
||||
***************************************************************************************************
|
||||
Improve performance by avoiding stalls caused by ordering rules. For example, posted writes are never normally allowed to pass each other in a queue, but if they are requested by different functions, we can have some confidence that the requests are not dependent on each other. The previously reserved Attribute bit [2] is now combined with the RO bit to indicate ID ordering with or without relaxed ordering.
|
||||
|
||||
This only has meaning for memory requests, and is reserved for Configuration or IO requests. Completers are not required to copy this bit into a completion, and only use the bit if their enable bit is set for this operation.
|
||||
@@ -1,36 +1,30 @@
|
||||
# Inference Optimization with MIGraphX
|
||||
# Inference optimization with MIGraphX
|
||||
|
||||
The following sections cover inferencing and introduces MIGraphX.
|
||||
The following sections cover inferencing and introduces [MIGraphX](https://rocm.docs.amd.com/projects/AMDMIGraphX/en/latest/).
|
||||
|
||||
## Inference
|
||||
|
||||
The inference is where capabilities learned during Deep Learning training are put to work. It refers to using a fully trained neural network to make conclusions (predictions) on unseen data that the model has never interacted with before. Deep Learning inferencing is achieved by feeding new data, such as new images, to the network, giving the Deep Neural Network a chance to classify the image.
|
||||
The inference is where capabilities learned during deep-learning training are put to work. It refers to using a fully trained neural network to make conclusions (predictions) on unseen data that the model has never interacted with before. Deep-learning inferencing is achieved by feeding new data, such as new images, to the network, giving the Deep Neural Network a chance to classify the image.
|
||||
|
||||
Taking our previous example of MNIST, the DNN can be fed new images of handwritten digit images, allowing the neural network to classify digits. A fully trained DNN should make accurate predictions about what an image represents, and inference cannot happen without training.
|
||||
|
||||
## MIGraphX Introduction
|
||||
## MIGraphX introduction
|
||||
|
||||
MIGraphX is a graph compiler focused on accelerating the Machine Learning inference that can target AMD GPUs and CPUs. MIGraphX accelerates the Machine Learning models by leveraging several graph-level transformations and optimizations. These optimizations include:
|
||||
MIGraphX is a graph compiler focused on accelerating the machine-learning inference that can target AMD GPUs and CPUs. MIGraphX accelerates the machine-learning models by leveraging several graph-level transformations and optimizations. These optimizations include:
|
||||
|
||||
- Operator fusion
|
||||
|
||||
- Arithmetic simplifications
|
||||
|
||||
- Dead-code elimination
|
||||
|
||||
- Common subexpression elimination (CSE)
|
||||
|
||||
- Constant propagation
|
||||
* Operator fusion
|
||||
* Arithmetic simplifications
|
||||
* Dead-code elimination
|
||||
* Common subexpression elimination (CSE)
|
||||
* Constant propagation
|
||||
|
||||
After doing all these transformations, MIGraphX emits code for the AMD GPU by calling to MIOpen or rocBLAS or creating HIP kernels for a particular operator. MIGraphX can also target CPUs using DNNL or ZenDNN libraries.
|
||||
|
||||
MIGraphX provides easy-to-use APIs in C++ and Python to import machine models in ONNX or TensorFlow. Users can compile, save, load, and run these models using MIGraphX's C++ and Python APIs. Internally, MIGraphX parses ONNX or TensorFlow models into internal graph representation where each operator in the model gets mapped to an operator within MIGraphX. Each of these operators defines various attributes such as:
|
||||
MIGraphX provides easy-to-use APIs in C++ and Python to import machine models in ONNX or TensorFlow. Users can compile, save, load, and run these models using the MIGraphX C++ and Python APIs. Internally, MIGraphX parses ONNX or TensorFlow models into internal graph representation where each operator in the model gets mapped to an operator within MIGraphX. Each of these operators defines various attributes such as:
|
||||
|
||||
- Number of arguments
|
||||
|
||||
- Type of arguments
|
||||
|
||||
- Shape of arguments
|
||||
* Number of arguments
|
||||
* Type of arguments
|
||||
* Shape of arguments
|
||||
|
||||
After optimization passes, all these operators get mapped to different kernels on GPUs or CPUs.
|
||||
|
||||
@@ -40,7 +34,7 @@ After importing a model into MIGraphX, the model is represented as `migraphx::pr
|
||||
|
||||
There are three options to get started with MIGraphX installation. MIGraphX depends on ROCm libraries; assume that the machine has ROCm installed.
|
||||
|
||||
### Option 1: Installing Binaries
|
||||
### Option 1: installing binaries
|
||||
|
||||
To install MIGraphX on Debian-based systems like Ubuntu, use the following command:
|
||||
|
||||
@@ -50,21 +44,21 @@ sudo apt update && sudo apt install -y migraphx
|
||||
|
||||
The header files and libraries are installed under `/opt/rocm-\<version\>`, where \<version\> is the ROCm version.
|
||||
|
||||
### Option 2: Building from Source
|
||||
### Option 2: building from source
|
||||
|
||||
There are two ways to build the MIGraphX sources.
|
||||
|
||||
- [Use the ROCm build tool](https://github.com/ROCmSoftwarePlatform/AMDMIGraphX#use-the-rocm-build-tool-rbuild) - This approach uses [rbuild](https://github.com/RadeonOpenCompute/rbuild) to install the prerequisites and build the libraries with just one command.
|
||||
* [Use the ROCm build tool](https://github.com/ROCmSoftwarePlatform/AMDMIGraphX#use-the-rocm-build-tool-rbuild) - This approach uses `[rbuild](https://github.com/RadeonOpenCompute/rbuild)` to install the prerequisites and build the libraries with just one command.
|
||||
|
||||
or
|
||||
|
||||
- [Use CMake](https://github.com/ROCmSoftwarePlatform/AMDMIGraphX#use-cmake-to-build-migraphx) - This approach uses a script to install the prerequisites, then uses CMake to build the source.
|
||||
* [Use CMake](https://github.com/ROCmSoftwarePlatform/AMDMIGraphX#use-cmake-to-build-migraphx) - This approach uses a script to install the prerequisites, then uses CMake to build the source.
|
||||
|
||||
For detailed steps on building from source and installing dependencies, refer to the following `README` file:
|
||||
|
||||
[https://github.com/ROCmSoftwarePlatform/AMDMIGraphX#building-from-source](https://github.com/ROCmSoftwarePlatform/AMDMIGraphX#building-from-source)
|
||||
|
||||
### Option 3: Use Docker
|
||||
### Option 3: use docker
|
||||
|
||||
To use Docker, follow these steps:
|
||||
|
||||
@@ -88,7 +82,7 @@ To use Docker, follow these steps:
|
||||
|
||||
The Docker image contains all the prerequisites required for the installation, so users can go to the folder `/code/AMDMIGraphX` and follow the steps mentioned in [Option 2: Building from Source](#option-2-building-from-source).
|
||||
|
||||
## MIGraphX Example
|
||||
## MIGraphX example
|
||||
|
||||
MIGraphX provides both C++ and Python APIs. The following sections show examples of both using the Inception v3 model. To walk through the examples, fetch the Inception v3 ONNX model by running the following:
|
||||
|
||||
@@ -215,23 +209,23 @@ Follow these steps:
|
||||
./inception_inference
|
||||
```
|
||||
|
||||
:::{note}
|
||||
```{note}
|
||||
Set `LD_LIBRARY_PATH` to `/opt/rocm/lib` if required during the build. Additional examples can be found in the MIGraphX repository under the `/examples/` directory.
|
||||
:::
|
||||
```
|
||||
|
||||
## Tuning MIGraphX
|
||||
|
||||
MIGraphX uses MIOpen kernels to target AMD GPU. For the model compiled with MIGraphX, tune MIOpen to pick the best possible kernel implementation. The MIOpen tuning results in a significant performance boost. Tuning can be done by setting the environment variable `MIOPEN_FIND_ENFORCE=3`.
|
||||
|
||||
:::{note}
|
||||
```{note}
|
||||
The tuning process can take a long time to finish.
|
||||
:::
|
||||
```
|
||||
|
||||
**Example:** The average inference time of the inception model example shown previously over 100 iterations using untuned kernels is 0.01383ms. After tuning, it reduces to 0.00459ms, which is a 3x improvement. This result is from ROCm v4.5 on a MI100 GPU.
|
||||
|
||||
:::{note}
|
||||
```{note}
|
||||
The results may vary depending on the system configurations.
|
||||
:::
|
||||
```
|
||||
|
||||
For reference, the following code snippet shows inference runs for only the first 10 iterations for both tuned and untuned kernels:
|
||||
|
||||
@@ -311,7 +305,7 @@ MIGraphX introduces a feature, known as YModel, that stores the kernel config pa
|
||||
|
||||
The YModel feature is available starting from ROCm 5.4.1 and UIF 1.1.
|
||||
|
||||
#### YModel Example
|
||||
#### YModel example
|
||||
|
||||
Through the `migraphx-driver` functionality, you can generate `.mxr` files with tuning information stored inside it by passing additional `--binary --output model.mxr` to `migraphx-driver` along with the rest of the necessary flags.
|
||||
|
||||
@@ -327,12 +321,6 @@ To run generated `.mxr` files through `migraphx-driver`, use the following:
|
||||
./path/to/migraphx-driver run --migraphx resnet50.mxr --enable-offload-copy
|
||||
```
|
||||
|
||||
Alternatively, you can use MIGraphX's C++ or Python API to generate `.mxr` file. Refer to {numref}`image018` for an example.
|
||||
Alternatively, you can use the MIGraphX C++ or Python API to generate `.mxr` files.
|
||||
|
||||
```{figure} ../../data/understand/deep_learning/image.018.png
|
||||
:name: image018
|
||||
---
|
||||
align: center
|
||||
---
|
||||
Generating a `.mxr` File
|
||||
```
|
||||

|
||||
@@ -1,19 +1,19 @@
|
||||
# Inception V3 with PyTorch
|
||||
# Deep learning: Inception V3 with PyTorch
|
||||
|
||||
## Deep Learning Training
|
||||
## Deep learning training
|
||||
|
||||
Deep Learning models are designed to capture the complexity of the problem and the underlying data. These models are "deep," comprising multiple component layers. Training is finding the best parameters for each model layer to achieve a well-defined objective.
|
||||
Deep-learning models are designed to capture the complexity of the problem and the underlying data. These models are "deep," comprising multiple component layers. Training is finding the best parameters for each model layer to achieve a well-defined objective.
|
||||
|
||||
The training data consists of input features in supervised learning, similar to what the learned model is expected to see during the evaluation or inference phase. The target output is also included, which serves to teach the model. A loss metric is defined as part of training that evaluates the model's performance during the training process.
|
||||
|
||||
Training also includes the choice of an optimization algorithm that reduces the loss by adjusting the model's parameters. Training is an iterative process where training data is fed in, usually split into different batches, with the entirety of the training data passed during one training epoch. Training usually is run for multiple epochs.
|
||||
|
||||
## Training Phases
|
||||
## Training phases
|
||||
|
||||
Training occurs in multiple phases for every batch of training data. {numref}`TypesOfTrainingPhases` provides an explanation of the types of training phases.
|
||||
Training occurs in multiple phases for every batch of training data. the following table provides an explanation of the types of training phases.
|
||||
|
||||
:::{table} Types of Training Phases
|
||||
:name: TypesOfTrainingPhases
|
||||
:name: training-phases
|
||||
:widths: auto
|
||||
| Types of Phases | |
|
||||
| ----------------- | --- |
|
||||
@@ -23,10 +23,10 @@ Training occurs in multiple phases for every batch of training data. {numref}`Ty
|
||||
| Optimization Pass | The optimization algorithm updates the model parameters using the stored error gradients. |
|
||||
:::
|
||||
|
||||
Training is different from inference, particularly from the hardware perspective. {numref}`TrainingVsInference` shows the contrast between training and inference.
|
||||
Training is different from inference, particularly from the hardware perspective. The following table shows the contrast between training and inference.
|
||||
|
||||
:::{table} Training vs. Inference
|
||||
:name: TrainingVsInference
|
||||
:name: training-inference
|
||||
:widths: auto
|
||||
| Training | Inference |
|
||||
| ----------- | ----------- |
|
||||
@@ -38,25 +38,25 @@ Training is different from inference, particularly from the hardware perspective
|
||||
|
||||
Different quantization data types are typically chosen between training (FP32, BF16) and inference (FP16, INT8). The computation hardware has different specializations from other datatypes, leading to improvement in performance if a faster datatype can be selected for the corresponding task.
|
||||
|
||||
## Case Studies
|
||||
## Case studies
|
||||
|
||||
The following sections contain case studies for the Inception v3 model.
|
||||
The following sections contain case studies for the Inception V3 model.
|
||||
|
||||
### Inception v3 with PyTorch
|
||||
### Inception V3 with PyTorch
|
||||
|
||||
Convolution Neural Networks are forms of artificial neural networks commonly used for image processing. One of the core layers of such a network is the convolutional layer, which convolves the input with a weight tensor and passes the result to the next layer. Inception v3[^inception_arch] is an architectural development over the ImageNet competition-winning entry, AlexNet, using more profound and broader networks while attempting to meet computational and memory budgets.
|
||||
Convolution Neural Networks are forms of artificial neural networks commonly used for image processing. One of the core layers of such a network is the convolutional layer, which convolves the input with a weight tensor and passes the result to the next layer. Inception V3[^inception_arch] is an architectural development over the ImageNet competition-winning entry, AlexNet, using more profound and broader networks while attempting to meet computational and memory budgets.
|
||||
|
||||
The implementation uses PyTorch as a framework. This case study utilizes `torchvision`[^torch_vision], a repository of popular datasets and model architectures, for obtaining the model. `torchvision` also provides pre-trained weights as a starting point to develop new models or fine-tune the model for a new task.
|
||||
The implementation uses PyTorch as a framework. This case study utilizes [TorchVision](https://pytorch.org/vision/stable/index.html), a repository of popular datasets and model architectures, for obtaining the model. TorchVision also provides pre-trained weights as a starting point to develop new models or fine-tune the model for a new task.
|
||||
|
||||
#### Evaluating a Pre-Trained Model
|
||||
#### Evaluating a pre-trained model
|
||||
|
||||
The Inception v3 model introduces a simple image classification task with the pre-trained model. This does not involve training but utilizes an already pre-trained model from `torchvision`.
|
||||
The Inception V3 model introduces a simple image classification task with the pre-trained model. This does not involve training but utilizes an already pre-trained model from TorchVision.
|
||||
|
||||
This example is adapted from the PyTorch research hub page on Inception v3[^torch_vision_inception].
|
||||
This example is adapted from the PyTorch research hub page on [Inception V3](https://pytorch.org/vision/master/models/inception.html).
|
||||
|
||||
Follow these steps:
|
||||
|
||||
1. Run the PyTorch ROCm-based Docker image or refer to the section [Installing PyTorch](/how_to/pytorch_install/pytorch_install.md) for setting up a PyTorch environment on ROCm.
|
||||
1. Run the PyTorch ROCm-based Docker image or refer to the section [Installing PyTorch](../install/pytorch-install.md) for setting up a PyTorch environment on ROCm.
|
||||
|
||||
```dockerfile
|
||||
docker run -it -v $HOME:/data --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --shm-size 8G rocm/pytorch:latest
|
||||
@@ -85,7 +85,7 @@ Follow these steps:
|
||||
except: urllib.request.urlretrieve(url, filename)
|
||||
```
|
||||
|
||||
5. Import `torchvision` and `PIL.Image` support libraries.
|
||||
5. Import torchvision and PILImage support libraries.
|
||||
|
||||
```py
|
||||
from PIL import Image
|
||||
@@ -140,13 +140,13 @@ Follow these steps:
|
||||
print(categories[top5_catid[i]], top5_prob[i].item())
|
||||
```
|
||||
|
||||
#### Training Inception v3
|
||||
#### Training Inception V3
|
||||
|
||||
The previous section focused on downloading and using the Inception v3 model for a simple image classification task. This section walks through training the model on a new dataset.
|
||||
The previous section focused on downloading and using the Inception V3 model for a simple image classification task. This section walks through training the model on a new dataset.
|
||||
|
||||
Follow these steps:
|
||||
|
||||
1. Run the PyTorch ROCm Docker image or refer to the section [Installing PyTorch](how_to/pytorch_install/pytorch_install.md) for setting up a PyTorch environment on ROCm.
|
||||
1. Run the PyTorch ROCm Docker image or refer to the section [Installing PyTorch](../install/pytorch-install.md) for setting up a PyTorch environment on ROCm.
|
||||
|
||||
```dockerfile
|
||||
docker pull rocm/pytorch:latest
|
||||
@@ -196,7 +196,7 @@ Follow these steps:
|
||||
|
||||
5. Open a Python shell.
|
||||
|
||||
6. Import dependencies, including `torch`, `os`, and `torchvision`.
|
||||
6. Import dependencies, including Torch, OS, and [TorchVision](https://github.com/pytorch/vision).
|
||||
|
||||
```py
|
||||
import torch
|
||||
@@ -208,9 +208,9 @@ Follow these steps:
|
||||
|
||||
7. Set parameters to guide the training process.
|
||||
|
||||
:::{note}
|
||||
```{note}
|
||||
The device is set to `"cuda"`. In PyTorch, `"cuda"` is a generic keyword to denote a GPU.
|
||||
:::
|
||||
```
|
||||
|
||||
```py
|
||||
device = "cuda"
|
||||
@@ -222,7 +222,7 @@ Follow these steps:
|
||||
data_path = "tiny-imagenet-200"
|
||||
```
|
||||
|
||||
The training image size is cropped for input into Inception v3.
|
||||
The training image size is cropped for input into Inception V3.
|
||||
|
||||
```py
|
||||
train_crop_size = 299
|
||||
@@ -241,7 +241,7 @@ Follow these steps:
|
||||
val_resize_size = 342
|
||||
```
|
||||
|
||||
The pre-trained Inception v3 model is chosen to be downloaded from `torchvision`.
|
||||
The pre-trained Inception V3 model is chosen to be downloaded from torchvision.
|
||||
|
||||
```py
|
||||
model_name = "inception_v3"
|
||||
@@ -270,9 +270,9 @@ Follow these steps:
|
||||
lr_gamma = 0.1
|
||||
```
|
||||
|
||||
:::{note}
|
||||
```{note}
|
||||
One training epoch is when the neural network passes an entire dataset forward and backward.
|
||||
:::
|
||||
```
|
||||
|
||||
```py
|
||||
epochs = 90
|
||||
@@ -333,9 +333,9 @@ Follow these steps:
|
||||
)
|
||||
```
|
||||
|
||||
:::{note}
|
||||
Use `torchvision` to obtain the Inception v3 model. Use the pre-trained model weights to speed up training.
|
||||
:::
|
||||
```{note}
|
||||
Use torchvision to obtain the Inception V3 model. Use the pre-trained model weights to speed up training.
|
||||
```
|
||||
|
||||
```py
|
||||
print("Creating model")
|
||||
@@ -343,7 +343,7 @@ Follow these steps:
|
||||
model = torchvision.models.__dict__[model_name](pretrained=pretrained)
|
||||
```
|
||||
|
||||
11. Adapt Inception v3 for the current dataset. `tiny-imagenet-200` contains only 200 classes, whereas Inception v3 is designed for 1,000-class output. The last layer of Inception v3 is replaced to match the output features required.
|
||||
11. Adapt Inception V3 for the current dataset. `tiny-imagenet-200` contains only 200 classes, whereas Inception V3 is designed for 1,000-class output. The last layer of Inception V3 is replaced to match the output features required.
|
||||
|
||||
```py
|
||||
model.fc = torch.nn.Linear(model.fc.in_features, len(dataset.classes))
|
||||
@@ -461,23 +461,17 @@ Follow these steps:
|
||||
torch.save(model.state_dict(), "trained_inception_v3.pt")
|
||||
```
|
||||
|
||||
Plotting the train and test loss shows both metrics reducing over training epochs. This is demonstrated in {numref}`inceptionV3`.
|
||||
Plotting the train and test loss shows both metrics reducing over training epochs. This is demonstrated in the following image.
|
||||
|
||||
```{figure} ../../data/understand/deep_learning/inception_v3.png
|
||||
:name: inceptionV3
|
||||
---
|
||||
align: center
|
||||
---
|
||||
Inception v3 Train and Loss Graph
|
||||
```
|
||||

|
||||
|
||||
### Custom Model with CIFAR-10 on PyTorch
|
||||
### Custom model with CIFAR-10 on PyTorch
|
||||
|
||||
The CIFAR-10 (Canadian Institute for Advanced Research) dataset is a subset of the Tiny Images dataset (which contains 80 million images of 32x32 collected from the Internet) and consists of 60,000 32x32 color images. The images are labeled with one of 10 mutually exclusive classes: airplane, motor car, bird, cat, deer, dog, frog, cruise ship, stallion, and truck (but not pickup truck). There are 6,000 images per class, with 5,000 training and 1,000 testing images per class. Let us prepare a custom model for classifying these images using the PyTorch framework and go step-by-step as illustrated below.
|
||||
The Canadian Institute for Advanced Research (CIFAR)-10 dataset is a subset of the Tiny Images dataset (which contains 80 million images of 32x32 collected from the Internet) and consists of 60,000 32x32 color images. The images are labeled with one of 10 mutually exclusive classes: airplane, motor car, bird, cat, deer, dog, frog, cruise ship, stallion, and truck (but not pickup truck). There are 6,000 images per class, with 5,000 training and 1,000 testing images per class. Let us prepare a custom model for classifying these images using the PyTorch framework and go step-by-step as illustrated below.
|
||||
|
||||
Follow these steps:
|
||||
|
||||
1. Import dependencies, including `torch`, `os`, and `torchvision`.
|
||||
1. Import dependencies, including Torch, OS, and [TorchVision](https://github.com/pytorch/vision).
|
||||
|
||||
```py
|
||||
import torch
|
||||
@@ -487,7 +481,7 @@ Follow these steps:
|
||||
import numpy as np
|
||||
```
|
||||
|
||||
2. The output of `torchvision` datasets is `PILImage` images of range [0, 1]. Transform them to Tensors of normalized range [-1, 1].
|
||||
2. The output of torchvision datasets is `PILImage` images of range [0, 1]. Transform them to Tensors of normalized range [-1, 1].
|
||||
|
||||
```py
|
||||
transform = transforms.Compose(
|
||||
@@ -668,13 +662,13 @@ Follow these steps:
|
||||
print("Accuracy for class {:5s} is: {:.1f} %".format(classname,accuracy))
|
||||
```
|
||||
|
||||
### Case Study: TensorFlow with Fashion MNIST
|
||||
### Case study: TensorFlow with Fashion-MNIST
|
||||
|
||||
Fashion MNIST is a dataset that contains 70,000 grayscale images in 10 categories.
|
||||
Fashion-MNIST is a dataset that contains 70,000 grayscale images in 10 categories.
|
||||
|
||||
Implement and train a neural network model using the TensorFlow framework to classify images of clothing, like sneakers and shirts.
|
||||
|
||||
The dataset has 60,000 images you will use to train the network and 10,000 to evaluate how accurately the network learned to classify images. The Fashion MNIST dataset can be accessed via TensorFlow internal libraries.
|
||||
The dataset has 60,000 images you will use to train the network and 10,000 to evaluate how accurately the network learned to classify images. The Fashion-MNIST dataset can be accessed via TensorFlow internal libraries.
|
||||
|
||||
Access the source code from the following repository:
|
||||
|
||||
@@ -696,7 +690,7 @@ To understand the code step by step, follow these steps:
|
||||
print(tf._version__) r
|
||||
```
|
||||
|
||||
3. Load the dataset from the available internal libraries to analyze and train a neural network upon the MNIST Fashion Dataset. Loading the dataset returns four NumPy arrays. The model uses the training set arrays, train_images and train_labels, to learn.
|
||||
3. Load the dataset from the available internal libraries to analyze and train a neural network upon the Fashion-MNIST dataset. Loading the dataset returns four NumPy arrays. The model uses the training set arrays, train_images and train_labels, to learn.
|
||||
|
||||
4. The model is tested against the test set, test_images, and test_labels arrays.
|
||||
|
||||
@@ -741,11 +735,7 @@ To understand the code step by step, follow these steps:
|
||||
plt.show()
|
||||
```
|
||||
|
||||
```{figure} ../../data/understand/deep_learning/mnist_1.png
|
||||
---
|
||||
align: center
|
||||
---
|
||||
```
|
||||

|
||||
|
||||
10. From the above picture, you can see that values are from zero to 255. Before training this on the neural network, you must bring them in the range of zero to one. Hence, divide the values by 255.
|
||||
|
||||
@@ -769,13 +759,9 @@ To understand the code step by step, follow these steps:
|
||||
plt.show()
|
||||
```
|
||||
|
||||
```{figure} ../../data/understand/deep_learning/mnist_2.png
|
||||
---
|
||||
align: center
|
||||
---
|
||||
```
|
||||

|
||||
|
||||
The basic building block of a neural network is the layer. Layers extract representations from the data fed into them. Deep Learning consists of chaining together simple layers. Most layers, such as `tf.keras.layers.Dense`, have parameters that are learned during training.
|
||||
The basic building block of a neural network is the layer. Layers extract representations from the data fed into them. Deep learning consists of chaining together simple layers. Most layers, such as `tf.keras.layers.Dense`, have parameters that are learned during training.
|
||||
|
||||
```py
|
||||
model = tf.keras.Sequential([
|
||||
@@ -785,9 +771,9 @@ To understand the code step by step, follow these steps:
|
||||
])
|
||||
```
|
||||
|
||||
- The first layer in this network `tf.keras.layers.Flatten` transforms the format of the images from a two-dimensional array (of 28 x 28 pixels) to a one-dimensional array (of 28 * 28 = 784 pixels). Think of this layer as unstacking rows of pixels in the image and lining them up. This layer has no parameters to learn; it only reformats the data.
|
||||
* The first layer in this network `tf.keras.layers.Flatten` transforms the format of the images from a two-dimensional array (of 28 x 28 pixels) to a one-dimensional array (of 28 * 28 = 784 pixels). Think of this layer as unstacking rows of pixels in the image and lining them up. This layer has no parameters to learn; it only reformats the data.
|
||||
|
||||
- After the pixels are flattened, the network consists of a sequence of two `tf.keras.layers.Dense` layers. These are densely connected or fully connected neural layers. The first Dense layer has 128 nodes (or neurons). The second (and last) layer returns a logits array with a length of 10. Each node contains a score that indicates the current image belongs to one of the 10 classes.
|
||||
* After the pixels are flattened, the network consists of a sequence of two `tf.keras.layers.Dense` layers. These are densely connected or fully connected neural layers. The first Dense layer has 128 nodes (or neurons). The second (and last) layer returns a logits array with a length of 10. Each node contains a score that indicates the current image belongs to one of the 10 classes.
|
||||
|
||||
12. You must add the Loss function, Metrics, and Optimizer at the time of model compilation.
|
||||
|
||||
@@ -797,11 +783,11 @@ To understand the code step by step, follow these steps:
|
||||
metrics=['accuracy'])
|
||||
```
|
||||
|
||||
- Loss function —This measures how accurate the model is during training when you are looking to minimize this function to "steer" the model in the right direction.
|
||||
* Loss function —This measures how accurate the model is during training when you are looking to minimize this function to "steer" the model in the right direction.
|
||||
|
||||
- Optimizer —This is how the model is updated based on the data it sees and its loss function.
|
||||
* Optimizer —This is how the model is updated based on the data it sees and its loss function.
|
||||
|
||||
- Metrics —This is used to monitor the training and testing steps.
|
||||
* Metrics —This is used to monitor the training and testing steps.
|
||||
|
||||
The following example uses accuracy, the fraction of the correctly classified images.
|
||||
|
||||
@@ -895,11 +881,7 @@ To understand the code step by step, follow these steps:
|
||||
plt.show()
|
||||
```
|
||||
|
||||
```{figure} ../../data/understand/deep_learning/mnist_3.png
|
||||
---
|
||||
align: center
|
||||
---
|
||||
```
|
||||

|
||||
|
||||
```py
|
||||
i = 12
|
||||
@@ -911,11 +893,7 @@ To understand the code step by step, follow these steps:
|
||||
plt.show()
|
||||
```
|
||||
|
||||
```{figure} ../../data/understand/deep_learning/mnist_4.png
|
||||
---
|
||||
align: center
|
||||
---
|
||||
```
|
||||

|
||||
|
||||
10. Use the trained model to predict a single image.
|
||||
|
||||
@@ -946,11 +924,7 @@ To understand the code step by step, follow these steps:
|
||||
plt.show()
|
||||
```
|
||||
|
||||
```{figure} ../../data/understand/deep_learning/mnist_5.png
|
||||
---
|
||||
align: center
|
||||
---
|
||||
```
|
||||

|
||||
|
||||
13. `tf.keras.Model.predict` returns a list of lists—one for each image in the batch of data. Grab the predictions for our (only) image in the batch.
|
||||
|
||||
@@ -958,7 +932,7 @@ To understand the code step by step, follow these steps:
|
||||
np.argmax(predictions_single[0])
|
||||
```
|
||||
|
||||
### Case Study: TensorFlow with Text Classification
|
||||
### Case study: TensorFlow with text classification
|
||||
|
||||
This procedure demonstrates text classification starting from plain text files stored on disk. You will train a binary classifier to perform sentiment analysis on an IMDB dataset. At the end of the notebook, there is an exercise for you to try in which you will train a multi-class classifier to predict the tag for a programming question on Stack Overflow.
|
||||
|
||||
@@ -988,7 +962,7 @@ Follow these steps:
|
||||
cache_subdir='')
|
||||
```
|
||||
|
||||
```py
|
||||
```bash
|
||||
Downloading data from https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
|
||||
84131840/84125825 [==============================] – 1s 0us/step
|
||||
84149932/84125825 [==============================] – 1s 0us/step
|
||||
@@ -1115,11 +1089,7 @@ To prepare the data for training, follow these steps:
|
||||
print("Vectorized review", vectorize_text(first_review, first_label))
|
||||
```
|
||||
|
||||
```{figure} ../../data/understand/deep_learning/TextClassification_3.png
|
||||
---
|
||||
align: center
|
||||
---
|
||||
```
|
||||

|
||||
|
||||
5. As you can see above, each token has been replaced by an integer. Look up the token (string) that each integer corresponds to by calling get_vocabulary() on the layer.
|
||||
|
||||
@@ -1158,11 +1128,7 @@ To prepare the data for training, follow these steps:
|
||||
model.summary()
|
||||
```
|
||||
|
||||
```{figure} ../../data/understand/deep_learning/TextClassification_4.png
|
||||
---
|
||||
align: center
|
||||
---
|
||||
```
|
||||

|
||||
|
||||
8. A model needs a loss function and an optimizer for training. Since this is a binary classification problem and the model outputs a probability (a single-unit layer with a sigmoid activation), use [`losses.BinaryCrossentropy`](https://www.tensorflow.org/api_docs/python/tf/keras/losses/BinaryCrossentropy) loss function.
|
||||
|
||||
@@ -1178,11 +1144,7 @@ To prepare the data for training, follow these steps:
|
||||
history = model.fit(train_ds,validation_data=val_ds,epochs=epochs)
|
||||
```
|
||||
|
||||
```{figure} ../../data/understand/deep_learning/TextClassification_5.png
|
||||
---
|
||||
align: center
|
||||
---
|
||||
```
|
||||

|
||||
|
||||
10. See how the model performs. Two values are returned: loss (a number representing our error; lower values are better) and accuracy.
|
||||
|
||||
@@ -1193,9 +1155,9 @@ To prepare the data for training, follow these steps:
|
||||
print("Accuracy: ", accuracy)
|
||||
```
|
||||
|
||||
:::{note}
|
||||
```{note}
|
||||
model.fit() returns a History object that contains a dictionary with everything that happened during training.
|
||||
:::
|
||||
```
|
||||
|
||||
```py
|
||||
history_dict = history.history
|
||||
@@ -1224,23 +1186,11 @@ To prepare the data for training, follow these steps:
|
||||
plt.show()
|
||||
```
|
||||
|
||||
{numref}`TextClassification6` and {numref}`TextClassification7` illustrate the training and validation loss and the training and validation accuracy.
|
||||
The following images illustrate the training and validation loss and the training and validation accuracy.
|
||||
|
||||
```{figure} ../../data/understand/deep_learning/TextClassification_6.png
|
||||
:name: TextClassification6
|
||||
---
|
||||
align: center
|
||||
---
|
||||
Training and Validation Loss
|
||||
```
|
||||

|
||||
|
||||
```{figure} ../../data/understand/deep_learning/TextClassification_7.png
|
||||
:name: TextClassification7
|
||||
---
|
||||
align: center
|
||||
---
|
||||
Training and Validation Accuracy
|
||||
```
|
||||

|
||||
|
||||
12. Export the model.
|
||||
|
||||
@@ -1271,15 +1221,3 @@ To prepare the data for training, follow these steps:
|
||||
|
||||
export_model.predict(examples)
|
||||
```
|
||||
|
||||
## References
|
||||
|
||||
[^inception_arch]: C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens and Z. Wojna, "Rethinking the Inception Architecture for Computer Vision," CoRR, p. abs/1512.00567, 2015
|
||||
|
||||
[^torch_vision]: PyTorch, \[Online\]. Available: [https://pytorch.org/vision/stable/index.html](https://pytorch.org/vision/stable/index.html)
|
||||
|
||||
[^torch_vision_inception]: PyTorch, \[Online\]. Available: [https://pytorch.org/hub/pytorch_vision_inception_v3/](https://pytorch.org/hub/pytorch_vision_inception_v3/)
|
||||
|
||||
[^Stanford_deep_learning]: Stanford, \[Online\]. Available: [http://cs231n.stanford.edu/](http://cs231n.stanford.edu/)
|
||||
|
||||
[^cross_entropy]: Wikipedia, \[Online\]. Available: [https://en.wikipedia.org/wiki/Cross_entropy](https://en.wikipedia.org/wiki/Cross_entropy)
|
||||
@@ -6,7 +6,7 @@ Most components in ROCm support CMake. Projects depending on header-only or
|
||||
library components typically require CMake 3.5 or higher whereas those wanting
|
||||
to make use of CMake's HIP language support will require CMake 3.21 or higher.
|
||||
|
||||
Finding Dependencies
|
||||
Finding dependencies
|
||||
====================
|
||||
|
||||
.. note::
|
||||
@@ -20,10 +20,10 @@ Finding Dependencies
|
||||
|
||||
In short, CMake supports finding dependencies in two ways:
|
||||
|
||||
- In Module mode, it consults a file ``Find<PackageName>.cmake`` which tries to
|
||||
* In Module mode, it consults a file ``Find<PackageName>.cmake`` which tries to
|
||||
find the component in typical install locations and layouts. CMake ships a
|
||||
few dozen such scripts, but users and projects may ship them as well.
|
||||
- In Config mode, it locates a file named ``<packagename>-config.cmake`` or
|
||||
* In Config mode, it locates a file named ``<packagename>-config.cmake`` or
|
||||
``<PackageName>Config.cmake`` which describes the installed component in all
|
||||
regards needed to consume it.
|
||||
|
||||
@@ -31,12 +31,12 @@ ROCm predominantly relies on Config mode, one notable exception being the Module
|
||||
driving the compilation of HIP programs on Nvidia runtimes. As such, when
|
||||
dependencies are not found in standard system locations, one either has to
|
||||
instruct CMake to search for package config files in additional folders using
|
||||
the ``CMAKE_PREFIX_PATH`` variable (a semi-colon separated list of filesystem
|
||||
the ``CMAKE_PREFIX_PATH`` variable (a semi-colon separated list of file system
|
||||
paths), or using ``<PackageName>_ROOT`` variable on a project-specific basis.
|
||||
|
||||
There are nearly a dozen ways to set these variables. One may be more convenient
|
||||
over the other depending on your workflow. Conceptually the simplest is adding
|
||||
it to your CMake configuration command on the command-line via
|
||||
it to your CMake configuration command on the command line via
|
||||
``-D CMAKE_PREFIX_PATH=....`` . AMD packaged ROCm installs can typically be
|
||||
added to the config file search paths such as:
|
||||
|
||||
@@ -50,12 +50,13 @@ the *config-file* packages are shipped with the upstream projects, such as
|
||||
rocPRIM and other ROCm libraries.
|
||||
|
||||
For a complete guide on where and how ROCm may be installed on a system, refer
|
||||
to the installation guides in these docs (`Linux <../deploy/linux/index.html>`_).
|
||||
to the installation guides for `Linux <../install/linux/install.html>`_ and
|
||||
`Windows <../install/windows/install.html>`_.
|
||||
|
||||
Using HIP in CMake
|
||||
==================
|
||||
|
||||
ROCm componenents providing a C/C++ interface support being consumed using any
|
||||
ROCm components providing a C/C++ interface support consumption via any
|
||||
C/C++ toolchain that CMake knows how to drive. ROCm also supports CMake's HIP
|
||||
language features, allowing users to program using the HIP single-source
|
||||
programming model. When a program (or translation-unit) uses the HIP API without
|
||||
@@ -101,7 +102,7 @@ via ``CMAKE_HIP_ARCHITECTURES``, CMake will select some sensible default. It is
|
||||
advised though that if a user knows what devices they wish to target, then set
|
||||
this variable explicitly.
|
||||
|
||||
Consuming ROCm C/C++ Libraries
|
||||
Consuming ROCm C/C++ libraries
|
||||
------------------------------
|
||||
|
||||
Libraries such as rocBLAS, rocFFT, MIOpen, etc. behave as C/C++ libraries.
|
||||
@@ -188,7 +189,7 @@ target GPU architectures is done via setting the ``GPU_TARGETS`` variable.
|
||||
default, this is set to some subset of the currently supported architectures of
|
||||
AMD ROCm. It can be set to eg. ``-D GPU_TARGETS="gfx1032;gfx1035"``.
|
||||
|
||||
ROCm CMake Packages
|
||||
ROCm CMake packages
|
||||
-------------------
|
||||
|
||||
+-----------+----------+--------------------------------------------------------+
|
||||
@@ -229,10 +230,10 @@ ROCm CMake Packages
|
||||
| | | ``migraphx::migraphx_onnx``, ``migraphx::migraphx_tf`` |
|
||||
+-----------+----------+--------------------------------------------------------+
|
||||
|
||||
Using CMake Presets
|
||||
Using CMake presets
|
||||
===================
|
||||
|
||||
CMake command-lines depending on how specific users like to be when compiling
|
||||
CMake command lines depending on how specific users like to be when compiling
|
||||
code can grow to unwieldy lengths. This is the primary reason why projects tend
|
||||
to bake script snippets into their build definitions controlling compiler
|
||||
warning levels, changing CMake defaults (``CMAKE_BUILD_TYPE`` or
|
||||
@@ -255,7 +256,7 @@ a setup'n'forget fashion for quick assembly using graphical front-ends. This is
|
||||
all nice, but configurations aren't portable, nor can they be reused in
|
||||
Continuous Intergration (CI) pipelines. CMake has condensed existing practice
|
||||
into a portable JSON format that works in all IDEs and can be invoked from any
|
||||
command-line. This is
|
||||
command line. This is
|
||||
`CMake Presets <https://cmake.org/cmake/help/latest/manual/cmake-presets.7.html>`_
|
||||
.
|
||||
|
||||
@@ -376,7 +377,7 @@ applications on a typical ROCm installation:
|
||||
|
||||
.. note::
|
||||
Getting presets to work reliably on Windows requires some CMake improvements
|
||||
and/or support from compiler vendors. (Refer to
|
||||
and/or support from compiler vendors. (Refer to
|
||||
`Add support to the Visual Studio generators <https://gitlab.kitware.com/cmake/cmake/-/issues/24245>`_
|
||||
and `Sourcing environment scripts <https://gitlab.kitware.com/cmake/cmake/-/issues/21619>`_
|
||||
.)
|
||||
@@ -1,15 +1,15 @@
|
||||
# ROCm Compilers Disambiguation
|
||||
# ROCm compilers disambiguation
|
||||
|
||||
ROCm ships multiple compilers of varying origins and purposes. This article
|
||||
disambiguates compiler naming used throughout the documentation.
|
||||
|
||||
## Compiler Terms
|
||||
## Compiler terms
|
||||
|
||||
| Term | Description |
|
||||
| - | - |
|
||||
| `amdclang++` | Clang/LLVM-based compiler that is part of `rocm-llvm` package. The source code is available at <a href="https://github.com/RadeonOpenCompute/llvm-project" target="_blank">https://github.com/RadeonOpenCompute/llvm-project</a>. |
|
||||
| AOCC | Closed-source clang-based compiler that includes additional CPU optimizations. Offered as part of ROCm via the `rocm-llvm-alt` package. See for details, <a href="https://developer.amd.com/amd-aocc/" target="_blank">https://developer.amd.com/amd-aocc/</a>. |
|
||||
| HIP-Clang | Informal term for the `amdclang++` compiler |
|
||||
| HIPify | Tools including `hipify-clang` and `hipify-perl`, used to automatically translate CUDA source code into portable HIP C++. The source code is available at <a href="https://github.com/ROCm-Developer-Tools/HIPIFY" target="_blank">https://github.com/ROCm-Developer-Tools/HIPIFY</a> |
|
||||
| HIPIFY | Tools including `hipify-clang` and `hipify-perl`, used to automatically translate CUDA source code into portable HIP C++. The source code is available at <a href="https://github.com/ROCm-Developer-Tools/HIPIFY" target="_blank">https://github.com/ROCm-Developer-Tools/HIPIFY</a> |
|
||||
| `hipcc` | HIP compiler driver. A utility that invokes `clang` or `nvcc` depending on the target and passes the appropriate include and library options for the target compiler and HIP infrastructure. The source code is available at <a href="https://github.com/ROCm-Developer-Tools/HIPCC" target="_blank">https://github.com/ROCm-Developer-Tools/HIPCC</a>. |
|
||||
| ROCmCC | Clang/LLVM-based compiler. ROCmCC in itself is not a binary but refers to the overall compiler. |
|
||||
@@ -1,10 +1,10 @@
|
||||
# ROCm FHS Reorganization
|
||||
# ROCm Linux Filesystem Hierarchy Standard reorganization
|
||||
|
||||
## Introduction
|
||||
|
||||
The ROCm platform has adopted the Linux foundation Filesystem Hierarchy Standard (FHS) [https://refspecs.linuxfoundation.org/FHS_3.0/fhs/index.html](https://refspecs.linuxfoundation.org/FHS_3.0/fhs/index.html) in order to to ensure ROCm is consistent with standard open source conventions. The following sections specify how current and future releases of ROCm adhere to FHS, how the previous ROCm filesystem is supported, and how improved versioning specifications are applied to ROCm.
|
||||
The ROCm platform has adopted the Linux Filesystem Hierarchy Standard (FHS) [https://refspecs.linuxfoundation.org/FHS_3.0/fhs/index.html](https://refspecs.linuxfoundation.org/FHS_3.0/fhs/index.html) in order to to ensure ROCm is consistent with standard open source conventions. The following sections specify how current and future releases of ROCm adhere to FHS, how the previous ROCm file system is supported, and how improved versioning specifications are applied to ROCm.
|
||||
|
||||
## Adopting the Linux foundation Filesystem Hierarchy Standard (FHS)
|
||||
## Adopting the FHS
|
||||
|
||||
In order to standardize ROCm directory structure and directory content layout ROCm has adopted the [FHS](https://refspecs.linuxfoundation.org/FHS_3.0/fhs/index.html), adhering to open source conventions for Linux-based distribution. FHS ensures internal consistency within the ROCm stack, as well as external consistency with other systems and distributions. The ROCm proposed file structure is outlined below:
|
||||
|
||||
@@ -44,7 +44,7 @@ In order to standardize ROCm directory structure and directory content layout RO
|
||||
| -- architecture independent misc files
|
||||
```
|
||||
|
||||
## Changes From Earlier ROCm Versions
|
||||
## Changes from earlier ROCm versions
|
||||
|
||||
The following table provides a brief overview of the new ROCm FHS layout, compared to the layout of earlier ROCm versions. Note that /opt/ is used to denote the default rocm-installation-path and should be replaced in case of a non-standard installation location of the ROCm distribution.
|
||||
|
||||
@@ -73,11 +73,11 @@ The following table provides a brief overview of the new ROCm FHS layout, compar
|
||||
|______________________________________________________|
|
||||
```
|
||||
|
||||
## ROCm FHS Reorganization: Backward Compatibility
|
||||
## ROCm FHS reorganization: backward compatibility
|
||||
|
||||
The FHS file organization for ROCm was first introduced in the release of ROCm 5.2 . Backward compatibility was implemented to make sure users could still run their ROCm applications while transitioning to the new FHS. ROCm has moved header files and libraries to their new locations as indicated in the above structure, and included symbolic-links and wrapper header files in their old location for backward compatibility. The following sections detail ROCm backward compatibility implementation for wrapper header files, executable files, library files and CMake config files.
|
||||
|
||||
### Wrapper Header Files
|
||||
### Wrapper header files
|
||||
|
||||
Wrapper header files are placed in the old location (
|
||||
`/opt/rocm-<ver>/<component>/include`) with a warning message to include files
|
||||
@@ -88,10 +88,10 @@ from the new location (`/opt/rocm-<ver>/include`) as shown in the example below.
|
||||
#include <hip/hip_runtime.h>
|
||||
```
|
||||
|
||||
- Starting at ROCm 5.2 release, the deprecation for backward compatibility wrapper header files is: `#pragma` message announcing `#warning`.
|
||||
- Starting from ROCm 6.0 (tentatively) backward compatibility for wrapper header files will be removed, and the `#pragma` message will be announcing `#error`.
|
||||
* Starting at ROCm 5.2 release, the deprecation for backward compatibility wrapper header files is: `#pragma` message announcing `#warning`.
|
||||
* Starting from ROCm 6.0 (tentatively) backward compatibility for wrapper header files will be removed, and the `#pragma` message will be announcing `#error`.
|
||||
|
||||
### Executable Files
|
||||
### Executable files
|
||||
|
||||
Executable files are available in the `/opt/rocm-<ver>/bin` folder. For backward
|
||||
compatibility, the old library location (`/opt/rocm-<ver>/<component>/bin`) has a
|
||||
@@ -103,7 +103,7 @@ $ ls -l /opt/rocm/hip/bin/
|
||||
lrwxrwxrwx 1 root root 24 Jan 1 23:32 hipcc -> ../../bin/hipcc
|
||||
```
|
||||
|
||||
### Library Files
|
||||
### Library files
|
||||
|
||||
Library files are available in the `/opt/rocm-<ver>/lib` folder. For backward
|
||||
compatibility, the old library location (`/opt/rocm-<ver>/<component>/lib`) has a
|
||||
@@ -116,7 +116,7 @@ drwxr-xr-x 4 root root 4096 Jan 1 10:45 cmake
|
||||
lrwxrwxrwx 1 root root 24 Jan 1 23:32 libamdhip64.so -> ../../lib/libamdhip64.so
|
||||
```
|
||||
|
||||
### CMake Config Files
|
||||
### CMake config files
|
||||
|
||||
All CMake configuration files are available in the
|
||||
`/opt/rocm-<ver>/lib/cmake/<component>` folder. For backward compatibility, the
|
||||
@@ -129,7 +129,7 @@ $ ls -l /opt/rocm/hip/lib/cmake/hip/
|
||||
lrwxrwxrwx 1 root root 42 Jan 1 23:32 hip-config.cmake -> ../../../../lib/cmake/hip/hip-config.cmake
|
||||
```
|
||||
|
||||
## Changes Required in Applications Using ROCm
|
||||
## Changes required in applications using ROCm
|
||||
|
||||
Applications using ROCm are advised to use the new file paths. As the old files
|
||||
will be deprecated in a future release. Applications have to make sure to include
|
||||
@@ -150,7 +150,7 @@ correct header file and use correct search paths.
|
||||
3. Any reference to `/opt/rocm/<component>/bin` or `/opt/rocm/<component>/lib`
|
||||
needs to be changed to `/opt/rocm/bin` and `/opt/rocm/lib/`, respectively.
|
||||
|
||||
## Changes in Versioning Specifications
|
||||
## Changes in versioning specifications
|
||||
|
||||
In order to better manage ROCm dependencies specification and allow smoother releases of ROCm while avoiding dependency conflicts, the ROCm platform shall adhere to the following scheme when numbering and incrementing ROCm files versions:
|
||||
|
||||
51
docs/conceptual/gpu-arch.md
Normal file
@@ -0,0 +1,51 @@
|
||||
# GPU architecture documentation
|
||||
|
||||
:::::{grid} 1 1 2 2
|
||||
:gutter: 1
|
||||
|
||||
:::{grid-item-card}
|
||||
**AMD Instinct MI200 series**
|
||||
|
||||
Review hardware aspects of the AMD Instinct™ MI200 series of GPU
|
||||
accelerators and the CDNA™ 2 architecture.
|
||||
|
||||
* [AMD Instinct™ MI250 microarchitecture](./gpu-arch/mi250.md)
|
||||
* [AMD Instinct MI200/CDNA2 ISA](https://www.amd.com/system/files/TechDocs/instinct-mi200-cdna2-instruction-set-architecture.pdf)
|
||||
* [White paper](https://www.amd.com/system/files/documents/amd-cdna2-white-paper.pdf)
|
||||
* [Performance counters](./gpu-arch/mi200-performance-counters.md)
|
||||
|
||||
:::
|
||||
|
||||
:::{grid-item-card}
|
||||
**AMD Instinct MI100**
|
||||
|
||||
Review hardware aspects of the AMD Instinct™ MI100
|
||||
accelerators and the CDNA™ 1 architecture that is the foundation of these GPUs.
|
||||
|
||||
* [AMD Instinct™ MI100 microarchitecture](./gpu-arch/mi100.md)
|
||||
* [AMD Instinct MI100/CDNA1 ISA](https://www.amd.com/system/files/TechDocs/instinct-mi100-cdna1-shader-instruction-set-architecture%C2%A0.pdf)
|
||||
* [White paper](https://www.amd.com/system/files/documents/amd-cdna-whitepaper.pdf)
|
||||
|
||||
:::
|
||||
|
||||
:::{grid-item-card}
|
||||
**RDNA**
|
||||
|
||||
* [AMD RDNA3 ISA](https://www.amd.com/system/files/TechDocs/rdna3-shader-instruction-set-architecture-feb-2023_0.pdf)
|
||||
* [AMD RDNA2 ISA](https://www.amd.com/system/files/TechDocs/rdna2-shader-instruction-set-architecture.pdf)
|
||||
* [AMD RDNA ISA](https://www.amd.com/system/files/TechDocs/rdna-shader-instruction-set-architecture.pdf)
|
||||
* [AMD RDNA Architecture White Paper](https://www.amd.com/system/files/documents/rdna-whitepaper.pdf)
|
||||
|
||||
:::
|
||||
|
||||
:::{grid-item-card}
|
||||
**Older architectures**
|
||||
|
||||
* [AMD Instinct MI50/Vega 7nm ISA](https://www.amd.com/system/files/TechDocs/vega-7nm-shader-instruction-set-architecture.pdf)
|
||||
* [AMD Instinct MI25/Vega ISA](https://www.amd.com/system/files/TechDocs/vega-shader-instruction-set-architecture.pdf)
|
||||
* [AMD GCN3 ISA](https://www.amd.com/system/files/TechDocs/gcn3-instruction-set-architecture.pdf)
|
||||
* [AMD Vega Architecture White Paper](https://en.wikichip.org/w/images/a/a1/vega-whitepaper.pdf)
|
||||
|
||||
:::
|
||||
|
||||
:::::
|
||||
@@ -1,12 +1,6 @@
|
||||
# AMD Instinct™ MI100 Hardware
|
||||
# AMD Instinct™ MI100 microarchitecture
|
||||
|
||||
In this chapter, we are going to briefly review hardware aspects of the AMD
|
||||
Instinct™ MI100 accelerators and the CDNA architecture that is the foundation of
|
||||
these GPUs.
|
||||
|
||||
## System Architecture
|
||||
|
||||
{numref}`mi100-arch` shows the node-level architecture of a system that
|
||||
The following image shows the node-level architecture of a system that
|
||||
comprises two AMD EPYC™ processors and (up to) eight AMD Instinct™ accelerators.
|
||||
The two EPYC processors are connected to each other with the AMD Infinity™
|
||||
fabric which provides a high-bandwidth (up to 18 GT/sec) and coherent links such
|
||||
@@ -17,12 +11,7 @@ available to connect the processors plus one PCIe Gen 4 x16 link per processor
|
||||
can attach additional I/O devices such as the host adapters for the network
|
||||
fabric.
|
||||
|
||||
:::{figure-md} mi100-arch
|
||||
|
||||
<img src="../../data/reference/gpu_arch/image.004.png" alt="Node-level system architecture with two AMD EPYC™ processors and eight AMD Instinct™ accelerators.">
|
||||
|
||||
Structure of a single GCD in the AMD Instinct MI100 accelerator.
|
||||
:::
|
||||

|
||||
|
||||
In a typical node configuration, each processor can host up to four AMD
|
||||
Instinct™ accelerators that are attached using PCIe Gen 4 links at 16 GT/sec,
|
||||
@@ -34,34 +23,29 @@ links. This inter-GPU link can be established in certified server systems if the
|
||||
GPUs are mounted in neighboring PCIe slots by installing the AMD Infinity
|
||||
Fabric™ bridge for the AMD Instinct™ accelerators.
|
||||
|
||||
## Micro-architecture
|
||||
## Microarchitecture
|
||||
|
||||
The micro-architecture of the AMD Instinct accelerators is based on the AMD CDNA
|
||||
The microarchitecture of the AMD Instinct accelerators is based on the AMD CDNA
|
||||
architecture, which targets compute applications such as high-performance
|
||||
computing (HPC) and AI & machine learning (ML) that run on everything from
|
||||
individual servers to the world's largest exascale supercomputers. The overall
|
||||
system architecture is designed for extreme scalability and compute performance.
|
||||
|
||||
:::{figure-md} mi100-block
|
||||
")
|
||||
|
||||
<img src="../../data/reference/gpu_arch/image.005.png" alt="Structure of the AMD Instinct accelerator (MI100 generation).">
|
||||
|
||||
Structure of the AMD Instinct accelerator (MI100 generation).
|
||||
:::
|
||||
|
||||
{numref}`mi100-block` shows the AMD Instinct accelerator with its PCIe Gen 4 x16
|
||||
The above image shows the AMD Instinct accelerator with its PCIe Gen 4 x16
|
||||
link (16 GT/sec, at the bottom) that connects the GPU to (one of) the host
|
||||
processor(s). It also shows the three AMD Infinity Fabric ports that provide
|
||||
high-speed links (23 GT/sec, also at the bottom) to the other GPUs of the local
|
||||
hive as shown in {numref}`mi100-arch`.
|
||||
hive.
|
||||
|
||||
On the left and right of the floor plan, the High Bandwidth Memory (HBM)
|
||||
attaches via the GPU's memory controller. The MI100 generation of the AMD
|
||||
attaches via the GPU memory controller. The MI100 generation of the AMD
|
||||
Instinct accelerator offers four stacks of HBM generation 2 (HBM2) for a total
|
||||
of 32GB with a 4,096bit-wide memory interface. The peak memory bandwidth of the
|
||||
attached HBM2 is 1.228 TB/sec at a memory clock frequency of 1.2 GHz.
|
||||
|
||||
The execution units of the GPU are depicted in {numref}`mi100-block` as Compute
|
||||
The execution units of the GPU are depicted in the above image as Compute
|
||||
Units (CU). There are a total 120 compute units that are physically organized
|
||||
into eight Shader Engines (SE) with fifteen compute units per shader engine.
|
||||
Each compute unit is further sub-divided into four SIMD units that process SIMD
|
||||
@@ -70,15 +54,9 @@ instructions of 16 data elements per instruction. This enables the CU to process
|
||||
Therefore, the theoretical maximum FP64 peak performance is 11.5 TFLOPS
|
||||
(`4 [SIMD units] x 16 [elements per instruction] x 120 [CU] x 1.5 [GHz]`).
|
||||
|
||||
:::{figure-md} mi100-gcd
|
||||

|
||||
|
||||
<img src="../../data/reference/gpu_arch/image.006.png" alt="Block diagram of an MI100 compute unit with detailed SIMD view of the AMD CDNA architecture">
|
||||
|
||||
Block diagram of an MI100 compute unit with detailed SIMD view of the AMD CDNA
|
||||
architecture
|
||||
:::
|
||||
|
||||
{numref}`mi100-gcd` shows the block diagram of a single CU of an AMD Instinct™
|
||||
The preceding image shows the block diagram of a single CU of an AMD Instinct™
|
||||
MI100 accelerator and summarizes how instructions flow through the execution
|
||||
engines. The CU fetches the instructions via a 32KB instruction cache and moves
|
||||
them forward to execution via a dispatcher. The CU can handle up to ten
|
||||
@@ -1,51 +1,51 @@
|
||||
# MI200 Performance Counters and Metrics
|
||||
# MI200 performance counters and metrics
|
||||
<!-- markdownlint-disable no-duplicate-header -->
|
||||
|
||||
This document lists and describes the hardware performance counters and the derived metrics available on the AMD Instinct™ MI200 GPU. All hardware performance monitors, and the derived performance metrics are accessible via AMD ROCm™ Profiler tool.
|
||||
|
||||
## MI200 Performance Counters List
|
||||
## MI200 performance counters list
|
||||
|
||||
:::{note}
|
||||
```{note}
|
||||
Preliminary validation of all MI200 performance counters is in progress. Those with “[*]” appended to the names require further evaluation.
|
||||
:::
|
||||
```
|
||||
|
||||
### Graphics Register Bus Management (GRBM)
|
||||
### GRBM
|
||||
|
||||
#### GRBM Counters
|
||||
#### GRBM counters
|
||||
|
||||
| Hardware Counter | Unit | Definition |
|
||||
|--------------------|--------| --------------------------------------------------------------------------|
|
||||
| `grbm_count` | Cycles | Free-running GPU clock |
|
||||
| `grbm_gui_active` | Cycles | GPU active cycles |
|
||||
| `grbm_cp_busy` | Cycles | Any of the CP (CPC/CPF) blocks are busy. |
|
||||
| `grbm_spi_busy` | Cycles | Any of the Shader Processor Input (SPI) are busy in the shader engine(s). |
|
||||
| `grbm_ta_busy` | Cycles | Any of the Texture Addressing Unit (TA) are busy in the shader engine(s). |
|
||||
| `grbm_tc_busy` | Cycles | Any of the Texture Cache Blocks (TCP/TCI/TCA/TCC) are busy. |
|
||||
| `grbm_cpc_busy` | Cycles | The Command Processor - Compute (CPC) is busy. |
|
||||
| `grbm_cpf_busy` | Cycles | The Command Processor - Fetcher (CPF) is busy. |
|
||||
| `grbm_utcl2_busy` | Cycles | The Unified Translation Cache - Level 2 (UTCL2) block is busy. |
|
||||
| `grbm_ea_busy` | Cycles | The Efficiency Arbiter (EA) block is busy. |
|
||||
| Hardware Counter | Unit | Definition |
|
||||
|--------------------|--------| ------------------------------------------------------|
|
||||
| `grbm_count` | Cycles | Free-running GPU clock |
|
||||
| `grbm_gui_active` | Cycles | GPU active cycles |
|
||||
| `grbm_cp_busy` | Cycles | Any of the command processor (CPC/CPF) blocks are busy. |
|
||||
| `grbm_spi_busy` | Cycles | Any of the shader processor input (SPI) are busy in the shader engine(s). |
|
||||
| `grbm_ta_busy` | Cycles | Any of the texture addressing unit are busy in the shader engine(s). |
|
||||
| `grbm_tc_busy` | Cycles | Any of the texture cache blocks (TCP/TCI/TCA/TCC) are busy. |
|
||||
| `grbm_cpc_busy` | Cycles | The command processor - compute (CPC) is busy. |
|
||||
| `grbm_cpf_busy` | Cycles | The command processor - fetcher (CPF) is busy. |
|
||||
| `grbm_utcl2_busy` | Cycles | The unified translation cache - level 2 (UTCL2) block is busy. |
|
||||
| `grbm_ea_busy` | Cycles | The efficiency arbiter (EA) block is busy. |
|
||||
|
||||
### Command Processor (CP)
|
||||
### Command processor
|
||||
|
||||
The command processor counters are further classified into fetcher and compute.
|
||||
|
||||
#### Command Processor - Fetcher (CPF)
|
||||
#### CPF
|
||||
|
||||
##### CPF Counters
|
||||
##### CPF counters
|
||||
|
||||
| Hardware Counter | Unit | Definition |
|
||||
|--------------------------------------|--------|--------------------------------------------------------------|
|
||||
| `cpf_cmp_utcl1_stall_on_translation` | Cycles | One of the Compute UTCL1s is stalled waiting on translation. |
|
||||
| `cpf_cmp_utcl1_stall_on_translation` | Cycles | One of the compute UTCL1s is stalled waiting on translation. |
|
||||
| `cpf_cpf_stat_idle[∗]` | Cycles | CPF idle |
|
||||
| `cpf_cpf_stat_stall` | Cycles | CPF stall |
|
||||
| `cpf_cpf_tciu_busy` | Cycles | CPF TCIU interface busy |
|
||||
| `cpf_cpf_tciu_idle` | Cycles | CPF TCIU interface idle |
|
||||
| `cpf_cpf_tciu_stall[∗]` | Cycles | CPF TCIU interface is stalled waiting on free tags. |
|
||||
|
||||
#### Command Processor - Compute (CPC)
|
||||
#### CPC
|
||||
|
||||
##### CPC Counters
|
||||
##### CPC counters
|
||||
|
||||
| Hardware Counter | Unit | Definition |
|
||||
| ---------------------------------| -------| --------------------------------------------------- |
|
||||
@@ -61,9 +61,9 @@ The command processor counters are further classified into fetcher and compute.
|
||||
| `cpc_cpc_utcl2iu_stall[∗]` | Cycles | CPC UTCL2 interface stalled waiting |
|
||||
| `cpc_me1_dci0_spi_busy` | Cycles | CPC ME1 Processor busy |
|
||||
|
||||
### Shader Processor Input (SPI)
|
||||
### SPI
|
||||
|
||||
#### SPI Counters
|
||||
#### SPI counters
|
||||
|
||||
| Hardware Counter | Unit | Definition |
|
||||
| :----------------------------| :-----------| -----------------------------------------------------------: |
|
||||
@@ -86,11 +86,11 @@ The command processor counters are further classified into fetcher and compute.
|
||||
| `spi_vwc_csc_wr` | Cycles | Number of clocks to write CSC waves to VGPRs (need to multiply this value by 4) |
|
||||
| `spi_swc_csc_wr` | Cycles | Number of clocks to write CSC waves to SGPRs (need to multiply this value by 4) |
|
||||
|
||||
### Compute Unit
|
||||
### Compute unit
|
||||
|
||||
The compute unit counters are further classified into instruction mix, MFMA operation counters, level counters, wavefront counters, wavefront cycle counters, local data share counters, and others.
|
||||
|
||||
#### Instruction Mix
|
||||
#### Instruction mix
|
||||
|
||||
| Hardware Counter | Unit | Definition |
|
||||
| :-----------------------| :-----:| -----------------------------------------------------------------------: |
|
||||
@@ -117,9 +117,9 @@ The compute unit counters are further classified into instruction mix, MFMA oper
|
||||
| `sq_insts_valu_mfma_f32` | Instr | Number of F32 MFMA instructions issued |
|
||||
| `sq_insts_valu_mfma_f64` | Instr | Number of F64 MFMA instructions issued |
|
||||
| `sq_insts_mfma` | Instr | Number of MFMA instructions issued |
|
||||
| `sq_insts_vmem_wr` | Instr | Number of VMEM Write instructions issued |
|
||||
| `sq_insts_vmem_rd` | Instr | Number of VMEM Read instructions issued |
|
||||
| `sq_insts_vmem` | Instr | Number of VMEM instructions issued, including both FLAT and Buffer instructions |
|
||||
| `sq_insts_vmem_wr` | Instr | Number of VMEM write instructions issued |
|
||||
| `sq_insts_vmem_rd` | Instr | Number of VMEM read instructions issued |
|
||||
| `sq_insts_vmem` | Instr | Number of VMEM instructions issued, including both FLAT and buffer instructions |
|
||||
| `sq_insts_salu` | Instr | Number of SALU instructions issued |
|
||||
| `sq_insts_smem` | Instr | Number of SMEM instructions issued |
|
||||
| `sq_insts_smem_norm` | Instr | Number of SMEM instructions issued to normalize to match `smem_level`. Used in measuring SMEM latency |
|
||||
@@ -132,7 +132,7 @@ The compute unit counters are further classified into instruction mix, MFMA oper
|
||||
| `sq_insts_sendmsg` | Instr | Number of SENDMSG instructions including s_endpgm issued |
|
||||
| `sq_insts_vskipped[∗]` | Instr | Number of VSkipped instructions issued |
|
||||
|
||||
#### MFMA Operation Counters
|
||||
#### MFMA operation counters
|
||||
|
||||
| Hardware Counter | Unit | Definition |
|
||||
| :----------------------------| :-----| ----------------------------------------------: |
|
||||
@@ -142,7 +142,7 @@ The compute unit counters are further classified into instruction mix, MFMA oper
|
||||
| `sq_insts_valu_mfma_mops_F32` | FLOP | Number of F32 floating MFMA ops in unit of 512 |
|
||||
| `sq_insts_valu_mfma_mops_F64` | FLOP | Number of F64 floating MFMA ops in unit of 512 |
|
||||
|
||||
#### Level Counters
|
||||
#### Level counters
|
||||
|
||||
| Hardware Counter | Unit | Definition |
|
||||
| :-------------------| :-----| -------------------------------------: |
|
||||
@@ -154,7 +154,7 @@ The compute unit counters are further classified into instruction mix, MFMA oper
|
||||
| `sq_insts_level_lds` | Instr | Number of inflight LDS instructions |
|
||||
| `sq_ifetch_level` | Instr | Number of inflight instruction fetches |
|
||||
|
||||
#### Wavefront Counters
|
||||
#### Wavefront counters
|
||||
|
||||
| Hardware Counter | Unit | Definition |
|
||||
| :--------------------| :-----| ----------------------------------------------------------------: |
|
||||
@@ -167,32 +167,32 @@ The compute unit counters are further classified into instruction mix, MFMA oper
|
||||
| `sq_waves_lt_32` | Waves | Number of wavefronts with less than 32 active threads sent to SQs |
|
||||
| `sq_waves_lt_16` | Waves | Number of wavefronts with less than 16 active threads sent to SQs |
|
||||
|
||||
#### Wavefront Cycle Counters
|
||||
#### Wavefront cycle counters
|
||||
|
||||
| Hardware Counter | Unit | Definition |
|
||||
| :------------------------| :-------| --------------------------------------------------------------------: |
|
||||
| `sq_cycles` | Cycles | Free-running SQ clocks |
|
||||
| `sq_busy_cycles` | Cycles | Number of cycles while SQ reports it to be busy |
|
||||
| `sq_busy_cu_cycles` | Qcycles | Number of quad-cycles each CU is busy |
|
||||
| `sq_busy_cu_cycles` | Qcycles | Number of quad cycles each CU is busy |
|
||||
| `sq_valu_mfma_busy_cycles` | Cycles | Number of cycles the MFMA ALU is busy |
|
||||
| `sq_wave_cycles` | Qcycles | Number of quad-cycles spent by waves in the CUs |
|
||||
| `sq_wait_any` | Qcycles | Number of quad-cycles spent waiting for anything |
|
||||
| `sq_wait_inst_any` | Qcycles | Number of quad-cycles spent waiting for an issued instruction |
|
||||
| `sq_active_inst_any` | Qcycles | Number of quad-cycles spent by each wave to work on an instruction |
|
||||
| `sq_active_inst_vmem` | Qcycles | Number of quad-cycles spent by each wave to work on a non-FLAT VMEM instruction |
|
||||
| `sq_active_inst_lds` | Qcycles | Number of quad-cycles spent by each wave to work on an LDS instruction |
|
||||
| `sq_active_inst_valu` | Qcycles | Number of quad-cycles spent by each wave to work on a VALU instruction |
|
||||
| `sq_active_inst_sca` | Qcycles | Number of quad-cycles spent by each wave to work on an SCA instruction |
|
||||
| `sq_active_inst_exp_gds` | Qcycles | Number of quad-cycles spent by each wave to work on EXP or GDS instruction |
|
||||
| `sq_active_inst_misc` | Qcycles | Number of quad-cycles spent by each wave to work on an MISC instruction, including branch and sendmsg |
|
||||
| `sq_active_inst_flat` | Qcycles | Number of quad-cycles spent by each wave to work on a FLAT instruction |
|
||||
| `sq_inst_cycles_vmem_wr` | Qcycles | Number of quad-cycles spent to send addr and cmd data for VMEM Write instructions, including both FLAT and Buffer |
|
||||
| `sq_inst_cycles_vmem_rd` | Qcycles | Number of quad-cycles spent to send addr and cmd data for VMEM Read instructions, including both FLAT and Buffer |
|
||||
| `sq_inst_cycles_smem` | Qcycles | Number of quad-cycles spent to execute scalar memory reads |
|
||||
| `sq_wave_cycles` | Qcycles | Number of quad cycles spent by waves in the CUs |
|
||||
| `sq_wait_any` | Qcycles | Number of quad cycles spent waiting for anything |
|
||||
| `sq_wait_inst_any` | Qcycles | Number of quad cycles spent waiting for an issued instruction |
|
||||
| `sq_active_inst_any` | Qcycles | Number of quad cycles spent by each wave to work on an instruction |
|
||||
| `sq_active_inst_vmem` | Qcycles | Number of quad cycles spent by each wave to work on a non-FLAT VMEM instruction |
|
||||
| `sq_active_inst_lds` | Qcycles | Number of quad cycles spent by each wave to work on an LDS instruction |
|
||||
| `sq_active_inst_valu` | Qcycles | Number of quad cycles spent by each wave to work on a VALU instruction |
|
||||
| `sq_active_inst_sca` | Qcycles | Number of quad cycles spent by each wave to work on an SCA instruction |
|
||||
| `sq_active_inst_exp_gds` | Qcycles | Number of quad cycles spent by each wave to work on EXP or GDS instruction |
|
||||
| `sq_active_inst_misc` | Qcycles | Number of quad cycles spent by each wave to work on an MISC instruction, including branch and sendmsg |
|
||||
| `sq_active_inst_flat` | Qcycles | Number of quad cycles spent by each wave to work on a FLAT instruction |
|
||||
| `sq_inst_cycles_vmem_wr` | Qcycles | Number of quad cycles spent to send addr and cmd data for VMEM write instructions, including both FLAT and buffer |
|
||||
| `sq_inst_cycles_vmem_rd` | Qcycles | Number of quad cycles spent to send addr and cmd data for VMEM read instructions, including both FLAT and buffer |
|
||||
| `sq_inst_cycles_smem` | Qcycles | Number of quad cycles spent to execute scalar memory reads |
|
||||
| `sq_inst_cycles_salu` | Cycles | Number of cycles spent to execute non-memory read scalar operations |
|
||||
| `sq_thread_cycles_valu` | Cycles | Number of thread-cycles spent to execute VALU operations |
|
||||
| `sq_thread_cycles_valu` | Cycles | Number of thread cycles spent to execute VALU operations |
|
||||
|
||||
#### Local Data Share
|
||||
#### Local data share
|
||||
|
||||
| Hardware Counter | Unit | Definition |
|
||||
| :--------------------------| :------| --------------------------------------------------------: |
|
||||
@@ -204,16 +204,16 @@ The compute unit counters are further classified into instruction mix, MFMA oper
|
||||
|
||||
#### Miscellaneous
|
||||
|
||||
##### Local Data Share
|
||||
##### Local data share
|
||||
|
||||
| Hardware Counter | Unit | Definition |
|
||||
| :----------------| :-------| --------------------------------------------------------: |
|
||||
| `sq_ifetch` | Count | Number of fetch requests from L1I cache, in 32-byte width |
|
||||
| `sq_items` | Threads | Number of valid threads |
|
||||
|
||||
### L1I and sL1D Caches
|
||||
### L1I and sL1D caches
|
||||
|
||||
#### L1I and sL1D Caches
|
||||
#### L1I and sL1D caches
|
||||
|
||||
| Hardware Counter | Unit | Definition |
|
||||
| :----------------------------| :------| ----------------------------------------------------------------: |
|
||||
@@ -226,61 +226,61 @@ The compute unit counters are further classified into instruction mix, MFMA oper
|
||||
| `sqc_dcache_hits` | Count | Number of sL1D cache lookup-hits |
|
||||
| `sqc_dcache_misses` | Count | Number of sL1D non-duplicate lookup-misses |
|
||||
| `sqc_dcache_misses_duplicate` | Count | Number of sL1D duplicate lookup-misses |
|
||||
| `sqc_dcache_req_read_1` | Req | Number of Read requests in a single 32-bit Data Word, DWORD (DW) |
|
||||
| `sqc_dcache_req_read_2` | Req | Number of Read requests in 2 DW |
|
||||
| `sqc_dcache_req_read_4` | Req | Number of Read requests in 4 DW |
|
||||
| `sqc_dcache_req_read_8` | Req | Number of Read requests in 8 DW |
|
||||
| `sqc_dcache_req_read_16` | Req | Number of Read requests in 16 DW |
|
||||
| `sqc_dcache_atomic[∗]` | Req | Number of Atomic requests |
|
||||
| `sqc_dcache_req_read_1` | Req | Number of read requests in a single 32-bit data word, DWORD (DW) |
|
||||
| `sqc_dcache_req_read_2` | Req | Number of read requests in 2 DW |
|
||||
| `sqc_dcache_req_read_4` | Req | Number of read requests in 4 DW |
|
||||
| `sqc_dcache_req_read_8` | Req | Number of read requests in 8 DW |
|
||||
| `sqc_dcache_req_read_16` | Req | Number of read requests in 16 DW |
|
||||
| `sqc_dcache_atomic[∗]` | Req | Number of atomic requests |
|
||||
| `sqc_tc_req` | Req | Number of L2 cache requests that were issued by instruction and constant caches |
|
||||
| `sqc_tc_inst_req` | Req | Number of instruction cache line requests to L2 cache |
|
||||
| `sqc_tc_data_read_req` | Req | Number of data Read requests to the L2 cache |
|
||||
| `sqc_tc_data_write_req[∗]` | Req | Number of data Write requests to the L2 cache |
|
||||
| `sqc_tc_data_atomic_req[∗]` | Req | Number of data Atomic requests to the L2 cache |
|
||||
| `sqc_tc_stall[∗]` | Cycles | Number of cycles while the valid requests to L2 Cache are stalled |
|
||||
| `sqc_tc_data_read_req` | Req | Number of data read requests to the L2 cache |
|
||||
| `sqc_tc_data_write_req[∗]` | Req | Number of data write requests to the L2 cache |
|
||||
| `sqc_tc_data_atomic_req[∗]` | Req | Number of data atomic requests to the L2 cache |
|
||||
| `sqc_tc_stall[∗]` | Cycles | Number of cycles while the valid requests to L2 cache are stalled |
|
||||
|
||||
### Vector L1 Cache Subsystem
|
||||
### Vector L1 cache subsystem
|
||||
|
||||
The vector L1 cache subsystem counters are further classified into texture addressing unit, texture data unit, vector L1D cache, and texture cache arbiter.
|
||||
|
||||
#### Texture Addressing Unit
|
||||
#### Texture addressing unit
|
||||
|
||||
##### Texture Addressing Unit Counters
|
||||
##### Texture addressing unit counters
|
||||
|
||||
| Hardware Counter | Unit | Definition |
|
||||
| :--------------------------------| :------| ------------------------------------------------: |
|
||||
| `ta_ta_busy` | Cycles | TA busy cycles |
|
||||
| `ta_ta_busy` | Cycles | texture addressing unit busy cycles |
|
||||
| `ta_total_wavefronts` | Instr | Number of wavefront instructions |
|
||||
| `ta_buffer_wavefronts` | Instr | Number of Buffer wavefront instructions |
|
||||
| `ta_buffer_read_wavefronts` | Instr | Number of Buffer Read wavefront instructions |
|
||||
| `ta_buffer_write_wavefronts` | Instr | Number of Buffer Write wavefront instructions |
|
||||
| `ta_buffer_atomic_wavefronts[∗]` | Instr | Number of Buffer Atomic wavefront instructions |
|
||||
| `ta_buffer_total_cycles` | Cycles | Number of Buffer cycles, including Read and Write |
|
||||
| `ta_buffer_coalesced_read_cycles` | Cycles | Number of coalesced Buffer read cycles |
|
||||
| `ta_buffer_coalesced_write_cycles` | Cycles | Number of coalesced Buffer write cycles |
|
||||
| `ta_addr_stalled_by_tc` | Cycles | Number of cycles TA address is stalled by TCP |
|
||||
| `ta_data_stalled_by_tc` | Cycles | Number of cycles TA data is stalled by TCP |
|
||||
| `ta_addr_stalled_by_td_cycles[∗]` | Cycles | Number of cycles TA address is stalled by TD |
|
||||
| `ta_flat_wavefronts` | Instr | Number of Flat wavefront instructions |
|
||||
| `ta_flat_read_wavefronts` | Instr | Number of Flat Read wavefront instructions |
|
||||
| `ta_flat_write_wavefronts` | Instr | Number of Flat Write wavefront instructions |
|
||||
| `ta_flat_atomic_wavefronts` | Instr | Number of Flat Atomic wavefront instructions |
|
||||
| `ta_buffer_wavefronts` | Instr | Number of buffer wavefront instructions |
|
||||
| `ta_buffer_read_wavefronts` | Instr | Number of buffer read wavefront instructions |
|
||||
| `ta_buffer_write_wavefronts` | Instr | Number of buffer write wavefront instructions |
|
||||
| `ta_buffer_atomic_wavefronts[∗]` | Instr | Number of buffer atomic wavefront instructions |
|
||||
| `ta_buffer_total_cycles` | Cycles | Number of buffer cycles, including read and write |
|
||||
| `ta_buffer_coalesced_read_cycles` | Cycles | Number of coalesced buffer read cycles |
|
||||
| `ta_buffer_coalesced_write_cycles` | Cycles | Number of coalesced buffer write cycles |
|
||||
| `ta_addr_stalled_by_tc` | Cycles | Number of cycles texture addressing unit address is stalled by TCP |
|
||||
| `ta_data_stalled_by_tc` | Cycles | Number of cycles texture addressing unit data is stalled by TCP |
|
||||
| `ta_addr_stalled_by_td_cycles[∗]` | Cycles | Number of cycles texture addressing unit address is stalled by TD |
|
||||
| `ta_flat_wavefronts` | Instr | Number of flat wavefront instructions |
|
||||
| `ta_flat_read_wavefronts` | Instr | Number of flat read wavefront instructions |
|
||||
| `ta_flat_write_wavefronts` | Instr | Number of flat write wavefront instructions |
|
||||
| `ta_flat_atomic_wavefronts` | Instr | Number of flat atomic wavefront instructions |
|
||||
|
||||
#### Texture Data Unit
|
||||
#### Texture data unit
|
||||
|
||||
##### Texture Data Unit Counters
|
||||
##### Texture data unit counters
|
||||
|
||||
| Hardware Counter | Unit | Definition |
|
||||
| :------------------------| :-----| ---------------------------------------------------: |
|
||||
| `td_td_busy` | Cycle | TD busy cycles |
|
||||
| `td_tc_stall` | Cycle | Number of cycles TD is stalled by TCP |
|
||||
| `td_spi_stall[∗]` | Cycle | Number of cycles TD is stalled by SPI |
|
||||
| `td_load_wavefront` | Instr | Number of wavefront instructions (Read/Write/Atomic) |
|
||||
| `td_store_wavefront` | Instr | Number of Write wavefront instructions |
|
||||
| `td_atomic_wavefront` | Instr | Number of Atomic wavefront instructions |
|
||||
| `td_load_wavefront` | Instr | Number of wavefront instructions (read/write/atomic) |
|
||||
| `td_store_wavefront` | Instr | Number of write wavefront instructions |
|
||||
| `td_atomic_wavefront` | Instr | Number of atomic wavefront instructions |
|
||||
| `td_coalescable_wavefront` | Instr | Number of coalescable instructions |
|
||||
|
||||
#### Vector L1D Cache
|
||||
#### Vector L1D cache
|
||||
|
||||
| Hardware Counter | Unit | Definition |
|
||||
| :-----------------------------------| :------| ----------------------------------------------------------: |
|
||||
@@ -288,101 +288,101 @@ The vector L1 cache subsystem counters are further classified into texture addre
|
||||
| `tcp_gate_en2` | Cycles | Number of cycles vL1D core clocks are turned on |
|
||||
| `tcp_td_tcp_stall_cycles` | Cycles | Number of cycles TD stalls vL1D |
|
||||
| `tcp_tcr_tcp_stall_cycles` | Cycles | Number of cycles TCR stalls vL1D |
|
||||
| `tcp_read_tagconflict_stall_cycles` | Cycles | Number of cycles tagram conflict stalls on a Read |
|
||||
| `tcp_write_tagconflict_stall_cycles` | Cycles | Number of cycles tagram conflict stalls on a Write |
|
||||
| `tcp_atomic_tagconflict_stall_cycles` | Cycles | Number of cycles tagram conflict stalls on an Atomic |
|
||||
| `tcp_pending_stall_cycles` | Cycles | Number of cycles vL1D cache is stalled due to data pending from L2 Cache |
|
||||
| `tcp_read_tagconflict_stall_cycles` | Cycles | Number of cycles tagram conflict stalls on a read |
|
||||
| `tcp_write_tagconflict_stall_cycles` | Cycles | Number of cycles tagram conflict stalls on a write |
|
||||
| `tcp_atomic_tagconflict_stall_cycles` | Cycles | Number of cycles tagram conflict stalls on an atomic |
|
||||
| `tcp_pending_stall_cycles` | Cycles | Number of cycles vL1D cache is stalled due to data pending from L2 cache |
|
||||
| `tcp_ta_tcp_state_read` | Req | Number of wavefront instruction requests to vL1D |
|
||||
| `tcp_volatile[∗]` | Req | Number of L1 volatile pixels/buffers from TA |
|
||||
| `tcp_volatile[∗]` | Req | Number of L1 volatile pixels/buffers from texture addressing unit |
|
||||
| `tcp_total_accesses` | Req | Number of vL1D accesses |
|
||||
| `tcp_total_read` | Req | Number of vL1D Read accesses |
|
||||
| `tcp_total_write` | Req | Number of vL1D Write accesses |
|
||||
| `tcp_total_atomic_with_ret` | Req | Number of vL1D Atomic with return |
|
||||
| `tcp_total_atomic_without_ret` | Req | Number of vL1D Atomic without return |
|
||||
| `tcp_total_writeback_invalidates` | Count | Number of vL1D Writebacks and Invalidates |
|
||||
| `tcp_total_read` | Req | Number of vL1D read accesses |
|
||||
| `tcp_total_write` | Req | Number of vL1D write accesses |
|
||||
| `tcp_total_atomic_with_ret` | Req | Number of vL1D atomic with return |
|
||||
| `tcp_total_atomic_without_ret` | Req | Number of vL1D atomic without return |
|
||||
| `tcp_total_writeback_invalidates` | Count | Number of vL1D writebacks and Invalidates |
|
||||
| `tcp_utcl1_request` | Req | Number of address translation requests to UTCL1 |
|
||||
| `tcp_utcl1_translation_hit` | Req | Number of UTCL1 translation hits |
|
||||
| `tcp_utcl1_translation_miss` | Req | Number of UTCL1 translation misses |
|
||||
| `tcp_utcl1_persmission_miss` | Req | Number of UTCL1 permission misses |
|
||||
| `tcp_total_cache_accesses` | Req | Number of vL1D cache accesses |
|
||||
| `tcp_tcp_latency` | Cycles | Accumulated wave access latency to vL1D over all wavefronts |
|
||||
| `tcp_tcc_read_req_latency` | Cycles | Accumulated vL1D-L2 request latency over all wavefronts for Reads and Atomics with return |
|
||||
| `tcp_tcc_write_req_latency` | Cycles | Accumulated vL1D-L2 request latency over all wavefronts for Writes and Atomics without return |
|
||||
| `tcp_tcc_read_req` | Req | Number of Read requests to L2 Cache |
|
||||
| `tcp_tcc_write_req` | Req | Number of Write requests to L2 Cache |
|
||||
| `tcp_tcc_atomic_with_ret_req` | Req | Number of Atomic requests to L2 Cache with return |
|
||||
| `tcp_tcc_atomic_without_ret_req` | Req | Number of Atomic requests to L2 Cache without return |
|
||||
| `tcp_tcc_nc_read_req` | Req | Number of NC Read requests to L2 Cache |
|
||||
| `tcp_tcc_uc_read_req` | Req | Number of UC Read requests to L2 Cache |
|
||||
| `tcp_tcc_cc_read_req` | Req | Number of CC Read requests to L2 Cache |
|
||||
| `tcp_tcc_rw_read_req` | Req | Number of RW Read requests to L2 Cache |
|
||||
| `tcp_tcc_nc_write_req` | Req | Number of NC Write requests to L2 Cache |
|
||||
| `tcp_tcc_uc_write_req` | Req | Number of UC Write requests to L2 Cache |
|
||||
| `tcp_tcc_cc_write_req` | Req | Number of CC Write requests to L2 Cache |
|
||||
| `tcp_tcc_rw_write_req` | Req | Number of RW Write requests to L2 Cache |
|
||||
| `tcp_tcc_nc_atomic_req` | Req | Number of NC Atomic requests to L2 Cache |
|
||||
| `tcp_tcc_uc_atomic_req` | Req | Number of UC Atomic requests to L2 Cache |
|
||||
| `tcp_tcc_cc_atomic_req` | Req | Number of CC Atomic requests to L2 Cache |
|
||||
| `tcp_tcc_rw_atomic_req` | Req | Number of RW Atomic requests to L2 Cache |
|
||||
| `tcp_tcc_read_req_latency` | Cycles | Accumulated vL1D-L2 request latency over all wavefronts for reads and atomics with return |
|
||||
| `tcp_tcc_write_req_latency` | Cycles | Accumulated vL1D-L2 request latency over all wavefronts for writes and atomics without return |
|
||||
| `tcp_tcc_read_req` | Req | Number of read requests to L2 cache |
|
||||
| `tcp_tcc_write_req` | Req | Number of write requests to L2 cache |
|
||||
| `tcp_tcc_atomic_with_ret_req` | Req | Number of atomic requests to L2 cache with return |
|
||||
| `tcp_tcc_atomic_without_ret_req` | Req | Number of atomic requests to L2 cache without return |
|
||||
| `tcp_tcc_nc_read_req` | Req | Number of NC read requests to L2 cache |
|
||||
| `tcp_tcc_uc_read_req` | Req | Number of UC read requests to L2 cache |
|
||||
| `tcp_tcc_cc_read_req` | Req | Number of CC read requests to L2 cache |
|
||||
| `tcp_tcc_rw_read_req` | Req | Number of RW read requests to L2 cache |
|
||||
| `tcp_tcc_nc_write_req` | Req | Number of NC write requests to L2 cache |
|
||||
| `tcp_tcc_uc_write_req` | Req | Number of UC write requests to L2 cache |
|
||||
| `tcp_tcc_cc_write_req` | Req | Number of CC write requests to L2 cache |
|
||||
| `tcp_tcc_rw_write_req` | Req | Number of RW write requests to L2 cache |
|
||||
| `tcp_tcc_nc_atomic_req` | Req | Number of NC atomic requests to L2 cache |
|
||||
| `tcp_tcc_uc_atomic_req` | Req | Number of UC atomic requests to L2 cache |
|
||||
| `tcp_tcc_cc_atomic_req` | Req | Number of CC atomic requests to L2 cache |
|
||||
| `tcp_tcc_rw_atomic_req` | Req | Number of RW atomic requests to L2 cache |
|
||||
|
||||
#### Texture Cache Arbiter (TCA)
|
||||
#### TCA
|
||||
|
||||
| Hardware Counter | Unit | Definition |
|
||||
| :----------------| :------| ------------------------------------------: |
|
||||
| `tca_cycle` | Cycles | TCA cycles |
|
||||
| `tca_busy` | Cycles | Number of cycles TCA has a pending request |
|
||||
|
||||
### L2 Cache Access
|
||||
### L2 cache access
|
||||
|
||||
#### L2 Cache Access Counters
|
||||
#### L2 cache access counters
|
||||
|
||||
| Hardware Counter | Unit | Definition |
|
||||
| :--------------------------------| :------| -------------------------------------------------------------: |
|
||||
| `tcc_cycle` |Cycle | L2 Cache free-running clocks |
|
||||
| `tcc_busy` |Cycle | L2 Cache busy cycles |
|
||||
| `tcc_req` |Req | Number of L2 Cache requests |
|
||||
| `tcc_streaming_req[∗]` |Req | Number of L2 Cache Streaming requests |
|
||||
| `tcc_cycle` |Cycle | L2 cache free-running clocks |
|
||||
| `tcc_busy` |Cycle | L2 cache busy cycles |
|
||||
| `tcc_req` |Req | Number of L2 cache requests |
|
||||
| `tcc_streaming_req[∗]` |Req | Number of L2 cache streaming requests |
|
||||
| `tcc_NC_req` |Req | Number of NC requests |
|
||||
| `tcc_UC_req` |Req | Number of UC requests |
|
||||
| `tcc_CC_req` |Req | Number of CC requests |
|
||||
| `tcc_RW_req` |Req | Number of RW requests |
|
||||
| `tcc_probe` |Req | Number of L2 Cache probe requests |
|
||||
| `tcc_probe` |Req | Number of L2 cache probe requests |
|
||||
| `tcc_probe_all[∗]` |Req | Number of external probe requests with EA_TCC_preq_all== 1 |
|
||||
| `tcc_read_req` |Req | Number of L2 Cache Read requests |
|
||||
| `tcc_write_req` |Req | Number of L2 Cache Write requests |
|
||||
| `tcc_atomic_req` |Req | Number of L2 Cache Atomic requests |
|
||||
| `tcc_hit` |Req | Number of L2 Cache lookup-hits |
|
||||
| `tcc_read_req` |Req | Number of L2 cache read requests |
|
||||
| `tcc_write_req` |Req | Number of L2 cache write requests |
|
||||
| `tcc_atomic_req` |Req | Number of L2 cache atomic requests |
|
||||
| `tcc_hit` |Req | Number of L2 cache lookup-hits |
|
||||
| `tcc_miss` |Req | Number of L2 cache lookup-misses |
|
||||
| `tcc_writeback` |Req | Number of lines written back to main memory, including writebacks of dirty lines and uncached Write/Atomic requests |
|
||||
| `tcc_ea_wrreq` |Req | Total number of 32-byte and 64-byte Write requests to EA |
|
||||
| `tcc_ea_wrreq_64B` |Req | Total number of 64-byte Write requests to EA |
|
||||
| `tcc_ea_wr_uncached_32B` |Req | Number of 32-byte Write/Atomic going over the TC_EA_wrreq interface due to uncached traffic. Note that CC mtypes can produce uncached requests, and those are included in this. A 64-byte request is counted as 2. |
|
||||
| `tcc_ea_wrreq_stall` | Cycles | Number of cycles a Write request was stalled |
|
||||
| `tcc_ea_wrreq_io_credit_stall[∗]` | Cycles | Number of cycles an EA Write request runs out of IO credits |
|
||||
| `tcc_ea_wrreq_gmi_credit_stall[∗]` | Cycles | Number of cycles an EA Write request runs out of GMI credits |
|
||||
| `tcc_ea_wrreq_dram_credit_stall` | Cycles | Number of cycles an EA Write request runs out of DRAM credits |
|
||||
| `tcc_too_many_ea_wrreqs_stall[∗]` | Cycles | Number of cycles the L2 Cache reaches maximum number of pending EA Write requests |
|
||||
| `tcc_ea_wrreq_level` | Req | Accumulated number of L2 Cache-EA Write requests in flight |
|
||||
| `tcc_ea_atomic` | Req | Number of 32-byte and 64-byte Atomic requests to EA |
|
||||
| `tcc_ea_atomic_level` | Req | Accumulated number of L2 Cache-EA Atomic requests in flight |
|
||||
| `tcc_ea_rdreq` | Req | Total number of 32-byte and 64-byte Read requests to EA |
|
||||
| `tcc_ea_rdreq_32B` | Req | Total number of 32-byte Read requests to EA |
|
||||
| `tcc_ea_rd_uncached_32B` | Req | Number of 32-byte L2 Cache-EA Read due to uncached traffic. A 64-byte request is counted as 2. |
|
||||
| `tcc_ea_rdreq_io_credit_stall[∗]` | Cycles | Number of cycles Read request interface runs out of IO credits |
|
||||
| `tcc_ea_rdreq_gmi_credit_stall[∗]` | Cycles | Number of cycles Read request interface runs out of GMI credits |
|
||||
| `tcc_ea_rdreq_dram_credit_stall` | Cycles | Number of cycles Read request interface runs out of DRAM credits |
|
||||
| `tcc_ea_rdreq_level` | Req | Accumulated number of L2 Cache-EA Read requests in flight |
|
||||
| `tcc_ea_rdreq_dram` | Req | Number of 32-byte and 64-byte Read requests to HBM |
|
||||
| `tcc_ea_wrreq_dram` | Req | Number of 32-byte and 64-byte Write requests to HBM |
|
||||
| `tcc_writeback` |Req | Number of lines written back to main memory, including writebacks of dirty lines and uncached write/atomic requests |
|
||||
| `tcc_ea_wrreq` |Req | Total number of 32-byte and 64-byte write requests to EA |
|
||||
| `tcc_ea_wrreq_64B` |Req | Total number of 64-byte write requests to EA |
|
||||
| `tcc_ea_wr_uncached_32B` |Req | Number of 32-byte write/atomic going over the TC_EA_wrreq interface due to uncached traffic. Note that CC mtypes can produce uncached requests, and those are included in this. A 64-byte request is counted as 2. |
|
||||
| `tcc_ea_wrreq_stall` | Cycles | Number of cycles a write request was stalled |
|
||||
| `tcc_ea_wrreq_io_credit_stall[∗]` | Cycles | Number of cycles an EA write request runs out of IO credits |
|
||||
| `tcc_ea_wrreq_gmi_credit_stall[∗]` | Cycles | Number of cycles an EA write request runs out of GMI credits |
|
||||
| `tcc_ea_wrreq_dram_credit_stall` | Cycles | Number of cycles an EA write request runs out of DRAM credits |
|
||||
| `tcc_too_many_ea_wrreqs_stall[∗]` | Cycles | Number of cycles the L2 cache reaches maximum number of pending EA write requests |
|
||||
| `tcc_ea_wrreq_level` | Req | Accumulated number of L2 cache-EA write requests in flight |
|
||||
| `tcc_ea_atomic` | Req | Number of 32-byte and 64-byte atomic requests to EA |
|
||||
| `tcc_ea_atomic_level` | Req | Accumulated number of L2 cache-EA atomic requests in flight |
|
||||
| `tcc_ea_rdreq` | Req | Total number of 32-byte and 64-byte read requests to EA |
|
||||
| `tcc_ea_rdreq_32B` | Req | Total number of 32-byte read requests to EA |
|
||||
| `tcc_ea_rd_uncached_32B` | Req | Number of 32-byte L2 cache-EA read due to uncached traffic. A 64-byte request is counted as 2. |
|
||||
| `tcc_ea_rdreq_io_credit_stall[∗]` | Cycles | Number of cycles read request interface runs out of IO credits |
|
||||
| `tcc_ea_rdreq_gmi_credit_stall[∗]` | Cycles | Number of cycles read request interface runs out of GMI credits |
|
||||
| `tcc_ea_rdreq_dram_credit_stall` | Cycles | Number of cycles read request interface runs out of DRAM credits |
|
||||
| `tcc_ea_rdreq_level` | Req | Accumulated number of L2 cache-EA read requests in flight |
|
||||
| `tcc_ea_rdreq_dram` | Req | Number of 32-byte and 64-byte read requests to HBM |
|
||||
| `tcc_ea_wrreq_dram` | Req | Number of 32-byte and 64-byte write requests to HBM |
|
||||
| `tcc_tag_stall` | Cycles | Number of cycles the normal request pipeline in the tag was stalled for any reason |
|
||||
| `tcc_normal_writeback` | Req | Number of L2 cache normal writeback |
|
||||
| `tcc_all_tc_op_wb_writeback[∗]` | Req | Number of instruction-triggered writeback requests |
|
||||
| `tcc_normal_evict` | Req | Number of L2 cache normal evictions |
|
||||
| `tcc_all_tc_op_inv_evict[∗]` | Req | Number of instruction-triggered eviction requests |
|
||||
|
||||
## MI200 Derived Metrics List
|
||||
## MI200 derived metrics list
|
||||
|
||||
### Derived Metrics on MI200 GPUs
|
||||
### Derived metrics on MI200 GPUs
|
||||
|
||||
| Derived Metric | Description |
|
||||
| :----------------| -------------------------------------------------------------------------------------: |
|
||||
@@ -400,58 +400,56 @@ The vector L1 cache subsystem counters are further classified into texture addre
|
||||
| `WriteUnitStalled` | The percentage of GPU time the write unit is stalled. Value range: 0% to 100% (bad) |
|
||||
| `LDSBankConflict` | The percentage of GPU time LDS is stalled by bank conflicts. Value range: 0% (optimal) to 100% (bad) |
|
||||
|
||||
## Abbreviations
|
||||
|
||||
### MI200 Abbreviations
|
||||
## MI200 acronyms
|
||||
|
||||
| Abbreviation | Meaning |
|
||||
| :------------| --------------------------------------------------------------------------------: |
|
||||
| `ALU` | Arithmetic Logic Unit |
|
||||
| `Arb` | Arbiter |
|
||||
| `BF16` | Brain Floating Point – 16 bits |
|
||||
| `CC` | Coherently Cached |
|
||||
| `CP` | Command Processor |
|
||||
| `CPC` | Command Processor – Compute |
|
||||
| `CPF` | Command Processor – Fetcher |
|
||||
| `CS` | Compute Shader |
|
||||
| `CSC` | Compute Shader Controller |
|
||||
| `CSn` | Compute Shader, the n-th pipe |
|
||||
| `CU` | Compute Unit |
|
||||
| `DW` | 32-bit Data Word, DWORD |
|
||||
| `EA` | Efficiency Arbiter |
|
||||
| `F16` | Half Precision Floating Point |
|
||||
| `FLAT` | FLAT instructions allow read/write/atomic access to a generic memory address pointer, which can resolve to any of the following physical memories:<br>• Global Memory<br>• Scratch (“private”)<br>• LDS (“shared”)<br>• Invalid – MEM_VIOL TrapStatus |
|
||||
| `FMA` | Fused Multiply Add |
|
||||
| `GDS` | Global Data Share |
|
||||
| `GRBM` | Graphics Register Bus Manager |
|
||||
| `HBM` | High Bandwidth Memory |
|
||||
| `Instr` | Instructions |
|
||||
| `IOP` | Integer Operation |
|
||||
| `L2` | Level-2 Cache |
|
||||
| `LDS` | Local Data Share |
|
||||
| `ME1` | Micro Engine, running packet processing firmware on CPC |
|
||||
| `MFMA` | Matrix Fused Multiply Add |
|
||||
| `NC` | Noncoherently Cached |
|
||||
| `RW` | Coherently Cached with Write |
|
||||
| `SALU` | Scalar ALU |
|
||||
| `SGPR` | Scalar GPR |
|
||||
| `SIMD` | Single Instruction Multiple Data |
|
||||
| `sL1D` | Scalar Level-1 Data Cache |
|
||||
| `SMEM` | Scalar Memory |
|
||||
| `SPI` | Shader Processor Input |
|
||||
| `SQ` | Sequencer |
|
||||
| `TA` | Texture Addressing Unit |
|
||||
| `TC` | Texture Cache |
|
||||
| `TCA` | Texture Cache Arbiter |
|
||||
| `TCC` | Texture Cache per Channel, known as L2 Cache |
|
||||
| `TCIU` | Texture Cache Interface Unit, Command Processor (CP)’s interface to memory system |
|
||||
| `TCP` | Texture Cache per Pipe, known as vector L1 Cache |
|
||||
| `TCR` | Texture Cache Router |
|
||||
| `TD` | Texture Data Unit |
|
||||
| `UC` | Uncached |
|
||||
| `UTCL1` | Unified Translation Cache – Level 1 |
|
||||
| `UTCL2` | Unified Translation Cache – Level 2 |
|
||||
| `VALU` | Vector ALU |
|
||||
| `VGPR` | Vector GPR |
|
||||
| `vL1D` | Vector Level -1 Data Cache |
|
||||
| `VMEM` | Vector Memory |
|
||||
| `ALU` | Arithmetic logic unit |
|
||||
| `Arb` | Arbiter |
|
||||
| `BF16` | Brain floating point – 16 |
|
||||
| `CC` | Coherently cached |
|
||||
| `CP` | Command processor |
|
||||
| `CPC` | Command processor – compute |
|
||||
| `CPF` | Command processor – fetcher |
|
||||
| `CS` | Compute shader |
|
||||
| `CSC` | Compute shader controller |
|
||||
| `CSn` | Compute Shader, the n-th pipe |
|
||||
| `CU` | Compute unit |
|
||||
| `DW` | 32-bit data word, DWORD |
|
||||
| `EA` | Efficiency arbiter |
|
||||
| `F16` | Half-precision floating point |
|
||||
| `FLAT` | FLAT instructions allow read/write/atomic access to a generic memory address pointer, which can resolve to any of the following physical memories:<br>• Global Memory<br>• Scratch (“private”)<br>• LDS (“shared”)<br>• Invalid – MEM_VIOL TrapStatus |
|
||||
| `FMA` | Fused multiply-add |
|
||||
| `GDS` | Global data share |
|
||||
| `GRBM` | Graphics register bus manager |
|
||||
| `HBM` | High bandwidth memory |
|
||||
| `Instr` | Instructions |
|
||||
| `IOP` | Integer operation |
|
||||
| `L2` | Level-2 cache |
|
||||
| `LDS` | Local data share |
|
||||
| `ME1` | Micro-engine, running packet processing firmware on CPC |
|
||||
| `MFMA` | Matrix fused multiply-add |
|
||||
| `NC` | Noncoherently cached |
|
||||
| `RW` | Coherently cached with write |
|
||||
| `SALU` | Scalar ALU |
|
||||
| `SGPR` | Scalar GPR |
|
||||
| `SIMD` | Single instruction multiple data |
|
||||
| `sL1D` | Scalar Level-1 data cache |
|
||||
| `SMEM` | Scalar memory |
|
||||
| `SPI` | Shader processor input |
|
||||
| `SQ` | Sequencer |
|
||||
| `TA` | Texture addressing unit |
|
||||
| `TC` | Texture cache |
|
||||
| `TCA` | Texture cache arbiter |
|
||||
| `TCC` | Texture cache per channel, known as L2 cache |
|
||||
| `TCIU` | Texture cache interface unit, command processor’s interface to memory system |
|
||||
| `TCP` | Texture cache per pipe, known as vector L1 cache |
|
||||
| `TCR` | Texture cache router |
|
||||
| `TD` | Texture data unit |
|
||||
| `UC` | Uncached |
|
||||
| `UTCL1` | Unified translation cache – level 1 |
|
||||
| `UTCL2` | Unified translation cache – level 2 |
|
||||
| `VALU` | Vector ALU |
|
||||
| `VGPR` | Vector GPR |
|
||||
| `vL1D` | Vector level 1 data cache |
|
||||
| `VMEM` | Vector memory |
|
||||
@@ -1,19 +1,13 @@
|
||||
# AMD Instinct Hardware
|
||||
# AMD Instinct™ MI250 microarchitecture
|
||||
|
||||
This chapter briefly reviews hardware aspects of the AMD Instinct MI250
|
||||
accelerators and the CDNA™ 2 architecture that is the foundation of these GPUs.
|
||||
|
||||
## AMD CDNA 2 Micro-architecture
|
||||
|
||||
The micro-architecture of the AMD Instinct MI250 accelerators is based on the
|
||||
The microarchitecture of the AMD Instinct MI250 accelerators is based on the
|
||||
AMD CDNA 2 architecture that targets compute applications such as HPC,
|
||||
artificial intelligence (AI), and Machine Learning (ML) and that run on
|
||||
artificial intelligence (AI), and machine learning (ML) and that run on
|
||||
everything from individual servers to the world’s largest exascale
|
||||
supercomputers. The overall system architecture is designed for extreme
|
||||
scalability and compute performance.
|
||||
|
||||
{numref}`mi250-gcd` shows the components of a single Graphics Compute Die (GCD
|
||||
) of the CDNA 2 architecture. On the top and the bottom are AMD Infinity Fabric™
|
||||
The following image shows the components of a single Graphics Compute Die (GCD) of the CDNA 2 architecture. On the top and the bottom are AMD Infinity Fabric™
|
||||
interfaces and their physical links that are used to connect the GPU die to the
|
||||
other system-level components of the node (see also Section 2.2). Both
|
||||
interfaces can drive four AMD Infinity Fabric links. One of the AMD Infinity
|
||||
@@ -28,7 +22,7 @@ To the left and the right are memory controllers that attach the High Bandwidth
|
||||
Memory (HBM) modules to the GCD. AMD Instinct MI250 GPUs use HBM2e, which offers
|
||||
a peak memory bandwidth of 1.6 TB/sec per GCD.
|
||||
|
||||
The execution units of the GPU are depicted in {numref}`mi250-gcd` as Compute
|
||||
The execution units of the GPU are depicted in the following image as Compute
|
||||
Units (CU). The MI250 GCD has 104 active CUs. Each compute unit is further
|
||||
subdivided into four SIMD units that process SIMD instructions of 16 data
|
||||
elements per instruction (for the FP64 data type). This enables the CU to
|
||||
@@ -39,16 +33,11 @@ execution units (also called matrix cores), which are geared toward executing
|
||||
matrix operations like matrix-matrix multiplications. For FP64, the peak
|
||||
performance of these units amounts to 90.5 TFLOPS.
|
||||
|
||||
:::{figure-md} mi250-gcd
|
||||
|
||||
<img src="../../data/reference/gpu_arch/image.001.png" alt="Structure of a single GCD in the AMD Instinct MI250 accelerator.">
|
||||
|
||||
Figure 1: Structure of a single GCD in the AMD Instinct MI250 accelerator.
|
||||
:::
|
||||

|
||||
|
||||
```{list-table} Peak-performance capabilities of the MI250 OAM for different data types.
|
||||
:header-rows: 1
|
||||
:name: mi250-perf
|
||||
:name: mi250-perf-table
|
||||
|
||||
*
|
||||
- Computation and Data Type
|
||||
@@ -88,7 +77,7 @@ Figure 1: Structure of a single GCD in the AMD Instinct MI250 accelerator.
|
||||
- 362.1
|
||||
```
|
||||
|
||||
{numref}`mi250-perf` summarizes the aggregated peak performance of the AMD
|
||||
The above table summarizes the aggregated peak performance of the AMD
|
||||
Instinct MI250 OCP Open Accelerator Modules (OAM, OCP is short for Open Compute
|
||||
Platform) and its two GCDs for different data types and execution units. The
|
||||
middle column lists the peak performance (number of data elements processed in a
|
||||
@@ -97,23 +86,18 @@ is being retired in each clock cycle. The third column lists the theoretical
|
||||
peak performance of the OAM module. The theoretical aggregated peak memory
|
||||
bandwidth of the GPU is 3.2 TB/sec (1.6 TB/sec per GCD).
|
||||
|
||||
:::{figure-md} mi250-arch
|
||||

|
||||
|
||||
<img src="../../data/reference/gpu_arch/image.002.png" alt="Dual-GCD architecture of the AMD Instinct MI250 accelerators.">
|
||||
|
||||
Dual-GCD architecture of the AMD Instinct MI250 accelerators.
|
||||
:::
|
||||
|
||||
{numref}`mi250-arch` shows the block diagram of an OAM package that consists
|
||||
The following image shows the block diagram of an OAM package that consists
|
||||
of two GCDs, each of which constitutes one GPU device in the system. The two
|
||||
GCDs in the package are connected via four AMD Infinity Fabric links running at
|
||||
a theoretical peak rate of 25 GT/sec, giving 200 GB/sec peak transfer bandwidth
|
||||
between the two GCDs of an OAM, or a bidirectional peak transfer bandwidth of
|
||||
400 GB/sec for the same.
|
||||
|
||||
## Node-level Architecture
|
||||
## Node-level architecture
|
||||
|
||||
{numref}`mi250-block` shows the node-level architecture of a system that is
|
||||
The following image shows the node-level architecture of a system that is
|
||||
based on the AMD Instinct MI250 accelerator. The MI250 OAMs attach to the host
|
||||
system via PCIe Gen 4 x16 links (yellow lines). Each GCD maintains its own PCIe
|
||||
x16 link to the host part of the system. Depending on the server platform, the
|
||||
@@ -121,15 +105,9 @@ GCD can attach to the AMD EPYC processor directly or via an optional PCIe switch
|
||||
. Note that some platforms may offer an x8 interface to the GCDs, which reduces
|
||||
the available host-to-GPU bandwidth.
|
||||
|
||||
:::{figure-md} mi250-block
|
||||

|
||||
|
||||
<img src="../../data/reference/gpu_arch/image.003.png" alt="Block diagram of AMD Instinct MI250 Accelerators with 3rd Generation AMD EPYC processor.">
|
||||
|
||||
Block diagram of AMD Instinct MI250 Accelerators with 3rd Generation
|
||||
AMD EPYC processor.
|
||||
:::
|
||||
|
||||
{numref}`mi250-block` shows the node-level architecture of a system with AMD
|
||||
The preceding image shows the node-level architecture of a system with AMD
|
||||
EPYC processors in a dual-socket configuration and four AMD Instinct MI250
|
||||
accelerators. The MI250 OAMs attach to the host processors system via PCIe Gen 4
|
||||
x16 links (yellow lines). Depending on the system design, a PCIe switch may
|
||||
@@ -146,4 +124,4 @@ two GPU dies in the MI250 OAM and operates at 25 GT/sec, which corresponds to a
|
||||
theoretical peak transfer rate of 50 GB/sec per link (or 100 GB/sec
|
||||
bidirectional peak transfer bandwidth). The GCD pairs 2 and 6 as well as GCDs 0
|
||||
and 4 connect via two XGMI links, which is indicated by the thicker red line in
|
||||
{numref}`mi250-block`.
|
||||
the preceding image.
|
||||
@@ -1,4 +1,4 @@
|
||||
# GPU Isolation Techniques
|
||||
# GPU isolation techniques
|
||||
|
||||
Restricting the access of applications to a subset of GPUs, aka isolating
|
||||
GPUs allows users to hide GPU resources from programs. The programs by default
|
||||
@@ -8,7 +8,7 @@ There are multiple ways to achieve isolation of GPUs in the ROCm software stack,
|
||||
differing in which applications they apply to and the security they provide.
|
||||
This page serves as an overview of the techniques.
|
||||
|
||||
## Environment Variables
|
||||
## Environment variables
|
||||
|
||||
The runtimes in the ROCm software stack read these environment variables to
|
||||
select the exposed or default device to present to applications using them.
|
||||
@@ -49,8 +49,7 @@ export GPU_DEVICE_ORDINAL="0,2"
|
||||
|
||||
Device indices exposed to HIP applications.
|
||||
|
||||
Runtime
|
||||
: HIP Runtime. Applies only to applications using HIP on the AMD platform.
|
||||
Runtime: HIP runtime. Applies only to applications using HIP on the AMD platform.
|
||||
|
||||
```{code-block} shell
|
||||
:caption: Example to expose the 1. and 3. devices in the system.
|
||||
@@ -92,7 +91,7 @@ to all programs that use the `amdgpu` kernel module interfaces.
|
||||
Even programs that don't use the ROCm runtime, like graphics applications
|
||||
using OpenGL or Vulkan, can only access the GPUs exposed to the container.
|
||||
|
||||
## GPU Passthrough to Virtual Machines
|
||||
## GPU passthrough to virtual machines
|
||||
|
||||
Virtual machines achieve the highest level of isolation, because even the kernel
|
||||
of the virtual machine is isolated from the host. Devices physically installed
|
||||
234
docs/conceptual/gpu-memory.md
Normal file
@@ -0,0 +1,234 @@
|
||||
# GPU memory
|
||||
|
||||
For the HIP reference documentation, see:
|
||||
|
||||
* {doc}`hip:.doxygen/docBin/html/group___memory`
|
||||
* {doc}`hip:.doxygen/docBin/html/group___memory_m`
|
||||
|
||||
Host memory exists on the host (e.g. CPU) of the machine in random access memory (RAM).
|
||||
|
||||
Device memory exists on the device (e.g. GPU) of the machine in video random access memory (VRAM).
|
||||
Recent architectures use graphics double data rate (GDDR) synchronous dynamic random-access memory (SDRAM)such as GDDR6, or high-bandwidth memory (HBM) such as HBM2e.
|
||||
|
||||
## Memory allocation
|
||||
|
||||
Memory can be allocated in two ways: pageable memory, and pinned memory.
|
||||
The following API calls with result in these allocations:
|
||||
|
||||
| API | Data location | Allocation |
|
||||
|--------------------|---------------|------------|
|
||||
| System allocated | Host | Pageable |
|
||||
| `hipMallocManaged` | Host | Managed |
|
||||
| `hipHostMalloc` | Host | Pinned |
|
||||
| `hipMalloc` | Device | Pinned |
|
||||
|
||||
:::{tip}
|
||||
`hipMalloc` and `hipFree` are blocking calls, however, HIP recently added non-blocking versions `hipMallocAsync` and `hipFreeAsync` which take in a stream as an additional argument.
|
||||
:::
|
||||
|
||||
### Pageable memory
|
||||
|
||||
Pageable memory is usually gotten when calling `malloc` or `new` in a C++ application.
|
||||
It is unique in that it exists on "pages" (blocks of memory), which can be migrated to other memory storage.
|
||||
For example, migrating memory between CPU sockets on a motherboard, or a system that runs out of space in RAM and starts dumping pages of RAM into the swap partition of your hard drive.
|
||||
|
||||
### Pinned memory
|
||||
|
||||
Pinned memory (or page-locked memory, or non-pageable memory) is host memory that is mapped into the address space of all GPUs, meaning that the pointer can be used on both host and device.
|
||||
Accessing host-resident pinned memory in device kernels is generally not recommended for performance, as it can force the data to traverse the host-device interconnect (e.g. PCIe), which is much slower than the on-device bandwidth (>40x on MI200).
|
||||
|
||||
Pinned host memory can be allocated with one of two types of coherence support:
|
||||
|
||||
:::{note}
|
||||
In HIP, pinned memory allocations are coherent by default (`hipHostMallocDefault`).
|
||||
There are additional pinned memory flags (e.g. `hipHostMallocMapped` and `hipHostMallocPortable`).
|
||||
On MI200 these options do not impact performance.
|
||||
<!-- TODO: link to programming_manual#memory-allocation-flags -->
|
||||
For more information, see the section *memory allocation flags* in the HIP Programming Guide: {doc}`hip:user_guide/programming_manual`.
|
||||
:::
|
||||
|
||||
Much like how a process can be locked to a CPU core by setting affinity, a pinned memory allocator does this with the memory storage system.
|
||||
On multi-socket systems it is important to ensure that pinned memory is located on the same socket as the owning process, or else each cache line will be moved through the CPU-CPU interconnect, thereby increasing latency and potentially decreasing bandwidth.
|
||||
|
||||
In practice, pinned memory is used to improve transfer times between host and device.
|
||||
For transfer operations, such as `hipMemcpy` or `hipMemcpyAsync`, using pinned memory instead of pageable memory on host can lead to a ~3x improvement in bandwidth.
|
||||
|
||||
:::{tip}
|
||||
If the application needs to move data back and forth between device and host (separate allocations), use pinned memory on the host side.
|
||||
:::
|
||||
|
||||
### Managed memory
|
||||
|
||||
Managed memory refers to universally addressable, or unified memory available on the MI200 series of GPUs.
|
||||
Much like pinned memory, managed memory shares a pointer between host and device and (by default) supports fine-grained coherence, however, managed memory can also automatically migrate pages between host and device.
|
||||
The allocation will be managed by AMD GPU driver using the Linux HMM (Heterogeneous Memory Management) mechanism.
|
||||
|
||||
If heterogenous memory management (HMM) is not available, then `hipMallocManaged` will default back to using system memory and will act like pinned host memory.
|
||||
Other managed memory API calls will have undefined behavior.
|
||||
It is therefore recommended to check for managed memory capability with: `hipDeviceGetAttribute` and `hipDeviceAttributeManagedMemory`.
|
||||
|
||||
HIP supports additional calls that work with page migration:
|
||||
|
||||
* `hipMemAdvise`
|
||||
* `hipMemPrefetchAsync`
|
||||
|
||||
:::{tip}
|
||||
If the application needs to use data on both host and device regularly, does not want to deal with separate allocations, and is not worried about maxing out the VRAM on MI200 GPUs (64 GB per GCD), use managed memory.
|
||||
:::
|
||||
|
||||
:::{tip}
|
||||
If managed memory performance is poor, check to see if managed memory is supported on your system and if page migration (XNACK) is enabled.
|
||||
:::
|
||||
|
||||
## Access behavior
|
||||
|
||||
Memory allocations for GPUs behave as follow:
|
||||
|
||||
| API | Data location | Host access | Device access |
|
||||
|--------------------|---------------|--------------|----------------------|
|
||||
| System allocated | Host | Local access | Unhandled page fault |
|
||||
| `hipMallocManaged` | Host | Local access | Zero-copy |
|
||||
| `hipHostMalloc` | Host | Local access | Zero-copy* |
|
||||
| `hipMalloc` | Device | Zero-copy | Local access |
|
||||
|
||||
Zero-copy accesses happen over the Infinity Fabric interconnect or PCI-E lanes on discrete GPUs.
|
||||
|
||||
:::{note}
|
||||
While `hipHostMalloc` allocated memory is accessible by a device, the host pointer must be converted to a device pointer with `hipHostGetDevicePointer`.
|
||||
|
||||
Memory allocated through standard system allocators such as `malloc`, can be accessed a device by registering the memory via `hipHostRegister`.
|
||||
The device pointer to be used in kernels can be retrieved with `hipHostGetDevicePointer`.
|
||||
Registered memory is treated like `hipHostMalloc` and will have similar performance.
|
||||
|
||||
On devices that support and have [](#xnack) enabled, such as the MI250X, `hipHostRegister` is not required as memory accesses are handled via automatic page migration.
|
||||
:::
|
||||
|
||||
### XNACK
|
||||
|
||||
Normally, host and device memory are separate and data has to be transferred manually via `hipMemcpy`.
|
||||
|
||||
On a subset of GPUs, such as the MI200, there is an option to automatically migrate pages of memory between host and device.
|
||||
This is important for managed memory, where the locality of the data is important for performance.
|
||||
Depending on the system, page migration may be disabled by default in which case managed memory will act like pinned host memory and suffer degraded performance.
|
||||
|
||||
*XNACK* describes the GPUs ability to retry memory accesses that failed due a page fault (which normally would lead to a memory access error), and instead retrieve the missing page.
|
||||
|
||||
This also affects memory allocated by the system as indicated by the following table:
|
||||
|
||||
| API | Data location | Host after device access | Device after host access |
|
||||
|--------------------|---------------|--------------------------|--------------------------|
|
||||
| System allocated | Host | Migrate page to host | Migrate page to device |
|
||||
| `hipMallocManaged` | Host | Migrate page to host | Migrate page to device |
|
||||
| `hipHostMalloc` | Host | Local access | Zero-copy |
|
||||
| `hipMalloc` | Device | Zero-copy | Local access |
|
||||
|
||||
To check if page migration is available on a platform, use `rocminfo`:
|
||||
|
||||
```sh
|
||||
$ rocminfo | grep xnack
|
||||
Name: amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-
|
||||
```
|
||||
|
||||
Here, `xnack-` means that XNACK is available but is disabled by default.
|
||||
Turning on XNACK by setting the environment variable `HSA_XNACK=1` and gives the expected result, `xnack+`:
|
||||
|
||||
```sh
|
||||
$ HSA_XNACK=1 rocminfo | grep xnack
|
||||
Name: amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack+
|
||||
```
|
||||
|
||||
`hipcc`by default will generate code that runs correctly with both XNACK enabled or disabled.
|
||||
Setting the `--offload-arch=`-option with `xnack+` or `xnack-` forces code to be only run with XNACK enabled or disabled respectively.
|
||||
|
||||
```sh
|
||||
# Compiled kernels will run regardless if XNACK is enabled or is disabled.
|
||||
hipcc --offload-arch=gfx90a
|
||||
|
||||
# Compiled kernels will only be run if XNACK is enabled with XNACK=1.
|
||||
hipcc --offload-arch=gfx90a:xnack+
|
||||
|
||||
# Compiled kernels will only be run if XNACK is disabled with XNACK=0.
|
||||
hipcc --offload-arch=gfx90a:xnack-
|
||||
```
|
||||
|
||||
:::{tip}
|
||||
If you want to make use of page migration, use managed memory. While pageable memory will migrate correctly, it is not a portable solution and can have performance issues if the accessed data isn't page aligned.
|
||||
:::
|
||||
|
||||
### Coherence
|
||||
|
||||
* *Coarse-grained coherence* means that memory is only considered up to date at kernel boundaries, which can be enforced through `hipDeviceSynchronize`, `hipStreamSynchronize`, or any blocking operation that acts on the null stream (e.g. `hipMemcpy`).
|
||||
For example, cacheable memory is a type of coarse-grained memory where an up-to-date copy of the data can be stored elsewhere (e.g. in an L2 cache).
|
||||
* *Fine-grained coherence* means the coherence is supported while a CPU/GPU kernel is running.
|
||||
This can be useful if both host and device are operating on the same dataspace using system-scope atomic operations (e.g. updating an error code or flag to a buffer).
|
||||
Fine-grained memory implies that up-to-date data may be made visible to others regardless of kernel boundaries as discussed above.
|
||||
|
||||
| API | Flag | Coherence |
|
||||
|-------------------------|------------------------------|----------------|
|
||||
| `hipHostMalloc` | `hipHostMallocDefault` | Fine-grained |
|
||||
| `hipHostMalloc` | `hipHostMallocNonCoherent` | Coarse-grained |
|
||||
|
||||
| API | Flag | Coherence |
|
||||
|-------------------------|------------------------------|----------------|
|
||||
| `hipExtMallocWithFlags` | `hipHostMallocDefault` | Fine-grained |
|
||||
| `hipExtMallocWithFlags` | `hipDeviceMallocFinegrained` | Coarse-grained |
|
||||
|
||||
| API | `hipMemAdvise` argument | Coherence |
|
||||
|-------------------------|------------------------------|----------------|
|
||||
| `hipMallocManaged` | | Fine-grained |
|
||||
| `hipMallocManaged` | `hipMemAdviseSetCoarseGrain` | Coarse-grained |
|
||||
| `malloc` | | Fine-grained |
|
||||
| `malloc` | `hipMemAdviseSetCoarseGrain` | Coarse-grained |
|
||||
|
||||
:::{tip}
|
||||
Try to design your algorithms to avoid host-device memory coherence (e.g. system scope atomics). While it can be a useful feature in very specific cases, it is not supported on all systems, and can negatively impact performance by introducing the host-device interconnect bottleneck.
|
||||
:::
|
||||
|
||||
The availability of fine- and coarse-grained memory pools can be checked with `rocminfo`:
|
||||
|
||||
```sh
|
||||
$ rocminfo
|
||||
...
|
||||
*******
|
||||
Agent 1
|
||||
*******
|
||||
Name: AMD EPYC 7742 64-Core Processor
|
||||
...
|
||||
Pool Info:
|
||||
Pool 1
|
||||
Segment: GLOBAL; FLAGS: FINE GRAINED
|
||||
...
|
||||
Pool 3
|
||||
Segment: GLOBAL; FLAGS: COARSE GRAINED
|
||||
...
|
||||
*******
|
||||
Agent 9
|
||||
*******
|
||||
Name: gfx90a
|
||||
...
|
||||
Pool Info:
|
||||
Pool 1
|
||||
Segment: GLOBAL; FLAGS: COARSE GRAINED
|
||||
...
|
||||
```
|
||||
|
||||
## System direct memory access
|
||||
|
||||
In most cases, the default behavior for HIP in transferring data from a pinned host allocation to device will run at the limit of the interconnect.
|
||||
However, there are certain cases where the interconnect is not the bottleneck.
|
||||
|
||||
The primary way to transfer data onto and off of a GPU, such as the MI200, is to use the onboard System Direct Memory Access engine, which is used to feed blocks of memory to the off-device interconnect (either GPU-CPU or GPU-GPU).
|
||||
Each GCD has a separate SDMA engine for host-to-device and device-to-host memory transfers.
|
||||
Importantly, SDMA engines are separate from the computing infrastructure, meaning that memory transfers to and from a device will not impact kernel compute performance, though they do impact memory bandwidth to a limited extent.
|
||||
The SDMA engines are mainly tuned for PCIe-4.0 x16, which means they are designed to operate at bandwidths up to 32 GB/s.
|
||||
|
||||
:::{note}
|
||||
An important feature of the MI250X platform is the Infinity Fabric™ interconnect between host and device.
|
||||
The Infinity Fabric interconnect supports improved performance over standard PCIe-4.0 (usually ~50% more bandwidth); however, since the SDMA engine does not run at this speed, it will not max out the bandwidth of the faster interconnect.
|
||||
:::
|
||||
|
||||
The bandwidth limitation can be countered by bypassing the SDMA engine and replacing it with a type of copy kernel known as a "blit" kernel.
|
||||
Blit kernels will use the compute units on the GPU, thereby consuming compute resources, which may not always be beneficial.
|
||||
The easiest way to enable blit kernels is to set an environment variable `HSA_ENABLE_SDMA=0`, which will disable the SDMA engine.
|
||||
On systems where the GPU uses a PCIe interconnect instead of an Infinity Fabric interconnect, blit kernels will not impact bandwidth, but will still consume compute resources.
|
||||
The use of SDMA vs blit kernels also applies to MPI data transfers and GPU-GPU transfers.
|
||||
@@ -1,90 +1,102 @@
|
||||
### Using the LLVM Address Sanitizer (ASAN) on the GPU
|
||||
# Using the LLVM ASan on a GPU (beta release)
|
||||
|
||||
The LLVM Address Sanitizer provides a process that allows developers to detect runtime addressing errors in applications and libraries. The detection is achieved using a combination of compiler-added instrumentation and runtime techniques, including function interception and replacement.
|
||||
The LLVM AddressSanitizer (ASan) provides a process that allows developers to detect runtime addressing errors in applications and libraries. The detection is achieved using a combination of compiler-added instrumentation and runtime techniques, including function interception and replacement.
|
||||
|
||||
Until now, the LLVM Address Sanitizer process was only available for traditional purely CPU applications. However, ROCm has extended this mechanism to additionally allow the detection of some addressing errors on the GPU in heterogeneous applications. Ideally, developers should treat heterogeneous HIP and OpenMP applications exactly like pure CPU applications. However, this simplicity has not been achieved yet.
|
||||
Until now, the LLVM ASan process was only available for traditional purely CPU applications. However, ROCm has extended this mechanism to additionally allow the detection of some addressing errors on the GPU in heterogeneous applications. Ideally, developers should treat heterogeneous HIP and OpenMP applications exactly like pure CPU applications. However, this simplicity has not been achieved yet.
|
||||
|
||||
This document provides documentation on using ROCm Address Sanitizer.
|
||||
For information about LLVM Address Sanitizer, see [the LLVM documentation](https://clang.llvm.org/docs/AddressSanitizer.html).
|
||||
This document provides documentation on using ROCm ASan.
|
||||
For information about LLVM ASan, see the [LLVM documentation](https://clang.llvm.org/docs/AddressSanitizer.html).
|
||||
|
||||
### Compile for Address Sanitizer
|
||||
**Note**: The beta release of LLVM ASan for ROCm is currently tested and validated on Ubuntu 20.04.
|
||||
|
||||
The address sanitizer process begins by compiling the application of interest with the address sanitizer instrumentation.
|
||||
## Compiling for ASan
|
||||
|
||||
The ASan process begins by compiling the application of interest with the ASan instrumentation.
|
||||
|
||||
Recommendations for doing this are:
|
||||
|
||||
+ Compile as many application and dependent library sources as possible using an AMD-built clang-based compiler such as `amdclang++`.
|
||||
+ Add the following options to the existing compiler and linker options:
|
||||
+ `-fsanitize=address` - enables instrumentation
|
||||
+ `-shared-libsan` - use shared version of runtime
|
||||
+ `-g` - add debug info for improved reporting
|
||||
+ Explicitly use `xnack+` in the offload architecture option. For example, `--offload-arch=gfx90a:xnack+`
|
||||
* Compile as many application and dependent library sources as possible using an AMD-built clang-based compiler such as `amdclang++`.
|
||||
* Add the following options to the existing compiler and linker options:
|
||||
* `-fsanitize=address` - enables instrumentation
|
||||
* `-shared-libsan` - use shared version of runtime
|
||||
* `-g` - add debug info for improved reporting
|
||||
* Explicitly use `xnack+` in the offload architecture option. For example, `--offload-arch=gfx90a:xnack+`
|
||||
Other architectures are allowed, but their device code will not be instrumented and a warning will be emitted.
|
||||
|
||||
It is not an error to compile some files without address sanitizer instrumentation, but doing so reduces the ability of the process to detect addressing errors. However, if the main program "`a.out`" does not directly depend on the Address Sanitizer runtime (`libclang_rt.asan-x86_64.so`) after the build completes (check by running `ldd` (List Dynamic Dependencies) or `readelf`), the application will immediately report an error at runtime as described in the next section.
|
||||
It is not an error to compile some files without ASan instrumentation, but doing so reduces the ability of the process to detect addressing errors. However, if the main program "`a.out`" does not directly depend on the ASan runtime (`libclang_rt.asan-x86_64.so`) after the build completes (check by running `ldd` (List Dynamic Dependencies) or `readelf`), the application will immediately report an error at runtime as described in the next section.
|
||||
|
||||
#### About Compilation Time
|
||||
### About compilation time
|
||||
|
||||
When `-fsanitize=address` is used, the LLVM compiler adds instrumentation code around every memory operation. This added code must be handled by all of the downstream components of the compiler toolchain and results in increased overall compilation time. This increase is especially evident in the AMDGPU device compiler and has in a few instances raised the compile time to an unacceptable level.
|
||||
|
||||
There are a few options if the compile time becomes unacceptable:
|
||||
|
||||
+ Avoid instrumentation of the files which have the worst compile times. This will reduce the effectiveness of the address sanitizer process.
|
||||
+ Add the option `-fsanitize-recover=address` to the compiles with the worst compile times. This option simplifies the added instrumentation resulting in faster compilation. See below for more information.
|
||||
+ Disable instrumentation on a per-function basis by adding `__attribute__`((no_sanitize("address"))) to functions found to be responsible for the large compile time. Again, this will reduce the effectiveness of the process.
|
||||
* Avoid instrumentation of the files which have the worst compile times. This will reduce the effectiveness of the ASan process.
|
||||
* Add the option `-fsanitize-recover=address` to the compiles with the worst compile times. This option simplifies the added instrumentation resulting in faster compilation. See below for more information.
|
||||
* Disable instrumentation on a per-function basis by adding `__attribute__`((no_sanitize("address"))) to functions found to be responsible for the large compile time. Again, this will reduce the effectiveness of the process.
|
||||
|
||||
### Use AMD Supplied Address Sanitizer Instrumented Libraries
|
||||
## Installing ROCm GPU ASan packages
|
||||
|
||||
ROCm releases provide optional packages containing address sanitizer instrumented builds of a subset of those ROCm libraries usually found in `/opt/rocm-<version>/lib`. These optional packages are typically named <library>-asan. However, the instrumented libraries themselves have identical names as the regular uninstrumented libraries and are located in `/opt/rocm-<version>/lib/asan`. It is expected that the subset of address sanitizer instrumented ROCm libraries will be expanded in future releases. They are built using the `amdclang++` and `hipcc` compilers, while some uninstrumented libraries are built with g++. The preexisting build options are used, but, as described above, additional options are used: `-fsanitize=address`, `-shared-libsan` and `-g`.
|
||||
For a complete ROCm GPU Sanitizer installation, including packages, instrumented HSA and HIP runtimes, tools, and math libraries, use the following instruction,
|
||||
|
||||
```bash
|
||||
sudo apt-get install rocm-ml-sdk-asan
|
||||
|
||||
```
|
||||
|
||||
## Using AMD-supplied ASan instrumented libraries
|
||||
|
||||
ROCm releases have optional packages that contain additional ASan instrumented builds of the ROCm libraries (usually found in `/opt/rocm-<version>/lib`). The instrumented libraries have identical names to the regular uninstrumented libraries, and are located in `/opt/rocm-<version>/lib/asan`.
|
||||
These additional libraries are built using the `amdclang++` and `hipcc` compilers, while some uninstrumented libraries are built with g++. The preexisting build options are used but, as described above, additional options are used: `-fsanitize=address`, `-shared-libsan` and `-g`.
|
||||
|
||||
These additional libraries avoid additional developer effort to locate repositories, identify the correct branch, check out the correct tags, and other efforts needed to build the libraries from the source. And they extend the ability of the process to detect addressing errors into the ROCm libraries themselves.
|
||||
|
||||
When adjusting an application build to add instrumentation, linking against these instrumented libraries is unnecessary. For example, any `-L` `/opt/rocm-<version>/lib` compiler options need not be changed. However, the instrumented libraries should be used when the application is run. It is particularly important that the instrumented language runtimes, like `libamdhip64.so` and `librocm-core.so`, are used; otherwise, device invalid access detections may not be reported.
|
||||
|
||||
### Running Address Sanitizer Instrumented Applications
|
||||
## Running ASan instrumented applications
|
||||
|
||||
#### Preparing to Run an Instrumented Application
|
||||
### Preparing to run an instrumented application
|
||||
|
||||
Here are a few recommendations to consider before running an address sanitizer instrumented heterogeneous application.
|
||||
Here are a few recommendations to consider before running an ASan instrumented heterogeneous application.
|
||||
|
||||
+ Ensure the Linux kernel running on the system has Heterogeneous Memory Management (HMM) support. A kernel version of 5.6 or higher should be sufficient.
|
||||
+ Ensure XNACK is enabled
|
||||
+ For `gfx90a` (MI-2X0) or `gfx940` (MI-3X0) use environment `HSA_XNACK = 1`.
|
||||
+ For `gfx906` (MI-50) or `gfx908` (MI-100) use environment `HSA_XNACK = 1` but also ensure the amdgpu kernel module is loaded with module argument `noretry=0`.
|
||||
* Ensure the Linux kernel running on the system has Heterogeneous Memory Management (HMM) support. A kernel version of 5.6 or higher should be sufficient.
|
||||
* Ensure XNACK is enabled
|
||||
* For `gfx90a` (MI-2X0) or `gfx940` (MI-3X0) use environment `HSA_XNACK = 1`.
|
||||
* For `gfx906` (MI-50) or `gfx908` (MI-100) use environment `HSA_XNACK = 1` but also ensure the amdgpu kernel module is loaded with module argument `noretry=0`.
|
||||
This requirement is due to the fact that the XNACK setting for these GPUs is system-wide.
|
||||
|
||||
+ Ensure that the application will use the instrumented libraries when it runs. The output from the shell command `ldd <application name>` can be used to see which libraries will be used.
|
||||
* Ensure that the application will use the instrumented libraries when it runs. The output from the shell command `ldd <application name>` can be used to see which libraries will be used.
|
||||
If the instrumented libraries are not listed by `ldd`, the environment variable `LD_LIBRARY_PATH` may need to be adjusted, or in some cases an `RPATH` compiled into the application may need to be changed and the application recompiled.
|
||||
|
||||
+ Ensure that the application depends on the address sanitizer runtime. This can be checked by running the command `readelf -d <application name> | grep NEEDED` and verifying that shared library: `libclang_rt.asan-x86_64.so` appears in the output.
|
||||
If it does not appear, when executed the application will quickly output an address sanitizer error that looks like:
|
||||
* Ensure that the application depends on the ASan runtime. This can be checked by running the command `readelf -d <application name> | grep NEEDED` and verifying that shared library: `libclang_rt.asan-x86_64.so` appears in the output.
|
||||
If it does not appear, when executed the application will quickly output an ASan error that looks like:
|
||||
|
||||
```bash
|
||||
==3210==ASan runtime does not come first in initial library list; you should either link runtime to your application or manually preload it with LD_PRELOAD.
|
||||
```
|
||||
|
||||
+ Ensure that the application `llvm-symbolizer` can be executed, and that it is located in `/opt/rocm-<version>/llvm/bin`. This executable is not strictly required, but if found is used to translate ("symbolize") a host-side instruction address into a more useful function name, file name, and line number (assuming the application has been built to include debug information).
|
||||
* Ensure that the application `llvm-symbolizer` can be executed, and that it is located in `/opt/rocm-<version>/llvm/bin`. This executable is not strictly required, but if found is used to translate ("symbolize") a host-side instruction address into a more useful function name, file name, and line number (assuming the application has been built to include debug information).
|
||||
|
||||
There is an environment variable, `ASAN_OPTIONS` which can be used to adjust the runtime behavior of the ASAN runtime itself. There are more than a hundred "flags" that can be adjusted (see an old list at [flags](https://github.com/google/sanitizers/wiki/AddressSanitizerFlags)) but the default settings are correct and should be used in most cases. It must be noted that these options only affect the host ASAN runtime. The device runtime only currently supports the default settings for the few relevant options.
|
||||
There is an environment variable, `ASAN_OPTIONS`, that can be used to adjust the runtime behavior of the ASAN runtime itself. There are more than a hundred "flags" that can be adjusted (see an old list at [flags](https://github.com/google/sanitizers/wiki/AddressSanitizerFlags)) but the default settings are correct and should be used in most cases. It must be noted that these options only affect the host ASAN runtime. The device runtime only currently supports the default settings for the few relevant options.
|
||||
|
||||
There are two `ASAN_OPTION` flags of particular note.
|
||||
|
||||
+ `halt_on_error=0/1 default 1`.
|
||||
* `halt_on_error=0/1 default 1`.
|
||||
|
||||
This tells the ASAN runtime to halt the application immediately after detecting and reporting an addressing error. The default makes sense because the application has entered the realm of undefined behavior. If the developer wishes to have the application continue anyway, this option can be set to zero. However, the application and libraries should then be compiled with the additional option `-fsanitize-recover=address`. Note that the ROCm optional address sanitizer instrumented libraries are not compiled with this option and if an error is detected within one of them, but halt_on_error is set to 0, more undefined behavior will occur.
|
||||
This tells the ASAN runtime to halt the application immediately after detecting and reporting an addressing error. The default makes sense because the application has entered the realm of undefined behavior. If the developer wishes to have the application continue anyway, this option can be set to zero. However, the application and libraries should then be compiled with the additional option `-fsanitize-recover=address`. Note that the ROCm optional ASan instrumented libraries are not compiled with this option and if an error is detected within one of them, but halt_on_error is set to 0, more undefined behavior will occur.
|
||||
|
||||
+ `detect_leaks=0/1 default 1`.
|
||||
This option directs the address sanitizer runtime to enable the [Leak Sanitizer](https://clang.llvm.org/docs/LeakSanitizer.html) (LSAN). Unfortunately, for heterogeneous applications, this default will result in significant output from the leak sanitizer when the application exits due to allocations made by the language runtime which are not considered to be to be leaks. This output can be avoided by adding `detect_leaks=0` to the `ASAN_OPTIONS`, or alternatively by producing an LSAN suppression file (syntax described [here](https://github.com/google/sanitizers/wiki/AddressSanitizerLeakSanitizer)) and activating it with environment variable `LSAN_OPTIONS=suppressions=/path/to/suppression/file`. When using a suppression file, a suppression report is printed by default. The suppression report can be disabled by using the `LSAN_OPTIONS` flag `print_suppressions=0`.
|
||||
* `detect_leaks=0/1 default 1`.
|
||||
This option directs the ASan runtime to enable the [Leak Sanitizer](https://clang.llvm.org/docs/LeakSanitizer.html) (LSAN). Unfortunately, for heterogeneous applications, this default will result in significant output from the leak sanitizer when the application exits due to allocations made by the language runtime which are not considered to be to be leaks. This output can be avoided by adding `detect_leaks=0` to the `ASAN_OPTIONS`, or alternatively by producing an LSAN suppression file (syntax described [here](https://github.com/google/sanitizers/wiki/AddressSanitizerLeakSanitizer)) and activating it with environment variable `LSAN_OPTIONS=suppressions=/path/to/suppression/file`. When using a suppression file, a suppression report is printed by default. The suppression report can be disabled by using the `LSAN_OPTIONS` flag `print_suppressions=0`.
|
||||
|
||||
### Runtime Overhead
|
||||
## Runtime overhead
|
||||
|
||||
Running an address sanitizer instrumented application incurs
|
||||
Running an ASan instrumented application incurs
|
||||
overheads which may result in unacceptably long runtimes
|
||||
or failure to run at all.
|
||||
|
||||
#### Higher Execution Time
|
||||
### Higher execution time
|
||||
|
||||
Address sanitizer detection works by checking each address at runtime
|
||||
ASan detection works by checking each address at runtime
|
||||
before the address is actually accessed by a load, store, or atomic
|
||||
instruction.
|
||||
This checking involves an additional load to "shadow" memory which
|
||||
@@ -98,10 +110,10 @@ For heterogeneous applications, the shadow memory must be accessible by all devi
|
||||
and this can mean that shadow accesses from some devices may be more costly
|
||||
than non-shadow accesses.
|
||||
|
||||
#### Higher Memory Use
|
||||
### Higher memory use
|
||||
|
||||
The address checking described above relies on the compiler to surround
|
||||
each program variable with a red zone and on address sanitizer
|
||||
each program variable with a red zone and on ASan
|
||||
runtime to surround each runtime memory allocation with a red zone and
|
||||
fill the shadow corresponding to each red zone with poison.
|
||||
The added memory for the red zones is additional overhead on top
|
||||
@@ -111,9 +123,9 @@ Applications which consume most one or more available memory pools when
|
||||
run normally are likely to encounter allocation failures when run with
|
||||
instrumentation.
|
||||
|
||||
### Runtime Reporting
|
||||
## Runtime reporting
|
||||
|
||||
It is not the intention of this document to provide a detailed explanation of all of the types of reports that can be output by the address sanitizer runtime. Instead, the focus is on the differences between the standard reports for CPU issues, and reports for GPU issues.
|
||||
It is not the intention of this document to provide a detailed explanation of all of the types of reports that can be output by the ASan runtime. Instead, the focus is on the differences between the standard reports for CPU issues, and reports for GPU issues.
|
||||
|
||||
An invalid address detection report for the CPU always starts with
|
||||
|
||||
@@ -162,16 +174,16 @@ currently may include one or two surprising CPU side tracebacks mentioning :`hos
|
||||
|
||||
### Running with `rocgdb`
|
||||
|
||||
`rocgdb` can be used to further investigate address sanitizer detected errors, with some preparation.
|
||||
`rocgdb` can be used to further investigate ASan detected errors, with some preparation.
|
||||
|
||||
Currently, the address sanitizer runtime complains when starting `rocgdb` without preparation.
|
||||
Currently, the ASan runtime complains when starting `rocgdb` without preparation.
|
||||
|
||||
```bash
|
||||
$ rocgdb my_app
|
||||
==1122==ASan` runtime does not come first in initial library list; you should either link runtime to your application or manually preload it with LD_PRELOAD.
|
||||
```
|
||||
|
||||
This is solved by setting environment variable `LD_PRELOAD` to the path to the address sanitizer runtime, whose path can be obtained using the command
|
||||
This is solved by setting environment variable `LD_PRELOAD` to the path to the ASan runtime, whose path can be obtained using the command
|
||||
|
||||
```bash
|
||||
amdclang++ -print-file-name=libclang_rt.asan-x86_64.so
|
||||
@@ -179,7 +191,7 @@ amdclang++ -print-file-name=libclang_rt.asan-x86_64.so
|
||||
|
||||
It is also recommended to set the environment variable `HIP_ENABLE_DEFERRED_LOADING=0` before debugging HIP applications.
|
||||
|
||||
After starting `rocgdb` breakpoints can be set on the address sanitizer runtime error reporting entry points of interest. For example, if an address sanitizer error report includes
|
||||
After starting `rocgdb` breakpoints can be set on the ASan runtime error reporting entry points of interest. For example, if an ASan error report includes
|
||||
|
||||
```bash
|
||||
WRITE of size 4 in workgroup id (10,0,0)
|
||||
@@ -203,7 +215,7 @@ is
|
||||
(gdb) break __asan_report_load<N>
|
||||
```
|
||||
|
||||
It is possible to set breakpoints on all address sanitizer report functions using these commands:
|
||||
It is possible to set breakpoints on all ASan report functions using these commands:
|
||||
|
||||
```bash
|
||||
$ rocgdb <path to application>
|
||||
@@ -212,14 +224,18 @@ $ rocgdb <path to application>
|
||||
(gdb) c
|
||||
```
|
||||
|
||||
### Using Address Sanitizer with a Short HIP Application (LINK NEEDED HERE)
|
||||
### Using ASan with a short HIP application
|
||||
|
||||
### Known Issues with Using GPU Sanitizer
|
||||
Refer to the following example to use ASan with a short HIP application,
|
||||
|
||||
+ Red zones must have limited size and it is possible for an invalid access to completely miss a red zone and not be detected.
|
||||
https://github.com/Rmalavally/rocm-examples/blob/Rmalavally-patch-1/LLVM_ASAN/Using-Address-Sanitizer-with-a-Short-HIP-Application.md
|
||||
|
||||
+ Lack of detection or false reports can be caused by the runtime not properly maintaining red zone shadows.
|
||||
### Known issues with using GPU sanitizer
|
||||
|
||||
+ Lack of detection on the GPU might also be due to the implementation not instrumenting accesses to all GPU specific address spaces. For example, in the current implementation accesses to "private" or "stack" variables on the GPU are not instrumented, and accesses to HIP shared variables (also known as "local data store" or "LDS") are also not instrumented.
|
||||
* Red zones must have limited size and it is possible for an invalid access to completely miss a red zone and not be detected.
|
||||
|
||||
+ It can also be the case that a memory fault is hit for an invalid address even with the instrumentation. This is usually caused by the invalid address being so wild that its shadow address is outside of any memory region, and the fault actually occurs on the access to the shadow address. It is also possible to hit a memory fault for the `NULL` pointer. While address 0 does have a shadow location, it is not poisoned by the runtime.
|
||||
* Lack of detection or false reports can be caused by the runtime not properly maintaining red zone shadows.
|
||||
|
||||
* Lack of detection on the GPU might also be due to the implementation not instrumenting accesses to all GPU specific address spaces. For example, in the current implementation accesses to "private" or "stack" variables on the GPU are not instrumented, and accesses to HIP shared variables (also known as "local data store" or "LDS") are also not instrumented.
|
||||
|
||||
* It can also be the case that a memory fault is hit for an invalid address even with the instrumentation. This is usually caused by the invalid address being so wild that its shadow address is outside of any memory region, and the fault actually occurs on the access to the shadow address. It is also possible to hit a memory fault for the `NULL` pointer. While address 0 does have a shadow location, it is not poisoned by the runtime.
|
||||
110
docs/conf.py
@@ -5,24 +5,46 @@
|
||||
# https://www.sphinx-doc.org/en/master/usage/configuration.html
|
||||
|
||||
import shutil
|
||||
import jinja2
|
||||
import os
|
||||
|
||||
from rocm_docs import ROCmDocs
|
||||
|
||||
# Environement to process Jinja templates.
|
||||
jinja_env = jinja2.Environment(loader=jinja2.FileSystemLoader("."))
|
||||
|
||||
shutil.copy2('../CONTRIBUTING.md','./contributing.md')
|
||||
shutil.copy2('../RELEASE.md','./release.md')
|
||||
# Jinja templates to render out.
|
||||
templates = [
|
||||
|
||||
]
|
||||
|
||||
# Render templates and output files without the last extension.
|
||||
# For example: 'install.md.jinja' becomes 'install.md'.
|
||||
for template in templates:
|
||||
rendered = jinja_env.get_template(template).render()
|
||||
with open(os.path.splitext(template)[0], 'w') as file:
|
||||
file.write(rendered)
|
||||
|
||||
shutil.copy2('../CONTRIBUTING.md','./contribute/index.md')
|
||||
shutil.copy2('../RELEASE.md','./about/release-notes.md')
|
||||
# Keep capitalization due to similar linking on GitHub's markdown preview.
|
||||
shutil.copy2('../CHANGELOG.md','./CHANGELOG.md')
|
||||
shutil.copy2('../CHANGELOG.md','./about/CHANGELOG.md')
|
||||
|
||||
latex_engine = "xelatex"
|
||||
latex_elements = {
|
||||
"fontpkg": r"""
|
||||
\usepackage{tgtermes}
|
||||
\usepackage{tgheros}
|
||||
\renewcommand\ttdefault{txtt}
|
||||
"""
|
||||
}
|
||||
|
||||
# configurations for PDF output by Read the Docs
|
||||
project = "ROCm Documentation"
|
||||
author = "Advanced Micro Devices, Inc."
|
||||
copyright = "Copyright (c) 2023 Advanced Micro Devices, Inc. All rights reserved."
|
||||
version = "5.7.0"
|
||||
release = "5.7.0"
|
||||
|
||||
version = "5.7.1"
|
||||
release = "5.7.1"
|
||||
setting_all_article_info = True
|
||||
all_article_info_os = ["linux", "windows"]
|
||||
all_article_info_author = ""
|
||||
@@ -35,57 +57,45 @@ article_pages = [
|
||||
"date":"2023-07-27"
|
||||
},
|
||||
|
||||
{"file":"deploy/linux/index", "os":["linux"]},
|
||||
{"file":"deploy/linux/install_overview", "os":["linux"]},
|
||||
{"file":"deploy/linux/prerequisites", "os":["linux"]},
|
||||
{"file":"deploy/linux/quick_start", "os":["linux"]},
|
||||
{"file":"deploy/linux/install", "os":["linux"]},
|
||||
{"file":"deploy/linux/upgrade", "os":["linux"]},
|
||||
{"file":"deploy/linux/uninstall", "os":["linux"]},
|
||||
{"file":"deploy/linux/package_manager_integration", "os":["linux"]},
|
||||
{"file":"deploy/docker", "os":["linux"]},
|
||||
|
||||
{"file":"deploy/windows/cli/index", "os":["windows"]},
|
||||
{"file":"deploy/windows/cli/install", "os":["windows"]},
|
||||
{"file":"deploy/windows/cli/uninstall", "os":["windows"]},
|
||||
{"file":"deploy/windows/cli/upgrade", "os":["windows"]},
|
||||
{"file":"deploy/windows/gui/index", "os":["windows"]},
|
||||
{"file":"deploy/windows/gui/install", "os":["windows"]},
|
||||
{"file":"deploy/windows/gui/uninstall", "os":["windows"]},
|
||||
{"file":"deploy/windows/gui/upgrade", "os":["windows"]},
|
||||
{"file":"deploy/windows/index", "os":["windows"]},
|
||||
{"file":"deploy/windows/prerequisites", "os":["windows"]},
|
||||
{"file":"deploy/windows/quick_start", "os":["windows"]},
|
||||
{"file":"install/windows/install-quick", "os":["windows"]},
|
||||
{"file":"install/linux/install-quick", "os":["linux"]},
|
||||
|
||||
{"file":"release/gpu_os_support", "os":["linux"]},
|
||||
{"file":"release/windows_support", "os":["windows"]},
|
||||
{"file":"release/docker_support_matrix", "os":["linux"]},
|
||||
|
||||
{"file":"reference/gpu_libraries/communication", "os":["linux"]},
|
||||
{"file":"reference/ai_tools", "os":["linux"]},
|
||||
{"file":"reference/management_tools", "os":["linux"]},
|
||||
{"file":"reference/validation_tools", "os":["linux"]},
|
||||
{"file":"reference/framework_compatibility/framework_compatibility", "os":["linux"]},
|
||||
{"file":"reference/computer_vision", "os":["linux"]},
|
||||
|
||||
{"file":"how_to/deep_learning_rocm", "os":["linux"]},
|
||||
{"file":"how_to/gpu_aware_mpi", "os":["linux"]},
|
||||
{"file":"how_to/magma_install/magma_install", "os":["linux"]},
|
||||
{"file":"how_to/pytorch_install/pytorch_install", "os":["linux"]},
|
||||
{"file":"how_to/system_debugging", "os":["linux"]},
|
||||
{"file":"how_to/tensorflow_install/tensorflow_install", "os":["linux"]},
|
||||
{"file":"install/linux/install", "os":["linux"]},
|
||||
{"file":"install/linux/install-options", "os":["linux"]},
|
||||
{"file":"install/linux/prerequisites", "os":["linux"]},
|
||||
|
||||
{"file":"examples/machine_learning", "os":["linux"]},
|
||||
{"file":"examples/inception_casestudy/inception_casestudy", "os":["linux"]},
|
||||
|
||||
{"file":"understand/file_reorg", "os":["linux"]},
|
||||
{"file":"install/docker", "os":["linux"]},
|
||||
{"file":"install/magma-install", "os":["linux"]},
|
||||
{"file":"install/pytorch-install", "os":["linux"]},
|
||||
{"file":"install/tensorflow-install", "os":["linux"]},
|
||||
|
||||
{"file":"install/windows/install", "os":["windows"]},
|
||||
{"file":"install/windows/prerequisites", "os":["windows"]},
|
||||
{"file":"install/windows/cli/index", "os":["windows"]},
|
||||
{"file":"install/windows/gui/index", "os":["windows"]},
|
||||
|
||||
{"file":"about/compatibility/linux-support", "os":["linux"]},
|
||||
{"file":"about/compatibility/windows-support", "os":["windows"]},
|
||||
|
||||
{"file":"about/compatibility/docker-image-support-matrix", "os":["linux"]},
|
||||
{"file":"about/compatibility/user-kernel-space-compat-matrix", "os":["linux"]},
|
||||
|
||||
{"file":"reference/library-index", "os":["linux"]},
|
||||
|
||||
{"file":"how-to/deep-learning-rocm", "os":["linux"]},
|
||||
{"file":"how-to/gpu-enabled-mpi", "os":["linux"]},
|
||||
{"file":"how-to/system-debugging", "os":["linux"]},
|
||||
{"file":"how-to/tuning-guides", "os":["linux", "windows"]},
|
||||
|
||||
{"file":"rocm-a-z", "os":["linux", "windows"]},
|
||||
|
||||
{"file":"understand/isv_deployment_win", "os":["windows"]},
|
||||
]
|
||||
|
||||
exclude_patterns = ['temp']
|
||||
|
||||
external_toc_path = "./sphinx/_toc.yml"
|
||||
|
||||
docs_core = ROCmDocs("ROCm 5.7.0 Documentation Home")
|
||||
docs_core = ROCmDocs("ROCm Documentation")
|
||||
docs_core.setup()
|
||||
|
||||
external_projects_current_project = "rocm"
|
||||
|
||||
@@ -1,139 +1,125 @@
|
||||
# Building Documentation
|
||||
# Building documentation
|
||||
|
||||
While contributing, one may build the documentation locally on the command-line
|
||||
or rely on Continuous Integration for previewing the resulting HTML pages in a
|
||||
browser.
|
||||
You can build our documentation via GitHub (in a pull request) or locally (using the command line or
|
||||
Visual Studio (VS) Code.
|
||||
|
||||
## Pull Request documentation builds
|
||||
## GitHub
|
||||
|
||||
When opening a PR to the `develop` branch on GitHub, the page corresponding to
|
||||
the PR (`https://github.com/RadeonOpenCompute/ROCm/pull/<pr_number>`) will have
|
||||
a summary at the bottom. This requires the user be logged in to GitHub.
|
||||
If you open a pull request on the `develop` branch of a ROCm repository and scroll to the bottom of
|
||||
the page, there is a summary panel. Next to the line
|
||||
`docs/readthedocs.com:advanced-micro-devices-demo`, there is a `Details` link. If you click this, it takes
|
||||
you to the Read the Docs build for your pull request.
|
||||
|
||||
- There, click `Show all checks` and `Details` of the Read the Docs pipeline. It
|
||||
will take you to a URL of the form
|
||||
`https://readthedocs.com/projects/advanced-micro-devices-rocm/builds/<some_build_num>/`
|
||||
- The list of commands shown are the exact ones used by CI to produce a render
|
||||
of the documentation.
|
||||
- There, click on the small blue link `View docs` (which is not the same as the
|
||||
bigger button with the same text). It will take you to the built HTML site with
|
||||
a URL of the form
|
||||
`https://advanced-micro-devices-demo--<pr_number>.com.readthedocs.build/projects/alpha/en/<pr_number>/`.
|
||||

|
||||
|
||||
## Build documentation from the Command Line
|
||||
If you don't see this line, click `Show all checks` to get an itemized view.
|
||||
|
||||
Python versions known to build documentation:
|
||||
## Command line
|
||||
|
||||
- 3.8
|
||||
You can build our documentation via the command line using Python. We use Python 3.8; other
|
||||
versions may not support the build.
|
||||
|
||||
To build the docs locally using Python Virtual Environment (`venv`), execute the
|
||||
following commands from the project root:
|
||||
Use the Python Virtual Environment (`venv`) and run the following commands from the project root:
|
||||
|
||||
```sh
|
||||
python3 -mvenv .venv
|
||||
|
||||
# Windows
|
||||
.venv/Scripts/python -m pip install -r docs/sphinx/requirements.txt
|
||||
.venv/Scripts/python -m sphinx -T -E -b html -d _build/doctrees -D language=en docs _build/html
|
||||
|
||||
# Linux
|
||||
.venv/bin/python -m pip install -r docs/sphinx/requirements.txt
|
||||
.venv/bin/python -m sphinx -T -E -b html -d _build/doctrees -D language=en docs _build/html
|
||||
```
|
||||
|
||||
Then open up `_build/html/index.html` in your favorite browser.
|
||||
Navigate to `_build/html/index.html` and open this file in a web browser.
|
||||
|
||||
## Build documentation using Visual Studio (VS) Code
|
||||
## Visual Studio Code
|
||||
|
||||
One can put together a productive environment to author documentation and also
|
||||
test it locally using VS Code with only a handful of extensions. Even though the
|
||||
extension landscape of VS Code is ever changing, here is one example setup that
|
||||
proved useful at the time of writing. In it, one can change/add content, build a
|
||||
new version of the docs using a single VS Code Task (or hotkey), see all errors/
|
||||
warnings emitted by Sphinx in the Problems pane and immediately see the
|
||||
resulting website show up on a locally-served web server.
|
||||
With the help of a few extensions, you can create a productive environment to author and test
|
||||
documentation locally using Visual Studio (VS) Code. Follow these steps to configure VS Code:
|
||||
|
||||
### Configuring VS Code
|
||||
1. Install the required extensions:
|
||||
|
||||
1. Install the following extensions:
|
||||
* Python: `(ms-python.python)`
|
||||
* Live Server: `(ritwickdey.LiveServer)`
|
||||
|
||||
- Python `(ms-python.python)`
|
||||
- Live Server `(ritwickdey.LiveServer)`
|
||||
|
||||
2. Add the following entries in `.vscode/settings.json`
|
||||
2. Add the following entries to `.vscode/settings.json`.
|
||||
|
||||
```json
|
||||
{
|
||||
"liveServer.settings.root": "/.vscode/build/html",
|
||||
"liveServer.settings.wait": 1000,
|
||||
"python.terminal.activateEnvInCurrentTerminal": true
|
||||
}
|
||||
{
|
||||
"liveServer.settings.root": "/.vscode/build/html",
|
||||
"liveServer.settings.wait": 1000,
|
||||
"python.terminal.activateEnvInCurrentTerminal": true
|
||||
}
|
||||
```
|
||||
|
||||
The settings above are used for the following reasons:
|
||||
- `liveServer.settings.root`: Sets the root of the output website for live previews. Must be changed
|
||||
* `liveServer.settings.root`: Sets the root of the output website for live previews. Must be changed
|
||||
alongside the `tasks.json` command.
|
||||
- `liveServer.settings.wait`: Tells live server to wait with the update to give time for Sphinx to
|
||||
regenerate site contents and not refresh before all is done. (Empirical value)
|
||||
- `python.terminal.activateEnvInCurrentTerminal`: Automatic virtual environment activation is a nice touch,
|
||||
should you want to build the site from the integrated terminal.
|
||||
* `liveServer.settings.wait`: Tells the live server to wait with the update in order to give Sphinx time to
|
||||
regenerate the site contents and not refresh before the build is complete.
|
||||
* `python.terminal.activateEnvInCurrentTerminal`: Activates the automatic virtual environment, so you
|
||||
can build the site from the integrated terminal.
|
||||
|
||||
3. Add the following tasks in `.vscode/tasks.json`
|
||||
3. Add the following tasks to `.vscode/tasks.json`.
|
||||
|
||||
```json
|
||||
{
|
||||
"version": "2.0.0",
|
||||
"tasks": [
|
||||
{
|
||||
"label": "Build Docs",
|
||||
"type": "process",
|
||||
"windows": {
|
||||
"command": "${workspaceFolder}/.venv/Scripts/python.exe"
|
||||
},
|
||||
"command": "${workspaceFolder}/.venv/bin/python3",
|
||||
"args": [
|
||||
"-m",
|
||||
"sphinx",
|
||||
"-j",
|
||||
"auto",
|
||||
"-T",
|
||||
"-b",
|
||||
"html",
|
||||
"-d",
|
||||
"${workspaceFolder}/.vscode/build/doctrees",
|
||||
"-D",
|
||||
"language=en",
|
||||
"${workspaceFolder}/docs",
|
||||
"${workspaceFolder}/.vscode/build/html"
|
||||
],
|
||||
"problemMatcher": [
|
||||
{
|
||||
"owner": "sphinx",
|
||||
"fileLocation": "absolute",
|
||||
"pattern": {
|
||||
"regexp": "^(?:.*\\.{3}\\s+)?(\\/[^:]*|[a-zA-Z]:\\\\[^:]*):(\\d+):\\s+(WARNING|ERROR):\\s+(.*)$",
|
||||
"file": 1,
|
||||
"line": 2,
|
||||
"severity": 3,
|
||||
"message": 4
|
||||
},
|
||||
{
|
||||
"version": "2.0.0",
|
||||
"tasks": [
|
||||
{
|
||||
"label": "Build Docs",
|
||||
"type": "process",
|
||||
"windows": {
|
||||
"command": "${workspaceFolder}/.venv/Scripts/python.exe"
|
||||
},
|
||||
{
|
||||
"command": "${workspaceFolder}/.venv/bin/python3",
|
||||
"args": [
|
||||
"-m",
|
||||
"sphinx",
|
||||
"-j",
|
||||
"auto",
|
||||
"-T",
|
||||
"-b",
|
||||
"html",
|
||||
"-d",
|
||||
"${workspaceFolder}/.vscode/build/doctrees",
|
||||
"-D",
|
||||
"language=en",
|
||||
"${workspaceFolder}/docs",
|
||||
"${workspaceFolder}/.vscode/build/html"
|
||||
],
|
||||
"problemMatcher": [
|
||||
{
|
||||
"owner": "sphinx",
|
||||
"fileLocation": "absolute",
|
||||
"pattern": {
|
||||
"regexp": "^(?:.*\\.{3}\\s+)?(\\/[^:]*|[a-zA-Z]:\\\\[^:]*):(\\d+):\\s+(WARNING|ERROR):\\s+(.*)$",
|
||||
"file": 1,
|
||||
"line": 2,
|
||||
"severity": 3,
|
||||
"message": 4
|
||||
}
|
||||
},
|
||||
{
|
||||
"owner": "sphinx",
|
||||
"fileLocation": "absolute",
|
||||
"pattern": {
|
||||
"regexp": "^(?:.*\\.{3}\\s+)?(\\/[^:]*|[a-zA-Z]:\\\\[^:]*):{1,2}\\s+(WARNING|ERROR):\\s+(.*)$",
|
||||
"file": 1,
|
||||
"severity": 2,
|
||||
"message": 3
|
||||
"fileLocation": "absolute",
|
||||
"pattern": {
|
||||
"regexp": "^(?:.*\\.{3}\\s+)?(\\/[^:]*|[a-zA-Z]:\\\\[^:]*):{1,2}\\s+(WARNING|ERROR):\\s+(.*)$",
|
||||
"file": 1,
|
||||
"severity": 2,
|
||||
"message": 3
|
||||
}
|
||||
}
|
||||
],
|
||||
"group": {
|
||||
"kind": "build",
|
||||
"isDefault": true
|
||||
}
|
||||
],
|
||||
"group": {
|
||||
"kind": "build",
|
||||
"isDefault": true
|
||||
}
|
||||
},
|
||||
],
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
> (Implementation detail: two problem matchers were needed to be defined,
|
||||
@@ -143,23 +129,20 @@ resulting website show up on a locally-served web server.
|
||||
> messages) but the `pattern` references said empty capture group, VS Code
|
||||
> discards the message completely.)
|
||||
|
||||
4. Configure Python virtual environment (`venv`)
|
||||
4. Configure the Python virtual environment (`venv`).
|
||||
|
||||
- From the Command Palette, run `Python: Create Environment`
|
||||
- Select `venv` environment and the `docs/sphinx/requirements.txt` file.
|
||||
_(Simply pressing enter while hovering over the file from the drop down is
|
||||
insufficient, one has to select the radio button with the 'Space' key if
|
||||
using the keyboard.)_
|
||||
From the Command Palette, run `Python: Create Environment`. Select `venv` environment and
|
||||
`docs/sphinx/requirements.txt`.
|
||||
|
||||
5. Build the docs
|
||||
5. Build the docs.
|
||||
|
||||
- Launch the default build Task using either:
|
||||
- a hotkey _(default is `Ctrl+Shift+B`)_ or
|
||||
- by issuing the `Tasks: Run Build Task` from the Command Palette.
|
||||
Launch the default build task using one of the following options:
|
||||
|
||||
6. Open the live preview
|
||||
* A hotkey (the default is `Ctrl+Shift+B`)
|
||||
* Issuing the `Tasks: Run Build Task` from the Command Palette
|
||||
|
||||
- Navigate to the output of the site within VS Code, right-click on
|
||||
`.vscode/build/html/index.html` and select `Open with Live Server`. The
|
||||
contents should update on every rebuild without having to refresh the
|
||||
browser.
|
||||
6. Open the live preview.
|
||||
|
||||
Navigate to the site output within VS Code: right-click on `.vscode/build/html/index.html` and
|
||||
select `Open with Live Server`. The contents should update on every rebuild without having to
|
||||
refresh the browser.
|
||||
|
||||
@@ -2,14 +2,14 @@
|
||||
|
||||
There are four standard ways to provide feedback for this repository.
|
||||
|
||||
## Pull Request
|
||||
## Pull request
|
||||
|
||||
All contributions to ROCm documentation should arrive via the
|
||||
[GitHub Flow](https://docs.github.com/en/get-started/quickstart/github-flow)
|
||||
targeting the develop branch of the repository. If you are unable to contribute
|
||||
via the GitHub Flow, feel free to email us.
|
||||
via the GitHub Flow, feel free to email us at [rocm-feedback@amd.com](mailto:rocm-feedback@amd.com?subject=Documentation%20Feedback).
|
||||
|
||||
## GitHub Discussions
|
||||
## GitHub discussions
|
||||
|
||||
To ask questions or view answers to frequently asked questions, refer to
|
||||
[GitHub Discussions](https://github.com/RadeonOpenCompute/ROCm/discussions).
|
||||
@@ -17,11 +17,11 @@ On GitHub Discussions, in addition to asking and answering questions,
|
||||
members can share updates, have open-ended conversations,
|
||||
and follow along on via public announcements.
|
||||
|
||||
## GitHub Issue
|
||||
## GitHub issue
|
||||
|
||||
Issues on existing or absent docs can be filed as
|
||||
[GitHub Issues](https://github.com/RadeonOpenCompute/ROCm/issues).
|
||||
|
||||
## Email
|
||||
|
||||
Send other feedback or questions to [rocm-feedback@amd.com](rocm-feedback@amd.com)
|
||||
Send other feedback or questions to [rocm-feedback@amd.com](mailto:rocm-feedback@amd.com?subject=Documentation%20Feedback).
|
||||
|
||||
@@ -1,9 +1,6 @@
|
||||
# About ROCm Documentation
|
||||
# ROCm documentation toolchain
|
||||
|
||||
ROCm documentation is made available under open source [licenses](licensing.md).
|
||||
Documentation is built using open source toolchains. Contributions to our
|
||||
documentation is encouraged and welcome. As a contributor, please familiarize
|
||||
yourself with our documentation toolchain.
|
||||
Our documentation relies on several open source toolchains and sites.
|
||||
|
||||
## `rocm-docs-core`
|
||||
|
||||
@@ -12,12 +9,12 @@ project that applies customization for our documentation. This
|
||||
project is the tool most ROCm repositories use as part of the documentation
|
||||
build. It is also available as a [pip package on PyPI](https://pypi.org/project/rocm-docs-core/).
|
||||
|
||||
See the user and developer guides for rocm-docs-core at {doc}`rocm-docs-core documentation <rocm-docs-core:index>`.
|
||||
See the user and developer guides for rocm-docs-core at {doc}`rocm-docs-core documentation<rocm-docs-core:index>`.
|
||||
|
||||
## Sphinx
|
||||
|
||||
[Sphinx](https://www.sphinx-doc.org/en/master/) is a documentation generator
|
||||
originally used for Python. It is now widely used in the Open Source community.
|
||||
originally used for Python. It is now widely used in the open source community.
|
||||
Originally, Sphinx supported reStructuredText (RST) based documentation, but
|
||||
Markdown support is now available.
|
||||
ROCm documentation plans to default to Markdown for new projects.
|
||||
@@ -50,9 +47,9 @@ It is integrated into ROCm documentation by the Sphinx extension [`myst-parser`]
|
||||
A cheat sheet that showcases how to use the MyST syntax is available over at
|
||||
the [Jupyter reference](https://jupyterbook.org/en/stable/reference/cheatsheet.html).
|
||||
|
||||
### Sphinx External TOC
|
||||
### Sphinx External ToC
|
||||
|
||||
[Sphinx External Table of Contents (TOC)](https://sphinx-external-toc.readthedocs.io/en/latest/intro.html)
|
||||
[Sphinx External ToC](https://sphinx-external-toc.readthedocs.io/en/latest/intro.html)
|
||||
is a Sphinx extension used for ROCm documentation navigation. This tool generates a navigation menu on the left
|
||||
based on a YAML file that specifies the table of contents.
|
||||
It was selected due to its flexibility that allows scripts to operate on the
|
||||
@@ -60,15 +57,15 @@ YAML file. Please transition to this file for the project's navigation. You can
|
||||
see the `_toc.yml.in` file in this repository in the `docs/sphinx` folder for an
|
||||
example.
|
||||
|
||||
### Sphinx Book Theme
|
||||
### Sphinx-book-theme
|
||||
|
||||
[Sphinx Book Theme](https://sphinx-book-theme.readthedocs.io/en/latest/) is a Sphinx theme
|
||||
[Sphinx-book-theme](https://sphinx-book-theme.readthedocs.io/en/latest/) is a Sphinx theme
|
||||
that defines the base appearance for ROCm documentation.
|
||||
ROCm documentation applies some customization,
|
||||
such as a custom header and footer on top of the Sphinx Book Theme.
|
||||
|
||||
### Sphinx Design
|
||||
### Sphinx design
|
||||
|
||||
[Sphinx Design](https://sphinx-design.readthedocs.io/en/latest/index.html) is a Sphinx extension that adds design
|
||||
[Sphinx design](https://sphinx-design.readthedocs.io/en/latest/index.html) is a Sphinx extension that adds design
|
||||
functionality.
|
||||
ROCm documentation uses Sphinx Design for grids, cards, and synchronized tabs.
|
||||
|
Before Width: | Height: | Size: 20 KiB |
|
Before Width: | Height: | Size: 3.3 KiB After Width: | Height: | Size: 3.3 KiB |
|
Before Width: | Height: | Size: 66 KiB After Width: | Height: | Size: 66 KiB |
|
Before Width: | Height: | Size: 36 KiB After Width: | Height: | Size: 36 KiB |
|
Before Width: | Height: | Size: 87 KiB After Width: | Height: | Size: 87 KiB |
|
Before Width: | Height: | Size: 20 KiB After Width: | Height: | Size: 20 KiB |
|
Before Width: | Height: | Size: 18 KiB After Width: | Height: | Size: 18 KiB |
|
Before Width: | Height: | Size: 103 KiB After Width: | Height: | Size: 103 KiB |
|
Before Width: | Height: | Size: 59 KiB After Width: | Height: | Size: 59 KiB |
|
Before Width: | Height: | Size: 41 KiB After Width: | Height: | Size: 41 KiB |
|
Before Width: | Height: | Size: 39 KiB After Width: | Height: | Size: 39 KiB |
|
Before Width: | Height: | Size: 47 KiB After Width: | Height: | Size: 47 KiB |
|
Before Width: | Height: | Size: 33 KiB After Width: | Height: | Size: 33 KiB |
|
Before Width: | Height: | Size: 42 KiB After Width: | Height: | Size: 42 KiB |
|
Before Width: | Height: | Size: 64 KiB After Width: | Height: | Size: 64 KiB |
|
Before Width: | Height: | Size: 22 KiB After Width: | Height: | Size: 22 KiB |
|
Before Width: | Height: | Size: 69 KiB After Width: | Height: | Size: 69 KiB |
|
Before Width: | Height: | Size: 9.8 KiB After Width: | Height: | Size: 9.8 KiB |
|
Before Width: | Height: | Size: 9.1 KiB After Width: | Height: | Size: 9.1 KiB |
|
Before Width: | Height: | Size: 4.8 KiB After Width: | Height: | Size: 4.8 KiB |
BIN
docs/data/contribute/github-docs-build.png
Normal file
|
After Width: | Height: | Size: 10 KiB |
|
Before Width: | Height: | Size: 13 KiB After Width: | Height: | Size: 13 KiB |
|
Before Width: | Height: | Size: 99 KiB After Width: | Height: | Size: 99 KiB |
|
Before Width: | Height: | Size: 130 KiB After Width: | Height: | Size: 130 KiB |
|
Before Width: | Height: | Size: 21 KiB After Width: | Height: | Size: 21 KiB |
|
Before Width: | Height: | Size: 8.8 KiB After Width: | Height: | Size: 8.8 KiB |
|
Before Width: | Height: | Size: 14 KiB After Width: | Height: | Size: 14 KiB |
|
Before Width: | Height: | Size: 25 KiB After Width: | Height: | Size: 25 KiB |
|
Before Width: | Height: | Size: 17 KiB After Width: | Height: | Size: 17 KiB |
|
Before Width: | Height: | Size: 47 KiB After Width: | Height: | Size: 47 KiB |
|
Before Width: | Height: | Size: 41 KiB After Width: | Height: | Size: 41 KiB |
|
Before Width: | Height: | Size: 14 KiB After Width: | Height: | Size: 14 KiB |
|
Before Width: | Height: | Size: 19 KiB After Width: | Height: | Size: 19 KiB |
|
Before Width: | Height: | Size: 57 KiB After Width: | Height: | Size: 57 KiB |
|
Before Width: | Height: | Size: 36 KiB After Width: | Height: | Size: 36 KiB |
|
Before Width: | Height: | Size: 102 KiB After Width: | Height: | Size: 102 KiB |
|
Before Width: | Height: | Size: 114 KiB After Width: | Height: | Size: 114 KiB |
|
Before Width: | Height: | Size: 939 KiB After Width: | Height: | Size: 939 KiB |
|
Before Width: | Height: | Size: 537 KiB After Width: | Height: | Size: 537 KiB |
|
Before Width: | Height: | Size: 292 KiB After Width: | Height: | Size: 292 KiB |
|
Before Width: | Height: | Size: 1.3 MiB After Width: | Height: | Size: 1.3 MiB |
|
Before Width: | Height: | Size: 88 KiB After Width: | Height: | Size: 88 KiB |
|
Before Width: | Height: | Size: 32 KiB After Width: | Height: | Size: 32 KiB |
|
Before Width: | Height: | Size: 3.6 KiB After Width: | Height: | Size: 3.6 KiB |
|
Before Width: | Height: | Size: 3.5 KiB After Width: | Height: | Size: 3.5 KiB |
|
Before Width: | Height: | Size: 3.5 KiB After Width: | Height: | Size: 3.5 KiB |
|
Before Width: | Height: | Size: 114 KiB After Width: | Height: | Size: 114 KiB |
|
Before Width: | Height: | Size: 110 KiB After Width: | Height: | Size: 110 KiB |
|
Before Width: | Height: | Size: 26 KiB After Width: | Height: | Size: 26 KiB |
|
Before Width: | Height: | Size: 26 KiB After Width: | Height: | Size: 26 KiB |
|
Before Width: | Height: | Size: 228 KiB After Width: | Height: | Size: 228 KiB |
|
Before Width: | Height: | Size: 796 KiB After Width: | Height: | Size: 796 KiB |
|
Before Width: | Height: | Size: 310 KiB After Width: | Height: | Size: 310 KiB |
|
Before Width: | Height: | Size: 789 KiB After Width: | Height: | Size: 789 KiB |
|
Before Width: | Height: | Size: 801 KiB After Width: | Height: | Size: 801 KiB |
|
Before Width: | Height: | Size: 102 KiB After Width: | Height: | Size: 102 KiB |
|
Before Width: | Height: | Size: 102 KiB After Width: | Height: | Size: 102 KiB |
|
Before Width: | Height: | Size: 323 KiB After Width: | Height: | Size: 323 KiB |