mirror of
https://github.com/ROCm/ROCm.git
synced 2026-04-05 03:01:17 -04:00
@@ -69,6 +69,7 @@ ROCm documentation is organized into the following categories:
|
||||
* [Environment variables](./reference/env-variables.rst)
|
||||
* [Data types and precision support](./reference/precision-support.rst)
|
||||
* [Graph safe support](./reference/graph-safe-support.rst)
|
||||
* [ROCm glossary](./reference/glossary.rst)
|
||||
<!-- markdownlint-enable MD051 -->
|
||||
:::
|
||||
|
||||
|
||||
24
docs/reference/glossary.rst
Normal file
24
docs/reference/glossary.rst
Normal file
@@ -0,0 +1,24 @@
|
||||
.. meta::
|
||||
:description: AMD ROCm Glossary
|
||||
:keywords: AMD, ROCm, glossary, terminology, device hardware,
|
||||
device software, host software, performance
|
||||
|
||||
.. _glossary:
|
||||
|
||||
********************************************************************************
|
||||
ROCm glossary
|
||||
********************************************************************************
|
||||
|
||||
This glossary provides concise definitions of key terms and concepts in AMD ROCm
|
||||
programming. Each entry includes a brief description and a link to detailed
|
||||
documentation for in-depth information.
|
||||
|
||||
The glossary is organized into four sections:
|
||||
|
||||
* :doc:`glossary/device-hardware` — Hardware components (for example, Compute
|
||||
Units, cores, memory)
|
||||
* :doc:`glossary/device-software` — Software abstractions (programming model,
|
||||
ISA, thread hierarchy)
|
||||
* :doc:`glossary/host-software` — Development tools (HIP, compilers, libraries,
|
||||
profilers)
|
||||
* :doc:`glossary/performance` — Performance metrics and optimization concepts
|
||||
254
docs/reference/glossary/device-hardware.rst
Normal file
254
docs/reference/glossary/device-hardware.rst
Normal file
@@ -0,0 +1,254 @@
|
||||
.. meta::
|
||||
:description: Device hardware glossary for AMD GPUs
|
||||
:keywords: AMD, ROCm, GPU, device hardware, compute units, cores, MFMA,
|
||||
architecture, register file, cache, HBM
|
||||
|
||||
.. _glossary-device-hardware:
|
||||
|
||||
************************
|
||||
Device hardware glossary
|
||||
************************
|
||||
|
||||
This section provides concise definitions of hardware components and architectural
|
||||
features of AMD GPUs.
|
||||
|
||||
.. glossary::
|
||||
:sorted:
|
||||
|
||||
AMD device architecture
|
||||
AMD device architecture is based on unified, programmable compute
|
||||
engines known as :term:`compute units (CUs) <Compute units>`. See
|
||||
:ref:`hip:hardware_implementation` for details.
|
||||
|
||||
Compute units
|
||||
Compute units (CUs) are the fundamental programmable execution engines
|
||||
in AMD GPUs capable of running complex programs. See
|
||||
:ref:`hip:compute_unit` for details.
|
||||
|
||||
ALU
|
||||
Arithmetic logic units (ALUs) are the primary arithmetic engines that
|
||||
execute mathematical and logical operations within
|
||||
:term:`compute units <Compute units>`. See :ref:`hip:valu` for details.
|
||||
|
||||
SALU
|
||||
Scalar :term:`ALUs <ALU>` (SALUs) operate on a single value per
|
||||
:term:`wavefront <Wavefront>` and manage all control flow.
|
||||
|
||||
VALU
|
||||
Vector :term:`ALUs <ALU>` (VALUs) perform an arithmetic or logical
|
||||
operation on data for each :term:`work-item <Work-item (Thread)>` in a
|
||||
:term:`wavefront <Wavefront>`, enabling data-parallel execution.
|
||||
|
||||
Special function unit
|
||||
Special function units (SFUs) accelerate transcendental and reciprocal
|
||||
mathematical functions such as ``exp``, ``log``, ``sin``, and ``cos``.
|
||||
See :ref:`hip:sfu` for details.
|
||||
|
||||
Load/store unit
|
||||
Load/store units (LSUs) handle data transfer between
|
||||
:term:`compute units <Compute units>` and the GPU's memory subsystems,
|
||||
managing thousands of concurrent memory operations. See :ref:`hip:lsu`
|
||||
for details.
|
||||
|
||||
Work-group (Block)
|
||||
A work-group (also called a block) is a collection of
|
||||
:term:`wavefronts <Wavefront (Warp)>` scheduled together on a single
|
||||
:term:`compute unit <Compute units>` that can coordinate through
|
||||
:term:`Local data share <Local data share>` memory. See
|
||||
:ref:`hip:inherent_thread_hierarchy_block` for work-group details.
|
||||
|
||||
Work-item (Thread)
|
||||
A work-item (also called a thread) is the smallest unit of execution on
|
||||
an AMD GPU and represents a single element of work. See
|
||||
:ref:`hip:work-item` for thread hierarchy details.
|
||||
|
||||
Wavefront (Warp)
|
||||
A wavefront (also called a warp) is a group of
|
||||
:term:`work-items <Work-item (Thread)>` that execute in parallel on a
|
||||
single :term:`compute unit <Compute units>`, sharing one
|
||||
instruction stream. See :ref:`hip:wavefront` for execution details.
|
||||
|
||||
Wavefront scheduler
|
||||
The wavefront scheduler in each :term:`compute unit <Compute units>`
|
||||
decides which :term:`wavefront <wavefront>` to execute each clock cycle,
|
||||
enabling rapid context switching for latency hiding. See
|
||||
:ref:`hip:wave-scheduling` for details.
|
||||
|
||||
Wavefront size
|
||||
The wavefront size is the number of
|
||||
:term:`work-items <Work-item (Thread)>` that execute together in a
|
||||
single :term:`wavefront <Wavefront (Warp)>`. For AMD Instinct GPUs, the
|
||||
wavefront size is 64 threads, while AMD Radeon GPUs have a wavefront
|
||||
size of 32 threads. See :ref:`hip:wavefront` for details.
|
||||
|
||||
SIMD core
|
||||
SIMD cores are execution lanes that perform scalar and vector arithmetic
|
||||
operations inside each :term:`compute unit <Compute unit>`. See
|
||||
:ref:`hip:cdna_architecture` and :ref:`hip:rdna_architecture` for
|
||||
details.
|
||||
|
||||
Matrix cores (MFMA units)
|
||||
Matrix cores (MFMA units) are specialized execution units that perform
|
||||
large-scale matrix operations in a single instruction, delivering high
|
||||
throughput for AI and HPC workloads. See :ref:`hip:mfma_units` for
|
||||
details.
|
||||
|
||||
Data movement engine
|
||||
Data movement engines (DMEs) are specialized hardware units in AMD
|
||||
Instinct MI300 and MI350 series GPUs that accelerate multi-dimensional
|
||||
tensor data copies between global memory and on-chip memory. See
|
||||
:ref:`hip:dme` for details.
|
||||
|
||||
GFX IP
|
||||
GFX IP (Graphics IP) versions are identifiers that specify which
|
||||
instruction formats, memory models, and compute features are supported
|
||||
by each AMD GPU generation. See :ref:`hip:gfx_ip` for versioning
|
||||
information.
|
||||
|
||||
GFX IP major version
|
||||
The :term:`GFX IP <GFX IP>` major version represents the GPU's core
|
||||
instruction set and architecture. For example, a GFX IP `11` major
|
||||
version corresponds to the RDNA3 architecture, influencing driver
|
||||
support and available compute features. See :ref:`hip:gfx_ip` for
|
||||
versioning information.
|
||||
|
||||
GFX IP minor version
|
||||
The :term:`GFX IP <GFX IP>` minor version represents specific variations
|
||||
within a :term:`GFX IP <GFX IP>` major version and affects feature sets,
|
||||
optimizations, and driver behavior. Different GPU models within the same
|
||||
major version can have unique capabilities, impacting performance and
|
||||
supported instructions. See :ref:`hip:gfx_ip` for versioning
|
||||
information.
|
||||
|
||||
Compute unit versioning
|
||||
:term:`Compute units <Compute units>` are versioned with
|
||||
:term:`GFX IP <GFX IP>` identifiers that define their microarchitectural
|
||||
features and instruction set compatibility. See :ref:`hip:gfx_ip` for
|
||||
details.
|
||||
|
||||
Register file
|
||||
The register file is the primary on-chip memory store in each
|
||||
:term:`compute unit <Compute units>`, holding data between arithmetic
|
||||
and memory operations. See :ref:`hip:memory_hierarchy` for details.
|
||||
|
||||
SGPR file
|
||||
The :term:`SGPR <SGPR>` file is the
|
||||
:term:`register file <Register file>` that holds data used by the
|
||||
:term:`scalar ALU <SALU>`.
|
||||
|
||||
VGPR file
|
||||
The :term:`VGPR <VGPR>` file is the
|
||||
:term:`register file <Register file>` that holds data used by the
|
||||
:term:`vector ALU <VALU>`. GPUs with
|
||||
:term:`matrix cores <Matrix cores (MFMA units)>` also have
|
||||
:term:`AccVGPR <AccVGPR>` files, used specifically for matrix
|
||||
instructions.
|
||||
|
||||
L0 instruction cache
|
||||
On AMD Radeon GPUs, the level 0 (L0) instruction cache is local to each
|
||||
:term:`WGP <WGP>` and thus shared between the WGP's
|
||||
:term:`compute units <Compute units>`.
|
||||
|
||||
L0 scalar cache
|
||||
On AMD Radeon GPUs, the level 0 (L0) scalar data cache is local to each
|
||||
:term:`WGP <WGP>` and thus shared between the WGP's
|
||||
:term:`compute units <Compute units>`. It provides the
|
||||
:term:`scalar ALU <SALU>` with fast access to recently used data.
|
||||
|
||||
L0 vector cache
|
||||
On AMD Radeon GPUs, the level 0 (L0) vector data cache is local to each
|
||||
:term:`WGP <WGP>` and thus shared between the WGP's
|
||||
:term:`compute units <Compute units>`. It provides the
|
||||
:term:`vector ALU <VALU>` with fast access to recently used data.
|
||||
|
||||
L1 instruction cache
|
||||
On AMD Instinct GPUs, the level 1 (L1) instruction cache is local to
|
||||
each :term:`compute unit <Compute units>`. On AMD Radeon GPUs, the
|
||||
L1 instruction cache does not exist as a separate cache level, and
|
||||
instructions are stored in the
|
||||
:term:`L0 instruction cache <L0 instruction cache>`.
|
||||
|
||||
L1 scalar cache
|
||||
On AMD Instinct GPUs, the level 1 (L1) scalar data cache is local to
|
||||
each :term:`compute unit <Compute units>`, providing the
|
||||
:term:`scalar ALU <SALU>` with fast access to recently used data. On AMD
|
||||
Radeon GPUs, the L1 scalar cache does not exist as a separate cache
|
||||
level, and recently used scalar data is stored in the
|
||||
:term:`L0 scalar cache <L0 scalar cache>`.
|
||||
|
||||
L1 vector cache
|
||||
On AMD Instinct GPUs, the level 1 (L1) vector data cache is local to
|
||||
each :term:`compute unit <Compute units>`, providing the
|
||||
:term:`vector ALU <VALU>` with fast access to recently used data. On AMD
|
||||
Radeon GPUs, the L1 vector cache does not exist as a separate cache
|
||||
level, and recently used vector data is stored in the
|
||||
:term:`L0 vector cache <L0 vector cache>`.
|
||||
|
||||
Graphics L1 cache
|
||||
On AMD Radeon GPUs, the read-only graphics level 1 (L1) cache is local
|
||||
to groups of :term:`WGPs <WGP>` called shader arrays, providing fast
|
||||
access to recently used data. AMD Instinct GPUs do not feature the
|
||||
graphics L1 cache.
|
||||
|
||||
L2 cache
|
||||
On AMD Instinct MI100 series GPUs, the L2 cache is shared across the
|
||||
entire chip, while for all other AMD GPUs the L2 caches are shared by
|
||||
the :term:`compute units <Compute units>` on the same :term:`GCD <GCD>`
|
||||
or :term:`XCD <XCD>`.
|
||||
|
||||
Infinity Cache (L3 cache)
|
||||
On AMD Instinct MI300 and MI350 series GPUs and AMD Radeon GPUs, the
|
||||
Infinity Cache is the last level cache of the cache hierarchy. It is
|
||||
shared by all :term:`compute units <Compute units>` and
|
||||
:term:`WGPs <WGP>` on the GPU.
|
||||
|
||||
GPU RAM (VRAM)
|
||||
GPU RAM, also known as :term:`global memory <Global memory>` in the HIP
|
||||
programming model, is the large, high-capacity off-chip memory subsystem
|
||||
accessible by all :term:`compute units <Compute units>`, forming the
|
||||
foundation of the device's :ref:`memory hierarchy <hip:hbm>`.
|
||||
|
||||
Local data share
|
||||
Local data share (LDS) is fast on-chip memory local to each
|
||||
:term:`compute unit <Compute units>` and shared among
|
||||
:term:`work-items <Work-item (Thread)>` in a
|
||||
:term:`work-group <Work-group (Block)>`, enabling efficient coordination
|
||||
and data reuse. In the HIP programming model, the LDS is known as shared
|
||||
memory. See :ref:`hip:lds` for LDS programming details.
|
||||
|
||||
Registers
|
||||
Registers are the lowest level of the memory hierarchy, storing
|
||||
per-thread temporary variables and intermediate results. See
|
||||
:ref:`hip:memory_hierarchy` for register usage details.
|
||||
|
||||
SGPR
|
||||
Scalar general-purpose :term:`registers <Registers>` (SGPRs) hold data
|
||||
produced and consumed by a :term:`compute unit <Compute units>`'s
|
||||
:term:`scalar ALU <SALU>`.
|
||||
|
||||
VGPR
|
||||
Vector general-purpose :term:`registers <Registers>` (VGPRs) hold data
|
||||
produced and consumed by a :term:`compute unit <Compute units>`'s
|
||||
:term:`vector ALU <VALU>`.
|
||||
|
||||
AccVGPR
|
||||
Accumulation General Purpose Vector Registers (AccVGPRs) are a special
|
||||
type of :term:`VGPRs <VGPR>` used exclusively for matrix operations.
|
||||
|
||||
XCD
|
||||
On AMD Instinct MI300 and MI350 series GPUs, the Accelerator Complex Die
|
||||
(XCD) contains the GPU's computational elements and lower levels of the
|
||||
cache hierarchy. See :doc:`../../conceptual/gpu-arch/mi300` for details.
|
||||
|
||||
GCD
|
||||
On AMD Instinct MI100 and MI250 series GPUs and AMD Radeon GPUs, the
|
||||
Graphics Compute Die (GCD) contains the GPU's computational elements
|
||||
and lower levels of the cache hierarchy. See
|
||||
:doc:`../../conceptual/gpu-arch/mi250` for details.
|
||||
|
||||
WGP
|
||||
A Workgroup Processor (WGP) is a hardware unit on AMD Radeon GPUs that
|
||||
contains two :term:`compute units <Compute units>` and their associated
|
||||
resources, enabling efficient scheduling and execution of
|
||||
:term:`wavefronts <wavefront>`. See :ref:`hip:rdna_architecture` for
|
||||
details.
|
||||
74
docs/reference/glossary/device-software.rst
Normal file
74
docs/reference/glossary/device-software.rst
Normal file
@@ -0,0 +1,74 @@
|
||||
.. meta::
|
||||
:description: Device software glossary for AMD GPUs
|
||||
:keywords: AMD, ROCm, GPU, device software, programming model, AMDGPU,
|
||||
assembly, IR, GFX IP, wavefront, work-group, HIP kernel, thread hierarchy
|
||||
|
||||
.. _glossary-device-software:
|
||||
|
||||
************************
|
||||
Device software glossary
|
||||
************************
|
||||
|
||||
This section provides brief definitions of software abstractions and programming
|
||||
models that run on AMD GPUs.
|
||||
|
||||
.. glossary::
|
||||
:sorted:
|
||||
|
||||
ROCm programming model
|
||||
The ROCm programming model defines how AMD GPUs execute massively
|
||||
parallel programs using hierarchical
|
||||
:term:`work-groups <Work-group (Block)>`, memory scopes, and barrier
|
||||
synchronization. See :ref:`hip:programming_model` for complete details.
|
||||
|
||||
AMDGPU assembly
|
||||
AMDGPU assembly (GFX ISA) is the low-level assembly format for programs
|
||||
running on AMD GPUs, generated by the
|
||||
:term:`ROCm compiler toolchain <HIP compiler>`. See
|
||||
:ref:`hip:amdgpu_assembly` for instruction set details.
|
||||
|
||||
AMDGPU intermediate representation
|
||||
AMDGPU IR is an intermediate representation for GPU code, serving as a
|
||||
virtual instruction set between high-level languages and
|
||||
:term:`architecture-specific assembly <AMDGPU assembly>`. See
|
||||
:ref:`hip:amdgpu_ir` for compilation details.
|
||||
|
||||
LLVM target name
|
||||
The LLVM target name is a string identifier corresponding to a specific
|
||||
:term:`GFX IP <GFX IP>` version that is passed to the
|
||||
:term:`HIP compiler <HIP compiler>` toolchain to specify the target GPU
|
||||
architecture for code generation.
|
||||
See :doc:`llvm-project:reference/rocmcc` for details.
|
||||
|
||||
Grid
|
||||
A grid represents the collection of all
|
||||
:term:`work-groups <Work-group (Block)>` executing a single
|
||||
:term:`kernel <HIP kernel>` across the entire GPU. See
|
||||
:ref:`hip:inherent_thread_hierarchy_grid` for grid execution details.
|
||||
|
||||
HIP kernel
|
||||
A HIP kernel is the unit of GPU code that executes in parallel across
|
||||
many :term:`threads <Work-item (Thread)>`, distributed across the GPU's
|
||||
:term:`compute units <Compute units>`. See :ref:`hip:device_program` for
|
||||
kernel programming details.
|
||||
|
||||
HIP thread hierarchy
|
||||
The thread hierarchy structures parallel work from individual
|
||||
:term:`threads <Work-item (Thread)>` to
|
||||
:term:`blocks <Work-group (Block)>` to :term:`grids <Grid>`, mapping
|
||||
onto hardware from :term:`SIMD lanes <SIMD core>` to
|
||||
:term:`compute units <Compute units>` to the entire GPU. See
|
||||
:ref:`hip:inherent_thread_model` for complete details.
|
||||
|
||||
HIP memory hierarchy
|
||||
The memory hierarchy pairs each
|
||||
:term:`thread hierarchy <HIP thread hierarchy>` level with corresponding
|
||||
memory scopes, from :term:`private registers <Register>` to
|
||||
:term:`LDS <Local data share>` to :term:`GPU RAM <GPU RAM (VRAM)>`. See
|
||||
:ref:`hip:memory_hierarchy` for memory architecture details.
|
||||
|
||||
Global memory
|
||||
Global memory is the :term:`device-wide memory <GPU RAM (VRAM)>`
|
||||
accessible to all :term:`threads <Work-item (Thread)>`, physically
|
||||
implemented as HBM or GDDR. See :ref:`hip:memory_hierarchy` for global
|
||||
memory details.
|
||||
67
docs/reference/glossary/host-software.rst
Normal file
67
docs/reference/glossary/host-software.rst
Normal file
@@ -0,0 +1,67 @@
|
||||
.. meta::
|
||||
:description: Host software glossary for AMD GPUs
|
||||
:keywords: AMD, ROCm, GPU, host software, HIP, compiler, runtime, libraries,
|
||||
profiler, amd-smi
|
||||
|
||||
.. _glossary-host-software:
|
||||
|
||||
**********************
|
||||
Host software glossary
|
||||
**********************
|
||||
|
||||
This section provides brief definitions of development tools, compilers,
|
||||
libraries, and runtime environments for programming AMD GPUs.
|
||||
|
||||
.. glossary::
|
||||
:sorted:
|
||||
|
||||
ROCm software platform
|
||||
ROCm is AMD's GPU software stack, providing compiler
|
||||
toolchains, runtime environments, and performance libraries for HPC and
|
||||
AI applications. See :doc:`../../what-is-rocm` for a complete component
|
||||
overview.
|
||||
|
||||
HIP C++ language extension
|
||||
HIP extends the C++ language with additional features designed for
|
||||
programming heterogeneous applications. These extensions mostly relate
|
||||
to the kernel language, but some can also be applied to host
|
||||
functionality. See :doc:`hip:how-to/hip_cpp_language_extensions` for
|
||||
language fundamentals.
|
||||
|
||||
AMD SMI
|
||||
The ``amd-smi`` command-line utility queries, monitors, and manages
|
||||
AMD GPU state, providing hardware information and performance metrics.
|
||||
See :doc:`amdsmi:index` for detailed usage.
|
||||
|
||||
HIP runtime API
|
||||
The HIP runtime API provides an interface for GPU programming, offering
|
||||
functions for memory management, kernel launches, and synchronization. See
|
||||
:ref:`hip:hip_runtime_api_how-to` for API overview.
|
||||
|
||||
HIP compiler
|
||||
The HIP compiler ``amdclang++`` compiles HIP C++ programs into binaries
|
||||
that contain both host CPU and device GPU code. See
|
||||
:doc:`llvm-project:reference/rocmcc` for compiler flags and options.
|
||||
|
||||
HIP runtime compiler
|
||||
The HIP Runtime Compiler (HIPRTC) compiles HIP source code at runtime
|
||||
into :term:`AMDGPU <AMDGPU assembly>` binary code objects, enabling
|
||||
just-in-time kernel generation, device-specific optimization, and
|
||||
dynamic code creation for different GPUs. See
|
||||
:ref:`hip:hip_runtime_compiler_how-to` for API details.
|
||||
|
||||
ROCgdb
|
||||
ROCgdb is AMD's source-level debugger for HIP and ROCm applications,
|
||||
enabling debugging of both host CPU and GPU device code, including
|
||||
kernel breakpoints, stepping, and variable inspection. See
|
||||
:doc:`rocgdb:index` for usage and command reference.
|
||||
|
||||
rocprofv3
|
||||
``rocprofv3`` is AMD's primary performance analysis tool, providing
|
||||
profiling, tracing, and performance counter collection.
|
||||
See :ref:`rocprofiler-sdk:using-rocprofv3` for profiling workflows.
|
||||
|
||||
ROCm and LLVM binary utilities
|
||||
ROCm and LLVM binary utilities are command-line tools for examining and
|
||||
manipulating GPU binaries and code objects. See
|
||||
:ref:`hip:binary_utilities` for utility details.
|
||||
135
docs/reference/glossary/performance.rst
Normal file
135
docs/reference/glossary/performance.rst
Normal file
@@ -0,0 +1,135 @@
|
||||
.. meta::
|
||||
:description: Performance glossary for AMD GPUs
|
||||
:keywords: AMD, ROCm, GPU, performance, optimization, roofline, bottleneck,
|
||||
occupancy, bandwidth, latency hiding, divergence
|
||||
|
||||
.. _glossary-performance:
|
||||
|
||||
*****************************
|
||||
Performance analysis glossary
|
||||
*****************************
|
||||
|
||||
This section provides brief definitions of performance analysis concepts and
|
||||
optimization techniques.
|
||||
|
||||
.. glossary::
|
||||
:sorted:
|
||||
|
||||
Roofline model
|
||||
The roofline model is a visual performance model that determines whether
|
||||
a program is :term:`compute-bound <Compute-bound>` or
|
||||
:term:`memory-bound <Memory-bound>`. See :ref:`hip:roofline_model` for
|
||||
roofline analysis.
|
||||
|
||||
Compute-bound
|
||||
Compute-bound kernels are limited by the
|
||||
:term:`arithmetic bandwidth <Arithmetic bandwidth>` of the GPU's
|
||||
:term:`compute units <Compute units>` rather than
|
||||
:term:`memory bandwidth <Memory bandwidth>`. See
|
||||
:ref:`hip:compute_bound` for compute-bound analysis.
|
||||
|
||||
Memory-bound
|
||||
Memory-bound kernels are limited by
|
||||
:term:`memory bandwidth <Memory bandwidth>` rather than
|
||||
:term:`arithmetic bandwidth <Arithmetic bandwidth>`, typically due to
|
||||
low :term:`arithmetic intensity <Arithmetic intensity>`. See
|
||||
:ref:`hip:memory_bound` for memory-bound analysis.
|
||||
|
||||
Arithmetic intensity
|
||||
Arithmetic intensity is the ratio of arithmetic operations to memory
|
||||
operations in a kernel, and determines performance characteristics. See
|
||||
:ref:`hip:arithmetic_intensity` for intensity analysis.
|
||||
|
||||
Overhead
|
||||
Overhead latency is the time spent with no useful work being done, often
|
||||
due to CPU-side bottlenecks or kernel launch delays. See
|
||||
:ref:`hip:performance_bottlenecks` for details.
|
||||
|
||||
Little's Law
|
||||
Little's Law relates concurrency, latency, and throughput, determining
|
||||
how much independent work must be in flight to hide latency. See
|
||||
:ref:`hip:littles_law` for latency hiding details.
|
||||
|
||||
Memory bandwidth
|
||||
Memory bandwidth is the maximum rate at which data can be transferred
|
||||
between memory hierarchy levels, typically measured in bytes per
|
||||
second. See :ref:`hip:memory_bound` for details.
|
||||
|
||||
Arithmetic bandwidth
|
||||
Arithmetic bandwidth is the peak rate at which arithmetic work can be
|
||||
performed, defining the compute roof in
|
||||
:term:`roofline models <Roofline model>`. See :ref:`hip:compute_bound`
|
||||
for details.
|
||||
|
||||
Latency hiding
|
||||
Latency hiding masks long-latency operations by running many concurrent
|
||||
threads, keeping execution pipelines busy. See :ref:`hip:latency_hiding`
|
||||
for details.
|
||||
|
||||
Wavefront execution state
|
||||
Wavefront execution states (*active*, *stalled*, *eligible*, *selected*)
|
||||
describe the scheduling status of :term:`wavefronts <Wavefront>` on AMD
|
||||
GPUs. See :ref:`hip:wavefront_execution` for state definitions.
|
||||
|
||||
Active cycle
|
||||
An active cycle is a clock cycle in which a
|
||||
:term:`compute unit <Compute units>` has at least one active
|
||||
:term:`wavefront <Wavefront>` resident. See
|
||||
:ref:`hip:wavefront_execution` for details.
|
||||
|
||||
Occupancy
|
||||
Occupancy is the ratio of active :term:`wavefronts <Wavefront>` to the
|
||||
maximum number of wavefronts that can be active on a
|
||||
:term:`compute unit <Compute units>`. See :ref:`hip:occupancy` for
|
||||
occupancy analysis.
|
||||
|
||||
Pipe utilization
|
||||
Pipe utilization measures how effectively a kernel uses the execution
|
||||
pipelines within each :term:`compute unit <Compute units>`. See
|
||||
:ref:`hip:pipe_utilization` for utilization details.
|
||||
|
||||
Peak rate
|
||||
Peak rate is the theoretical maximum throughput at which a hardware
|
||||
system can complete work under ideal conditions. See
|
||||
:ref:`hip:theoretical_performance_limits` for details.
|
||||
|
||||
Issue efficiency
|
||||
Issue efficiency measures how effectively the
|
||||
:term:`wavefront scheduler <Wavefront scheduler>` keeps
|
||||
execution pipelines busy by issuing instructions. See
|
||||
:ref:`hip:issue_efficiency` for efficiency metrics.
|
||||
|
||||
CU utilization
|
||||
CU utilization measures the percentage of time that
|
||||
:term:`compute units <Compute units>` are actively executing
|
||||
instructions. See :ref:`hip:cu_utilization` for utilization analysis.
|
||||
|
||||
Wavefront divergence
|
||||
Wavefront divergence occurs when threads within a
|
||||
:term:`wavefront <Wavefront>` take different execution paths due to
|
||||
conditional statements. See :ref:`hip:branch_efficiency` for divergence
|
||||
handling details.
|
||||
|
||||
Branch efficiency
|
||||
Branch efficiency measures how often all threads within a
|
||||
:term:`wavefront <Wavefront>` take the same execution path, quantifying
|
||||
control-flow uniformity. See :ref:`hip:branch_efficiency` for branch
|
||||
analysis.
|
||||
|
||||
Memory coalescing
|
||||
Memory coalescing improves :term:`memory bandwidth <Memory bandwidth>`
|
||||
by servicing many logical loads or stores with fewer physical memory
|
||||
transactions. See :ref:`hip:memory_coalescing_theory` for coalescing
|
||||
patterns.
|
||||
|
||||
Bank conflict
|
||||
A bank conflict occurs when multiple threads simultaneously access
|
||||
different addresses in the same :term:`LDS bank <Local data share>`,
|
||||
serializing accesses. See :ref:`hip:bank_conflicts_theory` for details.
|
||||
|
||||
Register pressure
|
||||
Register pressure occurs when excessive register demand limits the
|
||||
number of active :term:`wavefronts <Wavefront>` per
|
||||
:term:`compute unit <Compute units>`, reducing
|
||||
:term:`occupancy <Occupancy>`. See
|
||||
:ref:`hip:register_pressure_theory` for details.
|
||||
@@ -9,6 +9,12 @@ The following tables provide an overview of the hardware specifications for AMD
|
||||
|
||||
For more information about ROCm hardware compatibility, see the ROCm `Compatibility matrix <https://rocm.docs.amd.com/en/latest/compatibility/compatibility-matrix.html>`_.
|
||||
|
||||
For a description of the terms used in the table, see the
|
||||
:ref:`ROCm glossary <glossary>`, or for more detailed information about GPU
|
||||
architecture and programming models, see the
|
||||
:ref:`specific documents and guides <gpu-arch-documentation>`, or
|
||||
:doc:`Understanding the HIP programming model<hip:understand/programming_model>`.
|
||||
|
||||
.. tab-set::
|
||||
|
||||
.. tab-item:: AMD Instinct GPUs
|
||||
@@ -1127,125 +1133,3 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil
|
||||
- 32
|
||||
- 11
|
||||
- 5
|
||||
|
||||
Glossary
|
||||
========
|
||||
|
||||
For more information about the terms used, see the
|
||||
:ref:`specific documents and guides <gpu-arch-documentation>`, or
|
||||
:doc:`Understanding the HIP programming model<hip:understand/programming_model>`.
|
||||
|
||||
**LLVM target name**
|
||||
|
||||
Argument to pass to clang in ``--offload-arch`` to compile code for the given
|
||||
architecture.
|
||||
|
||||
**VRAM**
|
||||
|
||||
Amount of memory available on the GPU.
|
||||
|
||||
**Compute Units**
|
||||
|
||||
Number of compute units on the GPU.
|
||||
|
||||
**Wavefront Size**
|
||||
|
||||
Amount of work items that execute in parallel on a single compute unit. This
|
||||
is equivalent to the warp size in HIP.
|
||||
|
||||
**LDS**
|
||||
|
||||
The Local Data Share (LDS) is a low-latency, high-bandwidth scratch pad
|
||||
memory. It is local to the compute units, and can be shared by all work items
|
||||
in a work group. In HIP, the LDS can be used for shared memory, which is
|
||||
shared by all threads in a block.
|
||||
|
||||
**L3 Cache (CDNA/GCN only)**
|
||||
|
||||
Size of the level 3 cache. Shared by all compute units on the same GPU. Caches
|
||||
data and instructions. Similar to the Infinity Cache on RDNA architectures.
|
||||
|
||||
**Infinity Cache (RDNA only)**
|
||||
|
||||
Size of the infinity cache. Shared by all compute units on the same GPU. Caches
|
||||
data and instructions. Similar to the L3 Cache on CDNA/GCN architectures.
|
||||
|
||||
**L2 Cache**
|
||||
|
||||
Size of the level 2 cache. Shared by all compute units on the same GCD. Caches
|
||||
data and instructions.
|
||||
|
||||
**Graphics L1 Cache (RDNA only)**
|
||||
|
||||
An additional cache level that only exists in RDNA architectures. Local to a
|
||||
shader array.
|
||||
|
||||
**L1 Vector Cache (CDNA/GCN only)**
|
||||
|
||||
Size of the level 1 vector data cache. Local to a compute unit. This is the L0
|
||||
vector cache in RDNA architectures.
|
||||
|
||||
**L1 Scalar Cache (CDNA/GCN only)**
|
||||
|
||||
Size of the level 1 scalar data cache. Usually shared by several compute
|
||||
units. This is the L0 scalar cache in RDNA architectures.
|
||||
|
||||
**L1 Instruction Cache (CDNA/GCN only)**
|
||||
|
||||
Size of the level 1 instruction cache. Usually shared by several compute
|
||||
units. This is the L0 instruction cache in RDNA architectures.
|
||||
|
||||
**L0 Vector Cache (RDNA only)**
|
||||
|
||||
Size of the level 0 vector data cache. Local to a compute unit. This is the L1
|
||||
vector cache in CDNA/GCN architectures.
|
||||
|
||||
**L0 Scalar Cache (RDNA only)**
|
||||
|
||||
Size of the level 0 scalar data cache. Usually shared by several compute
|
||||
units. This is the L1 scalar cache in CDNA/GCN architectures.
|
||||
|
||||
**L0 Instruction Cache (RDNA only)**
|
||||
|
||||
Size of the level 0 instruction cache. Usually shared by several compute
|
||||
units. This is the L1 instruction cache in CDNA/GCN architectures.
|
||||
|
||||
**VGPR File**
|
||||
|
||||
Size of the Vector General Purpose Register (VGPR) file and. It holds data used in
|
||||
vector instructions.
|
||||
GPUs with matrix cores also have AccVGPRs, which are Accumulation General
|
||||
Purpose Vector Registers, used specifically in matrix instructions.
|
||||
|
||||
**SGPR File**
|
||||
|
||||
Size of the Scalar General Purpose Register (SGPR) file. Holds data used in
|
||||
scalar instructions.
|
||||
|
||||
**GFXIP**
|
||||
|
||||
GFXIP (Graphics IP) is a versioning system used by AMD to identify the GPU
|
||||
architecture and its instruction set. It helps categorize different generations
|
||||
of GPUs and their feature sets.
|
||||
|
||||
**GFXIP major version**
|
||||
|
||||
Defines the GPU's core instruction set and architecture, which determines
|
||||
compatibility with software stacks such as HIP and OpenCL. For example, a GFXIP
|
||||
11 major version corresponds to the RDNA 3 (Navi 3x) architecture, influencing
|
||||
driver support and available compute features.
|
||||
|
||||
**GFXIP minor version**
|
||||
|
||||
Represents specific variations within a GFXIP major version and affects feature sets,
|
||||
optimizations, and driver behavior in software stacks such as HIP and OpenCL. Different
|
||||
GPU models within the same major version can have unique capabilities, impacting
|
||||
performance and supported instructions.
|
||||
|
||||
**GCD**
|
||||
|
||||
Graphics Compute Die.
|
||||
|
||||
**XCD**
|
||||
|
||||
Accelerator Complex Die.
|
||||
|
||||
@@ -232,6 +232,18 @@ subtrees:
|
||||
title: Data types and precision support
|
||||
- file: reference/graph-safe-support.rst
|
||||
title: Graph safe support
|
||||
- file: reference/glossary.rst
|
||||
title: ROCm glossary
|
||||
subtrees:
|
||||
- entries:
|
||||
- file: reference/glossary/device-hardware.rst
|
||||
title: Device hardware
|
||||
- file: reference/glossary/device-software.rst
|
||||
title: Device software
|
||||
- file: reference/glossary/host-software.rst
|
||||
title: Host software
|
||||
- file: reference/glossary/performance.rst
|
||||
title: Performance
|
||||
|
||||
- caption: Contribute
|
||||
entries:
|
||||
|
||||
Reference in New Issue
Block a user