Add glossary

This commit is contained in:
Jan Stephan
2026-02-06 16:55:53 +01:00
committed by Istvan Kiss
parent a0f56927ba
commit debd213a58
7 changed files with 400 additions and 0 deletions

View File

@@ -64,6 +64,7 @@ ROCm documentation is organized into the following categories:
<!-- markdownlint-disable MD051 -->
* [ROCm libraries](./reference/api-libraries.md)
* [ROCm tools, compilers, and runtimes](./reference/rocm-tools.md)
* [ROCm glossary](./reference/glossary.rst)
* [GPU hardware specifications](./reference/gpu-arch-specs.rst)
* [Hardware atomics operation support](./reference/gpu-atomics-operation.rst)
* [Environment variables](./reference/env-variables.rst)

View File

@@ -0,0 +1,24 @@
.. meta::
:description: AMD ROCm Glossary
:keywords: AMD, ROCm, glossary, terminology, device hardware,
device software, host software, performance
.. _glossary:
********************************************************************************
Glossary
********************************************************************************
This glossary provides concise definitions of key terms and concepts in AMD ROCm
programming. Each entry includes a brief description and a link to detailed
documentation for in-depth information.
The glossary is organized into four sections:
* :doc:`glossary/device-hardware` — Hardware components (Compute Units, cores,
memory)
* :doc:`glossary/device-software` — Software abstractions (programming model,
ISA, thread hierarchy)
* :doc:`glossary/host-software` — Development tools (HIP, compilers, libraries,
profilers)
* :doc:`glossary/performance` — Performance metrics and optimization concepts

View File

@@ -0,0 +1,84 @@
.. meta::
:description: Device hardware glossary for AMD GPUs
:keywords: AMD, ROCm, GPU, device hardware, compute units, cores, MFMA,
architecture, register file, cache, HBM
.. _glossary-device-hardware:
***************
Device hardware
***************
This section provides brief definitions of hardware components and architectural
features of AMD GPUs.
.. glossary::
:sorted:
AMD device architecture
AMD's device architecture is based on unified, programmable compute
engines called Compute Units. See :ref:`hip:hardware_implementation` for
details.
Compute units
Compute Units (CUs) are the fundamental programmable execution engines
in AMD GPUs that manage thousands of lightweight threads. See
:ref:`hip:compute_unit` for details.
Vector arithmetic logic units
Vector arithmetic logic units (VALUs) are the primary arithmetic engines
that execute mathematical and logical operations within AMD Compute
Units. See :ref:`hip:valu` for details.
Special function unit
Special Function Units (SFUs) accelerate transcendental and reciprocal
mathematical functions such as ``exp``, ``log``, ``sin``, and ``cos``.
See :ref:`hip:sfu` for details.
Load and store unit
Load/Store Units (LSUs) handle data transfer between Compute Units and
the GPU's memory subsystems, managing thousands of concurrent memory
operations. See :ref:`hip:lsu` for details.
Wavefront scheduler
The Wavefront Scheduler in each Compute Unit decides which group of
threads to execute each clock cycle, enabling rapid context switching
for latency hiding. See :ref:`hip:wave-scheduling` for details.
SIMD core
SIMD Cores are execution lanes that perform scalar and vector arithmetic
operations inside each Compute Unit. See :ref:`hip:cdna_architecture`
and :ref:`hip:rdna_architecture` for details.
Matrix core and MFMA
Matrix Cores (MFMA units) are specialized execution units that perform
large-scale matrix operations in a single instruction, delivering high
throughput for AI and HPC workloads. See :ref:`hip:mfma_units` for
details.
Data movement engine
Data Movement Engines (DMEs) are specialized hardware units in CDNA3 and
CDNA4 that accelerate multi-dimensional tensor data copies between
global memory and on-chip memory. See :ref:`hip:dme` for details.
Compute unit versioning
Compute Units are versioned with GFX IP identifiers that define their
microarchitectural features and instruction set compatibility. See
:ref:`hip:gfx_ip` for details.
Register file
The register file is the primary on-chip memory store in each Compute
Unit, holding data between arithmetic and memory operations. See
:ref:`hip:memory_hierarchy` for details.
L1 data cache
The L1 Data Cache is the private on-chip memory associated with each
Compute Unit, providing fast access to recently used data. See
:ref:`hip:vl1`, :ref:`hip:sl1` and :ref:`hip:memory_coherence` for
details.
GPU RAM and HBM
GPU RAM, also known as global memory in the HIP programming model, is
the large, high-capacity High Bandwidth Memory (HBM) subsystem
accessible by all Compute Units, forming the foundation of the device's
data storage hierarchy. See :ref:hip:hbm for details.

View File

@@ -0,0 +1,92 @@
.. meta::
:description: Device software glossary for AMD GPUs
:keywords: AMD, ROCm, GPU, device software, programming model, AMDGPU,
assembly, IR, GFX IP, wavefront, work-group, HIP kernel, thread hierarchy
.. _glossary-device-software:
***************
Device software
***************
This section provides brief definitions of software abstractions and programming
models that run on AMD GPUs.
.. glossary::
:sorted:
ROCm programming model
The ROCm programming model defines how AMD GPUs execute massively
parallel programs through hierarchical work-groups, memory scopes, and
barrier synchronization. See :ref:`hip:programming_model` for complete
details.
AMDGPU assembly
AMDGPU Assembly (GFX ISA) is the low-level assembly format for programs
running on AMD GPUs, generated by the ROCm compiler toolchain. See
:ref:`hip:amdgpu_assembly` for instruction set details.
AMDGPU intermediate representation
AMDGPU IR is an intermediate representation for GPU code, serving as a
virtual instruction set between high-level languages and
architecture-specific assembly. See :ref:`hip:amdgpu_ir` for compilation
details.
GFX IP
GFX IP versions are identifiers that specify which instruction formats,
memory models, and compute features are supported by each AMD GPU
generation. See :ref:`hip:gfx_ip` for versioning information.
Work-item
A work-item (also called a thread) is the smallest unit of execution in
the AMD GPU programming model. See :ref:`hip:work-item` for thread
hierarchy details.
Wavefront
A wavefront is a group of threads that execute together in parallel on a
single Compute Unit, sharing one instruction stream. See
:ref:`hip:wavefront` for execution details.
Work-group
A work-group is a collection of threads scheduled together on a single
Compute Unit that can coordinate through Local Data Share memory. A
work-group may consist of multiple wavefronts that execute in parallel
on the same Compute Unit. See
:ref:`hip:inherent_thread_hierarchy_block` for work-group details.
Grid
A grid represents the collection of all work-groups executing a single
kernel across the entire GPU. See :ref:`hip:inherent_thread_hierarchy_`
for grid execution details.
HIP kernel
A kernel is the unit of GPU code that executes in parallel across many
threads, distributed across the GPU's Compute Units. See
:ref:`hip:device_program` for kernel programming details.
HIP thread hierarchy
The thread hierarchy structures parallel work from individual threads to
work-groups to grids, mapping onto hardware from SIMD lanes to Compute
Units to the entire GPU. See :ref:`hip:inherent_thread_model` for
complete details.
HIP memory hierarchy
The memory hierarchy pairs each thread hierarchy level with
corresponding memory scopes, from private registers to shared LDS to
global HBM. See :ref:`hip:memory_hierarchy` for memory architecture
details.
Registers
Registers are the lowest level of the memory hierarchy, storing per-thread
temporary variables and intermediate results. See
:ref:`hip:memory_hierarchy` for register usage details.
Local data share
Local Data Share (LDS) is fast on-chip memory shared among threads in a
work-group, enabling efficient coordination and data reuse. See
:ref:`hip:lds` for LDS programming details.
Global memory
Global memory is the device-wide memory accessible to all threads,
physically implemented in HBM or GDDR. See
:ref:`hip:memory_hierarchy` for global memory details.

View File

@@ -0,0 +1,66 @@
.. meta::
:description: Host software glossary for AMD GPUs
:keywords: AMD, ROCm, GPU, host software, HIP, compiler, runtime, libraries,
profiler, amd-smi
.. _glossary-host-software:
*************
Host software
*************
This section provides brief definitions of development tools, compilers,
libraries, and runtime environments for programming AMD GPUs.
.. glossary::
:sorted:
ROCm software platform
ROCm is AMD's GPU software stack, providing compiler
toolchains, runtime environments, and performance libraries for HPC and
AI applications. See :doc:`../../what-is-rocm` for a complete component
overview.
HIP C++ language extension
HIP extends the C++ language with additional features designed for
programming heterogeneous applications. These extensions mostly relate
to the kernel language, but some can also be applied to host
functionality. See :doc:`hip:how-to/hip_cpp_language_extensions` for
language fundamentals.
amd-smi
The ``amd-smi`` command-line utility queries, monitors, and manages AMD GPU
state, providing hardware information and performance metrics. See
:doc:`amdsmi:index` for detailed usage.
HIP runtime API
The HIP runtime API provides an interface for GPU programming, offering
functions for memory management, kernel launches, and synchronization. See
:ref:`hip:hip_runtime_api_how-to` for API overview.
HIP compiler
The HIP compiler ``amdclang++`` compiles HIP C++ programs into binaries
containing both host CPU and device GPU code. See
:doc:`llvm-project:reference/rocmcc` for compiler flags and options.
HIP runtime compiler
The HIP Runtime Compiler (HIPRTC) compiles HIP source code at runtime
into AMDGPU binary code objects, enabling just-in-time kernel generation,
device-specific optimization, and dynamic code creation for different
GPUs. See :ref:`hip:hip_runtime_compiler_how-to` for API details.
ROCgdb
ROCgdb is AMD's source-level debugger for HIP and ROCm applications,
enabling debugging of both host CPU and GPU device code, including
kernel breakpoints, stepping, and variable inspection. See
:doc:`rocgdb:index` for usage and command reference.
ROCm profiler
The ROCm profiler (``rocprofv3``) is AMD's primary performance analysis
tool, providing profiling, tracing, and performance counter collection.
See :ref:`rocprofiler-sdk:using-rocprofv3` for profiling workflows.
ROCm and LLVM binary utilities
ROCm and LLVM binary utilities are command-line tools for examining and
manipulating GPU binaries and code objects. See
:ref:`hip:binary_utilities` for utility details.

View File

@@ -0,0 +1,121 @@
.. meta::
:description: Performance glossary for AMD GPUs
:keywords: AMD, ROCm, GPU, performance, optimization, roofline, bottleneck,
occupancy, bandwidth, latency hiding, divergence
.. _glossary-performance:
***********
Performance
***********
This section provides brief definitions of performance analysis concepts and
optimization techniques.
.. glossary::
:sorted:
Roofline model
The roofline model is a visual performance model that determines whether
a program is compute-bound or memory-bound. See
:ref:`hip:roofline_model` for roofline analysis.
Compute-bound
Compute-bound kernels are limited by the arithmetic bandwidth of the
GPU's compute units rather than memory bandwidth. See
:ref:`hip:compute_bound` for compute-bound analysis.
Memory-bound
Memory-bound kernels are limited by memory bandwidth rather than
arithmetic throughput, typically due to low arithmetic intensity. See
:ref:`hip:memory_bound` for memory-bound analysis.
Arithmetic intensity
Arithmetic intensity is the ratio of arithmetic operations to memory
operations in a kernel, determining performance characteristics. See
:ref:`hip:arithmetic_intensity` for intensity analysis.
Overhead
Overhead latency is time spent with no useful work being done, often
from CPU-side bottlenecks or kernel launch delays. See
:ref:`hip:performance_bottlenecks` for details.
Little's Law
Little's Law relates concurrency, latency, and throughput, determining
how much independent work must be in flight to hide latency. See
:ref:`hip:littles_law` for latency hiding details.
Memory bandwidth
Memory bandwidth is the maximum rate at which data can be transferred
between memory hierarchy levels, typically measured in bytes per
second. See :ref:`hip:memory_bound` for details.
Arithmetic bandwidth
Arithmetic bandwidth is the peak rate at which arithmetic work can be
performed, defining the compute roof in roofline models. See
:ref:`hip:compute_bound` for details.
Latency hiding
Latency hiding masks long-latency operations by running many concurrent
threads, keeping execution pipelines busy. See :ref:`hip:latency_hiding`
for details.
Wavefront execution state
Wavefront execution states (*active*, *stalled*, *eligible*, *selected*)
describe the scheduling status of wavefronts on AMD GPUs. See
:ref:`hip:wavefront_execution` for state definitions.
Active cycle
An active cycle is a clock cycle in which a Compute Unit has at least
one active wavefront resident. See :ref:`hip:wavefront_execution` for
details.
Occupancy
Occupancy is the ratio of active wavefronts to the maximum number of
wavefronts that can be active on a Compute Unit. See
:ref:`hip:occupancy` for occupancy analysis.
Pipe utilization
Pipe utilization measures how effectively a kernel uses the execution
pipelines within each Compute Unit. See :ref:`hip:pipe_utilization` for
utilization details.
Peak rate
Peak rate is the theoretical maximum throughput at which a hardware
system can complete work under ideal conditions. See
:ref:`hip:theoretical_performance_limits` for details.
Issue efficiency
Issue efficiency measures how effectively the wavefront scheduler keeps
execution pipelines busy by issuing instructions. See
:ref:`hip:issue_efficiency` for efficiency metrics.
CU utilization
CU utilization measures the percentage of time that Compute Units are
actively executing instructions. See :ref:`hip:cu_utilization` for
utilization analysis.
Wavefront divergence
Wavefront divergence occurs when threads within a wavefront take
different execution paths due to conditional statements. See
:ref:`hip:branch_efficiency` for divergence handling details.
Branch efficiency
Branch efficiency measures how often all threads within a wavefront take
the same execution path, quantifying control flow uniformity. See
:ref:`hip:branch_efficiency` for branch analysis.
Memory coalescing
Memory coalescing improves memory bandwidth by servicing many logical
loads or stores with fewer physical memory transactions. See
:ref:`hip:memory_coalescing_theory` for coalescing patterns.
Bank conflict
A bank conflict occurs when multiple threads simultaneously access
different addresses in the same LDS bank, serializing accesses. See
:ref:`hip:bank_conflicts_theory` for details.
Register pressure
Register pressure occurs when excessive register demand limits the
number of active wavefronts per Compute Unit, reducing occupancy. See
:ref:`hip:register_pressure_theory` for details.

View File

@@ -224,6 +224,18 @@ subtrees:
title: ROCm libraries
- file: reference/rocm-tools.md
title: ROCm tools, compilers, and runtimes
- file: reference/glossary.rst
title: ROCm glossary
subtrees:
- entries:
- file: reference/glossary/device-hardware.rst
title: Device hardware
- file: reference/glossary/device-software.rst
title: Device software
- file: reference/glossary/host-software.rst
title: Host software
- file: reference/glossary/performance.rst
title: Performance
- file: reference/gpu-arch-specs.rst
- file: reference/gpu-atomics-operation.rst
- file: reference/env-variables.rst