Add glossary (#5935)

Signed-off-by: Jan Stephan <jan.stephan@amd.com>
2026-04-05 03:01:17 -04:00 · 2026-02-20 20:09:51 +01:00
parent 221f963d31
commit 61c0c31481
8 changed files with 573 additions and 122 deletions
--- a/docs/index.md
+++ b/docs/index.md
@@ -69,6 +69,7 @@ ROCm documentation is organized into the following categories:
 * [Environment variables](./reference/env-variables.rst)
 * [Data types and precision support](./reference/precision-support.rst)
 * [Graph safe support](./reference/graph-safe-support.rst)
+* [ROCm glossary](./reference/glossary.rst)
 <!-- markdownlint-enable MD051 -->
 :::

--- a/docs/reference/glossary.rst
+++ b/docs/reference/glossary.rst
@@ -0,0 +1,24 @@
+.. meta::
+  :description: AMD ROCm Glossary
+  :keywords: AMD, ROCm,  glossary, terminology, device hardware,
+    device software, host software, performance
+
+.. _glossary:
+
+********************************************************************************
+ROCm glossary
+********************************************************************************
+
+This glossary provides concise definitions of key terms and concepts in AMD ROCm
+programming. Each entry includes a brief description and a link to detailed
+documentation for in-depth information.
+
+The glossary is organized into four sections:
+
+* :doc:`glossary/device-hardware` — Hardware components (for example, Compute
+  Units, cores, memory)
+* :doc:`glossary/device-software` — Software abstractions (programming model,
+  ISA, thread hierarchy)
+* :doc:`glossary/host-software` — Development tools (HIP, compilers, libraries,
+  profilers)
+* :doc:`glossary/performance` — Performance metrics and optimization concepts
--- a/docs/reference/glossary/device-hardware.rst
+++ b/docs/reference/glossary/device-hardware.rst
@@ -0,0 +1,254 @@
+.. meta::
+  :description: Device hardware glossary for AMD GPUs
+  :keywords: AMD, ROCm, GPU, device hardware, compute units, cores, MFMA,
+    architecture, register file, cache, HBM
+
+.. _glossary-device-hardware:
+
+************************
+Device hardware glossary
+************************
+
+This section provides concise definitions of hardware components and architectural
+features of AMD GPUs.
+
+.. glossary::
+    :sorted:
+
+    AMD device architecture
+        AMD device architecture is based on unified, programmable compute
+        engines known as :term:`compute units (CUs) <Compute units>`. See
+        :ref:`hip:hardware_implementation` for details.
+
+    Compute units
+        Compute units (CUs) are the fundamental programmable execution engines
+        in AMD GPUs capable of running complex programs. See
+        :ref:`hip:compute_unit` for details.
+
+    ALU
+        Arithmetic logic units (ALUs) are the primary arithmetic engines that
+        execute mathematical and logical operations within
+        :term:`compute units <Compute units>`. See :ref:`hip:valu` for details. 
+
+    SALU
+        Scalar :term:`ALUs <ALU>` (SALUs) operate on a single value per
+        :term:`wavefront <Wavefront>` and manage all control flow.
+
+    VALU
+        Vector :term:`ALUs <ALU>` (VALUs) perform an arithmetic or logical
+        operation on data for each :term:`work-item <Work-item (Thread)>` in a
+        :term:`wavefront <Wavefront>`, enabling data-parallel execution.
+
+    Special function unit
+        Special function units (SFUs) accelerate transcendental and reciprocal
+        mathematical functions such as ``exp``, ``log``, ``sin``, and ``cos``.
+        See :ref:`hip:sfu` for details.
+
+    Load/store unit
+        Load/store units (LSUs) handle data transfer between
+        :term:`compute units <Compute units>` and the GPU's memory subsystems,
+        managing thousands of concurrent memory operations. See :ref:`hip:lsu`
+        for details.
+
+    Work-group (Block)
+        A work-group (also called a block) is a collection of
+        :term:`wavefronts <Wavefront (Warp)>` scheduled together on a single
+        :term:`compute unit <Compute units>` that can coordinate through
+        :term:`Local data share <Local data share>` memory. See
+        :ref:`hip:inherent_thread_hierarchy_block` for work-group details.
+
+    Work-item (Thread)
+        A work-item (also called a thread) is the smallest unit of execution on
+        an AMD GPU and represents a single element of work. See
+        :ref:`hip:work-item` for thread hierarchy details.
+
+    Wavefront (Warp)
+        A wavefront (also called a warp) is a group of
+        :term:`work-items <Work-item (Thread)>` that execute in parallel on a
+        single :term:`compute unit <Compute units>`, sharing one
+        instruction stream. See :ref:`hip:wavefront` for execution details.
+
+    Wavefront scheduler
+        The wavefront scheduler in each :term:`compute unit <Compute units>`
+        decides which :term:`wavefront <wavefront>` to execute each clock cycle,
+        enabling rapid context switching for latency hiding. See
+        :ref:`hip:wave-scheduling` for details.
+
+    Wavefront size
+        The wavefront size is the number of
+        :term:`work-items <Work-item (Thread)>` that execute together in a
+        single :term:`wavefront <Wavefront (Warp)>`. For AMD Instinct GPUs, the
+        wavefront size is 64 threads, while AMD Radeon GPUs have a wavefront
+        size of 32 threads. See :ref:`hip:wavefront` for details.
+
+    SIMD core
+        SIMD cores are execution lanes that perform scalar and vector arithmetic
+        operations inside each :term:`compute unit <Compute unit>`. See
+        :ref:`hip:cdna_architecture` and :ref:`hip:rdna_architecture` for
+        details.
+
+    Matrix cores (MFMA units)
+        Matrix cores (MFMA units) are specialized execution units that perform
+        large-scale matrix operations in a single instruction, delivering high
+        throughput for AI and HPC workloads. See :ref:`hip:mfma_units` for
+        details.
+
+    Data movement engine
+        Data movement engines (DMEs) are specialized hardware units in AMD
+        Instinct MI300 and MI350 series GPUs that accelerate multi-dimensional
+        tensor data copies between global memory and on-chip memory. See
+        :ref:`hip:dme` for details.
+
+    GFX IP
+        GFX IP (Graphics IP) versions are identifiers that specify which
+        instruction formats, memory models, and compute features are supported
+        by each AMD GPU generation. See :ref:`hip:gfx_ip` for versioning
+        information.
+
+    GFX IP major version
+        The :term:`GFX IP <GFX IP>` major version represents the GPU's core
+        instruction set and architecture. For example, a GFX IP `11` major
+        version corresponds to the RDNA3 architecture, influencing driver
+        support and available compute features. See :ref:`hip:gfx_ip` for
+        versioning information.
+
+    GFX IP minor version
+        The :term:`GFX IP <GFX IP>` minor version represents specific variations
+        within a :term:`GFX IP <GFX IP>` major version and affects feature sets,
+        optimizations, and driver behavior. Different GPU models within the same
+        major version can have unique capabilities, impacting performance and
+        supported instructions. See :ref:`hip:gfx_ip` for versioning
+        information.
+    
+    Compute unit versioning
+        :term:`Compute units <Compute units>` are versioned with
+        :term:`GFX IP <GFX IP>` identifiers that define their microarchitectural
+        features and instruction set compatibility. See :ref:`hip:gfx_ip` for
+        details.
+
+    Register file
+        The register file is the primary on-chip memory store in each
+        :term:`compute unit <Compute units>`, holding data between arithmetic
+        and memory operations. See :ref:`hip:memory_hierarchy` for details.
+
+    SGPR file
+        The :term:`SGPR <SGPR>` file is the
+        :term:`register file <Register file>` that holds data used by the
+        :term:`scalar ALU <SALU>`.
+    
+    VGPR file
+        The :term:`VGPR <VGPR>` file is the
+        :term:`register file <Register file>` that holds data used by the
+        :term:`vector ALU <VALU>`. GPUs with
+        :term:`matrix cores <Matrix cores (MFMA units)>` also have
+        :term:`AccVGPR <AccVGPR>` files, used specifically for matrix
+        instructions.
+
+    L0 instruction cache
+        On AMD Radeon GPUs, the level 0 (L0) instruction cache is local to each
+        :term:`WGP <WGP>` and thus shared between the WGP's
+        :term:`compute units <Compute units>`.
+
+    L0 scalar cache
+        On AMD Radeon GPUs, the level 0 (L0) scalar data cache is local to each
+        :term:`WGP <WGP>` and thus shared between the WGP's
+        :term:`compute units <Compute units>`. It provides the
+        :term:`scalar ALU <SALU>` with fast access to recently used data.
+
+    L0 vector cache 
+        On AMD Radeon GPUs, the level 0 (L0) vector data cache is local to each
+        :term:`WGP <WGP>` and thus shared between the WGP's
+        :term:`compute units <Compute units>`. It provides the
+        :term:`vector ALU <VALU>` with fast access to recently used data.
+
+    L1 instruction cache
+        On AMD Instinct GPUs, the level 1 (L1) instruction cache is local to
+        each :term:`compute unit <Compute units>`. On AMD Radeon GPUs, the
+        L1 instruction cache does not exist as a separate cache level, and
+        instructions are stored in the
+        :term:`L0 instruction cache <L0 instruction cache>`.
+
+    L1 scalar cache
+        On AMD Instinct GPUs, the level 1 (L1) scalar data cache is local to
+        each :term:`compute unit <Compute units>`, providing the 
+        :term:`scalar ALU <SALU>` with fast access to recently used data. On AMD
+        Radeon GPUs, the L1 scalar cache does not exist as a separate cache
+        level, and recently used scalar data is stored in the
+        :term:`L0 scalar cache <L0 scalar cache>`.
+
+    L1 vector cache
+        On AMD Instinct GPUs, the level 1 (L1) vector data cache is local to
+        each :term:`compute unit <Compute units>`, providing the 
+        :term:`vector ALU <VALU>` with fast access to recently used data. On AMD
+        Radeon GPUs, the L1 vector cache does not exist as a separate cache
+        level, and recently used vector data is stored in the
+        :term:`L0 vector cache <L0 vector cache>`.
+
+    Graphics L1 cache
+        On AMD Radeon GPUs, the read-only graphics level 1 (L1) cache is local
+        to groups of :term:`WGPs <WGP>` called shader arrays, providing fast
+        access to recently used data. AMD Instinct GPUs do not feature the
+        graphics L1 cache.
+
+    L2 cache
+        On AMD Instinct MI100 series GPUs, the L2 cache is shared across the
+        entire chip, while for all other AMD GPUs the L2 caches are shared by
+        the :term:`compute units <Compute units>` on the same :term:`GCD <GCD>`
+        or :term:`XCD <XCD>`.
+
+    Infinity Cache (L3 cache)
+        On AMD Instinct MI300 and MI350 series GPUs and AMD Radeon GPUs, the
+        Infinity Cache is the last level cache of the cache hierarchy. It is
+        shared by all :term:`compute units <Compute units>` and
+        :term:`WGPs <WGP>` on the GPU.
+
+    GPU RAM (VRAM)
+        GPU RAM, also known as :term:`global memory <Global memory>` in the HIP
+        programming model, is the large, high-capacity off-chip memory subsystem
+        accessible by all :term:`compute units <Compute units>`, forming the
+        foundation of the device's :ref:`memory hierarchy <hip:hbm>`.
+
+    Local data share
+        Local data share (LDS) is fast on-chip memory local to each
+        :term:`compute unit <Compute units>` and shared among
+        :term:`work-items <Work-item (Thread)>` in a
+        :term:`work-group <Work-group (Block)>`, enabling efficient coordination
+        and data reuse. In the HIP programming model, the LDS is known as shared
+        memory. See :ref:`hip:lds` for LDS programming details.
+
+    Registers
+        Registers are the lowest level of the memory hierarchy, storing
+        per-thread temporary variables and intermediate results. See
+        :ref:`hip:memory_hierarchy` for register usage details.
+
+    SGPR
+        Scalar general-purpose :term:`registers <Registers>` (SGPRs) hold data
+        produced and consumed by a :term:`compute unit <Compute units>`'s
+        :term:`scalar ALU <SALU>`.
+
+    VGPR
+        Vector general-purpose :term:`registers <Registers>` (VGPRs) hold data
+        produced and consumed by a :term:`compute unit <Compute units>`'s
+        :term:`vector ALU <VALU>`.
+
+    AccVGPR
+        Accumulation General Purpose Vector Registers (AccVGPRs) are a special
+        type of :term:`VGPRs <VGPR>` used exclusively for matrix operations.
+
+    XCD
+        On AMD Instinct MI300 and MI350 series GPUs, the Accelerator Complex Die
+        (XCD) contains the GPU's computational elements and lower levels of the
+        cache hierarchy. See :doc:`../../conceptual/gpu-arch/mi300` for details.
+
+    GCD
+        On AMD Instinct MI100 and MI250 series GPUs and AMD Radeon GPUs, the
+        Graphics Compute Die (GCD) contains the GPU's computational elements
+        and lower levels of the cache hierarchy. See
+        :doc:`../../conceptual/gpu-arch/mi250` for details.
+
+    WGP
+        A Workgroup Processor (WGP) is a hardware unit on AMD Radeon GPUs that
+        contains two :term:`compute units <Compute units>` and their associated
+        resources, enabling efficient scheduling and execution of
+        :term:`wavefronts <wavefront>`. See :ref:`hip:rdna_architecture` for
+        details.
--- a/docs/reference/glossary/device-software.rst
+++ b/docs/reference/glossary/device-software.rst
@@ -0,0 +1,74 @@
+.. meta::
+  :description: Device software glossary for AMD GPUs
+  :keywords: AMD, ROCm, GPU, device software, programming model, AMDGPU,
+    assembly, IR, GFX IP, wavefront, work-group, HIP kernel, thread hierarchy
+
+.. _glossary-device-software:
+
+************************
+Device software glossary
+************************
+
+This section provides brief definitions of software abstractions and programming
+models that run on AMD GPUs.
+
+.. glossary::
+    :sorted:
+
+    ROCm programming model
+        The ROCm programming model defines how AMD GPUs execute massively
+        parallel programs using hierarchical
+        :term:`work-groups <Work-group (Block)>`, memory scopes, and barrier
+        synchronization. See :ref:`hip:programming_model` for complete details.
+
+    AMDGPU assembly
+        AMDGPU assembly (GFX ISA) is the low-level assembly format for programs
+        running on AMD GPUs, generated by the
+        :term:`ROCm compiler toolchain <HIP compiler>`. See
+        :ref:`hip:amdgpu_assembly` for instruction set details.
+
+    AMDGPU intermediate representation
+        AMDGPU IR is an intermediate representation for GPU code, serving as a
+        virtual instruction set between high-level languages and
+        :term:`architecture-specific assembly <AMDGPU assembly>`. See
+        :ref:`hip:amdgpu_ir` for compilation details.
+
+    LLVM target name
+        The LLVM target name is a string identifier corresponding to a specific
+        :term:`GFX IP <GFX IP>` version that is passed to the
+        :term:`HIP compiler <HIP compiler>` toolchain to specify the target GPU
+        architecture for code generation.
+        See :doc:`llvm-project:reference/rocmcc` for details.
+
+    Grid
+        A grid represents the collection of all
+        :term:`work-groups <Work-group (Block)>` executing a single
+        :term:`kernel <HIP kernel>` across the entire GPU. See
+        :ref:`hip:inherent_thread_hierarchy_grid` for grid execution details.
+
+    HIP kernel
+        A HIP kernel is the unit of GPU code that executes in parallel across
+        many :term:`threads <Work-item (Thread)>`, distributed across the GPU's
+        :term:`compute units <Compute units>`. See :ref:`hip:device_program` for
+        kernel programming details.
+
+    HIP thread hierarchy
+        The thread hierarchy structures parallel work from individual
+        :term:`threads <Work-item (Thread)>` to
+        :term:`blocks <Work-group (Block)>` to :term:`grids <Grid>`, mapping
+        onto hardware from :term:`SIMD lanes <SIMD core>` to
+        :term:`compute units <Compute units>` to the entire GPU. See
+        :ref:`hip:inherent_thread_model` for complete details.
+
+    HIP memory hierarchy
+        The memory hierarchy pairs each
+        :term:`thread hierarchy <HIP thread hierarchy>` level with corresponding
+        memory scopes, from :term:`private registers <Register>` to
+        :term:`LDS <Local data share>` to :term:`GPU RAM <GPU RAM (VRAM)>`. See
+        :ref:`hip:memory_hierarchy` for memory architecture details.
+
+    Global memory
+        Global memory is the :term:`device-wide memory <GPU RAM (VRAM)>`
+        accessible to all :term:`threads <Work-item (Thread)>`, physically
+        implemented as HBM or GDDR. See :ref:`hip:memory_hierarchy` for global
+        memory details.
--- a/docs/reference/glossary/host-software.rst
+++ b/docs/reference/glossary/host-software.rst
@@ -0,0 +1,67 @@
+.. meta::
+  :description: Host software glossary for AMD GPUs
+  :keywords: AMD, ROCm, GPU, host software, HIP, compiler, runtime, libraries,
+    profiler, amd-smi
+
+.. _glossary-host-software:
+
+**********************
+Host software glossary
+**********************
+
+This section provides brief definitions of development tools, compilers,
+libraries, and runtime environments for programming AMD GPUs.
+
+.. glossary::
+    :sorted:
+
+    ROCm software platform
+        ROCm is AMD's GPU software stack, providing compiler
+        toolchains, runtime environments, and performance libraries for HPC and
+        AI applications. See :doc:`../../what-is-rocm` for a complete component
+        overview.
+
+    HIP C++ language extension
+        HIP extends the C++ language with additional features designed for
+        programming heterogeneous applications. These extensions mostly relate
+        to the kernel language, but some can also be applied to host
+        functionality. See :doc:`hip:how-to/hip_cpp_language_extensions` for
+        language fundamentals.
+
+    AMD SMI
+        The ``amd-smi`` command-line utility queries, monitors, and manages
+        AMD GPU state, providing hardware information and performance metrics.
+        See :doc:`amdsmi:index` for detailed usage.
+
+    HIP runtime API
+        The HIP runtime API provides an interface for GPU programming, offering
+        functions for memory management, kernel launches, and synchronization. See
+        :ref:`hip:hip_runtime_api_how-to` for API overview.
+
+    HIP compiler
+        The HIP compiler ``amdclang++`` compiles HIP C++ programs into binaries
+        that contain both host CPU and device GPU code. See
+        :doc:`llvm-project:reference/rocmcc` for compiler flags and options.
+
+    HIP runtime compiler
+        The HIP Runtime Compiler (HIPRTC) compiles HIP source code at runtime
+        into :term:`AMDGPU <AMDGPU assembly>` binary code objects, enabling
+        just-in-time kernel generation, device-specific optimization, and
+        dynamic code creation for different GPUs. See
+        :ref:`hip:hip_runtime_compiler_how-to` for API details.
+
+    ROCgdb
+        ROCgdb is AMD's source-level debugger for HIP and ROCm applications,
+        enabling debugging of both host CPU and GPU device code, including
+        kernel breakpoints, stepping, and variable inspection. See
+        :doc:`rocgdb:index` for usage and command reference.
+
+    rocprofv3
+        ``rocprofv3`` is AMD's primary performance analysis tool, providing
+        profiling, tracing, and performance counter collection.
+        See :ref:`rocprofiler-sdk:using-rocprofv3` for profiling workflows.
+
+    ROCm and LLVM binary utilities
+        ROCm and LLVM binary utilities are command-line tools for examining and
+        manipulating GPU binaries and code objects. See
+        :ref:`hip:binary_utilities` for utility details.
--- a/docs/reference/glossary/performance.rst
+++ b/docs/reference/glossary/performance.rst
@@ -0,0 +1,135 @@
+.. meta::
+  :description: Performance glossary for AMD GPUs
+  :keywords: AMD, ROCm, GPU, performance, optimization, roofline, bottleneck,
+    occupancy, bandwidth, latency hiding, divergence
+
+.. _glossary-performance:
+
+*****************************
+Performance analysis glossary
+*****************************
+
+This section provides brief definitions of performance analysis concepts and
+optimization techniques.
+
+.. glossary::
+    :sorted:
+    
+    Roofline model
+        The roofline model is a visual performance model that determines whether
+        a program is :term:`compute-bound <Compute-bound>` or
+        :term:`memory-bound <Memory-bound>`. See :ref:`hip:roofline_model` for
+        roofline analysis.
+    
+    Compute-bound
+        Compute-bound kernels are limited by the
+        :term:`arithmetic bandwidth <Arithmetic bandwidth>` of the GPU's
+        :term:`compute units <Compute units>` rather than
+        :term:`memory bandwidth <Memory bandwidth>`. See
+        :ref:`hip:compute_bound` for compute-bound analysis.
+    
+    Memory-bound
+        Memory-bound kernels are limited by
+        :term:`memory bandwidth <Memory bandwidth>` rather than
+        :term:`arithmetic bandwidth <Arithmetic bandwidth>`, typically due to
+        low :term:`arithmetic intensity <Arithmetic intensity>`. See
+        :ref:`hip:memory_bound` for memory-bound analysis.
+    
+    Arithmetic intensity
+        Arithmetic intensity is the ratio of arithmetic operations to memory
+        operations in a kernel, and determines performance characteristics. See
+        :ref:`hip:arithmetic_intensity` for intensity analysis.
+    
+    Overhead
+        Overhead latency is the time spent with no useful work being done, often
+        due to CPU-side bottlenecks or kernel launch delays. See
+        :ref:`hip:performance_bottlenecks` for details.
+    
+    Little's Law
+        Little's Law relates concurrency, latency, and throughput, determining
+        how much independent work must be in flight to hide latency. See
+        :ref:`hip:littles_law` for latency hiding details.
+    
+    Memory bandwidth
+        Memory bandwidth is the maximum rate at which data can be transferred
+        between memory hierarchy levels, typically measured in bytes per
+        second. See :ref:`hip:memory_bound` for details.
+    
+    Arithmetic bandwidth
+        Arithmetic bandwidth is the peak rate at which arithmetic work can be
+        performed, defining the compute roof in
+        :term:`roofline models <Roofline model>`. See :ref:`hip:compute_bound`
+        for details.
+    
+    Latency hiding
+        Latency hiding masks long-latency operations by running many concurrent
+        threads, keeping execution pipelines busy. See :ref:`hip:latency_hiding`
+        for details.
+
+    Wavefront execution state
+        Wavefront execution states (*active*, *stalled*, *eligible*, *selected*)
+        describe the scheduling status of :term:`wavefronts <Wavefront>` on AMD
+        GPUs. See :ref:`hip:wavefront_execution` for state definitions.
+
+    Active cycle
+        An active cycle is a clock cycle in which a
+        :term:`compute unit <Compute units>` has at least one active
+        :term:`wavefront <Wavefront>` resident. See
+        :ref:`hip:wavefront_execution` for details.
+
+    Occupancy
+        Occupancy is the ratio of active :term:`wavefronts <Wavefront>` to the
+        maximum number of wavefronts that can be active on a
+        :term:`compute unit <Compute units>`. See :ref:`hip:occupancy` for
+        occupancy analysis.
+
+    Pipe utilization
+        Pipe utilization measures how effectively a kernel uses the execution
+        pipelines within each :term:`compute unit <Compute units>`. See
+        :ref:`hip:pipe_utilization` for utilization details.
+
+    Peak rate
+        Peak rate is the theoretical maximum throughput at which a hardware
+        system can complete work under ideal conditions. See
+        :ref:`hip:theoretical_performance_limits` for details.
+
+    Issue efficiency
+        Issue efficiency measures how effectively the
+        :term:`wavefront scheduler <Wavefront scheduler>` keeps
+        execution pipelines busy by issuing instructions. See
+        :ref:`hip:issue_efficiency` for efficiency metrics.
+
+    CU utilization
+        CU utilization measures the percentage of time that
+        :term:`compute units <Compute units>` are actively executing
+        instructions. See :ref:`hip:cu_utilization` for utilization analysis.
+
+    Wavefront divergence
+        Wavefront divergence occurs when threads within a
+        :term:`wavefront <Wavefront>` take different execution paths due to
+        conditional statements. See :ref:`hip:branch_efficiency` for divergence
+        handling details.
+
+    Branch efficiency
+        Branch efficiency measures how often all threads within a
+        :term:`wavefront <Wavefront>` take the same execution path, quantifying
+        control-flow uniformity. See :ref:`hip:branch_efficiency` for branch
+        analysis.
+
+    Memory coalescing
+        Memory coalescing improves :term:`memory bandwidth <Memory bandwidth>`
+        by servicing many logical loads or stores with fewer physical memory
+        transactions. See :ref:`hip:memory_coalescing_theory` for coalescing
+        patterns.
+
+    Bank conflict
+        A bank conflict occurs when multiple threads simultaneously access
+        different addresses in the same :term:`LDS bank <Local data share>`,
+        serializing accesses. See :ref:`hip:bank_conflicts_theory` for details.
+
+    Register pressure
+        Register pressure occurs when excessive register demand limits the
+        number of active :term:`wavefronts <Wavefront>` per
+        :term:`compute unit <Compute units>`, reducing
+        :term:`occupancy <Occupancy>`. See
+        :ref:`hip:register_pressure_theory` for details.
--- a/docs/reference/gpu-arch-specs.rst
+++ b/docs/reference/gpu-arch-specs.rst
@@ -9,6 +9,12 @@ The following tables provide an overview of the hardware specifications for AMD

 For more information about ROCm hardware compatibility, see the ROCm `Compatibility matrix <https://rocm.docs.amd.com/en/latest/compatibility/compatibility-matrix.html>`_.

+For a description of the terms used in the table, see the
+:ref:`ROCm glossary <glossary>`, or for more detailed information about GPU
+architecture and programming models, see the
+:ref:`specific documents and guides <gpu-arch-documentation>`, or
+:doc:`Understanding the HIP programming model<hip:understand/programming_model>`.
+
 .. tab-set::

  .. tab-item:: AMD Instinct GPUs
@@ -1127,125 +1133,3 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil
          - 32
          - 11
          - 5
-
-Glossary
-========
-
-For more information about the terms used, see the
-:ref:`specific documents and guides <gpu-arch-documentation>`, or
-:doc:`Understanding the HIP programming model<hip:understand/programming_model>`.
-
-**LLVM target name**
-
-Argument to pass to clang in ``--offload-arch`` to compile code for the given
-architecture.
-
-**VRAM**
-
-Amount of memory available on the GPU.
-
-**Compute Units**
-
-Number of compute units on the GPU.
-
-**Wavefront Size**
-
-Amount of work items that execute in parallel on a single compute unit. This
-is equivalent to the warp size in HIP.
-
-**LDS**
-
-The Local Data Share (LDS) is a low-latency, high-bandwidth scratch pad
-memory. It is local to the compute units, and can be shared by all work items
-in a work group. In HIP, the LDS can be used for shared memory, which is
-shared by all threads in a block.
-
-**L3 Cache (CDNA/GCN only)**
-
-Size of the level 3 cache. Shared by all compute units on the same GPU. Caches
-data and instructions. Similar to the Infinity Cache on RDNA architectures.
-
-**Infinity Cache (RDNA only)**
-
-Size of the infinity cache. Shared by all compute units on the same GPU. Caches
-data and instructions. Similar to the L3 Cache on CDNA/GCN architectures.
-
-**L2 Cache**
-
-Size of the level 2 cache. Shared by all compute units on the same GCD. Caches
-data and instructions.
-
-**Graphics L1 Cache (RDNA only)**
-
-An additional cache level that only exists in RDNA architectures. Local to a
-shader array.
-
-**L1 Vector Cache (CDNA/GCN only)**
-
-Size of the level 1 vector data cache. Local to a compute unit. This is the L0
-vector cache in RDNA architectures.
-
-**L1 Scalar Cache (CDNA/GCN only)**
-
-Size of the level 1 scalar data cache. Usually shared by several compute
-units. This is the L0 scalar cache in RDNA architectures.
-
-**L1 Instruction Cache (CDNA/GCN only)**
-
-Size of the level 1 instruction cache. Usually shared by several compute
-units. This is the L0 instruction cache in RDNA architectures.
-
-**L0 Vector Cache (RDNA only)**
-
-Size of the level 0 vector data cache. Local to a compute unit. This is the L1
-vector cache in CDNA/GCN architectures.
-
-**L0 Scalar Cache (RDNA only)**
-
-Size of the level 0 scalar data cache. Usually shared by several compute
-units. This is the L1 scalar cache in CDNA/GCN architectures.
-
-**L0 Instruction Cache (RDNA only)**
-
-Size of the level 0 instruction cache. Usually shared by several compute
-units. This is the L1 instruction cache in CDNA/GCN architectures.
-
-**VGPR File**
-
-Size of the Vector General Purpose Register (VGPR) file and. It holds data used in
-vector instructions.
-GPUs with matrix cores also have AccVGPRs, which are Accumulation General
-Purpose Vector Registers, used specifically in matrix instructions.
-
-**SGPR File**
-
-Size of the Scalar General Purpose Register (SGPR) file. Holds data used in
-scalar instructions.
-
-**GFXIP**
-
-GFXIP (Graphics IP) is a versioning system used by AMD to identify the GPU
-architecture and its instruction set. It helps categorize different generations
-of GPUs and their feature sets.
-
-**GFXIP major version**
-
-Defines the GPU's core instruction set and architecture, which determines
-compatibility with software stacks such as HIP and OpenCL. For example, a GFXIP
-11 major version corresponds to the RDNA 3 (Navi 3x) architecture, influencing
-driver support and available compute features.
-
-**GFXIP minor version**
-
-Represents specific variations within a GFXIP major version and affects feature sets,
-optimizations, and driver behavior in software stacks such as HIP and OpenCL. Different
-GPU models within the same major version can have unique capabilities, impacting
-performance and supported instructions.
-
-**GCD**
-
-Graphics Compute Die.
-
-**XCD**
-
-Accelerator Complex Die.
--- a/docs/sphinx/_toc.yml.in
+++ b/docs/sphinx/_toc.yml.in
@@ -232,6 +232,18 @@ subtrees:
    title: Data types and precision support
  - file: reference/graph-safe-support.rst
    title: Graph safe support
+  - file: reference/glossary.rst
+    title: ROCm glossary
+    subtrees:
+    - entries:
+      - file: reference/glossary/device-hardware.rst
+        title: Device hardware
+      - file: reference/glossary/device-software.rst
+        title: Device software
+      - file: reference/glossary/host-software.rst
+        title: Host software
+      - file: reference/glossary/performance.rst
+        title: Performance

 - caption: Contribute
  entries: