ROCm/docs/how-to/rocm-for-ai/inference-optimization/vllm-optimization.rst

.. meta::
   :description: Learn about vLLM V1 inference tuning on AMD Instinct GPUs for optimal performance.
   :keywords: AMD, Instinct, MI300X, HPC, tuning, BIOS settings, NBIO, ROCm,
              environment variable, performance, HIP, Triton, PyTorch TunableOp, vLLM, RCCL,
              MIOpen, GPU, resource utilization

.. _mi300x-vllm-optimization:
.. _vllm-optimization:

********************************
vLLM V1 performance optimization
********************************

This guide helps you maximize vLLM throughput and minimize latency on AMD
Instinct MI300X, MI325X, MI350X, and MI355X GPUs. Learn how to:

* Enable AITER (AI Tensor Engine for ROCm) for speedups on LLM models.
* Configure environment variables for optimal HIP, RCCL, and Quick Reduce performance.
* Select the right attention backend for your workload (AITER MHA/MLA vs. Triton).
* Choose parallelism strategies (tensor, pipeline, data, expert) for multi-GPU deployments.
* Apply quantization (``FP8``/``FP4``) to reduce memory usage by 2-4× with minimal accuracy loss.
* Tune engine arguments (batch size, memory utilization, graph modes) for your use case.
* Benchmark and scale across single-node and multi-node configurations.

Performance environment variables
=================================

The following variables are generally useful for Instinct MI300X/MI355X GPUs and vLLM:

* **HIP and math libraries**

  * ``export HIP_FORCE_DEV_KERNARG=1`` — improves kernel launch performance by
    forcing device kernel arguments. This is already set by default in
    :doc:`vLLM ROCm Docker images
    </how-to/rocm-for-ai/inference/benchmark-docker/vllm>`. Bare-metal users
    should set this manually.
  * ``export TORCH_BLAS_PREFER_HIPBLASLT=1`` — explicitly prefers hipBLASLt
    over hipBLAS for GEMM operations. By default, PyTorch uses heuristics to
    choose the best BLAS library. Setting this can improve linear layer
    performance in some workloads.

* **RCCL (collectives for multi-GPU)**

  * ``export NCCL_MIN_NCHANNELS=112`` — increases RCCL channels from default
    (typically 32-64) to 112 on the Instinct MI300X. **Only beneficial for
    multi-GPU distributed workloads** (tensor parallelism, pipeline
    parallelism). Single-GPU inference does not need this.

.. _vllm-optimization-aiter-switches:

AITER (AI Tensor Engine for ROCm) switches
==========================================

AITER (AI Tensor Engine for ROCm) provides ROCm-specific fused kernels optimized for Instinct MI350 Series and MI300X GPUs in vLLM V1.

How AITER flags work:

* ``VLLM_ROCM_USE_AITER`` is the master switch (defaults to ``False``/``0``).
* Individual feature flags (``VLLM_ROCM_USE_AITER_LINEAR``, ``VLLM_ROCM_USE_AITER_MOE``, and so on) default to ``True`` but only activate when the master switch is enabled.
* To enable a specific AITER feature, you must set both ``VLLM_ROCM_USE_AITER=1`` and the specific feature flag to ``1``.

Quick start examples:

.. code-block:: bash

   # Enable all AITER optimizations (recommended for most workloads)
   export VLLM_ROCM_USE_AITER=1
   vllm serve MODEL_NAME

   # Enable AITER Fused MoE and enable Triton Prefill-Decode (split) attention
   export VLLM_ROCM_USE_AITER=1
   export VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1
   export VLLM_ROCM_USE_AITER_MHA=0
   vllm serve MODEL_NAME

   # Disable AITER entirely (i.e, use vLLM Triton Unified Attention Kernel)
   export VLLM_ROCM_USE_AITER=0
   vllm serve MODEL_NAME

.. list-table::
   :header-rows: 1
   :widths: 30 70

   * - Environment variable
     - Description (default behavior)

   * - ``VLLM_ROCM_USE_AITER``
     - Master switch to enable AITER kernels (``0``/``False`` by default). All other ``VLLM_ROCM_USE_AITER_*`` flags require this to be set to ``1``.

   * - ``VLLM_ROCM_USE_AITER_LINEAR``
     - Use AITER quantization operators + GEMM for linear layers (defaults to ``True`` when AITER is on). Accelerates matrix multiplications in all transformer layers. **Recommended to keep enabled**.

   * - ``VLLM_ROCM_USE_AITER_MOE``
     - Use AITER fused-MoE kernels (defaults to ``True`` when AITER is on). Accelerates Mixture-of-Experts routing and computation. See the note on :ref:`AITER MoE requirements <vllm-optimization-aiter-moe-requirements>`.

   * - ``VLLM_ROCM_USE_AITER_RMSNORM``
     - Use AITER RMSNorm kernels (defaults to ``True`` when AITER is on). Accelerates normalization layers. **Recommended: keep enabled.**

   * - ``VLLM_ROCM_USE_AITER_MLA``
     - Use AITER Multi-head Latent Attention for supported models, for example, DeepSeek-V3/R1 (defaults to ``True`` when AITER is on). See the section on :ref:`AITER MLA requirements <vllm-optimization-aiter-mla-requirements>`.

   * - ``VLLM_ROCM_USE_AITER_MHA``
     - Use AITER Multi-Head Attention kernels (defaults to ``True`` when AITER is on; set to ``0`` to use Triton attention backends and Prefill-Decode attention backend instead). See :ref:`attention backend selection <vllm-optimization-aiter-backend-selection>`.

   * - ``VLLM_ROCM_USE_AITER_UNIFIED_ATTENTION``
     - Enable AITER's optimized unified attention kernel (defaults to ``False``). Only takes effect when: AITER is enabled; unified attention mode is active (``VLLM_V1_USE_PREFILL_DECODE_ATTENTION=0``); and AITER MHA is disabled (``VLLM_ROCM_USE_AITER_MHA=0``). When disabled, falls back to vLLM's Triton unified attention.

   * - ``VLLM_ROCM_USE_AITER_FP8BMM``
     - Use AITER ``FP8`` batched matmul (defaults to ``True`` when AITER is on). Fuses ``FP8`` per-token quantization with batched GEMM (used in MLA models like DeepSeek-V3). Requires an Instinct MI300X/MI355X GPU.

   * - ``VLLM_ROCM_USE_SKINNY_GEMM``
     - Prefer skinny-GEMM kernel variants for small batch sizes (defaults to ``True``). Improves performance when ``M`` dimension is small. **Recommended to keep enabled**.

   * - ``VLLM_ROCM_FP8_PADDING``
     - Pad ``FP8`` linear weight tensors to improve memory locality (defaults to ``True``). Minor memory overhead for better performance.

   * - ``VLLM_ROCM_MOE_PADDING``
     - Pad MoE weight tensors for better memory access patterns (defaults to ``True``). Same memory/performance tradeoff as ``FP8`` padding.

   * - ``VLLM_ROCM_CUSTOM_PAGED_ATTN``
     - Use custom paged-attention decode kernel when Prefill-Decode attention backend is selected (defaults to ``True``). See :ref:`Attention backend selection with AITER <vllm-optimization-aiter-backend-selection>`.

.. note::

   When ``VLLM_ROCM_USE_AITER=1``, most AITER component flags (``LINEAR``,
   ``MOE``, ``RMSNORM``, ``MLA``, ``MHA``, ``FP8BMM``) automatically default to
   ``True``. You typically only need to set the master switch
   ``VLLM_ROCM_USE_AITER=1`` to enable all optimizations. ROCm provides a
   prebuilt optimized Docker image for validating the performance of LLM
   inference with vLLM on MI300X Series GPUs. The Docker image includes ROCm,
   vLLM, and PyTorch. For more information, see
   :doc:`/how-to/rocm-for-ai/inference/benchmark-docker/vllm`.

.. _vllm-optimization-aiter-moe-requirements:

AITER MoE requirements (Mixtral, DeepSeek-V2/V3, Qwen-MoE models)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

``VLLM_ROCM_USE_AITER_MOE`` enables AITER's optimized Mixture-of-Experts kernels, such as expert routing (topk selection) and expert computation for better performance.

Applicable models:

* Mixtral series: for example, Mixtral-8x7B / Mixtral-8x22B
* Llama-4 family: for example, Llama-4-Scout-17B-16E / Llama-4-Maverick-17B-128E
* DeepSeek family: DeepSeek-V2 / DeepSeek-V3 / DeepSeek-R1
* Qwen family: Qwen1.5-MoE / Qwen2-MoE / Qwen2.5-MoE series
* Other MoE architectures

When to enable:

* **Enable (default):** For all MoE models on the Instinct MI300X/MI355X for best throughput
* **Disable:** Only for debugging or if you encounter numerical issues

Example usage:

.. code-block:: bash

   # Standard MoE model (Mixtral)
   VLLM_ROCM_USE_AITER=1 vllm serve mistralai/Mixtral-8x7B-Instruct-v0.1

   # Hybrid MoE+MLA model (DeepSeek-V3) - requires both MOE and MLA flags
   VLLM_ROCM_USE_AITER=1 vllm serve deepseek-ai/DeepSeek-V3 \
       --block-size 1 \
       --tensor-parallel-size 8

.. _vllm-optimization-aiter-mla-requirements:

AITER MLA requirements (DeepSeek-V3/R1 models)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

``VLLM_ROCM_USE_AITER_MLA`` enables AITER MLA (Multi-head Latent Attention) optimization for supported models. Defaults to **True** when AITER is on.

Critical requirement:

* **Must** explicitly set ``--block-size 1``

.. important::

   If you omit ``--block-size 1``, vLLM will raise an error rather than defaulting to 1.

Applicable models:

* DeepSeek-V3 / DeepSeek-R1
* DeepSeek-V2
* Other models using multi-head latent attention (MLA) architecture

Example usage:

.. code-block:: bash

   # DeepSeek-R1 with AITER MLA (requires 8 GPUs)
   VLLM_ROCM_USE_AITER=1 vllm serve deepseek-ai/DeepSeek-R1 \
       --block-size 1 \
       --tensor-parallel-size 8

.. _vllm-optimization-aiter-backend-selection:

Attention backend selection with AITER
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Understanding which attention backend to use helps optimize your deployment.

Quick reference: Which attention backend will I get?

Default behavior (no configuration)

Without setting any environment variables, vLLM uses:

* **vLLM Triton Unified Attention** — A single Triton kernel handling both prefill and decode phases
* Works on all ROCm platforms
* Good baseline performance

**Recommended**: Enable AITER (set ``VLLM_ROCM_USE_AITER=1``)

When you enable AITER, the backend is automatically selected based on your model:

.. code-block:: text

   Is your model using MLA architecture? (DeepSeek-V3/R1/V2)
   ├─ YES → AITER MLA Backend
   │         • Requires --block-size 1
   │         • Best performance for MLA models
   │         • Automatically selected
   │
   └─ NO  → AITER MHA Backend
             • For standard transformer models (Llama, Mistral, etc.)
             • Optimized for Instinct MI300X/MI355X
             • Automatically selected

**Advanced**: Manual backend selection

Most users won't need this, but you can override the defaults:

.. list-table::
   :widths: 40 60
   :header-rows: 1

   * - To use this backend
     - Set these flags

   * - AITER MLA (MLA models only)
     - ``VLLM_ROCM_USE_AITER=1`` (auto-selects for DeepSeek-V3/R1)

   * - AITER MHA (standard models)
     - ``VLLM_ROCM_USE_AITER=1`` (auto-selects for non-MLA models)

   * - vLLM Triton Unified (default)
     - ``VLLM_ROCM_USE_AITER=0`` (or unset)

   * - Triton Prefill-Decode (split) without AITER
     - | ``VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1``

   * - Triton Prefill-Decode (split) along with AITER Fused-MoE
     - | ``VLLM_ROCM_USE_AITER=1``
       | ``VLLM_ROCM_USE_AITER_MHA=0``
       | ``VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1``

   * - AITER Unified Attention
     - | ``VLLM_ROCM_USE_AITER=1``
       | ``VLLM_ROCM_USE_AITER_MHA=0``
       | ``VLLM_ROCM_USE_AITER_UNIFIED_ATTENTION=1``

**Quick start examples**:

.. code-block:: bash

   # Recommended: Standard model with AITER (Llama, Mistral, Qwen, etc.)
   VLLM_ROCM_USE_AITER=1 vllm serve meta-llama/Llama-3.3-70B-Instruct

   # MLA model with AITER (DeepSeek-V3/R1)
   VLLM_ROCM_USE_AITER=1 vllm serve deepseek-ai/DeepSeek-R1 \
       --block-size 1 \
       --tensor-parallel-size 8

   # Advanced: Use Prefill-Decode split (for short input cases) with AITER Fused-MoE
   VLLM_ROCM_USE_AITER=1 \
   VLLM_ROCM_USE_AITER_MHA=0 \
   VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1 \
   vllm serve meta-llama/Llama-4-Scout-17B-16E

**Which backend should I choose?**

.. list-table::
   :widths: 30 70
   :header-rows: 1

   * - Your use case
     - Recommended backend

   * - **Standard transformer models** (Llama, Mistral, Qwen, Mixtral)
     - **AITER MHA** (``VLLM_ROCM_USE_AITER=1``) — **Recommended for most workloads** on Instinct MI300X/MI355X. Provides optimized attention kernels for both prefill and decode phases.

   * - **MLA models** (DeepSeek-V3/R1/V2)
     - **AITER MLA** (auto-selected with ``VLLM_ROCM_USE_AITER=1``) — Required for optimal performance, must use ``--block-size 1``

   * - **gpt-oss models** (gpt-oss-120b/20b)
     - **AITER Unified Attention** (``VLLM_ROCM_USE_AITER=1``, ``VLLM_ROCM_USE_AITER_MHA=0``, ``VLLM_ROCM_USE_AITER_UNIFIED_ATTENTION=1``) — Required for optimal performance

   * - **Debugging or compatibility**
     - **vLLM Triton Unified** (default with ``VLLM_ROCM_USE_AITER=0``) — Generic fallback, works everywhere

**Important notes:**

* **AITER MHA and AITER MLA are mutually exclusive** — vLLM automatically detects MLA models and selects the appropriate backend
* **For 95% of users:** Simply set ``VLLM_ROCM_USE_AITER=1`` and let vLLM choose the right backend
* When in doubt, start with AITER enabled (the recommended configuration) and profile your specific workload

Backend choice quick recipes
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

* **Standard transformers (any prompt length):** Start with ``VLLM_ROCM_USE_AITER=1`` → AITER MHA. For CUDA graph modes, see architecture-specific guidance below (Dense vs MoE models have different optimal modes).
* **Latency-sensitive chat (low TTFT):** keep ``--max-num-batched-tokens`` ≤ **8k–16k** with AITER.
* **Streaming decode (low ITL):** raise ``--max-num-batched-tokens`` to **32k–64k**.
* **Offline max throughput:** ``--max-num-batched-tokens`` ≥ **32k** with ``cudagraph_mode=FULL``.

**How to verify which backend is active**

Check vLLM's startup logs to confirm which attention backend is being used:

.. code-block:: bash

   # Start vLLM and check logs
   VLLM_ROCM_USE_AITER=1 vllm serve meta-llama/Llama-3.3-70B-Instruct 2>&1 | grep -i attention

**Expected log messages:**

* AITER MHA: ``Using Aiter Flash Attention backend on V1 engine.``
* AITER MLA: ``Using AITER MLA backend on V1 engine.``
* vLLM Triton MLA: ``Using Triton MLA backend on V1 engine.``
* vLLM Triton Unified: ``Using Triton Attention backend on V1 engine.``
* AITER Triton Unified: ``Using Aiter Unified Attention backend on V1 engine.``
* AITER Triton Prefill-Decode: ``Using Rocm Attention backend on V1 engine.``

Attention backend technical details
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This section provides technical details about vLLM's attention backends on ROCm.

vLLM V1 on ROCm provides these attention implementations:

1. **vLLM Triton Unified Attention** (default when AITER is **off**)

   * Single unified Triton kernel handling both chunked prefill and decode phases
   * Generic implementation that works across all ROCm platforms
   * Good baseline performance
   * Automatically selected when ``VLLM_ROCM_USE_AITER=0`` (or unset)
   * Supports GPT-OSS

2. **AITER Triton Unified Attention** (advanced, requires manual configuration)

   * The AMD optimized unified Triton kernel
   * Enable with ``VLLM_ROCM_USE_AITER=1``, ``VLLM_ROCM_USE_AITER_MHA=0``, and ``VLLM_ROCM_USE_AITER_UNIFIED_ATTENTION=1``.
   * Only useful for specific workloads. Most users should use AITER MHA instead.
   * Recommended this backend when running GPT-OSS.

3. **AITER Triton Prefill–Decode Attention** (hybrid, Instinct MI300X-optimized)

   * Enable with ``VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1``
   * Uses separate kernels for prefill and decode phases:

     * **Prefill**: ``context_attention_fwd`` Triton kernel
     * **Primary decode**: ``torch.ops._rocm_C.paged_attention`` (custom ROCm kernel optimized for head sizes 64/128, block sizes 16/32, GQA 1–16, context ≤131k; sliding window not supported)
     * **Fallback decode**: ``kernel_paged_attention_2d`` Triton kernel when shapes don't meet primary decode requirements

   * Usually better compared to unified Triton kernels
   * Performance vs AITER MHA varies: AITER MHA is typically faster overall, but Prefill-Decode split may win in short input scenarios
   * The custom paged attention decode kernel is controlled by ``VLLM_ROCM_CUSTOM_PAGED_ATTN`` (default **True**)

4. **AITER Multi-Head Attention (MHA)** (default when AITER is **on**)

   * Controlled by ``VLLM_ROCM_USE_AITER_MHA`` (**1** = enabled)
   * Best all-around performance for standard transformer models
   * Automatically selected when ``VLLM_ROCM_USE_AITER=1`` and model is not MLA

5. **vLLM Triton Multi-head Latent Attention (MLA)** (for DeepSeek-V3/R1/V2)

   * Automatically selected when ``VLLM_ROCM_USE_AITER=0`` (or unset)

6. **AITER Multi-head Latent Attention (MLA)** (for DeepSeek-V3/R1/V2)

   * Controlled by ``VLLM_ROCM_USE_AITER_MLA`` (``1`` = enabled)
   * Required for optimal performance on MLA architecture models
   * Automatically selected when ``VLLM_ROCM_USE_AITER=1`` and model uses MLA
   * Requires ``--block-size 1``

Quick Reduce (large all-reduces on ROCm)
========================================

**Quick Reduce** is an alternative to RCCL/custom all-reduce for **large** inputs (MI300-class GPUs).
It supports FP16/BF16 as well as symmetric INT8/INT6/INT4 quantized all-reduce (group size 32).

.. warning::

   Quantization can affect accuracy. Validate quality before deploying.

Control via:

* ``VLLM_ROCM_QUICK_REDUCE_QUANTIZATION`` ∈ ``["NONE","FP","INT8","INT6","INT4"]`` (default ``NONE``).
* ``VLLM_ROCM_QUICK_REDUCE_CAST_BF16_TO_FP16``: cast BF16 input to FP16 (``1/True`` by default for performance).
* ``VLLM_ROCM_QUICK_REDUCE_MAX_SIZE_BYTES_MB``: cap the preset buffer (default ``NONE`` ≈ ``2048`` MB).

Quick Reduce tends to help **throughput** at higher TP counts (for example, 4–8) with many concurrent requests.

Parallelism strategies (run vLLM on multiple GPUs)
==================================================

vLLM supports the following parallelism strategies:

1. Tensor parallelism
2. Pipeline parallelism
3. Data parallelism
4. Expert parallelism

For more details, see `Parallelism and scaling <https://docs.vllm.ai/en/stable/serving/parallelism_scaling.html>`_.

**Choosing the right strategy:**

* **Tensor Parallelism (TP)**: Use when model doesn't fit on one GPU. Prefer staying within a single XGMI island (≤8 GPUs on the Instinct MI300X).
* **Pipeline Parallelism (PP)**: Use for very large models across nodes. Set TP to GPUs per node, scale with PP across nodes.
* **Data Parallelism (DP)**: Use when model fits on single GPU or TP group, and you need higher throughput. Combine with TP/PP for large models.
* **Expert Parallelism (EP)**: Use for MoE models with ``--enable-expert-parallel``. More efficient than TP for MoE layers.

Tensor parallelism
^^^^^^^^^^^^^^^^^^

Tensor parallelism splits each layer of the model weights across multiple GPUs when the model doesn't fit on a single GPU. This is primarily for memory capacity.

**Use tensor parallelism when:**

* Model does not fit on one GPU (OOM)
* Need to enable larger batch sizes by distributing KV cache across GPUs

**Examples:**

.. code-block:: bash

   # Tensor parallelism: Split model across 2 GPUs
   vllm serve /path/to/model --dtype float16 --tensor-parallel-size 2

   # Combining TP and two vLLM instance, each split across 2 GPUs (4 GPUs total)
   CUDA_VISIBLE_DEVICES=0,1 vllm serve /path/to/model --dtype float16 --tensor-parallel-size 2 --port 8000
   CUDA_VISIBLE_DEVICES=2,3 vllm serve /path/to/model --dtype float16 --tensor-parallel-size 2 --port 8001

.. note::
   **ROCm GPU visibility:** vLLM on ROCm reads ``CUDA_VISIBLE_DEVICES``. Keep ``HIP_VISIBLE_DEVICES`` unset to avoid conflicts.

.. tip::
   For structured data parallelism deployments with load balancing, see :ref:`data-parallelism-section`.

Pipeline parallelism
^^^^^^^^^^^^^^^^^^^^

Pipeline parallelism splits the model's layers across multiple GPUs or nodes, with each GPU processing different layers sequentially. This is primarily used for multi-node deployments where the model is too large for a single node.

**Use pipeline parallelism when:**

* Model is too large for a single node (combine PP with TP)
* GPUs on a node lack high-speed interconnect (e.g., no NVLink/XGMI) - PP may perform better than TP
* GPU count doesn't evenly divide the model (PP supports uneven splits)

**Common pattern for multi-node:**

.. code-block:: bash

   # 2 nodes × 8 GPUs = 16 GPUs total
   # TP=8 per node, PP=2 across nodes
   vllm serve meta-llama/Llama-3.1-405B-Instruct \
       --tensor-parallel-size 8 \
       --pipeline-parallel-size 2

.. note::
   **ROCm best practice**: On the Instinct MI300X, prefer staying within a single XGMI island (≤8 GPUs) using TP only. Use PP when scaling beyond eight GPUs or across nodes.

.. _data-parallelism-section:

Data parallelism
^^^^^^^^^^^^^^^^

Data parallelism replicates model weights across separate instances/GPUs to process independent batches of requests. This approach increases throughput by distributing the workload across multiple replicas.

**Use data parallelism when:**

* Model fits on one GPU, but you need higher request throughput
* Scaling across multiple nodes horizontally
* Combining with tensor parallelism (for example, DP=2 + TP=4 = 8 GPUs total)

**Quick start - single-node:**

.. code-block:: bash

   # Model fit in 1 GPU. Creates 2 model replicas (requires 2 GPUs)
   VLLM_ALL2ALL_BACKEND="allgather_reducescatter" vllm serve /path/to/model \
       --data-parallel-size 2 \
       --disable-nccl-for-dp-synchronization

.. tip::
   For ROCm, currently use ``VLLM_ALL2ALL_BACKEND="allgather_reducescatter"`` and ``--disable-nccl-for-dp-synchronization`` with data parallelism.

Choosing a load balancing strategy
"""""""""""""""""""""""""""""""""""

vLLM supports two modes for routing requests to DP ranks:

.. list-table::
   :header-rows: 1
   :widths: 30 35 35

   * -
     - **Internal LB** (recommended)
     - **External LB**
   * - **HTTP endpoints**
     - 1 endpoint, vLLM routes internally
     - N endpoints, you provide external router
   * - **Single-node config**
     - ``--data-parallel-size N``
     - ``--data-parallel-size N --data-parallel-rank 0..N-1`` + different ports
   * - **Multi-node config**
     - ``--data-parallel-size``, ``--data-parallel-size-local``, ``--data-parallel-address``
     - ``--data-parallel-size N --data-parallel-rank 0..N-1`` + ``--data-parallel-address``
   * - **Client view**
     - Single URL/port
     - Multiple URLs/ports
   * - **Load balancer**
     - Built-in (vLLM handles)
     - External (Nginx, Kong, K8s Service)
   * - **Coordination**
     - DP ranks sync via RPC (for MoE/MLA)
     - DP ranks sync via RPC (for MoE/MLA)
   * - **Best for**
     - Most deployments (simpler)
     - K8s/cloud environments with existing LB

.. tip::
   **Dense (non-MoE) models only:** You can run fully independent ``vllm serve`` instances without any DP flags, using your own load balancer. This avoids RPC coordination overhead entirely.

For more technical details, see `vLLM Data Parallel Deployment <https://docs.vllm.ai/en/stable/serving/data_parallel_deployment.html>`_

Data Parallel Attention (advanced)
""""""""""""""""""""""""""""""""""

For models with Multi-head Latent Attention (MLA) architecture like DeepSeek V2, V3, and R1, vLLM supports **Data Parallel Attention**,
which provides request-level parallelism instead of model replication. This avoids KV cache duplication across tensor parallel ranks,
significantly reducing memory usage and enabling larger batch sizes.

**Key benefits for MLA models:**

* Eliminates KV cache duplication when using tensor parallelism
* Enables higher throughput for high-QPS serving scenarios
* Better memory efficiency for large context windows

**Usage with Expert Parallelism:**

Data parallel attention works seamlessly with Expert Parallelism for MoE models:

.. code-block:: bash

   # DeepSeek-R1 with DP attention and expert parallelism
   VLLM_ALL2ALL_BACKEND="allgather_reducescatter" vllm serve deepseek-ai/DeepSeek-R1 \
       --data-parallel-size 8 \
       --enable-expert-parallel \
       --disable-nccl-for-dp-synchronization

For more technical details, see `vLLM RFC #16037 <https://github.com/vllm-project/vllm/issues/16037>`_.

Expert parallelism
^^^^^^^^^^^^^^^^^^

Expert parallelism (EP) distributes expert layers of Mixture-of-Experts (MoE) models across multiple GPUs,
where tokens are routed to the GPUs holding the experts they need.

**Performance considerations:**

Expert parallelism is designed primarily for cross-node MoE deployments where high-bandwidth interconnects (like InfiniBand) between nodes make EP communication efficient. For single-node Instinct MI300X/MI355X deployments with XGMI connectivity, tensor parallelism typically provides better performance due to optimized all-to-all collectives on XGMI.

**When to use EP:**

* Multi-node MoE deployments with fast inter-node networking
* Models with very large numbers of experts that benefit from expert distribution
* Workloads where EP's reduced data movement outweighs communication overhead

**Single-node recommendation:** For Instinct MI300X/MI355X within a single node (≤8 GPUs), prefer tensor parallelism over expert parallelism for MoE models to leverage XGMI's high bandwidth and low latency.

**Basic usage:**

.. code-block:: bash

   # Enable expert parallelism for MoE models (DeepSeek example with 8 GPUs)
   vllm serve deepseek-ai/DeepSeek-R1 \
       --tensor-parallel-size 8 \
       --enable-expert-parallel

**Combining with Tensor Parallelism:**

When EP is enabled alongside tensor parallelism:

* Fused MoE layers use expert parallelism
* Non-fused MoE layers use tensor parallelism

**Combining with Data Parallelism:**

EP works seamlessly with Data Parallel Attention for optimal memory efficiency in MLA+MoE models (for example, DeepSeek V3):

.. code-block:: bash

   # DP attention + EP for DeepSeek-R1
   VLLM_ALL2ALL_BACKEND="allgather_reducescatter" vllm serve deepseek-ai/DeepSeek-R1 \
       --data-parallel-size 8 \
       --enable-expert-parallel \
       --disable-nccl-for-dp-synchronization

Throughput benchmarking
=======================

This guide evaluates LLM inference by tokens per second (TPS). vLLM provides a
built-in benchmark:

.. code-block:: bash

   # Synthetic or dataset-driven benchmark

   vllm bench throughput --model /path/to/model [other args]

* **Real-world dataset** (ShareGPT) example:

  .. code-block:: bash

     wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json

     vllm bench throughput --model /path/to/model  --dataset /path/to/ShareGPT_V3_unfiltered_cleaned_split.json

* **Synthetic**: set fixed ``--input-len`` and ``--output-len`` for reproducible runs.

.. tip::

   **Profiling checklist (ROCm)**

   1. Fix your prompt distribution (ISL/OSL) and **vary one knob at a time** (graph mode, MBT).
   2. Measure **TTFT**, **ITL**, and **TPS** together; don't optimize one in isolation.
   3. Compare graph modes: **PIECEWISE** (balanced) vs **FULL**/``FULL_DECODE_ONLY`` (max throughput).
   4. Sweep ``--max-num-batched-tokens`` around **8k–64k** to find your latency/throughput balance.

Maximizing instances per node
=============================

To maximize **per-node throughput**, run as many vLLM instances as model memory allows,
balancing KV-cache capacity.

* **HBM capacities**: MI300X = 192 GB HBM3; MI355X = 288 GB HBM3E.

* Up to **eight** single-GPU vLLM instances can run in parallel on an 8×GPU node (one per GPU):

  .. code-block:: bash

      for i in $(seq 0 7); do
         CUDA_VISIBLE_DEVICES="$i" vllm bench throughput
         -tp 1 --model /path/to/model
         --dataset /path/to/ShareGPT_V3_unfiltered_cleaned_split.json &
      done

Total throughput from **N** single-GPU instances usually exceeds one instance stretched across **N** GPUs (``-tp N``).

**Model coverage**: Llama 2 (7B/13B/70B), Llama 3 (8B/70B), Qwen2 (7B/72B), Mixtral-8x7B/8x22B, and others Llama2‑70B
and Llama3‑70B can fit a single MI300X/MI355X; Llama3.1‑405B fits on a single 8×MI300X/MI355X node.

Configure the gpu-memory-utilization parameter
==================================================

The ``--gpu-memory-utilization`` parameter controls the fraction of GPU memory reserved for the KV-cache. The default is **0.9** (90%).

There are two strategies:

1. **Increase** ``--gpu-memory-utilization`` to maximize throughput for a single instance (up to **0.95**).
   Example:

   .. code-block:: bash

      vllm serve meta-llama/Llama-3.3-70B-Instruct \
         --gpu-memory-utilization 0.95 \
         --max-model-len 8192 \
         --port 8000

2. **Decrease** to pack **multiple** instances on the same GPU (for small models like 7B/8B), keeping KV-cache viable:

   .. code-block:: bash

      # Instance 1 on GPU 0
      CUDA_VISIBLE_DEVICES=0 vllm serve meta-llama/Llama-3.1-8B-Instruct \
         --gpu-memory-utilization 0.45 \
         --max-model-len 4096 \
         --port 8000

      # Instance 2 on GPU 0
      CUDA_VISIBLE_DEVICES=0 vllm serve meta-llama/Llama-Guard-3-8B \
         --gpu-memory-utilization 0.45 \
         --max-model-len 4096 \
         --port 8001

vLLM engine arguments
=====================

Selected arguments that often help on ROCm. See `Engine Arguments
<https://docs.vllm.ai/en/stable/configuration/engine_args.html>`__ in the vLLM
documentation for the full list.

Configure --max-num-seqs
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The default value is **1024** in vLLM V1 (increased from **256** in V0). This flag controls the maximum number of sequences processed per batch, directly affecting concurrency and memory usage.

* **To increase throughput**: Raise to **2048** or **4096** if memory allows, enabling more sequences per iteration.
* **To reduce memory usage**: Lower to **256** or **128** for large models or long-context generation. For example, set ``--max-num-seqs 128`` to reduce concurrency and lower memory requirements.

In vLLM V1, KV-cache token requirements are computed as ``max-num-seqs * max-model-len``.

Example usage:

.. code-block:: bash

   vllm serve <model> --max-num-seqs 128 --max-model-len 8192

Configure --max-num-batched-tokens
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**Chunked prefill is enabled by default** in vLLM V1.

* Lower values improve **ITL** (less prefill interrupting decode).
* Higher values improve **TTFT** (more prefill per batch).

Defaults: **8192** for online serving, **16384** for offline. However, optimal values vary significantly by model size. Smaller models can efficiently handle larger batch sizes. Setting it near ``--max-model-len`` mimics V0 behavior and often maximizes throughput.

**Guidance:**

* **Interactive (low TTFT)**: keep MBT ≤ **8k–16k**.
* **Streaming (low ITL)**: MBT **16k–32k**.
* **Offline max throughput**: MBT **≥32k** (diminishing TPS returns beyond ~32k).

**Pattern:** Smaller/more efficient models benefit from larger batch sizes. MoE models with expert parallelism can handle very large batches efficiently.

**Rule of thumb**

* Push MBT **up** to trade TTFT↑ for ITL↓ and slightly higher TPS.
* Pull MBT **down** to trade ITL↑ for TTFT↓ (interactive UX).

Async scheduling
^^^^^^^^^^^^^^^^

``--async-scheduling`` (replaces deprecated ``num_scheduler_steps``) can improve throughput/ITL by trading off TTFT.
Prefer **off** for latency-sensitive serving; **on** for offline batch throughput.

CUDA graphs configuration
^^^^^^^^^^^^^^^^^^^^^^^^^^

CUDA graphs reduce kernel launch overhead by capturing and replaying GPU operations, improving inference throughput. Configure using ``--compilation-config '{"cudagraph_mode": "MODE"}'``.

**Available modes:**

* ``NONE`` — CUDA graphs disabled (debugging)
* ``PIECEWISE`` — Attention stays eager, other ops use CUDA graphs (most compatible)
* ``FULL`` — Full CUDA graphs for all batches (best for small models/prompts)
* ``FULL_DECODE_ONLY`` — Full CUDA graphs only for decode (saves memory in prefill/decode split setups)
* ``FULL_AND_PIECEWISE`` — **(default)** Full graphs for decode + piecewise for prefill (best performance, highest memory)

**Default behavior:** V1 defaults to ``FULL_AND_PIECEWISE`` with piecewise compilation enabled; otherwise ``NONE``.

**Backend compatibility:** Not all attention backends support all CUDA graph modes. Choose a mode your backend supports:

.. list-table::
   :header-rows: 1
   :widths: 40 60

   * - Attention backend
     - CUDA graph support
   * - vLLM/AITER Triton Unified Attention, vLLM Prefill-Decode Attention
     - Full support (prefill + decode)
   * - AITER MHA, AITER MLA
     - Uniform batches only
   * - vLLM Triton MLA
     - Must exclude attention from graph — ``PIECEWISE`` required

**Usage examples:**

.. code-block:: bash

   # Default (best performance, highest memory)
   vllm serve meta-llama/Llama-3.1-8B-Instruct

   # Decode-only graphs (lower memory, good for P/D split)
   vllm serve meta-llama/Llama-3.1-8B-Instruct \
     --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}'

   # Full graphs for offline throughput (small models)
   vllm serve meta-llama/Llama-3.1-8B-Instruct \
     --compilation-config '{"cudagraph_mode": "FULL"}'

**Migration from legacy flags:**

* ``use_cudagraph=False`` → ``NONE``
* ``use_cudagraph=True, full_cuda_graph=False`` → ``PIECEWISE``
* ``full_cuda_graph=True`` → ``FULL`` (with automatic fallback)

Quantization support
====================

vLLM supports FP4/FP8 (4-bit/8-bit floating point) weight and activation quantization using hardware acceleration on the Instinct MI300X and MI355X.
Quantization of models with FP4/FP8 allows for a **2x-4x** reduction in model memory requirements and up to a **1.6x**
improvement in throughput with minimal impact on accuracy.

vLLM ROCm supports a variety of quantization demands:

* On-the-fly quantization

* Pre-quantized model through Quark and llm-compressor

Supported quantization methods
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vLLM on ROCm supports the following quantization methods for the AMD Instinct MI300 series and Instinct MI355X GPUs:

.. list-table::
   :header-rows: 1
   :widths: 20 15 15 20 30

   * - Method
     - Precision
     - ROCm support
     - Memory reduction
     - Best use case
   * - **FP8** (W8A8)
     - 8-bit float
     - Excellent
     - 2× (50%)
     - Production, balanced speed/accuracy
   * - **PTPC-FP8**
     - 8-bit float
     - Excellent
     - 2× (50%)
     - High throughput, better than ``FP8``
   * - **AWQ**
     - 4-bit int (W4A16)
     - Good
     - 4× (75%)
     - Large models, memory-constrained
   * - **GPTQ**
     - 4-bit/8-bit int
     - Good
     - 2-4× (50-75%)
     - Pre-quantized models available
   * - **FP8 KV-cache**
     - 8-bit float
     - Excellent
     - KV cache: 50%
     - All inference workloads
   * - **Quark (AMD)**
     - ``FP8``/``MXFP4``
     - Optimized
     - 2-4× (50-75%)
     - AMD pre-quantized models
   * - **compressed-tensors**
     - W8A8 ``INT8``/``FP8``
     - Good
     - 2× (50%)
     - LLM Compressor models

**ROCm support key:**

- Excellent: Fully supported with optimized kernels
- Good: Supported, might not have AMD-optimized kernels
- Optimized: AMD-specific optimizations available

Using Pre-quantized Models
^^^^^^^^^^^^^^^^^^^^^^^^^^^

AMD provides pre-quantized models optimized for ROCm. These models are ready to use with vLLM.

**AMD Quark-quantized models**:

Available on `Hugging Face <https://huggingface.co/models?other=quark>`_:

* `Llama‑3.1‑8B‑Instruct‑FP8‑KV <https://huggingface.co/amd/Llama-3.1-8B-Instruct-FP8-KV>`__ (FP8 W8A8)
* `Llama‑3.1‑70B‑Instruct‑FP8‑KV <https://huggingface.co/amd/Llama-3.1-70B-Instruct-FP8-KV>`__ (FP8 W8A8)
* `Llama‑3.1‑405B‑Instruct‑FP8‑KV <https://huggingface.co/amd/Llama-3.1-405B-Instruct-FP8-KV>`__ (FP8 W8A8)
* `Mixtral‑8x7B‑Instruct‑v0.1‑FP8‑KV <https://huggingface.co/amd/Mixtral-8x7B-Instruct-v0.1-FP8-KV>`__ (FP8 W8A8)
* `Mixtral‑8x22B‑Instruct‑v0.1‑FP8‑KV <https://huggingface.co/amd/Mixtral-8x22B-Instruct-v0.1-FP8-KV>`__ (FP8 W8A8)
* `Llama-3.3-70B-Instruct-MXFP4-Preview <https://huggingface.co/amd/Llama-3.3-70B-Instruct-MXFP4-Preview>`__ (MXFP4 for MI350/MI355)
* `Llama-3.1-405B-Instruct-MXFP4-Preview <https://huggingface.co/amd/Llama-3.1-405B-Instruct-MXFP4-Preview>`__ (MXFP4 for MI350/MI355)
* `DeepSeek-R1-0528-MXFP4-Preview <https://huggingface.co/amd/DeepSeek-R1-0528-MXFP4-Preview>`__ (MXFP4 for MI350/MI355)

**Quick start**:

.. code-block:: bash

   # FP8 W8A8 Quark model
   vllm serve amd/Llama-3.1-8B-Instruct-FP8-KV \
      --dtype auto

   # MXFP4 Quark model for MI350/MI355
   vllm serve amd/Llama-3.3-70B-Instruct-MXFP4-Preview \
      --dtype auto \
      --tensor-parallel-size 1

**Other pre-quantized models**:

- AWQ models: `Hugging Face awq flag <https://huggingface.co/models?other=awq>`_
- GPTQ models: `Hugging Face gptq flag <https://huggingface.co/models?other=gptq>`_
- LLM Compressor models: `Hugging Face compressed-tensors flag <https://huggingface.co/models?other=compressed-tensors>`_

On-the-fly quantization
^^^^^^^^^^^^^^^^^^^^^^^^

For models without pre-quantization, vLLM can quantize ``FP16``/``BF16`` models at server startup.

**Supported methods**:

- ``fp8``: Per-tensor ``FP8`` weight and activation quantization
- ``ptpc_fp8``: Per-token-activation per-channel-weight ``FP8`` (better accuracy same ``FP8`` speed). See `PTPC-FP8 on ROCm blog post <https://blog.vllm.ai/2025/02/24/ptpc-fp8-rocm.html>`_ for details

**Usage:**

.. code-block:: bash

   # On-the-fly FP8 quantization
   vllm serve meta-llama/Llama-3.1-8B-Instruct \
      --quantization fp8 \
      --dtype auto

   # On-the-fly PTPC-FP8 (recommended as default)
   vllm serve meta-llama/Llama-3.1-70B-Instruct \
      --quantization ptpc_fp8 \
      --dtype auto \
      --tensor-parallel-size 4

.. note::

   On-the-fly quantization adds two to five minutes of startup time but eliminates pre-quantization. For production with frequent restarts, use pre-quantized models.

GPTQ
^^^^

GPTQ is a 4-bit/8-bit weight quantization method that compresses models with minimal accuracy loss. GPTQ
is fully supported on ROCm via HIP-compiled kernels in vLLM.

**ROCm support status**:

- **Fully supported** - GPTQ kernels compile and run on ROCm via HIP
- **Pre-quantized models work** with standard GPTQ kernels

**Recommendation**: For the AMD Instinct MI300X, **AWQ with Triton kernels** or **FP8 quantization** might provide better
performance due to ROCm-specific optimizations, but GPTQ is a viable alternative.

**Using pre-quantized GPTQ models**:

.. code-block:: bash

   # Using pre-quantized GPTQ model on ROCm
   vllm serve RedHatAI/Meta-Llama-3.1-70B-Instruct-quantized.w4a16 \
      --quantization gptq \
      --dtype auto \
      --tensor-parallel-size 1

**Important notes**:

- **Kernel support:** GPTQ uses standard HIP-compiled kernels on ROCm
- **Performance:** AWQ with Triton kernels might offer better throughput on AMD GPUs due to ROCm optimizations
- **Compatibility:** GPTQ models from Hugging Face work on ROCm with standard performance
- **Use case:** GPTQ is suitable when pre-quantized GPTQ models are readily available

AWQ (Activation-aware Weight Quantization)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

AWQ (Activation-aware Weight Quantization) is a 4-bit weight quantization technique that provides excellent
model compression with minimal accuracy loss (<1%). ROCm supports AWQ quantization on the AMD Instinct MI300 series and
MI355X GPUs with vLLM.

**Using pre-quantized AWQ models:**

Many AWQ-quantized models are available on Hugging Face. Use them directly with vLLM:

.. code-block:: bash

   # vLLM serve with AWQ model
   VLLM_USE_TRITON_AWQ=1 \
   vllm serve hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4 \
      --quantization awq \
      --tensor-parallel-size 1 \
      --dtype auto

**Important Notes:**

* **ROCm requirement:** Set ``VLLM_USE_TRITON_AWQ=1`` to enable Triton-based AWQ kernels on ROCm
* **dtype parameter:** AWQ requires ``--dtype auto`` or ``--dtype float16``. The ``--dtype`` flag controls
  the **activation dtype** (``FP16``/``BF16`` for computations), not the weight dtype. AWQ weights remain as INT4
  (4-bit integers) as specified in the model's quantization config, but are dequantized to ``FP16``/``BF16`` during
  matrix multiplication operations.
* **Group size:** 128 is recommended for optimal performance/accuracy balance
* **Model compatibility:** AWQ is primarily tested on Llama, Mistral, and Qwen model families

Quark (AMD quantization toolkit)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

AMD Quark is the AMD quantization toolkit optimized for ROCm. It supports ``FP8 W8A8``, ``MXFP4``, ``W8A8 INT8``, and
other quantization formats with native vLLM integration. The quantization format will automatically be inferred
from the model config file, so you can omit ``--quantization quark``.

**Running Quark Models:**

.. code-block:: bash

   # FP8 W8A8: Single GPU
   vllm serve amd/Llama-3.1-8B-Instruct-FP8-KV \
      --dtype auto \
      --max-model-len 8192 \
      --gpu-memory-utilization 0.90

   # MXFP4: Extreme memory efficiency
   vllm serve amd/Llama-3.3-70B-Instruct-MXFP4-Preview \
      --dtype auto \
      --tensor-parallel-size 1 \
      --max-model-len 8192

**Key features:**

- **FP8 models**: ~50% memory reduction, 2× compression
- **MXFP4 models**: ~75% memory reduction, 4× compression
- **Embedded scales**: Quark FP8-KV models include pre-calibrated KV-cache scales
- **Hardware optimized**: Leverages the AMD Instinct MI300 series ``FP8`` acceleration

For creating your own Quark-quantized models, see `Quark Documentation <https://quark.docs.amd.com/latest/>`_.

FP8 kv-cache dtype
^^^^^^^^^^^^^^^^^^^^

FP8 KV-cache quantization reduces memory footprint by approximately 50%, enabling longer context lengths
or higher concurrency. ROCm supports FP8 KV-cache with both ``fp8_e4m3`` and ``fp8_e5m2`` formats on
AMD Instinct MI300 series and other CDNA™ GPUs.

Use ``--kv-cache-dtype fp8`` to enable ``FP8`` KV-cache quantization. For best accuracy, use calibrated
scaling factors generated via `LLM Compressor <https://github.com/vllm-project/llm-compressor>`_.
Without calibration, scales are calculated dynamically (``--calculate-kv-scales``) with minimal
accuracy impact.


**Quick start (dynamic scaling)**:

.. code-block:: bash

   # vLLM serve with dynamic FP8 KV-cache
   vllm serve meta-llama/Llama-3.1-8B-Instruct \
      --kv-cache-dtype fp8 \
      --calculate-kv-scales \
      --gpu-memory-utilization 0.90

**Calibrated scaling (advanced)**:

For optimal accuracy, pre-calibrate KV-cache scales using representative data. The calibration process:

#. Runs the model on calibration data (512+ samples recommended)
#. Computes optimal ``FP8`` quantization scales for key/value cache tensors
#. Embeds these scales into the saved model as additional parameters
#. vLLM loads the model and uses the embedded scales automatically when ``--kv-cache-dtype fp8`` is specified

The quantized model can be used like any other model. The embedded scales are stored as part of the model weights.

**Using pre-calibrated models:**

AMD provides ready-to-use models with pre-calibrated ``FP8`` KV cache scales:

* `amd/Llama-3.1-8B-Instruct-FP8-KV <https://huggingface.co/amd/Llama-3.1-8B-Instruct-FP8-KV>`_
* `amd/Llama-3.3-70B-Instruct-FP8-KV <https://huggingface.co/amd/Llama-3.3-70B-Instruct-FP8-KV>`_

To verify a model has pre-calibrated KV cache scales, check ``config.json`` for:

.. code-block:: json

   "quantization_config": {
     "kv_cache_scheme": "static"  // Indicates pre-calibrated scales are embedded
   }

**Creating your own calibrated model:**

.. code-block:: bash

   # 1. Install LLM Compressor
   pip install llmcompressor

   # 2. Run calibration script (see llm-compressor repo for full example)
   python llama3_fp8_kv_example.py

   # 3. Use calibrated model in vLLM
   vllm serve ./Meta-Llama-3-8B-Instruct-FP8-KV \
      --kv-cache-dtype fp8

For detailed instructions and the complete calibration script, see the `FP8 KV Cache Quantization Guide <https://github.com/vllm-project/llm-compressor/blob/main/examples/quantization_kv_cache/README.md>`_.

**Format options**:

- ``fp8`` or ``fp8_e4m3``: Higher precision (default, recommended)
- ``fp8_e5m2``: Larger dynamic range, slightly lower precision

Speculative decoding (experimental)
===================================

Recent vLLM versions add support for speculative decoding backends (for example, Eagle‑v3). Evaluate for your model and latency/throughput goals.
Speculative decoding is a technique to reduce latency when max number of concurrency is low.
Depending on the methods, the effective concurrency varies, for example, from 16 to 64.

Example command:

.. code-block:: bash

   vllm serve meta-llama/Llama-3.1-8B-Instruct \
      --trust-remote-code \
      --swap-space 16 \
      --disable-log-requests \
      --tensor-parallel-size 1 \
      --distributed-executor-backend mp \
      --dtype float16 \
      --quantization fp8 \
      --kv-cache-dtype fp8 \
      --no-enable-chunked-prefill \
      --max-num-seqs 300 \
      --max-num-batched-tokens 131072 \
      --gpu-memory-utilization 0.8 \
      --speculative_config '{"method": "eagle3", "model": "yuhuili/EAGLE3-LLaMA3.1-Instruct-8B", "num_speculative_tokens": 2, "draft_tensor_parallel_size": 1, "dtype": "float16"}' \
      --port 8001


.. important::

   It has been observed that more ``num_speculative_tokens`` causes less
   acceptance rate of draft model tokens and a decline in throughput. As a
   workaround, set ``num_speculative_tokens`` to <= 2.


Multi-node checklist and troubleshooting
========================================

1. Use ``--distributed-executor-backend ray`` across nodes to manage HIP-visible ranks and RCCL communicators. (``ray`` is the default for multi-node. Explicitly setting this flag is optional.)
2. Ensure ``/dev/shm`` is shared across ranks (Docker ``--shm-size``, Kubernetes ``emptyDir``), as RCCL uses shared memory for rendezvous.
3. For GPUDirect RDMA, set ``RCCL_NET_GDR_LEVEL=2`` and verify links (``ibstat``). Requires supported NICs (for example, ConnectX‑6+).
4. Collect RCCL logs: ``RCCL_DEBUG=INFO`` and optionally ``RCCL_DEBUG_SUBSYS=INIT,GRAPH`` for init/graph stalls.

Further reading
===============

* :doc:`workload`
* :doc:`/how-to/rocm-for-ai/inference/benchmark-docker/vllm`