Files
ROCm/docs/how-to/rocm-for-ai/inference-optimization/vllm-optimization.rst
2025-11-11 09:22:11 -05:00

1147 lines
46 KiB
ReStructuredText
Raw Blame History

This file contains invisible Unicode characters
This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
.. meta::
:description: Learn about vLLM V1 inference tuning on AMD Instinct GPUs for optimal performance.
:keywords: AMD, Instinct, MI300X, HPC, tuning, BIOS settings, NBIO, ROCm,
environment variable, performance, HIP, Triton, PyTorch TunableOp, vLLM, RCCL,
MIOpen, GPU, resource utilization
.. _mi300x-vllm-optimization:
.. _vllm-optimization:
********************************
vLLM V1 performance optimization
********************************
This guide helps you maximize vLLM throughput and minimize latency on AMD
Instinct MI300X, MI325X, MI350X, and MI355X GPUs. Learn how to:
* Enable AITER (AI Tensor Engine for ROCm) for speedups on LLM models.
* Configure environment variables for optimal HIP, RCCL, and Quick Reduce performance.
* Select the right attention backend for your workload (AITER MHA/MLA vs. Triton).
* Choose parallelism strategies (tensor, pipeline, data, expert) for multi-GPU deployments.
* Apply quantization (``FP8``/``FP4``) to reduce memory usage by 2-4× with minimal accuracy loss.
* Tune engine arguments (batch size, memory utilization, graph modes) for your use case.
* Benchmark and scale across single-node and multi-node configurations.
Performance environment variables
=================================
The following variables are generally useful for Instinct MI300X/MI355X GPUs and vLLM:
* **HIP and math libraries**
* ``export HIP_FORCE_DEV_KERNARG=1`` — improves kernel launch performance by
forcing device kernel arguments. This is already set by default in
:doc:`vLLM ROCm Docker images
</how-to/rocm-for-ai/inference/benchmark-docker/vllm>`. Bare-metal users
should set this manually.
* ``export TORCH_BLAS_PREFER_HIPBLASLT=1`` — explicitly prefers hipBLASLt
over hipBLAS for GEMM operations. By default, PyTorch uses heuristics to
choose the best BLAS library. Setting this can improve linear layer
performance in some workloads.
* **RCCL (collectives for multi-GPU)**
* ``export NCCL_MIN_NCHANNELS=112`` — increases RCCL channels from default
(typically 32-64) to 112 on the Instinct MI300X. **Only beneficial for
multi-GPU distributed workloads** (tensor parallelism, pipeline
parallelism). Single-GPU inference does not need this.
.. _vllm-optimization-aiter-switches:
AITER (AI Tensor Engine for ROCm) switches
==========================================
AITER (AI Tensor Engine for ROCm) provides ROCm-specific fused kernels optimized for Instinct MI350 Series and MI300X GPUs in vLLM V1.
How AITER flags work:
* ``VLLM_ROCM_USE_AITER`` is the master switch (defaults to ``False``/``0``).
* Individual feature flags (``VLLM_ROCM_USE_AITER_LINEAR``, ``VLLM_ROCM_USE_AITER_MOE``, and so on) default to ``True`` but only activate when the master switch is enabled.
* To enable a specific AITER feature, you must set both ``VLLM_ROCM_USE_AITER=1`` and the specific feature flag to ``1``.
Quick start examples:
.. code-block:: bash
# Enable all AITER optimizations (recommended for most workloads)
export VLLM_ROCM_USE_AITER=1
vllm serve MODEL_NAME
# Enable AITER Fused MoE and enable Triton Prefill-Decode (split) attention
export VLLM_ROCM_USE_AITER=1
export VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1
export VLLM_ROCM_USE_AITER_MHA=0
vllm serve MODEL_NAME
# Disable AITER entirely (i.e, use vLLM Triton Unified Attention Kernel)
export VLLM_ROCM_USE_AITER=0
vllm serve MODEL_NAME
.. list-table::
:header-rows: 1
:widths: 30 70
* - Environment variable
- Description (default behavior)
* - ``VLLM_ROCM_USE_AITER``
- Master switch to enable AITER kernels (``0``/``False`` by default). All other ``VLLM_ROCM_USE_AITER_*`` flags require this to be set to ``1``.
* - ``VLLM_ROCM_USE_AITER_LINEAR``
- Use AITER quantization operators + GEMM for linear layers (defaults to ``True`` when AITER is on). Accelerates matrix multiplications in all transformer layers. **Recommended to keep enabled**.
* - ``VLLM_ROCM_USE_AITER_MOE``
- Use AITER fused-MoE kernels (defaults to ``True`` when AITER is on). Accelerates Mixture-of-Experts routing and computation. See the note on :ref:`AITER MoE requirements <vllm-optimization-aiter-moe-requirements>`.
* - ``VLLM_ROCM_USE_AITER_RMSNORM``
- Use AITER RMSNorm kernels (defaults to ``True`` when AITER is on). Accelerates normalization layers. **Recommended: keep enabled.**
* - ``VLLM_ROCM_USE_AITER_MLA``
- Use AITER Multi-head Latent Attention for supported models, for example, DeepSeek-V3/R1 (defaults to ``True`` when AITER is on). See the section on :ref:`AITER MLA requirements <vllm-optimization-aiter-mla-requirements>`.
* - ``VLLM_ROCM_USE_AITER_MHA``
- Use AITER Multi-Head Attention kernels (defaults to ``True`` when AITER is on; set to ``0`` to use Triton attention backends and Prefill-Decode attention backend instead). See :ref:`attention backend selection <vllm-optimization-aiter-backend-selection>`.
* - ``VLLM_ROCM_USE_AITER_UNIFIED_ATTENTION``
- Enable AITER's optimized unified attention kernel (defaults to ``False``). Only takes effect when: AITER is enabled; unified attention mode is active (``VLLM_V1_USE_PREFILL_DECODE_ATTENTION=0``); and AITER MHA is disabled (``VLLM_ROCM_USE_AITER_MHA=0``). When disabled, falls back to vLLM's Triton unified attention.
* - ``VLLM_ROCM_USE_AITER_FP8BMM``
- Use AITER ``FP8`` batched matmul (defaults to ``True`` when AITER is on). Fuses ``FP8`` per-token quantization with batched GEMM (used in MLA models like DeepSeek-V3). Requires an Instinct MI300X/MI355X GPU.
* - ``VLLM_ROCM_USE_SKINNY_GEMM``
- Prefer skinny-GEMM kernel variants for small batch sizes (defaults to ``True``). Improves performance when ``M`` dimension is small. **Recommended to keep enabled**.
* - ``VLLM_ROCM_FP8_PADDING``
- Pad ``FP8`` linear weight tensors to improve memory locality (defaults to ``True``). Minor memory overhead for better performance.
* - ``VLLM_ROCM_MOE_PADDING``
- Pad MoE weight tensors for better memory access patterns (defaults to ``True``). Same memory/performance tradeoff as ``FP8`` padding.
* - ``VLLM_ROCM_CUSTOM_PAGED_ATTN``
- Use custom paged-attention decode kernel when Prefill-Decode attention backend is selected (defaults to ``True``). See :ref:`Attention backend selection with AITER <vllm-optimization-aiter-backend-selection>`.
.. note::
When ``VLLM_ROCM_USE_AITER=1``, most AITER component flags (``LINEAR``,
``MOE``, ``RMSNORM``, ``MLA``, ``MHA``, ``FP8BMM``) automatically default to
``True``. You typically only need to set the master switch
``VLLM_ROCM_USE_AITER=1`` to enable all optimizations. ROCm provides a
prebuilt optimized Docker image for validating the performance of LLM
inference with vLLM on MI300X Series GPUs. The Docker image includes ROCm,
vLLM, and PyTorch. For more information, see
:doc:`/how-to/rocm-for-ai/inference/benchmark-docker/vllm`.
.. _vllm-optimization-aiter-moe-requirements:
AITER MoE requirements (Mixtral, DeepSeek-V2/V3, Qwen-MoE models)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
``VLLM_ROCM_USE_AITER_MOE`` enables AITER's optimized Mixture-of-Experts kernels, such as expert routing (topk selection) and expert computation for better performance.
Applicable models:
* Mixtral series: for example, Mixtral-8x7B / Mixtral-8x22B
* Llama-4 family: for example, Llama-4-Scout-17B-16E / Llama-4-Maverick-17B-128E
* DeepSeek family: DeepSeek-V2 / DeepSeek-V3 / DeepSeek-R1
* Qwen family: Qwen1.5-MoE / Qwen2-MoE / Qwen2.5-MoE series
* Other MoE architectures
When to enable:
* **Enable (default):** For all MoE models on the Instinct MI300X/MI355X for best throughput
* **Disable:** Only for debugging or if you encounter numerical issues
Example usage:
.. code-block:: bash
# Standard MoE model (Mixtral)
VLLM_ROCM_USE_AITER=1 vllm serve mistralai/Mixtral-8x7B-Instruct-v0.1
# Hybrid MoE+MLA model (DeepSeek-V3) - requires both MOE and MLA flags
VLLM_ROCM_USE_AITER=1 vllm serve deepseek-ai/DeepSeek-V3 \
--block-size 1 \
--tensor-parallel-size 8
.. _vllm-optimization-aiter-mla-requirements:
AITER MLA requirements (DeepSeek-V3/R1 models)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
``VLLM_ROCM_USE_AITER_MLA`` enables AITER MLA (Multi-head Latent Attention) optimization for supported models. Defaults to **True** when AITER is on.
Critical requirement:
* **Must** explicitly set ``--block-size 1``
.. important::
If you omit ``--block-size 1``, vLLM will raise an error rather than defaulting to 1.
Applicable models:
* DeepSeek-V3 / DeepSeek-R1
* DeepSeek-V2
* Other models using multi-head latent attention (MLA) architecture
Example usage:
.. code-block:: bash
# DeepSeek-R1 with AITER MLA (requires 8 GPUs)
VLLM_ROCM_USE_AITER=1 vllm serve deepseek-ai/DeepSeek-R1 \
--block-size 1 \
--tensor-parallel-size 8
.. _vllm-optimization-aiter-backend-selection:
Attention backend selection with AITER
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Understanding which attention backend to use helps optimize your deployment.
Quick reference: Which attention backend will I get?
Default behavior (no configuration)
Without setting any environment variables, vLLM uses:
* **vLLM Triton Unified Attention** — A single Triton kernel handling both prefill and decode phases
* Works on all ROCm platforms
* Good baseline performance
**Recommended**: Enable AITER (set ``VLLM_ROCM_USE_AITER=1``)
When you enable AITER, the backend is automatically selected based on your model:
.. code-block:: text
Is your model using MLA architecture? (DeepSeek-V3/R1/V2)
├─ YES → AITER MLA Backend
│ • Requires --block-size 1
│ • Best performance for MLA models
│ • Automatically selected
└─ NO → AITER MHA Backend
• For standard transformer models (Llama, Mistral, etc.)
• Optimized for Instinct MI300X/MI355X
• Automatically selected
**Advanced**: Manual backend selection
Most users won't need this, but you can override the defaults:
.. list-table::
:widths: 40 60
:header-rows: 1
* - To use this backend
- Set these flags
* - AITER MLA (MLA models only)
- ``VLLM_ROCM_USE_AITER=1`` (auto-selects for DeepSeek-V3/R1)
* - AITER MHA (standard models)
- ``VLLM_ROCM_USE_AITER=1`` (auto-selects for non-MLA models)
* - vLLM Triton Unified (default)
- ``VLLM_ROCM_USE_AITER=0`` (or unset)
* - Triton Prefill-Decode (split) without AITER
- | ``VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1``
* - Triton Prefill-Decode (split) along with AITER Fused-MoE
- | ``VLLM_ROCM_USE_AITER=1``
| ``VLLM_ROCM_USE_AITER_MHA=0``
| ``VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1``
* - AITER Unified Attention
- | ``VLLM_ROCM_USE_AITER=1``
| ``VLLM_ROCM_USE_AITER_MHA=0``
| ``VLLM_ROCM_USE_AITER_UNIFIED_ATTENTION=1``
**Quick start examples**:
.. code-block:: bash
# Recommended: Standard model with AITER (Llama, Mistral, Qwen, etc.)
VLLM_ROCM_USE_AITER=1 vllm serve meta-llama/Llama-3.3-70B-Instruct
# MLA model with AITER (DeepSeek-V3/R1)
VLLM_ROCM_USE_AITER=1 vllm serve deepseek-ai/DeepSeek-R1 \
--block-size 1 \
--tensor-parallel-size 8
# Advanced: Use Prefill-Decode split (for short input cases) with AITER Fused-MoE
VLLM_ROCM_USE_AITER=1 \
VLLM_ROCM_USE_AITER_MHA=0 \
VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1 \
vllm serve meta-llama/Llama-4-Scout-17B-16E
**Which backend should I choose?**
.. list-table::
:widths: 30 70
:header-rows: 1
* - Your use case
- Recommended backend
* - **Standard transformer models** (Llama, Mistral, Qwen, Mixtral)
- **AITER MHA** (``VLLM_ROCM_USE_AITER=1``) — **Recommended for most workloads** on Instinct MI300X/MI355X. Provides optimized attention kernels for both prefill and decode phases.
* - **MLA models** (DeepSeek-V3/R1/V2)
- **AITER MLA** (auto-selected with ``VLLM_ROCM_USE_AITER=1``) — Required for optimal performance, must use ``--block-size 1``
* - **gpt-oss models** (gpt-oss-120b/20b)
- **AITER Unified Attention** (``VLLM_ROCM_USE_AITER=1``, ``VLLM_ROCM_USE_AITER_MHA=0``, ``VLLM_ROCM_USE_AITER_UNIFIED_ATTENTION=1``) — Required for optimal performance
* - **Debugging or compatibility**
- **vLLM Triton Unified** (default with ``VLLM_ROCM_USE_AITER=0``) — Generic fallback, works everywhere
**Important notes:**
* **AITER MHA and AITER MLA are mutually exclusive** — vLLM automatically detects MLA models and selects the appropriate backend
* **For 95% of users:** Simply set ``VLLM_ROCM_USE_AITER=1`` and let vLLM choose the right backend
* When in doubt, start with AITER enabled (the recommended configuration) and profile your specific workload
Backend choice quick recipes
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
* **Standard transformers (any prompt length):** Start with ``VLLM_ROCM_USE_AITER=1`` → AITER MHA. For CUDA graph modes, see architecture-specific guidance below (Dense vs MoE models have different optimal modes).
* **Latency-sensitive chat (low TTFT):** keep ``--max-num-batched-tokens``**8k16k** with AITER.
* **Streaming decode (low ITL):** raise ``--max-num-batched-tokens`` to **32k64k**.
* **Offline max throughput:** ``--max-num-batched-tokens``**32k** with ``cudagraph_mode=FULL``.
**How to verify which backend is active**
Check vLLM's startup logs to confirm which attention backend is being used:
.. code-block:: bash
# Start vLLM and check logs
VLLM_ROCM_USE_AITER=1 vllm serve meta-llama/Llama-3.3-70B-Instruct 2>&1 | grep -i attention
**Expected log messages:**
* AITER MHA: ``Using Aiter Flash Attention backend on V1 engine.``
* AITER MLA: ``Using AITER MLA backend on V1 engine.``
* vLLM Triton MLA: ``Using Triton MLA backend on V1 engine.``
* vLLM Triton Unified: ``Using Triton Attention backend on V1 engine.``
* AITER Triton Unified: ``Using Aiter Unified Attention backend on V1 engine.``
* AITER Triton Prefill-Decode: ``Using Rocm Attention backend on V1 engine.``
Attention backend technical details
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This section provides technical details about vLLM's attention backends on ROCm.
vLLM V1 on ROCm provides these attention implementations:
1. **vLLM Triton Unified Attention** (default when AITER is **off**)
* Single unified Triton kernel handling both chunked prefill and decode phases
* Generic implementation that works across all ROCm platforms
* Good baseline performance
* Automatically selected when ``VLLM_ROCM_USE_AITER=0`` (or unset)
* Supports GPT-OSS
2. **AITER Triton Unified Attention** (advanced, requires manual configuration)
* The AMD optimized unified Triton kernel
* Enable with ``VLLM_ROCM_USE_AITER=1``, ``VLLM_ROCM_USE_AITER_MHA=0``, and ``VLLM_ROCM_USE_AITER_UNIFIED_ATTENTION=1``.
* Only useful for specific workloads. Most users should use AITER MHA instead.
* Recommended this backend when running GPT-OSS.
3. **AITER Triton PrefillDecode Attention** (hybrid, Instinct MI300X-optimized)
* Enable with ``VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1``
* Uses separate kernels for prefill and decode phases:
* **Prefill**: ``context_attention_fwd`` Triton kernel
* **Primary decode**: ``torch.ops._rocm_C.paged_attention`` (custom ROCm kernel optimized for head sizes 64/128, block sizes 16/32, GQA 116, context ≤131k; sliding window not supported)
* **Fallback decode**: ``kernel_paged_attention_2d`` Triton kernel when shapes don't meet primary decode requirements
* Usually better compared to unified Triton kernels
* Performance vs AITER MHA varies: AITER MHA is typically faster overall, but Prefill-Decode split may win in short input scenarios
* The custom paged attention decode kernel is controlled by ``VLLM_ROCM_CUSTOM_PAGED_ATTN`` (default **True**)
4. **AITER Multi-Head Attention (MHA)** (default when AITER is **on**)
* Controlled by ``VLLM_ROCM_USE_AITER_MHA`` (**1** = enabled)
* Best all-around performance for standard transformer models
* Automatically selected when ``VLLM_ROCM_USE_AITER=1`` and model is not MLA
5. **vLLM Triton Multi-head Latent Attention (MLA)** (for DeepSeek-V3/R1/V2)
* Automatically selected when ``VLLM_ROCM_USE_AITER=0`` (or unset)
6. **AITER Multi-head Latent Attention (MLA)** (for DeepSeek-V3/R1/V2)
* Controlled by ``VLLM_ROCM_USE_AITER_MLA`` (``1`` = enabled)
* Required for optimal performance on MLA architecture models
* Automatically selected when ``VLLM_ROCM_USE_AITER=1`` and model uses MLA
* Requires ``--block-size 1``
Quick Reduce (large all-reduces on ROCm)
========================================
**Quick Reduce** is an alternative to RCCL/custom all-reduce for **large** inputs (MI300-class GPUs).
It supports FP16/BF16 as well as symmetric INT8/INT6/INT4 quantized all-reduce (group size 32).
.. warning::
Quantization can affect accuracy. Validate quality before deploying.
Control via:
* ``VLLM_ROCM_QUICK_REDUCE_QUANTIZATION````["NONE","FP","INT8","INT6","INT4"]`` (default ``NONE``).
* ``VLLM_ROCM_QUICK_REDUCE_CAST_BF16_TO_FP16``: cast BF16 input to FP16 (``1/True`` by default for performance).
* ``VLLM_ROCM_QUICK_REDUCE_MAX_SIZE_BYTES_MB``: cap the preset buffer (default ``NONE````2048`` MB).
Quick Reduce tends to help **throughput** at higher TP counts (for example, 48) with many concurrent requests.
Parallelism strategies (run vLLM on multiple GPUs)
==================================================
vLLM supports the following parallelism strategies:
1. Tensor parallelism
2. Pipeline parallelism
3. Data parallelism
4. Expert parallelism
For more details, see `Parallelism and scaling <https://docs.vllm.ai/en/stable/serving/parallelism_scaling.html>`_.
**Choosing the right strategy:**
* **Tensor Parallelism (TP)**: Use when model doesn't fit on one GPU. Prefer staying within a single XGMI island (≤8 GPUs on the Instinct MI300X).
* **Pipeline Parallelism (PP)**: Use for very large models across nodes. Set TP to GPUs per node, scale with PP across nodes.
* **Data Parallelism (DP)**: Use when model fits on single GPU or TP group, and you need higher throughput. Combine with TP/PP for large models.
* **Expert Parallelism (EP)**: Use for MoE models with ``--enable-expert-parallel``. More efficient than TP for MoE layers.
Tensor parallelism
^^^^^^^^^^^^^^^^^^
Tensor parallelism splits each layer of the model weights across multiple GPUs when the model doesn't fit on a single GPU. This is primarily for memory capacity.
**Use tensor parallelism when:**
* Model does not fit on one GPU (OOM)
* Need to enable larger batch sizes by distributing KV cache across GPUs
**Examples:**
.. code-block:: bash
# Tensor parallelism: Split model across 2 GPUs
vllm serve /path/to/model --dtype float16 --tensor-parallel-size 2
# Combining TP and two vLLM instance, each split across 2 GPUs (4 GPUs total)
CUDA_VISIBLE_DEVICES=0,1 vllm serve /path/to/model --dtype float16 --tensor-parallel-size 2 --port 8000
CUDA_VISIBLE_DEVICES=2,3 vllm serve /path/to/model --dtype float16 --tensor-parallel-size 2 --port 8001
.. note::
**ROCm GPU visibility:** vLLM on ROCm reads ``CUDA_VISIBLE_DEVICES``. Keep ``HIP_VISIBLE_DEVICES`` unset to avoid conflicts.
.. tip::
For structured data parallelism deployments with load balancing, see :ref:`data-parallelism-section`.
Pipeline parallelism
^^^^^^^^^^^^^^^^^^^^
Pipeline parallelism splits the model's layers across multiple GPUs or nodes, with each GPU processing different layers sequentially. This is primarily used for multi-node deployments where the model is too large for a single node.
**Use pipeline parallelism when:**
* Model is too large for a single node (combine PP with TP)
* GPUs on a node lack high-speed interconnect (e.g., no NVLink/XGMI) - PP may perform better than TP
* GPU count doesn't evenly divide the model (PP supports uneven splits)
**Common pattern for multi-node:**
.. code-block:: bash
# 2 nodes × 8 GPUs = 16 GPUs total
# TP=8 per node, PP=2 across nodes
vllm serve meta-llama/Llama-3.1-405B-Instruct \
--tensor-parallel-size 8 \
--pipeline-parallel-size 2
.. note::
**ROCm best practice**: On the Instinct MI300X, prefer staying within a single XGMI island (≤8 GPUs) using TP only. Use PP when scaling beyond eight GPUs or across nodes.
.. _data-parallelism-section:
Data parallelism
^^^^^^^^^^^^^^^^
Data parallelism replicates model weights across separate instances/GPUs to process independent batches of requests. This approach increases throughput by distributing the workload across multiple replicas.
**Use data parallelism when:**
* Model fits on one GPU, but you need higher request throughput
* Scaling across multiple nodes horizontally
* Combining with tensor parallelism (for example, DP=2 + TP=4 = 8 GPUs total)
**Quick start - single-node:**
.. code-block:: bash
# Model fit in 1 GPU. Creates 2 model replicas (requires 2 GPUs)
VLLM_ALL2ALL_BACKEND="allgather_reducescatter" vllm serve /path/to/model \
--data-parallel-size 2 \
--disable-nccl-for-dp-synchronization
.. tip::
For ROCm, currently use ``VLLM_ALL2ALL_BACKEND="allgather_reducescatter"`` and ``--disable-nccl-for-dp-synchronization`` with data parallelism.
Choosing a load balancing strategy
"""""""""""""""""""""""""""""""""""
vLLM supports two modes for routing requests to DP ranks:
.. list-table::
:header-rows: 1
:widths: 30 35 35
* -
- **Internal LB** (recommended)
- **External LB**
* - **HTTP endpoints**
- 1 endpoint, vLLM routes internally
- N endpoints, you provide external router
* - **Single-node config**
- ``--data-parallel-size N``
- ``--data-parallel-size N --data-parallel-rank 0..N-1`` + different ports
* - **Multi-node config**
- ``--data-parallel-size``, ``--data-parallel-size-local``, ``--data-parallel-address``
- ``--data-parallel-size N --data-parallel-rank 0..N-1`` + ``--data-parallel-address``
* - **Client view**
- Single URL/port
- Multiple URLs/ports
* - **Load balancer**
- Built-in (vLLM handles)
- External (Nginx, Kong, K8s Service)
* - **Coordination**
- DP ranks sync via RPC (for MoE/MLA)
- DP ranks sync via RPC (for MoE/MLA)
* - **Best for**
- Most deployments (simpler)
- K8s/cloud environments with existing LB
.. tip::
**Dense (non-MoE) models only:** You can run fully independent ``vllm serve`` instances without any DP flags, using your own load balancer. This avoids RPC coordination overhead entirely.
For more technical details, see `vLLM Data Parallel Deployment <https://docs.vllm.ai/en/stable/serving/data_parallel_deployment.html>`_
Data Parallel Attention (advanced)
""""""""""""""""""""""""""""""""""
For models with Multi-head Latent Attention (MLA) architecture like DeepSeek V2, V3, and R1, vLLM supports **Data Parallel Attention**,
which provides request-level parallelism instead of model replication. This avoids KV cache duplication across tensor parallel ranks,
significantly reducing memory usage and enabling larger batch sizes.
**Key benefits for MLA models:**
* Eliminates KV cache duplication when using tensor parallelism
* Enables higher throughput for high-QPS serving scenarios
* Better memory efficiency for large context windows
**Usage with Expert Parallelism:**
Data parallel attention works seamlessly with Expert Parallelism for MoE models:
.. code-block:: bash
# DeepSeek-R1 with DP attention and expert parallelism
VLLM_ALL2ALL_BACKEND="allgather_reducescatter" vllm serve deepseek-ai/DeepSeek-R1 \
--data-parallel-size 8 \
--enable-expert-parallel \
--disable-nccl-for-dp-synchronization
For more technical details, see `vLLM RFC #16037 <https://github.com/vllm-project/vllm/issues/16037>`_.
Expert parallelism
^^^^^^^^^^^^^^^^^^
Expert parallelism (EP) distributes expert layers of Mixture-of-Experts (MoE) models across multiple GPUs,
where tokens are routed to the GPUs holding the experts they need.
**Performance considerations:**
Expert parallelism is designed primarily for cross-node MoE deployments where high-bandwidth interconnects (like InfiniBand) between nodes make EP communication efficient. For single-node Instinct MI300X/MI355X deployments with XGMI connectivity, tensor parallelism typically provides better performance due to optimized all-to-all collectives on XGMI.
**When to use EP:**
* Multi-node MoE deployments with fast inter-node networking
* Models with very large numbers of experts that benefit from expert distribution
* Workloads where EP's reduced data movement outweighs communication overhead
**Single-node recommendation:** For Instinct MI300X/MI355X within a single node (≤8 GPUs), prefer tensor parallelism over expert parallelism for MoE models to leverage XGMI's high bandwidth and low latency.
**Basic usage:**
.. code-block:: bash
# Enable expert parallelism for MoE models (DeepSeek example with 8 GPUs)
vllm serve deepseek-ai/DeepSeek-R1 \
--tensor-parallel-size 8 \
--enable-expert-parallel
**Combining with Tensor Parallelism:**
When EP is enabled alongside tensor parallelism:
* Fused MoE layers use expert parallelism
* Non-fused MoE layers use tensor parallelism
**Combining with Data Parallelism:**
EP works seamlessly with Data Parallel Attention for optimal memory efficiency in MLA+MoE models (for example, DeepSeek V3):
.. code-block:: bash
# DP attention + EP for DeepSeek-R1
VLLM_ALL2ALL_BACKEND="allgather_reducescatter" vllm serve deepseek-ai/DeepSeek-R1 \
--data-parallel-size 8 \
--enable-expert-parallel \
--disable-nccl-for-dp-synchronization
Throughput benchmarking
=======================
This guide evaluates LLM inference by tokens per second (TPS). vLLM provides a
built-in benchmark:
.. code-block:: bash
# Synthetic or dataset-driven benchmark
vllm bench throughput --model /path/to/model [other args]
* **Real-world dataset** (ShareGPT) example:
.. code-block:: bash
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
vllm bench throughput --model /path/to/model --dataset /path/to/ShareGPT_V3_unfiltered_cleaned_split.json
* **Synthetic**: set fixed ``--input-len`` and ``--output-len`` for reproducible runs.
.. tip::
**Profiling checklist (ROCm)**
1. Fix your prompt distribution (ISL/OSL) and **vary one knob at a time** (graph mode, MBT).
2. Measure **TTFT**, **ITL**, and **TPS** together; don't optimize one in isolation.
3. Compare graph modes: **PIECEWISE** (balanced) vs **FULL**/``FULL_DECODE_ONLY`` (max throughput).
4. Sweep ``--max-num-batched-tokens`` around **8k64k** to find your latency/throughput balance.
Maximizing instances per node
=============================
To maximize **per-node throughput**, run as many vLLM instances as model memory allows,
balancing KV-cache capacity.
* **HBM capacities**: MI300X = 192GB HBM3; MI355X = 288GB HBM3E.
* Up to **eight** single-GPU vLLM instances can run in parallel on an 8×GPU node (one per GPU):
.. code-block:: bash
for i in $(seq 0 7); do
CUDA_VISIBLE_DEVICES="$i" vllm bench throughput
-tp 1 --model /path/to/model
--dataset /path/to/ShareGPT_V3_unfiltered_cleaned_split.json &
done
Total throughput from **N** single-GPU instances usually exceeds one instance stretched across **N** GPUs (``-tp N``).
**Model coverage**: Llama 2 (7B/13B/70B), Llama 3 (8B/70B), Qwen2 (7B/72B), Mixtral-8x7B/8x22B, and others Llama270B
and Llama370B can fit a single MI300X/MI355X; Llama3.1405B fits on a single 8×MI300X/MI355X node.
Configure the gpu-memory-utilization parameter
==================================================
The ``--gpu-memory-utilization`` parameter controls the fraction of GPU memory reserved for the KV-cache. The default is **0.9** (90%).
There are two strategies:
1. **Increase** ``--gpu-memory-utilization`` to maximize throughput for a single instance (up to **0.95**).
Example:
.. code-block:: bash
vllm serve meta-llama/Llama-3.3-70B-Instruct \
--gpu-memory-utilization 0.95 \
--max-model-len 8192 \
--port 8000
2. **Decrease** to pack **multiple** instances on the same GPU (for small models like 7B/8B), keeping KV-cache viable:
.. code-block:: bash
# Instance 1 on GPU 0
CUDA_VISIBLE_DEVICES=0 vllm serve meta-llama/Llama-3.1-8B-Instruct \
--gpu-memory-utilization 0.45 \
--max-model-len 4096 \
--port 8000
# Instance 2 on GPU 0
CUDA_VISIBLE_DEVICES=0 vllm serve meta-llama/Llama-Guard-3-8B \
--gpu-memory-utilization 0.45 \
--max-model-len 4096 \
--port 8001
vLLM engine arguments
=====================
Selected arguments that often help on ROCm. See `Engine Arguments
<https://docs.vllm.ai/en/stable/configuration/engine_args.html>`__ in the vLLM
documentation for the full list.
Configure --max-num-seqs
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The default value is **1024** in vLLM V1 (increased from **256** in V0). This flag controls the maximum number of sequences processed per batch, directly affecting concurrency and memory usage.
* **To increase throughput**: Raise to **2048** or **4096** if memory allows, enabling more sequences per iteration.
* **To reduce memory usage**: Lower to **256** or **128** for large models or long-context generation. For example, set ``--max-num-seqs 128`` to reduce concurrency and lower memory requirements.
In vLLM V1, KV-cache token requirements are computed as ``max-num-seqs * max-model-len``.
Example usage:
.. code-block:: bash
vllm serve <model> --max-num-seqs 128 --max-model-len 8192
Configure --max-num-batched-tokens
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
**Chunked prefill is enabled by default** in vLLM V1.
* Lower values improve **ITL** (less prefill interrupting decode).
* Higher values improve **TTFT** (more prefill per batch).
Defaults: **8192** for online serving, **16384** for offline. However, optimal values vary significantly by model size. Smaller models can efficiently handle larger batch sizes. Setting it near ``--max-model-len`` mimics V0 behavior and often maximizes throughput.
**Guidance:**
* **Interactive (low TTFT)**: keep MBT ≤ **8k16k**.
* **Streaming (low ITL)**: MBT **16k32k**.
* **Offline max throughput**: MBT **≥32k** (diminishing TPS returns beyond ~32k).
**Pattern:** Smaller/more efficient models benefit from larger batch sizes. MoE models with expert parallelism can handle very large batches efficiently.
**Rule of thumb**
* Push MBT **up** to trade TTFT↑ for ITL↓ and slightly higher TPS.
* Pull MBT **down** to trade ITL↑ for TTFT↓ (interactive UX).
Async scheduling
^^^^^^^^^^^^^^^^
``--async-scheduling`` (replaces deprecated ``num_scheduler_steps``) can improve throughput/ITL by trading off TTFT.
Prefer **off** for latency-sensitive serving; **on** for offline batch throughput.
CUDA graphs configuration
^^^^^^^^^^^^^^^^^^^^^^^^^^
CUDA graphs reduce kernel launch overhead by capturing and replaying GPU operations, improving inference throughput. Configure using ``--compilation-config '{"cudagraph_mode": "MODE"}'``.
**Available modes:**
* ``NONE`` — CUDA graphs disabled (debugging)
* ``PIECEWISE`` — Attention stays eager, other ops use CUDA graphs (most compatible)
* ``FULL`` — Full CUDA graphs for all batches (best for small models/prompts)
* ``FULL_DECODE_ONLY`` — Full CUDA graphs only for decode (saves memory in prefill/decode split setups)
* ``FULL_AND_PIECEWISE``**(default)** Full graphs for decode + piecewise for prefill (best performance, highest memory)
**Default behavior:** V1 defaults to ``FULL_AND_PIECEWISE`` with piecewise compilation enabled; otherwise ``NONE``.
**Backend compatibility:** Not all attention backends support all CUDA graph modes. Choose a mode your backend supports:
.. list-table::
:header-rows: 1
:widths: 40 60
* - Attention backend
- CUDA graph support
* - vLLM/AITER Triton Unified Attention, vLLM Prefill-Decode Attention
- Full support (prefill + decode)
* - AITER MHA, AITER MLA
- Uniform batches only
* - vLLM Triton MLA
- Must exclude attention from graph — ``PIECEWISE`` required
**Usage examples:**
.. code-block:: bash
# Default (best performance, highest memory)
vllm serve meta-llama/Llama-3.1-8B-Instruct
# Decode-only graphs (lower memory, good for P/D split)
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}'
# Full graphs for offline throughput (small models)
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--compilation-config '{"cudagraph_mode": "FULL"}'
**Migration from legacy flags:**
* ``use_cudagraph=False````NONE``
* ``use_cudagraph=True, full_cuda_graph=False````PIECEWISE``
* ``full_cuda_graph=True````FULL`` (with automatic fallback)
Quantization support
====================
vLLM supports FP4/FP8 (4-bit/8-bit floating point) weight and activation quantization using hardware acceleration on the Instinct MI300X and MI355X.
Quantization of models with FP4/FP8 allows for a **2x-4x** reduction in model memory requirements and up to a **1.6x**
improvement in throughput with minimal impact on accuracy.
vLLM ROCm supports a variety of quantization demands:
* On-the-fly quantization
* Pre-quantized model through Quark and llm-compressor
Supported quantization methods
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vLLM on ROCm supports the following quantization methods for the AMD Instinct MI300 series and Instinct MI355X GPUs:
.. list-table::
:header-rows: 1
:widths: 20 15 15 20 30
* - Method
- Precision
- ROCm support
- Memory reduction
- Best use case
* - **FP8** (W8A8)
- 8-bit float
- Excellent
- 2× (50%)
- Production, balanced speed/accuracy
* - **PTPC-FP8**
- 8-bit float
- Excellent
- 2× (50%)
- High throughput, better than ``FP8``
* - **AWQ**
- 4-bit int (W4A16)
- Good
- 4× (75%)
- Large models, memory-constrained
* - **GPTQ**
- 4-bit/8-bit int
- Good
- 2-4× (50-75%)
- Pre-quantized models available
* - **FP8 KV-cache**
- 8-bit float
- Excellent
- KV cache: 50%
- All inference workloads
* - **Quark (AMD)**
- ``FP8``/``MXFP4``
- Optimized
- 2-4× (50-75%)
- AMD pre-quantized models
* - **compressed-tensors**
- W8A8 ``INT8``/``FP8``
- Good
- 2× (50%)
- LLM Compressor models
**ROCm support key:**
- Excellent: Fully supported with optimized kernels
- Good: Supported, might not have AMD-optimized kernels
- Optimized: AMD-specific optimizations available
Using Pre-quantized Models
^^^^^^^^^^^^^^^^^^^^^^^^^^^
AMD provides pre-quantized models optimized for ROCm. These models are ready to use with vLLM.
**AMD Quark-quantized models**:
Available on `Hugging Face <https://huggingface.co/models?other=quark>`_:
* `Llama3.18BInstructFP8KV <https://huggingface.co/amd/Llama-3.1-8B-Instruct-FP8-KV>`__ (FP8 W8A8)
* `Llama3.170BInstructFP8KV <https://huggingface.co/amd/Llama-3.1-70B-Instruct-FP8-KV>`__ (FP8 W8A8)
* `Llama3.1405BInstructFP8KV <https://huggingface.co/amd/Llama-3.1-405B-Instruct-FP8-KV>`__ (FP8 W8A8)
* `Mixtral8x7BInstructv0.1FP8KV <https://huggingface.co/amd/Mixtral-8x7B-Instruct-v0.1-FP8-KV>`__ (FP8 W8A8)
* `Mixtral8x22BInstructv0.1FP8KV <https://huggingface.co/amd/Mixtral-8x22B-Instruct-v0.1-FP8-KV>`__ (FP8 W8A8)
* `Llama-3.3-70B-Instruct-MXFP4-Preview <https://huggingface.co/amd/Llama-3.3-70B-Instruct-MXFP4-Preview>`__ (MXFP4 for MI350/MI355)
* `Llama-3.1-405B-Instruct-MXFP4-Preview <https://huggingface.co/amd/Llama-3.1-405B-Instruct-MXFP4-Preview>`__ (MXFP4 for MI350/MI355)
* `DeepSeek-R1-0528-MXFP4-Preview <https://huggingface.co/amd/DeepSeek-R1-0528-MXFP4-Preview>`__ (MXFP4 for MI350/MI355)
**Quick start**:
.. code-block:: bash
# FP8 W8A8 Quark model
vllm serve amd/Llama-3.1-8B-Instruct-FP8-KV \
--dtype auto
# MXFP4 Quark model for MI350/MI355
vllm serve amd/Llama-3.3-70B-Instruct-MXFP4-Preview \
--dtype auto \
--tensor-parallel-size 1
**Other pre-quantized models**:
- AWQ models: `Hugging Face awq flag <https://huggingface.co/models?other=awq>`_
- GPTQ models: `Hugging Face gptq flag <https://huggingface.co/models?other=gptq>`_
- LLM Compressor models: `Hugging Face compressed-tensors flag <https://huggingface.co/models?other=compressed-tensors>`_
On-the-fly quantization
^^^^^^^^^^^^^^^^^^^^^^^^
For models without pre-quantization, vLLM can quantize ``FP16``/``BF16`` models at server startup.
**Supported methods**:
- ``fp8``: Per-tensor ``FP8`` weight and activation quantization
- ``ptpc_fp8``: Per-token-activation per-channel-weight ``FP8`` (better accuracy same ``FP8`` speed). See `PTPC-FP8 on ROCm blog post <https://blog.vllm.ai/2025/02/24/ptpc-fp8-rocm.html>`_ for details
**Usage:**
.. code-block:: bash
# On-the-fly FP8 quantization
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--quantization fp8 \
--dtype auto
# On-the-fly PTPC-FP8 (recommended as default)
vllm serve meta-llama/Llama-3.1-70B-Instruct \
--quantization ptpc_fp8 \
--dtype auto \
--tensor-parallel-size 4
.. note::
On-the-fly quantization adds two to five minutes of startup time but eliminates pre-quantization. For production with frequent restarts, use pre-quantized models.
GPTQ
^^^^
GPTQ is a 4-bit/8-bit weight quantization method that compresses models with minimal accuracy loss. GPTQ
is fully supported on ROCm via HIP-compiled kernels in vLLM.
**ROCm support status**:
- **Fully supported** - GPTQ kernels compile and run on ROCm via HIP
- **Pre-quantized models work** with standard GPTQ kernels
**Recommendation**: For the AMD Instinct MI300X, **AWQ with Triton kernels** or **FP8 quantization** might provide better
performance due to ROCm-specific optimizations, but GPTQ is a viable alternative.
**Using pre-quantized GPTQ models**:
.. code-block:: bash
# Using pre-quantized GPTQ model on ROCm
vllm serve RedHatAI/Meta-Llama-3.1-70B-Instruct-quantized.w4a16 \
--quantization gptq \
--dtype auto \
--tensor-parallel-size 1
**Important notes**:
- **Kernel support:** GPTQ uses standard HIP-compiled kernels on ROCm
- **Performance:** AWQ with Triton kernels might offer better throughput on AMD GPUs due to ROCm optimizations
- **Compatibility:** GPTQ models from Hugging Face work on ROCm with standard performance
- **Use case:** GPTQ is suitable when pre-quantized GPTQ models are readily available
AWQ (Activation-aware Weight Quantization)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AWQ (Activation-aware Weight Quantization) is a 4-bit weight quantization technique that provides excellent
model compression with minimal accuracy loss (<1%). ROCm supports AWQ quantization on the AMD Instinct MI300 series and
MI355X GPUs with vLLM.
**Using pre-quantized AWQ models:**
Many AWQ-quantized models are available on Hugging Face. Use them directly with vLLM:
.. code-block:: bash
# vLLM serve with AWQ model
VLLM_USE_TRITON_AWQ=1 \
vllm serve hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4 \
--quantization awq \
--tensor-parallel-size 1 \
--dtype auto
**Important Notes:**
* **ROCm requirement:** Set ``VLLM_USE_TRITON_AWQ=1`` to enable Triton-based AWQ kernels on ROCm
* **dtype parameter:** AWQ requires ``--dtype auto`` or ``--dtype float16``. The ``--dtype`` flag controls
the **activation dtype** (``FP16``/``BF16`` for computations), not the weight dtype. AWQ weights remain as INT4
(4-bit integers) as specified in the model's quantization config, but are dequantized to ``FP16``/``BF16`` during
matrix multiplication operations.
* **Group size:** 128 is recommended for optimal performance/accuracy balance
* **Model compatibility:** AWQ is primarily tested on Llama, Mistral, and Qwen model families
Quark (AMD quantization toolkit)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AMD Quark is the AMD quantization toolkit optimized for ROCm. It supports ``FP8 W8A8``, ``MXFP4``, ``W8A8 INT8``, and
other quantization formats with native vLLM integration. The quantization format will automatically be inferred
from the model config file, so you can omit ``--quantization quark``.
**Running Quark Models:**
.. code-block:: bash
# FP8 W8A8: Single GPU
vllm serve amd/Llama-3.1-8B-Instruct-FP8-KV \
--dtype auto \
--max-model-len 8192 \
--gpu-memory-utilization 0.90
# MXFP4: Extreme memory efficiency
vllm serve amd/Llama-3.3-70B-Instruct-MXFP4-Preview \
--dtype auto \
--tensor-parallel-size 1 \
--max-model-len 8192
**Key features:**
- **FP8 models**: ~50% memory reduction, 2× compression
- **MXFP4 models**: ~75% memory reduction, 4× compression
- **Embedded scales**: Quark FP8-KV models include pre-calibrated KV-cache scales
- **Hardware optimized**: Leverages the AMD Instinct MI300 series ``FP8`` acceleration
For creating your own Quark-quantized models, see `Quark Documentation <https://quark.docs.amd.com/latest/>`_.
FP8 kv-cache dtype
^^^^^^^^^^^^^^^^^^^^
FP8 KV-cache quantization reduces memory footprint by approximately 50%, enabling longer context lengths
or higher concurrency. ROCm supports FP8 KV-cache with both ``fp8_e4m3`` and ``fp8_e5m2`` formats on
AMD Instinct MI300 series and other CDNA™ GPUs.
Use ``--kv-cache-dtype fp8`` to enable ``FP8`` KV-cache quantization. For best accuracy, use calibrated
scaling factors generated via `LLM Compressor <https://github.com/vllm-project/llm-compressor>`_.
Without calibration, scales are calculated dynamically (``--calculate-kv-scales``) with minimal
accuracy impact.
**Quick start (dynamic scaling)**:
.. code-block:: bash
# vLLM serve with dynamic FP8 KV-cache
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--kv-cache-dtype fp8 \
--calculate-kv-scales \
--gpu-memory-utilization 0.90
**Calibrated scaling (advanced)**:
For optimal accuracy, pre-calibrate KV-cache scales using representative data. The calibration process:
#. Runs the model on calibration data (512+ samples recommended)
#. Computes optimal ``FP8`` quantization scales for key/value cache tensors
#. Embeds these scales into the saved model as additional parameters
#. vLLM loads the model and uses the embedded scales automatically when ``--kv-cache-dtype fp8`` is specified
The quantized model can be used like any other model. The embedded scales are stored as part of the model weights.
**Using pre-calibrated models:**
AMD provides ready-to-use models with pre-calibrated ``FP8`` KV cache scales:
* `amd/Llama-3.1-8B-Instruct-FP8-KV <https://huggingface.co/amd/Llama-3.1-8B-Instruct-FP8-KV>`_
* `amd/Llama-3.3-70B-Instruct-FP8-KV <https://huggingface.co/amd/Llama-3.3-70B-Instruct-FP8-KV>`_
To verify a model has pre-calibrated KV cache scales, check ``config.json`` for:
.. code-block:: json
"quantization_config": {
"kv_cache_scheme": "static" // Indicates pre-calibrated scales are embedded
}
**Creating your own calibrated model:**
.. code-block:: bash
# 1. Install LLM Compressor
pip install llmcompressor
# 2. Run calibration script (see llm-compressor repo for full example)
python llama3_fp8_kv_example.py
# 3. Use calibrated model in vLLM
vllm serve ./Meta-Llama-3-8B-Instruct-FP8-KV \
--kv-cache-dtype fp8
For detailed instructions and the complete calibration script, see the `FP8 KV Cache Quantization Guide <https://github.com/vllm-project/llm-compressor/blob/main/examples/quantization_kv_cache/README.md>`_.
**Format options**:
- ``fp8`` or ``fp8_e4m3``: Higher precision (default, recommended)
- ``fp8_e5m2``: Larger dynamic range, slightly lower precision
Speculative decoding (experimental)
===================================
Recent vLLM versions add support for speculative decoding backends (for example, Eaglev3). Evaluate for your model and latency/throughput goals.
Speculative decoding is a technique to reduce latency when max number of concurrency is low.
Depending on the methods, the effective concurrency varies, for example, from 16 to 64.
Example command:
.. code-block:: bash
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--trust-remote-code \
--swap-space 16 \
--disable-log-requests \
--tensor-parallel-size 1 \
--distributed-executor-backend mp \
--dtype float16 \
--quantization fp8 \
--kv-cache-dtype fp8 \
--no-enable-chunked-prefill \
--max-num-seqs 300 \
--max-num-batched-tokens 131072 \
--gpu-memory-utilization 0.8 \
--speculative_config '{"method": "eagle3", "model": "yuhuili/EAGLE3-LLaMA3.1-Instruct-8B", "num_speculative_tokens": 2, "draft_tensor_parallel_size": 1, "dtype": "float16"}' \
--port 8001
.. important::
It has been observed that more ``num_speculative_tokens`` causes less
acceptance rate of draft model tokens and a decline in throughput. As a
workaround, set ``num_speculative_tokens`` to <= 2.
Multi-node checklist and troubleshooting
========================================
1. Use ``--distributed-executor-backend ray`` across nodes to manage HIP-visible ranks and RCCL communicators. (``ray`` is the default for multi-node. Explicitly setting this flag is optional.)
2. Ensure ``/dev/shm`` is shared across ranks (Docker ``--shm-size``, Kubernetes ``emptyDir``), as RCCL uses shared memory for rendezvous.
3. For GPUDirect RDMA, set ``RCCL_NET_GDR_LEVEL=2`` and verify links (``ibstat``). Requires supported NICs (for example, ConnectX6+).
4. Collect RCCL logs: ``RCCL_DEBUG=INFO`` and optionally ``RCCL_DEBUG_SUBSYS=INIT,GRAPH`` for init/graph stalls.
Further reading
===============
* :doc:`workload`
* :doc:`/how-to/rocm-for-ai/inference/benchmark-docker/vllm`