diff --git a/.wordlist.txt b/.wordlist.txt index 294016553..aed9dc1cc 100644 --- a/.wordlist.txt +++ b/.wordlist.txt @@ -27,6 +27,7 @@ ASICs ASan ASAN ASm +Async ATI atomicRMW AddressSanitizer @@ -133,6 +134,7 @@ ELMo ENDPGM EPYC ESXi +EP EoS etcd fas @@ -184,6 +186,7 @@ GPR GPT GPU GPU's +GPUDirect GPUs GraphBolt GraphSage @@ -302,6 +305,7 @@ Makefiles Matplotlib Matrox MaxText +MBT Megablocks Megatrends Megatron @@ -311,6 +315,7 @@ Meta's Miniconda MirroredStrategy Mixtral +MLA MosaicML MoEs Mooncake @@ -353,6 +358,7 @@ OFED OMM OMP OMPI +OOM OMPT OMPX ONNX @@ -398,6 +404,7 @@ Profiler's PyPi Pytest PyTorch +QPS Qcycles Qwen RAII @@ -673,6 +680,7 @@ denoised denoises denormalize dequantization +dequantized dequantizes deserializers detections @@ -788,6 +796,7 @@ linalg linearized linter linux +llm llvm lm localscratch @@ -838,6 +847,7 @@ passthrough pe perfcounter performant +piecewise perl pragma pre @@ -984,6 +994,7 @@ tokenizer tokenizes toolchain toolchains +topk toolset toolsets torchtitan @@ -1011,6 +1022,7 @@ USM UTCL UTIL utils +UX vL variational vdi diff --git a/docs/how-to/rocm-for-ai/inference-optimization/vllm-optimization.rst b/docs/how-to/rocm-for-ai/inference-optimization/vllm-optimization.rst new file mode 100644 index 000000000..d1e0f96da --- /dev/null +++ b/docs/how-to/rocm-for-ai/inference-optimization/vllm-optimization.rst @@ -0,0 +1,1139 @@ +.. meta:: + :description: Learn about vLLM V1 inference tuning on AMD Instinct GPUs for optimal performance. + :keywords: AMD, Instinct, MI300X, HPC, tuning, BIOS settings, NBIO, ROCm, + environment variable, performance, HIP, Triton, PyTorch TunableOp, vLLM, RCCL, + MIOpen, GPU, resource utilization + +.. _mi300x-vllm-optimization: +.. _vllm-optimization: + +******************************** +vLLM V1 performance optimization +******************************** + +This guide helps you maximize vLLM throughput and minimize latency on AMD +Instinct MI300X, MI325X, MI350X, and MI355X GPUs. Learn how to: + +* Enable AITER (AI Tensor Engine for ROCm) for speedups on LLM models. +* Configure environment variables for optimal HIP, RCCL, and Quick Reduce performance. +* Select the right attention backend for your workload (AITER MHA/MLA vs. Triton). +* Choose parallelism strategies (tensor, pipeline, data, expert) for multi-GPU deployments. +* Apply quantization (``FP8``/``FP4``) to reduce memory usage by 2-4× with minimal accuracy loss. +* Tune engine arguments (batch size, memory utilization, graph modes) for your use case. +* Benchmark and scale across single-node and multi-node configurations. + +Performance environment variables +================================= + +The following variables are generally useful for Instinct MI300X/MI355X GPUs and vLLM: + +* **HIP and math libraries** + + * ``export HIP_FORCE_DEV_KERNARG=1`` — improves kernel launch performance by + forcing device kernel arguments. This is already set by default in + :doc:`vLLM ROCm Docker images + `. Bare-metal users + should set this manually. + * ``export TORCH_BLAS_PREFER_HIPBLASLT=1`` — explicitly prefers hipBLASLt + over hipBLAS for GEMM operations. By default, PyTorch uses heuristics to + choose the best BLAS library. Setting this can improve linear layer + performance in some workloads. + +* **RCCL (collectives for multi-GPU)** + + * ``export NCCL_MIN_NCHANNELS=112`` — increases RCCL channels from default + (typically 32-64) to 112 on the Instinct MI300X. **Only beneficial for + multi-GPU distributed workloads** (tensor parallelism, pipeline + parallelism). Single-GPU inference does not need this. + +AITER (AI Tensor Engine for ROCm) switches +========================================== + +AITER (AI Tensor Engine for ROCm) provides ROCm-specific fused kernels optimized for Instinct MI350 Series and MI300X GPUs in vLLM V1. + +How AITER flags work: + +* ``VLLM_ROCM_USE_AITER`` is the master switch (defaults to ``False``/``0``). +* Individual feature flags (``VLLM_ROCM_USE_AITER_LINEAR``, ``VLLM_ROCM_USE_AITER_MOE``, and so on) default to ``True`` but only activate when the master switch is enabled. +* To enable a specific AITER feature, you must set both ``VLLM_ROCM_USE_AITER=1`` and the specific feature flag to ``1``. + +Quick start examples: + +.. code-block:: bash + + # Enable all AITER optimizations (recommended for most workloads) + export VLLM_ROCM_USE_AITER=1 + vllm serve MODEL_NAME + + # Enable only AITER Triton Prefill-Decode (split) attention + export VLLM_ROCM_USE_AITER=1 + export VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1 + export VLLM_ROCM_USE_AITER_MHA=0 + vllm serve MODEL_NAME + + # Disable AITER entirely (i.e, use vLLM Triton Unified Attention Kernel) + export VLLM_ROCM_USE_AITER=0 + vllm serve MODEL_NAME + +.. list-table:: + :header-rows: 1 + :widths: 30 70 + + * - Environment variable + - Description (default behavior) + + * - ``VLLM_ROCM_USE_AITER`` + - Master switch to enable AITER kernels (``0``/``False`` by default). All other ``VLLM_ROCM_USE_AITER_*`` flags require this to be set to ``1``. + + * - ``VLLM_ROCM_USE_AITER_LINEAR`` + - Use AITER quantization operators + GEMM for linear layers (defaults to ``True`` when AITER is on). Accelerates matrix multiplications in all transformer layers. **Recommended to keep enabled**. + + * - ``VLLM_ROCM_USE_AITER_MOE`` + - Use AITER fused-MoE kernels (defaults to ``True`` when AITER is on). Accelerates Mixture-of-Experts routing and computation. See the note on :ref:`AITER MoE requirements `. + + * - ``VLLM_ROCM_USE_AITER_RMSNORM`` + - Use AITER RMSNorm kernels (defaults to ``True`` when AITER is on). Accelerates normalization layers. **Recommended: keep enabled.** + + * - ``VLLM_ROCM_USE_AITER_MLA`` + - Use AITER Multi-head Latent Attention for supported models, for example, DeepSeek-V3/R1 (defaults to ``True`` when AITER is on). See the section on :ref:`AITER MLA requirements `. + + * - ``VLLM_ROCM_USE_AITER_MHA`` + - Use AITER Multi-Head Attention kernels (defaults to ``True`` when AITER is on; set to ``0`` to use Triton attention backends and Prefill-Decode attention backend instead). See :ref:`attention backend selection `. + + * - ``VLLM_ROCM_USE_AITER_UNIFIED_ATTENTION`` + - Enable AITER's optimized unified attention kernel (defaults to ``False``). Only takes effect when: AITER is enabled; unified attention mode is active (``VLLM_V1_USE_PREFILL_DECODE_ATTENTION=0``); and AITER MHA is disabled (``VLLM_ROCM_USE_AITER_MHA=0``). When disabled, falls back to vLLM's Triton unified attention. + + * - ``VLLM_ROCM_USE_AITER_FP8BMM`` + - Use AITER ``FP8`` batched matmul (defaults to ``True`` when AITER is on). Fuses ``FP8`` per-token quantization with batched GEMM (used in MLA models like DeepSeek-V3). Requires an Instinct MI300X/MI355X GPU. + + * - ``VLLM_ROCM_USE_SKINNY_GEMM`` + - Prefer skinny-GEMM kernel variants for small batch sizes (defaults to ``True``). Improves performance when ``M`` dimension is small. **Recommended to keep enabled**. + + * - ``VLLM_ROCM_FP8_PADDING`` + - Pad ``FP8`` linear weight tensors to improve memory locality (defaults to ``True``). Minor memory overhead for better performance. + + * - ``VLLM_ROCM_MOE_PADDING`` + - Pad MoE weight tensors for better memory access patterns (defaults to ``True``). Same memory/performance tradeoff as ``FP8`` padding. + + * - ``VLLM_ROCM_CUSTOM_PAGED_ATTN`` + - Use custom paged-attention decode kernel when Prefill-Decode attention backend is selected (defaults to ``True``). See :ref:`Attention backend selection with AITER `. + +.. note:: + + When ``VLLM_ROCM_USE_AITER=1``, most AITER component flags (``LINEAR``, + ``MOE``, ``RMSNORM``, ``MLA``, ``MHA``, ``FP8BMM``) automatically default to + ``True``. You typically only need to set the master switch + ``VLLM_ROCM_USE_AITER=1`` to enable all optimizations. ROCm provides a + prebuilt optimized Docker image for validating the performance of LLM + inference with vLLM on MI300X Series GPUs. The Docker image includes ROCm, + vLLM, and PyTorch. For more information, see + :doc:`/how-to/rocm-for-ai/inference/benchmark-docker/vllm`. + +.. _vllm-optimization-aiter-moe-requirements: + +AITER MoE requirements (Mixtral, DeepSeek-V2/V3, Qwen-MoE models) +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +``VLLM_ROCM_USE_AITER_MOE`` enables AITER's optimized Mixture-of-Experts kernels, such as expert routing (topk selection) and expert computation for better performance. + +Applicable models: + +* Mixtral series: for example, Mixtral-8x7B / Mixtral-8x22B +* Llama-4 family: for example, Llama-4-Scout-17B-16E / Llama-4-Maverick-17B-128E +* DeepSeek family: DeepSeek-V2 / DeepSeek-V3 / DeepSeek-R1 +* Qwen family: Qwen1.5-MoE / Qwen2-MoE / Qwen2.5-MoE series +* Other MoE architectures + +When to enable: + +* **Enable (default):** For all MoE models on the Instinct MI300X/MI355X for best throughput +* **Disable:** Only for debugging or if you encounter numerical issues + +Example usage: + +.. code-block:: bash + + # Standard MoE model (Mixtral) + VLLM_ROCM_USE_AITER=1 vllm serve mistralai/Mixtral-8x7B-Instruct-v0.1 + + # Hybrid MoE+MLA model (DeepSeek-V3) - requires both MOE and MLA flags + VLLM_ROCM_USE_AITER=1 vllm serve deepseek-ai/DeepSeek-V3 \ + --block-size 1 \ + --tensor-parallel-size 8 + +.. _vllm-optimization-aiter-mla-requirements: + +AITER MLA requirements (DeepSeek-V3/R1 models) +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +``VLLM_ROCM_USE_AITER_MLA`` enables AITER MLA (Multi-head Latent Attention) optimization for supported models. Defaults to **True** when AITER is on. + +Critical requirement: + +* **Must** explicitly set ``--block-size 1`` + +.. important:: + + If you omit ``--block-size 1``, vLLM will raise an error rather than defaulting to 1. + +Applicable models: + +* DeepSeek-V3 / DeepSeek-R1 +* DeepSeek-V2 +* Other models using multi-head latent attention (MLA) architecture + +Example usage: + +.. code-block:: bash + + # DeepSeek-R1 with AITER MLA (requires 8 GPUs) + VLLM_ROCM_USE_AITER=1 vllm serve deepseek-ai/DeepSeek-R1 \ + --block-size 1 \ + --tensor-parallel-size 8 + +.. _vllm-optimization-aiter-backend-selection: + +Attention backend selection with AITER +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Understanding which attention backend to use helps optimize your deployment. + +Quick reference: Which attention backend will I get? + +Default behavior (no configuration) + +Without setting any environment variables, vLLM uses: + +* **vLLM Triton Unified Attention** — A single Triton kernel handling both prefill and decode phases +* Works on all ROCm platforms +* Good baseline performance + +**Recommended**: Enable AITER (set ``VLLM_ROCM_USE_AITER=1``) + +When you enable AITER, the backend is automatically selected based on your model: + +.. code-block:: text + + Is your model using MLA architecture? (DeepSeek-V3/R1/V2) + ├─ YES → AITER MLA Backend + │ • Requires --block-size 1 + │ • Best performance for MLA models + │ • Automatically selected + │ + └─ NO → AITER MHA Backend + • For standard transformer models (Llama, Mistral, etc.) + • Optimized for Instinct MI300X/MI355X + • Automatically selected + +**Advanced**: Manual backend selection + +Most users won't need this, but you can override the defaults: + +.. list-table:: + :widths: 40 60 + :header-rows: 1 + + * - To use this backend + - Set these flags + + * - AITER MLA (MLA models only) + - ``VLLM_ROCM_USE_AITER=1`` (auto-selects for DeepSeek-V3/R1) + + * - AITER MHA (standard models) + - ``VLLM_ROCM_USE_AITER=1`` (auto-selects for non-MLA models) + + * - AITER Triton Prefill-Decode (split) + - | ``VLLM_ROCM_USE_AITER=1`` + | ``VLLM_ROCM_USE_AITER_MHA=0`` + | ``VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1`` + + * - vLLM Triton Unified (default) + - ``VLLM_ROCM_USE_AITER=0`` (or unset) + + * - AITER Unified Attention + - | ``VLLM_ROCM_USE_AITER=1`` + | ``VLLM_ROCM_USE_AITER_MHA=0`` + | ``VLLM_ROCM_USE_AITER_UNIFIED_ATTENTION=1`` + +**Quick start examples**: + +.. code-block:: bash + + # Recommended: Standard model with AITER (Llama, Mistral, Qwen, etc.) + VLLM_ROCM_USE_AITER=1 vllm serve meta-llama/Llama-3.3-70B-Instruct + + # MLA model with AITER (DeepSeek-V3/R1) + VLLM_ROCM_USE_AITER=1 vllm serve deepseek-ai/DeepSeek-R1 \ + --block-size 1 \ + --tensor-parallel-size 8 + + # Advanced: Use Prefill-Decode split (for short input cases) + VLLM_ROCM_USE_AITER=1 \ + VLLM_ROCM_USE_AITER_MHA=0 \ + VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1 \ + vllm serve meta-llama/Llama-3.3-70B-Instruct + +**Which backend should I choose?** + +.. list-table:: + :widths: 30 70 + :header-rows: 1 + + * - Your use case + - Recommended backend + + * - **Standard transformer models** (Llama, Mistral, Qwen, Mixtral) + - **AITER MHA** (``VLLM_ROCM_USE_AITER=1``) — **Recommended for most workloads** on Instinct MI300X/MI355X. Provides optimized attention kernels for both prefill and decode phases. + + * - **MLA models** (DeepSeek-V3/R1/V2) + - **AITER MLA** (auto-selected with ``VLLM_ROCM_USE_AITER=1``) — Required for optimal performance, must use ``--block-size 1`` + + * - **gpt-oss models** (gpt-oss-120b/20b) + - **AITER Unified Attention** (``VLLM_ROCM_USE_AITER=1``, ``VLLM_ROCM_USE_AITER_MHA=0``, ``VLLM_ROCM_USE_AITER_UNIFIED_ATTENTION=1``) — Required for optimal performance + + * - **Debugging or compatibility** + - **vLLM Triton Unified** (default with ``VLLM_ROCM_USE_AITER=0``) — Generic fallback, works everywhere + +**Important notes:** + +* **AITER MHA and AITER MLA are mutually exclusive** — vLLM automatically detects MLA models and selects the appropriate backend +* **For 95% of users:** Simply set ``VLLM_ROCM_USE_AITER=1`` and let vLLM choose the right backend +* When in doubt, start with AITER enabled (the recommended configuration) and profile your specific workload + +Backend choice quick recipes +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +* **Standard transformers (any prompt length):** Start with ``VLLM_ROCM_USE_AITER=1`` → AITER MHA. For CUDA graph modes, see architecture-specific guidance below (Dense vs MoE models have different optimal modes). +* **Latency-sensitive chat (low TTFT):** keep ``--max-num-batched-tokens`` ≤ **8k–16k** with AITER. +* **Streaming decode (low ITL):** raise ``--max-num-batched-tokens`` to **32k–64k**. +* **Offline max throughput:** ``--max-num-batched-tokens`` ≥ **32k** with ``cudagraph_mode=FULL``. + +**How to verify which backend is active** + +Check vLLM's startup logs to confirm which attention backend is being used: + +.. code-block:: bash + + # Start vLLM and check logs + VLLM_ROCM_USE_AITER=1 vllm serve meta-llama/Llama-3.3-70B-Instruct 2>&1 | grep -i attention + +**Expected log messages:** + +* AITER MHA: ``Using Aiter Flash Attention backend on V1 engine.`` +* AITER MLA: ``Using AITER MLA backend on V1 engine.`` +* vLLM Triton MLA: ``Using Triton MLA backend on V1 engine.`` +* vLLM Triton Unified: ``Using Triton Attention backend on V1 engine.`` +* AITER Triton Unified: ``Using Aiter Unified Attention backend on V1 engine.`` +* AITER Triton Prefill-Decode: ``Using Rocm Attention backend on V1 engine.`` + +Attention backend technical details +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +This section provides technical details about vLLM's attention backends on ROCm. + +vLLM V1 on ROCm provides these attention implementations: + +1. **vLLM Triton Unified Attention** (default when AITER is **off**) + + * Single unified Triton kernel handling both chunked prefill and decode phases + * Generic implementation that works across all ROCm platforms + * Good baseline performance + * Automatically selected when ``VLLM_ROCM_USE_AITER=0`` (or unset) + * Supports GPT-OSS + +2. **AITER Triton Unified Attention** (advanced, requires manual configuration) + + * The AMD optimized unified Triton kernel + * Enable with ``VLLM_ROCM_USE_AITER=1``, ``VLLM_ROCM_USE_AITER_MHA=0``, and ``VLLM_ROCM_USE_AITER_UNIFIED_ATTENTION=1``. + * Only useful for specific workloads. Most users should use AITER MHA instead. + * Recommended this backend when running GPT-OSS. + +3. **AITER Triton Prefill–Decode Attention** (hybrid, Instinct MI300X-optimized) + + * Enable with ``VLLM_ROCM_USE_AITER=1``, ``VLLM_ROCM_USE_AITER_MHA=0``, and ``VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1`` + * Uses separate kernels for prefill and decode phases: + + * **Prefill**: ``context_attention_fwd`` Triton kernel + * **Primary decode**: ``torch.ops._rocm_C.paged_attention`` (custom ROCm kernel optimized for head sizes 64/128, block sizes 16/32, GQA 1–16, context ≤131k; sliding window not supported) + * **Fallback decode**: ``kernel_paged_attention_2d`` Triton kernel when shapes don't meet primary decode requirements + + * Usually better compared to unified Triton kernels (both vLLM and AITER variants) + * Performance vs AITER MHA varies: AITER MHA is typically faster overall, but Prefill-Decode split may win in short input scenarios + * The custom paged attention decode kernel is controlled by ``VLLM_ROCM_CUSTOM_PAGED_ATTN`` (default **True**) + +4. **AITER Multi-Head Attention (MHA)** (default when AITER is **on**) + + * Controlled by ``VLLM_ROCM_USE_AITER_MHA`` (**1** = enabled) + * Best all-around performance for standard transformer models + * Automatically selected when ``VLLM_ROCM_USE_AITER=1`` and model is not MLA + +5. **vLLM Triton Multi-head Latent Attention (MLA)** (for DeepSeek-V3/R1/V2) + + * Automatically selected when ``VLLM_ROCM_USE_AITER=0`` (or unset) + +6. **AITER Multi-head Latent Attention (MLA)** (for DeepSeek-V3/R1/V2) + + * Controlled by ``VLLM_ROCM_USE_AITER_MLA`` (``1`` = enabled) + * Required for optimal performance on MLA architecture models + * Automatically selected when ``VLLM_ROCM_USE_AITER=1`` and model uses MLA + * Requires ``--block-size 1`` + +Quick Reduce (large all-reduces on ROCm) +======================================== + +**Quick Reduce** is an alternative to RCCL/custom all-reduce for **large** inputs (MI300-class GPUs). +It supports FP16/BF16 as well as symmetric INT8/INT6/INT4 quantized all-reduce (group size 32). + +.. warning:: + + Quantization can affect accuracy. Validate quality before deploying. + +Control via: + +* ``VLLM_ROCM_QUICK_REDUCE_QUANTIZATION`` ∈ ``["NONE","FP","INT8","INT6","INT4"]`` (default ``NONE``). +* ``VLLM_ROCM_QUICK_REDUCE_CAST_BF16_TO_FP16``: cast BF16 input to FP16 (``1/True`` by default for performance). +* ``VLLM_ROCM_QUICK_REDUCE_MAX_SIZE_BYTES_MB``: cap the preset buffer (default ``NONE`` ≈ ``2048`` MB). + +Quick Reduce tends to help **throughput** at higher TP counts (for example, 4–8) with many concurrent requests. + +Parallelism strategies (run vLLM on multiple GPUs) +================================================== + +vLLM supports the following parallelism strategies: + +1. Tensor parallelism +2. Pipeline parallelism +3. Data parallelism +4. Expert parallelism + +For more details, see `Parallelism and scaling `_. + +**Choosing the right strategy:** + +* **Tensor Parallelism (TP)**: Use when model doesn't fit on one GPU. Prefer staying within a single XGMI island (≤8 GPUs on the Instinct MI300X). +* **Pipeline Parallelism (PP)**: Use for very large models across nodes. Set TP to GPUs per node, scale with PP across nodes. +* **Data Parallelism (DP)**: Use when model fits on single GPU or TP group, and you need higher throughput. Combine with TP/PP for large models. +* **Expert Parallelism (EP)**: Use for MoE models with ``--enable-expert-parallel``. More efficient than TP for MoE layers. + +Tensor parallelism +^^^^^^^^^^^^^^^^^^ + +Tensor parallelism splits each layer of the model weights across multiple GPUs when the model doesn't fit on a single GPU. This is primarily for memory capacity. + +**Use tensor parallelism when:** + +* Model does not fit on one GPU (OOM) +* Need to enable larger batch sizes by distributing KV cache across GPUs + +**Examples:** + +.. code-block:: bash + + # Tensor parallelism: Split model across 2 GPUs + vllm serve /path/to/model --dtype float16 --tensor-parallel-size 2 + + # Combining TP and two vLLM instance, each split across 2 GPUs (4 GPUs total) + CUDA_VISIBLE_DEVICES=0,1 vllm serve /path/to/model --dtype float16 --tensor-parallel-size 2 --port 8000 + CUDA_VISIBLE_DEVICES=2,3 vllm serve /path/to/model --dtype float16 --tensor-parallel-size 2 --port 8001 + +.. note:: + **ROCm GPU visibility:** vLLM on ROCm reads ``CUDA_VISIBLE_DEVICES``. Keep ``HIP_VISIBLE_DEVICES`` unset to avoid conflicts. + +.. tip:: + For structured data parallelism deployments with load balancing, see :ref:`data-parallelism-section`. + +Pipeline parallelism +^^^^^^^^^^^^^^^^^^^^ + +Pipeline parallelism splits the model's layers across multiple GPUs or nodes, with each GPU processing different layers sequentially. This is primarily used for multi-node deployments where the model is too large for a single node. + +**Use pipeline parallelism when:** + +* Model is too large for a single node (combine PP with TP) +* GPUs on a node lack high-speed interconnect (e.g., no NVLink/XGMI) - PP may perform better than TP +* GPU count doesn't evenly divide the model (PP supports uneven splits) + +**Common pattern for multi-node:** + +.. code-block:: bash + + # 2 nodes × 8 GPUs = 16 GPUs total + # TP=8 per node, PP=2 across nodes + vllm serve meta-llama/Llama-3.1-405B-Instruct \ + --tensor-parallel-size 8 \ + --pipeline-parallel-size 2 + +.. note:: + **ROCm best practice**: On the Instinct MI300X, prefer staying within a single XGMI island (≤8 GPUs) using TP only. Use PP when scaling beyond eight GPUs or across nodes. + +.. _data-parallelism-section: + +Data parallelism +^^^^^^^^^^^^^^^^ + +Data parallelism replicates model weights across separate instances/GPUs to process independent batches of requests. This approach increases throughput by distributing the workload across multiple replicas. + +**Use data parallelism when:** + +* Model fits on one GPU, but you need higher request throughput +* Scaling across multiple nodes horizontally +* Combining with tensor parallelism (for example, DP=2 + TP=4 = 8 GPUs total) + +**Quick start - single-node:** + +.. code-block:: bash + + # Model fit in 1 GPU. Creates 2 model replicas (requires 2 GPUs) + VLLM_ALL2ALL_BACKEND="allgather_reducescatter" vllm serve /path/to/model \ + --data-parallel-size 2 \ + --disable-nccl-for-dp-synchronization + +.. tip:: + For ROCm, currently use ``VLLM_ALL2ALL_BACKEND="allgather_reducescatter"`` and ``--disable-nccl-for-dp-synchronization`` with data parallelism. + +Choosing a load balancing strategy +""""""""""""""""""""""""""""""""""" + +vLLM supports two modes for routing requests to DP ranks: + +.. list-table:: + :header-rows: 1 + :widths: 30 35 35 + + * - + - **Internal LB** (recommended) + - **External LB** + * - **HTTP endpoints** + - 1 endpoint, vLLM routes internally + - N endpoints, you provide external router + * - **Single-node config** + - ``--data-parallel-size N`` + - ``--data-parallel-size N --data-parallel-rank 0..N-1`` + different ports + * - **Multi-node config** + - ``--data-parallel-size``, ``--data-parallel-size-local``, ``--data-parallel-address`` + - ``--data-parallel-size N --data-parallel-rank 0..N-1`` + ``--data-parallel-address`` + * - **Client view** + - Single URL/port + - Multiple URLs/ports + * - **Load balancer** + - Built-in (vLLM handles) + - External (Nginx, Kong, K8s Service) + * - **Coordination** + - DP ranks sync via RPC (for MoE/MLA) + - DP ranks sync via RPC (for MoE/MLA) + * - **Best for** + - Most deployments (simpler) + - K8s/cloud environments with existing LB + +.. tip:: + **Dense (non-MoE) models only:** You can run fully independent ``vllm serve`` instances without any DP flags, using your own load balancer. This avoids RPC coordination overhead entirely. + +For more technical details, see `vLLM Data Parallel Deployment `_ + +Data Parallel Attention (advanced) +"""""""""""""""""""""""""""""""""" + +For models with Multi-head Latent Attention (MLA) architecture like DeepSeek V2, V3, and R1, vLLM supports **Data Parallel Attention**, +which provides request-level parallelism instead of model replication. This avoids KV cache duplication across tensor parallel ranks, +significantly reducing memory usage and enabling larger batch sizes. + +**Key benefits for MLA models:** + +* Eliminates KV cache duplication when using tensor parallelism +* Enables higher throughput for high-QPS serving scenarios +* Better memory efficiency for large context windows + +**Usage with Expert Parallelism:** + +Data parallel attention works seamlessly with Expert Parallelism for MoE models: + +.. code-block:: bash + + # DeepSeek-R1 with DP attention and expert parallelism + VLLM_ALL2ALL_BACKEND="allgather_reducescatter" vllm serve deepseek-ai/DeepSeek-R1 \ + --data-parallel-size 8 \ + --enable-expert-parallel \ + --disable-nccl-for-dp-synchronization + +For more technical details, see `vLLM RFC #16037 `_. + +Expert parallelism +^^^^^^^^^^^^^^^^^^ + +Expert parallelism (EP) distributes expert layers of Mixture-of-Experts (MoE) models across multiple GPUs, +where tokens are routed to the GPUs holding the experts they need. + +**Performance considerations:** + +Expert parallelism is designed primarily for cross-node MoE deployments where high-bandwidth interconnects (like InfiniBand) between nodes make EP communication efficient. For single-node Instinct MI300X/MI355X deployments with XGMI connectivity, tensor parallelism typically provides better performance due to optimized all-to-all collectives on XGMI. + +**When to use EP:** + +* Multi-node MoE deployments with fast inter-node networking +* Models with very large numbers of experts that benefit from expert distribution +* Workloads where EP's reduced data movement outweighs communication overhead + +**Single-node recommendation:** For Instinct MI300X/MI355X within a single node (≤8 GPUs), prefer tensor parallelism over expert parallelism for MoE models to leverage XGMI's high bandwidth and low latency. + +**Basic usage:** + +.. code-block:: bash + + # Enable expert parallelism for MoE models (DeepSeek example with 8 GPUs) + vllm serve deepseek-ai/DeepSeek-R1 \ + --tensor-parallel-size 8 \ + --enable-expert-parallel + +**Combining with Tensor Parallelism:** + +When EP is enabled alongside tensor parallelism: + +* Fused MoE layers use expert parallelism +* Non-fused MoE layers use tensor parallelism + +**Combining with Data Parallelism:** + +EP works seamlessly with Data Parallel Attention for optimal memory efficiency in MLA+MoE models (for example, DeepSeek V3): + +.. code-block:: bash + + # DP attention + EP for DeepSeek-R1 + VLLM_ALL2ALL_BACKEND="allgather_reducescatter" vllm serve deepseek-ai/DeepSeek-R1 \ + --data-parallel-size 8 \ + --enable-expert-parallel \ + --disable-nccl-for-dp-synchronization + +Throughput benchmarking +======================= + +This guide evaluates LLM inference by tokens per second (TPS). vLLM provides a +built-in benchmark: + +.. code-block:: bash + + # Synthetic or dataset-driven benchmark + + vllm bench throughput --model /path/to/model [other args] + +* **Real-world dataset** (ShareGPT) example: + + .. code-block:: bash + + wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json + + vllm bench throughput --model /path/to/model --dataset /path/to/ShareGPT_V3_unfiltered_cleaned_split.json + +* **Synthetic**: set fixed ``--input-len`` and ``--output-len`` for reproducible runs. + +.. tip:: + + **Profiling checklist (ROCm)** + + 1. Fix your prompt distribution (ISL/OSL) and **vary one knob at a time** (graph mode, MBT). + 2. Measure **TTFT**, **ITL**, and **TPS** together; don't optimize one in isolation. + 3. Compare graph modes: **PIECEWISE** (balanced) vs **FULL**/``FULL_DECODE_ONLY`` (max throughput). + 4. Sweep ``--max-num-batched-tokens`` around **8k–64k** to find your latency/throughput balance. + +Maximizing instances per node +============================= + +To maximize **per-node throughput**, run as many vLLM instances as model memory allows, +balancing KV-cache capacity. + +* **HBM capacities**: MI300X = 192 GB HBM3; MI355X = 288 GB HBM3E. + +* Up to **eight** single-GPU vLLM instances can run in parallel on an 8×GPU node (one per GPU): + + .. code-block:: bash + + for i in $(seq 0 7); do + CUDA_VISIBLE_DEVICES="$i" vllm bench throughput + -tp 1 --model /path/to/model + --dataset /path/to/ShareGPT_V3_unfiltered_cleaned_split.json & + done + +Total throughput from **N** single-GPU instances usually exceeds one instance stretched across **N** GPUs (``-tp N``). + +**Model coverage**: Llama 2 (7B/13B/70B), Llama 3 (8B/70B), Qwen2 (7B/72B), Mixtral-8x7B/8x22B, and others Llama2‑70B +and Llama3‑70B can fit a single MI300X/MI355X; Llama3.1‑405B fits on a single 8×MI300X/MI355X node. + +Configure the gpu-memory-utilization parameter +================================================== + +The ``--gpu-memory-utilization`` parameter controls the fraction of GPU memory reserved for the KV-cache. The default is **0.9** (90%). + +There are two strategies: + +1. **Increase** ``--gpu-memory-utilization`` to maximize throughput for a single instance (up to **0.95**). + Example: + + .. code-block:: bash + + vllm serve meta-llama/Llama-3.3-70B-Instruct \ + --gpu-memory-utilization 0.95 \ + --max-model-len 8192 \ + --port 8000 + +2. **Decrease** to pack **multiple** instances on the same GPU (for small models like 7B/8B), keeping KV-cache viable: + + .. code-block:: bash + + # Instance 1 on GPU 0 + CUDA_VISIBLE_DEVICES=0 vllm serve meta-llama/Llama-3.1-8B-Instruct \ + --gpu-memory-utilization 0.45 \ + --max-model-len 4096 \ + --port 8000 + + # Instance 2 on GPU 0 + CUDA_VISIBLE_DEVICES=0 vllm serve meta-llama/Llama-Guard-3-8B \ + --gpu-memory-utilization 0.45 \ + --max-model-len 4096 \ + --port 8001 + +vLLM engine arguments +===================== + +Selected arguments that often help on ROCm. See `engine args docs `_ for the full list. + +Configure --max-num-seqs +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The default value is **1024** in vLLM V1 (increased from **256** in V0). This flag controls the maximum number of sequences processed per batch, directly affecting concurrency and memory usage. + +* **To increase throughput**: Raise to **2048** or **4096** if memory allows, enabling more sequences per iteration. +* **To reduce memory usage**: Lower to **256** or **128** for large models or long-context generation. For example, set ``--max-num-seqs 128`` to reduce concurrency and lower memory requirements. + +In vLLM V1, KV-cache token requirements are computed as ``max-num-seqs * max-model-len``. + +Example usage: + +.. code-block:: bash + + vllm serve --max-num-seqs 128 --max-model-len 8192 + +Configure --max-num-batched-tokens +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +**Chunked prefill is enabled by default** in vLLM V1. + +* Lower values improve **ITL** (less prefill interrupting decode). +* Higher values improve **TTFT** (more prefill per batch). + +Defaults: **8192** for online serving, **16384** for offline. However, optimal values vary significantly by model size. Smaller models can efficiently handle larger batch sizes. Setting it near ``--max-model-len`` mimics V0 behavior and often maximizes throughput. + +**Guidance:** + +* **Interactive (low TTFT)**: keep MBT ≤ **8k–16k**. +* **Streaming (low ITL)**: MBT **16k–32k**. +* **Offline max throughput**: MBT **≥32k** (diminishing TPS returns beyond ~32k). + +**Pattern:** Smaller/more efficient models benefit from larger batch sizes. MoE models with expert parallelism can handle very large batches efficiently. + +**Rule of thumb** + +* Push MBT **up** to trade TTFT↑ for ITL↓ and slightly higher TPS. +* Pull MBT **down** to trade ITL↑ for TTFT↓ (interactive UX). + +Async scheduling +^^^^^^^^^^^^^^^^ + +``--async-scheduling`` (replaces deprecated ``num_scheduler_steps``) can improve throughput/ITL by trading off TTFT. +Prefer **off** for latency-sensitive serving; **on** for offline batch throughput. + +CUDA graphs configuration +^^^^^^^^^^^^^^^^^^^^^^^^^^ + +CUDA graphs reduce kernel launch overhead by capturing and replaying GPU operations, improving inference throughput. Configure using ``--compilation-config '{"cudagraph_mode": "MODE"}'``. + +**Available modes:** + +* ``NONE`` — CUDA graphs disabled (debugging) +* ``PIECEWISE`` — Attention stays eager, other ops use CUDA graphs (most compatible) +* ``FULL`` — Full CUDA graphs for all batches (best for small models/prompts) +* ``FULL_DECODE_ONLY`` — Full CUDA graphs only for decode (saves memory in prefill/decode split setups) +* ``FULL_AND_PIECEWISE`` — **(default)** Full graphs for decode + piecewise for prefill (best performance, highest memory) + +**Default behavior:** V1 defaults to ``FULL_AND_PIECEWISE`` with piecewise compilation enabled; otherwise ``NONE``. + +**Backend compatibility:** Not all attention backends support all CUDA graph modes. Choose a mode your backend supports: + +.. list-table:: + :header-rows: 1 + :widths: 40 60 + + * - Attention backend + - CUDA graph support + * - vLLM/AITER Triton Unified Attention, vLLM Prefill-Decode Attention + - Full support (prefill + decode) + * - AITER MHA, AITER MLA + - Uniform batches only + * - vLLM Triton MLA + - Must exclude attention from graph — ``PIECEWISE`` required + +**Usage examples:** + +.. code-block:: bash + + # Default (best performance, highest memory) + vllm serve meta-llama/Llama-3.1-8B-Instruct + + # Decode-only graphs (lower memory, good for P/D split) + vllm serve meta-llama/Llama-3.1-8B-Instruct \ + --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' + + # Full graphs for offline throughput (small models) + vllm serve meta-llama/Llama-3.1-8B-Instruct \ + --compilation-config '{"cudagraph_mode": "FULL"}' + +**Migration from legacy flags:** + +* ``use_cudagraph=False`` → ``NONE`` +* ``use_cudagraph=True, full_cuda_graph=False`` → ``PIECEWISE`` +* ``full_cuda_graph=True`` → ``FULL`` (with automatic fallback) + +Quantization support +==================== + +vLLM supports FP4/FP8 (4-bit/8-bit floating point) weight and activation quantization using hardware acceleration on the Instinct MI300X and MI355X. +Quantization of models with FP4/FP8 allows for a **2x-4x** reduction in model memory requirements and up to a **1.6x** +improvement in throughput with minimal impact on accuracy. + +vLLM ROCm supports a variety of quantization demands: + +* On-the-fly quantization + +* Pre-quantized model through Quark and llm-compressor + +Supported quantization methods +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +vLLM on ROCm supports the following quantization methods for the AMD Instinct MI300 series and Instinct MI355X GPUs: + +.. list-table:: + :header-rows: 1 + :widths: 20 15 15 20 30 + + * - Method + - Precision + - ROCm support + - Memory reduction + - Best use case + * - **FP8** (W8A8) + - 8-bit float + - Excellent + - 2× (50%) + - Production, balanced speed/accuracy + * - **PTPC-FP8** + - 8-bit float + - Excellent + - 2× (50%) + - High throughput, better than ``FP8`` + * - **AWQ** + - 4-bit int (W4A16) + - Good + - 4× (75%) + - Large models, memory-constrained + * - **GPTQ** + - 4-bit/8-bit int + - Good + - 2-4× (50-75%) + - Pre-quantized models available + * - **FP8 KV-cache** + - 8-bit float + - Excellent + - KV cache: 50% + - All inference workloads + * - **Quark (AMD)** + - ``FP8``/``MXFP4`` + - Optimized + - 2-4× (50-75%) + - AMD pre-quantized models + * - **compressed-tensors** + - W8A8 ``INT8``/``FP8`` + - Good + - 2× (50%) + - LLM Compressor models + +**ROCm support key:** + +- Excellent: Fully supported with optimized kernels +- Good: Supported, might not have AMD-optimized kernels +- Optimized: AMD-specific optimizations available + +Using Pre-quantized Models +^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +AMD provides pre-quantized models optimized for ROCm. These models are ready to use with vLLM. + +**AMD Quark-quantized models**: + +Available on `Hugging Face `_: + +* `Llama‑3.1‑8B‑Instruct‑FP8‑KV `__ (FP8 W8A8) +* `Llama‑3.1‑70B‑Instruct‑FP8‑KV `__ (FP8 W8A8) +* `Llama‑3.1‑405B‑Instruct‑FP8‑KV `__ (FP8 W8A8) +* `Mixtral‑8x7B‑Instruct‑v0.1‑FP8‑KV `__ (FP8 W8A8) +* `Mixtral‑8x22B‑Instruct‑v0.1‑FP8‑KV `__ (FP8 W8A8) +* `Llama-3.3-70B-Instruct-MXFP4-Preview `__ (MXFP4 for MI350/MI355) +* `Llama-3.1-405B-Instruct-MXFP4-Preview `__ (MXFP4 for MI350/MI355) +* `DeepSeek-R1-0528-MXFP4-Preview `__ (MXFP4 for MI350/MI355) + +**Quick start**: + +.. code-block:: bash + + # FP8 W8A8 Quark model + vllm serve amd/Llama-3.1-8B-Instruct-FP8-KV \ + --dtype auto + + # MXFP4 Quark model for MI350/MI355 + vllm serve amd/Llama-3.3-70B-Instruct-MXFP4-Preview \ + --dtype auto \ + --tensor-parallel-size 1 + +**Other pre-quantized models**: + +- AWQ models: `Hugging Face awq flag `_ +- GPTQ models: `Hugging Face gptq flag `_ +- LLM Compressor models: `Hugging Face compressed-tensors flag `_ + +On-the-fly quantization +^^^^^^^^^^^^^^^^^^^^^^^^ + +For models without pre-quantization, vLLM can quantize ``FP16``/``BF16`` models at server startup. + +**Supported methods**: + +- ``fp8``: Per-tensor ``FP8`` weight and activation quantization +- ``ptpc_fp8``: Per-token-activation per-channel-weight ``FP8`` (better accuracy same ``FP8`` speed). See `PTPC-FP8 on ROCm blog post `_ for details + +**Usage:** + +.. code-block:: bash + + # On-the-fly FP8 quantization + vllm serve meta-llama/Llama-3.1-8B-Instruct \ + --quantization fp8 \ + --dtype auto + + # On-the-fly PTPC-FP8 (recommended as default) + vllm serve meta-llama/Llama-3.1-70B-Instruct \ + --quantization ptpc_fp8 \ + --dtype auto \ + --tensor-parallel-size 4 + +.. note:: + + On-the-fly quantization adds two to five minutes of startup time but eliminates pre-quantization. For production with frequent restarts, use pre-quantized models. + +GPTQ +^^^^ + +GPTQ is a 4-bit/8-bit weight quantization method that compresses models with minimal accuracy loss. GPTQ +is fully supported on ROCm via HIP-compiled kernels in vLLM. + +**ROCm support status**: + +- **Fully supported** - GPTQ kernels compile and run on ROCm via HIP +- **Pre-quantized models work** with standard GPTQ kernels + +**Recommendation**: For the AMD Instinct MI300X, **AWQ with Triton kernels** or **FP8 quantization** might provide better +performance due to ROCm-specific optimizations, but GPTQ is a viable alternative. + +**Using pre-quantized GPTQ models**: + +.. code-block:: bash + + # Using pre-quantized GPTQ model on ROCm + vllm serve RedHatAI/Meta-Llama-3.1-70B-Instruct-quantized.w4a16 \ + --quantization gptq \ + --dtype auto \ + --tensor-parallel-size 1 + +**Important notes**: + +- **Kernel support:** GPTQ uses standard HIP-compiled kernels on ROCm +- **Performance:** AWQ with Triton kernels might offer better throughput on AMD GPUs due to ROCm optimizations +- **Compatibility:** GPTQ models from Hugging Face work on ROCm with standard performance +- **Use case:** GPTQ is suitable when pre-quantized GPTQ models are readily available + +AWQ (Activation-aware Weight Quantization) +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +AWQ (Activation-aware Weight Quantization) is a 4-bit weight quantization technique that provides excellent +model compression with minimal accuracy loss (<1%). ROCm supports AWQ quantization on the AMD Instinct MI300 series and +MI355X GPUs with vLLM. + +**Using pre-quantized AWQ models:** + +Many AWQ-quantized models are available on Hugging Face. Use them directly with vLLM: + +.. code-block:: bash + + # vLLM serve with AWQ model + VLLM_USE_TRITON_AWQ=1 \ + vllm serve hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4 \ + --quantization awq \ + --tensor-parallel-size 1 \ + --dtype auto + +**Important Notes:** + +* **ROCm requirement:** Set ``VLLM_USE_TRITON_AWQ=1`` to enable Triton-based AWQ kernels on ROCm +* **dtype parameter:** AWQ requires ``--dtype auto`` or ``--dtype float16``. The ``--dtype`` flag controls + the **activation dtype** (``FP16``/``BF16`` for computations), not the weight dtype. AWQ weights remain as INT4 + (4-bit integers) as specified in the model's quantization config, but are dequantized to ``FP16``/``BF16`` during + matrix multiplication operations. +* **Group size:** 128 is recommended for optimal performance/accuracy balance +* **Model compatibility:** AWQ is primarily tested on Llama, Mistral, and Qwen model families + +Quark (AMD quantization toolkit) +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +AMD Quark is the AMD quantization toolkit optimized for ROCm. It supports ``FP8 W8A8``, ``MXFP4``, ``W8A8 INT8``, and +other quantization formats with native vLLM integration. The quantization format will automatically be inferred +from the model config file, so you can omit ``--quantization quark``. + +**Running Quark Models:** + +.. code-block:: bash + + # FP8 W8A8: Single GPU + vllm serve amd/Llama-3.1-8B-Instruct-FP8-KV \ + --dtype auto \ + --max-model-len 8192 \ + --gpu-memory-utilization 0.90 + + # MXFP4: Extreme memory efficiency + vllm serve amd/Llama-3.3-70B-Instruct-MXFP4-Preview \ + --dtype auto \ + --tensor-parallel-size 1 \ + --max-model-len 8192 + +**Key features:** + +- **FP8 models**: ~50% memory reduction, 2× compression +- **MXFP4 models**: ~75% memory reduction, 4× compression +- **Embedded scales**: Quark FP8-KV models include pre-calibrated KV-cache scales +- **Hardware optimized**: Leverages the AMD Instinct MI300 series ``FP8`` acceleration + +For creating your own Quark-quantized models, see `Quark Documentation `_. + +FP8 kv-cache dtype +^^^^^^^^^^^^^^^^^^^^ + +FP8 KV-cache quantization reduces memory footprint by approximately 50%, enabling longer context lengths +or higher concurrency. ROCm supports FP8 KV-cache with both ``fp8_e4m3`` and ``fp8_e5m2`` formats on +AMD Instinct MI300 series and other CDNA™ GPUs. + +Use ``--kv-cache-dtype fp8`` to enable ``FP8`` KV-cache quantization. For best accuracy, use calibrated +scaling factors generated via `LLM Compressor `_. +Without calibration, scales are calculated dynamically (``--calculate-kv-scales``) with minimal +accuracy impact. + + +**Quick start (dynamic scaling)**: + +.. code-block:: bash + + # vLLM serve with dynamic FP8 KV-cache + vllm serve meta-llama/Llama-3.1-8B-Instruct \ + --kv-cache-dtype fp8 \ + --calculate-kv-scales \ + --gpu-memory-utilization 0.90 + +**Calibrated scaling (advanced)**: + +For optimal accuracy, pre-calibrate KV-cache scales using representative data. The calibration process: + +#. Runs the model on calibration data (512+ samples recommended) +#. Computes optimal ``FP8`` quantization scales for key/value cache tensors +#. Embeds these scales into the saved model as additional parameters +#. vLLM loads the model and uses the embedded scales automatically when ``--kv-cache-dtype fp8`` is specified + +The quantized model can be used like any other model. The embedded scales are stored as part of the model weights. + +**Using pre-calibrated models:** + +AMD provides ready-to-use models with pre-calibrated ``FP8`` KV cache scales: + +* `amd/Llama-3.1-8B-Instruct-FP8-KV `_ +* `amd/Llama-3.3-70B-Instruct-FP8-KV `_ + +To verify a model has pre-calibrated KV cache scales, check ``config.json`` for: + +.. code-block:: json + + "quantization_config": { + "kv_cache_scheme": "static" // Indicates pre-calibrated scales are embedded + } + +**Creating your own calibrated model:** + +.. code-block:: bash + + # 1. Install LLM Compressor + pip install llmcompressor + + # 2. Run calibration script (see llm-compressor repo for full example) + python llama3_fp8_kv_example.py + + # 3. Use calibrated model in vLLM + vllm serve ./Meta-Llama-3-8B-Instruct-FP8-KV \ + --kv-cache-dtype fp8 + +For detailed instructions and the complete calibration script, see the `FP8 KV Cache Quantization Guide `_. + +**Format options**: + +- ``fp8`` or ``fp8_e4m3``: Higher precision (default, recommended) +- ``fp8_e5m2``: Larger dynamic range, slightly lower precision + +Speculative decoding (experimental) +=================================== + +Recent vLLM versions add support for speculative decoding backends (for example, Eagle‑v3). Evaluate for your model and latency/throughput goals. +Speculative decoding is a technique to reduce latency when max number of concurrency is low. +Depending on the methods, the effective concurrency varies, for example, from 16 to 64. + +Example command: + +.. code-block:: bash + + vllm serve meta-llama/Llama-3.1-8B-Instruct \ + --trust-remote-code \ + --swap-space 16 \ + --disable-log-requests \ + --tensor-parallel-size 1 \ + --distributed-executor-backend mp \ + --dtype float16 \ + --quantization fp8 \ + --kv-cache-dtype fp8 \ + --no-enable-chunked-prefill \ + --max-num-seqs 300 \ + --max-num-batched-tokens 131072 \ + --gpu-memory-utilization 0.8 \ + --speculative_config '{"method": "eagle3", "model": "yuhuili/EAGLE3-LLaMA3.1-Instruct-8B", "num_speculative_tokens": 2, "draft_tensor_parallel_size": 1, "dtype": "float16"}' \ + --port 8001 + + +.. important:: + + It has been observed that more ``num_speculative_tokens`` causes less + acceptance rate of draft model tokens and a decline in throughput. As a + workaround, set ``num_speculative_tokens`` to <= 2. + + +Multi-node checklist and troubleshooting +======================================== + +1. Use ``--distributed-executor-backend ray`` across nodes to manage HIP-visible ranks and RCCL communicators. (``ray`` is the default for multi-node. Explicitly setting this flag is optional.) +2. Ensure ``/dev/shm`` is shared across ranks (Docker ``--shm-size``, Kubernetes ``emptyDir``), as RCCL uses shared memory for rendezvous. +3. For GPUDirect RDMA, set ``RCCL_NET_GDR_LEVEL=2`` and verify links (``ibstat``). Requires supported NICs (for example, ConnectX‑6+). +4. Collect RCCL logs: ``RCCL_DEBUG=INFO`` and optionally ``RCCL_DEBUG_SUBSYS=INIT,GRAPH`` for init/graph stalls. + +Further reading +=============== + +* :doc:`workload` +* :doc:`/how-to/rocm-for-ai/inference/benchmark-docker/vllm` diff --git a/docs/how-to/rocm-for-ai/inference-optimization/workload.rst b/docs/how-to/rocm-for-ai/inference-optimization/workload.rst index 30e86e277..7cd2c7fc6 100644 --- a/docs/how-to/rocm-for-ai/inference-optimization/workload.rst +++ b/docs/how-to/rocm-for-ai/inference-optimization/workload.rst @@ -15,10 +15,9 @@ using PyTorch. It delves into specific workloads such as :ref:`model inference `, offering strategies to enhance efficiency. -The following topics highlight :ref:`auto-tunable configurations ` -that streamline optimization as well as advanced techniques like -:ref:`Triton kernel optimization ` for -meticulous tuning. +The following topics highlight :ref:`auto-tunable configurations ` as +well as :ref:`Triton kernel optimization ` +for meticulous tuning. Workload tuning strategy ======================== @@ -86,23 +85,22 @@ Optimize model inference with vLLM ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ vLLM provides tools and techniques specifically designed for efficient model -inference on AMD Instinct MI300X GPUs. See :ref:`fine-tuning-llms-vllm` -for installation guidance. Optimizing performance with vLLM -involves configuring tensor parallelism, leveraging advanced features, and -ensuring efficient execution. Here’s how to optimize vLLM performance: +inference on AMD Instinct GPUs. See the official `vLLM installation docs +`__ for +installation guidance. Optimizing performance with vLLM involves configuring +tensor parallelism, leveraging advanced features, and ensuring efficient +execution. -* Tensor parallelism: Configure the - :ref:`tensor-parallel-size parameter ` to distribute - tensor computations across multiple GPUs. Adjust parameters such as - ``batch-size``, ``input-len``, and ``output-len`` based on your workload. - -* Configuration for vLLM: Set :ref:`parameters ` - according to workload requirements. Benchmark performance to understand - characteristics and identify bottlenecks. +* Configuration for vLLM: Set engine arguments according to workload + requirements. * Benchmarking and performance metrics: Measure latency and throughput to evaluate performance. +.. seealso:: + + See :doc:`vllm-optimization`. + .. _mi300x-auto-tune: Auto-tunable configurations @@ -120,8 +118,7 @@ characteristics. For example: your specific hardware. * Triton: Use :ref:`Triton’s auto-tuning features ` - to explore various kernel configurations and automatically select the - best-performing ones. + to explore various kernel configurations and select the best-performing ones. Manual tuning ^^^^^^^^^^^^^ @@ -328,381 +325,6 @@ hardware counters are also included. ROCm Systems Profiler timeline trace example. -.. _mi300x-vllm-optimization: - -vLLM performance optimization -============================= - -vLLM is a high-throughput and memory efficient inference and serving engine for large language models that has gained traction in the AI community for -its performance and ease of use. See :ref:`fine-tuning-llms-vllm` for a primer on vLLM with ROCm. - -Performance environment variables ---------------------------------- - -The following performance tips are not *specific* to vLLM -- they are general -but relevant in this context. You can tune the following vLLM parameters to -achieve optimal request latency and throughput performance. - -* As described in `Environment variables (MI300X) - `_, - the environment variable ``HIP_FORCE_DEV_KERNARG`` can improve vLLM - performance. Set it to ``export HIP_FORCE_DEV_KERNARG=1``. - -* Set the :ref:`RCCL environment variable ` ``NCCL_MIN_NCHANNELS`` - to ``112`` to increase the number of channels on MI300X to potentially improve - performance. - -* Set the environment variable ``TORCH_BLAS_PREFER_HIPBLASLT=1`` to use hipBLASLt to improve performance. - -Auto-tuning using PyTorch TunableOp ------------------------------------- - -Since vLLM is based on the PyTorch framework, PyTorch TunableOp can be used for auto-tuning. -You can run auto-tuning with TunableOp in two simple steps without modifying your code: - -* Enable TunableOp and tuning. Optionally, enable verbose mode: - - .. code-block:: shell - - PYTORCH_TUNABLEOP_ENABLED=1 PYTORCH_TUNABLEOP_VERBOSE=1 your_vllm_script.sh - -* Enable TunableOp and disable tuning and measure. - - .. code-block:: shell - - PYTORCH_TUNABLEOP_ENABLED=1 PYTORCH_TUNABLEOP_TUNING=0 your_vllm_script.sh - -Learn more about TunableOp in the :ref:`PyTorch TunableOp ` section. - -Performance tuning based on vLLM engine configurations -------------------------------------------------------- - -The following subsections describe vLLM-specific configurations for performance tuning. -You can tune the following vLLM parameters to achieve optimal performance. - -* ``tensor_parallel_size`` - -* ``gpu_memory_utilization`` - -* ``dtype`` - -* ``enforce_eager`` - -* ``kv_cache_dtype`` - -* ``input_len`` - -* ``output_len`` - -* ``max_num_seqs`` - -* ``num_scheduler_steps`` - -* ``max_model_len`` - -* ``enable_chunked_prefill`` - -* ``distributed_executor_backend`` - -* ``max_seq_len_to_capture`` - -Refer to `vLLM documentation `_ -for additional performance tips. :ref:`fine-tuning-llms-vllm` describes vLLM -usage with ROCm. - -ROCm provides a prebuilt optimized Docker image for validating the performance -of LLM inference with vLLM on MI300X Series GPUs. The Docker image includes -ROCm, vLLM, and PyTorch. For more information, see -:doc:`/how-to/rocm-for-ai/inference/benchmark-docker/vllm`. - -.. _mi300x-vllm-throughput-measurement: - -Evaluating performance by throughput measurement -------------------------------------------------- - -This tuning guide evaluates the performance of LLM inference workloads by measuring throughput in tokens per second (TPS). Throughput can be assessed using both real-world and synthetic data, depending on your evaluation goals. - -Refer to the benchmarking script located at ``benchmarks/benchmark_throughput.py`` in the `vLLM repository `_. -Use this script to measure throughput effectively. You can assess throughput using real-world and synthetic data, depending on your evaluation goals. - -* For realistic performance evaluation, you can use datasets like Hugging Face's - ``ShareGPT_V3_unfiltered_cleaned_split.json``. This dataset includes real-world conversational - data, making it a good representation of typical use cases for language models. Download it using - the following command: - - .. code-block:: shell - - wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json - -* For standardized benchmarking, you can set fixed input and output token - lengths. Synthetic prompts provide consistent benchmarking runs, making it - easier to compare performance across different models or configurations. - Additionally, a controlled environment simplifies analysis. - -By balancing real-world data and synthetic data approaches, you can get a well-rounded understanding of model performance in varied scenarios. - -.. _mi300x-vllm-single-node: - -Maximizing vLLM instances on a single node ------------------------------------------- - -The general guideline is to maximize per-node throughput by running as many vLLM instances as possible. -However, running too many instances might lead to insufficient memory for the KV-cache, which can affect performance. - -The Instinct MI300X GPU is equipped with 192 GB of HBM3 memory capacity and bandwidth. -For models that fit in one GPU -- to maximize the accumulated throughput -- you can run as many as eight vLLM instances -simultaneously on one MI300X node (with eight GPUs). To do so, use the GPU isolation environment -variable ``CUDA_VISIBLE_DEVICES``. - -For example, this script runs eight instances of vLLM for throughput benchmarking at the same time -with a model that can fit in one GPU: - -.. code-block:: shell - - for i in $(seq 0 7); - do - CUDA_VISIBLE_DEVICES="$i" python3 /app/vllm/benchmarks/benchmark_throughput.py -tp 1 --dataset "/path/to/dataset/ShareGPT_V3_unfiltered_cleaned_split.json" --model /path/to/model & - done - -The total throughput achieved by running ``N`` instances of vLLM is generally much higher than running a -single vLLM instance across ``N`` GPUs simultaneously (that is, configuring ``tensor_parallel_size`` as N or -using the ``-tp`` N option, where ``1 < N ≤ 8``). - -vLLM on MI300X GPUs can run a variety of model weights, including Llama 2 (7b, 13b, 70b), Llama 3 (8b, 70b), Qwen2 (7b, 72b), Mixtral-8x7b, Mixtral-8x22b, and so on. -Notable configurations include Llama2-70b and Llama3-70b models on a single MI300X GPU, and the Llama3.1 405b model can fit on one single node with 8 MI300X GPUs. - -.. _mi300x-vllm-gpu-memory-utilization: - -Configure the gpu_memory_utilization parameter ----------------------------------------------- - -There are two ways to increase throughput by configuring ``gpu-memory-utilization`` parameter. - -1. Increase ``gpu-memory-utilization`` to improve the throughput for a single instance as long as - it does not incur HIP or CUDA Out Of Memory. The default ``gpu-memory-utilization`` is 0.9. - You can set it to ``>0.9`` and ``<1``. - - For example, below benchmarking command set the ``gpu-memory-utilization`` as 0.98, or 98%. - - .. code-block:: shell - - /vllm-workspace/benchmarks/benchmark_throughput.py --gpu-memory-utilization 0.98 --input-len 1024 --output-len 128 --model /path/to/model - -2. Decrease ``gpu-memory-utilization`` to maximize the number of vLLM instances on the same GPU. - - Specify GPU memory utilization to run as many instances of vLLM as possible on a single - GPU. However, too many instances can result in no memory for KV-cache. For small models, run - multiple instances of vLLM on the same GPU by specifying a smaller ``gpu-memory-utilization`` -- as - long as it would not cause HIP Out Of Memory. - - For example, run two instances of the Llama3-8b model at the same time on a single GPU by specifying - ``--gpu-memory-utilization`` to 0.4 (40%) as follows (on GPU ``0``): - - .. code-block:: shell - - CUDA_VISIBLE_DEVICES=0 python3 /vllm-workspace/benchmarks/benchmark_throughput.py --gpu-memory-utilization 0.4 - --dataset "/path/to/dataset/ShareGPT_V3_unfiltered_cleaned_split.json" --model /path/to/model & - - CUDA_VISIBLE_DEVICES=0 python3 /vllm-workspace/benchmarks/benchmark_throughput.py --gpu-memory-utilization 0.4 - --dataset "/path/to/dataset/ShareGPT_V3_unfiltered_cleaned_split.json" --model /path/to/model & - -See :ref:`vllm-engine-args` for other performance suggestions. - -.. _mi300x-vllm-multiple-gpus: - -Run vLLM on multiple GPUs -------------------------- - -The two main reasons to use multiple GPUs are: - -* The model size is too big to run vLLM using one GPU as it results HIP Out of Memory. - -* To achieve better latency when using a single GPU is not desirable. - -To run one vLLM instance on multiple GPUs, use the ``-tp`` or ``--tensor-parallel-size`` option to -specify multiple GPUs. Optionally, use the ``CUDA_VISIBLE_DEVICES`` environment variable to specify -the GPUs. - -For example, you can use two GPUs to start an API server on port 8000: - -.. code-block:: shell - - python -m vllm.entrypoints.api_server --model /path/to/model --dtype - float16 -tp 2 --port 8000 & - -To achieve both latency and throughput performance for serving, you can run multiple API servers on -different GPUs by specifying different ports for each server and use ``CUDA_VISIBLE_DEVICES`` to -specify the GPUs for each server, for example: - -.. code-block:: shell - - CUDA_VISIBLE_DEVICES=0,1 python -m vllm.entrypoints.api_server --model - /path/to/model --dtype float16 -tp 2 --port 8000 & - - CUDA_VISIBLE_DEVICES=2,3 python -m vllm.entrypoints.api_server --model - /path/to/model --dtype float16 -tp 2 --port 8001 & - -Choose an attention backend ---------------------------- - -vLLM on ROCm supports two attention backends, each suitable for different use cases and performance -requirements: - -- **Triton Flash Attention** - For benchmarking, run vLLM scripts at - least once as a warm-up step so Triton can perform auto-tuning before - collecting benchmarking numbers. This is the default setting. - -- **Composable Kernel (CK) Flash Attention** - To use CK Flash Attention, specify - the environment variable as ``export VLLM_USE_TRITON_FLASH_ATTN=0``. - - -Refer to :ref:`Model acceleration libraries ` -to learn more about Flash Attention with Triton or CK backends. - -.. _vllm-engine-args: - -vLLM engine arguments ---------------------- - -The following are configuration suggestions to potentially improve performance with vLLM. See -`vLLM's engine arguments documentation `_ -for a full list of configurable engine arguments. - -Configure the max-num-seqs parameter -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Increase the ``max-num-seqs`` parameter from the default ``256`` to ``512`` (``--max-num-seqs -512``). This increases the maximum number of sequences per iteration and can improve throughput. - -Use the float16 dtype -^^^^^^^^^^^^^^^^^^^^^ - -The default data type (``dtype``) is specified in the model’s configuration file. For instance, some models use ``torch.bfloat16`` as their default ``dtype``. -Use float16 (``--dtype float16``) for better performance. - -Multi-step scheduling -^^^^^^^^^^^^^^^^^^^^^ - -Setting ``num-scheduler-steps`` for multi-step scheduling can increase performance. Set it between 10 to 15 (``--num-scheduler-steps 10``). - -Distributed executor backend -^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -The vLLM supports two modes of distributed executor backend: ``ray`` and ``mp``. When using the ``__ fork, using the ``mp`` -backend (``--distributed_executor_backend mp``) is recommended. - -Graph mode max-seq-len-to-capture -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Maximum sequence length covered by CUDA graphs. In the default mode (where ``enforce_eager`` is ``False``), when a sequence has context length -larger than this, vLLM engine falls back to eager mode. The default is 8192. - -When working with models that support long context lengths, set the parameter ``--max-seq-len-to-capture`` to 16384. -See this `vLLM blog `__ for details. - -An example of long context length model is Qwen2-7b. - -Whether to enable chunked prefill -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Another vLLM performance tip is to enable chunked prefill to improve -throughput. Chunked prefill allows large prefills to be chunked into -smaller chunks and batched together with decode requests. - -You can enable the feature by specifying ``--enable-chunked-prefill`` in the -command line or setting ``enable_chunked_prefill=True`` in the LLM -constructor.  - -As stated in `vLLM's documentation, `__, -you can tune the performance by changing ``max_num_batched_tokens``. By -default, it is set to 512 and optimized for ITL (inter-token latency). -Smaller ``max_num_batched_tokens`` achieves better ITL because there are -fewer prefills interrupting decodes. -Higher ``max_num_batched_tokens`` achieves better TTFT (time to the first -token) as you can put more prefill to the batch. - -You might experience noticeable throughput improvements when -benchmarking on a single GPU or 8 GPUs using the vLLM throughput -benchmarking script along with the ShareGPT dataset as input. - -In the case of fixed ``input-len``/``output-len``, for some configurations, -enabling chunked prefill increases the throughput. For some other -configurations, the throughput may be worse and elicit a need to tune -parameter ``max_num_batched_tokens`` (for example, increasing ``max_num_batched_tokens`` value to 4096 or larger). - -.. note:: - - Chunked prefill is no longer recommended. See the vLLM blog: `Serving LLMs on AMD MI300X: Best Practices `_ (October 2024). - -Quantization support ---------------------- - -Quantization reduces the precision of the model’s weights and activations, which significantly decreases the memory footprint. -``fp8(w8a8)`` and ``AWQ`` quantization are supported for ROCm. - -FP8 quantization -^^^^^^^^^^^^^^^^^ - -``__ supports FP8 (8-bit floating point) weight and activation quantization using hardware acceleration on the Instinct MI300X. -Quantization of models with FP8 allows for a 2x reduction in model memory requirements and up to a 1.6x improvement in throughput with minimal impact on accuracy. - -AMD publishes Quark Quantized OCP FP8 models on Hugging Face. For example: - -* `Llama-3.1-8B-Instruct-FP8-KV `__ -* `Llama-3.1-70B-Instruct-FP8-KV `__ -* `Llama-3.1-405B-Instruct-FP8-KV `__ -* `Mixtral-8x7B-Instruct-v0.1-FP8-KV `__ -* `Mixtral-8x22B-Instruct-v0.1-FP8-KV `__ - -To enable vLLM benchmarking to run on fp8 quantized models, use the ``--quantization`` parameter with value ``fp8`` (``--quantization fp8``). - -AWQ quantization -^^^^^^^^^^^^^^^^ - -You can quantize your own models by installing AutoAWQ or picking one of the 400+ models on Hugging Face. Be aware that -that AWQ support in vLLM is currently underoptimized. - -To enable vLLM to run on ``awq`` quantized models, using ``--quantization`` parameter with ``awq`` (``--quantization awq``). - -You can find more specifics in the `vLLM AutoAWQ documentation `_. - -fp8 kv-cached-dtype -^^^^^^^^^^^^^^^^^^^^^^^ - -Using ``fp8 kv-cache dtype`` can improve performance as it reduces the size -of ``kv-cache``. As a result, it reduces the cost required for reading and -writing the ``kv-cache``. - -To use this feature, specify ``--kv-cache-dtype`` as ``fp8``. - -To specify the quantization scaling config, use the -``--quantization-param-path`` parameter. If the parameter is not specified, -the default scaling factor of ``1`` is used, which can lead to less accurate -results. To generate ``kv-cache`` scaling JSON file, see `FP8 KV -Cache `__ -in the vLLM GitHub repository. - -Two sample Llama scaling configuration files are in vLLM for ``llama2-70b`` and -``llama2-7b``. - -If building the vLLM using -`Dockerfile.rocm `_ -for ``llama2-70b`` scale config, find the file at -``/vllm-workspace/tests/fp8_kv/llama2-70b-fp8-kv/kv_cache_scales.json`` at -runtime. - -Below is a sample command to run benchmarking with this feature enabled -for the ``llama2-70b`` model: - -.. code-block:: shell - - python3 /vllm-workspace/benchmarks/benchmark_throughput.py --model \ - /path/to/llama2-70b-model --kv-cache-dtype "fp8" \ - --quantization-param-path \ - "/vllm-workspace/tests/fp8_kv/llama2-70b-fp8-kv/kv_cache_scales.json" \ - --input-len 512 --output-len 256 --num-prompts 500 - - .. _mi300x-tunableop: PyTorch TunableOp @@ -946,33 +568,33 @@ for details. .. code-block:: shell - HIP_FORCE_DEV_KERNARG=1  hipblaslt-bench --alpha 1 --beta 0 -r f16_r \ + HIP_FORCE_DEV_KERNARG=1 hipblaslt-bench --alpha 1 --beta 0 -r f16_r \ --a_type f16_r --b_type f8_r --compute_type f32_f16_r \ - --initialization trig_float  --cold_iters 100 --iters 1000 --rotating 256 + --initialization trig_float --cold_iters 100 --iters 1000 --rotating 256 * Example 2: Benchmark forward epilogues and backward epilogues - * ``HIPBLASLT_EPILOGUE_RELU: "--activation_type relu";`` + * ``HIPBLASLT_EPILOGUE_RELU: "--activation_type relu";`` - * ``HIPBLASLT_EPILOGUE_BIAS: "--bias_vector";`` + * ``HIPBLASLT_EPILOGUE_BIAS: "--bias_vector";`` - * ``HIPBLASLT_EPILOGUE_RELU_BIAS: "--activation_type relu --bias_vector";`` + * ``HIPBLASLT_EPILOGUE_RELU_BIAS: "--activation_type relu --bias_vector";`` - * ``HIPBLASLT_EPILOGUE_GELU: "--activation_type gelu";`` + * ``HIPBLASLT_EPILOGUE_GELU: "--activation_type gelu";`` * ``HIPBLASLT_EPILOGUE_DGELU": --activation_type gelu --gradient";`` - * ``HIPBLASLT_EPILOGUE_GELU_BIAS: "--activation_type gelu --bias_vector";`` + * ``HIPBLASLT_EPILOGUE_GELU_BIAS: "--activation_type gelu --bias_vector";`` - * ``HIPBLASLT_EPILOGUE_GELU_AUX: "--activation_type gelu --use_e";`` + * ``HIPBLASLT_EPILOGUE_GELU_AUX: "--activation_type gelu --use_e";`` - * ``HIPBLASLT_EPILOGUE_GELU_AUX_BIAS: "--activation_type gelu --bias_vector --use_e";`` + * ``HIPBLASLT_EPILOGUE_GELU_AUX_BIAS: "--activation_type gelu --bias_vector --use_e";`` - * ``HIPBLASLT_EPILOGUE_DGELU_BGRAD: "--activation_type gelu --bias_vector --gradient";`` + * ``HIPBLASLT_EPILOGUE_DGELU_BGRAD: "--activation_type gelu --bias_vector --gradient";`` - * ``HIPBLASLT_EPILOGUE_BGRADA: "--bias_vector --gradient --bias_source a";`` + * ``HIPBLASLT_EPILOGUE_BGRADA: "--bias_vector --gradient --bias_source a";`` - * ``HIPBLASLT_EPILOGUE_BGRADB:  "--bias_vector --gradient --bias_source b";`` + * ``HIPBLASLT_EPILOGUE_BGRADB: "--bias_vector --gradient --bias_source b";`` hipBLASLt auto-tuning using hipblaslt-bench @@ -1031,26 +653,26 @@ The tuning tool is a two-step tool. It first runs the benchmark, then it creates .. code-block:: python - defaultBenchOptions = {"ProblemType": { -     "TransposeA": 0, -     "TransposeB": 0, -     "ComputeInputDataType": "s", -     "ComputeDataType": "s", -     "DataTypeC": "s", -     "DataTypeD": "s", -     "UseBias": False - }, "TestConfig": { -     "ColdIter": 20, -     "Iter": 100, -     "AlgoMethod": "all", -     "RequestedSolutions": 2, # Only works in AlgoMethod heuristic -     "SolutionIndex": None, # Only works in AlgoMethod index -     "ApiMethod": "cpp", -     "RotatingBuffer": 0, - }, "TuningParameters": { -     "SplitK": [0] - }, "ProblemSizes": []} - defaultCreateLogicOptions = {}  # Currently unused + defaultBenchOptions = {"ProblemType": { + "TransposeA": 0, + "TransposeB": 0, + "ComputeInputDataType": "s", + "ComputeDataType": "s", + "DataTypeC": "s", + "DataTypeD": "s", + "UseBias": False + }, "TestConfig": { + "ColdIter": 20, + "Iter": 100, + "AlgoMethod": "all", + "RequestedSolutions": 2, # Only works in AlgoMethod heuristic + "SolutionIndex": None, # Only works in AlgoMethod index + "ApiMethod": "cpp", + "RotatingBuffer": 0, + }, "TuningParameters": { + "SplitK": [0] + }, "ProblemSizes": []} + defaultCreateLogicOptions = {} # Currently unused * ``TestConfig`` 1. ``ColdIter``: This is number the warm-up iterations before starting the kernel benchmark. @@ -1230,7 +852,7 @@ command: .. code-block:: shell - merge.py original_dir new_tuned_yaml_dir output_dir  + merge.py original_dir new_tuned_yaml_dir output_dir The following table describes the logic YAML files. @@ -1833,7 +1455,7 @@ de-quantize the ``int4`` key-value from the ``int4`` data type to ``fp16``. From the IR snippet, you can see ``i32`` data is loaded from global memory to registers (``%190``). With a few element-wise operations in registers, it is -stored in shared memory (``%269``) for the transpose operation (``%270``), which +stored in shared memory (``%269``) for the transpose operation (``%270``), which needs data movement across different threads. With the transpose done, it is loaded from LDS to register again (``%276``), and with a few more element-wise operations, it is stored to LDS again (``%298``). The last step @@ -1967,7 +1589,7 @@ something similar to the following: loaded at: [0x7fd4f100c000-0x7fd4f100e070] The kernel name and the code object file should be listed. In the -example above, the kernel name is vector_add_assert_trap, but this might +example above, the kernel name is vector_add_assert_trap, but this might also look like: .. code-block:: text @@ -2081,3 +1703,8 @@ Hardware efficiency is maximized with 4 or fewer HIP streams. These environment configuration to two compute streams and two RCCL streams, aligning with this best practice. Additionally, RCCL is often pre-optimized for MI300 systems in production by querying the node topology during startup, reducing the need for extensive manual tuning. + +Further reading +=============== + +* :doc:`vllm-optimization` diff --git a/docs/sphinx/_toc.yml.in b/docs/sphinx/_toc.yml.in index cbc2e7bce..a0a5084ff 100644 --- a/docs/sphinx/_toc.yml.in +++ b/docs/sphinx/_toc.yml.in @@ -134,6 +134,8 @@ subtrees: title: Profile and debug - file: how-to/rocm-for-ai/inference-optimization/workload.rst title: Workload optimization + - file: how-to/rocm-for-ai/inference-optimization/vllm-optimization.rst + title: vLLM V1 performance optimization - url: https://rocm.docs.amd.com/projects/ai-developer-hub/en/latest/ title: AI tutorials