ROCm/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-0.6.6.rst

:orphan:

.. meta::
   :description: Learn how to validate LLM inference performance on MI300X GPUs using AMD MAD and the
                 ROCm vLLM Docker image.
   :keywords: model, MAD, automation, dashboarding, validate

***********************************************************
LLM inference performance validation on AMD Instinct MI300X
***********************************************************

.. caution::

   This documentation does not reflect the latest version of ROCm vLLM
   inference performance documentation. See :doc:`../vllm` for the latest version.

.. _vllm-benchmark-unified-docker:

The `ROCm vLLM Docker <https://hub.docker.com/r/rocm/vllm/tags>`_ image offers
a prebuilt, optimized environment for validating large language model (LLM)
inference performance on the AMD Instinct™ MI300X GPU. This ROCm vLLM
Docker image integrates vLLM and PyTorch tailored specifically for the MI300X
GPU and includes the following components:

* `ROCm 6.3.1 <https://github.com/ROCm/ROCm>`_

* `vLLM 0.6.6 <https://docs.vllm.ai/en/latest>`_

* `PyTorch 2.7.0 (2.7.0a0+git3a58512) <https://github.com/pytorch/pytorch>`_

With this Docker image, you can quickly validate the expected inference
performance numbers for the MI300X GPU. This topic also provides tips on
optimizing performance with popular AI models. For more information, see the lists of
:ref:`available models for MAD-integrated benchmarking <vllm-benchmark-mad-v066-models>`
and :ref:`standalone benchmarking <vllm-benchmark-standalone-v066-options>`.

.. _vllm-benchmark-vllm:

.. note::

   vLLM is a toolkit and library for LLM inference and serving. AMD implements
   high-performance custom kernels and modules in vLLM to enhance performance.
   See :ref:`fine-tuning-llms-vllm` and :ref:`mi300x-vllm-optimization` for
   more information.

Getting started
===============

Use the following procedures to reproduce the benchmark results on an
MI300X GPU with the prebuilt vLLM Docker image.

.. _vllm-benchmark-get-started:

1. Disable NUMA auto-balancing.

   To optimize performance, disable automatic NUMA balancing. Otherwise, the GPU
   might hang until the periodic balancing is finalized. For more information,
   see the :ref:`system validation steps <rocm-for-ai-system-optimization>`.

   .. code-block:: shell

      # disable automatic NUMA balancing
      sh -c 'echo 0 > /proc/sys/kernel/numa_balancing'
      # check if NUMA balancing is disabled (returns 0 if disabled)
      cat /proc/sys/kernel/numa_balancing
      0

2. Download the :ref:`ROCm vLLM Docker image <vllm-benchmark-unified-docker>`.

   Use the following command to pull the Docker image from Docker Hub.

   .. code-block:: shell

      docker pull rocm/vllm:rocm6.3.1_mi300_ubuntu22.04_py3.12_vllm_0.6.6

Once the setup is complete, choose between two options to reproduce the
benchmark results:

-  :ref:`MAD-integrated benchmarking <vllm-benchmark-mad-v066>`

-  :ref:`Standalone benchmarking <vllm-benchmark-standalone-v066>`

.. _vllm-benchmark-mad-v066:

MAD-integrated benchmarking
===========================

Clone the ROCm Model Automation and Dashboarding (`<https://github.com/ROCm/MAD>`__) repository to a local
directory and install the required packages on the host machine.

.. code-block:: shell

   git clone https://github.com/ROCm/MAD
   cd MAD
   pip install -r requirements.txt

Use this command to run a performance benchmark test of the Llama 3.1 8B model
on one GPU with ``float16`` data type in the host machine.

.. code-block:: shell

   export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
   python3 tools/run_models.py --tags pyt_vllm_llama-3.1-8b --keep-model-dir --live-output --timeout 28800

ROCm MAD launches a Docker container with the name
``container_ci-pyt_vllm_llama-3.1-8b``. The latency and throughput reports of the
model are collected in the following path: ``~/MAD/reports_float16/``.

Although the following models are preconfigured to collect latency and
throughput performance data, you can also change the benchmarking parameters.
Refer to the :ref:`Standalone benchmarking <vllm-benchmark-standalone-v066>` section.

.. _vllm-benchmark-mad-v066-models:

Available models
----------------

.. list-table::
   :header-rows: 1
   :widths: 2, 3

   * - Model name
     - Tag

   * - `Llama 3.1 8B <https://huggingface.co/meta-llama/Llama-3.1-8B>`_
     - ``pyt_vllm_llama-3.1-8b``

   * - `Llama 3.1 70B <https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct>`_
     - ``pyt_vllm_llama-3.1-70b``

   * - `Llama 3.1 405B <https://huggingface.co/meta-llama/Llama-3.1-405B-Instruct>`_
     - ``pyt_vllm_llama-3.1-405b``

   * - `Llama 3.2 11B Vision <https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct>`_
     - ``pyt_vllm_llama-3.2-11b-vision-instruct``

   * - `Llama 2 7B <https://huggingface.co/meta-llama/Llama-2-7b-chat-hf>`__
     - ``pyt_vllm_llama-2-7b``

   * - `Llama 2 70B <https://huggingface.co/meta-llama/Llama-2-70b-chat-hf>`__
     - ``pyt_vllm_llama-2-70b``

   * - `Mixtral MoE 8x7B <https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1>`_
     - ``pyt_vllm_mixtral-8x7b``

   * - `Mixtral MoE 8x22B <https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1>`_
     - ``pyt_vllm_mixtral-8x22b``

   * - `Mistral 7B <https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3>`_
     - ``pyt_vllm_mistral-7b``

   * - `Qwen2 7B <https://huggingface.co/Qwen/Qwen2-7B-Instruct>`_
     - ``pyt_vllm_qwen2-7b``

   * - `Qwen2 72B <https://huggingface.co/Qwen/Qwen2-72B-Instruct>`_
     - ``pyt_vllm_qwen2-72b``

   * - `JAIS 13B <https://huggingface.co/core42/jais-13b-chat>`_
     - ``pyt_vllm_jais-13b``

   * - `JAIS 30B <https://huggingface.co/core42/jais-30b-chat-v3>`_
     - ``pyt_vllm_jais-30b``

   * - `DBRX Instruct <https://huggingface.co/databricks/dbrx-instruct>`_
     - ``pyt_vllm_dbrx-instruct``

   * - `Gemma 2 27B <https://huggingface.co/google/gemma-2-27b>`_
     - ``pyt_vllm_gemma-2-27b``

   * - `C4AI Command R+ 08-2024 <https://huggingface.co/CohereForAI/c4ai-command-r-plus-08-2024>`_
     - ``pyt_vllm_c4ai-command-r-plus-08-2024``

   * - `DeepSeek MoE 16B <https://huggingface.co/deepseek-ai/deepseek-moe-16b-chat>`_
     - ``pyt_vllm_deepseek-moe-16b-chat``

   * - `Llama 3.1 70B FP8 <https://huggingface.co/amd/Llama-3.1-70B-Instruct-FP8-KV>`_
     - ``pyt_vllm_llama-3.1-70b_fp8``

   * - `Llama 3.1 405B FP8 <https://huggingface.co/amd/Llama-3.1-405B-Instruct-FP8-KV>`_
     - ``pyt_vllm_llama-3.1-405b_fp8``

   * - `Mixtral MoE 8x7B FP8 <https://huggingface.co/amd/Mixtral-8x7B-Instruct-v0.1-FP8-KV>`_
     - ``pyt_vllm_mixtral-8x7b_fp8``

   * - `Mixtral MoE 8x22B FP8 <https://huggingface.co/amd/Mixtral-8x22B-Instruct-v0.1-FP8-KV>`_
     - ``pyt_vllm_mixtral-8x22b_fp8``

   * - `Mistral 7B FP8 <https://huggingface.co/amd/Mistral-7B-v0.1-FP8-KV>`_
     - ``pyt_vllm_mistral-7b_fp8``

   * - `DBRX Instruct FP8 <https://huggingface.co/amd/dbrx-instruct-FP8-KV>`_
     - ``pyt_vllm_dbrx_fp8``

   * - `C4AI Command R+ 08-2024 FP8 <https://huggingface.co/amd/c4ai-command-r-plus-FP8-KV>`_
     - ``pyt_vllm_command-r-plus_fp8``

.. _vllm-benchmark-standalone-v066:

Standalone benchmarking
=======================

You can run the vLLM benchmark tool independently by starting the
:ref:`Docker container <vllm-benchmark-get-started>` as shown in the following
snippet.

.. code-block::

   docker pull rocm/vllm:rocm6.3.1_mi300_ubuntu22.04_py3.12_vllm_0.6.6
   docker run -it --device=/dev/kfd --device=/dev/dri --group-add video --shm-size 16G --security-opt seccomp=unconfined --security-opt apparmor=unconfined --cap-add=SYS_PTRACE -v $(pwd):/workspace --env HUGGINGFACE_HUB_CACHE=/workspace --name vllm_v0.6.6 rocm/vllm:rocm6.3.1_mi300_ubuntu22.04_py3.12_vllm_0.6.6

In the Docker container, clone the ROCm MAD repository and navigate to the
benchmark scripts directory at ``~/MAD/scripts/vllm``.

.. code-block::

   git clone https://github.com/ROCm/MAD
   cd MAD/scripts/vllm

Command
-------

To start the benchmark, use the following command with the appropriate options.
See :ref:`Options <vllm-benchmark-standalone-v066-options>` for the list of
options and their descriptions.

.. code-block:: shell

   ./vllm_benchmark_report.sh -s $test_option -m $model_repo -g $num_gpu -d $datatype

See the :ref:`examples <vllm-benchmark-run-benchmark-v066>` for more information.

.. note::

   The input sequence length, output sequence length, and tensor parallel (TP) are
   already configured. You don't need to specify them with this script.

.. note::

   If you encounter the following error, pass your access-authorized Hugging
   Face token to the gated models.

   .. code-block:: shell

      OSError: You are trying to access a gated repo.

      # pass your HF_TOKEN
      export HF_TOKEN=$your_personal_hf_token

.. _vllm-benchmark-standalone-v066-options:

Options and available models
----------------------------

.. list-table::
   :header-rows: 1
   :align: center

   * - Name
     - Options
     - Description

   * - ``$test_option``
     - latency
     - Measure decoding token latency

   * -
     - throughput
     - Measure token generation throughput

   * -
     - all
     - Measure both throughput and latency

   * - ``$model_repo``
     - ``meta-llama/Llama-3.1-8B-Instruct``
     - `Llama 3.1 8B <https://huggingface.co/meta-llama/Llama-3.1-8B>`_

   * - (``float16``)
     - ``meta-llama/Llama-3.1-70B-Instruct``
     - `Llama 3.1 70B <https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct>`_

   * -
     - ``meta-llama/Llama-3.1-405B-Instruct``
     - `Llama 3.1 405B <https://huggingface.co/meta-llama/Llama-3.1-405B-Instruct>`_

   * -
     - ``meta-llama/Llama-3.2-11B-Vision-Instruct``
     - `Llama 3.2 11B Vision <https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct>`_

   * -
     - ``meta-llama/Llama-2-7b-chat-hf``
     - `Llama 2 7B <https://huggingface.co/meta-llama/Llama-2-7b-chat-hf>`__

   * -
     - ``meta-llama/Llama-2-70b-chat-hf``
     - `Llama 2 70B <https://huggingface.co/meta-llama/Llama-2-70b-chat-hf>`__

   * -
     - ``mistralai/Mixtral-8x7B-Instruct-v0.1``
     - `Mixtral MoE 8x7B <https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1>`_

   * -
     - ``mistralai/Mixtral-8x22B-Instruct-v0.1``
     - `Mixtral MoE 8x22B <https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1>`_

   * -
     - ``mistralai/Mistral-7B-Instruct-v0.3``
     - `Mistral 7B <https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3>`_

   * -
     - ``Qwen/Qwen2-7B-Instruct``
     - `Qwen2 7B <https://huggingface.co/Qwen/Qwen2-7B-Instruct>`_

   * -
     - ``Qwen/Qwen2-72B-Instruct``
     - `Qwen2 72B <https://huggingface.co/Qwen/Qwen2-72B-Instruct>`_

   * -
     - ``core42/jais-13b-chat``
     - `JAIS 13B <https://huggingface.co/core42/jais-13b-chat>`_

   * -
     - ``core42/jais-30b-chat-v3``
     - `JAIS 30B <https://huggingface.co/core42/jais-30b-chat-v3>`_

   * -
     - ``databricks/dbrx-instruct``
     - `DBRX Instruct <https://huggingface.co/databricks/dbrx-instruct>`_

   * -
     - ``google/gemma-2-27b``
     - `Gemma 2 27B <https://huggingface.co/google/gemma-2-27b>`_

   * -
     - ``CohereForAI/c4ai-command-r-plus-08-2024``
     - `C4AI Command R+ 08-2024 <https://huggingface.co/CohereForAI/c4ai-command-r-plus-08-2024>`_

   * -
     - ``deepseek-ai/deepseek-moe-16b-chat``
     - `DeepSeek MoE 16B <https://huggingface.co/deepseek-ai/deepseek-moe-16b-chat>`_

   * - ``$model_repo``
     - ``amd/Llama-3.1-70B-Instruct-FP8-KV``
     - `Llama 3.1 70B FP8 <https://huggingface.co/amd/Llama-3.1-70B-Instruct-FP8-KV>`_

   * - (``float8``)
     - ``amd/Llama-3.1-405B-Instruct-FP8-KV``
     - `Llama 3.1 405B FP8 <https://huggingface.co/amd/Llama-3.1-405B-Instruct-FP8-KV>`_

   * -
     - ``amd/Mixtral-8x7B-Instruct-v0.1-FP8-KV``
     - `Mixtral MoE 8x7B FP8 <https://huggingface.co/amd/Mixtral-8x7B-Instruct-v0.1-FP8-KV>`_

   * -
     - ``amd/Mixtral-8x22B-Instruct-v0.1-FP8-KV``
     - `Mixtral MoE 8x22B FP8 <https://huggingface.co/amd/Mixtral-8x22B-Instruct-v0.1-FP8-KV>`_

   * -
     - ``amd/Mistral-7B-v0.1-FP8-KV``
     - `Mistral 7B FP8 <https://huggingface.co/amd/Mistral-7B-v0.1-FP8-KV>`_

   * -
     - ``amd/dbrx-instruct-FP8-KV``
     - `DBRX Instruct FP8 <https://huggingface.co/amd/dbrx-instruct-FP8-KV>`_

   * -
     - ``amd/c4ai-command-r-plus-FP8-KV``
     - `C4AI Command R+ 08-2024 FP8 <https://huggingface.co/amd/c4ai-command-r-plus-FP8-KV>`_

   * - ``$num_gpu``
     - 1 or 8
     - Number of GPUs

   * - ``$datatype``
     - ``float16`` or ``float8``
     - Data type

.. _vllm-benchmark-run-benchmark-v066:

Running the benchmark on the MI300X GPU
-----------------------------------------------

Here are some examples of running the benchmark with various options.
See :ref:`Options <vllm-benchmark-standalone-v066-options>` for the list of
options and their descriptions.

Example 1: latency benchmark
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Use this command to benchmark the latency of the Llama 3.1 70B model on eight GPUs with the ``float16`` and ``float8`` data types.

.. code-block::

   ./vllm_benchmark_report.sh -s latency -m meta-llama/Llama-3.1-70B-Instruct -g 8 -d float16
   ./vllm_benchmark_report.sh -s latency -m amd/Llama-3.1-70B-Instruct-FP8-KV -g 8 -d float8

Find the latency reports at:

- ``./reports_float16/summary/Llama-3.1-70B-Instruct_latency_report.csv``

- ``./reports_float8/summary/Llama-3.1-70B-Instruct-FP8-KV_latency_report.csv``

Example 2: throughput benchmark
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Use this command to benchmark the throughput of the Llama 3.1 70B model on eight GPUs with the ``float16`` and ``float8`` data types.

.. code-block:: shell

   ./vllm_benchmark_report.sh -s throughput -m meta-llama/Llama-3.1-70B-Instruct -g 8 -d float16
   ./vllm_benchmark_report.sh -s throughput -m amd/Llama-3.1-70B-Instruct-FP8-KV -g 8 -d float8

Find the throughput reports at:

- ``./reports_float16/summary/Llama-3.1-70B-Instruct_throughput_report.csv``

- ``./reports_float8/summary/Llama-3.1-70B-Instruct-FP8-KV_throughput_report.csv``

.. raw:: html

   <style>
   mjx-container[jax="CHTML"][display="true"] {
       text-align: left;
       margin: 0;
   }
   </style>

.. note::

   Throughput is calculated as:

   - .. math:: throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time

   - .. math:: throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time

Further reading
===============

- For application performance optimization strategies for HPC and AI workloads,
  including inference with vLLM, see :doc:`/how-to/rocm-for-ai/inference-optimization/workload`.

- To learn more about the options for latency and throughput benchmark scripts,
  see `<https://github.com/ROCm/vllm/tree/main/benchmarks>`_.

- To learn more about system settings and management practices to configure your system for
  MI300X Series GPUs, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_

- To learn how to run community models from Hugging Face on AMD GPUs, see
  :doc:`Running models from Hugging Face </how-to/rocm-for-ai/inference/hugging-face-models>`.

- To learn how to fine-tune LLMs and optimize inference, see
  :doc:`Fine-tuning LLMs and inference optimization </how-to/rocm-for-ai/fine-tuning/fine-tuning-and-inference>`.

- For a list of other ready-made Docker images for AI with ROCm, see
  `AMD Infinity Hub <https://www.amd.com/en/developer/resources/infinity-hub.html#f-amd_hub_category=AI%20%26%20ML%20Models>`_.

Previous versions
=================

See :doc:`vllm-history` to find documentation for previous releases
of the ``ROCm/vllm`` Docker image.