ROCm/docs/how-to/rocm-for-ai/inference/vllm-benchmark.rst

.. meta::
   :description: Learn how to validate LLM inference performance on MI300X accelerators using AMD MAD and the
                 ROCm vLLM Docker image.
   :keywords: model, MAD, automation, dashboarding, validate

********************************************************
LLM inference performance testing on AMD Instinct MI300X
********************************************************

.. _vllm-benchmark-unified-docker:

.. datatemplate:yaml:: /data/how-to/rocm-for-ai/inference/vllm-benchmark-models.yaml

   {% set unified_docker = data.vllm_benchmark.unified_docker.latest %}
   {% set model_groups = data.vllm_benchmark.model_groups %}

   The `ROCm vLLM Docker <{{ unified_docker.docker_hub_url }}>`_ image offers
   a prebuilt, optimized environment for validating large language model (LLM)
   inference performance on AMD Instinct™ MI300X series accelerator. This ROCm vLLM
   Docker image integrates vLLM and PyTorch tailored specifically for MI300X series
   accelerators and includes the following components:

   * `ROCm {{ unified_docker.rocm_version }} <https://github.com/ROCm/ROCm>`_

   * `vLLM {{ unified_docker.vllm_version }} <https://docs.vllm.ai/en/latest>`_

   * `PyTorch {{ unified_docker.pytorch_version }} <https://github.com/pytorch/pytorch>`_

   * `hipBLASLt {{ unified_docker.hipblaslt_version }} <https://github.com/ROCm/hipBLASLt>`_

   With this Docker image, you can quickly test the :ref:`expected
   inference performance numbers <vllm-benchmark-performance-measurements>` for
   MI300X series accelerators.

   .. _vllm-benchmark-available-models:

   Available models
   ================

   .. raw:: html

      <div id="vllm-benchmark-ud-params-picker" class="container-fluid">
        <div class="row">
          <div class="col-2 me-2 model-param-head">Model</div>
          <div class="row col-10">
   {% for model_group in model_groups %}
            <div class="col-3 model-param" data-param-k="model-group" data-param-v="{{ model_group.tag }}" tabindex="0">{{ model_group.group }}</div>
   {% endfor %}
          </div>
        </div>

        <div class="row mt-1">
          <div class="col-2 me-2 model-param-head">Model variant</div>
          <div class="row col-10">
   {% for model_group in model_groups %}
      {% set models = model_group.models %}
      {% for model in models %}
         {% if models|length % 3 == 0 %}
            <div class="col-4 model-param" data-param-k="model" data-param-v="{{ model.mad_tag }}" data-param-group="{{ model_group.tag }}" tabindex="0">{{ model.model }}</div>
         {% else %}
            <div class="col-6 model-param" data-param-k="model" data-param-v="{{ model.mad_tag }}" data-param-group="{{ model_group.tag }}" tabindex="0">{{ model.model }}</div>
         {% endif %}
      {% endfor %}
   {% endfor %}
          </div>
        </div>
      </div>

   .. _vllm-benchmark-vllm:

   {% for model_group in model_groups %}
      {% for model in model_group.models %}

   .. container:: model-doc {{model.mad_tag}}

      .. note::

         See the `{{ model.model }} model card on Hugging Face <{{ model.url }}>`_ to learn more about your selected model.
         Some models require access authorization prior to use via an external license agreement through a third party.

      {% endfor %}
   {% endfor %}

   .. note::

      vLLM is a toolkit and library for LLM inference and serving. AMD implements
      high-performance custom kernels and modules in vLLM to enhance performance.
      See :ref:`fine-tuning-llms-vllm` and :ref:`mi300x-vllm-optimization` for
      more information.

   .. _vllm-benchmark-performance-measurements:

   Performance measurements
   ========================

   To evaluate performance, the
   `Performance results with AMD ROCm software <https://www.amd.com/en/developer/resources/rocm-hub/dev-ai/performance-results.html>`_
   page provides reference throughput and latency measurements for inferencing
   popular AI models.

   .. note::

      The performance data presented in
      `Performance results with AMD ROCm software <https://www.amd.com/en/developer/resources/rocm-hub/dev-ai/performance-results.html>`_
      should not be interpreted as the peak performance achievable by AMD
      Instinct MI325X and MI300X accelerators or ROCm software.

   Advanced features and known issues
   ==================================

   For information on experimental features and known issues related to ROCm optimization efforts on vLLM,
   see the developer's guide at `<https://github.com/ROCm/vllm/blob/main/docs/dev-docker/README.md>`__.

   Getting started
   ===============

   Use the following procedures to reproduce the benchmark results on an
   MI300X accelerator with the prebuilt vLLM Docker image.

   .. _vllm-benchmark-get-started:

   1. Disable NUMA auto-balancing.

      To optimize performance, disable automatic NUMA balancing. Otherwise, the GPU
      might hang until the periodic balancing is finalized. For more information,
      see :ref:`AMD Instinct MI300X system optimization <mi300x-disable-numa>`.

      .. code-block:: shell

         # disable automatic NUMA balancing
         sh -c 'echo 0 > /proc/sys/kernel/numa_balancing'
         # check if NUMA balancing is disabled (returns 0 if disabled)
         cat /proc/sys/kernel/numa_balancing
         0

   2. Download the `ROCm vLLM Docker image <{{ unified_docker.docker_hub_url }}>`_.

      Use the following command to pull the Docker image from Docker Hub.

      .. code-block:: shell

         docker pull {{ unified_docker.pull_tag }}

   Benchmarking
   ============

   Once the setup is complete, choose between two options to reproduce the
   benchmark results:

   .. _vllm-benchmark-mad:

   {% for model_group in model_groups %}
      {% for model in model_group.models %}

   .. container:: model-doc {{model.mad_tag}}

      .. tab-set::

         .. tab-item:: MAD-integrated benchmarking

            Clone the ROCm Model Automation and Dashboarding (`<https://github.com/ROCm/MAD>`__) repository to a local
            directory and install the required packages on the host machine.

            .. code-block:: shell

               git clone https://github.com/ROCm/MAD
               cd MAD
               pip install -r requirements.txt

            Use this command to run the performance benchmark test on the `{{model.model}} <{{ model.url }}>`_ model
            using one GPU with the ``{{model.precision}}`` data type on the host machine.

            .. code-block:: shell

               export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
               python3 tools/run_models.py --tags {{model.mad_tag}} --keep-model-dir --live-output --timeout 28800

            MAD launches a Docker container with the name
            ``container_ci-{{model.mad_tag}}``. The latency and throughput reports of the
            model are collected in the following path: ``~/MAD/reports_{{model.precision}}/``.

            Although the :ref:`available models <vllm-benchmark-available-models>` are preconfigured
            to collect latency and throughput performance data, you can also change the benchmarking
            parameters. See the standalone benchmarking tab for more information.

         .. tab-item:: Standalone benchmarking

            Run the vLLM benchmark tool independently by starting the
            `Docker container <{{ unified_docker.docker_hub_url }}>`_
            as shown in the following snippet.

            .. code-block::

               docker pull {{ unified_docker.pull_tag }}
               docker run -it --device=/dev/kfd --device=/dev/dri --group-add video --shm-size 16G --security-opt seccomp=unconfined --security-opt apparmor=unconfined --cap-add=SYS_PTRACE -v $(pwd):/workspace --env HUGGINGFACE_HUB_CACHE=/workspace --name test {{ unified_docker.pull_tag }}

            In the Docker container, clone the ROCm MAD repository and navigate to the
            benchmark scripts directory at ``~/MAD/scripts/vllm``.

            .. code-block::

               git clone https://github.com/ROCm/MAD
               cd MAD/scripts/vllm

            To start the benchmark, use the following command with the appropriate options.

            .. code-block::

               ./vllm_benchmark_report.sh -s $test_option -m {{model.model_repo}} -g $num_gpu -d {{model.precision}}

            .. list-table::
               :header-rows: 1
               :align: center

               * - Name
                 - Options
                 - Description

               * - ``$test_option``
                 - latency
                 - Measure decoding token latency

               * -
                 - throughput
                 - Measure token generation throughput

               * -
                 - all
                 - Measure both throughput and latency

               * - ``$num_gpu``
                 - 1 or 8
                 - Number of GPUs

               * - ``$datatype``
                 - ``float16`` or ``float8``
                 - Data type

            .. note::

               The input sequence length, output sequence length, and tensor parallel (TP) are
               already configured. You don't need to specify them with this script.

            .. note::

               If you encounter the following error, pass your access-authorized Hugging
               Face token to the gated models.

               .. code-block::

                  OSError: You are trying to access a gated repo.

                  # pass your HF_TOKEN
                  export HF_TOKEN=$your_personal_hf_token

            Here are some examples of running the benchmark with various options.

            * Latency benchmark

              Use this command to benchmark the latency of the {{model.model}} model on eight GPUs with the ``{{model.precision}}`` data type.

              .. code-block::

                 ./vllm_benchmark_report.sh -s latency -m {{model.model_repo}} -g 8 -d {{model.precision}}

              Find the latency report at ``./reports_{{model.precision}}_vllm_rocm{{unified_docker.rocm_version}}/summary/{{model.model_repo.split('/', 1)[1] if '/' in model.model_repo else model.model_repo}}_latency_report.csv``.

            * Throughput benchmark

              Use this command to throughput the latency of the {{model.model}} model on eight GPUs with the ``{{model.precision}}`` data type.

              .. code-block:: shell

                 ./vllm_benchmark_report.sh -s latency -m {{model.model_repo}} -g 8 -d {{model.precision}}

              Find the throughput report at ``./reports_{{model.precision}}_vllm_rocm{{unified_docker.rocm_version}}/summary/{{model.model_repo.split('/', 1)[1] if '/' in model.model_repo else model.model_repo}}_throughput_report.csv``.

            .. raw:: html

               <style>
               mjx-container[jax="CHTML"][display="true"] {
                  text-align: left;
                  margin: 0;
               }
               </style>

            .. note::

               Throughput is calculated as:

               - .. math:: throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time

               - .. math:: throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time
      {% endfor %}
   {% endfor %}

Further reading
===============

- For application performance optimization strategies for HPC and AI workloads,
  including inference with vLLM, see :doc:`../inference-optimization/workload`.

- To learn more about the options for latency and throughput benchmark scripts,
  see `<https://github.com/ROCm/vllm/tree/main/benchmarks>`_.

- To learn more about system settings and management practices to configure your system for
  MI300X accelerators, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_

- To learn how to run LLM models from Hugging Face or your own model, see
  :doc:`Running models from Hugging Face <hugging-face-models>`.

- To learn how to optimize inference on LLMs, see
  :doc:`Inference optimization <../inference-optimization/index>`.

- To learn how to fine-tune LLMs, see
  :doc:`Fine-tuning LLMs <../fine-tuning/index>`.

Previous versions
=================

This table lists previous versions of the ROCm vLLM inference Docker image for
inference performance testing. For detailed information about available models
for benchmarking, see the version-specific documentation.

.. list-table::
   :header-rows: 1
   :stub-columns: 1

   * - ROCm version
     - vLLM version
     - PyTorch version
     - Resources

   * - 6.3.1
     - 0.6.6
     - 2.7.0
     -
       * `Documentation <https://rocm.docs.amd.com/en/docs-6.3.2/how-to/rocm-for-ai/training/benchmark-docker/pytorch-training.html>`_
       * `Docker Hub <https://hub.docker.com/layers/rocm/vllm/rocm6.3.1_mi300_ubuntu22.04_py3.12_vllm_0.6.6/images/sha256-9a12ef62bbbeb5a4c30a01f702c8e025061f575aa129f291a49fbd02d6b4d6c9>`_

   * - 6.2.1
     - 0.6.4
     - 2.5.0
     -
       * `Documentation <https://rocm.docs.amd.com/en/docs-6.3.0/how-to/performance-validation/mi300x/vllm-benchmark.html>`_
       * `Docker Hub <https://hub.docker.com/layers/rocm/vllm/rocm6.2_mi300_ubuntu20.04_py3.9_vllm_0.6.4/images/sha256-ccbb74cc9e7adecb8f7bdab9555f7ac6fc73adb580836c2a35ca96ff471890d8>`_

   * - 6.2.0
     - 0.4.3
     - 2.4.0
     -
       * `Documentation <https://rocm.docs.amd.com/en/docs-6.2.0/how-to/performance-validation/mi300x/vllm-benchmark.html>`_
       * `Docker Hub <https://hub.docker.com/layers/rocm/vllm/rocm6.2_mi300_ubuntu22.04_py3.9_vllm_7c5fd50/images/sha256-9e4dd4788a794c3d346d7d0ba452ae5e92d39b8dfac438b2af8efdc7f15d22c0>`_