mirror of
https://github.com/ROCm/ROCm.git
synced 2026-01-09 06:38:00 -05:00
354 lines
14 KiB
ReStructuredText
354 lines
14 KiB
ReStructuredText
:orphan:
|
|
|
|
.. meta::
|
|
:description: Learn how to validate LLM inference performance on MI300X accelerators using AMD MAD and the
|
|
ROCm vLLM Docker image.
|
|
:keywords: model, MAD, automation, dashboarding, validate
|
|
|
|
**********************************
|
|
vLLM inference performance testing
|
|
**********************************
|
|
|
|
.. caution::
|
|
|
|
This documentation does not reflect the latest version of ROCm vLLM
|
|
inference performance documentation. See :doc:`../vllm` for the latest version.
|
|
|
|
.. _vllm-benchmark-unified-docker-702:
|
|
|
|
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/inference/previous-versions/vllm_0.9.1_20250702-benchmark-models.yaml
|
|
|
|
{% set unified_docker = data.vllm_benchmark.unified_docker.latest %}
|
|
{% set model_groups = data.vllm_benchmark.model_groups %}
|
|
|
|
The `ROCm vLLM Docker <{{ unified_docker.docker_hub_url }}>`_ image offers
|
|
a prebuilt, optimized environment for validating large language model (LLM)
|
|
inference performance on AMD Instinct™ MI300X series accelerators. This ROCm vLLM
|
|
Docker image integrates vLLM and PyTorch tailored specifically for MI300X series
|
|
accelerators and includes the following components:
|
|
|
|
* `ROCm {{ unified_docker.rocm_version }} <https://github.com/ROCm/ROCm>`_
|
|
|
|
* `vLLM {{ unified_docker.vllm_version }} <https://docs.vllm.ai/en/latest>`_
|
|
|
|
* `PyTorch {{ unified_docker.pytorch_version }} <https://github.com/ROCm/pytorch.git>`_
|
|
|
|
* `hipBLASLt {{ unified_docker.hipblaslt_version }} <https://github.com/ROCm/hipBLASLt>`_
|
|
|
|
With this Docker image, you can quickly test the :ref:`expected
|
|
inference performance numbers <vllm-benchmark-performance-measurements-20250702>` for
|
|
MI300X series accelerators.
|
|
|
|
.. _vllm-benchmark-available-models-20250702:
|
|
|
|
Supported models
|
|
================
|
|
|
|
The following models are supported for inference performance benchmarking
|
|
with vLLM and ROCm. Some instructions, commands, and recommendations in this
|
|
documentation might vary by model -- select one to get started.
|
|
|
|
.. raw:: html
|
|
|
|
<div id="vllm-benchmark-ud-params-picker" class="container-fluid">
|
|
<div class="row">
|
|
<div class="col-2 me-2 model-param-head">Model group</div>
|
|
<div class="row col-10">
|
|
{% for model_group in model_groups %}
|
|
<div class="col-3 model-param" data-param-k="model-group" data-param-v="{{ model_group.tag }}" tabindex="0">{{ model_group.group }}</div>
|
|
{% endfor %}
|
|
</div>
|
|
</div>
|
|
|
|
<div class="row mt-1">
|
|
<div class="col-2 me-2 model-param-head">Model</div>
|
|
<div class="row col-10">
|
|
{% for model_group in model_groups %}
|
|
{% set models = model_group.models %}
|
|
{% for model in models %}
|
|
{% if models|length % 3 == 0 %}
|
|
<div class="col-4 model-param" data-param-k="model" data-param-v="{{ model.mad_tag }}" data-param-group="{{ model_group.tag }}" tabindex="0">{{ model.model }}</div>
|
|
{% else %}
|
|
<div class="col-6 model-param" data-param-k="model" data-param-v="{{ model.mad_tag }}" data-param-group="{{ model_group.tag }}" tabindex="0">{{ model.model }}</div>
|
|
{% endif %}
|
|
{% endfor %}
|
|
{% endfor %}
|
|
</div>
|
|
</div>
|
|
</div>
|
|
|
|
.. _vllm-benchmark-vllm-702:
|
|
|
|
{% for model_group in model_groups %}
|
|
{% for model in model_group.models %}
|
|
|
|
.. container:: model-doc {{model.mad_tag}}
|
|
|
|
.. note::
|
|
|
|
See the `{{ model.model }} model card on Hugging Face <{{ model.url }}>`_ to learn more about your selected model.
|
|
Some models require access authorization prior to use via an external license agreement through a third party.
|
|
|
|
{% endfor %}
|
|
{% endfor %}
|
|
|
|
.. note::
|
|
|
|
vLLM is a toolkit and library for LLM inference and serving. AMD implements
|
|
high-performance custom kernels and modules in vLLM to enhance performance.
|
|
See :ref:`fine-tuning-llms-vllm` and :ref:`mi300x-vllm-optimization` for
|
|
more information.
|
|
|
|
.. _vllm-benchmark-performance-measurements-20250702:
|
|
|
|
Performance measurements
|
|
========================
|
|
|
|
To evaluate performance, the
|
|
`Performance results with AMD ROCm software <https://www.amd.com/en/developer/resources/rocm-hub/dev-ai/performance-results.html>`_
|
|
page provides reference throughput and latency measurements for inferencing popular AI models.
|
|
|
|
.. important::
|
|
|
|
The performance data presented in
|
|
`Performance results with AMD ROCm software <https://www.amd.com/en/developer/resources/rocm-hub/dev-ai/performance-results.html>`_
|
|
only reflects the latest version of this inference benchmarking environment.
|
|
The listed measurements should not be interpreted as the peak performance achievable by AMD Instinct MI325X and MI300X accelerators or ROCm software.
|
|
|
|
Advanced features and known issues
|
|
==================================
|
|
|
|
For information on experimental features and known issues related to ROCm optimization efforts on vLLM,
|
|
see the developer's guide at `<https://github.com/ROCm/vllm/tree/5486e7bc8523be0324ccd68f221959445b56cc2a/docs/dev-docker>`__.
|
|
|
|
System validation
|
|
=================
|
|
|
|
Before running AI workloads, it's important to validate that your AMD hardware is configured
|
|
correctly and performing optimally.
|
|
|
|
To optimize performance, disable automatic NUMA balancing. Otherwise, the GPU
|
|
might hang until the periodic balancing is finalized. For more information,
|
|
see the :ref:`system validation steps <rocm-for-ai-system-optimization>`.
|
|
|
|
.. code-block:: shell
|
|
|
|
# disable automatic NUMA balancing
|
|
sh -c 'echo 0 > /proc/sys/kernel/numa_balancing'
|
|
# check if NUMA balancing is disabled (returns 0 if disabled)
|
|
cat /proc/sys/kernel/numa_balancing
|
|
0
|
|
|
|
To test for optimal performance, consult the recommended :ref:`System health benchmarks
|
|
<rocm-for-ai-system-health-bench>`. This suite of tests will help you verify and fine-tune your
|
|
system's configuration.
|
|
|
|
Pull the Docker image
|
|
=====================
|
|
|
|
Download the `ROCm vLLM Docker image <{{ unified_docker.docker_hub_url }}>`_.
|
|
Use the following command to pull the Docker image from Docker Hub.
|
|
|
|
.. code-block:: shell
|
|
|
|
docker pull {{ unified_docker.pull_tag }}
|
|
|
|
Benchmarking
|
|
============
|
|
|
|
Once the setup is complete, choose between two options to reproduce the
|
|
benchmark results:
|
|
|
|
.. _vllm-benchmark-mad-702:
|
|
|
|
{% for model_group in model_groups %}
|
|
{% for model in model_group.models %}
|
|
|
|
.. container:: model-doc {{model.mad_tag}}
|
|
|
|
.. tab-set::
|
|
|
|
.. tab-item:: MAD-integrated benchmarking
|
|
|
|
Clone the ROCm Model Automation and Dashboarding (`<https://github.com/ROCm/MAD>`__) repository to a local
|
|
directory and install the required packages on the host machine.
|
|
|
|
.. code-block:: shell
|
|
|
|
git clone https://github.com/ROCm/MAD
|
|
cd MAD
|
|
pip install -r requirements.txt
|
|
|
|
Use this command to run the performance benchmark test on the `{{model.model}} <{{ model.url }}>`_ model
|
|
using one GPU with the :literal:`{{model.precision}}` data type on the host machine.
|
|
|
|
.. code-block:: shell
|
|
|
|
export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
|
|
python3 tools/run_models.py --tags {{model.mad_tag}} --keep-model-dir --live-output --timeout 28800
|
|
|
|
MAD launches a Docker container with the name
|
|
``container_ci-{{model.mad_tag}}``. The latency and throughput reports of the
|
|
model are collected in the following path: ``~/MAD/reports_{{model.precision}}/``.
|
|
|
|
Although the :ref:`available models <vllm-benchmark-available-models-20250702>` are preconfigured
|
|
to collect latency and throughput performance data, you can also change the benchmarking
|
|
parameters. See the standalone benchmarking tab for more information.
|
|
|
|
{% if model.tunableop %}
|
|
|
|
.. note::
|
|
|
|
For improved performance, consider enabling :ref:`PyTorch TunableOp <mi300x-tunableop>`.
|
|
TunableOp automatically explores different implementations and configurations of certain PyTorch
|
|
operators to find the fastest one for your hardware.
|
|
|
|
By default, ``{{model.mad_tag}}`` runs with TunableOp disabled
|
|
(see
|
|
`<https://github.com/ROCm/MAD/blob/develop/models.json>`__). To
|
|
enable it, edit the default run behavior in the ``models.json``
|
|
configuration before running inference -- update the model's run
|
|
``args`` by changing ``--tunableop off`` to ``--tunableop on``.
|
|
|
|
Enabling TunableOp triggers a two-pass run -- a warm-up followed by the performance-collection run.
|
|
|
|
{% endif %}
|
|
|
|
.. tab-item:: Standalone benchmarking
|
|
|
|
Run the vLLM benchmark tool independently by starting the
|
|
`Docker container <{{ unified_docker.docker_hub_url }}>`_
|
|
as shown in the following snippet.
|
|
|
|
.. code-block::
|
|
|
|
docker pull {{ unified_docker.pull_tag }}
|
|
docker run -it --device=/dev/kfd --device=/dev/dri --group-add video --shm-size 16G --security-opt seccomp=unconfined --security-opt apparmor=unconfined --cap-add=SYS_PTRACE -v $(pwd):/workspace --env HUGGINGFACE_HUB_CACHE=/workspace --name test {{ unified_docker.pull_tag }}
|
|
|
|
In the Docker container, clone the ROCm MAD repository and navigate to the
|
|
benchmark scripts directory at ``~/MAD/scripts/vllm``.
|
|
|
|
.. code-block::
|
|
|
|
git clone https://github.com/ROCm/MAD
|
|
cd MAD/scripts/vllm
|
|
|
|
To start the benchmark, use the following command with the appropriate options.
|
|
|
|
.. code-block::
|
|
|
|
./vllm_benchmark_report.sh -s $test_option -m {{model.model_repo}} -g $num_gpu -d {{model.precision}}
|
|
|
|
.. list-table::
|
|
:header-rows: 1
|
|
:align: center
|
|
|
|
* - Name
|
|
- Options
|
|
- Description
|
|
|
|
* - ``$test_option``
|
|
- latency
|
|
- Measure decoding token latency
|
|
|
|
* -
|
|
- throughput
|
|
- Measure token generation throughput
|
|
|
|
* -
|
|
- all
|
|
- Measure both throughput and latency
|
|
|
|
* - ``$num_gpu``
|
|
- 1 or 8
|
|
- Number of GPUs
|
|
|
|
* - ``$datatype``
|
|
- ``float16`` or ``float8``
|
|
- Data type
|
|
|
|
.. note::
|
|
|
|
The input sequence length, output sequence length, and tensor parallel (TP) are
|
|
already configured. You don't need to specify them with this script.
|
|
|
|
.. note::
|
|
|
|
If you encounter the following error, pass your access-authorized Hugging
|
|
Face token to the gated models.
|
|
|
|
.. code-block::
|
|
|
|
OSError: You are trying to access a gated repo.
|
|
|
|
# pass your HF_TOKEN
|
|
export HF_TOKEN=$your_personal_hf_token
|
|
|
|
Here are some examples of running the benchmark with various options.
|
|
|
|
* Latency benchmark
|
|
|
|
Use this command to benchmark the latency of the {{model.model}} model on eight GPUs with :literal`{{model.precision}}` precision.
|
|
|
|
.. code-block::
|
|
|
|
./vllm_benchmark_report.sh -s latency -m {{model.model_repo}} -g 8 -d {{model.precision}}
|
|
|
|
Find the latency report at ``./reports_{{model.precision}}_vllm_rocm{{unified_docker.rocm_version}}/summary/{{model.model_repo.split('/', 1)[1] if '/' in model.model_repo else model.model_repo}}_latency_report.csv``.
|
|
|
|
* Throughput benchmark
|
|
|
|
Use this command to benchmark the throughput of the {{model.model}} model on eight GPUs with :literal:`{{model.precision}}` precision.
|
|
|
|
.. code-block:: shell
|
|
|
|
./vllm_benchmark_report.sh -s throughput -m {{model.model_repo}} -g 8 -d {{model.precision}}
|
|
|
|
Find the throughput report at ``./reports_{{model.precision}}_vllm_rocm{{unified_docker.rocm_version}}/summary/{{model.model_repo.split('/', 1)[1] if '/' in model.model_repo else model.model_repo}}_throughput_report.csv``.
|
|
|
|
.. raw:: html
|
|
|
|
<style>
|
|
mjx-container[jax="CHTML"][display="true"] {
|
|
text-align: left;
|
|
margin: 0;
|
|
}
|
|
</style>
|
|
|
|
.. note::
|
|
|
|
Throughput is calculated as:
|
|
|
|
- .. math:: throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time
|
|
|
|
- .. math:: throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time
|
|
{% endfor %}
|
|
{% endfor %}
|
|
|
|
Further reading
|
|
===============
|
|
|
|
- To learn more about the options for latency and throughput benchmark scripts,
|
|
see `<https://github.com/ROCm/vllm/tree/main/benchmarks>`_.
|
|
|
|
- To learn more about system settings and management practices to configure your system for
|
|
MI300X series accelerators, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_
|
|
|
|
- For application performance optimization strategies for HPC and AI workloads,
|
|
including inference with vLLM, see :doc:`/how-to/rocm-for-ai/inference-optimization/workload`.
|
|
|
|
- To learn how to run community models from Hugging Face on AMD GPUs, see
|
|
:doc:`Running models from Hugging Face </how-to/rocm-for-ai/inference/hugging-face-models>`.
|
|
|
|
- To learn how to fine-tune LLMs and optimize inference, see
|
|
:doc:`Fine-tuning LLMs and inference optimization </how-to/rocm-for-ai/fine-tuning/fine-tuning-and-inference>`.
|
|
|
|
- For a list of other ready-made Docker images for AI with ROCm, see
|
|
`AMD Infinity Hub <https://www.amd.com/en/developer/resources/infinity-hub.html#f-amd_hub_category=AI%20%26%20ML%20Models>`_.
|
|
|
|
Previous versions
|
|
=================
|
|
|
|
See :doc:`vllm-history` to find documentation for previous releases
|
|
of the ``ROCm/vllm`` Docker image.
|