Add latest rocm/vllm Docker details in vLLM inference benchmark guide (#4824)

* update rocm/vllm Docker details to latest release

* Add previous vLLM version

* fix 'further reading' xrefs

* improve model grouping names

* fix links

* update model picker text
This commit is contained in:
Peter Park
2025-05-28 14:20:18 -04:00
committed by GitHub
parent 0acb457389
commit cebf0f5975
4 changed files with 42 additions and 24 deletions

View File

@@ -1,14 +1,14 @@
vllm_benchmark:
unified_docker:
latest:
pull_tag: rocm/vllm:rocm6.3.1_instinct_vllm0.8.3_20250415
docker_hub_url: https://hub.docker.com/layers/rocm/vllm/rocm6.3.1_instinct_vllm0.8.3_20250415/images/sha256-ad9062dea3483d59dedb17c67f7c49f30eebd6eb37c3fac0a171fb19696cc845
pull_tag: rocm/vllm:rocm6.3.1_vllm0.8.3_20250415
docker_hub_url: https://hub.docker.com/layers/rocm/vllm/rocm6.3.1_vllm_0.8.5_20250513/images/sha256-5c8b4436dd0464119d9df2b44c745fadf81512f18ffb2f4b5dc235c71ebe26b4
rocm_version: 6.3.1
vllm_version: 0.8.3
pytorch_version: 2.7.0 (dev nightly)
hipblaslt_version: 0.13
vllm_version: 0.8.5
pytorch_version: 2.7.0+gitf717b2a
hipblaslt_version: 0.15
model_groups:
- group: Llama
- group: Meta Llama
tag: llama
models:
- model: Llama 3.1 8B
@@ -56,7 +56,7 @@ vllm_benchmark:
model_repo: amd/Llama-3.1-405B-Instruct-FP8-KV
url: https://huggingface.co/amd/Llama-3.1-405B-Instruct-FP8-KV
precision: float8
- group: Mistral
- group: Mistral AI
tag: mistral
models:
- model: Mixtral MoE 8x7B
@@ -108,7 +108,7 @@ vllm_benchmark:
url: https://huggingface.co/Qwen/QwQ-32B
precision: float16
tunableop: true
- group: DBRX
- group: Databricks DBRX
tag: dbrx
models:
- model: DBRX Instruct
@@ -121,7 +121,7 @@ vllm_benchmark:
model_repo: amd/dbrx-instruct-FP8-KV
url: https://huggingface.co/amd/dbrx-instruct-FP8-KV
precision: float8
- group: Gemma
- group: Google Gemma
tag: gemma
models:
- model: Gemma 2 27B

View File

@@ -1,6 +1,6 @@
megatron-lm_benchmark:
model_groups:
- group: Llama
- group: Meta Llama
tag: llama
models:
- model: Llama 3.3 70B
@@ -20,7 +20,7 @@ megatron-lm_benchmark:
mad_tag: pyt_megatron_lm_train_deepseek-v3-proxy
- model: DeepSeek-V2-Lite
mad_tag: pyt_megatron_lm_train_deepseek-v2-lite-16b
- group: Mistral
- group: Mistral AI
tag: mistral
models:
- model: Mixtral 8x7B

View File

@@ -24,11 +24,15 @@ PyTorch inference performance testing
Supported models
================
The following models are supported for inference performance benchmarking
with PyTorch and ROCm. Some instructions, commands, and recommendations in this
documentation might vary by model -- select one to get started.
.. raw:: html
<div id="vllm-benchmark-ud-params-picker" class="container-fluid">
<div class="row">
<div class="col-2 me-2 model-param-head">Model</div>
<div class="col-2 me-2 model-param-head">Model group</div>
<div class="row col-10">
{% for model_group in model_groups %}
<div class="col-6 model-param" data-param-k="model-group" data-param-v="{{ model_group.tag }}" tabindex="0">{{ model_group.group }}</div>
@@ -37,7 +41,7 @@ PyTorch inference performance testing
</div>
<div class="row mt-1" style="display: none;">
<div class="col-2 me-2 model-param-head">Model variant</div>
<div class="col-2 me-2 model-param-head">Model</div>
<div class="row col-10">
{% for model_group in model_groups %}
{% set models = model_group.models %}
@@ -162,11 +166,14 @@ Further reading
- To learn more about system settings and management practices to configure your system for
MI300X accelerators, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_.
- For application performance optimization strategies for HPC and AI workloads,
including inference with vLLM, see :doc:`../../inference-optimization/workload`.
- To learn how to run LLM models from Hugging Face or your model, see
:doc:`Running models from Hugging Face <hugging-face-models>`.
:doc:`Running models from Hugging Face <../hugging-face-models>`.
- To learn how to optimize inference on LLMs, see
:doc:`Inference optimization <../inference-optimization/index>`.
:doc:`Inference optimization <../../inference-optimization/index>`.
- To learn how to fine-tune LLMs, see
:doc:`Fine-tuning LLMs <../fine-tuning/index>`.
:doc:`Fine-tuning LLMs <../../fine-tuning/index>`.

View File

@@ -37,11 +37,15 @@ vLLM inference performance testing
Supported models
================
The following models are supported for inference performance benchmarking
with vLLM and ROCm. Some instructions, commands, and recommendations in this
documentation might vary by model -- select one to get started.
.. raw:: html
<div id="vllm-benchmark-ud-params-picker" class="container-fluid">
<div class="row">
<div class="col-2 me-2 model-param-head">Model</div>
<div class="col-2 me-2 model-param-head">Model group</div>
<div class="row col-10">
{% for model_group in model_groups %}
<div class="col-3 model-param" data-param-k="model-group" data-param-v="{{ model_group.tag }}" tabindex="0">{{ model_group.group }}</div>
@@ -50,7 +54,7 @@ vLLM inference performance testing
</div>
<div class="row mt-1">
<div class="col-2 me-2 model-param-head">Model variant</div>
<div class="col-2 me-2 model-param-head">Model</div>
<div class="row col-10">
{% for model_group in model_groups %}
{% set models = model_group.models %}
@@ -318,23 +322,23 @@ vLLM inference performance testing
Further reading
===============
- For application performance optimization strategies for HPC and AI workloads,
including inference with vLLM, see :doc:`../inference-optimization/workload`.
- To learn more about the options for latency and throughput benchmark scripts,
see `<https://github.com/ROCm/vllm/tree/main/benchmarks>`_.
- To learn more about system settings and management practices to configure your system for
MI300X accelerators, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_
- For application performance optimization strategies for HPC and AI workloads,
including inference with vLLM, see :doc:`../../inference-optimization/workload`.
- To learn how to run LLM models from Hugging Face or your own model, see
:doc:`Running models from Hugging Face <hugging-face-models>`.
:doc:`Running models from Hugging Face <../hugging-face-models>`.
- To learn how to optimize inference on LLMs, see
:doc:`Inference optimization <../inference-optimization/index>`.
:doc:`Inference optimization <../../inference-optimization/index>`.
- To learn how to fine-tune LLMs, see
:doc:`Fine-tuning LLMs <../fine-tuning/index>`.
:doc:`Fine-tuning LLMs <../../fine-tuning/index>`.
Previous versions
=================
@@ -352,6 +356,13 @@ for benchmarking, see the version-specific documentation.
- PyTorch version
- Resources
* - 6.3.1
- 0.8.3
- 2.7.0
-
* `Documentation <https://rocm.docs.amd.com/en/docs-6.4.0/how-to/rocm-for-ai/inference/vllm-benchmark.html>`_
* `Docker Hub <https://hub.docker.com/layers/rocm/vllm/rocm6.3.1_instinct_vllm0.8.3_20250415/images/sha256-ad9062dea3483d59dedb17c67f7c49f30eebd6eb37c3fac0a171fb19696cc845>`_
* - 6.3.1
- 0.7.3
- 2.7.0