mirror of
https://github.com/ROCm/ROCm.git
synced 2026-01-08 22:28:06 -05:00
docs: Add SGLang disaggregated P/D inference w/ Mooncake guide (#5335)
* add main content * Update content and format add clarification update update data * fix fix fix * fix: deepseek v3 * add ki * Update docs/how-to/rocm-for-ai/inference/benchmark-docker/sglang-distributed.rst Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com> * Update docs/how-to/rocm-for-ai/inference/benchmark-docker/sglang-distributed.rst Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com> * Update docs/how-to/rocm-for-ai/inference/benchmark-docker/sglang-distributed.rst Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com> * Update docs/how-to/rocm-for-ai/inference/benchmark-docker/sglang-distributed.rst Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com> * Update docs/how-to/rocm-for-ai/inference/benchmark-docker/sglang-distributed.rst Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com> * Update docs/how-to/rocm-for-ai/inference/benchmark-docker/sglang-distributed.rst Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com> * Update docs/how-to/rocm-for-ai/inference/benchmark-docker/sglang-distributed.rst Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com> * Update docs/how-to/rocm-for-ai/inference/benchmark-docker/sglang-distributed.rst Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com> * Update docs/how-to/rocm-for-ai/inference/benchmark-docker/sglang-distributed.rst Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com> * Update docs/how-to/rocm-for-ai/inference/benchmark-docker/sglang-distributed.rst Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com> --------- Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com>
This commit is contained in:
@@ -0,0 +1,32 @@
|
||||
dockers:
|
||||
- pull_tag: lmsysorg/sglang:v0.5.2rc1-rocm700-mi30x
|
||||
docker_hub_url: https://hub.docker.com/layers/lmsysorg/sglang/v0.5.2rc1-rocm700-mi30x/images/sha256-10c4ee502ddba44dd8c13325e6e03868bfe7f43d23d0a44780a8ee8b393f4729
|
||||
components:
|
||||
ROCm: 7.0.0
|
||||
SGLang: v0.5.2rc1
|
||||
pytorch-triton-rocm: 3.4.0+rocm7.0.0.gitf9e5bf54
|
||||
model_groups:
|
||||
- group: Dense models
|
||||
tag: dense-models
|
||||
models:
|
||||
- model: Llama 3.1 8B Instruct
|
||||
model_repo: Llama-3.1-8B-Instruct
|
||||
url: https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct
|
||||
- model: Llama 3.1 405B FP8 KV
|
||||
model_repo: Llama-3.1-405B-Instruct-FP8-KV
|
||||
url: https://huggingface.co/amd/Llama-3.1-405B-Instruct-FP8-KV
|
||||
- model: Llama 3.3 70B FP8 KV
|
||||
model_repo: amd-Llama-3.3-70B-Instruct-FP8-KV
|
||||
url: https://huggingface.co/amd/Llama-3.3-70B-Instruct-FP8-KV
|
||||
- model: Qwen3 32B
|
||||
model_repo: Qwen3-32B
|
||||
url: https://huggingface.co/Qwen/Qwen3-32B
|
||||
- group: Small experts models
|
||||
tag: small-experts-models
|
||||
models:
|
||||
- model: DeepSeek V3
|
||||
model_repo: DeepSeek-V3
|
||||
url: https://huggingface.co/deepseek-ai/DeepSeek-V3
|
||||
- model: Mixtral 8x7B v0.1
|
||||
model_repo: Mixtral-8x7B-v0.1
|
||||
url: https://huggingface.co/mistralai/Mixtral-8x7B-v0.1
|
||||
@@ -0,0 +1,257 @@
|
||||
.. meta::
|
||||
:description: SGLang multi-node disaggregated distributed inference using Mooncake
|
||||
:keywords: model, sglang, mooncake, disagg, disaggregated, distributed, multi-node, docker
|
||||
|
||||
******************************************
|
||||
SGLang distributed inference with Mooncake
|
||||
******************************************
|
||||
|
||||
As LLM inference increasingly demands handling massive models and dynamic workloads, efficient
|
||||
distributed inference becomes essential. Traditional co-located architectures face bottlenecks due
|
||||
to tightly coupled memory and compute resources, which limits scalability and flexibility.
|
||||
Disaggregated inference refers to the process of splitting the inference of LLMs into distinct
|
||||
phases. This architecture, facilitated by libraries like Mooncake, uses high-bandwidth
|
||||
RDMA to transfer the Key-Value (KV) cache between prefill and decode nodes.
|
||||
This allows for independent resource scaling and optimization, resulting in
|
||||
improved efficiency and throughput.
|
||||
|
||||
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/inference/sglang-distributed-benchmark-models.yaml
|
||||
|
||||
{% set docker = data.dockers[0] %}
|
||||
|
||||
`SGLang <https://docs.sglang.ai>`__ is a high-performance inference and
|
||||
serving engine for large language models (LLMs) and vision models. The
|
||||
ROCm-enabled `SGLang base Docker image <{{ docker.docker_hub_url }}>`__
|
||||
bundles SGLang with PyTorch, which is optimized for AMD Instinct MI300X series
|
||||
accelerators. It includes the following software components:
|
||||
|
||||
.. list-table::
|
||||
:header-rows: 1
|
||||
|
||||
* - Software component
|
||||
- Version
|
||||
|
||||
{% for component_name, component_version in docker.components.items() %}
|
||||
* - {{ component_name }}
|
||||
- {{ component_version }}
|
||||
{% endfor %}
|
||||
|
||||
The following guides on setting up and running SGLang and Mooncake for disaggregated
|
||||
distributed inference on a Slurm cluster using AMD Instinct MI300X series accelerators backed by
|
||||
Mellanox CX-7 NICs.
|
||||
|
||||
Prerequisites
|
||||
=============
|
||||
|
||||
Before starting, ensure you have:
|
||||
|
||||
* A Slurm cluster with at least three nodes: one for the proxy, one for prefill (``xP``), and one for decode (``yD``).
|
||||
|
||||
``Nodes -> xP + yD + 1``
|
||||
|
||||
* A Dockerized environment with SGLang, Mooncake, etcd, and NIC drivers built in. See :ref:`sglang-disagg-inf-build-docker-image` for instructions.
|
||||
|
||||
* A shared filesystem for storing models, scripts, and logs (cluster-specific).
|
||||
|
||||
Supported models
|
||||
================
|
||||
|
||||
The following models are supported for SGLang disaggregated prefill/decode
|
||||
inference. Some instructions, commands, and recommendations in this
|
||||
documentation might vary by selected model.
|
||||
|
||||
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/inference/sglang-distributed-benchmark-models.yaml
|
||||
|
||||
{% set model_groups = data.model_groups %}
|
||||
.. raw:: html
|
||||
|
||||
<div id="vllm-benchmark-ud-params-picker" class="container-fluid">
|
||||
<div class="row gx-0">
|
||||
<div class="col-2 me-1 px-2 model-param-head">Model type</div>
|
||||
<div class="row col-10 pe-0">
|
||||
{% for model_group in model_groups %}
|
||||
<div class="col-6 px-2 model-param" data-param-k="model-group" data-param-v="{{ model_group.tag }}" tabindex="0">{{ model_group.group }}</div>
|
||||
{% endfor %}
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="row gx-0 pt-1">
|
||||
<div class="col-2 me-1 px-2 model-param-head">Model</div>
|
||||
<div class="row col-10 pe-0">
|
||||
{% for model_group in model_groups %}
|
||||
{% set models = model_group.models %}
|
||||
{% for model in models %}
|
||||
{% if models|length % 3 == 0 %}
|
||||
<div class="col-4 px-2 model-param" data-param-k="model" data-param-v="{{ model.model_repo | lower }}" data-param-group="{{ model_group.tag }}" tabindex="0">{{ model.model }}</div>
|
||||
{% else %}
|
||||
<div class="col-6 px-2 model-param" data-param-k="model" data-param-v="{{ model.model_repo | lower }}" data-param-group="{{ model_group.tag }}" tabindex="0">{{ model.model }}</div>
|
||||
{% endif %}
|
||||
{% endfor %}
|
||||
{% endfor %}
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
{% for model_group in model_groups %}
|
||||
{% for model in model_group.models %}
|
||||
|
||||
.. container:: model-doc {{ model.model_repo }}
|
||||
|
||||
.. note::
|
||||
|
||||
See the `{{ model.model }} model card on Hugging Face <{{ model.url }}>`__ to learn more about this model.
|
||||
Some models require access authorization prior to use through an external license agreement with a third party.
|
||||
|
||||
{% endfor %}
|
||||
{% endfor %}
|
||||
|
||||
.. _sglang-disagg-inf-build-docker-image:
|
||||
|
||||
Build the Docker image
|
||||
----------------------
|
||||
|
||||
Get the Dockerfile located in
|
||||
`<https://github.com/ROCm/MAD/blob/develop/docker/sglang_dissag_inference.ubuntu.amd.Dockerfile>`__.
|
||||
It uses `lmsysorg/sglang:v0.5.2rc1-rocm700-mi30x
|
||||
<https://hub.docker.com/layers/lmsysorg/sglang/v0.4.9.post1-rocm630/images/sha256-2f6b1748e4bcc70717875a7da76c87795fd8aa46a9646e08d38aa7232fc78538>`__
|
||||
as the base Docker image and installs the necessary components for Mooncake, etcd, and Mellanox network
|
||||
drivers.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
git clone https://github.com/ROCm/MAD.git
|
||||
cd MAD/docker
|
||||
docker build \
|
||||
-t sglang_dissag_pd_image \
|
||||
-f sglang_dissag_inference.ubuntu.amd.Dockerfile .
|
||||
|
||||
Benchmarking
|
||||
============
|
||||
|
||||
The `<https://github.com/ROCm/MAD/tree/develop/scripts/sglang_dissag>`__
|
||||
repository contains scripts to launch SGLang inference with prefill/decode
|
||||
disaggregation via Mooncake for supported models.
|
||||
|
||||
* `scripts/sglang_dissag/run_xPyD_models.slurm <https://github.com/ROCm/MAD/blob/develop/scripts/sglang_dissag/run_xPyD_models.slurm>`__
|
||||
-- the main Slurm batch script to launch Docker containers on all nodes using ``sbatch`` or ``salloc``.
|
||||
|
||||
* `scripts/sglang_dissag/sglang_disagg_server.sh <https://github.com/ROCm/MAD/blob/develop/scripts/sglang_dissag/sglang_disagg_server.sh>`__
|
||||
-- the entrypoint script that runs inside each container to start the correct service -- proxy, prefill, or decode.
|
||||
|
||||
* `scripts/sglang_dissag/benchmark_xPyD.sh <https://github.com/ROCm/MAD/blob/develop/scripts/sglang_dissag/benchmark_xPyD.sh>`__
|
||||
-- the benchmark script to run the GSM8K accuracy benchmark and the SGLang benchmarking tool for performance measurement.
|
||||
|
||||
* `scripts/sglang_dissag/benchmark_parser.py <https://github.com/ROCm/MAD/blob/develop/scripts/sglang_dissag/benchmark_parser.py>`__
|
||||
-- the log parser script to be run on the concurrency benchmark log file to generate tabulated data.
|
||||
|
||||
Launch the service
|
||||
------------------
|
||||
|
||||
The service is deployed using a Slurm batch script that orchestrates the containers across the
|
||||
allocated nodes.
|
||||
|
||||
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/inference/sglang-distributed-benchmark-models.yaml
|
||||
|
||||
{% set model_groups = data.model_groups %}
|
||||
{% for model_group in model_groups %}
|
||||
{% for model in model_group.models %}
|
||||
|
||||
.. container:: model-doc {{ model.model_repo }}
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
# Clone the MAD repo if you haven't already and
|
||||
# navigate to the scripts directory
|
||||
git clone https://github.com/ROCm/MAD.git
|
||||
cd MAD/scripts/sglang_dissag/
|
||||
|
||||
# Slurm sbatch run command
|
||||
export DOCKER_IMAGE_NAME=sglang_dissag_pd_image
|
||||
export xP=<num_prefill_nodes>
|
||||
export yD=<num_decode_nodes>
|
||||
export MODEL_NAME={{ model.model_repo }}
|
||||
# num_nodes = xP + yD + 1
|
||||
sbatch -N <num_nodes> -n <num_nodes> --nodelist=<Nodes> run_xPyD_models.slurm
|
||||
|
||||
{% endfor %}
|
||||
{% endfor %}
|
||||
|
||||
Post-run logs and testing
|
||||
-------------------------
|
||||
|
||||
Logs are stored in your shared filesystem in the directory specified by the ``LOG_PATH`` variable in the Slurm script.
|
||||
A new directory named after the Slurm job ID is created for each run.
|
||||
|
||||
Inside that directory, you can access various logs:
|
||||
|
||||
* ``pd_sglang_bench_serving.sh_NODE<...>.log`` -- the main log for each server node.
|
||||
|
||||
* ``etcd_NODE<...>.log`` -- logs for etcd services.
|
||||
|
||||
* ``prefill_NODE<...>.log`` -- logs for the prefill services.
|
||||
|
||||
* ``decode_NODE<...>.log`` -- logs for the decode services.
|
||||
|
||||
Use the benchmark parser script for concurrency logs to tabulate different data.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
python3 benchmark_parser.py <log_path/benchmark_XXX_CONCURRENCY.log>
|
||||
|
||||
To verify the service is responsive, you can try sending a ``curl`` request to test the launched
|
||||
server from the Docker container on the proxy node. For example:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
curl -X POST http://127.0.0.1:30000/generate \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{ "text": "Let me tell you a story ", "sampling_params": { "temperature": 0.3 } }'
|
||||
|
||||
Known issues
|
||||
============
|
||||
|
||||
When running larger models, such as DeepSeek-V3 and Llama-3.1-405B-Instruct-FP8-KV, at
|
||||
higher concurrency levels (512+), the following error might occur:
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
<TransferEncodingError: 400, message:
|
||||
Not enough data to satisfy transfer length header.
|
||||
|
||||
The above exception was the direct cause of the following exception:
|
||||
|
||||
Traceback (most recent call last):
|
||||
...
|
||||
|
||||
This leads to dropping requests and lower throughput.
|
||||
|
||||
Further reading
|
||||
===============
|
||||
|
||||
- To learn about Mooncake, see `Welcome to Mooncake <https://kvcache-ai.github.io/Mooncake/>`__.
|
||||
|
||||
- To learn more about the options for latency and throughput benchmark scripts,
|
||||
see `<https://github.com/sgl-project/sglang/tree/main/benchmark/blog_v0_2>`__.
|
||||
|
||||
- See the base upstream Docker image on `Docker Hub <https://hub.docker.com/layers/lmsysorg/sglang/v0.5.2rc1-rocm700-mi30x/images/sha256-10c4ee502ddba44dd8c13325e6e03868bfe7f43d23d0a44780a8ee8b393f4729>`__.
|
||||
|
||||
- To learn more about system settings and management practices to configure your system for
|
||||
MI300X series accelerators, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`__.
|
||||
|
||||
- For application performance optimization strategies for HPC and AI workloads,
|
||||
including inference with vLLM, see :doc:`/how-to/rocm-for-ai/inference-optimization/workload`.
|
||||
|
||||
- To learn how to run community models from Hugging Face on AMD GPUs, see
|
||||
:doc:`Running models from Hugging Face </how-to/rocm-for-ai/inference/hugging-face-models>`.
|
||||
|
||||
- To learn how to fine-tune LLMs and optimize inference, see
|
||||
:doc:`Fine-tuning LLMs and inference optimization </how-to/rocm-for-ai/fine-tuning/fine-tuning-and-inference>`.
|
||||
|
||||
- For a list of other ready-made Docker images for AI with ROCm, see
|
||||
`AMD Infinity Hub <https://www.amd.com/en/developer/resources/infinity-hub.html#f-amd_hub_category=AI%20%26%20ML%20Models>`_.
|
||||
|
||||
Previous versions
|
||||
=================
|
||||
|
||||
See :doc:`previous-versions/sglang-history` to find documentation for previous releases
|
||||
of SGLang inference performance testing.
|
||||
@@ -106,6 +106,8 @@ subtrees:
|
||||
title: PyTorch inference performance testing
|
||||
- file: how-to/rocm-for-ai/inference/benchmark-docker/sglang.rst
|
||||
title: SGLang inference performance testing
|
||||
- file: how-to/rocm-for-ai/inference/benchmark-docker/sglang-distributed.rst
|
||||
title: SGLang distributed inference with Mooncake
|
||||
- file: how-to/rocm-for-ai/inference/deploy-your-model.rst
|
||||
title: Deploy your model
|
||||
|
||||
|
||||
Reference in New Issue
Block a user