From d5101532f7b7965d914012c731b0ea94d8897342 Mon Sep 17 00:00:00 2001 From: Peter Park Date: Tue, 16 Sep 2025 11:33:58 -0400 Subject: [PATCH] docs: Add SGLang disaggregated P/D inference w/ Mooncake guide (#5335) * add main content * Update content and format add clarification update update data * fix fix fix * fix: deepseek v3 * add ki * Update docs/how-to/rocm-for-ai/inference/benchmark-docker/sglang-distributed.rst Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com> * Update docs/how-to/rocm-for-ai/inference/benchmark-docker/sglang-distributed.rst Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com> * Update docs/how-to/rocm-for-ai/inference/benchmark-docker/sglang-distributed.rst Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com> * Update docs/how-to/rocm-for-ai/inference/benchmark-docker/sglang-distributed.rst Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com> * Update docs/how-to/rocm-for-ai/inference/benchmark-docker/sglang-distributed.rst Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com> * Update docs/how-to/rocm-for-ai/inference/benchmark-docker/sglang-distributed.rst Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com> * Update docs/how-to/rocm-for-ai/inference/benchmark-docker/sglang-distributed.rst Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com> * Update docs/how-to/rocm-for-ai/inference/benchmark-docker/sglang-distributed.rst Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com> * Update docs/how-to/rocm-for-ai/inference/benchmark-docker/sglang-distributed.rst Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com> * Update docs/how-to/rocm-for-ai/inference/benchmark-docker/sglang-distributed.rst Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com> --------- Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com> --- .../sglang-distributed-benchmark-models.yaml | 32 +++ .../benchmark-docker/sglang-distributed.rst | 257 ++++++++++++++++++ docs/sphinx/_toc.yml.in | 2 + 3 files changed, 291 insertions(+) create mode 100644 docs/data/how-to/rocm-for-ai/inference/sglang-distributed-benchmark-models.yaml create mode 100644 docs/how-to/rocm-for-ai/inference/benchmark-docker/sglang-distributed.rst diff --git a/docs/data/how-to/rocm-for-ai/inference/sglang-distributed-benchmark-models.yaml b/docs/data/how-to/rocm-for-ai/inference/sglang-distributed-benchmark-models.yaml new file mode 100644 index 000000000..b0ba6a549 --- /dev/null +++ b/docs/data/how-to/rocm-for-ai/inference/sglang-distributed-benchmark-models.yaml @@ -0,0 +1,32 @@ +dockers: + - pull_tag: lmsysorg/sglang:v0.5.2rc1-rocm700-mi30x + docker_hub_url: https://hub.docker.com/layers/lmsysorg/sglang/v0.5.2rc1-rocm700-mi30x/images/sha256-10c4ee502ddba44dd8c13325e6e03868bfe7f43d23d0a44780a8ee8b393f4729 + components: + ROCm: 7.0.0 + SGLang: v0.5.2rc1 + pytorch-triton-rocm: 3.4.0+rocm7.0.0.gitf9e5bf54 +model_groups: + - group: Dense models + tag: dense-models + models: + - model: Llama 3.1 8B Instruct + model_repo: Llama-3.1-8B-Instruct + url: https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct + - model: Llama 3.1 405B FP8 KV + model_repo: Llama-3.1-405B-Instruct-FP8-KV + url: https://huggingface.co/amd/Llama-3.1-405B-Instruct-FP8-KV + - model: Llama 3.3 70B FP8 KV + model_repo: amd-Llama-3.3-70B-Instruct-FP8-KV + url: https://huggingface.co/amd/Llama-3.3-70B-Instruct-FP8-KV + - model: Qwen3 32B + model_repo: Qwen3-32B + url: https://huggingface.co/Qwen/Qwen3-32B + - group: Small experts models + tag: small-experts-models + models: + - model: DeepSeek V3 + model_repo: DeepSeek-V3 + url: https://huggingface.co/deepseek-ai/DeepSeek-V3 + - model: Mixtral 8x7B v0.1 + model_repo: Mixtral-8x7B-v0.1 + url: https://huggingface.co/mistralai/Mixtral-8x7B-v0.1 diff --git a/docs/how-to/rocm-for-ai/inference/benchmark-docker/sglang-distributed.rst b/docs/how-to/rocm-for-ai/inference/benchmark-docker/sglang-distributed.rst new file mode 100644 index 000000000..799b330db --- /dev/null +++ b/docs/how-to/rocm-for-ai/inference/benchmark-docker/sglang-distributed.rst @@ -0,0 +1,257 @@ +.. meta:: + :description: SGLang multi-node disaggregated distributed inference using Mooncake + :keywords: model, sglang, mooncake, disagg, disaggregated, distributed, multi-node, docker + +****************************************** +SGLang distributed inference with Mooncake +****************************************** + +As LLM inference increasingly demands handling massive models and dynamic workloads, efficient +distributed inference becomes essential. Traditional co-located architectures face bottlenecks due +to tightly coupled memory and compute resources, which limits scalability and flexibility. +Disaggregated inference refers to the process of splitting the inference of LLMs into distinct +phases. This architecture, facilitated by libraries like Mooncake, uses high-bandwidth +RDMA to transfer the Key-Value (KV) cache between prefill and decode nodes. +This allows for independent resource scaling and optimization, resulting in +improved efficiency and throughput. + +.. datatemplate:yaml:: /data/how-to/rocm-for-ai/inference/sglang-distributed-benchmark-models.yaml + + {% set docker = data.dockers[0] %} + + `SGLang `__ is a high-performance inference and + serving engine for large language models (LLMs) and vision models. The + ROCm-enabled `SGLang base Docker image <{{ docker.docker_hub_url }}>`__ + bundles SGLang with PyTorch, which is optimized for AMD Instinct MI300X series + accelerators. It includes the following software components: + + .. list-table:: + :header-rows: 1 + + * - Software component + - Version + + {% for component_name, component_version in docker.components.items() %} + * - {{ component_name }} + - {{ component_version }} + {% endfor %} + +The following guides on setting up and running SGLang and Mooncake for disaggregated +distributed inference on a Slurm cluster using AMD Instinct MI300X series accelerators backed by +Mellanox CX-7 NICs. + +Prerequisites +============= + +Before starting, ensure you have: + +* A Slurm cluster with at least three nodes: one for the proxy, one for prefill (``xP``), and one for decode (``yD``). + + ``Nodes -> xP + yD + 1`` + +* A Dockerized environment with SGLang, Mooncake, etcd, and NIC drivers built in. See :ref:`sglang-disagg-inf-build-docker-image` for instructions. + +* A shared filesystem for storing models, scripts, and logs (cluster-specific). + +Supported models +================ + +The following models are supported for SGLang disaggregated prefill/decode +inference. Some instructions, commands, and recommendations in this +documentation might vary by selected model. + +.. datatemplate:yaml:: /data/how-to/rocm-for-ai/inference/sglang-distributed-benchmark-models.yaml + + {% set model_groups = data.model_groups %} + .. raw:: html + +
+
+
Model type
+
+ {% for model_group in model_groups %} +
{{ model_group.group }}
+ {% endfor %} +
+
+ +
+
Model
+
+ {% for model_group in model_groups %} + {% set models = model_group.models %} + {% for model in models %} + {% if models|length % 3 == 0 %} +
{{ model.model }}
+ {% else %} +
{{ model.model }}
+ {% endif %} + {% endfor %} + {% endfor %} +
+
+
+ + {% for model_group in model_groups %} + {% for model in model_group.models %} + + .. container:: model-doc {{ model.model_repo }} + + .. note:: + + See the `{{ model.model }} model card on Hugging Face <{{ model.url }}>`__ to learn more about this model. + Some models require access authorization prior to use through an external license agreement with a third party. + + {% endfor %} + {% endfor %} + +.. _sglang-disagg-inf-build-docker-image: + +Build the Docker image +---------------------- + +Get the Dockerfile located in +``__. +It uses `lmsysorg/sglang:v0.5.2rc1-rocm700-mi30x +`__ +as the base Docker image and installs the necessary components for Mooncake, etcd, and Mellanox network +drivers. + +.. code-block:: shell + + git clone https://github.com/ROCm/MAD.git + cd MAD/docker + docker build \ + -t sglang_dissag_pd_image \ + -f sglang_dissag_inference.ubuntu.amd.Dockerfile . + +Benchmarking +============ + +The ``__ +repository contains scripts to launch SGLang inference with prefill/decode +disaggregation via Mooncake for supported models. + +* `scripts/sglang_dissag/run_xPyD_models.slurm `__ + -- the main Slurm batch script to launch Docker containers on all nodes using ``sbatch`` or ``salloc``. + +* `scripts/sglang_dissag/sglang_disagg_server.sh `__ + -- the entrypoint script that runs inside each container to start the correct service -- proxy, prefill, or decode. + +* `scripts/sglang_dissag/benchmark_xPyD.sh `__ + -- the benchmark script to run the GSM8K accuracy benchmark and the SGLang benchmarking tool for performance measurement. + +* `scripts/sglang_dissag/benchmark_parser.py `__ + -- the log parser script to be run on the concurrency benchmark log file to generate tabulated data. + +Launch the service +------------------ + +The service is deployed using a Slurm batch script that orchestrates the containers across the +allocated nodes. + +.. datatemplate:yaml:: /data/how-to/rocm-for-ai/inference/sglang-distributed-benchmark-models.yaml + + {% set model_groups = data.model_groups %} + {% for model_group in model_groups %} + {% for model in model_group.models %} + + .. container:: model-doc {{ model.model_repo }} + + .. code-block:: shell + + # Clone the MAD repo if you haven't already and + # navigate to the scripts directory + git clone https://github.com/ROCm/MAD.git + cd MAD/scripts/sglang_dissag/ + + # Slurm sbatch run command + export DOCKER_IMAGE_NAME=sglang_dissag_pd_image + export xP= + export yD= + export MODEL_NAME={{ model.model_repo }} + # num_nodes = xP + yD + 1 + sbatch -N -n --nodelist= run_xPyD_models.slurm + + {% endfor %} + {% endfor %} + +Post-run logs and testing +------------------------- + +Logs are stored in your shared filesystem in the directory specified by the ``LOG_PATH`` variable in the Slurm script. +A new directory named after the Slurm job ID is created for each run. + +Inside that directory, you can access various logs: + +* ``pd_sglang_bench_serving.sh_NODE<...>.log`` -- the main log for each server node. + +* ``etcd_NODE<...>.log`` -- logs for etcd services. + +* ``prefill_NODE<...>.log`` -- logs for the prefill services. + +* ``decode_NODE<...>.log`` -- logs for the decode services. + +Use the benchmark parser script for concurrency logs to tabulate different data. + +.. code-block:: shell + + python3 benchmark_parser.py + +To verify the service is responsive, you can try sending a ``curl`` request to test the launched +server from the Docker container on the proxy node. For example: + +.. code-block:: shell + + curl -X POST http://127.0.0.1:30000/generate \ + -H "Content-Type: application/json" \ + -d '{ "text": "Let me tell you a story ", "sampling_params": { "temperature": 0.3 } }' + +Known issues +============ + +When running larger models, such as DeepSeek-V3 and Llama-3.1-405B-Instruct-FP8-KV, at +higher concurrency levels (512+), the following error might occur: + +.. code-block:: shell-session + + `__. + +- To learn more about the options for latency and throughput benchmark scripts, + see ``__. + +- See the base upstream Docker image on `Docker Hub `__. + +- To learn more about system settings and management practices to configure your system for + MI300X series accelerators, see `AMD Instinct MI300X system optimization `__. + +- For application performance optimization strategies for HPC and AI workloads, + including inference with vLLM, see :doc:`/how-to/rocm-for-ai/inference-optimization/workload`. + +- To learn how to run community models from Hugging Face on AMD GPUs, see + :doc:`Running models from Hugging Face `. + +- To learn how to fine-tune LLMs and optimize inference, see + :doc:`Fine-tuning LLMs and inference optimization `. + +- For a list of other ready-made Docker images for AI with ROCm, see + `AMD Infinity Hub `_. + +Previous versions +================= + +See :doc:`previous-versions/sglang-history` to find documentation for previous releases +of SGLang inference performance testing. diff --git a/docs/sphinx/_toc.yml.in b/docs/sphinx/_toc.yml.in index 4c3e6f0c2..e15c1039f 100644 --- a/docs/sphinx/_toc.yml.in +++ b/docs/sphinx/_toc.yml.in @@ -106,6 +106,8 @@ subtrees: title: PyTorch inference performance testing - file: how-to/rocm-for-ai/inference/benchmark-docker/sglang.rst title: SGLang inference performance testing + - file: how-to/rocm-for-ai/inference/benchmark-docker/sglang-distributed.rst + title: SGLang distributed inference with Mooncake - file: how-to/rocm-for-ai/inference/deploy-your-model.rst title: Deploy your model