docs: Add SGLang disaggregated P/D inference w/ Mooncake guide (#5335)

* add main content * Update content and format add clarification update update data * fix fix fix * fix: deepseek v3 * add ki * Update docs/how-to/rocm-for-ai/inference/benchmark-docker/sglang-distributed.rst Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com> * Update docs/how-to/rocm-for-ai/inference/benchmark-docker/sglang-distributed.rst Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com> * Update docs/how-to/rocm-for-ai/inference/benchmark-docker/sglang-distributed.rst Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com> * Update docs/how-to/rocm-for-ai/inference/benchmark-docker/sglang-distributed.rst Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com> * Update docs/how-to/rocm-for-ai/inference/benchmark-docker/sglang-distributed.rst Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com> * Update docs/how-to/rocm-for-ai/inference/benchmark-docker/sglang-distributed.rst Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com> * Update docs/how-to/rocm-for-ai/inference/benchmark-docker/sglang-distributed.rst Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com> * Update docs/how-to/rocm-for-ai/inference/benchmark-docker/sglang-distributed.rst Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com> * Update docs/how-to/rocm-for-ai/inference/benchmark-docker/sglang-distributed.rst Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com> * Update docs/how-to/rocm-for-ai/inference/benchmark-docker/sglang-distributed.rst Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com> --------- Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com>
2026-01-08 22:28:06 -05:00 · 2025-09-16 11:33:58 -04:00
parent ef4e7ca1fe
commit d5101532f7
3 changed files with 291 additions and 0 deletions
--- a/docs/data/how-to/rocm-for-ai/inference/sglang-distributed-benchmark-models.yaml
+++ b/docs/data/how-to/rocm-for-ai/inference/sglang-distributed-benchmark-models.yaml
@@ -0,0 +1,32 @@
+dockers:
+  - pull_tag: lmsysorg/sglang:v0.5.2rc1-rocm700-mi30x
+    docker_hub_url: https://hub.docker.com/layers/lmsysorg/sglang/v0.5.2rc1-rocm700-mi30x/images/sha256-10c4ee502ddba44dd8c13325e6e03868bfe7f43d23d0a44780a8ee8b393f4729
+    components:
+      ROCm: 7.0.0
+      SGLang: v0.5.2rc1
+      pytorch-triton-rocm: 3.4.0+rocm7.0.0.gitf9e5bf54
+model_groups:
+  - group: Dense models
+    tag: dense-models
+    models:
+      - model: Llama 3.1 8B Instruct
+        model_repo: Llama-3.1-8B-Instruct
+        url: https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct
+      - model: Llama 3.1 405B FP8 KV
+        model_repo: Llama-3.1-405B-Instruct-FP8-KV
+        url: https://huggingface.co/amd/Llama-3.1-405B-Instruct-FP8-KV
+      - model: Llama 3.3 70B FP8 KV
+        model_repo: amd-Llama-3.3-70B-Instruct-FP8-KV
+        url: https://huggingface.co/amd/Llama-3.3-70B-Instruct-FP8-KV
+      - model: Qwen3 32B
+        model_repo: Qwen3-32B
+        url: https://huggingface.co/Qwen/Qwen3-32B
+  - group: Small experts models
+    tag: small-experts-models
+    models:
+      - model: DeepSeek V3
+        model_repo: DeepSeek-V3
+        url: https://huggingface.co/deepseek-ai/DeepSeek-V3
+      - model: Mixtral 8x7B v0.1
+        model_repo: Mixtral-8x7B-v0.1
+        url: https://huggingface.co/mistralai/Mixtral-8x7B-v0.1
--- a/docs/how-to/rocm-for-ai/inference/benchmark-docker/sglang-distributed.rst
+++ b/docs/how-to/rocm-for-ai/inference/benchmark-docker/sglang-distributed.rst
@@ -0,0 +1,257 @@
+.. meta::
+   :description: SGLang multi-node disaggregated distributed inference using Mooncake
+   :keywords: model, sglang, mooncake, disagg, disaggregated, distributed, multi-node, docker
+
+******************************************
+SGLang distributed inference with Mooncake
+******************************************
+
+As LLM inference increasingly demands handling massive models and dynamic workloads, efficient
+distributed inference becomes essential. Traditional co-located architectures face bottlenecks due
+to tightly coupled memory and compute resources, which limits scalability and flexibility.
+Disaggregated inference refers to the process of splitting the inference of LLMs into distinct
+phases. This architecture, facilitated by libraries like Mooncake, uses high-bandwidth
+RDMA to transfer the Key-Value (KV) cache between prefill and decode nodes.
+This allows for independent resource scaling and optimization, resulting in
+improved efficiency and throughput.
+
+.. datatemplate:yaml:: /data/how-to/rocm-for-ai/inference/sglang-distributed-benchmark-models.yaml
+
+   {% set docker = data.dockers[0] %}
+
+   `SGLang <https://docs.sglang.ai>`__ is a high-performance inference and
+   serving engine for large language models (LLMs) and vision models. The
+   ROCm-enabled `SGLang base Docker image <{{ docker.docker_hub_url }}>`__
+   bundles SGLang with PyTorch, which is optimized for AMD Instinct MI300X series
+   accelerators. It includes the following software components:
+
+   .. list-table::
+      :header-rows: 1
+
+      * - Software component
+        - Version
+
+      {% for component_name, component_version in docker.components.items() %}
+      * - {{ component_name }}
+        - {{ component_version }}
+      {% endfor %}
+
+The following guides on setting up and running SGLang and Mooncake for disaggregated
+distributed inference on a Slurm cluster using AMD Instinct MI300X series accelerators backed by
+Mellanox CX-7 NICs.
+
+Prerequisites
+=============
+
+Before starting, ensure you have:
+
+* A Slurm cluster with at least three nodes: one for the proxy, one for prefill (``xP``), and one for decode (``yD``).
+
+  ``Nodes -> xP + yD + 1``
+
+* A Dockerized environment with SGLang, Mooncake, etcd, and NIC drivers built in. See :ref:`sglang-disagg-inf-build-docker-image` for instructions.
+
+* A shared filesystem for storing models, scripts, and logs (cluster-specific).
+
+Supported models
+================
+
+The following models are supported for SGLang disaggregated prefill/decode
+inference. Some instructions, commands, and recommendations in this
+documentation might vary by selected model.
+
+.. datatemplate:yaml:: /data/how-to/rocm-for-ai/inference/sglang-distributed-benchmark-models.yaml
+
+   {% set model_groups = data.model_groups %}
+   .. raw:: html
+
+      <div id="vllm-benchmark-ud-params-picker" class="container-fluid">
+         <div class="row gx-0">
+            <div class="col-2 me-1 px-2 model-param-head">Model type</div>
+            <div class="row col-10 pe-0">
+      {% for model_group in model_groups %}
+               <div class="col-6 px-2 model-param" data-param-k="model-group" data-param-v="{{ model_group.tag }}" tabindex="0">{{ model_group.group }}</div>
+      {% endfor %}
+            </div>
+         </div>
+
+         <div class="row gx-0 pt-1">
+            <div class="col-2 me-1 px-2 model-param-head">Model</div>
+            <div class="row col-10 pe-0">
+      {% for model_group in model_groups %}
+         {% set models = model_group.models %}
+         {% for model in models %}
+            {% if models|length % 3 == 0 %}
+               <div class="col-4 px-2 model-param" data-param-k="model" data-param-v="{{ model.model_repo | lower }}" data-param-group="{{ model_group.tag }}" tabindex="0">{{ model.model }}</div>
+            {% else %}
+               <div class="col-6 px-2 model-param" data-param-k="model" data-param-v="{{ model.model_repo | lower }}" data-param-group="{{ model_group.tag }}" tabindex="0">{{ model.model }}</div>
+            {% endif %}
+         {% endfor %}
+      {% endfor %}
+            </div>
+         </div>
+      </div>
+
+   {% for model_group in model_groups %}
+      {% for model in model_group.models %}
+
+   .. container:: model-doc {{ model.model_repo }}
+
+      .. note::
+
+         See the `{{ model.model }} model card on Hugging Face <{{ model.url }}>`__ to learn more about this model.
+         Some models require access authorization prior to use through an external license agreement with a third party.
+
+      {% endfor %}
+   {% endfor %}
+
+.. _sglang-disagg-inf-build-docker-image:
+
+Build the Docker image
+----------------------
+
+Get the Dockerfile located in
+`<https://github.com/ROCm/MAD/blob/develop/docker/sglang_dissag_inference.ubuntu.amd.Dockerfile>`__.
+It uses `lmsysorg/sglang:v0.5.2rc1-rocm700-mi30x
+<https://hub.docker.com/layers/lmsysorg/sglang/v0.4.9.post1-rocm630/images/sha256-2f6b1748e4bcc70717875a7da76c87795fd8aa46a9646e08d38aa7232fc78538>`__
+as the base Docker image and installs the necessary components for Mooncake, etcd, and Mellanox network
+drivers.
+
+.. code-block:: shell
+
+   git clone https://github.com/ROCm/MAD.git
+   cd MAD/docker
+   docker build \
+       -t sglang_dissag_pd_image \
+       -f sglang_dissag_inference.ubuntu.amd.Dockerfile .
+
+Benchmarking
+============
+
+The `<https://github.com/ROCm/MAD/tree/develop/scripts/sglang_dissag>`__
+repository contains scripts to launch SGLang inference with prefill/decode
+disaggregation via Mooncake for supported models.
+
+* `scripts/sglang_dissag/run_xPyD_models.slurm <https://github.com/ROCm/MAD/blob/develop/scripts/sglang_dissag/run_xPyD_models.slurm>`__
+  -- the main Slurm batch script to launch Docker containers on all nodes using ``sbatch`` or ``salloc``.
+
+* `scripts/sglang_dissag/sglang_disagg_server.sh <https://github.com/ROCm/MAD/blob/develop/scripts/sglang_dissag/sglang_disagg_server.sh>`__
+  -- the entrypoint script that runs inside each container to start the correct service -- proxy, prefill, or decode.
+
+* `scripts/sglang_dissag/benchmark_xPyD.sh <https://github.com/ROCm/MAD/blob/develop/scripts/sglang_dissag/benchmark_xPyD.sh>`__
+  -- the benchmark script to run the GSM8K accuracy benchmark and the SGLang benchmarking tool for performance measurement.
+
+* `scripts/sglang_dissag/benchmark_parser.py <https://github.com/ROCm/MAD/blob/develop/scripts/sglang_dissag/benchmark_parser.py>`__
+  -- the log parser script to be run on the concurrency benchmark log file to generate tabulated data.
+
+Launch the service
+------------------
+
+The service is deployed using a Slurm batch script that orchestrates the containers across the
+allocated nodes.
+
+.. datatemplate:yaml:: /data/how-to/rocm-for-ai/inference/sglang-distributed-benchmark-models.yaml
+
+   {% set model_groups = data.model_groups %}
+   {% for model_group in model_groups %}
+      {% for model in model_group.models %}
+
+   .. container:: model-doc {{ model.model_repo }}
+
+      .. code-block:: shell
+
+         # Clone the MAD repo if you haven't already and
+         # navigate to the scripts directory
+         git clone https://github.com/ROCm/MAD.git
+         cd MAD/scripts/sglang_dissag/
+
+         # Slurm sbatch run command
+         export DOCKER_IMAGE_NAME=sglang_dissag_pd_image
+         export xP=<num_prefill_nodes>
+         export yD=<num_decode_nodes>
+         export MODEL_NAME={{ model.model_repo }}
+         # num_nodes = xP + yD + 1
+         sbatch -N <num_nodes> -n <num_nodes> --nodelist=<Nodes> run_xPyD_models.slurm
+
+      {% endfor %}
+   {% endfor %}
+
+Post-run logs and testing
+-------------------------
+
+Logs are stored in your shared filesystem in the directory specified by the ``LOG_PATH`` variable in the Slurm script.
+A new directory named after the Slurm job ID is created for each run.
+
+Inside that directory, you can access various logs:
+
+* ``pd_sglang_bench_serving.sh_NODE<...>.log`` -- the main log for each server node.
+
+* ``etcd_NODE<...>.log`` -- logs for etcd services.
+
+* ``prefill_NODE<...>.log`` -- logs for the prefill services.
+
+* ``decode_NODE<...>.log`` -- logs for the decode services.
+
+Use the benchmark parser script for concurrency logs to tabulate different data.
+
+.. code-block:: shell
+
+   python3 benchmark_parser.py <log_path/benchmark_XXX_CONCURRENCY.log>
+
+To verify the service is responsive, you can try sending a ``curl`` request to test the launched
+server from the Docker container on the proxy node. For example:
+
+.. code-block:: shell
+
+   curl -X POST http://127.0.0.1:30000/generate \
+       -H "Content-Type: application/json" \
+       -d '{ "text": "Let me tell you a story ", "sampling_params": { "temperature": 0.3 } }'
+
+Known issues
+============
+
+When running larger models, such as DeepSeek-V3 and Llama-3.1-405B-Instruct-FP8-KV, at
+higher concurrency levels (512+), the following error might occur:
+
+.. code-block:: shell-session
+
+   <TransferEncodingError: 400, message:
+    Not enough data to satisfy transfer length header.
+
+   The above exception was the direct cause of the following exception:
+
+   Traceback (most recent call last):
+   ...
+
+This leads to dropping requests and lower throughput.
+
+Further reading
+===============
+
+- To learn about Mooncake, see `Welcome to Mooncake <https://kvcache-ai.github.io/Mooncake/>`__.
+
+- To learn more about the options for latency and throughput benchmark scripts,
+  see `<https://github.com/sgl-project/sglang/tree/main/benchmark/blog_v0_2>`__.
+
+- See the base upstream Docker image on `Docker Hub <https://hub.docker.com/layers/lmsysorg/sglang/v0.5.2rc1-rocm700-mi30x/images/sha256-10c4ee502ddba44dd8c13325e6e03868bfe7f43d23d0a44780a8ee8b393f4729>`__.
+
+- To learn more about system settings and management practices to configure your system for
+  MI300X series accelerators, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`__.
+
+- For application performance optimization strategies for HPC and AI workloads,
+  including inference with vLLM, see :doc:`/how-to/rocm-for-ai/inference-optimization/workload`.
+
+- To learn how to run community models from Hugging Face on AMD GPUs, see
+  :doc:`Running models from Hugging Face </how-to/rocm-for-ai/inference/hugging-face-models>`.
+
+- To learn how to fine-tune LLMs and optimize inference, see
+  :doc:`Fine-tuning LLMs and inference optimization </how-to/rocm-for-ai/fine-tuning/fine-tuning-and-inference>`.
+
+- For a list of other ready-made Docker images for AI with ROCm, see
+  `AMD Infinity Hub <https://www.amd.com/en/developer/resources/infinity-hub.html#f-amd_hub_category=AI%20%26%20ML%20Models>`_.
+
+Previous versions
+=================
+
+See :doc:`previous-versions/sglang-history` to find documentation for previous releases
+of SGLang inference performance testing.
--- a/docs/sphinx/_toc.yml.in
+++ b/docs/sphinx/_toc.yml.in
@@ -106,6 +106,8 @@ subtrees:
            title: PyTorch inference performance testing
          - file: how-to/rocm-for-ai/inference/benchmark-docker/sglang.rst
            title: SGLang inference performance testing
+          - file: how-to/rocm-for-ai/inference/benchmark-docker/sglang-distributed.rst
+            title: SGLang distributed inference with Mooncake
          - file: how-to/rocm-for-ai/inference/deploy-your-model.rst
            title: Deploy your model