.. meta:: :description: SGLang multi-node disaggregated distributed inference using Mooncake :keywords: model, sglang, mooncake, disagg, disaggregated, distributed, multi-node, docker ****************************************** SGLang distributed inference with Mooncake ****************************************** As LLM inference increasingly demands handling massive models and dynamic workloads, efficient distributed inference becomes essential. Traditional co-located architectures face bottlenecks due to tightly coupled memory and compute resources, which limits scalability and flexibility. Disaggregated inference refers to the process of splitting the inference of LLMs into distinct phases. This architecture, facilitated by libraries like Mooncake, uses high-bandwidth RDMA to transfer the Key-Value (KV) cache between prefill and decode nodes. This allows for independent resource scaling and optimization, resulting in improved efficiency and throughput. .. datatemplate:yaml:: /data/how-to/rocm-for-ai/inference/sglang-distributed-benchmark-models.yaml {% set docker = data.dockers[0] %} `SGLang `__ is a high-performance inference and serving engine for large language models (LLMs) and vision models. The ROCm-enabled `SGLang base Docker image <{{ docker.docker_hub_url }}>`__ bundles SGLang with PyTorch, which is optimized for AMD Instinct MI300X Series GPUs. It includes the following software components: .. list-table:: :header-rows: 1 * - Software component - Version {% for component_name, component_version in docker.components.items() %} * - {{ component_name }} - {{ component_version }} {% endfor %} The following guides on setting up and running SGLang and Mooncake for disaggregated distributed inference on a Slurm cluster using AMD Instinct MI300X Series GPUs backed by Mellanox CX-7 NICs. Prerequisites ============= Before starting, ensure you have: * A Slurm cluster with at least three nodes: one for the proxy, one for prefill (``xP``), and one for decode (``yD``). ``Nodes -> xP + yD + 1`` * A Dockerized environment with SGLang, Mooncake, etcd, and NIC drivers built in. See :ref:`sglang-disagg-inf-build-docker-image` for instructions. * A shared filesystem for storing models, scripts, and logs (cluster-specific). Supported models ================ The following models are supported for SGLang disaggregated prefill/decode inference. Some instructions, commands, and recommendations in this documentation might vary by selected model. .. datatemplate:yaml:: /data/how-to/rocm-for-ai/inference/sglang-distributed-benchmark-models.yaml {% set model_groups = data.model_groups %} .. raw:: html

Model type

{% for model_group in model_groups %}

{% endfor %}

Model

{% for model_group in model_groups %} {% set models = model_group.models %} {% for model in models %} {% if models|length % 3 == 0 %}

{% else %}

{% endif %} {% endfor %} {% endfor %}

{% for model_group in model_groups %} {% for model in model_group.models %} .. container:: model-doc {{ model.model_repo }} .. note:: See the `{{ model.model }} model card on Hugging Face <{{ model.url }}>`__ to learn more about this model. Some models require access authorization prior to use through an external license agreement with a third party. {% endfor %} {% endfor %} .. _sglang-disagg-inf-build-docker-image: Build the Docker image ---------------------- Get the Dockerfile located in ``__. It uses `lmsysorg/sglang:v0.5.2rc1-rocm700-mi30x `__ as the base Docker image and installs the necessary components for Mooncake, etcd, and Mellanox network drivers. .. code-block:: shell git clone https://github.com/ROCm/MAD.git cd MAD/docker docker build \ -t sglang_disagg_pd_image \ -f sglang_disagg_inference.ubuntu.amd.Dockerfile . Benchmarking ============ The ``__ repository contains scripts to launch SGLang inference with prefill/decode disaggregation via Mooncake for supported models. * `scripts/sglang_dissag/run_xPyD_models.slurm `__ -- the main Slurm batch script to launch Docker containers on all nodes using ``sbatch`` or ``salloc``. * `scripts/sglang_dissag/sglang_disagg_server.sh `__ -- the entrypoint script that runs inside each container to start the correct service -- proxy, prefill, or decode. * `scripts/sglang_dissag/benchmark_xPyD.sh `__ -- the benchmark script to run the GSM8K accuracy benchmark and the SGLang benchmarking tool for performance measurement. * `scripts/sglang_dissag/benchmark_parser.py `__ -- the log parser script to be run on the concurrency benchmark log file to generate tabulated data. Launch the service ------------------ The service is deployed using a Slurm batch script that orchestrates the containers across the allocated nodes. .. datatemplate:yaml:: /data/how-to/rocm-for-ai/inference/sglang-distributed-benchmark-models.yaml {% set model_groups = data.model_groups %} {% for model_group in model_groups %} {% for model in model_group.models %} .. container:: model-doc {{ model.model_repo }} .. code-block:: shell # Clone the MAD repo if you haven't already and # navigate to the scripts directory git clone https://github.com/ROCm/MAD.git cd MAD/scripts/sglang_disagg/ # Slurm sbatch run command export DOCKER_IMAGE_NAME=sglang_disagg_pd_image export xP= export yD= export MODEL_NAME={{ model.model_repo }} # num_nodes = xP + yD + 1 sbatch -N -n --nodelist= run_xPyD_models.slurm {% endfor %} {% endfor %} Post-run logs and testing ------------------------- Logs are stored in your shared filesystem in the directory specified by the ``LOG_PATH`` variable in the Slurm script. A new directory named after the Slurm job ID is created for each run. Inside that directory, you can access various logs: * ``pd_sglang_bench_serving.sh_NODE<...>.log`` -- the main log for each server node. * ``etcd_NODE<...>.log`` -- logs for etcd services. * ``prefill_NODE<...>.log`` -- logs for the prefill services. * ``decode_NODE<...>.log`` -- logs for the decode services. Use the benchmark parser script for concurrency logs to tabulate different data. .. code-block:: shell python3 benchmark_parser.py To verify the service is responsive, you can try sending a ``curl`` request to test the launched server from the Docker container on the proxy node. For example: .. code-block:: shell curl -X POST http://127.0.0.1:30000/generate \ -H "Content-Type: application/json" \ -d '{ "text": "Let me tell you a story ", "sampling_params": { "temperature": 0.3 } }' Known issues ============ When running larger models, such as DeepSeek-V3 and Llama-3.1-405B-Instruct-FP8-KV, at higher concurrency levels (512+), the following error might occur: .. code-block:: shell-session `__. - To learn more about the options for latency and throughput benchmark scripts, see ``__. - See the base upstream Docker image on `Docker Hub `__. - To learn more about system settings and management practices to configure your system for MI300X Series GPUs, see `AMD Instinct MI300X system optimization `__. - For application performance optimization strategies for HPC and AI workloads, including inference with vLLM, see :doc:`/how-to/rocm-for-ai/inference-optimization/workload`. - To learn how to run community models from Hugging Face on AMD GPUs, see :doc:`Running models from Hugging Face `. - To learn how to fine-tune LLMs and optimize inference, see :doc:`Fine-tuning LLMs and inference optimization `. - For a list of other ready-made Docker images for AI with ROCm, see `AMD Infinity Hub `_. Previous versions ================= See :doc:`previous-versions/sglang-history` to find documentation for previous releases of SGLang inference performance testing.