diff --git a/.wordlist.txt b/.wordlist.txt index ea414acdc..d88864e90 100644 --- a/.wordlist.txt +++ b/.wordlist.txt @@ -408,6 +408,7 @@ SDMA SDPA SDRAM SENDMSG +SGLang SGPR SGPRs SHA @@ -863,6 +864,7 @@ seealso sendmsg seqs serializers +sglang shader sharding sigmoid diff --git a/docs/data/how-to/rocm-for-ai/inference/sglang-benchmark-models.yaml b/docs/data/how-to/rocm-for-ai/inference/sglang-benchmark-models.yaml new file mode 100644 index 000000000..cc832dffb --- /dev/null +++ b/docs/data/how-to/rocm-for-ai/inference/sglang-benchmark-models.yaml @@ -0,0 +1,17 @@ +sglang_benchmark: + unified_docker: + latest: + pull_tag: lmsysorg/sglang:v0.4.5-rocm630 + docker_hub_url: https://hub.docker.com/layers/lmsysorg/sglang/v0.4.5-rocm630/images/sha256-63d2cb760a237125daf6612464cfe2f395c0784e21e8b0ea37d551cd10d3c951 + rocm_version: 6.3.0 + sglang_version: 0.4.5 (0.4.5-rocm) + pytorch_version: 2.6.0a0+git8d4926e + model_groups: + - group: DeepSeek + tag: deepseek + models: + - model: DeepSeek-R1-Distill-Qwen-32B + mad_tag: pyt_sglang_deepseek-r1-distill-qwen-32b + model_repo: deepseek-ai/DeepSeek-R1-Distill-Qwen-32B + url: https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B + precision: bfloat16 diff --git a/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/sglang-history.rst b/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/sglang-history.rst new file mode 100644 index 000000000..f2c993377 --- /dev/null +++ b/docs/how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/sglang-history.rst @@ -0,0 +1,25 @@ +:orphan: + +**************************************************** +SGLang inference performance testing version history +**************************************************** + +This table lists previous versions of the ROCm SGLang inference performance +testing environment. For detailed information about available models for +benchmarking, see the version-specific documentation. + +.. list-table:: + :header-rows: 1 + + * - Docker image tag + - Components + - Resources + + * - ``lmsysorg/sglang:v0.4.5-rocm630`` + - + * ROCm 6.3.0 + * SGLang 0.4.5 + * PyTorch 2.6.0 + - + * :doc:`Documentation <../sglang>` + * `Docker Hub `__ diff --git a/docs/how-to/rocm-for-ai/inference/benchmark-docker/sglang.rst b/docs/how-to/rocm-for-ai/inference/benchmark-docker/sglang.rst new file mode 100644 index 000000000..340ef975e --- /dev/null +++ b/docs/how-to/rocm-for-ai/inference/benchmark-docker/sglang.rst @@ -0,0 +1,280 @@ +.. meta:: + :description: Learn how to validate LLM inference performance on MI300X accelerators using AMD MAD and SGLang + :keywords: model, MAD, automation, dashboarding, validate + +************************************ +SGLang inference performance testing +************************************ + +.. _sglang-benchmark-unified-docker: + +.. datatemplate:yaml:: /data/how-to/rocm-for-ai/inference/sglang-benchmark-models.yaml + + {% set unified_docker = data.sglang_benchmark.unified_docker.latest %} + + `SGLang `__ is a high-performance inference and + serving engine for large language models (LLMs) and vision models. The + ROCm-enabled `SGLang Docker image <{{ unified_docker.docker_hub_url }}>`__ + bundles SGLang with PyTorch, optimized for AMD Instinct MI300X series + accelerators. It includes the following software components: + + .. list-table:: + :header-rows: 1 + + * - Software component + - Version + + * - `ROCm `__ + - {{ unified_docker.rocm_version }} + + * - `SGLang `__ + - {{ unified_docker.sglang_version }} + + * - `PyTorch `__ + - {{ unified_docker.pytorch_version }} + +System validation +================= + +Before running AI workloads, it's important to validate that your AMD hardware is configured +correctly and performing optimally. + +If you have already validated your system settings, including aspects like NUMA auto-balancing, you +can skip this step. Otherwise, complete the procedures in the :ref:`System validation and +optimization ` guide to properly configure your system settings +before starting training. + +To test for optimal performance, consult the recommended :ref:`System health benchmarks +`. This suite of tests will help you verify and fine-tune your +system's configuration. + +.. datatemplate:yaml:: /data/how-to/rocm-for-ai/inference/sglang-benchmark-models.yaml + + {% set unified_docker = data.sglang_benchmark.unified_docker.latest %} + {% set model_groups = data.sglang_benchmark.model_groups %} + + Pull the Docker image + ===================== + + Download the `SGLang Docker image <{{ unified_docker.docker_hub_url }}>`__. + Use the following command to pull the Docker image from Docker Hub. + + .. code-block:: shell + + docker pull {{ unified_docker.pull_tag }} + + Benchmarking + ============ + + Once the setup is complete, choose one of the following methods to benchmark inference performance with + `DeepSeek-R1-Distill-Qwen-32B `__. + + .. _sglang-benchmark-mad: + + {% for model_group in model_groups %} + {% for model in model_group.models %} + + .. container:: model-doc {{model.mad_tag}} + + .. tab-set:: + + .. tab-item:: MAD-integrated benchmarking + + 1. Clone the ROCm Model Automation and Dashboarding (``__) repository to a local + directory and install the required packages on the host machine. + + .. code-block:: shell + + git clone https://github.com/ROCm/MAD + cd MAD + pip install -r requirements.txt + + 2. Use this command to run the performance benchmark test on the `{{model.model}} <{{ model.url }}>`_ model + using one GPU with the ``{{model.precision}}`` data type on the host machine. + + .. code-block:: shell + + export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models" + madengine run \ + --tags {{model.mad_tag}} \ + --keep-model-dir \ + --live-output \ + --timeout 28800 + + MAD launches a Docker container with the name + ``container_ci-{{model.mad_tag}}``. The latency and throughput reports of the + model are collected in the following path: ``~/MAD/perf_DeepSeek-R1-Distill-Qwen-32B.csv``. + + Although the DeepSeek-R1-Distill-Qwen-32B is preconfigured + to collect latency and throughput performance data, you can also change the benchmarking + parameters. See the standalone benchmarking tab for more information. + + .. tab-item:: Standalone benchmarking + + .. rubric:: Download the Docker image and required scripts + + 1. Run the SGLang benchmark script independently by starting the + `Docker container <{{ unified_docker.docker_hub_url }}>`__ + as shown in the following snippet. + + .. code-block:: shell + + docker pull {{ unified_docker.pull_tag }} + docker run -it \ + --device=/dev/kfd \ + --device=/dev/dri \ + --group-add video \ + --shm-size 16G \ + --security-opt seccomp=unconfined \ + --security-opt apparmor=unconfined \ + --cap-add=SYS_PTRACE \ + -v $(pwd):/workspace \ + --env HUGGINGFACE_HUB_CACHE=/workspace \ + --name test \ + {{ unified_docker.pull_tag }} + + 2. In the Docker container, clone the ROCm MAD repository and navigate to the + benchmark scripts directory at ``~/MAD/scripts/sglang``. + + .. code-block:: shell + + git clone https://github.com/ROCm/MAD + cd MAD/scripts/sglang + + 3. To start the benchmark, use the following command with the appropriate options. + + .. dropdown:: Benchmark options + :open: + + .. list-table:: + :header-rows: 1 + :align: center + + * - Name + - Options + - Description + + * - ``$test_option`` + - latency + - Measure decoding token latency + + * - + - throughput + - Measure token generation throughput + + * - + - all + - Measure both throughput and latency + + * - ``$num_gpu`` + - 8 + - Number of GPUs + + * - ``$datatype`` + - ``bfloat16`` + - Data type + + * - ``$dataset`` + - random + - Dataset + + The input sequence length, output sequence length, and tensor parallel (TP) are + already configured. You don't need to specify them with this script. + + Command: + + .. code-block:: shell + + ./sglang_benchmark_report.sh -s $test_option -m {{model.model_repo}} -g $num_gpu -d $datatype [-a $dataset] + + .. note:: + + If you encounter the following error, pass your access-authorized Hugging + Face token to the gated models. + + .. code-block:: shell-session + + OSError: You are trying to access a gated repo. + # pass your HF_TOKEN + export HF_TOKEN=$your_personal_hf_token + + .. rubric:: Benchmarking examples + + Here are some examples of running the benchmark with various options: + + * Latency benchmark + + Use this command to benchmark the latency of the {{model.model}} model on eight GPUs with ``{{model.precision}}`` precision. + + .. code-block:: shell + + ./sglang_benchmark_report.sh \ + -s latency \ + -m {{model.model_repo}} \ + -g 8 \ + -d {{model.precision}} + + Find the latency report at ``./reports_{{model.precision}}/summary/{{model.model_repo.split('/', 1)[1] if '/' in model.model_repo else model.model_repo}}_latency_report.csv``. + + * Throughput benchmark + + Use this command to benchmark the throughput of the {{model.model}} model on eight GPUs with ``{{model.precision}}`` precision. + + .. code-block:: shell + + ./sglang_benchmark_report.sh \ + -s throughput \ + -m {{model.model_repo}} \ + -g 8 \ + -d {{model.precision}} \ + -a random + + Find the throughput report at ``./reports_{{model.precision}}/summary/{{model.model_repo.split('/', 1)[1] if '/' in model.model_repo else model.model_repo}}_throughput_report.csv``. + + .. raw:: html + + + + .. note:: + + Throughput is calculated as: + + - .. math:: throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time + + - .. math:: throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time + {% endfor %} + {% endfor %} + +Further reading +=============== + +- To learn more about the options for latency and throughput benchmark scripts, + see ``__. + +- To learn more about MAD and the ``madengine`` CLI, see the `MAD usage guide `__. + +- To learn more about system settings and management practices to configure your system for + MI300X series accelerators, see `AMD Instinct MI300X system optimization `__. + +- For application performance optimization strategies for HPC and AI workloads, + including inference with vLLM, see :doc:`/how-to/rocm-for-ai/inference-optimization/workload`. + +- To learn how to run community models from Hugging Face on AMD GPUs, see + :doc:`Running models from Hugging Face `. + +- To learn how to fine-tune LLMs and optimize inference, see + :doc:`Fine-tuning LLMs and inference optimization `. + +- For a list of other ready-made Docker images for AI with ROCm, see + `AMD Infinity Hub `_. + +Previous versions +================= + +See :doc:`previous-versions/sglang-history` to find documentation for previous releases +of SGLang inference performance testing. diff --git a/docs/how-to/rocm-for-ai/inference/index.rst b/docs/how-to/rocm-for-ai/inference/index.rst index c0bd22de1..3c211882b 100644 --- a/docs/how-to/rocm-for-ai/inference/index.rst +++ b/docs/how-to/rocm-for-ai/inference/index.rst @@ -24,4 +24,6 @@ training, fine-tuning, and inference. It leverages popular machine learning fram - :doc:`PyTorch inference performance testing ` +- :doc:`SGLang inference performance testing ` + - :doc:`Deploying your model ` diff --git a/docs/sphinx/_toc.yml.in b/docs/sphinx/_toc.yml.in index 5c0a18fad..8560f0c68 100644 --- a/docs/sphinx/_toc.yml.in +++ b/docs/sphinx/_toc.yml.in @@ -82,6 +82,8 @@ subtrees: title: vLLM inference performance testing - file: how-to/rocm-for-ai/inference/benchmark-docker/pytorch-inference.rst title: PyTorch inference performance testing + - file: how-to/rocm-for-ai/inference/benchmark-docker/sglang.rst + title: SGLang inference performance testing - file: how-to/rocm-for-ai/inference/deploy-your-model.rst title: Deploy your model