diff --git a/.wordlist.txt b/.wordlist.txt index c541a2e84..a990f031d 100644 --- a/.wordlist.txt +++ b/.wordlist.txt @@ -34,6 +34,7 @@ Autocast BARs BLAS BMC +BabelStream Blit Blockwise Bluefield @@ -138,6 +139,7 @@ GDR GDS GEMM GEMMs +GFLOPS GFortran GFXIP Gemma @@ -641,6 +643,7 @@ hipSPARSELt hipTensor hipamd hipblas +hipcc hipcub hipfft hipfort diff --git a/docs/conf.py b/docs/conf.py index c4530a95c..ed126dc05 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -51,6 +51,8 @@ article_pages = [ {"file": "how-to/deep-learning-rocm", "os": ["linux"]}, {"file": "how-to/rocm-for-ai/index", "os": ["linux"]}, + {"file": "how-to/rocm-for-ai/install", "os": ["linux"]}, + {"file": "how-to/rocm-for-ai/system-health-check", "os": ["linux"]}, {"file": "how-to/rocm-for-ai/training/index", "os": ["linux"]}, {"file": "how-to/rocm-for-ai/training/train-a-model", "os": ["linux"]}, @@ -67,7 +69,6 @@ article_pages = [ {"file": "how-to/rocm-for-ai/fine-tuning/multi-gpu-fine-tuning-and-inference", "os": ["linux"]}, {"file": "how-to/rocm-for-ai/inference/index", "os": ["linux"]}, - {"file": "how-to/rocm-for-ai/inference/install", "os": ["linux"]}, {"file": "how-to/rocm-for-ai/inference/hugging-face-models", "os": ["linux"]}, {"file": "how-to/rocm-for-ai/inference/llm-inference-frameworks", "os": ["linux"]}, {"file": "how-to/rocm-for-ai/inference/vllm-benchmark", "os": ["linux"]}, diff --git a/docs/how-to/rocm-for-ai/inference/pytorch-inference-benchmark.rst b/docs/how-to/rocm-for-ai/inference/pytorch-inference-benchmark.rst index 66e429a8a..7ed9bb005 100644 --- a/docs/how-to/rocm-for-ai/inference/pytorch-inference-benchmark.rst +++ b/docs/how-to/rocm-for-ai/inference/pytorch-inference-benchmark.rst @@ -62,47 +62,52 @@ PyTorch inference performance testing {% endfor %} {% endfor %} - Getting started - =============== + System validation + ================= - Use the following procedures to reproduce the benchmark results on an - MI300X series accelerator with the prebuilt PyTorch Docker image. + Before running AI workloads, it's important to validate that your AMD hardware is configured + correctly and performing optimally. - .. _pytorch-benchmark-get-started: + To optimize performance, disable automatic NUMA balancing. Otherwise, the GPU + might hang until the periodic balancing is finalized. For more information, + see the :ref:`system validation steps `. - 1. Disable NUMA auto-balancing. + .. code-block:: shell - To optimize performance, disable automatic NUMA balancing. Otherwise, the GPU - might hang until the periodic balancing is finalized. For more information, - see :ref:`AMD Instinct MI300X system optimization `. + # disable automatic NUMA balancing + sh -c 'echo 0 > /proc/sys/kernel/numa_balancing' + # check if NUMA balancing is disabled (returns 0 if disabled) + cat /proc/sys/kernel/numa_balancing + 0 - .. code-block:: shell + To test for optimal performance, consult the recommended :ref:`System health benchmarks + `. This suite of tests will help you verify and fine-tune your + system's configuration. - # disable automatic NUMA balancing - sh -c 'echo 0 > /proc/sys/kernel/numa_balancing' - # check if NUMA balancing is disabled (returns 0 if disabled) - cat /proc/sys/kernel/numa_balancing - 0 + Pull the Docker image + ===================== .. container:: model-doc pyt_chai1_inference - 2. Use the following command to pull the `ROCm PyTorch Docker image `_ from Docker Hub. + Use the following command to pull the `ROCm PyTorch Docker image `_ from Docker Hub. - .. code-block:: shell + .. code-block:: shell - docker pull rocm/pytorch:rocm6.2.3_ubuntu22.04_py3.10_pytorch_release_2.3.0_triton_llvm_reg_issue + docker pull rocm/pytorch:rocm6.2.3_ubuntu22.04_py3.10_pytorch_release_2.3.0_triton_llvm_reg_issue - .. note:: + .. note:: - The Chai-1 benchmark uses a specifically selected Docker image using ROCm 6.2.3 and PyTorch 2.3.0 to address an accuracy issue. + The Chai-1 benchmark uses a specifically selected Docker image using ROCm 6.2.3 and PyTorch 2.3.0 to address an accuracy issue. .. container:: model-doc pyt_clip_inference - 2. Use the following command to pull the `ROCm PyTorch Docker image `_ from Docker Hub. + Use the following command to pull the `ROCm PyTorch Docker image `_ from Docker Hub. - .. code-block:: shell + .. code-block:: shell - docker pull rocm/pytorch:latest + docker pull rocm/pytorch:latest + + .. _pytorch-benchmark-get-started: Benchmarking ============ diff --git a/docs/how-to/rocm-for-ai/inference/vllm-benchmark.rst b/docs/how-to/rocm-for-ai/inference/vllm-benchmark.rst index 8d530778f..d080e749d 100644 --- a/docs/how-to/rocm-for-ai/inference/vllm-benchmark.rst +++ b/docs/how-to/rocm-for-ai/inference/vllm-benchmark.rst @@ -111,35 +111,37 @@ vLLM inference performance testing For information on experimental features and known issues related to ROCm optimization efforts on vLLM, see the developer's guide at ``__. - Getting started - =============== + System validation + ================= - Use the following procedures to reproduce the benchmark results on an - MI300X accelerator with the prebuilt vLLM Docker image. + Before running AI workloads, it's important to validate that your AMD hardware is configured + correctly and performing optimally. - .. _vllm-benchmark-get-started: + To optimize performance, disable automatic NUMA balancing. Otherwise, the GPU + might hang until the periodic balancing is finalized. For more information, + see the :ref:`system validation steps `. - 1. Disable NUMA auto-balancing. + .. code-block:: shell - To optimize performance, disable automatic NUMA balancing. Otherwise, the GPU - might hang until the periodic balancing is finalized. For more information, - see :ref:`AMD Instinct MI300X system optimization `. + # disable automatic NUMA balancing + sh -c 'echo 0 > /proc/sys/kernel/numa_balancing' + # check if NUMA balancing is disabled (returns 0 if disabled) + cat /proc/sys/kernel/numa_balancing + 0 - .. code-block:: shell + To test for optimal performance, consult the recommended :ref:`System health benchmarks + `. This suite of tests will help you verify and fine-tune your + system's configuration. - # disable automatic NUMA balancing - sh -c 'echo 0 > /proc/sys/kernel/numa_balancing' - # check if NUMA balancing is disabled (returns 0 if disabled) - cat /proc/sys/kernel/numa_balancing - 0 + Pull the Docker image + ===================== - 2. Download the `ROCm vLLM Docker image <{{ unified_docker.docker_hub_url }}>`_. + Download the `ROCm vLLM Docker image <{{ unified_docker.docker_hub_url }}>`_. + Use the following command to pull the Docker image from Docker Hub. - Use the following command to pull the Docker image from Docker Hub. + .. code-block:: shell - .. code-block:: shell - - docker pull {{ unified_docker.pull_tag }} + docker pull {{ unified_docker.pull_tag }} Benchmarking ============ diff --git a/docs/how-to/rocm-for-ai/inference/install.rst b/docs/how-to/rocm-for-ai/install.rst similarity index 91% rename from docs/how-to/rocm-for-ai/inference/install.rst rename to docs/how-to/rocm-for-ai/install.rst index fb3cba206..ad6ff690c 100644 --- a/docs/how-to/rocm-for-ai/inference/install.rst +++ b/docs/how-to/rocm-for-ai/install.rst @@ -30,7 +30,7 @@ ROCm supports multiple :doc:`installation methods ` -* :ref:`Multi-version installation `. +* :ref:`Multi-version installation ` .. grid:: 1 @@ -59,4 +59,8 @@ images with the framework pre-installed. * :doc:`JAX for ROCm ` -The sections that follow in :doc:`Training a model <../training/train-a-model>` are geared for a ROCm with PyTorch installation. +Next steps +========== + +After installing ROCm and your desired ML libraries -- and before running AI workloads -- conduct system health benchmarks +to test the optimal performance of your AMD hardware. See :doc:`system-health-check` to get started. diff --git a/docs/how-to/rocm-for-ai/system-health-check.rst b/docs/how-to/rocm-for-ai/system-health-check.rst new file mode 100644 index 000000000..5e239e815 --- /dev/null +++ b/docs/how-to/rocm-for-ai/system-health-check.rst @@ -0,0 +1,104 @@ +.. meta:: + :description: System health checks with RVS, RCCL tests, BabelStream, and TransferBench to validate AMD hardware performance running AI workloads. + :keywords: gpu, accelerator, system, health, validation, bench, perf, performance, rvs, rccl, babel, mi300x, mi325x, flops, bandwidth, rbt, training, inference + +.. _rocm-for-ai-system-health-bench: + +************************ +System health benchmarks +************************ + +Before running AI workloads, it is important to validate that your AMD hardware is configured correctly and is performing optimally. This topic outlines several system health benchmarks you can use to test key aspects like GPU compute capabilities (FLOPS), memory bandwidth, and interconnect performance. Many of these tests are part of the ROCm Validation Suite (RVS). + +ROCm Validation Suite (RVS) tests +================================= + +RVS provides a collection of tests, benchmarks, and qualification tools, each +targeting a specific subsystem of the system under test. It includes tests for +GPU stress and memory bandwidth. + +.. _healthcheck-install-rvs: + +Install ROCm Validation Suite +----------------------------- + +To get started, install RVS. For example, on an Ubuntu system with ROCm already +installed, run the following command: + +.. code-block:: shell + + sudo apt update + sudo apt install rocm-validation-suite + +See the `ROCm Validation Suite installation instructions `_, +and `System validation tests `_ +in the Instinct documentation for more detailed instructions. + +Benchmark, stress, and qualification tests +------------------------------------------ + +The GPU stress test runs various GEMM computations as workloads to stress the GPU FLOPS performance and check whether it +meets the configured target GFLOPS. + +Run the benchmark, stress, and qualification tests included with RVS. See the `Benchmark, stress, qualification +`_ +section of the Instinct documentation for usage instructions. + +BabelStream test +---------------- + +BabelStream is a synthetic GPU benchmark based on the STREAM benchmark for +CPUs, measuring memory transfer rates to and from global device memory. +BabelStream tests are included with the RVS package as part of the `BABEL module +`_. + +For more information, see `Performance benchmarking +`_ +in the Instinct documentation. + +RCCL tests +========== + +The ROCm Communication Collectives Library (RCCL) enables efficient multi-GPU +communication. The ``__ suite benchmarks +the performance and verifies the correctness of these collective operations. +This helps ensure optimal scaling for multi-accelerator tasks. + +1. To get started, build RCCL-tests using the official instructions in the README at + ``__ or use the + following commands: + + .. code-block:: shell + + git clone https://github.com/ROCm/rccl-tests.git + cd rccl-tests + make + +2. Run the suggested RCCL tests -- see `RCCL benchmarking + `_ + in the Instinct performance benchmarking documentation for instructions. + +TransferBench test +================== + +TransferBench is a standalone utility for benchmarking simultaneous data +transfer performance between various devices in the system, including +CPU-to-GPU and GPU-to-GPU (peer-to-peer). This helps identify potential +bottlenecks in data movement between the host system and the GPUs, or between +GPUs, which can impact end-to-end latency. + +.. _healthcheck-install-transferbench: + +1. To get started, use the instructions in the `TransferBench documentation + `_ + or use the following commands: + + .. code:: shell + + git clone https://github.com/ROCm/TransferBench.git + cd TransferBench + CC=hipcc make + +2. Run the suggested TransferBench tests -- see `TransferBench benchmarking + `_ + in the Instinct performance benchmarking documentation for instructions. diff --git a/docs/how-to/rocm-for-ai/training/benchmark-docker/jax-maxtext.rst b/docs/how-to/rocm-for-ai/training/benchmark-docker/jax-maxtext.rst index 801025141..4338c7c2c 100644 --- a/docs/how-to/rocm-for-ai/training/benchmark-docker/jax-maxtext.rst +++ b/docs/how-to/rocm-for-ai/training/benchmark-docker/jax-maxtext.rst @@ -79,11 +79,18 @@ across different input sequences. Support for packed input format is planned for System validation ================= -If you have already validated your system settings, including NUMA -auto-balancing, skip this step. Otherwise, complete the :ref:`system validation -and optimization steps ` to set up your system +Before running AI workloads, it's important to validate that your AMD hardware is configured +correctly and performing optimally. + +If you have already validated your system settings, including aspects like NUMA auto-balancing, you +can skip this step. Otherwise, complete the procedures in the :ref:`System validation and +optimization ` guide to properly configure your system settings before starting training. +To test for optimal performance, consult the recommended :ref:`System health benchmarks +`. This suite of tests will help you verify and fine-tune your +system's configuration. + Environment setup ================= @@ -175,8 +182,8 @@ with RDMA, skip ahead to :ref:`amd-maxtext-download-docker`. .. _amd-maxtext-download-docker: -Download the Docker image -------------------------- +Pull the Docker image +--------------------- 1. Use the following command to pull the Docker image from Docker Hub. diff --git a/docs/how-to/rocm-for-ai/training/benchmark-docker/megatron-lm.rst b/docs/how-to/rocm-for-ai/training/benchmark-docker/megatron-lm.rst index eacd63ad6..7e082559e 100644 --- a/docs/how-to/rocm-for-ai/training/benchmark-docker/megatron-lm.rst +++ b/docs/how-to/rocm-for-ai/training/benchmark-docker/megatron-lm.rst @@ -103,11 +103,18 @@ popular AI models. System validation ================= -If you have already validated your system settings, including NUMA -auto-balancing, skip this step. Otherwise, complete the :ref:`system validation -and optimization steps ` to set up your system +Before running AI workloads, it's important to validate that your AMD hardware is configured +correctly and performing optimally. + +If you have already validated your system settings, including aspects like NUMA auto-balancing, you +can skip this step. Otherwise, complete the procedures in the :ref:`System validation and +optimization ` guide to properly configure your system settings before starting training. +To test for optimal performance, consult the recommended :ref:`System health benchmarks +`. This suite of tests will help you verify and fine-tune your +system's configuration. + .. _mi300x-amd-megatron-lm-training: Environment setup diff --git a/docs/how-to/rocm-for-ai/training/benchmark-docker/mpt-llm-foundry.rst b/docs/how-to/rocm-for-ai/training/benchmark-docker/mpt-llm-foundry.rst index 05b23672c..2e9efdcf5 100644 --- a/docs/how-to/rocm-for-ai/training/benchmark-docker/mpt-llm-foundry.rst +++ b/docs/how-to/rocm-for-ai/training/benchmark-docker/mpt-llm-foundry.rst @@ -34,11 +34,18 @@ for MPT-30B with access to detailed logs and performance metrics. System validation ================= -If you have already validated your system settings, including NUMA -auto-balancing, skip this step. Otherwise, complete the :ref:`system validation -and optimization steps ` to set up your system +Before running AI workloads, it's important to validate that your AMD hardware is configured +correctly and performing optimally. + +If you have already validated your system settings, including aspects like NUMA auto-balancing, you +can skip this step. Otherwise, complete the procedures in the :ref:`System validation and +optimization ` guide to properly configure your system settings before starting training. +To test for optimal performance, consult the recommended :ref:`System health benchmarks +`. This suite of tests will help you verify and fine-tune your +system's configuration. + Getting started =============== diff --git a/docs/how-to/rocm-for-ai/training/benchmark-docker/pytorch-training.rst b/docs/how-to/rocm-for-ai/training/benchmark-docker/pytorch-training.rst index 3e251782e..a8e9f0fc3 100644 --- a/docs/how-to/rocm-for-ai/training/benchmark-docker/pytorch-training.rst +++ b/docs/how-to/rocm-for-ai/training/benchmark-docker/pytorch-training.rst @@ -77,11 +77,18 @@ popular AI models. System validation ================= -If you have already validated your system settings, including NUMA -auto-balancing, skip this step. Otherwise, complete the :ref:`system validation -and optimization steps ` to set up your system +Before running AI workloads, it's important to validate that your AMD hardware is configured +correctly and performing optimally. + +If you have already validated your system settings, including aspects like NUMA auto-balancing, you +can skip this step. Otherwise, complete the procedures in the :ref:`System validation and +optimization ` guide to properly configure your system settings before starting training. +To test for optimal performance, consult the recommended :ref:`System health benchmarks +`. This suite of tests will help you verify and fine-tune your +system's configuration. + This Docker image is optimized for specific model configurations outlined below. Performance can vary for other training workloads, as AMD doesn’t validate configurations and run conditions outside those described. diff --git a/docs/how-to/rocm-for-ai/training/index.rst b/docs/how-to/rocm-for-ai/training/index.rst index 67916a19a..13213c2e9 100644 --- a/docs/how-to/rocm-for-ai/training/index.rst +++ b/docs/how-to/rocm-for-ai/training/index.rst @@ -21,8 +21,12 @@ In this guide, you'll learn about: - Training a model - - :doc:`Train a model with Megatron-LM ` + - :doc:`With Megatron-LM ` - - :doc:`Train a model with PyTorch ` + - :doc:`With PyTorch ` + + - :doc:`With JAX MaxText ` + + - :doc:`With LLM Foundry ` - :doc:`Scaling model training ` diff --git a/docs/how-to/rocm-for-ai/training/prerequisite-system-validation.rst b/docs/how-to/rocm-for-ai/training/prerequisite-system-validation.rst index ec2591114..68ce4e493 100644 --- a/docs/how-to/rocm-for-ai/training/prerequisite-system-validation.rst +++ b/docs/how-to/rocm-for-ai/training/prerequisite-system-validation.rst @@ -5,12 +5,13 @@ :keywords: ROCm, AI, LLM, train, megatron, Llama, tutorial, docker, torch, pytorch, jax .. _train-a-model-system-validation: +.. _rocm-for-ai-system-optimization: -********************************************** -Prerequisite system validation before training -********************************************** +********************************************************** +Prerequisite system validation before running AI workloads +********************************************************** -Complete the following system validation and optimization steps to set up your system before starting training. +Complete the following system validation and optimization steps to set up your system before starting training and inference. Disable NUMA auto-balancing --------------------------- @@ -26,7 +27,8 @@ the output is ``1``, run the following command to disable NUMA auto-balancing. sudo sh -c 'echo 0 > /proc/sys/kernel/numa_balancing' -See :ref:`mi300x-disable-numa` for more information. +See `Disable NUMA auto-balancing `_ +in the Instinct documentation for more information. Hardware verification with ROCm ------------------------------- @@ -42,7 +44,8 @@ Run the command: rocm-smi --setperfdeterminism 1900 -See :ref:`mi300x-hardware-verification-with-rocm` for more information. +See `Hardware verfication for ROCm `_ +in the Instinct documentation for more information. RCCL Bandwidth Test for multi-node setups ----------------------------------------- diff --git a/docs/sphinx/_toc.yml.in b/docs/sphinx/_toc.yml.in index 51961f965..34804835e 100644 --- a/docs/sphinx/_toc.yml.in +++ b/docs/sphinx/_toc.yml.in @@ -36,6 +36,10 @@ subtrees: title: Use ROCm for AI subtrees: - entries: + - file: how-to/rocm-for-ai/install.rst + title: Installation + - file: how-to/rocm-for-ai/system-health-check.rst + title: System health benchmarks - file: how-to/rocm-for-ai/training/index.rst title: Training subtrees: @@ -70,8 +74,6 @@ subtrees: title: Inference subtrees: - entries: - - file: how-to/rocm-for-ai/inference/install.rst - title: Installation - file: how-to/rocm-for-ai/inference/hugging-face-models.rst title: Run models from Hugging Face - file: how-to/rocm-for-ai/inference/llm-inference-frameworks.rst