From 824d760646f6d893e2c8aa1da5c7673b4e213573 Mon Sep 17 00:00:00 2001 From: Peter Park Date: Tue, 15 Apr 2025 18:16:22 -0400 Subject: [PATCH] Update PyTorch training Docker doc for 25.5 (#4638) * update pytorch-training to 25.5 * remove llama 2 * Revert "remove llama 2" This reverts commit dab672fa7bcbd8bff730382c14177df4301a537d. * add previous version * fix run cmd * add link to docker hub * fix linting issue * add Llama 3.3 70B * update (cherry picked from commit 9ff3c2c885faf74650b3bb77aaca5b82f602aa92) --- .wordlist.txt | 1 + .../benchmark-docker/pytorch-training.rst | 455 ++++++++++-------- 2 files changed, 252 insertions(+), 204 deletions(-) diff --git a/.wordlist.txt b/.wordlist.txt index 518ef820e..ae120d2e9 100644 --- a/.wordlist.txt +++ b/.wordlist.txt @@ -76,6 +76,7 @@ Concretized Conda ConnectX CuPy +da Dashboarding DBRX DDR diff --git a/docs/how-to/rocm-for-ai/training/benchmark-docker/pytorch-training.rst b/docs/how-to/rocm-for-ai/training/benchmark-docker/pytorch-training.rst index 896e63df9..6d5338745 100644 --- a/docs/how-to/rocm-for-ai/training/benchmark-docker/pytorch-training.rst +++ b/docs/how-to/rocm-for-ai/training/benchmark-docker/pytorch-training.rst @@ -9,7 +9,8 @@ Training a model with PyTorch for ROCm PyTorch is an open-source machine learning framework that is widely used for model training with GPU-optimized components for transformer-based models. -The PyTorch for ROCm training Docker (``rocm/pytorch-training:v25.4``) image +The `PyTorch for ROCm training Docker `_ +(``rocm/pytorch-training:v25.5``) image provides a prebuilt optimized environment for fine-tuning and pretraining a model on AMD Instinct MI325X and MI300X accelerators. It includes the following software components to accelerate training workloads: @@ -17,19 +18,19 @@ software components to accelerate training workloads: +--------------------------+--------------------------------+ | Software component | Version | +==========================+================================+ -| ROCm | 6.3.0 | +| ROCm | 6.3.4 | +--------------------------+--------------------------------+ | PyTorch | 2.7.0a0+git637433 | +--------------------------+--------------------------------+ | Python | 3.10 | +--------------------------+--------------------------------+ -| Transformer Engine | 1.11 | +| Transformer Engine | 1.12.0.dev0+25a33da | +--------------------------+--------------------------------+ | Flash Attention | 3.0.0 | +--------------------------+--------------------------------+ -| hipBLASLt | git258a2162 | +| hipBLASLt | git53b53bf | +--------------------------+--------------------------------+ -| Triton | 3.1 | +| Triton | 3.2.0 | +--------------------------+--------------------------------+ .. _amd-pytorch-training-model-support: @@ -39,6 +40,8 @@ Supported models The following models are pre-optimized for performance on the AMD Instinct MI325X and MI300X accelerators. +* Llama 3.3 70B + * Llama 3.1 8B * Llama 3.1 70B @@ -79,309 +82,346 @@ auto-balancing, skip this step. Otherwise, complete the :ref:`system validation and optimization steps ` to set up your system before starting training. -Environment setup -================= - This Docker image is optimized for specific model configurations outlined below. Performance can vary for other training workloads, as AMD doesn’t validate configurations and run conditions outside those described. -Download the Docker image -------------------------- +Benchmarking +============ -1. Use the following command to pull the Docker image from Docker Hub. +Once the setup is complete, choose between two options to start benchmarking: - .. code-block:: shell +.. tab-set:: - docker pull rocm/pytorch-training:v25.4 + .. tab-item:: MAD-integrated benchmarking -2. Run the Docker container. + Clone the ROCm Model Automation and Dashboarding (``__) repository to a local + directory and install the required packages on the host machine. - .. code-block:: shell + .. code-block:: shell - docker run -it --device /dev/dri --device /dev/kfd --network host --ipc host --group-add video --cap-add SYS_PTRACE --security-opt seccomp=unconfined --privileged -v $HOME:$HOME -v $HOME/.ssh:/root/.ssh --shm-size 64G --name training_env rocm/pytorch-training:v25.4 + git clone https://github.com/ROCm/MAD + cd MAD + pip install -r requirements.txt -3. Use these commands if you exit the ``training_env`` container and need to return to it. + For example, use this command to run the performance benchmark test on the Llama 3.1 8B model + using one GPU with the float16 data type on the host machine. - .. code-block:: shell + .. code-block:: shell - docker start training_env - docker exec -it training_env bash + export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models" + python3 tools/run_models.py --tags pyt_train_llama-3.1-8b --keep-model-dir --live-output --timeout 28800 -4. In the Docker container, clone the ``__ - repository and navigate to the benchmark scripts directory - ``/workspace/MAD/scripts/pytorch_train``. + The available models for MAD-integrated benchmarking are: - .. code-block:: shell + * ``pyt_train_llama-3.3-70b`` - git clone https://github.com/ROCm/MAD - cd MAD/scripts/pytorch_train + * ``pyt_train_llama-3.1-8b`` -Prepare training datasets and dependencies ------------------------------------------- + * ``pyt_train_llama-3.1-70b`` -The following benchmarking examples require downloading models and datasets -from Hugging Face. To ensure successful access to gated repos, set your -``HF_TOKEN``. + * ``pyt_train_flux`` -.. code-block:: shell + MAD launches a Docker container with the name + ``container_ci-pyt_train_llama-3.1-8b``, for example. The latency and throughput reports of the + model are collected in the following path: ``~/MAD/perf.csv``. - export HF_TOKEN=$your_personal_hugging_face_access_token + .. tab-item:: Standalone benchmarking -Run the setup script to install libraries and datasets needed for benchmarking. + .. rubric:: Download the Docker image and required packages -.. code-block:: shell + Use the following command to pull the Docker image from Docker Hub. - ./pytorch_benchmark_setup.sh + .. code-block:: shell -``pytorch_benchmark_setup.sh`` installs the following libraries: + docker pull rocm/pytorch-training:v25.5 -.. list-table:: - :header-rows: 1 + Run the Docker container. - * - Library - - Benchmark model - - Reference + .. code-block:: shell - * - ``accelerate`` - - Llama 3.1 8B, FLUX - - `Hugging Face Accelerate `_ + docker run -it --device /dev/dri --device /dev/kfd --network host --ipc host --group-add video --cap-add SYS_PTRACE --security-opt seccomp=unconfined --privileged -v $HOME:$HOME -v $HOME/.ssh:/root/.ssh --shm-size 64G --name training_env rocm/pytorch-training:v25.5 - * - ``datasets`` - - Llama 3.1 8B, 70B, FLUX - - `Hugging Face Datasets `_ 3.2.0 + Use these commands if you exit the ``training_env`` container and need to return to it. - * - ``torchdata`` - - Llama 3.1 70B - - `TorchData `_ + .. code-block:: shell - * - ``tomli`` - - Llama 3.1 70B - - `Tomli `_ + docker start training_env + docker exec -it training_env bash - * - ``tiktoken`` - - Llama 3.1 70B - - `tiktoken `_ + In the Docker container, clone the ``__ + repository and navigate to the benchmark scripts directory + ``/workspace/MAD/scripts/pytorch_train``. - * - ``blobfile`` - - Llama 3.1 70B - - `blobfile `_ + .. code-block:: shell - * - ``tabulate`` - - Llama 3.1 70B - - `tabulate `_ + git clone https://github.com/ROCm/MAD + cd MAD/scripts/pytorch_train - * - ``wandb`` - - Llama 3.1 70B - - `Weights & Biases `_ + .. rubric:: Prepare training datasets and dependencies - * - ``sentencepiece`` - - Llama 3.1 70B, FLUX - - `SentencePiece `_ 0.2.0 + The following benchmarking examples require downloading models and datasets + from Hugging Face. To ensure successful access to gated repos, set your + ``HF_TOKEN``. - * - ``tensorboard`` - - Llama 3.1 70 B, FLUX - - `TensorBoard `_ 2.18.0 + .. code-block:: shell - * - ``csvkit`` - - FLUX - - `csvkit `_ 2.0.1 + export HF_TOKEN=$your_personal_hugging_face_access_token - * - ``deepspeed`` - - FLUX - - `DeepSpeed `_ 0.16.2 + Run the setup script to install libraries and datasets needed for benchmarking. - * - ``diffusers`` - - FLUX - - `Hugging Face Diffusers `_ 0.31.0 + .. code-block:: shell - * - ``GitPython`` - - FLUX - - `GitPython `_ 3.1.44 + ./pytorch_benchmark_setup.sh - * - ``opencv-python-headless`` - - FLUX - - `opencv-python-headless `_ 4.10.0.84 + ``pytorch_benchmark_setup.sh`` installs the following libraries: - * - ``peft`` - - FLUX - - `PEFT `_ 0.14.0 + .. list-table:: + :header-rows: 1 - * - ``protobuf`` - - FLUX - - `Protocol Buffers `_ 5.29.2 + * - Library + - Benchmark model + - Reference - * - ``pytest`` - - FLUX - - `PyTest `_ 8.3.4 + * - ``accelerate`` + - Llama 3.1 8B, FLUX + - `Hugging Face Accelerate `_ - * - ``python-dotenv`` - - FLUX - - `python-dotenv `_ 1.0.1 + * - ``datasets`` + - Llama 3.1 8B, 70B, FLUX + - `Hugging Face Datasets `_ 3.2.0 - * - ``seaborn`` - - FLUX - - `Seaborn `_ 0.13.2 + * - ``torchdata`` + - Llama 3.1 70B + - `TorchData `_ - * - ``transformers`` - - FLUX - - `Transformers `_ 4.47.0 + * - ``tomli`` + - Llama 3.1 70B + - `Tomli `_ -``pytorch_benchmark_setup.sh`` downloads the following models from Hugging Face: + * - ``tiktoken`` + - Llama 3.1 70B + - `tiktoken `_ -* `meta-llama/Llama-3.1-70B-Instruct `_ + * - ``blobfile`` + - Llama 3.1 70B + - `blobfile `_ -* `black-forest-labs/FLUX.1-dev `_ + * - ``tabulate`` + - Llama 3.1 70B + - `tabulate `_ -Along with the following datasets: + * - ``wandb`` + - Llama 3.1 70B + - `Weights & Biases `_ -* `WikiText `_ + * - ``sentencepiece`` + - Llama 3.1 70B, FLUX + - `SentencePiece `_ 0.2.0 -* `UltraChat 200k `_ + * - ``tensorboard`` + - Llama 3.1 70 B, FLUX + - `TensorBoard `_ 2.18.0 -* `bghira/pseudo-camera-10k `_ + * - ``csvkit`` + - FLUX + - `csvkit `_ 2.0.1 -Getting started -=============== + * - ``deepspeed`` + - FLUX + - `DeepSpeed `_ 0.16.2 -The prebuilt PyTorch with ROCm training environment allows users to quickly validate -system performance, conduct training benchmarks, and achieve superior -performance for models like Llama 3.1 and Llama 2. This container should not be -expected to provide generalized performance across all training workloads. You -can expect the container to perform in the model configurations described in -the following section, but other configurations are not validated by AMD. + * - ``diffusers`` + - FLUX + - `Hugging Face Diffusers `_ 0.31.0 -Use the following instructions to set up the environment, configure the script -to train models, and reproduce the benchmark results on MI325X and MI300X -accelerators with the AMD PyTorch training Docker image. + * - ``GitPython`` + - FLUX + - `GitPython `_ 3.1.44 -Once your environment is set up, use the following commands and examples to start benchmarking. + * - ``opencv-python-headless`` + - FLUX + - `opencv-python-headless `_ 4.10.0.84 -Pretraining ------------ + * - ``peft`` + - FLUX + - `PEFT `_ 0.14.0 -To start the pretraining benchmark, use the following command with the -appropriate options. See the following list of options and their descriptions. + * - ``protobuf`` + - FLUX + - `Protocol Buffers `_ 5.29.2 -.. code-block:: shell + * - ``pytest`` + - FLUX + - `PyTest `_ 8.3.4 - ./pytorch_benchmark_report.sh -t $training_mode -m $model_repo -p $datatype -s $sequence_length + * - ``python-dotenv`` + - FLUX + - `python-dotenv `_ 1.0.1 -Options and available models -^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + * - ``seaborn`` + - FLUX + - `Seaborn `_ 0.13.2 -.. list-table:: - :header-rows: 1 + * - ``transformers`` + - FLUX + - `Transformers `_ 4.47.0 - * - Name - - Options - - Description + ``pytorch_benchmark_setup.sh`` downloads the following models from Hugging Face: - * - ``$training_mode`` - - ``pretrain`` - - Benchmark pretraining + * `meta-llama/Llama-3.1-70B-Instruct `_ - * - - - ``finetune_fw`` - - Benchmark full weight fine-tuning (Llama 3.1 70B with BF16) + * `black-forest-labs/FLUX.1-dev `_ - * - - - ``finetune_lora`` - - Benchmark LoRA fine-tuning (Llama 3.1 70B with BF16) + Along with the following datasets: - * - - - ``HF_finetune_lora`` - - Benchmark LoRA fine-tuning with Hugging Face PEFT (Llama 2 70B with BF16) + * `WikiText `_ - * - ``$datatype`` - - ``FP8`` or ``BF16`` - - Only Llama 3.1 8B supports FP8 precision. + * `UltraChat 200k `_ - * - ``$model_repo`` - - ``Llama-3.1-8B`` - - `Llama 3.1 8B `_ + * `bghira/pseudo-camera-10k `_ - * - - - ``Llama-3.1-70B`` - - `Llama 3.1 70B `_ + .. rubric:: Pretraining - * - - - ``Llama-2-70B`` - - `Llama 2 70B `_ + To start the pretraining benchmark, use the following command with the + appropriate options. See the following list of options and their descriptions. - * - - - ``Flux`` - - `FLUX.1 [dev] `_ + .. code-block:: shell - * - ``$sequence_length`` - - Sequence length for the language model. - - Between 2048 and 8192. 8192 by default. + ./pytorch_benchmark_report.sh -t $training_mode -m $model_repo -p $datatype -s $sequence_length -.. note:: + .. list-table:: + :header-rows: 1 - Occasionally, downloading the Flux dataset might fail. In the event of this - error, manually download it from Hugging Face at - `black-forest-labs/FLUX.1-dev `_ - and save it to `/workspace/FluxBenchmark`. This ensures that the test script can access - the required dataset. + * - Name + - Options + - Description -Fine-tuning ------------ + * - ``$training_mode`` + - ``pretrain`` + - Benchmark pretraining -To start the fine-tuning benchmark, use the following command. It will run the benchmarking example of Llama 3.1 70B -with the WikiText dataset using the AMD fork of `torchtune `_. + * - + - ``finetune_fw`` + - Benchmark full weight fine-tuning (Llama 3.1 70B with BF16) -.. code-block:: shell + * - + - ``finetune_lora`` + - Benchmark LoRA fine-tuning (Llama 3.1 70B with BF16) - ./pytorch_benchmark_report.sh -t {finetune_fw, finetune_lora} -p BF16 -m Llama-3.1-70B + * - + - ``HF_finetune_lora`` + - Benchmark LoRA fine-tuning with Hugging Face PEFT (Llama 2 70B with BF16) -Use the following command to run the benchmarking example of Llama 2 70B with the UltraChat 200k dataset using -`Hugging Face PEFT `_. + * - ``$datatype`` + - ``FP8`` or ``BF16`` + - Only Llama 3.1 8B supports FP8 precision. -.. code-block:: shell + * - ``$model_repo`` + - ``Llama-3.3-70B`` + - `Llama 3.3 70B `_ - ./pytorch_benchmark_report.sh -t HF_finetune_lora -p BF16 -m Llama-2-70B + * - + - ``Llama-3.1-8B`` + - `Llama 3.1 8B `_ -Benchmarking examples ---------------------- + * - + - ``Llama-3.1-70B`` + - `Llama 3.1 70B `_ -Here are some examples of how to use the command. + * - + - ``Llama-2-70B`` + - `Llama 2 70B `_ -* Example 1: Llama 3.1 70B with BF16 precision with `torchtitan `_. + * - + - ``Flux`` + - `FLUX.1 [dev] `_ - .. code-block:: shell + * - ``$sequence_length`` + - Sequence length for the language model. + - Between 2048 and 8192. 8192 by default. - ./pytorch_benchmark_report.sh -t pretrain -p BF16 -m Llama-3.1-70B -s 8192 + .. note:: -* Example 2: Llama 3.1 8B with FP8 precision using Transformer Engine (TE) and Hugging Face Accelerator. + Occasionally, downloading the Flux dataset might fail. In the event of this + error, manually download it from Hugging Face at + `black-forest-labs/FLUX.1-dev `_ + and save it to `/workspace/FluxBenchmark`. This ensures that the test script can access + the required dataset. - .. code-block:: shell + .. rubric:: Fine-tuning - ./pytorch_benchmark_report.sh -t pretrain -p FP8 -m Llama-3.1-70B -s 8192 + To start the fine-tuning benchmark, use the following command. It will run the benchmarking example of Llama 3.1 70B + with the WikiText dataset using the AMD fork of `torchtune `_. -* Example 3: FLUX.1-dev with BF16 precision with FluxBenchmark. + .. code-block:: shell - .. code-block:: shell + ./pytorch_benchmark_report.sh -t {finetune_fw, finetune_lora} -p BF16 -m Llama-3.1-70B - ./pytorch_benchmark_report.sh -t pretrain -p BF16 -m Flux + Use the following command to run the benchmarking example of Llama 2 70B with the UltraChat 200k dataset using + `Hugging Face PEFT `_. -* Example 4: Torchtune full weight fine-tuning with Llama 3.1 70B + .. code-block:: shell - .. code-block:: shell + ./pytorch_benchmark_report.sh -t HF_finetune_lora -p BF16 -m Llama-2-70B - ./pytorch_benchmark_report.sh -t finetune_fw -p BF16 -m Llama-3.1-70B + .. rubric:: Benchmarking examples -* Example 5: Torchtune LoRA fine-tuning with Llama 3.1 70B + Here are some example commands to get started pretraining and fine-tuning with various model configurations. - .. code-block:: shell + * Example 1: Llama 3.1 70B with BF16 precision with `torchtitan `_. - ./pytorch_benchmark_report.sh -t finetune_lora -p BF16 -m Llama-3.1-70B + .. code-block:: shell -* Example 6: Hugging Face PEFT LoRA fine-tuning with Llama 2 70B + ./pytorch_benchmark_report.sh -t pretrain -p BF16 -m Llama-3.1-70B -s 8192 - .. code-block:: shell + * Example 2: Llama 3.1 8B with FP8 precision using Transformer Engine (TE) and Hugging Face Accelerator. - ./pytorch_benchmark_report.sh -t HF_finetune_lora -p BF16 -m Llama-2-70B + .. code-block:: shell + + ./pytorch_benchmark_report.sh -t pretrain -p FP8 -m Llama-3.1-70B -s 8192 + + * Example 3: FLUX.1-dev with BF16 precision with FluxBenchmark. + + .. code-block:: shell + + ./pytorch_benchmark_report.sh -t pretrain -p BF16 -m Flux + + * Example 4: Torchtune full weight fine-tuning with Llama 3.1 70B + + .. code-block:: shell + + ./pytorch_benchmark_report.sh -t finetune_fw -p BF16 -m Llama-3.1-70B + + * Example 5: Torchtune LoRA fine-tuning with Llama 3.1 70B + + .. code-block:: shell + + ./pytorch_benchmark_report.sh -t finetune_lora -p BF16 -m Llama-3.1-70B + + * Example 6: Torchtune full weight fine-tuning with Llama-3.3-70B + + .. code-block:: shell + + ./pytorch_benchmark_report.sh -t finetune_fw -p BF16 -m Llama-3.3-70B + + * Example 7: Torchtune LoRA fine-tuning with Llama-3.3-70B + + .. code-block:: shell + + ./pytorch_benchmark_report.sh -t finetune_lora -p BF16 -m Llama-3.3-70B + + * Example 8: Torchtune QLoRA fine-tuning with Llama-3.3-70B + + .. code-block:: shell + + ./pytorch_benchmark_report.sh -t finetune_qlora -p BF16 -m Llama-3.3-70B + + * Example 9: Hugging Face PEFT LoRA fine-tuning with Llama 2 70B + + .. code-block:: shell + + ./pytorch_benchmark_report.sh -t HF_finetune_lora -p BF16 -m Llama-2-70B Previous versions ================= @@ -399,6 +439,13 @@ benchmarking, see the version-specific documentation. - PyTorch version - Resources + * - v25.4 + - 6.3.0 + - 2.7.0a0+git637433 + - + * `Documentation `_ + * `Docker Hub `_ + * - v25.3 - 6.3.0 - 2.7.0a0+git637433