Add MPT-30B + LLM Foundry doc (#4704)

* add mpt-30b doc * add tunableop note * update MPT doc * add section * update wordlist * fix flash attention version * update "applies to" * address review feedback * Update docs/how-to/rocm-for-ai/training/benchmark-docker/mpt-llm-foundry.rst Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com> * Update docs/how-to/rocm-for-ai/training/benchmark-docker/mpt-llm-foundry.rst Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com> * Update docs/how-to/rocm-for-ai/training/benchmark-docker/mpt-llm-foundry.rst Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com> * update docker details to pytorch-training-v25.5 * update --------- Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com>
2026-02-12 15:34:58 -05:00 · 2025-05-02 11:13:20 -05:00
parent a7c158a14f
commit d44ea40a0d
4 changed files with 175 additions and 0 deletions
--- a/docs/conf.py
+++ b/docs/conf.py
@@ -57,6 +57,7 @@ article_pages = [
    {"file": "how-to/rocm-for-ai/training/prerequisite-system-validation", "os": ["linux"]},
    {"file": "how-to/rocm-for-ai/training/benchmark-docker/megatron-lm", "os": ["linux"]},
    {"file": "how-to/rocm-for-ai/training/benchmark-docker/pytorch-training", "os": ["linux"]},
+    {"file": "how-to/rocm-for-ai/training/benchmark-docker/mpt-llm-foundry", "os": ["linux"]},
    {"file": "how-to/rocm-for-ai/training/scale-model-training", "os": ["linux"]},

    {"file": "how-to/rocm-for-ai/fine-tuning/index", "os": ["linux"]},
--- a/docs/how-to/rocm-for-ai/training/benchmark-docker/mpt-llm-foundry.rst
+++ b/docs/how-to/rocm-for-ai/training/benchmark-docker/mpt-llm-foundry.rst
@@ -0,0 +1,168 @@
+.. meta::
+   :description: How to train a model using LLM Foundry for ROCm.
+   :keywords: ROCm, AI, LLM, train, PyTorch, torch, Llama, flux, tutorial, docker
+
+******************************************
+Training MPT-30B with LLM Foundry and ROCm
+******************************************
+
+MPT-30B is a 30-billion parameter decoder-style transformer-based model from
+the Mosaic Pretrained Transformer (MPT) family -- learn more about it in
+MosaicML's research blog `MPT-30B: Raising the bar for open-source foundation
+models <https://www.databricks.com/blog/mpt-30b>`_.
+
+ROCm and `<https://github.com/ROCm/MAD>`__ provide a pre-configured training
+environment for the MPT-30B model using the ``rocm/pytorch-training:v25.5``
+base `Docker image <https://hub.docker.com/layers/rocm/pytorch-training/v25.5/images/sha256-d47850a9b25b4a7151f796a8d24d55ea17bba545573f0d50d54d3852f96ecde5>`_
+and the `LLM Foundry <https://github.com/mosaicml/llm-foundry>`_ framework.
+This environment packages the following software components to train
+on AMD Instinct MI300X series accelerators:
+
+--------------------------+--------------------------------+
+| Software component       | Version                        |
+==========================+================================+
+| ROCm                     | 6.3.4                          |
+--------------------------+--------------------------------+
+| PyTorch                  | 2.7.0a0+git6374332             |
+--------------------------+--------------------------------+
+| Flash Attention          | 3.0.0.post1                    |
+--------------------------+--------------------------------+
+
+Using this image, you can build, run, and test the training process
+for MPT-30B with access to detailed logs and performance metrics.
+
+System validation
+=================
+
+If you have already validated your system settings, including NUMA
+auto-balancing, skip this step. Otherwise, complete the :ref:`system validation
+and optimization steps <train-a-model-system-validation>` to set up your system
+before starting training.
+
+Getting started
+===============
+
+The following procedures help you set up the training environment in a
+reproducible Docker container. This training environment is tailored for
+training MPT-30B using LLM Foundry and the specific model configurations outlined.
+Other configurations and run conditions outside those described in this
+document are not validated.
+
+.. tab-set::
+
+   .. tab-item:: MAD-integrated benchmarking
+
+      On your host machine, clone the ROCm Model Automation and Dashboarding
+      (`<https://github.com/ROCm/MAD>`__) repository to a local directory and
+      install the required packages.
+
+      .. code-block:: shell
+
+         git clone https://github.com/ROCm/MAD
+         cd MAD
+         pip install -r requirements.txt
+
+      Use this command to initiate the MPT-30B training benchmark.
+
+      .. code-block:: shell
+
+         python3 tools/run_models.py --tags pyt_mpt30b_training --keep-model-dir --live-output --clean-docker-cache
+
+      .. tip::
+
+         If you experience data download failures, set the
+         ``MAD_SECRETS_HFTOKEN`` variable to your Hugging Face access token. See
+         `User access tokens <https://huggingface.co/docs/hub/security-tokens>`_
+         for details.
+
+         .. code-block:: shell
+
+            export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
+
+      .. note::
+
+         For improved performance (training throughput), consider enabling TunableOp.
+         By default, ``pyt_mpt30b_training`` runs with TunableOp disabled. To enable it,
+         run ``tools/run_models.py`` with the ``--tunableop on`` argument or edit the
+         ``models.json`` configuration before running training.
+
+         Although this might increase the initial training time, it can result in a performance gain.
+
+   .. tab-item:: Standalone benchmarking
+
+      To set up the training environment, clone the
+      `<https://github.com/ROCm/MAD>`__ repo and build the Docker image. In
+      this snippet, the image is named ``mosaic_mpt30_image``.
+
+      .. code-block:: shell
+
+         git clone https://github.com/ROCm/MAD
+         cd MAD
+
+         docker build --build-arg MAD_SYSTEM_GPU_ARCHITECTURE=gfx942 -f docker/pyt_mpt30b_training.ubuntu.amd.Dockerfile -t mosaic_mpt30_image .
+
+      Start a ``mosaic_mpt30_image`` container using the following command.
+
+      .. code-block:: shell
+
+         docker run -it --device=/dev/kfd --device=/dev/dri --group-add=video --ipc=host --shm-size=8G mosaic_mpt30_image
+
+      In the Docker container, clone the `<https://github.com/ROCm/MAD>`__
+      repository and navigate to the benchmark scripts directory at
+      ``/workspace/MAD/scripts/pyt_mpt30b_training``.
+
+      .. code-block:: shell
+
+         git clone https://github.com/ROCm/MAD
+         cd MAD/scripts/pyt_mpt30b_training
+
+      To initiate the training process, use the following command. This script uses the hyperparameters defined in
+      ``mpt-30b-instruct.yaml``.
+
+      .. code-block:: shell
+
+         source run.sh
+
+      .. note::
+
+         For improved performance (training throughput), consider enabling TunableOp.
+         To enable it, add the ``--tunableop on`` flag.
+
+         .. code-block:: shell
+
+            source run.sh --tunableop on
+
+         Although this might increase the initial training time, it can result in a performance gain.
+
+Interpreting the output
+=======================
+
+The training output will be displayed in the terminal and simultaneously saved
+to the ``output.txt`` file in the current directory. Key performance metrics will
+also be extracted and appended to the ``perf_pyt_mpt30b_training.csv`` file.
+
+Key performance metrics include:
+
+- Training logs: Real-time display of loss metrics, accuracy, and training progress.
+
+- Model checkpoints: Periodically saved model snapshots for potential resume or evaluation.
+
+- Performance metrics: Detailed summaries of training speed and training loss metrics.
+
+  - Performance (throughput/samples_per_sec)
+
+    Overall throughput, measuring the total samples processed per second. Higher values indicate better hardware utilization.
+
+  - Performance per device (throughput/samples_per_sec)
+
+    Throughput on a per-device basis, showing how each GPU or CPU is performing.
+
+  - Language Cross Entropy (metrics/train/LanguageCrossEntropy)
+
+    Measures prediction accuracy. Lower cross entropy suggests the model’s output is closer to the expected distribution.
+
+  - Training loss (loss/train/total)
+
+    Overall training loss. A decreasing trend indicates the model is learning effectively.
+
+
--- a/docs/sphinx/_toc.yml.in
+++ b/docs/sphinx/_toc.yml.in
@@ -46,6 +46,8 @@ subtrees:
            title: Train a model with PyTorch
          - file: how-to/rocm-for-ai/training/benchmark-docker/jax-maxtext
            title: Train a model with JAX MaxText
+          - file: how-to/rocm-for-ai/training/benchmark-docker/mpt-llm-foundry
+            title: Train a model with LLM Foundry
          - file: how-to/rocm-for-ai/training/scale-model-training.rst
            title: Scale model training