From f4f096b44e134f0c62654cac3856d6adcbe5d145 Mon Sep 17 00:00:00 2001 From: anisha-amd Date: Tue, 15 Jul 2025 16:23:50 -0400 Subject: [PATCH] Stanford Megatron-LM Compatibility * Create stanford-megatron-lm-compatibility.rst * toc and wordlist * Update deep-learning-rocm.rst * Update stanford-megatron-lm-compatibility.rst * Update stanford-megatron-lm-compatibility.rst * Update stanford-megatron-lm-compatibility.rst * Update stanford-megatron-lm-compatibility.rst * Update stanford-megatron-lm-compatibility.rst * Update stanford-megatron-lm-compatibility.rst * fixes and adding to main compat matrix * formatting fix * Update stanford-megatron-lm-compatibility.rst * Update stanford-megatron-lm-compatibility.rst * Update stanford-megatron-lm-compatibility.rst * Update docs/compatibility/ml-compatibility/stanford-megatron-lm-compatibility.rst Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com> * Update docs/compatibility/ml-compatibility/stanford-megatron-lm-compatibility.rst Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com> * Update docs/compatibility/ml-compatibility/stanford-megatron-lm-compatibility.rst Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com> * Update stanford-megatron-lm-compatibility.rst * Update stanford-megatron-lm-compatibility.rst * Update stanford-megatron-lm-compatibility.rst * Update stanford-megatron-lm-compatibility.rst --------- Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com> --- .wordlist.txt | 2 + docs/compatibility/compatibility-matrix.rst | 1 + .../stanford-megatron-lm-compatibility.rst | 100 ++++++++++++++++++ docs/how-to/deep-learning-rocm.rst | 2 + 4 files changed, 105 insertions(+) create mode 100644 docs/compatibility/ml-compatibility/stanford-megatron-lm-compatibility.rst diff --git a/.wordlist.txt b/.wordlist.txt index 57f58ab9f..0cb68a4aa 100644 --- a/.wordlist.txt +++ b/.wordlist.txt @@ -194,6 +194,7 @@ Higgs Hyperparameters Huggingface ICD +ICT ICV IDE IDEs @@ -368,6 +369,7 @@ RDC's RDMA RDNA README +Recomputation RHEL RMW RNN diff --git a/docs/compatibility/compatibility-matrix.rst b/docs/compatibility/compatibility-matrix.rst index 5587c368d..b6e12c349 100644 --- a/docs/compatibility/compatibility-matrix.rst +++ b/docs/compatibility/compatibility-matrix.rst @@ -55,6 +55,7 @@ compatibility and system requirements. :doc:`PyTorch <../compatibility/ml-compatibility/pytorch-compatibility>`,"2.6, 2.5, 2.4, 2.3","2.6, 2.5, 2.4, 2.3","2.4, 2.3, 2.2, 2.1, 2.0, 1.13" :doc:`TensorFlow <../compatibility/ml-compatibility/tensorflow-compatibility>`,"2.18.1, 2.17.1, 2.16.2","2.18.1, 2.17.1, 2.16.2","2.17.0, 2.16.2, 2.15.1" :doc:`JAX <../compatibility/ml-compatibility/jax-compatibility>`,0.4.35,0.4.35,0.4.31 + :doc:`Stanford Megatron-LM <../compatibility/ml-compatibility/stanford-megatron-lm-compatibility>`,N/A,N/A,`85f95ae `_ :doc:`DGL <../compatibility/ml-compatibility/dgl-compatibility>`,2.4.0,2.4.0,N/A `ONNX Runtime `_,1.2,1.2,1.17.3 ,,, diff --git a/docs/compatibility/ml-compatibility/stanford-megatron-lm-compatibility.rst b/docs/compatibility/ml-compatibility/stanford-megatron-lm-compatibility.rst new file mode 100644 index 000000000..e8f1b4195 --- /dev/null +++ b/docs/compatibility/ml-compatibility/stanford-megatron-lm-compatibility.rst @@ -0,0 +1,100 @@ +:orphan: + +.. meta:: + :description: Stanford Megatron-LM compatibility + :keywords: Stanford, Megatron-LM, compatibility + +.. version-set:: rocm_version latest + +******************************************************************************** +Stanford Megatron-LM compatibility +******************************************************************************** + +Stanford Megatron-LM is a large-scale language model training framework developed by NVIDIA `https://github.com/NVIDIA/Megatron-LM `_. It is +designed to train massive transformer-based language models efficiently by model and data parallelism. + +* ROCm support for Stanford Megatron-LM is hosted in the official `https://github.com/ROCm/Stanford-Megatron-LM `_ repository. +* Due to independent compatibility considerations, this location differs from the `https://github.com/stanford-futuredata/Megatron-LM `_ upstream repository. +* Use the prebuilt :ref:`Docker image ` with ROCm, PyTorch, and Megatron-LM preinstalled. +* See the :doc:`ROCm Stanford Megatron-LM installation guide ` to install and get started. + +.. note:: + + Stanford Megatron-LM is supported on ROCm 6.3.0. + + +Supported Devices +================================================================================ + +- **Officially Supported**: AMD Instinct MI300X +- **Partially Supported** (functionality or performance limitations): AMD Instinct MI250X, MI210X + + +Supported models and features +================================================================================ + +This section details models & features that are supported by the ROCm version on Stanford Megatron-LM. + +Models: + +* Bert +* GPT +* T5 +* ICT + +Features: + +* Distributed Pre-training +* Activation Checkpointing and Recomputation +* Distributed Optimizer +* Mixture-of-Experts + +.. _megatron-lm-recommendations: + +Use cases and recommendations +================================================================================ + +See the `Efficient MoE training on AMD ROCm: How-to use Megablocks on AMD GPUs blog `_ post +to leverage the ROCm platform for pre-training by using the Stanford Megatron-LM framework of pre-processing datasets on AMD GPUs. +Coverage includes: + + * Single-GPU pre-training + * Multi-GPU pre-training + + +.. _megatron-lm-docker-compat: + +Docker image compatibility +================================================================================ + +.. |docker-icon| raw:: html + + + +AMD validates and publishes `Stanford Megatron-LM images `_ +with ROCm and Pytorch backends on Docker Hub. The following Docker image tags and associated +inventories represent the latest Megatron-LM version from the official Docker Hub. +The Docker images have been validated for `ROCm 6.3.0 `_. +Click |docker-icon| to view the image on Docker Hub. + +.. list-table:: + :header-rows: 1 + :class: docker-image-compatibility + + * - Docker image + - Stanford Megatron-LM + - PyTorch + - Ubuntu + - Python + + * - .. raw:: html + + + + - `85f95ae `_ + - `2.4.0 `_ + - 24.04 + - `3.12.9 `_ + + + diff --git a/docs/how-to/deep-learning-rocm.rst b/docs/how-to/deep-learning-rocm.rst index e9b9881e8..5886647f7 100644 --- a/docs/how-to/deep-learning-rocm.rst +++ b/docs/how-to/deep-learning-rocm.rst @@ -17,6 +17,7 @@ features for these ROCm-enabled deep learning frameworks. * :doc:`PyTorch compatibility <../compatibility/ml-compatibility/pytorch-compatibility>` * :doc:`TensorFlow compatibility <../compatibility/ml-compatibility/tensorflow-compatibility>` * :doc:`JAX compatibility <../compatibility/ml-compatibility/jax-compatibility>` +* :doc:`Stanford Megatron-LM compatibility <../compatibility/ml-compatibility/stanford-megatron-lm-compatibility>` * :doc:`DGL compatibility <../compatibility/ml-compatibility/dgl-compatibility>` This chart steps through typical installation workflows for installing deep learning frameworks for ROCm. @@ -30,6 +31,7 @@ See the installation instructions to get started. * :doc:`PyTorch for ROCm ` * :doc:`TensorFlow for ROCm ` * :doc:`JAX for ROCm ` +* :doc:`Stanford Megatron-LM for ROCm ` * :doc:`DGL for ROCm ` .. note::