mirror of
https://github.com/ROCm/ROCm.git
synced 2026-04-27 03:01:52 -04:00
Compare commits
12 Commits
rocm-7.2.1
...
docs_7.2.1
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
4e5eac9127 | ||
|
|
cf5d3b2e99 | ||
|
|
0b972fe327 | ||
|
|
6a0abd09a7 | ||
|
|
0dcf4b0261 | ||
|
|
ba1add5662 | ||
|
|
0e3b546159 | ||
|
|
da180ce262 | ||
|
|
7ab479613e | ||
|
|
ac6e6b5301 | ||
|
|
a997f5135f | ||
|
|
1b69ea280c |
13
RELEASE.md
13
RELEASE.md
@@ -130,7 +130,6 @@ GPU and baseboard firmware versioning might differ across GPU families.
|
||||
<tr>
|
||||
<td>MI325X<a href="#footnote1"><sup>[1]</sup></a></td>
|
||||
<td>
|
||||
01.25.06.05<br>
|
||||
01.25.04.02
|
||||
</td>
|
||||
<td>30.30.1<br>
|
||||
@@ -260,7 +259,7 @@ Click {fab}`github` to go to the component's source code on GitHub.
|
||||
<th rowspan="9">Machine learning and computer vision</th>
|
||||
<td><a href="https://rocm.docs.amd.com/projects/composable_kernel/en/docs-7.2.1/index.html">Composable Kernel</a></td>
|
||||
<td>1.2.0</a></td>
|
||||
<td><a href="https://github.com/ROCm/composable_kernel"><i class="fab fa-github fa-lg"></i></a></td>
|
||||
<td><a href="https://github.com/ROCm/rocm-libraries/tree/develop/projects/composablekernel"><i class="fab fa-github fa-lg"></i></a></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><a href="https://rocm.docs.amd.com/projects/AMDMIGraphX/en/docs-7.2.1/index.html">MIGraphX</a></td>
|
||||
@@ -397,7 +396,7 @@ Click {fab}`github` to go to the component's source code on GitHub.
|
||||
</tr>
|
||||
<tr>
|
||||
<td><a href="https://rocm.docs.amd.com/projects/Tensile/en/docs-7.2.1/src/index.html">Tensile</a></td>
|
||||
<td>4.44.0</td>
|
||||
<td>4.45.0</td>
|
||||
<td><a href="https://github.com/ROCm/rocm-libraries/tree/develop/shared/tensile"><i class="fab fa-github fa-lg"></i></a></td>
|
||||
</tr>
|
||||
</tbody>
|
||||
@@ -674,11 +673,15 @@ Affected GEMM configurations:
|
||||
|
||||
* 8192 × 8192 × 1 × 16384
|
||||
|
||||
Due to this issue, you might also observe a slight increase in the test or inference time. This issue is resolved in the {fab}`github`[hipBLASLt `develop` branch](https://github.com/ROCm/rocm-libraries/tree/develop/projects/hipblaslt) and will be part of a future ROCm release.
|
||||
Due to this issue, you might also observe a slight increase in the test or inference time. This issue is resolved in the {fab}`github`[hipBLASLt develop branch](https://github.com/ROCm/rocm-libraries/tree/develop/projects/hipblaslt) and will be part of a future ROCm release. See [GitHub issue #6065](https://github.com/ROCm/ROCm/issues/6065).
|
||||
|
||||
### Longer runtime for hipBLASLt GEMM operations on Instinct MI300X GPUs in partitioned mode
|
||||
|
||||
GEMM operations using hipBLASLt might result in longer runtime on AMD Instinct MI300X GPUs configured in CPX or NPS4 partition mode (38 control units or CUs). This issue occurs when hipBLASLt fails to find applicable pre-tuned kernels. As a result, it performs an extensive kernel search, which increases both search time and the overall operation runtime. This issue is resolved in the {fab}`github`[hipBLASLt `develop` branch](https://github.com/ROCm/rocm-libraries/tree/develop/projects/hipblaslt) and will be part of a future ROCm release.
|
||||
GEMM operations using hipBLASLt might result in longer runtime on AMD Instinct MI300X GPUs configured in CPX or NPS4 partition mode (38 control units or CUs). This issue occurs when hipBLASLt fails to find applicable pre-tuned kernels. As a result, it performs an extensive kernel search, which increases both search time and the overall operation runtime. This issue is resolved in the {fab}`github`[hipBLASLt develop branch](https://github.com/ROCm/rocm-libraries/tree/develop/projects/hipblaslt) and will be part of a future ROCm release. See [GitHub issue #6066](https://github.com/ROCm/ROCm/issues/6066).
|
||||
|
||||
### ROCTracer might fail to report kernel operations
|
||||
|
||||
Applications that use [ROCTracer](https://rocm.docs.amd.com/projects/roctracer/en/latest/index.html) might fail to receive some or all kernel operation events due to a ROCTracer reporting failure. ROCTracer is already deprecated and is scheduled to reach end of support (EoS) by the end of 2026 Q2. For more details on ROCTracer deprecation, see [ROCm upcoming changes](#roctracer-rocprofiler-rocprof-and-rocprofv2-deprecation). This issue will be resolved in a future PyTorch on ROCm release that replaces ROCTracer with [ROCprofiler-SDK](https://rocm.docs.amd.com/projects/rocprofiler-sdk/en/latest/). See [GitHub issue #6102](https://github.com/ROCm/ROCm/issues/6102).
|
||||
|
||||
## ROCm resolved issues
|
||||
|
||||
|
||||
@@ -33,13 +33,7 @@ ROCm Version,7.2.1,7.2.0,7.1.1,7.1.0,7.0.2,7.0.1/7.0.0,6.4.3,6.4.2,6.4.1,6.4.0,6
|
||||
:doc:`PyTorch <../compatibility/ml-compatibility/pytorch-compatibility>`,"2.9.1, 2.8.0, 2.7.1","2.9.1, 2.8.0, 2.7.1","2.9, 2.8, 2.7","2.8, 2.7, 2.6","2.8, 2.7, 2.6","2.7, 2.6, 2.5","2.6, 2.5, 2.4, 2.3","2.6, 2.5, 2.4, 2.3","2.6, 2.5, 2.4, 2.3","2.6, 2.5, 2.4, 2.3","2.4, 2.3, 2.2, 1.13","2.4, 2.3, 2.2, 1.13","2.4, 2.3, 2.2, 1.13","2.4, 2.3, 2.2, 2.1, 2.0, 1.13","2.3, 2.2, 2.1, 2.0, 1.13","2.3, 2.2, 2.1, 2.0, 1.13","2.3, 2.2, 2.1, 2.0, 1.13","2.3, 2.2, 2.1, 2.0, 1.13","2.1, 2.0, 1.13","2.1, 2.0, 1.13","2.1, 2.0, 1.13","2.1, 2.0, 1.13","2.1, 2.0, 1.13","2.1, 2.0, 1.13"
|
||||
:doc:`TensorFlow <../compatibility/ml-compatibility/tensorflow-compatibility>`,"2.20.0, 2.19.1, 2.18.1","2.20.0, 2.19.1, 2.18.1","2.20.0, 2.19.1, 2.18.1","2.20.0, 2.19.1, 2.18.1","2.19.1, 2.18.1, 2.17.1 [#tf-mi350-past-60]_","2.19.1, 2.18.1, 2.17.1 [#tf-mi350-past-60]_","2.18.1, 2.17.1, 2.16.2","2.18.1, 2.17.1, 2.16.2","2.18.1, 2.17.1, 2.16.2","2.18.1, 2.17.1, 2.16.2","2.17.0, 2.16.2, 2.15.1","2.17.0, 2.16.2, 2.15.1","2.17.0, 2.16.2, 2.15.1","2.17.0, 2.16.2, 2.15.1","2.16.1, 2.15.1, 2.14.1","2.16.1, 2.15.1, 2.14.1","2.16.1, 2.15.1, 2.14.1","2.16.1, 2.15.1, 2.14.1","2.15.0, 2.14.0, 2.13.1","2.15.0, 2.14.0, 2.13.1","2.15.0, 2.14.0, 2.13.1","2.15.0, 2.14.0, 2.13.1","2.14.0, 2.13.1, 2.12.1","2.14.0, 2.13.1, 2.12.1"
|
||||
:doc:`JAX <../compatibility/ml-compatibility/jax-compatibility>`,0.8.2,0.8.0,0.7.1,0.7.1,0.6.0,0.6.0,0.4.35,0.4.35,0.4.35,0.4.35,0.4.31,0.4.31,0.4.31,0.4.31,0.4.26,0.4.26,0.4.26,0.4.26,0.4.26,0.4.26,0.4.26,0.4.26,0.4.26,0.4.26
|
||||
:doc:`verl <../compatibility/ml-compatibility/verl-compatibility>` [#verl_compat-past-60]_,N/A,N/A,N/A,N/A,N/A,0.6.0,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,0.3.0.post0,N/A,N/A,N/A,N/A,N/A,N/A
|
||||
:doc:`Stanford Megatron-LM <../compatibility/ml-compatibility/stanford-megatron-lm-compatibility>` [#stanford-megatron-lm_compat-past-60]_,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,85f95ae,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A
|
||||
:doc:`DGL <../compatibility/ml-compatibility/dgl-compatibility>` [#dgl_compat-past-60]_,N/A,N/A,N/A,N/A,N/A,2.4.0,2.4.0,N/A,N/A,2.4.0,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A
|
||||
:doc:`Megablocks <../compatibility/ml-compatibility/megablocks-compatibility>` [#megablocks_compat-past-60]_,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,0.7.0,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A
|
||||
:doc:`Ray <../compatibility/ml-compatibility/ray-compatibility>` [#ray_compat-past-60]_,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,2.48.0.post0,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A
|
||||
:doc:`llama.cpp <../compatibility/ml-compatibility/llama-cpp-compatibility>` [#llama-cpp_compat-past-60]_,N/A,N/A,N/A,N/A,N/A,b6652,b6356,b6356,b6356,b5997,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A
|
||||
:doc:`FlashInfer <../compatibility/ml-compatibility/flashinfer-compatibility>` [#flashinfer_compat-past-60]_,N/A,N/A,v0.2.5,N/A,N/A,N/A,N/A,N/A,v0.2.5,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A
|
||||
`ONNX Runtime <https://onnxruntime.ai/docs/build/eps.html#amd-migraphx>`_,1.23.2,1.23.2,1.23.1,1.22.0,1.22.0,1.22.0,1.20.0,1.20.0,1.20.0,1.20.0,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.14.1,1.14.1
|
||||
,,,,,,,,,,,,,,,,,,,,,,,,
|
||||
,,,,,,,,,,,,,,,,,,,,,,,,
|
||||
|
||||
|
@@ -58,7 +58,6 @@ compatibility and system requirements.
|
||||
:doc:`TensorFlow <../compatibility/ml-compatibility/tensorflow-compatibility>`,"2.20.0, 2.19.1, 2.18.1","2.20.0, 2.19.1, 2.18.1","2.18.1, 2.17.1, 2.16.2"
|
||||
:doc:`JAX <../compatibility/ml-compatibility/jax-compatibility>`,0.8.2,0.8.0,0.4.35
|
||||
:doc:`DGL <../compatibility/ml-compatibility/dgl-compatibility>` [#dgl_compat]_,N/A,N/A,2.4.0
|
||||
:doc:`llama.cpp <../compatibility/ml-compatibility/llama-cpp-compatibility>` [#llama-cpp_compat]_,N/A,N/A,b5997
|
||||
`ONNX Runtime <https://onnxruntime.ai/docs/build/eps.html#amd-migraphx>`_,1.23.2,1.23.2,1.20.0
|
||||
,,,
|
||||
THIRD PARTY COMMS,.. _thirdpartycomms-support-compatibility-matrix:,,
|
||||
@@ -159,7 +158,6 @@ compatibility and system requirements.
|
||||
.. [#os-compatibility] Some operating systems are supported on specific GPUs. For detailed information about operating systems supported on ROCm 7.2.1, see the latest :ref:`supported_distributions`. For version specific information, see `ROCm 7.2.0 <https://rocm.docs.amd.com/projects/install-on-linux/en/docs-7.2.0/reference/system-requirements.html#supported-operating-systems>`__, and `ROCm 6.4.0 <https://rocm.docs.amd.com/projects/install-on-linux/en/docs-6.4.0/reference/system-requirements.html#supported-operating-systems>`__.
|
||||
.. [#gpu-compatibility] Some GPUs have limited operating system support. For detailed information about GPUs supporting ROCm 7.2.1, see the latest :ref:`supported_GPUs`. For version specific information, see `ROCm 7.2.0 <https://rocm.docs.amd.com/projects/install-on-linux/en/docs-7.2.0/reference/system-requirements.html#supported-gpus>`__, and `ROCm 6.4.0 <https://rocm.docs.amd.com/projects/install-on-linux/en/docs-6.4.0/reference/system-requirements.html#supported-gpus>`__.
|
||||
.. [#dgl_compat] DGL is supported only on ROCm 7.0.0, ROCm 6.4.3, and ROCm 6.4.0.
|
||||
.. [#llama-cpp_compat] llama.cpp is supported only on ROCm 7.0.0 and ROCm 6.4.x.
|
||||
.. [#mi325x_KVM] For AMD Instinct MI325X KVM SR-IOV users, do not use AMD GPU Driver (amdgpu) 30.20.0.
|
||||
.. [#driver_patch] AMD GPU Driver (amdgpu) 30.10.1 is a quality release that resolves an issue identified in the 30.10 release. There are no other significant changes or feature additions in ROCm 7.0.1 from ROCm 7.0.0. AMD GPU Driver (amdgpu) 30.10.1 is compatible with ROCm 7.0.1 and ROCm 7.0.0.
|
||||
.. [#kfd_support] As of ROCm 6.4.0, forward and backward compatibility between the AMD GPU Driver (amdgpu) and its user space software is provided up to a year apart. For earlier ROCm releases, the compatibility is provided for +/- 2 releases. The supported user space versions on this page were accurate as of the time of initial ROCm release. For the most up-to-date information, see the latest version of this information at `User and AMD GPU Driver support matrix <https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/user-kernel-space-compat-matrix.html>`_.
|
||||
@@ -205,13 +203,7 @@ Expand for full historical view of:
|
||||
.. [#os-compatibility-past-60] Some operating systems are supported on specific GPUs. For detailed information, see :ref:`supported_distributions` and select the required ROCm version for version specific support.
|
||||
.. [#gpu-compatibility-past-60] Some GPUs have limited operating system support. For detailed information, see :ref:`supported_GPUs` and select the required ROCm version for version specific support.
|
||||
.. [#tf-mi350-past-60] TensorFlow 2.17.1 is not supported on AMD Instinct MI350 Series GPUs. Use TensorFlow 2.19.1 or 2.18.1 with MI350 Series GPUs instead.
|
||||
.. [#verl_compat-past-60] verl is supported only on ROCm 6.2.0.
|
||||
.. [#stanford-megatron-lm_compat-past-60] Stanford Megatron-LM is supported only on ROCm 6.3.0.
|
||||
.. [#dgl_compat-past-60] DGL is supported only on ROCm 7.0.0, ROCm 6.4.3, and ROCm 6.4.0.
|
||||
.. [#megablocks_compat-past-60] Megablocks is supported only on ROCm 6.3.0.
|
||||
.. [#ray_compat-past-60] Ray is supported only on ROCm 6.4.1.
|
||||
.. [#llama-cpp_compat-past-60] llama.cpp is supported only on ROCm 7.0.0 and 6.4.x.
|
||||
.. [#flashinfer_compat-past-60] FlashInfer is supported only on ROCm 6.4.1.
|
||||
.. [#mi325x_KVM-past-60] For AMD Instinct MI325X KVM SR-IOV users, do not use AMD GPU Driver (amdgpu) 30.20.0.
|
||||
.. [#driver_patch-past-60] AMD GPU Driver (amdgpu) 30.10.1 is a quality release that resolves an issue identified in the 30.10 release. There are no other significant changes or feature additions in ROCm 7.0.1 from ROCm 7.0.0. AMD GPU Driver (amdgpu) 30.10.1 is compatible with ROCm 7.0.1 and ROCm 7.0.0.
|
||||
.. [#kfd_support-past-60] As of ROCm 6.4.0, forward and backward compatibility between the AMD GPU Driver (amdgpu) and its user space software is provided up to a year apart. For earlier ROCm releases, the compatibility is provided for +/- 2 releases. The supported user space versions on this page were accurate as of the time of initial ROCm release. For the most up-to-date information, see the latest version of this information at `User and AMD GPU Driver support matrix <https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/user-kernel-space-compat-matrix.html>`_.
|
||||
|
||||
@@ -1,113 +0,0 @@
|
||||
:orphan:
|
||||
|
||||
.. meta::
|
||||
:description: FlashInfer compatibility
|
||||
:keywords: GPU, LLM, FlashInfer, deep learning, framework compatibility
|
||||
|
||||
.. version-set:: rocm_version latest
|
||||
|
||||
********************************************************************************
|
||||
FlashInfer compatibility
|
||||
********************************************************************************
|
||||
|
||||
`FlashInfer <https://docs.flashinfer.ai/index.html>`__ is a library and kernel generator
|
||||
for Large Language Models (LLMs) that provides a high-performance implementation of graphics
|
||||
processing units (GPUs) kernels. FlashInfer focuses on LLM serving and inference, as well
|
||||
as advanced performance across diverse scenarios.
|
||||
|
||||
FlashInfer features highly efficient attention kernels, load-balanced scheduling, and memory-optimized
|
||||
techniques, while supporting customized attention variants. It’s compatible with ``torch.compile``, and
|
||||
offers high-performance LLM-specific operators, with easy integration through PyTorch, and C++ APIs.
|
||||
|
||||
.. note::
|
||||
|
||||
The ROCm port of FlashInfer is under active development, and some features are not yet available.
|
||||
For the latest feature compatibility matrix, refer to the ``README`` of the
|
||||
`https://github.com/ROCm/flashinfer <https://github.com/ROCm/flashinfer>`__ repository.
|
||||
|
||||
Support overview
|
||||
================================================================================
|
||||
|
||||
- The ROCm-supported version of FlashInfer is maintained in the official `https://github.com/ROCm/flashinfer
|
||||
<https://github.com/ROCm/flashinfer>`__ repository, which differs from the
|
||||
`https://github.com/flashinfer-ai/flashinfer <https://github.com/flashinfer-ai/flashinfer>`__
|
||||
upstream repository.
|
||||
|
||||
- To get started and install FlashInfer on ROCm, use the prebuilt :ref:`Docker images <flashinfer-docker-compat>`,
|
||||
which include ROCm, FlashInfer, and all required dependencies.
|
||||
|
||||
- See the :doc:`ROCm FlashInfer installation guide <rocm-install-on-linux:install/3rd-party/flashinfer-install>`
|
||||
for installation and setup instructions.
|
||||
|
||||
- You can also consult the upstream `Installation guide <https://docs.flashinfer.ai/installation.html>`__
|
||||
for additional context.
|
||||
|
||||
.. _flashinfer-docker-compat:
|
||||
|
||||
Compatibility matrix
|
||||
================================================================================
|
||||
|
||||
.. |docker-icon| raw:: html
|
||||
|
||||
<i class="fab fa-docker"></i>
|
||||
|
||||
AMD validates and publishes `FlashInfer images <https://hub.docker.com/r/rocm/flashinfer/tags>`__
|
||||
with ROCm backends on Docker Hub. The following Docker image tag and associated
|
||||
inventories represent the latest available FlashInfer version from the official Docker Hub.
|
||||
Click |docker-icon| to view the image on Docker Hub.
|
||||
|
||||
.. list-table::
|
||||
:header-rows: 1
|
||||
:class: docker-image-compatibility
|
||||
|
||||
* - Docker image
|
||||
- ROCm
|
||||
- FlashInfer
|
||||
- PyTorch
|
||||
- Ubuntu
|
||||
- Python
|
||||
- GPU
|
||||
|
||||
* - .. raw:: html
|
||||
|
||||
<a href="https://hub.docker.com/layers/rocm/flashinfer/flashinfer-0.2.5.amd2_rocm7.1.1_ubuntu24.04_py3.12_pytorch2.8/images/sha256-9ab6426750a11dbab9bcddeaccaf492683bfd96a1d60b21dd9fc3a609a98175b"><i class="fab fa-docker fa-lg"></i> rocm/flashinfer</a>
|
||||
- `7.1.1 <https://repo.radeon.com/rocm/apt/7.1.1/>`__
|
||||
- `v0.2.5 <https://github.com/flashinfer-ai/flashinfer/releases/tag/v0.2.5>`__
|
||||
- `2.8.0 <https://github.com/ROCm/pytorch/releases/tag/v2.8.0>`__
|
||||
- 24.04
|
||||
- `3.12 <https://www.python.org/downloads/release/python-3129/>`__
|
||||
- MI325X, MI300X
|
||||
|
||||
* - .. raw:: html
|
||||
|
||||
<a href="https://hub.docker.com/layers/rocm/flashinfer/flashinfer-0.2.5_rocm6.4_ubuntu24.04_py3.12_pytorch2.7/images/sha256-558914838821c88c557fb6d42cfbc1bdb67d79d19759f37c764a9ee801f93313"><i class="fab fa-docker fa-lg"></i> rocm/flashinfer</a>
|
||||
- `6.4.1 <https://repo.radeon.com/rocm/apt/6.4.1/>`__
|
||||
- `v0.2.5 <https://github.com/flashinfer-ai/flashinfer/releases/tag/v0.2.5>`__
|
||||
- `2.7.1 <https://github.com/ROCm/pytorch/releases/tag/v2.7.1>`__
|
||||
- 24.04
|
||||
- `3.12 <https://www.python.org/downloads/release/python-3129/>`__
|
||||
- MI300X
|
||||
|
||||
.. _flashinfer-recommendations:
|
||||
|
||||
Use cases and recommendations
|
||||
================================================================================
|
||||
|
||||
FlashInfer on ROCm enables you to perform LLM inference for both prefill and decode:
|
||||
during prefill, your model efficiently processes input prompts to build KV caches
|
||||
and internal activations; during decode, it generates tokens sequentially based on
|
||||
prior outputs and context. Use the attention mode supported upstream (Multi-Head
|
||||
Attention, Grouped-Query Attention, or Multi-Query Attention) that matches your
|
||||
model configuration.
|
||||
|
||||
FlashInfer on ROCm also includes capabilities such as load balancing,
|
||||
sparse and dense attention optimizations, and single and batch decode, alongside
|
||||
prefill for high‑performance execution on MI300X GPUs.
|
||||
|
||||
For currently supported use cases and recommendations, refer to the `AMD ROCm blog <https://rocm.blogs.amd.com/search.html?q=flashinfer>`__,
|
||||
where you can search for examples and best practices to optimize your workloads on AMD GPUs.
|
||||
|
||||
Previous versions
|
||||
===============================================================================
|
||||
See :doc:`rocm-install-on-linux:install/3rd-party/previous-versions/flashinfer-history` to find documentation for previous releases
|
||||
of the ``ROCm/flashinfer`` Docker image.
|
||||
@@ -1,275 +0,0 @@
|
||||
:orphan:
|
||||
|
||||
.. meta::
|
||||
:description: llama.cpp compatibility
|
||||
:keywords: GPU, GGML, llama.cpp, deep learning, framework compatibility
|
||||
|
||||
.. version-set:: rocm_version latest
|
||||
|
||||
********************************************************************************
|
||||
llama.cpp compatibility
|
||||
********************************************************************************
|
||||
|
||||
`llama.cpp <https://github.com/ggml-org/llama.cpp>`__ is an open-source framework
|
||||
for Large Language Model (LLM) inference that runs on both central processing units
|
||||
(CPUs) and graphics processing units (GPUs). It is written in plain C/C++, providing
|
||||
a simple, dependency-free setup.
|
||||
|
||||
The framework supports multiple quantization options, from 1.5-bit to 8-bit integers,
|
||||
to accelerate inference and reduce memory usage. Originally built as a CPU-first library,
|
||||
llama.cpp is easy to integrate with other programming environments and is widely
|
||||
adopted across diverse platforms, including consumer devices.
|
||||
|
||||
Support overview
|
||||
================================================================================
|
||||
|
||||
- The ROCm-supported version of llama.cpp is maintained in the official `https://github.com/ROCm/llama.cpp
|
||||
<https://github.com/ROCm/llama.cpp>`__ repository, which differs from the
|
||||
`https://github.com/ggml-org/llama.cpp <https://github.com/ggml-org/llama.cpp>`__ upstream repository.
|
||||
|
||||
- To get started and install llama.cpp on ROCm, use the prebuilt :ref:`Docker images <llama-cpp-docker-compat>`,
|
||||
which include ROCm, llama.cpp, and all required dependencies.
|
||||
|
||||
- See the :doc:`ROCm llama.cpp installation guide <rocm-install-on-linux:install/3rd-party/llama-cpp-install>`
|
||||
for installation and setup instructions.
|
||||
|
||||
- You can also consult the upstream `Installation guide <https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md>`__
|
||||
for additional context.
|
||||
|
||||
.. _llama-cpp-docker-compat:
|
||||
|
||||
Compatibility matrix
|
||||
================================================================================
|
||||
|
||||
.. |docker-icon| raw:: html
|
||||
|
||||
<i class="fab fa-docker"></i>
|
||||
|
||||
AMD validates and publishes `llama.cpp images <https://hub.docker.com/r/rocm/llama.cpp/tags>`__
|
||||
with ROCm backends on Docker Hub. The following Docker image tags and associated
|
||||
inventories represent the latest available llama.cpp versions from the official Docker Hub.
|
||||
Click |docker-icon| to view the image on Docker Hub.
|
||||
|
||||
.. important::
|
||||
|
||||
Tag endings of ``_full``, ``_server``, and ``_light`` serve different purposes for entrypoints as follows:
|
||||
|
||||
- Full: This image includes both the main executable file and the tools to convert ``LLaMA`` models into ``ggml`` and convert into 4-bit quantization.
|
||||
- Server: This image only includes the server executable file.
|
||||
- Light: This image only includes the main executable file.
|
||||
|
||||
.. list-table::
|
||||
:header-rows: 1
|
||||
:class: docker-image-compatibility
|
||||
|
||||
* - Full Docker
|
||||
- Server Docker
|
||||
- Light Docker
|
||||
- llama.cpp
|
||||
- ROCm
|
||||
- Ubuntu
|
||||
- GPU
|
||||
|
||||
* - .. raw:: html
|
||||
|
||||
<a href="https://hub.docker.com/layers/rocm/llama.cpp/llama.cpp-b6652.amd0_rocm7.0.0_ubuntu24.04_full/images/sha256-a94f0c7a598cc6504ff9e8371c016d7a2f93e69bf54a36c870f9522567201f10g"><i class="fab fa-docker fa-lg"></i> rocm/llama.cpp</a>
|
||||
- .. raw:: html
|
||||
|
||||
<a href="https://hub.docker.com/layers/rocm/llama.cpp/llama.cpp-b6652.amd0_rocm7.0.0_ubuntu24.04_server/images/sha256-be175932c3c96e882dfbc7e20e0e834f58c89c2925f48b222837ee929dfc47ee"><i class="fab fa-docker fa-lg"></i> rocm/llama.cpp</a>
|
||||
- .. raw:: html
|
||||
|
||||
<a href="https://hub.docker.com/layers/rocm/llama.cpp/llama.cpp-b6652.amd0_rocm7.0.0_ubuntu24.04_light/images/sha256-d8ba0c70603da502c879b1f8010b439c8e7fa9f6cbdac8bbbbbba97cb41ebc9e"><i class="fab fa-docker fa-lg"></i> rocm/llama.cpp</a>
|
||||
- `b6652 <https://github.com/ROCm/llama.cpp/tree/release/b6652>`__
|
||||
- `7.0.0 <https://repo.radeon.com/rocm/apt/7.0/>`__
|
||||
- 24.04
|
||||
- MI325X, MI300X, MI210
|
||||
|
||||
* - .. raw:: html
|
||||
|
||||
<a href="https://hub.docker.com/layers/rocm/llama.cpp/llama.cpp-b6652.amd0_rocm7.0.0_ubuntu22.04_full/images/sha256-37582168984f25dce636cc7288298e06d94472ea35f65346b3541e6422b678ee"><i class="fab fa-docker fa-lg"></i> rocm/llama.cpp</a>
|
||||
- .. raw:: html
|
||||
|
||||
<a href="https://hub.docker.com/layers/rocm/llama.cpp/llama.cpp-b6652.amd0_rocm7.0.0_ubuntu22.04_server/images/sha256-7e70578e6c3530c6591cc2c26da24a9ee68a20d318e12241de93c83224f83720"><i class="fab fa-docker fa-lg"></i> rocm/llama.cpp</a>
|
||||
- .. raw:: html
|
||||
|
||||
<a href="https://hub.docker.com/layers/rocm/llama.cpp/llama.cpp-b6652.amd0_rocm7.0.0_ubuntu22.04_light/images/sha256-9a5231acf88b4a229677bc2c636ea3fe78a7a80f558bd80910b919855de93ad5"><i class="fab fa-docker fa-lg"></i> rocm/llama.cpp</a>
|
||||
- `b6652 <https://github.com/ROCm/llama.cpp/tree/release/b6652>`__
|
||||
- `7.0.0 <https://repo.radeon.com/rocm/apt/7.0/>`__
|
||||
- 22.04
|
||||
- MI325X, MI300X, MI210
|
||||
|
||||
* - .. raw:: html
|
||||
|
||||
<a href="https://hub.docker.com/layers/rocm/llama.cpp/llama.cpp-b6356_rocm6.4.3_ubuntu24.04_full/images/sha256-5960fc850024a8a76451f9eaadd89b7e59981ae9f393b407310c1ddf18892577"><i class="fab fa-docker fa-lg"></i> rocm/llama.cpp</a>
|
||||
- .. raw:: html
|
||||
|
||||
<a href="https://hub.docker.com/layers/rocm/llama.cpp/llama.cpp-b6356_rocm6.4.3_ubuntu24.04_server/images/sha256-1b79775d9f546065a6aaf9ca426e1dd4ed4de0b8f6ee83687758cc05af6538e6"><i class="fab fa-docker fa-lg"></i> rocm/llama.cpp</a>
|
||||
- .. raw:: html
|
||||
|
||||
<a href="https://hub.docker.com/layers/rocm/llama.cpp/llama.cpp-b6356_rocm6.4.3_ubuntu24.04_light/images/sha256-8f863c4c2857ae42bebd64e4f1a0a1e7cc3ec4503f243e32b4a4dcad070ec361"><i class="fab fa-docker fa-lg"></i> rocm/llama.cpp</a>
|
||||
- `b6356 <https://github.com/ROCm/llama.cpp/tree/release/b6356>`__
|
||||
- `6.4.3 <https://repo.radeon.com/rocm/apt/6.4.3/>`__
|
||||
- 24.04
|
||||
- MI325X, MI300X, MI210
|
||||
|
||||
* - .. raw:: html
|
||||
|
||||
<a href="https://hub.docker.com/layers/rocm/llama.cpp/llama.cpp-b6356_rocm6.4.3_ubuntu22.04_full/images/sha256-888879b3ee208f9247076d7984524b8d1701ac72611689e89854a1588bec9867"><i class="fab fa-docker fa-lg"></i> rocm/llama.cpp</a>
|
||||
- .. raw:: html
|
||||
|
||||
<a href="https://hub.docker.com/layers/rocm/llama.cpp/llama.cpp-b6356_rocm6.4.3_ubuntu22.04_server/images/sha256-90e4ff99a66743e33fd00728cd71a768588e5f5ef355aaa196669fe65ac70672"><i class="fab fa-docker fa-lg"></i> rocm/llama.cpp</a>
|
||||
- .. raw:: html
|
||||
|
||||
<a href="https://hub.docker.com/layers/rocm/llama.cpp/llama.cpp-b6356_rocm6.4.3_ubuntu22.04_light/images/sha256-bd447a049939cb99054f8fbf3f2352870fe906a75e2dc3339c845c08b9c53f9b"><i class="fab fa-docker fa-lg"></i> rocm/llama.cpp</a>
|
||||
- `b6356 <https://github.com/ROCm/llama.cpp/tree/release/b6356>`__
|
||||
- `6.4.3 <https://repo.radeon.com/rocm/apt/6.4.3/>`__
|
||||
- 22.04
|
||||
- MI325X, MI300X, MI210
|
||||
|
||||
* - .. raw:: html
|
||||
|
||||
<a href="https://hub.docker.com/layers/rocm/llama.cpp/llama.cpp-b6356_rocm6.4.2_ubuntu24.04_full/images/sha256-5b3a1bc4889c1fcade434b937fbf9cc1c22ff7dc0317c130339b0c9238bc88c4"><i class="fab fa-docker fa-lg"></i> rocm/llama.cpp</a>
|
||||
- .. raw:: html
|
||||
|
||||
<a href="https://hub.docker.com/layers/rocm/llama.cpp/llama.cpp-b6356_rocm6.4.2_ubuntu24.04_server/images/sha256-5228ff99d0f627a9032d668f4381b2e80dc1e301adc3e0821f26d8354b175271"><i class="fab fa-docker fa-lg"></i> rocm/llama.cpp</a>
|
||||
- .. raw:: html
|
||||
|
||||
<a href="https://hub.docker.com/layers/rocm/llama.cpp/llama.cpp-b6356_rocm6.4.2_ubuntu24.04_light/images/sha256-b12723b332a826a89b7252dddf868cbe4d1a869562fc4aa4032f59e1a683b968"><i class="fab fa-docker fa-lg"></i> rocm/llama.cpp</a>
|
||||
- `b6356 <https://github.com/ROCm/llama.cpp/tree/release/b6356>`__
|
||||
- `6.4.2 <https://repo.radeon.com/rocm/apt/6.4.2/>`__
|
||||
- 24.04
|
||||
- MI325X, MI300X, MI210
|
||||
|
||||
* - .. raw:: html
|
||||
|
||||
<a href="https://hub.docker.com/layers/rocm/llama.cpp/llama.cpp-b6356_rocm6.4.2_ubuntu22.04_full/images/sha256-cd6e21a6a73f59b35dd5309b09dd77654a94d783bf13a55c14eb8dbf8e9c2615"><i class="fab fa-docker fa-lg"></i> rocm/llama.cpp</a>
|
||||
- .. raw:: html
|
||||
|
||||
<a href="https://hub.docker.com/layers/rocm/llama.cpp/llama.cpp-b6356_rocm6.4.2_ubuntu22.04_server/images/sha256-c2b4689ab2c47e6626e8fea22d7a63eb03d47c0fde9f5ef8c9f158d15c423e58"><i class="fab fa-docker fa-lg"></i> rocm/llama.cpp</a>
|
||||
- .. raw:: html
|
||||
|
||||
<a href="https://hub.docker.com/layers/rocm/llama.cpp/llama.cpp-b6356_rocm6.4.2_ubuntu22.04_light/images/sha256-1acc28f29ed87db9cbda629cb29e1989b8219884afe05f9105522be929e94da4"><i class="fab fa-docker fa-lg"></i> rocm/llama.cpp</a>
|
||||
- `b6356 <https://github.com/ROCm/llama.cpp/tree/release/b6356>`__
|
||||
- `6.4.2 <https://repo.radeon.com/rocm/apt/6.4.2/>`__
|
||||
- 22.04
|
||||
- MI325X, MI300X, MI210
|
||||
|
||||
* - .. raw:: html
|
||||
|
||||
<a href="https://hub.docker.com/layers/rocm/llama.cpp/llama.cpp-b6356_rocm6.4.1_ubuntu24.04_full/images/sha256-2f8ae8a44510d96d52dea6cb398b224f7edeb7802df7ec488c6f63d206b3cdc9"><i class="fab fa-docker fa-lg"></i> rocm/llama.cpp</a>
|
||||
- .. raw:: html
|
||||
|
||||
<a href="https://hub.docker.com/layers/rocm/llama.cpp/llama.cpp-b6356_rocm6.4.1_ubuntu24.04_server/images/sha256-fece497ff9f4a28b12f645de52766941da8ead8471aa1ea84b61d4b4568e51f2"><i class="fab fa-docker fa-lg"></i> rocm/llama.cpp</a>
|
||||
- .. raw:: html
|
||||
|
||||
<a href="https://hub.docker.com/layers/rocm/llama.cpp/llama.cpp-b6356_rocm6.4.1_ubuntu24.04_light/images/sha256-3e14352fa6f8c6128b23cf9342531c20dbfb522550b626e09d83b260a1947022"><i class="fab fa-docker fa-lg"></i> rocm/llama.cpp</a>
|
||||
- `b6356 <https://github.com/ROCm/llama.cpp/tree/release/b6356>`__
|
||||
- `6.4.1 <https://repo.radeon.com/rocm/apt/6.4.1/>`__
|
||||
- 24.04
|
||||
- MI325X, MI300X, MI210
|
||||
|
||||
* - .. raw:: html
|
||||
|
||||
<a href="https://hub.docker.com/layers/rocm/llama.cpp/llama.cpp-b6356_rocm6.4.1_ubuntu22.04_full/images/sha256-80763062ef0bec15038c35fd01267f1fc99a5dd171d4b48583cc668b15efad69"><i class="fab fa-docker fa-lg"></i> rocm/llama.cpp</a>
|
||||
- .. raw:: html
|
||||
|
||||
<a href="https://hub.docker.com/layers/rocm/llama.cpp/llama.cpp-b6356_rocm6.4.1_ubuntu22.04_server/images/sha256-db2a6c957555ed83b819bbc54aea884a93192da0fb512dae63d32e0dc4e8ab8f"><i class="fab fa-docker fa-lg"></i> rocm/llama.cpp</a>
|
||||
- .. raw:: html
|
||||
|
||||
<a href="https://hub.docker.com/layers/rocm/llama.cpp/llama.cpp-b6356_rocm6.4.1_ubuntu22.04_light/images/sha256-c6dbb07cc655fb079d5216e4b77451cb64a9daa0585d23b6fb8b32cb22021197"><i class="fab fa-docker fa-lg"></i> rocm/llama.cpp</a>
|
||||
- `b6356 <https://github.com/ROCm/llama.cpp/tree/release/b6356>`__
|
||||
- `6.4.1 <https://repo.radeon.com/rocm/apt/6.4.1/>`__
|
||||
- 22.04
|
||||
- MI325X, MI300X, MI210
|
||||
|
||||
* - .. raw:: html
|
||||
|
||||
<a href="https://hub.docker.com/layers/rocm/llama.cpp/llama.cpp-b5997_rocm6.4.0_ubuntu24.04_full/images/sha256-f78f6c81ab2f8e957469415fe2370a1334fe969c381d1fe46050c85effaee9d5"><i class="fab fa-docker fa-lg"></i> rocm/llama.cpp</a>
|
||||
- .. raw:: html
|
||||
|
||||
<a href="https://hub.docker.com/layers/rocm/llama.cpp/llama.cpp-b5997_rocm6.4.0_ubuntu24.04_server/images/sha256-275ad9e18f292c26a00a2de840c37917e98737a88a3520bdc35fd3fc5c9a6a9b"><i class="fab fa-docker fa-lg"></i> rocm/llama.cpp</a>
|
||||
- .. raw:: html
|
||||
|
||||
<a href="https://hub.docker.com/layers/rocm/llama.cpp/llama.cpp-b5997_rocm6.4.0_ubuntu24.04_light/images/sha256-cc324e6faeedf0e400011f07b49d2dc41a16bae257b2b7befa0f4e2e97231320"><i class="fab fa-docker fa-lg"></i> rocm/llama.cpp</a>
|
||||
- `b5997 <https://github.com/ROCm/llama.cpp/tree/release/b5997>`__
|
||||
- `6.4.0 <https://repo.radeon.com/rocm/apt/6.4/>`__
|
||||
- 24.04
|
||||
- MI300X, MI210
|
||||
|
||||
.. _llama-cpp-key-rocm-libraries:
|
||||
|
||||
Key ROCm libraries for llama.cpp
|
||||
================================================================================
|
||||
|
||||
llama.cpp functionality on ROCm is determined by its underlying library
|
||||
dependencies. These ROCm components affect the capabilities, performance, and
|
||||
feature set available to developers. Ensure you have the required libraries for
|
||||
your corresponding ROCm version.
|
||||
|
||||
.. list-table::
|
||||
:header-rows: 1
|
||||
|
||||
* - ROCm library
|
||||
- ROCm 7.0.0 version
|
||||
- ROCm 6.4.x version
|
||||
- Purpose
|
||||
- Usage
|
||||
* - `hipBLAS <https://github.com/ROCm/hipBLAS>`__
|
||||
- 3.0.0
|
||||
- 2.4.0
|
||||
- Provides GPU-accelerated Basic Linear Algebra Subprograms (BLAS) for
|
||||
matrix and vector operations.
|
||||
- Supports operations such as matrix multiplication, matrix-vector
|
||||
products, and tensor contractions. Utilized in both dense and batched
|
||||
linear algebra operations.
|
||||
* - `hipBLASLt <https://github.com/ROCm/hipBLASLt>`__
|
||||
- 1.0.0
|
||||
- 0.12.0
|
||||
- hipBLASLt is an extension of the hipBLAS library, providing additional
|
||||
features like epilogues fused into the matrix multiplication kernel or
|
||||
use of integer tensor cores.
|
||||
- By setting the flag ``ROCBLAS_USE_HIPBLASLT``, you can dispatch hipblasLt
|
||||
kernels where possible.
|
||||
* - `rocWMMA <https://github.com/ROCm/rocWMMA>`__
|
||||
- 2.0.0
|
||||
- 1.7.0
|
||||
- Accelerates warp-level matrix-multiply and matrix-accumulate to speed up matrix
|
||||
multiplication (GEMM) and accumulation operations with mixed precision
|
||||
support.
|
||||
- Can be used to enhance the flash attention performance on AMD compute, by enabling
|
||||
the flag during compile time.
|
||||
|
||||
.. _llama-cpp-uses-recommendations:
|
||||
|
||||
Use cases and recommendations
|
||||
================================================================================
|
||||
|
||||
llama.cpp can be applied in a variety of scenarios, particularly when you need to meet one or more of the following requirements:
|
||||
|
||||
- Plain C/C++ implementation with no external dependencies
|
||||
- Support for 1.5-bit, 2-bit, 3-bit, 4-bit, 5-bit, 6-bit, and 8-bit integer quantization for faster inference and reduced memory usage
|
||||
- Custom HIP (Heterogeneous-compute Interface for Portability) kernels for running large language models (LLMs) on AMD GPUs (graphics processing units)
|
||||
- CPU (central processing unit) + GPU (graphics processing unit) hybrid inference for partially accelerating models larger than the total available VRAM (video random-access memory)
|
||||
|
||||
llama.cpp is also used in a range of real-world applications, including:
|
||||
|
||||
- Games such as `Lucy's Labyrinth <https://github.com/MorganRO8/Lucys_Labyrinth>`__:
|
||||
A simple maze game where AI-controlled agents attempt to trick the player.
|
||||
- Tools such as `Styled Lines <https://marketplace.unity.com/packages/tools/ai-ml-integration/style-text-webgl-ios-stand-alone-llm-llama-cpp-wrapper-292902>`__:
|
||||
A proprietary, asynchronous inference wrapper for Unity3D game development, including pre-built mobile and web platform wrappers and a model example.
|
||||
- Various other AI applications use llama.cpp as their inference engine;
|
||||
for a detailed list, see the `user interfaces (UIs) section <https://github.com/ggml-org/llama.cpp?tab=readme-ov-file#description>`__.
|
||||
|
||||
For more use cases and recommendations, refer to the `AMD ROCm blog <https://rocm.blogs.amd.com/>`__,
|
||||
where you can search for llama.cpp examples and best practices to optimize your workloads on AMD GPUs.
|
||||
|
||||
- The `Llama.cpp Meets Instinct: A New Era of Open-Source AI Acceleration <https://rocm.blogs.amd.com/ecosystems-and-partners/llama-cpp/README.html>`__
|
||||
blog post outlines how the open-source llama.cpp framework enables efficient LLM inference—including interactive inference with ``llama-cli``,
|
||||
server deployment with ``llama-server``, GGUF model preparation and quantization, performance benchmarking, and optimizations tailored for
|
||||
AMD Instinct GPUs within the ROCm ecosystem.
|
||||
|
||||
|
||||
Previous versions
|
||||
===============================================================================
|
||||
See :doc:`rocm-install-on-linux:install/3rd-party/previous-versions/llama-cpp-history` to find documentation for previous releases
|
||||
of the ``ROCm/llama.cpp`` Docker image.
|
||||
@@ -1,104 +0,0 @@
|
||||
:orphan:
|
||||
|
||||
.. meta::
|
||||
:description: Megablocks compatibility
|
||||
:keywords: GPU, megablocks, deep learning, framework compatibility
|
||||
|
||||
.. version-set:: rocm_version latest
|
||||
|
||||
********************************************************************************
|
||||
Megablocks compatibility
|
||||
********************************************************************************
|
||||
|
||||
`Megablocks <https://github.com/databricks/megablocks>`__ is a lightweight library
|
||||
for mixture-of-experts `(MoE) <https://huggingface.co/blog/moe>`__ training.
|
||||
The core of the system is efficient "dropless-MoE" and standard MoE layers.
|
||||
Megablocks is integrated with `https://github.com/stanford-futuredata/Megatron-LM
|
||||
<https://github.com/stanford-futuredata/Megatron-LM>`__,
|
||||
where data and pipeline parallel training of MoEs is supported.
|
||||
|
||||
Support overview
|
||||
================================================================================
|
||||
|
||||
- The ROCm-supported version of Megablocks is maintained in the official `https://github.com/ROCm/megablocks
|
||||
<https://github.com/ROCm/megablocks>`__ repository, which differs from the
|
||||
`https://github.com/stanford-futuredata/Megatron-LM <https://github.com/stanford-futuredata/Megatron-LM>`__ upstream repository.
|
||||
|
||||
- To get started and install Megablocks on ROCm, use the prebuilt :ref:`Docker image <megablocks-docker-compat>`,
|
||||
which includes ROCm, Megablocks, and all required dependencies.
|
||||
|
||||
- See the :doc:`ROCm Megablocks installation guide <rocm-install-on-linux:install/3rd-party/megablocks-install>`
|
||||
for installation and setup instructions.
|
||||
|
||||
- You can also consult the upstream `Installation guide <https://github.com/databricks/megablocks>`__
|
||||
for additional context.
|
||||
|
||||
.. _megablocks-docker-compat:
|
||||
|
||||
Compatibility matrix
|
||||
================================================================================
|
||||
|
||||
.. |docker-icon| raw:: html
|
||||
|
||||
<i class="fab fa-docker"></i>
|
||||
|
||||
AMD validates and publishes `Megablocks images <https://hub.docker.com/r/rocm/megablocks/tags>`__
|
||||
with ROCm backends on Docker Hub. The following Docker image tag and associated
|
||||
inventories represent the latest available Megablocks version from the official Docker Hub.
|
||||
Click |docker-icon| to view the image on Docker Hub.
|
||||
|
||||
.. list-table::
|
||||
:header-rows: 1
|
||||
:class: docker-image-compatibility
|
||||
|
||||
* - Docker image
|
||||
- ROCm
|
||||
- Megablocks
|
||||
- PyTorch
|
||||
- Ubuntu
|
||||
- Python
|
||||
- GPU
|
||||
|
||||
* - .. raw:: html
|
||||
|
||||
<a href="https://hub.docker.com/layers/rocm/megablocks/megablocks-0.7.0_rocm6.3.0_ubuntu24.04_py3.12_pytorch2.4.0/images/sha256-372ff89b96599019b8f5f9db469c84add2529b713456781fa62eb9a148659ab4"><i class="fab fa-docker fa-lg"></i> rocm/megablocks</a>
|
||||
- `6.3.0 <https://repo.radeon.com/rocm/apt/6.3/>`_
|
||||
- `0.7.0 <https://github.com/databricks/megablocks/releases/tag/v0.7.0>`_
|
||||
- `2.4.0 <https://github.com/ROCm/pytorch/tree/release/2.4>`_
|
||||
- 24.04
|
||||
- `3.12.9 <https://www.python.org/downloads/release/python-3129/>`_
|
||||
- MI300X
|
||||
|
||||
Supported models and features with ROCm 6.3.0
|
||||
================================================================================
|
||||
|
||||
This section summarizes the Megablocks features supported by ROCm.
|
||||
|
||||
* Distributed Pre-training
|
||||
* Activation Checkpointing and Recomputation
|
||||
* Distributed Optimizer
|
||||
* Mixture-of-Experts
|
||||
* dropless-Mixture-of-Experts
|
||||
|
||||
.. _megablocks-recommendations:
|
||||
|
||||
Use cases and recommendations
|
||||
================================================================================
|
||||
|
||||
* The `Efficient MoE training on AMD ROCm: How-to use Megablocks on AMD GPUs
|
||||
<https://rocm.blogs.amd.com/artificial-intelligence/megablocks/README.html>`__
|
||||
blog post guides how to leverage the ROCm platform for pre-training using the
|
||||
Megablocks framework. It introduces a streamlined approach for training Mixture-of-Experts
|
||||
(MoE) models using the Megablocks library on AMD hardware. Focusing on GPT-2, it
|
||||
demonstrates how block-sparse computations can enhance scalability and efficiency in MoE
|
||||
training. The guide provides step-by-step instructions for setting up the environment,
|
||||
including cloning the repository, building the Docker image, and running the training container.
|
||||
Additionally, it offers insights into utilizing the ``oscar-1GB.json`` dataset for pre-training
|
||||
language models. By leveraging Megablocks and the ROCm platform, you can optimize your MoE
|
||||
training workflows for large-scale transformer models.
|
||||
|
||||
It features how to pre-process datasets and how to begin pre-training on AMD GPUs through:
|
||||
|
||||
* Single-GPU pre-training
|
||||
* Multi-GPU pre-training
|
||||
|
||||
@@ -1,114 +0,0 @@
|
||||
:orphan:
|
||||
|
||||
.. meta::
|
||||
:description: Ray compatibility
|
||||
:keywords: GPU, Ray, deep learning, framework compatibility
|
||||
|
||||
.. version-set:: rocm_version latest
|
||||
|
||||
*******************************************************************************
|
||||
Ray compatibility
|
||||
*******************************************************************************
|
||||
|
||||
Ray is a unified framework for scaling AI and Python applications from your laptop
|
||||
to a full cluster, without changing your code. Ray consists of `a core distributed
|
||||
runtime <https://docs.ray.io/en/latest/ray-core/walkthrough.html>`__ and a set of
|
||||
`AI libraries <https://docs.ray.io/en/latest/ray-air/getting-started.html>`__ for
|
||||
simplifying machine learning computations.
|
||||
|
||||
Ray is a general-purpose framework that runs many types of workloads efficiently.
|
||||
Any Python application can be scaled with Ray, without extra infrastructure.
|
||||
|
||||
Support overview
|
||||
================================================================================
|
||||
|
||||
- The ROCm-supported version of Ray is maintained in the official `https://github.com/ROCm/ray
|
||||
<https://github.com/ROCm/ray>`__ repository, which differs from the
|
||||
`https://github.com/ray-project/ray <https://github.com/ray-project/ray>`__ upstream repository.
|
||||
|
||||
- To get started and install Ray on ROCm, use the prebuilt :ref:`Docker image <ray-docker-compat>`,
|
||||
which includes ROCm, Ray, and all required dependencies.
|
||||
|
||||
- See the :doc:`ROCm Ray installation guide <rocm-install-on-linux:install/3rd-party/ray-install>`
|
||||
for installation and setup instructions.
|
||||
|
||||
- You can also consult the upstream `Installation guide <https://docs.ray.io/en/latest/ray-overview/installation.html>`__
|
||||
for additional context.
|
||||
|
||||
.. _ray-docker-compat:
|
||||
|
||||
Compatibility matrix
|
||||
================================================================================
|
||||
|
||||
.. |docker-icon| raw:: html
|
||||
|
||||
<i class="fab fa-docker"></i>
|
||||
|
||||
AMD validates and publishes `ROCm Ray Docker images <https://hub.docker.com/r/rocm/ray/tags>`__
|
||||
with ROCm backends on Docker Hub. The following Docker image tags and
|
||||
associated inventories represent the latest Ray version from the official Docker Hub.
|
||||
Click |docker-icon| to view the image on Docker Hub.
|
||||
|
||||
.. list-table::
|
||||
:header-rows: 1
|
||||
:class: docker-image-compatibility
|
||||
|
||||
* - Docker image
|
||||
- ROCm
|
||||
- Ray
|
||||
- Pytorch
|
||||
- Ubuntu
|
||||
- Python
|
||||
- GPU
|
||||
|
||||
* - .. raw:: html
|
||||
|
||||
<a href="https://hub.docker.com/layers/rocm/ray/ray-2.51.1_rocm7.0.0_ubuntu22.04_py3.12_pytorch2.9.0/images/sha256-a02f6766b4ba406f88fd7e85707ec86c04b569834d869a08043ec9bcbd672168"><i class="fab fa-docker fa-lg"></i> rocm/ray</a>
|
||||
- `7.0.0 <https://repo.radeon.com/rocm/apt/7.0/>`__
|
||||
- `2.51.1 <https://github.com/ROCm/ray/tree/release/2.51.1>`__
|
||||
- 2.9.0a0+git1c57644
|
||||
- 22.04
|
||||
- `3.12.12 <https://www.python.org/downloads/release/python-31212/>`__
|
||||
- MI300X
|
||||
|
||||
* - .. raw:: html
|
||||
|
||||
<a href="https://hub.docker.com/layers/rocm/ray/ray-2.48.0.post0_rocm6.4.1_ubuntu24.04_py3.12_pytorch2.6.0/images/sha256-0d166fe6bdced38338c78eedfb96eff92655fb797da3478a62dd636365133cc0"><i class="fab fa-docker fa-lg"></i> rocm/ray</a>
|
||||
- `6.4.1 <https://repo.radeon.com/rocm/apt/6.4.1/>`__
|
||||
- `2.48.0.post0 <https://github.com/ROCm/ray/tree/release/2.48.0.post0>`__
|
||||
- 2.6.0+git684f6f2
|
||||
- 24.04
|
||||
- `3.12.10 <https://www.python.org/downloads/release/python-31210/>`__
|
||||
- MI300X, MI210
|
||||
|
||||
Use cases and recommendations
|
||||
================================================================================
|
||||
|
||||
* The `Reinforcement Learning from Human Feedback on AMD GPUs with verl and ROCm
|
||||
Integration <https://rocm.blogs.amd.com/artificial-intelligence/verl-large-scale/README.html>`__
|
||||
blog provides an overview of Volcano Engine Reinforcement Learning (verl)
|
||||
for large language models (LLMs) and discusses its benefits in large-scale
|
||||
reinforcement learning from human feedback (RLHF). It uses Ray as part of a
|
||||
hybrid orchestration engine to schedule and coordinate training and inference
|
||||
tasks in parallel, enabling optimized resource utilization and potential overlap
|
||||
between these phases. This dynamic resource allocation strategy significantly
|
||||
improves overall system efficiency. The blog presents verl’s performance results,
|
||||
focusing on throughput and convergence accuracy achieved on AMD Instinct™ MI300X
|
||||
GPUs. Follow this guide to get started with verl on AMD Instinct GPUs and
|
||||
accelerate your RLHF training with ROCm-optimized performance.
|
||||
|
||||
* The `Exploring Use Cases for Scalable AI: Implementing Ray with ROCm Support for Efficient ML Workflows
|
||||
<https://rocm.blogs.amd.com/artificial-intelligence/rocm-ray/README.html>`__
|
||||
blog post describes key use cases such as training and inference for large language models (LLMs),
|
||||
model serving, hyperparameter tuning, reinforcement learning, and the orchestration of large-scale
|
||||
workloads using Ray in the ROCm environment.
|
||||
|
||||
For more use cases and recommendations, see the AMD GPU tabs in the `Accelerator Support
|
||||
topic <https://docs.ray.io/en/latest/ray-core/scheduling/accelerators.html#accelerator-support>`__
|
||||
of the Ray core documentation and refer to the `AMD ROCm blog <https://rocm.blogs.amd.com/>`__,
|
||||
where you can search for Ray examples and best practices to optimize your workloads on AMD GPUs.
|
||||
|
||||
Previous versions
|
||||
===============================================================================
|
||||
See :doc:`rocm-install-on-linux:install/3rd-party/previous-versions/ray-history` to find documentation for previous releases
|
||||
of the ``ROCm/ray`` Docker image.
|
||||
@@ -1,116 +0,0 @@
|
||||
:orphan:
|
||||
|
||||
.. meta::
|
||||
:description: Stanford Megatron-LM compatibility
|
||||
:keywords: Stanford, Megatron-LM, deep learning, framework compatibility
|
||||
|
||||
.. version-set:: rocm_version latest
|
||||
|
||||
********************************************************************************
|
||||
Stanford Megatron-LM compatibility
|
||||
********************************************************************************
|
||||
|
||||
Stanford Megatron-LM is a large-scale language model training framework developed
|
||||
by NVIDIA at `https://github.com/NVIDIA/Megatron-LM <https://github.com/NVIDIA/Megatron-LM>`_.
|
||||
It is designed to train massive transformer-based language models efficiently by model
|
||||
and data parallelism.
|
||||
|
||||
It provides efficient tensor, pipeline, and sequence-based model parallelism for
|
||||
pre-training transformer-based language models such as GPT (Decoder Only), BERT
|
||||
(Encoder Only), and T5 (Encoder-Decoder).
|
||||
|
||||
Support overview
|
||||
================================================================================
|
||||
|
||||
- The ROCm-supported version of Stanford Megatron-LM is maintained in the official `https://github.com/ROCm/Stanford-Megatron-LM
|
||||
<https://github.com/ROCm/Stanford-Megatron-LM>`__ repository, which differs from the
|
||||
`https://github.com/stanford-futuredata/Megatron-LM <https://github.com/stanford-futuredata/Megatron-LM>`__ upstream repository.
|
||||
|
||||
- To get started and install Stanford Megatron-LM on ROCm, use the prebuilt :ref:`Docker image <megatron-lm-docker-compat>`,
|
||||
which includes ROCm, Stanford Megatron-LM, and all required dependencies.
|
||||
|
||||
- See the :doc:`ROCm Stanford Megatron-LM installation guide <rocm-install-on-linux:install/3rd-party/stanford-megatron-lm-install>`
|
||||
for installation and setup instructions.
|
||||
|
||||
- You can also consult the upstream `Installation guide <https://github.com/NVIDIA/Megatron-LM>`__
|
||||
for additional context.
|
||||
|
||||
.. _megatron-lm-docker-compat:
|
||||
|
||||
Compatibility matrix
|
||||
================================================================================
|
||||
|
||||
.. |docker-icon| raw:: html
|
||||
|
||||
<i class="fab fa-docker"></i>
|
||||
|
||||
AMD validates and publishes `Stanford Megatron-LM images <https://hub.docker.com/r/rocm/stanford-megatron-lm/tags>`_
|
||||
with ROCm and Pytorch backends on Docker Hub. The following Docker image tags and associated
|
||||
inventories represent the latest Stanford Megatron-LM version from the official Docker Hub.
|
||||
Click |docker-icon| to view the image on Docker Hub.
|
||||
|
||||
.. list-table::
|
||||
:header-rows: 1
|
||||
:class: docker-image-compatibility
|
||||
|
||||
* - Docker image
|
||||
- ROCm
|
||||
- Stanford Megatron-LM
|
||||
- PyTorch
|
||||
- Ubuntu
|
||||
- Python
|
||||
- GPU
|
||||
|
||||
* - .. raw:: html
|
||||
|
||||
<a href="https://hub.docker.com/layers/rocm/stanford-megatron-lm/stanford-megatron-lm85f95ae_rocm6.3.0_ubuntu24.04_py3.12_pytorch2.4.0/images/sha256-070556f078be10888a1421a2cb4f48c29f28b02bfeddae02588d1f7fc02a96a6"><i class="fab fa-docker fa-lg"></i> rocm/stanford-megatron-lm</a>
|
||||
|
||||
- `6.3.0 <https://repo.radeon.com/rocm/apt/6.3/>`_
|
||||
- `85f95ae <https://github.com/stanford-futuredata/Megatron-LM/commit/85f95aef3b648075fe6f291c86714fdcbd9cd1f5>`_
|
||||
- `2.4.0 <https://github.com/ROCm/pytorch/tree/release/2.4>`_
|
||||
- 24.04
|
||||
- `3.12.9 <https://www.python.org/downloads/release/python-3129/>`_
|
||||
- MI300X
|
||||
|
||||
Supported models and features with ROCm 6.3.0
|
||||
================================================================================
|
||||
|
||||
This section details models & features that are supported by the ROCm version on Stanford Megatron-LM.
|
||||
|
||||
Models:
|
||||
|
||||
* BERT
|
||||
* GPT
|
||||
* T5
|
||||
* ICT
|
||||
|
||||
Features:
|
||||
|
||||
* Distributed Pre-training
|
||||
* Activation Checkpointing and Recomputation
|
||||
* Distributed Optimizer
|
||||
* Mixture-of-Experts
|
||||
|
||||
.. _megatron-lm-recommendations:
|
||||
|
||||
Use cases and recommendations
|
||||
================================================================================
|
||||
|
||||
The following blog post mentions Megablocks, but you can run Stanford Megatron-LM with the same steps to pre-process datasets on AMD GPUs:
|
||||
|
||||
* The `Efficient MoE training on AMD ROCm: How-to use Megablocks on AMD GPUs
|
||||
<https://rocm.blogs.amd.com/artificial-intelligence/megablocks/README.html>`__
|
||||
blog post guides how to leverage the ROCm platform for pre-training using the
|
||||
Megablocks framework. It introduces a streamlined approach for training Mixture-of-Experts
|
||||
(MoE) models using the Megablocks library on AMD hardware. Focusing on GPT-2, it
|
||||
demonstrates how block-sparse computations can enhance scalability and efficiency in MoE
|
||||
training. The guide provides step-by-step instructions for setting up the environment,
|
||||
including cloning the repository, building the Docker image, and running the training container.
|
||||
Additionally, it offers insights into utilizing the ``oscar-1GB.json`` dataset for pre-training
|
||||
language models. By leveraging Megablocks and the ROCm platform, you can optimize your MoE
|
||||
training workflows for large-scale transformer models.
|
||||
|
||||
It features how to pre-process datasets and how to begin pre-training on AMD GPUs through:
|
||||
|
||||
* Single-GPU pre-training
|
||||
* Multi-GPU pre-training
|
||||
@@ -1,118 +0,0 @@
|
||||
:orphan:
|
||||
|
||||
.. meta::
|
||||
:description: verl compatibility
|
||||
:keywords: GPU, verl, deep learning, framework compatibility
|
||||
|
||||
.. version-set:: rocm_version latest
|
||||
|
||||
*******************************************************************************
|
||||
verl compatibility
|
||||
*******************************************************************************
|
||||
|
||||
Volcano Engine Reinforcement Learning for LLMs (`verl <https://verl.readthedocs.io/en/latest/>`__)
|
||||
is a reinforcement learning framework designed for large language models (LLMs).
|
||||
verl offers a scalable, open-source fine-tuning solution by using a hybrid programming model
|
||||
that makes it easy to define and run complex post-training dataflows efficiently.
|
||||
|
||||
Its modular APIs separate computation from data, allowing smooth integration with other frameworks.
|
||||
It also supports flexible model placement across GPUs for efficient scaling on different cluster sizes.
|
||||
verl achieves high training and generation throughput by building on existing LLM frameworks.
|
||||
Its 3D-HybridEngine reduces memory use and communication overhead when switching between training
|
||||
and inference, improving overall performance.
|
||||
|
||||
Support overview
|
||||
================================================================================
|
||||
|
||||
- The ROCm-supported version of verl is maintained in the official `https://github.com/ROCm/verl
|
||||
<https://github.com/ROCm/verl>`__ repository, which differs from the
|
||||
`https://github.com/volcengine/verl <https://github.com/volcengine/verl>`__ upstream repository.
|
||||
|
||||
- To get started and install verl on ROCm, use the prebuilt :ref:`Docker image <verl-docker-compat>`,
|
||||
which includes ROCm, verl, and all required dependencies.
|
||||
|
||||
- See the :doc:`ROCm verl installation guide <rocm-install-on-linux:install/3rd-party/verl-install>`
|
||||
for installation and setup instructions.
|
||||
|
||||
- You can also consult the upstream `verl documentation <https://verl.readthedocs.io/en/latest/>`__
|
||||
for additional context.
|
||||
|
||||
.. _verl-docker-compat:
|
||||
|
||||
Compatibility matrix
|
||||
================================================================================
|
||||
|
||||
.. |docker-icon| raw:: html
|
||||
|
||||
<i class="fab fa-docker"></i>
|
||||
|
||||
AMD validates and publishes `verl Docker images <https://hub.docker.com/r/rocm/verl/tags>`_
|
||||
with ROCm backends on Docker Hub. The following Docker image tag and associated inventories
|
||||
represent the latest verl version from the official Docker Hub.
|
||||
Click |docker-icon| to view the image on Docker Hub.
|
||||
|
||||
.. list-table::
|
||||
:header-rows: 1
|
||||
:class: docker-image-compatibility
|
||||
|
||||
* - Docker image
|
||||
- ROCm
|
||||
- verl
|
||||
- Ubuntu
|
||||
- PyTorch
|
||||
- Python
|
||||
- vllm
|
||||
- GPU
|
||||
|
||||
* - .. raw:: html
|
||||
|
||||
<a href="https://hub.docker.com/layers/rocm/verl/verl-0.6.0.amd0_rocm7.0_vllm0.11.0.dev/images/sha256-f70a3ebc94c1f66de42a2fcc3f8a6a8d6d0881eb0e65b6958d7d6d24b3eecb0d"><i class="fab fa-docker fa-lg"></i> rocm/verl</a>
|
||||
- `7.0.0 <https://repo.radeon.com/rocm/apt/7.0/>`__
|
||||
- `0.6.0 <https://github.com/volcengine/verl/releases/tag/v0.6.0>`__
|
||||
- 22.04
|
||||
- `2.9.0 <https://github.com/ROCm/pytorch/tree/release/2.9-rocm7.x-gfx115x>`__
|
||||
- `3.12.11 <https://www.python.org/downloads/release/python-31211/>`__
|
||||
- `0.11.0 <https://github.com/vllm-project/vllm/releases/tag/v0.11.0>`__
|
||||
- MI300X
|
||||
|
||||
* - .. raw:: html
|
||||
|
||||
<a href="https://hub.docker.com/layers/rocm/verl/verl-0.3.0.post0_rocm6.2_vllm0.6.3/images/sha256-cbe423803fd7850448b22444176bee06f4dcf22cd3c94c27732752d3a39b04b2"><i class="fab fa-docker fa-lg"></i> rocm/verl</a>
|
||||
- `6.2.0 <https://repo.radeon.com/rocm/apt/6.2/>`__
|
||||
- `0.3.0.post0 <https://github.com/volcengine/verl/releases/tag/v0.3.0.post0>`__
|
||||
- 20.04
|
||||
- `2.5.0 <https://github.com/ROCm/pytorch/tree/release/2.5>`__
|
||||
- `3.9.19 <https://www.python.org/downloads/release/python-3919/>`__
|
||||
- `0.6.3 <https://github.com/vllm-project/vllm/releases/tag/v0.6.3>`__
|
||||
- MI300X
|
||||
|
||||
.. _verl-supported_features:
|
||||
|
||||
Supported modules with verl on ROCm
|
||||
===============================================================================
|
||||
|
||||
The following GPU-accelerated modules are supported with verl on ROCm:
|
||||
|
||||
- ``FSDP``: Training engine
|
||||
- ``vllm``: Inference engine
|
||||
|
||||
.. _verl-recommendations:
|
||||
|
||||
Use cases and recommendations
|
||||
================================================================================
|
||||
|
||||
* The benefits of verl in large-scale reinforcement learning from human feedback
|
||||
(RLHF) are discussed in the `Reinforcement Learning from Human Feedback on AMD
|
||||
GPUs with verl and ROCm Integration <https://rocm.blogs.amd.com/artificial-intelligence/verl-large-scale/README.html>`__
|
||||
blog. The blog post outlines how the Volcano Engine Reinforcement Learning
|
||||
(verl) framework integrates with the AMD ROCm platform to optimize training on
|
||||
AMD Instinct™ GPUs. The guide details the process of building a Docker image,
|
||||
setting up single-node and multi-node training environments, and highlights
|
||||
performance benchmarks demonstrating improved throughput and convergence accuracy.
|
||||
This resource serves as a comprehensive starting point for deploying verl on AMD GPUs,
|
||||
facilitating efficient RLHF training workflows.
|
||||
|
||||
Previous versions
|
||||
===============================================================================
|
||||
See :doc:`rocm-install-on-linux:install/3rd-party/previous-versions/verl-history` to find documentation for previous releases
|
||||
of the ``ROCm/verl`` Docker image.
|
||||
@@ -107,13 +107,7 @@ article_pages = [
|
||||
{"file": "compatibility/ml-compatibility/pytorch-compatibility", "os": ["linux"]},
|
||||
{"file": "compatibility/ml-compatibility/tensorflow-compatibility", "os": ["linux"]},
|
||||
{"file": "compatibility/ml-compatibility/jax-compatibility", "os": ["linux"]},
|
||||
{"file": "compatibility/ml-compatibility/verl-compatibility", "os": ["linux"]},
|
||||
{"file": "compatibility/ml-compatibility/stanford-megatron-lm-compatibility", "os": ["linux"]},
|
||||
{"file": "compatibility/ml-compatibility/dgl-compatibility", "os": ["linux"]},
|
||||
{"file": "compatibility/ml-compatibility/megablocks-compatibility", "os": ["linux"]},
|
||||
{"file": "compatibility/ml-compatibility/ray-compatibility", "os": ["linux"]},
|
||||
{"file": "compatibility/ml-compatibility/llama-cpp-compatibility", "os": ["linux"]},
|
||||
{"file": "compatibility/ml-compatibility/flashinfer-compatibility", "os": ["linux"]},
|
||||
{"file": "how-to/deep-learning-rocm", "os": ["linux"]},
|
||||
|
||||
{"file": "how-to/rocm-for-ai/index", "os": ["linux"]},
|
||||
|
||||
@@ -1,14 +1,14 @@
|
||||
docker:
|
||||
pull_tag: rocm/primus:v26.1
|
||||
pull_tag: rocm/primus:v26.2
|
||||
docker_hub_url: https://hub.docker.com/layers/rocm/primus/v26.1/images/sha256-4fc8808bdb14117c6af7f38d79c809056e6fdbfd530c1fabbb61d097ddaf820d
|
||||
components:
|
||||
ROCm: 7.1.0
|
||||
PyTorch: 2.10.0.dev20251112+rocm7.1
|
||||
Python: "3.10"
|
||||
Transformer Engine: 2.6.0.dev0+f141f34b
|
||||
ROCm: 7.2.0
|
||||
PyTorch: 2.10.0+git94c6e04
|
||||
Python: "3.12.3"
|
||||
Transformer Engine: 2.8.0.dev0+51f74fa7
|
||||
Flash Attention: 2.8.3
|
||||
hipBLASLt: 34459f66ea
|
||||
Triton: 3.4.0
|
||||
Triton: 3.5.0
|
||||
RCCL: 2.27.7
|
||||
model_groups:
|
||||
- group: Meta Llama
|
||||
|
||||
@@ -0,0 +1,58 @@
|
||||
docker:
|
||||
pull_tag: rocm/primus:v26.1
|
||||
docker_hub_url: https://hub.docker.com/layers/rocm/primus/v26.1/images/sha256-4fc8808bdb14117c6af7f38d79c809056e6fdbfd530c1fabbb61d097ddaf820d
|
||||
components:
|
||||
ROCm: 7.1.0
|
||||
PyTorch: 2.10.0.dev20251112+rocm7.1
|
||||
Python: "3.10"
|
||||
Transformer Engine: 2.6.0.dev0+f141f34b
|
||||
Flash Attention: 2.8.3
|
||||
hipBLASLt: 34459f66ea
|
||||
Triton: 3.4.0
|
||||
RCCL: 2.27.7
|
||||
model_groups:
|
||||
- group: Meta Llama
|
||||
tag: llama
|
||||
models:
|
||||
- model: Llama 3.3 70B
|
||||
mad_tag: primus_pyt_megatron_lm_train_llama-3.3-70b
|
||||
config_name: llama3.3_70B-pretrain.yaml
|
||||
- model: Llama 3.1 70B
|
||||
mad_tag: primus_pyt_megatron_lm_train_llama-3.1-70b
|
||||
config_name: llama3.1_70B-pretrain.yaml
|
||||
- model: Llama 3.1 8B
|
||||
mad_tag: primus_pyt_megatron_lm_train_llama-3.1-8b
|
||||
config_name: llama3.1_8B-pretrain.yaml
|
||||
- model: Llama 2 7B
|
||||
mad_tag: primus_pyt_megatron_lm_train_llama-2-7b
|
||||
config_name: llama2_7B-pretrain.yaml
|
||||
- model: Llama 2 70B
|
||||
mad_tag: primus_pyt_megatron_lm_train_llama-2-70b
|
||||
config_name: llama2_70B-pretrain.yaml
|
||||
- group: DeepSeek
|
||||
tag: deepseek
|
||||
models:
|
||||
- model: DeepSeek-V3 (proxy)
|
||||
mad_tag: primus_pyt_megatron_lm_train_deepseek-v3-proxy
|
||||
config_name: deepseek_v3-pretrain.yaml
|
||||
- model: DeepSeek-V2-Lite
|
||||
mad_tag: primus_pyt_megatron_lm_train_deepseek-v2-lite-16b
|
||||
config_name: deepseek_v2_lite-pretrain.yaml
|
||||
- group: Mistral AI
|
||||
tag: mistral
|
||||
models:
|
||||
- model: Mixtral 8x7B
|
||||
mad_tag: primus_pyt_megatron_lm_train_mixtral-8x7b
|
||||
config_name: mixtral_8x7B_v0.1-pretrain.yaml
|
||||
- model: Mixtral 8x22B (proxy)
|
||||
mad_tag: primus_pyt_megatron_lm_train_mixtral-8x22b-proxy
|
||||
config_name: mixtral_8x22B_v0.1-pretrain.yaml
|
||||
- group: Qwen
|
||||
tag: qwen
|
||||
models:
|
||||
- model: Qwen 2.5 7B
|
||||
mad_tag: primus_pyt_megatron_lm_train_qwen2.5-7b
|
||||
config_name: primus_qwen2.5_7B-pretrain.yaml
|
||||
- model: Qwen 2.5 72B
|
||||
mad_tag: primus_pyt_megatron_lm_train_qwen2.5-72b
|
||||
config_name: qwen2.5_72B-pretrain.yaml
|
||||
@@ -0,0 +1,32 @@
|
||||
docker:
|
||||
pull_tag: rocm/primus:v26.1
|
||||
docker_hub_url: https://hub.docker.com/layers/rocm/primus/v26.1/images/sha256-4fc8808bdb14117c6af7f38d79c809056e6fdbfd530c1fabbb61d097ddaf820d
|
||||
components:
|
||||
ROCm: 7.1.0
|
||||
PyTorch: 2.10.0.dev20251112+rocm7.1
|
||||
Python: "3.10"
|
||||
Transformer Engine: 2.6.0.dev0+f141f34b
|
||||
Flash Attention: 2.8.3
|
||||
hipBLASLt: 34459f66ea
|
||||
model_groups:
|
||||
- group: Meta Llama
|
||||
tag: llama
|
||||
models:
|
||||
- model: Llama 3.1 8B
|
||||
mad_tag: primus_pyt_train_llama-3.1-8b
|
||||
model_repo: Llama-3.1-8B
|
||||
url: https://huggingface.co/meta-llama/Llama-3.1-8B
|
||||
precision: BF16
|
||||
- model: Llama 3.1 70B
|
||||
mad_tag: primus_pyt_train_llama-3.1-70b
|
||||
model_repo: Llama-3.1-70B
|
||||
url: https://huggingface.co/meta-llama/Llama-3.1-70B
|
||||
precision: BF16
|
||||
- group: DeepSeek
|
||||
tag: deepseek
|
||||
models:
|
||||
- model: DeepSeek V3 16B
|
||||
mad_tag: primus_pyt_train_deepseek-v3-16b
|
||||
model_repo: DeepSeek-V3
|
||||
url: https://huggingface.co/deepseek-ai/DeepSeek-V3
|
||||
precision: BF16
|
||||
@@ -1,14 +1,14 @@
|
||||
docker:
|
||||
pull_tag: rocm/primus:v26.1
|
||||
docker_hub_url: https://hub.docker.com/layers/rocm/primus/v26.1/images/sha256-4fc8808bdb14117c6af7f38d79c809056e6fdbfd530c1fabbb61d097ddaf820d
|
||||
pull_tag: rocm/primus:v26.2
|
||||
docker_hub_url: https://hub.docker.com/layers/rocm/primus/v26.2/images/sha256-9148d1bfcd579bf92f44bd89090e0d8c958f149c134b4b34b9674ab559244585
|
||||
components:
|
||||
ROCm: 7.1.0
|
||||
PyTorch: 2.10.0.dev20251112+rocm7.1
|
||||
Python: "3.10"
|
||||
Transformer Engine: 2.6.0.dev0+f141f34b
|
||||
ROCm: 7.2.0
|
||||
PyTorch: 2.10.0a0+git449b176
|
||||
Python: "3.12.3"
|
||||
Transformer Engine: 2.8.0.dev0+51f74fa7
|
||||
Flash Attention: 2.8.3
|
||||
hipBLASLt: 34459f66ea
|
||||
Triton: 3.4.0
|
||||
hipBLASLt: 1.2.0-de5c1aebb6
|
||||
Triton: 3.6.0
|
||||
RCCL: 2.27.7
|
||||
model_groups:
|
||||
- group: Meta Llama
|
||||
@@ -17,18 +17,30 @@ model_groups:
|
||||
- model: Llama 3.3 70B
|
||||
mad_tag: primus_pyt_megatron_lm_train_llama-3.3-70b
|
||||
config_name: llama3.3_70B-pretrain.yaml
|
||||
- model: Llama 3.1 70B
|
||||
mad_tag: primus_pyt_megatron_lm_train_llama-3.1-70b
|
||||
config_name: llama3.1_70B-pretrain.yaml
|
||||
- model: Llama 3.1 8B
|
||||
mad_tag: primus_pyt_megatron_lm_train_llama-3.1-8b
|
||||
config_name: llama3.1_8B-pretrain.yaml
|
||||
- model: Llama 3.1 70B
|
||||
mad_tag: primus_pyt_megatron_lm_train_llama-3.1-70b
|
||||
config_name: llama3.1_70B-pretrain.yaml
|
||||
- model: Llama 2 7B
|
||||
mad_tag: primus_pyt_megatron_lm_train_llama-2-7b
|
||||
config_name: llama2_7B-pretrain.yaml
|
||||
- model: Llama 2 70B
|
||||
mad_tag: primus_pyt_megatron_lm_train_llama-2-70b
|
||||
config_name: llama2_70B-pretrain.yaml
|
||||
- group: AMD Zebra-Llama
|
||||
tag: zebra-llama
|
||||
models:
|
||||
- model: Zebra-Llama 1B
|
||||
mad_tag: primus_pyt_megatron_lm_train_zebra-llama-1b
|
||||
config_name: zebra_llama_1b-pretrain.yaml
|
||||
- model: Zebra-Llama 3B
|
||||
mad_tag: primus_pyt_megatron_lm_train_zebra-llama-3b
|
||||
config_name: zebra_llama_3b-pretrain.yaml
|
||||
- model: Zebra-Llama 8B
|
||||
mad_tag: primus_pyt_megatron_lm_train_zebra-llama-8b
|
||||
config_name: zebra_llama_8b-pretrain.yaml
|
||||
- group: DeepSeek
|
||||
tag: deepseek
|
||||
models:
|
||||
@@ -50,6 +62,11 @@ model_groups:
|
||||
- group: Qwen
|
||||
tag: qwen
|
||||
models:
|
||||
- model: Qwen 3 32B SFT
|
||||
mad_tag: primus_pyt_megatron_lm_train_qwen3-32b-sft
|
||||
- model: Qwen 3 32B LoRA
|
||||
mad_tag: primus_pyt_megatron_lm_train_qwen3-32b-lora
|
||||
config_name: primus_qwen2.5_7B-pretrain.yaml
|
||||
- model: Qwen 2.5 7B
|
||||
mad_tag: primus_pyt_megatron_lm_train_qwen2.5-7b
|
||||
config_name: primus_qwen2.5_7B-pretrain.yaml
|
||||
|
||||
@@ -1,13 +1,15 @@
|
||||
docker:
|
||||
pull_tag: rocm/primus:v26.1
|
||||
docker_hub_url: https://hub.docker.com/layers/rocm/primus/v26.1/images/sha256-4fc8808bdb14117c6af7f38d79c809056e6fdbfd530c1fabbb61d097ddaf820d
|
||||
pull_tag: rocm/primus:v26.2
|
||||
docker_hub_url: https://hub.docker.com/layers/rocm/primus/v26.2/images/sha256-9148d1bfcd579bf92f44bd89090e0d8c958f149c134b4b34b9674ab559244585
|
||||
components:
|
||||
ROCm: 7.1.0
|
||||
PyTorch: 2.10.0.dev20251112+rocm7.1
|
||||
Python: "3.10"
|
||||
Transformer Engine: 2.6.0.dev0+f141f34b
|
||||
ROCm: 7.2.0
|
||||
PyTorch: 2.10.0a0+git449b176
|
||||
Python: "3.12.3"
|
||||
Transformer Engine: 2.8.0.dev0+51f74fa7
|
||||
Flash Attention: 2.8.3
|
||||
hipBLASLt: 34459f66ea
|
||||
hipBLASLt: 1.2.0-de5c1aebb6
|
||||
Triton: 3.6.0
|
||||
RCCL: 2.27.7
|
||||
model_groups:
|
||||
- group: Meta Llama
|
||||
tag: llama
|
||||
|
||||
@@ -1,11 +1,11 @@
|
||||
docker:
|
||||
pull_tag: rocm/primus:v26.1
|
||||
pull_tag: rocm/primus:v26.2
|
||||
docker_hub_url: https://hub.docker.com/layers/rocm/primus/v26.1/images/sha256-4fc8808bdb14117c6af7f38d79c809056e6fdbfd530c1fabbb61d097ddaf820d
|
||||
components:
|
||||
ROCm: 7.1.0
|
||||
PyTorch: 2.10.0.dev20251112+rocm7.1
|
||||
Python: "3.10"
|
||||
Transformer Engine: 2.6.0.dev0+f141f34b
|
||||
ROCm: 7.2.0
|
||||
PyTorch: 2.10.0+git94c6e04
|
||||
Python: "3.12.3"
|
||||
Transformer Engine: 2.8.0.dev0+51f74fa7
|
||||
Flash Attention: 2.8.3
|
||||
hipBLASLt: 34459f66ea
|
||||
model_groups:
|
||||
|
||||
@@ -52,22 +52,6 @@ The table below summarizes information about ROCm-enabled deep learning framewor
|
||||
|
||||
<a href="https://github.com/ROCm/jax"><i class="fab fa-github fa-lg"></i></a>
|
||||
|
||||
* - :doc:`verl <../compatibility/ml-compatibility/verl-compatibility>`
|
||||
- :doc:`link <rocm-install-on-linux:install/3rd-party/verl-install>`
|
||||
-
|
||||
- Docker image
|
||||
- .. raw:: html
|
||||
|
||||
<a href="https://github.com/ROCm/verl"><i class="fab fa-github fa-lg"></i></a>
|
||||
|
||||
* - :doc:`Stanford Megatron-LM <../compatibility/ml-compatibility/stanford-megatron-lm-compatibility>`
|
||||
- :doc:`link <rocm-install-on-linux:install/3rd-party/stanford-megatron-lm-install>`
|
||||
-
|
||||
- Docker image
|
||||
- .. raw:: html
|
||||
|
||||
<a href="https://github.com/ROCm/Stanford-Megatron-LM"><i class="fab fa-github fa-lg"></i></a>
|
||||
|
||||
* - :doc:`DGL <../compatibility/ml-compatibility/dgl-compatibility>`
|
||||
- :doc:`link <rocm-install-on-linux:install/3rd-party/dgl-install>`
|
||||
-
|
||||
@@ -76,42 +60,6 @@ The table below summarizes information about ROCm-enabled deep learning framewor
|
||||
|
||||
<a href="https://github.com/ROCm/dgl"><i class="fab fa-github fa-lg"></i></a>
|
||||
|
||||
* - :doc:`Megablocks <../compatibility/ml-compatibility/megablocks-compatibility>`
|
||||
- :doc:`link <rocm-install-on-linux:install/3rd-party/megablocks-install>`
|
||||
-
|
||||
- Docker image
|
||||
- .. raw:: html
|
||||
|
||||
<a href="https://github.com/ROCm/megablocks"><i class="fab fa-github fa-lg"></i></a>
|
||||
|
||||
* - :doc:`Ray <../compatibility/ml-compatibility/ray-compatibility>`
|
||||
- :doc:`link <rocm-install-on-linux:install/3rd-party/ray-install>`
|
||||
-
|
||||
- Docker image
|
||||
- Wheels package
|
||||
- ROCm Base Docker image
|
||||
- .. raw:: html
|
||||
|
||||
<a href="https://github.com/ROCm/ray"><i class="fab fa-github fa-lg"></i></a>
|
||||
|
||||
* - :doc:`llama.cpp <../compatibility/ml-compatibility/llama-cpp-compatibility>`
|
||||
- :doc:`link <rocm-install-on-linux:install/3rd-party/llama-cpp-install>`
|
||||
-
|
||||
- Docker image
|
||||
- ROCm Base Docker image
|
||||
- .. raw:: html
|
||||
|
||||
<a href="https://github.com/ROCm/llama.cpp"><i class="fab fa-github fa-lg"></i></a>
|
||||
|
||||
* - :doc:`FlashInfer <../compatibility/ml-compatibility/flashinfer-compatibility>`
|
||||
- :doc:`link <rocm-install-on-linux:install/3rd-party/flashinfer-install>`
|
||||
-
|
||||
- Docker image
|
||||
- ROCm Base Docker image
|
||||
- .. raw:: html
|
||||
|
||||
<a href="https://github.com/ROCm/flashinfer"><i class="fab fa-github fa-lg"></i></a>
|
||||
|
||||
Learn how to use your ROCm deep learning environment for training, fine-tuning, inference, and performance optimization
|
||||
through the following guides.
|
||||
|
||||
|
||||
@@ -7,7 +7,7 @@ Megatron-LM training performance testing version history
|
||||
This table lists previous versions of the ROCm Megatron-LM training Docker image for
|
||||
inference performance testing. For detailed information about available models
|
||||
for benchmarking, see the version-specific documentation. You can find tagged
|
||||
previous releases of the ``ROCm/megatron-lm`` Docker image on `Docker Hub <https://hub.docker.com/r/rocm/megatron-lm/tags>`__.
|
||||
previous releases of the ``ROCm/primus`` Docker image on `Docker Hub <https://hub.docker.com/r/rocm/megatron-lm/tags>`__.
|
||||
|
||||
.. list-table::
|
||||
:header-rows: 1
|
||||
@@ -16,13 +16,20 @@ previous releases of the ``ROCm/megatron-lm`` Docker image on `Docker Hub <https
|
||||
- Components
|
||||
- Resources
|
||||
|
||||
* - v26.1 (latest)
|
||||
* - v26.2 (latest)
|
||||
-
|
||||
* ROCm 7.2.0
|
||||
* PyTorch 2.10.0+git94c6e04
|
||||
-
|
||||
* :doc:`Primus Megatron documentation <../primus-megatron>`
|
||||
* `Docker Hub <https://hub.docker.com/layers/rocm/primus/v26.2/images/sha256-9148d1bfcd579bf92f44bd89090e0d8c958f149c134b4b34b9674ab559244585>`__
|
||||
|
||||
* - v26.1
|
||||
-
|
||||
* ROCm 7.1.0
|
||||
* PyTorch 2.10.0.dev20251112+rocm7.1
|
||||
-
|
||||
* :doc:`Primus Megatron documentation <../primus-megatron>`
|
||||
* :doc:`Megatron-LM (legacy) documentation <../megatron-lm>`
|
||||
* :doc:`Primus Megatron documentation <primus-megatron-v26.1>`
|
||||
* `Docker Hub <https://hub.docker.com/layers/rocm/primus/v26.1/images/sha256-4fc8808bdb14117c6af7f38d79c809056e6fdbfd530c1fabbb61d097ddaf820d>`__
|
||||
|
||||
* - v25.11
|
||||
@@ -31,7 +38,7 @@ previous releases of the ``ROCm/megatron-lm`` Docker image on `Docker Hub <https
|
||||
* PyTorch 2.10.0.dev20251112+rocm7.1
|
||||
-
|
||||
* :doc:`Primus Megatron documentation <primus-megatron-v25.11>`
|
||||
* :doc:`Megatron-LM (legacy) documentation <megatron-lm-v25.10>`
|
||||
* :doc:`Megatron-LM (legacy) documentation <megatron-lm-v25.11>`
|
||||
* `Docker Hub <https://hub.docker.com/layers/rocm/primus/v25.11/images/sha256-71aa65a9bfc8e9dd18bce5b68c81caff864f223e9afa75dc1b719671a1f4a3c3>`__
|
||||
|
||||
* - v25.10
|
||||
|
||||
File diff suppressed because it is too large
Load Diff
@@ -0,0 +1,457 @@
|
||||
:orphan:
|
||||
|
||||
.. meta::
|
||||
:description: How to train a model using PyTorch for ROCm.
|
||||
:keywords: ROCm, AI, LLM, train, PyTorch, torch, Llama, flux, tutorial, docker
|
||||
|
||||
****************************************
|
||||
Training a model with Primus and PyTorch
|
||||
****************************************
|
||||
|
||||
.. caution::
|
||||
|
||||
This documentation does not reflect the latest version of ROCm Primus PyTorch training
|
||||
performance benchmark documentation. See :doc:`../primus-pytorch` for the latest version.
|
||||
|
||||
`Primus <https://github.com/AMD-AGI/Primus>`__ is a unified and flexible
|
||||
LLM training framework designed to streamline training. It streamlines LLM
|
||||
training on AMD Instinct GPUs using a modular, reproducible configuration paradigm.
|
||||
Primus now supports the PyTorch torchtitan backend.
|
||||
|
||||
.. note::
|
||||
|
||||
For a unified training solution on AMD GPUs with ROCm, the `rocm/pytorch-training
|
||||
<https://hub.docker.com/r/rocm/pytorch-training/>`__ Docker Hub registry will be
|
||||
deprecated soon in favor of `rocm/primus <https://hub.docker.com/r/rocm/primus>`__.
|
||||
The ``rocm/primus`` Docker containers will cover PyTorch training ecosystem frameworks,
|
||||
including torchtitan and :doc:`Megatron-LM <primus-megatron>`.
|
||||
|
||||
Primus with the PyTorch torchtitan backend is designed to replace the
|
||||
:doc:`ROCm PyTorch training <pytorch-training>` workflow. See
|
||||
:doc:`pytorch-training` to see steps to run workloads without Primus.
|
||||
|
||||
AMD provides a ready-to-use Docker image for MI355X, MI350X, MI325X, and
|
||||
MI300X GPUs containing essential components for Primus and PyTorch training
|
||||
with Primus Turbo optimizations.
|
||||
|
||||
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/primus-pytorch-v26.1-benchmark-models.yaml
|
||||
|
||||
.. tab-set::
|
||||
|
||||
.. tab-item:: {{ data.docker.pull_tag }}
|
||||
:sync: {{ data.docker.pull_tag }}
|
||||
|
||||
.. list-table::
|
||||
:header-rows: 1
|
||||
|
||||
* - Software component
|
||||
- Version
|
||||
|
||||
{% for component_name, component_version in data.docker.components.items() %}
|
||||
* - {{ component_name }}
|
||||
- {{ component_version }}
|
||||
{% endfor %}
|
||||
|
||||
.. _amd-primus-pytorch-model-support-v26.01:
|
||||
|
||||
Supported models
|
||||
================
|
||||
|
||||
The following models are pre-optimized for performance on the AMD Instinct MI325X and MI300X GPUs.
|
||||
Some instructions, commands, and training recommendations in this documentation might
|
||||
vary by model -- select one to get started.
|
||||
|
||||
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/primus-pytorch-v26.1-benchmark-models.yaml
|
||||
|
||||
{% set model_groups = data.model_groups %}
|
||||
.. raw:: html
|
||||
|
||||
<div id="vllm-benchmark-ud-params-picker" class="container-fluid">
|
||||
<div class="row gx-0">
|
||||
<div class="col-2 me-1 px-2 model-param-head">Model</div>
|
||||
<div class="row col-10 pe-0">
|
||||
{% for model_group in model_groups %}
|
||||
<div class="col-6 px-2 model-param" data-param-k="model-group" data-param-v="{{ model_group.tag }}" tabindex="0">{{ model_group.group }}</div>
|
||||
{% endfor %}
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="row gx-0 pt-1">
|
||||
<div class="col-2 me-1 px-2 model-param-head">Variant</div>
|
||||
<div class="row col-10 pe-0">
|
||||
{% for model_group in model_groups %}
|
||||
{% set models = model_group.models %}
|
||||
{% for model in models %}
|
||||
{% if models|length % 3 == 0 %}
|
||||
<div class="col-4 px-2 model-param" data-param-k="model" data-param-v="{{ model.mad_tag }}" data-param-group="{{ model_group.tag }}" tabindex="0">{{ model.model }}</div>
|
||||
{% else %}
|
||||
<div class="col-6 px-2 model-param" data-param-k="model" data-param-v="{{ model.mad_tag }}" data-param-group="{{ model_group.tag }}" tabindex="0">{{ model.model }}</div>
|
||||
{% endif %}
|
||||
{% endfor %}
|
||||
{% endfor %}
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
.. seealso::
|
||||
|
||||
For additional workloads, including Llama 3.3, Llama 3.2, Llama 2, GPT OSS, Qwen, and Flux models,
|
||||
see the documentation :doc:`pytorch-training` (without Primus)
|
||||
|
||||
.. _amd-primus-pytorch-performance-measurements-v26.01:
|
||||
|
||||
System validation
|
||||
=================
|
||||
|
||||
Before running AI workloads, it's important to validate that your AMD hardware is configured
|
||||
correctly and performing optimally.
|
||||
|
||||
If you have already validated your system settings, including aspects like NUMA auto-balancing, you
|
||||
can skip this step. Otherwise, complete the procedures in the :ref:`System validation and
|
||||
optimization <rocm-for-ai-system-optimization>` guide to properly configure your system settings
|
||||
before starting training.
|
||||
|
||||
To test for optimal performance, consult the recommended :ref:`System health benchmarks
|
||||
<rocm-for-ai-system-health-bench>`. This suite of tests will help you verify and fine-tune your
|
||||
system's configuration.
|
||||
|
||||
This Docker image is optimized for specific model configurations outlined
|
||||
below. Performance can vary for other training workloads, as AMD
|
||||
doesn’t test configurations and run conditions outside those described.
|
||||
|
||||
Pull the Docker image
|
||||
=====================
|
||||
|
||||
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/primus-pytorch-v26.1-benchmark-models.yaml
|
||||
|
||||
Use the following command to pull the Docker image from Docker Hub.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
docker pull {{ data.docker.pull_tag }}
|
||||
|
||||
Run training
|
||||
============
|
||||
|
||||
Once the setup is complete, choose between the following two workflows to start benchmarking training.
|
||||
For fine-tuning workloads and multi-node training examples, see :doc:`pytorch-training` (without Primus).
|
||||
For best performance on MI325X, MI350X, and MI355X GPUs, you might need to
|
||||
tweak some configurations (such as batch sizes).
|
||||
|
||||
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/primus-pytorch-v26.1-benchmark-models.yaml
|
||||
|
||||
{% set docker = data.docker %}
|
||||
{% set model_groups = data.model_groups %}
|
||||
|
||||
.. tab-set::
|
||||
|
||||
.. tab-item:: Primus benchmarking
|
||||
|
||||
{% for model_group in model_groups %}
|
||||
{% for model in model_group.models %}
|
||||
|
||||
.. container:: model-doc {{ model.mad_tag }}
|
||||
|
||||
The following run commands are tailored to {{ model.model }}.
|
||||
See :ref:`amd-primus-pytorch-model-support-v26.01` to switch to another available model.
|
||||
|
||||
.. rubric:: Download the Docker image and required packages
|
||||
|
||||
1. Pull the ``{{ docker.pull_tag }}`` Docker image from Docker Hub.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
docker pull {{ docker.pull_tag }}
|
||||
|
||||
2. Run the Docker container.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
docker run -it \
|
||||
--device /dev/dri \
|
||||
--device /dev/kfd \
|
||||
--network host \
|
||||
--ipc host \
|
||||
--group-add video \
|
||||
--cap-add SYS_PTRACE \
|
||||
--security-opt seccomp=unconfined \
|
||||
--privileged \
|
||||
-v $HOME:$HOME \
|
||||
-v $HOME/.ssh:/root/.ssh \
|
||||
--shm-size 64G \
|
||||
--name training_env \
|
||||
{{ docker.pull_tag }}
|
||||
|
||||
Use these commands if you exit the ``training_env`` container and need to return to it.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
docker start training_env
|
||||
docker exec -it training_env bash
|
||||
|
||||
The Docker container hosts verified commit ``9c529cd4`` of the `Primus
|
||||
<https://github.com/AMD-AGI/Primus/tree/9c529cd4a934a68a880ede036c3e97b792e38167/>`__ repository.
|
||||
|
||||
.. rubric:: Prepare training datasets and dependencies
|
||||
|
||||
The following benchmarking examples require downloading models and datasets
|
||||
from Hugging Face. To ensure successful access to gated repos, set your
|
||||
``HF_TOKEN``.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
export HF_TOKEN=$your_personal_hugging_face_access_token
|
||||
|
||||
.. rubric:: Pretraining
|
||||
|
||||
To get started, navigate to the ``Primus`` directory in your container.
|
||||
|
||||
.. code-block::
|
||||
|
||||
cd /workspace/Primus
|
||||
|
||||
Now, to start the pretraining benchmark, use the ``run_pretrain.sh`` script
|
||||
included with Primus with the appropriate options.
|
||||
|
||||
.. rubric:: Benchmarking examples
|
||||
|
||||
.. container:: model-doc primus_pyt_train_llama-3.1-8b
|
||||
|
||||
Use the following command to run train Llama 3.1 8B with BF16 precision using Primus torchtitan.
|
||||
|
||||
.. tab-set::
|
||||
|
||||
.. tab-item:: MI355X and MI350X
|
||||
:sync: MI355X
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
bash runner/primus-cli direct \
|
||||
--log_file /tmp/primus_llama3.1_8B.log \
|
||||
-- train pretrain \
|
||||
--config examples/torchtitan/configs/MI355X/llama3.1_8B-BF16-pretrain.yaml
|
||||
|
||||
.. tab-item:: MI325X
|
||||
:sync: MI325X
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
bash runner/primus-cli direct \
|
||||
--log_file /tmp/primus_llama3.1_8B.log \
|
||||
-- train pretrain \
|
||||
--config examples/torchtitan/configs/MI300X/llama3.1_8B-BF16-pretrain.yaml \
|
||||
--training.local_batch_size 6
|
||||
|
||||
.. tab-item:: MI300X
|
||||
:sync: MI300X
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
bash runner/primus-cli direct \
|
||||
--log_file /tmp/primus_llama3.1_8B.log \
|
||||
-- train pretrain \
|
||||
--config examples/torchtitan/configs/MI300X/llama3.1_8B-BF16-pretrain.yaml
|
||||
|
||||
To train Llama 3.1 8B with FP8 precision, use the following command.
|
||||
|
||||
.. tab-set::
|
||||
|
||||
.. tab-item:: MI355X and MI350X
|
||||
:sync: MI355X
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
bash runner/primus-cli direct \
|
||||
--log_file /tmp/primus_llama3.1_8B_fp8.log \
|
||||
-- train pretrain \
|
||||
--config examples/torchtitan/configs/MI355X/llama3.1_8B-FP8-pretrain.yaml
|
||||
|
||||
.. tab-item:: MI325X
|
||||
:sync: MI325X
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
bash runner/primus-cli direct \
|
||||
--log_file /tmp/primus_llama3.1_8B_fp8.log \
|
||||
-- train pretrain \
|
||||
--config examples/torchtitan/configs/MI300X/llama3.1_8B-FP8-pretrain.yaml \
|
||||
--training.local_batch_size 7
|
||||
|
||||
.. tab-item:: MI300X
|
||||
:sync: MI300X
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
bash runner/primus-cli direct \
|
||||
--log_file /tmp/primus_llama3.1_8B_fp8.log \
|
||||
-- train pretrain \
|
||||
--config examples/torchtitan/configs/MI300X/llama3.1_8B-FP8-pretrain.yaml
|
||||
|
||||
.. container:: model-doc primus_pyt_train_llama-3.1-70b
|
||||
|
||||
Use the following command to run train Llama 3.1 70B with BF16 precision using Primus torchtitan.
|
||||
|
||||
.. tab-set::
|
||||
|
||||
.. tab-item:: MI355X and MI350X
|
||||
:sync: MI355X and MI300X
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
bash runner/primus-cli direct \
|
||||
--log_file /tmp/primus_llama3.1_70B.log \
|
||||
-- train pretrain \
|
||||
--config examples/torchtitan/configs/MI355X/llama3.1_70B-BF16-pretrain.yaml
|
||||
|
||||
.. tab-item:: MI325X
|
||||
:sync: MI325X
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
bash runner/primus-cli direct \
|
||||
--log_file /tmp/primus_llama3.1_70B.log \
|
||||
-- train pretrain \
|
||||
--config examples/torchtitan/configs/MI300X/llama3.1_70B-BF16-pretrain.yaml \
|
||||
--training.local_batch_size 6
|
||||
|
||||
.. tab-item:: MI300X
|
||||
:sync: MI300X
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
bash runner/primus-cli direct \
|
||||
--log_file /tmp/primus_llama3.1_70B.log \
|
||||
-- train pretrain \
|
||||
--config examples/torchtitan/configs/MI300X/llama3.1_70B-BF16-pretrain.yaml
|
||||
|
||||
To train Llama 3.1 70B with FP8 precision, use the following command.
|
||||
|
||||
.. tab-set::
|
||||
|
||||
.. tab-item:: MI355X and MI350X
|
||||
:sync: MI355X
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
bash runner/primus-cli direct \
|
||||
--log_file /tmp/primus_llama3.1_70B_fp8.log \
|
||||
-- train pretrain \
|
||||
--config examples/torchtitan/configs/MI355X/llama3.1_70B-FP8-pretrain.yaml
|
||||
|
||||
.. tab-item:: MI325X
|
||||
:sync: MI325X
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
bash runner/primus-cli direct \
|
||||
--log_file /tmp/primus_llama3.1_70B_fp8.log \
|
||||
-- train pretrain \
|
||||
--config examples/torchtitan/configs/MI300X/llama3.1_70B-FP8-pretrain.yaml \
|
||||
--training.local_batch_size 5
|
||||
|
||||
.. tab-item:: MI300X
|
||||
:sync: MI300X
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
bash runner/primus-cli direct \
|
||||
--log_file /tmp/primus_llama3.1_70B_fp8.log \
|
||||
-- train pretrain \
|
||||
--config examples/torchtitan/configs/MI300X/llama3.1_70B-FP8-pretrain.yaml
|
||||
|
||||
.. container:: model-doc primus_pyt_train_deepseek-v3-16b
|
||||
|
||||
Use the following command to run train DeepSeek V3 16B with BF16 precision using Primus torchtitan.
|
||||
|
||||
.. tab-set::
|
||||
|
||||
.. tab-item:: MI355X and MI350X
|
||||
:sync: MI355X and MI300X
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
bash runner/primus-cli direct \
|
||||
--log_file /tmp/primus_deepseek_v3_16b.log \
|
||||
-- train pretrain \
|
||||
--config examples/torchtitan/configs/MI355X/deepseek_v3_16b-pretrain.yaml
|
||||
|
||||
.. tab-item:: MI325X
|
||||
:sync: MI325X
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
bash runner/primus-cli direct \
|
||||
--log_file /tmp/primus_deepseek_v3_16b.log \
|
||||
-- train pretrain \
|
||||
--config examples/torchtitan/configs/MI300X/deepseek_v3_16b-pretrain.yaml \
|
||||
--training.local_batch_size 10
|
||||
|
||||
.. tab-item:: MI300X
|
||||
:sync: MI300X
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
bash runner/primus-cli direct \
|
||||
--log_file /tmp/primus_deepseek_v3_16b.log \
|
||||
-- train pretrain \
|
||||
--config examples/torchtitan/configs/MI300X/deepseek_v3_16b-pretrain.yaml
|
||||
{% endfor %}
|
||||
{% endfor %}
|
||||
|
||||
.. tab-item:: MAD-integrated benchmarking
|
||||
|
||||
{% for model_group in model_groups %}
|
||||
{% for model in model_group.models %}
|
||||
|
||||
.. container:: model-doc {{ model.mad_tag }}
|
||||
|
||||
The following run command is tailored to {{ model.model }}.
|
||||
See :ref:`amd-primus-pytorch-model-support-v26.01` to switch to another available model.
|
||||
|
||||
1. Clone the ROCm Model Automation and Dashboarding (`<https://github.com/ROCm/MAD>`__) repository to a local
|
||||
directory and install the required packages on the host machine.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
git clone https://github.com/ROCm/MAD
|
||||
cd MAD
|
||||
pip install -r requirements.txt
|
||||
|
||||
2. For example, use this command to run the performance benchmark test on the {{ model.model }} model
|
||||
using one node with the {{ model.precision }} data type on the host machine.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
|
||||
madengine run \
|
||||
--tags {{ model.mad_tag }} \
|
||||
--keep-model-dir \
|
||||
--live-output \
|
||||
--timeout 28800
|
||||
|
||||
MAD launches a Docker container with the name
|
||||
``container_ci-{{ model.mad_tag }}``. The latency and throughput reports of the
|
||||
model are collected in ``~/MAD/perf.csv``.
|
||||
|
||||
{% endfor %}
|
||||
{% endfor %}
|
||||
|
||||
Further reading
|
||||
===============
|
||||
|
||||
- For an introduction to Primus, see `Primus: A Lightweight, Unified Training
|
||||
Framework for Large Models on AMD GPUs <https://rocm.blogs.amd.com/software-tools-optimization/primus/README.html>`__.
|
||||
|
||||
- To learn more about MAD and the ``madengine`` CLI, see the `MAD usage guide <https://github.com/ROCm/MAD?tab=readme-ov-file#usage-guide>`__.
|
||||
|
||||
- To learn more about system settings and management practices to configure your system for
|
||||
AMD Instinct MI300X Series GPUs, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_.
|
||||
|
||||
- For a list of other ready-made Docker images for AI with ROCm, see
|
||||
`AMD Infinity Hub <https://www.amd.com/en/developer/resources/infinity-hub.html#f-amd_hub_category=AI%20%26%20ML%20Models>`_.
|
||||
|
||||
Previous versions
|
||||
=================
|
||||
|
||||
See :doc:`pytorch-training-history` to find documentation for previous releases
|
||||
of the ``ROCm/pytorch-training`` Docker image.
|
||||
@@ -7,7 +7,7 @@ PyTorch training performance testing version history
|
||||
This table lists previous versions of the ROCm PyTorch training Docker image for
|
||||
inference performance testing. For detailed information about available models
|
||||
for benchmarking, see the version-specific documentation. You can find tagged
|
||||
previous releases of the ``ROCm/pytorch-training`` Docker image on `Docker Hub <https://hub.docker.com/r/rocm/pytorch-training/tags>`_.
|
||||
previous releases of the ``ROCm/primus`` Docker image on `Docker Hub <https://hub.docker.com/r/rocm/pytorch-training/tags>`_.
|
||||
|
||||
.. list-table::
|
||||
:header-rows: 1
|
||||
@@ -16,13 +16,20 @@ previous releases of the ``ROCm/pytorch-training`` Docker image on `Docker Hub <
|
||||
- Components
|
||||
- Resources
|
||||
|
||||
* - v26.1 (latest)
|
||||
* - v26.2 (latest)
|
||||
-
|
||||
* ROCm 7.2.0
|
||||
* PyTorch 2.10.0+git94c6e04
|
||||
-
|
||||
* :doc:`Primus PyTorch training documentation <../primus-pytorch>`
|
||||
* `Docker Hub <https://hub.docker.com/layers/rocm/primus/v26.2/images/sha256-9148d1bfcd579bf92f44bd89090e0d8c958f149c134b4b34b9674ab559244585>`__
|
||||
|
||||
* - v26.1
|
||||
-
|
||||
* ROCm 7.1.0
|
||||
* PyTorch 2.10.0.dev20251112+rocm7.1
|
||||
-
|
||||
* :doc:`Primus PyTorch training documentation <../primus-megatron>`
|
||||
* :doc:`PyTorch training (legacy) documentation <../megatron-lm>`
|
||||
* :doc:`Primus PyTorch training documentation <primus-pytorch-v26.1>`
|
||||
* `Docker Hub <https://hub.docker.com/layers/rocm/primus/v26.1/images/sha256-4fc8808bdb14117c6af7f38d79c809056e6fdbfd530c1fabbb61d097ddaf820d>`__
|
||||
|
||||
* - v25.11
|
||||
|
||||
@@ -47,7 +47,7 @@ Megatron-LM.
|
||||
- {{ component_version }}
|
||||
{% endfor %}
|
||||
|
||||
.. _amd-primus-megatron-lm-model-support-v26.01:
|
||||
.. _amd-primus-megatron-lm-model-support-v26.2:
|
||||
|
||||
Supported models
|
||||
================
|
||||
@@ -65,9 +65,21 @@ might vary by model -- select one to get started.
|
||||
<div class="row gx-0">
|
||||
<div class="col-2 me-1 px-2 model-param-head">Model</div>
|
||||
<div class="row col-10 pe-0">
|
||||
{% for model_group in model_groups %}
|
||||
<div class="col-3 px-2 model-param" data-param-k="model-group" data-param-v="{{ model_group.tag }}" tabindex="0">{{ model_group.group }}</div>
|
||||
{% endfor %}
|
||||
{% set tag = "llama" %}
|
||||
{% set group = "Meta Llama" %}
|
||||
<div class="col-6 px-2 model-param" data-param-k="model-group" data-param-v="{{ tag }}" tabindex="0">{{ group }}</div>
|
||||
{% set tag = "zebra-llama" %}
|
||||
{% set group = "AMD Zebra-Llama" %}
|
||||
<div class="col-6 px-2 model-param" data-param-k="model-group" data-param-v="{{ tag }}" tabindex="0">{{ group }}</div>
|
||||
{% set tag = "deepseek" %}
|
||||
{% set group = "DeepSeek" %}
|
||||
<div class="col-4 px-2 model-param" data-param-k="model-group" data-param-v="{{ tag }}" tabindex="0">{{ group }}</div>
|
||||
{% set tag = "mistral" %}
|
||||
{% set group = "Mistral AI" %}
|
||||
<div class="col-4 px-2 model-param" data-param-k="model-group" data-param-v="{{ tag }}" tabindex="0">{{ group }}</div>
|
||||
{% set tag = "qwen" %}
|
||||
{% set group = "Qwen" %}
|
||||
<div class="col-4 px-2 model-param" data-param-k="model-group" data-param-v="{{ tag }}" tabindex="0">{{ group }}</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
@@ -108,7 +120,7 @@ To test for optimal performance, consult the recommended :ref:`System health ben
|
||||
<rocm-for-ai-system-health-bench>`. This suite of tests will help you verify and fine-tune your
|
||||
system's configuration.
|
||||
|
||||
.. _mi300x-amd-primus-megatron-lm-training-v26.01:
|
||||
.. _mi300x-amd-primus-megatron-lm-training-v26.2:
|
||||
|
||||
Environment setup
|
||||
=================
|
||||
@@ -118,7 +130,7 @@ Environment setup
|
||||
Use the following instructions to set up the environment, configure the script to train models, and
|
||||
reproduce the benchmark results on AMD Instinct GPUs.
|
||||
|
||||
.. _amd-primus-megatron-lm-requirements-v26.01:
|
||||
.. _amd-primus-megatron-lm-requirements-v26.2:
|
||||
|
||||
Pull the Docker image
|
||||
|
||||
@@ -160,7 +172,7 @@ Pull the Docker image
|
||||
The Docker container hosts verified commit ``9c529cd4`` of the `Primus
|
||||
<https://github.com/AMD-AGI/Primus/tree/9c529cd4a934a68a880ede036c3e97b792e38167>`__ repository.
|
||||
|
||||
.. _amd-primus-megatron-lm-environment-setup-v26.01:
|
||||
.. _amd-primus-megatron-lm-environment-setup-v26.2:
|
||||
|
||||
Configuration
|
||||
=============
|
||||
@@ -207,7 +219,7 @@ You can use either mock data or real data for training.
|
||||
|
||||
Ensure that the files are accessible inside the Docker container.
|
||||
|
||||
.. _amd-primus-megatron-lm-tokenizer-v26.01:
|
||||
.. _amd-primus-megatron-lm-tokenizer-v26.2:
|
||||
|
||||
Tokenizer
|
||||
---------
|
||||
@@ -220,7 +232,7 @@ right permissions to access the tokenizer for each model.
|
||||
# Export your HF_TOKEN in the workspace
|
||||
export HF_TOKEN=<your_hftoken>
|
||||
|
||||
.. _amd-primus-megatron-lm-run-training-v26.01:
|
||||
.. _amd-primus-megatron-lm-run-training-v26.2:
|
||||
|
||||
Run training
|
||||
============
|
||||
@@ -237,14 +249,12 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
|
||||
.. code-block:: shell
|
||||
|
||||
pip install -r requirements.txt
|
||||
export HSA_NO_SCRATCH_RECLAIM=1
|
||||
export NVTE_CK_USES_BWD_V3=1
|
||||
|
||||
.. container:: model-doc primus_pyt_megatron_lm_train_llama-3.3-70b
|
||||
|
||||
Once setup is complete, run the appropriate training command.
|
||||
The following run commands are tailored to Llama 3.3 70B.
|
||||
See :ref:`amd-primus-megatron-lm-model-support-v26.01` to switch to another available model.
|
||||
See :ref:`amd-primus-megatron-lm-model-support-v26.2` to switch to another available model.
|
||||
|
||||
To run pre-training for Llama 3.3 70B BF16, run:
|
||||
|
||||
@@ -279,7 +289,7 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
|
||||
|
||||
Once setup is complete, run the appropriate training command.
|
||||
The following run commands are tailored to Llama 3.1 8B.
|
||||
See :ref:`amd-primus-megatron-lm-model-support-v26.01` to switch to another available model.
|
||||
See :ref:`amd-primus-megatron-lm-model-support-v26.2` to switch to another available model.
|
||||
|
||||
To run pre-training for Llama 3.1 8B FP8, run:
|
||||
|
||||
@@ -343,7 +353,7 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
|
||||
|
||||
Once setup is complete, run the appropriate training command.
|
||||
The following run commands are tailored to Llama 3.1 70B.
|
||||
See :ref:`amd-primus-megatron-lm-model-support-v26.01` to switch to another available model.
|
||||
See :ref:`amd-primus-megatron-lm-model-support-v26.2` to switch to another available model.
|
||||
|
||||
To run pre-training for Llama 3.1 70B BF16, run:
|
||||
|
||||
@@ -357,7 +367,9 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
|
||||
bash runner/primus-cli direct \
|
||||
--log_file /tmp/primus_llama3.1_70B.log \
|
||||
-- train pretrain \
|
||||
--config examples/megatron/configs/MI355X/llama3.1_70B-BF16-pretrain.yaml
|
||||
--config examples/megatron/configs/MI355X/llama3.1_70B-BF16-pretrain.yaml \
|
||||
--micro_batch_size 8 \
|
||||
--global_batch_size 128
|
||||
|
||||
.. tab-item:: MI300X
|
||||
:sync: MI325X and MI300X
|
||||
@@ -417,7 +429,7 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
|
||||
|
||||
Once setup is complete, run the appropriate training command.
|
||||
The following run commands are tailored to Llama 2 7B.
|
||||
See :ref:`amd-primus-megatron-lm-model-support-v26.01` to switch to another available model.
|
||||
See :ref:`amd-primus-megatron-lm-model-support-v26.2` to switch to another available model.
|
||||
|
||||
To run pre-training for Llama 2 7B FP8, run:
|
||||
|
||||
@@ -481,7 +493,7 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
|
||||
|
||||
Once setup is complete, run the appropriate training command.
|
||||
The following run commands are tailored to Llama 2 70B.
|
||||
See :ref:`amd-primus-megatron-lm-model-support-v26.01` to switch to another available model.
|
||||
See :ref:`amd-primus-megatron-lm-model-support-v26.2` to switch to another available model.
|
||||
|
||||
To run pre-training for Llama 2 70B BF16, run:
|
||||
|
||||
@@ -516,7 +528,7 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
|
||||
|
||||
Once setup is complete, run the appropriate training command.
|
||||
The following run commands are tailored to DeepSeek-V3.
|
||||
See :ref:`amd-primus-megatron-lm-model-support-v26.01` to switch to another available model.
|
||||
See :ref:`amd-primus-megatron-lm-model-support-v26.2` to switch to another available model.
|
||||
|
||||
To run training on a single node for DeepSeek-V3 (MoE with expert parallel) BF16 with 3-layer proxy,
|
||||
use the following command:
|
||||
@@ -536,7 +548,9 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
|
||||
--moe_layer_freq 1 \
|
||||
--train_iters 50 \
|
||||
--micro_batch_size 8 \
|
||||
--global_batch_size 64
|
||||
--global_batch_size 64 \
|
||||
--moe_use_fused_router_with_aux_score True \
|
||||
--moe_permute_fusion True
|
||||
|
||||
.. tab-item:: MI300X
|
||||
:sync: MI325X and MI300X
|
||||
@@ -562,7 +576,7 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
|
||||
|
||||
Once setup is complete, run the appropriate training command.
|
||||
The following run commands are tailored to DeepSeek-V2-Lite.
|
||||
See :ref:`amd-primus-megatron-lm-model-support-v26.01` to switch to another available model.
|
||||
See :ref:`amd-primus-megatron-lm-model-support-v26.2` to switch to another available model.
|
||||
|
||||
To run training on a single node for DeepSeek-V2-Lite (MoE with expert parallel) BF16,
|
||||
use the following command:
|
||||
@@ -577,7 +591,11 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
|
||||
bash runner/primus-cli direct \
|
||||
--log_file /tmp/primus_deepseek_v2_lite.log \
|
||||
-- train pretrain \
|
||||
--config examples/megatron/configs//MI355X/deepseek_v2_lite-BF16-pretrain.yaml
|
||||
--config examples/megatron/configs//MI355X/deepseek_v2_lite-BF16-pretrain.yaml \
|
||||
--use_turbo_grouped_mlp False \
|
||||
--moe_use_legacy_grouped_gemm True \
|
||||
--moe_use_fused_router_with_aux_score True \
|
||||
--moe_permute_fusion True
|
||||
|
||||
.. tab-item:: MI300X
|
||||
:sync: MI325X and MI300X
|
||||
@@ -598,7 +616,7 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
|
||||
|
||||
Once setup is complete, run the appropriate training command.
|
||||
The following run commands are tailored to Mixtral 8x7B.
|
||||
See :ref:`amd-primus-megatron-lm-model-support-v26.01` to switch to another available model.
|
||||
See :ref:`amd-primus-megatron-lm-model-support-v26.2` to switch to another available model.
|
||||
|
||||
To run training on a single node for Mixtral 8x7B (MoE with expert parallel),
|
||||
use the following command:
|
||||
@@ -634,7 +652,7 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
|
||||
|
||||
Once setup is complete, run the appropriate training command.
|
||||
The following run commands are tailored to Mixtral 8x22B.
|
||||
See :ref:`amd-primus-megatron-lm-model-support-v26.01` to switch to another available model.
|
||||
See :ref:`amd-primus-megatron-lm-model-support-v26.2` to switch to another available model.
|
||||
|
||||
To run training on a single node for Mixtral 8x22B BF16 (MoE with expert parallel) 4-layer proxy,
|
||||
use the following command:
|
||||
@@ -671,11 +689,83 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
|
||||
--global_batch_size 16 \
|
||||
--train_iters 50
|
||||
|
||||
.. container:: model-doc primus_pyt_megatron_lm_train_qwen3-32b-lora
|
||||
|
||||
Once setup is complete, run the appropriate training command.
|
||||
The following run commands are tailored to post-training Qwen 3 32B (LoRA).
|
||||
See :ref:`amd-primus-megatron-lm-model-support-v26.2` to switch to another available model.
|
||||
|
||||
To run training on a single node for Qwen 3 32B BF16 (SFT), use the following
|
||||
command:
|
||||
|
||||
.. tab-set::
|
||||
|
||||
.. tab-item:: MI355X and MI350X
|
||||
:sync: MI355X and MI350X
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
bash runner/primus-cli direct \
|
||||
--log_file /tmp/primus_qwen3_32b.log \
|
||||
-- train posttrain \
|
||||
--config examples/megatron_bridge/configs/MI355X/qwen3_32b_lora_posttrain.yaml
|
||||
|
||||
.. tab-item:: MI300X
|
||||
:sync: MI325X and MI300X
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
# Set the variables for better performance
|
||||
# only on MI325X and MI300X
|
||||
export PRIMUS_TURBO_ATTN_V3_ATOMIC_FP32=1
|
||||
export NVTE_CK_IS_V3_ATOMIC_FP32=1
|
||||
|
||||
bash runner/primus-cli direct \
|
||||
--log_file /tmp/primus_qwen3_32b.log \
|
||||
-- train posttrain \
|
||||
--config examples/megatron_bridge/configs/MI300X/qwen3_32b_lora_posttrain.yaml
|
||||
|
||||
.. container:: model-doc primus_pyt_megatron_lm_train_qwen3-32b-sft
|
||||
|
||||
Once setup is complete, run the appropriate training command.
|
||||
The following run commands are tailored to post-training Qwen 3 32B (SFT).
|
||||
See :ref:`amd-primus-megatron-lm-model-support-v26.2` to switch to another available model.
|
||||
|
||||
To run training on a single node for Qwen 3 32B BF16 (SFT), use the following
|
||||
command:
|
||||
|
||||
.. tab-set::
|
||||
|
||||
.. tab-item:: MI355X and MI350X
|
||||
:sync: MI355X and MI350X
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
bash runner/primus-cli direct \
|
||||
--log_file /tmp/primus_qwen3_32b_sft.log \
|
||||
-- train posttrain \
|
||||
--config examples/megatron_bridge/configs/MI355X/qwen3_32b_sft_posttrain.yaml
|
||||
|
||||
.. tab-item:: MI300X
|
||||
:sync: MI325X and MI300X
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
# Set the variables for better performance
|
||||
# only on MI325X and MI300X
|
||||
export PRIMUS_TURBO_ATTN_V3_ATOMIC_FP32=1
|
||||
export NVTE_CK_IS_V3_ATOMIC_FP32=1
|
||||
|
||||
bash runner/primus-cli direct \
|
||||
--log_file /tmp/primus_qwen3_32b_sft.log \
|
||||
-- train posttrain \
|
||||
--config examples/megatron_bridge/configs/MI300X/qwen3_32b_sft_posttrain.yaml
|
||||
|
||||
.. container:: model-doc primus_pyt_megatron_lm_train_qwen2.5-7b
|
||||
|
||||
Once setup is complete, run the appropriate training command.
|
||||
The following run commands are tailored to Qwen 2.5 7B.
|
||||
See :ref:`amd-primus-megatron-lm-model-support-v26.01` to switch to another available model.
|
||||
See :ref:`amd-primus-megatron-lm-model-support-v26.2` to switch to another available model.
|
||||
|
||||
To run training on a single node for Qwen 2.5 7B BF16, use the following
|
||||
command:
|
||||
@@ -740,7 +830,7 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
|
||||
|
||||
Once setup is complete, run the appropriate training command.
|
||||
The following run commands are tailored to Qwen 2.5 72B.
|
||||
See :ref:`amd-primus-megatron-lm-model-support-v26.01` to switch to another available model.
|
||||
See :ref:`amd-primus-megatron-lm-model-support-v26.2` to switch to another available model.
|
||||
|
||||
To run the training on a single node for Qwen 2.5 72B BF16, use the following command.
|
||||
|
||||
@@ -771,7 +861,112 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
|
||||
-- train pretrain \
|
||||
--config examples/megatron/configs/MI300X/qwen2.5_72B-BF16-pretrain.yaml
|
||||
|
||||
.. _amd-primus-megatron-multi-node-examples-v26.01:
|
||||
.. container:: model-doc primus_pyt_megatron_lm_train_zebra-llama-1b
|
||||
|
||||
Once setup is complete, run the appropriate training command.
|
||||
The following run commands are tailored to Zebra-Llama 1B.
|
||||
See :ref:`amd-primus-megatron-lm-model-support-v26.2` to switch to another available model.
|
||||
|
||||
To run the training on a single node for AMD Zebra-Llama 1B BF16, use the following command.
|
||||
|
||||
.. tab-set::
|
||||
|
||||
.. tab-item:: MI355X and MI350X
|
||||
:sync: MI355X and MI350X
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
PRIMUS_TRAIN_RUNTIME=legacy bash runner/primus-cli direct \
|
||||
--log_file /tmp/primus_zebra_llama_1B.log \
|
||||
-- train pretrain \
|
||||
--config examples/megatron/configs/MI355X/zebra_llama_1B-pretrain.yaml
|
||||
|
||||
.. tab-item:: MI300X
|
||||
:sync: MI325X and MI300X
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
# Set the variables for better performance
|
||||
# only on MI325X and MI300X
|
||||
export PRIMUS_TURBO_ATTN_V3_ATOMIC_FP32=1
|
||||
export NVTE_CK_IS_V3_ATOMIC_FP32=1
|
||||
|
||||
PRIMUS_TRAIN_RUNTIME=legacy bash runner/primus-cli direct \
|
||||
--log_file /tmp/primus_zebra_llama_1B.log \
|
||||
-- train pretrain \
|
||||
--config examples/megatron/configs/MI300X/zebra_llama_1B-pretrain.yaml
|
||||
|
||||
.. container:: model-doc primus_pyt_megatron_lm_train_zebra-llama-3b
|
||||
|
||||
Once setup is complete, run the appropriate training command.
|
||||
The following run commands are tailored to Zebra-Llama 3B.
|
||||
See :ref:`amd-primus-megatron-lm-model-support-v26.2` to switch to another available model.
|
||||
|
||||
To run the training on a single node for AMD Zebra-Llama 3B BF16, use the following command.
|
||||
|
||||
.. tab-set::
|
||||
|
||||
.. tab-item:: MI355X and MI350X
|
||||
:sync: MI355X and MI350X
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
PRIMUS_TRAIN_RUNTIME=legacy bash runner/primus-cli direct \
|
||||
--log_file /tmp/primus_zebra_llama_3B.log \
|
||||
-- train pretrain \
|
||||
--config examples/megatron/configs/MI355X/zebra_llama_3B-pretrain.yaml
|
||||
|
||||
.. tab-item:: MI300X
|
||||
:sync: MI325X and MI300X
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
# Set the variables for better performance
|
||||
# only on MI325X and MI300X
|
||||
export PRIMUS_TURBO_ATTN_V3_ATOMIC_FP32=1
|
||||
export NVTE_CK_IS_V3_ATOMIC_FP32=1
|
||||
|
||||
PRIMUS_TRAIN_RUNTIME=legacy bash runner/primus-cli direct \
|
||||
--log_file /tmp/primus_zebra_llama_3B.log \
|
||||
-- train pretrain \
|
||||
--config examples/megatron/configs/MI300X/zebra_llama_3B-pretrain.yaml
|
||||
|
||||
.. container:: model-doc primus_pyt_megatron_lm_train_zebra-llama-8b
|
||||
|
||||
Once setup is complete, run the appropriate training command.
|
||||
The following run commands are tailored to Zebra Llama 8B.
|
||||
See :ref:`amd-primus-megatron-lm-model-support-v26.2` to switch to another available model.
|
||||
|
||||
To run the training on a single node for AMD Zebra-Llama 8B BF16, use the following command.
|
||||
|
||||
.. tab-set::
|
||||
|
||||
.. tab-item:: MI355X and MI350X
|
||||
:sync: MI355X and MI350X
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
PRIMUS_TRAIN_RUNTIME=legacy bash runner/primus-cli direct \
|
||||
--log_file /tmp/primus_zebra_llama_8B.log \
|
||||
-- train pretrain \
|
||||
--config examples/megatron/configs/MI355X/zebra_llama_8B-pretrain.yaml
|
||||
|
||||
.. tab-item:: MI300X
|
||||
:sync: MI325X and MI300X
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
# Set the variables for better performance
|
||||
# only on MI325X and MI300X
|
||||
export PRIMUS_TURBO_ATTN_V3_ATOMIC_FP32=1
|
||||
export NVTE_CK_IS_V3_ATOMIC_FP32=1
|
||||
|
||||
PRIMUS_TRAIN_RUNTIME=legacy bash runner/primus-cli direct \
|
||||
--log_file /tmp/primus_zebra_llama_8B.log \
|
||||
-- train pretrain \
|
||||
--config examples/megatron/configs/MI300X/zebra_llama_8B-pretrain.yaml
|
||||
|
||||
.. _amd-primus-megatron-multi-node-examples-v26.2:
|
||||
|
||||
Multi-node training examples
|
||||
----------------------------
|
||||
@@ -789,14 +984,11 @@ to launch the multi-node workload. Use the following steps to setup your environ
|
||||
.. code-block:: shell
|
||||
|
||||
git clone --recurse-submodules https://github.com/AMD-AGI/Primus.git
|
||||
cd Primus
|
||||
git checkout c4c083de64ba3e8f19ccc9629411267108931f9e
|
||||
cd Primus/
|
||||
git checkout 44f780d
|
||||
git submodule update --init --recursive
|
||||
|
||||
export DOCKER_IMAGE={{ docker.pull_tag }}
|
||||
export HF_TOKEN=<your_HF_token>
|
||||
export HSA_NO_SCRATCH_RECLAIM=1
|
||||
export NVTE_CK_USES_BWD_V3=1
|
||||
export NCCL_IB_HCA=<your_NCCL_IB_HCA> # specify which RDMA interfaces to use for communication
|
||||
export NCCL_SOCKET_IFNAME=<your_NCCL_SOCKET_IFNAME> # your Network Interface
|
||||
export GLOO_SOCKET_IFNAME=<your_GLOO_SOCKET_IFNAME> # your Network Interface
|
||||
@@ -813,13 +1005,13 @@ to launch the multi-node workload. Use the following steps to setup your environ
|
||||
* If ``NCCL_IB_HCA`` and ``NCCL_SOCKET_IFNAME`` are not set, Primus will try to auto-detect. However, since NICs can vary accross different cluster, it is encouraged to explicitly export your NCCL parameters for the cluster.
|
||||
* To find your network interface, you can use ``ip a``.
|
||||
* To find RDMA interfaces, you can use ``ibv_devices`` to get the list of all the RDMA/IB devices.
|
||||
* Remember to set ``DOCKER_IMAGE`` and ``HF_TOKEN`` (see :ref:`amd-primus-megatron-lm-tokenizer-v26.01`) as appropriate.
|
||||
* Remember to set ``DOCKER_IMAGE`` and ``HF_TOKEN`` (see :ref:`amd-primus-megatron-lm-tokenizer-v26.2`) as appropriate.
|
||||
|
||||
.. container:: model-doc primus_pyt_megatron_lm_train_llama-3.1-8b
|
||||
|
||||
Once setup is complete, run the appropriate training command.
|
||||
The following run commands are tailored to Llama 3.1 8B.
|
||||
See :ref:`amd-primus-megatron-lm-model-support-v26.01` to switch to another available model.
|
||||
See :ref:`amd-primus-megatron-lm-model-support-v26.2` to switch to another available model.
|
||||
|
||||
To train Llama 3.1 8B FP8 on 8 nodes, run:
|
||||
|
||||
@@ -836,7 +1028,7 @@ to launch the multi-node workload. Use the following steps to setup your environ
|
||||
|
||||
Once setup is complete, run the appropriate training command.
|
||||
The following run commands are tailored to Llama 2 7B.
|
||||
See :ref:`amd-primus-megatron-lm-model-support-v26.01` to switch to another available model.
|
||||
See :ref:`amd-primus-megatron-lm-model-support-v26.2` to switch to another available model.
|
||||
|
||||
To train Llama 2 7B FP8 on 8 nodes, run:
|
||||
|
||||
@@ -853,7 +1045,7 @@ to launch the multi-node workload. Use the following steps to setup your environ
|
||||
|
||||
Once setup is complete, run the appropriate training command.
|
||||
The following run commands are tailored to Llama 3.1 70B.
|
||||
See :ref:`amd-primus-megatron-lm-model-support-v26.01` to switch to another available model.
|
||||
See :ref:`amd-primus-megatron-lm-model-support-v26.2` to switch to another available model.
|
||||
|
||||
To train Llama 3.1 70B FP8 on 8 nodes, run:
|
||||
|
||||
@@ -883,7 +1075,7 @@ to launch the multi-node workload. Use the following steps to setup your environ
|
||||
|
||||
Once setup is complete, run the appropriate training command.
|
||||
The following run commands are tailored to Llama 2 70B.
|
||||
See :ref:`amd-primus-megatron-lm-model-support-v26.01` to switch to another available model.
|
||||
See :ref:`amd-primus-megatron-lm-model-support-v26.2` to switch to another available model.
|
||||
|
||||
To train Llama 2 70B FP8 on 8 nodes, run:
|
||||
|
||||
@@ -913,7 +1105,7 @@ to launch the multi-node workload. Use the following steps to setup your environ
|
||||
|
||||
Once setup is complete, run the appropriate training command.
|
||||
The following run commands are tailored to Llama 3.3 70B.
|
||||
See :ref:`amd-primus-megatron-lm-model-support-v26.01` to switch to another available model.
|
||||
See :ref:`amd-primus-megatron-lm-model-support-v26.2` to switch to another available model.
|
||||
|
||||
To train Llama 3.3 70B FP8 on 8 nodes, run:
|
||||
|
||||
@@ -943,7 +1135,7 @@ to launch the multi-node workload. Use the following steps to setup your environ
|
||||
|
||||
Once setup is complete, run the appropriate training command.
|
||||
The following run commands are tailored to Llama 2 70B.
|
||||
See :ref:`amd-primus-megatron-lm-model-support-v26.01` to switch to another available model.
|
||||
See :ref:`amd-primus-megatron-lm-model-support-v26.2` to switch to another available model.
|
||||
|
||||
To train Mixtral 8x7B BF16 on 8 nodes, run:
|
||||
|
||||
@@ -961,7 +1153,7 @@ to launch the multi-node workload. Use the following steps to setup your environ
|
||||
|
||||
Once setup is complete, run the appropriate training command.
|
||||
The following run commands are tailored to Llama 2 70B.
|
||||
See :ref:`amd-primus-megatron-lm-model-support-v26.01` to switch to another available model.
|
||||
See :ref:`amd-primus-megatron-lm-model-support-v26.2` to switch to another available model.
|
||||
|
||||
To train Qwen2.5 72B FP8 on 8 nodes, run:
|
||||
|
||||
@@ -976,7 +1168,7 @@ to launch the multi-node workload. Use the following steps to setup your environ
|
||||
--global_batch_size 512 \
|
||||
--recompute_num_layers 80 \
|
||||
|
||||
.. _amd-primus-megatron-lm-benchmark-test-vars-v26.01:
|
||||
.. _amd-primus-megatron-lm-benchmark-test-vars-v26.2:
|
||||
|
||||
Key options
|
||||
-----------
|
||||
@@ -1018,14 +1210,6 @@ recompute_granularity
|
||||
num_layers
|
||||
For using a reduced number of layers as with proxy models.
|
||||
|
||||
Known issues
|
||||
============
|
||||
|
||||
DeepSeekV3 proxy model and Mixtral 8x22B proxy model may exit with an error
|
||||
due to a memory free issue. However, this does not impacts training runs. All
|
||||
iterations, in this case 50, should have been completed before the exit and
|
||||
the results should be available in the end.
|
||||
|
||||
Further reading
|
||||
===============
|
||||
|
||||
|
||||
@@ -45,7 +45,7 @@ with Primus Turbo optimizations.
|
||||
- {{ component_version }}
|
||||
{% endfor %}
|
||||
|
||||
.. _amd-primus-pytorch-model-support-v26.01:
|
||||
.. _amd-primus-pytorch-model-support-v26.2:
|
||||
|
||||
Supported models
|
||||
================
|
||||
@@ -91,7 +91,7 @@ vary by model -- select one to get started.
|
||||
For additional workloads, including Llama 3.3, Llama 3.2, Llama 2, GPT OSS, Qwen, and Flux models,
|
||||
see the documentation :doc:`pytorch-training` (without Primus)
|
||||
|
||||
.. _amd-primus-pytorch-performance-measurements-v26.01:
|
||||
.. _amd-primus-pytorch-performance-measurements-v26.2:
|
||||
|
||||
System validation
|
||||
=================
|
||||
@@ -146,7 +146,7 @@ tweak some configurations (such as batch sizes).
|
||||
.. container:: model-doc {{ model.mad_tag }}
|
||||
|
||||
The following run commands are tailored to {{ model.model }}.
|
||||
See :ref:`amd-primus-pytorch-model-support-v26.01` to switch to another available model.
|
||||
See :ref:`amd-primus-pytorch-model-support-v26.2` to switch to another available model.
|
||||
|
||||
.. rubric:: Download the Docker image and required packages
|
||||
|
||||
@@ -224,17 +224,6 @@ tweak some configurations (such as batch sizes).
|
||||
-- train pretrain \
|
||||
--config examples/torchtitan/configs/MI355X/llama3.1_8B-BF16-pretrain.yaml
|
||||
|
||||
.. tab-item:: MI325X
|
||||
:sync: MI325X
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
bash runner/primus-cli direct \
|
||||
--log_file /tmp/primus_llama3.1_8B.log \
|
||||
-- train pretrain \
|
||||
--config examples/torchtitan/configs/MI300X/llama3.1_8B-BF16-pretrain.yaml \
|
||||
--training.local_batch_size 6
|
||||
|
||||
.. tab-item:: MI300X
|
||||
:sync: MI300X
|
||||
|
||||
@@ -259,17 +248,6 @@ tweak some configurations (such as batch sizes).
|
||||
-- train pretrain \
|
||||
--config examples/torchtitan/configs/MI355X/llama3.1_8B-FP8-pretrain.yaml
|
||||
|
||||
.. tab-item:: MI325X
|
||||
:sync: MI325X
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
bash runner/primus-cli direct \
|
||||
--log_file /tmp/primus_llama3.1_8B_fp8.log \
|
||||
-- train pretrain \
|
||||
--config examples/torchtitan/configs/MI300X/llama3.1_8B-FP8-pretrain.yaml \
|
||||
--training.local_batch_size 7
|
||||
|
||||
.. tab-item:: MI300X
|
||||
:sync: MI300X
|
||||
|
||||
@@ -296,17 +274,6 @@ tweak some configurations (such as batch sizes).
|
||||
-- train pretrain \
|
||||
--config examples/torchtitan/configs/MI355X/llama3.1_70B-BF16-pretrain.yaml
|
||||
|
||||
.. tab-item:: MI325X
|
||||
:sync: MI325X
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
bash runner/primus-cli direct \
|
||||
--log_file /tmp/primus_llama3.1_70B.log \
|
||||
-- train pretrain \
|
||||
--config examples/torchtitan/configs/MI300X/llama3.1_70B-BF16-pretrain.yaml \
|
||||
--training.local_batch_size 6
|
||||
|
||||
.. tab-item:: MI300X
|
||||
:sync: MI300X
|
||||
|
||||
@@ -331,17 +298,6 @@ tweak some configurations (such as batch sizes).
|
||||
-- train pretrain \
|
||||
--config examples/torchtitan/configs/MI355X/llama3.1_70B-FP8-pretrain.yaml
|
||||
|
||||
.. tab-item:: MI325X
|
||||
:sync: MI325X
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
bash runner/primus-cli direct \
|
||||
--log_file /tmp/primus_llama3.1_70B_fp8.log \
|
||||
-- train pretrain \
|
||||
--config examples/torchtitan/configs/MI300X/llama3.1_70B-FP8-pretrain.yaml \
|
||||
--training.local_batch_size 5
|
||||
|
||||
.. tab-item:: MI300X
|
||||
:sync: MI300X
|
||||
|
||||
@@ -368,17 +324,6 @@ tweak some configurations (such as batch sizes).
|
||||
-- train pretrain \
|
||||
--config examples/torchtitan/configs/MI355X/deepseek_v3_16b-pretrain.yaml
|
||||
|
||||
.. tab-item:: MI325X
|
||||
:sync: MI325X
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
bash runner/primus-cli direct \
|
||||
--log_file /tmp/primus_deepseek_v3_16b.log \
|
||||
-- train pretrain \
|
||||
--config examples/torchtitan/configs/MI300X/deepseek_v3_16b-pretrain.yaml \
|
||||
--training.local_batch_size 10
|
||||
|
||||
.. tab-item:: MI300X
|
||||
:sync: MI300X
|
||||
|
||||
@@ -399,7 +344,7 @@ tweak some configurations (such as batch sizes).
|
||||
.. container:: model-doc {{ model.mad_tag }}
|
||||
|
||||
The following run command is tailored to {{ model.model }}.
|
||||
See :ref:`amd-primus-pytorch-model-support-v26.01` to switch to another available model.
|
||||
See :ref:`amd-primus-pytorch-model-support-v26.2` to switch to another available model.
|
||||
|
||||
1. Clone the ROCm Model Automation and Dashboarding (`<https://github.com/ROCm/MAD>`__) repository to a local
|
||||
directory and install the required packages on the host machine.
|
||||
|
||||
@@ -13,11 +13,16 @@ compatibility with industry software frameworks. For more information, see
|
||||
[What is ROCm?](./what-is-rocm.rst)
|
||||
|
||||
ROCm supports multiple programming languages and programming interfaces such as
|
||||
{doc}`HIP (Heterogeneous-Compute Interface for Portability)<hip:index>`, OpenCL,
|
||||
and OpenMP, as explained in the [Programming guide](./how-to/programming_guide.rst).
|
||||
{doc}`HIP <hip:index>`, OpenCL, and OpenMP, as explained in the [Programming guide](./how-to/programming_guide.rst).
|
||||
|
||||
If you're using AMD Radeon GPUs or Ryzen APUs in a workstation setting with a display connected, review {doc}`ROCm on Radeon and Ryzen documentation<radeon:index>`.
|
||||
|
||||
```{note}
|
||||
The [AMD ROCm Programming Guide](https://rocm-handbook.amd.com/projects/amd-rocm-programming-guide/en/latest/)
|
||||
presents key ROCm concepts in a structured, book-style format, a helpful
|
||||
starting point for those new to GPU programming.
|
||||
```
|
||||
|
||||
ROCm documentation is organized into the following categories:
|
||||
|
||||
::::{grid} 1 2 2 2
|
||||
|
||||
@@ -35,20 +35,8 @@ subtrees:
|
||||
title: TensorFlow compatibility
|
||||
- file: compatibility/ml-compatibility/jax-compatibility.rst
|
||||
title: JAX compatibility
|
||||
- file: compatibility/ml-compatibility/verl-compatibility.rst
|
||||
title: verl compatibility
|
||||
- file: compatibility/ml-compatibility/stanford-megatron-lm-compatibility.rst
|
||||
title: Stanford Megatron-LM compatibility
|
||||
- file: compatibility/ml-compatibility/dgl-compatibility.rst
|
||||
title: DGL compatibility
|
||||
- file: compatibility/ml-compatibility/megablocks-compatibility.rst
|
||||
title: Megablocks compatibility
|
||||
- file: compatibility/ml-compatibility/ray-compatibility.rst
|
||||
title: Ray compatibility
|
||||
- file: compatibility/ml-compatibility/llama-cpp-compatibility.rst
|
||||
title: llama.cpp compatibility
|
||||
- file: compatibility/ml-compatibility/flashinfer-compatibility.rst
|
||||
title: FlashInfer compatibility
|
||||
- file: how-to/build-rocm.rst
|
||||
title: Build ROCm from source
|
||||
|
||||
@@ -77,12 +65,12 @@ subtrees:
|
||||
title: Train a model with Primus and Megatron-LM
|
||||
entries:
|
||||
- file: how-to/rocm-for-ai/training/benchmark-docker/megatron-lm.rst
|
||||
title: Train a model with Megatron-LM
|
||||
title: Train a model with Megatron-LM (legacy)
|
||||
- file: how-to/rocm-for-ai/training/benchmark-docker/primus-pytorch.rst
|
||||
title: Train a model with Primus and PyTorch
|
||||
entries:
|
||||
- file: how-to/rocm-for-ai/training/benchmark-docker/pytorch-training.rst
|
||||
title: Train a model with PyTorch
|
||||
title: Train a model with PyTorch (legacy)
|
||||
- file: how-to/rocm-for-ai/training/benchmark-docker/jax-maxtext.rst
|
||||
title: Train a model with Primus and JAX MaxText
|
||||
- file: how-to/rocm-for-ai/training/benchmark-docker/mpt-llm-foundry
|
||||
|
||||
@@ -10,12 +10,12 @@ alabaster==1.0.0
|
||||
# via sphinx
|
||||
asttokens==3.0.1
|
||||
# via stack-data
|
||||
attrs==25.4.0
|
||||
attrs==26.1.0
|
||||
# via
|
||||
# jsonschema
|
||||
# jupyter-cache
|
||||
# referencing
|
||||
babel==2.17.0
|
||||
babel==2.18.0
|
||||
# via
|
||||
# pydata-sphinx-theme
|
||||
# sphinx
|
||||
@@ -23,13 +23,13 @@ beautifulsoup4==4.14.3
|
||||
# via pydata-sphinx-theme
|
||||
breathe==4.36.0
|
||||
# via rocm-docs-core
|
||||
certifi==2026.1.4
|
||||
certifi==2026.2.25
|
||||
# via requests
|
||||
cffi==2.0.0
|
||||
# via
|
||||
# cryptography
|
||||
# pynacl
|
||||
charset-normalizer==3.4.4
|
||||
charset-normalizer==3.4.6
|
||||
# via requests
|
||||
click==8.3.1
|
||||
# via
|
||||
@@ -39,7 +39,7 @@ comm==0.2.3
|
||||
# via ipykernel
|
||||
cryptography==46.0.5
|
||||
# via pyjwt
|
||||
debugpy==1.8.19
|
||||
debugpy==1.8.20
|
||||
# via ipykernel
|
||||
decorator==5.2.1
|
||||
# via ipython
|
||||
@@ -62,17 +62,17 @@ gitdb==4.0.12
|
||||
# via gitpython
|
||||
gitpython==3.1.46
|
||||
# via rocm-docs-core
|
||||
greenlet==3.3.0
|
||||
greenlet==3.3.2
|
||||
# via sqlalchemy
|
||||
idna==3.11
|
||||
# via requests
|
||||
imagesize==1.4.1
|
||||
imagesize==2.0.0
|
||||
# via sphinx
|
||||
importlib-metadata==8.7.1
|
||||
importlib-metadata==9.0.0
|
||||
# via
|
||||
# jupyter-cache
|
||||
# myst-nb
|
||||
ipykernel==7.1.0
|
||||
ipykernel==7.2.0
|
||||
# via myst-nb
|
||||
ipython==8.38.0
|
||||
# via
|
||||
@@ -114,7 +114,7 @@ mdit-py-plugins==0.5.0
|
||||
# via myst-parser
|
||||
mdurl==0.1.2
|
||||
# via markdown-it-py
|
||||
myst-nb==1.3.0
|
||||
myst-nb==1.4.0
|
||||
# via rocm-docs-core
|
||||
myst-parser==4.0.1
|
||||
# via myst-nb
|
||||
@@ -129,32 +129,32 @@ nbformat==5.10.4
|
||||
# nbclient
|
||||
nest-asyncio==1.6.0
|
||||
# via ipykernel
|
||||
packaging==25.0
|
||||
packaging==26.0
|
||||
# via
|
||||
# ipykernel
|
||||
# pydata-sphinx-theme
|
||||
# sphinx
|
||||
parso==0.8.5
|
||||
parso==0.8.6
|
||||
# via jedi
|
||||
pexpect==4.9.0
|
||||
# via ipython
|
||||
platformdirs==4.5.1
|
||||
platformdirs==4.9.4
|
||||
# via jupyter-core
|
||||
prompt-toolkit==3.0.52
|
||||
# via ipython
|
||||
psutil==7.2.1
|
||||
psutil==7.2.2
|
||||
# via ipykernel
|
||||
ptyprocess==0.7.0
|
||||
# via pexpect
|
||||
pure-eval==0.2.3
|
||||
# via stack-data
|
||||
pycparser==2.23
|
||||
pycparser==3.0
|
||||
# via cffi
|
||||
pydata-sphinx-theme==0.15.4
|
||||
# via
|
||||
# rocm-docs-core
|
||||
# sphinx-book-theme
|
||||
pygithub==2.8.1
|
||||
pygithub==2.9.0
|
||||
# via rocm-docs-core
|
||||
pygments==2.19.2
|
||||
# via
|
||||
@@ -162,7 +162,7 @@ pygments==2.19.2
|
||||
# ipython
|
||||
# pydata-sphinx-theme
|
||||
# sphinx
|
||||
pyjwt[crypto]==2.10.1
|
||||
pyjwt[crypto]==2.12.1
|
||||
# via pygithub
|
||||
pynacl==1.6.2
|
||||
# via pygithub
|
||||
@@ -196,11 +196,11 @@ rpds-py==0.30.0
|
||||
# referencing
|
||||
six==1.17.0
|
||||
# via python-dateutil
|
||||
smmap==5.0.2
|
||||
smmap==5.0.3
|
||||
# via gitdb
|
||||
snowballstemmer==3.0.1
|
||||
# via sphinx
|
||||
soupsieve==2.8.1
|
||||
soupsieve==2.8.3
|
||||
# via beautifulsoup4
|
||||
sphinx==8.1.3
|
||||
# via
|
||||
@@ -253,15 +253,15 @@ sphinxcontrib-runcmd==0.2.0
|
||||
# via sphinxcontrib-datatemplates
|
||||
sphinxcontrib-serializinghtml==2.0.0
|
||||
# via sphinx
|
||||
sqlalchemy==2.0.45
|
||||
sqlalchemy==2.0.48
|
||||
# via jupyter-cache
|
||||
stack-data==0.6.3
|
||||
# via ipython
|
||||
tabulate==0.9.0
|
||||
tabulate==0.10.0
|
||||
# via jupyter-cache
|
||||
tomli==2.4.0
|
||||
# via sphinx
|
||||
tornado==6.5.4
|
||||
tornado==6.5.5
|
||||
# via
|
||||
# ipykernel
|
||||
# jupyter-client
|
||||
@@ -283,13 +283,14 @@ typing-extensions==4.15.0
|
||||
# myst-nb
|
||||
# pydata-sphinx-theme
|
||||
# pygithub
|
||||
# pyjwt
|
||||
# referencing
|
||||
# sqlalchemy
|
||||
urllib3==2.6.3
|
||||
# via
|
||||
# pygithub
|
||||
# requests
|
||||
wcwidth==0.2.14
|
||||
wcwidth==0.6.0
|
||||
# via prompt-toolkit
|
||||
zipp==3.23.0
|
||||
# via importlib-metadata
|
||||
|
||||
Reference in New Issue
Block a user