Compare commits

...

17 Commits

Author SHA1 Message Date
peterjunpark
811188dc13 Update Primus docs for 26.1 release (#5911) (#5918)
* archive previous versions

update conf

fix

fix docker hub url

fix

* update history pages

* update docker info

* update configs

* update primus commit

(cherry picked from commit d8b6ee47e3)
2026-01-30 12:54:26 -05:00
peterjunpark
ec36bc9971 Publish vLLM / SGLang + MoRI distributed inference cookbooks (#5912) (#5913)
* add recipes

* clean up

update

clean up

fix

* update sglang docker instructions

docker image tag
add user to docker group

fix

* update pldm/bkc

* update pldm/bkc

* add bkc note

* update bkc notes

* update article info

* update wordlist

* fix linting issues

* fix linting issues

* fix linting

* fix ref

(cherry picked from commit d1165b7359)
2026-01-29 11:42:03 -05:00
Pratik Basyal
cd208e7d74 PLDM Note change 720 (#5894)
* Note change

* Minor change
2026-01-23 10:32:00 -05:00
Pratik Basyal
af8ea73581 720 reference link update and note fixes [Develop] (#5883) (#5884)
* Links updated to 7.2.0

* COmpatibility note fixed
2026-01-22 12:21:46 -05:00
Pratik Basyal
f1c86d7d29 720 Post GA Known Issues update (#5879)
* 7.2.0 Known issues and PLDM table updated (#5877)

* Known issues and PLDM table updated

* JAX workload known issues added

* Minor changes

* Minor update
2026-01-21 17:29:18 -05:00
Alex Xu
370816001e Merge branch 'roc-7.2.x' into docs/7.2.0 2026-01-21 15:29:08 -05:00
Swati Rawat
d5994da509 Merge pull request #5872 from SwRaw/swaraw_cherrypick
Cherrypicking replacement of rocm-smi with amd-smi from ROCm internal
2026-01-21 19:10:51 +05:30
srawat
c02f86c0e7 Update prerequisite-system-validation.rst 2026-01-21 17:43:10 +05:30
srawat
d3523c24d3 replace rocm-smi reference with amd-smi 2026-01-21 17:40:26 +05:30
Swati Rawat
1980239b81 Update docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/megatron-lm-v24.12-dev.rst
Co-authored-by: peterjunpark <git@peterjunpark.com>
2026-01-21 17:31:41 +05:30
Swati Rawat
c75fd6f532 Update docs/how-to/rocm-for-ai/system-setup/prerequisite-system-validation.rst
Co-authored-by: peterjunpark <git@peterjunpark.com>
2026-01-21 17:31:05 +05:30
Swati Rawat
72cb598190 Update docs/how-to/rocm-for-ai/system-setup/prerequisite-system-validation.rst
Co-authored-by: peterjunpark <git@peterjunpark.com>
2026-01-21 17:30:33 +05:30
Swati Rawat
9b55b77aaa Update docs/how-to/rocm-for-ai/system-setup/prerequisite-system-validation.rst
Co-authored-by: peterjunpark <git@peterjunpark.com>
2026-01-21 17:29:45 +05:30
Swati Rawat
8267303e1d Update docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/megatron-lm-v24.12-dev.rst
Co-authored-by: peterjunpark <git@peterjunpark.com>
2026-01-21 17:29:04 +05:30
Swati Rawat
86d2c4e891 Update docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/megatron-lm-v24.12-dev.rst
Co-authored-by: peterjunpark <git@peterjunpark.com>
2026-01-21 17:28:23 +05:30
srawat
2977e35330 Update single-gpu-fine-tuning-and-inference.rst 2026-01-21 17:27:13 +05:30
srawat
e95955f572 Update multi-gpu-fine-tuning-and-inference.rst 2026-01-21 17:27:13 +05:30
33 changed files with 5406 additions and 246 deletions

View File

@@ -39,6 +39,7 @@ autograd
Backported Backported
BARs BARs
BatchNorm BatchNorm
BKC
BLAS BLAS
BMC BMC
BabelStream BabelStream
@@ -53,6 +54,7 @@ CDNA
CGUI CGUI
CHTML CHTML
CIFAR CIFAR
CNP
CLI CLI
CLion CLion
CMake CMake
@@ -96,6 +98,7 @@ Dashboarding
Dataloading Dataloading
dataflows dataflows
DBRX DBRX
DCQCN
DDR DDR
DF DF
DGEMM DGEMM
@@ -110,8 +113,10 @@ DMA
DOMContentLoaded DOMContentLoaded
DNN DNN
DNNL DNNL
DOCA
DPM DPM
DRI DRI
DSCP
DW DW
DWORD DWORD
Dask Dask
@@ -127,7 +132,9 @@ Deprecations
DevCap DevCap
DirectX DirectX
Disaggregated Disaggregated
disagg
disaggregated disaggregated
disaggregation
Dockerfile Dockerfile
Dockerized Dockerized
Doxygen Doxygen
@@ -179,6 +186,8 @@ GFLOPS
GFortran GFortran
GFXIP GFXIP
GGUF GGUF
GID
Gbps
Gemma Gemma
GiB GiB
GIM GIM
@@ -248,6 +257,7 @@ IOP
IOPS IOPS
IOPM IOPM
IOV IOV
IPs
IRQ IRQ
ISA ISA
ISV ISV
@@ -312,6 +322,7 @@ MNIST
MPI MPI
MPT MPT
MSVC MSVC
MTU
mul mul
MVAPICH MVAPICH
MVFFR MVFFR
@@ -334,6 +345,7 @@ MLA
MosaicML MosaicML
MoEs MoEs
Mooncake Mooncake
MoRI
Mpops Mpops
Multicore Multicore
multihost multihost
@@ -403,16 +415,21 @@ PEQT
PIL PIL
PILImage PILImage
PJRT PJRT
PLDM
POR POR
PRNG PRNG
PRs PRs
PSID
PTPC
PaLM PaLM
Pageable Pageable
PeerDirect PeerDirect
Pensando
PerfDb PerfDb
Perfetto Perfetto
PipelineParallel PipelineParallel
PnP PnP
Pollara
PowerEdge PowerEdge
PowerShell PowerShell
Pretrained Pretrained
@@ -424,6 +441,7 @@ Pytest
PyTorch PyTorch
QPS QPS
Qcycles Qcycles
QoS
Qwen Qwen
RAII RAII
RAS RAS
@@ -457,6 +475,7 @@ RPP
RST RST
RW RW
Radeon Radeon
Redfish
RelWithDebInfo RelWithDebInfo
Req Req
Rickle Rickle
@@ -724,6 +743,7 @@ enqueue
env env
epilog epilog
etcetera etcetera
eth
ethernet ethernet
exascale exascale
executables executables
@@ -819,6 +839,7 @@ llvm
lm lm
localscratch localscratch
logits logits
loopback
lossy lossy
macOS macOS
matchers matchers
@@ -844,6 +865,7 @@ nanoGPT
NCS NCS
NOP NOP
NVLink NVLink
netplan
num num
numref numref
ocl ocl
@@ -911,6 +933,7 @@ rc
rccl rccl
rdc rdc
rdma rdma
reachability
reStructuredText reStructuredText
redirections redirections
refactorization refactorization
@@ -980,6 +1003,7 @@ shader
sharding sharding
sigmoid sigmoid
sles sles
slurm
sm sm
smi smi
softmax softmax

View File

@@ -199,7 +199,7 @@ for a complete overview of this release.
* fftw_execute_dft_c2r * fftw_execute_dft_c2r
* fftwf_execute_dft_c2r * fftwf_execute_dft_c2r
### **HIPIFY** (22.2.0) ### **HIPIFY** (22.0.0)
#### Added #### Added
@@ -279,6 +279,10 @@ for a complete overview of this release.
* Updated clang/llvm to AMD clang version 22.0.0 (equivalent to LLVM 22.0.0 with additional out-of-tree patches). * Updated clang/llvm to AMD clang version 22.0.0 (equivalent to LLVM 22.0.0 with additional out-of-tree patches).
#### Upcoming changes
* As of ROCm 7.2.0, the [HIPCC](https://rocm.docs.amd.com/projects/HIPCC/en/latest/index.html) compiler is deprecated. HIPCC now invokes [AMD Clang](https://rocm.docs.amd.com/projects/llvm-project/en/latest/index.html). Its recommended that you now invoke AMD Clang directly rather than using HIPCC. There isnt any expected impact on usability, functionality, or performance when invoking AMD Clang directly. In a future ROCm release, HIPCC will become a symbolic link to AMD Clang.
### **MIGraphX** (2.15.0) ### **MIGraphX** (2.15.0)
#### Added #### Added

View File

@@ -139,7 +139,7 @@ GPU and baseboard firmware versioning might differ across GPU families.
</td> </td>
</tr> </tr>
<tr> <tr>
<td>MI300X</td> <td>MI300X<a href="#footnote1"><sup>[2]</sup></a></td>
<td>01.25.03.12</td> <td>01.25.03.12</td>
<td rowspan="6" style="vertical-align: middle;"> <td rowspan="6" style="vertical-align: middle;">
30.30.0<br> 30.30.0<br>
@@ -180,6 +180,7 @@ GPU and baseboard firmware versioning might differ across GPU families.
</div> </div>
<p id="footnote1">[1]: For AMD Instinct MI325X KVM SR-IOV users, don't use AMD GPU driver (amdgpu) 30.20.0.</p> <p id="footnote1">[1]: For AMD Instinct MI325X KVM SR-IOV users, don't use AMD GPU driver (amdgpu) 30.20.0.</p>
<p id="footnote1">[2]: For AMD Instinct MI300X KVM SR-IOV with Multi-VF (8 VF) support requires a compatible firmware BKC bundle which will be released in coming months.</p>
#### Node power management for multi-GPU nodes added #### Node power management for multi-GPU nodes added
@@ -245,7 +246,7 @@ New Stream Management API `hipStreamCopyAttributes` is implemented for CUDA Pari
The rocSHMEM communications library has added the GDA (GPUDirect Async) intra-node and inter-node communication backend conduit. This new backend enables communication between GPUs within a node or between nodes through a RNIC (RDMA NIC) using device-initiated GPU kernels to communicate with other GPUs. The GPU directly interacts with the RNIC with no host (CPU) involvement in the critical path of communication. The rocSHMEM communications library has added the GDA (GPUDirect Async) intra-node and inter-node communication backend conduit. This new backend enables communication between GPUs within a node or between nodes through a RNIC (RDMA NIC) using device-initiated GPU kernels to communicate with other GPUs. The GPU directly interacts with the RNIC with no host (CPU) involvement in the critical path of communication.
In addition to the already supported GDA NIC types, Mellanox CX-7 and Broadcom Thor2, ROCm 7.2.0 introduces support for AMD Pensando AI NIC installed with the corresponding driver and firmware versions that support GDA functionality. For more information, see [Installing rocSHMEM](https://rocm.docs.amd.com/projects/rocSHMEM/en/latest/install.html). In addition to the already supported GDA NIC types, Mellanox CX-7 and Broadcom Thor2, ROCm 7.2.0 introduces support for AMD Pensando AI NIC installed with the corresponding driver and firmware versions that support GDA functionality. For more information, see [Installing rocSHMEM](https://rocm.docs.amd.com/projects/rocSHMEM/en/docs-7.2.0/install.html).
### Software-managed plan cache support for hipTensor ### Software-managed plan cache support for hipTensor
@@ -285,7 +286,7 @@ MIGraphX has the following enhancements:
### AMDGPU wavefront size macro removal ### AMDGPU wavefront size macro removal
The `__AMDGCN_WAVEFRONT_SIZE` and `__AMDGCN_WAVEFRONT_SIZE__` macros, which provided a compile-time-constant wavefront size, are removed. Where required, the wavefront size should instead be queried using the warpSize variable in device code, or using `hipGetDeviceProperties` in host code. Neither of these will result in a compile-time constant. For more information, see [warpSize](https://rocm.docs.amd.com/projects/HIP/en/latest/how-to/hip_cpp_language_extensions.html#warpsize). The `__AMDGCN_WAVEFRONT_SIZE` and `__AMDGCN_WAVEFRONT_SIZE__` macros, which provided a compile-time-constant wavefront size, are removed. Where required, the wavefront size should instead be queried using the warpSize variable in device code, or using `hipGetDeviceProperties` in host code. Neither of these will result in a compile-time constant. For more information, see [warpSize](https://rocm.docs.amd.com/projects/HIP/en/docs-7.2.0/how-to/hip_cpp_language_extensions.html#warpsize).
For cases where compile-time evaluation of the wavefront size cannot be avoided, uses of `__AMDGCN_WAVEFRONT_SIZE` or `__AMDGCN_WAVEFRONT_SIZE__` can be replaced with a user-defined macro or `constexpr` variable with the wavefront size(s) for the target hardware. For example: For cases where compile-time evaluation of the wavefront size cannot be avoided, uses of `__AMDGCN_WAVEFRONT_SIZE` or `__AMDGCN_WAVEFRONT_SIZE__` can be replaced with a user-defined macro or `constexpr` variable with the wavefront size(s) for the target hardware. For example:
``` ```
@@ -349,13 +350,13 @@ The ROCm Offline Installer Creator 7.2.0 includes the following features and imp
* Fixes for Oracle Linux 10.0 ROCm and driver minimum mode installer creation. * Fixes for Oracle Linux 10.0 ROCm and driver minimum mode installer creation.
* Added support for creating an offline installer for Oracle Linux 8, 9, and 10, where the kernel version of the target OS differs from the host OS creating the installer. * Added support for creating an offline installer for Oracle Linux 8, 9, and 10, where the kernel version of the target OS differs from the host OS creating the installer.
See [ROCm Offline Installer Creator](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/rocm-offline-installer.html) for more information. See [ROCm Offline Installer Creator](https://rocm.docs.amd.com/projects/install-on-linux/en/docs-7.2.0/install/rocm-offline-installer.html) for more information.
### ROCm Runfile Installer updates ### ROCm Runfile Installer updates
The ROCm Runfile Installer 7.2.0 includes fixes for rocm-examples test script build issues. The ROCm Runfile Installer 7.2.0 includes fixes for rocm-examples test script build issues.
For more information, see [ROCm Runfile Installer](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/rocm-runfile-installer.html). For more information, see [ROCm Runfile Installer](https://rocm.docs.amd.com/projects/install-on-linux/en/docs-7.2.0/install/rocm-runfile-installer.html).
### Expansion of the ROCm examples repository ### Expansion of the ROCm examples repository
@@ -375,7 +376,7 @@ Usage examples are now available for the [ROCgdb](https://github.com/ROCm/rocm-e
ROCm documentation continues to be updated to provide clearer and more comprehensive guidance for a wider variety of user needs and use cases. ROCm documentation continues to be updated to provide clearer and more comprehensive guidance for a wider variety of user needs and use cases.
* The newest resource for ROCm and HIP developers is the [AMD ROCm Programming Guide](https://rocm-handbook.amd.com/projects/amd-rocm-programming-guide/en/latest/). This guide introduces the core concepts, APIs, and best practices for programming with ROCm and the HIP programming language. It provides hands-on guidance for writing GPU kernels, managing memory, optimizing performance, and integrating HIP with the broader AMD ROCm ecosystem of tools and libraries. The [HIP documentation](https://rocm.docs.amd.com/projects/HIP/en/latest/index.html) set continues to provide detailed information, tutorials, and reference content. * The newest resource for ROCm and HIP developers is the [AMD ROCm Programming Guide](https://rocm-handbook.amd.com/projects/amd-rocm-programming-guide/en/docs-7.2.0/). This guide introduces the core concepts, APIs, and best practices for programming with ROCm and the HIP programming language. It provides hands-on guidance for writing GPU kernels, managing memory, optimizing performance, and integrating HIP with the broader AMD ROCm ecosystem of tools and libraries. The [HIP documentation](https://rocm.docs.amd.com/projects/HIP/en/docs-7.2.0/index.html) set continues to provide detailed information, tutorials, and reference content.
* The HIP Programming Guide section includes a new topic titled [“Understanding GPU performance”](https://rocm.docs.amd.com/projects/HIP/en/docs-7.2.0/understand/performance_optimization.html). It explains the theoretical foundations of GPU performance on AMD hardware. Understanding these concepts helps you analyze performance characteristics, identify bottlenecks, and make informed optimization decisions. Two other topics in this guide have been enhanced: [Performance guidelines](https://rocm.docs.amd.com/projects/HIP/en/docs-7.2.0/how-to/performance_guidelines.html) and [Hardware implementation](https://rocm.docs.amd.com/projects/HIP/en/docs-7.2.0/understand/hardware_implementation.html). * The HIP Programming Guide section includes a new topic titled [“Understanding GPU performance”](https://rocm.docs.amd.com/projects/HIP/en/docs-7.2.0/understand/performance_optimization.html). It explains the theoretical foundations of GPU performance on AMD hardware. Understanding these concepts helps you analyze performance characteristics, identify bottlenecks, and make informed optimization decisions. Two other topics in this guide have been enhanced: [Performance guidelines](https://rocm.docs.amd.com/projects/HIP/en/docs-7.2.0/how-to/performance_guidelines.html) and [Hardware implementation](https://rocm.docs.amd.com/projects/HIP/en/docs-7.2.0/understand/hardware_implementation.html).
@@ -913,7 +914,7 @@ For a historical overview of ROCm component updates, see the {doc}`ROCm consolid
* fftw_execute_dft_c2r * fftw_execute_dft_c2r
* fftwf_execute_dft_c2r * fftwf_execute_dft_c2r
### **HIPIFY** (22.2.0) ### **HIPIFY** (22.0.0)
#### Added #### Added
@@ -995,7 +996,7 @@ For a historical overview of ROCm component updates, see the {doc}`ROCm consolid
#### Upcoming changes #### Upcoming changes
* As of ROCm 7.2.0, the [HIPCC](https://rocm.docs.amd.com/projects/HIPCC/en/latest/index.html) compiler is deprecated. HIPCC now invokes [AMD Clang](https://rocm.docs.amd.com/projects/llvm-project/en/latest/index.html). Its recommended that you now invoke AMD Clang directly rather than using HIPCC. There isnt any expected impact on usability, functionality, or performance when invoking AMD Clang directly. In a future ROCm release, HIPCC will become a symbolic link to AMD Clang. * As of ROCm 7.2.0, the [HIPCC](https://rocm.docs.amd.com/projects/HIPCC/en/docs-7.2.0/index.html) compiler is deprecated. HIPCC now invokes [AMD Clang](https://rocm.docs.amd.com/projects/llvm-project/en/docs-7.2.0/index.html). Its recommended that you now invoke AMD Clang directly rather than using HIPCC. There isnt any expected impact on usability, functionality, or performance when invoking AMD Clang directly. In a future ROCm release, HIPCC will become a symbolic link to AMD Clang.
### **MIGraphX** (2.15.0) ### **MIGraphX** (2.15.0)
@@ -1432,7 +1433,11 @@ python3 -m pip install --user .
`sudo` might be required. Use flag `--break-system-packages` if `pip un/installation` fails. `sudo` might be required. Use flag `--break-system-packages` if `pip un/installation` fails.
``` ```
For detailed instructions, see [Install the Python library for multiple ROCm instances](https://rocm.docs.amd.com/projects/amdsmi/en/latest/install/install.html#install-the-python-library-for-multiple-rocm-instances). The issue will be fixed in a future ROCm release. For detailed instructions, see [Install the Python library for multiple ROCm instances](https://rocm.docs.amd.com/projects/amdsmi/en/latest/install/install.html#install-the-python-library-for-multiple-rocm-instances). The issue will be fixed in a future ROCm release. See [GitHub issue #5875](https://github.com/ROCm/ROCm/issues/5875).
### Intermittent errors when running JAX workloads
You might experience intermittent errors or segmentation faults when running JAX workloads. The issue is currently under investigation and will be addressed in an upcoming ROCm release. See [GitHub issue #5878](https://github.com/ROCm/ROCm/issues/5878).
### hipBLASLt performance variation for a particular FP8 GEMM operation on AMD Instinct MI325X GPUs ### hipBLASLt performance variation for a particular FP8 GEMM operation on AMD Instinct MI325X GPUs
@@ -1477,7 +1482,7 @@ The following changes to the ROCm software stack are anticipated for future rele
### ROCm Offline Installer Creator deprecation ### ROCm Offline Installer Creator deprecation
The ROCm Offline Installer Creator is deprecated with the ROCm 7.2.0 release. Equivalent installation capabilities are available through the ROCm Runfile Installer, a self-extracting installer that is not based on OS package managers. This installer will be removed in a future release. The [ROCm Offline Installer Creator](https://rocm.docs.amd.com/projects/install-on-linux/en/docs-7.2.0/install/rocm-offline-installer.html) is deprecated with the ROCm 7.2.0 release and will be removed in a future release. Equivalent installation capabilities are available through the [ROCm Runfile Installer](https://rocm.docs.amd.com/projects/install-on-linux/en/docs-7.2.0/install/rocm-runfile-installer.html), a self-extracting installer that is not based on OS package managers.
### ROCm SMI deprecation ### ROCm SMI deprecation

View File

@@ -171,6 +171,7 @@ Operating systems, kernel and Glibc versions
********************************************* *********************************************
For detailed information on operating system supported on ROCm 7.2.0 and associated Kernel and Glibc version, see the latest :ref:`supported_distributions`. For version specific information, see `ROCm 7.1.1 <https://rocm.docs.amd.com/projects/install-on-linux/en/docs-7.1.1/reference/system-requirements.html#supported-operating-systems>`__, and `ROCm 6.4.0 <https://rocm.docs.amd.com/projects/install-on-linux/en/docs-6.4.0/reference/system-requirements.html#supported-operating-systems>`__. For detailed information on operating system supported on ROCm 7.2.0 and associated Kernel and Glibc version, see the latest :ref:`supported_distributions`. For version specific information, see `ROCm 7.1.1 <https://rocm.docs.amd.com/projects/install-on-linux/en/docs-7.1.1/reference/system-requirements.html#supported-operating-systems>`__, and `ROCm 6.4.0 <https://rocm.docs.amd.com/projects/install-on-linux/en/docs-6.4.0/reference/system-requirements.html#supported-operating-systems>`__.
.. note:: .. note::
* See `Red Hat Enterprise Linux Release Dates <https://access.redhat.com/articles/3078>`_ to learn about the specific kernel versions supported on Red Hat Enterprise Linux (RHEL). * See `Red Hat Enterprise Linux Release Dates <https://access.redhat.com/articles/3078>`_ to learn about the specific kernel versions supported on Red Hat Enterprise Linux (RHEL).

View File

@@ -138,12 +138,14 @@ article_pages = [
{"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/megatron-lm-v25.8", "os": ["linux"]}, {"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/megatron-lm-v25.8", "os": ["linux"]},
{"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/megatron-lm-v25.9", "os": ["linux"]}, {"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/megatron-lm-v25.9", "os": ["linux"]},
{"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/megatron-lm-v25.10", "os": ["linux"]}, {"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/megatron-lm-v25.10", "os": ["linux"]},
{"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/megatron-lm-v25.11", "os": ["linux"]},
{"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/megatron-lm-primus-migration-guide", "os": ["linux"]}, {"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/megatron-lm-primus-migration-guide", "os": ["linux"]},
{"file": "how-to/rocm-for-ai/training/benchmark-docker/primus-megatron", "os": ["linux"]}, {"file": "how-to/rocm-for-ai/training/benchmark-docker/primus-megatron", "os": ["linux"]},
{"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/primus-megatron-v25.7", "os": ["linux"]}, {"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/primus-megatron-v25.7", "os": ["linux"]},
{"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/primus-megatron-v25.8", "os": ["linux"]}, {"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/primus-megatron-v25.8", "os": ["linux"]},
{"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/primus-megatron-v25.9", "os": ["linux"]}, {"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/primus-megatron-v25.9", "os": ["linux"]},
{"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/primus-megatron-v25.10", "os": ["linux"]}, {"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/primus-megatron-v25.10", "os": ["linux"]},
{"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/primus-megatron-v25.11", "os": ["linux"]},
{"file": "how-to/rocm-for-ai/training/benchmark-docker/pytorch-training", "os": ["linux"]}, {"file": "how-to/rocm-for-ai/training/benchmark-docker/pytorch-training", "os": ["linux"]},
{"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/pytorch-training-history", "os": ["linux"]}, {"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/pytorch-training-history", "os": ["linux"]},
{"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/pytorch-training-v25.3", "os": ["linux"]}, {"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/pytorch-training-v25.3", "os": ["linux"]},
@@ -154,16 +156,17 @@ article_pages = [
{"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/pytorch-training-v25.8", "os": ["linux"]}, {"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/pytorch-training-v25.8", "os": ["linux"]},
{"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/pytorch-training-v25.9", "os": ["linux"]}, {"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/pytorch-training-v25.9", "os": ["linux"]},
{"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/pytorch-training-v25.10", "os": ["linux"]}, {"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/pytorch-training-v25.10", "os": ["linux"]},
{"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/pytorch-training-v25.11", "os": ["linux"]},
{"file": "how-to/rocm-for-ai/training/benchmark-docker/primus-pytorch", "os": ["linux"]}, {"file": "how-to/rocm-for-ai/training/benchmark-docker/primus-pytorch", "os": ["linux"]},
{"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/primus-pytorch-v25.8", "os": ["linux"]}, {"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/primus-pytorch-v25.8", "os": ["linux"]},
{"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/primus-pytorch-v25.9", "os": ["linux"]}, {"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/primus-pytorch-v25.9", "os": ["linux"]},
{"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/primus-pytorch-v25.10", "os": ["linux"]}, {"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/primus-pytorch-v25.10", "os": ["linux"]},
{"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/primus-pytorch-v25.11", "os": ["linux"]},
{"file": "how-to/rocm-for-ai/training/benchmark-docker/jax-maxtext", "os": ["linux"]}, {"file": "how-to/rocm-for-ai/training/benchmark-docker/jax-maxtext", "os": ["linux"]},
{"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/jax-maxtext-history", "os": ["linux"]}, {"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/jax-maxtext-history", "os": ["linux"]},
{"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/jax-maxtext-v25.4", "os": ["linux"]}, {"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/jax-maxtext-v25.4", "os": ["linux"]},
{"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/jax-maxtext-v25.5", "os": ["linux"]}, {"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/jax-maxtext-v25.5", "os": ["linux"]},
{"file": "how-to/rocm-for-ai/training/benchmark-docker/mpt-llm-foundry", "os": ["linux"]}, {"file": "how-to/rocm-for-ai/training/benchmark-docker/mpt-llm-foundry", "os": ["linux"]},
{"file": "how-to/rocm-for-ai/inference/xdit-diffusion-inference", "os": ["linux"]},
{"file": "how-to/rocm-for-ai/fine-tuning/index", "os": ["linux"]}, {"file": "how-to/rocm-for-ai/fine-tuning/index", "os": ["linux"]},
{"file": "how-to/rocm-for-ai/fine-tuning/overview", "os": ["linux"]}, {"file": "how-to/rocm-for-ai/fine-tuning/overview", "os": ["linux"]},
@@ -193,11 +196,16 @@ article_pages = [
{"file": "how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-0.11.1-20251103", "os": ["linux"]}, {"file": "how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-0.11.1-20251103", "os": ["linux"]},
{"file": "how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/sglang-history", "os": ["linux"]}, {"file": "how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/sglang-history", "os": ["linux"]},
{"file": "how-to/rocm-for-ai/inference/benchmark-docker/pytorch-inference", "os": ["linux"]}, {"file": "how-to/rocm-for-ai/inference/benchmark-docker/pytorch-inference", "os": ["linux"]},
{"file": "how-to/rocm-for-ai/inference/benchmark-docker/sglang", "os": ["linux"]},
{"file": "how-to/rocm-for-ai/inference/benchmark-docker/vllm-mori-distributed", "os": ["linux"]},
{"file": "how-to/rocm-for-ai/inference/benchmark-docker/sglang-mori-distributed", "os": ["linux"]},
{"file": "how-to/rocm-for-ai/inference/benchmark-docker/sglang-distributed", "os": ["linux"]},
{"file": "how-to/rocm-for-ai/inference/xdit-diffusion-inference", "os": ["linux"]}, {"file": "how-to/rocm-for-ai/inference/xdit-diffusion-inference", "os": ["linux"]},
{"file": "how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/xdit-25.10", "os": ["linux"]}, {"file": "how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/xdit-25.10", "os": ["linux"]},
{"file": "how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/xdit-25.11", "os": ["linux"]}, {"file": "how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/xdit-25.11", "os": ["linux"]},
{"file": "how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/xdit-25.12", "os": ["linux"]}, {"file": "how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/xdit-25.12", "os": ["linux"]},
{"file": "how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/xdit-25.13", "os": ["linux"]}, {"file": "how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/xdit-25.13", "os": ["linux"]},
{"file": "how-to/rocm-for-ai/inference/deploy-your-model", "os": ["linux"]}, {"file": "how-to/rocm-for-ai/inference/deploy-your-model", "os": ["linux"]},
{"file": "how-to/rocm-for-ai/inference-optimization/index", "os": ["linux"]}, {"file": "how-to/rocm-for-ai/inference-optimization/index", "os": ["linux"]},

View File

@@ -1,15 +1,13 @@
docker: docker:
pull_tag: rocm/primus:v25.10 pull_tag: rocm/primus:v26.1
docker_hub_url: https://hub.docker.com/layers/rocm/primus/v25.10/images/sha256-140c37cd2eeeb183759b9622543fc03cc210dc97cbfa18eeefdcbda84420c197 docker_hub_url: https://hub.docker.com/layers/rocm/primus/v26.1/images/sha256-4fc8808bdb14117c6af7f38d79c809056e6fdbfd530c1fabbb61d097ddaf820d
components: components:
ROCm: 7.1.0 ROCm: 7.1.0
Primus: 0.3.0
Primus Turbo: 0.1.1
PyTorch: 2.10.0.dev20251112+rocm7.1 PyTorch: 2.10.0.dev20251112+rocm7.1
Python: "3.10" Python: "3.10"
Transformer Engine: 2.4.0.dev0+32e2d1d4 Transformer Engine: 2.6.0.dev0+f141f34b
Flash Attention: 2.8.3 Flash Attention: 2.8.3
hipBLASLt: 1.2.0-09ab7153e2 hipBLASLt: 34459f66ea
Triton: 3.4.0 Triton: 3.4.0
RCCL: 2.27.7 RCCL: 2.27.7
model_groups: model_groups:

View File

@@ -0,0 +1,47 @@
docker:
pull_tag: rocm/primus:v25.11
docker_hub_url: https://hub.docker.com/layers/rocm/primus/v25.11/images/sha256-71aa65a9bfc8e9dd18bce5b68c81caff864f223e9afa75dc1b719671a1f4a3c3
components:
ROCm: 7.1.0
PyTorch: 2.10.0.dev20251112+rocm7.1
Python: "3.10"
Transformer Engine: 2.4.0.dev0+32e2d1d4
Flash Attention: 2.8.3
hipBLASLt: 1.2.0-09ab7153e2
Triton: 3.4.0
RCCL: 2.27.7
model_groups:
- group: Meta Llama
tag: llama
models:
- model: Llama 3.3 70B
mad_tag: pyt_megatron_lm_train_llama-3.3-70b
- model: Llama 3.1 8B
mad_tag: pyt_megatron_lm_train_llama-3.1-8b
- model: Llama 3.1 70B
mad_tag: pyt_megatron_lm_train_llama-3.1-70b
- model: Llama 2 7B
mad_tag: pyt_megatron_lm_train_llama-2-7b
- model: Llama 2 70B
mad_tag: pyt_megatron_lm_train_llama-2-70b
- group: DeepSeek
tag: deepseek
models:
- model: DeepSeek-V3 (proxy)
mad_tag: pyt_megatron_lm_train_deepseek-v3-proxy
- model: DeepSeek-V2-Lite
mad_tag: pyt_megatron_lm_train_deepseek-v2-lite-16b
- group: Mistral AI
tag: mistral
models:
- model: Mixtral 8x7B
mad_tag: pyt_megatron_lm_train_mixtral-8x7b
- model: Mixtral 8x22B (proxy)
mad_tag: pyt_megatron_lm_train_mixtral-8x22b-proxy
- group: Qwen
tag: qwen
models:
- model: Qwen 2.5 7B
mad_tag: pyt_megatron_lm_train_qwen2.5-7b
- model: Qwen 2.5 72B
mad_tag: pyt_megatron_lm_train_qwen2.5-72b

View File

@@ -0,0 +1,58 @@
docker:
pull_tag: rocm/primus:v25.11
docker_hub_url: https://hub.docker.com/layers/rocm/primus/v25.11/images/sha256-71aa65a9bfc8e9dd18bce5b68c81caff864f223e9afa75dc1b719671a1f4a3c3
components:
ROCm: 7.1.0
PyTorch: 2.10.0.dev20251112+rocm7.1
Python: "3.10"
Transformer Engine: 2.4.0.dev0+32e2d1d4
Flash Attention: 2.8.3
hipBLASLt: 1.2.0-09ab7153e2
Triton: 3.4.0
RCCL: 2.27.7
model_groups:
- group: Meta Llama
tag: llama
models:
- model: Llama 3.3 70B
mad_tag: primus_pyt_megatron_lm_train_llama-3.3-70b
config_name: llama3.3_70B-pretrain.yaml
- model: Llama 3.1 70B
mad_tag: primus_pyt_megatron_lm_train_llama-3.1-70b
config_name: llama3.1_70B-pretrain.yaml
- model: Llama 3.1 8B
mad_tag: primus_pyt_megatron_lm_train_llama-3.1-8b
config_name: llama3.1_8B-pretrain.yaml
- model: Llama 2 7B
mad_tag: primus_pyt_megatron_lm_train_llama-2-7b
config_name: llama2_7B-pretrain.yaml
- model: Llama 2 70B
mad_tag: primus_pyt_megatron_lm_train_llama-2-70b
config_name: llama2_70B-pretrain.yaml
- group: DeepSeek
tag: deepseek
models:
- model: DeepSeek-V3 (proxy)
mad_tag: primus_pyt_megatron_lm_train_deepseek-v3-proxy
config_name: deepseek_v3-pretrain.yaml
- model: DeepSeek-V2-Lite
mad_tag: primus_pyt_megatron_lm_train_deepseek-v2-lite-16b
config_name: deepseek_v2_lite-pretrain.yaml
- group: Mistral AI
tag: mistral
models:
- model: Mixtral 8x7B
mad_tag: primus_pyt_megatron_lm_train_mixtral-8x7b
config_name: mixtral_8x7B_v0.1-pretrain.yaml
- model: Mixtral 8x22B (proxy)
mad_tag: primus_pyt_megatron_lm_train_mixtral-8x22b-proxy
config_name: mixtral_8x22B_v0.1-pretrain.yaml
- group: Qwen
tag: qwen
models:
- model: Qwen 2.5 7B
mad_tag: primus_pyt_megatron_lm_train_qwen2.5-7b
config_name: primus_qwen2.5_7B-pretrain.yaml
- model: Qwen 2.5 72B
mad_tag: primus_pyt_megatron_lm_train_qwen2.5-72b
config_name: qwen2.5_72B-pretrain.yaml

View File

@@ -0,0 +1,32 @@
docker:
pull_tag: rocm/primus:v25.11
docker_hub_url: https://hub.docker.com/layers/rocm/primus/v25.11/images/sha256-71aa65a9bfc8e9dd18bce5b68c81caff864f223e9afa75dc1b719671a1f4a3c3
components:
ROCm: 7.1.0
PyTorch: 2.10.0.dev20251112+rocm7.1
Python: "3.10"
Transformer Engine: 2.4.0.dev0+32e2d1d4
Flash Attention: 2.8.3
hipBLASLt: 1.2.0-09ab7153e2
model_groups:
- group: Meta Llama
tag: llama
models:
- model: Llama 3.1 8B
mad_tag: primus_pyt_train_llama-3.1-8b
model_repo: Llama-3.1-8B
url: https://huggingface.co/meta-llama/Llama-3.1-8B
precision: BF16
- model: Llama 3.1 70B
mad_tag: primus_pyt_train_llama-3.1-70b
model_repo: Llama-3.1-70B
url: https://huggingface.co/meta-llama/Llama-3.1-70B
precision: BF16
- group: DeepSeek
tag: deepseek
models:
- model: DeepSeek V3 16B
mad_tag: primus_pyt_train_deepseek-v3-16b
model_repo: DeepSeek-V3
url: https://huggingface.co/deepseek-ai/DeepSeek-V3
precision: BF16

View File

@@ -0,0 +1,195 @@
docker:
pull_tag: rocm/primus:v25.11
docker_hub_url: https://hub.docker.com/layers/rocm/primus/v25.11/images/sha256-71aa65a9bfc8e9dd18bce5b68c81caff864f223e9afa75dc1b719671a1f4a3c3
components:
ROCm: 7.1.0
PyTorch: 2.10.0.dev20251112+rocm7.1
Python: "3.10"
Transformer Engine: 2.4.0.dev0+32e2d1d4
Flash Attention: 2.8.3
hipBLASLt: 1.2.0-09ab7153e2
model_groups:
- group: Meta Llama
tag: llama
models:
- model: Llama 4 Scout 17B-16E
mad_tag: pyt_train_llama-4-scout-17b-16e
model_repo: Llama-4-17B_16E
url: https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E
precision: BF16
training_modes: [finetune_fw, finetune_lora]
- model: Llama 3.3 70B
mad_tag: pyt_train_llama-3.3-70b
model_repo: Llama-3.3-70B
url: https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct
precision: BF16
training_modes: [finetune_fw, finetune_lora, finetune_qlora]
- model: Llama 3.2 1B
mad_tag: pyt_train_llama-3.2-1b
model_repo: Llama-3.2-1B
url: https://huggingface.co/meta-llama/Llama-3.2-1B
precision: BF16
training_modes: [finetune_fw, finetune_lora]
- model: Llama 3.2 3B
mad_tag: pyt_train_llama-3.2-3b
model_repo: Llama-3.2-3B
url: https://huggingface.co/meta-llama/Llama-3.2-3B
precision: BF16
training_modes: [finetune_fw, finetune_lora]
- model: Llama 3.2 Vision 11B
mad_tag: pyt_train_llama-3.2-vision-11b
model_repo: Llama-3.2-Vision-11B
url: https://huggingface.co/meta-llama/Llama-3.2-11B-Vision
precision: BF16
training_modes: [finetune_fw]
- model: Llama 3.2 Vision 90B
mad_tag: pyt_train_llama-3.2-vision-90b
model_repo: Llama-3.2-Vision-90B
url: https://huggingface.co/meta-llama/Llama-3.2-90B-Vision
precision: BF16
training_modes: [finetune_fw]
- model: Llama 3.1 8B
mad_tag: pyt_train_llama-3.1-8b
model_repo: Llama-3.1-8B
url: https://huggingface.co/meta-llama/Llama-3.1-8B
precision: BF16
training_modes: [pretrain, finetune_fw, finetune_lora, HF_pretrain]
- model: Llama 3.1 70B
mad_tag: pyt_train_llama-3.1-70b
model_repo: Llama-3.1-70B
url: https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct
precision: BF16
training_modes: [pretrain, finetune_fw, finetune_lora]
- model: Llama 3.1 405B
mad_tag: pyt_train_llama-3.1-405b
model_repo: Llama-3.1-405B
url: https://huggingface.co/meta-llama/Llama-3.1-405B
precision: BF16
training_modes: [finetune_qlora]
- model: Llama 3 8B
mad_tag: pyt_train_llama-3-8b
model_repo: Llama-3-8B
url: https://huggingface.co/meta-llama/Meta-Llama-3-8B
precision: BF16
training_modes: [finetune_fw, finetune_lora]
- model: Llama 3 70B
mad_tag: pyt_train_llama-3-70b
model_repo: Llama-3-70B
url: https://huggingface.co/meta-llama/Meta-Llama-3-70B
precision: BF16
training_modes: [finetune_fw, finetune_lora]
- model: Llama 2 7B
mad_tag: pyt_train_llama-2-7b
model_repo: Llama-2-7B
url: https://github.com/meta-llama/llama-models/tree/main/models/llama2
precision: BF16
training_modes: [finetune_fw, finetune_lora, finetune_qlora]
- model: Llama 2 13B
mad_tag: pyt_train_llama-2-13b
model_repo: Llama-2-13B
url: https://github.com/meta-llama/llama-models/tree/main/models/llama2
precision: BF16
training_modes: [finetune_fw, finetune_lora]
- model: Llama 2 70B
mad_tag: pyt_train_llama-2-70b
model_repo: Llama-2-70B
url: https://github.com/meta-llama/llama-models/tree/main/models/llama2
precision: BF16
training_modes: [finetune_lora, finetune_qlora]
- group: OpenAI
tag: openai
models:
- model: GPT OSS 20B
mad_tag: pyt_train_gpt_oss_20b
model_repo: GPT-OSS-20B
url: https://huggingface.co/openai/gpt-oss-20b
precision: BF16
training_modes: [HF_finetune_lora]
- model: GPT OSS 120B
mad_tag: pyt_train_gpt_oss_120b
model_repo: GPT-OSS-120B
url: https://huggingface.co/openai/gpt-oss-120b
precision: BF16
training_modes: [HF_finetune_lora]
- group: DeepSeek
tag: deepseek
models:
- model: DeepSeek V2 16B
mad_tag: primus_pyt_train_deepseek-v2
model_repo: DeepSeek-V2
url: https://huggingface.co/deepseek-ai/DeepSeek-V2
precision: BF16
training_modes: [pretrain]
- group: Qwen
tag: qwen
models:
- model: Qwen 3 8B
mad_tag: pyt_train_qwen3-8b
model_repo: Qwen3-8B
url: https://huggingface.co/Qwen/Qwen3-8B
precision: BF16
training_modes: [finetune_fw, finetune_lora]
- model: Qwen 3 32B
mad_tag: pyt_train_qwen3-32b
model_repo: Qwen3-32
url: https://huggingface.co/Qwen/Qwen3-32B
precision: BF16
training_modes: [finetune_lora]
- model: Qwen 2.5 32B
mad_tag: pyt_train_qwen2.5-32b
model_repo: Qwen2.5-32B
url: https://huggingface.co/Qwen/Qwen2.5-32B
precision: BF16
training_modes: [finetune_lora]
- model: Qwen 2.5 72B
mad_tag: pyt_train_qwen2.5-72b
model_repo: Qwen2.5-72B
url: https://huggingface.co/Qwen/Qwen2.5-72B
precision: BF16
training_modes: [finetune_lora]
- model: Qwen 2 1.5B
mad_tag: pyt_train_qwen2-1.5b
model_repo: Qwen2-1.5B
url: https://huggingface.co/Qwen/Qwen2-1.5B
precision: BF16
training_modes: [finetune_fw, finetune_lora]
- model: Qwen 2 7B
mad_tag: pyt_train_qwen2-7b
model_repo: Qwen2-7B
url: https://huggingface.co/Qwen/Qwen2-7B
precision: BF16
training_modes: [finetune_fw, finetune_lora]
- group: Stable Diffusion
tag: sd
models:
- model: Stable Diffusion XL
mad_tag: pyt_huggingface_stable_diffusion_xl_2k_lora_finetuning
model_repo: SDXL
url: https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0
precision: BF16
training_modes: [posttrain]
- group: Flux
tag: flux
models:
- model: FLUX.1-dev
mad_tag: pyt_train_flux
model_repo: Flux
url: https://huggingface.co/black-forest-labs/FLUX.1-dev
precision: BF16
training_modes: [posttrain]
- group: NCF
tag: ncf
models:
- model: NCF
mad_tag: pyt_ncf_training
model_repo:
url: https://github.com/ROCm/FluxBenchmark
precision: FP32
- group: DLRM
tag: dlrm
models:
- model: DLRM v2
mad_tag: pyt_train_dlrm
model_repo: DLRM
url: https://github.com/AMD-AGI/DLRMBenchmark
training_modes: [pretrain]

View File

@@ -1,13 +1,13 @@
docker: docker:
pull_tag: rocm/primus:v25.11 pull_tag: rocm/primus:v26.1
docker_hub_url: https://hub.docker.com/layers/rocm/primus/v25.10/images/sha256-140c37cd2eeeb183759b9622543fc03cc210dc97cbfa18eeefdcbda84420c197 docker_hub_url: https://hub.docker.com/layers/rocm/primus/v26.1/images/sha256-4fc8808bdb14117c6af7f38d79c809056e6fdbfd530c1fabbb61d097ddaf820d
components: components:
ROCm: 7.1.0 ROCm: 7.1.0
PyTorch: 2.10.0.dev20251112+rocm7.1 PyTorch: 2.10.0.dev20251112+rocm7.1
Python: "3.10" Python: "3.10"
Transformer Engine: 2.4.0.dev0+32e2d1d4 Transformer Engine: 2.6.0.dev0+f141f34b
Flash Attention: 2.8.3 Flash Attention: 2.8.3
hipBLASLt: 1.2.0-09ab7153e2 hipBLASLt: 34459f66ea
Triton: 3.4.0 Triton: 3.4.0
RCCL: 2.27.7 RCCL: 2.27.7
model_groups: model_groups:

View File

@@ -1,13 +1,13 @@
docker: docker:
pull_tag: rocm/primus:v25.11 pull_tag: rocm/primus:v26.1
docker_hub_url: https://hub.docker.com/layers/rocm/primus/v25.10/images/sha256-140c37cd2eeeb183759b9622543fc03cc210dc97cbfa18eeefdcbda84420c197 docker_hub_url: https://hub.docker.com/layers/rocm/primus/v26.1/images/sha256-4fc8808bdb14117c6af7f38d79c809056e6fdbfd530c1fabbb61d097ddaf820d
components: components:
ROCm: 7.1.0 ROCm: 7.1.0
PyTorch: 2.10.0.dev20251112+rocm7.1 PyTorch: 2.10.0.dev20251112+rocm7.1
Python: "3.10" Python: "3.10"
Transformer Engine: 2.4.0.dev0+32e2d1d4 Transformer Engine: 2.6.0.dev0+f141f34b
Flash Attention: 2.8.3 Flash Attention: 2.8.3
hipBLASLt: 1.2.0-09ab7153e2 hipBLASLt: 34459f66ea
model_groups: model_groups:
- group: Meta Llama - group: Meta Llama
tag: llama tag: llama

View File

@@ -1,15 +1,13 @@
docker: docker:
pull_tag: rocm/primus:v25.10 pull_tag: rocm/primus:v26.1
docker_hub_url: https://hub.docker.com/layers/rocm/primus/v25.10/images/sha256-140c37cd2eeeb183759b9622543fc03cc210dc97cbfa18eeefdcbda84420c197 docker_hub_url: https://hub.docker.com/layers/rocm/primus/v26.1/images/sha256-4fc8808bdb14117c6af7f38d79c809056e6fdbfd530c1fabbb61d097ddaf820d
components: components:
ROCm: 7.1.0 ROCm: 7.1.0
Primus: 0.3.0
Primus Turbo: 0.1.1
PyTorch: 2.10.0.dev20251112+rocm7.1 PyTorch: 2.10.0.dev20251112+rocm7.1
Python: "3.10" Python: "3.10"
Transformer Engine: 2.4.0.dev0+32e2d1d4 Transformer Engine: 2.6.0.dev0+f141f34b
Flash Attention: 2.8.3 Flash Attention: 2.8.3
hipBLASLt: 1.2.0-09ab7153e2 hipBLASLt: 34459f66ea
model_groups: model_groups:
- group: Meta Llama - group: Meta Llama
tag: llama tag: llama

View File

@@ -130,7 +130,7 @@ After loading the model in this way, the model is fully ready to use the resourc
torchtune for fine-tuning and inference torchtune for fine-tuning and inference
============================================= =============================================
`torchtune <https://meta-pytorch.org/torchtune/main/>`_ is a PyTorch-native library for easy single and multi-GPU `torchtune <https://pytorch.org/torchtune/main/>`_ is a PyTorch-native library for easy single and multi-GPU
model fine-tuning and inference with LLMs. model fine-tuning and inference with LLMs.
#. Install torchtune using pip. #. Install torchtune using pip.

View File

@@ -25,6 +25,5 @@ In this guide, you'll learn how to use ROCm for AI:
- :doc:`Inference optimization <inference-optimization/index>` - :doc:`Inference optimization <inference-optimization/index>`
To learn about ROCm for HPC applications and scientific computing, see To learn about ROCm for HPC applications and scientific computing, see
:doc:`../rocm-for-hpc/index`. :doc:`../rocm-for-hpc/index`.

View File

@@ -0,0 +1,904 @@
# SGLang distributed inference with MoRI
This document provides a comprehensive guide for deploying a high-performance
SGLang distributed inference serving environment on an AMD Instinct MI355X GPU
cluster, utilizing the [MoRI (Modular RDMA
Interface)](https://github.com/rocm/mori) communication backend for optimized
inter-node collective operations. It also includes systematic instructions for
benchmarking 1P2D (1 prefill 2 decode, 3 nodes) configurations using automated
scripts.
## Prerequisites
The following configuration is required to implement this setup:
* **Nodes:** A minimum of three GPU nodes (Virtual machines or Physical
machines) for wide expert parallelism (EP) evaluation.
* **GPUs** 8x AMD Instinct MI355X GPU cards per node.
* **Networking:** 8x AMD Pensando™ Pollara 400 AI NICs per node, providing
a dedicated 1:1 mapping between GPUs and network interfaces for optimal
inter-node communication.
* **Orchestration:** A Slurm cluster with at least three nodes -- one for
prefill service and two for decode services (EP16)
## System configuration
This section outlines the infrastructure setup required to support your AMD
Instinct MI355X cluster. It covers essential procedures for verifying software
baselines and firmware versions, configuring the AMD Pensando Pollara 400 AI
NICs for high-bandwidth networking, and applying thermal and Quality of Service
(QoS) tunings to ensure a stable, lossless RDMA fabric.
(sglang-mori-verify-baseline)=
### Verify baseline software
The following table outlines the validated software stack. Use the provided
shell commands to verify the environment on each node before proceeding.
| Component | Version | Verification command |
| :--- | :--- | :--- |
| **OS** | Ubuntu 22.04.5 LTS | `cat /etc/os-release` |
| **Kernel** | 5.15.0-163-generic | `uname -r` |
| **ROCm** | 7.1.1 | `amd-smi version` |
| **PLDM bundle (firmware)** | 01.25.16.03 | [Verify BKC](#verify-best-known-configuration-bkc) |
| **AI NIC Firmware** | 1.117.5.a.45 | `dkms status` |
| **AI NIC Driver** | 25.11.1.001 | `dkms status` |
### Verify best known configuration (BKC)
The BKC defines a validated configuration of GPU firmware, baseboard firmware,
ROCm user space components, the AMD GPU Driver, and virtualization tooling.
These components are tested together to attain best performance and compatibility.
While AMD publishes the AMD GPU driver and ROCm user space components, your
server OEM or infrastructure provider distributes the firmware packages. AMD
supplies those firmware images (PLDM bundles), which the OEM integrates and
distributes.
To verify the active BKC and IFWI (Integrated Firmware Image) versions via the
Redfish API:
1. Prepare credentials: Identify your BMC IP, username, and password.
2. Run Redfish queries: Use the following commands to check the active
firmware inventory.
``` bash
# Define BMC connection variables
BMC_IP="<BMC_IP>"
AUTH="<username>:<password>"
# Query active BKC bundle version
curl -X GET "https://${BMC_IP}/redfish/v1/UpdateService/FirmwareInventory/bundle_active" \
-u "${AUTH}" -k | json_pp
# Query active IFWI (Integrated Firmware Image)
curl -X GET "https://${BMC_IP}/redfish/v1/UpdateService/FirmwareInventory/firmware_active" \
-u "${AUTH}" -k | json_pp
```
### Run basic system health checks
Before proceeding with software deployment, verify that all cluster nodes
comply with the [MI355X Basic Health
Checks](https://instinct.docs.amd.com/projects/system-acceptance/en/latest/gpus/mi355x.html#basic-health-checks)
Key requirements include specific kernel boot arguments, minimum system memory
thresholds, PCIe Gen5 link stability, and so on.
### Install AMD Pensando Pollara 400 AI NIC drivers
For detailed instructions on upgrading the firmware and installing drivers for
the AMD Pensando Pollara 400 AI NIC, refer to the [AMD Instinct System
Acceptance
Guide](https://instinct.docs.amd.com/projects/system-acceptance/en/latest/network/nic-installation.html#amd-pensando-pollara-400-ai-nic).
After installation, verify the active firmware version on all NICs to ensure it
matches the software baseline. See [Verify baseline software](#verify-best-known-configuration-bkc).
To display the current firmware version for all AI NICs, use the following command.
```bash
sudo nicctl show version firmware
```
### Configure thermal management (fan speed)
For systems equipped with 400G optics, standard fan profiles are often
insufficient for maintaining stable operating temperatures. To prevent thermal
throttling or optics failure, the system fans must be set to `FullSpeed`.
* Requirement: A fan speed of approximately 25,000 RPM is required to maintain
the AI NIC modules at an optimal operating temperature (~50°C).
* Constraint: Default profiles (typically around 4,000 RPM) and "Performance IO"
settings (around 9,000 RPM) do not provide adequate airflow for 400G optical
transceivers.
#### Configure fan speed via Redfish (Supermicro)
Run the following command to set the fan mode to `FullSpeed` through the BMC:
``` bash
# Define BMC connection variables
BMC_IP="<BMC_IP>"
AUTH="<username>:<password>"
# Set Fan Mode to FullSpeed
curl -X PATCH "https://${BMC_IP}/redfish/v1/Managers/1/Oem/Supermicro/FanMode" \
-k -u "${AUTH}" \
-H "Content-Type: application/json" \
-d '{"Mode": "FullSpeed"}'
```
### Configure your backend network (netplan)
Configure the backend NICs for high-bandwidth inter-node communication. Suppose
the GPUs eight network interface controllers (NICs) are `benic1p1` to
`benic8p1`. Each NIC must have its own subnet that is disjoint from the others.
Each node needs a unique IP address on each subnet. You should use the same
final octet in each subnet for a given node. For example, one node would have
the addresses `192.168.1.36`, `192.168.2.36`, and so on. Another node would
have `192.168.1.37`, `192.168.2.37`, and so on. Ensure MTU is set to `9000`.
```{note}
Ensure you identify the correct interface names for your system using ip link
before applying this configuration.
```
For example, your `/etc/netplan/70-backend.yaml` should look like the
following:
```yaml
network:
ethernets:
benic8p1:
addresses:
- 192.168.8.38/31
match:
macaddress: 04:90:81:2a:34:08
mtu: 9000
routes:
- table: 108
to: 0.0.0.0/0
via: 192.168.8.39
routing-policy:
- from: 192.168.8.38
table: 108
set-name: benic8p1
benic7p1:
addresses:
- 192.168.7.38/31
match:
macaddress: 04:90:81:2b:82:40
mtu: 9000
routes:
- table: 107
to: 0.0.0.0/0
via: 192.168.7.39
routing-policy:
- from: 192.168.7.38
table: 107
set-name: benic7p1
benic6p1:
addresses:
- 192.168.6.38/31
match:
macaddress: 04:90:81:30:c9:30
mtu: 9000
routes:
- table: 106
to: 0.0.0.0/0
via: 192.168.6.39
routing-policy:
- from: 192.168.6.38
table: 106
set-name: benic6p1
benic5p1:
addresses:
- 192.168.5.38/31
match:
macaddress: 04:90:81:2a:23:40
mtu: 9000
routes:
- table: 105
to: 0.0.0.0/0
via: 192.168.5.39
routing-policy:
- from: 192.168.5.38
table: 105
set-name: benic5p1
benic4p1:
addresses:
- 192.168.4.38/31
match:
macaddress: 04:90:81:2d:69:60
mtu: 9000
routes:
- table: 104
to: 0.0.0.0/0
via: 192.168.4.39
routing-policy:
- from: 192.168.4.38
table: 104
set-name: benic4p1
benic3p1:
addresses:
- 192.168.3.38/31
match:
macaddress: 04:90:81:2a:2c:40
mtu: 9000
routes:
- table: 103
to: 0.0.0.0/0
via: 192.168.3.39
routing-policy:
- from: 192.168.3.38
table: 103
set-name: benic3p1
benic2p1:
addresses:
- 192.168.2.38/31
match:
macaddress: 04:90:81:30:d5:30
mtu: 9000
routes:
- table: 102
to: 0.0.0.0/0
via: 192.168.2.39
routing-policy:
- from: 192.168.2.38
table: 102
set-name: benic2p1
benic1p1:
addresses:
- 192.168.1.38/31
match:
macaddress: 04:90:81:30:e4:00
mtu: 9000
routes:
- table: 101
to: 0.0.0.0/0
via: 192.168.1.39
routing-policy:
- from: 192.168.1.38
table: 101
set-name: benic1p1
```
To apply the configuration, use the following command.
```bash
sudo netplan apply
```
To verify your configuration, use the following command.
```bash
sudo apt install -y net-tools && ip -br a
```
### Configure Quality of Service (QoS) and Congestion Control (DCQCN)
To ensure lossless communication and optimal performance for RDMA traffic, the
network must be configured with specific QoS and Data Center Quantized
Congestion Notification (DCQCN) settings.
The following configuration achieves:
• It enables RX and TX Pause frames on the ports
• Maps DSCP 24 (Data) to Q3 and DSCP 46 (CNP) to Q6, all other DSCP to Q0
• Enables PFC for Q3
• Scheduling : 99% to Q3, 1% to Q0 and strict priority for Q6
#### Configure DCQCN
Create and run a `/nfsdata/enable_dcqcn.sh` script to initialize congestion
control parameters.
``` bash
# !/bin/bash
TOKEN_BUCKET_SIZE=800000
AI_RATE=160
ALPHA_UPDATE_INTERVAL=1
ALPHA_UPDATE_G=512
INITIAL_ALPHA_VALUE=64
RATE_INCREASE_BYTE_COUNT=431068
HAI_RATE=300
RATE_REDUCE_MONITOR_PERIOD=1
RATE_INCREASE_THRESHOLD=1
RATE_INCREASE_INTERVAL=1
CNP_DSCP=46
ROCE_DEVICES=$(ibv_devices | grep ionic_ | awk '{print $1}' | paste -sd " ")
for roce_dev in $ROCE_DEVICES
do
sudo nicctl update dcqcn -r $roce_dev -i 1 \
--token-bucket-size $TOKEN_BUCKET_SIZE \
--ai-rate $AI_RATE \
--alpha-update-interval $ALPHA_UPDATE_INTERVAL \
--alpha-update-g $ALPHA_UPDATE_G \
--initial-alpha-value $INITIAL_ALPHA_VALUE \
--rate-increase-byte-count $RATE_INCREASE_BYTE_COUNT \
--hai-rate $HAI_RATE \
--rate-reduce-monitor-period $RATE_REDUCE_MONITOR_PERIOD \
--rate-increase-threshold $RATE_INCREASE_THRESHOLD \
--rate-increase-interval $RATE_INCREASE_INTERVAL \
--cnp-dscp $CNP_DSCP
done
```
#### Configure QoS and PFC
Create and run `/nfsdata/qos.sh` to set up traffic classes and scheduling.
``` bash
#!/bin/bash
# qos.sh
# Enable PFC and Auto-negotiation on all ports
for i in $(sudo nicctl show port | grep Port | awk {'print $3'}); do sudo nicctl update port -p $i --pause-type pfc --rx-pause enable --tx-pause enable; done
for i in $(sudo nicctl show port | grep Port | awk '{print $3}'); do sudo nicctl update port --port $i --auto-neg enable; done
# Define Priorities
cts_dscp=46
cts_prio=6
data_dscp=24
data_prio=3
default_prio=0
cnp_dscp=46
cnp_prio=6
sudo nicctl update qos pfc --priority 0 --no-drop disable
sudo nicctl update qos dscp-to-purpose --dscp 48 --purpose none
sudo nicctl update qos dscp-to-purpose --dscp 46 --purpose none
sudo nicctl update qos --classification-type pcp
sudo nicctl update qos --classification-type dscp
sudo nicctl update qos dscp-to-priority --dscp 0-63 --priority 0
sudo nicctl update qos dscp-to-priority --dscp 0-23,25-45,47-63 --priority $default_prio
sudo nicctl update qos dscp-to-priority --dscp $cts_dscp --priority $cts_prio
sudo nicctl update qos dscp-to-priority --dscp $data_dscp --priority $data_prio
sudo nicctl update qos dscp-to-priority --dscp $cnp_dscp --priority $cnp_prio
sudo nicctl update qos pfc --priority $data_prio --no-drop enable
sudo nicctl update qos scheduling --priority $data_prio,$default_prio,$cts_prio --dwrr 99,1,0 --rate-limit 0,0,10
```
#### Verification your configuration
Verify the configuration using `nicctl`.
* Verify QoS classification:
``` bash
sudo nicctl show qos
```
Expected QoS output:
``` bash
NIC : 42424650-4c32-3531-3230-303443000000 (0000:f6:00.0)
Port : 04908130-a7a0-4242-4242-000011010000
Classification type : DSCP
DSCP-to-priority :
DSCP bitmap : 0xffffbffffeffffff ==> priority : 0
DSCP bitmap : 0x0000000001000000 ==> priority : 3
DSCP bitmap : 0x0000400000000000 ==> priority : 6
DSCP : 0-23, 25-45, 47-63 ==> priority : 0
DSCP : 24 ==> priority : 3
DSCP : 46 ==> priority : 6
```
* Verify DCQCN and scheduling:
``` bash
sudo nicctl show dcqcn
```
Expected DCQCN and scheduling output:
``` bash
NIC : 42424650-4c32-3531-3230-303443000000 (0000:f6:00.0)
------------------------------------------------------------------------------------------
Lif id : 43000070-0100-0000-4242-04908130a7a0
ROCE device : ionic_7
DCQCN profile id : 1
Status : Enabled
Rate increase in AI phase : 160
Rate increase byte count : 431068
Alpha update G value : 512
Alpha update interval : 1
Rate increase in HAI phase : 300
Initial alpha value : 64
Rate reduce monitor period : 1
Rate increase threshold : 1
Rate increase interval : 1
Token bucket size : 800000
DSCP value used for CNP : 46
PFC :
PFC priority bitmap : 0x8
PFC no-drop priorities : 3
Scheduling :
--------------------------------------------
Priority Scheduling Bandwidth Rate-limit
Type (in %age) (in Gbps)
--------------------------------------------
0 DWRR 1 N/A
3 DWRR 99 N/A
6 strict N/A 10
```
### Configure your network file system (NFS)
Setting up a shared NFS volume facilitates centralized storage for models,
recipes, and logs across the cluster. Use the following commands to install the
necessary client tools and mount the remote directory.
```{important}
Replace `nfs_server_ip:/shared/folder` and `/mount/point` with your specific
server details and desired local mount path.
```
``` bash
sudo apt update && sudo apt install -y nfs-common
sudo mkdir -p /mount/point
sudo mount -t nfs nfs_server_ip:/shared/folder /mount/point
echo "nfs_server_ip:/shared/folder /mount/point nfs _netdev,nofail,x-systemd.automount,x-systemd.idle-timeout=600,vers=4.2 0 0" | sudo tee -a /etc/fstab
```
## Software installation
Next, install the core compute stack required to operate the AMD Instinct GPUs.
The following steps guide you through deploying the ROCm software stack and the
necessary kernel-mode drivers to enable hardware acceleration and optimize the
environment for distributed inference workloads.
### Install ROCm
Use the following commands to quickly install ROCm 7.1.1 on Ubuntu 22.04:
``` bash
wget https://repo.radeon.com/amdgpu-install/7.1.1/ubuntu/jammy/amdgpu-install_7.1.1.70101-1_all.deb
sudo apt install ./amdgpu-install_7.1.1.70101-1_all.deb
sudo apt update
sudo apt install python3-setuptools python3-wheel
sudo usermod -a -G render,video $LOGNAME # Add the current user to the render and video groups
sudo apt install rocm
```
For detailed installation instructions, refer to the [ROCm 7.1.1
documentation](https://rocm.docs.amd.com/projects/install-on-linux/en/docs-7.1.1/install/quick-start.html#rocm-installation).
### Install AMD GPU Driver (amdgpu)
Use the following commands to quickly install the AMD GPU Driver (ROCm 7.1.1)
on Ubuntu 22.04:
``` bash
wget https://repo.radeon.com/amdgpu-install/7.1.1/ubuntu/jammy/amdgpu-install_7.1.1.70101-1_all.deb
sudo apt install ./amdgpu-install_7.1.1.70101-1_all.deb
sudo apt update
sudo apt install "linux-headers-$(uname -r)" "linux-modules-extra-$(uname -r)"
sudo apt install amdgpu-dkms
```
For detailed installation instructions, refer to the [ROCm 7.1.1
documentation](https://rocm.docs.amd.com/projects/install-on-linux/en/docs-7.1.1/install/quick-start.html#amdgpu-driver-installation).
## Network verification and testing
Before deploying the inference engine, validate the health and performance of
the cluster interconnects.
### Verify network connectivity
Verify that all network interfaces are reachable across the cluster nodes.
Assuming `eth0` is the management interface, and `benic1p1` through `benic8p1` are the
dedicated RoCE backend interfaces, use the following loop to test reachability
to a remote node (for instance, a target node with host IP suffix `.38`).
```bash
# Test connectivity for RoCE subnets 192.168.x.38 (node B) through 192.168.x.37 (node A)
for i in {1..8}; do ping -c 1 192.168.${i}.38; done
```
### Validate your RDMA setup
Confirm that all eight RDMA network interfaces are in the `UP` state and
correctly configured with the required MTU and GID settings.
#### Verify link status MTU, NIC temperature, and NIC speed
```bash
sudo nicctl show port
```
The output should look something like this:
```bash
-------------------------------------------------------------------------------------
NIC : 42424650-4c32-3531-3530-314343000000 (0000:f6:00.0)
Port : 04908132-5d88-4242-4242-000011010000 (eth1/1)
Spec:
Ifindex : 0x11010000
Type : ETH
speed : 400G
Admin state : UP
FEC type : RS
Pause type : PFC
Number of lanes : 4
MTU : 9216
TX pause : enabled
RX pause : enabled
Auto negotiation : enabled
Status:
Physical port : 1
Operational status : UP
Link FSM state : UP
FEC type : RS
Cable type : Fiber
Number of lanes : 4
speed : 400G
Auto negotiation : disabled
MAC ID : 0
MAC channel : 0
MAC address : 04:90:81:32:5d:88
Transceiver type : QSFP_CMIS
Transceiver state : SPROM-READ
Transceiver PID : QSFP-400G-DR4
Transceiver temperature (in C) : 45
Transceiver warning temperature (in C) : 75
Transceiver alarm temperature (in C) : 80
-------------------------------------------------------------------------------------
```
#### Verify GID
Ensure each device has a valid GID mapped to its assigned IP address.
```bash
ibv_devinfo -v | grep GID
```
The output should look something like this:
```bash
GID[ 0]: fe80::690:81ff:fe30:a7a0, RoCE v2
GID[ 1]: ::ffff:192.168.7.36, RoCE v2
```
### Run RDMA bandwidth benchmarks
Verify the inter-node RDMA performance to ensure the network fabric can
saturate the link bandwidth.
#### Install RDMA Performance Tools
To get started, build the ROCm-optimized `rdma-perftest` test suite from
source:
```bash
sudo apt install -y libibumad-dev libpci-dev libibverbs-dev librdmacm-dev ibverbs-utils libtool
git clone https://github.com/ROCm/rdma-perftest
cd rdma-perftest/
./autogen.sh
./configure --enable-rocm --with-rocm=/opt/rocm
make -j$(nproc)
sudo make install
```
#### Run a bandwidth test (GPU memory)
Perform a bandwidth test using ROCm GPU memory between two nodes. One acts as
a server and the other acts as a client. Replace `<SERVER_IP>` with the
appropriate IP.
```bash
# On Server Node
./ib_write_bw --use_rocm=0 -d mlx5_0 --report_gbits -a
# On Client Node
./ib_write_bw --use_rocm=0 -d mlx5_0 --report_gbits -a <SERVER_IP>
```
## SGLang serving and MoRI unit tests
### Install Docker Engine
Install the Docker engine to manage the containerized vLLM and MoRI serving
environments.
```bash
sudo apt update && sudo apt install -y docker.io
sudo usermod -aG docker "$USER"
```
### Launch the serving container
Deploy the SGLang MoRI serving container on each node.
```bash
CONTAINER_NAME=sglang_mori
IMAGE_NAME=rocm/sgl-dev:sglang-0.5.6.post1-rocm700-mi35x-mori-0113
docker run -it \
--rm \
--device /dev/dri --device /dev/kfd --device=/dev/infiniBand \
--network host --ipc host \
--group-add video \
--cap-add SYS_PTRACE \
--security-opt seccomp=unconfined \
--privileged \
--shm-size 128G \
--name ${CONTAINER_NAME} \
${IMAGE_NAME} /bin/bash
```
### Run MoRI inter-node unit tests
Before starting the vLLM service, run the MoRI unit test to verify that the
inter-node communication backend is correctly configured.
MoRI unit test uses 2 nodes as a minimal validation before running the full
1P2D (3 nodes) benchmark.
The key configuration variables are:
* `GLOO_SOCKET_IFNAME`: The network interface used for backend initialization such as `eth2`.
* `<MASTER_IP>`: The IP address of the primary node's backend interface.
```{note}
You can find reference performance data in the [ROCm/MoRI
repository](https://github.com/ROCm/mori?tab=readme-ov-file#mori-ep).
```
```bash
# Set up environment inside the container
export PYTHONPATH=/app/mori:$PYTHONPATH
export GLOO_SOCKET_IFNAME=<BACKEND_INTERFACE>
# Node 0 (Primary)
torchrun --nnodes=2 --node_rank=0 --nproc_per_node=1 \
--master_addr="<MASTER_IP>" --master_port=1234 \
examples/ops/dispatch_combine/test_dispatch_combine_internode.py \
--cmd bench --kernel-type v1
# Node 1 (Secondary)
torchrun --nnodes=2 --node_rank=1 --nproc_per_node=1 \
--master_addr="<MASTER_IP>" --master_port=1234 \
examples/ops/dispatch_combine/test_dispatch_combine_internode.py \
--cmd bench --kernel-type v1
```
## End-to-end 1P2D performance testing
This section guides you through running distributed inference benchmarks using
the SGLang disagg recipe. For detailed implementation details, refer to the
[SGLang Disaggregation
Recipe](https://github.com/billishyahao/sglang_disagg/blob/9n_cluster/README.md).
### Download the model and setup your run environment
This performance test supports the following models:
* [DeepSeek-V3](https://huggingface.co/deepseek-ai/DeepSeek-V3)
* [DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1)
* [DeepSeek-R1-0528](https://huggingface.co/deepseek-ai/DeepSeek-R1-0528)
To set up your environment and download the models using the Hugging Face CLI,
use the following commands. Modify the `huggingface-cli download` command
to download the desired model.
```bash
# Set up a virtual environment and install the Hugging Face CLI
sudo apt update && sudo apt install -y python3-venv
python3 -m venv ~/venvs/hf
source ~/venvs/hf/bin/activate
pip install huggingface_hub
# Download the model to the shared NFS mount point
# Replace 'deepseek-ai/DeepSeek-R1-0528' with your desired model
huggingface-cli download --token <your_hf_token> \
deepseek-ai/DeepSeek-R1-0528 \
--local-dir /mount/point/models/DeepSeek-R1
```
### Clone the SGLang disaggregation recipe
Clone the SGLang disaggregation repository to the shared file system and switch
to the appropriate branch:
```bash
git clone https://github.com/billishyahao/sglang_disagg.git
git checkout 9n_cluster
cd sglang_disagg
```
```{note}
In the 1P2D configuration, the prefill service and benchmark process run on the
same node, while remaining nodes handle decode services.
```
### Configure InfiniBand devices
Identify and configure the available InfiniBand devices.
1. List available devices using the following command.
```bash
ibv_devinfo -l
```
Example output:
```bash
8 HCAs found:
ionic_0
ionic_1
ionic_2
ionic_3
ionic_4
ionic_5
ionic_6
ionic_7
```
2. Update environment variables. Edit `set_env_vars.sh` and add the
comma-separated list of your system's IB devices. For example:
```bash
export IBDEVICES=ionic_0,ionic_1,ionic_2,ionic_3,ionic_4,ionic_5,ionic_6,ionic_7
```
### Configure the script and submit the job
1. To set the required configuration parameters, update the following
environment variables in `run_submit_disagg.sh` to match your cluster setup:
```bash
# SLURM Job Configuration
export SLURM_ACCOUNT="amd" # The account name for SLURM job accounting and resource allocation
export SLURM_PARTITION="compute" # The specific cluster partition (queue) to submit the job to
export TIME_LIMIT="24:00:00" # Maximum wall time for the job (Hours:Minutes:Seconds)
# Model Configuration
export MODEL_PATH="/nfsdata" # Base directory where the model weights are stored
export MODEL_NAME="DeepSeek-R1" # Specific model directory name (joined with MODEL_PATH)
export CONTAINER_IMAGE="rocm/sgl-dev:sglang-0.5.6.post1-rocm700-mi35x-mori-1224" # Docker image to use for the environment
# Cluster Topology (Disaggregation Setup)
export PREFILL_NODES=1 # Number of prefill nodes
export PREFILL_WORKERS=1 # Number of prefill workers
export DECODE_NODES=2 # Number of decode nodes
export DECODE_WORKERS=2 # Number of decode workers
# Benchmark/Workload Parameters
export ISL=1024 # Input Sequence Length (number of tokens in the prompt)
export OSL=1024 # Output Sequence Length (number of tokens to generate)
export CONCURRENCIES="2048" # Total number of concurrent requests to simulate in the benchmark. The value can be "32,64,128"
export REQUEST_RATE="inf" # Request per second rate. "inf" means send all requests immediately
# Parallelism Strategies
export PREFILL_ENABLE_EP=true # Enable Expert Parallelism (EP) for the prefill phase
export PREFILL_ENABLE_DP=true # Enable Data Parallelism (DP) for the prefill phase
export DECODE_ENABLE_EP=true # Enable Expert Parallelism (EP) for the decode phase
export DECODE_ENABLE_DP=true # Enable Data Parallelism (DP) for the decode phase
```
2. Then submit the batch job into slurm cluster through `bash ./run_submit_disagg.sh`.
```bash
bash ./run_submit_disagg.sh
```
### Log file analysis
1. After submission, retrieve the SLURM job ID:
```bash
squeue
```
Example output:
```bash
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
123 compute 1p2d alice R 00:10:32 4 node[01-04]
```
2. A directory named `slurm_job-$SLURM_JOB_ID` is created in `/tmp` on each
participating node. The directory contains:
| Log File | Description |
| :--------| :-----------|
| `pd_sglang_bench_serving.sh_NODE${NODE_RANK}.log` | Main service log per node |
| `decode_NODE${NODE_RANK}.log` | SGLang decode service details |
| `prefill_NODE${NODE_RANK}.log` | SGLang prefill service details |
3. The benchmark results will be displayed in
`pd_sglang_bench_serving.sh_NODE${NODE_RANK}.log`. Key metrics include:
```{note}
The following benchmark utility output is provided for reference only and
should not be used to compare performance. See the
[InferenceMAX](https://inferencemax.semianalysis.com/) website for validated
performance results.
```
``` bash
============ Serving Benchmark Result ============
Successful requests: 20480
Benchmark duration (s): 1194.25
Total input tokens: 20971520
Total generated tokens: 20971520
Request throughput (req/s): 17.15
Output token throughput (tok/s): 17560.38
Total Token throughput (tok/s): 35120.76
---------------Time to First Token----------------
Mean TTFT (ms): 21601.77
Median TTFT (ms): 24525.21
P99 TTFT (ms): 85417.53
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 92.41
Median TPOT (ms): 85.46
P99 TPOT (ms): 138.67
---------------Inter-token Latency----------------
Mean ITL (ms): 92.41
Median ITL (ms): 74.76
P99 ITL (ms): 263.07
----------------End-to-end Latency----------------
Mean E2EL (ms): 116133.48
Median E2EL (ms): 110349.39
P99 E2EL (ms): 227243.97
==================================================
```
## Troubleshooting
The following section outlines common issues and their solutions.
### Bandwidth test fails with error
1. Use ROCm-optimized `rdma-perftest`, not the generic `perftest`
```bash
which ib_write_bw
```
2. Confirm the `SERVER_IP` is accessible
```bash
ping <SERVER_IP>
```
3. Check system logs, use `dmesg` for kernel-level errors
``` bash
sudo dmesg -T | grep -i 'error|warn|fail|exception'
```
### Slurm job fails
Common causes and solutions for Slurm job submission failures include:
1. Shared storage access:
* Verify that both `sglang_disagg` and model directories are located in a shared NFS mount accessible to all compute nodes.
* Ensure proper permissions: `chmod -R 755 /shared/path/sglang_disagg /shared/path/models`
2. Log analysis:
* Examine `pd_sglang_bench_serving.sh_NODE${NODE_RANK}.log` on each participating node for detailed error messages.
* Check for common issues like missing dependencies, GPU allocation failures, or network connectivity problems.
3. Configuration validation:
* Verify SLURM parameters in `run_submit_disagg.sh`:
* `SLURM_ACCOUNT`: Ensure your account has access to the cluster
* `SLURM_PARTITION`: Confirm the partition exists and is accessible
* `MODEL_PATH`: Check that the path is correct and accessible from compute nodes
* `MODEL_NAME`: Verify the model subdirectory exists within `MODEL_PATH`
* Use `sinfo` to check partition and node availability.

View File

@@ -0,0 +1,627 @@
# vLLM distributed inference with MoRI
This document provides a comprehensive guide for setting up a high-performance
vLLM serving environment on an AMD Instinct MI300X or MI325X GPU cluster using
the [MoRI (Modular RDMA Interface)](https://github.com/rocm/mori) communication
backend. It also includes detailed instructions on how to reproduce the
benchmark results published in the AMD ROCm blog [Practical, Fault-Robust
Distributed Inference for DeepSeek on AMD
MI300X](https://rocm.blogs.amd.com/software-tools-optimization/wide-ep-deepseek/README.html).
## Prerequisites
The following hardware configuration is required to implement this setup:
* **Nodes**: A minimum of two GPU nodes (virtual machines or physical machines)
for wide expert parallelism (EP) evaluation.
* **GPUs**: 8x AMD Instinct MI300X/MI325X GPU cards per node.
* **Networking**: 8x NVIDIA Mellanox ConnectX-7 (CX7) NICs per node, providing
a dedicated 1:1 mapping between GPUs and network interfaces for optimal
inter-node communication.
## System configuration
This section outlines infrastructure steps required to prepare your cluster for
high-performance AI workloads. It covers validating your system's software
baselines and firmware versions, configuring high-bandwidth backend networking
for inter-node communication, and establish shared storage to ensure
a synchronized distributed computing environment.
### Verify baseline software
This setup has been validated using the **AI/ML Ready Image (ROCm 7-based)** on
Digital Ocean AMD GPU Droplets. The following table outlines the software
stack versions and appropriate shell commands for verification:
| Component | Version | Verification command |
| :--- | :--- | :--- |
| **OS** | Ubuntu 24.04.3 LTS | `cat /etc/os-release` |
| **Kernel** | 6.8.0-87-generic | `uname -r` |
| **ROCm** | 7.0.2 | `amd-smi version` |
| **PLDM bundle (firmware) for MI300X** | 01.25.03.12 | [Verify BKC](#verify-best-known-configuration-bkc) |
| **PLDM bundle (firmware) for MI325X** | 01.25.03.03 | [Verify BKC](#verify-best-known-configuration-bkc) |
| **CX7 Firmware** | 28.46.3048 | `dkms status` |
| **CX7 Driver** | 24.10-3.2.5 | `dkms status` |
| **DOCA** | 2.9.3 | `dpkg -l \| grep doca` |
### Verify best known configuration (BKC)
The BKC defines a validated configuration of GPU firmware, baseboard firmware,
ROCm user space components, the AMD GPU Driver, and virtualization tooling.
These components are tested together to attain best performance and compatibility.
While AMD publishes the AMD GPU driver and ROCm user space components, your
server OEM or infrastructure provider distributes the firmware packages. AMD
supplies those firmware images (PLDM bundles), which the OEM integrates and
distributes.
To verify the active BKC and IFWI (Integrated Firmware Image) versions via the
Redfish API:
1. Prepare credentials: Identify your BMC IP, username, and password.
2. Run Redfish queries: Use the following commands to check the active
firmware inventory.
``` bash
# Define BMC connection variables
BMC_IP="<BMC_IP>"
AUTH="<username>:<password>"
# Query active BKC bundle version
curl -X GET "https://${BMC_IP}/redfish/v1/UpdateService/FirmwareInventory/bundle_active" \
-u "${AUTH}" -k | json_pp
# Query active IFWI (Integrated Firmware Image)
curl -X GET "https://${BMC_IP}/redfish/v1/UpdateService/FirmwareInventory/firmware_active" \
-u "${AUTH}" -k | json_pp
```
### Run basic system health checks
Before proceeding with software deployment, verify that all cluster nodes
comply with the [MI300X Basic Health
Checks](https://instinct.docs.amd.com/projects/system-acceptance/en/latest/gpus/mi300x.html#basic-health-checks)
or [MI325X Basic Health
Checks](https://instinct.docs.amd.com/projects/system-acceptance/en/latest/gpus/mi325x.html#basic-health-checks).
Key requirements include specific kernel boot arguments, minimum system memory
thresholds, PCIe Gen5 link stability, and so on.
### Configure your backend network (netplan)
Configure the backend NICs for high-bandwidth inter-node communication. Suppose
the GPUs eight network interface controllers (NICs) are eth2 to eth9. Each NIC
must have its own subnet that is disjoint from the others. For example, `eth2`
could use `192.168.50.0/24`, `eth3` could use `192.168.51.0/24`, and so on.
Each node needs a unique IP address on each subnet. You should use the same
final octet in each subnet for a given node. For example, one node would have
the addresses `192.168.50.2`, `192.168.51.2`, and so on. Another node might
have `192.168.50.3`, `192.168.51.3`, and so on. Ensure MTU is set to `4200`.
```{note}
Ensure you identify the correct interface names for your system using ip link
before applying this configuration.
```
For example, your `/etc/netplan/50-backend.yaml` might include something like
the following:
```yaml
eth2:
dhcp4: false
dhcp6: false
link-local: []
addresses:
- 192.168.50.2/24
mtu: 4200
eth3:
dhcp4: false
dhcp6: false
link-local: []
addresses:
- 192.168.51.2/24
mtu: 4200
eth4:
dhcp4: false
dhcp6: false
link-local: []
addresses:
- 192.168.52.2/24
mtu: 4200
eth5:
dhcp4: false
dhcp6: false
link-local: []
addresses:
- 192.168.53.2/24
mtu: 4200
eth6:
dhcp4: false
dhcp6: false
link-local: []
addresses:
- 192.168.54.2/24
mtu: 4200
eth7:
dhcp4: false
dhcp6: false
link-local: []
addresses:
- 192.168.55.2/24
mtu: 4200
eth8:
dhcp4: false
dhcp6: false
link-local: []
addresses:
- 192.168.56.2/24
mtu: 4200
eth9:
dhcp4: false
dhcp6: false
link-local: []
addresses:
- 192.168.57.2/24
mtu: 4200
```
To apply the configuration, use the following command.
```bash
sudo netplan apply
```
To verify your configuration, use the following command.
```bash
sudo apt install -y net-tools && ip -br a
```
### Configure your network file system (NFS)
Setting up a shared NFS volume facilitates centralized storage for models,
recipes, and logs across the cluster. Use the following commands to install the
necessary client tools and mount the remote directory.
```{important}
Replace `nfs_server_ip:/shared/folder` and `/mount/point` with your specific
server details and desired local mount path.
```
``` bash
sudo apt update && sudo apt install -y nfs-common
sudo mkdir -p /mount/point
sudo mount -t nfs nfs_server_ip:/shared/folder /mount/point
echo "nfs_server_ip:/shared/folder /mount/point nfs _netdev,nofail,x-systemd.automount,x-systemd.idle-timeout=600,vers=4.2 0 0" | sudo tee -a /etc/fstab
```
### Configure static hostname resolution for backend initialization (optional)
If the high-speed RDMA/IB interfaces are used for the initial distributed
coordination (such as `MASTER_ADDR`), you must configure static hostname
resolution. This ensures that cluster host names resolve to the backend network
IPs rather than the management or local loopback addresses.
Follow these steps to configure static hostname resolution:
1. Edit `/etc/hosts` on all nodes: for example, using `sudo vim /etc/hosts`.
2. Add the backend IP and hostname mappings.
3. Comment out any default local mappings (such as `127.0.1.1`) for the current
hostname to avoid resolution conflicts.
For example, your `/etc/hosts` entries might look like:
```text
# Map host names to backend network IPs
192.168.50.2 mori_test_01
192.168.50.3 mori_test_02
# Comment out the default entry to ensure resolution via the backend IP
# 127.0.1.1 mori_test_01 mori_test_01
```
## Software installation
Next, install the essential software stack required to operate the AMD Instinct
GPUs and high-speed networking components. Follow these steps to deploy the
NVIDIA DOCA drivers for Mellanox ConnectX-7 NICs, the ROCm software stack, and
the necessary kernel modules to enable hardware acceleration.
### Install CX7 driver and firmware
1. Download and install the `DOCA 2.9.3` driver following the instructions in
[NVIDIA DOCA 2.9.3
Downloads](https://developer.nvidia.com/doca-2-9-3-download-archive?deployment_platform=Host-Server&deployment_package=DOCA-Host&target_os=Linux&Architecture=x86_64&Profile=doca-all&Distribution=Ubuntu&version=24.04&installer_type=deb_local).
2. Download the appropriate firmware for your hardware PSID from the [NVIDIA
official website](https://network.nvidia.com/support/firmware/connectx7/)
and flash the device.
3. To verify driver and firmware versions, use the following command. Replace
`IB Device` with your specific backend interface.
```bash
ethtool -i <IB Device>
```
### Install ROCm
Use the following commands to quickly install ROCm 7.0.2 on Ubuntu 24.04:
``` bash
wget https://repo.radeon.com/amdgpu-install/7.0.2/ubuntu/noble/amdgpu-install_7.0.2.70002-1_all.deb
sudo apt install ./amdgpu-install_7.0.2.70002-1_all.deb
sudo apt update
sudo apt install python3-setuptools python3-wheel
sudo usermod -a -G render,video $LOGNAME # Add the current user to the render and video groups
sudo apt install rocm
```
For detailed installation instructions, refer to the [ROCm 7.0.2
documentation](https://rocm.docs.amd.com/projects/install-on-linux/en/docs-7.0.2/install/quick-start.html#rocm-installation).
### Install AMD GPU Driver (amdgpu)
Use the following commands to quickly install the AMD GPU Driver (ROCm 7.0.2) on Ubuntu 24.04:
``` bash
wget https://repo.radeon.com/amdgpu-install/7.0.2/ubuntu/noble/amdgpu-install_7.0.2.70002-1_all.deb
sudo apt install ./amdgpu-install_7.0.2.70002-1_all.deb
sudo apt update
sudo apt install "linux-headers-$(uname -r)" "linux-modules-extra-$(uname -r)"
sudo apt install amdgpu-dkms
```
For detailed installation instructions, refer to the [ROCm 7.0.2
documentation](https://rocm.docs.amd.com/projects/install-on-linux/en/docs-7.0.2/install/quick-start.html#amdgpu-driver-installation).
## Network verification and testing
Before deploying the inference engine, validate the health and performance of
the cluster interconnects.
### Verify network connectivity
Verify that all network interfaces are reachable across the cluster nodes.
Assuming `eth0` is the management interface, `eth1` is for the VPC, and `eth2`
through `eth9` are the dedicated RoCE backend interfaces, use the following
loop to test reachability to a remote node (for instance, a target node with
host IP suffix `.3`).
```bash
# Test connectivity for RoCE subnets 192.168.50.x through 192.168.57.x
for i in {0..7}; do ping -c 1 192.168.5${i}.3; done
```
### Validate your RDMA setup
Confirm that all eight RDMA network interfaces are in `UP` state. Verify the MTU
setting of `4096` and ensure each device has a valid GID mapped to its assigned
IP address.
``` bash
ibv_devinfo -v
```
The output should look something like this:
``` bash
hca_id: mlx5_0
transport: InfiniBand (0)
fw_ver: 28.46.3048
...
board_id: MT_0000000838
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet
...
GID[ 0]: fe80:0000:0000:0000:d894:24ff:fe4a:96e2, RoCE v1
GID[ 1]: fe80::d894:24ff:fe4a:96e2, RoCE v2
GID[ 2]: 0000:0000:0000:0000:0000:ffff:c0a8:3903, RoCE v1
GID[ 3]: ::ffff:192.168.57.3, RoCE v2
```
### Run RDMA bandwidth benchmarks
Verify the inter-node RDMA performance to ensure the network fabric can
saturate the link bandwidth.
#### Install RDMA Performance Tools
To get started, build the ROCm-optimized `rdma-perftest` test suite from
source:
```bash
sudo apt install -y libibumad-dev libpci-dev libibverbs-dev librdmacm-dev ibverbs-utils libtool
git clone https://github.com/ROCm/rdma-perftest
cd rdma-perftest/
./autogen.sh
./configure --enable-rocm --with-rocm=/opt/rocm
make -j$(nproc)
sudo make install
```
#### Run a bandwidth test (GPU memory)
Perform a bandwidth test using ROCm GPU memory between two nodes. One acts
as a server and the other acts as a client. For 400G interfaces, the expected
peak throughput is approximately 390 Gbps. Replace `<SERVER_IP>` with the
appropriate IP.
```bash
# On Server Node
./ib_write_bw --use_rocm=0 -d mlx5_0 --report_gbits -a
# On Client Node
./ib_write_bw --use_rocm=0 -d mlx5_0 --report_gbits -a <SERVER_IP>
```
## vLLM serving and MoRI unit tests
### Install Docker Engine
Install the Docker engine to manage the containerized vLLM and MoRI serving
environments.
```bash
sudo apt update && sudo apt install -y docker.io
```
### Download the DeepSeek PTPC model
This guide uses the
[DeepSeek-R1-FP8-Dynamic](https://huggingface.co/EmbeddedLLM/deepseek-r1-FP8-Dynamic)
model optimized for PTPC. Use the following commands to install the Hugging
Face CLI and download the model to your shared NFS directory:
```bash
# Set up a virtual environment and install the Hugging Face CLI
sudo apt update && sudo apt install -y python3-venv
python3 -m venv ~/venvs/hf
source ~/venvs/hf/bin/activate
pip install huggingface_hub
# Download the model to the shared NFS mount point
huggingface-cli download --token <your_hf_token> \
EmbeddedLLM/deepseek-r1-FP8-Dynamic \
--local-dir /mount/point/models/EmbeddedLLM/deepseek-r1-FP8-Dynamic
```
### Launch the serving container
Deploy the vLLM MoRI serving Docker container on each node.
```bash
CONTAINER_NAME=vllm_mori
IMAGE_NAME=aigmkt/vllm:mori_rocm6.4.1_20251105
docker run -it \
--rm \
--device /dev/dri --device /dev/kfd --device=/dev/infiniBand \
--network host --ipc host \
--group-add video \
--cap-add SYS_PTRACE \
--security-opt seccomp=unconfined \
--privileged \
-v /mount/point/models:/models \
--shm-size 128G \
--name ${CONTAINER_NAME} \
${IMAGE_NAME} /bin/bash
```
### Run MoRI inter-node unit tests
Before starting the vLLM service, run the MoRI unit test to verify that the
inter-node communication backend is correctly configured.
The key configuration variables are:
* `GLOO_SOCKET_IFNAME`: The network interface used for backend initialization such as `eth2`.
* `<MASTER_IP>`: The IP address of the primary node's backend interface.
```{note}
You can find reference performance data in the [ROCm/MoRI
repository](https://github.com/ROCm/mori?tab=readme-ov-file#mori-ep).
```
```bash
# Set up environment inside the container
cd /app/mori
export PYTHONPATH=/app/mori:$PYTHONPATH
export GLOO_SOCKET_IFNAME=<BACKEND_INTERFACE>
# Node 0 (Primary)
torchrun --nnodes=2 --node_rank=0 --nproc_per_node=1 \
--master_addr="<MASTER_IP>" --master_port=1234 \
examples/ops/dispatch_combine/test_dispatch_combine_internode.py \
--cmd bench --kernel-type v1
# Node 1 (Secondary)
torchrun --nnodes=2 --node_rank=1 --nproc_per_node=1 \
--master_addr="<MASTER_IP>" --master_port=1234 \
examples/ops/dispatch_combine/test_dispatch_combine_internode.py \
--cmd bench --kernel-type v1
```
### Deploy and serve the model
To deploy DeepSeek-R1 (PTPC) with Expert Parallelism 16 (EP16) across two
nodes, use the following serving scripts.
#### Create serving scripts
Create the following scripts inside the container on each node.
* Node 0 (master node): `ep16_node0.sh`
```bash
#!/bin/bash
# Add VLLM_ENFORCE_EPLB=1 to enforce EP balance
export VLLM_ROCM_USE_AITER=1
export VLLM_ROCM_USE_AITER_MOE=1
export VLLM_LOGGING_LEVEL=INFO
export VLLM_USE_V1=1
export VLLM_ROCM_USE_AITER_MLA=1
export VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=0
export VLLM_ALL2ALL_BACKEND=mori
vllm serve /models/EmbeddedLLM/deepseek-r1-FP8-Dynamic/ \
-dp 16 \
--enable-expert-parallel \
--data-parallel-size-local 8 \
--data-parallel-address ${IP} \
--data-parallel-rpc-port 1212 \
--served-model-name deepseek \
--port 8777 \
--block-size 1 \
--distributed-executor-backend mp \
--gpu-memory-utilization 0.8 \
--max-model-len 8192 \
--max-num-batched-tokens 4096 \
--max-num-seqs 4096 \
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "custom_ops": ["+quant_fp8"]}' \
--cuda-graph-sizes 1 2 4 8 16 32 64 128 256 \
--kv-cache-dtype fp8 \
--no-enable-prefix-caching \
--trust-remote-code 2>&1 | tee serving_node0_ep16.log
```
* Node 1: `ep16_node1.sh`
```bash
#!/bin/bash
# Add VLLM_ENFORCE_EPLB=1 to enforce EP balance
export VLLM_ROCM_USE_AITER=1
export VLLM_ROCM_USE_AITER_MOE=1
export VLLM_LOGGING_LEVEL=INFO
export VLLM_USE_V1=1
export VLLM_ROCM_USE_AITER_MLA=1
export VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=0
export VLLM_ALL2ALL_BACKEND=mori
vllm serve /models/EmbeddedLLM/deepseek-r1-FP8-Dynamic/ \
-dp 16 \
--enable-expert-parallel \
--headless \
--data-parallel-size-local 8 \
--data-parallel-start-rank 8 \
--data-parallel-address ${IP} \
--data-parallel-rpc-port 1212 \
--served-model-name deepseek \
--port 8777 \
--block-size 1 \
--distributed-executor-backend mp \
--gpu_memory_utilization 0.8 \
--max-model-len 8192 \
--max_num_batched_token 4096 \
--max-num-seqs 4096 \
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "custom_ops": ["+quant_fp8"]}' \
--cuda-graph-sizes 1 2 4 8 16 32 64 128 256 \
--kv-cache-dtype fp8 \
--no-enable-prefix-caching \
--trust-remote-code 2>&1 | tee serving_node1_ep16.log
```
#### Run the serving scripts
Run the scripts on each node to launch the distributed serving instance.
Replace `<MASTER_IP>` with the backend network IP of Node 0.
```bash
# On Node 0 (Primary)
export NCCL_SOCKET_IFNAME=<BACKEND_INTERFACE>
export GLOO_SOCKET_IFNAME=<BACKEND_INTERFACE>
IP=<MASTER_IP> bash ep16_node0.sh
# On Node 1 (Secondary)
export NCCL_SOCKET_IFNAME=<BACKEND_INTERFACE>
export GLOO_SOCKET_IFNAME=<BACKEND_INTERFACE>
IP=<MASTER_IP> bash ep16_node1.sh
```
## Reproducing performance
This section details how to reproduce the performance metrics published in the
AMD ROCm Blog: [Practical, Fault-Robust Distributed Inference for DeepSeek on
AMD
MI300X](https://rocm.blogs.amd.com/software-tools-optimization/wide-ep-deepseek/README.html).
### Configuration for EP16 (16 GPUs)
To achieve the reported throughput, expert parallelism 16 (EP16) is used across
the decode nodes.
#### Benchmark target
* Decode throughput: ~12.4k output tokens/s per node.
### Performance reproduction commands
Use the following configurations to reproduce published performance metrics.
#### Decode benchmark
To reproduce the 12.4k output tokens/s, use the following configuration:
```bash
#!/bin/bash
MAX_CONCURRENCY=${1:-3072}
TIMES=2
NUM_PROMPTS=$((MAX_CONCURRENCY*TIMES))
vllm bench serve \
--max-concurrency $MAX_CONCURRENCY \
--num-prompts $NUM_PROMPTS \
--model /models/EmbeddedLLM/deepseek-r1-FP8-Dynamic/ \
--served-model-name deepseek \
--port 8777 \
--ignore-eos \
--trust-remote-code \
--dataset-name random \
--seed 2025 \
--random-input-len 2048 \
--random-output-len 1024 2>&1 | tee bench_decode_${MAX_CONCURRENCY}_isl_2k_osl_1k.log
```
To calculate the per-node throughput for comparison with the blog data, take
the reported **Peak output token throughput (tok/s)** from the benchmark
results and divide it by the total number of nodes in the cluster.
## Troubleshooting
The following section outlines common issues and their solutions.
### Bandwidth test fails with error
1. Use ROCm-optimized `rdma-perftest`, not the generic `perftest`.
``` bash
which ib_write_bw
```
2. Confirm the `SERVER_IP` is accessible.
``` bash
ping <SERVER_IP>
```
3. Check system logs, use `dmesg` for kernel-level errors.
``` bash
sudo dmesg -T | grep -i 'error|warn|fail|exception'
```
### vLLM EP 16 with MoRI backend fails to launch
1. Error: `Waiting for init message from front-end.` Check the connectivity of the `IP`. Disable firewall/selinux or allow traffic for port `1212`.
2. Verify server name resolution. Ensure server names are correctly mapped in `/etc/hosts`.
3. Confirm whether environment variable `GLOO_SOCKET_IFNAME` is set before running the vLLM serving script.

View File

@@ -26,6 +26,12 @@ training, fine-tuning, and inference. It leverages popular machine learning fram
- :doc:`SGLang inference performance testing <benchmark-docker/sglang>` - :doc:`SGLang inference performance testing <benchmark-docker/sglang>`
- :doc:`vLLM distributed inference with MoRI <benchmark-docker/vllm-mori-distributed>`
- :doc:`SGLang distributed inference with MoRI <benchmark-docker/sglang-mori-distributed>`
- :doc:`SGLang distributed inference with Mooncake <benchmark-docker/sglang-distributed>`
- :doc:`xDiT diffusion inference <xdit-diffusion-inference>` - :doc:`xDiT diffusion inference <xdit-diffusion-inference>`
- :doc:`Deploying your model <deploy-your-model>` - :doc:`Deploying your model <deploy-your-model>`

View File

@@ -52,7 +52,7 @@ accelerate training workloads:
- {{ component_version }} - {{ component_version }}
{% endfor %} {% endfor %}
.. _amd-megatron-lm-model-support-v25.11: .. _amd-megatron-lm-model-support-v26.01:
Supported models Supported models
================ ================
@@ -97,7 +97,7 @@ accelerate training workloads:
Some models, such as Llama, require an external license agreement through Some models, such as Llama, require an external license agreement through
a third party (for example, Meta). a third party (for example, Meta).
.. _amd-megatron-lm-performance-measurements-v25.11: .. _amd-megatron-lm-performance-measurements-v26.01:
Performance measurements Performance measurements
======================== ========================
@@ -129,7 +129,7 @@ To test for optimal performance, consult the recommended :ref:`System health ben
<rocm-for-ai-system-health-bench>`. This suite of tests will help you verify and fine-tune your <rocm-for-ai-system-health-bench>`. This suite of tests will help you verify and fine-tune your
system's configuration. system's configuration.
.. _mi300x-amd-megatron-lm-training-v25.11: .. _mi300x-amd-megatron-lm-training-v26.01:
Environment setup Environment setup
================= =================
@@ -138,7 +138,7 @@ Use the following instructions to set up the environment, configure the script t
reproduce the benchmark results on MI300X Series GPUs with the AMD Megatron-LM Docker reproduce the benchmark results on MI300X Series GPUs with the AMD Megatron-LM Docker
image. image.
.. _amd-megatron-lm-requirements-v25.11: .. _amd-megatron-lm-requirements-v26.01:
Download the Docker image Download the Docker image
------------------------- -------------------------
@@ -190,7 +190,7 @@ Download the Docker image
The Docker container hosts a verified commit of The Docker container hosts a verified commit of
`<https://github.com/ROCm/Megatron-LM/tree/rocm_dev>`__. `<https://github.com/ROCm/Megatron-LM/tree/rocm_dev>`__.
.. _amd-megatron-lm-environment-setup-v25.11: .. _amd-megatron-lm-environment-setup-v26.01:
Configuration Configuration
============= =============
@@ -200,39 +200,39 @@ Configuration
Update the ``train_llama3.sh`` configuration script in the ``examples/llama`` Update the ``train_llama3.sh`` configuration script in the ``examples/llama``
directory of directory of
`<https://github.com/ROCm/Megatron-LM/tree/rocm_dev/examples/llama>`__ to configure your training run. `<https://github.com/ROCm/Megatron-LM/tree/rocm_dev/examples/llama>`__ to configure your training run.
Options can also be passed as command line arguments as described in :ref:`Run training <amd-megatron-lm-run-training-v25.11>`. Options can also be passed as command line arguments as described in :ref:`Run training <amd-megatron-lm-run-training-v26.01>`.
.. container:: model-doc pyt_megatron_lm_train_llama-2-7b pyt_megatron_lm_train_llama-2-70b .. container:: model-doc pyt_megatron_lm_train_llama-2-7b pyt_megatron_lm_train_llama-2-70b
Update the ``train_llama2.sh`` configuration script in the ``examples/llama`` Update the ``train_llama2.sh`` configuration script in the ``examples/llama``
directory of directory of
`<https://github.com/ROCm/Megatron-LM/tree/rocm_dev/examples/llama>`__ to configure your training run. `<https://github.com/ROCm/Megatron-LM/tree/rocm_dev/examples/llama>`__ to configure your training run.
Options can also be passed as command line arguments as described in :ref:`Run training <amd-megatron-lm-run-training-v25.11>`. Options can also be passed as command line arguments as described in :ref:`Run training <amd-megatron-lm-run-training-v26.01>`.
.. container:: model-doc pyt_megatron_lm_train_deepseek-v3-proxy .. container:: model-doc pyt_megatron_lm_train_deepseek-v3-proxy
Update the ``train_deepseekv3.sh`` configuration script in the ``examples/deepseek_v3`` Update the ``train_deepseekv3.sh`` configuration script in the ``examples/deepseek_v3``
directory of directory of
`<https://github.com/ROCm/Megatron-LM/tree/rocm_dev/examples/deepseek_v3>`__ to configure your training run. `<https://github.com/ROCm/Megatron-LM/tree/rocm_dev/examples/deepseek_v3>`__ to configure your training run.
Options can also be passed as command line arguments as described in :ref:`Run training <amd-megatron-lm-run-training-v25.11>`. Options can also be passed as command line arguments as described in :ref:`Run training <amd-megatron-lm-run-training-v26.01>`.
.. container:: model-doc pyt_megatron_lm_train_deepseek-v2-lite-16b .. container:: model-doc pyt_megatron_lm_train_deepseek-v2-lite-16b
Update the ``train_deepseekv2.sh`` configuration script in the ``examples/deepseek_v2`` Update the ``train_deepseekv2.sh`` configuration script in the ``examples/deepseek_v2``
directory of directory of
`<https://github.com/ROCm/Megatron-LM/tree/rocm_dev/examples/deepseek_v2>`__ to configure your training run. `<https://github.com/ROCm/Megatron-LM/tree/rocm_dev/examples/deepseek_v2>`__ to configure your training run.
Options can also be passed as command line arguments as described in :ref:`Run training <amd-megatron-lm-run-training-v25.11>`. Options can also be passed as command line arguments as described in :ref:`Run training <amd-megatron-lm-run-training-v26.01>`.
.. container:: model-doc pyt_megatron_lm_train_mixtral-8x7b pyt_megatron_lm_train_mixtral-8x22b-proxy .. container:: model-doc pyt_megatron_lm_train_mixtral-8x7b pyt_megatron_lm_train_mixtral-8x22b-proxy
Update the ``train_mixtral_moe.sh`` configuration script in the ``examples/mixtral`` Update the ``train_mixtral_moe.sh`` configuration script in the ``examples/mixtral``
directory of directory of
`<https://github.com/ROCm/Megatron-LM/tree/rocm_dev/examples/mixtral>`__ to configure your training run. `<https://github.com/ROCm/Megatron-LM/tree/rocm_dev/examples/mixtral>`__ to configure your training run.
Options can also be passed as command line arguments as described in :ref:`Run training <amd-megatron-lm-run-training-v25.11>`. Options can also be passed as command line arguments as described in :ref:`Run training <amd-megatron-lm-run-training-v26.01>`.
.. note:: .. note::
See :ref:`Key options <amd-megatron-lm-benchmark-test-vars-v25.11>` for more information on configuration options. See :ref:`Key options <amd-megatron-lm-benchmark-test-vars-v26.01>` for more information on configuration options.
Multi-node configuration Multi-node configuration
------------------------ ------------------------
@@ -240,7 +240,7 @@ Multi-node configuration
Refer to :doc:`/how-to/rocm-for-ai/system-setup/multi-node-setup` to configure your environment for multi-node Refer to :doc:`/how-to/rocm-for-ai/system-setup/multi-node-setup` to configure your environment for multi-node
training. See :ref:`amd-megatron-lm-multi-node-examples` for example run commands. training. See :ref:`amd-megatron-lm-multi-node-examples` for example run commands.
.. _amd-megatron-lm-tokenizer-v25.11: .. _amd-megatron-lm-tokenizer-v26.01:
Tokenizer Tokenizer
--------- ---------
@@ -377,7 +377,7 @@ Download the dataset
``TOKENIZER_MODEL`` can be any accessible Hugging Face tokenizer. ``TOKENIZER_MODEL`` can be any accessible Hugging Face tokenizer.
Remember to either pre-download the tokenizer or setup Hugging Face access Remember to either pre-download the tokenizer or setup Hugging Face access
otherwise when needed -- see the :ref:`Tokenizer <amd-megatron-lm-tokenizer-v25.11>` section. otherwise when needed -- see the :ref:`Tokenizer <amd-megatron-lm-tokenizer-v26.01>` section.
.. note:: .. note::
@@ -479,13 +479,13 @@ Download the dataset
Ensure that the files are accessible inside the Docker container. Ensure that the files are accessible inside the Docker container.
.. _amd-megatron-lm-run-training-v25.11: .. _amd-megatron-lm-run-training-v26.01:
Run training Run training
============ ============
Use the following example commands to set up the environment, configure Use the following example commands to set up the environment, configure
:ref:`key options <amd-megatron-lm-benchmark-test-vars-v25.11>`, and run training on :ref:`key options <amd-megatron-lm-benchmark-test-vars-v26.01>`, and run training on
MI300X Series GPUs with the AMD Megatron-LM environment. MI300X Series GPUs with the AMD Megatron-LM environment.
Before starting training, export the following environment variables. Before starting training, export the following environment variables.
@@ -920,7 +920,7 @@ Single node training
RECOMPUTE_ACTIVATIONS=full \ RECOMPUTE_ACTIVATIONS=full \
CKPT_FORMAT=torch_dist CKPT_FORMAT=torch_dist
.. _amd-megatron-lm-multi-node-examples-v25.11: .. _amd-megatron-lm-multi-node-examples-v26.01:
Multi-node training examples Multi-node training examples
---------------------------- ----------------------------
@@ -971,7 +971,7 @@ training on 16 nodes, try the following command:
sbatch examples/deepseek_v3/train_deepseek_v3_slurm.sh sbatch examples/deepseek_v3/train_deepseek_v3_slurm.sh
.. _amd-megatron-lm-benchmark-test-vars-v25.11: .. _amd-megatron-lm-benchmark-test-vars-v26.01:
Key options Key options
----------- -----------

View File

@@ -16,14 +16,23 @@ previous releases of the ``ROCm/megatron-lm`` Docker image on `Docker Hub <https
- Components - Components
- Resources - Resources
* - v25.11 * - v26.1 (latest)
- -
* ROCm 7.1.0 * ROCm 7.1.0
* PyTorch 2.10.0.dev20251112+rocm7.1 * PyTorch 2.10.0.dev20251112+rocm7.1
- -
* :doc:`Primus Megatron documentation <../primus-megatron>` * :doc:`Primus Megatron documentation <../primus-megatron>`
* :doc:`Megatron-LM (legacy) documentation <../megatron-lm>` * :doc:`Megatron-LM (legacy) documentation <../megatron-lm>`
* `Docker Hub <https://hub.docker.com/layers/rocm/primus/v25.10/images/sha256-140c37cd2eeeb183759b9622543fc03cc210dc97cbfa18eeefdcbda84420c197>`__ * `Docker Hub <https://hub.docker.com/layers/rocm/primus/v26.1/images/sha256-4fc8808bdb14117c6af7f38d79c809056e6fdbfd530c1fabbb61d097ddaf820d>`__
* - v25.11
-
* ROCm 7.1.0
* PyTorch 2.10.0.dev20251112+rocm7.1
-
* :doc:`Primus Megatron documentation <primus-megatron-v25.11>`
* :doc:`Megatron-LM (legacy) documentation <megatron-lm-v25.10>`
* `Docker Hub <https://hub.docker.com/layers/rocm/primus/v25.11/images/sha256-71aa65a9bfc8e9dd18bce5b68c81caff864f223e9afa75dc1b719671a1f4a3c3>`__
* - v25.10 * - v25.10
- -

View File

@@ -37,7 +37,7 @@ GPUs containing essential components, including PyTorch, ROCm libraries, and
Megatron-LM utilities. It contains the following software components to Megatron-LM utilities. It contains the following software components to
accelerate training workloads: accelerate training workloads:
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/megatron-lm-benchmark-models.yaml .. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/megatron-lm-v25.10-benchmark-models.yaml
.. tab-set:: .. tab-set::
@@ -146,7 +146,7 @@ image.
Download the Docker image Download the Docker image
------------------------- -------------------------
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/megatron-lm-benchmark-models.yaml .. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/megatron-lm-v25.10-benchmark-models.yaml
{% set docker = data.docker %} {% set docker = data.docker %}
1. Use the following command to pull the Docker image from Docker Hub. 1. Use the following command to pull the Docker image from Docker Hub.
@@ -811,7 +811,7 @@ Single node training
Note that DeepSeek-V2-Lite is experiencing instability due to GPU memory access fault Note that DeepSeek-V2-Lite is experiencing instability due to GPU memory access fault
for large iterations. for large iterations.
For stability, it's recommended to use Primus for this workload. For stability, it's recommended to use Primus for this workload.
See :doc:`primus-megatron`. See :doc:`../primus-megatron`.
.. container:: model-doc pyt_megatron_lm_train_mixtral-8x7b .. container:: model-doc pyt_megatron_lm_train_mixtral-8x7b

View File

@@ -25,10 +25,10 @@ model training. Performance acceleration is powered by `Primus Turbo
<https://hub.docker.com/r/rocm/megatron-lm/>`__ Docker Hub registry will be <https://hub.docker.com/r/rocm/megatron-lm/>`__ Docker Hub registry will be
deprecated soon in favor of `rocm/primus <https://hub.docker.com/r/rocm/primus>`__. deprecated soon in favor of `rocm/primus <https://hub.docker.com/r/rocm/primus>`__.
The ``rocm/primus`` Docker containers will cover PyTorch training ecosystem frameworks, The ``rocm/primus`` Docker containers will cover PyTorch training ecosystem frameworks,
including Megatron-LM and :doc:`torchtitan <primus-pytorch>`. including Megatron-LM and :doc:`torchtitan <../primus-pytorch>`.
Primus with Megatron is designed to replace the :doc:`ROCm Megatron-LM Primus with Megatron is designed to replace the :doc:`ROCm Megatron-LM
training <megatron-lm>` workflow. To learn how to migrate workloads from training <../megatron-lm>` workflow. To learn how to migrate workloads from
Megatron-LM to Primus with Megatron, see Megatron-LM to Primus with Megatron, see
:doc:`megatron-lm-primus-migration-guide`. :doc:`megatron-lm-primus-migration-guide`.
@@ -36,7 +36,7 @@ AMD provides a ready-to-use Docker images for MI355X, MI350X,
MI325X, and MI300X GPUs containing essential components for Primus, ROCm, and MI325X, and MI300X GPUs containing essential components for Primus, ROCm, and
Megatron-LM. Megatron-LM.
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/primus-megatron-benchmark-models.yaml .. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/primus-megatron-v25.10-benchmark-models.yaml
.. tab-set:: .. tab-set::
@@ -63,7 +63,7 @@ The following models are pre-optimized for performance on AMD Instinct GPUs.
Some instructions, commands, and training examples in this documentation Some instructions, commands, and training examples in this documentation
might vary by model -- select one to get started. might vary by model -- select one to get started.
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/primus-megatron-benchmark-models.yaml .. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/primus-megatron-v25.10-benchmark-models.yaml
{% set model_groups = data.model_groups %} {% set model_groups = data.model_groups %}
.. raw:: html .. raw:: html
@@ -120,7 +120,7 @@ system's configuration.
Environment setup Environment setup
================= =================
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/primus-megatron-benchmark-models.yaml .. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/primus-megatron-v25.10-benchmark-models.yaml
Use the following instructions to set up the environment, configure the script to train models, and Use the following instructions to set up the environment, configure the script to train models, and
reproduce the benchmark results on AMD Instinct GPUs. reproduce the benchmark results on AMD Instinct GPUs.
@@ -129,7 +129,7 @@ Environment setup
Pull the Docker image Pull the Docker image
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/primus-megatron-benchmark-models.yaml .. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/primus-megatron-v25.10-benchmark-models.yaml
{% set docker = data.docker %} {% set docker = data.docker %}
@@ -175,7 +175,7 @@ Configuration
Primus defines a training configuration in YAML for each model in Primus defines a training configuration in YAML for each model in
`examples/megatron/configs <https://github.com/AMD-AGI/Primus/tree/e16b27bf6c1b2798f38848fc574fee60d9a9b902/examples/megatron/configs>`__. `examples/megatron/configs <https://github.com/AMD-AGI/Primus/tree/e16b27bf6c1b2798f38848fc574fee60d9a9b902/examples/megatron/configs>`__.
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/primus-megatron-benchmark-models.yaml .. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/primus-megatron-v25.10-benchmark-models.yaml
{% set model_groups = data.model_groups %} {% set model_groups = data.model_groups %}
{% for model_group in model_groups %} {% for model_group in model_groups %}
@@ -805,7 +805,7 @@ To run training on multiple nodes, you can use the
`run_slurm_pretrain.sh <https://github.com/AMD-AGI/Primus/blob/main/examples/run_slurm_pretrain.sh>`__ `run_slurm_pretrain.sh <https://github.com/AMD-AGI/Primus/blob/main/examples/run_slurm_pretrain.sh>`__
to launch the multi-node workload. Use the following steps to setup your environment: to launch the multi-node workload. Use the following steps to setup your environment:
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/primus-megatron-benchmark-models.yaml .. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/primus-megatron-v25.10-benchmark-models.yaml
{% set docker = data.docker %} {% set docker = data.docker %}
.. code-block:: shell .. code-block:: shell

View File

@@ -24,17 +24,17 @@ Primus now supports the PyTorch torchtitan backend.
<https://hub.docker.com/r/rocm/pytorch-training/>`__ Docker Hub registry will be <https://hub.docker.com/r/rocm/pytorch-training/>`__ Docker Hub registry will be
deprecated soon in favor of `rocm/primus <https://hub.docker.com/r/rocm/primus>`__. deprecated soon in favor of `rocm/primus <https://hub.docker.com/r/rocm/primus>`__.
The ``rocm/primus`` Docker containers will cover PyTorch training ecosystem frameworks, The ``rocm/primus`` Docker containers will cover PyTorch training ecosystem frameworks,
including torchtitan and :doc:`Megatron-LM <primus-megatron>`. including torchtitan and :doc:`Megatron-LM <../primus-megatron>`.
Primus with the PyTorch torchtitan backend is designed to replace the Primus with the PyTorch torchtitan backend is designed to replace the
:doc:`ROCm PyTorch training <pytorch-training>` workflow. See :doc:`ROCm PyTorch training <../pytorch-training>` workflow. See
:doc:`pytorch-training` to see steps to run workloads without Primus. :doc:`../pytorch-training` to see steps to run workloads without Primus.
AMD provides a ready-to-use Docker image for MI355X, MI350X, MI325X, and AMD provides a ready-to-use Docker image for MI355X, MI350X, MI325X, and
MI300X GPUs containing essential components for Primus and PyTorch training MI300X GPUs containing essential components for Primus and PyTorch training
with Primus Turbo optimizations. with Primus Turbo optimizations.
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/primus-pytorch-benchmark-models.yaml .. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/primus-pytorch-v25.10-benchmark-models.yaml
.. tab-set:: .. tab-set::
@@ -61,7 +61,7 @@ The following models are pre-optimized for performance on the AMD Instinct MI325
Some instructions, commands, and training recommendations in this documentation might Some instructions, commands, and training recommendations in this documentation might
vary by model -- select one to get started. vary by model -- select one to get started.
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/primus-pytorch-benchmark-models.yaml .. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/primus-pytorch-v25.10-benchmark-models.yaml
{% set model_groups = data.model_groups %} {% set model_groups = data.model_groups %}
.. raw:: html .. raw:: html
@@ -96,7 +96,7 @@ vary by model -- select one to get started.
.. seealso:: .. seealso::
For additional workloads, including Llama 3.3, Llama 3.2, Llama 2, GPT OSS, Qwen, and Flux models, For additional workloads, including Llama 3.3, Llama 3.2, Llama 2, GPT OSS, Qwen, and Flux models,
see the documentation :doc:`pytorch-training` (without Primus) see the documentation :doc:`../pytorch-training` (without Primus)
.. _amd-primus-pytorch-performance-measurements-v2510: .. _amd-primus-pytorch-performance-measurements-v2510:
@@ -122,7 +122,7 @@ doesnt test configurations and run conditions outside those described.
Pull the Docker image Pull the Docker image
===================== =====================
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/primus-pytorch-benchmark-models.yaml .. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/primus-pytorch-v25.10-benchmark-models.yaml
Use the following command to pull the Docker image from Docker Hub. Use the following command to pull the Docker image from Docker Hub.
@@ -134,11 +134,11 @@ Run training
============ ============
Once the setup is complete, choose between the following two workflows to start benchmarking training. Once the setup is complete, choose between the following two workflows to start benchmarking training.
For fine-tuning workloads and multi-node training examples, see :doc:`pytorch-training` (without Primus). For fine-tuning workloads and multi-node training examples, see :doc:`../pytorch-training` (without Primus).
For best performance on MI325X, MI350X, and MI355X GPUs, you might need to For best performance on MI325X, MI350X, and MI355X GPUs, you might need to
tweak some configurations (such as batch sizes). tweak some configurations (such as batch sizes).
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/primus-pytorch-benchmark-models.yaml .. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/primus-pytorch-v25.10-benchmark-models.yaml
{% set docker = data.docker %} {% set docker = data.docker %}
{% set model_groups = data.model_groups %} {% set model_groups = data.model_groups %}

View File

@@ -0,0 +1,422 @@
:orphan:
.. meta::
:description: How to train a model using PyTorch for ROCm.
:keywords: ROCm, AI, LLM, train, PyTorch, torch, Llama, flux, tutorial, docker
****************************************
Training a model with Primus and PyTorch
****************************************
.. caution::
This documentation does not reflect the latest version of ROCm Primus PyTorch training
performance benchmark documentation. See :doc:`../primus-pytorch` for the latest version.
`Primus <https://github.com/AMD-AGI/Primus>`__ is a unified and flexible
LLM training framework designed to streamline training. It streamlines LLM
training on AMD Instinct GPUs using a modular, reproducible configuration paradigm.
Primus now supports the PyTorch torchtitan backend.
.. note::
For a unified training solution on AMD GPUs with ROCm, the `rocm/pytorch-training
<https://hub.docker.com/r/rocm/pytorch-training/>`__ Docker Hub registry will be
deprecated soon in favor of `rocm/primus <https://hub.docker.com/r/rocm/primus>`__.
The ``rocm/primus`` Docker containers will cover PyTorch training ecosystem frameworks,
including torchtitan and :doc:`Megatron-LM <../primus-megatron>`.
Primus with the PyTorch torchtitan backend is designed to replace the
:doc:`ROCm PyTorch training <../pytorch-training>` workflow. See
:doc:`../pytorch-training` to see steps to run workloads without Primus.
AMD provides a ready-to-use Docker image for MI355X, MI350X, MI325X, and
MI300X GPUs containing essential components for Primus and PyTorch training
with Primus Turbo optimizations.
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/primus-pytorch-v25.11-benchmark-models.yaml
.. tab-set::
.. tab-item:: {{ data.docker.pull_tag }}
:sync: {{ data.docker.pull_tag }}
.. list-table::
:header-rows: 1
* - Software component
- Version
{% for component_name, component_version in data.docker.components.items() %}
* - {{ component_name }}
- {{ component_version }}
{% endfor %}
.. _amd-primus-pytorch-model-support-v25.11:
Supported models
================
The following models are pre-optimized for performance on the AMD Instinct MI325X and MI300X GPUs.
Some instructions, commands, and training recommendations in this documentation might
vary by model -- select one to get started.
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/primus-pytorch-v25.11-benchmark-models.yaml
{% set model_groups = data.model_groups %}
.. raw:: html
<div id="vllm-benchmark-ud-params-picker" class="container-fluid">
<div class="row gx-0">
<div class="col-2 me-1 px-2 model-param-head">Model</div>
<div class="row col-10 pe-0">
{% for model_group in model_groups %}
<div class="col-6 px-2 model-param" data-param-k="model-group" data-param-v="{{ model_group.tag }}" tabindex="0">{{ model_group.group }}</div>
{% endfor %}
</div>
</div>
<div class="row gx-0 pt-1">
<div class="col-2 me-1 px-2 model-param-head">Variant</div>
<div class="row col-10 pe-0">
{% for model_group in model_groups %}
{% set models = model_group.models %}
{% for model in models %}
{% if models|length % 3 == 0 %}
<div class="col-4 px-2 model-param" data-param-k="model" data-param-v="{{ model.mad_tag }}" data-param-group="{{ model_group.tag }}" tabindex="0">{{ model.model }}</div>
{% else %}
<div class="col-6 px-2 model-param" data-param-k="model" data-param-v="{{ model.mad_tag }}" data-param-group="{{ model_group.tag }}" tabindex="0">{{ model.model }}</div>
{% endif %}
{% endfor %}
{% endfor %}
</div>
</div>
</div>
.. seealso::
For additional workloads, including Llama 3.3, Llama 3.2, Llama 2, GPT OSS, Qwen, and Flux models,
see the documentation :doc:`../pytorch-training` (without Primus)
.. _amd-primus-pytorch-performance-measurements-v25.11:
System validation
=================
Before running AI workloads, it's important to validate that your AMD hardware is configured
correctly and performing optimally.
If you have already validated your system settings, including aspects like NUMA auto-balancing, you
can skip this step. Otherwise, complete the procedures in the :ref:`System validation and
optimization <rocm-for-ai-system-optimization>` guide to properly configure your system settings
before starting training.
To test for optimal performance, consult the recommended :ref:`System health benchmarks
<rocm-for-ai-system-health-bench>`. This suite of tests will help you verify and fine-tune your
system's configuration.
This Docker image is optimized for specific model configurations outlined
below. Performance can vary for other training workloads, as AMD
doesnt test configurations and run conditions outside those described.
Pull the Docker image
=====================
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/primus-pytorch-v25.11-benchmark-models.yaml
Use the following command to pull the Docker image from Docker Hub.
.. code-block:: shell
docker pull {{ data.docker.pull_tag }}
Run training
============
Once the setup is complete, choose between the following two workflows to start benchmarking training.
For fine-tuning workloads and multi-node training examples, see :doc:`../pytorch-training` (without Primus).
For best performance on MI325X, MI350X, and MI355X GPUs, you might need to
tweak some configurations (such as batch sizes).
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/primus-pytorch-v25.11-benchmark-models.yaml
{% set docker = data.docker %}
{% set model_groups = data.model_groups %}
.. tab-set::
.. tab-item:: MAD-integrated benchmarking
{% for model_group in model_groups %}
{% for model in model_group.models %}
.. container:: model-doc {{ model.mad_tag }}
The following run command is tailored to {{ model.model }}.
See :ref:`amd-primus-pytorch-model-support-v25.11` to switch to another available model.
1. Clone the ROCm Model Automation and Dashboarding (`<https://github.com/ROCm/MAD>`__) repository to a local
directory and install the required packages on the host machine.
.. code-block:: shell
git clone https://github.com/ROCm/MAD
cd MAD
pip install -r requirements.txt
2. For example, use this command to run the performance benchmark test on the {{ model.model }} model
using one node with the {{ model.precision }} data type on the host machine.
.. code-block:: shell
export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
madengine run \
--tags {{ model.mad_tag }} \
--keep-model-dir \
--live-output \
--timeout 28800
MAD launches a Docker container with the name
``container_ci-{{ model.mad_tag }}``. The latency and throughput reports of the
model are collected in ``~/MAD/perf.csv``.
{% endfor %}
{% endfor %}
.. tab-item:: Primus benchmarking
{% for model_group in model_groups %}
{% for model in model_group.models %}
.. container:: model-doc {{ model.mad_tag }}
The following run commands are tailored to {{ model.model }}.
See :ref:`amd-primus-pytorch-model-support-v25.11` to switch to another available model.
.. rubric:: Download the Docker image and required packages
1. Pull the ``{{ docker.pull_tag }}`` Docker image from Docker Hub.
.. code-block:: shell
docker pull {{ docker.pull_tag }}
2. Run the Docker container.
.. code-block:: shell
docker run -it \
--device /dev/dri \
--device /dev/kfd \
--network host \
--ipc host \
--group-add video \
--cap-add SYS_PTRACE \
--security-opt seccomp=unconfined \
--privileged \
-v $HOME:$HOME \
-v $HOME/.ssh:/root/.ssh \
--shm-size 64G \
--name training_env \
{{ docker.pull_tag }}
Use these commands if you exit the ``training_env`` container and need to return to it.
.. code-block:: shell
docker start training_env
docker exec -it training_env bash
The Docker container hosts verified commit ``c4c083de`` of the `Primus
<https://github.com/AMD-AGI/Primus/tree/c4c083de64ba3e8f19ccc9629411267108931f9e/>`__ repository.
.. rubric:: Prepare training datasets and dependencies
The following benchmarking examples require downloading models and datasets
from Hugging Face. To ensure successful access to gated repos, set your
``HF_TOKEN``.
.. code-block:: shell
export HF_TOKEN=$your_personal_hugging_face_access_token
.. rubric:: Pretraining
To get started, navigate to the ``Primus`` directory in your container.
.. code-block::
cd /workspace/Primus
Now, to start the pretraining benchmark, use the ``run_pretrain.sh`` script
included with Primus with the appropriate options.
.. rubric:: Benchmarking examples
.. container:: model-doc primus_pyt_train_llama-3.1-8b
Use the following command to run train Llama 3.1 8B with BF16 precision using Primus torchtitan.
.. tab-set::
.. tab-item:: MI355X and MI350X
:sync: MI355X
.. code-block:: shell
EXP=examples/torchtitan/configs/MI355X/llama3.1_8B-BF16-pretrain.yaml \
bash examples/run_pretrain.sh
.. tab-item:: MI325X
:sync: MI325X
.. code-block:: shell
EXP=examples/torchtitan/configs/MI300X/llama3.1_8B-BF16-pretrain.yaml \
bash examples/run_pretrain.sh --training.local_batch_size 6
.. tab-item:: MI300X
:sync: MI300X
.. code-block:: shell
EXP=examples/torchtitan/configs/MI300X/llama3.1_8B-BF16-pretrain.yaml \
bash examples/run_pretrain.sh
To train Llama 3.1 8B with FP8 precision, use the following command.
.. tab-set::
.. tab-item:: MI355X and MI350X
:sync: MI355X
.. code-block:: shell
EXP=examples/torchtitan/configs/MI355X/llama3.1_8B-FP8-pretrain.yaml \
bash examples/run_pretrain.sh
.. tab-item:: MI325X
:sync: MI325X
.. code-block:: shell
EXP=examples/torchtitan/configs/MI300X/llama3.1_8B-FP8-pretrain.yaml \
bash examples/run_pretrain.sh --training.local_batch_size 7
.. tab-item:: MI300X
:sync: MI300X
.. code-block:: shell
EXP=examples/torchtitan/configs/MI300X/llama3.1_8B-FP8-pretrain.yaml \
bash examples/run_pretrain.sh
.. container:: model-doc primus_pyt_train_llama-3.1-70b
Use the following command to run train Llama 3.1 70B with BF16 precision using Primus torchtitan.
.. tab-set::
.. tab-item:: MI355X and MI350X
:sync: MI355X and MI300X
.. code-block:: shell
EXP=examples/torchtitan/configs/MI355X/llama3.1_70B-BF16-pretrain.yaml \
bash examples/run_pretrain.sh
.. tab-item:: MI325X
:sync: MI325X
.. code-block:: shell
EXP=examples/torchtitan/configs/MI300X/llama3.1_70B-BF16-pretrain.yaml \
bash examples/run_pretrain.sh --training.local_batch_size 6
.. tab-item:: MI300X
:sync: MI300X
.. code-block:: shell
EXP=examples/torchtitan/configs/MI300X/llama3.1_70B-BF16-pretrain.yaml \
bash examples/run_pretrain.sh
To train Llama 3.1 70B with FP8 precision, use the following command.
.. tab-set::
.. tab-item:: MI355X and MI350X
:sync: MI355X
.. code-block:: shell
EXP=examples/torchtitan/configs/MI355X/llama3.1_70B-FP8-pretrain.yaml \
bash examples/run_pretrain.sh
.. tab-item:: MI325X
:sync: MI325X
.. code-block:: shell
EXP=examples/torchtitan/configs/MI300X/llama3.1_70B-FP8-pretrain.yaml \
bash examples/run_pretrain.sh --training.local_batch_size 5
.. tab-item:: MI300X
:sync: MI300X
.. code-block:: shell
EXP=examples/torchtitan/configs/MI300X/llama3.1_70B-FP8-pretrain.yaml \
bash examples/run_pretrain.sh
.. container:: model-doc primus_pyt_train_deepseek-v3-16b
Use the following command to run train DeepSeek V3 16B with BF16 precision using Primus torchtitan.
.. tab-set::
.. tab-item:: MI355X and MI350X
:sync: MI355X and MI300X
.. code-block:: shell
EXP=examples/torchtitan/configs/MI355X/deepseek_v3_16b-pretrain.yaml \
bash examples/run_pretrain.sh
.. tab-item:: MI325X
:sync: MI325X
.. code-block:: shell
EXP=examples/torchtitan/configs/MI300X/deepseek_v3_16b-pretrain.yaml \
bash examples/run_pretrain.sh --training.local_batch_size 10
.. tab-item:: MI300X
:sync: MI300X
.. code-block:: shell
EXP=examples/torchtitan/configs/MI300X/deepseek_v3_16b-pretrain.yaml \
bash examples/run_pretrain.sh
{% endfor %}
{% endfor %}
Further reading
===============
- For an introduction to Primus, see `Primus: A Lightweight, Unified Training
Framework for Large Models on AMD GPUs <https://rocm.blogs.amd.com/software-tools-optimization/primus/README.html>`__.
- To learn more about MAD and the ``madengine`` CLI, see the `MAD usage guide <https://github.com/ROCm/MAD?tab=readme-ov-file#usage-guide>`__.
- To learn more about system settings and management practices to configure your system for
AMD Instinct MI300X Series GPUs, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_.
- For a list of other ready-made Docker images for AI with ROCm, see
`AMD Infinity Hub <https://www.amd.com/en/developer/resources/infinity-hub.html#f-amd_hub_category=AI%20%26%20ML%20Models>`_.
Previous versions
=================
See :doc:`pytorch-training-history` to find documentation for previous releases
of the ``ROCm/pytorch-training`` Docker image.

View File

@@ -16,21 +16,30 @@ previous releases of the ``ROCm/pytorch-training`` Docker image on `Docker Hub <
- Components - Components
- Resources - Resources
* - v26.1 (latest)
-
* ROCm 7.1.0
* PyTorch 2.10.0.dev20251112+rocm7.1
-
* :doc:`Primus PyTorch training documentation <../primus-megatron>`
* :doc:`PyTorch training (legacy) documentation <../megatron-lm>`
* `Docker Hub <https://hub.docker.com/layers/rocm/primus/v26.1/images/sha256-4fc8808bdb14117c6af7f38d79c809056e6fdbfd530c1fabbb61d097ddaf820d>`__
* - v25.11 * - v25.11
- -
* ROCm 7.1.0 * ROCm 7.1.0
* PyTorch 2.10.0.dev20251112+rocm7.1 * PyTorch 2.10.0.dev20251112+rocm7.1
- -
* :doc:`Primus PyTorch Training documentation <../primus-pytorch>` * :doc:`Primus PyTorch training documentation <primus-pytorch-v25.11>`
* :doc:`PyTorch training (legacy) documentation <../pytorch-training>` * :doc:`PyTorch training (legacy) documentation <pytorch-training-v25.11>`
* `Docker Hub <https://hub.docker.com/layers/rocm/primus/v25.10/images/sha256-140c37cd2eeeb183759b9622543fc03cc210dc97cbfa18eeefdcbda84420c197>`__ * `Docker Hub <https://hub.docker.com/layers/rocm/primus/v25.11/images/sha256-71aa65a9bfc8e9dd18bce5b68c81caff864f223e9afa75dc1b719671a1f4a3c3>`__
* - v25.10 * - v25.10
- -
* ROCm 7.1.0 * ROCm 7.1.0
* PyTorch 2.10.0.dev20251112+rocm7.1 * PyTorch 2.10.0.dev20251112+rocm7.1
- -
* :doc:`Primus PyTorch Training documentation <primus-pytorch-v25.10>` * :doc:`Primus PyTorch training documentation <primus-pytorch-v25.10>`
* :doc:`PyTorch training (legacy) documentation <pytorch-training-v25.10>` * :doc:`PyTorch training (legacy) documentation <pytorch-training-v25.10>`
* `Docker Hub <https://hub.docker.com/layers/rocm/primus/v25.10/images/sha256-140c37cd2eeeb183759b9622543fc03cc210dc97cbfa18eeefdcbda84420c197>`__ * `Docker Hub <https://hub.docker.com/layers/rocm/primus/v25.10/images/sha256-140c37cd2eeeb183759b9622543fc03cc210dc97cbfa18eeefdcbda84420c197>`__
@@ -40,7 +49,7 @@ previous releases of the ``ROCm/pytorch-training`` Docker image on `Docker Hub <
* Primus 0.3.0 * Primus 0.3.0
* PyTorch 2.9.0.dev20250821+rocm7.0.0.lw.git125803b7 * PyTorch 2.9.0.dev20250821+rocm7.0.0.lw.git125803b7
- -
* :doc:`Primus PyTorch Training documentation <primus-pytorch-v25.9>` * :doc:`Primus PyTorch training documentation <primus-pytorch-v25.9>`
* :doc:`PyTorch training (legacy) documentation <pytorch-training-v25.9>` * :doc:`PyTorch training (legacy) documentation <pytorch-training-v25.9>`
* `Docker Hub (gfx950) <https://hub.docker.com/layers/rocm/primus/v25.9_gfx950/images/sha256-1a198be32f49efd66d0ff82066b44bd99b3e6b04c8e0e9b36b2c481e13bff7b6>`__ * `Docker Hub (gfx950) <https://hub.docker.com/layers/rocm/primus/v25.9_gfx950/images/sha256-1a198be32f49efd66d0ff82066b44bd99b3e6b04c8e0e9b36b2c481e13bff7b6>`__
* `Docker Hub (gfx942) <https://hub.docker.com/layers/rocm/primus/v25.9_gfx942/images/sha256-df6ab8f45b4b9ceb100fb24e19b2019a364e351ee3b324dbe54466a1d67f8357>`__ * `Docker Hub (gfx942) <https://hub.docker.com/layers/rocm/primus/v25.9_gfx942/images/sha256-df6ab8f45b4b9ceb100fb24e19b2019a364e351ee3b324dbe54466a1d67f8357>`__
@@ -50,7 +59,7 @@ previous releases of the ``ROCm/pytorch-training`` Docker image on `Docker Hub <
* ROCm 6.4.3 * ROCm 6.4.3
* PyTorch 2.8.0a0+gitd06a406 * PyTorch 2.8.0a0+gitd06a406
- -
* :doc:`Primus PyTorch Training documentation <primus-pytorch-v25.8>` * :doc:`Primus PyTorch training documentation <primus-pytorch-v25.8>`
* :doc:`PyTorch training (legacy) documentation <pytorch-training-v25.8>` * :doc:`PyTorch training (legacy) documentation <pytorch-training-v25.8>`
* `Docker Hub <https://hub.docker.com/layers/rocm/pytorch-training/v25.8/images/sha256-5082ae01d73fec6972b0d84e5dad78c0926820dcf3c19f301d6c8eb892e573c5>`__ * `Docker Hub <https://hub.docker.com/layers/rocm/pytorch-training/v25.8/images/sha256-5082ae01d73fec6972b0d84e5dad78c0926820dcf3c19f301d6c8eb892e573c5>`__

View File

@@ -30,7 +30,7 @@ environment for fine-tuning and pretraining a model on AMD Instinct MI325X
and MI300X GPUs. It includes the following software components to accelerate and MI300X GPUs. It includes the following software components to accelerate
training workloads: training workloads:
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/pytorch-training-benchmark-models.yaml .. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/pytorch-training-v25.10-benchmark-models.yaml
.. tab-set:: .. tab-set::
@@ -58,7 +58,7 @@ MI355X, MI350X, MI325X, and MI300X GPUs. Some instructions, commands, and
training recommendations in this documentation might vary by model -- select training recommendations in this documentation might vary by model -- select
one to get started. one to get started.
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/pytorch-training-benchmark-models.yaml .. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/pytorch-training-v25.10-benchmark-models.yaml
{% set model_groups = data.model_groups %} {% set model_groups = data.model_groups %}
.. raw:: html .. raw:: html
@@ -94,7 +94,7 @@ one to get started.
The following table lists supported training modes per model. The following table lists supported training modes per model.
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/pytorch-training-benchmark-models.yaml .. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/pytorch-training-v25.10-benchmark-models.yaml
{% set model_groups = data.model_groups %} {% set model_groups = data.model_groups %}
.. dropdown:: Supported training modes .. dropdown:: Supported training modes
@@ -164,7 +164,7 @@ doesnt test configurations and run conditions outside those described.
Run training Run training
============ ============
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/pytorch-training-benchmark-models.yaml .. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/pytorch-training-v25.10-benchmark-models.yaml
{% set docker = data.docker %} {% set docker = data.docker %}
{% set model_groups = data.model_groups %} {% set model_groups = data.model_groups %}

View File

@@ -0,0 +1,669 @@
:orphan:
.. meta::
:description: How to train a model using PyTorch for ROCm.
:keywords: ROCm, AI, LLM, train, PyTorch, torch, Llama, flux, tutorial, docker
**************************************
Training a model with PyTorch on ROCm
**************************************
.. caution::
This documentation does not reflect the latest version of ROCm PyTorch training
performance benchmark documentation. See :doc:`../pytorch-training` for the latest version.
.. note::
For a unified training solution on AMD GPUs with ROCm, the `rocm/pytorch-training
<https://hub.docker.com/r/rocm/pytorch-training/>`__ Docker Hub registry will be
deprecated soon in favor of `rocm/primus <https://hub.docker.com/r/rocm/primus>`__.
The ``rocm/primus`` Docker containers will cover PyTorch training ecosystem frameworks,
including torchtitan and :doc:`Megatron-LM <../primus-megatron>`.
See :doc:`../primus-pytorch` for details.
PyTorch is an open-source machine learning framework that is widely used for
model training with GPU-optimized components for transformer-based models.
The PyTorch for ROCm training Docker image provides a prebuilt optimized
environment for fine-tuning and pretraining a model on AMD Instinct MI325X
and MI300X GPUs. It includes the following software components to accelerate
training workloads:
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/pytorch-training-v25.11-benchmark-models.yaml
.. tab-set::
.. tab-item:: {{ data.docker.pull_tag }}
:sync: {{ data.docker.pull_tag }}
.. list-table::
:header-rows: 1
* - Software component
- Version
{% for component_name, component_version in data.docker.components.items() %}
* - {{ component_name }}
- {{ component_version }}
{% endfor %}
.. _amd-pytorch-training-model-support-v25.11:
Supported models
================
The following models are pre-optimized for performance on the AMD Instinct
MI355X, MI350X, MI325X, and MI300X GPUs. Some instructions, commands, and
training recommendations in this documentation might vary by model -- select
one to get started.
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/pytorch-training-v25.11-benchmark-models.yaml
{% set model_groups = data.model_groups %}
.. raw:: html
<div id="vllm-benchmark-ud-params-picker" class="container-fluid">
<div class="row gx-0">
<div class="col-2 me-1 px-2 model-param-head">Model</div>
<div class="row col-10 pe-0">
{% for model_group in model_groups %}
<div class="col-4 px-2 model-param" data-param-k="model-group" data-param-v="{{ model_group.tag }}" tabindex="0">{{ model_group.group }}</div>
{% endfor %}
</div>
</div>
<div class="row gx-0 pt-1">
<div class="col-2 me-1 px-2 model-param-head">Variant</div>
<div class="row col-10 pe-0">
{% for model_group in model_groups %}
{% set models = model_group.models %}
{% for model in models %}
{% if models|length % 3 == 0 %}
<div class="col-4 px-2 model-param" data-param-k="model" data-param-v="{{ model.mad_tag }}" data-param-group="{{ model_group.tag }}" tabindex="0">{{ model.model }}</div>
{% else %}
<div class="col-6 px-2 model-param" data-param-k="model" data-param-v="{{ model.mad_tag }}" data-param-group="{{ model_group.tag }}" tabindex="0">{{ model.model }}</div>
{% endif %}
{% endfor %}
{% endfor %}
</div>
</div>
</div>
.. _amd-pytorch-training-supported-training-modes-v25.11:
The following table lists supported training modes per model.
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/pytorch-training-v25.11-benchmark-models.yaml
{% set model_groups = data.model_groups %}
.. dropdown:: Supported training modes
.. list-table::
:header-rows: 1
* - Model
- Supported training modes
{% for model_group in model_groups %}
{% set models = model_group.models %}
{% for model in models %}
{% if model.training_modes %}
* - {{ model.model }}
- ``{{ model.training_modes | join('``, ``') }}``
{% endif %}
{% endfor %}
{% endfor %}
.. note::
Some model and fine-tuning combinations are not listed. This is
because the `upstream torchtune repository <https://github.com/pytorch/torchtune>`__
doesn't provide default YAML configurations for them.
For advanced usage, you can create a custom configuration to enable
unlisted fine-tuning methods by using an existing file in the
``/workspace/torchtune/recipes/configs`` directory as a template.
.. _amd-pytorch-training-performance-measurements-v25.11:
Performance measurements
========================
To evaluate performance, the
`Performance results with AMD ROCm software <https://www.amd.com/en/developer/resources/rocm-hub/dev-ai/performance-results.html#tabs-a8deaeb413-item-21cea50186-tab>`_
page provides reference throughput and latency measurements for training
popular AI models.
.. note::
The performance data presented in
`Performance results with AMD ROCm software <https://www.amd.com/en/developer/resources/rocm-hub/dev-ai/performance-results.html#tabs-a8deaeb413-item-21cea50186-tab>`_
should not be interpreted as the peak performance achievable by AMD
Instinct MI325X and MI300X GPUs or ROCm software.
System validation
=================
Before running AI workloads, it's important to validate that your AMD hardware is configured
correctly and performing optimally.
If you have already validated your system settings, including aspects like NUMA auto-balancing, you
can skip this step. Otherwise, complete the procedures in the :ref:`System validation and
optimization <rocm-for-ai-system-optimization>` guide to properly configure your system settings
before starting training.
To test for optimal performance, consult the recommended :ref:`System health benchmarks
<rocm-for-ai-system-health-bench>`. This suite of tests will help you verify and fine-tune your
system's configuration.
This Docker image is optimized for specific model configurations outlined
below. Performance can vary for other training workloads, as AMD
doesnt test configurations and run conditions outside those described.
Run training
============
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/pytorch-training-v25.11-benchmark-models.yaml
{% set docker = data.docker %}
{% set model_groups = data.model_groups %}
Once the setup is complete, choose between two options to start benchmarking training:
.. tab-set::
.. tab-item:: MAD-integrated benchmarking
{% for model_group in model_groups %}
{% for model in model_group.models %}
.. container:: model-doc {{ model.mad_tag }}
The following run command is tailored to {{ model.model }}.
See :ref:`amd-pytorch-training-model-support-v25.11` to switch to another available model.
1. Clone the ROCm Model Automation and Dashboarding (`<https://github.com/ROCm/MAD>`__) repository to a local
directory and install the required packages on the host machine.
.. code-block:: shell
git clone https://github.com/ROCm/MAD
cd MAD
pip install -r requirements.txt
2. For example, use this command to run the performance benchmark test on the {{ model.model }} model
using one node with the {{ model.precision }} data type on the host machine.
.. code-block:: shell
export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
madengine run \
--tags {{ model.mad_tag }} \
--keep-model-dir \
--live-output \
--timeout 28800
MAD launches a Docker container with the name
``container_ci-{{ model.mad_tag }}``. The latency and throughput reports of the
model are collected in ``~/MAD/perf.csv``.
{% endfor %}
{% endfor %}
.. tab-item:: Standalone benchmarking
{% for model_group in model_groups %}
{% for model in model_group.models %}
.. container:: model-doc {{ model.mad_tag }}
The following commands are tailored to {{ model.model }}.
See :ref:`amd-pytorch-training-model-support-v25.11` to switch to another available model.
{% endfor %}
{% endfor %}
.. rubric:: Download the Docker image and required packages
1. Use the following command to pull the Docker image from Docker Hub.
.. code-block:: shell
docker pull {{ docker.pull_tag }}
2. Launch the Docker container.
.. code-block:: shell
docker run -it \
--device /dev/dri \
--device /dev/kfd \
--network host \
--ipc host \
--group-add video \
--cap-add SYS_PTRACE \
--security-opt seccomp=unconfined \
--privileged \
-v $HOME:$HOME \
-v $HOME/.ssh:/root/.ssh \
--shm-size 64G \
--name training_env \
{{ docker.pull_tag }}
Use these commands if you exit the ``training_env`` container and need to return to it.
.. code-block:: shell
docker start training_env
docker exec -it training_env bash
3. In the Docker container, clone the `<https://github.com/ROCm/MAD>`__
repository and navigate to the benchmark scripts directory
``/workspace/MAD/scripts/pytorch_train``.
.. code-block:: shell
git clone https://github.com/ROCm/MAD
cd MAD/scripts/pytorch_train
.. rubric:: Prepare training datasets and dependencies
1. The following benchmarking examples require downloading models and datasets
from Hugging Face. To ensure successful access to gated repos, set your
``HF_TOKEN``.
.. code-block:: shell
export HF_TOKEN=$your_personal_hugging_face_access_token
2. Run the setup script to install libraries and datasets needed for benchmarking.
.. code-block:: shell
./pytorch_benchmark_setup.sh
.. container:: model-doc pyt_train_llama-3.1-8b
``pytorch_benchmark_setup.sh`` installs the following libraries for Llama 3.1 8B:
.. list-table::
:header-rows: 1
* - Library
- Reference
* - ``accelerate``
- `Hugging Face Accelerate <https://huggingface.co/docs/accelerate/en/index>`_
* - ``datasets``
- `Hugging Face Datasets <https://huggingface.co/docs/datasets/v3.2.0/en/index>`_ 3.2.0
.. container:: model-doc pyt_train_llama-3.1-70b
``pytorch_benchmark_setup.sh`` installs the following libraries for Llama 3.1 70B:
.. list-table::
:header-rows: 1
* - Library
- Reference
* - ``datasets``
- `Hugging Face Datasets <https://huggingface.co/docs/datasets/v3.2.0/en/index>`_ 3.2.0
* - ``torchdata``
- `TorchData <https://meta-pytorch.org/data/beta/index.html#torchdata>`__
* - ``tomli``
- `Tomli <https://pypi.org/project/tomli/>`__
* - ``tiktoken``
- `tiktoken <https://github.com/openai/tiktoken>`__
* - ``blobfile``
- `blobfile <https://pypi.org/project/blobfile/>`__
* - ``tabulate``
- `tabulate <https://pypi.org/project/tabulate/>`__
* - ``wandb``
- `Weights & Biases <https://github.com/wandb/wandb>`__
* - ``sentencepiece``
- `SentencePiece <https://github.com/google/sentencepiece>`__ 0.2.0
* - ``tensorboard``
- `TensorBoard <https://www.tensorflow.org/tensorboard>`__ 2.18.0
.. container:: model-doc pyt_train_flux
``pytorch_benchmark_setup.sh`` installs the following libraries for FLUX:
.. list-table::
:header-rows: 1
* - Library
- Reference
* - ``accelerate``
- `Hugging Face Accelerate <https://huggingface.co/docs/accelerate/en/index>`_
* - ``datasets``
- `Hugging Face Datasets <https://huggingface.co/docs/datasets/v3.2.0/en/index>`__ 3.2.0
* - ``sentencepiece``
- `SentencePiece <https://github.com/google/sentencepiece>`__ 0.2.0
* - ``tensorboard``
- `TensorBoard <https://www.tensorflow.org/tensorboard>`__ 2.18.0
* - ``csvkit``
- `csvkit <https://csvkit.readthedocs.io/en/latest/>`__ 2.0.1
* - ``deepspeed``
- `DeepSpeed <https://github.com/deepspeedai/DeepSpeed>`__ 0.16.2
* - ``diffusers``
- `Hugging Face Diffusers <https://huggingface.co/docs/diffusers/en/index>`__ 0.31.0
* - ``GitPython``
- `GitPython <https://github.com/gitpython-developers/GitPython>`__ 3.1.44
* - ``opencv-python-headless``
- `opencv-python-headless <https://pypi.org/project/opencv-python-headless/>`__ 4.10.0.84
* - ``peft``
- `PEFT <https://huggingface.co/docs/peft/en/index>`__ 0.14.0
* - ``protobuf``
- `Protocol Buffers <https://github.com/protocolbuffers/protobuf>`__ 5.29.2
* - ``pytest``
- `PyTest <https://docs.pytest.org/en/stable/>`__ 8.3.4
* - ``python-dotenv``
- `python-dotenv <https://pypi.org/project/python-dotenv/>`__ 1.0.1
* - ``seaborn``
- `Seaborn <https://seaborn.pydata.org/>`__ 0.13.2
* - ``transformers``
- `Transformers <https://huggingface.co/docs/transformers/en/index>`__ 4.47.0
``pytorch_benchmark_setup.sh`` downloads the following datasets from Hugging Face:
* `frank-chieng/chinese_architecture_siheyuan <https://huggingface.co/datasets/frank-chieng/chinese_architecture_siheyuan>`__
{% for model_group in model_groups %}
{% for model in model_group.models %}
{% set training_modes = model.training_modes %}
{% set training_mode_descs = {
"pretrain": "Benchmark pre-training.",
"HF_pretrain": "Llama 3.1 8B pre-training with FP8 precision."
} %}
{% set available_modes = training_modes | select("in", ["pretrain", "HF_pretrain"]) | list %}
{% if available_modes %}
.. container:: model-doc {{ model.mad_tag }}
.. rubric:: Pretraining
To start the pre-training benchmark, use the following command with the
appropriate options. See the following list of options and their descriptions.
{% if model.mad_tag == "pyt_train_dlrm" %}
1. Go to the DLRM directory.
.. code-block:: shell
cd /workspace/DLRMBenchmark
2. To run the single node training benchmark for DLRM-v2 with TF32 precision,
run the following script.
.. code-block:: shell
./launch_training_single_node.sh
To run with MAD within the Docker container, use the following command.
.. code-block:: shell
./pytorch_benchmark_report.sh -t pretrain -m DLRM
{% else %}
.. code-block:: shell
./pytorch_benchmark_report.sh -t {% if available_modes | length == 1 %}{{ available_modes[0] }}{% else %}$training_mode{% endif %} \
-m {{ model.model_repo }} \
-p $datatype \
-s $sequence_length
{% if model.mad_tag == "pyt_train_flux" %}
.. container:: model-doc {{ model.mad_tag }}
.. note::
Currently, FLUX models are not supported out-of-the-box on this Docker.
To use FLUX, refer to ``rocm/pytorch-training`` Docker: :doc:`pytorch-training-v25.6`
Occasionally, downloading the Flux dataset might fail. In the event of this
error, manually download it from Hugging Face at
`black-forest-labs/FLUX.1-dev <https://huggingface.co/black-forest-labs/FLUX.1-dev>`_
and save it to `/workspace/FluxBenchmark`. This ensures that the test script can access
the required dataset.
{% endif %}
.. list-table::
:header-rows: 1
* - Name
- Options
- Description
{% for mode in available_modes %}
* - {% if loop.first %}``$training_mode``{% endif %}
- ``{{ mode }}``
- {{ training_mode_descs[mode] }}
{% endfor %}
* - ``$datatype``
- ``BF16``{% if model.mad_tag == "pyt_train_llama-3.1-8b" %} or ``FP8``{% endif %}
- Only Llama 3.1 8B supports FP8 precision.
* - ``$sequence_length``
- Sequence length for the language model.
- Between 2048 and 8192. 8192 by default.
{% endif %}
{% endif %}
{% set training_modes = model.training_modes %}
{% set training_mode_descs = {
"posttrain": "Benchmark post-training.",
} %}
{% set available_modes = training_modes | select("in", ["posttrain"]) | list %}
{% if available_modes %}
.. container:: model-doc {{ model.mad_tag }}
.. rubric:: Post-training
To start the post-training benchmark, use the following command with the
appropriate options. See the following list of options and their descriptions.
.. code-block:: shell
./pytorch_benchmark_report.sh -t {% if available_modes | length == 1 %}{{ available_modes[0] }}{% else %}$training_mode{% endif %} \
-m {{ model.model_repo }} \
-p $datatype \
-s $sequence_length
.. list-table::
:header-rows: 1
* - Name
- Options
- Description
{% for mode in available_modes %}
* - {% if loop.first %}``$training_mode``{% endif %}
- ``{{ mode }}``
- {{ training_mode_descs[mode] }}
{% endfor %}
* - ``$datatype``
- ``BF16``{% if model.mad_tag == "pyt_train_llama-3.1-8b" %} or ``FP8``{% endif %}
- Only Llama 3.1 8B supports FP8 precision.
* - ``$sequence_length``
- Sequence length for the language model.
- Between 2048 and 8192. 8192 by default.
{% endif %}
{% set training_mode_descs = {
"finetune_fw": "Full weight fine-tuning (BF16 and FP8 supported).",
"finetune_lora": "LoRA fine-tuning (BF16 supported).",
"finetune_qlora": "QLoRA fine-tuning (BF16 supported).",
"HF_finetune_lora": "LoRA fine-tuning with Hugging Face PEFT.",
} %}
{% set available_modes = training_modes | select("in", ["finetune_fw", "finetune_lora", "finetune_qlora", "HF_finetune_lora"]) | list %}
{% if available_modes %}
.. container:: model-doc {{ model.mad_tag }}
.. rubric:: Fine-tuning
To start the fine-tuning benchmark, use the following command with the
appropriate options. See the following list of options and their descriptions.
See :ref:`supported training modes <amd-pytorch-training-supported-training-modes-v25.11>`.
.. code-block:: shell
./pytorch_benchmark_report.sh -t $training_mode \
-m {{ model.model_repo }} \
-p $datatype \
-s $sequence_length
.. list-table::
:header-rows: 1
* - Name
- Options
- Description
{% for mode in available_modes %}
* - {% if loop.first %}``$training_mode``{% endif %}
- ``{{ mode }}``
- {{ training_mode_descs[mode] }}
{% endfor %}
* - ``$datatype``
- ``BF16``{% if "finetune_fw" in available_modes %} or ``FP8``{% endif %}
- All models support BF16.{% if "finetune_fw" in available_modes %} FP8 is only available for full weight fine-tuning.{% endif %}
* - ``$sequence_length``
- Between 2048 and 16384.
- Sequence length for the language model.
{% if model.mad_tag in ["pyt_train_llama3.2-vision-11b", "pyt_train_llama-3.2-vision-90b"] %}
.. note::
For LoRA and QLoRA support with vision models (Llama 3.2 11B and 90B),
use the following torchtune commit for compatibility:
.. code-block:: shell
git checkout 48192e23188b1fc524dd6d127725ceb2348e7f0e
{% elif model.mad_tag in ["pyt_train_llama-2-7b", "pyt_train_llama-2-13b", "pyt_train_llama-2-70b"] %}
.. note::
You might encounter the following error with Llama 2: ``ValueError: seq_len (16384) of
input tensor should be smaller than max_seq_len (4096)``.
This error indicates that an input sequence is longer than the model's maximum context window.
Ensure your tokenized input does not exceed the model's ``max_seq_len`` (4096
tokens in this case). You can resolve this by truncating the input or splitting
it into smaller chunks before passing it to the model.
Note on reproducibility: The results in this guide are based on
commit ``b4c98ac`` from the upstream
`<https://github.com/pytorch/torchtune>`__ repository. For the
latest updates, you can use the main branch.
{% endif %}
{% endif %}
{% endfor %}
{% endfor %}
.. rubric:: Benchmarking examples
For examples of benchmarking commands, see `<https://github.com/ROCm/MAD/tree/develop/benchmark/pytorch_train#benchmarking-examples>`__.
.. _amd-pytorch-training-multinode-examples-v25.11:
Multi-node training
-------------------
Refer to :doc:`/how-to/rocm-for-ai/system-setup/multi-node-setup` to configure your environment for multi-node
training. See :ref:`rocm-for-ai-multi-node-setup-pyt-train-example` for example Slurm run commands.
Pre-training
~~~~~~~~~~~~
Multi-node training with torchtitan is supported. The provided SLURM script is pre-configured for Llama 3 70B.
To launch the training job on a SLURM cluster for Llama 3 70B, run the following commands from the MAD repository.
.. code-block:: shell
# In the MAD repository
cd scripts/pytorch_train
sbatch run_slurm_train.sh
Fine-tuning
~~~~~~~~~~~
Multi-node training with torchtune is supported. The provided SLURM script is pre-configured for Llama 3.3 70B.
To launch the training job on a SLURM cluster for Llama 3.3 70B, run the following commands from the MAD repository.
.. code-block:: shell
huggingface-cli login # Get access to HF Llama model space
huggingface-cli download meta-llama/Llama-3.3-70B-Instruct --local-dir ./models/Llama-3.3-70B-Instruct # Download the Llama 3.3 model locally
# In the MAD repository
cd scripts/pytorch_train
sbatch Torchtune_Multinode.sh
.. note::
Information regarding benchmark setup:
* By default, Llama 3.3 70B is fine-tuned using ``alpaca_dataset``.
* You can adjust the torchtune `YAML configuration file
<https://github.com/pytorch/torchtune/blob/main/recipes/configs/llama3_3/70B_full_multinode.yaml>`__
if you're using a different model.
* The number of nodes and other parameters can be tuned in the SLURM script ``Torchtune_Multinode.sh``.
* Set the ``mounting_paths`` inside the SLURM script.
Once the run is finished, you can find the log files in the ``result_torchtune/`` directory.
Further reading
===============
- To learn more about MAD and the ``madengine`` CLI, see the `MAD usage guide <https://github.com/ROCm/MAD?tab=readme-ov-file#usage-guide>`__.
- To learn more about system settings and management practices to configure your system for
AMD Instinct MI300X Series GPUs, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_.
- For a list of other ready-made Docker images for AI with ROCm, see
`AMD Infinity Hub <https://www.amd.com/en/developer/resources/infinity-hub.html#f-amd_hub_category=AI%20%26%20ML%20Models>`_.
Previous versions
=================
See :doc:`pytorch-training-history` to find documentation for previous releases
of the ``ROCm/pytorch-training`` Docker image.

View File

@@ -47,7 +47,7 @@ Megatron-LM.
- {{ component_version }} - {{ component_version }}
{% endfor %} {% endfor %}
.. _amd-primus-megatron-lm-model-support-v25.11: .. _amd-primus-megatron-lm-model-support-v26.01:
Supported models Supported models
================ ================
@@ -108,7 +108,7 @@ To test for optimal performance, consult the recommended :ref:`System health ben
<rocm-for-ai-system-health-bench>`. This suite of tests will help you verify and fine-tune your <rocm-for-ai-system-health-bench>`. This suite of tests will help you verify and fine-tune your
system's configuration. system's configuration.
.. _mi300x-amd-primus-megatron-lm-training-v25.11: .. _mi300x-amd-primus-megatron-lm-training-v26.01:
Environment setup Environment setup
================= =================
@@ -118,7 +118,7 @@ Environment setup
Use the following instructions to set up the environment, configure the script to train models, and Use the following instructions to set up the environment, configure the script to train models, and
reproduce the benchmark results on AMD Instinct GPUs. reproduce the benchmark results on AMD Instinct GPUs.
.. _amd-primus-megatron-lm-requirements-v25.11: .. _amd-primus-megatron-lm-requirements-v26.01:
Pull the Docker image Pull the Docker image
@@ -157,16 +157,16 @@ Pull the Docker image
docker start primus_training_env docker start primus_training_env
docker exec -it primus_training_env bash docker exec -it primus_training_env bash
The Docker container hosts verified commit ``c4c083de`` of the `Primus The Docker container hosts verified commit ``9c529cd4`` of the `Primus
<https://github.com/AMD-AGI/Primus/tree/c4c083de64ba3e8f19ccc9629411267108931f9e/>`__ repository. <https://github.com/AMD-AGI/Primus/tree/9c529cd4a934a68a880ede036c3e97b792e38167>`__ repository.
.. _amd-primus-megatron-lm-environment-setup-v25.11: .. _amd-primus-megatron-lm-environment-setup-v26.01:
Configuration Configuration
============= =============
Primus defines a training configuration in YAML for each model in Primus defines a training configuration in YAML for each model in
`examples/megatron/configs <https://github.com/AMD-AGI/Primus/tree/c4c083de64ba3e8f19ccc9629411267108931f9e/examples/megatron/configs>`__. `examples/megatron/configs <https://github.com/AMD-AGI/Primus/tree/9c529cd4a934a68a880ede036c3e97b792e38167/examples/megatron/configs>`__.
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/primus-megatron-benchmark-models.yaml .. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/primus-megatron-benchmark-models.yaml
@@ -207,7 +207,7 @@ You can use either mock data or real data for training.
Ensure that the files are accessible inside the Docker container. Ensure that the files are accessible inside the Docker container.
.. _amd-primus-megatron-lm-tokenizer-v25.11: .. _amd-primus-megatron-lm-tokenizer-v26.01:
Tokenizer Tokenizer
--------- ---------
@@ -220,15 +220,7 @@ right permissions to access the tokenizer for each model.
# Export your HF_TOKEN in the workspace # Export your HF_TOKEN in the workspace
export HF_TOKEN=<your_hftoken> export HF_TOKEN=<your_hftoken>
.. note:: .. _amd-primus-megatron-lm-run-training-v26.01:
In Primus, each model uses a tokenizer from Hugging Face. For example, Llama
3.1 8B model uses ``tokenizer_model: meta-llama/Llama-3.1-8B`` and
``tokenizer_type: Llama3Tokenizer`` defined in the `llama3.1-8B model
<https://github.com/AMD-AGI/Primus/blob/e16b27bf6c1b2798f38848fc574fee60d9a9b902/examples/megatron/configs/llama3.1_8B-pretrain.yaml>`__
definition.
.. _amd-primus-megatron-lm-run-training-v25.11:
Run training Run training
============ ============
@@ -252,7 +244,7 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
Once setup is complete, run the appropriate training command. Once setup is complete, run the appropriate training command.
The following run commands are tailored to Llama 3.3 70B. The following run commands are tailored to Llama 3.3 70B.
See :ref:`amd-primus-megatron-lm-model-support-v25.11` to switch to another available model. See :ref:`amd-primus-megatron-lm-model-support-v26.01` to switch to another available model.
To run pre-training for Llama 3.3 70B BF16, run: To run pre-training for Llama 3.3 70B BF16, run:
@@ -263,8 +255,10 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
.. code-block:: shell .. code-block:: shell
EXP=examples/megatron/configs/MI355X/llama3.3_70B-BF16-pretrain.yaml \ bash runner/primus-cli direct \
bash ./examples/run_pretrain.sh --log_file /tmp/primus_llama3.3_70B.log \
-- train pretrain \
--config examples/megatron/configs/MI355X/llama3.3_70B-BF16-pretrain.yaml
.. tab-item:: MI300X .. tab-item:: MI300X
:sync: MI325X and MI300X :sync: MI325X and MI300X
@@ -276,14 +270,16 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
export PRIMUS_TURBO_ATTN_V3_ATOMIC_FP32=1 export PRIMUS_TURBO_ATTN_V3_ATOMIC_FP32=1
export NVTE_CK_IS_V3_ATOMIC_FP32=1 export NVTE_CK_IS_V3_ATOMIC_FP32=1
EXP=examples/megatron/configs/MI300X/llama3.3_70B-BF16-pretrain.yaml \ bash runner/primus-cli direct \
bash ./examples/run_pretrain.sh --log_file /tmp/primus_llama3.3_70B.log \
-- train pretrain \
--config examples/megatron/configs/MI300X/llama3.3_70B-BF16-pretrain.yaml
.. container:: model-doc primus_pyt_megatron_lm_train_llama-3.1-8b .. container:: model-doc primus_pyt_megatron_lm_train_llama-3.1-8b
Once setup is complete, run the appropriate training command. Once setup is complete, run the appropriate training command.
The following run commands are tailored to Llama 3.1 8B. The following run commands are tailored to Llama 3.1 8B.
See :ref:`amd-primus-megatron-lm-model-support-v25.11` to switch to another available model. See :ref:`amd-primus-megatron-lm-model-support-v26.01` to switch to another available model.
To run pre-training for Llama 3.1 8B FP8, run: To run pre-training for Llama 3.1 8B FP8, run:
@@ -294,8 +290,10 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
.. code-block:: shell .. code-block:: shell
EXP=examples/megatron/configs/MI355X/llama3.1_8B-FP8-pretrain.yaml \ bash runner/primus-cli direct \
bash ./examples/run_pretrain.sh --log_file /tmp/primus_llama3.1_8B_fp8.log \
-- train pretrain \
--config examples/megatron/configs/MI355X/llama3.1_8B-FP8-pretrain.yaml
.. tab-item:: MI300X .. tab-item:: MI300X
:sync: MI325X and MI300X :sync: MI325X and MI300X
@@ -307,8 +305,10 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
export PRIMUS_TURBO_ATTN_V3_ATOMIC_FP32=1 export PRIMUS_TURBO_ATTN_V3_ATOMIC_FP32=1
export NVTE_CK_IS_V3_ATOMIC_FP32=1 export NVTE_CK_IS_V3_ATOMIC_FP32=1
EXP=examples/megatron/configs/MI300X/llama3.1_8B-FP8-pretrain.yaml \ bash runner/primus-cli direct \
bash ./examples/run_pretrain.sh --log_file /tmp/primus_llama3.1_8B_fp8.log \
-- train pretrain \
--config examples/megatron/configs/MI300X/llama3.1_8B-FP8-pretrain.yaml
For Llama 3.1 8B BF16, use the following command: For Llama 3.1 8B BF16, use the following command:
@@ -319,8 +319,10 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
.. code-block:: shell .. code-block:: shell
EXP=examples/megatron/configs/MI355X/llama3.1_BF16-pretrain.yaml \ bash runner/primus-cli direct \
bash ./examples/run_pretrain.sh --log_file /tmp/primus_llama3.1_8B.log \
-- train pretrain \
--config examples/megatron/configs/MI355X/llama3.1_8B-BF16-pretrain.yaml
.. tab-item:: MI300X .. tab-item:: MI300X
:sync: MI325X and MI300X :sync: MI325X and MI300X
@@ -332,14 +334,16 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
export PRIMUS_TURBO_ATTN_V3_ATOMIC_FP32=1 export PRIMUS_TURBO_ATTN_V3_ATOMIC_FP32=1
export NVTE_CK_IS_V3_ATOMIC_FP32=1 export NVTE_CK_IS_V3_ATOMIC_FP32=1
EXP=examples/megatron/configs/MI300X/llama3.1_8B-BF16-pretrain.yaml \ bash runner/primus-cli direct \
bash ./examples/run_pretrain.sh --log_file /tmp/primus_llama3.1_8B.log \
-- train pretrain \
--config examples/megatron/configs/MI300X/llama3.1_8B-BF16-pretrain.yaml
.. container:: model-doc primus_pyt_megatron_lm_train_llama-3.1-70b .. container:: model-doc primus_pyt_megatron_lm_train_llama-3.1-70b
Once setup is complete, run the appropriate training command. Once setup is complete, run the appropriate training command.
The following run commands are tailored to Llama 3.1 70B. The following run commands are tailored to Llama 3.1 70B.
See :ref:`amd-primus-megatron-lm-model-support-v25.11` to switch to another available model. See :ref:`amd-primus-megatron-lm-model-support-v26.01` to switch to another available model.
To run pre-training for Llama 3.1 70B BF16, run: To run pre-training for Llama 3.1 70B BF16, run:
@@ -350,8 +354,10 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
.. code-block:: shell .. code-block:: shell
EXP=examples/megatron/configs/MI355X/llama3.1_70B-BF16-pretrain.yaml \ bash runner/primus-cli direct \
bash ./examples/run_pretrain.sh --log_file /tmp/primus_llama3.1_70B.log \
-- train pretrain \
--config examples/megatron/configs/MI355X/llama3.1_70B-BF16-pretrain.yaml
.. tab-item:: MI300X .. tab-item:: MI300X
:sync: MI325X and MI300X :sync: MI325X and MI300X
@@ -363,8 +369,10 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
export PRIMUS_TURBO_ATTN_V3_ATOMIC_FP32=1 export PRIMUS_TURBO_ATTN_V3_ATOMIC_FP32=1
export NVTE_CK_IS_V3_ATOMIC_FP32=1 export NVTE_CK_IS_V3_ATOMIC_FP32=1
EXP=examples/megatron/configs/MI300X/llama3.1_70B-BF16-pretrain.yaml \ bash runner/primus-cli direct \
bash ./examples/run_pretrain.sh --log_file /tmp/primus_llama3.1_70B.log \
-- train pretrain \
--config examples/megatron/configs/MI300X/llama3.1_70B-BF16-pretrain.yaml
To run the training on a single node for Llama 3.1 70B FP8, use the following command. To run the training on a single node for Llama 3.1 70B FP8, use the following command.
@@ -381,8 +389,10 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
.. code-block:: shell .. code-block:: shell
EXP=examples/megatron/configs/MI355X/llama3.1_70B-FP8-pretrain.yaml \ bash runner/primus-cli direct \
bash ./examples/run_pretrain.sh --log_file /tmp/primus_llama3.1_70B_fp8.log \
-- train pretrain \
--config examples/megatron/configs/MI355X/llama3.1_70B-FP8-pretrain.yaml
.. tab-item:: MI300X .. tab-item:: MI300X
:sync: MI325X and MI300X :sync: MI325X and MI300X
@@ -394,18 +404,20 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
export PRIMUS_TURBO_ATTN_V3_ATOMIC_FP32=1 export PRIMUS_TURBO_ATTN_V3_ATOMIC_FP32=1
export NVTE_CK_IS_V3_ATOMIC_FP32=1 export NVTE_CK_IS_V3_ATOMIC_FP32=1
EXP=examples/megatron/configs/MI300X/llama3.1_70B-FP8-pretrain.yaml \ bash runner/primus-cli direct \
bash ./examples/run_pretrain.sh \ --log_file /tmp/primus_llama3.1_70B_fp8_proxy.log \
--train_iters 50 \ -- train pretrain \
--num_layers 40 \ --config examples/megatron/configs/MI300X/llama3.1_70B-FP8-pretrain.yaml \
--fp8 hybrid \ --train_iters 50 \
--no_fp8_weight_transpose_cache true --num_layers 40 \
--fp8 hybrid \
--no_fp8_weight_transpose_cache true
.. container:: model-doc primus_pyt_megatron_lm_train_llama-2-7b .. container:: model-doc primus_pyt_megatron_lm_train_llama-2-7b
Once setup is complete, run the appropriate training command. Once setup is complete, run the appropriate training command.
The following run commands are tailored to Llama 2 7B. The following run commands are tailored to Llama 2 7B.
See :ref:`amd-primus-megatron-lm-model-support-v25.11` to switch to another available model. See :ref:`amd-primus-megatron-lm-model-support-v26.01` to switch to another available model.
To run pre-training for Llama 2 7B FP8, run: To run pre-training for Llama 2 7B FP8, run:
@@ -416,8 +428,10 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
.. code-block:: shell .. code-block:: shell
EXP=examples/megatron/configs/MI355X/llama2_7B-FP8-pretrain.yaml \ bash runner/primus-cli direct \
bash ./examples/run_pretrain.sh --log_file /tmp/primus_llama2_7B_fp8.log \
-- train pretrain \
--config examples/megatron/configs/MI355X/llama2_7B-FP8-pretrain.yaml
.. tab-item:: MI300X .. tab-item:: MI300X
:sync: MI325X and MI300X :sync: MI325X and MI300X
@@ -429,8 +443,10 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
export PRIMUS_TURBO_ATTN_V3_ATOMIC_FP32=1 export PRIMUS_TURBO_ATTN_V3_ATOMIC_FP32=1
export NVTE_CK_IS_V3_ATOMIC_FP32=1 export NVTE_CK_IS_V3_ATOMIC_FP32=1
EXP=examples/megatron/configs/MI300X/llama2_7B-FP8-pretrain.yaml \ bash runner/primus-cli direct \
bash ./examples/run_pretrain.sh --log_file /tmp/primus_llama2_7B_fp8.log \
-- train pretrain \
--config examples/megatron/configs/MI300X/llama2_7B-FP8-pretrain.yaml
To run pre-training for Llama 2 7B BF16, run: To run pre-training for Llama 2 7B BF16, run:
@@ -441,8 +457,10 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
.. code-block:: shell .. code-block:: shell
EXP=examples/megatron/configs/MI355X/llama2_7B-BF16-pretrain.yaml \ bash runner/primus-cli direct \
bash ./examples/run_pretrain.sh --log_file /tmp/primus_llama2_7B.log \
-- train pretrain \
--config examples/megatron/configs/MI355X/llama2_7B-BF16-pretrain.yaml
.. tab-item:: MI300X .. tab-item:: MI300X
:sync: MI325X and MI300X :sync: MI325X and MI300X
@@ -454,14 +472,16 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
export PRIMUS_TURBO_ATTN_V3_ATOMIC_FP32=1 export PRIMUS_TURBO_ATTN_V3_ATOMIC_FP32=1
export NVTE_CK_IS_V3_ATOMIC_FP32=1 export NVTE_CK_IS_V3_ATOMIC_FP32=1
EXP=examples/megatron/configs/MI300X/llama2_7B-BF16-pretrain.yaml \ bash runner/primus-cli direct \
bash ./examples/run_pretrain.sh --log_file /tmp/primus_llama2_7B.log \
-- train pretrain \
--config examples/megatron/configs/MI300X/llama2_7B-BF16-pretrain.yaml
.. container:: model-doc primus_pyt_megatron_lm_train_llama-2-70b .. container:: model-doc primus_pyt_megatron_lm_train_llama-2-70b
Once setup is complete, run the appropriate training command. Once setup is complete, run the appropriate training command.
The following run commands are tailored to Llama 2 70B. The following run commands are tailored to Llama 2 70B.
See :ref:`amd-primus-megatron-lm-model-support-v25.11` to switch to another available model. See :ref:`amd-primus-megatron-lm-model-support-v26.01` to switch to another available model.
To run pre-training for Llama 2 70B BF16, run: To run pre-training for Llama 2 70B BF16, run:
@@ -472,8 +492,10 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
.. code-block:: shell .. code-block:: shell
EXP=examples/megatron/configs/MI355X/llama2_70B-BF16-pretrain.yaml \ bash runner/primus-cli direct \
bash ./examples/run_pretrain.sh --log_file /tmp/primus_llama2_70B.log \
-- train pretrain \
--config examples/megatron/configs/MI355X/llama2_70B-BF16-pretrain.yaml
.. tab-item:: MI300X .. tab-item:: MI300X
:sync: MI325X and MI300X :sync: MI325X and MI300X
@@ -485,14 +507,16 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
export PRIMUS_TURBO_ATTN_V3_ATOMIC_FP32=1 export PRIMUS_TURBO_ATTN_V3_ATOMIC_FP32=1
export NVTE_CK_IS_V3_ATOMIC_FP32=1 export NVTE_CK_IS_V3_ATOMIC_FP32=1
EXP=examples/megatron/configs/MI300X/llama2_70B-BF16-pretrain.yaml \ bash runner/primus-cli direct \
bash ./examples/run_pretrain.sh --log_file /tmp/primus_llama2_70B.log \
-- train pretrain \
--config examples/megatron/configs/MI300X/llama2_70B-BF16-pretrain.yaml
.. container:: model-doc primus_pyt_megatron_lm_train_deepseek-v3-proxy .. container:: model-doc primus_pyt_megatron_lm_train_deepseek-v3-proxy
Once setup is complete, run the appropriate training command. Once setup is complete, run the appropriate training command.
The following run commands are tailored to DeepSeek-V3. The following run commands are tailored to DeepSeek-V3.
See :ref:`amd-primus-megatron-lm-model-support-v25.11` to switch to another available model. See :ref:`amd-primus-megatron-lm-model-support-v26.01` to switch to another available model.
To run training on a single node for DeepSeek-V3 (MoE with expert parallel) BF16 with 3-layer proxy, To run training on a single node for DeepSeek-V3 (MoE with expert parallel) BF16 with 3-layer proxy,
use the following command: use the following command:
@@ -504,13 +528,15 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
.. code-block:: shell .. code-block:: shell
EXP=examples/megatron/configs/MI355X/deepseek_v3-BF16-pretrain.yaml \ bash runner/primus-cli direct \
bash examples/run_pretrain.sh \ --log_file /tmp/primus_deepseek_v3_proxy.log \
--num_layers 3 \ -- train pretrain \
--moe_layer_freq 1 \ --config examples/megatron/configs/MI355X/deepseek_v3-BF16-pretrain.yaml \
--train_iters 50 \ --num_layers 3 \
--micro_batch_size 8 \ --moe_layer_freq 1 \
--global_batch_size 64 --train_iters 50 \
--micro_batch_size 8 \
--global_batch_size 64
.. tab-item:: MI300X .. tab-item:: MI300X
:sync: MI325X and MI300X :sync: MI325X and MI300X
@@ -522,17 +548,21 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
export PRIMUS_TURBO_ATTN_V3_ATOMIC_FP32=1 export PRIMUS_TURBO_ATTN_V3_ATOMIC_FP32=1
export NVTE_CK_IS_V3_ATOMIC_FP32=1 export NVTE_CK_IS_V3_ATOMIC_FP32=1
EXP=examples/megatron/configs/MI300X/deepseek_v3-BF16-pretrain.yaml \ bash runner/primus-cli direct \
bash examples/run_pretrain.sh \ --log_file /tmp/primus_deepseek_v3_proxy.log \
--num_layers 3 \ -- train pretrain \
--moe_layer_freq 1 \ --config examples/megatron/configs/MI300X/deepseek_v3-BF16-pretrain.yaml \
--train_iters 50 --num_layers 3 \
--moe_layer_freq 1 \
--micro_batch_size 3 \
--global_batch_size 192 \
--train_iters 50
.. container:: model-doc primus_pyt_megatron_lm_train_deepseek-v2-lite-16b .. container:: model-doc primus_pyt_megatron_lm_train_deepseek-v2-lite-16b
Once setup is complete, run the appropriate training command. Once setup is complete, run the appropriate training command.
The following run commands are tailored to DeepSeek-V2-Lite. The following run commands are tailored to DeepSeek-V2-Lite.
See :ref:`amd-primus-megatron-lm-model-support-v25.11` to switch to another available model. See :ref:`amd-primus-megatron-lm-model-support-v26.01` to switch to another available model.
To run training on a single node for DeepSeek-V2-Lite (MoE with expert parallel) BF16, To run training on a single node for DeepSeek-V2-Lite (MoE with expert parallel) BF16,
use the following command: use the following command:
@@ -544,8 +574,10 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
.. code-block:: shell .. code-block:: shell
EXP=examples/megatron/configs/MI355X/deepseek_v2_lite-BF16-pretrain.yaml \ bash runner/primus-cli direct \
bash examples/run_pretrain.sh --log_file /tmp/primus_deepseek_v2_lite.log \
-- train pretrain \
--config examples/megatron/configs//MI355X/deepseek_v2_lite-BF16-pretrain.yaml
.. tab-item:: MI300X .. tab-item:: MI300X
:sync: MI325X and MI300X :sync: MI325X and MI300X
@@ -557,14 +589,16 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
export PRIMUS_TURBO_ATTN_V3_ATOMIC_FP32=1 export PRIMUS_TURBO_ATTN_V3_ATOMIC_FP32=1
export NVTE_CK_IS_V3_ATOMIC_FP32=1 export NVTE_CK_IS_V3_ATOMIC_FP32=1
EXP=examples/megatron/configs/MI300X/deepseek_v2_lite-BF16-pretrain.yaml \ bash runner/primus-cli direct \
bash examples/run_pretrain.sh --log_file /tmp/primus_deepseek_v2_lite.log \
-- train pretrain \
--config examples/megatron/configs/MI300X/deepseek_v2_lite-BF16-pretrain.yaml
.. container:: model-doc primus_pyt_megatron_lm_train_mixtral-8x7b .. container:: model-doc primus_pyt_megatron_lm_train_mixtral-8x7b
Once setup is complete, run the appropriate training command. Once setup is complete, run the appropriate training command.
The following run commands are tailored to Mixtral 8x7B. The following run commands are tailored to Mixtral 8x7B.
See :ref:`amd-primus-megatron-lm-model-support-v25.11` to switch to another available model. See :ref:`amd-primus-megatron-lm-model-support-v26.01` to switch to another available model.
To run training on a single node for Mixtral 8x7B (MoE with expert parallel), To run training on a single node for Mixtral 8x7B (MoE with expert parallel),
use the following command: use the following command:
@@ -576,8 +610,10 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
.. code-block:: shell .. code-block:: shell
EXP=examples/megatron/configs/MI355X/mixtral_8x7B_v0.1-BF16-pretrain.yaml \ bash runner/primus-cli direct \
bash examples/run_pretrain.sh --log_file /tmp/primus_mixtral_8x7B.log \
-- train pretrain \
--config examples/megatron/configs/MI355X/mixtral_8x7B_v0.1-BF16-pretrain.yaml
.. tab-item:: MI300X .. tab-item:: MI300X
:sync: MI325X and MI300X :sync: MI325X and MI300X
@@ -589,15 +625,16 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
export PRIMUS_TURBO_ATTN_V3_ATOMIC_FP32=1 export PRIMUS_TURBO_ATTN_V3_ATOMIC_FP32=1
export NVTE_CK_IS_V3_ATOMIC_FP32=1 export NVTE_CK_IS_V3_ATOMIC_FP32=1
EXP=examples/megatron/configs/MI300X/mixtral_8x7B_v0.1-BF16-pretrain.yaml \ bash runner/primus-cli direct \
bash examples/run_pretrain.sh \ --log_file /tmp/primus_mixtral_8x7B.log \
--train_iters 50 -- train pretrain \
--config examples/megatron/configs/MI300X/mixtral_8x7B_v0.1-BF16-pretrain.yaml
.. container:: model-doc primus_pyt_megatron_lm_train_mixtral-8x22b-proxy .. container:: model-doc primus_pyt_megatron_lm_train_mixtral-8x22b-proxy
Once setup is complete, run the appropriate training command. Once setup is complete, run the appropriate training command.
The following run commands are tailored to Mixtral 8x22B. The following run commands are tailored to Mixtral 8x22B.
See :ref:`amd-primus-megatron-lm-model-support-v25.11` to switch to another available model. See :ref:`amd-primus-megatron-lm-model-support-v26.01` to switch to another available model.
To run training on a single node for Mixtral 8x22B BF16 (MoE with expert parallel) 4-layer proxy, To run training on a single node for Mixtral 8x22B BF16 (MoE with expert parallel) 4-layer proxy,
use the following command: use the following command:
@@ -609,8 +646,10 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
.. code-block:: shell .. code-block:: shell
EXP=examples/megatron/configs/MI355X/mixtral_8x22B_v0.1-BF16-pretrain.yaml \ bash runner/primus-cli direct \
bash examples/run_pretrain.sh --log_file /tmp/primus_mixtral_8x22B_proxy.log \
-- train pretrain \
--config examples/megatron/configs/MI355X/mixtral_8x22B_v0.1-BF16-pretrain.yaml
.. tab-item:: MI300X .. tab-item:: MI300X
:sync: MI325X and MI300X :sync: MI325X and MI300X
@@ -622,19 +661,21 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
export PRIMUS_TURBO_ATTN_V3_ATOMIC_FP32=1 export PRIMUS_TURBO_ATTN_V3_ATOMIC_FP32=1
export NVTE_CK_IS_V3_ATOMIC_FP32=1 export NVTE_CK_IS_V3_ATOMIC_FP32=1
EXP=examples/megatron/configs/MI300X/mixtral_8x22B_v0.1-BF16-pretrain.yaml \ bash runner/primus-cli direct \
bash examples/run_pretrain.sh \ --log_file /tmp/primus_mixtral_8x22B_proxy.log \
--train_iters 50 \ -- train pretrain \
--num_layers 4 \ --config examples/megatron/configs/MI300X/mixtral_8x22B_v0.1-BF16-pretrain.yaml \
--pipeline_model_parallel_size 1 \ --num_layers 4 \
--micro_batch_size 1 \ --pipeline_model_parallel_size 1 \
--global_batch_size 16 --micro_batch_size 1 \
--global_batch_size 16 \
--train_iters 50
.. container:: model-doc primus_pyt_megatron_lm_train_qwen2.5-7b .. container:: model-doc primus_pyt_megatron_lm_train_qwen2.5-7b
Once setup is complete, run the appropriate training command. Once setup is complete, run the appropriate training command.
The following run commands are tailored to Qwen 2.5 7B. The following run commands are tailored to Qwen 2.5 7B.
See :ref:`amd-primus-megatron-lm-model-support-v25.11` to switch to another available model. See :ref:`amd-primus-megatron-lm-model-support-v26.01` to switch to another available model.
To run training on a single node for Qwen 2.5 7B BF16, use the following To run training on a single node for Qwen 2.5 7B BF16, use the following
command: command:
@@ -646,8 +687,10 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
.. code-block:: shell .. code-block:: shell
EXP=examples/megatron/configs/MI355X/qwen2.5_7B-BF16-pretrain.yaml \ bash runner/primus-cli direct \
bash examples/run_pretrain.sh --log_file /tmp/primus_qwen2.5_7B.log \
-- train pretrain \
--config examples/megatron/configs/MI355X/qwen2.5_7B-BF16-pretrain.yaml
.. tab-item:: MI300X .. tab-item:: MI300X
:sync: MI325X and MI300X :sync: MI325X and MI300X
@@ -659,8 +702,10 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
export PRIMUS_TURBO_ATTN_V3_ATOMIC_FP32=1 export PRIMUS_TURBO_ATTN_V3_ATOMIC_FP32=1
export NVTE_CK_IS_V3_ATOMIC_FP32=1 export NVTE_CK_IS_V3_ATOMIC_FP32=1
EXP=examples/megatron/configs/MI300X/qwen2.5_7B-BF16-pretrain.yaml \ bash runner/primus-cli direct \
bash examples/run_pretrain.sh --log_file /tmp/primus_qwen2.5_7B.log \
-- train pretrain \
--config examples/megatron/configs/MI300X/qwen2.5_7B-BF16-pretrain.yaml
For FP8, use the following command. For FP8, use the following command.
@@ -671,8 +716,10 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
.. code-block:: shell .. code-block:: shell
EXP=examples/megatron/configs/MI355X/qwen2.5_7B-FP8-pretrain.yaml \ bash runner/primus-cli direct \
bash examples/run_pretrain.sh --log_file /tmp/primus_qwen2.5_7B_fp8.log \
-- train pretrain \
--config examples/megatron/configs/MI355X/qwen2.5_7B-FP8-pretrain.yaml
.. tab-item:: MI300X .. tab-item:: MI300X
:sync: MI325X and MI300X :sync: MI325X and MI300X
@@ -684,14 +731,16 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
export PRIMUS_TURBO_ATTN_V3_ATOMIC_FP32=1 export PRIMUS_TURBO_ATTN_V3_ATOMIC_FP32=1
export NVTE_CK_IS_V3_ATOMIC_FP32=1 export NVTE_CK_IS_V3_ATOMIC_FP32=1
EXP=examples/megatron/configs/MI300X/qwen2.5_7B-FP8-pretrain.yaml \ bash runner/primus-cli direct \
bash examples/run_pretrain.sh --log_file /tmp/primus_qwen2.5_7B_fp8.log \
-- train pretrain \
--config examples/megatron/configs/MI300X/qwen2.5_7B-FP8-pretrain.yaml
.. container:: model-doc primus_pyt_megatron_lm_train_qwen2.5-72b .. container:: model-doc primus_pyt_megatron_lm_train_qwen2.5-72b
Once setup is complete, run the appropriate training command. Once setup is complete, run the appropriate training command.
The following run commands are tailored to Qwen 2.5 72B. The following run commands are tailored to Qwen 2.5 72B.
See :ref:`amd-primus-megatron-lm-model-support-v25.11` to switch to another available model. See :ref:`amd-primus-megatron-lm-model-support-v26.01` to switch to another available model.
To run the training on a single node for Qwen 2.5 72B BF16, use the following command. To run the training on a single node for Qwen 2.5 72B BF16, use the following command.
@@ -702,11 +751,10 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
.. code-block:: shell .. code-block:: shell
EXP=examples/megatron/configs/MI355X/qwen2.5_72B-pretrain.yaml \ bash runner/primus-cli direct \
bash examples/run_pretrain.sh \ --log_file /tmp/primus_qwen2.5_72B.log \
--train_iters 50 \ -- train pretrain \
--micro_batch_size 16 \ --config examples/megatron/configs/MI355X/qwen2.5_72B-BF16-pretrain.yaml
--global_batch_size 256
.. tab-item:: MI300X .. tab-item:: MI300X
:sync: MI325X and MI300X :sync: MI325X and MI300X
@@ -718,10 +766,12 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
export PRIMUS_TURBO_ATTN_V3_ATOMIC_FP32=1 export PRIMUS_TURBO_ATTN_V3_ATOMIC_FP32=1
export NVTE_CK_IS_V3_ATOMIC_FP32=1 export NVTE_CK_IS_V3_ATOMIC_FP32=1
EXP=examples/megatron/configs/MI300X/qwen2.5_72B-BF16-pretrain.yaml \ bash runner/primus-cli direct \
bash examples/run_pretrain.sh --log_file /tmp/primus_qwen2.5_72B.log \
-- train pretrain \
--config examples/megatron/configs/MI300X/qwen2.5_72B-BF16-pretrain.yaml
.. _amd-primus-megatron-multi-node-examples-v25.11: .. _amd-primus-megatron-multi-node-examples-v26.01:
Multi-node training examples Multi-node training examples
---------------------------- ----------------------------
@@ -730,7 +780,7 @@ Refer to :doc:`/how-to/rocm-for-ai/system-setup/multi-node-setup` to configure y
training. training.
To run training on multiple nodes, you can use the To run training on multiple nodes, you can use the
`run_slurm_pretrain.sh <https://github.com/AMD-AGI/Primus/blob/main/examples/run_slurm_pretrain.sh>`__ `run_slurm_pretrain.sh <https://github.com/AMD-AGI/Primus/blob/9c529cd4a934a68a880ede036c3e97b792e38167/examples/run_slurm_pretrain.sh>`__
to launch the multi-node workload. Use the following steps to setup your environment: to launch the multi-node workload. Use the following steps to setup your environment:
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/primus-megatron-benchmark-models.yaml .. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/primus-megatron-benchmark-models.yaml
@@ -763,13 +813,13 @@ to launch the multi-node workload. Use the following steps to setup your environ
* If ``NCCL_IB_HCA`` and ``NCCL_SOCKET_IFNAME`` are not set, Primus will try to auto-detect. However, since NICs can vary accross different cluster, it is encouraged to explicitly export your NCCL parameters for the cluster. * If ``NCCL_IB_HCA`` and ``NCCL_SOCKET_IFNAME`` are not set, Primus will try to auto-detect. However, since NICs can vary accross different cluster, it is encouraged to explicitly export your NCCL parameters for the cluster.
* To find your network interface, you can use ``ip a``. * To find your network interface, you can use ``ip a``.
* To find RDMA interfaces, you can use ``ibv_devices`` to get the list of all the RDMA/IB devices. * To find RDMA interfaces, you can use ``ibv_devices`` to get the list of all the RDMA/IB devices.
* Remember to set ``DOCKER_IMAGE`` and ``HF_TOKEN`` (see :ref:`amd-primus-megatron-lm-tokenizer-v25.11`) as appropriate. * Remember to set ``DOCKER_IMAGE`` and ``HF_TOKEN`` (see :ref:`amd-primus-megatron-lm-tokenizer-v26.01`) as appropriate.
.. container:: model-doc primus_pyt_megatron_lm_train_llama-3.1-8b .. container:: model-doc primus_pyt_megatron_lm_train_llama-3.1-8b
Once setup is complete, run the appropriate training command. Once setup is complete, run the appropriate training command.
The following run commands are tailored to Llama 3.1 8B. The following run commands are tailored to Llama 3.1 8B.
See :ref:`amd-primus-megatron-lm-model-support-v25.11` to switch to another available model. See :ref:`amd-primus-megatron-lm-model-support-v26.01` to switch to another available model.
To train Llama 3.1 8B FP8 on 8 nodes, run: To train Llama 3.1 8B FP8 on 8 nodes, run:
@@ -786,7 +836,7 @@ to launch the multi-node workload. Use the following steps to setup your environ
Once setup is complete, run the appropriate training command. Once setup is complete, run the appropriate training command.
The following run commands are tailored to Llama 2 7B. The following run commands are tailored to Llama 2 7B.
See :ref:`amd-primus-megatron-lm-model-support-v25.11` to switch to another available model. See :ref:`amd-primus-megatron-lm-model-support-v26.01` to switch to another available model.
To train Llama 2 7B FP8 on 8 nodes, run: To train Llama 2 7B FP8 on 8 nodes, run:
@@ -803,7 +853,7 @@ to launch the multi-node workload. Use the following steps to setup your environ
Once setup is complete, run the appropriate training command. Once setup is complete, run the appropriate training command.
The following run commands are tailored to Llama 3.1 70B. The following run commands are tailored to Llama 3.1 70B.
See :ref:`amd-primus-megatron-lm-model-support-v25.11` to switch to another available model. See :ref:`amd-primus-megatron-lm-model-support-v26.01` to switch to another available model.
To train Llama 3.1 70B FP8 on 8 nodes, run: To train Llama 3.1 70B FP8 on 8 nodes, run:
@@ -833,7 +883,7 @@ to launch the multi-node workload. Use the following steps to setup your environ
Once setup is complete, run the appropriate training command. Once setup is complete, run the appropriate training command.
The following run commands are tailored to Llama 2 70B. The following run commands are tailored to Llama 2 70B.
See :ref:`amd-primus-megatron-lm-model-support-v25.11` to switch to another available model. See :ref:`amd-primus-megatron-lm-model-support-v26.01` to switch to another available model.
To train Llama 2 70B FP8 on 8 nodes, run: To train Llama 2 70B FP8 on 8 nodes, run:
@@ -863,7 +913,7 @@ to launch the multi-node workload. Use the following steps to setup your environ
Once setup is complete, run the appropriate training command. Once setup is complete, run the appropriate training command.
The following run commands are tailored to Llama 3.3 70B. The following run commands are tailored to Llama 3.3 70B.
See :ref:`amd-primus-megatron-lm-model-support-v25.11` to switch to another available model. See :ref:`amd-primus-megatron-lm-model-support-v26.01` to switch to another available model.
To train Llama 3.3 70B FP8 on 8 nodes, run: To train Llama 3.3 70B FP8 on 8 nodes, run:
@@ -893,7 +943,7 @@ to launch the multi-node workload. Use the following steps to setup your environ
Once setup is complete, run the appropriate training command. Once setup is complete, run the appropriate training command.
The following run commands are tailored to Llama 2 70B. The following run commands are tailored to Llama 2 70B.
See :ref:`amd-primus-megatron-lm-model-support-v25.11` to switch to another available model. See :ref:`amd-primus-megatron-lm-model-support-v26.01` to switch to another available model.
To train Mixtral 8x7B BF16 on 8 nodes, run: To train Mixtral 8x7B BF16 on 8 nodes, run:
@@ -911,7 +961,7 @@ to launch the multi-node workload. Use the following steps to setup your environ
Once setup is complete, run the appropriate training command. Once setup is complete, run the appropriate training command.
The following run commands are tailored to Llama 2 70B. The following run commands are tailored to Llama 2 70B.
See :ref:`amd-primus-megatron-lm-model-support-v25.11` to switch to another available model. See :ref:`amd-primus-megatron-lm-model-support-v26.01` to switch to another available model.
To train Qwen2.5 72B FP8 on 8 nodes, run: To train Qwen2.5 72B FP8 on 8 nodes, run:
@@ -926,7 +976,7 @@ to launch the multi-node workload. Use the following steps to setup your environ
--global_batch_size 512 \ --global_batch_size 512 \
--recompute_num_layers 80 \ --recompute_num_layers 80 \
.. _amd-primus-megatron-lm-benchmark-test-vars-v25.11: .. _amd-primus-megatron-lm-benchmark-test-vars-v26.01:
Key options Key options
----------- -----------

View File

@@ -45,7 +45,7 @@ with Primus Turbo optimizations.
- {{ component_version }} - {{ component_version }}
{% endfor %} {% endfor %}
.. _amd-primus-pytorch-model-support-v25.11: .. _amd-primus-pytorch-model-support-v26.01:
Supported models Supported models
================ ================
@@ -91,7 +91,7 @@ vary by model -- select one to get started.
For additional workloads, including Llama 3.3, Llama 3.2, Llama 2, GPT OSS, Qwen, and Flux models, For additional workloads, including Llama 3.3, Llama 3.2, Llama 2, GPT OSS, Qwen, and Flux models,
see the documentation :doc:`pytorch-training` (without Primus) see the documentation :doc:`pytorch-training` (without Primus)
.. _amd-primus-pytorch-performance-measurements-v25.11: .. _amd-primus-pytorch-performance-measurements-v26.01:
System validation System validation
================= =================
@@ -146,7 +146,7 @@ tweak some configurations (such as batch sizes).
.. container:: model-doc {{ model.mad_tag }} .. container:: model-doc {{ model.mad_tag }}
The following run command is tailored to {{ model.model }}. The following run command is tailored to {{ model.model }}.
See :ref:`amd-primus-pytorch-model-support-v25.11` to switch to another available model. See :ref:`amd-primus-pytorch-model-support-v26.01` to switch to another available model.
1. Clone the ROCm Model Automation and Dashboarding (`<https://github.com/ROCm/MAD>`__) repository to a local 1. Clone the ROCm Model Automation and Dashboarding (`<https://github.com/ROCm/MAD>`__) repository to a local
directory and install the required packages on the host machine. directory and install the required packages on the host machine.
@@ -184,7 +184,7 @@ tweak some configurations (such as batch sizes).
.. container:: model-doc {{ model.mad_tag }} .. container:: model-doc {{ model.mad_tag }}
The following run commands are tailored to {{ model.model }}. The following run commands are tailored to {{ model.model }}.
See :ref:`amd-primus-pytorch-model-support-v25.11` to switch to another available model. See :ref:`amd-primus-pytorch-model-support-v26.01` to switch to another available model.
.. rubric:: Download the Docker image and required packages .. rubric:: Download the Docker image and required packages
@@ -220,8 +220,8 @@ tweak some configurations (such as batch sizes).
docker start training_env docker start training_env
docker exec -it training_env bash docker exec -it training_env bash
The Docker container hosts verified commit ``c4c083de`` of the `Primus The Docker container hosts verified commit ``9c529cd4`` of the `Primus
<https://github.com/AMD-AGI/Primus/tree/c4c083de64ba3e8f19ccc9629411267108931f9e/>`__ repository. <https://github.com/AMD-AGI/Primus/tree/9c529cd4a934a68a880ede036c3e97b792e38167/>`__ repository.
.. rubric:: Prepare training datasets and dependencies .. rubric:: Prepare training datasets and dependencies
@@ -257,24 +257,31 @@ tweak some configurations (such as batch sizes).
.. code-block:: shell .. code-block:: shell
EXP=examples/torchtitan/configs/MI355X/llama3.1_8B-BF16-pretrain.yaml \ bash runner/primus-cli direct \
bash examples/run_pretrain.sh --log_file /tmp/primus_llama3.1_8B.log \
-- train pretrain \
--config examples/torchtitan/configs/MI355X/llama3.1_8B-BF16-pretrain.yaml
.. tab-item:: MI325X .. tab-item:: MI325X
:sync: MI325X :sync: MI325X
.. code-block:: shell .. code-block:: shell
EXP=examples/torchtitan/configs/MI300X/llama3.1_8B-BF16-pretrain.yaml \ bash runner/primus-cli direct \
bash examples/run_pretrain.sh --training.local_batch_size 6 --log_file /tmp/primus_llama3.1_8B.log \
-- train pretrain \
--config examples/torchtitan/configs/MI300X/llama3.1_8B-BF16-pretrain.yaml \
--training.local_batch_size 6
.. tab-item:: MI300X .. tab-item:: MI300X
:sync: MI300X :sync: MI300X
.. code-block:: shell .. code-block:: shell
EXP=examples/torchtitan/configs/MI300X/llama3.1_8B-BF16-pretrain.yaml \ bash runner/primus-cli direct \
bash examples/run_pretrain.sh --log_file /tmp/primus_llama3.1_8B.log \
-- train pretrain \
--config examples/torchtitan/configs/MI300X/llama3.1_8B-BF16-pretrain.yaml
To train Llama 3.1 8B with FP8 precision, use the following command. To train Llama 3.1 8B with FP8 precision, use the following command.
@@ -285,24 +292,31 @@ tweak some configurations (such as batch sizes).
.. code-block:: shell .. code-block:: shell
EXP=examples/torchtitan/configs/MI355X/llama3.1_8B-FP8-pretrain.yaml \ bash runner/primus-cli direct \
bash examples/run_pretrain.sh --log_file /tmp/primus_llama3.1_8B_fp8.log \
-- train pretrain \
--config examples/torchtitan/configs/MI355X/llama3.1_8B-FP8-pretrain.yaml
.. tab-item:: MI325X .. tab-item:: MI325X
:sync: MI325X :sync: MI325X
.. code-block:: shell .. code-block:: shell
EXP=examples/torchtitan/configs/MI300X/llama3.1_8B-FP8-pretrain.yaml \ bash runner/primus-cli direct \
bash examples/run_pretrain.sh --training.local_batch_size 7 --log_file /tmp/primus_llama3.1_8B_fp8.log \
-- train pretrain \
--config examples/torchtitan/configs/MI300X/llama3.1_8B-FP8-pretrain.yaml \
--training.local_batch_size 7
.. tab-item:: MI300X .. tab-item:: MI300X
:sync: MI300X :sync: MI300X
.. code-block:: shell .. code-block:: shell
EXP=examples/torchtitan/configs/MI300X/llama3.1_8B-FP8-pretrain.yaml \ bash runner/primus-cli direct \
bash examples/run_pretrain.sh --log_file /tmp/primus_llama3.1_8B_fp8.log \
-- train pretrain \
--config examples/torchtitan/configs/MI300X/llama3.1_8B-FP8-pretrain.yaml
.. container:: model-doc primus_pyt_train_llama-3.1-70b .. container:: model-doc primus_pyt_train_llama-3.1-70b
@@ -315,24 +329,31 @@ tweak some configurations (such as batch sizes).
.. code-block:: shell .. code-block:: shell
EXP=examples/torchtitan/configs/MI355X/llama3.1_70B-BF16-pretrain.yaml \ bash runner/primus-cli direct \
bash examples/run_pretrain.sh --log_file /tmp/primus_llama3.1_70B.log \
-- train pretrain \
--config examples/torchtitan/configs/MI355X/llama3.1_70B-BF16-pretrain.yaml
.. tab-item:: MI325X .. tab-item:: MI325X
:sync: MI325X :sync: MI325X
.. code-block:: shell .. code-block:: shell
EXP=examples/torchtitan/configs/MI300X/llama3.1_70B-BF16-pretrain.yaml \ bash runner/primus-cli direct \
bash examples/run_pretrain.sh --training.local_batch_size 6 --log_file /tmp/primus_llama3.1_70B.log \
-- train pretrain \
--config examples/torchtitan/configs/MI300X/llama3.1_70B-BF16-pretrain.yaml \
--training.local_batch_size 6
.. tab-item:: MI300X .. tab-item:: MI300X
:sync: MI300X :sync: MI300X
.. code-block:: shell .. code-block:: shell
EXP=examples/torchtitan/configs/MI300X/llama3.1_70B-BF16-pretrain.yaml \ bash runner/primus-cli direct \
bash examples/run_pretrain.sh --log_file /tmp/primus_llama3.1_70B.log \
-- train pretrain \
--config examples/torchtitan/configs/MI300X/llama3.1_70B-BF16-pretrain.yaml
To train Llama 3.1 70B with FP8 precision, use the following command. To train Llama 3.1 70B with FP8 precision, use the following command.
@@ -343,24 +364,31 @@ tweak some configurations (such as batch sizes).
.. code-block:: shell .. code-block:: shell
EXP=examples/torchtitan/configs/MI355X/llama3.1_70B-FP8-pretrain.yaml \ bash runner/primus-cli direct \
bash examples/run_pretrain.sh --log_file /tmp/primus_llama3.1_70B_fp8.log \
-- train pretrain \
--config examples/torchtitan/configs/MI355X/llama3.1_70B-FP8-pretrain.yaml
.. tab-item:: MI325X .. tab-item:: MI325X
:sync: MI325X :sync: MI325X
.. code-block:: shell .. code-block:: shell
EXP=examples/torchtitan/configs/MI300X/llama3.1_70B-FP8-pretrain.yaml \ bash runner/primus-cli direct \
bash examples/run_pretrain.sh --training.local_batch_size 5 --log_file /tmp/primus_llama3.1_70B_fp8.log \
-- train pretrain \
--config examples/torchtitan/configs/MI300X/llama3.1_70B-FP8-pretrain.yaml \
--training.local_batch_size 5
.. tab-item:: MI300X .. tab-item:: MI300X
:sync: MI300X :sync: MI300X
.. code-block:: shell .. code-block:: shell
EXP=examples/torchtitan/configs/MI300X/llama3.1_70B-FP8-pretrain.yaml \ bash runner/primus-cli direct \
bash examples/run_pretrain.sh --log_file /tmp/primus_llama3.1_70B_fp8.log \
-- train pretrain \
--config examples/torchtitan/configs/MI300X/llama3.1_70B-FP8-pretrain.yaml
.. container:: model-doc primus_pyt_train_deepseek-v3-16b .. container:: model-doc primus_pyt_train_deepseek-v3-16b
@@ -373,24 +401,31 @@ tweak some configurations (such as batch sizes).
.. code-block:: shell .. code-block:: shell
EXP=examples/torchtitan/configs/MI355X/deepseek_v3_16b-pretrain.yaml \ bash runner/primus-cli direct \
bash examples/run_pretrain.sh --log_file /tmp/primus_deepseek_v3_16b.log \
-- train pretrain \
--config examples/torchtitan/configs/MI355X/deepseek_v3_16b-pretrain.yaml
.. tab-item:: MI325X .. tab-item:: MI325X
:sync: MI325X :sync: MI325X
.. code-block:: shell .. code-block:: shell
EXP=examples/torchtitan/configs/MI300X/deepseek_v3_16b-pretrain.yaml \ bash runner/primus-cli direct \
bash examples/run_pretrain.sh --training.local_batch_size 10 --log_file /tmp/primus_deepseek_v3_16b.log \
-- train pretrain \
--config examples/torchtitan/configs/MI300X/deepseek_v3_16b-pretrain.yaml \
--training.local_batch_size 10
.. tab-item:: MI300X .. tab-item:: MI300X
:sync: MI300X :sync: MI300X
.. code-block:: shell .. code-block:: shell
EXP=examples/torchtitan/configs/MI300X/deepseek_v3_16b-pretrain.yaml \ bash runner/primus-cli direct \
bash examples/run_pretrain.sh --log_file /tmp/primus_deepseek_v3_16b.log \
-- train pretrain \
--config examples/torchtitan/configs/MI300X/deepseek_v3_16b-pretrain.yaml
{% endfor %} {% endfor %}
{% endfor %} {% endfor %}

View File

@@ -43,7 +43,7 @@ training workloads:
- {{ component_version }} - {{ component_version }}
{% endfor %} {% endfor %}
.. _amd-pytorch-training-model-support-v25.11: .. _amd-pytorch-training-model-support-v26.01:
Supported models Supported models
================ ================
@@ -85,7 +85,7 @@ one to get started.
</div> </div>
</div> </div>
.. _amd-pytorch-training-supported-training-modes-v25.11: .. _amd-pytorch-training-supported-training-modes-v26.01:
The following table lists supported training modes per model. The following table lists supported training modes per model.
@@ -120,7 +120,7 @@ The following table lists supported training modes per model.
unlisted fine-tuning methods by using an existing file in the unlisted fine-tuning methods by using an existing file in the
``/workspace/torchtune/recipes/configs`` directory as a template. ``/workspace/torchtune/recipes/configs`` directory as a template.
.. _amd-pytorch-training-performance-measurements-v25.11: .. _amd-pytorch-training-performance-measurements-v26.01:
Performance measurements Performance measurements
======================== ========================
@@ -176,7 +176,7 @@ Run training
.. container:: model-doc {{ model.mad_tag }} .. container:: model-doc {{ model.mad_tag }}
The following run command is tailored to {{ model.model }}. The following run command is tailored to {{ model.model }}.
See :ref:`amd-pytorch-training-model-support-v25.11` to switch to another available model. See :ref:`amd-pytorch-training-model-support-v26.01` to switch to another available model.
1. Clone the ROCm Model Automation and Dashboarding (`<https://github.com/ROCm/MAD>`__) repository to a local 1. Clone the ROCm Model Automation and Dashboarding (`<https://github.com/ROCm/MAD>`__) repository to a local
directory and install the required packages on the host machine. directory and install the required packages on the host machine.
@@ -214,7 +214,7 @@ Run training
.. container:: model-doc {{ model.mad_tag }} .. container:: model-doc {{ model.mad_tag }}
The following commands are tailored to {{ model.model }}. The following commands are tailored to {{ model.model }}.
See :ref:`amd-pytorch-training-model-support-v25.11` to switch to another available model. See :ref:`amd-pytorch-training-model-support-v26.01` to switch to another available model.
{% endfor %} {% endfor %}
{% endfor %} {% endfor %}
@@ -409,6 +409,10 @@ Run training
{% if model.mad_tag == "pyt_train_dlrm" %} {% if model.mad_tag == "pyt_train_dlrm" %}
.. note::
DLRM is supported on MI300X, MI325X, MI350X, and MI355X GPUs.
1. Go to the DLRM directory. 1. Go to the DLRM directory.
.. code-block:: shell .. code-block:: shell
@@ -532,7 +536,7 @@ Run training
To start the fine-tuning benchmark, use the following command with the To start the fine-tuning benchmark, use the following command with the
appropriate options. See the following list of options and their descriptions. appropriate options. See the following list of options and their descriptions.
See :ref:`supported training modes <amd-pytorch-training-supported-training-modes-v25.11>`. See :ref:`supported training modes <amd-pytorch-training-supported-training-modes-v26.01>`.
.. code-block:: shell .. code-block:: shell
@@ -597,7 +601,7 @@ Run training
For examples of benchmarking commands, see `<https://github.com/ROCm/MAD/tree/develop/benchmark/pytorch_train#benchmarking-examples>`__. For examples of benchmarking commands, see `<https://github.com/ROCm/MAD/tree/develop/benchmark/pytorch_train#benchmarking-examples>`__.
.. _amd-pytorch-training-multinode-examples-v25.11: .. _amd-pytorch-training-multinode-examples-v26.01:
Multi-node training Multi-node training
------------------- -------------------

View File

@@ -119,6 +119,10 @@ subtrees:
title: PyTorch inference performance testing title: PyTorch inference performance testing
- file: how-to/rocm-for-ai/inference/benchmark-docker/sglang.rst - file: how-to/rocm-for-ai/inference/benchmark-docker/sglang.rst
title: SGLang inference performance testing title: SGLang inference performance testing
- file: how-to/rocm-for-ai/inference/benchmark-docker/vllm-mori-distributed.md
title: vLLM distributed inference with MoRI
- file: how-to/rocm-for-ai/inference/benchmark-docker/sglang-mori-distributed.md
title: SGLang distributed inference with MoRI
- file: how-to/rocm-for-ai/inference/benchmark-docker/sglang-distributed.rst - file: how-to/rocm-for-ai/inference/benchmark-docker/sglang-distributed.rst
title: SGLang distributed inference with Mooncake title: SGLang distributed inference with Mooncake
- file: how-to/rocm-for-ai/inference/xdit-diffusion-inference.rst - file: how-to/rocm-for-ai/inference/xdit-diffusion-inference.rst