mirror of
https://github.com/ROCm/ROCm.git
synced 2026-02-01 01:45:18 -05:00
Compare commits
17 Commits
hipdnn
...
docs/7.2.0
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
811188dc13 | ||
|
|
ec36bc9971 | ||
|
|
cd208e7d74 | ||
|
|
af8ea73581 | ||
|
|
f1c86d7d29 | ||
|
|
370816001e | ||
|
|
d5994da509 | ||
|
|
c02f86c0e7 | ||
|
|
d3523c24d3 | ||
|
|
1980239b81 | ||
|
|
c75fd6f532 | ||
|
|
72cb598190 | ||
|
|
9b55b77aaa | ||
|
|
8267303e1d | ||
|
|
86d2c4e891 | ||
|
|
2977e35330 | ||
|
|
e95955f572 |
@@ -39,6 +39,7 @@ autograd
|
|||||||
Backported
|
Backported
|
||||||
BARs
|
BARs
|
||||||
BatchNorm
|
BatchNorm
|
||||||
|
BKC
|
||||||
BLAS
|
BLAS
|
||||||
BMC
|
BMC
|
||||||
BabelStream
|
BabelStream
|
||||||
@@ -53,6 +54,7 @@ CDNA
|
|||||||
CGUI
|
CGUI
|
||||||
CHTML
|
CHTML
|
||||||
CIFAR
|
CIFAR
|
||||||
|
CNP
|
||||||
CLI
|
CLI
|
||||||
CLion
|
CLion
|
||||||
CMake
|
CMake
|
||||||
@@ -96,6 +98,7 @@ Dashboarding
|
|||||||
Dataloading
|
Dataloading
|
||||||
dataflows
|
dataflows
|
||||||
DBRX
|
DBRX
|
||||||
|
DCQCN
|
||||||
DDR
|
DDR
|
||||||
DF
|
DF
|
||||||
DGEMM
|
DGEMM
|
||||||
@@ -110,8 +113,10 @@ DMA
|
|||||||
DOMContentLoaded
|
DOMContentLoaded
|
||||||
DNN
|
DNN
|
||||||
DNNL
|
DNNL
|
||||||
|
DOCA
|
||||||
DPM
|
DPM
|
||||||
DRI
|
DRI
|
||||||
|
DSCP
|
||||||
DW
|
DW
|
||||||
DWORD
|
DWORD
|
||||||
Dask
|
Dask
|
||||||
@@ -127,7 +132,9 @@ Deprecations
|
|||||||
DevCap
|
DevCap
|
||||||
DirectX
|
DirectX
|
||||||
Disaggregated
|
Disaggregated
|
||||||
|
disagg
|
||||||
disaggregated
|
disaggregated
|
||||||
|
disaggregation
|
||||||
Dockerfile
|
Dockerfile
|
||||||
Dockerized
|
Dockerized
|
||||||
Doxygen
|
Doxygen
|
||||||
@@ -179,6 +186,8 @@ GFLOPS
|
|||||||
GFortran
|
GFortran
|
||||||
GFXIP
|
GFXIP
|
||||||
GGUF
|
GGUF
|
||||||
|
GID
|
||||||
|
Gbps
|
||||||
Gemma
|
Gemma
|
||||||
GiB
|
GiB
|
||||||
GIM
|
GIM
|
||||||
@@ -248,6 +257,7 @@ IOP
|
|||||||
IOPS
|
IOPS
|
||||||
IOPM
|
IOPM
|
||||||
IOV
|
IOV
|
||||||
|
IPs
|
||||||
IRQ
|
IRQ
|
||||||
ISA
|
ISA
|
||||||
ISV
|
ISV
|
||||||
@@ -312,6 +322,7 @@ MNIST
|
|||||||
MPI
|
MPI
|
||||||
MPT
|
MPT
|
||||||
MSVC
|
MSVC
|
||||||
|
MTU
|
||||||
mul
|
mul
|
||||||
MVAPICH
|
MVAPICH
|
||||||
MVFFR
|
MVFFR
|
||||||
@@ -334,6 +345,7 @@ MLA
|
|||||||
MosaicML
|
MosaicML
|
||||||
MoEs
|
MoEs
|
||||||
Mooncake
|
Mooncake
|
||||||
|
MoRI
|
||||||
Mpops
|
Mpops
|
||||||
Multicore
|
Multicore
|
||||||
multihost
|
multihost
|
||||||
@@ -403,16 +415,21 @@ PEQT
|
|||||||
PIL
|
PIL
|
||||||
PILImage
|
PILImage
|
||||||
PJRT
|
PJRT
|
||||||
|
PLDM
|
||||||
POR
|
POR
|
||||||
PRNG
|
PRNG
|
||||||
PRs
|
PRs
|
||||||
|
PSID
|
||||||
|
PTPC
|
||||||
PaLM
|
PaLM
|
||||||
Pageable
|
Pageable
|
||||||
PeerDirect
|
PeerDirect
|
||||||
|
Pensando
|
||||||
PerfDb
|
PerfDb
|
||||||
Perfetto
|
Perfetto
|
||||||
PipelineParallel
|
PipelineParallel
|
||||||
PnP
|
PnP
|
||||||
|
Pollara
|
||||||
PowerEdge
|
PowerEdge
|
||||||
PowerShell
|
PowerShell
|
||||||
Pretrained
|
Pretrained
|
||||||
@@ -424,6 +441,7 @@ Pytest
|
|||||||
PyTorch
|
PyTorch
|
||||||
QPS
|
QPS
|
||||||
Qcycles
|
Qcycles
|
||||||
|
QoS
|
||||||
Qwen
|
Qwen
|
||||||
RAII
|
RAII
|
||||||
RAS
|
RAS
|
||||||
@@ -457,6 +475,7 @@ RPP
|
|||||||
RST
|
RST
|
||||||
RW
|
RW
|
||||||
Radeon
|
Radeon
|
||||||
|
Redfish
|
||||||
RelWithDebInfo
|
RelWithDebInfo
|
||||||
Req
|
Req
|
||||||
Rickle
|
Rickle
|
||||||
@@ -724,6 +743,7 @@ enqueue
|
|||||||
env
|
env
|
||||||
epilog
|
epilog
|
||||||
etcetera
|
etcetera
|
||||||
|
eth
|
||||||
ethernet
|
ethernet
|
||||||
exascale
|
exascale
|
||||||
executables
|
executables
|
||||||
@@ -819,6 +839,7 @@ llvm
|
|||||||
lm
|
lm
|
||||||
localscratch
|
localscratch
|
||||||
logits
|
logits
|
||||||
|
loopback
|
||||||
lossy
|
lossy
|
||||||
macOS
|
macOS
|
||||||
matchers
|
matchers
|
||||||
@@ -844,6 +865,7 @@ nanoGPT
|
|||||||
NCS
|
NCS
|
||||||
NOP
|
NOP
|
||||||
NVLink
|
NVLink
|
||||||
|
netplan
|
||||||
num
|
num
|
||||||
numref
|
numref
|
||||||
ocl
|
ocl
|
||||||
@@ -911,6 +933,7 @@ rc
|
|||||||
rccl
|
rccl
|
||||||
rdc
|
rdc
|
||||||
rdma
|
rdma
|
||||||
|
reachability
|
||||||
reStructuredText
|
reStructuredText
|
||||||
redirections
|
redirections
|
||||||
refactorization
|
refactorization
|
||||||
@@ -980,6 +1003,7 @@ shader
|
|||||||
sharding
|
sharding
|
||||||
sigmoid
|
sigmoid
|
||||||
sles
|
sles
|
||||||
|
slurm
|
||||||
sm
|
sm
|
||||||
smi
|
smi
|
||||||
softmax
|
softmax
|
||||||
|
|||||||
@@ -199,7 +199,7 @@ for a complete overview of this release.
|
|||||||
* fftw_execute_dft_c2r
|
* fftw_execute_dft_c2r
|
||||||
* fftwf_execute_dft_c2r
|
* fftwf_execute_dft_c2r
|
||||||
|
|
||||||
### **HIPIFY** (22.2.0)
|
### **HIPIFY** (22.0.0)
|
||||||
|
|
||||||
#### Added
|
#### Added
|
||||||
|
|
||||||
@@ -279,6 +279,10 @@ for a complete overview of this release.
|
|||||||
|
|
||||||
* Updated clang/llvm to AMD clang version 22.0.0 (equivalent to LLVM 22.0.0 with additional out-of-tree patches).
|
* Updated clang/llvm to AMD clang version 22.0.0 (equivalent to LLVM 22.0.0 with additional out-of-tree patches).
|
||||||
|
|
||||||
|
#### Upcoming changes
|
||||||
|
|
||||||
|
* As of ROCm 7.2.0, the [HIPCC](https://rocm.docs.amd.com/projects/HIPCC/en/latest/index.html) compiler is deprecated. HIPCC now invokes [AMD Clang](https://rocm.docs.amd.com/projects/llvm-project/en/latest/index.html). It’s recommended that you now invoke AMD Clang directly rather than using HIPCC. There isn’t any expected impact on usability, functionality, or performance when invoking AMD Clang directly. In a future ROCm release, HIPCC will become a symbolic link to AMD Clang.
|
||||||
|
|
||||||
### **MIGraphX** (2.15.0)
|
### **MIGraphX** (2.15.0)
|
||||||
|
|
||||||
#### Added
|
#### Added
|
||||||
|
|||||||
25
RELEASE.md
25
RELEASE.md
@@ -139,7 +139,7 @@ GPU and baseboard firmware versioning might differ across GPU families.
|
|||||||
</td>
|
</td>
|
||||||
</tr>
|
</tr>
|
||||||
<tr>
|
<tr>
|
||||||
<td>MI300X</td>
|
<td>MI300X<a href="#footnote1"><sup>[2]</sup></a></td>
|
||||||
<td>01.25.03.12</td>
|
<td>01.25.03.12</td>
|
||||||
<td rowspan="6" style="vertical-align: middle;">
|
<td rowspan="6" style="vertical-align: middle;">
|
||||||
30.30.0<br>
|
30.30.0<br>
|
||||||
@@ -180,6 +180,7 @@ GPU and baseboard firmware versioning might differ across GPU families.
|
|||||||
</div>
|
</div>
|
||||||
|
|
||||||
<p id="footnote1">[1]: For AMD Instinct MI325X KVM SR-IOV users, don't use AMD GPU driver (amdgpu) 30.20.0.</p>
|
<p id="footnote1">[1]: For AMD Instinct MI325X KVM SR-IOV users, don't use AMD GPU driver (amdgpu) 30.20.0.</p>
|
||||||
|
<p id="footnote1">[2]: For AMD Instinct MI300X KVM SR-IOV with Multi-VF (8 VF) support requires a compatible firmware BKC bundle which will be released in coming months.</p>
|
||||||
|
|
||||||
#### Node power management for multi-GPU nodes added
|
#### Node power management for multi-GPU nodes added
|
||||||
|
|
||||||
@@ -245,7 +246,7 @@ New Stream Management API `hipStreamCopyAttributes` is implemented for CUDA Pari
|
|||||||
|
|
||||||
The rocSHMEM communications library has added the GDA (GPUDirect Async) intra-node and inter-node communication backend conduit. This new backend enables communication between GPUs within a node or between nodes through a RNIC (RDMA NIC) using device-initiated GPU kernels to communicate with other GPUs. The GPU directly interacts with the RNIC with no host (CPU) involvement in the critical path of communication.
|
The rocSHMEM communications library has added the GDA (GPUDirect Async) intra-node and inter-node communication backend conduit. This new backend enables communication between GPUs within a node or between nodes through a RNIC (RDMA NIC) using device-initiated GPU kernels to communicate with other GPUs. The GPU directly interacts with the RNIC with no host (CPU) involvement in the critical path of communication.
|
||||||
|
|
||||||
In addition to the already supported GDA NIC types, Mellanox CX-7 and Broadcom Thor2, ROCm 7.2.0 introduces support for AMD Pensando AI NIC installed with the corresponding driver and firmware versions that support GDA functionality. For more information, see [Installing rocSHMEM](https://rocm.docs.amd.com/projects/rocSHMEM/en/latest/install.html).
|
In addition to the already supported GDA NIC types, Mellanox CX-7 and Broadcom Thor2, ROCm 7.2.0 introduces support for AMD Pensando AI NIC installed with the corresponding driver and firmware versions that support GDA functionality. For more information, see [Installing rocSHMEM](https://rocm.docs.amd.com/projects/rocSHMEM/en/docs-7.2.0/install.html).
|
||||||
|
|
||||||
### Software-managed plan cache support for hipTensor
|
### Software-managed plan cache support for hipTensor
|
||||||
|
|
||||||
@@ -285,7 +286,7 @@ MIGraphX has the following enhancements:
|
|||||||
|
|
||||||
### AMDGPU wavefront size macro removal
|
### AMDGPU wavefront size macro removal
|
||||||
|
|
||||||
The `__AMDGCN_WAVEFRONT_SIZE` and `__AMDGCN_WAVEFRONT_SIZE__` macros, which provided a compile-time-constant wavefront size, are removed. Where required, the wavefront size should instead be queried using the warpSize variable in device code, or using `hipGetDeviceProperties` in host code. Neither of these will result in a compile-time constant. For more information, see [warpSize](https://rocm.docs.amd.com/projects/HIP/en/latest/how-to/hip_cpp_language_extensions.html#warpsize).
|
The `__AMDGCN_WAVEFRONT_SIZE` and `__AMDGCN_WAVEFRONT_SIZE__` macros, which provided a compile-time-constant wavefront size, are removed. Where required, the wavefront size should instead be queried using the warpSize variable in device code, or using `hipGetDeviceProperties` in host code. Neither of these will result in a compile-time constant. For more information, see [warpSize](https://rocm.docs.amd.com/projects/HIP/en/docs-7.2.0/how-to/hip_cpp_language_extensions.html#warpsize).
|
||||||
For cases where compile-time evaluation of the wavefront size cannot be avoided, uses of `__AMDGCN_WAVEFRONT_SIZE` or `__AMDGCN_WAVEFRONT_SIZE__` can be replaced with a user-defined macro or `constexpr` variable with the wavefront size(s) for the target hardware. For example:
|
For cases where compile-time evaluation of the wavefront size cannot be avoided, uses of `__AMDGCN_WAVEFRONT_SIZE` or `__AMDGCN_WAVEFRONT_SIZE__` can be replaced with a user-defined macro or `constexpr` variable with the wavefront size(s) for the target hardware. For example:
|
||||||
|
|
||||||
```
|
```
|
||||||
@@ -349,13 +350,13 @@ The ROCm Offline Installer Creator 7.2.0 includes the following features and imp
|
|||||||
* Fixes for Oracle Linux 10.0 ROCm and driver minimum mode installer creation.
|
* Fixes for Oracle Linux 10.0 ROCm and driver minimum mode installer creation.
|
||||||
* Added support for creating an offline installer for Oracle Linux 8, 9, and 10, where the kernel version of the target OS differs from the host OS creating the installer.
|
* Added support for creating an offline installer for Oracle Linux 8, 9, and 10, where the kernel version of the target OS differs from the host OS creating the installer.
|
||||||
|
|
||||||
See [ROCm Offline Installer Creator](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/rocm-offline-installer.html) for more information.
|
See [ROCm Offline Installer Creator](https://rocm.docs.amd.com/projects/install-on-linux/en/docs-7.2.0/install/rocm-offline-installer.html) for more information.
|
||||||
|
|
||||||
### ROCm Runfile Installer updates
|
### ROCm Runfile Installer updates
|
||||||
|
|
||||||
The ROCm Runfile Installer 7.2.0 includes fixes for rocm-examples test script build issues.
|
The ROCm Runfile Installer 7.2.0 includes fixes for rocm-examples test script build issues.
|
||||||
|
|
||||||
For more information, see [ROCm Runfile Installer](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/rocm-runfile-installer.html).
|
For more information, see [ROCm Runfile Installer](https://rocm.docs.amd.com/projects/install-on-linux/en/docs-7.2.0/install/rocm-runfile-installer.html).
|
||||||
|
|
||||||
### Expansion of the ROCm examples repository
|
### Expansion of the ROCm examples repository
|
||||||
|
|
||||||
@@ -375,7 +376,7 @@ Usage examples are now available for the [ROCgdb](https://github.com/ROCm/rocm-e
|
|||||||
|
|
||||||
ROCm documentation continues to be updated to provide clearer and more comprehensive guidance for a wider variety of user needs and use cases.
|
ROCm documentation continues to be updated to provide clearer and more comprehensive guidance for a wider variety of user needs and use cases.
|
||||||
|
|
||||||
* The newest resource for ROCm and HIP developers is the [AMD ROCm Programming Guide](https://rocm-handbook.amd.com/projects/amd-rocm-programming-guide/en/latest/). This guide introduces the core concepts, APIs, and best practices for programming with ROCm and the HIP programming language. It provides hands-on guidance for writing GPU kernels, managing memory, optimizing performance, and integrating HIP with the broader AMD ROCm ecosystem of tools and libraries. The [HIP documentation](https://rocm.docs.amd.com/projects/HIP/en/latest/index.html) set continues to provide detailed information, tutorials, and reference content.
|
* The newest resource for ROCm and HIP developers is the [AMD ROCm Programming Guide](https://rocm-handbook.amd.com/projects/amd-rocm-programming-guide/en/docs-7.2.0/). This guide introduces the core concepts, APIs, and best practices for programming with ROCm and the HIP programming language. It provides hands-on guidance for writing GPU kernels, managing memory, optimizing performance, and integrating HIP with the broader AMD ROCm ecosystem of tools and libraries. The [HIP documentation](https://rocm.docs.amd.com/projects/HIP/en/docs-7.2.0/index.html) set continues to provide detailed information, tutorials, and reference content.
|
||||||
|
|
||||||
* The HIP Programming Guide section includes a new topic titled [“Understanding GPU performance”](https://rocm.docs.amd.com/projects/HIP/en/docs-7.2.0/understand/performance_optimization.html). It explains the theoretical foundations of GPU performance on AMD hardware. Understanding these concepts helps you analyze performance characteristics, identify bottlenecks, and make informed optimization decisions. Two other topics in this guide have been enhanced: [Performance guidelines](https://rocm.docs.amd.com/projects/HIP/en/docs-7.2.0/how-to/performance_guidelines.html) and [Hardware implementation](https://rocm.docs.amd.com/projects/HIP/en/docs-7.2.0/understand/hardware_implementation.html).
|
* The HIP Programming Guide section includes a new topic titled [“Understanding GPU performance”](https://rocm.docs.amd.com/projects/HIP/en/docs-7.2.0/understand/performance_optimization.html). It explains the theoretical foundations of GPU performance on AMD hardware. Understanding these concepts helps you analyze performance characteristics, identify bottlenecks, and make informed optimization decisions. Two other topics in this guide have been enhanced: [Performance guidelines](https://rocm.docs.amd.com/projects/HIP/en/docs-7.2.0/how-to/performance_guidelines.html) and [Hardware implementation](https://rocm.docs.amd.com/projects/HIP/en/docs-7.2.0/understand/hardware_implementation.html).
|
||||||
|
|
||||||
@@ -913,7 +914,7 @@ For a historical overview of ROCm component updates, see the {doc}`ROCm consolid
|
|||||||
* fftw_execute_dft_c2r
|
* fftw_execute_dft_c2r
|
||||||
* fftwf_execute_dft_c2r
|
* fftwf_execute_dft_c2r
|
||||||
|
|
||||||
### **HIPIFY** (22.2.0)
|
### **HIPIFY** (22.0.0)
|
||||||
|
|
||||||
#### Added
|
#### Added
|
||||||
|
|
||||||
@@ -995,7 +996,7 @@ For a historical overview of ROCm component updates, see the {doc}`ROCm consolid
|
|||||||
|
|
||||||
#### Upcoming changes
|
#### Upcoming changes
|
||||||
|
|
||||||
* As of ROCm 7.2.0, the [HIPCC](https://rocm.docs.amd.com/projects/HIPCC/en/latest/index.html) compiler is deprecated. HIPCC now invokes [AMD Clang](https://rocm.docs.amd.com/projects/llvm-project/en/latest/index.html). It’s recommended that you now invoke AMD Clang directly rather than using HIPCC. There isn’t any expected impact on usability, functionality, or performance when invoking AMD Clang directly. In a future ROCm release, HIPCC will become a symbolic link to AMD Clang.
|
* As of ROCm 7.2.0, the [HIPCC](https://rocm.docs.amd.com/projects/HIPCC/en/docs-7.2.0/index.html) compiler is deprecated. HIPCC now invokes [AMD Clang](https://rocm.docs.amd.com/projects/llvm-project/en/docs-7.2.0/index.html). It’s recommended that you now invoke AMD Clang directly rather than using HIPCC. There isn’t any expected impact on usability, functionality, or performance when invoking AMD Clang directly. In a future ROCm release, HIPCC will become a symbolic link to AMD Clang.
|
||||||
|
|
||||||
### **MIGraphX** (2.15.0)
|
### **MIGraphX** (2.15.0)
|
||||||
|
|
||||||
@@ -1432,7 +1433,11 @@ python3 -m pip install --user .
|
|||||||
`sudo` might be required. Use flag `--break-system-packages` if `pip un/installation` fails.
|
`sudo` might be required. Use flag `--break-system-packages` if `pip un/installation` fails.
|
||||||
```
|
```
|
||||||
|
|
||||||
For detailed instructions, see [Install the Python library for multiple ROCm instances](https://rocm.docs.amd.com/projects/amdsmi/en/latest/install/install.html#install-the-python-library-for-multiple-rocm-instances). The issue will be fixed in a future ROCm release.
|
For detailed instructions, see [Install the Python library for multiple ROCm instances](https://rocm.docs.amd.com/projects/amdsmi/en/latest/install/install.html#install-the-python-library-for-multiple-rocm-instances). The issue will be fixed in a future ROCm release. See [GitHub issue #5875](https://github.com/ROCm/ROCm/issues/5875).
|
||||||
|
|
||||||
|
### Intermittent errors when running JAX workloads
|
||||||
|
|
||||||
|
You might experience intermittent errors or segmentation faults when running JAX workloads. The issue is currently under investigation and will be addressed in an upcoming ROCm release. See [GitHub issue #5878](https://github.com/ROCm/ROCm/issues/5878).
|
||||||
|
|
||||||
### hipBLASLt performance variation for a particular FP8 GEMM operation on AMD Instinct MI325X GPUs
|
### hipBLASLt performance variation for a particular FP8 GEMM operation on AMD Instinct MI325X GPUs
|
||||||
|
|
||||||
@@ -1477,7 +1482,7 @@ The following changes to the ROCm software stack are anticipated for future rele
|
|||||||
|
|
||||||
### ROCm Offline Installer Creator deprecation
|
### ROCm Offline Installer Creator deprecation
|
||||||
|
|
||||||
The ROCm Offline Installer Creator is deprecated with the ROCm 7.2.0 release. Equivalent installation capabilities are available through the ROCm Runfile Installer, a self-extracting installer that is not based on OS package managers. This installer will be removed in a future release.
|
The [ROCm Offline Installer Creator](https://rocm.docs.amd.com/projects/install-on-linux/en/docs-7.2.0/install/rocm-offline-installer.html) is deprecated with the ROCm 7.2.0 release and will be removed in a future release. Equivalent installation capabilities are available through the [ROCm Runfile Installer](https://rocm.docs.amd.com/projects/install-on-linux/en/docs-7.2.0/install/rocm-runfile-installer.html), a self-extracting installer that is not based on OS package managers.
|
||||||
|
|
||||||
### ROCm SMI deprecation
|
### ROCm SMI deprecation
|
||||||
|
|
||||||
|
|||||||
@@ -171,6 +171,7 @@ Operating systems, kernel and Glibc versions
|
|||||||
*********************************************
|
*********************************************
|
||||||
|
|
||||||
For detailed information on operating system supported on ROCm 7.2.0 and associated Kernel and Glibc version, see the latest :ref:`supported_distributions`. For version specific information, see `ROCm 7.1.1 <https://rocm.docs.amd.com/projects/install-on-linux/en/docs-7.1.1/reference/system-requirements.html#supported-operating-systems>`__, and `ROCm 6.4.0 <https://rocm.docs.amd.com/projects/install-on-linux/en/docs-6.4.0/reference/system-requirements.html#supported-operating-systems>`__.
|
For detailed information on operating system supported on ROCm 7.2.0 and associated Kernel and Glibc version, see the latest :ref:`supported_distributions`. For version specific information, see `ROCm 7.1.1 <https://rocm.docs.amd.com/projects/install-on-linux/en/docs-7.1.1/reference/system-requirements.html#supported-operating-systems>`__, and `ROCm 6.4.0 <https://rocm.docs.amd.com/projects/install-on-linux/en/docs-6.4.0/reference/system-requirements.html#supported-operating-systems>`__.
|
||||||
|
|
||||||
.. note::
|
.. note::
|
||||||
|
|
||||||
* See `Red Hat Enterprise Linux Release Dates <https://access.redhat.com/articles/3078>`_ to learn about the specific kernel versions supported on Red Hat Enterprise Linux (RHEL).
|
* See `Red Hat Enterprise Linux Release Dates <https://access.redhat.com/articles/3078>`_ to learn about the specific kernel versions supported on Red Hat Enterprise Linux (RHEL).
|
||||||
|
|||||||
10
docs/conf.py
10
docs/conf.py
@@ -138,12 +138,14 @@ article_pages = [
|
|||||||
{"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/megatron-lm-v25.8", "os": ["linux"]},
|
{"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/megatron-lm-v25.8", "os": ["linux"]},
|
||||||
{"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/megatron-lm-v25.9", "os": ["linux"]},
|
{"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/megatron-lm-v25.9", "os": ["linux"]},
|
||||||
{"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/megatron-lm-v25.10", "os": ["linux"]},
|
{"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/megatron-lm-v25.10", "os": ["linux"]},
|
||||||
|
{"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/megatron-lm-v25.11", "os": ["linux"]},
|
||||||
{"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/megatron-lm-primus-migration-guide", "os": ["linux"]},
|
{"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/megatron-lm-primus-migration-guide", "os": ["linux"]},
|
||||||
{"file": "how-to/rocm-for-ai/training/benchmark-docker/primus-megatron", "os": ["linux"]},
|
{"file": "how-to/rocm-for-ai/training/benchmark-docker/primus-megatron", "os": ["linux"]},
|
||||||
{"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/primus-megatron-v25.7", "os": ["linux"]},
|
{"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/primus-megatron-v25.7", "os": ["linux"]},
|
||||||
{"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/primus-megatron-v25.8", "os": ["linux"]},
|
{"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/primus-megatron-v25.8", "os": ["linux"]},
|
||||||
{"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/primus-megatron-v25.9", "os": ["linux"]},
|
{"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/primus-megatron-v25.9", "os": ["linux"]},
|
||||||
{"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/primus-megatron-v25.10", "os": ["linux"]},
|
{"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/primus-megatron-v25.10", "os": ["linux"]},
|
||||||
|
{"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/primus-megatron-v25.11", "os": ["linux"]},
|
||||||
{"file": "how-to/rocm-for-ai/training/benchmark-docker/pytorch-training", "os": ["linux"]},
|
{"file": "how-to/rocm-for-ai/training/benchmark-docker/pytorch-training", "os": ["linux"]},
|
||||||
{"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/pytorch-training-history", "os": ["linux"]},
|
{"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/pytorch-training-history", "os": ["linux"]},
|
||||||
{"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/pytorch-training-v25.3", "os": ["linux"]},
|
{"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/pytorch-training-v25.3", "os": ["linux"]},
|
||||||
@@ -154,16 +156,17 @@ article_pages = [
|
|||||||
{"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/pytorch-training-v25.8", "os": ["linux"]},
|
{"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/pytorch-training-v25.8", "os": ["linux"]},
|
||||||
{"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/pytorch-training-v25.9", "os": ["linux"]},
|
{"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/pytorch-training-v25.9", "os": ["linux"]},
|
||||||
{"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/pytorch-training-v25.10", "os": ["linux"]},
|
{"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/pytorch-training-v25.10", "os": ["linux"]},
|
||||||
|
{"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/pytorch-training-v25.11", "os": ["linux"]},
|
||||||
{"file": "how-to/rocm-for-ai/training/benchmark-docker/primus-pytorch", "os": ["linux"]},
|
{"file": "how-to/rocm-for-ai/training/benchmark-docker/primus-pytorch", "os": ["linux"]},
|
||||||
{"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/primus-pytorch-v25.8", "os": ["linux"]},
|
{"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/primus-pytorch-v25.8", "os": ["linux"]},
|
||||||
{"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/primus-pytorch-v25.9", "os": ["linux"]},
|
{"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/primus-pytorch-v25.9", "os": ["linux"]},
|
||||||
{"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/primus-pytorch-v25.10", "os": ["linux"]},
|
{"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/primus-pytorch-v25.10", "os": ["linux"]},
|
||||||
|
{"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/primus-pytorch-v25.11", "os": ["linux"]},
|
||||||
{"file": "how-to/rocm-for-ai/training/benchmark-docker/jax-maxtext", "os": ["linux"]},
|
{"file": "how-to/rocm-for-ai/training/benchmark-docker/jax-maxtext", "os": ["linux"]},
|
||||||
{"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/jax-maxtext-history", "os": ["linux"]},
|
{"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/jax-maxtext-history", "os": ["linux"]},
|
||||||
{"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/jax-maxtext-v25.4", "os": ["linux"]},
|
{"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/jax-maxtext-v25.4", "os": ["linux"]},
|
||||||
{"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/jax-maxtext-v25.5", "os": ["linux"]},
|
{"file": "how-to/rocm-for-ai/training/benchmark-docker/previous-versions/jax-maxtext-v25.5", "os": ["linux"]},
|
||||||
{"file": "how-to/rocm-for-ai/training/benchmark-docker/mpt-llm-foundry", "os": ["linux"]},
|
{"file": "how-to/rocm-for-ai/training/benchmark-docker/mpt-llm-foundry", "os": ["linux"]},
|
||||||
{"file": "how-to/rocm-for-ai/inference/xdit-diffusion-inference", "os": ["linux"]},
|
|
||||||
|
|
||||||
{"file": "how-to/rocm-for-ai/fine-tuning/index", "os": ["linux"]},
|
{"file": "how-to/rocm-for-ai/fine-tuning/index", "os": ["linux"]},
|
||||||
{"file": "how-to/rocm-for-ai/fine-tuning/overview", "os": ["linux"]},
|
{"file": "how-to/rocm-for-ai/fine-tuning/overview", "os": ["linux"]},
|
||||||
@@ -193,11 +196,16 @@ article_pages = [
|
|||||||
{"file": "how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-0.11.1-20251103", "os": ["linux"]},
|
{"file": "how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/vllm-0.11.1-20251103", "os": ["linux"]},
|
||||||
{"file": "how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/sglang-history", "os": ["linux"]},
|
{"file": "how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/sglang-history", "os": ["linux"]},
|
||||||
{"file": "how-to/rocm-for-ai/inference/benchmark-docker/pytorch-inference", "os": ["linux"]},
|
{"file": "how-to/rocm-for-ai/inference/benchmark-docker/pytorch-inference", "os": ["linux"]},
|
||||||
|
{"file": "how-to/rocm-for-ai/inference/benchmark-docker/sglang", "os": ["linux"]},
|
||||||
|
{"file": "how-to/rocm-for-ai/inference/benchmark-docker/vllm-mori-distributed", "os": ["linux"]},
|
||||||
|
{"file": "how-to/rocm-for-ai/inference/benchmark-docker/sglang-mori-distributed", "os": ["linux"]},
|
||||||
|
{"file": "how-to/rocm-for-ai/inference/benchmark-docker/sglang-distributed", "os": ["linux"]},
|
||||||
{"file": "how-to/rocm-for-ai/inference/xdit-diffusion-inference", "os": ["linux"]},
|
{"file": "how-to/rocm-for-ai/inference/xdit-diffusion-inference", "os": ["linux"]},
|
||||||
{"file": "how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/xdit-25.10", "os": ["linux"]},
|
{"file": "how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/xdit-25.10", "os": ["linux"]},
|
||||||
{"file": "how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/xdit-25.11", "os": ["linux"]},
|
{"file": "how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/xdit-25.11", "os": ["linux"]},
|
||||||
{"file": "how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/xdit-25.12", "os": ["linux"]},
|
{"file": "how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/xdit-25.12", "os": ["linux"]},
|
||||||
{"file": "how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/xdit-25.13", "os": ["linux"]},
|
{"file": "how-to/rocm-for-ai/inference/benchmark-docker/previous-versions/xdit-25.13", "os": ["linux"]},
|
||||||
|
|
||||||
{"file": "how-to/rocm-for-ai/inference/deploy-your-model", "os": ["linux"]},
|
{"file": "how-to/rocm-for-ai/inference/deploy-your-model", "os": ["linux"]},
|
||||||
|
|
||||||
{"file": "how-to/rocm-for-ai/inference-optimization/index", "os": ["linux"]},
|
{"file": "how-to/rocm-for-ai/inference-optimization/index", "os": ["linux"]},
|
||||||
|
|||||||
@@ -1,15 +1,13 @@
|
|||||||
docker:
|
docker:
|
||||||
pull_tag: rocm/primus:v25.10
|
pull_tag: rocm/primus:v26.1
|
||||||
docker_hub_url: https://hub.docker.com/layers/rocm/primus/v25.10/images/sha256-140c37cd2eeeb183759b9622543fc03cc210dc97cbfa18eeefdcbda84420c197
|
docker_hub_url: https://hub.docker.com/layers/rocm/primus/v26.1/images/sha256-4fc8808bdb14117c6af7f38d79c809056e6fdbfd530c1fabbb61d097ddaf820d
|
||||||
components:
|
components:
|
||||||
ROCm: 7.1.0
|
ROCm: 7.1.0
|
||||||
Primus: 0.3.0
|
|
||||||
Primus Turbo: 0.1.1
|
|
||||||
PyTorch: 2.10.0.dev20251112+rocm7.1
|
PyTorch: 2.10.0.dev20251112+rocm7.1
|
||||||
Python: "3.10"
|
Python: "3.10"
|
||||||
Transformer Engine: 2.4.0.dev0+32e2d1d4
|
Transformer Engine: 2.6.0.dev0+f141f34b
|
||||||
Flash Attention: 2.8.3
|
Flash Attention: 2.8.3
|
||||||
hipBLASLt: 1.2.0-09ab7153e2
|
hipBLASLt: 34459f66ea
|
||||||
Triton: 3.4.0
|
Triton: 3.4.0
|
||||||
RCCL: 2.27.7
|
RCCL: 2.27.7
|
||||||
model_groups:
|
model_groups:
|
||||||
|
|||||||
@@ -0,0 +1,47 @@
|
|||||||
|
docker:
|
||||||
|
pull_tag: rocm/primus:v25.11
|
||||||
|
docker_hub_url: https://hub.docker.com/layers/rocm/primus/v25.11/images/sha256-71aa65a9bfc8e9dd18bce5b68c81caff864f223e9afa75dc1b719671a1f4a3c3
|
||||||
|
components:
|
||||||
|
ROCm: 7.1.0
|
||||||
|
PyTorch: 2.10.0.dev20251112+rocm7.1
|
||||||
|
Python: "3.10"
|
||||||
|
Transformer Engine: 2.4.0.dev0+32e2d1d4
|
||||||
|
Flash Attention: 2.8.3
|
||||||
|
hipBLASLt: 1.2.0-09ab7153e2
|
||||||
|
Triton: 3.4.0
|
||||||
|
RCCL: 2.27.7
|
||||||
|
model_groups:
|
||||||
|
- group: Meta Llama
|
||||||
|
tag: llama
|
||||||
|
models:
|
||||||
|
- model: Llama 3.3 70B
|
||||||
|
mad_tag: pyt_megatron_lm_train_llama-3.3-70b
|
||||||
|
- model: Llama 3.1 8B
|
||||||
|
mad_tag: pyt_megatron_lm_train_llama-3.1-8b
|
||||||
|
- model: Llama 3.1 70B
|
||||||
|
mad_tag: pyt_megatron_lm_train_llama-3.1-70b
|
||||||
|
- model: Llama 2 7B
|
||||||
|
mad_tag: pyt_megatron_lm_train_llama-2-7b
|
||||||
|
- model: Llama 2 70B
|
||||||
|
mad_tag: pyt_megatron_lm_train_llama-2-70b
|
||||||
|
- group: DeepSeek
|
||||||
|
tag: deepseek
|
||||||
|
models:
|
||||||
|
- model: DeepSeek-V3 (proxy)
|
||||||
|
mad_tag: pyt_megatron_lm_train_deepseek-v3-proxy
|
||||||
|
- model: DeepSeek-V2-Lite
|
||||||
|
mad_tag: pyt_megatron_lm_train_deepseek-v2-lite-16b
|
||||||
|
- group: Mistral AI
|
||||||
|
tag: mistral
|
||||||
|
models:
|
||||||
|
- model: Mixtral 8x7B
|
||||||
|
mad_tag: pyt_megatron_lm_train_mixtral-8x7b
|
||||||
|
- model: Mixtral 8x22B (proxy)
|
||||||
|
mad_tag: pyt_megatron_lm_train_mixtral-8x22b-proxy
|
||||||
|
- group: Qwen
|
||||||
|
tag: qwen
|
||||||
|
models:
|
||||||
|
- model: Qwen 2.5 7B
|
||||||
|
mad_tag: pyt_megatron_lm_train_qwen2.5-7b
|
||||||
|
- model: Qwen 2.5 72B
|
||||||
|
mad_tag: pyt_megatron_lm_train_qwen2.5-72b
|
||||||
@@ -0,0 +1,58 @@
|
|||||||
|
docker:
|
||||||
|
pull_tag: rocm/primus:v25.11
|
||||||
|
docker_hub_url: https://hub.docker.com/layers/rocm/primus/v25.11/images/sha256-71aa65a9bfc8e9dd18bce5b68c81caff864f223e9afa75dc1b719671a1f4a3c3
|
||||||
|
components:
|
||||||
|
ROCm: 7.1.0
|
||||||
|
PyTorch: 2.10.0.dev20251112+rocm7.1
|
||||||
|
Python: "3.10"
|
||||||
|
Transformer Engine: 2.4.0.dev0+32e2d1d4
|
||||||
|
Flash Attention: 2.8.3
|
||||||
|
hipBLASLt: 1.2.0-09ab7153e2
|
||||||
|
Triton: 3.4.0
|
||||||
|
RCCL: 2.27.7
|
||||||
|
model_groups:
|
||||||
|
- group: Meta Llama
|
||||||
|
tag: llama
|
||||||
|
models:
|
||||||
|
- model: Llama 3.3 70B
|
||||||
|
mad_tag: primus_pyt_megatron_lm_train_llama-3.3-70b
|
||||||
|
config_name: llama3.3_70B-pretrain.yaml
|
||||||
|
- model: Llama 3.1 70B
|
||||||
|
mad_tag: primus_pyt_megatron_lm_train_llama-3.1-70b
|
||||||
|
config_name: llama3.1_70B-pretrain.yaml
|
||||||
|
- model: Llama 3.1 8B
|
||||||
|
mad_tag: primus_pyt_megatron_lm_train_llama-3.1-8b
|
||||||
|
config_name: llama3.1_8B-pretrain.yaml
|
||||||
|
- model: Llama 2 7B
|
||||||
|
mad_tag: primus_pyt_megatron_lm_train_llama-2-7b
|
||||||
|
config_name: llama2_7B-pretrain.yaml
|
||||||
|
- model: Llama 2 70B
|
||||||
|
mad_tag: primus_pyt_megatron_lm_train_llama-2-70b
|
||||||
|
config_name: llama2_70B-pretrain.yaml
|
||||||
|
- group: DeepSeek
|
||||||
|
tag: deepseek
|
||||||
|
models:
|
||||||
|
- model: DeepSeek-V3 (proxy)
|
||||||
|
mad_tag: primus_pyt_megatron_lm_train_deepseek-v3-proxy
|
||||||
|
config_name: deepseek_v3-pretrain.yaml
|
||||||
|
- model: DeepSeek-V2-Lite
|
||||||
|
mad_tag: primus_pyt_megatron_lm_train_deepseek-v2-lite-16b
|
||||||
|
config_name: deepseek_v2_lite-pretrain.yaml
|
||||||
|
- group: Mistral AI
|
||||||
|
tag: mistral
|
||||||
|
models:
|
||||||
|
- model: Mixtral 8x7B
|
||||||
|
mad_tag: primus_pyt_megatron_lm_train_mixtral-8x7b
|
||||||
|
config_name: mixtral_8x7B_v0.1-pretrain.yaml
|
||||||
|
- model: Mixtral 8x22B (proxy)
|
||||||
|
mad_tag: primus_pyt_megatron_lm_train_mixtral-8x22b-proxy
|
||||||
|
config_name: mixtral_8x22B_v0.1-pretrain.yaml
|
||||||
|
- group: Qwen
|
||||||
|
tag: qwen
|
||||||
|
models:
|
||||||
|
- model: Qwen 2.5 7B
|
||||||
|
mad_tag: primus_pyt_megatron_lm_train_qwen2.5-7b
|
||||||
|
config_name: primus_qwen2.5_7B-pretrain.yaml
|
||||||
|
- model: Qwen 2.5 72B
|
||||||
|
mad_tag: primus_pyt_megatron_lm_train_qwen2.5-72b
|
||||||
|
config_name: qwen2.5_72B-pretrain.yaml
|
||||||
@@ -0,0 +1,32 @@
|
|||||||
|
docker:
|
||||||
|
pull_tag: rocm/primus:v25.11
|
||||||
|
docker_hub_url: https://hub.docker.com/layers/rocm/primus/v25.11/images/sha256-71aa65a9bfc8e9dd18bce5b68c81caff864f223e9afa75dc1b719671a1f4a3c3
|
||||||
|
components:
|
||||||
|
ROCm: 7.1.0
|
||||||
|
PyTorch: 2.10.0.dev20251112+rocm7.1
|
||||||
|
Python: "3.10"
|
||||||
|
Transformer Engine: 2.4.0.dev0+32e2d1d4
|
||||||
|
Flash Attention: 2.8.3
|
||||||
|
hipBLASLt: 1.2.0-09ab7153e2
|
||||||
|
model_groups:
|
||||||
|
- group: Meta Llama
|
||||||
|
tag: llama
|
||||||
|
models:
|
||||||
|
- model: Llama 3.1 8B
|
||||||
|
mad_tag: primus_pyt_train_llama-3.1-8b
|
||||||
|
model_repo: Llama-3.1-8B
|
||||||
|
url: https://huggingface.co/meta-llama/Llama-3.1-8B
|
||||||
|
precision: BF16
|
||||||
|
- model: Llama 3.1 70B
|
||||||
|
mad_tag: primus_pyt_train_llama-3.1-70b
|
||||||
|
model_repo: Llama-3.1-70B
|
||||||
|
url: https://huggingface.co/meta-llama/Llama-3.1-70B
|
||||||
|
precision: BF16
|
||||||
|
- group: DeepSeek
|
||||||
|
tag: deepseek
|
||||||
|
models:
|
||||||
|
- model: DeepSeek V3 16B
|
||||||
|
mad_tag: primus_pyt_train_deepseek-v3-16b
|
||||||
|
model_repo: DeepSeek-V3
|
||||||
|
url: https://huggingface.co/deepseek-ai/DeepSeek-V3
|
||||||
|
precision: BF16
|
||||||
@@ -0,0 +1,195 @@
|
|||||||
|
docker:
|
||||||
|
pull_tag: rocm/primus:v25.11
|
||||||
|
docker_hub_url: https://hub.docker.com/layers/rocm/primus/v25.11/images/sha256-71aa65a9bfc8e9dd18bce5b68c81caff864f223e9afa75dc1b719671a1f4a3c3
|
||||||
|
components:
|
||||||
|
ROCm: 7.1.0
|
||||||
|
PyTorch: 2.10.0.dev20251112+rocm7.1
|
||||||
|
Python: "3.10"
|
||||||
|
Transformer Engine: 2.4.0.dev0+32e2d1d4
|
||||||
|
Flash Attention: 2.8.3
|
||||||
|
hipBLASLt: 1.2.0-09ab7153e2
|
||||||
|
model_groups:
|
||||||
|
- group: Meta Llama
|
||||||
|
tag: llama
|
||||||
|
models:
|
||||||
|
- model: Llama 4 Scout 17B-16E
|
||||||
|
mad_tag: pyt_train_llama-4-scout-17b-16e
|
||||||
|
model_repo: Llama-4-17B_16E
|
||||||
|
url: https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E
|
||||||
|
precision: BF16
|
||||||
|
training_modes: [finetune_fw, finetune_lora]
|
||||||
|
- model: Llama 3.3 70B
|
||||||
|
mad_tag: pyt_train_llama-3.3-70b
|
||||||
|
model_repo: Llama-3.3-70B
|
||||||
|
url: https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct
|
||||||
|
precision: BF16
|
||||||
|
training_modes: [finetune_fw, finetune_lora, finetune_qlora]
|
||||||
|
- model: Llama 3.2 1B
|
||||||
|
mad_tag: pyt_train_llama-3.2-1b
|
||||||
|
model_repo: Llama-3.2-1B
|
||||||
|
url: https://huggingface.co/meta-llama/Llama-3.2-1B
|
||||||
|
precision: BF16
|
||||||
|
training_modes: [finetune_fw, finetune_lora]
|
||||||
|
- model: Llama 3.2 3B
|
||||||
|
mad_tag: pyt_train_llama-3.2-3b
|
||||||
|
model_repo: Llama-3.2-3B
|
||||||
|
url: https://huggingface.co/meta-llama/Llama-3.2-3B
|
||||||
|
precision: BF16
|
||||||
|
training_modes: [finetune_fw, finetune_lora]
|
||||||
|
- model: Llama 3.2 Vision 11B
|
||||||
|
mad_tag: pyt_train_llama-3.2-vision-11b
|
||||||
|
model_repo: Llama-3.2-Vision-11B
|
||||||
|
url: https://huggingface.co/meta-llama/Llama-3.2-11B-Vision
|
||||||
|
precision: BF16
|
||||||
|
training_modes: [finetune_fw]
|
||||||
|
- model: Llama 3.2 Vision 90B
|
||||||
|
mad_tag: pyt_train_llama-3.2-vision-90b
|
||||||
|
model_repo: Llama-3.2-Vision-90B
|
||||||
|
url: https://huggingface.co/meta-llama/Llama-3.2-90B-Vision
|
||||||
|
precision: BF16
|
||||||
|
training_modes: [finetune_fw]
|
||||||
|
- model: Llama 3.1 8B
|
||||||
|
mad_tag: pyt_train_llama-3.1-8b
|
||||||
|
model_repo: Llama-3.1-8B
|
||||||
|
url: https://huggingface.co/meta-llama/Llama-3.1-8B
|
||||||
|
precision: BF16
|
||||||
|
training_modes: [pretrain, finetune_fw, finetune_lora, HF_pretrain]
|
||||||
|
- model: Llama 3.1 70B
|
||||||
|
mad_tag: pyt_train_llama-3.1-70b
|
||||||
|
model_repo: Llama-3.1-70B
|
||||||
|
url: https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct
|
||||||
|
precision: BF16
|
||||||
|
training_modes: [pretrain, finetune_fw, finetune_lora]
|
||||||
|
- model: Llama 3.1 405B
|
||||||
|
mad_tag: pyt_train_llama-3.1-405b
|
||||||
|
model_repo: Llama-3.1-405B
|
||||||
|
url: https://huggingface.co/meta-llama/Llama-3.1-405B
|
||||||
|
precision: BF16
|
||||||
|
training_modes: [finetune_qlora]
|
||||||
|
- model: Llama 3 8B
|
||||||
|
mad_tag: pyt_train_llama-3-8b
|
||||||
|
model_repo: Llama-3-8B
|
||||||
|
url: https://huggingface.co/meta-llama/Meta-Llama-3-8B
|
||||||
|
precision: BF16
|
||||||
|
training_modes: [finetune_fw, finetune_lora]
|
||||||
|
- model: Llama 3 70B
|
||||||
|
mad_tag: pyt_train_llama-3-70b
|
||||||
|
model_repo: Llama-3-70B
|
||||||
|
url: https://huggingface.co/meta-llama/Meta-Llama-3-70B
|
||||||
|
precision: BF16
|
||||||
|
training_modes: [finetune_fw, finetune_lora]
|
||||||
|
- model: Llama 2 7B
|
||||||
|
mad_tag: pyt_train_llama-2-7b
|
||||||
|
model_repo: Llama-2-7B
|
||||||
|
url: https://github.com/meta-llama/llama-models/tree/main/models/llama2
|
||||||
|
precision: BF16
|
||||||
|
training_modes: [finetune_fw, finetune_lora, finetune_qlora]
|
||||||
|
- model: Llama 2 13B
|
||||||
|
mad_tag: pyt_train_llama-2-13b
|
||||||
|
model_repo: Llama-2-13B
|
||||||
|
url: https://github.com/meta-llama/llama-models/tree/main/models/llama2
|
||||||
|
precision: BF16
|
||||||
|
training_modes: [finetune_fw, finetune_lora]
|
||||||
|
- model: Llama 2 70B
|
||||||
|
mad_tag: pyt_train_llama-2-70b
|
||||||
|
model_repo: Llama-2-70B
|
||||||
|
url: https://github.com/meta-llama/llama-models/tree/main/models/llama2
|
||||||
|
precision: BF16
|
||||||
|
training_modes: [finetune_lora, finetune_qlora]
|
||||||
|
- group: OpenAI
|
||||||
|
tag: openai
|
||||||
|
models:
|
||||||
|
- model: GPT OSS 20B
|
||||||
|
mad_tag: pyt_train_gpt_oss_20b
|
||||||
|
model_repo: GPT-OSS-20B
|
||||||
|
url: https://huggingface.co/openai/gpt-oss-20b
|
||||||
|
precision: BF16
|
||||||
|
training_modes: [HF_finetune_lora]
|
||||||
|
- model: GPT OSS 120B
|
||||||
|
mad_tag: pyt_train_gpt_oss_120b
|
||||||
|
model_repo: GPT-OSS-120B
|
||||||
|
url: https://huggingface.co/openai/gpt-oss-120b
|
||||||
|
precision: BF16
|
||||||
|
training_modes: [HF_finetune_lora]
|
||||||
|
- group: DeepSeek
|
||||||
|
tag: deepseek
|
||||||
|
models:
|
||||||
|
- model: DeepSeek V2 16B
|
||||||
|
mad_tag: primus_pyt_train_deepseek-v2
|
||||||
|
model_repo: DeepSeek-V2
|
||||||
|
url: https://huggingface.co/deepseek-ai/DeepSeek-V2
|
||||||
|
precision: BF16
|
||||||
|
training_modes: [pretrain]
|
||||||
|
- group: Qwen
|
||||||
|
tag: qwen
|
||||||
|
models:
|
||||||
|
- model: Qwen 3 8B
|
||||||
|
mad_tag: pyt_train_qwen3-8b
|
||||||
|
model_repo: Qwen3-8B
|
||||||
|
url: https://huggingface.co/Qwen/Qwen3-8B
|
||||||
|
precision: BF16
|
||||||
|
training_modes: [finetune_fw, finetune_lora]
|
||||||
|
- model: Qwen 3 32B
|
||||||
|
mad_tag: pyt_train_qwen3-32b
|
||||||
|
model_repo: Qwen3-32
|
||||||
|
url: https://huggingface.co/Qwen/Qwen3-32B
|
||||||
|
precision: BF16
|
||||||
|
training_modes: [finetune_lora]
|
||||||
|
- model: Qwen 2.5 32B
|
||||||
|
mad_tag: pyt_train_qwen2.5-32b
|
||||||
|
model_repo: Qwen2.5-32B
|
||||||
|
url: https://huggingface.co/Qwen/Qwen2.5-32B
|
||||||
|
precision: BF16
|
||||||
|
training_modes: [finetune_lora]
|
||||||
|
- model: Qwen 2.5 72B
|
||||||
|
mad_tag: pyt_train_qwen2.5-72b
|
||||||
|
model_repo: Qwen2.5-72B
|
||||||
|
url: https://huggingface.co/Qwen/Qwen2.5-72B
|
||||||
|
precision: BF16
|
||||||
|
training_modes: [finetune_lora]
|
||||||
|
- model: Qwen 2 1.5B
|
||||||
|
mad_tag: pyt_train_qwen2-1.5b
|
||||||
|
model_repo: Qwen2-1.5B
|
||||||
|
url: https://huggingface.co/Qwen/Qwen2-1.5B
|
||||||
|
precision: BF16
|
||||||
|
training_modes: [finetune_fw, finetune_lora]
|
||||||
|
- model: Qwen 2 7B
|
||||||
|
mad_tag: pyt_train_qwen2-7b
|
||||||
|
model_repo: Qwen2-7B
|
||||||
|
url: https://huggingface.co/Qwen/Qwen2-7B
|
||||||
|
precision: BF16
|
||||||
|
training_modes: [finetune_fw, finetune_lora]
|
||||||
|
- group: Stable Diffusion
|
||||||
|
tag: sd
|
||||||
|
models:
|
||||||
|
- model: Stable Diffusion XL
|
||||||
|
mad_tag: pyt_huggingface_stable_diffusion_xl_2k_lora_finetuning
|
||||||
|
model_repo: SDXL
|
||||||
|
url: https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0
|
||||||
|
precision: BF16
|
||||||
|
training_modes: [posttrain]
|
||||||
|
- group: Flux
|
||||||
|
tag: flux
|
||||||
|
models:
|
||||||
|
- model: FLUX.1-dev
|
||||||
|
mad_tag: pyt_train_flux
|
||||||
|
model_repo: Flux
|
||||||
|
url: https://huggingface.co/black-forest-labs/FLUX.1-dev
|
||||||
|
precision: BF16
|
||||||
|
training_modes: [posttrain]
|
||||||
|
- group: NCF
|
||||||
|
tag: ncf
|
||||||
|
models:
|
||||||
|
- model: NCF
|
||||||
|
mad_tag: pyt_ncf_training
|
||||||
|
model_repo:
|
||||||
|
url: https://github.com/ROCm/FluxBenchmark
|
||||||
|
precision: FP32
|
||||||
|
- group: DLRM
|
||||||
|
tag: dlrm
|
||||||
|
models:
|
||||||
|
- model: DLRM v2
|
||||||
|
mad_tag: pyt_train_dlrm
|
||||||
|
model_repo: DLRM
|
||||||
|
url: https://github.com/AMD-AGI/DLRMBenchmark
|
||||||
|
training_modes: [pretrain]
|
||||||
@@ -1,13 +1,13 @@
|
|||||||
docker:
|
docker:
|
||||||
pull_tag: rocm/primus:v25.11
|
pull_tag: rocm/primus:v26.1
|
||||||
docker_hub_url: https://hub.docker.com/layers/rocm/primus/v25.10/images/sha256-140c37cd2eeeb183759b9622543fc03cc210dc97cbfa18eeefdcbda84420c197
|
docker_hub_url: https://hub.docker.com/layers/rocm/primus/v26.1/images/sha256-4fc8808bdb14117c6af7f38d79c809056e6fdbfd530c1fabbb61d097ddaf820d
|
||||||
components:
|
components:
|
||||||
ROCm: 7.1.0
|
ROCm: 7.1.0
|
||||||
PyTorch: 2.10.0.dev20251112+rocm7.1
|
PyTorch: 2.10.0.dev20251112+rocm7.1
|
||||||
Python: "3.10"
|
Python: "3.10"
|
||||||
Transformer Engine: 2.4.0.dev0+32e2d1d4
|
Transformer Engine: 2.6.0.dev0+f141f34b
|
||||||
Flash Attention: 2.8.3
|
Flash Attention: 2.8.3
|
||||||
hipBLASLt: 1.2.0-09ab7153e2
|
hipBLASLt: 34459f66ea
|
||||||
Triton: 3.4.0
|
Triton: 3.4.0
|
||||||
RCCL: 2.27.7
|
RCCL: 2.27.7
|
||||||
model_groups:
|
model_groups:
|
||||||
|
|||||||
@@ -1,13 +1,13 @@
|
|||||||
docker:
|
docker:
|
||||||
pull_tag: rocm/primus:v25.11
|
pull_tag: rocm/primus:v26.1
|
||||||
docker_hub_url: https://hub.docker.com/layers/rocm/primus/v25.10/images/sha256-140c37cd2eeeb183759b9622543fc03cc210dc97cbfa18eeefdcbda84420c197
|
docker_hub_url: https://hub.docker.com/layers/rocm/primus/v26.1/images/sha256-4fc8808bdb14117c6af7f38d79c809056e6fdbfd530c1fabbb61d097ddaf820d
|
||||||
components:
|
components:
|
||||||
ROCm: 7.1.0
|
ROCm: 7.1.0
|
||||||
PyTorch: 2.10.0.dev20251112+rocm7.1
|
PyTorch: 2.10.0.dev20251112+rocm7.1
|
||||||
Python: "3.10"
|
Python: "3.10"
|
||||||
Transformer Engine: 2.4.0.dev0+32e2d1d4
|
Transformer Engine: 2.6.0.dev0+f141f34b
|
||||||
Flash Attention: 2.8.3
|
Flash Attention: 2.8.3
|
||||||
hipBLASLt: 1.2.0-09ab7153e2
|
hipBLASLt: 34459f66ea
|
||||||
model_groups:
|
model_groups:
|
||||||
- group: Meta Llama
|
- group: Meta Llama
|
||||||
tag: llama
|
tag: llama
|
||||||
|
|||||||
@@ -1,15 +1,13 @@
|
|||||||
docker:
|
docker:
|
||||||
pull_tag: rocm/primus:v25.10
|
pull_tag: rocm/primus:v26.1
|
||||||
docker_hub_url: https://hub.docker.com/layers/rocm/primus/v25.10/images/sha256-140c37cd2eeeb183759b9622543fc03cc210dc97cbfa18eeefdcbda84420c197
|
docker_hub_url: https://hub.docker.com/layers/rocm/primus/v26.1/images/sha256-4fc8808bdb14117c6af7f38d79c809056e6fdbfd530c1fabbb61d097ddaf820d
|
||||||
components:
|
components:
|
||||||
ROCm: 7.1.0
|
ROCm: 7.1.0
|
||||||
Primus: 0.3.0
|
|
||||||
Primus Turbo: 0.1.1
|
|
||||||
PyTorch: 2.10.0.dev20251112+rocm7.1
|
PyTorch: 2.10.0.dev20251112+rocm7.1
|
||||||
Python: "3.10"
|
Python: "3.10"
|
||||||
Transformer Engine: 2.4.0.dev0+32e2d1d4
|
Transformer Engine: 2.6.0.dev0+f141f34b
|
||||||
Flash Attention: 2.8.3
|
Flash Attention: 2.8.3
|
||||||
hipBLASLt: 1.2.0-09ab7153e2
|
hipBLASLt: 34459f66ea
|
||||||
model_groups:
|
model_groups:
|
||||||
- group: Meta Llama
|
- group: Meta Llama
|
||||||
tag: llama
|
tag: llama
|
||||||
|
|||||||
@@ -130,7 +130,7 @@ After loading the model in this way, the model is fully ready to use the resourc
|
|||||||
torchtune for fine-tuning and inference
|
torchtune for fine-tuning and inference
|
||||||
=============================================
|
=============================================
|
||||||
|
|
||||||
`torchtune <https://meta-pytorch.org/torchtune/main/>`_ is a PyTorch-native library for easy single and multi-GPU
|
`torchtune <https://pytorch.org/torchtune/main/>`_ is a PyTorch-native library for easy single and multi-GPU
|
||||||
model fine-tuning and inference with LLMs.
|
model fine-tuning and inference with LLMs.
|
||||||
|
|
||||||
#. Install torchtune using pip.
|
#. Install torchtune using pip.
|
||||||
|
|||||||
@@ -25,6 +25,5 @@ In this guide, you'll learn how to use ROCm for AI:
|
|||||||
|
|
||||||
- :doc:`Inference optimization <inference-optimization/index>`
|
- :doc:`Inference optimization <inference-optimization/index>`
|
||||||
|
|
||||||
|
|
||||||
To learn about ROCm for HPC applications and scientific computing, see
|
To learn about ROCm for HPC applications and scientific computing, see
|
||||||
:doc:`../rocm-for-hpc/index`.
|
:doc:`../rocm-for-hpc/index`.
|
||||||
|
|||||||
@@ -0,0 +1,904 @@
|
|||||||
|
# SGLang distributed inference with MoRI
|
||||||
|
|
||||||
|
This document provides a comprehensive guide for deploying a high-performance
|
||||||
|
SGLang distributed inference serving environment on an AMD Instinct MI355X GPU
|
||||||
|
cluster, utilizing the [MoRI (Modular RDMA
|
||||||
|
Interface)](https://github.com/rocm/mori) communication backend for optimized
|
||||||
|
inter-node collective operations. It also includes systematic instructions for
|
||||||
|
benchmarking 1P2D (1 prefill 2 decode, 3 nodes) configurations using automated
|
||||||
|
scripts.
|
||||||
|
|
||||||
|
## Prerequisites
|
||||||
|
|
||||||
|
The following configuration is required to implement this setup:
|
||||||
|
|
||||||
|
* **Nodes:** A minimum of three GPU nodes (Virtual machines or Physical
|
||||||
|
machines) for wide expert parallelism (EP) evaluation.
|
||||||
|
* **GPUs** 8x AMD Instinct MI355X GPU cards per node.
|
||||||
|
* **Networking:** 8x AMD Pensando™ Pollara 400 AI NICs per node, providing
|
||||||
|
a dedicated 1:1 mapping between GPUs and network interfaces for optimal
|
||||||
|
inter-node communication.
|
||||||
|
* **Orchestration:** A Slurm cluster with at least three nodes -- one for
|
||||||
|
prefill service and two for decode services (EP16)
|
||||||
|
|
||||||
|
## System configuration
|
||||||
|
|
||||||
|
This section outlines the infrastructure setup required to support your AMD
|
||||||
|
Instinct MI355X cluster. It covers essential procedures for verifying software
|
||||||
|
baselines and firmware versions, configuring the AMD Pensando Pollara 400 AI
|
||||||
|
NICs for high-bandwidth networking, and applying thermal and Quality of Service
|
||||||
|
(QoS) tunings to ensure a stable, lossless RDMA fabric.
|
||||||
|
|
||||||
|
(sglang-mori-verify-baseline)=
|
||||||
|
|
||||||
|
### Verify baseline software
|
||||||
|
|
||||||
|
The following table outlines the validated software stack. Use the provided
|
||||||
|
shell commands to verify the environment on each node before proceeding.
|
||||||
|
|
||||||
|
| Component | Version | Verification command |
|
||||||
|
| :--- | :--- | :--- |
|
||||||
|
| **OS** | Ubuntu 22.04.5 LTS | `cat /etc/os-release` |
|
||||||
|
| **Kernel** | 5.15.0-163-generic | `uname -r` |
|
||||||
|
| **ROCm** | 7.1.1 | `amd-smi version` |
|
||||||
|
| **PLDM bundle (firmware)** | 01.25.16.03 | [Verify BKC](#verify-best-known-configuration-bkc) |
|
||||||
|
| **AI NIC Firmware** | 1.117.5.a.45 | `dkms status` |
|
||||||
|
| **AI NIC Driver** | 25.11.1.001 | `dkms status` |
|
||||||
|
|
||||||
|
### Verify best known configuration (BKC)
|
||||||
|
|
||||||
|
The BKC defines a validated configuration of GPU firmware, baseboard firmware,
|
||||||
|
ROCm user space components, the AMD GPU Driver, and virtualization tooling.
|
||||||
|
These components are tested together to attain best performance and compatibility.
|
||||||
|
|
||||||
|
While AMD publishes the AMD GPU driver and ROCm user space components, your
|
||||||
|
server OEM or infrastructure provider distributes the firmware packages. AMD
|
||||||
|
supplies those firmware images (PLDM bundles), which the OEM integrates and
|
||||||
|
distributes.
|
||||||
|
|
||||||
|
To verify the active BKC and IFWI (Integrated Firmware Image) versions via the
|
||||||
|
Redfish API:
|
||||||
|
|
||||||
|
1. Prepare credentials: Identify your BMC IP, username, and password.
|
||||||
|
2. Run Redfish queries: Use the following commands to check the active
|
||||||
|
firmware inventory.
|
||||||
|
|
||||||
|
``` bash
|
||||||
|
# Define BMC connection variables
|
||||||
|
BMC_IP="<BMC_IP>"
|
||||||
|
AUTH="<username>:<password>"
|
||||||
|
|
||||||
|
# Query active BKC bundle version
|
||||||
|
curl -X GET "https://${BMC_IP}/redfish/v1/UpdateService/FirmwareInventory/bundle_active" \
|
||||||
|
-u "${AUTH}" -k | json_pp
|
||||||
|
|
||||||
|
# Query active IFWI (Integrated Firmware Image)
|
||||||
|
curl -X GET "https://${BMC_IP}/redfish/v1/UpdateService/FirmwareInventory/firmware_active" \
|
||||||
|
-u "${AUTH}" -k | json_pp
|
||||||
|
```
|
||||||
|
|
||||||
|
### Run basic system health checks
|
||||||
|
|
||||||
|
Before proceeding with software deployment, verify that all cluster nodes
|
||||||
|
comply with the [MI355X Basic Health
|
||||||
|
Checks](https://instinct.docs.amd.com/projects/system-acceptance/en/latest/gpus/mi355x.html#basic-health-checks)
|
||||||
|
Key requirements include specific kernel boot arguments, minimum system memory
|
||||||
|
thresholds, PCIe Gen5 link stability, and so on.
|
||||||
|
|
||||||
|
### Install AMD Pensando Pollara 400 AI NIC drivers
|
||||||
|
|
||||||
|
For detailed instructions on upgrading the firmware and installing drivers for
|
||||||
|
the AMD Pensando Pollara 400 AI NIC, refer to the [AMD Instinct System
|
||||||
|
Acceptance
|
||||||
|
Guide](https://instinct.docs.amd.com/projects/system-acceptance/en/latest/network/nic-installation.html#amd-pensando-pollara-400-ai-nic).
|
||||||
|
After installation, verify the active firmware version on all NICs to ensure it
|
||||||
|
matches the software baseline. See [Verify baseline software](#verify-best-known-configuration-bkc).
|
||||||
|
|
||||||
|
To display the current firmware version for all AI NICs, use the following command.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo nicctl show version firmware
|
||||||
|
```
|
||||||
|
|
||||||
|
### Configure thermal management (fan speed)
|
||||||
|
|
||||||
|
For systems equipped with 400G optics, standard fan profiles are often
|
||||||
|
insufficient for maintaining stable operating temperatures. To prevent thermal
|
||||||
|
throttling or optics failure, the system fans must be set to `FullSpeed`.
|
||||||
|
|
||||||
|
* Requirement: A fan speed of approximately 25,000 RPM is required to maintain
|
||||||
|
the AI NIC modules at an optimal operating temperature (~50°C).
|
||||||
|
|
||||||
|
* Constraint: Default profiles (typically around 4,000 RPM) and "Performance IO"
|
||||||
|
settings (around 9,000 RPM) do not provide adequate airflow for 400G optical
|
||||||
|
transceivers.
|
||||||
|
|
||||||
|
#### Configure fan speed via Redfish (Supermicro)
|
||||||
|
|
||||||
|
Run the following command to set the fan mode to `FullSpeed` through the BMC:
|
||||||
|
|
||||||
|
``` bash
|
||||||
|
# Define BMC connection variables
|
||||||
|
BMC_IP="<BMC_IP>"
|
||||||
|
AUTH="<username>:<password>"
|
||||||
|
|
||||||
|
# Set Fan Mode to FullSpeed
|
||||||
|
curl -X PATCH "https://${BMC_IP}/redfish/v1/Managers/1/Oem/Supermicro/FanMode" \
|
||||||
|
-k -u "${AUTH}" \
|
||||||
|
-H "Content-Type: application/json" \
|
||||||
|
-d '{"Mode": "FullSpeed"}'
|
||||||
|
```
|
||||||
|
|
||||||
|
### Configure your backend network (netplan)
|
||||||
|
|
||||||
|
Configure the backend NICs for high-bandwidth inter-node communication. Suppose
|
||||||
|
the GPU’s eight network interface controllers (NICs) are `benic1p1` to
|
||||||
|
`benic8p1`. Each NIC must have its own subnet that is disjoint from the others.
|
||||||
|
Each node needs a unique IP address on each subnet. You should use the same
|
||||||
|
final octet in each subnet for a given node. For example, one node would have
|
||||||
|
the addresses `192.168.1.36`, `192.168.2.36`, and so on. Another node would
|
||||||
|
have `192.168.1.37`, `192.168.2.37`, and so on. Ensure MTU is set to `9000`.
|
||||||
|
|
||||||
|
```{note}
|
||||||
|
Ensure you identify the correct interface names for your system using ip link
|
||||||
|
before applying this configuration.
|
||||||
|
```
|
||||||
|
|
||||||
|
For example, your `/etc/netplan/70-backend.yaml` should look like the
|
||||||
|
following:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
network:
|
||||||
|
ethernets:
|
||||||
|
benic8p1:
|
||||||
|
addresses:
|
||||||
|
- 192.168.8.38/31
|
||||||
|
match:
|
||||||
|
macaddress: 04:90:81:2a:34:08
|
||||||
|
mtu: 9000
|
||||||
|
routes:
|
||||||
|
- table: 108
|
||||||
|
to: 0.0.0.0/0
|
||||||
|
via: 192.168.8.39
|
||||||
|
routing-policy:
|
||||||
|
- from: 192.168.8.38
|
||||||
|
table: 108
|
||||||
|
set-name: benic8p1
|
||||||
|
benic7p1:
|
||||||
|
addresses:
|
||||||
|
- 192.168.7.38/31
|
||||||
|
match:
|
||||||
|
macaddress: 04:90:81:2b:82:40
|
||||||
|
mtu: 9000
|
||||||
|
routes:
|
||||||
|
- table: 107
|
||||||
|
to: 0.0.0.0/0
|
||||||
|
via: 192.168.7.39
|
||||||
|
routing-policy:
|
||||||
|
- from: 192.168.7.38
|
||||||
|
table: 107
|
||||||
|
set-name: benic7p1
|
||||||
|
benic6p1:
|
||||||
|
addresses:
|
||||||
|
- 192.168.6.38/31
|
||||||
|
match:
|
||||||
|
macaddress: 04:90:81:30:c9:30
|
||||||
|
mtu: 9000
|
||||||
|
routes:
|
||||||
|
- table: 106
|
||||||
|
to: 0.0.0.0/0
|
||||||
|
via: 192.168.6.39
|
||||||
|
routing-policy:
|
||||||
|
- from: 192.168.6.38
|
||||||
|
table: 106
|
||||||
|
set-name: benic6p1
|
||||||
|
benic5p1:
|
||||||
|
addresses:
|
||||||
|
- 192.168.5.38/31
|
||||||
|
match:
|
||||||
|
macaddress: 04:90:81:2a:23:40
|
||||||
|
mtu: 9000
|
||||||
|
routes:
|
||||||
|
- table: 105
|
||||||
|
to: 0.0.0.0/0
|
||||||
|
via: 192.168.5.39
|
||||||
|
routing-policy:
|
||||||
|
- from: 192.168.5.38
|
||||||
|
table: 105
|
||||||
|
set-name: benic5p1
|
||||||
|
benic4p1:
|
||||||
|
addresses:
|
||||||
|
- 192.168.4.38/31
|
||||||
|
match:
|
||||||
|
macaddress: 04:90:81:2d:69:60
|
||||||
|
mtu: 9000
|
||||||
|
routes:
|
||||||
|
- table: 104
|
||||||
|
to: 0.0.0.0/0
|
||||||
|
via: 192.168.4.39
|
||||||
|
routing-policy:
|
||||||
|
- from: 192.168.4.38
|
||||||
|
table: 104
|
||||||
|
set-name: benic4p1
|
||||||
|
benic3p1:
|
||||||
|
addresses:
|
||||||
|
- 192.168.3.38/31
|
||||||
|
match:
|
||||||
|
macaddress: 04:90:81:2a:2c:40
|
||||||
|
mtu: 9000
|
||||||
|
routes:
|
||||||
|
- table: 103
|
||||||
|
to: 0.0.0.0/0
|
||||||
|
via: 192.168.3.39
|
||||||
|
routing-policy:
|
||||||
|
- from: 192.168.3.38
|
||||||
|
table: 103
|
||||||
|
set-name: benic3p1
|
||||||
|
benic2p1:
|
||||||
|
addresses:
|
||||||
|
- 192.168.2.38/31
|
||||||
|
match:
|
||||||
|
macaddress: 04:90:81:30:d5:30
|
||||||
|
mtu: 9000
|
||||||
|
routes:
|
||||||
|
- table: 102
|
||||||
|
to: 0.0.0.0/0
|
||||||
|
via: 192.168.2.39
|
||||||
|
routing-policy:
|
||||||
|
- from: 192.168.2.38
|
||||||
|
table: 102
|
||||||
|
set-name: benic2p1
|
||||||
|
benic1p1:
|
||||||
|
addresses:
|
||||||
|
- 192.168.1.38/31
|
||||||
|
match:
|
||||||
|
macaddress: 04:90:81:30:e4:00
|
||||||
|
mtu: 9000
|
||||||
|
routes:
|
||||||
|
- table: 101
|
||||||
|
to: 0.0.0.0/0
|
||||||
|
via: 192.168.1.39
|
||||||
|
routing-policy:
|
||||||
|
- from: 192.168.1.38
|
||||||
|
table: 101
|
||||||
|
set-name: benic1p1
|
||||||
|
```
|
||||||
|
|
||||||
|
To apply the configuration, use the following command.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo netplan apply
|
||||||
|
```
|
||||||
|
|
||||||
|
To verify your configuration, use the following command.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo apt install -y net-tools && ip -br a
|
||||||
|
```
|
||||||
|
|
||||||
|
### Configure Quality of Service (QoS) and Congestion Control (DCQCN)
|
||||||
|
|
||||||
|
To ensure lossless communication and optimal performance for RDMA traffic, the
|
||||||
|
network must be configured with specific QoS and Data Center Quantized
|
||||||
|
Congestion Notification (DCQCN) settings.
|
||||||
|
|
||||||
|
The following configuration achieves:
|
||||||
|
• It enables RX and TX Pause frames on the ports
|
||||||
|
• Maps DSCP 24 (Data) to Q3 and DSCP 46 (CNP) to Q6, all other DSCP to Q0
|
||||||
|
• Enables PFC for Q3
|
||||||
|
• Scheduling : 99% to Q3, 1% to Q0 and strict priority for Q6
|
||||||
|
|
||||||
|
#### Configure DCQCN
|
||||||
|
|
||||||
|
Create and run a `/nfsdata/enable_dcqcn.sh` script to initialize congestion
|
||||||
|
control parameters.
|
||||||
|
|
||||||
|
``` bash
|
||||||
|
# !/bin/bash
|
||||||
|
|
||||||
|
TOKEN_BUCKET_SIZE=800000
|
||||||
|
AI_RATE=160
|
||||||
|
ALPHA_UPDATE_INTERVAL=1
|
||||||
|
ALPHA_UPDATE_G=512
|
||||||
|
INITIAL_ALPHA_VALUE=64
|
||||||
|
RATE_INCREASE_BYTE_COUNT=431068
|
||||||
|
HAI_RATE=300
|
||||||
|
RATE_REDUCE_MONITOR_PERIOD=1
|
||||||
|
RATE_INCREASE_THRESHOLD=1
|
||||||
|
RATE_INCREASE_INTERVAL=1
|
||||||
|
CNP_DSCP=46
|
||||||
|
|
||||||
|
ROCE_DEVICES=$(ibv_devices | grep ionic_ | awk '{print $1}' | paste -sd " ")
|
||||||
|
for roce_dev in $ROCE_DEVICES
|
||||||
|
do
|
||||||
|
sudo nicctl update dcqcn -r $roce_dev -i 1 \
|
||||||
|
--token-bucket-size $TOKEN_BUCKET_SIZE \
|
||||||
|
--ai-rate $AI_RATE \
|
||||||
|
--alpha-update-interval $ALPHA_UPDATE_INTERVAL \
|
||||||
|
--alpha-update-g $ALPHA_UPDATE_G \
|
||||||
|
--initial-alpha-value $INITIAL_ALPHA_VALUE \
|
||||||
|
--rate-increase-byte-count $RATE_INCREASE_BYTE_COUNT \
|
||||||
|
--hai-rate $HAI_RATE \
|
||||||
|
--rate-reduce-monitor-period $RATE_REDUCE_MONITOR_PERIOD \
|
||||||
|
--rate-increase-threshold $RATE_INCREASE_THRESHOLD \
|
||||||
|
--rate-increase-interval $RATE_INCREASE_INTERVAL \
|
||||||
|
--cnp-dscp $CNP_DSCP
|
||||||
|
done
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Configure QoS and PFC
|
||||||
|
|
||||||
|
Create and run `/nfsdata/qos.sh` to set up traffic classes and scheduling.
|
||||||
|
|
||||||
|
``` bash
|
||||||
|
#!/bin/bash
|
||||||
|
# qos.sh
|
||||||
|
|
||||||
|
# Enable PFC and Auto-negotiation on all ports
|
||||||
|
for i in $(sudo nicctl show port | grep Port | awk {'print $3'}); do sudo nicctl update port -p $i --pause-type pfc --rx-pause enable --tx-pause enable; done
|
||||||
|
for i in $(sudo nicctl show port | grep Port | awk '{print $3}'); do sudo nicctl update port --port $i --auto-neg enable; done
|
||||||
|
|
||||||
|
# Define Priorities
|
||||||
|
cts_dscp=46
|
||||||
|
cts_prio=6
|
||||||
|
data_dscp=24
|
||||||
|
data_prio=3
|
||||||
|
default_prio=0
|
||||||
|
cnp_dscp=46
|
||||||
|
cnp_prio=6
|
||||||
|
|
||||||
|
sudo nicctl update qos pfc --priority 0 --no-drop disable
|
||||||
|
sudo nicctl update qos dscp-to-purpose --dscp 48 --purpose none
|
||||||
|
sudo nicctl update qos dscp-to-purpose --dscp 46 --purpose none
|
||||||
|
sudo nicctl update qos --classification-type pcp
|
||||||
|
sudo nicctl update qos --classification-type dscp
|
||||||
|
sudo nicctl update qos dscp-to-priority --dscp 0-63 --priority 0
|
||||||
|
sudo nicctl update qos dscp-to-priority --dscp 0-23,25-45,47-63 --priority $default_prio
|
||||||
|
sudo nicctl update qos dscp-to-priority --dscp $cts_dscp --priority $cts_prio
|
||||||
|
sudo nicctl update qos dscp-to-priority --dscp $data_dscp --priority $data_prio
|
||||||
|
sudo nicctl update qos dscp-to-priority --dscp $cnp_dscp --priority $cnp_prio
|
||||||
|
sudo nicctl update qos pfc --priority $data_prio --no-drop enable
|
||||||
|
sudo nicctl update qos scheduling --priority $data_prio,$default_prio,$cts_prio --dwrr 99,1,0 --rate-limit 0,0,10
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Verification your configuration
|
||||||
|
|
||||||
|
Verify the configuration using `nicctl`.
|
||||||
|
|
||||||
|
* Verify QoS classification:
|
||||||
|
|
||||||
|
``` bash
|
||||||
|
sudo nicctl show qos
|
||||||
|
```
|
||||||
|
|
||||||
|
Expected QoS output:
|
||||||
|
|
||||||
|
``` bash
|
||||||
|
NIC : 42424650-4c32-3531-3230-303443000000 (0000:f6:00.0)
|
||||||
|
|
||||||
|
Port : 04908130-a7a0-4242-4242-000011010000
|
||||||
|
|
||||||
|
Classification type : DSCP
|
||||||
|
|
||||||
|
DSCP-to-priority :
|
||||||
|
DSCP bitmap : 0xffffbffffeffffff ==> priority : 0
|
||||||
|
DSCP bitmap : 0x0000000001000000 ==> priority : 3
|
||||||
|
DSCP bitmap : 0x0000400000000000 ==> priority : 6
|
||||||
|
DSCP : 0-23, 25-45, 47-63 ==> priority : 0
|
||||||
|
DSCP : 24 ==> priority : 3
|
||||||
|
DSCP : 46 ==> priority : 6
|
||||||
|
```
|
||||||
|
|
||||||
|
* Verify DCQCN and scheduling:
|
||||||
|
|
||||||
|
``` bash
|
||||||
|
sudo nicctl show dcqcn
|
||||||
|
```
|
||||||
|
|
||||||
|
Expected DCQCN and scheduling output:
|
||||||
|
|
||||||
|
``` bash
|
||||||
|
NIC : 42424650-4c32-3531-3230-303443000000 (0000:f6:00.0)
|
||||||
|
------------------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
Lif id : 43000070-0100-0000-4242-04908130a7a0
|
||||||
|
ROCE device : ionic_7
|
||||||
|
DCQCN profile id : 1
|
||||||
|
Status : Enabled
|
||||||
|
Rate increase in AI phase : 160
|
||||||
|
Rate increase byte count : 431068
|
||||||
|
Alpha update G value : 512
|
||||||
|
Alpha update interval : 1
|
||||||
|
Rate increase in HAI phase : 300
|
||||||
|
Initial alpha value : 64
|
||||||
|
Rate reduce monitor period : 1
|
||||||
|
Rate increase threshold : 1
|
||||||
|
Rate increase interval : 1
|
||||||
|
Token bucket size : 800000
|
||||||
|
DSCP value used for CNP : 46
|
||||||
|
|
||||||
|
|
||||||
|
PFC :
|
||||||
|
PFC priority bitmap : 0x8
|
||||||
|
PFC no-drop priorities : 3
|
||||||
|
|
||||||
|
Scheduling :
|
||||||
|
--------------------------------------------
|
||||||
|
Priority Scheduling Bandwidth Rate-limit
|
||||||
|
Type (in %age) (in Gbps)
|
||||||
|
--------------------------------------------
|
||||||
|
0 DWRR 1 N/A
|
||||||
|
3 DWRR 99 N/A
|
||||||
|
6 strict N/A 10
|
||||||
|
```
|
||||||
|
|
||||||
|
### Configure your network file system (NFS)
|
||||||
|
|
||||||
|
Setting up a shared NFS volume facilitates centralized storage for models,
|
||||||
|
recipes, and logs across the cluster. Use the following commands to install the
|
||||||
|
necessary client tools and mount the remote directory.
|
||||||
|
|
||||||
|
```{important}
|
||||||
|
Replace `nfs_server_ip:/shared/folder` and `/mount/point` with your specific
|
||||||
|
server details and desired local mount path.
|
||||||
|
```
|
||||||
|
|
||||||
|
``` bash
|
||||||
|
sudo apt update && sudo apt install -y nfs-common
|
||||||
|
sudo mkdir -p /mount/point
|
||||||
|
sudo mount -t nfs nfs_server_ip:/shared/folder /mount/point
|
||||||
|
echo "nfs_server_ip:/shared/folder /mount/point nfs _netdev,nofail,x-systemd.automount,x-systemd.idle-timeout=600,vers=4.2 0 0" | sudo tee -a /etc/fstab
|
||||||
|
```
|
||||||
|
|
||||||
|
## Software installation
|
||||||
|
|
||||||
|
Next, install the core compute stack required to operate the AMD Instinct GPUs.
|
||||||
|
The following steps guide you through deploying the ROCm software stack and the
|
||||||
|
necessary kernel-mode drivers to enable hardware acceleration and optimize the
|
||||||
|
environment for distributed inference workloads.
|
||||||
|
|
||||||
|
### Install ROCm
|
||||||
|
|
||||||
|
Use the following commands to quickly install ROCm 7.1.1 on Ubuntu 22.04:
|
||||||
|
|
||||||
|
``` bash
|
||||||
|
wget https://repo.radeon.com/amdgpu-install/7.1.1/ubuntu/jammy/amdgpu-install_7.1.1.70101-1_all.deb
|
||||||
|
sudo apt install ./amdgpu-install_7.1.1.70101-1_all.deb
|
||||||
|
sudo apt update
|
||||||
|
sudo apt install python3-setuptools python3-wheel
|
||||||
|
sudo usermod -a -G render,video $LOGNAME # Add the current user to the render and video groups
|
||||||
|
sudo apt install rocm
|
||||||
|
```
|
||||||
|
|
||||||
|
For detailed installation instructions, refer to the [ROCm 7.1.1
|
||||||
|
documentation](https://rocm.docs.amd.com/projects/install-on-linux/en/docs-7.1.1/install/quick-start.html#rocm-installation).
|
||||||
|
|
||||||
|
### Install AMD GPU Driver (amdgpu)
|
||||||
|
|
||||||
|
Use the following commands to quickly install the AMD GPU Driver (ROCm 7.1.1)
|
||||||
|
on Ubuntu 22.04:
|
||||||
|
|
||||||
|
``` bash
|
||||||
|
wget https://repo.radeon.com/amdgpu-install/7.1.1/ubuntu/jammy/amdgpu-install_7.1.1.70101-1_all.deb
|
||||||
|
sudo apt install ./amdgpu-install_7.1.1.70101-1_all.deb
|
||||||
|
sudo apt update
|
||||||
|
sudo apt install "linux-headers-$(uname -r)" "linux-modules-extra-$(uname -r)"
|
||||||
|
sudo apt install amdgpu-dkms
|
||||||
|
```
|
||||||
|
|
||||||
|
For detailed installation instructions, refer to the [ROCm 7.1.1
|
||||||
|
documentation](https://rocm.docs.amd.com/projects/install-on-linux/en/docs-7.1.1/install/quick-start.html#amdgpu-driver-installation).
|
||||||
|
|
||||||
|
## Network verification and testing
|
||||||
|
|
||||||
|
Before deploying the inference engine, validate the health and performance of
|
||||||
|
the cluster interconnects.
|
||||||
|
|
||||||
|
### Verify network connectivity
|
||||||
|
|
||||||
|
Verify that all network interfaces are reachable across the cluster nodes.
|
||||||
|
Assuming `eth0` is the management interface, and `benic1p1` through `benic8p1` are the
|
||||||
|
dedicated RoCE backend interfaces, use the following loop to test reachability
|
||||||
|
to a remote node (for instance, a target node with host IP suffix `.38`).
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Test connectivity for RoCE subnets 192.168.x.38 (node B) through 192.168.x.37 (node A)
|
||||||
|
for i in {1..8}; do ping -c 1 192.168.${i}.38; done
|
||||||
|
```
|
||||||
|
|
||||||
|
### Validate your RDMA setup
|
||||||
|
|
||||||
|
Confirm that all eight RDMA network interfaces are in the `UP` state and
|
||||||
|
correctly configured with the required MTU and GID settings.
|
||||||
|
|
||||||
|
#### Verify link status MTU, NIC temperature, and NIC speed
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo nicctl show port
|
||||||
|
```
|
||||||
|
|
||||||
|
The output should look something like this:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
-------------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
NIC : 42424650-4c32-3531-3530-314343000000 (0000:f6:00.0)
|
||||||
|
|
||||||
|
Port : 04908132-5d88-4242-4242-000011010000 (eth1/1)
|
||||||
|
Spec:
|
||||||
|
Ifindex : 0x11010000
|
||||||
|
Type : ETH
|
||||||
|
speed : 400G
|
||||||
|
Admin state : UP
|
||||||
|
FEC type : RS
|
||||||
|
Pause type : PFC
|
||||||
|
Number of lanes : 4
|
||||||
|
MTU : 9216
|
||||||
|
TX pause : enabled
|
||||||
|
RX pause : enabled
|
||||||
|
Auto negotiation : enabled
|
||||||
|
Status:
|
||||||
|
Physical port : 1
|
||||||
|
Operational status : UP
|
||||||
|
Link FSM state : UP
|
||||||
|
FEC type : RS
|
||||||
|
Cable type : Fiber
|
||||||
|
Number of lanes : 4
|
||||||
|
speed : 400G
|
||||||
|
Auto negotiation : disabled
|
||||||
|
MAC ID : 0
|
||||||
|
MAC channel : 0
|
||||||
|
MAC address : 04:90:81:32:5d:88
|
||||||
|
Transceiver type : QSFP_CMIS
|
||||||
|
Transceiver state : SPROM-READ
|
||||||
|
Transceiver PID : QSFP-400G-DR4
|
||||||
|
Transceiver temperature (in C) : 45
|
||||||
|
Transceiver warning temperature (in C) : 75
|
||||||
|
Transceiver alarm temperature (in C) : 80
|
||||||
|
-------------------------------------------------------------------------------------
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Verify GID
|
||||||
|
|
||||||
|
Ensure each device has a valid GID mapped to its assigned IP address.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ibv_devinfo -v | grep GID
|
||||||
|
```
|
||||||
|
|
||||||
|
The output should look something like this:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
GID[ 0]: fe80::690:81ff:fe30:a7a0, RoCE v2
|
||||||
|
GID[ 1]: ::ffff:192.168.7.36, RoCE v2
|
||||||
|
```
|
||||||
|
|
||||||
|
### Run RDMA bandwidth benchmarks
|
||||||
|
|
||||||
|
Verify the inter-node RDMA performance to ensure the network fabric can
|
||||||
|
saturate the link bandwidth.
|
||||||
|
|
||||||
|
#### Install RDMA Performance Tools
|
||||||
|
|
||||||
|
To get started, build the ROCm-optimized `rdma-perftest` test suite from
|
||||||
|
source:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo apt install -y libibumad-dev libpci-dev libibverbs-dev librdmacm-dev ibverbs-utils libtool
|
||||||
|
git clone https://github.com/ROCm/rdma-perftest
|
||||||
|
cd rdma-perftest/
|
||||||
|
./autogen.sh
|
||||||
|
./configure --enable-rocm --with-rocm=/opt/rocm
|
||||||
|
make -j$(nproc)
|
||||||
|
sudo make install
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Run a bandwidth test (GPU memory)
|
||||||
|
|
||||||
|
Perform a bandwidth test using ROCm GPU memory between two nodes. One acts as
|
||||||
|
a server and the other acts as a client. Replace `<SERVER_IP>` with the
|
||||||
|
appropriate IP.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# On Server Node
|
||||||
|
./ib_write_bw --use_rocm=0 -d mlx5_0 --report_gbits -a
|
||||||
|
|
||||||
|
# On Client Node
|
||||||
|
./ib_write_bw --use_rocm=0 -d mlx5_0 --report_gbits -a <SERVER_IP>
|
||||||
|
```
|
||||||
|
|
||||||
|
## SGLang serving and MoRI unit tests
|
||||||
|
|
||||||
|
### Install Docker Engine
|
||||||
|
|
||||||
|
Install the Docker engine to manage the containerized vLLM and MoRI serving
|
||||||
|
environments.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo apt update && sudo apt install -y docker.io
|
||||||
|
sudo usermod -aG docker "$USER"
|
||||||
|
```
|
||||||
|
|
||||||
|
### Launch the serving container
|
||||||
|
|
||||||
|
Deploy the SGLang MoRI serving container on each node.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
CONTAINER_NAME=sglang_mori
|
||||||
|
IMAGE_NAME=rocm/sgl-dev:sglang-0.5.6.post1-rocm700-mi35x-mori-0113
|
||||||
|
|
||||||
|
docker run -it \
|
||||||
|
--rm \
|
||||||
|
--device /dev/dri --device /dev/kfd --device=/dev/infiniBand \
|
||||||
|
--network host --ipc host \
|
||||||
|
--group-add video \
|
||||||
|
--cap-add SYS_PTRACE \
|
||||||
|
--security-opt seccomp=unconfined \
|
||||||
|
--privileged \
|
||||||
|
--shm-size 128G \
|
||||||
|
--name ${CONTAINER_NAME} \
|
||||||
|
${IMAGE_NAME} /bin/bash
|
||||||
|
```
|
||||||
|
|
||||||
|
### Run MoRI inter-node unit tests
|
||||||
|
|
||||||
|
Before starting the vLLM service, run the MoRI unit test to verify that the
|
||||||
|
inter-node communication backend is correctly configured.
|
||||||
|
|
||||||
|
MoRI unit test uses 2 nodes as a minimal validation before running the full
|
||||||
|
1P2D (3 nodes) benchmark.
|
||||||
|
|
||||||
|
The key configuration variables are:
|
||||||
|
|
||||||
|
* `GLOO_SOCKET_IFNAME`: The network interface used for backend initialization such as `eth2`.
|
||||||
|
* `<MASTER_IP>`: The IP address of the primary node's backend interface.
|
||||||
|
|
||||||
|
```{note}
|
||||||
|
You can find reference performance data in the [ROCm/MoRI
|
||||||
|
repository](https://github.com/ROCm/mori?tab=readme-ov-file#mori-ep).
|
||||||
|
```
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Set up environment inside the container
|
||||||
|
export PYTHONPATH=/app/mori:$PYTHONPATH
|
||||||
|
export GLOO_SOCKET_IFNAME=<BACKEND_INTERFACE>
|
||||||
|
|
||||||
|
# Node 0 (Primary)
|
||||||
|
torchrun --nnodes=2 --node_rank=0 --nproc_per_node=1 \
|
||||||
|
--master_addr="<MASTER_IP>" --master_port=1234 \
|
||||||
|
examples/ops/dispatch_combine/test_dispatch_combine_internode.py \
|
||||||
|
--cmd bench --kernel-type v1
|
||||||
|
|
||||||
|
# Node 1 (Secondary)
|
||||||
|
torchrun --nnodes=2 --node_rank=1 --nproc_per_node=1 \
|
||||||
|
--master_addr="<MASTER_IP>" --master_port=1234 \
|
||||||
|
examples/ops/dispatch_combine/test_dispatch_combine_internode.py \
|
||||||
|
--cmd bench --kernel-type v1
|
||||||
|
```
|
||||||
|
|
||||||
|
## End-to-end 1P2D performance testing
|
||||||
|
|
||||||
|
This section guides you through running distributed inference benchmarks using
|
||||||
|
the SGLang disagg recipe. For detailed implementation details, refer to the
|
||||||
|
[SGLang Disaggregation
|
||||||
|
Recipe](https://github.com/billishyahao/sglang_disagg/blob/9n_cluster/README.md).
|
||||||
|
|
||||||
|
### Download the model and setup your run environment
|
||||||
|
|
||||||
|
This performance test supports the following models:
|
||||||
|
|
||||||
|
* [DeepSeek-V3](https://huggingface.co/deepseek-ai/DeepSeek-V3)
|
||||||
|
* [DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1)
|
||||||
|
* [DeepSeek-R1-0528](https://huggingface.co/deepseek-ai/DeepSeek-R1-0528)
|
||||||
|
|
||||||
|
To set up your environment and download the models using the Hugging Face CLI,
|
||||||
|
use the following commands. Modify the `huggingface-cli download` command
|
||||||
|
to download the desired model.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Set up a virtual environment and install the Hugging Face CLI
|
||||||
|
sudo apt update && sudo apt install -y python3-venv
|
||||||
|
python3 -m venv ~/venvs/hf
|
||||||
|
source ~/venvs/hf/bin/activate
|
||||||
|
pip install huggingface_hub
|
||||||
|
|
||||||
|
# Download the model to the shared NFS mount point
|
||||||
|
# Replace 'deepseek-ai/DeepSeek-R1-0528' with your desired model
|
||||||
|
huggingface-cli download --token <your_hf_token> \
|
||||||
|
deepseek-ai/DeepSeek-R1-0528 \
|
||||||
|
--local-dir /mount/point/models/DeepSeek-R1
|
||||||
|
```
|
||||||
|
|
||||||
|
### Clone the SGLang disaggregation recipe
|
||||||
|
|
||||||
|
Clone the SGLang disaggregation repository to the shared file system and switch
|
||||||
|
to the appropriate branch:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git clone https://github.com/billishyahao/sglang_disagg.git
|
||||||
|
git checkout 9n_cluster
|
||||||
|
cd sglang_disagg
|
||||||
|
```
|
||||||
|
|
||||||
|
```{note}
|
||||||
|
In the 1P2D configuration, the prefill service and benchmark process run on the
|
||||||
|
same node, while remaining nodes handle decode services.
|
||||||
|
```
|
||||||
|
|
||||||
|
### Configure InfiniBand devices
|
||||||
|
|
||||||
|
Identify and configure the available InfiniBand devices.
|
||||||
|
|
||||||
|
1. List available devices using the following command.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ibv_devinfo -l
|
||||||
|
```
|
||||||
|
|
||||||
|
Example output:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
8 HCAs found:
|
||||||
|
ionic_0
|
||||||
|
ionic_1
|
||||||
|
ionic_2
|
||||||
|
ionic_3
|
||||||
|
ionic_4
|
||||||
|
ionic_5
|
||||||
|
ionic_6
|
||||||
|
ionic_7
|
||||||
|
```
|
||||||
|
|
||||||
|
2. Update environment variables. Edit `set_env_vars.sh` and add the
|
||||||
|
comma-separated list of your system's IB devices. For example:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
export IBDEVICES=ionic_0,ionic_1,ionic_2,ionic_3,ionic_4,ionic_5,ionic_6,ionic_7
|
||||||
|
```
|
||||||
|
|
||||||
|
### Configure the script and submit the job
|
||||||
|
|
||||||
|
1. To set the required configuration parameters, update the following
|
||||||
|
environment variables in `run_submit_disagg.sh` to match your cluster setup:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# SLURM Job Configuration
|
||||||
|
export SLURM_ACCOUNT="amd" # The account name for SLURM job accounting and resource allocation
|
||||||
|
export SLURM_PARTITION="compute" # The specific cluster partition (queue) to submit the job to
|
||||||
|
export TIME_LIMIT="24:00:00" # Maximum wall time for the job (Hours:Minutes:Seconds)
|
||||||
|
|
||||||
|
# Model Configuration
|
||||||
|
export MODEL_PATH="/nfsdata" # Base directory where the model weights are stored
|
||||||
|
export MODEL_NAME="DeepSeek-R1" # Specific model directory name (joined with MODEL_PATH)
|
||||||
|
export CONTAINER_IMAGE="rocm/sgl-dev:sglang-0.5.6.post1-rocm700-mi35x-mori-1224" # Docker image to use for the environment
|
||||||
|
|
||||||
|
# Cluster Topology (Disaggregation Setup)
|
||||||
|
export PREFILL_NODES=1 # Number of prefill nodes
|
||||||
|
export PREFILL_WORKERS=1 # Number of prefill workers
|
||||||
|
export DECODE_NODES=2 # Number of decode nodes
|
||||||
|
export DECODE_WORKERS=2 # Number of decode workers
|
||||||
|
|
||||||
|
# Benchmark/Workload Parameters
|
||||||
|
export ISL=1024 # Input Sequence Length (number of tokens in the prompt)
|
||||||
|
export OSL=1024 # Output Sequence Length (number of tokens to generate)
|
||||||
|
export CONCURRENCIES="2048" # Total number of concurrent requests to simulate in the benchmark. The value can be "32,64,128"
|
||||||
|
export REQUEST_RATE="inf" # Request per second rate. "inf" means send all requests immediately
|
||||||
|
|
||||||
|
# Parallelism Strategies
|
||||||
|
export PREFILL_ENABLE_EP=true # Enable Expert Parallelism (EP) for the prefill phase
|
||||||
|
export PREFILL_ENABLE_DP=true # Enable Data Parallelism (DP) for the prefill phase
|
||||||
|
export DECODE_ENABLE_EP=true # Enable Expert Parallelism (EP) for the decode phase
|
||||||
|
export DECODE_ENABLE_DP=true # Enable Data Parallelism (DP) for the decode phase
|
||||||
|
```
|
||||||
|
|
||||||
|
2. Then submit the batch job into slurm cluster through `bash ./run_submit_disagg.sh`.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
bash ./run_submit_disagg.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
### Log file analysis
|
||||||
|
|
||||||
|
1. After submission, retrieve the SLURM job ID:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
squeue
|
||||||
|
```
|
||||||
|
|
||||||
|
Example output:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
|
||||||
|
123 compute 1p2d alice R 00:10:32 4 node[01-04]
|
||||||
|
```
|
||||||
|
|
||||||
|
2. A directory named `slurm_job-$SLURM_JOB_ID` is created in `/tmp` on each
|
||||||
|
participating node. The directory contains:
|
||||||
|
|
||||||
|
| Log File | Description |
|
||||||
|
| :--------| :-----------|
|
||||||
|
| `pd_sglang_bench_serving.sh_NODE${NODE_RANK}.log` | Main service log per node |
|
||||||
|
| `decode_NODE${NODE_RANK}.log` | SGLang decode service details |
|
||||||
|
| `prefill_NODE${NODE_RANK}.log` | SGLang prefill service details |
|
||||||
|
|
||||||
|
3. The benchmark results will be displayed in
|
||||||
|
`pd_sglang_bench_serving.sh_NODE${NODE_RANK}.log`. Key metrics include:
|
||||||
|
|
||||||
|
```{note}
|
||||||
|
The following benchmark utility output is provided for reference only and
|
||||||
|
should not be used to compare performance. See the
|
||||||
|
[InferenceMAX](https://inferencemax.semianalysis.com/) website for validated
|
||||||
|
performance results.
|
||||||
|
```
|
||||||
|
|
||||||
|
``` bash
|
||||||
|
============ Serving Benchmark Result ============
|
||||||
|
Successful requests: 20480
|
||||||
|
Benchmark duration (s): 1194.25
|
||||||
|
Total input tokens: 20971520
|
||||||
|
Total generated tokens: 20971520
|
||||||
|
Request throughput (req/s): 17.15
|
||||||
|
Output token throughput (tok/s): 17560.38
|
||||||
|
Total Token throughput (tok/s): 35120.76
|
||||||
|
---------------Time to First Token----------------
|
||||||
|
Mean TTFT (ms): 21601.77
|
||||||
|
Median TTFT (ms): 24525.21
|
||||||
|
P99 TTFT (ms): 85417.53
|
||||||
|
-----Time per Output Token (excl. 1st token)------
|
||||||
|
Mean TPOT (ms): 92.41
|
||||||
|
Median TPOT (ms): 85.46
|
||||||
|
P99 TPOT (ms): 138.67
|
||||||
|
---------------Inter-token Latency----------------
|
||||||
|
Mean ITL (ms): 92.41
|
||||||
|
Median ITL (ms): 74.76
|
||||||
|
P99 ITL (ms): 263.07
|
||||||
|
----------------End-to-end Latency----------------
|
||||||
|
Mean E2EL (ms): 116133.48
|
||||||
|
Median E2EL (ms): 110349.39
|
||||||
|
P99 E2EL (ms): 227243.97
|
||||||
|
==================================================
|
||||||
|
```
|
||||||
|
|
||||||
|
## Troubleshooting
|
||||||
|
|
||||||
|
The following section outlines common issues and their solutions.
|
||||||
|
|
||||||
|
### Bandwidth test fails with error
|
||||||
|
|
||||||
|
1. Use ROCm-optimized `rdma-perftest`, not the generic `perftest`
|
||||||
|
|
||||||
|
```bash
|
||||||
|
which ib_write_bw
|
||||||
|
```
|
||||||
|
|
||||||
|
2. Confirm the `SERVER_IP` is accessible
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ping <SERVER_IP>
|
||||||
|
```
|
||||||
|
|
||||||
|
3. Check system logs, use `dmesg` for kernel-level errors
|
||||||
|
|
||||||
|
``` bash
|
||||||
|
sudo dmesg -T | grep -i 'error|warn|fail|exception'
|
||||||
|
```
|
||||||
|
|
||||||
|
### Slurm job fails
|
||||||
|
|
||||||
|
Common causes and solutions for Slurm job submission failures include:
|
||||||
|
|
||||||
|
1. Shared storage access:
|
||||||
|
* Verify that both `sglang_disagg` and model directories are located in a shared NFS mount accessible to all compute nodes.
|
||||||
|
* Ensure proper permissions: `chmod -R 755 /shared/path/sglang_disagg /shared/path/models`
|
||||||
|
|
||||||
|
2. Log analysis:
|
||||||
|
* Examine `pd_sglang_bench_serving.sh_NODE${NODE_RANK}.log` on each participating node for detailed error messages.
|
||||||
|
* Check for common issues like missing dependencies, GPU allocation failures, or network connectivity problems.
|
||||||
|
|
||||||
|
3. Configuration validation:
|
||||||
|
* Verify SLURM parameters in `run_submit_disagg.sh`:
|
||||||
|
* `SLURM_ACCOUNT`: Ensure your account has access to the cluster
|
||||||
|
* `SLURM_PARTITION`: Confirm the partition exists and is accessible
|
||||||
|
* `MODEL_PATH`: Check that the path is correct and accessible from compute nodes
|
||||||
|
* `MODEL_NAME`: Verify the model subdirectory exists within `MODEL_PATH`
|
||||||
|
* Use `sinfo` to check partition and node availability.
|
||||||
@@ -0,0 +1,627 @@
|
|||||||
|
# vLLM distributed inference with MoRI
|
||||||
|
|
||||||
|
This document provides a comprehensive guide for setting up a high-performance
|
||||||
|
vLLM serving environment on an AMD Instinct MI300X or MI325X GPU cluster using
|
||||||
|
the [MoRI (Modular RDMA Interface)](https://github.com/rocm/mori) communication
|
||||||
|
backend. It also includes detailed instructions on how to reproduce the
|
||||||
|
benchmark results published in the AMD ROCm blog [Practical, Fault-Robust
|
||||||
|
Distributed Inference for DeepSeek on AMD
|
||||||
|
MI300X](https://rocm.blogs.amd.com/software-tools-optimization/wide-ep-deepseek/README.html).
|
||||||
|
|
||||||
|
## Prerequisites
|
||||||
|
|
||||||
|
The following hardware configuration is required to implement this setup:
|
||||||
|
|
||||||
|
* **Nodes**: A minimum of two GPU nodes (virtual machines or physical machines)
|
||||||
|
for wide expert parallelism (EP) evaluation.
|
||||||
|
* **GPUs**: 8x AMD Instinct MI300X/MI325X GPU cards per node.
|
||||||
|
* **Networking**: 8x NVIDIA Mellanox ConnectX-7 (CX7) NICs per node, providing
|
||||||
|
a dedicated 1:1 mapping between GPUs and network interfaces for optimal
|
||||||
|
inter-node communication.
|
||||||
|
|
||||||
|
## System configuration
|
||||||
|
|
||||||
|
This section outlines infrastructure steps required to prepare your cluster for
|
||||||
|
high-performance AI workloads. It covers validating your system's software
|
||||||
|
baselines and firmware versions, configuring high-bandwidth backend networking
|
||||||
|
for inter-node communication, and establish shared storage to ensure
|
||||||
|
a synchronized distributed computing environment.
|
||||||
|
|
||||||
|
### Verify baseline software
|
||||||
|
|
||||||
|
This setup has been validated using the **AI/ML Ready Image (ROCm 7-based)** on
|
||||||
|
Digital Ocean AMD GPU Droplets. The following table outlines the software
|
||||||
|
stack versions and appropriate shell commands for verification:
|
||||||
|
|
||||||
|
| Component | Version | Verification command |
|
||||||
|
| :--- | :--- | :--- |
|
||||||
|
| **OS** | Ubuntu 24.04.3 LTS | `cat /etc/os-release` |
|
||||||
|
| **Kernel** | 6.8.0-87-generic | `uname -r` |
|
||||||
|
| **ROCm** | 7.0.2 | `amd-smi version` |
|
||||||
|
| **PLDM bundle (firmware) for MI300X** | 01.25.03.12 | [Verify BKC](#verify-best-known-configuration-bkc) |
|
||||||
|
| **PLDM bundle (firmware) for MI325X** | 01.25.03.03 | [Verify BKC](#verify-best-known-configuration-bkc) |
|
||||||
|
| **CX7 Firmware** | 28.46.3048 | `dkms status` |
|
||||||
|
| **CX7 Driver** | 24.10-3.2.5 | `dkms status` |
|
||||||
|
| **DOCA** | 2.9.3 | `dpkg -l \| grep doca` |
|
||||||
|
|
||||||
|
### Verify best known configuration (BKC)
|
||||||
|
|
||||||
|
The BKC defines a validated configuration of GPU firmware, baseboard firmware,
|
||||||
|
ROCm user space components, the AMD GPU Driver, and virtualization tooling.
|
||||||
|
These components are tested together to attain best performance and compatibility.
|
||||||
|
|
||||||
|
While AMD publishes the AMD GPU driver and ROCm user space components, your
|
||||||
|
server OEM or infrastructure provider distributes the firmware packages. AMD
|
||||||
|
supplies those firmware images (PLDM bundles), which the OEM integrates and
|
||||||
|
distributes.
|
||||||
|
|
||||||
|
To verify the active BKC and IFWI (Integrated Firmware Image) versions via the
|
||||||
|
Redfish API:
|
||||||
|
|
||||||
|
1. Prepare credentials: Identify your BMC IP, username, and password.
|
||||||
|
2. Run Redfish queries: Use the following commands to check the active
|
||||||
|
firmware inventory.
|
||||||
|
|
||||||
|
``` bash
|
||||||
|
# Define BMC connection variables
|
||||||
|
BMC_IP="<BMC_IP>"
|
||||||
|
AUTH="<username>:<password>"
|
||||||
|
|
||||||
|
# Query active BKC bundle version
|
||||||
|
curl -X GET "https://${BMC_IP}/redfish/v1/UpdateService/FirmwareInventory/bundle_active" \
|
||||||
|
-u "${AUTH}" -k | json_pp
|
||||||
|
|
||||||
|
# Query active IFWI (Integrated Firmware Image)
|
||||||
|
curl -X GET "https://${BMC_IP}/redfish/v1/UpdateService/FirmwareInventory/firmware_active" \
|
||||||
|
-u "${AUTH}" -k | json_pp
|
||||||
|
```
|
||||||
|
|
||||||
|
### Run basic system health checks
|
||||||
|
|
||||||
|
Before proceeding with software deployment, verify that all cluster nodes
|
||||||
|
comply with the [MI300X Basic Health
|
||||||
|
Checks](https://instinct.docs.amd.com/projects/system-acceptance/en/latest/gpus/mi300x.html#basic-health-checks)
|
||||||
|
or [MI325X Basic Health
|
||||||
|
Checks](https://instinct.docs.amd.com/projects/system-acceptance/en/latest/gpus/mi325x.html#basic-health-checks).
|
||||||
|
Key requirements include specific kernel boot arguments, minimum system memory
|
||||||
|
thresholds, PCIe Gen5 link stability, and so on.
|
||||||
|
|
||||||
|
### Configure your backend network (netplan)
|
||||||
|
|
||||||
|
Configure the backend NICs for high-bandwidth inter-node communication. Suppose
|
||||||
|
the GPU’s eight network interface controllers (NICs) are eth2 to eth9. Each NIC
|
||||||
|
must have its own subnet that is disjoint from the others. For example, `eth2`
|
||||||
|
could use `192.168.50.0/24`, `eth3` could use `192.168.51.0/24`, and so on.
|
||||||
|
Each node needs a unique IP address on each subnet. You should use the same
|
||||||
|
final octet in each subnet for a given node. For example, one node would have
|
||||||
|
the addresses `192.168.50.2`, `192.168.51.2`, and so on. Another node might
|
||||||
|
have `192.168.50.3`, `192.168.51.3`, and so on. Ensure MTU is set to `4200`.
|
||||||
|
|
||||||
|
```{note}
|
||||||
|
Ensure you identify the correct interface names for your system using ip link
|
||||||
|
before applying this configuration.
|
||||||
|
```
|
||||||
|
|
||||||
|
For example, your `/etc/netplan/50-backend.yaml` might include something like
|
||||||
|
the following:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
eth2:
|
||||||
|
dhcp4: false
|
||||||
|
dhcp6: false
|
||||||
|
link-local: []
|
||||||
|
addresses:
|
||||||
|
- 192.168.50.2/24
|
||||||
|
mtu: 4200
|
||||||
|
eth3:
|
||||||
|
dhcp4: false
|
||||||
|
dhcp6: false
|
||||||
|
link-local: []
|
||||||
|
addresses:
|
||||||
|
- 192.168.51.2/24
|
||||||
|
mtu: 4200
|
||||||
|
eth4:
|
||||||
|
dhcp4: false
|
||||||
|
dhcp6: false
|
||||||
|
link-local: []
|
||||||
|
addresses:
|
||||||
|
- 192.168.52.2/24
|
||||||
|
mtu: 4200
|
||||||
|
eth5:
|
||||||
|
dhcp4: false
|
||||||
|
dhcp6: false
|
||||||
|
link-local: []
|
||||||
|
addresses:
|
||||||
|
- 192.168.53.2/24
|
||||||
|
mtu: 4200
|
||||||
|
eth6:
|
||||||
|
dhcp4: false
|
||||||
|
dhcp6: false
|
||||||
|
link-local: []
|
||||||
|
addresses:
|
||||||
|
- 192.168.54.2/24
|
||||||
|
mtu: 4200
|
||||||
|
eth7:
|
||||||
|
dhcp4: false
|
||||||
|
dhcp6: false
|
||||||
|
link-local: []
|
||||||
|
addresses:
|
||||||
|
- 192.168.55.2/24
|
||||||
|
mtu: 4200
|
||||||
|
eth8:
|
||||||
|
dhcp4: false
|
||||||
|
dhcp6: false
|
||||||
|
link-local: []
|
||||||
|
addresses:
|
||||||
|
- 192.168.56.2/24
|
||||||
|
mtu: 4200
|
||||||
|
eth9:
|
||||||
|
dhcp4: false
|
||||||
|
dhcp6: false
|
||||||
|
link-local: []
|
||||||
|
addresses:
|
||||||
|
- 192.168.57.2/24
|
||||||
|
mtu: 4200
|
||||||
|
```
|
||||||
|
|
||||||
|
To apply the configuration, use the following command.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo netplan apply
|
||||||
|
```
|
||||||
|
|
||||||
|
To verify your configuration, use the following command.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo apt install -y net-tools && ip -br a
|
||||||
|
```
|
||||||
|
|
||||||
|
### Configure your network file system (NFS)
|
||||||
|
|
||||||
|
Setting up a shared NFS volume facilitates centralized storage for models,
|
||||||
|
recipes, and logs across the cluster. Use the following commands to install the
|
||||||
|
necessary client tools and mount the remote directory.
|
||||||
|
|
||||||
|
```{important}
|
||||||
|
Replace `nfs_server_ip:/shared/folder` and `/mount/point` with your specific
|
||||||
|
server details and desired local mount path.
|
||||||
|
```
|
||||||
|
|
||||||
|
``` bash
|
||||||
|
sudo apt update && sudo apt install -y nfs-common
|
||||||
|
sudo mkdir -p /mount/point
|
||||||
|
sudo mount -t nfs nfs_server_ip:/shared/folder /mount/point
|
||||||
|
echo "nfs_server_ip:/shared/folder /mount/point nfs _netdev,nofail,x-systemd.automount,x-systemd.idle-timeout=600,vers=4.2 0 0" | sudo tee -a /etc/fstab
|
||||||
|
```
|
||||||
|
|
||||||
|
### Configure static hostname resolution for backend initialization (optional)
|
||||||
|
|
||||||
|
If the high-speed RDMA/IB interfaces are used for the initial distributed
|
||||||
|
coordination (such as `MASTER_ADDR`), you must configure static hostname
|
||||||
|
resolution. This ensures that cluster host names resolve to the backend network
|
||||||
|
IPs rather than the management or local loopback addresses.
|
||||||
|
|
||||||
|
Follow these steps to configure static hostname resolution:
|
||||||
|
|
||||||
|
1. Edit `/etc/hosts` on all nodes: for example, using `sudo vim /etc/hosts`.
|
||||||
|
2. Add the backend IP and hostname mappings.
|
||||||
|
3. Comment out any default local mappings (such as `127.0.1.1`) for the current
|
||||||
|
hostname to avoid resolution conflicts.
|
||||||
|
|
||||||
|
For example, your `/etc/hosts` entries might look like:
|
||||||
|
|
||||||
|
```text
|
||||||
|
# Map host names to backend network IPs
|
||||||
|
192.168.50.2 mori_test_01
|
||||||
|
192.168.50.3 mori_test_02
|
||||||
|
|
||||||
|
# Comment out the default entry to ensure resolution via the backend IP
|
||||||
|
# 127.0.1.1 mori_test_01 mori_test_01
|
||||||
|
```
|
||||||
|
|
||||||
|
## Software installation
|
||||||
|
|
||||||
|
Next, install the essential software stack required to operate the AMD Instinct
|
||||||
|
GPUs and high-speed networking components. Follow these steps to deploy the
|
||||||
|
NVIDIA DOCA drivers for Mellanox ConnectX-7 NICs, the ROCm software stack, and
|
||||||
|
the necessary kernel modules to enable hardware acceleration.
|
||||||
|
|
||||||
|
### Install CX7 driver and firmware
|
||||||
|
|
||||||
|
1. Download and install the `DOCA 2.9.3` driver following the instructions in
|
||||||
|
[NVIDIA DOCA 2.9.3
|
||||||
|
Downloads](https://developer.nvidia.com/doca-2-9-3-download-archive?deployment_platform=Host-Server&deployment_package=DOCA-Host&target_os=Linux&Architecture=x86_64&Profile=doca-all&Distribution=Ubuntu&version=24.04&installer_type=deb_local).
|
||||||
|
|
||||||
|
2. Download the appropriate firmware for your hardware PSID from the [NVIDIA
|
||||||
|
official website](https://network.nvidia.com/support/firmware/connectx7/)
|
||||||
|
and flash the device.
|
||||||
|
|
||||||
|
3. To verify driver and firmware versions, use the following command. Replace
|
||||||
|
`IB Device` with your specific backend interface.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ethtool -i <IB Device>
|
||||||
|
```
|
||||||
|
|
||||||
|
### Install ROCm
|
||||||
|
|
||||||
|
Use the following commands to quickly install ROCm 7.0.2 on Ubuntu 24.04:
|
||||||
|
|
||||||
|
``` bash
|
||||||
|
wget https://repo.radeon.com/amdgpu-install/7.0.2/ubuntu/noble/amdgpu-install_7.0.2.70002-1_all.deb
|
||||||
|
sudo apt install ./amdgpu-install_7.0.2.70002-1_all.deb
|
||||||
|
sudo apt update
|
||||||
|
sudo apt install python3-setuptools python3-wheel
|
||||||
|
sudo usermod -a -G render,video $LOGNAME # Add the current user to the render and video groups
|
||||||
|
sudo apt install rocm
|
||||||
|
```
|
||||||
|
|
||||||
|
For detailed installation instructions, refer to the [ROCm 7.0.2
|
||||||
|
documentation](https://rocm.docs.amd.com/projects/install-on-linux/en/docs-7.0.2/install/quick-start.html#rocm-installation).
|
||||||
|
|
||||||
|
### Install AMD GPU Driver (amdgpu)
|
||||||
|
|
||||||
|
Use the following commands to quickly install the AMD GPU Driver (ROCm 7.0.2) on Ubuntu 24.04:
|
||||||
|
|
||||||
|
``` bash
|
||||||
|
wget https://repo.radeon.com/amdgpu-install/7.0.2/ubuntu/noble/amdgpu-install_7.0.2.70002-1_all.deb
|
||||||
|
sudo apt install ./amdgpu-install_7.0.2.70002-1_all.deb
|
||||||
|
sudo apt update
|
||||||
|
sudo apt install "linux-headers-$(uname -r)" "linux-modules-extra-$(uname -r)"
|
||||||
|
sudo apt install amdgpu-dkms
|
||||||
|
```
|
||||||
|
|
||||||
|
For detailed installation instructions, refer to the [ROCm 7.0.2
|
||||||
|
documentation](https://rocm.docs.amd.com/projects/install-on-linux/en/docs-7.0.2/install/quick-start.html#amdgpu-driver-installation).
|
||||||
|
|
||||||
|
## Network verification and testing
|
||||||
|
|
||||||
|
Before deploying the inference engine, validate the health and performance of
|
||||||
|
the cluster interconnects.
|
||||||
|
|
||||||
|
### Verify network connectivity
|
||||||
|
|
||||||
|
Verify that all network interfaces are reachable across the cluster nodes.
|
||||||
|
Assuming `eth0` is the management interface, `eth1` is for the VPC, and `eth2`
|
||||||
|
through `eth9` are the dedicated RoCE backend interfaces, use the following
|
||||||
|
loop to test reachability to a remote node (for instance, a target node with
|
||||||
|
host IP suffix `.3`).
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Test connectivity for RoCE subnets 192.168.50.x through 192.168.57.x
|
||||||
|
for i in {0..7}; do ping -c 1 192.168.5${i}.3; done
|
||||||
|
```
|
||||||
|
|
||||||
|
### Validate your RDMA setup
|
||||||
|
|
||||||
|
Confirm that all eight RDMA network interfaces are in `UP` state. Verify the MTU
|
||||||
|
setting of `4096` and ensure each device has a valid GID mapped to its assigned
|
||||||
|
IP address.
|
||||||
|
|
||||||
|
``` bash
|
||||||
|
ibv_devinfo -v
|
||||||
|
```
|
||||||
|
|
||||||
|
The output should look something like this:
|
||||||
|
|
||||||
|
``` bash
|
||||||
|
hca_id: mlx5_0
|
||||||
|
transport: InfiniBand (0)
|
||||||
|
fw_ver: 28.46.3048
|
||||||
|
...
|
||||||
|
board_id: MT_0000000838
|
||||||
|
phys_port_cnt: 1
|
||||||
|
port: 1
|
||||||
|
state: PORT_ACTIVE (4)
|
||||||
|
max_mtu: 4096 (5)
|
||||||
|
active_mtu: 4096 (5)
|
||||||
|
sm_lid: 0
|
||||||
|
port_lid: 0
|
||||||
|
port_lmc: 0x00
|
||||||
|
link_layer: Ethernet
|
||||||
|
...
|
||||||
|
GID[ 0]: fe80:0000:0000:0000:d894:24ff:fe4a:96e2, RoCE v1
|
||||||
|
GID[ 1]: fe80::d894:24ff:fe4a:96e2, RoCE v2
|
||||||
|
GID[ 2]: 0000:0000:0000:0000:0000:ffff:c0a8:3903, RoCE v1
|
||||||
|
GID[ 3]: ::ffff:192.168.57.3, RoCE v2
|
||||||
|
```
|
||||||
|
|
||||||
|
### Run RDMA bandwidth benchmarks
|
||||||
|
|
||||||
|
Verify the inter-node RDMA performance to ensure the network fabric can
|
||||||
|
saturate the link bandwidth.
|
||||||
|
|
||||||
|
#### Install RDMA Performance Tools
|
||||||
|
|
||||||
|
To get started, build the ROCm-optimized `rdma-perftest` test suite from
|
||||||
|
source:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo apt install -y libibumad-dev libpci-dev libibverbs-dev librdmacm-dev ibverbs-utils libtool
|
||||||
|
git clone https://github.com/ROCm/rdma-perftest
|
||||||
|
cd rdma-perftest/
|
||||||
|
./autogen.sh
|
||||||
|
./configure --enable-rocm --with-rocm=/opt/rocm
|
||||||
|
make -j$(nproc)
|
||||||
|
sudo make install
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Run a bandwidth test (GPU memory)
|
||||||
|
|
||||||
|
Perform a bandwidth test using ROCm GPU memory between two nodes. One acts
|
||||||
|
as a server and the other acts as a client. For 400G interfaces, the expected
|
||||||
|
peak throughput is approximately 390 Gbps. Replace `<SERVER_IP>` with the
|
||||||
|
appropriate IP.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# On Server Node
|
||||||
|
./ib_write_bw --use_rocm=0 -d mlx5_0 --report_gbits -a
|
||||||
|
|
||||||
|
# On Client Node
|
||||||
|
./ib_write_bw --use_rocm=0 -d mlx5_0 --report_gbits -a <SERVER_IP>
|
||||||
|
```
|
||||||
|
|
||||||
|
## vLLM serving and MoRI unit tests
|
||||||
|
|
||||||
|
### Install Docker Engine
|
||||||
|
|
||||||
|
Install the Docker engine to manage the containerized vLLM and MoRI serving
|
||||||
|
environments.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo apt update && sudo apt install -y docker.io
|
||||||
|
```
|
||||||
|
|
||||||
|
### Download the DeepSeek PTPC model
|
||||||
|
|
||||||
|
This guide uses the
|
||||||
|
[DeepSeek-R1-FP8-Dynamic](https://huggingface.co/EmbeddedLLM/deepseek-r1-FP8-Dynamic)
|
||||||
|
model optimized for PTPC. Use the following commands to install the Hugging
|
||||||
|
Face CLI and download the model to your shared NFS directory:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Set up a virtual environment and install the Hugging Face CLI
|
||||||
|
sudo apt update && sudo apt install -y python3-venv
|
||||||
|
python3 -m venv ~/venvs/hf
|
||||||
|
source ~/venvs/hf/bin/activate
|
||||||
|
pip install huggingface_hub
|
||||||
|
|
||||||
|
# Download the model to the shared NFS mount point
|
||||||
|
huggingface-cli download --token <your_hf_token> \
|
||||||
|
EmbeddedLLM/deepseek-r1-FP8-Dynamic \
|
||||||
|
--local-dir /mount/point/models/EmbeddedLLM/deepseek-r1-FP8-Dynamic
|
||||||
|
```
|
||||||
|
|
||||||
|
### Launch the serving container
|
||||||
|
|
||||||
|
Deploy the vLLM MoRI serving Docker container on each node.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
CONTAINER_NAME=vllm_mori
|
||||||
|
IMAGE_NAME=aigmkt/vllm:mori_rocm6.4.1_20251105
|
||||||
|
|
||||||
|
docker run -it \
|
||||||
|
--rm \
|
||||||
|
--device /dev/dri --device /dev/kfd --device=/dev/infiniBand \
|
||||||
|
--network host --ipc host \
|
||||||
|
--group-add video \
|
||||||
|
--cap-add SYS_PTRACE \
|
||||||
|
--security-opt seccomp=unconfined \
|
||||||
|
--privileged \
|
||||||
|
-v /mount/point/models:/models \
|
||||||
|
--shm-size 128G \
|
||||||
|
--name ${CONTAINER_NAME} \
|
||||||
|
${IMAGE_NAME} /bin/bash
|
||||||
|
```
|
||||||
|
|
||||||
|
### Run MoRI inter-node unit tests
|
||||||
|
|
||||||
|
Before starting the vLLM service, run the MoRI unit test to verify that the
|
||||||
|
inter-node communication backend is correctly configured.
|
||||||
|
|
||||||
|
The key configuration variables are:
|
||||||
|
|
||||||
|
* `GLOO_SOCKET_IFNAME`: The network interface used for backend initialization such as `eth2`.
|
||||||
|
* `<MASTER_IP>`: The IP address of the primary node's backend interface.
|
||||||
|
|
||||||
|
```{note}
|
||||||
|
You can find reference performance data in the [ROCm/MoRI
|
||||||
|
repository](https://github.com/ROCm/mori?tab=readme-ov-file#mori-ep).
|
||||||
|
```
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Set up environment inside the container
|
||||||
|
cd /app/mori
|
||||||
|
export PYTHONPATH=/app/mori:$PYTHONPATH
|
||||||
|
export GLOO_SOCKET_IFNAME=<BACKEND_INTERFACE>
|
||||||
|
|
||||||
|
# Node 0 (Primary)
|
||||||
|
torchrun --nnodes=2 --node_rank=0 --nproc_per_node=1 \
|
||||||
|
--master_addr="<MASTER_IP>" --master_port=1234 \
|
||||||
|
examples/ops/dispatch_combine/test_dispatch_combine_internode.py \
|
||||||
|
--cmd bench --kernel-type v1
|
||||||
|
|
||||||
|
# Node 1 (Secondary)
|
||||||
|
torchrun --nnodes=2 --node_rank=1 --nproc_per_node=1 \
|
||||||
|
--master_addr="<MASTER_IP>" --master_port=1234 \
|
||||||
|
examples/ops/dispatch_combine/test_dispatch_combine_internode.py \
|
||||||
|
--cmd bench --kernel-type v1
|
||||||
|
```
|
||||||
|
|
||||||
|
### Deploy and serve the model
|
||||||
|
|
||||||
|
To deploy DeepSeek-R1 (PTPC) with Expert Parallelism 16 (EP16) across two
|
||||||
|
nodes, use the following serving scripts.
|
||||||
|
|
||||||
|
#### Create serving scripts
|
||||||
|
|
||||||
|
Create the following scripts inside the container on each node.
|
||||||
|
|
||||||
|
* Node 0 (master node): `ep16_node0.sh`
|
||||||
|
|
||||||
|
```bash
|
||||||
|
#!/bin/bash
|
||||||
|
|
||||||
|
# Add VLLM_ENFORCE_EPLB=1 to enforce EP balance
|
||||||
|
export VLLM_ROCM_USE_AITER=1
|
||||||
|
export VLLM_ROCM_USE_AITER_MOE=1
|
||||||
|
export VLLM_LOGGING_LEVEL=INFO
|
||||||
|
export VLLM_USE_V1=1
|
||||||
|
export VLLM_ROCM_USE_AITER_MLA=1
|
||||||
|
export VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=0
|
||||||
|
export VLLM_ALL2ALL_BACKEND=mori
|
||||||
|
|
||||||
|
vllm serve /models/EmbeddedLLM/deepseek-r1-FP8-Dynamic/ \
|
||||||
|
-dp 16 \
|
||||||
|
--enable-expert-parallel \
|
||||||
|
--data-parallel-size-local 8 \
|
||||||
|
--data-parallel-address ${IP} \
|
||||||
|
--data-parallel-rpc-port 1212 \
|
||||||
|
--served-model-name deepseek \
|
||||||
|
--port 8777 \
|
||||||
|
--block-size 1 \
|
||||||
|
--distributed-executor-backend mp \
|
||||||
|
--gpu-memory-utilization 0.8 \
|
||||||
|
--max-model-len 8192 \
|
||||||
|
--max-num-batched-tokens 4096 \
|
||||||
|
--max-num-seqs 4096 \
|
||||||
|
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "custom_ops": ["+quant_fp8"]}' \
|
||||||
|
--cuda-graph-sizes 1 2 4 8 16 32 64 128 256 \
|
||||||
|
--kv-cache-dtype fp8 \
|
||||||
|
--no-enable-prefix-caching \
|
||||||
|
--trust-remote-code 2>&1 | tee serving_node0_ep16.log
|
||||||
|
```
|
||||||
|
|
||||||
|
* Node 1: `ep16_node1.sh`
|
||||||
|
|
||||||
|
```bash
|
||||||
|
#!/bin/bash
|
||||||
|
|
||||||
|
# Add VLLM_ENFORCE_EPLB=1 to enforce EP balance
|
||||||
|
export VLLM_ROCM_USE_AITER=1
|
||||||
|
export VLLM_ROCM_USE_AITER_MOE=1
|
||||||
|
export VLLM_LOGGING_LEVEL=INFO
|
||||||
|
export VLLM_USE_V1=1
|
||||||
|
export VLLM_ROCM_USE_AITER_MLA=1
|
||||||
|
export VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=0
|
||||||
|
export VLLM_ALL2ALL_BACKEND=mori
|
||||||
|
|
||||||
|
vllm serve /models/EmbeddedLLM/deepseek-r1-FP8-Dynamic/ \
|
||||||
|
-dp 16 \
|
||||||
|
--enable-expert-parallel \
|
||||||
|
--headless \
|
||||||
|
--data-parallel-size-local 8 \
|
||||||
|
--data-parallel-start-rank 8 \
|
||||||
|
--data-parallel-address ${IP} \
|
||||||
|
--data-parallel-rpc-port 1212 \
|
||||||
|
--served-model-name deepseek \
|
||||||
|
--port 8777 \
|
||||||
|
--block-size 1 \
|
||||||
|
--distributed-executor-backend mp \
|
||||||
|
--gpu_memory_utilization 0.8 \
|
||||||
|
--max-model-len 8192 \
|
||||||
|
--max_num_batched_token 4096 \
|
||||||
|
--max-num-seqs 4096 \
|
||||||
|
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "custom_ops": ["+quant_fp8"]}' \
|
||||||
|
--cuda-graph-sizes 1 2 4 8 16 32 64 128 256 \
|
||||||
|
--kv-cache-dtype fp8 \
|
||||||
|
--no-enable-prefix-caching \
|
||||||
|
--trust-remote-code 2>&1 | tee serving_node1_ep16.log
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Run the serving scripts
|
||||||
|
|
||||||
|
Run the scripts on each node to launch the distributed serving instance.
|
||||||
|
Replace `<MASTER_IP>` with the backend network IP of Node 0.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# On Node 0 (Primary)
|
||||||
|
export NCCL_SOCKET_IFNAME=<BACKEND_INTERFACE>
|
||||||
|
export GLOO_SOCKET_IFNAME=<BACKEND_INTERFACE>
|
||||||
|
IP=<MASTER_IP> bash ep16_node0.sh
|
||||||
|
|
||||||
|
# On Node 1 (Secondary)
|
||||||
|
export NCCL_SOCKET_IFNAME=<BACKEND_INTERFACE>
|
||||||
|
export GLOO_SOCKET_IFNAME=<BACKEND_INTERFACE>
|
||||||
|
IP=<MASTER_IP> bash ep16_node1.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
## Reproducing performance
|
||||||
|
|
||||||
|
This section details how to reproduce the performance metrics published in the
|
||||||
|
AMD ROCm Blog: [Practical, Fault-Robust Distributed Inference for DeepSeek on
|
||||||
|
AMD
|
||||||
|
MI300X](https://rocm.blogs.amd.com/software-tools-optimization/wide-ep-deepseek/README.html).
|
||||||
|
|
||||||
|
### Configuration for EP16 (16 GPUs)
|
||||||
|
|
||||||
|
To achieve the reported throughput, expert parallelism 16 (EP16) is used across
|
||||||
|
the decode nodes.
|
||||||
|
|
||||||
|
#### Benchmark target
|
||||||
|
|
||||||
|
* Decode throughput: ~12.4k output tokens/s per node.
|
||||||
|
|
||||||
|
### Performance reproduction commands
|
||||||
|
|
||||||
|
Use the following configurations to reproduce published performance metrics.
|
||||||
|
|
||||||
|
#### Decode benchmark
|
||||||
|
|
||||||
|
To reproduce the 12.4k output tokens/s, use the following configuration:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
#!/bin/bash
|
||||||
|
|
||||||
|
MAX_CONCURRENCY=${1:-3072}
|
||||||
|
TIMES=2
|
||||||
|
NUM_PROMPTS=$((MAX_CONCURRENCY*TIMES))
|
||||||
|
vllm bench serve \
|
||||||
|
--max-concurrency $MAX_CONCURRENCY \
|
||||||
|
--num-prompts $NUM_PROMPTS \
|
||||||
|
--model /models/EmbeddedLLM/deepseek-r1-FP8-Dynamic/ \
|
||||||
|
--served-model-name deepseek \
|
||||||
|
--port 8777 \
|
||||||
|
--ignore-eos \
|
||||||
|
--trust-remote-code \
|
||||||
|
--dataset-name random \
|
||||||
|
--seed 2025 \
|
||||||
|
--random-input-len 2048 \
|
||||||
|
--random-output-len 1024 2>&1 | tee bench_decode_${MAX_CONCURRENCY}_isl_2k_osl_1k.log
|
||||||
|
```
|
||||||
|
|
||||||
|
To calculate the per-node throughput for comparison with the blog data, take
|
||||||
|
the reported **Peak output token throughput (tok/s)** from the benchmark
|
||||||
|
results and divide it by the total number of nodes in the cluster.
|
||||||
|
|
||||||
|
## Troubleshooting
|
||||||
|
|
||||||
|
The following section outlines common issues and their solutions.
|
||||||
|
|
||||||
|
### Bandwidth test fails with error
|
||||||
|
|
||||||
|
1. Use ROCm-optimized `rdma-perftest`, not the generic `perftest`.
|
||||||
|
|
||||||
|
``` bash
|
||||||
|
which ib_write_bw
|
||||||
|
```
|
||||||
|
|
||||||
|
2. Confirm the `SERVER_IP` is accessible.
|
||||||
|
|
||||||
|
``` bash
|
||||||
|
ping <SERVER_IP>
|
||||||
|
```
|
||||||
|
|
||||||
|
3. Check system logs, use `dmesg` for kernel-level errors.
|
||||||
|
|
||||||
|
``` bash
|
||||||
|
sudo dmesg -T | grep -i 'error|warn|fail|exception'
|
||||||
|
```
|
||||||
|
|
||||||
|
### vLLM EP 16 with MoRI backend fails to launch
|
||||||
|
|
||||||
|
1. Error: `Waiting for init message from front-end.` Check the connectivity of the `IP`. Disable firewall/selinux or allow traffic for port `1212`.
|
||||||
|
|
||||||
|
2. Verify server name resolution. Ensure server names are correctly mapped in `/etc/hosts`.
|
||||||
|
|
||||||
|
3. Confirm whether environment variable `GLOO_SOCKET_IFNAME` is set before running the vLLM serving script.
|
||||||
@@ -26,6 +26,12 @@ training, fine-tuning, and inference. It leverages popular machine learning fram
|
|||||||
|
|
||||||
- :doc:`SGLang inference performance testing <benchmark-docker/sglang>`
|
- :doc:`SGLang inference performance testing <benchmark-docker/sglang>`
|
||||||
|
|
||||||
|
- :doc:`vLLM distributed inference with MoRI <benchmark-docker/vllm-mori-distributed>`
|
||||||
|
|
||||||
|
- :doc:`SGLang distributed inference with MoRI <benchmark-docker/sglang-mori-distributed>`
|
||||||
|
|
||||||
|
- :doc:`SGLang distributed inference with Mooncake <benchmark-docker/sglang-distributed>`
|
||||||
|
|
||||||
- :doc:`xDiT diffusion inference <xdit-diffusion-inference>`
|
- :doc:`xDiT diffusion inference <xdit-diffusion-inference>`
|
||||||
|
|
||||||
- :doc:`Deploying your model <deploy-your-model>`
|
- :doc:`Deploying your model <deploy-your-model>`
|
||||||
|
|||||||
@@ -52,7 +52,7 @@ accelerate training workloads:
|
|||||||
- {{ component_version }}
|
- {{ component_version }}
|
||||||
{% endfor %}
|
{% endfor %}
|
||||||
|
|
||||||
.. _amd-megatron-lm-model-support-v25.11:
|
.. _amd-megatron-lm-model-support-v26.01:
|
||||||
|
|
||||||
Supported models
|
Supported models
|
||||||
================
|
================
|
||||||
@@ -97,7 +97,7 @@ accelerate training workloads:
|
|||||||
Some models, such as Llama, require an external license agreement through
|
Some models, such as Llama, require an external license agreement through
|
||||||
a third party (for example, Meta).
|
a third party (for example, Meta).
|
||||||
|
|
||||||
.. _amd-megatron-lm-performance-measurements-v25.11:
|
.. _amd-megatron-lm-performance-measurements-v26.01:
|
||||||
|
|
||||||
Performance measurements
|
Performance measurements
|
||||||
========================
|
========================
|
||||||
@@ -129,7 +129,7 @@ To test for optimal performance, consult the recommended :ref:`System health ben
|
|||||||
<rocm-for-ai-system-health-bench>`. This suite of tests will help you verify and fine-tune your
|
<rocm-for-ai-system-health-bench>`. This suite of tests will help you verify and fine-tune your
|
||||||
system's configuration.
|
system's configuration.
|
||||||
|
|
||||||
.. _mi300x-amd-megatron-lm-training-v25.11:
|
.. _mi300x-amd-megatron-lm-training-v26.01:
|
||||||
|
|
||||||
Environment setup
|
Environment setup
|
||||||
=================
|
=================
|
||||||
@@ -138,7 +138,7 @@ Use the following instructions to set up the environment, configure the script t
|
|||||||
reproduce the benchmark results on MI300X Series GPUs with the AMD Megatron-LM Docker
|
reproduce the benchmark results on MI300X Series GPUs with the AMD Megatron-LM Docker
|
||||||
image.
|
image.
|
||||||
|
|
||||||
.. _amd-megatron-lm-requirements-v25.11:
|
.. _amd-megatron-lm-requirements-v26.01:
|
||||||
|
|
||||||
Download the Docker image
|
Download the Docker image
|
||||||
-------------------------
|
-------------------------
|
||||||
@@ -190,7 +190,7 @@ Download the Docker image
|
|||||||
The Docker container hosts a verified commit of
|
The Docker container hosts a verified commit of
|
||||||
`<https://github.com/ROCm/Megatron-LM/tree/rocm_dev>`__.
|
`<https://github.com/ROCm/Megatron-LM/tree/rocm_dev>`__.
|
||||||
|
|
||||||
.. _amd-megatron-lm-environment-setup-v25.11:
|
.. _amd-megatron-lm-environment-setup-v26.01:
|
||||||
|
|
||||||
Configuration
|
Configuration
|
||||||
=============
|
=============
|
||||||
@@ -200,39 +200,39 @@ Configuration
|
|||||||
Update the ``train_llama3.sh`` configuration script in the ``examples/llama``
|
Update the ``train_llama3.sh`` configuration script in the ``examples/llama``
|
||||||
directory of
|
directory of
|
||||||
`<https://github.com/ROCm/Megatron-LM/tree/rocm_dev/examples/llama>`__ to configure your training run.
|
`<https://github.com/ROCm/Megatron-LM/tree/rocm_dev/examples/llama>`__ to configure your training run.
|
||||||
Options can also be passed as command line arguments as described in :ref:`Run training <amd-megatron-lm-run-training-v25.11>`.
|
Options can also be passed as command line arguments as described in :ref:`Run training <amd-megatron-lm-run-training-v26.01>`.
|
||||||
|
|
||||||
.. container:: model-doc pyt_megatron_lm_train_llama-2-7b pyt_megatron_lm_train_llama-2-70b
|
.. container:: model-doc pyt_megatron_lm_train_llama-2-7b pyt_megatron_lm_train_llama-2-70b
|
||||||
|
|
||||||
Update the ``train_llama2.sh`` configuration script in the ``examples/llama``
|
Update the ``train_llama2.sh`` configuration script in the ``examples/llama``
|
||||||
directory of
|
directory of
|
||||||
`<https://github.com/ROCm/Megatron-LM/tree/rocm_dev/examples/llama>`__ to configure your training run.
|
`<https://github.com/ROCm/Megatron-LM/tree/rocm_dev/examples/llama>`__ to configure your training run.
|
||||||
Options can also be passed as command line arguments as described in :ref:`Run training <amd-megatron-lm-run-training-v25.11>`.
|
Options can also be passed as command line arguments as described in :ref:`Run training <amd-megatron-lm-run-training-v26.01>`.
|
||||||
|
|
||||||
.. container:: model-doc pyt_megatron_lm_train_deepseek-v3-proxy
|
.. container:: model-doc pyt_megatron_lm_train_deepseek-v3-proxy
|
||||||
|
|
||||||
Update the ``train_deepseekv3.sh`` configuration script in the ``examples/deepseek_v3``
|
Update the ``train_deepseekv3.sh`` configuration script in the ``examples/deepseek_v3``
|
||||||
directory of
|
directory of
|
||||||
`<https://github.com/ROCm/Megatron-LM/tree/rocm_dev/examples/deepseek_v3>`__ to configure your training run.
|
`<https://github.com/ROCm/Megatron-LM/tree/rocm_dev/examples/deepseek_v3>`__ to configure your training run.
|
||||||
Options can also be passed as command line arguments as described in :ref:`Run training <amd-megatron-lm-run-training-v25.11>`.
|
Options can also be passed as command line arguments as described in :ref:`Run training <amd-megatron-lm-run-training-v26.01>`.
|
||||||
|
|
||||||
.. container:: model-doc pyt_megatron_lm_train_deepseek-v2-lite-16b
|
.. container:: model-doc pyt_megatron_lm_train_deepseek-v2-lite-16b
|
||||||
|
|
||||||
Update the ``train_deepseekv2.sh`` configuration script in the ``examples/deepseek_v2``
|
Update the ``train_deepseekv2.sh`` configuration script in the ``examples/deepseek_v2``
|
||||||
directory of
|
directory of
|
||||||
`<https://github.com/ROCm/Megatron-LM/tree/rocm_dev/examples/deepseek_v2>`__ to configure your training run.
|
`<https://github.com/ROCm/Megatron-LM/tree/rocm_dev/examples/deepseek_v2>`__ to configure your training run.
|
||||||
Options can also be passed as command line arguments as described in :ref:`Run training <amd-megatron-lm-run-training-v25.11>`.
|
Options can also be passed as command line arguments as described in :ref:`Run training <amd-megatron-lm-run-training-v26.01>`.
|
||||||
|
|
||||||
.. container:: model-doc pyt_megatron_lm_train_mixtral-8x7b pyt_megatron_lm_train_mixtral-8x22b-proxy
|
.. container:: model-doc pyt_megatron_lm_train_mixtral-8x7b pyt_megatron_lm_train_mixtral-8x22b-proxy
|
||||||
|
|
||||||
Update the ``train_mixtral_moe.sh`` configuration script in the ``examples/mixtral``
|
Update the ``train_mixtral_moe.sh`` configuration script in the ``examples/mixtral``
|
||||||
directory of
|
directory of
|
||||||
`<https://github.com/ROCm/Megatron-LM/tree/rocm_dev/examples/mixtral>`__ to configure your training run.
|
`<https://github.com/ROCm/Megatron-LM/tree/rocm_dev/examples/mixtral>`__ to configure your training run.
|
||||||
Options can also be passed as command line arguments as described in :ref:`Run training <amd-megatron-lm-run-training-v25.11>`.
|
Options can also be passed as command line arguments as described in :ref:`Run training <amd-megatron-lm-run-training-v26.01>`.
|
||||||
|
|
||||||
.. note::
|
.. note::
|
||||||
|
|
||||||
See :ref:`Key options <amd-megatron-lm-benchmark-test-vars-v25.11>` for more information on configuration options.
|
See :ref:`Key options <amd-megatron-lm-benchmark-test-vars-v26.01>` for more information on configuration options.
|
||||||
|
|
||||||
Multi-node configuration
|
Multi-node configuration
|
||||||
------------------------
|
------------------------
|
||||||
@@ -240,7 +240,7 @@ Multi-node configuration
|
|||||||
Refer to :doc:`/how-to/rocm-for-ai/system-setup/multi-node-setup` to configure your environment for multi-node
|
Refer to :doc:`/how-to/rocm-for-ai/system-setup/multi-node-setup` to configure your environment for multi-node
|
||||||
training. See :ref:`amd-megatron-lm-multi-node-examples` for example run commands.
|
training. See :ref:`amd-megatron-lm-multi-node-examples` for example run commands.
|
||||||
|
|
||||||
.. _amd-megatron-lm-tokenizer-v25.11:
|
.. _amd-megatron-lm-tokenizer-v26.01:
|
||||||
|
|
||||||
Tokenizer
|
Tokenizer
|
||||||
---------
|
---------
|
||||||
@@ -377,7 +377,7 @@ Download the dataset
|
|||||||
|
|
||||||
``TOKENIZER_MODEL`` can be any accessible Hugging Face tokenizer.
|
``TOKENIZER_MODEL`` can be any accessible Hugging Face tokenizer.
|
||||||
Remember to either pre-download the tokenizer or setup Hugging Face access
|
Remember to either pre-download the tokenizer or setup Hugging Face access
|
||||||
otherwise when needed -- see the :ref:`Tokenizer <amd-megatron-lm-tokenizer-v25.11>` section.
|
otherwise when needed -- see the :ref:`Tokenizer <amd-megatron-lm-tokenizer-v26.01>` section.
|
||||||
|
|
||||||
.. note::
|
.. note::
|
||||||
|
|
||||||
@@ -479,13 +479,13 @@ Download the dataset
|
|||||||
|
|
||||||
Ensure that the files are accessible inside the Docker container.
|
Ensure that the files are accessible inside the Docker container.
|
||||||
|
|
||||||
.. _amd-megatron-lm-run-training-v25.11:
|
.. _amd-megatron-lm-run-training-v26.01:
|
||||||
|
|
||||||
Run training
|
Run training
|
||||||
============
|
============
|
||||||
|
|
||||||
Use the following example commands to set up the environment, configure
|
Use the following example commands to set up the environment, configure
|
||||||
:ref:`key options <amd-megatron-lm-benchmark-test-vars-v25.11>`, and run training on
|
:ref:`key options <amd-megatron-lm-benchmark-test-vars-v26.01>`, and run training on
|
||||||
MI300X Series GPUs with the AMD Megatron-LM environment.
|
MI300X Series GPUs with the AMD Megatron-LM environment.
|
||||||
|
|
||||||
Before starting training, export the following environment variables.
|
Before starting training, export the following environment variables.
|
||||||
@@ -920,7 +920,7 @@ Single node training
|
|||||||
RECOMPUTE_ACTIVATIONS=full \
|
RECOMPUTE_ACTIVATIONS=full \
|
||||||
CKPT_FORMAT=torch_dist
|
CKPT_FORMAT=torch_dist
|
||||||
|
|
||||||
.. _amd-megatron-lm-multi-node-examples-v25.11:
|
.. _amd-megatron-lm-multi-node-examples-v26.01:
|
||||||
|
|
||||||
Multi-node training examples
|
Multi-node training examples
|
||||||
----------------------------
|
----------------------------
|
||||||
@@ -971,7 +971,7 @@ training on 16 nodes, try the following command:
|
|||||||
|
|
||||||
sbatch examples/deepseek_v3/train_deepseek_v3_slurm.sh
|
sbatch examples/deepseek_v3/train_deepseek_v3_slurm.sh
|
||||||
|
|
||||||
.. _amd-megatron-lm-benchmark-test-vars-v25.11:
|
.. _amd-megatron-lm-benchmark-test-vars-v26.01:
|
||||||
|
|
||||||
Key options
|
Key options
|
||||||
-----------
|
-----------
|
||||||
|
|||||||
@@ -16,14 +16,23 @@ previous releases of the ``ROCm/megatron-lm`` Docker image on `Docker Hub <https
|
|||||||
- Components
|
- Components
|
||||||
- Resources
|
- Resources
|
||||||
|
|
||||||
* - v25.11
|
* - v26.1 (latest)
|
||||||
-
|
-
|
||||||
* ROCm 7.1.0
|
* ROCm 7.1.0
|
||||||
* PyTorch 2.10.0.dev20251112+rocm7.1
|
* PyTorch 2.10.0.dev20251112+rocm7.1
|
||||||
-
|
-
|
||||||
* :doc:`Primus Megatron documentation <../primus-megatron>`
|
* :doc:`Primus Megatron documentation <../primus-megatron>`
|
||||||
* :doc:`Megatron-LM (legacy) documentation <../megatron-lm>`
|
* :doc:`Megatron-LM (legacy) documentation <../megatron-lm>`
|
||||||
* `Docker Hub <https://hub.docker.com/layers/rocm/primus/v25.10/images/sha256-140c37cd2eeeb183759b9622543fc03cc210dc97cbfa18eeefdcbda84420c197>`__
|
* `Docker Hub <https://hub.docker.com/layers/rocm/primus/v26.1/images/sha256-4fc8808bdb14117c6af7f38d79c809056e6fdbfd530c1fabbb61d097ddaf820d>`__
|
||||||
|
|
||||||
|
* - v25.11
|
||||||
|
-
|
||||||
|
* ROCm 7.1.0
|
||||||
|
* PyTorch 2.10.0.dev20251112+rocm7.1
|
||||||
|
-
|
||||||
|
* :doc:`Primus Megatron documentation <primus-megatron-v25.11>`
|
||||||
|
* :doc:`Megatron-LM (legacy) documentation <megatron-lm-v25.10>`
|
||||||
|
* `Docker Hub <https://hub.docker.com/layers/rocm/primus/v25.11/images/sha256-71aa65a9bfc8e9dd18bce5b68c81caff864f223e9afa75dc1b719671a1f4a3c3>`__
|
||||||
|
|
||||||
* - v25.10
|
* - v25.10
|
||||||
-
|
-
|
||||||
|
|||||||
@@ -37,7 +37,7 @@ GPUs containing essential components, including PyTorch, ROCm libraries, and
|
|||||||
Megatron-LM utilities. It contains the following software components to
|
Megatron-LM utilities. It contains the following software components to
|
||||||
accelerate training workloads:
|
accelerate training workloads:
|
||||||
|
|
||||||
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/megatron-lm-benchmark-models.yaml
|
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/megatron-lm-v25.10-benchmark-models.yaml
|
||||||
|
|
||||||
.. tab-set::
|
.. tab-set::
|
||||||
|
|
||||||
@@ -146,7 +146,7 @@ image.
|
|||||||
Download the Docker image
|
Download the Docker image
|
||||||
-------------------------
|
-------------------------
|
||||||
|
|
||||||
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/megatron-lm-benchmark-models.yaml
|
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/megatron-lm-v25.10-benchmark-models.yaml
|
||||||
|
|
||||||
{% set docker = data.docker %}
|
{% set docker = data.docker %}
|
||||||
1. Use the following command to pull the Docker image from Docker Hub.
|
1. Use the following command to pull the Docker image from Docker Hub.
|
||||||
@@ -811,7 +811,7 @@ Single node training
|
|||||||
Note that DeepSeek-V2-Lite is experiencing instability due to GPU memory access fault
|
Note that DeepSeek-V2-Lite is experiencing instability due to GPU memory access fault
|
||||||
for large iterations.
|
for large iterations.
|
||||||
For stability, it's recommended to use Primus for this workload.
|
For stability, it's recommended to use Primus for this workload.
|
||||||
See :doc:`primus-megatron`.
|
See :doc:`../primus-megatron`.
|
||||||
|
|
||||||
.. container:: model-doc pyt_megatron_lm_train_mixtral-8x7b
|
.. container:: model-doc pyt_megatron_lm_train_mixtral-8x7b
|
||||||
|
|
||||||
|
|||||||
File diff suppressed because it is too large
Load Diff
@@ -25,10 +25,10 @@ model training. Performance acceleration is powered by `Primus Turbo
|
|||||||
<https://hub.docker.com/r/rocm/megatron-lm/>`__ Docker Hub registry will be
|
<https://hub.docker.com/r/rocm/megatron-lm/>`__ Docker Hub registry will be
|
||||||
deprecated soon in favor of `rocm/primus <https://hub.docker.com/r/rocm/primus>`__.
|
deprecated soon in favor of `rocm/primus <https://hub.docker.com/r/rocm/primus>`__.
|
||||||
The ``rocm/primus`` Docker containers will cover PyTorch training ecosystem frameworks,
|
The ``rocm/primus`` Docker containers will cover PyTorch training ecosystem frameworks,
|
||||||
including Megatron-LM and :doc:`torchtitan <primus-pytorch>`.
|
including Megatron-LM and :doc:`torchtitan <../primus-pytorch>`.
|
||||||
|
|
||||||
Primus with Megatron is designed to replace the :doc:`ROCm Megatron-LM
|
Primus with Megatron is designed to replace the :doc:`ROCm Megatron-LM
|
||||||
training <megatron-lm>` workflow. To learn how to migrate workloads from
|
training <../megatron-lm>` workflow. To learn how to migrate workloads from
|
||||||
Megatron-LM to Primus with Megatron, see
|
Megatron-LM to Primus with Megatron, see
|
||||||
:doc:`megatron-lm-primus-migration-guide`.
|
:doc:`megatron-lm-primus-migration-guide`.
|
||||||
|
|
||||||
@@ -36,7 +36,7 @@ AMD provides a ready-to-use Docker images for MI355X, MI350X,
|
|||||||
MI325X, and MI300X GPUs containing essential components for Primus, ROCm, and
|
MI325X, and MI300X GPUs containing essential components for Primus, ROCm, and
|
||||||
Megatron-LM.
|
Megatron-LM.
|
||||||
|
|
||||||
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/primus-megatron-benchmark-models.yaml
|
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/primus-megatron-v25.10-benchmark-models.yaml
|
||||||
|
|
||||||
.. tab-set::
|
.. tab-set::
|
||||||
|
|
||||||
@@ -63,7 +63,7 @@ The following models are pre-optimized for performance on AMD Instinct GPUs.
|
|||||||
Some instructions, commands, and training examples in this documentation
|
Some instructions, commands, and training examples in this documentation
|
||||||
might vary by model -- select one to get started.
|
might vary by model -- select one to get started.
|
||||||
|
|
||||||
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/primus-megatron-benchmark-models.yaml
|
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/primus-megatron-v25.10-benchmark-models.yaml
|
||||||
|
|
||||||
{% set model_groups = data.model_groups %}
|
{% set model_groups = data.model_groups %}
|
||||||
.. raw:: html
|
.. raw:: html
|
||||||
@@ -120,7 +120,7 @@ system's configuration.
|
|||||||
Environment setup
|
Environment setup
|
||||||
=================
|
=================
|
||||||
|
|
||||||
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/primus-megatron-benchmark-models.yaml
|
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/primus-megatron-v25.10-benchmark-models.yaml
|
||||||
|
|
||||||
Use the following instructions to set up the environment, configure the script to train models, and
|
Use the following instructions to set up the environment, configure the script to train models, and
|
||||||
reproduce the benchmark results on AMD Instinct GPUs.
|
reproduce the benchmark results on AMD Instinct GPUs.
|
||||||
@@ -129,7 +129,7 @@ Environment setup
|
|||||||
|
|
||||||
Pull the Docker image
|
Pull the Docker image
|
||||||
|
|
||||||
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/primus-megatron-benchmark-models.yaml
|
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/primus-megatron-v25.10-benchmark-models.yaml
|
||||||
|
|
||||||
{% set docker = data.docker %}
|
{% set docker = data.docker %}
|
||||||
|
|
||||||
@@ -175,7 +175,7 @@ Configuration
|
|||||||
Primus defines a training configuration in YAML for each model in
|
Primus defines a training configuration in YAML for each model in
|
||||||
`examples/megatron/configs <https://github.com/AMD-AGI/Primus/tree/e16b27bf6c1b2798f38848fc574fee60d9a9b902/examples/megatron/configs>`__.
|
`examples/megatron/configs <https://github.com/AMD-AGI/Primus/tree/e16b27bf6c1b2798f38848fc574fee60d9a9b902/examples/megatron/configs>`__.
|
||||||
|
|
||||||
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/primus-megatron-benchmark-models.yaml
|
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/primus-megatron-v25.10-benchmark-models.yaml
|
||||||
|
|
||||||
{% set model_groups = data.model_groups %}
|
{% set model_groups = data.model_groups %}
|
||||||
{% for model_group in model_groups %}
|
{% for model_group in model_groups %}
|
||||||
@@ -805,7 +805,7 @@ To run training on multiple nodes, you can use the
|
|||||||
`run_slurm_pretrain.sh <https://github.com/AMD-AGI/Primus/blob/main/examples/run_slurm_pretrain.sh>`__
|
`run_slurm_pretrain.sh <https://github.com/AMD-AGI/Primus/blob/main/examples/run_slurm_pretrain.sh>`__
|
||||||
to launch the multi-node workload. Use the following steps to setup your environment:
|
to launch the multi-node workload. Use the following steps to setup your environment:
|
||||||
|
|
||||||
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/primus-megatron-benchmark-models.yaml
|
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/primus-megatron-v25.10-benchmark-models.yaml
|
||||||
|
|
||||||
{% set docker = data.docker %}
|
{% set docker = data.docker %}
|
||||||
.. code-block:: shell
|
.. code-block:: shell
|
||||||
|
|||||||
File diff suppressed because it is too large
Load Diff
@@ -24,17 +24,17 @@ Primus now supports the PyTorch torchtitan backend.
|
|||||||
<https://hub.docker.com/r/rocm/pytorch-training/>`__ Docker Hub registry will be
|
<https://hub.docker.com/r/rocm/pytorch-training/>`__ Docker Hub registry will be
|
||||||
deprecated soon in favor of `rocm/primus <https://hub.docker.com/r/rocm/primus>`__.
|
deprecated soon in favor of `rocm/primus <https://hub.docker.com/r/rocm/primus>`__.
|
||||||
The ``rocm/primus`` Docker containers will cover PyTorch training ecosystem frameworks,
|
The ``rocm/primus`` Docker containers will cover PyTorch training ecosystem frameworks,
|
||||||
including torchtitan and :doc:`Megatron-LM <primus-megatron>`.
|
including torchtitan and :doc:`Megatron-LM <../primus-megatron>`.
|
||||||
|
|
||||||
Primus with the PyTorch torchtitan backend is designed to replace the
|
Primus with the PyTorch torchtitan backend is designed to replace the
|
||||||
:doc:`ROCm PyTorch training <pytorch-training>` workflow. See
|
:doc:`ROCm PyTorch training <../pytorch-training>` workflow. See
|
||||||
:doc:`pytorch-training` to see steps to run workloads without Primus.
|
:doc:`../pytorch-training` to see steps to run workloads without Primus.
|
||||||
|
|
||||||
AMD provides a ready-to-use Docker image for MI355X, MI350X, MI325X, and
|
AMD provides a ready-to-use Docker image for MI355X, MI350X, MI325X, and
|
||||||
MI300X GPUs containing essential components for Primus and PyTorch training
|
MI300X GPUs containing essential components for Primus and PyTorch training
|
||||||
with Primus Turbo optimizations.
|
with Primus Turbo optimizations.
|
||||||
|
|
||||||
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/primus-pytorch-benchmark-models.yaml
|
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/primus-pytorch-v25.10-benchmark-models.yaml
|
||||||
|
|
||||||
.. tab-set::
|
.. tab-set::
|
||||||
|
|
||||||
@@ -61,7 +61,7 @@ The following models are pre-optimized for performance on the AMD Instinct MI325
|
|||||||
Some instructions, commands, and training recommendations in this documentation might
|
Some instructions, commands, and training recommendations in this documentation might
|
||||||
vary by model -- select one to get started.
|
vary by model -- select one to get started.
|
||||||
|
|
||||||
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/primus-pytorch-benchmark-models.yaml
|
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/primus-pytorch-v25.10-benchmark-models.yaml
|
||||||
|
|
||||||
{% set model_groups = data.model_groups %}
|
{% set model_groups = data.model_groups %}
|
||||||
.. raw:: html
|
.. raw:: html
|
||||||
@@ -96,7 +96,7 @@ vary by model -- select one to get started.
|
|||||||
.. seealso::
|
.. seealso::
|
||||||
|
|
||||||
For additional workloads, including Llama 3.3, Llama 3.2, Llama 2, GPT OSS, Qwen, and Flux models,
|
For additional workloads, including Llama 3.3, Llama 3.2, Llama 2, GPT OSS, Qwen, and Flux models,
|
||||||
see the documentation :doc:`pytorch-training` (without Primus)
|
see the documentation :doc:`../pytorch-training` (without Primus)
|
||||||
|
|
||||||
.. _amd-primus-pytorch-performance-measurements-v2510:
|
.. _amd-primus-pytorch-performance-measurements-v2510:
|
||||||
|
|
||||||
@@ -122,7 +122,7 @@ doesn’t test configurations and run conditions outside those described.
|
|||||||
Pull the Docker image
|
Pull the Docker image
|
||||||
=====================
|
=====================
|
||||||
|
|
||||||
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/primus-pytorch-benchmark-models.yaml
|
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/primus-pytorch-v25.10-benchmark-models.yaml
|
||||||
|
|
||||||
Use the following command to pull the Docker image from Docker Hub.
|
Use the following command to pull the Docker image from Docker Hub.
|
||||||
|
|
||||||
@@ -134,11 +134,11 @@ Run training
|
|||||||
============
|
============
|
||||||
|
|
||||||
Once the setup is complete, choose between the following two workflows to start benchmarking training.
|
Once the setup is complete, choose between the following two workflows to start benchmarking training.
|
||||||
For fine-tuning workloads and multi-node training examples, see :doc:`pytorch-training` (without Primus).
|
For fine-tuning workloads and multi-node training examples, see :doc:`../pytorch-training` (without Primus).
|
||||||
For best performance on MI325X, MI350X, and MI355X GPUs, you might need to
|
For best performance on MI325X, MI350X, and MI355X GPUs, you might need to
|
||||||
tweak some configurations (such as batch sizes).
|
tweak some configurations (such as batch sizes).
|
||||||
|
|
||||||
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/primus-pytorch-benchmark-models.yaml
|
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/primus-pytorch-v25.10-benchmark-models.yaml
|
||||||
|
|
||||||
{% set docker = data.docker %}
|
{% set docker = data.docker %}
|
||||||
{% set model_groups = data.model_groups %}
|
{% set model_groups = data.model_groups %}
|
||||||
|
|||||||
@@ -0,0 +1,422 @@
|
|||||||
|
:orphan:
|
||||||
|
|
||||||
|
.. meta::
|
||||||
|
:description: How to train a model using PyTorch for ROCm.
|
||||||
|
:keywords: ROCm, AI, LLM, train, PyTorch, torch, Llama, flux, tutorial, docker
|
||||||
|
|
||||||
|
****************************************
|
||||||
|
Training a model with Primus and PyTorch
|
||||||
|
****************************************
|
||||||
|
|
||||||
|
.. caution::
|
||||||
|
|
||||||
|
This documentation does not reflect the latest version of ROCm Primus PyTorch training
|
||||||
|
performance benchmark documentation. See :doc:`../primus-pytorch` for the latest version.
|
||||||
|
|
||||||
|
`Primus <https://github.com/AMD-AGI/Primus>`__ is a unified and flexible
|
||||||
|
LLM training framework designed to streamline training. It streamlines LLM
|
||||||
|
training on AMD Instinct GPUs using a modular, reproducible configuration paradigm.
|
||||||
|
Primus now supports the PyTorch torchtitan backend.
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
For a unified training solution on AMD GPUs with ROCm, the `rocm/pytorch-training
|
||||||
|
<https://hub.docker.com/r/rocm/pytorch-training/>`__ Docker Hub registry will be
|
||||||
|
deprecated soon in favor of `rocm/primus <https://hub.docker.com/r/rocm/primus>`__.
|
||||||
|
The ``rocm/primus`` Docker containers will cover PyTorch training ecosystem frameworks,
|
||||||
|
including torchtitan and :doc:`Megatron-LM <../primus-megatron>`.
|
||||||
|
|
||||||
|
Primus with the PyTorch torchtitan backend is designed to replace the
|
||||||
|
:doc:`ROCm PyTorch training <../pytorch-training>` workflow. See
|
||||||
|
:doc:`../pytorch-training` to see steps to run workloads without Primus.
|
||||||
|
|
||||||
|
AMD provides a ready-to-use Docker image for MI355X, MI350X, MI325X, and
|
||||||
|
MI300X GPUs containing essential components for Primus and PyTorch training
|
||||||
|
with Primus Turbo optimizations.
|
||||||
|
|
||||||
|
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/primus-pytorch-v25.11-benchmark-models.yaml
|
||||||
|
|
||||||
|
.. tab-set::
|
||||||
|
|
||||||
|
.. tab-item:: {{ data.docker.pull_tag }}
|
||||||
|
:sync: {{ data.docker.pull_tag }}
|
||||||
|
|
||||||
|
.. list-table::
|
||||||
|
:header-rows: 1
|
||||||
|
|
||||||
|
* - Software component
|
||||||
|
- Version
|
||||||
|
|
||||||
|
{% for component_name, component_version in data.docker.components.items() %}
|
||||||
|
* - {{ component_name }}
|
||||||
|
- {{ component_version }}
|
||||||
|
{% endfor %}
|
||||||
|
|
||||||
|
.. _amd-primus-pytorch-model-support-v25.11:
|
||||||
|
|
||||||
|
Supported models
|
||||||
|
================
|
||||||
|
|
||||||
|
The following models are pre-optimized for performance on the AMD Instinct MI325X and MI300X GPUs.
|
||||||
|
Some instructions, commands, and training recommendations in this documentation might
|
||||||
|
vary by model -- select one to get started.
|
||||||
|
|
||||||
|
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/primus-pytorch-v25.11-benchmark-models.yaml
|
||||||
|
|
||||||
|
{% set model_groups = data.model_groups %}
|
||||||
|
.. raw:: html
|
||||||
|
|
||||||
|
<div id="vllm-benchmark-ud-params-picker" class="container-fluid">
|
||||||
|
<div class="row gx-0">
|
||||||
|
<div class="col-2 me-1 px-2 model-param-head">Model</div>
|
||||||
|
<div class="row col-10 pe-0">
|
||||||
|
{% for model_group in model_groups %}
|
||||||
|
<div class="col-6 px-2 model-param" data-param-k="model-group" data-param-v="{{ model_group.tag }}" tabindex="0">{{ model_group.group }}</div>
|
||||||
|
{% endfor %}
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<div class="row gx-0 pt-1">
|
||||||
|
<div class="col-2 me-1 px-2 model-param-head">Variant</div>
|
||||||
|
<div class="row col-10 pe-0">
|
||||||
|
{% for model_group in model_groups %}
|
||||||
|
{% set models = model_group.models %}
|
||||||
|
{% for model in models %}
|
||||||
|
{% if models|length % 3 == 0 %}
|
||||||
|
<div class="col-4 px-2 model-param" data-param-k="model" data-param-v="{{ model.mad_tag }}" data-param-group="{{ model_group.tag }}" tabindex="0">{{ model.model }}</div>
|
||||||
|
{% else %}
|
||||||
|
<div class="col-6 px-2 model-param" data-param-k="model" data-param-v="{{ model.mad_tag }}" data-param-group="{{ model_group.tag }}" tabindex="0">{{ model.model }}</div>
|
||||||
|
{% endif %}
|
||||||
|
{% endfor %}
|
||||||
|
{% endfor %}
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
.. seealso::
|
||||||
|
|
||||||
|
For additional workloads, including Llama 3.3, Llama 3.2, Llama 2, GPT OSS, Qwen, and Flux models,
|
||||||
|
see the documentation :doc:`../pytorch-training` (without Primus)
|
||||||
|
|
||||||
|
.. _amd-primus-pytorch-performance-measurements-v25.11:
|
||||||
|
|
||||||
|
System validation
|
||||||
|
=================
|
||||||
|
|
||||||
|
Before running AI workloads, it's important to validate that your AMD hardware is configured
|
||||||
|
correctly and performing optimally.
|
||||||
|
|
||||||
|
If you have already validated your system settings, including aspects like NUMA auto-balancing, you
|
||||||
|
can skip this step. Otherwise, complete the procedures in the :ref:`System validation and
|
||||||
|
optimization <rocm-for-ai-system-optimization>` guide to properly configure your system settings
|
||||||
|
before starting training.
|
||||||
|
|
||||||
|
To test for optimal performance, consult the recommended :ref:`System health benchmarks
|
||||||
|
<rocm-for-ai-system-health-bench>`. This suite of tests will help you verify and fine-tune your
|
||||||
|
system's configuration.
|
||||||
|
|
||||||
|
This Docker image is optimized for specific model configurations outlined
|
||||||
|
below. Performance can vary for other training workloads, as AMD
|
||||||
|
doesn’t test configurations and run conditions outside those described.
|
||||||
|
|
||||||
|
Pull the Docker image
|
||||||
|
=====================
|
||||||
|
|
||||||
|
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/primus-pytorch-v25.11-benchmark-models.yaml
|
||||||
|
|
||||||
|
Use the following command to pull the Docker image from Docker Hub.
|
||||||
|
|
||||||
|
.. code-block:: shell
|
||||||
|
|
||||||
|
docker pull {{ data.docker.pull_tag }}
|
||||||
|
|
||||||
|
Run training
|
||||||
|
============
|
||||||
|
|
||||||
|
Once the setup is complete, choose between the following two workflows to start benchmarking training.
|
||||||
|
For fine-tuning workloads and multi-node training examples, see :doc:`../pytorch-training` (without Primus).
|
||||||
|
For best performance on MI325X, MI350X, and MI355X GPUs, you might need to
|
||||||
|
tweak some configurations (such as batch sizes).
|
||||||
|
|
||||||
|
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/primus-pytorch-v25.11-benchmark-models.yaml
|
||||||
|
|
||||||
|
{% set docker = data.docker %}
|
||||||
|
{% set model_groups = data.model_groups %}
|
||||||
|
|
||||||
|
.. tab-set::
|
||||||
|
|
||||||
|
.. tab-item:: MAD-integrated benchmarking
|
||||||
|
|
||||||
|
{% for model_group in model_groups %}
|
||||||
|
{% for model in model_group.models %}
|
||||||
|
|
||||||
|
.. container:: model-doc {{ model.mad_tag }}
|
||||||
|
|
||||||
|
The following run command is tailored to {{ model.model }}.
|
||||||
|
See :ref:`amd-primus-pytorch-model-support-v25.11` to switch to another available model.
|
||||||
|
|
||||||
|
1. Clone the ROCm Model Automation and Dashboarding (`<https://github.com/ROCm/MAD>`__) repository to a local
|
||||||
|
directory and install the required packages on the host machine.
|
||||||
|
|
||||||
|
.. code-block:: shell
|
||||||
|
|
||||||
|
git clone https://github.com/ROCm/MAD
|
||||||
|
cd MAD
|
||||||
|
pip install -r requirements.txt
|
||||||
|
|
||||||
|
2. For example, use this command to run the performance benchmark test on the {{ model.model }} model
|
||||||
|
using one node with the {{ model.precision }} data type on the host machine.
|
||||||
|
|
||||||
|
.. code-block:: shell
|
||||||
|
|
||||||
|
export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
|
||||||
|
madengine run \
|
||||||
|
--tags {{ model.mad_tag }} \
|
||||||
|
--keep-model-dir \
|
||||||
|
--live-output \
|
||||||
|
--timeout 28800
|
||||||
|
|
||||||
|
MAD launches a Docker container with the name
|
||||||
|
``container_ci-{{ model.mad_tag }}``. The latency and throughput reports of the
|
||||||
|
model are collected in ``~/MAD/perf.csv``.
|
||||||
|
|
||||||
|
{% endfor %}
|
||||||
|
{% endfor %}
|
||||||
|
|
||||||
|
.. tab-item:: Primus benchmarking
|
||||||
|
|
||||||
|
{% for model_group in model_groups %}
|
||||||
|
{% for model in model_group.models %}
|
||||||
|
|
||||||
|
.. container:: model-doc {{ model.mad_tag }}
|
||||||
|
|
||||||
|
The following run commands are tailored to {{ model.model }}.
|
||||||
|
See :ref:`amd-primus-pytorch-model-support-v25.11` to switch to another available model.
|
||||||
|
|
||||||
|
.. rubric:: Download the Docker image and required packages
|
||||||
|
|
||||||
|
1. Pull the ``{{ docker.pull_tag }}`` Docker image from Docker Hub.
|
||||||
|
|
||||||
|
.. code-block:: shell
|
||||||
|
|
||||||
|
docker pull {{ docker.pull_tag }}
|
||||||
|
|
||||||
|
2. Run the Docker container.
|
||||||
|
|
||||||
|
.. code-block:: shell
|
||||||
|
|
||||||
|
docker run -it \
|
||||||
|
--device /dev/dri \
|
||||||
|
--device /dev/kfd \
|
||||||
|
--network host \
|
||||||
|
--ipc host \
|
||||||
|
--group-add video \
|
||||||
|
--cap-add SYS_PTRACE \
|
||||||
|
--security-opt seccomp=unconfined \
|
||||||
|
--privileged \
|
||||||
|
-v $HOME:$HOME \
|
||||||
|
-v $HOME/.ssh:/root/.ssh \
|
||||||
|
--shm-size 64G \
|
||||||
|
--name training_env \
|
||||||
|
{{ docker.pull_tag }}
|
||||||
|
|
||||||
|
Use these commands if you exit the ``training_env`` container and need to return to it.
|
||||||
|
|
||||||
|
.. code-block:: shell
|
||||||
|
|
||||||
|
docker start training_env
|
||||||
|
docker exec -it training_env bash
|
||||||
|
|
||||||
|
The Docker container hosts verified commit ``c4c083de`` of the `Primus
|
||||||
|
<https://github.com/AMD-AGI/Primus/tree/c4c083de64ba3e8f19ccc9629411267108931f9e/>`__ repository.
|
||||||
|
|
||||||
|
.. rubric:: Prepare training datasets and dependencies
|
||||||
|
|
||||||
|
The following benchmarking examples require downloading models and datasets
|
||||||
|
from Hugging Face. To ensure successful access to gated repos, set your
|
||||||
|
``HF_TOKEN``.
|
||||||
|
|
||||||
|
.. code-block:: shell
|
||||||
|
|
||||||
|
export HF_TOKEN=$your_personal_hugging_face_access_token
|
||||||
|
|
||||||
|
.. rubric:: Pretraining
|
||||||
|
|
||||||
|
To get started, navigate to the ``Primus`` directory in your container.
|
||||||
|
|
||||||
|
.. code-block::
|
||||||
|
|
||||||
|
cd /workspace/Primus
|
||||||
|
|
||||||
|
Now, to start the pretraining benchmark, use the ``run_pretrain.sh`` script
|
||||||
|
included with Primus with the appropriate options.
|
||||||
|
|
||||||
|
.. rubric:: Benchmarking examples
|
||||||
|
|
||||||
|
.. container:: model-doc primus_pyt_train_llama-3.1-8b
|
||||||
|
|
||||||
|
Use the following command to run train Llama 3.1 8B with BF16 precision using Primus torchtitan.
|
||||||
|
|
||||||
|
.. tab-set::
|
||||||
|
|
||||||
|
.. tab-item:: MI355X and MI350X
|
||||||
|
:sync: MI355X
|
||||||
|
|
||||||
|
.. code-block:: shell
|
||||||
|
|
||||||
|
EXP=examples/torchtitan/configs/MI355X/llama3.1_8B-BF16-pretrain.yaml \
|
||||||
|
bash examples/run_pretrain.sh
|
||||||
|
|
||||||
|
.. tab-item:: MI325X
|
||||||
|
:sync: MI325X
|
||||||
|
|
||||||
|
.. code-block:: shell
|
||||||
|
|
||||||
|
EXP=examples/torchtitan/configs/MI300X/llama3.1_8B-BF16-pretrain.yaml \
|
||||||
|
bash examples/run_pretrain.sh --training.local_batch_size 6
|
||||||
|
|
||||||
|
.. tab-item:: MI300X
|
||||||
|
:sync: MI300X
|
||||||
|
|
||||||
|
.. code-block:: shell
|
||||||
|
|
||||||
|
EXP=examples/torchtitan/configs/MI300X/llama3.1_8B-BF16-pretrain.yaml \
|
||||||
|
bash examples/run_pretrain.sh
|
||||||
|
|
||||||
|
To train Llama 3.1 8B with FP8 precision, use the following command.
|
||||||
|
|
||||||
|
.. tab-set::
|
||||||
|
|
||||||
|
.. tab-item:: MI355X and MI350X
|
||||||
|
:sync: MI355X
|
||||||
|
|
||||||
|
.. code-block:: shell
|
||||||
|
|
||||||
|
EXP=examples/torchtitan/configs/MI355X/llama3.1_8B-FP8-pretrain.yaml \
|
||||||
|
bash examples/run_pretrain.sh
|
||||||
|
|
||||||
|
.. tab-item:: MI325X
|
||||||
|
:sync: MI325X
|
||||||
|
|
||||||
|
.. code-block:: shell
|
||||||
|
|
||||||
|
EXP=examples/torchtitan/configs/MI300X/llama3.1_8B-FP8-pretrain.yaml \
|
||||||
|
bash examples/run_pretrain.sh --training.local_batch_size 7
|
||||||
|
|
||||||
|
.. tab-item:: MI300X
|
||||||
|
:sync: MI300X
|
||||||
|
|
||||||
|
.. code-block:: shell
|
||||||
|
|
||||||
|
EXP=examples/torchtitan/configs/MI300X/llama3.1_8B-FP8-pretrain.yaml \
|
||||||
|
bash examples/run_pretrain.sh
|
||||||
|
|
||||||
|
.. container:: model-doc primus_pyt_train_llama-3.1-70b
|
||||||
|
|
||||||
|
Use the following command to run train Llama 3.1 70B with BF16 precision using Primus torchtitan.
|
||||||
|
|
||||||
|
.. tab-set::
|
||||||
|
|
||||||
|
.. tab-item:: MI355X and MI350X
|
||||||
|
:sync: MI355X and MI300X
|
||||||
|
|
||||||
|
.. code-block:: shell
|
||||||
|
|
||||||
|
EXP=examples/torchtitan/configs/MI355X/llama3.1_70B-BF16-pretrain.yaml \
|
||||||
|
bash examples/run_pretrain.sh
|
||||||
|
|
||||||
|
.. tab-item:: MI325X
|
||||||
|
:sync: MI325X
|
||||||
|
|
||||||
|
.. code-block:: shell
|
||||||
|
|
||||||
|
EXP=examples/torchtitan/configs/MI300X/llama3.1_70B-BF16-pretrain.yaml \
|
||||||
|
bash examples/run_pretrain.sh --training.local_batch_size 6
|
||||||
|
|
||||||
|
.. tab-item:: MI300X
|
||||||
|
:sync: MI300X
|
||||||
|
|
||||||
|
.. code-block:: shell
|
||||||
|
|
||||||
|
EXP=examples/torchtitan/configs/MI300X/llama3.1_70B-BF16-pretrain.yaml \
|
||||||
|
bash examples/run_pretrain.sh
|
||||||
|
|
||||||
|
To train Llama 3.1 70B with FP8 precision, use the following command.
|
||||||
|
|
||||||
|
.. tab-set::
|
||||||
|
|
||||||
|
.. tab-item:: MI355X and MI350X
|
||||||
|
:sync: MI355X
|
||||||
|
|
||||||
|
.. code-block:: shell
|
||||||
|
|
||||||
|
EXP=examples/torchtitan/configs/MI355X/llama3.1_70B-FP8-pretrain.yaml \
|
||||||
|
bash examples/run_pretrain.sh
|
||||||
|
|
||||||
|
.. tab-item:: MI325X
|
||||||
|
:sync: MI325X
|
||||||
|
|
||||||
|
.. code-block:: shell
|
||||||
|
|
||||||
|
EXP=examples/torchtitan/configs/MI300X/llama3.1_70B-FP8-pretrain.yaml \
|
||||||
|
bash examples/run_pretrain.sh --training.local_batch_size 5
|
||||||
|
|
||||||
|
.. tab-item:: MI300X
|
||||||
|
:sync: MI300X
|
||||||
|
|
||||||
|
.. code-block:: shell
|
||||||
|
|
||||||
|
EXP=examples/torchtitan/configs/MI300X/llama3.1_70B-FP8-pretrain.yaml \
|
||||||
|
bash examples/run_pretrain.sh
|
||||||
|
|
||||||
|
.. container:: model-doc primus_pyt_train_deepseek-v3-16b
|
||||||
|
|
||||||
|
Use the following command to run train DeepSeek V3 16B with BF16 precision using Primus torchtitan.
|
||||||
|
|
||||||
|
.. tab-set::
|
||||||
|
|
||||||
|
.. tab-item:: MI355X and MI350X
|
||||||
|
:sync: MI355X and MI300X
|
||||||
|
|
||||||
|
.. code-block:: shell
|
||||||
|
|
||||||
|
EXP=examples/torchtitan/configs/MI355X/deepseek_v3_16b-pretrain.yaml \
|
||||||
|
bash examples/run_pretrain.sh
|
||||||
|
|
||||||
|
.. tab-item:: MI325X
|
||||||
|
:sync: MI325X
|
||||||
|
|
||||||
|
.. code-block:: shell
|
||||||
|
|
||||||
|
EXP=examples/torchtitan/configs/MI300X/deepseek_v3_16b-pretrain.yaml \
|
||||||
|
bash examples/run_pretrain.sh --training.local_batch_size 10
|
||||||
|
|
||||||
|
.. tab-item:: MI300X
|
||||||
|
:sync: MI300X
|
||||||
|
|
||||||
|
.. code-block:: shell
|
||||||
|
|
||||||
|
EXP=examples/torchtitan/configs/MI300X/deepseek_v3_16b-pretrain.yaml \
|
||||||
|
bash examples/run_pretrain.sh
|
||||||
|
{% endfor %}
|
||||||
|
{% endfor %}
|
||||||
|
|
||||||
|
Further reading
|
||||||
|
===============
|
||||||
|
|
||||||
|
- For an introduction to Primus, see `Primus: A Lightweight, Unified Training
|
||||||
|
Framework for Large Models on AMD GPUs <https://rocm.blogs.amd.com/software-tools-optimization/primus/README.html>`__.
|
||||||
|
|
||||||
|
- To learn more about MAD and the ``madengine`` CLI, see the `MAD usage guide <https://github.com/ROCm/MAD?tab=readme-ov-file#usage-guide>`__.
|
||||||
|
|
||||||
|
- To learn more about system settings and management practices to configure your system for
|
||||||
|
AMD Instinct MI300X Series GPUs, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_.
|
||||||
|
|
||||||
|
- For a list of other ready-made Docker images for AI with ROCm, see
|
||||||
|
`AMD Infinity Hub <https://www.amd.com/en/developer/resources/infinity-hub.html#f-amd_hub_category=AI%20%26%20ML%20Models>`_.
|
||||||
|
|
||||||
|
Previous versions
|
||||||
|
=================
|
||||||
|
|
||||||
|
See :doc:`pytorch-training-history` to find documentation for previous releases
|
||||||
|
of the ``ROCm/pytorch-training`` Docker image.
|
||||||
@@ -16,21 +16,30 @@ previous releases of the ``ROCm/pytorch-training`` Docker image on `Docker Hub <
|
|||||||
- Components
|
- Components
|
||||||
- Resources
|
- Resources
|
||||||
|
|
||||||
|
* - v26.1 (latest)
|
||||||
|
-
|
||||||
|
* ROCm 7.1.0
|
||||||
|
* PyTorch 2.10.0.dev20251112+rocm7.1
|
||||||
|
-
|
||||||
|
* :doc:`Primus PyTorch training documentation <../primus-megatron>`
|
||||||
|
* :doc:`PyTorch training (legacy) documentation <../megatron-lm>`
|
||||||
|
* `Docker Hub <https://hub.docker.com/layers/rocm/primus/v26.1/images/sha256-4fc8808bdb14117c6af7f38d79c809056e6fdbfd530c1fabbb61d097ddaf820d>`__
|
||||||
|
|
||||||
* - v25.11
|
* - v25.11
|
||||||
-
|
-
|
||||||
* ROCm 7.1.0
|
* ROCm 7.1.0
|
||||||
* PyTorch 2.10.0.dev20251112+rocm7.1
|
* PyTorch 2.10.0.dev20251112+rocm7.1
|
||||||
-
|
-
|
||||||
* :doc:`Primus PyTorch Training documentation <../primus-pytorch>`
|
* :doc:`Primus PyTorch training documentation <primus-pytorch-v25.11>`
|
||||||
* :doc:`PyTorch training (legacy) documentation <../pytorch-training>`
|
* :doc:`PyTorch training (legacy) documentation <pytorch-training-v25.11>`
|
||||||
* `Docker Hub <https://hub.docker.com/layers/rocm/primus/v25.10/images/sha256-140c37cd2eeeb183759b9622543fc03cc210dc97cbfa18eeefdcbda84420c197>`__
|
* `Docker Hub <https://hub.docker.com/layers/rocm/primus/v25.11/images/sha256-71aa65a9bfc8e9dd18bce5b68c81caff864f223e9afa75dc1b719671a1f4a3c3>`__
|
||||||
|
|
||||||
* - v25.10
|
* - v25.10
|
||||||
-
|
-
|
||||||
* ROCm 7.1.0
|
* ROCm 7.1.0
|
||||||
* PyTorch 2.10.0.dev20251112+rocm7.1
|
* PyTorch 2.10.0.dev20251112+rocm7.1
|
||||||
-
|
-
|
||||||
* :doc:`Primus PyTorch Training documentation <primus-pytorch-v25.10>`
|
* :doc:`Primus PyTorch training documentation <primus-pytorch-v25.10>`
|
||||||
* :doc:`PyTorch training (legacy) documentation <pytorch-training-v25.10>`
|
* :doc:`PyTorch training (legacy) documentation <pytorch-training-v25.10>`
|
||||||
* `Docker Hub <https://hub.docker.com/layers/rocm/primus/v25.10/images/sha256-140c37cd2eeeb183759b9622543fc03cc210dc97cbfa18eeefdcbda84420c197>`__
|
* `Docker Hub <https://hub.docker.com/layers/rocm/primus/v25.10/images/sha256-140c37cd2eeeb183759b9622543fc03cc210dc97cbfa18eeefdcbda84420c197>`__
|
||||||
|
|
||||||
@@ -40,7 +49,7 @@ previous releases of the ``ROCm/pytorch-training`` Docker image on `Docker Hub <
|
|||||||
* Primus 0.3.0
|
* Primus 0.3.0
|
||||||
* PyTorch 2.9.0.dev20250821+rocm7.0.0.lw.git125803b7
|
* PyTorch 2.9.0.dev20250821+rocm7.0.0.lw.git125803b7
|
||||||
-
|
-
|
||||||
* :doc:`Primus PyTorch Training documentation <primus-pytorch-v25.9>`
|
* :doc:`Primus PyTorch training documentation <primus-pytorch-v25.9>`
|
||||||
* :doc:`PyTorch training (legacy) documentation <pytorch-training-v25.9>`
|
* :doc:`PyTorch training (legacy) documentation <pytorch-training-v25.9>`
|
||||||
* `Docker Hub (gfx950) <https://hub.docker.com/layers/rocm/primus/v25.9_gfx950/images/sha256-1a198be32f49efd66d0ff82066b44bd99b3e6b04c8e0e9b36b2c481e13bff7b6>`__
|
* `Docker Hub (gfx950) <https://hub.docker.com/layers/rocm/primus/v25.9_gfx950/images/sha256-1a198be32f49efd66d0ff82066b44bd99b3e6b04c8e0e9b36b2c481e13bff7b6>`__
|
||||||
* `Docker Hub (gfx942) <https://hub.docker.com/layers/rocm/primus/v25.9_gfx942/images/sha256-df6ab8f45b4b9ceb100fb24e19b2019a364e351ee3b324dbe54466a1d67f8357>`__
|
* `Docker Hub (gfx942) <https://hub.docker.com/layers/rocm/primus/v25.9_gfx942/images/sha256-df6ab8f45b4b9ceb100fb24e19b2019a364e351ee3b324dbe54466a1d67f8357>`__
|
||||||
@@ -50,7 +59,7 @@ previous releases of the ``ROCm/pytorch-training`` Docker image on `Docker Hub <
|
|||||||
* ROCm 6.4.3
|
* ROCm 6.4.3
|
||||||
* PyTorch 2.8.0a0+gitd06a406
|
* PyTorch 2.8.0a0+gitd06a406
|
||||||
-
|
-
|
||||||
* :doc:`Primus PyTorch Training documentation <primus-pytorch-v25.8>`
|
* :doc:`Primus PyTorch training documentation <primus-pytorch-v25.8>`
|
||||||
* :doc:`PyTorch training (legacy) documentation <pytorch-training-v25.8>`
|
* :doc:`PyTorch training (legacy) documentation <pytorch-training-v25.8>`
|
||||||
* `Docker Hub <https://hub.docker.com/layers/rocm/pytorch-training/v25.8/images/sha256-5082ae01d73fec6972b0d84e5dad78c0926820dcf3c19f301d6c8eb892e573c5>`__
|
* `Docker Hub <https://hub.docker.com/layers/rocm/pytorch-training/v25.8/images/sha256-5082ae01d73fec6972b0d84e5dad78c0926820dcf3c19f301d6c8eb892e573c5>`__
|
||||||
|
|
||||||
|
|||||||
@@ -30,7 +30,7 @@ environment for fine-tuning and pretraining a model on AMD Instinct MI325X
|
|||||||
and MI300X GPUs. It includes the following software components to accelerate
|
and MI300X GPUs. It includes the following software components to accelerate
|
||||||
training workloads:
|
training workloads:
|
||||||
|
|
||||||
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/pytorch-training-benchmark-models.yaml
|
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/pytorch-training-v25.10-benchmark-models.yaml
|
||||||
|
|
||||||
.. tab-set::
|
.. tab-set::
|
||||||
|
|
||||||
@@ -58,7 +58,7 @@ MI355X, MI350X, MI325X, and MI300X GPUs. Some instructions, commands, and
|
|||||||
training recommendations in this documentation might vary by model -- select
|
training recommendations in this documentation might vary by model -- select
|
||||||
one to get started.
|
one to get started.
|
||||||
|
|
||||||
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/pytorch-training-benchmark-models.yaml
|
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/pytorch-training-v25.10-benchmark-models.yaml
|
||||||
|
|
||||||
{% set model_groups = data.model_groups %}
|
{% set model_groups = data.model_groups %}
|
||||||
.. raw:: html
|
.. raw:: html
|
||||||
@@ -94,7 +94,7 @@ one to get started.
|
|||||||
|
|
||||||
The following table lists supported training modes per model.
|
The following table lists supported training modes per model.
|
||||||
|
|
||||||
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/pytorch-training-benchmark-models.yaml
|
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/pytorch-training-v25.10-benchmark-models.yaml
|
||||||
|
|
||||||
{% set model_groups = data.model_groups %}
|
{% set model_groups = data.model_groups %}
|
||||||
.. dropdown:: Supported training modes
|
.. dropdown:: Supported training modes
|
||||||
@@ -164,7 +164,7 @@ doesn’t test configurations and run conditions outside those described.
|
|||||||
Run training
|
Run training
|
||||||
============
|
============
|
||||||
|
|
||||||
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/pytorch-training-benchmark-models.yaml
|
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/pytorch-training-v25.10-benchmark-models.yaml
|
||||||
|
|
||||||
{% set docker = data.docker %}
|
{% set docker = data.docker %}
|
||||||
{% set model_groups = data.model_groups %}
|
{% set model_groups = data.model_groups %}
|
||||||
|
|||||||
@@ -0,0 +1,669 @@
|
|||||||
|
:orphan:
|
||||||
|
|
||||||
|
.. meta::
|
||||||
|
:description: How to train a model using PyTorch for ROCm.
|
||||||
|
:keywords: ROCm, AI, LLM, train, PyTorch, torch, Llama, flux, tutorial, docker
|
||||||
|
|
||||||
|
**************************************
|
||||||
|
Training a model with PyTorch on ROCm
|
||||||
|
**************************************
|
||||||
|
|
||||||
|
.. caution::
|
||||||
|
|
||||||
|
This documentation does not reflect the latest version of ROCm PyTorch training
|
||||||
|
performance benchmark documentation. See :doc:`../pytorch-training` for the latest version.
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
For a unified training solution on AMD GPUs with ROCm, the `rocm/pytorch-training
|
||||||
|
<https://hub.docker.com/r/rocm/pytorch-training/>`__ Docker Hub registry will be
|
||||||
|
deprecated soon in favor of `rocm/primus <https://hub.docker.com/r/rocm/primus>`__.
|
||||||
|
The ``rocm/primus`` Docker containers will cover PyTorch training ecosystem frameworks,
|
||||||
|
including torchtitan and :doc:`Megatron-LM <../primus-megatron>`.
|
||||||
|
|
||||||
|
See :doc:`../primus-pytorch` for details.
|
||||||
|
|
||||||
|
PyTorch is an open-source machine learning framework that is widely used for
|
||||||
|
model training with GPU-optimized components for transformer-based models.
|
||||||
|
The PyTorch for ROCm training Docker image provides a prebuilt optimized
|
||||||
|
environment for fine-tuning and pretraining a model on AMD Instinct MI325X
|
||||||
|
and MI300X GPUs. It includes the following software components to accelerate
|
||||||
|
training workloads:
|
||||||
|
|
||||||
|
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/pytorch-training-v25.11-benchmark-models.yaml
|
||||||
|
|
||||||
|
.. tab-set::
|
||||||
|
|
||||||
|
.. tab-item:: {{ data.docker.pull_tag }}
|
||||||
|
:sync: {{ data.docker.pull_tag }}
|
||||||
|
|
||||||
|
.. list-table::
|
||||||
|
:header-rows: 1
|
||||||
|
|
||||||
|
* - Software component
|
||||||
|
- Version
|
||||||
|
|
||||||
|
{% for component_name, component_version in data.docker.components.items() %}
|
||||||
|
* - {{ component_name }}
|
||||||
|
- {{ component_version }}
|
||||||
|
{% endfor %}
|
||||||
|
|
||||||
|
.. _amd-pytorch-training-model-support-v25.11:
|
||||||
|
|
||||||
|
Supported models
|
||||||
|
================
|
||||||
|
|
||||||
|
The following models are pre-optimized for performance on the AMD Instinct
|
||||||
|
MI355X, MI350X, MI325X, and MI300X GPUs. Some instructions, commands, and
|
||||||
|
training recommendations in this documentation might vary by model -- select
|
||||||
|
one to get started.
|
||||||
|
|
||||||
|
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/pytorch-training-v25.11-benchmark-models.yaml
|
||||||
|
|
||||||
|
{% set model_groups = data.model_groups %}
|
||||||
|
.. raw:: html
|
||||||
|
|
||||||
|
<div id="vllm-benchmark-ud-params-picker" class="container-fluid">
|
||||||
|
<div class="row gx-0">
|
||||||
|
<div class="col-2 me-1 px-2 model-param-head">Model</div>
|
||||||
|
<div class="row col-10 pe-0">
|
||||||
|
{% for model_group in model_groups %}
|
||||||
|
<div class="col-4 px-2 model-param" data-param-k="model-group" data-param-v="{{ model_group.tag }}" tabindex="0">{{ model_group.group }}</div>
|
||||||
|
{% endfor %}
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<div class="row gx-0 pt-1">
|
||||||
|
<div class="col-2 me-1 px-2 model-param-head">Variant</div>
|
||||||
|
<div class="row col-10 pe-0">
|
||||||
|
{% for model_group in model_groups %}
|
||||||
|
{% set models = model_group.models %}
|
||||||
|
{% for model in models %}
|
||||||
|
{% if models|length % 3 == 0 %}
|
||||||
|
<div class="col-4 px-2 model-param" data-param-k="model" data-param-v="{{ model.mad_tag }}" data-param-group="{{ model_group.tag }}" tabindex="0">{{ model.model }}</div>
|
||||||
|
{% else %}
|
||||||
|
<div class="col-6 px-2 model-param" data-param-k="model" data-param-v="{{ model.mad_tag }}" data-param-group="{{ model_group.tag }}" tabindex="0">{{ model.model }}</div>
|
||||||
|
{% endif %}
|
||||||
|
{% endfor %}
|
||||||
|
{% endfor %}
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
.. _amd-pytorch-training-supported-training-modes-v25.11:
|
||||||
|
|
||||||
|
The following table lists supported training modes per model.
|
||||||
|
|
||||||
|
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/pytorch-training-v25.11-benchmark-models.yaml
|
||||||
|
|
||||||
|
{% set model_groups = data.model_groups %}
|
||||||
|
.. dropdown:: Supported training modes
|
||||||
|
|
||||||
|
.. list-table::
|
||||||
|
:header-rows: 1
|
||||||
|
|
||||||
|
* - Model
|
||||||
|
- Supported training modes
|
||||||
|
|
||||||
|
{% for model_group in model_groups %}
|
||||||
|
{% set models = model_group.models %}
|
||||||
|
{% for model in models %}
|
||||||
|
{% if model.training_modes %}
|
||||||
|
* - {{ model.model }}
|
||||||
|
- ``{{ model.training_modes | join('``, ``') }}``
|
||||||
|
|
||||||
|
{% endif %}
|
||||||
|
{% endfor %}
|
||||||
|
{% endfor %}
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
Some model and fine-tuning combinations are not listed. This is
|
||||||
|
because the `upstream torchtune repository <https://github.com/pytorch/torchtune>`__
|
||||||
|
doesn't provide default YAML configurations for them.
|
||||||
|
For advanced usage, you can create a custom configuration to enable
|
||||||
|
unlisted fine-tuning methods by using an existing file in the
|
||||||
|
``/workspace/torchtune/recipes/configs`` directory as a template.
|
||||||
|
|
||||||
|
.. _amd-pytorch-training-performance-measurements-v25.11:
|
||||||
|
|
||||||
|
Performance measurements
|
||||||
|
========================
|
||||||
|
|
||||||
|
To evaluate performance, the
|
||||||
|
`Performance results with AMD ROCm software <https://www.amd.com/en/developer/resources/rocm-hub/dev-ai/performance-results.html#tabs-a8deaeb413-item-21cea50186-tab>`_
|
||||||
|
page provides reference throughput and latency measurements for training
|
||||||
|
popular AI models.
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
The performance data presented in
|
||||||
|
`Performance results with AMD ROCm software <https://www.amd.com/en/developer/resources/rocm-hub/dev-ai/performance-results.html#tabs-a8deaeb413-item-21cea50186-tab>`_
|
||||||
|
should not be interpreted as the peak performance achievable by AMD
|
||||||
|
Instinct MI325X and MI300X GPUs or ROCm software.
|
||||||
|
|
||||||
|
System validation
|
||||||
|
=================
|
||||||
|
|
||||||
|
Before running AI workloads, it's important to validate that your AMD hardware is configured
|
||||||
|
correctly and performing optimally.
|
||||||
|
|
||||||
|
If you have already validated your system settings, including aspects like NUMA auto-balancing, you
|
||||||
|
can skip this step. Otherwise, complete the procedures in the :ref:`System validation and
|
||||||
|
optimization <rocm-for-ai-system-optimization>` guide to properly configure your system settings
|
||||||
|
before starting training.
|
||||||
|
|
||||||
|
To test for optimal performance, consult the recommended :ref:`System health benchmarks
|
||||||
|
<rocm-for-ai-system-health-bench>`. This suite of tests will help you verify and fine-tune your
|
||||||
|
system's configuration.
|
||||||
|
|
||||||
|
This Docker image is optimized for specific model configurations outlined
|
||||||
|
below. Performance can vary for other training workloads, as AMD
|
||||||
|
doesn’t test configurations and run conditions outside those described.
|
||||||
|
|
||||||
|
Run training
|
||||||
|
============
|
||||||
|
|
||||||
|
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/pytorch-training-v25.11-benchmark-models.yaml
|
||||||
|
|
||||||
|
{% set docker = data.docker %}
|
||||||
|
{% set model_groups = data.model_groups %}
|
||||||
|
|
||||||
|
Once the setup is complete, choose between two options to start benchmarking training:
|
||||||
|
|
||||||
|
.. tab-set::
|
||||||
|
|
||||||
|
.. tab-item:: MAD-integrated benchmarking
|
||||||
|
|
||||||
|
{% for model_group in model_groups %}
|
||||||
|
{% for model in model_group.models %}
|
||||||
|
|
||||||
|
.. container:: model-doc {{ model.mad_tag }}
|
||||||
|
|
||||||
|
The following run command is tailored to {{ model.model }}.
|
||||||
|
See :ref:`amd-pytorch-training-model-support-v25.11` to switch to another available model.
|
||||||
|
|
||||||
|
1. Clone the ROCm Model Automation and Dashboarding (`<https://github.com/ROCm/MAD>`__) repository to a local
|
||||||
|
directory and install the required packages on the host machine.
|
||||||
|
|
||||||
|
.. code-block:: shell
|
||||||
|
|
||||||
|
git clone https://github.com/ROCm/MAD
|
||||||
|
cd MAD
|
||||||
|
pip install -r requirements.txt
|
||||||
|
|
||||||
|
2. For example, use this command to run the performance benchmark test on the {{ model.model }} model
|
||||||
|
using one node with the {{ model.precision }} data type on the host machine.
|
||||||
|
|
||||||
|
.. code-block:: shell
|
||||||
|
|
||||||
|
export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
|
||||||
|
madengine run \
|
||||||
|
--tags {{ model.mad_tag }} \
|
||||||
|
--keep-model-dir \
|
||||||
|
--live-output \
|
||||||
|
--timeout 28800
|
||||||
|
|
||||||
|
MAD launches a Docker container with the name
|
||||||
|
``container_ci-{{ model.mad_tag }}``. The latency and throughput reports of the
|
||||||
|
model are collected in ``~/MAD/perf.csv``.
|
||||||
|
|
||||||
|
{% endfor %}
|
||||||
|
{% endfor %}
|
||||||
|
|
||||||
|
.. tab-item:: Standalone benchmarking
|
||||||
|
|
||||||
|
{% for model_group in model_groups %}
|
||||||
|
{% for model in model_group.models %}
|
||||||
|
|
||||||
|
.. container:: model-doc {{ model.mad_tag }}
|
||||||
|
|
||||||
|
The following commands are tailored to {{ model.model }}.
|
||||||
|
See :ref:`amd-pytorch-training-model-support-v25.11` to switch to another available model.
|
||||||
|
|
||||||
|
{% endfor %}
|
||||||
|
{% endfor %}
|
||||||
|
|
||||||
|
.. rubric:: Download the Docker image and required packages
|
||||||
|
|
||||||
|
1. Use the following command to pull the Docker image from Docker Hub.
|
||||||
|
|
||||||
|
.. code-block:: shell
|
||||||
|
|
||||||
|
docker pull {{ docker.pull_tag }}
|
||||||
|
|
||||||
|
2. Launch the Docker container.
|
||||||
|
|
||||||
|
.. code-block:: shell
|
||||||
|
|
||||||
|
docker run -it \
|
||||||
|
--device /dev/dri \
|
||||||
|
--device /dev/kfd \
|
||||||
|
--network host \
|
||||||
|
--ipc host \
|
||||||
|
--group-add video \
|
||||||
|
--cap-add SYS_PTRACE \
|
||||||
|
--security-opt seccomp=unconfined \
|
||||||
|
--privileged \
|
||||||
|
-v $HOME:$HOME \
|
||||||
|
-v $HOME/.ssh:/root/.ssh \
|
||||||
|
--shm-size 64G \
|
||||||
|
--name training_env \
|
||||||
|
{{ docker.pull_tag }}
|
||||||
|
|
||||||
|
Use these commands if you exit the ``training_env`` container and need to return to it.
|
||||||
|
|
||||||
|
.. code-block:: shell
|
||||||
|
|
||||||
|
docker start training_env
|
||||||
|
docker exec -it training_env bash
|
||||||
|
|
||||||
|
3. In the Docker container, clone the `<https://github.com/ROCm/MAD>`__
|
||||||
|
repository and navigate to the benchmark scripts directory
|
||||||
|
``/workspace/MAD/scripts/pytorch_train``.
|
||||||
|
|
||||||
|
.. code-block:: shell
|
||||||
|
|
||||||
|
git clone https://github.com/ROCm/MAD
|
||||||
|
cd MAD/scripts/pytorch_train
|
||||||
|
|
||||||
|
.. rubric:: Prepare training datasets and dependencies
|
||||||
|
|
||||||
|
1. The following benchmarking examples require downloading models and datasets
|
||||||
|
from Hugging Face. To ensure successful access to gated repos, set your
|
||||||
|
``HF_TOKEN``.
|
||||||
|
|
||||||
|
.. code-block:: shell
|
||||||
|
|
||||||
|
export HF_TOKEN=$your_personal_hugging_face_access_token
|
||||||
|
|
||||||
|
2. Run the setup script to install libraries and datasets needed for benchmarking.
|
||||||
|
|
||||||
|
.. code-block:: shell
|
||||||
|
|
||||||
|
./pytorch_benchmark_setup.sh
|
||||||
|
|
||||||
|
.. container:: model-doc pyt_train_llama-3.1-8b
|
||||||
|
|
||||||
|
``pytorch_benchmark_setup.sh`` installs the following libraries for Llama 3.1 8B:
|
||||||
|
|
||||||
|
.. list-table::
|
||||||
|
:header-rows: 1
|
||||||
|
|
||||||
|
* - Library
|
||||||
|
- Reference
|
||||||
|
|
||||||
|
* - ``accelerate``
|
||||||
|
- `Hugging Face Accelerate <https://huggingface.co/docs/accelerate/en/index>`_
|
||||||
|
|
||||||
|
* - ``datasets``
|
||||||
|
- `Hugging Face Datasets <https://huggingface.co/docs/datasets/v3.2.0/en/index>`_ 3.2.0
|
||||||
|
|
||||||
|
.. container:: model-doc pyt_train_llama-3.1-70b
|
||||||
|
|
||||||
|
``pytorch_benchmark_setup.sh`` installs the following libraries for Llama 3.1 70B:
|
||||||
|
|
||||||
|
.. list-table::
|
||||||
|
:header-rows: 1
|
||||||
|
|
||||||
|
* - Library
|
||||||
|
- Reference
|
||||||
|
|
||||||
|
* - ``datasets``
|
||||||
|
- `Hugging Face Datasets <https://huggingface.co/docs/datasets/v3.2.0/en/index>`_ 3.2.0
|
||||||
|
|
||||||
|
* - ``torchdata``
|
||||||
|
- `TorchData <https://meta-pytorch.org/data/beta/index.html#torchdata>`__
|
||||||
|
|
||||||
|
* - ``tomli``
|
||||||
|
- `Tomli <https://pypi.org/project/tomli/>`__
|
||||||
|
|
||||||
|
* - ``tiktoken``
|
||||||
|
- `tiktoken <https://github.com/openai/tiktoken>`__
|
||||||
|
|
||||||
|
* - ``blobfile``
|
||||||
|
- `blobfile <https://pypi.org/project/blobfile/>`__
|
||||||
|
|
||||||
|
* - ``tabulate``
|
||||||
|
- `tabulate <https://pypi.org/project/tabulate/>`__
|
||||||
|
|
||||||
|
* - ``wandb``
|
||||||
|
- `Weights & Biases <https://github.com/wandb/wandb>`__
|
||||||
|
|
||||||
|
* - ``sentencepiece``
|
||||||
|
- `SentencePiece <https://github.com/google/sentencepiece>`__ 0.2.0
|
||||||
|
|
||||||
|
* - ``tensorboard``
|
||||||
|
- `TensorBoard <https://www.tensorflow.org/tensorboard>`__ 2.18.0
|
||||||
|
|
||||||
|
.. container:: model-doc pyt_train_flux
|
||||||
|
|
||||||
|
``pytorch_benchmark_setup.sh`` installs the following libraries for FLUX:
|
||||||
|
|
||||||
|
.. list-table::
|
||||||
|
:header-rows: 1
|
||||||
|
|
||||||
|
* - Library
|
||||||
|
- Reference
|
||||||
|
|
||||||
|
* - ``accelerate``
|
||||||
|
- `Hugging Face Accelerate <https://huggingface.co/docs/accelerate/en/index>`_
|
||||||
|
|
||||||
|
* - ``datasets``
|
||||||
|
- `Hugging Face Datasets <https://huggingface.co/docs/datasets/v3.2.0/en/index>`__ 3.2.0
|
||||||
|
|
||||||
|
* - ``sentencepiece``
|
||||||
|
- `SentencePiece <https://github.com/google/sentencepiece>`__ 0.2.0
|
||||||
|
|
||||||
|
* - ``tensorboard``
|
||||||
|
- `TensorBoard <https://www.tensorflow.org/tensorboard>`__ 2.18.0
|
||||||
|
|
||||||
|
* - ``csvkit``
|
||||||
|
- `csvkit <https://csvkit.readthedocs.io/en/latest/>`__ 2.0.1
|
||||||
|
|
||||||
|
* - ``deepspeed``
|
||||||
|
- `DeepSpeed <https://github.com/deepspeedai/DeepSpeed>`__ 0.16.2
|
||||||
|
|
||||||
|
* - ``diffusers``
|
||||||
|
- `Hugging Face Diffusers <https://huggingface.co/docs/diffusers/en/index>`__ 0.31.0
|
||||||
|
|
||||||
|
* - ``GitPython``
|
||||||
|
- `GitPython <https://github.com/gitpython-developers/GitPython>`__ 3.1.44
|
||||||
|
|
||||||
|
* - ``opencv-python-headless``
|
||||||
|
- `opencv-python-headless <https://pypi.org/project/opencv-python-headless/>`__ 4.10.0.84
|
||||||
|
|
||||||
|
* - ``peft``
|
||||||
|
- `PEFT <https://huggingface.co/docs/peft/en/index>`__ 0.14.0
|
||||||
|
|
||||||
|
* - ``protobuf``
|
||||||
|
- `Protocol Buffers <https://github.com/protocolbuffers/protobuf>`__ 5.29.2
|
||||||
|
|
||||||
|
* - ``pytest``
|
||||||
|
- `PyTest <https://docs.pytest.org/en/stable/>`__ 8.3.4
|
||||||
|
|
||||||
|
* - ``python-dotenv``
|
||||||
|
- `python-dotenv <https://pypi.org/project/python-dotenv/>`__ 1.0.1
|
||||||
|
|
||||||
|
* - ``seaborn``
|
||||||
|
- `Seaborn <https://seaborn.pydata.org/>`__ 0.13.2
|
||||||
|
|
||||||
|
* - ``transformers``
|
||||||
|
- `Transformers <https://huggingface.co/docs/transformers/en/index>`__ 4.47.0
|
||||||
|
|
||||||
|
``pytorch_benchmark_setup.sh`` downloads the following datasets from Hugging Face:
|
||||||
|
|
||||||
|
* `frank-chieng/chinese_architecture_siheyuan <https://huggingface.co/datasets/frank-chieng/chinese_architecture_siheyuan>`__
|
||||||
|
|
||||||
|
{% for model_group in model_groups %}
|
||||||
|
{% for model in model_group.models %}
|
||||||
|
{% set training_modes = model.training_modes %}
|
||||||
|
{% set training_mode_descs = {
|
||||||
|
"pretrain": "Benchmark pre-training.",
|
||||||
|
"HF_pretrain": "Llama 3.1 8B pre-training with FP8 precision."
|
||||||
|
} %}
|
||||||
|
{% set available_modes = training_modes | select("in", ["pretrain", "HF_pretrain"]) | list %}
|
||||||
|
{% if available_modes %}
|
||||||
|
|
||||||
|
.. container:: model-doc {{ model.mad_tag }}
|
||||||
|
|
||||||
|
.. rubric:: Pretraining
|
||||||
|
|
||||||
|
To start the pre-training benchmark, use the following command with the
|
||||||
|
appropriate options. See the following list of options and their descriptions.
|
||||||
|
|
||||||
|
{% if model.mad_tag == "pyt_train_dlrm" %}
|
||||||
|
|
||||||
|
1. Go to the DLRM directory.
|
||||||
|
|
||||||
|
.. code-block:: shell
|
||||||
|
|
||||||
|
cd /workspace/DLRMBenchmark
|
||||||
|
|
||||||
|
2. To run the single node training benchmark for DLRM-v2 with TF32 precision,
|
||||||
|
run the following script.
|
||||||
|
|
||||||
|
.. code-block:: shell
|
||||||
|
|
||||||
|
./launch_training_single_node.sh
|
||||||
|
|
||||||
|
To run with MAD within the Docker container, use the following command.
|
||||||
|
|
||||||
|
.. code-block:: shell
|
||||||
|
|
||||||
|
./pytorch_benchmark_report.sh -t pretrain -m DLRM
|
||||||
|
|
||||||
|
{% else %}
|
||||||
|
|
||||||
|
.. code-block:: shell
|
||||||
|
|
||||||
|
./pytorch_benchmark_report.sh -t {% if available_modes | length == 1 %}{{ available_modes[0] }}{% else %}$training_mode{% endif %} \
|
||||||
|
-m {{ model.model_repo }} \
|
||||||
|
-p $datatype \
|
||||||
|
-s $sequence_length
|
||||||
|
|
||||||
|
{% if model.mad_tag == "pyt_train_flux" %}
|
||||||
|
.. container:: model-doc {{ model.mad_tag }}
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
Currently, FLUX models are not supported out-of-the-box on this Docker.
|
||||||
|
To use FLUX, refer to ``rocm/pytorch-training`` Docker: :doc:`pytorch-training-v25.6`
|
||||||
|
|
||||||
|
Occasionally, downloading the Flux dataset might fail. In the event of this
|
||||||
|
error, manually download it from Hugging Face at
|
||||||
|
`black-forest-labs/FLUX.1-dev <https://huggingface.co/black-forest-labs/FLUX.1-dev>`_
|
||||||
|
and save it to `/workspace/FluxBenchmark`. This ensures that the test script can access
|
||||||
|
the required dataset.
|
||||||
|
{% endif %}
|
||||||
|
|
||||||
|
.. list-table::
|
||||||
|
:header-rows: 1
|
||||||
|
|
||||||
|
* - Name
|
||||||
|
- Options
|
||||||
|
- Description
|
||||||
|
|
||||||
|
{% for mode in available_modes %}
|
||||||
|
* - {% if loop.first %}``$training_mode``{% endif %}
|
||||||
|
- ``{{ mode }}``
|
||||||
|
- {{ training_mode_descs[mode] }}
|
||||||
|
{% endfor %}
|
||||||
|
|
||||||
|
* - ``$datatype``
|
||||||
|
- ``BF16``{% if model.mad_tag == "pyt_train_llama-3.1-8b" %} or ``FP8``{% endif %}
|
||||||
|
- Only Llama 3.1 8B supports FP8 precision.
|
||||||
|
|
||||||
|
* - ``$sequence_length``
|
||||||
|
- Sequence length for the language model.
|
||||||
|
- Between 2048 and 8192. 8192 by default.
|
||||||
|
{% endif %}
|
||||||
|
{% endif %}
|
||||||
|
|
||||||
|
{% set training_modes = model.training_modes %}
|
||||||
|
{% set training_mode_descs = {
|
||||||
|
"posttrain": "Benchmark post-training.",
|
||||||
|
} %}
|
||||||
|
{% set available_modes = training_modes | select("in", ["posttrain"]) | list %}
|
||||||
|
{% if available_modes %}
|
||||||
|
|
||||||
|
.. container:: model-doc {{ model.mad_tag }}
|
||||||
|
|
||||||
|
.. rubric:: Post-training
|
||||||
|
|
||||||
|
To start the post-training benchmark, use the following command with the
|
||||||
|
appropriate options. See the following list of options and their descriptions.
|
||||||
|
|
||||||
|
.. code-block:: shell
|
||||||
|
|
||||||
|
./pytorch_benchmark_report.sh -t {% if available_modes | length == 1 %}{{ available_modes[0] }}{% else %}$training_mode{% endif %} \
|
||||||
|
-m {{ model.model_repo }} \
|
||||||
|
-p $datatype \
|
||||||
|
-s $sequence_length
|
||||||
|
|
||||||
|
.. list-table::
|
||||||
|
:header-rows: 1
|
||||||
|
|
||||||
|
* - Name
|
||||||
|
- Options
|
||||||
|
- Description
|
||||||
|
|
||||||
|
{% for mode in available_modes %}
|
||||||
|
* - {% if loop.first %}``$training_mode``{% endif %}
|
||||||
|
- ``{{ mode }}``
|
||||||
|
- {{ training_mode_descs[mode] }}
|
||||||
|
{% endfor %}
|
||||||
|
|
||||||
|
* - ``$datatype``
|
||||||
|
- ``BF16``{% if model.mad_tag == "pyt_train_llama-3.1-8b" %} or ``FP8``{% endif %}
|
||||||
|
- Only Llama 3.1 8B supports FP8 precision.
|
||||||
|
|
||||||
|
* - ``$sequence_length``
|
||||||
|
- Sequence length for the language model.
|
||||||
|
- Between 2048 and 8192. 8192 by default.
|
||||||
|
{% endif %}
|
||||||
|
|
||||||
|
{% set training_mode_descs = {
|
||||||
|
"finetune_fw": "Full weight fine-tuning (BF16 and FP8 supported).",
|
||||||
|
"finetune_lora": "LoRA fine-tuning (BF16 supported).",
|
||||||
|
"finetune_qlora": "QLoRA fine-tuning (BF16 supported).",
|
||||||
|
"HF_finetune_lora": "LoRA fine-tuning with Hugging Face PEFT.",
|
||||||
|
} %}
|
||||||
|
{% set available_modes = training_modes | select("in", ["finetune_fw", "finetune_lora", "finetune_qlora", "HF_finetune_lora"]) | list %}
|
||||||
|
{% if available_modes %}
|
||||||
|
.. container:: model-doc {{ model.mad_tag }}
|
||||||
|
|
||||||
|
.. rubric:: Fine-tuning
|
||||||
|
|
||||||
|
To start the fine-tuning benchmark, use the following command with the
|
||||||
|
appropriate options. See the following list of options and their descriptions.
|
||||||
|
See :ref:`supported training modes <amd-pytorch-training-supported-training-modes-v25.11>`.
|
||||||
|
|
||||||
|
.. code-block:: shell
|
||||||
|
|
||||||
|
./pytorch_benchmark_report.sh -t $training_mode \
|
||||||
|
-m {{ model.model_repo }} \
|
||||||
|
-p $datatype \
|
||||||
|
-s $sequence_length
|
||||||
|
|
||||||
|
.. list-table::
|
||||||
|
:header-rows: 1
|
||||||
|
|
||||||
|
* - Name
|
||||||
|
- Options
|
||||||
|
- Description
|
||||||
|
|
||||||
|
{% for mode in available_modes %}
|
||||||
|
* - {% if loop.first %}``$training_mode``{% endif %}
|
||||||
|
- ``{{ mode }}``
|
||||||
|
- {{ training_mode_descs[mode] }}
|
||||||
|
{% endfor %}
|
||||||
|
|
||||||
|
* - ``$datatype``
|
||||||
|
- ``BF16``{% if "finetune_fw" in available_modes %} or ``FP8``{% endif %}
|
||||||
|
- All models support BF16.{% if "finetune_fw" in available_modes %} FP8 is only available for full weight fine-tuning.{% endif %}
|
||||||
|
|
||||||
|
* - ``$sequence_length``
|
||||||
|
- Between 2048 and 16384.
|
||||||
|
- Sequence length for the language model.
|
||||||
|
|
||||||
|
{% if model.mad_tag in ["pyt_train_llama3.2-vision-11b", "pyt_train_llama-3.2-vision-90b"] %}
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
For LoRA and QLoRA support with vision models (Llama 3.2 11B and 90B),
|
||||||
|
use the following torchtune commit for compatibility:
|
||||||
|
|
||||||
|
.. code-block:: shell
|
||||||
|
|
||||||
|
git checkout 48192e23188b1fc524dd6d127725ceb2348e7f0e
|
||||||
|
|
||||||
|
{% elif model.mad_tag in ["pyt_train_llama-2-7b", "pyt_train_llama-2-13b", "pyt_train_llama-2-70b"] %}
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
You might encounter the following error with Llama 2: ``ValueError: seq_len (16384) of
|
||||||
|
input tensor should be smaller than max_seq_len (4096)``.
|
||||||
|
This error indicates that an input sequence is longer than the model's maximum context window.
|
||||||
|
|
||||||
|
Ensure your tokenized input does not exceed the model's ``max_seq_len`` (4096
|
||||||
|
tokens in this case). You can resolve this by truncating the input or splitting
|
||||||
|
it into smaller chunks before passing it to the model.
|
||||||
|
|
||||||
|
Note on reproducibility: The results in this guide are based on
|
||||||
|
commit ``b4c98ac`` from the upstream
|
||||||
|
`<https://github.com/pytorch/torchtune>`__ repository. For the
|
||||||
|
latest updates, you can use the main branch.
|
||||||
|
|
||||||
|
{% endif %}
|
||||||
|
{% endif %}
|
||||||
|
{% endfor %}
|
||||||
|
{% endfor %}
|
||||||
|
|
||||||
|
.. rubric:: Benchmarking examples
|
||||||
|
|
||||||
|
For examples of benchmarking commands, see `<https://github.com/ROCm/MAD/tree/develop/benchmark/pytorch_train#benchmarking-examples>`__.
|
||||||
|
|
||||||
|
.. _amd-pytorch-training-multinode-examples-v25.11:
|
||||||
|
|
||||||
|
Multi-node training
|
||||||
|
-------------------
|
||||||
|
|
||||||
|
Refer to :doc:`/how-to/rocm-for-ai/system-setup/multi-node-setup` to configure your environment for multi-node
|
||||||
|
training. See :ref:`rocm-for-ai-multi-node-setup-pyt-train-example` for example Slurm run commands.
|
||||||
|
|
||||||
|
Pre-training
|
||||||
|
~~~~~~~~~~~~
|
||||||
|
|
||||||
|
Multi-node training with torchtitan is supported. The provided SLURM script is pre-configured for Llama 3 70B.
|
||||||
|
|
||||||
|
To launch the training job on a SLURM cluster for Llama 3 70B, run the following commands from the MAD repository.
|
||||||
|
|
||||||
|
.. code-block:: shell
|
||||||
|
|
||||||
|
# In the MAD repository
|
||||||
|
cd scripts/pytorch_train
|
||||||
|
sbatch run_slurm_train.sh
|
||||||
|
|
||||||
|
Fine-tuning
|
||||||
|
~~~~~~~~~~~
|
||||||
|
|
||||||
|
Multi-node training with torchtune is supported. The provided SLURM script is pre-configured for Llama 3.3 70B.
|
||||||
|
|
||||||
|
To launch the training job on a SLURM cluster for Llama 3.3 70B, run the following commands from the MAD repository.
|
||||||
|
|
||||||
|
.. code-block:: shell
|
||||||
|
|
||||||
|
huggingface-cli login # Get access to HF Llama model space
|
||||||
|
huggingface-cli download meta-llama/Llama-3.3-70B-Instruct --local-dir ./models/Llama-3.3-70B-Instruct # Download the Llama 3.3 model locally
|
||||||
|
# In the MAD repository
|
||||||
|
cd scripts/pytorch_train
|
||||||
|
sbatch Torchtune_Multinode.sh
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
Information regarding benchmark setup:
|
||||||
|
|
||||||
|
* By default, Llama 3.3 70B is fine-tuned using ``alpaca_dataset``.
|
||||||
|
* You can adjust the torchtune `YAML configuration file
|
||||||
|
<https://github.com/pytorch/torchtune/blob/main/recipes/configs/llama3_3/70B_full_multinode.yaml>`__
|
||||||
|
if you're using a different model.
|
||||||
|
* The number of nodes and other parameters can be tuned in the SLURM script ``Torchtune_Multinode.sh``.
|
||||||
|
* Set the ``mounting_paths`` inside the SLURM script.
|
||||||
|
|
||||||
|
Once the run is finished, you can find the log files in the ``result_torchtune/`` directory.
|
||||||
|
|
||||||
|
Further reading
|
||||||
|
===============
|
||||||
|
|
||||||
|
- To learn more about MAD and the ``madengine`` CLI, see the `MAD usage guide <https://github.com/ROCm/MAD?tab=readme-ov-file#usage-guide>`__.
|
||||||
|
|
||||||
|
- To learn more about system settings and management practices to configure your system for
|
||||||
|
AMD Instinct MI300X Series GPUs, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_.
|
||||||
|
|
||||||
|
- For a list of other ready-made Docker images for AI with ROCm, see
|
||||||
|
`AMD Infinity Hub <https://www.amd.com/en/developer/resources/infinity-hub.html#f-amd_hub_category=AI%20%26%20ML%20Models>`_.
|
||||||
|
|
||||||
|
Previous versions
|
||||||
|
=================
|
||||||
|
|
||||||
|
See :doc:`pytorch-training-history` to find documentation for previous releases
|
||||||
|
of the ``ROCm/pytorch-training`` Docker image.
|
||||||
@@ -47,7 +47,7 @@ Megatron-LM.
|
|||||||
- {{ component_version }}
|
- {{ component_version }}
|
||||||
{% endfor %}
|
{% endfor %}
|
||||||
|
|
||||||
.. _amd-primus-megatron-lm-model-support-v25.11:
|
.. _amd-primus-megatron-lm-model-support-v26.01:
|
||||||
|
|
||||||
Supported models
|
Supported models
|
||||||
================
|
================
|
||||||
@@ -108,7 +108,7 @@ To test for optimal performance, consult the recommended :ref:`System health ben
|
|||||||
<rocm-for-ai-system-health-bench>`. This suite of tests will help you verify and fine-tune your
|
<rocm-for-ai-system-health-bench>`. This suite of tests will help you verify and fine-tune your
|
||||||
system's configuration.
|
system's configuration.
|
||||||
|
|
||||||
.. _mi300x-amd-primus-megatron-lm-training-v25.11:
|
.. _mi300x-amd-primus-megatron-lm-training-v26.01:
|
||||||
|
|
||||||
Environment setup
|
Environment setup
|
||||||
=================
|
=================
|
||||||
@@ -118,7 +118,7 @@ Environment setup
|
|||||||
Use the following instructions to set up the environment, configure the script to train models, and
|
Use the following instructions to set up the environment, configure the script to train models, and
|
||||||
reproduce the benchmark results on AMD Instinct GPUs.
|
reproduce the benchmark results on AMD Instinct GPUs.
|
||||||
|
|
||||||
.. _amd-primus-megatron-lm-requirements-v25.11:
|
.. _amd-primus-megatron-lm-requirements-v26.01:
|
||||||
|
|
||||||
Pull the Docker image
|
Pull the Docker image
|
||||||
|
|
||||||
@@ -157,16 +157,16 @@ Pull the Docker image
|
|||||||
docker start primus_training_env
|
docker start primus_training_env
|
||||||
docker exec -it primus_training_env bash
|
docker exec -it primus_training_env bash
|
||||||
|
|
||||||
The Docker container hosts verified commit ``c4c083de`` of the `Primus
|
The Docker container hosts verified commit ``9c529cd4`` of the `Primus
|
||||||
<https://github.com/AMD-AGI/Primus/tree/c4c083de64ba3e8f19ccc9629411267108931f9e/>`__ repository.
|
<https://github.com/AMD-AGI/Primus/tree/9c529cd4a934a68a880ede036c3e97b792e38167>`__ repository.
|
||||||
|
|
||||||
.. _amd-primus-megatron-lm-environment-setup-v25.11:
|
.. _amd-primus-megatron-lm-environment-setup-v26.01:
|
||||||
|
|
||||||
Configuration
|
Configuration
|
||||||
=============
|
=============
|
||||||
|
|
||||||
Primus defines a training configuration in YAML for each model in
|
Primus defines a training configuration in YAML for each model in
|
||||||
`examples/megatron/configs <https://github.com/AMD-AGI/Primus/tree/c4c083de64ba3e8f19ccc9629411267108931f9e/examples/megatron/configs>`__.
|
`examples/megatron/configs <https://github.com/AMD-AGI/Primus/tree/9c529cd4a934a68a880ede036c3e97b792e38167/examples/megatron/configs>`__.
|
||||||
|
|
||||||
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/primus-megatron-benchmark-models.yaml
|
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/primus-megatron-benchmark-models.yaml
|
||||||
|
|
||||||
@@ -207,7 +207,7 @@ You can use either mock data or real data for training.
|
|||||||
|
|
||||||
Ensure that the files are accessible inside the Docker container.
|
Ensure that the files are accessible inside the Docker container.
|
||||||
|
|
||||||
.. _amd-primus-megatron-lm-tokenizer-v25.11:
|
.. _amd-primus-megatron-lm-tokenizer-v26.01:
|
||||||
|
|
||||||
Tokenizer
|
Tokenizer
|
||||||
---------
|
---------
|
||||||
@@ -220,15 +220,7 @@ right permissions to access the tokenizer for each model.
|
|||||||
# Export your HF_TOKEN in the workspace
|
# Export your HF_TOKEN in the workspace
|
||||||
export HF_TOKEN=<your_hftoken>
|
export HF_TOKEN=<your_hftoken>
|
||||||
|
|
||||||
.. note::
|
.. _amd-primus-megatron-lm-run-training-v26.01:
|
||||||
|
|
||||||
In Primus, each model uses a tokenizer from Hugging Face. For example, Llama
|
|
||||||
3.1 8B model uses ``tokenizer_model: meta-llama/Llama-3.1-8B`` and
|
|
||||||
``tokenizer_type: Llama3Tokenizer`` defined in the `llama3.1-8B model
|
|
||||||
<https://github.com/AMD-AGI/Primus/blob/e16b27bf6c1b2798f38848fc574fee60d9a9b902/examples/megatron/configs/llama3.1_8B-pretrain.yaml>`__
|
|
||||||
definition.
|
|
||||||
|
|
||||||
.. _amd-primus-megatron-lm-run-training-v25.11:
|
|
||||||
|
|
||||||
Run training
|
Run training
|
||||||
============
|
============
|
||||||
@@ -252,7 +244,7 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
|
|||||||
|
|
||||||
Once setup is complete, run the appropriate training command.
|
Once setup is complete, run the appropriate training command.
|
||||||
The following run commands are tailored to Llama 3.3 70B.
|
The following run commands are tailored to Llama 3.3 70B.
|
||||||
See :ref:`amd-primus-megatron-lm-model-support-v25.11` to switch to another available model.
|
See :ref:`amd-primus-megatron-lm-model-support-v26.01` to switch to another available model.
|
||||||
|
|
||||||
To run pre-training for Llama 3.3 70B BF16, run:
|
To run pre-training for Llama 3.3 70B BF16, run:
|
||||||
|
|
||||||
@@ -263,8 +255,10 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
|
|||||||
|
|
||||||
.. code-block:: shell
|
.. code-block:: shell
|
||||||
|
|
||||||
EXP=examples/megatron/configs/MI355X/llama3.3_70B-BF16-pretrain.yaml \
|
bash runner/primus-cli direct \
|
||||||
bash ./examples/run_pretrain.sh
|
--log_file /tmp/primus_llama3.3_70B.log \
|
||||||
|
-- train pretrain \
|
||||||
|
--config examples/megatron/configs/MI355X/llama3.3_70B-BF16-pretrain.yaml
|
||||||
|
|
||||||
.. tab-item:: MI300X
|
.. tab-item:: MI300X
|
||||||
:sync: MI325X and MI300X
|
:sync: MI325X and MI300X
|
||||||
@@ -276,14 +270,16 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
|
|||||||
export PRIMUS_TURBO_ATTN_V3_ATOMIC_FP32=1
|
export PRIMUS_TURBO_ATTN_V3_ATOMIC_FP32=1
|
||||||
export NVTE_CK_IS_V3_ATOMIC_FP32=1
|
export NVTE_CK_IS_V3_ATOMIC_FP32=1
|
||||||
|
|
||||||
EXP=examples/megatron/configs/MI300X/llama3.3_70B-BF16-pretrain.yaml \
|
bash runner/primus-cli direct \
|
||||||
bash ./examples/run_pretrain.sh
|
--log_file /tmp/primus_llama3.3_70B.log \
|
||||||
|
-- train pretrain \
|
||||||
|
--config examples/megatron/configs/MI300X/llama3.3_70B-BF16-pretrain.yaml
|
||||||
|
|
||||||
.. container:: model-doc primus_pyt_megatron_lm_train_llama-3.1-8b
|
.. container:: model-doc primus_pyt_megatron_lm_train_llama-3.1-8b
|
||||||
|
|
||||||
Once setup is complete, run the appropriate training command.
|
Once setup is complete, run the appropriate training command.
|
||||||
The following run commands are tailored to Llama 3.1 8B.
|
The following run commands are tailored to Llama 3.1 8B.
|
||||||
See :ref:`amd-primus-megatron-lm-model-support-v25.11` to switch to another available model.
|
See :ref:`amd-primus-megatron-lm-model-support-v26.01` to switch to another available model.
|
||||||
|
|
||||||
To run pre-training for Llama 3.1 8B FP8, run:
|
To run pre-training for Llama 3.1 8B FP8, run:
|
||||||
|
|
||||||
@@ -294,8 +290,10 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
|
|||||||
|
|
||||||
.. code-block:: shell
|
.. code-block:: shell
|
||||||
|
|
||||||
EXP=examples/megatron/configs/MI355X/llama3.1_8B-FP8-pretrain.yaml \
|
bash runner/primus-cli direct \
|
||||||
bash ./examples/run_pretrain.sh
|
--log_file /tmp/primus_llama3.1_8B_fp8.log \
|
||||||
|
-- train pretrain \
|
||||||
|
--config examples/megatron/configs/MI355X/llama3.1_8B-FP8-pretrain.yaml
|
||||||
|
|
||||||
.. tab-item:: MI300X
|
.. tab-item:: MI300X
|
||||||
:sync: MI325X and MI300X
|
:sync: MI325X and MI300X
|
||||||
@@ -307,8 +305,10 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
|
|||||||
export PRIMUS_TURBO_ATTN_V3_ATOMIC_FP32=1
|
export PRIMUS_TURBO_ATTN_V3_ATOMIC_FP32=1
|
||||||
export NVTE_CK_IS_V3_ATOMIC_FP32=1
|
export NVTE_CK_IS_V3_ATOMIC_FP32=1
|
||||||
|
|
||||||
EXP=examples/megatron/configs/MI300X/llama3.1_8B-FP8-pretrain.yaml \
|
bash runner/primus-cli direct \
|
||||||
bash ./examples/run_pretrain.sh
|
--log_file /tmp/primus_llama3.1_8B_fp8.log \
|
||||||
|
-- train pretrain \
|
||||||
|
--config examples/megatron/configs/MI300X/llama3.1_8B-FP8-pretrain.yaml
|
||||||
|
|
||||||
For Llama 3.1 8B BF16, use the following command:
|
For Llama 3.1 8B BF16, use the following command:
|
||||||
|
|
||||||
@@ -319,8 +319,10 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
|
|||||||
|
|
||||||
.. code-block:: shell
|
.. code-block:: shell
|
||||||
|
|
||||||
EXP=examples/megatron/configs/MI355X/llama3.1_BF16-pretrain.yaml \
|
bash runner/primus-cli direct \
|
||||||
bash ./examples/run_pretrain.sh
|
--log_file /tmp/primus_llama3.1_8B.log \
|
||||||
|
-- train pretrain \
|
||||||
|
--config examples/megatron/configs/MI355X/llama3.1_8B-BF16-pretrain.yaml
|
||||||
|
|
||||||
.. tab-item:: MI300X
|
.. tab-item:: MI300X
|
||||||
:sync: MI325X and MI300X
|
:sync: MI325X and MI300X
|
||||||
@@ -332,14 +334,16 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
|
|||||||
export PRIMUS_TURBO_ATTN_V3_ATOMIC_FP32=1
|
export PRIMUS_TURBO_ATTN_V3_ATOMIC_FP32=1
|
||||||
export NVTE_CK_IS_V3_ATOMIC_FP32=1
|
export NVTE_CK_IS_V3_ATOMIC_FP32=1
|
||||||
|
|
||||||
EXP=examples/megatron/configs/MI300X/llama3.1_8B-BF16-pretrain.yaml \
|
bash runner/primus-cli direct \
|
||||||
bash ./examples/run_pretrain.sh
|
--log_file /tmp/primus_llama3.1_8B.log \
|
||||||
|
-- train pretrain \
|
||||||
|
--config examples/megatron/configs/MI300X/llama3.1_8B-BF16-pretrain.yaml
|
||||||
|
|
||||||
.. container:: model-doc primus_pyt_megatron_lm_train_llama-3.1-70b
|
.. container:: model-doc primus_pyt_megatron_lm_train_llama-3.1-70b
|
||||||
|
|
||||||
Once setup is complete, run the appropriate training command.
|
Once setup is complete, run the appropriate training command.
|
||||||
The following run commands are tailored to Llama 3.1 70B.
|
The following run commands are tailored to Llama 3.1 70B.
|
||||||
See :ref:`amd-primus-megatron-lm-model-support-v25.11` to switch to another available model.
|
See :ref:`amd-primus-megatron-lm-model-support-v26.01` to switch to another available model.
|
||||||
|
|
||||||
To run pre-training for Llama 3.1 70B BF16, run:
|
To run pre-training for Llama 3.1 70B BF16, run:
|
||||||
|
|
||||||
@@ -350,8 +354,10 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
|
|||||||
|
|
||||||
.. code-block:: shell
|
.. code-block:: shell
|
||||||
|
|
||||||
EXP=examples/megatron/configs/MI355X/llama3.1_70B-BF16-pretrain.yaml \
|
bash runner/primus-cli direct \
|
||||||
bash ./examples/run_pretrain.sh
|
--log_file /tmp/primus_llama3.1_70B.log \
|
||||||
|
-- train pretrain \
|
||||||
|
--config examples/megatron/configs/MI355X/llama3.1_70B-BF16-pretrain.yaml
|
||||||
|
|
||||||
.. tab-item:: MI300X
|
.. tab-item:: MI300X
|
||||||
:sync: MI325X and MI300X
|
:sync: MI325X and MI300X
|
||||||
@@ -363,8 +369,10 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
|
|||||||
export PRIMUS_TURBO_ATTN_V3_ATOMIC_FP32=1
|
export PRIMUS_TURBO_ATTN_V3_ATOMIC_FP32=1
|
||||||
export NVTE_CK_IS_V3_ATOMIC_FP32=1
|
export NVTE_CK_IS_V3_ATOMIC_FP32=1
|
||||||
|
|
||||||
EXP=examples/megatron/configs/MI300X/llama3.1_70B-BF16-pretrain.yaml \
|
bash runner/primus-cli direct \
|
||||||
bash ./examples/run_pretrain.sh
|
--log_file /tmp/primus_llama3.1_70B.log \
|
||||||
|
-- train pretrain \
|
||||||
|
--config examples/megatron/configs/MI300X/llama3.1_70B-BF16-pretrain.yaml
|
||||||
|
|
||||||
To run the training on a single node for Llama 3.1 70B FP8, use the following command.
|
To run the training on a single node for Llama 3.1 70B FP8, use the following command.
|
||||||
|
|
||||||
@@ -381,8 +389,10 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
|
|||||||
|
|
||||||
.. code-block:: shell
|
.. code-block:: shell
|
||||||
|
|
||||||
EXP=examples/megatron/configs/MI355X/llama3.1_70B-FP8-pretrain.yaml \
|
bash runner/primus-cli direct \
|
||||||
bash ./examples/run_pretrain.sh
|
--log_file /tmp/primus_llama3.1_70B_fp8.log \
|
||||||
|
-- train pretrain \
|
||||||
|
--config examples/megatron/configs/MI355X/llama3.1_70B-FP8-pretrain.yaml
|
||||||
|
|
||||||
.. tab-item:: MI300X
|
.. tab-item:: MI300X
|
||||||
:sync: MI325X and MI300X
|
:sync: MI325X and MI300X
|
||||||
@@ -394,18 +404,20 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
|
|||||||
export PRIMUS_TURBO_ATTN_V3_ATOMIC_FP32=1
|
export PRIMUS_TURBO_ATTN_V3_ATOMIC_FP32=1
|
||||||
export NVTE_CK_IS_V3_ATOMIC_FP32=1
|
export NVTE_CK_IS_V3_ATOMIC_FP32=1
|
||||||
|
|
||||||
EXP=examples/megatron/configs/MI300X/llama3.1_70B-FP8-pretrain.yaml \
|
bash runner/primus-cli direct \
|
||||||
bash ./examples/run_pretrain.sh \
|
--log_file /tmp/primus_llama3.1_70B_fp8_proxy.log \
|
||||||
--train_iters 50 \
|
-- train pretrain \
|
||||||
--num_layers 40 \
|
--config examples/megatron/configs/MI300X/llama3.1_70B-FP8-pretrain.yaml \
|
||||||
--fp8 hybrid \
|
--train_iters 50 \
|
||||||
--no_fp8_weight_transpose_cache true
|
--num_layers 40 \
|
||||||
|
--fp8 hybrid \
|
||||||
|
--no_fp8_weight_transpose_cache true
|
||||||
|
|
||||||
.. container:: model-doc primus_pyt_megatron_lm_train_llama-2-7b
|
.. container:: model-doc primus_pyt_megatron_lm_train_llama-2-7b
|
||||||
|
|
||||||
Once setup is complete, run the appropriate training command.
|
Once setup is complete, run the appropriate training command.
|
||||||
The following run commands are tailored to Llama 2 7B.
|
The following run commands are tailored to Llama 2 7B.
|
||||||
See :ref:`amd-primus-megatron-lm-model-support-v25.11` to switch to another available model.
|
See :ref:`amd-primus-megatron-lm-model-support-v26.01` to switch to another available model.
|
||||||
|
|
||||||
To run pre-training for Llama 2 7B FP8, run:
|
To run pre-training for Llama 2 7B FP8, run:
|
||||||
|
|
||||||
@@ -416,8 +428,10 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
|
|||||||
|
|
||||||
.. code-block:: shell
|
.. code-block:: shell
|
||||||
|
|
||||||
EXP=examples/megatron/configs/MI355X/llama2_7B-FP8-pretrain.yaml \
|
bash runner/primus-cli direct \
|
||||||
bash ./examples/run_pretrain.sh
|
--log_file /tmp/primus_llama2_7B_fp8.log \
|
||||||
|
-- train pretrain \
|
||||||
|
--config examples/megatron/configs/MI355X/llama2_7B-FP8-pretrain.yaml
|
||||||
|
|
||||||
.. tab-item:: MI300X
|
.. tab-item:: MI300X
|
||||||
:sync: MI325X and MI300X
|
:sync: MI325X and MI300X
|
||||||
@@ -429,8 +443,10 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
|
|||||||
export PRIMUS_TURBO_ATTN_V3_ATOMIC_FP32=1
|
export PRIMUS_TURBO_ATTN_V3_ATOMIC_FP32=1
|
||||||
export NVTE_CK_IS_V3_ATOMIC_FP32=1
|
export NVTE_CK_IS_V3_ATOMIC_FP32=1
|
||||||
|
|
||||||
EXP=examples/megatron/configs/MI300X/llama2_7B-FP8-pretrain.yaml \
|
bash runner/primus-cli direct \
|
||||||
bash ./examples/run_pretrain.sh
|
--log_file /tmp/primus_llama2_7B_fp8.log \
|
||||||
|
-- train pretrain \
|
||||||
|
--config examples/megatron/configs/MI300X/llama2_7B-FP8-pretrain.yaml
|
||||||
|
|
||||||
To run pre-training for Llama 2 7B BF16, run:
|
To run pre-training for Llama 2 7B BF16, run:
|
||||||
|
|
||||||
@@ -441,8 +457,10 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
|
|||||||
|
|
||||||
.. code-block:: shell
|
.. code-block:: shell
|
||||||
|
|
||||||
EXP=examples/megatron/configs/MI355X/llama2_7B-BF16-pretrain.yaml \
|
bash runner/primus-cli direct \
|
||||||
bash ./examples/run_pretrain.sh
|
--log_file /tmp/primus_llama2_7B.log \
|
||||||
|
-- train pretrain \
|
||||||
|
--config examples/megatron/configs/MI355X/llama2_7B-BF16-pretrain.yaml
|
||||||
|
|
||||||
.. tab-item:: MI300X
|
.. tab-item:: MI300X
|
||||||
:sync: MI325X and MI300X
|
:sync: MI325X and MI300X
|
||||||
@@ -454,14 +472,16 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
|
|||||||
export PRIMUS_TURBO_ATTN_V3_ATOMIC_FP32=1
|
export PRIMUS_TURBO_ATTN_V3_ATOMIC_FP32=1
|
||||||
export NVTE_CK_IS_V3_ATOMIC_FP32=1
|
export NVTE_CK_IS_V3_ATOMIC_FP32=1
|
||||||
|
|
||||||
EXP=examples/megatron/configs/MI300X/llama2_7B-BF16-pretrain.yaml \
|
bash runner/primus-cli direct \
|
||||||
bash ./examples/run_pretrain.sh
|
--log_file /tmp/primus_llama2_7B.log \
|
||||||
|
-- train pretrain \
|
||||||
|
--config examples/megatron/configs/MI300X/llama2_7B-BF16-pretrain.yaml
|
||||||
|
|
||||||
.. container:: model-doc primus_pyt_megatron_lm_train_llama-2-70b
|
.. container:: model-doc primus_pyt_megatron_lm_train_llama-2-70b
|
||||||
|
|
||||||
Once setup is complete, run the appropriate training command.
|
Once setup is complete, run the appropriate training command.
|
||||||
The following run commands are tailored to Llama 2 70B.
|
The following run commands are tailored to Llama 2 70B.
|
||||||
See :ref:`amd-primus-megatron-lm-model-support-v25.11` to switch to another available model.
|
See :ref:`amd-primus-megatron-lm-model-support-v26.01` to switch to another available model.
|
||||||
|
|
||||||
To run pre-training for Llama 2 70B BF16, run:
|
To run pre-training for Llama 2 70B BF16, run:
|
||||||
|
|
||||||
@@ -472,8 +492,10 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
|
|||||||
|
|
||||||
.. code-block:: shell
|
.. code-block:: shell
|
||||||
|
|
||||||
EXP=examples/megatron/configs/MI355X/llama2_70B-BF16-pretrain.yaml \
|
bash runner/primus-cli direct \
|
||||||
bash ./examples/run_pretrain.sh
|
--log_file /tmp/primus_llama2_70B.log \
|
||||||
|
-- train pretrain \
|
||||||
|
--config examples/megatron/configs/MI355X/llama2_70B-BF16-pretrain.yaml
|
||||||
|
|
||||||
.. tab-item:: MI300X
|
.. tab-item:: MI300X
|
||||||
:sync: MI325X and MI300X
|
:sync: MI325X and MI300X
|
||||||
@@ -485,14 +507,16 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
|
|||||||
export PRIMUS_TURBO_ATTN_V3_ATOMIC_FP32=1
|
export PRIMUS_TURBO_ATTN_V3_ATOMIC_FP32=1
|
||||||
export NVTE_CK_IS_V3_ATOMIC_FP32=1
|
export NVTE_CK_IS_V3_ATOMIC_FP32=1
|
||||||
|
|
||||||
EXP=examples/megatron/configs/MI300X/llama2_70B-BF16-pretrain.yaml \
|
bash runner/primus-cli direct \
|
||||||
bash ./examples/run_pretrain.sh
|
--log_file /tmp/primus_llama2_70B.log \
|
||||||
|
-- train pretrain \
|
||||||
|
--config examples/megatron/configs/MI300X/llama2_70B-BF16-pretrain.yaml
|
||||||
|
|
||||||
.. container:: model-doc primus_pyt_megatron_lm_train_deepseek-v3-proxy
|
.. container:: model-doc primus_pyt_megatron_lm_train_deepseek-v3-proxy
|
||||||
|
|
||||||
Once setup is complete, run the appropriate training command.
|
Once setup is complete, run the appropriate training command.
|
||||||
The following run commands are tailored to DeepSeek-V3.
|
The following run commands are tailored to DeepSeek-V3.
|
||||||
See :ref:`amd-primus-megatron-lm-model-support-v25.11` to switch to another available model.
|
See :ref:`amd-primus-megatron-lm-model-support-v26.01` to switch to another available model.
|
||||||
|
|
||||||
To run training on a single node for DeepSeek-V3 (MoE with expert parallel) BF16 with 3-layer proxy,
|
To run training on a single node for DeepSeek-V3 (MoE with expert parallel) BF16 with 3-layer proxy,
|
||||||
use the following command:
|
use the following command:
|
||||||
@@ -504,13 +528,15 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
|
|||||||
|
|
||||||
.. code-block:: shell
|
.. code-block:: shell
|
||||||
|
|
||||||
EXP=examples/megatron/configs/MI355X/deepseek_v3-BF16-pretrain.yaml \
|
bash runner/primus-cli direct \
|
||||||
bash examples/run_pretrain.sh \
|
--log_file /tmp/primus_deepseek_v3_proxy.log \
|
||||||
--num_layers 3 \
|
-- train pretrain \
|
||||||
--moe_layer_freq 1 \
|
--config examples/megatron/configs/MI355X/deepseek_v3-BF16-pretrain.yaml \
|
||||||
--train_iters 50 \
|
--num_layers 3 \
|
||||||
--micro_batch_size 8 \
|
--moe_layer_freq 1 \
|
||||||
--global_batch_size 64
|
--train_iters 50 \
|
||||||
|
--micro_batch_size 8 \
|
||||||
|
--global_batch_size 64
|
||||||
|
|
||||||
.. tab-item:: MI300X
|
.. tab-item:: MI300X
|
||||||
:sync: MI325X and MI300X
|
:sync: MI325X and MI300X
|
||||||
@@ -522,17 +548,21 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
|
|||||||
export PRIMUS_TURBO_ATTN_V3_ATOMIC_FP32=1
|
export PRIMUS_TURBO_ATTN_V3_ATOMIC_FP32=1
|
||||||
export NVTE_CK_IS_V3_ATOMIC_FP32=1
|
export NVTE_CK_IS_V3_ATOMIC_FP32=1
|
||||||
|
|
||||||
EXP=examples/megatron/configs/MI300X/deepseek_v3-BF16-pretrain.yaml \
|
bash runner/primus-cli direct \
|
||||||
bash examples/run_pretrain.sh \
|
--log_file /tmp/primus_deepseek_v3_proxy.log \
|
||||||
--num_layers 3 \
|
-- train pretrain \
|
||||||
--moe_layer_freq 1 \
|
--config examples/megatron/configs/MI300X/deepseek_v3-BF16-pretrain.yaml \
|
||||||
--train_iters 50
|
--num_layers 3 \
|
||||||
|
--moe_layer_freq 1 \
|
||||||
|
--micro_batch_size 3 \
|
||||||
|
--global_batch_size 192 \
|
||||||
|
--train_iters 50
|
||||||
|
|
||||||
.. container:: model-doc primus_pyt_megatron_lm_train_deepseek-v2-lite-16b
|
.. container:: model-doc primus_pyt_megatron_lm_train_deepseek-v2-lite-16b
|
||||||
|
|
||||||
Once setup is complete, run the appropriate training command.
|
Once setup is complete, run the appropriate training command.
|
||||||
The following run commands are tailored to DeepSeek-V2-Lite.
|
The following run commands are tailored to DeepSeek-V2-Lite.
|
||||||
See :ref:`amd-primus-megatron-lm-model-support-v25.11` to switch to another available model.
|
See :ref:`amd-primus-megatron-lm-model-support-v26.01` to switch to another available model.
|
||||||
|
|
||||||
To run training on a single node for DeepSeek-V2-Lite (MoE with expert parallel) BF16,
|
To run training on a single node for DeepSeek-V2-Lite (MoE with expert parallel) BF16,
|
||||||
use the following command:
|
use the following command:
|
||||||
@@ -544,8 +574,10 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
|
|||||||
|
|
||||||
.. code-block:: shell
|
.. code-block:: shell
|
||||||
|
|
||||||
EXP=examples/megatron/configs/MI355X/deepseek_v2_lite-BF16-pretrain.yaml \
|
bash runner/primus-cli direct \
|
||||||
bash examples/run_pretrain.sh
|
--log_file /tmp/primus_deepseek_v2_lite.log \
|
||||||
|
-- train pretrain \
|
||||||
|
--config examples/megatron/configs//MI355X/deepseek_v2_lite-BF16-pretrain.yaml
|
||||||
|
|
||||||
.. tab-item:: MI300X
|
.. tab-item:: MI300X
|
||||||
:sync: MI325X and MI300X
|
:sync: MI325X and MI300X
|
||||||
@@ -557,14 +589,16 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
|
|||||||
export PRIMUS_TURBO_ATTN_V3_ATOMIC_FP32=1
|
export PRIMUS_TURBO_ATTN_V3_ATOMIC_FP32=1
|
||||||
export NVTE_CK_IS_V3_ATOMIC_FP32=1
|
export NVTE_CK_IS_V3_ATOMIC_FP32=1
|
||||||
|
|
||||||
EXP=examples/megatron/configs/MI300X/deepseek_v2_lite-BF16-pretrain.yaml \
|
bash runner/primus-cli direct \
|
||||||
bash examples/run_pretrain.sh
|
--log_file /tmp/primus_deepseek_v2_lite.log \
|
||||||
|
-- train pretrain \
|
||||||
|
--config examples/megatron/configs/MI300X/deepseek_v2_lite-BF16-pretrain.yaml
|
||||||
|
|
||||||
.. container:: model-doc primus_pyt_megatron_lm_train_mixtral-8x7b
|
.. container:: model-doc primus_pyt_megatron_lm_train_mixtral-8x7b
|
||||||
|
|
||||||
Once setup is complete, run the appropriate training command.
|
Once setup is complete, run the appropriate training command.
|
||||||
The following run commands are tailored to Mixtral 8x7B.
|
The following run commands are tailored to Mixtral 8x7B.
|
||||||
See :ref:`amd-primus-megatron-lm-model-support-v25.11` to switch to another available model.
|
See :ref:`amd-primus-megatron-lm-model-support-v26.01` to switch to another available model.
|
||||||
|
|
||||||
To run training on a single node for Mixtral 8x7B (MoE with expert parallel),
|
To run training on a single node for Mixtral 8x7B (MoE with expert parallel),
|
||||||
use the following command:
|
use the following command:
|
||||||
@@ -576,8 +610,10 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
|
|||||||
|
|
||||||
.. code-block:: shell
|
.. code-block:: shell
|
||||||
|
|
||||||
EXP=examples/megatron/configs/MI355X/mixtral_8x7B_v0.1-BF16-pretrain.yaml \
|
bash runner/primus-cli direct \
|
||||||
bash examples/run_pretrain.sh
|
--log_file /tmp/primus_mixtral_8x7B.log \
|
||||||
|
-- train pretrain \
|
||||||
|
--config examples/megatron/configs/MI355X/mixtral_8x7B_v0.1-BF16-pretrain.yaml
|
||||||
|
|
||||||
.. tab-item:: MI300X
|
.. tab-item:: MI300X
|
||||||
:sync: MI325X and MI300X
|
:sync: MI325X and MI300X
|
||||||
@@ -589,15 +625,16 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
|
|||||||
export PRIMUS_TURBO_ATTN_V3_ATOMIC_FP32=1
|
export PRIMUS_TURBO_ATTN_V3_ATOMIC_FP32=1
|
||||||
export NVTE_CK_IS_V3_ATOMIC_FP32=1
|
export NVTE_CK_IS_V3_ATOMIC_FP32=1
|
||||||
|
|
||||||
EXP=examples/megatron/configs/MI300X/mixtral_8x7B_v0.1-BF16-pretrain.yaml \
|
bash runner/primus-cli direct \
|
||||||
bash examples/run_pretrain.sh \
|
--log_file /tmp/primus_mixtral_8x7B.log \
|
||||||
--train_iters 50
|
-- train pretrain \
|
||||||
|
--config examples/megatron/configs/MI300X/mixtral_8x7B_v0.1-BF16-pretrain.yaml
|
||||||
|
|
||||||
.. container:: model-doc primus_pyt_megatron_lm_train_mixtral-8x22b-proxy
|
.. container:: model-doc primus_pyt_megatron_lm_train_mixtral-8x22b-proxy
|
||||||
|
|
||||||
Once setup is complete, run the appropriate training command.
|
Once setup is complete, run the appropriate training command.
|
||||||
The following run commands are tailored to Mixtral 8x22B.
|
The following run commands are tailored to Mixtral 8x22B.
|
||||||
See :ref:`amd-primus-megatron-lm-model-support-v25.11` to switch to another available model.
|
See :ref:`amd-primus-megatron-lm-model-support-v26.01` to switch to another available model.
|
||||||
|
|
||||||
To run training on a single node for Mixtral 8x22B BF16 (MoE with expert parallel) 4-layer proxy,
|
To run training on a single node for Mixtral 8x22B BF16 (MoE with expert parallel) 4-layer proxy,
|
||||||
use the following command:
|
use the following command:
|
||||||
@@ -609,8 +646,10 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
|
|||||||
|
|
||||||
.. code-block:: shell
|
.. code-block:: shell
|
||||||
|
|
||||||
EXP=examples/megatron/configs/MI355X/mixtral_8x22B_v0.1-BF16-pretrain.yaml \
|
bash runner/primus-cli direct \
|
||||||
bash examples/run_pretrain.sh
|
--log_file /tmp/primus_mixtral_8x22B_proxy.log \
|
||||||
|
-- train pretrain \
|
||||||
|
--config examples/megatron/configs/MI355X/mixtral_8x22B_v0.1-BF16-pretrain.yaml
|
||||||
|
|
||||||
.. tab-item:: MI300X
|
.. tab-item:: MI300X
|
||||||
:sync: MI325X and MI300X
|
:sync: MI325X and MI300X
|
||||||
@@ -622,19 +661,21 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
|
|||||||
export PRIMUS_TURBO_ATTN_V3_ATOMIC_FP32=1
|
export PRIMUS_TURBO_ATTN_V3_ATOMIC_FP32=1
|
||||||
export NVTE_CK_IS_V3_ATOMIC_FP32=1
|
export NVTE_CK_IS_V3_ATOMIC_FP32=1
|
||||||
|
|
||||||
EXP=examples/megatron/configs/MI300X/mixtral_8x22B_v0.1-BF16-pretrain.yaml \
|
bash runner/primus-cli direct \
|
||||||
bash examples/run_pretrain.sh \
|
--log_file /tmp/primus_mixtral_8x22B_proxy.log \
|
||||||
--train_iters 50 \
|
-- train pretrain \
|
||||||
--num_layers 4 \
|
--config examples/megatron/configs/MI300X/mixtral_8x22B_v0.1-BF16-pretrain.yaml \
|
||||||
--pipeline_model_parallel_size 1 \
|
--num_layers 4 \
|
||||||
--micro_batch_size 1 \
|
--pipeline_model_parallel_size 1 \
|
||||||
--global_batch_size 16
|
--micro_batch_size 1 \
|
||||||
|
--global_batch_size 16 \
|
||||||
|
--train_iters 50
|
||||||
|
|
||||||
.. container:: model-doc primus_pyt_megatron_lm_train_qwen2.5-7b
|
.. container:: model-doc primus_pyt_megatron_lm_train_qwen2.5-7b
|
||||||
|
|
||||||
Once setup is complete, run the appropriate training command.
|
Once setup is complete, run the appropriate training command.
|
||||||
The following run commands are tailored to Qwen 2.5 7B.
|
The following run commands are tailored to Qwen 2.5 7B.
|
||||||
See :ref:`amd-primus-megatron-lm-model-support-v25.11` to switch to another available model.
|
See :ref:`amd-primus-megatron-lm-model-support-v26.01` to switch to another available model.
|
||||||
|
|
||||||
To run training on a single node for Qwen 2.5 7B BF16, use the following
|
To run training on a single node for Qwen 2.5 7B BF16, use the following
|
||||||
command:
|
command:
|
||||||
@@ -646,8 +687,10 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
|
|||||||
|
|
||||||
.. code-block:: shell
|
.. code-block:: shell
|
||||||
|
|
||||||
EXP=examples/megatron/configs/MI355X/qwen2.5_7B-BF16-pretrain.yaml \
|
bash runner/primus-cli direct \
|
||||||
bash examples/run_pretrain.sh
|
--log_file /tmp/primus_qwen2.5_7B.log \
|
||||||
|
-- train pretrain \
|
||||||
|
--config examples/megatron/configs/MI355X/qwen2.5_7B-BF16-pretrain.yaml
|
||||||
|
|
||||||
.. tab-item:: MI300X
|
.. tab-item:: MI300X
|
||||||
:sync: MI325X and MI300X
|
:sync: MI325X and MI300X
|
||||||
@@ -659,8 +702,10 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
|
|||||||
export PRIMUS_TURBO_ATTN_V3_ATOMIC_FP32=1
|
export PRIMUS_TURBO_ATTN_V3_ATOMIC_FP32=1
|
||||||
export NVTE_CK_IS_V3_ATOMIC_FP32=1
|
export NVTE_CK_IS_V3_ATOMIC_FP32=1
|
||||||
|
|
||||||
EXP=examples/megatron/configs/MI300X/qwen2.5_7B-BF16-pretrain.yaml \
|
bash runner/primus-cli direct \
|
||||||
bash examples/run_pretrain.sh
|
--log_file /tmp/primus_qwen2.5_7B.log \
|
||||||
|
-- train pretrain \
|
||||||
|
--config examples/megatron/configs/MI300X/qwen2.5_7B-BF16-pretrain.yaml
|
||||||
|
|
||||||
For FP8, use the following command.
|
For FP8, use the following command.
|
||||||
|
|
||||||
@@ -671,8 +716,10 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
|
|||||||
|
|
||||||
.. code-block:: shell
|
.. code-block:: shell
|
||||||
|
|
||||||
EXP=examples/megatron/configs/MI355X/qwen2.5_7B-FP8-pretrain.yaml \
|
bash runner/primus-cli direct \
|
||||||
bash examples/run_pretrain.sh
|
--log_file /tmp/primus_qwen2.5_7B_fp8.log \
|
||||||
|
-- train pretrain \
|
||||||
|
--config examples/megatron/configs/MI355X/qwen2.5_7B-FP8-pretrain.yaml
|
||||||
|
|
||||||
.. tab-item:: MI300X
|
.. tab-item:: MI300X
|
||||||
:sync: MI325X and MI300X
|
:sync: MI325X and MI300X
|
||||||
@@ -684,14 +731,16 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
|
|||||||
export PRIMUS_TURBO_ATTN_V3_ATOMIC_FP32=1
|
export PRIMUS_TURBO_ATTN_V3_ATOMIC_FP32=1
|
||||||
export NVTE_CK_IS_V3_ATOMIC_FP32=1
|
export NVTE_CK_IS_V3_ATOMIC_FP32=1
|
||||||
|
|
||||||
EXP=examples/megatron/configs/MI300X/qwen2.5_7B-FP8-pretrain.yaml \
|
bash runner/primus-cli direct \
|
||||||
bash examples/run_pretrain.sh
|
--log_file /tmp/primus_qwen2.5_7B_fp8.log \
|
||||||
|
-- train pretrain \
|
||||||
|
--config examples/megatron/configs/MI300X/qwen2.5_7B-FP8-pretrain.yaml
|
||||||
|
|
||||||
.. container:: model-doc primus_pyt_megatron_lm_train_qwen2.5-72b
|
.. container:: model-doc primus_pyt_megatron_lm_train_qwen2.5-72b
|
||||||
|
|
||||||
Once setup is complete, run the appropriate training command.
|
Once setup is complete, run the appropriate training command.
|
||||||
The following run commands are tailored to Qwen 2.5 72B.
|
The following run commands are tailored to Qwen 2.5 72B.
|
||||||
See :ref:`amd-primus-megatron-lm-model-support-v25.11` to switch to another available model.
|
See :ref:`amd-primus-megatron-lm-model-support-v26.01` to switch to another available model.
|
||||||
|
|
||||||
To run the training on a single node for Qwen 2.5 72B BF16, use the following command.
|
To run the training on a single node for Qwen 2.5 72B BF16, use the following command.
|
||||||
|
|
||||||
@@ -702,11 +751,10 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
|
|||||||
|
|
||||||
.. code-block:: shell
|
.. code-block:: shell
|
||||||
|
|
||||||
EXP=examples/megatron/configs/MI355X/qwen2.5_72B-pretrain.yaml \
|
bash runner/primus-cli direct \
|
||||||
bash examples/run_pretrain.sh \
|
--log_file /tmp/primus_qwen2.5_72B.log \
|
||||||
--train_iters 50 \
|
-- train pretrain \
|
||||||
--micro_batch_size 16 \
|
--config examples/megatron/configs/MI355X/qwen2.5_72B-BF16-pretrain.yaml
|
||||||
--global_batch_size 256
|
|
||||||
|
|
||||||
.. tab-item:: MI300X
|
.. tab-item:: MI300X
|
||||||
:sync: MI325X and MI300X
|
:sync: MI325X and MI300X
|
||||||
@@ -718,10 +766,12 @@ To run training on a single node, navigate to ``/workspace/Primus`` and use the
|
|||||||
export PRIMUS_TURBO_ATTN_V3_ATOMIC_FP32=1
|
export PRIMUS_TURBO_ATTN_V3_ATOMIC_FP32=1
|
||||||
export NVTE_CK_IS_V3_ATOMIC_FP32=1
|
export NVTE_CK_IS_V3_ATOMIC_FP32=1
|
||||||
|
|
||||||
EXP=examples/megatron/configs/MI300X/qwen2.5_72B-BF16-pretrain.yaml \
|
bash runner/primus-cli direct \
|
||||||
bash examples/run_pretrain.sh
|
--log_file /tmp/primus_qwen2.5_72B.log \
|
||||||
|
-- train pretrain \
|
||||||
|
--config examples/megatron/configs/MI300X/qwen2.5_72B-BF16-pretrain.yaml
|
||||||
|
|
||||||
.. _amd-primus-megatron-multi-node-examples-v25.11:
|
.. _amd-primus-megatron-multi-node-examples-v26.01:
|
||||||
|
|
||||||
Multi-node training examples
|
Multi-node training examples
|
||||||
----------------------------
|
----------------------------
|
||||||
@@ -730,7 +780,7 @@ Refer to :doc:`/how-to/rocm-for-ai/system-setup/multi-node-setup` to configure y
|
|||||||
training.
|
training.
|
||||||
|
|
||||||
To run training on multiple nodes, you can use the
|
To run training on multiple nodes, you can use the
|
||||||
`run_slurm_pretrain.sh <https://github.com/AMD-AGI/Primus/blob/main/examples/run_slurm_pretrain.sh>`__
|
`run_slurm_pretrain.sh <https://github.com/AMD-AGI/Primus/blob/9c529cd4a934a68a880ede036c3e97b792e38167/examples/run_slurm_pretrain.sh>`__
|
||||||
to launch the multi-node workload. Use the following steps to setup your environment:
|
to launch the multi-node workload. Use the following steps to setup your environment:
|
||||||
|
|
||||||
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/primus-megatron-benchmark-models.yaml
|
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/primus-megatron-benchmark-models.yaml
|
||||||
@@ -763,13 +813,13 @@ to launch the multi-node workload. Use the following steps to setup your environ
|
|||||||
* If ``NCCL_IB_HCA`` and ``NCCL_SOCKET_IFNAME`` are not set, Primus will try to auto-detect. However, since NICs can vary accross different cluster, it is encouraged to explicitly export your NCCL parameters for the cluster.
|
* If ``NCCL_IB_HCA`` and ``NCCL_SOCKET_IFNAME`` are not set, Primus will try to auto-detect. However, since NICs can vary accross different cluster, it is encouraged to explicitly export your NCCL parameters for the cluster.
|
||||||
* To find your network interface, you can use ``ip a``.
|
* To find your network interface, you can use ``ip a``.
|
||||||
* To find RDMA interfaces, you can use ``ibv_devices`` to get the list of all the RDMA/IB devices.
|
* To find RDMA interfaces, you can use ``ibv_devices`` to get the list of all the RDMA/IB devices.
|
||||||
* Remember to set ``DOCKER_IMAGE`` and ``HF_TOKEN`` (see :ref:`amd-primus-megatron-lm-tokenizer-v25.11`) as appropriate.
|
* Remember to set ``DOCKER_IMAGE`` and ``HF_TOKEN`` (see :ref:`amd-primus-megatron-lm-tokenizer-v26.01`) as appropriate.
|
||||||
|
|
||||||
.. container:: model-doc primus_pyt_megatron_lm_train_llama-3.1-8b
|
.. container:: model-doc primus_pyt_megatron_lm_train_llama-3.1-8b
|
||||||
|
|
||||||
Once setup is complete, run the appropriate training command.
|
Once setup is complete, run the appropriate training command.
|
||||||
The following run commands are tailored to Llama 3.1 8B.
|
The following run commands are tailored to Llama 3.1 8B.
|
||||||
See :ref:`amd-primus-megatron-lm-model-support-v25.11` to switch to another available model.
|
See :ref:`amd-primus-megatron-lm-model-support-v26.01` to switch to another available model.
|
||||||
|
|
||||||
To train Llama 3.1 8B FP8 on 8 nodes, run:
|
To train Llama 3.1 8B FP8 on 8 nodes, run:
|
||||||
|
|
||||||
@@ -786,7 +836,7 @@ to launch the multi-node workload. Use the following steps to setup your environ
|
|||||||
|
|
||||||
Once setup is complete, run the appropriate training command.
|
Once setup is complete, run the appropriate training command.
|
||||||
The following run commands are tailored to Llama 2 7B.
|
The following run commands are tailored to Llama 2 7B.
|
||||||
See :ref:`amd-primus-megatron-lm-model-support-v25.11` to switch to another available model.
|
See :ref:`amd-primus-megatron-lm-model-support-v26.01` to switch to another available model.
|
||||||
|
|
||||||
To train Llama 2 7B FP8 on 8 nodes, run:
|
To train Llama 2 7B FP8 on 8 nodes, run:
|
||||||
|
|
||||||
@@ -803,7 +853,7 @@ to launch the multi-node workload. Use the following steps to setup your environ
|
|||||||
|
|
||||||
Once setup is complete, run the appropriate training command.
|
Once setup is complete, run the appropriate training command.
|
||||||
The following run commands are tailored to Llama 3.1 70B.
|
The following run commands are tailored to Llama 3.1 70B.
|
||||||
See :ref:`amd-primus-megatron-lm-model-support-v25.11` to switch to another available model.
|
See :ref:`amd-primus-megatron-lm-model-support-v26.01` to switch to another available model.
|
||||||
|
|
||||||
To train Llama 3.1 70B FP8 on 8 nodes, run:
|
To train Llama 3.1 70B FP8 on 8 nodes, run:
|
||||||
|
|
||||||
@@ -833,7 +883,7 @@ to launch the multi-node workload. Use the following steps to setup your environ
|
|||||||
|
|
||||||
Once setup is complete, run the appropriate training command.
|
Once setup is complete, run the appropriate training command.
|
||||||
The following run commands are tailored to Llama 2 70B.
|
The following run commands are tailored to Llama 2 70B.
|
||||||
See :ref:`amd-primus-megatron-lm-model-support-v25.11` to switch to another available model.
|
See :ref:`amd-primus-megatron-lm-model-support-v26.01` to switch to another available model.
|
||||||
|
|
||||||
To train Llama 2 70B FP8 on 8 nodes, run:
|
To train Llama 2 70B FP8 on 8 nodes, run:
|
||||||
|
|
||||||
@@ -863,7 +913,7 @@ to launch the multi-node workload. Use the following steps to setup your environ
|
|||||||
|
|
||||||
Once setup is complete, run the appropriate training command.
|
Once setup is complete, run the appropriate training command.
|
||||||
The following run commands are tailored to Llama 3.3 70B.
|
The following run commands are tailored to Llama 3.3 70B.
|
||||||
See :ref:`amd-primus-megatron-lm-model-support-v25.11` to switch to another available model.
|
See :ref:`amd-primus-megatron-lm-model-support-v26.01` to switch to another available model.
|
||||||
|
|
||||||
To train Llama 3.3 70B FP8 on 8 nodes, run:
|
To train Llama 3.3 70B FP8 on 8 nodes, run:
|
||||||
|
|
||||||
@@ -893,7 +943,7 @@ to launch the multi-node workload. Use the following steps to setup your environ
|
|||||||
|
|
||||||
Once setup is complete, run the appropriate training command.
|
Once setup is complete, run the appropriate training command.
|
||||||
The following run commands are tailored to Llama 2 70B.
|
The following run commands are tailored to Llama 2 70B.
|
||||||
See :ref:`amd-primus-megatron-lm-model-support-v25.11` to switch to another available model.
|
See :ref:`amd-primus-megatron-lm-model-support-v26.01` to switch to another available model.
|
||||||
|
|
||||||
To train Mixtral 8x7B BF16 on 8 nodes, run:
|
To train Mixtral 8x7B BF16 on 8 nodes, run:
|
||||||
|
|
||||||
@@ -911,7 +961,7 @@ to launch the multi-node workload. Use the following steps to setup your environ
|
|||||||
|
|
||||||
Once setup is complete, run the appropriate training command.
|
Once setup is complete, run the appropriate training command.
|
||||||
The following run commands are tailored to Llama 2 70B.
|
The following run commands are tailored to Llama 2 70B.
|
||||||
See :ref:`amd-primus-megatron-lm-model-support-v25.11` to switch to another available model.
|
See :ref:`amd-primus-megatron-lm-model-support-v26.01` to switch to another available model.
|
||||||
|
|
||||||
To train Qwen2.5 72B FP8 on 8 nodes, run:
|
To train Qwen2.5 72B FP8 on 8 nodes, run:
|
||||||
|
|
||||||
@@ -926,7 +976,7 @@ to launch the multi-node workload. Use the following steps to setup your environ
|
|||||||
--global_batch_size 512 \
|
--global_batch_size 512 \
|
||||||
--recompute_num_layers 80 \
|
--recompute_num_layers 80 \
|
||||||
|
|
||||||
.. _amd-primus-megatron-lm-benchmark-test-vars-v25.11:
|
.. _amd-primus-megatron-lm-benchmark-test-vars-v26.01:
|
||||||
|
|
||||||
Key options
|
Key options
|
||||||
-----------
|
-----------
|
||||||
|
|||||||
@@ -45,7 +45,7 @@ with Primus Turbo optimizations.
|
|||||||
- {{ component_version }}
|
- {{ component_version }}
|
||||||
{% endfor %}
|
{% endfor %}
|
||||||
|
|
||||||
.. _amd-primus-pytorch-model-support-v25.11:
|
.. _amd-primus-pytorch-model-support-v26.01:
|
||||||
|
|
||||||
Supported models
|
Supported models
|
||||||
================
|
================
|
||||||
@@ -91,7 +91,7 @@ vary by model -- select one to get started.
|
|||||||
For additional workloads, including Llama 3.3, Llama 3.2, Llama 2, GPT OSS, Qwen, and Flux models,
|
For additional workloads, including Llama 3.3, Llama 3.2, Llama 2, GPT OSS, Qwen, and Flux models,
|
||||||
see the documentation :doc:`pytorch-training` (without Primus)
|
see the documentation :doc:`pytorch-training` (without Primus)
|
||||||
|
|
||||||
.. _amd-primus-pytorch-performance-measurements-v25.11:
|
.. _amd-primus-pytorch-performance-measurements-v26.01:
|
||||||
|
|
||||||
System validation
|
System validation
|
||||||
=================
|
=================
|
||||||
@@ -146,7 +146,7 @@ tweak some configurations (such as batch sizes).
|
|||||||
.. container:: model-doc {{ model.mad_tag }}
|
.. container:: model-doc {{ model.mad_tag }}
|
||||||
|
|
||||||
The following run command is tailored to {{ model.model }}.
|
The following run command is tailored to {{ model.model }}.
|
||||||
See :ref:`amd-primus-pytorch-model-support-v25.11` to switch to another available model.
|
See :ref:`amd-primus-pytorch-model-support-v26.01` to switch to another available model.
|
||||||
|
|
||||||
1. Clone the ROCm Model Automation and Dashboarding (`<https://github.com/ROCm/MAD>`__) repository to a local
|
1. Clone the ROCm Model Automation and Dashboarding (`<https://github.com/ROCm/MAD>`__) repository to a local
|
||||||
directory and install the required packages on the host machine.
|
directory and install the required packages on the host machine.
|
||||||
@@ -184,7 +184,7 @@ tweak some configurations (such as batch sizes).
|
|||||||
.. container:: model-doc {{ model.mad_tag }}
|
.. container:: model-doc {{ model.mad_tag }}
|
||||||
|
|
||||||
The following run commands are tailored to {{ model.model }}.
|
The following run commands are tailored to {{ model.model }}.
|
||||||
See :ref:`amd-primus-pytorch-model-support-v25.11` to switch to another available model.
|
See :ref:`amd-primus-pytorch-model-support-v26.01` to switch to another available model.
|
||||||
|
|
||||||
.. rubric:: Download the Docker image and required packages
|
.. rubric:: Download the Docker image and required packages
|
||||||
|
|
||||||
@@ -220,8 +220,8 @@ tweak some configurations (such as batch sizes).
|
|||||||
docker start training_env
|
docker start training_env
|
||||||
docker exec -it training_env bash
|
docker exec -it training_env bash
|
||||||
|
|
||||||
The Docker container hosts verified commit ``c4c083de`` of the `Primus
|
The Docker container hosts verified commit ``9c529cd4`` of the `Primus
|
||||||
<https://github.com/AMD-AGI/Primus/tree/c4c083de64ba3e8f19ccc9629411267108931f9e/>`__ repository.
|
<https://github.com/AMD-AGI/Primus/tree/9c529cd4a934a68a880ede036c3e97b792e38167/>`__ repository.
|
||||||
|
|
||||||
.. rubric:: Prepare training datasets and dependencies
|
.. rubric:: Prepare training datasets and dependencies
|
||||||
|
|
||||||
@@ -257,24 +257,31 @@ tweak some configurations (such as batch sizes).
|
|||||||
|
|
||||||
.. code-block:: shell
|
.. code-block:: shell
|
||||||
|
|
||||||
EXP=examples/torchtitan/configs/MI355X/llama3.1_8B-BF16-pretrain.yaml \
|
bash runner/primus-cli direct \
|
||||||
bash examples/run_pretrain.sh
|
--log_file /tmp/primus_llama3.1_8B.log \
|
||||||
|
-- train pretrain \
|
||||||
|
--config examples/torchtitan/configs/MI355X/llama3.1_8B-BF16-pretrain.yaml
|
||||||
|
|
||||||
.. tab-item:: MI325X
|
.. tab-item:: MI325X
|
||||||
:sync: MI325X
|
:sync: MI325X
|
||||||
|
|
||||||
.. code-block:: shell
|
.. code-block:: shell
|
||||||
|
|
||||||
EXP=examples/torchtitan/configs/MI300X/llama3.1_8B-BF16-pretrain.yaml \
|
bash runner/primus-cli direct \
|
||||||
bash examples/run_pretrain.sh --training.local_batch_size 6
|
--log_file /tmp/primus_llama3.1_8B.log \
|
||||||
|
-- train pretrain \
|
||||||
|
--config examples/torchtitan/configs/MI300X/llama3.1_8B-BF16-pretrain.yaml \
|
||||||
|
--training.local_batch_size 6
|
||||||
|
|
||||||
.. tab-item:: MI300X
|
.. tab-item:: MI300X
|
||||||
:sync: MI300X
|
:sync: MI300X
|
||||||
|
|
||||||
.. code-block:: shell
|
.. code-block:: shell
|
||||||
|
|
||||||
EXP=examples/torchtitan/configs/MI300X/llama3.1_8B-BF16-pretrain.yaml \
|
bash runner/primus-cli direct \
|
||||||
bash examples/run_pretrain.sh
|
--log_file /tmp/primus_llama3.1_8B.log \
|
||||||
|
-- train pretrain \
|
||||||
|
--config examples/torchtitan/configs/MI300X/llama3.1_8B-BF16-pretrain.yaml
|
||||||
|
|
||||||
To train Llama 3.1 8B with FP8 precision, use the following command.
|
To train Llama 3.1 8B with FP8 precision, use the following command.
|
||||||
|
|
||||||
@@ -285,24 +292,31 @@ tweak some configurations (such as batch sizes).
|
|||||||
|
|
||||||
.. code-block:: shell
|
.. code-block:: shell
|
||||||
|
|
||||||
EXP=examples/torchtitan/configs/MI355X/llama3.1_8B-FP8-pretrain.yaml \
|
bash runner/primus-cli direct \
|
||||||
bash examples/run_pretrain.sh
|
--log_file /tmp/primus_llama3.1_8B_fp8.log \
|
||||||
|
-- train pretrain \
|
||||||
|
--config examples/torchtitan/configs/MI355X/llama3.1_8B-FP8-pretrain.yaml
|
||||||
|
|
||||||
.. tab-item:: MI325X
|
.. tab-item:: MI325X
|
||||||
:sync: MI325X
|
:sync: MI325X
|
||||||
|
|
||||||
.. code-block:: shell
|
.. code-block:: shell
|
||||||
|
|
||||||
EXP=examples/torchtitan/configs/MI300X/llama3.1_8B-FP8-pretrain.yaml \
|
bash runner/primus-cli direct \
|
||||||
bash examples/run_pretrain.sh --training.local_batch_size 7
|
--log_file /tmp/primus_llama3.1_8B_fp8.log \
|
||||||
|
-- train pretrain \
|
||||||
|
--config examples/torchtitan/configs/MI300X/llama3.1_8B-FP8-pretrain.yaml \
|
||||||
|
--training.local_batch_size 7
|
||||||
|
|
||||||
.. tab-item:: MI300X
|
.. tab-item:: MI300X
|
||||||
:sync: MI300X
|
:sync: MI300X
|
||||||
|
|
||||||
.. code-block:: shell
|
.. code-block:: shell
|
||||||
|
|
||||||
EXP=examples/torchtitan/configs/MI300X/llama3.1_8B-FP8-pretrain.yaml \
|
bash runner/primus-cli direct \
|
||||||
bash examples/run_pretrain.sh
|
--log_file /tmp/primus_llama3.1_8B_fp8.log \
|
||||||
|
-- train pretrain \
|
||||||
|
--config examples/torchtitan/configs/MI300X/llama3.1_8B-FP8-pretrain.yaml
|
||||||
|
|
||||||
.. container:: model-doc primus_pyt_train_llama-3.1-70b
|
.. container:: model-doc primus_pyt_train_llama-3.1-70b
|
||||||
|
|
||||||
@@ -315,24 +329,31 @@ tweak some configurations (such as batch sizes).
|
|||||||
|
|
||||||
.. code-block:: shell
|
.. code-block:: shell
|
||||||
|
|
||||||
EXP=examples/torchtitan/configs/MI355X/llama3.1_70B-BF16-pretrain.yaml \
|
bash runner/primus-cli direct \
|
||||||
bash examples/run_pretrain.sh
|
--log_file /tmp/primus_llama3.1_70B.log \
|
||||||
|
-- train pretrain \
|
||||||
|
--config examples/torchtitan/configs/MI355X/llama3.1_70B-BF16-pretrain.yaml
|
||||||
|
|
||||||
.. tab-item:: MI325X
|
.. tab-item:: MI325X
|
||||||
:sync: MI325X
|
:sync: MI325X
|
||||||
|
|
||||||
.. code-block:: shell
|
.. code-block:: shell
|
||||||
|
|
||||||
EXP=examples/torchtitan/configs/MI300X/llama3.1_70B-BF16-pretrain.yaml \
|
bash runner/primus-cli direct \
|
||||||
bash examples/run_pretrain.sh --training.local_batch_size 6
|
--log_file /tmp/primus_llama3.1_70B.log \
|
||||||
|
-- train pretrain \
|
||||||
|
--config examples/torchtitan/configs/MI300X/llama3.1_70B-BF16-pretrain.yaml \
|
||||||
|
--training.local_batch_size 6
|
||||||
|
|
||||||
.. tab-item:: MI300X
|
.. tab-item:: MI300X
|
||||||
:sync: MI300X
|
:sync: MI300X
|
||||||
|
|
||||||
.. code-block:: shell
|
.. code-block:: shell
|
||||||
|
|
||||||
EXP=examples/torchtitan/configs/MI300X/llama3.1_70B-BF16-pretrain.yaml \
|
bash runner/primus-cli direct \
|
||||||
bash examples/run_pretrain.sh
|
--log_file /tmp/primus_llama3.1_70B.log \
|
||||||
|
-- train pretrain \
|
||||||
|
--config examples/torchtitan/configs/MI300X/llama3.1_70B-BF16-pretrain.yaml
|
||||||
|
|
||||||
To train Llama 3.1 70B with FP8 precision, use the following command.
|
To train Llama 3.1 70B with FP8 precision, use the following command.
|
||||||
|
|
||||||
@@ -343,24 +364,31 @@ tweak some configurations (such as batch sizes).
|
|||||||
|
|
||||||
.. code-block:: shell
|
.. code-block:: shell
|
||||||
|
|
||||||
EXP=examples/torchtitan/configs/MI355X/llama3.1_70B-FP8-pretrain.yaml \
|
bash runner/primus-cli direct \
|
||||||
bash examples/run_pretrain.sh
|
--log_file /tmp/primus_llama3.1_70B_fp8.log \
|
||||||
|
-- train pretrain \
|
||||||
|
--config examples/torchtitan/configs/MI355X/llama3.1_70B-FP8-pretrain.yaml
|
||||||
|
|
||||||
.. tab-item:: MI325X
|
.. tab-item:: MI325X
|
||||||
:sync: MI325X
|
:sync: MI325X
|
||||||
|
|
||||||
.. code-block:: shell
|
.. code-block:: shell
|
||||||
|
|
||||||
EXP=examples/torchtitan/configs/MI300X/llama3.1_70B-FP8-pretrain.yaml \
|
bash runner/primus-cli direct \
|
||||||
bash examples/run_pretrain.sh --training.local_batch_size 5
|
--log_file /tmp/primus_llama3.1_70B_fp8.log \
|
||||||
|
-- train pretrain \
|
||||||
|
--config examples/torchtitan/configs/MI300X/llama3.1_70B-FP8-pretrain.yaml \
|
||||||
|
--training.local_batch_size 5
|
||||||
|
|
||||||
.. tab-item:: MI300X
|
.. tab-item:: MI300X
|
||||||
:sync: MI300X
|
:sync: MI300X
|
||||||
|
|
||||||
.. code-block:: shell
|
.. code-block:: shell
|
||||||
|
|
||||||
EXP=examples/torchtitan/configs/MI300X/llama3.1_70B-FP8-pretrain.yaml \
|
bash runner/primus-cli direct \
|
||||||
bash examples/run_pretrain.sh
|
--log_file /tmp/primus_llama3.1_70B_fp8.log \
|
||||||
|
-- train pretrain \
|
||||||
|
--config examples/torchtitan/configs/MI300X/llama3.1_70B-FP8-pretrain.yaml
|
||||||
|
|
||||||
.. container:: model-doc primus_pyt_train_deepseek-v3-16b
|
.. container:: model-doc primus_pyt_train_deepseek-v3-16b
|
||||||
|
|
||||||
@@ -373,24 +401,31 @@ tweak some configurations (such as batch sizes).
|
|||||||
|
|
||||||
.. code-block:: shell
|
.. code-block:: shell
|
||||||
|
|
||||||
EXP=examples/torchtitan/configs/MI355X/deepseek_v3_16b-pretrain.yaml \
|
bash runner/primus-cli direct \
|
||||||
bash examples/run_pretrain.sh
|
--log_file /tmp/primus_deepseek_v3_16b.log \
|
||||||
|
-- train pretrain \
|
||||||
|
--config examples/torchtitan/configs/MI355X/deepseek_v3_16b-pretrain.yaml
|
||||||
|
|
||||||
.. tab-item:: MI325X
|
.. tab-item:: MI325X
|
||||||
:sync: MI325X
|
:sync: MI325X
|
||||||
|
|
||||||
.. code-block:: shell
|
.. code-block:: shell
|
||||||
|
|
||||||
EXP=examples/torchtitan/configs/MI300X/deepseek_v3_16b-pretrain.yaml \
|
bash runner/primus-cli direct \
|
||||||
bash examples/run_pretrain.sh --training.local_batch_size 10
|
--log_file /tmp/primus_deepseek_v3_16b.log \
|
||||||
|
-- train pretrain \
|
||||||
|
--config examples/torchtitan/configs/MI300X/deepseek_v3_16b-pretrain.yaml \
|
||||||
|
--training.local_batch_size 10
|
||||||
|
|
||||||
.. tab-item:: MI300X
|
.. tab-item:: MI300X
|
||||||
:sync: MI300X
|
:sync: MI300X
|
||||||
|
|
||||||
.. code-block:: shell
|
.. code-block:: shell
|
||||||
|
|
||||||
EXP=examples/torchtitan/configs/MI300X/deepseek_v3_16b-pretrain.yaml \
|
bash runner/primus-cli direct \
|
||||||
bash examples/run_pretrain.sh
|
--log_file /tmp/primus_deepseek_v3_16b.log \
|
||||||
|
-- train pretrain \
|
||||||
|
--config examples/torchtitan/configs/MI300X/deepseek_v3_16b-pretrain.yaml
|
||||||
{% endfor %}
|
{% endfor %}
|
||||||
{% endfor %}
|
{% endfor %}
|
||||||
|
|
||||||
|
|||||||
@@ -43,7 +43,7 @@ training workloads:
|
|||||||
- {{ component_version }}
|
- {{ component_version }}
|
||||||
{% endfor %}
|
{% endfor %}
|
||||||
|
|
||||||
.. _amd-pytorch-training-model-support-v25.11:
|
.. _amd-pytorch-training-model-support-v26.01:
|
||||||
|
|
||||||
Supported models
|
Supported models
|
||||||
================
|
================
|
||||||
@@ -85,7 +85,7 @@ one to get started.
|
|||||||
</div>
|
</div>
|
||||||
</div>
|
</div>
|
||||||
|
|
||||||
.. _amd-pytorch-training-supported-training-modes-v25.11:
|
.. _amd-pytorch-training-supported-training-modes-v26.01:
|
||||||
|
|
||||||
The following table lists supported training modes per model.
|
The following table lists supported training modes per model.
|
||||||
|
|
||||||
@@ -120,7 +120,7 @@ The following table lists supported training modes per model.
|
|||||||
unlisted fine-tuning methods by using an existing file in the
|
unlisted fine-tuning methods by using an existing file in the
|
||||||
``/workspace/torchtune/recipes/configs`` directory as a template.
|
``/workspace/torchtune/recipes/configs`` directory as a template.
|
||||||
|
|
||||||
.. _amd-pytorch-training-performance-measurements-v25.11:
|
.. _amd-pytorch-training-performance-measurements-v26.01:
|
||||||
|
|
||||||
Performance measurements
|
Performance measurements
|
||||||
========================
|
========================
|
||||||
@@ -176,7 +176,7 @@ Run training
|
|||||||
.. container:: model-doc {{ model.mad_tag }}
|
.. container:: model-doc {{ model.mad_tag }}
|
||||||
|
|
||||||
The following run command is tailored to {{ model.model }}.
|
The following run command is tailored to {{ model.model }}.
|
||||||
See :ref:`amd-pytorch-training-model-support-v25.11` to switch to another available model.
|
See :ref:`amd-pytorch-training-model-support-v26.01` to switch to another available model.
|
||||||
|
|
||||||
1. Clone the ROCm Model Automation and Dashboarding (`<https://github.com/ROCm/MAD>`__) repository to a local
|
1. Clone the ROCm Model Automation and Dashboarding (`<https://github.com/ROCm/MAD>`__) repository to a local
|
||||||
directory and install the required packages on the host machine.
|
directory and install the required packages on the host machine.
|
||||||
@@ -214,7 +214,7 @@ Run training
|
|||||||
.. container:: model-doc {{ model.mad_tag }}
|
.. container:: model-doc {{ model.mad_tag }}
|
||||||
|
|
||||||
The following commands are tailored to {{ model.model }}.
|
The following commands are tailored to {{ model.model }}.
|
||||||
See :ref:`amd-pytorch-training-model-support-v25.11` to switch to another available model.
|
See :ref:`amd-pytorch-training-model-support-v26.01` to switch to another available model.
|
||||||
|
|
||||||
{% endfor %}
|
{% endfor %}
|
||||||
{% endfor %}
|
{% endfor %}
|
||||||
@@ -409,6 +409,10 @@ Run training
|
|||||||
|
|
||||||
{% if model.mad_tag == "pyt_train_dlrm" %}
|
{% if model.mad_tag == "pyt_train_dlrm" %}
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
DLRM is supported on MI300X, MI325X, MI350X, and MI355X GPUs.
|
||||||
|
|
||||||
1. Go to the DLRM directory.
|
1. Go to the DLRM directory.
|
||||||
|
|
||||||
.. code-block:: shell
|
.. code-block:: shell
|
||||||
@@ -532,7 +536,7 @@ Run training
|
|||||||
|
|
||||||
To start the fine-tuning benchmark, use the following command with the
|
To start the fine-tuning benchmark, use the following command with the
|
||||||
appropriate options. See the following list of options and their descriptions.
|
appropriate options. See the following list of options and their descriptions.
|
||||||
See :ref:`supported training modes <amd-pytorch-training-supported-training-modes-v25.11>`.
|
See :ref:`supported training modes <amd-pytorch-training-supported-training-modes-v26.01>`.
|
||||||
|
|
||||||
.. code-block:: shell
|
.. code-block:: shell
|
||||||
|
|
||||||
@@ -597,7 +601,7 @@ Run training
|
|||||||
|
|
||||||
For examples of benchmarking commands, see `<https://github.com/ROCm/MAD/tree/develop/benchmark/pytorch_train#benchmarking-examples>`__.
|
For examples of benchmarking commands, see `<https://github.com/ROCm/MAD/tree/develop/benchmark/pytorch_train#benchmarking-examples>`__.
|
||||||
|
|
||||||
.. _amd-pytorch-training-multinode-examples-v25.11:
|
.. _amd-pytorch-training-multinode-examples-v26.01:
|
||||||
|
|
||||||
Multi-node training
|
Multi-node training
|
||||||
-------------------
|
-------------------
|
||||||
|
|||||||
@@ -119,6 +119,10 @@ subtrees:
|
|||||||
title: PyTorch inference performance testing
|
title: PyTorch inference performance testing
|
||||||
- file: how-to/rocm-for-ai/inference/benchmark-docker/sglang.rst
|
- file: how-to/rocm-for-ai/inference/benchmark-docker/sglang.rst
|
||||||
title: SGLang inference performance testing
|
title: SGLang inference performance testing
|
||||||
|
- file: how-to/rocm-for-ai/inference/benchmark-docker/vllm-mori-distributed.md
|
||||||
|
title: vLLM distributed inference with MoRI
|
||||||
|
- file: how-to/rocm-for-ai/inference/benchmark-docker/sglang-mori-distributed.md
|
||||||
|
title: SGLang distributed inference with MoRI
|
||||||
- file: how-to/rocm-for-ai/inference/benchmark-docker/sglang-distributed.rst
|
- file: how-to/rocm-for-ai/inference/benchmark-docker/sglang-distributed.rst
|
||||||
title: SGLang distributed inference with Mooncake
|
title: SGLang distributed inference with Mooncake
|
||||||
- file: how-to/rocm-for-ai/inference/xdit-diffusion-inference.rst
|
- file: how-to/rocm-for-ai/inference/xdit-diffusion-inference.rst
|
||||||
|
|||||||
Reference in New Issue
Block a user