Compare commits

..

8 Commits

Author SHA1 Message Date
Peter Park
cef7e945be remove "mi300x" from fp4 inf page (#5385) 2025-09-18 12:29:42 -04:00
Peter Park
bc59cb8bef [docs/7.0-beta] Add pointers to latest documentation 2025-09-18 12:16:54 -04:00
Peter Park
fc8f998f8a [docs/7.0-beta] Add SGLang benchmark doc (#5242) 2025-09-02 09:17:25 -04:00
Peter Park
eb334335d0 Add training/inference user guides for 7.0 beta preview Dockers (#5234)
Update
docs/preview/benchmark-docker/inference-vllm-llama-3.1-405b-fp4.rst

Co-authored-by: dyokelso <dewi.yokelson@amd.com>

Update
docs/preview/benchmark-docker/inference-vllm-llama-3.1-405b-fp4.rst

Co-authored-by: dyokelso <dewi.yokelson@amd.com>

Update
docs/preview/benchmark-docker/inference-sglang-deepseek-r1-fp4.rst

Co-authored-by: dyokelso <dewi.yokelson@amd.com>
2025-08-28 23:24:44 -04:00
Peter Park
e7b032c3dc [docs/7.0-beta] Update preview versions list (#5169) 2025-08-08 10:11:05 -04:00
Peter Park
00a96635ae update wording 2025-07-24 17:42:19 -04:00
Peter Park
f8ce4a9113 [docs/7.0-beta] 7.0 beta preview docs
fix mi210 virtu OSes

update wording

words

improve look

update heading

fix preview versions list url
2025-07-24 16:48:38 -04:00
Peter Park
b5598e581d Remove extra files 2025-07-22 14:35:50 -04:00
23 changed files with 1319 additions and 8354 deletions

View File

@@ -1,9 +1,22 @@
IPC
SPIR
VFS
builtins
crosslane
frontend
summarization
QKV
MLPerf
GovReport
oss
openai
gpt
SGLang
amd
MXFP
subproject
ROCpd
rocpd
STL
XCCs
chiplets
hipRTC
nvRTC
warpSize
Datacenter
GST
IET

File diff suppressed because it is too large Load Diff

View File

@@ -1,657 +0,0 @@
<!-- Do not edit this file! -->
<!-- This file is autogenerated with -->
<!-- tools/autotag/tag_script.py -->
<!-- Disable lints since this is an auto-generated file. -->
<!-- markdownlint-disable blanks-around-headers -->
<!-- markdownlint-disable no-duplicate-header -->
<!-- markdownlint-disable no-blanks-blockquote -->
<!-- markdownlint-disable ul-indent -->
<!-- markdownlint-disable no-trailing-spaces -->
<!-- markdownlint-disable reference-links-images -->
<!-- markdownlint-disable no-missing-space-atx -->
<!-- spellcheck-disable -->
# ROCm 6.4.1 release notes
The release notes provide a summary of notable changes since the previous ROCm release.
- [Release highlights](#release-highlights)
- [Operating system and hardware support changes](#operating-system-and-hardware-support-changes)
- [ROCm components versioning](#rocm-components)
- [Detailed component changes](#detailed-component-changes)
- [ROCm known issues](#rocm-known-issues)
- [ROCm upcoming changes](#rocm-upcoming-changes)
```{note}
If youre using Radeon™ PRO or Radeon GPUs in a workstation setting with a display connected, see the [Use ROCm on Radeon GPUs](https://rocm.docs.amd.com/projects/radeon/en/latest/docs/compatibility/native_linux/native_linux_compatibility.html)
documentation to verify compatibility and system requirements.
```
## Release highlights
The following are notable new features and improvements in ROCm 6.4.1. For changes to individual components, see
[Detailed component changes](#detailed-component-changes).
### Addition of DPX partition mode under NPS2 memory mode
AMD Instinct MI300X now supports DPX partition mode under NPS2 memory mode. For more partitioning information, see the [Deep dive into the MI300 compute and memory partition modes](https://rocm.blogs.amd.com/software-tools-optimization/compute-memory-modes/README.html) blog and [AMD Instinct MI300X system optimization](https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html#change-gpu-partition-modes).
### Introducing the ROCm Data Science toolkit
The ROCm Data Science toolkit (or ROCm-DS) is an open-source software collection for high-performance data science applications built on the core ROCm platform. You can leverage ROCm-DS to accelerate both new and existing data science workloads, allowing you to execute intensive applications with larger datasets at lightning speed. ROCm-DS is in an early access state. Running production workloads is not recommended. For more information, see [AMD ROCm-DS Documentation](https://rocm.docs.amd.com/projects/rocm-ds/en/latest/index.html).
### ROCm Offline Installer Creator updates
The ROCm Offline Installer Creator 6.4.1 now allows you to use the SPACEBAR or ENTER keys for menu item selection in the GUI. It also adds support for Debian 12 and fixes an issue for “full” mode RHEL offline installer creation, where GDM packages were uninstalled during offline installation. See [ROCm Offline Installer Creator](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/rocm-offline-installer.html) for more information.
### ROCm Runfile Installer updates
The ROCm Runfile Installer 6.4.1 adds the following improvements:
- Relaxed version checks for installation on different distributions. Provided the dependencies are not installed by the Runfile Installer, you can target installation for a different path from the host system running the installer. For example, the installer can run on a system using Ubuntu 22.04 and install to a partition/system that is using Ubuntu 24.04.
- Performance improvements for detecting a previous ROCm install.
- Removal of the extra `opt` directory created for the target during the ROCm installation. For example, installing to `target=/home/amd` now installs ROCm to `/home/amd/rocm-6.4.1` and not `/home/amd/opt/rocm-6.4.1`. For installs using `target=/`, the installation will continue to use `/opt/`.
- The Runfile Installer can be used to uninstall any Runfile-based installation of the driver.
- In the CLI interface, the `postrocm` argument can now be run separately from the `rocm` argument. In cases where `postrocm` was missed from the initial ROCm install, `postrocm` can now be run on the same target folder. For example, if you installed ROCm 6.4.1 using `install.run target=/myrocm rocm`, you can run the post-installation separately using the command `install.run target=/myrocm/rocm-6.4.1 postrocm`.
For more information, see [ROCm Runfile Installer](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/rocm-runfile-installer.html).
### ROCm documentation updates
ROCm documentation continues to be updated to provide clearer and more comprehensive guidance for a wider variety of user needs and use cases.
* [Tutorials for AI developers](https://rocm.docs.amd.com/projects/ai-developer-hub/en/latest/) have been expanded with five new tutorials. These tutorials are Jupyter notebook-based, easy-to-follow documents. They are ideal for AI developers who want to learn about specific topics, including inference, fine-tuning, and training. For more information about the changes, see [Changelog for the AI Developer Hub](https://rocm.docs.amd.com/projects/ai-developer-hub/en/latest/changelog.html).
* The [Training a model with LLM Foundry](https://rocm.docs.amd.com/en/latest/how-to/rocm-for-ai/training/benchmark-docker/mpt-llm-foundry.html) performance testing guide has been added. This guide describes how to use the preconfigured [ROCm/pytorch-training](https://hub.docker.com/layers/rocm/pytorch-training/v25.5/images/sha256-d47850a9b25b4a7151f796a8d24d55ea17bba545573f0d50d54d3852f96ecde5) training environment and [https://github.com/ROCm/MAD](https://github.com/ROCm/MAD) to test the training performance of the LLM Foundry framework on AMD Instinct MI325X and MI300X accelerators using the [MPT-30B](https://huggingface.co/mosaicml/mpt-30b) model.
* The [Training a model with PyTorch](https://rocm.docs.amd.com/en/latest/how-to/rocm-for-ai/training/benchmark-docker/pytorch-training.html) performance testing guide has been updated to feature the latest [ROCm/pytorch-training](https://hub.docker.com/layers/rocm/pytorch-training/v25.5/images/sha256-d47850a9b25b4a7151f796a8d24d55ea17bba545573f0d50d54d3852f96ecde5) Docker image (a preconfigured training environment with ROCm and PyTorch). Support for [Llama 3.3 70B](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct) has been added.
* The [Training a model with JAX MaxText](https://rocm.docs.amd.com/en/latest/how-to/rocm-for-ai/training/benchmark-docker/jax-maxtext.html) performance testing guide has been updated to feature the latest [ROCm/jax-training](https://hub.docker.com/layers/rocm/jax-training/maxtext-v25.5/images/sha256-4e0516358a227cae8f552fb866ec07e2edcf244756f02e7b40212abfbab5217b) Docker image (a preconfigured training environment with ROCm, JAX, and [MaxText](https://github.com/AI-Hypercomputer/maxtext)). Support for [Llama 3.3 70B](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct) has been added.
* The [vLLM inference performance testing](https://rocm.docs.amd.com/en/latest/how-to/rocm-for-ai/inference/vllm-benchmark.html?model=pyt_vllm_qwq-32b) guide has been updated to feature the latest [ROCm/vLLM](https://hub.docker.com/layers/rocm/vllm/latest/images/sha256-5c8b4436dd0464119d9df2b44c745fadf81512f18ffb2f4b5dc235c71ebe26b4) Docker image (a preconfigured environment for inference with ROCm and [vLLM](https://docs.vllm.ai/en/latest/)). Support for the [QwQ-32B](https://huggingface.co/Qwen/QwQ-32B) model has been added.
* The [PyTorch inference performance testing](https://rocm.docs.amd.com/en/latest/how-to/rocm-for-ai/inference/pytorch-inference-benchmark.html?model=pyt_clip_inference) guide has been added, featuring the [ROCm/PyTorch](https://hub.docker.com/layers/rocm/pytorch/latest/images/sha256-ab1d350b818b90123cfda31363019d11c0d41a8f12a19e3cb2cb40cf0261137d) Docker image (a preconfigured inference environment with ROCm and PyTorch) with initial support for the [CLIP](https://huggingface.co/laion/CLIP-ViT-B-32-laion2B-s34B-b79K) and [Chai-1](https://huggingface.co/chaidiscovery/chai-1) models.
## Operating system and hardware support changes
ROCm 6.4.1 introduces support for the RDNA4 architecture-based [Radeon AI PRO
R9700](https://www.amd.com/en/products/graphics/workstations/radeon-ai-pro/ai-9000-series/amd-radeon-ai-pro-r9700.html),
[Radeon RX 9070](https://www.amd.com/en/products/graphics/desktops/radeon/9000-series/amd-radeon-rx-9070.html),
[Radeon RX 9070 XT](https://www.amd.com/en/products/graphics/desktops/radeon/9000-series/amd-radeon-rx-9070xt.html),
Radeon RX 9070 GRE, and
[Radeon RX 9060 XT](https://www.amd.com/en/products/graphics/desktops/radeon/9000-series/amd-radeon-rx-9060xt.html) GPUs
for compute workloads. It also adds support for RDNA3 architecture-based [Radeon PRO W7700](https://www.amd.com/en/products/graphics/workstations/radeon-pro/w7700.html) and [Radeon RX 7800 XT](https://www.amd.com/en/products/graphics/desktops/radeon/7000-series/amd-radeon-rx-7800-xt.html) GPUs. These GPUs are supported on Ubuntu 24.04.2, Ubuntu 22.04.5, RHEL 9.6, RHEL 9.5, and RHEL 9.4.
For details, see the full list of [Supported GPUs
(Linux)](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/system-requirements.html#supported-gpus).
See the [Compatibility
matrix](../../docs/compatibility/compatibility-matrix.rst)
for more information about operating system and hardware compatibility.
## ROCm components
The following table lists the versions of ROCm components for ROCm 6.4.1, including any version
changes from 6.4.0 to 6.4.1. Click the component's updated version to go to a list of its changes.
Click {fab}`github` to go to the component's source code on GitHub.
<div class="pst-scrollable-table-container">
<table id="rocm-rn-components" class="table">
<thead>
<tr>
<th>Category</th>
<th>Group</th>
<th>Name</th>
<th>Version</th>
<th></th>
</tr>
</thead>
<colgroup>
<col span="1">
<col span="1">
</colgroup>
<tbody class="rocm-components-libs rocm-components-ml">
<tr>
<th rowspan="9">Libraries</th>
<th rowspan="9">Machine learning and computer vision</th>
<td><a href="https://rocm.docs.amd.com/projects/composable_kernel/en/docs-6.4.1/index.html">Composable Kernel</a></td>
<td>1.1.0</td>
<td><a href="https://github.com/ROCm/composable_kernel"><i class="fab fa-github fa-lg"></i></a></td>
</tr>
<tr>
<td><a href="https://rocm.docs.amd.com/projects/AMDMIGraphX/en/docs-6.4.1/index.html">MIGraphX</a></td>
<td>2.12.0</td>
<td><a href="https://github.com/ROCm/AMDMIGraphX"><i class="fab fa-github fa-lg"></i></a></td>
</tr>
<tr>
<td><a href="https://rocm.docs.amd.com/projects/MIOpen/en/docs-6.4.1/index.html">MIOpen</a></td>
<td>3.4.0</td>
<td><a href="https://github.com/ROCm/MIOpen"><i class="fab fa-github fa-lg"></i></a></td>
</tr>
<tr>
<td><a href="https://rocm.docs.amd.com/projects/MIVisionX/en/docs-6.4.1/index.html">MIVisionX</a></td>
<td>3.2.0</td>
<td><a href="https://github.com/ROCm/MIVisionX"><i class="fab fa-github fa-lg"></i></a></td>
</tr>
<tr>
<td><a href="https://rocm.docs.amd.com/projects/rocAL/en/docs-6.4.1/index.html">rocAL</a></td>
<td>2.2.0</td>
<td><a href="https://github.com/ROCm/rocAL"><i class="fab fa-github fa-lg"></i></a></td>
</tr>
<tr>
<td><a href="https://rocm.docs.amd.com/projects/rocDecode/en/docs-6.4.1/index.html">rocDecode</a></td>
<td>0.10.0</td>
<td><a href="https://github.com/ROCm/rocDecode"><i class="fab fa-github fa-lg"></i></a></td>
</tr>
<tr>
<td><a href="https://rocm.docs.amd.com/projects/rocJPEG/en/docs-6.4.1/index.html">rocJPEG</a></td>
<td>0.8.0</td>
<td><a href="https://github.com/ROCm/rocJPEG"><i class="fab fa-github fa-lg"></i></a></td>
</tr>
<tr>
<td><a href="https://rocm.docs.amd.com/projects/rocPyDecode/en/docs-6.4.1/index.html">rocPyDecode</a></td>
<td>0.3.1</td>
<td><a href="https://github.com/ROCm/rocPyDecode"><i class="fab fa-github fa-lg"></i></a></td>
</tr>
<tr>
<td><a href="https://rocm.docs.amd.com/projects/rpp/en/docs-6.4.1/index.html">RPP</a></td>
<td>1.9.10</td>
<td><a href="https://github.com/ROCm/rpp"><i class="fab fa-github fa-lg"></i></a></td>
</tr>
</tbody>
<tbody class="rocm-components-libs rocm-components-communication tbody-reverse-zebra">
<tr>
<th rowspan="2"></th>
<th rowspan="2">Communication</th>
<td><a href="https://rocm.docs.amd.com/projects/rccl/en/docs-6.4.1/index.html">RCCL</a></td>
<td>2.22.3&nbsp;&Rightarrow;&nbsp;<a href="#rccl-2-22-3">2.22.3</td>
<td><a href="https://github.com/ROCm/rccl"><i class="fab fa-github fa-lg"></i></a></td>
</tr>
<tr>
<td><a href="https://rocm.docs.amd.com/projects/rocSHMEM/en/docs-6.4.1/index.html">rocSHMEM</a></td>
<td>2.0.0</td>
<td><a href="https://github.com/ROCm/rocSHMEM"><i class="fab fa-github fa-lg"></i></a></td>
</tr>
</tbody>
<tbody class="rocm-components-libs rocm-components-math tbody-reverse-zebra">
<tr>
<th rowspan="16"></th>
<th rowspan="16">Math</th>
<td><a href="https://rocm.docs.amd.com/projects/hipBLAS/en/docs-6.4.1/index.html">hipBLAS</a></td>
<td>2.4.0</td>
<td><a href="https://github.com/ROCm/hipBLAS"><i class="fab fa-github fa-lg"></i></a></td>
</tr>
<tr>
<td><a href="https://rocm.docs.amd.com/projects/hipBLASLt/en/docs-6.4.1/index.html">hipBLASLt</a></td>
<td>0.12.0&nbsp;&Rightarrow;&nbsp;<a href="#hipblaslt-0-12-1">0.12.1</td>
<td><a href="https://github.com/ROCm/hipBLASLt"><i class="fab fa-github fa-lg"></i></a></td>
</tr>
<tr>
<td><a href="https://rocm.docs.amd.com/projects/hipFFT/en/docs-6.4.1/index.html">hipFFT</a></td>
<td>1.0.18</td>
<td><a href="https://github.com/ROCm/hipFFT"><i class="fab fa-github fa-lg"></i></a></td>
</tr>
<tr>
<td><a href="https://rocm.docs.amd.com/projects/hipfort/en/docs-6.4.1/index.html">hipfort</a></td>
<td>0.6.0</td>
<td><a href="https://github.com/ROCm/hipfort"><i class="fab fa-github fa-lg"></i></a></td>
</tr>
<tr>
<td><a href="https://rocm.docs.amd.com/projects/hipRAND/en/docs-6.4.1/index.html">hipRAND</a></td>
<td>2.12.0</td>
<td><a href="https://github.com/ROCm/hipRAND"><i class="fab fa-github fa-lg"></i></a></td>
</tr>
<tr>
<td><a href="https://rocm.docs.amd.com/projects/hipSOLVER/en/docs-6.4.1/index.html">hipSOLVER</a></td>
<td>2.4.0</td>
<td><a href="https://github.com/ROCm/hipSOLVER"><i class="fab fa-github fa-lg"></i></a></td>
</tr>
<tr>
<td><a href="https://rocm.docs.amd.com/projects/hipSPARSE/en/docs-6.4.1/index.html">hipSPARSE</a></td>
<td>3.2.0</td>
<td><a href="https://github.com/ROCm/hipSPARSE"><i class="fab fa-github fa-lg"></i></a></td>
</tr>
<tr>
<td><a href="https://rocm.docs.amd.com/projects/hipSPARSELt/en/docs-6.4.1/index.html">hipSPARSELt</a></td>
<td>0.2.3</td>
<td><a href="https://github.com/ROCm/hipSPARSELt"><i class="fab fa-github fa-lg"></i></a></td>
</tr>
<tr>
<td><a href="https://rocm.docs.amd.com/projects/rocALUTION/en/docs-6.4.1/index.html">rocALUTION</a></td>
<td>3.2.2&nbsp;&Rightarrow;&nbsp;<a href="#rocalution-3-2-3">3.2.3</td></td>
<td><a href="https://github.com/ROCm/rocALUTION"><i class="fab fa-github fa-lg"></i></a></td>
</tr>
<tr>
<td><a href="https://rocm.docs.amd.com/projects/rocBLAS/en/docs-6.4.1/index.html">rocBLAS</a></td>
<td>4.4.0</td>
<td><a href="https://github.com/ROCm/rocBLAS"><i class="fab fa-github fa-lg"></i></a></td>
</tr>
<tr>
<td><a href="https://rocm.docs.amd.com/projects/rocFFT/en/docs-6.4.1/index.html">rocFFT</a></td>
<td>1.0.32</td>
<td><a href="https://github.com/ROCm/rocFFT"><i class="fab fa-github fa-lg"></i></a></td>
</tr>
<tr>
<td><a href="https://rocm.docs.amd.com/projects/rocRAND/en/docs-6.4.1/index.html">rocRAND</a></td>
<td>3.3.0</td>
<td><a href="https://github.com/ROCm/rocRAND"><i class="fab fa-github fa-lg"></i></a></td>
</tr>
<tr>
<td><a href="https://rocm.docs.amd.com/projects/rocSOLVER/en/docs-6.4.1/index.html">rocSOLVER</a></td>
<td>3.28.0</td>
<td><a href="https://github.com/ROCm/rocSOLVER"><i class="fab fa-github fa-lg"></i></a></td>
</tr>
<tr>
<td><a href="https://rocm.docs.amd.com/projects/rocSPARSE/en/docs-6.4.1/index.html">rocSPARSE</a></td>
<td>3.4.0</td>
<td><a href="https://github.com/ROCm/rocSPARSE"><i class="fab fa-github fa-lg"></i></a></td>
</tr>
<tr>
<td><a href="https://rocm.docs.amd.com/projects/rocWMMA/en/docs-6.4.1/index.html">rocWMMA</a></td>
<td>1.7.0</td>
<td><a href="https://github.com/ROCm/rocWMMA"><i class="fab fa-github fa-lg"></i></a></td>
</tr>
<tr>
<td><a href="https://rocm.docs.amd.com/projects/Tensile/en/docs-6.4.1/src/index.html">Tensile</a></td>
<td>4.43.0</td>
<td><a href="https://github.com/ROCm/Tensile"><i class="fab fa-github fa-lg"></i></a></td>
</tr>
</tbody>
<tbody class="rocm-components-libs rocm-components-primitives tbody-reverse-zebra">
<tr>
<th rowspan="4"></th>
<th rowspan="4">Primitives</th>
<td><a href="https://rocm.docs.amd.com/projects/hipCUB/en/docs-6.4.1/index.html">hipCUB</a></td>
<td>3.4.0</td>
<td><a href="https://github.com/ROCm/hipCUB"><i class="fab fa-github fa-lg"></i></a></td>
</tr>
<tr>
<td><a href="https://rocm.docs.amd.com/projects/hipTensor/en/docs-6.4.1/index.html">hipTensor</a></td>
<td>1.5.0</td>
<td><a href="https://github.com/ROCm/hipTensor"><i class="fab fa-github fa-lg"></i></a></td>
</tr>
<tr>
<td><a href="https://rocm.docs.amd.com/projects/rocPRIM/en/docs-6.4.1/index.html">rocPRIM</a></td>
<td>3.4.0</td>
<td><a href="https://github.com/ROCm/rocPRIM"><i class="fab fa-github fa-lg"></i></a></td>
</tr>
<tr>
<td><a href="https://rocm.docs.amd.com/projects/rocThrust/en/docs-6.4.1/index.html">rocThrust</a></td>
<td>3.3.0</td>
<td><a href="https://github.com/ROCm/rocThrust"><i class="fab fa-github fa-lg"></i></a></td>
</tr>
</tbody>
<tbody class="rocm-components-tools rocm-components-system tbody-reverse-zebra">
<tr>
<th rowspan="7">Tools</th>
<th rowspan="7">System management</th>
<td><a href="https://rocm.docs.amd.com/projects/amdsmi/en/docs-6.4.1/index.html">AMD SMI</a></td>
<td>25.3.0&nbsp;&Rightarrow;&nbsp;<a href="#amd-smi-25-4-2">25.4.2</a></td>
<td><a href="https://github.com/ROCm/amdsmi"><i class="fab fa-github fa-lg"></i></a></td>
</tr>
<tr>
<td><a href="https://rocm.docs.amd.com/projects/rdc/en/docs-6.4.1/index.html">ROCm Data Center Tool</a></td>
<td>0.3.0&nbsp;&Rightarrow;&nbsp;<a href="#rocm-data-center-tool-0-3-0">0.3.0</td>
<td><a href="https://github.com/ROCm/rdc"><i class="fab fa-github fa-lg"></i></a></td>
</tr>
<tr>
<td><a href="https://rocm.docs.amd.com/projects/rocminfo/en/docs-6.4.1/index.html">rocminfo</a></td>
<td>1.0.0</td>
<td><a href="https://github.com/ROCm/rocminfo"><i class="fab fa-github fa-lg"></i></a></td>
</tr>
<tr>
<td><a href="https://rocm.docs.amd.com/projects/rocm_smi_lib/en/docs-6.4.1/index.html">ROCm SMI</a></td>
<td>7.5.0&nbsp;&Rightarrow;&nbsp;<a href="#rocm-smi-7-5-0">7.5.0</a></td>
<td><a href="https://github.com/ROCm/rocm_smi_lib"><i class="fab fa-github fa-lg"></i></a></td>
</tr>
<tr>
<td><a href="https://rocm.docs.amd.com/projects/ROCmValidationSuite/en/docs-6.4.1/index.html">ROCmValidationSuite</a></td>
<td>1.1.0</td>
<td><a href="https://github.com/ROCm/ROCmValidationSuite"><i class="fab fa-github fa-lg"></i></a></td>
</tr>
</tbody>
<tbody class="rocm-components-tools rocm-components-perf">
<tr>
<th rowspan="6"></th>
<th rowspan="6">Performance</th>
<td><a href="https://rocm.docs.amd.com/projects/rocm_bandwidth_test/en/docs-6.4.1/index.html">ROCm Bandwidth
Test</a></td>
<td>1.4.0</td>
<td><a href="https://github.com/ROCm/rocm_bandwidth_test/"><i
class="fab fa-github fa-lg"></i></a></td>
</tr>
<tr>
<td><a href="https://rocm.docs.amd.com/projects/rocprofiler-compute/en/docs-6.4.1/index.html">ROCm Compute Profiler</a></td>
<td>3.1.0</td>
<td><a href="https://github.com/ROCm/rocprofiler-compute"><i
class="fab fa-github fa-lg"></i></a></td>
</tr>
<tr>
<td><a href="https://rocm.docs.amd.com/projects/rocprofiler-systems/en/docs-6.4.1/index.html">ROCm Systems Profiler</a></td>
<td>1.0.0&nbsp;&Rightarrow;&nbsp;<a href="#rocm-systems-profiler-1-0-1">1.0.1</td>
<td><a href="https://github.com/ROCm/rocprofiler-systems"><i
class="fab fa-github fa-lg"></i></a></td>
</tr>
<tr>
<td><a href="https://rocm.docs.amd.com/projects/rocprofiler/en/docs-6.4.1/index.html">ROCProfiler</a></td>
<td>2.0.0</td>
<td><a href="https://github.com/ROCm/ROCProfiler/"><i
class="fab fa-github fa-lg"></i></a></td>
</tr>
<tr>
<td><a href="https://rocm.docs.amd.com/projects/rocprofiler-sdk/en/docs-6.4.1/index.html">ROCprofiler-SDK</a></td>
<td>0.6.0</td>
<td><a href="https://github.com/ROCm/rocprofiler-sdk/"><i
class="fab fa-github fa-lg"></i></a></td>
</tr>
<tr >
<td><a href="https://rocm.docs.amd.com/projects/roctracer/en/docs-6.4.1/index.html">ROCTracer</a></td>
<td>4.1.0</td>
<td><a href="https://github.com/ROCm/ROCTracer/"><i
class="fab fa-github fa-lg"></i></a></td>
</tr>
</tbody>
<tbody class="rocm-components-tools rocm-components-dev">
<tr>
<th rowspan="5"></th>
<th rowspan="5">Development</th>
<td><a href="https://rocm.docs.amd.com/projects/HIPIFY/en/docs-6.4.1/index.html">HIPIFY</a></td>
<td>19.0.0</td>
<td><a href="https://github.com/ROCm/HIPIFY/"><i
class="fab fa-github fa-lg"></i></a></td>
</tr>
<tr>
<td><a href="https://rocm.docs.amd.com/projects/ROCdbgapi/en/docs-6.4.1/index.html">ROCdbgapi</a></td>
<td>0.77.2</td>
<td><a href="https://github.com/ROCm/ROCdbgapi/"><i
class="fab fa-github fa-lg"></i></a></td>
</tr>
<tr>
<td><a href="https://rocm.docs.amd.com/projects/ROCmCMakeBuildTools/en/docs-6.4.1/index.html">ROCm CMake</a></td>
<td>0.14.0</td>
<td><a href="https://github.com/ROCm/rocm-cmake/"><i
class="fab fa-github fa-lg"></i></a></td>
</tr>
<tr>
<td><a href="https://rocm.docs.amd.com/projects/ROCgdb/en/docs-6.4.1/index.html">ROCm Debugger (ROCgdb)</a>
</td>
<td>15.2</td>
<td><a href="https://github.com/ROCm/ROCgdb/"><i
class="fab fa-github fa-lg"></i></a></td>
</tr>
<tr>
<td><a href="https://rocm.docs.amd.com/projects/rocr_debug_agent/en/docs-6.4.1/index.html">ROCr Debug Agent</a>
</td>
<td>2.0.4</td>
<td><a href="https://github.com/ROCm/rocr_debug_agent/"><i
class="fab fa-github fa-lg"></i></a></td>
</tr>
</tbody>
<tbody class="rocm-components-compilers tbody-reverse-zebra">
<tr>
<th rowspan="2" colspan="2">Compilers</th>
<td><a href="https://rocm.docs.amd.com/projects/HIPCC/en/docs-6.4.1/index.html">HIPCC</a></td>
<td>1.1.1</td>
<td><a href="https://github.com/ROCm/llvm-project/tree/amd-staging/amd/hipcc"><i
class="fab fa-github fa-lg"></i></a></td>
</tr>
<tr>
<td><a href="https://rocm.docs.amd.com/projects/llvm-project/en/docs-6.4.1/index.html">llvm-project</a></td>
<td>19.0.0</td>
<td><a href="https://github.com/ROCm/llvm-project/"><i
class="fab fa-github fa-lg"></i></a></td>
</tr>
</tbody>
<tbody class="rocm-components-runtimes tbody-reverse-zebra">
<tr>
<th rowspan="2" colspan="2">Runtimes</th>
<td><a href="https://rocm.docs.amd.com/projects/HIP/en/docs-6.4.1/index.html">HIP</a></td>
<td>6.4.0&nbsp;&Rightarrow;&nbsp;<a href="#hip-6-4-1">6.4.1</td>
<td><a href="https://github.com/ROCm/HIP/"><i class="fab fa-github fa-lg"></i></a></td>
</tr>
<tr>
<td><a href="https://rocm.docs.amd.com/projects/ROCR-Runtime/en/docs-6.4.1/index.html">ROCr Runtime</a></td>
<td>1.15.0&nbsp;&Rightarrow;&nbsp;<a href="#rocr-runtime-1-15-0">1.15.0</td>
<td><a href="https://github.com/ROCm/ROCR-Runtime/"><i class="fab fa-github fa-lg"></i></a></td>
</tr>
</tbody>
</table>
</div>
## Detailed component changes
The following sections describe key changes to ROCm components.
```{note}
For a historical overview of ROCm component updates, see the {doc}`ROCm consolidated changelog </release/changelog>`.
```
### **AMD SMI** (25.4.2)
#### Added
* Dumping CPER entries from RAS tool `amdsmi_get_gpu_cper_entries()` to Python and C APIs.
- Dumping CPER entries consist of `amdsmi_cper_hdr_t`.
- Dumping CPER entries is also enabled in the CLI interface through `sudo amd-smi ras --cper`.
* `amdsmi_get_gpu_busy_percent` to the C API.
#### Changed
* Modified VRAM display for amd-smi monitor -v.
#### Optimized
* Improved load times for CLI commands when the GPU has multiple parititons.
#### Resolved issues
* Fixed partition enumeration in `amd-smi list -e`, `amdsmi_get_gpu_enumeration_info()`, `amdsmi_enumeration_info_t`, `drm_card`, and `drm_render` fields.
#### Known issues
* When using the `--follow` flag with `amd-smi ras --cper`, CPER entries are not streamed continuously as intended. This will be fixed in an upcoming ROCm release.
```{note}
See the full [AMD SMI changelog](https://github.com/ROCm/amdsmi/blob/release/rocm-rel-6.4/CHANGELOG.md) for details, examples, and in-depth descriptions.
```
### **HIP** (6.4.1)
#### Added
* New log mask enumeration `LOG_COMGR` enables logging precise code object information.
#### Changed
* HIP runtime uses device bitcode before SPIRV.
* The implementation of preventing `hipLaunchKernel` latency degradation with number of idle streams is reverted/disabled by default.
#### Optimized
* Improved kernel logging includes de-mangling shader names.
* Refined implementation in HIP APIs `hipEventRecords` and `hipStreamWaitEvent` for performance improvement.
#### Resolved issues
* Stale state during the graph capture. The return error was fixed, HIP runtime now always uses the latest dependent nodes during `hipEventRecord` capture.
* Segmentation fault during kernel execution. HIP runtime now allows maximum stack size as per ISA on the GPU device.
### **hipBLASLt** (0.12.1)
#### Resolved issues
* Fixed an accuracy issue for some solutions using an `FP32` or `TF32` data type with a TT transpose.
### **RCCL** (2.22.3)
#### Changed
* MSCCL++ is now disabled by default. To enable it, set `RCCL_MSCCLPP_ENABLE=1`.
#### Resolved issues
* Fixed an issue where early termination, in rare circumstances, could cause the application to stop responding by adding synchronization before destroying a proxy thread.
* Fixed the accuracy issue for the MSCCLPP `allreduce7` kernel in graph mode.
#### Known issues
* When splitting a communicator using `ncclCommSplit` in some GPU configurations, MSCCL initialization can cause a segmentation fault. The recommended workaround is to disable MSCCL with `export RCCL_MSCCL_ENABLE=0`.
This issue will be fixed in a future ROCm release.
* Within the RCCL-UnitTests test suite, failures occur in tests ending with the
`.ManagedMem` and `.ManagedMemGraph` suffixes. These failures only affect the
test results and do not affect the RCCL component itself. This issue will be
resolved in a future ROCm release.
### **rocALUTION** (3.2.3)
#### Added
* The `-a` option has been added to the `rmake.py` build script. This option allows you to select specific architectures when building on Microsoft Windows.
#### Resolved issues
* Fixed an issue where the `HIP_PATH` environment variable was being ignored when compiling on Microsoft Windows.
### **ROCm Data Center Tool** (0.3.0)
#### Added
- Support for GPU partitions.
- `RDC_FI_GPU_BUSY_PERCENT` metric.
#### Changed
- Updated `rdc_field` to align with `rdc_bootstrap` for current metrics.
#### Resolved issues
- Fixed [ROCProfiler](https://rocm.docs.amd.com/projects/rocprofiler/en/docs-6.4.0/index.html) eval metrics and memory leaks.
### **ROCm SMI** (7.5.0)
#### Resolved issues
- Fixed partition enumeration. It now refers to the correct DRM Render and Card paths.
```{note}
See the full [ROCm SMI changelog](https://github.com/ROCm/rocm_smi_lib/blob/release/rocm-rel-6.4/CHANGELOG.md) for details, examples, and in-depth descriptions.
```
### **ROCm Systems Profiler** (1.0.1)
#### Added
* How-to document for [network performance profiling](https://rocm.docs.amd.com/projects/rocprofiler-systems/en/latest/how-to/nic-profiling.html) for standard Network Interface Cards (NICs).
#### Resolved issues
* Fixed a build issue with Dyninst on GCC 13.
### **ROCr Runtime** (1.15.0)
#### Resolved issues
* Fixed a rare occurrence issue on AMD Instinct MI25, MI50, and MI100 GPUs, where the `SDMA` copies might start before the dependent Kernel finishes and could cause memory corruption.
## ROCm known issues
ROCm known issues are noted on {fab}`github` [GitHub](https://github.com/ROCm/ROCm/labels/Verified%20Issue). For known
issues related to individual components, review the [Detailed component changes](#detailed-component-changes).
### Radeon AI PRO R9700 hangs when running Stable Diffusion 2.1 at batch sizes above four
Radeon AI PRO R9700 GPUs might hang when running [Stable Diffusion
2.1](https://huggingface.co/stabilityai/stable-diffusion-2-1) with batch sizes
greater than four. As a workaround, limit batch sizes to four or fewer. This issue
will be addressed in a future ROCm release. See [issue #4770](https://github.com/ROCm/ROCm/issues/4770) on GitHub.
### RCCL MSCCL initialization failure
When splitting a communicator using `ncclCommSplit` in some GPU configurations, MSCCL initialization can cause a segmentation fault. The recommended workaround is to disable MSCCL with `export RCCL_MSCCL_ENABLE=0`.
This issue will be fixed in a future ROCm release. See [issue #4769](https://github.com/ROCm/ROCm/issues/4769) on GitHub.
### AMD SMI CLI: CPER entries not dumped continuously when using follow flag
* When using the `--follow` flag with `amd-smi ras --cper`, CPER entries are not streamed continuously as intended. This will be fixed in an upcoming ROCm release.
See [issue #4768](https://github.com/ROCm/ROCm/issues/4768) on GitHub.
### ROCm SMI uninstallation issue on RHEL and SLES
`rocm-smi-lib` does not get uninstalled and remains orphaned on RHEL and SLES systems when:
* [Uninstalling ROCm using the AMDGPU installer](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/amdgpu-install.html#uninstalling-rocm) with `amdgpu-install --uninstall`
* [Uninstalling via package manager](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/install-methods/package-manager/package-manager-rhel.html#uninstall-rocm-packages)
with `dnf remove rocm-core` on RHEL or `zypper remove rocm-core` on SLES.
As a workaround, manually remove the `rocm-smi-lib` package using `sudo dnf remove rocm-smi-lib` or `sudo zypper remove rocm-smi-lib`.
See [issue #4767](https://github.com/ROCm/ROCm/issues/4767) on GitHub.
## ROCm upcoming changes
The following changes to the ROCm software stack are anticipated for future releases.
### ROCm SMI deprecation
[ROCm SMI](https://github.com/ROCm/rocm_smi_lib) will be phased out in an
upcoming ROCm release and will enter maintenance mode. After this transition,
only critical bug fixes will be addressed and no further feature development
will take place.
It's strongly recommended to transition your projects to [AMD
SMI](https://github.com/ROCm/amdsmi), the successor to ROCm SMI. AMD SMI
includes all the features of the ROCm SMI and will continue to receive regular
updates, new functionality, and ongoing support. For more information on AMD
SMI, see the [AMD SMI documentation](https://rocm.docs.amd.com/projects/amdsmi/en/latest/).
### ROCTracer, ROCProfiler, rocprof, and rocprofv2 deprecation
Development and support for ROCTracer, ROCProfiler, `rocprof`, and `rocprofv2` are being phased out in favor of ROCprofiler-SDK in upcoming ROCm releases. Starting with ROCm 6.4, only critical defect fixes will be addressed for older versions of the profiling tools and libraries. All users are encouraged to upgrade to the latest version of the ROCprofiler-SDK library and the (`rocprofv3`) tool to ensure continued support and access to new features. ROCprofiler-SDK is still in beta today and will be production-ready in a future ROCm release.
It's anticipated that ROCTracer, ROCProfiler, `rocprof`, and `rocprofv2` will reach end-of-life by future releases, aligning with Q1 of 2026.
### AMDGPU wavefront size compiler macro deprecation
Access to the wavefront size as a compile-time constant via the `__AMDGCN_WAVEFRONT_SIZE`
and `__AMDGCN_WAVEFRONT_SIZE__` macros or the `constexpr warpSize` variable is deprecated
and will be disabled in a future release.
* The `__AMDGCN_WAVEFRONT_SIZE__` macro and `__AMDGCN_WAVEFRONT_SIZE` alias will be removed in an upcoming release.
It is recommended to remove any use of this macro. For more information, see
[AMDGPU support](https://rocm.docs.amd.com/projects/llvm-project/en/docs-6.4.0/LLVM/clang/html/AMDGPUSupport.html).
* `warpSize` will only be available as a non-`constexpr` variable. Where required,
the wavefront size should be queried via the `warpSize` variable in device code,
or via `hipGetDeviceProperties` in host code. Neither of these will result in a compile-time constant.
* For cases where compile-time evaluation of the wavefront size cannot be avoided,
uses of `__AMDGCN_WAVEFRONT_SIZE`, `__AMDGCN_WAVEFRONT_SIZE__`, or `warpSize`
can be replaced with a user-defined macro or `constexpr` variable with the wavefront
size(s) for the target hardware. For example:
```
#if defined(__GFX9__)
#define MY_MACRO_FOR_WAVEFRONT_SIZE 64
#else
#define MY_MACRO_FOR_WAVEFRONT_SIZE 32
#endif
```
### HIPCC Perl scripts deprecation
The HIPCC Perl scripts (`hipcc.pl` and `hipconfig.pl`) will be removed in an upcoming release.
### Changes to ROCm Object Tooling
ROCm Object Tooling tools ``roc-obj-ls``, ``roc-obj-extract``, and ``roc-obj`` are
deprecated in ROCm 6.4, and will be removed in a future release. Functionality
has been added to the ``llvm-objdump --offloading`` tool option to extract all
clang-offload-bundles into individual code objects found within the objects
or executables passed as input. The ``llvm-objdump --offloading`` tool option also
supports the ``--arch-name`` option, and only extracts code objects found with
the specified target architecture. See [llvm-objdump](https://llvm.org/docs/CommandGuide/llvm-objdump.html)
for more information.
### HIP runtime API changes
There are a number of upcoming changes planned for HIP runtime API in an upcoming major release
that are not backward compatible with prior releases. Most of these changes increase
alignment between HIP and CUDA APIs or behavior. Some of the upcoming changes are to
clean up header files, remove namespace collision, and have a clear separation between
`hipRTC` and HIP runtime.

View File

@@ -27,28 +27,14 @@ project = "ROCm Documentation"
project_path = os.path.abspath(".").replace("\\", "/")
author = "Advanced Micro Devices, Inc."
copyright = "Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved."
version = "7.0 Alpha 2"
release = "7.0 Alpha 2"
version = "7.0 Beta"
release = "7.0 Beta"
setting_all_article_info = True
all_article_info_os = ["linux", "windows"]
all_article_info_os = ["linux"]
all_article_info_author = ""
# pages with specific settings
article_pages = [
{"file": "preview/index", "os": ["linux"],},
{"file": "preview/release", "os": ["linux"],},
{"file": "preview/versions", "os": ["linux"],},
{"file": "preview/install/index", "os": ["linux"],},
{"file": "preview/install/instinct-driver", "os": ["linux"],},
{"file": "preview/install/rocm", "os": ["linux"],},
{"file": "preview/benchmark-docker/index", "os": ["linux"],},
{"file": "preview/benchmark-docker/training", "os": ["linux"],},
{"file": "preview/benchmark-docker/pre-training-megatron-lm-llama-3-8b", "os": ["linux"],},
{"file": "preview/benchmark-docker/pre-training-torchtitan-llama-3-70b", "os": ["linux"],},
{"file": "preview/benchmark-docker/fine-tuning-lora-llama-2-70b", "os": ["linux"],},
{"file": "preview/benchmark-docker/inference", "os": ["linux"],},
{"file": "preview/benchmark-docker/inference-vllm-llama-3.1-405b-fp4", "os": ["linux"],},
{"file": "preview/benchmark-docker/inference-sglang-deepseek-r1-fp4", "os": ["linux"],},
{"file": "preview/release", "date": "2025-07-24",},
]
external_toc_path = "./sphinx/_toc.yml"
@@ -73,7 +59,7 @@ html_static_path = ["sphinx/static/css", "sphinx/static/js"]
html_css_files = ["rocm_custom.css", "rocm_rn.css"]
html_js_files = ["preview-version-list.js"]
html_title = "ROCm 7.0 Alpha 2 documentation"
html_title = "ROCm 7.0 Beta documentation"
html_theme_options = {"link_main_doc": False}

View File

@@ -15,27 +15,36 @@ Docker images for AI training and inference
<https://rocm.docs.amd.com/en/docs-7.0-docker/benchmark-docker/index.html>`__
documentation.
This page accompanies preview Docker images designed to validate and reproduce
training performance on AMD Instinct™ MI355X and MI350X accelerators. The images provide access to
Alpha versions of the ROCm 7.0 software stack and are targeted at early-access users evaluating
training workloads using next-generation AMD accelerators.
This preview offers hands-on benchmarking using representative large-scale
language and reasoning models with optimized compute precisions and
configurations.
This page accompanies preview Docker images designed to reproduce
training performance on AMD Instinct™ MI355X, MI350X, and MI300X series
accelerators. The images provide access to Beta versions of the ROCm 7.0
software stack and are targeted at early-access users evaluating training and
inference workloads using next-generation AMD accelerators.
.. important::
The following AI workload benchmarks only support the ROCm 7.0 Alpha release on AMD Instinct
MI355X and MI350X accelerators.
The following AI workload benchmarks use the ROCm 7.0 Beta preview on AMD Instinct
MI355X, MI350X, and MI300X series accelerators.
If you're looking for production-level workloads for the MI300X series, see
If you're looking for production-level workloads for MI300X series accelerators, see
`Infinity Hub <https://www.amd.com/en/developer/resources/infinity-hub.html>`_.
.. grid:: 2
.. grid-item-card:: Training
* :doc:`pre-training-megatron-lm-llama-3-8b`
* :doc:`training-megatron-lm-llama-3`
* :doc:`pre-training-torchtitan-llama-3-70b`
* :doc:`training-torchtitan-llama-3`
* :doc:`training-mlperf-fine-tuning-llama-2-70b`
.. grid-item-card:: Inference
* :doc:`inference-vllm-llama-3.1-405b-fp4`
* :doc:`inference-vllm-llama-3.3-70b-fp8`
* :doc:`inference-vllm-gpt-oss-120b`
* :doc:`inference-sglang-deepseek-r1-fp4`

View File

@@ -0,0 +1,106 @@
***********************************************
Benchmark DeepSeek R1 FP4 inference with SGLang
***********************************************
.. note::
For the latest iteration of AI training and inference performance for ROCm
7.0, see `Infinity Hub
<https://www.amd.com/en/developer/resources/infinity-hub.html#q=ROCm%207>`__
and the `ROCm 7.0 AI training and inference performance
<https://rocm.docs.amd.com/en/docs-7.0-docker/benchmark-docker/index.html>`__
documentation.
This section provides instructions to test the inference performance of DeepSeek R1
with FP4 precision via the SGLang serving framework.
The accompanying Docker image integrates the ROCm 7.0 Beta with SGLang, and is
tailored for AMD Instinct MI355X and MI350X accelerators. This
benchmark does not support other accelerators.
Follow these steps to pull the required image, spin up the container with the
appropriate options, download the model, and run the throughput test.
Pull the Docker image
=====================
Use the following command to pull the `Docker image <https://hub.docker.com/layers/rocm/7.0-preview/rocm7.0_preview_ubuntu_22.04_sglang_0.4.6.post4_mi35X_alpha/images/sha256-3095b0179c31bb892799c3b53e73f202abbd66409903cb990f48d0fdd3b1a1fe>`__.
.. code-block:: shell
docker pull rocm/7.0-preview:rocm7.0_preview_ubuntu_22.04_sglang_0.5.1post2_dsv3-opt_mi35x_beta
Download the model
==================
See model card on Hugging Face at `DeepSeek-R1-MXFP4-Preview
<https://huggingface.co/amd/DeepSeek-R1-MXFP4-Preview>`__. This model uses
microscaling 4-bit floating point (MXFP4) quantization via `AMD Quark
<https://quark.docs.amd.com/latest/>`_ for efficient inference on AMD
accelerators.
.. code-block:: shell
pip install huggingface_hub[cli] hf_transfer hf_xet
HF_HUB_ENABLE_HF_TRANSFER=1 \
HF_HOME=/data/huggingface-cache \
HF_TOKEN="<HF_TOKEN>" \
huggingface-cli download amd/DeepSeek-R1-0528-MXFP4-Preview --exclude "original/*"
Run the inference benchmark
===========================
1. Start the container using the following command.
.. code-block:: shell
docker run -it \
--user root \
--group-add video \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
-w /app/ \
--ipc=host \
--shm-size 64G \
--mount type=bind,src=/data,dst=/data \
--device=/dev/kfd \
--device=/dev/dri \
-e SGLANG_USE_AITER=1 \
rocm/7.0-preview:rocm7.0_preview_ubuntu_22.04_sglang_0.5.1post2_dsv3-opt_mi35x_beta
2. Start the server.
.. code-block:: shell
python3 -m sglang.launch_server \
--model-path amd/DeepSeek-R1-0528-MXFP4-Preview \
--host localhost \
--port 8000 \
--tensor-parallel-size 8 \
--trust-remote-code \
--chunked-prefill-size 196608 \
--mem-fraction-static 0.8 --disable-radix-cache \
--num-continuous-decode-steps 4 \
--max-prefill-tokens 196608 \
--cuda-graph-max-bs 128 &
3. Run the benchmark with the following options.
.. code-block:: shell
input_tokens=1024
output_tokens=1024
max_concurrency=64
num_prompts=128
python3 -m sglang.bench_serving \
--host localhost \
--port 8000 \
--model amd/DeepSeek-R1-0528-MXFP4-Preview \
--dataset-name random \
--random-input ${input_tokens} \
--random-output ${output_tokens} \
--random-range-ratio 1.0 \
--max-concurrency ${max_concurrency} \
--num-prompt ${num_prompts}

View File

@@ -0,0 +1,108 @@
******************************************
Benchmark GPT OSS 120B inference with vLLM
******************************************
.. note::
For the latest iteration of AI training and inference performance for ROCm
7.0, see `Infinity Hub
<https://www.amd.com/en/developer/resources/infinity-hub.html#q=ROCm%207>`__
and the `ROCm 7.0 AI training and inference performance
<https://rocm.docs.amd.com/en/docs-7.0-docker/benchmark-docker/index.html>`__
documentation.
This section provides instructions to test the inference performance of OpenAI
GPT OSS 120B on the vLLM inference engine. The accompanying Docker image integrates
the ROCm 7.0 Beta with vLLM, and is tailored for AMD Instinct
MI355X, MI350X, and MI300X series accelerators. This benchmark does not support other
GPUs.
Follow these steps to pull the required image, spin up the container with the
appropriate options, download the model, and run the throughput test.
Pull the Docker image
=====================
Use the following command to pull the `Docker image <https://hub.docker.com/layers/rocm/7.0-preview/rocm7.0_preview_ubuntu_22.04_vllm_0.10.1_instinct_beta/images/sha256-ac5bf30a1ce1daf41cc68081ce41b76b1fba6bf44c8fab7ccba01f86d8f619b8>`__.
.. code-block:: shell
docker pull rocm/7.0-preview:rocm7.0_preview_ubuntu_22.04_vllm_0.10.1_instinct_beta
Download the model
==================
See the model card on Hugging Face at `openai/gpt-oss-120b
<https://huggingface.co/openai/gpt-oss-120b>`__.
.. code-block:: shell
pip install huggingface_hub[cli] hf_transfer hf_xet
HF_HUB_ENABLE_HF_TRANSFER=1 \
HF_HOME=/data/huggingface-cache \
HF_TOKEN="<HF_TOKEN>" \
huggingface-cli download openai/gpt-oss-120b --local-dir /data/gpt-oss-120b
Run the inference benchmark
===========================
1. Start the container using the following command.
.. code-block:: shell
docker run -it \
--network host \
--ipc host \
--privileged \
--cap-add=CAP_SYS_ADMIN \
--device=/dev/kfd \
--device=/dev/dri \
--device=/dev/mem \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
--shm-size 32G \
-v /data/huggingface-cache:/root/.cache/huggingface/hub/ \
-v "$PWD/.vllm_cache/":/root/.cache/vllm/ \
-v /data/gpt-oss-120b:/data/gpt-oss-120b \
--name vllm-server \
rocm/7.0-preview:rocm7.0_preview_ubuntu_22.04_vllm_0.10.1_instinct_beta
2. Set environment variables and start the server.
.. code-block:: shell
export VLLM_ROCM_USE_AITER=1
export VLLM_DISABLE_COMPILE_CACHE=1
export VLLM_USE_AITER_UNIFIED_ATTENTION=1
export VLLM_ROCM_USE_AITER_MHA=0
export VLLM_USE_AITER_TRITON_FUSED_SPLIT_QKV_ROPE=1
export VLLM_USE_AITER_TRITON_FUSED_ADD_RMSNORM_PAD=1
export TRITON_HIP_PRESHUFFLE_SCALES=1
export VLLM_USE_AITER_TRITON_GEMM=1
export HSA_NO_SCRATCH_RECLAIM=1
vllm serve /data/gpt-oss-120b/ \
--tensor-parallel 1 \
--no-enable-prefix-caching --disable-log-requests \
--compilation-config '{"compile_sizes": [1, 2, 4, 8, 16, 24, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192], "cudagraph_capture_sizes":[8192,4096,2048,1024,1008,992,976,960,944,928,912,896,880,864,848,832,816,800,784,768,752,736,720,704,688,672,656,640,624,608,592,576,560,544,528,512,496,480,464,448,432,416,400,384,368,352,336,320,304,288,272,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1], "full_cuda_graph": true}' \
--block-size 64 \
--swap-space 16 \
--gpu-memory-utilization 0.95 \
--async-scheduling
3. Run the benchmark with the following options.
.. code-block:: shell
vllm bench serve \
--model /data/gpt-oss-120b \
--backend vllm \
--host 0.0.0.0 \
--dataset-name "random" \
--random-input-len 1024 \
--random-output-len 1024 \
--random-prefix-len 0 \
--num-prompts 32 \
--max-concurrency 16 \
--request-rate "inf" \
--ignore-eos

View File

@@ -0,0 +1,139 @@
************************************************
Benchmark Llama 3.1 405B FP4 inference with vLLM
************************************************
.. note::
For the latest iteration of AI training and inference performance for ROCm
7.0, see `Infinity Hub
<https://www.amd.com/en/developer/resources/infinity-hub.html#q=ROCm%207>`__
and the `ROCm 7.0 AI training and inference performance
<https://rocm.docs.amd.com/en/docs-7.0-docker/benchmark-docker/index.html>`__
documentation.
This section provides instructions to test the inference performance of Llama
3.1 405B on the vLLM inference engine. The accompanying Docker image integrates
the ROCm 7.0 Beta with vLLM, and is tailored for AMD Instinct
MI355X and MI350X accelerators. This benchmark does not support other
GPUs.
Follow these steps to pull the required image, spin up the container with the
appropriate options, download the model, and run the throughput test.
Pull the Docker image
=====================
Use the following command to pull the `Docker image <https://hub.docker.com/layers/rocm/7.0-preview/rocm7.0_preview_ubuntu_22.04_vllm_0.10.1_instinct_beta/images/sha256-ac5bf30a1ce1daf41cc68081ce41b76b1fba6bf44c8fab7ccba01f86d8f619b8>`__.
.. code-block:: shell
docker pull rocm/7.0-preview:rocm7.0_preview_ubuntu_22.04_vllm_0.10.1_instinct_beta
Download the model
==================
See the model card on Hugging Face at
`amd/Llama-3.1-405B-Instruct-MXFP4-Preview
<https://huggingface.co/amd/Llama-3.1-405B-Instruct-MXFP4-Preview>`__. This
model uses microscaling 4-bit floating point (MXFP4) quantization via `AMD
Quark <https://quark.docs.amd.com/latest/>`_ for efficient inference on AMD
accelerators.
.. code-block:: shell
pip install huggingface_hub[cli] hf_transfer hf_xet
HF_HUB_ENABLE_HF_TRANSFER=1 \
HF_HOME=/data/huggingface-cache \
HF_TOKEN="<HF_TOKEN>" \
huggingface-cli download amd/Llama-3.1-405B-Instruct-MXFP4-Preview --exclude "original/*"
Run the inference benchmark
===========================
1. Start the container using the following command.
.. code-block:: shell
docker run -it \
--ipc=host \
--network=host \
--privileged \
--cap-add=CAP_SYS_ADMIN \
--device=/dev/kfd \
--device=/dev/dri \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
-v /data:/data \
-e HF_HOME=/data/huggingface-cache \
-e HF_HUB_OFFLINE=1 \
-e VLLM_USE_V1=1 \
-e VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1 \
-e AMDGCN_USE_BUFFER_OPS=1 \
-e VLLM_USE_AITER_TRITON_ROPE=1 \
-e TRITON_HIP_ASYNC_COPY_BYPASS_PERMUTE=1 \
-e TRITON_HIP_USE_ASYNC_COPY=1 \
-e TRITON_HIP_USE_BLOCK_PINGPONG=1 \
-e TRITON_HIP_ASYNC_FAST_SWIZZLE=1 \
-e VLLM_ROCM_USE_AITER=1 \
-e VLLM_ROCM_USE_AITER_MHA=0 \
-e VLLM_ROCM_USE_AITER_RMSNORM=1 \
-e VLLM_TRITON_FP4_GEMM_USE_ASM=1 \
-e VLLM_TRITON_FP4_GEMM_SPLITK_USE_BF16=1 \
-e TRITON_HIP_PRESHUFFLE_SCALES=0 \
-e VLLM_TRITON_FP4_GEMM_BPRESHUFFLE=0 \
--name vllm-server \
rocm/7.0-preview:rocm7.0_preview_ubuntu_22.04_vllm_0.10.1_instinct_beta
2. Start the server.
.. code-block:: shell
max_model_len=10240
max_num_seqs=1024
max_num_batched_tokens=131072
max_seq_len_to_capture=16384
tensor_parallel_size=8
vllm serve amd/Llama-3.1-405B-Instruct-MXFP4-Preview \
--host localhost \
--port 8000 \
--swap-space 64 \
--disable-log-requests \
--dtype auto \
--max-model-len ${max_model_len} \
--tensor-parallel-size ${tensor_parallel_size} \
--max-num-seqs ${max_num_seqs} \
--distributed-executor-backend mp \
--trust-remote-code \
--kv-cache-dtype fp8 \
--gpu-memory-utilization 0.92 \
--max-seq-len-to-capture ${max_seq_len_to_capture} \
--max-num-batched-tokens ${max_num_batched_tokens} \
--no-enable-prefix-caching \
--async-scheduling
# Wait for model to load and server is ready to accept requests
3. Open another terminal on the same machine and run the benchmark with the following options.
.. code-block:: shell
# Connect to server
docker exec -it vllm-server bash
# Run the client benchmark
input_tokens=1024
output_tokens=1024
max_concurrency=4
num_prompts=32
python3 /app/vllm/benchmarks/benchmark_serving.py --host localhost --port 8000 \
--model amd/Llama-3.1-405B-Instruct-MXFP4-Preview \
--dataset-name random \
--random-input-len ${input_tokens} \
--random-output-len ${output_tokens} \
--max-concurrency ${max_concurrency} \
--num-prompts ${num_prompts} \
--percentile-metrics ttft,tpot,itl,e2el \
--ignore-eos

View File

@@ -0,0 +1,132 @@
************************************************
Benchmark Llama 3.3 70B FP8 inference with vLLM
************************************************
.. note::
For the latest iteration of AI training and inference performance for ROCm
7.0, see `Infinity Hub
<https://www.amd.com/en/developer/resources/infinity-hub.html#q=ROCm%207>`__
and the `ROCm 7.0 AI training and inference performance
<https://rocm.docs.amd.com/en/docs-7.0-docker/benchmark-docker/index.html>`__
documentation.
This section provides instructions to test the inference performance of Llama
3.3 70B on the vLLM inference engine. The accompanying Docker image integrates
the ROCm 7.0 Beta with vLLM, and is tailored for AMD Instinct
MI355X, MI350X, and MI300X series accelerators. This benchmark does not support other
GPUs.
Follow these steps to pull the required image, spin up the container with the
appropriate options, download the model, and run the throughput test.
Pull the Docker image
=====================
Use the following command to pull the `Docker image <https://hub.docker.com/layers/rocm/7.0-preview/rocm7.0_preview_ubuntu_22.04_vllm_0.9.1_mi35x_alpha/images/sha256-3ab87887724b75e5d1d2306a04afae853849ec3aabf8f9ee6335d766b3d0eaa0>`__.
.. code-block:: shell
docker pull rocm/7.0-preview:rocm7.0_preview_ubuntu_22.04_vllm_0.10.1_instinct_beta
Download the model
==================
See the model card on Hugging Face at
`amd/Llama-3.3-70B-Instruct-FP8-KV
<https://huggingface.co/amd/Llama-3.3-70B-Instruct-FP8-KV>`__.
.. code-block:: shell
pip install huggingface_hub[cli] hf_transfer hf_xet
HF_HUB_ENABLE_HF_TRANSFER=1 \
HF_HOME=/data/huggingface-cache \
HF_TOKEN="<HF_TOKEN>" \
huggingface-cli download amd/Llama-3.3-70B-Instruct-FP8-KV --exclude "original/*"
Run the inference benchmark
===========================
1. Start the container using the following command.
.. code-block:: shell
docker run -it \
--ipc=host \
--network=host \
--privileged \
--cap-add=CAP_SYS_ADMIN \
--device=/dev/kfd \
--device=/dev/dri \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
-v /data:/data \
-e HF_HOME=/data/huggingface-cache \
-e HF_HUB_OFFLINE=1 \
-e VLLM_USE_V1=1 \
-e VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1 \
-e AMDGCN_USE_BUFFER_OPS=1 \
-e VLLM_USE_AITER_TRITON_ROPE=1 \
-e TRITON_HIP_ASYNC_COPY_BYPASS_PERMUTE=1 \
-e TRITON_HIP_USE_ASYNC_COPY=1 \
-e TRITON_HIP_USE_BLOCK_PINGPONG=1 \
-e TRITON_HIP_ASYNC_FAST_SWIZZLE=1 \
-e VLLM_ROCM_USE_AITER=1 \
-e VLLM_ROCM_USE_AITER_MHA=0 \
-e VLLM_ROCM_USE_AITER_RMSNORM=1 \
--name vllm-server \
rocm/7.0-preview:rocm7.0_preview_ubuntu_22.04_vllm_0.10.1_instinct_beta
2. Start the server.
.. code-block:: shell
max_model_len=10240
max_num_seqs=1024
max_num_batched_tokens=131072
max_seq_len_to_capture=16384
tensor_parallel_size=8
vllm serve amd/Llama-3.3-70B-Instruct-FP8-KV \
--host localhost \
--port 8000 \
--swap-space 64 \
--disable-log-requests \
--dtype auto \
--max-model-len ${max_model_len} \
--tensor-parallel-size ${tensor_parallel_size} \
--max-num-seqs ${max_num_seqs} \
--distributed-executor-backend mp \
--trust-remote-code \
--kv-cache-dtype fp8 \
--gpu-memory-utilization 0.92 \
--max-seq-len-to-capture ${max_seq_len_to_capture} \
--max-num-batched-tokens ${max_num_batched_tokens} \
--no-enable-prefix-caching \
--async-scheduling
# Wait for model to load and server is ready to accept requests
3. Open another terminal on the same machine and run the benchmark with the following options.
.. code-block:: shell
# Connect to server
docker exec -it vllm-server bash
# Run the client benchmark
input_tokens=8192
output_tokens=1024
max_concurrency=4
num_prompts=32
python3 /app/vllm/benchmarks/benchmark_serving.py --host localhost --port 8000 \
--model amd/Llama-3.3-70B-Instruct-FP8-KV \
--dataset-name random \
--random-input-len ${input_tokens} \
--random-output-len ${output_tokens} \
--max-concurrency ${max_concurrency} \
--num-prompts ${num_prompts} \
--percentile-metrics ttft,tpot,itl,e2el \
--ignore-eos

View File

@@ -0,0 +1,28 @@
****************************
Benchmarking model inference
****************************
.. note::
For the latest iteration of AI training and inference performance for ROCm
7.0, see `Infinity Hub
<https://www.amd.com/en/developer/resources/infinity-hub.html#q=ROCm%207>`__
and the `ROCm 7.0 AI training and inference performance
<https://rocm.docs.amd.com/en/docs-7.0-docker/benchmark-docker/index.html>`__
documentation.
AMD provides prebuilt, optimized environments for testing the inference
performance of popular models on AMD Instinct™ MI355X and MI350X accelerators.
See the following sections for instructions.
.. grid::
.. grid-item-card:: Inference benchmarking
* :doc:`inference-vllm-llama-3.1-405b-fp4`
* :doc:`inference-vllm-llama-3.3-70b-fp8`
* :doc:`inference-vllm-gpt-oss-120b`
* :doc:`inference-sglang-deepseek-r1-fp4`

View File

@@ -1,96 +0,0 @@
*****************************************************
Benchmarking Llama 3 8B pre-training with Megatron-LM
*****************************************************
.. note::
For the latest iteration of AI training and inference performance for ROCm
7.0, see `Infinity Hub
<https://www.amd.com/en/developer/resources/infinity-hub.html#q=ROCm%207>`__
and the `ROCm 7.0 AI training and inference performance
<https://rocm.docs.amd.com/en/docs-7.0-docker/benchmark-docker/index.html>`__
documentation.
This section details how to benchmark Llama 3 8B pre-training using the
Megatron-LM framework. It includes configurations for both ``FP8`` and
``BF16`` precision to measure throughput.
The accompanying Docker image integrates the ROCm 7.0 Alpha with Megatron-LM, and is
tailored for AMD Instinct MI355X and MI350X accelerators. This
benchmark does not support other accelerators.
Follow these steps to pull the required image, spin up the container with the
appropriate options, download the model, and run the throughput test.
1. Pull the Docker image.
.. code-block:: shell
docker pull rocm/7.0-preview:rocm7.0_preview_pytorch_training_mi35X_alpha
2. Start the container.
.. code-block:: shell
docker run -it --device /dev/dri --device /dev/kfd \
--network host --ipc host --group-add video \
--cap-add SYS_PTRACE --security-opt seccomp=unconfined --privileged \
-v $HOME:$HOME \
-v $HOME/.ssh:/root/.ssh \
--shm-size 64G \
-w /workspace/Megatron-LM \
--name training_benchmark \
rocm/7.0-preview:rocm7.0_preview_pytorch_training_mi35X_alpha
.. note::
This containerized environment includes all necessary dependencies and pre-tuned
configurations for the supported models and precision types.
3. Run the training script for Llama 3 8B with the appropriate options for your desired precision.
.. tab-set::
.. tab-item:: FP8 precision
.. code-block:: shell
bash examples/llama/train_llama3.sh \
TEE_OUTPUT=1 \
MBS=4 \
BS=512 \
TP=1 \
TE_FP8=1 \
SEQ_LENGTH=8192 \
MODEL_SIZE=8 \
TOTAL_ITERS=10 \
GEMM_TUNING=0
.. tab-item:: BF16 precision
.. code-block:: shell
bash examples/llama/train_llama3.sh
TEE_OUTPUT=1 \
MBS=4 \
BS=256 \
TP=1 \
TE_FP8=0 \
SEQ_LENGTH=8192 \
MODEL_SIZE=8 \
TOTAL_ITERS=10
.. note::
The ``train_llama3.sh`` script accepts the following options:
* ``MBS``: Micro-batch size per GPU
* ``BS``: Global batch size
* ``TP``: Tensor parallelism
* ``SEQ_LENGTH``: Maximum input token sequence length
* ``TE_FP8``: Toggle to enable FP8
* ``TOTAL_ITERS``: Number of training iterations to execute

View File

@@ -1,79 +0,0 @@
*****************************************************
Benchmarking Llama 3 70B pre-training with torchtitan
*****************************************************
.. note::
For the latest iteration of AI training and inference performance for ROCm
7.0, see `Infinity Hub
<https://www.amd.com/en/developer/resources/infinity-hub.html#q=ROCm%207>`__
and the `ROCm 7.0 AI training and inference performance
<https://rocm.docs.amd.com/en/docs-7.0-docker/benchmark-docker/index.html>`__
documentation.
This guide provides instructions for benchmarking the pre-training throughput
of the Llama 3 70B model using torchtitan. By following these steps, you will
use a pre-configured Docker container, download the necessary Llama 3 assets,
and run the training script to measure performance in either ``FP8`` or ``BF16``
precision.
The accompanying Docker image integrates the ROCm 7.0 Alpha with torchtitan, and is
tailored for next-generation AMD Instinct MI355X and MI350X accelerators. This
benchmark does not support other accelerators.
Follow these steps to pull the required image, spin up the container with the
appropriate options, download the model, and run the throughput test.
1. Pull the Docker image.
.. code-block:: shell
docker pull rocm/7.0-preview:rocm7.0_preview_pytorch_training_mi35X_alpha
2. Start the container.
.. code-block:: shell
docker run -it --device /dev/dri --device /dev/kfd \
--network host --ipc host --group-add video \
--cap-add SYS_PTRACE --security-opt seccomp=unconfined --privileged \
-v $HOME:$HOME -v \
$HOME/.ssh:/root/.ssh \
--shm-size 64G \
-w /workspace/torchtitan \
--name training_benchmark \
rocm/7.0-preview:rocm7.0_preview_pytorch_training_mi35X_alpha
.. note::
This containerized environment includes all necessary dependencies and pre-tuned
configurations for the supported models and precision types.
3. Download the Llama 3 tokenizer. Make sure to set ``HF_TOKEN`` using a valid Hugging Face access
token with Llama model permissions.
.. code-block:: shell
export HF_TOKEN= #{your huggingface token with Llama 3 access}
python scripts/download_tokenizer.py --repo_id meta-llama/Meta-Llama-3-70B --tokenizer_path "original" --hf_token=${HF_TOKEN}
4. Run the training script for Llama 3 70B with the appropriate configuration file for your desired
precision.
.. tab-set::
.. tab-item:: FP8 precision
.. code-block:: shell
CONFIG_FILE="./llama3_70b_fsdp_fp8.toml" ./run_train.sh
.. tab-item:: BF16 precision
.. code-block:: shell
CONFIG_FILE="./llama3_70b_fsdp_bf16.toml" ./run_train.sh
.. note::
These configuration files define batch size, FSDP strategy, optimizer settings, and precision
type for each benchmarking run.

View File

@@ -0,0 +1,153 @@
***********************************************
Benchmark Llama 3 pre-training with Megatron-LM
***********************************************
.. note::
For the latest iteration of AI training and inference performance for ROCm
7.0, see `Infinity Hub
<https://www.amd.com/en/developer/resources/infinity-hub.html#q=ROCm%207>`__
and the `ROCm 7.0 AI training and inference performance
<https://rocm.docs.amd.com/en/docs-7.0-docker/benchmark-docker/index.html>`__
documentation.
This page describes how to benchmark Llama 3 8B and 70B pre-training using the
Megatron-LM framework. It includes configurations for both
FP8 and BF16 precision to measure throughput. The accompanying Docker
image integrates a Beta preview build of ROCm 7.0 with Megatron-LM
-- and is tailored for AMD Instinct MI355X and MI350X accelerators.
This benchmark does not support other accelerators.
Follow these steps to pull the required image, spin up the container with the
appropriate options, download the model, and run the throughput test.
Pull the Docker image
=====================
Use the following command to pull the `Docker image <https://hub.docker.com/layers/rocm/7.0-preview/rocm7.0_preview_pytorch_training_mi35x_beta/images/sha256-d47db310d1913c1de526b25c06ac6bd4c9f53c199a5a04677afc57526f259234>`__.
.. code-block:: shell
docker pull rocm/7.0-preview:rocm7.0_preview_pytorch_training_mi35x_beta
Run the training benchmark
==========================
1. Start the container using the following command.
.. code-block:: shell
docker run -it \
--device /dev/dri \
--device /dev/kfd \
--network host \
--ipc host \
--group-add video \
--cap-add SYS_PTRACE \
--security-opt seccomp=unconfined \
--privileged \
-v $HOME:$HOME \
-v $HOME/.ssh:/root/.ssh \
--shm-size 64G \
-w /workspace/Megatron-LM \
--name training_benchmark \
rocm/7.0-preview:rocm7.0_preview_pytorch_training_mi35x_beta
.. note::
This containerized environment includes all necessary dependencies and pre-tuned
configurations for the supported models and precision types.
2. Run the training script with the following options for your desired precision.
.. tab-set::
.. tab-item:: Llama 3 8B
.. tab-set::
.. tab-item:: BF16
.. code-block:: shell
TEE_OUTPUT=1 \
MBS=4 \
BS=512 \
TP=1 \
TE_FP8=0 \
SEQ_LENGTH=8192 \
MODEL_SIZE=8 \
TOTAL_ITERS=10 \
GEMM_TUNING=1 \
bash examples/llama/train_llama3.sh
.. tab-item:: FP8
.. code-block:: shell
TEE_OUTPUT=1 \
MBS=4 \
BS=512 \
TP=1 \
TE_FP8=1 \
SEQ_LENGTH=8192 \
MODEL_SIZE=8 \
TOTAL_ITERS=10 \
GEMM_TUNING=0 \
bash examples/llama/train_llama3.sh
.. tab-item:: Llama 3 70B
.. tab-set::
.. tab-item:: BF16
.. code-block:: shell
CKPT_FORMAT=torch_dist \
TEE_OUTPUT=1 \
MBS=3 \
BS=24 \
TP=1 \
TE_FP8=0 \
FSDP=1 \
RECOMPUTE=1 \
SEQ_LENGTH=8192 \
MODEL_SIZE=70 \
TOTAL_ITERS=10 \
bash examples/llama/train_llama3.sh
.. tab-item:: FP8
.. code-block:: shell
CKPT_FORMAT=torch_dist \
TEE_OUTPUT=1 \
RECOMPUTE=1 \
MBS=3 \
BS=24 \
TP=1 \
TE_FP8=1 \
SEQ_LENGTH=8192 \
MODEL_SIZE=70 \
FSDP=1 \
TOTAL_ITERS=10 \
NUM_LAYERS=40 \
bash examples/llama/train_llama3.sh
.. rubric:: Options
The ``train_llama3.sh`` script accepts the following options:
* ``MBS``: Micro-batch size per GPU
* ``BS``: Global batch size
* ``TP``: Tensor parallelism
* ``SEQ_LENGTH``: Maximum input token sequence length
* ``TE_FP8``: Toggle to enable FP8
* ``TOTAL_ITERS``: Number of training iterations to execute

View File

@@ -0,0 +1,160 @@
**************************************************
Benchmark Llama 2 70B LoRA fine-tuning with MLPerf
**************************************************
.. note::
For the latest iteration of AI training and inference performance for ROCm
7.0, see `Infinity Hub
<https://www.amd.com/en/developer/resources/infinity-hub.html#q=ROCm%207>`__
and the `ROCm 7.0 AI training and inference performance
<https://rocm.docs.amd.com/en/docs-7.0-docker/benchmark-docker/index.html>`__
documentation.
This guide provides instructions to benchmark LoRA fine-tuning on the Llama 2
70B model. The benchmark follows the MLPerf training submission for
long-document summarization using the GovReport dataset.
The accompanying Docker image integrates an early-access preview of the ROCm
7.0 software stack and is optimized for AMD Instinct™ MI355X, MI350X, and
MI300X series accelerators.
Pull the Docker image
=====================
1. Use the following command to pull the `Docker image <https://hub.docker.com/layers/rocm/7.0-preview/rocm7.0_preview_ubuntu22.04_llama2_70b_training_mlperf_instinct_beta/images/sha256-e75ec355a0501cad57d258bdf055267edfbdce4339b07d203a1ca9cdced2f9c9>`__.
.. code-block:: shell
docker pull rocm/7.0-preview:rocm7.0_preview_ubuntu22.04_llama2_70b_training_mlperf_instinct_beta
2. Copy the benchmark scripts from the container to your host. These scripts
are used to configure the environment and launch the benchmark.
.. code-block:: shell
container_id=$(docker create rocm/7.0-preview:rocm7.0_preview_ubuntu22.04_llama2_70b_training_mlperf_instinct_beta) && \
docker cp $container_id:/workspace/code/runtime_tunables.sh . && \
docker cp $container_id:/workspace/code/run_with_docker.sh . && \
docker cp $container_id:/workspace/code/config_MI355X_1x8x1.sh . && \
docker rm $container_id
.. note::
The ``config_*.sh`` files contain system-specific hyperparameters used in
the :ref:`run step <system-config>`. You will need to copy the one that
matches your hardware configuration.
Prepare the GovReport dataset
=============================
This benchmark uses the Llama 2 70B model with fused QKV and the GovReport dataset.
GovReport is a dataset for long document summarization that consists of
reports written by government research agencies. The dataset hosted on the
MLPerf drive is already tokenized and packed so that each sequence has
length 8192.
1. Download and preprocess the dataset.
Start the Docker container by mounting the volume you want to use for
downloading the data under ``/data`` within the container. This example uses
``/data/mlperf_llama2`` as the host's download directory:
.. code-block:: shell
docker run -it \
--net=host \
--uts=host \
--ipc=host \
--device /dev/dri \
--device /dev/kfd \
--privileged \
--security-opt=seccomp=unconfined \
--volume=/data/mlperf_llama2:/data \
--volume=/data/mlperf_llama2/model:/ckpt \
rocm/7.0-preview:rocm7.0_preview_ubuntu22.04_llama2_70b_training_mlperf_instinct_beta
2. From within the container, run the preparation script. This will download and
preprocess the dataset and model.
.. code-block:: shell
bash ./scripts/prepare_data_and_model.sh
3. Verify the preprocessed files. After the script completes, check for the
following files in the mounted directories.
After preprocessing, you should see the following files in the ``/data/model`` directory:
.. code-block:: shell-session
<hash>_tokenizer.model llama2-70b.nemo
model_config.yaml model_weights
And the following files in ``/data/data``:
.. code-block:: shell-session
train.npy validation.npy
4. Exit the container and return to your host shell.
.. code-block:: shell
exit
Run the benchmark
=================
With the dataset prepared, you can now configure and run the fine-tuning
benchmark from your host machine.
1. Set the environment variables. These variables point to the directories you used
for data, the model, and where the resulting logs should be stored.
.. code-block:: shell
export DATADIR=/data/mlperf_llama2
export LOGDIR=/data/mlperf_llama2/results
export CONT=rocm/7.0-preview:rocm7.0_preview_ubuntu22.04_llama2_70b_training_mlperf_instinct_beta
.. tip::
Ensure the log directory exists and is writable by the container user.
.. code-block:: shell
mkdir -p $LOGDIR
sudo chmod -R 777 $LOGDIR
.. _system-config:
2. Source the system-specific configuration file. The ``config_*.sh`` files
contain optimized hyperparameters for different hardware configurations.
.. code-block:: shell
# Use the appropriate config
source config_MI355X_1x8x1.sh
3. To perform a single training run, use the following command.
.. code-block:: shell
export NEXP=1
bash run_with_docker.sh
Optionally, to perform 10 consecutive training runs:
.. code-block:: shell
export NEXP=10
bash run_with_docker.sh
.. note::
To optimize performance, the ``run_with_docker.sh`` script automatically
executes ``runtime_tunables.sh`` to apply system-level optimizations
before starting the training job.
Upon run completion, the logs will be available under ``$LOGDIR``.

View File

@@ -0,0 +1,122 @@
**********************************************
Benchmark Llama 3 pre-training with torchtitan
**********************************************
.. note::
For the latest iteration of AI training and inference performance for ROCm
7.0, see `Infinity Hub
<https://www.amd.com/en/developer/resources/infinity-hub.html#q=ROCm%207>`__
and the `ROCm 7.0 AI training and inference performance
<https://rocm.docs.amd.com/en/docs-7.0-docker/benchmark-docker/index.html>`__
documentation.
This page describes how to benchmark Llama 3 8B and 70B pre-training using
torchtitan. The accompanying Docker image integrates a Beta preview build of
ROCm 7.0 with torchtitan -- and is tailored for AMD Instinct MI355X and MI350X
accelerators. This benchmark does not support other accelerators.
Follow these steps to pull the required image, spin up the container with the
appropriate options, download the model, and run the throughput test.
Pull the Docker image
=====================
Use the following command to pull the `Docker image <https://hub.docker.com/layers/rocm/7.0-preview/rocm7.0_preview_pytorch_training_mi35x_beta/images/sha256-d47db310d1913c1de526b25c06ac6bd4c9f53c199a5a04677afc57526f259234>`__.
.. code-block:: shell
docker pull rocm/7.0-preview:rocm7.0_preview_pytorch_training_mi35x_beta
Run the training benchmark
==========================
1. Start the container using the following command.
.. code-block:: shell
docker run -it \
--device /dev/dri \
--device /dev/kfd \
--network host \
--ipc host \
--group-add video \
--cap-add SYS_PTRACE \
--security-opt seccomp=unconfined \
--privileged \
-v $HOME:$HOME \
-v $HOME/.ssh:/root/.ssh \
--shm-size 64G \
-w /workspace/torchtitan \
--name training_benchmark \
rocm/7.0-preview:rocm7.0_preview_pytorch_training_mi35x_beta
.. note::
This containerized environment includes all necessary dependencies and pre-tuned
configurations for the supported models and precision types.
2. Download the Llama 3 tokenizer. Make sure to set ``HF_TOKEN`` using
a valid Hugging Face access token with Llama model permissions.
.. tab-set::
.. tab-item:: Llama 3 8B
:sync: 8b
.. code-block:: shell
export HF_TOKEN=#{your huggingface token with Llama 3 access}
python3 scripts/download_tokenizer.py \
--repo_id meta-llama/Meta-Llama-3-8B \
--tokenizer_path "original" \
--hf_token=${HF_TOKEN}
.. tab-item:: Llama 3 70B
:sync: 70b
.. code-block:: shell
export HF_TOKEN=#{your huggingface token with Llama 3 access}
python3 scripts/download_tokenizer.py \
--repo_id meta-llama/Meta-Llama-3-70B \
--tokenizer_path "original" \
--hf_token=${HF_TOKEN}
3. Run the training script with the following options for your desired precision.
.. tab-set::
.. tab-item:: Llama 3 8B
:sync: 8b
.. tab-set::
.. tab-item:: BF16
.. code-block:: shell
CONFIG_FILE="./llama3_8b_fsdp_bf16.toml" ./run_train.sh
.. tab-item:: FP8
.. code-block:: shell
CONFIG_FILE="./llama3_8b_fsdp_fp8.toml" ./run_train.sh
.. tab-item:: Llama 3 70B
:sync: 70b
.. tab-set::
.. tab-item:: BF16
.. code-block:: shell
CONFIG_FILE="./llama3_70b_fsdp_bf16.toml" ./run_train.sh
.. tab-item:: FP8
.. code-block:: shell
CONFIG_FILE="./llama3_70b_fsdp_fp8.toml" ./run_train.sh

View File

@@ -11,21 +11,17 @@ Benchmark model training
<https://rocm.docs.amd.com/en/docs-7.0-docker/benchmark-docker/index.html>`__
documentation.
The process of training models is computationally intensive, requiring
specialized hardware like GPUs to accelerate computations and reduce training
time. Training models on AMD GPUs involves leveraging the parallel processing
capabilities of these GPUs to significantly speed up the model training process
in deep learning tasks.
Training models on AMD GPUs with the ROCm software platform allows you to use
the powerful parallel processing capabilities and efficient compute resource
management, significantly improving training time and overall performance in
machine learning applications.
AMD provides ready-to-use Docker images for MI355X and MI350X series
accelerators containing essential software components and optimizations to
accelerate and benchmark training workloads for popular models.
See the following sections for instructions.
.. grid:: 1
.. grid-item-card:: Training benchmarking
* :doc:`pre-training-megatron-lm-llama-3-8b`
* :doc:`training-megatron-lm-llama-3`
* :doc:`pre-training-torchtitan-llama-3-70b`
* :doc:`training-torchtitan-llama-3`
* :doc:`training-mlperf-fine-tuning-llama-2-70b`

View File

@@ -1,18 +1,18 @@
---
myst:
html_meta:
"description": "AMD ROCm 7.0 Alpha 2 documentation"
"description": "AMD ROCm 7.0 Beta documentation"
"keywords": "Radeon, open, compute, platform, install, how, conceptual, reference, home, docs"
---
# AMD ROCm 7.0 Alpha 2 documentation
# AMD ROCm 7.0 Beta documentation
AMD ROCm is an open-source software platform optimized to extract HPC and AI
workload performance from AMD Instinct™ accelerators while maintaining
compatibility with industry software frameworks.
This documentation provides early access information about ROCm 7.0
Alpha 2. This preview release provides access to new
This documentation provides early access information about the ROCm 7.0
Beta. This preview release provides access to new
features under development for testing so users can provide feedback.
It is not recommended for production use.
@@ -23,5 +23,7 @@ For a complete list of ROCm 7.0 preview releases, see the [ROCm 7.0 preview rele
The documentation includes:
- [ROCm 7.0 Alpha 2 release notes](release.rst) with feature details and support matrix
- [Installation instructions](install/index.rst) for the ROCm 7.0 Alpha 2 and the Instinct Driver
- [ROCm 7.0 Beta release notes](release.rst) with feature details and support matrix
- [Installation instructions](install/index.rst) for the ROCm 7.0 Beta and the Instinct Driver
- [Docker images for AI training and inference](benchmark-docker/index.rst) including the ROCm 7.0 Beta and
AI workload benchmarks

View File

@@ -4,24 +4,24 @@
ROCm
****************************************
ROCm 7.0 Alpha installation instructions
ROCm 7.0 Beta installation instructions
****************************************
The ROCm 7.0 Alpha must be installed using your Linux distribution's native
The ROCm 7.0 Beta must be installed using your Linux distribution's native
package manager. This release supports specific hardware and software
configurations -- before installing, see the :ref:`supported OSes and hardware
<alpha-2-system-requirements>` outlined in the Alpha 2 release notes.
<beta-system-requirements>` outlined in the Beta release notes.
.. important::
Upgrades and downgrades are not supported. You must uninstall any existing
ROCm installation before installing the Alpha 2 build.
ROCm installation before installing the Beta build.
.. grid:: 2
.. grid-item-card:: Install ROCm
See :doc:`Install the ROCm 7.0 Alpha 2 via package manager <rocm>`.
See :doc:`Install the ROCm 7.0 Beta via package manager <rocm>`.
.. grid-item-card:: Install Instinct Driver

View File

@@ -81,7 +81,7 @@ Register ROCm repositories
.. code-block:: shell
echo "deb [arch=amd64 signed-by=/etc/apt/keyrings/rocm.gpg] https://repo.radeon.com/amdgpu/30.10_alpha2/ubuntu jammy main" \
echo "deb [arch=amd64 signed-by=/etc/apt/keyrings/rocm.gpg] https://repo.radeon.com/amdgpu/30.10_beta/ubuntu jammy main" \
| sudo tee /etc/apt/sources.list.d/amdgpu.list
sudo apt update
@@ -104,7 +104,7 @@ Register ROCm repositories
.. code-block:: shell
echo "deb [arch=amd64 signed-by=/etc/apt/keyrings/rocm.gpg] https://repo.radeon.com/amdgpu/30.10_alpha2/ubuntu noble main" \
echo "deb [arch=amd64 signed-by=/etc/apt/keyrings/rocm.gpg] https://repo.radeon.com/amdgpu/30.10_beta/ubuntu noble main" \
| sudo tee /etc/apt/sources.list.d/amdgpu.list
sudo apt update
@@ -116,7 +116,7 @@ Register ROCm repositories
sudo tee /etc/yum.repos.d/amdgpu.repo <<EOF
[amdgpu]
name=amdgpu
baseurl=https://repo.radeon.com/amdgpu/30.10_alpha2/rhel/9.6/main/x86_64/
baseurl=https://repo.radeon.com/amdgpu/30.10_beta/rhel/9.6/main/x86_64/
enabled=1
priority=50
gpgcheck=1

View File

@@ -1,8 +1,8 @@
************************************************
Install the ROCm 7.0 Alpha 2 via package manager
Install the ROCm 7.0 Beta via package manager
************************************************
This page describes how to install the ROCm 7.0 Alpha 2 build using ``apt`` on
This page describes how to install AMD ROCm 7.0 Beta build using ``apt`` on
Ubuntu 22.04 or 24.04, or ``dnf`` on Red Hat Enterprise Linux 9.6.
.. important::
@@ -115,10 +115,10 @@ Register ROCm repositories
.. code-block:: shell
echo "deb [arch=amd64 signed-by=/etc/apt/keyrings/rocm.gpg] https://repo.radeon.com/rocm/apt/7.0_alpha2 jammy main" \
echo "deb [arch=amd64 signed-by=/etc/apt/keyrings/rocm.gpg] https://repo.radeon.com/rocm/apt/7.0_beta jammy main" \
| sudo tee /etc/apt/sources.list.d/rocm.list
echo "deb [arch=amd64 signed-by=/etc/apt/keyrings/rocm.gpg] https://repo.radeon.com/graphics/7.0_alpha2/ubuntu jammy main" \
echo "deb [arch=amd64 signed-by=/etc/apt/keyrings/rocm.gpg] https://repo.radeon.com/graphics/7.0_beta/ubuntu jammy main" \
| sudo tee /etc/apt/sources.list.d/rocm-graphics.list
echo -e 'Package: *\nPin: release o=repo.radeon.com\nPin-Priority: 600' \
@@ -144,10 +144,10 @@ Register ROCm repositories
.. code-block:: shell
echo "deb [arch=amd64 signed-by=/etc/apt/keyrings/rocm.gpg] https://repo.radeon.com/rocm/apt/7.0_alpha2 noble main" \
echo "deb [arch=amd64 signed-by=/etc/apt/keyrings/rocm.gpg] https://repo.radeon.com/rocm/apt/7.0_beta noble main" \
| sudo tee /etc/apt/sources.list.d/rocm.list
echo "deb [arch=amd64 signed-by=/etc/apt/keyrings/rocm.gpg] https://repo.radeon.com/graphics/7.0_alpha2/ubuntu noble main" \
echo "deb [arch=amd64 signed-by=/etc/apt/keyrings/rocm.gpg] https://repo.radeon.com/graphics/7.0_beta/ubuntu noble main" \
| sudo tee /etc/apt/sources.list.d/rocm-graphics.list
echo -e 'Package: *\nPin: release o=repo.radeon.com\nPin-Priority: 600' \
@@ -162,7 +162,7 @@ Register ROCm repositories
sudo tee /etc/yum.repos.d/rocm.repo <<EOF
[ROCm-7.0.0]
name=ROCm7.0.0
baseurl=https://repo.radeon.com/rocm/el9/7.0_alpha2/main
baseurl=https://repo.radeon.com/rocm/el9/7.0_beta/main
enabled=1
priority=50
gpgcheck=1
@@ -172,7 +172,7 @@ Register ROCm repositories
sudo tee /etc/yum.repos.d/rocm-graphics.repo <<EOF
[ROCm-7.0.0-Graphics]
name=ROCm7.0.0-Graphics
baseurl=https://repo.radeon.com/graphics/7.0_alpha2/rhel/9/main/x86_64/
baseurl=https://repo.radeon.com/graphics/7.0_beta/rhel/9/main/x86_64/
enabled=1
priority=50
gpgcheck=1

View File

@@ -1,59 +1,95 @@
******************************
ROCm 7.0 Alpha 2 release notes
******************************
***************************
ROCm 7.0 Beta release notes
***************************
The ROCm 7.0 Alpha 2 is a preview of the upcoming ROCm 7.0 release,
which includes functional support for AMD Instinct™ MI355X and MI350X
on bare metal, single-node systems. It also introduces new ROCm features for
MI300X, MI200, and MI100 series accelerators. This is an Alpha-quality release;
expect issues and limitations that will be addressed in upcoming previews.
The AMD ROCm 7.0 Beta is a preview of the upcoming ROCm 7.0 release,
which includes functional support for AMD Instinct™ MI355X and MI350X accelerators
on single-node systems. It also introduces new ROCm features for
MI300X, MI200, and MI100 series accelerators. These include the addition of
KVM-based SR-IOV for GPU virtualization, major improvements to the HIP runtime,
and enhancements to profilers.
As this is a Beta release, expect issues and limitations that will be addressed
in upcoming previews.
.. important::
The Alpha 2 release is not intended for performance evaluation.
For the latest stable release for use in production, see the [ROCm documentation](https://rocm.docs.amd.com/en/latest/).
The Beta release is not intended for performance evaluation.
For the latest stable release for use in production, see the `ROCm documentation <https://rocm.docs.amd.com/en/latest/>`__.
This page provides a high-level summary of key changes added to the Alpha 2
release since `the previous Alpha
<https://rocm.docs.amd.com/en/docs-7.0-alpha/preview/index.html>`_.
This document highlights the key changes in the Beta release since the
`Alpha 2 <https://rocm.docs.amd.com/en/docs-7.0-alpha-2/preview/release.html>`__.
For a complete history, see the :doc:`ROCm 7.0 preview release history <versions>`.
.. _alpha-2-system-requirements:
.. _beta-system-requirements:
Operating system and hardware support
=====================================
Only the accelerators and operating systems listed here are supported. Multi-node systems,
virtualized environments, and GPU partitioning are not supported in the Alpha 2 release.
Only the accelerators and operating systems listed below are supported. Multi-node systems
and GPU partitioning are not supported in the Beta release.
* AMD Instinct accelerator: MI355X, MI350X, MI325X [#mi325x]_, MI300X, MI300A, MI250X, MI250, MI210, MI100
* Operating system: Ubuntu 22.04, Ubuntu 24.04, RHEL 9.6
* System type: Bare metal, single node only
* Partitioning: Not supported
.. list-table::
:stub-columns: 1
* - AMD Instinct accelerator
- MI355X, MI350X, MI325X [#mi325x]_, MI300X, MI300A, MI250X, MI250, MI210, MI100
* - Operating system
- Ubuntu 22.04, Ubuntu 24.04, RHEL 9.6
* - System type
- Single node only
* - GPU partitioning
- Not supported
.. [#mi325x] MI325X is only supported with Ubuntu 22.04.
.. _alpha-2-highlights:
Virtualization support
----------------------
Alpha 2 release highlights
==========================
The Beta introduces support for KVM-based SR-IOV on select accelerators. All
supported configurations require the `GIM SR-IOV driver version 8.3.0K
<https://github.com/amd/MxGPU-Virtualization/releases>`__.
This section highlights key features enabled in the ROCm 7.0 Alpha 2 release.
.. list-table::
:header-rows: 1
* - Accelerator
- Host OS
- Guest OS
* - MI350X
- Ubuntu 24.04
- Ubuntu 24.04
* - MI325X
- Ubuntu 22.04
- Ubuntu 22.04
* - MI300X
- Ubuntu 24.04
- Ubuntu 24.04
* - MI210
- Ubuntu 22.04
- Ubuntu 22.04
.. _beta-highlights:
Beta release highlights
=======================
This section highlights key features enabled in the ROCm 7.0 Beta release.
AI frameworks
-------------
The ROCm 7.0 Alpha 2 release supports PyTorch 2.7, TensorFlow 2.19, and Triton 3.3.0.
Libraries
---------
MIGraphX
~~~~~~~~
Added support for the Open Compute Project (OCP) ``FP8`` data type on MI350X accelerators.
The ROCm 7.0 Beta release supports PyTorch 2.7, TensorFlow 2.19, and Triton 3.3.0.
RCCL support
~~~~~~~~~~~~
------------
RCCL is supported for single-node functional usage only. Multi-node communication capabilities will
be supported in future preview releases.
@@ -61,85 +97,234 @@ be supported in future preview releases.
HIP
---
The HIP runtime includes support for:
Enhancements
~~~~~~~~~~~~
* Added ``constexpr`` operators for ``FP16`` and ``BF16``.
* Added ``hipDeviceGetAttribute``, a new device attribute to query the number
of compute dies (chiplets, XCCs), enabling performance optimizations based on
cache locality.
* Added ``__syncwarp`` operation.
* Extended fine-grained system memory pools.
* The ``_sync()`` version of crosslane builtins such as ``shfl_sync()`` and
``__reduce_add_sync`` are enabled by default. These can be disabled by
setting the preprocessor macro ``HIP_DISABLE_WARP_SYNC_BUILTINS``.
* To improve API consistency, ``num_threads`` is now an alias for the legacy
``size`` parameter.
In addition, the HIP runtime includes the following functional enhancements which improve runtime
performance and user experience:
Fixes
~~~~~
* HIP runtime now enables peer-to-peer (P2P) memory copies to utilize all
available SDMA engines, rather than being limited to a single engine. It also
selects the best engine first to give optimal bandwidth.
* Fixed an issue where ``hipExtMallocWithFlags()`` did not correctly handle the
``hipDeviceMallocContiguous`` flag. The function now properly enables the
``HSA_AMD_MEMORY_POOL_CONTIGUOUS_FLAG`` for memory pool allocations on the GPU.
* To match CUDA runtime behavior more closely, HIP runtime APIs no longer check
the stream validity with streams passed as input parameters. If the input
stream is invalid, it causes a segmentation fault instead of returning
an error code ``hipErrorContextIsDestroyed``.
* Resolved a compilation failure caused by incorrect vector type alignment. The
HIP runtime has been refactored to use ``__hip_vec_align_v`` for proper
alignment.
The following issues have been resolved:
Backwards-incompatible changes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
* An issue when retrieving a memory object from the IPC memory handle causing
failures in some framework test applications.
Backwards-incompatible API changes previously described in `HIP 7.0 Is Coming:
What You Need to Know to Stay Ahead
<https://rocm.blogs.amd.com/ecosystems-and-partners/transition-to-hip-7.0-blog/README.html>`__
are enabled in the Beta. These changes -- aimed at improving GPU code
portability -- include:
* An issue causing the incorrect return error ``hipErrorNoDevice`` when a crash occurred
on a GPU due to an illegal operation or memory violation. The HIP runtime now
handles the failure on the GPU side properly and reports the precise error
code based on the last error seen on the GPU.
Behavior changes
^^^^^^^^^^^^^^^^
See :ref:`HIP compatibility <hip-known-limitation>` for more information about upcoming API changes.
* To align with NVIDIA CUDA behavior, ``hipGetLastError`` now returns the last actual
error code caught in the current thread during application execution -- neither
``hipSuccess`` nor ``hipErrorNotReady`` are considered errors.
``hipExtGetLastError`` retains the previous behavior of ``hipGetLastError``.
Compilers
---------
* Cooperative groups: added stricter input parameter validation to
``hipLaunchCooperativeKernel`` and ``hipLaunchCooperativeKernelMultiDevice``.
The Alpha 2 release introduces the AMD Next-Gen Fortran compiler. ``llvm-flang``
(sometimes called ``new-flang`` or ``flang-18``) is a re-implementation of the
Fortran frontend. It is a strategic replacement for ``classic-flang`` and is
developed in LLVM's upstream repo at `<https://github.com/llvm/llvm-project/tree/main/flang>`__.
* ``hipPointerGetAttributes`` now returns ``hipSuccess`` instead of
``hipErrorInvalidValue`` when a null pointer is passed as an input parameter,
aligning its behaviour with ``cudaPointerGetAttributes`` (CUDA 11+).
Key enhancements include:
* ``hipFree`` no longer performs an implicit device-wide wait when freeing
memory allocated with ``hipMallocAsync`` or ``hipMallocFromPoolAsync``. This
matches the behavior of ``cudaFree``.
* Compiler:
hipRTC changes
^^^^^^^^^^^^^^
* Improved memory load and store instructions.
* ``hipRTC`` symbols are now removed from the HIP runtime library.
Any application using hipRTC APIs should link explicitly with the hipRTC
library. This makes the usage of hipRTC library on Linux the same as on
Windows and matches the behavior of CUDA nvRTC.
* Updated clang/llvm to `AMD clang version 20.0.0git` (equivalent to LLVM 20.0.0 with additional out-of-tree patches).
* hipRTC compilation: the device code compilation now uses namespace
``__hip_internal``, instead of the standard headers std, to avoid namespace
collision.
* Support added for separate debug file generation for device code.
* Datatype definitions such as ``int64_t``, ``uint64_t``, ``int32_t``,
``uint32_t``, and so on, are removed to avoid any potential conflicts in
some applications as they use their own definitions for these types. HIP
now uses internal datatypes instead, prefixed with ``__hip`` -- for
example, ``__hip_int64_t``.
* Comgr:
HIP header clean up
^^^^^^^^^^^^^^^^^^^
* Added support for an in-memory virtual file system (VFS) for storing temporary files
generated during intermediate compilation steps. This is designed to
improve performance by reducing on-disk file I/O. Currently, VFS is
supported only for the device library link step, with plans for expanded
support in future releases.
* Removed non-essential C++ standard library headers; HIP header files now
only include necessary STL headers.
* SPIR-V:
* The deprecated struct ``HIP_MEMSET_NODE_PARAMS`` is now removed from the
API. Developers can use the definition ``hipMemsetParams`` instead.
* Improved `target-specific extensions <https://github.com/ROCm/llvm-project/blob/c2535466c6e40acd5ecf6ba1676a4e069c6245cc/clang/docs/LanguageExtensions.rst>`_:
API changes
^^^^^^^^^^^
* Added a new target-specific builtin ``__builtin_amdgcn_processor_is`` for late or deferred queries of the current target processor.
* Some APIs' signatures have been adjusted to match corresponding CUDA counterparts. Impacted
APIs are:
* Added a new target-specific builtin ``__builtin_amdgcn_is_invocable``, enabling fine-grained, per-builtin feature availability.
* ``hiprtcCreateProgram``
* HIPIFY now supports NVIDIA CUDA 12.8.0 APIs:
* ``hiprtcCompileProgram``
* Added support for all new device and host APIs, including ``FP4``, ``FP6``, and ``FP128`` -- including support for the corresponding ROCm HIP equivalents.
* ``hipMemcpyHtoD``
* Deprecated features:
* ``hipCtxGetApiVersion``
* ROCm components no longer use the ``__AMDGCN_WAVEFRONT_SIZE`` and
``__AMDGCN_WAVEFRONT_SIZE__`` macros nor HIP's ``warpSize`` variable as
``constexpr``s. These macros and reliance on ``warpSize`` as a ``constexpr`` are
deprecated and will be disabled in a future release. Users are encouraged
to update their code if needed to ensure future compatibility.
* Updated ``hipMemsetParams`` for compatibility with the CUDA equivalent structure.
* HIP vector constructors for ``hipComplex`` initialization now generate
correct values. The affected constructors are small vector types such as
``float2``, ``int4``, and so on.
Stream capture
^^^^^^^^^^^^^^
Stream capture mode is now more restrictive in HIP APIs through the addition
of the ``CHECK_STREAM_CAPTURE_SUPPORTED`` macro.
* HIP now only supports ``hipStreamCaptureModeRelaxed``. Attempts to initiate
stream capture with any other mode will fail and return
``hipErrorStreamCaptureUnsupported``. Consequently, the following APIs are
only permitted in Relaxed mode and will return an error if called during
capture with a now disallowed mode:
* ``hipMallocManaged``
* ``hipMemAdvise``
* The following APIs check the stream capture mode and return error codes, matching
CUDA behavior:
* ``hipLaunchCooperativeKernelMultiDevice``
* ``hipEventQuery``
* ``hipStreamAddCallback``
* During stream capture, the following HIP APIs now return the error
``hipErrorStreamCaptureUnsupported`` on the AMD platform, not always
``hipSuccess``. This aligns with CUDA's behavior.
* ``hipDeviceSetMemPool``
* ``hipMemPoolCreate``
* ``hipMemPoolDestroy``
* ``hipDeviceSetSharedMemConfig``
* ``hipDeviceSetCacheConfig``
Error codes
^^^^^^^^^^^
Error and value codes returned by HIP APIs have been updated to align with
their CUDA counterparts.
* Module management-related APIs: ``hipModuleLaunchKernel``,
``hipExtModuleLaunchKernel``, ``hipExtLaunchKernel``,
``hipDrvLaunchKernelEx``, ``hipLaunchKernel``, ``hipLaunchKernelExC``,
``hipModuleLaunchCooperativeKernel``, ``hipModuleLoad``
* Texture management-related APIs:
* ``hipTexObjectCreate`` -- now supports zero width and height for 2D images. If
either is zero, will not return false.
* ``hipBindTexture2D`` -- now returns ``hipErrorNotFound`` for null texture or device pointers.
* ``hipBindTextureToArray`` -- now returns
``hipErrorInvalidChannelDescriptor`` (instead of ``hipErrorInvalidValue``)
for null inputs.
* ``hipGetTextureAlignmentOffset`` -- now returns ``hipErrorInvalidTexture``
for a null texture reference.
* Cooperative group-related APIs: added stricter validations to ``hipLaunchCooperativeKernelMultiDevice`` and ``hipLaunchCooperativeKernel``
Invalid stream input parameter handling
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
To match the CUDA runtime behavior more closely, HIP APIs with streams passed
as input parameters no longer check the stream validity. Previously, the HIP
runtime returned an error code ``hipErrorContextIsDestroyed`` if the stream was
invalid. In CUDA version 12 and later, the equivalent behavior is to raise a
segmentation fault. In HIP 7.0, the HIP runtime matches the CUDA by causing a
segmentation fault. The list of APIs impacted by this change are as follows:
* Stream management-related APIs: ``hipStreamGetCaptureInfo``,
``hipStreamGetPriority``, ``hipStreamGetFlags``, ``hipStreamDestroy``,
``hipStreamAddCallback``, ``hipStreamQuery``, ``hipLaunchHostFunc``
* Graph management-related APIs: ``hipGraphUpload``, ``hipGraphLaunch``,
``hipStreamBeginCaptureToGraph``, ``hipStreamBeginCapture``,
``hipStreamIsCapturing``, ``hipStreamGetCaptureInfo``,
``hipGraphInstantiateWithParams``
* Memory management-related APIs: ``hipMemcpyPeerAsync``,
``hipMemcpy2DValidateParams``, ``hipMallocFromPoolAsync``, ``hipFreeAsync``,
``hipMallocAsync``, ``hipMemcpyAsync``, ``hipMemcpyToSymbolAsync``,
``hipStreamAttachMemAsync``, ``hipMemPrefetchAsync``, ``hipDrvMemcpy3D``,
``hipDrvMemcpy3DAsync``, ``hipDrvMemcpy2DUnaligned``, ``hipMemcpyParam2D``,
``hipMemcpyParam2DAsync``, ``hipMemcpy2DArrayToArray``, ``hipMemcpy2D``,
``hipMemcpy2DAsync``, ``hipDrvMemcpy2DUnaligned``, ``hipMemcpy3D``
* Event management-related APIs: ``hipEventRecord``,
``hipEventRecordWithFlags``
warpSize
^^^^^^^^
To align with the CUDA specification, the ``warpSize`` device variable is no longer a
compile-time constant (``constexpr``). This is a backwards-incompatible change for applications
that use ``warpSize`` in a compile-time context.
ROCprofiler-SDK and rocprofv3
-----------------------------
rocpd
~~~~~
Support has been added for the ``rocpd`` (ROCm Profiling Data) output format,
which is now the default format for rocprofv3. A subproject of the
ROCprofiler-SDK, ``rocpd`` enables saving profiling results to a SQLite3
database, providing a structured and efficient foundation for analysis and
post-processing.
Core SDK enhancements
~~~~~~~~~~~~~~~~~~~~~
* ROCprofiler-SDK is now compatible with the HIP 7.0 API.
* Added stochastic and host-trap PC sampling support for all MI300 series accelerators.
* Added support for tracing KFD events.
rocprofv3 CLI tool enhancements
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
* Added stochastic and host-trap PC sampling support for all MI300 series accelerators.
* HIP streams translate to Queues in Time Traces in Perfetto output.
Instinct Driver / ROCm packaging separation
-------------------------------------------
@@ -152,17 +337,4 @@ Instinct Datacenter GPU Driver
information.
Forward and backward compatibility between the Instinct Driver and ROCm is not supported in the
Alpha 2 release. See the :doc:`installation instructions <install/index>`.
Known limitations
=================
.. _hip-known-limitation:
HIP compatibility
-----------------
HIP runtime APIs in the ROCm 7.0 Alpha 2 don't include the upcoming backward-incompatible changes. See `HIP 7.0 Is
Coming: What You Need to Know to Stay Ahead
<https://rocm.blogs.amd.com/ecosystems-and-partners/transition-to-hip-7.0-blog/README.html>`_ to learn about the
changes expected for HIP.
Beta release. See the :doc:`installation instructions <install/index>`.

View File

@@ -7,7 +7,7 @@ root: preview/index
subtrees:
- entries:
- file: preview/release.rst
title: Alpha 2 release notes
title: Beta release notes
- file: preview/install/index.rst
title: Installation
subtrees:
@@ -16,3 +16,22 @@ subtrees:
title: Install ROCm
- file: preview/install/instinct-driver
title: Install Instinct Driver
- file: preview/benchmark-docker/index.rst
title: Docker images for AI
subtrees:
- entries:
- file: preview/benchmark-docker/training.rst
title: Training benchmarks
subtrees:
- entries:
- file: preview/benchmark-docker/training-megatron-lm-llama-3.rst
- file: preview/benchmark-docker/training-torchtitan-llama-3.rst
- file: preview/benchmark-docker/training-mlperf-fine-tuning-llama-2-70b.rst
- file: preview/benchmark-docker/inference.rst
title: Inference benchmarks
subtrees:
- entries:
- file: preview/benchmark-docker/inference-vllm-llama-3.1-405b-fp4.rst
- file: preview/benchmark-docker/inference-vllm-llama-3.3-70b-fp8.rst
- file: preview/benchmark-docker/inference-vllm-gpt-oss-120b.rst
- file: preview/benchmark-docker/inference-sglang-deepseek-r1-fp4.rst

View File

@@ -11,5 +11,5 @@ ready(() => {
"a.header-all-versions[href='https://rocm.docs.amd.com/en/latest/release/versions.html']",
);
versionListLink.textContent = "Preview versions"
versionListLink.href = "https://rocm.docs.amd.com/en/docs-7.0-alpha-2/preview/versions.html"
versionListLink.href = "https://rocm.docs.amd.com/en/docs-7.0-beta/preview/versions.html"
});