Compare commits

...

92 Commits

Author SHA1 Message Date
David Dixon
65a6be8968 Update catch2.yml 2025-10-20 11:14:07 -06:00
David Dixon
05ceeadbdd Update catch2.yml 2025-10-20 11:11:07 -06:00
David Dixon
04ec795b25 Update catch2.yml 2025-10-20 09:41:37 -06:00
David Dixon
1d0dbb055b Update catch2.yml 2025-10-20 09:25:01 -06:00
David Dixon
0cee66595d Update catch2.yml 2025-10-20 09:18:48 -06:00
David Dixon
3eb8afa396 Update catch2.yml 2025-10-20 09:14:38 -06:00
David Dixon
eeaf74a1b3 Update catch2.yml 2025-10-20 09:13:14 -06:00
David Dixon
3bd25e56aa Update catch2.yml 2025-10-20 09:12:25 -06:00
David Dixon
afef52f7b4 Update catch2.yml 2025-10-20 09:10:43 -06:00
David Dixon
511b770c6f Update build-cmake.yml 2025-10-20 08:48:58 -06:00
David Dixon
a87f498fc4 Update build-cmake.yml 2025-10-20 08:45:15 -06:00
David Dixon
44e052a91c Update build-cmake.yml 2025-10-20 08:43:28 -06:00
David Dixon
0d4ed51676 Update build-cmake.yml 2025-10-20 08:41:53 -06:00
David Dixon
411cd1eff5 Update build-cmake.yml 2025-10-20 08:39:55 -06:00
David Dixon
abf88de698 Update build-cmake.yml 2025-10-20 08:38:04 -06:00
David Dixon
247491bc3a Update build-cmake.yml 2025-10-20 08:33:05 -06:00
David Dixon
9ac9819d26 Update build-cmake.yml 2025-10-20 08:26:51 -06:00
David Dixon
1e7332eb38 Update build-cmake.yml 2025-10-20 08:20:08 -06:00
David Dixon
00fbcf3c67 Update build-cmake.yml 2025-10-19 21:57:05 -06:00
David Dixon
bc1a0a6753 Update build-cmake.yml 2025-10-19 21:46:19 -06:00
David Dixon
292dc2c6fe Update build-cmake.yml 2025-10-19 21:44:15 -06:00
David Dixon
0153ef56cb Update build-cmake.yml 2025-10-19 21:34:31 -06:00
David Dixon
c4ab8e5ca5 Update build-cmake.yml 2025-10-19 18:03:19 -06:00
David Dixon
65ea838d25 Update build-cmake.yml 2025-10-19 16:03:31 -06:00
David Dixon
b9b0278d7c Update build-cmake.yml 2025-10-19 15:21:58 -06:00
David Dixon
f395cfc6d7 Update build-cmake.yml 2025-10-19 15:21:06 -06:00
David Dixon
ae229eb9cc Update build-cmake.yml 2025-10-19 15:18:03 -06:00
David Dixon
0dab4939e0 Update build-cmake.yml 2025-10-19 15:16:12 -06:00
David Dixon
911df89756 Update build-cmake.yml 2025-10-19 15:15:28 -06:00
David Dixon
191b70bf41 Update build-cmake.yml 2025-10-19 15:14:51 -06:00
David Dixon
de48c7e14c Update build-cmake.yml 2025-10-19 15:14:18 -06:00
David Dixon
a99b1a3f63 Update build-cmake.yml 2025-10-19 15:12:45 -06:00
David Dixon
44044cc706 Update build-cmake.yml 2025-10-19 15:12:02 -06:00
David Dixon
cac275b5a4 Update build-cmake.yml 2025-10-19 15:10:26 -06:00
David Dixon
934d33b466 Update build-cmake.yml 2025-10-19 15:08:26 -06:00
David Dixon
c172c91d4d Update build-cmake.yml 2025-10-19 15:00:09 -06:00
David Dixon
e9d7618f9f Update build-cmake.yml 2025-10-19 14:50:13 -06:00
David Dixon
5eea99eaba Update build-cmake.yml 2025-10-19 14:49:05 -06:00
David Dixon
6ed3f30af3 Update build-cmake.yml 2025-10-19 14:46:37 -06:00
David Dixon
fded8ab72e Update build-cmake.yml 2025-10-19 14:45:45 -06:00
David Dixon
34e1eb325e Update build-cmake.yml 2025-10-19 14:44:29 -06:00
David Dixon
8694bac132 Update build-cmake.yml 2025-10-19 14:21:08 -06:00
David Dixon
f4e1b3a7d2 Update build-cmake.yml 2025-10-19 14:02:03 -06:00
David Dixon
96fe3c01f9 Update catch2.yml 2025-10-19 13:31:38 -06:00
David Dixon
89bf1105ee Update build-cmake.yml 2025-10-19 13:21:20 -06:00
David Dixon
a307e368f2 Update build-cmake.yml 2025-10-19 12:36:49 -06:00
David Dixon
a1b081f20a Update build-cmake.yml 2025-10-19 12:21:01 -06:00
David Dixon
c8dace0145 Update catch2.yml 2025-10-19 12:10:58 -06:00
David Dixon
aad89e4a52 print everything 2025-10-19 11:37:35 -06:00
David Dixon
ead9696b45 Build with verbosity 2025-10-19 11:08:40 -06:00
Pratik Basyal
fd6bbe18a7 PLDM update for MI250 and MI210 [Develop] (#5537)
* PLDM update for MI250 and MI210

* PLDM update
2025-10-17 17:13:42 -04:00
peterjunpark
a613bd6824 JAX Maxtext v25.9 doc update (#5532)
* archive previous version (25.7)

* update docker components list for 25.9

* update template

* update docker pull tag

* update

* fix intro
2025-10-17 11:31:06 -04:00
peterjunpark
14bb59fca9 Update Megatron/PyTorch Primus 25.9 docs (#5528)
* add previous versions

* Fix heading levels in pages using embedded templates (#5468)

* update primus-megatron doc

update megatron-lm doc

update templates

fix tab

update primus-megatron model configs

Update primus-pytorch model configs

fix css class

add posttrain to pytorch-training template

update data sheets

update

update

update

update docker tags

* Add known issue and update Primus/Turbo versions

* add primus ver to histories

* update primus ver to 0.1.1

* fix leftovers from merge conflict
2025-10-16 12:51:30 -04:00
anisha-amd
a98236a4e3 Main Docs: references of accelerator removal and change to GPU (#5495)
* Docs: references of accelerator removal and change to GPU

Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com>
Co-authored-by: Pratik Basyal <pratik.basyal@amd.com>
2025-10-16 11:22:10 -04:00
David Dixon
5cb6bfe151 Add yaml-cpp to dependencies 2025-10-16 07:26:06 -06:00
David Dixon
6e7422ded7 Update cli11.yml for Azure Pipelines (#5523) 2025-10-15 10:47:29 -06:00
Istvan Kiss
7b7ff53985 Update Radeon link (#5453) 2025-10-15 17:25:05 +02:00
David Dixon
019796dc63 [external] Create cli11.yml (#5522) 2025-10-15 09:19:56 -06:00
Pratik Basyal
f21cfe1171 GitHub issue added to 702 known issues (#5520)
* GitHub issue added to 702 known issues

* Added missing RCCL changelog
2025-10-15 09:58:23 -04:00
Jan Stephan
170cb47a4f Merge pull request #5512 from j-stephan/rocm-examples-deps
[Ex CI] Add libtiff-dev, libopencv-dev and rpp
2025-10-15 10:02:46 +02:00
Braden Stefanuk
d19a8e4a83 [superbuild] Add dependencies for hipblaslt and origami (#5487)
* ci: add deps for origami in superbuild

* ci: add rocm path to system path

* build: add pip msgpack dep
2025-10-14 16:05:24 -06:00
amd-hsivasun
3a0b8529ed [Ex CI] Added MIOpen to the test dependencies for rocm-examples (#5517) 2025-10-14 14:56:36 -04:00
Joseph Macaranas
f9d7fc2e6a [External CI] Add libsimde-dev to ROCR pipeline (#5515) 2025-10-14 14:24:45 -04:00
Nilesh M Negi
d424687191 [Ex CI] Increase RCCL build time limit to 120mins (#5516) 2025-10-14 12:59:40 -05:00
Jan Stephan
35e6e50888 [Ex CI] Add libopencv-dev
Signed-off-by: Jan Stephan <jan.stephan@amd.com>
2025-10-13 20:00:25 +02:00
Jan Stephan
91cfe98eb3 [Ex CI] Add libtiff-dev and rpp
Signed-off-by: Jan Stephan <jan.stephan@amd.com>
2025-10-13 17:42:59 +02:00
Pratik Basyal
036aaa2e78 ROCm for HPC topic updated Develop (#5504)
* ROCm for HPC topic updated

* ROCm for HPC topic udpated

* Minor editorial
2025-10-10 22:31:51 -04:00
Pratik Basyal
78258e0f85 702 compatibility Footnote updated (#5502)
* Footnote updated

* Minor update

* Minor update

* Break added

* Line break added

* Line break

* Footnote updated

* Minor correction
2025-10-10 21:23:07 -04:00
amd-hsong
c79d9f74ef Merge pull request #5490 Re-enable device_merge_inplace unit test for rocPRIM 2025-10-10 15:03:23 -06:00
amd-hsivasun
fb1b78c6f0 [Ex CI] Added Component and Module Dependencies (#5489)
* [Ex CI] Added Component and Module Dependencies

* Add registerROCmPackages flag
2025-10-10 16:01:11 -04:00
peterjunpark
3a70d75f5e Fix documented AMD SMI version (ROCm 7.0.2) (#5496) 2025-10-10 15:09:20 -04:00
alexxu-amd
61e1f088a1 Merge pull request #5492 from ROCm/sync-dev-from-internal
Sync dev from internal for 7.0.2 GA
2025-10-10 11:17:32 -04:00
Pratik Basyal
1f6e5c5e04 Update compatibility-matrix.rst 2025-10-10 11:10:48 -04:00
Pratik Basyal
e8a0769842 Update RELEASE.md 2025-10-10 11:07:51 -04:00
Alex Xu
6f9579d052 Merge remote-tracking branch 'internal/develop' into sync-dev-from-internal 2025-10-10 11:02:33 -04:00
Pratik Basyal
245d53a021 Merge pull request #579 from prbasyal-amd/post-rc3-702-update
GPU resiliency highlight updated 702
2025-10-10 11:00:59 -04:00
Alex Xu
35dbbb22bc fix linting 2025-10-10 10:29:13 -04:00
alexxu-amd
03dc8cee00 Merge pull request #584 from ROCm/sync-dev-from-external
Sync dev from external
2025-10-10 10:14:56 -04:00
Alex Xu
323e5fd27a Merge remote-tracking branch 'external/develop' into sync-dev-from-external 2025-10-10 10:13:08 -04:00
alexxu-amd
b11fd7b492 Update versions.md (#583) 2025-10-10 09:31:24 -04:00
srayasam-amd
5e2efa05a6 7.0.2 GA update (#5491)
* 7.0.2 GA update

* Create rocm-7.0.2.xml
2025-10-10 18:47:48 +05:30
Hao Song
29a90f0271 [rocPRIM] Re-enable device_merge_inplace unit test for rocPRIM 2025-10-09 21:48:11 +00:00
randyh62
c06242bb89 Update RELEASE.md (#581)
* Update RELEASE.md

Remove support for rocBlas and hipBlasLt

* Update CHANGELOG.md

Removed from the Changelog as well.
2025-10-09 13:15:08 -07:00
alexxu-amd
978c58d196 Merge pull request #577 from ROCm/sync-develop-from-external
Sync develop from external
2025-10-08 14:25:03 -04:00
alexxu-amd
a366048b64 Merge branch 'develop' into sync-develop-from-external 2025-10-08 14:12:14 -04:00
Pratik Basyal
4c3e33c291 Compatibility matrix and changelog synced for ROCm 7.0.2 (#576)
* Compatibility matrix and changelog synced

* Indentation updated

* OS updated
2025-10-08 14:11:15 -04:00
Alex Xu
89758e67d8 Merge remote-tracking branch 'external/develop' into sync-develop-from-external 2025-10-08 14:03:34 -04:00
Pratik Basyal
5d0f201b4d 7.0.2 review update (#575)
* 7.0.2 review update

* Tensorflow footnote updated

* Wordlist added
2025-10-08 12:35:14 -04:00
Pratik Basyal
e3677d89a6 PLDM bundle info updated for 7.0.2 (#574)
* PLDM bundle info updated

* Driver dependency added to GPU resiliency

* Known issue for Migrpahx added

* Footnote added

* Known issue for OpenCV updated

* Leo's feedback incorporated

* Radeon 9060 updated

* Known issues updated
2025-10-08 11:00:42 -04:00
Pratik Basyal
6f84d50011 ROCm 7.0.2 Post RC3 update (#573)
* Space minimized

* OS support updated

* Minor change
2025-10-06 14:08:01 -04:00
Pratik Basyal
57dd082f28 Post RC2 7.0.2 review feedback updated (#571)
* Known issue updated

* Space optimized

* Changelog updated

* Apply suggestions from code review

Leo's review feedback incorporated

Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com>

* Highlight changes

* Highlight and OS support updated

* GPU resiliency highlight updated

* Highlights updated

* ROCm-EP deprecation added

* Apply suggestions from code review

leo's feedback incorporated

Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com>

* PLDM update

---------

Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com>
2025-10-06 12:04:09 -04:00
Pratik Basyal
5c7b993c0c 7.0.2 release changes (#568)
* Initial changes for 7.0.2

* Heading level updated

* Release notes changes

* rocsolver added

* Known issues updated

* Highlights updated

* RN changes

* Release highlights for AI applications updated

* AI developer contents added

* leo's review feedback added

* Compatibility matrix updated

* GPU driver support
2025-09-30 14:02:04 -04:00
130 changed files with 6181 additions and 3698 deletions

View File

@@ -37,6 +37,7 @@ parameters:
- libdrm-dev
- libelf-dev
- libnuma-dev
- libsimde-dev
- ninja-build
- pkg-config
- name: rocmDependencies

View File

@@ -70,7 +70,7 @@ parameters:
jobs:
- ${{ each job in parameters.jobMatrix.buildJobs }}:
- job: rccl_build_${{ job.target }}
timeoutInMinutes: 90
timeoutInMinutes: 120
variables:
- group: common
- template: /.azuredevops/variables-global.yml

View File

@@ -210,7 +210,7 @@ jobs:
parameters:
componentName: ${{ parameters.componentName }}
testDir: '$(Agent.BuildDirectory)/rocm/bin/rocprim'
extraTestParameters: '-I ${{ job.shard }},,${{ job.shardCount }} -E device_merge_inplace'
extraTestParameters: '-I ${{ job.shard }},,${{ job.shardCount }}'
os: ${{ job.os }}
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/docker-container.yml
parameters:

View File

@@ -14,9 +14,13 @@ parameters:
type: object
default:
- cmake
- libdw-dev
- libglfw3-dev
- libmsgpack-dev
- libopencv-dev
- libtbb-dev
- libtiff-dev
- libva-amdgpu-dev
- ninja-build
- python3-pip
- name: rocmDependencies
@@ -35,6 +39,7 @@ parameters:
- hipSPARSE
- hipTensor
- llvm-project
- MIOpen
- rocBLAS
- rocFFT
- rocJPEG
@@ -47,6 +52,7 @@ parameters:
- rocSPARSE
- rocThrust
- rocWMMA
- rpp
- name: rocmTestDependencies
type: object
default:
@@ -63,6 +69,7 @@ parameters:
- hipSPARSE
- hipTensor
- llvm-project
- MIOpen
- rocBLAS
- rocFFT
- rocminfo
@@ -77,6 +84,7 @@ parameters:
- rocThrust
- roctracer
- rocWMMA
- rpp
- name: jobMatrix
type: object
@@ -105,6 +113,7 @@ jobs:
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/dependencies-other.yml
parameters:
aptPackages: ${{ parameters.aptPackages }}
registerROCmPackages: true
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/dependencies-cmake-custom.yml
parameters:
cmakeVersion: '3.25.0'
@@ -169,6 +178,7 @@ jobs:
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/dependencies-other.yml
parameters:
aptPackages: ${{ parameters.aptPackages }}
registerROCmPackages: true
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/dependencies-cmake-custom.yml
parameters:
cmakeVersion: '3.25.0'

View File

@@ -43,9 +43,14 @@ parameters:
- ninja-build
- python3-pip
- python3-venv
- googletest
- libgtest-dev
- libgmock-dev
- libboost-filesystem-dev
- name: pipModules
type: object
default:
- msgpack
- joblib
- "packaging>=22.0"
- pytest
@@ -147,6 +152,13 @@ jobs:
echo "##vso[task.prependpath]$USER_BASE/bin"
echo "##vso[task.setvariable variable=PytestCmakePath]$USER_BASE/share/Pytest/cmake"
displayName: Set cmake configure paths
- task: Bash@3
displayName: Add ROCm binaries to PATH
inputs:
targetType: inline
script: |
echo "##vso[task.prependpath]$(Agent.BuildDirectory)/rocm/bin"
echo "##vso[task.prependpath]$(Agent.BuildDirectory)/rocm/llvm/bin"
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/build-cmake.yml
parameters:
os: ${{ job.os }}

View File

@@ -44,20 +44,42 @@ jobs:
packageManager: ${{ job.packageManager }}
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/preamble.yml
- task: Bash@3
displayName: Clone catch2 ${{ parameters.catch2Version }}
displayName: Clone catch2 ${{ parameters.catch2Version }} and apply patch
inputs:
targetType: inline
script: git clone https://github.com/catchorg/Catch2.git -b ${{ parameters.catch2Version }}
workingDirectory: $(Agent.BuildDirectory)
script: |
git clone https://github.com/catchorg/Catch2.git -b ${{ parameters.catch2Version }}
cd Catch2
cat <<'EOF' > catch2_env_patch.diff
index 9bd7e01d..c967bf82 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -1,5 +1,9 @@
cmake_minimum_required(VERSION 3.10)
+execute_process(COMMAND ${CMAKE_COMMAND} -E environment
+ OUTPUT_VARIABLE ENV_OUTPUT)
+message(STATUS "Environment variables:\n${ENV_OUTPUT}")
+
# detect if Catch is being bundled,
# disable testsuite in that case
if(NOT DEFINED PROJECT_NAME)
EOF
patch -p1 < catch2_env_patch.diff
workingDirectory: $(Agent.BuildDirectory)
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/build-cmake.yml
parameters:
os: ${{ job.os }}
cmakeBuildDir: $(Agent.BuildDirectory)/Catch2/build
cmakeSourceDir: $(Agent.BuildDirectory)/Catch2
useAmdclang: false
useAmdclang: true
extraBuildFlags: >-
-DCMAKE_VERBOSE_MAKEFILE:BOOL=ON
-DCMAKE_BUILD_TYPE=Release
-GNinja
--trace-expand
--debug-output
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/artifact-upload.yml
parameters:
os: ${{ job.os }}

View File

@@ -0,0 +1,63 @@
parameters:
- name: checkoutRepo
type: string
default: 'self'
- name: checkoutRef
type: string
default: ''
- name: cli11Version
type: string
default: ''
- name: aptPackages
type: object
default:
- cmake
- git
- ninja-build
- name: jobMatrix
type: object
default:
buildJobs:
- { os: ubuntu2204, packageManager: apt}
- { os: almalinux8, packageManager: dnf}
jobs:
- ${{ each job in parameters.jobMatrix.buildJobs }}:
- job: cli11_${{ job.os }}
variables:
- group: common
- template: /.azuredevops/variables-global.yml
pool:
vmImage: 'ubuntu-22.04'
${{ if eq(job.os, 'almalinux8') }}:
container:
image: rocmexternalcicd.azurecr.io/manylinux228:latest
endpoint: ContainerService3
workspace:
clean: all
steps:
- checkout: none
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/dependencies-other.yml
parameters:
aptPackages: ${{ parameters.aptPackages }}
packageManager: ${{ job.packageManager }}
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/preamble.yml
- task: Bash@3
displayName: Clone cli11 ${{ parameters.cli11Version }}
inputs:
targetType: inline
script: git clone https://github.com/CLIUtils/CLI11.git -b ${{ parameters.cli11Version }}
workingDirectory: $(Agent.BuildDirectory)
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/build-cmake.yml
parameters:
os: ${{ job.os }}
cmakeBuildDir: $(Agent.BuildDirectory)/CLI11/build
cmakeSourceDir: $(Agent.BuildDirectory)/CLI11
useAmdclang: false
extraBuildFlags: >-
-DCMAKE_BUILD_TYPE=Release
-GNinja
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/artifact-upload.yml
parameters:
os: ${{ job.os }}

View File

@@ -0,0 +1,66 @@
parameters:
- name: checkoutRepo
type: string
default: 'self'
- name: checkoutRef
type: string
default: ''
- name: yamlcppVersion
type: string
default: ''
- name: aptPackages
type: object
default:
- cmake
- git
- ninja-build
- name: jobMatrix
type: object
default:
buildJobs:
- { os: ubuntu2204, packageManager: apt}
- { os: almalinux8, packageManager: dnf}
jobs:
- ${{ each job in parameters.jobMatrix.buildJobs }}:
- job: yamlcpp_${{ job.os }}
variables:
- group: common
- template: /.azuredevops/variables-global.yml
pool:
vmImage: 'ubuntu-22.04'
${{ if eq(job.os, 'almalinux8') }}:
container:
image: rocmexternalcicd.azurecr.io/manylinux228:latest
endpoint: ContainerService3
workspace:
clean: all
steps:
- checkout: none
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/dependencies-other.yml
parameters:
aptPackages: ${{ parameters.aptPackages }}
packageManager: ${{ job.packageManager }}
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/preamble.yml
- task: Bash@3
displayName: Clone yaml-cpp ${{ parameters.yamlcppVersion }}
inputs:
targetType: inline
script: git clone https://github.com/jbeder/yaml-cpp.git -b ${{ parameters.yamlcppVersion }}
workingDirectory: $(Agent.BuildDirectory)
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/build-cmake.yml
parameters:
os: ${{ job.os }}
cmakeBuildDir: $(Agent.BuildDirectory)/yaml-cpp/build
cmakeSourceDir: $(Agent.BuildDirectory)/yaml-cpp
useAmdclang: false
extraBuildFlags: >-
-DCMAKE_BUILD_TYPE=Release
-DYAML_CPP_BUILD_TOOLS=OFF
-DYAML_BUILD_SHARED_LIBS=OFF
-DYAML_CPP_INSTALL=ON
-GNinja
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/artifact-upload.yml
parameters:
os: ${{ job.os }}

View File

@@ -0,0 +1,23 @@
variables:
- group: common
- template: /.azuredevops/variables-global.yml
parameters:
- name: cli11Version
type: string
default: "main"
resources:
repositories:
- repository: pipelines_repo
type: github
endpoint: ROCm
name: ROCm/ROCm
trigger: none
pr: none
jobs:
- template: ${{ variables.CI_DEPENDENCIES_PATH }}/cli11.yml
parameters:
cli11Version: ${{ parameters.cli11Version }}

View File

@@ -0,0 +1,24 @@
variables:
- group: common
- template: /.azuredevops/variables-global.yml
parameters:
- name: yamlcppVersion
type: string
default: "0.8.0"
resources:
repositories:
- repository: pipelines_repo
type: github
endpoint: ROCm
name: ROCm/ROCm
trigger: none
pr: none
jobs:
- template: ${{ variables.CI_DEPENDENCIES_PATH }}/yamlcpp.yml
parameters:
yamlcppVersion: ${{ parameters.yamlcppVersion }}

View File

@@ -57,43 +57,41 @@ parameters:
steps:
# create workingDirectory if it does not exist and change into it
# call cmake from within that directory using $cmakeArgs as its parameters
- ${{ if eq(parameters.os, 'almalinux8') }}:
- script: |
cd $AGENT_BUILDDIRECTORY
pwd
echo "##vso[task.setvariable variable=PATH]$PATH:/opt/rh/gcc-toolset-14/root/usr/bin"
echo "##vso[task.setvariable variable=MANPATH]$MANPATH:/opt/rh/gcc-toolset-14/root/usr/share/man"
echo "##vso[task.setvariable variable=INFOPATH]$INFOPATH:/opt/rh/gcc-toolset-14/root/usr/share/info"
echo "##vso[task.setvariable variable=PCP_DIR]/opt/rh/gcc-toolset-14/root"
echo "##vso[task.setvariable variable=LD_LIBRARY_PATH]$LD_LIBRARY_PATH:/opt/rh/gcc-toolset-14/root/usr/lib64"
echo "##vso[task.setvariable variable=LD_LIBRARY_PATH]$LD_LIBRARY_PATH:/opt/rh/gcc-toolset-14/root/usr/lib"
echo "##vso[task.setvariable variable=PKG_CONFIG_PATH]$PKG_CONFIG_PATH:/opt/rh/gcc-toolset-14/root/usr/lib64/pkgconfig"
echo "##vso[task.setvariable variable=CMAKE_TOOLCHAIN_FILE]$AGENT_BUILDDIRECTORY/ci-toolchain.cmake"
echo "set(CMAKE_SHARED_LINKER_FLAGS_INIT -L$AGENT_BUILDDIRECTORY/rocm/lib64 -L/opt/rh/gcc-toolset-14/root/usr/lib/gcc/x86_64-redhat-linux/14/)" > ci-toolchain.cmake
echo "set(CMAKE_EXE_LINKER_FLAGS_INIT -L$AGENT_BUILDDIRECTORY/rocm/lib64 -L/opt/rh/gcc-toolset-14/root/usr/lib/gcc/x86_64-redhat-linux/14/)" >> ci-toolchain.cmake
echo "set(CMAKE_CXX_FLAGS_INIT="--gcc-toolchain=/opt/rh/gcc-toolset-14/root")" >> ci-toolchain.cmake
env:
AGENT_BUILDDIRECTORY: $(Agent.BuildDirectory)
displayName: '${{parameters.componentName }} CMake Env'
- task: CMake@1
displayName: '${{parameters.componentName }} CMake Flags'
${{ if eq(parameters.os, 'almalinux8')}}:
env:
PATH: "/opt/rh/gcc-toolset-14/root/usr/bin:$(PATH)"
MANPATH: "/opt/rh/gcc-toolset-14/root/usr/share/man:$(MANPATH)"
INFOPATH: "/opt/rh/gcc-toolset-14/root/usr/share/info:$(INFOPATH)"
PCP_DIR: "/opt/rh/gcc-toolset-14/root"
LD_LIBRARY_PATH: "/opt/rh/gcc-toolset-14/root/usr/lib64:/opt/rh/gcc-toolset-14/root/usr/lib:$(LD_LIBRARY_PATH)"
PKG_CONFIG_PATH: "/opt/rh/gcc-toolset-14/root/usr/lib64/pkgconfig:$(PKG_CONFIG_PATH)"
inputs:
workingDirectory: ${{ parameters.cmakeBuildDir }}
cmakeArgs: >-
${{ iif(parameters.customInstallPath, join('', format('-DCMAKE_INSTALL_PREFIX={0}', parameters.installDir)), '') }}
${{ iif(eq(parameters.os, 'almalinux8'), '-DCMAKE_SHARED_LINKER_FLAGS="-L$(Agent.BuildDirectory)/rocm/lib64 -L/opt/rh/gcc-toolset-14/root/usr/lib/gcc/x86_64-redhat-linux/14/"', '') }}
${{ iif(eq(parameters.os, 'almalinux8'), '-DCMAKE_EXE_LINKER_FLAGS="-L$(Agent.BuildDirectory)/rocm/lib64 -L/opt/rh/gcc-toolset-14/root/usr/lib/gcc/x86_64-redhat-linux/14/"', '') }}
-DCMAKE_CXX_FLAGS="${{ parameters.extraCxxFlags }} ${{ iif(and(eq(parameters.os, 'almalinux8'), parameters.useAmdclang), '--gcc-toolchain=/opt/rh/gcc-toolset-14/root', '') }}"
${{ parameters.extraBuildFlags }}
${{ parameters.cmakeSourceDir }}
${{ iif(parameters.customInstallPath, format('-DCMAKE_INSTALL_PREFIX={0}', parameters.installDir), '') }}
${{ parameters.extraBuildFlags }} ${{ parameters.cmakeSourceDir }}
- ${{ if parameters.printDiskSpace }}:
- script: df -h
displayName: Disk space before build
# equivalent to running make $cmakeTargetDir from $cmakeBuildDir
# i.e., cd $cmakeBuildDir; make $cmakeTargetDir
- task: CMake@1
${{ if and( eq(parameters.os, 'almalinux8'), eq(parameters.consolidateBuildAndInstall , true)) }}:
${{ if eq(parameters.consolidateBuildAndInstall, true) }}:
displayName: '${{ parameters.componentName }} CMake Build and Install'
${{ else }}:
displayName: '${{ parameters.componentName }} CMake Build'
${{ if eq(parameters.os, 'almalinux8')}}:
env:
PATH: "/opt/rh/gcc-toolset-14/root/usr/bin:$(PATH)"
MANPATH: "/opt/rh/gcc-toolset-14/root/usr/share/man:$(MANPATH)"
INFOPATH: "/opt/rh/gcc-toolset-14/root/usr/share/info:$(INFOPATH)"
PCP_DIR: "/opt/rh/gcc-toolset-14/root"
LD_LIBRARY_PATH: "/opt/rh/gcc-toolset-14/root/usr/lib64:/opt/rh/gcc-toolset-14/root/usr/lib:$(LD_LIBRARY_PATH)"
PKG_CONFIG_PATH: "/opt/rh/gcc-toolset-14/root/usr/lib64/pkgconfig:$(PKG_CONFIG_PATH)"
inputs:
workingDirectory: ${{ parameters.cmakeBuildDir }}
${{ if eq(parameters.os, 'almalinux8') }}:
@@ -112,7 +110,7 @@ steps:
displayName: Disk space after build
# equivalent to running make $cmakeTarget from $cmakeBuildDir
# e.g., make install
- ${{ if and(eq(parameters.installEnabled, true), or(ne(parameters.os, 'almalinux8'), eq(parameters.consolidateBuildAndInstall, false))) }}:
- ${{ if and(eq(parameters.installEnabled, true), eq(parameters.consolidateBuildAndInstall, false)) }}:
- task: CMake@1
displayName: '${{parameters.componentName }} ${{ parameters.cmakeTarget }}'
inputs:

View File

@@ -675,6 +675,7 @@ detections
dev
devicelibs
devsel
dgl
dimensionality
disambiguates
distro

View File

@@ -4,6 +4,116 @@ This page is a historical overview of changes made to ROCm components. This
consolidated changelog documents key modifications and improvements across
different versions of the ROCm software stack and its components.
## ROCm 7.0.2
See the [ROCm 7.0.2 release notes](https://rocm.docs.amd.com/en/docs-7.0.2/about/release-notes.html#rocm-7-0-2-release-notes)
for a complete overview of this release.
### **AMD SMI** (26.0.2)
#### Added
* Added `bad_page_threshold_exceeded` field to `amd-smi static --ras`, which compares retired pages count against bad page threshold. This field displays `True` if retired pages exceed the threshold, `False` if within threshold, or `N/A` if threshold data is unavailable. Note that `sudo` is required to have the `bad_page_threshold_exceeded` field populated.
#### Removed
* Removed gpuboard and baseboard temperatures enums in amdsmi Python Library.
* `AmdSmiTemperatureType` had issues with referencing the correct attribute. As such, the following duplicate enums have been removed:
- `AmdSmiTemperatureType.GPUBOARD_NODE_FIRST`
- `AmdSmiTemperatureType.GPUBOARD_VR_FIRST`
- `AmdSmiTemperatureType.BASEBOARD_FIRST`
#### Resolved Issues
* Fixed `attribute error` in `amd-smi monitor` on Linux Guest systems, where the violations argument caused CLI to break.
* Fixed certain output in `amd-smi monitor` when GPUs are partitioned.
* It fixes the amd-smi monitor such as: `amd-smi monitor -Vqt`, `amd-smi monitor -g 0 -Vqt -w 1`, `amd-smi monitor -Vqt --file /tmp/test1`, etc. These commands will now be able to display as normal in partitioned GPU scenarios.
* Fixed an issue where using `amd-smi ras --folder <folder_name>` was forcing the created folder's name to be lowercase. This fix also allows all string input options to be case insensitive.
* Fixed an issue of some processes not being detected by AMD SMI despite making use of KFD resources. This fix, with the addition of KFD Fallback for process detection, ensures that all KFD processes will be detected.
* Multiple CPER issues were fixed.
- Issue of being unable to query for additional CPERs after 20 were generated on a single device.
- Issue where the RAS HBM CRC read was failing due to an incorrect AFID value.
- Issue where RAS injections were not consistently producing related CPERs.
### **HIP** (7.0.2)
#### Added
* Support for the `hipMemAllocationTypeUncached` flag, enabling developers to allocate uncached memory. This flag is now supported in the following APIs:
- `hipMemGetAllocationGranularity` determines the recommended allocation granularity for uncached memory.
- `hipMemCreate` allocates memory with uncached properties.
#### Resolved issues
* A compilation failure affecting applications that compile kernels using `hiprtc` with the compiler option `std=c++11`.
* A permission-related error occurred during the execution of `hipLaunchHostFunc`. This API is now supported and permitted to run during stream capture, aligning its behavior with CUDA.
* A numerical error during graph capture of kernels that rely on a remainder in `globalWorkSize`, in frameworks like MIOpen and PyTorch, where the grid size is not a multiple of the block size. To ensure correct replay behavior, HIP runtime now stores this remainder in `hip::GraphKernelNode` during `hipExtModuleLaunchKernel` capture, enabling accurate execution and preventing corruption.
* A page fault occurred during viewport rendering while running the file undo.blend in Blender. The issue was resolved by the HIP runtime, which reused the same context during image creation.
* Resolved a segmentation fault in `gpu_metrics`, which is used in threshold logic for command submission patches to GPU device(s) during CPU synchronization.
### **hipBLAS** (3.0.2)
#### Added
* Enabled support for gfx1150, gfx1151, gfx1200, and gfx1201 AMD hardware.
### **RCCL** (2.26.6)
#### Added
* Enabled double-buffering in `reduceCopyPacks` to trigger pipelining, especially to overlap bf16 arithmetic.
* Added `--force-reduce-pipeline` as an option that can be passed to the `install.sh` script. Passing this option will enable software-triggered pipelining `bfloat16` reductions (that is, `all_reduce`, `reduce_scatter`, and `reduce`).
### **rocBLAS** (5.0.2)
#### Added
* Enabled gfx1150 and gfx1151.
* The `ROCBLAS_USE_HIPBLASLT_BATCHED` variable to independently control the batched hipblaslt backend. Set `ROCBLAS_USE_HIPBLASLT_BATCHED=0` to disable batched GEMM use of the hipblaslt backend.
#### Resolved issues
* Set the imaginary portion of the main diagonal of the output matrix to zero in syrk and herk.
### **ROCdbgapi** (0.77.4)
#### Added
* ROCdbgapi documentation link in the README.md file.
### **ROCm Systems Profiler** (1.1.1)
#### Resolved issues
* Fixed an issue where ROC-TX ranges were displayed as two separate events instead of a single spanning event.
### **rocPRIM** (4.0.1)
#### Resolved issues
* Fixed compilation issue when using `rocprim::texture_cache_iterator`.
* Fixed a HIP version check used to determine whether `hipStreamLegacy` is supported. This resolves runtime errors that occur when `hipStreamLegacy` is used in ROCm 7.0.0 and later.
### **rocSPARSE** (4.0.3)
#### Resolved issues
* Fixed an issue causing premature deallocation of internal buffers while still in use.
### **rocSOLVER** (3.30.1)
#### Optimized
Improved the performance of:
* LARFT and downstream functions such as GEQRF and ORMTR.
* LARF and downstream functions such as GEQR2.
* ORMTR and downstream functions such as SYEVD.
* GEQR2 and downstream functions such as GEQRF.
## ROCm 7.0.1
ROCm 7.0.1 is a quality release that resolves the existing issue. There is no change in component from the previous ROCm 7.0.0 release. See the [ROCm 7.0.1 release notes](https://rocm.docs.amd.com/en/docs-7.0.1/about/release-notes.html#rocm-7-0-1-release-notes) for a complete overview of this release.
@@ -802,11 +912,15 @@ HIP runtime has the following functional improvements which improves runtime per
* Compatibility with NCCL 2.25.1.
* Compatibility with NCCL 2.26.6.
#### Optimized
* Improved the performance of the `FP8` Sum operation by upcasting to `FP16`.
#### Resolved issues
* Resolved an issue when using more than 64 channels when multiple collectives are used in the same `ncclGroup()` call.
* Fixed unit test failures in tests ending with the `ManagedMem` and `ManagedMemGraph` suffixes.
* Fixed a suboptimal algorithmic switching point for AllReduce on the AMD Instinct MI300X.
* Fixed broken functionality within the LL protocol on gfx950 by disabling inlining of LLGenericOp kernels.
* Fixed the known issue "When splitting a communicator using `ncclCommSplit` in some GPU configurations, MSCCL initialization can cause a segmentation fault" with a design change to use `comm` instead of `rank` for `mscclStatus`. The global map for `comm` to `mscclStatus` is still not thread safe but should be explicitly handled by mutexes for read-write operations. This is tested for correctness, but there is a plan to use a thread-safe map data structure in an upcoming release.
### **rocAL** (2.3.0)
@@ -4002,7 +4116,7 @@ memory partition modes upon an invalid argument return from memory partition mod
- JSON output plugin for `rocprofv2`. The JSON file matches Google Trace Format making it easy to load on Perfetto, Chrome tracing, or Speedscope. For Speedscope, use `--disable-json-data-flows` option as speedscope doesn't work with data flows.
- `--no-serialization` flag to disable kernel serialization when `rocprofv2` is in counter collection mode. This allows `rocprofv2` to avoid deadlock when profiling certain programs in counter collection mode.
- `FP64_ACTIVE` and `ENGINE_ACTIVE` metrics to AMD Instinct MI300 accelerator
- `FP64_ACTIVE` and `ENGINE_ACTIVE` metrics to AMD Instinct MI300 GPU
- New HIP APIs with struct defined inside union.
- Early checks to confirm the eligibility of ELF file in ATT plugin
- Support for kernel name filtering in `rocprofv2`
@@ -4026,18 +4140,18 @@ memory partition modes upon an invalid argument return from memory partition mod
#### Resolved issues
- Bandwidth measurement in AMD Instinct MI300 accelerator
- Bandwidth measurement in AMD Instinct MI300 GPU
- Perfetto plugin issue of `roctx` trace not getting displayed
- `--help` for counter collection
- Signal management issues in `queue.cpp`
- Perfetto tracks for multi-GPU
- Perfetto plugin usage with `rocsys`
- Incorrect number of columns in the output CSV files for counter collection and kernel tracing
- The ROCProfiler hang issue when running kernel trace, thread trace, or counter collection on Iree benchmark for AMD Instinct MI300 accelerator
- The ROCProfiler hang issue when running kernel trace, thread trace, or counter collection on Iree benchmark for AMD Instinct MI300 GPU
- Build errors thrown during parsing of unions
- The system hang caused while running `--kernel-trace` with Perfetto for certain applications
- Missing profiler records issue caused while running `--trace-period`
- The hang issue of `ProfilerAPITest` of `runFeatureTests` on AMD Instinct MI300 accelerator
- The hang issue of `ProfilerAPITest` of `runFeatureTests` on AMD Instinct MI300 GPU
- Segmentation fault on Navi32
@@ -5434,7 +5548,7 @@ See [issue #3499](https://github.com/ROCm/ROCm/issues/3499) on GitHub.
intermediary script to call the application with the necessary arguments, then call the script with Omniperf. This
issue is fixed in a future release of Omniperf. See [#347](https://github.com/ROCm/rocprofiler-compute/issues/347).
- Omniperf might not work with AMD Instinct MI300 accelerators out of the box, resulting in the following error:
- Omniperf might not work with AMD Instinct MI300 GPUs out of the box, resulting in the following error:
"*ERROR gfx942 is not enabled rocprofv1. Available profilers include: ['rocprofv2']*". As a workaround, add the
environment variable `export ROCPROF=rocprofv2`.
@@ -5550,7 +5664,7 @@ See [issue #3498](https://github.com/ROCm/ROCm/issues/3498) on GitHub.
#### Optimized
* Improved performance of Level 1 `dot_batched` and `dot_strided_batched` for all precisions. Performance enhanced by 6 times for bigger problem sizes, as measured on an Instinct MI210 accelerator.
* Improved performance of Level 1 `dot_batched` and `dot_strided_batched` for all precisions. Performance enhanced by 6 times for bigger problem sizes, as measured on an Instinct MI210 GPU.
#### Removed

2590
RELEASE.md

File diff suppressed because it is too large Load Diff

View File

@@ -1,7 +1,7 @@
<?xml version="1.0" encoding="UTF-8"?>
<manifest>
<remote name="rocm-org" fetch="https://github.com/ROCm/" />
<default revision="refs/tags/rocm-7.0.1"
<default revision="refs/tags/rocm-7.0.2"
remote="rocm-org"
sync-c="true"
sync-j="4" />
@@ -41,7 +41,6 @@
<project groups="mathlibs" name="MIVisionX" />
<project groups="mathlibs" name="ROCmValidationSuite" />
<project groups="mathlibs" name="composable_kernel" />
<project groups="mathlibs" name="hipSOLVER" />
<project groups="mathlibs" name="hipTensor" />
<project groups="mathlibs" name="hipfort" />
<project groups="mathlibs" name="rccl" />
@@ -57,7 +56,6 @@
<project groups="mathlibs" name="rocm-libraries" />
<project groups="mathlibs" name="rocPyDecode" />
<project groups="mathlibs" name="rocSHMEM" />
<project groups="mathlibs" name="rocSOLVER" />
<project groups="mathlibs" name="rocWMMA" />
<project groups="mathlibs" name="rocm-cmake" />
<project groups="mathlibs" name="rpp" />

View File

@@ -1,137 +1,137 @@
ROCm Version,7.0.1/7.0.0,6.4.3,6.4.2,6.4.1,6.4.0,6.3.3,6.3.2,6.3.1,6.3.0,6.2.4,6.2.2,6.2.1,6.2.0, 6.1.5, 6.1.2, 6.1.1, 6.1.0, 6.0.2, 6.0.0
:ref:`Operating systems & kernels <OS-kernel-versions>`,Ubuntu 24.04.3,Ubuntu 24.04.2,Ubuntu 24.04.2,Ubuntu 24.04.2,Ubuntu 24.04.2,Ubuntu 24.04.2,Ubuntu 24.04.2,Ubuntu 24.04.2,Ubuntu 24.04.2,"Ubuntu 24.04.1, 24.04","Ubuntu 24.04.1, 24.04","Ubuntu 24.04.1, 24.04",Ubuntu 24.04,,,,,,
,Ubuntu 22.04.5,Ubuntu 22.04.5,Ubuntu 22.04.5,Ubuntu 22.04.5,Ubuntu 22.04.5,Ubuntu 22.04.5,Ubuntu 22.04.5,Ubuntu 22.04.5,Ubuntu 22.04.5,"Ubuntu 22.04.5, 22.04.4","Ubuntu 22.04.5, 22.04.4","Ubuntu 22.04.5, 22.04.4","Ubuntu 22.04.5, 22.04.4","Ubuntu 22.04.5, 22.04.4, 22.04.3","Ubuntu 22.04.4, 22.04.3","Ubuntu 22.04.4, 22.04.3","Ubuntu 22.04.4, 22.04.3","Ubuntu 22.04.4, 22.04.3, 22.04.2","Ubuntu 22.04.4, 22.04.3, 22.04.2"
,,,,,,,,,,,,,,"Ubuntu 20.04.6, 20.04.5","Ubuntu 20.04.6, 20.04.5","Ubuntu 20.04.6, 20.04.5","Ubuntu 20.04.6, 20.04.5","Ubuntu 20.04.6, 20.04.5","Ubuntu 20.04.6, 20.04.5"
,"RHEL 9.6, 9.4","RHEL 9.6, 9.4","RHEL 9.6, 9.4","RHEL 9.6, 9.5, 9.4","RHEL 9.5, 9.4","RHEL 9.5, 9.4","RHEL 9.5, 9.4","RHEL 9.5, 9.4","RHEL 9.5, 9.4","RHEL 9.4, 9.3","RHEL 9.4, 9.3","RHEL 9.4, 9.3","RHEL 9.4, 9.3","RHEL 9.4, 9.3, 9.2","RHEL 9.4, 9.3, 9.2","RHEL 9.4, 9.3, 9.2","RHEL 9.4, 9.3, 9.2","RHEL 9.3, 9.2","RHEL 9.3, 9.2"
,RHEL 8.10 [#rhel-700-past-60]_,RHEL 8.10,RHEL 8.10,RHEL 8.10,RHEL 8.10,RHEL 8.10,RHEL 8.10,RHEL 8.10,RHEL 8.10,"RHEL 8.10, 8.9","RHEL 8.10, 8.9","RHEL 8.10, 8.9","RHEL 8.10, 8.9","RHEL 8.9, 8.8","RHEL 8.9, 8.8","RHEL 8.9, 8.8","RHEL 8.9, 8.8","RHEL 8.9, 8.8","RHEL 8.9, 8.8"
,SLES 15 SP7 [#sles-db-700-past-60]_,"SLES 15 SP7, SP6","SLES 15 SP7, SP6",SLES 15 SP6,SLES 15 SP6,"SLES 15 SP6, SP5","SLES 15 SP6, SP5","SLES 15 SP6, SP5","SLES 15 SP6, SP5","SLES 15 SP6, SP5","SLES 15 SP6, SP5","SLES 15 SP6, SP5","SLES 15 SP6, SP5","SLES 15 SP5, SP4","SLES 15 SP5, SP4","SLES 15 SP5, SP4","SLES 15 SP5, SP4","SLES 15 SP5, SP4","SLES 15 SP5, SP4"
,,,,,,,,,,,,,,,CentOS 7.9,CentOS 7.9,CentOS 7.9,CentOS 7.9,CentOS 7.9
,"Oracle Linux 9, 8 [#ol-700-mi300x-past-60]_","Oracle Linux 9, 8 [#mi300x-past-60]_","Oracle Linux 9, 8 [#mi300x-past-60]_","Oracle Linux 9, 8 [#mi300x-past-60]_","Oracle Linux 9, 8 [#mi300x-past-60]_",Oracle Linux 8.10 [#mi300x-past-60]_,Oracle Linux 8.10 [#mi300x-past-60]_,Oracle Linux 8.10 [#mi300x-past-60]_,Oracle Linux 8.10 [#mi300x-past-60]_,Oracle Linux 8.9 [#mi300x-past-60]_,Oracle Linux 8.9 [#mi300x-past-60]_,Oracle Linux 8.9 [#mi300x-past-60]_,Oracle Linux 8.9 [#mi300x-past-60]_,Oracle Linux 8.9 [#mi300x-past-60]_,Oracle Linux 8.9 [#mi300x-past-60]_,Oracle Linux 8.9 [#mi300x-past-60]_,,,
,Debian 12 [#sles-db-700-past-60]_,Debian 12 [#single-node-past-60]_,Debian 12 [#single-node-past-60]_,Debian 12 [#single-node-past-60]_,Debian 12 [#single-node-past-60]_,Debian 12 [#single-node-past-60]_,Debian 12 [#single-node-past-60]_,Debian 12 [#single-node-past-60]_,,,,,,,,,,,
,Azure Linux 3.0 [#az-mi300x-past-60]_,Azure Linux 3.0 [#az-mi300x-past-60]_,Azure Linux 3.0 [#az-mi300x-past-60]_,Azure Linux 3.0 [#az-mi300x-past-60]_,Azure Linux 3.0 [#az-mi300x-past-60]_,Azure Linux 3.0 [#az-mi300x-630-past-60]_,Azure Linux 3.0 [#az-mi300x-630-past-60]_,,,,,,,,,,,,
,Rocky Linux 9 [#rl-700-past-60]_,,,,,,,,,,,,,,,,,,
,.. _architecture-support-compatibility-matrix-past-60:,,,,,,,,,,,,,,,,,,
:doc:`Architecture <rocm-install-on-linux:reference/system-requirements>`,CDNA4,,,,,,,,,,,,,,,,,,
,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3
,CDNA2,CDNA2,CDNA2,CDNA2,CDNA2,CDNA2,CDNA2,CDNA2,CDNA2,CDNA2,CDNA2,CDNA2,CDNA2,CDNA2,CDNA2,CDNA2,CDNA2,CDNA2,CDNA2
,CDNA,CDNA,CDNA,CDNA,CDNA,CDNA,CDNA,CDNA,CDNA,CDNA,CDNA,CDNA,CDNA,CDNA,CDNA,CDNA,CDNA,CDNA,CDNA
,RDNA4,RDNA4,RDNA4,RDNA4,,,,,,,,,,,,,,,
,RDNA3,RDNA3,RDNA3,RDNA3,RDNA3,RDNA3,RDNA3,RDNA3,RDNA3,RDNA3,RDNA3,RDNA3,RDNA3,RDNA3,RDNA3,RDNA3,RDNA3,RDNA3,RDNA3
,RDNA2,RDNA2,RDNA2,RDNA2,RDNA2,RDNA2,RDNA2,RDNA2,RDNA2,RDNA2,RDNA2,RDNA2,RDNA2,RDNA2,RDNA2,RDNA2,RDNA2,RDNA2,RDNA2
,.. _gpu-support-compatibility-matrix-past-60:,,,,,,,,,,,,,,,,,,
:doc:`GPU / LLVM target <rocm-install-on-linux:reference/system-requirements>`,gfx950 [#mi350x-os-past-60]_,,,,,,,,,,,,,,,,,,
,gfx1201 [#RDNA-OS-700-past-60]_,gfx1201 [#RDNA-OS-past-60]_,gfx1201 [#RDNA-OS-past-60]_,gfx1201 [#RDNA-OS-past-60]_,,,,,,,,,,,,,,,
,gfx1200 [#RDNA-OS-700-past-60]_,gfx1200 [#RDNA-OS-past-60]_,gfx1200 [#RDNA-OS-past-60]_,gfx1200 [#RDNA-OS-past-60]_,,,,,,,,,,,,,,,
,gfx1101 [#RDNA-OS-700-past-60]_ [#rd-v710-past-60]_,gfx1101 [#RDNA-OS-past-60]_ [#7700XT-OS-past-60]_,gfx1101 [#RDNA-OS-past-60]_ [#7700XT-OS-past-60]_,gfx1101 [#RDNA-OS-past-60]_,,,,,,,,,,,,,,,
,gfx1100 [#RDNA-OS-700-past-60]_,gfx1100,gfx1100,gfx1100,gfx1100,gfx1100,gfx1100,gfx1100,gfx1100,gfx1100,gfx1100,gfx1100,gfx1100,gfx1100,gfx1100,gfx1100,gfx1100,gfx1100,gfx1100
,gfx1030 [#RDNA-OS-700-past-60]_ [#rd-v620-past-60]_,gfx1030,gfx1030,gfx1030,gfx1030,gfx1030,gfx1030,gfx1030,gfx1030,gfx1030,gfx1030,gfx1030,gfx1030,gfx1030,gfx1030,gfx1030,gfx1030,gfx1030,gfx1030
,gfx942 [#mi325x-os-past-60]_ [#mi300x-os-past-60]_ [#mi300A-os-past-60]_,gfx942,gfx942,gfx942,gfx942,gfx942,gfx942,gfx942,gfx942,gfx942 [#mi300_624-past-60]_,gfx942 [#mi300_622-past-60]_,gfx942 [#mi300_621-past-60]_,gfx942 [#mi300_620-past-60]_, gfx942 [#mi300_612-past-60]_, gfx942 [#mi300_612-past-60]_, gfx942 [#mi300_611-past-60]_, gfx942 [#mi300_610-past-60]_, gfx942 [#mi300_602-past-60]_, gfx942 [#mi300_600-past-60]_
,gfx90a [#mi200x-os-past-60]_,gfx90a,gfx90a,gfx90a,gfx90a,gfx90a,gfx90a,gfx90a,gfx90a,gfx90a,gfx90a,gfx90a,gfx90a,gfx90a,gfx90a,gfx90a,gfx90a,gfx90a,gfx90a
,gfx908 [#mi100-os-past-60]_,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908
,,,,,,,,,,,,,,,,,,,
FRAMEWORK SUPPORT,.. _framework-support-compatibility-matrix-past-60:,,,,,,,,,,,,,,,,,,
:doc:`PyTorch <../compatibility/ml-compatibility/pytorch-compatibility>`,"2.7, 2.6, 2.5","2.6, 2.5, 2.4, 2.3","2.6, 2.5, 2.4, 2.3","2.6, 2.5, 2.4, 2.3","2.6, 2.5, 2.4, 2.3","2.4, 2.3, 2.2, 1.13","2.4, 2.3, 2.2, 1.13","2.4, 2.3, 2.2, 1.13","2.4, 2.3, 2.2, 2.1, 2.0, 1.13","2.3, 2.2, 2.1, 2.0, 1.13","2.3, 2.2, 2.1, 2.0, 1.13","2.3, 2.2, 2.1, 2.0, 1.13","2.3, 2.2, 2.1, 2.0, 1.13","2.1, 2.0, 1.13","2.1, 2.0, 1.13","2.1, 2.0, 1.13","2.1, 2.0, 1.13","2.1, 2.0, 1.13","2.1, 2.0, 1.13"
:doc:`TensorFlow <../compatibility/ml-compatibility/tensorflow-compatibility>`,"2.19.1, 2.18.1, 2.17.1 [#tf-mi350-past-60]_","2.18.1, 2.17.1, 2.16.2","2.18.1, 2.17.1, 2.16.2","2.18.1, 2.17.1, 2.16.2","2.18.1, 2.17.1, 2.16.2","2.17.0, 2.16.2, 2.15.1","2.17.0, 2.16.2, 2.15.1","2.17.0, 2.16.2, 2.15.1","2.17.0, 2.16.2, 2.15.1","2.16.1, 2.15.1, 2.14.1","2.16.1, 2.15.1, 2.14.1","2.16.1, 2.15.1, 2.14.1","2.16.1, 2.15.1, 2.14.1","2.15.0, 2.14.0, 2.13.1","2.15.0, 2.14.0, 2.13.1","2.15.0, 2.14.0, 2.13.1","2.15.0, 2.14.0, 2.13.1","2.14.0, 2.13.1, 2.12.1","2.14.0, 2.13.1, 2.12.1"
:doc:`JAX <../compatibility/ml-compatibility/jax-compatibility>`,0.6.0,0.4.35,0.4.35,0.4.35,0.4.35,0.4.31,0.4.31,0.4.31,0.4.31,0.4.26,0.4.26,0.4.26,0.4.26,0.4.26,0.4.26,0.4.26,0.4.26,0.4.26,0.4.26
:doc:`verl <../compatibility/ml-compatibility/verl-compatibility>` [#verl_compat-past-60]_,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,0.3.0.post0,N/A,N/A,N/A,N/A,N/A,N/A
:doc:`Stanford Megatron-LM <../compatibility/ml-compatibility/stanford-megatron-lm-compatibility>` [#stanford-megatron-lm_compat-past-60]_,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,85f95ae,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A
:doc:`DGL <../compatibility/ml-compatibility/dgl-compatibility>` [#dgl_compat-past-60]_,N/A,N/A,N/A,N/A,2.4.0,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A
:doc:`Megablocks <../compatibility/ml-compatibility/megablocks-compatibility>` [#megablocks_compat-past-60]_,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,0.7.0,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A
:doc:`Taichi <../compatibility/ml-compatibility/taichi-compatibility>` [#taichi_compat-past-60]_,N/A,N/A,N/A,N/A,N/A,N/A,1.8.0b1,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A
:doc:`Ray <../compatibility/ml-compatibility/ray-compatibility>` [#ray_compat-past-60]_,N/A,N/A,N/A,2.48.0.post0,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A
:doc:`llama.cpp <../compatibility/ml-compatibility/llama-cpp-compatibility>` [#llama-cpp_compat-past-60]_,b6356,b6356,b6356,b6356,b5997,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A
:doc:`FlashInfer <../compatibility/ml-compatibility/flashinfer-compatibility>` [#flashinfer_compat-past-60]_,N/A,N/A,N/A,v0.2.5,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A
`ONNX Runtime <https://onnxruntime.ai/docs/build/eps.html#amd-migraphx>`_,1.22.0,1.20.0,1.20.0,1.20.0,1.20.0,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.14.1,1.14.1
,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,
THIRD PARTY COMMS,.. _thirdpartycomms-support-compatibility-matrix-past-60:,,,,,,,,,,,,,,,,,,
`UCC <https://github.com/ROCm/ucc>`_,>=1.4.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.2.0,>=1.2.0
`UCX <https://github.com/ROCm/ucx>`_,>=1.17.0,>=1.15.0,>=1.15.0,>=1.15.0,>=1.15.0,>=1.15.0,>=1.15.0,>=1.15.0,>=1.15.0,>=1.15.0,>=1.15.0,>=1.15.0,>=1.15.0,>=1.14.1,>=1.14.1,>=1.14.1,>=1.14.1,>=1.14.1,>=1.14.1
,,,,,,,,,,,,,,,,,,,
THIRD PARTY ALGORITHM,.. _thirdpartyalgorithm-support-compatibility-matrix-past-60:,,,,,,,,,,,,,,,,,,
Thrust,2.6.0,2.5.0,2.5.0,2.5.0,2.5.0,2.3.2,2.3.2,2.3.2,2.3.2,2.2.0,2.2.0,2.2.0,2.2.0,2.1.0,2.1.0,2.1.0,2.1.0,2.0.1,2.0.1
CUB,2.6.0,2.5.0,2.5.0,2.5.0,2.5.0,2.3.2,2.3.2,2.3.2,2.3.2,2.2.0,2.2.0,2.2.0,2.2.0,2.1.0,2.1.0,2.1.0,2.1.0,2.0.1,2.0.1
,,,,,,,,,,,,,,,,,,,
DRIVER & USER SPACE [#kfd_support-past-60]_,.. _kfd-userspace-support-compatibility-matrix-past-60:,,,,,,,,,,,,,,,,,,
:doc:`AMD GPU Driver <rocm-install-on-linux:reference/user-kernel-space-compat-matrix>`,"30.10.1 [#driver_patch-past-60]_, 30.10, 6.4.x, 6.3.x, 6.2.x","6.4.x, 6.3.x, 6.2.x, 6.1.x","6.4.x, 6.3.x, 6.2.x, 6.1.x","6.4.x, 6.3.x, 6.2.x, 6.1.x","6.4.x, 6.3.x, 6.2.x, 6.1.x","6.4.x, 6.3.x, 6.2.x, 6.1.x","6.4.x, 6.3.x, 6.2.x, 6.1.x","6.4.x, 6.3.x, 6.2.x, 6.1.x","6.4.x, 6.3.x, 6.2.x, 6.1.x","6.4.x, 6.3.x, 6.2.x, 6.1.x, 6.0.x","6.4.x, 6.3.x, 6.2.x, 6.1.x, 6.0.x","6.4.x, 6.3.x, 6.2.x, 6.1.x, 6.0.x","6.4.x, 6.3.x, 6.2.x, 6.1.x, 6.0.x","6.4.x, 6.3.x, 6.2.x, 6.1.x, 6.0.x, 5.7.x","6.4.x, 6.3.x, 6.2.x, 6.1.x, 6.0.x, 5.7.x","6.4.x, 6.3.x, 6.2.x, 6.1.x, 6.0.x, 5.7.x","6.4.x, 6.3.x, 6.2.x, 6.1.x, 6.0.x, 5.7.x","6.2.x, 6.1.x, 6.0.x, 5.7.x, 5.6.x","6.2.x, 6.1.x, 6.0.x, 5.7.x, 5.6.x"
,,,,,,,,,,,,,,,,,,,
ML & COMPUTER VISION,.. _mllibs-support-compatibility-matrix-past-60:,,,,,,,,,,,,,,,,,,
:doc:`Composable Kernel <composable_kernel:index>`,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0
:doc:`MIGraphX <amdmigraphx:index>`,2.13.0,2.12.0,2.12.0,2.12.0,2.12.0,2.11.0,2.11.0,2.11.0,2.11.0,2.10.0,2.10.0,2.10.0,2.10.0,2.9.0,2.9.0,2.9.0,2.9.0,2.8.0,2.8.0
:doc:`MIOpen <miopen:index>`,3.5.0,3.4.0,3.4.0,3.4.0,3.4.0,3.3.0,3.3.0,3.3.0,3.3.0,3.2.0,3.2.0,3.2.0,3.2.0,3.1.0,3.1.0,3.1.0,3.1.0,3.0.0,3.0.0
:doc:`MIVisionX <mivisionx:index>`,3.3.0,3.2.0,3.2.0,3.2.0,3.2.0,3.1.0,3.1.0,3.1.0,3.1.0,3.0.0,3.0.0,3.0.0,3.0.0,2.5.0,2.5.0,2.5.0,2.5.0,2.5.0,2.5.0
:doc:`rocAL <rocal:index>`,2.3.0,2.2.0,2.2.0,2.2.0,2.2.0,2.1.0,2.1.0,2.1.0,2.1.0,2.0.0,2.0.0,2.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0
:doc:`rocDecode <rocdecode:index>`,1.0.0,0.10.0,0.10.0,0.10.0,0.10.0,0.8.0,0.8.0,0.8.0,0.8.0,0.6.0,0.6.0,0.6.0,0.6.0,0.6.0,0.6.0,0.5.0,0.5.0,N/A,N/A
:doc:`rocJPEG <rocjpeg:index>`,1.1.0,0.8.0,0.8.0,0.8.0,0.8.0,0.6.0,0.6.0,0.6.0,0.6.0,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A
:doc:`rocPyDecode <rocpydecode:index>`,0.6.0,0.3.1,0.3.1,0.3.1,0.3.1,0.2.0,0.2.0,0.2.0,0.2.0,0.1.0,0.1.0,0.1.0,0.1.0,N/A,N/A,N/A,N/A,N/A,N/A
:doc:`RPP <rpp:index>`,2.0.0,1.9.10,1.9.10,1.9.10,1.9.10,1.9.1,1.9.1,1.9.1,1.9.1,1.8.0,1.8.0,1.8.0,1.8.0,1.5.0,1.5.0,1.5.0,1.5.0,1.4.0,1.4.0
,,,,,,,,,,,,,,,,,,,
COMMUNICATION,.. _commlibs-support-compatibility-matrix-past-60:,,,,,,,,,,,,,,,,,,
:doc:`RCCL <rccl:index>`,2.26.6,2.22.3,2.22.3,2.22.3,2.22.3,2.21.5,2.21.5,2.21.5,2.21.5,2.20.5,2.20.5,2.20.5,2.20.5,2.18.6,2.18.6,2.18.6,2.18.6,2.18.3,2.18.3
:doc:`rocSHMEM <rocshmem:index>`,3.0.0,2.0.1,2.0.1,2.0.0,2.0.0,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A
,,,,,,,,,,,,,,,,,,,
MATH LIBS,.. _mathlibs-support-compatibility-matrix-past-60:,,,,,,,,,,,,,,,,,,
`half <https://github.com/ROCm/half>`_ ,1.12.0,1.12.0,1.12.0,1.12.0,1.12.0,1.12.0,1.12.0,1.12.0,1.12.0,1.12.0,1.12.0,1.12.0,1.12.0,1.12.0,1.12.0,1.12.0,1.12.0,1.12.0,1.12.0
:doc:`hipBLAS <hipblas:index>`,3.0.0,2.4.0,2.4.0,2.4.0,2.4.0,2.3.0,2.3.0,2.3.0,2.3.0,2.2.0,2.2.0,2.2.0,2.2.0,2.1.0,2.1.0,2.1.0,2.1.0,2.0.0,2.0.0
:doc:`hipBLASLt <hipblaslt:index>`,1.0.0,0.12.1,0.12.1,0.12.1,0.12.0,0.10.0,0.10.0,0.10.0,0.10.0,0.8.0,0.8.0,0.8.0,0.8.0,0.7.0,0.7.0,0.7.0,0.7.0,0.6.0,0.6.0
:doc:`hipFFT <hipfft:index>`,1.0.20,1.0.18,1.0.18,1.0.18,1.0.18,1.0.17,1.0.17,1.0.17,1.0.17,1.0.16,1.0.15,1.0.15,1.0.14,1.0.14,1.0.14,1.0.14,1.0.14,1.0.13,1.0.13
:doc:`hipfort <hipfort:index>`,0.7.0,0.6.0,0.6.0,0.6.0,0.6.0,0.5.1,0.5.1,0.5.0,0.5.0,0.4.0,0.4.0,0.4.0,0.4.0,0.4.0,0.4.0,0.4.0,0.4.0,0.4.0,0.4.0
:doc:`hipRAND <hiprand:index>`,3.0.0,2.12.0,2.12.0,2.12.0,2.12.0,2.11.1,2.11.1,2.11.1,2.11.0,2.11.1,2.11.0,2.11.0,2.11.0,2.10.16,2.10.16,2.10.16,2.10.16,2.10.16,2.10.16
:doc:`hipSOLVER <hipsolver:index>`,3.0.0,2.4.0,2.4.0,2.4.0,2.4.0,2.3.0,2.3.0,2.3.0,2.3.0,2.2.0,2.2.0,2.2.0,2.2.0,2.1.1,2.1.1,2.1.1,2.1.0,2.0.0,2.0.0
:doc:`hipSPARSE <hipsparse:index>`,4.0.1,3.2.0,3.2.0,3.2.0,3.2.0,3.1.2,3.1.2,3.1.2,3.1.2,3.1.1,3.1.1,3.1.1,3.1.1,3.0.1,3.0.1,3.0.1,3.0.1,3.0.0,3.0.0
:doc:`hipSPARSELt <hipsparselt:index>`,0.2.4,0.2.3,0.2.3,0.2.3,0.2.3,0.2.2,0.2.2,0.2.2,0.2.2,0.2.1,0.2.1,0.2.1,0.2.1,0.2.0,0.2.0,0.1.0,0.1.0,0.1.0,0.1.0
:doc:`rocALUTION <rocalution:index>`,4.0.0,3.2.3,3.2.3,3.2.3,3.2.2,3.2.1,3.2.1,3.2.1,3.2.1,3.2.1,3.2.0,3.2.0,3.2.0,3.1.1,3.1.1,3.1.1,3.1.1,3.0.3,3.0.3
:doc:`rocBLAS <rocblas:index>`,5.0.0,4.4.1,4.4.1,4.4.0,4.4.0,4.3.0,4.3.0,4.3.0,4.3.0,4.2.4,4.2.1,4.2.1,4.2.0,4.1.2,4.1.2,4.1.0,4.1.0,4.0.0,4.0.0
:doc:`rocFFT <rocfft:index>`,1.0.34,1.0.32,1.0.32,1.0.32,1.0.32,1.0.31,1.0.31,1.0.31,1.0.31,1.0.30,1.0.29,1.0.29,1.0.28,1.0.27,1.0.27,1.0.27,1.0.26,1.0.25,1.0.23
:doc:`rocRAND <rocrand:index>`,4.0.0,3.3.0,3.3.0,3.3.0,3.3.0,3.2.0,3.2.0,3.2.0,3.2.0,3.1.1,3.1.0,3.1.0,3.1.0,3.0.1,3.0.1,3.0.1,3.0.1,3.0.0,2.10.17
:doc:`rocSOLVER <rocsolver:index>`,3.30.0,3.28.2,3.28.2,3.28.0,3.28.0,3.27.0,3.27.0,3.27.0,3.27.0,3.26.2,3.26.0,3.26.0,3.26.0,3.25.0,3.25.0,3.25.0,3.25.0,3.24.0,3.24.0
:doc:`rocSPARSE <rocsparse:index>`,4.0.2,3.4.0,3.4.0,3.4.0,3.4.0,3.3.0,3.3.0,3.3.0,3.3.0,3.2.1,3.2.0,3.2.0,3.2.0,3.1.2,3.1.2,3.1.2,3.1.2,3.0.2,3.0.2
:doc:`rocWMMA <rocwmma:index>`,2.0.0,1.7.0,1.7.0,1.7.0,1.7.0,1.6.0,1.6.0,1.6.0,1.6.0,1.5.0,1.5.0,1.5.0,1.5.0,1.4.0,1.4.0,1.4.0,1.4.0,1.3.0,1.3.0
:doc:`Tensile <tensile:src/index>`,4.44.0,4.43.0,4.43.0,4.43.0,4.43.0,4.42.0,4.42.0,4.42.0,4.42.0,4.41.0,4.41.0,4.41.0,4.41.0,4.40.0,4.40.0,4.40.0,4.40.0,4.39.0,4.39.0
,,,,,,,,,,,,,,,,,,,
PRIMITIVES,.. _primitivelibs-support-compatibility-matrix-past-60:,,,,,,,,,,,,,,,,,,
:doc:`hipCUB <hipcub:index>`,4.0.0,3.4.0,3.4.0,3.4.0,3.4.0,3.3.0,3.3.0,3.3.0,3.3.0,3.2.1,3.2.0,3.2.0,3.2.0,3.1.0,3.1.0,3.1.0,3.1.0,3.0.0,3.0.0
:doc:`hipTensor <hiptensor:index>`,2.0.0,1.5.0,1.5.0,1.5.0,1.5.0,1.4.0,1.4.0,1.4.0,1.4.0,1.3.0,1.3.0,1.3.0,1.3.0,1.2.0,1.2.0,1.2.0,1.2.0,1.1.0,1.1.0
:doc:`rocPRIM <rocprim:index>`,4.0.0,3.4.1,3.4.1,3.4.0,3.4.0,3.3.0,3.3.0,3.3.0,3.3.0,3.2.2,3.2.0,3.2.0,3.2.0,3.1.0,3.1.0,3.1.0,3.1.0,3.0.0,3.0.0
:doc:`rocThrust <rocthrust:index>`,4.0.0,3.3.0,3.3.0,3.3.0,3.3.0,3.3.0,3.3.0,3.3.0,3.3.0,3.1.1,3.1.0,3.1.0,3.0.1,3.0.1,3.0.1,3.0.1,3.0.1,3.0.0,3.0.0
,,,,,,,,,,,,,,,,,,,
SUPPORT LIBS,,,,,,,,,,,,,,,,,,,
`hipother <https://github.com/ROCm/hipother>`_,7.0.51830,6.4.43483,6.4.43483,6.4.43483,6.4.43482,6.3.42134,6.3.42134,6.3.42133,6.3.42131,6.2.41134,6.2.41134,6.2.41134,6.2.41133,6.1.40093,6.1.40093,6.1.40092,6.1.40091,6.1.32831,6.1.32830
`rocm-core <https://github.com/ROCm/rocm-core>`_,7.0.1/7.0.0,6.4.3,6.4.2,6.4.1,6.4.0,6.3.3,6.3.2,6.3.1,6.3.0,6.2.4,6.2.2,6.2.1,6.2.0,6.1.5,6.1.2,6.1.1,6.1.0,6.0.2,6.0.0
`ROCT-Thunk-Interface <https://github.com/ROCm/ROCT-Thunk-Interface>`_,N/A [#ROCT-rocr-past-60]_,N/A [#ROCT-rocr-past-60]_,N/A [#ROCT-rocr-past-60]_,N/A [#ROCT-rocr-past-60]_,N/A [#ROCT-rocr-past-60]_,N/A [#ROCT-rocr-past-60]_,N/A [#ROCT-rocr-past-60]_,N/A [#ROCT-rocr-past-60]_,N/A [#ROCT-rocr-past-60]_,20240607.5.7,20240607.5.7,20240607.4.05,20240607.1.4246,20240125.5.08,20240125.5.08,20240125.5.08,20240125.3.30,20231016.2.245,20231016.2.245
,,,,,,,,,,,,,,,,,,,
SYSTEM MGMT TOOLS,.. _tools-support-compatibility-matrix-past-60:,,,,,,,,,,,,,,,,,,
:doc:`AMD SMI <amdsmi:index>`,26.0.0,25.5.1,25.5.1,25.4.2,25.3.0,24.7.1,24.7.1,24.7.1,24.7.1,24.6.3,24.6.3,24.6.3,24.6.2,24.5.1,24.5.1,24.5.1,24.4.1,23.4.2,23.4.2
:doc:`ROCm Data Center Tool <rdc:index>`,1.1.0,0.3.0,0.3.0,0.3.0,0.3.0,0.3.0,0.3.0,0.3.0,0.3.0,0.3.0,0.3.0,0.3.0,0.3.0,0.3.0,0.3.0,0.3.0,0.3.0,0.3.0,0.3.0
:doc:`rocminfo <rocminfo:index>`,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0
:doc:`ROCm SMI <rocm_smi_lib:index>`,7.8.0,7.7.0,7.5.0,7.5.0,7.5.0,7.4.0,7.4.0,7.4.0,7.4.0,7.3.0,7.3.0,7.3.0,7.3.0,7.2.0,7.2.0,7.0.0,7.0.0,6.0.2,6.0.0
:doc:`ROCm Validation Suite <rocmvalidationsuite:index>`,1.2.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.0.60204,1.0.60202,1.0.60201,1.0.60200,1.0.60105,1.0.60102,1.0.60101,1.0.60100,1.0.60002,1.0.60000
,,,,,,,,,,,,,,,,,,,
PERFORMANCE TOOLS,,,,,,,,,,,,,,,,,,,
:doc:`ROCm Bandwidth Test <rocm_bandwidth_test:index>`,2.6.0,1.4.0,1.4.0,1.4.0,1.4.0,1.4.0,1.4.0,1.4.0,1.4.0,1.4.0,1.4.0,1.4.0,1.4.0,1.4.0,1.4.0,1.4.0,1.4.0,1.4.0,1.4.0
:doc:`ROCm Compute Profiler <rocprofiler-compute:index>`,3.2.3,3.1.1,3.1.1,3.1.0,3.1.0,3.0.0,3.0.0,3.0.0,3.0.0,2.0.1,2.0.1,2.0.1,2.0.1,N/A,N/A,N/A,N/A,N/A,N/A
:doc:`ROCm Systems Profiler <rocprofiler-systems:index>`,1.1.0,1.0.2,1.0.2,1.0.1,1.0.0,0.1.2,0.1.1,0.1.0,0.1.0,1.11.2,1.11.2,1.11.2,1.11.2,N/A,N/A,N/A,N/A,N/A,N/A
:doc:`ROCProfiler <rocprofiler:index>`,2.0.70000,2.0.60403,2.0.60402,2.0.60401,2.0.60400,2.0.60303,2.0.60302,2.0.60301,2.0.60300,2.0.60204,2.0.60202,2.0.60201,2.0.60200,2.0.60105,2.0.60102,2.0.60101,2.0.60100,2.0.60002,2.0.60000
:doc:`ROCprofiler-SDK <rocprofiler-sdk:index>`,1.0.0,0.6.0,0.6.0,0.6.0,0.6.0,0.5.0,0.5.0,0.5.0,0.5.0,0.4.0,0.4.0,0.4.0,0.4.0,N/A,N/A,N/A,N/A,N/A,N/A
:doc:`ROCTracer <roctracer:index>`,4.1.70000,4.1.60403,4.1.60402,4.1.60401,4.1.60400,4.1.60303,4.1.60302,4.1.60301,4.1.60300,4.1.60204,4.1.60202,4.1.60201,4.1.60200,4.1.60105,4.1.60102,4.1.60101,4.1.60100,4.1.60002,4.1.60000
,,,,,,,,,,,,,,,,,,,
DEVELOPMENT TOOLS,,,,,,,,,,,,,,,,,,,
:doc:`HIPIFY <hipify:index>`,20.0.0,19.0.0,19.0.0,19.0.0,19.0.0,18.0.0.25012,18.0.0.25012,18.0.0.24491,18.0.0.24455,18.0.0.24392,18.0.0.24355,18.0.0.24355,18.0.0.24232,17.0.0.24193,17.0.0.24193,17.0.0.24154,17.0.0.24103,17.0.0.24012,17.0.0.23483
:doc:`ROCm CMake <rocmcmakebuildtools:index>`,0.14.0,0.14.0,0.14.0,0.14.0,0.14.0,0.14.0,0.14.0,0.14.0,0.14.0,0.13.0,0.13.0,0.13.0,0.13.0,0.12.0,0.12.0,0.12.0,0.12.0,0.11.0,0.11.0
:doc:`ROCdbgapi <rocdbgapi:index>`,0.77.3,0.77.2,0.77.2,0.77.2,0.77.2,0.77.0,0.77.0,0.77.0,0.77.0,0.76.0,0.76.0,0.76.0,0.76.0,0.71.0,0.71.0,0.71.0,0.71.0,0.71.0,0.71.0
:doc:`ROCm Debugger (ROCgdb) <rocgdb:index>`,16.3.0,15.2.0,15.2.0,15.2.0,15.2.0,15.2.0,15.2.0,15.2.0,15.2.0,14.2.0,14.2.0,14.2.0,14.2.0,14.1.0,14.1.0,14.1.0,14.1.0,13.2.0,13.2.0
`rocprofiler-register <https://github.com/ROCm/rocprofiler-register>`_,0.5.0,0.4.0,0.4.0,0.4.0,0.4.0,0.4.0,0.4.0,0.4.0,0.4.0,0.4.0,0.4.0,0.4.0,0.4.0,0.3.0,0.3.0,0.3.0,0.3.0,N/A,N/A
:doc:`ROCr Debug Agent <rocr_debug_agent:index>`,2.1.0,2.0.4,2.0.4,2.0.4,2.0.4,2.0.3,2.0.3,2.0.3,2.0.3,2.0.3,2.0.3,2.0.3,2.0.3,2.0.3,2.0.3,2.0.3,2.0.3,2.0.3,2.0.3
,,,,,,,,,,,,,,,,,,,
COMPILERS,.. _compilers-support-compatibility-matrix-past-60:,,,,,,,,,,,,,,,,,,
`clang-ocl <https://github.com/ROCm/clang-ocl>`_,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,0.5.0,0.5.0,0.5.0,0.5.0,0.5.0,0.5.0
:doc:`hipCC <hipcc:index>`,1.1.1,1.1.1,1.1.1,1.1.1,1.1.1,1.1.1,1.1.1,1.1.1,1.1.1,1.1.1,1.1.1,1.1.1,1.1.1,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0
`Flang <https://github.com/ROCm/flang>`_,20.0.0.25314,19.0.0.25224,19.0.0.25224,19.0.0.25184,19.0.0.25133,18.0.0.25012,18.0.0.25012,18.0.0.24491,18.0.0.24455,18.0.0.24392,18.0.0.24355,18.0.0.24355,18.0.0.24232,17.0.0.24193,17.0.0.24193,17.0.0.24154,17.0.0.24103,17.0.0.24012,17.0.0.23483
:doc:`llvm-project <llvm-project:index>`,20.0.0.25314,19.0.0.25224,19.0.0.25224,19.0.0.25184,19.0.0.25133,18.0.0.25012,18.0.0.25012,18.0.0.24491,18.0.0.24491,18.0.0.24392,18.0.0.24355,18.0.0.24355,18.0.0.24232,17.0.0.24193,17.0.0.24193,17.0.0.24154,17.0.0.24103,17.0.0.24012,17.0.0.23483
`OpenMP <https://github.com/ROCm/llvm-project/tree/amd-staging/openmp>`_,20.0.0.25314,19.0.0.25224,19.0.0.25224,19.0.0.25184,19.0.0.25133,18.0.0.25012,18.0.0.25012,18.0.0.24491,18.0.0.24491,18.0.0.24392,18.0.0.24355,18.0.0.24355,18.0.0.24232,17.0.0.24193,17.0.0.24193,17.0.0.24154,17.0.0.24103,17.0.0.24012,17.0.0.23483
,,,,,,,,,,,,,,,,,,,
RUNTIMES,.. _runtime-support-compatibility-matrix-past-60:,,,,,,,,,,,,,,,,,,
:doc:`AMD CLR <hip:understand/amd_clr>`,7.0.51830,6.4.43484,6.4.43484,6.4.43483,6.4.43482,6.3.42134,6.3.42134,6.3.42133,6.3.42131,6.2.41134,6.2.41134,6.2.41134,6.2.41133,6.1.40093,6.1.40093,6.1.40092,6.1.40091,6.1.32831,6.1.32830
:doc:`HIP <hip:index>`,7.0.51830,6.4.43484,6.4.43484,6.4.43483,6.4.43482,6.3.42134,6.3.42134,6.3.42133,6.3.42131,6.2.41134,6.2.41134,6.2.41134,6.2.41133,6.1.40093,6.1.40093,6.1.40092,6.1.40091,6.1.32831,6.1.32830
`OpenCL Runtime <https://github.com/ROCm/clr/tree/develop/opencl>`_,2.0.0,2.0.0,2.0.0,2.0.0,2.0.0,2.0.0,2.0.0,2.0.0,2.0.0,2.0.0,2.0.0,2.0.0,2.0.0,2.0.0,2.0.0,2.0.0,2.0.0,2.0.0,2.0.0
:doc:`ROCr Runtime <rocr-runtime:index>`,1.18.0,1.15.0,1.15.0,1.15.0,1.15.0,1.14.0,1.14.0,1.14.0,1.14.0,1.14.0,1.14.0,1.14.0,1.13.0,1.13.0,1.13.0,1.13.0,1.13.0,1.12.0,1.12.0
ROCm Version,7.0.2,7.0.1/7.0.0,6.4.3,6.4.2,6.4.1,6.4.0,6.3.3,6.3.2,6.3.1,6.3.0,6.2.4,6.2.2,6.2.1,6.2.0, 6.1.5, 6.1.2, 6.1.1, 6.1.0, 6.0.2, 6.0.0
:ref:`Operating systems & kernels <OS-kernel-versions>`,Ubuntu 24.04.3,Ubuntu 24.04.3,Ubuntu 24.04.2,Ubuntu 24.04.2,Ubuntu 24.04.2,Ubuntu 24.04.2,Ubuntu 24.04.2,Ubuntu 24.04.2,Ubuntu 24.04.2,Ubuntu 24.04.2,"Ubuntu 24.04.1, 24.04","Ubuntu 24.04.1, 24.04","Ubuntu 24.04.1, 24.04",Ubuntu 24.04,,,,,,
,Ubuntu 22.04.5,Ubuntu 22.04.5,Ubuntu 22.04.5,Ubuntu 22.04.5,Ubuntu 22.04.5,Ubuntu 22.04.5,Ubuntu 22.04.5,Ubuntu 22.04.5,Ubuntu 22.04.5,Ubuntu 22.04.5,"Ubuntu 22.04.5, 22.04.4","Ubuntu 22.04.5, 22.04.4","Ubuntu 22.04.5, 22.04.4","Ubuntu 22.04.5, 22.04.4","Ubuntu 22.04.5, 22.04.4, 22.04.3","Ubuntu 22.04.4, 22.04.3","Ubuntu 22.04.4, 22.04.3","Ubuntu 22.04.4, 22.04.3","Ubuntu 22.04.4, 22.04.3, 22.04.2","Ubuntu 22.04.4, 22.04.3, 22.04.2"
,,,,,,,,,,,,,,,"Ubuntu 20.04.6, 20.04.5","Ubuntu 20.04.6, 20.04.5","Ubuntu 20.04.6, 20.04.5","Ubuntu 20.04.6, 20.04.5","Ubuntu 20.04.6, 20.04.5","Ubuntu 20.04.6, 20.04.5"
,"RHEL 10.0 [#rhel-10-702-past-60]_, 9.6 [#rhel-10-702-past-60]_, 9.4 [#rhel-94-702-past-60]_","RHEL 9.6 [#rhel-10-702-past-60]_, 9.4 [#rhel-94-702-past-60]_","RHEL 9.6, 9.4","RHEL 9.6, 9.4","RHEL 9.6, 9.5, 9.4","RHEL 9.5, 9.4","RHEL 9.5, 9.4","RHEL 9.5, 9.4","RHEL 9.5, 9.4","RHEL 9.5, 9.4","RHEL 9.4, 9.3","RHEL 9.4, 9.3","RHEL 9.4, 9.3","RHEL 9.4, 9.3","RHEL 9.4, 9.3, 9.2","RHEL 9.4, 9.3, 9.2","RHEL 9.4, 9.3, 9.2","RHEL 9.4, 9.3, 9.2","RHEL 9.3, 9.2","RHEL 9.3, 9.2"
,RHEL 8.10 [#rhel-700-past-60]_,RHEL 8.10 [#rhel-700-past-60]_,RHEL 8.10,RHEL 8.10,RHEL 8.10,RHEL 8.10,RHEL 8.10,RHEL 8.10,RHEL 8.10,RHEL 8.10,"RHEL 8.10, 8.9","RHEL 8.10, 8.9","RHEL 8.10, 8.9","RHEL 8.10, 8.9","RHEL 8.9, 8.8","RHEL 8.9, 8.8","RHEL 8.9, 8.8","RHEL 8.9, 8.8","RHEL 8.9, 8.8","RHEL 8.9, 8.8"
,SLES 15 SP7 [#sles-db-700-past-60]_,SLES 15 SP7 [#sles-db-700-past-60]_,"SLES 15 SP7, SP6","SLES 15 SP7, SP6",SLES 15 SP6,SLES 15 SP6,"SLES 15 SP6, SP5","SLES 15 SP6, SP5","SLES 15 SP6, SP5","SLES 15 SP6, SP5","SLES 15 SP6, SP5","SLES 15 SP6, SP5","SLES 15 SP6, SP5","SLES 15 SP6, SP5","SLES 15 SP5, SP4","SLES 15 SP5, SP4","SLES 15 SP5, SP4","SLES 15 SP5, SP4","SLES 15 SP5, SP4","SLES 15 SP5, SP4"
,,,,,,,,,,,,,,,,CentOS 7.9,CentOS 7.9,CentOS 7.9,CentOS 7.9,CentOS 7.9
,"Oracle Linux 10, 9, 8 [#ol-700-mi300x-past-60]_","Oracle Linux 9, 8 [#ol-700-mi300x-past-60]_","Oracle Linux 9, 8 [#mi300x-past-60]_","Oracle Linux 9, 8 [#mi300x-past-60]_","Oracle Linux 9, 8 [#mi300x-past-60]_","Oracle Linux 9, 8 [#mi300x-past-60]_",Oracle Linux 8.10 [#mi300x-past-60]_,Oracle Linux 8.10 [#mi300x-past-60]_,Oracle Linux 8.10 [#mi300x-past-60]_,Oracle Linux 8.10 [#mi300x-past-60]_,Oracle Linux 8.9 [#mi300x-past-60]_,Oracle Linux 8.9 [#mi300x-past-60]_,Oracle Linux 8.9 [#mi300x-past-60]_,Oracle Linux 8.9 [#mi300x-past-60]_,Oracle Linux 8.9 [#mi300x-past-60]_,Oracle Linux 8.9 [#mi300x-past-60]_,Oracle Linux 8.9 [#mi300x-past-60]_,,,
,"Debian 13 [#db-mi300x-past-60]_, 12 [#sles-db-700-past-60]_",Debian 12 [#sles-db-700-past-60]_,Debian 12 [#single-node-past-60]_,Debian 12 [#single-node-past-60]_,Debian 12 [#single-node-past-60]_,Debian 12 [#single-node-past-60]_,Debian 12 [#single-node-past-60]_,Debian 12 [#single-node-past-60]_,Debian 12 [#single-node-past-60]_,,,,,,,,,,,
,Azure Linux 3.0 [#az-mi300x-past-60]_,Azure Linux 3.0 [#az-mi300x-past-60]_,Azure Linux 3.0 [#az-mi300x-past-60]_,Azure Linux 3.0 [#az-mi300x-past-60]_,Azure Linux 3.0 [#az-mi300x-past-60]_,Azure Linux 3.0 [#az-mi300x-past-60]_,Azure Linux 3.0 [#az-mi300x-630-past-60]_,Azure Linux 3.0 [#az-mi300x-630-past-60]_,,,,,,,,,,,,
,Rocky Linux 9 [#rl-700-past-60]_,Rocky Linux 9 [#rl-700-past-60]_,,,,,,,,,,,,,,,,,,
,.. _architecture-support-compatibility-matrix-past-60:,,,,,,,,,,,,,,,,,,,
:doc:`Architecture <rocm-install-on-linux:reference/system-requirements>`,CDNA4,CDNA4,,,,,,,,,,,,,,,,,,
,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3,CDNA3
,CDNA2,CDNA2,CDNA2,CDNA2,CDNA2,CDNA2,CDNA2,CDNA2,CDNA2,CDNA2,CDNA2,CDNA2,CDNA2,CDNA2,CDNA2,CDNA2,CDNA2,CDNA2,CDNA2,CDNA2
,CDNA,CDNA,CDNA,CDNA,CDNA,CDNA,CDNA,CDNA,CDNA,CDNA,CDNA,CDNA,CDNA,CDNA,CDNA,CDNA,CDNA,CDNA,CDNA,CDNA
,RDNA4,RDNA4,RDNA4,RDNA4,RDNA4,,,,,,,,,,,,,,,
,RDNA3,RDNA3,RDNA3,RDNA3,RDNA3,RDNA3,RDNA3,RDNA3,RDNA3,RDNA3,RDNA3,RDNA3,RDNA3,RDNA3,RDNA3,RDNA3,RDNA3,RDNA3,RDNA3,RDNA3
,RDNA2,RDNA2,RDNA2,RDNA2,RDNA2,RDNA2,RDNA2,RDNA2,RDNA2,RDNA2,RDNA2,RDNA2,RDNA2,RDNA2,RDNA2,RDNA2,RDNA2,RDNA2,RDNA2,RDNA2
,.. _gpu-support-compatibility-matrix-past-60:,,,,,,,,,,,,,,,,,,,
:doc:`GPU / LLVM target <rocm-install-on-linux:reference/system-requirements>`,gfx950 [#mi350x-os-past-60]_,gfx950 [#mi350x-os-past-60]_,,,,,,,,,,,,,,,,,,
,gfx1201 [#RDNA-OS-700-past-60]_,gfx1201 [#RDNA-OS-700-past-60]_,gfx1201 [#RDNA-OS-past-60]_,gfx1201 [#RDNA-OS-past-60]_,gfx1201 [#RDNA-OS-past-60]_,,,,,,,,,,,,,,,
,gfx1200 [#RDNA-OS-700-past-60]_,gfx1200 [#RDNA-OS-700-past-60]_,gfx1200 [#RDNA-OS-past-60]_,gfx1200 [#RDNA-OS-past-60]_,gfx1200 [#RDNA-OS-past-60]_,,,,,,,,,,,,,,,
,gfx1101 [#RDNA-OS-700-past-60]_ [#rd-v710-past-60]_,gfx1101 [#RDNA-OS-700-past-60]_ [#rd-v710-past-60]_,gfx1101 [#RDNA-OS-past-60]_ [#7700XT-OS-past-60]_,gfx1101 [#RDNA-OS-past-60]_ [#7700XT-OS-past-60]_,gfx1101 [#RDNA-OS-past-60]_,,,,,,,,,,,,,,,
,gfx1100 [#RDNA-OS-700-past-60]_,gfx1100 [#RDNA-OS-700-past-60]_,gfx1100,gfx1100,gfx1100,gfx1100,gfx1100,gfx1100,gfx1100,gfx1100,gfx1100,gfx1100,gfx1100,gfx1100,gfx1100,gfx1100,gfx1100,gfx1100,gfx1100,gfx1100
,gfx1030 [#RDNA-OS-700-past-60]_ [#rd-v620-past-60]_,gfx1030 [#RDNA-OS-700-past-60]_ [#rd-v620-past-60]_,gfx1030,gfx1030,gfx1030,gfx1030,gfx1030,gfx1030,gfx1030,gfx1030,gfx1030,gfx1030,gfx1030,gfx1030,gfx1030,gfx1030,gfx1030,gfx1030,gfx1030,gfx1030
,gfx942 [#mi325x-os-past-60]_ [#mi300x-os-past-60]_ [#mi300A-os-past-60]_,gfx942 [#mi325x-os-past-60]_ [#mi300x-os-past-60]_ [#mi300A-os-past-60]_,gfx942,gfx942,gfx942,gfx942,gfx942,gfx942,gfx942,gfx942,gfx942 [#mi300_624-past-60]_,gfx942 [#mi300_622-past-60]_,gfx942 [#mi300_621-past-60]_,gfx942 [#mi300_620-past-60]_, gfx942 [#mi300_612-past-60]_, gfx942 [#mi300_612-past-60]_, gfx942 [#mi300_611-past-60]_, gfx942 [#mi300_610-past-60]_, gfx942 [#mi300_602-past-60]_, gfx942 [#mi300_600-past-60]_
,gfx90a [#mi200x-os-past-60]_,gfx90a [#mi200x-os-past-60]_,gfx90a,gfx90a,gfx90a,gfx90a,gfx90a,gfx90a,gfx90a,gfx90a,gfx90a,gfx90a,gfx90a,gfx90a,gfx90a,gfx90a,gfx90a,gfx90a,gfx90a,gfx90a
,gfx908 [#mi100-os-past-60]_,gfx908 [#mi100-os-past-60]_,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908
,,,,,,,,,,,,,,,,,,,,
FRAMEWORK SUPPORT,.. _framework-support-compatibility-matrix-past-60:,,,,,,,,,,,,,,,,,,,
:doc:`PyTorch <../compatibility/ml-compatibility/pytorch-compatibility>`,"2.8, 2.7, 2.6","2.7, 2.6, 2.5","2.6, 2.5, 2.4, 2.3","2.6, 2.5, 2.4, 2.3","2.6, 2.5, 2.4, 2.3","2.6, 2.5, 2.4, 2.3","2.4, 2.3, 2.2, 1.13","2.4, 2.3, 2.2, 1.13","2.4, 2.3, 2.2, 1.13","2.4, 2.3, 2.2, 2.1, 2.0, 1.13","2.3, 2.2, 2.1, 2.0, 1.13","2.3, 2.2, 2.1, 2.0, 1.13","2.3, 2.2, 2.1, 2.0, 1.13","2.3, 2.2, 2.1, 2.0, 1.13","2.1, 2.0, 1.13","2.1, 2.0, 1.13","2.1, 2.0, 1.13","2.1, 2.0, 1.13","2.1, 2.0, 1.13","2.1, 2.0, 1.13"
:doc:`TensorFlow <../compatibility/ml-compatibility/tensorflow-compatibility>`,"2.19.1, 2.18.1, 2.17.1 [#tf-mi350-past-60]_","2.19.1, 2.18.1, 2.17.1 [#tf-mi350-past-60]_","2.18.1, 2.17.1, 2.16.2","2.18.1, 2.17.1, 2.16.2","2.18.1, 2.17.1, 2.16.2","2.18.1, 2.17.1, 2.16.2","2.17.0, 2.16.2, 2.15.1","2.17.0, 2.16.2, 2.15.1","2.17.0, 2.16.2, 2.15.1","2.17.0, 2.16.2, 2.15.1","2.16.1, 2.15.1, 2.14.1","2.16.1, 2.15.1, 2.14.1","2.16.1, 2.15.1, 2.14.1","2.16.1, 2.15.1, 2.14.1","2.15.0, 2.14.0, 2.13.1","2.15.0, 2.14.0, 2.13.1","2.15.0, 2.14.0, 2.13.1","2.15.0, 2.14.0, 2.13.1","2.14.0, 2.13.1, 2.12.1","2.14.0, 2.13.1, 2.12.1"
:doc:`JAX <../compatibility/ml-compatibility/jax-compatibility>`,0.6.0,0.6.0,0.4.35,0.4.35,0.4.35,0.4.35,0.4.31,0.4.31,0.4.31,0.4.31,0.4.26,0.4.26,0.4.26,0.4.26,0.4.26,0.4.26,0.4.26,0.4.26,0.4.26,0.4.26
:doc:`verl <../compatibility/ml-compatibility/verl-compatibility>` [#verl_compat-past-60]_,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,0.3.0.post0,N/A,N/A,N/A,N/A,N/A,N/A
:doc:`Stanford Megatron-LM <../compatibility/ml-compatibility/stanford-megatron-lm-compatibility>` [#stanford-megatron-lm_compat-past-60]_,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,85f95ae,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A
:doc:`DGL <../compatibility/ml-compatibility/dgl-compatibility>` [#dgl_compat-past-60]_,N/A,N/A,N/A,N/A,N/A,2.4.0,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A
:doc:`Megablocks <../compatibility/ml-compatibility/megablocks-compatibility>` [#megablocks_compat-past-60]_,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,0.7.0,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A
:doc:`Taichi <../compatibility/ml-compatibility/taichi-compatibility>` [#taichi_compat-past-60]_,N/A,N/A,N/A,N/A,N/A,N/A,N/A,1.8.0b1,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A
:doc:`Ray <../compatibility/ml-compatibility/ray-compatibility>` [#ray_compat-past-60]_,N/A,N/A,N/A,N/A,2.48.0.post0,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A
:doc:`llama.cpp <../compatibility/ml-compatibility/llama-cpp-compatibility>` [#llama-cpp_compat-past-60]_,N/A,b6356,b6356,b6356,b6356,b5997,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A
:doc:`FlashInfer <../compatibility/ml-compatibility/flashinfer-compatibility>` [#flashinfer_compat-past-60]_,N/A,N/A,N/A,N/A,v0.2.5,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A
`ONNX Runtime <https://onnxruntime.ai/docs/build/eps.html#amd-migraphx>`_,1.22.0,1.22.0,1.20.0,1.20.0,1.20.0,1.20.0,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.14.1,1.14.1
,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,
THIRD PARTY COMMS,.. _thirdpartycomms-support-compatibility-matrix-past-60:,,,,,,,,,,,,,,,,,,,
`UCC <https://github.com/ROCm/ucc>`_,>=1.4.0,>=1.4.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.3.0,>=1.2.0,>=1.2.0
`UCX <https://github.com/ROCm/ucx>`_,>=1.17.0,>=1.17.0,>=1.15.0,>=1.15.0,>=1.15.0,>=1.15.0,>=1.15.0,>=1.15.0,>=1.15.0,>=1.15.0,>=1.15.0,>=1.15.0,>=1.15.0,>=1.15.0,>=1.14.1,>=1.14.1,>=1.14.1,>=1.14.1,>=1.14.1,>=1.14.1
,,,,,,,,,,,,,,,,,,,,
THIRD PARTY ALGORITHM,.. _thirdpartyalgorithm-support-compatibility-matrix-past-60:,,,,,,,,,,,,,,,,,,,
Thrust,2.6.0,2.6.0,2.5.0,2.5.0,2.5.0,2.5.0,2.3.2,2.3.2,2.3.2,2.3.2,2.2.0,2.2.0,2.2.0,2.2.0,2.1.0,2.1.0,2.1.0,2.1.0,2.0.1,2.0.1
CUB,2.6.0,2.6.0,2.5.0,2.5.0,2.5.0,2.5.0,2.3.2,2.3.2,2.3.2,2.3.2,2.2.0,2.2.0,2.2.0,2.2.0,2.1.0,2.1.0,2.1.0,2.1.0,2.0.1,2.0.1
,,,,,,,,,,,,,,,,,,,,
DRIVER & USER SPACE [#kfd_support-past-60]_,.. _kfd-userspace-support-compatibility-matrix-past-60:,,,,,,,,,,,,,,,,,,,
:doc:`AMD GPU Driver <rocm-install-on-linux:reference/user-kernel-space-compat-matrix>`,"30.10.2, 30.10.1 [#driver_patch-past-60]_, 30.10, 6.4.x, 6.3.x","30.10.1 [#driver_patch-past-60]_, 30.10, 6.4.x, 6.3.x, 6.2.x","6.4.x, 6.3.x, 6.2.x, 6.1.x","6.4.x, 6.3.x, 6.2.x, 6.1.x","6.4.x, 6.3.x, 6.2.x, 6.1.x","6.4.x, 6.3.x, 6.2.x, 6.1.x","6.4.x, 6.3.x, 6.2.x, 6.1.x","6.4.x, 6.3.x, 6.2.x, 6.1.x","6.4.x, 6.3.x, 6.2.x, 6.1.x","6.4.x, 6.3.x, 6.2.x, 6.1.x","6.4.x, 6.3.x, 6.2.x, 6.1.x, 6.0.x","6.4.x, 6.3.x, 6.2.x, 6.1.x, 6.0.x","6.4.x, 6.3.x, 6.2.x, 6.1.x, 6.0.x","6.4.x, 6.3.x, 6.2.x, 6.1.x, 6.0.x","6.4.x, 6.3.x, 6.2.x, 6.1.x, 6.0.x, 5.7.x","6.4.x, 6.3.x, 6.2.x, 6.1.x, 6.0.x, 5.7.x","6.4.x, 6.3.x, 6.2.x, 6.1.x, 6.0.x, 5.7.x","6.4.x, 6.3.x, 6.2.x, 6.1.x, 6.0.x, 5.7.x","6.2.x, 6.1.x, 6.0.x, 5.7.x, 5.6.x","6.2.x, 6.1.x, 6.0.x, 5.7.x, 5.6.x"
,,,,,,,,,,,,,,,,,,,,
ML & COMPUTER VISION,.. _mllibs-support-compatibility-matrix-past-60:,,,,,,,,,,,,,,,,,,,
:doc:`Composable Kernel <composable_kernel:index>`,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0
:doc:`MIGraphX <amdmigraphx:index>`,2.13.0,2.13.0,2.12.0,2.12.0,2.12.0,2.12.0,2.11.0,2.11.0,2.11.0,2.11.0,2.10.0,2.10.0,2.10.0,2.10.0,2.9.0,2.9.0,2.9.0,2.9.0,2.8.0,2.8.0
:doc:`MIOpen <miopen:index>`,3.5.0,3.5.0,3.4.0,3.4.0,3.4.0,3.4.0,3.3.0,3.3.0,3.3.0,3.3.0,3.2.0,3.2.0,3.2.0,3.2.0,3.1.0,3.1.0,3.1.0,3.1.0,3.0.0,3.0.0
:doc:`MIVisionX <mivisionx:index>`,3.3.0,3.3.0,3.2.0,3.2.0,3.2.0,3.2.0,3.1.0,3.1.0,3.1.0,3.1.0,3.0.0,3.0.0,3.0.0,3.0.0,2.5.0,2.5.0,2.5.0,2.5.0,2.5.0,2.5.0
:doc:`rocAL <rocal:index>`,2.3.0,2.3.0,2.2.0,2.2.0,2.2.0,2.2.0,2.1.0,2.1.0,2.1.0,2.1.0,2.0.0,2.0.0,2.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0
:doc:`rocDecode <rocdecode:index>`,1.0.0,1.0.0,0.10.0,0.10.0,0.10.0,0.10.0,0.8.0,0.8.0,0.8.0,0.8.0,0.6.0,0.6.0,0.6.0,0.6.0,0.6.0,0.6.0,0.5.0,0.5.0,N/A,N/A
:doc:`rocJPEG <rocjpeg:index>`,1.1.0,1.1.0,0.8.0,0.8.0,0.8.0,0.8.0,0.6.0,0.6.0,0.6.0,0.6.0,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A
:doc:`rocPyDecode <rocpydecode:index>`,0.6.0,0.6.0,0.3.1,0.3.1,0.3.1,0.3.1,0.2.0,0.2.0,0.2.0,0.2.0,0.1.0,0.1.0,0.1.0,0.1.0,N/A,N/A,N/A,N/A,N/A,N/A
:doc:`RPP <rpp:index>`,2.0.0,2.0.0,1.9.10,1.9.10,1.9.10,1.9.10,1.9.1,1.9.1,1.9.1,1.9.1,1.8.0,1.8.0,1.8.0,1.8.0,1.5.0,1.5.0,1.5.0,1.5.0,1.4.0,1.4.0
,,,,,,,,,,,,,,,,,,,,
COMMUNICATION,.. _commlibs-support-compatibility-matrix-past-60:,,,,,,,,,,,,,,,,,,,
:doc:`RCCL <rccl:index>`,2.26.6,2.26.6,2.22.3,2.22.3,2.22.3,2.22.3,2.21.5,2.21.5,2.21.5,2.21.5,2.20.5,2.20.5,2.20.5,2.20.5,2.18.6,2.18.6,2.18.6,2.18.6,2.18.3,2.18.3
:doc:`rocSHMEM <rocshmem:index>`,3.0.0,3.0.0,2.0.1,2.0.1,2.0.0,2.0.0,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A
,,,,,,,,,,,,,,,,,,,,
MATH LIBS,.. _mathlibs-support-compatibility-matrix-past-60:,,,,,,,,,,,,,,,,,,,
`half <https://github.com/ROCm/half>`_ ,1.12.0,1.12.0,1.12.0,1.12.0,1.12.0,1.12.0,1.12.0,1.12.0,1.12.0,1.12.0,1.12.0,1.12.0,1.12.0,1.12.0,1.12.0,1.12.0,1.12.0,1.12.0,1.12.0,1.12.0
:doc:`hipBLAS <hipblas:index>`,3.0.2,3.0.0,2.4.0,2.4.0,2.4.0,2.4.0,2.3.0,2.3.0,2.3.0,2.3.0,2.2.0,2.2.0,2.2.0,2.2.0,2.1.0,2.1.0,2.1.0,2.1.0,2.0.0,2.0.0
:doc:`hipBLASLt <hipblaslt:index>`,1.0.0,1.0.0,0.12.1,0.12.1,0.12.1,0.12.0,0.10.0,0.10.0,0.10.0,0.10.0,0.8.0,0.8.0,0.8.0,0.8.0,0.7.0,0.7.0,0.7.0,0.7.0,0.6.0,0.6.0
:doc:`hipFFT <hipfft:index>`,1.0.20,1.0.20,1.0.18,1.0.18,1.0.18,1.0.18,1.0.17,1.0.17,1.0.17,1.0.17,1.0.16,1.0.15,1.0.15,1.0.14,1.0.14,1.0.14,1.0.14,1.0.14,1.0.13,1.0.13
:doc:`hipfort <hipfort:index>`,0.7.0,0.7.0,0.6.0,0.6.0,0.6.0,0.6.0,0.5.1,0.5.1,0.5.0,0.5.0,0.4.0,0.4.0,0.4.0,0.4.0,0.4.0,0.4.0,0.4.0,0.4.0,0.4.0,0.4.0
:doc:`hipRAND <hiprand:index>`,3.0.0,3.0.0,2.12.0,2.12.0,2.12.0,2.12.0,2.11.1,2.11.1,2.11.1,2.11.0,2.11.1,2.11.0,2.11.0,2.11.0,2.10.16,2.10.16,2.10.16,2.10.16,2.10.16,2.10.16
:doc:`hipSOLVER <hipsolver:index>`,3.0.0,3.0.0,2.4.0,2.4.0,2.4.0,2.4.0,2.3.0,2.3.0,2.3.0,2.3.0,2.2.0,2.2.0,2.2.0,2.2.0,2.1.1,2.1.1,2.1.1,2.1.0,2.0.0,2.0.0
:doc:`hipSPARSE <hipsparse:index>`,4.0.1,4.0.1,3.2.0,3.2.0,3.2.0,3.2.0,3.1.2,3.1.2,3.1.2,3.1.2,3.1.1,3.1.1,3.1.1,3.1.1,3.0.1,3.0.1,3.0.1,3.0.1,3.0.0,3.0.0
:doc:`hipSPARSELt <hipsparselt:index>`,0.2.4,0.2.4,0.2.3,0.2.3,0.2.3,0.2.3,0.2.2,0.2.2,0.2.2,0.2.2,0.2.1,0.2.1,0.2.1,0.2.1,0.2.0,0.2.0,0.1.0,0.1.0,0.1.0,0.1.0
:doc:`rocALUTION <rocalution:index>`,4.0.0,4.0.0,3.2.3,3.2.3,3.2.3,3.2.2,3.2.1,3.2.1,3.2.1,3.2.1,3.2.1,3.2.0,3.2.0,3.2.0,3.1.1,3.1.1,3.1.1,3.1.1,3.0.3,3.0.3
:doc:`rocBLAS <rocblas:index>`,5.0.2,5.0.0,4.4.1,4.4.1,4.4.0,4.4.0,4.3.0,4.3.0,4.3.0,4.3.0,4.2.4,4.2.1,4.2.1,4.2.0,4.1.2,4.1.2,4.1.0,4.1.0,4.0.0,4.0.0
:doc:`rocFFT <rocfft:index>`,1.0.34,1.0.34,1.0.32,1.0.32,1.0.32,1.0.32,1.0.31,1.0.31,1.0.31,1.0.31,1.0.30,1.0.29,1.0.29,1.0.28,1.0.27,1.0.27,1.0.27,1.0.26,1.0.25,1.0.23
:doc:`rocRAND <rocrand:index>`,4.0.0,4.0.0,3.3.0,3.3.0,3.3.0,3.3.0,3.2.0,3.2.0,3.2.0,3.2.0,3.1.1,3.1.0,3.1.0,3.1.0,3.0.1,3.0.1,3.0.1,3.0.1,3.0.0,2.10.17
:doc:`rocSOLVER <rocsolver:index>`,3.30.1,3.30.0,3.28.2,3.28.2,3.28.0,3.28.0,3.27.0,3.27.0,3.27.0,3.27.0,3.26.2,3.26.0,3.26.0,3.26.0,3.25.0,3.25.0,3.25.0,3.25.0,3.24.0,3.24.0
:doc:`rocSPARSE <rocsparse:index>`,4.0.2,4.0.2,3.4.0,3.4.0,3.4.0,3.4.0,3.3.0,3.3.0,3.3.0,3.3.0,3.2.1,3.2.0,3.2.0,3.2.0,3.1.2,3.1.2,3.1.2,3.1.2,3.0.2,3.0.2
:doc:`rocWMMA <rocwmma:index>`,2.0.0,2.0.0,1.7.0,1.7.0,1.7.0,1.7.0,1.6.0,1.6.0,1.6.0,1.6.0,1.5.0,1.5.0,1.5.0,1.5.0,1.4.0,1.4.0,1.4.0,1.4.0,1.3.0,1.3.0
:doc:`Tensile <tensile:src/index>`,4.44.0,4.44.0,4.43.0,4.43.0,4.43.0,4.43.0,4.42.0,4.42.0,4.42.0,4.42.0,4.41.0,4.41.0,4.41.0,4.41.0,4.40.0,4.40.0,4.40.0,4.40.0,4.39.0,4.39.0
,,,,,,,,,,,,,,,,,,,,
PRIMITIVES,.. _primitivelibs-support-compatibility-matrix-past-60:,,,,,,,,,,,,,,,,,,,
:doc:`hipCUB <hipcub:index>`,4.0.0,4.0.0,3.4.0,3.4.0,3.4.0,3.4.0,3.3.0,3.3.0,3.3.0,3.3.0,3.2.1,3.2.0,3.2.0,3.2.0,3.1.0,3.1.0,3.1.0,3.1.0,3.0.0,3.0.0
:doc:`hipTensor <hiptensor:index>`,2.0.0,2.0.0,1.5.0,1.5.0,1.5.0,1.5.0,1.4.0,1.4.0,1.4.0,1.4.0,1.3.0,1.3.0,1.3.0,1.3.0,1.2.0,1.2.0,1.2.0,1.2.0,1.1.0,1.1.0
:doc:`rocPRIM <rocprim:index>`,4.0.1,4.0.0,3.4.1,3.4.1,3.4.0,3.4.0,3.3.0,3.3.0,3.3.0,3.3.0,3.2.2,3.2.0,3.2.0,3.2.0,3.1.0,3.1.0,3.1.0,3.1.0,3.0.0,3.0.0
:doc:`rocThrust <rocthrust:index>`,4.0.0,4.0.0,3.3.0,3.3.0,3.3.0,3.3.0,3.3.0,3.3.0,3.3.0,3.3.0,3.1.1,3.1.0,3.1.0,3.0.1,3.0.1,3.0.1,3.0.1,3.0.1,3.0.0,3.0.0
,,,,,,,,,,,,,,,,,,,,
SUPPORT LIBS,,,,,,,,,,,,,,,,,,,,
`hipother <https://github.com/ROCm/hipother>`_,7.0.51830,7.0.51830,6.4.43483,6.4.43483,6.4.43483,6.4.43482,6.3.42134,6.3.42134,6.3.42133,6.3.42131,6.2.41134,6.2.41134,6.2.41134,6.2.41133,6.1.40093,6.1.40093,6.1.40092,6.1.40091,6.1.32831,6.1.32830
`rocm-core <https://github.com/ROCm/rocm-core>`_,7.0.2,7.0.1/7.0.0,6.4.3,6.4.2,6.4.1,6.4.0,6.3.3,6.3.2,6.3.1,6.3.0,6.2.4,6.2.2,6.2.1,6.2.0,6.1.5,6.1.2,6.1.1,6.1.0,6.0.2,6.0.0
`ROCT-Thunk-Interface <https://github.com/ROCm/ROCT-Thunk-Interface>`_,N/A [#ROCT-rocr-past-60]_,N/A [#ROCT-rocr-past-60]_,N/A [#ROCT-rocr-past-60]_,N/A [#ROCT-rocr-past-60]_,N/A [#ROCT-rocr-past-60]_,N/A [#ROCT-rocr-past-60]_,N/A [#ROCT-rocr-past-60]_,N/A [#ROCT-rocr-past-60]_,N/A [#ROCT-rocr-past-60]_,N/A [#ROCT-rocr-past-60]_,20240607.5.7,20240607.5.7,20240607.4.05,20240607.1.4246,20240125.5.08,20240125.5.08,20240125.5.08,20240125.3.30,20231016.2.245,20231016.2.245
,,,,,,,,,,,,,,,,,,,,
SYSTEM MGMT TOOLS,.. _tools-support-compatibility-matrix-past-60:,,,,,,,,,,,,,,,,,,,
:doc:`AMD SMI <amdsmi:index>`,26.0.2,26.0.0,25.5.1,25.5.1,25.4.2,25.3.0,24.7.1,24.7.1,24.7.1,24.7.1,24.6.3,24.6.3,24.6.3,24.6.2,24.5.1,24.5.1,24.5.1,24.4.1,23.4.2,23.4.2
:doc:`ROCm Data Center Tool <rdc:index>`,1.1.0,1.1.0,0.3.0,0.3.0,0.3.0,0.3.0,0.3.0,0.3.0,0.3.0,0.3.0,0.3.0,0.3.0,0.3.0,0.3.0,0.3.0,0.3.0,0.3.0,0.3.0,0.3.0,0.3.0
:doc:`rocminfo <rocminfo:index>`,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0
:doc:`ROCm SMI <rocm_smi_lib:index>`,7.8.0,7.8.0,7.7.0,7.5.0,7.5.0,7.5.0,7.4.0,7.4.0,7.4.0,7.4.0,7.3.0,7.3.0,7.3.0,7.3.0,7.2.0,7.2.0,7.0.0,7.0.0,6.0.2,6.0.0
:doc:`ROCm Validation Suite <rocmvalidationsuite:index>`,1.2.0,1.2.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.1.0,1.0.60204,1.0.60202,1.0.60201,1.0.60200,1.0.60105,1.0.60102,1.0.60101,1.0.60100,1.0.60002,1.0.60000
,,,,,,,,,,,,,,,,,,,,
PERFORMANCE TOOLS,,,,,,,,,,,,,,,,,,,,
:doc:`ROCm Bandwidth Test <rocm_bandwidth_test:index>`,2.6.0,2.6.0,1.4.0,1.4.0,1.4.0,1.4.0,1.4.0,1.4.0,1.4.0,1.4.0,1.4.0,1.4.0,1.4.0,1.4.0,1.4.0,1.4.0,1.4.0,1.4.0,1.4.0,1.4.0
:doc:`ROCm Compute Profiler <rocprofiler-compute:index>`,3.2.3,3.2.3,3.1.1,3.1.1,3.1.0,3.1.0,3.0.0,3.0.0,3.0.0,3.0.0,2.0.1,2.0.1,2.0.1,2.0.1,N/A,N/A,N/A,N/A,N/A,N/A
:doc:`ROCm Systems Profiler <rocprofiler-systems:index>`,1.1.1,1.1.0,1.0.2,1.0.2,1.0.1,1.0.0,0.1.2,0.1.1,0.1.0,0.1.0,1.11.2,1.11.2,1.11.2,1.11.2,N/A,N/A,N/A,N/A,N/A,N/A
:doc:`ROCProfiler <rocprofiler:index>`,2.0.70002,2.0.70000,2.0.60403,2.0.60402,2.0.60401,2.0.60400,2.0.60303,2.0.60302,2.0.60301,2.0.60300,2.0.60204,2.0.60202,2.0.60201,2.0.60200,2.0.60105,2.0.60102,2.0.60101,2.0.60100,2.0.60002,2.0.60000
:doc:`ROCprofiler-SDK <rocprofiler-sdk:index>`,1.0.0,1.0.0,0.6.0,0.6.0,0.6.0,0.6.0,0.5.0,0.5.0,0.5.0,0.5.0,0.4.0,0.4.0,0.4.0,0.4.0,N/A,N/A,N/A,N/A,N/A,N/A
:doc:`ROCTracer <roctracer:index>`,4.1.70002,4.1.70000,4.1.60403,4.1.60402,4.1.60401,4.1.60400,4.1.60303,4.1.60302,4.1.60301,4.1.60300,4.1.60204,4.1.60202,4.1.60201,4.1.60200,4.1.60105,4.1.60102,4.1.60101,4.1.60100,4.1.60002,4.1.60000
,,,,,,,,,,,,,,,,,,,,
DEVELOPMENT TOOLS,,,,,,,,,,,,,,,,,,,,
:doc:`HIPIFY <hipify:index>`,20.0.0,20.0.0,19.0.0,19.0.0,19.0.0,19.0.0,18.0.0.25012,18.0.0.25012,18.0.0.24491,18.0.0.24455,18.0.0.24392,18.0.0.24355,18.0.0.24355,18.0.0.24232,17.0.0.24193,17.0.0.24193,17.0.0.24154,17.0.0.24103,17.0.0.24012,17.0.0.23483
:doc:`ROCm CMake <rocmcmakebuildtools:index>`,0.14.0,0.14.0,0.14.0,0.14.0,0.14.0,0.14.0,0.14.0,0.14.0,0.14.0,0.14.0,0.13.0,0.13.0,0.13.0,0.13.0,0.12.0,0.12.0,0.12.0,0.12.0,0.11.0,0.11.0
:doc:`ROCdbgapi <rocdbgapi:index>`,0.77.4,0.77.3,0.77.2,0.77.2,0.77.2,0.77.2,0.77.0,0.77.0,0.77.0,0.77.0,0.76.0,0.76.0,0.76.0,0.76.0,0.71.0,0.71.0,0.71.0,0.71.0,0.71.0,0.71.0
:doc:`ROCm Debugger (ROCgdb) <rocgdb:index>`,16.3.0,16.3.0,15.2.0,15.2.0,15.2.0,15.2.0,15.2.0,15.2.0,15.2.0,15.2.0,14.2.0,14.2.0,14.2.0,14.2.0,14.1.0,14.1.0,14.1.0,14.1.0,13.2.0,13.2.0
`rocprofiler-register <https://github.com/ROCm/rocprofiler-register>`_,0.5.0,0.5.0,0.4.0,0.4.0,0.4.0,0.4.0,0.4.0,0.4.0,0.4.0,0.4.0,0.4.0,0.4.0,0.4.0,0.4.0,0.3.0,0.3.0,0.3.0,0.3.0,N/A,N/A
:doc:`ROCr Debug Agent <rocr_debug_agent:index>`,2.1.0,2.1.0,2.0.4,2.0.4,2.0.4,2.0.4,2.0.3,2.0.3,2.0.3,2.0.3,2.0.3,2.0.3,2.0.3,2.0.3,2.0.3,2.0.3,2.0.3,2.0.3,2.0.3,2.0.3
,,,,,,,,,,,,,,,,,,,,
COMPILERS,.. _compilers-support-compatibility-matrix-past-60:,,,,,,,,,,,,,,,,,,,
`clang-ocl <https://github.com/ROCm/clang-ocl>`_,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,0.5.0,0.5.0,0.5.0,0.5.0,0.5.0,0.5.0
:doc:`hipCC <hipcc:index>`,1.1.1,1.1.1,1.1.1,1.1.1,1.1.1,1.1.1,1.1.1,1.1.1,1.1.1,1.1.1,1.1.1,1.1.1,1.1.1,1.1.1,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0,1.0.0
`Flang <https://github.com/ROCm/flang>`_,20.0.0.25385,20.0.0.25314,19.0.0.25224,19.0.0.25224,19.0.0.25184,19.0.0.25133,18.0.0.25012,18.0.0.25012,18.0.0.24491,18.0.0.24455,18.0.0.24392,18.0.0.24355,18.0.0.24355,18.0.0.24232,17.0.0.24193,17.0.0.24193,17.0.0.24154,17.0.0.24103,17.0.0.24012,17.0.0.23483
:doc:`llvm-project <llvm-project:index>`,20.0.0.25385,20.0.0.25314,19.0.0.25224,19.0.0.25224,19.0.0.25184,19.0.0.25133,18.0.0.25012,18.0.0.25012,18.0.0.24491,18.0.0.24491,18.0.0.24392,18.0.0.24355,18.0.0.24355,18.0.0.24232,17.0.0.24193,17.0.0.24193,17.0.0.24154,17.0.0.24103,17.0.0.24012,17.0.0.23483
`OpenMP <https://github.com/ROCm/llvm-project/tree/amd-staging/openmp>`_,20.0.0.25385,20.0.0.25314,19.0.0.25224,19.0.0.25224,19.0.0.25184,19.0.0.25133,18.0.0.25012,18.0.0.25012,18.0.0.24491,18.0.0.24491,18.0.0.24392,18.0.0.24355,18.0.0.24355,18.0.0.24232,17.0.0.24193,17.0.0.24193,17.0.0.24154,17.0.0.24103,17.0.0.24012,17.0.0.23483
,,,,,,,,,,,,,,,,,,,,
RUNTIMES,.. _runtime-support-compatibility-matrix-past-60:,,,,,,,,,,,,,,,,,,,
:doc:`AMD CLR <hip:understand/amd_clr>`,7.0.51831,7.0.51830,6.4.43484,6.4.43484,6.4.43483,6.4.43482,6.3.42134,6.3.42134,6.3.42133,6.3.42131,6.2.41134,6.2.41134,6.2.41134,6.2.41133,6.1.40093,6.1.40093,6.1.40092,6.1.40091,6.1.32831,6.1.32830
:doc:`HIP <hip:index>`,7.0.51831,7.0.51830,6.4.43484,6.4.43484,6.4.43483,6.4.43482,6.3.42134,6.3.42134,6.3.42133,6.3.42131,6.2.41134,6.2.41134,6.2.41134,6.2.41133,6.1.40093,6.1.40093,6.1.40092,6.1.40091,6.1.32831,6.1.32830
`OpenCL Runtime <https://github.com/ROCm/clr/tree/develop/opencl>`_,2.0.0,2.0.0,2.0.0,2.0.0,2.0.0,2.0.0,2.0.0,2.0.0,2.0.0,2.0.0,2.0.0,2.0.0,2.0.0,2.0.0,2.0.0,2.0.0,2.0.0,2.0.0,2.0.0,2.0.0
:doc:`ROCr Runtime <rocr-runtime:index>`,1.18.0,1.18.0,1.15.0,1.15.0,1.15.0,1.15.0,1.14.0,1.14.0,1.14.0,1.14.0,1.14.0,1.14.0,1.14.0,1.13.0,1.13.0,1.13.0,1.13.0,1.13.0,1.12.0,1.12.0
1 ROCm Version 7.0.2 7.0.1/7.0.0 6.4.3 6.4.2 6.4.1 6.4.0 6.3.3 6.3.2 6.3.1 6.3.0 6.2.4 6.2.2 6.2.1 6.2.0 6.1.5 6.1.2 6.1.1 6.1.0 6.0.2 6.0.0
2 :ref:`Operating systems & kernels <OS-kernel-versions>` Ubuntu 24.04.3 Ubuntu 24.04.3 Ubuntu 24.04.2 Ubuntu 24.04.2 Ubuntu 24.04.2 Ubuntu 24.04.2 Ubuntu 24.04.2 Ubuntu 24.04.2 Ubuntu 24.04.2 Ubuntu 24.04.2 Ubuntu 24.04.1, 24.04 Ubuntu 24.04.1, 24.04 Ubuntu 24.04.1, 24.04 Ubuntu 24.04
3 Ubuntu 22.04.5 Ubuntu 22.04.5 Ubuntu 22.04.5 Ubuntu 22.04.5 Ubuntu 22.04.5 Ubuntu 22.04.5 Ubuntu 22.04.5 Ubuntu 22.04.5 Ubuntu 22.04.5 Ubuntu 22.04.5 Ubuntu 22.04.5, 22.04.4 Ubuntu 22.04.5, 22.04.4 Ubuntu 22.04.5, 22.04.4 Ubuntu 22.04.5, 22.04.4 Ubuntu 22.04.5, 22.04.4, 22.04.3 Ubuntu 22.04.4, 22.04.3 Ubuntu 22.04.4, 22.04.3 Ubuntu 22.04.4, 22.04.3 Ubuntu 22.04.4, 22.04.3, 22.04.2 Ubuntu 22.04.4, 22.04.3, 22.04.2
4 Ubuntu 20.04.6, 20.04.5 Ubuntu 20.04.6, 20.04.5 Ubuntu 20.04.6, 20.04.5 Ubuntu 20.04.6, 20.04.5 Ubuntu 20.04.6, 20.04.5 Ubuntu 20.04.6, 20.04.5
5 RHEL 10.0 [#rhel-10-702-past-60]_, 9.6 [#rhel-10-702-past-60]_, 9.4 [#rhel-94-702-past-60]_ RHEL 9.6, 9.4 RHEL 9.6 [#rhel-10-702-past-60]_, 9.4 [#rhel-94-702-past-60]_ RHEL 9.6, 9.4 RHEL 9.6, 9.4 RHEL 9.6, 9.5, 9.4 RHEL 9.5, 9.4 RHEL 9.5, 9.4 RHEL 9.5, 9.4 RHEL 9.5, 9.4 RHEL 9.5, 9.4 RHEL 9.4, 9.3 RHEL 9.4, 9.3 RHEL 9.4, 9.3 RHEL 9.4, 9.3 RHEL 9.4, 9.3, 9.2 RHEL 9.4, 9.3, 9.2 RHEL 9.4, 9.3, 9.2 RHEL 9.4, 9.3, 9.2 RHEL 9.3, 9.2 RHEL 9.3, 9.2
6 RHEL 8.10 [#rhel-700-past-60]_ RHEL 8.10 [#rhel-700-past-60]_ RHEL 8.10 RHEL 8.10 RHEL 8.10 RHEL 8.10 RHEL 8.10 RHEL 8.10 RHEL 8.10 RHEL 8.10 RHEL 8.10, 8.9 RHEL 8.10, 8.9 RHEL 8.10, 8.9 RHEL 8.10, 8.9 RHEL 8.9, 8.8 RHEL 8.9, 8.8 RHEL 8.9, 8.8 RHEL 8.9, 8.8 RHEL 8.9, 8.8 RHEL 8.9, 8.8
7 SLES 15 SP7 [#sles-db-700-past-60]_ SLES 15 SP7 [#sles-db-700-past-60]_ SLES 15 SP7, SP6 SLES 15 SP7, SP6 SLES 15 SP6 SLES 15 SP6 SLES 15 SP6, SP5 SLES 15 SP6, SP5 SLES 15 SP6, SP5 SLES 15 SP6, SP5 SLES 15 SP6, SP5 SLES 15 SP6, SP5 SLES 15 SP6, SP5 SLES 15 SP6, SP5 SLES 15 SP5, SP4 SLES 15 SP5, SP4 SLES 15 SP5, SP4 SLES 15 SP5, SP4 SLES 15 SP5, SP4 SLES 15 SP5, SP4
8 CentOS 7.9 CentOS 7.9 CentOS 7.9 CentOS 7.9 CentOS 7.9
9 Oracle Linux 10, 9, 8 [#ol-700-mi300x-past-60]_ Oracle Linux 9, 8 [#ol-700-mi300x-past-60]_ Oracle Linux 9, 8 [#mi300x-past-60]_ Oracle Linux 9, 8 [#mi300x-past-60]_ Oracle Linux 9, 8 [#mi300x-past-60]_ Oracle Linux 9, 8 [#mi300x-past-60]_ Oracle Linux 8.10 [#mi300x-past-60]_ Oracle Linux 8.10 [#mi300x-past-60]_ Oracle Linux 8.10 [#mi300x-past-60]_ Oracle Linux 8.10 [#mi300x-past-60]_ Oracle Linux 8.9 [#mi300x-past-60]_ Oracle Linux 8.9 [#mi300x-past-60]_ Oracle Linux 8.9 [#mi300x-past-60]_ Oracle Linux 8.9 [#mi300x-past-60]_ Oracle Linux 8.9 [#mi300x-past-60]_ Oracle Linux 8.9 [#mi300x-past-60]_ Oracle Linux 8.9 [#mi300x-past-60]_
10 Debian 13 [#db-mi300x-past-60]_, 12 [#sles-db-700-past-60]_ Debian 12 [#sles-db-700-past-60]_ Debian 12 [#single-node-past-60]_ Debian 12 [#single-node-past-60]_ Debian 12 [#single-node-past-60]_ Debian 12 [#single-node-past-60]_ Debian 12 [#single-node-past-60]_ Debian 12 [#single-node-past-60]_ Debian 12 [#single-node-past-60]_
11 Azure Linux 3.0 [#az-mi300x-past-60]_ Azure Linux 3.0 [#az-mi300x-past-60]_ Azure Linux 3.0 [#az-mi300x-past-60]_ Azure Linux 3.0 [#az-mi300x-past-60]_ Azure Linux 3.0 [#az-mi300x-past-60]_ Azure Linux 3.0 [#az-mi300x-past-60]_ Azure Linux 3.0 [#az-mi300x-630-past-60]_ Azure Linux 3.0 [#az-mi300x-630-past-60]_
12 Rocky Linux 9 [#rl-700-past-60]_ Rocky Linux 9 [#rl-700-past-60]_
13 .. _architecture-support-compatibility-matrix-past-60: .. _architecture-support-compatibility-matrix-past-60:
14 :doc:`Architecture <rocm-install-on-linux:reference/system-requirements>` CDNA4 CDNA4
15 CDNA3 CDNA3 CDNA3 CDNA3 CDNA3 CDNA3 CDNA3 CDNA3 CDNA3 CDNA3 CDNA3 CDNA3 CDNA3 CDNA3 CDNA3 CDNA3 CDNA3 CDNA3 CDNA3 CDNA3
16 CDNA2 CDNA2 CDNA2 CDNA2 CDNA2 CDNA2 CDNA2 CDNA2 CDNA2 CDNA2 CDNA2 CDNA2 CDNA2 CDNA2 CDNA2 CDNA2 CDNA2 CDNA2 CDNA2 CDNA2
17 CDNA CDNA CDNA CDNA CDNA CDNA CDNA CDNA CDNA CDNA CDNA CDNA CDNA CDNA CDNA CDNA CDNA CDNA CDNA CDNA
18 RDNA4 RDNA4 RDNA4 RDNA4 RDNA4
19 RDNA3 RDNA3 RDNA3 RDNA3 RDNA3 RDNA3 RDNA3 RDNA3 RDNA3 RDNA3 RDNA3 RDNA3 RDNA3 RDNA3 RDNA3 RDNA3 RDNA3 RDNA3 RDNA3 RDNA3
20 RDNA2 RDNA2 RDNA2 RDNA2 RDNA2 RDNA2 RDNA2 RDNA2 RDNA2 RDNA2 RDNA2 RDNA2 RDNA2 RDNA2 RDNA2 RDNA2 RDNA2 RDNA2 RDNA2 RDNA2
21 .. _gpu-support-compatibility-matrix-past-60: .. _gpu-support-compatibility-matrix-past-60:
22 :doc:`GPU / LLVM target <rocm-install-on-linux:reference/system-requirements>` gfx950 [#mi350x-os-past-60]_ gfx950 [#mi350x-os-past-60]_
23 gfx1201 [#RDNA-OS-700-past-60]_ gfx1201 [#RDNA-OS-700-past-60]_ gfx1201 [#RDNA-OS-past-60]_ gfx1201 [#RDNA-OS-past-60]_ gfx1201 [#RDNA-OS-past-60]_
24 gfx1200 [#RDNA-OS-700-past-60]_ gfx1200 [#RDNA-OS-700-past-60]_ gfx1200 [#RDNA-OS-past-60]_ gfx1200 [#RDNA-OS-past-60]_ gfx1200 [#RDNA-OS-past-60]_
25 gfx1101 [#RDNA-OS-700-past-60]_ [#rd-v710-past-60]_ gfx1101 [#RDNA-OS-700-past-60]_ [#rd-v710-past-60]_ gfx1101 [#RDNA-OS-past-60]_ [#7700XT-OS-past-60]_ gfx1101 [#RDNA-OS-past-60]_ [#7700XT-OS-past-60]_ gfx1101 [#RDNA-OS-past-60]_
26 gfx1100 [#RDNA-OS-700-past-60]_ gfx1100 [#RDNA-OS-700-past-60]_ gfx1100 gfx1100 gfx1100 gfx1100 gfx1100 gfx1100 gfx1100 gfx1100 gfx1100 gfx1100 gfx1100 gfx1100 gfx1100 gfx1100 gfx1100 gfx1100 gfx1100 gfx1100
27 gfx1030 [#RDNA-OS-700-past-60]_ [#rd-v620-past-60]_ gfx1030 [#RDNA-OS-700-past-60]_ [#rd-v620-past-60]_ gfx1030 gfx1030 gfx1030 gfx1030 gfx1030 gfx1030 gfx1030 gfx1030 gfx1030 gfx1030 gfx1030 gfx1030 gfx1030 gfx1030 gfx1030 gfx1030 gfx1030 gfx1030
28 gfx942 [#mi325x-os-past-60]_ [#mi300x-os-past-60]_ [#mi300A-os-past-60]_ gfx942 [#mi325x-os-past-60]_ [#mi300x-os-past-60]_ [#mi300A-os-past-60]_ gfx942 gfx942 gfx942 gfx942 gfx942 gfx942 gfx942 gfx942 gfx942 [#mi300_624-past-60]_ gfx942 [#mi300_622-past-60]_ gfx942 [#mi300_621-past-60]_ gfx942 [#mi300_620-past-60]_ gfx942 [#mi300_612-past-60]_ gfx942 [#mi300_612-past-60]_ gfx942 [#mi300_611-past-60]_ gfx942 [#mi300_610-past-60]_ gfx942 [#mi300_602-past-60]_ gfx942 [#mi300_600-past-60]_
29 gfx90a [#mi200x-os-past-60]_ gfx90a [#mi200x-os-past-60]_ gfx90a gfx90a gfx90a gfx90a gfx90a gfx90a gfx90a gfx90a gfx90a gfx90a gfx90a gfx90a gfx90a gfx90a gfx90a gfx90a gfx90a gfx90a
30 gfx908 [#mi100-os-past-60]_ gfx908 [#mi100-os-past-60]_ gfx908 gfx908 gfx908 gfx908 gfx908 gfx908 gfx908 gfx908 gfx908 gfx908 gfx908 gfx908 gfx908 gfx908 gfx908 gfx908 gfx908 gfx908
31
32 FRAMEWORK SUPPORT .. _framework-support-compatibility-matrix-past-60: .. _framework-support-compatibility-matrix-past-60:
33 :doc:`PyTorch <../compatibility/ml-compatibility/pytorch-compatibility>` 2.8, 2.7, 2.6 2.7, 2.6, 2.5 2.6, 2.5, 2.4, 2.3 2.6, 2.5, 2.4, 2.3 2.6, 2.5, 2.4, 2.3 2.6, 2.5, 2.4, 2.3 2.4, 2.3, 2.2, 1.13 2.4, 2.3, 2.2, 1.13 2.4, 2.3, 2.2, 1.13 2.4, 2.3, 2.2, 2.1, 2.0, 1.13 2.3, 2.2, 2.1, 2.0, 1.13 2.3, 2.2, 2.1, 2.0, 1.13 2.3, 2.2, 2.1, 2.0, 1.13 2.3, 2.2, 2.1, 2.0, 1.13 2.1, 2.0, 1.13 2.1, 2.0, 1.13 2.1, 2.0, 1.13 2.1, 2.0, 1.13 2.1, 2.0, 1.13 2.1, 2.0, 1.13
34 :doc:`TensorFlow <../compatibility/ml-compatibility/tensorflow-compatibility>` 2.19.1, 2.18.1, 2.17.1 [#tf-mi350-past-60]_ 2.19.1, 2.18.1, 2.17.1 [#tf-mi350-past-60]_ 2.18.1, 2.17.1, 2.16.2 2.18.1, 2.17.1, 2.16.2 2.18.1, 2.17.1, 2.16.2 2.18.1, 2.17.1, 2.16.2 2.17.0, 2.16.2, 2.15.1 2.17.0, 2.16.2, 2.15.1 2.17.0, 2.16.2, 2.15.1 2.17.0, 2.16.2, 2.15.1 2.16.1, 2.15.1, 2.14.1 2.16.1, 2.15.1, 2.14.1 2.16.1, 2.15.1, 2.14.1 2.16.1, 2.15.1, 2.14.1 2.15.0, 2.14.0, 2.13.1 2.15.0, 2.14.0, 2.13.1 2.15.0, 2.14.0, 2.13.1 2.15.0, 2.14.0, 2.13.1 2.14.0, 2.13.1, 2.12.1 2.14.0, 2.13.1, 2.12.1
35 :doc:`JAX <../compatibility/ml-compatibility/jax-compatibility>` 0.6.0 0.6.0 0.4.35 0.4.35 0.4.35 0.4.35 0.4.31 0.4.31 0.4.31 0.4.31 0.4.26 0.4.26 0.4.26 0.4.26 0.4.26 0.4.26 0.4.26 0.4.26 0.4.26 0.4.26
36 :doc:`verl <../compatibility/ml-compatibility/verl-compatibility>` [#verl_compat-past-60]_ N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A 0.3.0.post0 N/A N/A N/A N/A N/A N/A
37 :doc:`Stanford Megatron-LM <../compatibility/ml-compatibility/stanford-megatron-lm-compatibility>` [#stanford-megatron-lm_compat-past-60]_ N/A N/A N/A N/A N/A N/A N/A N/A N/A 85f95ae N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A
38 :doc:`DGL <../compatibility/ml-compatibility/dgl-compatibility>` [#dgl_compat-past-60]_ N/A N/A N/A N/A N/A 2.4.0 N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A
39 :doc:`Megablocks <../compatibility/ml-compatibility/megablocks-compatibility>` [#megablocks_compat-past-60]_ N/A N/A N/A N/A N/A N/A N/A N/A N/A 0.7.0 N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A
40 :doc:`Taichi <../compatibility/ml-compatibility/taichi-compatibility>` [#taichi_compat-past-60]_ N/A N/A N/A N/A N/A N/A N/A 1.8.0b1 N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A
41 :doc:`Ray <../compatibility/ml-compatibility/ray-compatibility>` [#ray_compat-past-60]_ N/A N/A N/A N/A 2.48.0.post0 N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A
42 :doc:`llama.cpp <../compatibility/ml-compatibility/llama-cpp-compatibility>` [#llama-cpp_compat-past-60]_ N/A b6356 b6356 b6356 b6356 b5997 N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A
43 :doc:`FlashInfer <../compatibility/ml-compatibility/flashinfer-compatibility>` [#flashinfer_compat-past-60]_ N/A N/A N/A N/A v0.2.5 N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A
44 `ONNX Runtime <https://onnxruntime.ai/docs/build/eps.html#amd-migraphx>`_ 1.22.0 1.22.0 1.20.0 1.20.0 1.20.0 1.20.0 1.17.3 1.17.3 1.17.3 1.17.3 1.17.3 1.17.3 1.17.3 1.17.3 1.17.3 1.17.3 1.17.3 1.17.3 1.14.1 1.14.1
45
46
47 THIRD PARTY COMMS .. _thirdpartycomms-support-compatibility-matrix-past-60: .. _thirdpartycomms-support-compatibility-matrix-past-60:
48 `UCC <https://github.com/ROCm/ucc>`_ >=1.4.0 >=1.4.0 >=1.3.0 >=1.3.0 >=1.3.0 >=1.3.0 >=1.3.0 >=1.3.0 >=1.3.0 >=1.3.0 >=1.3.0 >=1.3.0 >=1.3.0 >=1.3.0 >=1.3.0 >=1.3.0 >=1.3.0 >=1.3.0 >=1.2.0 >=1.2.0
49 `UCX <https://github.com/ROCm/ucx>`_ >=1.17.0 >=1.17.0 >=1.15.0 >=1.15.0 >=1.15.0 >=1.15.0 >=1.15.0 >=1.15.0 >=1.15.0 >=1.15.0 >=1.15.0 >=1.15.0 >=1.15.0 >=1.15.0 >=1.14.1 >=1.14.1 >=1.14.1 >=1.14.1 >=1.14.1 >=1.14.1
50
51 THIRD PARTY ALGORITHM .. _thirdpartyalgorithm-support-compatibility-matrix-past-60: .. _thirdpartyalgorithm-support-compatibility-matrix-past-60:
52 Thrust 2.6.0 2.6.0 2.5.0 2.5.0 2.5.0 2.5.0 2.3.2 2.3.2 2.3.2 2.3.2 2.2.0 2.2.0 2.2.0 2.2.0 2.1.0 2.1.0 2.1.0 2.1.0 2.0.1 2.0.1
53 CUB 2.6.0 2.6.0 2.5.0 2.5.0 2.5.0 2.5.0 2.3.2 2.3.2 2.3.2 2.3.2 2.2.0 2.2.0 2.2.0 2.2.0 2.1.0 2.1.0 2.1.0 2.1.0 2.0.1 2.0.1
54
55 DRIVER & USER SPACE [#kfd_support-past-60]_ .. _kfd-userspace-support-compatibility-matrix-past-60: .. _kfd-userspace-support-compatibility-matrix-past-60:
56 :doc:`AMD GPU Driver <rocm-install-on-linux:reference/user-kernel-space-compat-matrix>` 30.10.2, 30.10.1 [#driver_patch-past-60]_, 30.10, 6.4.x, 6.3.x 30.10.1 [#driver_patch-past-60]_, 30.10, 6.4.x, 6.3.x, 6.2.x 6.4.x, 6.3.x, 6.2.x, 6.1.x 6.4.x, 6.3.x, 6.2.x, 6.1.x 6.4.x, 6.3.x, 6.2.x, 6.1.x 6.4.x, 6.3.x, 6.2.x, 6.1.x 6.4.x, 6.3.x, 6.2.x, 6.1.x 6.4.x, 6.3.x, 6.2.x, 6.1.x 6.4.x, 6.3.x, 6.2.x, 6.1.x 6.4.x, 6.3.x, 6.2.x, 6.1.x 6.4.x, 6.3.x, 6.2.x, 6.1.x, 6.0.x 6.4.x, 6.3.x, 6.2.x, 6.1.x, 6.0.x 6.4.x, 6.3.x, 6.2.x, 6.1.x, 6.0.x 6.4.x, 6.3.x, 6.2.x, 6.1.x, 6.0.x 6.4.x, 6.3.x, 6.2.x, 6.1.x, 6.0.x, 5.7.x 6.4.x, 6.3.x, 6.2.x, 6.1.x, 6.0.x, 5.7.x 6.4.x, 6.3.x, 6.2.x, 6.1.x, 6.0.x, 5.7.x 6.4.x, 6.3.x, 6.2.x, 6.1.x, 6.0.x, 5.7.x 6.2.x, 6.1.x, 6.0.x, 5.7.x, 5.6.x 6.2.x, 6.1.x, 6.0.x, 5.7.x, 5.6.x
57
58 ML & COMPUTER VISION .. _mllibs-support-compatibility-matrix-past-60: .. _mllibs-support-compatibility-matrix-past-60:
59 :doc:`Composable Kernel <composable_kernel:index>` 1.1.0 1.1.0 1.1.0 1.1.0 1.1.0 1.1.0 1.1.0 1.1.0 1.1.0 1.1.0 1.1.0 1.1.0 1.1.0 1.1.0 1.1.0 1.1.0 1.1.0 1.1.0 1.1.0 1.1.0
60 :doc:`MIGraphX <amdmigraphx:index>` 2.13.0 2.13.0 2.12.0 2.12.0 2.12.0 2.12.0 2.11.0 2.11.0 2.11.0 2.11.0 2.10.0 2.10.0 2.10.0 2.10.0 2.9.0 2.9.0 2.9.0 2.9.0 2.8.0 2.8.0
61 :doc:`MIOpen <miopen:index>` 3.5.0 3.5.0 3.4.0 3.4.0 3.4.0 3.4.0 3.3.0 3.3.0 3.3.0 3.3.0 3.2.0 3.2.0 3.2.0 3.2.0 3.1.0 3.1.0 3.1.0 3.1.0 3.0.0 3.0.0
62 :doc:`MIVisionX <mivisionx:index>` 3.3.0 3.3.0 3.2.0 3.2.0 3.2.0 3.2.0 3.1.0 3.1.0 3.1.0 3.1.0 3.0.0 3.0.0 3.0.0 3.0.0 2.5.0 2.5.0 2.5.0 2.5.0 2.5.0 2.5.0
63 :doc:`rocAL <rocal:index>` 2.3.0 2.3.0 2.2.0 2.2.0 2.2.0 2.2.0 2.1.0 2.1.0 2.1.0 2.1.0 2.0.0 2.0.0 2.0.0 1.0.0 1.0.0 1.0.0 1.0.0 1.0.0 1.0.0 1.0.0
64 :doc:`rocDecode <rocdecode:index>` 1.0.0 1.0.0 0.10.0 0.10.0 0.10.0 0.10.0 0.8.0 0.8.0 0.8.0 0.8.0 0.6.0 0.6.0 0.6.0 0.6.0 0.6.0 0.6.0 0.5.0 0.5.0 N/A N/A
65 :doc:`rocJPEG <rocjpeg:index>` 1.1.0 1.1.0 0.8.0 0.8.0 0.8.0 0.8.0 0.6.0 0.6.0 0.6.0 0.6.0 N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A
66 :doc:`rocPyDecode <rocpydecode:index>` 0.6.0 0.6.0 0.3.1 0.3.1 0.3.1 0.3.1 0.2.0 0.2.0 0.2.0 0.2.0 0.1.0 0.1.0 0.1.0 0.1.0 N/A N/A N/A N/A N/A N/A
67 :doc:`RPP <rpp:index>` 2.0.0 2.0.0 1.9.10 1.9.10 1.9.10 1.9.10 1.9.1 1.9.1 1.9.1 1.9.1 1.8.0 1.8.0 1.8.0 1.8.0 1.5.0 1.5.0 1.5.0 1.5.0 1.4.0 1.4.0
68
69 COMMUNICATION .. _commlibs-support-compatibility-matrix-past-60: .. _commlibs-support-compatibility-matrix-past-60:
70 :doc:`RCCL <rccl:index>` 2.26.6 2.26.6 2.22.3 2.22.3 2.22.3 2.22.3 2.21.5 2.21.5 2.21.5 2.21.5 2.20.5 2.20.5 2.20.5 2.20.5 2.18.6 2.18.6 2.18.6 2.18.6 2.18.3 2.18.3
71 :doc:`rocSHMEM <rocshmem:index>` 3.0.0 3.0.0 2.0.1 2.0.1 2.0.0 2.0.0 N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A
72
73 MATH LIBS .. _mathlibs-support-compatibility-matrix-past-60: .. _mathlibs-support-compatibility-matrix-past-60:
74 `half <https://github.com/ROCm/half>`_ 1.12.0 1.12.0 1.12.0 1.12.0 1.12.0 1.12.0 1.12.0 1.12.0 1.12.0 1.12.0 1.12.0 1.12.0 1.12.0 1.12.0 1.12.0 1.12.0 1.12.0 1.12.0 1.12.0 1.12.0
75 :doc:`hipBLAS <hipblas:index>` 3.0.2 3.0.0 2.4.0 2.4.0 2.4.0 2.4.0 2.3.0 2.3.0 2.3.0 2.3.0 2.2.0 2.2.0 2.2.0 2.2.0 2.1.0 2.1.0 2.1.0 2.1.0 2.0.0 2.0.0
76 :doc:`hipBLASLt <hipblaslt:index>` 1.0.0 1.0.0 0.12.1 0.12.1 0.12.1 0.12.0 0.10.0 0.10.0 0.10.0 0.10.0 0.8.0 0.8.0 0.8.0 0.8.0 0.7.0 0.7.0 0.7.0 0.7.0 0.6.0 0.6.0
77 :doc:`hipFFT <hipfft:index>` 1.0.20 1.0.20 1.0.18 1.0.18 1.0.18 1.0.18 1.0.17 1.0.17 1.0.17 1.0.17 1.0.16 1.0.15 1.0.15 1.0.14 1.0.14 1.0.14 1.0.14 1.0.14 1.0.13 1.0.13
78 :doc:`hipfort <hipfort:index>` 0.7.0 0.7.0 0.6.0 0.6.0 0.6.0 0.6.0 0.5.1 0.5.1 0.5.0 0.5.0 0.4.0 0.4.0 0.4.0 0.4.0 0.4.0 0.4.0 0.4.0 0.4.0 0.4.0 0.4.0
79 :doc:`hipRAND <hiprand:index>` 3.0.0 3.0.0 2.12.0 2.12.0 2.12.0 2.12.0 2.11.1 2.11.1 2.11.1 2.11.0 2.11.1 2.11.0 2.11.0 2.11.0 2.10.16 2.10.16 2.10.16 2.10.16 2.10.16 2.10.16
80 :doc:`hipSOLVER <hipsolver:index>` 3.0.0 3.0.0 2.4.0 2.4.0 2.4.0 2.4.0 2.3.0 2.3.0 2.3.0 2.3.0 2.2.0 2.2.0 2.2.0 2.2.0 2.1.1 2.1.1 2.1.1 2.1.0 2.0.0 2.0.0
81 :doc:`hipSPARSE <hipsparse:index>` 4.0.1 4.0.1 3.2.0 3.2.0 3.2.0 3.2.0 3.1.2 3.1.2 3.1.2 3.1.2 3.1.1 3.1.1 3.1.1 3.1.1 3.0.1 3.0.1 3.0.1 3.0.1 3.0.0 3.0.0
82 :doc:`hipSPARSELt <hipsparselt:index>` 0.2.4 0.2.4 0.2.3 0.2.3 0.2.3 0.2.3 0.2.2 0.2.2 0.2.2 0.2.2 0.2.1 0.2.1 0.2.1 0.2.1 0.2.0 0.2.0 0.1.0 0.1.0 0.1.0 0.1.0
83 :doc:`rocALUTION <rocalution:index>` 4.0.0 4.0.0 3.2.3 3.2.3 3.2.3 3.2.2 3.2.1 3.2.1 3.2.1 3.2.1 3.2.1 3.2.0 3.2.0 3.2.0 3.1.1 3.1.1 3.1.1 3.1.1 3.0.3 3.0.3
84 :doc:`rocBLAS <rocblas:index>` 5.0.2 5.0.0 4.4.1 4.4.1 4.4.0 4.4.0 4.3.0 4.3.0 4.3.0 4.3.0 4.2.4 4.2.1 4.2.1 4.2.0 4.1.2 4.1.2 4.1.0 4.1.0 4.0.0 4.0.0
85 :doc:`rocFFT <rocfft:index>` 1.0.34 1.0.34 1.0.32 1.0.32 1.0.32 1.0.32 1.0.31 1.0.31 1.0.31 1.0.31 1.0.30 1.0.29 1.0.29 1.0.28 1.0.27 1.0.27 1.0.27 1.0.26 1.0.25 1.0.23
86 :doc:`rocRAND <rocrand:index>` 4.0.0 4.0.0 3.3.0 3.3.0 3.3.0 3.3.0 3.2.0 3.2.0 3.2.0 3.2.0 3.1.1 3.1.0 3.1.0 3.1.0 3.0.1 3.0.1 3.0.1 3.0.1 3.0.0 2.10.17
87 :doc:`rocSOLVER <rocsolver:index>` 3.30.1 3.30.0 3.28.2 3.28.2 3.28.0 3.28.0 3.27.0 3.27.0 3.27.0 3.27.0 3.26.2 3.26.0 3.26.0 3.26.0 3.25.0 3.25.0 3.25.0 3.25.0 3.24.0 3.24.0
88 :doc:`rocSPARSE <rocsparse:index>` 4.0.2 4.0.2 3.4.0 3.4.0 3.4.0 3.4.0 3.3.0 3.3.0 3.3.0 3.3.0 3.2.1 3.2.0 3.2.0 3.2.0 3.1.2 3.1.2 3.1.2 3.1.2 3.0.2 3.0.2
89 :doc:`rocWMMA <rocwmma:index>` 2.0.0 2.0.0 1.7.0 1.7.0 1.7.0 1.7.0 1.6.0 1.6.0 1.6.0 1.6.0 1.5.0 1.5.0 1.5.0 1.5.0 1.4.0 1.4.0 1.4.0 1.4.0 1.3.0 1.3.0
90 :doc:`Tensile <tensile:src/index>` 4.44.0 4.44.0 4.43.0 4.43.0 4.43.0 4.43.0 4.42.0 4.42.0 4.42.0 4.42.0 4.41.0 4.41.0 4.41.0 4.41.0 4.40.0 4.40.0 4.40.0 4.40.0 4.39.0 4.39.0
91
92 PRIMITIVES .. _primitivelibs-support-compatibility-matrix-past-60: .. _primitivelibs-support-compatibility-matrix-past-60:
93 :doc:`hipCUB <hipcub:index>` 4.0.0 4.0.0 3.4.0 3.4.0 3.4.0 3.4.0 3.3.0 3.3.0 3.3.0 3.3.0 3.2.1 3.2.0 3.2.0 3.2.0 3.1.0 3.1.0 3.1.0 3.1.0 3.0.0 3.0.0
94 :doc:`hipTensor <hiptensor:index>` 2.0.0 2.0.0 1.5.0 1.5.0 1.5.0 1.5.0 1.4.0 1.4.0 1.4.0 1.4.0 1.3.0 1.3.0 1.3.0 1.3.0 1.2.0 1.2.0 1.2.0 1.2.0 1.1.0 1.1.0
95 :doc:`rocPRIM <rocprim:index>` 4.0.1 4.0.0 3.4.1 3.4.1 3.4.0 3.4.0 3.3.0 3.3.0 3.3.0 3.3.0 3.2.2 3.2.0 3.2.0 3.2.0 3.1.0 3.1.0 3.1.0 3.1.0 3.0.0 3.0.0
96 :doc:`rocThrust <rocthrust:index>` 4.0.0 4.0.0 3.3.0 3.3.0 3.3.0 3.3.0 3.3.0 3.3.0 3.3.0 3.3.0 3.1.1 3.1.0 3.1.0 3.0.1 3.0.1 3.0.1 3.0.1 3.0.1 3.0.0 3.0.0
97
98 SUPPORT LIBS
99 `hipother <https://github.com/ROCm/hipother>`_ 7.0.51830 7.0.51830 6.4.43483 6.4.43483 6.4.43483 6.4.43482 6.3.42134 6.3.42134 6.3.42133 6.3.42131 6.2.41134 6.2.41134 6.2.41134 6.2.41133 6.1.40093 6.1.40093 6.1.40092 6.1.40091 6.1.32831 6.1.32830
100 `rocm-core <https://github.com/ROCm/rocm-core>`_ 7.0.2 7.0.1/7.0.0 6.4.3 6.4.2 6.4.1 6.4.0 6.3.3 6.3.2 6.3.1 6.3.0 6.2.4 6.2.2 6.2.1 6.2.0 6.1.5 6.1.2 6.1.1 6.1.0 6.0.2 6.0.0
101 `ROCT-Thunk-Interface <https://github.com/ROCm/ROCT-Thunk-Interface>`_ N/A [#ROCT-rocr-past-60]_ N/A [#ROCT-rocr-past-60]_ N/A [#ROCT-rocr-past-60]_ N/A [#ROCT-rocr-past-60]_ N/A [#ROCT-rocr-past-60]_ N/A [#ROCT-rocr-past-60]_ N/A [#ROCT-rocr-past-60]_ N/A [#ROCT-rocr-past-60]_ N/A [#ROCT-rocr-past-60]_ N/A [#ROCT-rocr-past-60]_ 20240607.5.7 20240607.5.7 20240607.4.05 20240607.1.4246 20240125.5.08 20240125.5.08 20240125.5.08 20240125.3.30 20231016.2.245 20231016.2.245
102
103 SYSTEM MGMT TOOLS .. _tools-support-compatibility-matrix-past-60: .. _tools-support-compatibility-matrix-past-60:
104 :doc:`AMD SMI <amdsmi:index>` 26.0.2 26.0.0 25.5.1 25.5.1 25.4.2 25.3.0 24.7.1 24.7.1 24.7.1 24.7.1 24.6.3 24.6.3 24.6.3 24.6.2 24.5.1 24.5.1 24.5.1 24.4.1 23.4.2 23.4.2
105 :doc:`ROCm Data Center Tool <rdc:index>` 1.1.0 1.1.0 0.3.0 0.3.0 0.3.0 0.3.0 0.3.0 0.3.0 0.3.0 0.3.0 0.3.0 0.3.0 0.3.0 0.3.0 0.3.0 0.3.0 0.3.0 0.3.0 0.3.0 0.3.0
106 :doc:`rocminfo <rocminfo:index>` 1.0.0 1.0.0 1.0.0 1.0.0 1.0.0 1.0.0 1.0.0 1.0.0 1.0.0 1.0.0 1.0.0 1.0.0 1.0.0 1.0.0 1.0.0 1.0.0 1.0.0 1.0.0 1.0.0 1.0.0
107 :doc:`ROCm SMI <rocm_smi_lib:index>` 7.8.0 7.8.0 7.7.0 7.5.0 7.5.0 7.5.0 7.4.0 7.4.0 7.4.0 7.4.0 7.3.0 7.3.0 7.3.0 7.3.0 7.2.0 7.2.0 7.0.0 7.0.0 6.0.2 6.0.0
108 :doc:`ROCm Validation Suite <rocmvalidationsuite:index>` 1.2.0 1.2.0 1.1.0 1.1.0 1.1.0 1.1.0 1.1.0 1.1.0 1.1.0 1.1.0 1.0.60204 1.0.60202 1.0.60201 1.0.60200 1.0.60105 1.0.60102 1.0.60101 1.0.60100 1.0.60002 1.0.60000
109
110 PERFORMANCE TOOLS
111 :doc:`ROCm Bandwidth Test <rocm_bandwidth_test:index>` 2.6.0 2.6.0 1.4.0 1.4.0 1.4.0 1.4.0 1.4.0 1.4.0 1.4.0 1.4.0 1.4.0 1.4.0 1.4.0 1.4.0 1.4.0 1.4.0 1.4.0 1.4.0 1.4.0 1.4.0
112 :doc:`ROCm Compute Profiler <rocprofiler-compute:index>` 3.2.3 3.2.3 3.1.1 3.1.1 3.1.0 3.1.0 3.0.0 3.0.0 3.0.0 3.0.0 2.0.1 2.0.1 2.0.1 2.0.1 N/A N/A N/A N/A N/A N/A
113 :doc:`ROCm Systems Profiler <rocprofiler-systems:index>` 1.1.1 1.1.0 1.0.2 1.0.2 1.0.1 1.0.0 0.1.2 0.1.1 0.1.0 0.1.0 1.11.2 1.11.2 1.11.2 1.11.2 N/A N/A N/A N/A N/A N/A
114 :doc:`ROCProfiler <rocprofiler:index>` 2.0.70002 2.0.70000 2.0.60403 2.0.60402 2.0.60401 2.0.60400 2.0.60303 2.0.60302 2.0.60301 2.0.60300 2.0.60204 2.0.60202 2.0.60201 2.0.60200 2.0.60105 2.0.60102 2.0.60101 2.0.60100 2.0.60002 2.0.60000
115 :doc:`ROCprofiler-SDK <rocprofiler-sdk:index>` 1.0.0 1.0.0 0.6.0 0.6.0 0.6.0 0.6.0 0.5.0 0.5.0 0.5.0 0.5.0 0.4.0 0.4.0 0.4.0 0.4.0 N/A N/A N/A N/A N/A N/A
116 :doc:`ROCTracer <roctracer:index>` 4.1.70002 4.1.70000 4.1.60403 4.1.60402 4.1.60401 4.1.60400 4.1.60303 4.1.60302 4.1.60301 4.1.60300 4.1.60204 4.1.60202 4.1.60201 4.1.60200 4.1.60105 4.1.60102 4.1.60101 4.1.60100 4.1.60002 4.1.60000
117
118 DEVELOPMENT TOOLS
119 :doc:`HIPIFY <hipify:index>` 20.0.0 20.0.0 19.0.0 19.0.0 19.0.0 19.0.0 18.0.0.25012 18.0.0.25012 18.0.0.24491 18.0.0.24455 18.0.0.24392 18.0.0.24355 18.0.0.24355 18.0.0.24232 17.0.0.24193 17.0.0.24193 17.0.0.24154 17.0.0.24103 17.0.0.24012 17.0.0.23483
120 :doc:`ROCm CMake <rocmcmakebuildtools:index>` 0.14.0 0.14.0 0.14.0 0.14.0 0.14.0 0.14.0 0.14.0 0.14.0 0.14.0 0.14.0 0.13.0 0.13.0 0.13.0 0.13.0 0.12.0 0.12.0 0.12.0 0.12.0 0.11.0 0.11.0
121 :doc:`ROCdbgapi <rocdbgapi:index>` 0.77.4 0.77.3 0.77.2 0.77.2 0.77.2 0.77.2 0.77.0 0.77.0 0.77.0 0.77.0 0.76.0 0.76.0 0.76.0 0.76.0 0.71.0 0.71.0 0.71.0 0.71.0 0.71.0 0.71.0
122 :doc:`ROCm Debugger (ROCgdb) <rocgdb:index>` 16.3.0 16.3.0 15.2.0 15.2.0 15.2.0 15.2.0 15.2.0 15.2.0 15.2.0 15.2.0 14.2.0 14.2.0 14.2.0 14.2.0 14.1.0 14.1.0 14.1.0 14.1.0 13.2.0 13.2.0
123 `rocprofiler-register <https://github.com/ROCm/rocprofiler-register>`_ 0.5.0 0.5.0 0.4.0 0.4.0 0.4.0 0.4.0 0.4.0 0.4.0 0.4.0 0.4.0 0.4.0 0.4.0 0.4.0 0.4.0 0.3.0 0.3.0 0.3.0 0.3.0 N/A N/A
124 :doc:`ROCr Debug Agent <rocr_debug_agent:index>` 2.1.0 2.1.0 2.0.4 2.0.4 2.0.4 2.0.4 2.0.3 2.0.3 2.0.3 2.0.3 2.0.3 2.0.3 2.0.3 2.0.3 2.0.3 2.0.3 2.0.3 2.0.3 2.0.3 2.0.3
125
126 COMPILERS .. _compilers-support-compatibility-matrix-past-60: .. _compilers-support-compatibility-matrix-past-60:
127 `clang-ocl <https://github.com/ROCm/clang-ocl>`_ N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A 0.5.0 0.5.0 0.5.0 0.5.0 0.5.0 0.5.0
128 :doc:`hipCC <hipcc:index>` 1.1.1 1.1.1 1.1.1 1.1.1 1.1.1 1.1.1 1.1.1 1.1.1 1.1.1 1.1.1 1.1.1 1.1.1 1.1.1 1.1.1 1.0.0 1.0.0 1.0.0 1.0.0 1.0.0 1.0.0
129 `Flang <https://github.com/ROCm/flang>`_ 20.0.0.25385 20.0.0.25314 19.0.0.25224 19.0.0.25224 19.0.0.25184 19.0.0.25133 18.0.0.25012 18.0.0.25012 18.0.0.24491 18.0.0.24455 18.0.0.24392 18.0.0.24355 18.0.0.24355 18.0.0.24232 17.0.0.24193 17.0.0.24193 17.0.0.24154 17.0.0.24103 17.0.0.24012 17.0.0.23483
130 :doc:`llvm-project <llvm-project:index>` 20.0.0.25385 20.0.0.25314 19.0.0.25224 19.0.0.25224 19.0.0.25184 19.0.0.25133 18.0.0.25012 18.0.0.25012 18.0.0.24491 18.0.0.24491 18.0.0.24392 18.0.0.24355 18.0.0.24355 18.0.0.24232 17.0.0.24193 17.0.0.24193 17.0.0.24154 17.0.0.24103 17.0.0.24012 17.0.0.23483
131 `OpenMP <https://github.com/ROCm/llvm-project/tree/amd-staging/openmp>`_ 20.0.0.25385 20.0.0.25314 19.0.0.25224 19.0.0.25224 19.0.0.25184 19.0.0.25133 18.0.0.25012 18.0.0.25012 18.0.0.24491 18.0.0.24491 18.0.0.24392 18.0.0.24355 18.0.0.24355 18.0.0.24232 17.0.0.24193 17.0.0.24193 17.0.0.24154 17.0.0.24103 17.0.0.24012 17.0.0.23483
132
133 RUNTIMES .. _runtime-support-compatibility-matrix-past-60: .. _runtime-support-compatibility-matrix-past-60:
134 :doc:`AMD CLR <hip:understand/amd_clr>` 7.0.51831 7.0.51830 6.4.43484 6.4.43484 6.4.43483 6.4.43482 6.3.42134 6.3.42134 6.3.42133 6.3.42131 6.2.41134 6.2.41134 6.2.41134 6.2.41133 6.1.40093 6.1.40093 6.1.40092 6.1.40091 6.1.32831 6.1.32830
135 :doc:`HIP <hip:index>` 7.0.51831 7.0.51830 6.4.43484 6.4.43484 6.4.43483 6.4.43482 6.3.42134 6.3.42134 6.3.42133 6.3.42131 6.2.41134 6.2.41134 6.2.41134 6.2.41133 6.1.40093 6.1.40093 6.1.40092 6.1.40091 6.1.32831 6.1.32830
136 `OpenCL Runtime <https://github.com/ROCm/clr/tree/develop/opencl>`_ 2.0.0 2.0.0 2.0.0 2.0.0 2.0.0 2.0.0 2.0.0 2.0.0 2.0.0 2.0.0 2.0.0 2.0.0 2.0.0 2.0.0 2.0.0 2.0.0 2.0.0 2.0.0 2.0.0 2.0.0
137 :doc:`ROCr Runtime <rocr-runtime:index>` 1.18.0 1.18.0 1.15.0 1.15.0 1.15.0 1.15.0 1.14.0 1.14.0 1.14.0 1.14.0 1.14.0 1.14.0 1.14.0 1.13.0 1.13.0 1.13.0 1.13.0 1.13.0 1.12.0 1.12.0

View File

@@ -10,10 +10,9 @@ Use this matrix to view the ROCm compatibility and system requirements across su
You can also refer to the :ref:`past versions of ROCm compatibility matrix<past-rocm-compatibility-matrix>`.
Accelerators and GPUs listed in the following table support compute workloads (no display
GPUs listed in the following table support compute workloads (no display
information or graphics). If youre using ROCm with AMD Radeon GPUs or Ryzen APUs for graphics
workloads, see the `Use ROCm on Radeon and Ryzen
<https://rocm.docs.amd.com/projects/radeon-ryzen/en/latest/index.html>`_ to verify
workloads, see the :docs:`Use ROCm on Radeon and Ryzen <radeon:index.html>` to verify
compatibility and system requirements.
.. |br| raw:: html
@@ -23,20 +22,20 @@ compatibility and system requirements.
.. container:: format-big-table
.. csv-table::
:header: "ROCm Version", "7.0.1/7.0.0", "6.4.3", "6.3.0"
:header: "ROCm Version", "7.0.2", "7.0.1/7.0.0", "6.4.0"
:stub-columns: 1
:ref:`Operating systems & kernels <OS-kernel-versions>`,Ubuntu 24.04.3,Ubuntu 24.04.2,Ubuntu 24.04.2
:ref:`Operating systems & kernels <OS-kernel-versions>`,Ubuntu 24.04.3,Ubuntu 24.04.3,Ubuntu 24.04.2
,Ubuntu 22.04.5,Ubuntu 22.04.5,Ubuntu 22.04.5
,"RHEL 9.6, 9.4","RHEL 9.6, 9.4","RHEL 9.5, 9.4"
,RHEL 8.10 [#rhel-700]_,RHEL 8.10,RHEL 8.10
,SLES 15 SP7 [#sles-db-700]_,"SLES 15 SP7, SP6","SLES 15 SP6, SP5"
,"Oracle Linux 9, 8 [#ol-700-mi300x]_","Oracle Linux 9, 8 [#ol-mi300x]_",Oracle Linux 8.10 [#ol-mi300x]_
,Debian 12 [#sles-db-700]_,Debian 12 [#single-node]_,
,Azure Linux 3.0 [#az-mi300x]_,Azure Linux 3.0 [#az-mi300x]_,
,Rocky Linux 9 [#rl-700]_,,
,"RHEL 10.0 [#rhel-10-702]_, 9.6 [#rhel-10-702]_, 9.4 [#rhel-94-702]_","RHEL 9.6 [#rhel-10-702]_, 9.4 [#rhel-94-702]_","RHEL 9.5, 9.4"
,RHEL 8.10 [#rhel-700]_,RHEL 8.10 [#rhel-700]_,RHEL 8.10
,SLES 15 SP7 [#sles-db-700]_,SLES 15 SP7 [#sles-db-700]_,SLES 15 SP6
,"Oracle Linux 10, 9, 8 [#ol-700-mi300x]_","Oracle Linux 9, 8 [#ol-700-mi300x]_","Oracle Linux 9, 8 [#ol-mi300x]_"
,"Debian 13 [#db-mi300x]_, 12 [#sles-db-700]_",Debian 12 [#sles-db-700]_,Debian 12 [#single-node]_
,Azure Linux 3.0 [#az-mi300x]_,Azure Linux 3.0 [#az-mi300x]_,Azure Linux 3.0 [#az-mi300x]_
,Rocky Linux 9 [#rl-700]_,Rocky Linux 9 [#rl-700]_,
,.. _architecture-support-compatibility-matrix:,,
:doc:`Architecture <rocm-install-on-linux:reference/system-requirements>`,CDNA4,,
:doc:`Architecture <rocm-install-on-linux:reference/system-requirements>`,CDNA4,CDNA4,
,CDNA3,CDNA3,CDNA3
,CDNA2,CDNA2,CDNA2
,CDNA,CDNA,CDNA
@@ -44,140 +43,140 @@ compatibility and system requirements.
,RDNA3,RDNA3,RDNA3
,RDNA2,RDNA2,RDNA2
,.. _gpu-support-compatibility-matrix:,,
:doc:`GPU / LLVM target <rocm-install-on-linux:reference/system-requirements>`,gfx950 [#mi350x-os]_,,
,gfx1201 [#RDNA-OS-700]_,gfx1201 [#RDNA-OS]_,
,gfx1200 [#RDNA-OS-700]_,gfx1200 [#RDNA-OS]_,
,gfx1101 [#RDNA-OS-700]_ [#rd-v710]_,gfx1101 [#RDNA-OS]_ [#7700XT-OS]_,
,gfx1100 [#RDNA-OS-700]_,gfx1100,gfx1100
,gfx1030 [#RDNA-OS-700]_ [#rd-v620]_,gfx1030,gfx1030
,gfx942 [#mi325x-os]_ [#mi300x-os]_ [#mi300A-os]_,gfx942,gfx942
,gfx90a [#mi200x-os]_,gfx90a,gfx90a
,gfx908 [#mi100-os]_,gfx908,gfx908
:doc:`GPU / LLVM target <rocm-install-on-linux:reference/system-requirements>`,gfx950 [#mi350x-os]_,gfx950 [#mi350x-os]_,
,gfx1201 [#RDNA-OS-700]_,gfx1201 [#RDNA-OS-700]_,
,gfx1200 [#RDNA-OS-700]_,gfx1200 [#RDNA-OS-700]_,
,gfx1101 [#RDNA-OS-700]_ [#rd-v710]_,gfx1101 [#RDNA-OS-700]_ [#rd-v710]_,
,gfx1100 [#RDNA-OS-700]_,gfx1100 [#RDNA-OS-700]_,gfx1100
,gfx1030 [#RDNA-OS-700]_ [#rd-v620]_,gfx1030 [#RDNA-OS-700]_ [#rd-v620]_,gfx1030
,gfx942 [#mi325x-os]_ [#mi300x-os]_ [#mi300A-os]_,gfx942 [#mi325x-os]_ [#mi300x-os]_ [#mi300A-os]_,gfx942
,gfx90a [#mi200x-os]_,gfx90a [#mi200x-os]_,gfx90a
,gfx908 [#mi100-os]_,gfx908 [#mi100-os]_,gfx908
,,,
FRAMEWORK SUPPORT,.. _framework-support-compatibility-matrix:,,
:doc:`PyTorch <../compatibility/ml-compatibility/pytorch-compatibility>`,"2.7, 2.6, 2.5","2.6, 2.5, 2.4, 2.3","2.4, 2.3, 2.2, 2.1, 2.0, 1.13"
:doc:`TensorFlow <../compatibility/ml-compatibility/tensorflow-compatibility>`,"2.19.1, 2.18.1, 2.17.1 [#tf-mi350]_","2.18.1, 2.17.1, 2.16.2","2.17.0, 2.16.2, 2.15.1"
:doc:`JAX <../compatibility/ml-compatibility/jax-compatibility>`,0.6.0,0.4.35,0.4.31
:doc:`Stanford Megatron-LM <../compatibility/ml-compatibility/stanford-megatron-lm-compatibility>` [#stanford-megatron-lm_compat]_,N/A,N/A,85f95ae
:doc:`Megablocks <../compatibility/ml-compatibility/megablocks-compatibility>` [#megablocks_compat]_,N/A,N/A,0.7.0
:doc:`llama.cpp <../compatibility/ml-compatibility/llama-cpp-compatibility>` [#llama-cpp_compat]_,b6356,b6356,N/A
`ONNX Runtime <https://onnxruntime.ai/docs/build/eps.html#amd-migraphx>`_,1.22.0,1.20.0,1.17.3
:doc:`PyTorch <../compatibility/ml-compatibility/pytorch-compatibility>`,"2.8, 2.7, 2.6","2.7, 2.6, 2.5","2.6, 2.5, 2.4, 2.3"
:doc:`TensorFlow <../compatibility/ml-compatibility/tensorflow-compatibility>`,"2.19.1, 2.18.1, 2.17.1 [#tf-mi350]_","2.19.1, 2.18.1, 2.17.1 [#tf-mi350]_","2.18.1, 2.17.1, 2.16.2"
:doc:`JAX <../compatibility/ml-compatibility/jax-compatibility>`,0.6.0,0.6.0,0.4.35
:doc:`DGL <../compatibility/ml-compatibility/dgl-compatibility>` [#dgl_compat]_,N/A,N/A,2.4.0
:doc:`llama.cpp <../compatibility/ml-compatibility/llama-cpp-compatibility>` [#llama-cpp_compat]_,N/A,b6356,b5997
`ONNX Runtime <https://onnxruntime.ai/docs/build/eps.html#amd-migraphx>`_,1.22.0,1.22.0,1.20.0
,,,
THIRD PARTY COMMS,.. _thirdpartycomms-support-compatibility-matrix:,,
`UCC <https://github.com/ROCm/ucc>`_,>=1.4.0,>=1.3.0,>=1.3.0
`UCX <https://github.com/ROCm/ucx>`_,>=1.17.0,>=1.15.0,>=1.15.0
`UCC <https://github.com/ROCm/ucc>`_,>=1.4.0,>=1.4.0,>=1.3.0
`UCX <https://github.com/ROCm/ucx>`_,>=1.17.0,>=1.17.0,>=1.15.0
,,,
THIRD PARTY ALGORITHM,.. _thirdpartyalgorithm-support-compatibility-matrix:,,
Thrust,2.6.0,2.5.0,2.3.2
CUB,2.6.0,2.5.0,2.3.2
Thrust,2.6.0,2.6.0,2.5.0
CUB,2.6.0,2.6.0,2.5.0
,,,
DRIVER & USER SPACE [#kfd_support]_,.. _kfd-userspace-support-compatibility-matrix:,,
:doc:`AMD GPU Driver <rocm-install-on-linux:reference/user-kernel-space-compat-matrix>`,"30.10.1 [#driver_patch]_, 30.10, 6.4.x, 6.3.x, 6.2.x","6.4.x, 6.3.x, 6.2.x, 6.1.x","6.4.x, 6.3.x, 6.2.x, 6.1.x"
:doc:`AMD GPU Driver <rocm-install-on-linux:reference/user-kernel-space-compat-matrix>`,"30.10.2, 30.10.1 [#driver_patch]_, |br| 30.10, 6.4.x, 6.3.x","30.10.1 [#driver_patch]_, 30.10, |br| 6.4.x, 6.3.x, 6.2.x","6.4.x, 6.3.x, 6.2.x, 6.1.x"
,,,
ML & COMPUTER VISION,.. _mllibs-support-compatibility-matrix:,,
:doc:`Composable Kernel <composable_kernel:index>`,1.1.0,1.1.0,1.1.0
:doc:`MIGraphX <amdmigraphx:index>`,2.13.0,2.12.0,2.11.0
:doc:`MIOpen <miopen:index>`,3.5.0,3.4.0,3.3.0
:doc:`MIVisionX <mivisionx:index>`,3.3.0,3.2.0,3.1.0
:doc:`rocAL <rocal:index>`,2.3.0,2.2.0,2.1.0
:doc:`rocDecode <rocdecode:index>`,1.0.0,0.10.0,0.8.0
:doc:`rocJPEG <rocjpeg:index>`,1.1.0,0.8.0,0.6.0
:doc:`rocPyDecode <rocpydecode:index>`,0.6.0,0.3.1,0.2.0
:doc:`RPP <rpp:index>`,2.0.0,1.9.10,1.9.1
:doc:`MIGraphX <amdmigraphx:index>`,2.13.0,2.13.0,2.12.0
:doc:`MIOpen <miopen:index>`,3.5.0,3.5.0,3.4.0
:doc:`MIVisionX <mivisionx:index>`,3.3.0,3.3.0,3.2.0
:doc:`rocAL <rocal:index>`,2.3.0,2.3.0,2.2.0
:doc:`rocDecode <rocdecode:index>`,1.0.0,1.0.0,0.10.0
:doc:`rocJPEG <rocjpeg:index>`,1.1.0,1.1.0,0.8.0
:doc:`rocPyDecode <rocpydecode:index>`,0.6.0,0.6.0,0.3.1
:doc:`RPP <rpp:index>`,2.0.0,2.0.0,1.9.10
,,,
COMMUNICATION,.. _commlibs-support-compatibility-matrix:,,
:doc:`RCCL <rccl:index>`,2.26.6,2.22.3,2.21.5
:doc:`rocSHMEM <rocshmem:index>`,3.0.0,2.0.1,N/A
:doc:`RCCL <rccl:index>`,2.26.6,2.26.6,2.22.3
:doc:`rocSHMEM <rocshmem:index>`,3.0.0,3.0.0,2.0.0
,,,
MATH LIBS,.. _mathlibs-support-compatibility-matrix:,,
`half <https://github.com/ROCm/half>`_ ,1.12.0,1.12.0,1.12.0
:doc:`hipBLAS <hipblas:index>`,3.0.0,2.4.0,2.3.0
:doc:`hipBLASLt <hipblaslt:index>`,1.0.0,0.12.1,0.10.0
:doc:`hipFFT <hipfft:index>`,1.0.20,1.0.18,1.0.17
:doc:`hipfort <hipfort:index>`,0.7.0,0.6.0,0.5.0
:doc:`hipRAND <hiprand:index>`,3.0.0,2.12.0,2.11.0
:doc:`hipSOLVER <hipsolver:index>`,3.0.0,2.4.0,2.3.0
:doc:`hipSPARSE <hipsparse:index>`,4.0.1,3.2.0,3.1.2
:doc:`hipSPARSELt <hipsparselt:index>`,0.2.4,0.2.3,0.2.2
:doc:`rocALUTION <rocalution:index>`,4.0.0,3.2.3,3.2.1
:doc:`rocBLAS <rocblas:index>`,5.0.0,4.4.1,4.3.0
:doc:`rocFFT <rocfft:index>`,1.0.34,1.0.32,1.0.31
:doc:`rocRAND <rocrand:index>`,4.0.0,3.3.0,3.2.0
:doc:`rocSOLVER <rocsolver:index>`,3.30.0,3.28.2,3.27.0
:doc:`rocSPARSE <rocsparse:index>`,4.0.2,3.4.0,3.3.0
:doc:`rocWMMA <rocwmma:index>`,2.0.0,1.7.0,1.6.0
:doc:`Tensile <tensile:src/index>`,4.44.0,4.43.0,4.42.0
:doc:`hipBLAS <hipblas:index>`,3.0.2,3.0.0,2.4.0
:doc:`hipBLASLt <hipblaslt:index>`,1.0.0,1.0.0,0.12.0
:doc:`hipFFT <hipfft:index>`,1.0.20,1.0.20,1.0.18
:doc:`hipfort <hipfort:index>`,0.7.0,0.7.0,0.6.0
:doc:`hipRAND <hiprand:index>`,3.0.0,3.0.0,2.12.0
:doc:`hipSOLVER <hipsolver:index>`,3.0.0,3.0.0,2.4.0
:doc:`hipSPARSE <hipsparse:index>`,4.0.1,4.0.1,3.2.0
:doc:`hipSPARSELt <hipsparselt:index>`,0.2.4,0.2.4,0.2.3
:doc:`rocALUTION <rocalution:index>`,4.0.0,4.0.0,3.2.2
:doc:`rocBLAS <rocblas:index>`,5.0.2,5.0.0,4.4.0
:doc:`rocFFT <rocfft:index>`,1.0.34,1.0.34,1.0.32
:doc:`rocRAND <rocrand:index>`,4.0.0,4.0.0,3.3.0
:doc:`rocSOLVER <rocsolver:index>`,3.30.1,3.30.0,3.28.0
:doc:`rocSPARSE <rocsparse:index>`,4.0.2,4.0.2,3.4.0
:doc:`rocWMMA <rocwmma:index>`,2.0.0,2.0.0,1.7.0
:doc:`Tensile <tensile:src/index>`,4.44.0,4.44.0,4.43.0
,,,
PRIMITIVES,.. _primitivelibs-support-compatibility-matrix:,,
:doc:`hipCUB <hipcub:index>`,4.0.0,3.4.0,3.3.0
:doc:`hipTensor <hiptensor:index>`,2.0.0,1.5.0,1.4.0
:doc:`rocPRIM <rocprim:index>`,4.0.0,3.4.1,3.3.0
:doc:`rocThrust <rocthrust:index>`,4.0.0,3.3.0,3.3.0
:doc:`hipCUB <hipcub:index>`,4.0.0,4.0.0,3.4.0
:doc:`hipTensor <hiptensor:index>`,2.0.0,2.0.0,1.5.0
:doc:`rocPRIM <rocprim:index>`,4.0.1,4.0.0,3.4.0
:doc:`rocThrust <rocthrust:index>`,4.0.0,4.0.0,3.3.0
,,,
SUPPORT LIBS,,,
`hipother <https://github.com/ROCm/hipother>`_,7.0.51830,6.4.43483,6.3.42131
`rocm-core <https://github.com/ROCm/rocm-core>`_,7.0.1/7.0.0,6.4.3,6.3.0
`hipother <https://github.com/ROCm/hipother>`_,7.0.51830,7.0.51830,6.4.43482
`rocm-core <https://github.com/ROCm/rocm-core>`_,7.0.2,7.0.1/7.0.0,6.4.0
`ROCT-Thunk-Interface <https://github.com/ROCm/ROCT-Thunk-Interface>`_,N/A [#ROCT-rocr]_,N/A [#ROCT-rocr]_,N/A [#ROCT-rocr]_
,,,
SYSTEM MGMT TOOLS,.. _tools-support-compatibility-matrix:,,
:doc:`AMD SMI <amdsmi:index>`,26.0.0,25.5.1,24.7.1
:doc:`ROCm Data Center Tool <rdc:index>`,1.1.0,0.3.0,0.3.0
:doc:`AMD SMI <amdsmi:index>`,26.0.2,26.0.0,25.3.0
:doc:`ROCm Data Center Tool <rdc:index>`,1.1.0,1.1.0,0.3.0
:doc:`rocminfo <rocminfo:index>`,1.0.0,1.0.0,1.0.0
:doc:`ROCm SMI <rocm_smi_lib:index>`,7.8.0,7.7.0,7.4.0
:doc:`ROCm Validation Suite <rocmvalidationsuite:index>`,1.2.0,1.1.0,1.1.0
:doc:`ROCm SMI <rocm_smi_lib:index>`,7.8.0,7.8.0,7.5.0
:doc:`ROCm Validation Suite <rocmvalidationsuite:index>`,1.2.0,1.2.0,1.1.0
,,,
PERFORMANCE TOOLS,,,
:doc:`ROCm Bandwidth Test <rocm_bandwidth_test:index>`,2.6.0,1.4.0,1.4.0
:doc:`ROCm Compute Profiler <rocprofiler-compute:index>`,3.2.3,3.1.1,3.0.0
:doc:`ROCm Systems Profiler <rocprofiler-systems:index>`,1.1.0,1.0.2,0.1.0
:doc:`ROCProfiler <rocprofiler:index>`,2.0.70000,2.0.60403,2.0.60300
:doc:`ROCprofiler-SDK <rocprofiler-sdk:index>`,1.0.0,0.6.0,0.5.0
:doc:`ROCTracer <roctracer:index>`,4.1.70000,4.1.60403,4.1.60300
:doc:`ROCm Bandwidth Test <rocm_bandwidth_test:index>`,2.6.0,2.6.0,1.4.0
:doc:`ROCm Compute Profiler <rocprofiler-compute:index>`,3.2.3,3.2.3,3.1.0
:doc:`ROCm Systems Profiler <rocprofiler-systems:index>`,1.1.1,1.1.0,1.0.0
:doc:`ROCProfiler <rocprofiler:index>`,2.0.70002,2.0.70000,2.0.60400
:doc:`ROCprofiler-SDK <rocprofiler-sdk:index>`,1.0.0,1.0.0,0.6.0
:doc:`ROCTracer <roctracer:index>`,4.1.70002,4.1.70000,4.1.60400
,,,
DEVELOPMENT TOOLS,,,
:doc:`HIPIFY <hipify:index>`,20.0.0,19.0.0,18.0.0.24455
:doc:`HIPIFY <hipify:index>`,20.0.0,20.0.0,19.0.0
:doc:`ROCm CMake <rocmcmakebuildtools:index>`,0.14.0,0.14.0,0.14.0
:doc:`ROCdbgapi <rocdbgapi:index>`,0.77.3,0.77.2,0.77.0
:doc:`ROCm Debugger (ROCgdb) <rocgdb:index>`,16.3.0,15.2.0,15.2.0
`rocprofiler-register <https://github.com/ROCm/rocprofiler-register>`_,0.5.0,0.4.0,0.4.0
:doc:`ROCr Debug Agent <rocr_debug_agent:index>`,2.1.0,2.0.4,2.0.3
:doc:`ROCdbgapi <rocdbgapi:index>`,0.77.4,0.77.3,0.77.2
:doc:`ROCm Debugger (ROCgdb) <rocgdb:index>`,16.3.0,16.3.0,15.2.0
`rocprofiler-register <https://github.com/ROCm/rocprofiler-register>`_,0.5.0,0.5.0,0.4.0
:doc:`ROCr Debug Agent <rocr_debug_agent:index>`,2.1.0,2.1.0,2.0.4
,,,
COMPILERS,.. _compilers-support-compatibility-matrix:,,
`clang-ocl <https://github.com/ROCm/clang-ocl>`_,N/A,N/A,N/A
:doc:`hipCC <hipcc:index>`,1.1.1,1.1.1,1.1.1
`Flang <https://github.com/ROCm/flang>`_,20.0.0.25314,19.0.0.25224,18.0.0.24455
:doc:`llvm-project <llvm-project:index>`,20.0.0.25314,19.0.0.25224,18.0.0.24491
`OpenMP <https://github.com/ROCm/llvm-project/tree/amd-staging/openmp>`_,20.0.0.25314,19.0.0.25224,18.0.0.24491
`Flang <https://github.com/ROCm/flang>`_,20.0.0.25385,20.0.0.25314,19.0.0.25133
:doc:`llvm-project <llvm-project:index>`,20.0.0.25385,20.0.0.25314,19.0.0.25133
`OpenMP <https://github.com/ROCm/llvm-project/tree/amd-staging/openmp>`_,20.0.0.25385,20.0.0.25314,19.0.0.25133
,,,
RUNTIMES,.. _runtime-support-compatibility-matrix:,,
:doc:`AMD CLR <hip:understand/amd_clr>`,7.0.51830,6.4.43484,6.3.42131
:doc:`HIP <hip:index>`,7.0.51830,6.4.43484,6.3.42131
:doc:`AMD CLR <hip:understand/amd_clr>`,7.0.51831,7.0.51830,6.4.43482
:doc:`HIP <hip:index>`,7.0.51831,7.0.51830,6.4.43482
`OpenCL Runtime <https://github.com/ROCm/clr/tree/develop/opencl>`_,2.0.0,2.0.0,2.0.0
:doc:`ROCr Runtime <rocr-runtime:index>`,1.18.0,1.15.0,1.14.0
:doc:`ROCr Runtime <rocr-runtime:index>`,1.18.0,1.18.0,1.15.0
.. rubric:: Footnotes
.. [#rhel-700] RHEL 8.10 is only supported on AMD Instinct MI300X, MI300A, MI250X, MI250, MI210, and MI100 GPUs.
.. [#ol-700-mi300x] **For ROCm 7.0.x** - Oracle Linux 9 is supported only on AMD Instinct MI355X, MI350X, and MI300X GPUs. Oracle Linux 8 is supported only on AMD Instinct MI300X GPUs.
.. [#rhel-10-702] RHEL 10.0 and RHEL 9.6 are supported on all listed :ref:`supported_GPUs` except AMD Radeon PRO V620 GPU.
.. [#rhel-94-702] RHEL 9.4 is supported on all AMD Instinct GPUs listed under :ref:`supported_GPUs`.
.. [#rhel-700] RHEL 8.10 is supported only on AMD Instinct MI300X, MI300A, MI250X, MI250, MI210, and MI100 GPUs.
.. [#ol-700-mi300x] **For ROCm 7.0.x** - Oracle Linux 10 and 9 are supported only on AMD Instinct MI355X, MI350X, and MI300X GPUs. Oracle Linux 8 is supported only on AMD Instinct MI300X GPU.
.. [#ol-mi300x] **Prior ROCm 7.0.0** - Oracle Linux is supported only on AMD Instinct MI300X GPUs.
.. [#sles-db-700] **For ROCm 7.0.x** - SLES 15 SP7 and Debian 12 are only supported on AMD Instinct MI300X, MI300A, MI250X, MI250, and MI210 GPUs.
.. [#az-mi300x] Starting ROCm 6.4.0, Azure Linux 3.0 is supported only on AMD Instinct MI300X and AMD Radeon PRO V710.
.. [#rl-700] Rocky Linux 9 is only supported on AMD Instinct MI300X and MI300A GPUs.
.. [#single-node] **Prior to ROCm 7.0.0** - Debian 12 is supported only on AMD Instinct MI300X for single-node functionality.
.. [#mi350x-os] AMD Instinct MI355X (gfx950) and MI350X(gfx950) GPUs are only supported on Ubuntu 24.04.3, Ubuntu 22.04.5, RHEL 9.6, RHEL 9.4, and Oracle Linux 9.
.. [#RDNA-OS-700] **For ROCm 7.0.x** - AMD Radeon PRO AI PRO R9700 (gfx1201), AMD Radeon RX 9070 XT (gfx1201), AMD Radeon RX 9070 GRE (gfx1201), AMD Radeon RX 9070 (gfx1201), AMD Radeon RX 9060 XT (gfx1200), AMD Radeon RX 7800 XT (gfx1101), AMD Radeon RX 7700 XT (gfx1101), AMD Radeon PRO W7700 (gfx1101), and AMD Radeon PRO W6800 (gfx1030) are only supported on Ubuntu 24.04.3, Ubuntu 22.04.5, and RHEL 9.6.
.. [#RDNA-OS] **Prior ROCm 7.0.0** - Radeon AI PRO R9700, Radeon RX 9070 XT (gfx1201), Radeon RX 9060 XT (gfx1200), Radeon PRO W7700 (gfx1101), and Radeon RX 7800 XT (gfx1101) are supported only on Ubuntu 24.04.2, Ubuntu 22.04.5, RHEL 9.6, and RHEL 9.4.
.. [#rd-v710] **For ROCm 7.0.x** - AMD Radeon PRO V710 (gfx1101) is only supported on Ubuntu 24.04.3, Ubuntu 22.04.5, RHEL 9.6, and Azure Linux 3.0.
.. [#rd-v620] **For ROCm 7.0.x** - AMD Radeon PRO V620 (gfx1030) is only supported on Ubuntu 24.04.3 and Ubuntu 22.04.5.
.. [#mi325x-os] **For ROCm 7.0.x** - AMD Instinct MI325X GPU (gfx942) is only supported on Ubuntu 24.04.3, Ubuntu 22.04.5, RHEL 9.6, and RHEL 9.4.
.. [#mi300x-os] **For ROCm 7.0.x** - AMD Instinct MI300X GPU (gfx942) is supported on all listed :ref:`supported_distributions`.
.. [#mi300A-os] **For ROCm 7.0.x** - AMD Instinct MI300A GPU (gfx942) is supported only on Ubuntu 24.04, Ubuntu 22.04, RHEL 9.6, RHEL 9.4, RHEL 8.10, SLES 15 SP7, Debian 12, and Rocky Linux 9.
.. [#mi200x-os] **For ROCm 7.0.x** - AMD Instinct MI200 Series GPUs (gfx90a) are supported only on Ubuntu 24.04, Ubuntu 22.04, RHEL 9.6, RHEL 9.4, RHEL 8.10, SLES 15 SP7, and Debian 12.
.. [#mi100-os] **For ROCm 7.0.x** - AMD Instinct MI100 GPU (gfx908) is only supported on Ubuntu 24.04.3, Ubuntu 22.04.5, RHEL 9.6, RHEL 9.4, and RHEL 8.10.
.. [#7700XT-OS] **Prior ROCm 7.0.0** - Radeon RX 7700 XT (gfx1101) is supported only on Ubuntu 24.04.2 and RHEL 9.6.
.. [#tf-mi350] TensorFlow 2.17.1 is not supported on AMD Instinct MI350 series GPUs. Use TensorFlow 2.19.1 or 2.18.1 with MI350 series GPUs instead.
.. [#stanford-megatron-lm_compat] Stanford Megatron-LM is only supported on ROCm 6.3.0.
.. [#megablocks_compat] Megablocks is only supported on ROCm 6.3.0.
.. [#llama-cpp_compat] llama.cpp is only supported on ROCm 7.0.0 and 6.4.x.
.. [#db-mi300x] **For ROCm 7.0.2** - Debian 13 is supported only on AMD Instinct MI300X GPUs.
.. [#sles-db-700] **For ROCm 7.0.x** - SLES 15 SP7 and Debian 12 are supported only on AMD Instinct MI300X, MI300A, MI250X, MI250, and MI210 GPUs.
.. [#az-mi300x] Starting ROCm 6.4.0, Azure Linux 3.0 is supported only on AMD Instinct MI300X and AMD Radeon PRO V710 GPUs.
.. [#rl-700] Rocky Linux 9 is supported only on AMD Instinct MI300X and MI300A GPUs.
.. [#single-node] **Prior to ROCm 7.0.0** - Debian 12 is supported only on AMD Instinct MI300X GPUs for single-node functionality.
.. [#mi350x-os] AMD Instinct MI355X (gfx950) and MI350X(gfx950) GPUs are supported only on Ubuntu 24.04.3, Ubuntu 22.04.5, RHEL 10.0, RHEL 9.6, RHEL 9.4, Oracle Linux 10, and Oracle Linux 9.
.. [#RDNA-OS-700] **For ROCm 7.0.x** - AMD Radeon PRO AI PRO R9700 (gfx1201), AMD Radeon RX 9070 XT (gfx1201), AMD Radeon RX 9070 GRE (gfx1201), AMD Radeon RX 9070 (gfx1201), AMD Radeon RX 9060 XT (gfx1200), AMD Radeon RX 9060 (gfx1200), AMD Radeon RX 7800 XT (gfx1101), AMD Radeon RX 7700 XT (gfx1101), AMD Radeon PRO W7700 (gfx1101), and AMD Radeon PRO W6800 (gfx1030) are supported only on Ubuntu 24.04.3, Ubuntu 22.04.5, RHEL 10.0, and RHEL 9.6.
.. [#rd-v710] **For ROCm 7.0.x** - AMD Radeon PRO V710 (gfx1101) GPUs are supported only on Ubuntu 24.04.3, Ubuntu 22.04.5, RHEL 10.0, RHEL 9.6, and Azure Linux 3.0.
.. [#rd-v620] **For ROCm 7.0.x** - AMD Radeon PRO V620 (gfx1030) GPUs are supported only on Ubuntu 24.04.3 and Ubuntu 22.04.5.
.. [#mi325x-os] **For ROCm 7.0.x** - AMD Instinct MI325X GPUs (gfx942) are supported only on Ubuntu 24.04.3, Ubuntu 22.04.5, RHEL 9.6, and RHEL 9.4.
.. [#mi300x-os] **For ROCm 7.0.x** - AMD Instinct MI300X GPUs (gfx942) are supported on all listed :ref:`supported_distributions`.
.. [#mi300A-os] **For ROCm 7.0.x** - AMD Instinct MI300A GPUs (gfx942) are supported only on Ubuntu 24.04.3, Ubuntu 22.04.5, RHEL 10.0, RHEL 9.6, RHEL 9.4, RHEL 8.10, SLES 15 SP7, Debian 12, and Rocky Linux 9.
.. [#mi200x-os] **For ROCm 7.0.x** - AMD Instinct MI200 Series GPUs (gfx90a) are supported only on Ubuntu 24.04.3, Ubuntu 22.04.5, RHEL 10.0, RHEL 9.6, RHEL 9.4, RHEL 8.10, SLES 15 SP7, and Debian 12.
.. [#mi100-os] **For ROCm 7.0.x** - AMD Instinct MI100 GPUs (gfx908) are supported only on Ubuntu 24.04.3, Ubuntu 22.04.5, RHEL 10.0, RHEL 9.6, RHEL 9.4, and RHEL 8.10.
.. [#tf-mi350] TensorFlow 2.17.1 is not supported on AMD Instinct MI350 Series GPUs. Use TensorFlow 2.19.1 or 2.18.1 with MI350 Series GPUs instead.
.. [#dgl_compat] DGL is supported only on ROCm 6.4.0.
.. [#llama-cpp_compat] llama.cpp is supported only on ROCm 7.0.0 and ROCm 6.4.x.
.. [#driver_patch] AMD GPU Driver (amdgpu) 30.10.1 is a quality release that resolves an issue identified in the 30.10 release. There are no other significant changes or feature additions in ROCm 7.0.1 from ROCm 7.0.0. AMD GPU Driver (amdgpu) 30.10.1 is compatible with ROCm 7.0.1 and ROCm 7.0.0.
.. [#kfd_support] As of ROCm 6.4.0, forward and backward compatibility between the AMD GPU Driver (amdgpu) and its user space software is provided up to a year apart. For earlier ROCm releases, the compatibility is provided for +/- 2 releases. The supported user space versions on this page were accurate as of the time of initial ROCm release. For the most up-to-date information, see the latest version of this information at `User and AMD GPU Driver support matrix <https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/user-kernel-space-compat-matrix.html>`_.
.. [#ROCT-rocr] Starting from ROCm 6.3.0, the ROCT Thunk Interface is included as part of the ROCr runtime package.
@@ -201,6 +200,8 @@ Use this lookup table to confirm which operating system and kernel versions are
,,
`Ubuntu <https://ubuntu.com/about/release-cycle#ubuntu-kernel-release-cycle>`_, 22.04.5, "5.15 [GA], 6.8 [HWE]", 2.35
,,
`Red Hat Enterprise Linux (RHEL 10) <https://access.redhat.com/articles/3078#RHEL9>`_, 10.0, 6.12.0-55, 2.39
,,
`Red Hat Enterprise Linux (RHEL 9) <https://access.redhat.com/articles/3078#RHEL9>`_, 9.6, 5.14.0-570, 2.34
,9.5, 5.14+, 2.34
,9.4, 5.14.0-427, 2.34
@@ -213,10 +214,12 @@ Use this lookup table to confirm which operating system and kernel versions are
,,
`Rocky Linux <https://wiki.rockylinux.org/rocky/version/>`_, 9, 5.14.0-570, 2.34
,,
`Oracle Linux <https://blogs.oracle.com/scoter/post/oracle-linux-and-unbreakable-enterprise-kernel-uek-releases>`_, 9, 6.12.0 (UEK), 2.34
`Oracle Linux <https://blogs.oracle.com/scoter/post/oracle-linux-and-unbreakable-enterprise-kernel-uek-releases>`_, 10, 6.12.0 (UEK), 2.39
,9, 6.12.0 (UEK), 2.34
,8, 5.15.0 (UEK), 2.28
,,
`Debian <https://www.debian.org/download>`_,12, 6.1.0, 2.36
`Debian <https://www.debian.org/download>`_,13, 6.12, 2.35
,12, 6.1.0, 2.36
,,
`Azure Linux <https://techcommunity.microsoft.com/blog/linuxandopensourceblog/azure-linux-3-0-now-in-preview-on-azure-kubernetes-service-v1-31/4287229>`_,3.0, 6.6.92, 2.38
,,
@@ -251,43 +254,46 @@ Expand for full historical view of:
.. rubric:: Footnotes
.. [#rhel-700-past-60] **For ROCm 7.0.x** - RHEL 8.10 is only supported on AMD Instinct MI300X, MI300A, MI250X, MI250, MI210, and MI100 GPUs.
.. [#ol-700-mi300x-past-60] **For ROCm 7.0.x** - Oracle Linux 9 is supported only on AMD Instinct MI300X, MI350X, and MI355X. Oracle Linux 8 is only supported on AMD Instinct MI300X.
.. [#mi300x-past-60] **Prior ROCm 7.0.0** - Oracle Linux is supported only on AMD Instinct MI300X.
.. [#sles-db-700-past-60] **For ROCm 7.0.x** - SLES 15 SP7 and Debian 12 are only supported on AMD Instinct MI300X, MI300A, MI250X, MI250, and MI210 GPUs.
.. [#single-node-past-60] **Prior to ROCm 7.0.0** - Debian 12 is supported only on AMD Instinct MI300X for single-node functionality.
.. [#az-mi300x-past-60] Starting from ROCm 6.4.0, Azure Linux 3.0 is supported only on AMD Instinct MI300X and AMD Radeon PRO V710.
.. [#az-mi300x-630-past-60] **Prior ROCm 6.4.0**- Azure Linux 3.0 is supported only on AMD Instinct MI300X.
.. [#rl-700-past-60] Rocky Linux 9 is only supported on AMD Instinct MI300X and MI300A GPUs.
.. [#mi350x-os-past-60] AMD Instinct MI355X (gfx950) and MI350X(gfx950) GPUs are only supported on Ubuntu 24.04.3, Ubuntu 22.04.5, RHEL 9.6, RHEL 9.4, and Oracle Linux 9.
.. [#RDNA-OS-700-past-60] **For ROCm 7.0.x** AMD Radeon PRO AI PRO R9700 (gfx1201), AMD Radeon RX 9070 XT (gfx1201), AMD Radeon RX 9070 GRE (gfx1201), AMD Radeon RX 9070 (gfx1201), AMD Radeon RX 9060 XT (gfx1200), AMD Radeon RX 7800 XT (gfx1101), AMD Radeon RX 7700 XT (gfx1101), AMD Radeon PRO W7700 (gfx1101), and AMD Radeon PRO W6800 (gfx1030) are only supported on Ubuntu 24.04.3, Ubuntu 22.04.5, and RHEL 9.6.
.. [#rhel-10-702-past-60] RHEL 10.0 and RHEL 9.6 are supported on all listed :ref:`supported_GPUs` except AMD Radeon PRO V620 GPU.
.. [#rhel-94-702-past-60] RHEL 9.4 is supported on all AMD Instinct GPUs listed under :ref:`supported_GPUs`.
.. [#rhel-700-past-60] **For ROCm 7.0.x** - RHEL 8.10 is supported only on AMD Instinct MI300X, MI300A, MI250X, MI250, MI210, and MI100 GPUs.
.. [#ol-700-mi300x-past-60] **For ROCm 7.0.x** - Oracle Linux 10 and 9 are supported only on AMD Instinct MI355X, MI350X, and MI300X GPUs. Oracle Linux 8 is supported only on AMD Instinct MI300X GPU.
.. [#mi300x-past-60] **Prior ROCm 7.0.0** - Oracle Linux is supported only on AMD Instinct MI300X GPUs.
.. [#db-mi300x-past-60] **For ROCm 7.0.2** - Debian 13 is supported only on AMD Instinct MI300X GPUs.
.. [#sles-db-700-past-60] **For ROCm 7.0.x** - SLES 15 SP7 and Debian 12 are supported only on AMD Instinct MI300X, MI300A, MI250X, MI250, and MI210 GPUs.
.. [#single-node-past-60] **Prior to ROCm 7.0.0** - Debian 12 is supported only on AMD Instinct MI300X GPUs for single-node functionality.
.. [#az-mi300x-past-60] Starting from ROCm 6.4.0, Azure Linux 3.0 is supported only on AMD Instinct MI300X and AMD Radeon PRO V710 GPUs.
.. [#az-mi300x-630-past-60] **Prior ROCm 6.4.0**- Azure Linux 3.0 is supported only on AMD Instinct MI300X GPUs.
.. [#rl-700-past-60] Rocky Linux 9 is supported only on AMD Instinct MI300X and MI300A GPUs.
.. [#mi350x-os-past-60] AMD Instinct MI355X (gfx950) and MI350X(gfx950) GPUs are supported only on Ubuntu 24.04.3, Ubuntu 22.04.5, RHEL 9.6, RHEL 9.4, and Oracle Linux 9.
.. [#RDNA-OS-700-past-60] **For ROCm 7.0.x** AMD Radeon PRO AI PRO R9700 (gfx1201), AMD Radeon RX 9070 XT (gfx1201), AMD Radeon RX 9070 GRE (gfx1201), AMD Radeon RX 9070 (gfx1201), AMD Radeon RX 9060 XT (gfx1200), AMD Radeon RX 9060 (gfx1200), AMD Radeon RX 7800 XT (gfx1101), AMD Radeon RX 7700 XT (gfx1101), AMD Radeon PRO W7700 (gfx1101), and AMD Radeon PRO W6800 (gfx1030) are supported only on Ubuntu 24.04.3, Ubuntu 22.04.5, RHEL 10.0, RHEL 9.6, RHEL 9.4, Oracle Linux 10, and Oracle Linux 9.
.. [#RDNA-OS-past-60] **Prior ROCm 7.0.0** - Radeon AI PRO R9700, Radeon RX 9070 XT (gfx1201), Radeon RX 9060 XT (gfx1200), Radeon PRO W7700 (gfx1101), and Radeon RX 7800 XT (gfx1101) are supported only on Ubuntu 24.04.2, Ubuntu 22.04.5, RHEL 9.6, and RHEL 9.4.
.. [#rd-v710-past-60] **For ROCm 7.0.x** - AMD Radeon PRO V710 (gfx1101) is only supported on Ubuntu 24.04.3, Ubuntu 22.04.5, RHEL 9.6, and Azure Linux 3.0.
.. [#rd-v620-past-60] **For ROCm 7.0.x** - AMD Radeon PRO V620 (gfx1030) is only supported on Ubuntu 24.04.3 and Ubuntu 22.04.5.
.. [#mi325x-os-past-60] **For ROCm 7.0.x** - AMD Instinct MI325X GPU (gfx942) is only supported on Ubuntu 24.04.3, Ubuntu 22.04.5, RHEL 9.6, and RHEL 9.4.
.. [#rd-v710-past-60] **For ROCm 7.0.x** - AMD Radeon PRO V710 (gfx1101) is supported only on Ubuntu 24.04.3, Ubuntu 22.04.5, RHEL 10.0, RHEL 9.6, and Azure Linux 3.0.
.. [#rd-v620-past-60] **For ROCm 7.0.x** - AMD Radeon PRO V620 (gfx1030) is supported only on Ubuntu 24.04.3 and Ubuntu 22.04.5.
.. [#mi325x-os-past-60] **For ROCm 7.0.x** - AMD Instinct MI325X GPU (gfx942) is supported only on Ubuntu 24.04.3, Ubuntu 22.04.5, RHEL 9.6, and RHEL 9.4.
.. [#mi300x-os-past-60] **For ROCm 7.0.x** - AMD Instinct MI300X GPU (gfx942) is supported on all listed :ref:`supported_distributions`.
.. [#mi300A-os-past-60] **For ROCm 7.0.x** - AMD Instinct MI300A GPU (gfx942) is supported only on Ubuntu 24.04, Ubuntu 22.04, RHEL 9.6, RHEL 9.4, RHEL 8.10, SLES 15 SP7, Debian 12, and Rocky Linux 9.
.. [#mi200x-os-past-60] **For ROCm 7.0.x** - AMD Instinct MI200 Series GPUs (gfx90a) are supported only on Ubuntu 24.04, Ubuntu 22.04, RHEL 9.6, RHEL 9.4, RHEL 8.10, SLES 15 SP7, and Debian 12.
.. [#mi100-os-past-60] **For ROCm 7.0.x** - AMD Instinct MI100 GPU (gfx908) is only supported on Ubuntu 24.04.3, Ubuntu 22.04.5, RHEL 9.6, RHEL 9.4, and RHEL 8.10.
.. [#7700XT-OS-past-60] Radeon RX 7700 XT (gfx1101) is supported only on Ubuntu 24.04.2 and RHEL 9.6.
.. [#mi300A-os-past-60] **For ROCm 7.0.x** - AMD Instinct MI300A GPU (gfx942) is supported only on Ubuntu 24.04.3, Ubuntu 22.04.5, RHEL 10.0, RHEL 9.6, RHEL 9.4, RHEL 8.10, SLES 15 SP7, Debian 12, and Rocky Linux 9.
.. [#mi200x-os-past-60] **For ROCm 7.0.x** - AMD Instinct MI200 Series GPUs (gfx90a) are supported only on Ubuntu 24.04.3, Ubuntu 22.04.5, RHEL 10.0, RHEL 9.6, RHEL 9.4, RHEL 8.10, SLES 15 SP7, and Debian 12.
.. [#mi100-os-past-60] **For ROCm 7.0.x** - AMD Instinct MI100 GPU (gfx908) is supported only on Ubuntu 24.04.3, Ubuntu 22.04.5, RHEL 10.0, RHEL 9.6, RHEL 9.4, and RHEL 8.10.
.. [#7700XT-OS-past-60] **Prior to ROCm 7.0.0** - Radeon RX 7700 XT (gfx1101) is supported only on Ubuntu 24.04.2 and RHEL 9.6.
.. [#mi300_624-past-60] **For ROCm 6.2.4** - MI300X (gfx942) is supported on listed operating systems *except* Ubuntu 22.04.5 [6.8 HWE] and Ubuntu 22.04.4 [6.5 HWE].
.. [#mi300_622-past-60] **For ROCm 6.2.2** - MI300X (gfx942) is supported on listed operating systems *except* Ubuntu 22.04.5 [6.8 HWE] and Ubuntu 22.04.4 [6.5 HWE].
.. [#mi300_621-past-60] **For ROCm 6.2.1** - MI300X (gfx942) is supported on listed operating systems *except* Ubuntu 22.04.5 [6.8 HWE] and Ubuntu 22.04.4 [6.5 HWE].
.. [#mi300_620-past-60] **For ROCm 6.2.0** - MI300X (gfx942) is supported on listed operating systems *except* Ubuntu 22.04.5 [6.8 HWE] and Ubuntu 22.04.4 [6.5 HWE].
.. [#mi300_612-past-60] **For ROCm 6.1.2** - MI300A (gfx942) is supported on Ubuntu 22.04.4, RHEL 9.4, RHEL 9.3, RHEL 8.9, and SLES 15 SP5. MI300X (gfx942) is only supported on Ubuntu 22.04.4 and Oracle Linux.
.. [#mi300_611-past-60] **For ROCm 6.1.1** - MI300A (gfx942) is supported on Ubuntu 22.04.4, RHEL 9.4, RHEL 9.3, RHEL 8.9, and SLES 15 SP5. MI300X (gfx942) is only supported on Ubuntu 22.04.4 and Oracle Linux.
.. [#mi300_610-past-60] **For ROCm 6.1.0** - MI300A (gfx942) is supported on Ubuntu 22.04.4, RHEL 9.4, RHEL 9.3, RHEL 8.9, and SLES 15 SP5. MI300X (gfx942) is only supported on Ubuntu 22.04.4.
.. [#mi300_602-past-60] **For ROCm 6.0.2** - MI300A (gfx942) is supported on Ubuntu 22.04.3, RHEL 8.9, and SLES 15 SP5. MI300X (gfx942) is only supported on Ubuntu 22.04.3.
.. [#mi300_600-past-60] **For ROCm 6.0.0** - MI300A (gfx942) is supported on Ubuntu 22.04.3, RHEL 8.9, and SLES 15 SP5. MI300X (gfx942) is only supported on Ubuntu 22.04.3.
.. [#tf-mi350-past-60] TensorFlow 2.17.1 is not supported on AMD Instinct MI350 series GPUs. Use TensorFlow 2.19.1 or 2.18.1 with MI350 series GPUs instead.
.. [#verl_compat-past-60] verl is only supported on ROCm 6.2.0.
.. [#stanford-megatron-lm_compat-past-60] Stanford Megatron-LM is only supported on ROCm 6.3.0.
.. [#dgl_compat-past-60] DGL is only supported on ROCm 6.4.0.
.. [#megablocks_compat-past-60] Megablocks is only supported on ROCm 6.3.0.
.. [#taichi_compat-past-60] Taichi is only supported on ROCm 6.3.2.
.. [#ray_compat-past-60] Ray is only supported on ROCm 6.4.1.
.. [#llama-cpp_compat-past-60] llama.cpp is only supported on ROCm 7.0.0 and 6.4.x.
.. [#flashinfer_compat-past-60] FlashInfer is only supported on ROCm 6.4.1.
.. [#mi300_612-past-60] **For ROCm 6.1.2** - MI300A (gfx942) is supported on Ubuntu 22.04.4, RHEL 9.4, RHEL 9.3, RHEL 8.9, and SLES 15 SP5. MI300X (gfx942) is supported only on Ubuntu 22.04.4 and Oracle Linux.
.. [#mi300_611-past-60] **For ROCm 6.1.1** - MI300A (gfx942) is supported on Ubuntu 22.04.4, RHEL 9.4, RHEL 9.3, RHEL 8.9, and SLES 15 SP5. MI300X (gfx942) is supported only on Ubuntu 22.04.4 and Oracle Linux.
.. [#mi300_610-past-60] **For ROCm 6.1.0** - MI300A (gfx942) is supported on Ubuntu 22.04.4, RHEL 9.4, RHEL 9.3, RHEL 8.9, and SLES 15 SP5. MI300X (gfx942) is supported only on Ubuntu 22.04.4.
.. [#mi300_602-past-60] **For ROCm 6.0.2** - MI300A (gfx942) is supported on Ubuntu 22.04.3, RHEL 8.9, and SLES 15 SP5. MI300X (gfx942) is supported only on Ubuntu 22.04.3.
.. [#mi300_600-past-60] **For ROCm 6.0.0** - MI300A (gfx942) is supported on Ubuntu 22.04.3, RHEL 8.9, and SLES 15 SP5. MI300X (gfx942) is supported only on Ubuntu 22.04.3.
.. [#tf-mi350-past-60] TensorFlow 2.17.1 is not supported on AMD Instinct MI350 Series GPUs. Use TensorFlow 2.19.1 or 2.18.1 with MI350 Series GPUs instead.
.. [#verl_compat-past-60] verl is supported only on ROCm 6.2.0.
.. [#stanford-megatron-lm_compat-past-60] Stanford Megatron-LM is supported only on ROCm 6.3.0.
.. [#dgl_compat-past-60] DGL is supported only on ROCm 6.4.0.
.. [#megablocks_compat-past-60] Megablocks is supported only on ROCm 6.3.0.
.. [#taichi_compat-past-60] Taichi is supported only on ROCm 6.3.2.
.. [#ray_compat-past-60] Ray is supported only on ROCm 6.4.1.
.. [#llama-cpp_compat-past-60] llama.cpp is supported only on ROCm 7.0.0 and 6.4.x.
.. [#flashinfer_compat-past-60] FlashInfer is supported only on ROCm 6.4.1.
.. [#driver_patch-past-60] AMD GPU Driver (amdgpu) 30.10.1 is a quality release that resolves an issue identified in the 30.10 release. There are no other significant changes or feature additions in ROCm 7.0.1 from ROCm 7.0.0. AMD GPU Driver (amdgpu) 30.10.1 is compatible with ROCm 7.0.1 and ROCm 7.0.0.
.. [#kfd_support-past-60] As of ROCm 6.4.0, forward and backward compatibility between the AMD GPU Driver (amdgpu) and its user space software is provided up to a year apart. For earlier ROCm releases, the compatibility is provided for +/- 2 releases. The supported user space versions on this page were accurate as of the time of initial ROCm release. For the most up-to-date information, see the latest version of this information at `User and AMD GPU Driver support matrix <https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/user-kernel-space-compat-matrix.html>`_.
.. [#ROCT-rocr-past-60] Starting from ROCm 6.3.0, the ROCT Thunk Interface is included as part of the ROCr runtime package.

View File

@@ -79,7 +79,7 @@ Use cases and recommendations
* The `MI300X workload optimization guide <https://rocm.docs.amd.com/en/latest/how-to/tuning-guides/mi300x/workload.html>`_
provides detailed guidance on optimizing workloads for the AMD Instinct MI300X
accelerator using ROCm. The page is aimed at helping users achieve optimal
GPU using ROCm. The page is aimed at helping users achieve optimal
performance for deep learning and other high-performance computing tasks on
the MI300X GPU.

View File

@@ -73,9 +73,9 @@ Use cases and recommendations
* The :doc:`Instinct MI300X workload optimization guide </how-to/rocm-for-ai/inference-optimization/workload>`
provides detailed guidance on optimizing workloads for the AMD Instinct MI300X
accelerator using ROCm. This guide helps users achieve optimal performance for
GPU using ROCm. This guide helps users achieve optimal performance for
deep learning and other high-performance computing tasks on the MI300X
accelerator.
GPU.
* The :doc:`Inception with PyTorch documentation </conceptual/ai-pytorch-inception>`
describes how PyTorch integrates with ROCm for AI workloads It outlines the
@@ -417,7 +417,7 @@ Key features and enhancements for PyTorch 2.7 with ROCm 7.0
- Expanded GPU architecture support: Provides optimized support for newer GPU
architectures, including gfx1200 and gfx1201 with preferred hipBLASLt backend
selection, along with improvements for gfx950 and gfx1100 series GPUs.
selection, along with improvements for gfx950 and gfx1100 Series GPUs.
- Advanced Triton Integration: AOTriton 0.10b introduces official support for
gfx950 and gfx1201, along with experimental support for gfx1101, gfx1151,

View File

@@ -30,8 +30,8 @@ visual effects in film and gaming, and general-purpose computing.
Supported devices and features
===============================================================================
There is support through the ROCm software stack for all Taichi GPU features on AMD Instinct MI250X and MI210X series GPUs with the exception of Taichis GPU rendering system, CGUI.
AMD Instinct MI300X series GPUs will be supported by November.
There is support through the ROCm software stack for all Taichi GPU features on AMD Instinct MI250X and MI210X Series GPUs with the exception of Taichis GPU rendering system, CGUI.
AMD Instinct MI300X Series GPUs will be supported by November.
.. _taichi-recommendations:

View File

@@ -13,22 +13,22 @@
:gutter: 1
:::{grid-item-card}
**AMD Instinct MI300 series**
**AMD Instinct MI300 Series**
Review hardware aspects of the AMD Instinct™ MI300 series of GPU accelerators and the CDNA™ 3
Review hardware aspects of the AMD Instinct™ MI300 Series GPUs and the CDNA™ 3
architecture.
* [AMD Instinct™ MI300 microarchitecture](./gpu-arch/mi300.md)
* [AMD Instinct MI300/CDNA3 ISA](https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/instruction-set-architectures/amd-instinct-mi300-cdna3-instruction-set-architecture.pdf)
* [White paper](https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/white-papers/amd-cdna-3-white-paper.pdf)
* [MI300 performance counters](./gpu-arch/mi300-mi200-performance-counters.rst)
* [MI350 series performance counters](./gpu-arch/mi350-performance-counters.rst)
* [MI350 Series performance counters](./gpu-arch/mi350-performance-counters.rst)
:::
:::{grid-item-card}
**AMD Instinct MI200 series**
**AMD Instinct MI200 Series**
Review hardware aspects of the AMD Instinct™ MI200 series of GPU accelerators and the CDNA™ 2
Review hardware aspects of the AMD Instinct™ MI200 Series GPUs and the CDNA™ 2
architecture.
* [AMD Instinct™ MI250 microarchitecture](./gpu-arch/mi250.md)
@@ -41,7 +41,7 @@ architecture.
:::{grid-item-card}
**AMD Instinct MI100**
Review hardware aspects of the AMD Instinct™ MI100 series of GPU accelerators and the CDNA™ 1
Review hardware aspects of the AMD Instinct™ MI100 Series GPUs and the CDNA™ 1
architecture.
* [AMD Instinct™ MI100 microarchitecture](./gpu-arch/mi100.md)

View File

@@ -1,14 +1,14 @@
---
myst:
html_meta:
"description lang=en": "Learn about the AMD Instinct MI100 series architecture."
"description lang=en": "Learn about the AMD Instinct MI100 Series architecture."
"keywords": "Instinct, MI100, microarchitecture, AMD, ROCm"
---
# AMD Instinct™ MI100 microarchitecture
The following image shows the node-level architecture of a system that
comprises two AMD EPYC™ processors and (up to) eight AMD Instinct™ accelerators.
comprises two AMD EPYC™ processors and (up to) eight AMD Instinct™ GPUs.
The two EPYC processors are connected to each other with the AMD Infinity™
fabric which provides a high-bandwidth (up to 18 GT/sec) and coherent links such
that each processor can access the available node memory as a single
@@ -18,29 +18,29 @@ available to connect the processors plus one PCIe Gen 4 x16 link per processor
can attach additional I/O devices such as the host adapters for the network
fabric.
![Structure of a single GCD in the AMD Instinct MI100 accelerator](../../data/conceptual/gpu-arch/image004.png "Node-level system architecture with two AMD EPYC™ processors and eight AMD Instinct™ accelerators.")
![Structure of a single GCD in the AMD Instinct MI100 GPU](../../data/conceptual/gpu-arch/image004.png "Node-level system architecture with two AMD EPYC™ processors and eight AMD Instinct™ GPUs.")
In a typical node configuration, each processor can host up to four AMD
Instinct™ accelerators that are attached using PCIe Gen 4 links at 16 GT/sec,
Instinct™ GPUs that are attached using PCIe Gen 4 links at 16 GT/sec,
which corresponds to a peak bidirectional link bandwidth of 32 GB/sec. Each hive
of four accelerators can participate in a fully connected, coherent AMD
Instinct™ fabric that connects the four accelerators using 23 GT/sec AMD
of four GPUs can participate in a fully connected, coherent AMD
Instinct™ fabric that connects the four GPUs using 23 GT/sec AMD
Infinity fabric links that run at a higher frequency than the inter-processor
links. This inter-GPU link can be established in certified server systems if the
GPUs are mounted in neighboring PCIe slots by installing the AMD Infinity
Fabric™ bridge for the AMD Instinct™ accelerators.
Fabric™ bridge for the AMD Instinct™ GPUs.
## Microarchitecture
The microarchitecture of the AMD Instinct accelerators is based on the AMD CDNA
The microarchitecture of the AMD Instinct GPUs is based on the AMD CDNA
architecture, which targets compute applications such as high-performance
computing (HPC) and AI & machine learning (ML) that run on everything from
individual servers to the world's largest exascale supercomputers. The overall
system architecture is designed for extreme scalability and compute performance.
![Structure of the AMD Instinct accelerator (MI100 generation)](../../data/conceptual/gpu-arch/image005.png "Structure of the AMD Instinct accelerator (MI100 generation)")
![Structure of the AMD Instinct GPU (MI100 generation)](../../data/conceptual/gpu-arch/image005.png "Structure of the AMD Instinct GPU (MI100 generation)")
The above image shows the AMD Instinct accelerator with its PCIe Gen 4 x16
The above image shows the AMD Instinct GPU with its PCIe Gen 4 x16
link (16 GT/sec, at the bottom) that connects the GPU to (one of) the host
processor(s). It also shows the three AMD Infinity Fabric ports that provide
high-speed links (23 GT/sec, also at the bottom) to the other GPUs of the local
@@ -48,7 +48,7 @@ hive.
On the left and right of the floor plan, the High Bandwidth Memory (HBM)
attaches via the GPU memory controller. The MI100 generation of the AMD
Instinct accelerator offers four stacks of HBM generation 2 (HBM2) for a total
Instinct GPU offers four stacks of HBM generation 2 (HBM2) for a total
of 32GB with a 4,096bit-wide memory interface. The peak memory bandwidth of the
attached HBM2 is 1.228 TB/sec at a memory clock frequency of 1.2 GHz.
@@ -64,7 +64,7 @@ Therefore, the theoretical maximum FP64 peak performance is 11.5 TFLOPS
![Block diagram of an MI100 compute unit with detailed SIMD view of the AMD CDNA architecture](../../data/conceptual/gpu-arch/image006.png "An MI100 compute unit with detailed SIMD view of the AMD CDNA architecture")
The preceding image shows the block diagram of a single CU of an AMD Instinct™
MI100 accelerator and summarizes how instructions flow through the execution
MI100 GPU and summarizes how instructions flow through the execution
engines. The CU fetches the instructions via a 32KB instruction cache and moves
them forward to execution via a dispatcher. The CU can handle up to ten
wavefronts at a time and feed their instructions into the execution unit. The

View File

@@ -1,13 +1,13 @@
---
myst:
html_meta:
"description lang=en": "Learn about the AMD Instinct MI250 series architecture."
"description lang=en": "Learn about the AMD Instinct MI250 Series architecture."
"keywords": "Instinct, MI250, microarchitecture, AMD, ROCm"
---
# AMD Instinct™ MI250 microarchitecture
The microarchitecture of the AMD Instinct MI250 accelerators is based on the
The microarchitecture of the AMD Instinct MI250 GPU is based on the
AMD CDNA 2 architecture that targets compute applications such as HPC,
artificial intelligence (AI), and machine learning (ML) and that run on
everything from individual servers to the worlds largest exascale
@@ -40,7 +40,7 @@ execution units (also called matrix cores), which are geared toward executing
matrix operations like matrix-matrix multiplications. For FP64, the peak
performance of these units amounts to 90.5 TFLOPS.
![Structure of a single GCD in the AMD Instinct MI250 accelerator.](../../data/conceptual/gpu-arch/image001.png "Structure of a single GCD in the AMD Instinct MI250 accelerator.")
![Structure of a single GCD in the AMD Instinct MI250 GPU.](../../data/conceptual/gpu-arch/image001.png "Structure of a single GCD in the AMD Instinct MI250 GPU.")
```{list-table} Peak-performance capabilities of the MI250 OAM for different data types.
:header-rows: 1
@@ -84,16 +84,9 @@ performance of these units amounts to 90.5 TFLOPS.
- 362.1
```
The above table summarizes the aggregated peak performance of the AMD
Instinct MI250 OCP Open Accelerator Modules (OAM, OCP is short for Open Compute
Platform) and its two GCDs for different data types and execution units. The
middle column lists the peak performance (number of data elements processed in a
single instruction) of a single compute unit if a SIMD (or matrix) instruction
is being retired in each clock cycle. The third column lists the theoretical
peak performance of the OAM module. The theoretical aggregated peak memory
bandwidth of the GPU is 3.2 TB/sec (1.6 TB/sec per GCD).
The above table summarizes the aggregated peak performance of the AMD Instinct MI250 Open Compute Platform (OCP) Open Accelerator Modules (OAMs) and its two GCDs for different data types and execution units. The middle column lists the peak performance (number of data elements processed in a single instruction) of a single compute unit if a SIMD (or matrix) instruction is being retired in each clock cycle. The third column lists the theoretical peak performance of the OAM module. The theoretical aggregated peak memory bandwidth of the GPU is 3.2 TB/sec (1.6 TB/sec per GCD).
![Dual-GCD architecture of the AMD Instinct MI250 accelerators](../../data/conceptual/gpu-arch/image002.png "Dual-GCD architecture of the AMD Instinct MI250 accelerators")
![Dual-GCD architecture of the AMD Instinct MI250 GPUs](../../data/conceptual/gpu-arch/image002.png "Dual-GCD architecture of the AMD Instinct MI250 GPUs")
The following image shows the block diagram of an OAM package that consists
of two GCDs, each of which constitutes one GPU device in the system. The two
@@ -105,18 +98,18 @@ between the two GCDs of an OAM, or a bidirectional peak transfer bandwidth of
## Node-level architecture
The following image shows the node-level architecture of a system that is
based on the AMD Instinct MI250 accelerator. The MI250 OAMs attach to the host
based on the AMD Instinct MI250 GPU. The MI250 OAMs attach to the host
system via PCIe Gen 4 x16 links (yellow lines). Each GCD maintains its own PCIe
x16 link to the host part of the system. Depending on the server platform, the
GCD can attach to the AMD EPYC processor directly or via an optional PCIe switch
. Note that some platforms may offer an x8 interface to the GCDs, which reduces
the available host-to-GPU bandwidth.
![Block diagram of AMD Instinct MI250 Accelerators with 3rd Generation AMD EPYC processor](../../data/conceptual/gpu-arch/image003.png "Block diagram of AMD Instinct MI250 Accelerators with 3rd Generation AMD EPYC processor")
![Block diagram of AMD Instinct MI250 GPUs with 3rd Generation AMD EPYC processor](../../data/conceptual/gpu-arch/image003.png "Block diagram of AMD Instinct MI250 GPUs with 3rd Generation AMD EPYC processor")
The preceding image shows the node-level architecture of a system with AMD
EPYC processors in a dual-socket configuration and four AMD Instinct MI250
accelerators. The MI250 OAMs attach to the host processors system via PCIe Gen 4
GPUs. The MI250 OAMs attach to the host processors system via PCIe Gen 4
x16 links (yellow lines). Depending on the system design, a PCIe switch may
exist to make more PCIe lanes available for additional components like network
interfaces and/or storage devices. Each GCD maintains its own PCIe x16 link to

View File

@@ -1,16 +1,16 @@
.. meta::
:description: MI300 and MI200 series performance counters and metrics
:description: MI300 and MI200 Series performance counters and metrics
:keywords: MI300, MI200, performance counters, command processor counters
***************************************************************************************************
MI300 and MI200 series performance counters and metrics
MI300 and MI200 Series performance counters and metrics
***************************************************************************************************
This document lists and describes the hardware performance counters and derived metrics available
for the AMD Instinct™ MI300 and MI200 GPU. You can also access this information using the
:doc:`ROCprofiler-SDK <rocprofiler-sdk:how-to/using-rocprofv3>`.
MI300 and MI200 series performance counters
MI300 and MI200 Series performance counters
===============================================================
Series performance counters include the following categories:
@@ -27,7 +27,7 @@ The following sections provide additional details for each category.
.. note::
Preliminary validation of all MI300 and MI200 series performance counters is in progress. Those with
Preliminary validation of all MI300 and MI200 Series performance counters is in progress. Those with
an asterisk (*) require further evaluation.
.. _command-processor-counters:
@@ -171,7 +171,7 @@ Instruction mix
"``SQ_INSTS_SMEM``", "Instr", "Number of scalar memory instructions issued"
"``SQ_INSTS_SMEM_NORM``", "Instr", "Number of scalar memory instructions normalized to match ``smem_level`` issued"
"``SQ_INSTS_FLAT``", "Instr", "Number of flat instructions issued"
"``SQ_INSTS_FLAT_LDS_ONLY``", "Instr", "**MI200 series only** Number of FLAT instructions that read/write only from/to LDS issued. Works only if ``EARLY_TA_DONE`` is enabled."
"``SQ_INSTS_FLAT_LDS_ONLY``", "Instr", "**MI200 Series only** Number of FLAT instructions that read/write only from/to LDS issued. Works only if ``EARLY_TA_DONE`` is enabled."
"``SQ_INSTS_LDS``", "Instr", "Number of LDS instructions issued **(MI200: includes flat; MI300: does not include flat)**"
"``SQ_INSTS_GDS``", "Instr", "Number of global data share instructions issued"
"``SQ_INSTS_EXP_GDS``", "Instr", "Number of EXP and global data share instructions excluding skipped export instructions issued"
@@ -396,9 +396,9 @@ Texture cache per pipe counters
"``TCP_UTCL1_TRANSLATION_MISS[n]``", "Req", "Number of unified translation cache (L1) translation misses", "0-15"
"``TCP_UTCL1_PERMISSION_MISS[n]``", "Req", "Number of unified translation cache (L1) permission misses", "0-15"
"``TCP_TOTAL_CACHE_ACCESSES[n]``", "Req", "Number of vector L1d cache accesses including hits and misses", "0-15"
"``TCP_TCP_LATENCY[n]``", "Cycles", "**MI200 series only** Accumulated wave access latency to vL1D over all wavefronts", "0-15"
"``TCP_TCC_READ_REQ_LATENCY[n]``", "Cycles", "**MI200 series only** Total vL1D to L2 request latency over all wavefronts for reads and atomics with return", "0-15"
"``TCP_TCC_WRITE_REQ_LATENCY[n]``", "Cycles", "**MI200 series only** Total vL1D to L2 request latency over all wavefronts for writes and atomics without return", "0-15"
"``TCP_TCP_LATENCY[n]``", "Cycles", "**MI200 Series only** Accumulated wave access latency to vL1D over all wavefronts", "0-15"
"``TCP_TCC_READ_REQ_LATENCY[n]``", "Cycles", "**MI200 Series only** Total vL1D to L2 request latency over all wavefronts for reads and atomics with return", "0-15"
"``TCP_TCC_WRITE_REQ_LATENCY[n]``", "Cycles", "**MI200 Series only** Total vL1D to L2 request latency over all wavefronts for writes and atomics without return", "0-15"
"``TCP_TCC_READ_REQ[n]``", "Req", "Number of read requests to L2 cache", "0-15"
"``TCP_TCC_WRITE_REQ[n]``", "Req", "Number of write requests to L2 cache", "0-15"
"``TCP_TCC_ATOMIC_WITH_RET_REQ[n]``", "Req", "Number of atomic requests to L2 cache with return", "0-15"
@@ -560,7 +560,7 @@ Note the following:
``TCC_TAG_STALL[n]``, probes can stall the pipeline at a variety of places. There is no single point that
can accurately measure the total stalls
MI300 and MI200 series derived metrics list
MI300 and MI200 Series derived metrics list
==============================================================
.. csv-table::

View File

@@ -1,21 +1,21 @@
---
myst:
html_meta:
"description lang=en": "Learn about the AMD Instinct MI300 series architecture."
"description lang=en": "Learn about the AMD Instinct MI300 Series architecture."
"keywords": "Instinct, MI300X, MI300A, microarchitecture, AMD, ROCm"
---
# AMD Instinct™ MI300 series microarchitecture
# AMD Instinct™ MI300 Series microarchitecture
The AMD Instinct MI300 series accelerators are based on the AMD CDNA 3
The AMD Instinct MI300 Series GPUs are based on the AMD CDNA 3
architecture which was designed to deliver leadership performance for HPC, artificial intelligence (AI), and machine
learning (ML) workloads. The AMD Instinct MI300 series accelerators are well-suited for extreme scalability and compute performance, running
learning (ML) workloads. The AMD Instinct MI300 Series GPUs are well-suited for extreme scalability and compute performance, running
on everything from individual servers to the worlds largest exascale supercomputers.
With the MI300 series, AMD is introducing the Accelerator Complex Die (XCD), which contains the
With the MI300 Series, AMD is introducing the Accelerator Complex Die (XCD), which contains the
GPU computational elements of the processor along with the lower levels of the cache hierarchy.
The following image depicts the structure of a single XCD in the AMD Instinct MI300 accelerator series.
The following image depicts the structure of a single XCD in the AMD Instinct MI300 GPU Series.
```{figure} ../../data/shared/xcd-sys-arch.png
---
@@ -39,7 +39,7 @@ infrastructure) using the AMD Infinity Fabric™ technology as interconnect.
The Matrix Cores inside the CDNA 3 CUs have significant improvements, emphasizing AI and machine
learning, enhancing throughput of existing data types while adding support for new data types.
CDNA 2 Matrix Cores support FP16 and BF16, while offering INT8 for inference. Compared to MI250X
accelerators, CDNA 3 Matrix Cores triple the performance for FP16 and BF16, while providing a
GPUs, CDNA 3 Matrix Cores triple the performance for FP16 and BF16, while providing a
performance gain of 6.8 times for INT8. FP8 has a performance gain of 16 times compared to FP32,
while TF32 has a gain of 4 times compared to FP32.
@@ -105,7 +105,7 @@ name: mi300-arch
alt:
align: center
---
MI300 series system architecture showing MI300A (left) with 6 XCDs and 3 CCDs, while the MI300X (right) has 8 XCDs.
MI300 Series system architecture showing MI300A (left) with 6 XCDs and 3 CCDs, while the MI300X (right) has 8 XCDs.
```
## Node-level architecture
@@ -116,11 +116,11 @@ name: mi300-node
align: center
---
MI300 series node-level architecture showing 8 fully interconnected MI300X OAM modules connected to (optional) PCIEe switches via retimers and HGX connectors.
MI300 Series node-level architecture showing 8 fully interconnected MI300X OAM modules connected to (optional) PCIEe switches via retimers and HGX connectors.
```
The image above shows the node-level architecture of a system with AMD EPYC processors in a
dual-socket configuration and eight AMD Instinct MI300X accelerators. The MI300X OAMs attach to the
dual-socket configuration and eight AMD Instinct MI300X GPUs. The MI300X OAMs attach to the
host system via PCIe Gen 5 x16 links (yellow lines). The GPUs are using seven high-bandwidth,
low-latency AMD Infinity Fabric™ links (red lines) to form a fully connected 8-GPU system.

View File

@@ -1,12 +1,12 @@
.. meta::
:description: MI355 series performance counters and metrics
:description: MI355 Series performance counters and metrics
:keywords: MI355, MI355X, MI3XX
***********************************
MI350 series performance counters
MI350 Series performance counters
***********************************
This topic lists and describes the hardware performance counters and derived metrics available on the AMD Instinct MI350 and MI355 accelerators. These counters are available for profiling using `ROCprofiler-SDK <https://rocm.docs.amd.com/projects/rocprofiler-sdk/en/latest/index.html>`_ and `ROCm Compute Profiler <https://rocm.docs.amd.com/projects/rocprofiler-compute/en/latest/>`_.
This topic lists and describes the hardware performance counters and derived metrics available on the AMD Instinct MI350 and MI355 GPUs. These counters are available for profiling using `ROCprofiler-SDK <https://rocm.docs.amd.com/projects/rocprofiler-sdk/en/latest/index.html>`_ and `ROCm Compute Profiler <https://rocm.docs.amd.com/projects/rocprofiler-compute/en/latest/>`_.
The following sections list the performance counters based on the IP blocks.

View File

@@ -89,15 +89,15 @@ project = "ROCm Documentation"
project_path = os.path.abspath(".").replace("\\", "/")
author = "Advanced Micro Devices, Inc."
copyright = "Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved."
version = "7.0.1"
release = "7.0.1"
version = "7.0.2"
release = "7.0.2"
setting_all_article_info = True
all_article_info_os = ["linux", "windows"]
all_article_info_author = ""
# pages with specific settings
article_pages = [
{"file": "about/release-notes", "os": ["linux"], "date": "2025-09-17"},
{"file": "about/release-notes", "os": ["linux"], "date": "2025-10-10"},
{"file": "release/changelog", "os": ["linux"],},
{"file": "compatibility/compatibility-matrix", "os": ["linux"]},
{"file": "compatibility/ml-compatibility/pytorch-compatibility", "os": ["linux"]},
@@ -234,7 +234,7 @@ suppress_warnings = ["autosectionlabel.*"]
html_context = {
"project_path" : {project_path},
"gpu_type" : [('AMD Instinct accelerators', 'intrinsic'), ('AMD gfx families', 'gfx'), ('NVIDIA families', 'nvidia') ],
"gpu_type" : [('AMD Instinct GPUs', 'intrinsic'), ('AMD gfx families', 'gfx'), ('NVIDIA families', 'nvidia') ],
"atomics_type" : [('HW atomics', 'hw-atomics'), ('CAS emulation', 'cas-atomics')],
"pcie_type" : [('No PCIe atomics', 'nopcie'), ('PCIe atomics', 'pcie')],
"memory_type" : [('Device DRAM', 'device-dram'), ('Migratable Host DRAM', 'migratable-host-dram'), ('Pinned Host DRAM', 'pinned-host-dram')],

View File

@@ -1,47 +1,16 @@
dockers:
- pull_tag: rocm/jax-training:maxtext-v25.7-jax060
- pull_tag: rocm/jax-training:maxtext-v25.9
docker_hub_url: https://hub.docker.com/layers/rocm/jax-training/maxtext-v25.7/images/sha256-45f4c727d4019a63fc47313d3a5f5a5105569539294ddfd2d742218212ae9025
components:
ROCm: 6.4.1
JAX: 0.6.0
Python: 3.10.12
Transformer Engine: 2.1.0+90d703dd
hipBLASLt: 1.1.0-499ece1c21
- pull_tag: rocm/jax-training:maxtext-v25.7
docker_hub_url: https://hub.docker.com/layers/rocm/jax-training/maxtext-v25.7/images/sha256-45f4c727d4019a63fc47313d3a5f5a5105569539294ddfd2d742218212ae9025
components:
ROCm: 6.4.1
JAX: 0.5.0
Python: 3.10.12
Transformer Engine: 2.1.0+90d703dd
ROCm: 7.0.0
JAX: 0.6.2
Python: 3.10.18
Transformer Engine: 2.2.0.dev0+c91bac54
hipBLASLt: 1.x.x
model_groups:
- group: Meta Llama
tag: llama
models:
- model: Llama 3.3 70B
mad_tag: jax_maxtext_train_llama-3.3-70b
model_repo: Llama-3.3-70B
precision: bf16
doc_options: ["single-node"]
- model: Llama 3.1 8B
mad_tag: jax_maxtext_train_llama-3.1-8b
model_repo: Llama-3.1-8B
precision: bf16
doc_options: ["single-node"]
- model: Llama 3.1 70B
mad_tag: jax_maxtext_train_llama-3.1-70b
model_repo: Llama-3.1-70B
precision: bf16
doc_options: ["single-node"]
- model: Llama 3 8B
mad_tag: jax_maxtext_train_llama-3-8b
multinode_training_script: llama3_8b_multinode.sh
doc_options: ["multi-node"]
- model: Llama 3 70B
mad_tag: jax_maxtext_train_llama-3-70b
multinode_training_script: llama3_70b_multinode.sh
doc_options: ["multi-node"]
- model: Llama 2 7B
mad_tag: jax_maxtext_train_llama-2-7b
model_repo: Llama-2-7B
@@ -54,6 +23,29 @@ model_groups:
precision: bf16
multinode_training_script: llama2_70b_multinode.sh
doc_options: ["single-node", "multi-node"]
- model: Llama 3 8B (multi-node)
mad_tag: jax_maxtext_train_llama-3-8b
multinode_training_script: llama3_8b_multinode.sh
doc_options: ["multi-node"]
- model: Llama 3 70B (multi-node)
mad_tag: jax_maxtext_train_llama-3-70b
multinode_training_script: llama3_70b_multinode.sh
doc_options: ["multi-node"]
- model: Llama 3.1 8B
mad_tag: jax_maxtext_train_llama-3.1-8b
model_repo: Llama-3.1-8B
precision: bf16
doc_options: ["single-node"]
- model: Llama 3.1 70B
mad_tag: jax_maxtext_train_llama-3.1-70b
model_repo: Llama-3.1-70B
precision: bf16
doc_options: ["single-node"]
- model: Llama 3.3 70B
mad_tag: jax_maxtext_train_llama-3.3-70b
model_repo: Llama-3.3-70B
precision: bf16
doc_options: ["single-node"]
- group: DeepSeek
tag: deepseek
models:

View File

@@ -1,14 +1,21 @@
dockers:
- pull_tag: rocm/megatron-lm:v25.8_py310
docker_hub_url: https://hub.docker.com/layers/rocm/megatron-lm/v25.8_py310/images/sha256-50fc824361054e445e86d5d88d5f58817f61f8ec83ad4a7e43ea38bbc4a142c0
components:
ROCm: 6.4.3
PyTorch: 2.8.0a0+gitd06a406
MI355X and MI350X:
pull_tag: rocm/megatron-lm:v25.9_gfx950
docker_hub_url: https://hub.docker.com/layers/rocm/megatron-lm/v25.9_gfx950/images/sha256-1a198be32f49efd66d0ff82066b44bd99b3e6b04c8e0e9b36b2c481e13bff7b6
components: &docker_components
ROCm: 7.0.0
Primus: aab4234
PyTorch: 2.9.0.dev20250821+rocm7.0.0.lw.git125803b7
Python: "3.10"
Transformer Engine: 2.2.0.dev0+54dd2bdc
hipBLASLt: d1b517fc7a
Triton: 3.3.0
RCCL: 2.22.3
Flash Attention: 2.8.3
hipBLASLt: 911283acd1
Triton: 3.4.0+rocm7.0.0.git56765e8c
RCCL: 2.26.6
MI325X and MI300X:
pull_tag: rocm/megatron-lm:v25.9_gfx942
docker_hub_url: https://hub.docker.com/layers/rocm/megatron-lm/v25.9_gfx942/images/sha256-df6ab8f45b4b9ceb100fb24e19b2019a364e351ee3b324dbe54466a1d67f8357
components: *docker_components
model_groups:
- group: Meta Llama
tag: llama
@@ -19,8 +26,6 @@ model_groups:
mad_tag: pyt_megatron_lm_train_llama-3.1-8b
- model: Llama 3.1 70B
mad_tag: pyt_megatron_lm_train_llama-3.1-70b
- model: Llama 3.1 70B (proxy)
mad_tag: pyt_megatron_lm_train_llama-3.1-70b-proxy
- model: Llama 2 7B
mad_tag: pyt_megatron_lm_train_llama-2-7b
- model: Llama 2 70B

View File

@@ -0,0 +1,72 @@
dockers:
- pull_tag: rocm/jax-training:maxtext-v25.7-jax060
docker_hub_url: https://hub.docker.com/layers/rocm/jax-training/maxtext-v25.7/images/sha256-45f4c727d4019a63fc47313d3a5f5a5105569539294ddfd2d742218212ae9025
components:
ROCm: 6.4.1
JAX: 0.6.0
Python: 3.10.12
Transformer Engine: 2.1.0+90d703dd
hipBLASLt: 1.1.0-499ece1c21
- pull_tag: rocm/jax-training:maxtext-v25.7
docker_hub_url: https://hub.docker.com/layers/rocm/jax-training/maxtext-v25.7/images/sha256-45f4c727d4019a63fc47313d3a5f5a5105569539294ddfd2d742218212ae9025
components:
ROCm: 6.4.1
JAX: 0.5.0
Python: 3.10.12
Transformer Engine: 2.1.0+90d703dd
hipBLASLt: 1.x.x
model_groups:
- group: Meta Llama
tag: llama
models:
- model: Llama 3.3 70B
mad_tag: jax_maxtext_train_llama-3.3-70b
model_repo: Llama-3.3-70B
precision: bf16
doc_options: ["single-node"]
- model: Llama 3.1 8B
mad_tag: jax_maxtext_train_llama-3.1-8b
model_repo: Llama-3.1-8B
precision: bf16
doc_options: ["single-node"]
- model: Llama 3.1 70B
mad_tag: jax_maxtext_train_llama-3.1-70b
model_repo: Llama-3.1-70B
precision: bf16
doc_options: ["single-node"]
- model: Llama 3 8B
mad_tag: jax_maxtext_train_llama-3-8b
multinode_training_script: llama3_8b_multinode.sh
doc_options: ["multi-node"]
- model: Llama 3 70B
mad_tag: jax_maxtext_train_llama-3-70b
multinode_training_script: llama3_70b_multinode.sh
doc_options: ["multi-node"]
- model: Llama 2 7B
mad_tag: jax_maxtext_train_llama-2-7b
model_repo: Llama-2-7B
precision: bf16
multinode_training_script: llama2_7b_multinode.sh
doc_options: ["single-node", "multi-node"]
- model: Llama 2 70B
mad_tag: jax_maxtext_train_llama-2-70b
model_repo: Llama-2-70B
precision: bf16
multinode_training_script: llama2_70b_multinode.sh
doc_options: ["single-node", "multi-node"]
- group: DeepSeek
tag: deepseek
models:
- model: DeepSeek-V2-Lite (16B)
mad_tag: jax_maxtext_train_deepseek-v2-lite-16b
model_repo: DeepSeek-V2-lite
precision: bf16
doc_options: ["single-node"]
- group: Mistral AI
tag: mistral
models:
- model: Mixtral 8x7B
mad_tag: jax_maxtext_train_mixtral-8x7b
model_repo: Mixtral-8x7B
precision: bf16
doc_options: ["single-node"]

View File

@@ -0,0 +1,48 @@
dockers:
- pull_tag: rocm/megatron-lm:v25.8_py310
docker_hub_url: https://hub.docker.com/layers/rocm/megatron-lm/v25.8_py310/images/sha256-50fc824361054e445e86d5d88d5f58817f61f8ec83ad4a7e43ea38bbc4a142c0
components:
ROCm: 6.4.3
PyTorch: 2.8.0a0+gitd06a406
Python: "3.10"
Transformer Engine: 2.2.0.dev0+54dd2bdc
hipBLASLt: d1b517fc7a
Triton: 3.3.0
RCCL: 2.22.3
model_groups:
- group: Meta Llama
tag: llama
models:
- model: Llama 3.3 70B
mad_tag: pyt_megatron_lm_train_llama-3.3-70b
- model: Llama 3.1 8B
mad_tag: pyt_megatron_lm_train_llama-3.1-8b
- model: Llama 3.1 70B
mad_tag: pyt_megatron_lm_train_llama-3.1-70b
- model: Llama 3.1 70B (proxy)
mad_tag: pyt_megatron_lm_train_llama-3.1-70b-proxy
- model: Llama 2 7B
mad_tag: pyt_megatron_lm_train_llama-2-7b
- model: Llama 2 70B
mad_tag: pyt_megatron_lm_train_llama-2-70b
- group: DeepSeek
tag: deepseek
models:
- model: DeepSeek-V3 (proxy)
mad_tag: pyt_megatron_lm_train_deepseek-v3-proxy
- model: DeepSeek-V2-Lite
mad_tag: pyt_megatron_lm_train_deepseek-v2-lite-16b
- group: Mistral AI
tag: mistral
models:
- model: Mixtral 8x7B
mad_tag: pyt_megatron_lm_train_mixtral-8x7b
- model: Mixtral 8x22B (proxy)
mad_tag: pyt_megatron_lm_train_mixtral-8x22b-proxy
- group: Qwen
tag: qwen
models:
- model: Qwen 2.5 7B
mad_tag: pyt_megatron_lm_train_qwen2.5-7b
- model: Qwen 2.5 72B
mad_tag: pyt_megatron_lm_train_qwen2.5-72b

View File

@@ -0,0 +1,58 @@
dockers:
- pull_tag: rocm/megatron-lm:v25.8_py310
docker_hub_url: https://hub.docker.com/layers/rocm/megatron-lm/v25.8_py310/images/sha256-50fc824361054e445e86d5d88d5f58817f61f8ec83ad4a7e43ea38bbc4a142c0
components:
ROCm: 6.4.3
Primus: 927a717
PyTorch: 2.8.0a0+gitd06a406
Python: "3.10"
Transformer Engine: 2.2.0.dev0+54dd2bdc
hipBLASLt: d1b517fc7a
Triton: 3.3.0
RCCL: 2.22.3
model_groups:
- group: Meta Llama
tag: llama
models:
- model: Llama 3.3 70B
mad_tag: primus_pyt_megatron_lm_train_llama-3.3-70b
config_name: llama3.3_70B-pretrain.yaml
- model: Llama 3.1 70B
mad_tag: primus_pyt_megatron_lm_train_llama-3.1-70b
config_name: llama3.1_70B-pretrain.yaml
- model: Llama 3.1 8B
mad_tag: primus_pyt_megatron_lm_train_llama-3.1-8b
config_name: llama3.1_8B-pretrain.yaml
- model: Llama 2 7B
mad_tag: primus_pyt_megatron_lm_train_llama-2-7b
config_name: llama2_7B-pretrain.yaml
- model: Llama 2 70B
mad_tag: primus_pyt_megatron_lm_train_llama-2-70b
config_name: llama2_70B-pretrain.yaml
- group: DeepSeek
tag: deepseek
models:
- model: DeepSeek-V3 (proxy)
mad_tag: primus_pyt_megatron_lm_train_deepseek-v3-proxy
config_name: deepseek_v3-pretrain.yaml
- model: DeepSeek-V2-Lite
mad_tag: primus_pyt_megatron_lm_train_deepseek-v2-lite-16b
config_name: deepseek_v2_lite-pretrain.yaml
- group: Mistral AI
tag: mistral
models:
- model: Mixtral 8x7B
mad_tag: primus_pyt_megatron_lm_train_mixtral-8x7b
config_name: mixtral_8x7B_v0.1-pretrain.yaml
- model: Mixtral 8x22B (proxy)
mad_tag: primus_pyt_megatron_lm_train_mixtral-8x22b-proxy
config_name: mixtral_8x22B_v0.1-pretrain.yaml
- group: Qwen
tag: qwen
models:
- model: Qwen 2.5 7B
mad_tag: primus_pyt_megatron_lm_train_qwen2.5-7b
config_name: primus_qwen2.5_7B-pretrain.yaml
- model: Qwen 2.5 72B
mad_tag: primus_pyt_megatron_lm_train_qwen2.5-72b
config_name: qwen2.5_72B-pretrain.yaml

View File

@@ -0,0 +1,24 @@
dockers:
- pull_tag: rocm/pytorch-training:v25.8
docker_hub_url: https://hub.docker.com/layers/rocm/pytorch-training/v25.8/images/sha256-5082ae01d73fec6972b0d84e5dad78c0926820dcf3c19f301d6c8eb892e573c5
components:
ROCm: 6.4.3
PyTorch: 2.8.0a0+gitd06a406
Python: 3.10.18
Transformer Engine: 2.2.0.dev0+a1e66aae
Flash Attention: 3.0.0.post1
hipBLASLt: 1.1.0-d1b517fc7a
model_groups:
- group: Meta Llama
tag: llama
models:
- model: Llama 3.1 8B
mad_tag: primus_pyt_train_llama-3.1-8b
model_repo: Llama-3.1-8B
url: https://huggingface.co/meta-llama/Llama-3.1-8B
precision: BF16
- model: Llama 3.1 70B
mad_tag: primus_pyt_train_llama-3.1-70b
model_repo: Llama-3.1-70B
url: https://huggingface.co/meta-llama/Llama-3.1-70B
precision: BF16

View File

@@ -0,0 +1,178 @@
dockers:
- pull_tag: rocm/pytorch-training:v25.8
docker_hub_url: https://hub.docker.com/layers/rocm/pytorch-training/v25.8/images/sha256-5082ae01d73fec6972b0d84e5dad78c0926820dcf3c19f301d6c8eb892e573c5
components:
ROCm: 6.4.3
PyTorch: 2.8.0a0+gitd06a406
Python: 3.10.18
Transformer Engine: 2.2.0.dev0+a1e66aae
Flash Attention: 3.0.0.post1
hipBLASLt: 1.1.0-d1b517fc7a
model_groups:
- group: Meta Llama
tag: llama
models:
- model: Llama 4 Scout 17B-16E
mad_tag: pyt_train_llama-4-scout-17b-16e
model_repo: Llama-4-17B_16E
url: https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E
precision: BF16
training_modes: [finetune_fw, finetune_lora]
- model: Llama 3.3 70B
mad_tag: pyt_train_llama-3.3-70b
model_repo: Llama-3.3-70B
url: https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct
precision: BF16
training_modes: [finetune_fw, finetune_lora, finetune_qlora]
- model: Llama 3.2 1B
mad_tag: pyt_train_llama-3.2-1b
model_repo: Llama-3.2-1B
url: https://huggingface.co/meta-llama/Llama-3.2-1B
precision: BF16
training_modes: [finetune_fw, finetune_lora]
- model: Llama 3.2 3B
mad_tag: pyt_train_llama-3.2-3b
model_repo: Llama-3.2-3B
url: https://huggingface.co/meta-llama/Llama-3.2-3B
precision: BF16
training_modes: [finetune_fw, finetune_lora]
- model: Llama 3.2 Vision 11B
mad_tag: pyt_train_llama-3.2-vision-11b
model_repo: Llama-3.2-Vision-11B
url: https://huggingface.co/meta-llama/Llama-3.2-11B-Vision
precision: BF16
training_modes: [finetune_fw]
- model: Llama 3.2 Vision 90B
mad_tag: pyt_train_llama-3.2-vision-90b
model_repo: Llama-3.2-Vision-90B
url: https://huggingface.co/meta-llama/Llama-3.2-90B-Vision
precision: BF16
training_modes: [finetune_fw]
- model: Llama 3.1 8B
mad_tag: pyt_train_llama-3.1-8b
model_repo: Llama-3.1-8B
url: https://huggingface.co/meta-llama/Llama-3.1-8B
precision: BF16
training_modes: [pretrain, finetune_fw, finetune_lora, HF_pretrain]
- model: Llama 3.1 70B
mad_tag: pyt_train_llama-3.1-70b
model_repo: Llama-3.1-70B
url: https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct
precision: BF16
training_modes: [pretrain, finetune_fw, finetune_lora]
- model: Llama 3.1 405B
mad_tag: pyt_train_llama-3.1-405b
model_repo: Llama-3.1-405B
url: https://huggingface.co/meta-llama/Llama-3.1-405B
precision: BF16
training_modes: [finetune_qlora]
- model: Llama 3 8B
mad_tag: pyt_train_llama-3-8b
model_repo: Llama-3-8B
url: https://huggingface.co/meta-llama/Meta-Llama-3-8B
precision: BF16
training_modes: [finetune_fw, finetune_lora]
- model: Llama 3 70B
mad_tag: pyt_train_llama-3-70b
model_repo: Llama-3-70B
url: https://huggingface.co/meta-llama/Meta-Llama-3-70B
precision: BF16
training_modes: [finetune_fw, finetune_lora]
- model: Llama 2 7B
mad_tag: pyt_train_llama-2-7b
model_repo: Llama-2-7B
url: https://github.com/meta-llama/llama-models/tree/main/models/llama2
precision: BF16
training_modes: [finetune_fw, finetune_lora, finetune_qlora]
- model: Llama 2 13B
mad_tag: pyt_train_llama-2-13b
model_repo: Llama-2-13B
url: https://github.com/meta-llama/llama-models/tree/main/models/llama2
precision: BF16
training_modes: [finetune_fw, finetune_lora]
- model: Llama 2 70B
mad_tag: pyt_train_llama-2-70b
model_repo: Llama-2-70B
url: https://github.com/meta-llama/llama-models/tree/main/models/llama2
precision: BF16
training_modes: [finetune_lora, finetune_qlora]
- group: OpenAI
tag: openai
models:
- model: GPT OSS 20B
mad_tag: pyt_train_gpt_oss_20b
model_repo: GPT-OSS-20B
url: https://huggingface.co/openai/gpt-oss-20b
precision: BF16
training_modes: [HF_finetune_lora]
- model: GPT OSS 120B
mad_tag: pyt_train_gpt_oss_120b
model_repo: GPT-OSS-120B
url: https://huggingface.co/openai/gpt-oss-120b
precision: BF16
training_modes: [HF_finetune_lora]
- group: Qwen
tag: qwen
models:
- model: Qwen 3 8B
mad_tag: pyt_train_qwen3-8b
model_repo: Qwen3-8B
url: https://huggingface.co/Qwen/Qwen3-8B
precision: BF16
training_modes: [finetune_fw, finetune_lora]
- model: Qwen 3 32B
mad_tag: pyt_train_qwen3-32b
model_repo: Qwen3-32
url: https://huggingface.co/Qwen/Qwen3-32B
precision: BF16
training_modes: [finetune_lora]
- model: Qwen 2.5 32B
mad_tag: pyt_train_qwen2.5-32b
model_repo: Qwen2.5-32B
url: https://huggingface.co/Qwen/Qwen2.5-32B
precision: BF16
training_modes: [finetune_lora]
- model: Qwen 2.5 72B
mad_tag: pyt_train_qwen2.5-72b
model_repo: Qwen2.5-72B
url: https://huggingface.co/Qwen/Qwen2.5-72B
precision: BF16
training_modes: [finetune_lora]
- model: Qwen 2 1.5B
mad_tag: pyt_train_qwen2-1.5b
model_repo: Qwen2-1.5B
url: https://huggingface.co/Qwen/Qwen2-1.5B
precision: BF16
training_modes: [finetune_fw, finetune_lora]
- model: Qwen 2 7B
mad_tag: pyt_train_qwen2-7b
model_repo: Qwen2-7B
url: https://huggingface.co/Qwen/Qwen2-7B
precision: BF16
training_modes: [finetune_fw, finetune_lora]
- group: Stable Diffusion
tag: sd
models:
- model: Stable Diffusion XL
mad_tag: pyt_huggingface_stable_diffusion_xl_2k_lora_finetuning
model_repo: SDXL
url: https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0
precision: BF16
training_modes: [finetune_lora]
- group: Flux
tag: flux
models:
- model: FLUX.1-dev
mad_tag: pyt_train_flux
model_repo: Flux
url: https://huggingface.co/black-forest-labs/FLUX.1-dev
precision: BF16
training_modes: [pretrain]
- group: NCF
tag: ncf
models:
- model: NCF
mad_tag: pyt_ncf_training
model_repo:
url: https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/Recommendation/NCF
precision: FP32

View File

@@ -1,15 +1,22 @@
dockers:
- pull_tag: rocm/megatron-lm:v25.8_py310
docker_hub_url: https://hub.docker.com/layers/rocm/megatron-lm/v25.8_py310/images/sha256-50fc824361054e445e86d5d88d5f58817f61f8ec83ad4a7e43ea38bbc4a142c0
components:
ROCm: 6.4.3
Primus: 927a717
PyTorch: 2.8.0a0+gitd06a406
MI355X and MI350X:
pull_tag: rocm/primus:v25.9_gfx950
docker_hub_url: https://hub.docker.com/layers/rocm/primus/v25.9_gfx950/images/sha256-1a198be32f49efd66d0ff82066b44bd99b3e6b04c8e0e9b36b2c481e13bff7b6
components: &docker_components
ROCm: 7.0.0
Primus: 0.3.0
Primus Turbo: 0.1.1
PyTorch: 2.9.0.dev20250821+rocm7.0.0.lw.git125803b7
Python: "3.10"
Transformer Engine: 2.2.0.dev0+54dd2bdc
hipBLASLt: d1b517fc7a
Triton: 3.3.0
RCCL: 2.22.3
Flash Attention: 2.8.3
hipBLASLt: 911283acd1
Triton: 3.4.0+rocm7.0.0.git56765e8c
RCCL: 2.26.6
MI325X and MI300X:
pull_tag: rocm/primus:v25.9_gfx942
docker_hub_url: https://hub.docker.com/layers/rocm/primus/v25.9_gfx942/images/sha256-df6ab8f45b4b9ceb100fb24e19b2019a364e351ee3b324dbe54466a1d67f8357
components: *docker_components
model_groups:
- group: Meta Llama
tag: llama

View File

@@ -1,24 +1,39 @@
dockers:
- pull_tag: rocm/pytorch-training:v25.8
docker_hub_url: https://hub.docker.com/layers/rocm/pytorch-training/v25.8/images/sha256-5082ae01d73fec6972b0d84e5dad78c0926820dcf3c19f301d6c8eb892e573c5
components:
ROCm: 6.4.3
PyTorch: 2.8.0a0+gitd06a406
Python: 3.10.18
Transformer Engine: 2.2.0.dev0+a1e66aae
Flash Attention: 3.0.0.post1
hipBLASLt: 1.1.0-d1b517fc7a
MI355X and MI350X:
pull_tag: rocm/primus:v25.9_gfx950
docker_hub_url: https://hub.docker.com/layers/rocm/primus/v25.9_gfx950/images/sha256-1a198be32f49efd66d0ff82066b44bd99b3e6b04c8e0e9b36b2c481e13bff7b6
components: &docker_components
ROCm: 7.0.0
Primus: 0.3.0
Primus Turbo: 0.1.1
PyTorch: 2.9.0.dev20250821+rocm7.0.0.lw.git125803b7
Python: "3.10"
Transformer Engine: 2.2.0.dev0+54dd2bdc
Flash Attention: 2.8.3
hipBLASLt: 911283acd1
Triton: 3.4.0+rocm7.0.0.git56765e8c
RCCL: 2.26.6
MI325X and MI300X:
pull_tag: rocm/primus:v25.9_gfx942
docker_hub_url: https://hub.docker.com/layers/rocm/primus/v25.9_gfx942/images/sha256-df6ab8f45b4b9ceb100fb24e19b2019a364e351ee3b324dbe54466a1d67f8357
components: *docker_components
model_groups:
- group: Meta Llama
tag: llama
models:
- model: Llama 3.1 8B
mad_tag: primus_pyt_train_llama-3.1-8b
model_repo: Llama-3.1-8B
model_repo: meta-llama/Llama-3.1-8B
url: https://huggingface.co/meta-llama/Llama-3.1-8B
precision: BF16
config_file:
bf16: "./llama3_8b_fsdp_bf16.toml"
fp8: "./llama3_8b_fsdp_fp8.toml"
- model: Llama 3.1 70B
mad_tag: primus_pyt_train_llama-3.1-70b
model_repo: Llama-3.1-70B
model_repo: meta-llama/Llama-3.1-70B
url: https://huggingface.co/meta-llama/Llama-3.1-70B
precision: BF16
config_file:
bf16: "./llama3_70b_fsdp_bf16.toml"
fp8: "./llama3_70b_fsdp_fp8.toml"

View File

@@ -1,13 +1,21 @@
dockers:
- pull_tag: rocm/pytorch-training:v25.8
docker_hub_url: https://hub.docker.com/layers/rocm/pytorch-training/v25.8/images/sha256-5082ae01d73fec6972b0d84e5dad78c0926820dcf3c19f301d6c8eb892e573c5
components:
ROCm: 6.4.3
PyTorch: 2.8.0a0+gitd06a406
Python: 3.10.18
Transformer Engine: 2.2.0.dev0+a1e66aae
Flash Attention: 3.0.0.post1
hipBLASLt: 1.1.0-d1b517fc7a
MI355X and MI350X:
pull_tag: rocm/pytorch-training:v25.9_gfx950
docker_hub_url: https://hub.docker.com/layers/rocm/pytorch-training/v25.9_gfx950/images/sha256-1a198be32f49efd66d0ff82066b44bd99b3e6b04c8e0e9b36b2c481e13bff7b6
components: &docker_components
ROCm: 7.0.0
Primus: aab4234
PyTorch: 2.9.0.dev20250821+rocm7.0.0.lw.git125803b7
Python: "3.10"
Transformer Engine: 2.2.0.dev0+54dd2bdc
Flash Attention: 2.8.3
hipBLASLt: 911283acd1
Triton: 3.4.0+rocm7.0.0.git56765e8c
RCCL: 2.26.6
MI325X and MI300X:
pull_tag: rocm/pytorch-training:v25.9_gfx942
docker_hub_url: https://hub.docker.com/layers/rocm/pytorch-training/v25.9_gfx942/images/sha256-df6ab8f45b4b9ceb100fb24e19b2019a364e351ee3b324dbe54466a1d67f8357
components: *docker_components
model_groups:
- group: Meta Llama
tag: llama
@@ -158,7 +166,7 @@ model_groups:
model_repo: SDXL
url: https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0
precision: BF16
training_modes: [finetune_lora]
training_modes: [posttrain-p]
- group: Flux
tag: flux
models:
@@ -167,7 +175,7 @@ model_groups:
model_repo: Flux
url: https://huggingface.co/black-forest-labs/FLUX.1-dev
precision: BF16
training_modes: [pretrain]
training_modes: [posttrain-p]
- group: NCF
tag: ncf
models:

View File

@@ -1,4 +1,4 @@
Atomic,MI100,MI200 PCIe,MI200 A+A,MI300X series,MI300A,MI350X series
Atomic,MI100,MI200 PCIe,MI200 A+A,MI300X Series,MI300A,MI350X Series
32 bit atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS
32 bit atomicSub,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS
32 bit atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS
1 Atomic MI100 MI200 PCIe MI200 A+A MI300X series MI300X Series MI300A MI350X series MI350X Series
2 32 bit atomicAdd ✅ CAS ✅ CAS ✅ CAS ✅ CAS ✅ CAS ✅ CAS
3 32 bit atomicSub ✅ CAS ✅ CAS ✅ CAS ✅ CAS ✅ CAS ✅ CAS
4 32 bit atomicMin ✅ CAS ✅ CAS ✅ CAS ✅ CAS ✅ CAS ✅ CAS

View File

@@ -1,4 +1,4 @@
Atomic,MI100,MI200 PCIe,MI200 A+A,MI300X series,MI300A,MI350X series
Atomic,MI100,MI200 PCIe,MI200 A+A,MI300X Series,MI300A,MI350X Series
32 bit atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS
32 bit atomicSub,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS
32 bit atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS
1 Atomic MI100 MI200 PCIe MI200 A+A MI300X series MI300X Series MI300A MI350X series MI350X Series
2 32 bit atomicAdd ✅ CAS ✅ CAS ✅ CAS ✅ CAS ✅ CAS ✅ CAS
3 32 bit atomicSub ✅ CAS ✅ CAS ✅ CAS ✅ CAS ✅ CAS ✅ CAS
4 32 bit atomicMin ✅ CAS ✅ CAS ✅ CAS ✅ CAS ✅ CAS ✅ CAS

View File

@@ -1,4 +1,4 @@
Atomic,MI100,MI200 PCIe,MI200 A+A,MI300X series,MI300A,MI350X series
Atomic,MI100,MI200 PCIe,MI200 A+A,MI300X Series,MI300A,MI350X Series
32 bit atomicAdd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native
32 bit atomicSub,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native
32 bit atomicMin,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native
1 Atomic MI100 MI200 PCIe MI200 A+A MI300X series MI300X Series MI300A MI350X series MI350X Series
2 32 bit atomicAdd ✅ Native ✅ Native ✅ Native ✅ Native ✅ Native ✅ Native
3 32 bit atomicSub ✅ Native ✅ Native ✅ Native ✅ Native ✅ Native ✅ Native
4 32 bit atomicMin ✅ Native ✅ Native ✅ Native ✅ Native ✅ Native ✅ Native

View File

@@ -1,4 +1,4 @@
Atomic,MI100,MI200 PCIe,MI200 A+A,MI300X series,MI300A,MI350X series
Atomic,MI100,MI200 PCIe,MI200 A+A,MI300X Series,MI300A,MI350X Series
32 bit atomicAdd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native
32 bit atomicSub,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native
32 bit atomicMin,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native
1 Atomic MI100 MI200 PCIe MI200 A+A MI300X series MI300X Series MI300A MI350X series MI350X Series
2 32 bit atomicAdd ✅ Native ✅ Native ✅ Native ✅ Native ✅ Native ✅ Native
3 32 bit atomicSub ✅ Native ✅ Native ✅ Native ✅ Native ✅ Native ✅ Native
4 32 bit atomicMin ✅ Native ✅ Native ✅ Native ✅ Native ✅ Native ✅ Native

View File

@@ -10,7 +10,7 @@ Deep learning frameworks provide environments for machine learning, training, fi
ROCm offers a complete ecosystem for developing and running deep learning applications efficiently. It also provides ROCm-compatible versions of popular frameworks and libraries, such as PyTorch, TensorFlow, JAX, and others.
The AMD ROCm organization actively contributes to open-source development and collaborates closely with framework organizations. This collaboration ensures that framework-specific optimizations effectively leverage AMD GPUs and accelerators.
The AMD ROCm organization actively contributes to open-source development and collaborates closely with framework organizations. This collaboration ensures that framework-specific optimizations effectively leverage AMD GPUs.
The table below summarizes information about ROCm-enabled deep learning frameworks. It includes details on ROCm compatibility and third-party tool support, installation steps and options, and links to GitHub resources. For a complete list of supported framework versions on ROCm, see the :doc:`Compatibility matrix <../compatibility/compatibility-matrix>` topic.

View File

@@ -1,5 +1,5 @@
.. meta::
:description: How to configure MI300X accelerators to fully leverage their capabilities and achieve optimal performance.
:description: How to configure MI300X GPUs to fully leverage their capabilities and achieve optimal performance.
:keywords: ROCm, AI, machine learning, MI300X, LLM, usage, tutorial, optimization, tuning
**************************************
@@ -7,11 +7,11 @@ AMD Instinct MI300X performance guides
**************************************
The following performance guides provide essential guidance on the necessary
steps to properly `configure your system for AMD Instinct™ MI300X accelerators
steps to properly `configure your system for AMD Instinct™ MI300X GPUs
<https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_.
They include detailed instructions on system settings and application
:doc:`workload tuning </how-to/rocm-for-ai/inference-optimization/workload>` to
help you leverage the maximum capabilities of these accelerators and achieve
help you leverage the maximum capabilities of these GPUs and achieve
superior performance.
* `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`__
@@ -19,9 +19,9 @@ superior performance.
your AMD Instinct MI300X system for performance.
* :doc:`/how-to/rocm-for-ai/inference-optimization/workload` covers steps to
optimize the performance of AMD Instinct MI300X series accelerators for HPC
optimize the performance of AMD Instinct MI300X Series GPUs for HPC
and deep learning operations.
* :doc:`/how-to/rocm-for-ai/inference/benchmark-docker/vllm` introduces a preconfigured
environment for LLM inference, designed to help you test performance with
popular models on AMD Instinct MI300X series accelerators.
popular models on AMD Instinct MI300X Series GPUs.

View File

@@ -25,7 +25,7 @@ execute on AMD GPUs while maintaining compatibility with CUDA-based systems.
OpenCL (Open Computing Language) is an open standard for cross-platform,
parallel programming of diverse processors. ROCm supports OpenCL for developers
who want to use standard frameworks across different hardware platforms,
including CPUs, GPUs, and other accelerators. For more information, see
including CPUs, GPUs, and APUs. For more information, see
`OpenCL <https://www.khronos.org/opencl/>`_.
Python bindings can be found at https://github.com/ROCm/hip-python.

View File

@@ -11,10 +11,10 @@ Fine-tuning using ROCm involves leveraging AMD's GPU-accelerated :doc:`libraries
ecosystem for deep learning development, including open-source libraries for optimized deep learning operations and
ROCm-aware versions of :doc:`deep learning frameworks <../../deep-learning-rocm>` such as PyTorch, TensorFlow, and JAX.
Single-accelerator systems, such as a machine equipped with a single accelerator or GPU, are commonly used for
Single-accelerator systems, such as a machine equipped with a single GPU, are commonly used for
smaller-scale deep learning tasks, including fine-tuning pre-trained models and running inference on moderately
sized datasets. See :doc:`single-gpu-fine-tuning-and-inference`.
Multi-accelerator systems, on the other hand, consist of multiple accelerators working in parallel. These systems are
Multi-accelerator systems, on the other hand, consist of multiple GPUs working in parallel. These systems are
typically used in LLMs and other large-scale deep learning tasks where performance, scalability, and the handling of
massive datasets are crucial. See :doc:`multi-gpu-fine-tuning-and-inference`.

View File

@@ -3,11 +3,11 @@
:keywords: ROCm, LLM, fine-tuning, usage, tutorial, multi-GPU, distributed, inference, accelerators, PyTorch, HuggingFace, torchtune
*****************************************************
Fine-tuning and inference using multiple accelerators
Fine-tuning and inference using multiple GPUs
*****************************************************
This section explains how to fine-tune a model on a multi-accelerator system. See
:doc:`Single-accelerator fine-tuning <single-gpu-fine-tuning-and-inference>` for a single accelerator or GPU setup.
:doc:`Single-accelerator fine-tuning <single-gpu-fine-tuning-and-inference>` for a single GPU setup.
.. _fine-tuning-llms-multi-gpu-env:
@@ -20,7 +20,7 @@ This section was tested using the following hardware and software environment.
:stub-columns: 1
* - Hardware
- 4 AMD Instinct MI300X accelerators
- 4 AMD Instinct MI300X GPUs
* - Software
- ROCm 6.1, Ubuntu 22.04, PyTorch 2.1.2, Python 3.10
@@ -40,13 +40,13 @@ Setting up the base implementation environment
:doc:`PyTorch installation guide <rocm-install-on-linux:install/3rd-party/pytorch-install>`. For consistent
installation, its recommended to use official ROCm prebuilt Docker images with the framework pre-installed.
#. In the Docker container, check the availability of ROCM-capable accelerators using the following command.
#. In the Docker container, check the availability of ROCm-capable GPUs using the following command.
.. code-block:: shell
rocm-smi --showproductname
#. Check that your accelerators are available to PyTorch.
#. Check that your GPUs are available to PyTorch.
.. code-block:: python
@@ -66,7 +66,7 @@ Setting up the base implementation environment
.. tip::
During training and inference, you can check the memory usage by running the ``rocm-smi`` command in your terminal.
This tool helps you see shows which accelerators or GPUs are involved.
This tool helps you see shows which GPUs are involved.
.. _fine-tuning-llms-multi-gpu-hugging-face-accelerate:
@@ -74,9 +74,9 @@ Setting up the base implementation environment
Hugging Face Accelerate for fine-tuning and inference
===========================================================
`Hugging Face Accelerate <https://huggingface.co/docs/accelerate/en/index>`_ is a library that simplifies turning raw
PyTorch code for a single accelerator into code for multiple accelerators for LLM fine-tuning and inference. It is
integrated with `Transformers <https://huggingface.co/docs/transformers/en/index>`_ allowing you to scale your PyTorch
`Hugging Face Accelerate <https://huggingface.co/docs/accelerate/en/index>`__ is a library that simplifies turning raw
PyTorch code for a single GPU into code for multiple GPUs for LLM fine-tuning and inference. It is
integrated with `Transformers <https://huggingface.co/docs/transformers/en/index>`__, so you can scale your PyTorch
code while maintaining performance and flexibility.
As a brief example of model fine-tuning and inference using multiple GPUs, let's use Transformers and load in the Llama
@@ -107,7 +107,7 @@ Now, it's important to adjust how you load the model. Add the ``device_map`` par
(``"auto"``, ``"balanced"``, ``"balanced_low_0"``, ``"sequential"``).
It's recommended to set the ``device_map`` parameter to ``“auto”`` to allow Accelerate to automatically and
efficiently allocate the model given the available resources (4 accelerators in this case).
efficiently allocate the model given the available resources (four GPUs in this case).
When you have more GPU memory available than the model size, here is the difference between each ``device_map``
option:
@@ -130,8 +130,8 @@ After loading the model in this way, the model is fully ready to use the resourc
torchtune for fine-tuning and inference
=============================================
`torchtune <https://pytorch.org/torchtune/main/>`_ is a PyTorch-native library for easy single and multi-accelerator or
GPU model fine-tuning and inference with LLMs.
`torchtune <https://pytorch.org/torchtune/main/>`_ is a PyTorch-native library for easy single and multi-GPU
model fine-tuning and inference with LLMs.
#. Install torchtune using pip.

View File

@@ -30,7 +30,7 @@ The challenge of fine-tuning models
However, the computational cost of fine-tuning is still high, especially for complex models and large datasets, which
poses distinct challenges related to substantial computational and memory requirements. This might be a barrier for
accelerators or GPUs with low computing power or limited device memory resources.
GPUs with low computing power or limited device memory resources.
For example, suppose we have a language model with 7 billion (7B) parameters, represented by a weight matrix :math:`W`.
During backpropagation, the model needs to learn a :math:`ΔW` matrix, which updates the original weights to minimize the
@@ -84,8 +84,8 @@ Walkthrough
===========
To demonstrate the benefits of LoRA and the ideal compute compatibility of using PEFT and TRL libraries on AMD
ROCm-compatible accelerators and GPUs, let's step through a comprehensive implementation of the fine-tuning process
using the Llama 2 7B model with LoRA tailored specifically for question-and-answer tasks on AMD MI300X accelerators.
ROCm-compatible GPUs, let's step through a comprehensive implementation of the fine-tuning process
using the Llama 2 7B model with LoRA tailored specifically for question-and-answer tasks on AMD MI300X GPUs.
Before starting, review and understand the key components of this walkthrough:

View File

@@ -3,12 +3,11 @@
:keywords: ROCm, LLM, fine-tuning, usage, tutorial, single-GPU, LoRA, PEFT, inference, SFTTrainer
****************************************************
Fine-tuning and inference using a single accelerator
Fine-tuning and inference using a single GPU
****************************************************
This section explains model fine-tuning and inference techniques on a single-accelerator system. See
:doc:`Multi-accelerator fine-tuning <multi-gpu-fine-tuning-and-inference>` for a setup with multiple accelerators or
GPUs.
:doc:`Multi-accelerator fine-tuning <multi-gpu-fine-tuning-and-inference>` for a setup with multiple GPUs.
.. _fine-tuning-llms-single-gpu-env:
@@ -21,7 +20,7 @@ This section was tested using the following hardware and software environment.
:stub-columns: 1
* - Hardware
- AMD Instinct MI300X accelerator
- AMD Instinct MI300X GPU
* - Software
- ROCm 6.1, Ubuntu 22.04, PyTorch 2.1.2, Python 3.10
@@ -41,7 +40,7 @@ Setting up the base implementation environment
:doc:`PyTorch installation guide <rocm-install-on-linux:install/3rd-party/pytorch-install>`. For a consistent
installation, its recommended to use official ROCm prebuilt Docker images with the framework pre-installed.
#. In the Docker container, check the availability of ROCm-capable accelerators using the following command.
#. In the Docker container, check the availability of ROCm-capable GPUs using the following command.
.. code-block:: shell
@@ -53,14 +52,14 @@ Setting up the base implementation environment
============================ ROCm System Management Interface ============================
====================================== Product Info ======================================
GPU[0] : Card series: AMD Instinct MI300X OAM
GPU[0] : Card Series: AMD Instinct MI300X OAM
GPU[0] : Card model: 0x74a1
GPU[0] : Card vendor: Advanced Micro Devices, Inc. [AMD/ATI]
GPU[0] : Card SKU: MI3SRIOV
==========================================================================================
================================== End of ROCm SMI Log ===================================
#. Check that your accelerators are available to PyTorch.
#. Check that your GPUs are available to PyTorch.
.. code-block:: python
@@ -502,9 +501,9 @@ Let's look at achieving model inference using these types of models.
# Token generation
print(pipe("What is a large language model?")[0]["generated_text"])
If using multiple accelerators, see
If using multiple GPUs, see
:ref:`Multi-accelerator fine-tuning and inference <fine-tuning-llms-multi-gpu-hugging-face-accelerate>` to explore
popular libraries that simplify fine-tuning and inference in a multi-accelerator system.
popular libraries that simplify fine-tuning and inference in a multiple-GPU system.
Read more about inference frameworks like vLLM and Hugging Face TGI in
:doc:`LLM inference frameworks <../inference/llm-inference-frameworks>`.

View File

@@ -45,7 +45,7 @@ ROCm provides two different implementations of Flash Attention 2 modules. They c
# Install from source
git clone https://github.com/ROCm/flash-attention.git
cd flash-attention/
GPU_ARCHS=gfx942 python setup.py install #MI300 series
GPU_ARCHS=gfx942 python setup.py install #MI300 Series
Hugging Face Transformers can easily deploy the CK Flash Attention 2 module by passing an argument
``attn_implementation="flash_attention_2"`` in the ``from_pretrained`` class.
@@ -526,7 +526,7 @@ follow these instructions:
python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning split_table_batched_embeddings_test.py
To run the FBGEMM_GPU ``uvm`` test, use these commands. These tests only support the AMD MI210 and
more recent accelerators.
more recent GPUs.
.. code-block:: shell

View File

@@ -7,7 +7,7 @@ Model quantization techniques
*****************************
Quantization reduces the model size compared to its native full-precision version, making it easier to fit large models
onto accelerators or GPUs with limited memory usage. This section explains how to perform LLM quantization using AMD Quark, GPTQ
onto GPUs with limited memory usage. This section explains how to perform LLM quantization using AMD Quark, GPTQ
and bitsandbytes on AMD Instinct hardware.
.. _quantize-llms-quark:
@@ -311,7 +311,7 @@ ExLlama-v2 support
ExLlama is a Python/C++/CUDA implementation of the Llama model that is
designed for faster inference with 4-bit GPTQ weights. The ExLlama
kernel is activated by default when users create a ``GPTQConfig`` object. To
boost inference speed even further on Instinct accelerators, use the ExLlama-v2
boost inference speed even further on Instinct GPUs, use the ExLlama-v2
kernels by configuring the ``exllama_config`` parameter as the following.
.. code-block:: python
@@ -332,7 +332,7 @@ The `ROCm-aware bitsandbytes <https://github.com/ROCm/bitsandbytes>`_ library is
a lightweight Python wrapper around CUDA custom functions, in particular 8-bit optimizer, matrix multiplication, and
8-bit and 4-bit quantization functions. The library includes quantization primitives for 8-bit and 4-bit operations
through ``bitsandbytes.nn.Linear8bitLt`` and ``bitsandbytes.nn.Linear4bit`` and 8-bit optimizers through the
``bitsandbytes.optim`` module. These modules are supported on AMD Instinct accelerators.
``bitsandbytes.optim`` module. These modules are supported on AMD Instinct GPUs.
Installing bitsandbytes
-----------------------

View File

@@ -9,13 +9,13 @@ myst:
The AMD ROCm Composable Kernel (CK) library provides a programming model for writing performance-critical kernels for machine learning workloads. It generates a general-purpose kernel during the compilation phase through a C++ template, enabling developers to achieve operation fusions on different data precisions.
This article gives a high-level overview of CK General Matrix Multiplication (GEMM) kernel based on the design example of `03_gemm_bias_relu`. It also outlines the steps to construct the kernel and run it. Moreover, the article provides a detailed implementation of running SmoothQuant quantized INT8 models on AMD Instinct MI300X accelerators using CK.
This article gives a high-level overview of CK General Matrix Multiplication (GEMM) kernel based on the design example of `03_gemm_bias_relu`. It also outlines the steps to construct the kernel and run it. Moreover, the article provides a detailed implementation of running SmoothQuant quantized INT8 models on AMD Instinct MI300X GPUs using CK.
## High-level overview: a CK GEMM instance
GEMM is a fundamental block in linear algebra, machine learning, and deep neural networks. It is defined as the operation:
{math}`E = α \times (A \times B) + β \times (D)`, with A and B as matrix inputs, α and β as scalar inputs, and D as a pre-existing matrix.
Take the commonly used linear transformation in a fully connected layer as an example. These terms correspond to input activation (A), weight (B), bias (D), and output (E), respectively. The example employs a `DeviceGemmMultipleD_Xdl_CShuffle` struct from CK library as the fundamental instance to explore the compute capability of AMD Instinct accelerators for the computation of GEMM. The implementation of the instance contains two phases:
Take the commonly used linear transformation in a fully connected layer as an example. These terms correspond to input activation (A), weight (B), bias (D), and output (E), respectively. The example employs a `DeviceGemmMultipleD_Xdl_CShuffle` struct from CK library as the fundamental instance to explore the compute capability of AMD Instinct GPUs for the computation of GEMM. The implementation of the instance contains two phases:
- [Template parameter definition](#template-parameter-definition)
- [Instantiating and running the templated kernel](#instantiating-and-running-the-templated-kernel)
@@ -108,7 +108,7 @@ These parameters include Block Size, M/N/K Per Block, M/N per XDL, AK1, BK1, etc
- Block Size determines the number of threads in the thread block.
- M/N/K Per Block determines the size of tile that each thread block is responsible for calculating.
- M/N Per XDL refers to M/N size for Instinct accelerator Matrix Fused Multiply Add (MFMA) instructions operating on a per-wavefront basis.
- M/N Per XDL refers to M/N size for Instinct GPU Matrix Fused Multiply Add (MFMA) instructions operating on a per-wavefront basis.
- A/B K1 is related to the data type. It can be any value ranging from 1 to K Per Block. To achieve the optimal load/store performance, 128bit per load is suggested. In addition, the A/B loading parameters must be changed accordingly to match the A/B K1 value; otherwise, it will result in compilation errors.
Conditions for achieving computational load balancing on different hardware platforms can vary.
@@ -133,7 +133,7 @@ Templated kernel launching consists of kernel instantiation, making arguments by
## Developing fused INT8 kernels for SmoothQuant models
[SmoothQuant](https://github.com/mit-han-lab/smoothquant) (SQ) is a quantization algorithm that enables an INT8 quantization of both weights and activations for all the matrix multiplications in LLM. The required GPU kernel functionalities used to accelerate the inference of SQ models on Instinct accelerators are shown in the following table.
[SmoothQuant](https://github.com/mit-han-lab/smoothquant) (SQ) is a quantization algorithm that enables an INT8 quantization of both weights and activations for all the matrix multiplications in LLM. The required GPU kernel functionalities used to accelerate the inference of SQ models on Instinct GPUs are shown in the following table.
:::{table} Functionalities used to implement SmoothQuant model inference.
@@ -164,7 +164,7 @@ The CK library contains many fundamental instances that implement different func
Second, consider whether the format of input data meets your actual calculation needs. For SQ models, the 8-bit integer data format (INT8) is applied for matrix calculations.
Third, consider the platform for implementing CK instances. The instances suffixed with `xdl` only run on AMD Instinct accelerators after being compiled and cannot run on Radeon-series GPUs. This is due to the underlying device-specific instruction sets for implementing these basic instances.
Third, consider the platform for implementing CK instances. The instances suffixed with `xdl` only run on AMD Instinct GPUs after being compiled and cannot run on Radeon-Series GPUs. This is due to the underlying device-specific instruction sets for implementing these basic instances.
Here, we use [DeviceBatchedGemmMultiD_Xdl](https://github.com/ROCm/composable_kernel/tree/develop/example/24_batched_gemm) as the fundamental instance to implement the functionalities in the previous table.
@@ -435,7 +435,7 @@ The implementation architecture of running SmoothQuant models on MI300X GPUs is
### Figure 7
================ -->
```{figure} ../../../data/how-to/llm-fine-tuning-optimization/ck-inference_flow.jpg
The implementation architecture of running SmoothQuant models on AMD MI300X accelerators.
The implementation architecture of running SmoothQuant models on AMD MI300X GPUs.
```
For the target [SQ quantized model](https://huggingface.co/mit-han-lab/opt-13b-smoothquant), each decoder layer contains three major components: attention calculation, layer normalization, and linear transformation in fully connected layers. The corresponding implementation classes for these components are:
@@ -447,21 +447,21 @@ For the target [SQ quantized model](https://huggingface.co/mit-han-lab/opt-13b-s
These classes' underlying implementation logits will harness the functions in previous table. Note that for the example, the `LayerNormQ` module is implemented by the torch native module.
Testing environment:
The hardware platform used for testing equips with 256 AMD EPYC 9534 64-Core Processor, 8 AMD Instinct MI300X accelerators and 1.5T memory. The testing was done in a publicly available Docker image from Docker Hub:
The hardware platform used for testing equips with 256 AMD EPYC 9534 64-Core Processor, 8 AMD Instinct MI300X GPUs and 1.5T memory. The testing was done in a publicly available Docker image from Docker Hub:
[`rocm/pytorch:rocm6.1_ubuntu22.04_py3.10_pytorch_2.1.2`](https://hub.docker.com/layers/rocm/pytorch/rocm6.1_ubuntu22.04_py3.10_pytorch_2.1.2/images/sha256-f6ea7cee8aae299c7f6368187df7beed29928850c3929c81e6f24b34271d652b)
The tested models are OPT-1.3B, 2.7B, 6.7B and 13B FP16 models and the corresponding SmoothQuant INT8 OPT models were obtained from Hugging Face.
Note that since the default values were used for the tunable parameters of the fundamental instance, the performance of the INT8 kernel is suboptimal.
Figure 8 shows the performance comparisons between the original FP16 and the SmoothQuant-quantized INT8 models on a single MI300X accelerator. The GPU memory footprints of SmoothQuant-quantized models are significantly reduced. It also indicates the per-sample inference latency is significantly reduced for all SmoothQuant-quantized OPT models (illustrated in (b)). Notably, the performance of the CK instance-based INT8 kernel steadily improves with an increase in model size.
Figure 8 shows the performance comparisons between the original FP16 and the SmoothQuant-quantized INT8 models on a single MI300X GPU. The GPU memory footprints of SmoothQuant-quantized models are significantly reduced. It also indicates the per-sample inference latency is significantly reduced for all SmoothQuant-quantized OPT models (illustrated in (b)). Notably, the performance of the CK instance-based INT8 kernel steadily improves with an increase in model size.
<!--
================
### Figure 8
================ -->
```{figure} ../../../data/how-to/llm-fine-tuning-optimization/ck-comparisons.jpg
Performance comparisons between the original FP16 and the SmoothQuant-quantized INT8 models on a single MI300X accelerator.
Performance comparisons between the original FP16 and the SmoothQuant-quantized INT8 models on a single MI300X GPU.
```
For accuracy comparisons between the original FP16 and INT8 models, the evaluation is done by using the first 1,000 samples from the LAMBADA dataset's validation set. We employ the same Last Token Prediction Accuracy method introduced in [SmoothQuant Real-INT8 Inference for PyTorch](https://github.com/mit-han-lab/smoothquant/blob/main/examples/smoothquant_opt_real_int8_demo.ipynb) as our evaluation metric. The comparison results are shown in Table 2.
@@ -482,4 +482,4 @@ CK provides a rich set of template parameters for generating flexible accelerate
CK supports multiple instruction sets of AMD Instinct GPUs, operator fusion and different data precisions. Its composability helps users quickly construct operator performance verification.
With CK, you can build more effective AI applications with higher flexibility and better performance on different AMD accelerator platforms.
With CK, you can build more effective AI applications with higher flexibility and better performance on different AMD GPU platforms.

View File

@@ -1,15 +1,15 @@
.. meta::
:description: Learn about workload tuning on AMD Instinct MI300X accelerators for optimal performance.
:description: Learn about workload tuning on AMD Instinct MI300X GPUs for optimal performance.
:keywords: AMD, Instinct, MI300X, HPC, tuning, BIOS settings, NBIO, ROCm,
environment variable, performance, HIP, Triton, PyTorch TunableOp, vLLM, RCCL,
MIOpen, accelerator, GPU, resource utilization
MIOpen, GPU, resource utilization
*****************************************
AMD Instinct MI300X workload optimization
*****************************************
This document provides guidelines for optimizing the performance of AMD
Instinct™ MI300X accelerators, with a particular focus on GPU kernel
Instinct™ MI300X GPUs, with a particular focus on GPU kernel
programming, high-performance computing (HPC), and deep learning operations
using PyTorch. It delves into specific workloads such as
:ref:`model inference <mi300x-vllm-optimization>`, offering strategies to
@@ -25,7 +25,7 @@ Workload tuning strategy
By following a structured approach, you can systematically address
performance issues and enhance the efficiency of your workloads on AMD Instinct
MI300X accelerators.
MI300X GPUs.
Measure the current workload
----------------------------
@@ -86,7 +86,7 @@ Optimize model inference with vLLM
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vLLM provides tools and techniques specifically designed for efficient model
inference on AMD Instinct MI300X accelerators. See :ref:`fine-tuning-llms-vllm`
inference on AMD Instinct MI300X GPUs. See :ref:`fine-tuning-llms-vllm`
for installation guidance. Optimizing performance with vLLM
involves configuring tensor parallelism, leveraging advanced features, and
ensuring efficient execution. Heres how to optimize vLLM performance:
@@ -239,7 +239,7 @@ benchmarking process.
With AMD's profiling tools, developers are able to gain important insight into how efficiently their application is
using hardware resources and effectively diagnose potential bottlenecks contributing to poor performance. Developers
working with AMD Instinct accelerators have multiple tools depending on their specific profiling needs; these include:
working with AMD Instinct GPUs have multiple tools depending on their specific profiling needs; these include:
* :ref:`ROCProfiler <mi300x-rocprof>`
@@ -257,11 +257,11 @@ metrics, commonly called *performance counters*. These counters quantify the per
showcasing which pieces of the computational pipeline and memory hierarchy are being utilized.
Your ROCm installation contains a script or executable command called ``rocprof`` which provides the ability to list all
available hardware counters for your specific accelerator or GPU, and run applications while collecting counters during
available hardware counters for your specific GPU, and run applications while collecting counters during
their execution.
This ``rocprof`` utility also depends on the :doc:`ROCTracer and ROC-TX libraries <roctracer:index>`, giving it the
ability to collect timeline traces of the accelerator software stack as well as user-annotated code regions.
ability to collect timeline traces of the GPU software stack as well as user-annotated code regions.
.. note::
@@ -276,16 +276,16 @@ ROCm Compute Profiler
^^^^^^^^^^^^^^^^^^^^^
:doc:`ROCm Compute Profiler <rocprofiler-compute:index>` is a system performance profiler for high-performance computing (HPC) and
machine learning (ML) workloads using Instinct accelerators. Under the hood, ROCm Compute Profiler uses
machine learning (ML) workloads using Instinct GPUs. Under the hood, ROCm Compute Profiler uses
:ref:`ROCProfiler <mi300x-rocprof>` to collect hardware performance counters. The ROCm Compute Profiler tool performs
system profiling based on all approved hardware counters for Instinct
accelerator architectures. It provides high level performance analysis features including System Speed-of-Light, IP
GPU architectures. It provides high level performance analysis features including System Speed-of-Light, IP
block Speed-of-Light, Memory Chart Analysis, Roofline Analysis, Baseline Comparisons, and more.
ROCm Compute Profiler takes the guesswork out of profiling by removing the need to provide text input files with lists of counters
to collect and analyze raw CSV output files as is the case with ROCProfiler. Instead, ROCm Compute Profiler automates the collection
of all available hardware counters in one command and provides graphical interfaces to help users understand and
analyze bottlenecks and stressors for their computational workloads on AMD Instinct accelerators.
analyze bottlenecks and stressors for their computational workloads on AMD Instinct GPUs.
.. note::
@@ -411,7 +411,7 @@ for additional performance tips. :ref:`fine-tuning-llms-vllm` describes vLLM
usage with ROCm.
ROCm provides a prebuilt optimized Docker image for validating the performance
of LLM inference with vLLM on MI300X series accelerators. The Docker image includes
of LLM inference with vLLM on MI300X Series GPUs. The Docker image includes
ROCm, vLLM, and PyTorch. For more information, see
:doc:`/how-to/rocm-for-ai/inference/benchmark-docker/vllm`.
@@ -449,7 +449,7 @@ Maximizing vLLM instances on a single node
The general guideline is to maximize per-node throughput by running as many vLLM instances as possible.
However, running too many instances might lead to insufficient memory for the KV-cache, which can affect performance.
The Instinct MI300X accelerator is equipped with 192GB of HBM3 memory capacity and bandwidth.
The Instinct MI300X GPU is equipped with 192 GB of HBM3 memory capacity and bandwidth.
For models that fit in one GPU -- to maximize the accumulated throughput -- you can run as many as eight vLLM instances
simultaneously on one MI300X node (with eight GPUs). To do so, use the GPU isolation environment
variable ``CUDA_VISIBLE_DEVICES``.
@@ -468,7 +468,7 @@ The total throughput achieved by running ``N`` instances of vLLM is generally mu
single vLLM instance across ``N`` GPUs simultaneously (that is, configuring ``tensor_parallel_size`` as N or
using the ``-tp`` N option, where ``1 < N ≤ 8``).
vLLM on MI300X accelerators can run a variety of model weights, including Llama 2 (7b, 13b, 70b), Llama 3 (8b, 70b), Qwen2 (7b, 72b), Mixtral-8x7b, Mixtral-8x22b, and so on.
vLLM on MI300X GPUs can run a variety of model weights, including Llama 2 (7b, 13b, 70b), Llama 3 (8b, 70b), Qwen2 (7b, 72b), Mixtral-8x7b, Mixtral-8x22b, and so on.
Notable configurations include Llama2-70b and Llama3-70b models on a single MI300X GPU, and the Llama3.1 405b model can fit on one single node with 8 MI300X GPUs.
.. _mi300x-vllm-gpu-memory-utilization:
@@ -917,7 +917,7 @@ ROCm library tuning involves optimizing the performance of routine computational
operations (such as ``GEMM``) provided by ROCm libraries like
:ref:`hipBLASLt <mi300x-hipblaslt>`, :ref:`Composable Kernel <mi300x-ck>`,
:ref:`MIOpen <mi300x-miopen>`, and :ref:`RCCL <mi300x-rccl>`. This tuning aims
to maximize efficiency and throughput on Instinct MI300X accelerators to gain
to maximize efficiency and throughput on Instinct MI300X GPUs to gain
improved application performance.
.. _mi300x-library-gemm:
@@ -1451,7 +1451,7 @@ you can only use a fraction of the potential bandwidth on the node.
The following figure shows an
:doc:`MI300X node-level architecture </conceptual/gpu-arch/mi300>` of a
system with AMD EPYC processors in a dual-socket configuration and eight
AMD Instinct MI300X accelerators. The MI300X OAMs attach to the host system via
AMD Instinct MI300X GPUs. The MI300X OAMs attach to the host system via
PCIe Gen 5 x16 links (yellow lines). The GPUs use seven high-bandwidth,
low-latency AMD Infinity Fabric™ links (red lines) to form a fully connected
8-GPU system.
@@ -1460,7 +1460,7 @@ low-latency AMD Infinity Fabric™ links (red lines) to form a fully connected
.. figure:: ../../../data/shared/mi300-node-level-arch.png
MI300 series node-level architecture showing 8 fully interconnected MI300X
MI300 Series node-level architecture showing 8 fully interconnected MI300X
OAM modules connected to (optional) PCIe switches via re-timers and HGX
connectors.
@@ -1653,7 +1653,7 @@ Auto-tunable kernel configuration involves adjusting memory access and computati
resources assigned to each compute unit. It encompasses the usage of
:ref:`LDS <mi300x-cu-fig>`, register, and task scheduling on a compute unit.
The accelerator or GPU contains global memory, local data share (LDS), and
The GPU contains global memory, local data share (LDS), and
registers. Global memory has high access latency, but is large. LDS access has
much lower latency, but is smaller. It is a fast on-CU software-managed memory
that can be used to efficiently share data between all work items in a block.
@@ -1666,11 +1666,11 @@ Register access is the fastest yet smallest among the three.
Schematic representation of a CU in the CDNA2 or CDNA3 architecture.
The following is a list of kernel arguments used for tuning performance and
resource allocation on AMD accelerators, which helps in optimizing the
resource allocation on AMD GPUs, which helps in optimizing the
efficiency and throughput of various computational kernels.
``num_stages=n``
Adjusts the number of pipeline stages for different types of kernels. On AMD accelerators, set ``num_stages``
Adjusts the number of pipeline stages for different types of kernels. On AMD GPUs, set ``num_stages``
according to the following rules:
* For kernels with a single GEMM, set to ``2``.
@@ -1697,15 +1697,15 @@ efficiency and throughput of various computational kernels.
* The occupancy of the kernel is limited by VGPR usage, and
* The current VGPR usage is only a few above a boundary in
:ref:`Occupancy related to VGPR usage in an Instinct MI300X accelerator <mi300x-occupancy-vgpr-table>`.
:ref:`Occupancy related to VGPR usage in an Instinct MI300X GPU <mi300x-occupancy-vgpr-table>`.
.. _mi300x-occupancy-vgpr-table:
.. figure:: ../../../data/shared/occupancy-vgpr.png
:alt: Occupancy related to VGPR usage in an Instinct MI300X accelerator.
:alt: Occupancy related to VGPR usage in an Instinct MI300X GPU.
:align: center
Occupancy related to VGPRs usage on an Instinct MI300X accelerator
Occupancy related to VGPRs usage on an Instinct MI300X GPU
For example, according to the table, each Execution Unit (EU) has 512 available
VGPRs, which are allocated in blocks of 16. If the current VGPR usage is 170,
@@ -1730,7 +1730,7 @@ VGPR usage so that it might fit 3 waves per EU.
- ``matrix_instr_nonkdim = 32``: ``mfma_32x32`` is used.
For GEMM kernels on an MI300X accelerator, ``mfma_16x16`` typically outperforms ``mfma_32x32``, even for large
For GEMM kernels on an MI300X GPU, ``mfma_16x16`` typically outperforms ``mfma_32x32``, even for large
tile/GEMM sizes.
@@ -1749,7 +1749,7 @@ the number of CUs a kernel can distribute its task across.
XCD-level system architecture showing 40 compute units,
each with 32 KB L1 cache, a unified compute system with 4 ACE compute
accelerators, shared 4MB of L2 cache, and a hardware scheduler (HWS).
GPUs, shared 4MB of L2 cache, and a hardware scheduler (HWS).
You can query hardware resources with the command ``rocminfo`` in the
``/opt/rocm/bin`` directory. For instance, query the number of CUs, number of

View File

@@ -1,7 +1,7 @@
:orphan:
.. meta::
:description: Learn how to validate LLM inference performance on MI300X accelerators using AMD MAD and the
:description: Learn how to validate LLM inference performance on MI300X GPUs using AMD MAD and the
ROCm vLLM Docker image.
:keywords: model, MAD, automation, dashboarding, validate
@@ -23,9 +23,9 @@ vLLM inference performance testing
The `ROCm vLLM Docker <{{ unified_docker.docker_hub_url }}>`_ image offers
a prebuilt, optimized environment for validating large language model (LLM)
inference performance on AMD Instinct™ MI300X series accelerators. This ROCm vLLM
Docker image integrates vLLM and PyTorch tailored specifically for MI300X series
accelerators and includes the following components:
inference performance on AMD Instinct™ MI300X Series GPUs. This ROCm vLLM
Docker image integrates vLLM and PyTorch tailored specifically for MI300X Series
GPUs and includes the following components:
.. list-table::
:header-rows: 1
@@ -47,7 +47,7 @@ vLLM inference performance testing
With this Docker image, you can quickly test the :ref:`expected
inference performance numbers <vllm-benchmark-performance-measurements-812>` for
MI300X series accelerators.
MI300X Series GPUs.
What's new
==========
@@ -139,7 +139,7 @@ page provides reference throughput and serving measurements for inferencing popu
The performance data presented in
`Performance results with AMD ROCm software <https://www.amd.com/en/developer/resources/rocm-hub/dev-ai/performance-results.html>`_
only reflects the latest version of this inference benchmarking environment.
The listed measurements should not be interpreted as the peak performance achievable by AMD Instinct MI325X and MI300X accelerators or ROCm software.
The listed measurements should not be interpreted as the peak performance achievable by AMD Instinct MI325X and MI300X GPUs or ROCm software.
System validation
=================
@@ -424,7 +424,7 @@ Further reading
- To learn more about MAD and the ``madengine`` CLI, see the `MAD usage guide <https://github.com/ROCm/MAD?tab=readme-ov-file#usage-guide>`__.
- To learn more about system settings and management practices to configure your system for
AMD Instinct MI300X series accelerators, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_.
AMD Instinct MI300X Series GPUs, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_.
- For application performance optimization strategies for HPC and AI workloads,
including inference with vLLM, see :doc:`/how-to/rocm-for-ai/inference-optimization/workload`.

View File

@@ -21,8 +21,8 @@ vLLM inference performance testing
The `ROCm vLLM Docker <{{ docker.docker_hub_url }}>`_ image offers
a prebuilt, optimized environment for validating large language model (LLM)
inference performance on AMD Instinct™ MI300X series accelerators. This ROCm vLLM
Docker image integrates vLLM and PyTorch tailored specifically for MI300X series
inference performance on AMD Instinct™ MI300X Series accelerators. This ROCm vLLM
Docker image integrates vLLM and PyTorch tailored specifically for MI300X Series
accelerators and includes the following components:
.. list-table::
@@ -38,7 +38,7 @@ vLLM inference performance testing
With this Docker image, you can quickly test the :ref:`expected
inference performance numbers <vllm-benchmark-performance-measurements-909>` for
MI300X series accelerators.
MI300X Series accelerators.
What's new
==========
@@ -430,7 +430,7 @@ Further reading
- To learn more about MAD and the ``madengine`` CLI, see the `MAD usage guide <https://github.com/ROCm/MAD?tab=readme-ov-file#usage-guide>`__.
- To learn more about system settings and management practices to configure your system for
AMD Instinct MI300X series accelerators, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_.
AMD Instinct MI300X Series accelerators, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_.
- See :ref:`fine-tuning-llms-vllm` and :ref:`mi300x-vllm-optimization` for
a brief introduction to vLLM and optimization strategies.

View File

@@ -1,7 +1,7 @@
:orphan:
.. meta::
:description: Learn how to validate LLM inference performance on MI300X accelerators using AMD MAD and the unified
:description: Learn how to validate LLM inference performance on MI300X GPUs using AMD MAD and the unified
ROCm Docker image.
:keywords: model, MAD, automation, dashboarding, validate
@@ -18,9 +18,9 @@ vLLM inference performance testing
The `ROCm vLLM Docker <https://hub.docker.com/r/rocm/vllm/tags>`_ image offers
a prebuilt, optimized environment designed for validating large language model
(LLM) inference performance on the AMD Instinct™ MI300X accelerator. This
(LLM) inference performance on the AMD Instinct™ MI300X GPU. This
ROCm vLLM Docker image integrates vLLM and PyTorch tailored specifically for the
MI300X accelerator and includes the following components:
MI300X GPU and includes the following components:
* `ROCm 6.2.0 <https://github.com/ROCm/ROCm>`_
@@ -31,7 +31,7 @@ MI300X accelerator and includes the following components:
* Tuning files (in CSV format)
With this Docker image, you can quickly validate the expected inference
performance numbers on the MI300X accelerator. This topic also provides tips on
performance numbers on the MI300X GPU. This topic also provides tips on
optimizing performance with popular AI models.
.. _vllm-benchmark-vllm:
@@ -51,7 +51,7 @@ Getting started
===============
Use the following procedures to reproduce the benchmark results on an
MI300X accelerator with the prebuilt vLLM Docker image.
MI300X GPU with the prebuilt vLLM Docker image.
.. _vllm-benchmark-get-started:
@@ -267,7 +267,7 @@ Options
.. _vllm-benchmark-run-benchmark-v043:
Running the benchmark on the MI300X accelerator
Running the benchmark on the MI300X GPU
-----------------------------------------------
Here are some examples of running the benchmark with various options.
@@ -328,7 +328,7 @@ Further reading
see `<https://github.com/ROCm/vllm/tree/main/benchmarks>`_.
- To learn more about system settings and management practices to configure your system for
MI300X series accelerators, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_
MI300X Series GPUs, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_
- To learn how to run community models from Hugging Face on AMD GPUs, see
:doc:`Running models from Hugging Face </how-to/rocm-for-ai/inference/hugging-face-models>`.

View File

@@ -1,7 +1,7 @@
:orphan:
.. meta::
:description: Learn how to validate LLM inference performance on MI300X accelerators using AMD MAD and the unified
:description: Learn how to validate LLM inference performance on MI300X GPUs using AMD MAD and the unified
ROCm Docker image.
:keywords: model, MAD, automation, dashboarding, validate
@@ -18,9 +18,9 @@ vLLM inference performance testing
The `ROCm vLLM Docker <https://hub.docker.com/r/rocm/vllm/tags>`_ image offers
a prebuilt, optimized environment designed for validating large language model
(LLM) inference performance on the AMD Instinct™ MI300X accelerator. This
(LLM) inference performance on the AMD Instinct™ MI300X GPU. This
ROCm vLLM Docker image integrates vLLM and PyTorch tailored specifically for the
MI300X accelerator and includes the following components:
MI300X GPU and includes the following components:
* `ROCm 6.2.1 <https://github.com/ROCm/ROCm>`_
@@ -31,7 +31,7 @@ MI300X accelerator and includes the following components:
* Tuning files (in CSV format)
With this Docker image, you can quickly validate the expected inference
performance numbers on the MI300X accelerator. This topic also provides tips on
performance numbers on the MI300X GPU. This topic also provides tips on
optimizing performance with popular AI models.
.. hlist::
@@ -74,7 +74,7 @@ Getting started
===============
Use the following procedures to reproduce the benchmark results on an
MI300X accelerator with the prebuilt vLLM Docker image.
MI300X GPU with the prebuilt vLLM Docker image.
.. _vllm-benchmark-get-started:
@@ -332,7 +332,7 @@ Options
.. _vllm-benchmark-run-benchmark-v064:
Running the benchmark on the MI300X accelerator
Running the benchmark on the MI300X GPU
-----------------------------------------------
Here are some examples of running the benchmark with various options.
@@ -398,7 +398,7 @@ Further reading
see `<https://github.com/ROCm/vllm/tree/main/benchmarks>`_.
- To learn more about system settings and management practices to configure your system for
MI300X series accelerators, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_
MI300X Series GPUs, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_
- To learn how to run community models from Hugging Face on AMD GPUs, see
:doc:`Running models from Hugging Face </how-to/rocm-for-ai/inference/hugging-face-models>`.

View File

@@ -1,7 +1,7 @@
:orphan:
.. meta::
:description: Learn how to validate LLM inference performance on MI300X accelerators using AMD MAD and the
:description: Learn how to validate LLM inference performance on MI300X GPUs using AMD MAD and the
ROCm vLLM Docker image.
:keywords: model, MAD, automation, dashboarding, validate
@@ -18,9 +18,9 @@ LLM inference performance validation on AMD Instinct MI300X
The `ROCm vLLM Docker <https://hub.docker.com/r/rocm/vllm/tags>`_ image offers
a prebuilt, optimized environment for validating large language model (LLM)
inference performance on the AMD Instinct™ MI300X accelerator. This ROCm vLLM
inference performance on the AMD Instinct™ MI300X GPU. This ROCm vLLM
Docker image integrates vLLM and PyTorch tailored specifically for the MI300X
accelerator and includes the following components:
GPU and includes the following components:
* `ROCm 6.3.1 <https://github.com/ROCm/ROCm>`_
@@ -29,7 +29,7 @@ accelerator and includes the following components:
* `PyTorch 2.7.0 (2.7.0a0+git3a58512) <https://github.com/pytorch/pytorch>`_
With this Docker image, you can quickly validate the expected inference
performance numbers for the MI300X accelerator. This topic also provides tips on
performance numbers for the MI300X GPU. This topic also provides tips on
optimizing performance with popular AI models. For more information, see the lists of
:ref:`available models for MAD-integrated benchmarking <vllm-benchmark-mad-v066-models>`
and :ref:`standalone benchmarking <vllm-benchmark-standalone-v066-options>`.
@@ -47,7 +47,7 @@ Getting started
===============
Use the following procedures to reproduce the benchmark results on an
MI300X accelerator with the prebuilt vLLM Docker image.
MI300X GPU with the prebuilt vLLM Docker image.
.. _vllm-benchmark-get-started:
@@ -377,7 +377,7 @@ Options and available models
.. _vllm-benchmark-run-benchmark-v066:
Running the benchmark on the MI300X accelerator
Running the benchmark on the MI300X GPU
-----------------------------------------------
Here are some examples of running the benchmark with various options.
@@ -443,7 +443,7 @@ Further reading
see `<https://github.com/ROCm/vllm/tree/main/benchmarks>`_.
- To learn more about system settings and management practices to configure your system for
MI300X series accelerators, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_
MI300X Series GPUs, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_
- To learn how to run community models from Hugging Face on AMD GPUs, see
:doc:`Running models from Hugging Face </how-to/rocm-for-ai/inference/hugging-face-models>`.

View File

@@ -1,7 +1,7 @@
:orphan:
.. meta::
:description: Learn how to validate LLM inference performance on MI300X accelerators using AMD MAD and the
:description: Learn how to validate LLM inference performance on MI300X GPUs using AMD MAD and the
ROCm vLLM Docker image.
:keywords: model, MAD, automation, dashboarding, validate
@@ -23,9 +23,9 @@ vLLM inference performance testing
The `ROCm vLLM Docker <{{ unified_docker.docker_hub_url }}>`_ image offers
a prebuilt, optimized environment for validating large language model (LLM)
inference performance on AMD Instinct™ MI300X series accelerator. This ROCm vLLM
Docker image integrates vLLM and PyTorch tailored specifically for MI300X series
accelerators and includes the following components:
inference performance on AMD Instinct™ MI300X Series GPU. This ROCm vLLM
Docker image integrates vLLM and PyTorch tailored specifically for MI300X Series
GPUs and includes the following components:
* `ROCm {{ unified_docker.rocm_version }} <https://github.com/ROCm/ROCm>`_
@@ -37,7 +37,7 @@ vLLM inference performance testing
With this Docker image, you can quickly test the :ref:`expected
inference performance numbers <vllm-benchmark-performance-measurements-v073>` for
MI300X series accelerators.
MI300X Series GPUs.
.. _vllm-benchmark-available-models-v073:
@@ -110,7 +110,7 @@ vLLM inference performance testing
The performance data presented in
`Performance results with AMD ROCm software <https://www.amd.com/en/developer/resources/rocm-hub/dev-ai/performance-results.html>`_
only reflects the :doc:`latest version of this inference benchmarking environment <../vllm>`.
The listed measurements should not be interpreted as the peak performance achievable by AMD Instinct MI325X and MI300X accelerators or ROCm software.
The listed measurements should not be interpreted as the peak performance achievable by AMD Instinct MI325X and MI300X GPUs or ROCm software.
Advanced features and known issues
==================================
@@ -122,7 +122,7 @@ vLLM inference performance testing
===============
Use the following procedures to reproduce the benchmark results on an
MI300X accelerator with the prebuilt vLLM Docker image.
MI300X GPU with the prebuilt vLLM Docker image.
.. _vllm-benchmark-get-started:
@@ -311,7 +311,7 @@ Further reading
see `<https://github.com/ROCm/vllm/tree/main/benchmarks>`_.
- To learn more about system settings and management practices to configure your system for
MI300X series accelerators, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_
MI300X Series GPUs, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_
- To learn how to run community models from Hugging Face on AMD GPUs, see
:doc:`Running models from Hugging Face </how-to/rocm-for-ai/inference/hugging-face-models>`.

View File

@@ -1,7 +1,7 @@
:orphan:
.. meta::
:description: Learn how to validate LLM inference performance on MI300X accelerators using AMD MAD and the
:description: Learn how to validate LLM inference performance on MI300X GPUs using AMD MAD and the
ROCm vLLM Docker image.
:keywords: model, MAD, automation, dashboarding, validate
@@ -18,9 +18,9 @@ vLLM inference performance testing
The `ROCm vLLM Docker <{{ unified_docker.docker_hub_url }}>`_ image offers
a prebuilt, optimized environment for validating large language model (LLM)
inference performance on AMD Instinct™ MI300X series accelerators. This ROCm vLLM
Docker image integrates vLLM and PyTorch tailored specifically for MI300X series
accelerators and includes the following components:
inference performance on AMD Instinct™ MI300X Series GPUs. This ROCm vLLM
Docker image integrates vLLM and PyTorch tailored specifically for MI300X Series
GPUs and includes the following components:
* `ROCm {{ unified_docker.rocm_version }} <https://github.com/ROCm/ROCm>`_
@@ -32,7 +32,7 @@ vLLM inference performance testing
With this Docker image, you can quickly test the :ref:`expected
inference performance numbers <vllm-benchmark-performance-measurements-v083>` for
MI300X series accelerators.
MI300X Series GPUs.
.. _vllm-benchmark-available-models-v083:
@@ -105,7 +105,7 @@ vLLM inference performance testing
The performance data presented in
`Performance results with AMD ROCm software <https://www.amd.com/en/developer/resources/rocm-hub/dev-ai/performance-results.html>`_
only reflects the :doc:`latest version of this inference benchmarking environment <../vllm>`.
The listed measurements should not be interpreted as the peak performance achievable by AMD Instinct MI325X and MI300X accelerators or ROCm software.
The listed measurements should not be interpreted as the peak performance achievable by AMD Instinct MI325X and MI300X GPUs or ROCm software.
Advanced features and known issues
==================================
@@ -327,7 +327,7 @@ Further reading
see `<https://github.com/ROCm/vllm/tree/main/benchmarks>`_.
- To learn more about system settings and management practices to configure your system for
MI300X series accelerators, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_
MI300X Series GPUs, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_
- To learn how to run community models from Hugging Face on AMD GPUs, see
:doc:`Running models from Hugging Face </how-to/rocm-for-ai/inference/hugging-face-models>`.

View File

@@ -1,7 +1,7 @@
:orphan:
.. meta::
:description: Learn how to validate LLM inference performance on MI300X accelerators using AMD MAD and the
:description: Learn how to validate LLM inference performance on MI300X GPUs using AMD MAD and the
ROCm vLLM Docker image.
:keywords: model, MAD, automation, dashboarding, validate
@@ -23,9 +23,9 @@ vLLM inference performance testing
The `ROCm vLLM Docker <{{ unified_docker.docker_hub_url }}>`_ image offers
a prebuilt, optimized environment for validating large language model (LLM)
inference performance on AMD Instinct™ MI300X series accelerators. This ROCm vLLM
Docker image integrates vLLM and PyTorch tailored specifically for MI300X series
accelerators and includes the following components:
inference performance on AMD Instinct™ MI300X Series GPUs. This ROCm vLLM
Docker image integrates vLLM and PyTorch tailored specifically for MI300X Series
GPUs and includes the following components:
* `ROCm {{ unified_docker.rocm_version }} <https://github.com/ROCm/ROCm>`_
@@ -37,7 +37,7 @@ vLLM inference performance testing
With this Docker image, you can quickly test the :ref:`expected
inference performance numbers <vllm-benchmark-performance-measurements-v085-20250513>` for
MI300X series accelerators.
MI300X Series GPUs.
.. _vllm-benchmark-available-models-v085-20250513:
@@ -114,7 +114,7 @@ vLLM inference performance testing
The performance data presented in
`Performance results with AMD ROCm software <https://www.amd.com/en/developer/resources/rocm-hub/dev-ai/performance-results.html>`_
only reflects the :doc:`latest version of this inference benchmarking environment <../vllm>`.
The listed measurements should not be interpreted as the peak performance achievable by AMD Instinct MI325X and MI300X accelerators or ROCm software.
The listed measurements should not be interpreted as the peak performance achievable by AMD Instinct MI325X and MI300X GPUs or ROCm software.
Advanced features and known issues
==================================
@@ -333,7 +333,7 @@ Further reading
see `<https://github.com/ROCm/vllm/tree/main/benchmarks>`_.
- To learn more about system settings and management practices to configure your system for
MI300X series accelerators, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_
MI300X Series GPUs, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_
- For application performance optimization strategies for HPC and AI workloads,
including inference with vLLM, see :doc:`/how-to/rocm-for-ai/inference-optimization/workload`.

View File

@@ -1,7 +1,7 @@
:orphan:
.. meta::
:description: Learn how to validate LLM inference performance on MI300X accelerators using AMD MAD and the
:description: Learn how to validate LLM inference performance on MI300X GPUs using AMD MAD and the
ROCm vLLM Docker image.
:keywords: model, MAD, automation, dashboarding, validate
@@ -23,9 +23,9 @@ vLLM inference performance testing
The `ROCm vLLM Docker <{{ unified_docker.docker_hub_url }}>`_ image offers
a prebuilt, optimized environment for validating large language model (LLM)
inference performance on AMD Instinct™ MI300X series accelerators. This ROCm vLLM
Docker image integrates vLLM and PyTorch tailored specifically for MI300X series
accelerators and includes the following components:
inference performance on AMD Instinct™ MI300X Series GPUs. This ROCm vLLM
Docker image integrates vLLM and PyTorch tailored specifically for MI300X Series
GPUs and includes the following components:
* `ROCm {{ unified_docker.rocm_version }} <https://github.com/ROCm/ROCm>`_
@@ -37,7 +37,7 @@ vLLM inference performance testing
With this Docker image, you can quickly test the :ref:`expected
inference performance numbers <vllm-benchmark-performance-measurements-v085-20250521>` for
MI300X series accelerators.
MI300X Series GPUs.
.. _vllm-benchmark-available-models-v085-20250521:
@@ -114,7 +114,7 @@ vLLM inference performance testing
The performance data presented in
`Performance results with AMD ROCm software <https://www.amd.com/en/developer/resources/rocm-hub/dev-ai/performance-results.html>`_
should not be interpreted as the peak performance achievable by AMD
Instinct MI325X and MI300X accelerators or ROCm software.
Instinct MI325X and MI300X GPUs or ROCm software.
Advanced features and known issues
==================================
@@ -333,7 +333,7 @@ Further reading
see `<https://github.com/ROCm/vllm/tree/main/benchmarks>`_.
- To learn more about system settings and management practices to configure your system for
MI300X series accelerators, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_
MI300X Series GPUs, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_
- For application performance optimization strategies for HPC and AI workloads,
including inference with vLLM, see :doc:`/how-to/rocm-for-ai/inference-optimization/workload`.

View File

@@ -1,7 +1,7 @@
:orphan:
.. meta::
:description: Learn how to validate LLM inference performance on MI300X accelerators using AMD MAD and the
:description: Learn how to validate LLM inference performance on MI300X GPUs using AMD MAD and the
ROCm vLLM Docker image.
:keywords: model, MAD, automation, dashboarding, validate
@@ -23,9 +23,9 @@ vLLM inference performance testing
The `ROCm vLLM Docker <{{ unified_docker.docker_hub_url }}>`_ image offers
a prebuilt, optimized environment for validating large language model (LLM)
inference performance on AMD Instinct™ MI300X series accelerators. This ROCm vLLM
Docker image integrates vLLM and PyTorch tailored specifically for MI300X series
accelerators and includes the following components:
inference performance on AMD Instinct™ MI300X Series GPUs. This ROCm vLLM
Docker image integrates vLLM and PyTorch tailored specifically for MI300X Series
GPUs and includes the following components:
* `ROCm {{ unified_docker.rocm_version }} <https://github.com/ROCm/ROCm>`_
@@ -37,7 +37,7 @@ vLLM inference performance testing
With this Docker image, you can quickly test the :ref:`expected
inference performance numbers <vllm-benchmark-performance-measurements-v0901-20250605>` for
MI300X series accelerators.
MI300X Series GPUs.
.. _vllm-benchmark-available-models-v0901-20250605:
@@ -113,7 +113,7 @@ vLLM inference performance testing
The performance data presented in
`Performance results with AMD ROCm software <https://www.amd.com/en/developer/resources/rocm-hub/dev-ai/performance-results.html>`_
only reflects the latest version of this inference benchmarking environment.
The listed measurements should not be interpreted as the peak performance achievable by AMD Instinct MI325X and MI300X accelerators or ROCm software.
The listed measurements should not be interpreted as the peak performance achievable by AMD Instinct MI325X and MI300X GPUs or ROCm software.
Advanced features and known issues
==================================
@@ -332,7 +332,7 @@ Further reading
see `<https://github.com/ROCm/vllm/tree/main/benchmarks>`_.
- To learn more about system settings and management practices to configure your system for
MI300X accelerators, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_
MI300X GPUs, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_
- For application performance optimization strategies for HPC and AI workloads,
including inference with vLLM, see :doc:`/how-to/rocm-for-ai/inference-optimization/workload`.

View File

@@ -1,7 +1,7 @@
:orphan:
.. meta::
:description: Learn how to validate LLM inference performance on MI300X accelerators using AMD MAD and the
:description: Learn how to validate LLM inference performance on MI300X GPUs using AMD MAD and the
ROCm vLLM Docker image.
:keywords: model, MAD, automation, dashboarding, validate
@@ -23,9 +23,9 @@ vLLM inference performance testing
The `ROCm vLLM Docker <{{ unified_docker.docker_hub_url }}>`_ image offers
a prebuilt, optimized environment for validating large language model (LLM)
inference performance on AMD Instinct™ MI300X series accelerators. This ROCm vLLM
Docker image integrates vLLM and PyTorch tailored specifically for MI300X series
accelerators and includes the following components:
inference performance on AMD Instinct™ MI300X Series GPUs. This ROCm vLLM
Docker image integrates vLLM and PyTorch tailored specifically for MI300X Series
GPUs and includes the following components:
* `ROCm {{ unified_docker.rocm_version }} <https://github.com/ROCm/ROCm>`_
@@ -37,7 +37,7 @@ vLLM inference performance testing
With this Docker image, you can quickly test the :ref:`expected
inference performance numbers <vllm-benchmark-performance-measurements-20250702>` for
MI300X series accelerators.
MI300X Series GPUs.
.. _vllm-benchmark-available-models-20250702:
@@ -113,7 +113,7 @@ vLLM inference performance testing
The performance data presented in
`Performance results with AMD ROCm software <https://www.amd.com/en/developer/resources/rocm-hub/dev-ai/performance-results.html>`_
only reflects the latest version of this inference benchmarking environment.
The listed measurements should not be interpreted as the peak performance achievable by AMD Instinct MI325X and MI300X accelerators or ROCm software.
The listed measurements should not be interpreted as the peak performance achievable by AMD Instinct MI325X and MI300X GPUs or ROCm software.
Advanced features and known issues
==================================
@@ -332,7 +332,7 @@ Further reading
see `<https://github.com/ROCm/vllm/tree/main/benchmarks>`_.
- To learn more about system settings and management practices to configure your system for
MI300X series accelerators, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_
MI300X Series GPUs, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_
- For application performance optimization strategies for HPC and AI workloads,
including inference with vLLM, see :doc:`/how-to/rocm-for-ai/inference-optimization/workload`.

View File

@@ -1,7 +1,7 @@
:orphan:
.. meta::
:description: Learn how to validate LLM inference performance on MI300X accelerators using AMD MAD and the
:description: Learn how to validate LLM inference performance on MI300X GPUs using AMD MAD and the
ROCm vLLM Docker image.
:keywords: model, MAD, automation, dashboarding, validate
@@ -23,9 +23,9 @@ vLLM inference performance testing
The `ROCm vLLM Docker <{{ unified_docker.docker_hub_url }}>`_ image offers
a prebuilt, optimized environment for validating large language model (LLM)
inference performance on AMD Instinct™ MI300X series accelerators. This ROCm vLLM
Docker image integrates vLLM and PyTorch tailored specifically for MI300X series
accelerators and includes the following components:
inference performance on AMD Instinct™ MI300X Series GPUs. This ROCm vLLM
Docker image integrates vLLM and PyTorch tailored specifically for MI300X Series
GPUs and includes the following components:
.. list-table::
:header-rows: 1
@@ -47,7 +47,7 @@ vLLM inference performance testing
With this Docker image, you can quickly test the :ref:`expected
inference performance numbers <vllm-benchmark-performance-measurements-715>` for
MI300X series accelerators.
MI300X Series GPUs.
What's new
==========
@@ -145,7 +145,7 @@ page provides reference throughput and latency measurements for inferencing popu
The performance data presented in
`Performance results with AMD ROCm software <https://www.amd.com/en/developer/resources/rocm-hub/dev-ai/performance-results.html>`_
only reflects the latest version of this inference benchmarking environment.
The listed measurements should not be interpreted as the peak performance achievable by AMD Instinct MI325X and MI300X accelerators or ROCm software.
The listed measurements should not be interpreted as the peak performance achievable by AMD Instinct MI325X and MI300X GPUs or ROCm software.
System validation
=================
@@ -429,7 +429,7 @@ Further reading
- To learn more about MAD and the ``madengine`` CLI, see the `MAD usage guide <https://github.com/ROCm/MAD?tab=readme-ov-file#usage-guide>`__.
- To learn more about system settings and management practices to configure your system for
AMD Instinct MI300X series accelerators, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_.
AMD Instinct MI300X Series GPUs, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_.
- For application performance optimization strategies for HPC and AI workloads,
including inference with vLLM, see :doc:`/how-to/rocm-for-ai/inference-optimization/workload`.

View File

@@ -1,5 +1,5 @@
.. meta::
:description: Learn how to validate LLM inference performance on MI300X accelerators using AMD MAD and the
:description: Learn how to validate LLM inference performance on MI300X GPUs using AMD MAD and the
ROCm PyTorch Docker image.
:keywords: model, MAD, automation, dashboarding, validate, pytorch
@@ -15,7 +15,7 @@ PyTorch inference performance testing
{% set model_groups = data.pytorch_inference_benchmark.model_groups %}
The `ROCm PyTorch Docker <https://hub.docker.com/r/rocm/pytorch/tags>`_ image offers a prebuilt,
optimized environment for testing model inference performance on AMD Instinct™ MI300X series
optimized environment for testing model inference performance on AMD Instinct™ MI300X Series
GPUs. This guide demonstrates how to use the AMD Model Automation and Dashboarding (MAD)
tool with the ROCm PyTorch container to test inference performance on various models efficiently.
@@ -175,7 +175,7 @@ Further reading
- To learn more about MAD and the ``madengine`` CLI, see the `MAD usage guide <https://github.com/ROCm/MAD?tab=readme-ov-file#usage-guide>`__.
- To learn more about system settings and management practices to configure your system for
AMD Instinct MI300X series GPUs, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_.
AMD Instinct MI300X Series GPUs, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_.
- For application performance optimization strategies for HPC and AI workloads,
including inference with vLLM, see :doc:`../../inference-optimization/workload`.

View File

@@ -22,7 +22,7 @@ improved efficiency and throughput.
`SGLang <https://docs.sglang.ai>`__ is a high-performance inference and
serving engine for large language models (LLMs) and vision models. The
ROCm-enabled `SGLang base Docker image <{{ docker.docker_hub_url }}>`__
bundles SGLang with PyTorch, which is optimized for AMD Instinct MI300X series
bundles SGLang with PyTorch, which is optimized for AMD Instinct MI300X Series
GPUs. It includes the following software components:
.. list-table::
@@ -37,7 +37,7 @@ improved efficiency and throughput.
{% endfor %}
The following guides on setting up and running SGLang and Mooncake for disaggregated
distributed inference on a Slurm cluster using AMD Instinct MI300X series GPUs backed by
distributed inference on a Slurm cluster using AMD Instinct MI300X Series GPUs backed by
Mellanox CX-7 NICs.
Prerequisites
@@ -236,7 +236,7 @@ Further reading
- See the base upstream Docker image on `Docker Hub <https://hub.docker.com/layers/lmsysorg/sglang/v0.5.2rc1-rocm700-mi30x/images/sha256-10c4ee502ddba44dd8c13325e6e03868bfe7f43d23d0a44780a8ee8b393f4729>`__.
- To learn more about system settings and management practices to configure your system for
MI300X series GPUs, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`__.
MI300X Series GPUs, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`__.
- For application performance optimization strategies for HPC and AI workloads,
including inference with vLLM, see :doc:`/how-to/rocm-for-ai/inference-optimization/workload`.

View File

@@ -1,5 +1,5 @@
.. meta::
:description: Learn how to validate LLM inference performance on MI300X accelerators using AMD MAD and SGLang
:description: Learn how to validate LLM inference performance on MI300X GPUs using AMD MAD and SGLang
:keywords: model, MAD, automation, dashboarding, validate
*****************************************************************
@@ -15,8 +15,8 @@ SGLang inference performance testing DeepSeek-R1-Distill-Qwen-32B
`SGLang <https://docs.sglang.ai>`__ is a high-performance inference and
serving engine for large language models (LLMs) and vision models. The
ROCm-enabled `SGLang Docker image <{{ docker.docker_hub_url }}>`__
bundles SGLang with PyTorch, optimized for AMD Instinct MI300X series
accelerators. It includes the following software components:
bundles SGLang with PyTorch, optimized for AMD Instinct MI300X Series
GPUs. It includes the following software components:
.. list-table::
:header-rows: 1
@@ -255,7 +255,7 @@ Further reading
- To learn more about MAD and the ``madengine`` CLI, see the `MAD usage guide <https://github.com/ROCm/MAD?tab=readme-ov-file#usage-guide>`__.
- To learn more about system settings and management practices to configure your system for
MI300X series accelerators, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`__.
MI300X Series GPUs, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`__.
- For application performance optimization strategies for HPC and AI workloads,
including inference with vLLM, see :doc:`/how-to/rocm-for-ai/inference-optimization/workload`.

View File

@@ -1,5 +1,5 @@
.. meta::
:description: Learn how to validate LLM inference performance on MI300X accelerators using AMD MAD and the ROCm vLLM Docker image.
:description: Learn how to validate LLM inference performance on MI300X GPUs using AMD MAD and the ROCm vLLM Docker image.
:keywords: model, MAD, automation, dashboarding, validate
**********************************
@@ -457,7 +457,7 @@ Further reading
- To learn more about MAD and the ``madengine`` CLI, see the `MAD usage guide <https://github.com/ROCm/MAD?tab=readme-ov-file#usage-guide>`__.
- To learn more about system settings and management practices to configure your system for
AMD Instinct MI300X series GPUs, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_.
AMD Instinct MI300X Series GPUs, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_.
- See :ref:`fine-tuning-llms-vllm` and :ref:`mi300x-vllm-optimization` for
a brief introduction to vLLM and optimization strategies.

View File

@@ -44,9 +44,9 @@ Validating vLLM performance
---------------------------
ROCm provides a prebuilt optimized Docker image for validating the performance of LLM inference with vLLM
on the MI300X accelerator. The Docker image includes ROCm, vLLM, PyTorch, and tuning files in the CSV
on the MI300X GPU. The Docker image includes ROCm, vLLM, PyTorch, and tuning files in the CSV
format. For more information, see the guide to
`LLM inference performance testing with vLLM on the AMD Instinct™ MI300X accelerator <https://github.com/ROCm/MAD/blob/develop/benchmark/vllm/README.md>`_
`LLM inference performance testing with vLLM on the AMD Instinct™ MI300X GPU <https://github.com/ROCm/MAD/blob/develop/benchmark/vllm/README.md>`_
on the ROCm GitHub repository.
.. _rocm-for-ai-serve-hugging-face-tgi:
@@ -61,7 +61,7 @@ The `Hugging Face Text Generation Inference <https://huggingface.co/docs/text-ge
TGI installation
----------------
The easiest way to use Hugging Face TGI with ROCm on AMD Instinct accelerators is to use the official Docker image at
The easiest way to use Hugging Face TGI with ROCm on AMD Instinct GPUs is to use the official Docker image at
`<https://github.com/huggingface/text-generation-inference/pkgs/container/text-generation-inference>`__.
TGI walkthrough

View File

@@ -10,7 +10,7 @@ Running models from Hugging Face
transformer models. Hugging Face models and tools significantly enhance productivity, performance, and accessibility in
developing and deploying AI solutions.
This section describes how to run popular community transformer models from Hugging Face on AMD accelerators and GPUs.
This section describes how to run popular community transformer models from Hugging Face on AMD GPUs.
.. _rocm-for-ai-hugging-face-transformers:
@@ -62,11 +62,11 @@ Using Hugging Face with Optimum-AMD
Optimum-AMD is the interface between Hugging Face libraries and the ROCm software stack.
For a deeper dive into using Hugging Face libraries on AMD accelerators and GPUs, refer to the
For a deeper dive into using Hugging Face libraries on AMD GPUs, refer to the
`Optimum-AMD <https://huggingface.co/docs/optimum/main/en/amd/amdgpu/overview>`_ page on Hugging Face for guidance on
using Flash Attention 2, GPTQ quantization and the ONNX Runtime integration.
Hugging Face libraries natively support AMD Instinct accelerators. For other
Hugging Face libraries natively support AMD Instinct GPUs. For other
:doc:`ROCm-capable hardware <rocm-install-on-linux:reference/system-requirements>`, support is currently not
validated, but most features are expected to work without issues.
@@ -139,7 +139,7 @@ To enable `GPTQ <https://arxiv.org/abs/2210.17323>`_, hosted wheels are availabl
pip install auto-gptq --no-build-isolation --extra-index-url https://huggingface.github.io/autogptq-index/whl/rocm573/
Or, to install from source for AMD accelerators supporting ROCm, specify the ``ROCM_VERSION`` environment variable.
Or, to install from source for AMD GPUs supporting ROCm, specify the ``ROCM_VERSION`` environment variable.
.. code-block:: shell

View File

@@ -9,7 +9,7 @@ AI inference is a process of deploying a trained machine learning model to make
Understanding the ROCm™ software platforms architecture and capabilities is vital for running AI inference. By leveraging the ROCm platform's capabilities, you can harness the power of high-performance computing and efficient resource management to run inference workloads, leading to faster predictions and classifications on real-time data.
Throughout the following topics, this section provides a comprehensive guide to setting up and deploying AI inference on AMD GPUs. This includes instructions on how to install ROCm, how to use Hugging Face Transformers to manage pre-trained models for natural language processing (NLP) tasks, how to validate vLLM on AMD Instinct™ MI300X accelerators and illustrate how to deploy trained models in production environments.
Throughout the following topics, this section provides a comprehensive guide to setting up and deploying AI inference on AMD GPUs. This includes instructions on how to install ROCm, how to use Hugging Face Transformers to manage pre-trained models for natural language processing (NLP) tasks, how to validate vLLM on AMD Instinct™ MI300X GPUs and illustrate how to deploy trained models in production environments.
The AI Developer Hub contains `AMD ROCm tutorials <https://rocm.docs.amd.com/projects/ai-developer-hub/en/latest/>`_ for
training, fine-tuning, and inference. It leverages popular machine learning frameworks on AMD GPUs.

View File

@@ -60,7 +60,7 @@ Installing vLLM
vllm-rocm \
bash
3. Inside the container, start the API server to run on a single accelerator on port 8000 using the following command.
3. Inside the container, start the API server to run on a single GPU on port 8000 using the following command.
.. code-block:: shell
@@ -113,7 +113,7 @@ Installing vLLM
python -m vllm.entrypoints.api_server --model /app/model --dtype float16 -tp 2 --port 8000 &
4. To run multiple instances of API Servers, specify different ports for each server, and use ``ROCR_VISIBLE_DEVICES`` to
isolate each instance to a different accelerator.
isolate each instance to a different GPU.
For example, to run two API servers, one on port 8000 using GPU 0 and 1, one on port 8001 using GPU 2 and 3, use a
a command like the following.
@@ -140,7 +140,7 @@ Installing vLLM
See :ref:`mi300x-vllm-optimization` for performance optimization tips.
ROCm provides a prebuilt optimized Docker image for validating the performance of LLM inference with vLLM
on the MI300X accelerator. The Docker image includes ROCm, vLLM, and PyTorch.
on the MI300X GPU. The Docker image includes ROCm, vLLM, and PyTorch.
For more information, see :doc:`/how-to/rocm-for-ai/inference/benchmark-docker/vllm`.
.. _fine-tuning-llms-tgi:
@@ -178,7 +178,7 @@ Install TGI
.. tab-item:: TGI on a single-accelerator system
:sync: single
2. Inside the container, launch a model using TGI server on a single accelerator.
2. Inside the container, launch a model using TGI server on a single GPU.
.. code-block:: shell
@@ -199,7 +199,7 @@ Install TGI
.. tab-item:: TGI on a multi-accelerator system
2. Inside the container, launch a model using TGI server on multiple accelerators (4 in this case).
2. Inside the container, launch a model using TGI server on multiple GPUs (four in this case).
.. code-block:: shell

View File

@@ -1,6 +1,6 @@
.. meta::
:description: Multi-node setup for AI training
:keywords: gpu, accelerator, system, health, validation, bench, perf, performance, rvs, rccl, babel, mi300x, mi325x, flops, bandwidth, rbt, training
:keywords: gpu, system, health, validation, bench, perf, performance, rvs, rccl, babel, mi300x, mi325x, flops, bandwidth, rbt, training
.. _rocm-for-ai-multi-node-setup:
@@ -21,7 +21,7 @@ Before starting, ensure your environment meets the following requirements:
* Multi-node networking: your cluster should have a configured multi-node network. For setup
instructions, see the `Multi-node network configuration for AMD Instinct
accelerators
GPUs
<https://instinct.docs.amd.com/projects/gpu-cluster-networking/en/latest/how-to/multi-node-config.html>`__
guide in the Instinct documentation.
@@ -54,8 +54,8 @@ Compile and install the RoCE library
If you're using Broadcom NICs, you need to compile and install the RoCE (RDMA
over Converged Ethernet) library. See `RoCE cluster network configuration guide
for AMD Instinct accelerators
<https://instinct.docs.amd.com/projects/gpu-cluster-networking/en/latest/how-to/roce-network-config.html#roce-cluster-network-configuration-guide-for-amd-instinct-accelerators>`__
for AMD Instinct GPUs
<https://instinct.docs.amd.com/projects/gpu-cluster-networking/en/latest/how-to/roce-network-config.html>`__
for more information.
See the `Ethernet networking guide for AMD
@@ -315,6 +315,6 @@ Megatron-LM
Further reading
===============
* `Multi-node network configuration for AMD Instinct accelerators <https://instinct.docs.amd.com/projects/gpu-cluster-networking/en/latest/how-to/multi-node-config.html>`__
* `Multi-node network configuration for AMD Instinct GPUs <https://instinct.docs.amd.com/projects/gpu-cluster-networking/en/latest/how-to/multi-node-config.html>`__
* `Ethernet networking guide for AMD Instinct MI300X GPU clusters: Compiling Broadcom NIC software from source <https://docs.broadcom.com/doc/957608-AN2XX#page=81>`__

View File

@@ -6,14 +6,8 @@
Training a model with JAX MaxText on ROCm
******************************************
MaxText is a high-performance, open-source framework built on the Google JAX
machine learning library to train LLMs at scale. The MaxText framework for
ROCm is an optimized fork of the upstream
`<https://github.com/AI-Hypercomputer/maxtext>`__ enabling efficient AI workloads
on AMD MI300X series GPUs.
The MaxText for ROCm training Docker image
provides a prebuilt environment for training on AMD Instinct MI300X and MI325X GPUs,
provides a prebuilt environment for training on AMD Instinct MI355X, MI350X, MI325X, and MI300X GPUs,
including essential components like JAX, XLA, ROCm libraries, and MaxText utilities.
It includes the following software components:
@@ -61,15 +55,15 @@ MaxText with on ROCm provides the following key features to train large language
- Multi-node support
- NANOO FP8 quantization support
- NANOO FP8 (for MI300X series GPUs) and FP8 (for MI355X and MI350X) quantization support
.. _amd-maxtext-model-support-v257:
.. _amd-maxtext-model-support-v259:
Supported models
================
The following models are pre-optimized for performance on AMD Instinct MI300
series GPUs. Some instructions, commands, and available training
The following models are pre-optimized for performance on AMD Instinct
GPUs. Some instructions, commands, and available training
configurations in this documentation might vary by model -- select one to get
started.
@@ -139,22 +133,13 @@ Use the following command to pull the Docker image from Docker Hub.
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/jax-maxtext-benchmark-models.yaml
{% set dockers = data.dockers %}
.. tab-set::
{% set docker = data.dockers[0] %}
{% for docker in dockers %}
{% set jax_version = docker.components["JAX"] %}
.. code-block:: shell
.. tab-item:: JAX {{ jax_version }}
:sync: {{ docker.pull_tag }}
docker pull {{ docker.pull_tag }}
.. code-block:: shell
docker pull {{ docker.pull_tag }}
{% endfor %}
.. _amd-maxtext-multi-node-setup-v257:
.. _amd-maxtext-multi-node-setup-v259:
Multi-node configuration
------------------------
@@ -162,7 +147,7 @@ Multi-node configuration
See :doc:`/how-to/rocm-for-ai/system-setup/multi-node-setup` to configure your
environment for multi-node training.
.. _amd-maxtext-get-started-v257:
.. _amd-maxtext-get-started-v259:
Benchmarking
============
@@ -174,7 +159,7 @@ benchmark results:
.. _vllm-benchmark-mad:
{% set dockers = data.dockers %}
{% set docker = data.dockers[0] %}
{% set model_groups = data.model_groups %}
{% for model_group in model_groups %}
{% for model in model_group.models %}
@@ -186,6 +171,9 @@ benchmark results:
{% if model.mad_tag and "single-node" in model.doc_options %}
.. tab-item:: MAD-integrated benchmarking
The following run command is tailored to {{ model.model }}.
See :ref:`amd-maxtext-model-support-v259` to switch to another available model.
1. Clone the ROCm Model Automation and Dashboarding (`<https://github.com/ROCm/MAD>`__) repository to a local
directory and install the required packages on the host machine.
@@ -214,22 +202,19 @@ benchmark results:
.. tab-item:: Standalone benchmarking
The following commands are optimized for {{ model.model }}. See
:ref:`amd-maxtext-model-support-v259` to switch to another
available model. Some instructions and resources might not be
available for all models and configurations.
.. rubric:: Download the Docker image and required scripts
Run the JAX MaxText benchmark tool independently by starting the
Docker container as shown in the following snippet.
.. tab-set::
{% for docker in dockers %}
{% set jax_version = docker.components["JAX"] %}
.. code-block:: shell
.. tab-item:: JAX {{ jax_version }}
:sync: {{ docker.pull_tag }}
.. code-block:: shell
docker pull {{ docker.pull_tag }}
{% endfor %}
docker pull {{ docker.pull_tag }}
{% if model.model_repo and "single-node" in model.doc_options %}
.. rubric:: Single node training
@@ -250,33 +235,25 @@ benchmark results:
2. Launch the Docker container.
.. tab-set::
{% for docker in dockers %}
{% set jax_version = docker.components["JAX"] %}
.. code-block:: shell
.. tab-item:: JAX {{ jax_version }}
:sync: {{ docker.pull_tag }}
.. code-block:: shell
docker run -it \
--device=/dev/dri \
--device=/dev/kfd \
--network host \
--ipc host \
--group-add video \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
--privileged \
-v $HOME:$HOME \
-v $HOME/.ssh:/root/.ssh \
-v $HF_HOME:/hf_cache \
-e HF_HOME=/hf_cache \
-e MAD_SECRETS_HFTOKEN=$MAD_SECRETS_HFTOKEN
--shm-size 64G \
--name training_env \
{{ docker.pull_tag }}
{% endfor %}
docker run -it \
--device=/dev/dri \
--device=/dev/kfd \
--network host \
--ipc host \
--group-add video \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
--privileged \
-v $HOME:$HOME \
-v $HOME/.ssh:/root/.ssh \
-v $HF_HOME:/hf_cache \
-e HF_HOME=/hf_cache \
-e MAD_SECRETS_HFTOKEN=$MAD_SECRETS_HFTOKEN
--shm-size 64G \
--name training_env \
{{ docker.pull_tag }}
3. In the Docker container, clone the ROCm MAD repository and navigate to the
benchmark scripts directory at ``MAD/scripts/jax-maxtext``.
@@ -299,11 +276,27 @@ benchmark results:
./jax-maxtext_benchmark_report.sh -m {{ model.model_repo }}
For quantized training, use the following command:
For quantized training, run the script with the appropriate option for your Instinct GPU.
.. code-block:: shell
.. tab-set::
./jax-maxtext_benchmark_report.sh -m {{ model.model_repo }} -q nanoo_fp8
.. tab-item:: MI355X and MI350X
For ``fp8`` quantized training on MI355X and MI350X GPUs, use the following command:
.. code-block:: shell
./jax-maxtext_benchmark_report.sh -m {{ model.model_repo }} -q fp8
{% if model.model_repo not in ["Llama-3.1-70B", "Llama-3.3-70B"] %}
.. tab-item:: MI325X and MI300X
For ``nanoo_fp8`` quantized training on MI300X series GPUs, use the following command:
.. code-block:: shell
./jax-maxtext_benchmark_report.sh -m {{ model.model_repo }} -q nanoo_fp8
{% endif %}
{% endif %}
{% if model.multinode_training_script and "multi-node" in model.doc_options %}
@@ -335,7 +328,7 @@ benchmark results:
{% else %}
.. rubric:: Multi-node training
For multi-node training examples, choose a model from :ref:`amd-maxtext-model-support-v257`
For multi-node training examples, choose a model from :ref:`amd-maxtext-model-support-v259`
with an available `multi-node training script <https://github.com/ROCm/MAD/tree/develop/scripts/jax-maxtext/gpu-rocm>`__.
{% endif %}
{% endfor %}
@@ -347,7 +340,7 @@ Further reading
- To learn more about MAD and the ``madengine`` CLI, see the `MAD usage guide <https://github.com/ROCm/MAD?tab=readme-ov-file#usage-guide>`__.
- To learn more about system settings and management practices to configure your system for
AMD Instinct MI300X series GPUs, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_.
AMD Instinct MI300X Series GPUs, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_.
- For a list of other ready-made Docker images for AI with ROCm, see
`AMD Infinity Hub <https://www.amd.com/en/developer/resources/infinity-hub.html#f-amd_hub_category=AI%20%26%20ML%20Models>`_.

View File

@@ -10,6 +10,12 @@ Training a model with Megatron-LM on ROCm
.. caution::
For a unified training solution on AMD GPUs with ROCm, the `rocm/megatron-lm
<https://hub.docker.com/r/rocm/megatron-lm/>`__ Docker Hub registry will be
deprecated soon in favor of `rocm/primus <https://hub.docker.com/r/rocm/primus>`__.
The ``rocm/primus`` Docker containers will cover PyTorch training ecosystem frameworks,
including Megatron-LM, `torchtitan, and torchtune <primus-pytorch>`__.
Primus with Megatron is designed to replace this ROCm Megatron-LM training workflow.
To learn how to migrate workloads from Megatron-LM to Primus with Megatron,
see :doc:`previous-versions/megatron-lm-primus-migration-guide`.
@@ -17,30 +23,25 @@ Training a model with Megatron-LM on ROCm
The `Megatron-LM framework for ROCm <https://github.com/ROCm/Megatron-LM>`_ is
a specialized fork of the robust Megatron-LM, designed to enable efficient
training of large-scale language models on AMD GPUs. By leveraging AMD
Instinct™ MI300X series GPUs, Megatron-LM delivers enhanced
scalability, performance, and resource utilization for AI workloads. It is
Instinct™ GPUs, Megatron-LM delivers enhanced scalability, performance, and
resource utilization for AI workloads. It is
purpose-built to support models like Llama, DeepSeek, and Mixtral,
enabling developers to train next-generation AI models more
efficiently.
AMD provides ready-to-use Docker images for MI300X series GPUs containing
essential components, including PyTorch, ROCm libraries, and Megatron-LM
utilities. It contains the following software components to accelerate training
workloads:
.. note::
This Docker environment is based on Python 3.10 and Ubuntu 22.04. For an alternative environment with
Python 3.12 and Ubuntu 24.04, see the :doc:`previous ROCm Megatron-LM v25.6 Docker release <previous-versions/megatron-lm-v25.6>`.
AMD provides ready-to-use Docker images for MI355X, MI350X, MI325X, and MI300X
GPUs containing essential components, including PyTorch, ROCm libraries, and
Megatron-LM utilities. It contains the following software components to
accelerate training workloads:
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/megatron-lm-benchmark-models.yaml
{% set dockers = data.dockers %}
.. tab-set::
{% for docker in dockers %}
.. tab-item:: ``{{ docker.pull_tag }}``
:sync: {{ docker.pull_tag }}
{% for supported_gpus, docker in dockers.items() %}
.. tab-item:: {{ supported_gpus }}
:sync: {{ supported_gpus }}
.. list-table::
:header-rows: 1
@@ -51,17 +52,15 @@ workloads:
{% for component_name, component_version in docker.components.items() %}
* - {{ component_name }}
- {{ component_version }}
{% endfor %}
{% endfor %}
{% endfor %}
.. _amd-megatron-lm-model-support:
Supported models
================
The following models are supported for training performance benchmarking with Megatron-LM and ROCm
on AMD Instinct MI300X series GPUs.
on AMD Instinct MI300X Series GPUs.
Some instructions, commands, and training recommendations in this documentation might
vary by model -- select one to get started.
@@ -138,7 +137,7 @@ Environment setup
=================
Use the following instructions to set up the environment, configure the script to train models, and
reproduce the benchmark results on MI300X series GPUs with the AMD Megatron-LM Docker
reproduce the benchmark results on MI300X Series GPUs with the AMD Megatron-LM Docker
image.
.. _amd-megatron-lm-requirements:
@@ -151,33 +150,24 @@ Download the Docker image
{% set dockers = data.dockers %}
1. Use the following command to pull the Docker image from Docker Hub.
{% if dockers|length > 1 %}
.. tab-set::
{% for docker in data.dockers %}
.. tab-item:: {{ docker.doc_name }}
:sync: {{ docker.pull_tag }}
{% for supported_gpus, docker in dockers.items() %}
.. tab-item:: {{ supported_gpus }}
:sync: {{ supported_gpus }}
.. code-block:: shell
docker pull {{ docker.pull_tag }}
{% endfor %}
{% elif dockers|length == 1 %}
{% set docker = dockers[0] %}
.. code-block:: shell
docker pull {{ docker.pull_tag }}
{% endif %}
2. Launch the Docker container.
{% if dockers|length > 1 %}
.. tab-set::
{% for docker in dockers %}
.. tab-item:: {{ docker.doc_name }}
:sync: {{ docker.pull_tag }}
{% for supported_gpus, docker in dockers.items() %}
.. tab-item:: {{ supported_gpus }}
:sync: {{ supported_gpus }}
.. code-block:: shell
@@ -195,28 +185,7 @@ Download the Docker image
--shm-size 128G \
--name megatron_training_env \
{{ docker.pull_tag }}
{% endfor %}
{% elif dockers|length == 1 %}
{% set docker = dockers[0] %}
.. code-block:: shell
docker run -it \
--device /dev/dri \
--device /dev/kfd \
--device /dev/infiniband \
--network host --ipc host \
--group-add video \
--cap-add SYS_PTRACE \
--security-opt seccomp=unconfined \
--privileged \
-v $HOME:$HOME \
-v $HOME/.ssh:/root/.ssh \
--shm-size 128G \
--name megatron_training_env \
{{ docker.pull_tag }}
{% endif %}
3. Use these commands if you exit the ``megatron_training_env`` container and need to return to it.
@@ -234,8 +203,8 @@ Download the Docker image
pip uninstall megatron-core
pip install -e .
The Docker container hosts
`<https://github.com/ROCm/Megatron-LM/tree/rocm_dev>`__ at verified commit ``e8e9edc``.
The Docker container hosts a verified commit of
`<https://github.com/ROCm/Megatron-LM/tree/rocm_dev>`__.
.. _amd-megatron-lm-environment-setup:
@@ -533,7 +502,7 @@ Run training
Use the following example commands to set up the environment, configure
:ref:`key options <amd-megatron-lm-benchmark-test-vars>`, and run training on
MI300X series GPUs with the AMD Megatron-LM environment.
MI300X Series GPUs with the AMD Megatron-LM environment.
Single node training
--------------------
@@ -572,31 +541,73 @@ Single node training
To run training on a single node for Llama 3.1 8B FP8, navigate to the Megatron-LM folder and use the
following command.
.. code-block:: shell
.. tab-set::
TEE_OUTPUT=1 \
MBS=2 \
BS=128 \
TP=1 \
TE_FP8=1 \
SEQ_LENGTH=8192 \
MODEL_SIZE=8 \
TOTAL_ITERS=50 \
bash examples/llama/train_llama3.sh
.. tab-item:: MI355X and MI350X
:sync: MI355X and MI350X
.. code-block:: shell
TEE_OUTPUT=1 \
MBS=4 \
BS=512 \
TP=1 \
TE_FP8=1 \
SEQ_LENGTH=8192 \
MODEL_SIZE=8 \
TOTAL_ITERS=10 \
GEMM_TUNING=0 \
bash examples/llama/train_llama3.sh
.. tab-item:: MI300X
:sync: MI325X and MI300X
.. code-block:: shell
TEE_OUTPUT=1 \
MBS=2 \
BS=128 \
TP=1 \
TE_FP8=1 \
SEQ_LENGTH=8192 \
MODEL_SIZE=8 \
TOTAL_ITERS=50 \
bash examples/llama/train_llama3.sh
For Llama 3.1 8B BF16, use the following command:
.. code-block:: shell
.. tab-set::
TEE_OUTPUT=1 \
MBS=2 \
BS=128 \
TP=1 \
TE_FP8=0 \
SEQ_LENGTH=8192 \
MODEL_SIZE=8 \
TOTAL_ITERS=50 \
bash examples/llama/train_llama3.sh
.. tab-item:: MI355X and MI350X
:sync: MI355X and MI350X
.. code-block:: shell
TEE_OUTPUT=1 \
MBS=4 \
BS=512 \
TP=1 \
TE_FP8=0 \
SEQ_LENGTH=8192 \
MODEL_SIZE=8 \
TOTAL_ITERS=10 \
GEMM_TUNING=1 \
bash examples/llama/train_llama3.sh
.. tab-item:: MI300X
:sync: MI325X and MI300X
.. code-block:: shell
TEE_OUTPUT=1 \
MBS=2 \
BS=128 \
TP=1 \
TE_FP8=0 \
SEQ_LENGTH=8192 \
MODEL_SIZE=8 \
TOTAL_ITERS=50 \
bash examples/llama/train_llama3.sh
.. container:: model-doc pyt_megatron_lm_train_llama-3.1-70b
@@ -625,29 +636,60 @@ Single node training
parallelism, MCore's distributed optimizer, gradient accumulation fusion,
or FP16.
.. container:: model-doc pyt_megatron_lm_train_llama-3.1-70b-proxy
To run the training on a single node for Llama 3.1 70B with proxy, use the following command.
.. code-block:: shell
CKPT_FORMAT=torch_dist \
TEE_OUTPUT=1 \
RECOMPUTE=1 \
MBS=3 \
BS=24 \
TP=1 \
TE_FP8=1 \
SEQ_LENGTH=8192 \
MODEL_SIZE=70 \
FSDP=1 \
TOTAL_ITERS=10 \
NUM_LAYERS=40 \
bash examples/llama/train_llama3.sh
To run the training on a single node for Llama 3.1 70B FP8, use the
following command.
.. note::
Use two or more nodes to run the *full* Llama 70B model with FP8 precision.
The MI300X configuration uses a proxy model. On MI300X GPUs, use two or more nodes
to run the full Llama 3.1 70B model with FP8 precision. MI355X and MI350X GPUs
can support the full 70B model with FP8 precision on a single node.
.. tab-set::
.. tab-item:: MI355X and MI350X
:sync: MI355X and MI350X
.. code-block:: shell
CKPT_FORMAT=torch_dist \
TEE_OUTPUT=1 \
RECOMPUTE=1 \
MBS=3 \
BS=24 \
TP=1 \
TE_FP8=1 \
SEQ_LENGTH=8192 \
MODEL_SIZE=70 \
FSDP=1 \
TOTAL_ITERS=10 \
bash examples/llama/train_llama3.sh
.. tab-item:: MI300X
:sync: MI325X and MI300X
.. code-block:: shell
FP8_WEIGHT_TRANSPOSE_CACHE=0 \
CKPT_FORMAT=torch_dist \
TEE_OUTPUT=1 \
RECOMPUTE=1 \
MBS=3 \
BS=24 \
TP=1 \
TE_FP8=1 \
SEQ_LENGTH=8192 \
MODEL_SIZE=70 \
FSDP=1 \
TOTAL_ITERS=10 \
NUM_LAYERS=40 \
bash examples/llama/train_llama3.sh
.. note::
The MI300X configuration uses a proxy model. On MI300X GPUs, use two or more nodes
to run the full Llama 3.1 70B model with FP8 precision. MI355X and MI350X GPUs
can support the full 70B model with FP8 precision on a single node.
.. note::
@@ -987,6 +1029,11 @@ The benchmark tests support the following sets of variables.
``RECOMPUTE_NUM_LAYERS``
Number of layers used for checkpointing recompute.
Known issues
============
PyTorch Profiler may produce inaccurate traces when CPU activity profiling is enabled.
Previous versions
=================

View File

@@ -16,7 +16,7 @@ environment for the MPT-30B model using the ``rocm/pytorch-training:v25.5``
base `Docker image <https://hub.docker.com/layers/rocm/pytorch-training/v25.5/images/sha256-d47850a9b25b4a7151f796a8d24d55ea17bba545573f0d50d54d3852f96ecde5>`_
and the `LLM Foundry <https://github.com/mosaicml/llm-foundry>`_ framework.
This environment packages the following software components to train
on AMD Instinct MI300X series accelerators:
on AMD Instinct MI300X Series GPUs:
+--------------------------+--------------------------------+
| Software component | Version |
@@ -182,7 +182,7 @@ Further reading
- To learn more about MAD and the ``madengine`` CLI, see the `MAD usage guide <https://github.com/ROCm/MAD?tab=readme-ov-file#usage-guide>`__.
- To learn more about system settings and management practices to configure your system for
AMD Instinct MI300X series accelerators, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_.
AMD Instinct MI300X Series GPUs, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_.
- For a list of other ready-made Docker images for AI with ROCm, see
`AMD Infinity Hub <https://www.amd.com/en/developer/resources/infinity-hub.html#f-amd_hub_category=AI%20%26%20ML%20Models>`_.

View File

@@ -17,27 +17,35 @@ previous releases of the ``ROCm/jax-training`` Docker image on `Docker Hub <http
- Components
- Resources
* - 25.7 (latest)
-
* - 25.9 (latest)
-
* ROCm 7.0.0
* JAX 0.6.2
-
* :doc:`Documentation <../jax-maxtext>`
* `Docker Hub <https://hub.docker.com/layers/rocm/jax-training/maxtext-v25.7-jax060/images/sha256-7352212ae033a76dca2b9dceffc23c1b5f1a61a7a560082cf747a9bf1acfc9ce>`__
* - 25.7
-
* ROCm 6.4.1
* JAX 0.6.0, 0.5.0
-
* :doc:`Documentation <../jax-maxtext>`
-
* :doc:`Documentation <jax-maxtext-v25.7>`
* `Docker Hub (JAX 0.6.0) <https://hub.docker.com/layers/rocm/jax-training/maxtext-v25.7-jax060/images/sha256-7352212ae033a76dca2b9dceffc23c1b5f1a61a7a560082cf747a9bf1acfc9ce>`__
* `Docker Hub (JAX 0.5.0) <https://hub.docker.com/layers/rocm/jax-training/maxtext-v25.7/images/sha256-45f4c727d4019a63fc47313d3a5f5a5105569539294ddfd2d742218212ae9025>`__
* - 25.5
-
-
* ROCm 6.3.4
* JAX 0.4.35
-
-
* :doc:`Documentation <jax-maxtext-v25.5>`
* `Docker Hub <https://hub.docker.com/layers/rocm/jax-training/maxtext-v25.5/images/sha256-4e0516358a227cae8f552fb866ec07e2edcf244756f02e7b40212abfbab5217b>`__
* - 25.4
-
-
* ROCm 6.3.0
* JAX 0.4.31
-
-
* :doc:`Documentation <jax-maxtext-v25.4>`
* `Docker Hub <https://hub.docker.com/layers/rocm/jax-training/maxtext-v25.4/images/sha256-fb3eb71cd74298a7b3044b7130cf84113f14d518ff05a2cd625c11ea5f6a7b01>`__

View File

@@ -17,10 +17,10 @@ MaxText is a high-performance, open-source framework built on the Google JAX
machine learning library to train LLMs at scale. The MaxText framework for
ROCm is an optimized fork of the upstream
`<https://github.com/AI-Hypercomputer/maxtext>`__ enabling efficient AI workloads
on AMD MI300X series accelerators.
on AMD MI300X Series GPUs.
The MaxText for ROCm training Docker (``rocm/jax-training:maxtext-v25.4``) image
provides a prebuilt environment for training on AMD Instinct MI300X and MI325X accelerators,
provides a prebuilt environment for training on AMD Instinct MI300X and MI325X GPUs,
including essential components like JAX, XLA, ROCm libraries, and MaxText utilities.
It includes the following software components:
@@ -53,7 +53,7 @@ MaxText provides the following key features to train large language models effic
.. _amd-maxtext-model-support-v254:
The following models are pre-optimized for performance on AMD Instinct MI300X series accelerators.
The following models are pre-optimized for performance on AMD Instinct MI300X Series GPUs.
* Llama 3.1 8B

View File

@@ -17,10 +17,10 @@ MaxText is a high-performance, open-source framework built on the Google JAX
machine learning library to train LLMs at scale. The MaxText framework for
ROCm is an optimized fork of the upstream
`<https://github.com/AI-Hypercomputer/maxtext>`__ enabling efficient AI workloads
on AMD MI300X series accelerators.
on AMD MI300X Series GPUs.
The MaxText for ROCm training Docker (``rocm/jax-training:maxtext-v25.5``) image
provides a prebuilt environment for training on AMD Instinct MI300X and MI325X accelerators,
provides a prebuilt environment for training on AMD Instinct MI300X and MI325X GPUs,
including essential components like JAX, XLA, ROCm libraries, and MaxText utilities.
It includes the following software components:
@@ -53,7 +53,7 @@ MaxText provides the following key features to train large language models effic
.. _amd-maxtext-model-support-v255:
The following models are pre-optimized for performance on AMD Instinct MI300X series accelerators.
The following models are pre-optimized for performance on AMD Instinct MI300X Series GPUs.
* Llama 3.3 70B

View File

@@ -0,0 +1,366 @@
:orphan:
.. meta::
:description: How to train a model using JAX MaxText for ROCm.
:keywords: ROCm, AI, LLM, train, jax, torch, Llama, flux, tutorial, docker
******************************************
Training a model with JAX MaxText on ROCm
******************************************
.. caution::
This documentation does not reflect the latest version of ROCm JAX MaxText
training performance documentation. See :doc:`../jax-maxtext` for the latest version.
MaxText is a high-performance, open-source framework built on the Google JAX
machine learning library to train LLMs at scale. The MaxText framework for
ROCm is an optimized fork of the upstream
`<https://github.com/AI-Hypercomputer/maxtext>`__ enabling efficient AI workloads
on AMD MI300X series GPUs.
The MaxText for ROCm training Docker image
provides a prebuilt environment for training on AMD Instinct MI300X and MI325X GPUs,
including essential components like JAX, XLA, ROCm libraries, and MaxText utilities.
It includes the following software components:
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/jax-maxtext-benchmark-models.yaml
{% set dockers = data.dockers %}
.. tab-set::
{% for docker in dockers %}
{% set jax_version = docker.components["JAX"] %}
.. tab-item:: ``{{ docker.pull_tag }}``
:sync: {{ docker.pull_tag }}
.. list-table::
:header-rows: 1
* - Software component
- Version
{% for component_name, component_version in docker.components.items() %}
* - {{ component_name }}
- {{ component_version }}
{% endfor %}
{% if jax_version == "0.6.0" %}
.. note::
Shardy is a new config in JAX 0.6.0. You might get related errors if it's
not configured correctly. For now you can turn it off by setting
``shardy=False`` during the training run. You can also follow the `migration
guide <https://docs.jax.dev/en/latest/shardy_jax_migration.html>`__ to enable
it.
{% endif %}
{% endfor %}
MaxText with on ROCm provides the following key features to train large language models efficiently:
- Transformer Engine (TE)
- Flash Attention (FA) 3 -- with or without sequence input packing
- GEMM tuning
- Multi-node support
- NANOO FP8 quantization support
.. _amd-maxtext-model-support-v257:
Supported models
================
The following models are pre-optimized for performance on AMD Instinct MI300
series GPUs. Some instructions, commands, and available training
configurations in this documentation might vary by model -- select one to get
started.
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/jax-maxtext-benchmark-models.yaml
{% set model_groups = data.model_groups %}
.. raw:: html
<div id="vllm-benchmark-ud-params-picker" class="container-fluid">
<div class="row gx-0">
<div class="col-2 me-1 px-2 model-param-head">Model</div>
<div class="row col-10 pe-0">
{% for model_group in model_groups %}
<div class="col-4 px-2 model-param" data-param-k="model-group" data-param-v="{{ model_group.tag }}" tabindex="0">{{ model_group.group }}</div>
{% endfor %}
</div>
</div>
<div class="row gx-0 pt-1">
<div class="col-2 me-1 px-2 model-param-head">Variant</div>
<div class="row col-10 pe-0">
{% for model_group in model_groups %}
{% set models = model_group.models %}
{% for model in models %}
{% if models|length % 3 == 0 %}
<div class="col-4 px-2 model-param" data-param-k="model" data-param-v="{{ model.mad_tag }}" data-param-group="{{ model_group.tag }}" tabindex="0">{{ model.model }}</div>
{% else %}
<div class="col-6 px-2 model-param" data-param-k="model" data-param-v="{{ model.mad_tag }}" data-param-group="{{ model_group.tag }}" tabindex="0">{{ model.model }}</div>
{% endif %}
{% endfor %}
{% endfor %}
</div>
</div>
</div>
.. note::
Some models, such as Llama 3, require an external license agreement through
a third party (for example, Meta).
System validation
=================
Before running AI workloads, it's important to validate that your AMD hardware is configured
correctly and performing optimally.
If you have already validated your system settings, including aspects like NUMA auto-balancing, you
can skip this step. Otherwise, complete the procedures in the :ref:`System validation and
optimization <rocm-for-ai-system-optimization>` guide to properly configure your system settings
before starting training.
To test for optimal performance, consult the recommended :ref:`System health benchmarks
<rocm-for-ai-system-health-bench>`. This suite of tests will help you verify and fine-tune your
system's configuration.
Environment setup
=================
This Docker image is optimized for specific model configurations outlined
as follows. Performance can vary for other training workloads, as AMD
doesnt validate configurations and run conditions outside those described.
Pull the Docker image
---------------------
Use the following command to pull the Docker image from Docker Hub.
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/jax-maxtext-benchmark-models.yaml
{% set dockers = data.dockers %}
.. tab-set::
{% for docker in dockers %}
{% set jax_version = docker.components["JAX"] %}
.. tab-item:: JAX {{ jax_version }}
:sync: {{ docker.pull_tag }}
.. code-block:: shell
docker pull {{ docker.pull_tag }}
{% endfor %}
.. _amd-maxtext-multi-node-setup-v257:
Multi-node configuration
------------------------
See :doc:`/how-to/rocm-for-ai/system-setup/multi-node-setup` to configure your
environment for multi-node training.
.. _amd-maxtext-get-started-v257:
Benchmarking
============
Once the setup is complete, choose between two options to reproduce the
benchmark results:
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/jax-maxtext-benchmark-models.yaml
.. _vllm-benchmark-mad:
{% set dockers = data.dockers %}
{% set model_groups = data.model_groups %}
{% for model_group in model_groups %}
{% for model in model_group.models %}
.. container:: model-doc {{model.mad_tag}}
.. tab-set::
{% if model.mad_tag and "single-node" in model.doc_options %}
.. tab-item:: MAD-integrated benchmarking
1. Clone the ROCm Model Automation and Dashboarding (`<https://github.com/ROCm/MAD>`__) repository to a local
directory and install the required packages on the host machine.
.. code-block:: shell
git clone https://github.com/ROCm/MAD
cd MAD
pip install -r requirements.txt
2. Use this command to run the performance benchmark test on the {{ model.model }} model
using one GPU with the :literal:`{{model.precision}}` data type on the host machine.
.. code-block:: shell
export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
madengine run \
--tags {{model.mad_tag}} \
--keep-model-dir \
--live-output \
--timeout 28800
MAD launches a Docker container with the name
``container_ci-{{model.mad_tag}}``. The latency and throughput reports of the
model are collected in the following path: ``~/MAD/perf.csv/``.
{% endif %}
.. tab-item:: Standalone benchmarking
.. rubric:: Download the Docker image and required scripts
Run the JAX MaxText benchmark tool independently by starting the
Docker container as shown in the following snippet.
.. tab-set::
{% for docker in dockers %}
{% set jax_version = docker.components["JAX"] %}
.. tab-item:: JAX {{ jax_version }}
:sync: {{ docker.pull_tag }}
.. code-block:: shell
docker pull {{ docker.pull_tag }}
{% endfor %}
{% if model.model_repo and "single-node" in model.doc_options %}
.. rubric:: Single node training
1. Set up environment variables.
.. code-block:: shell
export MAD_SECRETS_HFTOKEN=<Your Hugging Face token>
export HF_HOME=<Location of saved/cached Hugging Face models>
``MAD_SECRETS_HFTOKEN`` is your Hugging Face access token to access models, tokenizers, and data.
See `User access tokens <https://huggingface.co/docs/hub/en/security-tokens>`__.
``HF_HOME`` is where ``huggingface_hub`` will store local data. See `huggingface_hub CLI <https://huggingface.co/docs/huggingface_hub/main/en/guides/cli#huggingface-cli-download>`__.
If you already have downloaded or cached Hugging Face artifacts, set this variable to that path.
Downloaded files typically get cached to ``~/.cache/huggingface``.
2. Launch the Docker container.
.. tab-set::
{% for docker in dockers %}
{% set jax_version = docker.components["JAX"] %}
.. tab-item:: JAX {{ jax_version }}
:sync: {{ docker.pull_tag }}
.. code-block:: shell
docker run -it \
--device=/dev/dri \
--device=/dev/kfd \
--network host \
--ipc host \
--group-add video \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
--privileged \
-v $HOME:$HOME \
-v $HOME/.ssh:/root/.ssh \
-v $HF_HOME:/hf_cache \
-e HF_HOME=/hf_cache \
-e MAD_SECRETS_HFTOKEN=$MAD_SECRETS_HFTOKEN
--shm-size 64G \
--name training_env \
{{ docker.pull_tag }}
{% endfor %}
3. In the Docker container, clone the ROCm MAD repository and navigate to the
benchmark scripts directory at ``MAD/scripts/jax-maxtext``.
.. code-block:: shell
git clone https://github.com/ROCm/MAD
cd MAD/scripts/jax-maxtext
4. Run the setup scripts to install libraries and datasets needed
for benchmarking.
.. code-block:: shell
./jax-maxtext_benchmark_setup.sh -m {{ model.model_repo }}
5. To run the training benchmark without quantization, use the following command:
.. code-block:: shell
./jax-maxtext_benchmark_report.sh -m {{ model.model_repo }}
For quantized training, use the following command:
.. code-block:: shell
./jax-maxtext_benchmark_report.sh -m {{ model.model_repo }} -q nanoo_fp8
{% endif %}
{% if model.multinode_training_script and "multi-node" in model.doc_options %}
.. rubric:: Multi-node training
The following examples use SLURM to run on multiple nodes.
.. note::
The following scripts will launch the Docker container and run the
benchmark. Run them outside of any Docker container.
1. Make sure ``$HF_HOME`` is set before running the test. See
`ROCm benchmarking <https://github.com/ROCm/MAD/blob/develop/scripts/jax-maxtext/gpu-rocm/readme.md>`__
for more details on downloading the Llama models before running the
benchmark.
2. To run multi-node training for {{ model.model }},
use the
`multi-node training script <https://github.com/ROCm/MAD/blob/develop/scripts/jax-maxtext/gpu-rocm/{{ model.multinode_training_script }}>`__
under the ``scripts/jax-maxtext/gpu-rocm/`` directory.
3. Run the multi-node training benchmark script.
.. code-block:: shell
sbatch -N <num_nodes> {{ model.multinode_training_script }}
{% else %}
.. rubric:: Multi-node training
For multi-node training examples, choose a model from :ref:`amd-maxtext-model-support-v257`
with an available `multi-node training script <https://github.com/ROCm/MAD/tree/develop/scripts/jax-maxtext/gpu-rocm>`__.
{% endif %}
{% endfor %}
{% endfor %}
Further reading
===============
- To learn more about MAD and the ``madengine`` CLI, see the `MAD usage guide <https://github.com/ROCm/MAD?tab=readme-ov-file#usage-guide>`__.
- To learn more about system settings and management practices to configure your system for
AMD Instinct MI300X series GPUs, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_.
- For a list of other ready-made Docker images for AI with ROCm, see
`AMD Infinity Hub <https://www.amd.com/en/developer/resources/infinity-hub.html#f-amd_hub_category=AI%20%26%20ML%20Models>`_.
Previous versions
=================
See :doc:`jax-maxtext-history` to find documentation for previous releases
of the ``ROCm/jax-training`` Docker image.

View File

@@ -16,62 +16,73 @@ previous releases of the ``ROCm/megatron-lm`` Docker image on `Docker Hub <https
- Components
- Resources
* - v25.8 (latest)
-
* ROCm 6.4.3
* PyTorch 2.8.0a0+gitd06a406
-
* - v25.9 (latest)
-
* ROCm 7.0.0
* Primus 0.3.0
* PyTorch 2.9.0.dev20250821+rocm7.0.0.lw.git125803b7
-
* :doc:`Primus Megatron documentation <../primus-megatron>`
* :doc:`Megatron-LM (legacy) documentation <../megatron-lm>`
* `Docker Hub (py310) <https://hub.docker.com/r/rocm/megatron-lm/tags>`__
* `Docker Hub (gfx950) <https://hub.docker.com/layers/rocm/primus/v25.9_gfx950/images/sha256-1a198be32f49efd66d0ff82066b44bd99b3e6b04c8e0e9b36b2c481e13bff7b6>`__
* `Docker Hub (gfx942) <https://hub.docker.com/layers/rocm/primus/v25.9_gfx942/images/sha256-df6ab8f45b4b9ceb100fb24e19b2019a364e351ee3b324dbe54466a1d67f8357>`__
* - v25.8
-
* ROCm 6.4.3
* PyTorch 2.8.0a0+gitd06a406
-
* :doc:`Primus Megatron documentation <primus-megatron-v25.8>`
* :doc:`Megatron-LM (legacy) documentation <megatron-lm-v25.8>`
* `Docker Hub (py310) <https://hub.docker.com/layers/rocm/megatron-lm/v25.8_py310/images/sha256-0030c4a3dcb233c66dd5f61135821f9f5c4e321cbe0a2cdc74f110752f28c869>`__
* - v25.7
-
-
* ROCm 6.4.2
* PyTorch 2.8.0a0+gitd06a406
-
-
* :doc:`Primus Megatron documentation <primus-megatron-v25.7>`
* :doc:`Megatron-LM (legacy) documentation <megatron-lm-v25.7>`
* `Docker Hub (py310) <https://hub.docker.com/layers/rocm/megatron-lm/v25.7_py310/images/sha256-6189df849feeeee3ae31bb1e97aef5006d69d2b90c134e97708c19632e20ab5a>`__
* - v25.6
-
-
* ROCm 6.4.1
* PyTorch 2.8.0a0+git7d205b2
-
-
* :doc:`Documentation <megatron-lm-v25.6>`
* `Docker Hub (py312) <https://hub.docker.com/layers/rocm/megatron-lm/v25.6_py312/images/sha256-482ff906532285bceabdf2bda629bd32cb6174d2d07f4243a736378001b28df0>`__
* `Docker Hub (py310) <https://hub.docker.com/layers/rocm/megatron-lm/v25.6_py310/images/sha256-9627bd9378684fe26cb1a10c7dd817868f553b33402e49b058355b0f095568d6>`__
* - v25.5
-
-
* ROCm 6.3.4
* PyTorch 2.8.0a0+gite2f9759
-
-
* :doc:`Documentation <megatron-lm-v25.5>`
* `Docker Hub (py312) <https://hub.docker.com/layers/rocm/megatron-lm/v25.5_py312/images/sha256-4506f18ba188d24189c6b1f95130b425f52c528a543bb3f420351824edceadc2>`__
* `Docker Hub (py310) <https://hub.docker.com/layers/rocm/megatron-lm/v25.5_py310/images/sha256-743fbf1ceff7a44c4452f938d783a7abf143737d1c15b2b95f6f8a62e0fd048b>`__
* - v25.4
-
-
* ROCm 6.3.0
* PyTorch 2.7.0a0+git637433
-
* PyTorch 2.7.0a0+git637433
-
* :doc:`Documentation <megatron-lm-v25.4>`
* `Docker Hub <https://hub.docker.com/layers/rocm/megatron-lm/v25.4/images/sha256-941aa5387918ea91c376c13083aa1e6c9cab40bb1875abbbb73bbb65d8736b3f>`__
* - v25.3
-
-
* ROCm 6.3.0
* PyTorch 2.7.0a0+git637433
-
* PyTorch 2.7.0a0+git637433
-
* :doc:`Documentation <megatron-lm-v25.3>`
* `Docker Hub <https://hub.docker.com/layers/rocm/megatron-lm/v25.3/images/sha256-1e6ed9bdc3f4ca397300d5a9907e084ab5e8ad1519815ee1f868faf2af1e04e2>`__
* - v24.12-dev
-
-
* ROCm 6.1.0
* PyTorch 2.4.0
-
-
* :doc:`Documentation <megatron-lm-v24.12-dev>`
* `Docker Hub <https://hub.docker.com/layers/rocm/megatron-lm/24.12-dev/images/sha256-5818c50334ce3d69deeeb8f589d83ec29003817da34158ebc9e2d112b929bf2e>`__

View File

@@ -17,12 +17,12 @@ Training a model with ROCm Megatron-LM
The ROCm Megatron-LM framework is a specialized fork of the robust Megatron-LM, designed to
enable efficient training of large-scale language models on AMD GPUs. By leveraging AMD Instinct™ MI300X
accelerators, AMD Megatron-LM delivers enhanced scalability, performance, and resource utilization for AI
GPUs, AMD Megatron-LM delivers enhanced scalability, performance, and resource utilization for AI
workloads. It is purpose-built to :ref:`support models <amd-megatron-lm-model-support-24-12>`
like Meta's Llama 2, Llama 3, and Llama 3.1, enabling developers to train next-generation AI models with greater
efficiency. See the GitHub repository at `<https://github.com/ROCm/Megatron-LM>`__.
For ease of use, AMD provides a ready-to-use Docker image for MI300X accelerators containing essential
For ease of use, AMD provides a ready-to-use Docker image for MI300X GPUs containing essential
components, including PyTorch, PyTorch Lightning, ROCm libraries, and Megatron-LM utilities. It contains the
following software to accelerate training workloads:
@@ -69,7 +69,7 @@ Megatron-LM provides the following key features to train large language models e
.. _amd-megatron-lm-model-support-24-12:
The following models are pre-optimized for performance on the AMD Instinct MI300X accelerator.
The following models are pre-optimized for performance on the AMD Instinct MI300X GPU.
* Llama 2 7B
@@ -208,14 +208,14 @@ Use the following script to run the RCCL test for four MI300X GPU nodes. Modify
.. _mi300x-amd-megatron-lm-training-v2412:
Start training on MI300X accelerators
Start training on MI300X GPUs
=====================================
The pre-built ROCm Megatron-LM environment allows users to quickly validate system performance, conduct
training benchmarks, and achieve superior performance for models like Llama 2 and Llama 3.1.
Use the following instructions to set up the environment, configure the script to train models, and
reproduce the benchmark results on the MI300X accelerators with the AMD Megatron-LM Docker
reproduce the benchmark results on the MI300X GPUs with the AMD Megatron-LM Docker
image.
.. _amd-megatron-lm-requirements-v2412:

View File

@@ -15,13 +15,13 @@ Training a model with Megatron-LM for ROCm
The Megatron-LM framework for ROCm is a specialized fork of the robust Megatron-LM,
designed to enable efficient training of large-scale language models on AMD
GPUs. By leveraging AMD Instinct™ MI300X series accelerators, Megatron-LM delivers
GPUs. By leveraging AMD Instinct™ MI300X Series GPUs, Megatron-LM delivers
enhanced scalability, performance, and resource utilization for AI workloads.
It is purpose-built to support models like Llama 2, Llama 3, Llama 3.1, and
DeepSeek, enabling developers to train next-generation AI models more
efficiently. See the GitHub repository at `<https://github.com/ROCm/Megatron-LM>`__.
AMD provides a ready-to-use Docker image for MI300X accelerators containing
AMD provides a ready-to-use Docker image for MI300X GPUs containing
essential components, including PyTorch, ROCm libraries, and Megatron-LM
utilities. It contains the following software components to accelerate training
workloads:
@@ -69,7 +69,7 @@ Megatron-LM provides the following key features to train large language models e
.. _amd-megatron-lm-model-support-25-3:
The following models are pre-optimized for performance on the AMD Instinct MI300X accelerator.
The following models are pre-optimized for performance on the AMD Instinct MI300X GPU.
* Llama 2 7B
@@ -123,7 +123,7 @@ The pre-built ROCm Megatron-LM environment allows users to quickly validate syst
training benchmarks, and achieve superior performance for models like Llama 3.1, Llama 2, and DeepSeek V2.
Use the following instructions to set up the environment, configure the script to train models, and
reproduce the benchmark results on the MI300X accelerators with the AMD Megatron-LM Docker
reproduce the benchmark results on the MI300X GPUs with the AMD Megatron-LM Docker
image.
.. _amd-megatron-lm-requirements-v253:
@@ -334,7 +334,7 @@ Multi-node training
inside a Docker, either install the drivers inside the Docker container or pass the network
drivers from the host while creating the Docker container.
Start training on AMD Instinct accelerators
Start training on AMD Instinct GPUs
===========================================
The prebuilt Megatron-LM with ROCm training environment allows users to quickly validate
@@ -345,8 +345,8 @@ can expect the container to perform in the model configurations described in
the following section, but other configurations are not validated by AMD.
Use the following instructions to set up the environment, configure the script
to train models, and reproduce the benchmark results on MI300X series
accelerators with the AMD Megatron-LM Docker image.
to train models, and reproduce the benchmark results on MI300X Series
GPUs with the AMD Megatron-LM Docker image.
.. tab-set::

View File

@@ -15,13 +15,13 @@ Training a model with Megatron-LM for ROCm
The Megatron-LM framework for ROCm is a specialized fork of the robust Megatron-LM,
designed to enable efficient training of large-scale language models on AMD
GPUs. By leveraging AMD Instinct™ MI300X series accelerators, Megatron-LM delivers
GPUs. By leveraging AMD Instinct™ MI300X Series GPUs, Megatron-LM delivers
enhanced scalability, performance, and resource utilization for AI workloads.
It is purpose-built to support models like Llama 2, Llama 3, Llama 3.1, and
DeepSeek, enabling developers to train next-generation AI models more
efficiently. See the GitHub repository at `<https://github.com/ROCm/Megatron-LM>`__.
AMD provides a ready-to-use Docker image for MI300X series accelerators containing
AMD provides a ready-to-use Docker image for MI300X Series GPUs containing
essential components, including PyTorch, ROCm libraries, and Megatron-LM
utilities. It contains the following software components to accelerate training
workloads:
@@ -69,7 +69,7 @@ Megatron-LM provides the following key features to train large language models e
.. _amd-megatron-lm-model-support-25-4:
The following models are pre-optimized for performance on AMD Instinct MI300X series accelerators.
The following models are pre-optimized for performance on AMD Instinct MI300X Series GPUs.
* Llama 3.1 8B
@@ -105,7 +105,7 @@ popular AI models.
The performance data presented in
`Performance results with AMD ROCm software <https://www.amd.com/en/developer/resources/rocm-hub/dev-ai/performance-results.html>`__
only reflects the :doc:`latest version of this training benchmarking environment <../megatron-lm>`.
The listed measurements should not be interpreted as the peak performance achievable by AMD Instinct MI325X and MI300X accelerators or ROCm software.
The listed measurements should not be interpreted as the peak performance achievable by AMD Instinct MI325X and MI300X GPUs or ROCm software.
System validation
=================
@@ -124,7 +124,7 @@ The prebuilt ROCm Megatron-LM environment allows users to quickly validate syste
training benchmarks, and achieve superior performance for models like Llama 3.1, Llama 2, and DeepSeek V2.
Use the following instructions to set up the environment, configure the script to train models, and
reproduce the benchmark results on MI300X series accelerators with the AMD Megatron-LM Docker
reproduce the benchmark results on MI300X Series GPUs with the AMD Megatron-LM Docker
image.
.. _amd-megatron-lm-requirements-v254:
@@ -367,7 +367,7 @@ Multi-node training
# Specify which RDMA interfaces to use for communication
export NCCL_IB_HCA=rdma0,rdma1,rdma2,rdma3,rdma4,rdma5,rdma6,rdma7
Start training on AMD Instinct accelerators
Start training on AMD Instinct GPUs
===========================================
The prebuilt Megatron-LM with ROCm training environment allows users to quickly validate
@@ -378,8 +378,8 @@ can expect the container to perform in the model configurations described in
the following section, but other configurations are not validated by AMD.
Use the following instructions to set up the environment, configure the script
to train models, and reproduce the benchmark results on MI300X series
accelerators with the AMD Megatron-LM Docker image.
to train models, and reproduce the benchmark results on MI300X Series
GPUs with the AMD Megatron-LM Docker image.
.. tab-set::

View File

@@ -16,13 +16,13 @@ Training a model with Megatron-LM for ROCm
The `Megatron-LM framework for ROCm <https://github.com/ROCm/Megatron-LM>`_ is
a specialized fork of the robust Megatron-LM, designed to enable efficient
training of large-scale language models on AMD GPUs. By leveraging AMD
Instinct™ MI300X series accelerators, Megatron-LM delivers enhanced
Instinct™ MI300X Series GPUs, Megatron-LM delivers enhanced
scalability, performance, and resource utilization for AI workloads. It is
purpose-built to support models like Llama, DeepSeek, and Mixtral,
enabling developers to train next-generation AI models more
efficiently.
AMD provides a ready-to-use Docker image for MI300X series accelerators containing
AMD provides a ready-to-use Docker image for MI300X Series GPUs containing
essential components, including PyTorch, ROCm libraries, and Megatron-LM
utilities. It contains the following software components to accelerate training
workloads:
@@ -69,7 +69,7 @@ Megatron-LM provides the following key features to train large language models e
.. _amd-megatron-lm-model-support-v255:
The following models are pre-optimized for performance on AMD Instinct MI300X series accelerators.
The following models are pre-optimized for performance on AMD Instinct MI300X Series GPUs.
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/megatron-lm-v25.5-benchmark-models.yaml
@@ -131,7 +131,7 @@ popular AI models.
The performance data presented in
`Performance results with AMD ROCm software <https://www.amd.com/en/developer/resources/rocm-hub/dev-ai/performance-results.html>`__
only reflects the latest version of this training benchmarking environment.
The listed measurements should not be interpreted as the peak performance achievable by AMD Instinct MI325X and MI300X accelerators or ROCm software.
The listed measurements should not be interpreted as the peak performance achievable by AMD Instinct MI325X and MI300X GPUs or ROCm software.
System validation
=================
@@ -154,7 +154,7 @@ Environment setup
=================
Use the following instructions to set up the environment, configure the script to train models, and
reproduce the benchmark results on MI300X series accelerators with the AMD Megatron-LM Docker
reproduce the benchmark results on MI300X Series GPUs with the AMD Megatron-LM Docker
image.
.. _amd-megatron-lm-requirements-v255:
@@ -536,7 +536,7 @@ Run training
Use the following example commands to set up the environment, configure
:ref:`key options <amd-megatron-lm-benchmark-test-vars-v255>`, and run training on
MI300X series accelerators with the AMD Megatron-LM environment.
MI300X Series GPUs with the AMD Megatron-LM environment.
Single node training
^^^^^^^^^^^^^^^^^^^^

View File

@@ -16,13 +16,13 @@ Training a model with Megatron-LM for ROCm
The `Megatron-LM framework for ROCm <https://github.com/ROCm/Megatron-LM>`__ is
a specialized fork of the robust Megatron-LM, designed to enable efficient
training of large-scale language models on AMD GPUs. By leveraging AMD
Instinct™ MI300X series accelerators, Megatron-LM delivers enhanced
Instinct™ MI300X Series GPUs, Megatron-LM delivers enhanced
scalability, performance, and resource utilization for AI workloads. It is
purpose-built to support models like Llama, DeepSeek, and Mixtral,
enabling developers to train next-generation AI models more
efficiently.
AMD provides ready-to-use Docker images for MI300X series accelerators containing
AMD provides ready-to-use Docker images for MI300X Series GPUs containing
essential components, including PyTorch, ROCm libraries, and Megatron-LM
utilities. It contains the following software components to accelerate training
workloads:
@@ -65,7 +65,7 @@ workloads:
.. _amd-megatron-lm-model-support-v256:
The following models are pre-optimized for performance on AMD Instinct MI300X series accelerators.
The following models are pre-optimized for performance on AMD Instinct MI300X Series GPUs.
Supported models
================
@@ -124,7 +124,7 @@ popular AI models.
The performance data presented in
`Performance results with AMD ROCm software <https://www.amd.com/en/developer/resources/rocm-hub/dev-ai/performance-results.html>`__
only reflects the latest version of this training benchmarking environment.
The listed measurements should not be interpreted as the peak performance achievable by AMD Instinct MI325X and MI300X accelerators or ROCm software.
The listed measurements should not be interpreted as the peak performance achievable by AMD Instinct MI325X and MI300X GPUs or ROCm software.
System validation
=================
@@ -147,7 +147,7 @@ Environment setup
=================
Use the following instructions to set up the environment, configure the script to train models, and
reproduce the benchmark results on MI300X series accelerators with the AMD Megatron-LM Docker
reproduce the benchmark results on MI300X Series GPUs with the AMD Megatron-LM Docker
image.
.. _amd-megatron-lm-requirements-v256:
@@ -589,7 +589,7 @@ Run training
Use the following example commands to set up the environment, configure
:ref:`key options <amd-megatron-lm-benchmark-test-vars-v256>`, and run training on
MI300X series accelerators with the AMD Megatron-LM environment.
MI300X Series GPUs with the AMD Megatron-LM environment.
Single node training
--------------------

View File

@@ -22,13 +22,13 @@ Training a model with Megatron-LM for ROCm
The `Megatron-LM framework for ROCm <https://github.com/ROCm/Megatron-LM>`_ is
a specialized fork of the robust Megatron-LM, designed to enable efficient
training of large-scale language models on AMD GPUs. By leveraging AMD
Instinct™ MI300X series accelerators, Megatron-LM delivers enhanced
Instinct™ MI300X Series GPUs, Megatron-LM delivers enhanced
scalability, performance, and resource utilization for AI workloads. It is
purpose-built to support models like Llama, DeepSeek, and Mixtral,
enabling developers to train next-generation AI models more
efficiently.
AMD provides ready-to-use Docker images for MI300X series accelerators containing
AMD provides ready-to-use Docker images for MI300X Series GPUs containing
essential components, including PyTorch, ROCm libraries, and Megatron-LM
utilities. It contains the following software components to accelerate training
workloads:
@@ -66,7 +66,7 @@ workloads:
================
The following models are supported for training performance benchmarking with Megatron-LM and ROCm
on AMD Instinct MI300X series accelerators.
on AMD Instinct MI300X Series GPUs.
Some instructions, commands, and training recommendations in this documentation might
vary by model -- select one to get started.
@@ -120,7 +120,7 @@ popular AI models.
The performance data presented in
`Performance results with AMD ROCm software <https://www.amd.com/en/developer/resources/rocm-hub/dev-ai/performance-results.html>`__
only reflects the latest version of this training benchmarking environment.
The listed measurements should not be interpreted as the peak performance achievable by AMD Instinct MI325X and MI300X accelerators or ROCm software.
The listed measurements should not be interpreted as the peak performance achievable by AMD Instinct MI325X and MI300X GPUs or ROCm software.
System validation
=================
@@ -143,7 +143,7 @@ Environment setup
=================
Use the following instructions to set up the environment, configure the script to train models, and
reproduce the benchmark results on MI300X series accelerators with the AMD Megatron-LM Docker
reproduce the benchmark results on MI300X Series GPUs with the AMD Megatron-LM Docker
image.
.. _amd-megatron-lm-requirements-v257:
@@ -592,7 +592,7 @@ Run training
Use the following example commands to set up the environment, configure
:ref:`key options <amd-megatron-lm-benchmark-test-vars-v257>`, and run training on
MI300X series accelerators with the AMD Megatron-LM environment.
MI300X Series GPUs with the AMD Megatron-LM environment.
Single node training
--------------------

View File

@@ -15,7 +15,7 @@ Training a model with Primus and Megatron-LM
`Primus <https://github.com/AMD-AGI/Primus>`__ is a unified and flexible
LLM training framework designed to streamline training. It streamlines LLM
training on AMD Instinct accelerators using a modular, reproducible configuration paradigm.
training on AMD Instinct GPUs using a modular, reproducible configuration paradigm.
Primus is backend-agnostic and supports multiple training engines -- including Megatron.
.. note::
@@ -25,7 +25,7 @@ Primus is backend-agnostic and supports multiple training engines -- including M
workloads from Megatron-LM to Primus with Megatron, see
:doc:`megatron-lm-primus-migration-guide`.
For ease of use, AMD provides a ready-to-use Docker image for MI300 series accelerators
For ease of use, AMD provides a ready-to-use Docker image for MI300 Series GPUs
containing essential components for Primus and Megatron-LM.
.. note::
@@ -53,7 +53,7 @@ containing essential components for Primus and Megatron-LM.
Supported models
================
The following models are pre-optimized for performance on AMD Instinct MI300X series accelerators.
The following models are pre-optimized for performance on AMD Instinct MI300X Series GPUs.
Some instructions, commands, and training examples in this documentation might
vary by model -- select one to get started.
@@ -120,7 +120,7 @@ system's configuration.
=================
Use the following instructions to set up the environment, configure the script to train models, and
reproduce the benchmark results on MI300X series accelerators with the ``{{ docker.pull_tag }}`` image.
reproduce the benchmark results on MI300X Series GPUs with the ``{{ docker.pull_tag }}`` image.
.. _amd-primus-megatron-lm-requirements-v257:
@@ -231,7 +231,7 @@ Run training
Use the following example commands to set up the environment, configure
:ref:`key options <amd-primus-megatron-lm-benchmark-test-vars>`, and run training on
MI300X series accelerators with the AMD Megatron-LM environment.
MI300X Series GPUs with the AMD Megatron-LM environment.
Single node training
--------------------

View File

@@ -0,0 +1,667 @@
:orphan:
.. meta::
:description: How to train a model using Megatron-LM for ROCm.
:keywords: ROCm, AI, LLM, train, Megatron-LM, megatron, Llama, tutorial, docker, torch
********************************************
Training a model with Primus and Megatron-LM
********************************************
.. caution::
This documentation does not reflect the latest version of ROCm Megatron-LM
training performance documentation. See :doc:`../primus-megatron` for the latest version.
`Primus <https://github.com/AMD-AGI/Primus>`__ is a unified and flexible
LLM training framework designed to streamline training. It streamlines LLM
training on AMD Instinct GPUs using a modular, reproducible configuration paradigm.
Primus is backend-agnostic and supports multiple training engines -- including Megatron.
.. note::
Primus with Megatron is designed to replace the :doc:`ROCm Megatron-LM training <../megatron-lm>` workflow.
To learn how to migrate workloads from Megatron-LM to Primus with Megatron,
see :doc:`megatron-lm-primus-migration-guide`.
For ease of use, AMD provides a ready-to-use Docker image for MI300 series GPUs
containing essential components for Primus and Megatron-LM. This Docker is powered by Primus
Turbo optimizations for performance; this release adds support for Primus Turbo
with optimized attention and grouped GEMM kernels.
.. note::
This Docker environment is based on Python 3.10 and Ubuntu 22.04. For an alternative environment with
Python 3.12 and Ubuntu 24.04, see the :doc:`previous ROCm Megatron-LM v25.6 Docker release <megatron-lm-v25.6>`.
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/primus-megatron-v25.8-benchmark-models.yaml
{% set dockers = data.dockers %}
{% set docker = dockers[0] %}
.. list-table::
:header-rows: 1
* - Software component
- Version
{% for component_name, component_version in docker.components.items() %}
* - {{ component_name }}
- {{ component_version }}
{% endfor %}
.. _amd-primus-megatron-lm-model-support:
Supported models
================
The following models are pre-optimized for performance on AMD Instinct MI300X series GPUs.
Some instructions, commands, and training examples in this documentation might
vary by model -- select one to get started.
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/primus-megatron-v25.8-benchmark-models.yaml
{% set model_groups = data.model_groups %}
.. raw:: html
<div id="vllm-benchmark-ud-params-picker" class="container-fluid">
<div class="row gx-0">
<div class="col-2 me-1 px-2 model-param-head">Model</div>
<div class="row col-10 pe-0">
{% for model_group in model_groups %}
<div class="col-3 px-2 model-param" data-param-k="model-group" data-param-v="{{ model_group.tag }}" tabindex="0">{{ model_group.group }}</div>
{% endfor %}
</div>
</div>
<div class="row gx-0 pt-1">
<div class="col-2 me-1 px-2 model-param-head">Variant</div>
<div class="row col-10 pe-0">
{% for model_group in model_groups %}
{% set models = model_group.models %}
{% for model in models %}
{% if models|length % 3 == 0 %}
<div class="col-4 px-2 model-param" data-param-k="model" data-param-v="{{ model.mad_tag }}" data-param-group="{{ model_group.tag }}" tabindex="0">{{ model.model }}</div>
{% else %}
<div class="col-6 px-2 model-param" data-param-k="model" data-param-v="{{ model.mad_tag }}" data-param-group="{{ model_group.tag }}" tabindex="0">{{ model.model }}</div>
{% endif %}
{% endfor %}
{% endfor %}
</div>
</div>
</div>
.. note::
Some models, such as Llama, require an external license agreement through
a third party (for example, Meta).
System validation
=================
Before running AI workloads, it's important to validate that your AMD hardware is configured
correctly and performing optimally.
If you have already validated your system settings, including aspects like NUMA auto-balancing, you
can skip this step. Otherwise, complete the procedures in the :ref:`System validation and
optimization <rocm-for-ai-system-optimization>` guide to properly configure your system settings
before starting training.
To test for optimal performance, consult the recommended :ref:`System health benchmarks
<rocm-for-ai-system-health-bench>`. This suite of tests will help you verify and fine-tune your
system's configuration.
.. _mi300x-amd-primus-megatron-lm-training:
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/primus-megatron-v25.8-benchmark-models.yaml
{% set dockers = data.dockers %}
{% set docker = dockers[0] %}
Environment setup
=================
Use the following instructions to set up the environment, configure the script to train models, and
reproduce the benchmark results on MI300X series GPUs with the ``{{ docker.pull_tag }}`` image.
.. _amd-primus-megatron-lm-requirements:
Download the Docker image
-------------------------
1. Use the following command to pull the Docker image from Docker Hub.
.. code-block:: shell
docker pull {{ docker.pull_tag }}
2. Launch the Docker container.
.. code-block:: shell
docker run -it \
--device /dev/dri \
--device /dev/kfd \
--device /dev/infiniband \
--network host --ipc host \
--group-add video \
--cap-add SYS_PTRACE \
--security-opt seccomp=unconfined \
--privileged \
-v $HOME:$HOME \
--shm-size 128G \
--name primus_training_env \
{{ docker.pull_tag }}
3. Use these commands if you exit the ``primus_training_env`` container and need to return to it.
.. code-block:: shell
docker start primus_training_env
docker exec -it primus_training_env bash
The Docker container hosts verified commit ``927a717`` of the `Primus
<https://github.com/AMD-AGI/Primus/tree/927a71702784347a311ca48fd45f0f308c6ef6dd>`__ repository.
.. _amd-primus-megatron-lm-environment-setup:
Configuration
=============
Primus defines a training configuration in YAML for each model in
`examples/megatron/configs <https://github.com/AMD-AGI/Primus/tree/927a71702784347a311ca48fd45f0f308c6ef6dd/examples/megatron/configs>`__.
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/primus-megatron-v25.8-benchmark-models.yaml
{% set model_groups = data.model_groups %}
{% for model_group in model_groups %}
{% for model in model_group.models %}
.. container:: model-doc {{ model.mad_tag }}
To update training parameters for {{ model.model }}, you can update ``examples/megatron/configs/{{ model.config_name }}``.
Note that training configuration YAML files for other models follow this naming convention.
{% endfor %}
{% endfor %}
.. note::
See :ref:`Key options <amd-primus-megatron-lm-benchmark-test-vars>` for more information on configuration options.
Dataset options
---------------
You can use either mock data or real data for training.
* Mock data can be useful for testing and validation. Use the ``mock_data`` field to toggle between mock and real data. The default
value is ``true`` for enabled.
.. code-block:: yaml
mock_data: true
* If you're using a real dataset, update the ``train_data_path`` field to point to the location of your dataset.
.. code-block:: bash
mock_data: false
train_data_path: /path/to/your/dataset
Ensure that the files are accessible inside the Docker container.
.. _amd-primus-megatron-lm-tokenizer:
Tokenizer
---------
Set the ``HF_TOKEN`` environment variable with
right permissions to access the tokenizer for each model.
.. code-block:: bash
# Export your HF_TOKEN in the workspace
export HF_TOKEN=<your_hftoken>
.. note::
In Primus, each model uses a tokenizer from Hugging Face. For example, Llama
3.1 8B model uses ``tokenizer_model: meta-llama/Llama-3.1-8B`` and
``tokenizer_type: Llama3Tokenizer`` defined in the `llama3.1-8B model
<https://github.com/AMD-AGI/Primus/blob/927a71702784347a311ca48fd45f0f308c6ef6dd/examples/megatron/configs/llama3.1_8B-pretrain.yaml>`__
definition.
.. _amd-primus-megatron-lm-run-training:
Run training
============
Use the following example commands to set up the environment, configure
:ref:`key options <amd-primus-megatron-lm-benchmark-test-vars>`, and run training on
MI300X series GPUs with the AMD Megatron-LM environment.
Single node training
--------------------
To run training on a single node, navigate to ``/workspace/Primus`` and use the following setup command:
.. code-block:: shell
pip install -r requirements.txt
export HSA_NO_SCRATCH_RECLAIM=1
export NVTE_CK_USES_BWD_V3=1
.. container:: model-doc primus_pyt_megatron_lm_train_llama-3.3-70b
Once setup is complete, run the appropriate training command.
The following run commands are tailored to Llama 3.3 70B.
See :ref:`amd-primus-megatron-lm-model-support` to switch to another available model.
To run pre-training for Llama 3.3 70B BF16, run:
.. code-block:: shell
EXP=examples/megatron/configs/llama3.3_70B-pretrain.yaml \
bash ./examples/run_pretrain.sh \
--micro_batch_size 2 \
--global_batch_size 16 \
--train_iters 50
.. container:: model-doc primus_pyt_megatron_lm_train_llama-3.1-8b
Once setup is complete, run the appropriate training command.
The following run commands are tailored to Llama 3.1 8B.
See :ref:`amd-primus-megatron-lm-model-support` to switch to another available model.
To run pre-training for Llama 3.1 8B FP8, run:
.. code-block:: shell
EXP=examples/megatron/configs/llama3.1_8B-pretrain.yaml \
bash ./examples/run_pretrain.sh \
--train_iters 50 \
--fp8 hybrid
For Llama 3.1 8B BF16, use the following command:
.. code-block:: shell
EXP=examples/megatron/configs/llama3.1_8B-pretrain.yaml \
bash ./examples/run_pretrain.sh --train_iters 50
.. container:: model-doc primus_pyt_megatron_lm_train_llama-3.1-70b
Once setup is complete, run the appropriate training command.
The following run commands are tailored to Llama 3.1 70B.
See :ref:`amd-primus-megatron-lm-model-support` to switch to another available model.
To run pre-training for Llama 3.1 70B BF16, run:
.. code-block:: shell
EXP=examples/megatron/configs/llama3.1_70B-pretrain.yaml \
bash ./examples/run_pretrain.sh \
--train_iters 50
To run the training on a single node for Llama 3.1 70B FP8 with proxy, use the following command:
.. code-block:: shell
EXP=examples/megatron/configs/llama3.1_70B-pretrain.yaml \
bash ./examples/run_pretrain.sh \
--train_iters 50 \
--num_layers 40 \
--fp8 hybrid
.. note::
Use two or more nodes to run the *full* Llama 70B model with FP8 precision.
.. container:: model-doc primus_pyt_megatron_lm_train_llama-2-7b
Once setup is complete, run the appropriate training command.
The following run commands are tailored to Llama 2 7B.
See :ref:`amd-primus-megatron-lm-model-support` to switch to another available model.
To run pre-training for Llama 2 7B FP8, run:
.. code-block:: shell
EXP=examples/megatron/configs/llama2_7B-pretrain.yaml \
bash ./examples/run_pretrain.sh \
--train_iters 50 \
--fp8 hybrid
To run pre-training for Llama 2 7B BF16, run:
.. code-block:: shell
EXP=examples/megatron/configs/llama2_7B-pretrain.yaml \
bash ./examples/run_pretrain.sh --train_iters 50
.. container:: model-doc primus_pyt_megatron_lm_train_llama-2-70b
Once setup is complete, run the appropriate training command.
The following run commands are tailored to Llama 2 70B.
See :ref:`amd-primus-megatron-lm-model-support` to switch to another available model.
To run pre-training for Llama 2 70B BF16, run:
.. code-block:: shell
EXP=examples/megatron/configs/llama2_70B-pretrain.yaml \
bash ./examples/run_pretrain.sh --train_iters 50
.. container:: model-doc primus_pyt_megatron_lm_train_deepseek-v3-proxy
Once setup is complete, run the appropriate training command.
The following run commands are tailored to DeepSeek-V3.
See :ref:`amd-primus-megatron-lm-model-support` to switch to another available model.
To run training on a single node for DeepSeek-V3 (MoE with expert parallel) with 3-layer proxy,
use the following command:
.. code-block:: shell
EXP=examples/megatron/configs/deepseek_v3-pretrain.yaml \
bash examples/run_pretrain.sh \
--num_layers 3 \
--moe_layer_freq 1 \
--train_iters 50
.. container:: model-doc primus_pyt_megatron_lm_train_deepseek-v2-lite-16b
Once setup is complete, run the appropriate training command.
The following run commands are tailored to DeepSeek-V2-Lite.
See :ref:`amd-primus-megatron-lm-model-support` to switch to another available model.
To run training on a single node for DeepSeek-V2-Lite (MoE with expert parallel),
use the following command:
.. code-block:: shell
EXP=examples/megatron/configs/deepseek_v2_lite-pretrain.yaml \
bash examples/run_pretrain.sh \
--global_batch_size 256 \
--train_iters 50
.. container:: model-doc primus_pyt_megatron_lm_train_mixtral-8x7b
Once setup is complete, run the appropriate training command.
The following run commands are tailored to Mixtral 8x7B.
See :ref:`amd-primus-megatron-lm-model-support` to switch to another available model.
To run training on a single node for Mixtral 8x7B (MoE with expert parallel),
use the following command:
.. code-block:: shell
EXP=examples/megatron/configs/mixtral_8x7B_v0.1-pretrain.yaml \
bash examples/run_pretrain.sh --train_iters 50
.. container:: model-doc primus_pyt_megatron_lm_train_mixtral-8x22b-proxy
Once setup is complete, run the appropriate training command.
The following run commands are tailored to Mixtral 8x22B.
See :ref:`amd-primus-megatron-lm-model-support` to switch to another available model.
To run training on a single node for Mixtral 8x22B (MoE with expert parallel) with 4-layer proxy,
use the following command:
.. code-block:: shell
EXP=examples/megatron/configs/mixtral_8x22B_v0.1-pretrain.yaml \
bash examples/run_pretrain.sh \
--num_layers 4 \
--pipeline_model_parallel_size 1 \
--micro_batch_size 1 \
--global_batch_size 16 \
--train_iters 50
.. container:: model-doc primus_pyt_megatron_lm_train_qwen2.5-7b
Once setup is complete, run the appropriate training command.
The following run commands are tailored to Qwen 2.5 7B.
See :ref:`amd-primus-megatron-lm-model-support` to switch to another available model.
To run training on a single node for Qwen 2.5 7B BF16, use the following
command:
.. code-block:: shell
EXP=examples/megatron/configs/qwen2.5_7B-pretrain.yaml \
bash examples/run_pretrain.sh --train_iters 50
For FP8, use the following command.
.. code-block:: shell
EXP=examples/megatron/configs/qwen2.5_7B-pretrain.yaml \
bash examples/run_pretrain.sh \
--train_iters 50 \
--fp8 hybrid
.. container:: model-doc primus_pyt_megatron_lm_train_qwen2.5-72b
Once setup is complete, run the appropriate training command.
The following run commands are tailored to Qwen 2.5 72B.
See :ref:`amd-primus-megatron-lm-model-support` to switch to another available model.
To run the training on a single node for Qwen 2.5 72B BF16, use the following command.
.. code-block:: shell
EXP=examples/megatron/configs/qwen2.5_72B-pretrain.yaml \
bash examples/run_pretrain.sh --train_iters 50
.. _amd-primus-megatron-multi-node-examples:
Multi-node training examples
----------------------------
Refer to :doc:`/how-to/rocm-for-ai/system-setup/multi-node-setup` to configure your environment for multi-node
training.
To run training on multiple nodes, you can use the
`run_slurm_pretrain.sh <https://github.com/AMD-AGI/Primus/blob/927a71702784347a311ca48fd45f0f308c6ef6dd/examples/run_slurm_pretrain.sh>`__
to launch the multi-node workload. Use the following steps to setup your environment:
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/primus-megatron-v25.8-benchmark-models.yaml
{% set dockers = data.dockers %}
{% set docker = dockers[0] %}
.. code-block:: shell
cd /workspace/Primus/
export DOCKER_IMAGE={{ docker.pull_tag }}
export HF_TOKEN=<your_HF_token>
export HSA_NO_SCRATCH_RECLAIM=1
export NVTE_CK_USES_BWD_V3=1
export NCCL_IB_HCA=<your_NCCL_IB_HCA> # specify which RDMA interfaces to use for communication
export NCCL_SOCKET_IFNAME=<your_NCCL_SOCKET_IFNAME> # your Network Interface
export GLOO_SOCKET_IFNAME=<your_GLOO_SOCKET_IFNAME> # your Network Interface
export NCCL_IB_GID_INDEX=3 # Set InfiniBand GID index for NCCL communication. Default is 3 for ROCE
.. note::
* Make sure correct network drivers are installed on the nodes. If inside a Docker, either install the drivers inside the Docker container or pass the network drivers from the host while creating Docker container.
* If ``NCCL_IB_HCA`` and ``NCCL_SOCKET_IFNAME`` are not set, Primus will try to auto-detect. However, since NICs can vary accross different cluster, it is encouraged to explicitly export your NCCL parameters for the cluster.
* To find your network interface, you can use ``ip a``.
* To find RDMA interfaces, you can use ``ibv_devices`` to get the list of all the RDMA/IB devices.
.. container:: model-doc primus_pyt_megatron_lm_train_llama-3.3-70b
To train Llama 3.3 70B FP8 on 8 nodes, run:
.. code-block:: shell
NNODES=8 EXP=examples/megatron/configs/llama3.3_70B-pretrain.yaml \
bash examples/run_slurm_pretrain.sh \
--micro_batch_size 1 \
--global_batch_size 256 \
--recompute_num_layers 80 \
--fp8 hybrid
To train Llama 3.3 70B BF16 on 8 nodes, run:
.. code-block:: shell
NNODES=8 EXP=examples/megatron/configs/llama3.3_70B-pretrain.yaml \
bash examples/run_slurm_pretrain.sh \
--micro_batch_size 1 \
--global_batch_size 256 \
--recompute_num_layers 12
.. container:: model-doc primus_pyt_megatron_lm_train_llama-3.1-8b
To train Llama 3.1 8B FP8 on 8 nodes, run:
.. code-block:: shell
# Adjust the training parameters. For e.g., `global_batch_size: 8 * #single_node_bs` for 8 nodes in this case
NNODES=8 EXP=examples/megatron/configs/llama3.1_8B-pretrain.yaml \
bash ./examples/run_slurm_pretrain.sh \
--global_batch_size 1024 \
--fp8 hybrid
.. container:: model-doc primus_pyt_megatron_lm_train_llama-3.1-70b
To train Llama 3.1 70B FP8 on 8 nodes, run:
.. code-block:: shell
NNODES=8 EXP=examples/megatron/configs/llama3.1_70B-pretrain.yaml \
bash examples/run_slurm_pretrain.sh \
--micro_batch_size 1 \
--global_batch_size 256 \
--recompute_num_layers 80 \
--fp8 hybrid
To train Llama 3.1 70B BF16 on 8 nodes, run:
.. code-block:: shell
NNODES=8 EXP=examples/megatron/configs/llama3.1_70B-pretrain.yaml \
bash examples/run_slurm_pretrain.sh \
--micro_batch_size 1 \
--global_batch_size 256 \
--recompute_num_layers 12
.. container:: model-doc primus_pyt_megatron_lm_train_llama-2-7b
To train Llama 2 8B FP8 on 8 nodes, run:
.. code-block:: shell
# Adjust the training parameters. For e.g., `global_batch_size: 8 * #single_node_bs` for 8 nodes in this case
NNODES=8 EXP=examples/megatron/configs/llama2_7B-pretrain.yaml bash ./examples/run_slurm_pretrain.sh --global_batch_size 2048 --fp8 hybrid
.. container:: model-doc primus_pyt_megatron_lm_train_llama-2-70b
To train Llama 2 70B FP8 on 8 nodes, run:
.. code-block:: shell
NNODES=8 EXP=examples/megatron/configs/llama2_70B-pretrain.yaml \
bash examples/run_slurm_pretrain.sh \
--micro_batch_size 2 \
--global_batch_size 256 \
--recompute_num_layers 80 \
--fp8 hybrid
To train Llama 2 70B BF16 on 8 nodes, run:
.. code-block:: shell
NNODES=8 EXP=examples/megatron/configs/llama2_70B-pretrain.yaml \
bash ./examples/run_slurm_pretrain.sh \
--micro_batch_size 2 \
--global_batch_size 1536 \
--recompute_num_layers 12
.. container:: model-doc primus_pyt_megatron_lm_train_mixtral-8x7b
To train Mixtral 8x7B BF16 on 8 nodes, run:
.. code-block:: shell
NNODES=8 EXP=examples/megatron/configs/mixtral_8x7B_v0.1-pretrain.yaml \
bash examples/run_slurm_pretrain.sh \
--micro_batch_size 2 \
--global_batch_size 256
.. container:: model-doc primus_pyt_megatron_lm_train_qwen2.5-72b
To train Qwen2.5 72B FP8 on 8 nodes, run:
.. code-block:: shell
NNODES=8 EXP=examples/megatron/configs/qwen2.5_72B-pretrain.yaml \
bash examples/run_slurm_pretrain.sh \
--micro_batch_size 4 \
--global_batch_size 256 \
--recompute_num_layers 80 \
--fp8 hybrid
.. _amd-primus-megatron-lm-benchmark-test-vars:
Key options
-----------
The following are key options to take note of
fp8
``hybrid`` enables FP8 GEMMs.
use_torch_fsdp2
``use_torch_fsdp2: 1`` enables torch fsdp-v2. If FSDP is enabled,
set ``use_distributed_optimizer`` and ``overlap_param_gather`` to ``false``.
profile
To enable PyTorch profiling, set these parameters:
.. code-block:: yaml
profile: true
use_pytorch_profiler: true
profile_step_end: 7
profile_step_start: 6
train_iters
The total number of iterations (default: 50).
mock_data
True by default.
micro_batch_size
Micro batch size.
global_batch_size
Global batch size.
recompute_granularity
For activation checkpointing.
num_layers
For using a reduced number of layers as with proxy models.
Further reading
===============
- For an introduction to Primus, see `Primus: A Lightweight, Unified Training
Framework for Large Models on AMD GPUs <https://rocm.blogs.amd.com/software-tools-optimization/primus/README.html>`__.
- To learn more about system settings and management practices to configure your system for
AMD Instinct MI300X series GPUs, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_.
- For a list of other ready-made Docker images for AI with ROCm, see
`AMD Infinity Hub <https://www.amd.com/en/developer/resources/infinity-hub.html#f-amd_hub_category=AI%20%26%20ML%20Models>`_.
Previous versions
=================
See :doc:`megatron-lm-history` to find documentation for previous releases
of the ``ROCm/megatron-lm`` Docker image.
This training environment now uses Primus with Megatron as the primary
configuration. Limited support for the legacy ROCm Megatron-LM is still
available; see the :doc:`../megatron-lm` documentation.

View File

@@ -0,0 +1,312 @@
:orphan:
.. meta::
:description: How to train a model using PyTorch for ROCm.
:keywords: ROCm, AI, LLM, train, PyTorch, torch, Llama, flux, tutorial, docker
****************************************
Training a model with Primus and PyTorch
****************************************
.. caution::
This documentation does not reflect the latest version of ROCm Primus PyTorch training
performance benchmark documentation. See :doc:`../primus-pytorch` for the latest version.
`Primus <https://github.com/AMD-AGI/Primus>`__ is a unified and flexible
LLM training framework designed to streamline training. It streamlines LLM
training on AMD Instinct GPUs using a modular, reproducible configuration paradigm.
Primus now supports the PyTorch torchtitan backend.
.. note::
Primus with the PyTorch torchtitan backend is designed to replace the :doc:`ROCm PyTorch training <../pytorch-training>` workflow.
See :doc:`../pytorch-training` to see steps to run workloads without Primus.
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/primus-pytorch-v25.8-benchmark-models.yaml
{% set dockers = data.dockers %}
{% set docker = dockers[0] %}
For ease of use, AMD provides a ready-to-use Docker image -- ``{{
docker.pull_tag }}`` -- for MI300X series GPUs containing essential
components for Primus and PyTorch training with
Primus Turbo optimizations.
.. list-table::
:header-rows: 1
* - Software component
- Version
{% for component_name, component_version in docker.components.items() %}
* - {{ component_name }}
- {{ component_version }}
{% endfor %}
.. _amd-primus-pytorch-model-support-v258:
Supported models
================
The following models are pre-optimized for performance on the AMD Instinct MI325X and MI300X GPUs.
Some instructions, commands, and training recommendations in this documentation might
vary by model -- select one to get started.
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/primus-pytorch-v25.8-benchmark-models.yaml
{% set unified_docker = data.dockers[0] %}
{% set model_groups = data.model_groups %}
.. raw:: html
<div id="vllm-benchmark-ud-params-picker" class="container-fluid">
<div class="row gx-0" style="display: none;">
<div class="col-2 me-1 px-2 model-param-head">Model</div>
<div class="row col-10 pe-0">
{% for model_group in model_groups %}
<div class="col-3 px-2 model-param" data-param-k="model-group" data-param-v="{{ model_group.tag }}" tabindex="0">{{ model_group.group }}</div>
{% endfor %}
</div>
</div>
<div class="row gx-0 pt-1">
<div class="col-2 me-1 px-2 model-param-head">Model</div>
<div class="row col-10 pe-0">
{% for model_group in model_groups %}
{% set models = model_group.models %}
{% for model in models %}
{% if models|length % 3 == 0 %}
<div class="col-4 px-2 model-param" data-param-k="model" data-param-v="{{ model.mad_tag }}" data-param-group="{{ model_group.tag }}" tabindex="0">{{ model.model }}</div>
{% else %}
<div class="col-6 px-2 model-param" data-param-k="model" data-param-v="{{ model.mad_tag }}" data-param-group="{{ model_group.tag }}" tabindex="0">{{ model.model }}</div>
{% endif %}
{% endfor %}
{% endfor %}
</div>
</div>
</div>
.. seealso::
For additional workloads, including Llama 3.3, Llama 3.2, Llama 2, GPT OSS, Qwen, and Flux models,
see the documentation :doc:`../pytorch-training` (without Primus)
.. _amd-primus-pytorch-performance-measurements-v258:
System validation
=================
Before running AI workloads, it's important to validate that your AMD hardware is configured
correctly and performing optimally.
If you have already validated your system settings, including aspects like NUMA auto-balancing, you
can skip this step. Otherwise, complete the procedures in the :ref:`System validation and
optimization <rocm-for-ai-system-optimization>` guide to properly configure your system settings
before starting training.
To test for optimal performance, consult the recommended :ref:`System health benchmarks
<rocm-for-ai-system-health-bench>`. This suite of tests will help you verify and fine-tune your
system's configuration.
This Docker image is optimized for specific model configurations outlined
below. Performance can vary for other training workloads, as AMD
doesnt test configurations and run conditions outside those described.
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/primus-pytorch-v25.8-benchmark-models.yaml
{% set unified_docker = data.dockers[0] %}
Pull the Docker image
=====================
Use the following command to pull the `Docker image <{{ unified_docker.docker_hub_url }}>`_ from Docker Hub.
.. code-block:: shell
docker pull {{ unified_docker.pull_tag }}
Run training
============
{% set model_groups = data.model_groups %}
Once the setup is complete, choose between the following two workflows to start benchmarking training.
For fine-tuning workloads and multi-node training examples, see :doc:`../pytorch-training` (without Primus).
.. tab-set::
.. tab-item:: MAD-integrated benchmarking
{% for model_group in model_groups %}
{% for model in model_group.models %}
.. container:: model-doc {{ model.mad_tag }}
The following run command is tailored to {{ model.model }}.
See :ref:`amd-primus-pytorch-model-support-v258` to switch to another available model.
1. Clone the ROCm Model Automation and Dashboarding (`<https://github.com/ROCm/MAD>`__) repository to a local
directory and install the required packages on the host machine.
.. code-block:: shell
git clone https://github.com/ROCm/MAD
cd MAD
pip install -r requirements.txt
2. For example, use this command to run the performance benchmark test on the {{ model.model }} model
using one node with the {{ model.precision }} data type on the host machine.
.. code-block:: shell
export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
madengine run \
--tags {{ model.mad_tag }} \
--keep-model-dir \
--live-output \
--timeout 28800
MAD launches a Docker container with the name
``container_ci-{{ model.mad_tag }}``. The latency and throughput reports of the
model are collected in ``~/MAD/perf.csv``.
{% endfor %}
{% endfor %}
.. tab-item:: Standalone benchmarking
{% for model_group in model_groups %}
{% for model in model_group.models %}
.. container:: model-doc {{ model.mad_tag }}
The following run commands are tailored to {{ model.model }}.
See :ref:`amd-primus-pytorch-model-support-v258` to switch to another available model.
.. rubric:: Download the Docker image and required packages
1. Use the following command to pull the Docker image from Docker Hub.
.. code-block:: shell
docker pull {{ unified_docker.pull_tag }}
2. Run the Docker container.
.. code-block:: shell
docker run -it \
--device /dev/dri \
--device /dev/kfd \
--network host \
--ipc host \
--group-add video \
--cap-add SYS_PTRACE \
--security-opt seccomp=unconfined \
--privileged \
-v $HOME:$HOME \
-v $HOME/.ssh:/root/.ssh \
--shm-size 64G \
--name training_env \
{{ unified_docker.pull_tag }}
Use these commands if you exit the ``training_env`` container and need to return to it.
.. code-block:: shell
docker start training_env
docker exec -it training_env bash
3. In the Docker container, clone the `<https://github.com/ROCm/MAD>`__
repository and navigate to the benchmark scripts directory
``/workspace/MAD/scripts/pytorch_train``.
.. code-block:: shell
git clone https://github.com/ROCm/MAD
cd MAD/scripts/pytorch_train
.. rubric:: Prepare training datasets and dependencies
1. The following benchmarking examples require downloading models and datasets
from Hugging Face. To ensure successful access to gated repos, set your
``HF_TOKEN``.
.. code-block:: shell
export HF_TOKEN=$your_personal_hugging_face_access_token
2. Run the setup script to install libraries and datasets needed for benchmarking.
.. code-block:: shell
./pytorch_benchmark_setup.sh
.. rubric:: Pretraining
To start the pretraining benchmark, use the following command with the
appropriate options. See the following list of options and their descriptions.
.. code-block:: shell
./pytorch_benchmark_report.sh -t pretrain \
-m {{ model.model_repo }} \
-p $datatype \
-s $sequence_length
.. list-table::
:header-rows: 1
* - Name
- Options
- Description
{% for mode in available_modes %}
* - {% if loop.first %}``$training_mode``{% endif %}
- ``{{ mode }}``
- {{ training_mode_descs[mode] }}
{% endfor %}
* - ``$datatype``
- ``BF16``{% if model.mad_tag == "primus_pyt_train_llama-3.1-8b" %} or ``FP8``{% endif %}
- Currently, only Llama 3.1 8B supports FP8 precision.
* - ``$sequence_length``
- Sequence length for the language model.
- Between 2048 and 8192. 8192 by default.
.. rubric:: Benchmarking examples
Use the following command to run train {{ model.model }} with BF16 precision using Primus torchtitan.
.. code-block:: shell
./pytorch_benchmark_report.sh -m {{ model.model_repo }}
To train {{ model.model }} with FP8 precision, use the following command.
.. code-block:: shell
./pytorch_benchmark_report.sh -m {{ model.model_repo }} -p FP8
{% endfor %}
{% endfor %}
Further reading
===============
- For an introduction to Primus, see `Primus: A Lightweight, Unified Training
Framework for Large Models on AMD GPUs <https://rocm.blogs.amd.com/software-tools-optimization/primus/README.html>`__.
- To learn more about MAD and the ``madengine`` CLI, see the `MAD usage guide <https://github.com/ROCm/MAD?tab=readme-ov-file#usage-guide>`__.
- To learn more about system settings and management practices to configure your system for
AMD Instinct MI300X series GPUs, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_.
- For a list of other ready-made Docker images for AI with ROCm, see
`AMD Infinity Hub <https://www.amd.com/en/developer/resources/infinity-hub.html#f-amd_hub_category=AI%20%26%20ML%20Models>`_.
Previous versions
=================
See :doc:`pytorch-training-history` to find documentation for previous releases
of the ``ROCm/pytorch-training`` Docker image.

View File

@@ -16,51 +16,62 @@ previous releases of the ``ROCm/pytorch-training`` Docker image on `Docker Hub <
- Components
- Resources
* - v25.8 (latest)
-
* ROCm 6.4.3
* PyTorch 2.8.0a0+gitd06a406
-
* - v25.9 (latest)
-
* ROCm 7.0.0
* Primus 0.3.0
* PyTorch 2.9.0.dev20250821+rocm7.0.0.lw.git125803b7
-
* :doc:`Primus PyTorch Training documentation <../primus-pytorch>`
* :doc:`PyTorch training (legacy) documentation <../pytorch-training>`
* `Docker Hub <https://hub.docker.com/r/rocm/pytorch-training/tags>`__
* `Docker Hub (gfx950) <https://hub.docker.com/layers/rocm/primus/v25.9_gfx950/images/sha256-1a198be32f49efd66d0ff82066b44bd99b3e6b04c8e0e9b36b2c481e13bff7b6>`__
* `Docker Hub (gfx942) <https://hub.docker.com/layers/rocm/primus/v25.9_gfx942/images/sha256-df6ab8f45b4b9ceb100fb24e19b2019a364e351ee3b324dbe54466a1d67f8357>`__
* - v25.8
-
* ROCm 6.4.3
* PyTorch 2.8.0a0+gitd06a406
-
* :doc:`Primus PyTorch Training documentation <primus-pytorch-v25.8>`
* :doc:`PyTorch training (legacy) documentation <pytorch-training-v25.8>`
* `Docker Hub <https://hub.docker.com/layers/rocm/pytorch-training/v25.8/images/sha256-5082ae01d73fec6972b0d84e5dad78c0926820dcf3c19f301d6c8eb892e573c5>`__
* - v25.7
-
-
* ROCm 6.4.2
* PyTorch 2.8.0a0+gitd06a406
-
-
* :doc:`Documentation <pytorch-training-v25.7>`
* `Docker Hub <https://hub.docker.com/layers/rocm/pytorch-training/v25.7/images/sha256-cc6fd840ab89cb81d926fc29eca6d075aee9875a55a522675a4b9231c9a0a712>`__
* - v25.6
-
-
* ROCm 6.3.4
* PyTorch 2.8.0a0+git7d205b2
-
-
* :doc:`Documentation <pytorch-training-v25.6>`
* `Docker Hub <https://hub.docker.com/layers/rocm/pytorch-training/v25.6/images/sha256-a4cea3c493a4a03d199a3e81960ac071d79a4a7a391aa9866add3b30a7842661>`__
* - v25.5
-
-
* ROCm 6.3.4
* PyTorch 2.7.0a0+git637433
-
-
* :doc:`Documentation <pytorch-training-v25.5>`
* `Docker Hub <https://hub.docker.com/layers/rocm/pytorch-training/v25.5/images/sha256-d47850a9b25b4a7151f796a8d24d55ea17bba545573f0d50d54d3852f96ecde5>`__
* - v25.4
-
-
* ROCm 6.3.0
* PyTorch 2.7.0a0+git637433
-
-
* :doc:`Documentation <pytorch-training-v25.4>`
* `Docker Hub <https://hub.docker.com/layers/rocm/pytorch-training/v25.4/images/sha256-fa98a9aa69968e654466c06f05aaa12730db79b48b113c1ab4f7a5fe6920a20b>`__
* - v25.3
-
-
* ROCm 6.3.0
* PyTorch 2.7.0a0+git637433
-
-
* :doc:`Documentation <pytorch-training-v25.3>`
* `Docker Hub <https://hub.docker.com/layers/rocm/pytorch-training/v25.3/images/sha256-0ffdde1b590fd2787b1c7adf5686875b100980b0f314090901387c44253e709b>`__

View File

@@ -18,7 +18,7 @@ model training with GPU-optimized components for transformer-based models.
The PyTorch for ROCm training Docker (``rocm/pytorch-training:v25.3``) image
provides a prebuilt optimized environment for fine-tuning and pretraining a
model on AMD Instinct MI325X and MI300X accelerators. It includes the following
model on AMD Instinct MI325X and MI300X GPUs. It includes the following
software components to accelerate training workloads:
+--------------------------+--------------------------------+
@@ -44,7 +44,7 @@ software components to accelerate training workloads:
Supported models
================
The following models are pre-optimized for performance on the AMD Instinct MI300X accelerator.
The following models are pre-optimized for performance on the AMD Instinct MI300X GPU.
* Llama 3.1 8B
@@ -237,7 +237,7 @@ Along with the following datasets:
* `bghira/pseudo-camera-10k <https://huggingface.co/datasets/bghira/pseudo-camera-10k>`_
Start training on AMD Instinct accelerators
Start training on AMD Instinct GPUs
===========================================
The prebuilt PyTorch with ROCm training environment allows users to quickly validate
@@ -248,8 +248,8 @@ can expect the container to perform in the model configurations described in
the following section, but other configurations are not validated by AMD.
Use the following instructions to set up the environment, configure the script
to train models, and reproduce the benchmark results on MI300X series
accelerators with the AMD PyTorch training Docker image.
to train models, and reproduce the benchmark results on MI300X Series
GPUs with the AMD PyTorch training Docker image.
Once your environment is set up, use the following commands and examples to start benchmarking.

View File

@@ -18,7 +18,7 @@ model training with GPU-optimized components for transformer-based models.
The PyTorch for ROCm training Docker (``rocm/pytorch-training:v25.4``) image
provides a prebuilt optimized environment for fine-tuning and pretraining a
model on AMD Instinct MI325X and MI300X accelerators. It includes the following
model on AMD Instinct MI325X and MI300X GPUs. It includes the following
software components to accelerate training workloads:
+--------------------------+--------------------------------+
@@ -44,7 +44,7 @@ software components to accelerate training workloads:
Supported models
================
The following models are pre-optimized for performance on the AMD Instinct MI325X and MI300X accelerators.
The following models are pre-optimized for performance on the AMD Instinct MI325X and MI300X GPUs.
* Llama 3.1 8B
@@ -76,7 +76,7 @@ popular AI models.
The performance data presented in
`Performance results with AMD ROCm software <https://www.amd.com/en/developer/resources/rocm-hub/dev-ai/performance-results.html#tabs-a8deaeb413-item-21cea50186-tab>`_
should not be interpreted as the peak performance achievable by AMD
Instinct MI325X and MI300X accelerators or ROCm software.
Instinct MI325X and MI300X GPUs or ROCm software.
System validation
=================
@@ -260,7 +260,7 @@ the following section, but other configurations are not validated by AMD.
Use the following instructions to set up the environment, configure the script
to train models, and reproduce the benchmark results on MI325X and MI300X
accelerators with the AMD PyTorch training Docker image.
GPUs with the AMD PyTorch training Docker image.
Once your environment is set up, use the following commands and examples to start benchmarking.

View File

@@ -19,7 +19,7 @@ model training with GPU-optimized components for transformer-based models.
The `PyTorch for ROCm training Docker <https://hub.docker.com/layers/rocm/pytorch-training/v25.5/images/sha256-d47850a9b25b4a7151f796a8d24d55ea17bba545573f0d50d54d3852f96ecde5>`_
(``rocm/pytorch-training:v25.5``) image
provides a prebuilt optimized environment for fine-tuning and pretraining a
model on AMD Instinct MI325X and MI300X accelerators. It includes the following
model on AMD Instinct MI325X and MI300X GPUs. It includes the following
software components to accelerate training workloads:
+--------------------------+--------------------------------+
@@ -45,7 +45,7 @@ software components to accelerate training workloads:
Supported models
================
The following models are pre-optimized for performance on the AMD Instinct MI325X and MI300X accelerators.
The following models are pre-optimized for performance on the AMD Instinct MI325X and MI300X GPUs.
* Llama 3.3 70B
@@ -79,7 +79,7 @@ popular AI models.
The performance data presented in
`Performance results with AMD ROCm software <https://www.amd.com/en/developer/resources/rocm-hub/dev-ai/performance-results.html#tabs-a8deaeb413-item-21cea50186-tab>`_
should not be interpreted as the peak performance achievable by AMD
Instinct MI325X and MI300X accelerators or ROCm software.
Instinct MI325X and MI300X GPUs or ROCm software.
System validation
=================

View File

@@ -18,7 +18,7 @@ model training with GPU-optimized components for transformer-based models.
The `PyTorch for ROCm training Docker <https://hub.docker.com/layers/rocm/pytorch-training/v25.6/images/sha256-a4cea3c493a4a03d199a3e81960ac071d79a4a7a391aa9866add3b30a7842661>`_
(``rocm/pytorch-training:v25.6``) image provides a prebuilt optimized environment for fine-tuning and pretraining a
model on AMD Instinct MI325X and MI300X accelerators. It includes the following software components to accelerate
model on AMD Instinct MI325X and MI300X GPUs. It includes the following software components to accelerate
training workloads:
+--------------------------+--------------------------------+
@@ -44,7 +44,7 @@ training workloads:
Supported models
================
The following models are pre-optimized for performance on the AMD Instinct MI325X and MI300X accelerators.
The following models are pre-optimized for performance on the AMD Instinct MI325X and MI300X GPUs.
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/pytorch-training-v25.6-benchmark-models.yaml
@@ -99,7 +99,7 @@ The following models are pre-optimized for performance on the AMD Instinct MI325
The performance data presented in
`Performance results with AMD ROCm software <https://www.amd.com/en/developer/resources/rocm-hub/dev-ai/performance-results.html#tabs-a8deaeb413-item-21cea50186-tab>`_
should not be interpreted as the peak performance achievable by AMD
Instinct MI325X and MI300X accelerators or ROCm software.
Instinct MI325X and MI300X GPUs or ROCm software.
System validation
=================
@@ -444,7 +444,7 @@ Further reading
- To learn more about MAD and the ``madengine`` CLI, see the `MAD usage guide <https://github.com/ROCm/MAD?tab=readme-ov-file#usage-guide>`__.
- To learn more about system settings and management practices to configure your system for
AMD Instinct MI300X series accelerators, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_.
AMD Instinct MI300X Series GPUs, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_.
- For a list of other ready-made Docker images for AI with ROCm, see
`AMD Infinity Hub <https://www.amd.com/en/developer/resources/infinity-hub.html#f-amd_hub_category=AI%20%26%20ML%20Models>`_.

View File

@@ -10,7 +10,7 @@ Training a model with PyTorch for ROCm
.. caution::
This documentation does not reflect the latest version of ROCm vLLM
This documentation does not reflect the latest version of ROCm PyTorch training
performance benchmark documentation. See :doc:`../pytorch-training` for the latest version.
PyTorch is an open-source machine learning framework that is widely used for
@@ -22,7 +22,7 @@ model training with GPU-optimized components for transformer-based models.
{% set docker = dockers[0] %}
The `PyTorch for ROCm training Docker <{{ docker.docker_hub_url }}>`__
(``{{ docker.pull_tag }}``) image provides a prebuilt optimized environment for fine-tuning and pretraining a
model on AMD Instinct MI325X and MI300X accelerators. It includes the following software components to accelerate
model on AMD Instinct MI325X and MI300X GPUs. It includes the following software components to accelerate
training workloads:
.. list-table::
@@ -41,7 +41,7 @@ model training with GPU-optimized components for transformer-based models.
Supported models
================
The following models are pre-optimized for performance on the AMD Instinct MI325X and MI300X accelerators.
The following models are pre-optimized for performance on the AMD Instinct MI325X and MI300X GPUs.
Some instructions, commands, and training recommendations in this documentation might
vary by model -- select one to get started.
@@ -124,7 +124,7 @@ popular AI models.
The performance data presented in
`Performance results with AMD ROCm software <https://www.amd.com/en/developer/resources/rocm-hub/dev-ai/performance-results.html#tabs-a8deaeb413-item-21cea50186-tab>`_
should not be interpreted as the peak performance achievable by AMD
Instinct MI325X and MI300X accelerators or ROCm software.
Instinct MI325X and MI300X GPUs or ROCm software.
System validation
=================
@@ -555,7 +555,7 @@ Further reading
- To learn more about MAD and the ``madengine`` CLI, see the `MAD usage guide <https://github.com/ROCm/MAD?tab=readme-ov-file#usage-guide>`__.
- To learn more about system settings and management practices to configure your system for
AMD Instinct MI300X series accelerators, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_.
AMD Instinct MI300X Series GPUs, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_.
- For a list of other ready-made Docker images for AI with ROCm, see
`AMD Infinity Hub <https://www.amd.com/en/developer/resources/infinity-hub.html#f-amd_hub_category=AI%20%26%20ML%20Models>`_.

View File

@@ -0,0 +1,588 @@
:orphan:
.. meta::
:description: How to train a model using PyTorch for ROCm.
:keywords: ROCm, AI, LLM, train, PyTorch, torch, Llama, flux, tutorial, docker
**************************************
Training a model with PyTorch on ROCm
**************************************
.. caution::
This documentation does not reflect the latest version of ROCm PyTorch training
performance benchmark documentation. See :doc:`../pytorch-training` for the latest version.
PyTorch is an open-source machine learning framework that is widely used for
model training with GPU-optimized components for transformer-based models.
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/pytorch-training-v25.8-benchmark-models.yaml
{% set dockers = data.dockers %}
{% set docker = dockers[0] %}
The `PyTorch for ROCm training Docker <{{ docker.docker_hub_url }}>`__
(``{{ docker.pull_tag }}``) image provides a prebuilt optimized environment for fine-tuning and pretraining a
model on AMD Instinct MI325X and MI300X GPUs. It includes the following software components to accelerate
training workloads:
.. list-table::
:header-rows: 1
* - Software component
- Version
{% for component_name, component_version in docker.components.items() %}
* - {{ component_name }}
- {{ component_version }}
{% endfor %}
.. _amd-pytorch-training-model-support:
Supported models
================
The following models are pre-optimized for performance on the AMD Instinct MI325X and MI300X GPUs.
Some instructions, commands, and training recommendations in this documentation might
vary by model -- select one to get started.
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/pytorch-training-v25.8-benchmark-models.yaml
{% set unified_docker = data.dockers[0] %}
{% set model_groups = data.model_groups %}
.. raw:: html
<div id="vllm-benchmark-ud-params-picker" class="container-fluid">
<div class="row gx-0">
<div class="col-2 me-1 px-2 model-param-head">Model</div>
<div class="row col-10 pe-0">
{% for model_group in model_groups %}
<div class="col-4 px-2 model-param" data-param-k="model-group" data-param-v="{{ model_group.tag }}" tabindex="0">{{ model_group.group }}</div>
{% endfor %}
</div>
</div>
<div class="row gx-0 pt-1">
<div class="col-2 me-1 px-2 model-param-head">Variant</div>
<div class="row col-10 pe-0">
{% for model_group in model_groups %}
{% set models = model_group.models %}
{% for model in models %}
{% if models|length % 3 == 0 %}
<div class="col-4 px-2 model-param" data-param-k="model" data-param-v="{{ model.mad_tag }}" data-param-group="{{ model_group.tag }}" tabindex="0">{{ model.model }}</div>
{% else %}
<div class="col-6 px-2 model-param" data-param-k="model" data-param-v="{{ model.mad_tag }}" data-param-group="{{ model_group.tag }}" tabindex="0">{{ model.model }}</div>
{% endif %}
{% endfor %}
{% endfor %}
</div>
</div>
</div>
.. _amd-pytorch-training-supported-training-modes:
The following table lists supported training modes per model.
.. dropdown:: Supported training modes
.. list-table::
:header-rows: 1
* - Model
- Supported training modes
{% for model_group in model_groups %}
{% set models = model_group.models %}
{% for model in models %}
{% if model.training_modes %}
* - {{ model.model }}
- ``{{ model.training_modes | join('``, ``') }}``
{% endif %}
{% endfor %}
{% endfor %}
.. note::
Some model and fine-tuning combinations are not listed. This is
because the `upstream torchtune repository <https://github.com/pytorch/torchtune>`__
doesn't provide default YAML configurations for them.
For advanced usage, you can create a custom configuration to enable
unlisted fine-tuning methods by using an existing file in the
``/workspace/torchtune/recipes/configs`` directory as a template.
.. _amd-pytorch-training-performance-measurements:
Performance measurements
========================
To evaluate performance, the
`Performance results with AMD ROCm software <https://www.amd.com/en/developer/resources/rocm-hub/dev-ai/performance-results.html#tabs-a8deaeb413-item-21cea50186-tab>`_
page provides reference throughput and latency measurements for training
popular AI models.
.. note::
The performance data presented in
`Performance results with AMD ROCm software <https://www.amd.com/en/developer/resources/rocm-hub/dev-ai/performance-results.html#tabs-a8deaeb413-item-21cea50186-tab>`_
should not be interpreted as the peak performance achievable by AMD
Instinct MI325X and MI300X GPUs or ROCm software.
System validation
=================
Before running AI workloads, it's important to validate that your AMD hardware is configured
correctly and performing optimally.
If you have already validated your system settings, including aspects like NUMA auto-balancing, you
can skip this step. Otherwise, complete the procedures in the :ref:`System validation and
optimization <rocm-for-ai-system-optimization>` guide to properly configure your system settings
before starting training.
To test for optimal performance, consult the recommended :ref:`System health benchmarks
<rocm-for-ai-system-health-bench>`. This suite of tests will help you verify and fine-tune your
system's configuration.
This Docker image is optimized for specific model configurations outlined
below. Performance can vary for other training workloads, as AMD
doesnt test configurations and run conditions outside those described.
Run training
============
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/pytorch-training-v25.8-benchmark-models.yaml
{% set unified_docker = data.dockers[0] %}
{% set model_groups = data.model_groups %}
Once the setup is complete, choose between two options to start benchmarking training:
.. tab-set::
.. tab-item:: MAD-integrated benchmarking
{% for model_group in model_groups %}
{% for model in model_group.models %}
.. container:: model-doc {{ model.mad_tag }}
The following run command is tailored to {{ model.model }}.
See :ref:`amd-pytorch-training-model-support` to switch to another available model.
1. Clone the ROCm Model Automation and Dashboarding (`<https://github.com/ROCm/MAD>`__) repository to a local
directory and install the required packages on the host machine.
.. code-block:: shell
git clone https://github.com/ROCm/MAD
cd MAD
pip install -r requirements.txt
2. For example, use this command to run the performance benchmark test on the {{ model.model }} model
using one node with the {{ model.precision }} data type on the host machine.
.. code-block:: shell
export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
madengine run \
--tags {{ model.mad_tag }} \
--keep-model-dir \
--live-output \
--timeout 28800
MAD launches a Docker container with the name
``container_ci-{{ model.mad_tag }}``. The latency and throughput reports of the
model are collected in ``~/MAD/perf.csv``.
{% endfor %}
{% endfor %}
.. tab-item:: Standalone benchmarking
{% for model_group in model_groups %}
{% for model in model_group.models %}
.. container:: model-doc {{ model.mad_tag }}
The following commands are tailored to {{ model.model }}.
See :ref:`amd-pytorch-training-model-support` to switch to another available model.
{% endfor %}
{% endfor %}
.. rubric:: Download the Docker image and required packages
1. Use the following command to pull the Docker image from Docker Hub.
.. code-block:: shell
docker pull {{ unified_docker.pull_tag }}
2. Run the Docker container.
.. code-block:: shell
docker run -it \
--device /dev/dri \
--device /dev/kfd \
--network host \
--ipc host \
--group-add video \
--cap-add SYS_PTRACE \
--security-opt seccomp=unconfined \
--privileged \
-v $HOME:$HOME \
-v $HOME/.ssh:/root/.ssh \
--shm-size 64G \
--name training_env \
{{ unified_docker.pull_tag }}
Use these commands if you exit the ``training_env`` container and need to return to it.
.. code-block:: shell
docker start training_env
docker exec -it training_env bash
3. In the Docker container, clone the `<https://github.com/ROCm/MAD>`__
repository and navigate to the benchmark scripts directory
``/workspace/MAD/scripts/pytorch_train``.
.. code-block:: shell
git clone https://github.com/ROCm/MAD
cd MAD/scripts/pytorch_train
.. rubric:: Prepare training datasets and dependencies
1. The following benchmarking examples require downloading models and datasets
from Hugging Face. To ensure successful access to gated repos, set your
``HF_TOKEN``.
.. code-block:: shell
export HF_TOKEN=$your_personal_hugging_face_access_token
2. Run the setup script to install libraries and datasets needed for benchmarking.
.. code-block:: shell
./pytorch_benchmark_setup.sh
.. container:: model-doc pyt_train_llama-3.1-8b
``pytorch_benchmark_setup.sh`` installs the following libraries for Llama 3.1 8B:
.. list-table::
:header-rows: 1
* - Library
- Reference
* - ``accelerate``
- `Hugging Face Accelerate <https://huggingface.co/docs/accelerate/en/index>`_
* - ``datasets``
- `Hugging Face Datasets <https://huggingface.co/docs/datasets/v3.2.0/en/index>`_ 3.2.0
.. container:: model-doc pyt_train_llama-3.1-70b
``pytorch_benchmark_setup.sh`` installs the following libraries for Llama 3.1 70B:
.. list-table::
:header-rows: 1
* - Library
- Reference
* - ``datasets``
- `Hugging Face Datasets <https://huggingface.co/docs/datasets/v3.2.0/en/index>`_ 3.2.0
* - ``torchdata``
- `TorchData <https://meta-pytorch.org/data/beta/index.html#torchdata>`__
* - ``tomli``
- `Tomli <https://pypi.org/project/tomli/>`__
* - ``tiktoken``
- `tiktoken <https://github.com/openai/tiktoken>`__
* - ``blobfile``
- `blobfile <https://pypi.org/project/blobfile/>`__
* - ``tabulate``
- `tabulate <https://pypi.org/project/tabulate/>`__
* - ``wandb``
- `Weights & Biases <https://github.com/wandb/wandb>`__
* - ``sentencepiece``
- `SentencePiece <https://github.com/google/sentencepiece>`__ 0.2.0
* - ``tensorboard``
- `TensorBoard <https://www.tensorflow.org/tensorboard>`__ 2.18.0
.. container:: model-doc pyt_train_flux
``pytorch_benchmark_setup.sh`` installs the following libraries for FLUX:
.. list-table::
:header-rows: 1
* - Library
- Reference
* - ``accelerate``
- `Hugging Face Accelerate <https://huggingface.co/docs/accelerate/en/index>`_
* - ``datasets``
- `Hugging Face Datasets <https://huggingface.co/docs/datasets/v3.2.0/en/index>`__ 3.2.0
* - ``sentencepiece``
- `SentencePiece <https://github.com/google/sentencepiece>`__ 0.2.0
* - ``tensorboard``
- `TensorBoard <https://www.tensorflow.org/tensorboard>`__ 2.18.0
* - ``csvkit``
- `csvkit <https://csvkit.readthedocs.io/en/latest/>`__ 2.0.1
* - ``deepspeed``
- `DeepSpeed <https://github.com/deepspeedai/DeepSpeed>`__ 0.16.2
* - ``diffusers``
- `Hugging Face Diffusers <https://huggingface.co/docs/diffusers/en/index>`__ 0.31.0
* - ``GitPython``
- `GitPython <https://github.com/gitpython-developers/GitPython>`__ 3.1.44
* - ``opencv-python-headless``
- `opencv-python-headless <https://pypi.org/project/opencv-python-headless/>`__ 4.10.0.84
* - ``peft``
- `PEFT <https://huggingface.co/docs/peft/en/index>`__ 0.14.0
* - ``protobuf``
- `Protocol Buffers <https://github.com/protocolbuffers/protobuf>`__ 5.29.2
* - ``pytest``
- `PyTest <https://docs.pytest.org/en/stable/>`__ 8.3.4
* - ``python-dotenv``
- `python-dotenv <https://pypi.org/project/python-dotenv/>`__ 1.0.1
* - ``seaborn``
- `Seaborn <https://seaborn.pydata.org/>`__ 0.13.2
* - ``transformers``
- `Transformers <https://huggingface.co/docs/transformers/en/index>`__ 4.47.0
``pytorch_benchmark_setup.sh`` downloads the following datasets from Hugging Face:
* `bghira/pseudo-camera-10k <https://huggingface.co/datasets/bghira/pseudo-camera-10k>`__
{% for model_group in model_groups %}
{% for model in model_group.models %}
{% set training_modes = model.training_modes %}
{% set training_mode_descs = {
"pretrain": "Benchmark pre-training.",
"HF_pretrain": "Llama 3.1 8B pre-training with FP8 precision."
} %}
{% set available_modes = training_modes | select("in", ["pretrain", "HF_pretrain"]) | list %}
{% if available_modes %}
.. container:: model-doc {{ model.mad_tag }}
.. rubric:: Pre-training
To start the pre-training benchmark, use the following command with the
appropriate options. See the following list of options and their descriptions.
.. code-block:: shell
./pytorch_benchmark_report.sh -t {% if available_modes | length == 1 %}{{ available_modes[0] }}{% else %}$training_mode{% endif %} \
-m {{ model.model_repo }} \
-p $datatype \
-s $sequence_length
{% if model.mad_tag == "pyt_train_flux" %}
.. container:: model-doc {{ model.mad_tag }}
.. note::
Currently, FLUX models are not supported out-of-the-box on {{ unified_docker.pull_tag }}.
To use FLUX, refer to ``rocm/pytorch-training`` Docker: :doc:`pytorch-training-v25.6`
Occasionally, downloading the Flux dataset might fail. In the event of this
error, manually download it from Hugging Face at
`black-forest-labs/FLUX.1-dev <https://huggingface.co/black-forest-labs/FLUX.1-dev>`_
and save it to `/workspace/FluxBenchmark`. This ensures that the test script can access
the required dataset.
{% endif %}
.. list-table::
:header-rows: 1
* - Name
- Options
- Description
{% for mode in available_modes %}
* - {% if loop.first %}``$training_mode``{% endif %}
- ``{{ mode }}``
- {{ training_mode_descs[mode] }}
{% endfor %}
* - ``$datatype``
- ``BF16``{% if model.mad_tag == "pyt_train_llama-3.1-8b" %} or ``FP8``{% endif %}
- Only Llama 3.1 8B supports FP8 precision.
* - ``$sequence_length``
- Sequence length for the language model.
- Between 2048 and 8192. 8192 by default.
{% endif %}
{% set training_mode_descs = {
"finetune_fw": "Full weight fine-tuning (BF16 and FP8 supported).",
"finetune_lora": "LoRA fine-tuning (BF16 supported).",
"finetune_qlora": "QLoRA fine-tuning (BF16 supported).",
"HF_finetune_lora": "LoRA fine-tuning with Hugging Face PEFT.",
} %}
{% set available_modes = training_modes | select("in", ["finetune_fw", "finetune_lora", "finetune_qlora", "HF_finetune_lora"]) | list %}
{% if available_modes %}
.. container:: model-doc {{ model.mad_tag }}
.. rubric:: Fine-tuning
To start the fine-tuning benchmark, use the following command with the
appropriate options. See the following list of options and their descriptions.
See :ref:`supported training modes <amd-pytorch-training-supported-training-modes>`.
.. code-block:: shell
./pytorch_benchmark_report.sh -t $training_mode \
-m {{ model.model_repo }} \
-p $datatype \
-s $sequence_length
.. list-table::
:header-rows: 1
* - Name
- Options
- Description
{% for mode in available_modes %}
* - {% if loop.first %}``$training_mode``{% endif %}
- ``{{ mode }}``
- {{ training_mode_descs[mode] }}
{% endfor %}
* - ``$datatype``
- ``BF16``{% if "finetune_fw" in available_modes %} or ``FP8``{% endif %}
- All models support BF16.{% if "finetune_fw" in available_modes %} FP8 is only available for full weight fine-tuning.{% endif %}
* - ``$sequence_length``
- Between 2048 and 16384.
- Sequence length for the language model.
{% if model.mad_tag in ["pyt_train_llama3.2-vision-11b", "pyt_train_llama-3.2-vision-90b"] %}
.. note::
For LoRA and QLoRA support with vision models (Llama 3.2 11B and 90B),
use the following torchtune commit for compatibility:
.. code-block:: shell
git checkout 48192e23188b1fc524dd6d127725ceb2348e7f0e
{% elif model.mad_tag in ["pyt_train_llama-2-7b", "pyt_train_llama-2-13b", "pyt_train_llama-2-70b"] %}
.. note::
You might encounter the following error with Llama 2: ``ValueError: seq_len (16384) of
input tensor should be smaller than max_seq_len (4096)``.
This error indicates that an input sequence is longer than the model's maximum context window.
Ensure your tokenized input does not exceed the model's ``max_seq_len`` (4096
tokens in this case). You can resolve this by truncating the input or splitting
it into smaller chunks before passing it to the model.
Note on reproducibility: The results in this guide are based on
commit ``b4c98ac`` from the upstream
`<https://github.com/pytorch/torchtune>`__ repository. For the
latest updates, you can use the main branch.
{% endif %}
{% endif %}
{% endfor %}
{% endfor %}
.. rubric:: Benchmarking examples
For examples of benchmarking commands, see `<https://github.com/ROCm/MAD/tree/develop/benchmark/pytorch_train#benchmarking-examples>`__.
.. _amd-pytorch-training-multinode-examples:
Multi-node training
-------------------
Refer to :doc:`/how-to/rocm-for-ai/system-setup/multi-node-setup` to configure your environment for multi-node
training. See :ref:`rocm-for-ai-multi-node-setup-pyt-train-example` for example Slurm run commands.
Pre-training
~~~~~~~~~~~~
Multi-node training with torchtitan is supported. The provided SLURM script is pre-configured for Llama 3 70B.
To launch the training job on a SLURM cluster for Llama 3 70B, run the following commands from the MAD repository.
.. code-block:: shell
# In the MAD repository
cd scripts/pytorch_train
sbatch run_slurm_train.sh
Fine-tuning
~~~~~~~~~~~
Multi-node training with torchtune is supported. The provided SLURM script is pre-configured for Llama 3.3 70B.
To launch the training job on a SLURM cluster for Llama 3.3 70B, run the following commands from the MAD repository.
.. code-block:: shell
huggingface-cli login # Get access to HF Llama model space
huggingface-cli download meta-llama/Llama-3.3-70B-Instruct --local-dir ./models/Llama-3.3-70B-Instruct # Download the Llama 3.3 model locally
# In the MAD repository
cd scripts/pytorch_train
sbatch Torchtune_Multinode.sh
.. note::
Information regarding benchmark setup:
* By default, Llama 3.3 70B is fine-tuned using ``alpaca_dataset``.
* You can adjust the torchtune `YAML configuration file
<https://github.com/pytorch/torchtune/blob/main/recipes/configs/llama3_3/70B_full_multinode.yaml>`__
if you're using a different model.
* The number of nodes and other parameters can be tuned in the SLURM script ``Torchtune_Multinode.sh``.
* Set the ``mounting_paths`` inside the SLURM script.
Once the run is finished, you can find the log files in the ``result_torchtune/`` directory.
Further reading
===============
- To learn more about MAD and the ``madengine`` CLI, see the `MAD usage guide <https://github.com/ROCm/MAD?tab=readme-ov-file#usage-guide>`__.
- To learn more about system settings and management practices to configure your system for
AMD Instinct MI300X series GPUs, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_.
- For a list of other ready-made Docker images for AI with ROCm, see
`AMD Infinity Hub <https://www.amd.com/en/developer/resources/infinity-hub.html#f-amd_hub_category=AI%20%26%20ML%20Models>`_.
Previous versions
=================
See :doc:`pytorch-training-history` to find documentation for previous releases
of the ``ROCm/pytorch-training`` Docker image.

View File

@@ -13,30 +13,42 @@ Primus now supports the PyTorch torchtitan backend.
.. note::
Primus with the PyTorch torchtitan backend is designed to replace the :doc:`ROCm PyTorch training <pytorch-training>` workflow.
See :doc:`pytorch-training` to see steps to run workloads without Primus.
For a unified training solution on AMD GPUs with ROCm, the `rocm/pytorch-training
<https://hub.docker.com/r/rocm/pytorch-training/>`__ Docker Hub registry will be
deprecated soon in favor of `rocm/primus <https://hub.docker.com/r/rocm/primus>`__.
The ``rocm/primus`` Docker containers will cover PyTorch training ecosystem frameworks,
including `Megatron-LM <primus-megatron>`__, torchtitan, and torchtune.
Primus with the PyTorch torchtitan backend is designed to replace the
:doc:`ROCm PyTorch training <pytorch-training>` workflow. See
:doc:`pytorch-training` to see steps to run workloads without Primus.
AMD provides a ready-to-use Docker image for MI355X, MI350X, MI325X, and
MI300X GPUs containing essential components for Primus and PyTorch training
with Primus Turbo optimizations.
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/primus-pytorch-benchmark-models.yaml
{% set dockers = data.dockers %}
{% set docker = dockers[0] %}
For ease of use, AMD provides a ready-to-use Docker image -- ``{{
docker.pull_tag }}`` -- for MI300X series GPUs containing essential
components for Primus and PyTorch training with
Primus Turbo optimizations.
.. tab-set::
.. list-table::
:header-rows: 1
{% for supported_gpus, docker in dockers.items() %}
.. tab-item:: {{ supported_gpus }}
:sync: {{ supported_gpus }}
* - Software component
- Version
.. list-table::
:header-rows: 1
{% for component_name, component_version in docker.components.items() %}
* - {{ component_name }}
- {{ component_version }}
{% endfor %}
* - Software component
- Version
.. _amd-primus-pytorch-model-support-v258:
{% for component_name, component_version in docker.components.items() %}
* - {{ component_name }}
- {{ component_version }}
{% endfor %}
{% endfor %}
.. _amd-primus-pytorch-model-support-v259:
Supported models
================
@@ -47,22 +59,21 @@ vary by model -- select one to get started.
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/primus-pytorch-benchmark-models.yaml
{% set unified_docker = data.dockers[0] %}
{% set model_groups = data.model_groups %}
.. raw:: html
<div id="vllm-benchmark-ud-params-picker" class="container-fluid">
<div class="row gx-0" style="display: none;">
<div class="row gx-0">
<div class="col-2 me-1 px-2 model-param-head">Model</div>
<div class="row col-10 pe-0">
{% for model_group in model_groups %}
<div class="col-3 px-2 model-param" data-param-k="model-group" data-param-v="{{ model_group.tag }}" tabindex="0">{{ model_group.group }}</div>
<div class="col-12 px-2 model-param" data-param-k="model-group" data-param-v="{{ model_group.tag }}" tabindex="0">{{ model_group.group }}</div>
{% endfor %}
</div>
</div>
<div class="row gx-0 pt-1">
<div class="col-2 me-1 px-2 model-param-head">Model</div>
<div class="col-2 me-1 px-2 model-param-head">Variant</div>
<div class="row col-10 pe-0">
{% for model_group in model_groups %}
{% set models = model_group.models %}
@@ -83,7 +94,7 @@ vary by model -- select one to get started.
For additional workloads, including Llama 3.3, Llama 3.2, Llama 2, GPT OSS, Qwen, and Flux models,
see the documentation :doc:`pytorch-training` (without Primus)
.. _amd-primus-pytorch-performance-measurements-v258:
.. _amd-primus-pytorch-performance-measurements-v259:
System validation
=================
@@ -109,25 +120,34 @@ Pull the Docker image
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/primus-pytorch-benchmark-models.yaml
{% set unified_docker = data.dockers[0] %}
{% set dockers = data.dockers %}
Use the following command to pull the `Docker image <{{ unified_docker.docker_hub_url }}>`_ from Docker Hub.
Use the following command to pull the Docker image from Docker Hub.
.. code-block:: shell
.. tab-set::
docker pull {{ unified_docker.pull_tag }}
{% for supported_gpus, docker in dockers.items() %}
.. tab-item:: {{ supported_gpus }}
:sync: {{ supported_gpus }}
.. code-block:: shell
docker pull {{ docker.pull_tag }}
{% endfor %}
Run training
============
Once the setup is complete, choose between the following two workflows to start benchmarking training.
For fine-tuning workloads and multi-node training examples, see :doc:`pytorch-training` (without Primus).
For best performance on MI325X, MI350X, and MI355X GPUs, you might need to
tweak some configurations (such as batch sizes).
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/primus-pytorch-benchmark-models.yaml
{% set unified_docker = data.dockers[0] %}
{% set dockers = data.dockers %}
{% set model_groups = data.model_groups %}
Once the setup is complete, choose between the following two workflows to start benchmarking training.
For fine-tuning workloads and multi-node training examples, see :doc:`pytorch-training` (without Primus).
.. tab-set::
.. tab-item:: MAD-integrated benchmarking
@@ -138,7 +158,7 @@ Run training
.. container:: model-doc {{ model.mad_tag }}
The following run command is tailored to {{ model.model }}.
See :ref:`amd-primus-pytorch-model-support-v258` to switch to another available model.
See :ref:`amd-primus-pytorch-model-support-v259` to switch to another available model.
1. Clone the ROCm Model Automation and Dashboarding (`<https://github.com/ROCm/MAD>`__) repository to a local
directory and install the required packages on the host machine.
@@ -165,10 +185,17 @@ Run training
``container_ci-{{ model.mad_tag }}``. The latency and throughput reports of the
model are collected in ``~/MAD/perf.csv``.
.. note::
Currently, Primus torchtitan models are run with Primus Turbo
enabled for enhanced performance. To disable Primus Turbo,
modify respective configuration file
``scripts/primus/pytorch_train/primus_torchtitan_scripts/llama3_[8B|70B]-[BF16|FP8].yaml``.
{% endfor %}
{% endfor %}
.. tab-item:: Standalone benchmarking
.. tab-item:: Primus benchmarking
{% for model_group in model_groups %}
{% for model in model_group.models %}
@@ -176,34 +203,48 @@ Run training
.. container:: model-doc {{ model.mad_tag }}
The following run commands are tailored to {{ model.model }}.
See :ref:`amd-primus-pytorch-model-support-v258` to switch to another available model.
See :ref:`amd-primus-pytorch-model-support-v259` to switch to another available model.
.. rubric:: Download the Docker image and required packages
1. Use the following command to pull the Docker image from Docker Hub.
1. Pull the appropriate Docker image for your AMD GPU architecture from Docker Hub.
.. code-block:: shell
.. tab-set::
docker pull {{ unified_docker.pull_tag }}
{% for supported_gpus, docker in dockers.items() %}
.. tab-item:: {{ supported_gpus }}
:sync: {{ supported_gpus }}
.. code-block:: shell
docker pull {{ docker.pull_tag }}
{% endfor %}
2. Run the Docker container.
.. code-block:: shell
.. tab-set::
docker run -it \
--device /dev/dri \
--device /dev/kfd \
--network host \
--ipc host \
--group-add video \
--cap-add SYS_PTRACE \
--security-opt seccomp=unconfined \
--privileged \
-v $HOME:$HOME \
-v $HOME/.ssh:/root/.ssh \
--shm-size 64G \
--name training_env \
{{ unified_docker.pull_tag }}
{% for supported_gpus, docker in dockers.items() %}
.. tab-item:: {{ supported_gpus }}
:sync: {{ supported_gpus }}
.. code-block:: shell
docker run -it \
--device /dev/dri \
--device /dev/kfd \
--network host \
--ipc host \
--group-add video \
--cap-add SYS_PTRACE \
--security-opt seccomp=unconfined \
--privileged \
-v $HOME:$HOME \
-v $HOME/.ssh:/root/.ssh \
--shm-size 64G \
--name training_env \
{{ docker.pull_tag }}
{% endfor %}
Use these commands if you exit the ``training_env`` container and need to return to it.
@@ -212,16 +253,249 @@ Run training
docker start training_env
docker exec -it training_env bash
3. In the Docker container, clone the `<https://github.com/ROCm/MAD>`__
repository and navigate to the benchmark scripts directory
``/workspace/MAD/scripts/pytorch_train``.
.. rubric:: Prepare training datasets and dependencies
The following benchmarking examples require downloading models and datasets
from Hugging Face. To ensure successful access to gated repos, set your
``HF_TOKEN``.
.. code-block:: shell
export HF_TOKEN=$your_personal_hugging_face_access_token
.. rubric:: Pretraining
To get started, navigate to the ``Primus`` directory in your container.
.. code-block::
cd /workspace/Primus
Now, to start the pretraining benchmark, use the ``run_pretrain.sh`` script
included with Primus with the appropriate options.
.. rubric:: Benchmarking examples
.. container:: model-doc primus_pyt_train_llama-3.1-8b
Use the following command to run train Llama 3.1 8B with BF16 precision using Primus torchtitan.
.. tab-set::
.. tab-item:: MI355X and MI350X
:sync: MI355X and MI300X
.. code-block:: shell
EXP=examples/torchtitan/configs/llama3.1_8B-BF16-pretrain.yaml \
bash examples/run_pretrain.sh \
--metrics.enable_tensorboard false \
--profiling.enable_profiling false \
--training.batch_size 5
.. tab-item:: MI325X
:sync: MI325X
.. code-block:: shell
EXP=examples/torchtitan/configs/llama3.1_8B-BF16-pretrain.yaml \
bash examples/run_pretrain.sh \
--metrics.enable_tensorboard false \
--profiling.enable_profiling false \
--training.batch_size 6
.. tab-item:: MI300X
:sync: MI325X and MI300X
.. code-block:: shell
EXP=examples/torchtitan/configs/llama3.1_8B-BF16-pretrain.yaml \
bash examples/run_pretrain.sh \
--metrics.enable_tensorboard false \
--profiling.enable_profiling false \
--training.batch_size 4
To train Llama 3.1 8B with FP8 precision, use the following command.
.. tab-set::
.. tab-item:: MI355X and MI350X
:sync: MI355X and MI300X
.. code-block:: shell
EXP=examples/torchtitan/configs/llama3.1_8B-BF16-pretrain.yaml \
bash examples/run_pretrain.sh \
--metrics.enable_tensorboard false \
--profiling.enable_profiling false \
--training.batch_size 8
.. tab-item:: MI325X
:sync: MI325X
.. code-block:: shell
EXP=examples/torchtitan/configs/llama3.1_8B-FP8-pretrain.yaml \
bash examples/run_pretrain.sh \
--metrics.enable_tensorboard false \
--profiling.enable_profiling false \
--training.batch_size 7
.. tab-item:: MI300X
:sync: MI325X and MI300X
.. code-block:: shell
EXP=examples/torchtitan/configs/llama3.1_8B-FP8-pretrain.yaml \
bash examples/run_pretrain.sh \
--metrics.enable_tensorboard false \
--profiling.enable_profiling false \
--training.batch_size 5
.. container:: model-doc primus_pyt_train_llama-3.1-70b
Use the following command to run train Llama 3.1 70B with BF16 precision using Primus torchtitan.
.. tab-set::
.. tab-item:: MI355X and MI350X
:sync: MI355X and MI300X
.. code-block:: shell
EXP=examples/torchtitan/configs/llama3.1_70B-BF16-pretrain.yaml \
bash examples/run_pretrain.sh \
--metrics.enable_tensorboard false \
--profiling.enable_profiling false \
--training.batch_size 8
.. tab-item:: MI325X
:sync: MI325X
.. code-block:: shell
EXP=examples/torchtitan/configs/llama3.1_70B-BF16-pretrain.yaml \
bash examples/run_pretrain.sh \
--metrics.enable_tensorboard false \
--profiling.enable_profiling false \
--training.batch_size 6
.. tab-item:: MI300X
:sync: MI325X and MI300X
.. code-block:: shell
EXP=examples/torchtitan/configs/llama3.1_70B-BF16-pretrain.yaml \
bash examples/run_pretrain.sh \
--metrics.enable_tensorboard false \
--profiling.enable_profiling false \
--training.batch_size 4
To train Llama 3.1 70B with FP8 precision, use the following command.
.. tab-set::
.. tab-item:: MI355X and MI350X
:sync: MI355X and MI300X
.. code-block:: shell
EXP=examples/torchtitan/configs/llama3.1_70B-FP8-pretrain.yaml \
bash examples/run_pretrain.sh \
--metrics.enable_tensorboard false \
--profiling.enable_profiling false \
--training.batch_size 6
.. tab-item:: MI325X
:sync: MI325X
.. code-block:: shell
EXP=examples/torchtitan/configs/llama3.1_70B-FP8-pretrain.yaml \
bash examples/run_pretrain.sh \
--metrics.enable_tensorboard false \
--profiling.enable_profiling false \
--training.batch_size 5
.. tab-item:: MI300X
:sync: MI325X and MI300X
.. code-block:: shell
EXP=examples/torchtitan/configs/llama3.1_70B-FP8-pretrain.yaml \
bash examples/run_pretrain.sh \
--metrics.enable_tensorboard false \
--profiling.enable_profiling false \
--training.batch_size 3
{% endfor %}
{% endfor %}
.. tab-item:: Standalone torchtitan benchmarking
{% for model_group in model_groups %}
{% for model in model_group.models %}
.. container:: model-doc {{ model.mad_tag }}
The following run commands are tailored to {{ model.model }}.
See :ref:`amd-primus-pytorch-model-support-v259` to switch to another available model.
.. rubric:: Download the Docker image and required packages
1. Pull the appropriate Docker image for your AMD GPU architecture from Docker Hub.
.. tab-set::
{% for supported_gpus, docker in dockers.items() %}
.. tab-item:: {{ supported_gpus }}
:sync: {{ supported_gpus }}
.. code-block:: shell
docker pull {{ docker.pull_tag }}
{% endfor %}
2. Run the Docker container.
.. tab-set::
{% for supported_gpus, docker in dockers.items() %}
.. tab-item:: {{ supported_gpus }}
:sync: {{ supported_gpus }}
.. code-block:: shell
docker run -it \
--device /dev/dri \
--device /dev/kfd \
--network host \
--ipc host \
--group-add video \
--cap-add SYS_PTRACE \
--security-opt seccomp=unconfined \
--privileged \
-v $HOME:$HOME \
-v $HOME/.ssh:/root/.ssh \
--shm-size 64G \
--name training_env \
{{ docker.pull_tag }}
{% endfor %}
Use these commands if you exit the ``training_env`` container and need to return to it.
.. code-block:: shell
git clone https://github.com/ROCm/MAD
cd MAD/scripts/pytorch_train
docker start training_env
docker exec -it training_env bash
.. rubric:: Prepare training datasets and dependencies
3. Navigate to the ``torchtitan`` workspace directory.
.. code-block:: shell
cd /workspace/torchtitan
.. rubric:: Download the tokenizer
1. The following benchmarking examples require downloading models and datasets
from Hugging Face. To ensure successful access to gated repos, set your
@@ -231,62 +505,47 @@ Run training
export HF_TOKEN=$your_personal_hugging_face_access_token
2. Run the setup script to install libraries and datasets needed for benchmarking.
2. Download the tokenizer for your model.
.. container:: model-doc {{ model.mad_tag }}
.. code-block:: shell
python3 scripts/download_tokenizer.py \
--repo_id {{ model.model_repo }} \
--tokenizer_path "original" \
--hf_token=${HF_TOKEN}
.. rubric:: Pretraining examples
Run the training script with the appropriate configuration file.
For train with BF16 precicion, use the following command:
.. container:: model-doc {{ model.mad_tag }}
.. code-block:: shell
./pytorch_benchmark_setup.sh
CONFIG_FILE={{ model.config_file.bf16 }} \
.run_train.sh
.. rubric:: Pretraining
For train with BF16 precicion, use the following command:
To start the pretraining benchmark, use the following command with the
appropriate options. See the following list of options and their descriptions.
.. container:: model-doc {{ model.mad_tag }}
.. code-block:: shell
.. code-block:: shell
./pytorch_benchmark_report.sh -t pretrain \
-m {{ model.model_repo }} \
-p $datatype \
-s $sequence_length
.. list-table::
:header-rows: 1
* - Name
- Options
- Description
{% for mode in available_modes %}
* - {% if loop.first %}``$training_mode``{% endif %}
- ``{{ mode }}``
- {{ training_mode_descs[mode] }}
{% endfor %}
* - ``$datatype``
- ``BF16``{% if model.mad_tag == "primus_pyt_train_llama-3.1-8b" %} or ``FP8``{% endif %}
- Currently, only Llama 3.1 8B supports FP8 precision.
* - ``$sequence_length``
- Sequence length for the language model.
- Between 2048 and 8192. 8192 by default.
.. rubric:: Benchmarking examples
Use the following command to run train {{ model.model }} with BF16 precision using Primus torchtitan.
.. code-block:: shell
./pytorch_benchmark_report.sh -m {{ model.model_repo }}
To train {{ model.model }} with FP8 precision, use the following command.
.. code-block:: shell
./pytorch_benchmark_report.sh -m {{ model.model_repo }} -p FP8
CONFIG_FILE={{ model.config_file.fp8 }} \
.run_train.sh
{% endfor %}
{% endfor %}
Known issues
============
PyTorch Profiler may produce inaccurate traces when CPU activity profiling is enabled.
Further reading
===============
@@ -296,7 +555,7 @@ Further reading
- To learn more about MAD and the ``madengine`` CLI, see the `MAD usage guide <https://github.com/ROCm/MAD?tab=readme-ov-file#usage-guide>`__.
- To learn more about system settings and management practices to configure your system for
AMD Instinct MI300X series GPUs, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_.
AMD Instinct MI300X Series GPUs, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_.
- For a list of other ready-made Docker images for AI with ROCm, see
`AMD Infinity Hub <https://www.amd.com/en/developer/resources/infinity-hub.html#f-amd_hub_category=AI%20%26%20ML%20Models>`_.

Some files were not shown because too many files have changed in this diff Show More