Compare commits

..

29 Commits

Author SHA1 Message Date
Istvan Kiss
bb84806ffe Add JAX Plugin-PJRT support table 2025-11-17 12:08:22 +01:00
Pratik Basyal
52f6f0cbc4 Release date for PLDM updated (#5627) 2025-11-04 18:09:52 -05:00
Pratik Basyal
f71614de00 7.0.2 Broken link, version and known issue update (#5591) (#5592)
* Version and known issue update

* Historical compatibility updated
2025-10-28 15:20:13 -04:00
Istvan Kiss
1b2d5924f2 Remove CentOS Stream mention from PyTorch release notes. (#5583) 2025-10-28 14:14:15 +01:00
peterjunpark
e3704ad70e Revert "Add xdit diffusion docs (#5576) (#5578)" (#5579)
This reverts commit a38b2865f0.
2025-10-27 16:21:10 -04:00
peterjunpark
a38b2865f0 Add xdit diffusion docs (#5576) (#5578)
(cherry picked from commit 4132a2609c)

Co-authored-by: Kristoffer <kristoffer.torp@amd.com>
2025-10-27 15:41:29 -04:00
peterjunpark
dfdff755ef Fix broken links under rocm-for-ai/ (#5564) (#5565)
(cherry picked from commit 35ca027aa4)
2025-10-23 15:18:08 -04:00
peterjunpark
8d2d5abdae add xref to vllm v1 optimization guide in workload.rst (#5560) (#5561)
(cherry picked from commit 90c1d9068f)
2025-10-23 11:51:55 -04:00
peterjunpark
b30b8b43e0 Updates to the vLLM optimization guide for MI300X/MI355X (#5554)
* Expand vLLM optimization guide for MI300X/MI355X with comprehensive AITER coverage. attention backend selection, environment variables (HIP/RCCL/Quick Reduce), parallelism strategies, quantization (FP8/FP4), engine tuning, CUDA graph modes, and multi-node scaling.

Co-authored-by: PinSiang <pinsiang.tan@embeddedllm.com>
Co-authored-by: Hongxia Yang <62075498+hongxiayang@users.noreply.github.com>
Co-authored-by: pinsiangamd <pinsiang.tan@amd.com>
Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>
(cherry picked from commit cb8d21a0df)
2025-10-22 13:01:57 -04:00
Pratik Basyal
47421ef7cc PLDM version update for MI350 series [Develop] (#5547) (#5548)
* PLDM version update for MI350 series

* Minor update
2025-10-20 14:42:33 -04:00
Pratik Basyal
a4ac854834 PLDM update for MI250 and MI210 [Develop] (#5537) (#5538)
* PLDM update for MI250 and MI210

* PLDM update
2025-10-17 17:32:57 -04:00
peterjunpark
79acda6775 JAX Maxtext v25.9 doc update (#5532) (#5533)
* archive previous version (25.7)

* update docker components list for 25.9

* update template

* update docker pull tag

* update

* fix intro

(cherry picked from commit a613bd6824)
2025-10-17 11:54:39 -04:00
peterjunpark
811fa5c87a Update Megatron/PyTorch Primus 25.9 docs (#5528) (#5529)
* add previous versions

* Fix heading levels in pages using embedded templates (#5468)

* update primus-megatron doc

update megatron-lm doc

update templates

fix tab

update primus-megatron model configs

Update primus-pytorch model configs

fix css class

add posttrain to pytorch-training template

update data sheets

update

update

update

update docker tags

* Add known issue and update Primus/Turbo versions

* add primus ver to histories

* update primus ver to 0.1.1

* fix leftovers from merge conflict

(cherry picked from commit 14bb59fca9)
2025-10-16 13:27:40 -04:00
Pratik Basyal
6796381dd5 GitHub issue added to 702 known issues (#5520) (#5521)
* GitHub issue added to 702 known issues

* Added missing RCCL changelog
2025-10-15 10:03:07 -04:00
Pratik Basyal
0ada3a8fef ROCm for HPC topic updated Develop (#5504) (#5505)
* ROCm for HPC topic updated

* ROCm for HPC topic udpated

* Minor editorial
2025-10-10 22:39:31 -04:00
Pratik Basyal
5c0b5d08da Post GA 702 footnote (#5503)
* Footnote updated

* Merge conflict updated
2025-10-10 21:35:33 -04:00
peterjunpark
c9dbb72806 Merge pull request #5497 from ROCm/develop
[docs/7.0.2] Fix documented AMD SMI version (ROCm 7.0.2) (#5496)
2025-10-10 15:10:39 -04:00
Istvan Kiss
02ac6124de Update PyTorch compatibilty page version number (#5494)
* Update PyTorch compatibilty page version number

* Apply suggestions from code review

Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com>

---------

Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com>
2025-10-10 19:11:16 +02:00
Istvan Kiss
c0e317de22 Fix broken environment variable links in release notes (#5493) 2025-10-10 18:23:00 +02:00
Alex Xu
92f950e6ff Merge branch 'roc-7.0.x' into docs/7.0.2 2025-10-10 11:55:35 -04:00
Alex Xu
07bf4643bc Merge branch 'develop' into roc-7.0.x 2025-10-10 11:55:18 -04:00
Alex Xu
4786abd1ac Merge branch 'roc-7.0.x' into docs/7.0.2 2025-10-10 11:49:20 -04:00
Alex Xu
0ba8fdc9fd Merge branch 'develop' into roc-7.0.x 2025-10-10 11:48:32 -04:00
alexxu-amd
d5f13a9290 Sync develop into docs/7.0.2 2025-10-10 09:26:44 -04:00
alexxu-amd
8088f74594 Sync develop into docs/7.0.2 2025-10-09 16:44:29 -04:00
Parag Bhandari
dd95373e54 Merge branch 'develop' into roc-7.0.x 2025-09-17 16:04:13 -04:00
Parag Bhandari
a471b4fc9b Merge branch 'develop' into roc-7.0.x 2025-09-17 15:22:07 -04:00
Parag Bhandari
ec7c3b02e2 Merge branch 'develop' into roc-7.0.x 2025-09-17 14:33:12 -04:00
JeniferC99
fdc822ac3d Merge pull request #5366 from ROCm/JeniferC99/pr-7.0.1-p2
7.0.1 GA update
2025-09-17 13:18:12 -04:00
118 changed files with 723 additions and 1038 deletions

View File

@@ -37,7 +37,6 @@ parameters:
- libdrm-dev
- libelf-dev
- libnuma-dev
- libsimde-dev
- ninja-build
- pkg-config
- name: rocmDependencies

View File

@@ -130,7 +130,7 @@ jobs:
parameters:
componentName: hipTensor
testDir: '$(Agent.BuildDirectory)/rocm/bin/hiptensor'
testParameters: '-E ".*-extended" --extra-verbose --output-on-failure --force-new-ctest-process --output-junit test_output.xml'
testParameters: '-E ".*-extended" --output-on-failure --force-new-ctest-process --output-junit test_output.xml'
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/docker-container.yml
parameters:
aptPackages: ${{ parameters.aptPackages }}

View File

@@ -70,7 +70,7 @@ parameters:
jobs:
- ${{ each job in parameters.jobMatrix.buildJobs }}:
- job: rccl_build_${{ job.target }}
timeoutInMinutes: 120
timeoutInMinutes: 90
variables:
- group: common
- template: /.azuredevops/variables-global.yml

View File

@@ -210,7 +210,7 @@ jobs:
parameters:
componentName: ${{ parameters.componentName }}
testDir: '$(Agent.BuildDirectory)/rocm/bin/rocprim'
extraTestParameters: '-I ${{ job.shard }},,${{ job.shardCount }}'
extraTestParameters: '-I ${{ job.shard }},,${{ job.shardCount }} -E device_merge_inplace'
os: ${{ job.os }}
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/docker-container.yml
parameters:

View File

@@ -81,7 +81,7 @@ jobs:
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/test.yml
parameters:
componentName: rocm-cmake
testParameters: '-E "pass-version-parent" --extra-verbose --output-on-failure --force-new-ctest-process --output-junit test_output.xml'
testParameters: '-E "pass-version-parent" --output-on-failure --force-new-ctest-process --output-junit test_output.xml'
os: ${{ job.os }}
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/manifest.yml
parameters:

View File

@@ -14,17 +14,9 @@ parameters:
type: object
default:
- cmake
- libdw-dev
- libglfw3-dev
- libmsgpack-dev
- libomp-dev
- libopencv-dev
- libtbb-dev
- libtiff-dev
- libva-amdgpu-dev
- libavcodec-dev
- libavformat-dev
- libavutil-dev
- ninja-build
- python3-pip
- name: rocmDependencies
@@ -43,10 +35,7 @@ parameters:
- hipSPARSE
- hipTensor
- llvm-project
- MIOpen
- MIVisionX
- rocBLAS
- rocDecode
- rocFFT
- rocJPEG
- rocPRIM
@@ -58,7 +47,6 @@ parameters:
- rocSPARSE
- rocThrust
- rocWMMA
- rpp
- name: rocmTestDependencies
type: object
default:
@@ -75,10 +63,7 @@ parameters:
- hipSPARSE
- hipTensor
- llvm-project
- MIOpen
- MIVisionX
- rocBLAS
- rocDecode
- rocFFT
- rocminfo
- rocPRIM
@@ -92,7 +77,6 @@ parameters:
- rocThrust
- roctracer
- rocWMMA
- rpp
- name: jobMatrix
type: object
@@ -121,7 +105,6 @@ jobs:
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/dependencies-other.yml
parameters:
aptPackages: ${{ parameters.aptPackages }}
registerROCmPackages: true
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/dependencies-cmake-custom.yml
parameters:
cmakeVersion: '3.25.0'
@@ -186,7 +169,6 @@ jobs:
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/dependencies-other.yml
parameters:
aptPackages: ${{ parameters.aptPackages }}
registerROCmPackages: true
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/dependencies-cmake-custom.yml
parameters:
cmakeVersion: '3.25.0'

View File

@@ -43,14 +43,9 @@ parameters:
- ninja-build
- python3-pip
- python3-venv
- googletest
- libgtest-dev
- libgmock-dev
- libboost-filesystem-dev
- name: pipModules
type: object
default:
- msgpack
- joblib
- "packaging>=22.0"
- pytest
@@ -152,13 +147,6 @@ jobs:
echo "##vso[task.prependpath]$USER_BASE/bin"
echo "##vso[task.setvariable variable=PytestCmakePath]$USER_BASE/share/Pytest/cmake"
displayName: Set cmake configure paths
- task: Bash@3
displayName: Add ROCm binaries to PATH
inputs:
targetType: inline
script: |
echo "##vso[task.prependpath]$(Agent.BuildDirectory)/rocm/bin"
echo "##vso[task.prependpath]$(Agent.BuildDirectory)/rocm/llvm/bin"
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/build-cmake.yml
parameters:
os: ${{ job.os }}

View File

@@ -1,63 +0,0 @@
parameters:
- name: checkoutRepo
type: string
default: 'self'
- name: checkoutRef
type: string
default: ''
- name: cli11Version
type: string
default: ''
- name: aptPackages
type: object
default:
- cmake
- git
- ninja-build
- name: jobMatrix
type: object
default:
buildJobs:
- { os: ubuntu2204, packageManager: apt}
- { os: almalinux8, packageManager: dnf}
jobs:
- ${{ each job in parameters.jobMatrix.buildJobs }}:
- job: cli11_${{ job.os }}
variables:
- group: common
- template: /.azuredevops/variables-global.yml
pool:
vmImage: 'ubuntu-22.04'
${{ if eq(job.os, 'almalinux8') }}:
container:
image: rocmexternalcicd.azurecr.io/manylinux228:latest
endpoint: ContainerService3
workspace:
clean: all
steps:
- checkout: none
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/dependencies-other.yml
parameters:
aptPackages: ${{ parameters.aptPackages }}
packageManager: ${{ job.packageManager }}
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/preamble.yml
- task: Bash@3
displayName: Clone cli11 ${{ parameters.cli11Version }}
inputs:
targetType: inline
script: git clone https://github.com/CLIUtils/CLI11.git -b ${{ parameters.cli11Version }}
workingDirectory: $(Agent.BuildDirectory)
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/build-cmake.yml
parameters:
os: ${{ job.os }}
cmakeBuildDir: $(Agent.BuildDirectory)/CLI11/build
cmakeSourceDir: $(Agent.BuildDirectory)/CLI11
useAmdclang: false
extraBuildFlags: >-
-DCMAKE_BUILD_TYPE=Release
-GNinja
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/artifact-upload.yml
parameters:
os: ${{ job.os }}

View File

@@ -1,66 +0,0 @@
parameters:
- name: checkoutRepo
type: string
default: 'self'
- name: checkoutRef
type: string
default: ''
- name: yamlcppVersion
type: string
default: ''
- name: aptPackages
type: object
default:
- cmake
- git
- ninja-build
- name: jobMatrix
type: object
default:
buildJobs:
- { os: ubuntu2204, packageManager: apt}
- { os: almalinux8, packageManager: dnf}
jobs:
- ${{ each job in parameters.jobMatrix.buildJobs }}:
- job: yamlcpp_${{ job.os }}
variables:
- group: common
- template: /.azuredevops/variables-global.yml
pool:
vmImage: 'ubuntu-22.04'
${{ if eq(job.os, 'almalinux8') }}:
container:
image: rocmexternalcicd.azurecr.io/manylinux228:latest
endpoint: ContainerService3
workspace:
clean: all
steps:
- checkout: none
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/dependencies-other.yml
parameters:
aptPackages: ${{ parameters.aptPackages }}
packageManager: ${{ job.packageManager }}
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/preamble.yml
- task: Bash@3
displayName: Clone yaml-cpp ${{ parameters.yamlcppVersion }}
inputs:
targetType: inline
script: git clone https://github.com/jbeder/yaml-cpp.git -b ${{ parameters.yamlcppVersion }}
workingDirectory: $(Agent.BuildDirectory)
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/build-cmake.yml
parameters:
os: ${{ job.os }}
cmakeBuildDir: $(Agent.BuildDirectory)/yaml-cpp/build
cmakeSourceDir: $(Agent.BuildDirectory)/yaml-cpp
useAmdclang: false
extraBuildFlags: >-
-DCMAKE_BUILD_TYPE=Release
-DYAML_CPP_BUILD_TOOLS=OFF
-DYAML_BUILD_SHARED_LIBS=OFF
-DYAML_CPP_INSTALL=ON
-GNinja
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/artifact-upload.yml
parameters:
os: ${{ job.os }}

View File

@@ -1,23 +0,0 @@
variables:
- group: common
- template: /.azuredevops/variables-global.yml
parameters:
- name: cli11Version
type: string
default: "main"
resources:
repositories:
- repository: pipelines_repo
type: github
endpoint: ROCm
name: ROCm/ROCm
trigger: none
pr: none
jobs:
- template: ${{ variables.CI_DEPENDENCIES_PATH }}/cli11.yml
parameters:
cli11Version: ${{ parameters.cli11Version }}

View File

@@ -1,24 +0,0 @@
variables:
- group: common
- template: /.azuredevops/variables-global.yml
parameters:
- name: yamlcppVersion
type: string
default: "0.8.0"
resources:
repositories:
- repository: pipelines_repo
type: github
endpoint: ROCm
name: ROCm/ROCm
trigger: none
pr: none
jobs:
- template: ${{ variables.CI_DEPENDENCIES_PATH }}/yamlcpp.yml
parameters:
yamlcppVersion: ${{ parameters.yamlcppVersion }}

View File

@@ -13,7 +13,7 @@ parameters:
default: ctest
- name: testParameters
type: string
default: --extra-verbose --output-on-failure --force-new-ctest-process --output-junit test_output.xml
default: --output-on-failure --force-new-ctest-process --output-junit test_output.xml
- name: extraTestParameters
type: string
default: ''

View File

@@ -35,7 +35,6 @@ AlexNet
Andrej
Arb
Autocast
autograd
BARs
BatchNorm
BLAS
@@ -88,11 +87,9 @@ Conda
ConnectX
CountOnes
CuPy
customizable
da
Dashboarding
Dataloading
dataflows
DBRX
DDR
DF
@@ -188,7 +185,7 @@ GPU
GPU's
GPUDirect
GPUs
GraphBolt
Graphbolt
GraphSage
GRBM
GRE
@@ -218,7 +215,6 @@ Haswell
Higgs
href
Hyperparameters
HybridEngine
Huggingface
IB
ICD

View File

@@ -4116,7 +4116,7 @@ memory partition modes upon an invalid argument return from memory partition mod
- JSON output plugin for `rocprofv2`. The JSON file matches Google Trace Format making it easy to load on Perfetto, Chrome tracing, or Speedscope. For Speedscope, use `--disable-json-data-flows` option as speedscope doesn't work with data flows.
- `--no-serialization` flag to disable kernel serialization when `rocprofv2` is in counter collection mode. This allows `rocprofv2` to avoid deadlock when profiling certain programs in counter collection mode.
- `FP64_ACTIVE` and `ENGINE_ACTIVE` metrics to AMD Instinct MI300 GPU
- `FP64_ACTIVE` and `ENGINE_ACTIVE` metrics to AMD Instinct MI300 accelerator
- New HIP APIs with struct defined inside union.
- Early checks to confirm the eligibility of ELF file in ATT plugin
- Support for kernel name filtering in `rocprofv2`
@@ -4140,18 +4140,18 @@ memory partition modes upon an invalid argument return from memory partition mod
#### Resolved issues
- Bandwidth measurement in AMD Instinct MI300 GPU
- Bandwidth measurement in AMD Instinct MI300 accelerator
- Perfetto plugin issue of `roctx` trace not getting displayed
- `--help` for counter collection
- Signal management issues in `queue.cpp`
- Perfetto tracks for multi-GPU
- Perfetto plugin usage with `rocsys`
- Incorrect number of columns in the output CSV files for counter collection and kernel tracing
- The ROCProfiler hang issue when running kernel trace, thread trace, or counter collection on Iree benchmark for AMD Instinct MI300 GPU
- The ROCProfiler hang issue when running kernel trace, thread trace, or counter collection on Iree benchmark for AMD Instinct MI300 accelerator
- Build errors thrown during parsing of unions
- The system hang caused while running `--kernel-trace` with Perfetto for certain applications
- Missing profiler records issue caused while running `--trace-period`
- The hang issue of `ProfilerAPITest` of `runFeatureTests` on AMD Instinct MI300 GPU
- The hang issue of `ProfilerAPITest` of `runFeatureTests` on AMD Instinct MI300 accelerator
- Segmentation fault on Navi32
@@ -5548,7 +5548,7 @@ See [issue #3499](https://github.com/ROCm/ROCm/issues/3499) on GitHub.
intermediary script to call the application with the necessary arguments, then call the script with Omniperf. This
issue is fixed in a future release of Omniperf. See [#347](https://github.com/ROCm/rocprofiler-compute/issues/347).
- Omniperf might not work with AMD Instinct MI300 GPUs out of the box, resulting in the following error:
- Omniperf might not work with AMD Instinct MI300 accelerators out of the box, resulting in the following error:
"*ERROR gfx942 is not enabled rocprofv1. Available profilers include: ['rocprofv2']*". As a workaround, add the
environment variable `export ROCPROF=rocprofv2`.
@@ -5664,7 +5664,7 @@ See [issue #3498](https://github.com/ROCm/ROCm/issues/3498) on GitHub.
#### Optimized
* Improved performance of Level 1 `dot_batched` and `dot_strided_batched` for all precisions. Performance enhanced by 6 times for bigger problem sizes, as measured on an Instinct MI210 GPU.
* Improved performance of Level 1 `dot_batched` and `dot_strided_batched` for all precisions. Performance enhanced by 6 times for bigger problem sizes, as measured on an Instinct MI210 accelerator.
#### Removed

View File

@@ -164,7 +164,7 @@ firmware, AMD GPU drivers, and the ROCm user space software.
</table>
</div>
<p id="footnote1">[1]: PLDM bundle 01.25.05.00 will be available by October 31, 2025.</p>
<p id="footnote1">[1]: PLDM bundle 01.25.05.00 will be available by November 2025.</p>
#### AMD Instinct MI300X GPU resiliency improvement
@@ -190,7 +190,7 @@ ROCm-LS provides the following tools to build a complete workflow for life scien
* The hipCIM library provides powerful support for GPU-accelerated I/O operations, coupled with an array of computer vision and image processing primitives designed for N-dimensional image data in fields such as biomedical imaging. For more information, see the [hipCIM documentation](https://rocm.docs.amd.com/projects/hipCIM/en/latest/).
* MONAI for AMD ROCm, a ROCm-enabled version of [MONAI](https://monai.io/), is built on top of [PyTorch for AMD ROCm](https://pytorch.org/blog/pytorch-for-amd-rocm-platform-now-available-as-python-package/), helping healthcare and life science innovators to leverage GPU acceleration with AMD Instinct GPUs for high-performance inference and training of medical AI applications. For more information, see the [MONAI for AMD ROCm documentation](https://rocm.docs.amd.com/projects/monai/en/latest/).
* MONAI for AMD ROCm, a ROCm-enabled version of {fab}`github` [MONAI](https://github.com/Project-MONAI/MONAI), is built on top of [PyTorch for AMD ROCm](https://pytorch.org/blog/pytorch-for-amd-rocm-platform-now-available-as-python-package/), helping healthcare and life science innovators to leverage GPU acceleration with AMD Instinct GPUs for high-performance inference and training of medical AI applications. For more information, see the [MONAI for AMD ROCm documentation](https://rocm.docs.amd.com/projects/monai/en/latest/).
### Deep learning and AI framework updates
@@ -241,8 +241,6 @@ ROCm documentation continues to be updated to provide clearer and more comprehen
For more information about the changes, see the [Changelog for the AI Developer Hub](https://rocm.docs.amd.com/projects/ai-developer-hub/en/latest/changelog.html).
* ROCm components support a wide range of environment variables that can be used for testing, logging, debugging, experimental features, and more. The [rocBLAS](https://rocm.docs.amd.com/projects/rocBLAS/en/docs-7.0.2/reference/env-variables.html) and [RCCL](https://rocm.docs.amd.com/projects/rccl/en/docs-7.0.2/api-reference/env-variables.html) components have been updated with new environment variable content.
## ROCm components
The following table lists the versions of ROCm components for ROCm 7.0.2, including any version
@@ -712,7 +710,7 @@ The issue will be resolved in a future ROCm release. See [GitHub issue #5500](ht
### Applications using OpenCV might fail due to package incompatibility between the OS
OpenCV packages built on Ubuntu 24.04 are incompatible with Debian 13 due to a version conflict. As a result, applications, tests, and samples that use OpenCV might fail. To avoid the version conflict, rebuild OpenCV with the version corresponding to Debian 13, then rebuild MIVisionX on top of it. As a workaround, rebuild OpenCV from source, followed by the application that uses OpenCV. This issue will be fixed in a future ROCm release. See [GitHub issue #5501](https://github.com/ROCm/ROCm/issues/5501).
OpenCV packages built on Ubuntu 24.04 are incompatible with Debian 13 due to a version conflict. As a result, applications, tests, and samples that use OpenCV might fail. As a workaround, rebuild OpenCV with the version corresponding to Debian 13 from source, followed by the application that uses OpenCV. This issue will be fixed in a future ROCm release. See [GitHub issue #5501](https://github.com/ROCm/ROCm/issues/5501).
## ROCm upcoming changes

View File

@@ -96,7 +96,7 @@ ROCm Version,7.0.2,7.0.1/7.0.0,6.4.3,6.4.2,6.4.1,6.4.0,6.3.3,6.3.2,6.3.1,6.3.0,6
:doc:`rocThrust <rocthrust:index>`,4.0.0,4.0.0,3.3.0,3.3.0,3.3.0,3.3.0,3.3.0,3.3.0,3.3.0,3.3.0,3.1.1,3.1.0,3.1.0,3.0.1,3.0.1,3.0.1,3.0.1,3.0.1,3.0.0,3.0.0
,,,,,,,,,,,,,,,,,,,,
SUPPORT LIBS,,,,,,,,,,,,,,,,,,,,
`hipother <https://github.com/ROCm/hipother>`_,7.0.51830,7.0.51830,6.4.43483,6.4.43483,6.4.43483,6.4.43482,6.3.42134,6.3.42134,6.3.42133,6.3.42131,6.2.41134,6.2.41134,6.2.41134,6.2.41133,6.1.40093,6.1.40093,6.1.40092,6.1.40091,6.1.32831,6.1.32830
`hipother <https://github.com/ROCm/hipother>`_,7.0.51831,7.0.51830,6.4.43483,6.4.43483,6.4.43483,6.4.43482,6.3.42134,6.3.42134,6.3.42133,6.3.42131,6.2.41134,6.2.41134,6.2.41134,6.2.41133,6.1.40093,6.1.40093,6.1.40092,6.1.40091,6.1.32831,6.1.32830
`rocm-core <https://github.com/ROCm/rocm-core>`_,7.0.2,7.0.1/7.0.0,6.4.3,6.4.2,6.4.1,6.4.0,6.3.3,6.3.2,6.3.1,6.3.0,6.2.4,6.2.2,6.2.1,6.2.0,6.1.5,6.1.2,6.1.1,6.1.0,6.0.2,6.0.0
`ROCT-Thunk-Interface <https://github.com/ROCm/ROCT-Thunk-Interface>`_,N/A [#ROCT-rocr-past-60]_,N/A [#ROCT-rocr-past-60]_,N/A [#ROCT-rocr-past-60]_,N/A [#ROCT-rocr-past-60]_,N/A [#ROCT-rocr-past-60]_,N/A [#ROCT-rocr-past-60]_,N/A [#ROCT-rocr-past-60]_,N/A [#ROCT-rocr-past-60]_,N/A [#ROCT-rocr-past-60]_,N/A [#ROCT-rocr-past-60]_,20240607.5.7,20240607.5.7,20240607.4.05,20240607.1.4246,20240125.5.08,20240125.5.08,20240125.5.08,20240125.3.30,20231016.2.245,20231016.2.245
,,,,,,,,,,,,,,,,,,,,
1 ROCm Version 7.0.2 7.0.1/7.0.0 6.4.3 6.4.2 6.4.1 6.4.0 6.3.3 6.3.2 6.3.1 6.3.0 6.2.4 6.2.2 6.2.1 6.2.0 6.1.5 6.1.2 6.1.1 6.1.0 6.0.2 6.0.0
96 :doc:`rocThrust <rocthrust:index>` 4.0.0 4.0.0 3.3.0 3.3.0 3.3.0 3.3.0 3.3.0 3.3.0 3.3.0 3.3.0 3.1.1 3.1.0 3.1.0 3.0.1 3.0.1 3.0.1 3.0.1 3.0.1 3.0.0 3.0.0
97
98 SUPPORT LIBS
99 `hipother <https://github.com/ROCm/hipother>`_ 7.0.51830 7.0.51831 7.0.51830 6.4.43483 6.4.43483 6.4.43483 6.4.43482 6.3.42134 6.3.42134 6.3.42133 6.3.42131 6.2.41134 6.2.41134 6.2.41134 6.2.41133 6.1.40093 6.1.40093 6.1.40092 6.1.40091 6.1.32831 6.1.32830
100 `rocm-core <https://github.com/ROCm/rocm-core>`_ 7.0.2 7.0.1/7.0.0 6.4.3 6.4.2 6.4.1 6.4.0 6.3.3 6.3.2 6.3.1 6.3.0 6.2.4 6.2.2 6.2.1 6.2.0 6.1.5 6.1.2 6.1.1 6.1.0 6.0.2 6.0.0
101 `ROCT-Thunk-Interface <https://github.com/ROCm/ROCT-Thunk-Interface>`_ N/A [#ROCT-rocr-past-60]_ N/A [#ROCT-rocr-past-60]_ N/A [#ROCT-rocr-past-60]_ N/A [#ROCT-rocr-past-60]_ N/A [#ROCT-rocr-past-60]_ N/A [#ROCT-rocr-past-60]_ N/A [#ROCT-rocr-past-60]_ N/A [#ROCT-rocr-past-60]_ N/A [#ROCT-rocr-past-60]_ N/A [#ROCT-rocr-past-60]_ 20240607.5.7 20240607.5.7 20240607.4.05 20240607.1.4246 20240125.5.08 20240125.5.08 20240125.5.08 20240125.3.30 20231016.2.245 20231016.2.245
102

View File

@@ -10,9 +10,10 @@ Use this matrix to view the ROCm compatibility and system requirements across su
You can also refer to the :ref:`past versions of ROCm compatibility matrix<past-rocm-compatibility-matrix>`.
GPUs listed in the following table support compute workloads (no display
Accelerators and GPUs listed in the following table support compute workloads (no display
information or graphics). If youre using ROCm with AMD Radeon GPUs or Ryzen APUs for graphics
workloads, see the :docs:`Use ROCm on Radeon and Ryzen <radeon:index.html>` to verify
workloads, see the `Use ROCm on Radeon and Ryzen
<https://rocm.docs.amd.com/projects/radeon-ryzen/en/latest/index.html>`_ to verify
compatibility and system requirements.
.. |br| raw:: html
@@ -113,7 +114,7 @@ compatibility and system requirements.
:doc:`rocThrust <rocthrust:index>`,4.0.0,4.0.0,3.3.0
,,,
SUPPORT LIBS,,,
`hipother <https://github.com/ROCm/hipother>`_,7.0.51830,7.0.51830,6.4.43482
`hipother <https://github.com/ROCm/hipother>`_,7.0.51831,7.0.51830,6.4.43482
`rocm-core <https://github.com/ROCm/rocm-core>`_,7.0.2,7.0.1/7.0.0,6.4.0
`ROCT-Thunk-Interface <https://github.com/ROCm/ROCT-Thunk-Interface>`_,N/A [#ROCT-rocr]_,N/A [#ROCT-rocr]_,N/A [#ROCT-rocr]_
,,,

View File

@@ -2,7 +2,7 @@
.. meta::
:description: Deep Graph Library (DGL) compatibility
:keywords: GPU, CPU, deep graph library, DGL, deep learning, framework compatibility
:keywords: GPU, DGL compatibility
.. version-set:: rocm_version latest
@@ -10,42 +10,24 @@
DGL compatibility
********************************************************************************
Deep Graph Library (`DGL <https://www.dgl.ai/>`__) is an easy-to-use, high-performance, and scalable
Deep Graph Library `(DGL) <https://www.dgl.ai/>`_ is an easy-to-use, high-performance and scalable
Python package for deep learning on graphs. DGL is framework agnostic, meaning
that if a deep graph model is a component in an end-to-end application, the rest of
if a deep graph model is a component in an end-to-end application, the rest of
the logic is implemented using PyTorch.
DGL provides a high-performance graph object that can reside on either CPUs or GPUs.
It bundles structural data features for better control and provides a variety of functions
for computing with graph objects, including efficient and customizable message passing
primitives for Graph Neural Networks.
* ROCm support for DGL is hosted in the `https://github.com/ROCm/dgl <https://github.com/ROCm/dgl>`_ repository.
* Due to independent compatibility considerations, this location differs from the `https://github.com/dmlc/dgl <https://github.com/dmlc/dgl>`_ upstream repository.
* Use the prebuilt :ref:`Docker images <dgl-docker-compat>` with DGL, PyTorch, and ROCm preinstalled.
* See the :doc:`ROCm DGL installation guide <rocm-install-on-linux:install/3rd-party/dgl-install>`
to install and get started.
Support overview
================================================================================
- The ROCm-supported version of DGL is maintained in the official `https://github.com/ROCm/dgl
<https://github.com/ROCm/dgl>`__ repository, which differs from the
`https://github.com/dmlc/dgl <https://github.com/dmlc/dgl>`__ upstream repository.
- To get started and install DGL on ROCm, use the prebuilt :ref:`Docker images <dgl-docker-compat>`,
which include ROCm, DGL, and all required dependencies.
- See the :doc:`ROCm DGL installation guide <rocm-install-on-linux:install/3rd-party/dgl-install>`
for installation and setup instructions.
- You can also consult the upstream `Installation guide <https://www.dgl.ai/pages/start.html>`__
for additional context.
Version support
--------------------------------------------------------------------------------
DGL is supported on `ROCm 6.4.0 <https://repo.radeon.com/rocm/apt/6.4/>`__.
Supported devices
--------------------------------------------------------------------------------
================================================================================
- **Officially Supported**: TF32 with AMD Instinct MI300X (through hipblaslt)
- **Partially Supported**: TF32 with AMD Instinct MI250X
- **Officially Supported**: AMD Instinct™ MI300X (through `hipBLASlt <https://rocm.docs.amd.com/projects/hipBLASLt/en/latest/index.html>`__)
- **Partially Supported**: AMD Instinct™ MI250X
.. _dgl-recommendations:
@@ -53,7 +35,7 @@ Use cases and recommendations
================================================================================
DGL can be used for Graph Learning, and building popular graph models like
GAT, GCN, and GraphSage. Using these models, a variety of use cases are supported:
GAT, GCN and GraphSage. Using these we can support a variety of use-cases such as:
- Recommender systems
- Network Optimization and Analysis
@@ -80,17 +62,16 @@ Docker image compatibility
<i class="fab fa-docker"></i>
AMD validates and publishes `DGL images <https://hub.docker.com/r/rocm/dgl/tags>`__
with ROCm backends on Docker Hub. The following Docker image tags and associated
inventories represent the latest available DGL version from the official Docker Hub.
AMD validates and publishes `DGL images <https://hub.docker.com/r/rocm/dgl>`_
with ROCm and Pytorch backends on Docker Hub. The following Docker image tags and associated
inventories were tested on `ROCm 6.4.0 <https://repo.radeon.com/rocm/apt/6.4/>`_.
Click the |docker-icon| to view the image on Docker Hub.
.. list-table:: DGL Docker image components
:header-rows: 1
:class: docker-image-compatibility
* - Docker image
- ROCm
* - Docker
- DGL
- PyTorch
- Ubuntu
@@ -100,106 +81,102 @@ Click the |docker-icon| to view the image on Docker Hub.
<a href="https://hub.docker.com/layers/rocm/dgl/dgl-2.4_rocm6.4_ubuntu24.04_py3.12_pytorch_release_2.6.0/images/sha256-8ce2c3bcfaa137ab94a75f9e2ea711894748980f57417739138402a542dd5564"><i class="fab fa-docker fa-lg"></i></a>
- `6.4.0 <https://repo.radeon.com/rocm/apt/6.4/>`__.
- `2.4.0 <https://github.com/dmlc/dgl/releases/tag/v2.4.0>`__
- `2.6.0 <https://github.com/ROCm/pytorch/tree/release/2.6>`__
- `2.4.0 <https://github.com/dmlc/dgl/releases/tag/v2.4.0>`_
- `2.6.0 <https://github.com/ROCm/pytorch/tree/release/2.6>`_
- 24.04
- `3.12.9 <https://www.python.org/downloads/release/python-3129/>`__
- `3.12.9 <https://www.python.org/downloads/release/python-3129/>`_
* - .. raw:: html
<a href="https://hub.docker.com/layers/rocm/dgl/dgl-2.4_rocm6.4_ubuntu24.04_py3.12_pytorch_release_2.4.1/images/sha256-cf1683283b8eeda867b690229c8091c5bbf1edb9f52e8fb3da437c49a612ebe4"><i class="fab fa-docker fa-lg"></i></a>
- `6.4.0 <https://repo.radeon.com/rocm/apt/6.4/>`__.
- `2.4.0 <https://github.com/dmlc/dgl/releases/tag/v2.4.0>`__
- `2.4.1 <https://github.com/ROCm/pytorch/tree/release/2.4>`__
- `2.4.0 <https://github.com/dmlc/dgl/releases/tag/v2.4.0>`_
- `2.4.1 <https://github.com/ROCm/pytorch/tree/release/2.4>`_
- 24.04
- `3.12.9 <https://www.python.org/downloads/release/python-3129/>`__
- `3.12.9 <https://www.python.org/downloads/release/python-3129/>`_
* - .. raw:: html
<a href="https://hub.docker.com/layers/rocm/dgl/dgl-2.4_rocm6.4_ubuntu22.04_py3.10_pytorch_release_2.4.1/images/sha256-4834f178c3614e2d09e89e32041db8984c456d45dfd20286e377ca8635686554"><i class="fab fa-docker fa-lg"></i></a>
- `6.4.0 <https://repo.radeon.com/rocm/apt/6.4/>`__.
- `2.4.0 <https://github.com/dmlc/dgl/releases/tag/v2.4.0>`__
- `2.4.1 <https://github.com/ROCm/pytorch/tree/release/2.4>`__
- `2.4.0 <https://github.com/dmlc/dgl/releases/tag/v2.4.0>`_
- `2.4.1 <https://github.com/ROCm/pytorch/tree/release/2.4>`_
- 22.04
- `3.10.16 <https://www.python.org/downloads/release/python-31016/>`__
- `3.10.16 <https://www.python.org/downloads/release/python-31016/>`_
* - .. raw:: html
<a href="https://hub.docker.com/layers/rocm/dgl/dgl-2.4_rocm6.4_ubuntu22.04_py3.10_pytorch_release_2.3.0/images/sha256-88740a2c8ab4084b42b10c3c6ba984cab33dd3a044f479c6d7618e2b2cb05e69"><i class="fab fa-docker fa-lg"></i></a>
- `6.4.0 <https://repo.radeon.com/rocm/apt/6.4/>`__.
- `2.4.0 <https://github.com/dmlc/dgl/releases/tag/v2.4.0>`__
- `2.3.0 <https://github.com/ROCm/pytorch/tree/release/2.3>`__
- `2.4.0 <https://github.com/dmlc/dgl/releases/tag/v2.4.0>`_
- `2.3.0 <https://github.com/ROCm/pytorch/tree/release/2.3>`_
- 22.04
- `3.10.16 <https://www.python.org/downloads/release/python-31016/>`__
- `3.10.16 <https://www.python.org/downloads/release/python-31016/>`_
Key ROCm libraries for DGL
================================================================================
DGL on ROCm depends on specific libraries that affect its features and performance.
Using the DGL Docker container or building it with the provided Docker file or a ROCm base image is recommended.
Using the DGL Docker container or building it with the provided docker file or a ROCm base image is recommended.
If you prefer to build it yourself, ensure the following dependencies are installed:
.. list-table::
:header-rows: 1
* - ROCm library
- ROCm 6.4.0 Version
- Version
- Purpose
* - `Composable Kernel <https://github.com/ROCm/composable_kernel>`_
- 1.1.0
- :version-ref:`"Composable Kernel" rocm_version`
- Enables faster execution of core operations like matrix multiplication
(GEMM), convolutions and transformations.
* - `hipBLAS <https://github.com/ROCm/hipBLAS>`_
- 2.4.0
- :version-ref:`hipBLAS rocm_version`
- Provides GPU-accelerated Basic Linear Algebra Subprograms (BLAS) for
matrix and vector operations.
* - `hipBLASLt <https://github.com/ROCm/hipBLASLt>`_
- 0.12.0
- :version-ref:`hipBLASLt rocm_version`
- hipBLASLt is an extension of the hipBLAS library, providing additional
features like epilogues fused into the matrix multiplication kernel or
use of integer tensor cores.
* - `hipCUB <https://github.com/ROCm/hipCUB>`_
- 3.4.0
- :version-ref:`hipCUB rocm_version`
- Provides a C++ template library for parallel algorithms for reduction,
scan, sort and select.
* - `hipFFT <https://github.com/ROCm/hipFFT>`_
- 1.0.18
- :version-ref:`hipFFT rocm_version`
- Provides GPU-accelerated Fast Fourier Transform (FFT) operations.
* - `hipRAND <https://github.com/ROCm/hipRAND>`_
- 2.12.0
- :version-ref:`hipRAND rocm_version`
- Provides fast random number generation for GPUs.
* - `hipSOLVER <https://github.com/ROCm/hipSOLVER>`_
- 2.4.0
- :version-ref:`hipSOLVER rocm_version`
- Provides GPU-accelerated solvers for linear systems, eigenvalues, and
singular value decompositions (SVD).
* - `hipSPARSE <https://github.com/ROCm/hipSPARSE>`_
- 3.2.0
- :version-ref:`hipSPARSE rocm_version`
- Accelerates operations on sparse matrices, such as sparse matrix-vector
or matrix-matrix products.
* - `hipSPARSELt <https://github.com/ROCm/hipSPARSELt>`_
- 0.2.3
- :version-ref:`hipSPARSELt rocm_version`
- Accelerates operations on sparse matrices, such as sparse matrix-vector
or matrix-matrix products.
* - `hipTensor <https://github.com/ROCm/hipTensor>`_
- 1.5.0
- :version-ref:`hipTensor rocm_version`
- Optimizes for high-performance tensor operations, such as contractions.
* - `MIOpen <https://github.com/ROCm/MIOpen>`_
- 3.4.0
- :version-ref:`MIOpen rocm_version`
- Optimizes deep learning primitives such as convolutions, pooling,
normalization, and activation functions.
* - `MIGraphX <https://github.com/ROCm/AMDMIGraphX>`_
- 2.12.0
- :version-ref:`MIGraphX rocm_version`
- Adds graph-level optimizations, ONNX models and mixed precision support
and enable Ahead-of-Time (AOT) Compilation.
* - `MIVisionX <https://github.com/ROCm/MIVisionX>`_
- 3.2.0
- :version-ref:`MIVisionX rocm_version`
- Optimizes acceleration for computer vision and AI workloads like
preprocessing, augmentation, and inferencing.
* - `rocAL <https://github.com/ROCm/rocAL>`_
@@ -207,25 +184,25 @@ If you prefer to build it yourself, ensure the following dependencies are instal
- Accelerates the data pipeline by offloading intensive preprocessing and
augmentation tasks. rocAL is part of MIVisionX.
* - `RCCL <https://github.com/ROCm/rccl>`_
- 2.2.0
- :version-ref:`RCCL rocm_version`
- Optimizes for multi-GPU communication for operations like AllReduce and
Broadcast.
* - `rocDecode <https://github.com/ROCm/rocDecode>`_
- 0.10.0
- :version-ref:`rocDecode rocm_version`
- Provides hardware-accelerated data decoding capabilities, particularly
for image, video, and other dataset formats.
* - `rocJPEG <https://github.com/ROCm/rocJPEG>`_
- 0.8.0
- :version-ref:`rocJPEG rocm_version`
- Provides hardware-accelerated JPEG image decoding and encoding.
* - `RPP <https://github.com/ROCm/RPP>`_
- 1.9.10
- :version-ref:`RPP rocm_version`
- Speeds up data augmentation, transformation, and other preprocessing steps.
* - `rocThrust <https://github.com/ROCm/rocThrust>`_
- 3.3.0
- :version-ref:`rocThrust rocm_version`
- Provides a C++ template library for parallel algorithms like sorting,
reduction, and scanning.
* - `rocWMMA <https://github.com/ROCm/rocWMMA>`_
- 1.7.0
- :version-ref:`rocWMMA rocm_version`
- Accelerates warp-level matrix-multiply and matrix-accumulate to speed up matrix
multiplication (GEMM) and accumulation operations with mixed precision
support.
@@ -234,14 +211,14 @@ If you prefer to build it yourself, ensure the following dependencies are instal
Supported features
================================================================================
Many functions and methods available upstream are also supported in DGL on ROCm.
Many functions and methods available in DGL Upstream are also supported in DGL ROCm.
Instead of listing them all, support is grouped into the following categories to provide a general overview.
* DGL Base
* DGL Backend
* DGL Data
* DGL Dataloading
* DGL Graph
* DGL DGLGraph
* DGL Function
* DGL Ops
* DGL Sampling
@@ -258,9 +235,9 @@ Instead of listing them all, support is grouped into the following categories to
Unsupported features
================================================================================
* GraphBolt
* Partial TF32 Support (MI250X only)
* Kineto/ROCTracer integration
* Graphbolt
* Partial TF32 Support (MI250x only)
* Kineto/ ROCTracer integration
Unsupported functions

View File

@@ -1,8 +1,8 @@
:orphan:
.. meta::
:description: FlashInfer compatibility
:keywords: GPU, LLM, FlashInfer, deep learning, framework compatibility
:description: FlashInfer deep learning framework compatibility
:keywords: GPU, LLM, FlashInfer, compatibility
.. version-set:: rocm_version latest
@@ -11,7 +11,7 @@ FlashInfer compatibility
********************************************************************************
`FlashInfer <https://docs.flashinfer.ai/index.html>`__ is a library and kernel generator
for Large Language Models (LLMs) that provides a high-performance implementation of graphics
for Large Language Models (LLMs) that provides high-performance implementation of graphics
processing units (GPUs) kernels. FlashInfer focuses on LLM serving and inference, as well
as advanced performance across diverse scenarios.
@@ -25,30 +25,28 @@ offers high-performance LLM-specific operators, with easy integration through Py
For the latest feature compatibility matrix, refer to the ``README`` of the
`https://github.com/ROCm/flashinfer <https://github.com/ROCm/flashinfer>`__ repository.
Support overview
================================================================================
Support for the ROCm port of FlashInfer is available as follows:
- The ROCm-supported version of FlashInfer is maintained in the official `https://github.com/ROCm/flashinfer
<https://github.com/ROCm/flashinfer>`__ repository, which differs from the
`https://github.com/flashinfer-ai/flashinfer <https://github.com/flashinfer-ai/flashinfer>`__
- ROCm support for FlashInfer is hosted in the `https://github.com/ROCm/flashinfer
<https://github.com/ROCm/flashinfer>`__ repository. This location differs from the
`https://github.com/flashinfer-ai/flashinfer <https://github.com/flashinfer-ai/flashinfer>`_
upstream repository.
- To get started and install FlashInfer on ROCm, use the prebuilt :ref:`Docker images <flashinfer-docker-compat>`,
which include ROCm, FlashInfer, and all required dependencies.
- To install FlashInfer, use the prebuilt :ref:`Docker image <flashinfer-docker-compat>`,
which includes ROCm, FlashInfer, and all required dependencies.
- See the :doc:`ROCm FlashInfer installation guide <rocm-install-on-linux:install/3rd-party/flashinfer-install>`
for installation and setup instructions.
to install and get started.
- You can also consult the upstream `Installation guide <https://docs.flashinfer.ai/installation.html>`__
for additional context.
- See the `Installation guide <https://docs.flashinfer.ai/installation.html>`__
in the upstream FlashInfer documentation.
Version support
--------------------------------------------------------------------------------
.. note::
FlashInfer is supported on `ROCm 6.4.1 <https://repo.radeon.com/rocm/apt/6.4.1/>`__.
Flashinfer is supported on ROCm 6.4.1.
Supported devices
--------------------------------------------------------------------------------
================================================================================
**Officially Supported**: AMD Instinct™ MI300X
@@ -80,9 +78,10 @@ Docker image compatibility
<i class="fab fa-docker"></i>
AMD validates and publishes `FlashInfer images <https://hub.docker.com/r/rocm/flashinfer/tags>`__
with ROCm backends on Docker Hub. The following Docker image tag and associated
inventories represent the latest available FlashInfer version from the official Docker Hub.
AMD validates and publishes `ROCm FlashInfer images <https://hub.docker.com/r/rocm/flashinfer/tags>`__
with ROCm and Pytorch backends on Docker Hub. The following Docker image tags and associated
inventories represent the FlashInfer version from the official Docker Hub.
The Docker images have been validated for `ROCm 6.4.1 <https://repo.radeon.com/rocm/apt/6.4.1/>`__.
Click |docker-icon| to view the image on Docker Hub.
.. list-table::

View File

@@ -2,7 +2,7 @@
.. meta::
:description: JAX compatibility
:keywords: GPU, JAX, deep learning, framework compatibility
:keywords: GPU, JAX compatibility
.. version-set:: rocm_version latest
@@ -10,38 +10,59 @@
JAX compatibility
*******************************************************************************
`JAX <https://docs.jax.dev/en/latest/notebooks/thinking_in_jax.html>`__ is a library
for array-oriented numerical computation (similar to NumPy), with automatic differentiation
and just-in-time (JIT) compilation to enable high-performance machine learning research.
JAX provides a NumPy-like API, which combines automatic differentiation and the
Accelerated Linear Algebra (XLA) compiler to achieve high-performance machine
learning at scale.
JAX provides an API that combines automatic differentiation and the
Accelerated Linear Algebra (XLA) compiler to achieve high-performance machine
learning at scale. JAX uses composable transformations of Python and NumPy through
JIT compilation, automatic vectorization, and parallelization.
JAX uses composable transformations of Python and NumPy through just-in-time
(JIT) compilation, automatic vectorization, and parallelization. To learn about
JAX, including profiling and optimizations, see the official `JAX documentation
<https://jax.readthedocs.io/en/latest/notebooks/quickstart.html>`_.
Support overview
ROCm support for JAX is upstreamed, and users can build the official source code
with ROCm support:
- ROCm JAX release:
- Offers AMD-validated and community :ref:`Docker images <jax-docker-compat>`
with ROCm and JAX preinstalled.
- ROCm JAX repository: `ROCm/rocm-jax <https://github.com/ROCm/rocm-jax>`_
- See the :doc:`ROCm JAX installation guide <rocm-install-on-linux:install/3rd-party/jax-install>`
to get started.
- Official JAX release:
- Official JAX repository: `jax-ml/jax <https://github.com/jax-ml/jax>`_
- See the `AMD GPU (Linux) installation section
<https://jax.readthedocs.io/en/latest/installation.html#amd-gpu-linux>`_ in
the JAX documentation.
.. note::
AMD releases official `ROCm JAX Docker images <https://hub.docker.com/r/rocm/jax>`_
quarterly alongside new ROCm releases. These images undergo full AMD testing.
`Community ROCm JAX Docker images <https://hub.docker.com/r/rocm/jax-community>`_
follow upstream JAX releases and use the latest available ROCm version.
JAX Plugin-PJRT with JAX/JAXLIB compatibility
================================================================================
- The ROCm-supported version of JAX is maintained in the official `https://github.com/ROCm/rocm-jax
<https://github.com/ROCm/rocm-jax>`__ repository, which differs from the
`https://github.com/jax-ml/jax <https://github.com/jax-ml/jax>`__ upstream repository.
Portable JIT Runtime (PJRT) is an open, stable interface for device runtime and
compiler. The following table details the ROCm version compatibility matrix
between JAX PluginPJRT and JAX/JAXLIB.
- To get started and install JAX on ROCm, use the prebuilt :ref:`Docker images <jax-docker-compat>`,
which include ROCm, JAX, and all required dependencies.
.. list-table::
:header-rows: 1
- See the :doc:`ROCm JAX installation guide <rocm-install-on-linux:install/3rd-party/jax-install>`
for installation and setup instructions.
- You can also consult the upstream `Installation guide <https://jax.readthedocs.io/en/latest/installation.html#amd-gpu-linux>`__
for additional context.
Version support
--------------------------------------------------------------------------------
AMD releases official `ROCm JAX Docker images <https://hub.docker.com/r/rocm/jax/tags>`_
quarterly alongside new ROCm releases. These images undergo full AMD testing.
`Community ROCm JAX Docker images <https://hub.docker.com/r/rocm/jax-community/tags>`_
follow upstream JAX releases and use the latest available ROCm version.
* - JAX Plugin-PJRT
- JAX/JAXLIB
- ROCm
* - 0.6.0
- 0.6.2, 0.6.0
- 7.0.0, 7.0.1, 7.0.2
Use cases and recommendations
================================================================================
@@ -67,7 +88,7 @@ Use cases and recommendations
* The `Distributed fine-tuning with JAX on AMD GPUs <https://rocm.blogs.amd.com/artificial-intelligence/distributed-sft-jax/README.html>`_
outlines the process of fine-tuning a Bidirectional Encoder Representations
from Transformers (BERT)-based large language model (LLM) using JAX for a text
classification task. The blog post discusses techniques for parallelizing the
classification task. The blog post discuss techniques for parallelizing the
fine-tuning across multiple AMD GPUs and assess the model's performance on a
holdout dataset. During the fine-tuning, a BERT-base-cased transformer model
and the General Language Understanding Evaluation (GLUE) benchmark dataset was
@@ -75,7 +96,7 @@ Use cases and recommendations
* The `MI300X workload optimization guide <https://rocm.docs.amd.com/en/latest/how-to/tuning-guides/mi300x/workload.html>`_
provides detailed guidance on optimizing workloads for the AMD Instinct MI300X
GPU using ROCm. The page is aimed at helping users achieve optimal
accelerator using ROCm. The page is aimed at helping users achieve optimal
performance for deep learning and other high-performance computing tasks on
the MI300X GPU.
@@ -86,9 +107,9 @@ For more use cases and recommendations, see `ROCm JAX blog posts <https://rocm.b
Docker image compatibility
================================================================================
AMD validates and publishes `JAX images <https://hub.docker.com/r/rocm/jax/tags>`__
with ROCm backends on Docker Hub.
AMD provides preconfigured Docker images with JAX and the ROCm backend.
These images are published on `Docker Hub <https://hub.docker.com/r/rocm/jax>`__ and are the
recommended way to get started with deep learning with JAX on ROCm.
For ``jax-community`` images, see `rocm/jax-community
<https://hub.docker.com/r/rocm/jax-community/tags>`__ on Docker Hub.
@@ -230,7 +251,7 @@ The ROCm supported data types in JAX are collected in the following table.
.. note::
JAX data type support is affected by the :ref:`key_rocm_libraries` and it's
JAX data type support is effected by the :ref:`key_rocm_libraries` and it's
collected on :doc:`ROCm data types and precision support <rocm:reference/precision-support>`
page.

View File

@@ -1,8 +1,8 @@
:orphan:
.. meta::
:description: llama.cpp compatibility
:keywords: GPU, GGML, llama.cpp, deep learning, framework compatibility
:description: llama.cpp deep learning framework compatibility
:keywords: GPU, GGML, llama.cpp compatibility
.. version-set:: rocm_version latest
@@ -20,33 +20,34 @@ to accelerate inference and reduce memory usage. Originally built as a CPU-first
llama.cpp is easy to integrate with other programming environments and is widely
adopted across diverse platforms, including consumer devices.
Support overview
================================================================================
ROCm support for llama.cpp is upstreamed, and you can build the official source code
with ROCm support:
- The ROCm-supported version of llama.cpp is maintained in the official `https://github.com/ROCm/llama.cpp
<https://github.com/ROCm/llama.cpp>`__ repository, which differs from the
`https://github.com/ggml-org/llama.cpp <https://github.com/ggml-org/llama.cpp>`__ upstream repository.
- ROCm support for llama.cpp is hosted in the official `https://github.com/ROCm/llama.cpp
<https://github.com/ROCm/llama.cpp>`_ repository.
- To get started and install llama.cpp on ROCm, use the prebuilt :ref:`Docker images <llama-cpp-docker-compat>`,
which include ROCm, llama.cpp, and all required dependencies.
- Due to independent compatibility considerations, this location differs from the
`https://github.com/ggml-org/llama.cpp <https://github.com/ggml-org/llama.cpp>`_ upstream repository.
- To install llama.cpp, use the prebuilt :ref:`Docker image <llama-cpp-docker-compat>`,
which includes ROCm, llama.cpp, and all required dependencies.
- See the :doc:`ROCm llama.cpp installation guide <rocm-install-on-linux:install/3rd-party/llama-cpp-install>`
for installation and setup instructions.
to install and get started.
- You can also consult the upstream `Installation guide <https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md>`__
for additional context.
- See the `Installation guide <https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md#hip>`__
in the upstream llama.cpp documentation.
Version support
--------------------------------------------------------------------------------
.. note::
llama.cpp is supported on `ROCm 7.0.0 <https://repo.radeon.com/rocm/apt/7.0/>`__ and
`ROCm 6.4.x <https://repo.radeon.com/rocm/apt/6.4/>`__.
llama.cpp is supported on ROCm 7.0.0 and ROCm 6.4.x.
Supported devices
--------------------------------------------------------------------------------
================================================================================
**Officially Supported**: AMD Instinct™ MI300X, MI325X, MI210
Use cases and recommendations
================================================================================
@@ -83,9 +84,9 @@ Docker image compatibility
<i class="fab fa-docker"></i>
AMD validates and publishes `llama.cpp images <https://hub.docker.com/r/rocm/llama.cpp/tags>`__
AMD validates and publishes `ROCm llama.cpp Docker images <https://hub.docker.com/r/rocm/llama.cpp/tags>`__
with ROCm backends on Docker Hub. The following Docker image tags and associated
inventories represent the latest available llama.cpp versions from the official Docker Hub.
inventories represent the available llama.cpp versions from the official Docker Hub.
Click |docker-icon| to view the image on Docker Hub.
.. important::

View File

@@ -2,7 +2,7 @@
.. meta::
:description: Megablocks compatibility
:keywords: GPU, megablocks, deep learning, framework compatibility
:keywords: GPU, megablocks, compatibility
.. version-set:: rocm_version latest
@@ -10,42 +10,28 @@
Megablocks compatibility
********************************************************************************
`Megablocks <https://github.com/databricks/megablocks>`__ is a lightweight library
for mixture-of-experts `(MoE) <https://huggingface.co/blog/moe>`__ training.
Megablocks is a light-weight library for mixture-of-experts (MoE) training.
The core of the system is efficient "dropless-MoE" and standard MoE layers.
Megablocks is integrated with `https://github.com/stanford-futuredata/Megatron-LM
<https://github.com/stanford-futuredata/Megatron-LM>`__,
Megablocks is integrated with `https://github.com/stanford-futuredata/Megatron-LM <https://github.com/stanford-futuredata/Megatron-LM>`_,
where data and pipeline parallel training of MoEs is supported.
Support overview
================================================================================
* ROCm support for Megablocks is hosted in the official `https://github.com/ROCm/megablocks <https://github.com/ROCm/megablocks>`_ repository.
* Due to independent compatibility considerations, this location differs from the `https://github.com/stanford-futuredata/Megatron-LM <https://github.com/stanford-futuredata/Megatron-LM>`_ upstream repository.
* Use the prebuilt :ref:`Docker image <megablocks-docker-compat>` with ROCm, PyTorch, and Megablocks preinstalled.
* See the :doc:`ROCm Megablocks installation guide <rocm-install-on-linux:install/3rd-party/megablocks-install>` to install and get started.
- The ROCm-supported version of Megablocks is maintained in the official `https://github.com/ROCm/megablocks
<https://github.com/ROCm/megablocks>`__ repository, which differs from the
`https://github.com/stanford-futuredata/Megatron-LM <https://github.com/stanford-futuredata/Megatron-LM>`__ upstream repository.
.. note::
- To get started and install Megablocks on ROCm, use the prebuilt :ref:`Docker image <megablocks-docker-compat>`,
which includes ROCm, Megablocks, and all required dependencies.
- See the :doc:`ROCm Megablocks installation guide <rocm-install-on-linux:install/3rd-party/megablocks-install>`
for installation and setup instructions.
- You can also consult the upstream `Installation guide <https://github.com/databricks/megablocks>`__
for additional context.
Version support
--------------------------------------------------------------------------------
Megablocks is supported on `ROCm 6.3.0 <https://repo.radeon.com/rocm/apt/6.3/>`__.
Megablocks is supported on ROCm 6.3.0.
Supported devices
--------------------------------------------------------------------------------
================================================================================
- **Officially Supported**: AMD Instinct MI300X
- **Partially Supported** (functionality or performance limitations): AMD Instinct MI250X, MI210
- **Officially Supported**: AMD Instinct MI300X
- **Partially Supported** (functionality or performance limitations): AMD Instinct MI250X, MI210
Supported models and features
--------------------------------------------------------------------------------
================================================================================
This section summarizes the Megablocks features supported by ROCm.
@@ -55,28 +41,20 @@ This section summarizes the Megablocks features supported by ROCm.
* Mixture-of-Experts
* dropless-Mixture-of-Experts
.. _megablocks-recommendations:
Use cases and recommendations
================================================================================
* The `Efficient MoE training on AMD ROCm: How-to use Megablocks on AMD GPUs
<https://rocm.blogs.amd.com/artificial-intelligence/megablocks/README.html>`__
blog post guides how to leverage the ROCm platform for pre-training using the
Megablocks framework. It introduces a streamlined approach for training Mixture-of-Experts
(MoE) models using the Megablocks library on AMD hardware. Focusing on GPT-2, it
demonstrates how block-sparse computations can enhance scalability and efficiency in MoE
training. The guide provides step-by-step instructions for setting up the environment,
including cloning the repository, building the Docker image, and running the training container.
Additionally, it offers insights into utilizing the ``oscar-1GB.json`` dataset for pre-training
language models. By leveraging Megablocks and the ROCm platform, you can optimize your MoE
training workflows for large-scale transformer models.
The `ROCm Megablocks blog posts <https://rocm.blogs.amd.com/artificial-intelligence/megablocks/README.html>`_
guide how to leverage the ROCm platform for pre-training using the Megablocks framework.
It features how to pre-process datasets and how to begin pre-training on AMD GPUs through:
* Single-GPU pre-training
* Multi-GPU pre-training
.. _megablocks-docker-compat:
Docker image compatibility
@@ -86,9 +64,10 @@ Docker image compatibility
<i class="fab fa-docker"></i>
AMD validates and publishes `Megablocks images <https://hub.docker.com/r/rocm/megablocks/tags>`__
with ROCm backends on Docker Hub. The following Docker image tag and associated
inventories represent the latest available Megablocks version from the official Docker Hub.
AMD validates and publishes `ROCm Megablocks images <https://hub.docker.com/r/rocm/megablocks/tags>`_
with ROCm and Pytorch backends on Docker Hub. The following Docker image tags and associated
inventories represent the latest Megatron-LM version from the official Docker Hub.
The Docker images have been validated for `ROCm 6.3.0 <https://repo.radeon.com/rocm/apt/6.3/>`_.
Click |docker-icon| to view the image on Docker Hub.
.. list-table::

View File

@@ -2,7 +2,7 @@
.. meta::
:description: PyTorch compatibility
:keywords: GPU, PyTorch, deep learning, framework compatibility
:keywords: GPU, PyTorch compatibility
.. version-set:: rocm_version latest
@@ -15,42 +15,40 @@ deep learning. PyTorch on ROCm provides mixed-precision and large-scale training
using `MIOpen <https://github.com/ROCm/MIOpen>`__ and
`RCCL <https://github.com/ROCm/rccl>`__ libraries.
PyTorch provides two high-level features:
ROCm support for PyTorch is upstreamed into the official PyTorch repository. Due
to independent compatibility considerations, this results in two distinct
release cycles for PyTorch on ROCm:
- Tensor computation (like NumPy) with strong GPU acceleration
- ROCm PyTorch release:
- Deep neural networks built on a tape-based autograd system (rapid computation
of multiple partial derivatives or gradients)
- Provides the latest version of ROCm but might not necessarily support the
latest stable PyTorch version.
Support overview
================================================================================
- Offers :ref:`Docker images <pytorch-docker-compat>` with ROCm and PyTorch
preinstalled.
ROCm support for PyTorch is upstreamed into the official PyTorch repository.
ROCm development is aligned with the stable release of PyTorch, while upstream
PyTorch testing uses the stable release of ROCm to maintain consistency:
- ROCm PyTorch repository: `<https://github.com/ROCm/pytorch>`__
- The ROCm-supported version of PyTorch is maintained in the official `https://github.com/ROCm/pytorch
<https://github.com/ROCm/pytorch>`__ repository, which differs from the
`https://github.com/pytorch/pytorch <https://github.com/pytorch/pytorch>`__ upstream repository.
- See the :doc:`ROCm PyTorch installation guide <rocm-install-on-linux:install/3rd-party/pytorch-install>`
to get started.
- To get started and install PyTorch on ROCm, use the prebuilt :ref:`Docker images <pytorch-docker-compat>`,
which include ROCm, PyTorch, and all required dependencies.
- Official PyTorch release:
- See the :doc:`ROCm PyTorch installation guide <rocm-install-on-linux:install/3rd-party/pytorch-install>`
for installation and setup instructions.
- Provides the latest stable version of PyTorch but might not necessarily
support the latest ROCm version.
- You can also consult the upstream `Installation guide <https://pytorch.org/get-started/locally/>`__ or
`Previous versions <https://pytorch.org/get-started/previous-versions/>`__ for additional context.
- Official PyTorch repository: `<https://github.com/pytorch/pytorch>`__
- See the `Nightly and latest stable version installation guide <https://pytorch.org/get-started/locally/>`__
or `Previous versions <https://pytorch.org/get-started/previous-versions/>`__
to get started.
PyTorch includes tooling that generates HIP source code from the CUDA backend.
This approach allows PyTorch to support ROCm without requiring manual code
modifications. For more information, see :doc:`HIPIFY <hipify:index>`.
Version support
--------------------------------------------------------------------------------
AMD releases official `ROCm PyTorch Docker images <https://hub.docker.com/r/rocm/pytorch/tags>`_
quarterly alongside new ROCm releases. These images undergo full AMD testing.
ROCm development is aligned with the stable release of PyTorch, while upstream
PyTorch testing uses the stable release of ROCm to maintain consistency.
.. _pytorch-recommendations:
@@ -75,12 +73,12 @@ Use cases and recommendations
* The :doc:`Instinct MI300X workload optimization guide </how-to/rocm-for-ai/inference-optimization/workload>`
provides detailed guidance on optimizing workloads for the AMD Instinct MI300X
GPU using ROCm. This guide helps users achieve optimal performance for
accelerator using ROCm. This guide helps users achieve optimal performance for
deep learning and other high-performance computing tasks on the MI300X
GPU.
accelerator.
* The :doc:`Inception with PyTorch documentation </conceptual/ai-pytorch-inception>`
describes how PyTorch integrates with ROCm for AI workloads. It outlines the
describes how PyTorch integrates with ROCm for AI workloads It outlines the
use of PyTorch on the ROCm platform and focuses on efficiently leveraging AMD
GPU hardware for training and inference tasks in AI applications.
@@ -91,8 +89,9 @@ For more use cases and recommendations, see `ROCm PyTorch blog posts <https://ro
Docker image compatibility
================================================================================
AMD validates and publishes `PyTorch images <https://hub.docker.com/r/rocm/pytorch/tags>`__
with ROCm backends on Docker Hub.
AMD provides preconfigured Docker images with PyTorch and the ROCm backend.
These images are published on `Docker Hub <https://hub.docker.com/r/rocm/pytorch>`__ and are the
recommended way to get started with deep learning with PyTorch on ROCm.
To find the right image tag, see the :ref:`PyTorch on ROCm installation
documentation <rocm-install-on-linux:pytorch-docker-support>` for a list of
@@ -408,7 +407,7 @@ with ROCm.
**Note:** Only official release exists.
Key features and enhancements for PyTorch 2.7 with ROCm 7.0
Key features and enhancements for PyTorch 2.7/2.8 with ROCm 7.0
================================================================================
- Enhanced TunableOp framework: Introduces ``tensorfloat32`` support for
@@ -418,7 +417,7 @@ Key features and enhancements for PyTorch 2.7 with ROCm 7.0
- Expanded GPU architecture support: Provides optimized support for newer GPU
architectures, including gfx1200 and gfx1201 with preferred hipBLASLt backend
selection, along with improvements for gfx950 and gfx1100 Series GPUs.
selection, along with improvements for gfx950 and gfx1100 series GPUs.
- Advanced Triton Integration: AOTriton 0.10b introduces official support for
gfx950 and gfx1201, along with experimental support for gfx1101, gfx1151,
@@ -443,10 +442,6 @@ Key features and enhancements for PyTorch 2.7 with ROCm 7.0
ROCm-specific test conditions, and enhanced unit test coverage for Flash
Attention and Memory Efficient operations.
- Build system and infrastructure improvements: Provides updated CentOS Stream 9
support, improved Docker configuration, migration to public MAGMA repository,
and enhanced QA automation scripts for PyTorch unit testing.
- Composable Kernel (CK) updates: Features updated CK submodule integration with
the latest optimizations and performance improvements for core mathematical
operations.
@@ -468,7 +463,7 @@ Key features and enhancements for PyTorch 2.7 with ROCm 7.0
network training or inference. For AMD platforms, ``amdclang++`` has been
validated as the supported compiler for building these extensions.
Known issues and notes for PyTorch 2.7 with ROCm 7.0
Known issues and notes for PyTorch 2.7/2.8 with ROCm 7.0
================================================================================
- The ``matmul.allow_fp16_reduced_precision_reduction`` and

View File

@@ -1,8 +1,8 @@
:orphan:
.. meta::
:description: Ray compatibility
:keywords: GPU, Ray, deep learning, framework compatibility
:description: Ray deep learning framework compatibility
:keywords: GPU, Ray compatibility
.. version-set:: rocm_version latest
@@ -19,36 +19,37 @@ simplifying machine learning computations.
Ray is a general-purpose framework that runs many types of workloads efficiently.
Any Python application can be scaled with Ray, without extra infrastructure.
Support overview
================================================================================
ROCm support for Ray is upstreamed, and you can build the official source code
with ROCm support:
- The ROCm-supported version of Ray is maintained in the official `https://github.com/ROCm/ray
<https://github.com/ROCm/ray>`__ repository, which differs from the
`https://github.com/ray-project/ray <https://github.com/ray-project/ray>`__ upstream repository.
- ROCm support for Ray is hosted in the official `https://github.com/ROCm/ray
<https://github.com/ROCm/ray>`_ repository.
- To get started and install Ray on ROCm, use the prebuilt :ref:`Docker image <ray-docker-compat>`,
- Due to independent compatibility considerations, this location differs from the
`https://github.com/ray-project/ray <https://github.com/ray-project/ray>`_ upstream repository.
- To install Ray, use the prebuilt :ref:`Docker image <ray-docker-compat>`
which includes ROCm, Ray, and all required dependencies.
- The Docker image provided is based on the upstream Ray `Daily Release (Nightly) wheels
<https://docs.ray.io/en/latest/ray-overview/installation.html#daily-releases-nightlies>`__
- See the :doc:`ROCm Ray installation guide <rocm-install-on-linux:install/3rd-party/ray-install>`
for instructions to get started.
- See the `Installation section <https://docs.ray.io/en/latest/ray-overview/installation.html>`_
in the upstream Ray documentation.
- The Docker image provided is based on the upstream Ray `Daily Release (Nightly) wheels <https://docs.ray.io/en/latest/ray-overview/installation.html#daily-releases-nightlies>`__
corresponding to commit `005c372 <https://github.com/ray-project/ray/commit/005c372262e050d5745f475e22e64305fa07f8b8>`__.
- See the :doc:`ROCm Ray installation guide <rocm-install-on-linux:install/3rd-party/ray-install>`
for installation and setup instructions.
.. note::
- You can also consult the upstream `Installation guide <https://docs.ray.io/en/latest/ray-overview/installation.html>`__
for additional context.
Version support
--------------------------------------------------------------------------------
Ray is supported on `ROCm 6.4.1 <https://repo.radeon.com/rocm/apt/6.4.1/>`__.
Ray is supported on ROCm 6.4.1.
Supported devices
--------------------------------------------------------------------------------
================================================================================
**Officially Supported**: AMD Instinct™ MI300X, MI210
Use cases and recommendations
================================================================================
@@ -87,15 +88,15 @@ Docker image compatibility
AMD validates and publishes ready-made `ROCm Ray Docker images <https://hub.docker.com/r/rocm/ray/tags>`__
with ROCm backends on Docker Hub. The following Docker image tags and
associated inventories represent the latest Ray version from the official Docker Hub.
Click the |docker-icon| icon to view the image on Docker Hub.
associated inventories represent the latest Ray version from the official Docker Hub and are validated for
`ROCm 6.4.1 <https://repo.radeon.com/rocm/apt/6.4.1/>`_. Click the |docker-icon|
icon to view the image on Docker Hub.
.. list-table::
:header-rows: 1
:class: docker-image-compatibility
* - Docker image
- ROCm
- Ray
- Pytorch
- Ubuntu
@@ -104,7 +105,6 @@ Click the |docker-icon| icon to view the image on Docker Hub.
* - .. raw:: html
<a href="https://hub.docker.com/layers/rocm/ray/ray-2.48.0.post0_rocm6.4.1_ubuntu24.04_py3.12_pytorch2.6.0/images/sha256-0d166fe6bdced38338c78eedfb96eff92655fb797da3478a62dd636365133cc0"><i class="fab fa-docker fa-lg"></i> rocm/ray</a>
- `6.4.1 <https://repo.radeon.com/rocm/apt/6.4.1/>`__.
- `2.48.0.post0 <https://github.com/ROCm/ray/tree/release/2.48.0.post0>`_
- 2.6.0+git684f6f2
- 24.04

View File

@@ -2,7 +2,7 @@
.. meta::
:description: Stanford Megatron-LM compatibility
:keywords: Stanford, Megatron-LM, deep learning, framework compatibility
:keywords: Stanford, Megatron-LM, compatibility
.. version-set:: rocm_version latest
@@ -10,50 +10,34 @@
Stanford Megatron-LM compatibility
********************************************************************************
Stanford Megatron-LM is a large-scale language model training framework developed
by NVIDIA at `https://github.com/NVIDIA/Megatron-LM <https://github.com/NVIDIA/Megatron-LM>`_.
It is designed to train massive transformer-based language models efficiently by model
and data parallelism.
Stanford Megatron-LM is a large-scale language model training framework developed by NVIDIA `https://github.com/NVIDIA/Megatron-LM <https://github.com/NVIDIA/Megatron-LM>`_. It is
designed to train massive transformer-based language models efficiently by model and data parallelism.
It provides efficient tensor, pipeline, and sequence-based model parallelism for
pre-training transformer-based language models such as GPT (Decoder Only), BERT
(Encoder Only), and T5 (Encoder-Decoder).
* ROCm support for Stanford Megatron-LM is hosted in the official `https://github.com/ROCm/Stanford-Megatron-LM <https://github.com/ROCm/Stanford-Megatron-LM>`_ repository.
* Due to independent compatibility considerations, this location differs from the `https://github.com/stanford-futuredata/Megatron-LM <https://github.com/stanford-futuredata/Megatron-LM>`_ upstream repository.
* Use the prebuilt :ref:`Docker image <megatron-lm-docker-compat>` with ROCm, PyTorch, and Megatron-LM preinstalled.
* See the :doc:`ROCm Stanford Megatron-LM installation guide <rocm-install-on-linux:install/3rd-party/stanford-megatron-lm-install>` to install and get started.
Support overview
.. note::
Stanford Megatron-LM is supported on ROCm 6.3.0.
Supported Devices
================================================================================
- The ROCm-supported version of Stanford Megatron-LM is maintained in the official `https://github.com/ROCm/Stanford-Megatron-LM
<https://github.com/ROCm/Stanford-Megatron-LM>`__ repository, which differs from the
`https://github.com/stanford-futuredata/Megatron-LM <https://github.com/stanford-futuredata/Megatron-LM>`__ upstream repository.
- **Officially Supported**: AMD Instinct MI300X
- **Partially Supported** (functionality or performance limitations): AMD Instinct MI250X, MI210
- To get started and install Stanford Megatron-LM on ROCm, use the prebuilt :ref:`Docker image <megatron-lm-docker-compat>`,
which includes ROCm, Stanford Megatron-LM, and all required dependencies.
- See the :doc:`ROCm Stanford Megatron-LM installation guide <rocm-install-on-linux:install/3rd-party/stanford-megatron-lm-install>`
for installation and setup instructions.
- You can also consult the upstream `Installation guide <https://github.com/NVIDIA/Megatron-LM>`__
for additional context.
Version support
--------------------------------------------------------------------------------
Stanford Megatron-LM is supported on `ROCm 6.3.0 <https://repo.radeon.com/rocm/apt/6.3/>`__.
Supported devices
--------------------------------------------------------------------------------
- **Officially Supported**: AMD Instinct™ MI300X
- **Partially Supported** (functionality or performance limitations): AMD Instinct™ MI250X, MI210
Supported models and features
--------------------------------------------------------------------------------
================================================================================
This section details models & features that are supported by the ROCm version on Stanford Megatron-LM.
Models:
* BERT
* Bert
* GPT
* T5
* ICT
@@ -70,24 +54,13 @@ Features:
Use cases and recommendations
================================================================================
The following blog post mentions Megablocks, but you can run Stanford Megatron-LM with the same steps to pre-process datasets on AMD GPUs:
See the `Efficient MoE training on AMD ROCm: How-to use Megablocks on AMD GPUs blog <https://rocm.blogs.amd.com/artificial-intelligence/megablocks/README.html>`_ post
to leverage the ROCm platform for pre-training by using the Stanford Megatron-LM framework of pre-processing datasets on AMD GPUs.
Coverage includes:
* The `Efficient MoE training on AMD ROCm: How-to use Megablocks on AMD GPUs
<https://rocm.blogs.amd.com/artificial-intelligence/megablocks/README.html>`__
blog post guides how to leverage the ROCm platform for pre-training using the
Megablocks framework. It introduces a streamlined approach for training Mixture-of-Experts
(MoE) models using the Megablocks library on AMD hardware. Focusing on GPT-2, it
demonstrates how block-sparse computations can enhance scalability and efficiency in MoE
training. The guide provides step-by-step instructions for setting up the environment,
including cloning the repository, building the Docker image, and running the training container.
Additionally, it offers insights into utilizing the ``oscar-1GB.json`` dataset for pre-training
language models. By leveraging Megablocks and the ROCm platform, you can optimize your MoE
training workflows for large-scale transformer models.
* Single-GPU pre-training
* Multi-GPU pre-training
It features how to pre-process datasets and how to begin pre-training on AMD GPUs through:
* Single-GPU pre-training
* Multi-GPU pre-training
.. _megatron-lm-docker-compat:
@@ -98,9 +71,10 @@ Docker image compatibility
<i class="fab fa-docker"></i>
AMD validates and publishes `Stanford Megatron-LM images <https://hub.docker.com/r/rocm/stanford-megatron-lm/tags>`_
AMD validates and publishes `Stanford Megatron-LM images <https://hub.docker.com/r/rocm/megatron-lm>`_
with ROCm and Pytorch backends on Docker Hub. The following Docker image tags and associated
inventories represent the latest Stanford Megatron-LM version from the official Docker Hub.
inventories represent the latest Megatron-LM version from the official Docker Hub.
The Docker images have been validated for `ROCm 6.3.0 <https://repo.radeon.com/rocm/apt/6.3/>`_.
Click |docker-icon| to view the image on Docker Hub.
.. list-table::
@@ -108,7 +82,6 @@ Click |docker-icon| to view the image on Docker Hub.
:class: docker-image-compatibility
* - Docker image
- ROCm
- Stanford Megatron-LM
- PyTorch
- Ubuntu
@@ -118,7 +91,6 @@ Click |docker-icon| to view the image on Docker Hub.
<a href="https://hub.docker.com/layers/rocm/stanford-megatron-lm/stanford-megatron-lm85f95ae_rocm6.3.0_ubuntu24.04_py3.12_pytorch2.4.0/images/sha256-070556f078be10888a1421a2cb4f48c29f28b02bfeddae02588d1f7fc02a96a6"><i class="fab fa-docker fa-lg"></i></a>
- `6.3.0 <https://repo.radeon.com/rocm/apt/6.3/>`_
- `85f95ae <https://github.com/stanford-futuredata/Megatron-LM/commit/85f95aef3b648075fe6f291c86714fdcbd9cd1f5>`_
- `2.4.0 <https://github.com/ROCm/pytorch/tree/release/2.4>`_
- 24.04

View File

@@ -2,7 +2,7 @@
.. meta::
:description: Taichi compatibility
:keywords: GPU, Taichi, deep learning, framework compatibility
:keywords: GPU, Taichi compatibility
.. version-set:: rocm_version latest
@@ -19,52 +19,28 @@ Taichi is widely used across various domains, including real-time physical simul
numerical computing, augmented reality, artificial intelligence, computer vision, robotics,
visual effects in film and gaming, and general-purpose computing.
Support overview
================================================================================
* ROCm support for Taichi is hosted in the official `https://github.com/ROCm/taichi <https://github.com/ROCm/taichi>`_ repository.
* Due to independent compatibility considerations, this location differs from the `https://github.com/taichi-dev <https://github.com/taichi-dev>`_ upstream repository.
* Use the prebuilt :ref:`Docker image <taichi-docker-compat>` with ROCm, PyTorch, and Taichi preinstalled.
* See the :doc:`ROCm Taichi installation guide <rocm-install-on-linux:install/3rd-party/taichi-install>` to install and get started.
- The ROCm-supported version of Taichi is maintained in the official `https://github.com/ROCm/taichi
<https://github.com/ROCm/taichi>`__ repository, which differs from the
`https://github.com/taichi-dev/taichi <https://github.com/taichi-dev/taichi>`__ upstream repository.
.. note::
- To get started and install Taichi on ROCm, use the prebuilt :ref:`Docker image <taichi-docker-compat>`,
which includes ROCm, Taichi, and all required dependencies.
Taichi is supported on ROCm 6.3.2.
- See the :doc:`ROCm Taichi installation guide <rocm-install-on-linux:install/3rd-party/taichi-install>`
for installation and setup instructions.
- You can also consult the upstream `Installation guide <https://github.com/taichi-dev/taichi>`__
for additional context.
Version support
--------------------------------------------------------------------------------
Taichi is supported on `ROCm 6.3.2 <https://repo.radeon.com/rocm/apt/6.3.2/>`__.
Supported devices
--------------------------------------------------------------------------------
- **Officially Supported**: AMD Instinct™ MI250X, MI210X (with the exception of Taichis GPU rendering system, CGUI)
- **Upcoming Support**: AMD Instinct™ MI300X
Supported devices and features
===============================================================================
There is support through the ROCm software stack for all Taichi GPU features on AMD Instinct MI250X and MI210X series GPUs with the exception of Taichis GPU rendering system, CGUI.
AMD Instinct MI300X series GPUs will be supported by November.
.. _taichi-recommendations:
Use cases and recommendations
================================================================================
* The `Accelerating Parallel Programming in Python with Taichi Lang on AMD GPUs
<https://rocm.blogs.amd.com/artificial-intelligence/taichi/README.html>`__
blog highlights Taichi as an open-source programming language designed for high-performance
numerical computation, particularly in domains like real-time physical simulation,
artificial intelligence, computer vision, robotics, and visual effects. Taichi
is embedded in Python and uses just-in-time (JIT) compilation frameworks like
LLVM to optimize execution on GPUs and CPUs. The blog emphasizes the versatility
of Taichi in enabling complex simulations and numerical algorithms, making
it ideal for developers working on compute-intensive tasks. Developers are
encouraged to follow recommended coding patterns and utilize Taichi decorators
for performance optimization, with examples available in the `https://github.com/ROCm/taichi_examples
<https://github.com/ROCm/taichi_examples>`_ repository. Prebuilt Docker images
integrating ROCm, PyTorch, and Taichi are provided for simplified installation
and deployment, making it easier to leverage Taichi for advanced computational workloads.
To fully leverage Taichi's performance capabilities in compute-intensive tasks, it is best to adhere to specific coding patterns and utilize Taichi decorators.
A collection of example use cases is available in the `https://github.com/ROCm/taichi_examples <https://github.com/ROCm/taichi_examples>`_ repository,
providing practical insights and foundational knowledge for working with the Taichi programming language.
You can also refer to the `AMD ROCm blog <https://rocm.blogs.amd.com/>`_ to search for Taichi examples and best practices to optimize your workflows on AMD GPUs.
.. _taichi-docker-compat:
@@ -76,8 +52,9 @@ Docker image compatibility
<i class="fab fa-docker"></i>
AMD validates and publishes ready-made `ROCm Taichi Docker images <https://hub.docker.com/r/rocm/taichi/tags>`_
with ROCm backends on Docker Hub. The following Docker image tag and associated inventories
with ROCm backends on Docker Hub. The following Docker image tags and associated inventories
represent the latest Taichi version from the official Docker Hub.
The Docker images have been validated for `ROCm 6.3.2 <https://rocm.docs.amd.com/en/docs-6.3.2/about/release-notes.html>`_.
Click |docker-icon| to view the image on Docker Hub.
.. list-table::

View File

@@ -2,7 +2,7 @@
.. meta::
:description: TensorFlow compatibility
:keywords: GPU, TensorFlow, deep learning, framework compatibility
:keywords: GPU, TensorFlow compatibility
.. version-set:: rocm_version latest
@@ -12,33 +12,37 @@ TensorFlow compatibility
`TensorFlow <https://www.tensorflow.org/>`__ is an open-source library for
solving machine learning, deep learning, and AI problems. It can solve many
problems across different sectors and industries, but primarily focuses on
neural network training and inference. It is one of the most popular deep
learning frameworks and is very active in open-source development.
Support overview
================================================================================
- The ROCm-supported version of TensorFlow is maintained in the official `https://github.com/ROCm/tensorflow-upstream
<https://github.com/ROCm/tensorflow-upstream>`__ repository, which differs from the
`https://github.com/tensorflow/tensorflow <https://github.com/tensorflow/tensorflow>`__ upstream repository.
- To get started and install TensorFlow on ROCm, use the prebuilt :ref:`Docker images <tensorflow-docker-compat>`,
which include ROCm, TensorFlow, and all required dependencies.
- See the :doc:`ROCm TensorFlow installation guide <rocm-install-on-linux:install/3rd-party/tensorflow-install>`
for installation and setup instructions.
- You can also consult the `TensorFlow API versions <https://www.tensorflow.org/versions>`__ list
for additional context.
Version support
--------------------------------------------------------------------------------
problems across different sectors and industries but primarily focuses on
neural network training and inference. It is one of the most popular and
in-demand frameworks and is very active in open-source contribution and
development.
The `official TensorFlow repository <http://github.com/tensorflow/tensorflow>`__
includes full ROCm support. AMD maintains a TensorFlow `ROCm repository
<http://github.com/rocm/tensorflow-upstream>`__ in order to quickly add bug
fixes, updates, and support for the latest ROCm versions.
fixes, updates, and support for the latest ROCM versions.
- ROCm TensorFlow release:
- Offers :ref:`Docker images <tensorflow-docker-compat>` with
ROCm and TensorFlow pre-installed.
- ROCm TensorFlow repository: `<https://github.com/ROCm/tensorflow-upstream>`__
- See the :doc:`ROCm TensorFlow installation guide <rocm-install-on-linux:install/3rd-party/tensorflow-install>`
to get started.
- Official TensorFlow release:
- Official TensorFlow repository: `<https://github.com/tensorflow/tensorflow>`__
- See the `TensorFlow API versions <https://www.tensorflow.org/versions>`__ list.
.. note::
The official TensorFlow documentation does not cover ROCm support. Use the
ROCm documentation for installation instructions for Tensorflow on ROCm.
See :doc:`rocm-install-on-linux:install/3rd-party/tensorflow-install`.
.. _tensorflow-docker-compat:

View File

@@ -2,7 +2,7 @@
.. meta::
:description: verl compatibility
:keywords: GPU, verl, deep learning, framework compatibility
:keywords: GPU, verl compatibility
.. version-set:: rocm_version latest
@@ -10,58 +10,24 @@
verl compatibility
*******************************************************************************
Volcano Engine Reinforcement Learning for LLMs (`verl <https://verl.readthedocs.io/en/latest/>`__)
is a reinforcement learning framework designed for large language models (LLMs).
verl offers a scalable, open-source fine-tuning solution by using a hybrid programming model
that makes it easy to define and run complex post-training dataflows efficiently.
Volcano Engine Reinforcement Learning for LLMs (verl) is a reinforcement learning framework designed for large language models (LLMs).
verl offers a scalable, open-source fine-tuning solution optimized for AMD Instinct GPUs with full ROCm support.
Its modular APIs separate computation from data, allowing smooth integration with other frameworks.
It also supports flexible model placement across GPUs for efficient scaling on different cluster sizes.
verl achieves high training and generation throughput by building on existing LLM frameworks.
Its 3D-HybridEngine reduces memory use and communication overhead when switching between training
and inference, improving overall performance.
* See the `verl documentation <https://verl.readthedocs.io/en/latest/>`_ for more information about verl.
* The official verl GitHub repository is `https://github.com/volcengine/verl <https://github.com/volcengine/verl>`_.
* Use the AMD-validated :ref:`Docker images <verl-docker-compat>` with ROCm and verl preinstalled.
* See the :doc:`ROCm verl installation guide <rocm-install-on-linux:install/3rd-party/verl-install>` to install and get started.
Support overview
================================================================================
.. note::
- The ROCm-supported version of verl is maintained in the official `https://github.com/ROCm/verl
<https://github.com/ROCm/verl>`__ repository, which differs from the
`https://github.com/volcengine/verl <https://github.com/volcengine/verl>`__ upstream repository.
- To get started and install verl on ROCm, use the prebuilt :ref:`Docker image <verl-docker-compat>`,
which includes ROCm, verl, and all required dependencies.
- See the :doc:`ROCm verl installation guide <rocm-install-on-linux:install/3rd-party/verl-install>`
for installation and setup instructions.
- You can also consult the upstream `verl documentation <https://verl.readthedocs.io/en/latest/>`__
for additional context.
Version support
--------------------------------------------------------------------------------
verl is supported on `ROCm 6.2.0 <https://repo.radeon.com/rocm/apt/6.2/>`__.
Supported devices
--------------------------------------------------------------------------------
**Officially Supported**: AMD Instinct™ MI300X
verl is supported on ROCm 6.2.0.
.. _verl-recommendations:
Use cases and recommendations
================================================================================
* The benefits of verl in large-scale reinforcement learning from human feedback
(RLHF) are discussed in the `Reinforcement Learning from Human Feedback on AMD
GPUs with verl and ROCm Integration <https://rocm.blogs.amd.com/artificial-intelligence/verl-large-scale/README.html>`__
blog. The blog post outlines how the Volcano Engine Reinforcement Learning
(verl) framework integrates with the AMD ROCm platform to optimize training on
Instinct™ MI300X GPUs. The guide details the process of building a Docker image,
setting up single-node and multi-node training environments, and highlights
performance benchmarks demonstrating improved throughput and convergence accuracy.
This resource serves as a comprehensive starting point for deploying verl on AMD GPUs,
facilitating efficient RLHF training workflows.
The benefits of verl in large-scale reinforcement learning from human feedback (RLHF) are discussed in the `Reinforcement Learning from Human Feedback on AMD GPUs with verl and ROCm Integration <https://rocm.blogs.amd.com/artificial-intelligence/verl-large-scale/README.html>`_ blog.
.. _verl-supported_features:
@@ -95,10 +61,8 @@ Docker image compatibility
<i class="fab fa-docker"></i>
AMD validates and publishes ready-made `verl Docker images <https://hub.docker.com/r/rocm/verl/tags>`_
with ROCm backends on Docker Hub. The following Docker image tag and associated inventories
represent the latest verl version from the official Docker Hub.
Click |docker-icon| to view the image on Docker Hub.
AMD validates and publishes ready-made `ROCm verl Docker images <https://hub.docker.com/r/rocm/verl/tags>`_
with ROCm backends on Docker Hub. The following Docker image tags and associated inventories represent the available verl versions from the official Docker Hub.
.. list-table::
:header-rows: 1

View File

@@ -13,22 +13,22 @@
:gutter: 1
:::{grid-item-card}
**AMD Instinct MI300 Series**
**AMD Instinct MI300 series**
Review hardware aspects of the AMD Instinct™ MI300 Series GPUs and the CDNA™ 3
Review hardware aspects of the AMD Instinct™ MI300 series of GPU accelerators and the CDNA™ 3
architecture.
* [AMD Instinct™ MI300 microarchitecture](./gpu-arch/mi300.md)
* [AMD Instinct MI300/CDNA3 ISA](https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/instruction-set-architectures/amd-instinct-mi300-cdna3-instruction-set-architecture.pdf)
* [White paper](https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/white-papers/amd-cdna-3-white-paper.pdf)
* [MI300 performance counters](./gpu-arch/mi300-mi200-performance-counters.rst)
* [MI350 Series performance counters](./gpu-arch/mi350-performance-counters.rst)
* [MI350 series performance counters](./gpu-arch/mi350-performance-counters.rst)
:::
:::{grid-item-card}
**AMD Instinct MI200 Series**
**AMD Instinct MI200 series**
Review hardware aspects of the AMD Instinct™ MI200 Series GPUs and the CDNA™ 2
Review hardware aspects of the AMD Instinct™ MI200 series of GPU accelerators and the CDNA™ 2
architecture.
* [AMD Instinct™ MI250 microarchitecture](./gpu-arch/mi250.md)
@@ -41,7 +41,7 @@ architecture.
:::{grid-item-card}
**AMD Instinct MI100**
Review hardware aspects of the AMD Instinct™ MI100 Series GPUs and the CDNA™ 1
Review hardware aspects of the AMD Instinct™ MI100 series of GPU accelerators and the CDNA™ 1
architecture.
* [AMD Instinct™ MI100 microarchitecture](./gpu-arch/mi100.md)

View File

@@ -1,14 +1,14 @@
---
myst:
html_meta:
"description lang=en": "Learn about the AMD Instinct MI100 Series architecture."
"description lang=en": "Learn about the AMD Instinct MI100 series architecture."
"keywords": "Instinct, MI100, microarchitecture, AMD, ROCm"
---
# AMD Instinct™ MI100 microarchitecture
The following image shows the node-level architecture of a system that
comprises two AMD EPYC™ processors and (up to) eight AMD Instinct™ GPUs.
comprises two AMD EPYC™ processors and (up to) eight AMD Instinct™ accelerators.
The two EPYC processors are connected to each other with the AMD Infinity™
fabric which provides a high-bandwidth (up to 18 GT/sec) and coherent links such
that each processor can access the available node memory as a single
@@ -18,29 +18,29 @@ available to connect the processors plus one PCIe Gen 4 x16 link per processor
can attach additional I/O devices such as the host adapters for the network
fabric.
![Structure of a single GCD in the AMD Instinct MI100 GPU](../../data/conceptual/gpu-arch/image004.png "Node-level system architecture with two AMD EPYC™ processors and eight AMD Instinct™ GPUs.")
![Structure of a single GCD in the AMD Instinct MI100 accelerator](../../data/conceptual/gpu-arch/image004.png "Node-level system architecture with two AMD EPYC™ processors and eight AMD Instinct™ accelerators.")
In a typical node configuration, each processor can host up to four AMD
Instinct™ GPUs that are attached using PCIe Gen 4 links at 16 GT/sec,
Instinct™ accelerators that are attached using PCIe Gen 4 links at 16 GT/sec,
which corresponds to a peak bidirectional link bandwidth of 32 GB/sec. Each hive
of four GPUs can participate in a fully connected, coherent AMD
Instinct™ fabric that connects the four GPUs using 23 GT/sec AMD
of four accelerators can participate in a fully connected, coherent AMD
Instinct™ fabric that connects the four accelerators using 23 GT/sec AMD
Infinity fabric links that run at a higher frequency than the inter-processor
links. This inter-GPU link can be established in certified server systems if the
GPUs are mounted in neighboring PCIe slots by installing the AMD Infinity
Fabric™ bridge for the AMD Instinct™ GPUs.
Fabric™ bridge for the AMD Instinct™ accelerators.
## Microarchitecture
The microarchitecture of the AMD Instinct GPUs is based on the AMD CDNA
The microarchitecture of the AMD Instinct accelerators is based on the AMD CDNA
architecture, which targets compute applications such as high-performance
computing (HPC) and AI & machine learning (ML) that run on everything from
individual servers to the world's largest exascale supercomputers. The overall
system architecture is designed for extreme scalability and compute performance.
![Structure of the AMD Instinct GPU (MI100 generation)](../../data/conceptual/gpu-arch/image005.png "Structure of the AMD Instinct GPU (MI100 generation)")
![Structure of the AMD Instinct accelerator (MI100 generation)](../../data/conceptual/gpu-arch/image005.png "Structure of the AMD Instinct accelerator (MI100 generation)")
The above image shows the AMD Instinct GPU with its PCIe Gen 4 x16
The above image shows the AMD Instinct accelerator with its PCIe Gen 4 x16
link (16 GT/sec, at the bottom) that connects the GPU to (one of) the host
processor(s). It also shows the three AMD Infinity Fabric ports that provide
high-speed links (23 GT/sec, also at the bottom) to the other GPUs of the local
@@ -48,7 +48,7 @@ hive.
On the left and right of the floor plan, the High Bandwidth Memory (HBM)
attaches via the GPU memory controller. The MI100 generation of the AMD
Instinct GPU offers four stacks of HBM generation 2 (HBM2) for a total
Instinct accelerator offers four stacks of HBM generation 2 (HBM2) for a total
of 32GB with a 4,096bit-wide memory interface. The peak memory bandwidth of the
attached HBM2 is 1.228 TB/sec at a memory clock frequency of 1.2 GHz.
@@ -64,7 +64,7 @@ Therefore, the theoretical maximum FP64 peak performance is 11.5 TFLOPS
![Block diagram of an MI100 compute unit with detailed SIMD view of the AMD CDNA architecture](../../data/conceptual/gpu-arch/image006.png "An MI100 compute unit with detailed SIMD view of the AMD CDNA architecture")
The preceding image shows the block diagram of a single CU of an AMD Instinct™
MI100 GPU and summarizes how instructions flow through the execution
MI100 accelerator and summarizes how instructions flow through the execution
engines. The CU fetches the instructions via a 32KB instruction cache and moves
them forward to execution via a dispatcher. The CU can handle up to ten
wavefronts at a time and feed their instructions into the execution unit. The

View File

@@ -1,13 +1,13 @@
---
myst:
html_meta:
"description lang=en": "Learn about the AMD Instinct MI250 Series architecture."
"description lang=en": "Learn about the AMD Instinct MI250 series architecture."
"keywords": "Instinct, MI250, microarchitecture, AMD, ROCm"
---
# AMD Instinct™ MI250 microarchitecture
The microarchitecture of the AMD Instinct MI250 GPU is based on the
The microarchitecture of the AMD Instinct MI250 accelerators is based on the
AMD CDNA 2 architecture that targets compute applications such as HPC,
artificial intelligence (AI), and machine learning (ML) and that run on
everything from individual servers to the worlds largest exascale
@@ -40,7 +40,7 @@ execution units (also called matrix cores), which are geared toward executing
matrix operations like matrix-matrix multiplications. For FP64, the peak
performance of these units amounts to 90.5 TFLOPS.
![Structure of a single GCD in the AMD Instinct MI250 GPU.](../../data/conceptual/gpu-arch/image001.png "Structure of a single GCD in the AMD Instinct MI250 GPU.")
![Structure of a single GCD in the AMD Instinct MI250 accelerator.](../../data/conceptual/gpu-arch/image001.png "Structure of a single GCD in the AMD Instinct MI250 accelerator.")
```{list-table} Peak-performance capabilities of the MI250 OAM for different data types.
:header-rows: 1
@@ -84,9 +84,16 @@ performance of these units amounts to 90.5 TFLOPS.
- 362.1
```
The above table summarizes the aggregated peak performance of the AMD Instinct MI250 Open Compute Platform (OCP) Open Accelerator Modules (OAMs) and its two GCDs for different data types and execution units. The middle column lists the peak performance (number of data elements processed in a single instruction) of a single compute unit if a SIMD (or matrix) instruction is being retired in each clock cycle. The third column lists the theoretical peak performance of the OAM module. The theoretical aggregated peak memory bandwidth of the GPU is 3.2 TB/sec (1.6 TB/sec per GCD).
The above table summarizes the aggregated peak performance of the AMD
Instinct MI250 OCP Open Accelerator Modules (OAM, OCP is short for Open Compute
Platform) and its two GCDs for different data types and execution units. The
middle column lists the peak performance (number of data elements processed in a
single instruction) of a single compute unit if a SIMD (or matrix) instruction
is being retired in each clock cycle. The third column lists the theoretical
peak performance of the OAM module. The theoretical aggregated peak memory
bandwidth of the GPU is 3.2 TB/sec (1.6 TB/sec per GCD).
![Dual-GCD architecture of the AMD Instinct MI250 GPUs](../../data/conceptual/gpu-arch/image002.png "Dual-GCD architecture of the AMD Instinct MI250 GPUs")
![Dual-GCD architecture of the AMD Instinct MI250 accelerators](../../data/conceptual/gpu-arch/image002.png "Dual-GCD architecture of the AMD Instinct MI250 accelerators")
The following image shows the block diagram of an OAM package that consists
of two GCDs, each of which constitutes one GPU device in the system. The two
@@ -98,18 +105,18 @@ between the two GCDs of an OAM, or a bidirectional peak transfer bandwidth of
## Node-level architecture
The following image shows the node-level architecture of a system that is
based on the AMD Instinct MI250 GPU. The MI250 OAMs attach to the host
based on the AMD Instinct MI250 accelerator. The MI250 OAMs attach to the host
system via PCIe Gen 4 x16 links (yellow lines). Each GCD maintains its own PCIe
x16 link to the host part of the system. Depending on the server platform, the
GCD can attach to the AMD EPYC processor directly or via an optional PCIe switch
. Note that some platforms may offer an x8 interface to the GCDs, which reduces
the available host-to-GPU bandwidth.
![Block diagram of AMD Instinct MI250 GPUs with 3rd Generation AMD EPYC processor](../../data/conceptual/gpu-arch/image003.png "Block diagram of AMD Instinct MI250 GPUs with 3rd Generation AMD EPYC processor")
![Block diagram of AMD Instinct MI250 Accelerators with 3rd Generation AMD EPYC processor](../../data/conceptual/gpu-arch/image003.png "Block diagram of AMD Instinct MI250 Accelerators with 3rd Generation AMD EPYC processor")
The preceding image shows the node-level architecture of a system with AMD
EPYC processors in a dual-socket configuration and four AMD Instinct MI250
GPUs. The MI250 OAMs attach to the host processors system via PCIe Gen 4
accelerators. The MI250 OAMs attach to the host processors system via PCIe Gen 4
x16 links (yellow lines). Depending on the system design, a PCIe switch may
exist to make more PCIe lanes available for additional components like network
interfaces and/or storage devices. Each GCD maintains its own PCIe x16 link to

View File

@@ -1,16 +1,16 @@
.. meta::
:description: MI300 and MI200 Series performance counters and metrics
:description: MI300 and MI200 series performance counters and metrics
:keywords: MI300, MI200, performance counters, command processor counters
***************************************************************************************************
MI300 and MI200 Series performance counters and metrics
MI300 and MI200 series performance counters and metrics
***************************************************************************************************
This document lists and describes the hardware performance counters and derived metrics available
for the AMD Instinct™ MI300 and MI200 GPU. You can also access this information using the
:doc:`ROCprofiler-SDK <rocprofiler-sdk:how-to/using-rocprofv3>`.
MI300 and MI200 Series performance counters
MI300 and MI200 series performance counters
===============================================================
Series performance counters include the following categories:
@@ -27,7 +27,7 @@ The following sections provide additional details for each category.
.. note::
Preliminary validation of all MI300 and MI200 Series performance counters is in progress. Those with
Preliminary validation of all MI300 and MI200 series performance counters is in progress. Those with
an asterisk (*) require further evaluation.
.. _command-processor-counters:
@@ -171,7 +171,7 @@ Instruction mix
"``SQ_INSTS_SMEM``", "Instr", "Number of scalar memory instructions issued"
"``SQ_INSTS_SMEM_NORM``", "Instr", "Number of scalar memory instructions normalized to match ``smem_level`` issued"
"``SQ_INSTS_FLAT``", "Instr", "Number of flat instructions issued"
"``SQ_INSTS_FLAT_LDS_ONLY``", "Instr", "**MI200 Series only** Number of FLAT instructions that read/write only from/to LDS issued. Works only if ``EARLY_TA_DONE`` is enabled."
"``SQ_INSTS_FLAT_LDS_ONLY``", "Instr", "**MI200 series only** Number of FLAT instructions that read/write only from/to LDS issued. Works only if ``EARLY_TA_DONE`` is enabled."
"``SQ_INSTS_LDS``", "Instr", "Number of LDS instructions issued **(MI200: includes flat; MI300: does not include flat)**"
"``SQ_INSTS_GDS``", "Instr", "Number of global data share instructions issued"
"``SQ_INSTS_EXP_GDS``", "Instr", "Number of EXP and global data share instructions excluding skipped export instructions issued"
@@ -396,9 +396,9 @@ Texture cache per pipe counters
"``TCP_UTCL1_TRANSLATION_MISS[n]``", "Req", "Number of unified translation cache (L1) translation misses", "0-15"
"``TCP_UTCL1_PERMISSION_MISS[n]``", "Req", "Number of unified translation cache (L1) permission misses", "0-15"
"``TCP_TOTAL_CACHE_ACCESSES[n]``", "Req", "Number of vector L1d cache accesses including hits and misses", "0-15"
"``TCP_TCP_LATENCY[n]``", "Cycles", "**MI200 Series only** Accumulated wave access latency to vL1D over all wavefronts", "0-15"
"``TCP_TCC_READ_REQ_LATENCY[n]``", "Cycles", "**MI200 Series only** Total vL1D to L2 request latency over all wavefronts for reads and atomics with return", "0-15"
"``TCP_TCC_WRITE_REQ_LATENCY[n]``", "Cycles", "**MI200 Series only** Total vL1D to L2 request latency over all wavefronts for writes and atomics without return", "0-15"
"``TCP_TCP_LATENCY[n]``", "Cycles", "**MI200 series only** Accumulated wave access latency to vL1D over all wavefronts", "0-15"
"``TCP_TCC_READ_REQ_LATENCY[n]``", "Cycles", "**MI200 series only** Total vL1D to L2 request latency over all wavefronts for reads and atomics with return", "0-15"
"``TCP_TCC_WRITE_REQ_LATENCY[n]``", "Cycles", "**MI200 series only** Total vL1D to L2 request latency over all wavefronts for writes and atomics without return", "0-15"
"``TCP_TCC_READ_REQ[n]``", "Req", "Number of read requests to L2 cache", "0-15"
"``TCP_TCC_WRITE_REQ[n]``", "Req", "Number of write requests to L2 cache", "0-15"
"``TCP_TCC_ATOMIC_WITH_RET_REQ[n]``", "Req", "Number of atomic requests to L2 cache with return", "0-15"
@@ -560,7 +560,7 @@ Note the following:
``TCC_TAG_STALL[n]``, probes can stall the pipeline at a variety of places. There is no single point that
can accurately measure the total stalls
MI300 and MI200 Series derived metrics list
MI300 and MI200 series derived metrics list
==============================================================
.. csv-table::

View File

@@ -1,21 +1,21 @@
---
myst:
html_meta:
"description lang=en": "Learn about the AMD Instinct MI300 Series architecture."
"description lang=en": "Learn about the AMD Instinct MI300 series architecture."
"keywords": "Instinct, MI300X, MI300A, microarchitecture, AMD, ROCm"
---
# AMD Instinct™ MI300 Series microarchitecture
# AMD Instinct™ MI300 series microarchitecture
The AMD Instinct MI300 Series GPUs are based on the AMD CDNA 3
The AMD Instinct MI300 series accelerators are based on the AMD CDNA 3
architecture which was designed to deliver leadership performance for HPC, artificial intelligence (AI), and machine
learning (ML) workloads. The AMD Instinct MI300 Series GPUs are well-suited for extreme scalability and compute performance, running
learning (ML) workloads. The AMD Instinct MI300 series accelerators are well-suited for extreme scalability and compute performance, running
on everything from individual servers to the worlds largest exascale supercomputers.
With the MI300 Series, AMD is introducing the Accelerator Complex Die (XCD), which contains the
With the MI300 series, AMD is introducing the Accelerator Complex Die (XCD), which contains the
GPU computational elements of the processor along with the lower levels of the cache hierarchy.
The following image depicts the structure of a single XCD in the AMD Instinct MI300 GPU Series.
The following image depicts the structure of a single XCD in the AMD Instinct MI300 accelerator series.
```{figure} ../../data/shared/xcd-sys-arch.png
---
@@ -39,7 +39,7 @@ infrastructure) using the AMD Infinity Fabric™ technology as interconnect.
The Matrix Cores inside the CDNA 3 CUs have significant improvements, emphasizing AI and machine
learning, enhancing throughput of existing data types while adding support for new data types.
CDNA 2 Matrix Cores support FP16 and BF16, while offering INT8 for inference. Compared to MI250X
GPUs, CDNA 3 Matrix Cores triple the performance for FP16 and BF16, while providing a
accelerators, CDNA 3 Matrix Cores triple the performance for FP16 and BF16, while providing a
performance gain of 6.8 times for INT8. FP8 has a performance gain of 16 times compared to FP32,
while TF32 has a gain of 4 times compared to FP32.
@@ -105,7 +105,7 @@ name: mi300-arch
alt:
align: center
---
MI300 Series system architecture showing MI300A (left) with 6 XCDs and 3 CCDs, while the MI300X (right) has 8 XCDs.
MI300 series system architecture showing MI300A (left) with 6 XCDs and 3 CCDs, while the MI300X (right) has 8 XCDs.
```
## Node-level architecture
@@ -116,11 +116,11 @@ name: mi300-node
align: center
---
MI300 Series node-level architecture showing 8 fully interconnected MI300X OAM modules connected to (optional) PCIEe switches via retimers and HGX connectors.
MI300 series node-level architecture showing 8 fully interconnected MI300X OAM modules connected to (optional) PCIEe switches via retimers and HGX connectors.
```
The image above shows the node-level architecture of a system with AMD EPYC processors in a
dual-socket configuration and eight AMD Instinct MI300X GPUs. The MI300X OAMs attach to the
dual-socket configuration and eight AMD Instinct MI300X accelerators. The MI300X OAMs attach to the
host system via PCIe Gen 5 x16 links (yellow lines). The GPUs are using seven high-bandwidth,
low-latency AMD Infinity Fabric™ links (red lines) to form a fully connected 8-GPU system.

View File

@@ -1,12 +1,12 @@
.. meta::
:description: MI355 Series performance counters and metrics
:description: MI355 series performance counters and metrics
:keywords: MI355, MI355X, MI3XX
***********************************
MI350 Series performance counters
MI350 series performance counters
***********************************
This topic lists and describes the hardware performance counters and derived metrics available on the AMD Instinct MI350 and MI355 GPUs. These counters are available for profiling using `ROCprofiler-SDK <https://rocm.docs.amd.com/projects/rocprofiler-sdk/en/latest/index.html>`_ and `ROCm Compute Profiler <https://rocm.docs.amd.com/projects/rocprofiler-compute/en/latest/>`_.
This topic lists and describes the hardware performance counters and derived metrics available on the AMD Instinct MI350 and MI355 accelerators. These counters are available for profiling using `ROCprofiler-SDK <https://rocm.docs.amd.com/projects/rocprofiler-sdk/en/latest/index.html>`_ and `ROCm Compute Profiler <https://rocm.docs.amd.com/projects/rocprofiler-compute/en/latest/>`_.
The following sections list the performance counters based on the IP blocks.

View File

@@ -234,7 +234,7 @@ suppress_warnings = ["autosectionlabel.*"]
html_context = {
"project_path" : {project_path},
"gpu_type" : [('AMD Instinct GPUs', 'intrinsic'), ('AMD gfx families', 'gfx'), ('NVIDIA families', 'nvidia') ],
"gpu_type" : [('AMD Instinct accelerators', 'intrinsic'), ('AMD gfx families', 'gfx'), ('NVIDIA families', 'nvidia') ],
"atomics_type" : [('HW atomics', 'hw-atomics'), ('CAS emulation', 'cas-atomics')],
"pcie_type" : [('No PCIe atomics', 'nopcie'), ('PCIe atomics', 'pcie')],
"memory_type" : [('Device DRAM', 'device-dram'), ('Migratable Host DRAM', 'migratable-host-dram'), ('Pinned Host DRAM', 'pinned-host-dram')],

View File

@@ -1,4 +1,4 @@
Atomic,MI100,MI200 PCIe,MI200 A+A,MI300X Series,MI300A,MI350X Series
Atomic,MI100,MI200 PCIe,MI200 A+A,MI300X series,MI300A,MI350X series
32 bit atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS
32 bit atomicSub,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS
32 bit atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS
1 Atomic MI100 MI200 PCIe MI200 A+A MI300X Series MI300X series MI300A MI350X Series MI350X series
2 32 bit atomicAdd ✅ CAS ✅ CAS ✅ CAS ✅ CAS ✅ CAS ✅ CAS
3 32 bit atomicSub ✅ CAS ✅ CAS ✅ CAS ✅ CAS ✅ CAS ✅ CAS
4 32 bit atomicMin ✅ CAS ✅ CAS ✅ CAS ✅ CAS ✅ CAS ✅ CAS

View File

@@ -1,4 +1,4 @@
Atomic,MI100,MI200 PCIe,MI200 A+A,MI300X Series,MI300A,MI350X Series
Atomic,MI100,MI200 PCIe,MI200 A+A,MI300X series,MI300A,MI350X series
32 bit atomicAdd,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS
32 bit atomicSub,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS
32 bit atomicMin,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS,✅ CAS
1 Atomic MI100 MI200 PCIe MI200 A+A MI300X Series MI300X series MI300A MI350X Series MI350X series
2 32 bit atomicAdd ✅ CAS ✅ CAS ✅ CAS ✅ CAS ✅ CAS ✅ CAS
3 32 bit atomicSub ✅ CAS ✅ CAS ✅ CAS ✅ CAS ✅ CAS ✅ CAS
4 32 bit atomicMin ✅ CAS ✅ CAS ✅ CAS ✅ CAS ✅ CAS ✅ CAS

View File

@@ -1,4 +1,4 @@
Atomic,MI100,MI200 PCIe,MI200 A+A,MI300X Series,MI300A,MI350X Series
Atomic,MI100,MI200 PCIe,MI200 A+A,MI300X series,MI300A,MI350X series
32 bit atomicAdd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native
32 bit atomicSub,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native
32 bit atomicMin,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native
1 Atomic MI100 MI200 PCIe MI200 A+A MI300X Series MI300X series MI300A MI350X Series MI350X series
2 32 bit atomicAdd ✅ Native ✅ Native ✅ Native ✅ Native ✅ Native ✅ Native
3 32 bit atomicSub ✅ Native ✅ Native ✅ Native ✅ Native ✅ Native ✅ Native
4 32 bit atomicMin ✅ Native ✅ Native ✅ Native ✅ Native ✅ Native ✅ Native

View File

@@ -1,4 +1,4 @@
Atomic,MI100,MI200 PCIe,MI200 A+A,MI300X Series,MI300A,MI350X Series
Atomic,MI100,MI200 PCIe,MI200 A+A,MI300X series,MI300A,MI350X series
32 bit atomicAdd,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native
32 bit atomicSub,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native
32 bit atomicMin,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native,✅ Native
1 Atomic MI100 MI200 PCIe MI200 A+A MI300X Series MI300X series MI300A MI350X Series MI350X series
2 32 bit atomicAdd ✅ Native ✅ Native ✅ Native ✅ Native ✅ Native ✅ Native
3 32 bit atomicSub ✅ Native ✅ Native ✅ Native ✅ Native ✅ Native ✅ Native
4 32 bit atomicMin ✅ Native ✅ Native ✅ Native ✅ Native ✅ Native ✅ Native

View File

@@ -10,7 +10,7 @@ Deep learning frameworks provide environments for machine learning, training, fi
ROCm offers a complete ecosystem for developing and running deep learning applications efficiently. It also provides ROCm-compatible versions of popular frameworks and libraries, such as PyTorch, TensorFlow, JAX, and others.
The AMD ROCm organization actively contributes to open-source development and collaborates closely with framework organizations. This collaboration ensures that framework-specific optimizations effectively leverage AMD GPUs.
The AMD ROCm organization actively contributes to open-source development and collaborates closely with framework organizations. This collaboration ensures that framework-specific optimizations effectively leverage AMD GPUs and accelerators.
The table below summarizes information about ROCm-enabled deep learning frameworks. It includes details on ROCm compatibility and third-party tool support, installation steps and options, and links to GitHub resources. For a complete list of supported framework versions on ROCm, see the :doc:`Compatibility matrix <../compatibility/compatibility-matrix>` topic.

View File

@@ -1,5 +1,5 @@
.. meta::
:description: How to configure MI300X GPUs to fully leverage their capabilities and achieve optimal performance.
:description: How to configure MI300X accelerators to fully leverage their capabilities and achieve optimal performance.
:keywords: ROCm, AI, machine learning, MI300X, LLM, usage, tutorial, optimization, tuning
**************************************
@@ -7,11 +7,11 @@ AMD Instinct MI300X performance guides
**************************************
The following performance guides provide essential guidance on the necessary
steps to properly `configure your system for AMD Instinct™ MI300X GPUs
steps to properly `configure your system for AMD Instinct™ MI300X accelerators
<https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_.
They include detailed instructions on system settings and application
:doc:`workload tuning </how-to/rocm-for-ai/inference-optimization/workload>` to
help you leverage the maximum capabilities of these GPUs and achieve
help you leverage the maximum capabilities of these accelerators and achieve
superior performance.
* `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`__
@@ -19,9 +19,9 @@ superior performance.
your AMD Instinct MI300X system for performance.
* :doc:`/how-to/rocm-for-ai/inference-optimization/workload` covers steps to
optimize the performance of AMD Instinct MI300X Series GPUs for HPC
optimize the performance of AMD Instinct MI300X series accelerators for HPC
and deep learning operations.
* :doc:`/how-to/rocm-for-ai/inference/benchmark-docker/vllm` introduces a preconfigured
environment for LLM inference, designed to help you test performance with
popular models on AMD Instinct MI300X Series GPUs.
popular models on AMD Instinct MI300X series accelerators.

View File

@@ -25,7 +25,7 @@ execute on AMD GPUs while maintaining compatibility with CUDA-based systems.
OpenCL (Open Computing Language) is an open standard for cross-platform,
parallel programming of diverse processors. ROCm supports OpenCL for developers
who want to use standard frameworks across different hardware platforms,
including CPUs, GPUs, and APUs. For more information, see
including CPUs, GPUs, and other accelerators. For more information, see
`OpenCL <https://www.khronos.org/opencl/>`_.
Python bindings can be found at https://github.com/ROCm/hip-python.

View File

@@ -11,10 +11,10 @@ Fine-tuning using ROCm involves leveraging AMD's GPU-accelerated :doc:`libraries
ecosystem for deep learning development, including open-source libraries for optimized deep learning operations and
ROCm-aware versions of :doc:`deep learning frameworks <../../deep-learning-rocm>` such as PyTorch, TensorFlow, and JAX.
Single-accelerator systems, such as a machine equipped with a single GPU, are commonly used for
Single-accelerator systems, such as a machine equipped with a single accelerator or GPU, are commonly used for
smaller-scale deep learning tasks, including fine-tuning pre-trained models and running inference on moderately
sized datasets. See :doc:`single-gpu-fine-tuning-and-inference`.
Multi-accelerator systems, on the other hand, consist of multiple GPUs working in parallel. These systems are
Multi-accelerator systems, on the other hand, consist of multiple accelerators working in parallel. These systems are
typically used in LLMs and other large-scale deep learning tasks where performance, scalability, and the handling of
massive datasets are crucial. See :doc:`multi-gpu-fine-tuning-and-inference`.

View File

@@ -3,11 +3,11 @@
:keywords: ROCm, LLM, fine-tuning, usage, tutorial, multi-GPU, distributed, inference, accelerators, PyTorch, HuggingFace, torchtune
*****************************************************
Fine-tuning and inference using multiple GPUs
Fine-tuning and inference using multiple accelerators
*****************************************************
This section explains how to fine-tune a model on a multi-accelerator system. See
:doc:`Single-accelerator fine-tuning <single-gpu-fine-tuning-and-inference>` for a single GPU setup.
:doc:`Single-accelerator fine-tuning <single-gpu-fine-tuning-and-inference>` for a single accelerator or GPU setup.
.. _fine-tuning-llms-multi-gpu-env:
@@ -20,7 +20,7 @@ This section was tested using the following hardware and software environment.
:stub-columns: 1
* - Hardware
- 4 AMD Instinct MI300X GPUs
- 4 AMD Instinct MI300X accelerators
* - Software
- ROCm 6.1, Ubuntu 22.04, PyTorch 2.1.2, Python 3.10
@@ -40,13 +40,13 @@ Setting up the base implementation environment
:doc:`PyTorch installation guide <rocm-install-on-linux:install/3rd-party/pytorch-install>`. For consistent
installation, its recommended to use official ROCm prebuilt Docker images with the framework pre-installed.
#. In the Docker container, check the availability of ROCm-capable GPUs using the following command.
#. In the Docker container, check the availability of ROCM-capable accelerators using the following command.
.. code-block:: shell
rocm-smi --showproductname
#. Check that your GPUs are available to PyTorch.
#. Check that your accelerators are available to PyTorch.
.. code-block:: python
@@ -66,7 +66,7 @@ Setting up the base implementation environment
.. tip::
During training and inference, you can check the memory usage by running the ``rocm-smi`` command in your terminal.
This tool helps you see shows which GPUs are involved.
This tool helps you see shows which accelerators or GPUs are involved.
.. _fine-tuning-llms-multi-gpu-hugging-face-accelerate:
@@ -74,9 +74,9 @@ Setting up the base implementation environment
Hugging Face Accelerate for fine-tuning and inference
===========================================================
`Hugging Face Accelerate <https://huggingface.co/docs/accelerate/en/index>`__ is a library that simplifies turning raw
PyTorch code for a single GPU into code for multiple GPUs for LLM fine-tuning and inference. It is
integrated with `Transformers <https://huggingface.co/docs/transformers/en/index>`__, so you can scale your PyTorch
`Hugging Face Accelerate <https://huggingface.co/docs/accelerate/en/index>`_ is a library that simplifies turning raw
PyTorch code for a single accelerator into code for multiple accelerators for LLM fine-tuning and inference. It is
integrated with `Transformers <https://huggingface.co/docs/transformers/en/index>`_ allowing you to scale your PyTorch
code while maintaining performance and flexibility.
As a brief example of model fine-tuning and inference using multiple GPUs, let's use Transformers and load in the Llama
@@ -107,7 +107,7 @@ Now, it's important to adjust how you load the model. Add the ``device_map`` par
(``"auto"``, ``"balanced"``, ``"balanced_low_0"``, ``"sequential"``).
It's recommended to set the ``device_map`` parameter to ``“auto”`` to allow Accelerate to automatically and
efficiently allocate the model given the available resources (four GPUs in this case).
efficiently allocate the model given the available resources (4 accelerators in this case).
When you have more GPU memory available than the model size, here is the difference between each ``device_map``
option:
@@ -130,8 +130,8 @@ After loading the model in this way, the model is fully ready to use the resourc
torchtune for fine-tuning and inference
=============================================
`torchtune <https://pytorch.org/torchtune/main/>`_ is a PyTorch-native library for easy single and multi-GPU
model fine-tuning and inference with LLMs.
`torchtune <https://pytorch.org/torchtune/main/>`_ is a PyTorch-native library for easy single and multi-accelerator or
GPU model fine-tuning and inference with LLMs.
#. Install torchtune using pip.

View File

@@ -30,7 +30,7 @@ The challenge of fine-tuning models
However, the computational cost of fine-tuning is still high, especially for complex models and large datasets, which
poses distinct challenges related to substantial computational and memory requirements. This might be a barrier for
GPUs with low computing power or limited device memory resources.
accelerators or GPUs with low computing power or limited device memory resources.
For example, suppose we have a language model with 7 billion (7B) parameters, represented by a weight matrix :math:`W`.
During backpropagation, the model needs to learn a :math:`ΔW` matrix, which updates the original weights to minimize the
@@ -84,8 +84,8 @@ Walkthrough
===========
To demonstrate the benefits of LoRA and the ideal compute compatibility of using PEFT and TRL libraries on AMD
ROCm-compatible GPUs, let's step through a comprehensive implementation of the fine-tuning process
using the Llama 2 7B model with LoRA tailored specifically for question-and-answer tasks on AMD MI300X GPUs.
ROCm-compatible accelerators and GPUs, let's step through a comprehensive implementation of the fine-tuning process
using the Llama 2 7B model with LoRA tailored specifically for question-and-answer tasks on AMD MI300X accelerators.
Before starting, review and understand the key components of this walkthrough:

View File

@@ -3,11 +3,12 @@
:keywords: ROCm, LLM, fine-tuning, usage, tutorial, single-GPU, LoRA, PEFT, inference, SFTTrainer
****************************************************
Fine-tuning and inference using a single GPU
Fine-tuning and inference using a single accelerator
****************************************************
This section explains model fine-tuning and inference techniques on a single-accelerator system. See
:doc:`Multi-accelerator fine-tuning <multi-gpu-fine-tuning-and-inference>` for a setup with multiple GPUs.
:doc:`Multi-accelerator fine-tuning <multi-gpu-fine-tuning-and-inference>` for a setup with multiple accelerators or
GPUs.
.. _fine-tuning-llms-single-gpu-env:
@@ -20,7 +21,7 @@ This section was tested using the following hardware and software environment.
:stub-columns: 1
* - Hardware
- AMD Instinct MI300X GPU
- AMD Instinct MI300X accelerator
* - Software
- ROCm 6.1, Ubuntu 22.04, PyTorch 2.1.2, Python 3.10
@@ -40,7 +41,7 @@ Setting up the base implementation environment
:doc:`PyTorch installation guide <rocm-install-on-linux:install/3rd-party/pytorch-install>`. For a consistent
installation, its recommended to use official ROCm prebuilt Docker images with the framework pre-installed.
#. In the Docker container, check the availability of ROCm-capable GPUs using the following command.
#. In the Docker container, check the availability of ROCm-capable accelerators using the following command.
.. code-block:: shell
@@ -52,14 +53,14 @@ Setting up the base implementation environment
============================ ROCm System Management Interface ============================
====================================== Product Info ======================================
GPU[0] : Card Series: AMD Instinct MI300X OAM
GPU[0] : Card series: AMD Instinct MI300X OAM
GPU[0] : Card model: 0x74a1
GPU[0] : Card vendor: Advanced Micro Devices, Inc. [AMD/ATI]
GPU[0] : Card SKU: MI3SRIOV
==========================================================================================
================================== End of ROCm SMI Log ===================================
#. Check that your GPUs are available to PyTorch.
#. Check that your accelerators are available to PyTorch.
.. code-block:: python
@@ -501,9 +502,9 @@ Let's look at achieving model inference using these types of models.
# Token generation
print(pipe("What is a large language model?")[0]["generated_text"])
If using multiple GPUs, see
If using multiple accelerators, see
:ref:`Multi-accelerator fine-tuning and inference <fine-tuning-llms-multi-gpu-hugging-face-accelerate>` to explore
popular libraries that simplify fine-tuning and inference in a multiple-GPU system.
popular libraries that simplify fine-tuning and inference in a multi-accelerator system.
Read more about inference frameworks like vLLM and Hugging Face TGI in
:doc:`LLM inference frameworks <../inference/llm-inference-frameworks>`.

View File

@@ -45,7 +45,7 @@ ROCm provides two different implementations of Flash Attention 2 modules. They c
# Install from source
git clone https://github.com/ROCm/flash-attention.git
cd flash-attention/
GPU_ARCHS=gfx942 python setup.py install #MI300 Series
GPU_ARCHS=gfx942 python setup.py install #MI300 series
Hugging Face Transformers can easily deploy the CK Flash Attention 2 module by passing an argument
``attn_implementation="flash_attention_2"`` in the ``from_pretrained`` class.
@@ -526,7 +526,7 @@ follow these instructions:
python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning split_table_batched_embeddings_test.py
To run the FBGEMM_GPU ``uvm`` test, use these commands. These tests only support the AMD MI210 and
more recent GPUs.
more recent accelerators.
.. code-block:: shell

View File

@@ -7,7 +7,7 @@ Model quantization techniques
*****************************
Quantization reduces the model size compared to its native full-precision version, making it easier to fit large models
onto GPUs with limited memory usage. This section explains how to perform LLM quantization using AMD Quark, GPTQ
onto accelerators or GPUs with limited memory usage. This section explains how to perform LLM quantization using AMD Quark, GPTQ
and bitsandbytes on AMD Instinct hardware.
.. _quantize-llms-quark:
@@ -311,7 +311,7 @@ ExLlama-v2 support
ExLlama is a Python/C++/CUDA implementation of the Llama model that is
designed for faster inference with 4-bit GPTQ weights. The ExLlama
kernel is activated by default when users create a ``GPTQConfig`` object. To
boost inference speed even further on Instinct GPUs, use the ExLlama-v2
boost inference speed even further on Instinct accelerators, use the ExLlama-v2
kernels by configuring the ``exllama_config`` parameter as the following.
.. code-block:: python
@@ -332,7 +332,7 @@ The `ROCm-aware bitsandbytes <https://github.com/ROCm/bitsandbytes>`_ library is
a lightweight Python wrapper around CUDA custom functions, in particular 8-bit optimizer, matrix multiplication, and
8-bit and 4-bit quantization functions. The library includes quantization primitives for 8-bit and 4-bit operations
through ``bitsandbytes.nn.Linear8bitLt`` and ``bitsandbytes.nn.Linear4bit`` and 8-bit optimizers through the
``bitsandbytes.optim`` module. These modules are supported on AMD Instinct GPUs.
``bitsandbytes.optim`` module. These modules are supported on AMD Instinct accelerators.
Installing bitsandbytes
-----------------------

View File

@@ -9,13 +9,13 @@ myst:
The AMD ROCm Composable Kernel (CK) library provides a programming model for writing performance-critical kernels for machine learning workloads. It generates a general-purpose kernel during the compilation phase through a C++ template, enabling developers to achieve operation fusions on different data precisions.
This article gives a high-level overview of CK General Matrix Multiplication (GEMM) kernel based on the design example of `03_gemm_bias_relu`. It also outlines the steps to construct the kernel and run it. Moreover, the article provides a detailed implementation of running SmoothQuant quantized INT8 models on AMD Instinct MI300X GPUs using CK.
This article gives a high-level overview of CK General Matrix Multiplication (GEMM) kernel based on the design example of `03_gemm_bias_relu`. It also outlines the steps to construct the kernel and run it. Moreover, the article provides a detailed implementation of running SmoothQuant quantized INT8 models on AMD Instinct MI300X accelerators using CK.
## High-level overview: a CK GEMM instance
GEMM is a fundamental block in linear algebra, machine learning, and deep neural networks. It is defined as the operation:
{math}`E = α \times (A \times B) + β \times (D)`, with A and B as matrix inputs, α and β as scalar inputs, and D as a pre-existing matrix.
Take the commonly used linear transformation in a fully connected layer as an example. These terms correspond to input activation (A), weight (B), bias (D), and output (E), respectively. The example employs a `DeviceGemmMultipleD_Xdl_CShuffle` struct from CK library as the fundamental instance to explore the compute capability of AMD Instinct GPUs for the computation of GEMM. The implementation of the instance contains two phases:
Take the commonly used linear transformation in a fully connected layer as an example. These terms correspond to input activation (A), weight (B), bias (D), and output (E), respectively. The example employs a `DeviceGemmMultipleD_Xdl_CShuffle` struct from CK library as the fundamental instance to explore the compute capability of AMD Instinct accelerators for the computation of GEMM. The implementation of the instance contains two phases:
- [Template parameter definition](#template-parameter-definition)
- [Instantiating and running the templated kernel](#instantiating-and-running-the-templated-kernel)
@@ -108,7 +108,7 @@ These parameters include Block Size, M/N/K Per Block, M/N per XDL, AK1, BK1, etc
- Block Size determines the number of threads in the thread block.
- M/N/K Per Block determines the size of tile that each thread block is responsible for calculating.
- M/N Per XDL refers to M/N size for Instinct GPU Matrix Fused Multiply Add (MFMA) instructions operating on a per-wavefront basis.
- M/N Per XDL refers to M/N size for Instinct accelerator Matrix Fused Multiply Add (MFMA) instructions operating on a per-wavefront basis.
- A/B K1 is related to the data type. It can be any value ranging from 1 to K Per Block. To achieve the optimal load/store performance, 128bit per load is suggested. In addition, the A/B loading parameters must be changed accordingly to match the A/B K1 value; otherwise, it will result in compilation errors.
Conditions for achieving computational load balancing on different hardware platforms can vary.
@@ -133,7 +133,7 @@ Templated kernel launching consists of kernel instantiation, making arguments by
## Developing fused INT8 kernels for SmoothQuant models
[SmoothQuant](https://github.com/mit-han-lab/smoothquant) (SQ) is a quantization algorithm that enables an INT8 quantization of both weights and activations for all the matrix multiplications in LLM. The required GPU kernel functionalities used to accelerate the inference of SQ models on Instinct GPUs are shown in the following table.
[SmoothQuant](https://github.com/mit-han-lab/smoothquant) (SQ) is a quantization algorithm that enables an INT8 quantization of both weights and activations for all the matrix multiplications in LLM. The required GPU kernel functionalities used to accelerate the inference of SQ models on Instinct accelerators are shown in the following table.
:::{table} Functionalities used to implement SmoothQuant model inference.
@@ -164,7 +164,7 @@ The CK library contains many fundamental instances that implement different func
Second, consider whether the format of input data meets your actual calculation needs. For SQ models, the 8-bit integer data format (INT8) is applied for matrix calculations.
Third, consider the platform for implementing CK instances. The instances suffixed with `xdl` only run on AMD Instinct GPUs after being compiled and cannot run on Radeon-Series GPUs. This is due to the underlying device-specific instruction sets for implementing these basic instances.
Third, consider the platform for implementing CK instances. The instances suffixed with `xdl` only run on AMD Instinct accelerators after being compiled and cannot run on Radeon-series GPUs. This is due to the underlying device-specific instruction sets for implementing these basic instances.
Here, we use [DeviceBatchedGemmMultiD_Xdl](https://github.com/ROCm/composable_kernel/tree/develop/example/24_batched_gemm) as the fundamental instance to implement the functionalities in the previous table.
@@ -435,7 +435,7 @@ The implementation architecture of running SmoothQuant models on MI300X GPUs is
### Figure 7
================ -->
```{figure} ../../../data/how-to/llm-fine-tuning-optimization/ck-inference_flow.jpg
The implementation architecture of running SmoothQuant models on AMD MI300X GPUs.
The implementation architecture of running SmoothQuant models on AMD MI300X accelerators.
```
For the target [SQ quantized model](https://huggingface.co/mit-han-lab/opt-13b-smoothquant), each decoder layer contains three major components: attention calculation, layer normalization, and linear transformation in fully connected layers. The corresponding implementation classes for these components are:
@@ -447,21 +447,21 @@ For the target [SQ quantized model](https://huggingface.co/mit-han-lab/opt-13b-s
These classes' underlying implementation logits will harness the functions in previous table. Note that for the example, the `LayerNormQ` module is implemented by the torch native module.
Testing environment:
The hardware platform used for testing equips with 256 AMD EPYC 9534 64-Core Processor, 8 AMD Instinct MI300X GPUs and 1.5T memory. The testing was done in a publicly available Docker image from Docker Hub:
The hardware platform used for testing equips with 256 AMD EPYC 9534 64-Core Processor, 8 AMD Instinct MI300X accelerators and 1.5T memory. The testing was done in a publicly available Docker image from Docker Hub:
[`rocm/pytorch:rocm6.1_ubuntu22.04_py3.10_pytorch_2.1.2`](https://hub.docker.com/layers/rocm/pytorch/rocm6.1_ubuntu22.04_py3.10_pytorch_2.1.2/images/sha256-f6ea7cee8aae299c7f6368187df7beed29928850c3929c81e6f24b34271d652b)
The tested models are OPT-1.3B, 2.7B, 6.7B and 13B FP16 models and the corresponding SmoothQuant INT8 OPT models were obtained from Hugging Face.
Note that since the default values were used for the tunable parameters of the fundamental instance, the performance of the INT8 kernel is suboptimal.
Figure 8 shows the performance comparisons between the original FP16 and the SmoothQuant-quantized INT8 models on a single MI300X GPU. The GPU memory footprints of SmoothQuant-quantized models are significantly reduced. It also indicates the per-sample inference latency is significantly reduced for all SmoothQuant-quantized OPT models (illustrated in (b)). Notably, the performance of the CK instance-based INT8 kernel steadily improves with an increase in model size.
Figure 8 shows the performance comparisons between the original FP16 and the SmoothQuant-quantized INT8 models on a single MI300X accelerator. The GPU memory footprints of SmoothQuant-quantized models are significantly reduced. It also indicates the per-sample inference latency is significantly reduced for all SmoothQuant-quantized OPT models (illustrated in (b)). Notably, the performance of the CK instance-based INT8 kernel steadily improves with an increase in model size.
<!--
================
### Figure 8
================ -->
```{figure} ../../../data/how-to/llm-fine-tuning-optimization/ck-comparisons.jpg
Performance comparisons between the original FP16 and the SmoothQuant-quantized INT8 models on a single MI300X GPU.
Performance comparisons between the original FP16 and the SmoothQuant-quantized INT8 models on a single MI300X accelerator.
```
For accuracy comparisons between the original FP16 and INT8 models, the evaluation is done by using the first 1,000 samples from the LAMBADA dataset's validation set. We employ the same Last Token Prediction Accuracy method introduced in [SmoothQuant Real-INT8 Inference for PyTorch](https://github.com/mit-han-lab/smoothquant/blob/main/examples/smoothquant_opt_real_int8_demo.ipynb) as our evaluation metric. The comparison results are shown in Table 2.
@@ -482,4 +482,4 @@ CK provides a rich set of template parameters for generating flexible accelerate
CK supports multiple instruction sets of AMD Instinct GPUs, operator fusion and different data precisions. Its composability helps users quickly construct operator performance verification.
With CK, you can build more effective AI applications with higher flexibility and better performance on different AMD GPU platforms.
With CK, you can build more effective AI applications with higher flexibility and better performance on different AMD accelerator platforms.

View File

@@ -1,15 +1,15 @@
.. meta::
:description: Learn about workload tuning on AMD Instinct MI300X GPUs for optimal performance.
:description: Learn about workload tuning on AMD Instinct MI300X accelerators for optimal performance.
:keywords: AMD, Instinct, MI300X, HPC, tuning, BIOS settings, NBIO, ROCm,
environment variable, performance, HIP, Triton, PyTorch TunableOp, vLLM, RCCL,
MIOpen, GPU, resource utilization
MIOpen, accelerator, GPU, resource utilization
*****************************************
AMD Instinct MI300X workload optimization
*****************************************
This document provides guidelines for optimizing the performance of AMD
Instinct™ MI300X GPUs, with a particular focus on GPU kernel
Instinct™ MI300X accelerators, with a particular focus on GPU kernel
programming, high-performance computing (HPC), and deep learning operations
using PyTorch. It delves into specific workloads such as
:ref:`model inference <mi300x-vllm-optimization>`, offering strategies to
@@ -24,7 +24,7 @@ Workload tuning strategy
By following a structured approach, you can systematically address
performance issues and enhance the efficiency of your workloads on AMD Instinct
MI300X GPUs.
MI300X accelerators.
Measure the current workload
----------------------------
@@ -238,7 +238,7 @@ benchmarking process.
With AMD's profiling tools, developers are able to gain important insight into how efficiently their application is
using hardware resources and effectively diagnose potential bottlenecks contributing to poor performance. Developers
working with AMD Instinct GPUs have multiple tools depending on their specific profiling needs; these include:
working with AMD Instinct accelerators have multiple tools depending on their specific profiling needs; these include:
* :ref:`ROCProfiler <mi300x-rocprof>`
@@ -256,11 +256,11 @@ metrics, commonly called *performance counters*. These counters quantify the per
showcasing which pieces of the computational pipeline and memory hierarchy are being utilized.
Your ROCm installation contains a script or executable command called ``rocprof`` which provides the ability to list all
available hardware counters for your specific GPU, and run applications while collecting counters during
available hardware counters for your specific accelerator or GPU, and run applications while collecting counters during
their execution.
This ``rocprof`` utility also depends on the :doc:`ROCTracer and ROC-TX libraries <roctracer:index>`, giving it the
ability to collect timeline traces of the GPU software stack as well as user-annotated code regions.
ability to collect timeline traces of the accelerator software stack as well as user-annotated code regions.
.. note::
@@ -275,16 +275,16 @@ ROCm Compute Profiler
^^^^^^^^^^^^^^^^^^^^^
:doc:`ROCm Compute Profiler <rocprofiler-compute:index>` is a system performance profiler for high-performance computing (HPC) and
machine learning (ML) workloads using Instinct GPUs. Under the hood, ROCm Compute Profiler uses
machine learning (ML) workloads using Instinct accelerators. Under the hood, ROCm Compute Profiler uses
:ref:`ROCProfiler <mi300x-rocprof>` to collect hardware performance counters. The ROCm Compute Profiler tool performs
system profiling based on all approved hardware counters for Instinct
GPU architectures. It provides high level performance analysis features including System Speed-of-Light, IP
accelerator architectures. It provides high level performance analysis features including System Speed-of-Light, IP
block Speed-of-Light, Memory Chart Analysis, Roofline Analysis, Baseline Comparisons, and more.
ROCm Compute Profiler takes the guesswork out of profiling by removing the need to provide text input files with lists of counters
to collect and analyze raw CSV output files as is the case with ROCProfiler. Instead, ROCm Compute Profiler automates the collection
of all available hardware counters in one command and provides graphical interfaces to help users understand and
analyze bottlenecks and stressors for their computational workloads on AMD Instinct GPUs.
analyze bottlenecks and stressors for their computational workloads on AMD Instinct accelerators.
.. note::
@@ -557,7 +557,7 @@ ROCm library tuning involves optimizing the performance of routine computational
operations (such as ``GEMM``) provided by ROCm libraries like
:ref:`hipBLASLt <mi300x-hipblaslt>`, :ref:`Composable Kernel <mi300x-ck>`,
:ref:`MIOpen <mi300x-miopen>`, and :ref:`RCCL <mi300x-rccl>`. This tuning aims
to maximize efficiency and throughput on Instinct MI300X GPUs to gain
to maximize efficiency and throughput on Instinct MI300X accelerators to gain
improved application performance.
.. _mi300x-library-gemm:
@@ -1091,7 +1091,7 @@ you can only use a fraction of the potential bandwidth on the node.
The following figure shows an
:doc:`MI300X node-level architecture </conceptual/gpu-arch/mi300>` of a
system with AMD EPYC processors in a dual-socket configuration and eight
AMD Instinct MI300X GPUs. The MI300X OAMs attach to the host system via
AMD Instinct MI300X accelerators. The MI300X OAMs attach to the host system via
PCIe Gen 5 x16 links (yellow lines). The GPUs use seven high-bandwidth,
low-latency AMD Infinity Fabric™ links (red lines) to form a fully connected
8-GPU system.
@@ -1100,7 +1100,7 @@ low-latency AMD Infinity Fabric™ links (red lines) to form a fully connected
.. figure:: ../../../data/shared/mi300-node-level-arch.png
MI300 Series node-level architecture showing 8 fully interconnected MI300X
MI300 series node-level architecture showing 8 fully interconnected MI300X
OAM modules connected to (optional) PCIe switches via re-timers and HGX
connectors.
@@ -1293,7 +1293,7 @@ Auto-tunable kernel configuration involves adjusting memory access and computati
resources assigned to each compute unit. It encompasses the usage of
:ref:`LDS <mi300x-cu-fig>`, register, and task scheduling on a compute unit.
The GPU contains global memory, local data share (LDS), and
The accelerator or GPU contains global memory, local data share (LDS), and
registers. Global memory has high access latency, but is large. LDS access has
much lower latency, but is smaller. It is a fast on-CU software-managed memory
that can be used to efficiently share data between all work items in a block.
@@ -1306,11 +1306,11 @@ Register access is the fastest yet smallest among the three.
Schematic representation of a CU in the CDNA2 or CDNA3 architecture.
The following is a list of kernel arguments used for tuning performance and
resource allocation on AMD GPUs, which helps in optimizing the
resource allocation on AMD accelerators, which helps in optimizing the
efficiency and throughput of various computational kernels.
``num_stages=n``
Adjusts the number of pipeline stages for different types of kernels. On AMD GPUs, set ``num_stages``
Adjusts the number of pipeline stages for different types of kernels. On AMD accelerators, set ``num_stages``
according to the following rules:
* For kernels with a single GEMM, set to ``2``.
@@ -1337,15 +1337,15 @@ efficiency and throughput of various computational kernels.
* The occupancy of the kernel is limited by VGPR usage, and
* The current VGPR usage is only a few above a boundary in
:ref:`Occupancy related to VGPR usage in an Instinct MI300X GPU <mi300x-occupancy-vgpr-table>`.
:ref:`Occupancy related to VGPR usage in an Instinct MI300X accelerator <mi300x-occupancy-vgpr-table>`.
.. _mi300x-occupancy-vgpr-table:
.. figure:: ../../../data/shared/occupancy-vgpr.png
:alt: Occupancy related to VGPR usage in an Instinct MI300X GPU.
:alt: Occupancy related to VGPR usage in an Instinct MI300X accelerator.
:align: center
Occupancy related to VGPRs usage on an Instinct MI300X GPU
Occupancy related to VGPRs usage on an Instinct MI300X accelerator
For example, according to the table, each Execution Unit (EU) has 512 available
VGPRs, which are allocated in blocks of 16. If the current VGPR usage is 170,
@@ -1370,7 +1370,7 @@ VGPR usage so that it might fit 3 waves per EU.
- ``matrix_instr_nonkdim = 32``: ``mfma_32x32`` is used.
For GEMM kernels on an MI300X GPU, ``mfma_16x16`` typically outperforms ``mfma_32x32``, even for large
For GEMM kernels on an MI300X accelerator, ``mfma_16x16`` typically outperforms ``mfma_32x32``, even for large
tile/GEMM sizes.
@@ -1389,7 +1389,7 @@ the number of CUs a kernel can distribute its task across.
XCD-level system architecture showing 40 compute units,
each with 32 KB L1 cache, a unified compute system with 4 ACE compute
GPUs, shared 4MB of L2 cache, and a hardware scheduler (HWS).
accelerators, shared 4MB of L2 cache, and a hardware scheduler (HWS).
You can query hardware resources with the command ``rocminfo`` in the
``/opt/rocm/bin`` directory. For instance, query the number of CUs, number of

View File

@@ -1,7 +1,7 @@
:orphan:
.. meta::
:description: Learn how to validate LLM inference performance on MI300X GPUs using AMD MAD and the
:description: Learn how to validate LLM inference performance on MI300X accelerators using AMD MAD and the
ROCm vLLM Docker image.
:keywords: model, MAD, automation, dashboarding, validate
@@ -23,9 +23,9 @@ vLLM inference performance testing
The `ROCm vLLM Docker <{{ unified_docker.docker_hub_url }}>`_ image offers
a prebuilt, optimized environment for validating large language model (LLM)
inference performance on AMD Instinct™ MI300X Series GPUs. This ROCm vLLM
Docker image integrates vLLM and PyTorch tailored specifically for MI300X Series
GPUs and includes the following components:
inference performance on AMD Instinct™ MI300X series accelerators. This ROCm vLLM
Docker image integrates vLLM and PyTorch tailored specifically for MI300X series
accelerators and includes the following components:
.. list-table::
:header-rows: 1
@@ -47,7 +47,7 @@ vLLM inference performance testing
With this Docker image, you can quickly test the :ref:`expected
inference performance numbers <vllm-benchmark-performance-measurements-812>` for
MI300X Series GPUs.
MI300X series accelerators.
What's new
==========
@@ -139,7 +139,7 @@ page provides reference throughput and serving measurements for inferencing popu
The performance data presented in
`Performance results with AMD ROCm software <https://www.amd.com/en/developer/resources/rocm-hub/dev-ai/performance-results.html>`_
only reflects the latest version of this inference benchmarking environment.
The listed measurements should not be interpreted as the peak performance achievable by AMD Instinct MI325X and MI300X GPUs or ROCm software.
The listed measurements should not be interpreted as the peak performance achievable by AMD Instinct MI325X and MI300X accelerators or ROCm software.
System validation
=================
@@ -424,7 +424,7 @@ Further reading
- To learn more about MAD and the ``madengine`` CLI, see the `MAD usage guide <https://github.com/ROCm/MAD?tab=readme-ov-file#usage-guide>`__.
- To learn more about system settings and management practices to configure your system for
AMD Instinct MI300X Series GPUs, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_.
AMD Instinct MI300X series accelerators, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_.
- For application performance optimization strategies for HPC and AI workloads,
including inference with vLLM, see :doc:`/how-to/rocm-for-ai/inference-optimization/workload`.

View File

@@ -21,8 +21,8 @@ vLLM inference performance testing
The `ROCm vLLM Docker <{{ docker.docker_hub_url }}>`_ image offers
a prebuilt, optimized environment for validating large language model (LLM)
inference performance on AMD Instinct™ MI300X Series accelerators. This ROCm vLLM
Docker image integrates vLLM and PyTorch tailored specifically for MI300X Series
inference performance on AMD Instinct™ MI300X series accelerators. This ROCm vLLM
Docker image integrates vLLM and PyTorch tailored specifically for MI300X series
accelerators and includes the following components:
.. list-table::
@@ -38,7 +38,7 @@ vLLM inference performance testing
With this Docker image, you can quickly test the :ref:`expected
inference performance numbers <vllm-benchmark-performance-measurements-909>` for
MI300X Series accelerators.
MI300X series accelerators.
What's new
==========
@@ -430,7 +430,7 @@ Further reading
- To learn more about MAD and the ``madengine`` CLI, see the `MAD usage guide <https://github.com/ROCm/MAD?tab=readme-ov-file#usage-guide>`__.
- To learn more about system settings and management practices to configure your system for
AMD Instinct MI300X Series accelerators, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_.
AMD Instinct MI300X series accelerators, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_.
- See :ref:`fine-tuning-llms-vllm` and :ref:`mi300x-vllm-optimization` for
a brief introduction to vLLM and optimization strategies.

View File

@@ -1,7 +1,7 @@
:orphan:
.. meta::
:description: Learn how to validate LLM inference performance on MI300X GPUs using AMD MAD and the unified
:description: Learn how to validate LLM inference performance on MI300X accelerators using AMD MAD and the unified
ROCm Docker image.
:keywords: model, MAD, automation, dashboarding, validate
@@ -18,9 +18,9 @@ vLLM inference performance testing
The `ROCm vLLM Docker <https://hub.docker.com/r/rocm/vllm/tags>`_ image offers
a prebuilt, optimized environment designed for validating large language model
(LLM) inference performance on the AMD Instinct™ MI300X GPU. This
(LLM) inference performance on the AMD Instinct™ MI300X accelerator. This
ROCm vLLM Docker image integrates vLLM and PyTorch tailored specifically for the
MI300X GPU and includes the following components:
MI300X accelerator and includes the following components:
* `ROCm 6.2.0 <https://github.com/ROCm/ROCm>`_
@@ -31,7 +31,7 @@ MI300X GPU and includes the following components:
* Tuning files (in CSV format)
With this Docker image, you can quickly validate the expected inference
performance numbers on the MI300X GPU. This topic also provides tips on
performance numbers on the MI300X accelerator. This topic also provides tips on
optimizing performance with popular AI models.
.. _vllm-benchmark-vllm:
@@ -51,7 +51,7 @@ Getting started
===============
Use the following procedures to reproduce the benchmark results on an
MI300X GPU with the prebuilt vLLM Docker image.
MI300X accelerator with the prebuilt vLLM Docker image.
.. _vllm-benchmark-get-started:
@@ -267,7 +267,7 @@ Options
.. _vllm-benchmark-run-benchmark-v043:
Running the benchmark on the MI300X GPU
Running the benchmark on the MI300X accelerator
-----------------------------------------------
Here are some examples of running the benchmark with various options.
@@ -328,7 +328,7 @@ Further reading
see `<https://github.com/ROCm/vllm/tree/main/benchmarks>`_.
- To learn more about system settings and management practices to configure your system for
MI300X Series GPUs, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_
MI300X series accelerators, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_
- To learn how to run community models from Hugging Face on AMD GPUs, see
:doc:`Running models from Hugging Face </how-to/rocm-for-ai/inference/hugging-face-models>`.

View File

@@ -1,7 +1,7 @@
:orphan:
.. meta::
:description: Learn how to validate LLM inference performance on MI300X GPUs using AMD MAD and the unified
:description: Learn how to validate LLM inference performance on MI300X accelerators using AMD MAD and the unified
ROCm Docker image.
:keywords: model, MAD, automation, dashboarding, validate
@@ -18,9 +18,9 @@ vLLM inference performance testing
The `ROCm vLLM Docker <https://hub.docker.com/r/rocm/vllm/tags>`_ image offers
a prebuilt, optimized environment designed for validating large language model
(LLM) inference performance on the AMD Instinct™ MI300X GPU. This
(LLM) inference performance on the AMD Instinct™ MI300X accelerator. This
ROCm vLLM Docker image integrates vLLM and PyTorch tailored specifically for the
MI300X GPU and includes the following components:
MI300X accelerator and includes the following components:
* `ROCm 6.2.1 <https://github.com/ROCm/ROCm>`_
@@ -31,7 +31,7 @@ MI300X GPU and includes the following components:
* Tuning files (in CSV format)
With this Docker image, you can quickly validate the expected inference
performance numbers on the MI300X GPU. This topic also provides tips on
performance numbers on the MI300X accelerator. This topic also provides tips on
optimizing performance with popular AI models.
.. hlist::
@@ -74,7 +74,7 @@ Getting started
===============
Use the following procedures to reproduce the benchmark results on an
MI300X GPU with the prebuilt vLLM Docker image.
MI300X accelerator with the prebuilt vLLM Docker image.
.. _vllm-benchmark-get-started:
@@ -332,7 +332,7 @@ Options
.. _vllm-benchmark-run-benchmark-v064:
Running the benchmark on the MI300X GPU
Running the benchmark on the MI300X accelerator
-----------------------------------------------
Here are some examples of running the benchmark with various options.
@@ -398,7 +398,7 @@ Further reading
see `<https://github.com/ROCm/vllm/tree/main/benchmarks>`_.
- To learn more about system settings and management practices to configure your system for
MI300X Series GPUs, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_
MI300X series accelerators, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_
- To learn how to run community models from Hugging Face on AMD GPUs, see
:doc:`Running models from Hugging Face </how-to/rocm-for-ai/inference/hugging-face-models>`.

View File

@@ -1,7 +1,7 @@
:orphan:
.. meta::
:description: Learn how to validate LLM inference performance on MI300X GPUs using AMD MAD and the
:description: Learn how to validate LLM inference performance on MI300X accelerators using AMD MAD and the
ROCm vLLM Docker image.
:keywords: model, MAD, automation, dashboarding, validate
@@ -18,9 +18,9 @@ LLM inference performance validation on AMD Instinct MI300X
The `ROCm vLLM Docker <https://hub.docker.com/r/rocm/vllm/tags>`_ image offers
a prebuilt, optimized environment for validating large language model (LLM)
inference performance on the AMD Instinct™ MI300X GPU. This ROCm vLLM
inference performance on the AMD Instinct™ MI300X accelerator. This ROCm vLLM
Docker image integrates vLLM and PyTorch tailored specifically for the MI300X
GPU and includes the following components:
accelerator and includes the following components:
* `ROCm 6.3.1 <https://github.com/ROCm/ROCm>`_
@@ -29,7 +29,7 @@ GPU and includes the following components:
* `PyTorch 2.7.0 (2.7.0a0+git3a58512) <https://github.com/pytorch/pytorch>`_
With this Docker image, you can quickly validate the expected inference
performance numbers for the MI300X GPU. This topic also provides tips on
performance numbers for the MI300X accelerator. This topic also provides tips on
optimizing performance with popular AI models. For more information, see the lists of
:ref:`available models for MAD-integrated benchmarking <vllm-benchmark-mad-v066-models>`
and :ref:`standalone benchmarking <vllm-benchmark-standalone-v066-options>`.
@@ -47,7 +47,7 @@ Getting started
===============
Use the following procedures to reproduce the benchmark results on an
MI300X GPU with the prebuilt vLLM Docker image.
MI300X accelerator with the prebuilt vLLM Docker image.
.. _vllm-benchmark-get-started:
@@ -377,7 +377,7 @@ Options and available models
.. _vllm-benchmark-run-benchmark-v066:
Running the benchmark on the MI300X GPU
Running the benchmark on the MI300X accelerator
-----------------------------------------------
Here are some examples of running the benchmark with various options.
@@ -443,7 +443,7 @@ Further reading
see `<https://github.com/ROCm/vllm/tree/main/benchmarks>`_.
- To learn more about system settings and management practices to configure your system for
MI300X Series GPUs, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_
MI300X series accelerators, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_
- To learn how to run community models from Hugging Face on AMD GPUs, see
:doc:`Running models from Hugging Face </how-to/rocm-for-ai/inference/hugging-face-models>`.

View File

@@ -1,7 +1,7 @@
:orphan:
.. meta::
:description: Learn how to validate LLM inference performance on MI300X GPUs using AMD MAD and the
:description: Learn how to validate LLM inference performance on MI300X accelerators using AMD MAD and the
ROCm vLLM Docker image.
:keywords: model, MAD, automation, dashboarding, validate
@@ -23,9 +23,9 @@ vLLM inference performance testing
The `ROCm vLLM Docker <{{ unified_docker.docker_hub_url }}>`_ image offers
a prebuilt, optimized environment for validating large language model (LLM)
inference performance on AMD Instinct™ MI300X Series GPU. This ROCm vLLM
Docker image integrates vLLM and PyTorch tailored specifically for MI300X Series
GPUs and includes the following components:
inference performance on AMD Instinct™ MI300X series accelerator. This ROCm vLLM
Docker image integrates vLLM and PyTorch tailored specifically for MI300X series
accelerators and includes the following components:
* `ROCm {{ unified_docker.rocm_version }} <https://github.com/ROCm/ROCm>`_
@@ -37,7 +37,7 @@ vLLM inference performance testing
With this Docker image, you can quickly test the :ref:`expected
inference performance numbers <vllm-benchmark-performance-measurements-v073>` for
MI300X Series GPUs.
MI300X series accelerators.
.. _vllm-benchmark-available-models-v073:
@@ -110,7 +110,7 @@ vLLM inference performance testing
The performance data presented in
`Performance results with AMD ROCm software <https://www.amd.com/en/developer/resources/rocm-hub/dev-ai/performance-results.html>`_
only reflects the :doc:`latest version of this inference benchmarking environment <../vllm>`.
The listed measurements should not be interpreted as the peak performance achievable by AMD Instinct MI325X and MI300X GPUs or ROCm software.
The listed measurements should not be interpreted as the peak performance achievable by AMD Instinct MI325X and MI300X accelerators or ROCm software.
Advanced features and known issues
==================================
@@ -122,7 +122,7 @@ vLLM inference performance testing
===============
Use the following procedures to reproduce the benchmark results on an
MI300X GPU with the prebuilt vLLM Docker image.
MI300X accelerator with the prebuilt vLLM Docker image.
.. _vllm-benchmark-get-started:
@@ -311,7 +311,7 @@ Further reading
see `<https://github.com/ROCm/vllm/tree/main/benchmarks>`_.
- To learn more about system settings and management practices to configure your system for
MI300X Series GPUs, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_
MI300X series accelerators, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_
- To learn how to run community models from Hugging Face on AMD GPUs, see
:doc:`Running models from Hugging Face </how-to/rocm-for-ai/inference/hugging-face-models>`.

View File

@@ -1,7 +1,7 @@
:orphan:
.. meta::
:description: Learn how to validate LLM inference performance on MI300X GPUs using AMD MAD and the
:description: Learn how to validate LLM inference performance on MI300X accelerators using AMD MAD and the
ROCm vLLM Docker image.
:keywords: model, MAD, automation, dashboarding, validate
@@ -18,9 +18,9 @@ vLLM inference performance testing
The `ROCm vLLM Docker <{{ unified_docker.docker_hub_url }}>`_ image offers
a prebuilt, optimized environment for validating large language model (LLM)
inference performance on AMD Instinct™ MI300X Series GPUs. This ROCm vLLM
Docker image integrates vLLM and PyTorch tailored specifically for MI300X Series
GPUs and includes the following components:
inference performance on AMD Instinct™ MI300X series accelerators. This ROCm vLLM
Docker image integrates vLLM and PyTorch tailored specifically for MI300X series
accelerators and includes the following components:
* `ROCm {{ unified_docker.rocm_version }} <https://github.com/ROCm/ROCm>`_
@@ -32,7 +32,7 @@ vLLM inference performance testing
With this Docker image, you can quickly test the :ref:`expected
inference performance numbers <vllm-benchmark-performance-measurements-v083>` for
MI300X Series GPUs.
MI300X series accelerators.
.. _vllm-benchmark-available-models-v083:
@@ -105,7 +105,7 @@ vLLM inference performance testing
The performance data presented in
`Performance results with AMD ROCm software <https://www.amd.com/en/developer/resources/rocm-hub/dev-ai/performance-results.html>`_
only reflects the :doc:`latest version of this inference benchmarking environment <../vllm>`.
The listed measurements should not be interpreted as the peak performance achievable by AMD Instinct MI325X and MI300X GPUs or ROCm software.
The listed measurements should not be interpreted as the peak performance achievable by AMD Instinct MI325X and MI300X accelerators or ROCm software.
Advanced features and known issues
==================================
@@ -327,7 +327,7 @@ Further reading
see `<https://github.com/ROCm/vllm/tree/main/benchmarks>`_.
- To learn more about system settings and management practices to configure your system for
MI300X Series GPUs, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_
MI300X series accelerators, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_
- To learn how to run community models from Hugging Face on AMD GPUs, see
:doc:`Running models from Hugging Face </how-to/rocm-for-ai/inference/hugging-face-models>`.

View File

@@ -1,7 +1,7 @@
:orphan:
.. meta::
:description: Learn how to validate LLM inference performance on MI300X GPUs using AMD MAD and the
:description: Learn how to validate LLM inference performance on MI300X accelerators using AMD MAD and the
ROCm vLLM Docker image.
:keywords: model, MAD, automation, dashboarding, validate
@@ -23,9 +23,9 @@ vLLM inference performance testing
The `ROCm vLLM Docker <{{ unified_docker.docker_hub_url }}>`_ image offers
a prebuilt, optimized environment for validating large language model (LLM)
inference performance on AMD Instinct™ MI300X Series GPUs. This ROCm vLLM
Docker image integrates vLLM and PyTorch tailored specifically for MI300X Series
GPUs and includes the following components:
inference performance on AMD Instinct™ MI300X series accelerators. This ROCm vLLM
Docker image integrates vLLM and PyTorch tailored specifically for MI300X series
accelerators and includes the following components:
* `ROCm {{ unified_docker.rocm_version }} <https://github.com/ROCm/ROCm>`_
@@ -37,7 +37,7 @@ vLLM inference performance testing
With this Docker image, you can quickly test the :ref:`expected
inference performance numbers <vllm-benchmark-performance-measurements-v085-20250513>` for
MI300X Series GPUs.
MI300X series accelerators.
.. _vllm-benchmark-available-models-v085-20250513:
@@ -114,7 +114,7 @@ vLLM inference performance testing
The performance data presented in
`Performance results with AMD ROCm software <https://www.amd.com/en/developer/resources/rocm-hub/dev-ai/performance-results.html>`_
only reflects the :doc:`latest version of this inference benchmarking environment <../vllm>`.
The listed measurements should not be interpreted as the peak performance achievable by AMD Instinct MI325X and MI300X GPUs or ROCm software.
The listed measurements should not be interpreted as the peak performance achievable by AMD Instinct MI325X and MI300X accelerators or ROCm software.
Advanced features and known issues
==================================
@@ -333,7 +333,7 @@ Further reading
see `<https://github.com/ROCm/vllm/tree/main/benchmarks>`_.
- To learn more about system settings and management practices to configure your system for
MI300X Series GPUs, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_
MI300X series accelerators, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_
- For application performance optimization strategies for HPC and AI workloads,
including inference with vLLM, see :doc:`/how-to/rocm-for-ai/inference-optimization/workload`.

View File

@@ -1,7 +1,7 @@
:orphan:
.. meta::
:description: Learn how to validate LLM inference performance on MI300X GPUs using AMD MAD and the
:description: Learn how to validate LLM inference performance on MI300X accelerators using AMD MAD and the
ROCm vLLM Docker image.
:keywords: model, MAD, automation, dashboarding, validate
@@ -23,9 +23,9 @@ vLLM inference performance testing
The `ROCm vLLM Docker <{{ unified_docker.docker_hub_url }}>`_ image offers
a prebuilt, optimized environment for validating large language model (LLM)
inference performance on AMD Instinct™ MI300X Series GPUs. This ROCm vLLM
Docker image integrates vLLM and PyTorch tailored specifically for MI300X Series
GPUs and includes the following components:
inference performance on AMD Instinct™ MI300X series accelerators. This ROCm vLLM
Docker image integrates vLLM and PyTorch tailored specifically for MI300X series
accelerators and includes the following components:
* `ROCm {{ unified_docker.rocm_version }} <https://github.com/ROCm/ROCm>`_
@@ -37,7 +37,7 @@ vLLM inference performance testing
With this Docker image, you can quickly test the :ref:`expected
inference performance numbers <vllm-benchmark-performance-measurements-v085-20250521>` for
MI300X Series GPUs.
MI300X series accelerators.
.. _vllm-benchmark-available-models-v085-20250521:
@@ -114,7 +114,7 @@ vLLM inference performance testing
The performance data presented in
`Performance results with AMD ROCm software <https://www.amd.com/en/developer/resources/rocm-hub/dev-ai/performance-results.html>`_
should not be interpreted as the peak performance achievable by AMD
Instinct MI325X and MI300X GPUs or ROCm software.
Instinct MI325X and MI300X accelerators or ROCm software.
Advanced features and known issues
==================================
@@ -333,7 +333,7 @@ Further reading
see `<https://github.com/ROCm/vllm/tree/main/benchmarks>`_.
- To learn more about system settings and management practices to configure your system for
MI300X Series GPUs, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_
MI300X series accelerators, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_
- For application performance optimization strategies for HPC and AI workloads,
including inference with vLLM, see :doc:`/how-to/rocm-for-ai/inference-optimization/workload`.

View File

@@ -1,7 +1,7 @@
:orphan:
.. meta::
:description: Learn how to validate LLM inference performance on MI300X GPUs using AMD MAD and the
:description: Learn how to validate LLM inference performance on MI300X accelerators using AMD MAD and the
ROCm vLLM Docker image.
:keywords: model, MAD, automation, dashboarding, validate
@@ -23,9 +23,9 @@ vLLM inference performance testing
The `ROCm vLLM Docker <{{ unified_docker.docker_hub_url }}>`_ image offers
a prebuilt, optimized environment for validating large language model (LLM)
inference performance on AMD Instinct™ MI300X Series GPUs. This ROCm vLLM
Docker image integrates vLLM and PyTorch tailored specifically for MI300X Series
GPUs and includes the following components:
inference performance on AMD Instinct™ MI300X series accelerators. This ROCm vLLM
Docker image integrates vLLM and PyTorch tailored specifically for MI300X series
accelerators and includes the following components:
* `ROCm {{ unified_docker.rocm_version }} <https://github.com/ROCm/ROCm>`_
@@ -37,7 +37,7 @@ vLLM inference performance testing
With this Docker image, you can quickly test the :ref:`expected
inference performance numbers <vllm-benchmark-performance-measurements-v0901-20250605>` for
MI300X Series GPUs.
MI300X series accelerators.
.. _vllm-benchmark-available-models-v0901-20250605:
@@ -113,7 +113,7 @@ vLLM inference performance testing
The performance data presented in
`Performance results with AMD ROCm software <https://www.amd.com/en/developer/resources/rocm-hub/dev-ai/performance-results.html>`_
only reflects the latest version of this inference benchmarking environment.
The listed measurements should not be interpreted as the peak performance achievable by AMD Instinct MI325X and MI300X GPUs or ROCm software.
The listed measurements should not be interpreted as the peak performance achievable by AMD Instinct MI325X and MI300X accelerators or ROCm software.
Advanced features and known issues
==================================
@@ -332,7 +332,7 @@ Further reading
see `<https://github.com/ROCm/vllm/tree/main/benchmarks>`_.
- To learn more about system settings and management practices to configure your system for
MI300X GPUs, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_
MI300X accelerators, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_
- For application performance optimization strategies for HPC and AI workloads,
including inference with vLLM, see :doc:`/how-to/rocm-for-ai/inference-optimization/workload`.

View File

@@ -1,7 +1,7 @@
:orphan:
.. meta::
:description: Learn how to validate LLM inference performance on MI300X GPUs using AMD MAD and the
:description: Learn how to validate LLM inference performance on MI300X accelerators using AMD MAD and the
ROCm vLLM Docker image.
:keywords: model, MAD, automation, dashboarding, validate
@@ -23,9 +23,9 @@ vLLM inference performance testing
The `ROCm vLLM Docker <{{ unified_docker.docker_hub_url }}>`_ image offers
a prebuilt, optimized environment for validating large language model (LLM)
inference performance on AMD Instinct™ MI300X Series GPUs. This ROCm vLLM
Docker image integrates vLLM and PyTorch tailored specifically for MI300X Series
GPUs and includes the following components:
inference performance on AMD Instinct™ MI300X series accelerators. This ROCm vLLM
Docker image integrates vLLM and PyTorch tailored specifically for MI300X series
accelerators and includes the following components:
* `ROCm {{ unified_docker.rocm_version }} <https://github.com/ROCm/ROCm>`_
@@ -37,7 +37,7 @@ vLLM inference performance testing
With this Docker image, you can quickly test the :ref:`expected
inference performance numbers <vllm-benchmark-performance-measurements-20250702>` for
MI300X Series GPUs.
MI300X series accelerators.
.. _vllm-benchmark-available-models-20250702:
@@ -113,7 +113,7 @@ vLLM inference performance testing
The performance data presented in
`Performance results with AMD ROCm software <https://www.amd.com/en/developer/resources/rocm-hub/dev-ai/performance-results.html>`_
only reflects the latest version of this inference benchmarking environment.
The listed measurements should not be interpreted as the peak performance achievable by AMD Instinct MI325X and MI300X GPUs or ROCm software.
The listed measurements should not be interpreted as the peak performance achievable by AMD Instinct MI325X and MI300X accelerators or ROCm software.
Advanced features and known issues
==================================
@@ -332,7 +332,7 @@ Further reading
see `<https://github.com/ROCm/vllm/tree/main/benchmarks>`_.
- To learn more about system settings and management practices to configure your system for
MI300X Series GPUs, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_
MI300X series accelerators, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_
- For application performance optimization strategies for HPC and AI workloads,
including inference with vLLM, see :doc:`/how-to/rocm-for-ai/inference-optimization/workload`.

View File

@@ -1,7 +1,7 @@
:orphan:
.. meta::
:description: Learn how to validate LLM inference performance on MI300X GPUs using AMD MAD and the
:description: Learn how to validate LLM inference performance on MI300X accelerators using AMD MAD and the
ROCm vLLM Docker image.
:keywords: model, MAD, automation, dashboarding, validate
@@ -23,9 +23,9 @@ vLLM inference performance testing
The `ROCm vLLM Docker <{{ unified_docker.docker_hub_url }}>`_ image offers
a prebuilt, optimized environment for validating large language model (LLM)
inference performance on AMD Instinct™ MI300X Series GPUs. This ROCm vLLM
Docker image integrates vLLM and PyTorch tailored specifically for MI300X Series
GPUs and includes the following components:
inference performance on AMD Instinct™ MI300X series accelerators. This ROCm vLLM
Docker image integrates vLLM and PyTorch tailored specifically for MI300X series
accelerators and includes the following components:
.. list-table::
:header-rows: 1
@@ -47,7 +47,7 @@ vLLM inference performance testing
With this Docker image, you can quickly test the :ref:`expected
inference performance numbers <vllm-benchmark-performance-measurements-715>` for
MI300X Series GPUs.
MI300X series accelerators.
What's new
==========
@@ -145,7 +145,7 @@ page provides reference throughput and latency measurements for inferencing popu
The performance data presented in
`Performance results with AMD ROCm software <https://www.amd.com/en/developer/resources/rocm-hub/dev-ai/performance-results.html>`_
only reflects the latest version of this inference benchmarking environment.
The listed measurements should not be interpreted as the peak performance achievable by AMD Instinct MI325X and MI300X GPUs or ROCm software.
The listed measurements should not be interpreted as the peak performance achievable by AMD Instinct MI325X and MI300X accelerators or ROCm software.
System validation
=================
@@ -429,7 +429,7 @@ Further reading
- To learn more about MAD and the ``madengine`` CLI, see the `MAD usage guide <https://github.com/ROCm/MAD?tab=readme-ov-file#usage-guide>`__.
- To learn more about system settings and management practices to configure your system for
AMD Instinct MI300X Series GPUs, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_.
AMD Instinct MI300X series accelerators, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_.
- For application performance optimization strategies for HPC and AI workloads,
including inference with vLLM, see :doc:`/how-to/rocm-for-ai/inference-optimization/workload`.

View File

@@ -1,5 +1,5 @@
.. meta::
:description: Learn how to validate LLM inference performance on MI300X GPUs using AMD MAD and the
:description: Learn how to validate LLM inference performance on MI300X accelerators using AMD MAD and the
ROCm PyTorch Docker image.
:keywords: model, MAD, automation, dashboarding, validate, pytorch
@@ -15,7 +15,7 @@ PyTorch inference performance testing
{% set model_groups = data.pytorch_inference_benchmark.model_groups %}
The `ROCm PyTorch Docker <https://hub.docker.com/r/rocm/pytorch/tags>`_ image offers a prebuilt,
optimized environment for testing model inference performance on AMD Instinct™ MI300X Series
optimized environment for testing model inference performance on AMD Instinct™ MI300X series
GPUs. This guide demonstrates how to use the AMD Model Automation and Dashboarding (MAD)
tool with the ROCm PyTorch container to test inference performance on various models efficiently.
@@ -175,7 +175,7 @@ Further reading
- To learn more about MAD and the ``madengine`` CLI, see the `MAD usage guide <https://github.com/ROCm/MAD?tab=readme-ov-file#usage-guide>`__.
- To learn more about system settings and management practices to configure your system for
AMD Instinct MI300X Series GPUs, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_.
AMD Instinct MI300X series GPUs, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_.
- For application performance optimization strategies for HPC and AI workloads,
including inference with vLLM, see :doc:`../../inference-optimization/workload`.

View File

@@ -22,7 +22,7 @@ improved efficiency and throughput.
`SGLang <https://docs.sglang.ai>`__ is a high-performance inference and
serving engine for large language models (LLMs) and vision models. The
ROCm-enabled `SGLang base Docker image <{{ docker.docker_hub_url }}>`__
bundles SGLang with PyTorch, which is optimized for AMD Instinct MI300X Series
bundles SGLang with PyTorch, which is optimized for AMD Instinct MI300X series
GPUs. It includes the following software components:
.. list-table::
@@ -37,7 +37,7 @@ improved efficiency and throughput.
{% endfor %}
The following guides on setting up and running SGLang and Mooncake for disaggregated
distributed inference on a Slurm cluster using AMD Instinct MI300X Series GPUs backed by
distributed inference on a Slurm cluster using AMD Instinct MI300X series GPUs backed by
Mellanox CX-7 NICs.
Prerequisites
@@ -236,7 +236,7 @@ Further reading
- See the base upstream Docker image on `Docker Hub <https://hub.docker.com/layers/lmsysorg/sglang/v0.5.2rc1-rocm700-mi30x/images/sha256-10c4ee502ddba44dd8c13325e6e03868bfe7f43d23d0a44780a8ee8b393f4729>`__.
- To learn more about system settings and management practices to configure your system for
MI300X Series GPUs, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`__.
MI300X series GPUs, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`__.
- For application performance optimization strategies for HPC and AI workloads,
including inference with vLLM, see :doc:`/how-to/rocm-for-ai/inference-optimization/workload`.

View File

@@ -1,5 +1,5 @@
.. meta::
:description: Learn how to validate LLM inference performance on MI300X GPUs using AMD MAD and SGLang
:description: Learn how to validate LLM inference performance on MI300X accelerators using AMD MAD and SGLang
:keywords: model, MAD, automation, dashboarding, validate
*****************************************************************
@@ -15,8 +15,8 @@ SGLang inference performance testing DeepSeek-R1-Distill-Qwen-32B
`SGLang <https://docs.sglang.ai>`__ is a high-performance inference and
serving engine for large language models (LLMs) and vision models. The
ROCm-enabled `SGLang Docker image <{{ docker.docker_hub_url }}>`__
bundles SGLang with PyTorch, optimized for AMD Instinct MI300X Series
GPUs. It includes the following software components:
bundles SGLang with PyTorch, optimized for AMD Instinct MI300X series
accelerators. It includes the following software components:
.. list-table::
:header-rows: 1
@@ -255,7 +255,7 @@ Further reading
- To learn more about MAD and the ``madengine`` CLI, see the `MAD usage guide <https://github.com/ROCm/MAD?tab=readme-ov-file#usage-guide>`__.
- To learn more about system settings and management practices to configure your system for
MI300X Series GPUs, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`__.
MI300X series accelerators, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`__.
- For application performance optimization strategies for HPC and AI workloads,
including inference with vLLM, see :doc:`/how-to/rocm-for-ai/inference-optimization/workload`.

View File

@@ -1,5 +1,5 @@
.. meta::
:description: Learn how to validate LLM inference performance on MI300X GPUs using AMD MAD and the ROCm vLLM Docker image.
:description: Learn how to validate LLM inference performance on MI300X accelerators using AMD MAD and the ROCm vLLM Docker image.
:keywords: model, MAD, automation, dashboarding, validate
**********************************
@@ -457,7 +457,7 @@ Further reading
- To learn more about MAD and the ``madengine`` CLI, see the `MAD usage guide <https://github.com/ROCm/MAD?tab=readme-ov-file#usage-guide>`__.
- To learn more about system settings and management practices to configure your system for
AMD Instinct MI300X Series GPUs, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_.
AMD Instinct MI300X series GPUs, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_.
- See :ref:`fine-tuning-llms-vllm` and :ref:`mi300x-vllm-optimization` for
a brief introduction to vLLM and optimization strategies.

View File

@@ -44,9 +44,9 @@ Validating vLLM performance
---------------------------
ROCm provides a prebuilt optimized Docker image for validating the performance of LLM inference with vLLM
on the MI300X GPU. The Docker image includes ROCm, vLLM, PyTorch, and tuning files in the CSV
on the MI300X accelerator. The Docker image includes ROCm, vLLM, PyTorch, and tuning files in the CSV
format. For more information, see the guide to
`LLM inference performance testing with vLLM on the AMD Instinct™ MI300X GPU <https://github.com/ROCm/MAD/blob/develop/benchmark/vllm/README.md>`_
`LLM inference performance testing with vLLM on the AMD Instinct™ MI300X accelerator <https://github.com/ROCm/MAD/blob/develop/benchmark/vllm/README.md>`_
on the ROCm GitHub repository.
.. _rocm-for-ai-serve-hugging-face-tgi:
@@ -61,7 +61,7 @@ The `Hugging Face Text Generation Inference <https://huggingface.co/docs/text-ge
TGI installation
----------------
The easiest way to use Hugging Face TGI with ROCm on AMD Instinct GPUs is to use the official Docker image at
The easiest way to use Hugging Face TGI with ROCm on AMD Instinct accelerators is to use the official Docker image at
`<https://github.com/huggingface/text-generation-inference/pkgs/container/text-generation-inference>`__.
TGI walkthrough

View File

@@ -10,7 +10,7 @@ Running models from Hugging Face
transformer models. Hugging Face models and tools significantly enhance productivity, performance, and accessibility in
developing and deploying AI solutions.
This section describes how to run popular community transformer models from Hugging Face on AMD GPUs.
This section describes how to run popular community transformer models from Hugging Face on AMD accelerators and GPUs.
.. _rocm-for-ai-hugging-face-transformers:
@@ -62,11 +62,11 @@ Using Hugging Face with Optimum-AMD
Optimum-AMD is the interface between Hugging Face libraries and the ROCm software stack.
For a deeper dive into using Hugging Face libraries on AMD GPUs, refer to the
For a deeper dive into using Hugging Face libraries on AMD accelerators and GPUs, refer to the
`Optimum-AMD <https://huggingface.co/docs/optimum/main/en/amd/amdgpu/overview>`_ page on Hugging Face for guidance on
using Flash Attention 2, GPTQ quantization and the ONNX Runtime integration.
Hugging Face libraries natively support AMD Instinct GPUs. For other
Hugging Face libraries natively support AMD Instinct accelerators. For other
:doc:`ROCm-capable hardware <rocm-install-on-linux:reference/system-requirements>`, support is currently not
validated, but most features are expected to work without issues.
@@ -139,7 +139,7 @@ To enable `GPTQ <https://arxiv.org/abs/2210.17323>`_, hosted wheels are availabl
pip install auto-gptq --no-build-isolation --extra-index-url https://huggingface.github.io/autogptq-index/whl/rocm573/
Or, to install from source for AMD GPUs supporting ROCm, specify the ``ROCM_VERSION`` environment variable.
Or, to install from source for AMD accelerators supporting ROCm, specify the ``ROCM_VERSION`` environment variable.
.. code-block:: shell

View File

@@ -9,7 +9,7 @@ AI inference is a process of deploying a trained machine learning model to make
Understanding the ROCm™ software platforms architecture and capabilities is vital for running AI inference. By leveraging the ROCm platform's capabilities, you can harness the power of high-performance computing and efficient resource management to run inference workloads, leading to faster predictions and classifications on real-time data.
Throughout the following topics, this section provides a comprehensive guide to setting up and deploying AI inference on AMD GPUs. This includes instructions on how to install ROCm, how to use Hugging Face Transformers to manage pre-trained models for natural language processing (NLP) tasks, how to validate vLLM on AMD Instinct™ MI300X GPUs and illustrate how to deploy trained models in production environments.
Throughout the following topics, this section provides a comprehensive guide to setting up and deploying AI inference on AMD GPUs. This includes instructions on how to install ROCm, how to use Hugging Face Transformers to manage pre-trained models for natural language processing (NLP) tasks, how to validate vLLM on AMD Instinct™ MI300X accelerators and illustrate how to deploy trained models in production environments.
The AI Developer Hub contains `AMD ROCm tutorials <https://rocm.docs.amd.com/projects/ai-developer-hub/en/latest/>`_ for
training, fine-tuning, and inference. It leverages popular machine learning frameworks on AMD GPUs.

View File

@@ -60,7 +60,7 @@ Installing vLLM
vllm-rocm \
bash
3. Inside the container, start the API server to run on a single GPU on port 8000 using the following command.
3. Inside the container, start the API server to run on a single accelerator on port 8000 using the following command.
.. code-block:: shell
@@ -113,7 +113,7 @@ Installing vLLM
python -m vllm.entrypoints.api_server --model /app/model --dtype float16 -tp 2 --port 8000 &
4. To run multiple instances of API Servers, specify different ports for each server, and use ``ROCR_VISIBLE_DEVICES`` to
isolate each instance to a different GPU.
isolate each instance to a different accelerator.
For example, to run two API servers, one on port 8000 using GPU 0 and 1, one on port 8001 using GPU 2 and 3, use a
a command like the following.
@@ -140,7 +140,7 @@ Installing vLLM
See :ref:`mi300x-vllm-optimization` for performance optimization tips.
ROCm provides a prebuilt optimized Docker image for validating the performance of LLM inference with vLLM
on the MI300X GPU. The Docker image includes ROCm, vLLM, and PyTorch.
on the MI300X accelerator. The Docker image includes ROCm, vLLM, and PyTorch.
For more information, see :doc:`/how-to/rocm-for-ai/inference/benchmark-docker/vllm`.
.. _fine-tuning-llms-tgi:
@@ -178,7 +178,7 @@ Install TGI
.. tab-item:: TGI on a single-accelerator system
:sync: single
2. Inside the container, launch a model using TGI server on a single GPU.
2. Inside the container, launch a model using TGI server on a single accelerator.
.. code-block:: shell
@@ -199,7 +199,7 @@ Install TGI
.. tab-item:: TGI on a multi-accelerator system
2. Inside the container, launch a model using TGI server on multiple GPUs (four in this case).
2. Inside the container, launch a model using TGI server on multiple accelerators (4 in this case).
.. code-block:: shell

View File

@@ -1,6 +1,6 @@
.. meta::
:description: Multi-node setup for AI training
:keywords: gpu, system, health, validation, bench, perf, performance, rvs, rccl, babel, mi300x, mi325x, flops, bandwidth, rbt, training
:keywords: gpu, accelerator, system, health, validation, bench, perf, performance, rvs, rccl, babel, mi300x, mi325x, flops, bandwidth, rbt, training
.. _rocm-for-ai-multi-node-setup:
@@ -21,7 +21,7 @@ Before starting, ensure your environment meets the following requirements:
* Multi-node networking: your cluster should have a configured multi-node network. For setup
instructions, see the `Multi-node network configuration for AMD Instinct
GPUs
accelerators
<https://instinct.docs.amd.com/projects/gpu-cluster-networking/en/latest/how-to/multi-node-config.html>`__
guide in the Instinct documentation.
@@ -54,8 +54,8 @@ Compile and install the RoCE library
If you're using Broadcom NICs, you need to compile and install the RoCE (RDMA
over Converged Ethernet) library. See `RoCE cluster network configuration guide
for AMD Instinct GPUs
<https://instinct.docs.amd.com/projects/gpu-cluster-networking/en/latest/how-to/roce-network-config.html>`__
for AMD Instinct accelerators
<https://instinct.docs.amd.com/projects/gpu-cluster-networking/en/latest/how-to/roce-network-config.html#roce-cluster-network-configuration-guide-for-amd-instinct-accelerators>`__
for more information.
See the `Ethernet networking guide for AMD
@@ -315,6 +315,6 @@ Megatron-LM
Further reading
===============
* `Multi-node network configuration for AMD Instinct GPUs <https://instinct.docs.amd.com/projects/gpu-cluster-networking/en/latest/how-to/multi-node-config.html>`__
* `Multi-node network configuration for AMD Instinct accelerators <https://instinct.docs.amd.com/projects/gpu-cluster-networking/en/latest/how-to/multi-node-config.html>`__
* `Ethernet networking guide for AMD Instinct MI300X GPU clusters: Compiling Broadcom NIC software from source <https://docs.broadcom.com/doc/957608-AN2XX#page=81>`__

View File

@@ -340,7 +340,7 @@ Further reading
- To learn more about MAD and the ``madengine`` CLI, see the `MAD usage guide <https://github.com/ROCm/MAD?tab=readme-ov-file#usage-guide>`__.
- To learn more about system settings and management practices to configure your system for
AMD Instinct MI300X Series GPUs, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_.
AMD Instinct MI300X series GPUs, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_.
- For a list of other ready-made Docker images for AI with ROCm, see
`AMD Infinity Hub <https://www.amd.com/en/developer/resources/infinity-hub.html#f-amd_hub_category=AI%20%26%20ML%20Models>`_.

View File

@@ -60,7 +60,7 @@ accelerate training workloads:
================
The following models are supported for training performance benchmarking with Megatron-LM and ROCm
on AMD Instinct MI300X Series GPUs.
on AMD Instinct MI300X series GPUs.
Some instructions, commands, and training recommendations in this documentation might
vary by model -- select one to get started.
@@ -137,7 +137,7 @@ Environment setup
=================
Use the following instructions to set up the environment, configure the script to train models, and
reproduce the benchmark results on MI300X Series GPUs with the AMD Megatron-LM Docker
reproduce the benchmark results on MI300X series GPUs with the AMD Megatron-LM Docker
image.
.. _amd-megatron-lm-requirements:
@@ -502,7 +502,7 @@ Run training
Use the following example commands to set up the environment, configure
:ref:`key options <amd-megatron-lm-benchmark-test-vars>`, and run training on
MI300X Series GPUs with the AMD Megatron-LM environment.
MI300X series GPUs with the AMD Megatron-LM environment.
Single node training
--------------------

View File

@@ -16,7 +16,7 @@ environment for the MPT-30B model using the ``rocm/pytorch-training:v25.5``
base `Docker image <https://hub.docker.com/layers/rocm/pytorch-training/v25.5/images/sha256-d47850a9b25b4a7151f796a8d24d55ea17bba545573f0d50d54d3852f96ecde5>`_
and the `LLM Foundry <https://github.com/mosaicml/llm-foundry>`_ framework.
This environment packages the following software components to train
on AMD Instinct MI300X Series GPUs:
on AMD Instinct MI300X series accelerators:
+--------------------------+--------------------------------+
| Software component | Version |
@@ -182,7 +182,7 @@ Further reading
- To learn more about MAD and the ``madengine`` CLI, see the `MAD usage guide <https://github.com/ROCm/MAD?tab=readme-ov-file#usage-guide>`__.
- To learn more about system settings and management practices to configure your system for
AMD Instinct MI300X Series GPUs, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_.
AMD Instinct MI300X series accelerators, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_.
- For a list of other ready-made Docker images for AI with ROCm, see
`AMD Infinity Hub <https://www.amd.com/en/developer/resources/infinity-hub.html#f-amd_hub_category=AI%20%26%20ML%20Models>`_.

View File

@@ -17,10 +17,10 @@ MaxText is a high-performance, open-source framework built on the Google JAX
machine learning library to train LLMs at scale. The MaxText framework for
ROCm is an optimized fork of the upstream
`<https://github.com/AI-Hypercomputer/maxtext>`__ enabling efficient AI workloads
on AMD MI300X Series GPUs.
on AMD MI300X series accelerators.
The MaxText for ROCm training Docker (``rocm/jax-training:maxtext-v25.4``) image
provides a prebuilt environment for training on AMD Instinct MI300X and MI325X GPUs,
provides a prebuilt environment for training on AMD Instinct MI300X and MI325X accelerators,
including essential components like JAX, XLA, ROCm libraries, and MaxText utilities.
It includes the following software components:
@@ -53,7 +53,7 @@ MaxText provides the following key features to train large language models effic
.. _amd-maxtext-model-support-v254:
The following models are pre-optimized for performance on AMD Instinct MI300X Series GPUs.
The following models are pre-optimized for performance on AMD Instinct MI300X series accelerators.
* Llama 3.1 8B

View File

@@ -17,10 +17,10 @@ MaxText is a high-performance, open-source framework built on the Google JAX
machine learning library to train LLMs at scale. The MaxText framework for
ROCm is an optimized fork of the upstream
`<https://github.com/AI-Hypercomputer/maxtext>`__ enabling efficient AI workloads
on AMD MI300X Series GPUs.
on AMD MI300X series accelerators.
The MaxText for ROCm training Docker (``rocm/jax-training:maxtext-v25.5``) image
provides a prebuilt environment for training on AMD Instinct MI300X and MI325X GPUs,
provides a prebuilt environment for training on AMD Instinct MI300X and MI325X accelerators,
including essential components like JAX, XLA, ROCm libraries, and MaxText utilities.
It includes the following software components:
@@ -53,7 +53,7 @@ MaxText provides the following key features to train large language models effic
.. _amd-maxtext-model-support-v255:
The following models are pre-optimized for performance on AMD Instinct MI300X Series GPUs.
The following models are pre-optimized for performance on AMD Instinct MI300X series accelerators.
* Llama 3.3 70B

View File

@@ -17,12 +17,12 @@ Training a model with ROCm Megatron-LM
The ROCm Megatron-LM framework is a specialized fork of the robust Megatron-LM, designed to
enable efficient training of large-scale language models on AMD GPUs. By leveraging AMD Instinct™ MI300X
GPUs, AMD Megatron-LM delivers enhanced scalability, performance, and resource utilization for AI
accelerators, AMD Megatron-LM delivers enhanced scalability, performance, and resource utilization for AI
workloads. It is purpose-built to :ref:`support models <amd-megatron-lm-model-support-24-12>`
like Meta's Llama 2, Llama 3, and Llama 3.1, enabling developers to train next-generation AI models with greater
efficiency. See the GitHub repository at `<https://github.com/ROCm/Megatron-LM>`__.
For ease of use, AMD provides a ready-to-use Docker image for MI300X GPUs containing essential
For ease of use, AMD provides a ready-to-use Docker image for MI300X accelerators containing essential
components, including PyTorch, PyTorch Lightning, ROCm libraries, and Megatron-LM utilities. It contains the
following software to accelerate training workloads:
@@ -69,7 +69,7 @@ Megatron-LM provides the following key features to train large language models e
.. _amd-megatron-lm-model-support-24-12:
The following models are pre-optimized for performance on the AMD Instinct MI300X GPU.
The following models are pre-optimized for performance on the AMD Instinct MI300X accelerator.
* Llama 2 7B
@@ -208,14 +208,14 @@ Use the following script to run the RCCL test for four MI300X GPU nodes. Modify
.. _mi300x-amd-megatron-lm-training-v2412:
Start training on MI300X GPUs
Start training on MI300X accelerators
=====================================
The pre-built ROCm Megatron-LM environment allows users to quickly validate system performance, conduct
training benchmarks, and achieve superior performance for models like Llama 2 and Llama 3.1.
Use the following instructions to set up the environment, configure the script to train models, and
reproduce the benchmark results on the MI300X GPUs with the AMD Megatron-LM Docker
reproduce the benchmark results on the MI300X accelerators with the AMD Megatron-LM Docker
image.
.. _amd-megatron-lm-requirements-v2412:

View File

@@ -15,13 +15,13 @@ Training a model with Megatron-LM for ROCm
The Megatron-LM framework for ROCm is a specialized fork of the robust Megatron-LM,
designed to enable efficient training of large-scale language models on AMD
GPUs. By leveraging AMD Instinct™ MI300X Series GPUs, Megatron-LM delivers
GPUs. By leveraging AMD Instinct™ MI300X series accelerators, Megatron-LM delivers
enhanced scalability, performance, and resource utilization for AI workloads.
It is purpose-built to support models like Llama 2, Llama 3, Llama 3.1, and
DeepSeek, enabling developers to train next-generation AI models more
efficiently. See the GitHub repository at `<https://github.com/ROCm/Megatron-LM>`__.
AMD provides a ready-to-use Docker image for MI300X GPUs containing
AMD provides a ready-to-use Docker image for MI300X accelerators containing
essential components, including PyTorch, ROCm libraries, and Megatron-LM
utilities. It contains the following software components to accelerate training
workloads:
@@ -69,7 +69,7 @@ Megatron-LM provides the following key features to train large language models e
.. _amd-megatron-lm-model-support-25-3:
The following models are pre-optimized for performance on the AMD Instinct MI300X GPU.
The following models are pre-optimized for performance on the AMD Instinct MI300X accelerator.
* Llama 2 7B
@@ -123,7 +123,7 @@ The pre-built ROCm Megatron-LM environment allows users to quickly validate syst
training benchmarks, and achieve superior performance for models like Llama 3.1, Llama 2, and DeepSeek V2.
Use the following instructions to set up the environment, configure the script to train models, and
reproduce the benchmark results on the MI300X GPUs with the AMD Megatron-LM Docker
reproduce the benchmark results on the MI300X accelerators with the AMD Megatron-LM Docker
image.
.. _amd-megatron-lm-requirements-v253:
@@ -334,7 +334,7 @@ Multi-node training
inside a Docker, either install the drivers inside the Docker container or pass the network
drivers from the host while creating the Docker container.
Start training on AMD Instinct GPUs
Start training on AMD Instinct accelerators
===========================================
The prebuilt Megatron-LM with ROCm training environment allows users to quickly validate
@@ -345,8 +345,8 @@ can expect the container to perform in the model configurations described in
the following section, but other configurations are not validated by AMD.
Use the following instructions to set up the environment, configure the script
to train models, and reproduce the benchmark results on MI300X Series
GPUs with the AMD Megatron-LM Docker image.
to train models, and reproduce the benchmark results on MI300X series
accelerators with the AMD Megatron-LM Docker image.
.. tab-set::

View File

@@ -15,13 +15,13 @@ Training a model with Megatron-LM for ROCm
The Megatron-LM framework for ROCm is a specialized fork of the robust Megatron-LM,
designed to enable efficient training of large-scale language models on AMD
GPUs. By leveraging AMD Instinct™ MI300X Series GPUs, Megatron-LM delivers
GPUs. By leveraging AMD Instinct™ MI300X series accelerators, Megatron-LM delivers
enhanced scalability, performance, and resource utilization for AI workloads.
It is purpose-built to support models like Llama 2, Llama 3, Llama 3.1, and
DeepSeek, enabling developers to train next-generation AI models more
efficiently. See the GitHub repository at `<https://github.com/ROCm/Megatron-LM>`__.
AMD provides a ready-to-use Docker image for MI300X Series GPUs containing
AMD provides a ready-to-use Docker image for MI300X series accelerators containing
essential components, including PyTorch, ROCm libraries, and Megatron-LM
utilities. It contains the following software components to accelerate training
workloads:
@@ -69,7 +69,7 @@ Megatron-LM provides the following key features to train large language models e
.. _amd-megatron-lm-model-support-25-4:
The following models are pre-optimized for performance on AMD Instinct MI300X Series GPUs.
The following models are pre-optimized for performance on AMD Instinct MI300X series accelerators.
* Llama 3.1 8B
@@ -105,7 +105,7 @@ popular AI models.
The performance data presented in
`Performance results with AMD ROCm software <https://www.amd.com/en/developer/resources/rocm-hub/dev-ai/performance-results.html>`__
only reflects the :doc:`latest version of this training benchmarking environment <../megatron-lm>`.
The listed measurements should not be interpreted as the peak performance achievable by AMD Instinct MI325X and MI300X GPUs or ROCm software.
The listed measurements should not be interpreted as the peak performance achievable by AMD Instinct MI325X and MI300X accelerators or ROCm software.
System validation
=================
@@ -124,7 +124,7 @@ The prebuilt ROCm Megatron-LM environment allows users to quickly validate syste
training benchmarks, and achieve superior performance for models like Llama 3.1, Llama 2, and DeepSeek V2.
Use the following instructions to set up the environment, configure the script to train models, and
reproduce the benchmark results on MI300X Series GPUs with the AMD Megatron-LM Docker
reproduce the benchmark results on MI300X series accelerators with the AMD Megatron-LM Docker
image.
.. _amd-megatron-lm-requirements-v254:
@@ -367,7 +367,7 @@ Multi-node training
# Specify which RDMA interfaces to use for communication
export NCCL_IB_HCA=rdma0,rdma1,rdma2,rdma3,rdma4,rdma5,rdma6,rdma7
Start training on AMD Instinct GPUs
Start training on AMD Instinct accelerators
===========================================
The prebuilt Megatron-LM with ROCm training environment allows users to quickly validate
@@ -378,8 +378,8 @@ can expect the container to perform in the model configurations described in
the following section, but other configurations are not validated by AMD.
Use the following instructions to set up the environment, configure the script
to train models, and reproduce the benchmark results on MI300X Series
GPUs with the AMD Megatron-LM Docker image.
to train models, and reproduce the benchmark results on MI300X series
accelerators with the AMD Megatron-LM Docker image.
.. tab-set::

View File

@@ -16,13 +16,13 @@ Training a model with Megatron-LM for ROCm
The `Megatron-LM framework for ROCm <https://github.com/ROCm/Megatron-LM>`_ is
a specialized fork of the robust Megatron-LM, designed to enable efficient
training of large-scale language models on AMD GPUs. By leveraging AMD
Instinct™ MI300X Series GPUs, Megatron-LM delivers enhanced
Instinct™ MI300X series accelerators, Megatron-LM delivers enhanced
scalability, performance, and resource utilization for AI workloads. It is
purpose-built to support models like Llama, DeepSeek, and Mixtral,
enabling developers to train next-generation AI models more
efficiently.
AMD provides a ready-to-use Docker image for MI300X Series GPUs containing
AMD provides a ready-to-use Docker image for MI300X series accelerators containing
essential components, including PyTorch, ROCm libraries, and Megatron-LM
utilities. It contains the following software components to accelerate training
workloads:
@@ -69,7 +69,7 @@ Megatron-LM provides the following key features to train large language models e
.. _amd-megatron-lm-model-support-v255:
The following models are pre-optimized for performance on AMD Instinct MI300X Series GPUs.
The following models are pre-optimized for performance on AMD Instinct MI300X series accelerators.
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/megatron-lm-v25.5-benchmark-models.yaml
@@ -131,7 +131,7 @@ popular AI models.
The performance data presented in
`Performance results with AMD ROCm software <https://www.amd.com/en/developer/resources/rocm-hub/dev-ai/performance-results.html>`__
only reflects the latest version of this training benchmarking environment.
The listed measurements should not be interpreted as the peak performance achievable by AMD Instinct MI325X and MI300X GPUs or ROCm software.
The listed measurements should not be interpreted as the peak performance achievable by AMD Instinct MI325X and MI300X accelerators or ROCm software.
System validation
=================
@@ -154,7 +154,7 @@ Environment setup
=================
Use the following instructions to set up the environment, configure the script to train models, and
reproduce the benchmark results on MI300X Series GPUs with the AMD Megatron-LM Docker
reproduce the benchmark results on MI300X series accelerators with the AMD Megatron-LM Docker
image.
.. _amd-megatron-lm-requirements-v255:
@@ -536,7 +536,7 @@ Run training
Use the following example commands to set up the environment, configure
:ref:`key options <amd-megatron-lm-benchmark-test-vars-v255>`, and run training on
MI300X Series GPUs with the AMD Megatron-LM environment.
MI300X series accelerators with the AMD Megatron-LM environment.
Single node training
^^^^^^^^^^^^^^^^^^^^

View File

@@ -16,13 +16,13 @@ Training a model with Megatron-LM for ROCm
The `Megatron-LM framework for ROCm <https://github.com/ROCm/Megatron-LM>`__ is
a specialized fork of the robust Megatron-LM, designed to enable efficient
training of large-scale language models on AMD GPUs. By leveraging AMD
Instinct™ MI300X Series GPUs, Megatron-LM delivers enhanced
Instinct™ MI300X series accelerators, Megatron-LM delivers enhanced
scalability, performance, and resource utilization for AI workloads. It is
purpose-built to support models like Llama, DeepSeek, and Mixtral,
enabling developers to train next-generation AI models more
efficiently.
AMD provides ready-to-use Docker images for MI300X Series GPUs containing
AMD provides ready-to-use Docker images for MI300X series accelerators containing
essential components, including PyTorch, ROCm libraries, and Megatron-LM
utilities. It contains the following software components to accelerate training
workloads:
@@ -65,7 +65,7 @@ workloads:
.. _amd-megatron-lm-model-support-v256:
The following models are pre-optimized for performance on AMD Instinct MI300X Series GPUs.
The following models are pre-optimized for performance on AMD Instinct MI300X series accelerators.
Supported models
================
@@ -124,7 +124,7 @@ popular AI models.
The performance data presented in
`Performance results with AMD ROCm software <https://www.amd.com/en/developer/resources/rocm-hub/dev-ai/performance-results.html>`__
only reflects the latest version of this training benchmarking environment.
The listed measurements should not be interpreted as the peak performance achievable by AMD Instinct MI325X and MI300X GPUs or ROCm software.
The listed measurements should not be interpreted as the peak performance achievable by AMD Instinct MI325X and MI300X accelerators or ROCm software.
System validation
=================
@@ -147,7 +147,7 @@ Environment setup
=================
Use the following instructions to set up the environment, configure the script to train models, and
reproduce the benchmark results on MI300X Series GPUs with the AMD Megatron-LM Docker
reproduce the benchmark results on MI300X series accelerators with the AMD Megatron-LM Docker
image.
.. _amd-megatron-lm-requirements-v256:
@@ -589,7 +589,7 @@ Run training
Use the following example commands to set up the environment, configure
:ref:`key options <amd-megatron-lm-benchmark-test-vars-v256>`, and run training on
MI300X Series GPUs with the AMD Megatron-LM environment.
MI300X series accelerators with the AMD Megatron-LM environment.
Single node training
--------------------

View File

@@ -22,13 +22,13 @@ Training a model with Megatron-LM for ROCm
The `Megatron-LM framework for ROCm <https://github.com/ROCm/Megatron-LM>`_ is
a specialized fork of the robust Megatron-LM, designed to enable efficient
training of large-scale language models on AMD GPUs. By leveraging AMD
Instinct™ MI300X Series GPUs, Megatron-LM delivers enhanced
Instinct™ MI300X series accelerators, Megatron-LM delivers enhanced
scalability, performance, and resource utilization for AI workloads. It is
purpose-built to support models like Llama, DeepSeek, and Mixtral,
enabling developers to train next-generation AI models more
efficiently.
AMD provides ready-to-use Docker images for MI300X Series GPUs containing
AMD provides ready-to-use Docker images for MI300X series accelerators containing
essential components, including PyTorch, ROCm libraries, and Megatron-LM
utilities. It contains the following software components to accelerate training
workloads:
@@ -66,7 +66,7 @@ workloads:
================
The following models are supported for training performance benchmarking with Megatron-LM and ROCm
on AMD Instinct MI300X Series GPUs.
on AMD Instinct MI300X series accelerators.
Some instructions, commands, and training recommendations in this documentation might
vary by model -- select one to get started.
@@ -120,7 +120,7 @@ popular AI models.
The performance data presented in
`Performance results with AMD ROCm software <https://www.amd.com/en/developer/resources/rocm-hub/dev-ai/performance-results.html>`__
only reflects the latest version of this training benchmarking environment.
The listed measurements should not be interpreted as the peak performance achievable by AMD Instinct MI325X and MI300X GPUs or ROCm software.
The listed measurements should not be interpreted as the peak performance achievable by AMD Instinct MI325X and MI300X accelerators or ROCm software.
System validation
=================
@@ -143,7 +143,7 @@ Environment setup
=================
Use the following instructions to set up the environment, configure the script to train models, and
reproduce the benchmark results on MI300X Series GPUs with the AMD Megatron-LM Docker
reproduce the benchmark results on MI300X series accelerators with the AMD Megatron-LM Docker
image.
.. _amd-megatron-lm-requirements-v257:
@@ -592,7 +592,7 @@ Run training
Use the following example commands to set up the environment, configure
:ref:`key options <amd-megatron-lm-benchmark-test-vars-v257>`, and run training on
MI300X Series GPUs with the AMD Megatron-LM environment.
MI300X series accelerators with the AMD Megatron-LM environment.
Single node training
--------------------

View File

@@ -15,7 +15,7 @@ Training a model with Primus and Megatron-LM
`Primus <https://github.com/AMD-AGI/Primus>`__ is a unified and flexible
LLM training framework designed to streamline training. It streamlines LLM
training on AMD Instinct GPUs using a modular, reproducible configuration paradigm.
training on AMD Instinct accelerators using a modular, reproducible configuration paradigm.
Primus is backend-agnostic and supports multiple training engines -- including Megatron.
.. note::
@@ -25,7 +25,7 @@ Primus is backend-agnostic and supports multiple training engines -- including M
workloads from Megatron-LM to Primus with Megatron, see
:doc:`megatron-lm-primus-migration-guide`.
For ease of use, AMD provides a ready-to-use Docker image for MI300 Series GPUs
For ease of use, AMD provides a ready-to-use Docker image for MI300 series accelerators
containing essential components for Primus and Megatron-LM.
.. note::
@@ -53,7 +53,7 @@ containing essential components for Primus and Megatron-LM.
Supported models
================
The following models are pre-optimized for performance on AMD Instinct MI300X Series GPUs.
The following models are pre-optimized for performance on AMD Instinct MI300X series accelerators.
Some instructions, commands, and training examples in this documentation might
vary by model -- select one to get started.
@@ -120,7 +120,7 @@ system's configuration.
=================
Use the following instructions to set up the environment, configure the script to train models, and
reproduce the benchmark results on MI300X Series GPUs with the ``{{ docker.pull_tag }}`` image.
reproduce the benchmark results on MI300X series accelerators with the ``{{ docker.pull_tag }}`` image.
.. _amd-primus-megatron-lm-requirements-v257:
@@ -231,7 +231,7 @@ Run training
Use the following example commands to set up the environment, configure
:ref:`key options <amd-primus-megatron-lm-benchmark-test-vars>`, and run training on
MI300X Series GPUs with the AMD Megatron-LM environment.
MI300X series accelerators with the AMD Megatron-LM environment.
Single node training
--------------------

View File

@@ -18,7 +18,7 @@ model training with GPU-optimized components for transformer-based models.
The PyTorch for ROCm training Docker (``rocm/pytorch-training:v25.3``) image
provides a prebuilt optimized environment for fine-tuning and pretraining a
model on AMD Instinct MI325X and MI300X GPUs. It includes the following
model on AMD Instinct MI325X and MI300X accelerators. It includes the following
software components to accelerate training workloads:
+--------------------------+--------------------------------+
@@ -44,7 +44,7 @@ software components to accelerate training workloads:
Supported models
================
The following models are pre-optimized for performance on the AMD Instinct MI300X GPU.
The following models are pre-optimized for performance on the AMD Instinct MI300X accelerator.
* Llama 3.1 8B
@@ -237,7 +237,7 @@ Along with the following datasets:
* `bghira/pseudo-camera-10k <https://huggingface.co/datasets/bghira/pseudo-camera-10k>`_
Start training on AMD Instinct GPUs
Start training on AMD Instinct accelerators
===========================================
The prebuilt PyTorch with ROCm training environment allows users to quickly validate
@@ -248,8 +248,8 @@ can expect the container to perform in the model configurations described in
the following section, but other configurations are not validated by AMD.
Use the following instructions to set up the environment, configure the script
to train models, and reproduce the benchmark results on MI300X Series
GPUs with the AMD PyTorch training Docker image.
to train models, and reproduce the benchmark results on MI300X series
accelerators with the AMD PyTorch training Docker image.
Once your environment is set up, use the following commands and examples to start benchmarking.

View File

@@ -18,7 +18,7 @@ model training with GPU-optimized components for transformer-based models.
The PyTorch for ROCm training Docker (``rocm/pytorch-training:v25.4``) image
provides a prebuilt optimized environment for fine-tuning and pretraining a
model on AMD Instinct MI325X and MI300X GPUs. It includes the following
model on AMD Instinct MI325X and MI300X accelerators. It includes the following
software components to accelerate training workloads:
+--------------------------+--------------------------------+
@@ -44,7 +44,7 @@ software components to accelerate training workloads:
Supported models
================
The following models are pre-optimized for performance on the AMD Instinct MI325X and MI300X GPUs.
The following models are pre-optimized for performance on the AMD Instinct MI325X and MI300X accelerators.
* Llama 3.1 8B
@@ -76,7 +76,7 @@ popular AI models.
The performance data presented in
`Performance results with AMD ROCm software <https://www.amd.com/en/developer/resources/rocm-hub/dev-ai/performance-results.html#tabs-a8deaeb413-item-21cea50186-tab>`_
should not be interpreted as the peak performance achievable by AMD
Instinct MI325X and MI300X GPUs or ROCm software.
Instinct MI325X and MI300X accelerators or ROCm software.
System validation
=================
@@ -260,7 +260,7 @@ the following section, but other configurations are not validated by AMD.
Use the following instructions to set up the environment, configure the script
to train models, and reproduce the benchmark results on MI325X and MI300X
GPUs with the AMD PyTorch training Docker image.
accelerators with the AMD PyTorch training Docker image.
Once your environment is set up, use the following commands and examples to start benchmarking.

View File

@@ -19,7 +19,7 @@ model training with GPU-optimized components for transformer-based models.
The `PyTorch for ROCm training Docker <https://hub.docker.com/layers/rocm/pytorch-training/v25.5/images/sha256-d47850a9b25b4a7151f796a8d24d55ea17bba545573f0d50d54d3852f96ecde5>`_
(``rocm/pytorch-training:v25.5``) image
provides a prebuilt optimized environment for fine-tuning and pretraining a
model on AMD Instinct MI325X and MI300X GPUs. It includes the following
model on AMD Instinct MI325X and MI300X accelerators. It includes the following
software components to accelerate training workloads:
+--------------------------+--------------------------------+
@@ -45,7 +45,7 @@ software components to accelerate training workloads:
Supported models
================
The following models are pre-optimized for performance on the AMD Instinct MI325X and MI300X GPUs.
The following models are pre-optimized for performance on the AMD Instinct MI325X and MI300X accelerators.
* Llama 3.3 70B
@@ -79,7 +79,7 @@ popular AI models.
The performance data presented in
`Performance results with AMD ROCm software <https://www.amd.com/en/developer/resources/rocm-hub/dev-ai/performance-results.html#tabs-a8deaeb413-item-21cea50186-tab>`_
should not be interpreted as the peak performance achievable by AMD
Instinct MI325X and MI300X GPUs or ROCm software.
Instinct MI325X and MI300X accelerators or ROCm software.
System validation
=================

View File

@@ -18,7 +18,7 @@ model training with GPU-optimized components for transformer-based models.
The `PyTorch for ROCm training Docker <https://hub.docker.com/layers/rocm/pytorch-training/v25.6/images/sha256-a4cea3c493a4a03d199a3e81960ac071d79a4a7a391aa9866add3b30a7842661>`_
(``rocm/pytorch-training:v25.6``) image provides a prebuilt optimized environment for fine-tuning and pretraining a
model on AMD Instinct MI325X and MI300X GPUs. It includes the following software components to accelerate
model on AMD Instinct MI325X and MI300X accelerators. It includes the following software components to accelerate
training workloads:
+--------------------------+--------------------------------+
@@ -44,7 +44,7 @@ training workloads:
Supported models
================
The following models are pre-optimized for performance on the AMD Instinct MI325X and MI300X GPUs.
The following models are pre-optimized for performance on the AMD Instinct MI325X and MI300X accelerators.
.. datatemplate:yaml:: /data/how-to/rocm-for-ai/training/previous-versions/pytorch-training-v25.6-benchmark-models.yaml
@@ -99,7 +99,7 @@ The following models are pre-optimized for performance on the AMD Instinct MI325
The performance data presented in
`Performance results with AMD ROCm software <https://www.amd.com/en/developer/resources/rocm-hub/dev-ai/performance-results.html#tabs-a8deaeb413-item-21cea50186-tab>`_
should not be interpreted as the peak performance achievable by AMD
Instinct MI325X and MI300X GPUs or ROCm software.
Instinct MI325X and MI300X accelerators or ROCm software.
System validation
=================
@@ -444,7 +444,7 @@ Further reading
- To learn more about MAD and the ``madengine`` CLI, see the `MAD usage guide <https://github.com/ROCm/MAD?tab=readme-ov-file#usage-guide>`__.
- To learn more about system settings and management practices to configure your system for
AMD Instinct MI300X Series GPUs, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_.
AMD Instinct MI300X series accelerators, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_.
- For a list of other ready-made Docker images for AI with ROCm, see
`AMD Infinity Hub <https://www.amd.com/en/developer/resources/infinity-hub.html#f-amd_hub_category=AI%20%26%20ML%20Models>`_.

View File

@@ -22,7 +22,7 @@ model training with GPU-optimized components for transformer-based models.
{% set docker = dockers[0] %}
The `PyTorch for ROCm training Docker <{{ docker.docker_hub_url }}>`__
(``{{ docker.pull_tag }}``) image provides a prebuilt optimized environment for fine-tuning and pretraining a
model on AMD Instinct MI325X and MI300X GPUs. It includes the following software components to accelerate
model on AMD Instinct MI325X and MI300X accelerators. It includes the following software components to accelerate
training workloads:
.. list-table::
@@ -41,7 +41,7 @@ model training with GPU-optimized components for transformer-based models.
Supported models
================
The following models are pre-optimized for performance on the AMD Instinct MI325X and MI300X GPUs.
The following models are pre-optimized for performance on the AMD Instinct MI325X and MI300X accelerators.
Some instructions, commands, and training recommendations in this documentation might
vary by model -- select one to get started.
@@ -124,7 +124,7 @@ popular AI models.
The performance data presented in
`Performance results with AMD ROCm software <https://www.amd.com/en/developer/resources/rocm-hub/dev-ai/performance-results.html#tabs-a8deaeb413-item-21cea50186-tab>`_
should not be interpreted as the peak performance achievable by AMD
Instinct MI325X and MI300X GPUs or ROCm software.
Instinct MI325X and MI300X accelerators or ROCm software.
System validation
=================
@@ -555,7 +555,7 @@ Further reading
- To learn more about MAD and the ``madengine`` CLI, see the `MAD usage guide <https://github.com/ROCm/MAD?tab=readme-ov-file#usage-guide>`__.
- To learn more about system settings and management practices to configure your system for
AMD Instinct MI300X Series GPUs, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_.
AMD Instinct MI300X series accelerators, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_.
- For a list of other ready-made Docker images for AI with ROCm, see
`AMD Infinity Hub <https://www.amd.com/en/developer/resources/infinity-hub.html#f-amd_hub_category=AI%20%26%20ML%20Models>`_.

View File

@@ -996,7 +996,7 @@ Further reading
Framework for Large Models on AMD GPUs <https://rocm.blogs.amd.com/software-tools-optimization/primus/README.html>`__.
- To learn more about system settings and management practices to configure your system for
AMD Instinct MI300X Series GPUs, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_.
AMD Instinct MI300X series GPUs, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_.
- For a list of other ready-made Docker images for AI with ROCm, see
`AMD Infinity Hub <https://www.amd.com/en/developer/resources/infinity-hub.html#f-amd_hub_category=AI%20%26%20ML%20Models>`_.

View File

@@ -555,7 +555,7 @@ Further reading
- To learn more about MAD and the ``madengine`` CLI, see the `MAD usage guide <https://github.com/ROCm/MAD?tab=readme-ov-file#usage-guide>`__.
- To learn more about system settings and management practices to configure your system for
AMD Instinct MI300X Series GPUs, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_.
AMD Instinct MI300X series GPUs, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_.
- For a list of other ready-made Docker images for AI with ROCm, see
`AMD Infinity Hub <https://www.amd.com/en/developer/resources/infinity-hub.html#f-amd_hub_category=AI%20%26%20ML%20Models>`_.

View File

@@ -650,7 +650,7 @@ Further reading
- To learn more about MAD and the ``madengine`` CLI, see the `MAD usage guide <https://github.com/ROCm/MAD?tab=readme-ov-file#usage-guide>`__.
- To learn more about system settings and management practices to configure your system for
AMD Instinct MI300X Series GPUs, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_.
AMD Instinct MI300X series GPUs, see `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_.
- For a list of other ready-made Docker images for AI with ROCm, see
`AMD Infinity Hub <https://www.amd.com/en/developer/resources/infinity-hub.html#f-amd_hub_category=AI%20%26%20ML%20Models>`_.

View File

@@ -6,9 +6,9 @@
Scaling model training
**********************
To train a large-scale model like OpenAI GPT-2 or Meta Llama 2 70B, a single GPU cannot store all the
model parameters required for training. This immense scale presents a fundamental challenge: no single GPU
can simultaneously store and process the entire model's parameters during training. PyTorch
To train a large-scale model like OpenAI GPT-2 or Meta Llama 2 70B, a single accelerator or GPU cannot store all the
model parameters required for training. This immense scale presents a fundamental challenge: no single GPU or
accelerator can simultaneously store and process the entire model's parameters during training. PyTorch
provides an answer to this computational constraint through its distributed training frameworks.
.. _rocm-for-ai-pytorch-distributed:
@@ -26,9 +26,9 @@ Features in ``torch.distributed`` are categorized into three main components:
- `Collective communication <https://pytorch.org/docs/stable/distributed.html>`_
In this topic, the focus is on the distributed data-parallelism strategy as its the most popular. To get started with DDP,
you need to first understand how to coordinate the model and its training data across multiple GPUs.
you need to first understand how to coordinate the model and its training data across multiple accelerators or GPUs.
The DDP workflow on multiple GPUs is as follows:
The DDP workflow on multiple accelerators or GPUs is as follows:
#. Split the current global training batch into small local batches on each GPU. For instance, if you have 8 GPUs and
the global batch is set at 32 samples, each of the 8 GPUs will have a local batch size of 4 samples.
@@ -82,7 +82,7 @@ the training pillar.
See `Pre-training a large language model with Megatron-DeepSpeed on multiple AMD GPUs
<https://rocm.blogs.amd.com/artificial-intelligence/megatron-deepspeed-pretrain/README.html>`_ for a detailed example of
training with DeepSpeed on an AMD GPU.
training with DeepSpeed on an AMD accelerator or GPU.
.. _rocm-for-ai-automatic-mixed-precision:
@@ -95,7 +95,7 @@ can take to reduce training time and memory usage through `automatic mixed preci
See `Automatic mixed precision in PyTorch using AMD GPUs — ROCm Blogs
<https://rocm.blogs.amd.com/artificial-intelligence/automatic-mixed-precision/README.html#automatic-mixed-precision-in-pytorch-using-amd-gpus>`_
for more information about running AMP on an AMD Instinct-Series GPU.
for more information about running AMP on an AMD accelerator.
.. _rocm-for-ai-fine-tune:
@@ -107,7 +107,7 @@ example, LoRA, QLoRA, PEFT, and FSDP.
Learn more about challenges and solutions for model fine-tuning in :doc:`../fine-tuning/index`.
The following developer blogs showcase examples of fine-tuning a model on an AMD GPU.
The following developer blogs showcase examples of fine-tuning a model on an AMD accelerator or GPU.
* Fine-tuning Llama2 with LoRA

View File

@@ -7,14 +7,14 @@ Using ROCm for HPC
******************
The ROCm open-source software stack is optimized to extract high-performance
computing (HPC) workload performance from AMD Instinct™ GPUs
computing (HPC) workload performance from AMD Instinct™ accelerators
while maintaining compatibility with industry software frameworks.
ROCm enhances support and access for developers by providing streamlined and
improved tools that significantly increase productivity. Being open-source, ROCm
fosters innovation, differentiation, and collaboration within the developer
community, making it a powerful and accessible solution for leveraging the full
potential of AMD GPUs' capabilities in diverse computational
potential of AMD accelerators' capabilities in diverse computational
applications.
* For more information, see :doc:`What is ROCm? <../../what-is-rocm>`.
@@ -24,7 +24,7 @@ applications.
and operating system support.
Some of the most popular HPC frameworks are part of the ROCm platform, including
those to help parallelize operations across multiple GPUs and servers,
those to help parallelize operations across multiple accelerators and servers,
handle memory hierarchies, and solve linear systems.
.. image:: ../../data/how-to/rocm-for-hpc/hpc-stack-2024_6_20.png
@@ -38,7 +38,7 @@ science, genomics, geophysics, molecular dynamics, and physics computing.
Refer to the resources in the following table for instructions on building,
running, and deploying these applications on ROCm-capable systems with AMD
Instinct GPUs. Each build container provides parameters to specify
Instinct accelerators. Each build container provides parameters to specify
different source code branches, release versions of ROCm, OpenMPI, UCX, and
Ubuntu versions.
@@ -96,7 +96,7 @@ Ubuntu versions.
* -
- `QUDA <https://github.com/amd/InfinityHub-CI/tree/main/quda>`_
- Library designed for efficient lattice QCD computations on
GPUs. It includes optimized Dirac operators and a variety of
accelerators. It includes optimized Dirac operators and a variety of
fermion solvers and conjugate gradient (CG) implementations, enhancing
performance and accuracy in lattice QCD simulations.

View File

@@ -1,6 +1,6 @@
.. meta::
:description: Learn about AMD hardware optimization for HPC-specific and workstation workloads.
:keywords: high-performance computing, HPC, Instinct GPUs, Radeon,
:keywords: high-performance computing, HPC, Instinct accelerators, Radeon,
tuning, tuning guide, AMD, ROCm
*******************

View File

@@ -1,7 +1,7 @@
:orphan:
.. meta::
:description: How to configure MI300X GPUs to fully leverage their capabilities and achieve optimal performance.
:description: How to configure MI300X accelerators to fully leverage their capabilities and achieve optimal performance.
:keywords: ROCm, AI, machine learning, MI300X, LLM, usage, tutorial, optimization, tuning
************************
@@ -10,9 +10,9 @@ AMD MI300X tuning guides
The tuning guides in this section provide a comprehensive summary of the
necessary steps to properly configure your system for AMD Instinct™ MI300X
GPUs. They include detailed instructions on system settings and
accelerators. They include detailed instructions on system settings and
application tuning suggestions to help you fully leverage the capabilities of
these GPUs, thereby achieving optimal performance.
these accelerators, thereby achieving optimal performance.
* :doc:`/how-to/rocm-for-ai/inference-optimization/workload`
* `AMD Instinct MI300X system optimization <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html>`_

View File

@@ -8,7 +8,7 @@ myst:
# AMD ROCm documentation
ROCm is an open-source software platform optimized to extract HPC and AI workload
performance from AMD Instinct GPUs and AMD Radeon GPUs while maintaining
performance from AMD Instinct accelerators and AMD Radeon GPUs while maintaining
compatibility with industry software frameworks. For more information, see
[What is ROCm?](./what-is-rocm.rst)
@@ -16,7 +16,7 @@ ROCm supports multiple programming languages and programming interfaces such as
{doc}`HIP (Heterogeneous-Compute Interface for Portability)<hip:index>`, OpenCL,
and OpenMP, as explained in the [Programming guide](./how-to/programming_guide.rst).
If you're using AMD Radeon GPUs or Ryzen APUs in a workstation setting with a display connected, review {doc}`ROCm on Radeon and Ryzen documentation<radeon:index>`.
If you're using AMD Radeon GPUs or Ryzen APUs in a workstation setting with a display connected, review [ROCm on Radeon and Ryzen documentation](https://rocm.docs.amd.com/projects/radeon-ryzen/en/latest/index.html).
ROCm documentation is organized into the following categories:
@@ -29,7 +29,7 @@ ROCm documentation is organized into the following categories:
* {doc}`ROCm on Linux <rocm-install-on-linux:reference/system-requirements>`
* {doc}`HIP SDK on Windows <rocm-install-on-windows:reference/system-requirements>`
* {doc}`ROCm on Radeon and Ryzen<radeon:index>`
* [ROCm on Radeon GPUs](https://rocm.docs.amd.com/projects/radeon/en/latest/index.html)
* {doc}`Deep learning frameworks </how-to/deep-learning-rocm>`
* {doc}`Build from source </how-to/build-rocm>`
:::
@@ -64,7 +64,7 @@ ROCm documentation is organized into the following categories:
<!-- markdownlint-disable MD051 -->
* [ROCm libraries](./reference/api-libraries.md)
* [ROCm tools, compilers, and runtimes](./reference/rocm-tools.md)
* [GPU hardware specifications](./reference/gpu-arch-specs.rst)
* [Accelerator and GPU hardware specifications](./reference/gpu-arch-specs.rst)
* [Data types and precision support](./reference/precision-support.rst)
* [Graph safe support](./reference/graph-safe-support.rst)
<!-- markdownlint-enable MD051 -->

View File

@@ -1,17 +1,17 @@
.. meta::
:description: AMD Instinct™ GPU, AMD Radeon PRO™, and AMD Radeon™ GPU architecture information
:description: AMD Instinct™ accelerator, AMD Radeon PRO™, and AMD Radeon™ GPU architecture information
:keywords: Instinct, Radeon, accelerator, GCN, CDNA, RDNA, GPU, architecture, VRAM, Compute Units, Cache, Registers, LDS, Register File
GPU hardware specifications
Accelerator and GPU hardware specifications
===========================================
The following tables provide an overview of the hardware specifications for AMD Instinct™ GPUs, and AMD Radeon™ PRO and Radeon™ GPUs.
The following tables provide an overview of the hardware specifications for AMD Instinct™ accelerators, and AMD Radeon™ PRO and Radeon™ GPUs.
For more information about ROCm hardware compatibility, see the ROCm `Compatibility matrix <https://rocm.docs.amd.com/en/latest/compatibility/compatibility-matrix.html>`_.
.. tab-set::
.. tab-item:: AMD Instinct GPUs
.. tab-item:: AMD Instinct accelerators
.. list-table::
:header-rows: 1

View File

@@ -1,5 +1,5 @@
.. meta::
:description: AMD Instinct GPU, AMD Radeon PRO, and AMD Radeon GPU
:description: AMD Instinct accelerator, AMD Radeon PRO, and AMD Radeon GPU
atomics operations information
:keywords: Atomics operations, atomic bitwise functions, atomics add, atomics
subtraction, atomics exchange, atomics min, atomics max
@@ -15,8 +15,8 @@ access to the same memory location could lead to incorrect or undefined
behavior.
This topic summarizes the support of atomic read-modify-write
(atomicRMW) operations on AMD GPUs. This includes gfx9, gfx10,
gfx11, and gfx12 targets and the following Instinct™ Series:
(atomicRMW) operations on AMD GPUs and accelerators. This includes gfx9, gfx10,
gfx11, and gfx12 targets and the following series of Instinct™ series:
- MI100
@@ -79,10 +79,10 @@ Scopes of operations:
Support summary
================================================================================
AMD Instinct GPUs
AMD Instinct accelerators
--------------------------------------------------------------------------------
**MI300 and MI350 Series**
**MI300 and MI350 series**
- All atomicRMW operations are forwarded out to the Infinity Fabric.
- Infinity Fabric supports common integer and bitwise atomics, FP32 atomic add,
@@ -95,7 +95,7 @@ AMD Instinct GPUs
It will seem like atomics to the wave, but the CPU sees it as a non-atomic
load-op-store sequence. This downgrades system-scope atomics to device-scope.
**MI200 Series**
**MI200 series**
- L2 cache and Infinity Fabric both support common integer and bitwise atomics.
- L2 cache supports FP32 atomic add, packed-FP16 atomic add, and FP64 add,
@@ -272,10 +272,10 @@ The tables selectors or options are the following:
- Second-level option:
- "No PCIe atomics" means the system does not support PCIe atomics between
the GPU and peer/host-memory.
the accelerator and peer/host-memory.
- "PCIe atomics" means the system supports PCIe atomics between the
GPU and peer/host-memory.
accelerator and peer/host-memory.
- The third-level option is the memory granularity of the memory target.
@@ -306,11 +306,11 @@ The integer type atomic operations that are supported by different hardware.
- Max
AMD Instinct GPUs
AMD Instinct accelerators
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The integer type atomic operations that are supported by different AMD
Instinct GPUs listed in the following table.
Instinct accelerators listed in the following table.
.. <!-- spellcheck-disable -->
@@ -481,11 +481,11 @@ The bitwise atomic operations that are supported by different hardware.
128-bit bitwise Exchange and CAS are not supported on AMD GPUs
AMD Instinct GPUs
AMD Instinct accelerators
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The bitwise atomic operations that are supported by different AMD Instinct
GPUs listed in the following table.
accelerators listed in the following table.
.. <!-- spellcheck-disable -->
@@ -659,11 +659,11 @@ The float types atomic operations that are supported by different hardware.
- Add
AMD Instinct GPUs
AMD Instinct accelerators
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The float type atomic operations that are supported by different AMD Instinct
GPUs listed in the following table.
accelerators listed in the following table.
.. <!-- spellcheck-disable -->

View File

@@ -9,7 +9,7 @@
Data types and precision support
*************************************************************
This topic summarizes the data types supported on AMD GPUs and
This topic summarizes the data types supported on AMD GPUs and accelerators and
ROCm libraries, along with corresponding :doc:`HIP <hip:index>` data types.
Integral types

View File

@@ -97,9 +97,9 @@ subtrees:
subtrees:
- entries:
- file: how-to/rocm-for-ai/fine-tuning/single-gpu-fine-tuning-and-inference.rst
title: Use a single GPU
title: Use a single accelerator
- file: how-to/rocm-for-ai/fine-tuning/multi-gpu-fine-tuning-and-inference.rst
title: Use multiple GPUs
title: Use multiple accelerators
- file: how-to/rocm-for-ai/inference/index.rst
title: Inference
@@ -182,7 +182,7 @@ subtrees:
- file: conceptual/gpu-arch/mi300-mi200-performance-counters.rst
title: MI300 and MI200 performance counters
- file: conceptual/gpu-arch/mi350-performance-counters.rst
title: MI350 Series performance counters
title: MI350 series performance counters
- file: conceptual/gpu-arch/mi250.md
title: MI250 microarchitecture
subtrees:

Some files were not shown because too many files have changed in this diff Show More