Compare commits

..

54 Commits

Author SHA1 Message Date
randyh62
2b83a962a0 Use intersphinx links for deep learning (#5859)
* Use intersphinx links for deep learning

* Update deep-learning-rocm.rst

remove Taichi

* Update deep-learning-rocm.rst

Change Install link to "link"

* Apply suggestion from @randyh62

OK
2026-01-20 09:17:37 -08:00
Jeffrey Novotny
54bf4c0319 Add missing APU entries to GPU hardware specifications (#646) (#5862) (#5863)
* Add missing APU entries to GPU hardware specifications

* Move Ryzen APUs to new tab

* Add new column to Ryzen table and rename column elsewhere

---------

(cherry picked from commit 7ab402a3b3)


(cherry picked from commit 33fbde69db)

Co-authored-by: alexxu-amd <159800977+alexxu-amd@users.noreply.github.com>
2026-01-16 13:02:06 -05:00
peterjunpark
4347a11bc4 Doc update for vLLM refactor #5855 (#5856)
(cherry picked from commit a745e45dcb)
2026-01-15 11:34:02 -05:00
ROCm Docs Automation
2b7fde505f Update rocm-docs-core to 1.31.2 2026-01-14 11:26:11 -05:00
anisha-amd
a98d6a5777 Docs: Ray release 25.12 and compatibility version format standardization (#5845) (#5846) 2026-01-08 12:29:00 -05:00
Swati Rawat
38b271df55 Merge pull request #5843 from SwRaw/sw_cherrypick
Cherrypicking amd-smi updates from ROCm internal
2026-01-08 20:33:14 +05:30
Swati Rawat
4184d1ee1f Update docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/megatron-lm-v24.12-dev.rst
Co-authored-by: peterjunpark <git@peterjunpark.com>
2026-01-08 16:46:22 +05:30
Swati Rawat
0786c328c1 Update docs/how-to/rocm-for-ai/system-setup/prerequisite-system-validation.rst
Co-authored-by: peterjunpark <git@peterjunpark.com>
2026-01-08 16:46:22 +05:30
Swati Rawat
88ea6072f5 Update docs/how-to/rocm-for-ai/system-setup/prerequisite-system-validation.rst
Co-authored-by: peterjunpark <git@peterjunpark.com>
2026-01-08 16:46:22 +05:30
Swati Rawat
b32dcc8570 Update docs/how-to/rocm-for-ai/system-setup/prerequisite-system-validation.rst
Co-authored-by: peterjunpark <git@peterjunpark.com>
2026-01-08 16:46:22 +05:30
Swati Rawat
0faa92e922 Update docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/megatron-lm-v24.12-dev.rst
Co-authored-by: peterjunpark <git@peterjunpark.com>
2026-01-08 16:46:21 +05:30
Swati Rawat
26ae989602 Update docs/how-to/rocm-for-ai/training/benchmark-docker/previous-versions/megatron-lm-v24.12-dev.rst
Co-authored-by: peterjunpark <git@peterjunpark.com>
2026-01-08 16:46:21 +05:30
srawat
4402dc4147 Update single-gpu-fine-tuning-and-inference.rst 2026-01-08 16:46:21 +05:30
srawat
5eda438e0a Update multi-gpu-fine-tuning-and-inference.rst 2026-01-08 16:46:20 +05:30
srawat
049784e1a7 Update prerequisite-system-validation.rst 2026-01-08 16:42:18 +05:30
srawat
f12169c5b7 replace rocm-smi reference with amd-smi 2026-01-08 16:42:18 +05:30
peterjunpark
b35d1a0627 fix(primus-pytorch.rst): FP8 config instead of BF16 (#5839)
(cherry picked from commit 2dc22ca890)
2026-01-07 13:51:50 -05:00
Pratik Basyal
912618cb08 ROCM-core version fixed (#5827) (#5828) 2026-01-02 16:10:16 -05:00
peterjunpark
7d2feaa8b1 Fix inconsistency in xDiT doc (#5823)
Fix inconsistency in xDiT doc

(cherry picked from commit 172b0f7c08)
2025-12-29 10:29:59 -05:00
peterjunpark
7d0d114994 Merge pull request #5821 from peterjunpark/docs/7.1.1
[docs/7.1.1] Add xDiT and Primus doc updates
2025-12-29 08:49:44 -05:00
peterjunpark
2a65394e32 Update docs for xDiT diffusion inference 25.13 Docker release (#5820)
* archive previous version

* add xdit 25.13

* update history index

* add perf results section

(cherry picked from commit c67fac78bd)
2025-12-29 08:45:29 -05:00
peterjunpark
268c1332c9 Update training docs for Primus/25.11 (#5819)
* update conf and toc.yml.in

* archive previous versions

archive data files

update anchors

* primus pytorch: remove training batch size args

* update primus megatron run cmds

multi-node

* update primus pytorch

update

* update

update

* update docker tag

(cherry picked from commit e0b8ec4dfb)
2025-12-29 08:45:17 -05:00
Pratik Basyal
374e0944dc OS table removed from compatibility table [develop] (#5810) (#5811)
* OS table removed from compatibility table

* Feedback added

* Azure Linux 3.0 and compatibility version update

* Version fix

* Review feedback added

* Minor change
2025-12-23 16:38:03 -05:00
peterjunpark
512e311041 Update xdit diffusion inference history (#5808) (#5809)
* Update xdit diffusion inference history

* fix

(cherry picked from commit 3a43bacdda)
2025-12-22 11:14:57 -05:00
peterjunpark
ad4f486635 fix link to ROCm PyT docker image (#5803) (#5804)
(cherry picked from commit 48d8fe139b)
2025-12-19 15:51:20 -05:00
peterjunpark
485886712b clean up formatting in FA2 page (#5795) (#5796)
(cherry picked from commit 7455fe57b8)
2025-12-19 09:38:20 -05:00
peterjunpark
1cd6a14a22 Update Flash Attention guidance in "Model acceleration libraries" (#5793)
* flash attention update

Signed-off-by: seungrok.jung <seungrok.jung@amd.com>

flash attention update

Signed-off-by: seungrok.jung <seungrok.jung@amd.com>

flash attention update

Signed-off-by: seungrok.jung <seungrok.jung@amd.com>

sentence-case heading

* Update docs/how-to/rocm-for-ai/inference-optimization/model-acceleration-libraries.rst

Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com>

---------

Co-authored-by: seungrok.jung <seungrok.jung@amd.com>
Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com>
(cherry picked from commit 52c0a47e84)
2025-12-19 09:00:40 -05:00
peterjunpark
a17f04a3b5 Update documentation for JAX training MaxText 25.11 release (#5789) (#5790)
(cherry picked from commit cbab9a465d)
2025-12-18 11:26:42 -05:00
peterjunpark
94de66ef3f [docs/7.1.1] Publish vLLM and xDiT doc updates (#5787)
* vLLM inference benchmark 1210 (#5776)

* Archive previous ver

fix anchors

* Update vllm.rst and data yaml for 20251210

(cherry picked from commit 1b4f25733d)

* xDiT diffusion inference v25.12 documentation update (#5786)

* Add xdit-diffusion ROCm docs page.

* Update template formatting and fix sphinx warnings

* Add System Validation section.

* Add sw component versions/commits.

* Update to use latest v25.10 image instead of v25.9

* Update commands and add FLUX instructions.

* Update Flux instructions. Change image tag. Describe as diffusion inference instead of specifically video.

* git rm xdit-video-diffusion.rst

* Docs for v25.12

* Add hyperlinks to components

* Command fixes

* -Diffusers suffix

* Simplify yaml file and cleanup main rst page.

* Spelling, added 'js'

* fix merge conflict

fix

---------

Co-authored-by: Kristoffer <kristoffer.torp@amd.com>
(cherry picked from commit 459283da3c)

---------

Co-authored-by: Kristoffer <kristoffer.torp@amd.com>
2025-12-17 10:28:30 -05:00
Pratik Basyal
e5cebe7b4e Taichi removed from ROCm docs [Develop] (#5779) (#5781)
* Taichi removed from ROCm docs

* Warnings fixed
2025-12-16 13:24:12 -05:00
Pratik Basyal
7047cfa19c Onnx and rocshmem version updated (#5760) (#5764) 2025-12-11 17:11:05 -05:00
Matt Williams
de71bf5fa7 Merge pull request #5759 from ROCm/cherry-pick-701
Fixing link redirects (#5758)
2025-12-10 11:39:53 -05:00
Matt Williams
0d17c96f7f Fixing link redirects (#5758)
* Update multi-gpu-fine-tuning-and-inference.rst

* Update pytorch-training-v25.6.rst

* Update pytorch-compatibility.rst
2025-12-10 11:31:26 -05:00
anisha-amd
2f8c99f7f0 Docs: update verl compatibility - fix (#5755) 2025-12-09 19:52:12 -05:00
anisha-amd
982927e866 Docs: verl framework - compatibility - 25.11 release (#5752) (#5753) 2025-12-09 12:02:20 -05:00
peterjunpark
8f45b791fe Fix Primus PyTorch doc: training.batch_size -> training.local_batch_size (#5748) (#5749)
(cherry picked from commit bf74351e5a)
2025-12-08 13:59:00 -05:00
yugang-amd
f7c7587b10 xdit-diffusion v25.11 docs (#5743) 2025-12-05 17:08:21 -05:00
Pratik Basyal
96b3c0d4f3 PyTorch 2.7 support added (#5740) (#5741) 2025-12-04 17:00:34 -05:00
peterjunpark
d6d4d2ef92 fix docker hub links for primus:v25.10 (#5738)
(cherry picked from commit 453751a86f)
2025-12-04 09:21:53 -05:00
peterjunpark
8647ebcf76 Update training Docker docs for Primus 25.10 (#5737)
(cherry picked from commit fb644412d5)
2025-12-04 09:21:53 -05:00
Pratik Basyal
48ca38b0dc Conflict resolved (#5735) 2025-12-03 09:02:57 -05:00
Istvan Kiss
acbd671e99 JAX key features and enhancements (#5708)
Co-authored-by: Pratik Basyal <prbasyal@amd.com>
2025-12-01 19:52:07 +01:00
Pratik Basyal
133a97ec18 711 post GA known issue update [docs/711] (#5723)
* 7.1.1 known issues post GA (#5721)

* rocblas known issues added

* Minor change

* Update RELEASE.md

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>

* Resolved

* Update RELEASE.md

Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com>

---------

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>
Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com>

* GitHub Issue added

---------

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>
Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com>
2025-11-30 00:16:26 -05:00
Pratik Basyal
2d40066f29 Merged cell removed for coloring issue (#5713) (#5714) 2025-11-27 20:03:11 -05:00
ROCm Docs Automation
5d7fdace0e Update rocm-docs-core to 1.30.0 2025-11-26 17:09:50 -05:00
Istvan Kiss
7dbcdc5deb Update release notes links ROCm 7.1.1 (#5705) 2025-11-26 20:02:33 +01:00
Pratik Basyal
a966db29ca Known issue from 7.1.0 removed (#5702) (#5703) 2025-11-26 12:30:28 -05:00
Pratik Basyal
9ea8a48b3a Link and PyTorch version updated (#5700) (#5701) 2025-11-26 12:01:12 -05:00
Alex Xu
9956d72614 fix dependency 2025-11-26 11:42:22 -05:00
Alex Xu
305d24f486 Merge branch 'roc-7.1.x' into docs/7.1.1 2025-11-26 11:37:06 -05:00
Alex Xu
26f6b6b3e1 Merge branch 'roc-7.1.x' into docs/7.1.1 2025-11-26 11:29:02 -05:00
Alex Xu
d4cdbd79a3 Merge branch 'develop' into docs/7.1.1 2025-11-26 08:47:19 -05:00
alexxu-amd
26d1ab7d27 Update documentation requirements 2025-11-25 16:30:46 -05:00
alexxu-amd
272c9f6be3 Update documentation requirements 2025-11-25 15:37:04 -05:00
29 changed files with 1222 additions and 832 deletions

View File

@@ -34,7 +34,6 @@ parameters:
default:
- cmake
- libnuma-dev
- libsimde-dev
- mesa-common-dev
- ninja-build
- ocl-icd-libopencl1

View File

@@ -39,7 +39,6 @@ parameters:
- python3
- python3-dev
- python3-pip
- python3-venv
- libgtest-dev
- libboost-filesystem-dev
- libboost-program-options-dev
@@ -47,8 +46,6 @@ parameters:
type: object
default:
- nanobind>=2.0.0
- pytest
- pytest-cov
- name: rocmDependencies
type: object
default:
@@ -75,10 +72,8 @@ parameters:
- { os: ubuntu2204, packageManager: apt }
- { os: almalinux8, packageManager: dnf }
testJobs:
- { os: ubuntu2204, packageManager: apt, target: gfx942 }
- { os: ubuntu2204, packageManager: apt, target: gfx90a }
# - { os: ubuntu2204, packageManager: apt, target: gfx1100 }
# - { os: ubuntu2204, packageManager: apt, target: gfx1151 }
# - { os: ubuntu2204, packageManager: apt, target: gfx1201 }
- name: downstreamComponentMatrix
type: object
default:
@@ -121,11 +116,6 @@ jobs:
parameters:
dependencyList:
- gtest
- ${{ if ne(job.os, 'almalinux8') }}:
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/dependencies-vendor.yml
parameters:
dependencyList:
- catch2
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/checkout.yml
parameters:
checkoutRepo: ${{ parameters.checkoutRepo }}
@@ -147,7 +137,6 @@ jobs:
-DORIGAMI_BUILD_SHARED_LIBS=ON
-DORIGAMI_ENABLE_PYTHON=ON
-DORIGAMI_BUILD_TESTING=ON
-DORIGAMI_ENABLE_FETCH=ON
-GNinja
- ${{ if ne(job.os, 'almalinux8') }}:
- task: PublishPipelineArtifact@1
@@ -180,6 +169,7 @@ jobs:
dependsOn: origami_build_${{ job.os }}
condition:
and(succeeded(),
eq(variables['ENABLE_${{ upper(job.target) }}_TESTS'], 'true'),
not(containsValue(split(variables['DISABLED_${{ upper(job.target) }}_TESTS'], ','), '${{ parameters.componentName }}')),
eq(${{ parameters.aggregatePipeline }}, False)
)
@@ -190,30 +180,30 @@ jobs:
workspace:
clean: all
steps:
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/checkout.yml
parameters:
checkoutRepo: ${{ parameters.checkoutRepo }}
sparseCheckoutDir: ${{ parameters.sparseCheckoutDir }}
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/dependencies-other.yml
parameters:
aptPackages: ${{ parameters.aptPackages }}
pipModules: ${{ parameters.pipModules }}
packageManager: ${{ job.packageManager }}
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/dependencies-cmake-custom.yml
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/preamble.yml
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/checkout.yml
parameters:
checkoutRepo: ${{ parameters.checkoutRepo }}
sparseCheckoutDir: ${{ parameters.sparseCheckoutDir }}
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/dependencies-vendor.yml
parameters:
dependencyList:
- gtest
- ${{ if ne(job.os, 'almalinux8') }}:
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/dependencies-vendor.yml
parameters:
dependencyList:
- catch2
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/local-artifact-download.yml
parameters:
preTargetFilter: ${{ parameters.componentName }}
os: ${{ job.os }}
- task: DownloadPipelineArtifact@2
displayName: 'Download Build Directory Artifact'
inputs:
artifact: '${{ parameters.componentName }}_${{ job.os }}_build_dir'
path: '$(Agent.BuildDirectory)/s/build'
- task: DownloadPipelineArtifact@2
displayName: 'Download Python Source Artifact'
inputs:
artifact: '${{ parameters.componentName }}_${{ job.os }}_python_src'
path: '$(Agent.BuildDirectory)/s/python'
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/dependencies-rocm.yml
parameters:
checkoutRef: ${{ parameters.checkoutRef }}
@@ -222,72 +212,25 @@ jobs:
gpuTarget: ${{ job.target }}
${{ if parameters.triggerDownstreamJobs }}:
downstreamAggregateNames: ${{ parameters.downstreamAggregateNames }}
- task: CMake@1
displayName: 'Origami Test CMake Configuration'
inputs:
cmakeArgs: >-
-DCMAKE_PREFIX_PATH=$(Agent.BuildDirectory)/rocm;$(Agent.BuildDirectory)/vendor
-DCMAKE_CXX_COMPILER=$(Agent.BuildDirectory)/rocm/llvm/bin/amdclang++
-DORIGAMI_BUILD_SHARED_LIBS=ON
-DORIGAMI_ENABLE_PYTHON=ON
-DORIGAMI_BUILD_TESTING=ON
-GNinja
$(Agent.BuildDirectory)/s
- task: Bash@3
displayName: 'Build Origami Tests and Python Bindings'
inputs:
targetType: inline
workingDirectory: build
script: |
cmake --build . --target origami-tests origami_python -- -j$(nproc)
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/gpu-diagnostics.yml
# Run tests using CTest (discovers and runs both C++ and Python tests)
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/test.yml
parameters:
componentName: ${{ parameters.componentName }}
os: ${{ job.os }}
testDir: 'build'
testParameters: '--output-on-failure --force-new-ctest-process --output-junit test_output.xml'
# Test pip install workflow
# - task: Bash@3
# displayName: 'Test Pip Install'
# inputs:
# targetType: inline
# script: |
# set -e
# echo "==================================================================="
# echo "Testing pip install workflow (pip install -e .)"
# echo "==================================================================="
# # Set environment variables for pip install CMake build
# export ROCM_PATH=$(Agent.BuildDirectory)/rocm
# export CMAKE_PREFIX_PATH=$(Agent.BuildDirectory)/rocm:$(Agent.BuildDirectory)/vendor
# export CMAKE_CXX_COMPILER=$(Agent.BuildDirectory)/rocm/llvm/bin/amdclang++
# echo "ROCM_PATH: $ROCM_PATH"
# echo "CMAKE_PREFIX_PATH: $CMAKE_PREFIX_PATH"
# echo "CMAKE_CXX_COMPILER: $CMAKE_CXX_COMPILER"
# echo ""
# # Install from source directory
# cd "$(Agent.BuildDirectory)/s/python"
# pip install -e .
# # Verify import works
# echo ""
# echo "Verifying origami can be imported..."
# python3 -c "import origami; print('✓ Successfully imported origami')"
# # Run pytest on installed package
# echo ""
# echo "Running pytest tests..."
# python3 -m pytest tests/ -v -m "not slow" --tb=short
# echo ""
# echo "==================================================================="
# echo "Pip install test completed successfully"
# echo "==================================================================="
testDir: '$(Agent.BuildDirectory)/rocm/bin'
testExecutable: './origami-tests'
testParameters: '--yaml origami-tests.yaml --gtest_output=xml:./test_output.xml --gtest_color=yes'
- script: |
set -e
export PYTHONPATH=$(Agent.BuildDirectory)/s/build/python:$PYTHONPATH
echo "--- Running origami_test.py ---"
python3 $(Agent.BuildDirectory)/s/python/origami_test.py
echo "--- Running origami_grid_test.py ---"
python3 $(Agent.BuildDirectory)/s/python/origami_grid_test.py
displayName: 'Run Python Binding Tests'
condition: succeeded()
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/docker-container.yml
parameters:
aptPackages: ${{ parameters.aptPackages }}

View File

@@ -30,7 +30,6 @@ parameters:
- python3-pip
- protobuf-compiler
- libprotoc-dev
- libopencv-dev
- name: pipModules
type: object
default:
@@ -65,7 +64,6 @@ parameters:
- MIVisionX
- rocm_smi_lib
- rccl
- rocAL
- rocALUTION
- rocBLAS
- rocDecode
@@ -105,7 +103,6 @@ parameters:
- MIVisionX
- rocm_smi_lib
- rccl
- rocAL
- rocALUTION
- rocBLAS
- rocDecode

View File

@@ -36,6 +36,7 @@ Andrej
Arb
Autocast
autograd
Backported
BARs
BatchNorm
BLAS
@@ -203,9 +204,11 @@ GenAI
GenZ
GitHub
Gitpod
hardcoded
HBM
HCA
HGX
HLO
HIPCC
hipDataType
HIPExtension
@@ -333,6 +336,7 @@ MoEs
Mooncake
Mpops
Multicore
multihost
Multithreaded
mx
MXFP
@@ -1027,6 +1031,7 @@ uncacheable
uncorrectable
underoptimized
unhandled
unfused
uninstallation
unmapped
unsqueeze

View File

@@ -270,26 +270,26 @@ The [ROCm examples repository](https://github.com/ROCm/rocm-examples) has been e
:margin: auto 0 auto auto
:::{grid}
:margin: auto 0 auto auto
* [hipBLASLt](https://rocm.docs.amd.com/projects/hipBLASLt/en/latest/)
* [hipSPARSE](https://rocm.docs.amd.com/projects/hipSPARSE/en/latest/)
* [hipSPARSELt](https://rocm.docs.amd.com/projects/hipSPARSELt/en/latest/)
* [hipTensor](https://rocm.docs.amd.com/projects/hipTensor/en/latest/)
* [hipBLASLt](https://github.com/ROCm/rocm-examples/tree/amd-staging/Libraries/hipBLASLt)
* [hipSPARSE](https://github.com/ROCm/rocm-examples/tree/amd-staging/Libraries/hipSPARSE)
* [hipSPARSELt](https://github.com/ROCm/rocm-examples/tree/amd-staging/Libraries/hipSPARSELt)
* [hipTensor](https://github.com/ROCm/rocm-examples/tree/amd-staging/Libraries/hipTensor)
:::
:::{grid}
:margin: auto 0 auto auto
* [rocALUTION](https://rocm.docs.amd.com/projects/rocALUTION/en/latest/)
* [ROCprofiler-SDK](https://rocm.docs.amd.com/projects/rocprofiler-sdk/en/latest/)
* [rocWMMA](https://rocm.docs.amd.com/projects/rocWMMA/en/latest/)
* [rocALUTION](https://github.com/ROCm/rocm-examples/tree/amd-staging/Libraries/rocALUTION)
* [ROCprofiler-SDK](https://github.com/ROCm/rocm-examples/tree/amd-staging/Libraries/rocProfiler-SDK)
* [rocWMMA](https://github.com/ROCm/rocm-examples/tree/amd-staging/Libraries/rocWMMA)
:::
::::
Usage examples are now available for the following performance analysis tools:
* [ROCm Compute Profiler](https://rocm.docs.amd.com/projects/rocprofiler-compute/en/latest/index.html)
* [ROCm Systems Profiler](https://rocm.docs.amd.com/projects/rocprofiler-systems/en/latest/index.html)
* [rocprofv3](https://rocm.docs.amd.com/projects/rocprofiler-sdk/en/latest/how-to/using-rocprofv3.html)
* [ROCm Compute Profiler](https://github.com/ROCm/rocm-examples/tree/amd-staging/Tools/rocprof-compute)
* [ROCm Systems Profiler](https://github.com/ROCm/rocm-examples/tree/amd-staging/Tools/rocprof-systems)
* [rocprofv3](https://github.com/ROCm/rocm-examples/tree/amd-staging/Tools/rocprofv3)
The complete source code for the [HIP Graph Tutorial](https://rocm.docs.amd.com/projects/HIP/en/latest/tutorial/graph_api.html) is also available as part of the ROCm examples.
The complete source code for the [HIP Graph Tutorial](https://github.com/ROCm/rocm-examples/tree/amd-staging/HIP-Doc/Tutorials/graph_api) is also available as part of the ROCm examples.
### ROCm documentation updates

View File

@@ -37,7 +37,7 @@ ROCm Version,7.1.1,7.1.0,7.0.2,7.0.1/7.0.0,6.4.3,6.4.2,6.4.1,6.4.0,6.3.3,6.3.2,6
:doc:`Stanford Megatron-LM <../compatibility/ml-compatibility/stanford-megatron-lm-compatibility>` [#stanford-megatron-lm_compat-past-60]_,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,85f95ae,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A
:doc:`DGL <../compatibility/ml-compatibility/dgl-compatibility>` [#dgl_compat-past-60]_,N/A,N/A,N/A,2.4.0,2.4.0,N/A,N/A,2.4.0,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A
:doc:`Megablocks <../compatibility/ml-compatibility/megablocks-compatibility>` [#megablocks_compat-past-60]_,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,0.7.0,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A
:doc:`Ray <../compatibility/ml-compatibility/ray-compatibility>` [#ray_compat-past-60]_,N/A,N/A,N/A,N/A,N/A,N/A,2.48.0.post0,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A
:doc:`Ray <../compatibility/ml-compatibility/ray-compatibility>` [#ray_compat-past-60]_,N/A,N/A,N/A,2.51.1,N/A,N/A,2.48.0.post0,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A
:doc:`llama.cpp <../compatibility/ml-compatibility/llama-cpp-compatibility>` [#llama-cpp_compat-past-60]_,N/A,N/A,N/A,b6652,b6356,b6356,b6356,b5997,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A
:doc:`FlashInfer <../compatibility/ml-compatibility/flashinfer-compatibility>` [#flashinfer_compat-past-60]_,N/A,N/A,N/A,N/A,N/A,N/A,v0.2.5,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A
`ONNX Runtime <https://onnxruntime.ai/docs/build/eps.html#amd-migraphx>`_,1.23.1,1.22.0,1.22.0,1.22.0,1.20.0,1.20.0,1.20.0,1.20.0,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.14.1,1.14.1
1 ROCm Version 7.1.1 7.1.0 7.0.2 7.0.1/7.0.0 6.4.3 6.4.2 6.4.1 6.4.0 6.3.3 6.3.2 6.3.1 6.3.0 6.2.4 6.2.2 6.2.1 6.2.0 6.1.5 6.1.2 6.1.1 6.1.0 6.0.2 6.0.0
37 :doc:`Stanford Megatron-LM <../compatibility/ml-compatibility/stanford-megatron-lm-compatibility>` [#stanford-megatron-lm_compat-past-60]_ N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A 85f95ae N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A
38 :doc:`DGL <../compatibility/ml-compatibility/dgl-compatibility>` [#dgl_compat-past-60]_ N/A N/A N/A 2.4.0 2.4.0 N/A N/A 2.4.0 N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A
39 :doc:`Megablocks <../compatibility/ml-compatibility/megablocks-compatibility>` [#megablocks_compat-past-60]_ N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A 0.7.0 N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A
40 :doc:`Ray <../compatibility/ml-compatibility/ray-compatibility>` [#ray_compat-past-60]_ N/A N/A N/A N/A 2.51.1 N/A N/A 2.48.0.post0 N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A
41 :doc:`llama.cpp <../compatibility/ml-compatibility/llama-cpp-compatibility>` [#llama-cpp_compat-past-60]_ N/A N/A N/A b6652 b6356 b6356 b6356 b5997 N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A
42 :doc:`FlashInfer <../compatibility/ml-compatibility/flashinfer-compatibility>` [#flashinfer_compat-past-60]_ N/A N/A N/A N/A N/A N/A v0.2.5 N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A
43 `ONNX Runtime <https://onnxruntime.ai/docs/build/eps.html#amd-migraphx>`_ 1.23.1 1.22.0 1.22.0 1.22.0 1.20.0 1.20.0 1.20.0 1.20.0 1.17.3 1.17.3 1.17.3 1.17.3 1.17.3 1.17.3 1.17.3 1.17.3 1.17.3 1.17.3 1.17.3 1.17.3 1.14.1 1.14.1

View File

@@ -157,8 +157,8 @@ compatibility and system requirements.
.. [#os-compatibility] Some operating systems are supported on limited GPUs. For detailed information, see the latest :ref:`supported_distributions`. For version specific information, see `ROCm 7.1.1 <https://rocm.docs.amd.com/projects/install-on-linux/en/docs-7.1.1/reference/system-requirements.html#supported-operating-systems>`__, `ROCm 7.1.0 <https://rocm.docs.amd.com/projects/install-on-linux/en/docs-7.1.0/reference/system-requirements.html#supported-operating-systems>`__, and `ROCm 6.4.0 <https://rocm.docs.amd.com/projects/install-on-linux/en/docs-6.4.0/reference/system-requirements.html#supported-operating-systems>`__.
.. [#gpu-compatibility] Some GPUs have limited operating system support. For detailed information, see the latest :ref:`supported_GPUs`. For version specific information, see `ROCm 7.1.1 <https://rocm.docs.amd.com/projects/install-on-linux/en/docs-7.1.1/reference/system-requirements.html#supported-gpus>`__, `ROCm 7.1.0 <https://rocm.docs.amd.com/projects/install-on-linux/en/docs-7.1.0/reference/system-requirements.html#supported-gpus>`__, and `ROCm 6.4.0 <https://rocm.docs.amd.com/projects/install-on-linux/en/docs-6.4.0/reference/system-requirements.html#supported-gpus>`__.
.. [#dgl_compat] DGL is supported only on ROCm 7.0.0, ROCm 6.4.3 and ROCm 6.4.0.
.. [#llama-cpp_compat] llama.cpp is supported only on ROCm 7.0.0 and ROCm 6.4.x.
.. [#dgl_compat] DGL is only supported on ROCm 7.0.0, ROCm 6.4.3 and ROCm 6.4.0.
.. [#llama-cpp_compat] llama.cpp is only supported on ROCm 7.0.0 and ROCm 6.4.x.
.. [#mi325x_KVM] For AMD Instinct MI325X KVM SR-IOV users, do not use AMD GPU Driver (amdgpu) 30.20.0.
.. [#driver_patch] AMD GPU Driver (amdgpu) 30.10.1 is a quality release that resolves an issue identified in the 30.10 release. There are no other significant changes or feature additions in ROCm 7.0.1 from ROCm 7.0.0. AMD GPU Driver (amdgpu) 30.10.1 is compatible with ROCm 7.0.1 and ROCm 7.0.0.
.. [#kfd_support] As of ROCm 6.4.0, forward and backward compatibility between the AMD GPU Driver (amdgpu) and its user space software is provided up to a year apart. For earlier ROCm releases, the compatibility is provided for +/- 2 releases. The supported user space versions on this page were accurate as of the time of initial ROCm release. For the most up-to-date information, see the latest version of this information at `User and AMD GPU Driver support matrix <https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/user-kernel-space-compat-matrix.html>`_.
@@ -204,13 +204,13 @@ Expand for full historical view of:
.. [#os-compatibility-past-60] Some operating systems are supported on limited GPUs. For detailed information, see the latest :ref:`supported_distributions`. For version specific information, see `ROCm 7.1.1 <https://rocm.docs.amd.com/projects/install-on-linux/en/docs-7.1.1/reference/system-requirements.html#supported-operating-systems>`__, `ROCm 7.1.0 <https://rocm.docs.amd.com/projects/install-on-linux/en/docs-7.1.0/reference/system-requirements.html#supported-operating-systems>`__, and `ROCm 6.4.0 <https://rocm.docs.amd.com/projects/install-on-linux/en/docs-6.4.0/reference/system-requirements.html#supported-operating-systems>`__.
.. [#gpu-compatibility-past-60] Some GPUs have limited operating system support. For detailed information, see the latest :ref:`supported_GPUs`. For version specific information, see `ROCm 7.1.1 <https://rocm.docs.amd.com/projects/install-on-linux/en/docs-7.1.1/reference/system-requirements.html#supported-gpus>`__, `ROCm 7.1.0 <https://rocm.docs.amd.com/projects/install-on-linux/en/docs-7.1.0/reference/system-requirements.html#supported-gpus>`__, and `ROCm 6.4.0 <https://rocm.docs.amd.com/projects/install-on-linux/en/docs-6.4.0/reference/system-requirements.html#supported-gpus>`__.
.. [#tf-mi350-past-60] TensorFlow 2.17.1 is not supported on AMD Instinct MI350 Series GPUs. Use TensorFlow 2.19.1 or 2.18.1 with MI350 Series GPUs instead.
.. [#verl_compat-past-60] verl is supported only on ROCm 7.0.0 and 6.2.0.
.. [#stanford-megatron-lm_compat-past-60] Stanford Megatron-LM is supported only on ROCm 6.3.0.
.. [#dgl_compat-past-60] DGL is supported only on ROCm 7.0.0, ROCm 6.4.3 and ROCm 6.4.0.
.. [#megablocks_compat-past-60] Megablocks is supported only on ROCm 6.3.0.
.. [#ray_compat-past-60] Ray is supported only on ROCm 6.4.1.
.. [#llama-cpp_compat-past-60] llama.cpp is supported only on ROCm 7.0.0 and 6.4.x.
.. [#flashinfer_compat-past-60] FlashInfer is supported only on ROCm 6.4.1.
.. [#verl_compat-past-60] verl is only supported on ROCm 7.0.0 and 6.2.0.
.. [#stanford-megatron-lm_compat-past-60] Stanford Megatron-LM is only supported on ROCm 6.3.0.
.. [#dgl_compat-past-60] DGL is only supported on ROCm 7.0.0, ROCm 6.4.3 and ROCm 6.4.0.
.. [#megablocks_compat-past-60] Megablocks is only supported on ROCm 6.3.0.
.. [#ray_compat-past-60] Ray is only supported on ROCm 7.0.0 and 6.4.1.
.. [#llama-cpp_compat-past-60] llama.cpp is only supported on ROCm 7.0.0 and 6.4.x.
.. [#flashinfer_compat-past-60] FlashInfer is only supported on ROCm 6.4.1.
.. [#mi325x_KVM-past-60] For AMD Instinct MI325X KVM SR-IOV users, do not use AMD GPU Driver (amdgpu) 30.20.0.
.. [#driver_patch-past-60] AMD GPU Driver (amdgpu) 30.10.1 is a quality release that resolves an issue identified in the 30.10 release. There are no other significant changes or feature additions in ROCm 7.0.1 from ROCm 7.0.0. AMD GPU Driver (amdgpu) 30.10.1 is compatible with ROCm 7.0.1 and ROCm 7.0.0.
.. [#kfd_support-past-60] As of ROCm 6.4.0, forward and backward compatibility between the AMD GPU Driver (amdgpu) and its user space software is provided up to a year apart. For earlier ROCm releases, the compatibility is provided for +/- 2 releases. The supported user space versions on this page were accurate as of the time of initial ROCm release. For the most up-to-date information, see the latest version of this information at `User and AMD GPU Driver support matrix <https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/user-kernel-space-compat-matrix.html>`_.

View File

@@ -36,63 +36,9 @@ Support overview
- You can also consult the upstream `Installation guide <https://www.dgl.ai/pages/start.html>`__
for additional context.
Version support
--------------------------------------------------------------------------------
DGL is supported on `ROCm 7.0.0 <https://repo.radeon.com/rocm/apt/7.0/>`__,
`ROCm 6.4.3 <https://repo.radeon.com/rocm/apt/6.4.3/>`__, and `ROCm 6.4.0 <https://repo.radeon.com/rocm/apt/6.4/>`__.
Supported devices
--------------------------------------------------------------------------------
**Officially Supported**: AMD Instinct™ MI300X, MI250X
.. _dgl-recommendations:
Use cases and recommendations
================================================================================
DGL can be used for Graph Learning, and building popular graph models like
GAT, GCN, and GraphSage. Using these models, a variety of use cases are supported:
- Recommender systems
- Network Optimization and Analysis
- 1D (Temporal) and 2D (Image) Classification
- Drug Discovery
For use cases and recommendations, refer to the `AMD ROCm blog <https://rocm.blogs.amd.com/>`__,
where you can search for DGL examples and best practices to optimize your workloads on AMD GPUs.
* Although multiple use cases of DGL have been tested and verified, a few have been
outlined in the `DGL in the Real World: Running GNNs on Real Use Cases
<https://rocm.blogs.amd.com/artificial-intelligence/dgl_blog2/README.html>`__ blog
post, which walks through four real-world graph neural network (GNN) workloads
implemented with the Deep Graph Library on ROCm. It covers tasks ranging from
heterogeneous e-commerce graphs and multiplex networks (GATNE) to molecular graph
regression (GNN-FiLM) and EEG-based neurological diagnosis (EEG-GCNN). For each use
case, the authors detail: the dataset and task, how DGL is used, and their experience
porting to ROCm. It is shown that DGL codebases often run without modification, with
seamless integration of graph operations, message passing, sampling, and convolution.
* The `Graph Neural Networks (GNNs) at Scale: DGL with ROCm on AMD Hardware
<https://rocm.blogs.amd.com/artificial-intelligence/why-graph-neural/README.html>`__
blog post introduces the Deep Graph Library (DGL) and its enablement on the AMD ROCm platform,
bringing high-performance graph neural network (GNN) training to AMD GPUs. DGL bridges
the gap between dense tensor frameworks and the irregular nature of graph data through a
graph-first, message-passing abstraction. Its design ensures scalability, flexibility, and
interoperability across frameworks like PyTorch and TensorFlow. AMDs ROCm integration
enables DGL to run efficiently on HIP-based GPUs, supported by prebuilt Docker containers
and open-source repositories. This marks a major step in AMD's mission to advance open,
scalable AI ecosystems beyond traditional architectures.
You can pre-process datasets and begin training on AMD GPUs through:
* Single-GPU training/inference
* Multi-GPU training
.. _dgl-docker-compat:
Docker image compatibility
Compatibility matrix
================================================================================
.. |docker-icon| raw:: html
@@ -114,6 +60,7 @@ Click the |docker-icon| to view the image on Docker Hub.
- PyTorch
- Ubuntu
- Python
- GPU
* - .. raw:: html
@@ -124,6 +71,7 @@ Click the |docker-icon| to view the image on Docker Hub.
- `2.8.0 <https://github.com/pytorch/pytorch/releases/tag/v2.8.0>`__
- 24.04
- `3.12.9 <https://www.python.org/downloads/release/python-3129/>`__
- MI300X, MI250X
* - .. raw:: html
@@ -134,6 +82,7 @@ Click the |docker-icon| to view the image on Docker Hub.
- `2.6.0 <https://github.com/pytorch/pytorch/releases/tag/v2.6.0>`__
- 24.04
- `3.12.9 <https://www.python.org/downloads/release/python-3129/>`__
- MI300X, MI250X
* - .. raw:: html
@@ -144,6 +93,7 @@ Click the |docker-icon| to view the image on Docker Hub.
- `2.7.1 <https://github.com/pytorch/pytorch/releases/tag/v2.7.1>`__
- 22.04
- `3.10.16 <https://www.python.org/downloads/release/python-31016/>`__
- MI300X, MI250X
* - .. raw:: html
@@ -154,6 +104,7 @@ Click the |docker-icon| to view the image on Docker Hub.
- `2.6.0 <https://github.com/pytorch/pytorch/releases/tag/v2.6.0>`__
- 24.04
- `3.12.9 <https://www.python.org/downloads/release/python-3129/>`__
- MI300X, MI250X
* - .. raw:: html
@@ -164,6 +115,7 @@ Click the |docker-icon| to view the image on Docker Hub.
- `2.6.0 <https://github.com/pytorch/pytorch/releases/tag/v2.6.0>`__
- 24.04
- `3.12.9 <https://www.python.org/downloads/release/python-3129/>`__
- MI300X, MI250X
* - .. raw:: html
@@ -174,7 +126,7 @@ Click the |docker-icon| to view the image on Docker Hub.
- `2.4.1 <https://github.com/pytorch/pytorch/releases/tag/v2.4.1>`__
- 24.04
- `3.12.9 <https://www.python.org/downloads/release/python-3129/>`__
- MI300X, MI250X
* - .. raw:: html
@@ -185,7 +137,7 @@ Click the |docker-icon| to view the image on Docker Hub.
- `2.4.1 <https://github.com/pytorch/pytorch/releases/tag/v2.4.1>`__
- 22.04
- `3.10.16 <https://www.python.org/downloads/release/python-31016/>`__
- MI300X, MI250X
* - .. raw:: html
@@ -196,7 +148,10 @@ Click the |docker-icon| to view the image on Docker Hub.
- `2.3.0 <https://github.com/pytorch/pytorch/releases/tag/v2.3.0>`__
- 22.04
- `3.10.16 <https://www.python.org/downloads/release/python-31016/>`__
- MI300X, MI250X
.. _dgl-key-rocm-libraries:
Key ROCm libraries for DGL
================================================================================
@@ -310,8 +265,9 @@ If you prefer to build it yourself, ensure the following dependencies are instal
multiplication (GEMM) and accumulation operations with mixed precision
support.
.. _dgl-supported-features-latest:
Supported features
Supported features with ROCm 7.0.0
================================================================================
Many functions and methods available upstream are also supported in DGL on ROCm.
@@ -335,14 +291,17 @@ Instead of listing them all, support is grouped into the following categories to
* DGL Sparse
* GraphBolt
Unsupported features
.. _dgl-unsupported-features-latest:
Unsupported features with ROCm 7.0.0
================================================================================
* TF32 Support (only supported for PyTorch 2.7 and above)
* Kineto/ROCTracer integration
.. _dgl-unsupported-functions:
Unsupported functions
Unsupported functions with ROCm 7.0.0
================================================================================
* ``bfs``
@@ -355,6 +314,50 @@ Unsupported functions
* ``sample_labors_noprob``
* ``sparse_admin``
.. _dgl-recommendations:
Use cases and recommendations
================================================================================
DGL can be used for Graph Learning, and building popular graph models like
GAT, GCN, and GraphSage. Using these models, a variety of use cases are supported:
- Recommender systems
- Network Optimization and Analysis
- 1D (Temporal) and 2D (Image) Classification
- Drug Discovery
For use cases and recommendations, refer to the `AMD ROCm blog <https://rocm.blogs.amd.com/>`__,
where you can search for DGL examples and best practices to optimize your workloads on AMD GPUs.
* Although multiple use cases of DGL have been tested and verified, a few have been
outlined in the `DGL in the Real World: Running GNNs on Real Use Cases
<https://rocm.blogs.amd.com/artificial-intelligence/dgl_blog2/README.html>`__ blog
post, which walks through four real-world graph neural network (GNN) workloads
implemented with the Deep Graph Library on ROCm. It covers tasks ranging from
heterogeneous e-commerce graphs and multiplex networks (GATNE) to molecular graph
regression (GNN-FiLM) and EEG-based neurological diagnosis (EEG-GCNN). For each use
case, the authors detail: the dataset and task, how DGL is used, and their experience
porting to ROCm. It is shown that DGL codebases often run without modification, with
seamless integration of graph operations, message passing, sampling, and convolution.
* The `Graph Neural Networks (GNNs) at Scale: DGL with ROCm on AMD Hardware
<https://rocm.blogs.amd.com/artificial-intelligence/why-graph-neural/README.html>`__
blog post introduces the Deep Graph Library (DGL) and its enablement on the AMD ROCm platform,
bringing high-performance graph neural network (GNN) training to AMD GPUs. DGL bridges
the gap between dense tensor frameworks and the irregular nature of graph data through a
graph-first, message-passing abstraction. Its design ensures scalability, flexibility, and
interoperability across frameworks like PyTorch and TensorFlow. AMDs ROCm integration
enables DGL to run efficiently on HIP-based GPUs, supported by prebuilt Docker containers
and open-source repositories. This marks a major step in AMD's mission to advance open,
scalable AI ecosystems beyond traditional architectures.
You can pre-process datasets and begin training on AMD GPUs through:
* Single-GPU training/inference
* Multi-GPU training
Previous versions
===============================================================================
See :doc:`rocm-install-on-linux:install/3rd-party/previous-versions/dgl-history` to find documentation for previous releases

View File

@@ -42,38 +42,9 @@ Support overview
- You can also consult the upstream `Installation guide <https://docs.flashinfer.ai/installation.html>`__
for additional context.
Version support
--------------------------------------------------------------------------------
FlashInfer is supported on `ROCm 6.4.1 <https://repo.radeon.com/rocm/apt/6.4.1/>`__.
Supported devices
--------------------------------------------------------------------------------
**Officially Supported**: AMD Instinct™ MI300X
.. _flashinfer-recommendations:
Use cases and recommendations
================================================================================
This release of FlashInfer on ROCm provides the decode functionality for LLM inferencing.
In the decode phase, tokens are generated sequentially, with the model predicting each new
token based on the previously generated tokens and the input context.
FlashInfer on ROCm brings over upstream features such as load balancing, sparse and dense
attention optimizations, and batching support, enabling efficient execution on AMD Instinct™ MI300X GPUs.
Because large LLMs often require substantial KV caches or long context windows, FlashInfer on ROCm
also implements cascade attention from upstream to reduce memory usage.
For currently supported use cases and recommendations, refer to the `AMD ROCm blog <https://rocm.blogs.amd.com/>`__,
where you can search for examples and best practices to optimize your workloads on AMD GPUs.
.. _flashinfer-docker-compat:
Docker image compatibility
Compatibility matrix
================================================================================
.. |docker-icon| raw:: html
@@ -95,6 +66,7 @@ Click |docker-icon| to view the image on Docker Hub.
- PyTorch
- Ubuntu
- Python
- GPU
* - .. raw:: html
@@ -104,5 +76,23 @@ Click |docker-icon| to view the image on Docker Hub.
- `2.7.1 <https://github.com/ROCm/pytorch/releases/tag/v2.7.1>`__
- 24.04
- `3.12 <https://www.python.org/downloads/release/python-3129/>`__
- MI300X
.. _flashinfer-recommendations:
Use cases and recommendations
================================================================================
The release of FlashInfer on ROCm provides the decode functionality for LLM inferencing.
In the decode phase, tokens are generated sequentially, with the model predicting each new
token based on the previously generated tokens and the input context.
FlashInfer on ROCm brings over upstream features such as load balancing, sparse and dense
attention optimizations, and batching support, enabling efficient execution on AMD Instinct™ MI300X GPUs.
Because large LLMs often require substantial KV caches or long context windows, FlashInfer on ROCm
also implements cascade attention from upstream to reduce memory usage.
For currently supported use cases and recommendations, refer to the `AMD ROCm blog <https://rocm.blogs.amd.com/>`__,
where you can search for examples and best practices to optimize your workloads on AMD GPUs.

View File

@@ -269,6 +269,33 @@ For a complete and up-to-date list of JAX public modules (for example, ``jax.num
JAX API modules are maintained by the JAX project and is subject to change.
Refer to the official Jax documentation for the most up-to-date information.
Key features and enhancements for ROCm 7.1
===============================================================================
- Enabled compilation of multihost HLO runner Python bindings.
- Backported multihost HLO runner bindings and some related changes to
:code:`FunctionalHloRunner`.
- Added :code:`requirements_lock_3_12` to enable building for Python 3.12.
- Removed hardcoded NHWC convolution layout for ``fp16`` precision to address the performance drops for ``fp16`` precision on gfx12xx GPUs.
- ROCprofiler-SDK integration:
- Integrated ROCprofiler-SDK (v3) to XLA to improve profiling of GPU events,
support both time-based and step-based profiling.
- Added unit tests for :code:`rocm_collector` and :code:`rocm_tracer`.
- Added Triton unsupported conversion from ``f8E4M3FNUZ`` to ``fp16`` with
rounding mode.
- Introduced :code:`CudnnFusedConvDecomposer` to revert fused convolutions
when :code:`ConvAlgorithmPicker` fails to find a fused algorithm, and removed
unfused fallback paths from :code:`RocmFusedConvRunner`.
Key features and enhancements for ROCm 7.0
===============================================================================

View File

@@ -36,47 +36,9 @@ Support overview
- You can also consult the upstream `Installation guide <https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md>`__
for additional context.
Version support
--------------------------------------------------------------------------------
llama.cpp is supported on `ROCm 7.0.0 <https://repo.radeon.com/rocm/apt/7.0/>`__ and
`ROCm 6.4.x <https://repo.radeon.com/rocm/apt/6.4/>`__.
Supported devices
--------------------------------------------------------------------------------
**Officially Supported**: AMD Instinct™ MI325X, MI300X, MI210
Use cases and recommendations
================================================================================
llama.cpp can be applied in a variety of scenarios, particularly when you need to meet one or more of the following requirements:
- Plain C/C++ implementation with no external dependencies
- Support for 1.5-bit, 2-bit, 3-bit, 4-bit, 5-bit, 6-bit, and 8-bit integer quantization for faster inference and reduced memory usage
- Custom HIP (Heterogeneous-compute Interface for Portability) kernels for running large language models (LLMs) on AMD GPUs (graphics processing units)
- CPU (central processing unit) + GPU (graphics processing unit) hybrid inference for partially accelerating models larger than the total available VRAM (video random-access memory)
llama.cpp is also used in a range of real-world applications, including:
- Games such as `Lucy's Labyrinth <https://github.com/MorganRO8/Lucys_Labyrinth>`__:
A simple maze game where AI-controlled agents attempt to trick the player.
- Tools such as `Styled Lines <https://marketplace.unity.com/packages/tools/ai-ml-integration/style-text-webgl-ios-stand-alone-llm-llama-cpp-wrapper-292902>`__:
A proprietary, asynchronous inference wrapper for Unity3D game development, including pre-built mobile and web platform wrappers and a model example.
- Various other AI applications use llama.cpp as their inference engine;
for a detailed list, see the `user interfaces (UIs) section <https://github.com/ggml-org/llama.cpp?tab=readme-ov-file#description>`__.
For more use cases and recommendations, refer to the `AMD ROCm blog <https://rocm.blogs.amd.com/>`__,
where you can search for llama.cpp examples and best practices to optimize your workloads on AMD GPUs.
- The `Llama.cpp Meets Instinct: A New Era of Open-Source AI Acceleration <https://rocm.blogs.amd.com/ecosystems-and-partners/llama-cpp/README.html>`__
blog post outlines how the open-source llama.cpp framework enables efficient LLM inference—including interactive inference with ``llama-cli``,
server deployment with ``llama-server``, GGUF model preparation and quantization, performance benchmarking, and optimizations tailored for
AMD Instinct GPUs within the ROCm ecosystem.
.. _llama-cpp-docker-compat:
Docker image compatibility
Compatibility matrix
================================================================================
.. |docker-icon| raw:: html
@@ -106,6 +68,7 @@ Click |docker-icon| to view the image on Docker Hub.
- llama.cpp
- ROCm
- Ubuntu
- GPU
* - .. raw:: html
@@ -119,6 +82,7 @@ Click |docker-icon| to view the image on Docker Hub.
- `b6652 <https://github.com/ROCm/llama.cpp/tree/release/b6652>`__
- `7.0.0 <https://repo.radeon.com/rocm/apt/7.0/>`__
- 24.04
- MI325X, MI300X, MI210
* - .. raw:: html
@@ -132,6 +96,7 @@ Click |docker-icon| to view the image on Docker Hub.
- `b6652 <https://github.com/ROCm/llama.cpp/tree/release/b6652>`__
- `7.0.0 <https://repo.radeon.com/rocm/apt/7.0/>`__
- 22.04
- MI325X, MI300X, MI210
* - .. raw:: html
@@ -145,6 +110,7 @@ Click |docker-icon| to view the image on Docker Hub.
- `b6356 <https://github.com/ROCm/llama.cpp/tree/release/b6356>`__
- `6.4.3 <https://repo.radeon.com/rocm/apt/6.4.3/>`__
- 24.04
- MI325X, MI300X, MI210
* - .. raw:: html
@@ -158,7 +124,7 @@ Click |docker-icon| to view the image on Docker Hub.
- `b6356 <https://github.com/ROCm/llama.cpp/tree/release/b6356>`__
- `6.4.3 <https://repo.radeon.com/rocm/apt/6.4.3/>`__
- 22.04
- MI325X, MI300X, MI210
* - .. raw:: html
@@ -172,6 +138,7 @@ Click |docker-icon| to view the image on Docker Hub.
- `b6356 <https://github.com/ROCm/llama.cpp/tree/release/b6356>`__
- `6.4.2 <https://repo.radeon.com/rocm/apt/6.4.2/>`__
- 24.04
- MI325X, MI300X, MI210
* - .. raw:: html
@@ -185,7 +152,7 @@ Click |docker-icon| to view the image on Docker Hub.
- `b6356 <https://github.com/ROCm/llama.cpp/tree/release/b6356>`__
- `6.4.2 <https://repo.radeon.com/rocm/apt/6.4.2/>`__
- 22.04
- MI325X, MI300X, MI210
* - .. raw:: html
@@ -199,6 +166,7 @@ Click |docker-icon| to view the image on Docker Hub.
- `b6356 <https://github.com/ROCm/llama.cpp/tree/release/b6356>`__
- `6.4.1 <https://repo.radeon.com/rocm/apt/6.4.1/>`__
- 24.04
- MI325X, MI300X, MI210
* - .. raw:: html
@@ -212,6 +180,7 @@ Click |docker-icon| to view the image on Docker Hub.
- `b6356 <https://github.com/ROCm/llama.cpp/tree/release/b6356>`__
- `6.4.1 <https://repo.radeon.com/rocm/apt/6.4.1/>`__
- 22.04
- MI325X, MI300X, MI210
* - .. raw:: html
@@ -225,7 +194,9 @@ Click |docker-icon| to view the image on Docker Hub.
- `b5997 <https://github.com/ROCm/llama.cpp/tree/release/b5997>`__
- `6.4.0 <https://repo.radeon.com/rocm/apt/6.4/>`__
- 24.04
- MI300X, MI210
.. _llama-cpp-key-rocm-libraries:
Key ROCm libraries for llama.cpp
================================================================================
@@ -268,6 +239,36 @@ your corresponding ROCm version.
- Can be used to enhance the flash attention performance on AMD compute, by enabling
the flag during compile time.
.. _llama-cpp-uses-recommendations:
Use cases and recommendations
================================================================================
llama.cpp can be applied in a variety of scenarios, particularly when you need to meet one or more of the following requirements:
- Plain C/C++ implementation with no external dependencies
- Support for 1.5-bit, 2-bit, 3-bit, 4-bit, 5-bit, 6-bit, and 8-bit integer quantization for faster inference and reduced memory usage
- Custom HIP (Heterogeneous-compute Interface for Portability) kernels for running large language models (LLMs) on AMD GPUs (graphics processing units)
- CPU (central processing unit) + GPU (graphics processing unit) hybrid inference for partially accelerating models larger than the total available VRAM (video random-access memory)
llama.cpp is also used in a range of real-world applications, including:
- Games such as `Lucy's Labyrinth <https://github.com/MorganRO8/Lucys_Labyrinth>`__:
A simple maze game where AI-controlled agents attempt to trick the player.
- Tools such as `Styled Lines <https://marketplace.unity.com/packages/tools/ai-ml-integration/style-text-webgl-ios-stand-alone-llm-llama-cpp-wrapper-292902>`__:
A proprietary, asynchronous inference wrapper for Unity3D game development, including pre-built mobile and web platform wrappers and a model example.
- Various other AI applications use llama.cpp as their inference engine;
for a detailed list, see the `user interfaces (UIs) section <https://github.com/ggml-org/llama.cpp?tab=readme-ov-file#description>`__.
For more use cases and recommendations, refer to the `AMD ROCm blog <https://rocm.blogs.amd.com/>`__,
where you can search for llama.cpp examples and best practices to optimize your workloads on AMD GPUs.
- The `Llama.cpp Meets Instinct: A New Era of Open-Source AI Acceleration <https://rocm.blogs.amd.com/ecosystems-and-partners/llama-cpp/README.html>`__
blog post outlines how the open-source llama.cpp framework enables efficient LLM inference—including interactive inference with ``llama-cli``,
server deployment with ``llama-server``, GGUF model preparation and quantization, performance benchmarking, and optimizations tailored for
AMD Instinct GPUs within the ROCm ecosystem.
Previous versions
===============================================================================
See :doc:`rocm-install-on-linux:install/3rd-party/previous-versions/llama-cpp-history` to find documentation for previous releases

View File

@@ -33,19 +33,44 @@ Support overview
- You can also consult the upstream `Installation guide <https://github.com/databricks/megablocks>`__
for additional context.
Version support
--------------------------------------------------------------------------------
.. _megablocks-docker-compat:
Megablocks is supported on `ROCm 6.3.0 <https://repo.radeon.com/rocm/apt/6.3/>`__.
Compatibility matrix
================================================================================
Supported devices
--------------------------------------------------------------------------------
.. |docker-icon| raw:: html
- **Officially Supported**: AMD Instinct™ MI300X
- **Partially Supported** (functionality or performance limitations): AMD Instinct™ MI250X, MI210
<i class="fab fa-docker"></i>
Supported models and features
--------------------------------------------------------------------------------
AMD validates and publishes `Megablocks images <https://hub.docker.com/r/rocm/megablocks/tags>`__
with ROCm backends on Docker Hub. The following Docker image tag and associated
inventories represent the latest available Megablocks version from the official Docker Hub.
Click |docker-icon| to view the image on Docker Hub.
.. list-table::
:header-rows: 1
:class: docker-image-compatibility
* - Docker image
- ROCm
- Megablocks
- PyTorch
- Ubuntu
- Python
- GPU
* - .. raw:: html
<a href="https://hub.docker.com/layers/rocm/megablocks/megablocks-0.7.0_rocm6.3.0_ubuntu24.04_py3.12_pytorch2.4.0/images/sha256-372ff89b96599019b8f5f9db469c84add2529b713456781fa62eb9a148659ab4"><i class="fab fa-docker fa-lg"></i> rocm/megablocks</a>
- `6.3.0 <https://repo.radeon.com/rocm/apt/6.3/>`_
- `0.7.0 <https://github.com/databricks/megablocks/releases/tag/v0.7.0>`_
- `2.4.0 <https://github.com/ROCm/pytorch/tree/release/2.4>`_
- 24.04
- `3.12.9 <https://www.python.org/downloads/release/python-3129/>`_
- MI300X
Supported models and features with ROCm 6.3.0
================================================================================
This section summarizes the Megablocks features supported by ROCm.
@@ -77,38 +102,3 @@ It features how to pre-process datasets and how to begin pre-training on AMD GPU
* Single-GPU pre-training
* Multi-GPU pre-training
.. _megablocks-docker-compat:
Docker image compatibility
================================================================================
.. |docker-icon| raw:: html
<i class="fab fa-docker"></i>
AMD validates and publishes `Megablocks images <https://hub.docker.com/r/rocm/megablocks/tags>`__
with ROCm backends on Docker Hub. The following Docker image tag and associated
inventories represent the latest available Megablocks version from the official Docker Hub.
Click |docker-icon| to view the image on Docker Hub.
.. list-table::
:header-rows: 1
:class: docker-image-compatibility
* - Docker image
- ROCm
- Megablocks
- PyTorch
- Ubuntu
- Python
* - .. raw:: html
<a href="https://hub.docker.com/layers/rocm/megablocks/megablocks-0.7.0_rocm6.3.0_ubuntu24.04_py3.12_pytorch2.4.0/images/sha256-372ff89b96599019b8f5f9db469c84add2529b713456781fa62eb9a148659ab4"><i class="fab fa-docker fa-lg"></i> rocm/megablocks</a>
- `6.3.0 <https://repo.radeon.com/rocm/apt/6.3/>`_
- `0.7.0 <https://github.com/databricks/megablocks/releases/tag/v0.7.0>`_
- `2.4.0 <https://github.com/ROCm/pytorch/tree/release/2.4>`_
- 24.04
- `3.12.9 <https://www.python.org/downloads/release/python-3129/>`_

View File

@@ -12,8 +12,8 @@ Ray compatibility
Ray is a unified framework for scaling AI and Python applications from your laptop
to a full cluster, without changing your code. Ray consists of `a core distributed
runtime <https://docs.ray.io/en/latest/ray-core/walkthrough.html>`_ and a set of
`AI libraries <https://docs.ray.io/en/latest/ray-air/getting-started.html>`_ for
runtime <https://docs.ray.io/en/latest/ray-core/walkthrough.html>`__ and a set of
`AI libraries <https://docs.ray.io/en/latest/ray-air/getting-started.html>`__ for
simplifying machine learning computations.
Ray is a general-purpose framework that runs many types of workloads efficiently.
@@ -29,25 +29,57 @@ Support overview
- To get started and install Ray on ROCm, use the prebuilt :ref:`Docker image <ray-docker-compat>`,
which includes ROCm, Ray, and all required dependencies.
- The Docker image provided is based on the upstream Ray `Daily Release (Nightly) wheels
<https://docs.ray.io/en/latest/ray-overview/installation.html#daily-releases-nightlies>`__
corresponding to commit `005c372 <https://github.com/ray-project/ray/commit/005c372262e050d5745f475e22e64305fa07f8b8>`__.
- See the :doc:`ROCm Ray installation guide <rocm-install-on-linux:install/3rd-party/ray-install>`
- See the :doc:`ROCm Ray installation guide <rocm-install-on-linux:install/3rd-party/ray-install>`
for installation and setup instructions.
- You can also consult the upstream `Installation guide <https://docs.ray.io/en/latest/ray-overview/installation.html>`__
for additional context.
Version support
--------------------------------------------------------------------------------
.. _ray-docker-compat:
Ray is supported on `ROCm 6.4.1 <https://repo.radeon.com/rocm/apt/6.4.1/>`__.
Compatibility matrix
================================================================================
Supported devices
--------------------------------------------------------------------------------
.. |docker-icon| raw:: html
**Officially Supported**: AMD Instinct™ MI300X, MI210
<i class="fab fa-docker"></i>
AMD validates and publishes `ROCm Ray Docker images <https://hub.docker.com/r/rocm/ray/tags>`__
with ROCm backends on Docker Hub. The following Docker image tags and
associated inventories represent the latest Ray version from the official Docker Hub.
Click |docker-icon| to view the image on Docker Hub.
.. list-table::
:header-rows: 1
:class: docker-image-compatibility
* - Docker image
- ROCm
- Ray
- Pytorch
- Ubuntu
- Python
- GPU
* - .. raw:: html
<a href="https://hub.docker.com/layers/rocm/ray/ray-2.51.1_rocm7.0.0_ubuntu22.04_py3.12_pytorch2.9.0/images/sha256-a02f6766b4ba406f88fd7e85707ec86c04b569834d869a08043ec9bcbd672168"><i class="fab fa-docker fa-lg"></i> rocm/ray</a>
- `7.0.0 <https://repo.radeon.com/rocm/apt/7.0/>`__
- `2.51.1 <https://github.com/ROCm/ray/tree/release/2.51.1>`__
- 2.9.0a0+git1c57644
- 22.04
- `3.12.12 <https://www.python.org/downloads/release/python-31212/>`__
- MI300X
* - .. raw:: html
<a href="https://hub.docker.com/layers/rocm/ray/ray-2.48.0.post0_rocm6.4.1_ubuntu24.04_py3.12_pytorch2.6.0/images/sha256-0d166fe6bdced38338c78eedfb96eff92655fb797da3478a62dd636365133cc0"><i class="fab fa-docker fa-lg"></i> rocm/ray</a>
- `6.4.1 <https://repo.radeon.com/rocm/apt/6.4.1/>`__
- `2.48.0.post0 <https://github.com/ROCm/ray/tree/release/2.48.0.post0>`__
- 2.6.0+git684f6f2
- 24.04
- `3.12.10 <https://www.python.org/downloads/release/python-31210/>`__
- MI300X, MI210
Use cases and recommendations
================================================================================
@@ -76,36 +108,7 @@ topic <https://docs.ray.io/en/latest/ray-core/scheduling/accelerators.html#accel
of the Ray core documentation and refer to the `AMD ROCm blog <https://rocm.blogs.amd.com/>`__,
where you can search for Ray examples and best practices to optimize your workloads on AMD GPUs.
.. _ray-docker-compat:
Docker image compatibility
================================================================================
.. |docker-icon| raw:: html
<i class="fab fa-docker"></i>
AMD validates and publishes ready-made `ROCm Ray Docker images <https://hub.docker.com/r/rocm/ray/tags>`__
with ROCm backends on Docker Hub. The following Docker image tags and
associated inventories represent the latest Ray version from the official Docker Hub.
Click the |docker-icon| icon to view the image on Docker Hub.
.. list-table::
:header-rows: 1
:class: docker-image-compatibility
* - Docker image
- ROCm
- Ray
- Pytorch
- Ubuntu
- Python
* - .. raw:: html
<a href="https://hub.docker.com/layers/rocm/ray/ray-2.48.0.post0_rocm6.4.1_ubuntu24.04_py3.12_pytorch2.6.0/images/sha256-0d166fe6bdced38338c78eedfb96eff92655fb797da3478a62dd636365133cc0"><i class="fab fa-docker fa-lg"></i> rocm/ray</a>
- `6.4.1 <https://repo.radeon.com/rocm/apt/6.4.1/>`__.
- `2.48.0.post0 <https://github.com/ROCm/ray/tree/release/2.48.0.post0>`_
- 2.6.0+git684f6f2
- 24.04
- `3.12.10 <https://www.python.org/downloads/release/python-31210/>`_
Previous versions
===============================================================================
See :doc:`rocm-install-on-linux:install/3rd-party/previous-versions/ray-history` to find documentation for previous releases
of the ``ROCm/ray`` Docker image.

View File

@@ -35,19 +35,45 @@ Support overview
- You can also consult the upstream `Installation guide <https://github.com/NVIDIA/Megatron-LM>`__
for additional context.
Version support
--------------------------------------------------------------------------------
.. _megatron-lm-docker-compat:
Stanford Megatron-LM is supported on `ROCm 6.3.0 <https://repo.radeon.com/rocm/apt/6.3/>`__.
Compatibility matrix
================================================================================
Supported devices
--------------------------------------------------------------------------------
.. |docker-icon| raw:: html
- **Officially Supported**: AMD Instinct™ MI300X
- **Partially Supported** (functionality or performance limitations): AMD Instinct™ MI250X, MI210
<i class="fab fa-docker"></i>
Supported models and features
--------------------------------------------------------------------------------
AMD validates and publishes `Stanford Megatron-LM images <https://hub.docker.com/r/rocm/stanford-megatron-lm/tags>`_
with ROCm and Pytorch backends on Docker Hub. The following Docker image tags and associated
inventories represent the latest Stanford Megatron-LM version from the official Docker Hub.
Click |docker-icon| to view the image on Docker Hub.
.. list-table::
:header-rows: 1
:class: docker-image-compatibility
* - Docker image
- ROCm
- Stanford Megatron-LM
- PyTorch
- Ubuntu
- Python
- GPU
* - .. raw:: html
<a href="https://hub.docker.com/layers/rocm/stanford-megatron-lm/stanford-megatron-lm85f95ae_rocm6.3.0_ubuntu24.04_py3.12_pytorch2.4.0/images/sha256-070556f078be10888a1421a2cb4f48c29f28b02bfeddae02588d1f7fc02a96a6"><i class="fab fa-docker fa-lg"></i> rocm/stanford-megatron-lm</a>
- `6.3.0 <https://repo.radeon.com/rocm/apt/6.3/>`_
- `85f95ae <https://github.com/stanford-futuredata/Megatron-LM/commit/85f95aef3b648075fe6f291c86714fdcbd9cd1f5>`_
- `2.4.0 <https://github.com/ROCm/pytorch/tree/release/2.4>`_
- 24.04
- `3.12.9 <https://www.python.org/downloads/release/python-3129/>`_
- MI300X
Supported models and features with ROCm 6.3.0
================================================================================
This section details models & features that are supported by the ROCm version on Stanford Megatron-LM.
@@ -88,41 +114,3 @@ It features how to pre-process datasets and how to begin pre-training on AMD GPU
* Single-GPU pre-training
* Multi-GPU pre-training
.. _megatron-lm-docker-compat:
Docker image compatibility
================================================================================
.. |docker-icon| raw:: html
<i class="fab fa-docker"></i>
AMD validates and publishes `Stanford Megatron-LM images <https://hub.docker.com/r/rocm/stanford-megatron-lm/tags>`_
with ROCm and Pytorch backends on Docker Hub. The following Docker image tags and associated
inventories represent the latest Stanford Megatron-LM version from the official Docker Hub.
Click |docker-icon| to view the image on Docker Hub.
.. list-table::
:header-rows: 1
:class: docker-image-compatibility
* - Docker image
- ROCm
- Stanford Megatron-LM
- PyTorch
- Ubuntu
- Python
* - .. raw:: html
<a href="https://hub.docker.com/layers/rocm/stanford-megatron-lm/stanford-megatron-lm85f95ae_rocm6.3.0_ubuntu24.04_py3.12_pytorch2.4.0/images/sha256-070556f078be10888a1421a2cb4f48c29f28b02bfeddae02588d1f7fc02a96a6"><i class="fab fa-docker fa-lg"></i></a>
- `6.3.0 <https://repo.radeon.com/rocm/apt/6.3/>`_
- `85f95ae <https://github.com/stanford-futuredata/Megatron-LM/commit/85f95aef3b648075fe6f291c86714fdcbd9cd1f5>`_
- `2.4.0 <https://github.com/ROCm/pytorch/tree/release/2.4>`_
- 24.04
- `3.12.9 <https://www.python.org/downloads/release/python-3129/>`_

View File

@@ -37,67 +37,9 @@ Support overview
- You can also consult the upstream `verl documentation <https://verl.readthedocs.io/en/latest/>`__
for additional context.
Version support
--------------------------------------------------------------------------------
verl is supported on `ROCm 7.0.0 <https://repo.radeon.com/rocm/apt/7.0/>`__ and
`ROCm 6.2.0 <https://repo.radeon.com/rocm/apt/6.2/>`__.
Supported devices
--------------------------------------------------------------------------------
**Officially Supported**: AMD Instinct™ MI300X
.. _verl-recommendations:
Use cases and recommendations
================================================================================
* The benefits of verl in large-scale reinforcement learning from human feedback
(RLHF) are discussed in the `Reinforcement Learning from Human Feedback on AMD
GPUs with verl and ROCm Integration <https://rocm.blogs.amd.com/artificial-intelligence/verl-large-scale/README.html>`__
blog. The blog post outlines how the Volcano Engine Reinforcement Learning
(verl) framework integrates with the AMD ROCm platform to optimize training on
AMD Instinct™ GPUs. The guide details the process of building a Docker image,
setting up single-node and multi-node training environments, and highlights
performance benchmarks demonstrating improved throughput and convergence accuracy.
This resource serves as a comprehensive starting point for deploying verl on AMD GPUs,
facilitating efficient RLHF training workflows.
.. _verl-supported_features:
Supported features
===============================================================================
The following table shows verl on ROCm support for GPU-accelerated modules.
.. list-table::
:header-rows: 1
* - Module
- Description
- verl version
- ROCm version
* - ``FSDP``
- Training engine
-
* 0.6.0
* 0.3.0.post0
-
* 7.0.0
* 6.2.0
* - ``vllm``
- Inference engine
-
* 0.6.0
* 0.3.0.post0
-
* 7.0.0
* 6.2.0
.. _verl-docker-compat:
Docker image compatibility
Compatibility matrix
================================================================================
.. |docker-icon| raw:: html
@@ -120,6 +62,7 @@ Click |docker-icon| to view the image on Docker Hub.
- PyTorch
- Python
- vllm
- GPU
* - .. raw:: html
@@ -130,6 +73,7 @@ Click |docker-icon| to view the image on Docker Hub.
- `2.9.0 <https://github.com/ROCm/pytorch/tree/release/2.9-rocm7.x-gfx115x>`__
- `3.12.11 <https://www.python.org/downloads/release/python-31211/>`__
- `0.11.0 <https://github.com/vllm-project/vllm/releases/tag/v0.11.0>`__
- MI300X
* - .. raw:: html
@@ -140,7 +84,33 @@ Click |docker-icon| to view the image on Docker Hub.
- `2.5.0 <https://github.com/ROCm/pytorch/tree/release/2.5>`__
- `3.9.19 <https://www.python.org/downloads/release/python-3919/>`__
- `0.6.3 <https://github.com/vllm-project/vllm/releases/tag/v0.6.3>`__
- MI300X
.. _verl-supported_features:
Supported modules with verl on ROCm
===============================================================================
The following GPU-accelerated modules are supported with verl on ROCm:
- ``FSDP``: Training engine
- ``vllm``: Inference engine
.. _verl-recommendations:
Use cases and recommendations
================================================================================
* The benefits of verl in large-scale reinforcement learning from human feedback
(RLHF) are discussed in the `Reinforcement Learning from Human Feedback on AMD
GPUs with verl and ROCm Integration <https://rocm.blogs.amd.com/artificial-intelligence/verl-large-scale/README.html>`__
blog. The blog post outlines how the Volcano Engine Reinforcement Learning
(verl) framework integrates with the AMD ROCm platform to optimize training on
AMD Instinct™ GPUs. The guide details the process of building a Docker image,
setting up single-node and multi-node training environments, and highlights
performance benchmarks demonstrating improved throughput and convergence accuracy.
This resource serves as a comprehensive starting point for deploying verl on AMD GPUs,
facilitating efficient RLHF training workflows.
Previous versions
===============================================================================

View File

@@ -268,6 +268,3 @@ html_context = {
"granularity_type" : [('Coarse-grained', 'coarse-grained'), ('Fine-grained', 'fine-grained')],
"scope_type" : [('Device', 'device'), ('System', 'system')]
}
# Disable figure and table numbering
numfig = False

View File

@@ -8,6 +8,303 @@ dockers:
hipBLASLt: 1.0.0
dockerfile:
commit: 8398684622109c806a35d660647060b0b9910663
configs:
default:
## DeepSeek AITER MLA currently only supports --block-size 1
- &deepseek-r1-serving
benchmark: serving
model: deepseek-ai/DeepSeek-R1-0528
tp: 8
inp: 1024
out: 1024
dtype: auto
max_concurrency: 1 8 32 128
extra_args:
async-scheduling: True
block-size: 1
## gpt-oss requires AITER unified attention and performs best with block-size 64 and FULL_AND_PIECEWISE cudagraph mode
- &gpt-oss-120b-serving
benchmark: serving
model: openai/gpt-oss-120b
tp: 8
inp: 1024
out: 1024
dtype: auto
max_concurrency: 1 8 32 128
env:
VLLM_ROCM_USE_AITER_MHA: 0
VLLM_ROCM_USE_AITER_UNIFIED_ATTENTION: 1
extra_args:
async-scheduling: True
block-size: 64
compilation-config: '{\"cudagraph_mode\":\"FULL_AND_PIECEWISE\"}'
- &llama-3-serving
benchmark: serving
model:
meta-llama/Llama-3.1-405B-Instruct
amd/Llama-3.1-405B-Instruct-FP8-KV
meta-llama/Llama-3.3-70B-Instruct
amd/Llama-3.3-70B-Instruct-FP8-KV
tp: 8
inp: 1024
out: 1024
dtype: auto
max_concurrency: 1 8 32 128
extra_args:
async-scheduling: True
arch_overrides:
gfx942:
dtype: float16
## Llama 3.x MXFP4 (gfx950 only)
- &llama-3-mxfp4-serving
benchmark: serving
model:
amd/Llama-3.1-405B-Instruct-MXFP4-Preview
amd/Llama-3.3-70B-Instruct-MXFP4-Preview
tp: 8
inp: 1024
out: 1024
dtype: auto
max_concurrency: 1 8 32 128
extra_args:
async-scheduling: True
## Llama 4 currently does not support full cudagraph or attn fusion
- &llama-4-fp8-serving
benchmark: serving
model:
meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8
tp: 8
inp: 1024
out: 1024
dtype: auto
max_concurrency: 1 8 32 128
extra_args:
async-scheduling: True
compilation-config: '{\"cudagraph_mode\":\"PIECEWISE\",\"pass_config\":{\"enable_attn_fusion\":false}}'
arch_overrides:
gfx942:
dtype: float16
- &mixtral-8x22b-serving
benchmark: serving
model:
mistralai/Mixtral-8x22B-Instruct-v0.1
amd/Mixtral-8x22B-Instruct-v0.1-FP8-KV
tp: 8
inp: 1024
out: 1024
dtype: auto
max_concurrency: 1 8 32 128
extra_args:
async-scheduling: True
arch_overrides:
gfx942:
dtype: float16
extended:
## gpt-oss requires AITER unified attention and performs best with block-size 64 and FULL_AND_PIECEWISE cudagraph mode
- &gpt-oss-20b-serving
benchmark: serving
model:
openai/gpt-oss-20b
tp: 1
inp: 1024
out: 1024
dtype: auto
max_concurrency: 1
env:
VLLM_ROCM_USE_AITER_MHA: 0
VLLM_ROCM_USE_AITER_UNIFIED_ATTENTION: 1
extra_args:
async-scheduling: True
block-size: 64
compilation-config: '{\"cudagraph_mode\":\"FULL_AND_PIECEWISE\"}'
- &llama-3-8b-phi-4-qwen3-serving
benchmark: serving
model:
meta-llama/Llama-3.1-8B-Instruct
amd/Llama-3.1-8B-Instruct-FP8-KV
microsoft/phi-4
Qwen/Qwen3-8B
Qwen/Qwen3-32B
Qwen/Qwen3-30B-A3B-Thinking-2507
Qwen/Qwen3-30B-A3B-Thinking-2507-FP8
tp: 1
inp: 1024
out: 1024
dtype: auto
max_concurrency: 1
extra_args:
async-scheduling: True
arch_overrides:
gfx942:
dtype: float16
- &llama-2-70b-serving
benchmark: serving
model:
meta-llama/Llama-2-70b-chat-hf
tp: 8
inp: 1024
out: 1024
dtype: auto
max_concurrency: 1
extra_args:
async-scheduling: True
arch_overrides:
gfx942:
dtype: float16
## Llama 4 currently does not support full cudagraph or attn fusion
- &llama-4-serving
benchmark: serving
model:
meta-llama/Llama-4-Scout-17B-16E-Instruct
meta-llama/Llama-4-Maverick-17B-128E-Instruct
tp: 8
inp: 1024
out: 1024
dtype: auto
max_concurrency: 1
extra_args:
async-scheduling: True
compilation-config: '{\"cudagraph_mode\":\"PIECEWISE\",\"pass_config\":{\"enable_attn_fusion\":false}}'
arch_overrides:
gfx942:
dtype: float16
- &mixtral-8x7b-serving
benchmark: serving
model:
mistralai/Mixtral-8x7B-Instruct-v0.1
amd/Mixtral-8x7B-Instruct-v0.1-FP8-KV
tp: 8
inp: 1024
out: 1024
dtype: auto
max_concurrency: 1
extra_args:
async-scheduling: True
arch_overrides:
gfx942:
dtype: float16
## Qwen 235B requires --enable-expert-parallel with tp 8
- &qwen3-235b-a22b-serving
benchmark: serving
model:
Qwen/Qwen3-235B-A22B-Thinking-2507
Qwen/Qwen3-235B-A22B-Thinking-2507-FP8
tp: 8
inp: 1024
out: 1024
dtype: auto
max_concurrency: 1
extra_args:
async-scheduling: True
enable-expert-parallel: True
arch_overrides:
gfx942:
dtype: float16
accuracy:
## DeepSeek AITER MLA currently only supports --block-size 1
- &deepseek-r1-accuracy
benchmark: accuracy
model: deepseek-ai/DeepSeek-R1-0528
tp: 8
dtype: auto
extra_args:
async-scheduling: True
block-size: 1
bench_args:
apply_chat_template: True
## gpt-oss requires AITER unified attention and performs best with block-size 64 and FULL_AND_PIECEWISE cudagraph mode
- &gpt-oss-120b-accuracy
benchmark: accuracy
model: openai/gpt-oss-120b
tp: 8
dtype: auto
env:
VLLM_ROCM_USE_AITER_MHA: 0
VLLM_USE_AITER_UNIFIED_ATTENTION: 1
extra_args:
async-scheduling: True
block-size: 64
compilation-config: '{\"cudagraph_mode\":\"FULL_AND_PIECEWISE\"}'
bench_args:
apply_chat_template: True
## Llama 3.x bf16 and fp8 perform better with --dtype float16 on gfx942
- &llama-3-accuracy
benchmark: accuracy
model:
meta-llama/Llama-3.1-405B-Instruct
amd/Llama-3.1-405B-Instruct-FP8-KV
meta-llama/Llama-3.3-70B-Instruct
amd/Llama-3.3-70B-Instruct-FP8-KV
tp: 8
dtype: auto
extra_args:
async-scheduling: True
bench_args:
apply_chat_template: True
arch_overrides:
gfx942:
dtype: float16
## Llama 3.x MXFP4 (gfx950 only)
- &llama-3-mxfp4-accuracy
benchmark: accuracy
model:
amd/Llama-3.1-405B-Instruct-MXFP4-Preview
amd/Llama-3.3-70B-Instruct-MXFP4-Preview
tp: 8
dtype: auto
extra_args:
async-scheduling: True
bench_args:
apply_chat_template: True
## Llama 4 currently does not support full cudagraph or attn fusion
- &llama-4-fp8-accuracy
benchmark: accuracy
model:
meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8
tp: 8
dtype: auto
extra_args:
async-scheduling: True
compilation-config: '{\"cudagraph_mode\":\"PIECEWISE\",\"pass_config\":{\"enable_attn_fusion\":false}}'
bench_args:
apply_chat_template: True
arch_overrides:
gfx942:
dtype: float16
## Mistral models require --tokenizer-mode mistral for correct decoding
- &mixtral-8x22b-accuracy
benchmark: accuracy
model:
mistralai/Mixtral-8x22B-Instruct-v0.1
amd/Mixtral-8x22B-Instruct-v0.1-FP8-KV
tp: 8
dtype: auto
extra_args:
async-scheduling: True
bench_args:
apply_chat_template: True
arch_overrides:
gfx942:
dtype: float16
## Qwen 235B requires --enable-expert-parallel with tp 8
- &qwen3-235b-a22b-accuracy
benchmark: accuracy
model:
Qwen/Qwen3-235B-A22B-Thinking-2507
Qwen/Qwen3-235B-A22B-Thinking-2507-FP8
dtype: auto
extra_args:
async-scheduling: True
enable-expert-parallel: True
bench_args:
apply_chat_template: True
arch_overrides:
gfx942:
dtype: float16
model_groups:
- group: Meta Llama
tag: llama
@@ -18,132 +315,139 @@ model_groups:
url: https://huggingface.co/meta-llama/Llama-2-70b-chat-hf
precision: float16
config:
tp: 8
dtype: auto
kv_cache_dtype: auto
max_num_batched_tokens: 4096
max_model_len: 4096
serving: *llama-2-70b-serving
ex:
kv_cache_dtype: auto
max_num_batched_tokens: 4096
max_model_len: 4096
- model: Llama 3.1 8B
mad_tag: pyt_vllm_llama-3.1-8b
model_repo: meta-llama/Llama-3.1-8B-Instruct
url: https://huggingface.co/meta-llama/Llama-3.1-8B
precision: float16
config:
tp: 1
dtype: auto
kv_cache_dtype: auto
max_num_batched_tokens: 131072
max_model_len: 8192
serving: *llama-3-8b-phi-4-qwen3-serving
ex:
kv_cache_dtype: auto
max_num_batched_tokens: 131072
max_model_len: 8192
- model: Llama 3.1 8B FP8
mad_tag: pyt_vllm_llama-3.1-8b_fp8
model_repo: amd/Llama-3.1-8B-Instruct-FP8-KV
url: https://huggingface.co/amd/Llama-3.1-8B-Instruct-FP8-KV
precision: float8
config:
tp: 1
dtype: auto
kv_cache_dtype: fp8
max_num_batched_tokens: 131072
max_model_len: 8192
serving: *llama-3-8b-phi-4-qwen3-serving
ex:
kv_cache_dtype: fp8
max_num_batched_tokens: 131072
max_model_len: 8192
- model: Llama 3.1 405B
mad_tag: pyt_vllm_llama-3.1-405b
model_repo: meta-llama/Llama-3.1-405B-Instruct
url: https://huggingface.co/meta-llama/Llama-3.1-405B-Instruct
precision: float16
config:
tp: 8
dtype: auto
kv_cache_dtype: auto
max_num_batched_tokens: 131072
max_model_len: 8192
serving: *llama-3-serving
accuracy: *llama-3-accuracy
ex:
kv_cache_dtype: auto
max_num_batched_tokens: 131072
max_model_len: 8192
- model: Llama 3.1 405B FP8
mad_tag: pyt_vllm_llama-3.1-405b_fp8
model_repo: amd/Llama-3.1-405B-Instruct-FP8-KV
url: https://huggingface.co/amd/Llama-3.1-405B-Instruct-FP8-KV
precision: float8
config:
tp: 8
dtype: auto
kv_cache_dtype: fp8
max_num_batched_tokens: 131072
max_model_len: 8192
serving: *llama-3-serving
accuracy: *llama-3-accuracy
ex:
kv_cache_dtype: fp8
max_num_batched_tokens: 131072
max_model_len: 8192
- model: Llama 3.1 405B MXFP4
mad_tag: pyt_vllm_llama-3.1-405b_fp4
model_repo: amd/Llama-3.1-405B-Instruct-MXFP4-Preview
url: https://huggingface.co/amd/Llama-3.1-405B-Instruct-MXFP4-Preview
precision: float4
config:
tp: 8
dtype: auto
kv_cache_dtype: fp8
max_num_batched_tokens: 131072
max_model_len: 8192
serving: *llama-3-mxfp4-serving
accuracy: *llama-3-mxfp4-accuracy
ex:
kv_cache_dtype: fp8
max_num_batched_tokens: 131072
max_model_len: 8192
- model: Llama 3.3 70B
mad_tag: pyt_vllm_llama-3.3-70b
model_repo: meta-llama/Llama-3.3-70B-Instruct
url: https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct
precision: float16
config:
tp: 8
dtype: auto
kv_cache_dtype: auto
max_num_batched_tokens: 131072
max_model_len: 8192
serving: *llama-3-serving
accuracy: *llama-3-accuracy
ex:
kv_cache_dtype: auto
max_num_batched_tokens: 131072
max_model_len: 8192
- model: Llama 3.3 70B FP8
mad_tag: pyt_vllm_llama-3.3-70b_fp8
model_repo: amd/Llama-3.3-70B-Instruct-FP8-KV
url: https://huggingface.co/amd/Llama-3.3-70B-Instruct-FP8-KV
precision: float8
config:
tp: 8
dtype: auto
kv_cache_dtype: fp8
max_num_batched_tokens: 131072
max_model_len: 8192
serving: *llama-3-serving
accuracy: *llama-3-accuracy
ex:
kv_cache_dtype: fp8
max_num_batched_tokens: 131072
max_model_len: 8192
- model: Llama 3.3 70B MXFP4
mad_tag: pyt_vllm_llama-3.3-70b_fp4
model_repo: amd/Llama-3.3-70B-Instruct-MXFP4-Preview
url: https://huggingface.co/amd/Llama-3.3-70B-Instruct-MXFP4-Preview
precision: float4
config:
tp: 8
dtype: auto
kv_cache_dtype: fp8
max_num_batched_tokens: 131072
max_model_len: 8192
serving: *llama-3-mxfp4-serving
accuracy: *llama-3-mxfp4-accuracy
ex:
kv_cache_dtype: fp8
max_num_batched_tokens: 131072
max_model_len: 8192
- model: Llama 4 Scout 17Bx16E
mad_tag: pyt_vllm_llama-4-scout-17b-16e
model_repo: meta-llama/Llama-4-Scout-17B-16E-Instruct
url: https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct
precision: float16
config:
tp: 8
dtype: auto
kv_cache_dtype: auto
max_num_batched_tokens: 32768
max_model_len: 8192
serving: *llama-4-serving
ex:
kv_cache_dtype: auto
max_num_batched_tokens: 32768
max_model_len: 8192
- model: Llama 4 Maverick 17Bx128E
mad_tag: pyt_vllm_llama-4-maverick-17b-128e
model_repo: meta-llama/Llama-4-Maverick-17B-128E-Instruct
url: https://huggingface.co/meta-llama/Llama-4-Maverick-17B-128E-Instruct
precision: float16
config:
tp: 8
dtype: auto
kv_cache_dtype: auto
max_num_batched_tokens: 32768
max_model_len: 8192
serving: *llama-4-serving
ex:
kv_cache_dtype: auto
max_num_batched_tokens: 32768
max_model_len: 8192
- model: Llama 4 Maverick 17Bx128E FP8
mad_tag: pyt_vllm_llama-4-maverick-17b-128e_fp8
model_repo: meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8
url: https://huggingface.co/meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8
precision: float8
config:
tp: 8
dtype: auto
kv_cache_dtype: fp8
max_num_batched_tokens: 131072
max_model_len: 8192
serving: *llama-4-fp8-serving
accuracy: *llama-4-fp8-accuracy
ex:
kv_cache_dtype: fp8
max_num_batched_tokens: 131072
max_model_len: 8192
- group: DeepSeek
tag: deepseek
models:
@@ -153,12 +457,12 @@ model_groups:
url: https://huggingface.co/deepseek-ai/DeepSeek-R1-0528
precision: float8
config:
tp: 8
dtype: auto
kv_cache_dtype: fp8
max_num_seqs: 1024
max_num_batched_tokens: 131072
max_model_len: 8192
serving: *deepseek-r1-serving
accuracy: *deepseek-r1-accuracy
ex:
kv_cache_dtype: fp8
max_num_batched_tokens: 131072
max_model_len: 8192
- group: OpenAI GPT OSS
tag: gpt-oss
models:
@@ -168,22 +472,23 @@ model_groups:
url: https://huggingface.co/openai/gpt-oss-20b
precision: bfloat16
config:
tp: 1
dtype: auto
kv_cache_dtype: auto
max_num_batched_tokens: 8192
max_model_len: 8192
serving: *gpt-oss-20b-serving
ex:
kv_cache_dtype: auto
max_num_batched_tokens: 8192
max_model_len: 8192
- model: GPT OSS 120B
mad_tag: pyt_vllm_gpt-oss-120b
model_repo: openai/gpt-oss-120b
url: https://huggingface.co/openai/gpt-oss-120b
precision: bfloat16
config:
tp: 8
dtype: auto
kv_cache_dtype: auto
max_num_batched_tokens: 8192
max_model_len: 8192
serving: *gpt-oss-120b-serving
accuracy: *gpt-oss-120b-accuracy
ex:
kv_cache_dtype: auto
max_num_batched_tokens: 8192
max_model_len: 8192
- group: Mistral AI
tag: mistral
models:
@@ -193,44 +498,46 @@ model_groups:
url: https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1
precision: float16
config:
tp: 8
dtype: auto
kv_cache_dtype: auto
max_num_batched_tokens: 32768
max_model_len: 8192
serving: *mixtral-8x7b-serving
ex:
kv_cache_dtype: auto
max_num_batched_tokens: 32768
max_model_len: 8192
- model: Mixtral MoE 8x7B FP8
mad_tag: pyt_vllm_mixtral-8x7b_fp8
model_repo: amd/Mixtral-8x7B-Instruct-v0.1-FP8-KV
url: https://huggingface.co/amd/Mixtral-8x7B-Instruct-v0.1-FP8-KV
precision: float8
config:
tp: 8
dtype: auto
kv_cache_dtype: fp8
max_num_batched_tokens: 32768
max_model_len: 8192
serving: *mixtral-8x7b-serving
ex:
kv_cache_dtype: fp8
max_num_batched_tokens: 32768
max_model_len: 8192
- model: Mixtral MoE 8x22B
mad_tag: pyt_vllm_mixtral-8x22b
model_repo: mistralai/Mixtral-8x22B-Instruct-v0.1
url: https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1
precision: float16
config:
tp: 8
dtype: auto
kv_cache_dtype: auto
max_num_batched_tokens: 65536
max_model_len: 8192
serving: *mixtral-8x22b-serving
accuracy: *mixtral-8x22b-accuracy
ex:
kv_cache_dtype: auto
max_num_batched_tokens: 65536
max_model_len: 8192
- model: Mixtral MoE 8x22B FP8
mad_tag: pyt_vllm_mixtral-8x22b_fp8
model_repo: amd/Mixtral-8x22B-Instruct-v0.1-FP8-KV
url: https://huggingface.co/amd/Mixtral-8x22B-Instruct-v0.1-FP8-KV
precision: float8
config:
tp: 8
dtype: auto
kv_cache_dtype: fp8
max_num_batched_tokens: 65536
max_model_len: 8192
serving: *mixtral-8x22b-serving
accuracy: *mixtral-8x22b-accuracy
ex:
kv_cache_dtype: fp8
max_num_batched_tokens: 65536
max_model_len: 8192
- group: Qwen
tag: qwen
models:
@@ -240,66 +547,68 @@ model_groups:
url: https://huggingface.co/Qwen/Qwen3-8B
precision: float16
config:
tp: 1
dtype: auto
kv_cache_dtype: auto
max_num_batched_tokens: 40960
max_model_len: 8192
serving: *llama-3-8b-phi-4-qwen3-serving
ex:
kv_cache_dtype: auto
max_num_batched_tokens: 40960
max_model_len: 8192
- model: Qwen3 32B
mad_tag: pyt_vllm_qwen3-32b
model_repo: Qwen/Qwen3-32b
model_repo: Qwen/Qwen3-32B
url: https://huggingface.co/Qwen/Qwen3-32B
precision: float16
config:
tp: 1
dtype: auto
kv_cache_dtype: auto
max_num_batched_tokens: 40960
max_model_len: 8192
- model: Qwen3 30B A3B
serving: *llama-3-8b-phi-4-qwen3-serving
ex:
kv_cache_dtype: auto
max_num_batched_tokens: 40960
max_model_len: 8192
- model: Qwen3 30B A3B Thinking
mad_tag: pyt_vllm_qwen3-30b-a3b
model_repo: Qwen/Qwen3-30B-A3B
url: https://huggingface.co/Qwen/Qwen3-30B-A3B
model_repo: Qwen/Qwen3-30B-A3B-Thinking-2507
url: https://huggingface.co/Qwen/Qwen3-30B-A3B-Thinking-2507
precision: float16
config:
tp: 1
dtype: auto
kv_cache_dtype: auto
max_num_batched_tokens: 40960
max_model_len: 8192
- model: Qwen3 30B A3B FP8
serving: *llama-3-8b-phi-4-qwen3-serving
ex:
kv_cache_dtype: auto
max_num_batched_tokens: 40960
max_model_len: 8192
- model: Qwen3 30B A3B Thinking FP8
mad_tag: pyt_vllm_qwen3-30b-a3b_fp8
model_repo: Qwen/Qwen3-30B-A3B-FP8
url: https://huggingface.co/Qwen/Qwen3-30B-A3B-FP8
model_repo: Qwen/Qwen3-30B-A3B-Thinking-2507-FP8
url: https://huggingface.co/Qwen/Qwen3-30B-A3B-Thinking-2507-FP8
precision: float16
config:
tp: 1
dtype: auto
kv_cache_dtype: fp8
max_num_batched_tokens: 40960
max_model_len: 8192
- model: Qwen3 235B A22B
serving: *llama-3-8b-phi-4-qwen3-serving
ex:
kv_cache_dtype: auto
max_num_batched_tokens: 40960
max_model_len: 8192
- model: Qwen3 235B A22B Thinking
mad_tag: pyt_vllm_qwen3-235b-a22b
model_repo: Qwen/Qwen3-235B-A22B
url: https://huggingface.co/Qwen/Qwen3-235B-A22B
model_repo: Qwen/Qwen3-235B-A22B-Thinking-2507
url: https://huggingface.co/Qwen/Qwen3-235B-A22B-Thinking-2507
precision: float16
config:
tp: 8
dtype: auto
kv_cache_dtype: auto
max_num_batched_tokens: 40960
max_model_len: 8192
- model: Qwen3 235B A22B FP8
serving: *qwen3-235b-a22b-serving
accuracy: *qwen3-235b-a22b-accuracy
ex:
kv_cache_dtype: auto
max_num_batched_tokens: 40960
max_model_len: 8192
- model: Qwen3 235B A22B Thinking FP8
mad_tag: pyt_vllm_qwen3-235b-a22b_fp8
model_repo: Qwen/Qwen3-235B-A22B-FP8
url: https://huggingface.co/Qwen/Qwen3-235B-A22B-FP8
model_repo: Qwen/Qwen3-235B-A22B-Thinking-2507-FP8
url: https://huggingface.co/Qwen/Qwen3-235B-A22B-Thinking-2507-FP8
precision: float8
config:
tp: 8
dtype: auto
kv_cache_dtype: fp8
max_num_batched_tokens: 40960
max_model_len: 8192
serving: *qwen3-235b-a22b-serving
accuracy: *qwen3-235b-a22b-accuracy
ex:
kv_cache_dtype: auto
max_num_batched_tokens: 40960
max_model_len: 8192
- group: Microsoft Phi
tag: phi
models:
@@ -309,8 +618,8 @@ model_groups:
url: https://huggingface.co/microsoft/phi-4
precision: float16
config:
tp: 1
dtype: auto
kv_cache_dtype: auto
max_num_batched_tokens: 16384
max_model_len: 8192
serving: *llama-3-8b-phi-4-qwen3-serving
ex:
kv_cache_dtype: auto
max_num_batched_tokens: 16384
max_model_len: 8192

View File

@@ -19,117 +19,95 @@ The table below summarizes information about ROCm-enabled deep learning framewor
:widths: 5 3 6 3
* - Framework
- Installation
- Installation guide
- Installation options
- GitHub
* - `PyTorch <https://rocm.docs.amd.com/en/latest/compatibility/ml-compatibility/pytorch-compatibility.html>`__
- .. raw:: html
<a href="https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/3rd-party/pytorch-install.html"><i class="fas fa-link fa-lg"></i></a>
* - :doc:`PyTorch <../compatibility/ml-compatibility/pytorch-compatibility>`
- :doc:`link <rocm-install-on-linux:install/3rd-party/pytorch-install>`
-
- `Docker image <https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/3rd-party/pytorch-install.html#using-a-docker-image-with-pytorch-pre-installed>`__
- `Wheels package <https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/3rd-party/pytorch-install.html#using-a-wheels-package>`__
- `ROCm Base Docker image <https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/3rd-party/pytorch-install.html#using-the-pytorch-rocm-base-docker-image>`__
- `Upstream Docker file <https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/3rd-party/pytorch-install.html#using-the-pytorch-upstream-dockerfile>`__
- Docker image
- Wheels package
- ROCm Base Docker image
- Upstream Docker file
- .. raw:: html
<a href="https://github.com/ROCm/pytorch"><i class="fab fa-github fa-lg"></i></a>
* - `TensorFlow <https://rocm.docs.amd.com/en/latest/compatibility/ml-compatibility/tensorflow-compatibility.html>`__
- .. raw:: html
<a href="https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/3rd-party/tensorflow-install.html"><i class="fas fa-link fa-lg"></i></a>
* - :doc:`TensorFlow <../compatibility/ml-compatibility/tensorflow-compatibility>`
- :doc:`link <rocm-install-on-linux:install/3rd-party/tensorflow-install>`
-
- `Docker image <https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/3rd-party/tensorflow-install.html#using-a-docker-image-with-tensorflow-pre-installed>`__
- `Wheels package <https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/3rd-party/tensorflow-install.html#using-a-wheels-package>`__
- Docker image
- Wheels package
- .. raw:: html
<a href="https://github.com/ROCm/tensorflow-upstream"><i class="fab fa-github fa-lg"></i></a>
* - `JAX <https://rocm.docs.amd.com/en/latest/compatibility/ml-compatibility/jax-compatibility.html>`__
- .. raw:: html
<a href="https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/3rd-party/jax-install.html"><i class="fas fa-link fa-lg"></i></a>
* - :doc:`JAX <../compatibility/ml-compatibility/jax-compatibility>`
- :doc:`link <rocm-install-on-linux:install/3rd-party/jax-install>`
-
- `Docker image <https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/3rd-party/jax-install.html#using-a-prebuilt-docker-image>`__
- Docker image
- .. raw:: html
<a href="https://github.com/ROCm/jax"><i class="fab fa-github fa-lg"></i></a>
* - `verl <https://rocm.docs.amd.com/en/latest/compatibility/ml-compatibility/verl-compatibility.html>`__
- .. raw:: html
<a href="https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/3rd-party/verl-install.html"><i class="fas fa-link fa-lg"></i></a>
* - :doc:`verl <../compatibility/ml-compatibility/verl-compatibility>`
- :doc:`link <rocm-install-on-linux:install/3rd-party/verl-install>`
-
- `Docker image <https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/3rd-party/verl-install.html#use-a-prebuilt-docker-image-with-verl-pre-installed>`__
- Docker image
- .. raw:: html
<a href="https://github.com/ROCm/verl"><i class="fab fa-github fa-lg"></i></a>
* - `Stanford Megatron-LM <https://rocm.docs.amd.com/en/latest/compatibility/ml-compatibility/stanford-megatron-lm-compatibility.html>`__
- .. raw:: html
<a href="https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/3rd-party/stanford-megatron-lm-install.html"><i class="fas fa-link fa-lg"></i></a>
* - :doc:`Stanford Megatron-LM <../compatibility/ml-compatibility/stanford-megatron-lm-compatibility>`
- :doc:`link <rocm-install-on-linux:install/3rd-party/stanford-megatron-lm-install>`
-
- `Docker image <https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/3rd-party/stanford-megatron-lm-install.html#use-a-prebuilt-docker-image-with-stanford-megatron-lm-pre-installed>`__
- Docker image
- .. raw:: html
<a href="https://github.com/ROCm/Stanford-Megatron-LM"><i class="fab fa-github fa-lg"></i></a>
* - `DGL <https://rocm.docs.amd.com/en/latest/compatibility/ml-compatibility/dgl-compatibility.html>`__
- .. raw:: html
<a href="https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/3rd-party/dgl-install.html"><i class="fas fa-link fa-lg"></i></a>
* - :doc:`DGL <../compatibility/ml-compatibility/dgl-compatibility>`
- :doc:`link <rocm-install-on-linux:install/3rd-party/dgl-install>`
-
- `Docker image <https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/3rd-party/dgl-install.html#use-a-prebuilt-docker-image-with-dgl-pre-installed>`__
- `Wheels package <https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/3rd-party/dgl-install.html#use-a-wheels-package>`__
- Docker image
- .. raw:: html
<a href="https://github.com/ROCm/dgl"><i class="fab fa-github fa-lg"></i></a>
* - `Megablocks <https://rocm.docs.amd.com/en/latest/compatibility/ml-compatibility/megablocks-compatibility.html>`__
- .. raw:: html
<a href="https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/3rd-party/megablocks-install.html"><i class="fas fa-link fa-lg"></i></a>
* - :doc:`Megablocks <../compatibility/ml-compatibility/megablocks-compatibility>`
- :doc:`link <rocm-install-on-linux:install/3rd-party/megablocks-install>`
-
- `Docker image <https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/3rd-party/megablocks-install.html#using-a-prebuilt-docker-image-with-megablocks-pre-installed>`__
- Docker image
- .. raw:: html
<a href="https://github.com/ROCm/megablocks"><i class="fab fa-github fa-lg"></i></a>
* - `Ray <https://rocm.docs.amd.com/en/latest/compatibility/ml-compatibility/ray-compatibility.html>`__
- .. raw:: html
<a href="https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/3rd-party/ray-install.html"><i class="fas fa-link fa-lg"></i></a>
* - :doc:`Ray <../compatibility/ml-compatibility/ray-compatibility>`
- :doc:`link <rocm-install-on-linux:install/3rd-party/ray-install>`
-
- `Docker image <https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/3rd-party/ray-install.html#using-a-prebuilt-docker-image-with-ray-pre-installed>`__
- `Wheels package <https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/3rd-party/ray-install.html#install-ray-on-bare-metal-or-a-custom-container>`__
- `ROCm Base Docker image <https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/3rd-party/ray-install.html#build-your-own-docker-image>`__
- Docker image
- Wheels package
- ROCm Base Docker image
- .. raw:: html
<a href="https://github.com/ROCm/ray"><i class="fab fa-github fa-lg"></i></a>
* - `llama.cpp <https://rocm.docs.amd.com/en/latest/compatibility/ml-compatibility/llama-cpp-compatibility.html>`__
- .. raw:: html
<a href="https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/3rd-party/llama-cpp-install.html"><i class="fas fa-link fa-lg"></i></a>
* - :doc:`llama.cpp <../compatibility/ml-compatibility/llama-cpp-compatibility>`
- :doc:`link <rocm-install-on-linux:install/3rd-party/llama-cpp-install>`
-
- `Docker image <https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/3rd-party/llama-cpp-install.html#use-a-prebuilt-docker-image-with-llama-cpp-pre-installed>`__
- `ROCm Base Docker image <https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/3rd-party/llama-cpp-install.html#build-your-own-docker-image>`__
- Docker image
- ROCm Base Docker image
- .. raw:: html
<a href="https://github.com/ROCm/llama.cpp"><i class="fab fa-github fa-lg"></i></a>
* - `FlashInfer <https://rocm.docs.amd.com/en/latest/compatibility/ml-compatibility/flashinfer-compatibility.html>`__
- .. raw:: html
<a href="https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/3rd-party/flashinfer-install.html"><i class="fas fa-link fa-lg"></i></a>
* - :doc:`FlashInfer <../compatibility/ml-compatibility/flashinfer-compatibility>`
- :doc:`link <rocm-install-on-linux:install/3rd-party/flashinfer-install>`
-
- `Docker image <https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/3rd-party/flashinfer-install.html#use-a-prebuilt-docker-image-with-flashinfer-pre-installed>`__
- `ROCm Base Docker image <https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/3rd-party/flashinfer-install.html#build-your-own-docker-image>`__
- Docker image
- ROCm Base Docker image
- .. raw:: html
<a href="https://github.com/ROCm/flashinfer"><i class="fab fa-github fa-lg"></i></a>

View File

@@ -44,7 +44,7 @@ Setting up the base implementation environment
.. code-block:: shell
rocm-smi --showproductname
amd-smi static --board
#. Check that your GPUs are available to PyTorch.
@@ -65,8 +65,8 @@ Setting up the base implementation environment
.. tip::
During training and inference, you can check the memory usage by running the ``rocm-smi`` command in your terminal.
This tool helps you see shows which GPUs are involved.
During training and inference, you can check the memory usage by running the ``amd-smi`` command in your terminal.
This tool helps you see which GPUs are involved.
.. _fine-tuning-llms-multi-gpu-hugging-face-accelerate:
@@ -91,10 +91,10 @@ Now, it's important to adjust how you load the model. Add the ``device_map`` par
...
base_model_name = "meta-llama/Llama-2-7b-chat-hf"
# Load base model to GPU memory
base_model = AutoModelForCausalLM.from_pretrained(
base_model_name,
base_model_name,
device_map = "auto",
trust_remote_code = True)
...
@@ -130,7 +130,7 @@ After loading the model in this way, the model is fully ready to use the resourc
torchtune for fine-tuning and inference
=============================================
`torchtune <https://meta-pytorch.org/torchtune/main/>`_ is a PyTorch-native library for easy single and multi-GPU
`torchtune <https://pytorch.org/torchtune/main/>`_ is a PyTorch-native library for easy single and multi-GPU
model fine-tuning and inference with LLMs.
#. Install torchtune using pip.
@@ -139,7 +139,7 @@ model fine-tuning and inference with LLMs.
# Install torchtune with PyTorch release 2.2.2+
pip install torchtune
# To confirm that the package is installed correctly
tune --help
@@ -148,12 +148,12 @@ model fine-tuning and inference with LLMs.
.. code-block:: shell
usage: tune [-h] {download,ls,cp,run,validate} ...
Welcome to the TorchTune CLI!
options:
-h, --help show this help message and exit
subcommands:
{download,ls,cp,run,validate}
@@ -194,11 +194,11 @@ model fine-tuning and inference with LLMs.
apply_lora_to_output: False
lora_rank: 8
lora_alpha: 16
tokenizer:
_component_: torchtune.models.llama2.llama2_tokenizer
path: /tmp/Llama-2-7b-hf/tokenizer.model
# Dataset and sampler
dataset:
_component_: torchtune.datasets.alpaca_cleaned_dataset

View File

@@ -44,20 +44,19 @@ Setting up the base implementation environment
.. code-block:: shell
rocm-smi --showproductname
amd-smi static --board
Your output should look like this:
.. code-block:: shell
============================ ROCm System Management Interface ============================
====================================== Product Info ======================================
GPU[0] : Card Series: AMD Instinct MI300X OAM
GPU[0] : Card model: 0x74a1
GPU[0] : Card vendor: Advanced Micro Devices, Inc. [AMD/ATI]
GPU[0] : Card SKU: MI3SRIOV
==========================================================================================
================================== End of ROCm SMI Log ===================================
GPU: 0
BOARD:
MODEL_NUMBER: 102-G39203-0B
PRODUCT_SERIAL: PCB079220-1150
FRU_ID: 113-AMDG392030B04-100-300000097H
PRODUCT_NAME: AMD Instinct MI325 OAM
MANUFACTURER_NAME: AMD
#. Check that your GPUs are available to PyTorch.
@@ -94,13 +93,13 @@ Setting up the base implementation environment
pip install -r requirements-dev.txt
cmake -DBNB_ROCM_ARCH="gfx942" -DCOMPUTE_BACKEND=hip -S .
python setup.py install
# To leverage the SFTTrainer in TRL for model fine-tuning.
pip install trl
# To leverage PEFT for efficiently adapting pre-trained language models .
pip install peft
# Install the other dependencies.
pip install transformers datasets huggingface-hub scipy
@@ -132,7 +131,7 @@ Download the base model and fine-tuning dataset
.. note::
You can also use the `NousResearch Llama-2-7b-chat-hf <https://huggingface.co/NousResearch/Llama-2-7b-chat-hf>`_
You can also use the `NousResearch Llama-2-7b-chat-hf <https://huggingface.co/NousResearch/Llama-2-7b-chat-hf>`_
as a substitute. It has the same model weights as the original.
#. Run the following code to load the base model and tokenizer.
@@ -141,14 +140,14 @@ Download the base model and fine-tuning dataset
# Base model and tokenizer names.
base_model_name = "meta-llama/Llama-2-7b-chat-hf"
# Load base model to GPU memory.
device = "cuda:0"
base_model = AutoModelForCausalLM.from_pretrained(base_model_name, trust_remote_code = True).to(device)
# Load tokenizer.
tokenizer = AutoTokenizer.from_pretrained(
base_model_name,
base_model_name,
trust_remote_code = True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
@@ -162,10 +161,10 @@ Download the base model and fine-tuning dataset
# Dataset for fine-tuning.
training_dataset_name = "mlabonne/guanaco-llama2-1k"
training_dataset = load_dataset(training_dataset_name, split = "train")
# Check the data.
print(training_dataset)
# Dataset 11 is a QA sample in English.
print(training_dataset[11])
@@ -252,8 +251,8 @@ Compare the number of trainable parameters and training time under the two diffe
dataset_text_field = "text",
tokenizer = tokenizer,
args = training_arguments
)
)
# Run the trainer.
sft_trainer.train()
@@ -286,7 +285,7 @@ Compare the number of trainable parameters and training time under the two diffe
if param.requires_grad:
trainable_params += param.numel()
print(f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param:.2f}")
sft_trainer.peft_config = None
print_trainable_parameters(sft_trainer.model)
@@ -309,8 +308,8 @@ Compare the number of trainable parameters and training time under the two diffe
dataset_text_field = "text",
tokenizer = tokenizer,
args = training_arguments
)
)
# Training.
trainer_full.train()
@@ -349,7 +348,7 @@ store, and load.
# PEFT adapter name.
adapter_name = "llama-2-7b-enhanced-adapter"
# Save PEFT adapter.
sft_trainer.model.save_pretrained(adapter_name)
@@ -359,21 +358,21 @@ store, and load.
# Access adapter directory.
cd llama-2-7b-enhanced-adapter
# List all adapter files.
README.md adapter_config.json adapter_model.safetensors
.. tab-item:: Saving a fully fine-tuned model
:sync: without
If you're not using LoRA and PEFT so there is no PEFT LoRA configuration used for training, use the following code
If you're not using LoRA and PEFT so there is no PEFT LoRA configuration used for training, use the following code
to save your fine-tuned model to your system.
.. code-block:: python
# Fully fine-tuned model name.
new_model_name = "llama-2-7b-enhanced"
# Save the fully fine-tuned model.
full_trainer.model.save_pretrained(new_model_name)
@@ -383,7 +382,7 @@ store, and load.
# Access new model directory.
cd llama-2-7b-enhanced
# List all model files.
config.json model-00002-of-00006.safetensors model-00005-of-00006.safetensors
generation_config.json model-00003-of-00006.safetensors model-00006-of-00006.safetensors
@@ -412,26 +411,26 @@ Let's look at achieving model inference using these types of models.
.. tab-item:: Inference using PEFT adapters
To use PEFT adapters like a normal transformer model, you can run the generation by loading a base model along with PEFT
To use PEFT adapters like a normal transformer model, you can run the generation by loading a base model along with PEFT
adapters as follows.
.. code-block:: python
from peft import PeftModel
from transformers import AutoModelForCausalLM
# Set the path of the model or the name on Hugging face hub
base_model_name = "meta-llama/Llama-2-7b-chat-hf"
# Set the path of the adapter
adapter_name = "Llama-2-7b-enhanced-adpater"
# Load base model
# Load base model
base_model = AutoModelForCausalLM.from_pretrained(base_model_name)
# Adapt the base model with the adapter
# Adapt the base model with the adapter
new_model = PeftModel.from_pretrained(base_model, adapter_name)
# Then, run generation as the same with a normal model outlined in 2.1
The PEFT library provides a ``merge_and_unload`` method, which merges the adapter layers into the base model. This is
@@ -439,13 +438,13 @@ Let's look at achieving model inference using these types of models.
.. code-block:: python
# Load base model
# Load base model
base_model = AutoModelForCausalLM.from_pretrained(base_model_name)
# Adapt the base model with the adapter
# Adapt the base model with the adapter
new_model = PeftModel.from_pretrained(base_model, adapter_name)
# Merge adapter
# Merge adapter
model = model.merge_and_unload()
# Save the merged model into local
@@ -461,25 +460,25 @@ Let's look at achieving model inference using these types of models.
# Import relevant class for loading model and tokenizer
from transformers import AutoTokenizer, AutoModelForCausalLM
# Set the pre-trained model name on Hugging face hub
model_name = "meta-llama/Llama-2-7b-chat-hf"
# Set device type
# Set device type
device = "cuda:0"
# Load model and tokenizer
# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(model_name).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Input prompt encoding
# Input prompt encoding
query = "What is a large language model?"
inputs = tokenizer.encode(query, return_tensors="pt").to(device)
# Token generation
outputs = model.generate(inputs)
# Outputs decoding
# Token generation
outputs = model.generate(inputs)
# Outputs decoding
print(tokenizer.decode(outputs[0]))
In addition, pipelines from Transformers offer simple APIs to use pre-trained models for different tasks, including
@@ -490,14 +489,14 @@ Let's look at achieving model inference using these types of models.
# Import relevant class for loading model and tokenizer
from transformers import pipeline
# Set the path of your model or the name on Hugging face hub
model_name_or_path = "meta-llama/Llama-2-7b-chat-hf"
# Set pipeline
# Set pipeline
# A positive device value will run the model on associated CUDA device id
pipe = pipeline("text-generation", model=model_name_or_path, device=0)
# Token generation
print(pipe("What is a large language model?")[0]["generated_text"])

View File

@@ -189,6 +189,10 @@ Benchmarking
{% for model_group in model_groups %}
{% for model in model_group.models %}
{% set serv_config = model.config.serving %}
{% set acc_config = model.config.accuracy %}
{% set ex_config = model.config.ex %}
.. container:: model-doc {{model.mad_tag}}
.. tab-set::
@@ -283,108 +287,173 @@ Benchmarking
--name test \
{{ docker.pull_tag }}
.. rubric:: Throughput command
.. rubric:: Run the inference benchmarks
Use the following command to start the throughput benchmark.
.. tab-set::
.. code-block:: shell
.. tab-item:: Latency command
model={{ model.model_repo }}
tp={{ model.config.tp }}
num_prompts={{ model.config.num_prompts | default(1024) }}
in={{ model.config.in | default(128) }}
out={{ model.config.in | default(128) }}
dtype={{ model.config.dtype | default("auto") }}
kv_cache_dtype={{ model.config.kv_cache_dtype }}
max_num_seqs={{ model.config.max_num_seqs | default(1024) }}
max_num_batched_tokens={{ model.config.max_num_batched_tokens }}
max_model_len={{ model.config.max_model_len }}
Use the following command to start the latency benchmark.
vllm bench throughput --model $model \
-tp $tp \
--num-prompts $num_prompts \
--input-len $in \
--output-len $out \
--dtype $dtype \
--kv-cache-dtype $kv_cache_dtype \
--max-num-seqs $max_num_seqs \
--max-num-batched-tokens $max_num_batched_tokens \
--max-model-len $max_model_len \
--trust-remote-code \
--output-json ${model}_throughput.json \
--gpu-memory-utilization {{ model.config.gpu_memory_utilization | default(0.9) }}
.. code-block:: shell
.. rubric:: Serving command
model={{ model.model_repo }}
tp={{ serv_config.tp }}
batch_size=16
in={{ serv_config.inp | default(1024) }}
out={{ serv_config.out | default(1024) }}
dtype={{ serv_config.dtype | default("auto") }}
kv_cache_dtype={{ ex_config.kv_cache_dtype | default("auto") }}
max_num_seqs={{ ex_config.max_num_seqs | default(1024) }}
max_num_batched_tokens={{ ex_config.max_num_batched_tokens }}
max_model_len={{ ex_config.max_model_len }}
1. Start the server using the following command:
vllm bench latency --model $model \
-tp $tp \
--batch-size $batch_size \
--input-len $in \
--output-len $out \
--dtype $dtype \
--kv-cache-dtype $kv_cache_dtype \
--max-num-seqs $max_num_seqs \
--max-num-batched-tokens $max_num_batched_tokens \
--max-model-len $max_model_len \
--output-json ${model}_throughput.json \
.. code-block:: shell
.. tab-item:: Throughput command
model={{ model.model_repo }}
tp={{ model.config.tp }}
dtype={{ model.config.dtype }}
kv_cache_dtype={{ model.config.kv_cache_dtype }}
max_num_seqs=256
max_num_batched_tokens={{ model.config.max_num_batched_tokens }}
max_model_len={{ model.config.max_model_len }}
Use the following command to start the throughput benchmark.
vllm serve $model \
-tp $tp \
--dtype $dtype \
--kv-cache-dtype $kv_cache_dtype \
--max-num-seqs $max_num_seqs \
--max-num-batched-tokens $max_num_batched_tokens \
--max-model-len $max_model_len \
--no-enable-prefix-caching \
--swap-space 16 \
--disable-log-requests \
--trust-remote-code \
--gpu-memory-utilization 0.9
.. code-block:: shell
Wait until the model has loaded and the server is ready to accept requests.
model={{ model.model_repo }}
tp={{ serv_config.tp }}
num_prompts={{ model.config.num_prompts | default(1024) }}
in={{ serv_config.inp | default(1024) }}
out={{ serv_config.out | default(1024) }}
dtype={{ serv_config.dtype | default("auto") }}
kv_cache_dtype={{ ex_config.kv_cache_dtype | default("auto") }}
max_num_seqs={{ ex_config.max_num_seqs | default(1024) }}
max_num_batched_tokens={{ ex_config.max_num_batched_tokens }}
max_model_len={{ ex_config.max_model_len }}
2. On another terminal on the same machine, run the benchmark:
vllm bench throughput --model $model \
-tp $tp \
--num-prompts $num_prompts \
--input-len $in \
--output-len $out \
--dtype $dtype \
--kv-cache-dtype $kv_cache_dtype \
--max-num-seqs $max_num_seqs \
--max-num-batched-tokens $max_num_batched_tokens \
--max-model-len $max_model_len \
--trust-remote-code \
--output-json ${model}_throughput.json \
--gpu-memory-utilization {{ model.config.gpu_memory_utilization | default(0.9) }}
.. code-block:: shell
.. tab-item:: Serving command
# Connect to the container
docker exec -it test bash
1. Start the server using the following command:
# Wait for the server to start
until curl -s http://localhost:8000/v1/models; do sleep 30; done
.. code-block:: shell
# Run the benchmark
model={{ model.model_repo }}
max_concurrency=1
num_prompts=10
in=128
out=128
vllm bench serve --model $model \
--percentile-metrics "ttft,tpot,itl,e2el" \
--dataset-name random \
--ignore-eos \
--max-concurrency $max_concurrency \
--num-prompts $num_prompts \
--random-input-len $in \
--random-output-len $out \
--trust-remote-code \
--save-result \
--result-filename ${model}_serving.json
model={{ model.model_repo }}
tp={{ serv_config.tp }}
dtype={{ serv_config.dtype }}
kv_cache_dtype={{ ex_config.kv_cache_dtype }}
max_num_seqs=1024
max_num_batched_tokens={{ ex_config.max_num_batched_tokens }}
max_model_len={{ ex_config.max_model_len }}
.. note::
vllm serve $model \
-tp $tp \
--dtype $dtype \
--kv-cache-dtype $kv_cache_dtype \
--max-num-seqs $max_num_seqs \
--max-num-batched-tokens $max_num_batched_tokens \
--max-model-len $max_model_len \
--no-enable-prefix-caching \
--swap-space 16 \
--disable-log-requests
For improved performance with certain Mixture of Experts models, such as Mixtral 8x22B,
try adding ``export VLLM_ROCM_USE_AITER=1`` to your commands.
Wait until the model has loaded and the server is ready to accept requests.
If you encounter the following error, pass your access-authorized Hugging
Face token to the gated models.
2. On another terminal on the same machine, run the benchmark:
.. code-block::
.. code-block:: shell
OSError: You are trying to access a gated repo.
# Connect to the container
docker exec -it test bash
# pass your HF_TOKEN
export HF_TOKEN=$your_personal_hf_token
# Wait for the server to start
until curl -s http://localhost:8000/v1/models; do sleep 30; done
# Run the benchmark
model={{ model.model_repo }}
max_concurrency=1
num_prompts=10
in={{ serv_config.inp | default("1024") }}
out={{ serv_config.out | default("1024") }}
vllm bench serve --model $model \
--percentile-metrics "ttft,tpot,itl,e2el" \
--dataset-name random \
--ignore-eos \
--max-concurrency $max_concurrency \
--num-prompts $num_prompts \
--random-input-len $in \
--random-output-len $out \
--trust-remote-code \
--save-result \
--result-filename ${model}_serving.json
{% if acc_config %}
.. tab-item:: Accuracy command
1. Start the server using the following command:
.. code-block:: shell
model={{ model.model_repo }}
tp={{ acc_config.tp }}
dtype={{ acc_config.dtype }}
kv_cache_dtype={{ ex_config.kv_cache_dtype }}
max_num_seqs=1024
max_num_batched_tokens={{ ex_config.max_num_batched_tokens }}
max_model_len={{ ex_config.max_model_len }}
vllm serve $model \
-tp $tp \
--dtype $dtype \
--kv-cache-dtype $kv_cache_dtype \
--max-num-seqs $max_num_seqs \
--max-num-batched-tokens $max_num_batched_tokens \
--max-model-len $max_model_len \
--no-enable-prefix-caching \
--swap-space 16 \
--disable-log-requests
Wait until the model has loaded and the server is ready to accept requests.
2. On another terminal on the same machine, run the benchmark:
.. code-block:: shell
# Connect to the container
docker exec -it test bash
# Wait for the server to start
until curl -s http://localhost:8000/v1/models; do sleep 30; done
# Install lm-eval
pip install lm-eval[api]
# Run the benchmark
model={{ acc_config.model }}
lm_eval --model local-completions \
--model_args model=$model,max_gen_toks=2048,num_concurrent=256,max_retries=10,base_url=http://localhost:8000/v1/completions \
--tasks gsm8k --limit 250 --output_path ./tmp
{% endif %}
.. raw:: html

View File

@@ -31,16 +31,16 @@ in the Instinct documentation for more information.
Hardware verification with ROCm
-------------------------------
Use the command ``rocm-smi --setperfdeterminism 1900`` to set the max clock speed up to 1900 MHz
Use the command ``amd-smi set --perf-determinism 1900`` to set the max clock speed up to 1900 MHz
instead of the default 2100 MHz. This can reduce the chance of a PCC event lowering the attainable
GPU clocks. This setting will not be required for new IFWI releases with the production PRC feature.
You can restore this setting to its default value with the ``rocm-smi -r`` command.
You can restore this setting to its default value with the ``amd-smi reset --clocks`` command.
Run the command:
.. code-block:: shell
rocm-smi --setperfdeterminism 1900
amd-smi set --perf-determinism 1900
See `Hardware verfication for ROCm <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html#hardware-verification-with-rocm>`_
in the Instinct documentation for more information.

View File

@@ -108,16 +108,16 @@ for more information.
Hardware verification with ROCm
-------------------------------
Use the command ``rocm-smi --setperfdeterminism 1900`` to set the max clock speed up to 1900 MHz
Use the command ``amd-smi set --perf-determinism 1900`` to set the max clock speed up to 1900 MHz
instead of the default 2100 MHz. This can reduce the chance of a PCC event lowering the attainable
GPU clocks. This setting will not be required for new IFWI releases with the production PRC feature.
You can restore this setting to its default value with the ``rocm-smi -r`` command.
You can restore this setting to its default value with the ``amd-smi reset --clocks`` command.
Run the command:
.. code-block:: shell
rocm-smi --setperfdeterminism 1900
amd-smi set --perf-determinism 1900
See `Hardware verification with ROCm <https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/system-optimization/mi300x.html#hardware-verification-with-rocm>`_ for more information.
@@ -248,7 +248,7 @@ Download the Docker image and required packages
Checking out this specific commit is recommended for a stable and reproducible environment.
.. code-block:: shell
git checkout bb93ccbfeae6363c67b361a97a27c74ab86e7e92
Prepare training datasets

View File

@@ -285,7 +285,7 @@ tweak some configurations (such as batch sizes).
.. code-block:: shell
EXP=examples/torchtitan/configs/MI355X/llama3.1_8B-BF16-pretrain.yaml \
EXP=examples/torchtitan/configs/MI355X/llama3.1_8B-FP8-pretrain.yaml \
bash examples/run_pretrain.sh
.. tab-item:: MI325X

View File

@@ -5,7 +5,7 @@
GPU hardware specifications
===========================================
The following tables provide an overview of the hardware specifications for AMD Instinct™ GPUs, and AMD Radeon™ PRO and Radeon™ GPUs.
The following tables provide an overview of the hardware specifications for AMD Instinct™ GPUs, AMD Radeon™ PRO and Radeon™ GPUs, and AMD Ryzen™ APUs.
For more information about ROCm hardware compatibility, see the ROCm `Compatibility matrix <https://rocm.docs.amd.com/en/latest/compatibility/compatibility-matrix.html>`_.
@@ -18,7 +18,7 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil
:name: instinct-arch-spec-table
*
- Model
- Name
- Architecture
- LLVM target name
- VRAM (GiB)
@@ -297,7 +297,7 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil
:name: radeon-pro-arch-spec-table
*
- Model
- Name
- Architecture
- LLVM target name
@@ -539,7 +539,7 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil
:name: radeon-arch-spec-table
*
- Model
- Name
- Architecture
- LLVM target name
- VRAM (GiB)
@@ -953,6 +953,127 @@ For more information about ROCm hardware compatibility, see the ROCm `Compatibil
- 9
- 0
.. tab-item:: AMD Ryzen APUs
.. list-table::
:header-rows: 1
:name: ryzen-arch-spec-table
*
- Name
- Graphics model
- Architecture
- LLVM target name
- VRAM (GiB)
- Compute Units
- Wavefront Size
- LDS (KiB)
- Infinity Cache (MiB)
- L2 Cache (MiB)
- Graphics L1 Cache (KiB)
- L0 Vector Cache (KiB)
- L0 Scalar Cache (KiB)
- L0 Instruction Cache (KiB)
- VGPR File (KiB)
- SGPR File (KiB)
- GFXIP Major version
- GFXIP Minor version
*
- AMD Ryzen 7 7840U
- Radeon 780M
- RDNA3
- gfx1103
- Dynamic + carveout
- 12
- 32 or 64
- 128
- N/A
- 2
- 256
- 32
- 16
- 32
- 512
- 32
- 11
- 0
*
- AMD Ryzen 9 270
- Radeon 780M
- RDNA3
- gfx1103
- Dynamic + carveout
- 12
- 32 or 64
- 128
- N/A
- 2
- 256
- 32
- 16
- 32
- 512
- 32
- 11
- 0
*
- AMD Ryzen AI 9 HX 375
- Radeon 890M
- RDNA3.5
- gfx1150
- Dynamic + carveout
- 16
- 32 or 64
- 128
- N/A
- 2
- 256
- 32
- 16
- 32
- 512
- 32
- 11
- 5
*
- AMD Ryzen AI Max+ PRO 395
- Radeon 8060S
- RDNA3.5
- gfx1151
- Dynamic + carveout
- 40
- 32 or 64
- 128
- 32
- 2
- 256
- 32
- 16
- 32
- 768
- 32
- 11
- 5
*
- AMD Ryzen Al 7 350
- Radeon 860M
- RDNA3.5
- gfx1152
- Dynamic + carveout
- 8
- 32 or 64
- 128
- N/A
- 1
- 256
- 32
- 16
- 32
- 512
- 32
- 11
- 5
Glossary
========

View File

@@ -25,7 +25,7 @@ subtrees:
title: HIP SDK on Windows
- url: https://rocm.docs.amd.com/projects/radeon-ryzen/en/latest/index.html
title: ROCm on Radeon and Ryzen
- file: how-to/deep-learning-rocm.md
- file: how-to/deep-learning-rocm
title: Deep learning frameworks
subtrees:
- entries:

View File

@@ -1,4 +1,4 @@
rocm-docs-core==1.31.1
rocm-docs-core==1.31.2
sphinx-reredirects
sphinx-sitemap
sphinxcontrib.datatemplates==0.11.0

View File

@@ -19,11 +19,11 @@ babel==2.17.0
# via
# pydata-sphinx-theme
# sphinx
beautifulsoup4==4.14.2
beautifulsoup4==4.14.3
# via pydata-sphinx-theme
breathe==4.36.0
# via rocm-docs-core
certifi==2025.11.12
certifi==2026.1.4
# via requests
cffi==2.0.0
# via
@@ -39,7 +39,7 @@ comm==0.2.3
# via ipykernel
cryptography==46.0.3
# via pyjwt
debugpy==1.8.17
debugpy==1.8.19
# via ipykernel
decorator==5.2.1
# via ipython
@@ -60,21 +60,21 @@ fastjsonschema==2.21.2
# rocm-docs-core
gitdb==4.0.12
# via gitpython
gitpython==3.1.45
gitpython==3.1.46
# via rocm-docs-core
greenlet==3.2.4
greenlet==3.3.0
# via sqlalchemy
idna==3.11
# via requests
imagesize==1.4.1
# via sphinx
importlib-metadata==8.7.0
importlib-metadata==8.7.1
# via
# jupyter-cache
# myst-nb
ipykernel==7.1.0
# via myst-nb
ipython==8.37.0
ipython==8.38.0
# via
# ipykernel
# myst-nb
@@ -84,13 +84,13 @@ jinja2==3.1.6
# via
# myst-parser
# sphinx
jsonschema==4.25.1
jsonschema==4.26.0
# via nbformat
jsonschema-specifications==2025.9.1
# via jsonschema
jupyter-cache==1.0.1
# via myst-nb
jupyter-client==8.6.3
jupyter-client==8.8.0
# via
# ipykernel
# nbclient
@@ -118,7 +118,7 @@ myst-nb==1.3.0
# via rocm-docs-core
myst-parser==4.0.1
# via myst-nb
nbclient==0.10.2
nbclient==0.10.4
# via
# jupyter-cache
# myst-nb
@@ -138,11 +138,11 @@ parso==0.8.5
# via jedi
pexpect==4.9.0
# via ipython
platformdirs==4.5.0
platformdirs==4.5.1
# via jupyter-core
prompt-toolkit==3.0.52
# via ipython
psutil==7.1.3
psutil==7.2.1
# via ipykernel
ptyprocess==0.7.0
# via pexpect
@@ -188,9 +188,9 @@ requests==2.32.5
# via
# pygithub
# sphinx
rocm-docs-core==1.31.1
rocm-docs-core==1.31.2
# via -r requirements.in
rpds-py==0.29.0
rpds-py==0.30.0
# via
# jsonschema
# referencing
@@ -200,7 +200,7 @@ smmap==5.0.2
# via gitdb
snowballstemmer==3.0.1
# via sphinx
soupsieve==2.8
soupsieve==2.8.1
# via beautifulsoup4
sphinx==8.1.3
# via
@@ -250,15 +250,15 @@ sphinxcontrib-runcmd==0.2.0
# via sphinxcontrib-datatemplates
sphinxcontrib-serializinghtml==2.0.0
# via sphinx
sqlalchemy==2.0.44
sqlalchemy==2.0.45
# via jupyter-cache
stack-data==0.6.3
# via ipython
tabulate==0.9.0
# via jupyter-cache
tomli==2.3.0
tomli==2.4.0
# via sphinx
tornado==6.5.2
tornado==6.5.4
# via
# ipykernel
# jupyter-client
@@ -282,7 +282,7 @@ typing-extensions==4.15.0
# pygithub
# referencing
# sqlalchemy
urllib3==2.5.0
urllib3==2.6.3
# via
# pygithub
# requests

View File

@@ -123,7 +123,8 @@ Performance
.. note::
`ROCprof Compute Viewer <https://rocm.docs.amd.com/projects/rocprof-compute-viewer/en/amd-mainline/>`_ is a tool for visualizing and analyzing GPU thread trace data collected using :doc:`rocprofv3 <rocprofiler-sdk:index>`. Note that `ROCprof Compute Viewer <https://rocm.docs.amd.com/projects/rocprof-compute-viewer/en/amd-mainline/>`_ is in an early access state. Running production workloads is not recommended.
`ROCprof Compute Viewer <https://rocm.docs.amd.com/projects/rocprof-compute-viewer/en/amd-mainline/>`_ is a tool for visualizing and analyzing GPU thread trace data collected using :doc:`rocprofv3 <rocprofiler-sdk:index>`.
Note that `ROCprof Compute Viewer <https://rocm.docs.amd.com/projects/rocprof-compute-viewer/en/amd-mainline/>`_ is in an early access state. Running production workloads is not recommended.
Development
^^^^^^^^^^^