Compare commits

...

109 Commits

Author SHA1 Message Date
Sam Wu
43cd74913b Merge branch 'develop' into roc-6.0.x 2024-01-31 16:04:42 -07:00
Sam Wu
83766203ff Update changelog announcement (#2857)
* Update changelog announcement

* Update phrasing
2024-01-31 16:04:14 -07:00
Sam Wu
e467b13c68 Merge branch 'develop' into roc-6.0.x 2024-01-31 15:04:00 -07:00
Sam Wu
336f88c7c2 Fix typo in changelog (#2856) 2024-01-31 15:03:31 -07:00
Sam Wu
b18eacbdac Merge branch 'develop' into roc-6.0.x 2024-01-31 14:34:08 -07:00
zhang2amd
78bd182403 Update default.xml to version 6.0.2 (#2855) 2024-01-31 14:33:45 -07:00
Lisa
ba9cc4f185 changelog updates (#2792)
* changelog updates

* updates

* changelog updates

* Update CHANGELOG.md

Co-authored-by: Saad Rahim (AMD) <44449863+saadrahim@users.noreply.github.com>

* Update RELEASE.md

* 6.0.1 -> 6.0.2

* 6.0.1 -> 6.0.2

* Update CONTRIBUTING.md (#2791)

* Update CONTRIBUTING.md

* Fixed link to licensing document

Also, changed to use relative links for internal files.

* Create issue_retrieval.yml

I am tasked with adding a GitHub action to process incoming GitHub issues. The AMD GitHub admin team asked me to try out one of their runners and to do so, I need to load in a workflow file.

* changed group to ROCM-Ubuntu

* Added a field to specify project number

This action receives an org name and project number and adds issues to it using this information

* Update issue_retrieval.yml

* Update issue_retrieval.yml

* Revert "Update CONTRIBUTING.md" (#2795)

* Text change to direct PRs into default branch, since not all repos have develop branch

* add keywords (#2799)

* Update issue_retrieval.yml

* ci(default.xml): Add hipBLASLt to manifest (#2796)

* Deleting issue_report.yml in favor of a global issue template placed in ROCm/.github (#2803)

* Delete .github/ISSUE_TEMPLATE/issue_report.yml

* Delete .github/ISSUE_TEMPLATE/config.yml

* Delete .github/ISSUE_TEMPLATE directory (#2805)

* docs(conf.py): Update article info for release page (#2806)

* docs(conf.py): Update article info for release page

* Update conf.py

* Fix typo (#2809)

* Bump rocm-docs-core from 0.30.3 to 0.31.0 in /docs/sphinx (#2807)

Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 0.30.3 to 0.31.0.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases)
- [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.30.3...v0.31.0)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* corrections for Issue #2753 (#2819)

* docs(versions.md): Add 5.6.1 to versions list (#2816)

* Add codeowners for documentation (#2834)

Co-authored-by: samjwu <samjwu@users.noreply.github.com>

* Bump jinja2 from 3.1.2 to 3.1.3 in /docs/sphinx (#2835)

Bumps [jinja2](https://github.com/pallets/jinja) from 3.1.2 to 3.1.3.
- [Release notes](https://github.com/pallets/jinja/releases)
- [Changelog](https://github.com/pallets/jinja/blob/main/CHANGES.rst)
- [Commits](https://github.com/pallets/jinja/compare/3.1.2...3.1.3)

---
updated-dependencies:
- dependency-name: jinja2
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Bump gitpython from 3.1.30 to 3.1.41 in /docs/sphinx (#2836)

Bumps [gitpython](https://github.com/gitpython-developers/GitPython) from 3.1.30 to 3.1.41.
- [Release notes](https://github.com/gitpython-developers/GitPython/releases)
- [Changelog](https://github.com/gitpython-developers/GitPython/blob/main/CHANGES)
- [Commits](https://github.com/gitpython-developers/GitPython/compare/3.1.30...3.1.41)

---
updated-dependencies:
- dependency-name: gitpython
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* changelog updates

* sync release file with changelog

* remove 6.0.0 duplicates

* update intro

* Update CHANGELOG.md

* Update RELEASE.md

* clean up duplicates

* caps

* minor update

* language update

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Saad Rahim (AMD) <44449863+saadrahim@users.noreply.github.com>
Co-authored-by: David Galiffi <dgaliffi@amd.com>
Co-authored-by: abhimeda <138710508+abhimeda@users.noreply.github.com>
Co-authored-by: Young Hui <young.hui@amd.com>
Co-authored-by: Sam Wu <sam.wu2@amd.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: randyh62 <42045079+randyh62@users.noreply.github.com>
Co-authored-by: samjwu <samjwu@users.noreply.github.com>
2024-01-31 13:26:27 -07:00
Lisa
df70d90d49 radeon updates (#2818)
* radeon updates

* update link

* update intro

* verbiage

* Update docs/index.md

Co-authored-by: Sam Wu <sam.wu2@amd.com>

* Update docs/what-is-rocm.md

Co-authored-by: Sam Wu <sam.wu2@amd.com>

* Use intersphinx link for radeon

---------

Co-authored-by: Sam Wu <sam.wu2@amd.com>
2024-01-30 13:20:28 -07:00
dependabot[bot]
95fa47e31a Bump rocm-docs-core from 0.31.0 to 0.33.0 in /docs/sphinx (#2851)
Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 0.31.0 to 0.33.0.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases)
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.31.0...v0.33.0)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-01-29 17:20:35 -07:00
Spencer Hance
5afa1539ed Fix link to building.md in README (#2843)
Fix broken link to building.md in README.  It was missing `/docs/` in the path.
2024-01-29 17:04:10 -07:00
BrenAMD
0b5cfca1e4 Updated New ROCm meta package section (#2839) 2024-01-25 12:19:34 -07:00
dependabot[bot]
14979045a8 Bump gitpython from 3.1.30 to 3.1.41 in /docs/sphinx (#2836)
Bumps [gitpython](https://github.com/gitpython-developers/GitPython) from 3.1.30 to 3.1.41.
- [Release notes](https://github.com/gitpython-developers/GitPython/releases)
- [Changelog](https://github.com/gitpython-developers/GitPython/blob/main/CHANGES)
- [Commits](https://github.com/gitpython-developers/GitPython/compare/3.1.30...3.1.41)

---
updated-dependencies:
- dependency-name: gitpython
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-01-23 09:44:58 -07:00
dependabot[bot]
65b5a383ec Bump jinja2 from 3.1.2 to 3.1.3 in /docs/sphinx (#2835)
Bumps [jinja2](https://github.com/pallets/jinja) from 3.1.2 to 3.1.3.
- [Release notes](https://github.com/pallets/jinja/releases)
- [Changelog](https://github.com/pallets/jinja/blob/main/CHANGES.rst)
- [Commits](https://github.com/pallets/jinja/compare/3.1.2...3.1.3)

---
updated-dependencies:
- dependency-name: jinja2
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-01-23 09:44:43 -07:00
Sam Wu
c679235a90 Add codeowners for documentation (#2834)
Co-authored-by: samjwu <samjwu@users.noreply.github.com>
2024-01-23 09:29:14 -07:00
Sam Wu
4833ecfa6a docs(versions.md): Add 5.6.1 to versions list (#2816) 2024-01-22 15:16:58 -07:00
randyh62
c9425c6d19 corrections for Issue #2753 (#2819) 2024-01-18 09:31:45 -07:00
dependabot[bot]
c4383d217a Bump rocm-docs-core from 0.30.3 to 0.31.0 in /docs/sphinx (#2807)
Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 0.30.3 to 0.31.0.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases)
- [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.30.3...v0.31.0)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-01-16 11:53:20 -07:00
Sam Wu
0ef9f2d53c Merge branch 'develop' into roc-6.0.x 2024-01-16 11:46:26 -07:00
Sam Wu
44b5d516e8 Merge branch 'docs/6.0.0' into roc-6.0.x 2024-01-16 10:56:03 -07:00
Sam Wu
ad66256e52 Merge develop into roc-6.0.x (#2810)
* Create issue_retrieval.yml

I am tasked with adding a GitHub action to process incoming GitHub issues. The AMD GitHub admin team asked me to try out one of their runners and to do so, I need to load in a workflow file.

* changed group to ROCM-Ubuntu

* Added a field to specify project number

This action receives an org name and project number and adds issues to it using this information

* Update issue_retrieval.yml

* Update issue_retrieval.yml

* Generate release notes for 6.0.1 from autotag script (#2790)

* Update CONTRIBUTING.md (#2791)

* Update CONTRIBUTING.md

* Fixed link to licensing document

Also, changed to use relative links for internal files.

* Revert "Update CONTRIBUTING.md" (#2795)

* Text change to direct PRs into default branch, since not all repos have develop branch

* add keywords (#2799)

* Update issue_retrieval.yml

* ci(default.xml): Add hipBLASLt to manifest (#2796)

* Deleting issue_report.yml in favor of a global issue template placed in ROCm/.github (#2803)

* Delete .github/ISSUE_TEMPLATE/issue_report.yml

* Delete .github/ISSUE_TEMPLATE/config.yml

* Delete .github/ISSUE_TEMPLATE directory (#2805)

* docs(conf.py): Update article info for release page (#2806)

* docs(conf.py): Update article info for release page

* Update conf.py

* Fix typo (#2809)

---------

Co-authored-by: abhimeda <138710508+abhimeda@users.noreply.github.com>
Co-authored-by: David Galiffi <dgaliffi@amd.com>
Co-authored-by: Lisa <lisa.delaney@amd.com>
Co-authored-by: Young Hui <young.hui@amd.com>
Co-authored-by: yhuiYH <145490163+yhuiYH@users.noreply.github.com>
2024-01-16 10:53:28 -07:00
Sam Wu
d509656c6b Fix typo (#2809) 2024-01-16 10:48:21 -07:00
Sam Wu
c2a3626026 docs(conf.py): Update article info for release page (#2806)
* docs(conf.py): Update article info for release page

* Update conf.py
2024-01-12 17:12:56 -07:00
abhimeda
51d5bf015c Delete .github/ISSUE_TEMPLATE directory (#2805) 2024-01-12 16:12:09 -07:00
abhimeda
c6facfb30f Deleting issue_report.yml in favor of a global issue template placed in ROCm/.github (#2803)
* Delete .github/ISSUE_TEMPLATE/issue_report.yml

* Delete .github/ISSUE_TEMPLATE/config.yml
2024-01-12 15:20:15 -07:00
Sam Wu
fce96340f4 ci(default.xml): Add hipBLASLt to manifest (#2796) 2024-01-12 15:19:22 -07:00
abhimeda
8d44e04483 Merge pull request #2800 from ROCm/abhimeda-added-env-variables-to-workflow-file
Added repository secrets to ROCm and pointing the workflow file to use them
2024-01-12 11:46:26 -05:00
abhimeda
dcce85a84a Update issue_retrieval.yml 2024-01-12 10:57:29 -05:00
Lisa
d399b13c88 add keywords (#2799) 2024-01-11 14:07:30 -07:00
yhuiYH
20005e0ef7 Merge pull request #2798 from ROCm/amd/dev/yhui/UpdateTextInContributing
Update Contributing.md to direct PRs to use repo's default branch
2024-01-11 15:08:37 -05:00
Young Hui
d05c1d529e Text change to direct PRs into default branch, since not all repos have develop branch 2024-01-11 14:02:17 -05:00
Lisa
163262643f Revert "Update CONTRIBUTING.md" (#2795) 2024-01-10 11:26:47 -07:00
abhimeda
318126b155 Merge pull request #2772 from ROCm/abhimeda-adding-workflow-file-to-test-github-runner
Abhimeda adding workflow file to create GitHub Action
2024-01-10 10:16:11 -05:00
zhang2amd
221aa04931 Add hipBLASLt in manifest. (#2776) 2024-01-10 07:06:11 -07:00
David Galiffi
2be774fb19 Update CONTRIBUTING.md (#2791)
* Update CONTRIBUTING.md

* Fixed link to licensing document

Also, changed to use relative links for internal files.
2024-01-10 07:04:38 -07:00
Sam Wu
3faa2600eb Generate release notes for 6.0.1 from autotag script (#2790) 2024-01-09 13:39:19 -07:00
Sam Wu
d531936276 Merge roc-6.0.x into docs/6.0.0 (#2784)
* Mi300 info update (#2780)

* docs(gpu-enabled-mpi.rst): Fix links to 3rd party support matrices (#2775)

* docs(gpu-enabled-mpi.rst): Fix links to 3rd party support matrices

* docs: Directly link for RST instead of using intersphinx

---------

Co-authored-by: Istvan Kiss <neon60@gmail.com>
2024-01-09 09:21:24 -07:00
Sam Wu
753d2f9719 Merge branch 'develop' into roc-6.0.x 2024-01-08 16:35:26 -07:00
Sam Wu
7ffc622039 docs(gpu-enabled-mpi.rst): Fix links to 3rd party support matrices (#2775)
* docs(gpu-enabled-mpi.rst): Fix links to 3rd party support matrices

* docs: Directly link for RST instead of using intersphinx
2024-01-08 16:34:45 -07:00
Istvan Kiss
054689be6a Mi300 info update (#2780) 2024-01-08 16:30:41 -07:00
abhimeda
40b5f85af9 Update issue_retrieval.yml 2024-01-04 15:40:05 -05:00
abhimeda
a1372d56f9 Update issue_retrieval.yml 2024-01-03 14:54:10 -05:00
abhimeda
717b09f7eb Added a field to specify project number
This action receives an org name and project number and adds issues to it using this information
2024-01-03 14:50:52 -05:00
abhimeda
1cd2b651c4 changed group to ROCM-Ubuntu 2024-01-01 21:55:28 -05:00
abhimeda
587f821194 Create issue_retrieval.yml
I am tasked with adding a GitHub action to process incoming GitHub issues. The AMD GitHub admin team asked me to try out one of their runners and to do so, I need to load in a workflow file.
2024-01-01 21:53:42 -05:00
Sam Wu
147dce6f28 Merge branch 'develop' into roc-6.0.x 2023-12-20 15:54:20 -07:00
Sam Wu
4808c615e6 Merge branch 'develop' into docs/6.0.0 2023-12-20 15:53:12 -07:00
Lisa
f94a8620eb Update CHANGELOG.md (#2762) 2023-12-20 13:40:35 -07:00
Lisa
5f9842db8f link fixes & consistency (#2761) 2023-12-20 12:42:15 -07:00
dependabot[bot]
6fae95aa02 Bump rocm-docs-core from 0.30.2 to 0.30.3 in /docs/sphinx (#2759)
Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 0.30.2 to 0.30.3.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases)
- [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.30.2...v0.30.3)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-12-19 17:13:46 -07:00
Sam Wu
b865ae7796 Merge branch 'roc-6.0.x' into docs/6.0.0 2023-12-19 15:56:57 -07:00
Sam Wu
74a5c1b580 Merge branch 'develop' into roc-6.0.x 2023-12-19 15:56:02 -07:00
Sam Wu
538a44f4d7 docs: Update GPU and OS support for Linux page (#2757) 2023-12-19 15:53:52 -07:00
Sam Wu
6c90336e67 Merge docs/6.0.0 into develop (#2756)
* Marking TransferBench as beta (#2727)

* Known issues (#2731) (#2732)

* rearranging

* edits

* update toc

* link update

* line break

* updates

* Update RELEASE.md

* edits

* Update conf.py

* file cleanup

* Update RELEASE.md

* Update conf.py

* addition

* verbiage

* Update CHANGELOG.md

* edits

* edits

* updates

* edits

* more edits

* Update RELEASE.md

Limited OS to start in 6.0

* Update RELEASE.md

* Update RELEASE.md

Table to reflect support.

* Update RELEASE.md

tweaked language

* Update RELEASE.md

Tweaking language

* edits

* edits

* link

* spelling

* add link

* new section

* Add files via upload (#2701)

* updates

---------

Co-authored-by: Lisa <lisa.delaney@amd.com>
Co-authored-by: Saad Rahim (AMD) <44449863+saadrahim@users.noreply.github.com>
Co-authored-by: Ronnie Chatterjee <111161280+ronniec91@users.noreply.github.com>
Co-authored-by: abhimeda <138710508+abhimeda@users.noreply.github.com>

* docs(library-index.md): Add MIVisionX to library index (#2736)

* Delete docs/about/compatibility/linux-support.md (#2734)

* Delete docs/about/compatibility/linux-support.md

* Update _toc.yml.in

* Update _toc.yml.in

---------

Co-authored-by: Sam Wu <sam.wu2@amd.com>

* Corrected OS version (#2738)

* Corrected OS version 

There is no 22.04.5 exist.
It's 22.04.3 which has been tested and supported

* Update CHANGELOG.md

* Update _toc.yml.in (#2750)

---------

Co-authored-by: Saad Rahim (AMD) <44449863+saadrahim@users.noreply.github.com>
Co-authored-by: Lisa <lisa.delaney@amd.com>
Co-authored-by: Ronnie Chatterjee <111161280+ronniec91@users.noreply.github.com>
Co-authored-by: abhimeda <138710508+abhimeda@users.noreply.github.com>
Co-authored-by: pramenku <7664080+pramenku@users.noreply.github.com>
2023-12-19 15:43:04 -07:00
Sam Wu
859f3763c8 Merge branch 'develop' into docs/6.0.0 2023-12-19 15:41:06 -07:00
abhimeda
7f4922d2b2 Abhimeda updating issue template (#2749)
* added ROCm v6, MI300, and set a default component

* Delete .github/ISSUE_TEMPLATE/0_issue_report.yml
2023-12-18 15:06:35 -07:00
Lisa
c8c4b5a034 Update _toc.yml.in (#2750) 2023-12-18 12:27:06 -07:00
Mátyás Aradi
3e1a87a4f1 Remove virtualenv build from dependencies (#2699)
* Remove virtualenv build from dependencies

* Rename ROCM_BUILD_DOCS to BUILD_DOCS
2023-12-18 07:03:55 -07:00
pramenku
3522084990 Corrected OS version (#2738)
* Corrected OS version 

There is no 22.04.5 exist.
It's 22.04.3 which has been tested and supported

* Update CHANGELOG.md
2023-12-18 07:03:24 -07:00
yhuiYH
eeb96ebb18 Move documentation contributing.md and add Governance.md and Contributing.md (#2690)
* moved contributing.md to new location as it describes contributing to documentation

* Adding Governance.md and high-level Contributing.md

* fix linting errors (asterisk, whitespace and unused links)

* More linting fixes

* merge conflicts

* verbiage

* License link moved out of codeblock, and text fix there. Changed to full name of AMD. Update links to ROCm Org path

* whitespace linting fix

* Reverted back to ROCm is lead and managed by AMD.  Flows better to me.

---------

Co-authored-by: Lisa Delaney <lisa.delaney@amd.com>
2023-12-15 16:14:13 -07:00
Saad Rahim (AMD)
1c420b4b5c Delete docs/about/compatibility/linux-support.md (#2734)
* Delete docs/about/compatibility/linux-support.md

* Update _toc.yml.in

* Update _toc.yml.in

---------

Co-authored-by: Sam Wu <sam.wu2@amd.com>
2023-12-15 16:09:50 -07:00
Sam Wu
914befefcb docs(library-index.md): Add MIVisionX to library index (#2736) 2023-12-15 15:59:36 -07:00
Sam Wu
6099778813 Merge branch 'develop' into roc-6.0.x 2023-12-15 15:50:14 -07:00
Sam Wu
8a8504246a docs(library-index.md): Add MIVisionX to library index (#2735)
* Add files via upload (#2701)

* Merge Roc 6.0.x into develop (#2733)

* Marking TransferBench as beta (#2727)

* Known issues (#2731)

* rearranging

* edits

* update toc

* link update

* line break

* updates

* Update RELEASE.md

* edits

* Update conf.py

* file cleanup

* Update RELEASE.md

* Update conf.py

* addition

* verbiage

* Update CHANGELOG.md

* edits

* edits

* updates

* edits

* more edits

* Update RELEASE.md

Limited OS to start in 6.0

* Update RELEASE.md

* Update RELEASE.md

Table to reflect support.

* Update RELEASE.md

tweaked language

* Update RELEASE.md

Tweaking language

* edits

* edits

* link

* spelling

* add link

* new section

* Add files via upload (#2701)

* updates

---------

Co-authored-by: Saad Rahim (AMD) <44449863+saadrahim@users.noreply.github.com>
Co-authored-by: Ronnie Chatterjee <111161280+ronniec91@users.noreply.github.com>
Co-authored-by: abhimeda <138710508+abhimeda@users.noreply.github.com>

---------

Co-authored-by: Saad Rahim (AMD) <44449863+saadrahim@users.noreply.github.com>
Co-authored-by: Lisa <lisa.delaney@amd.com>
Co-authored-by: Ronnie Chatterjee <111161280+ronniec91@users.noreply.github.com>
Co-authored-by: abhimeda <138710508+abhimeda@users.noreply.github.com>

* docs(library-index.md): Add MIVisionX to library index

---------

Co-authored-by: abhimeda <138710508+abhimeda@users.noreply.github.com>
Co-authored-by: Saad Rahim (AMD) <44449863+saadrahim@users.noreply.github.com>
Co-authored-by: Lisa <lisa.delaney@amd.com>
Co-authored-by: Ronnie Chatterjee <111161280+ronniec91@users.noreply.github.com>
2023-12-15 15:47:15 -07:00
Sam Wu
82d871c907 Merge Roc 6.0.x into develop (#2733)
* Marking TransferBench as beta (#2727)

* Known issues (#2731)

* rearranging

* edits

* update toc

* link update

* line break

* updates

* Update RELEASE.md

* edits

* Update conf.py

* file cleanup

* Update RELEASE.md

* Update conf.py

* addition

* verbiage

* Update CHANGELOG.md

* edits

* edits

* updates

* edits

* more edits

* Update RELEASE.md

Limited OS to start in 6.0

* Update RELEASE.md

* Update RELEASE.md

Table to reflect support.

* Update RELEASE.md

tweaked language

* Update RELEASE.md

Tweaking language

* edits

* edits

* link

* spelling

* add link

* new section

* Add files via upload (#2701)

* updates

---------

Co-authored-by: Saad Rahim (AMD) <44449863+saadrahim@users.noreply.github.com>
Co-authored-by: Ronnie Chatterjee <111161280+ronniec91@users.noreply.github.com>
Co-authored-by: abhimeda <138710508+abhimeda@users.noreply.github.com>

---------

Co-authored-by: Saad Rahim (AMD) <44449863+saadrahim@users.noreply.github.com>
Co-authored-by: Lisa <lisa.delaney@amd.com>
Co-authored-by: Ronnie Chatterjee <111161280+ronniec91@users.noreply.github.com>
Co-authored-by: abhimeda <138710508+abhimeda@users.noreply.github.com>
2023-12-15 15:06:03 -07:00
Sam Wu
a9099dd36e Known issues (#2731) (#2732)
* rearranging

* edits

* update toc

* link update

* line break

* updates

* Update RELEASE.md

* edits

* Update conf.py

* file cleanup

* Update RELEASE.md

* Update conf.py

* addition

* verbiage

* Update CHANGELOG.md

* edits

* edits

* updates

* edits

* more edits

* Update RELEASE.md

Limited OS to start in 6.0

* Update RELEASE.md

* Update RELEASE.md

Table to reflect support.

* Update RELEASE.md

tweaked language

* Update RELEASE.md

Tweaking language

* edits

* edits

* link

* spelling

* add link

* new section

* Add files via upload (#2701)

* updates

---------

Co-authored-by: Lisa <lisa.delaney@amd.com>
Co-authored-by: Saad Rahim (AMD) <44449863+saadrahim@users.noreply.github.com>
Co-authored-by: Ronnie Chatterjee <111161280+ronniec91@users.noreply.github.com>
Co-authored-by: abhimeda <138710508+abhimeda@users.noreply.github.com>
2023-12-15 15:05:35 -07:00
Lisa
6ba05d8ab0 Known issues (#2731)
* rearranging

* edits

* update toc

* link update

* line break

* updates

* Update RELEASE.md

* edits

* Update conf.py

* file cleanup

* Update RELEASE.md

* Update conf.py

* addition

* verbiage

* Update CHANGELOG.md

* edits

* edits

* updates

* edits

* more edits

* Update RELEASE.md

Limited OS to start in 6.0

* Update RELEASE.md

* Update RELEASE.md

Table to reflect support.

* Update RELEASE.md

tweaked language

* Update RELEASE.md

Tweaking language

* edits

* edits

* link

* spelling

* add link

* new section

* Add files via upload (#2701)

* updates

---------

Co-authored-by: Saad Rahim (AMD) <44449863+saadrahim@users.noreply.github.com>
Co-authored-by: Ronnie Chatterjee <111161280+ronniec91@users.noreply.github.com>
Co-authored-by: abhimeda <138710508+abhimeda@users.noreply.github.com>
2023-12-15 15:01:52 -07:00
Saad Rahim (AMD)
ba69933774 Marking TransferBench as beta (#2727) 2023-12-15 14:48:33 -07:00
abhimeda
5676b16fce Add files via upload (#2701) 2023-12-15 14:42:13 -07:00
Lisa
1828271505 Update library-index.md (#2723)
* Update library-index.md

* Update library-index.md
2023-12-15 14:33:22 -07:00
Sam Wu
5b672af67d build: Update rocm-docs-core to v0.30.2 (#2724)
* build: Update rocm-docs-core to v0.30.2

* docs: Fix doc links in index
2023-12-15 14:32:46 -07:00
Lisa
a121e35aa7 rearranging (#2718) 2023-12-15 14:03:14 -07:00
zhang2amd
2a71de6c93 Update default.xml for ROCm 6.0.0 (#2721) 2023-12-15 13:20:39 -07:00
Saad Rahim (AMD)
8588444a0d Updating release notes (#2712)
* Updating release notes

* Apply suggestions from code review

* Update RELEASE.md

Co-authored-by: Sam Wu <sjwu@ualberta.ca>

* Update RELEASE.md

Co-authored-by: Sam Wu <sjwu@ualberta.ca>

* Update into text

* Update RELEASE.md

* Update RELEASE.md

Co-authored-by: Sam Wu <sjwu@ualberta.ca>

---------

Co-authored-by: Lisa <lisajdelaney@gmail.com>
Co-authored-by: Sam Wu <sjwu@ualberta.ca>
2023-12-14 14:38:42 -07:00
Sam Wu
b8412e17f3 docs(versions.md): Add back docs versions page (#2716)
This is used by the Version List header for the rocm-docs-home theme flavor
2023-12-14 14:21:11 -07:00
Sam Wu
652f72dbdd docs: Manually add ROCgdb release notes (#2714) 2023-12-14 14:20:57 -07:00
Sam Wu
13da03473f Manual update to Release Notes (#2711)
* docs: Manually add rocprofiler release notes

* docs: Manually add HIP release notes

* Update CHANGELOG.md

Co-authored-by: Saad Rahim (AMD) <44449863+saadrahim@users.noreply.github.com>

* docs: HIP 6.0.0

---------

Co-authored-by: Saad Rahim (AMD) <44449863+saadrahim@users.noreply.github.com>
2023-12-14 11:42:54 -07:00
Lisa
bcc8603454 update links, remove windows (#2706) 2023-12-14 09:21:50 -07:00
Lisa
5a53b95c7f release updates (#2707)
* release updates

* minor updates

* Update CHANGELOG.md
2023-12-14 09:20:53 -07:00
srawat
7889220f04 Mi200 counters (#2622) 2023-12-12 11:25:57 -07:00
Lisa
19eae6a8eb heading consistency (#2697)
* heading consistency

* update rocrand
2023-12-12 11:16:49 -07:00
srawat
131aa66591 Merge pull request #2700 from SwRaw/rocprofiler_index
Update library-index.md
2023-12-11 11:00:49 +05:30
Sam Wu
c648ca767b fix(tag_script.py): Update organization names for projects used in tagging script (#2698)
Most projects were moved to the ROCm organization
2023-12-08 10:44:26 -07:00
srawat
4922020441 Update library-index.md 2023-12-08 22:18:41 +05:30
srawat
07a778498c Update library-index.md 2023-12-08 22:11:54 +05:30
srawat
d75a05645f Update library-index.md 2023-12-08 17:37:53 +05:30
Sam Wu
00f7899b03 docs(conf.py): Use rocm-docs-core as extension (#2695)
* docs(conf.py): Use rocm-docs-core as extension

instead of calling and instantiating as object (legacy method)

Also apply the rocm-docs-home theme flavor

* build: Update rocm-docs-core to 0.30.1
2023-12-07 09:39:45 -07:00
Sam Wu
412366ff61 Update Changelog and latest Release notes (#2648)
* docs: Remove extra newline from 5.7.1.md template

* docs: Update the changelog and latest release notes

* docs: Rebuild changelog with updated 6.0.0 edits
2023-12-06 16:27:04 -07:00
dependabot[bot]
be1fed8ca4 Bump rocm-docs-core from 0.29.0 to 0.30.0 in /docs/sphinx (#2684)
Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 0.29.0 to 0.30.0.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases)
- [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.29.0...v0.30.0)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-12-05 15:07:34 -07:00
Lisa
16a1d355c1 typo (#2687) 2023-12-04 10:03:02 -07:00
Lisa
3aa7072fc2 metadata test (#2656) 2023-11-30 14:37:12 -07:00
Saad Rahim (AMD)
7179884433 Left Navigation further compression for usability (#2677)
* Left Navigation further compression for usability

* Whitespace

* provide feedback
2023-11-30 13:11:17 -07:00
Lisa
3523e9e822 Open MPI updates (#2655) 2023-11-30 09:58:12 -07:00
Nagy-Egri Máté Ferenc
3b9cd77b93 Clarify mixing C++ and HIP sources via CMake (#2618)
* Carify mixing C++ and HIP sources via CMake

* Designate code blocks

* Simplify lang around host-only use of the HIP API

* Remove superfluous wording.

* Note LINKER_LANGUAGE of mixed sources

* Space after code-block

* Single space in code-block
2023-11-29 07:03:44 -07:00
Mátyás Aradi
ef1c21ccf7 Add CMake support (#2641)
* Add CMake support

* Update README and CHANGELOG

* Update CHANGELOG

Co-authored-by: Saad Rahim (AMD) <44449863+saadrahim@users.noreply.github.com>

---------

Co-authored-by: Saad Rahim (AMD) <44449863+saadrahim@users.noreply.github.com>
2023-11-28 09:40:25 -07:00
Istvan Kiss
35893c4df6 Remove disable spellchecks of cmake-packages.rst (#2678) 2023-11-28 07:03:13 -07:00
Saad Rahim (AMD)
c1ee7d32e0 Removing Linux installation related content (#2673)
* Removing Linux installation related content

* TOC updates

* Removing added files

* Line spacing on code block
2023-11-27 14:03:52 -07:00
Istvan Kiss
f8446befd2 Remove disable spellchecks of cmake-packages.rst (#2676) 2023-11-27 11:17:13 -07:00
dependabot[bot]
f51e1144df Bump rocm-docs-core from 0.28.0 to 0.29.0 in /docs/sphinx (#2674)
Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 0.28.0 to 0.29.0.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases)
- [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.28.0...v0.29.0)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-11-27 10:21:27 -07:00
Lisa
4adaff02a6 Left nav updates (#2647)
* update gpu-enabled-mpi

update the documentation to also include libfabric based network interconnects,
not just UCX.

* add some technical terms to wordlist

* shorten left nav

* grid updates

---------

Co-authored-by: Edgar Gabriel <Edgar.Gabriel@amd.com>
Co-authored-by: Saad Rahim (AMD) <44449863+saadrahim@users.noreply.github.com>
2023-11-24 07:15:10 -07:00
dependabot[bot]
0d6fc80070 Bump rocm-docs-core from 0.27.0 to 0.28.0 in /docs/sphinx (#2651)
Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 0.27.0 to 0.28.0.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases)
- [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.27.0...v0.28.0)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Sam Wu <sam.wu2@amd.com>
2023-11-22 15:07:01 -07:00
Lisa
33f110e354 update ROCm name (#2660)
* update ROCm name

* update version history page
2023-11-22 10:30:10 -07:00
Saad Rahim (AMD)
9a9cf073b4 spelling check fix (#2649) 2023-11-20 10:12:39 -07:00
Lisa
1e6951dc55 add tensorflow support link (#2612)
* add tensorflow support link

* Update docs/install/tensorflow-install.md

---------

Co-authored-by: Saad Rahim (AMD) <44449863+saadrahim@users.noreply.github.com>
2023-11-15 15:41:36 -07:00
Jithun Nair
135e489e7a Update torchvision version to 0.15.2 for PyTorch2.0.1 (#2635)
Ubuntu20.04 entry contains the correct info. This corrects the info for Ubuntu22.04 entry

Co-authored-by: Saad Rahim (AMD) <44449863+saadrahim@users.noreply.github.com>
Co-authored-by: Sam Wu <sam.wu2@amd.com>
2023-11-15 15:37:57 -07:00
Lisa
c326a64381 Acronym update (#2637) 2023-11-14 08:54:13 -07:00
Lisa
37c48060f7 update release note files (#2617)
---------

Co-authored-by: Sam Wu <sam.wu2@amd.com>
Co-authored-by: Saad Rahim (AMD) <44449863+saadrahim@users.noreply.github.com>
2023-11-10 15:14:59 -07:00
dependabot[bot]
3f855e386c Bump rocm-docs-core from 0.26.0 to 0.27.0 in /docs/sphinx (#2626)
Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 0.26.0 to 0.27.0.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases)
- [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.26.0...v0.27.0)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-11-03 07:08:50 -06:00
Sam Wu
aa5eff25fb docs: Update copyright and release history doc (#2624) 2023-11-02 10:10:34 -06:00
Istvan Kiss
ccdcfbd7e3 Fix warnings (#2623)
* Fix warnings

* Fix file conflict

* Remove duplication in 5.7.1 changelog
2023-11-02 10:00:01 -06:00
92 changed files with 6347 additions and 4958 deletions

4
.github/CODEOWNERS vendored Normal file → Executable file
View File

@@ -1 +1,5 @@
* @saadrahim @Rmalavally @amd-aakash @zhang2amd @jlgreathouse @samjwu @MathiasMagnus @LisaDelaney
# Documentation files
docs/* @ROCm/rocm-documentation
*.md @ROCm/rocm-documentation
*.rst @ROCm/rocm-documentation

View File

@@ -1,76 +0,0 @@
name: Issue Report
description: File a report for something not working correctly.
title: "[Issue]: "
body:
- type: markdown
attributes:
value: |
Thank you for taking the time to fill out this report!
On a Linux system, you can acquire your OS, CPU, GPU, and ROCm version (for filling out this report) with the following commands:
echo "OS:" && cat /etc/os-release | grep -E "^(NAME=|VERSION=)";
echo "CPU: " && cat /proc/cpuinfo | grep "model name" | sort --unique;
echo "GPU:" && /opt/rocm/bin/rocminfo | grep -E "^\s*(Name|Marketing Name)";
echo "ROCm in /opt:" && ls -1 /opt | grep -E "rocm-";
- type: textarea
attributes:
label: Problem Description
description: Describe the issue you encountered.
placeholder: "The steps to reproduce can be included here, or in the dedicated section further below."
validations:
required: true
- type: input
attributes:
label: Operating System
description: What is the name and version number of the OS?
placeholder: "e.g. Ubuntu 22.04.3 LTS (Jammy Jellyfish)"
validations:
required: true
- type: input
attributes:
label: CPU
description: What CPU did you encounter the issue on?
placeholder: "e.g. AMD Ryzen 9 5900HX with Radeon Graphics"
validations:
required: true
- type: input
attributes:
label: GPU
description: What GPU(s) did you encounter the issue on?
placeholder: "e.g. MI200"
validations:
required: true
- type: input
attributes:
label: ROCm Version
description: What version(s) of ROCm did you encounter the issue on?
placeholder: "e.g. 5.7.0"
validations:
required: true
- type: input
attributes:
label: ROCm Component
description: (Optional) If this issue relates to a specific ROCm component, it can be mentioned here.
placeholder: "e.g. rocBLAS"
- type: textarea
attributes:
label: Steps to Reproduce
description: (Optional) Detailed steps to reproduce the issue.
placeholder: Please also include what you expected to happen, and what actually did, at the failing step(s).
validations:
required: false
- type: textarea
attributes:
label: Output of /opt/rocm/bin/rocminfo --support
description: The output of rocminfo --support will help to better address the problem.
placeholder: |
ROCk module is loaded
=====================
HSA System Attributes
=====================
[...]
validations:
required: true

View File

@@ -1,32 +0,0 @@
name: Feature Suggestion
description: Suggest an additional functionality, or new way of handling an existing functionality.
title: "[Feature]: "
body:
- type: markdown
attributes:
value: |
Thank you for taking the time to make a suggestion!
- type: textarea
attributes:
label: Suggestion Description
description: Describe your suggestion.
validations:
required: true
- type: input
attributes:
label: Operating System
description: (Optional) If this is for a specific OS, you can mention it here.
placeholder: "e.g. Ubuntu"
- type: input
attributes:
label: GPU
description: (Optional) If this is for a specific GPU or GPU family, you can mention it here.
placeholder: "e.g. MI200"
- type: input
attributes:
label: ROCm Component
description: (Optional) If this issue relates to a specific ROCm component, it can be mentioned here.
placeholder: "e.g. rocBLAS"

View File

@@ -1,5 +0,0 @@
blank_issues_enabled: false
contact_links:
- name: ROCm Community Discussions
url: https://github.com/RadeonOpenCompute/ROCm/discussions
about: Please ask and answer questions here for anything ROCm.

22
.github/workflows/issue_retrieval.yml vendored Normal file
View File

@@ -0,0 +1,22 @@
name: Issue retrieval
on:
issues:
types: [opened]
jobs:
auto-retrieve:
runs-on: ubuntu-latest
steps:
- name: Generate a token
id: generate_token
uses: actions/create-github-app-token@v1
with:
app_id: ${{ secrets.ACTION_APP_ID }}
private_key: ${{ secrets.ACTION_PEM }}
- name: 'Retrieve Issue'
uses: abhimeda/rocm_issue_management@main
with:
authentication-token: ${{ steps.generate_token.outputs.token }}
github-organization: 'ROCm'
project-num: '6'

View File

@@ -23,6 +23,7 @@ ASan
ASIC
ASICs
ASm
ATI
atmi
atomics
autogenerated
@@ -50,6 +51,7 @@ changelog
chiplet
CIFAR
CLI
CLion
CMake
cmake
CMakeLists
@@ -61,6 +63,7 @@ Codespaces
comgr
Commitizen
CommonMark
completers
composable
concretization
Concretized
@@ -81,6 +84,7 @@ CSE
CSn
csn
CSV
CTests
CU
cuBLAS
CUDA
@@ -90,6 +94,7 @@ cuRAND
CUs
cuSOLVER
cuSPARSE
CXX
dataset
datasets
dataspace
@@ -103,7 +108,9 @@ Dependabot
deserializers
detections
dev
DevCap
devicelibs
devsel
DGEMM
disambiguates
distro
@@ -112,7 +119,6 @@ DMA
DNN
DNNL
Dockerfile
DockerHub
Doxygen
DPM
DRI
@@ -151,6 +157,7 @@ GDR
GDS
GEMM
GEMMs
GenZ
gfortran
gfx
GIM
@@ -194,6 +201,7 @@ hipSPARSELt
hipTensor
HPC
HPCG
HPE
HPL
HSA
hsa
@@ -201,6 +209,8 @@ hsakmt
HWE
ib_core
ICV
IDE
IDEs
ImageNet
IMDB
inband
@@ -224,6 +234,7 @@ IOP
IOPM
IOV
ipo
IRQ
ISA
ISV
ISVs
@@ -236,6 +247,7 @@ KVM
LAPACK
LCLK
LDS
libfabric
libjpeg
libs
linearized
@@ -268,6 +280,8 @@ mivisionx
mkdir
mlirmiopen
MMA
MMIO
MMIOH
MNIST
MPI
MSVC
@@ -329,14 +343,17 @@ perl
PIL
PILImage
PowerShell
PnP
pragma
pre
prebuilt
precompiled
prefetch
prefetchable
preprocess
preprocessing
preq
prequantized
prerequisites
PRNG
profiler
@@ -348,6 +365,7 @@ PyPi
PyTorch
Qcycles
quasirandom
queueing
Radeon
RadeonOpenCompute
RCCL
@@ -369,6 +387,7 @@ Rickle
roadmap
roc
ROC
RoCE
rocAL
rocALUTION
rocalution
@@ -385,6 +404,7 @@ rocm
ROCm
ROCmCC
rocminfo
rocMLIR
ROCmSoftwarePlatform
ROCmValidationSuite
rocPRIM
@@ -410,6 +430,7 @@ RST
runtime
runtimes
RW
Ryzen
SALU
SBIOS
SCA
@@ -431,11 +452,13 @@ Shlens
sigmoid
SIGQUIT
SIMD
SIMDs
SKU
SKUs
skylake
sL
SLES
sm
SMEM
SMI
smi
@@ -455,6 +478,7 @@ subexpression
subfolder
subfolders
supercomputing
Supermicro
SWE
Szegedy
tagram
@@ -477,6 +501,7 @@ toolchains
toolset
toolsets
TorchAudio
TorchMIGraphX
TorchScript
TorchServe
TorchVision
@@ -494,6 +519,7 @@ UCX
UIF
Uncached
uncached
uncorrectable
Unhandled
uninstallation
unsqueeze

File diff suppressed because it is too large Load Diff

40
CMakeLists.txt Normal file
View File

@@ -0,0 +1,40 @@
# MIT License
#
# Copyright (c) 2023 Advanced Micro Devices, Inc. All rights reserved.
#
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in all
# copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
# SOFTWARE.
cmake_minimum_required(VERSION 3.18.0)
project(ROCm VERSION 5.7.1 LANGUAGES NONE)
option(BUILD_DOCS "Build ROCm documentation" ON)
include(GNUInstallDirs)
# Adding default path cmake modules
list(APPEND CMAKE_MODULE_PATH "${CMAKE_CURRENT_SOURCE_DIR}/cmake/Modules")
# Handle dependencies
include(Dependencies)
# Build docs
if(BUILD_DOCS)
add_subdirectory(docs)
endif()

View File

@@ -1,229 +1,94 @@
# Contributing to ROCm documentation
AMD values and encourages contributions to our code and documentation. If you choose to
contribute, we encourage you to be polite and respectful. Improving documentation is a long-term
process, to which we are dedicated.
If you have issues when trying to contribute, refer to the
[discussions](https://github.com/RadeonOpenCompute/ROCm/discussions) page in our GitHub
repository.
## Folder structure and naming convention
Our documentation follows the Pitchfork folder structure. Most documentation files are stored in the
`/docs` folder. Some special files (such as release, contributing, and changelog) are stored in the root
(`/`) folder.
All images are stored in the `/docs/data` folder. An image's file path mirrors that of the documentation
file where it is used.
Our naming structure uses kebab case; for example, `my-file-name.rst`.
## Supported formats and syntax
Our documentation includes both Markdown and RST files. We are gradually transitioning existing
Markdown to RST in order to more effectively meet our documentation needs. When contributing,
RST is preferred; if you must use Markdown, use GitHub-flavored Markdown.
We use [Sphinx Design](https://sphinx-design.readthedocs.io/en/latest/index.html) syntax and compile
our API references using [Doxygen](https://www.doxygen.nl/).
The following table shows some common documentation components and the syntax convention we
use for each:
<table>
<tr>
<th>Component</th>
<th>RST syntax</th>
</tr>
<tr>
<td>Code blocks</td>
<td>
```rst
.. code-block:: language-name
My code block.
```
</td>
</tr>
<tr>
<td>Cross-referencing internal files</td>
<td>
```rst
:doc:`Title <../path/to/file/filename>`
```
</td>
</tr>
<tr>
<td>External links</td>
<td>
```rst
`link name <URL>`_
```
</td>
</tr>
<tr>
<tr>
<td>Headings</td>
<td>
```rst
******************
Chapter title (H1)
******************
Section title (H2)
===============
Subsection title (H3)
---------------------
Sub-subsection title (H4)
^^^^^^^^^^^^^^^^^^^^
```
</td>
</tr>
<tr>
<td>Images</td>
<td>
```rst
.. image:: image1.png
```
</td>
</tr>
<tr>
<td>Internal links</td>
<td>
```rst
1. Add a tag to the section you want to reference:
.. _my-section-tag: section-1
Section 1
==========
2. Link to your tag:
As shown in :ref:`section-1`.
```
</td>
</tr>
<tr>
<tr>
<td>Lists</td>
<td>
```rst
# Ordered (numbered) list item
* Unordered (bulleted) list item
```
</td>
</tr>
<tr>
<tr>
<td>Math (block)</td>
<td>
```rst
.. math::
A = \begin{pmatrix}
0.0 & 1.0 & 1.0 & 3.0 \\
4.0 & 5.0 & 6.0 & 7.0 \\
\end{pmatrix}
```
</td>
</tr>
<tr>
<td>Math (inline)</td>
<td>
```rst
:math:`2 \times 2 `
```
</td>
</tr>
<tr>
<td>Notes</td>
<td>
```rst
.. note::
My note here.
```
</td>
</tr>
<tr>
<td>Tables</td>
<td>
```rst
.. csv-table:: Optional title here
:widths: 30, 70 #optional column widths
:header: "entry1 header", "entry2 header"
"entry1", "entry2"
```
</td>
</tr>
</table>
## Language and style
We use the
[Google developer documentation style guide](https://developers.google.com/style/highlights) to
guide our content.
Font size and type, page layout, white space control, and other formatting
details are controlled via
[rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core). If you want to notify us
of any formatting issues, create a pull request in our
[rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) GitHub repository.
## Building our documentation
<!-- % TODO: Fix the link to be able to work at every files -->
To learn how to build our documentation, refer to
[Building documentation](./building.md).
<head>
<meta charset="UTF-8">
<meta name="description" content="Contributing to ROCm">
<meta name="keywords" content="ROCm, contributing, contribute, maintainer, contributor">
</head>
# Contribute to ROCm
AMD values and encourages contributions to our code and documentation. If you want to contribute
to our ROCm repositories, first review the following guidance. For documentation-specific information,
see [Contributing to ROCm docs](https://rocm.docs.amd.com/en/latest/contribute/contribute-docs.html).
ROCm is a software stack made up of a collection of drivers, development tools, and APIs that enable
GPU programming from low-level kernel to end-user applications. Because some of our components
are inherited from external projects (such as
[LLVM](https://github.com/ROCm/llvm-project) and
[Kernel driver](https://github.com/ROCm/ROCK-Kernel-Driver)), these use
project-specific contribution guidelines and workflow. Refer to their repositories for more information.
All other ROCm components follow the workflow described in the following sections.
## Development workflow
ROCm uses GitHub to host code, collaborate, and manage version control. We use pull requests (PRs)
for all changes within our repositories. We use
[GitHub issues](https://github.com/ROCm/ROCm/issues) to track known issues, such as
bugs.
### Issue tracking
Before filing a new issue, search the
[existing issues](https://github.com/ROCm/ROCm/issues) to make sure your issue isn't
already listed.
General issue guidelines:
* Use your best judgement for issue creation. If your issue is already listed, upvote the issue and
comment or post to provide additional details, such as how you reproduced this issue.
* If you're not sure if your issue is the same, err on the side of caution and file your issue.
You can add a comment to include the issue number (and link) for the similar issue. If we evaluate
your issue as being the same as the existing issue, we'll close the duplicate.
* If your issue doesn't exist, use the issue template to file a new issue.
* When filing an issue, be sure to provide as much information as possible, including script output so
we can collect information about your configuration. This helps reduce the time required to
reproduce your issue.
* Check your issue regularly, as we may require additional information to successfully reproduce the
issue.
### Pull requests
When you create a pull request, you should target the default branch. Our repositories typically use the **develop** branch as the default integration branch.
When creating a PR, use the following process. Note that each repository may include additional,
project-specific steps. Refer to each repository's PR process for any additional steps.
* Identify the issue you want to fix
* Target the default branch (usually the **develop** branch) for integration
* Ensure your code builds successfully
* Each component has a suite of test cases to run; include the log of the successful test run in your PR
* Do not break existing test cases
* New functionality is only merged with new unit tests
* If your PR includes a new feature, you must provide an application or test so we can ensure that the
feature works and continues to be valid in the future
* Tests must have good code coverage
* Submit your PR and work with the reviewer or maintainer to get your PR approved
* Once approved, the PR is brought onto internal CI systems and may be merged into the component
during our release cycle, as coordinated by the maintainer
* We'll inform you once your change is committed
:::{important}
By creating a PR, you agree to allow your contribution to be licensed under the
terms of the LICENSE.txt file in the corresponding repository. Different repositories may use different
licenses.
:::
You can look up each license on the [ROCm licensing](https://rocm.docs.amd.com/en/latest/about/license.html) page.
### New feature development
Use the [GitHub Discussion forum](https://github.com/ROCm/ROCm/discussions)
(Ideas category) to propose new features. Our maintainers are happy to provide direction and
feedback on feature development.
### Documentation
Submit ROCm documentation changes to our
[documentation repository](https://github.com/ROCm/ROCm). You must update
documentation related to any new feature or API contribution.
Note that each ROCm project uses its own repository for documentation.
## Future development workflow
The current ROCm development workflow is GitHub-based. If, in the future, we change this platform,
the tools and links may change. In this instance, we will update contribution guidelines accordingly.

60
GOVERNANCE.md Normal file
View File

@@ -0,0 +1,60 @@
<head>
<meta charset="UTF-8">
<meta name="description" content="ROCm governance model">
<meta name="keywords" content="ROCm, governance">
</head>
# Governance model
ROCm is a software stack made up of a collection of drivers, development tools, and APIs that enable
GPU programming from the low-level kernel to end-user applications.
Components of ROCm that are inherited from external projects (such as
[LLVM](https://github.com/ROCm/llvm-project) and
[Kernel driver](https://github.com/ROCm/ROCK-Kernel-Driver)) follow their own
governance model and code of conduct. All other components of ROCm are governed by this
document.
## Governance
ROCm is led and managed by AMD.
We welcome contributions from the community. Our maintainers review all proposed changes to
ROCm.
## Roles
* **Maintainers** are responsible for their designated component and repositories.
* **Contributors** provide input and suggest changes to existing components.
### Maintainers
Maintainers are appointed by AMD. They are able to approve changes and can commit to our
repositories. They must use pull requests (PRs) for all changes.
You can find the list of maintainers in the CODEOWNERS file of each repository. Code owners differ
between repositories.
### Contributors
If you're not a maintainer, you're a contributor. We encourage the ROCm community to contribute in
several ways:
* Help other community members by posting questions or solutions on our
[GitHub discussion forums](https://github.com/ROCm/ROCm/discussions)
* Notify us of a bugs by filing an issue report on
[GitHub Issues](https://github.com/ROCm/ROCm/issues)
* Improve our documentation by submitting a PR to our
[repository](https://github.com/ROCm/ROCm/)
* Improve the code base (for smaller or contained changes) by submitting a PR to the component
* Suggest larger features by adding to the *Ideas* category in the
[GitHub discussion forum](https://github.com/ROCm/ROCm/discussions)
For more information, refer to our [contribution guidelines](CONTRIBUTING.md).
## Code of conduct
To engage with any AMD ROCm component that is hosted on GitHub, you must abide by the
[GitHub community guidelines](https://docs.github.com/en/site-policy/github-terms/github-community-guidelines)
and the
[GitHub community code of conduct](https://docs.github.com/en/site-policy/github-terms/github-community-code-of-conduct).

View File

@@ -1,4 +1,4 @@
# AMD ROCm™ platform
# AMD ROCm Software
ROCm is an open-source stack, composed primarily of open-source software, designed for graphics
processing unit (GPU) computation. ROCm consists of a collection of drivers, development tools, and
@@ -34,7 +34,7 @@ The ROCm documentation homepage is [rocm.docs.amd.com](https://rocm.docs.amd.com
### Building our documentation
For a quick-start build, use the following code. For more options and detail, refer to
[Building documentation](./contribute/building.md).
[Building documentation](./docs/contribute/building.md).
```bash
cd docs
@@ -44,7 +44,15 @@ pip3 install -r sphinx/requirements.txt
python3 -m sphinx -T -E -b html -d _build/doctrees -D language=en . _build/html
```
Alternatively, CMake build is supported.
```bash
cmake -B build
cmake --build build --target=doc
```
## Older ROCm releases
For release information for older ROCm releases, refer to
[`CHANGELOG`](./CHANGELOG.md).
For release information for older ROCm releases, refer to the
[CHANGELOG](./CHANGELOG.md).

View File

@@ -1,7 +1,4 @@
# Release Notes
<!-- Do not edit this file! This file is autogenerated with -->
<!-- tools/autotag/tag_script.py -->
# Release notes
<!-- Disable lints since this is an auto-generated file. -->
<!-- markdownlint-disable blanks-around-headers -->
<!-- markdownlint-disable no-duplicate-header -->
@@ -11,65 +8,47 @@
<!-- spellcheck-disable -->
Welcome to the release notes for the ROCm platform.
This page contains the release notes for AMD ROCm Software.
-------------------
## ROCm 5.7.1
<!-- markdownlint-disable first-line-h1 -->
<!-- markdownlint-disable no-duplicate-header -->
## ROCm 6.0.2
### What's New in This Release
The ROCm 6.0.2 point release consists of minor bug fixes to improve the stability of MI300 GPU applications. This release introduces several new driver features for system qualification on our partner server offerings.
### ROCm Libraries
#### rocBLAS
A new functionality rocblas-gemm-tune and an environment variable ROCBLAS_TENSILE_GEMM_OVERRIDE_PATH are added to rocBLAS in the ROCm 5.7.1 release.
*rocblas-gemm-tune* is used to find the best-performing GEMM kernel for each GEMM problem set. It has a command line interface, which mimics the --yaml input used by rocblas-bench. To generate the expected --yaml input, profile logging can be used, by setting the environment variable ROCBLAS_LAYER4.
For more information on rocBLAS logging, see Logging in rocBLAS, in the [API Reference Guide](https://rocm.docs.amd.com/projects/rocBLAS/en/docs-5.7.1/API_Reference_Guide.html#logging-in-rocblas).
An example input file: Expected output (note selected GEMM idx may differ): Where the far right values (solution_index) are the indices of the best-performing kernels for those GEMMs in the rocBLAS kernel library. These indices can be directly used in future GEMM calls. See rocBLAS/samples/example_user_driven_tuning.cpp for sample code of directly using kernels via their indices.
If the output is stored in a file, the results can be used to override default kernel selection with the kernels found, by setting the environment variable ROCBLAS_TENSILE_GEMM_OVERRIDE_PATH, where points to the stored file.
For more details, refer to the [rocBLAS Programmer's Guide.](https://rocm.docs.amd.com/projects/rocBLAS/en/latest/Programmers_Guide.html#rocblas-gemm-tune)
#### HIP 5.7.1 (for ROCm 5.7.1)
ROCm 5.7.1 is a point release with several bug fixes in the HIP runtime.
### Fixed defects
The *hipPointerGetAttributes* API returns the correct HIP memory type as *hipMemoryTypeManaged* for managed memory.
### Library Changes in ROCM 5.7.1
### Library changes in ROCm 6.0.2
| Library | Version |
|---------|---------|
| hipBLAS | [1.1.0](https://github.com/ROCmSoftwarePlatform/hipBLAS/releases/tag/rocm-5.7.1) |
| hipCUB | [2.13.1](https://github.com/ROCmSoftwarePlatform/hipCUB/releases/tag/rocm-5.7.1) |
| hipFFT | [1.0.12](https://github.com/ROCmSoftwarePlatform/hipFFT/releases/tag/rocm-5.7.1) |
| hipSOLVER | 1.8.1 ⇒ [1.8.2](https://github.com/ROCmSoftwarePlatform/hipSOLVER/releases/tag/rocm-5.7.1) |
| hipSPARSE | [2.3.8](https://github.com/ROCmSoftwarePlatform/hipSPARSE/releases/tag/rocm-5.7.1) |
| MIOpen | [2.19.0](https://github.com/ROCmSoftwarePlatform/MIOpen/releases/tag/rocm-5.7.1) |
| rocALUTION | [2.1.11](https://github.com/ROCmSoftwarePlatform/rocALUTION/releases/tag/rocm-5.7.1) |
| rocBLAS | [3.1.0](https://github.com/ROCmSoftwarePlatform/rocBLAS/releases/tag/rocm-5.7.1) |
| rocFFT | [1.0.24](https://github.com/ROCmSoftwarePlatform/rocFFT/releases/tag/rocm-5.7.1) |
| rocm-cmake | [0.10.0](https://github.com/RadeonOpenCompute/rocm-cmake/releases/tag/rocm-5.7.1) |
| rocPRIM | [2.13.1](https://github.com/ROCmSoftwarePlatform/rocPRIM/releases/tag/rocm-5.7.1) |
| rocRAND | [2.10.17](https://github.com/ROCmSoftwarePlatform/rocRAND/releases/tag/rocm-5.7.1) |
| rocSOLVER | [3.23.0](https://github.com/ROCmSoftwarePlatform/rocSOLVER/releases/tag/rocm-5.7.1) |
| rocSPARSE | [2.5.4](https://github.com/ROCmSoftwarePlatform/rocSPARSE/releases/tag/rocm-5.7.1) |
| rocThrust | [2.18.0](https://github.com/ROCmSoftwarePlatform/rocThrust/releases/tag/rocm-5.7.1) |
| rocWMMA | [1.2.0](https://github.com/ROCmSoftwarePlatform/rocWMMA/releases/tag/rocm-5.7.1) |
| Tensile | [4.38.0](https://github.com/ROCmSoftwarePlatform/Tensile/releases/tag/rocm-5.7.1) |
| AMDMIGraphX | ⇒ [2.8](https://github.com/ROCm/AMDMIGraphX/releases/tag/rocm-6.0.2) |
| hipBLAS | ⇒ [2.0.0](https://github.com/ROCm/hipBLAS/releases/tag/rocm-6.0.2) |
| hipBLASLt | ⇒ [0.6.0](https://github.com/ROCm/hipBLASLt/releases/tag/rocm-6.0.2) |
| hipCUB | ⇒ [3.0.0](https://github.com/ROCm/hipCUB/releases/tag/rocm-6.0.2) |
| hipFFT | ⇒ [1.0.13](https://github.com/ROCm/hipFFT/releases/tag/rocm-6.0.2) |
| hipRAND | ⇒ [2.10.17](https://github.com/ROCm/hipRAND/releases/tag/rocm-6.0.2) |
| hipSOLVER | ⇒ [2.0.0](https://github.com/ROCm/hipSOLVER/releases/tag/rocm-6.0.2) |
| hipSPARSE | ⇒ [3.0.0](https://github.com/ROCm/hipSPARSE/releases/tag/rocm-6.0.2) |
| hipSPARSELt | ⇒ [0.1.0](https://github.com/ROCm/hipSPARSELt/releases/tag/rocm-6.0.2) |
| hipTensor | ⇒ [1.1.0](https://github.com/ROCm/hipTensor/releases/tag/rocm-6.0.2) |
| MIOpen | ⇒ [2.19.0](https://github.com/ROCm/MIOpen/releases/tag/rocm-6.0.2) |
| rccl | ⇒ [2.15.5](https://github.com/ROCm/rccl/releases/tag/rocm-6.0.2) |
| rocALUTION | ⇒ [3.0.3](https://github.com/ROCm/rocALUTION/releases/tag/rocm-6.0.2) |
| rocBLAS | ⇒ [4.0.0](https://github.com/ROCm/rocBLAS/releases/tag/rocm-6.0.2) |
| rocFFT | ⇒ [1.0.25](https://github.com/ROCm/rocFFT/releases/tag/rocm-6.0.2) |
| rocm-cmake | ⇒ [0.11.0](https://github.com/ROCm/rocm-cmake/releases/tag/rocm-6.0.2) |
| rocPRIM | ⇒ [3.0.0](https://github.com/ROCm/rocPRIM/releases/tag/rocm-6.0.2) |
| rocRAND | ⇒ [3.0.0](https://github.com/ROCm/rocRAND/releases/tag/rocm-6.0.2) |
| rocSOLVER | ⇒ [3.24.0](https://github.com/ROCm/rocSOLVER/releases/tag/rocm-6.0.2) |
| rocSPARSE | ⇒ [3.0.2](https://github.com/ROCm/rocSPARSE/releases/tag/rocm-6.0.2) |
| rocThrust | ⇒ [3.0.0](https://github.com/ROCm/rocThrust/releases/tag/rocm-6.0.2) |
| rocWMMA | ⇒ [1.3.0](https://github.com/ROCm/rocWMMA/releases/tag/rocm-6.0.2) |
| Tensile | ⇒ [4.39.0](https://github.com/ROCm/Tensile/releases/tag/rocm-6.0.2) |
#### hipSOLVER 1.8.2
#### hipFFT 1.0.13
hipSOLVER 1.8.2 for ROCm 5.7.1
hipFFT 1.0.13 for ROCm 6.0.2
##### Fixed
##### Changes
- Fixed conflicts between the hipsolver-dev and -asan packages by excluding
hipsolver_module.f90 from the latter
* Removed the Git submodule for shared files between rocFFT and hipFFT; instead, just copy the files
over (this should help simplify downstream builds and packaging)

View File

@@ -0,0 +1,47 @@
# MIT License
#
# Copyright (c) 2023 Advanced Micro Devices, Inc. All rights reserved.
#
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in all
# copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
# SOFTWARE.
# ###########################
# ROCm dependencies
# ###########################
include(FetchContent)
if(BUILD_DOCS)
find_package(ROCM 0.11.0 CONFIG QUIET PATHS "${ROCM_PATH}") # First version with Sphinx doc gen improvement
if(NOT ROCM_FOUND)
message(STATUS "ROCm CMake not found. Fetching...")
set(rocm_cmake_tag
"c044bb52ba85058d28afe2313be98d9fed02e293" # develop@2023.09.12. (move to 6.0 tag when released)
CACHE STRING "rocm-cmake tag to download")
FetchContent_Declare(
rocm-cmake
GIT_REPOSITORY https://github.com/RadeonOpenCompute/rocm-cmake.git
GIT_TAG ${rocm_cmake_tag}
SOURCE_SUBDIR "DISABLE ADDING TO BUILD" # We don't really want to consume the build and test targets of ROCm CMake.
)
FetchContent_MakeAvailable(rocm-cmake)
find_package(ROCM CONFIG REQUIRED NO_DEFAULT_PATH PATHS "${rocm-cmake_SOURCE_DIR}")
else()
find_package(ROCM 0.11.0 CONFIG REQUIRED PATHS "${ROCM_PATH}")
endif()
endif()

View File

@@ -1,22 +1,17 @@
<?xml version="1.0" encoding="UTF-8"?>
<manifest>
<remote name="roc-github"
fetch="https://github.com/RadeonOpenCompute/" />
<remote name="rocm-devtools"
fetch="https://github.com/ROCm-Developer-Tools/" />
<remote name="rocm-swplat"
fetch="https://github.com/ROCmSoftwarePlatform/" />
<remote name="gpuopen-libs"
fetch="https://github.com/GPUOpen-ProfessionalCompute-Libraries/" />
<remote name="gpuopen-tools"
fetch="https://github.com/GPUOpen-Tools/" />
<remote name="KhronosGroup"
fetch="https://github.com/KhronosGroup/" />
<default revision="refs/tags/rocm-5.7.1"
remote="roc-github"
<remote name="rocm-org" fetch="https://github.com/ROCm/" />
<remote name="roc-github" fetch="https://github.com/RadeonOpenCompute/" />
<remote name="rocm-devtools" fetch="https://github.com/ROCm-Developer-Tools/" />
<remote name="rocm-swplat" fetch="https://github.com/ROCmSoftwarePlatform/" />
<remote name="gpuopen-libs" fetch="https://github.com/GPUOpen-ProfessionalCompute-Libraries/" />
<remote name="gpuopen-tools" fetch="https://github.com/GPUOpen-Tools/" />
<remote name="KhronosGroup" fetch="https://github.com/KhronosGroup/" />
<default revision="refs/tags/rocm-6.0.2"
remote="rocm-org"
sync-c="true"
sync-j="4" />
<!--list of projects for ROCM-->
<!--list of projects for ROCm-->
<project name="ROCK-Kernel-Driver" />
<project name="ROCT-Thunk-Interface" />
<project name="ROCR-Runtime" />
@@ -26,54 +21,57 @@ fetch="https://github.com/KhronosGroup/" />
<project name="rocm-cmake" />
<project name="rocminfo" />
<project name="rocm_bandwidth_test" />
<project name="rocprofiler" remote="rocm-devtools" />
<project name="roctracer" remote="rocm-devtools" />
<project name="rocprofiler" />
<project name="roctracer" />
<project path="ROCm-OpenCL-Runtime/api/opencl/khronos/icd" name="OpenCL-ICD-Loader" remote="KhronosGroup" revision="6c03f8b58fafd9dd693eaac826749a5cfad515f8" />
<project name="clang-ocl" />
<project name="rdc" />
<!--HIP Projects-->
<project name="HIP" remote="rocm-devtools" />
<project name="HIP-Examples" remote="rocm-devtools" />
<project name="clr" remote="rocm-devtools" />
<project name="HIPIFY" remote="rocm-devtools" />
<project name="HIPCC" remote="rocm-devtools" />
<project name="HIP" />
<project name="HIP-Examples" />
<project name="clr" />
<project name="hipother" />
<project name="HIPIFY" />
<project name="HIPCC" />
<!-- The following projects are all associated with the AMDGPU LLVM compiler -->
<project name="llvm-project" />
<project name="ROCm-Device-Libs" />
<project name="ROCm-CompilerSupport" />
<project name="half" remote="rocm-swplat" revision="37742ce15b76b44e4b271c1e66d13d2fa7bd003e" />
<project name="half" revision="37742ce15b76b44e4b271c1e66d13d2fa7bd003e" />
<!-- gdb projects -->
<project name="ROCgdb" remote="rocm-devtools" />
<project name="ROCdbgapi" remote="rocm-devtools" />
<project name="rocr_debug_agent" remote="rocm-devtools" />
<project name="ROCgdb" />
<project name="ROCdbgapi" />
<project name="rocr_debug_agent" />
<!-- ROCm Libraries -->
<project groups="mathlibs" name="rocBLAS" remote="rocm-swplat" />
<project groups="mathlibs" name="Tensile" remote="rocm-swplat" />
<project groups="mathlibs" name="hipTensor" remote="rocm-swplat" />
<project groups="mathlibs" name="hipBLAS" remote="rocm-swplat" />
<project groups="mathlibs" name="rocFFT" remote="rocm-swplat" />
<project groups="mathlibs" name="hipFFT" remote="rocm-swplat" />
<project groups="mathlibs" name="rocRAND" remote="rocm-swplat" />
<project groups="mathlibs" name="rocSPARSE" remote="rocm-swplat" />
<project groups="mathlibs" name="rocSOLVER" remote="rocm-swplat" />
<project groups="mathlibs" name="hipSOLVER" remote="rocm-swplat" />
<project groups="mathlibs" name="hipSPARSE" remote="rocm-swplat" />
<project groups="mathlibs" name="rocALUTION" remote="rocm-swplat" />
<project groups="mathlibs" name="rocThrust" remote="rocm-swplat" />
<project groups="mathlibs" name="hipCUB" remote="rocm-swplat" />
<project groups="mathlibs" name="rocPRIM" remote="rocm-swplat" />
<project groups="mathlibs" name="rocWMMA" remote="rocm-swplat" />
<project groups="mathlibs" name="rccl" remote="rocm-swplat" />
<project name="rocMLIR" remote="rocm-swplat" />
<project name="MIOpen" remote="rocm-swplat" />
<project name="composable_kernel" remote="rocm-swplat" />
<project name="MIVisionX" remote="gpuopen-libs" />
<project name="rpp" remote="gpuopen-libs" />
<project name="hipfort" remote="rocm-swplat" />
<project name="AMDMIGraphX" remote="rocm-swplat" />
<project name="ROCmValidationSuite" remote="rocm-devtools" />
<project groups="mathlibs" name="rocBLAS" />
<project groups="mathlibs" name="Tensile" />
<project groups="mathlibs" name="hipTensor" />
<project groups="mathlibs" name="hipBLAS" />
<project groups="mathlibs" name="hipBLASLt" />
<project groups="mathlibs" name="rocFFT" />
<project groups="mathlibs" name="hipFFT" />
<project groups="mathlibs" name="rocRAND" />
<project groups="mathlibs" name="hipRAND" />
<project groups="mathlibs" name="rocSPARSE" />
<project groups="mathlibs" name="hipSPARSELt" />
<project groups="mathlibs" name="rocSOLVER" />
<project groups="mathlibs" name="hipSOLVER" />
<project groups="mathlibs" name="hipSPARSE" />
<project groups="mathlibs" name="rocALUTION" />
<project groups="mathlibs" name="rocThrust" />
<project groups="mathlibs" name="hipCUB" />
<project groups="mathlibs" name="rocPRIM" />
<project groups="mathlibs" name="rocWMMA" />
<project groups="mathlibs" name="rccl" />
<project name="MIOpen" />
<project name="composable_kernel" />
<project name="MIVisionX" />
<project name="rpp" />
<project name="hipfort" />
<project name="AMDMIGraphX" />
<project name="ROCmValidationSuite" />
<!-- Projects for OpenMP-Extras -->
<project name="aomp" path="openmp-extras/aomp" remote="rocm-devtools" />
<project name="aomp-extras" path="openmp-extras/aomp-extras" remote="rocm-devtools" />
<project name="flang" path="openmp-extras/flang" remote="rocm-devtools" />
<project name="aomp" path="openmp-extras/aomp" />
<project name="aomp-extras" path="openmp-extras/aomp-extras" />
<project name="flang" path="openmp-extras/flang" />
</manifest>

33
docs/CMakeLists.txt Normal file
View File

@@ -0,0 +1,33 @@
# MIT License
#
# Copyright (c) 2023 Advanced Micro Devices, Inc. All rights reserved.
#
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in all
# copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
# SOFTWARE.
include(ROCMSphinxDoc)
rocm_add_sphinx_doc(
"${CMAKE_CURRENT_SOURCE_DIR}"
OUTPUT_DIR html
BUILDER html
)
install(
DIRECTORY "${CMAKE_CURRENT_BINARY_DIR}/html"
DESTINATION "${CMAKE_INSTALL_DOCDIR}")

View File

@@ -1,63 +0,0 @@
# Third party support matrix
ROCm™ supports various 3rd party libraries and frameworks. Supported versions
are tested and known to work. Non-supported versions of 3rd parties may also
work, but aren't tested.
## Deep learning
ROCm releases support the most recent and two prior releases of PyTorch and
TensorFlow.
| ROCm | [PyTorch](https://github.com/pytorch/pytorch/releases/) | [TensorFlow](https://github.com/tensorflow/tensorflow/releases/) |
|:------|:--------------------------:|:--------------------:|
| 5.0.2 | 1.8, 1.9, 1.10 | 2.6, 2.7, 2.8 |
| 5.1.3 | 1.9, 1.10, 1.11 | 2.7, 2.8, 2.9 |
| 5.2.x | 1.10, 1.11, 1.12 | 2.8, 2.9, 2.9 |
| 5.3.x | 1.10.1, 1.11, 1.12.1, 1.13 | 2.8, 2.9, 2.10 |
| 5.4.x | 1.10.1, 1.11, 1.12.1, 1.13 | 2.8, 2.9, 2.10, 2.11 |
| 5.5.x | 1.10.1, 1.11, 1.12.1, 1.13 | 2.10, 2.11, 2.13 |
| 5.6.x | 1.12.1, 1.13, 2.0 | 2.12, 2.13 |
| 5.7.x | 1.12.1, 1.13, 2.0 | 2.12, 2.13 |
(communication-libraries)=
## Communication libraries
ROCm supports [OpenUCX](https://openucx.org/), an open-source,
production-grade communication framework for data-centric and high performance
applications.
UCX version | ROCm 5.4 and older | ROCm 5.5 and newer |
|:----------|:------------------:|:------------------:|
| -1.14.0 | COMPATIBLE | INCOMPATIBLE |
| 1.14.1+ | COMPATIBLE | COMPATIBLE |
The Unified Collective Communication ([UCC](https://github.com/openucx/ucc)) library also has
support for ROCm devices.
UCC version | ROCm 5.5 and older | ROCm 5.6 and newer |
|:----------|:------------------:|:------------------:|
| -1.1.0 | COMPATIBLE | INCOMPATIBLE |
| 1.2.0+ | COMPATIBLE | COMPATIBLE |
## Algorithm libraries
ROCm releases provide algorithm libraries with interfaces compatible with
contemporary CUDA / NVIDIA HPC SDK alternatives.
* Thrust → rocThrust
* CUB → hipCUB
| ROCm | Thrust / CUB | HPC SDK |
|:------|:------------:|:-------:|
| 5.0.2 | 1.14 | 21.9 |
| 5.1.3 | 1.15 | 22.1 |
| 5.2.x | 1.15 | 22.2, 22.3 |
| 5.3.x | 1.16 | 22.7 |
| 5.4.x | 1.16 | 22.9 |
| 5.5.x | 1.17 | 22.9 |
| 5.6.x | 1.17.2 | 22.9 |
| 5.7.x | 1.17.2 | 22.9 |
For the latest documentation of these libraries, refer to [API libraries](../../reference/library-index.md).

View File

@@ -1,130 +0,0 @@
******************************************************************
Docker image support matrix
******************************************************************
AMD validates and publishes `PyTorch <https://hub.docker.com/r/rocm/pytorch>`_ and
`TensorFlow <https://hub.docker.com/r/rocm/tensorflow>`_ containers on dockerhub. The following
tags, and associated inventories, are validated with ROCm 5.7.
.. tab-set::
.. tab-item:: PyTorch
.. tab-set::
.. tab-item:: Ubuntu 22.04
Tag: `rocm/pytorch:rocm5.7_ubuntu22.04_py3.10_pytorch_2.0.1 <https://hub.docker.com/layers/rocm/pytorch/rocm5.7_ubuntu22.04_py3.10_pytorch_2.0.1/images/sha256-21df283b1712f3d73884b9bc4733919374344ceacb694e8fbc2c50bdd3e767ee>`_
* Inventory:
* `ROCm 5.7 <https://repo.radeon.com/rocm/apt/5.7/>`_
* `Python 3.10 <https://www.python.org/downloads/release/python-31013/>`_
* `Torch 2.0.1 <https://github.com/ROCmSoftwarePlatform/pytorch/tree/release/2.0>`_
* `Apex 0.1 <https://github.com/ROCmSoftwarePlatform/apex/tree/v0.1>`_
* `Torchvision 0.15.0 <https://github.com/pytorch/vision/tree/release/0.15>`_
* `Tensorboard 2.14.0 <https://github.com/tensorflow/tensorboard/tree/2.14>`_
* `MAGMA <https://bitbucket.org/icl/magma/src/master/>`_
* `UCX 1.10.0 <https://github.com/openucx/ucx/tree/v1.10.0>`_
* `OMPI 4.0.3 <https://github.com/open-mpi/ompi/tree/v4.0.3>`_
* `OFED 5.4.3 <https://content.mellanox.com/ofed/MLNX_OFED-5.3-1.0.5.0/MLNX_OFED_LINUX-5.3-1.0.5.0-ubuntu20.04-x86_64.tgz>`_
.. tab-item:: Ubuntu 20.04
Tag: `rocm/pytorch:rocm5.7_ubuntu20.04_py3.9_pytorch_staging <https://hub.docker.com/layers/rocm/pytorch/rocm5.7_ubuntu20.04_py3.9_pytorch_2.0.1/images/sha256-4dd86046e5f777f53ae40a75ecfc76a5e819f01f3b2d40eacbb2db95c2f971d4)>`_
* Inventory:
* `ROCm 5.7 <https://repo.radeon.com/rocm/apt/5.7/>`_
* `Python 3.9 <https://www.python.org/downloads/release/python-3918/>`_
* `Torch 2.1.0 <https://github.com/ROCmSoftwarePlatform/pytorch/tree/rocm5.7_internal_testing>`_
* `Apex 0.1 <https://github.com/ROCmSoftwarePlatform/apex/tree/v0.1>`_
* `Torchvision 0.16.0 <https://github.com/pytorch/vision/tree/release/0.16>`_
* `Tensorboard 2.14.0 <https://github.com/tensorflow/tensorboard/tree/2.14>`_
* `MAGMA <https://bitbucket.org/icl/magma/src/master/>`_
* `UCX 1.10.0 <https://github.com/openucx/ucx/tree/v1.10.0>`_
* `OMPI 4.0.3 <https://github.com/open-mpi/ompi/tree/v4.0.3>`_
* `OFED 5.4.3 <https://content.mellanox.com/ofed/MLNX_OFED-5.3-1.0.5.0/MLNX_OFED_LINUX-5.3-1.0.5.0-ubuntu20.04-x86_64.tgz>`_
Tag: `Ubuntu rocm/pytorch:rocm5.7_ubuntu20.04_py3.9_pytorch_1.12.1 <https://hub.docker.com/layers/rocm/pytorch/rocm5.7_ubuntu20.04_py3.9_pytorch_1.12.1/images/sha256-e67db9373c045a7b6defd43cc3d067e7d49fd5d380f3f8582d2fb219c1756e1f>`_
* Inventory:
* `ROCm 5.7 <https://repo.radeon.com/rocm/apt/5.7/>`_
* `Python 3.9 <https://www.python.org/downloads/release/python-3918/>`_
* `Torch 1.12.1 <https://github.com/ROCmSoftwarePlatform/pytorch/tree/release/1.12>`_
* `Apex 0.1 <https://github.com/ROCmSoftwarePlatform/apex/tree/v0.1>`_
* `Torchvision 0.13.1 <https://github.com/pytorch/vision/tree/v0.13.1>`_
* `Tensorboard 2.14.0 <https://github.com/tensorflow/tensorboard/tree/2.14>`_
* `MAGMA <https://bitbucket.org/icl/magma/src/master/>`_
* `UCX 1.10.0 <https://github.com/openucx/ucx/tree/v1.10.0>`_
* `OMPI 4.0.3 <https://github.com/open-mpi/ompi/tree/v4.0.3>`_
* `OFED 5.4.3 <https://content.mellanox.com/ofed/MLNX_OFED-5.3-1.0.5.0/MLNX_OFED_LINUX-5.3-1.0.5.0-ubuntu20.04-x86_64.tgz>`_
Tag: `Ubuntu rocm/pytorch:rocm5.7_ubuntu20.04_py3.9_pytorch_1.13.1 <https://hub.docker.com/layers/rocm/pytorch/rocm5.7_ubuntu20.04_py3.9_pytorch_1.13.1/images/sha256-ed99d159026093d2aaf5c48c1e4b0911508773430377051372733f75c340a4c1>`_
* Inventory:
* `ROCm 5.7 <https://repo.radeon.com/rocm/apt/5.7/>`_
* `Python 3.9 <https://www.python.org/downloads/release/python-3918/>`_
* `Torch 1.12.1 <https://github.com/ROCmSoftwarePlatform/pytorch/tree/release/1.13>`_
* `Apex 0.1 <https://github.com/ROCmSoftwarePlatform/apex/tree/v0.1>`_
* `Torchvision 0.14.0 <https://github.com/pytorch/vision/tree/v0.14.0>`_
* `Tensorboard 2.12.0 <https://github.com/tensorflow/tensorboard/tree/2.12.0>`_
* `MAGMA <https://bitbucket.org/icl/magma/src/master/>`_
* `UCX 1.10.0 <https://github.com/openucx/ucx/tree/v1.10.0>`_
* `OMPI 4.0.3 <https://github.com/open-mpi/ompi/tree/v4.0.3>`_
* `OFED 5.4.3 <https://content.mellanox.com/ofed/MLNX_OFED-5.3-1.0.5.0/MLNX_OFED_LINUX-5.3-1.0.5.0-ubuntu20.04-x86_64.tgz>`_
Tag: `Ubuntu rocm/pytorch:rocm5.7_ubuntu20.04_py3.9_pytorch_2.0.1 <https://hub.docker.com/layers/rocm/pytorch/rocm5.7_ubuntu20.04_py3.9_pytorch_2.0.1/images/sha256-4dd86046e5f777f53ae40a75ecfc76a5e819f01f3b2d40eacbb2db95c2f971d4>`_
* Inventory:
* `ROCm 5.7 <https://repo.radeon.com/rocm/apt/5.7/>`_
* `Python 3.9 <https://www.python.org/downloads/release/python-3918/>`_
* `Torch 2.0.1 <https://github.com/ROCmSoftwarePlatform/pytorch/tree/release/2.0>`_
* `Apex 0.1 <https://github.com/ROCmSoftwarePlatform/apex/tree/v0.1>`_
* `Torchvision 0.15.2 <https://github.com/pytorch/vision/tree/release/0.15>`_
* `Tensorboard 2.14.0 <https://github.com/tensorflow/tensorboard/tree/2.14>`_
* `MAGMA <https://bitbucket.org/icl/magma/src/master/>`_
* `UCX 1.10.0 <https://github.com/openucx/ucx/tree/v1.10.0>`_
* `OMPI 4.0.3 <https://github.com/open-mpi/ompi/tree/v4.0.3>`_
* `OFED 5.4.3 <https://content.mellanox.com/ofed/MLNX_OFED-5.3-1.0.5.0/MLNX_OFED_LINUX-5.3-1.0.5.0-ubuntu20.04-x86_64.tgz>`_
.. tab-item:: CentOS 7
Tag: `rocm/pytorch:rocm5.7_centos7_py3.9_pytorch_staging <https://hub.docker.com/layers/rocm/pytorch/rocm5.7_centos7_py3.9_pytorch_staging/images/sha256-92240cdf0b4aa7afa76fc78be995caa19ee9c54b5c9f1683bdcac28cedb58d2b>`_
* Inventory:
* `ROCm 5.7 <https://repo.radeon.com/rocm/yum/5.7/>`_
* `Python 3.9 <https://www.python.org/downloads/release/python-3918/>`_
* `Torch 2.1.0 <https://github.com/ROCmSoftwarePlatform/pytorch/tree/rocm5.7_internal_testing>`_
* `Apex 0.1 <https://github.com/ROCmSoftwarePlatform/apex/tree/v0.1>`_
* `Torchvision 0.16.0 <https://github.com/pytorch/vision/tree/release/0.16>`_
* `MAGMA <https://bitbucket.org/icl/magma/src/master/>`_
.. tab-item:: TensorFlow
.. tab-set::
.. tab-item:: Ubuntu 20.04
Tag: `rocm5.7-tf2.12-dev <https://hub.docker.com/layers/rocm/tensorflow/rocm5.7-tf2.12-dev/images/sha256-e0ac4d49122702e5167175acaeb98a79b9500f585d5e74df18facf6b52ce3e59>`_
* Inventory:
* `ROCm 5.7 <https://repo.radeon.com/rocm/apt/5.7/>`_
* `Python 3.9 <https://www.python.org/downloads/release/python-3918/>`_
* `tensorflow-rocm 2.12.1 <https://pypi.org/project/tensorflow-rocm/2.12.1.570/>`_
* `Tensorboard 2.12.3 <https://github.com/tensorflow/tensorboard/tree/2.12>`_
Tag: `rocm5.7-tf2.13-dev <https://hub.docker.com/layers/rocm/tensorflow/rocm5.7-tf2.13-dev/images/sha256-6f995539eebc062aac2b53db40e2b545192d8b032d0deada8c24c6651a7ac332>`_
* Inventory:
* `ROCm 5.7 <https://repo.radeon.com/rocm/apt/5.7/>`_
* `Python 3.9 <https://www.python.org/downloads/release/python-3918/>`_
* `tensorflow-rocm 2.13.0 <https://pypi.org/project/tensorflow-rocm/2.13.0.570/>`_
* `Tensorboard 2.13.0 <https://github.com/tensorflow/tensorboard/tree/2.13>`_

View File

@@ -1,116 +0,0 @@
# GPU and OS support (Linux)
(linux-support)=
## Supported Linux distributions
AMD ROCm™ Platform supports the following Linux distributions.
::::{tab-set}
:::{tab-item} Supported
| Distribution | Processor Architectures | Validated Kernel | Support |
| :----------- | :---------------------: | :--------------: | ------: |
| RHEL 9.2 | x86-64 | 5.14 (5.14.0-284.11.1.el9_2.x86_64) | ✅ |
| RHEL 9.1 | x86-64 | 5.14.0-284.11.1.el9_2.x86_64 | ✅ |
| RHEL 8.8 | x86-64 | 4.18.0-477.el8.x86_64 | ✅ |
| RHEL 8.7 | x86-64 | 4.18.0-425.10.1.el8_7.x86_64 | ✅ |
| SLES 15 SP5 | x86-64 | 5.14.21-150500.53-default | ✅ |
| SLES 15 SP4 | x86-64 | 5.14.21-150400.24.63-default | ✅ |
| Ubuntu 22.04.2 | x86-64 | 5.19.0-45-generic | ✅ |
| Ubuntu 20.04.5 | x86-64 | 5.15.0-75-generic | ✅ |
:::{versionadded} 5.6
* RHEL 8.8 and 9.2 support is added.
* SLES 15 SP5 support is added
:::
:::{tab-item} Unsupported
| Distribution | Processor Architectures | Validated Kernel | Support |
| :----------- | :---------------------: | :--------------: | ------: |
| RHEL 9.0 | x86-64 | 5.14 | ❌ |
| RHEL 8.6 | x86-64 | 5.14 | ❌ |
| SLES 15 SP3 | x86-64 | 5.3 | ❌ |
| Ubuntu 22.04.0 | x86-64 | 5.15 LTS, 5.17 OEM | ❌ |
| Ubuntu 20.04.4 | x86-64 | 5.13 HWE, 5.13 OEM | ❌ |
| Ubuntu 22.04.1 | x86-64 | 5.15 LTS | ❌ |
:::
::::
✅: **Supported** - AMD performs full testing of all ROCm components on distro
GA image.
❌: **Unsupported** - AMD no longer performs builds and testing on these
previously supported distro GA images.
## Virtualization support
ROCm supports virtualization for select GPUs only as shown below.
| Hypervisor | Version | GPU | Validated Guest OS (validated kernel) |
|----------------|----------|-------|----------------------------------------------------------------------------------|
| VMWare | ESXi 8 | MI250 | Ubuntu 20.04 (`5.15.0-56-generic`) |
| VMWare | ESXi 8 | MI210 | Ubuntu 20.04 (`5.15.0-56-generic`), SLES 15 SP4 (`5.14.21-150400.24.18-default`) |
| VMWare | ESXi 7 | MI210 | Ubuntu 20.04 (`5.15.0-56-generic`), SLES 15 SP4 (`5.14.21-150400.24.18-default`) |
## Linux-supported GPUs
The table below shows supported GPUs for Instinct™, Radeon Pro™ and Radeon™
GPUs. Please click the tabs below to switch between GPU product lines. If a GPU
is not listed on this table, the GPU is not officially supported by AMD.
:::::{tab-set}
::::{tab-item} AMD Instinct™
:sync: instinct
| Product Name | Architecture | [LLVM Target](https://www.llvm.org/docs/AMDGPUUsage.html#processors) |Support |
|:------------:|:------------:|:--------------------------------------------------------------------:|:-------:|
| AMD Instinct™ MI250X | CDNA2 | gfx90a | ✅ |
| AMD Instinct™ MI250 | CDNA2 | gfx90a | ✅ |
| AMD Instinct™ MI210 | CDNA2 | gfx90a | ✅ |
| AMD Instinct™ MI100 | CDNA | gfx908 | ✅ |
| AMD Instinct™ MI50 | GCN5.1 | gfx906 | ✅ |
| AMD Instinct™ MI25 | GCN5.0 | gfx900 | ❌ |
::::
::::{tab-item} Radeon Pro™
:sync: radeonpro
| Name | Architecture |[LLVM Target](https://www.llvm.org/docs/AMDGPUUsage.html#processors) | Support|
|:----:|:------------:|:--------------------------------------------------------------------:|:-------:|
| AMD Radeon™ Pro W7900 | RDNA3 | gfx1100 | ✅ (Ubuntu 22.04 only)|
| AMD Radeon™ Pro W6800 | RDNA2 | gfx1030 | ✅ |
| AMD Radeon™ Pro V620 | RDNA2 | gfx1030 | ✅ |
| AMD Radeon™ Pro VII | GCN5.1 | gfx906 | ✅ |
::::
::::{tab-item} Radeon™
:sync: radeonpro
| Name | Architecture |[LLVM Target](https://www.llvm.org/docs/AMDGPUUsage.html#processors) | Support|
|:----:|:---------------:|:--------------------------------------------------------------------:|:-------:|
| AMD Radeon™ RX 7900 XTX | RDNA3 | gfx1100 | ✅ (Ubuntu 22.04 only)|
| AMD Radeon™ VII | GCN5.1 | gfx906 | ✅ |
::::
:::::
### Support status
✅: **Supported** - AMD enables these GPUs in our software distributions for
the corresponding ROCm product.
⚠️: **Deprecated** - Support will be removed in a future release.
❌: **Unsupported** - This configuration is not enabled in our software
distributions.
## CPU support
ROCm requires CPUs that support PCIe™ atomics. Modern CPUs after the release of
1st generation AMD Zen CPU and Intel™ Haswell support PCIe atomics.

View File

@@ -1,3 +1,9 @@
<head>
<meta charset="UTF-8">
<meta name="description" content="OpenMP support in ROCm">
<meta name="keywords" content="OpenMP, LLVM, OpenMP toolchain">
</head>
# OpenMP support in ROCm
## Introduction
@@ -9,7 +15,8 @@ Along with host APIs, the OpenMP compilers support offloading code and data onto
GPU devices. This document briefly describes the installation location of the
OpenMP toolchain, example usage of device offloading, and usage of `rocprof`
with OpenMP applications. The GPUs supported are the same as those supported by
this ROCm release. See the list of supported GPUs for [Linux](../../about/compatibility/linux-support.md) and [Windows](../../about/compatibility/windows-support.md).
this ROCm release. See the list of supported GPUs for {doc}`Linux<rocm-install-on-linux:reference/system-requirements>` and
{doc}`Windows<rocm-install-on-windows:reference/system-requirements>`.
The ROCm OpenMP compiler is implemented using LLVM compiler technology.
The following image illustrates the internal steps taken to translate a users application into an executable that can offload computation to the AMDGPU. The compilation is a two-pass process. Pass 1 compiles the application to generate the CPU code and Pass 2 links the CPU code to the AMDGPU device code.
@@ -41,10 +48,10 @@ cd $ROCM_PATH/share/openmp-extras/examples/openmp/veccopy
sudo make run
```
```{note}
:::{note}
`sudo` is required since we are building inside the `/opt` directory.
Alternatively, copy the files to your home directory first.
```
:::
The above invocation of Make compiles and runs the program. Note the options
that are required for target offload from an OpenMP program:
@@ -53,13 +60,15 @@ that are required for target offload from an OpenMP program:
-fopenmp --offload-arch=<gpu-arch>
```
```{note}
:::{note}
The compiler also accepts the alternative offloading notation:
```bash
-fopenmp -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa -march=<gpu-arch>
```
:::
Obtain the value of `gpu-arch` by running the following command:
```bash
@@ -321,10 +330,10 @@ double a = 0.0;
a = a + 1.0;
```
```{note}
:::{note}
`AMD_unsafe_fp_atomics` is an alias for `AMD_fast_fp_atomics`, and
`AMD_safe_fp_atomics` is implemented with a compare-and-swap loop.
```
:::
To disable the generation of fast floating-point atomic instructions at the file
level, build using the option `-msafe-fp-atomics` or use a hint clause on a

View File

@@ -1,24 +0,0 @@
# User/kernel-space support matrix
ROCm™ provides forward and backward compatibility between the Kernel Fusion
Driver (KFD) and its user space software for +/- 2 releases. This table shows
the compatibility combinations that are currently supported.
| KFD | Tested user space versions |
|:------|:--------------------------:|
| 5.0.2 | 5.1.0, 5.2.0 |
| 5.1.0 | 5.0.2 |
| 5.1.3 | 5.2.0, 5.3.0 |
| 5.2.0 | 5.0.2, 5.1.3 |
| 5.2.3 | 5.3.0, 5.4.0 |
| 5.3.0 | 5.1.3, 5.2.3 |
| 5.3.3 | 5.4.0, 5.5.0 |
| 5.4.0 | 5.2.3, 5.3.3 |
| 5.4.3 | 5.5.0, 5.6.0 |
| 5.4.4 | 5.5.0 |
| 5.5.0 | 5.3.3, 5.4.3 |
| 5.5.1 | 5.6.0, 5.7.0 |
| 5.6.0 | 5.4.3, 5.5.1 |
| 5.6.1 | 5.7.0 |
| 5.7.0 | 5.5.0, 5.6.1 |
| 5.7.1 | 5.5.0, 5.6.1 |

View File

@@ -1,80 +0,0 @@
# GPU and OS support (Windows)
(windows-support)=
## Supported SKUs
AMD HIP SDK supports the following Windows variants.
| Distribution |Processor Architectures| Validated update |
|---------------------|-----------------------|--------------------|
| Windows 10 | x86-64 | 22H2 (GA) |
| Windows 11 | x86-64 | 22H2 (GA) |
| Windows Server 2022 | x86-64 | |
## Windows-supported GPUs
The table below shows supported GPUs for Radeon Pro™ and Radeon™ GPUs. Please
click the tabs below to switch between GPU product lines. If a GPU is not listed
on this table, the GPU is not officially supported by AMD.
::::{tab-set}
:::{tab-item} Radeon Pro™
:sync: radeonpro
| Name | Architecture |[LLVM Target](https://www.llvm.org/docs/AMDGPUUsage.html#processors) | Runtime | HIP SDK |
|:----:|:------------:|:--------------------------------------------------------------------:|:-------:|:----------------:|
| AMD Radeon Pro™ W7900 | RDNA3 | gfx1100 | ✅ | ✅ |
| AMD Radeon Pro™ W7800 | RDNA3 | gfx1100 | ✅ | ✅ |
| AMD Radeon Pro™ W6800 | RDNA2 | gfx1030 | ✅ | ✅ |
| AMD Radeon Pro™ W6600 | RDNA2 | gfx1032 | ✅ | ❌ |
| AMD Radeon Pro™ W5500 | RDNA1 | gfx1012 | ❌ | ❌ |
| AMD Radeon Pro™ VII | GCN5.1 | gfx906 | ❌ | ❌ |
:::
:::{tab-item} Radeon™
:sync: radeon
| Name | Architecture | [LLVM Target](https://www.llvm.org/docs/AMDGPUUsage.html#processors) | Runtime | HIP SDK |
|:----:|:------------:|:--------------------------------------------------------------------:|:-------:|:----------------:|
| AMD Radeon™ RX 7900 XTX | RDNA3 | gfx1100 | ✅ | ✅ |
| AMD Radeon™ RX 7900 XT | RDNA3 | gfx1100 | ✅ | ✅ |
| AMD Radeon™ RX 7600 | RDNA3 | gfx1102 | ✅ | ✅ |
| AMD Radeon™ RX 6950 XT | RDNA2 | gfx1030 | ✅ | ✅ |
| AMD Radeon™ RX 6900 XT | RDNA2 | gfx1030 | ✅ | ✅ |
| AMD Radeon™ RX 6800 XT | RDNA2 | gfx1030 | ✅ | ✅ |
| AMD Radeon™ RX 6800 | RDNA2 | gfx1030 | ✅ | ✅ |
| AMD Radeon™ RX 6750 XT | RDNA2 | gfx1031 | ✅ | ❌ |
| AMD Radeon™ RX 6700 XT | RDNA2 | gfx1031 | ✅ | ❌ |
| AMD Radeon™ RX 6700 | RDNA2 | gfx1031 | ✅ | ❌ |
| AMD Radeon™ RX 6650 XT | RDNA2 | gfx1032 | ✅ | ❌ |
| AMD Radeon™ RX 6600 XT | RDNA2 | gfx1032 | ✅ | ❌ |
| AMD Radeon™ RX 6600 | RDNA2 | gfx1032 | ✅ | ❌ |
:::
::::
### Component support
ROCm components are described in [What is ROCm?](../../what-is-rocm.md) Support
on Windows is provided with two levels on enablement.
* **Runtime**: Runtime enables the use of the HIP and OpenCL runtimes only.
* **HIP SDK**: Runtime plus additional components are listed in [Libraries](../../reference/library-index.md).
Note that some math libraries are Linux exclusive.
### Support status
✅: **Supported** - AMD enables these GPUs in our software distributions for
the corresponding ROCm product.
⚠️: **Deprecated** - Support will be removed in a future release.
❌: **Unsupported** - This configuration is not enabled in our software
distributions.
## CPU support
ROCm requires CPUs that support PCIe™ atomics. Modern CPUs after the release of
1st generation AMD Zen CPU and Intel™ Haswell support PCIe atomics.

View File

@@ -1,6 +1,10 @@
# License
> Note: This license applies to the [ROCm repository](https://github.com/RadeonOpenCompute/ROCm) that primarily contains documentation. For other licensing information, refer to the [Licensing Terms page](./licensing).
:::{note}
This license applies to the [ROCm repository](https://github.com/RadeonOpenCompute/ROCm) that
primarily contains documentation. For other licensing information, refer to the
[Licensing Terms page](./licensing).
:::
```{include} ../../LICENSE
```

View File

@@ -1,3 +1,9 @@
<head>
<meta charset="UTF-8">
<meta name="description" content="ROCm licensing terms">
<meta name="keywords" content="license, licensing terms">
</head>
# ROCm licensing terms
ROCm™ is released by Advanced Micro Devices, Inc. and is licensed per component separately.
@@ -108,7 +114,7 @@ companies.
## Package licensing
```{attention}
:::{attention}
AQL Profiler and AOCC CPU optimization are both provided in binary form, each
subject to the license agreement enclosed in the directory for the binary and is
available here: `/opt/rocm/share/doc/rocm-llvm-alt/EULA`. By using, installing,
@@ -116,7 +122,7 @@ copying or distributing AQL Profiler and/or AOCC CPU Optimizations, you agree to
the terms and conditions of this license agreement. If you do not agree to the
terms of this agreement, do not install, copy or use the AQL Profiler and/or the
AOCC CPU Optimizations.
```
:::
For the rest of the ROCm packages, you can find the licensing information at the
following location: `/opt/rocm/share/doc/<component-name>/`

View File

@@ -1,93 +0,0 @@
# What's new in ROCm?
ROCm is now supported on Windows.
## Windows support
Starting with ROCm 5.5, the HIP SDK brings a subset of ROCm to developers on Windows.
The collection of features enabled on Windows is referred to as the HIP SDK.
These features allow developers to use the HIP runtime, HIP math libraries
and HIP Primitive libraries. The following table shows the differences
between Windows and Linux releases.
|Component|Linux|Windows|
|---------|-----|-------|
|Driver|Radeon Software for Linux |AMD Software Pro Edition|
|Compiler|`hipcc`/`amdclang++`|`hipcc`/`clang++`|
|Debugger|`rocgdb`|no debugger available|
|Profiler|`rocprof`|[Radeon GPU Profiler](https://gpuopen.com/rgp/)|
|Porting Tools|HIPIFY|Coming Soon|
|Runtime|HIP (Open Sourced)|HIP (closed source)|
|Math Libraries|Supported|Supported|
|Primitives Libraries|Supported|Supported|
|Communication Libraries|Supported|Not Available|
|AI Libraries|MIOpen, MIGraphX|Not Available|
|System Management|`rocm-smi-lib`, RDC, `rocminfo`|`amdsmi`, `hipInfo`|
|AI Frameworks|PyTorch, TensorFlow, etc.|Not Available|
|CMake HIP Language|Enabled|Unsupported|
|Visual Studio| Not applicable| Plugin Available|
|HIP Ray Tracing| Supported|Supported|
AMD is continuing to invest in Windows support and AMD plans to release enhanced
features in subsequent revisions.
```{note}
The 5.5 Windows Installer collectively groups the Math and Primitives
libraries.
```
```{note}
GPU support on Windows and Linux may differ. You must refer to
Windows and Linux GPU support tables separately.
```
```{note}
HIP Ray Tracing is not distributed via ROCm in Linux.
```
## ROCm release versioning
Linux OS releases set the canonical version numbers for ROCm. Windows will
follow Linux version numbers as Windows releases are based on Linux ROCm
releases. However, not all Linux ROCm releases will have a corresponding Windows
release. The following table shows the ROCm releases on Windows and Linux. Releases
with both Windows and Linux are referred to as a joint release. Releases with
only Linux support are referred to as a skipped release from the Windows
perspective.
|Release version|Linux|Windows|
|---------------|-----|-------|
|5.5|✅|✅|
|5.6|✅|❌|
ROCm Linux releases are versioned with following the Major.Minor.Patch
version number system. Windows releases will only be versioned with Major.Minor.
In general, Windows releases will trail Linux releases. Software developers that
wish to support both Linux and Windows using a single ROCm version should
refrain from upgrading ROCm unless there is a joint release.
## Windows documentation implications
The ROCm documentation website contains both Windows and Linux documentation.
Just below each article title, a convenient article information section states
whether the page applies to Linux only, Windows only or both OSes. To find the
exact Windows documentation for a release of the HIP SDK, please view the ROCm documentation with the same
Major.Minor version number while ignoring the Patch version. The Patch version
only matters for Linux releases. For convenience,
Windows documentation will continue to be included in the overall ROCm
documentation for the skipped Windows releases.
Windows release notes will contain only information pertinent to Windows.
The software developer must read all the previous ROCm release notes (including)
skipped ROCm versions on Windows for information on all the changes present in
the Windows release.
## Windows builds from source
Not all source code required to build Windows from source is available under a
permissive open source license. Build instructions on Windows is only provided
for projects that can be built from source on Windows using a toolchain that
has closed source build prerequisites. The ROCm manifest file is not valid for
Windows. AMD does not release a manifest or tag our components in Windows.
Users may use corresponding Linux tags to build on Windows.

View File

@@ -1,36 +1,61 @@
===========================
How ROCm uses PCIe atomics
===========================
.. meta::
:description: How ROCm uses PCIe atomics
:keywords: PCIe, PCIe atomics, atomics, BAR memory, AMD, ROCm
*****************************************************************************
How ROCm uses PCIe atomics
*****************************************************************************
ROCm PCIe feature and overview of BAR memory
======================================================================
================================================================
ROCm is an extension of HSA platform architecture, so it shares the queuing model, memory model,
signaling and synchronization protocols. Platform atomics are integral to perform queuing and
signaling memory operations where there may be multiple-writers across CPU and GPU agents.
ROCm is an extension of HSA platform architecture, so it shares the queueing model, memory model, signaling and synchronization protocols. Platform atomics are integral to perform queuing and signaling memory operations where there may be multiple-writers across CPU and GPU agents.
The full list of HSA system architecture platform requirements are here:
`HSA Sys Arch Features <http://hsafoundation.com/wp-content/uploads/2021/02/HSA-SysArch-1.2.pdf>`_.
The full list of HSA system architecture platform requirements are here: `HSA Sys Arch Features <http://hsafoundation.com/wp-content/uploads/2021/02/HSA-SysArch-1.2.pdf>`_.
AMD ROCm Software uses the new PCI Express 3.0 (Peripheral Component Interconnect Express [PCIe]
3.0) features for atomic read-modify-write transactions which extends inter-processor synchronization
mechanisms to IO to support the defined set of HSA capabilities needed for queuing and signaling
memory operations.
The ROCm Platform uses the new PCI Express 3.0 (PCIe 3.0) features for Atomic Read-Modify-Write Transactions which extends inter-processor synchronization mechanisms to IO to support the defined set of HSA capabilities needed for queuing and signaling memory operations.
The new PCIe AtomicOps operate as completers for ``CAS`` (Compare and Swap), ``FetchADD``, ``SWAP`` atomics. The AtomicsOps are initiated by the
I/O device which support 32-bit, 64-bit and 128-bit operand which target address have to be naturally aligned to operation sizes.
The new PCIe atomic operations operate as completers for ``CAS`` (Compare and Swap), ``FetchADD``,
``SWAP`` atomics. The atomic operations are initiated by the I/O device which support 32-bit, 64-bit and
128-bit operand which target address have to be naturally aligned to operation sizes.
For ROCm the Platform atomics are used in ROCm in the following ways:
* Update HSA queues read_dispatch_id: 64 bit atomic add used by the command processor on the GPU agent to update the packet ID it processed.
* Update HSA queues write_dispatch_id: 64 bit atomic add used by the CPU and GPU agent to support multi-writer queue insertions.
* Update HSA Signals 64bit atomic ops are used for CPU & GPU synchronization.
* Update HSA queue's read_dispatch_id: 64 bit atomic add used by the command processor on the
GPU agent to update the packet ID it processed.
* Update HSA queue's write_dispatch_id: 64 bit atomic add used by the CPU and GPU agent to
support multi-writer queue insertions.
* Update HSA Signals -- 64bit atomic ops are used for CPU & GPU synchronization.
The PCIe 3.0 AtomicOp feature allows atomic transactions to be requested by, routed through and completed by PCIe components. Routing and completion does not require software support. Component support for each is detectable via the DEVCAP2 register. Upstream bridges need to have AtomicOp routing enabled or the Atomic Operations will fail even though PCIe endpoint and PCIe I/O devices has the capability to Atomics Operations.
The PCIe 3.0 atomic operations feature allows atomic transactions to be requested by, routed through
and completed by PCIe components. Routing and completion does not require software support.
Component support for each is detectable via the Device Capabilities 2 (DevCap2) register. Upstream
bridges need to have atomic operations routing enabled or the atomic operations will fail even though
PCIe endpoint and PCIe I/O devices has the capability to atomic operations.
To do AtomicOp routing capability between two or more Root Ports, each associated Root Port must indicate that capability via the AtomicOp routing supported bit in the Device Capabilities 2 register.
To do atomic operations routing capability between two or more Root Ports, each associated Root Port
must indicate that capability via the atomic operations routing supported bit in the DevCap2 register.
If your system has a PCIe Express Switch it needs to support AtomicsOp routing. AtomicOp requests are permitted only if a components ``DEVCTL2.ATOMICOP_REQUESTER_ENABLE`` field is set. These requests can only be serviced if the upstream components support AtomicOp completion and/or routing to a component which does. AtomicOp Routing Support=1 Routing is supported, AtomicOp Routing Support=0 routing is not supported.
If your system has a PCIe Express Switch it needs to support atomic operations routing. Atomic
operations requests are permitted only if a component's ``DEVCTL2.ATOMICOP_REQUESTER_ENABLE``
field is set. These requests can only be serviced if the upstream components support atomic operation
completion and/or routing to a component which does. Atomic operations routing support=1, routing
is supported; atomic operations routing support=0, routing is not supported.
An atomic operation is a non-posted transaction supporting 32-bit and 64-bit address formats, there must be a response for Completion containing the result of the operation. Errors associated with the operation (uncorrectable error accessing the target location or carrying out the Atomic operation) are signaled to the requester by setting the Completion Status field in the completion descriptor, they are set to to Completer Abort (CA) or Unsupported Request (UR).
An atomic operation is a non-posted transaction supporting 32-bit and 64-bit address formats, there
must be a response for Completion containing the result of the operation. Errors associated with the
operation (uncorrectable error accessing the target location or carrying out the atomic operation) are
signaled to the requester by setting the Completion Status field in the completion descriptor, they are
set to to Completer Abort (CA) or Unsupported Request (UR).
To understand more about how PCIe atomic operations work, see `PCIe atomics <https://pcisig.com/specifications/pciexpress/specifications/ECN_Atomic_Ops_080417.pdf>`_
To understand more about how PCIe atomic operations work, see
`PCIe atomics <https://pcisig.com/specifications/pciexpress/specifications/ECN_Atomic_Ops_080417.pdf>`_
`Linux Kernel Patch to pci_enable_atomic_request <https://patchwork.kernel.org/project/linux-pci/patch/1443110390-4080-1-git-send-email-jay@jcornwall.me/>`_
@@ -39,56 +64,60 @@ There are also a number of papers which talk about these new capabilities:
* `Atomic Read Modify Write Primitives by Intel <https://www.intel.es/content/dam/doc/white-paper/atomic-read-modify-write-primitives-i-o-devices-paper.pdf>`_
* `PCI express 3 Accelerator White paper by Intel <https://www.intel.sg/content/dam/doc/white-paper/pci-express3-accelerator-white-paper.pdf>`_
* `Intel PCIe Generation 3 Hotchips Paper <https://www.hotchips.org/wp-content/uploads/hc_archives/hc21/1_sun/HC21.23.1.SystemInterconnectTutorial-Epub/HC21.23.131.Ajanovic-Intel-PCIeGen3.pdf>`_
* `PCIe Generation 4 Base Specification includes Atomics Operation <https://astralvx.com/storage/2020/11/PCI_Express_Base_4.0_Rev0.3_February19-2014.pdf>`_
* `PCIe Generation 4 Base Specification includes atomic operations <https://astralvx.com/storage/2020/11/PCI_Express_Base_4.0_Rev0.3_February19-2014.pdf>`_
Other I/O devices with PCIe atomics support
* `Mellanox ConnectX-5 InfiniBand Card <http://www.mellanox.com/related-docs/prod_adapter_cards/PB_ConnectX-5_VPI_Card.pdf>`_
* `Cray Aries Interconnect <http://www.hoti.org/hoti20/slides/Bob_Alverson.pdf>`_
* `Xilinx PCIe Ultrascale White paper <https://docs.xilinx.com/v/u/8OZSA2V1b1LLU2rRCDVGQw>`_
* `Xilinx 7 Series Devices <https://docs.xilinx.com/v/u/1nfXeFNnGpA0ywyykvWHWQ>`_
* `Mellanox ConnectX-5 InfiniBand Card <http://www.mellanox.com/related-docs/prod_adapter_cards/PB_ConnectX-5_VPI_Card.pdf>`_
* `Cray Aries Interconnect <http://www.hoti.org/hoti20/slides/Bob_Alverson.pdf>`_
* `Xilinx PCIe Ultrascale White paper <https://docs.xilinx.com/v/u/8OZSA2V1b1LLU2rRCDVGQw>`_
* `Xilinx 7 Series Devices <https://docs.xilinx.com/v/u/1nfXeFNnGpA0ywyykvWHWQ>`_
Future bus technology with richer I/O atomics operation Support
* GenZ
New PCIe Endpoints with support beyond AMD Ryzen and EPYC CPU; Intel Haswell or newer CPUs with PCIe Generation 3.0 support.
New PCIe Endpoints with support beyond AMD Ryzen and EPYC CPU; Intel Haswell or newer CPUs
with PCIe Generation 3.0 support.
* `Mellanox Bluefield SOC <https://docs.nvidia.com/networking/display/BlueFieldSWv25111213/BlueField+Software+Overview>`_
* `Cavium Thunder X2 <https://en.wikichip.org/wiki/cavium/thunderx2>`_
In ROCm, we also take advantage of PCIe ID based ordering technology for P2P when the GPU originates two writes to two different targets:
In ROCm, we also take advantage of PCIe ID based ordering technology for P2P when the GPU
originates two writes to two different targets:
| 1. write to another GPU memory,
* Write to another GPU memory
* Write to system memory to indicate transfer complete
| 2. then write to system memory to indicate transfer complete.
They are routed off to different ends of the computer but we want to make sure the write to system memory to indicate transfer complete occurs AFTER P2P write to GPU has complete.
They are routed off to different ends of the computer but we want to make sure the write to system
memory to indicate transfer complete occurs AFTER P2P write to GPU has complete.
BAR memory overview
***************************************************************************************************
On a Xeon E5 based system in the BIOS we can turn on above 4GB PCIe addressing, if so he need to set MMIO Base address ( MMIOH Base) and Range ( MMIO High Size) in the BIOS.
----------------------------------------------------------------------------------------------------
On a Xeon E5 based system in the BIOS we can turn on above 4GB PCIe addressing, if so he need to set
memory-mapped input/output (MMIO) base address (MMIOH base) and range (MMIO high size) in the BIOS.
In SuperMicro system in the system bios you need to see the following
In the Supermicro system in the system bios you need to see the following
* Advanced->PCIe/PCI/PnP configuration-> Above 4G Decoding = Enabled
* Advanced->PCIe/PCI/PnP configuration-\> Above 4G Decoding = Enabled
* Advanced->PCIe/PCI/PnP Configuration-\>MMIOH Base = 512G
* Advanced->PCIe/PCI/PnP Configuration-\>MMIO High Size = 256G
* Advanced->PCIe/PCI/PnP Configuration->MMIOH Base = 512G
* Advanced->PCIe/PCI/PnP Configuration->MMIO High Size = 256G
When we support Large Bar Capability there is a Large Bar Vbios which also disable the IO bar.
When we support Large Bar Capability there is a Large Bar VBIOS which also disable the IO bar.
For GFX9 and Vega10 which have Physical Address up 44 bit and 48 bit Virtual address.
* BAR0-1 registers: 64bit, prefetchable, GPU memory. 8GB or 16GB depending on Vega10 SKU. Must be placed < 2^44 to support P2P access from other Vega10.
* BAR2-3 registers: 64bit, prefetchable, Doorbell. Must be placed < 2^44 to support P2P access from other Vega10.
* BAR4 register: Optional, not a boot device.
* BAR5 register: 32bit, non-prefetchable, MMIO. Must be placed < 4GB.
* BAR0-1 registers: 64bit, prefetchable, GPU memory. 8GB or 16GB depending on Vega10 SKU. Must
be placed < 2^44 to support P2P access from other Vega10.
* BAR2-3 registers: 64bit, prefetchable, Doorbell. Must be placed \< 2^44 to support P2P access from
other Vega10.
* BAR4 register: Optional, not a boot device.
* BAR5 register: 32bit, non-prefetchable, MMIO. Must be placed \< 4GB.
Here is how our base address register (BAR) works on GFX 8 GPUs with 40 bit Physical Address Limit ::
Here is how our base address register (BAR) works on GFX 8 GPUs with 40 bit Physical Address Limit ::
11:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Fiji [Radeon R9 FURY / NANO Series] (rev c1)
11:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Fiji [Radeon R9 FURY / NANO
Series] (rev c1)
Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Device 0b35
@@ -106,40 +135,23 @@ Here is how our base address register (BAR) works on GFX 8 GPUs with 40 bit P
Legend:
1 : GPU Frame Buffer BAR In this example it happens to be 256M, but typically this will be size of the GPU memory (typically 4GB+). This BAR has to be placed < 2^40 to allow peer-to-peer access from other GFX8 AMD GPUs. For GFX9 (Vega GPU) the BAR has to be placed < 2^44 to allow peer-to-peer access from other GFX9 AMD GPUs.
1 : GPU Frame Buffer BAR -- In this example it happens to be 256M, but typically this will be size of the
GPU memory (typically 4GB+). This BAR has to be placed \< 2^40 to allow peer-to-peer access from
other GFX8 AMD GPUs. For GFX9 (Vega GPU) the BAR has to be placed \< 2^44 to allow peer-to-peer
access from other GFX9 AMD GPUs.
2 : Doorbell BAR The size of the BAR is typically will be < 10MB (currently fixed at 2MB) for this generation GPUs. This BAR has to be placed < 2^40 to allow peer-to-peer access from other current generation AMD GPUs.
2 : Doorbell BAR -- The size of the BAR is typically will be \< 10MB (currently fixed at 2MB) for this
generation GPUs. This BAR has to be placed \< 2^40 to allow peer-to-peer access from other current
generation AMD GPUs.
3 : IO BAR - This is for legacy VGA and boot device support, but since this the GPUs in this project are not VGA devices (headless), this is not a concern even if the SBIOS does not setup.
3 : IO BAR -- This is for legacy VGA and boot device support, but since this the GPUs in this project are
not VGA devices (headless), this is not a concern even if the SBIOS does not setup.
4 : MMIO BAR This is required for the AMD Driver SW to access the configuration registers. Since the reminder of the BAR available is only 1 DWORD (32bit), this is placed < 4GB. This is fixed at 256KB.
4 : MMIO BAR -- This is required for the AMD Driver SW to access the configuration registers. Since the
reminder of the BAR available is only 1 DWORD (32bit), this is placed \< 4GB. This is fixed at 256KB.
5 : Expansion ROM This is required for the AMD Driver SW to access the GPUs video-bios. This is currently fixed at 128KB.
5 : Expansion ROM -- This is required for the AMD Driver SW to access the GPU video-bios. This is
currently fixed at 128KB.
Excerpts from 'Overview of Changes to PCI Express 3.0'
================================================================
By Mike Jackson, Senior Staff Architect, MindShare, Inc.
***************************************************************************************************
Atomic operations goal:
***************************************************************************************************
Support SMP-type operations across a PCIe network to allow for things like offloading tasks between CPU cores and accelerators like a GPU. The spec says this enables advanced synchronization mechanisms that are particularly useful with multiple producers or consumers that need to be synchronized in a non-blocking fashion. Three new atomic non-posted requests were added, plus the corresponding completion (the address must be naturally aligned with the operand size or the TLP is malformed):
* Fetch and Add uses one operand as the “add” value. Reads the target location, adds the operand, and then writes the result back to the original location.
* Unconditional Swap uses one operand as the “swap” value. Reads the target location and then writes the swap value to it.
* Compare and Swap uses 2 operands: first data is compare value, second is swap value. Reads the target location, checks it against the compare value and, if equal, writes the swap value to the target location.
* AtomicOpCompletion new completion to give the result so far atomic request and indicate that the atomicity of the transaction has been maintained.
Since atomic operations are not locked they don't have the performance downsides of the PCI locked protocol. Compared to locked cycles, they provide “lower latency, higher scalability, advanced synchronization algorithms, and dramatically lower impact on other PCIe traffic.” The lock mechanism can still be used across a bridge to PCI or PCI-X to achieve the desired operation.
Atomic operations can go from device to device, device to host, or host to device. Each completer indicates whether it supports this capability and guarantees atomic access if it does. The ability to route atomic operations is also indicated in the registers for a given port.
ID-based ordering goal:
***************************************************************************************************
Improve performance by avoiding stalls caused by ordering rules. For example, posted writes are never normally allowed to pass each other in a queue, but if they are requested by different functions, we can have some confidence that the requests are not dependent on each other. The previously reserved Attribute bit [2] is now combined with the RO bit to indicate ID ordering with or without relaxed ordering.
This only has meaning for memory requests, and is reserved for Configuration or IO requests. Completers are not required to copy this bit into a completion, and only use the bit if their enable bit is set for this operation.
To read more on PCIe Gen 3 new options https://www.mindshare.com/files/resources/PCIe%203-0.pdf
For more information, you can review
`Overview of Changes to PCI Express 3.0 <https://www.mindshare.com/files/resources/PCIe%203-0.pdf>`_.

View File

@@ -1,3 +1,10 @@
<head>
<meta charset="UTF-8">
<meta name="description" content="Inference optimization with MIGraphX">
<meta name="keywords" content="Inference optimization, MIGraphX, deep-learning, MIGraphX
installation, AMD, ROCm">
</head>
# Inference optimization with MIGraphX
The following sections cover inferencing and introduces [MIGraphX](https://rocm.docs.amd.com/projects/AMDMIGraphX/en/latest/).
@@ -209,23 +216,23 @@ Follow these steps:
./inception_inference
```
```{note}
:::{note}
Set `LD_LIBRARY_PATH` to `/opt/rocm/lib` if required during the build. Additional examples can be found in the MIGraphX repository under the `/examples/` directory.
```
:::
## Tuning MIGraphX
MIGraphX uses MIOpen kernels to target AMD GPU. For the model compiled with MIGraphX, tune MIOpen to pick the best possible kernel implementation. The MIOpen tuning results in a significant performance boost. Tuning can be done by setting the environment variable `MIOPEN_FIND_ENFORCE=3`.
```{note}
:::{note}
The tuning process can take a long time to finish.
```
:::
**Example:** The average inference time of the inception model example shown previously over 100 iterations using untuned kernels is 0.01383ms. After tuning, it reduces to 0.00459ms, which is a 3x improvement. This result is from ROCm v4.5 on a MI100 GPU.
```{note}
:::{note}
The results may vary depending on the system configurations.
```
:::
For reference, the following code snippet shows inference runs for only the first 10 iterations for both tuned and untuned kernels:

View File

@@ -1,3 +1,10 @@
<head>
<meta charset="UTF-8">
<meta name="description" content="Inception V3 with PyTorch">
<meta name="keywords" content="PyTorch, Inception V3, deep-learning, training data, optimization
algorithm, AMD, ROCm">
</head>
# Deep learning: Inception V3 with PyTorch
## Deep learning training
@@ -36,7 +43,7 @@ Training is different from inference, particularly from the hardware perspective
| Data for training is available on the disk before the training process and is generally significant. The training performance is measured by how fast the data batches can be processed. | Inference data usually arrive stochastically, which may be batched to improve performance. Inference performance is generally measured in throughput speed to process the batch of data and the delay in responding to the input (latency). |
:::
Different quantization data types are typically chosen between training (FP32, BF16) and inference (FP16, INT8). The computation hardware has different specializations from other datatypes, leading to improvement in performance if a faster datatype can be selected for the corresponding task.
Different quantization data types are typically chosen between training (FP32, BF16) and inference (FP16, INT8). The computation hardware has different specializations from other data types, leading to improvement in performance if a faster datatype can be selected for the corresponding task.
## Case studies
@@ -56,7 +63,7 @@ This example is adapted from the PyTorch research hub page on [Inception V3](htt
Follow these steps:
1. Run the PyTorch ROCm-based Docker image or refer to the section [Installing PyTorch](../install/pytorch-install.md) for setting up a PyTorch environment on ROCm.
1. Run the PyTorch ROCm-based Docker image or refer to the section {doc}`Installing PyTorch <rocm-install-on-linux:how-to/3rd-party/pytorch-install>` for setting up a PyTorch environment on ROCm.
```dockerfile
docker run -it -v $HOME:/data --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --shm-size 8G rocm/pytorch:latest
@@ -146,7 +153,7 @@ The previous section focused on downloading and using the Inception V3 model for
Follow these steps:
1. Run the PyTorch ROCm Docker image or refer to the section [Installing PyTorch](../install/pytorch-install.md) for setting up a PyTorch environment on ROCm.
1. Run the PyTorch ROCm Docker image or refer to the section {doc}`Installing PyTorch <rocm-install-on-linux:how-to/3rd-party/pytorch-install>` for setting up a PyTorch environment on ROCm.
```dockerfile
docker pull rocm/pytorch:latest
@@ -208,9 +215,9 @@ Follow these steps:
7. Set parameters to guide the training process.
```{note}
:::{note}
The device is set to `"cuda"`. In PyTorch, `"cuda"` is a generic keyword to denote a GPU.
```
:::
```py
device = "cuda"
@@ -270,9 +277,9 @@ Follow these steps:
lr_gamma = 0.1
```
```{note}
:::{note}
One training epoch is when the neural network passes an entire dataset forward and backward.
```
:::
```py
epochs = 90
@@ -333,9 +340,9 @@ Follow these steps:
)
```
```{note}
:::{note}
Use torchvision to obtain the Inception V3 model. Use the pre-trained model weights to speed up training.
```
:::
```py
print("Creating model")
@@ -1155,9 +1162,10 @@ To prepare the data for training, follow these steps:
print("Accuracy: ", accuracy)
```
```{note}
model.fit() returns a History object that contains a dictionary with everything that happened during training.
```
:::{note}
`model.fit()` returns a History object that contains a dictionary with everything that happened during
training.
:::
```py
history_dict = history.history

View File

@@ -1,34 +1,40 @@
***********
.. meta::
:description: Using CMake
:keywords: CMake, dependencies, HIP, C++, AMD, ROCm
*********************************
Using CMake
***********
*********************************
Most components in ROCm support CMake. Projects depending on header-only or
library components typically require CMake 3.5 or higher whereas those wanting
to make use of CMake's HIP language support will require CMake 3.21 or higher.
to make use of the CMake HIP language support will require CMake 3.21 or higher.
Finding dependencies
====================
.. note::
For a complete
reference on how to deal with dependencies in CMake, refer to the CMake docs
on `find_package
<https://cmake.org/cmake/help/latest/command/find_package.html>`_ and the
`Using Dependencies Guide
<https://cmake.org/cmake/help/latest/guide/using-dependencies/index.html>`_
to get an overview of CMake's related facilities.
For a complete
reference on how to deal with dependencies in CMake, refer to the CMake docs
on `find_package
<https://cmake.org/cmake/help/latest/command/find_package.html>`_ and the
`Using Dependencies Guide
<https://cmake.org/cmake/help/latest/guide/using-dependencies/index.html>`_
to get an overview of CMake related facilities.
In short, CMake supports finding dependencies in two ways:
* In Module mode, it consults a file ``Find<PackageName>.cmake`` which tries to
find the component in typical install locations and layouts. CMake ships a
few dozen such scripts, but users and projects may ship them as well.
* In Config mode, it locates a file named ``<packagename>-config.cmake`` or
``<PackageName>Config.cmake`` which describes the installed component in all
regards needed to consume it.
* In Module mode, it consults a file ``Find<PackageName>.cmake`` which tries to find the component
in typical install locations and layouts. CMake ships a few dozen such scripts, but users and projects
may ship them as well.
* In Config mode, it locates a file named ``<packagename>-config.cmake`` or
``<PackageName>Config.cmake`` which describes the installed component in all regards needed to
consume it.
ROCm predominantly relies on Config mode, one notable exception being the Module
driving the compilation of HIP programs on Nvidia runtimes. As such, when
driving the compilation of HIP programs on NVIDIA runtimes. As such, when
dependencies are not found in standard system locations, one either has to
instruct CMake to search for package config files in additional folders using
the ``CMAKE_PREFIX_PATH`` variable (a semi-colon separated list of file system
@@ -40,9 +46,9 @@ it to your CMake configuration command on the command line via
``-D CMAKE_PREFIX_PATH=....`` . AMD packaged ROCm installs can typically be
added to the config file search paths such as:
- Windows: ``-D CMAKE_PREFIX_PATH=${env:HIP_PATH}``
* Windows: ``-D CMAKE_PREFIX_PATH=${env:HIP_PATH}``
- Linux: ``-D CMAKE_PREFIX_PATH=/opt/rocm``
* Linux: ``-D CMAKE_PREFIX_PATH=/opt/rocm``
ROCm provides the respective *config-file* packages, and this enables
``find_package`` to be used directly. ROCm does not require any Find module as
@@ -50,14 +56,16 @@ the *config-file* packages are shipped with the upstream projects, such as
rocPRIM and other ROCm libraries.
For a complete guide on where and how ROCm may be installed on a system, refer
to the installation guides for `Linux <../install/linux/install.html>`_ and
`Windows <../install/windows/install.html>`_.
to the installation guides for
`Linux <https://rocm.docs.amd.com/projects/install-on-linux/en/latest/tutorial/quick-start.html>`_
and
`Windows <https://rocm.docs.amd.com/projects/install-on-windows/en/latest/index.html>`_.
Using HIP in CMake
==================
ROCm components providing a C/C++ interface support consumption via any
C/C++ toolchain that CMake knows how to drive. ROCm also supports CMake's HIP
C/C++ toolchain that CMake knows how to drive. ROCm also supports the CMake HIP
language features, allowing users to program using the HIP single-source
programming model. When a program (or translation-unit) uses the HIP API without
compiling any GPU device code, HIP can be treated in CMake as a simple C/C++
@@ -70,22 +78,22 @@ Source code written in the HIP dialect of C++ typically uses the `.hip`
extension. When the HIP CMake language is enabled, it will automatically
associate such source files with the HIP toolchain being used.
::
.. code-block:: cmake
cmake_minimum_required(VERSION 3.21) # HIP language support requires 3.21
cmake_policy(VERSION 3.21.3...3.27)
project(MyProj LANGUAGES HIP)
add_executable(MyApp Main.hip)
cmake_minimum_required(VERSION 3.21) # HIP language support requires 3.21
cmake_policy(VERSION 3.21.3...3.27)
project(MyProj LANGUAGES HIP)
add_executable(MyApp Main.hip)
Should you have existing CUDA code that is from the source compatible subset of
HIP, you can tell CMake that despite their `.cu` extension, they're HIP sources.
Do note that this mostly facilitates compiling kernel code-only source files,
as host-side CUDA API won't compile in this fashion.
::
.. code-block:: cmake
add_library(MyLib MyLib.cu)
set_source_files_properties(MyLib.cu PROPERTIES LANGUAGE HIP)
add_library(MyLib MyLib.cu)
set_source_files_properties(MyLib.cu PROPERTIES LANGUAGE HIP)
CMake itself only hosts part of the HIP language support, such as defining
HIP-specific properties, etc. while the other half ships with the HIP
@@ -97,6 +105,10 @@ there's a catch-all, last resort variable consulted locating this file,
``-D CMAKE_HIP_COMPILER_ROCM_ROOT:PATH=`` which should be set the root of the
ROCm installation.
.. note::
Imported targets defined by `hip-lang-config.cmake` are for internal use
only.
If the user doesn't provide a semi-colon delimited list of device architectures
via ``CMAKE_HIP_ARCHITECTURES``, CMake will select some sensible default. It is
advised though that if a user knows what devices they wish to target, then set
@@ -110,45 +122,57 @@ Illustrated in the example below is a C++ application using MIOpen from CMake.
It calls ``find_package(miopen)``, which provides the ``MIOpen`` imported
target. This can be linked with ``target_link_libraries``
::
.. code-block:: cmake
cmake_minimum_required(VERSION 3.5) # find_package(miopen) requires 3.5
cmake_policy(VERSION 3.5...3.27)
project(MyProj LANGUAGES CXX)
find_package(miopen)
add_library(MyLib ...)
target_link_libraries(MyLib PUBLIC MIOpen)
cmake_minimum_required(VERSION 3.5) # find_package(miopen) requires 3.5
cmake_policy(VERSION 3.5...3.27)
project(MyProj LANGUAGES CXX)
find_package(miopen)
add_library(MyLib ...)
target_link_libraries(MyLib PUBLIC MIOpen)
.. note::
Most libraries are designed as host-only API, so using a GPU device
compiler is not necessary for downstream projects unless they use GPU device
code.
Most libraries are designed as host-only API, so using a GPU device
compiler is not necessary for downstream projects unless they use GPU device
code.
Consuming the HIP API in C++ code
---------------------------------
Use the HIP API without compiling the GPU device code. As there is no GPU code,
any C or C++ compiler can be used. The ``find_package(hip)`` provides the
``hip::host`` imported target to use HIP in this context.
Consuming the HIP API without compiling single-source GPU device code can be
done using any C++ compiler. The ``find_package(hip)`` provides the
``hip::host`` imported target to use HIP in this scenario.
::
.. code-block:: cmake
cmake_minimum_required(VERSION 3.5) # find_package(hip) requires 3.5
cmake_policy(VERSION 3.5...3.27)
project(MyProj LANGUAGES CXX)
find_package(hip REQUIRED)
add_executable(MyApp ...)
target_link_libraries(MyApp PRIVATE hip::host)
cmake_minimum_required(VERSION 3.5) # find_package(hip) requires 3.5
cmake_policy(VERSION 3.5...3.27)
project(MyProj LANGUAGES CXX)
find_package(hip REQUIRED)
add_executable(MyApp ...)
target_link_libraries(MyApp PRIVATE hip::host)
When mixing such ``CXX`` sources with ``HIP`` sources holding device-code, link
only to `hip::host`. If HIP sources don't have `.hip` as their extension, use
`set_source_files_properties(<hip_sources>... PROPERTIES LANGUAGE HIP)` on them.
Linking to `hip::host` will set all the necessary flags for the ``CXX`` sources
while ``HIP`` sources inherit all flags from the built-in language support.
Having HIP sources in a target will turn the |LINK_LANG|_ into ``HIP``.
.. |LINK_LANG| replace:: ``LINKER_LANGUAGE``
.. _LINK_LANG: https://cmake.org/cmake/help/latest/prop_tgt/LINKER_LANGUAGE.html
Compiling device code in C++ language mode
------------------------------------------
.. attention::
The workflow detailed here is considered legacy and is shown for
understanding's sake. It pre-dates the existence of HIP language support in
CMake. If source code has HIP device code in it, it is a HIP source file
and should be compiled as such. Only resort to the method below if your
HIP-enabled CMake codepath can't mandate CMake version 3.21.
The workflow detailed here is considered legacy and is shown for
understanding's sake. It pre-dates the existence of HIP language support in
CMake. If source code has HIP device code in it, it is a HIP source file
and should be compiled as such. Only resort to the method below if your
HIP-enabled CMake code path can't mandate CMake version 3.21.
If code uses the HIP API and compiles GPU device code, it requires using a
device compiler. The compiler for CMake can be set using either the
@@ -160,20 +184,21 @@ compiler that supports AMD GPU targets, which is usually Clang.
The ``find_package(hip)`` provides the ``hip::device`` imported target to add
all the flags necessary for device compilation.
::
.. code-block:: cmake
cmake_minimum_required(VERSION 3.8) # cxx_std_11 requires 3.8
cmake_policy(VERSION 3.8...3.27)
project(MyProj LANGUAGES CXX)
find_package(hip REQUIRED)
add_library(MyLib ...)
target_link_libraries(MyLib PRIVATE hip::device)
target_compile_features(MyLib PRIVATE cxx_std_11)
cmake_minimum_required(VERSION 3.8) # cxx_std_11 requires 3.8
cmake_policy(VERSION 3.8...3.27)
project(MyProj LANGUAGES CXX)
find_package(hip REQUIRED)
add_library(MyLib ...)
target_link_libraries(MyLib PRIVATE hip::device)
target_compile_features(MyLib PRIVATE cxx_std_11)
.. note::
Compiling for the GPU device requires at least C++11.
This project can then be configured with for eg.
Compiling for the GPU device requires at least C++11.
This project can then be configured with the following CMake commands.
- Windows: ``cmake -D CMAKE_CXX_COMPILER:PATH=${env:HIP_PATH}\bin\clang++.exe``
@@ -183,11 +208,11 @@ Which use the device compiler provided from the binary packages of
`ROCm HIP SDK <https://www.amd.com/en/developer/rocm-hub.html>`_ and
`repo.radeon.com <https://repo.radeon.com>`_ respectively.
When using the CXX language support to compile HIP device code, selecting the
When using the ``CXX`` language support to compile HIP device code, selecting the
target GPU architectures is done via setting the ``GPU_TARGETS`` variable.
``CMAKE_HIP_ARCHITECTURES`` only exists when the HIP language is enabled. By
default, this is set to some subset of the currently supported architectures of
AMD ROCm. It can be set to eg. ``-D GPU_TARGETS="gfx1032;gfx1035"``.
AMD ROCm. It can be set to the CMake option ``-D GPU_TARGETS="gfx1032;gfx1035"``.
ROCm CMake packages
-------------------
@@ -252,13 +277,12 @@ options.
IDEs supporting CMake (Visual Studio, Visual Studio Code, CLion, etc.) all came
up with their own way to register command-line fragments of different purpose in
a setup'n'forget fashion for quick assembly using graphical front-ends. This is
a setup-and-forget fashion for quick assembly using graphical front-ends. This is
all nice, but configurations aren't portable, nor can they be reused in
Continuous Intergration (CI) pipelines. CMake has condensed existing practice
Continuous Integration (CI) pipelines. CMake has condensed existing practice
into a portable JSON format that works in all IDEs and can be invoked from any
command line. This is
`CMake Presets <https://cmake.org/cmake/help/latest/manual/cmake-presets.7.html>`_
.
`CMake Presets <https://cmake.org/cmake/help/latest/manual/cmake-presets.7.html>`_.
There are two types of preset files: one supplied by the project, called
``CMakePresets.json`` which is meant to be committed to version control,
@@ -275,109 +299,110 @@ Following is an example ``CMakeUserPresets.json`` file which actually compiles
the `amd/rocm-examples <https://github.com/amd/rocm-examples>`_ suite of sample
applications on a typical ROCm installation:
::
.. code-block:: json
{
"version": 3,
"cmakeMinimumRequired": {
"major": 3,
"minor": 21,
"patch": 0
{
"version": 3,
"cmakeMinimumRequired": {
"major": 3,
"minor": 21,
"patch": 0
},
"configurePresets": [
{
"name": "layout",
"hidden": true,
"binaryDir": "${sourceDir}/build/${presetName}",
"installDir": "${sourceDir}/install/${presetName}"
},
"configurePresets": [
{
"name": "layout",
"hidden": true,
"binaryDir": "${sourceDir}/build/${presetName}",
"installDir": "${sourceDir}/install/${presetName}"
},
{
"name": "generator-ninja-multi-config",
"hidden": true,
"generator": "Ninja Multi-Config"
},
{
"name": "toolchain-makefiles-c/c++-amdclang",
"hidden": true,
"cacheVariables": {
"CMAKE_C_COMPILER": "/opt/rocm/bin/amdclang",
"CMAKE_CXX_COMPILER": "/opt/rocm/bin/amdclang++",
"CMAKE_HIP_COMPILER": "/opt/rocm/bin/amdclang++"
}
},
{
"name": "clang-strict-iso-high-warn",
"hidden": true,
"cacheVariables": {
"CMAKE_C_FLAGS": "-Wall -Wextra -pedantic",
"CMAKE_CXX_FLAGS": "-Wall -Wextra -pedantic",
"CMAKE_HIP_FLAGS": "-Wall -Wextra -pedantic"
}
},
{
"name": "ninja-mc-rocm",
"displayName": "Ninja Multi-Config ROCm",
"inherits": [
"layout",
"generator-ninja-multi-config",
"toolchain-makefiles-c/c++-amdclang",
"clang-strict-iso-high-warn"
]
{
"name": "generator-ninja-multi-config",
"hidden": true,
"generator": "Ninja Multi-Config"
},
{
"name": "toolchain-makefiles-c/c++-amdclang",
"hidden": true,
"cacheVariables": {
"CMAKE_C_COMPILER": "/opt/rocm/bin/amdclang",
"CMAKE_CXX_COMPILER": "/opt/rocm/bin/amdclang++",
"CMAKE_HIP_COMPILER": "/opt/rocm/bin/amdclang++"
}
],
"buildPresets": [
{
"name": "ninja-mc-rocm-debug",
"displayName": "Debug",
"configuration": "Debug",
"configurePreset": "ninja-mc-rocm"
},
{
"name": "ninja-mc-rocm-release",
"displayName": "Release",
"configuration": "Release",
"configurePreset": "ninja-mc-rocm"
},
{
"name": "ninja-mc-rocm-debug-verbose",
"displayName": "Debug (verbose)",
"configuration": "Debug",
"configurePreset": "ninja-mc-rocm",
"verbose": true
},
{
"name": "ninja-mc-rocm-release-verbose",
"displayName": "Release (verbose)",
"configuration": "Release",
"configurePreset": "ninja-mc-rocm",
"verbose": true
},
{
"name": "clang-strict-iso-high-warn",
"hidden": true,
"cacheVariables": {
"CMAKE_C_FLAGS": "-Wall -Wextra -pedantic",
"CMAKE_CXX_FLAGS": "-Wall -Wextra -pedantic",
"CMAKE_HIP_FLAGS": "-Wall -Wextra -pedantic"
}
],
"testPresets": [
{
"name": "ninja-mc-rocm-debug",
"displayName": "Debug",
"configuration": "Debug",
"configurePreset": "ninja-mc-rocm",
"execution": {
"jobs": 0
}
},
{
"name": "ninja-mc-rocm-release",
"displayName": "Release",
"configuration": "Release",
"configurePreset": "ninja-mc-rocm",
"execution": {
"jobs": 0
}
},
{
"name": "ninja-mc-rocm",
"displayName": "Ninja Multi-Config ROCm",
"inherits": [
"layout",
"generator-ninja-multi-config",
"toolchain-makefiles-c/c++-amdclang",
"clang-strict-iso-high-warn"
]
}
],
"buildPresets": [
{
"name": "ninja-mc-rocm-debug",
"displayName": "Debug",
"configuration": "Debug",
"configurePreset": "ninja-mc-rocm"
},
{
"name": "ninja-mc-rocm-release",
"displayName": "Release",
"configuration": "Release",
"configurePreset": "ninja-mc-rocm"
},
{
"name": "ninja-mc-rocm-debug-verbose",
"displayName": "Debug (verbose)",
"configuration": "Debug",
"configurePreset": "ninja-mc-rocm",
"verbose": true
},
{
"name": "ninja-mc-rocm-release-verbose",
"displayName": "Release (verbose)",
"configuration": "Release",
"configurePreset": "ninja-mc-rocm",
"verbose": true
}
],
"testPresets": [
{
"name": "ninja-mc-rocm-debug",
"displayName": "Debug",
"configuration": "Debug",
"configurePreset": "ninja-mc-rocm",
"execution": {
"jobs": 0
}
]
}
},
{
"name": "ninja-mc-rocm-release",
"displayName": "Release",
"configuration": "Release",
"configurePreset": "ninja-mc-rocm",
"execution": {
"jobs": 0
}
}
]
}
.. note::
Getting presets to work reliably on Windows requires some CMake improvements
and/or support from compiler vendors. (Refer to
`Add support to the Visual Studio generators <https://gitlab.kitware.com/cmake/cmake/-/issues/24245>`_
and `Sourcing environment scripts <https://gitlab.kitware.com/cmake/cmake/-/issues/21619>`_
.)
Getting presets to work reliably on Windows requires some CMake improvements
and/or support from compiler vendors. (Refer to
`Add support to the Visual Studio generators <https://gitlab.kitware.com/cmake/cmake/-/issues/24245>`_
and `Sourcing environment scripts <https://gitlab.kitware.com/cmake/cmake/-/issues/21619>`_
.)

View File

@@ -1,3 +1,9 @@
<head>
<meta charset="UTF-8">
<meta name="description" content="ROCm compilers disambiguation">
<meta name="keywords" content="compilers, compiler naming, AMD, ROCm">
</head>
# ROCm compilers disambiguation
ROCm ships multiple compilers of varying origins and purposes. This article

View File

@@ -1,8 +1,15 @@
<head>
<meta charset="UTF-8">
<meta name="description" content="ROCm Linux Filesystem Hierarchy Standard reorganization">
<meta name="keywords" content="FHS, Linux Filesystem Hierarchy Standard, directory structure,
AMD, ROCm">
</head>
# ROCm Linux Filesystem Hierarchy Standard reorganization
## Introduction
The ROCm platform has adopted the Linux Filesystem Hierarchy Standard (FHS) [https://refspecs.linuxfoundation.org/FHS_3.0/fhs/index.html](https://refspecs.linuxfoundation.org/FHS_3.0/fhs/index.html) in order to to ensure ROCm is consistent with standard open source conventions. The following sections specify how current and future releases of ROCm adhere to FHS, how the previous ROCm file system is supported, and how improved versioning specifications are applied to ROCm.
The ROCm Software has adopted the Linux Filesystem Hierarchy Standard (FHS) [https://refspecs.linuxfoundation.org/FHS_3.0/fhs/index.html](https://refspecs.linuxfoundation.org/FHS_3.0/fhs/index.html) in order to to ensure ROCm is consistent with standard open source conventions. The following sections specify how current and future releases of ROCm adhere to FHS, how the previous ROCm file system is supported, and how improved versioning specifications are applied to ROCm.
## Adopting the FHS
@@ -152,7 +159,7 @@ correct header file and use correct search paths.
## Changes in versioning specifications
In order to better manage ROCm dependencies specification and allow smoother releases of ROCm while avoiding dependency conflicts, the ROCm platform shall adhere to the following scheme when numbering and incrementing ROCm files versions:
In order to better manage ROCm dependencies specification and allow smoother releases of ROCm while avoiding dependency conflicts, ROCm software shall adhere to the following scheme when numbering and incrementing ROCm files versions:
rocm-\<ver\>, where \<ver\> = \<x.y.z\>

View File

@@ -1,3 +1,10 @@
<head>
<meta charset="UTF-8">
<meta name="description" content="GPU architecture">
<meta name="keywords" content="GPU architecture, architecture support, MI200, MI250, RDNA,
MI100, AMD Instinct">
</head>
# GPU architecture documentation
:::::{grid} 1 1 2 2

View File

@@ -1,3 +1,9 @@
<head>
<meta charset="UTF-8">
<meta name="description" content="AMD Instinct MI100 microarchitecture">
<meta name="keywords" content="Instinct, MI100, microarchitecture, AMD, ROCm">
</head>
# AMD Instinct™ MI100 microarchitecture
The following image shows the node-level architecture of a system that

View File

@@ -1,455 +1,578 @@
<head>
<meta charset="UTF-8">
<meta name="description" content="MI200 performance counters and metrics">
<meta name="keywords" content="MI200, performance counters, counters, GRBM counters, GRBM,
CPF counters, CPF, CPC counters, CPC, command processor counters, SPI counters, SPI, AMD, ROCm">
</head>
# MI200 performance counters and metrics
<!-- markdownlint-disable no-duplicate-header -->
This document lists and describes the hardware performance counters and the derived metrics available on the AMD Instinct™ MI200 GPU. All hardware performance monitors, and the derived performance metrics are accessible via AMD ROCm™ Profiler tool.
This document lists and describes the hardware performance counters and derived metrics available on the AMD Instinct™ MI200 GPU. All the hardware basic counters and derived metrics are accessible via {doc}`ROCProfiler tool <rocprofiler:rocprofv1>`.
## MI200 performance counters list
```{note}
Preliminary validation of all MI200 performance counters is in progress. Those with “[*]” appended to the names require further evaluation.
```
See the category-wise listing of MI200 performance counters in the following tables.
### GRBM
:::{note}
Preliminary validation of all MI200 performance counters is in progress. Those with “*” appended to the names require further evaluation.
:::
#### GRBM counters
### Graphics Register Bus Management (GRBM) counters
| Hardware Counter | Unit | Definition |
|--------------------|--------| ------------------------------------------------------|
| `grbm_count` | Cycles | Free-running GPU clock |
| `grbm_gui_active` | Cycles | GPU active cycles |
| `grbm_cp_busy` | Cycles | Any of the command processor (CPC/CPF) blocks are busy. |
| `grbm_spi_busy` | Cycles | Any of the shader processor input (SPI) are busy in the shader engine(s). |
| `grbm_ta_busy` | Cycles | Any of the texture addressing unit are busy in the shader engine(s). |
| `grbm_tc_busy` | Cycles | Any of the texture cache blocks (TCP/TCI/TCA/TCC) are busy. |
| `grbm_cpc_busy` | Cycles | The command processor - compute (CPC) is busy. |
| `grbm_cpf_busy` | Cycles | The command processor - fetcher (CPF) is busy. |
| `grbm_utcl2_busy` | Cycles | The unified translation cache - level 2 (UTCL2) block is busy. |
| `grbm_ea_busy` | Cycles | The efficiency arbiter (EA) block is busy. |
| Hardware Counter | Unit | Definition |
|:--------------------|:--------|:--------------------------------------------------------------------------|
| `GRBM_COUNT` | Cycles | Number of free-running GPU cycles |
| `GRBM_GUI_ACTIVE` | Cycles | Number of GPU active cycles |
| `GRBM_CP_BUSY` | Cycles | Number of cycles any of the Command Processor (CP) blocks are busy |
| `GRBM_SPI_BUSY` | Cycles | Number of cycles any of the Shader Processor Input (SPI) are busy in the shader engine(s) |
| `GRBM_TA_BUSY` | Cycles | Number of cycles any of the Texture Addressing Unit (TA) are busy in the shader engine(s) |
| `GRBM_TC_BUSY` | Cycles | Number of cycles any of the Texture Cache Blocks (TCP/TCI/TCA/TCC) are busy |
| `GRBM_CPC_BUSY` | Cycles | Number of cycles the Command Processor - Compute (CPC) is busy |
| `GRBM_CPF_BUSY` | Cycles | Number of cycles the Command Processor - Fetcher (CPF) is busy |
| `GRBM_UTCL2_BUSY` | Cycles | Number of cycles the Unified Translation Cache - Level 2 (UTCL2) block is busy |
| `GRBM_EA_BUSY` | Cycles | Number of cycles the Efficiency Arbiter (EA) block is busy |
### Command processor
### Command Processor (CP) counters
The command processor counters are further classified into fetcher and compute.
The CP counters are further classified into CP-Fetcher (CPF) and CP-Compute (CPC).
#### CPF
#### CPF counters
##### CPF counters
| Hardware Counter | Unit | Definition |
|:--------------------------------------|:--------|:-------------------------------------------------------------|
| `CPF_CMP_UTCL1_STALL_ON_TRANSLATION` | Cycles | Number of cycles one of the Compute UTCL1s is stalled waiting on translation |
| `CPF_CPF_STAT_BUSY` | Cycles | Number of cycles CPF is busy |
| `CPF_CPF_STAT_IDLE*` | Cycles | Number of cycles CPF is idle |
| `CPF_CPF_STAT_STALL` | Cycles | Number of cycles CPF is stalled |
| `CPF_CPF_TCIU_BUSY` | Cycles | Number of cycles CPF Texture Cache Interface Unit (TCIU) interface is busy |
| `CPF_CPF_TCIU_IDLE` | Cycles | Number of cycles CPF TCIU interface is idle |
| `CPF_CPF_TCIU_STALL*` | Cycles | Number of cycles CPF TCIU interface is stalled waiting on free tags |
| Hardware Counter | Unit | Definition |
|--------------------------------------|--------|--------------------------------------------------------------|
| `cpf_cmp_utcl1_stall_on_translation` | Cycles | One of the compute UTCL1s is stalled waiting on translation. |
| `cpf_cpf_stat_idle[]` | Cycles | CPF idle |
| `cpf_cpf_stat_stall` | Cycles | CPF stall |
| `cpf_cpf_tciu_busy` | Cycles | CPF TCIU interface busy |
| `cpf_cpf_tciu_idle` | Cycles | CPF TCIU interface idle |
| `cpf_cpf_tciu_stall[]` | Cycles | CPF TCIU interface is stalled waiting on free tags. |
#### CPC
##### CPC counters
#### CPC counters
| Hardware Counter | Unit | Definition |
| ---------------------------------| -------| --------------------------------------------------- |
| `cpc_me1_busy_for_packet_decode` | Cycles | CPC ME1 busy decoding packets |
| `cpc_utcl1_stall_on_translation` | Cycles | One of the UTCL1s is stalled waiting on translation |
| `cpc_cpc_stat_busy` | Cycles | CPC busy |
| `cpc_cpc_stat_idle` | Cycles | CPC idle |
| `cpc_cpc_stat_stall` | Cycles | CPC stalled |
| `cpc_cpc_tciu_busy` | Cycles | CPC TCIU interface busy |
| `cpc_cpc_tciu_idle` | Cycles | CPC TCIU interface idle |
| `cpc_cpc_utcl2iu_busy` | Cycles | CPC UTCL2 interface busy |
| `cpc_cpc_utcl2iu_idle` | Cycles | CPC UTCL2 interface idle |
| `cpc_cpc_utcl2iu_stall[]` | Cycles | CPC UTCL2 interface stalled waiting |
| `cpc_me1_dci0_spi_busy` | Cycles | CPC ME1 Processor busy |
|:---------------------------------|:-------|:---------------------------------------------------|
| `CPC_ME1_BUSY_FOR_PACKET_DECODE` | Cycles | Number of cycles CPC Micro Engine (ME1) is busy decoding packets |
| `CPC_UTCL1_STALL_ON_TRANSLATION` | Cycles | Number of cycles one of the UTCL1s is stalled waiting on translation |
| `CPC_CPC_STAT_BUSY` | Cycles | Number of cycles CPC is busy |
| `CPC_CPC_STAT_IDLE` | Cycles | Number of cycles CPC is idle |
| `CPC_CPC_STAT_STALL` | Cycles | Number of cycles CPC is stalled |
| `CPC_CPC_TCIU_BUSY` | Cycles | Number of cycles CPC TCIU interface is busy |
| `CPC_CPC_TCIU_IDLE` | Cycles | Number of cycles CPC TCIU interface is idle |
| `CPC_CPC_UTCL2IU_BUSY` | Cycles | Number of cycles CPC UTCL2 interface is busy |
| `CPC_CPC_UTCL2IU_IDLE` | Cycles | Number of cycles CPC UTCL2 interface is idle |
| `CPC_CPC_UTCL2IU_STALL` | Cycles | Number of cycles CPC UTCL2 interface is stalled |
| `CPC_ME1_DC0_SPI_BUSY` | Cycles | Number of cycles CPC ME1 Processor is busy |
### SPI
#### SPI counters
### Shader Processor Input (SPI) counters
| Hardware Counter | Unit | Definition |
| :----------------------------| :-----------| -----------------------------------------------------------: |
| `spi_csn_busy` | Cycles | Number of clocks with outstanding waves |
| `spi_csn_window_valid` | Cycles | Clock count enabled by perfcounter_start event |
| `spi_csn_num_threadgroups` | Workgroups | Total number of dispatched workgroups |
| `spi_csn_wave` | Wavefronts | Total number of dispatched wavefronts |
| `spi_ra_req_no_alloc` | Cycles | Arb cycles with requests but no allocation (need to multiply this value by 4) |
|`spi_ra_req_no_alloc_csn` | Cycles | Arb cycles with CSn req and no CSn alloc (need to multiply this value by 4) |
| `spi_ra_res_stall_csn` | Cycles | Arb cycles with CSn req and no CSn fits (need to multiply this value by 4) |
| `spi_ra_tmp_stall_csn[]` | Cycles | Cycles where CSn wants to req but does not fit in temp space |
| `spi_ra_wave_simd_full_csn` | SIMD-cycles | Sum of SIMD where WAVE cannot take csn wave when not fits |
| `spi_ra_vgpr_simd_full_csn[]` | SIMD-cycles | Sum of SIMD where VGPR cannot take csn wave when not fits |
| `spi_ra_sgpr_simd_full_csn[]` | SIMD-cycles | Sum of SIMD where SGPR cannot take csn wave when not fits |
| `spi_ra_lds_cu_full_csn` | CUs | Sum of CU where LDS cannot take csn wave when not fits |
| `spi_ra_bar_cu_full_csn[]` | CUs | Sum of CU where BARRIER cannot take csn wave when not fits |
| `spi_ra_bulky_cu_full_csn[]` | CUs | Sum of CU where BULKY cannot take csn wave when not fits |
| `spi_ra_tglim_cu_full_csn[]` | Cycles | Cycles where csn wants to req but all CUs are at tg_limit |
| `spi_ra_wvlim_cu_full_csn[]` | Cycles | Number of clocks csn is stalled due to WAVE LIMIT |
| `spi_vwc_csc_wr` | Cycles | Number of clocks to write CSC waves to VGPRs (need to multiply this value by 4) |
| `spi_swc_csc_wr` | Cycles | Number of clocks to write CSC waves to SGPRs (need to multiply this value by 4) |
|:----------------------------|:-----------|:-----------------------------------------------------------|
| `SPI_CSN_BUSY` | Cycles | Number of cycles with outstanding waves |
| `SPI_CSN_WINDOW_VALID` | Cycles | Number of cycles enabled by `perfcounter_start` event |
| `SPI_CSN_NUM_THREADGROUPS` | Workgroups | Number of dispatched workgroups |
| `SPI_CSN_WAVE` | Wavefronts | Number of dispatched wavefronts |
| `SPI_RA_REQ_NO_ALLOC` | Cycles | Number of Arb cycles with requests but no allocation |
|`SPI_RA_REQ_NO_ALLOC_CSN` | Cycles | Number of Arb cycles with Compute Shader, n-th pipe (CSn) requests but no CSn allocation |
| `SPI_RA_RES_STALL_CSN` | Cycles | Number of Arb stall cycles due to shortage of CSn pipeline slots |
| `SPI_RA_TMP_STALL_CSN*` | Cycles | Number of stall cycles due to shortage of temp space |
| `SPI_RA_WAVE_SIMD_FULL_CSN` | SIMD-cycles | Accumulated number of Single Instruction Multiple Data (SIMDs) per cycle affected by shortage of wave slots for CSn wave dispatch |
| `SPI_RA_VGPR_SIMD_FULL_CSN*` | SIMD-cycles | Accumulated number of SIMDs per cycle affected by shortage of VGPR slots for CSn wave dispatch |
| `SPI_RA_SGPR_SIMD_FULL_CSN*` | SIMD-cycles | Accumulated number of SIMDs per cycle affected by shortage of SGPR slots for CSn wave dispatch |
| `SPI_RA_LDS_CU_FULL_CSN` | CUs | Number of Compute Units (CUs) affected by shortage of LDS space for CSn wave dispatch |
| `SPI_RA_BAR_CU_FULL_CSN*` | CUs | Number of CUs with CSn waves waiting at a BARRIER |
| `SPI_RA_BULKY_CU_FULL_CSN*` | CUs | Number of CUs with CSn waves waiting for BULKY resource |
| `SPI_RA_TGLIM_CU_FULL_CSN*` | Cycles | Number of CSn wave stall cycles due to restriction of `tg_limit` for thread group size |
| `SPI_RA_WVLIM_STALL_CSN*` | Cycles | Number of cycles CSn is stalled due to WAVE_LIMIT |
| `SPI_VWC_CSC_WR` | Qcycles | Number of quad-cycles taken to initialize Vector General Purpose Register (VGPRs) when launching waves |
| `SPI_SWC_CSC_WR` | Qcycles | Number of quad-cycles taken to initialize Vector General Purpose Register (SGPRs) when launching waves |
### Compute unit
### Compute Unit (CU) counters
The compute unit counters are further classified into instruction mix, MFMA operation counters, level counters, wavefront counters, wavefront cycle counters, local data share counters, and others.
The CU counters are further classified into instruction mix, Matrix Fused Multiply Add (MFMA) operation counters, level counters, wavefront counters, wavefront cycle counters and Local Data Share (LDS) counters.
#### Instruction mix
| Hardware Counter | Unit | Definition |
| :-----------------------| :-----:| -----------------------------------------------------------------------: |
| `sq_insts` | Instr | Number of instructions issued |
| `sq_insts_valu` | Instr | Number of VALU instructions issued, including MFMA |
| `sq_insts_valu_add_f16` | Instr | Number of VALU F16 Add instructions issued |
| `sq_insts_valu_mul_f16` | Instr | Number of VALU F16 Multiply instructions issued |
| `sq_insts_valu_fma_f16` | Instr | Number of VALU F16 FMA instructions issued |
| `sq_insts_valu_trans_f16` | Instr | Number of VALU F16 Transcendental instructions issued |
| `sq_insts_valu_add_f32` | Instr | Number of VALU F32 Add instructions issued |
| `sq_insts_valu_mul_f32` | Instr | Number of VALU F32 Multiply instructions issued |
| `sq_insts_valu_fma_f32` | Instr | Number of VALU F32 FMA instructions issued |
| `sq_insts_valu_trans_f32` | Instr | Number of VALU F32 Transcendental instructions issued |
| `sq_insts_valu_add_f64` | Instr | Number of VALU F64 Add instructions issued |
| `sq_insts_valu_mul_f64` | Instr | Number of VALU F64 Multiply instructions issued |
| `sq_insts_valu_fma_f64` | Instr | Number of VALU F64 FMA instructions issued |
| `sq_insts_valu_trans_f64` | Instr | Number of VALU F64 Transcendental instructions issued |
| `sq_insts_valu_int32` | Instr | Number of VALU 32-bit integer instructions issued (signed or unsigned) |
| `sq_insts_valu_int64` | Instr | Number of VALU 64-bit integer instructions issued (signed or unsigned) |
| `sq_insts_valu_cvt` | Instr | Number of VALU Conversion instructions issued |
| `sq_insts_valu_mfma_i8` | Instr | Number of 8-bit Integer MFMA instructions issued |
| `sq_insts_valu_mfma_f16` | Instr | Number of F16 MFMA instructions issued |
| `sq_insts_valu_mfma_bf16` | Instr | Number of BF16 MFMA instructions issued |
| `sq_insts_valu_mfma_f32` | Instr | Number of F32 MFMA instructions issued |
| `sq_insts_valu_mfma_f64` | Instr | Number of F64 MFMA instructions issued |
| `sq_insts_mfma` | Instr | Number of MFMA instructions issued |
| `sq_insts_vmem_wr` | Instr | Number of VMEM write instructions issued |
| `sq_insts_vmem_rd` | Instr | Number of VMEM read instructions issued |
| `sq_insts_vmem` | Instr | Number of VMEM instructions issued, including both FLAT and buffer instructions |
| `sq_insts_salu` | Instr | Number of SALU instructions issued |
| `sq_insts_smem` | Instr | Number of SMEM instructions issued |
| `sq_insts_smem_norm` | Instr | Number of SMEM instructions issued to normalize to match `smem_level`. Used in measuring SMEM latency |
| `sq_insts_flat` | Instr | Number of FLAT instructions issued |
| `sq_insts_flat_lds_only` | Instr | Number of FLAT instructions issued that read/write only from/to LDS |
| `sq_insts_lds` | Instr | Number of LDS instructions issued |
| `sq_insts_gds` | Instr | Number of GDS instructions issued |
| `sq_insts_exp_gds` | Instr | Number of EXP and GDS instructions excluding skipped export instructions issued |
| `sq_insts_branch` | Instr | Number of Branch instructions issued |
| `sq_insts_sendmsg` | Instr | Number of SENDMSG instructions including s_endpgm issued |
| `sq_insts_vskipped[]` | Instr | Number of VSkipped instructions issued |
|:-----------------------|:-----|:-----------------------------------------------------------------------|
| `SQ_INSTS` | Instr | Number of instructions issued. |
| `SQ_INSTS_VALU` | Instr | Number of Vector Arithmetic Logic Unit (VALU) instructions including MFMA issued. |
| `SQ_INSTS_VALU_ADD_F16` | Instr | Number of VALU Half Precision Floating Point (F16) ADD/SUB instructions issued. |
| `SQ_INSTS_VALU_MUL_F16` | Instr | Number of VALU F16 Multiply instructions issued. |
| `SQ_INSTS_VALU_FMA_F16` | Instr | Number of VALU F16 Fused Multiply Add (FMA)/ Multiply Add (MAD) instructions issued. |
| `SQ_INSTS_VALU_TRANS_F16` | Instr | Number of VALU F16 Transcendental instructions issued. |
| `SQ_INSTS_VALU_ADD_F32` | Instr | Number of VALU Full Precision Floating Point (F32) ADD/SUB instructions issued. |
| `SQ_INSTS_VALU_MUL_F32` | Instr | Number of VALU F32 Multiply instructions issued. |
| `SQ_INSTS_VALU_FMA_F32` | Instr | Number of VALU F32 FMA/MAD instructions issued. |
| `SQ_INSTS_VALU_TRANS_F32` | Instr | Number of VALU F32 Transcendental instructions issued. |
| `SQ_INSTS_VALU_ADD_F64` | Instr | Number of VALU F64 ADD/SUB instructions issued. |
| `SQ_INSTS_VALU_MUL_F64` | Instr | Number of VALU F64 Multiply instructions issued. |
| `SQ_INSTS_VALU_FMA_F64` | Instr | Number of VALU F64 FMA/MAD instructions issued. |
| `SQ_INSTS_VALU_TRANS_F64` | Instr | Number of VALU F64 Transcendental instructions issued. |
| `SQ_INSTS_VALU_INT32` | Instr | Number of VALU 32-bit integer instructions (signed or unsigned) issued. |
| `SQ_INSTS_VALU_INT64` | Instr | Number of VALU 64-bit integer instructions (signed or unsigned) issued. |
| `SQ_INSTS_VALU_CVT` | Instr | Number of VALU Conversion instructions issued. |
| `SQ_INSTS_VALU_MFMA_I8` | Instr | Number of 8-bit Integer MFMA instructions issued. |
| `SQ_INSTS_VALU_MFMA_F16` | Instr | Number of F16 MFMA instructions issued. |
| `SQ_INSTS_VALU_MFMA_BF16` | Instr | Number of Brain Floating Point - 16 (BF16) MFMA instructions issued. |
| `SQ_INSTS_VALU_MFMA_F32` | Instr | Number of F32 MFMA instructions issued. |
| `SQ_INSTS_VALU_MFMA_F64` | Instr | Number of F64 MFMA instructions issued. |
| `SQ_INSTS_MFMA` | Instr | Number of MFMA instructions issued. |
| `SQ_INSTS_VMEM_WR` | Instr | Number of Vector Memory (VMEM) Write instructions (including FLAT) issued. |
| `SQ_INSTS_VMEM_RD` | Instr | Number of VMEM Read instructions (including FLAT) issued. |
| `SQ_INSTS_VMEM` | Instr | Number of VMEM instructions issued, including both FLAT and Buffer instructions. |
| `SQ_INSTS_SALU` | Instr | Number of SALU instructions issued. |
| `SQ_INSTS_SMEM` | Instr | Number of Scalar Memory (SMEM) instructions issued. |
| `SQ_INSTS_SMEM_NORM` | Instr | Number of SMEM instructions normalized to match `smem_level` issued. |
| `SQ_INSTS_FLAT` | Instr | Number of FLAT instructions issued. |
| `SQ_INSTS_FLAT_LDS_ONLY` | Instr | Number of FLAT instructions that read/write only from/to LDS issued. Works only if `EARLY_TA_DONE` is enabled. |
| `SQ_INSTS_LDS` | Instr | Number of Local Data Share (LDS) instructions issued (including FLAT). |
| `SQ_INSTS_GDS` | Instr | Number of Global Data Share (GDS) instructions issued. |
| `SQ_INSTS_EXP_GDS` | Instr | Number of EXP and GDS instructions excluding skipped export instructions issued. |
| `SQ_INSTS_BRANCH` | Instr | Number of Branch instructions issued. |
| `SQ_INSTS_SENDMSG` | Instr | Number of `SENDMSG` instructions including `s_endpgm` issued. |
| `SQ_INSTS_VSKIPPED*` | Instr | Number of vector instructions skipped. |
#### MFMA operation counters
| Hardware Counter | Unit | Definition |
| :----------------------------| :-----| ----------------------------------------------: |
| `sq_insts_valu_mfma_mops_I8` | IOP | Number of 8-bit integer MFMA ops in unit of 512 |
| `sq_insts_valu_mfma_mops_F16` | FLOP | Number of F16 floating MFMA ops in unit of 512 |
| `sq_insts_valu_mfma_mops_BF16` | FLOP | Number of BF16 floating MFMA ops in unit of 512 |
| `sq_insts_valu_mfma_mops_F32` | FLOP | Number of F32 floating MFMA ops in unit of 512 |
| `sq_insts_valu_mfma_mops_F64` | FLOP | Number of F64 floating MFMA ops in unit of 512 |
|:----------------------------|:-----|:----------------------------------------------|
| `SQ_INSTS_VALU_MFMA_MOPS_I8` | IOP | Number of 8-bit integer MFMA ops in the unit of 512 |
| `SQ_INSTS_VALU_MFMA_MOPS_F16` | FLOP | Number of F16 floating MFMA ops in the unit of 512 |
| `SQ_INSTS_VALU_MFMA_MOPS_BF16` | FLOP | Number of BF16 floating MFMA ops in the unit of 512 |
| `SQ_INSTS_VALU_MFMA_MOPS_F32` | FLOP | Number of F32 floating MFMA ops in the unit of 512 |
| `SQ_INSTS_VALU_MFMA_MOPS_F64` | FLOP | Number of F64 floating MFMA ops in the unit of 512 |
#### Level counters
:::{note}
All level counters must be followed by `SQ_ACCUM_PREV_HIRES` counter to measure average latency.
:::
| Hardware Counter | Unit | Definition |
| :-------------------| :-----| -------------------------------------: |
| `sq_accum_prev` | Count | Accumulated counter sample value where accumulation takes place once every four cycles |
| `sq_accum_prev_hires` | Count | Accumulated counter sample value where accumulation takes place once every cycle |
| `sq_level_waves` | Waves | Number of inflight waves |
| `sq_insts_level_vmem` | Instr | Number of inflight VMEM instructions |
| `sq_insts_level_smem` | Instr | Number of inflight SMEM instructions |
| `sq_insts_level_lds` | Instr | Number of inflight LDS instructions |
| `sq_ifetch_level` | Instr | Number of inflight instruction fetches |
|:-------------------|:-----|:-------------------------------------|
| `SQ_ACCUM_PREV` | Count | Accumulated counter sample value where accumulation takes place once every four cycles. |
| `SQ_ACCUM_PREV_HIRES` | Count | Accumulated counter sample value where accumulation takes place once every cycle. |
| `SQ_LEVEL_WAVES` | Waves | Number of inflight waves. To calculate the wave latency, divide `SQ_ACCUM_PREV_HIRES` by `SQ_WAVE`. |
| `SQ_INST_LEVEL_VMEM` | Instr | Number of inflight VMEM (including FLAT) instructions. To calculate the VMEM latency, divide `SQ_ACCUM_PREV_HIRES` by `SQ_INSTS_VMEM`. |
| `SQ_INST_LEVEL_SMEM` | Instr | Number of inflight SMEM instructions. To calculate the SMEM latency, divide `SQ_ACCUM_PREV_HIRES` by `SQ_INSTS_SMEM_NORM`. |
| `SQ_INST_LEVEL_LDS` | Instr | Number of inflight LDS (including FLAT) instructions. To calculate the LDS latency, divide `SQ_ACCUM_PREV_HIRES` by `SQ_INSTS_LDS`. |
| `SQ_IFETCH_LEVEL` | Instr | Number of inflight instruction fetch requests from the cache. To calculate the instruction fetch latency, divide `SQ_ACCUM_PREV_HIRES` by `SQ_IFETCH`. |
#### Wavefront counters
| Hardware Counter | Unit | Definition |
| :--------------------| :-----| ----------------------------------------------------------------: |
| `sq_waves` | Waves | Number of wavefronts dispatch to SQs, including both new and restored wavefronts |
| `sq_waves_saved[]` | Waves | Number of context-saved wavefronts |
| `sq_waves_restored[]` | Waves | Number of context-restored wavefronts |
| `sq_waves_eq_64` | Waves | Number of wavefronts with exactly 64 active threads sent to SQs |
| `sq_waves_lt_64` | Waves | Number of wavefronts with less than 64 active threads sent to SQs |
| `sq_waves_lt_48` | Waves | Number of wavefronts with less than 48 active threads sent to SQs |
| `sq_waves_lt_32` | Waves | Number of wavefronts with less than 32 active threads sent to SQs |
| `sq_waves_lt_16` | Waves | Number of wavefronts with less than 16 active threads sent to SQs |
|:--------------------|:-----|:----------------------------------------------------------------|
| `SQ_WAVES` | Waves | Number of wavefronts dispatched to Sequencers (SQs), including both new and restored wavefronts |
| `SQ_WAVES_SAVED*` | Waves | Number of context-saved waves |
| `SQ_WAVES_RESTORED*` | Waves | Number of context-restored waves sent to SQs |
| `SQ_WAVES_EQ_64` | Waves | Number of wavefronts with exactly 64 active threads sent to SQs |
| `SQ_WAVES_LT_64` | Waves | Number of wavefronts with less than 64 active threads sent to SQs |
| `SQ_WAVES_LT_48` | Waves | Number of wavefronts with less than 48 active threads sent to SQs |
| `SQ_WAVES_LT_32` | Waves | Number of wavefronts with less than 32 active threads sent to SQs |
| `SQ_WAVES_LT_16` | Waves | Number of wavefronts with less than 16 active threads sent to SQs |
#### Wavefront cycle counters
| Hardware Counter | Unit | Definition |
| :------------------------| :-------| --------------------------------------------------------------------: |
| `sq_cycles` | Cycles | Free-running SQ clocks |
| `sq_busy_cycles` | Cycles | Number of cycles while SQ reports it to be busy |
| `sq_busy_cu_cycles` | Qcycles | Number of quad cycles each CU is busy |
| `sq_valu_mfma_busy_cycles` | Cycles | Number of cycles the MFMA ALU is busy |
| `sq_wave_cycles` | Qcycles | Number of quad cycles spent by waves in the CUs |
| `sq_wait_any` | Qcycles | Number of quad cycles spent waiting for anything |
| `sq_wait_inst_any` | Qcycles | Number of quad cycles spent waiting for an issued instruction |
| `sq_active_inst_any` | Qcycles | Number of quad cycles spent by each wave to work on an instruction |
| `sq_active_inst_vmem` | Qcycles | Number of quad cycles spent by each wave to work on a non-FLAT VMEM instruction |
| `sq_active_inst_lds` | Qcycles | Number of quad cycles spent by each wave to work on an LDS instruction |
| `sq_active_inst_valu` | Qcycles | Number of quad cycles spent by each wave to work on a VALU instruction |
| `sq_active_inst_sca` | Qcycles | Number of quad cycles spent by each wave to work on an SCA instruction |
| `sq_active_inst_exp_gds` | Qcycles | Number of quad cycles spent by each wave to work on EXP or GDS instruction |
| `sq_active_inst_misc` | Qcycles | Number of quad cycles spent by each wave to work on an MISC instruction, including branch and sendmsg |
| `sq_active_inst_flat` | Qcycles | Number of quad cycles spent by each wave to work on a FLAT instruction |
| `sq_inst_cycles_vmem_wr` | Qcycles | Number of quad cycles spent to send addr and cmd data for VMEM write instructions, including both FLAT and buffer |
| `sq_inst_cycles_vmem_rd` | Qcycles | Number of quad cycles spent to send addr and cmd data for VMEM read instructions, including both FLAT and buffer |
| `sq_inst_cycles_smem` | Qcycles | Number of quad cycles spent to execute scalar memory reads |
| `sq_inst_cycles_salu` | Cycles | Number of cycles spent to execute non-memory read scalar operations |
| `sq_thread_cycles_valu` | Cycles | Number of thread cycles spent to execute VALU operations |
|:------------------------|:-------|:--------------------------------------------------------------------|
| `SQ_CYCLES` | Cycles | Clock cycles. |
| `SQ_BUSY_CYCLES` | Cycles | Number of cycles while SQ reports it to be busy. |
| `SQ_BUSY_CU_CYCLES` | Qcycles | Number of quad-cycles each CU is busy. |
| `SQ_VALU_MFMA_BUSY_CYCLES` | Cycles | Number of cycles the MFMA ALU is busy. |
| `SQ_WAVE_CYCLES` | Qcycles | Number of quad-cycles spent by waves in the CUs. |
| `SQ_WAIT_ANY` | Qcycles | Number of quad-cycles spent waiting for anything. |
| `SQ_WAIT_INST_ANY` | Qcycles | Number of quad-cycles spent waiting for any instruction to be issued. |
| `SQ_ACTIVE_INST_ANY` | Qcycles | Number of quad-cycles spent by each wave to work on an instruction. |
| `SQ_ACTIVE_INST_VMEM` | Qcycles | Number of quad-cycles spent by the SQ instruction arbiter to work on a VMEM instruction. |
| `SQ_ACTIVE_INST_LDS` | Qcycles | Number of quad-cycles spent by the SQ instruction arbiter to work on an LDS instruction. |
| `SQ_ACTIVE_INST_VALU` | Qcycles | Number of quad-cycles spent by the SQ instruction arbiter to work on a VALU instruction. |
| `SQ_ACTIVE_INST_SCA` | Qcycles | Number of quad-cycles spent by the SQ instruction arbiter to work on a SALU or SMEM instruction. |
| `SQ_ACTIVE_INST_EXP_GDS` | Qcycles | Number of quad-cycles spent by the SQ instruction arbiter to work on an EXPORT or GDS instruction. |
| `SQ_ACTIVE_INST_MISC` | Qcycles | Number of quad-cycles spent by the SQ instruction aribter to work on a BRANCH or `SENDMSG` instruction. |
| `SQ_ACTIVE_INST_FLAT` | Qcycles | Number of quad-cycles spent by the SQ instruction arbiter to work on a FLAT instruction. |
| `SQ_INST_CYCLES_VMEM_WR` | Qcycles | Number of quad-cycles spent to send addr and cmd data for VMEM Write instructions. |
| `SQ_INST_CYCLES_VMEM_RD` | Qcycles | Number of quad-cycles spent to send addr and cmd data for VMEM Read instructions. |
| `SQ_INST_CYCLES_SMEM` | Qcycles | Number of quad-cycles spent to execute scalar memory reads. |
| `SQ_INST_CYCLES_SALU` | Qcycles | Number of quad-cycles spent to execute non-memory read scalar operations. |
| `SQ_THREAD_CYCLES_VALU` | Cycles | Number of thread-cycles spent to execute VALU operations. This is similar to `INST_CYCLES_VALU` but multiplied by the number of active threads. |
| `SQ_WAIT_INST_LDS` | Qcycles | Number of quad-cycles spent waiting for LDS instruction to be issued. |
#### Local data share
#### LDS counters
| Hardware Counter | Unit | Definition |
| :--------------------------| :------| --------------------------------------------------------: |
| `sq_lds_atomic_return` | Cycles | Number of atomic return cycles in LDS |
| `sq_lds_bank_conflict` | Cycles | Number of cycles LDS is stalled by bank conflicts |
| `sq_lds_addr_conflict[]` | Cycles | Number of cycles LDS is stalled by address conflicts |
| `sq_lds_unaligned_stalls[]` | Cycles | Number of cycles LDS is stalled processing flat unaligned load/store ops |
| `sq_lds_mem_violations[]` | Count | Number of threads that have a memory violation in the LDS |
|:--------------------------|:------|:--------------------------------------------------------|
| `SQ_LDS_ATOMIC_RETURN` | Cycles | Number of atomic return cycles in LDS |
| `SQ_LDS_BANK_CONFLICT` | Cycles | Number of cycles LDS is stalled by bank conflicts |
| `SQ_LDS_ADDR_CONFLICT*` | Cycles | Number of cycles LDS is stalled by address conflicts |
| `SQ_LDS_UNALIGNED_STALL*` | Cycles | Number of cycles LDS is stalled processing flat unaligned load/store ops |
| `SQ_LDS_MEM_VIOLATIONS*` | Count | Number of threads that have a memory violation in the LDS |
| `SQ_LDS_IDX_ACTIVE` | Cycles | Number of cycles LDS is used for indexed operations |
#### Miscellaneous
#### Miscellaneous counters
##### Local data share
| Hardware Counter | Unit | Definition |
|:--------------------------|:------|:--------------------------------------------------------|
| `SQ_IFETCH` | Count | Number of instruction fetch requests from `L1I` cache, in 32-byte width |
| `SQ_ITEMS` | Threads | Number of valid items per wave |
| Hardware Counter | Unit | Definition |
| :----------------| :-------| --------------------------------------------------------: |
| `sq_ifetch` | Count | Number of fetch requests from L1I cache, in 32-byte width |
| `sq_items` | Threads | Number of valid threads |
### L1I and sL1D caches
#### L1I and sL1D caches
### L1I and sL1D cache counters
| Hardware Counter | Unit | Definition |
| :----------------------------| :------| ----------------------------------------------------------------: |
| `sqc_icache_req` | Req | Number of L1I cache requests |
| `sqc_icache_hits` | Count | Number of L1I cache lookup-hits |
| `sqc_icache_misses` | Count | Number of L1I cache non-duplicate lookup-misses |
| `sqc_icache_misses_duplicate` | Count | Number of d L1I cache duplicate lookup misses whose previous lookup miss on the same cache line is not fulfilled yet |
| `sqc_dcache_req` | Req | Number of sL1D cache requests |
| `sqc_dcache_input_valid_readb` | Cycles | Number of cycles while SQ input is valid but sL1D cache is not ready |
| `sqc_dcache_hits` | Count | Number of sL1D cache lookup-hits |
| `sqc_dcache_misses` | Count | Number of sL1D non-duplicate lookup-misses |
| `sqc_dcache_misses_duplicate` | Count | Number of sL1D duplicate lookup-misses |
| `sqc_dcache_req_read_1` | Req | Number of read requests in a single 32-bit data word, DWORD (DW) |
| `sqc_dcache_req_read_2` | Req | Number of read requests in 2 DW |
| `sqc_dcache_req_read_4` | Req | Number of read requests in 4 DW |
| `sqc_dcache_req_read_8` | Req | Number of read requests in 8 DW |
| `sqc_dcache_req_read_16` | Req | Number of read requests in 16 DW |
| `sqc_dcache_atomic[]` | Req | Number of atomic requests |
| `sqc_tc_req` | Req | Number of L2 cache requests that were issued by instruction and constant caches |
| `sqc_tc_inst_req` | Req | Number of instruction cache line requests to L2 cache |
| `sqc_tc_data_read_req` | Req | Number of data read requests to the L2 cache |
| `sqc_tc_data_write_req[]` | Req | Number of data write requests to the L2 cache |
| `sqc_tc_data_atomic_req[]` | Req | Number of data atomic requests to the L2 cache |
| `sqc_tc_stall[]` | Cycles | Number of cycles while the valid requests to L2 cache are stalled |
|:----------------------------|:------|:----------------------------------------------------------------|
| `SQC_ICACHE_REQ` | Req | Number of `L1I` cache requests |
| `SQC_ICACHE_HITS` | Count | Number of `L1I` cache hits |
| `SQC_ICACHE_MISSES` | Count | Number of non-duplicate `L1I` cache misses including uncached requests |
| `SQC_ICACHE_MISSES_DUPLICATE` | Count | Number of duplicate `L1I` cache misses whose previous lookup miss on the same cache line is not fulfilled yet |
| `SQC_DCACHE_REQ` | Req | Number of `sL1D` cache requests |
| `SQC_DCACHE_INPUT_VALID_READYB` | Cycles | Number of cycles while SQ input is valid but sL1D cache is not ready |
| `SQC_DCACHE_HITS` | Count | Number of `sL1D` cache hits |
| `SQC_DCACHE_MISSES` | Count | Number of non-duplicate `sL1D` cache misses including uncached requests |
| `SQC_DCACHE_MISSES_DUPLICATE` | Count | Number of duplicate `sL1D` cache misses |
| `SQC_DCACHE_REQ_READ_1` | Req | Number of constant cache read requests in a single DW |
| `SQC_DCACHE_REQ_READ_2` | Req | Number of constant cache read requests in two DW |
| `SQC_DCACHE_REQ_READ_4` | Req | Number of constant cache read requests in four DW |
| `SQC_DCACHE_REQ_READ_8` | Req | Number of constant cache read requests in eight DW |
| `SQC_DCACHE_REQ_READ_16` | Req | Number of constant cache read requests in 16 DW |
| `SQC_DCACHE_ATOMIC*` | Req | Number of atomic requests |
| `SQC_TC_REQ` | Req | Number of TC requests that were issued by instruction and constant caches |
| `SQC_TC_INST_REQ` | Req | Number of instruction requests to the L2 cache |
| `SQC_TC_DATA_READ_REQ` | Req | Number of data Read requests to the L2 cache |
| `SQC_TC_DATA_WRITE_REQ*` | Req | Number of data write requests to the L2 cache |
| `SQC_TC_DATA_ATOMIC_REQ*` | Req | Number of data atomic requests to the L2 cache |
| `SQC_TC_STALL*` | Cycles | Number of cycles while the valid requests to the L2 cache are stalled |
### Vector L1 cache subsystem
The vector L1 cache subsystem counters are further classified into texture addressing unit, texture data unit, vector L1D cache, and texture cache arbiter.
The vector L1 cache subsystem counters are further classified into Texture Addressing Unit (TA), Texture Data Unit (TD), vector L1D cache or Texture Cache per Pipe (TCP), and Texture Cache Arbiter (TCA) counters.
#### Texture addressing unit
##### Texture addressing unit counters
#### TA counters
| Hardware Counter | Unit | Definition |
| :--------------------------------| :------| ------------------------------------------------: |
| `ta_ta_busy` | Cycles | texture addressing unit busy cycles |
| `ta_total_wavefronts` | Instr | Number of wavefront instructions |
| `ta_buffer_wavefronts` | Instr | Number of buffer wavefront instructions |
| `ta_buffer_read_wavefronts` | Instr | Number of buffer read wavefront instructions |
| `ta_buffer_write_wavefronts` | Instr | Number of buffer write wavefront instructions |
| `ta_buffer_atomic_wavefronts[]` | Instr | Number of buffer atomic wavefront instructions |
| `ta_buffer_total_cycles` | Cycles | Number of buffer cycles, including read and write |
| `ta_buffer_coalesced_read_cycles` | Cycles | Number of coalesced buffer read cycles |
| `ta_buffer_coalesced_write_cycles` | Cycles | Number of coalesced buffer write cycles |
| `ta_addr_stalled_by_tc` | Cycles | Number of cycles texture addressing unit address is stalled by TCP |
| `ta_data_stalled_by_tc` | Cycles | Number of cycles texture addressing unit data is stalled by TCP |
| `ta_addr_stalled_by_td_cycles[]` | Cycles | Number of cycles texture addressing unit address is stalled by TD |
| `ta_flat_wavefronts` | Instr | Number of flat wavefront instructions |
| `ta_flat_read_wavefronts` | Instr | Number of flat read wavefront instructions |
| `ta_flat_write_wavefronts` | Instr | Number of flat write wavefront instructions |
| `ta_flat_atomic_wavefronts` | Instr | Number of flat atomic wavefront instructions |
|:--------------------------------|:------|:------------------------------------------------|
| `TA_TA_BUSY[n]` | Cycles | TA busy cycles. Value range for n: [0-15]. |
| `TA_TOTAL_WAVEFRONTS[n]` | Instr | Number of wavefronts processed by TA. Value range for n: [0-15]. |
| `TA_BUFFER_WAVEFRONTS[n]` | Instr | Number of buffer wavefronts processed by TA. Value range for n: [0-15]. |
| `TA_BUFFER_READ_WAVEFRONTS[n]` | Instr | Number of buffer read wavefronts processed by TA. Value range for n: [0-15]. |
| `TA_BUFFER_WRITE_WAVEFRONTS[n]` | Instr | Number of buffer write wavefronts processed by TA. Value range for n: [0-15]. |
| `TA_BUFFER_ATOMIC_WAVEFRONTS[n]` | Instr | Number of buffer atomic wavefronts processed by TA. Value range for n: [0-15]. |
| `TA_BUFFER_TOTAL_CYCLES[n]` | Cycles | Number of buffer cycles (including read and write) issued to TC. Value range for n: [0-15]. |
| `TA_BUFFER_COALESCED_READ_CYCLES[n]` | Cycles | Number of coalesced buffer read cycles issued to TC. Value range for n: [0-15]. |
| `TA_BUFFER_COALESCED_WRITE_CYCLES[n]` | Cycles | Number of coalesced buffer write cycles issued to TC. Value range for n: [0-15]. |
| `TA_ADDR_STALLED_BY_TC_CYCLES[n]` | Cycles | Number of cycles TA address path is stalled by TC. Value range for n: [0-15]. |
| `TA_DATA_STALLED_BY_TC_CYCLES[n]` | Cycles | Number of cycles TA data path is stalled by TC. Value range for n: [0-15]. |
| `TA_ADDR_STALLED_BY_TD_CYCLES[n]` | Cycles | Number of cycles TA address path is stalled by TD. Value range for n: [0-15]. |
| `TA_FLAT_WAVEFRONTS[n]` | Instr | Number of flat opcode wavefronts processed by TA. Value range for n: [0-15]. |
| `TA_FLAT_READ_WAVEFRONTS[n]` | Instr | Number of flat opcode read wavefronts processed by TA. Value range for n: [0-15]. |
| `TA_FLAT_WRITE_WAVEFRONTS[n]` | Instr | Number of flat opcode write wavefronts processed by TA. Value range for n: [0-15]. |
| `TA_FLAT_ATOMIC_WAVEFRONTS[n]` | Instr | Number of flat opcode atomic wavefronts processed by TA. Value range for n: [0-15]. |
#### Texture data unit
##### Texture data unit counters
#### TD counters
| Hardware Counter | Unit | Definition |
| :------------------------| :-----| ---------------------------------------------------: |
| `td_td_busy` | Cycle | TD busy cycles |
| `td_tc_stall` | Cycle | Number of cycles TD is stalled by TCP |
| `td_spi_stall[]` | Cycle | Number of cycles TD is stalled by SPI |
| `td_load_wavefront` | Instr | Number of wavefront instructions (read/write/atomic) |
| `td_store_wavefront` | Instr | Number of write wavefront instructions |
| `td_atomic_wavefront` | Instr | Number of atomic wavefront instructions |
| `td_coalescable_wavefront` | Instr | Number of coalescable instructions |
|:------------------------|:-----|:---------------------------------------------------|
| `TD_TD_BUSY[n]` | Cycle | TD busy cycles while it is processing or waiting for data. Value range for n: [0-15]. |
| `TD_TC_STALL[n]` | Cycle | Number of cycles TD is stalled waiting for TC data. Value range for n: [0-15]. |
| `TD_SPI_STALL[n]` | Cycle | Number of cycles TD is stalled by SPI. Value range for n: [0-15]. |
| `TD_LOAD_WAVEFRONT[n]` | Instr |Number of wavefront instructions (read/write/atomic). Value range for n: [0-15]. |
| `TD_STORE_WAVEFRONT[n]` | Instr | Number of write wavefront instructions. Value range for n: [0-15].|
| `TD_ATOMIC_WAVEFRONT[n]` | Instr | Number of atomic wavefront instructions. Value range for n: [0-15]. |
| `TD_COALESCABLE_WAVEFRONT[n]` | Instr | Number of coalescable wavefronts according to TA. Value range for n: [0-15]. |
#### Vector L1D cache
#### TCP counters
| Hardware Counter | Unit | Definition |
| :-----------------------------------| :------| ----------------------------------------------------------: |
| `tcp_gate_en1` | Cycles | Number of cycles/ vL1D interface clocks are turned on |
| `tcp_gate_en2` | Cycles | Number of cycles vL1D core clocks are turned on |
| `tcp_td_tcp_stall_cycles` | Cycles | Number of cycles TD stalls vL1D |
| `tcp_tcr_tcp_stall_cycles` | Cycles | Number of cycles TCR stalls vL1D |
| `tcp_read_tagconflict_stall_cycles` | Cycles | Number of cycles tagram conflict stalls on a read |
| `tcp_write_tagconflict_stall_cycles` | Cycles | Number of cycles tagram conflict stalls on a write |
| `tcp_atomic_tagconflict_stall_cycles` | Cycles | Number of cycles tagram conflict stalls on an atomic |
| `tcp_pending_stall_cycles` | Cycles | Number of cycles vL1D cache is stalled due to data pending from L2 cache |
| `tcp_ta_tcp_state_read` | Req | Number of wavefront instruction requests to vL1D |
| `tcp_volatile[]` | Req | Number of L1 volatile pixels/buffers from texture addressing unit |
| `tcp_total_accesses` | Req | Number of vL1D accesses |
| `tcp_total_read` | Req | Number of vL1D read accesses |
| `tcp_total_write` | Req | Number of vL1D write accesses |
| `tcp_total_atomic_with_ret` | Req | Number of vL1D atomic with return |
| `tcp_total_atomic_without_ret` | Req | Number of vL1D atomic without return |
| `tcp_total_writeback_invalidates` | Count | Number of vL1D writebacks and Invalidates |
| `tcp_utcl1_request` | Req | Number of address translation requests to UTCL1 |
| `tcp_utcl1_translation_hit` | Req | Number of UTCL1 translation hits |
| `tcp_utcl1_translation_miss` | Req | Number of UTCL1 translation misses |
| `tcp_utcl1_persmission_miss` | Req | Number of UTCL1 permission misses |
| `tcp_total_cache_accesses` | Req | Number of vL1D cache accesses |
| `tcp_tcp_latency` | Cycles | Accumulated wave access latency to vL1D over all wavefronts |
| `tcp_tcc_read_req_latency` | Cycles | Accumulated vL1D-L2 request latency over all wavefronts for reads and atomics with return |
| `tcp_tcc_write_req_latency` | Cycles | Accumulated vL1D-L2 request latency over all wavefronts for writes and atomics without return |
| `tcp_tcc_read_req` | Req | Number of read requests to L2 cache |
| `tcp_tcc_write_req` | Req | Number of write requests to L2 cache |
| `tcp_tcc_atomic_with_ret_req` | Req | Number of atomic requests to L2 cache with return |
| `tcp_tcc_atomic_without_ret_req` | Req | Number of atomic requests to L2 cache without return |
| `tcp_tcc_nc_read_req` | Req | Number of NC read requests to L2 cache |
| `tcp_tcc_uc_read_req` | Req | Number of UC read requests to L2 cache |
| `tcp_tcc_cc_read_req` | Req | Number of CC read requests to L2 cache |
| `tcp_tcc_rw_read_req` | Req | Number of RW read requests to L2 cache |
| `tcp_tcc_nc_write_req` | Req | Number of NC write requests to L2 cache |
| `tcp_tcc_uc_write_req` | Req | Number of UC write requests to L2 cache |
| `tcp_tcc_cc_write_req` | Req | Number of CC write requests to L2 cache |
| `tcp_tcc_rw_write_req` | Req | Number of RW write requests to L2 cache |
| `tcp_tcc_nc_atomic_req` | Req | Number of NC atomic requests to L2 cache |
| `tcp_tcc_uc_atomic_req` | Req | Number of UC atomic requests to L2 cache |
| `tcp_tcc_cc_atomic_req` | Req | Number of CC atomic requests to L2 cache |
| `tcp_tcc_rw_atomic_req` | Req | Number of RW atomic requests to L2 cache |
|:-----------------------------------|:------|:----------------------------------------------------------|
| `TCP_GATE_EN1[n]` | Cycles | Number of cycles vL1D interface clocks are turned on. Value range for n: [0-15]. |
| `TCP_GATE_EN2[n]` | Cycles | Number of cycles vL1D core clocks are turned on. Value range for n: [0-15]. |
| `TCP_TD_TCP_STALL_CYCLES[n]` | Cycles | Number of cycles TD stalls vL1D. Value range for n: [0-15]. |
| `TCP_TCR_TCP_STALL_CYCLES[n]` | Cycles | Number of cycles TCR stalls vL1D. Value range for n: [0-15]. |
| `TCP_READ_TAGCONFLICT_STALL_CYCLES[n]` | Cycles | Number of cycles tagram conflict stalls on a read. Value range for n: [0-15]. |
| `TCP_WRITE_TAGCONFLICT_STALL_CYCLES[n]` | Cycles | Number of cycles tagram conflict stalls on a write. Value range for n: [0-15]. |
| `TCP_ATOMIC_TAGCONFLICT_STALL_CYCLES[n]` | Cycles | Number of cycles tagram conflict stalls on an atomic. Value range for n: [0-15]. |
| `TCP_PENDING_STALL_CYCLES[n]` | Cycles | Number of cycles vL1D cache is stalled due to data pending from L2 Cache. Value range for n: [0-15]. |
| `TCP_TCP_TA_DATA_STALL_CYCLES` | Cycles | Number of cycles TCP stalls TA data interface. |
| `TCP_TA_TCP_STATE_READ[n]` | Req | Number of state reads. Value range for n: [0-15]. |
| `TCP_VOLATILE[n]` | Req | Number of L1 volatile pixels/buffers from TA. Value range for n: [0-15]. |
| `TCP_TOTAL_ACCESSES[n]` | Req | Number of vL1D accesses. Equals `TCP_PERF_SEL_TOTAL_READ`+`TCP_PERF_SEL_TOTAL_NONREAD`. Value range for n: [0-15]. |
| `TCP_TOTAL_READ[n]` | Req | Number of vL1D read accesses. Equals `TCP_PERF_SEL_TOTAL_HIT_LRU_READ` + `TCP_PERF_SEL_TOTAL_MISS_LRU_READ` + `TCP_PERF_SEL_TOTAL_MISS_EVICT_READ`. Value range for n: [0-15]. |
| `TCP_TOTAL_WRITE[n]` | Req | Number of vL1D write accesses. `Equals TCP_PERF_SEL_TOTAL_MISS_LRU_WRITE`+ `TCP_PERF_SEL_TOTAL_MISS_EVICT_WRITE`. Value range for n: [0-15]. |
| `TCP_TOTAL_ATOMIC_WITH_RET[n]` | Req | Number of vL1D atomic requests with return. Value range for n: [0-15]. |
| `TCP_TOTAL_ATOMIC_WITHOUT_RET[n]` | Req | Number of vL1D atomic without return. Value range for n: [0-15]. |
| `TCP_TOTAL_WRITEBACK_INVALIDATES[n]` | Count | Total number of vL1D writebacks and invalidates. Equals `TCP_PERF_SEL_TOTAL_WBINVL1`+ `TCP_PERF_SEL_TOTAL_WBINVL1_VOL`+ `TCP_PERF_SEL_CP_TCP_INVALIDATE`+ `TCP_PERF_SEL_SQ_TCP_INVALIDATE_VOL`. Value range for n: [0-15]. |
| `TCP_UTCL1_REQUEST[n]` | Req | Number of address translation requests to UTCL1. Value range for n: [0-15]. |
| `TCP_UTCL1_TRANSLATION_HIT[n]` | Req | Number of UTCL1 translation hits. Value range for n: [0-15]. |
| `TCP_UTCL1_TRANSLATION_MISS[n]` | Req | Number of UTCL1 translation misses. Value range for n: [0-15]. |
| `TCP_UTCL1_PERMISSION_MISS[n]` | Req | Number of UTCL1 permission misses. Value range for n: [0-15]. |
| `TCP_TOTAL_CACHE_ACCESSES[n]` | Req | Number of vL1D cache accesses including hits and misses. Value range for n: [0-15]. |
| `TCP_TCP_LATENCY[n]` | Cycles | Accumulated wave access latency to vL1D over all wavefronts. Value range for n: [0-15]. |
| `TCP_TCC_READ_REQ_LATENCY[n]` | Cycles | Total vL1D to L2 request latency over all wavefronts for reads and atomics with return. Value range for n: [0-15]. |
| `TCP_TCC_WRITE_REQ_LATENCY[n]` | Cycles | Total vL1D to L2 request latency over all wavefronts for writes and atomics without return. Value range for n: [0-15]. |
| `TCP_TCC_READ_REQ[n]` | Req | Number of read requests to L2 cache. Value range for n: [0-15]. |
| `TCP_TCC_WRITE_REQ[n]` | Req | Number of write requests to L2 cache. Value range for n: [0-15]. |
| `TCP_TCC_ATOMIC_WITH_RET_REQ[n]` | Req | Number of atomic requests to L2 cache with return. Value range for n: [0-15]. |
| `TCP_TCC_ATOMIC_WITHOUT_RET_REQ[n]` | Req | Number of atomic requests to L2 cache without return. Value range for n: [0-15]. |
| `TCP_TCC_NC_READ_REQ[n]` | Req | Number of NC read requests to L2 cache. Value range for n: [0-15]. |
| `TCP_TCC_UC_READ_REQ[n]` | Req | Number of UC read requests to L2 cache. Value range for n: [0-15]. |
| `TCP_TCC_CC_READ_REQ[n]` | Req | Number of CC read requests to L2 cache. Value range for n: [0-15]. |
| `TCP_TCC_RW_READ_REQ[n]` | Req | Number of RW read requests to L2 cache. Value range for n: [0-15]. |
| `TCP_TCC_NC_WRITE_REQ[n]` | Req | Number of NC write requests to L2 cache. Value range for n: [0-15]. |
| `TCP_TCC_UC_WRITE_REQ[n]` | Req | Number of UC write requests to L2 cache. Value range for n: [0-15]. |
| `TCP_TCC_CC_WRITE_REQ[n]` | Req | Number of CC write requests to L2 cache. Value range for n: [0-15]. |
| `TCP_TCC_RW_WRITE_REQ[n]` | Req | Number of RW write requests to L2 cache. Value range for n: [0-15]. |
| `TCP_TCC_NC_ATOMIC_REQ[n]` | Req | Number of NC atomic requests to L2 cache. Value range for n: [0-15]. |
| `TCP_TCC_UC_ATOMIC_REQ[n]` | Req | Number of UC atomic requests to L2 cache. Value range for n: [0-15]. |
| `TCP_TCC_CC_ATOMIC_REQ[n]` | Req | Number of CC atomic requests to L2 cache. Value range for n: [0-15]. |
| `TCP_TCC_RW_ATOMIC_REQ[n]` | Req | Number of RW atomic requests to L2 cache. Value range for n: [0-15]. |
#### TCA
#### TCA counters
| Hardware Counter | Unit | Definition |
| :----------------| :------| ------------------------------------------: |
| `tca_cycle` | Cycles | TCA cycles |
| `tca_busy` | Cycles | Number of cycles TCA has a pending request |
|:----------------|:------|:------------------------------------------|
| `TCA_CYCLE[n]` | Cycles | Number of TCA cycles. Value range for n: [0-31]. |
| `TCA_BUSY[n]` | Cycles | Number of cycles TCA has a pending request. Value range for n: [0-31]. |
### L2 cache access
### L2 cache access counters
#### L2 cache access counters
L2 Cache is also known as Texture Cache per Channel (TCC).
| Hardware Counter | Unit | Definition |
| :--------------------------------| :------| -------------------------------------------------------------: |
| `tcc_cycle` |Cycle | L2 cache free-running clocks |
| `tcc_busy` |Cycle | L2 cache busy cycles |
| `tcc_req` |Req | Number of L2 cache requests |
| `tcc_streaming_req[]` |Req | Number of L2 cache streaming requests |
| `tcc_NC_req` |Req | Number of NC requests |
| `tcc_UC_req` |Req | Number of UC requests |
| `tcc_CC_req` |Req | Number of CC requests |
| `tcc_RW_req` |Req | Number of RW requests |
| `tcc_probe` |Req | Number of L2 cache probe requests |
| `tcc_probe_all[]` |Req | Number of external probe requests with EA_TCC_preq_all== 1 |
| `tcc_read_req` |Req | Number of L2 cache read requests |
| `tcc_write_req` |Req | Number of L2 cache write requests |
| `tcc_atomic_req` |Req | Number of L2 cache atomic requests |
| `tcc_hit` |Req | Number of L2 cache lookup-hits |
| `tcc_miss` |Req | Number of L2 cache lookup-misses |
| `tcc_writeback` |Req | Number of lines written back to main memory, including writebacks of dirty lines and uncached write/atomic requests |
| `tcc_ea_wrreq` |Req | Total number of 32-byte and 64-byte write requests to EA |
| `tcc_ea_wrreq_64B` |Req | Total number of 64-byte write requests to EA |
| `tcc_ea_wr_uncached_32B` |Req | Number of 32-byte write/atomic going over the TC_EA_wrreq interface due to uncached traffic. Note that CC mtypes can produce uncached requests, and those are included in this. A 64-byte request is counted as 2. |
| `tcc_ea_wrreq_stall` | Cycles | Number of cycles a write request was stalled |
| `tcc_ea_wrreq_io_credit_stall[]` | Cycles | Number of cycles an EA write request runs out of IO credits |
| `tcc_ea_wrreq_gmi_credit_stall[]` | Cycles | Number of cycles an EA write request runs out of GMI credits |
| `tcc_ea_wrreq_dram_credit_stall` | Cycles | Number of cycles an EA write request runs out of DRAM credits |
| `tcc_too_many_ea_wrreqs_stall[]` | Cycles | Number of cycles the L2 cache reaches maximum number of pending EA write requests |
| `tcc_ea_wrreq_level` | Req | Accumulated number of L2 cache-EA write requests in flight |
| `tcc_ea_atomic` | Req | Number of 32-byte and 64-byte atomic requests to EA |
| `tcc_ea_atomic_level` | Req | Accumulated number of L2 cache-EA atomic requests in flight |
| `tcc_ea_rdreq` | Req | Total number of 32-byte and 64-byte read requests to EA |
| `tcc_ea_rdreq_32B` | Req | Total number of 32-byte read requests to EA |
| `tcc_ea_rd_uncached_32B` | Req | Number of 32-byte L2 cache-EA read due to uncached traffic. A 64-byte request is counted as 2. |
| `tcc_ea_rdreq_io_credit_stall[]` | Cycles | Number of cycles read request interface runs out of IO credits |
| `tcc_ea_rdreq_gmi_credit_stall[]` | Cycles | Number of cycles read request interface runs out of GMI credits |
| `tcc_ea_rdreq_dram_credit_stall` | Cycles | Number of cycles read request interface runs out of DRAM credits |
| `tcc_ea_rdreq_level` | Req | Accumulated number of L2 cache-EA read requests in flight |
| `tcc_ea_rdreq_dram` | Req | Number of 32-byte and 64-byte read requests to HBM |
| `tcc_ea_wrreq_dram` | Req | Number of 32-byte and 64-byte write requests to HBM |
| `tcc_tag_stall` | Cycles | Number of cycles the normal request pipeline in the tag was stalled for any reason |
| `tcc_normal_writeback` | Req | Number of L2 cache normal writeback |
| `tcc_all_tc_op_wb_writeback[]` | Req | Number of instruction-triggered writeback requests |
| `tcc_normal_evict` | Req | Number of L2 cache normal evictions |
| `tcc_all_tc_op_inv_evict[]` | Req | Number of instruction-triggered eviction requests |
|:--------------------------------|:------|:-------------------------------------------------------------|
| `TCC_CYCLE[n]` |Cycle | Number of L2 cache free-running clocks. Value range for n: [0-31]. |
| `TCC_BUSY[n]` |Cycle | Number of L2 cache busy cycles. Value range for n: [0-31]. |
| `TCC_REQ[n]` |Req | Number of L2 cache requests of all types. This is measured at the tag block. This may be more than the number of requests arriving at the TCC, but it is a good indication of the total amount of work that needs to be performed. Value range for n: [0-31]. |
| `TCC_STREAMING_REQ[n]` |Req | Number of L2 cache streaming requests. This is measured at the tag block. Value range for n: [0-31]. |
| `TCC_NC_REQ[n]` |Req | Number of NC requests. This is measured at the tag block. Value range for n: [0-31]. |
| `TCC_UC_REQ[n]` |Req | Number of UC requests. This is measured at the tag block. Value range for n: [0-31]. |
| `TCC_CC_REQ[n]` |Req | Number of CC requests. This is measured at the tag block. Value range for n: [0-31]. |
| `TCC_RW_REQ[n]` |Req | Number of RW requests. This is measured at the tag block. Value range for n: [0-31]. |
| `TCC_PROBE[n]` |Req | Number of probe requests. Value range for n: [0-31]. |
| `TCC_PROBE_ALL[n]` |Req | Number of external probe requests with `EA_TCC_preq_all`== 1. Value range for n: [0-31]. |
| `TCC_READ[n]` |Req | Number of L2 cache read requests. This includes compressed reads but not metadata reads. Value range for n: [0-31]. |
| `TCC_WRITE[n]` |Req | Number of L2 cache write requests. Value range for n: [0-31]. |
| `TCC_ATOMIC[n]` |Req | Number of L2 cache atomic requests of all types. Value range for n: [0-31]. |
| `TCC_HIT[n]` |Req | Number of L2 cache hits. Value range for n: [0-31]. |
| `TCC_MISS[n]` |Req | Number of L2 cache misses. Value range for n: [0-31]. |
| `TCC_WRITEBACK[n]` |Req | Number of lines written back to the main memory, including writebacks of dirty lines and uncached write/atomic requests. Value range for n: [0-31]. |
| `TCC_EA_WRREQ[n]` |Req | Number of 32-byte and 64-byte transactions going over the `TC_EA_wrreq` interface. Atomics may travel over the same interface and are generally classified as write requests. This does not include probe commands. Value range for n: [0-31]. |
| `TCC_EA_WRREQ_64B[n]` |Req | Total number of 64-byte transactions (write or `CMPSWAP`) going over the `TC_EA_wrreq` interface. Value range for n: [0-31]. |
| `TCC_EA_WR_UNCACHED_32B[n]` |Req | Number of 32-byte write/atomic going over the `TC_EA_wrreq` interface due to uncached traffic. Note that CC mtypes can produce uncached requests, and those are included in this. A 64-byte request is counted as 2. Value range for n: [0-31].|
| `TCC_EA_WRREQ_STALL[n]` | Cycles | Number of cycles a write request is stalled. Value range for n: [0-31]. |
| `TCC_EA_WRREQ_IO_CREDIT_STALL[n]` | Cycles | Number of cycles an EA write request is stalled due to the interface running out of IO credits. Value range for n: [0-31]. |
| `TCC_EA_WRREQ_GMI_CREDIT_STALL[n]` | Cycles | Number of cycles an EA write request is stalled due to the interface running out of GMI credits. Value range for n: [0-31]. |
| `TCC_EA_WRREQ_DRAM_CREDIT_STALL[n]` | Cycles | Number of cycles an EA write request is stalled due to the interface running out of DRAM credits. Value range for n: [0-31]. |
| `TCC_TOO_MANY_EA_WRREQS_STALL[n]` | Cycles | Number of cycles the L2 cache is unable to send an EA write request due to it reaching its maximum capacity of pending EA write requests. Value range for n: [0-31]. |
| `TCC_EA_WRREQ_LEVEL[n]` | Req | The accumulated number of EA write requests in flight. This is primarily intended to measure average EA write latency. Average write latency = `TCC_PERF_SEL_EA_WRREQ_LEVEL`/`TCC_PERF_SEL_EA_WRREQ`. Value range for n: [0-31]. |
| `TCC_EA_ATOMIC[n]` | Req | Number of 32-byte or 64-byte atomic requests going over the `TC_EA_wrreq` interface. Value range for n: [0-31]. |
| `TCC_EA_ATOMIC_LEVEL[n]` | Req | The accumulated number of EA atomic requests in flight. This is primarily intended to measure average EA atomic latency. Average atomic latency = `TCC_PERF_SEL_EA_WRREQ_ATOMIC_LEVEL`/`TCC_PERF_SEL_EA_WRREQ_ATOMIC`. Value range for n: [0-31]. |
| `TCC_EA_RDREQ[n]` | Req | Number of 32-byte or 64-byte read requests to EA. Value range for n: [0-31]. |
| `TCC_EA_RDREQ_32B[n]` | Req | Number of 32-byte read requests to EA. Value range for n: [0-31]. |
| `TCC_EA_RD_UNCACHED_32B[n]` | Req | Number of 32-byte EA reads due to uncached traffic. A 64-byte request is counted as 2. Value range for n: [0-31]. |
| `TCC_EA_RDREQ_IO_CREDIT_STALL[n]` | Cycles | Number of cycles there is a stall due to the read request interface running out of IO credits. Stalls occur irrespective of the need for a read to be performed. Value range for n: [0-31]. |
| `TCC_EA_RDREQ_GMI_CREDIT_STALL[n]` | Cycles | Number of cycles there is a stall due to the read request interface running out of GMI credits. Stalls occur irrespective of the need for a read to be performed. Value range for n: [0-31]. |
| `TCC_EA_RDREQ_DRAM_CREDIT_STALL[n]` | Cycles | Number of cycles there is a stall due to the read request interface running out of DRAM credits. Stalls occur irrespective of the need for a read to be performed. Value range for n: [0-31]. |
| `TCC_EA_RDREQ_LEVEL[n]` | Req | The accumulated number of EA read requests in flight. This is primarily intended to measure average EA read latency. Average read latency = `TCC_PERF_SEL_EA_RDREQ_LEVEL`/`TCC_PERF_SEL_EA_RDREQ`. Value range for n: [0-31]. |
| `TCC_EA_RDREQ_DRAM[n]` | Req | Number of 32-byte or 64-byte EA read requests to High Bandwidth Memory (HBM). Value range for n: [0-31]. |
| `TCC_EA_WRREQ_DRAM[n]` | Req | Number of 32-byte or 64-byte EA write requests to HBM. Value range for n: [0-31]. |
| `TCC_TAG_STALL[n]` | Cycles | Number of cycles the normal request pipeline in the tag is stalled for any reason. Normally, stalls of this nature are measured exactly at one point in the pipeline however in case of this counter, probes can stall the pipeline at a variety of places and there is no single point that can reasonably measure the total stalls accurately. Value range for n: [0-31]. |
| `TCC_NORMAL_WRITEBACK[n]` | Req | Number of writebacks due to requests that are not writeback requests. Value range for n: [0-31]. |
| `TCC_ALL_TC_OP_WB_WRITEBACK[n]` | Req | Number of writebacks due to all `TC_OP` writeback requests. Value range for n: [0-31]. |
| `TCC_NORMAL_EVICT[n]` | Req | Number of evictions due to requests that are not invalidate or probe requests. Value range for n: [0-31]. |
| `TCC_ALL_TC_OP_INV_EVICT[n]` | Req | Number of evictions due to all `TC_OP` invalidate requests. Value range for n: [0-31]. |
## MI200 derived metrics list
### Derived metrics on MI200 GPUs
| Derived Metric | Description |
| :----------------| -------------------------------------------------------------------------------------: |
| `VFetchInsts` | The average number of vector fetch instructions from the video memory executed per work-item (affected by flow control). Excludes FLAT instructions that fetch from video memory |
| `VWriteInsts` | The average number of vector write instructions to the video memory executed per work-item (affected by flow control). Excludes FLAT instructions that write to video memory |
| `FlatVMemInsts` | The average number of FLAT instructions that read from or write to the video memory executed per work item (affected by flow control). Includes FLAT instructions that read from or write to scratch |
| `LDSInsts` | The average number of LDS read/write instructions executed per work item (affected by flow control). Excludes FLAT instructions that read from or write to LDS |
| `FlatLDSInsts` | The average number of FLAT instructions that read or write to LDS executed per work item (affected by flow control) |
| `VALUUtilization` | The percentage of active vector ALU threads in a wave. A lower number can mean either more thread divergence in a wave or that the work-group size is not a multiple of 64. Value range: 0% (bad), 100% (ideal - no thread divergence) |
| `VALUBusy` | The percentage of GPU time vector ALU instructions are processed. Value range: 0% (bad) to 100% (optimal) |
| `SALUBusy` | The percentage of GPU time scalar ALU instructions are processed. Value range: 0% (bad) to 100% (optimal) |
| `MemWrites32B` | The total number of effective 32B write transactions to the memory |
| `L2CacheHit` | The percentage of fetch, write, atomic, and other instructions that hit the data in L2 cache. Value range: 0% (no hit) to 100% (optimal) |
| `MemUnitStalled` | The percentage of GPU time the memory unit is stalled. Try reducing the number or size of fetches and writes if possible. Value range: 0% (optimal) to 100% (bad) |
| `WriteUnitStalled` | The percentage of GPU time the write unit is stalled. Value range: 0% to 100% (bad) |
| `LDSBankConflict` | The percentage of GPU time LDS is stalled by bank conflicts. Value range: 0% (optimal) to 100% (bad) |
|:----------------|:-------------------------------------------------------------------------------------|
| `ALUStalledByLDS` | Percentage of GPU time ALU units are stalled due to the LDS input queue being full or the output queue not being ready. Reduce this by reducing the LDS bank conflicts or the number of LDS accesses if possible. Value range: 0% (optimal) to 100% (bad). |
| `FetchSize` | Total kilobytes fetched from the video memory. This is measured with all extra fetches and any cache or memory effects taken into account. |
| `FlatLDSInsts` | Average number of FLAT instructions that read from or write to LDS, executed per work item (affected by flow control). |
| `FlatVMemInsts` | Average number of FLAT instructions that read from or write to the video memory, executed per work item (affected by flow control). Includes FLAT instructions that read from or write to scratch. |
| `GDSInsts` | Average number of GDS read/write instructions executed per work item (affected by flow control). |
| `GPUBusy` | Percentage of time GPU is busy. |
| `L2CacheHit` | Percentage of fetch, write, atomic, and other instructions that hit the data in L2 cache. Value range: 0% (no hit) to 100% (optimal). |
| `LDSBankConflict` | Percentage of GPU time LDS is stalled by bank conflicts. Value range: 0% (optimal) to 100% (bad). |
| `LDSInsts` | Average number of LDS read/write instructions executed per work item (affected by flow control). Excludes FLAT instructions that read from or write to LDS. |
| `MemUnitBusy` | Percentage of GPU time the memory unit is active. The result includes the stall time (`MemUnitStalled`). This is measured with all extra fetches and writes and any cache or memory effects taken into account. Value range: 0% to 100% (fetch-bound). |
| `MemUnitStalled` | Percentage of GPU time the memory unit is stalled. Try reducing the number or size of fetches and writes if possible. Value range: 0% (optimal) to 100% (bad). |
| `MemWrites32B` | Total number of effective 32B write transactions to the memory. |
| `SALUBusy` | Percentage of GPU time scalar ALU instructions are processed. Value range: 0% (bad) to 100% (optimal). |
| `SALUInsts` | Average number of scalar ALU instructions executed per work item (affected by flow control). |
| `SFetchInsts` | Average number of scalar fetch instructions from the video memory executed per work item (affected by flow control). |
| `TA_ADDR_STALLED_BY_TC_CYCLES_sum` | Total number of cycles TA address path is stalled by TC, over all TA instances. |
| `TA_ADDR_STALLED_BY_TD_CYCLES_sum` | Total number of cycles TA address path is stalled by TD, over all TA instances. |
| `TA_BUFFER_WAVEFRONTS_sum` | Total number of buffer wavefronts processed by all TA instances. |
| `TA_BUFFER_READ_WAVEFRONTS_sum` | Total number of buffer read wavefronts processed by all TA instances. |
| `TA_BUFFER_WRITE_WAVEFRONTS_sum` | Total number of buffer write wavefronts processed by all TA instances. |
| `TA_BUFFER_ATOMIC_WAVEFRONTS_sum` | Total number of buffer atomic wavefronts processed by all TA instances. |
| `TA_BUFFER_TOTAL_CYCLES_sum` | Total number of buffer cycles (including read and write) issued to TC by all TA instances. |
| `TA_BUFFER_COALESCED_READ_CYCLES_sum` | Total number of coalesced buffer read cycles issued to TC by all TA instances. |
| `TA_BUFFER_COALESCED_WRITE_CYCLES_sum` | Total number of coalesced buffer write cycles issued to TC by all TA instances. |
| `TA_BUSY_avr` | Average number of busy cycles over all TA instances. |
| `TA_BUSY_max` | Maximum number of TA busy cycles over all TA instances. |
| `TA_BUSY_min` | Minimum number of TA busy cycles over all TA instances. |
| `TA_DATA_STALLED_BY_TC_CYCLES_sum` | Total number of cycles TA data path is stalled by TC, over all TA instances. |
| `TA_FLAT_READ_WAVEFRONTS_sum` | Sum of flat opcode reads processed by all TA instances. |
| `TA_FLAT_WRITE_WAVEFRONTS_sum` | Sum of flat opcode writes processed by all TA instances. |
| `TA_FLAT_WAVEFRONTS_sum` | Total number of flat opcode wavefronts processed by all TA instances. |
| `TA_FLAT_READ_WAVEFRONTS_sum` | Total number of flat opcode read wavefronts processed by all TA instances. |
| `TA_FLAT_ATOMIC_WAVEFRONTS_sum` | Total number of flat opcode atomic wavefronts processed by all TA instances. |
| `TA_TA_BUSY_sum` | Total number of TA busy cycles over all TA instances. |
| `TA_TOTAL_WAVEFRONTS_sum` | Total number of wavefronts processed by all TA instances. |
| `TCA_BUSY_sum` | Total number of cycles TCA has a pending request, over all TCA instances. |
| `TCA_CYCLE_sum` | Total number of cycles over all TCA instances. |
| `TCC_ALL_TC_OP_WB_WRITEBACK_sum` | Total number of writebacks due to all TC_OP writeback requests, over all TCC instances. |
| `TCC_ALL_TC_OP_INV_EVICT_sum` | Total number of evictions due to all TC_OP invalidate requests, over all TCC instances. |
| `TCC_ATOMIC_sum` | Total number of L2 cache atomic requests of all types, over all TCC instances. |
| `TCC_BUSY_avr` | Average number of L2 cache busy cycles, over all TCC instances. |
| `TCC_BUSY_sum` | Total number of L2 cache busy cycles, over all TCC instances. |
| `TCC_CC_REQ_sum` | Total number of CC requests over all TCC instances. |
| `TCC_CYCLE_sum` | Total number of L2 cache free running clocks, over all TCC instances. |
| `TCC_EA_WRREQ_sum` | Total number of 32-byte and 64-byte transactions going over the TC_EA_wrreq interface, over all TCC instances. Atomics may travel over the same interface and are generally classified as write requests. This does not include probe commands. |
| `TCC_EA_WRREQ_64B_sum` | Total number of 64-byte transactions (write or `CMPSWAP`) going over the TC_EA_wrreq interface, over all TCC instances. |
| `TCC_EA_WR_UNCACHED_32B_sum` | Total Number of 32-byte write/atomic going over the TC_EA_wrreq interface due to uncached traffic, over all TCC instances. Note that CC mtypes can produce uncached requests, and those are included in this. A 64-byte request is counted as 2. |
| `TCC_EA_WRREQ_STALL_sum` | Total Number of cycles a write request is stalled, over all instances. |
| `TCC_EA_WRREQ_IO_CREDIT_STALL_sum` | Total number of cycles an EA write request is stalled due to the interface running out of IO credits, over all instances. |
| `TCC_EA_WRREQ_GMI_CREDIT_STALL_sum` | Total number of cycles an EA write request is stalled due to the interface running out of GMI credits, over all instances. |
| `TCC_EA_WRREQ_DRAM_CREDIT_STALL_sum` | Total number of cycles an EA write request is stalled due to the interface running out of DRAM credits, over all instances. |
| `TCC_EA_WRREQ_LEVEL_sum` | Total number of EA write requests in flight over all TCC instances. |
| `TCC_EA_RDREQ_LEVEL_sum` | Total number of EA read requests in flight over all TCC instances. |
| `TCC_EA_ATOMIC_sum` | Total Number of 32-byte or 64-byte atomic requests going over the TC_EA_wrreq interface, over all TCC instances. |
| `TCC_EA_ATOMIC_LEVEL_sum` | Total number of EA atomic requests in flight, over all TCC instances. |
| `TCC_EA_RDREQ_sum` | Total number of 32-byte or 64-byte read requests to EA, over all TCC instances. |
| `TCC_EA_RDREQ_32B_sum` | Total number of 32-byte read requests to EA, over all TCC instances. |
| `TCC_EA_RD_UNCACHED_32B_sum` | Total number of 32-byte EA reads due to uncached traffic, over all TCC instances. |
| `TCC_EA_RDREQ_IO_CREDIT_STALL_sum` | Total number of cycles there is a stall due to the read request interface running out of IO credits, over all TCC instances. |
| `TCC_EA_RDREQ_GMI_CREDIT_STALL_sum` | Total number of cycles there is a stall due to the read request interface running out of GMI credits, over all TCC instances. |
| `TCC_EA_RDREQ_DRAM_CREDIT_STALL_sum` | Total number of cycles there is a stall due to the read request interface running out of DRAM credits, over all TCC instances. |
| `TCC_EA_RDREQ_DRAM_sum` | Total number of 32-byte or 64-byte EA read requests to HBM, over all TCC instances. |
| `TCC_EA_WRREQ_DRAM_sum` | Total number of 32-byte or 64-byte EA write requests to HBM, over all TCC instances. |
| `TCC_HIT_sum` | Total number of L2 cache hits over all TCC instances. |
| `TCC_MISS_sum` | Total number of L2 cache misses over all TCC instances. |
| `TCC_NC_REQ_sum` | Total number of NC requests over all TCC instances. |
| `TCC_NORMAL_WRITEBACK_sum` | Total number of writebacks due to requests that are not writeback requests, over all TCC instances. |
| `TCC_NORMAL_EVICT_sum` | Total number of evictions due to requests that are not invalidate or probe requests, over all TCC instances. |
| `TCC_PROBE_sum` | Total number of probe requests over all TCC instances. |
| `TCC_PROBE_ALL_sum` | Total number of external probe requests with EA_TCC_preq_all== 1, over all TCC instances. |
| `TCC_READ_sum` | Total number of L2 cache read requests (including compressed reads but not metadata reads) over all TCC instances. |
| `TCC_REQ_sum` | Total number of all types of L2 cache requests over all TCC instances. |
| `TCC_RW_REQ_sum` | Total number of RW requests over all TCC instances. |
| `TCC_STREAMING_REQ_sum` | Total number of L2 cache streaming requests over all TCC instances. |
| `TCC_TAG_STALL_sum` | Total number of cycles the normal request pipeline in the tag is stalled for any reason, over all TCC instances. |
| `TCC_TOO_MANY_EA_WRREQS_STALL_sum` | Total number of cycles L2 cache is unable to send an EA write request due to it reaching its maximum capacity of pending EA write requests, over all TCC instances. |
| `TCC_UC_REQ_sum` | Total number of UC requests over all TCC instances. |
| `TCC_WRITE_sum` | Total number of L2 cache write requests over all TCC instances. |
| `TCC_WRITEBACK_sum` | Total number of lines written back to the main memory including writebacks of dirty lines and uncached write/atomic requests, over all TCC instances. |
| `TCC_WRREQ_STALL_max` | Maximum number of cycles a write request is stalled, over all TCC instances. |
| `TCP_ATOMIC_TAGCONFLICT_STALL_CYCLES_sum` | Total number of cycles tagram conflict stalls on an atomic, over all TCP instances. |
| `TCP_GATE_EN1_sum` | Total number of cycles vL1D interface clocks are turned on, over all TCP instances. |
| `TCP_GATE_EN2_sum` | Total number of cycles vL1D core clocks are turned on, over all TCP instances. |
| `TCP_PENDING_STALL_CYCLES_sum` | Total number of cycles vL1D cache is stalled due to data pending from L2 Cache, over all TCP instances. |
| `TCP_READ_TAGCONFLICT_STALL_CYCLES_sum` | Total number of cycles tagram conflict stalls on a read, over all TCP instances. |
| `TCP_TA_TCP_STATE_READ_sum` | Total number of state reads by all TCP instances. |
| `TCP_TCC_ATOMIC_WITH_RET_REQ_sum` | Total number of atomic requests to L2 cache with return, over all TCP instances. |
| `TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum` | Total number of atomic requests to L2 cache without return, over all TCP instances. |
| `TCP_TCC_CC_READ_REQ_sum` | Total number of CC read requests to L2 cache, over all TCP instances. |
| `TCP_TCC_CC_WRITE_REQ_sum` | Total number of CC write requests to L2 cache, over all TCP instances. |
| `TCP_TCC_CC_ATOMIC_REQ_sum` | Total number of CC atomic requests to L2 cache, over all TCP instances. |
| `TCP_TCC_NC_READ_REQ_sum` | Total number of NC read requests to L2 cache, over all TCP instances. |
| `TCP_TCC_NC_WRITE_REQ_sum` | Total number of NC write requests to L2 cache, over all TCP instances. |
| `TCP_TCC_NC_ATOMIC_REQ_sum` | Total number of NC atomic requests to L2 cache, over all TCP instances. |
| `TCP_TCC_READ_REQ_LATENCY_sum` | Total vL1D to L2 request latency over all wavefronts for reads and atomics with return for all TCP instances. |
| `TCP_TCC_READ_REQ_sum` | Total number of read requests to L2 cache, over all TCP instances. |
| `TCP_TCC_RW_READ_REQ_sum` | Total number of RW read requests to L2 cache, over all TCP instances. |
| `TCP_TCC_RW_WRITE_REQ_sum` | Total number of RW write requests to L2 cache, over all TCP instances. |
| `TCP_TCC_RW_ATOMIC_REQ_sum` | Total number of RW atomic requests to L2 cache, over all TCP instances. |
| `TCP_TCC_UC_READ_REQ_sum` | Total number of UC read requests to L2 cache, over all TCP instances. |
| `TCP_TCC_UC_WRITE_REQ_sum` | Total number of UC write requests to L2 cache, over all TCP instances. |
| `TCP_TCC_UC_ATOMIC_REQ_sum` | Total number of UC atomic requests to L2 cache, over all TCP instances. |
| `TCP_TCC_WRITE_REQ_LATENCY_sum` | Total vL1D to L2 request latency over all wavefronts for writes and atomics without return for all TCP instances. |
| `TCP_TCC_WRITE_REQ_sum` | Total number of write requests to L2 cache, over all TCP instances. |
| `TCP_TCP_LATENCY_sum` | Total wave access latency to vL1D over all wavefronts for all TCP instances. |
| `TCP_TCR_TCP_STALL_CYCLES_sum` | Total number of cycles TCR stalls vL1D, over all TCP instances. |
| `TCP_TD_TCP_STALL_CYCLES_sum` | Total number of cycles TD stalls vL1D, over all TCP instances. |
| `TCP_TOTAL_ACCESSES_sum` | Total number of vL1D accesses, over all TCP instances. |
| `TCP_TOTAL_READ_sum` | Total number of vL1D read accesses, over all TCP instances. |
| `TCP_TOTAL_WRITE_sum` | Total number of vL1D write accesses, over all TCP instances. |
| `TCP_TOTAL_ATOMIC_WITH_RET_sum` | Total number of vL1D atomic requests with return, over all TCP instances. |
| `TCP_TOTAL_ATOMIC_WITHOUT_RET_sum` | Total number of vL1D atomic requests without return, over all TCP instances. |
| `TCP_TOTAL_CACHE_ACCESSES_sum` | Total number of vL1D cache accesses (including hits and misses) by all TCP instances. |
| `TCP_TOTAL_WRITEBACK_INVALIDATES_sum` | Total number of vL1D writebacks and invalidates, over all TCP instances. |
| `TCP_UTCL1_PERMISSION_MISS_sum` | Total number of UTCL1 permission misses by all TCP instances. |
| `TCP_UTCL1_REQUEST_sum` | Total number of address translation requests to UTCL1 by all TCP instances. |
| `TCP_UTCL1_TRANSLATION_MISS_sum` | Total number of UTCL1 translation misses by all TCP instances. |
| `TCP_UTCL1_TRANSLATION_HIT_sum` | Total number of UTCL1 translation hits by all TCP instances. |
| `TCP_VOLATILE_sum` | Total number of L1 volatile pixels/buffers from TA, over all TCP instances. |
| `TCP_WRITE_TAGCONFLICT_STALL_CYCLES_sum` | Total number of cycles tagram conflict stalls on a write, over all TCP instances. |
| `TD_ATOMIC_WAVEFRONT_sum` | Total number of atomic wavefront instructions, over all TD instances. |
| `TD_COALESCABLE_WAVEFRONT_sum` | Total number of coalescable wavefronts according to TA, over all TD instances. |
| `TD_LOAD_WAVEFRONT_sum` | Total number of wavefront instructions (read/write/atomic), over all TD instances. |
| `TD_SPI_STALL_sum` | Total number of cycles TD is stalled by SPI, over all TD instances. |
| `TD_STORE_WAVEFRONT_sum` | Total number of write wavefront instructions, over all TD instances. |
| `TD_TC_STALL_sum` | Total number of cycles TD is stalled waiting for TC data, over all TD instances. |
| `TD_TD_BUSY_sum` | Total number of TD busy cycles while it is processing or waiting for data, over all TD instances. |
| `VALUBusy` | Percentage of GPU time vector ALU instructions are processed. Value range: 0% (bad) to 100% (optimal). |
| `VALUInsts` | Average number of vector ALU instructions executed per work item (affected by flow control). |
| `VALUUtilization` | Percentage of active vector ALU threads in a wave. A lower number can mean either more thread divergence in a wave or that the work-group size is not a multiple of 64. Value range: 0% (bad), 100% (ideal - no thread divergence). |
| `VFetchInsts` | Average number of vector fetch instructions from the video memory executed per work-item (affected by flow control). Excludes FLAT instructions that fetch from video memory. |
| `VWriteInsts` | Average number of vector write instructions to the video memory executed per work-item (affected by flow control). Excludes FLAT instructions that write to video memory. |
| `Wavefronts` | Total wavefronts. |
| `WRITE_REQ_32B` | Total number of 32-byte effective memory writes. |
| `WriteSize` | Total kilobytes written to the video memory. This is measured with all extra fetches and any cache or memory effects taken into account. |
| `WriteUnitStalled` | Percentage of GPU time the write unit is stalled. Value range: 0% to 100% (bad). |
## MI200 acronyms
## Abbreviations
| Abbreviation | Meaning |
| :------------| --------------------------------------------------------------------------------: |
| `ALU` | Arithmetic logic unit |
| `Arb` | Arbiter |
| `BF16` | Brain floating point 16 |
| `CC` | Coherently cached |
| `CP` | Command processor |
| `CPC` | Command processor compute |
| `CPF` | Command processor fetcher |
| `CS` | Compute shader |
| `CSC` | Compute shader controller |
| `CSn` | Compute Shader, the n-th pipe |
| `CU` | Compute unit |
| `DW` | 32-bit data word, DWORD |
| `EA` | Efficiency arbiter |
| `F16` | Half-precision floating point |
| `FLAT` | FLAT instructions allow read/write/atomic access to a generic memory address pointer, which can resolve to any of the following physical memories:<br>• Global Memory<br>• Scratch (“private”)<br>• LDS (“shared”)<br>• Invalid MEM_VIOL TrapStatus |
| `FMA` | Fused multiply-add |
| `GDS` | Global data share |
| `GRBM` | Graphics register bus manager |
| `HBM` | High bandwidth memory |
| `Instr` | Instructions |
| `IOP` | Integer operation |
| `L2` | Level-2 cache |
| `LDS` | Local data share |
| `ME1` | Micro-engine, running packet processing firmware on CPC |
| `MFMA` | Matrix fused multiply-add |
| `NC` | Noncoherently cached |
| `RW` | Coherently cached with write |
| `SALU` | Scalar ALU |
| `SGPR` | Scalar GPR |
| `SIMD` | Single instruction multiple data |
| `sL1D` | Scalar Level-1 data cache |
| `SMEM` | Scalar memory |
| `SPI` | Shader processor input |
| `SQ` | Sequencer |
| `TA` | Texture addressing unit |
| `TC` | Texture cache |
| `TCA` | Texture cache arbiter |
| `TCC` | Texture cache per channel, known as L2 cache |
| `TCIU` | Texture cache interface unit, command processors interface to memory system |
| `TCP` | Texture cache per pipe, known as vector L1 cache |
| `TCR` | Texture cache router |
| `TD` | Texture data unit |
| `UC` | Uncached |
| `UTCL1` | Unified translation cache level 1 |
| `UTCL2` | Unified translation cache level 2 |
| `VALU` | Vector ALU |
| `VGPR` | Vector GPR |
| `vL1D` | Vector level 1 data cache |
| `VMEM` | Vector memory |
|:------------|:--------------------------------------------------------------------------------|
| `ALU` | Arithmetic Logic Unit |
| `Arb` | Arbiter |
| `BF16` | Brain Floating Point - 16 bits |
| `CC` | Coherently Cached |
| `CP` | Command Processor |
| `CPC` | Command Processor - Compute |
| `CPF` | Command Processor - Fetcher |
| `CS` | Compute Shader |
| `CSC` | Compute Shader Controller |
| `CSn` | Compute Shader, the n-th pipe |
| `CU` | Compute Unit |
| `DW` | 32-bit Data Word, DWORD |
| `EA` | Efficiency Arbiter |
| `F16` | Half Precision Floating Point |
| `F32` | Full Precision Floating Point |
| `FLAT` | FLAT instructions allow read/write/atomic access to a generic memory address pointer, which can resolve to any of the following physical memories:<br>. Global Memory<br>. Scratch ("private")<br>. LDS ("shared")<br>. Invalid - MEM_VIOL TrapStatus |
| `FMA` | Fused Multiply Add |
| `GDS` | Global Data Share |
| `GRBM` | Graphics Register Bus Manager |
| `HBM` | High Bandwidth Memory |
| `Instr` | Instructions |
| `IOP` | Integer Operation |
| `L2` | Level-2 Cache |
| `LDS` | Local Data Share |
| `ME1` | Micro Engine, running packet processing firmware on CPC |
| `MFMA` | Matrix Fused Multiply Add |
| `NC` | Noncoherently Cached |
| `RW` | Coherently Cached with Write |
| `SALU` | Scalar ALU |
| `SGPR` | Scalar General Purpose Register |
| `SIMD` | Single Instruction Multiple Data |
| `sL1D` | Scalar Level-1 Data Cache |
| `SMEM` | Scalar Memory |
| `SPI` | Shader Processor Input |
| `SQ` | Sequencer |
| `TA` | Texture Addressing Unit |
| `TC` | Texture Cache |
| `TCA` | Texture Cache Arbiter |
| `TCC` | Texture Cache per Channel, known as L2 Cache |
| `TCIU` | Texture Cache Interface Unit (interface between CP and the memory system) |
| `TCP` | Texture Cache per Pipe, known as vector L1 Cache |
| `TCR` | Texture Cache Router |
| `TD` | Texture Data Unit |
| `UC` | Uncached |
| `UTCL1` | Unified Translation Cache - Level 1 |
| `UTCL2` | Unified Translation Cache - Level 2 |
| `VALU` | Vector ALU |
| `VGPR` | Vector General Purpose Register |
| `vL1D` | Vector Level -1 Data Cache |
| `VMEM` | Vector Memory |

View File

@@ -1,3 +1,9 @@
<head>
<meta charset="UTF-8">
<meta name="description" content="AMD Instinct MI250 microarchitecture">
<meta name="keywords" content="Instinct, MI250, microarchitecture, AMD, ROCm">
</head>
# AMD Instinct™ MI250 microarchitecture
The microarchitecture of the AMD Instinct MI250 accelerators is based on the

View File

@@ -1,3 +1,10 @@
<head>
<meta charset="UTF-8">
<meta name="description" content="GPU isolation techniques">
<meta name="keywords" content="GPU isolation techniques, UUID, universally unique identifier,
environment variables, virtual machines, AMD, ROCm">
</head>
# GPU isolation techniques
Restricting the access of applications to a subset of GPUs, aka isolating
@@ -22,7 +29,7 @@ A list of device indices or {abbr}`UUID (universally unique identifier)`s
that will be exposed to applications.
Runtime
: ROCm Platform Runtime. Applies to all applications using the user mode ROCm
: ROCm Software Runtime. Applies to all applications using the user mode ROCm
software stack.
```{code-block} shell

View File

@@ -1,9 +1,16 @@
<head>
<meta charset="UTF-8">
<meta name="description" content="GPU memory">
<meta name="keywords" content="GPU memory, VRAM, video random access memory, pageable
memory, pinned memory, managed memory, AMD, ROCm">
</head>
# GPU memory
For the HIP reference documentation, see:
* {doc}`hip:.doxygen/docBin/html/group___memory`
* {doc}`hip:.doxygen/docBin/html/group___memory_m`
* {doc}`hip:doxygen/html/group___memory`
* {doc}`hip:doxygen/html/group___memory_m`
Host memory exists on the host (e.g. CPU) of the machine in random access memory (RAM).
@@ -170,8 +177,8 @@ Fine-grained memory implies that up-to-date data may be made visible to others r
| API | Flag | Coherence |
|-------------------------|------------------------------|----------------|
| `hipExtMallocWithFlags` | `hipHostMallocDefault` | Fine-grained |
| `hipExtMallocWithFlags` | `hipDeviceMallocFinegrained` | Coarse-grained |
| `hipExtMallocWithFlags` | `hipDeviceMallocDefault` | Coarse-grained |
| `hipExtMallocWithFlags` | `hipDeviceMallocFinegrained` | Fine-grained |
| API | `hipMemAdvise` argument | Coherence |
|-------------------------|------------------------------|----------------|

View File

@@ -1,3 +1,10 @@
<head>
<meta charset="UTF-8">
<meta name="description" content="Using the LLVM ASan on a GPU">
<meta name="keywords" content="LLVM, ASan, address sanitizer, AddressSanitizer, instrumented
libraries, instrumented applications, AMD, ROCm">
</head>
# Using the LLVM ASan on a GPU (beta release)
The LLVM AddressSanitizer (ASan) provides a process that allows developers to detect runtime addressing errors in applications and libraries. The detection is achieved using a combination of compiler-added instrumentation and runtime techniques, including function interception and replacement.
@@ -7,7 +14,9 @@ Until now, the LLVM ASan process was only available for traditional purely CPU a
This document provides documentation on using ROCm ASan.
For information about LLVM ASan, see the [LLVM documentation](https://clang.llvm.org/docs/AddressSanitizer.html).
**Note**: The beta release of LLVM ASan for ROCm is currently tested and validated on Ubuntu 20.04.
:::{note}
The beta release of LLVM ASan for ROCm is currently tested and validated on Ubuntu 20.04.
:::
## Compiling for ASan

View File

@@ -8,15 +8,11 @@ import shutil
import jinja2
import os
from rocm_docs import ROCmDocs
# Environement to process Jinja templates.
# Environment to process Jinja templates.
jinja_env = jinja2.Environment(loader=jinja2.FileSystemLoader("."))
# Jinja templates to render out.
templates = [
]
templates = []
# Render templates and output files without the last extension.
# For example: 'install.md.jinja' becomes 'install.md'.
@@ -42,9 +38,9 @@ latex_elements = {
# configurations for PDF output by Read the Docs
project = "ROCm Documentation"
author = "Advanced Micro Devices, Inc."
copyright = "Copyright (c) 2023 Advanced Micro Devices, Inc. All rights reserved."
version = "5.7.1"
release = "5.7.1"
copyright = "Copyright (c) 2024 Advanced Micro Devices, Inc. All rights reserved."
version = "6.0.1"
release = "6.0.1"
setting_all_article_info = True
all_article_info_os = ["linux", "windows"]
all_article_info_author = ""
@@ -54,7 +50,7 @@ article_pages = [
{
"file":"release",
"os":["linux", "windows"],
"date":"2023-07-27"
"date":"2024-01-09"
},
{"file":"install/windows/install-quick", "os":["windows"]},
@@ -74,9 +70,6 @@ article_pages = [
{"file":"install/windows/cli/index", "os":["windows"]},
{"file":"install/windows/gui/index", "os":["windows"]},
{"file":"about/compatibility/linux-support", "os":["linux"]},
{"file":"about/compatibility/windows-support", "os":["windows"]},
{"file":"about/compatibility/docker-image-support-matrix", "os":["linux"]},
{"file":"about/compatibility/user-kernel-space-compat-matrix", "os":["linux"]},
@@ -89,19 +82,22 @@ article_pages = [
{"file":"rocm-a-z", "os":["linux", "windows"]},
{"file":"about/release-notes", "os":["linux"]},
]
exclude_patterns = ['temp']
external_toc_path = "./sphinx/_toc.yml"
docs_core = ROCmDocs("ROCm Documentation")
docs_core.setup()
extensions = ["rocm_docs"]
external_projects_current_project = "rocm"
for sphinx_var in ROCmDocs.SPHINX_VARS:
globals()[sphinx_var] = getattr(docs_core, sphinx_var)
html_theme = "rocm_docs_theme"
html_theme_options = {"flavor": "rocm-docs-home"}
html_title = "ROCm Documentation"
html_theme_options = {
"link_main_doc": False
}

View File

@@ -1,3 +1,10 @@
<head>
<meta charset="UTF-8">
<meta name="description" content="Building ROCm documentation">
<meta name="keywords" content="documentation, Visual Studio Code, GitHub, command line,
AMD, ROCm">
</head>
# Building documentation
You can build our documentation via GitHub (in a pull request) or locally (using the command line or

View File

@@ -0,0 +1,229 @@
# Contributing to ROCm documentation
AMD values and encourages contributions to our code and documentation. If you choose to
contribute, we encourage you to be polite and respectful. Improving documentation is a long-term
process, to which we are dedicated.
If you have issues when trying to contribute, refer to the
[discussions](https://github.com/RadeonOpenCompute/ROCm/discussions) page in our GitHub
repository.
## Folder structure and naming convention
Our documentation follows the Pitchfork folder structure. Most documentation files are stored in the
`/docs` folder. Some special files (such as release, contributing, and changelog) are stored in the root
(`/`) folder.
All images are stored in the `/docs/data` folder. An image's file path mirrors that of the documentation
file where it is used.
Our naming structure uses kebab case; for example, `my-file-name.rst`.
## Supported formats and syntax
Our documentation includes both Markdown and RST files. We are gradually transitioning existing
Markdown to RST in order to more effectively meet our documentation needs. When contributing,
RST is preferred; if you must use Markdown, use GitHub-flavored Markdown.
We use [Sphinx Design](https://sphinx-design.readthedocs.io/en/latest/index.html) syntax and compile
our API references using [Doxygen](https://www.doxygen.nl/).
The following table shows some common documentation components and the syntax convention we
use for each:
<table>
<tr>
<th>Component</th>
<th>RST syntax</th>
</tr>
<tr>
<td>Code blocks</td>
<td>
```rst
.. code-block:: language-name
My code block.
```
</td>
</tr>
<tr>
<td>Cross-referencing internal files</td>
<td>
```rst
:doc:`Title <../path/to/file/filename>`
```
</td>
</tr>
<tr>
<td>External links</td>
<td>
```rst
`link name <URL>`_
```
</td>
</tr>
<tr>
<tr>
<td>Headings</td>
<td>
```rst
******************
Chapter title (H1)
******************
Section title (H2)
===============
Subsection title (H3)
---------------------
Sub-subsection title (H4)
^^^^^^^^^^^^^^^^^^^^
```
</td>
</tr>
<tr>
<td>Images</td>
<td>
```rst
.. image:: image1.png
```
</td>
</tr>
<tr>
<td>Internal links</td>
<td>
```rst
1. Add a tag to the section you want to reference:
.. _my-section-tag: section-1
Section 1
==========
2. Link to your tag:
As shown in :ref:`section-1`.
```
</td>
</tr>
<tr>
<tr>
<td>Lists</td>
<td>
```rst
# Ordered (numbered) list item
* Unordered (bulleted) list item
```
</td>
</tr>
<tr>
<tr>
<td>Math (block)</td>
<td>
```rst
.. math::
A = \begin{pmatrix}
0.0 & 1.0 & 1.0 & 3.0 \\
4.0 & 5.0 & 6.0 & 7.0 \\
\end{pmatrix}
```
</td>
</tr>
<tr>
<td>Math (inline)</td>
<td>
```rst
:math:`2 \times 2 `
```
</td>
</tr>
<tr>
<td>Notes</td>
<td>
```rst
.. note::
My note here.
```
</td>
</tr>
<tr>
<td>Tables</td>
<td>
```rst
.. csv-table:: Optional title here
:widths: 30, 70 #optional column widths
:header: "entry1 header", "entry2 header"
"entry1", "entry2"
```
</td>
</tr>
</table>
## Language and style
We use the
[Google developer documentation style guide](https://developers.google.com/style/highlights) to
guide our content.
Font size and type, page layout, white space control, and other formatting
details are controlled via
[rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core). If you want to notify us
of any formatting issues, create a pull request in our
[rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) GitHub repository.
## Building our documentation
<!-- % TODO: Fix the link to be able to work at every files -->
To learn how to build our documentation, refer to
[Building documentation](./building.md).

View File

@@ -1,4 +1,10 @@
# How to provide feedback for ROCm documentation
<head>
<meta charset="UTF-8">
<meta name="description" content="Providing feedback for ROCm documentation">
<meta name="keywords" content="documentation, pull request, GitHub, AMD, ROCm">
</head>
# Providing feedback for ROCm documentation
There are four standard ways to provide feedback for this repository.

View File

@@ -1,3 +1,9 @@
<head>
<meta charset="UTF-8">
<meta name="description" content="ROCm documentation toolchain">
<meta name="keywords" content="documentation, toolchain, Sphinx, Doxygen, MyST, AMD, ROCm">
</head>
# ROCm documentation toolchain
Our documentation relies on several open source toolchains and sites.

View File

@@ -1,15 +1,22 @@
<head>
<meta charset="UTF-8">
<meta name="description" content="Deep learning using ROCm">
<meta name="keywords" content="deep learning, frameworks, installation, PyTorch, TensorFlow,
MAGMA, AMD, ROCm">
</head>
# Deep learning guide
The following sections cover the different framework installations for ROCm and
deep-learning applications. The following image provides
the sequential flow for the use of each framework. Refer to the ROCm Compatible
Frameworks Release Notes for each framework's most current release notes at
[Third party support](../about/compatibility/3rd-party-support-matrix.md).
{doc}`Third-party support<rocm-install-on-linux:reference/3rd-party-support-matrix>`.
![ROCm Compatible Frameworks Flowchart](../data/install/magma-install/magma005.png "ROCm Compatible Frameworks")
## Frameworks installation
* [Installing PyTorch](../install/pytorch-install.md)
* [Installing TensorFlow](../install/tensorflow-install.md)
* [Installing MAGMA](../install/magma-install.md)
* {doc}`PyTorch for ROCm<rocm-install-on-linux:how-to/3rd-party/pytorch-install>`
* {doc}`TensorFlow for ROCm<rocm-install-on-linux:how-to/3rd-party/tensorflow-install>`
* {doc}`MAGMA for ROCm<rocm-install-on-linux:how-to/3rd-party/magma-install>`

View File

@@ -1,189 +0,0 @@
# GPU-enabled MPI
The Message Passing Interface ([MPI](https://www.mpi-forum.org)) is a standard
API for distributed and parallel application development that can scale to
multi-node clusters. To facilitate the porting of applications to clusters with
GPUs, ROCm enables various technologies. These technologies allow users to
directly use GPU pointers in MPI calls and enable ROCm-aware MPI libraries to
deliver optimal performance for both intra-node and inter-node GPU-to-GPU
communication.
The AMD kernel driver exposes Remote Direct Memory Access (RDMA) through the
*PeerDirect* interfaces to allow Host Channel Adapters (HCA, a type of
Network Interface Card or NIC) to directly read and write to the GPU device
memory with RDMA capabilities. These interfaces are currently registered as a
*peer_memory_client* with Mellanoxs OpenFabrics Enterprise Distribution (OFED)
`ib_core` kernel module to allow high-speed DMA transfers between GPU and HCA.
These interfaces are used to optimize inter-node MPI message communication.
This chapter exemplifies how to set up Open MPI with the ROCm platform. The Open
MPI project is an open source implementation of the MPI that is developed and maintained by a consortium of academic, research,
and industry partners.
Several MPI implementations can be made ROCm-aware by compiling them with
[Unified Communication Framework](https://www.openucx.org/) (UCX) support. One
notable exception is MVAPICH2: It directly supports AMD GPUs without using UCX,
and you can download it [here](http://mvapich.cse.ohio-state.edu/downloads/).
Use the latest version of the MVAPICH2-GDR package.
The Unified Communication Framework, is an open source cross-platform framework
whose goal is to provide a common set of communication interfaces that targets a
broad set of network programming models and interfaces. UCX is ROCm-aware, and
ROCm technologies are used directly to implement various network operation
primitives. For more details on the UCX design, refer to it's
[documentation](https://www.openucx.org/documentation).
## Building UCX
The following section describes how to set up UCX so it can be used to compile
Open MPI. The following environment variables are set, such that all software
components will be installed in the same base directory (we assume to install
them in your home directory; for other locations adjust the below environment
variables accordingly, and make sure you have write permission for that
location):
```shell
export INSTALL_DIR=$HOME/ompi_for_gpu
export BUILD_DIR=/tmp/ompi_for_gpu_build
mkdir -p $BUILD_DIR
```
```{note}
The following sequences of build commands assume either the ROCmCC or the AOMP
compiler is active in the environment, which will execute the commands.
```
## Install UCX
The next step is to set up UCX by compiling its source code and install it:
```shell
export UCX_DIR=$INSTALL_DIR/ucx
cd $BUILD_DIR
git clone https://github.com/openucx/ucx.git -b v1.14.1
cd ucx
./autogen.sh
mkdir build
cd build
../configure -prefix=$UCX_DIR \
--with-rocm=/opt/rocm
make -j $(nproc)
make -j $(nproc) install
```
The [communication libraries tables](../reference/library-index.md)
documents the compatibility of UCX versions with ROCm versions.
## Install Open MPI
These are the steps to build Open MPI:
```shell
export OMPI_DIR=$INSTALL_DIR/ompi
cd $BUILD_DIR
git clone --recursive https://github.com/open-mpi/ompi.git \
-b v5.0.x
cd ompi
./autogen.pl
mkdir build
cd build
../configure --prefix=$OMPI_DIR --with-ucx=$UCX_DIR \
--with-rocm=/opt/rocm
make -j $(nproc)
make -j $(nproc) install
```
## ROCm-enabled OSU
The OSU Micro Benchmarks v5.9 (OMB) can be used to evaluate the performance of
various primitives with an AMD GPU device and ROCm support. This functionality
is exposed when configured with `--enable-rocm` option. We can use the following
steps to compile OMB:
```shell
export OSU_DIR=$INSTALL_DIR/osu
cd $BUILD_DIR
wget http://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-5.9.tar.gz
tar xfz osu-micro-benchmarks-5.9.tar.gz
cd osu-micro-benchmarks-5.9
./configure --prefix=$INSTALL_DIR/osu --enable-rocm \
--with-rocm=/opt/rocm \
CC=$OMPI_DIR/bin/mpicc CXX=$OMPI_DIR/bin/mpicxx \
LDFLAGS="-L$OMPI_DIR/lib/ -lmpi -L/opt/rocm/lib/ \
$(hipconfig -C) -lamdhip64" CXXFLAGS="-std=c++11"
make -j $(nproc)
```
## Intra-node run
Before running an Open MPI job, it is essential to set some environment variables to
ensure that the correct version of Open MPI and UCX is being used.
```shell
export LD_LIBRARY_PATH=$OMPI_DIR/lib:$UCX_DIR/lib:/opt/rocm/lib
export PATH=$OMPI_DIR/bin:$PATH
```
The following command runs the OSU bandwidth benchmark between the first two GPU
devices (i.e., GPU 0 and GPU 1, same OAM) by default inside the same node. It
measures the unidirectional bandwidth from the first device to the other.
```shell
$OMPI_DIR/bin/mpirun -np 2 \
-x UCX_TLS=sm,self,rocm \
--mca pml ucx mpi/pt2pt/osu_bw -d rocm D D
```
To select different devices, for example 2 and 3, use the following command:
```shell
export HIP_VISIBLE_DEVICES=2,3
export HSA_ENABLE_SDMA=0
```
The following output shows the effective transfer bandwidth measured for
inter-die data transfer between GPU device 2 and 3 (same OAM). For messages
larger than 67MB, an effective utilization of about 150GB/sec is achieved, which
corresponds to 75% of the peak transfer bandwidth of 200GB/sec for that
connection:
![OSU execution showing transfer bandwidth increasing alongside payload increase](../data/how-to/gpu-enabled-mpi-1.png "Inter-GPU bandwidth with various payload sizes")
## Collective operations
Collective Operations on GPU buffers are best handled through the
Unified Collective Communication Library (UCC) component in Open MPI.
For this, the UCC library has to be configured and compiled with ROCm
support.
Please note the compatibility tables in the [communication libraries](../reference/library-index.md)
for UCC versions with the various ROCm versions.
An example for configuring UCC and Open MPI with ROCm support
is shown below:
```shell
export UCC_DIR=$INSTALL_DIR/ucc
git clone https://github.com/openucx/ucc.git
cd ucc
./configure --with-rocm=/opt/rocm \
--with-ucx=$UCX_DIR \
--prefix=$UCC_DIR
make -j && make install
# Configure and compile Open MPI with UCX, UCC, and ROCm support
cd ompi
./configure --with-rocm=/opt/rocm \
--with-ucx=$UCX_DIR \
--with-ucc=$UCC_DIR
--prefix=$OMPI_DIR
```
To use the UCC component with an MPI application requires setting some
additional parameters:
```shell
mpirun --mca pml ucx --mca osc ucx \
--mca coll_ucc_enable 1 \
--mca coll_ucc_priority 100 -np 64 ./my_mpi_app
```

View File

@@ -0,0 +1,264 @@
.. meta::
:description: GPU-enabled Message Passing Interface
:keywords: Message Passing Interface, MPI, AMD, ROCm
***************************************************************************************************
GPU-enabled Message Passing Interface
***************************************************************************************************
The Message Passing Interface (`MPI <https://www.mpi-forum.org>`_) is a standard API for distributed
and parallel application development that can scale to multi-node clusters. To facilitate the porting of
applications to clusters with GPUs, ROCm enables various technologies. You can use these
technologies add GPU pointers to MPI calls and enable ROCm-aware MPI libraries to deliver optimal
performance for both intra-node and inter-node GPU-to-GPU communication.
The AMD kernel driver exposes remote direct memory access (RDMA) through *PeerDirect* interfaces.
This allows network interface cards (NICs) to directly read and write to RDMA-capable GPU device
memory, resulting in high-speed direct memory access (DMA) transfers between GPU and NIC. These
interfaces are used to optimize inter-node MPI message communication.
The Open MPI project is an open source implementation of the MPI. It's developed and maintained by
a consortium of academic, research, and industry partners. To compile Open MPI with ROCm support,
refer to the following sections:
* :ref:`open-mpi-ucx`
* :ref:`open-mpi-libfabric`
.. _open-mpi-ucx:
ROCm-aware Open MPI on InfiniBand and RoCE networks using UCX
================================================================
The `Unified Communication Framework <https://www.openucx.org/documentation>`_ (UCX), is an
open source, cross-platform framework designed to provide a common set of communication
interfaces for various network programming models and interfaces. UCX uses ROCm technologies to
implement various network operation primitives. UCX is the standard communication library for
InfiniBand and RDMA over Converged Ethernet (RoCE) network interconnect. To optimize data
transfer operations, many MPI libraries, including Open MPI, can leverage UCX internally.
UCX and Open MPI have a compile option to enable ROCm support. To install and configure UCX to compile Open MPI for ROCm, use the following instructions.
1. Set environment variables to install all software components in the same base directory. We use the
home directory in our example, but you can specify a different location if you want.
.. code-block:: shell
export INSTALL_DIR=$HOME/ompi_for_gpu
export BUILD_DIR=/tmp/ompi_for_gpu_build
mkdir -p $BUILD_DIR
2. Install UCX. To view UCX and ROCm version compatibility, refer to the
`communication libraries tables <https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/3rd-party-support-matrix.html>`_
.. code-block:: shell
export UCX_DIR=$INSTALL_DIR/ucx
cd $BUILD_DIR
git clone https://github.com/openucx/ucx.git -b v1.15.x
cd ucx
./autogen.sh
mkdir build
cd build
../configure -prefix=$UCX_DIR \
--with-rocm=/opt/rocm
make -j $(nproc)
make -j $(nproc) install
3. Install Open MPI.
.. code-block:: shell
export OMPI_DIR=$INSTALL_DIR/ompi
cd $BUILD_DIR
git clone --recursive https://github.com/open-mpi/ompi.git \
-b v5.0.x
cd ompi
./autogen.pl
mkdir build
cd build
../configure --prefix=$OMPI_DIR --with-ucx=$UCX_DIR \
--with-rocm=/opt/rocm
make -j $(nproc)
make install
.. _rocm-enabled-osu:
ROCm-enabled OSU benchmarks
---------------------------------------------------------------------------------------------------------------
You can use OSU Micro Benchmarks (OMB) to evaluate the performance of various primitives on
ROCm-supported AMD GPUs. The ``--enable-rocm`` option exposes this functionality.
.. code-block:: shell
export OSU_DIR=$INSTALL_DIR/osu
cd $BUILD_DIR
wget http://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-7.2.tar.gz
tar xfz osu-micro-benchmarks-7.2.tar.gz
cd osu-micro-benchmarks-7.2
./configure --enable-rocm \
--with-rocm=/opt/rocm \
CC=$OMPI_DIR/bin/mpicc CXX=$OMPI_DIR/bin/mpicxx \
LDFLAGS="-L$OMPI_DIR/lib/ -lmpi -L/opt/rocm/lib/ \
$(hipconfig -C) -lamdhip64" CXXFLAGS="-std=c++11"
make -j $(nproc)
Intra-node run
----------------------------------------------------------------------------------------------------------------
Before running an Open MPI job, you must set the following environment variables to ensure that
you're using the correct versions of Open MPI and UCX.
.. code-block:: shell
export LD_LIBRARY_PATH=$OMPI_DIR/lib:$UCX_DIR/lib:/opt/rocm/lib
export PATH=$OMPI_DIR/bin:$PATH
To run the OSU bandwidth benchmark between the first two GPU devices (``GPU 0`` and ``GPU 1``)
inside the same node, use the following code.
.. code-block:: shell
$OMPI_DIR/bin/mpirun -np 2 \
-x UCX_TLS=sm,self,rocm \
--mca pml ucx \
./c/mpi/pt2pt/standard/osu_bw D D
This measures the unidirectional bandwidth from the first device (``GPU 0``) to the second device
(``GPU 1``). To select specific devices, for example ``GPU 2`` and ``GPU 3``, include the following
command:
.. code-block:: shell
export HIP_VISIBLE_DEVICES=2,3
To force using a copy kernel instead of a DMA engine for the data transfer, use the following
command:
.. code-block:: shell
export HSA_ENABLE_SDMA=0
The following output shows the effective transfer bandwidth measured for inter-die data transfer
between ``GPU 2`` and ``GPU 3`` on a system with MI250 GPUs. For messages larger than 67 MB, an effective
utilization of about 150 GB/sec is achieved:
.. image:: ../data/how-to/gpu-enabled-mpi-1.png
:width: 400
:alt: Inter-GPU bandwidth for various payload sizes
Collective operations
----------------------------------------------------------------------------------------------------------------
Collective operations on GPU buffers are best handled through the Unified Collective Communication
(UCC) library component in Open MPI. To accomplish this, you must configure and compile the UCC
library with ROCm support.
.. note::
You can verify UCC and ROCm version compatibility using the
`communication libraries tables <https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/3rd-party-support-matrix.html>`_
.. code-block:: shell
export UCC_DIR=$INSTALL_DIR/ucc
git clone https://github.com/openucx/ucc.git -b v1.2.x
cd ucc
./autogen.sh
./configure --with-rocm=/opt/rocm \
--with-ucx=$UCX_DIR \
--prefix=$UCC_DIR
make -j && make install
# Configure and compile Open MPI with UCX, UCC, and ROCm support
cd ompi
./configure --with-rocm=/opt/rocm \
--with-ucx=$UCX_DIR \
--with-ucc=$UCC_DIR
--prefix=$OMPI_DIR
To use the UCC component with an MPI application, you must set additional parameters:
.. code-block:: shell
mpirun --mca pml ucx --mca osc ucx \
--mca coll_ucc_enable 1 \
--mca coll_ucc_priority 100 -np 64 ./my_mpi_app
.. _open-mpi-libfabric:
ROCm-aware Open MPI using libfabric
================================================================
For network interconnects that are not covered in the previous category, such as HPE Slingshot,
ROCm-aware communication can often be achieved through the libfabric library. For more information,
refer to the `libfabric documentation <https://github.com/ofiwg/libfabric/wiki>`_.
.. note::
When using Open MPI v5.0.x with libfabric support, shared memory communication between
processes on the same node goes through the *ob1/sm* component. This component has
fundamental support for GPU memory that is, accomplished by using a staging host buffer
Consequently, the performance of device-to-device shared memory communication is lower than
the theoretical peak performance allowed by the GPU-to-GPU interconnect.
1. Install libfabric. Note that libfabric is often pre-installed. To determine if it's already installed, run:
.. code-block:: shell
module avail libfabric
Alternatively, you can download and compile libfabric with ROCm support. Note that not all
components required to support some networks (e.g., HPE Slingshot) are available in the open source
repository. Therefore, using a pre-installed libfabric library is strongly recommended over compiling
libfabric manually.
If a pre-compiled libfabric library is available on your system, you can skip the following step.
2. Compile libfabric with ROCm support.
.. code-block:: shell
export OFI_DIR=$INSTALL_DIR/ofi
cd $BUILD_DIR
git clone https://github.com/ofiwg/libfabric.git -b v1.19.x
cd libfabric
./autogen.sh
./configure --prefix=$OFI_DIR \
--with-rocr=/opt/rocm
make -j $(nproc)
make install
Installing Open MPI with libfabric support
----------------------------------------------------------------------------------------------------------------
To build Open MPI with libfabric, use the following code:
.. code-block:: shell
export OMPI_DIR=$INSTALL_DIR/ompi
cd $BUILD_DIR
git clone --recursive https://github.com/open-mpi/ompi.git \
-b v5.0.x
cd ompi
./autogen.pl
mkdir build
cd build
../configure --prefix=$OMPI_DIR --with-ofi=$OFI_DIR \
--with-rocm=/opt/rocm
make -j $(nproc)
make install
ROCm-aware OSU with Open MPI and libfabric
----------------------------------------------------------------------------------------------------------------
Compiling a ROCm-aware version of OSU benchmarks with Open MPI and libfabric uses the same
process described in :ref:`rocm-enabled-osu`.
To run an OSU benchmark using multiple nodes, use the following code:
.. code-block:: shell
export LD_LIBRARY_PATH=$OMPI_DIR/lib:$OFI_DIR/lib64:/opt/rocm/lib
$OMPI_DIR/bin/mpirun -np 2 \
./c/mpi/pt2pt/standard/osu_bw D D

View File

@@ -1,6 +1,13 @@
<head>
<meta charset="UTF-8">
<meta name="description" content="System debugging guide">
<meta name="keywords" content="debug, system-level debug, debug flags, PCIe debug, AMD,
ROCm">
</head>
# System debugging guide
## ROCm language and system level debug, flags, and environment variables
## ROCm language and system-level debug, flags, and environment variables
Kernel options to avoid: the Ethernet port getting renamed every time you change graphics cards, `net.ifnames=0 biosdevname=0`

View File

@@ -1,3 +1,10 @@
<head>
<meta charset="UTF-8">
<meta name="description" content="Tuning guides">
<meta name="keywords" content="high-performance computing, HPC, Instinct accelerators,
Radeon, tuning, tuning guide, AMD, ROCm">
</head>
# Tuning guides
Use case-specific system setup and tuning guides.

View File

@@ -1,3 +1,10 @@
<head>
<meta charset="UTF-8">
<meta name="description" content="MI100 high-performance computing and tuning guide">
<meta name="keywords" content="MI100, high-performance computing, HPC, tuning, BIOS
settings, NBIO, AMD, ROCm">
</head>
# MI100 high-performance computing and tuning guide
## System settings
@@ -352,15 +359,15 @@ If SMT is enabled by setting "CCD/Core/Thread Enablement > SMT Control" to
[...]
```
Once the system is properly configured, the AMD ROCm platform can be
Once the system is properly configured, ROCm software can be
installed.
## System management
For a complete guide on how to install/manage/uninstall ROCm on Linux, refer to
[Installing ROCm on Linux](../../install/linux/install.md). For verifying that the
installation was successful, refer to
{ref}`verifying-kernel-mode-driver-installation` and
{doc}`Quick-start (Linux)<rocm-install-on-linux:tutorial/quick-start>`. To verify that the installation was
successful, refer to the
{doc}`post-install instructions<rocm-install-on-linux:how-to/native-install/post-install>` and
[Validation Tools](../../reference/library-index.md). Should verification
fail, consult the [System Debugging Guide](../system-debugging.md).
@@ -405,7 +412,8 @@ SIMD pipelines, memory information, and Instruction Set Architecture:
![rocminfo output fragment on an 8*MI100 system](../../data/how-to/tuning-guides/tuning003.png "rocminfo output fragment on an 8*MI100 system")
For a complete list of architecture (LLVM target) names, refer to
[Linux support](../../about/compatibility/linux-support.md) and [Windows support](../../about/compatibility/windows-support.md).
{doc}`Linux<rocm-install-on-linux:reference/system-requirements>` and
{doc}`Windows<rocm-install-on-windows:reference/system-requirements>` support.
### Testing inter-device bandwidth

View File

@@ -1,3 +1,10 @@
<head>
<meta charset="UTF-8">
<meta name="description" content="MI200 high-performance computing and tuning guide">
<meta name="keywords" content="MI200, high-performance computing, HPC, tuning, BIOS
settings, NBIO, AMD, ROCm">
</head>
# MI200 high-performance computing and tuning guide
## System settings
@@ -27,7 +34,7 @@ Analogous settings for other non-AMI System BIOS providers could be set
similarly. For systems with Intel processors, some settings may not apply or be
available as listed in the following table.
```{list-table} Recommended settings for the system BIOS in a GIGABYTE platform.
```{list-table}
:header-rows: 1
:name: mi200-bios
@@ -337,15 +344,15 @@ If SMT is enabled by setting "CCD/Core/Thread Enablement > SMT Control" to
[...]
```
Once the system is properly configured, the AMD ROCm platform can be
Once the system is properly configured, ROCm software can be
installed.
## System management
For a complete guide on how to install/manage/uninstall ROCm on Linux, refer to
[Installing ROCm on Linux](../../install/linux/install.md). For verifying that the
installation was successful, refer to
{ref}`verifying-kernel-mode-driver-installation` and
{doc}`Quick-start (Linux)<rocm-install-on-linux:tutorial/quick-start>`. For verifying that the
installation was successful, refer to the
{doc}`post-install instructions<rocm-install-on-linux:how-to/native-install/post-install>` and
[Validation Tools](../../reference/library-index.md). Should verification
fail, consult the [System Debugging Guide](../system-debugging.md).
@@ -390,7 +397,8 @@ Instruction Set Architecture (ISA):
![rocminfo output fragment on an 8*MI200 system](../../data/how-to/tuning-guides/tuning010.png "'rocminfo' output fragment on an 8*MI200 system")
For a complete list of architecture (LLVM target) names, refer to GPU OS Support for
[Linux](../../about/compatibility/linux-support.md) and [Windows](../../about/compatibility/windows-support.md).
{doc}`Linux<rocm-install-on-linux:reference/system-requirements>` and
{doc}`Windows<rocm-install-on-windows:reference/system-requirements>`.
### Testing inter-device bandwidth

View File

@@ -1,3 +1,10 @@
<head>
<meta charset="UTF-8">
<meta name="description" content="RDNA2 workstation tuning guide">
<meta name="keywords" content="RDNA2, workstation tuning, BIOS settings, installation, AMD,
ROCm">
</head>
# RDNA2 workstation tuning guide
## System settings
@@ -5,16 +12,16 @@
This chapter reviews system settings that are required to configure the system
for ROCm virtualization on RDNA2-based AMD Radeon™ PRO GPUs. Installing ROCm on
Bare Metal follows the routine ROCm
[installation procedure](../../install/linux/install.md).
{doc}`installation procedure<rocm-install-on-linux:how-to/native-install/index>`.
To enable ROCm virtualization on V620, one has to setup Single Root I/O
Virtualization (SR-IOV) in the BIOS via setting found in the following
({ref}`bios-settings`). A tested configuration can be followed in
({ref}`os-settings`).
```{attention}
:::{attention}
SR-IOV is supported on V620 and unsupported on W6800.
```
:::
(bios-settings)=
@@ -160,6 +167,6 @@ First, assign GPU virtual function (VF) to VM using the following steps.
Then start the VM.
Finally install ROCm on the virtual machine (VM). For detailed instructions,
refer to the [ROCm Installation Guide](../../install/linux/install.md). For any
refer to the {doc}`Linux install guide<rocm-install-on-linux:how-to/native-install/index>`. For any
issue encountered during installation, write to us
[here](mailto:CloudGPUsupport@amd.com).

View File

@@ -1,12 +1,22 @@
<head>
<meta charset="UTF-8">
<meta name="description" content="AMD ROCm documentation">
<meta name="keywords" content="documentation, guides, installation, compatibility, support,
reference, ROCm, AMD">
</head>
# AMD ROCm™ documentation
Welcome to the ROCm docs home page! If you're new to ROCm, you can review the following
resources to learn more about our products and what we support:
* [What is ROCm?](./what-is-rocm.md)
* [What's new?](about/whats-new/whats-new)
* [Release notes](./about/release-notes.md)
You can install ROCm on our Radeon™, Radeon Pro™, and Instinct™ GPUs. If you're using Radeon
GPUs, we recommend reading the
{doc}`Radeon-specific ROCm documentation<radeon:index>`
Our documentation is organized into the following categories:
::::{grid} 1 2 2 2
@@ -20,34 +30,34 @@ Installation guides
^^^
* Linux
* [Quick-start (Linux)](./install/linux/install-quick.md)
* [Linux install guide](./install/linux/install.md)
* [Package manager integration](./install/linux/package-manager-integration.md)
* {doc}`Quick-start (Linux)<rocm-install-on-linux:tutorial/quick-start>`
* {doc}`Linux install guide<rocm-install-on-linux:how-to/native-install/index>`
* {doc}`Package manager integration<rocm-install-on-linux:how-to/native-install/package-manager-integration>`
* Windows
* [Quick-start (Windows)](./install/windows/install-quick.md)
* [Windows install guide](./install/windows/install.md)
* [Application deployment guidelines](./install/windows/windows-app-deployment-guidelines.md)
* [Deploy ROCm Docker containers](./install/docker.md)
* [PyTorch for ROCm](./install/pytorch-install.md)
* [TensorFlow for ROCm](./install/tensorflow-install.md)
* [MAGMA for ROCm](./install/magma-install.md)
* [ROCm & Spack](./install/spack-intro.md)
* {doc}`Windows install guide<rocm-install-on-windows:how-to/install>`
* {doc}`Application deployment guidelines<rocm-install-on-windows:conceptual/deployment-guidelines>`
* {doc}`Install Docker containers<rocm-install-on-linux:how-to/docker>`
* {doc}`PyTorch for ROCm<rocm-install-on-linux:how-to/3rd-party/pytorch-install>`
* {doc}`TensorFlow for ROCm<rocm-install-on-linux:how-to/3rd-party/tensorflow-install>`
* {doc}`MAGMA for ROCm<rocm-install-on-linux:how-to/3rd-party/magma-install>`
* {doc}`ROCm & Spack<rocm-install-on-linux:how-to/spack>`
:::
:::{grid-item-card}
:padding: 2
**Compatibility & Support**
**Compatibility & support**
ROCm compatibility information
^^^
* [Linux (GPU & OS)](./about/compatibility/linux-support.md)
* [Windows (GPU & OS)](./about/compatibility/windows-support.md)
* [Third-party](./about/compatibility/3rd-party-support-matrix.md)
* [User/kernel space](./about/compatibility/user-kernel-space-compat-matrix.md)
* [Docker](./about/compatibility/docker-image-support-matrix.rst)
* {doc}`System requirements (Linux)<rocm-install-on-linux:reference/system-requirements>`
* {doc}`System requirements (Windows)<rocm-install-on-windows:reference/system-requirements>`
* {doc}`Third-party<rocm-install-on-linux:reference/3rd-party-support-matrix>`
* {doc}`User/kernel space<rocm-install-on-linux:reference/user-kernel-space-compat-matrix>`
* {doc}`Docker<rocm-install-on-linux:reference/docker-image-support-matrix>`
* [OpenMP](./about/compatibility/openmp.md)
{doc}`ROCm on Radeon GPUs<radeon:index>`
:::
@@ -63,7 +73,7 @@ Task-oriented walkthroughs
* [MI200](./how-to/tuning-guides/mi200.md)
* [RDNA2](./how-to/tuning-guides/w6000-v620.md)
* [Setting up for deep learning with ROCm](./how-to/deep-learning-rocm.md)
* [GPU-enabled MPI](./how-to/gpu-enabled-mpi.md)
* [GPU-enabled MPI](./how-to/gpu-enabled-mpi.rst)
* [System level debugging](./how-to/system-debugging.md)
* [GitHub examples](https://github.com/amd/rocm-examples)
@@ -95,7 +105,7 @@ Topic overviews & background information
* [Compiler disambiguation](./conceptual/compiler-disambiguation.md)
* [File structure (Linux FHS)](./conceptual/file-reorg.md)
* [GPU isolation techniques](./conceptual/gpu-isolation.md)
* [LLVN ASan](./conceptual/using-gpu-sanitizer.md)
* [LLVM ASan](./conceptual/using-gpu-sanitizer.md)
* [Using CMake](./conceptual/cmake-packages.rst)
* [ROCm & PCIe atomics](./conceptual/More-about-how-ROCm-uses-PCIe-Atomics.rst)
* [Inception v3 with PyTorch](./conceptual/ai-pytorch-inception.md)

View File

@@ -1,90 +0,0 @@
# Deploy ROCm Docker containers
## Prerequisites
Docker containers share the kernel with the host operating system, therefore the
ROCm kernel-mode driver must be installed on the host. Please refer to
{ref}`linux-install-methods` on installing `amdgpu-dkms`. The other
user-space parts (like the HIP-runtime or math libraries) of the ROCm stack will
be loaded from the container image and don't need to be installed to the host.
(docker-access-gpus-in-container)=
## Accessing GPUs in containers
In order to access GPUs in a container (to run applications using HIP, OpenCL or
OpenMP offloading) explicit access to the GPUs must be granted.
The ROCm runtimes make use of multiple device files:
* `/dev/kfd`: the main compute interface shared by all GPUs
* `/dev/dri/renderD<node>`: direct rendering interface (DRI) devices for each
GPU. **`<node>`** is a number for each card in the system starting from 128.
Exposing these devices to a container is done by using the
[`--device`](https://docs.docker.com/engine/reference/commandline/run/#device)
option, i.e. to allow access to all GPUs expose `/dev/kfd` and all
`/dev/dri/renderD` devices:
```shell
docker run --device /dev/kfd --device /dev/renderD128 --device /dev/renderD129 ...
```
More conveniently, instead of listing all devices, the entire `/dev/dri` folder
can be exposed to the new container:
```shell
docker run --device /dev/kfd --device /dev/dri
```
Note that this gives more access than strictly required, as it also exposes the
other device files found in that folder to the container.
(docker-restrict-gpus)=
### Restricting a container to a subset of the GPUs
If a `/dev/dri/renderD` device is not exposed to a container then it cannot use
the GPU associated with it; this allows to restrict a container to any subset of
devices.
For example to allow the container to access the first and third GPU start it
like:
```shell
docker run --device /dev/kfd --device /dev/dri/renderD128 --device /dev/dri/renderD130 <image>
```
### Additional options
The performance of an application can vary depending on the assignment of GPUs
and CPUs to the task. Typically, `numactl` is installed as part of many HPC
applications to provide GPU/CPU mappings. This Docker runtime option supports
memory mapping and can improve performance.
```shell
--security-opt seccomp=unconfined
```
This option is recommended for Docker Containers running HPC applications.
```shell
docker run --device /dev/kfd --device /dev/dri --security-opt seccomp=unconfined ...
```
## Docker images in the ROCm ecosystem
### Base images
<https://github.com/RadeonOpenCompute/ROCm-docker> hosts images useful for users
wishing to build their own containers leveraging ROCm. The built images are
available from [Docker Hub](https://hub.docker.com/u/rocm). In particular
`rocm/rocm-terminal` is a small image with the prerequisites to build HIP
applications, but does not include any libraries.
### Applications
AMD provides pre-built images for various GPU-ready applications through its
Infinity Hub at <https://www.amd.com/en/technologies/infinity-hub>.
Examples for invoking each application and suggested parameters used for
benchmarking are also provided there.

View File

@@ -1,64 +0,0 @@
# MAGMA installation for ROCm
## MAGMA for ROCm
Matrix Algebra on GPU and Multicore Architectures (MAGMA) is a
collection of next-generation dense linear algebra libraries that is designed
for heterogeneous architectures, such as multiple GPUs and multi- or many-core
CPUs.
MAGMA provides implementations for CUDA, HIP, Intel Xeon Phi, and OpenCL™. For
more information, refer to
[https://icl.utk.edu/magma/index.html](https://icl.utk.edu/magma/index.html).
### Using MAGMA for PyTorch
Tensor is fundamental to deep-learning techniques because it provides extensive
representational functionalities and math operations. This data structure is
represented as a multidimensional matrix. MAGMA accelerates tensor operations
with a variety of solutions including driver routines, computational routines,
BLAS routines, auxiliary routines, and utility routines.
### Building MAGMA from source
To build MAGMA from the source, follow these steps:
1. In the event you want to compile only for your uarch, use:
```bash
export PYTORCH_ROCM_ARCH=<uarch>
```
`<uarch>` is the architecture reported by the `rocminfo` command.
2. Use the following:
```bash
export PYTORCH_ROCM_ARCH=<uarch>
# "install" hipMAGMA into /opt/rocm/magma by copying after build
git clone https://bitbucket.org/icl/magma.git
pushd magma
# Fixes memory leaks of MAGMA found while executing linalg UTs
git checkout 5959b8783e45f1809812ed96ae762f38ee701972
cp make.inc-examples/make.inc.hip-gcc-mkl make.inc
echo 'LIBDIR += -L$(MKLROOT)/lib' >> make.inc
echo 'LIB += -Wl,--enable-new-dtags -Wl,--rpath,/opt/rocm/lib -Wl,--rpath,$(MKLROOT)/lib -Wl,--rpath,/opt/rocm/magma/lib' >> make.inc
echo 'DEVCCFLAGS += --gpu-max-threads-per-block=256' >> make.inc
export PATH="${PATH}:/opt/rocm/bin"
if [[ -n "$PYTORCH_ROCM_ARCH" ]]; then
amdgpu_targets=`echo $PYTORCH_ROCM_ARCH | sed 's/;/ /g'`
else
amdgpu_targets=`rocm_agent_enumerator | grep -v gfx000 | sort -u | xargs`
fi
for arch in $amdgpu_targets; do
echo "DEVCCFLAGS += --amdgpu-target=$arch" >> make.inc
done
# hipcc with openmp flag may cause isnan() on __device__ not to be found; depending on context, compiler may attempt to match with host definition
sed -i 's/^FOPENMP/#FOPENMP/g' make.inc
make -f make.gen.hipMAGMA -j $(nproc)
LANG=C.UTF-8 make lib/libmagma.so -j $(nproc) MKLROOT=/opt/conda
make testing/testing_dgemm -j $(nproc) MKLROOT=/opt/conda
popd
mv magma /opt/rocm
```

View File

@@ -1,446 +0,0 @@
# Installing PyTorch for ROCm
[PyTorch](https://pytorch.org/) is an open-source tensor library designed for deep learning. PyTorch on
ROCm provides mixed-precision and large-scale training using our
[MIOpen](https://github.com/ROCmSoftwarePlatform/MIOpen) and
[RCCL](https://github.com/ROCmSoftwarePlatform/rccl) libraries.
To install [PyTorch for ROCm](https://pytorch.org/blog/pytorch-for-amd-rocm-platform-now-available-as-python-package/), you have the following options:
* [Use a Docker image with PyTorch pre-installed](#using-a-docker-image-with-pytorch-pre-installed)
(recommended)
* [Use a wheels package](#using-a-wheels-package)
* [Use the PyTorch ROCm base Docker image](#using-the-pytorch-rocm-base-docker-image)
* [Use the PyTorch upstream Docker file](#using-the-pytorch-upstream-docker-file)
For hardware, software, and third-party framework compatibility between ROCm and PyTorch, refer to:
* [GPU and OS support (Linux)](../about/compatibility/linux-support.md)
* [Compatibility](../about/compatibility/3rd-party-support-matrix.md)
## Using a Docker image with PyTorch pre-installed
1. Download the latest public PyTorch Docker image
([https://hub.docker.com/r/rocm/pytorch](https://hub.docker.com/r/rocm/pytorch)).
```bash
docker pull rocm/pytorch:latest
```
You can also download a specific and supported configuration with different user-space ROCm
versions, PyTorch versions, and operating systems.
2. Start a Docker container using the image.
```bash
docker run -it --cap-add=SYS_PTRACE --security-opt seccomp=unconfined \
--device=/dev/kfd --device=/dev/dri --group-add video \
--ipc=host --shm-size 8G rocm/pytorch:latest
```
:::{note}
This will automatically download the image if it does not exist on the host. You can also pass the `-v`
argument to mount any data directories from the host onto the container.
:::
(install_pytorch_wheels)=
## Using a wheels package
PyTorch supports the ROCm platform by providing tested wheels packages. To access this feature, go
to [https://pytorch.org/get-started/locally/](https://pytorch.org/get-started/locally/). For the correct
wheels command, you must select 'Linux', 'Python', 'pip', and 'ROCm' in the matrix.
1. Choose one of the following three options:
**Option 1:**
a. Download a base Docker image with the correct user-space ROCm version.
| Base OS | Docker image | Link to Docker image|
|----------------|-----------------------------|----------------|
| Ubuntu 20.04 | `rocm/dev-ubuntu-20.04` | [https://hub.docker.com/r/rocm/dev-ubuntu-20.04](https://hub.docker.com/r/rocm/dev-ubuntu-20.04)
| Ubuntu 22.04 | `rocm/dev-ubuntu-22.04` | [https://hub.docker.com/r/rocm/dev-ubuntu-22.04](https://hub.docker.com/r/rocm/dev-ubuntu-22.04)
| CentOS 7 | `rocm/dev-centos-7` | [https://hub.docker.com/r/rocm/dev-centos-7](https://hub.docker.com/r/rocm/dev-centos-7)
b. Pull the selected image.
```bash
docker pull rocm/dev-ubuntu-20.04:latest
```
c. Start a Docker container using the downloaded image.
```bash
docker run -it --device=/dev/kfd --device=/dev/dri --group-add video rocm/dev-ubuntu-20.04:latest
```
**Option 2:**
Select a base OS Docker image (Check [OS compatibility](../about/compatibility/linux-support.md))
Pull selected base OS image (Ubuntu 20.04 for example)
```docker
docker pull ubuntu:20.04
```
Start a Docker container using the downloaded image
```docker
docker run -it --device=/dev/kfd --device=/dev/dri --group-add video ubuntu:20.04
```
Install ROCm using the directions in the [Installation section](./linux/install.md).
**Option 3:**
Install on bare metal. Check [OS compatibility](../about/compatibility/linux-support.md) and install ROCm using the
directions in the [Installation section](./linux/install.md).
2. Install the required dependencies for the wheels package.
```bash
sudo apt update
sudo apt install libjpeg-dev python3-dev python3-pip
pip3 install wheel setuptools
```
3. Install `torch`, `torchvision`, and `torchaudio`, as specified in the
[installation matrix](https://pytorch.org/get-started/locally/).
:::{note}
The following command uses the ROCm 5.6 PyTorch wheel. If you want a different version of ROCm,
modify the command accordingly.
:::
```bash
pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm5.6/
```
4. (Optional) Use MIOpen kdb files with ROCm PyTorch wheels.
PyTorch uses [MIOpen](https://github.com/ROCmSoftwarePlatform/MIOpen) for machine learning
primitives, which are compiled into kernels at runtime. Runtime compilation causes a small warm-up
phase when starting PyTorch, and MIOpen kdb files contain precompiled kernels that can speed up
application warm-up phases. For more information, refer to the
{doc}`MIOpen installation page <miopen:install>`.
MIOpen kdb files can be used with ROCm PyTorch wheels. However, the kdb files need to be placed in
a specific location with respect to the PyTorch installation path. A helper script simplifies this task by
taking the ROCm version and GPU architecture as inputs. This works for Ubuntu and CentOS.
You can download the helper script here:
[install_kdb_files_for_pytorch_wheels.sh](https://raw.githubusercontent.com/wiki/ROCmSoftwarePlatform/pytorch/files/ install_kdb_files_for_pytorch_wheels.sh), or use:
`wget https://raw.githubusercontent.com/wiki/ROCmSoftwarePlatform/pytorch/files/install_kdb_files_for_pytorch_wheels.sh`
After installing ROCm PyTorch wheels, run the following code:
```bash
#Optional; replace 'gfx90a' with your architecture and 5.6 with your preferred ROCm version
export GFX_ARCH=gfx90a
#Optional
export ROCM_VERSION=5.6
./install_kdb_files_for_pytorch_wheels.sh
```
## Using the PyTorch ROCm base Docker image
The pre-built base Docker image has all dependencies installed, including:
* ROCm
* Torchvision
* Conda packages
* The compiler toolchain
Additionally, a particular environment flag (`BUILD_ENVIRONMENT`) is set, which is used by the build
scripts to determine the configuration of the build environment.
1. Download the Docker image. This is the base image, which does not contain PyTorch.
```bash
docker pull rocm/pytorch:latest-base
```
2. Start a Docker container using the downloaded image.
```bash
docker run -it --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --shm-size 8G rocm/pytorch:latest-base
```
You can also pass the `-v` argument to mount any data directories from the host onto the container.
3. Clone the PyTorch repository.
```bash
cd ~
git clone https://github.com/pytorch/pytorch.git
cd /pytorch
git submodule update --init --recursive
```
4. Set ROCm architecture (optional). The Docker image tag is `rocm/pytorch:latest-base`.
:::{note}
By default in the `rocm/pytorch:latest-base` image, PyTorch builds simultaneously for the following
architectures:
* gfx900
* gfx906
* gfx908
* gfx90a
* gfx1030
:::
If you want to compile _only_ for your microarchitecture (uarch), run:
```bash
export PYTORCH_ROCM_ARCH=<uarch>
```
Where `<uarch>` is the architecture reported by the `rocminfo` command.
To find your uarch, run:
```bash
rocminfo | grep gfx
```
5. Build PyTorch.
```bash
./.ci/pytorch/build.sh
```
This converts PyTorch sources for
[HIP compatibility](https://www.amd.com/en/developer/rocm-hub/hip-sdk.html) and builds the
PyTorch framework.
To check if your build is successful, run:
```bash
echo $? # should return 0 if success
```
## Using the PyTorch upstream Docker file
If you don't want to use a prebuilt base Docker image, you can build a custom base Docker image
using scripts from the PyTorch repository. This uses a standard Docker image from operating system
maintainers and installs all the required dependencies, including:
* ROCm
* Torchvision
* Conda packages
* The compiler toolchain
1. Clone the PyTorch repository.
```bash
cd ~
git clone https://github.com/pytorch/pytorch.git
cd /pytorch
git submodule update --init --recursive
```
2. Build the PyTorch Docker image.
```bash
cd .ci/docker
./build.sh pytorch-linux-<os-version>-rocm<rocm-version>-py<python-version> -t rocm/pytorch:build_from_dockerfile
```
Where:
* `<os-version>`: `ubuntu20.04` (or `focal`), `ubuntu22.04` (or `jammy`), `centos7.5`, or `centos9`
* `<rocm-version>`: `5.4`, `5.5`, or `5.6`
* `<python-version>`: `3.8`-`3.11`
To verify that your image was successfully created, run:
`docker image ls rocm/pytorch:build_from_dockerfile`
If successful, the output looks like this:
```bash
REPOSITORY TAG IMAGE ID CREATED SIZE
rocm/pytorch build_from_dockerfile 17071499be47 2 minutes ago 32.8GB
```
3. Start a Docker container using the image with the mounted PyTorch folder.
```bash
docker run -it --cap-add=SYS_PTRACE --security-opt --user root \
seccomp=unconfined --device=/dev/kfd --device=/dev/dri \
--group-add video --ipc=host --shm-size 8G \
-v ~/pytorch:/pytorch rocm/pytorch:build_from_dockerfile
```
You can also pass the `-v` argument to mount any data directories from the host onto the container.
4. Go to the PyTorch directory.
```bash
cd pytorch
```
5. Set ROCm architecture.
To determine your AMD architecture, run:
```bash
rocminfo | grep gfx
```
The result looks like this (for `gfx1030` architecture):
```bash
Name: gfx1030
Name: amdgcn-amd-amdhsa--gfx1030
```
Set the `PYTORCH_ROCM_ARCH` environment variable to specify the architectures you want to
build PyTorch for.
```bash
export PYTORCH_ROCM_ARCH=<uarch>
```
where `<uarch>` is the architecture reported by the `rocminfo` command.
6. Build PyTorch.
```bash
./.ci/pytorch/build.sh
```
This converts PyTorch sources for
[HIP compatibility](https://www.amd.com/en/developer/rocm-hub/hip-sdk.html) and builds the
PyTorch framework.
To check if your build is successful, run:
```bash
echo $? # should return 0 if success
```
## Testing the PyTorch installation
You can use PyTorch unit tests to validate your PyTorch installation. If you used a
**prebuilt PyTorch Docker image from AMD ROCm DockerHub** or installed an
**official wheels package**, validation tests are not necessary.
If you want to manually run unit tests to validate your PyTorch installation fully, follow these steps:
1. Import the torch package in Python to test if PyTorch is installed and accessible.
:::{note}
Do not run the following command in the PyTorch git folder.
:::
```bash
python3 -c 'import torch' 2> /dev/null && echo 'Success' || echo 'Failure'
```
2. Check if the GPU is accessible from PyTorch. In the PyTorch framework, `torch.cuda` is a generic way
to access the GPU. This can only access an AMD GPU if one is available.
```bash
python3 -c 'import torch; print(torch.cuda.is_available())'
```
3. Run unit tests to validate the PyTorch installation fully.
:::{note}
You must run the following command from the PyTorch home directory.
:::
```bash
PYTORCH_TEST_WITH_ROCM=1 python3 test/run_test.py --verbose \
--include test_nn test_torch test_cuda test_ops \
test_unary_ufuncs test_binary_ufuncs test_autograd
```
This command ensures that the required environment variable is set to skip certain unit tests for
ROCm. This also applies to wheel installs in a non-controlled environment.
:::{note}
Make sure your PyTorch source code corresponds to the PyTorch wheel or the installation in the
Docker image. Incompatible PyTorch source code can give errors when running unit tests.
:::
Some tests may be skipped, as appropriate, based on your system configuration. ROCm doesn't
support all PyTorch features; tests that evaluate unsupported features are skipped. Other tests might
be skipped, depending on the host or GPU memory and the number of available GPUs.
If the compilation and installation are correct, all tests will pass.
4. Run individual unit tests.
```bash
PYTORCH_TEST_WITH_ROCM=1 python3 test/test_nn.py --verbose
```
You can replace `test_nn.py` with any other test set.
## Running a basic PyTorch example
The PyTorch examples repository provides basic examples that exercise the functionality of your
framework.
Two of our favorite testing databases are:
* **MNIST** (Modified National Institute of Standards and Technology): A database of handwritten
digits that can be used to train a Convolutional Neural Network for **handwriting recognition**.
* **ImageNet**: A database of images that can be used to train a network for
**visual object recognition**.
### MNIST PyTorch example
1. Clone the PyTorch examples repository.
```bash
git clone https://github.com/pytorch/examples.git
```
2. Go to the MNIST example folder.
```bash
cd examples/mnist
```
3. Follow the instructions in the `README.md`` file in this folder to install the requirements. Then run:
```bash
python3 main.py
```
This generates the following output:
```bash
...
Train Epoch: 14 [58240/60000 (97%)] Loss: 0.010128
Train Epoch: 14 [58880/60000 (98%)] Loss: 0.001348
Train Epoch: 14 [59520/60000 (99%)] Loss: 0.005261
Test set: Average loss: 0.0252, Accuracy: 9921/10000 (99%)
```
### ImageNet PyTorch example
1. Clone the PyTorch examples repository (if you didn't already do this step in the preceding MNIST example).
```bash
git clone https://github.com/pytorch/examples.git
```
2. Go to the ImageNet example folder.
```bash
cd examples/imagenet
```
3. Follow the instructions in the `README.md` file in this folder to install the Requirements. Then run:
```bash
python3 main.py
```

View File

@@ -1,421 +0,0 @@
# Introduction to Spack
Spack is a package management tool designed to support multiple software versions and
configurations on a wide variety of platforms and environments. It was designed for large
supercomputing centers, where many users share common software installations on clusters with
exotic architectures using libraries that do not have a standard ABI. Spack is non-destructive: installing
a new version does not break existing installations, so many configurations can coexist on the same
system.
Most importantly, Spack is *simple*. It offers a simple *spec* syntax, so users can concisely specify
versions and configuration options. Spack is also simple for package authors: package files are written
in pure Python, and specs allow package authors to maintain a single file for many different builds of
the same package. For more information on Spack, see
[https://spack-tutorial.readthedocs.io/en/latest/](https://spack-tutorial.readthedocs.io/en/latest/).
## ROCM packages in Spack
| **Component** | **Spack Package Name** |
|---------------------------|------------------------|
| **rocm-cmake** | rocm-cmake |
| **thunk** | hsakmt-roct |
| **rocm-smi-lib** | rocm-smi-lib |
| **hsa** | hsa-rocr-dev |
| **lightning** | llvm-amdgpu |
| **devicelibs** | rocm-device-libs |
| **comgr** | comgr |
| **rocclr (vdi)** | hip-rocclr |
| **hipify_clang** | hipify-clang |
| **hip (hip_in_vdi)** | hip |
| **ocl (opencl_on_vdi )** | rocm-opencl |
| **rocminfo** | rocminfo |
| **clang-ocl** | rocm-clang-ocl |
| **rccl** | rccl |
| **atmi** | atmi |
| **rocm_debug_agent** | rocm-debug-agent |
| **rocm_bandwidth_test** | rocm-bandwidth-test |
| **rocprofiler** | rocprofiler-dev |
| **roctracer-dev-api** | roctracer-dev-api |
| **roctracer** | roctracer-dev |
| **dbgapi** | rocm-dbgapi |
| **rocm-gdb** | rocm-gdb |
| **openmp-extras** | rocm-openmp-extras |
| **rocBLAS** | rocblas |
| **hipBLAS** | hipblas |
| **rocFFT** | rocfft |
| **rocRAND** | rocrand |
| **rocSPARSE** | rocsparse |
| **hipSPARSE** | hipsparse |
| **rocALUTION** | rocalution |
| **rocSOLVER** | rocsolver |
| **rocPRIM** | rocprim |
| **rocThrust** | rocthrust |
| **hipCUB** | hipcub |
| **hipfort** | hipfort |
| **ROCmValidationSuite** | rocm-validation-suite |
| **MIOpenGEMM** | miopengemm |
| **MIOpen(Hip variant)** | miopen-hip |
| **MIOpen(opencl)** | miopen-opencl |
| **MIVisionX** | mivisionx |
| **AMDMIGraphX** | migraphx |
| **rocm-tensile** | rocm-tensile |
| **hipfft** | hipfft |
| **RDC** | rdc |
| **hipsolver** | hipsolver |
| **mlirmiopen** | mlirmiopen |
```{note}
You must install all prerequisites before installing Spack.
```
::::{tab-set}
:::{tab-item} Ubuntu
:sync: Ubuntu
```shell
# Install some essential utilities:
apt-get update
apt-get install make patch bash tar gzip unzip bzip2 file gnupg2 git gawk
apt-get update -y
apt-get install -y xz-utils
apt-get build-essential
apt-get install vim
# Install Python:
apt-get install python3
apt-get upgrade python3-pip
# Install Compilers:
apt-get install gcc
apt-get install gfortran
```
:::
:::{tab-item} SLES
:sync: SLES
```shell
# Install some essential utilities:
zypper update
zypper install make patch bash tar gzip unzip bzip xz file gnupg2 git awk
zypper in -t pattern
zypper install vim
# Install Python:
zypper install python3
zypper install python3-pip
# Install Compilers:
zypper install gcc
zypper install gcc-fortran
zypper install gcc-c++
```
:::
:::{tab-item} CentOS
:sync: CentOS
```shell
# Install some essential utilities:
yum update
yum install make
yum install patch bash tar yum install gzip unzip bzip2 xz file gnupg2 git gawk
yum group install "Development Tools"
yum install vim
# Install Python:
yum install python3
pip3 install --upgrade pip
# Install compilers:
yum install gcc
yum install gcc-gfortran
yum install gcc-c++
```
:::
::::
## Steps to build ROCm components using Spack
1. To use the spack package manager, clone the Spack project from GitHub.
```bash
git clone <https://github.com/spack/spack>
```
2. Initialize Spack.
The `setup-env.sh` script initializes the Spack environment.
```bash
cd spack
. share/spack/setup-env.sh
```
Spack commands are available once the above steps are completed. To list the available commands,
use `help`.
```bash
root@[ixt-rack-104:/spack\#](http://ixt-rack-104/spack) spack help
```
## Using Spack to install ROCm components
1. `rocm-cmake`
Install the default variants and the latest version of `rocm-cmake`.
```bash
spack install rocm-cmake
```
To install a specific version of `rocm-cmake`, use:
```bash
spack install rocm-cmake@<version number>
```
For example, `spack install rocm-cmake@5.2.0`
2. `info`
The `info**` command displays basic package information. It shows the preferred, safe, and
deprecated versions, in addition to the available variants. It also shows the dependencies with other
packages.
```bash
spack info mivisionx
```
For example:
```bash
root@[ixt-rack-104:/spack\#](http://ixt-rack-104/spack) spack info mivisionx
CMakePackage: mivisionx
Description:
MIVisionX toolkit is a set of comprehensive computer vision and machine
intelligence libraries, utilities, and applications bundled into a
single toolkit.
Homepage: <https://github.com/GPUOpen-ProfessionalCompute-Libraries/MIVisionX>
Preferred version:
5.3.0 <https://github.com/GPUOpen-ProfessionalCompute-Libraries/MIVisionX/archive/rocm-5.3.0.tar.gz>
Safe versions:
5.3.0 <https://github.com/GPUOpen-ProfessionalCompute-Libraries/MIVisionX/archive/rocm-5.3.0.tar.gz>
5.2.3 <https://github.com/GPUOpen-ProfessionalCompute-Libraries/MIVisionX/archive/rocm-5.2.3.tar.gz>
5.2.1 <https://github.com/GPUOpen-ProfessionalCompute-Libraries/MIVisionX/archive/rocm-5.2.1.tar.gz>
5.2.0 <https://github.com/GPUOpen-ProfessionalCompute-Libraries/MIVisionX/archive/rocm-5.2.0.tar.gz>
5.1.3 <https://github.com/GPUOpen-ProfessionalCompute-Libraries/MIVisionX/archive/rocm-5.1.3.tar.gz>
5.1.0 <https://github.com/GPUOpen-ProfessionalCompute-Libraries/MIVisionX/archive/rocm-5.1.0.tar.gz>
5.0.2 <https://github.com/GPUOpen-ProfessionalCompute-Libraries/MIVisionX/archive/rocm-5.0.2.tar.gz>
5.0.0 <https://github.com/GPUOpen-ProfessionalCompute-Libraries/MIVisionX/archive/rocm-5.0.0.tar.gz>
4.5.2 <https://github.com/GPUOpen-ProfessionalCompute-Libraries/MIVisionX/archive/rocm-4.5.2.tar.gz>
4.5.0 <https://github.com/GPUOpen-ProfessionalCompute-Libraries/MIVisionX/archive/rocm-4.5.0.tar.gz>
Deprecated versions:
4.3.1 <https://github.com/GPUOpen-ProfessionalCompute-Libraries/MIVisionX/archive/rocm-4.3.1.tar.gz>
4.3.0 <https://github.com/GPUOpen-ProfessionalCompute-Libraries/MIVisionX/archive/rocm-4.3.0.tar.gz>
4.2.0 <https://github.com/GPUOpen-ProfessionalCompute-Libraries/MIVisionX/archive/rocm-4.2.0.tar.gz>
4.1.0 <https://github.com/GPUOpen-ProfessionalCompute-Libraries/MIVisionX/archive/rocm-4.1.0.tar.gz>
4.0.0 <https://github.com/GPUOpen-ProfessionalCompute-Libraries/MIVisionX/archive/rocm-4.0.0.tar.gz>
3.10.0 <https://github.com/GPUOpen-ProfessionalCompute-Libraries/MIVisionX/archive/rocm-3.10.0.tar.gz>
3.9.0 <https://github.com/GPUOpen-ProfessionalCompute-Libraries/MIVisionX/archive/rocm-3.9.0.tar.gz>
3.8.0 <https://github.com/GPUOpen-ProfessionalCompute-Libraries/MIVisionX/archive/rocm-3.8.0.tar.gz>
3.7.0 <https://github.com/GPUOpen-ProfessionalCompute-Libraries/MIVisionX/archive/rocm-3.7.0.tar.gz>
1.7 <https://github.com/GPUOpen-ProfessionalCompute-Libraries/MIVisionX/archive/1.7.tar.gz>
Variants:
Name [Default] When Allowed values Description
==================== ==== ==================== ==================================
build_type [Release] -- Release, Debug, CMake build type
RelWithDebInfo
hip [on] -- on, off Use HIP as backend
ipo [off] -- on, off CMake interprocedural optimization
opencl [off] -- on, off Use OPENCL as the backend
Build Dependencies:
cmake ffmpeg libjpeg-turbo miopen-hip miopen-opencl miopengemm opencv openssl protobuf rocm-cmake rocm-opencl
Link Dependencies:
miopen-hip miopen-opencl miopengemm openssl rocm-opencl
Run Dependencies:
None
root@[ixt-rack-104:/spack\#](http://ixt-rack-104/spack)
```
## Installing variants for ROCm components
The variants listed above indicate that the `mivisionx` package is built by default with
`build_type=Release` and the `hip` backend, and without the `opencl` backend. `build_type=Debug` and
`RelWithDebInfo`, with `opencl` and without `hip`, are also supported.
For example:
```bash
spack install mivisionx build_type=Debug (Backend will be hip since it is the default one)
spack install mivisionx+opencl build_type=Debug (Backend will be opencl and hip will be disabled as per the conflict defined in recipe)
```
* `spack spec` command
To display the dependency tree, the `spack spec` command can be used with the same format.
For example:
```bash
root@[ixt-rack-104:/spack\#](http://ixt-rack-104/spack) spack spec mivisionx
Input spec
--------------------------------
mivisionx
Concretized
--------------------------------
mivisionx@5.3.0%gcc@9.4.0+hip\~ipo\~opencl build_type=Release arch=linux-ubuntu20.04-skylake_avx512
```
## Creating an environment
You can create an environment with all the required components of your version.
1. In the root folder, create a new folder when you can create a `.yaml` file. This file is used to
create an environment.
```bash
* mkdir /localscratch
* cd /localscratch
* vi sample.yaml
```
2. Add all the required components in the `sample.yaml` file:
```bash
* spack:
* concretization: separately
* packages:
* all:
* compiler: [gcc@8.5.0]
* specs:
* - matrix:
* - ['%gcc@8.5.0\^cmake@3.19.7']
* - [rocm-cmake@5.3.2, rocm-dbgapi@5.3.2, rocm-debug-agent@5.3.2, rocm-gdb@5.3.2,
* rocminfo@5.3.2, rocm-opencl@5.3.2, rocm-smi-lib@5.3.2, rocm-tensile@5.3.2, rocm-validation-suite@4.3.1,
* rocprim@5.3.2, rocprofiler-dev@5.3.2, rocrand@5.3.2, rocsolver@5.3.2, rocsparse@5.3.2,
* rocthrust@5.3.2, roctracer-dev@5.3.2]
* view: true
```
3. Once you've created the `.yaml` file, you can use it to create an environment.
```bash
* spack env create -d /localscratch/MyEnvironment /localscratch/sample.yaml
```
4. Activate the environment.
```bash
* spack env activate /localscratch/MyEnvironment
```
5. Verify that you want all the component versions.
```bash
* spack find - this command will list out all components been in the environment (and 0 installed )
```
6. Install all the components in the `.yaml` file.
```bash
* cd /localscratch/MyEnvironment
* spack install -j 50
```
7. Check that all components are successfully installed.
```bash
* spack find
```
8. If any modification is made to the `.yaml` file, you must deactivate the existing environment and create a new one in order for the modications to be reflected.
To deactivate, use:
```bash
* spack env deactivate
```
## Create and apply a patch before installation
Spack installs ROCm packages after pulling the source code from GitHub and building it locally. In
order to build a component with any modification to the source code, you must generate a patch and
apply it before the build phase.
To generate a patch and build with the changes:
1. Stage the source code.
```bash
spack stage hip@5.2.0 (This will pull the 5.2.0 release version source code of hip and display the path to spack-src directory where entire source code is available)
root@[ixt-rack-104:/spack#](http://ixt-rack-104/spack) spack stage hip@5.2.0
==> Fetching <https://github.com/ROCm-Developer-Tools/HIP/archive/rocm-5.2.0.tar.gz>
==> Fetching <https://github.com/ROCm-Developer-Tools/hipamd/archive/rocm-5.2.0.tar.gz>
==> Fetching <https://github.com/ROCm-Developer-Tools/ROCclr/archive/rocm-5.2.0.tar.gz>
==> Moving resource stage
source: /tmp/root/spack-stage/resource-hipamd-wzo5y6ysvmadyb5mvffr35galb6vjxb7/spack-src/
destination: /tmp/root/spack-stage/spack-stage-hip-5.2.0-wzo5y6ysvmadyb5mvffr35galb6vjxb7/spack-src/hipamd
==> Moving resource stage
source: /tmp/root/spack-stage/resource-opencl-wzo5y6ysvmadyb5mvffr35galb6vjxb7/spack-src/
destination: /tmp/root/spack-stage/spack-stage-hip-5.2.0-wzo5y6ysvmadyb5mvffr35galb6vjxb7/spack-src/opencl
==> Moving resource stage
source: /tmp/root/spack-stage/resource-rocclr-wzo5y6ysvmadyb5mvffr35galb6vjxb7/spack-src/
destination: /tmp/root/spack-stage/spack-stage-hip-5.2.0-wzo5y6ysvmadyb5mvffr35galb6vjxb7/spack-src/rocclr
==> Staged hip in /tmp/root/spack-stage/spack-stage-hip-5.2.0-wzo5y6ysvmadyb5mvffr35galb6vjxb7
```
2. Change directory to `spack-src` inside the staged directory.
```bash
root@[ixt-rack-104:/spack#cd /tmp/root/spack-stage/spack-stage-hip-5.2.0-wzo5y6ysvmadyb5mvffr35galb6vjxb7](http://ixt-rack-104/spack)
root@[ixt-rack-104:/tmp/root/spack-stage/spack-stage-hip-5.2.0-wzo5y6ysvmadyb5mvffr35galb6vjxb7#](http://ixt-rack-104/tmp/root/spack-stage/spack-stage-hip-5.2.0-wzo5y6ysvmadyb5mvffr35galb6vjxb7) cd spack-src/
```
3. Create a new Git repository.
```bash
root@[ixt-rack-104:/tmp/root/spack-stage/spack-stage-hip-5.2.0-wzo5y6ysvmadyb5mvffr35galb6vjxb7/spack-src#](http://ixt-rack-104/tmp/root/spack-stage/spack-stage-hip-5.2.0-wzo5y6ysvmadyb5mvffr35galb6vjxb7/spack-src) git init
```
4. Add the entire directory to the repository.
```bash
root@[ixt-rack-104:/tmp/root/spack-stage/spack-stage-hip-5.2.0-wzo5y6ysvmadyb5mvffr35galb6vjxb7/spack-src#](http://ixt-rack-104/tmp/root/spack-stage/spack-stage-hip-5.2.0-wzo5y6ysvmadyb5mvffr35galb6vjxb7/spack-src) git add .
```
5. Make the required changes to the source code.
```bash
root@[ixt-rack-104:/tmp/root/spack-stage/spack-stage-hip-5.2.0-wzo5y6ysvmadyb5mvffr35galb6vjxb7/spack-src#](http://ixt-rack-104/tmp/root/spack-stage/spack-stage-hip-5.2.0-wzo5y6ysvmadyb5mvffr35galb6vjxb7/spack-src) vi hipamd/CMakeLists.txt (Make required changes in the source code)
```
6. Generate the patch using the `git diff` command.
```bash
diff > /spack/var/spack/repos/builtin/packages/hip/0001-modifications.patch
```
7. Update the recipe with the patch file name and any conditions you want to apply.
```bash
root@[ixt-rack-104:/tmp/root/spack-stage/spack-stage-hip-5.2.0-wzo5y6ysvmadyb5mvffr35galb6vjxb7/spack-src#](http://ixt-rack-104/tmp/root/spack-stage/spack-stage-hip-5.2.0-wzo5y6ysvmadyb5mvffr35galb6vjxb7/spack-src) spack edit hip
```
Provide the patch file name and the conditions for the patch:
`patch("0001-modifications.patch", when="@5.2.0")`
Spack applies `0001-modifications.patch` on the `5.2.0` release code before starting the `hip` build.
After each modification, you must update the recipe. If there is no change to the recipe, run
`touch /spack/var/spack/repos/builtin/packages/hip/package.py`

View File

@@ -1,191 +0,0 @@
# Installing TensorFlow for ROCm
## TensorFlow
TensorFlow is an open-source library for solving machine-learning,
deep-learning, and artificial-intelligence problems. It can be used to solve
many problems across different sectors and industries but primarily focuses on
training and inference in neural networks. It is one of the most popular and
in-demand frameworks and is very active in open source contribution and
development.
:::{warning}
ROCm 5.6 and 5.7 deviates from the standard practice of supporting the last three
TensorFlow versions. This is due to incompatibilities between earlier TensorFlow
versions and changes introduced in the ROCm 5.6 compiler. Refer to the following
version support matrix:
| ROCm | TensorFlow |
|:-----:|:----------:|
| 5.6.x | 2.12 |
| 5.7.0 | 2.12, 2.13 |
| Post-5.7.0 | Last three versions at ROCm release. |
:::
### Installing TensorFlow
The following sections contain options for installing TensorFlow.
#### Option 1: using a Docker image
To install ROCm on bare metal, follow the section
[Linux installation guide](../install/linux/install.md). The recommended option to
get a TensorFlow environment is through Docker.
Using Docker provides portability and access to a prebuilt Docker container that
has been rigorously tested within AMD. This might also save compilation time and
should perform as tested without facing potential installation issues.
Follow these steps:
1. Pull the latest public TensorFlow Docker image.
```bash
docker pull rocm/tensorflow:latest
```
2. Once you have pulled the image, run it by using the command below:
```bash
docker run -it --network=host --device=/dev/kfd --device=/dev/dri \
--ipc=host --shm-size 16G --group-add video --cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined rocm/tensorflow:latest
```
#### Option 2: using a wheels package
To install TensorFlow using the wheels package, follow these steps:
1. Check the Python version.
```bash
python3 --version
```
| If: | Then: |
|:-----------------------------------:|:--------------------------------:|
| The Python version is less than 3.7 | Upgrade Python. |
| The Python version is more than 3.7 | Skip this step and go to Step 3. |
```{note}
The supported Python versions are:
* 3.7
* 3.8
* 3.9
* 3.10
```
```bash
sudo apt-get install python3.7 # or python3.8 or python 3.9 or python 3.10
```
2. Set up multiple Python versions using update-alternatives.
```bash
update-alternatives --query python3
sudo update-alternatives --install
/usr/bin/python3 python3 /usr/bin/python[version] [priority]
```
```{note}
Follow the instruction in Step 2 for incompatible Python versions.
```
```bash
sudo update-alternatives --config python3
```
3. Follow the screen prompts, and select the Python version installed in Step 2.
4. Install or upgrade PIP.
```bash
sudo apt install python3-pip
```
To install PIP, use the following:
```bash
/usr/bin/python[version] -m pip install --upgrade pip
```
Upgrade PIP for Python version installed in step 2:
```bash
sudo pip3 install --upgrade pip
```
5. Install TensorFlow for the Python version as indicated in Step 2.
```bash
/usr/bin/python[version] -m pip install --user tensorflow-rocm==[wheel-version] --upgrade
```
For a valid wheel version for a ROCm release, refer to the instruction below:
```bash
sudo apt install rocm-libs rccl
```
6. Update `protobuf` to 3.19 or lower.
```bash
/usr/bin/python3.7 -m pip install protobuf=3.19.0
sudo pip3 install tensorflow
```
7. Set the environment variable `PYTHONPATH`.
```bash
export PYTHONPATH="./.local/lib/python[version]/site-packages:$PYTHONPATH" #Use same python version as in step 2
```
8. Install libraries.
```bash
sudo apt install rocm-libs rccl
```
9. Test installation.
```bash
python3 -c 'import tensorflow' 2> /dev/null && echo 'Success' || echo 'Failure'
```
```{note}
For details on `tensorflow-rocm` wheels and ROCm version compatibility, see:
[https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/blob/develop-upstream/rocm_docs/tensorflow-rocm-release.md](https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/blob/develop-upstream/rocm_docs/tensorflow-rocm-release.md)
```
### Test the TensorFlow installation
To test the installation of TensorFlow, run the container image as specified in
the previous section Installing TensorFlow. Ensure you have access to the Python
shell in the Docker container.
```bash
python3 -c 'import tensorflow' 2> /dev/null && echo Success || echo Failure
```
### Run a basic TensorFlow example
The TensorFlow examples repository provides basic examples that exercise the
framework's functionality. The MNIST database is a collection of handwritten
digits that may be used to train a Convolutional Neural Network for handwriting
recognition.
Follow these steps:
1. Clone the TensorFlow example repository.
```bash
cd ~
git clone https://github.com/tensorflow/models.git
```
2. Install the dependencies of the code, and run the code.
```bash
#pip3 install requirement.txt
#python mnist_tf.py
```

View File

@@ -1,140 +0,0 @@
# Windows quick-start installation guide
For a quick summary on installing ROCm (HIP SDK) on Windows, follow the steps listed on this page. If
you want a more in-depth installation guide, see
[Installing ROCm on Windows](./install.md).
## System requirements
The HIP SDK is supported on Windows 10 and 11. The HIP SDK may be installed on a
system without AMD GPUs to use the build toolchains. To run HIP applications, a
compatible GPU is required. Please see the supported GPU guide for more details.
## HIP SDK installation
### Download the installer
Download the installer from the
[HIP-SDK download page](https://www.amd.com/en/developer/rocm-hub/hip-sdk.html).
### Launch the installer
To launch the AMD HIP SDK Installer, click the **Setup** icon shown in the following image.
![Icon with AMD arrow logo and User Access Control Shield overlay](../../data/install/windows/000-setup-icon.png "Setup Icon")
The installer requires Administrator Privileges, so you may be greeted with a
User Access Control (UAC) pop-up. Click Yes.
![User Access Control pop-up](../../data/install/windows/001-uac-dark.png "User Access Control pop-up")
![User Access Control pop-up](../../data/install/windows/001-uac-light.png "User Access Control pop-up")
The installer executable will temporarily extract installer packages to `C:\AMD`
which it will remove after installation completes. This extraction is signified
by the "Initializing install" window in the following image.
![Window with AMD arrow logo, futuristic background and progress counter](../../data/install/windows/002-initializing.png "Installer initialization window")
The installer will then detect your system configuration to determine which installable components
are applicable to your system.
![Window with AMD arrow logo, futuristic background and activity indicator](../../data/install/windows/003-detecting-system-config.png "Installer initialization window")
### Customize the install
When the installer launches, it displays a window that lets the user customize
the installation. By default, all components are selected for installation.
Refer to the following image for an instance when the Select All option
is turned on.
![Window with AMD arrow logo, futuristic background and activity indicator](../../data/install/windows/004-installer-window.png "Installer initialization window")
#### HIP SDK installer
The HIP SDK installation options are listed in the following table.
```{table} HIP SDK Components for Installation
:name: hip-sdk-options-win
| **HIP Components** | **Install Type** | **Additional Options** |
|:------------------:|:----------------:|:----------------------:|
| HIP SDK Core | 5.5.0 | Install location |
| HIP Libraries | Full, Partial, None | Runtime, Development (Libs and headers) |
| HIP Runtime Compiler | Full, Partial, None | Runtime, Development (Headers) |
| HIP Ray Tracing | Full, Partial, None | Runtime, Development (Headers) |
| Visual Studio Plugin | Full, Partial, None | Visual Studio 2017, 2019, 2022 Plugin |
```
```{note}
The Select/DeSelect All option only applies to the installation of HIP SDK
components. To install the bundled AMD Display Driver, manually select the
install type.
```
```{tip}
Should you only wish to install a few select components,
DeSelecting All and then picking the individual components may be more
convenient.
```
#### AMD display driver
The HIP SDK installer bundles an AMD Radeon Software PRO 23.10 installer. The
supported install options are summarized in the following table:
```{table} AMD Display Driver Install Options
:name: display-driver-install-win
| **Install Option** | **Description** |
|:------------------:|:---------------:|
| Install Location | Location on disk to store driver files. |
| Install Type | The breadth of components to be installed. |
| Factory Reset (Optional) | A Factory Reset will remove all prior versions of AMD HIP SDK and drivers. You will not be able to roll back to previously installed drivers. |
```
```{table} AMD Display Driver Install Types
:name: display-driver-win-types
| **Install Type** | **Description** |
|:----------------:|:---------------:|
| Full Install | Provides all AMD Software features and controls for gaming, recording, streaming, and tweaking the performance on your graphics hardware. |
| Minimal Install | Provides only the basic controls for AMD Software features and does not include advanced features such as performance tweaking or recording and capturing content. |
| Driver Only | Provides no user interface for AMD Software features. |
```
```{note}
You must perform a system restart for a complete installation of the
Display Driver.
```
### Install components
Please wait for the installation to complete during as shown in the following image.
![Window with AMD arrow logo, futuristic background and progress meter](../../data/install/windows/012-install-progress.png "Installation progress")
### Installation complete
Once the installation is complete, the installer window may prompt you for a
system restart. Click **Restart** at the lower right corner, shown in the following image.
![Window with AMD arrow logo, futuristic background and completion notice](../../data/install/windows/013-install-complete.png "Installation complete")
```{error}
Should the installer terminate due to unexpcted circumstances, or the user
forcibly terminates the installer, the temporary directory created under
`C:\AMD` may be safely removed. Installed components will not depend on this
folder (unless the user specifies `C:\AMD` as an install folder explicitly).
```
## Uninstall
All components, except visual studio plug-in should be uninstalled through
control panel -> Add/Remove Program. For visual studio extension uninstallation,
please refer to
<https://github.com/ROCm-Developer-Tools/HIP-VS/blob/master/README.md>.
Uninstallation of the HIP SDK components can be done through the Windows
Settings app. Navigate to "Apps > Installed apps", click the "..." on the far
right next to the component to uninstall, and click "Uninstall".
![Installed apps section of the settings app showing installed HIP SDK components](../../data/install/windows/014-uninstall-dark.png "Removing the SDK via the settings app")
![Installed apps section of the settings app showing installed HIP SDK components](../../data/install/windows/014-uninstall-light.png "Removing the SDK via the settings app")

View File

@@ -1,258 +0,0 @@
# Install HIP SDK on Windows
To install the HIP SDK on Windows, use the [quick-start guide](./install-quick.md) or follow the instructions below.
follow the instructions listed below.
**Topics:**
* [Prerequisites](#prerequisites)
* [Install HIP SDK](#install-hip-sdk)
* [Upgrade HIP SDK](#upgrade-hip-sdk)
* [Uninstall HIP SDK](#uninstall-hip-sdk)
## Prerequisites
Verify that your system meets all the installation requirements. The installation is only supported
only on specific host architectures, Windows Editions, and update versions.
The HIP SDK is supported on Windows 10 and 11. It can be installed on a
system without AMD GPUs to use the build toolchains, but to run HIP applications, a
compatible GPU is required. Please see the
[supported GPU guide](../../about/compatibility/windows-support.md) for more details.
::::{tab-set}
:::{tab-item} CLI
:sync: cli
1. Type the following command on your system from a PowerShell command-line interface (CLI):
```pwsh
Get-ComputerInfo | Format-Table CsSystemType,OSName,OSDisplayVersion
```
Running this command on a Windows system may result in the following output:
```output
CsSystemType OsName OSDisplayVersion
------------ ------ ----------------
x64-based PC Microsoft Windows 11 Pro 22H2
```
2. Confirm that the obtained information matches with those listed in {ref}`windows-support`.
:::
:::{tab-item} GUI
:sync: gui
1. Open the **Settings** app.
![Gear icon of the Windows Settings app](../../data/install/windows/000-settings-light.png "Windows Settings app icon")
2. Navigate to **System > About**.
![Settings app panel showing device and OS information](../../data/install/windows/001-about-light.png "Settings > About page")
3. Confirm that the obtained information matches {ref}`windows-support`.
:::
::::
## Install HIP SDK
::::{tab-set}
:::{tab-item} CLI
:sync: cli
CLI options are listed in the following table:
```{table}
:name: hip-sdk-cli-install
| **Install Option** | **Description** |
|:------------------|:---------------|
| `-install` | Command used to install packages, both driver and applications. No output to the screen. |
| `-install -boot` | Silent install with auto reboot. |
| `-install -log <absolute path>` | Write install result code to the specified log file. The specified log file must be on a local machine. Double quotes are needed if there are spaces in the log file path. |
| `-uninstall` | Command to uninstall all packages installed by this installer on the system. There is no option to specify which packages to uninstall. |
| `-uninstall -boot` | Silent uninstall with auto reboot. |
| `/?` or `/help` | Shows a brief description of all switch commands. |
```
```{note}
Unlike the GUI, the CLI doesn't support selectively installing parts of the SDK bundle.
```
To start the installation, follow these steps:
1. Download the installer from the
[HIP-SDK download page](https://www.amd.com/en/developer/rocm-hub/hip-sdk.html).
2. Launch the installer. Note that the installer is a graphical application with a `WinMain` entry
point, even when called on the command line. This means that the application lifetime is tied to a
window, even on headless systems where that window may not be visible.
```pwsh
Start-Process $InstallerExecutable -ArgumentList $InstallerArgs -NoNewWindow -Wait
```
```{important}
Running the installer requires Administrator Privileges.
```
To install all components:
```pwsh
Start-Process ~\Downloads\Setup.exe -ArgumentList '-install','-log',"${env:USERPROFILE}\installer_log.txt" -NoNewWindow -Wait
```
:::
:::{tab-item} GUI
:sync: gui
The HIP SDK installation options are listed in the following table.
```{table}
:name: hip-sdk-options
| **HIP Components** | **Install Type** | **Additional Options** |
|:------------------|:----------------|:----------------------|
| HIP SDK Core | 5.5.0 | Install location |
| HIP Libraries | Full, Partial, None | Runtime, Development (Libs and headers) |
| HIP Runtime Compiler | Full, Partial, None | Runtime, Development (Headers) |
| HIP Ray Tracing | Full, Partial, None | Runtime, Development (Headers) |
| Visual Studio Plugin | Full, Partial, None | Visual Studio 2017, 2019, 2022 Plugin |
```
```{note}
The Select/DeSelect All option only applies to the installation of HIP SDK
components. To install the bundled AMD Display Driver, manually select the
install type.
```
```{tip}
Should you only wish to install a few select components,
DeSelecting All and then picking the individual components may be more
convenient.
```
The HIP SDK installer bundles an AMD Radeon Software PRO 23.10 installer. The
supported install options are summarized in the following table:
```{table}
:name: display-driver-install-options
| **Install Option** | **Description** |
|:------------------|:---------------|
| Install Location | Location on disk to store driver files. |
| Install Type | The breadth of components to be installed. |
| Factory Reset (Optional) | A Factory Reset will remove all prior versions of AMD HIP SDK and drivers. You will not be able to roll back to previously installed drivers. |
```
```{table} AMD Display Driver Install Types
:name:
| **Install Type** | **Description** |
|:----------------|:---------------|
| Full Install | Provides all AMD Software features and controls for gaming, recording, streaming, and tweaking the performance on your graphics hardware. |
| Minimal Install | Provides only the basic controls for AMD Software features and does not include advanced features such as performance tweaking or recording and capturing content. |
| Driver Only | Provides no user interface for AMD Software features. |
```
```{note}
You must perform a system restart for a complete installation of the
Display Driver.
```
To start the installation, follow these steps:
1. Download the installer from the
[HIP-SDK download page](https://www.amd.com/en/developer/rocm-hub/hip-sdk.html).
2. Launch the installer by clicking the **Setup** icon.
![Icon with AMD arrow logo and User Access Control Shield overlay](../../data/install/windows/000-setup-icon.png "Setup Icon")
The installer requires Administrator Privileges, so you may be greeted with a
User Access Control (UAC) pop-up. Click Yes.
![User Access Control pop-up](../../data/install/windows/001-uac-light.png "User Access Control pop-up")
The installer executable temporarily extracts installer packages to `C:\AMD`; it removes these after the
installation completes.
![Window with AMD arrow logo, futuristic background and progress counter](../../data/install/windows/002-initializing.png "Installer initialization window")
The installer detects your system configuration to determine which installable components
are applicable to your system.
![Window with AMD arrow logo, futuristic background and activity indicator](../../data/install/windows/003-detecting-system-config.png "Installer initialization window")
3. Customize your installation. When the installer launches, it displays a window that lets you customize
your installation. By default, all components are selected.
![Window with AMD arrow logo, futuristic background and activity indicator](../../data/install/windows/004-installer-window.png "Installer initialization window")
4. Wait for the installation to complete.
![Window with AMD arrow logo, futuristic background and progress meter](../../data/install/windows/012-install-progress.png "Installation progress")
When installation is complete, the installer window may prompt you for a system restart.
![Window with AMD arrow logo, futuristic background and completion notice](../../data/install/windows/013-install-complete.png "Installation complete")
```{error}
If the installer terminates mid-installation, the temporary directory created under `C:\AMD` can be
safely removed. Installed components don't depend on this folder unless you explicitly choose this
as the install folder.
```
:::
::::
## Upgrade HIP SDK
To upgrade the HIP SDK, you can run the installer for the newer version without uninstalling the
existing version. You can also uninstall the HIP SDK before installing the newest version.
## Uninstall HIP SDK
::::{tab-set}
:::{tab-item} CLI
:sync: cli
Launch the installer. Note that the installer is a graphical application with a `WinMain` entry
point, even when called on the command line. This means that the application lifetime is tied to a
window, even on headless systems where that window may not be visible.
```pwsh
Start-Process $InstallerExecutable -ArgumentList $InstallerArgs -NoNewWindow -Wait
```
```{important}
Running the installer requires Administrator Privileges.
```
To uninstall all components:
```pwsh
Start-Process ~\Downloads\Setup.exe -ArgumentList '-uninstall' -NoNewWindow -Wait
```
:::
:::{tab-item} GUI
:sync: gui
Uninstallation of HIP SDK components can be done through the Windows Settings app. Navigate to
"Apps > Installed apps" and click the ellipsis (...) on the far right next to the component you want to uninstall. Click "Uninstall".
![Installed apps section of the settings app showing installed HIP SDK components](../../data/install/windows/014-uninstall-light.png "Removing the SDK via the settings app")
For visual studio extension uninstallation, refer to
<https://github.com/ROCm-Developer-Tools/HIP-VS/blob/master/README.md>.
:::
::::

View File

@@ -1,71 +0,0 @@
# Application deployment guidelines for Windows
ISVs deploying applications using the HIP SDK depend on the AMD GPU Drivers, HIP
Runtime Library and HIP SDK Libraries. A compatibility matrix table provides
details on AMDs support model. AMD GPU Drivers are distributed with a HIP
Runtime included. Each HIP runtime is associated with a HIP compiler version.
Applications built with a particular HIP compiler should document its associated
HIP runtime version and AMD GPU Driver as minimum version requirements for its
end users. Applications do not distribute the HIP runtime. Instead, end users
will use the HIP runtime provided by an AMD GPU Driver. AMD provides backward
compatibility for applications dynamically linked to the HIP runtime based on
our Driver and HIP support policy. ISV applications using the HIP SDK Libraries,
for example hipBLAS, should distribute the HIP SDK Library as part of its
installer package. It is recommended not to require end users to install the
HIP SDK. AMD provides backward compatibility for AMD Driver and HIP runtime for
the HIP SDK Libraries based on our support policy. AMD support policy for Visual
Studio and other third-party compilers are documented here.
## Usage scenario
This guide is intended for Independent Software Vendors (ISVs) and other
software developers intending to build applications with the HIP SDK for
Windows. The HIP SDK is intended for developer distribution in contrast to the
AMD GPU driver which is intended for all end users. The guide discusses how to
use and distribute components from the HIP SDK. The HIP SDK is the collection of
the AMD GPU Driver, HIP runtime and the HIP Libraries. These three parts are
distributed in the HIP SDK installer. The compatibility and versioning relation
between these three parts is documented here. AMDs support policies for the
developer tools allows the ISVs the stability to plan the usage of a tool chain.
## Recommended library distribution model
The HIP SDK is distributed via a Windows installer. This distribution system is
only intended for software developers and testers. AMD recommends that end users
of the program built against HIP SDK components do not have a requirement to
install the HIP SDK. There are two types of ISV applications that use the HIP
SDK as follows.
The first group of ISV applications have a dependency on the HIP runtime and
select HIP Header Only Libraries (rocPRIM, hipCUB and rocThrust). This group of
ISV applications need to require their end users install an AMD GPU Driver. Each
AMD GPU driver has a HIP runtime library bundled with it. The ISV application
should ensure that the HIP runtime library has a minimum version associated with
it. As the HIP runtime library does not have semantic versioning, the ISV
application cannot check for compatibility. However, AMD is committed to not
breaking API/ABI compatibility unless the major version number of the HIP
runtime is incremented. ISV applications may run without user warning if the HIP
major version available in the driver is the same as the HIP major version
associated with the compiler it was built with. The ISV at its discretion may
throw a warning if the HIP major version is higher than the associate HIP major
version of the compiler it was built with.
The second group of ISV application has a dependency on the HIP runtime and one
or more Dynamically Linked HIP Libraries including the HIP RT library. ISV
applications with this dependency need to ensure the end user installs an AMD
GPU Driver and is recommended to distribute the dynamically linked HIP library
in the installer package of its application. This allows end users to avoid
installing the HIP SDK. One benefit of this model is smaller disk space required
as only required binaries are distributed by the ISV application. It also avoids
the end user to have to agree to licensing agreements for the entire HIP SDK.
The version checks recommended for the ISV application including dynamically
linked HIP Libraries follow the same requirements as the ISV applications that
only have the HIP runtime and header only library. In addition, each dynamically
linked HIP library also has a minimum HIP runtime requirement. Checks for the
minimum HIP version for each dynamically linked HIP library may be added at the
ISVs discretion. Usually, the minimum HIP version check for the HIP runtime is
sufficient if dynamically linked HIP libraries come from the same SDK package as
the HIP compiler.
Please note AMD does not support static linking to any components distributed in
the HIP SDK.

View File

@@ -1,6 +1,14 @@
<head>
<meta charset="UTF-8">
<meta name="description" content="ROCm API libraries & tools">
<meta name="keywords" content="ROCm, API, libraries, tools, artificial intelligence, development,
Communications, C++ primitives, Fast Fourier transforms, FFTs, random number generators, linear
algebra, AMD">
</head>
# ROCm API libraries & tools
::::{grid} 1 2 2 2
::::{grid} 1 3 3 3
:class-container: rocm-doc-grid
:::{grid-item-card}
@@ -10,8 +18,9 @@
^^^
* {doc}`Composable Kernel <composable_kernel:index>`
* {doc}`MIOpen <miopen:index>`
* {doc}`MIGraphX <amdmigraphx:index>`
* {doc}`MIOpen <miopen:index>`
* {doc}`MIVisionX <mivisionx:doxygen/html/index>`
:::
@@ -44,7 +53,6 @@
^^^
* {doc}`hipCC <hipcc:index>`
* {doc}`ROCdbgapi <rocdbgapi:index>`
* [ROCmCC](./rocmcc.md)
* {doc}`ROCm debugger (ROCgdb) <rocgdb:index>`
@@ -99,7 +107,7 @@
^^^
* {doc}`ROCProfiler <rocprofiler:rocprof>`
* {doc}`ROCProfiler <rocprofiler:profiler_home_page>`
* {doc}`ROCTracer <roctracer:index>`
:::
@@ -121,8 +129,9 @@
^^^
* {doc}`AMD SMI <amdsmi:index>`
* {doc}`ROCm Data Center Tool <rdc:index>`
* {doc}`ROCm SMI LIB <rocm_smi_lib:index>`
* {doc}`ROCm SMI <rocm_smi_lib:index>`
* {doc}`ROCm Validation Suite <rocmvalidationsuite:index>`
* {doc}`TransferBench <transferbench:index>`

View File

@@ -1,3 +1,10 @@
<head>
<meta charset="UTF-8">
<meta name="description" content="Compiler reference guide">
<meta name="keywords" content="compiler, hipCC, Clang, amdclang, optimizations, LLVM,
rocm-llvm, , AMD, ROCm">
</head>
# Compiler reference guide
## Introduction to compiler reference guide
@@ -134,12 +141,12 @@ The `-famd-opt` flag is useful when a user wants to build with the proprietary
optimization compiler and not have to depend on setting any of the other
proprietary optimization flags.
```{note}
:::{note}
`-famd-opt` can be used in addition to the other proprietary CPU optimization
flags. The table of optimizations below implicitly enables the invocation of the
AMD proprietary optimizations compiler, whereas the `-famd-opt` flag requires
this to be handled explicitly.
```
:::
#### `-fstruct-layout=[1,2,3,4,5,6,7]`
@@ -255,12 +262,12 @@ loop. The heuristic can be controlled with the following options:
Where, `n` is a positive integer and higher value of `<n>` facilitates more unswitching.
```{note}
:::{note}
These options may facilitate more unswitching under some workloads. Since
loop-unswitching inherently leads to code bloat, facilitating more
unswitching may significantly increase the code size. Hence, it may also lead
to longer compilation times.
```
:::
##### `-enable-strided-vectorization`
@@ -451,11 +458,11 @@ supports ASM statements, their use is not recommended for the following reasons:
* Writing correct ASM statements is often difficult; we strongly recommend
thorough testing of any use of ASM statements.
```{note}
:::{note}
For developers who choose to include ASM statements in the code, AMD is
interested in understanding the use case and appreciates feedback at
[https://github.com/RadeonOpenCompute/ROCm/issues](https://github.com/RadeonOpenCompute/ROCm/issues)
```
:::
### Miscellaneous OpenMP compiler features

View File

@@ -1,23 +1,33 @@
# ROCm release history
| Version | Release Date |
| ------- | ------------ |
| [5.6.0](https://rocm.docs.amd.com/en/docs-5.6.0/) | Jun 28, 2023 |
| [5.5.1](https://rocm.docs.amd.com/en/docs-5.5.1/) | May 24, 2023 |
| [5.5.0](https://rocm.docs.amd.com/en/docs-5.5.0/) | May 1, 2023 |
| [5.4.3](https://rocm.docs.amd.com/en/docs-5.4.3/) | Feb 7, 2023 |
| [5.4.2](https://rocm.docs.amd.com/en/docs-5.4.2/) | Jan 13, 2023 |
| [5.4.1](https://rocm.docs.amd.com/en/docs-5.4.1/) | Dec 15, 2022 |
| [5.4.0](https://rocm.docs.amd.com/en/docs-5.4.0/) | Nov 30, 2022 |
| [5.3.3](https://rocm.docs.amd.com/en/docs-5.3.3/) | Nov 17, 2022 |
| [5.3.2](https://rocm.docs.amd.com/en/docs-5.3.2/) | Nov 9, 2022 |
| [5.3.0](https://rocm.docs.amd.com/en/docs-5.3.0/) | Oct 4, 2022 |
| [5.2.3](https://rocm.docs.amd.com/en/docs-5.2.3/) | Aug 18, 2022 |
| [5.2.1](https://rocm.docs.amd.com/en/docs-5.2.1/) | Jul 21, 2022 |
| [5.2.0](https://rocm.docs.amd.com/en/docs-5.2.0/) | Jun 28, 2022 |
| [5.1.3](https://rocm.docs.amd.com/en/docs-5.1.3/) | May 20, 2022 |
| [5.1.1](https://rocm.docs.amd.com/en/docs-5.1.1/) | Apr 8, 2022 |
| [5.1.0](https://rocm.docs.amd.com/en/docs-5.1.0/) | Mar 30, 2022 |
| [5.0.2](https://rocm.docs.amd.com/en/docs-5.0.2/) | Mar 4, 2022 |
| [5.0.1](https://rocm.docs.amd.com/en/docs-5.0.1/) | Feb 16, 2022 |
| [5.0.0](https://rocm.docs.amd.com/en/docs-5.0.0/) | Feb 9, 2022 |
<head>
<meta charset="UTF-8">
<meta name="description" content="ROCm release history">
<meta name="keywords" content="documentation, release history, ROCm, AMD">
</head>
# ROCm release history
| Version | Release date |
| ------- | ------------ |
| [6.0.0](https://rocm.docs.amd.com/en/docs-6.0.0/) | Dec 15, 2023 |
| [5.7.1](https://rocm.docs.amd.com/en/docs-5.7.1/) | Oct 13, 2023 |
| [5.7.0](https://rocm.docs.amd.com/en/docs-5.7.0/) | Sep 15, 2023 |
| [5.6.1](https://rocm.docs.amd.com/en/docs-5.6.1/) | Aug 29, 2023 |
| [5.6.0](https://rocm.docs.amd.com/en/docs-5.6.0/) | Jun 28, 2023 |
| [5.5.1](https://rocm.docs.amd.com/en/docs-5.5.1/) | May 24, 2023 |
| [5.5.0](https://rocm.docs.amd.com/en/docs-5.5.0/) | May 1, 2023 |
| [5.4.3](https://rocm.docs.amd.com/en/docs-5.4.3/) | Feb 7, 2023 |
| [5.4.2](https://rocm.docs.amd.com/en/docs-5.4.2/) | Jan 13, 2023 |
| [5.4.1](https://rocm.docs.amd.com/en/docs-5.4.1/) | Dec 15, 2022 |
| [5.4.0](https://rocm.docs.amd.com/en/docs-5.4.0/) | Nov 30, 2022 |
| [5.3.3](https://rocm.docs.amd.com/en/docs-5.3.3/) | Nov 17, 2022 |
| [5.3.2](https://rocm.docs.amd.com/en/docs-5.3.2/) | Nov 9, 2022 |
| [5.3.0](https://rocm.docs.amd.com/en/docs-5.3.0/) | Oct 4, 2022 |
| [5.2.3](https://rocm.docs.amd.com/en/docs-5.2.3/) | Aug 18, 2022 |
| [5.2.1](https://rocm.docs.amd.com/en/docs-5.2.1/) | Jul 21, 2022 |
| [5.2.0](https://rocm.docs.amd.com/en/docs-5.2.0/) | Jun 28, 2022 |
| [5.1.3](https://rocm.docs.amd.com/en/docs-5.1.3/) | May 20, 2022 |
| [5.1.1](https://rocm.docs.amd.com/en/docs-5.1.1/) | Apr 8, 2022 |
| [5.1.0](https://rocm.docs.amd.com/en/docs-5.1.0/) | Mar 30, 2022 |
| [5.0.2](https://rocm.docs.amd.com/en/docs-5.0.2/) | Mar 4, 2022 |
| [5.0.1](https://rocm.docs.amd.com/en/docs-5.0.1/) | Feb 16, 2022 |
| [5.0.0](https://rocm.docs.amd.com/en/docs-5.0.0/) | Feb 9, 2022 |

View File

@@ -7,60 +7,39 @@ root: index
subtrees:
- entries:
- file: what-is-rocm.md
- file: about/whats-new/whats-new.md
- caption: Installation
entries:
- file: install/windows/install-quick.md
title: Quick start (Windows)
- file: install/windows/install.md
title: Windows install guide
subtrees:
- entries:
- file: install/windows/windows-app-deployment-guidelines.md
title: Application deployment guidelines
- file: install/docker.md
title: ROCm Docker containers
- file: install/pytorch-install.md
title: PyTorch for ROCm
- file: install/tensorflow-install.md
title: Tensorflow for ROCm
- file: install/magma-install.md
title: MAGMA for ROCm
- file: install/spack-intro.md
title: ROCm & Spack
- caption: Compatibility & support
entries:
- file: about/compatibility/linux-support.md
title: Linux (GPU & OS)
- file: about/compatibility/windows-support.md
title: Windows (GPU & OS)
- file: about/compatibility/3rd-party-support-matrix.md
title: Third-party
- file: about/compatibility/user-kernel-space-compat-matrix.md
title: User/kernel space support
- file: about/compatibility/docker-image-support-matrix.rst
title: Docker
- file: about/compatibility/openmp.md
title: OpenMP
- caption: Release information
entries:
- file: about/release-notes.md
title: Release notes
- file: about/CHANGELOG.md
title: Changelog
- file: about/release-history.md
title: Release history
subtrees:
- entries:
- file: about/CHANGELOG.md
title: Changelog
- url: https://github.com/RadeonOpenCompute/ROCm/labels/Verified%20Issue
title: Known issues
- caption: Install
entries:
- url: https://rocm.docs.amd.com/projects/install-on-linux/en/${branch}/
title: ROCm on Linux
- url: https://rocm.docs.amd.com/projects/install-on-windows/en/${branch}/
title: HIP SDK on Windows
- caption: Supported configurations
entries:
- url: https://rocm.docs.amd.com/projects/install-on-linux/en/${branch}/reference/system-requirements.html
title: Linux
- url: https://rocm.docs.amd.com/projects/install-on-windows/en/${branch}/reference/system-requirements.html
title: Windows
- caption: Reference
entries:
- file: reference/library-index.md
title: API libraries & tools
- caption: How-to
entries:
- file: how-to/deep-learning-rocm.md
title: Deep learning
- file: how-to/gpu-enabled-mpi.md
- file: how-to/gpu-enabled-mpi.rst
title: Using MPI
- file: how-to/system-debugging.md
title: Debugging
@@ -77,79 +56,6 @@ subtrees:
- url: https://github.com/amd/rocm-examples
title: GitHub examples
- caption: Reference
entries:
- file: reference/library-index.md
title: API libraries & tools
subtrees:
- entries:
- url: ${project:composable_kernel}
title: Composable kernel
- url: ${project:hipblas}
title: hipBLAS
- url: ${project:hipblaslt}
title: hipBLASLt
- url: ${project:hipcc}
title: hipCC
- url: ${project:hipcub}
title: hipCUB
- url: ${project:hipfft}
title: hipFFT
- url: ${project:hipify}
title: HIPIFY
- url: ${project:hiprand}
title: hipRAND
- url: ${project:hip}
title: HIP runtime
- url: ${project:hipsolver}
title: hipSOLVER
- url: ${project:hipsparse}
title: hipSPARSE
- url: ${project:hipsparselt}
title: hipSPARSELt
- url: ${project:hiptensor}
title: hipTensor
- url: ${project:miopen}
title: MIOpen
- url: ${project:amdmigraphx}
title: MIGraphX
- url: ${project:rccl}
title: RCCL
- url: ${project:rocalution}
title: rocALUTION
- url: ${project:rocblas}
title: rocBLAS
- url: ${project:rocdbgapi}
title: ROCdbgapi
- url: ${project:rocfft}
title: rocFFT
- file: reference/rocmcc.md
title: ROCmCC
- url: ${project:rdc}
title: ROCm Data Center Tool
- url: ${project:rocm_smi_lib}
title: ROCm SMI LIB
- url: ${project:rocmvalidationsuite}
title: ROCm validation suite
- url: ${project:rocprim}
title: rocPRIM
- url: ${project:rocprofiler}
title: ROCProfiler
- url: ${project:rocrand}
title: rocRAND
- url: ${project:rocsolver}
title: rocSOLVER
- url: ${project:rocsparse}
title: rocSPARSE
- url: ${project:rocthrust}
title: rocThrust
- url: ${project:roctracer}
title: rocTracer
- url: ${project:rocwmma}
title: rocWMMA
- url: ${project:transferbench}
title: TransferBench
- caption: Conceptual
entries:
- file: conceptual/gpu-arch.md
@@ -178,12 +84,14 @@ subtrees:
title: GPU memory
- file: conceptual/compiler-disambiguation.md
title: Compiler disambiguation
- file: about/compatibility/openmp.md
title: OpenMP
- file: conceptual/file-reorg.md
title: File structure (Linux FHS)
- file: conceptual/gpu-isolation.md
title: GPU isolation techniques
- file: conceptual/using-gpu-sanitizer.md
title: LLVN ASan
title: LLVM ASan
- file: conceptual/cmake-packages.rst
title: Using CMake
- file: conceptual/More-about-how-ROCm-uses-PCIe-Atomics.rst
@@ -196,14 +104,19 @@ subtrees:
- caption: Contribute
entries:
- file: contribute/index.md
title: Contribute to ROCm docs
title: Contribute to ROCm
subtrees:
- entries:
- file: contribute/toolchain.md
title: Documentation tools
- file: contribute/building.md
title: Building documentation
- file: contribute/feedback.md
title: Providing feedback
- file: contribute/contribute-docs.md
title: Contribute to ROCm docs
subtrees:
- entries:
- file: contribute/toolchain.md
title: Documentation tools
- file: contribute/building.md
title: Building documentation
- file: contribute/feedback.md
title: Provide feedback
- file: about/license.md
title: ROCm license

View File

@@ -1 +1 @@
rocm-docs-core==0.26.0
rocm-docs-core==0.33.0

View File

@@ -1,5 +1,5 @@
#
# This file is autogenerated by pip-compile with Python 3.10
# This file is autogenerated by pip-compile with Python 3.8
# by the following command:
#
# pip-compile requirements.in
@@ -40,17 +40,17 @@ fastjsonschema==2.16.3
# via rocm-docs-core
gitdb==4.0.10
# via gitpython
gitpython==3.1.30
gitpython==3.1.41
# via rocm-docs-core
idna==3.4
# via requests
imagesize==1.4.1
# via sphinx
importlib-metadata==6.8.0
importlib-metadata==7.0.0
# via sphinx
importlib-resources==6.1.0
importlib-resources==6.1.1
# via rocm-docs-core
jinja2==3.1.2
jinja2==3.1.3
# via
# myst-parser
# sphinx
@@ -84,7 +84,9 @@ pygments==2.15.0
# pydata-sphinx-theme
# sphinx
pyjwt[crypto]==2.6.0
# via pygithub
# via
# pygithub
# pyjwt
pynacl==1.5.0
# via pygithub
pytz==2022.7.1
@@ -98,7 +100,7 @@ requests==2.31.0
# via
# pygithub
# sphinx
rocm-docs-core==0.26.0
rocm-docs-core==0.33.0
# via -r requirements.in
smmap==5.0.0
# via gitdb

View File

@@ -18,7 +18,7 @@ Installation of various deep learning frameworks and applications.
:::
:::{grid-item-card}
**[GPU-enabled MPI](./gpu-enabled-mpi.md)**
**[GPU-enabled MPI](./gpu-enabled-mpi.rst)**
This chapter exemplifies how to set up Open MPI with the ROCm platform.

View File

@@ -29,11 +29,11 @@ To implement a workaround, follow these steps:
roc-obj-ls -v $TORCHDIR/lib/libtorch_hip.so # check for gfx target
```
```{note}
:::{note}
Recompile PyTorch with the right gfx target if compiling from the source if
the hardware is not supported. For wheels or Docker installation, contact
ROCm support [^ROCm_issues].
```
:::
**Q: Why am I unable to access Docker or GPU in user accounts?**
@@ -43,7 +43,7 @@ described in the ROCm Installation Guide at {ref}`linux_group_permissions`.
**Q: Can I install PyTorch directly on bare metal?**
Ans: Bare-metal installation of PyTorch is supported through wheels. Refer to
Option 2: Install PyTorch Using Wheels Package. See [Installing PyTorch](../install/pytorch-install.md) for more information.
Option 2: Install PyTorch Using Wheels Package. See {doc}`PyTorch for ROCm<rocm-install-on-linux:pytorch-install>` for more information.
**Q: How do I profile PyTorch workloads?**

View File

@@ -1,3 +1,9 @@
<head>
<meta charset="UTF-8">
<meta name="description" content="What is ROCm">
<meta name="keywords" content="documentation, projects, introduction, ROCm, AMD">
</head>
# What is ROCm?
ROCm is an open-source stack, composed primarily of open-source software, designed for
@@ -19,6 +25,11 @@ ROCm supports programming models, such as OpenMP and OpenCL, and includes all ne
source software compilers, debuggers, and libraries. ROCm is fully integrated into machine learning
(ML) frameworks, such as PyTorch and TensorFlow.
```{tip}
If you're using Radeon GPUs, refer to the
{doc}`Radeon-specific ROCm documentation<radeon:index>`
```
## ROCm projects
ROCm consists of the following drivers, development tools, and APIs.
@@ -70,7 +81,7 @@ ROCm consists of the following drivers, development tools, and APIs.
| [ROCProfiler](https://rocm.docs.amd.com/projects/rocprofiler/en/latest/profiler_home_page.html) | A profiling tool for HIP applications |
| [rocRAND](https://rocm.docs.amd.com/projects/rocRAND/en/latest/) | Provides functions that generate pseudorandom and quasirandom numbers |
| [ROCR-Runtime](https://github.com/RadeonOpenCompute/ROCR-Runtime/) | User-mode API interfaces and libraries necessary for host applications to launch compute kernels on available HSA ROCm kernel agents |
| [rocSOLVER](https://rocm.docs.amd.com/projects/rocSOLVER/en/latest/) | An implementation of LAPACK routines on the ROCm platform, implemented in the HIP programming language and optimized for AMDs latest discrete GPUs |
| [rocSOLVER](https://rocm.docs.amd.com/projects/rocSOLVER/en/latest/) | An implementation of LAPACK routines on ROCm software, implemented in the HIP programming language and optimized for AMDs latest discrete GPUs |
| [rocSPARSE](https://rocm.docs.amd.com/projects/rocSPARSE/en/latest/) | Exposes a common interface that provides BLAS for sparse computation implemented on ROCm runtime and toolchains (in the HIP programming language) |
| [rocThrust](https://rocm.docs.amd.com/projects/rocThrust/en/latest/) | A parallel algorithm library |
| [ROCT-Thunk-Interface](https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/) | User-mode API interfaces used to interact with the ROCk driver |

View File

@@ -2,6 +2,7 @@
## Pre-requisites
* Python 3.10
* Create a GitHub Personal Access Token.
* Tested with all the read-only permissions, but public_repo, read:project read:user, and repo:status should be enough.
* Copy the token somewhere safe.
@@ -17,23 +18,16 @@
* Run this for 5.6.0 (change for whatever version you require)
* `GITHUB_ACCESS_TOKEN=my_token_here`
<<<<<<< HEAD
To generate the changelog from 5.0.0 up to and including 5.7.0:
To generate the changelog from 5.0.0 up to and including 6.0.1:
```sh
python3 tag_script.py -t $GITHUB_ACCESS_TOKEN --no-release --no-pulls --do-previous --compile_file ../../CHANGELOG.md --branch release/rocm-rel-5.7 5.7.0
=======
To generate the changelog from 5.0.0 up to and including 5.7.1:
```sh
python3 tag_script.py -t $GITHUB_ACCESS_TOKEN --no-release --no-pulls --do-previous --compile_file ../../CHANGELOG.md --branch release/rocm-rel-5.7 5.7.1
>>>>>>> roc-5.7.x
python3 tag_script.py -t $GITHUB_ACCESS_TOKEN --no-release --no-pulls --do-previous --compile_file ../../CHANGELOG.md --branch release/rocm-rel-6.0 6.0.1
```
To generate the changelog only for 5.7.1:
To generate the changelog only for 6.0.1:
```sh
python3 tag_script.py -t $GITHUB_ACCESS_TOKEN --no-release --no-pulls --compile_file ../../CHANGELOG.md --branch release/rocm-rel-5.7 5.7.1
python3 tag_script.py -t $GITHUB_ACCESS_TOKEN --no-release --no-pulls --compile_file ../../CHANGELOG.md --branch release/rocm-rel-6.0 6.0.1
```
### Notes

View File

@@ -227,9 +227,9 @@ def run_tagging():
# Creates a collection of ROCm libraries grouped by release.
release_bundle_factory = ReleaseBundleFactory(
"RadeonOpenCompute/ROCm",
"ROCm/ROCm",
Github(**gh_args), Github(**pr_args),
"RadeonOpenCompute",
"ROCm",
remote_map,
args.branch
)

View File

@@ -1,4 +1,4 @@
# Release Notes
# Release notes
<!-- Do not edit this file! This file is autogenerated with -->
<!-- tools/autotag/tag_script.py -->
@@ -16,7 +16,7 @@
<!-- spellcheck-disable -->
The release notes for the ROCm platform.
This page contains the release notes for AMD ROCm Software.
{%- for version, release in releases %}
@@ -27,7 +27,7 @@ The release notes for the ROCm platform.
{%- set rocm_changes = "./rocm_changes/" ~ version ~ ".md" %}
{% include rocm_changes ignore missing %}
### Library Changes in ROCM {{version}}
### Library changes in ROCM {{version}}
| Library | Version |
|---------|---------|

View File

@@ -16,27 +16,27 @@ Refer to the HIP Installation Guide v5.0 for more details.
Managed memory, including the `__managed__` keyword, is now supported in the HIP combined host/device compilation. Through unified memory allocation, managed memory allows data to be shared and accessible to both the CPU and GPU using a single pointer. The allocation is managed by the AMD GPU driver using the Linux Heterogeneous Memory Management (HMM) mechanism. The user can call managed memory API hipMallocManaged to allocate a large chunk of HMM memory, execute kernels on a device, and fetch data between the host and device as needed.
> **Note**
>
> In a HIP application, it is recommended to do a capability check before calling the managed memory APIs. For example,
>
> ```cpp
> int managed_memory = 0;
> HIPCHECK(hipDeviceGetAttribute(&managed_memory,
> hipDeviceAttributeManagedMemory,p_gpuDevice));
> if (!managed_memory ) {
> printf ("info: managed memory access not supported on the device %d\n Skipped\n", p_gpuDevice);
> }
> else {
> HIPCHECK(hipSetDevice(p_gpuDevice));
> HIPCHECK(hipMallocManaged(&Hmm, N * sizeof(T)));
> . . .
> }
> ```
:::{note}
In a HIP application, it is recommended to do a capability check before calling the managed memory APIs. For example,
> **Note**
>
> The managed memory capability check may not be necessary; however, if HMM is not supported, managed malloc will fall back to using system memory. Other managed memory API calls will, then, have
```cpp
int managed_memory = 0;
HIPCHECK(hipDeviceGetAttribute(&managed_memory,
hipDeviceAttributeManagedMemory,p_gpuDevice));
if (!managed_memory ) {
printf ("info: managed memory access not supported on the device %d\n Skipped\n", p_gpuDevice);
}
else {
HIPCHECK(hipSetDevice(p_gpuDevice));
HIPCHECK(hipMallocManaged(&Hmm, N * sizeof(T)));
. . .
}
```
:::
:::{note}
The managed memory capability check may not be necessary; however, if HMM is not supported, managed malloc will fall back to using system memory. Other managed memory API calls will, then, have
:::
Refer to the HIP API documentation for more details on managed memory APIs.
@@ -264,13 +264,17 @@ typedef enum hipDeviceAttribute_t {
#### Incorrect dGPU behavior when using AMDVBFlash tool
The AMDVBFlash tool, used for flashing the VBIOS image to dGPU, does not communicate with the ROM Controller specifically when the driver is present. This is because the driver, as part of its runtime power management feature, puts the dGPU to a sleep state.
The AMDVBFlash tool, used for flashing the VBIOS image to dGPU, does not communicate with the
ROM Controller specifically when the driver is present. This is because the driver, as part of its runtime
power management feature, puts the dGPU to a sleep state.
As a workaround, users can run amdgpu.runpm=0, which temporarily disables the runtime power management feature from the driver and dynamically changes some power control-related sysfs files.
As a workaround, users can run amdgpu.runpm=0, which temporarily disables the runtime power
management feature from the driver and dynamically changes some power control-related sysfs files.
#### Issue with START timestamp in ROCProfiler
Users may encounter an issue with the enabled timestamp functionality for monitoring one or multiple counters. ROCProfiler outputs the following four timestamps for each kernel:
Users may encounter an issue with the enabled timestamp functionality for monitoring one or multiple
counters. ROCProfiler outputs the following four timestamps for each kernel:
* Dispatch
* Start
@@ -279,7 +283,8 @@ Users may encounter an issue with the enabled timestamp functionality for monito
##### Issue
This defect is related to the Start timestamp functionality, which incorrectly shows an earlier time than the Dispatch timestamp.
This defect is related to the Start timestamp functionality, which incorrectly shows an earlier time than
the Dispatch timestamp.
To reproduce the issue,
@@ -301,20 +306,22 @@ The correct order is:
Dispatch < Start < End < Complete
Users cannot use ROCProfiler to measure the time spent on each kernel because of the incorrect timestamp with counter collection enabled.
Users cannot use ROCProfiler to measure the time spent on each kernel because of the incorrect
timestamp with counter collection enabled.
##### Recommended workaround
Users are recommended to collect kernel execution timestamps without monitoring counters, as follows:
Users are recommended to collect kernel execution timestamps without monitoring counters, as
follows:
1. Enable timing using the --timestamp on flag, and run the application.
2. Rerun the application using the -i option with the input filename that contains the name of the counter(s) to monitor, and save this to a different output file using the -o flag.
2. Rerun the application using the -i option with the input filename that contains the name of the
counter(s) to monitor, and save this to a different output file using the -o flag.
3. Check the output result file from step 1.
4. The order of timestamps correctly displays as:
DispatchNS < BeginNS < EndNS < CompleteNS
4. The order of timestamps correctly displays as: DispatchNS < BeginNS < EndNS < CompleteNS
5. Users can find the values of the collected counters in the output file generated in step 2.
@@ -322,17 +329,21 @@ Users are recommended to collect kernel execution timestamps without monitoring
##### No support for SMI and ROCDebugger on SRIOV
System Management Interface (SMI) and ROCDebugger are not supported in the SRIOV environment on any GPU. For more information, refer to the Systems Management Interface documentation.
System Management Interface (SMI) and ROCDebugger are not supported in the SRIOV environment
on any GPU. For more information, refer to the Systems Management Interface documentation.
### Deprecations and warnings
#### ROCm libraries changes deprecations and deprecation removal
* The hipFFT.h header is now provided only by the hipFFT package. Up to ROCm 5.0, users would get hipFFT.h in the rocFFT package too.
* The `hipFFT.h` header is now provided only by the hipFFT package. Up to ROCm 5.0, users would get
`hipFFT.h` in the rocFFT package too.
* The GlobalPairwiseAMG class is now entirely removed, users should use the PairwiseAMG class instead.
* The GlobalPairwiseAMG class is now entirely removed, users should use the PairwiseAMG class
instead.
* The rocsparse_spmm signature in 5.0 was changed to match that of rocsparse_spmm_ex. In 5.0, rocsparse_spmm_ex is still present, but deprecated. Signature diff for rocsparse_spmm
* The rocsparse_spmm signature in 5.0 was changed to match that of rocsparse_spmm_ex. In 5.0,
rocsparse_spmm_ex is still present, but deprecated. Signature diff for rocsparse_spmm
rocsparse_spmm in 5.0
```cpp
@@ -374,11 +385,15 @@ System Management Interface (SMI) and ROCDebugger are not supported in the SRIOV
In this release, arithmetic operators of HIP complex and vector types are deprecated.
* As alternatives to arithmetic operators of HIP complex types, users can use arithmetic operators of `std::complex` types.
* As alternatives to arithmetic operators of HIP complex types, users can use arithmetic operators of
`std::complex` types.
* As alternatives to arithmetic operators of HIP vector types, users can use the operators of the native clang vector type associated with the data member of HIP vector types.
* As alternatives to arithmetic operators of HIP vector types, users can use the operators of the native
clang vector type associated with the data member of HIP vector types.
During the deprecation, two macros `_HIP_ENABLE_COMPLEX_OPERATORS` and `_HIP_ENABLE_VECTOR_OPERATORS` are provided to allow users to conditionally enable arithmetic operators of HIP complex or vector types.
During the deprecation, two macros `_HIP_ENABLE_COMPLEX_OPERATORS` and
`_HIP_ENABLE_VECTOR_OPERATORS` are provided to allow users to conditionally enable arithmetic
operators of HIP complex or vector types.
Note, the two macros are mutually exclusive and, by default, set to Off.
@@ -388,7 +403,8 @@ Refer to the HIP API Guide for more information.
#### Warning - compiler-generated code object version 4 deprecation
Support for loading compiler-generated code object version 4 will be deprecated in a future release with no release announcement and replaced with code object 5 as the default version.
Support for loading compiler-generated code object version 4 will be deprecated in a future release
with no release announcement and replaced with code object 5 as the default version.
The current default is code object version 4.

View File

@@ -3,10 +3,17 @@
#### Refactor of HIPCC/HIPCONFIG
In prior ROCm releases, by default, the hipcc/hipconfig Perl scripts were used to identify and set target compiler options, target platform, compiler, and runtime appropriately.
In prior ROCm releases, by default, the hipcc/hipconfig Perl scripts were used to identify and set target
compiler options, target platform, compiler, and runtime appropriately.
In ROCm v5.0.1, hipcc.bin and hipconfig.bin have been added as the compiled binary implementations of the hipcc and hipconfig. These new binaries are currently a work-in-progress, considered, and marked as experimental. ROCm plans to fully transition to hipcc.bin and hipconfig.bin in the a future ROCm release. The existing hipcc and hipconfig Perl scripts are renamed to hipcc.pl and hipconfig.pl respectively. New top-level hipcc and hipconfig Perl scripts are created, which can switch between the Perl script or the compiled binary based on the environment variable HIPCC_USE_PERL_SCRIPT.
In ROCm v5.0.1, hipcc.bin and hipconfig.bin have been added as the compiled binary implementations
of the hipcc and hipconfig. These new binaries are currently a work-in-progress, considered, and
marked as experimental. ROCm plans to fully transition to hipcc.bin and hipconfig.bin in the a future
ROCm release. The existing hipcc and hipconfig Perl scripts are renamed to `hipcc.pl` and `hipconfig.pl`
respectively. New top-level hipcc and hipconfig Perl scripts are created, which can switch between the
Perl script or the compiled binary based on the environment variable `HIPCC_USE_PERL_SCRIPT`.
In ROCm 5.0.1, by default, this environment variable is set to use hipcc and hipconfig through the Perl scripts.
In ROCm 5.0.1, by default, this environment variable is set to use hipcc and hipconfig through the Perl
scripts.
Subsequently, Perl scripts will no longer be available in ROCm in a future release.
Subsequent Perl scripts will no longer be available in ROCm in a future release.

View File

@@ -1,18 +1,26 @@
<!-- markdownlint-disable first-line-h1 -->
### Fixed defects
### Defect fixes
The following defects are fixed in the ROCm v5.0.2 release.
#### Issue with hostcall facility in HIP runtime
In ROCm v5.0, when using the “assert()” call in a HIP kernel, the compiler may sometimes fail to emit kernel metadata related to the hostcall facility, which results in incomplete initialization of the hostcall facility in the HIP runtime. This can cause the HIP kernel to crash when it attempts to execute the “assert()” call.
In ROCm v5.0, when using the “assert()” call in a HIP kernel, the compiler may sometimes fail to emit
kernel metadata related to the hostcall facility, which results in incomplete initialization of the hostcall
facility in the HIP runtime. This can cause the HIP kernel to crash when it attempts to execute the
“assert()” call.
The root cause was an incorrect check in the compiler to determine whether the hostcall facility is required by the kernel. This is fixed in the ROCm v5.0.2 release.
The root cause was an incorrect check in the compiler to determine whether the hostcall facility is
required by the kernel. This is fixed in the ROCm v5.0.2 release.
The resolution includes a compiler change, which emits the required metadata by default, unless the compiler can prove that the hostcall facility is not required by the kernel. This ensures that the “assert()” call never fails.
The resolution includes a compiler change, which emits the required metadata by default, unless the
compiler can prove that the hostcall facility is not required by the kernel. This ensures that the
“assert()” call never fails.
Note:
This fix may lead to breakage in some OpenMP offload use cases, which use print inside a target region and result in an abort in device code. The issue will be fixed in a future release.
Compatibility Matrix Updates to the [Deep-learning guide](./how-to/deep-learning-rocm.md)
:::{note}
This fix may lead to breakage in some OpenMP offload use cases, which use print inside a target region
and result in an abort in device code. The issue will be fixed in a future release.
:::
The compatibility matrix in the [Deep-learning guide](./how-to/deep-learning-rocm.md) is updated for ROCm v5.0.2.
The compatibility matrix in the [Deep-learning guide](./how-to/deep-learning-rocm.md) is updated for
ROCm v5.0.2.

View File

@@ -8,7 +8,8 @@ The ROCm v5.1 release consists of the following HIP enhancements.
##### HIP installation guide updates
The HIP Installation Guide is updated to include installation and building HIP from source on the AMD and NVIDIA platforms.
The HIP installation guide now includes information on installing and building HIP from source on
AMD and NVIDIA platforms.
Refer to the HIP Installation Guide v5.1 for more details.
@@ -20,11 +21,14 @@ ROCm v5.1 extends support for HIP Graph.
###### Separation of hiprtc (libhiprtc) library from hip runtime (amdhip64)
On ROCm/Linux, to maintain backward compatibility, the hipruntime library (amdhip64) will continue to include hiprtc symbols in future releases. The backward compatible support may be discontinued by removing hiprtc symbols from the hipruntime library (amdhip64) in the next major release.
On ROCm/Linux, to maintain backward compatibility, the hipruntime library (amdhip64) will continue
to include hiprtc symbols in future releases. The backward compatible support may be discontinued by
removing hiprtc symbols from the hipruntime library (amdhip64) in the next major release.
###### hipDeviceProp_t structure enhancements
Changes to the hipDeviceProp_t structure in the next major release may result in backward incompatibility. More details on these changes will be provided in subsequent releases.
Changes to the hipDeviceProp_t structure in the next major release may result in backward
incompatibility. More details on these changes will be provided in subsequent releases.
#### ROCDebugger enhancements
@@ -34,15 +38,19 @@ The compiler now generates a source-level variable and function argument debug i
The accuracy is guaranteed if the compiler options `-g -O0` are used and apply only to HIP.
This enhancement enables ROCDebugger users to interact with the HIP source-level variables and function arguments.
This enhancement enables ROCDebugger users to interact with the HIP source-level variables and
function arguments.
> **Note**
>
> The newly-suggested compiler -g option must be used instead of the previously-suggested `-ggdb` option. Although the effect of these two options is currently equivalent, this is not guaranteed for the future and might get changed by the upstream LLVM community.
:::{note}
The newly-suggested compiler -g option must be used instead of the previously-suggested `-ggdb`
option. Although the effect of these two options is currently equivalent, this is not guaranteed for the
future, as changes might be made by the upstream LLVM community.
:::
##### Machine interface lanes support
ROCDebugger Machine Interface (MI) extends support to lanes. The following enhancements are made:
ROCDebugger Machine Interface (MI) extends support to lanes, which includes the following
enhancements:
* Added a new -lane-info command, listing the current thread's lanes.
@@ -52,24 +60,29 @@ ROCDebugger Machine Interface (MI) extends support to lanes. The following enhan
-thread-select -l LANE THREAD
```
* The =thread-selected notification gained a lane-id attribute. This enables the frontend to know which lane of the thread was selected.
* The =thread-selected notification gained a lane-id attribute. This enables the frontend to know which
lane of the thread was selected.
* The *stopped asynchronous record gained lane-id and hit-lanes attributes. The former indicates which lane is selected, and the latter indicates which lanes explain the stop.
* The *stopped asynchronous record gained lane-id and hit-lanes attributes. The former indicates
which lane is selected, and the latter indicates which lanes explain the stop.
* MI commands now accept a global --lane option, similar to the global --thread and --frame options.
* MI varobjs are now lane-aware.
For more information, refer to the ROC Debugger User Guide at
{doc}`ROCgdb <rocgdb:index>`.
For more information, refer to the ROC Debugger User Guide at {doc}`ROCgdb <rocgdb:index>`.
##### Enhanced - clone-inferior command
The clone-inferior command now ensures that the TTY, CMD, ARGS, and AMDGPU PRECISE-MEMORY settings are copied from the original inferior to the new one. All modifications to the environment variables done using the 'set environment' or 'unset environment' commands are also copied to the new inferior.
The clone-inferior command now ensures that the TTY, CMD, ARGS, and AMDGPU PRECISE-MEMORY
settings are copied from the original inferior to the new one. All modifications to the environment
variables done using the 'set environment' or 'unset environment' commands are also copied to the
new inferior.
#### MIOpen support for RDNA GPUs
This release includes support for AMD Radeon™ Pro W6800, in addition to other bug fixes and performance improvements as listed below:
This release includes support for AMD Radeon™ Pro W6800, in addition to other bug fixes and
performance improvements as listed below:
* MIOpen now supports RDNA GPUs!! (via MIOpen PRs 973, 780, 764, 740, 739, 677, 660, 653, 493, 498)
@@ -87,11 +100,13 @@ For more information, see {doc}`Documentation <miopen:index>`.
#### Checkpoint restore support with CRIU
The new Checkpoint Restore in Userspace (CRIU) functionality is implemented to support AMD GPU and ROCm applications.
The new Checkpoint Restore in Userspace (CRIU) functionality is implemented to support AMD GPU
and ROCm applications.
CRIU is a userspace tool to Checkpoint and Restore an application.
CRIU lacked the support for checkpoint restore applications that used device files such as a GPU. With this ROCm release, CRIU is enhanced with a new plugin to support AMD GPUs, which includes:
CRIU lacked the support for checkpoint restore applications that used device files such as a GPU. With
this ROCm release, CRIU is enhanced with a new plugin to support AMD GPUs, which includes:
* Single and Multi GPU systems (Gfx9)
* Checkpoint / Restore on a different system
@@ -100,15 +115,19 @@ CRIU lacked the support for checkpoint restore applications that used device fil
* TensorFlow
* Using CRIU Image Streamer
For more information, refer to <https://github.com/checkpoint-restore/criu/tree/criu-dev/plugins/amdgpu>
For more information, refer to
<https://github.com/checkpoint-restore/criu/tree/criu-dev/plugins/amdgpu>
> **Note**
>
> The CRIU plugin (amdgpu_plugin) is merged upstream with the CRIU repository. The KFD kernel patches are also available upstream with the amd-staging-drm-next branch (public) and the ROCm 5.1 release branch.
:::{note}
The CRIU plugin (amdgpu_plugin) is merged upstream with the CRIU repository. The KFD kernel
patches are also available upstream with the amd-staging-drm-next branch (public) and the ROCm 5.1
release branch.
:::
> **Note**
>
> This is a Beta release of the Checkpoint and Restore functionality, and some features are not available in this release.
:::{note}
This is a Beta release of the Checkpoint and Restore functionality, and some features are not available
in this release.
:::
For more information, refer to the following websites:
@@ -116,7 +135,7 @@ For more information, refer to the following websites:
* <https://criu.org/Main_Page>
### Fixed defects
### Defect fixes
The following defects are fixed in this release.
@@ -126,37 +145,48 @@ The issue with the driver failing to load after ROCm installation is now fixed.
The driver installs successfully, and the server reboots with working rocminfo and clinfo.
#### ROCDebugger fixed defects
#### ROCDebugger defect fixes
##### Breakpoints in GPU kernel code before kernel is loaded
Previously, setting a breakpoint in device code by line number before the device code was loaded into the program resulted in ROCgdb incorrectly moving the breakpoint to the first following line that contains host code.
Previously, setting a breakpoint in device code by line number before the device code was loaded into
the program resulted in ROCgdb incorrectly moving the breakpoint to the first following line that
contains host code.
Now, the breakpoint is left pending. When the GPU kernel gets loaded, the breakpoint resolves to a location in the kernel.
Now, the breakpoint is left pending. When the GPU kernel gets loaded, the breakpoint resolves to a
location in the kernel.
##### Registers invalidated after write
Previously, the stale just-written value was presented as a current value.
ROCgdb now invalidates the cached values of registers whose content might differ after being written. For example, registers with read-only bits.
ROCgdb now invalidates the cached values of registers whose content might differ after being written.
For example, registers with read-only bits.
ROCgdb also invalidates all volatile registers when a volatile register is written. For example, writing VCC invalidates the content of STATUS as STATUS.VCCZ may change.
ROCgdb also invalidates all volatile registers when a volatile register is written. For example, writing
VCC invalidates the content of STATUS as STATUS.VCCZ may change.
##### Scheduler-locking and GPU wavefronts
When scheduler-locking is in effect, new wavefronts created by a resumed thread, CPU, or GPU wavefront, are held in the halt state. For example, the "set scheduler-locking" command.
When scheduler-locking is in effect, new wavefronts created by a resumed thread, CPU, or GPU
wavefront, are held in the halt state. For example, the "set scheduler-locking" command.
##### ROCDebugger fails before completion of kernel execution
It was possible (although erroneous) for a debugger to load GPU code in memory, send it to the device, start executing a kernel on the device, and dispose of the original code before the kernel had finished execution. If a breakpoint was hit after this point, the debugger failed with an internal error while trying to access the debug information.
It was possible (although erroneous) for a debugger to load GPU code in memory, send it to the
device, start executing a kernel on the device, and dispose of the original code before the kernel had
finished execution. If a breakpoint was hit after this point, the debugger failed with an internal error
while trying to access the debug information.
This issue is now fixed by ensuring that the debugger keeps a local copy of the original code and debug information.
This issue is now fixed by ensuring that the debugger keeps a local copy of the original code and
debug information.
### Known issues
#### Random memory access fault errors observed while running math libraries unit tests
**Issue:** Random memory access fault issues are observed while running Math libraries unit tests. This issue is encountered in ROCm v5.0, ROCm v5.0.1, and ROCm v5.0.2.
**Issue:** Random memory access fault issues are observed while running Math libraries unit tests.
This issue is encountered in ROCm v5.0, ROCm v5.0.1, and ROCm v5.0.2.
Note, the faults only occur in the SRIOV environment.
@@ -178,13 +208,15 @@ Where expectation is 0.
#### CU masking causes application to freeze
Using CU Masking results in an application freeze or runs exceptionally slowly. This issue is noticed only in the GFX10 suite of products. Note, this issue is observed only in GFX10 suite of products.
Using CU Masking results in an application freeze or runs exceptionally slowly. This issue is noticed
only in the GFX10 suite of products. Note, this issue is observed only in GFX10 suite of products.
This issue is under active investigation at this time.
#### Failed checkpoint in Docker containers
A defect with Ubuntu images kernel-5.13-30-generic and kernel-5.13-35-generic with Overlay FS results in incorrect reporting of the mount ID.
A defect with Ubuntu images kernel-5.13-30-generic and kernel-5.13-35-generic with Overlay FS
results in incorrect reporting of the mount ID.
This issue with Ubuntu causes CRIU checkpointing to fail in Docker containers.
@@ -192,8 +224,8 @@ As a workaround, use an older version of the kernel. For example, Ubuntu 5.11.0-
#### Issue with restoring workloads using cooperative groups feature
Workloads that use the cooperative groups function to ensure all waves can be resident at the same time may fail to restore correctly.
This issue is under investigation and will be fixed in a future release.
Workloads that use the cooperative groups function to ensure all waves can be resident at the same
time may fail to restore correctly. This issue is under investigation and will be fixed in a future release.
#### Radeon Pro V620 and W6800 workstation GPUs

View File

@@ -8,21 +8,26 @@ The ROCm v5.2 release consists of the following HIP enhancements:
##### HIP installation guide updates
The HIP Installation Guide is updated to include building HIP tests from source on the AMD and NVIDIA platforms.
The HIP Installation Guide is updated to include building HIP tests from source on the AMD and
NVIDIA platforms.
For more details, refer to the HIP Installation Guide v5.2.
##### Support for device-side malloc on HIP-Clang
HIP-Clang now supports device-side malloc. This implementation does not require the use of `hipDeviceSetLimit(hipLimitMallocHeapSize,value)` nor respect any setting. The heap is fully dynamic and can grow until the available free memory on the device is consumed.
HIP-Clang now supports device-side malloc. This implementation does not require the use of
`hipDeviceSetLimit(hipLimitMallocHeapSize,value)` nor respect any setting. The heap is fully dynamic
and can grow until the available free memory on the device is consumed.
The test codes at the following link show how to implement applications using malloc and free functions in device kernels:
The test codes at the following link show how to implement applications using malloc and free
functions in device kernels:
<https://github.com/ROCm-Developer-Tools/HIP/blob/develop/tests/src/deviceLib/hipDeviceMalloc.cpp>
##### New HIP APIs in this release
The following new HIP APIs are available in the ROCm v5.2 release. Note that this is a pre-official version (beta) release of the new APIs:
The following new HIP APIs are available in the ROCm v5.2 release. Note that this is a pre-official
version (beta) release of the new APIs:
###### Device management HIP APIs
@@ -34,13 +39,11 @@ The new device management HIP APIs are as follows:
hipError_t hipDeviceGetUuid(hipUUID* uuid, hipDevice_t device);
```
> **Note**
>
> This new API corresponds to the following CUDA API:
>
> ```cpp
> CUresult cuDeviceGetUuid(CUuuid* uuid, CUdevice dev);
> ```
Note that this new API corresponds to the following CUDA API:
```cpp
CUresult cuDeviceGetUuid(CUuuid* uuid, CUdevice dev);
```
* Gets default memory pool of the specified device
@@ -62,7 +65,7 @@ The new device management HIP APIs are as follows:
###### New HIP runtime APIs in memory management
The new Stream Ordered Memory Allocator functions of HIP runtime APIs in memory management are as follows:
The new Stream Ordered Memory Allocator functions of HIP runtime APIs in memory management are:
* Allocates memory with stream ordered semantics
@@ -180,7 +183,7 @@ The new HIP Graph Management APIs are as follows:
* Gets a node attribute
```cpp
hipError_t hipGraphKernelNodeGetAttribute(hipGraphNode_t hNode, hipKernelNodeAttrID attr, hipKernelNodeAttrValue* value);
hipError_t hipGraphKernelNodeGetAttribute(hipGraphNode_t hNode, hipKernelNodeAttrID attr, hipKernelNodeAttrValue* value);
```
###### Support for virtual memory management APIs
@@ -244,7 +247,7 @@ The new APIs for virtual memory management are as follows:
* Maps or unmaps subregions of sparse HIP arrays and sparse HIP mipmapped arrays
```cpp
hipError_t hipMemMapArrayAsync(hipArrayMapInfo* mapInfoList, unsigned int count, hipStream_t stream);
hipError_t hipMemMapArrayAsync(hipArrayMapInfo* mapInfoList, unsigned int count, hipStream_t stream);
```
* Release a memory handle representing a memory allocation, that was previously allocated through hipMemCreate
@@ -272,45 +275,71 @@ The new APIs for virtual memory management are as follows:
```
For more information, refer to the HIP API documentation at
{doc}`hip:.doxygen/docBin/html/modules`.
{doc}`hip:doxygen/html/modules`.
##### Planned HIP changes in future releases
Changes to `hipDeviceProp_t`, `HIPMEMCPY_3D`, and `hipArray` structures (and related HIP APIs) are planned in the next major release. These changes may impact backward compatibility.
Changes to `hipDeviceProp_t`, `HIPMEMCPY_3D`, and `hipArray` structures (and related HIP APIs) are
planned in the next major release. These changes may impact backward compatibility.
Refer to the Release Notes document in subsequent releases for more information.
ROCm Math and Communication Libraries
Refer to the release notes in subsequent releases for more information.
In this release, ROCm Math and Communication Libraries consist of the following enhancements and fixes:
New rocWMMA for Matrix Multiplication and Accumulation Operations Acceleration
#### ROCm math and communication libraries
This release introduces a new ROCm C++ library for accelerating mixed-precision matrix multiplication and accumulation (MFMA) operations leveraging specialized GPU matrix cores. rocWMMA provides a C++ API to facilitate breaking down matrix multiply accumulate problems into fragments and using them in block-wise operations that are distributed in parallel across GPU wavefronts. The API is a header library of GPU device code, meaning matrix core acceleration may be compiled directly into your kernel device code. This can benefit from compiler optimization in the generation of kernel assembly and does not incur additional overhead costs of linking to external runtime libraries or having to launch separate kernels.
In this release, ROCm math and communication libraries consist of the following enhancements and
fixes:
rocWMMA is released as a header library and includes test and sample projects to validate and illustrate example usages of the C++ API. GEMM matrix multiplication is used as primary validation given the heavy precedent for the library. However, the usage portfolio is growing significantly and demonstrates different ways rocWMMA may be consumed.
* New rocWMMA for matrix multiplication and accumulation operations acceleration
For more information, refer to
[Communication Libraries](./docs/reference/library-index.md)
This release introduces a new ROCm C++ library for accelerating mixed-precision matrix multiplication
and accumulation (MFMA) operations leveraging specialized GPU matrix cores. rocWMMA provides a
C++ API to facilitate breaking down matrix multiply accumulate problems into fragments and using
them in block-wise operations that are distributed in parallel across GPU wavefronts. The API is a
header library of GPU device code, meaning matrix core acceleration may be compiled directly into
your kernel device code. This can benefit from compiler optimization in the generation of kernel
assembly and does not incur additional overhead costs of linking to external runtime libraries or having
to launch separate kernels.
rocWMMA is released as a header library and includes test and sample projects to validate and
illustrate example usages of the C++ API. GEMM matrix multiplication is used as primary validation
given the heavy precedent for the library. However, the usage portfolio is growing significantly and
demonstrates different ways rocWMMA may be consumed.
For more information, refer to [Communication Libraries](../reference/library-index.md)
#### OpenMP enhancements in this release
##### OMPT target support
The OpenMP runtime in ROCm implements a subset of the OMPT device APIs, as described in the OpenMP specification document. These are APIs that allow first-party tools to examine the profile and traces for kernels that execute on a device. A tool may register callbacks for data transfer and kernel dispatch entry points. A tool may use APIs to start and stop tracing for device-related activities such as data transfer and kernel dispatch timings and associated metadata. If device tracing is enabled, trace records for device activities are collected during program execution and returned to the tool using the APIs described in the specification.
The OpenMP runtime in ROCm implements a subset of the OMPT device APIs, as described in the
OpenMP specification document. These are APIs that allow first-party tools to examine the profile
and traces for kernels that execute on a device. A tool may register callbacks for data transfer and
kernel dispatch entry points. A tool may use APIs to start and stop tracing for device-related activities,
such as data transfer and kernel dispatch timings and associated metadata. If device tracing is enabled,
trace records for device activities are collected during program execution and returned to the tool
using the APIs described in the specification.
Following is an example demonstrating how a tool would use the OMPT target APIs supported. The README in /opt/rocm/llvm/examples/tools/ompt outlines the steps to follow, and you can run the provided example as indicated below:
Following is an example demonstrating how a tool would use the OMPT target APIs supported. The
README in /opt/rocm/llvm/examples/tools/ompt outlines the steps to follow, and you can run the
provided example as indicated below:
```sh
cd /opt/rocm/llvm/examples/tools/ompt/veccopy-ompt-target-tracing
make run
```
The file `veccopy-ompt-target-tracing.c` simulates how a tool would initiate device activity tracing. The file `callbacks.h` shows the callbacks that may be registered and implemented by the tool.
The file `veccopy-ompt-target-tracing.c` simulates how a tool would initiate device activity tracing. The
file `callbacks.h` shows the callbacks that may be registered and implemented by the tool.
### Deprecations and warnings
#### Linux file system hierarchy standard for ROCm
ROCm packages have adopted the Linux foundation file system hierarchy standard in this release to ensure ROCm components follow open source conventions for Linux-based distributions. While moving to a new file system hierarchy, ROCm ensures backward compatibility with its 5.1 version or older file system hierarchy. See below for a detailed explanation of the new file system hierarchy and backward compatibility.
ROCm packages have adopted the Linux foundation file system hierarchy standard in this release to
ensure ROCm components follow open source conventions for Linux-based distributions. While
moving to a new file system hierarchy, ROCm ensures backward compatibility with its 5.1 version or
older file system hierarchy. See below for a detailed explanation of the new file system hierarchy and
backward compatibility.
##### New file system hierarchy
@@ -346,23 +375,27 @@ The following is the new file system hierarchy:
```
> **Note**
>
> ROCm will not support backward compatibility with the v5.1(old) file system hierarchy in its next major release.
:::{note}
ROCm will not support backward compatibility with the v5.1(old) file system hierarchy in its next major
release.
:::
For more information, refer to <https://refspecs.linuxfoundation.org/fhs.shtml>.
##### Backward compatibility with older file systems
ROCm has moved header files and libraries to its new location as indicated in the above structure and included symbolic-link and wrapper header files in its old location for backward compatibility.
ROCm has moved header files and libraries to its new location as indicated in the above structure and
included symbolic-link and wrapper header files in its old location for backward compatibility.
> **Note**
>
> ROCm will continue supporting backward compatibility until the next major release.
:::{note}
ROCm will continue supporting backward compatibility until the next major release.
:::
##### Wrapper header files
Wrapper header files are placed in the old location (`/opt/rocm-xxx/<component>/include`) with a warning message to include files from the new location (`/opt/rocm-xxx/include`) as shown in the example below:
Wrapper header files are placed in the old location (`/opt/rocm-xxx/<component>/include`) with a
warning message to include files from the new location (`/opt/rocm-xxx/include`) as shown in the
example below:
```cpp
// Code snippet from hip_runtime.h
@@ -379,7 +412,8 @@ The wrapper header files backward compatibility deprecation is as follows:
##### Library files
Library files are available in the `/opt/rocm-xxx/lib` folder. For backward compatibility, the old library location (`/opt/rocm-xxx/<component>/lib`) has a soft link to the library at the new location.
Library files are available in the `/opt/rocm-xxx/lib` folder. For backward compatibility, the old library
location (`/opt/rocm-xxx/<component>/lib`) has a soft link to the library at the new location.
Example:
@@ -392,7 +426,9 @@ lrwxrwxrwx 1 root root 24 May 10 23:32 libamdhip64.so -> ../../lib/libamdhip64
##### CMake config files
All CMake configuration files are available in the `/opt/rocm-xxx/lib/cmake/<component>` folder. For backward compatibility, the old CMake locations (`/opt/rocm-xxx/<component>/lib/cmake`) consist of a soft link to the new CMake config.
All CMake configuration files are available in the `/opt/rocm-xxx/lib/cmake/<component>` folder. For
backward compatibility, the old CMake locations (`/opt/rocm-xxx/<component>/lib/cmake`) consist of
a soft link to the new CMake config.
Example:
@@ -404,20 +440,26 @@ lrwxrwxrwx 1 root root 42 May 10 23:32 hip-config.cmake -> ../../../../lib/cmake
#### Planned deprecation of hip-rocclr and hip-base packages
In the ROCm v5.2 release, hip-rocclr and hip-base packages (Debian and RPM) are planned for deprecation and will be removed in a future release. hip-runtime-amd and hip-dev(el) will replace these packages respectively. Users of hip-rocclr must install two packages, hip-runtime-amd and hip-dev, to get the same set of packages installed by hip-rocclr previously.
In the ROCm v5.2 release, hip-rocclr and hip-base packages (Debian and RPM) are planned for
deprecation and will be removed in a future release. hip-runtime-amd and hip-dev(el) will replace
these packages respectively. Users of hip-rocclr must install two packages, hip-runtime-amd and
hip-dev, to get the same set of packages installed by hip-rocclr previously.
Currently, both package names hip-rocclr (or) hip-runtime-amd and hip-base (or) hip-dev(el) are supported.
Deprecation of Integrated HIP Directed Tests
Currently, both package names hip-rocclr (or) hip-runtime-amd and hip-base (or) hip-dev(el) are
supported.
The integrated HIP directed tests, which are currently built by default, are deprecated in this release. The default building and execution support through CMake will be removed in future release.
#### Deprecation of integrated HIP directed tests
### Fixed defects
The integrated HIP directed tests, which are currently built by default, are deprecated in this release.
The default building and execution support through CMake will be removed in future release.
| Fixed Defect | Fix |
|------------------------------------------------------------------------------|----------|
| ROCmInfo does not list gpus | Code fix |
| Hang observed while restoring cooperative group samples | Code fix |
| ROCM-SMI over SRIOV: Unsupported commands do not return proper error message | Code fix |
### Defect fixes
| Defect | Fix |
|--------|------|
| ROCmInfo does not list gpus | code fix |
| Hang observed while restoring cooperative group samples | code fix |
| ROCM-SMI over SRIOV: Unsupported commands do not return proper error message | code fix |
### Known issues
@@ -427,35 +469,44 @@ This section consists of known issues in this release.
##### Issue
A compiler error occurs when using -O0 flag to compile code for gfx1030 that calls atomicAddNoRet, which is defined in amd_hip_atomic.h. The compiler generates an illegal instruction for gfx1030.
A compiler error occurs when using -O0 flag to compile code for gfx1030 that calls atomicAddNoRet,
which is defined in amd_hip_atomic.h. The compiler generates an illegal instruction for gfx1030.
##### Workaround
The workaround is not to use the -O0 flag for this case. For higher optimization levels, the compiler does not generate an invalid instruction.
The workaround is not to use the -O0 flag for this case. For higher optimization levels, the compiler
does not generate an invalid instruction.
#### System freeze observed during CUDA memtest checkpoint
##### Issue
Checkpoint/Restore in Userspace (CRIU) requires 20 MB of VRAM approximately to checkpoint and restore. The CRIU process may freeze if the maximum amount of available VRAM is allocated to checkpoint applications.
Checkpoint/Restore in Userspace (CRIU) requires 20 MB of VRAM approximately to checkpoint and
restore. The CRIU process may freeze if the maximum amount of available VRAM is allocated to
checkpoint applications.
##### Workaround
To use CRIU to checkpoint and restore your application, limit the amount of VRAM the application uses to ensure at least 20 MB is available.
To use CRIU to checkpoint and restore your application, limit the amount of VRAM the application uses
to ensure at least 20 MB is available.
#### HPC test fails with the “HSA_STATUS_ERROR_MEMORY_FAULT” error
##### Issue
The compiler may incorrectly compile a program that uses the `__shfl_sync(mask, value, srcLane)` function when the "value" parameter to the function is undefined along some path to the function. For most functions, uninitialized inputs cause undefined behavior, but the definition for `__shfl_sync` should allow for undefined values.
The compiler may incorrectly compile a program that uses the `__shfl_sync(mask, value, srcLane)`
function when the "value" parameter to the function is undefined along some path to the function. For
most functions, uninitialized inputs cause undefined behavior, but the definition for `__shfl_sync` should
allow for undefined values.
##### Workaround
The workaround is to initialize the parameters to `__shfl_sync`.
> **Note**
>
> When the `-Wall` compilation flag is used, the compiler generates a warning indicating the variable is initialized along some path.
:::{note}
When the `-Wall` compilation flag is used, the compiler generates a warning indicating the variable is
initialized along some path.
:::
Example:
@@ -471,24 +522,32 @@ res = __shfl_sync(mask, res, 0);
##### Issue
In recent changes to Clang, insertion of the noundef attribute to all the function arguments has been enabled by default.
In recent changes to Clang, insertion of the noundef attribute to all the function arguments has been
enabled by default.
In the HIP kernel, variable var in shfl_sync may not be initialized, so LLVM IR treats it as undef.
So, the function argument that is potentially undef (because it is not intialized) has always been assumed to be noundef by LLVM IR (since Clang has inserted noundef attribute). This leads to ambiguous kernel execution.
So, the function argument that is potentially undef (because it is not initialized) has always been
assumed to be noundef by LLVM IR (since Clang has inserted the noundef attribute). This leads to
ambiguous kernel execution.
##### Workaround
* Skip adding `noundef` attribute to functions tagged with convergent attribute. Refer to <https://reviews.llvm.org/D124158> for more information.
* Skip adding `noundef` attribute to functions tagged with convergent attribute. Refer to
<https://reviews.llvm.org/D124158> for more information.
* Introduce shuffle attribute and add it to `__shfl` like APIs at hip headers. Clang can skip adding noundef attribute, if it finds that argument is tagged with shuffle attribute. Refer to <https://reviews.llvm.org/D125378> for more information.
* Introduce shuffle attribute and add it to `__shfl` like APIs at hip headers. Clang can skip adding the
`noundef` attribute, if it finds that argument is tagged with shuffle attribute. Refer to
<https://reviews.llvm.org/D125378> for more information.
* Introduce clang builtin for `__shfl` to identify it and skip adding `noundef` attribute.
* Introduce `__builtin_freeze` to use on the relevant arguments in library wrappers. The library/header need to insert freezes on the relevant inputs.
* Introduce `__builtin_freeze` to use on the relevant arguments in library wrappers. The library/header
need to insert freezes on the relevant inputs.
#### Issue with applications triggering oversubscription
There is a known issue with applications that trigger oversubscription. A hardware hang occurs when ROCgdb is used on AMD Instinct™ MI50 and MI100 systems.
There is a known issue with applications that trigger oversubscription. A hardware hang occurs when
ROCgdb is used on AMD Instinct™ MI50 and MI100 systems.
This issue is under investigation and will be fixed in a future release.

View File

@@ -3,25 +3,28 @@
#### Ubuntu 18.04 end-of-life announcement
Support for Ubuntu 18.04 ends in this release. Future releases of ROCm will not provide prebuilt packages for Ubuntu 18.04.
HIP and Other Runtimes
Support for Ubuntu 18.04 ends in this release. Future releases of ROCm will not provide prebuilt
packages for Ubuntu 18.04.
#### HIP Runtime
#### HIP runtime
##### Fixes
* A bug was discovered in the HIP graph capture implementation in the ROCm v5.2.0 release. If the same kernel is called twice (with different argument values) in a graph capture, the implementation only kept the argument values for the second kernel call.
* A bug was discovered in the HIP graph capture implementation in the ROCm v5.2.0 release. If the
same kernel is called twice (with different argument values) in a graph capture, the implementation
only kept the argument values for the second kernel call.
* A bug was introduced in the hiprtc implementation in the ROCm v5.2.0 release. This bug caused the `hiprtcGetLoweredName` call to fail for named expressions with whitespace in it.
* A bug was introduced in the hiprtc implementation in the ROCm v5.2.0 release. This bug caused the
`hiprtcGetLoweredName` call to fail for named expressions with whitespace in it.
Example:
The named expression `my_sqrt<complex<double>>` passed but `my_sqrt<complex<double >>` failed.
ROCm Libraries
The named expression `my_sqrt<complex<double>>` passed but `my_sqrt<complex<double >>`
failed.
#### RCCL
##### Added
##### Additions
Compatibility with NCCL 2.12.10
@@ -33,9 +36,11 @@ Compatibility with NCCL 2.12.10
* Added experimental support for using multiple ranks per device
* Requires using a new interface to create communicator (ncclCommInitRankMulti), refer to the interface documentation for details.
* Requires using a new interface to create communicator (ncclCommInitRankMulti), refer to the
interface documentation for details.
* To avoid potential deadlocks, user might have to set an environment variables increasing the number of hardware queues. For example,
* To avoid potential deadlocks, user might have to set an environment variables increasing the
number of hardware queues. For example,
```sh
export GPU_MAX_HW_QUEUES=16
@@ -45,20 +50,23 @@ export GPU_MAX_HW_QUEUES=16
* Opt-in with NCCL_IB_SOCK_CLIENT_PORT_REUSE=1 and NCCL_IB_SOCK_SERVER_PORT_REUSE=1
* When "Call to bind failed: Address already in use" error happens in large-scale AlltoAll(for example, >=64 MI200 nodes), users are suggested to opt-in either one or both of the options to resolve the massive port usage issue
* When "Call to bind failed: Address already in use" error happens in large-scale AlltoAll (for example,
\>=64 MI200 nodes), users are suggested to opt-in either one or both of the options to resolve the
massive port usage issue
* Avoid using NCCL_IB_SOCK_SERVER_PORT_REUSE when NCCL_NCHANNELS_PER_NET_PEER is tuned >1
* Avoid using NCCL_IB_SOCK_SERVER_PORT_REUSE when NCCL_NCHANNELS_PER_NET_PEER is tuned
\>1
##### Removed
##### Removals
* Removed experimental clique-based kernels
#### Development tools
No notable changes in this release for development tools, including the compiler, profiler, and debugger
Deployment and Management Tools
No notable changes in this release for development tools, including the compiler, profiler, and
debugger deployment and management tools
No notable changes in this release for deployment and management tools.
Older ROCm Releases
For release information for older ROCm releases, refer to <https://github.com/RadeonOpenCompute/ROCm/blob/master/CHANGELOG.md>
For release information for older ROCm releases, refer to
<https://github.com/RadeonOpenCompute/ROCm/blob/master/CHANGELOG.md>

View File

@@ -3,15 +3,24 @@
#### HIP Perl scripts deprecation
The `hipcc` and `hipconfig` Perl scripts are deprecated. In a future release, compiled binaries will be available as `hipcc.bin` and `hipconfig.bin` as replacements for the Perl scripts.
The `hipcc` and `hipconfig` Perl scripts are deprecated. In a future release, compiled binaries will be
available as `hipcc.bin` and `hipconfig.bin` as replacements for the Perl scripts.
> **Note**
>
> There will be a transition period where the Perl scripts and compiled binaries are available before the scripts are removed. There will be no functional difference between the Perl scripts and their compiled binary counterpart. No user action is required. Once these are available, users can optionally switch to `hipcc.bin` and `hipconfig.bin`. The `hipcc`/`hipconfig` soft link will be assimilated to point from `hipcc`/`hipconfig` to the respective compiled binaries as the default option.
:::{note}
There will be a transition period where the Perl scripts and compiled binaries are available before the
scripts are removed. There will be no functional difference between the Perl scripts and their compiled
binary counterpart. No user action is required. Once these are available, users can optionally switch to
`hipcc.bin` and `hipconfig.bin`. The `hipcc`/`hipconfig` soft link will be assimilated to point from
`hipcc`/`hipconfig` to the respective compiled binaries as the default option.
:::
#### Linux file system hierarchy standard for ROCm
ROCm packages have adopted the Linux foundation file system hierarchy standard in this release to ensure ROCm components follow open source conventions for Linux-based distributions. While moving to a new file system hierarchy, ROCm ensures backward compatibility with its 5.1 version or older file system hierarchy. See below for a detailed explanation of the new file system hierarchy and backward compatibility.
ROCm packages have adopted the Linux foundation file system hierarchy standard in this release to
ensure ROCm components follow open source conventions for Linux-based distributions. While
moving to a new file system hierarchy, ROCm ensures backward compatibility with its 5.1 version or
older file system hierarchy. See below for a detailed explanation of the new file system hierarchy and
backward compatibility.
##### New file system hierarchy
@@ -47,23 +56,27 @@ The following is the new file system hierarchy:
```
> **Note**
>
> ROCm will not support backward compatibility with the v5.1(old) file system hierarchy in its next major release.
:::{note}
ROCm will not support backward compatibility with the v5.1(old) file system hierarchy in its next major
release.
:::
For more information, refer to <https://refspecs.linuxfoundation.org/fhs.shtml>.
##### Backward compatibility with older file systems
ROCm has moved header files and libraries to its new location as indicated in the above structure and included symbolic-link and wrapper header files in its old location for backward compatibility.
ROCm has moved header files and libraries to its new location as indicated in the above structure and
included symbolic-link and wrapper header files in its old location for backward compatibility.
> **Note**
>
> ROCm will continue supporting backward compatibility until the next major release.
:::{note}
ROCm will continue supporting backward compatibility until the next major release.
:::
##### Wrapper header files
Wrapper header files are placed in the old location (`/opt/rocm-xxx/<component>/include`) with a warning message to include files from the new location (`/opt/rocm-xxx/include`) as shown in the example below:
Wrapper header files are placed in the old location (`/opt/rocm-xxx/<component>/include`) with a
warning message to include files from the new location (`/opt/rocm-xxx/include`) as shown in the
example below:
```cpp
// Code snippet from hip_runtime.h
@@ -80,7 +93,8 @@ The wrapper header files backward compatibility deprecation is as follows:
##### Library files
Library files are available in the `/opt/rocm-xxx/lib` folder. For backward compatibility, the old library location (`/opt/rocm-xxx/<component>/lib`) has a soft link to the library at the new location.
Library files are available in the `/opt/rocm-xxx/lib` folder. For backward compatibility, the old library
location (`/opt/rocm-xxx/<component>/lib`) has a soft link to the library at the new location.
Example:
@@ -93,7 +107,9 @@ lrwxrwxrwx 1 root root 24 May 10 23:32 libamdhip64.so -> ../../lib/libamdhip64
##### CMake config files
All CMake configuration files are available in the `/opt/rocm-xxx/lib/cmake/<component>` folder. For backward compatibility, the old CMake locations (`/opt/rocm-xxx/<component>/lib/cmake`) consist of a soft link to the new CMake config.
All CMake configuration files are available in the `/opt/rocm-xxx/lib/cmake/<component>` folder. For
backward compatibility, the old CMake locations (`/opt/rocm-xxx/<component>/lib/cmake`) consist of
a soft link to the new CMake config.
Example:
@@ -103,23 +119,29 @@ total 0
lrwxrwxrwx 1 root root 42 May 10 23:32 hip-config.cmake -> ../../../../lib/cmake/hip/hip-config.cmake
```
### Fixed defects
### Defect fixes
The following defects are fixed in this release.
These defects were identified and documented as known issues in previous ROCm releases and are fixed in the ROCm v5.3 release.
These defects were identified and documented as known issues in previous ROCm releases and are
fixed in the ROCm v5.3 release.
#### Kernel produces incorrect results with ROCm 5.2
User code did not initialize certain data constructs, leading to a correctness issue. A strict reading of the C++ standard suggests that failing to initialize these data constructs is undefined behavior. However, a special case was added for a specific compiler builtin to handle the uninitialized data in a defined manner.
User code did not initialize certain data constructs, leading to a correctness issue. A strict reading of
the C++ standard suggests that failing to initialize these data constructs is undefined behavior.
However, a special case was added for a specific compiler builtin to handle the uninitialized data in a
defined manner.
The compiler fix consists of the following patches:
* A new `noundef` attribute is added. This attribute denotes when a function call argument or return val may never contain uninitialized bits.
For more information, see <https://reviews.llvm.org/D81678>
* The application of this attribute was refined such that it was not added to a specific compiler builtin where the compiler knows that inactive lanes do not impact program execution.
For more information, see <https://github.com/RadeonOpenCompute/llvm-project/commit/accf36c58409268ca1f216cdf5ad812ba97ceccd>.
* A new `noundef` attribute is added. This attribute denotes when a function call argument or return
value may never contain uninitialized bits. For more information, see
<https://reviews.llvm.org/D81678>
* The application of this attribute was refined such that it was not added to a specific compiler built-in
where the compiler knows that inactive lanes do not impact program execution. For more
information, see
<https://github.com/RadeonOpenCompute/llvm-project/commit/accf36c58409268ca1f216cdf5ad812ba97ceccd>.
### Known issues
@@ -127,7 +149,10 @@ This section consists of known issues in this release.
#### Issue with OpenMP-extras package upgrade
The `openmp-extras` package has been split into runtime (`openmp-extras-runtime`) and dev (`openmp-extras-devel`) packages. This change has broken the upgrade support for the `openmp-extras` package in RHEL/SLES.
The `openmp-extras` package has been split into runtime (`openmp-extras-runtime`) and dev
(`openmp-extras-devel`) packages. This change has broken the upgrade support for the
`openmp-extras` package in RHEL/SLES.
An available workaround in RHEL is to use the following command for upgrades:
```sh
@@ -143,16 +168,21 @@ zypper update --force-resolution <meta-package>
#### AMD Instinct™ MI200 SRIOV virtualization issue
There is a known issue in this ROCm v5.3 release with all AMD Instinct™ MI200 devices running within a virtual function (VF) under SRIOV virtualization. This issue will likely impact the functionality of SRIOV-based workloads, but does not impact Discrete Device Assignment (DDA) or Bare Metal.
There is a known issue in this ROCm v5.3 release with all AMD Instinct™ MI200 devices running within
a virtual function (VF) under SRIOV virtualization. This issue will likely impact the functionality of
SRIOV-based workloads, but does not impact Discrete Device Assignment (DDA) or Bare Metal.
Until a fix is provided, users should rely on ROCm v5.2.3 to support their SRIOV workloads.
#### System crash when IMMOU is enabled
If input-output memory management unit (IOMMU) is enabled in SBIOS and ROCm is installed, the system may report the following failure or errors when running workloads such as bandwidth test, clinfo, and HelloWord.cl and cause a system crash.
If input-output memory management unit (IOMMU) is enabled in SBIOS and ROCm is installed, the
system may report the following failure or errors when running workloads such as bandwidth test,
clinfo, and HelloWord.cl and cause a system crash.
* IO PAGE FAULT
* IRQ remapping does not support X2APIC mode
* NMI error
Workaround: To avoid the system crash, add `amd_iommu=on iommu=pt` as the kernel bootparam, as indicated in the warning message.
Workaround: To avoid the system crash, add `amd_iommu=on iommu=pt` as the kernel bootparam, as
indicated in the warning message.

View File

@@ -1,13 +1,14 @@
<!-- markdownlint-disable first-line-h1 -->
### Fixed defects
### Defect fixes
The following known issues in ROCm v5.3.2 are fixed in this release.
#### Peer-to-peer DMA mapping errors with SLES and RHEL
Peer-to-Peer Direct Memory Access (DMA) mapping errors on Dell systems (R7525 and R750XA) with SLES 15 SP3/SP4 and RHEL 9.0 are fixed in this release.
Peer-to-Peer Direct Memory Access (DMA) mapping errors on Dell systems (R7525 and R750XA) with
SLES 15 SP3/SP4 and RHEL 9.0 are fixed in this release.
Previously, running rocminfo resulted in Peer-to-Peer DMA mapping errors.
Previously, running `rocminfo` resulted in Peer-to-Peer DMA mapping errors.
#### RCCL tuning table
@@ -15,7 +16,8 @@ The RCCL tuning table is updated for supported platforms.
#### SGEMM (F32 GEMM) routines in rocBLAS
Functional correctness failures in SGEMM (F32 GEMM) routines in rocBLAS for certain problem sizes and ranges are fixed in this release.
Functional correctness failures in SGEMM (F32 GEMM) routines in rocBLAS for certain problem sizes
and ranges are fixed in this release.
### Known issues
@@ -23,7 +25,9 @@ This section consists of known issues in this release.
#### AMD Instinct™ MI200 SRIOV virtualization issue
There is a known issue in this ROCm v5.3 release with all AMD Instinct™ MI200 devices running within a virtual function (VF) under SRIOV virtualization. This issue will likely impact the functionality of SRIOV-based workloads but does not impact Discrete Device Assignment (DDA) or bare metal.
There is a known issue in this ROCm v5.3 release with all AMD Instinct™ MI200 devices running within
a virtual function (VF) under SRIOV virtualization. This issue will likely impact the functionality of
SRIOV-based workloads but does not impact Discrete Device Assignment (DDA) or bare metal.
Until a fix is provided, users should rely on ROCm v5.2.3 to support their SRIOV workloads.
@@ -31,14 +35,18 @@ Until a fix is provided, users should rely on ROCm v5.2.3 to support their SRIOV
Customers cannot update the Integrated Firmware Image (IFWI) for AMD Instinct™ MI200 accelerators.
An updated firmware maintenance bundle consisting of an installation tool and images specific to AMD Instinct™ MI200 accelerators is under planning and will be available soon.
An updated firmware maintenance bundle consisting of an installation tool and images specific to
AMD Instinct™ MI200 accelerators is under planning and will be available soon.
#### Known issue with rocThrust and rocPRIM libraries
There is a known known issue with rocThrust and rocPRIM libraries supporting iterator and types in ROCm v5.3.x releases.
There is a known known issue with rocThrust and rocPRIM libraries supporting iterator and types in
ROCm v5.3.x releases.
* thrust::merge no longer correctly supports different iterator types for `keys_input1` and `keys_input2`.
* `thrust::merge` no longer correctly supports different iterator types for `keys_input1` and
`keys_input2`.
* rocprim::device_merge no longer correctly supports using different types for `keys_input1` and `keys_input2`.
* `rocprim::device_merge` no longer correctly supports using different types for `keys_input1` and
`keys_input2`.
This issue is currently under investigation and will be resolved in a future release.

View File

@@ -1,12 +1,15 @@
<!-- markdownlint-disable first-line-h1 -->
### Fixed defects
### Defect fixes
#### Issue with rocTHRUST and rocPRIM libraries
There was a known issue with rocTHRUST and rocPRIM libraries supporting iterator and types in ROCm v5.3.x releases.
There was a known issue with rocTHRUST and rocPRIM libraries supporting iterator and types in ROCm
v5.3.x releases.
* `thrust::merge` no longer correctly supports different iterator types for `keys_input1` and `keys_input2`.
* `rocprim::device_merge` no longer correctly supports using different types for `keys_input1` and `keys_input2`.
* `thrust::merge` no longer correctly supports different iterator types for `keys_input1` and
`keys_input2`.
* `rocprim::device_merge` no longer correctly supports using different types for `keys_input1` and
`keys_input2`.
This issue is resolved with the following fixes to compilation failures:

View File

@@ -8,13 +8,15 @@ The ROCm v5.4 release consists of the following HIP enhancements:
##### Support for wall_clock64
A new timer function wall_clock64() is supported, which returns wall clock count at a constant frequency on the device.
A new timer function wall_clock64() is supported, which returns wall clock count at a constant
frequency on the device.
```cpp
long long int wall_clock64();
```
It returns wall clock count at a constant frequency on the device, which can be queried via HIP API with the hipDeviceAttributeWallClockRate attribute of the device in the HIP application code.
It returns wall clock count at a constant frequency on the device, which can be queried via HIP API with
the hipDeviceAttributeWallClockRate attribute of the device in the HIP application code.
Example:
@@ -25,19 +27,23 @@ int wallClkRate = 0; //in kilohertz
Where hipDeviceAttributeWallClockRate is a device attribute.
> **Note**
>
> The wall clock frequency is a per-device attribute.
:::{note}
The wall clock frequency is a per-device attribute.
:::
##### New registry added for GPU_MAX_HW_QUEUES
The GPU_MAX_HW_QUEUES registry defines the maximum number of independent hardware queues allocated per process per device.
The GPU_MAX_HW_QUEUES registry defines the maximum number of independent hardware queues
allocated per process per device.
The environment variable controls how many independent hardware queues HIP runtime can create per process, per device. If the application allocates more HIP streams than this number, then the HIP runtime reuses the same hardware queues for the new streams in a round-robin manner.
The environment variable controls how many independent hardware queues HIP runtime can create
per process, per device. If the application allocates more HIP streams than this number, then the HIP
runtime reuses the same hardware queues for the new streams in a round-robin manner.
> **Note**
>
> This maximum number does not apply to hardware queues created for CU-masked HIP streams or cooperative queues for HIP Cooperative Groups (there is only one queue per device).
:::{note}
This maximum number does not apply to hardware queues created for CU-masked HIP streams or
cooperative queues for HIP Cooperative Groups (there is only one queue per device).
:::
For more details, refer to the HIP Programming Guide.
@@ -45,9 +51,9 @@ For more details, refer to the HIP Programming Guide.
The following new HIP APIs are available in the ROCm v5.4 release.
> **Note**
>
> This is a pre-official version (beta) release of the new APIs.
:::{note}
This is a pre-official version (beta) release of the new APIs.
:::
##### Error handling
@@ -81,7 +87,8 @@ This release consists of the following OpenMP enhancements:
* Enable new device RTL in libomptarget as default.
* New flag `-fopenmp-target-fast` to imply `-fopenmp-target-ignore-env-vars -fopenmp-assume-no-thread-state -fopenmp-assume-no-nested-parallelism`.
* Support for the collapse clause and non-unit stride in cases where the no-loop specialized kernel is generated.
* Support for the collapse clause and non-unit stride in cases where the no-loop specialized kernel is
generated.
* Initial implementation of optimized cross-team sum reduction for float and double type scalars.
* Pool-based optimization in the OpenMP runtime to reduce locking during data transfer.
@@ -89,15 +96,24 @@ This release consists of the following OpenMP enhancements:
#### HIP Perl scripts deprecation
The `hipcc` and `hipconfig` Perl scripts are deprecated. In a future release, compiled binaries will be available as `hipcc.bin` and `hipconfig.bin` as replacements for the Perl scripts.
The `hipcc` and `hipconfig` Perl scripts are deprecated. In a future release, compiled binaries will be
available as `hipcc.bin` and `hipconfig.bin` as replacements for the Perl scripts.
> **Note**
>
> There will be a transition period where the Perl scripts and compiled binaries are available before the scripts are removed. There will be no functional difference between the Perl scripts and their compiled binary counterpart. No user action is required. Once these are available, users can optionally switch to `hipcc.bin` and `hipconfig.bin`. The `hipcc`/`hipconfig` soft link will be assimilated to point from `hipcc`/`hipconfig` to the respective compiled binaries as the default option.
:::{note}
There will be a transition period where the Perl scripts and compiled binaries are available before the
scripts are removed. There will be no functional difference between the Perl scripts and their compiled
binary counterpart. No user action is required. Once these are available, users can optionally switch to
`hipcc.bin` and `hipconfig.bin`. The `hipcc`/`hipconfig` soft link will be assimilated to point from
`hipcc`/`hipconfig` to the respective compiled binaries as the default option.
:::
##### Linux file system hierarchy standard for ROCm
ROCm packages have adopted the Linux foundation file system hierarchy standard in this release to ensure ROCm components follow open source conventions for Linux-based distributions. While moving to a new file system hierarchy, ROCm ensures backward compatibility with its 5.1 version or older file system hierarchy. See below for a detailed explanation of the new file system hierarchy and backward compatibility.
ROCm packages have adopted the Linux foundation file system hierarchy standard in this release to
ensure ROCm components follow open source conventions for Linux-based distributions. While
moving to a new file system hierarchy, ROCm ensures backward compatibility with its 5.1 version or
older file system hierarchy. See below for a detailed explanation of the new file system hierarchy and
backward compatibility.
##### New file system hierarchy
@@ -133,23 +149,27 @@ The following is the new file system hierarchy:
```
> **Note**
>
> ROCm will not support backward compatibility with the v5.1(old) file system hierarchy in its next major release.
:::{note}
ROCm will not support backward compatibility with the v5.1(old) file system hierarchy in its next major
release.
:::
For more information, refer to <https://refspecs.linuxfoundation.org/fhs.shtml>.
##### Backward compatibility with older file systems
ROCm has moved header files and libraries to its new location as indicated in the above structure and included symbolic-link and wrapper header files in its old location for backward compatibility.
ROCm has moved header files and libraries to its new location as indicated in the above structure and
included symbolic-link and wrapper header files in its old location for backward compatibility.
> **Note**
>
> ROCm will continue supporting backward compatibility until the next major release.
:::{note}
ROCm will continue supporting backward compatibility until the next major release.
:::
##### Wrapper header files
Wrapper header files are placed in the old location (`/opt/rocm-xxx/<component>/include`) with a warning message to include files from the new location (`/opt/rocm-xxx/include`) as shown in the example below:
Wrapper header files are placed in the old location (`/opt/rocm-xxx/<component>/include`) with a
warning message to include files from the new location (`/opt/rocm-xxx/include`) as shown in the
example below:
```cpp
// Code snippet from hip_runtime.h
@@ -166,7 +186,8 @@ The wrapper header files backward compatibility deprecation is as follows:
##### Library files
Library files are available in the `/opt/rocm-xxx/lib` folder. For backward compatibility, the old library location (`/opt/rocm-xxx/<component>/lib`) has a soft link to the library at the new location.
Library files are available in the `/opt/rocm-xxx/lib` folder. For backward compatibility, the old library
location (`/opt/rocm-xxx/<component>/lib`) has a soft link to the library at the new location.
Example:
@@ -179,7 +200,9 @@ lrwxrwxrwx 1 root root 24 May 10 23:32 libamdhip64.so -> ../../lib/libamdhip64
##### CMake config files
All CMake configuration files are available in the `/opt/rocm-xxx/lib/cmake/<component>` folder. For backward compatibility, the old CMake locations (`/opt/rocm-xxx/<component>/lib/cmake`) consist of a soft link to the new CMake config.
All CMake configuration files are available in the `/opt/rocm-xxx/lib/cmake/<component>` folder. For
backward compatibility, the old CMake locations (`/opt/rocm-xxx/<component>/lib/cmake`) consist of
a soft link to the new CMake config.
Example:
@@ -189,37 +212,45 @@ total 0
lrwxrwxrwx 1 root root 42 May 10 23:32 hip-config.cmake -> ../../../../lib/cmake/hip/hip-config.cmake
```
### Fixed defects
### Defect fixes
The following defects are fixed in this release.
These defects were identified and documented as known issues in previous ROCm releases and are fixed in this release.
These defects were identified and documented as known issues in previous ROCm releases and are
fixed in this release.
#### Memory allocated using hipHostMalloc() with flags didn't exhibit fine-grain behavior
##### Issue
The test was incorrectly using the `hipDeviceAttributePageableMemoryAccess` device attribute to determine coherent support.
The test was incorrectly using the `hipDeviceAttributePageableMemoryAccess` device attribute to
determine coherent support.
##### Fix
`hipHostMalloc()` allocates memory with fine-grained access by default when the environment variable `HIP_HOST_COHERENT=1` is used.
`hipHostMalloc()` allocates memory with fine-grained access by default when the environment variable
`HIP_HOST_COHERENT=1` is used.
For more information, refer to {doc}`hip:.doxygen/docBin/html/index`.
For more information, refer to {doc}`hip:doxygen/html/index`.
#### SoftHang with `hipStreamWithCUMask` test on AMD Instinct™
##### Issue
On GFX10 GPUs, kernel execution hangs when it is launched on streams created using `hipStreamWithCUMask`.
On GFX10 GPUs, kernel execution hangs when it is launched on streams created using
`hipStreamWithCUMask`.
##### Fix
On GFX10 GPUs, each workgroup processor encompasses two compute units, and the compute units must be enabled as a pair. The `hipStreamWithCUMask` API unit test cases are updated to set compute unit mask (cuMask) in pairs for GFX10 GPUs.
On GFX10 GPUs, each workgroup processor encompasses two compute units, and the compute units
must be enabled as a pair. The `hipStreamWithCUMask` API unit test cases are updated to set compute
unit mask (cuMask) in pairs for GFX10 GPUs.
#### ROCm tools GPU IDs
The HIP language device IDs are not the same as the GPU IDs reported by the tools. GPU IDs are globally unique and guaranteed to be consistent across APIs and processes.
The HIP language device IDs are not the same as the GPU IDs reported by the tools. GPU IDs are
globally unique and guaranteed to be consistent across APIs and processes.
GPU IDs reported by ROCTracer and ROCProfiler or ROCm Tools are HSA Driver Node ID of that GPU, as it is a unique ID for that device in that particular node.
GPU IDs reported by ROCTracer and ROCProfiler or ROCm Tools are HSA Driver Node ID of that GPU,
as it is a unique ID for that device in that particular node.

View File

@@ -9,9 +9,9 @@ The ROCm v5.4.1 release consists of the following new HIP API:
The following new HIP API is introduced in the ROCm v5.4.1 release.
> **Note**
>
> This is a pre-official version (beta) release of the new APIs.
:::{note}
This is a pre-official version (beta) release of the new APIs.
:::
```cpp
hipError_t hipLaunchHostFunc(hipStream_t stream, hipHostFn_t fn, void* userData);
@@ -25,30 +25,40 @@ This swaps the stream capture mode of a thread.
This parameter returns `#hipSuccess`, `#hipErrorInvalidValue`.
For more information, refer to the HIP API documentation at /bundle/HIP_API_Guide/page/modules.html.
For more information, refer to the HIP API documentation at
/bundle/HIP_API_Guide/page/modules.html.
### Deprecations and warnings
#### HIP Perl scripts deprecation
The `hipcc` and `hipconfig` Perl scripts are deprecated. In a future release, compiled binaries will be available as `hipcc.bin` and `hipconfig.bin` as replacements for the Perl scripts.
The `hipcc` and `hipconfig` Perl scripts are deprecated. In a future release, compiled binaries will be
available as `hipcc.bin` and `hipconfig.bin` as replacements for the Perl scripts.
> **Note**
>
> There will be a transition period where the Perl scripts and compiled binaries are available before the scripts are removed. There will be no functional difference between the Perl scripts and their compiled binary counterpart. No user action is required. Once these are available, users can optionally switch to `hipcc.bin` and `hipconfig.bin`. The `hipcc`/`hipconfig` soft link will be assimilated to point from `hipcc`/`hipconfig` to the respective compiled binaries as the default option.
:::{note}
There will be a transition period where the Perl scripts and compiled binaries are available before the
scripts are removed. There will be no functional difference between the Perl scripts and their compiled
binary counterpart. No user action is required. Once these are available, users can optionally switch to
`hipcc.bin` and `hipconfig.bin`. The `hipcc`/`hipconfig` soft link will be assimilated to point from
`hipcc`/`hipconfig` to the respective compiled binaries as the default option.
:::
### IFWI fixes
These defects were identified and documented as known issues in previous ROCm releases and are fixed in this release.
AMD Instinct™ MI200 Firmware IFWI Maintenance Update #3
These defects were identified and documented as known issues in previous ROCm releases and are
fixed in this release.
#### AMD Instinct™ MI200 firmware IFWI maintenance update #3
This IFWI release fixes the following issue in AMD Instinct™ MI210/MI250 Accelerators.
After prolonged periods of operation, certain MI200 Instinct™ Accelerators may perform in a degraded way resulting in application failures.
After prolonged periods of operation, certain MI200 Instinct™ Accelerators may perform in a degraded
way resulting in application failures.
In this package, AMD delivers a new firmware version for MI200 GPU accelerators and a firmware installation tool AMD FW FLASH 1.2.
In this package, AMD delivers a new firmware version for MI200 GPU accelerators and a firmware
installation tool AMD FW FLASH 1.2.
| GPU | Production Part Number | SKU | IFWI Name |
| GPU | Productionp part number | SKU | IFWI name |
|-------|------------|--------|---------------|
| MI210 | 113-D673XX | D67302 | D6730200V.110 |
| MI210 | 113-D673XX | D67301 | D6730100V.073 |
@@ -61,4 +71,5 @@ Instructions on how to download and apply MI200 maintenance updates are availabl
#### AMD Instinct™ MI200 SRIOV virtualization support
Maintenance update #3, combined with ROCm 5.4.1, now provides SRIOV virtualization support for all AMD Instinct™ MI200 devices.
Maintenance update #3, combined with ROCm 5.4.1, now provides SRIOV virtualization support for all
AMD Instinct™ MI200 devices.

View File

@@ -3,23 +3,32 @@
#### HIP Perl scripts deprecation
The `hipcc` and `hipconfig` Perl scripts are deprecated. In a future release, compiled binaries will be available as `hipcc.bin` and `hipconfig.bin` as replacements for the Perl scripts.
The `hipcc` and `hipconfig` Perl scripts are deprecated. In a future release, compiled binaries will be
available as `hipcc.bin` and `hipconfig.bin` as replacements for the Perl scripts.
> **Note**
>
> There will be a transition period where the Perl scripts and compiled binaries are available before the scripts are removed. There will be no functional difference between the Perl scripts and their compiled binary counterpart. No user action is required. Once these are available, users can optionally switch to `hipcc.bin` and `hipconfig.bin`. The `hipcc`/`hipconfig` soft link will be assimilated to point from `hipcc`/`hipconfig` to the respective compiled binaries as the default option.
:::{note}
There will be a transition period where the Perl scripts and compiled binaries are available before the
scripts are removed. There will be no functional difference between the Perl scripts and their compiled
binary counterpart. No user action is required. Once these are available, users can optionally switch to
`hipcc.bin` and `hipconfig.bin`. The `hipcc`/`hipconfig` soft link will be assimilated to point from
`hipcc`/`hipconfig` to the respective compiled binaries as the default option.
:::
#### `hipcc` options deprecation
The following hipcc options are being deprecated and will be removed in a future release:
* The `--amdgpu-target` option is being deprecated, and user must use the `offload-arch` option to specify the GPU architecture.
* The `--amdhsa-code-object-version` option is being deprecated. Users can use the Clang/LLVM option `-mllvm -mcode-object-version` to debug issues related to code object versions.
* The `--hipcc-func-supp`/`--hipcc-no-func-supp` options are being deprecated, as the function calls are already supported in production on AMD GPUs.
* The `--amdgpu-target` option is being deprecated, and user must use the `offload-arch` option to
specify the GPU architecture.
* The `--amdhsa-code-object-version` option is being deprecated. Users can use the Clang/LLVM
option `-mllvm -mcode-object-version` to debug issues related to code object versions.
* The `--hipcc-func-supp`/`--hipcc-no-func-supp` options are being deprecated, as the function calls
are already supported in production on AMD GPUs.
### Known issues
Under certain circumstances typified by high register pressure, users may encounter a compiler abort with one of the following error messages:
Under certain circumstances typified by high register pressure, users may encounter a compiler abort
with one of the following error messages:
* > `error: unhandled SGPR spill to memory`

View File

@@ -3,15 +3,24 @@
#### HIP Perl scripts deprecation
The `hipcc` and `hipconfig` Perl scripts are deprecated. In a future release, compiled binaries will be available as `hipcc.bin` and `hipconfig.bin` as replacements for the Perl scripts.
The `hipcc` and `hipconfig` Perl scripts are deprecated. In a future release, compiled binaries will be
available as `hipcc.bin` and `hipconfig.bin` as replacements for the Perl scripts.
> **Note**
>
> There will be a transition period where the Perl scripts and compiled binaries are available before the scripts are removed. There will be no functional difference between the Perl scripts and their compiled binary counterpart. No user action is required. Once these are available, users can optionally switch to `hipcc.bin` and `hipconfig.bin`. The `hipcc`/`hipconfig` soft link will be assimilated to point from `hipcc`/`hipconfig` to the respective compiled binaries as the default option.
:::{note}
There will be a transition period where the Perl scripts and compiled binaries are available before the
scripts are removed. There will be no functional difference between the Perl scripts and their compiled
binary counterpart. No user action is required. Once these are available, users can optionally switch to
`hipcc.bin` and `hipconfig.bin`. The `hipcc`/`hipconfig` soft link will be assimilated to point from
`hipcc`/`hipconfig` to the respective compiled binaries as the default option.
:::
##### Linux file system hierarchy standard for ROCm
ROCm packages have adopted the Linux foundation file system hierarchy standard in this release to ensure ROCm components follow open source conventions for Linux-based distributions. While moving to a new file system hierarchy, ROCm ensures backward compatibility with its 5.1 version or older file system hierarchy. See below for a detailed explanation of the new file system hierarchy and backward compatibility.
ROCm packages have adopted the Linux foundation file system hierarchy standard in this release to
ensure ROCm components follow open source conventions for Linux-based distributions. While
moving to a new file system hierarchy, ROCm ensures backward compatibility with its 5.1 version or
older file system hierarchy. See below for a detailed explanation of the new file system hierarchy and
backward compatibility.
##### New file system hierarchy
@@ -47,23 +56,27 @@ The following is the new file system hierarchy:4
```
> **Note**
>
> ROCm will not support backward compatibility with the v5.1(old) file system hierarchy in its next major release.
:::{note}
ROCm will not support backward compatibility with the v5.1(old) file system hierarchy in its next major
release.
:::
For more information, refer to <https://refspecs.linuxfoundation.org/fhs.shtml>.
##### Backward compatibility with older file systems
ROCm has moved header files and libraries to its new location as indicated in the above structure and included symbolic-link and wrapper header files in its old location for backward compatibility.
ROCm has moved header files and libraries to its new location as indicated in the above structure and
included symbolic-link and wrapper header files in its old location for backward compatibility.
> **Note**
>
> ROCm will continue supporting backward compatibility until the next major release.
:::{note}
ROCm will continue supporting backward compatibility until the next major release.
:::
##### Wrapper header files
Wrapper header files are placed in the old location (`/opt/rocm-xxx/<component>/include`) with a warning message to include files from the new location (`/opt/rocm-xxx/include`) as shown in the example below:
Wrapper header files are placed in the old location (`/opt/rocm-xxx/<component>/include`) with a
warning message to include files from the new location (`/opt/rocm-xxx/include`) as shown in the
example below:
```cpp
// Code snippet from hip_runtime.h
@@ -80,7 +93,8 @@ The wrapper header files backward compatibility deprecation is as follows:
##### Library files
Library files are available in the `/opt/rocm-xxx/lib` folder. For backward compatibility, the old library location (`/opt/rocm-xxx/<component>/lib`) has a soft link to the library at the new location.
Library files are available in the `/opt/rocm-xxx/lib` folder. For backward compatibility, the old library
location (`/opt/rocm-xxx/<component>/lib`) has a soft link to the library at the new location.
Example:
@@ -93,7 +107,9 @@ lrwxrwxrwx 1 root root 24 May 10 23:32 libamdhip64.so -> ../../lib/libamdhip64
##### CMake config files
All CMake configuration files are available in the `/opt/rocm-xxx/lib/cmake/<component>` folder. For backward compatibility, the old CMake locations (`/opt/rocm-xxx/<component>/lib/cmake`) consist of a soft link to the new CMake config.
All CMake configuration files are available in the `/opt/rocm-xxx/lib/cmake/<component>` folder. For
backward compatibility, the old CMake locations (`/opt/rocm-xxx/<component>/lib/cmake`) consist of
a soft link to the new CMake config.
Example:
@@ -103,7 +119,7 @@ total 0
lrwxrwxrwx 1 root root 42 May 10 23:32 hip-config.cmake -> ../../../../lib/cmake/hip/hip-config.cmake
```
### Fixed defects
### Defect fixes
#### Compiler improvements
@@ -117,6 +133,8 @@ In ROCm v5.4.3, improvements to the compiler address errors with the following s
#### Compiler option error at runtime
Some users may encounter a “Cannot find Symbol” error at runtime when using `-save-temps`. While most `-save-temps` use cases work correctly, this error may appear occasionally.
Some users may encounter a “Cannot find Symbol” error at runtime when using `-save-temps`. While
most `-save-temps` use cases work correctly, this error may appear occasionally.
This issue is under investigation, and the known workaround is not to use `-save-temps` when the error appears.
This issue is under investigation, and the known workaround is not to use `-save-temps` when the error
appears.

View File

@@ -15,12 +15,14 @@ Applications requiring to update the stack size can use hipDeviceSetLimit API.
The following hipcc changes are implemented in this release:
* `hipcc` will not implicitly link to `libpthread` and `librt`, as they are no longer a link time dependence for HIP programs.  Applications that depend on these libraries must explicitly link to them.
* `hipcc` will not implicitly link to `libpthread` and `librt`, as they are no longer a link time dependence
for HIP programs.  Applications that depend on these libraries must explicitly link to them.
* `-use-staticlib` and `-use-sharedlib` options are deprecated.
##### Future changes
* Separation of `hipcc` binaries (Perl scripts) from HIP to `hipcc` project. Users will access separate `hipcc` package for installing `hipcc` binaries in future ROCm releases.
* Separation of `hipcc` binaries (Perl scripts) from HIP to `hipcc` project. Users will access separate
`hipcc` package for installing `hipcc` binaries in future ROCm releases.
* In a future ROCm release, the following samples will be removed from the `hip-tests` project.
* `hipBusbandWidth` at <https://github.com/ROCm-Developer-Tools/hip-tests/tree/develop/samples/1_Utils/shipBusBandwidth>
@@ -53,9 +55,9 @@ The following hipcc changes are implemented in this release:
##### New HIP APIs in this release
> **Note**
>
> This is a pre-official version (beta) release of the new APIs and may contain unresolved issues.
:::{note}
This is a pre-official version (beta) release of the new APIs and may contain unresolved issues.
:::
###### Memory management HIP APIs
@@ -71,21 +73,23 @@ The new memory management HIP API is as follows:
The new module management HIP APIs are as follows:
* Launches kernel $f$ with launch parameters and shared memory on stream with arguments passed to `kernelParams`, where thread blocks can cooperate and synchronize as they execute.
* Launches kernel $f$ with launch parameters and shared memory on stream with arguments passed
to `kernelParams`, where thread blocks can cooperate and synchronize as they run.
```cpp
hipError_t hipModuleLaunchCooperativeKernel(hipFunction_t f, unsigned int gridDimX, unsigned int gridDimY, unsigned int gridDimZ, unsigned int blockDimX, unsigned int blockDimY, unsigned int blockDimZ, unsigned int sharedMemBytes, hipStream_t stream, void** kernelParams);
```
* Launches kernels on multiple devices where thread blocks can cooperate and synchronize as they execute.
* Launches kernels on multiple devices where thread blocks can cooperate and synchronize as they
run.
```cpp
hipError_t hipModuleLaunchCooperativeKernelMultiDevice(hipFunctionLaunchParams* launchParamsList, unsigned int numDevices, unsigned int flags);
```
###### HIP Graph Management APIs
###### HIP graph management APIs
The new HIP Graph Management APIs are as follows:
The new HIP graph management APIs are as follows:
* Creates a memory allocation node and adds it to a graph \[BETA]
@@ -136,21 +140,27 @@ The new HIP Graph Management APIs are as follows:
```
##### OpenMP enhancements
This release consists of the following OpenMP enhancements:
* Additional support for OMPT functions `get_device_time` and `get_record_type`.
* Add support for min/max fast fp atomics on AMD GPUs.
Fix the use of the abs function in C device regions.
* Additional support for OMPT functions `get_device_time` and `get_record_type`
* Added support for min/max fast fp atomics on AMD GPUs
* Fixed the use of the abs function in C device regions
### Deprecations and warnings
#### HIP deprecation
The `hipcc` and `hipconfig` Perl scripts are deprecated. In a future release, compiled binaries will be available as `hipcc.bin` and `hipconfig.bin` as replacements for the Perl scripts.
The `hipcc` and `hipconfig` Perl scripts are deprecated. In a future release, compiled binaries will be
available as `hipcc.bin` and `hipconfig.bin` as replacements for the Perl scripts.
> **Note**
>
> There will be a transition period where the Perl scripts and compiled binaries are available before the scripts are removed. There will be no functional difference between the Perl scripts and their compiled binary counterpart. No user action is required. Once these are available, users can optionally switch to `hipcc.bin` and `hipconfig.bin`. The `hipcc`/`hipconfig` soft link will be assimilated to point from `hipcc`/`hipconfig` to the respective compiled binaries as the default option.
:::{note}
There will be a transition period where the Perl scripts and compiled binaries are available before the
scripts are removed. There will be no functional difference between the Perl scripts and their compiled
binary counterpart. No user action is required. Once these are available, users can optionally switch to
`hipcc.bin` and `hipconfig.bin`. The `hipcc`/`hipconfig` soft link will be assimilated to point from
`hipcc`/`hipconfig` to the respective compiled binaries as the default option.
:::
##### Linux file system hierarchy standard for ROCm
@@ -190,23 +200,26 @@ The following is the new file system hierarchy:4
```
> **Note**
>
> ROCm will not support backward compatibility with the v5.1(old) file system hierarchy in its next major release.
:::{note}
ROCm will not support backward compatibility with the v5.1(old) file system hierarchy in its next major
release.
:::
For more information, refer to <https://refspecs.linuxfoundation.org/fhs.shtml>.
##### Backward compatibility with older file systems
ROCm has moved header files and libraries to its new location as indicated in the above structure and included symbolic-link and wrapper header files in its old location for backward compatibility.
> **Note**
>
> ROCm will continue supporting backward compatibility until the next major release.
ROCm has moved header files and libraries to its new location as indicated in the above structure and
included symbolic-link and wrapper header files in its old location for backward compatibility.
:::{note}
ROCm will continue supporting backward compatibility until the next major release.
:::
##### Wrapper header files
Wrapper header files are placed in the old location (`/opt/rocm-xxx/<component>/include`) with a warning message to include files from the new location (`/opt/rocm-xxx/include`) as shown in the example below:
Wrapper header files are placed in the old location (`/opt/rocm-xxx/<component>/include`) with a
warning message to include files from the new location (`/opt/rocm-xxx/include`) as shown in the
example below:
```cpp
// Code snippet from hip_runtime.h
@@ -223,7 +236,8 @@ The wrapper header files backward compatibility deprecation is as follows:
##### Library files
Library files are available in the `/opt/rocm-xxx/lib` folder. For backward compatibility, the old library location (`/opt/rocm-xxx/<component>/lib`) has a soft link to the library at the new location.
Library files are available in the `/opt/rocm-xxx/lib` folder. For backward compatibility, the old library
location (`/opt/rocm-xxx/<component>/lib`) has a soft link to the library at the new location.
Example:
@@ -237,7 +251,8 @@ lrwxrwxrwx 1 root root 24 May 10 23:32 libamdhip64.so -> ../../lib/libamdhip64
##### CMake config files
All CMake configuration files are available in the `/opt/rocm-xxx/lib/cmake/<component>` folder.
For backward compatibility, the old CMake locations (`/opt/rocm-xxx/<component>/lib/cmake`) consist of a soft link to the new CMake config.
For backward compatibility, the old CMake locations (`/opt/rocm-xxx/<component>/lib/cmake`)
consist of a soft link to the new CMake config.
Example:
@@ -253,7 +268,8 @@ Support for Code Object v3 is deprecated and will be removed in a future release
#### Comgr V3.0 changes
The following APIs and macros have been marked as deprecated. These are expected to be removed in a future ROCm release and coincides with the release of Comgr v3.0.
The following APIs and macros have been marked as deprecated. These are expected to be removed in
a future ROCm release and coincides with the release of Comgr v3.0.
##### API changes
@@ -265,7 +281,8 @@ The following APIs and macros have been marked as deprecated. These are expected
* `AMD_COMGR_ACTION_ADD_DEVICE_LIBRARIES`
* `AMD_COMGR_ACTION_COMPILE_SOURCE_TO_FATBIN`
For replacements, see the `AMD_COMGR_ACTION_INFO_GET`/`SET_OPTION_LIST APIs`, and the `AMD_COMGR_ACTION_COMPILE_SOURCE_(WITH_DEVICE_LIBS)_TO_BC` macros.
For replacements, see the `AMD_COMGR_ACTION_INFO_GET`/`SET_OPTION_LIST APIs`, and the
`AMD_COMGR_ACTION_COMPILE_SOURCE_(WITH_DEVICE_LIBS)_TO_BC` macros.
#### Deprecated environment variables

View File

@@ -5,7 +5,7 @@
#### HIP SDK for Windows
AMD is pleased to announce the availability of the HIP SDK for Windows as part
of the ROCm platform. The
of ROCm software. The
[HIP SDK OS and GPU support page](https://rocm.docs.amd.com/en/docs-5.5.1/release/windows_support.html)
lists the versions of Windows and GPUs validated by AMD. HIP SDK features on
Windows are described in detail in our
@@ -21,4 +21,5 @@ The following HIP API is updated in the ROCm 5.5.1 release:
##### `hipDeviceSetCacheConfig`
* The return value for `hipDeviceSetCacheConfig` is updated from `hipErrorNotSupported` to `hipSuccess`
* The return value for `hipDeviceSetCacheConfig` is updated from `hipErrorNotSupported` to
`hipSuccess`

View File

@@ -3,27 +3,37 @@
<!-- markdownlint-disable header-increment -->
### Release highlights
ROCm 5.6 consists of several AI software ecosystem improvements to our fast-growing user base.A few examples include:
ROCm 5.6 consists of several AI software ecosystem improvements to our fast-growing user base. A
few examples include:
* New documentation portal at https://rocm.docs.amd.com
* Ongoing software enhancements for LLMs, ensuring full compliance with the HuggingFace unit test suite
* Ongoing software enhancements for LLMs, ensuring full compliance with the HuggingFace unit test
suite
* OpenAI Triton, CuPy, HIP Graph support, and many other library performance enhancements
* Improved ROCm deployment and development tools, including CPU-GPU (rocGDB) debugger, profiler, and docker containers
* New pseudorandom generators are available in rocRAND. Added support for half-precision transforms in hipFFT/rocFFT. Added LU refactorization and linear system solver for sparse matrices in rocSOLVER.
* Improved ROCm deployment and development tools, including CPU-GPU (rocGDB) debugger,
profiler, and docker containers
* New pseudorandom generators are available in rocRAND. Added support for half-precision
transforms in hipFFT/rocFFT. Added LU refactorization and linear system solver for sparse matrices in
rocSOLVER.
### OS and GPU support changes
* SLES15 SP5 support was added this release. SLES15 SP3 support was dropped.
* AMD Instinct MI50, Radeon Pro VII, and Radeon VII products (collectively referred to as gfx906 GPUs) will be entering the maintenance mode starting Q3 2023. This will be aligned with ROCm 5.7 GA release date.
* No new features and performance optimizations will be supported for the gfx906 GPUs beyond ROCm 5.7
* Bug fixes / critical security patches will continue to be supported for the gfx906 GPUs till Q2 2024 (End of Maintenance \[EOM])(will be aligned with the closest ROCm release)
* AMD Instinct MI50, Radeon Pro VII, and Radeon VII products (collectively referred to as gfx906 GPUs)
will be entering the maintenance mode starting Q3 2023. This will be aligned with ROCm 5.7 GA
release date.
* No new features and performance optimizations will be supported for the gfx906 GPUs beyond
ROCm 5.7
* Bug fixes / critical security patches will continue to be supported for the gfx906 GPUs till Q2 2024
(EOM will be aligned with the closest ROCm release)
* Bug fixes during the maintenance will be made to the next ROCm point release
* Bug fixes will not be back ported to older ROCm releases for this SKU
* Distro / Operating system updates will continue per the ROCm release cadence for gfx906 GPUs till EOM.
* Distro / Operating system updates will continue per the ROCm release cadence for gfx906 GPUs till
EOM.
### AMDSMI CLI 23.0.0.4
#### Added
#### Additions
* AMDSMI CLI tool enabled for Linux Bare Metal & Guest
@@ -39,7 +49,8 @@ ROCm 5.6 consists of several AI software ecosystem improvements to our fast-grow
#### Fixes
* Stability fix for multi GPU system reproducible via ROCm_Bandwidth_Test as reported in [Issue 2198](https://github.com/RadeonOpenCompute/ROCm/issues/2198).
* Stability fix for multi GPU system reproducible via ROCm_Bandwidth_Test as reported in
[Issue 2198](https://github.com/RadeonOpenCompute/ROCm/issues/2198).
### HIP 5.6 (for ROCm 5.6)
@@ -48,7 +59,7 @@ ROCm 5.6 consists of several AI software ecosystem improvements to our fast-grow
* Consolidation of hipamd, rocclr and OpenCL projects in clr
* Optimized lock for graph global capture mode
#### Added
#### Additions
* Added hipRTC support for amd_hip_fp16
* Added hipStreamGetDevice implementation to get the device associated with the stream
@@ -57,14 +68,14 @@ ROCm 5.6 consists of several AI software ecosystem improvements to our fast-grow
* hipArrayGetDescriptor for getting 1D or 2D array descriptor
* hipArray3DGetDescriptor to get 3D array descriptor
#### Changed
#### Changes
* hipMallocAsync to return success for zero size allocation to match hipMalloc
* Separation of hipcc perl binaries from HIP project to hipcc project. hip-devel package depends on newly added hipcc package
* Consolidation of hipamd, ROCclr, and OpenCL repositories into a single repository called clr. Instructions are updated to build HIP from sources in the HIP Installation guide
* Removed hipBusBandwidth and hipCommander samples from hip-tests
#### Fixed
#### Fixes
* Fixed regression in hipMemCpyParam3D when offset is applied
@@ -98,11 +109,11 @@ ROCm 5.6 consists of several AI software ecosystem improvements to our fast-grow
### ROCgdb-13 (For ROCm 5.6.0)
#### Optimized
#### Optimizations
* Improved performances when handling the end of a process with a large number of threads.
Known Issues
#### Known issues
* On certain configurations, ROCgdb can show the following warning message:
@@ -176,15 +187,15 @@ gcc main.c -I/opt/rocm-5.6.0/include -L/opt/rocm-5.6.0/lib -lrocprofiler64-v2
The resulting `a.out` will depend on
`/opt/rocm-5.6.0/lib/librocprofiler64.so.2`.
#### Optimized
#### Optimizations
* Improved Test Suite
#### Added
#### Additions
* 'end_time' need to be disabled in roctx_trace.txt
#### Fixed
#### Fixes
* rocprof in ROcm/5.4.0 gpu selector broken.
* rocprof in ROCm/5.4.1 fails to generate kernel info.

View File

@@ -7,9 +7,9 @@ ROCm 5.6.1 is a point release with several bug fixes in the HIP runtime.
#### HIP 5.6.1 (for ROCm 5.6.1)
### Fixed defects
### Defect fixes
* *hipMemcpy* device-to-device (inter-device) is now asynchronous with respect to the host
* `hipMemcpy` device-to-device (inter-device) is now asynchronous with respect to the host
* Enabled xnack+ check in HIP catch2 tests hang when executing tests
* Memory leak when code object files are loaded/unloaded via hipModuleLoad/hipModuleUnload APIs
* Using *hipGraphAddMemFreeNode* no longer results in a crash
* Using `hipGraphAddMemFreeNode` no longer results in a crash

View File

@@ -3,27 +3,45 @@
### Release highlights for ROCm 5.7
ROCm 5.7.0 includes many new features. These include: a new library (hipTensor), and optimizations for rocRAND and MIVisionX. Address sanitizer for host and device code (GPU) is now available as a beta. Note that ROCm 5.7.0 is EOS for MI50. 5.7 versions of ROCm are the last major release in the ROCm 5 series. This release is Linux-only.
New features include:
Important: The next major ROCm release (ROCm 6.0) will not be backward compatible with the ROCm 5 series. Changes will include: splitting LLVM packages into more manageable sizes, changes to the HIP runtime API, splitting rocRAND and hipRAND into separate packages, and reorganizing our file structure.
* A new library (hipTensor)
* Optimizations for rocRAND and MIVisionX
* AddressSanitizer for host and device code (GPU) is now available as a beta
Note that ROCm 5.7.0 is EOS for MI50. 5.7 versions of ROCm are the last major releases in the ROCm 5
series. This release is Linux-only.
:::{important}
The next major ROCm release (ROCm 6.0) will not be backward compatible with the ROCm 5 series.
Changes will include: splitting LLVM packages into more manageable sizes, changes to the HIP runtime
API, splitting rocRAND and hipRAND into separate packages, and reorganizing our file structure.
:::
#### AMD Instinct™ MI50 end-of-support notice
AMD Instinct MI50, Radeon Pro VII, and Radeon VII products (collectively gfx906 GPUs) will enter maintenance mode starting Q3 2023.
AMD Instinct MI50, Radeon Pro VII, and Radeon VII products (collectively gfx906 GPUs) will enter
maintenance mode starting Q3 2023.
As outlined in [5.6.0](https://rocm.docs.amd.com/en/docs-5.6.0/release.html), ROCm 5.7 will be the final release for gfx906 GPUs to be in a fully supported state.
As outlined in [5.6.0](https://rocm.docs.amd.com/en/docs-5.6.0/release.html), ROCm 5.7 will be the
final release for gfx906 GPUs to be in a fully supported state.
* ROCm 6.0 release will show MI50s as "under maintenance" for [Linux](../about/compatibility/linux-support.md) and [Windows](../about/compatibility/windows-support.md)
* ROCm 6.0 release will show MI50s as "under maintenance" for
{doc}`Linux<rocm-install-on-linux:reference/system-requirements>` and
{doc}`Windows<rocm-install-on-windows:reference/system-requirements>`
* No new features and performance optimizations will be supported for the gfx906 GPUs beyond this major release (ROCm 5.7).
* No new features and performance optimizations will be supported for the gfx906 GPUs beyond this
major release (ROCm 5.7).
* Bug fixes and critical security patches will continue to be supported for the gfx906 GPUs until Q2 2024 (EOM (End of Maintenance) will be aligned with the closest ROCm release).
* Bug fixes and critical security patches will continue to be supported for the gfx906 GPUs until Q2
2024 (end of maintenance \[EOM] will be aligned with the closest ROCm release).
* Bug fixes during the maintenance will be made to the next ROCm point release.
* Bug fixes will not be backported to older ROCm releases for gfx906.
* Distribution and operating system updates will continue per the ROCm release cadence for gfx906 GPUs until EOM.
* Distribution and operating system updates will continue per the ROCm release cadence for gfx906
GPUs until EOM.
#### Feature updates
@@ -31,40 +49,62 @@ As outlined in [5.6.0](https://rocm.docs.amd.com/en/docs-5.6.0/release.html), RO
**Current behavior**
The current version of HIP printf relies on hostcalls, which, in turn, rely on PCIe atomics. However, PCle atomics are unavailable in some environments, and, as a result, HIP-printf does not work in those environments. Users may see the following error from runtime (with AMD_LOG_LEVEL 1 and above):
The current version of HIP printf relies on hostcalls, which, in turn, rely on PCIe atomics. However, PCle
atomics are unavailable in some environments, and, as a result, HIP-printf does not work in those
environments. Users may see the following error from runtime (with AMD_LOG_LEVEL 1 and above):
```
```shell
Pcie atomics not enabled, hostcall not supported
```
**Workaround**
The ROCm 5.7 release introduces an alternative to the current hostcall-based implementation that leverages an older OpenCL-based printf scheme, which does not rely on hostcalls/PCIe atomics.
The ROCm 5.7 release introduces an alternative to the current hostcall-based implementation that
leverages an older OpenCL-based printf scheme, which does not rely on hostcalls/PCIe atomics.
Note: This option is less robust than hostcall-based implementation and is intended to be a workaround when hostcalls do not work.
:::{note}
This option is less robust than hostcall-based implementation and is intended to be a
workaround when hostcalls do not work.
:::
The printf variant is now controlled via a new compiler option -mprintf-kind=<value>. This is supported only for HIP programs and takes the following values,
The printf variant is now controlled via a new compiler option -mprintf-kind=<value>. This is
supported only for HIP programs and takes the following values,
* “hostcall” This currently available implementation relies on hostcalls, which require the system to support PCIe atomics. It is the default scheme.
* “hostcall” This currently available implementation relies on hostcalls, which require the system to
support PCIe atomics. It is the default scheme.
* “buffered” This implementation leverages the older printf scheme used by OpenCL; it relies on a memory buffer where printf arguments are stored during the kernel execution, and then the runtime handles the actual printing once the kernel finishes execution.
* “buffered” This implementation leverages the older printf scheme used by OpenCL; it relies on a
memory buffer where printf arguments are stored during the kernel execution, and then the runtime
handles the actual printing once the kernel finishes execution.
**NOTE**: With the new workaround:
* The printf buffer is fixed size and non-circular. After the buffer is filled, calls to printf will not result in additional output.
* The printf buffer is fixed size and non-circular. After the buffer is filled, calls to printf will not result in
additional output.
* The printf call returns either 0 (on success) or -1 (on failure, due to full buffer), unlike the hostcall scheme that returns the number of characters printed.
* The printf call returns either 0 (on success) or -1 (on failure, due to full buffer), unlike the hostcall
scheme that returns the number of characters printed.
##### Beta release of LLVM AddressSanitizer (ASan) with the GPU
The ROCm 5.7 release introduces the beta release of LLVM AddressSanitizer (ASan) with the GPU. The LLVM ASan provides a process that allows developers to detect runtime addressing errors in applications and libraries. The detection is achieved using a combination of compiler-added instrumentation and runtime techniques, including function interception and replacement.
The ROCm 5.7 release introduces the beta release of LLVM AddressSanitizer (ASan) with the GPU. The
LLVM ASan provides a process that allows developers to detect runtime addressing errors in
applications and libraries. The detection is achieved using a combination of compiler-added
instrumentation and runtime techniques, including function interception and replacement.
Until now, the LLVM ASan process was only available for traditional purely CPU applications. However, ROCm has extended this mechanism to additionally allow the detection of some addressing errors on the GPU in heterogeneous applications. Ideally, developers should treat heterogeneous HIP and OpenMP applications like pure CPU applications. However, this simplicity has not been achieved yet.
Until now, the LLVM ASan process was only available for traditional purely CPU applications. However,
ROCm has extended this mechanism to additionally allow the detection of some addressing errors on
the GPU in heterogeneous applications. Ideally, developers should treat heterogeneous HIP and
OpenMP applications like pure CPU applications. However, this simplicity has not been achieved yet.
Refer to the documentation on LLVM ASan with the GPU at [LLVM AddressSanitizer User Guide](../conceptual/using_gpu_sanitizer.md).
Refer to the documentation on LLVM ASan with the GPU at
[LLVM AddressSanitizer User Guide](../conceptual/using-gpu-sanitizer.md).
**Note**: The beta release of LLVM ASan for ROCm is currently tested and validated on Ubuntu 20.04.
:::{note}
The beta release of LLVM ASan for ROCm is currently tested and validated on Ubuntu 20.04.
:::
#### Fixed defects
#### Defect fixes
The following defects are fixed in ROCm v5.7:
@@ -80,7 +120,7 @@ The following defects are fixed in ROCm v5.7:
##### Optimizations
##### Added
##### Additions
* Added `meta_group_size`/`rank` for getting the number of tiles and rank of a tile in the partition
@@ -98,14 +138,16 @@ The following defects are fixed in ROCm v5.7:
* `hipMipmappedArrayGetLevel` for getting a mipmapped array on a mipmapped level
##### Changed
##### Changes
##### Fixed
##### Fixes
##### Known issues
* HIP memory type enum values currently don't support equivalent value to `cudaMemoryTypeUnregistered`, due to HIP functionality backward compatibility.
* HIP API `hipPointerGetAttributes` could return invalid value in case the input memory pointer was not allocated through any HIP API on device or host.
* HIP memory type enum values currently don't support equivalent value to
`cudaMemoryTypeUnregistered`, due to HIP functionality backward compatibility.
* HIP API `hipPointerGetAttributes` could return invalid value in case the input memory pointer was not
allocated through any HIP API on device or host.
##### Upcoming changes for HIP in ROCm 6.0 release
@@ -139,16 +181,17 @@ The following defects are fixed in ROCm v5.7:
* Removal of deprecated code -hip-hcc codes from hip code tree
* Correct hipArray usage in HIP APIs such as hipMemcpyAtoH and hipMemcpyHtoA
* Correct hipArray usage in HIP APIs such as `hipMemcpyAtoH` and `hipMemcpyHtoA`
* HIPMEMCPY_3D fields correction to avoid truncation of "size_t" to "unsigned int" inside hipMemcpy3D()
* HIPMEMCPY_3D fields correction to avoid truncation of "size_t" to "unsigned int" inside
`hipMemcpy3D()`
* Renaming of 'memoryType' in hipPointerAttribute_t structure to 'type'
* Renaming of 'memoryType' in `hipPointerAttribute_t` structure to 'type'
* Correct hipGetLastError to return the last error instead of last API call's return code
* Correct `hipGetLastError` to return the last error instead of last API call's return code
* Update hipExternalSemaphoreHandleDesc to add "unsigned int reserved[16]"
* Update `hipExternalSemaphoreHandleDesc` to add "unsigned int reserved[16]"
* Correct handling of flag values in hipIpcOpenMemHandle for hipIpcMemLazyEnablePeerAccess
* Correct handling of flag values in `hipIpcOpenMemHandle` for `hipIpcMemLazyEnablePeerAccess`
* Remove hiparray* and make it opaque with hipArray_t
* Remove `hiparray*` and make it opaque with `hipArray_t`

View File

@@ -1,50 +1,53 @@
<<<<<<< HEAD
<!-- markdownlint-disable first-line-h1 -->
<!-- markdownlint-disable no-duplicate-header -->
### What's new in this release
ROCm 5.7.1 is a point release with several bug fixes in the HIP runtime.
#### HIP 5.7.1 (for ROCm 5.7.1)
### Fixed defects
=======
<!-- markdownlint-disable first-line-h1 -->
<!-- markdownlint-disable no-duplicate-header -->
### What's New in This Release
### What's new in this release
#### Installing all GPU Address sanitizer packages with a single command
ROCm 5.7.1 is a point release with several bug fixes in the HIP runtime.
ROCm 5.7.1 simplifies the installation steps for the optional Address Sanitizer (ASan) packages. This release provides the meta package *rocm-ml-sdk-asan* for ease of ASan installation. The following command can be used to install all ASan packages rather than installing each package separately,
#### Installing all GPU AddressSanitizer packages with a single command
ROCm 5.7.1 simplifies the installation steps for the optional AddressSanitizer (ASan) packages. This
release provides the meta package *rocm-ml-sdk-asan* for ease of ASan installation. The following
command can be used to install all ASan packages rather than installing each package separately,
sudo apt-get install rocm-ml-sdk-asan
For more detailed information about using the GPU AddressSanitizer, refer to the [user guide](https://rocm.docs.amd.com/en/docs-5.7.1/understand/using_gpu_sanitizer.html)
For more detailed information about using the GPU AddressSanitizer, refer to the
[user guide](https://rocm.docs.amd.com/en/docs-5.7.1/understand/using_gpu_sanitizer.html)
### ROCm Libraries
### ROCm libraries
#### rocBLAS
A new functionality rocblas-gemm-tune and an environment variable ROCBLAS_TENSILE_GEMM_OVERRIDE_PATH are added to rocBLAS in the ROCm 5.7.1 release.
A new functionality rocblas-gemm-tune and an environment variable
ROCBLAS_TENSILE_GEMM_OVERRIDE_PATH are added to rocBLAS in the ROCm 5.7.1 release.
*rocblas-gemm-tune* is used to find the best-performing GEMM kernel for each GEMM problem set. It has a command line interface, which mimics the --yaml input used by rocblas-bench. To generate the expected --yaml input, profile logging can be used, by setting the environment variable ROCBLAS_LAYER4.
`rocblas-gemm-tune` is used to find the best-performing GEMM kernel for each GEMM problem set. It
has a command line interface, which mimics the --yaml input used by rocblas-bench. To generate the
expected --yaml input, profile logging can be used, by setting the environment variable
ROCBLAS_LAYER4.
For more information on rocBLAS logging, see Logging in rocBLAS, in the [API Reference Guide](https://rocm.docs.amd.com/projects/rocBLAS/en/docs-5.7.1/API_Reference_Guide.html#logging-in-rocblas).
For more information on rocBLAS logging, see Logging in rocBLAS, in the
[API Reference Guide](https://rocm.docs.amd.com/projects/rocBLAS/en/docs-5.7.1/API_Reference_Guide.html#logging-in-rocblas).
An example input file: Expected output (note selected GEMM idx may differ): Where the far right values (solution_index) are the indices of the best-performing kernels for those GEMMs in the rocBLAS kernel library. These indices can be directly used in future GEMM calls. See rocBLAS/samples/example_user_driven_tuning.cpp for sample code of directly using kernels via their indices.
An example input file: Expected output (note selected GEMM idx may differ): Where the far right values
(solution_index) are the indices of the best-performing kernels for those GEMMs in the rocBLAS kernel
library. These indices can be directly used in future GEMM calls. See
` rocBLAS/samples/example_user_driven_tuning.cpp` for sample code of directly using kernels via their
indices.
If the output is stored in a file, the results can be used to override default kernel selection with the kernels found, by setting the environment variable ROCBLAS_TENSILE_GEMM_OVERRIDE_PATH, where points to the stored file.
If the output is stored in a file, the results can be used to override default kernel selection with the
kernels found by setting the environment variable ROCBLAS_TENSILE_GEMM_OVERRIDE_PATH, which
points to the stored file.
For more details, refer to the [rocBLAS Programmer's Guide.](https://rocm.docs.amd.com/projects/rocBLAS/en/latest/Programmers_Guide.html#rocblas-gemm-tune)
For more details, refer to the
[rocBLAS Programmer's Guide](https://rocm.docs.amd.com/projects/rocBLAS/en/latest/Programmers_Guide.html#rocblas-gemm-tune).
#### HIP 5.7.1 (for ROCm 5.7.1)
ROCm 5.7.1 is a point release with several bug fixes in the HIP runtime.
### Fixed defects
The *hipPointerGetAttributes* API returns the correct HIP memory type as *hipMemoryTypeManaged* for managed memory.
### Defect fixes
>>>>>>> roc-5.7.x
The `hipPointerGetAttributes` API returns the correct HIP memory type as `hipMemoryTypeManaged`
for managed memory.

View File

@@ -0,0 +1,891 @@
<!-- markdownlint-disable first-line-h1 -->
<!-- markdownlint-disable no-duplicate-header -->
ROCm 6.0 is a major release with new performance optimizations, expanded frameworks and library
support, and improved developer experience. This includes initial enablement of the AMD Instinct™
MI300 series.Future releases will further enable and optimize this new platform. Key features include:
* Improved performance in areas like lower precision math and attention layers.
* New hipSPARSELt library to accelerate AI workloads via AMD's sparse matrix core technique.
* Latest upstream support for popular AI frameworks like PyTorch, TensorFlow, and JAX.
* New support for libraries, such as DeepSpeed, ONNX-RT, and CuPy.
* Prepackaged HPC and AI containers on AMD Infinity Hub, with improved documentation and
tutorials on the [AMD ROCm Docs](https://rocm.docs.amd.com) site.
* Consolidated developer resources and training on the new AMD ROCm Developer Hub.
The following section provide a release overview for ROCm 6.0. For additional details, you can refer to
the [Changelog](https://rocm.docs.amd.com/en/develop/about/CHANGELOG.html).
### OS and GPU support changes
AMD Instinct™ MI300A and MI300X Accelerator support has been enabled for limited operating
systems.
* Ubuntu 22.04.3 (MI300A and MI300X)
* RHEL 8.9 (MI300A)
* SLES 15 SP5 (MI300A)
We've added support for the following operating systems:
* RHEL 9.3
* RHEL 8.9
Note that, of ROCm 6.2, we've planned for end-of-support (EoS) for the following operating systems:
* Ubuntu 20.04.5
* SLES 15 SP4
* RHEL/CentOS 7.9
### New ROCm meta package
We've added a new ROCm meta package for easy installation of all ROCm core packages, tools, and
libraries. For example, the following command will install the full ROCm package: `apt-get install rocm`
(Ubuntu), or `yum install rocm` (RHEL).
### Filesystem Hierarchy Standard
ROCm 6.0 fully adopts the Filesystem Hierarchy Standard (FHS) reorganization goals. We've removed
the backward compatibility support for old file locations.
### Compiler location change
* The installation path of LLVM has been changed from `/opt/rocm-<rel>/llvm` to
`/opt/rocm-<rel>/lib/llvm`. For backward compatibility, a symbolic link is provided to the old
location and will be removed in a future release.
* The installation path of the device library bitcode has changed from `/opt/rocm-<rel>/amdgcn` to
`/opt/rocm-<rel>/lib/llvm/lib/clang/<ver>/lib/amdgcn`. For backward compatibility, a symbolic link
is provided and will be removed in a future release.
### Documentation
CMake support has been added for documentation in the
[ROCm repository](https://github.com/RadeonOpenCompute/ROCm).
### AMD Instinct™ MI50 end-of-support notice
AMD Instinct MI50, Radeon Pro VII, and Radeon VII products (collectively gfx906 GPUs) enters
maintenance mode in ROCm 6.0.
As outlined in [5.6.0](https://rocm.docs.amd.com/en/docs-5.6.0/release.html), ROCm 5.7 was the
final release for gfx906 GPUs in a fully supported state.
* Henceforth, no new features and performance optimizations will be supported for the gfx906 GPUs.
* Bug fixes and critical security patches will continue to be supported for the gfx906 GPUs until Q2
2024 (end of maintenance \[EOM] will be aligned with the closest ROCm release).
* Bug fixes will be made up to the next ROCm point release.
* Bug fixes will not be backported to older ROCm releases for gfx906.
* Distribution and operating system updates will continue per the ROCm release cadence for gfx906
GPUs until EOM.
### Known issues
* Hang is observed with rocSPARSE tests: [Issue 2726](https://github.com/ROCm/ROCm/issues/2726).
* AddressSanitizer instrumentation is incorrect for device global variables:
[Issue 2551](https://github.com/ROCm/ROCm/issues/2551).
* Dynamically loaded HIP runtime library references incorrect version of `hipDeviceGetProperties`
API: [Issue 2728](https://github.com/ROCm/ROCm/issues/2728).
* Memory access violations when running rocFFT-HMM:
[Issue 2730](https://github.com/ROCm/ROCm/issues/2730).
### Library changes
| Library | Version |
|---------|---------|
| AMDMIGraphX | ⇒ [2.8](https://github.com/ROCmSoftwarePlatform/AMDMIGraphX/releases/tag/rocm-6.0.0) |
| HIP | [6.0.0](https://github.com/ROCm/HIP/releases/tag/rocm-6.0.0) |
| hipBLAS | ⇒ [2.0.0](https://github.com/ROCmSoftwarePlatform/hipBLAS/releases/tag/rocm-6.0.0) |
| hipCUB | ⇒ [3.0.0](https://github.com/ROCmSoftwarePlatform/hipCUB/releases/tag/rocm-6.0.0) |
| hipFFT | ⇒ [1.0.13](https://github.com/ROCmSoftwarePlatform/hipFFT/releases/tag/rocm-6.0.0) |
| hipSOLVER | ⇒ [2.0.0](https://github.com/ROCmSoftwarePlatform/hipSOLVER/releases/tag/rocm-6.0.0) |
| hipSPARSE | ⇒ [3.0.0](https://github.com/ROCmSoftwarePlatform/hipSPARSE/releases/tag/rocm-6.0.0) |
| hipTensor | ⇒ [1.1.0](https://github.com/ROCmSoftwarePlatform/hipTensor/releases/tag/rocm-6.0.0) |
| MIOpen | ⇒ [2.19.0](https://github.com/ROCmSoftwarePlatform/MIOpen/releases/tag/rocm-6.0.0) |
| rccl | ⇒ [2.15.5](https://github.com/ROCmSoftwarePlatform/rccl/releases/tag/rocm-6.0.0) |
| rocALUTION | ⇒ [3.0.3](https://github.com/ROCmSoftwarePlatform/rocALUTION/releases/tag/rocm-6.0.0) |
| rocBLAS | ⇒ [4.0.0](https://github.com/ROCmSoftwarePlatform/rocBLAS/releases/tag/rocm-6.0.0) |
| rocFFT | ⇒ [1.0.25](https://github.com/ROCmSoftwarePlatform/rocFFT/releases/tag/rocm-6.0.0) |
| ROCgdb | [13.2](https://github.com/ROCm/ROCgdb/releases/tag/rocm-6.0.0) |
| rocm-cmake | ⇒ [0.11.0](https://github.com/RadeonOpenCompute/rocm-cmake/releases/tag/rocm-6.0.0) |
| rocPRIM | ⇒ [3.0.0](https://github.com/ROCmSoftwarePlatform/rocPRIM/releases/tag/rocm-6.0.0) |
| rocprofiler | [2.0.0](https://github.com/ROCm/rocprofiler/releases/tag/rocm-6.0.0) |
| rocRAND | ⇒ [2.10.17](https://github.com/ROCmSoftwarePlatform/rocRAND/releases/tag/rocm-6.0.0) |
| rocSOLVER | ⇒ [3.24.0](https://github.com/ROCmSoftwarePlatform/rocSOLVER/releases/tag/rocm-6.0.0) |
| rocSPARSE | ⇒ [3.0.2](https://github.com/ROCmSoftwarePlatform/rocSPARSE/releases/tag/rocm-6.0.0) |
| rocThrust | ⇒ [3.0.0](https://github.com/ROCmSoftwarePlatform/rocThrust/releases/tag/rocm-6.0.0) |
| rocWMMA | ⇒ [1.3.0](https://github.com/ROCmSoftwarePlatform/rocWMMA/releases/tag/rocm-6.0.0) |
| Tensile | ⇒ [4.39.0](https://github.com/ROCmSoftwarePlatform/Tensile/releases/tag/rocm-6.0.0) |
#### AMDMIGraphX 2.8
MIGraphX 2.8 for ROCm 6.0.0
##### Additions
* Support for TorchMIGraphX via PyTorch
* Boosted overall performance by integrating rocMLIR
* INT8 support for ONNX Runtime
* Support for ONNX version 1.14.1
* Added new operators: `Qlinearadd`, `QlinearGlobalAveragePool`, `Qlinearconv`, `Shrink`, `CastLike`,
and `RandomUniform`
* Added an error message for when `gpu_targets` is not set during MIGraphX compilation
* Added parameter to set tolerances with `migraphx-driver` verify
* Added support for MXR files &gt; 4 GB
* Added `MIGRAPHX_TRACE_MLIR` flag
* BETA added capability for using ROCm Composable Kernels via the `MIGRAPHX_ENABLE_CK=1`
environment variable
##### Optimizations
* Improved performance support for INT8
* Improved time precision while benchmarking candidate kernels from CK or MLIR
* Removed contiguous from reshape parsing
* Updated the `ConstantOfShape` operator to support Dynamic Batch
* Simplified dynamic shapes-related operators to their static versions, where possible
* Improved debugging tools for accuracy issues
* Included a print warning about `miopen_fusion` while generating `mxr`
* General reduction in system memory usage during model compilation
* Created additional fusion opportunities during model compilation
* Improved debugging for matchers
* Improved general debug messages
##### Fixes
* Fixed scatter operator for nonstandard shapes with some models from ONNX Model Zoo
* Provided a compile option to improve the accuracy of some models by disabling Fast-Math
* Improved layernorm + pointwise fusion matching to ignore argument order
* Fixed accuracy issue with `ROIAlign` operator
* Fixed computation logic for the `Trilu` operator
* Fixed support for the DETR model
##### Changes
* Changed MIGraphX version to 2.8
* Extracted the test packages into a separate deb file when building MIGraphX from source
##### Removals
* Removed building Python 2.7 bindings
#### AMD SMI
* Integrated the E-SMI library: You can now query CPU-related information directly through AMD SMI.
Metrics include power, energy, performance, and other system details.
* Added support for gfx942 metrics: You can now query MI300 device metrics to get real-time
information. Metrics include power, temperature, energy, and performance.
* Added support for compute and memory partitions
#### HIP 6.0.0
HIP 6.0.0 for ROCm 6.0.0
##### Additions
* New fields and structs for external resource interoperability
* `hipExternalMemoryHandleDesc_st`
* `hipExternalMemoryBufferDesc_st`
* `hipExternalSemaphoreHandleDesc_st`
* `hipExternalSemaphoreSignalParams_st`
* `hipExternalSemaphoreWaitParams_st Enumerations`
* `hipExternalMemoryHandleType_enum`
* `hipExternalSemaphoreHandleType_enum`
* `hipExternalMemoryHandleType_enum`
* New environment variable `HIP_LAUNCH_BLOCKING`
* For serialization on kernel execution. The default value is 0 (disable); kernel will execute normally as
defined in the queue. When this environment variable is set as 1 (enable), HIP runtime will
serialize kernel enqueue; behaves the same as AMD_SERIALIZE_KERNEL.
* More members are added in HIP struct `hipDeviceProp_t`, for new feature capabilities including:
* Texture
* `int maxTexture1DMipmap;`
* `int maxTexture2DMipmap[2];`
* `int maxTexture2DLinear[3];`
* `int maxTexture2DGather[2];`
* `int maxTexture3DAlt[3];`
* `int maxTextureCubemap;`
* `int maxTexture1DLayered[2];`
* `int maxTexture2DLayered[3];`
* `int maxTextureCubemapLayered[2];`
* Surface
* `int maxSurface1D;`
* `int maxSurface2D[2];`
* `int maxSurface3D[3];`
* `int maxSurface1DLayered[2];`
* `int maxSurface2DLayered[3];`
* `int maxSurfaceCubemap;`
* `int maxSurfaceCubemapLayered[2];`
* Device
* `hipUUID uuid;`
* `char luid[8];` this is an 8-byte unique identifier. Only valid on Windows
* `unsigned int luidDeviceNodeMask;`
* LUID (Locally Unique Identifier) is supported for interoperability between devices. In HIP, more
members are added in the struct `hipDeviceProp_t`, as properties to identify each device:
* `char luid[8];`
* `unsigned int luidDeviceNodeMask;`
:::{note}
HIP only supports LUID on Windows OS.
:::
##### Changes
* Some OpenGL Interop HIP APIs are moved from the hip_runtime_api header to a new header file hip_gl_interop.h for the AMD platform, as follows:
* `hipGLGetDevices`
* `hipGraphicsGLRegisterBuffer`
* `hipGraphicsGLRegisterImage`
###### Changes impacting backward incompatibility
* Data types for members in `HIP_MEMCPY3D` structure are changed from `unsigned int` to `size_t`.
* The value of the flag `hipIpcMemLazyEnablePeerAccess` is changed to `0x01`, which was previously
defined as `0`
* Some device property attributes are not currently supported in HIP runtime. In order to maintain
consistency, the following related enumeration names are changed in `hipDeviceAttribute_t`
* `hipDeviceAttributeName` is changed to `hipDeviceAttributeUnused1`
* `hipDeviceAttributeUuid` is changed to `hipDeviceAttributeUnused2`
* `hipDeviceAttributeArch` is changed to `hipDeviceAttributeUnused3`
* `hipDeviceAttributeGcnArch` is changed to `hipDeviceAttributeUnused4`
* `hipDeviceAttributeGcnArchName` is changed to `hipDeviceAttributeUnused5`
* HIP struct `hipArray` is removed from driver type header to comply with CUDA
* `hipArray_t` replaces `hipArray*`, as the pointer to array.
* This allows `hipMemcpyAtoH` and `hipMemcpyHtoA` to have the correct array type which is
equivalent to corresponding CUDA driver APIs.
##### Fixes
* Kernel launch maximum dimension validation is added specifically on gridY and gridZ in the HIP API `hipModule-LaunchKernel`. As a result,when `hipGetDeviceAttribute` is called for the value of `hipDeviceAttributeMaxGrid-Dim`, the behavior on the AMD platform is equivalent to NVIDIA.
* The HIP stream synchronization behavior is changed in internal stream functions, in which a flag "wait" is added and set when the current stream is null pointer while executing stream synchronization on other explicitly created streams. This change avoids blocking of execution on null/default stream. The change won't affect usage of applications, and makes them behave the same on the AMD platform as NVIDIA.
* Error handling behavior on unsupported GPU is fixed, HIP runtime will log out error message, instead of creating signal abortion error which is invisible to developers but continued kernel execution process. This is for the case when developers compile any application via hipcc, setting the option `--offload-arch` with GPU ID which is different from the one on the system.
* HIP complex vector type multiplication and division operations. On AMD platform, some duplicated complex operators are removed to avoid compilation failures. In HIP, `hipFloatComplex` and `hipDoubleComplex` are defined as complex data types: `typedef float2 hipFloatComplex; typedef double2 hipDoubleComplex;` Any application that uses complex multiplication and division operations needs to replace '*' and '/' operators with the following:
* `hipCmulf()` and `hipCdivf()` for `hipFloatComplex`
* `hipCmul()` and `hipCdiv()` for `hipDoubleComplex`
Note: These complex operations are equivalent to corresponding types/functions on NVIDIA platform.
##### Removals
* Deprecated Heterogeneous Compute (HCC) symbols and flags are removed from the HIP source code, including:
* Build options on obsolete `HCC_OPTIONS` were removed from cmake.
* Micro definitions are removed:
* `HIP_INCLUDE_HIP_HCC_DETAIL_DRIVER_TYPES_H`
* `HIP_INCLUDE_HIP_HCC_DETAIL_HOST_DEFINES_H`
* Compilation flags for the platform definitions
* AMD platform
* `HIP_PLATFORM_HCC`
* `HCC`
* `HIP_ROCclr`
* NVIDIA platform
* `HIP_PLATFORM_NVCC`
* File directories in the clr repository are removed, for more details see https://github.com/ROCm-Developer-Tools/clr/blob/develop/hipamd/include/hip/hcc_detail and https://github.com/ROCm-Developer-Tools/clr/blob/develop/hipamd/include/hip/nvcc_detail
* Deprecated gcnArch is removed from hip device struct `hipDeviceProp_t`.
* Deprecated `enum hipMemoryType memoryType;` is removed from HIP struct `hipPointerAttribute_t` union.
#### hipBLAS 2.0.0
hipBLAS 2.0.0 for ROCm 6.0.0
##### Additions
* New option to define `HIPBLAS_USE_HIP_BFLOAT16` to switch API to use the `hip_bfloat16` type
* New `hipblasGemmExWithFlags` API
##### Deprecations
* `hipblasDatatype_t`; use `hipDataType` instead
* `hipblasComplex`; use `hipComplex` instead
* `hipblasDoubleComplex`; use `hipDoubleComplex` instead
* Use of `hipblasDatatype_t` for `hipblasGemmEx` for compute-type; use `hipblasComputeType_t` instead
##### Removals
* `hipblasXtrmm` (calculates B <- alpha * op(A) * B) has been replaced with `hipblasXtrmm` (calculates
C <- alpha * op(A) * B)
#### hipCUB 3.0.0
hipCUB 3.0.0 for ROCm 6.0.0
##### Changes
* Removed `DOWNLOAD_ROCPRIM`: you can force rocPRIM to download using
`DEPENDENCIES_FORCE_DOWNLOAD`
#### hipFFT 1.0.13
hipFFT 1.0.13 for ROCm 6.0.0
##### Changes
* `hipfft-rider` has been renamed to `hipfft-bench`; it is controlled by the `BUILD_CLIENTS_BENCH`
CMake option (note that a link for the old file name is installed, and the old `BUILD_CLIENTS_RIDER`
CMake option is accepted for backwards compatibility, but both will be removed in a future release)
* Binaries in debug builds no longer have a `-d` suffix
* The minimum rocFFT required version has been updated to 1.0.21
##### Additions
* `hipfftXtSetGPUs`, `hipfftXtMalloc, hipfftXtMemcpy`, `hipfftXtFree`, and `hipfftXtExecDescriptor` APIs
have been implemented to allow FFT computing on multiple devices in a single process
#### hipSOLVER 2.0.0
hipSOLVER 2.0.0 for ROCm 6.0.0
##### Additions
* Added hipBLAS as an optional dependency to `hipsolver-test`
* You can use the `BUILD_HIPBLAS_TESTS` CMake option to test the compatibility between hipSOLVER
and hipBLAS
##### Changes
* The `hipsolverOperation_t` type is now an alias of `hipblasOperation_t`
* The `hipsolverFillMode_t` type is now an alias of `hipblasFillMode_t`
* The `hipsolverSideMode_t` type is now an alias of `hipblasSideMode_t`
##### Fixes
* Tests for hipSOLVER info updates in `ORGBR/UNGBR`, `ORGQR/UNGQR`, `ORGTR/UNGTR`,
`ORMQR/UNMQR`, and `ORMTR/UNMTR`
#### hipSPARSE 3.0.0
hipSPARSE 3.0.0 for ROCm 6.0.0
##### Additions
* Added `hipsparseGetErrorName` and `hipsparseGetErrorString`
##### Changes
* Changed the `hipsparseSpSV_solve()` API function to match the cuSPARSE API
* Changed generic API functions to use const descriptors
* Improved documentation
#### hipTensor 1.1.0
hipTensor 1.1.0 for ROCm 6.0.0
##### Additions
* Architecture support for gfx942
* Client tests configuration parameters now support YAML file input format
##### Changes
* Doxygen now treats warnings as errors
##### Fixes
* Client tests output redirections now behave accordingly
* Removed dependency static library deployment
* Security issues for documentation
* Compile issues in debug mode
* Corrected soft link for ROCm deployment
#### MIOpen 2.19.0
MIOpen 2.19.0 for ROCm 6.0.0
##### Additions
* ROCm 5.5 support for gfx1101 (Navi32)
##### Changes
* Tuning results for MLIR on ROCm 5.5
* Bumped MLIR commit to 5.5.0 release tag
##### Fixes
* 3-D convolution host API bug
* `[HOTFIX][MI200][FP16]` has been disabled for `ConvHipImplicitGemmBwdXdlops` when FP16_ALT is
required
#### MIVisionX
* Added Comprehensive CTests to aid developers
* Introduced Doxygen support for complete API documentation
* Simplified dependencies for rocAL
#### OpenMP
* MI300:
* Added support for gfx942 targets
* Fixed declare target variable access in unified_shared_memory mode
* Enabled OMPX_APU_MAPS environment variable for MI200 and gfx942
* Handled global pointers in forced USM (`OMPX_APU_MAPS`)
* Nextgen AMDGPU plugin:
* Respect `GPU_MAX_HW_QUEUES` in the AMDGPU Nextgen plugin, which takes precedence over the
standard `LIBOMPTARGET_AMDGPU_NUM_HSA_QUEUES` environment variable
* Changed the default for `LIBOMPTARGET_AMDGPU_TEAMS_PER_CU` from 4 to 6
* Fixed the behavior of the `OMPX_FORCE_SYNC_REGIONS` environment variable, which is used to
force synchronous target regions (the default is to use an asynchronous implementation)
* Added support for and enabled default of code object version 5
* Implemented target OMPT callbacks and trace records support in the nextgen plugin
* Specialized kernels:
* Removes redundant copying of arrays when xteam reductions are active but not offloaded
* Tuned the number of teams for BigJumpLoop
* Enables specialized kernel generation with nested OpenMP pragma, as long as there is no nested
omp-parallel directive
##### Additions
* `-fopenmp-runtimelib={lib,lib-perf,lib-debug}` to select libs
* Warning if mixed HIP / OpenMP offloading (i.e., if HIP language mode is active, but OpenMP target
directives are encountered)
* Introduced compile-time limit for the number of GPUs supported in a system: 16 GPUs in a single
node is currently the maximum supported
##### Changes
* Correctly compute number of waves when workgroup size is less than the wave size
* Implemented `LIBOMPTARGET_KERNEL_TRACE=3`, which prints DEVID traces and API timings
* ASAN support for openmp release, debug, and perf libraries
* Changed LDS lowering default to hybrid
##### Fixes
* Fixed RUNPATH for gdb plugin
* Fixed hang in OMPT support if flush trace is called when there are no helper threads
#### rccl 2.15.5
RCCL 2.15.5 for ROCm 6.0.0
##### Changes
* Compatibility with NCCL 2.15.5
* Renamed the unit test executable to `rccl-UnitTests`
##### Additions
* HW-topology-aware binary tree implementation
* Experimental support for MSCCL
* New unit tests for hipGraph support
* NPKit integration
##### Fixes
* rocm-smi ID conversion
* Support for `HIP_VISIBLE_DEVICES` for unit tests
* Support for p2p transfers to non (HIP) visible devices
##### Removals
* Removed TransferBench from tools as it exists in standalone repo:
[https://github.com/ROCmSoftwarePlatform/TransferBench](https://github.com/ROCmSoftwarePlatform/TransferBench)
#### rocALUTION 3.0.3
rocALUTION 3.0.3 for ROCm 6.0.0
##### Additions
* Support for 64bit integer vectors
* Inclusive and exclusive sum functionality for vector classes
* Transpose functionality for `GlobalMatrix` and `LocalMatrix`
* `TripleMatrixProduct` functionality for `LocalMatrix`
* `Sort()` function for `LocalVector` class
* Multiple stream support to the HIP backend
##### Optimizations
* `GlobalMatrix::Apply()` now uses multiple streams to better hide communication
##### Changes
* Matrix dimensions and number of non-zeros are now stored using 64-bit integers
* Improved the ILUT preconditioner
##### Removals
* `LocalVector::GetIndexValues(ValueType*)`
* `LocalVector::SetIndexValues(const ValueType*)`
* `LocalMatrix::RSDirectInterpolation(const LocalVector&, const LocalVector&, LocalMatrix*, LocalMatrix*)`
* `LocalMatrix::RSExtPIInterpolation(const LocalVector&, const LocalVector&, bool, float, LocalMatrix*, LocalMatrix*)`
* `LocalMatrix::RugeStueben()`
* `LocalMatrix::AMGSmoothedAggregation(ValueType, const LocalVector&, const LocalVector&, LocalMatrix*, LocalMatrix*, int)`
* `LocalMatrix::AMGAggregation(const LocalVector&, LocalMatrix*, LocalMatrix*)`
##### Fixes
* Unit tests no longer ignore BCSR block dimension
* Fixed documentation typos
* Bug in multi-coloring for non-symmetric matrix patterns
#### rocBLAS 4.0.0
rocBLAS 4.0.0 for ROCm 6.0.0
##### Additions
* Beta API `rocblas_gemm_batched_ex3` and `rocblas_gemm_strided_batched_ex3`
* Input/output type f16_r/bf16_r and execution type f32_r support for Level 2 gemv_batched and
gemv_strided_batched
* Use of `rocblas_status_excluded_from_build` when calling functions that require Tensile (when using
rocBLAS built without Tensile)
* System for asynchronous kernel launches that set a `rocblas_status` failure based on a
`hipPeekAtLastError` discrepancy
##### Optimizations
* TRSM performance for small sizes (m < 32 && n < 32)
##### Deprecations
* Atomic operations will be disabled by default in a future release of rocBLAS (you can enable atomic
operations using the `rocblas_set_atomics_mode` function)
##### Removals
* `rocblas_gemm_ext2` API function
* In-place trmm API from Legacy BLAS is replaced by an API that supports both in-place and
out-of-place trmm
* int8x4 support is removed (int8 support is unchanged)
* `#define __STDC_WANT_IEC_60559_TYPES_EXT__` is removed from `rocblas-types.h` (if you want
ISO/IEC TS 18661-3:2015 functionality, you must define `__STDC_WANT_IEC_60559_TYPES_EXT__`
before including `float.h`, `math.h`, and `rocblas.h`)
* The default build removes device code for gfx803 architecture from the fat binary
##### Fixes
* Made offset calculations for 64-bit rocBLAS functions safe
* Fixes for very large leading dimension or increment potentially causing overflow:
* Level2: `gbmv`, `gemv`, `hbmv`, `sbmv`, `spmv`, `tbmv`, `tpmv`, `tbsv`, and `tpsv`
* Lazy loading supports heterogeneous architecture setup and load-appropriate tensile library files,
based on device architecture
* Guards against no-op kernel launches that result in a potential `hipGetLastError`
##### Changes
* Reduced the default verbosity of `rocblas-test` (you can see all tests by setting the
`GTEST_LISTENER=PASS_LINE_IN_LOG` environment variable)
#### rocFFT 1.0.25
rocFFT 1.0.25 for ROCm 6.0.0
##### Additions
* Implemented experimental APIs to allow computing FFTs on data distributed across multiple devices
in a single process
* `rocfft_field` is a new type that can be added to a plan description to describe the layout of FFT
input or output
* `rocfft_field_add_brick` can be called to describe the brick decomposition of an FFT field, where each
brick can be assigned a different device
These interfaces are still experimental and subject to change. Your feedback is appreciated.
You can raise questions and concerns by opening issues in the
[rocFFT issue tracker](https://github.com/ROCmSoftwarePlatform/rocFFT/issues).
Note that multi-device FFTs currently have several limitations (we plan to address these in future
releases):
* Real-complex (forward or inverse) FFTs are not supported
* Planar format fields are not supported
* Batch (the `number_of_transforms` provided to `rocfft_plan_create`) must be 1
* FFT input is gathered to the current device at run time, so all FFT data must fit on that device
##### Optimizations
* Improved the performance of several 2D/3D real FFTs supported by `2D_SINGLE` kernel. Offline
tuning provides more optimization for fx90a
* Removed an extra kernel launch from even-length, real-complex FFTs that use callbacks
##### Changes
* Built kernels in a solution map to the library kernel cache
* Real forward transforms (real-to-complex) no longer overwrite input; rocFFT may still overwrite real
inverse (complex-to-real) input, as this allows for faster performance
* `rocfft-rider` and `dyna-rocfft-rider` have been renamed to `rocfft-bench` and `dyna-rocfft-bench`;
these are controlled by the `BUILD_CLIENTS_BENCH` CMake option
* Links for the former file names are installed, and the former `BUILD_CLIENTS_RIDER` CMake option
is accepted for compatibility, but both will be removed in a future release
* Binaries in debug builds no longer have a `-d` suffix
##### Fixes
* rocFFT now correctly handles load callbacks that convert data from a smaller data type (e.g., 16-bit
integers -> 32-bit float)
#### ROCgdb 13.2
ROCgdb 13.2 for ROCm 6.0.0
##### Additions
* Support for watchpoints on scratch memory addresses.
* Added support for gfx1100, gfx1101, and gfx1102.
* Added support for gfx942.
##### Optimizations
* Improved performances when handling the end of a process with a large number of threads.
##### Known issues
* On certain configurations, ROCgdb can show the following warning message:
`warning: Probes-based dynamic linker interface failed. Reverting to original interface.`
This does not affect ROCgdb's functionalities.
* ROCgdb cannot debug a program on an AMDGPU device past a
`s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)` instruction. If an exception is reported after this
instruction has been executed (including asynchronous exceptions), the wave is killed and the
exceptions are only reported by the ROCm runtime.
#### rocm-cmake 0.11.0
rocm-cmake 0.11.0 for ROCm 6.0.0
##### Changes
* Improved validation, documentation, and rocm-docs-core integration for ROCMSphinxDoc
##### Fixes
* Fixed extra `make` flags passed for Clang-Tidy (ROCMClangTidy).
* Fixed issues with ROCMTest when using a module in a subdirectory
#### ROCm Compiler
* On MI300, kernel arguments can be preloaded into SGPRs rather than passed in memory. This
feature is enabled with a compiler option, which also controls the number of arguments to pass in
SGPRs.
* Improved register allocation at -O0: Avoid compiler crashes ( 'ran out of registers during register allocation' )
* Improved generation of debug information:
* Improve compile time
* Avoid compiler crashes
#### rocPRIM 3.0.0
rocPRIM 3.0.0 for ROCm 6.0.0
##### Additions
* `block_sort::sort()` overload for keys and values with a dynamic size, for all block sort algorithms
* All `block_sort::sort()` overloads with a dynamic size are now supported for
`block_sort_algorithm::merge_sort` and `block_sort_algorithm::bitonic_sort`
* New two-way partition primitive `partition_two_way`, which can write to two separate iterators
##### Optimizations
* Improved `partition` performance
##### Fixes
* Fixed `rocprim::MatchAny` for devices with 64-bit warp size
* Note that `rocprim::MatchAny` is deprecated; use `rocprim::match_any` instead
#### Roc Profiler 2.0.0
Roc Profiler 2.0.0 for ROCm 6.0.0
##### Additions
* Updated supported GPU architectures in README with profiler versions
* Automatic ISA dumping for ATT. See README.
* CSV mode for ATT. See README.
* Added option to control kernel name truncation.
* Limit rocprof(v1) script usage to only supported architectures.
* Added Tool versioning to be able to run rocprofv2 using rocprof. See README for more information.
* Added Plugin Versioning way in rocprofv2. See README for more details.
* Added `--version` in rocprof and rocprofv2 to be able to see the current rocprof/v2 version along with ROCm version information.
#### rocRAND 2.10.17
rocRAND 2.10.17 for ROCm 6.0.0
### Changes
* Generator classes from `rocrand.hpp` are no longer copyable (in previous versions these copies
would copy internal references to the generators and would lead to double free or memory leak
errors)
* These types should be moved instead of copied; move constructors and operators are now
defined
### Optimizations
* Improved MT19937 initialization and generation performance
### Removals
* Removed the hipRAND submodule from rocRAND; hipRAND is now only available as a separate
package
* Removed references to, and workarounds for, the deprecated hcc
### Fixes
* `mt19937_engine` from `rocrand.hpp` is now move-constructible and move-assignable (the move
constructor and move assignment operator was deleted for this class)
* Various fixes for the C++ wrapper header `rocrand.hpp`
* The name of `mrg31k3p` it is now correctly spelled (was incorrectly named `mrg31k3a` in previous
versions)
* Added the missing `order` setter method for `threefry4x64`
* Fixed the default ordering parameter for `lfsr113`
* Build error when using Clang++ directly resulting from unsupported `amdgpu-target` references
#### rocSOLVER 3.24.0
rocSOLVER 3.24.0 for ROCm 6.0.0
##### Additions
* Cholesky refactorization for sparse matrices: `CSRRF_REFACTCHOL`
* Added `rocsolver_rfinfo_mode` and the ability to specify the desired refactorization routine (see `rocsolver_set_rfinfo_mode`)
##### Changes
* `CSRRF_ANALYSIS` and `CSRRF_SOLVE` now support sparse Cholesky factorization
#### rocSPARSE 3.0.2
rocSPARSE 3.0.2 for ROCm 6.0.0
##### Changes
* Function arguments for `rocsparse_spmv`
* Function arguments for `rocsparse_xbsrmv` routines
* When using host pointer mode, you must now call `hipStreamSynchronize` following `doti`, `dotci`,
`spvv`, and `csr2ell`
* Improved documentation
* Improved verbose output during argument checking on API function calls
##### Removals
* Auto stages from `spmv`, `spmm`, `spgemm`, `spsv`, `spsm`, and `spitsv`
* Formerly deprecated `rocsparse_spmm_ex` routine
### Fixes
* Bug in `rocsparse-bench` where the SpMV algorithm was not taken into account in CSR format
* BSR and GEBSR routines (`bsrmv`, `bsrsv`, `bsrmm`, `bsrgeam`, `gebsrmv`, `gebsrmm`) didn't always
show `block_dim==0` as an invalid size
* Passing `nnz = 0` to `doti` or `dotci` wasn't always returning a dot product of 0
### Additions
* `rocsparse_inverse_permutation`
* Mixed-precisions for SpVV
* Uniform int8 precision for gather and scatter
#### rocThrust 3.0.0
rocThrust 3.0.0 for ROCm 6.0.0
##### Additions
* Updated to match upstream Thrust 2.0.1
* `NV_IF_TARGET` macro from libcu++ for NVIDIA backend and HIP implementation for HIP backend
##### Changes
* The CMake build system now accepts `GPU_TARGETS` in addition to `AMDGPU_TARGETS` for
setting targeted GPU architectures
* `GPU_TARGETS=all` compiles for all supported architectures
* `AMDGPU_TARGETS` is only provided for backwards compatibility (`GPU_TARGETS` is preferred)
* Removed CUB symlink from the root of the repository
* Removed support for deprecated macros (`THRUST_DEVICE_BACKEND` and
`THRUST_HOST_BACKEND`)
##### Known issues
* The `THRUST_HAS_CUDART` macro, which is no longer used in Thrust (it's provided only for legacy
support) is replaced with `NV_IF_TARGET` and `THRUST_RDC_ENABLED` in the NVIDIA backend. The
HIP backend doesn't have a `THRUST_RDC_ENABLED` macro, so some branches in Thrust code may
be unreachable in the HIP backend.
#### rocWMMA 1.3.0
rocWMMA 1.3.0 for ROCm 6.0.0
##### Additions
* Support for gfx942
* Support for f8, bf8, and xfloat32 data types
* support for `HIP_NO_HALF`, `__ HIP_NO_HALF_CONVERSIONS__`, and
`__ HIP_NO_HALF_OPERATORS__` (e.g., PyTorch environment)
##### Changes
* rocWMMA with hipRTC now supports `bfloat16_t` data type
* gfx11 WMMA now uses lane swap instead of broadcast for layout adjustment
* Updated samples GEMM parameter validation on host arch
##### Fixes
* Disabled GoogleTest static library deployment
* Extended tests now build in large code model
#### Tensile 4.39.0
Tensile 4.39.0 for ROCm 6.0.0
##### Additions
* Added `aquavanjaram` support: gfx942, fp8/bf8 datatype, xf32 datatype, and
stochastic rounding for various datatypes
* Added and updated tuning scripts
* Added `DirectToLds` support for larger data types with 32-bit global load (old parameter `DirectToLds`
is replaced with `DirectToLdsA` and `DirectToLdsB`), and the corresponding test cases
* Added the average of frequency, power consumption, and temperature information for the winner
kernels to the CSV file
* Added asmcap check for MFMA + const src
* Added support for wider local read + pack with v_perm (with `VgprForLocalReadPacking=True`)
* Added a new parameter to increase `miLatencyLeft`
##### Optimizations
* Enabled `InitAccVgprOpt` for `MatrixInstruction` cases
* Implemented local read related parameter calculations with `DirectToVgpr`
* Enabled dedicated vgpr allocation for local read + pack
* Optimized code initialization
* Optimized sgpr allocation
* Supported DGEMM TLUB + RLVW=2 for odd N (edge shift change)
* Enabled `miLatency` optimization for specific data types, and fixed
instruction scheduling
##### Changes
* Removed old code for DTL + (bpe * GlobalReadVectorWidth &gt; 4)
* Changed/updated failed CI tests for gfx11xx, InitAccVgprOpt, and DTLds
* Removed unused `CustomKernels` and `ReplacementKernels`
* Added a reject condition for DTVB + TransposeLDS=False (not supported so far)
* Removed unused code for DirectToLds
* Updated test cases for DTV + TransposeLDS=False
* Moved the `MinKForGSU` parameter from `globalparameter` to `BenchmarkCommonParameter` to
support smaller K
* Changed how to calculate `latencyForLR` for miLatency
* Set minimum value of `latencyForLRCount` for 1LDSBuffer to avoid getting rejected by
overflowedResources=5 (related to miLatency)
* Refactored allowLRVWBforTLUandMI and renamed it as VectorWidthB
* Supported multi-gpu for different architectures in lazy library loading
* Enabled dtree library for batch &gt; 1
* Added problem scale feature for dtree selection
* Modified non-lazy load build to skip experimental logic
##### Fixes
* Predicate ordering for fp16alt impl round near zero mode to unbreak distance modes
* Boundary check for mirror dims and re-enable disabled mirror dims test cases
* Merge error affecting i8 with WMMA
* Mismatch issue with DTLds + TSGR + TailLoop
* Bug with `InitAccVgprOpt` + GSU&gt;1 and a mismatch issue with PGR=0
* Override for unloaded solutions when lazy loading
* Adding missing headers
* Boost link for a clean build on Ubuntu 22
* Bug in `forcestoresc1` arch selection
* Compiler directive for gfx942
* Formatting for `DecisionTree_test.cpp`

View File

@@ -0,0 +1,10 @@
The ROCm 6.0.2 point release consists of minor bug fixes to improve the stability of MI300 GPU applications. This release introduces several new driver features for system qualification on our partner server offerings.
#### hipFFT 1.0.13
hipFFT 1.0.13 for ROCm 6.0.2
##### Changes
* Removed the Git submodule for shared files between rocFFT and hipFFT; instead, just copy the files
over (this should help simplify downstream builds and packaging)