Compare commits

...

71 Commits

Author SHA1 Message Date
Wang, Yanyao
66e8ada46c Test ROCm 6.1.2 build 2024-06-05 14:48:12 -07:00
Wang, Yanyao
673f1eed48 Fix Markdown formate for the linter check 2024-06-05 10:41:25 -07:00
Wang, Yanyao
5642d4f7b0 Update the branch of ROCm repo after testing 2024-06-05 09:50:20 -07:00
Wang, Yanyao
7f93d1635d Build ROCm from source 2024-06-04 21:20:04 -07:00
Sam Wu
17f12a11e7 Merge pull request #3234 from WBobby/roc-6.1.2-manifest
Update manifest file for ROCm6.1.2
2024-06-04 14:50:14 -06:00
Wang, Yanyao
b2f0f0acdf Update manifest file for ROCm6.1.2 2024-06-04 15:39:16 -05:00
Sam Wu
a11c0512e1 Merge branch 'docs/6.1.2' into roc-6.1.x 2024-06-04 14:38:59 -06:00
Sam Wu
eec71da8dd Merge pull request #3232 from ROCm/develop
Merge develop into roc-6.1.x
2024-06-04 14:36:34 -06:00
Sam Wu
39891fe185 Sync develop branch 2024-06-04 14:32:36 -06:00
Peter Park
14ee171649 Add OS support note (#91) 2024-06-04 14:11:01 -04:00
Peter Park
e7bff21d3e Add final fixes to 6.1.2 release notes and changelog (#90)
* Regenerate changelog

* Add component changelogs and known issue

Fix RELEASE.md headings

Update pub datestamp for 6.1.2

Add AMDSMI and ROCm SMI to 6.1.2 template

Add rccl and rocBLAS

Update intro blurb and headings

Add ROCm SMI fix

Add missed heading to AMDSMI

Update datestamp and release version number

Update version and release number

Add known issue re: MI300X error detection

Words

Add issue link

Rm GitHub issue link

Move known issue down

Update ki wording

Remove "this issue has been investigated ... " from known issue

Fix changelog h1

* Reorg known issue, upcoming changes, remove rocDecode tested configurations

* Add fixes from review

* Add fixed issue link

* Fix heading

* Remove known issue
2024-06-04 12:23:07 -04:00
Peter Park
6abe5b50a2 Merge pull request #3229 from peterjunpark/docs/6.1.2
docs/6.1.2: Update the links for rocminfo and rocm-bandwidth-test (#3213)
2024-06-04 08:12:15 -07:00
amitkumar-amd
df864f8f79 Update the links for rocminfo and rocm-bandwidth-test (#3213)
* Update the links for rocminfo and rocm-bandwidth-test

* Update the links for rocminfo and rocm-bandwidth-test

* Update the links for rocminfo and rocm-bandwidth-test

* Update links to intersphinx links

---------

Co-authored-by: Peter Jun Park <peter.park@amd.com>
2024-06-04 11:00:52 -04:00
amitkumar-amd
7290ce9030 Update the links for rocminfo and rocm-bandwidth-test (#3213)
* Update the links for rocminfo and rocm-bandwidth-test

* Update the links for rocminfo and rocm-bandwidth-test

* Update the links for rocminfo and rocm-bandwidth-test

* Update links to intersphinx links

---------

Co-authored-by: Peter Jun Park <peter.park@amd.com>
2024-06-04 10:59:22 -04:00
Peter Park
d6d18d7cd4 Merge pull request #3226 from peterjunpark/docs/6.1.2
docs/6.1.2: Add "Fine Tuning LLMs" how to guide (#3124)
2024-06-04 07:02:36 -07:00
Peter Park
30f10e0145 Update fine-tuning guide: title, improve readibility in code blocks, fix typos (#3222)
* Fix typo

* Add torchtune link

* Add newlines before comments in code blocks for readability

* Update title
2024-06-03 22:15:36 -04:00
Peter Park
1e55e01af3 Add "Fine Tuning LLMs" how to guide (#3124)
* Add Fine Tuning LLMs how to guide

* Reorg and refactor Fine-tuning LLMs with ROCm

Update index and headings

Fix formatting and update toc

Split out content from index to overview.rst

Add metadata

Clean up overview

Add inference sections, fix rst errors, clean up single-gpu-fine-tuning

Combine fine-tuning and inference guides

Fix some links and formatting

Update toc and add formatting fixes

Add ck kernel fusion content

Update toc

Clean up model quantization and acceleration

Add CK images

Clean up profiling

Update triton kernel performance optimization

Update llm inference frameworks guide

Disable automatic number of figures and tables in Sphinx conf

Change tabs to spaces

Change heading to end with -ing

Add link fixes and heading updates

Add rocprof/Omniperf/Omnitrace section

Update profiling and debugging guide

Add formatting fixes

Satisfy spellcheck

Fix words

Delete unused file

Finish overview

Clean up first 4 sections

Multi-gpu fine-tuning guide: slight fixes

Update toc

Remove tabs

Formatting fixes

* Minor wording updates

* Add some clean-up

* Update profiling and debugging gudie

* Fix Omnitrace link

* Update ck kernel fusion with latest

* Update CK formatting

* Fix perfetto link syntax

* Fix typos and add blurbs

* Add fixes to Triton optimization doc

* Tabify saving adapters / models section

* Fix linting errors - spellcheck

Fix spelling and grammar

Satisfy linter

Update wording in profiling guide

Add fixes to satisfy linter

More fixes for linting in Triton guide

More linting fixes

Spellcheck in CK guide

* Improve triton guide

Fix linting errors and optics

* Add occupancy / vgpr table

Change some wording

* Re-add tunableop

* Add missing indent in _toc.yml

* Remove ckProfiler references

* Add links to resources

* Add refs in CK optimization guide

* Rename files and fix internal links

* Organize tuning guides

Reorg triton

* Add compute unit diagram

* Remove AutoAWQ

* Add higher res image for Perfetto trace example

* Update link text

* Update fig nums

* Update some formatting

* Update "Inductor"

* Change "Inductor" to TorchInductor

* Add link to official TorchInductor docs
2024-06-03 22:15:13 -04:00
Peter Park
9a347aa168 Update fine-tuning guide: title, improve readibility in code blocks, fix typos (#3222)
* Fix typo

* Add torchtune link

* Add newlines before comments in code blocks for readability

* Update title
2024-06-03 22:11:19 -04:00
Peter Park
fed33835a0 Add "Fine Tuning LLMs" how to guide (#3124)
* Add Fine Tuning LLMs how to guide

* Reorg and refactor Fine-tuning LLMs with ROCm

Update index and headings

Fix formatting and update toc

Split out content from index to overview.rst

Add metadata

Clean up overview

Add inference sections, fix rst errors, clean up single-gpu-fine-tuning

Combine fine-tuning and inference guides

Fix some links and formatting

Update toc and add formatting fixes

Add ck kernel fusion content

Update toc

Clean up model quantization and acceleration

Add CK images

Clean up profiling

Update triton kernel performance optimization

Update llm inference frameworks guide

Disable automatic number of figures and tables in Sphinx conf

Change tabs to spaces

Change heading to end with -ing

Add link fixes and heading updates

Add rocprof/Omniperf/Omnitrace section

Update profiling and debugging guide

Add formatting fixes

Satisfy spellcheck

Fix words

Delete unused file

Finish overview

Clean up first 4 sections

Multi-gpu fine-tuning guide: slight fixes

Update toc

Remove tabs

Formatting fixes

* Minor wording updates

* Add some clean-up

* Update profiling and debugging gudie

* Fix Omnitrace link

* Update ck kernel fusion with latest

* Update CK formatting

* Fix perfetto link syntax

* Fix typos and add blurbs

* Add fixes to Triton optimization doc

* Tabify saving adapters / models section

* Fix linting errors - spellcheck

Fix spelling and grammar

Satisfy linter

Update wording in profiling guide

Add fixes to satisfy linter

More fixes for linting in Triton guide

More linting fixes

Spellcheck in CK guide

* Improve triton guide

Fix linting errors and optics

* Add occupancy / vgpr table

Change some wording

* Re-add tunableop

* Add missing indent in _toc.yml

* Remove ckProfiler references

* Add links to resources

* Add refs in CK optimization guide

* Rename files and fix internal links

* Organize tuning guides

Reorg triton

* Add compute unit diagram

* Remove AutoAWQ

* Add higher res image for Perfetto trace example

* Update link text

* Update fig nums

* Update some formatting

* Update "Inductor"

* Change "Inductor" to TorchInductor

* Add link to official TorchInductor docs
2024-06-03 14:04:33 -04:00
danielsu-amd
f52bc2bc68 External CI: Add rocBLAS dependency to rocSPARSE (#3216) 2024-06-03 13:41:30 -04:00
danielsu-amd
205790159d External CI: use pipelined rocm-core for rocprofiler (#3215) 2024-06-03 10:52:56 -04:00
Peter Park
9679a84a8b Add components, known issues, and fixed issues to 6.1.2 RN / CL (#87)
* Regenerate changelog

* Add component changelogs and known issue

Fix RELEASE.md headings

Update pub datestamp for 6.1.2

Add AMDSMI and ROCm SMI to 6.1.2 template

Add rccl and rocBLAS

Update intro blurb and headings

Add ROCm SMI fix

Add missed heading to AMDSMI

Update datestamp and release version number

Update version and release number

Add known issue re: MI300X error detection

Words

Add issue link

Rm GitHub issue link

Move known issue down

Update ki wording

Remove "this issue has been investigated ... " from known issue

Fix changelog h1
2024-06-03 08:51:38 -04:00
Sam Wu
d34f7d7777 Merge pull request #3210 from ROCm/dependabot/pip/docs/sphinx/requests-2.32.2
Bump requests from 2.31.0 to 2.32.2 in /docs/sphinx
2024-05-31 17:10:09 -06:00
dependabot[bot]
16fca72626 Bump requests from 2.31.0 to 2.32.2 in /docs/sphinx
Bumps [requests](https://github.com/psf/requests) from 2.31.0 to 2.32.2.
- [Release notes](https://github.com/psf/requests/releases)
- [Changelog](https://github.com/psf/requests/blob/main/HISTORY.md)
- [Commits](https://github.com/psf/requests/compare/v2.31.0...v2.32.2)

---
updated-dependencies:
- dependency-name: requests
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
2024-05-31 23:02:26 +00:00
Sam Wu
1a6ce7f6e0 Merge pull request #3212 from ROCm/dependabot/pip/docs/sphinx/rocm-docs-core-1.2.0
Bump rocm-docs-core from 1.1.1 to 1.2.0 in /docs/sphinx
2024-05-31 17:01:03 -06:00
dependabot[bot]
35c17fcce5 Bump rocm-docs-core from 1.1.1 to 1.2.0 in /docs/sphinx
Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 1.1.1 to 1.2.0.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases)
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v1.1.1...v1.2.0)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
2024-05-31 22:07:13 +00:00
Sam Wu
bf19dd1dc8 Update RTD config 2024-05-31 15:18:53 -06:00
Sam Wu
5fec2e1ca4 Update documentation requirements 2024-05-31 13:49:14 -06:00
danielsu-amd
1975889da1 External CI: Remove redundant rocm_smi_lib pipeline ID (#3211) 2024-05-31 14:25:09 -04:00
Sam Wu
b9c4490f96 Merge branch 'roc-6.1.x' into docs/6.1.2 2024-05-31 11:59:44 -06:00
Sam Wu
7fcb0f19a9 Merge pull request #3208 from ROCm/develop
Merge develop into roc-6.1.x
2024-05-31 11:49:48 -06:00
Sam Wu
625c18371c Merge branch 'roc-6.1.x' into develop 2024-05-31 11:47:19 -06:00
danielsu-amd
9dd6e42122 External CI: Dockerless + latest source for rocprofiler and rocm_bandwidth_test (#3209) 2024-05-31 13:27:47 -04:00
Joseph Macaranas
9d27863954 MIOpen External CI: Add rocprofiler-register dependency for latest source (#3203) 2024-05-31 11:23:46 -04:00
Joseph Macaranas
04561cc60f External CI: updated cmake dependencies (#3206)
Template with bash commands to update cmake with snap.
Use template for two components that want updated cmake with latest source on their default branches.
2024-05-31 11:16:36 -04:00
Joseph Macaranas
14a3e80a1b hipFFT External CI: Missing vmImage syntax for base pool (#3204) 2024-05-31 11:16:17 -04:00
abhimeda
32334fd826 Removing docker from hipBLASLt (#3202) 2024-05-30 21:12:54 -04:00
Peter Park
61d18252ab Remove unused images and add link to usage in Deep Learning install guide (#3196) 2024-05-30 19:28:13 -04:00
Sam Wu
2d8eba0404 Disable pdf builds (#3197) 2024-05-30 19:13:54 -04:00
Joseph Macaranas
cfaa056ae0 Add rocPRIM dependency to rocSOLVER CI build (#3195) 2024-05-30 17:33:02 -04:00
Peter Park
6a5defb825 Add "How to use ROCm for AI" (#3117)
* Add Using ROCm for AI:wq

Add PyTorch Docker installation images

Split doc into subtopics

Add metadata

Clean up index

Clean up hugging face guide

Clean up installation guide

Fix rST formatting

Clean up install and train-a-model

Clean up MAD

Delete unused file

Add ref anchors and clean up MAD doc

Add formatting fixes

Update toc and section index

Format some code blocks

Remove install guide and update toc

Chop installation guide

Clean up deployment and hugging face sections

Change headings to end in -ing

Fix spelling in Training a model

Delete MAD and split out install content

Fix formatting

Change words to satisfy spellcheck linter

* Add review suggestions and add helpful links

Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com>

Add helpful links and add review suggestions

Remove fine-tuning link and links to D5 and MAGMA

Update docs/how-to/rocm-for-ai/deploy-your-model.rst

Co-authored-by: Young Hui - AMD <145490163+yhuiYH@users.noreply.github.com>

Update DeepSpeed link

Add subheading to ML framework installation and closing blurb to hugging face models guide

* Reorder topics
2024-05-30 16:17:44 -04:00
randyh62
6864f1546e hipcc doc link (#3190)
* hipcc doc link

* Update docs/what-is-rocm.rst

Co-authored-by: Istvan Kiss <neon60@gmail.com>

* Update docs/what-is-rocm.rst

Co-authored-by: Istvan Kiss <neon60@gmail.com>

---------

Co-authored-by: Istvan Kiss <neon60@gmail.com>
2024-05-30 12:52:58 -07:00
Joseph Macaranas
58f543c010 Some new external CI dependencies for latest source on default branches (#3188)
rdc: amdsmi
rocBLAS: AOCL 4.2
rocPRIM: incorrect compiler path
2024-05-30 10:56:52 -04:00
abhimeda
7504e6bc13 removing docker from external ci pipelines (#3177)
* removed docker and pointed ROCm deps to our existing builds

* removed vmImage tag for pool

* added pip to apt list and renamed from rocFFT to hipFFT

* fixed spelling mistakes in rocmDependencies

* added correct apt dep for pip

* removed leading slash in the cmake flags

* changed cxx_compiler to /rocm/bin/hipcc

* added llvm-project, ROCR-Runtime, clr, and rocminfo to rocm deps

* added rocFFT as a rocm dependency

* removed docker and added our builds for components

* removed rocFFT from rocm deps

* Fixed typo in rocFFT value

* added rocprofiler-register to rocFFT and fixed typo in the dependencies-rocm file

* changed cxx compiler to amdclang++

* fixed amdclang++ paths

* moving to faster machine

* added cmake module paths

* switched back to medium build

* added libopm-dev to apt deps

* added libomp-14-dev to apt deps

* added aomp as a rocm dep

* added aomp as a rocm dep

* added hipcc as the cxx_compiler

* reverted back to clang++ as the cxx_compiler

* removed unmentioned rocm deps from the readme

* removed docker

* added python3-pip as an apt dep

* fixed compiler paths

* added hipRAND as a rocm dep

* added print statements to see directory structure

* adding a print statement into /agent/_work/1/s/build/library

* added -Tensile_rocm_assembler as a build flag

* removed a broken script line

* added D to tensile rocm assembler

* added DROCM_PATH to build flags

* fixed typo

* changed build pool from medium to base

* changed build pool from base to low

* added env variables using josephs pr

* removed docker from hipBLASLt and added rocm dependencies that point to our builds

* added pip to the apt packages array

* changed cmake_cxx_compiler env var ro amdclang++

* changed cmake_cxx_compiler env var to amdclang++

* changed cmake_cxx_compiler env var to hipcc

* changed cmake_cxx_compiler env var to hipcc

* changed clang to amdclang

* changed all refs mentioning hipcc to amdclang

* changed cmake_cxx_compiler back to hipcc

* added a HIP_PATH env var based off Tensile/Source/FindHIP.cmake

* added hipcc to HIP_PATH

* added rocm-cmake to rocm deps

* added rocRAND as a rocm dep

* removed dcmake_module flag

* added libomp-dev as an apt dep

* added aomp as a rocm dep

* added clang as an apt dep

* reverted changes back to how they appear in develop since this branch will be submitted for review

* removed unecessary flags

* adding -DCMAKE_CXX_COMPILER=$(Agent.BuildDirectory)/rocm/llvm/bin/amdclang++         -DCMAKE_C_COMPILER=$(Agent.BuildDirectory)/rocm/llvm/bin/amdclang back to see if these are vital to a successful build

* removed newline character
2024-05-30 10:55:24 -04:00
Joseph Macaranas
7e1a1bc7c2 Change offload build to a parameter (#3187) 2024-05-29 21:50:02 -04:00
Joseph Macaranas
a2574adc73 Disable aomp offload build for initial external ci-build work (#3186)
* Disable aomp offload build for initial ci-build work

* Missing dependency for medium pool use of rocPRIM

* Latest rocBLAS source needs added ROCm dependencies
2024-05-29 21:45:34 -04:00
Joseph Macaranas
7207d815d1 ci-build scenario adjustments for aomp and rocm_smi_lib (#3185)
* Update rocm_smi_lib.yml

* Change checkout reference for aomp ci-build case
2024-05-29 19:51:06 -04:00
Sam Wu
5930282993 docs(conf.py): Update PDF version to 6.1.1 (#3184) 2024-05-29 15:11:19 -04:00
Sam Wu
e63ff81549 Merge pull request #3169 from ROCm/develop
Merge develop into roc-6.1.x
2024-05-29 12:25:51 -06:00
Sam Wu
cd575e2926 Merge pull request #3172 from perovskikh/patch-1
Update default.xml
2024-05-29 12:23:29 -06:00
Peter Park
3a68f43df7 Reorg 'Deep learning' and 'Tuning guides' docs (#3153)
* Rename 'Tuning guides' to 'Hardware optimization'

* Move deep learning to Install section

* Change 'Hardware' to 'System' to align with index.md

* Satisfy spellcheck linter

* adding new framework install graphic with JAX

* Fix link to ROCm libraries list

* crop framework_install graphic

* Reset .wordlist.txt update

* Prettify deep learning framework installation page

* Change spacing in list of frameworks

---------

Co-authored-by: Young Hui <young.hui@amd.com>
2024-05-29 14:12:43 -04:00
alexxu-amd
a8c7faeae3 Remove docker from multiple external CI pipelines (#3161) 2024-05-29 10:19:02 -04:00
Joseph Macaranas
892c0957b8 Special pipeline for aomp with latest source (#3174)
aomp build is not triggered by changes to aomp repo, but by updates to llvm-project and ROCR-Runtime, so trigger definition can remain this ROCm/ROCm repo.
2024-05-29 10:12:25 -04:00
abhimeda
82ed9e9ffd Removing docker from hipFFT (#3160) 2024-05-29 10:11:54 -04:00
Joseph Macaranas
32592f436b Change ROCm interdependencies for MIVisionX (#3158)
Instead of using docker and apt install of ROCm component dependencies, use tarballs from Azure Pipeline builds to enable updates of ROCm interdependencies without waiting for releases..
2024-05-29 10:09:52 -04:00
Joseph Macaranas
cd5c6768d7 Match case of GDB repo name for external CI (#3171)
* Match case of repo name for external CI
2024-05-29 09:58:04 -04:00
Bence Parajdi
97129c0972 Merge pull request #3062 from StreamHPC/cu
add cu setting page
2024-05-29 09:46:20 +02:00
Kiriti Gowda
885ad0da42 Update MIVisionX.yml (#3173)
OpenCV Added
2024-05-29 02:53:46 -04:00
Vadik
80d7feeebc Update default.xml
Удаленное имя для KhronosGroup отсутствует в default.xml:

https://github.com/ROCm/ROCm/pull/3098/files#diff-d9b8e4a48f8e111ec5d21480d9d33a893b365dfa7f8550bbc0577e4d42afeac8L4
2024-05-29 10:43:12 +05:00
danielsu-amd
518a2069b3 External CI: Update pipeline interdependencies (#3162)
Remove Docker and update interdependencies for

ROCdbgapi
ROCmValidationSuite
hipCUB
hipSOLVER
hipSPARSE
rocThrust
rocr_debug_agent
rpp
rocALUTION

Fixed roctracer not publishing artifacts
2024-05-28 16:36:43 -04:00
Joseph Macaranas
2160ee6556 Update External CI Interdependencies for more repos (#3154)
* Update External CI Interdependencies for more repos

- composable_kernel
- hipBLAS
- rocBLAS
- rocSOLVER

Cleaned up unused flags from llvm-project

* Remove LD_LIBRARY_PATH change. Should not be needed.
2024-05-28 13:37:25 -04:00
Peter Park
657a27758a Add missed ROCm SMI changelog notes (#3168)
Fix link to rocm_smi_lib changelog

Update RELEASE.md
2024-05-28 12:29:32 -04:00
Bence Parajdi
0ba6bb43ef fix bad file extention referencing setting-cus.rst in index.md 2024-05-28 12:27:02 +02:00
Peter Jun Park
cf53fda864 Add manual changes to 6.1.2 changelog
Move HIPIFY from 6.1.1.md to 6.1.2.md

Regenerate changelog

Fix accidental autoformat in 6.1.1.md

Update 6.1.2.md and regen changelog

Add AMD SMI for ROCm 6.1.2

Regen changelog

Add rocDecode and update RELEASE.md

Update 6.1.2 intro blurb

Fix arrow symbol

Add (tm) to changelog.jinja template

Incorporate Leo's feedback

Intro blurb wording.
Add missed tested ROCm config (rocDecode)
Add OS support

Add version to release notes h1

Update intro blurb again

Make changelog filepath lowercase

Update blurb

Add extra line to 6.1.2 template

Fix heading in RELEASE

Fix amdsmi changelog link

Remove OS support notice

Add rocDecode to table

Add redecode to CL

Update rocDecode setup script note for clarity

Update AMD SMI changelog

Apply Leo's feedback

Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com>
2024-05-15 13:12:40 -07:00
Peter Jun Park
aac6898385 Generate changelog 2024-05-15 13:12:40 -07:00
Bence Parajdi
d86c23a847 remove unnecessary comma 2024-05-14 10:08:44 +02:00
Bence Parajdi
06c960aa97 Update docs/conceptual/setting-cus.rst
Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com>
2024-05-13 16:27:05 +02:00
Bence Parajdi
3edc3e9759 add new page to index.md 2024-05-13 16:27:04 +02:00
Bence Parajdi
41da494ef0 fix review comments 2024-05-13 16:26:16 +02:00
Bence Parajdi
c0fbd1ca5b fix typos 2024-05-13 16:25:33 +02:00
Bence Parajdi
7f38465770 add cu setting page 2024-05-13 16:25:31 +02:00
163 changed files with 14943 additions and 951 deletions

View File

@@ -0,0 +1,33 @@
variables:
- group: common
- template: /.azuredevops/variables-global.yml
resources:
repositories:
- repository: release_repo
type: github
endpoint: ROCm
name: ROCm/aomp
ref: aomp-dev
- repository: llvm-project_repo
type: github
endpoint: ROCm
name: ROCm/llvm-project
ref: amd-staging
pipelines:
- pipeline: rocr-runtime_pipeline
source: \ROCR-Runtime
trigger:
branches:
include:
- master
# this job will only be triggered after successful build sequence of llvm-project and ROCR-Runtime
trigger: none
pr: none
jobs:
- template: ${{ variables.CI_COMPONENT_PATH }}/aomp.yml
parameters:
checkoutRepo: release_repo

View File

@@ -9,12 +9,9 @@ parameters:
type: object
default:
- software-properties-common
- python3-pip
- cmake
- ninja-build
- composablekernel-dev
- half
- rocrand
- rocblas
- libsqlite3-dev
- libbz2-dev
- nlohmann-json3-dev
@@ -23,6 +20,16 @@ parameters:
type: object
default:
- rocMLIR
- rocRAND
- rocBLAS
- half
- composable_kernel
- rocm-cmake
- llvm-project
- ROCR-Runtime
- rocprofiler-register
- clr
- rocminfo
jobs:
- job: MIOpen
@@ -30,8 +37,6 @@ jobs:
- group: common
- template: /.azuredevops/variables-global.yml
pool: ${{ variables.MEDIUM_BUILD_POOL }}
container:
image: ${{ variables.DOCKER_IMAGE_NAME }}:${{ variables.LATEST_DOCKER_VERSION }}
workspace:
clean: all
steps:
@@ -70,9 +75,8 @@ jobs:
parameters:
extraBuildFlags: >-
-DMIOPEN_BACKEND=HIP
-DCMAKE_CXX_COMPILER=/opt/rocm/llvm/bin/amdclang++
-DCMAKE_C_COMPILER=/opt/rocm/llvm/bin/amdclang
-DCMAKE_PREFIX_PATH="$(Agent.BuildDirectory)/rocm"
-DCMAKE_CXX_COMPILER=$(Agent.BuildDirectory)/rocm/llvm/bin/amdclang++
-DCMAKE_PREFIX_PATH=$(Agent.BuildDirectory)/rocm
-DMIOPEN_ENABLE_AI_KERNEL_TUNING=OFF
-DMIOPEN_ENABLE_AI_IMMED_MODE_FALLBACK=OFF
-DCMAKE_BUILD_TYPE=Release

View File

@@ -14,10 +14,6 @@ parameters:
- wget
- unzip
- pkg-config
- half
- rocblas-dev
- miopen-hip-dev
- migraphx-dev
- protobuf-compiler
- libprotoc-dev
- ffmpeg
@@ -25,10 +21,6 @@ parameters:
- libavformat-dev
- libavutil-dev
- libswscale-dev
- rpp
- rpp-dev
- rocdecode
- rocdecode-dev
- build-essential
- libgtk2.0-dev
- libavcodec-dev
@@ -41,6 +33,7 @@ parameters:
- libtiff-dev
- libdc1394-dev
- libgmp-dev
- libopencv-dev
- name: pipModules
type: object
default:
@@ -50,6 +43,21 @@ parameters:
- google==3.0.0
- protobuf==3.12.4
- onnx==1.12.0
- name: rocmDependencies
type: object
default:
- rocm-cmake
- llvm-project
- ROCR-Runtime
- clr
- rocminfo
- rocprofiler-register
- half
- rocBLAS
- MIOpen
- AMDMIGraphX
- rpp
- rocDecode
jobs:
- job: MIVisionX
@@ -58,8 +66,6 @@ jobs:
- template: /.azuredevops/variables-global.yml
pool:
vmImage: ${{ variables.BASE_BUILD_POOL }}
container:
image: ${{ variables.DOCKER_IMAGE_NAME }}:${{ variables.LATEST_DOCKER_VERSION }}
workspace:
clean: all
steps:
@@ -71,11 +77,23 @@ jobs:
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/checkout.yml
parameters:
checkoutRepo: ${{ parameters.checkoutRepo }}
# CI case: download latest default branch build
- ${{ if eq(parameters.checkoutRef, '') }}:
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/dependencies-rocm.yml
parameters:
dependencyList: ${{ parameters.rocmDependencies }}
dependencySource: staging
# manual build case: triggered by ROCm/ROCm repo
- ${{ if ne(parameters.checkoutRef, '') }}:
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/dependencies-rocm.yml
parameters:
dependencyList: ${{ parameters.rocmDependencies }}
dependencySource: tag-builds
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/build-cmake.yml
parameters:
extraBuildFlags: >-
-DCMAKE_BUILD_TYPE=Release
-DROCM_PATH=/opt/rocm
-DROCM_PATH=$(Agent.BuildDirectory)/rocm
-DROCM_DEP_ROCMCORE=ON
-GNinja
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/artifact-upload.yml

View File

@@ -10,6 +10,13 @@ parameters:
default:
- cmake
- ninja-build
- name: rocmDependencies
type: object
default:
- clr
- llvm-project
- rocminfo
- ROCR-Runtime
jobs:
- job: ROCdbgapi
@@ -18,8 +25,6 @@ jobs:
- template: /.azuredevops/variables-global.yml
pool:
vmImage: ${{ variables.BASE_BUILD_POOL }}
container:
image: ${{ variables.DOCKER_IMAGE_NAME }}:${{ variables.LATEST_DOCKER_VERSION }}
workspace:
clean: all
steps:
@@ -30,9 +35,22 @@ jobs:
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/checkout.yml
parameters:
checkoutRepo: ${{ parameters.checkoutRepo }}
# CI case: download latest default branch build
- ${{ if eq(parameters.checkoutRef, '') }}:
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/dependencies-rocm.yml
parameters:
dependencyList: ${{ parameters.rocmDependencies }}
dependencySource: staging
# manual build case: triggered by ROCm/ROCm repo
- ${{ if ne(parameters.checkoutRef, '') }}:
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/dependencies-rocm.yml
parameters:
dependencyList: ${{ parameters.rocmDependencies }}
dependencySource: tag-builds
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/build-cmake.yml
parameters:
extraBuildFlags: >-
-DCMAKE_BUILD_TYPE=Release
-DCMAKE_PREFIX_PATH=$(Agent.BuildDirectory)/rocm
-GNinja
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/artifact-upload.yml

View File

@@ -10,22 +10,33 @@ parameters:
default:
- cmake
- ninja-build
- rocblas
- libyaml-cpp-dev
- libpci-dev
- libpci3
- googletest
- git
- name: rocmDependencies
type: object
default:
- clr
- llvm-project
- rocBLAS
- rocm-cmake
- rocm_smi_lib
- rocminfo
- rocprofiler-register
- ROCR-Runtime
- ROCT-Thunk-Interface
jobs:
- job: ROCmValidationSuite
variables:
- group: common
- template: /.azuredevops/variables-global.yml
- name: HIP_ROCCLR_HOME
value: $(Build.BinariesDirectory)/rocm
pool:
vmImage: ${{ variables.BASE_BUILD_POOL }}
container:
image: ${{ variables.DOCKER_IMAGE_NAME }}:${{ variables.LATEST_DOCKER_VERSION }}
workspace:
clean: all
steps:
@@ -36,11 +47,23 @@ jobs:
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/checkout.yml
parameters:
checkoutRepo: ${{ parameters.checkoutRepo }}
# CI case: download latest default branch build
- ${{ if eq(parameters.checkoutRef, '') }}:
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/dependencies-rocm.yml
parameters:
dependencyList: ${{ parameters.rocmDependencies }}
dependencySource: staging
# manual build case: triggered by ROCm/ROCm repo
- ${{ if ne(parameters.checkoutRef, '') }}:
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/dependencies-rocm.yml
parameters:
dependencyList: ${{ parameters.rocmDependencies }}
dependencySource: tag-builds
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/build-cmake.yml
parameters:
extraBuildFlags: >-
-DROCM_PATH=/opt/rocm
-DCMAKE_PREFIX_PATH=/opt/rocm
-DCPACK_PACKAGING_INSTALL_PREFIX='$(Build.BinariesDirectory)'
-DROCM_PATH=$(Agent.BuildDirectory)/rocm
-DCMAKE_PREFIX_PATH=$(Agent.BuildDirectory)/rocm
-DCPACK_PACKAGING_INSTALL_PREFIX=$(Build.BinariesDirectory)
-GNinja
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/artifact-upload.yml

View File

@@ -5,6 +5,9 @@ parameters:
- name: checkoutRef
type: string
default: ''
- name: offloadEnabled
type: boolean
default: false
- name: aptPackages
type: object
default:
@@ -94,18 +97,18 @@ jobs:
cmakeBuildDir: $(Build.SourcesDirectory)/llvm-project/openmp/build
installDir: $(Build.BinariesDirectory)/llvm
# offload does not exist for recent releases, so use CI conditional
- ${{ if eq(parameters.checkoutRef, '') }}:
- ${{ if eq(parameters.offloadEnabled, true) }}:
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/build-cmake.yml
parameters:
componentName: llvm-offload
extraBuildFlags: >-
-DOPENMP_ENABLE_LIBOMPTARGET=1
-DOPENMP_TEST_C_COMPILER==$(Agent.BuildDirectory)/rocm/llvm/bin/clang
-DOPENMP_TEST_CXX_COMPILER==$(Agent.BuildDirectory)/rocm/llvm/bin/clang++
-DCMAKE_C_COMPILER==$(Agent.BuildDirectory)/rocm/llvm/bin/clang
-DCMAKE_CXX_COMPILER==$(Agent.BuildDirectory)/rocm/llvm/bin/clang++
-DOPENMP_TEST_C_COMPILER=$(Agent.BuildDirectory)/rocm/llvm/bin/clang
-DOPENMP_TEST_CXX_COMPILER=$(Agent.BuildDirectory)/rocm/llvm/bin/clang++
-DCMAKE_C_COMPILER=$(Agent.BuildDirectory)/rocm/llvm/bin/clang
-DCMAKE_CXX_COMPILER=$(Agent.BuildDirectory)/rocm/llvm/bin/clang++
-DLIBOMPTARGET_AMDGCN_GFXLIST=gfx700;gfx701;gfx801;gfx803;gfx900;gfx902;gfx906;gfx908;gfx90a;gfx90c;gfx940;gfx941;gfx942;gfx1030;gfx1031;gfx1035;gfx1036;gfx1100;gfx1101;gfx1102;gfx1103
-DLLVM_DIR==$(Agent.BuildDirectory)/rocm/llvm
-DLLVM_DIR=$(Agent.BuildDirectory)/rocm/llvm
-DLLVM_MAIN_INCLUDE_DIR=$(Build.SourcesDirectory)/llvm-project/llvm/include
-DLIBOMPTARGET_LLVM_INCLUDE_DIRS=$(Build.SourcesDirectory)/llvm-project/llvm/include
-DCUDA_TOOLKIT_ROOT_DIR=OFF

View File

@@ -11,6 +11,16 @@ parameters:
- cmake
- ninja-build
- git
- python3-pip
- name: rocmDependencies
type: object
default:
- rocm-cmake
- llvm-project
- ROCR-Runtime
- clr
- rocminfo
- rocprofiler-register
jobs:
- job: composable_kernel
@@ -18,8 +28,6 @@ jobs:
- group: common
- template: /.azuredevops/variables-global.yml
pool: ${{ variables.MEDIUM_BUILD_POOL }}
container:
image: ${{ variables.DOCKER_IMAGE_NAME }}:${{ variables.LATEST_DOCKER_VERSION }}
workspace:
clean: all
steps:
@@ -30,11 +38,24 @@ jobs:
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/checkout.yml
parameters:
checkoutRepo: ${{ parameters.checkoutRepo }}
# CI case: download latest default branch build
- ${{ if eq(parameters.checkoutRef, '') }}:
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/dependencies-rocm.yml
parameters:
dependencyList: ${{ parameters.rocmDependencies }}
dependencySource: staging
# manual build case: triggered by ROCm/ROCm repo
- ${{ if ne(parameters.checkoutRef, '') }}:
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/dependencies-rocm.yml
parameters:
dependencyList: ${{ parameters.rocmDependencies }}
dependencySource: tag-builds
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/build-cmake.yml
parameters:
extraBuildFlags: >-
-DCMAKE_CXX_COMPILER=/opt/rocm/bin/amdclang++
-DCMAKE_C_COMPILER=/opt/rocm/bin/amdclang
-DCMAKE_CXX_COMPILER=$(Agent.BuildDirectory)/rocm/llvm/bin/amdclang++
-DCMAKE_C_COMPILER=$(Agent.BuildDirectory)/rocm/llvm/bin/amdclang
-DCMAKE_PREFIX_PATH=$(Agent.BuildDirectory)/rocm
-DCMAKE_BUILD_TYPE=Release
-DGPU_TARGETS=gfx1030;gfx1100
-GNinja

View File

@@ -8,8 +8,17 @@ parameters:
- name: aptPackages
type: object
default:
- python3-pip
- cmake
- ninja-build
- name: rocmDependencies
type: object
default:
- rocm-cmake
- llvm-project
- ROCR-Runtime
- clr
- rocminfo
jobs:
- job: half
@@ -18,8 +27,6 @@ jobs:
- template: /.azuredevops/variables-global.yml
pool:
vmImage: ${{ variables.BASE_BUILD_POOL }}
container:
image: ${{ variables.DOCKER_IMAGE_NAME }}:${{ variables.LATEST_DOCKER_VERSION }}
workspace:
clean: all
steps:
@@ -30,9 +37,22 @@ jobs:
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/checkout.yml
parameters:
checkoutRepo: ${{ parameters.checkoutRepo }}
# CI case: download latest default branch build
- ${{ if eq(parameters.checkoutRef, '') }}:
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/dependencies-rocm.yml
parameters:
dependencyList: ${{ parameters.rocmDependencies }}
dependencySource: staging
# manual build case: triggered by ROCm/ROCm repo
- ${{ if ne(parameters.checkoutRef, '') }}:
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/dependencies-rocm.yml
parameters:
dependencyList: ${{ parameters.rocmDependencies }}
dependencySource: tag-builds
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/build-cmake.yml
parameters:
extraBuildFlags: >-
-DCMAKE_PREFIX_PATH=$(Agent.BuildDirectory)/rocm
-DBUILD_FILE_REORG_BACKWARD_COMPATIBILITY=OFF
-GNinja
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/artifact-upload.yml

View File

@@ -10,25 +10,33 @@ parameters:
default:
- cmake
- ninja-build
- rocblas-dev
- rocsparse
- rocsolver-dev
- gfortran
- googletest
- git
- libgtest-dev
- wget
- python3-pip
- libomp-dev
- name: rocmDependencies
type: object
default:
- rocm-cmake
- llvm-project
- ROCR-Runtime
- clr
- rocminfo
- rocprofiler-register
- rocBLAS
- rocSPARSE
- rocSOLVER
- aomp
jobs:
- job: hipBLAS
variables:
- group: common
- template: /.azuredevops/variables-global.yml
- name: LD_LIBRARY_PATH
value: '/lib:/usr/lib:/usr/local/lib'
pool: ${{ variables.MEDIUM_BUILD_POOL }}
container:
image: ${{ variables.DOCKER_IMAGE_NAME }}:${{ variables.LATEST_DOCKER_VERSION }}
workspace:
clean: all
steps:
@@ -51,20 +59,24 @@ jobs:
targetType: inline
script: sudo apt install --yes ./aocl-linux-aocc-4.1.0_1_amd64.deb
workingDirectory: '$(Pipeline.Workspace)'
- task: Bash@3
displayName: 'ldconfig'
inputs:
targetType: inline
script: sudo ldconfig
workingDirectory: '/usr/local/lib'
- script: 'ls -1R /usr/local'
displayName: 'Artifact listing'
# CI case: download latest default branch build
- ${{ if eq(parameters.checkoutRef, '') }}:
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/dependencies-rocm.yml
parameters:
dependencyList: ${{ parameters.rocmDependencies }}
dependencySource: staging
# manual build case: triggered by ROCm/ROCm repo
- ${{ if ne(parameters.checkoutRef, '') }}:
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/dependencies-rocm.yml
parameters:
dependencyList: ${{ parameters.rocmDependencies }}
dependencySource: tag-builds
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/build-cmake.yml
parameters:
extraBuildFlags: >-
-DCMAKE_PREFIX_PATH=/opt/rocm
-DCMAKE_PREFIX_PATH=$(Agent.BuildDirectory)/rocm
-DCMAKE_BUILD_TYPE=Release
-DCMAKE_CXX_COMPILER=/opt/rocm/llvm/bin/amdclang++
-DCMAKE_CXX_COMPILER=$(Agent.BuildDirectory)/rocm/llvm/bin/amdclang++
-DAMDGPU_TARGETS=gfx1030;gfx1100
-DHIP_PLATFORM=amd
-DBUILD_CLIENTS_TESTS=ON

View File

@@ -8,25 +8,44 @@ parameters:
- name: aptPackages
type: object
default:
- cmake
- ninja-build
- python3-venv
- libmsgpack-dev
- hipblas-dev
- git
- python3-pip
- libdrm-dev
- name: pipModules
type: object
default:
- joblib
- name: rocmDependencies
type: object
default:
- llvm-project
- ROCR-Runtime
- clr
- rocminfo
- rocprofiler-register
- hipBLAS
jobs:
- job: hipBLASLt
variables:
- group: common
- template: /.azuredevops/variables-global.yml
- name: HIP_ROCCLR_HOME
value: $(Build.BinariesDirectory)/rocm
- name: TENSILE_ROCM_ASSEMBLER_PATH
value: $(Agent.BuildDirectory)/rocm/llvm/bin/amdclang
- name: CMAKE_CXX_COMPILER
value: $(Agent.BuildDirectory)/rocm/bin/hipcc
- name: TENSILE_ROCM_OFFLOAD_BUNDLER_PATH
value: $(Agent.BuildDirectory)/rocm/llvm/bin/clang-offload-bundler
- name: TENSILE_ROCM_PATH
value: $(Agent.BuildDirectory)/rocm/bin/hipcc
- name: PATH
value: $(Agent.BuildDirectory)/rocm/llvm/bin:$(Agent.BuildDirectory)/rocm/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin
pool: ${{ variables.MEDIUM_BUILD_POOL }}
container:
image: ${{ variables.DOCKER_IMAGE_NAME }}:${{ variables.LATEST_DOCKER_VERSION }}
workspace:
clean: all
steps:
@@ -34,21 +53,36 @@ jobs:
parameters:
aptPackages: ${{ parameters.aptPackages }}
pipModules: ${{ parameters.pipModules }}
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/dependencies-cmake-latest.yml
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/preamble.yml
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/checkout.yml
parameters:
checkoutRepo: ${{ parameters.checkoutRepo }}
# CI case: download latest default branch build
- ${{ if eq(parameters.checkoutRef, '') }}:
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/dependencies-rocm.yml
parameters:
dependencyList: ${{ parameters.rocmDependencies }}
dependencySource: staging
# manual build case: triggered by ROCm/ROCm repo
- ${{ if ne(parameters.checkoutRef, '') }}:
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/dependencies-rocm.yml
parameters:
dependencyList: ${{ parameters.rocmDependencies }}
dependencySource: tag-builds
- script: sudo ln -s $(Agent.BuildDirectory)/rocm /opt/rocm
displayName: ROCm symbolic link
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/build-cmake.yml
parameters:
extraBuildFlags: >-
-DCMAKE_BUILD_TYPE=Release
-DCMAKE_CXX_COMPILER=/opt/rocm/bin/amdclang++
-DCMAKE_C_COMPILER=/opt/rocm/bin/amdclang
-DCMAKE_CXX_COMPILER=$(Agent.BuildDirectory)/rocm/llvm/bin/amdclang++
-DCMAKE_C_COMPILER=$(Agent.BuildDirectory)/rocm/llvm/bin/amdclang
-DAMDGPU_TARGETS=gfx90a
-DTensile_LOGIC=
-DTensile_CPU_THREADS=
-DTensile_CODE_OBJECT_VERSION=default
-DTensile_LIBRARY_FORMAT=msgpack
-DCMAKE_PREFIX_PATH="/opt/rocm"
-DCMAKE_PREFIX_PATH="$(Agent.BuildDirectory)/rocm"
-GNinja
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/artifact-upload.yml
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/artifact-upload.yml

View File

@@ -10,9 +10,17 @@ parameters:
default:
- cmake
- ninja-build
- rocprim
- googletest
- git
- python3-pip
- name: rocmDependencies
type: object
default:
- clr
- llvm-project
- rocminfo
- rocPRIM
- ROCR-Runtime
jobs:
- job: hipCUB
@@ -20,8 +28,6 @@ jobs:
- group: common
- template: /.azuredevops/variables-global.yml
pool: ${{ variables.MEDIUM_BUILD_POOL }}
container:
image: ${{ variables.DOCKER_IMAGE_NAME }}:${{ variables.LATEST_DOCKER_VERSION }}
workspace:
clean: all
steps:
@@ -32,12 +38,24 @@ jobs:
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/checkout.yml
parameters:
checkoutRepo: ${{ parameters.checkoutRepo }}
# CI case: download latest default branch build
- ${{ if eq(parameters.checkoutRef, '') }}:
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/dependencies-rocm.yml
parameters:
dependencyList: ${{ parameters.rocmDependencies }}
dependencySource: staging
# manual build case: triggered by ROCm/ROCm repo
- ${{ if ne(parameters.checkoutRef, '') }}:
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/dependencies-rocm.yml
parameters:
dependencyList: ${{ parameters.rocmDependencies }}
dependencySource: tag-builds
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/build-cmake.yml
parameters:
extraBuildFlags: >-
-DCMAKE_CXX_COMPILER=/opt/rocm/bin/amdclang++
-DCMAKE_C_COMPILER=/opt/rocm/bin/amdclang
-DCMAKE_PREFIX_PATH="/opt/rocm"
-DCMAKE_CXX_COMPILER=$(Agent.BuildDirectory)/rocm/llvm/bin/amdclang++
-DCMAKE_C_COMPILER=$(Agent.BuildDirectory)/rocm/llvm/bin/amdclang
-DCMAKE_PREFIX_PATH=$(Agent.BuildDirectory)/rocm
-DBUILD_TEST=ON
-DAMDGPU_TARGETS=gfx1030;gfx1100
-GNinja

View File

@@ -10,22 +10,32 @@ parameters:
default:
- cmake
- ninja-build
- rocrand
- hiprand
- rocfft
- libboost-program-options-dev
- googletest
- libgtest-dev
- libfftw3-dev
- python3-pip
- libomp-14-dev
# rocm dependencies should match dependencies-rocm.yml
- name: rocmDependencies
type: object
default:
- rocRAND
- hipRAND
- llvm-project
- ROCR-Runtime
- clr
- rocminfo
- rocFFT
- aomp
jobs:
- job: hipFFT
variables:
- group: common
- template: /.azuredevops/variables-global.yml
- name: HIP_ROCCLR_HOME
value: $(Build.BinariesDirectory)/rocm
pool:
vmImage: ${{ variables.BASE_BUILD_POOL }}
container:
image: ${{ variables.DOCKER_IMAGE_NAME }}:${{ variables.LATEST_DOCKER_VERSION }}
workspace:
clean: all
steps:
@@ -36,16 +46,31 @@ jobs:
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/checkout.yml
parameters:
checkoutRepo: ${{ parameters.checkoutRepo }}
# CI case: download latest default branch build
- ${{ if eq(parameters.checkoutRef, '') }}:
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/dependencies-rocm.yml
parameters:
dependencyList: ${{ parameters.rocmDependencies }}
dependencySource: staging
# manual build case: triggered by ROCm/ROCm repo
- ${{ if ne(parameters.checkoutRef, '') }}:
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/dependencies-rocm.yml
parameters:
dependencyList: ${{ parameters.rocmDependencies }}
dependencySource: tag-builds
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/build-cmake.yml
parameters:
extraBuildFlags: >-
-DCMAKE_CXX_COMPILER=/opt/rocm/bin/amdclang++
-DCMAKE_C_COMPILER=/opt/rocm/bin/amdclang
-DCMAKE_CXX_COMPILER=$(Agent.BuildDirectory)/rocm/llvm/bin/amdclang++
-DCMAKE_C_COMPILER=$(Agent.BuildDirectory)/rocm/llvm/bin/amdclang
-DCMAKE_MODULE_PATH=$(Agent.BuildDirectory)/rocm/lib/cmake/hip
-DCMAKE_PREFIX_PATH=$(Agent.BuildDirectory)/rocm
-DCMAKE_BUILD_TYPE=Release
-DAMDGPU_TARGETS=gfx1030;gfx1100
-DUSE_HIP_CLANG=ON
-DHIP_COMPILER=clang
-DBUILD_CLIENTS_TESTS=ON
-DBUILD_CLIENTS_BENCH=OFF
-DBUILD_CLIENTS_BENCHMARKS=OFF
-DBUILD_CLIENTS_SAMPLES=OFF
-L
-GNinja
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/artifact-upload.yml

View File

@@ -10,15 +10,23 @@ parameters:
default:
- cmake
- ninja-build
- rocblas
- rocsparse
- hipsparse
- rocsolver
- libsuitesparse-dev
- gfortran
- git
- googletest
- libgtest-dev
- name: rocmDependencies
type: object
default:
- clr
- hipSPARSE
- llvm-project
- rocBLAS
- rocm-cmake
- rocminfo
- ROCR-Runtime
- rocSPARSE
- rocSOLVER
jobs:
- job: hipSOLVER
@@ -27,8 +35,6 @@ jobs:
- template: /.azuredevops/variables-global.yml
pool:
vmImage: ${{ variables.BASE_BUILD_POOL }}
container:
image: ${{ variables.DOCKER_IMAGE_NAME }}:${{ variables.LATEST_DOCKER_VERSION }}
workspace:
clean: all
steps:
@@ -39,6 +45,18 @@ jobs:
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/checkout.yml
parameters:
checkoutRepo: ${{ parameters.checkoutRepo }}
# CI case: download latest default branch build
- ${{ if eq(parameters.checkoutRef, '') }}:
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/dependencies-rocm.yml
parameters:
dependencyList: ${{ parameters.rocmDependencies }}
dependencySource: staging
# manual build case: triggered by ROCm/ROCm repo
- ${{ if ne(parameters.checkoutRef, '') }}:
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/dependencies-rocm.yml
parameters:
dependencyList: ${{ parameters.rocmDependencies }}
dependencySource: tag-builds
# build external gtest and lapack
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/build-cmake.yml
parameters:
@@ -52,10 +70,10 @@ jobs:
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/build-cmake.yml
parameters:
extraBuildFlags: >-
-DCMAKE_PREFIX_PATH="/opt/rocm;$(Pipeline.Workspace)/deps-install"
-DCMAKE_PREFIX_PATH=$(Agent.BuildDirectory)/rocm;$(Pipeline.Workspace)/deps-install
-DCMAKE_BUILD_TYPE=Release
-DCMAKE_CXX_COMPILER=/opt/rocm/bin/amdclang++
-DCMAKE_C_COMPILER=/opt/rocm/bin/amdclang
-DCMAKE_CXX_COMPILER=$(Agent.BuildDirectory)/rocm/llvm/bin/amdclang++
-DCMAKE_C_COMPILER=$(Agent.BuildDirectory)/rocm/llvm/bin/amdclang
-DAMDGPU_TARGETS=gfx1030;gfx1100
-DBUILD_CLIENTS_TESTS=ON
-DUSE_CUDA=OFF

View File

@@ -13,10 +13,18 @@ parameters:
- libboost-program-options-dev
- googletest
- libfftw3-dev
- rocsparse
- git
- gfortran
- libgtest-dev
- name: rocmDependencies
type: object
default:
- clr
- llvm-project
- rocminfo
- rocprofiler-register
- ROCR-Runtime
- rocSPARSE
jobs:
- job: hipSPARSE
@@ -25,8 +33,6 @@ jobs:
- template: /.azuredevops/variables-global.yml
pool:
vmImage: ${{ variables.BASE_BUILD_POOL }}
container:
image: ${{ variables.DOCKER_IMAGE_NAME }}:${{ variables.LATEST_DOCKER_VERSION }}
workspace:
clean: all
steps:
@@ -37,15 +43,25 @@ jobs:
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/checkout.yml
parameters:
checkoutRepo: ${{ parameters.checkoutRepo }}
# CI case: download latest default branch build
- ${{ if eq(parameters.checkoutRef, '') }}:
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/dependencies-rocm.yml
parameters:
dependencyList: ${{ parameters.rocmDependencies }}
dependencySource: staging
# manual build case: triggered by ROCm/ROCm repo
- ${{ if ne(parameters.checkoutRef, '') }}:
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/dependencies-rocm.yml
parameters:
dependencyList: ${{ parameters.rocmDependencies }}
dependencySource: tag-builds
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/build-cmake.yml
parameters:
extraBuildFlags: >-
-DCMAKE_CXX_COMPILER=/opt/rocm/bin/amdclang++
-DCMAKE_C_COMPILER=/opt/rocm/bin/amdclang
-DCMAKE_CXX_COMPILER=$(Agent.BuildDirectory)/rocm/llvm/bin/amdclang++
-DCMAKE_BUILD_TYPE=Release
-DCMAKE_PREFIX_PATH="/opt/rocm;/opt/rocm/share/rocm/cmake/"
-DCMAKE_PREFIX_PATH=$(Agent.BuildDirectory)/rocm;$(Agent.BuildDirectory)/rocm/share/rocm/cmake/
-DBUILD_CLIENTS_TESTS=ON
-DBUILD_CLIENTS_SAMPLES=OFF
-DBUILD_CLIENTS_BENCHMARKS=OFF
-GNinja
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/artifact-upload.yml

View File

@@ -8,25 +8,43 @@ parameters:
- name: aptPackages
type: object
default:
- cmake
- ninja-build
- python3-venv
- libmsgpack-dev
- hipsparse-dev
- git
- python3-pip
- name: pipModules
type: object
default:
- joblib
# rocm dependencies should match dependencies-rocm.yml
- name: rocmDependencies
type: object
default:
- llvm-project
- ROCR-Runtime
- clr
- rocminfo
- rocprofiler-register
- hipSPARSE
- rocBLAS
jobs:
- job: hipSPARSELt
variables:
- group: common
- template: /.azuredevops/variables-global.yml
- name: HIP_ROCCLR_HOME
value: $(Build.BinariesDirectory)/rocm
- name: TENSILE_ROCM_ASSEMBLER_PATH
value: $(Agent.BuildDirectory)/rocm/llvm/bin/clang
- name: CMAKE_CXX_COMPILER
value: $(Agent.BuildDirectory)/rocm/llvm/bin/hipcc
- name: TENSILE_ROCM_OFFLOAD_BUNDLER_PATH
value: $(Agent.BuildDirectory)/rocm/llvm/bin/clang-offload-bundler
- name: PATH
value: $(Agent.BuildDirectory)/rocm/llvm/bin:$(Agent.BuildDirectory)/rocm/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin
pool: ${{ variables.MEDIUM_BUILD_POOL }}
container:
image: ${{ variables.DOCKER_IMAGE_NAME }}:${{ variables.LATEST_DOCKER_VERSION }}
workspace:
clean: all
steps:
@@ -34,21 +52,35 @@ jobs:
parameters:
aptPackages: ${{ parameters.aptPackages }}
pipModules: ${{ parameters.pipModules }}
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/dependencies-cmake-latest.yml
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/preamble.yml
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/checkout.yml
parameters:
checkoutRepo: ${{ parameters.checkoutRepo }}
# CI case: download latest default branch build
- ${{ if eq(parameters.checkoutRef, '') }}:
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/dependencies-rocm.yml
parameters:
dependencyList: ${{ parameters.rocmDependencies }}
dependencySource: staging
# manual build case: triggered by ROCm/ROCm repo
- ${{ if ne(parameters.checkoutRef, '') }}:
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/dependencies-rocm.yml
parameters:
dependencyList: ${{ parameters.rocmDependencies }}
dependencySource: tag-builds
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/build-cmake.yml
parameters:
extraBuildFlags: >-
-DCMAKE_BUILD_TYPE=Release
-DCMAKE_CXX_COMPILER=/opt/rocm/bin/amdclang++
-DCMAKE_C_COMPILER=/opt/rocm/bin/amdclang
-DCMAKE_CXX_COMPILER=$(Agent.BuildDirectory)/rocm/llvm/bin/amdclang++
-DCMAKE_C_COMPILER=$(Agent.BuildDirectory)/rocm/llvm/bin/amdclang
-DAMDGPU_TARGETS=all
-DTensile_LOGIC=
-DTensile_CPU_THREADS=
-DTensile_CODE_OBJECT_VERSION=default
-DTensile_LIBRARY_FORMAT=msgpack
-DCMAKE_PREFIX_PATH="/opt/rocm"
-DCMAKE_PREFIX_PATH="$(Agent.BuildDirectory)/rocm"
-DROCM_PATH=$(Agent.BuildDirectory)/rocm
-GNinja
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/artifact-upload.yml

View File

@@ -10,12 +10,17 @@ parameters:
default:
- cmake
- ninja-build
- composablekernel-dev
- python3-pip
- git
- name: rocmDependencies
type: object
default:
- rocm-cmake
- llvm-project
- ROCR-Runtime
- clr
- rocminfo
- composable_kernel
jobs:
- job: hipTensor
@@ -23,8 +28,6 @@ jobs:
- group: common
- template: /.azuredevops/variables-global.yml
pool: ${{ variables.MEDIUM_BUILD_POOL }}
container:
image: ${{ variables.DOCKER_IMAGE_NAME }}:${{ variables.LATEST_DOCKER_VERSION }}
workspace:
clean: all
steps:
@@ -50,9 +53,8 @@ jobs:
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/build-cmake.yml
parameters:
extraBuildFlags: >-
-DCMAKE_CXX_COMPILER=/opt/rocm/bin/amdclang++
-DCMAKE_C_COMPILER=/opt/rocm/bin/amdclang
-DCMAKE_PREFIX_PATH="$(Agent.BuildDirectory)/rocm/llvm"
-DCMAKE_PREFIX_PATH=$(Agent.BuildDirectory)/rocm;$(Agent.BuildDirectory)/rocm/llvm
-DCMAKE_CXX_COMPILER=$(Agent.BuildDirectory)/rocm/llvm/bin/amdclang++
-DROCM_PATH="$(Agent.BuildDirectory)/rocm"
-DCMAKE_BUILD_TYPE=Release
-DHIPTENSOR_BUILD_TESTS=ON

View File

@@ -14,6 +14,7 @@ parameters:
- ninja-build
- python-is-python3
- zlib1g-dev
- pkg-config
- name: rocmDependencies
type: object
default:
@@ -68,8 +69,6 @@ jobs:
-DLIBCXXABI_INSTALL_STATIC_LIBRARY=OFF
-DLLVM_BUILD_DOCS=OFF
-DLLVM_ENABLE_SPHINX=OFF
-DSPHINX_WARNINGS_AS_ERRORS=OFF
-DSPHINX_OUTPUT_MAN=OFF
-DLLVM_ENABLE_ASSERTIONS=OFF
-DLLVM_ENABLE_Z3_SOLVER=OFF
-DLLVM_ENABLE_ZLIB=ON
@@ -80,7 +79,6 @@ jobs:
-DPACKAGE_VENDOR=AMD
-DCLANG_LINK_FLANG_LEGACY=ON
-DCMAKE_CXX_STANDARD=17
-DFLANG_INCLUDE_DOCS=OFF
-DROCM_LLVM_BACKWARD_COMPAT_LINK=$(Build.BinariesDirectory)/llvm
-DROCM_LLVM_BACKWARD_COMPAT_LINK_TARGET=./lib/llvm
-GNinja

View File

@@ -8,21 +8,37 @@ parameters:
- name: aptPackages
type: object
default:
- python3-pip
- cmake
- libboost-program-options-dev
- googletest
- libfftw3-dev
- git
- ninja-build
- libstdc++-12-dev
- name: rocmDependencies
type: object
default:
- rocm-cmake
- llvm-project
- ROCR-Runtime
- clr
- rocminfo
- rocm_smi_lib
- rocprofiler-register
- rocm-core
- HIPIFY
- aomp
- aomp-extras
jobs:
- job: rccl
variables:
- group: common
- template: /.azuredevops/variables-global.yml
- name: HIP_ROCCLR_HOME
value: $(Build.BinariesDirectory)/rocm
pool: ${{ variables.MEDIUM_BUILD_POOL }}
container:
image: ${{ variables.DOCKER_IMAGE_NAME }}:${{ variables.LATEST_DOCKER_VERSION }}
workspace:
clean: all
steps:
@@ -33,14 +49,28 @@ jobs:
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/checkout.yml
parameters:
checkoutRepo: ${{ parameters.checkoutRepo }}
# CI case: download latest default branch build
- ${{ if eq(parameters.checkoutRef, '') }}:
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/dependencies-rocm.yml
parameters:
dependencyList: ${{ parameters.rocmDependencies }}
dependencySource: staging
# manual build case: triggered by ROCm/ROCm repo
- ${{ if ne(parameters.checkoutRef, '') }}:
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/dependencies-rocm.yml
parameters:
dependencyList: ${{ parameters.rocmDependencies }}
dependencySource: tag-builds
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/build-cmake.yml
parameters:
extraBuildFlags: >-
-DCMAKE_CXX_COMPILER=/opt/rocm/llvm/bin/amdclang++
-DCMAKE_C_COMPILER=/opt/rocm/llvm/bin/amdclang
-DCMAKE_CXX_COMPILER=$(Agent.BuildDirectory)/rocm/bin/hipcc
-DCMAKE_C_COMPILER=$(Agent.BuildDirectory)/rocm/bin/hipcc
-DHALF_INCLUDE_DIR=$(Agent.BuildDirectory)/rocm/include
-DCMAKE_BUILD_TYPE=Release
-DROCM_PATH=$(Agent.BuildDirectory)/rocm
-DBUILD_TESTS=ON
-DCMAKE_PREFIX_PATH="/opt/rocm;/opt/rocm/share/rocm/cmake/"
-DCMAKE_PREFIX_PATH=$(Agent.BuildDirectory)/rocm;$(Agent.BuildDirectory)/rocm/share/rocm/cmake/
-DAMDGPU_TARGETS=gfx1030;gfx1100
-GNinja
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/artifact-upload.yml

View File

@@ -8,6 +8,7 @@ parameters:
- name: aptPackages
type: object
default:
- python3-pip
- cmake
- ninja-build
- git
@@ -17,6 +18,16 @@ parameters:
- autoconf
- libtool
- pkg-config
- name: rocmDependencies
type: object
default:
- rocm-cmake
- llvm-project
- ROCR-Runtime
- clr
- rocminfo
- rocm_smi_lib
- amdsmi
jobs:
- job: rdc
@@ -24,8 +35,6 @@ jobs:
- group: common
- template: /.azuredevops/variables-global.yml
pool: ${{ variables.MEDIUM_BUILD_POOL }}
container:
image: ${{ variables.DOCKER_IMAGE_NAME }}:${{ variables.LATEST_DOCKER_VERSION }}
workspace:
clean: all
steps:
@@ -36,6 +45,18 @@ jobs:
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/checkout.yml
parameters:
checkoutRepo: ${{ parameters.checkoutRepo }}
# CI case: download latest default branch build
- ${{ if eq(parameters.checkoutRef, '') }}:
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/dependencies-rocm.yml
parameters:
dependencyList: ${{ parameters.rocmDependencies }}
dependencySource: staging
# manual build case: triggered by ROCm/ROCm repo
- ${{ if ne(parameters.checkoutRef, '') }}:
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/dependencies-rocm.yml
parameters:
dependencyList: ${{ parameters.rocmDependencies }}
dependencySource: tag-builds
# Build grpc
- task: Bash@3
displayName: 'git clone grpc'
@@ -57,6 +78,7 @@ jobs:
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/build-cmake.yml
parameters:
extraBuildFlags: >-
-DCMAKE_PREFIX_PATH=$(Agent.BuildDirectory)/rocm
-DGRPC_ROOT="$(Build.SourcesDirectory)/bin"
-DBUILD_TESTS=ON
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/artifact-upload.yml

View File

@@ -15,16 +15,29 @@ parameters:
- git
- mpich
- ninja-build
- name: rocmDependencies
type: object
default:
- aomp
- clr
- llvm-project
- rocBLAS
- rocminfo
- rocPRIM
- rocprofiler-register
- ROCR-Runtime
- rocRAND
- rocSPARSE
jobs:
- job: rocALUTION
variables:
- group: common
- template: /.azuredevops/variables-global.yml
- name: HIP_ROCCLR_HOME
value: $(Build.BinariesDirectory)/rocm
pool:
vmImage: ${{ variables.BASE_BUILD_POOL }}
container:
image: ${{ variables.DOCKER_IMAGE_NAME }}:${{ variables.LATEST_DOCKER_VERSION }}
workspace:
clean: all
steps:
@@ -35,13 +48,25 @@ jobs:
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/checkout.yml
parameters:
checkoutRepo: ${{ parameters.checkoutRepo }}
# CI case: download latest default branch build
- ${{ if eq(parameters.checkoutRef, '') }}:
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/dependencies-rocm.yml
parameters:
dependencyList: ${{ parameters.rocmDependencies }}
dependencySource: staging
# manual build case: triggered by ROCm/ROCm repo
- ${{ if ne(parameters.checkoutRef, '') }}:
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/dependencies-rocm.yml
parameters:
dependencyList: ${{ parameters.rocmDependencies }}
dependencySource: tag-builds
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/build-cmake.yml
parameters:
extraBuildFlags: >-
-DCMAKE_CXX_COMPILER=/opt/rocm/llvm/bin/amdclang++
-DCMAKE_C_COMPILER=/opt/rocm/llvm/bin/amdclang
-DCMAKE_CXX_COMPILER=$(Agent.BuildDirectory)/rocm/llvm/bin/amdclang++
-DCMAKE_BUILD_TYPE=Release
-DCMAKE_PREFIX_PATH="/opt/rocm;/opt/rocm/share/rocm/cmake/"
-DCMAKE_PREFIX_PATH=$(Agent.BuildDirectory)/rocm;$(Agent.BuildDirectory)/rocm/share/rocm/cmake/
-DCMAKE_MODULE_PATH=$(Agent.BuildDirectory)/rocm;$(Agent.BuildDirectory)/rocm/lib/cmake/hip
-DAMDGPU_TARGETS=gfx1030;gfx1100
-DBUILD_CLIENTS_TESTS=ON
-DBUILD_CLIENTS_BENCHMARKS=OFF

View File

@@ -18,19 +18,40 @@ parameters:
- googletest
- libgtest-dev
- wget
- python3-pip
- libdrm-dev
- name: pipModules
type: object
default:
- joblib
- name: rocmDependencies
type: object
default:
- rocm-cmake
- llvm-project
- ROCR-Runtime
- clr
- rocminfo
- rocprofiler-register
- rocm_smi_lib
- rocm-core
- aomp
- aomp-extras
jobs:
- job: rocBLAS
variables:
- group: common
- template: /.azuredevops/variables-global.yml
- name: HIP_ROCCLR_HOME
value: $(Build.BinariesDirectory)/rocm
- name: TENSILE_ROCM_ASSEMBLER_PATH
value: $(Agent.BuildDirectory)/rocm/llvm/bin/clang
- name: CMAKE_CXX_COMPILER
value: $(Agent.BuildDirectory)/rocm/bin/hipcc
- name: TENSILE_ROCM_OFFLOAD_BUNDLER_PATH
value: $(Agent.BuildDirectory)/rocm/llvm/bin/clang-offload-bundler
pool: ${{ variables.MEDIUM_BUILD_POOL }}
container:
image: ${{ variables.DOCKER_IMAGE_NAME }}:${{ variables.LATEST_DOCKER_VERSION }}
workspace:
clean: all
steps:
@@ -42,23 +63,60 @@ jobs:
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/checkout.yml
parameters:
checkoutRepo: ${{ parameters.checkoutRepo }}
# CI case: download latest default branch build
- ${{ if eq(parameters.checkoutRef, '') }}:
- task: Bash@3
displayName: 'Download AOCL'
inputs:
targetType: inline
script: wget -nv https://download.amd.com/developer/eula/aocl/aocl-4-2/aocl-linux-gcc-4.2.0_1_amd64.deb
workingDirectory: '$(Pipeline.Workspace)'
- task: Bash@3
displayName: 'Install AOCL'
inputs:
targetType: inline
script: sudo apt install --yes ./aocl-linux-gcc-4.2.0_1_amd64.deb
workingDirectory: '$(Pipeline.Workspace)'
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/dependencies-rocm.yml
parameters:
dependencyList: ${{ parameters.rocmDependencies }}
dependencySource: staging
# manual build case: triggered by ROCm/ROCm repo
- ${{ if ne(parameters.checkoutRef, '') }}:
- task: Bash@3
displayName: 'Download AOCL'
inputs:
targetType: inline
script: wget -nv https://download.amd.com/developer/eula/aocl/aocl-4-1/aocl-linux-aocc-4.1.0_1_amd64.deb
workingDirectory: '$(Pipeline.Workspace)'
- task: Bash@3
displayName: 'Install AOCL'
inputs:
targetType: inline
script: sudo apt install --yes ./aocl-linux-aocc-4.1.0_1_amd64.deb
workingDirectory: '$(Pipeline.Workspace)'
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/dependencies-rocm.yml
parameters:
dependencyList: ${{ parameters.rocmDependencies }}
dependencySource: tag-builds
- script: echo $PATH
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/build-cmake.yml
parameters:
extraBuildFlags: >-
-DCMAKE_TOOLCHAIN_FILE=toolchain-linux.cmake
-DCMAKE_PREFIX_PATH="/opt/rocm;$(Pipeline.Workspace)/deps-install"
-DCMAKE_PREFIX_PATH=$(Agent.BuildDirectory)/rocm/llvm;$(Agent.BuildDirectory)/rocm;$(Pipeline.Workspace)/deps-install
-DCMAKE_BUILD_TYPE=Release
-DCMAKE_CXX_COMPILER=/opt/rocm/llvm/bin/amdclang++
-DCMAKE_C_COMPILER=/opt/rocm/llvm/bin/amdclang
-DCMAKE_CXX_COMPILER=$(Agent.BuildDirectory)/rocm/bin/hipcc
-DCMAKE_C_COMPILER=$(Agent.BuildDirectory)/rocm/bin/hipcc
-DAMDGPU_TARGETS=gfx1030;gfx1100
-DTensile_CODE_OBJECT_VERSION=default
-DTensile_LOGIC=asm_full
-DTensile_SEPARATE_ARCHITECTURES=ON
-DTensile_LAZY_LIBRARY_LOADING=ON
-DTensile_LIBRARY_FORMAT=msgpack
-DTENSILE_VENV_UPGRADE_PIP=ON
-DBUILD_CLIENTS_TESTS=ON
-DBUILD_CLIENTS_BENCHMARKS=OFF
-DBUILD_CLIENTS_SAMPLES=OFF
-DROCM_PATH=$(Agent.BuildDirectory)/rocm
-GNinja
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/artifact-upload.yml

View File

@@ -8,6 +8,7 @@ parameters:
- name: aptPackages
type: object
default:
- python3-pip
- cmake
- ninja-build
- pkg-config
@@ -18,6 +19,16 @@ parameters:
- libstdc++-12-dev
- libva-dev
- mesa-amdgpu-va-drivers
- libdrm-dev
- name: rocmDependencies
type: object
default:
- rocm-cmake
- llvm-project
- ROCR-Runtime
- clr
- rocminfo
- rocm-core
jobs:
- job: rocDecode
@@ -26,11 +37,21 @@ jobs:
- template: /.azuredevops/variables-global.yml
pool:
vmImage: ${{ variables.BASE_BUILD_POOL }}
container:
image: ${{ variables.DOCKER_IMAGE_NAME }}:${{ variables.LATEST_DOCKER_VERSION }}
workspace:
clean: all
steps:
# Since mesa-amdgpu-multimedia-devel is not directly available from apt, register it
- task: Bash@3
displayName: 'Register ROCm packages'
inputs:
targetType: inline
script: |
sudo mkdir --parents --mode=0755 /etc/apt/keyrings
wget https://repo.radeon.com/rocm/rocm.gpg.key -O - | gpg --dearmor | sudo tee /etc/apt/keyrings/rocm.gpg > /dev/null
echo "deb [arch=amd64 signed-by=/etc/apt/keyrings/rocm.gpg] https://repo.radeon.com/amdgpu/${{ variables.KEYRING_VERSION }}/ubuntu jammy main" | sudo tee /etc/apt/sources.list.d/amdgpu.list
echo "deb [arch=amd64 signed-by=/etc/apt/keyrings/rocm.gpg] https://repo.radeon.com/rocm/apt/${{ variables.KEYRING_VERSION }} jammy main" | sudo tee --append /etc/apt/sources.list.d/rocm.list
echo -e 'Package: *\nPin: release o=repo.radeon.com\nPin-Priority: 600' | sudo tee /etc/apt/preferences.d/rocm-pin-600
sudo apt update
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/dependencies-other.yml
parameters:
aptPackages: ${{ parameters.aptPackages }}
@@ -38,10 +59,24 @@ jobs:
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/checkout.yml
parameters:
checkoutRepo: ${{ parameters.checkoutRepo }}
# CI case: download latest default branch build
- ${{ if eq(parameters.checkoutRef, '') }}:
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/dependencies-rocm.yml
parameters:
dependencyList: ${{ parameters.rocmDependencies }}
dependencySource: staging
# manual build case: triggered by ROCm/ROCm repo
- ${{ if ne(parameters.checkoutRef, '') }}:
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/dependencies-rocm.yml
parameters:
dependencyList: ${{ parameters.rocmDependencies }}
dependencySource: tag-builds
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/build-cmake.yml
parameters:
extraBuildFlags: >-
-DCMAKE_CXX_COMPILER=$(Agent.BuildDirectory)/rocm/llvm/bin/amdclang++
-DROCM_PATH=$(Agent.BuildDirectory)/rocm
-DCMAKE_PREFIX_PATH=$(Agent.BuildDirectory)/rocm
-DCMAKE_BUILD_TYPE=Release
-L
-GNinja
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/artifact-upload.yml

View File

@@ -10,20 +10,31 @@ parameters:
default:
- cmake
- ninja-build
- rocrand
- hiprand
- libboost-program-options-dev
- libgtest-dev
- libfftw3-dev
- python3-pip
# rocm dependencies should match dependencies-rocm.yml
- name: rocmDependencies
type: object
default:
- llvm-project
- ROCR-Runtime
- clr
- rocminfo
- rocprofiler-register
- hipRAND
- rocRAND
- rocm-cmake
- aomp
jobs:
- job: rocFFT
variables:
- group: common
- template: /.azuredevops/variables-global.yml
- name: HIP_ROCCLR_HOME
value: $(Build.BinariesDirectory)/rocm
pool: ${{ variables.MEDIUM_BUILD_POOL }}
container:
image: ${{ variables.DOCKER_IMAGE_NAME }}:${{ variables.LATEST_DOCKER_VERSION }}
workspace:
clean: all
steps:
@@ -34,12 +45,24 @@ jobs:
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/checkout.yml
parameters:
checkoutRepo: ${{ parameters.checkoutRepo }}
# CI case: download latest default branch build
- ${{ if eq(parameters.checkoutRef, '') }}:
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/dependencies-rocm.yml
parameters:
dependencyList: ${{ parameters.rocmDependencies }}
dependencySource: staging
# manual build case: triggered by ROCm/ROCm repo
- ${{ if ne(parameters.checkoutRef, '') }}:
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/dependencies-rocm.yml
parameters:
dependencyList: ${{ parameters.rocmDependencies }}
dependencySource: tag-builds
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/build-cmake.yml
parameters:
extraBuildFlags: >-
-DCMAKE_CXX_COMPILER=/opt/rocm/llvm/bin/amdclang++
-DCMAKE_C_COMPILER=/opt/rocm/llvm/bin/amdclang
-DCMAKE_PREFIX_PATH=/opt/rocm
-DCMAKE_CXX_COMPILER=$(Agent.BuildDirectory)/rocm/llvm/bin/amdclang++
-DCMAKE_C_COMPILER=$(Agent.BuildDirectory)/rocm/llvm/bin/amdclang
-DCMAKE_PREFIX_PATH=$(Agent.BuildDirectory)/rocm
-DCMAKE_BUILD_TYPE=Release
-DAMDGPU_TARGETS=gfx1030;gfx1100
-DUSE_HIP_CLANG=ON

View File

@@ -12,6 +12,15 @@ parameters:
- ninja-build
- libgtest-dev
- git
- python3-pip
- name: rocmDependencies
type: object
default:
- rocm-cmake
- llvm-project
- ROCR-Runtime
- clr
- rocminfo
jobs:
- job: rocPRIM
@@ -19,8 +28,6 @@ jobs:
- group: common
- template: /.azuredevops/variables-global.yml
pool: ${{ variables.MEDIUM_BUILD_POOL }}
container:
image: ${{ variables.DOCKER_IMAGE_NAME }}:${{ variables.LATEST_DOCKER_VERSION }}
workspace:
clean: all
steps:
@@ -34,12 +41,24 @@ jobs:
# ${{ }} are resolved during compile-time
# so this next step is skipped completely until
# we define explicit aptPackages needed to install
# CI case: download latest default branch build
- ${{ if eq(parameters.checkoutRef, '') }}:
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/dependencies-rocm.yml
parameters:
dependencyList: ${{ parameters.rocmDependencies }}
dependencySource: staging
# manual build case: triggered by ROCm/ROCm repo
- ${{ if ne(parameters.checkoutRef, '') }}:
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/dependencies-rocm.yml
parameters:
dependencyList: ${{ parameters.rocmDependencies }}
dependencySource: tag-builds
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/build-cmake.yml
parameters:
extraBuildFlags: >-
-DCMAKE_PREFIX_PATH=$(Agent.BuildDirectory)/rocm
-DBUILD_BENCHMARK=ON
-DCMAKE_CXX_COMPILER=/opt/rocm/llvm/bin/amdclang++
-DCMAKE_C_COMPILER=/opt/rocm/llvm/bin/amdclang
-DCMAKE_CXX_COMPILER=$(Agent.BuildDirectory)/rocm/llvm/bin/amdclang++
-DAMDGPU_TARGETS=gfx1030;gfx1100
-DBUILD_TEST=ON
-GNinja

View File

@@ -10,15 +10,25 @@ parameters:
default:
- cmake
- ninja-build
- rocblas
- rocsparse
- hipsparse
- libsuitesparse-dev
- gfortran
- libfmt-dev
- git
- googletest
- libgtest-dev
- python3-pip
- name: rocmDependencies
type: object
default:
- rocm-cmake
- llvm-project
- ROCR-Runtime
- clr
- rocminfo
- rocBLAS
- rocPRIM
- rocSPARSE
- hipSPARSE
jobs:
- job: rocSOLVER
@@ -26,8 +36,6 @@ jobs:
- group: common
- template: /.azuredevops/variables-global.yml
pool: ${{ variables.MEDIUM_BUILD_POOL }}
container:
image: ${{ variables.DOCKER_IMAGE_NAME }}:${{ variables.LATEST_DOCKER_VERSION }}
workspace:
clean: all
steps:
@@ -44,6 +52,18 @@ jobs:
targetType: inline
script: git clone --depth 1 --branch v3.9.1 https://github.com/Reference-LAPACK/lapack
workingDirectory: '$(Build.SourcesDirectory)'
# CI case: download latest default branch build
- ${{ if eq(parameters.checkoutRef, '') }}:
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/dependencies-rocm.yml
parameters:
dependencyList: ${{ parameters.rocmDependencies }}
dependencySource: staging
# manual build case: triggered by ROCm/ROCm repo
- ${{ if ne(parameters.checkoutRef, '') }}:
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/dependencies-rocm.yml
parameters:
dependencyList: ${{ parameters.rocmDependencies }}
dependencySource: tag-builds
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/build-cmake.yml
parameters:
componentName: lapack
@@ -59,11 +79,10 @@ jobs:
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/build-cmake.yml
parameters:
extraBuildFlags: >-
-DCMAKE_PREFIX_PATH="/opt/rocm;$(Pipeline.Workspace)/deps-install"
-DCMAKE_CXX_COMPILER=/opt/rocm/llvm/bin/amdclang++
-DCMAKE_C_COMPILER=/opt/rocm/llvm/bin/amdclang
-DCMAKE_PREFIX_PATH=$(Agent.BuildDirectory)/rocm;$(Pipeline.Workspace)/deps-install
-DCMAKE_CXX_COMPILER=$(Agent.BuildDirectory)/rocm/llvm/bin/amdclang++
-DCMAKE_C_COMPILER=$(Agent.BuildDirectory)/rocm/llvm/bin/amdclang
-DAMDGPU_TARGETS=gfx1030;gfx1100
-DUSE_CUDA=OFF
-DBUILD_CLIENTS_TESTS=ON
-DBUILD_CLIENTS_BENCHMARKS=OFF
-DBUILD_CLIENTS_SAMPLES=OFF

View File

@@ -8,6 +8,7 @@ parameters:
- name: aptPackages
type: object
default:
- python3-pip
- cmake
- ninja-build
- libboost-program-options-dev
@@ -15,17 +16,28 @@ parameters:
- libfftw3-dev
- git
- gfortran
- rocprim-dev
- libgtest-dev
- libdrm-dev
- name: rocmDependencies
type: object
default:
- rocm-cmake
- llvm-project
- ROCR-Runtime
- clr
- rocBLAS
- rocminfo
- rocPRIM
- rocprofiler-register
jobs:
- job: rocSPARSE
variables:
- group: common
- template: /.azuredevops/variables-global.yml
- name: HIP_ROCCLR_HOME
value: $(Build.BinariesDirectory)/rocm
pool: ${{ variables.MEDIUM_BUILD_POOL }}
container:
image: ${{ variables.DOCKER_IMAGE_NAME }}:${{ variables.LATEST_DOCKER_VERSION }}
workspace:
clean: all
steps:
@@ -36,16 +48,30 @@ jobs:
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/checkout.yml
parameters:
checkoutRepo: ${{ parameters.checkoutRepo }}
# CI case: download latest default branch build
- ${{ if eq(parameters.checkoutRef, '') }}:
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/dependencies-rocm.yml
parameters:
dependencyList: ${{ parameters.rocmDependencies }}
dependencySource: staging
# manual build case: triggered by ROCm/ROCm repo
- ${{ if ne(parameters.checkoutRef, '') }}:
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/dependencies-rocm.yml
parameters:
dependencyList: ${{ parameters.rocmDependencies }}
dependencySource: tag-builds
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/build-cmake.yml
parameters:
extraBuildFlags: >-
-DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc
-DCMAKE_C_COMPILER=/opt/rocm/bin/hipcc
-DCMAKE_CXX_COMPILER=$(Agent.BuildDirectory)/rocm/bin/hipcc
-DCMAKE_C_COMPILER=$(Agent.BuildDirectory)/rocm/bin/hipcc
-DCMAKE_PREFIX_PATH=$(Agent.BuildDirectory)/rocm
-DROCM_PATH=$(Agent.BuildDirectory)/rocm
-DCMAKE_BUILD_TYPE=Release
-DAMDGPU_TARGETS=gfx1030;gfx1100
-DBUILD_CLIENTS_SAMPLES=OFF
-DBUILD_CLIENTS_TESTS=ON
-DBUILD_CLIENTS_BENCHMARKS=OFF
-DCMAKE_MODULE_PATH="/opt/rocm/lib/cmake/hip;/opt/rocm/hip/cmake"
-DCMAKE_MODULE_PATH=$(Agent.BuildDirectory)/rocm/lib/cmake/hip;$(Agent.BuildDirectory)/rocm/hip/cmake
-GNinja
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/artifact-upload.yml

View File

@@ -10,12 +10,20 @@ parameters:
default:
- cmake
- ninja-build
- hiprand
- rocprim-dev
- libboost-program-options-dev
- googletest
- libfftw3-dev
- git
- python3-pip
- name: rocmDependencies
type: object
default:
- clr
- hipRAND
- llvm-project
- rocminfo
- rocPRIM
- ROCR-Runtime
jobs:
- job: rocThrust
@@ -23,8 +31,6 @@ jobs:
- group: common
- template: /.azuredevops/variables-global.yml
pool: ${{ variables.MEDIUM_BUILD_POOL }}
container:
image: ${{ variables.DOCKER_IMAGE_NAME }}:${{ variables.LATEST_DOCKER_VERSION }}
workspace:
clean: all
steps:
@@ -35,14 +41,25 @@ jobs:
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/checkout.yml
parameters:
checkoutRepo: ${{ parameters.checkoutRepo }}
# CI case: download latest default branch build
- ${{ if eq(parameters.checkoutRef, '') }}:
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/dependencies-rocm.yml
parameters:
dependencyList: ${{ parameters.rocmDependencies }}
dependencySource: staging
# manual build case: triggered by ROCm/ROCm repo
- ${{ if ne(parameters.checkoutRef, '') }}:
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/dependencies-rocm.yml
parameters:
dependencyList: ${{ parameters.rocmDependencies }}
dependencySource: tag-builds
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/build-cmake.yml
parameters:
extraBuildFlags: >-
-GNinja
-DCMAKE_CXX_COMPILER=/opt/rocm/llvm/bin/amdclang++
-DCMAKE_C_COMPILER=/opt/rocm/llvm/bin/amdclang
-DROCM_PATH=/opt/rocm
-DCMAKE_PREFIX_PATH=/opt/rocm
-DCMAKE_CXX_COMPILER=$(Agent.BuildDirectory)/rocm/llvm/bin/amdclang++
-DROCM_PATH=$(Agent.BuildDirectory)/rocm
-DCMAKE_PREFIX_PATH=$(Agent.BuildDirectory)/rocm
-DAMDGPU_TARGETS=gfx1030;gfx1100
-DBUILD_TEST=ON
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/artifact-upload.yml

View File

@@ -8,6 +8,7 @@ parameters:
- name: aptPackages
type: object
default:
- python3-pip
- cmake
- ninja-build
- libboost-program-options-dev
@@ -15,7 +16,18 @@ parameters:
- googletest
- libfftw3-dev
- git
- rocblas
- libomp-dev
- name: rocmDependencies
type: object
default:
- rocm-cmake
- llvm-project
- ROCR-Runtime
- clr
- rocminfo
- rocBLAS
- aomp
- rocm_smi_lib
jobs:
- job: rocWMMA
@@ -23,8 +35,6 @@ jobs:
- group: common
- template: /.azuredevops/variables-global.yml
pool: ${{ variables.MEDIUM_BUILD_POOL }}
container:
image: ${{ variables.DOCKER_IMAGE_NAME }}:${{ variables.LATEST_DOCKER_VERSION }}
workspace:
clean: all
steps:
@@ -35,11 +45,23 @@ jobs:
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/checkout.yml
parameters:
checkoutRepo: ${{ parameters.checkoutRepo }}
# CI case: download latest default branch build
- ${{ if eq(parameters.checkoutRef, '') }}:
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/dependencies-rocm.yml
parameters:
dependencyList: ${{ parameters.rocmDependencies }}
dependencySource: staging
# manual build case: triggered by ROCm/ROCm repo
- ${{ if ne(parameters.checkoutRef, '') }}:
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/dependencies-rocm.yml
parameters:
dependencyList: ${{ parameters.rocmDependencies }}
dependencySource: tag-builds
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/build-cmake.yml
parameters:
extraBuildFlags: >-
-DCMAKE_CXX_COMPILER=/opt/rocm/bin/amdclang++
-DCMAKE_C_COMPILER=/opt/rocm/bin/amdclang
-DCMAKE_PREFIX_PATH=$(Agent.BuildDirectory)/rocm
-DCMAKE_CXX_COMPILER=$(Agent.BuildDirectory)/rocm/llvm/bin/amdclang++
-DCMAKE_BUILD_TYPE=Release
-DROCWMMA_BUILD_TESTS=ON
-DROCWMMA_BUILD_SAMPLES=OFF

View File

@@ -10,21 +10,33 @@ parameters:
default:
- cmake
- ninja-build
- python3-pip
- name: pipModules
type: object
default:
- CppHeaderParser
- argparse
- name: rocmDependencies
type: object
default:
- clr
- llvm-project
- rocminfo
- rocprofiler-register
- ROCR-Runtime
- ROCT-Thunk-Interface
jobs:
- job: rocm_bandwidth_test
variables:
- group: common
- template: /.azuredevops/variables-global.yml
- name: ROCR_INC_DIR
value: $(Agent.BuildDirectory)/rocm
- name: ROCR_LIB_DIR
value: $(Agent.BuildDirectory)/rocm
pool:
vmImage: ${{ variables.BASE_BUILD_POOL }}
container:
image: ${{ variables.DOCKER_IMAGE_NAME }}:${{ variables.LATEST_DOCKER_VERSION }}
workspace:
clean: all
steps:
@@ -36,11 +48,23 @@ jobs:
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/checkout.yml
parameters:
checkoutRepo: ${{ parameters.checkoutRepo }}
# CI case: download latest default branch build
- ${{ if eq(parameters.checkoutRef, '') }}:
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/dependencies-rocm.yml
parameters:
dependencyList: ${{ parameters.rocmDependencies }}
dependencySource: staging
# manual build case: triggered by ROCm/ROCm repo
- ${{ if ne(parameters.checkoutRef, '') }}:
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/dependencies-rocm.yml
parameters:
dependencyList: ${{ parameters.rocmDependencies }}
dependencySource: tag-builds
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/build-cmake.yml
parameters:
extraBuildFlags: >-
-DCMAKE_BUILD_TYPE=release
-DCMAKE_MODULE_PATH="$(Build.SourcesDirectory)/cmake_modules"
-DCMAKE_PREFIX_PATH=/opt/rocm
-DCMAKE_MODULE_PATH=$(Build.SourcesDirectory)/cmake_modules
-DCMAKE_PREFIX_PATH=$(Agent.BuildDirectory)/rocm;$(Agent.BuildDirectory)/rocm/include;$(Agent.BuildDirectory)/rocm/include/hsa
-GNinja
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/artifact-upload.yml

View File

@@ -24,4 +24,5 @@ jobs:
parameters:
extraBuildFlags: >-
-DBUILD_TESTS=ON
-DROCM_DEP_ROCMCORE=ON
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/artifact-upload.yml

View File

@@ -10,12 +10,13 @@ parameters:
default:
- cmake
- libgtest-dev
- libdrm-dev
- libdw-dev
- libsystemd-dev
- libelf-dev
- libnuma-dev
- libpciaccess-dev
- rocm-llvm-dev
- python3-pip
- name: pipModules
type: object
default:
@@ -26,15 +27,31 @@ parameters:
- lxml
- barectf
- pandas
- name: rocmDependencies
type: object
default:
- clr
- llvm-project
- ROCdbgapi
- rocm-cmake
- rocm-core
- rocm_smi_lib
- rocminfo
- ROCR-Runtime
- rocprofiler-register
- ROCT-Thunk-Interface
- roctracer
jobs:
- job: rocprofiler
variables:
- group: common
- template: /.azuredevops/variables-global.yml
- name: HIP_ROCCLR_HOME
value: $(Agent.BuildDirectory)/rocm
- name: ROCM_PATH
value: $(Agent.BuildDirectory)/rocm
pool: ${{ variables.MEDIUM_BUILD_POOL }}
container:
image: ${{ variables.DOCKER_IMAGE_NAME }}:${{ variables.LATEST_DOCKER_VERSION }}
workspace:
clean: all
steps:
@@ -46,11 +63,46 @@ jobs:
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/checkout.yml
parameters:
checkoutRepo: ${{ parameters.checkoutRepo }}
# Manually download aqlprofile, hard-coded 6.1.0 version
- task: Bash@3
displayName: 'Download aqlprofile'
inputs:
targetType: inline
script: wget -nv https://repo.radeon.com/rocm/apt/6.1/pool/main/h/hsa-amd-aqlprofile/hsa-amd-aqlprofile_1.0.0.60100.60100-82~22.04_amd64.deb
workingDirectory: '$(Pipeline.Workspace)'
- task: Bash@3
displayName: 'Extract aqlprofile'
inputs:
targetType: inline
script: |
mkdir hsa-amd-aqlprofile
dpkg-deb -R hsa-amd-aqlprofile_1.0.0.60100.60100-82~22.04_amd64.deb hsa-amd-aqlprofile
workingDirectory: '$(Pipeline.Workspace)'
- task: Bash@3
displayName: 'Move aqlprofile'
inputs:
targetType: inline
script: |
mkdir -p $(Agent.BuildDirectory)/rocm
cp -R hsa-amd-aqlprofile/opt/rocm-6.1.0/* $(Agent.BuildDirectory)/rocm
workingDirectory: '$(Pipeline.Workspace)'
# CI case: download latest default branch build
- ${{ if eq(parameters.checkoutRef, '') }}:
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/dependencies-rocm.yml
parameters:
dependencyList: ${{ parameters.rocmDependencies }}
dependencySource: staging
# manual build case: triggered by ROCm/ROCm repo
- ${{ if ne(parameters.checkoutRef, '') }}:
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/dependencies-rocm.yml
parameters:
dependencyList: ${{ parameters.rocmDependencies }}
dependencySource: tag-builds
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/build-cmake.yml
parameters:
extraBuildFlags: >-
-DCMAKE_MODULE_PATH="$(Build.SourcesDirectory)/cmake_modules;/opt/rocm/lib/cmake"
-DCMAKE_PREFIX_PATH="/opt/rocm"
-DCMAKE_MODULE_PATH=$(Build.SourcesDirectory)/cmake_modules;$(Agent.BuildDirectory)/rocm/lib/cmake;$(Agent.BuildDirectory)/rocm/lib/cmake/hip
-DCMAKE_PREFIX_PATH=$(Agent.BuildDirectory)/rocm
-DENABLE_LDCONFIG=OFF
-DUSE_PROF_API=1
-DGPU_TARGETS=gfx1030;gfx1100

View File

@@ -12,6 +12,14 @@ parameters:
- ninja-build
- libelf-dev
- libdw-dev
- name: rocmDependencies
type: object
default:
- clr
- llvm-project
- ROCdbgapi
- rocminfo
- ROCR-Runtime
jobs:
- job: rocr_debug_agent
@@ -20,8 +28,6 @@ jobs:
- template: /.azuredevops/variables-global.yml
pool:
vmImage: ${{ variables.BASE_BUILD_POOL }}
container:
image: ${{ variables.DOCKER_IMAGE_NAME }}:${{ variables.LATEST_DOCKER_VERSION }}
workspace:
clean: all
steps:
@@ -32,11 +38,23 @@ jobs:
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/checkout.yml
parameters:
checkoutRepo: ${{ parameters.checkoutRepo }}
# CI case: download latest default branch build
- ${{ if eq(parameters.checkoutRef, '') }}:
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/dependencies-rocm.yml
parameters:
dependencyList: ${{ parameters.rocmDependencies }}
dependencySource: staging
# manual build case: triggered by ROCm/ROCm repo
- ${{ if ne(parameters.checkoutRef, '') }}:
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/dependencies-rocm.yml
parameters:
dependencyList: ${{ parameters.rocmDependencies }}
dependencySource: tag-builds
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/build-cmake.yml
parameters:
extraBuildFlags: >-
-DCMAKE_BUILD_TYPE=Release
-DROCM_PATH=/opt/rocm
-DCMAKE_MODULE_PATH=/opt/rocm/lib/cmake
-DCMAKE_MODULE_PATH=$(Agent.BuildDirectory)/rocm/lib/cmake;$(Agent.BuildDirectory)/rocm/lib/cmake/hip
-DCMAKE_PREFIX_PATH=$(Agent.BuildDirectory)/rocm
-GNinja
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/artifact-upload.yml

View File

@@ -67,10 +67,4 @@ jobs:
-DCMAKE_PREFIX_PATH=$(Agent.BuildDirectory)/rocm
-DGPU_TARGETS=gfx1030;gfx1100
-GNinja
# - template: ${{ variables.CI_TEMPLATE_PATH }}/steps/artifact-upload.yml
# - task: Bash@3
# displayName: 'Tests'
# inputs:
# targetType: inline
# script: ./run.sh
# workingDirectory: build
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/artifact-upload.yml

View File

@@ -9,10 +9,18 @@ parameters:
type: object
default:
- cmake
- libomp-dev # needed to pass flag step
- ninja-build
- clang
- name: rocmDependencies
type: object
default:
- aomp # needed to pass build step
- clr
- half
- libomp-dev
- llvm-project
- rocminfo
- ROCR-Runtime
jobs:
- job: rpp
@@ -21,8 +29,6 @@ jobs:
- template: /.azuredevops/variables-global.yml
pool:
vmImage: ${{ variables.BASE_BUILD_POOL }}
container:
image: ${{ variables.DOCKER_IMAGE_NAME }}:${{ variables.LATEST_DOCKER_VERSION }}
workspace:
clean: all
steps:
@@ -33,13 +39,27 @@ jobs:
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/checkout.yml
parameters:
checkoutRepo: ${{ parameters.checkoutRepo }}
# CI case: download latest default branch build
- ${{ if eq(parameters.checkoutRef, '') }}:
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/dependencies-rocm.yml
parameters:
dependencyList: ${{ parameters.rocmDependencies }}
dependencySource: staging
# manual build case: triggered by ROCm/ROCm repo
- ${{ if ne(parameters.checkoutRef, '') }}:
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/dependencies-rocm.yml
parameters:
dependencyList: ${{ parameters.rocmDependencies }}
dependencySource: tag-builds
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/build-cmake.yml
parameters:
extraBuildFlags: >-
-DCMAKE_CXX_COMPILER=/opt/rocm/llvm/bin/amdclang++
-DCMAKE_C_COMPILER=/opt/rocm/llvm/bin/amdclang
-DCMAKE_CXX_COMPILER=$(Agent.BuildDirectory)/rocm/llvm/bin/amdclang++
-DCMAKE_C_COMPILER=$(Agent.BuildDirectory)/rocm/llvm/bin/amdclang
-DROCM_PATH=$(Agent.BuildDirectory)/rocm
-DCMAKE_PREFIX_PATH=$(Agent.BuildDirectory)/rocm
-DHALF_INCLUDE_DIRS=$(Agent.BuildDirectory)/rocm/include
-DCMAKE_BUILD_TYPE=Release
-DBUILD_CLIENTS=ON
-DAMDGPU_TARGETS=gfx1030;gfx1100
-GNinja
- template: ${{ variables.CI_TEMPLATE_PATH }}/steps/artifact-upload.yml

View File

@@ -23,7 +23,7 @@ trigger: none
pr: none
jobs:
- template: ${{ variables.CI_COMPONENT_PATH }}/rocgdb.yml
- template: ${{ variables.CI_COMPONENT_PATH }}/ROCgdb.yml
parameters:
checkoutRepo: release_repo
checkoutRef: ${{ parameters.checkoutRef }}

View File

@@ -12,6 +12,7 @@ parameters:
- name: defaultBranchList
type: object
default:
amdsmi: develop
aomp: aomp-dev
aomp-extras: aomp-dev
AMDMIGraphX: develop
@@ -25,8 +26,11 @@ parameters:
llvm-project: amd-staging
MIOpen: develop
rocBLAS: develop
ROCdbgapi : amd-master
rocDecode: develop
rocFFT: develop
rocm-cmake: develop
rocm_smi_lib: develop
rocminfo: master
rocMLIR: develop
rocPRIM: develop
@@ -36,6 +40,7 @@ parameters:
rocSOLVER: develop
rocSPARSE: develop
ROCT-Thunk-Interface: master
roctracer: amd-master
rpp: master
- name: componentsFailureOkay
type: object

View File

@@ -0,0 +1,10 @@
# replace cmake from apt install with newest version using snap install
steps:
- task: Bash@3
displayName: update cmake
inputs:
targetType: inline
script: |
sudo apt purge cmake
sudo snap install cmake --classic
hash -r

View File

@@ -23,6 +23,7 @@ parameters:
- name: stagingPipelineIdentifiers
type: object
default:
amdsmi: $(amdsmi-pipeline-id)
aomp: $(aomp-pipeline-id)
aomp-extras: $(aomp-extras-pipeline-id)
AMDMIGraphX: $(amdmigraphx-pipeline-id)
@@ -30,13 +31,18 @@ parameters:
composable_kernel: $(composable-kernel-pipeline-id)
half: $(half-pipeline-id)
hipBLAS: $(hipblas-pipeline-id)
HIPIFY: $(hipify-pipeline-id)
hipRAND: $(hiprand-pipeline-id)
hipSPARSE: $(hipsparse-pipeline-id)
llvm-project: $(llvm-project-pipeline-id)
MIOpen: $(miopen-pipeline-id)
rocBLAS: $(rocblas-pipeline-id)
rocFFT: $(rotfft-pipeline-id)
ROCdbgapi : $(rocdbgapi-pipeline-id)
rocDecode: $(rocdecode-pipeline-id)
rocFFT: $(rocfft-pipeline-id)
rocm-cmake: $(rocm-cmake-pipeline-id)
rocm-core: $(rocm-core-pipeline-id)
rocm_smi_lib: $(rocm-smi-lib-pipeline-id)
rocminfo: $(rocminfo-pipeline-id)
rocMLIR: $(rocmlir-pipeline-id)
rocPRIM: $(rocprim-pipeline-id)
@@ -46,10 +52,12 @@ parameters:
rocSOLVER: $(rocsolver-pipeline-id)
rocSPARSE: $(rocsparse-pipeline-id)
ROCT-Thunk-Interface: $(roct-thunk-interface-pipeline-id)
roctracer: $(roctracer-pipeline-id)
rpp: $(rpp-pipeline-id)
- name: taggedPipelineIdentifiers
type: object
default:
amdsmi: $(amdsmi-tagged-pipeline-id)
aomp: $(aomp-tagged-pipeline-id)
aomp-extras: $(aomp-extras-tagged-pipeline-id)
AMDMIGraphX: $(amdmigraphx-tagged-pipeline-id)
@@ -57,13 +65,18 @@ parameters:
composable_kernel: $(composable-kernel-tagged-pipeline-id)
half: $(half-tagged-pipeline-id)
hipBLAS: $(hipblas-tagged-pipeline-id)
HIPIFY: $(hipify-tagged-pipeline-id)
hipRAND: $(hiprand-tagged-pipeline-id)
hipSPARSE: $(hipsparse-tagged-pipeline-id)
llvm-project: $(llvm-project-tagged-pipeline-id)
MIOpen: $(miopen-tagged-pipeline-id)
rocBLAS: $(rocblas-tagged-pipeline-id)
rocFFT: $(rotfft-tagged-pipeline-id)
ROCdbgapi : $(rocdbgapi-tagged-pipeline-id)
rocDecode: $(rocdecode-tagged-pipeline-id)
rocFFT: $(rocfft-tagged-pipeline-id)
rocm-cmake: $(rocm-cmake-tagged-pipeline-id)
rocm-core: $(rocm-core-tagged-pipeline-id)
rocm_smi_lib: $(rocm-smi-lib-tagged-pipeline-id)
rocminfo: $(rocminfo-tagged-pipeline-id)
rocMLIR: $(rocmlir-tagged-pipeline-id)
rocPRIM: $(rocprim-tagged-pipeline-id)
@@ -73,6 +86,7 @@ parameters:
rocSOLVER: $(rocsolver-tagged-pipeline-id)
rocSPARSE: $(rocsparse-tagged-pipeline-id)
ROCT-Thunk-Interface: $(roct-thunk-interface-tagged-pipeline-id)
roctracer: $(roctracer-tagged-pipeline-id)
rpp: $(rpp-tagged-pipeline-id)
# set to true if you're calling this template file multiple files in same pipeline
# only leave last call false to optimize sequence

View File

@@ -27,3 +27,5 @@ variables:
value: rocm/dev-ubuntu-22.04
- name: LATEST_DOCKER_VERSION
value: 6.1
- name: KEYRING_VERSION
value: 6.1

2
.gitignore vendored
View File

@@ -16,4 +16,4 @@ _readthedocs/
docs/CHANGELOG.md
docs/contribute/index.md
docs/about/release-notes.md
docs/about/CHANGELOG.md
docs/about/changelog.md

View File

@@ -3,16 +3,20 @@
version: 2
build:
os: ubuntu-22.04
tools:
python: "3.10"
sphinx:
configuration: docs/conf.py
formats: [htmlzip]
python:
install:
- requirements: docs/sphinx/requirements.txt
formats: [htmlzip, epub]
build:
os: ubuntu-22.04
tools:
python: "3.10"
apt_packages:
- "doxygen"
- "gfortran" # For pre-processing fortran sources
- "graphviz" # For dot graphs in doxygen

File diff suppressed because it is too large Load Diff

View File

@@ -56,12 +56,103 @@ cd ~/ROCm/
**Note:** Using this sample code will cause the repo tool to download the open source code associated with the specified ROCm release. Ensure that you have ssh-keys configured on your machine for your GitHub ID prior to the download as explained at [Connecting to GitHub with SSH](https://docs.github.com/en/authentication/connecting-to-github-with-ssh).
### Building the ROCm source code
## Building the ROCm source code
Each ROCm component repository contains directions for building that component, such as the rocSPARSE documentation [Installation and Building for Linux](https://rocm.docs.amd.com/projects/rocSPARSE/en/latest/install/Linux_Install_Guide.html). Refer to the specific component documentation for instructions on building the repository.
Each release of the ROCm software supports specific hardware and software configurations. Refer to [System requirements (Linux)](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/system-requirements.html) for the current supported hardware and OS.
## Build ROCm from source
The Build will use as many processors as it can find to build in parallel. Some of the compiles can consume as much as 10GB of RAM, so make sure you have plenty of Swap Space !
By default the ROCm build will compile for all supported GPU architectures and will take approximately 500 CPU hours.
The Build time will reduce significantly if we limit the GPU Architecture/s against which we need to build by using the environment variable GPU_ARCHS as mentioned below.
```bash
# --------------------------------------
# Step1: clone source code
# --------------------------------------
mkdir -p ~/WORKSPACE/ # Or any folder name other than WORKSPACE
cd ~/WORKSPACE/
export ROCM_VERSION=6.1.0 # or 6.1.1
~/bin/repo init -u http://github.com/ROCm/ROCm.git -b roc-6.1.x -m rocm-build/rocm-${ROCM_VERSION}.xml
~/bin/repo sync
# --------------------------------------
# Step 2: Prepare build environment
# --------------------------------------
# Option 1: Start a docker container
# Pulling required base docker images:
# Ubuntu20.04 built from ROCm/rocm-build/docker/ubuntu20/Dockerfile
docker pull rocm/rocm-build-ubuntu-20.04:6.1
# Ubuntu22.04 built from ROCm/rocm-build/docker/ubuntu22/Dockerfile
docker pull rocm/rocm-build-ubuntu-22.04:6.1
# Start docker container and mount the source code folder:
docker run -ti \
-e ROCM_VERSION=${ROCM_VERSION} \
-e CCACHE_DIR=$HOME/.ccache \
-e CCACHE_ENABLED=true \
-e DOCK_WORK_FOLD=/src \
-w /src \
-v $PWD:/src \
-v /etc/passwd:/etc/passwd \
-v /etc/shadow:/etc/shadow \
-v ${HOME}/.ccache:${HOME}/.ccache \
-u $(id -u):$(id -g) \
<replace_with_required_ubuntu_base_docker_image> bash
# Option 2: Install required packages into the host machine
# For ubuntu20.04 system
cd ROCm/rocm-build/docker/ubuntu20
bash install-prerequisites.sh
# For ubuntu22.04 system
cd ROCm/rocm-build/docker/ubuntu22
bash install-prerequisities.sh
# --------------------------------------
# Step 3: Run build command line
# --------------------------------------
# Select GPU targets before building:
# When GPU_ARCHS is not set, default GPU targets supported by ROCm6.1 will be used.
# To build against a subset of GFX architectures you can use the below env variable.
# Support MI300 (gfx940, gfx941, gfx942).
export GPU_ARCHS="gfx942" # Example
export GPU_ARCHS="gfx940;gfx941;gfx942" # Example
# Pick and run build commands in the docker container:
# Build rocm-dev packages
make -f ROCm/rocm-build/ROCm.mk -j ${NPROC:-$(nproc)} rocm-dev
# Build all ROCm packages
make -f ROCm/rocm-build/ROCm.mk -j ${NPROC:-$(nproc)} all
# list all ROCm components to find required components
make -f ROCm/rocm-build/ROCm.mk list_components
# Build a single ROCm packages
make -f ROCm/rocm-build/ROCm.mk T_rocblas
# Find built packages in ubuntu20.04:
out/ubuntu-20.04/20.04/deb/
# Find built packages in ubuntu22.04:
out/ubuntu-22.04/22.04/deb/
# Find built logs in ubuntu20.04:
out/ubuntu-20.04/20.04/logs/
# Find built logs in ubuntu22.04:
out/ubuntu-22.04/22.04/logs/
# All logs pertaining to failed components, end with .errrors extension.
out/ubuntu-22.04/22.04/logs/rocblas.errors # Example
# All logs pertaining to building components, end with .inprogress extension.
out/ubuntu-22.04/22.04/logs/rocblas.inprogress # Example
# All logs pertaining to passed components, use the component names.
out/ubuntu-22.04/22.04/logs/rocblas # Example
```
Note: [Overview for ROCm.mk](rocm-build/README.md)
## ROCm documentation
This repository contains the [manifest file](https://gerrit.googlesource.com/git-repo/+/HEAD/docs/manifest-format.md)

View File

@@ -1,4 +1,6 @@
# ROCm 6.1.1 release notes
# ROCm 6.1.2 release notes
<!-- Do not edit this file! This file is autogenerated with -->
<!-- tools/autotag/tag_script.py -->
<!-- Disable lints since this is an auto-generated file. -->
<!-- markdownlint-disable blanks-around-headers -->
@@ -9,153 +11,137 @@
<!-- spellcheck-disable -->
ROCm 6.1.1 introduces minor fixes and improvements to some tools and libraries.
ROCm 6.1.2 includes enhancements to SMI tools and improvements to some libraries.
## OS support
### OS support
ROCm 6.1.1 has been tested against a pre-release version of Ubuntu 22.04.5 (kernel: 5.15 [GA], 6.8 [HWE]).
ROCm 6.1.2 has been tested against a pre-release version of Ubuntu 22.04.5 (kernel: 5.15 [GA], 6.8 [HWE]).
## AMD SMI
### AMD SMI
AMD SMI for ROCm 6.1.1
### Additions
- Added deferred error correctable counts to `amd-smi metric -ecc -ecc-blocks`.
### Changes
- Updated the output of `amd-smi metric --ecc-blocks` to show counters available from blocks.
- Updated the output of `amd-smi metric --clock` to reflect each engine.
- Updated the output of `amd-smi topology --json` to align with output reported by host and guest systems.
### Fixes
- Fixed `amd-smi metric --clock`'s clock lock and deep sleep status.
- Fixed an issue that would cause an error when resetting non-AMD GPUs.
- Fixed `amd-smi metric --pcie` and `amdsmi_get_pcie_info()` when using RDNA3 (Navi 32 and Navi 31) hardware to prevent "UNKNOWN" reports.
- Fixed the output results of `amd-smi process` when getting processes running on a device.
### Removals
- Removed the `amdsmi_get_gpu_process_info` API from the Python library. It was removed from the C library in an earlier release.
### Known issues
- `amd-smi bad-pages` can result in a `ValueError: Null pointer access` error when using some PMU firmware versions.
```{note}
See the [detailed changelog](https://github.com/ROCm/amdsmi/blob/docs/6.1.1/CHANGELOG.md) with code samples for more information.
```
## HIPCC
HIPCC for ROCm 6.1.1
### Changes
- **Upcoming:** a future release will enable use of compiled binaries `hipcc.bin` and `hipconfig.bin` by default. No action is needed by users. You can continue calling high-level Perl scripts `hipcc` and `hipconfig`. `hipcc.bin` and `hipconfig.bin` will be invoked by the high-level Perl scripts. To revert to the previous behavior and invoke `hipcc.pl` and `hipconfig.pl`, set the `HIP_USE_PERL_SCRIPTS` environment variable to `1`.
- **Upcoming:** a subsequent release will remove high-level Perl scripts `hipcc` and `hipconfig`. This release will remove the `HIP_USE_PERL_SCRIPTS` environment variable. It will rename `hipcc.bin` and `hipconfig.bin` to `hipcc` and `hipconfig` respectively. No action is needed by the users. To revert to the previous behavior, invoke `hipcc.pl` and `hipconfig.pl` explicitly.
- **Upcoming:** a subsequent release will remove `hipcc.pl` and `hipconfig.pl`.
## ROCm SMI
ROCm SMI for ROCm 6.1.1
### Additions
* Added the capability to unlock mutex when a process is dead. Added related debug output.
* Added the `Partition ID` field to the `rocm-smi` CLI.
* Added `NODE`, `GUID`, and `GFX Version` fields to the CLI.
* Documentation now includes C++ and Python tutorials, API guides, and reference material.
### Changes
* Some `rocm-smi` fields now display `N/A` instead of `unknown/unsupported` for consistency.
* Changed stacked ID formatting in the `rocm-smi` CLI to make it easier to spot identifiers.
### Fixes
* Fixed HIP and ROCm SMI mismatch on GPU bus assignments.
* Fixed memory leaks caused by not closing directories and creating maps nodes instead of using `.at()`.
* Fixed initializing calls which reuse `rocmsmi.initializeRsmi()` bindings in the `rocmsmi` Python API.
* Fixed an issue causing `rsmi_dev_activity_metric_get` gfx/memory to not update with GPU activity.
### Known issues
- ROCm SMI reports GPU utilization incorrectly for RDNA3 GPUs in some situations. See the issue on [GitHub](https://github.com/ROCm/ROCm/issues/3112).
```{note}
See the [detailed ROCm SMI changelog](https://github.com/ROCm/rocm_smi_lib/blob/docs/6.1.1/CHANGELOG.md) with code samples for more information.
```
## Library changes in ROCm 6.1.1
| Library | Version |
| ----------- | -------------------------------------------------------------------------- |
| AMDMIGraphX | [2.9](https://github.com/ROCm/AMDMIGraphX/releases/tag/rocm-6.1.1) |
| hipBLAS | [2.1.0](https://github.com/ROCm/hipBLAS/releases/tag/rocm-6.1.1) |
| hipBLASLt | [0.7.0](https://github.com/ROCm/hipBLASLt/releases/tag/rocm-6.1.1) |
| hipCUB | [3.1.0](https://github.com/ROCm/hipCUB/releases/tag/rocm-6.1.1) |
| hipFFT | [1.0.14](https://github.com/ROCm/hipFFT/releases/tag/rocm-6.1.1) |
| hipRAND | [2.10.17](https://github.com/ROCm/hipRAND/releases/tag/rocm-6.1.1) |
| hipSOLVER | 2.1.0 ⇒ [2.1.1](https://github.com/ROCm/hipSOLVER/releases/tag/rocm-6.1.1) |
| hipSPARSE | [3.0.1](https://github.com/ROCm/hipSPARSE/releases/tag/rocm-6.1.1) |
| hipSPARSELt | [0.2.0](https://github.com/ROCm/hipSPARSELt/releases/tag/rocm-6.1.1) |
| hipTensor | [1.2.0](https://github.com/ROCm/hipTensor/releases/tag/rocm-6.1.1) |
| MIOpen | [3.1.0](https://github.com/ROCm/MIOpen/releases/tag/rocm-6.1.1) |
| MIVisionX | [2.5.0](https://github.com/ROCm/MIVisionX/releases/tag/rocm-6.1.1) |
| rccl | [2.18.6](https://github.com/ROCm/rccl/releases/tag/rocm-6.1.1) |
| rocALUTION | [3.1.1](https://github.com/ROCm/rocALUTION/releases/tag/rocm-6.1.1) |
| rocBLAS | [4.1.0](https://github.com/ROCm/rocBLAS/releases/tag/rocm-6.1.1) |
| rocDecode | [0.5.0](https://github.com/ROCm/rocDecode/releases/tag/rocm-6.1.1) |
| rocFFT | 1.0.26 ⇒ [1.0.27](https://github.com/ROCm/rocFFT/releases/tag/rocm-6.1.1) |
| rocm-cmake | [0.12.0](https://github.com/ROCm/rocm-cmake/releases/tag/rocm-6.1.1) |
| rocPRIM | [3.1.0](https://github.com/ROCm/rocPRIM/releases/tag/rocm-6.1.1) |
| rocRAND | [3.0.1](https://github.com/ROCm/rocRAND/releases/tag/rocm-6.1.1) |
| rocSOLVER | [3.25.0](https://github.com/ROCm/rocSOLVER/releases/tag/rocm-6.1.1) |
| rocSPARSE | [3.1.2](https://github.com/ROCm/rocSPARSE/releases/tag/rocm-6.1.1) |
| rocThrust | [3.0.1](https://github.com/ROCm/rocThrust/releases/tag/rocm-6.1.1) |
| rocWMMA | [1.4.0](https://github.com/ROCm/rocWMMA/releases/tag/rocm-6.1.1) |
| rpp | [1.5.0](https://github.com/ROCm/rpp/releases/tag/rocm-6.1.1) |
| Tensile | [4.40.0](https://github.com/ROCm/Tensile/releases/tag/rocm-6.1.1) |
### hipBLASLt 0.7.0
hipBLASLt 0.7.0 for ROCm 6.1.1
AMD SMI for ROCm 6.1.2
#### Additions
- Added `hipblasltExtSoftmax` extension API.
- Added `hipblasltExtLayerNorm` extension API.
- Added `hipblasltExtAMax` extension API.
- Added `GemmTuning` extension parameter to set split-k by user.
- Added support for mixed precision datatype: fp16/fp8 in with fp16 outk.
* Added process isolation and clean shader APIs and CLI commands.
* `amdsmi_get_gpu_process_isolation()`
* `amdsmi_set_gpu_process_isolation()`
* `amdsmi_set_gpu_clear_sram_data()`
* Added the `MIN_POWER` metric to output provided by `amd-smi static --limit`.
#### Deprecations
#### Optimizations
- **Upcoming**: `algoGetHeuristic()` ext API for GroupGemm will be deprecated in a future release of hipBLASLt.
### hipSOLVER 2.1.1
hipSOLVER 2.1.1 for ROCm 6.1.1
* Updated the `amd-smi monitor --pcie` output to prevent delays with the `monitor` command.
#### Changes
- By default, `BUILD_WITH_SPARSE` is now set to OFF on Microsoft Windows.
* Updated `amismi_get_power_cap_info` to return values in uW instead of W.
* Updated Python library return types for `amdsmi_get_gpu_memory_reserved_pages` and `amdsmi_get_gpu_bad_page_info`.
* Updated the output of `amd-smi metric --ecc-blocks` to show counters available from blocks.
#### Fixes
- Fixed benchmark client build when `BUILD_WITH_SPARSE` is OFF.
* `amdsmi_get_gpu_board_info()` no longer returns junk character strings.
* `amd-smi metric --power` now correctly details power output for RDNA3, RDNA2, and MI1x devices.
* Fixed the `amdsmitstReadWrite.TestPowerCapReadWrite` test for RDNA3, RDNA2, and MI100 devices.
* Fixed an issue with the `amdsmi_get_gpu_memory_reserved_pages` and `amdsmi_get_gpu_bad_page_info` Python interface calls.
### rocFFT 1.0.27
#### Removals
rocFFT 1.0.27 for ROCm 6.1.1
* Removed the `amdsmi_get_gpu_process_info` API from the Python library. It was removed from the C library in an earlier release.
```{note}
See the AMD SMI [detailed changelog](https://github.com/ROCm/amdsmi/blob/rocm-6.1.x/CHANGELOG.md) with code samples for more information.
```
### ROCm SMI
ROCm SMI for ROCm 6.1.2
#### Additions
- Enable multi-GPU testing on systems without direct GPU-interconnects.
* Added the ring hang event to the `amdsmi_evt_notification_type_t` enum.
#### Fixes
- Fixed kernel launch failure on execute of very large odd-length real-complex transforms.
* Fixed an issue causing ROCm SMI to incorrectly report GPU utilization for RDNA3 GPUs. See the issue on [GitHub](https://github.com/ROCm/ROCm/issues/3112).
* Fixed the parsing of `pp_od_clk_voltage` in `get_od_clk_volt_info` to work better with MI-series hardware.
## Library changes in ROCm 6.1.2
| Library | Version |
|---------|---------|
| AMDMIGraphX | [2.9](https://github.com/ROCm/AMDMIGraphX/releases/tag/rocm-6.1.2) |
| composable_kernel | [0.2.0](https://github.com/ROCm/composable_kernel/releases/tag/rocm-6.1.2) |
| hipBLAS | [2.1.0](https://github.com/ROCm/hipBLAS/releases/tag/rocm-6.1.2) |
| hipBLASLt | [0.7.0](https://github.com/ROCm/hipBLASLt/releases/tag/rocm-6.1.2) |
| hipCUB | [3.1.0](https://github.com/ROCm/hipCUB/releases/tag/rocm-6.1.2) |
| hipFFT | [1.0.14](https://github.com/ROCm/hipFFT/releases/tag/rocm-6.1.2) |
| hipRAND | [2.10.17](https://github.com/ROCm/hipRAND/releases/tag/rocm-6.1.2) |
| hipSOLVER | [2.1.1](https://github.com/ROCm/hipSOLVER/releases/tag/rocm-6.1.2) |
| hipSPARSE | [3.0.1](https://github.com/ROCm/hipSPARSE/releases/tag/rocm-6.1.2) |
| hipSPARSELt | [0.2.0](https://github.com/ROCm/hipSPARSELt/releases/tag/rocm-6.1.2) |
| hipTensor | [1.2.0](https://github.com/ROCm/hipTensor/releases/tag/rocm-6.1.2) |
| MIOpen | [3.1.0](https://github.com/ROCm/MIOpen/releases/tag/rocm-6.1.2) |
| MIVisionX | [2.5.0](https://github.com/ROCm/MIVisionX/releases/tag/rocm-6.1.2) |
| rccl | [2.18.6](https://github.com/ROCm/rccl/releases/tag/rocm-6.1.2) |
| rocALUTION | [3.1.1](https://github.com/ROCm/rocALUTION/releases/tag/rocm-6.1.2) |
| rocBLAS | 4.1.0 ⇒ [4.1.2](https://github.com/ROCm/rocBLAS/releases/tag/rocm-6.1.2) |
| rocDecode | 0.5.0 ⇒ [0.6.0](https://github.com/ROCm/rocDecode/releases/tag/rocm-6.1.2) |
| rocFFT | [1.0.27](https://github.com/ROCm/rocFFT/releases/tag/rocm-6.1.2) |
| rocm-cmake | [0.12.0](https://github.com/ROCm/rocm-cmake/releases/tag/rocm-6.1.2) |
| rocPRIM | [3.1.0](https://github.com/ROCm/rocPRIM/releases/tag/rocm-6.1.2) |
| rocRAND | [3.0.1](https://github.com/ROCm/rocRAND/releases/tag/rocm-6.1.2) |
| rocSOLVER | [3.25.0](https://github.com/ROCm/rocSOLVER/releases/tag/rocm-6.1.2) |
| rocSPARSE | [3.1.2](https://github.com/ROCm/rocSPARSE/releases/tag/rocm-6.1.2) |
| rocThrust | [3.0.1](https://github.com/ROCm/rocThrust/releases/tag/rocm-6.1.2) |
| rocWMMA | [1.4.0](https://github.com/ROCm/rocWMMA/releases/tag/rocm-6.1.2) |
| rpp | [1.5.0](https://github.com/ROCm/rpp/releases/tag/rocm-6.1.2) |
| Tensile | [4.40.0](https://github.com/ROCm/Tensile/releases/tag/rocm-6.1.2) |
### RCCL
RCCL 2.18.6 for ROCm 6.1.2
#### Changes
* Reduced `NCCL_TOPO_MAX_NODES` to limit stack usage and avoid stack overflow.
### rocBLAS
rocBLAS 4.1.2 for ROCm 6.1.2
#### Optimizations
* Tuned BBS TN and TT operations on the CDNA3 architecture.
#### Fixes
* Fixed an issue related to obtaining solutions for BF16 TT operations.
### rocDecode
rocDecode 0.6.0 for ROCm 6.1.2
#### Additions
* Added support for FFmpeg v5.x.
#### Optimizations
* Updated error checking in the `rocDecode-setup.py` script.
#### Changes
* Updated core dependencies.
* Updated to support the use of public LibVA headers.
#### Fixes
* Fixed some package dependencies.
## Upcoming changes
* A future release will enable the use of HIPCC compiled binaries `hipcc.bin` and `hipconfig.bin` by default. No action is needed by users; you may continue calling high-level Perl scripts `hipcc` and `hipconfig`. `hipcc.bin` and `hipconfig.bin` will be invoked by the high-level Perl scripts. To revert to the previous behavior and invoke `hipcc.pl` and `hipconfig.pl`, set the `HIP_USE_PERL_SCRIPTS` environment variable to `1`.
* A subsequent release will remove high-level HIPCC Perl scripts from `hipcc` and `hipconfig`. This release will remove the `HIP_USE_PERL_SCRIPTS` environment variable. It will rename `hipcc.bin` and `hipconfig.bin` to `hipcc` and `hipconfig` respectively. No action is needed by the users. To revert to the previous behavior, invoke `hipcc.pl` and `hipconfig.pl` explicitly.
* A subsequent release will remove `hipcc.pl` and `hipconfig.pl` for HIPCC.

View File

@@ -1,12 +1,11 @@
<?xml version="1.0" encoding="UTF-8"?>
<manifest>
<remote name="rocm-org" fetch="https://github.com/ROCm/" />
<default revision="refs/tags/rocm-6.1.1"
<default revision="refs/tags/rocm-6.1.2"
remote="rocm-org"
sync-c="true"
sync-j="4" />
<!--list of projects for ROCm-->
<project path="ROCm-OpenCL-Runtime/api/opencl/khronos/icd" name="OpenCL-ICD-Loader" remote="KhronosGroup" />
<project name="ROCK-Kernel-Driver" />
<project name="ROCR-Runtime" />
<project name="ROCT-Thunk-Interface" />

View File

@@ -0,0 +1,47 @@
.. meta::
:description: Setting the number of CUs
:keywords: AMD, ROCm, cu, number of cus
.. _env-variables-reference:
*************************************************************
Setting the number of CUs
*************************************************************
When using GPUs to accelerate compute workloads, it sometimes becomes necessary
to configure the hardware's usage of Compute Units (CU). This is a more advanced
option, so please read this page before experimentation.
The GPU driver provides two environment variables to set the number of CUs used. The
first one is ``HSA_CU_MASK`` and the second one is ``ROC_GLOBAL_CU_MASK``. The main
difference is that ``ROC_GLOBAL_CU_MASK`` sets the CU mask on queues created by the HIP
or the OpenCL runtimes. While ``HSA_CU_MASK`` sets the mask on a lower level of queue
creation in the driver, this mask will also be set for queues being profiled.
The environment variables have the following syntax:
::
ID = [0-9][0-9]* ex. base 10 numbers
ID_list = (ID | ID-ID)[, (ID | ID-ID)]* ex. 0,2-4,7
GPU_list = ID_list ex. 0,2-4,7
CU_list = 0x[0-F]* | ID_list ex. 0x337F OR 0,2-4,7
CU_Set = GPU_list : CU_list ex. 0,2-4,7:0-15,32-47 OR 0,2-4,7:0x337F
HSA_CU_MASK = CU_Set [; CU_Set]* ex. 0,2-4,7:0-15,32-47; 3-9:0x337F
The GPU indices are taken post ``ROCR_VISIBLE_DEVICES`` reordering. For GPUs listed,
the listed or masked CUs will be enabled, the rest disabled. Unlisted GPUs will not
be affected, their CUs will all be enabled.
The parsing of the variable is stopped when a syntax error occurs. The erroneous set
and the ones following will be ignored. Repeating GPU or CU IDs are a syntax error.
Specifying a mask with no usable CUs (CU_list is 0x0) is a syntax error. For excluding
GPU devices use ``ROCR_VISIBLE_DEVICES``.
These environment variables only affect ROCm software, not graphics applications.
It's important to know that not all CU configurations are valid on all devices. For
instance, on devices where two CUs can be combined into a WGP (for kernels running in
WGP mode), it is not valid to disable only a single CU in a WGP. `This paper
<https://www.cs.unc.edu/~otternes/papers/rtsj2022.pdf>`_ can provide more information
about what to expect, when disabling CUs.

View File

@@ -23,7 +23,7 @@ for template in templates:
shutil.copy2('../RELEASE.md','./about/release-notes.md')
# Keep capitalization due to similar linking on GitHub's markdown preview.
shutil.copy2('../CHANGELOG.md','./about/CHANGELOG.md')
shutil.copy2('../CHANGELOG.md','./about/changelog.md')
latex_engine = "xelatex"
latex_elements = {
@@ -38,8 +38,8 @@ latex_elements = {
project = "ROCm Documentation"
author = "Advanced Micro Devices, Inc."
copyright = "Copyright (c) 2024 Advanced Micro Devices, Inc. All rights reserved."
version = "6.1.1"
release = "6.1.1"
version = "6.1.2"
release = "6.1.2"
setting_all_article_info = True
all_article_info_os = ["linux", "windows"]
all_article_info_author = ""
@@ -49,12 +49,12 @@ article_pages = [
{
"file":"about/release-notes",
"os":["linux", "windows"],
"date":"2024-01-31"
"date":"2024-06-04"
},
{
"file":"about/CHANGELOG",
"file":"about/changelog",
"os":["linux", "windows"],
"date":"2024-01-31"
"date":"2024-06-04"
},
{"file":"install/windows/install-quick", "os":["windows"]},

Binary file not shown.

After

Width:  |  Height:  |  Size: 44 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 112 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 188 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 138 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 62 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 27 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 86 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 49 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 45 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 288 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 153 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 219 KiB

File diff suppressed because one or more lines are too long

After

Width:  |  Height:  |  Size: 80 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 73 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 28 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 43 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 25 KiB

View File

@@ -60,6 +60,9 @@ Find information on version compatibility and framework release notes in :doc:`T
For guidance on installing ROCm itself, refer to :doc:`ROCm installation for Linux <rocm-install-on-linux:index>`.
Learn how to use your ROCm deep learning environment for training, fine-tuning, and inference through the following guides.
Learn how to use your ROCm deep learning environment for training, fine-tuning, inference, and performance optimization
through the following guides.
* :doc:`rocm-for-ai/index`
* :doc:`fine-tuning-llms/index`

View File

@@ -0,0 +1,20 @@
.. meta::
:description: How to fine-tune LLMs with ROCm
:keywords: ROCm, LLM, fine-tuning, inference, usage, tutorial
*************************
Fine-tuning and inference
*************************
Fine-tuning using ROCm involves leveraging AMD's GPU-accelerated :doc:`libraries <rocm:reference/api-libraries>` and
:doc:`tools <rocm:reference/rocm-tools>` to optimize and train deep learning models. ROCm provides a comprehensive
ecosystem for deep learning development, including open-source libraries for optimized deep learning operations and
ROCm-aware versions of :doc:`deep learning frameworks <../deep-learning-rocm>` such as PyTorch, TensorFlow, and JAX.
Single-accelerator systems, such as a machine equipped with a single accelerator or GPU, are commonly used for
smaller-scale deep learning tasks, including fine-tuning pre-trained models and running inference on moderately
sized datasets. See :doc:`single-gpu-fine-tuning-and-inference`.
Multi-accelerator systems, on the other hand, consist of multiple accelerators working in parallel. These systems are
typically used in LLMs and other large-scale deep learning tasks where performance, scalability, and the handling of
massive datasets are crucial. See :doc:`multi-gpu-fine-tuning-and-inference`.

View File

@@ -0,0 +1,37 @@
.. meta::
:description: How to fine-tune LLMs with ROCm
:keywords: ROCm, LLM, fine-tuning, usage, tutorial
*******************************************
Fine-tuning LLMs and inference optimization
*******************************************
ROCm empowers the fine-tuning and optimization of large language models, making them accessible and efficient for
specialized tasks. ROCm supports the broader AI ecosystem to ensure seamless integration with open frameworks,
models, and tools.
For more information, see `What is ROCm? <https://rocm.docs.amd.com/en/latest/what-is-rocm.html>`_
Throughout the following topics, this guide discusses the goals and :ref:`challenges of fine-tuning a large language
model <fine-tuning-llms-concept-challenge>` like Llama 2. Then, it introduces :ref:`common methods of optimizing your
fine-tuning <fine-tuning-llms-concept-optimizations>` using techniques like LoRA with libraries like PEFT. In the
sections that follow, you'll find practical guides on libraries and tools to accelerate your fine-tuning.
- :doc:`Conceptual overview of fine-tuning LLMs <overview>`
- :doc:`Fine-tuning and inference <fine-tuning-and-inference>` using a
:doc:`single-accelerator <single-gpu-fine-tuning-and-inference>` or
:doc:`multi-accelerator <multi-gpu-fine-tuning-and-inference>` system.
- :doc:`Model quantization <model-quantization>`
- :doc:`Model acceleration libraries <model-acceleration-libraries>`
- :doc:`LLM inference frameworks <llm-inference-frameworks>`
- :doc:`Optimizing with Composable Kernel <optimizing-with-composable-kernel>`
- :doc:`Optimizing Triton kernels <optimizing-triton-kernel>`
- :doc:`Profiling and debugging <profiling-and-debugging>`

View File

@@ -0,0 +1,218 @@
.. meta::
:description: How to fine-tune LLMs with ROCm
:keywords: ROCm, LLM, fine-tuning, usage, tutorial, inference, vLLM, TGI, text generation inference
************************
LLM inference frameworks
************************
This section discusses how to implement `vLLM <https://docs.vllm.ai/en/latest>`_ and `Hugging Face TGI
<https://huggingface.co/docs/text-generation-inference/en/index>`_ using
:doc:`single-accelerator <single-gpu-fine-tuning-and-inference>` and
:doc:`multi-accelerator <multi-gpu-fine-tuning-and-inference>` systems.
.. _fine-tuning-llms-vllm:
vLLM inference
==============
vLLM is renowned for its paged attention algorithm that can reduce memory consumption and increase throughput thanks to
its paging scheme. Instead of allocating GPU high-bandwidth memory (HBM) for the maximum output token lengths of the
models, the paged attention of vLLM allocates GPU HBM dynamically for its actual decoding lengths. This paged attention
is also effective when multiple requests share the same key and value contents for a large value of beam search or
multiple parallel requests.
vLLM also incorporates many modern LLM acceleration and quantization algorithms, such as Flash Attention, HIP and CUDA
graphs, tensor parallel multi-GPU, GPTQ, AWQ, and token speculation.
Installing vLLM
---------------
1. To install vLLM, run the following commands.
.. code-block:: shell
# Install from source
git clone https://github.com/ROCm/vllm.git
cd vllm
PYTORCH_ROCM_ARCH=gfx942 python setup.py install #MI300 series
.. _fine-tuning-llms-vllm-rocm-docker-image:
2. Run the following commands to build a Docker image ``vllm-rocm``.
.. code-block:: shell
git clone https://github.com/vllm-project/vllm.git
cd vllm
docker build -f Dockerfile.rocm -t vllm-rocm .
.. tab-set::
.. tab-item:: vLLM on a single-accelerator system
:sync: single
3. To use vLLM as an API server to serve reference requests, first start a container using the :ref:`vllm-rocm
Docker image <fine-tuning-llms-vllm-rocm-docker-image>`.
.. code-block:: shell
docker run -it \
--network=host \
--group-add=video \
--ipc=host \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
--device /dev/kfd \
--device /dev/dri \
-v <path/to/model>:/app/model \
vllm-rocm \
bash
4. Inside the container, start the API server to run on a single accelerator on port 8000 using the following command.
.. code-block:: shell
python -m vllm.entrypoints.api_server --model /app/model --dtype float16 --port 8000 &
The following log message is displayed in your command line indicates that the server is listening for requests.
.. image:: ../../data/how-to/fine-tuning-llms/vllm-single-gpu-log.png
:alt: vLLM API server log message
:align: center
5. To test, send it a curl request containing a prompt.
.. code-block:: shell
curl http://localhost:8000/generate -H "Content-Type: application/json" -d '{"prompt": "What is AMD Instinct?", "max_tokens": 80, "temperature": 0.0 }'
You should receive a response like the following.
.. code-block:: text
{"text":["What is AMD Instinct?\nAmd Instinct is a brand new line of high-performance computing (HPC) processors from Advanced Micro Devices (AMD). These processors are designed to deliver unparalleled performance for HPC workloads, including scientific simulations, data analytics, and machine learning.\nThe Instinct lineup includes a range of processors, from the entry-level Inst"]}
.. tab-item:: vLLM on a multi-accelerator system
:sync: multi
3. To use vLLM as an API server to serve reference requests, first start a container using the :ref:`vllm-rocm
Docker image <fine-tuning-llms-vllm-rocm-docker-image>`.
.. code-block:: shell
docker run -it \
--network=host \
--group-add=video \
--ipc=host \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
--device /dev/kfd \
--device /dev/dri \
-v <path/to/model>:/app/model \
vllm-rocm \
bash
4. To run API server on multiple GPUs, use the ``-tp`` or ``--tensor-parallel-size`` parameter. For example, to use two
GPUs, start the API server using the following command.
.. code-block:: shell
python -m vllm.entrypoints.api_server --model /app/model --dtype float16 -tp 2 --port 8000 &
5. To run multiple instances of API Servers, specify different ports for each server, and use ``ROCR_VISIBLE_DEVICES`` to
isolate each instance to a different accelerator.
For example, to run two API servers, one on port 8000 using GPU 0 and 1, one on port 8001 using GPU 2 and 3, use a
a command like the following.
.. code-block:: shell
ROCR_VISIBLE_DEVICES=0,1 python -m vllm.entrypoints.api_server --model /data/llama-2-7b-chat-hf --dtype float16 tp 2 --port 8000 &
ROCR_VISIBLE_DEVICES=2,3 python -m vllm.entrypoints.api_server --model /data/llama-2-7b-chat-hf --dtype float16 tp 2--port 8001 &
6. To test, send it a curl request containing a prompt.
.. code-block:: shell
curl http://localhost:8000/generate -H "Content-Type: application/json" -d '{"prompt": "What is AMD Instinct?", "max_tokens": 80, "temperature": 0.0 }'
You should receive a response like the following.
.. code-block:: text
{"text":["What is AMD Instinct?\nAmd Instinct is a brand new line of high-performance computing (HPC) processors from Advanced Micro Devices (AMD). These processors are designed to deliver unparalleled performance for HPC workloads, including scientific simulations, data analytics, and machine learning.\nThe Instinct lineup includes a range of processors, from the entry-level Inst"]}
.. _fine-tuning-llms-tgi:
Hugging Face TGI
================
Text Generation Inference (TGI) is LLM serving framework from Hugging
Face, and it also supports the majority of high-performance LLM
acceleration algorithms such as Flash Attention, Paged Attention,
CUDA/HIP graph, tensor parallel multi-GPU, GPTQ, AWQ, and token
speculation.
.. tip::
In addition to LLM serving capability, TGI also provides the `Text Generation Inference benchmarking tool
<https://github.com/huggingface/text-generation-inference/blob/main/benchmark/README.md>`_.
Install TGI
-----------
1. To install the TGI Docker image, run the following commands.
.. code-block:: shell
# Install from Dockerfile
git clone https://github.com/huggingface/text-generation-inference.git -b mi300-compat
cd text-generation-inference
docker build . -f Dockerfile.rocm
.. tab-set::
.. tab-item:: TGI on a single-accelerator system
:sync: single
2. Launch a model using TGI server on a single accelerator.
.. code-block:: shell
export ROCM_USE_FLASH_ATTN_V2_TRITON=True
text-generation-launcher --model-id NousResearch/Meta-Llama-3-70B --dtype float16 --port 8000 &
3. To test, send it a curl request containing a prompt.
.. code-block:: shell
curl http://localhost:8000/generate_stream -X POST -d '{"inputs":"What is AMD Instinct?","parameters":{"max_new_tokens":20}}' -H 'Content-Type: application/json'
You should receive a response like the following.
.. code-block:: shell
data:{"index":20,"token":{"id":304,"text":" in","logprob":-1.2822266,"special":false},"generated_text":" AMD Instinct is a new family of data center GPUs designed to accelerate the most demanding workloads in","details":null}
.. tab-item:: TGI on a multi-accelerator system
2. Launch a model using TGI server on multiple accelerators (4 in this case).
.. code-block:: shell
export ROCM_USE_FLASH_ATTN_V2_TRITON=True
text-generation-launcher --model-id NousResearch/Meta-Llama-3-8B --dtype float16 --port 8000 --num-shard 4 &
3. To test, send it a curl request containing a prompt.
.. code-block:: shell
curl http://localhost:8000/generate_stream -X POST -d '{"inputs":"What is AMD Instinct?","parameters":{"max_new_tokens":20}}' -H 'Content-Type: application/json'
You should receive a response like the following.
.. code-block:: shell
data:{"index":20,"token":{"id":304,"text":" in","logprob":-1.2773438,"special":false},"generated_text":" AMD Instinct is a new family of data center GPUs designed to accelerate the most demanding workloads in","details":null}

View File

@@ -0,0 +1,251 @@
.. meta::
:description: How to fine-tune LLMs with ROCm
:keywords: ROCm, LLM, fine-tuning, usage, tutorial, Flash Attention, Hugging Face, xFormers, vLLM, PyTorch
****************************
Model acceleration libraries
****************************
This section discusses model acceleration techniques and libraries to improve memory efficiency and performance.
Flash Attention 2
=================
Flash Attention is a technique designed to reduce memory movements between GPU SRAM and high-bandwidth memory (HBM). By
using a tiling approach, Flash Attention 2 improves memory locality in the nested loops of query, key, and value
computations within the Attention modules of LLMs. These modules include Multi-Head Attention (MHA), Group-Query
Attention (GQA), and Multi-Query Attention (MQA). This reduction in memory movements significantly decreases the
time-to-first-token (TTFT) latency for large batch sizes and long prompt sequences, thereby enhancing overall
performance.
.. image:: ../../data/how-to/fine-tuning-llms/attention-module.png
:alt: Attention module of a large language module utilizing tiling
:align: center
Installing Flash Attention 2
----------------------------
ROCm provides two different implementations of Flash Attention 2 modules. They can be deployed interchangeably:
* ROCm `Composable Kernel <https://github.com/ROCm/composable_kernel/tree/develop/example/01_gemm>`_
(CK) Flash Attention 2
* `OpenAI Triton <https://triton-lang.org/main/index.html>`_ Flash Attention 2
.. tab-set::
.. tab-item:: CK Flash Attention 2
To install CK Flash Attention 2, use the following commands.
.. code-block:: shell
# Install from source
git clone https://github.com/ROCm/flash-attention.git
cd flash-attention/
GPU_ARCHS=gfx942 python setup.py install #MI300 series
Hugging Face Transformers can easily deploy the CK Flash Attention 2 module by passing an argument
``attn_implementation="flash_attention_2"`` in the ``from_pretrained`` class.
.. code-block:: python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model_name = "NousResearch/Meta-Llama-3-8B"
tokenizer = AutoTokenizer.from_pretrained(model_name, torch_dtype=torch.float16, use_fast=False)
inputs = tokenizer('Today is', return_tensors='pt').to(device)
model_eager = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, attn_implementation="eager").cuda(device)
model_ckFAv2 = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, attn_implementation="flash_attention_2").cuda(device)
print("eager GQA: ", tokenizer.decode(model_eager.generate(**inputs, max_new_tokens=10)[0], skip_special_tokens=True))
print("ckFAv2 GQA: ", tokenizer.decode(model_ckFAv2.generate(**inputs, max_new_tokens=10)[0], skip_special_tokens=True))
# eager GQA: Today is the day of the Lord, and we are the
# ckFAv2 GQA: Today is the day of the Lord, and we are the
.. tab-item:: Triton Flash Attention 2
The Triton Flash Attention 2 module is implemented in Python and uses OpenAIs JIT compiler. This module has been
upstreamed into the vLLM serving toolkit, discussed in :doc:'llm-inference-frameworks'.
1. To install Triton Flash Attention 2 and run the benchmark, use the following commands.
.. code-block:: shell
# Install from the source
pip uninstall pytorch-triton-rocm triton -y
git clone https://github.com/ROCm/triton.git
cd triton/python
GPU_ARCHS=gfx942 python setup.py install #MI300 series
pip install matplotlib pandas
2. To test, run the Triton Flash Attention 2 performance benchmark.
.. code-block:: shell
# Test the triton FA v2 kernel
python https://github.com/ROCm/triton/blob/triton-mlir/python/perf-kernels/flash-attention.py
# Results (Okay to release TFLOPS number ???)
fused-attention-fwd-d128:
BATCH HQ HK N_CTX_Q N_CTX_K TFLOPS
0 16.0 16.0 16.0 1024.0 1024.0 287.528411
1 8.0 16.0 16.0 2048.0 2048.0 287.490806
2 4.0 16.0 16.0 4096.0 4096.0 345.966031
3 2.0 16.0 16.0 8192.0 8192.0 361.369510
4 1.0 16.0 16.0 16384.0 16384.0 356.873720
5 2.0 48.0 48.0 1024.0 1024.0 216.916235
6 2.0 48.0 48.0 2048.0 1024.0 271.027578
7 2.0 48.0 48.0 4096.0 8192.0 337.367372
8 2.0 48.0 48.0 8192.0 4096.0 363.481649
9 2.0 48.0 48.0 16384.0 8192.0 375.013622
10 8.0 16.0 16.0 1989.0 15344.0 321.791333
11 4.0 16.0 16.0 4097.0 163.0 122.104888
12 2.0 16.0 16.0 8122.0 2159.0 337.060283
13 1.0 16.0 16.0 16281.0 7.0 5.234012
14 2.0 48.0 48.0 1021.0 1020.0 214.657425
15 2.0 48.0 48.0 2001.0 2048.0 314.429118
16 2.0 48.0 48.0 3996.0 9639.0 330.411368
17 2.0 48.0 48.0 8181.0 1021.0 324.614980
xFormers
========
xFormers also improves the performance of attention modules. Although xFormers attention performs very
similarly to Flash Attention 2 due to its tiling behavior of query, key, and value, its widely used for LLMs and
Stable Diffusion models with the Hugging Face Diffusers library.
Installing CK xFormers
----------------------
Use the following commands to install CK xFormers.
.. code-block:: shell
# Install from source
git clone https://github.com/ROCm/xformers.git
cd xformers/
git submodule update --init --recursive
PYTORCH_ROCM_ARCH=gfx942 python setup.py install #Instinct MI300-series
PyTorch built-in acceleration
=============================
`PyTorch compilation
mode <https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html>`__
synthesizes the model into a graph and then lowers it to prime
operators. These operators are compiled using TorchInductor, which uses
OpenAI Triton as a building block for GPU acceleration. One advantage of
PyTorch compilation mode is that its GPU kernels are written in Python,
making modifying and extending them easier. PyTorch compilation mode
often delivers higher performance, as model operations are fused before
runtime, which allows for easy deployment of high-performance kernels.
PyTorch compilation
-------------------
To utilize the PyTorch compilation mode, specific layers of the model
must be explicitly assigned as compilation targets. In the case of LLM,
where autoregressive token decoding generates dynamically changing
key/value sizes, limiting the key/value size to a static dimension,
``max_cache_length``, is necessary to utilize the performance benefits
of the PyTorch compilation.
.. code-block:: python
# Sample script to run LLM with the static key-value cache and PyTorch compilation
from transformers import AutoModelForCausalLM, AutoTokenizer, StaticCache
import torch
from typing import Optional
import os
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
os.environ["TOKENIZERS_PARALLELISM"] = "false"
model_name = "NousResearch/Meta-Llama-3-8B"
prompts = []
for b in range(1):
prompts.append("New york city is where "
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16).to(device).eval()
inputs = tokenizer(prompts, return_tensors="pt").to(model.device)
def decode_one_tokens(model, cur_token, input_pos, cache_position):
logits = model(cur_token, position_ids=input_pos, cache_position=cache_position, return_dict=False, use_cache=True)[0]
new_token = torch.argmax(logits[:, -1], dim=-1)[:, None]
return new_token
batch_size, seq_length = inputs["input_ids"].shape
# Static key-value cache
max_cache_length = 1024
max_new_tokens = 10
model._setup_cache(StaticCache, batch_size, max_cache_len=max_cache_length)
cache_position = torch.arange(seq_length, device=device)
generated_ids = torch.zeros(batch_size, seq_length + max_new_tokens + 1, dtype=torch.int, device=device)
generated_ids[:, cache_position] = inputs["input_ids"].to(device).to(torch.int)
logits = model(**inputs, cache_position=cache_position, return_dict=False, use_cache=True)[0]
next_token = torch.argmax(logits[:, -1], dim=-1)[:, None]
# torch compilation
decode_one_tokens = torch.compile(decode_one_tokens, mode="max-autotune-no-cudagraphs",fullgraph=True)
generated_ids[:, seq_length] = next_token[:, 0]
cache_position = torch.tensor([seq_length + 1], device=device)
with torch.no_grad():
for _ in range(1, max_new_tokens):
with torch.backends.cuda.sdp_kernel(enable_flash=False, enable_mem_efficient=False, enable_math=True):
next_token = decode_one_tokens(model, next_token.clone(), None, cache_position)
generated_ids[:, cache_position] = next_token.int()
cache_position += 1
.. _fine-tuning-llms-pytorch-tunableop:
PyTorch TunableOp
------------------
ROCm PyTorch (2.2.0 and later) allows users to use high-performance ROCm
GEMM kernel libraries through PyTorch's built-in TunableOp options.
This enables users to automatically pick up the best-performing GEMM
kernels from :doc:`rocBLAS <rocblas:index>` and :doc:`hipBLASLt <hipblaslt:index>` libraries during runtime.
During warm-up runs or offline profiling steps, users can create a GEMM Table
that enumerates the kernel information. During the model's run, the best-performing kernel substitutes
``torch.nn.functional.linear(input, weight, bias=None)`` with the kernel specified in the GEMM table. The
`Tunable GitHub <https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/cuda/tunable/README.md>`_
page describes the options.
.. code-block:: python
# To turn on TunableOp, simply set this environment variable
export PYTORCH_TUNABLEOP_ENABLED=1
# Python
import torch
import torch.nn as nn
import torch.nn.functional as F
A = torch.rand(100, 20, device="cuda")
W = torch.rand(200, 20, device="cuda")
Out = F.linear(A, W)
print(Out.size())
# tunableop_results0.csv
Validator,PT_VERSION,2.4.0
Validator,ROCM_VERSION,6.1.0.0-82-5fabb4c
Validator,HIPBLASLT_VERSION,0.7.0-1549b021
Validator,GCN_ARCH_NAME,gfx942:sramecc+:xnack-
Validator,ROCBLAS_VERSION,4.1.0-cefa4a9b-dirty
GemmTunableOp_float_TN,tn_200_100_20,Gemm_Rocblas_32323,0.00669595
.. image:: ../../data/how-to/fine-tuning-llms/tunableop.png
:alt: GEMM and TunableOp
:align: center
Learn more about optimizing kernels with TunableOp in
:ref:`Optimizing Triton kernels <fine-tuning-llms-triton-tunableop>`.

View File

@@ -0,0 +1,259 @@
.. meta::
:description: How to fine-tune LLMs with ROCm
:keywords: ROCm, LLM, fine-tuning, usage, tutorial, quantization, GPTQ, transformers, bitsandbytes
*****************************
Model quantization techniques
*****************************
Quantization reduces the model size compared to its native full-precision version, making it easier to fit large models
onto accelerators or GPUs with limited memory usage. This section explains how to perform LLM quantization using GPTQ
and bitsandbytes on AMD Instinct hardware.
.. _fine-tune-llms-gptq:
GPTQ
====
GPTQ is a post-training quantization technique where each row of the weight matrix is quantized independently to find a
version of the weights that minimizes error. These weights are quantized to ``int4`` but are restored to ``fp16`` on the
fly during inference. This can save your memory usage by a factor of four. A speedup in inference is expected because
inference of GPTQ models uses a lower bit width, which takes less time to communicate.
Before setting up the GPTQ configuration in Transformers, ensure the `AutoGPTQ <https://github.com/AutoGPTQ/AutoGPTQ>`_ library
is installed.
Installing AutoGPTQ
-------------------
The AutoGPTQ library implements the GPTQ algorithm.
#. Use the following command to install the latest stable release of AutoGPTQ from pip.
.. code-block:: shell
# This will install pre-built wheel for a specific ROCm version.
pip install auto-gptq --no-build-isolation --extra-index-url https://huggingface.github.io/autogptq-index/whl/rocm573/
Or, install AutoGPTQ from source for the appropriate ROCm version (for example, ROCm 6.1).
.. code-block:: shell
# Clone the source code.
git clone https://github.com/AutoGPTQ/AutoGPTQ.git
cd AutoGPTQ
# Speed up the compilation by specifying PYTORCH_ROCM_ARCH to target device.
PYTORCH_ROCM_ARCH=gfx942 ROCM_VERSION=6.1 pip install .
# Show the package after the installation
#. Run ``pip show auto-gptq`` to print information for the installed ``auto-gptq`` package. Its output should look like
this:
.. code-block:: shell
Name: auto-gptq
Version: 0.8.0.dev0+rocm6.1
...
Using GPTQ with AutoGPTQ
------------------------
#. Run the following code snippet.
.. code-block:: python
from transformers import AutoTokenizer, TextGenerationPipeline
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
base_model_name = "NousResearch/Llama-2-7b-hf"
quantized_model_name = "llama-2-7b-hf-gptq"
tokenizer = AutoTokenizer.from_pretrained(base_model_name, use_fast=True)
examples = [
tokenizer(
"auto-gptq is an easy-to-use model quantization library with user-friendly apis, based on GPTQ algorithm."
)
]
print(examples)
The resulting examples should be a list of dictionaries whose keys are ``input_ids`` and ``attention_mask``.
#. Set up the quantization configuration using the following snippet.
.. code-block:: python
quantize_config = BaseQuantizeConfig(
bits=4, # quantize model to 4-bit
group_size=128, # it is recommended to set the value to 128
desc_act=False,
)
#. Load the non-quantized model using the AutoGPTQ class and run the quantization.
.. code-block:: python
# Import auto_gptq class.
from auto_gptq import AutoGPTQForCausalLM
# Load non-quantized model.
base_model = AutoGPTQForCausalLM.from_pretrained(base_model_name, quantize_config, device_map = "auto")
base_model.quantize(examples)
# Save quantized model.
base_model.save_quantized(quantized_model_name)
Using GPTQ with Hugging Face Transformers
------------------------------------------
#. To perform a GPTQ quantization using Hugging Face Transformers, you need to create a ``GPTQConfig`` instance and set the
number of bits to quantize to, and a dataset to calibrate the weights.
.. code-block:: python
from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig
base_model_name = " NousResearch/Llama-2-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
gptq_config = GPTQConfig(bits=4, dataset="c4", tokenizer=tokenizer)
#. Load a model to quantize using ``AutoModelForCausalLM`` and pass the
``gptq_config`` to its ``from_pretained`` method. Set ``device_map=”auto”`` to
automatically offload the model to available GPU resources.
.. code-block:: python
quantized_model = AutoModelForCausalLM.from_pretrained(
base_model_name,
device_map="auto",
quantization_config=gptq_config)
#. Once the model is quantized, you can push the model and tokenizer to Hugging Face Hub for easy share and access.
.. code-block:: python
quantized_model.push_to_hub("llama-2-7b-hf-gptq")
tokenizer.push_to_hub("llama-2-7b-hf-gptq")
Or, you can save the model locally using the following snippet.
.. code-block:: python
quantized_model.save_pretrained("llama-2-7b-gptq")
tokenizer.save_pretrained("llama-2-7b-gptq")
ExLlama-v2 support
------------------
ExLlama is a Python/C++/CUDA implementation of the Llama model that is
designed for faster inference with 4-bit GPTQ weights. The ExLlama
kernel is activated by default when users create a ``GPTQConfig`` object. To
boost inference speed even further on Instinct accelerators, use the ExLlama-v2
kernels by configuring the ``exllama_config`` parameter as the following.
.. code-block:: python
from transformers import AutoModelForCausalLM, GPTQConfig
pretrained_model_dir = "meta-llama/Llama-2-7b"
gptq_config = GPTQConfig(bits=4, exllama_config={"version":2})
quantized_model = AutoModelForCausalLM.from_pretrained(
base_model_name,
device_map="auto",
quantization_config=gptq_config)
bitsandbytes
============
The `ROCm-aware bitsandbytes <https://github.com/ROCm/bitsandbytes>`_ library is
a lightweight Python wrapper around CUDA custom functions, in particular 8-bit optimizer, matrix multiplication, and
8-bit and 4-bit quantization functions. The library includes quantization primitives for 8-bit and 4-bit operations
through ``bitsandbytes.nn.Linear8bitLt`` and ``bitsandbytes.nn.Linear4bit`` and 8-bit optimizers through the
``bitsandbytes.optim`` module. These modules are supported on AMD Instinct accelerators.
Installing bitsandbytes
-----------------------
#. To install bitsandbytes for ROCm 6.0 (and later), use the following commands.
.. code-block:: shell
# Clone the github repo
git clone --recurse https://github.com/ROCm/bitsandbytes.git
cd bitsandbytes
git checkout rocm_enabled
# Install dependencies
pip install -r requirements-dev.txt
# Use -DBNB_ROCM_ARCH to specify target GPU arch
cmake -DBNB_ROCM_ARCH="gfx942" -DCOMPUTE_BACKEND=hip -S .
# Install
python setup.py install
#. Run ``pip show bitsandbytes`` to show the information about the installed bitsandbytes package. Its output should
look like the following.
.. code-block:: shell
Name: bitsandbytes
Version: 0.44.0.dev0
...
Using bitsandbytes primitives
-----------------------------
To get started with bitsandbytes primitives, use the following code as reference.
.. code-block:: python
import bitsandbytes as bnb
# Use Int8 Matrix Multiplication
bnb.matmul(..., threshold=6.0)
# Use bitsandbytes 8-bit Optimizers
adam = bnb.optim.Adam8bit(model.parameters(), lr=0.001, betas=(0.9, 0.995))
Using bitsandbytes with Hugging Face Transformers
-------------------------------------------------
To load a Transformers model in 4-bit, set ``load_int_4bt=true`` in ``BitsAndBytesConfig``.
.. code-block:: python
from transformers import AutoModelForCausalLM
from bitsandbytes import BitsAndBytesConfig
base_model_name = "NousResearch/Llama-2-7b-hf"
quantization_config = BitsAndBytesConfig(load_in_4bit=True)
bnb_model_4bit = AutoModelForCausalLM.from_pretrained(
base_model_name,
device_map="auto",
quantization_config=quantization_config)
# Check the memory footprint with get_memory_footprint method
print(bnb_model_4bit.get_memory_footprint())
To load a model in 8-bit for inference, use the ``load_in_8bit`` option.
.. code-block:: python
from transformers import AutoModelForCausalLM, AutoTokenizer
from bitsandbytes import BitsAndBytesConfig
base_model_name = "NousResearch/Llama-2-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
bnb_model_8bit = AutoModelForCausalLM.from_pretrained(
base_model_name,
device_map="auto",
quantization_config=quantization_config)
prompt = "What is a large language model?"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
generated_ids = model.generate(**inputs)
outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)

View File

@@ -0,0 +1,236 @@
.. meta::
:description: Model fine-tuning and inference on a multi-GPU system
:keywords: ROCm, LLM, fine-tuning, usage, tutorial, multi-GPU, distributed, inference
*****************************************************
Fine-tuning and inference using multiple accelerators
*****************************************************
This section explains how to fine-tune a model on a multi-accelerator system. See
:doc:`Single-accelerator fine-tuning <single-gpu-fine-tuning-and-inference>` for a single accelerator or GPU setup.
.. _fine-tuning-llms-multi-gpu-env:
Environment setup
=================
This section was tested using the following hardware and software environment.
.. list-table::
:stub-columns: 1
* - Hardware
- 4 AMD Instinct MI300X accelerators
* - Software
- ROCm 6.1, Ubuntu 22.04, PyTorch 2.1.2, Python 3.10
* - Libraries
- ``transformers`` ``datasets`` ``accelerate`` ``huggingface-hub`` ``peft`` ``trl`` ``scipy``
* - Base model
- ``meta-llama/Llama-2-7b-chat-hf``
.. _fine-tuning-llms-multi-gpu-env-setup:
Setting up the base implementation environment
----------------------------------------------
#. Install PyTorch for ROCm. Refer to the
:doc:`PyTorch installation guide <rocm-install-on-linux:how-to/3rd-party/pytorch-install>`. For consistent
installation, its recommended to use official ROCm prebuilt Docker images with the framework pre-installed.
#. In the Docker container, check the availability of ROCM-capable accelerators using the following command.
.. code-block:: shell
rocm-smi -showproductname
#. Check that your accelerators are available to PyTorch.
.. code-block:: python
import torch
print("Is a ROCm-GPU detected? ", torch.cuda.is_available())
print("How many ROCm-GPUs are detected? ", torch.cuda.device_count())
If successful, your output should look like this:
.. code-block:: shell
>>> print("Is a ROCm-GPU detected? ", torch.cuda.is_available())
Is a ROCm-GPU detected? True
>>> print("How many ROCm-GPUs are detected? ", torch.cuda.device_count())
How many ROCm-GPUs are detected? 4
.. tip::
During training and inference, you can check the memory usage by running the ``rocm-smi`` command in your terminal.
This tool helps you see shows which accelerators or GPUs are involved.
.. _fine-tuning-llms-multi-gpu-hugging-face-accelerate:
Hugging Face Accelerate for fine-tuning and inference
===========================================================
`Hugging Face Accelerate <https://huggingface.co/docs/accelerate/en/index>`_ is a library that simplifies turning raw
PyTorch code for a single accelerator into code for multiple accelerators for LLM fine-tuning and inference. It is
integrated with `Transformers <https://huggingface.co/docs/transformers/en/index>`_ allowing you to scale your PyTorch
code while maintaining performance and flexibility.
As a brief example of model fine-tuning and inference using multiple GPUs, let's use Transformers and load in the Llama
2 7B model.
Here, let's reuse the code in :ref:`Single-accelerator fine-tuning <fine-tuning-llms-single-gpu-download-model-dataset>`
to load the base model and tokenizer.
Now, it's important to adjust how you load the model. Add the ``device_map`` parameter to your base model configuration.
.. code-block:: python
...
base_model_name = "meta-llama/Llama-2-7b-chat-hf"
# Load base model to GPU memory
base_model = AutoModelForCausalLM.from_pretrained(
base_model_name,
device_map = "auto"
trust_remote_code = True)
...
# Run training
sft_trainer.train()
.. note::
You can let Accelerate handle the device map computation by setting ``device_map`` to one of the supported options
(``"auto"``, ``"balanced"``, ``"balanced_low_0"``, ``"sequential"``).
It's recommended to set the ``device_map`` parameter to ``“auto”`` to allow Accelerate to automatically and
efficiently allocate the model given the available resources (4 accelerators in this case).
When you have more GPU memory available than the model size, here is the difference between each ``device_map``
option:
* ``"auto"`` and ``"balanced"`` evenly split the model on all available GPUs, making it possible for you to use a
batch size greater than 1.
* ``"balanced_low_0"`` evenly splits the model on all GPUs except the first
one, and only puts on GPU 0 what does not fit on the others. This
option is great when you need to use GPU 0 for some processing of the
outputs, like when using the generate function for Transformers
models.
* ``"sequential"`` will fit what it can on GPU 0, then move on GPU 1 and so forth. Not all GPUs might be used.
After loading the model in this way, the model is fully ready to use the resources available to it.
.. _fine-tuning-llms-multi-gpu-torchtune:
torchtune for fine-tuning and inference
=============================================
`torchtune <https://pytorch.org/torchtune/main/>`_ is a PyTorch-native library for easy single and multi-accelerator or
GPU model fine-tuning and inference with LLMs.
#. Install torchtune using pip.
.. code-block:: shell
# Install torchtune with PyTorch release 2.2.2+
pip install torchtune
# To confirm that the package is installed correctly
tune --help
The output should look like this:
.. code-block:: shell
usage: tune [-h] {download,ls,cp,run,validate} ...
Welcome to the TorchTune CLI!
options:
-h, --help show this help message and exit
subcommands:
{download,ls,cp,run,validate}
#. torchtune recipes are designed around easily composable components and workable training loops, with minimal abstraction
getting in the way of fine-tuning. Run ``tune ls`` to show built-in torchtune configuration recipes.
.. code-block:: shell
RECIPE CONFIG
full_finetune_single_device llama2/7B_full_low_memory
llama3/8B_full_single_device
mistral/7B_full_low_memory
full_finetune_distributed llama2/7B_full
llama2/13B_full
llama3/8B_full
mistral/7B_full
gemma/2B_full
lora_finetune_single_device llama2/7B_lora_single_device
llama2/7B_qlora_single_device
llama3/8B_lora_single_device
llama3/8B_qlora_single_device
llama2/13B_qlora_single_device
mistral/7B_lora_single_device
The ``RECIPE`` column shows the easy-to-use and workable fine-tuning and inference recipes for popular fine-tuning
techniques (such as LoRA). The ``CONFIG`` column lists the YAML configurations for easily configuring training,
evaluation, quantization, or inference recipes.
The snippet shows the architecture of a model's YAML configuration file:
.. code-block:: yaml
# Model arguments
model:
_component_: torchtune.models.llama2.lora_llama2_7b
lora_attn_modules: ['q_proj', 'v_proj']
apply_lora_to_mlp: False
apply_lora_to_output: False
lora_rank: 8
lora_alpha: 16
tokenizer:
_component_: torchtune.models.llama2.llama2_tokenizer
path: /tmp/Llama-2-7b-hf/tokenizer.model
# Dataset and sampler
dataset:
_component_: torchtune.datasets.alpaca_cleaned_dataset
train_on_input: True
#. This configuration file defines the fine-tuning base model path, data set, hyper-parameters for optimizer and scheduler,
and training data type. To download the base model for fine-tuning, run the following command:
.. code-block:: shell
tune download meta-llama/Llama-2-7b-hf --output-dir /tmp/Llama-2-7b-hf --hf-token
The output directory argument for ``--output-dir`` should map the model path specified in YAML config file.
#. To launch ``lora_finetune_distributed`` on four devices, run the following
command:
.. code-block:: shell
tune run --nnodes 1 --nproc_per_node 4 lora_finetune_distributed --config llama2/7B_lora
If successful, you should something like the following output:
.. code-block:: shell
INFO:torchtune.utils.logging:FSDP is enabled. Instantiating Model on CPU for Rank 0 ...
INFO:torchtune.utils.logging:Model instantiation took 7.32 secs
INFO:torchtune.utils.logging:Memory Stats after model init:
{'peak_memory_active': 9.478172672, 'peak_memory_alloc': 8.953868288, 'peak_memory_reserved': 11.112808448}
INFO:torchtune.utils.logging:Optimizer and loss are initialized.
INFO:torchtune.utils.logging:Dataset and Sampler are initialized.
INFO:torchtune.utils.logging:Learning rate scheduler is initialized.
1|111|Loss: 1.5790324211120605: 7%|| 114/1618
Read more about inference frameworks in :doc:`LLM inference frameworks <llm-inference-frameworks>`.

View File

@@ -0,0 +1,388 @@
.. meta::
:description: How to fine-tune LLMs with ROCm
:keywords: ROCm, LLM, fine-tuning, usage, tutorial, Triton, kernel, performance, optimization
*************************
Optimizing Triton kernels
*************************
This section introduces the general steps for `Triton <https://openai.com/index/triton/>`_ kernel optimization. Broadly,
Triton kernel optimization is similar to HIP and CUDA kernel optimization.
.. _fine-tuning-llms-triton-memory-access-efficiency:
Memory access efficiency
========================
The accelerator or GPU contains global memory, local data share (LDS), and registers. Global memory has high access
latency, but is large. LDS access has much lower latency, but is smaller. Register access is the fastest yet smallest
among the three.
So, the data in global memory should be loaded and stored as few times as possible. If different threads in a block
need to access the same data, these data should be first transferred from global memory to LDS, then accessed by
different threads in a workgroup.
.. _fine-tuning-llms-triton-hardware-resource-utilization:
Hardware resource utilization
=============================
Each accelerator or GPU has multiple Compute Units (CUs) and various CUs do computation in parallel. So, how many CUs
can a compute kernel can allocate its task to? For the :doc:`AMD MI300X accelerator <../../reference/gpu-arch-specs>`, the
grid should have at least 1024 thread blocks or workgroups.
.. figure:: ../../data/how-to/fine-tuning-llms/compute-unit.png
Schematic representation of a CU in the CDNA2 or CDNA3 architecture.
To increase hardware utilization and maximize parallelism, it is necessary to design algorithms that can exploit more
parallelism. One approach to achieving this is by using larger split-K techniques for General Matrix Multiply (GEMM)
operations, which can further distribute the computation across more CUs, thereby enhancing performance.
.. tip::
You can query hardware resources with the command ``rocminfo`` (in the ``/opt/rocm/bin`` directory). For instance,
query the number of CUs, number of SIMD, and wavefront size using the following commands.
.. code-block:: shell
rocminfo | grep "Compute Unit"
rocminfo | grep "SIMD"
rocminfo | grep "Wavefront Size"
On an MI300X device, there are 304 CUs, 4 SIMD per CU, and the wavefront size (warp size) is 64. See :doc:`Hardware
specifications <../../reference/gpu-arch-specs>` for a full list of AMD accelerators and GPUs.
.. _fine-tuning-llms-triton-ir-analysis:
IR analysis
===========
In Triton, there are several layouts including *blocked*, *shared*, *sliced*, and *MFMA*.
From the Triton GPU IR (intermediate representation), you can know in which memory each computation is
performed. The following is a snippet of IR from the Flash Attention decode ``int4`` key-value program. It is to
de-quantize the ``int4`` key-value from the ``int4`` data type to ``fp16``.
.. code-block::
%190 = tt.load %189 {cache = 1 : i32, evict = 1 : i32, isVolatile =
false} : tensor<1x64xi32, #blocked6> loc(#loc159)
%266 = arith.andi %190, %cst_28 : tensor<1x64xi32, #blocked6>
loc(#loc250)
%267 = arith.trunci %266 : tensor<1x64xi32, #blocked6> to
tensor<1x64xi16, #blocked6> loc(#loc251)
%268 = tt.bitcast %267 : tensor<1x64xi16, #blocked6> -> tensor<1x64xf16,
#blocked6> loc(#loc252)
%269 = triton_gpu.convert_layout %268 : (tensor<1x64xf16, #blocked6>) ->
tensor<1x64xf16, #shared1> loc(#loc252)
%270 = tt.trans %269 : (tensor<1x64xf16, #shared1>) -> tensor<64x1xf16,
#shared2> loc(#loc194)
%276 = triton_gpu.convert_layout %270 : (tensor<64x1xf16, #shared2>) ->
tensor<64x1xf16, #blocked5> loc(#loc254)
%293 = arith.mulf %276, %cst_30 : tensor<64x1xf16, #blocked5>
loc(#loc254)
%295 = arith.mulf %292, %294 : tensor<64x32xf16, #blocked5> loc(#loc264)
%297 = arith.addf %295, %296 : tensor<64x32xf16, #blocked5> loc(#loc255)
%298 = triton_gpu.convert_layout %297 : (tensor<64x32xf16, #blocked5>)
-> tensor<64x32xf16, #shared1> loc(#loc255)
%299 = tt.trans %298 : (tensor<64x32xf16, #shared1>) ->
tensor<32x64xf16, #shared2> loc(#loc196)
%300 = triton_gpu.convert_layout %299 : (tensor<32x64xf16, #shared2>) ->
tensor<32x64xf16, #triton_gpu.dot_op<{opIdx = 1, parent = #mfma, kWidth
= 4}>> loc(#loc197)
From the IR, you can see ``i32`` data is loaded from global memory to registers. With a few element-wise operations in
registers, then it is stored in shared memory for the transpose operation, which needs data movement across different
threads. With the transpose done, it is loaded from LDS to register again, and with a few more element-wise operations,
they are stored in LDS again. The last step is to load from LDS to registers and convert to the dot-operand layout.
From the IR, you can see that it uses the LDS twice: one for the transpose, and the other to convert the blocked layout
to a dot-operand layout.
Assembly analysis
=================
In the ISA, ensure ``global_load_dwordx4`` is used, especially when the
load happens in a loop.
In most cases, the LDS load and store should use ``_b128`` as well to
minimize the number of LDS access instructions. Note that upstream (or backend) might not have ``_b128`` LDS read/write,
so it uses ``_b64``. For most cases, no matter if you use fork or upstream,
the LDS access should have ``_b64`` vector width.
The AMD ISA has the ``s_waitcnt`` instruction to synchronize the dependency
of memory access and computations. The ``s_waitcnt`` instruction can
have two signals, typically in the context of Triton:
* ``lgkmcnt(n):`` `lgkm` stands for LDS, GDS, Constant and Message.
In this context, it is often related to LDS access. The number ``n`` here means the number of such accesses that can
be left out to continue. For example, 0 means all ``lgkm`` access must finish before continuing, and 1 means only 1
``lgkm`` access can be still running asynchronously before proceeding.
* ``vmcnt(n):`` `vm` means vector memory.
This happens when vector memory is accessed, for example, when global load moves from global memory to vector memory.
Again, the number ``n`` here means the number of accesses that can be left out to continue.
Generally recommended guidelines are as follows.
* Vectorize memory access as much as possible.
* Ensure synchronization is done efficiently.
* Overlap of instructions to hide latency, but it requires thoughtful
analysis of the algorithms.
* If you find inefficiencies, you can trace it back to LLVM IR, TTGIR
and even TTIR to see where the problem comes from. If you find it
during compiler optimization, activate the MLIR dump and check which
optimization pass caused the problem.
.. _fine-tuning-llms-triton-kernel-occupancy:
Kernel occupancy
================
1. Get the VGPR count, search for ``.vgpr_count`` in the ISA (for example, ``N``).
2. Get the allocated LDS following the steps (for example, L for the kernel).
a. ``export MLIR_ENABLE_DUMP=1``
b. ``rm -rf ~/.triton/cache``
c. ``python kernel.py | | grep "triton_gpu.shared = " | tail -n 1``
d. You should see something like ``triton_gpu.shared = 65536``, indicating 65536 bytes of LDS are allocated for the
kernel.
3. Get number of waves per workgroup using the following steps (for example, ``nW``).
a. ``export MLIR_ENABLE_DUMP=1``
b. ``rm -rf ~/.triton/cache``
c. ``python kernel.py | | grep "triton_gpu.num-warps " | tail -n 1``
d. You should see something like ``“triton_gpu.num-warps" = 8``, indicating 8 waves per workgroup.
4. Compute occupancy limited by VGPR based on N according to the following table. For example, waves per EU as
``occ_vgpr``.
.. _fine-tuning-llms-occupancy-vgpr-table:
.. figure:: ../../data/how-to/fine-tuning-llms/occupancy-vgpr.png
:alt: Occupancy related to VGPR usage in an Instinct MI300X accelerator.
:align: center
5. Compute occupancy limited by LDS based on L by: ``occ_lds = floor(65536 / L)``.
6. Then the occupancy is ``occ = min(floor(occ_vgpr * 4 / nW), occ_lds) * nW / 4``
a. ``occ_vgpr \* 4`` gives the total number of waves on all 4 execution units (SIMDs)
per CU.
b. ``floor(occ_vgpr * 4 / nW)`` gives the occupancy of workgroups per CU
regrading VGPR usage.
c. The true ``occ`` is the minimum of the two.
.. _fine-tuning-llms-triton-kernel-configs-env-vars:
Auto-tunable kernel configurations and environment variables
============================================================
This section relates to the amount of :ref:`memory access <fine-tuning-llms-triton-memory-access-efficiency>` and
computation assigned to each CU. It is related to the usage of LDS, registers and the scheduling of different tasks on
a CU.
The following is a list of kernel arguments used for tuning.
``num_stages=n``
Adjusts the number of pipeline stages for different types of kernels. On AMD accelerators, set ``num_stages``
according to the following rules:
* For kernels with a single GEMM, set to ``0``.
* For kernels with two GEMMs fused (Flash Attention, or any other kernel
that fuses 2 GEMMs), set to ``1``.
* For kernels that fuse a single GEMM with another non-GEMM operator
(for example ReLU activation), set to ``0``.
* For kernels that have no GEMMs, set to ``1``.
``waves_per_eu=n``
Helps to manage Vector General Purpose Registers (VGPR) usage to achieve desired occupancy levels. This argument
hints to the compiler to reduce VGPR to achieve ``n`` occupancy. See
:ref:`Kernel occupancy <fine-tuning-llms-triton-kernel-occupancy>` for more information about how to compute
occupancy.
This argument is useful if:
* The occupancy of the kernel is limited by VGPR usage.
* The current VGPR usage is only a few above a boundary in
:ref:`Occupancy related to VGPR usage in an Instinct MI300X accelerator <fine-tuning-llms-occupancy-vgpr-table>`.
For example, according to the table, the available VGPR is 512 per Execution Unit (EU), and VGPU is allocated at the
unit of 16. If the current VGPR usage is 170, the actual requested VGPR will be 176, so the
occupancy is only 2 waves per CU since :math:`176 \times 3 > 512`. So, if you set
``waves_per_eu`` to 3, the LLVM backend tries to bring VGPR usage down so
that it might fit 3 waves per EU.
``BLOCK_M``, ``BLOCK_N``, ``BLOCK_K``
Tile sizes to be tuned to balance the memory-to-computation ratio. You want tile sizes large enough to
maximize the efficiency of memory-to-computation ratio, but small enough to parallelize the greatest number of
workgroups at the grid level.
``matrix_instr_nonkdim``
Experimental feature for Flash Attention-like kernels that determines the size of the Matrix Fused Multiply-Add
(MFMA) instruction used.
- ``Matrix_instr_nonkdim = 16``: ``mfma_16x16`` is used.
- ``Matrix_instr_nonkdim = 32``: ``mfma_32x32`` is used.
For GEMM kernels on an AMD MI300X accelerator, ``mfma_16x16`` typically outperforms ``mfma_32x32``, even for large
tile/GEMM sizes.
The following is an environment variable used for tuning.
``OPTIMIZE_EPILOGUE``
Setting this variable to ``1`` can improve performance by removing the ``convert_layout`` operation in the epilogue.
It should be turned on (set to ``1``) in most cases. Setting ``OPTIMIZE_EPILOGUE=1`` stores the MFMA instruction
results in the MFMA layout directly; this comes at the cost of reduced global store efficiency, but the impact on
kernel execution time is usually minimal.
By default (``0``), the results of MFMA instruction are converted to blocked layout, which leads to ``global_store``
with maximum vector length, that is ``global_store_dwordx4``.
This is done implicitly with LDS as the intermediate buffer to achieve
data exchange between threads. Padding is used in LDS to avoid bank
conflicts. This usually leads to extra LDS usage, which might reduce
occupancy.
.. note::
This variable is not turned on by default because it only
works with ``tt.store`` but not ``tt.atomic_add``, which is used in split-k and
stream-k GEMM kernels. In the future, it might be enabled with
``tt.atomic_add`` and turned on by default.
See :ref:`IR analysis <fine-tuning-llms-triton-ir-analysis>`.
TorchInductor with Triton tuning knobs
===========================================
The following are suggestions for optimizing matrix multiplication (GEMM) and convolution (``conv``) operations in PyTorch
using ``inductor``, a part of the PyTorch compilation framework. The goal is to leverage Triton to achieve better
performance.
Learn more about TorchInductor environment variables and usage in
`PyTorch documentation <https://pytorch.org/docs/2.3/torch.compiler_inductor_profiling.html>`_.
To enable a ``gemm``/``conv`` lowering to Triton, it requires use of ``inductor``s ``max_autotune`` mode. This benchmarks a
static list of Triton configurations (``conv`` configurations for max auto-tune + ``matmul`` configurations for max
auto-tune) and uses the fastest for each shape. Note that the Triton is not used if regular :doc:`MIOpen <miopen:index>`
or :doc:`rocBLAS <rocblas:index>` is faster for a specific operation.
* Set ``torch._inductor.config.max_autotune = True`` or ``TORCHINDUCTOR_MAX_AUTOTUNE=1``.
* Or, for more fine-grained control:
``torch._inductor.config.max_autotune.pointwise = True``
To enable tuning for ``pointwise``/``reduction`` ops.
``torch._inductor.config.max_autotune_gemm = True``
To enable tuning or lowering of ``mm``/``conv``\s.
``torch._inductor.max_autotune_gemm_backends/TORCHINDUCTOR_MAX_AUTOTUNE_GEMM_BACKENDS``
To select the candidate backends for ``mm`` auto-tuning. Defaults to
``TRITON,ATEN,NV``. This also includes the ``CUTLASS`` tuning option. Limiting this to
``TRITON`` might improve performance by enabling more fused ``mm`` kernels
instead of going to rocBLAS.
* For ``mm`` tuning, tuning ``coordinate_descent`` might improve performance.
``torch._inductor.config.coordinate_descent_tuning = True`` or ``TORCHINDUCTOR_COORDINATE_DESCENT_TUNING=1``
* Inference can see large improvements on AMD GPUs by utilizing
``torch._inductor.config.freezing=True`` or the ``TORCHINDUCTOR_FREEZING=1`` variable, which
in-lines weights as constants and enables constant folding optimizations.
* Enabling ``inductor``s cpp_wrapper might improve overhead. This generates
C++ code which launches Triton binaries directly with
``hipModuleLaunchKernel`` and relies on `hipification`.
* For NHWC convolutions workloads
``torch._inductor.config.layout_optimization=True`` or ``TORCHINDUCTOR_LAYOUT_OPTIMIZATION=``
can help be enforcing channels_last format throughout the graph avoiding
any additional transposes added by ``inductor``. Note that
``PYTORCH_MIOPEN_SUGGEST_NHWC=1`` is recommended if using this.
* Extracting the Triton kernel ``TORCH_COMPILE_DEBUG`` creates a
``torch_compile_debug/`` directory at current path, in the ``output_code.py``
the code-strings for the Triton kernels that are defined. Manual work is
then required to strip out the kernel and create kernel
compilation and launch via Triton.
* For advanced ``matmul`` or ``conv`` configuration tuning, the ``inductor-gemm-tuner`` can
help. This implements the Triton ``conv``/``mm`` implementations used upstream
and allows specification of inputs and configuration tuning search space if new
tunings are found that can be added to the auto-tune list.
Other guidelines
================
* Performance-critical HIP provides an environment variable, ``export HIP_FORCE_DEV_KERNARG=1``,
that can put HIP kernel arguments directly to
device memory to reduce the latency of accessing kernel arguments. It
can reduce 2 to 3 μs for some kernels. Setting this variable for the FA
decode containing ``splitK`` and reduced kernels can reduce the total time
by around 6 μs in the benchmark test.
* Set the clock to deterministic. Use the command ``rocm-smi --setperfdeterminism 1900`` to set the max clock speed to
1900MHz instead of the default 2100MHz. This can reduce the chance of clock speed decrease due to chip high temperature
by setting a lower cap. You can restore this setting to its default value with ``rocm-smi -r``.
* Set Non-Uniform Memory Access (NUMA) auto-balance. Run the command ``cat /proc/sys/kernel/numa_balancing`` to check the
current setting. An output of ``0`` indicates this setting is available. If output is ``1``, run the command
``sudo sh -c \\'echo 0 > /proc/sys/kernel/numa_balancing`` to set this.
For these settings, the ``env_check.sh`` script automates the setting, resetting, and checking of the such
environments. Find the script at `<https://github.com/ROCm/triton/blob/rocm_env/scripts/amd/env_check.sh>`__.
.. _fine-tuning-llms-triton-tunableop:
TunableOp
---------
`TunableOp <https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/cuda/tunable/README.md>`_
is a feature used to define and optimize kernels that can have tunable parameters. This is useful in
optimizing the performance of custom kernels by exploring different parameter configurations to find the most efficient
setup. See more about PyTorch TunableOp :ref:`Model acceleration libraries <fine-tuning-llms-pytorch-tunableop>`.
You can easily manipulate the behavior TunableOp through environment variables, though you could use the C++ interface
``at::cuda::tunable::getTuningContext()``. A Python interface to the ``TuningContext`` does not yet exist.
The default value is ``0``, which means only 1 iteration is attempted. Remember: theres an overhead to tuning. To try
and minimize the overhead, only a limited number of iterations of a given operation are attempted. If you set this to
``10``, each solution for a given operation can run as many iterations as possible within 10ms. There is a hard-coded
upper limit of 100 iterations attempted per solution. This is a tuning parameter; if you want the tunings to be chosen
based on an average over multiple iterations, increase the allowed tuning duration.

View File

@@ -0,0 +1,484 @@
<head>
<meta charset="UTF-8">
<meta name="description" content="SmoothQuant model inference on AMD Instinct MI300X using Composable Kernel">
<meta name="keywords" content="Mixed Precision, Kernel, Inference, Linear Algebra">
</head>
# Optimizing with Composable Kernel
The AMD ROCm&trade; Composable Kernel (CK) library provides a programming model for writing performance-critical kernels for machine learning workloads. It generates a general-purpose kernel during the compilation phase through a C++ template, enabling developers to achieve operation fusions on different data precisions.
This article gives a high-level overview of CK General Matrix Multiplication (GEMM) kernel based on the design example of `03_gemm_bias_relu`. It also outlines the steps to construct the kernel and run it. Moreover, the article provides a detailed implementation of running SmoothQuant quantized INT8 models on AMD Instinct MI300X accelerators using CK.
## High-level overview: a CK GEMM instance
GEMM is a fundamental block in linear algebra, machine learning, and deep neural networks. It is defined as the operation:
{math}`E = α \times (A \times B) + β \times (D)`, with A and B as matrix inputs, α and β as scalar inputs, and D as a pre-existing matrix.
Take the commonly used linear transformation in a fully connected layer as an example. These terms correspond to input activation (A), weight (B), bias (D), and output (E), respectively. The example employs a `DeviceGemmMultipleD_Xdl_CShuffle` struct from CK library as the fundamental instance to explore the compute capability of AMD Instinct accelerators for the computation of GEMM. The implementation of the instance contains two phases:
- [Template parameter definition](#template-parameter-definition)
- [Instantiating and running the templated kernel](#instantiating-and-running-the-templated-kernel)
### Template parameter definition
The template parameters of the instance are grouped into four parameter types:
- [Parameters for determining matrix data precision](matrix-data-precision)
- [Parameters for determining matrix data layout](matrix-data-layout)
- [Parameters for determining extra operations on matrix elements](matrix-element-operation)
- [Performance-oriented tunable parameters](tunable-parameters)
<!--
================
### Figure 2
================ -->
```{figure} ../../data/how-to/fine-tuning-llms/ck-template_parameters.jpg
The template parameters of the selected GEMM kernel are classified into four groups. These template parameter groups should be defined properly before running the instance.
```
(matrix-data-precision)=
#### Matrix data precision
A, B, D, and E are defined as half-precision floating-point datatypes. The multiply-add results of matrix A and B are added with a pre-existing matrix D (half-precision), and the final GEMM results are also half-precision floating-points.
```c++
using ADataType = F16;
using BDataType = F16;
using AccDataType = F32;
using CShuffleDataType = F16;
using DDataType = F16;
using EDataType = F16;
```
`ADataType` and `BDataType` denote the data precision of the A and B input matrices. `AccDataType` determines the data precision used for representing the multiply-add results of A and B elements. These results are stored in a `CShuffle` module in local data share (LDS), a low-latency and high-bandwidth explicitly-addressed memory used for synchronization within a workgroup LDS for later use.
`CShuffleDataType` denotes the data precision of `CShuffle` in LDS.
`DDataType` denotes the data precision of the pre-existing D matrix stored in GPU global memory, while `EDatatype` denotes the data precision of the final output. The CK kernel supports a fusion strategy so that `CShuffle` can be added with a single pre-existing matrix in the same GPU kernel for better performance.
(matrix-data-layout)=
#### Matrix data layout
```c++
using ALayout = Row;
using BLayout = Col;
using DLayout = Row;
using ELayout = Row;
```
Following the convention of various linear algebra libraries, CK assumes that the input matrix A is an M x K matrix, meaning the matrix has M rows and K columns. Similarly, matrix B is assumed to be K x N, meaning it has K rows and N columns. In computing, row-major order and column-major order are commonly used ways to store matrices in linear storage. After understanding the matrix storage pattern, the underlying optimized memory access manner can be applied to achieve better performance depending on the storage ordering of these matrices.
(matrix-element-operation)=
#### Matrix element operation
```c++
using AElementOp = PassThrough;
using BElementOp = PassThrough;
using CDEElementOp = AddRelu;
```
CK supports the pre-processing of the matrix before calculating GEMM, that is, `C = AElementOp(A) * BElementOp(B)`. It similarly supports the post-processing of GEMM results the same way, that is, `E = CDEElementOp(C, D)`.
`AElementOp` and `BElementOp` determine the operation applied to matrix A and B separately before GEMM, which is achieved by binding the operation with a C++ struct function.
The above `PassThrough` denotes no operations are performed on the target matrix. `CDEELementOp` determines the operations applied to `CShuffle` output and matrix D. The following binding struct `AddRelu` shows an example of adding the `CShuffle` output and matrix D, and ReLU (Rectified Linear Unit) operations to the addition result. It then passes the results to matrix E.
```c++
struct AddRelu
{
__host__ __device__ void operator()(ck::half_t& e, const ck::half_t& c, const ck::half_t& d) const
{
const ck::half_t x = c + d;
e = x > 0 ? x : 0;
}
};
```
(tunable-parameters)=
#### Tunable parameters
The CK instance includes a series of tunable template parameters to control the parallel granularity of the workload to achieve load balancing on different hardware platforms.
These parameters include Block Size, M/N/K Per Block, M/N per XDL, AK1, BK1, etc.
- Block Size determines the number of threads in the thread block.
- M/N/K Per Block determines the size of tile that each thread block is responsible for calculating.
- M/N Per XDL refers to M/N size for Instinct accelerator Matrix Fused Multiply Add (MFMA) instructions operating on a per-wavefront basis.
- A/B K1 is related to the data type. It can be any value ranging from 1 to K Per Block. To achieve the optimal load/store performance, 128bit per load is suggested. In addition, the A/B loading parameters must be changed accordingly to match the A/B K1 value; otherwise, it will result in compilation errors.
Conditions for achieving computational load balancing on different hardware platforms can vary.
### Instantiating and running the templated kernel
After determining the template parameters, we instantiate the kernel with actual arguments. Do one of the following:
- Use `GetDeviceBuffer` from CKs custom struct `DeviceMem` to pass the element values of the matrices that need to be calculated.
- Allocate device buffer via `hipMalloc`. Ensure the device buffer size can fit the matrix size.
- Pass matrix elements through the `data_ptr` method in the `Tensor` object if the matrix to be calculated is of `Tensor` type.
The row and column, and stride information of input matrices are also passed to the instance. For batched GEMM, you must pass in additional batch count and batch stride values. The extra operations for pre and post-processing are also passed with an actual argument; for example, α and β for GEMM scaling operations. Afterward, the instantiated kernel is launched by the invoker, as illustrated in Figure 3.
<!--
================
### Figure 3
================ -->
```{figure} ../../data/how-to/fine-tuning-llms/ck-kernel_launch.jpg
Templated kernel launching consists of kernel instantiation, making arguments by passing in actual application parameters, creating an invoker, and running the instance through the invoker.
```
## Developing fused INT8 kernels for SmoothQuant models
[SmoothQuant](https://github.com/mit-han-lab/smoothquant) (SQ) is a quantization algorithm that enables an INT8 quantization of both weights and activations for all the matrix multiplications in LLM. The required GPU kernel functionalities used to accelerate the inference of SQ models on Instinct accelerators are shown in the following table.
:::{table} Functionalities used to implement SmoothQuant model inference.
| Functionality descriptions | Corresponding wrappers |
|:-------------------------------------|-----------------------------------------|
| {math}`E = α \times (A \times B) + β \times (D)`, where A, B, D, E are INT8 2-D tensors; | E = Linear_ABDE_I8(A, B, D, {math}`\alpha`, {math}`\beta`) |
| {math}`E = RELU (α \times (A \times B) + β \times (D))`, where A, B, D, E are INT8 2-D tensors; | E = Linear_ReLU_ABDE_I8(A, B, D, {math}`\alpha`, {math}`\beta`) |
| {math}`E = α \times (A \times B) + β \times (D)`, where A, B are INT8 2-D tensors, D and E are FP32 2-D tensors; | E = Linear_AB_I8_DE_F32(A, B, D, {math}`\alpha`, {math}`\beta`) |
| {math}`E = α \times (A \times B)`, where A, B, E are INT8 3-D tensors; | E = BMM_ABE_I8(A, B, {math}`\alpha`) |
| {math}`E = α \times (A \times B)`, where A, B are INT8 3-D tensors, E is FP32 3-D tensor; | E = BMM_AB_I8_E_F32(A, B, {math}`\alpha`) |
:::
### Operation flow analysis
The following section discusses the analysis of the operation flow of `Linear_ReLU_ABDE_I8`. The rest of the wrappers in Table 1 can be analyzed similarly.
The first operation in the process is to perform the multiplication of input matrices A and B. The resulting matrix C is then scaled with α to obtain T1. At the same time, the process performs a scaling operation on D elements to obtain T2. Afterward, the process performs matrix addition between T1 and T2, element activation calculation using ReLU, and element rounding sequentially. The operations to generate E1, E2, and E are encapsulated and completed by a user-defined template function in CK (given in the next sub-section). This template function is integrated into the fundamental instance directly during the compilation phase so that all these steps can be fused in a single GPU kernel.
<!--
================
### Figure 4
================ -->
```{figure} ../../data/how-to/fine-tuning-llms/ck-operation_flow.jpg
Operation flow.
```
The CK library contains many fundamental instances that implement different functions. Familiarize yourself with the names of various CK instances and determine whether they meet the target functional requirements.
Second, consider whether the format of input data meets your actual calculation needs. For SQ models, the 8-bit integer data format (INT8) is applied for matrix calculations.
Third, consider the platform for implementing CK instances. The instances suffixed with `xdl` only run on AMD Instinct accelerators after being compiled and cannot run on Radeon-series GPUs. This is due to the underlying device-specific instruction sets for implementing these basic instances.
Here, we use [DeviceBatchedGemmMultiD_Xdl](https://github.com/ROCm/composable_kernel/tree/develop/example/24_batched_gemm) as the fundamental instance to implement the functionalities in the previous table.
<!--
================
### Figure 5
================ -->
```{figure} ../../data/how-to/fine-tuning-llms/ck-root_instance.jpg
Use the DeviceBatchedGemmMultiD_Xdl instance as a root.
```
The `DeviceBatchedGemmMultiD_Xdl` instance realizes the batched GEMM `BMM_ABE_I8` and `BMM_AB_I8_E_F32` kernels directly by using the proper input and output data precision types.
Based on the two batched GEMM kernels, GEMM kernel `Linear_ABDE_I8` and `Linear_AB_I8_DE_F32` can be implemented by expanding their input 2-D tensors to 3-D tensors. Then, the 3-D output tensors produced by the root instance are squeezed back to 2-D output tensors before returning back.
For example, unsqueeze A (M, K) to A (1, M, K) before assigning it into the root instance and squeeze E (1, M, N) to (M, N) after the calculations of the root instance return back. `Linear_ReLU_ABDE_I8` is implemented by adding a ReLU operation on the result output of `Linear_ABDE_I8`.
### Developing the complete function
The inference of SQ quantized models relies on using PyTorch and Transformer libraries, and a tensor type is used to represent matrices and vectors in `torch`, the C++ data types in CK need to be replaced with the `torch::tensor` type. The data types of the input and output matrices should be a `tensor` type.
In GEMM, the A and B inputs are two-dimensional matrices, and the required input matrices of the selected fundamental CK instance are three-dimensional matrices. Therefore, we must convert the input 2-D tensors to 3-D tensors, by using `tensor`'s `unsqueeze()` method before passing these matrices to the instance. For batched GEMM in the preceding table, ignore this step.
```c++
// Function input and output
torch::Tensor linear_relu_abde_i8(
torch::Tensor A_,
torch::Tensor B_,
torch::Tensor D_,
float alpha,
float beta)
{
// Convert torch::Tensor A_ (M, K) to torch::Tensor A (1, M, K)
auto A = A_.unsqueeze(0);
// Convert torch::Tensor B_ (K, N) to torch::Tensor A (1, K, N)
auto B = B_.unsqueeze(0);
...
```
As shown in the following code block, we obtain M, N, and K values using input tensor size values. This stride size information is used to reshape the input vector D and allocate the storage space of tensor E. Stride reflects the exact size of continuous elements in memory, which are passed as important parameters to the fundamental instance for GPU kernel use.
```c++
// Return the batch count from the size of dimension 0
int batch_count = A.size(0);
// Return the M, N, K from the size of dimension 1 & 2
int M = A.size(1);
int N = B.size(1);
int K = A.size(2);
// Initialize the stride size for A, B, D and E
int stride_A = K;
int stride_B = K;
int stride_D0 = N;
int stride_E = N;
// Initialize the stride size for batched A, B, D and E
long long int batch_stride_A = M * K;
long long int batch_stride_B = K * N;
long long int batch_stride_D0 = M * N;
long long int batch_stride_E = M * N;
// Convert the tensor of 2-D to 3-D
auto D = D_.view({1,-1}).repeat({M, 1});
// Allocate memory for E
auto E = torch::empty({batch_count, M, N},
torch::dtype(torch::kInt8).device(A.device()));
```
In the following code block, `ADataType`, `BDataType` and `D0DataType` are used to denote the data precision of the input tensors A, B and D, respectively. `EDataType` is used to denote the data precision of output tensor E. These parameters are specified to `I8` data format (8-bit integer data format) to meet the kernel's design requirements.
`AccDataType` determines the data precision used to represent the multiply-add results of A and B elements. Generally, a larger range data type is applied to store the multiply-add results of A and B to avoid result overflow; `I32` is applied in this case. The `CShuffleDataType I32` data type indicates that the multiply-add results continue to be stored in LDS as an `I32` data format. All of this is implemented through the following code block.
```c++
// Data precision
using ADataType = I8;
using BDataType = I8;
using AccDataType = I32;
using CShuffleDataType = I32;
using D0DataType = I8;
using DsDataType = ck::Tuple<D0DataType>;
using EDataType = I8;
```
Following the convention of various linear algebra libraries, row-major and column-major orders are used to denote the ways of storing matrices in linear storage. The advantage of specifying matrix B as column major is that all the relevant matrix elements are stored continuously in GPU global memory when a row in A is multiplied by a column in B, which can help GPU achieve data consistency access to improve access performance.
```c++
// Specify tensor order
using ALayout = RowMajor;
using BLayout = ColumnMajor;
using D0Layout = RowMajor;
using DsLayout = ck::Tuple<D0Layout>;
using ELayout = RowMajor;
```
In CK, `PassThrough` is a struct denoting if an operation is applied to the tensor it binds to. To fuse the operations between E1, E2, and E introduced in section [Operation flow analysis](#operation-flow-analysis), we define a custom C++ struct, `ScaleScaleAddRelu`, and bind it to `CDEELementOp`. It determines the operations that will be applied to `CShuffle` (A×B results), tensor D, α, and β.
```c++
// No operations bound to the elements of A and B
using AElementOp = PassThrough;
using BElementOp = PassThrough;
// Operations bound to the elements of C, D and E
using CDEElementOp = ScaleScaleAddRelu;
```
In the binding struct, `operator()` performs an addition operation between `CShuffle` and matrix D, a ReLU operation on the addition results, and a rounding operation on the output elements. It then returns the results to E.
```c++
struct ScaleScaleAddRelu {
template <>
__host__ __device__ constexpr void
operator()<I8, I32, I8>(I8& e, const I32& c, const I8& d) const
{
// Scale AxB result with alpha
const F32 c_scale = ck::type_convert<F32>(c) * alpha;
// Scale D with beta
const F32 d_scale = ck::type_convert<F32>(d) * beta;
// Perform addition operation
F32 temp = c_scale + d_scale;
// Perform RELU operation
temp = temp > 0 ? temp : 0;
// Perform rounding operation
temp = temp > 127 ? 127 : temp;
// Return to E
e = ck::type_convert<I8>(temp);
}
F32 alpha;
F32 beta;
};
```
The original input tensors need to be padded to meet GPU tile-based parallelism.
```c++
static constexpr auto GemmDefault = ck::tensor_operation::device::GemmSpecialization::MNKPadding;
```
The template parameters of the target fundamental instance are initialized with the above parameters and includes default tunable parameters. For specific tuning methods, see [Tunable parameters](#tunable-parameters).
```c++
using DeviceOpInstance = ck::tensor_operation::device::DeviceBatchedGemmMultiD_Xdl<
// Tensor layout
ALayout, BLayout, DsLayout, ELayout,
// Tensor data type
ADataType, BDataType, AccDataType, CShuffleDataType, DsDataType, EDataType,
// Tensor operation
AElementOp, BElementOp, CDEElementOp,
// Padding strategy
GemmDefault,
// Tunable parameters
tunable parameters>;
```
Return the address of the first element of tensors:
```c++
auto A_ref = A.data_ptr<ADataType>();
auto B_ref = B.data_ptr<BDataType>();
auto D0_ref = D.data_ptr<D0DataType>();
auto E_ref = E.data_ptr<EDataType>();
```
The fundamental instance is then initialized and run with actual arguments:
```c++
auto device_op = DeviceOpInstance{};
auto invoker = device_op.MakeInvoker();
auto argument = device_op.MakeArgument(
A_ref, B_ref, {D0_ref}, E_ref,
M, N, K,
batch_count,
stride_A, stride_B, {stride_D0}, stride_E,
batch_stride_A, batch_stride_B, {batch_stride_D0}, batch_stride_E,
AElementOp{}, BElementOp{}, CDEElementOp{alpha, beta});
invoker.Run(argument, StreamConfig{nullptr, 0});
```
The output of the fundamental instance is a calculated batched matrix E (batch, M, N). Before the return, it needs to be converted to a 2-D matrix if a normal GEMM result is required.
```c++
// Convert (1, M, N) to (M, N)
return E.squeeze(0);
```
### Binding to Python
Since these functions are written in C++ and `torch::Tensor`, you can use `pybind11` to bind the functions and import them as Python modules. For the example, the necessary binding code for exposing the functions in the table spans but a few lines.
```c++
#include <torch/extension.h>
PYBIND11_MODULE(TORCH_EXTENSION_NAME, m){
m.def("linear_ab_i8_de_f32", &linear_ab_i8_de_f32);
m.def("linear_relu_abde_i8", &linear_relu_abde_i8);
m.def("linear_abde_i8", &linear_abde_i8);
m.def("bmm_abe_i8", &bmm_abe_i8);
m.def("bmm_ab_i8_e_f32", &bmm_ab_i8_e_f32);
}
```
Build the C++ extension by writing a `setup.py` script that uses `setuptools` to compile the C++ code. A reference implementation of the `setup.py` script is as follows.
```python
import os
from setuptools import setup, find_packages
from torch.utils import cpp_extension
from torch.utils.cpp_extension import BuildExtension
os.environ["CC"] = "hipcc"
os.environ["CXX"] = "hipcc"
sources = [
'torch_int/kernels/linear.cpp',
'torch_int/kernels/bmm.cpp',
'torch_int/kernels/pybind.cpp',
]
include_dirs = ['torch_int/kernels/include']
extra_link_args = ['libutility.a']
extra_compile_args = ['-O3','-DNDEBUG', '-std=c++17', '--offload-arch=gfx942', '-DCK_ENABLE_INT8', '-D__HIP_PLATFORM_AMD__=1']
setup(
name='torch_int',
ext_modules=[
cpp_extension.CUDAExtension(
name='torch_int.rocm',
sources=sources,
include_dirs=include_dirs,
extra_link_args=extra_link_args,
extra_compile_args=extra_compile_args
),
],
cmdclass={
'build_ext': BuildExtension.with_options(use_ninja=False)
},
packages=find_packages(
exclude=['notebook', 'scripts', 'tests']),
)
```
Run `python setup.py install` to build and install the extension. It should look something like Figure 6:
<!--
================
### Figure 6
================ -->
```{figure} ../../data/how-to/fine-tuning-llms/ck-compilation.jpg
Compilation and installation of the INT8 kernels.
```
### INT8 model inference and performance
The implementation architecture of running SmoothQuant models on MI300X GPUs is illustrated in Figure 7, where (a) shows the decoder layer composition components of the target model, (b) shows the major implementation class for the decoder layer components, and \(c\) denotes the underlying GPU kernels implemented by CK instance.
<!--
================
### Figure 7
================ -->
```{figure} ../../data/how-to/fine-tuning-llms/ck-inference_flow.jpg
The implementation architecture of running SmoothQuant models on AMD MI300X accelerators.
```
For the target [SQ quantized model](https://huggingface.co/mit-han-lab/opt-13b-smoothquant), each decoder layer contains three major components: attention calculation, layer normalization, and linear transformation in fully connected layers. The corresponding implementation classes for these components are:
- `Int8OPTAttention`
- `W8A8B8O8LinearReLU`
- `W8A8BF32OF32Linear`
These classes' underlying implementation logits will harness the functions in previous table. Note that for the example, the `LayerNormQ` module is implemented by the torch native module.
Testing environment:
The hardware platform used for testing equips with 256 AMD EPYC 9534 64-Core Processor, 8 AMD Instinct MI300X accelerators and 1.5T memory. The testing was done in a publicly available Docker image from Docker Hub:
[`rocm/pytorch:rocm6.1_ubuntu22.04_py3.10_pytorch_2.1.2`](https://hub.docker.com/layers/rocm/pytorch/rocm6.1_ubuntu22.04_py3.10_pytorch_2.1.2/images/sha256-f6ea7cee8aae299c7f6368187df7beed29928850c3929c81e6f24b34271d652b)
The tested models are OPT-1.3B, 2.7B, 6.7B and 13B FP16 models and the corresponding SmoothQuant INT8 OPT models were obtained from Hugging Face.
Note that since the default values were used for the tunable parameters of the fundamental instance, the performance of the INT8 kernel is suboptimal.
Figure 8 shows the performance comparisons between the original FP16 and the SmoothQuant-quantized INT8 models on a single MI300X accelerator. The GPU memory footprints of SmoothQuant-quantized models are significantly reduced. It also indicates the per-sample inference latency is significantly reduced for all SmoothQuant-quantized OPT models (illustrated in (b)). Notably, the performance of the CK instance-based INT8 kernel steadily improves with an increase in model size.
<!--
================
### Figure 8
================ -->
```{figure} ../../data/how-to/fine-tuning-llms/ck-comparisons.jpg
Performance comparisons between the original FP16 and the SmoothQuant-quantized INT8 models on a single MI300X accelerator.
```
For accuracy comparisons between the original FP16 and INT8 models, the evaluation is done by using the first 1,000 samples from the LAMBADA dataset's validation set. We employ the same Last Token Prediction Accuracy method introduced in [SmoothQuant Real-INT8 Inference for PyTorch](https://github.com/mit-han-lab/smoothquant/blob/main/examples/smoothquant_opt_real_int8_demo.ipynb) as our evaluation metric. The comparison results are shown in Table 2.
:::{table} The inference accuracy comparisons of SmoothQuant quantized models on Instinct MI300X.
| Models | Hugging Face FP16 model accuracy | SmoothQuant quantized INT8 model accuracy |
|:-----------------|----------------------------------------|---------------------------------------------|
| opt-1.3B | 0.72 | 0.70 |
| opt-2.7B | 0.76 | 0.75 |
| opt-6.7B | 0.80 | 0.79 |
| opt-13B | 0.79 | 0.77 |
:::
## Conclusion
CK provides a rich set of template parameters for generating flexible accelerated computing kernels for difference application scenarios.
CK supports multiple instruction sets of AMD Instinct GPUs, operator fusion and different data precisions. Its composability helps users quickly construct operator performance verification.
With CK, you can build more effective AI applications with higher flexibility and better performance on different AMD accelerator platforms.

View File

@@ -0,0 +1,104 @@
.. meta::
:description: How to fine-tune LLMs with ROCm
:keywords: ROCm, LLM, fine-tuning, usage, tutorial, optimzation, LoRA, walkthrough
***************************************
Conceptual overview of fine-tuning LLMs
***************************************
Large language models (LLMs) are trained on massive amounts of text data to generate coherent and fluent text. The
underlying *transformer* architecture is the fundamental building block of all LLMs. Transformers
enable LLMs to understand and generate text by capturing contextual relationships and long-range dependencies. To better
understand the philosophy of the transformer architecture, review the foundational
`Attention is all you need <https://arxiv.org/pdf/1706.03762.pdf>`_ paper.
By further training pre-trained LLMs, the fine-tuned model can gain knowledge related to specific fields or tasks,
thereby significantly improving its performance in that field or task. The core idea of fine-tuning is to use the
parameters of the pre-trained model as the starting point for new tasks and shape it through a small amount of
specific domain or task data, expanding the original model's capability to new tasks or datasets.
Fine-tuning can effectively improve the performance of existing pre-trained models in specific application scenarios.
Continuous training and adjustment of the parameters of the base model in the target domain or task can better capture
the semantic characteristics and patterns in specific scenarios, thereby significantly improving the key indicators of
the model in that domain or task. For example, by fine-tuning the Llama 2 model, its performance in certain applications
can be improve over the base model.
.. _fine-tuning-llms-concept-challenge:
The challenge of fine-tuning models
===================================
However, the computational cost of fine-tuning is still high, especially for complex models and large datasets, which
poses distinct challenges related to substantial computational and memory requirements. This might be a barrier for
accelerators or GPUs with low computing power or limited device memory resources.
For example, suppose we have a language model with 7 billion (7B) parameters, represented by a weight matrix :math:`W`.
During backpropagation, the model needs to learn a :math:`ΔW` matrix, which updates the original weights to minimize the
value of the loss function.
The weight update is as follows: :math:`W_{updated} = W + ΔW`.
If the weight matrix :math:`W` contains 7B parameters, then the weight update matrix :math:`ΔW` should also
contain 7B parameters. Therefore, the :math:`ΔW` calculation is computationally and memory intensive.
.. figure:: ../../data/how-to/fine-tuning-llms/weight-update.png
:alt: Weight update diagram
(a) Weight update in regular fine-tuning. (b) Weight update in LoRA where the product of matrix A (:math:`M\times K`)
and matrix B (:math:`K\times N`) is :math:`ΔW(M\times N)`; dimension K is a hyperparameter. By representing
:math:`ΔW` as the product of two smaller matrices (A and B) with a lower rank K, the number of trainable parameters
is significantly reduced.
.. _fine-tuning-llms-concept-optimizations:
Optimizations for model fine-tuning
===================================
Low-Rank Adaptation (LoRA) is a technique allowing fast and cost-effective fine-tuning of state-of-the-art LLMs that can
overcome this issue of high memory consumption.
LoRA accelerates the adjustment process and reduces related memory costs. To be precise, LoRA decomposes the portion of
weight changes :math:`ΔW` into high-precision low-rank representations, which do not require the calculations of all
:math:`ΔW`. It learns the decomposition representation of :math:`ΔW` during training, as shown in
the :ref:`weight update diagram <fine-tuning-llms-concept-challenge>`. This is how LoRA saves on
computing resources.
LoRA is integrated into the `Hugging Face Parameter-Efficient Fine-Tuning (PEFT)
<https://huggingface.co/docs/peft/en/index>`_ library, as well as other computation and memory efficiency optimization
variants for model fine-tuning such as `AdaLoRA <https://huggingface.co/docs/peft/en/package_reference/adalora>`_. This
library efficiently adapts large pre-trained models to various downstream applications without fine-tuning all model
parameters. PEFT methods only fine-tune a few model parameters, significantly decreasing computational and storage
costs while yielding performance comparable to a fully fine-tuned model. PEFT is integrated with the `Hugging Face
Transformers <https://huggingface.co/docs/transformers/en/index>`_ library, providing a faster and easier way to load,
train, and use large models for inference.
To simplify running a fine-tuning implementation, the `Transformer Reinforcement Learning (TRL)
<https://huggingface.co/docs/trl/en/index>`_ library provides a set of tools to train transformer language models with
reinforcement learning, from the Supervised Fine-Tuning step (SFT), Reward Modeling step (RM), to the Proximal Policy
Optimization (PPO) step. The ``SFTTrainer`` API in TRL encapsulates these PEFT optimizations so you can easily import
their custom training configuration and run the training process.
.. _fine-tuning-llms-walkthrough-desc:
Walkthrough
===========
To demonstrate the benefits of LoRA and the ideal compute compatibility of using PEFT and TRL libraries on AMD
ROCm-compatible accelerators and GPUs, let's step through a comprehensive implementation of the fine-tuning process
using the Llama 2 7B model with LoRA tailored specifically for question-and-answer tasks on AMD MI300X accelerators.
Before starting, review and understand the key components of this walkthrough:
- `Llama 2 <https://huggingface.co/meta-llama>`_: a family of large language models developed and publicly released by
Meta. Its variants range in scale from 7 billion to 70 billion parameters.
- Fine-tuning: a critical process that refines LLMs for specialized tasks and optimizes performance.
- LoRA: a memory-efficient implementation of LLM fine-tuning that significantly reduces the number of trainable
parameters.
- `SFTTrainer <https://huggingface.co/docs/trl/v0.8.6/en/sft_trainer#supervised-fine-tuning-trainer>`_: an optimized
trainer with a simple interface to easily fine-tune pre-trained models with PEFT adapters, for example, LoRA, for
memory efficiency purposes on a custom dataset.
Continue the walkthrough in :doc:`Fine-tuning and inference <fine-tuning-and-inference>`.

View File

@@ -0,0 +1,217 @@
.. meta::
:description: How to fine-tune LLMs with ROCm
:keywords: ROCm, LLM, fine-tuning, usage, tutorial, profiling, debugging, performance, Triton
***********************
Profiling and debugging
***********************
This section discusses profiling and debugging tools and some of their common usage patterns with ROCm applications.
PyTorch Profiler
================
`PyTorch Profiler <https://pytorch.org/docs/stable/profiler.html>`_ can be invoked inside Python scripts, letting you
collect CPU and GPU performance metrics while the script is running. See the `PyTorch Profiler tutorial
<https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html>`_ for more information.
You can then visualize and view these metrics using an open-source profile visualization tool like
`Perfetto UI <https://ui.perfetto.dev>`_.
#. Use the following snippet to invoke PyTorch Profiler in your code.
.. code-block:: python
import torch
import torchvision.models as models
from torch.profiler import profile, record_function, ProfilerActivity
model = models.resnet18().cuda()
inputs = torch.randn(2000, 3, 224, 224).cuda()
with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA]) as prof:
with record_function("model_inference"):
model(inputs)
prof.export_chrome_trace("resnet18_profile.json")
#. Profile results in ``resnet18_profile.json`` can be viewed by the Perfetto visualization tool. Go to
`<https://ui.perfetto.dev>`__ and import the file. In your Perfetto visualization, you'll see that the upper section
shows transactions denoting the CPU activities that launch GPU kernels while the lower section shows the actual GPU
activities where it processes the ``resnet18`` inferences layer by layer.
.. figure:: ../../data/how-to/fine-tuning-llms/perfetto-trace.svg
Perfetto trace visualization example.
ROCm profiling tools
====================
Heterogenous systems, where programs run on both CPUs and GPUs, introduce additional complexities. Understanding the
critical path and kernel execution is all the more important; so, performance tuning is a necessary component in the
benchmarking process.
With AMD's profiling tools, developers are able to gain important insight into how efficiently their application is
using hardware resources and effectively diagnose potential bottlenecks contributing to poor performance. Developers
working with AMD Instinct accelerators have multiple tools depending on their specific profiling needs; these are:
* :ref:`ROCProfiler <fine-tuning-llms-profiling-rocprof>`
* :ref:`Omniperf <fine-tuning-llms-profiling-omniperf>`
* :ref:`Omnitrace <fine-tuning-llms-profiling-omnitrace>`
.. _fine-tuning-llms-profiling-rocprof:
ROCProfiler
-----------
:doc:`ROCProfiler <rocprofiler:index>` is primarily a low-level API for accessing and extracting GPU hardware performance
metrics, commonly called *performance counters*. These counters quantify the performance of the underlying architecture
showcasing which pieces of the computational pipeline and memory hierarchy are being utilized.
Your ROCm installation contains a script or executable command called ``rocprof`` which provides the ability to list all
available hardware counters for your specific accelerator or GPU, and run applications while collecting counters during
their execution.
This ``rocprof`` utility also depends on the :doc:`ROCTracer and ROC-TX libraries <roctracer:index>`, giving it the
ability to collect timeline traces of the accelerator software stack as well as user-annotated code regions.
.. note::
``rocprof`` is a CLI-only utility so input and output takes the format of ``.txt`` and CSV files. These
formats provide a raw view of the data and puts the onus on the user to parse and analyze. Therefore, ``rocprof``
gives the user full access and control of raw performance profiling data, but requires extra effort to analyze the
collected data.
.. _fine-tuning-llms-profiling-omniperf:
Omniperf
--------
`Omniperf <https://rocm.github.io/omniperf>`_ is a system performance profiler for high-performance computing (HPC) and
machine learning (ML) workloads using Instinct accelerators. Under the hood, Omniperf uses
:ref:`ROCProfiler <fine-tuning-llms-profiling-rocprof>` to collect hardware performance counters. The Omniperf tool performs
system profiling based on all approved hardware counters for Instinct
accelerator architectures. It provides high level performance analysis features including System Speed-of-Light, IP
block Speed-of-Light, Memory Chart Analysis, Roofline Analysis, Baseline Comparisons, and more.
Omniperf takes the guesswork out of profiling by removing the need to provide text input files with lists of counters
to collect and analyze raw CSV output files as is the case with ROC-profiler. Instead, Omniperf automates the collection
of all available hardware counters in one command and provides a graphical interface to help users understand and
analyze bottlenecks and stressors for their computational workloads on AMD Instinct accelerators.
.. note::
Omniperf collects hardware counters in multiple passes, and will therefore re-run the application during each pass
to collect different sets of metrics.
.. figure:: ../../data/how-to/fine-tuning-llms/omniperf-analysis.png
Omniperf memory chat analysis panel.
In brief, Omniperf provides details about hardware activity for a particular GPU kernel. It also supports both
a web-based GUI or command-line analyzer, depending on your preference.
.. _fine-tuning-llms-profiling-omnitrace:
Omnitrace
---------
`Omnitrace <https://rocm.github.io/omnitrace>`_ is a comprehensive profiling and tracing tool for parallel applications,
including HPC and ML packages, written in C, C++, Fortran, HIP, OpenCL, and Python which execute on the CPU or CPU and
GPU. It is capable of gathering the performance information of functions through any combination of binary
instrumentation, call-stack sampling, user-defined regions, and Python interpreter hooks.
Omnitrace supports interactive visualization of comprehensive traces in the web browser in addition to high-level
summary profiles with ``mean/min/max/stddev`` statistics. Beyond runtime
information, Omnitrace supports the collection of system-level metrics such as CPU frequency, GPU temperature, and GPU
utilization. Process and thread level metrics such as memory usage, page faults, context switches, and numerous other
hardware counters are also included.
.. tip::
When analyzing the performance of an application, it is best not to assume you know where the performance
bottlenecks are and why they are happening. Omnitrace is the ideal tool for characterizing where optimization would
have the greatest impact on the end-to-end execution of the application and to discover what else is happening on the
system during a performance bottleneck.
.. figure:: ../../data/how-to/fine-tuning-llms/omnitrace-timeline.png
Omnitrace timeline trace example.
For details usage and examples of using these tools, refer to the
`Introduction to profiling tools for AMD hardware <https://rocm.blogs.amd.com/software-tools-optimization/profilers/README.html>`_
developer blog.
Debugging with ROCm Debug Agent
===============================
ROCm Debug Agent (:doc:`ROCdebug-agent <rocr_debug_agent:index>`) is a library that can be loaded by the ROCm platform
runtime (:doc:`ROCr <rocr-runtime:index>`) to provide the following functionalities for all AMD accelerators and GPUs
supported by the ROCm Debugger API (:doc:`ROCdbgapi <rocdbgapi:index>`).
* Print the state of all AMD accelerator or GPU wavefronts that caused a queue error; for example, causing a memory
violation, executing an ``s_trap2``, or executing an illegal instruction.
* Print the state of all AMD accelerator or GPU wavefronts by sending a ``SIGQUIT`` signal to the process in question;
for example, by pressing ``Ctrl + \`` while the process is executing.
Debugging memory access faults
------------------------------
Identifying a faulting kernel is often enough to triage a memory access fault. To that end, the
`ROCm Debug Agent <https://github.com/ROCm/rocr_debug_agent/>`_ can trap a memory access fault and provide a dump of all
active wavefronts that caused the error as well as the name of the kernel. The
`AMD ROCm Debug Agent Library README <https://github.com/ROCm/rocr_debug_agent/blob/master/README.md>`_ provides full
instructions, but in brief:
* Compiling with ``-ggdb -O0`` is recommended but not required.
* ``HSA_TOOLS_LIB=/opt/rocm/lib/librocm-debug-agent.so.2 HSA_ENABLE_DEBUG=1 ./my_program``
When the debug agent traps the fault, it will produce an extremely
verbose output of all wavefront registers and memory content.
Importantly, it also prints something like:
.. code-block:: shell
Disassembly for function vector_add_assert_trap(int*, int*, int*):
code object:
file:////rocm-debug-agent/build/test/rocm-debug-agent-test#offset=14309&size=31336
loaded at: [0x7fd4f100c000-0x7fd4f100e070]
The kernel name and the code object file should be listed. In the
example above, the kernel name is ``vector_add_assert_trap``, but this might
also look like:
.. code-block:: shell
Disassembly for function memory:///path/to/codeobject#offset=1234&size=567:
In this case, it is an in-memory kernel that was generated at runtime.
Using the following environment variable, the debug agent will save all code objects to the current directory (use
``--save-code-objects=[DIR]`` to place them in another location). The code objects will be renamed from the URI format
with special characters replaced by ``_``.
.. code-block:: shell
ROCM_DEBUG_AGENT_OPTIONS="--all --save-code-objects"
Use the ``llvm-objdump`` command to disassemble the indicated in-memory
code object that has now been saved to disk. The name of the kernel is
often found inside the disassembled code object.
.. code-block:: shell
llvm-objdump --disassemble-all path/to/code-object.co
Consider turning off memory caching strategies both within the ROCm
stack and PyTorch where possible. This will give the debug agent the
best chance at finding the memory fault where it originates. Otherwise,
it could be masked by writing past the end of a cached block within a
larger allocation.
.. code-block:: shell
PYTORCH_NO_HIP_MEMORY_CACHING=1
HSA_DISABLE_FRAGMENT_ALLOCATOR=1

View File

@@ -0,0 +1,510 @@
.. meta::
:description: Model fine-tuning and inference on a single-GPU system
:keywords: ROCm, LLM, fine-tuning, usage, tutorial, single-GPU, LoRA, PEFT, inference
****************************************************
Fine-tuning and inference using a single accelerator
****************************************************
This section explains model fine-tuning and inference techniques on a single-accelerator system. See
:doc:`Multi-accelerator fine-tuning <multi-gpu-fine-tuning-and-inference>` for a setup with multiple accelerators or
GPUs.
.. _fine-tuning-llms-single-gpu-env:
Environment setup
=================
This section was tested using the following hardware and software environment.
.. list-table::
:stub-columns: 1
* - Hardware
- AMD Instinct MI300X accelerator
* - Software
- ROCm 6.1, Ubuntu 22.04, PyTorch 2.1.2, Python 3.10
* - Libraries
- ``transformers`` ``datasets`` ``huggingface-hub`` ``peft`` ``trl`` ``scipy``
* - Base model
- ``meta-llama/Llama-2-7b-chat-hf``
.. _fine-tuning-llms-single-gpu-env-setup:
Setting up the base implementation environment
----------------------------------------------
#. Install PyTorch for ROCm. Refer to the
:doc:`PyTorch installation guide <rocm-install-on-linux:how-to/3rd-party/pytorch-install>`. For a consistent
installation, its recommended to use official ROCm prebuilt Docker images with the framework pre-installed.
#. In the Docker container, check the availability of ROCm-capable accelerators using the following command.
.. code-block:: shell
rocm-smi -showproductname
Your output should look like this:
.. code-block:: shell
============================ ROCm System Management Interface ============================
====================================== Product Info ======================================
GPU[0] : Card series: AMD Instinct MI300X OAM
GPU[0] : Card model: 0x74a1
GPU[0] : Card vendor: Advanced Micro Devices, Inc. [AMD/ATI]
GPU[0] : Card SKU: MI3SRIOV
==========================================================================================
================================== End of ROCm SMI Log ===================================
#. Check that your accelerators are available to PyTorch.
.. code-block:: python
import torch
print("Is a ROCm-GPU detected? ", torch.cuda.is_available())
print("How many ROCm-GPUs are detected? ", torch.cuda.device_count())
If successful, your output should look like this:
.. code-block:: shell
>>> print("Is a ROCm-GPU detected? ", torch.cuda.is_available())
Is a ROCm-GPU detected? True
>>> print("How many ROCm-GPUs are detected? ", torch.cuda.device_count())
How many ROCm-GPUs are detected? 4
#. Install the required dependencies.
bitsandbytes is a library that facilitates quantization to improve the efficiency of deep learning models. Learn more
about its use in :doc:`model-quantization`.
See the :ref:`Optimizations for model fine-tuning <fine-tuning-llms-concept-optimizations>` for a brief discussion on
PEFT and TRL.
.. code-block:: shell
# Install `bitsandbytes` for ROCm 6.0+.
# Use -DBNB_ROCM_ARCH to target a specific GPU architecture.
git clone --recurse https://github.com/ROCm/bitsandbytes.git
cd bitsandbytes
git checkout rocm_enabled
pip install -r requirements-dev.txt
cmake -DBNB_ROCM_ARCH="gfx942" -DCOMPUTE_BACKEND=hip -S .
python setup.py install
# To leverage the SFTTrainer in TRL for model fine-tuning.
pip install trl
# To leverage PEFT for efficiently adapting pre-trained language models .
pip install peft
# Install the other dependencies.
pip install transformers, datasets, huggingface-hub, scipy
#. Check that the required packages can be imported.
.. code-block:: python
import torch
from datasets import load_dataset
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
TrainingArguments
)
from peft import LoraConfig
from trl import SFTTrainer
.. _fine-tuning-llms-single-gpu-download-model-dataset:
Download the base model and fine-tuning dataset
-----------------------------------------------
#. Request to access to download the `Meta's official Llama model <https://huggingface.co/meta-llama>`_ from Hugging
Face. After permission is granted, log in with the following command using your personal access tokens:
.. code-block:: shell
huggingface-cli login
.. note::
You can also use the `NousResearch Llama-2-7b-chat-hf <https://huggingface.co/NousResearch/Llama-2-7b-chat-hf>`_
as a substitute. It has the same model weights as the original.
#. Run the following code to load the base model and tokenizer.
.. code-block:: python
# Base model and tokenizer names.
base_model_name = "meta-llama/Llama-2-7b-chat-hf"
# Load base model to GPU memory.
device = "cuda:0"
base_model = AutoModelForCausalLM.from_pretrained(base_model_name, trust_remote_code = True).to(device)
# Load tokenizer.
tokenizer = AutoTokenizer.from_pretrained(
base_model_name,
trust_remote_code = True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
#. Now, let's fine-tune the base model for a question-and-answer task using a small dataset called
`mlabonne/guanaco-llama2-1k <https://huggingface.co/datasets/mlabonne/guanaco-llama2-1k>`_, which is a 1000 sample
subset of the `timdettmers/openassistant-guanaco <https://huggingface.co/datasets/OpenAssistant/oasst1>`_ dataset.
.. code-block::
# Dataset for fine-tuning.
training_dataset_name = "mlabonne/guanaco-llama2-1k"
training_dataset = load_dataset(training_dataset_name, split = "train")
# Check the data.
print(training_dataset)
# Dataset 11 is a QA sample in English.
print(training_dataset[11])
#. With the base model and the dataset, let's start fine-tuning!
.. _fine-tuning-llms-single-gpu-configure-params:
Configure fine-tuning parameters
--------------------------------
To set up ``SFTTrainer`` parameters, you can use the following code as reference.
.. code-block:: python
# Training parameters for SFTTrainer.
training_arguments = TrainingArguments(
output_dir = "./results",
num_train_epochs = 1,
per_device_train_batch_size = 4,
gradient_accumulation_steps = 1,
optim = "paged_adamw_32bit",
save_steps = 50,
logging_steps = 50,
learning_rate = 4e-5,
weight_decay = 0.001,
fp16=False,
bf16=False,
max_grad_norm = 0.3,
max_steps = -1,
warmup_ratio = 0.03,
group_by_length = True,
lr_scheduler_type = "constant",
report_to = "tensorboard"
)
.. _fine-tuning-llms-single-gpu-start:
Fine-tuning
===========
In this section, you'll see two ways of training: with the LoRA technique and without. See :ref:`Optimizations for model
fine-tuning <fine-tuning-llms-concept-optimizations>` for an introduction to LoRA. Training with LoRA uses the
``SFTTrainer`` API with its PEFT integration. Training without LoRA forgoes these benefits.
Compare the number of trainable parameters and training time under the two different methodologies.
.. tab-set::
.. tab-item:: Fine-tuning with LoRA and PEFT
:sync: with
1. Configure LoRA using the following code snippet.
.. code-block:: python
peft_config = LoraConfig(
lora_alpha = 16,
lora_dropout = 0.1,
r = 64,
bias = "none",
task_type = "CAUSAL_LM"
)
# View the number of trainable parameters.
from peft import get_peft_model
peft_model = get_peft_model(base_model, peft_config)
peft_model.print_trainable_parameters()
The output should look like this. Compare the number of trainable parameters to that when fine-tuning without
LoRA and PEFT.
.. code-block:: shell
trainable params: 33,554,432 || all params: 6,771,970,048 || trainable%: 0.49548996469513035
2. Initialize ``SFTTrainer`` with a PEFT LoRA configuration and run the trainer.
.. code-block:: python
# Initialize an SFT trainer.
sft_trainer = SFTTrainer(
model = base_model,
train_dataset = training_dataset,
peft_config = peft_config,
dataset_text_field = "text",
tokenizer = tokenizer,
args = training_arguments
)
# Run the trainer.
sft_trainer.train()
The output should look like this:
.. code-block:: shell
{'loss': 1.5973, 'grad_norm': 0.25271978974342346, 'learning_rate': 4e-05, 'epoch': 0.16}
{'loss': 2.0519, 'grad_norm': 0.21817368268966675, 'learning_rate': 4e-05, 'epoch': 0.32}
{'loss': 1.6147, 'grad_norm': 0.3046981394290924, 'learning_rate': 4e-05, 'epoch': 0.48}
{'loss': 1.4124, 'grad_norm': 0.11534837633371353, 'learning_rate': 4e-05, 'epoch': 0.64}
{'loss': 1.5627, 'grad_norm': 0.09108350425958633, 'learning_rate': 4e-05, 'epoch': 0.8}
{'loss': 1.417, 'grad_norm': 0.2536439299583435, 'learning_rate': 4e-05, 'epoch': 0.96}
{'train_runtime': 197.4947, 'train_samples_per_second': 5.063, 'train_steps_per_second': 0.633, 'train_loss': 1.6194254455566406, 'epoch': 1.0}
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 125/125 [03:17<00:00, 1.58s/it]
.. tab-item:: Fine-tuning without LoRA and PEFT
:sync: without
1. Use the following code to get started.
.. code-block:: python
def print_trainable_parameters(model):
# Prints the number of trainable parameters in the model.
trainable_params = 0
all_param = 0
for _, param in model.named_parameters():
all_param += param.numel()
if param.requires_grad:
trainable_params += param.numel()
print(f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param:.2f}")
sft_trainer.peft_config = None
print_trainable_parameters(sft_trainer.model)
The output should look like this. Compare the number of trainable parameters to that when fine-tuning with LoRA
and PEFT.
.. code-block:: shell
trainable params: 6,738,415,616 || all params: 6,738,415,616 || trainable%: 100.00
2. Run the trainer.
.. code-block:: python
# Trainer without LoRA config.
trainer_full = SFTTrainer(
model = base_model,
train_dataset = training_dataset,
dataset_text_field = "text",
tokenizer = tokenizer,
args = training_arguments
)
# Training.
trainer_full.train()
The output should look like this:
.. code-block:: shell
{'loss': 1.5975, 'grad_norm': 0.25113457441329956, 'learning_rate': 4e-05, 'epoch': 0.16}
{'loss': 2.0524, 'grad_norm': 0.2180655151605606, 'learning_rate': 4e-05, 'epoch': 0.32}
{'loss': 1.6145, 'grad_norm': 0.2949850261211395, 'learning_rate': 4e-05, 'epoch': 0.48}
{'loss': 1.4118, 'grad_norm': 0.11036080121994019, 'learning_rate': 4e-05, 'epoch': 0.64}
{'loss': 1.5595, 'grad_norm': 0.08962831646203995, 'learning_rate': 4e-05, 'epoch': 0.8}
{'loss': 1.4119, 'grad_norm': 0.25422757863998413, 'learning_rate': 4e-05, 'epoch': 0.96}
{'train_runtime': 419.5154, 'train_samples_per_second': 2.384, 'train_steps_per_second': 0.298, 'train_loss': 1.6171623611450194, 'epoch': 1.0}
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 125/125 [06:59<00:00, 3.36s/it]
.. _fine-tuning-llms-single-gpu-saving:
Saving adapters or fully fine-tuned models
------------------------------------------
PEFT methods freeze the pre-trained model parameters during fine-tuning and add a smaller number of trainable
parameters, namely the adapters, on top of it. The adapters are trained to learn specific task information. The adapters
trained with PEFT are usually an order of magnitude smaller than the full base model, making them convenient to share,
store, and load.
.. tab-set::
.. tab-item:: Saving a PEFT adapter
:sync: with
If you're using LoRA and PEFT, use the following code to save a PEFT adapter to your system once the fine-tuning
is completed.
.. code-block:: python
# PEFT adapter name.
adapter_name = "llama-2-7b-enhanced-adapter"
# Save PEFT adapter.
sft_trainer.model.save_pretrained(adapter_name)
The saved PEFT adapter should look like this on your system:
.. code-block:: shell
# Access adapter directory.
cd llama-2-7b-enhanced-adapter
# List all adapter files.
README.md adapter_config.json adapter_model.safetensors
.. tab-item:: Saving a fully fine-tuned model
:sync: without
If you're not using LoRA and PEFT so there is no PEFT LoRA configuration used for training, use the following code
to save your fine-tuned model to your system.
.. code-block:: python
# Fully fine-tuned model name.
new_model_name = "llama-2-7b-enhanced"
# Save the fully fine-tuned model.
full_trainer.model.save_pretrained(new_model_name)
The saved new full model should look like this on your system:
.. code-block:: shell
# Access new model directory.
cd llama-2-7b-enhanced
# List all model files.
config.json model-00002-of-00006.safetensors model-00005-of-00006.safetensors
generation_config.json model-00003-of-00006.safetensors model-00006-of-00006.safetensors
model-00001-of-00006.safetensors model-00004-of-00006.safetensors model.safetensors.index.json
.. note::
PEFT adapters cant be loaded by ``AutoModelForCausalLM`` from the Transformers library as they do not contain
full model parameters and model configurations, for example, ``config.json``. To use it as a normal transformer
model, you need to merge them into the base model.
Basic model inference
=====================
A trained model can be classified into one of three types:
* A PEFT adapter
* A pre-trained language model in Hugging Face
* A fully fine-tuned model not using PEFT
Let's look at achieving model inference using these types of models.
.. tab-set::
.. tab-item:: Inference using PEFT adapters
To use PEFT adapters like a normal transformer model, you can run the generation by loading a base model along with PEFT
adapters as follows.
.. code-block:: python
from peft import PeftModel
from transformers import AutoModelForCausalLM
# Set the path of the model or the name on Hugging face hub
base_model_name = "meta-llama/Llama-2-7b-chat-hf"
# Set the path of the adapter
adapter_name = "Llama-2-7b-enhanced-adpater"
# Load base model
base_model = AutoModelForCausalLM.from_pretrained(base_model_name)
# Adapt the base model with the adapter
new_model = PeftModel.from_pretrained(base_model, adapter_name)
# Then, run generation as the same with a normal model outlined in 2.1
The PEFT library provides a ``merge_and_unload`` method, which merges the adapter layers into the base model. This is
needed if someone wants to save the adapted model into local storage and use it as a normal standalone model.
.. code-block:: python
# Load base model
base_model = AutoModelForCausalLM.from_pretrained(base_model_name)
# Adapt the base model with the adapter
new_model = PeftModel.from_pretrained(base_model, adapter_name)
# Merge adapter
model = model.merge_and_unload()
# Save the merged model into local
model.save_pretrained("merged_adpaters")
.. tab-item:: Inference using pre-trained or fully fine-tuned models
If you have a fully fine-tuned model not using PEFT, you can load it like any other pre-trained language model in
`Hugging Face Hub <https://huggingface.co/docs/hub/en/index>`_ using the `Transformers
<https://huggingface.co/docs/transformers/en/index>`_ library.
.. code-block:: python
# Import relevant class for loading model and tokenizer
from transformers import AutoTokenizer, AutoModelForCausalLM
# Set the pre-trained model name on Hugging face hub
model_name = "meta-llama/Llama-2-7b-chat-hf"
# Set device type
device = "cuda:0"
# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(model_name).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Input prompt encoding
query = "What is a large language model?"
inputs = tokenizer.encode(query, return_tensors="pt").to(device)
# Token generation
outputs = model.generate(inputs)
# Outputs decoding
print(tokenizer.decode(outputs[0]))
In addition, pipelines from Transformers offer simple APIs to use pre-trained models for different tasks, including
sentiment analysis, feature extraction, question answering and so on. You can use the pipeline abstraction to achieve
model inference easily.
.. code-block:: python
# Import relevant class for loading model and tokenizer
from transformers import pipeline
# Set the path of your model or the name on Hugging face hub
model_name_or_path = "meta-llama/Llama-2-7b-chat-hf"
# Set pipeline
# A positive device value will run the model on associated CUDA device id
pipe = pipeline("text-generation", model=model_name_or_path, device=0)
# Token generation
print(pipe("What is a large language model?")[0]["generated_text"])
If using multiple accelerators, see
:ref:`Multi-accelerator fine-tuning and inference <fine-tuning-llms-multi-gpu-hugging-face-accelerate>` to explore
popular libraries that simplify fine-tuning and inference in a multi-accelerator system.
Read more about inference frameworks like vLLM and Hugging Face TGI in
:doc:`LLM inference frameworks <llm-inference-frameworks>`.

View File

@@ -107,7 +107,10 @@ for more information about running AMP on an AMD accelerator.
Fine-tuning your model
======================
ROCm supports multiple fine-tuning techniques, for example, LoRA, QLoRA, PEFT, and FSDP.
ROCm supports multiple techniques for :ref:`optimizing fine-tuning <fine-tuning-llms-concept-optimizations>`, for
example, LoRA, QLoRA, PEFT, and FSDP.
Learn more about challenges and solutions for model fine-tuning in :doc:`../fine-tuning-llms/index`.
The following developer blogs showcase examples of how to fine-tune a model on an AMD accelerator or GPU.

View File

@@ -92,6 +92,7 @@ Our documentation is organized into the following categories:
:padding: 2
* [Using ROCm for AI](./how-to/rocm-for-ai/index.rst)
* [Fine-tuning LLMs and inference optimization](./how-to/fine-tuning-llms/index.rst)
* [System tuning for various architectures](./how-to/tuning-guides.md)
* [MI100](./how-to/tuning-guides/mi100.md)
* [MI200](./how-to/tuning-guides/mi200.md)
@@ -116,6 +117,7 @@ Our documentation is organized into the following categories:
* [MI250](./conceptual/gpu-arch/mi250.md)
* [MI300](./conceptual/gpu-arch/mi300.md)
* [GPU memory](./conceptual/gpu-memory.md)
* [Setting the number of CUs](./conceptual/setting-cus)
* [File structure (Linux FHS)](./conceptual/file-reorg.md)
* [GPU isolation techniques](./conceptual/gpu-isolation.md)
* [Using CMake](./conceptual/cmake-packages.rst)

View File

@@ -34,7 +34,7 @@
:img-alt: Performance tools
:padding: 2
* [RocBandwidthTest](https://github.com/ROCm/rocm_bandwidth_test)
* {doc}`ROCm Bandwidth Test <rocm_bandwidth_test:index>`
* {doc}`ROCProfiler <rocprofiler:profiler_home_page>`
* [rocprofiler-register](https://github.com/ROCm/rocprofiler-register)
* {doc}`ROCTracer <roctracer:index>`
@@ -49,8 +49,8 @@
:padding: 2
* {doc}`AMD SMI <amdsmi:index>`
* {doc}`rocminfo <rocminfo:index>`
* {doc}`ROCm Data Center Tool <rdc:index>`
* [ROCm Info](https://github.com/ROCm/rocminfo)
* {doc}`ROCm SMI <rocm_smi_lib:index>`
* {doc}`ROCm Validation Suite <rocmvalidationsuite:index>`
* {doc}`TransferBench <transferbench:index>`

View File

@@ -11,7 +11,7 @@ subtrees:
title: Release notes
subtrees:
- entries:
- file: about/CHANGELOG.md
- file: about/changelog.md
title: Changelog
- url: https://github.com/ROCm/ROCm/labels/Verified%20Issue
title: Known issues
@@ -58,6 +58,27 @@ subtrees:
- file: how-to/rocm-for-ai/train-a-model.rst
- file: how-to/rocm-for-ai/hugging-face-models.rst
- file: how-to/rocm-for-ai/deploy-your-model.rst
- file: how-to/fine-tuning-llms/index.rst
title: Fine-tuning LLMs and inference optimization
subtrees:
- entries:
- file: how-to/fine-tuning-llms/overview.rst
title: Conceptual overview
- file: how-to/fine-tuning-llms/fine-tuning-and-inference.rst
subtrees:
- entries:
- file: how-to/fine-tuning-llms/single-gpu-fine-tuning-and-inference.rst
title: Using a single accelerator
- file: how-to/fine-tuning-llms/multi-gpu-fine-tuning-and-inference.rst
title: Using multiple accelerators
- file: how-to/fine-tuning-llms/model-quantization.rst
- file: how-to/fine-tuning-llms/model-acceleration-libraries.rst
- file: how-to/fine-tuning-llms/llm-inference-frameworks.rst
- file: how-to/fine-tuning-llms/optimizing-with-composable-kernel.md
title: Optimizing with Composable Kernel
- file: how-to/fine-tuning-llms/optimizing-triton-kernel.rst
title: Optimizing Triton kernels
- file: how-to/fine-tuning-llms/profiling-and-debugging.rst
- file: how-to/tuning-guides.md
title: System optimization
subtrees:
@@ -119,6 +140,8 @@ subtrees:
title: White paper
- file: conceptual/gpu-memory.md
title: GPU memory
- file: conceptual/setting-cus
title: Setting the number of CUs
- file: conceptual/file-reorg.md
title: File structure (Linux FHS)
- file: conceptual/gpu-isolation.md

View File

@@ -1,2 +1,2 @@
rocm-docs-core==1.1.1
sphinx-reredirects==0.1.3
rocm-docs-core==1.2.0
sphinx-reredirects

View File

@@ -4,22 +4,16 @@
#
# pip-compile requirements.in
#
--extra-index-url https://test.pypi.org/simple/
accessible-pygments==0.0.4
accessible-pygments==0.0.5
# via pydata-sphinx-theme
alabaster==0.7.16
# via sphinx
autodoc==0.5.0
# via -r requirements.in
babel==2.14.0
babel==2.15.0
# via
# pydata-sphinx-theme
# sphinx
beautifulsoup4==4.12.3
# via
# pydata-sphinx-theme
# webtest
# via pydata-sphinx-theme
breathe==4.35.0
# via rocm-docs-core
certifi==2024.2.2
@@ -31,43 +25,23 @@ cffi==1.16.0
charset-normalizer==3.3.2
# via requests
click==8.1.7
# via
# click-log
# doxysphinx
# sphinx-external-toc
click-log==0.4.0
# via doxysphinx
cryptography==42.0.5
# via sphinx-external-toc
cryptography==42.0.7
# via pyjwt
decorator==5.1.1
# via autodoc
defusedxml==0.7.1
# via sphinxcontrib-datatemplates
deprecated==1.2.14
# via pygithub
docutils==0.21.2
# via
# breathe
# myst-parser
# pybtex-docutils
# pydata-sphinx-theme
# sphinx
# sphinx-substitution-extensions
# sphinxcontrib-bibtex
doxysphinx==3.3.7
# via rocm-docs-core
fastjsonschema==2.19.1
# via rocm-docs-core
gitdb==4.0.11
# via gitpython
gitpython==3.1.43
# via rocm-docs-core
hip-python==6.1.0.470.16
# via
# -r requirements.in
# hip-python-as-cuda
hip-python-as-cuda==6.1.0.470.16
# via -r requirements.in
idna==3.7
# via requests
imagesize==1.4.1
@@ -76,75 +50,50 @@ jinja2==3.1.4
# via
# myst-parser
# sphinx
latexcodec==3.0.0
# via pybtex
libsass==0.22.0
# via doxysphinx
lxml==4.9.4
# via doxysphinx
markdown-it-py==3.0.0
# via
# mdit-py-plugins
# myst-parser
markupsafe==2.1.5
# via jinja2
mdit-py-plugins==0.4.0
mdit-py-plugins==0.4.1
# via myst-parser
mdurl==0.1.2
# via markdown-it-py
mpire==2.10.1
# via doxysphinx
myst-parser==3.0.0
myst-parser==3.0.1
# via rocm-docs-core
numpy==1.26.4
# via -r requirements.in
packaging==24.0
# via
# pydata-sphinx-theme
# sphinx
pybtex==0.24.0
# via
# pybtex-docutils
# sphinxcontrib-bibtex
pybtex-docutils==1.0.3
# via sphinxcontrib-bibtex
pycparser==2.22
# via cffi
pydata-sphinx-theme==0.15.2
pydata-sphinx-theme==0.15.3
# via
# rocm-docs-core
# sphinx-book-theme
pygithub==2.3.0
# via rocm-docs-core
pygments==2.17.2
pygments==2.18.0
# via
# accessible-pygments
# mpire
# pydata-sphinx-theme
# sphinx
pyjson5==1.6.6
# via doxysphinx
pyjwt[crypto]==2.8.0
# via pygithub
pynacl==1.5.0
# via pygithub
pyparsing==3.1.2
# via doxysphinx
pyyaml==6.0.1
# via
# myst-parser
# pybtex
# rocm-docs-core
# sphinx-external-toc
# sphinxcontrib-datatemplates
requests==2.31.0
requests==2.32.3
# via
# pygithub
# sphinx
rocm-docs-core==1.1.1
rocm-docs-core==1.2.0
# via -r requirements.in
six==1.16.0
# via pybtex
smmap==5.0.1
# via gitdb
snowballstemmer==2.2.0
@@ -158,56 +107,44 @@ sphinx==7.3.7
# pydata-sphinx-theme
# rocm-docs-core
# sphinx-book-theme
# sphinx-collapse
# sphinx-copybutton
# sphinx-design
# sphinx-external-toc
# sphinx-notfound-page
# sphinx-reredirects
# sphinx-substitution-extensions
# sphinxcontrib-bibtex
# sphinxcontrib-datatemplates
# sphinxcontrib-moderncmakedomain
# sphinxcontrib-runcmd
sphinx-book-theme==1.1.2
# via rocm-docs-core
sphinx-collapse==0.1.3
# via -r requirements.in
sphinx-copybutton==0.5.2
# via rocm-docs-core
sphinx-design==0.5.0
sphinx-design==0.6.0
# via rocm-docs-core
sphinx-external-toc==1.0.1
# via rocm-docs-core
sphinx-notfound-page==1.0.0
sphinx-notfound-page==1.0.2
# via rocm-docs-core
sphinx-reredirects==0.1.3
# via -r requirements.in
sphinx-substitution-extensions==2022.2.16
# via -r requirements.in
sphinxcontrib-applehelp==1.0.8
# via sphinx
sphinxcontrib-bibtex==2.6.2
# via -r requirements.in
sphinxcontrib-datatemplates==0.11.0
# via -r requirements.in
sphinxcontrib-devhelp==1.0.6
# via sphinx
sphinxcontrib-htmlhelp==2.0.5
# via sphinx
sphinxcontrib-jsmath==1.0.1
# via sphinx
sphinxcontrib-moderncmakedomain==3.27.0
# via -r requirements.in
sphinxcontrib-qthelp==1.0.7
# via sphinx
sphinxcontrib-runcmd==0.2.0
# via sphinxcontrib-datatemplates
sphinxcontrib-serializinghtml==1.1.10
# via sphinx
typing-extensions==4.5.0
# via pydata-sphinx-theme
urllib3==1.26.18
# via requests
wrapt==1.14.1
tomli==2.0.1
# via sphinx
typing-extensions==4.12.0
# via
# pydata-sphinx-theme
# pygithub
urllib3==2.2.1
# via
# pygithub
# requests
wrapt==1.16.0
# via deprecated

View File

@@ -99,14 +99,14 @@ Tools
":doc:`AMD SMI <amdsmi:index>`", "C library for Linux that provides a user space interface for applications to monitor and control AMD devices"
":doc:`HIPIFY <hipify:index>`", "Translates CUDA source code into portable HIP C++"
"`RocBandwidthTest <https://github.com/ROCm/rocm_bandwidth_test/>`_ ", "Captures the performance characteristics of buffer copying and kernel read/write operations"
":doc:`ROCm Bandwidth Test <rocm_bandwidth_test:index>`", "Captures the performance characteristics of buffer copying and kernel read/write operations"
":doc:`ROCmCC <./reference/rocmcc>`", "Clang/LLVM-based compiler"
":doc:`ROCm CMake <rocmcmakebuildtools:index>`", "Collection of CMake modules for common build and development tasks"
":doc:`ROCm Data Center Tool <rdc:index>`", "Simplifies administration and addresses key infrastructure challenges in AMD GPUs in cluster and data-center environments"
"`ROCm Debug Agent (ROCdebug-agent) <https://github.com/ROCm/rocr_debug_agent/>`_ ", "Prints the state of all AMD GPU wavefronts that caused a queue error by sending a SIGQUIT signal to the process while the program is running"
":doc:`ROCm Debugger (ROCgdb) <rocgdb:index>`", "Source-level debugger for Linux, based on the GNU Debugger (GDB)"
":doc:`ROCdbgapi <rocdbgapi:index>`", "ROCm debugger API library"
"`rocminfo <https://github.com/ROCm/rocminfo/>`_ ", "Reports system information"
":doc:`rocminfo <rocminfo:index>`", "Reports system information"
":doc:`ROCm SMI <rocm_smi_lib:index>`", "C library for Linux that provides a user space interface for applications to monitor and control GPU applications"
":doc:`ROCm Validation Suite <rocmvalidationsuite:index>`", "Detects and troubleshoots common problems affecting AMD GPUs running in a high-performance computing environment"
":doc:`ROCProfiler <rocprofiler:profiler_home_page>`", "Profiling tool for HIP applications"
@@ -121,7 +121,7 @@ Compilers
"`AOMP <https://github.com/ROCm/aomp/>`_", "Scripted build of `LLVM <https://github.com/ROCm/llvm-project>`_ and supporting software"
"`FLANG <https://github.com/ROCm/flang/>`_", "An out-of-tree Fortran compiler targeting LLVM"
"`hipCC <https://github.com/ROCm/HIPCC>`_ ", "Compiler driver utility that calls Clang or NVCC and passes the appropriate include and library options for the target compiler and HIP infrastructure"
":doc:`hipCC <hipcc:index>`", "Compiler driver utility that calls Clang or NVCC and passes the appropriate include and library options for the target compiler and HIP infrastructure"
"`LLVM (amdclang) <https://github.com/ROCm/llvm-project>`_ ", "Toolkit for the construction of highly optimized compilers, optimizers, and runtime environments"
Runtimes

37
rocm-build/README.md Normal file
View File

@@ -0,0 +1,37 @@
# Overview for ROCm.mk
This Makefile builds the various projects that makes up ROCm in the correct order.
It is expected to be run in an environment with the tooling set up. An easy way
to do this is to use Docker.
## Targets
* all (default)
* rocm-dev (a subset of all)
* clean
* list_components
* help
* T_foo
* C_foo
## Makefile Variables
* PEVAL set to 1 to enable some Makefile debugging code.
* RELEASE\_FLAG set to "" to avoid passing "-r" to builds, effect is package defined.
* NOBUILD="foo bar" to avoid adding foo and bar into the dependencies of top level targets. They still may be
built if they are needed as dependencies of other top level targets.
* toplevel
## Makefile assumptions
### Requirements for package "foo"
#### program build\_foo.sh
* Should take option "-c" to clean
* Should take option "-r" to do a normal "RelWithDeb" build
### For package "foo" we define some targets
* T\_foo - The main build target, calls "build\_foo.sh -r"
* C\_foo - Clean target, calls "build_foo.sh -c"

248
rocm-build/ROCm.mk Normal file
View File

@@ -0,0 +1,248 @@
# Traditional first make target
all:
# Use bash as a shell
# On Ubuntu sh is 'dash'
SHELL:=bash
# Allow RELEASE_FLAG to be overwritten
RELEASE_FLAG?=-r
# Set SANITIZER_FLAG for sanitizer
ASAN_DEP:=
ifeq (${ENABLE_ADDRESS_SANITIZER},true)
ASAN_DEP=lightning
SANITIZER_FLAG=-a
endif
export INFRA_REPO:=ROCm/rocm-build
OUT_DIR:=$(shell . ${INFRA_REPO}/envsetup.sh >/dev/null 2>&1 ; echo $${OUT_DIR})
ROCM_INSTALL_PATH:=$(shell . ${INFRA_REPO}/envsetup.sh >/dev/null 2>&1 ; echo $${ROCM_INSTALL_PATH})
$(info OUT_DIR=${OUT_DIR})
$(info ROCM_INSTALL_PATH=${ROCM_INSTALL_PATH})
# -------------------------------------------------------------------------
# Internal stuff. Could be put in a different file to hide it.
# Internal macros, they need to be defined before being used.
# The internal "eval" allows parts of the Makefile to be generated.
# Whilst it is possible to dump the effective Makefile, it can be
# hard to see where parts come from. Set up the "peval" macro which
# optionally prints out the generated makefile snippet and evaluate it.
# Use "make PEVAL=1 all" to see the things being evaluated.
ifeq (,${PEVAL})
define peval =
$(eval $1)
endef
else
define peval =
$(eval $(info $1)$1)
endef
endif
# macro to add dependencies. Saves having to put all the OUT_DIR/logs in
# The outer strip is to work around a gnu make 4.1 and earlier bug
# It should not be needed.
define adddep =
$(strip $(call peval,components+= $(1) $(2))
$(foreach comp,$(strip $2),$(call peval,${OUT_DIR}/logs/${1}: ${OUT_DIR}/logs/${comp}))
)
endef
# End of internal stuff that is needed at the start of the file
# -------------------------------------------------------------------------
# Dependencies. These can be updated. Anything that is mentioned in
# either the args to the adddep macro will be added to components. as
# an example there is no need for the adddep of lightning, as it
# depends on nothing and at least one other component includes it.
# Syntax. Up to the first comma everything is fixed. The "call" is a
# keyword to gnu make. The "adddep" is the name of the variable containing
# the macro.
# The second comma delimited argument is the target.
# The third comma delimited arg is the thing that the target depends on.
# It is a space seperated list with zero or more elements.
$(call adddep,amd_smi_lib,${ASAN_DEP})
$(call adddep,aqlprofile,${ASAN_DEP} hsa)
$(call adddep,clang-ocl,lightning rocm-cmake)
$(call adddep,comgr,lightning devicelibs)
$(call adddep,dbgapi,hsa comgr)
$(call adddep,devicelibs,lightning)
$(call adddep,hip_on_rocclr,${ASAN_DEP} rocclr rocprofiler-register)
$(call adddep,hipcc,)
$(call adddep,hipify_clang,hip_on_rocclr lightning)
$(call adddep,hsa,${ASAN_DEP} thunk lightning devicelibs rocprofiler-register)
$(call adddep,lightning,)
$(call adddep,opencl_on_rocclr,${ASAN_DEP} rocclr)
$(call adddep,openmp_extras,thunk lightning devicelibs hsa)
$(call adddep,rdc,${ASAN_DEP} rocm_smi_lib hsa rocprofiler)
$(call adddep,rocclr,${ASAN_DEP} hsa comgr hipcc rocprofiler-register)
$(call adddep,rocm_bandwidth_test,${ASAN_DEP} hsa)
$(call adddep,rocm_smi_lib,${ASAN_DEP})
$(call adddep,rocm-cmake,${ASAN_DEP})
$(call adddep,rocm-core,${ASAN_DEP})
$(call adddep,rocm-gdb,dbgapi)
$(call adddep,rocminfo,${ASAN_DEP} hsa)
$(call adddep,rocprofiler-register,${ASAN_DEP})
$(call adddep,rocprofiler,${ASAN_DEP} hsa roctracer aqlprofile opencl_on_rocclr hip_on_rocclr comgr dbgapi rocm_smi_lib)
$(call adddep,rocr_debug_agent,${ASAN_DEP} hip_on_rocclr hsa dbgapi)
$(call adddep,roctracer,${ASAN_DEP} hsa hip_on_rocclr)
$(call adddep,thunk,${ASAN_DEP})
# rocm-dev points to all possible last finish components of Stage1 build.
rocm-dev-components :=rdc hipify_clang openmp_extras \
rocm-core amd_smi_lib hipcc clang-ocl \
rocm_bandwidth_test rocr_debug_agent rocm-gdb
$(call adddep,rocm-dev,$(filter-out ${NOBUILD},${rocm-dev-components}))
$(call adddep,amdmigraphx,hip_on_rocclr half rocblas miopen-hip lightning hipcc)
$(call adddep,composable_kernel,lightning hipcc hip_on_rocclr rocm-cmake)
$(call adddep,half,rocm-cmake)
$(call adddep,hipblas,hip_on_rocclr rocblas rocsolver lightning hipcc)
$(call adddep,hipblaslt,hip_on_rocclr openmp_extras hipblas lightning hipcc)
$(call adddep,hipcub,hip_on_rocclr rocprim lightning hipcc)
$(call adddep,hipfft,hip_on_rocclr openmp_extras rocfft rocrand hiprand lightning hipcc)
$(call adddep,hipfort,rocblas hipblas rocsparse hipsparse rocfft hipfft rocrand hiprand rocsolver hipsolver lightning hipcc)
$(call adddep,hiprand,hip_on_rocclr rocrand lightning hipcc)
$(call adddep,hipsolver,hip_on_rocclr rocblas rocsolver rocsparse lightning hipcc hipsparse)
$(call adddep,hipsparse,hip_on_rocclr rocsparse lightning hipcc)
$(call adddep,hipsparselt,hip_on_rocclr hipsparse lightning hipcc openmp_extras)
$(call adddep,hiptensor,hip_on_rocclr composable_kernel lightning hipcc)
$(call adddep,miopen-deps,lightning hipcc)
$(call adddep,miopen-hip,composable_kernel half hip_on_rocclr miopen-deps rocblas roctracer lightning hipcc)
$(call adddep,mivisionx,amdmigraphx miopen-hip rpp lightning hipcc)
$(call adddep,rccl,hip_on_rocclr hsa lightning hipcc rocm_smi_lib hipify_clang)
$(call adddep,rocalution,rocblas rocsparse rocrand lightning hipcc)
$(call adddep,rocblas,hip_on_rocclr openmp_extras lightning hipcc)
$(call adddep,rocdecode,hip_on_rocclr lightning hipcc)
$(call adddep,rocfft,hip_on_rocclr rocrand hiprand lightning hipcc openmp_extras)
$(call adddep,rocmvalidationsuite,hip_on_rocclr hsa rocblas rocm-core lightning hipcc rocm_smi_lib)
$(call adddep,rocprim,hip_on_rocclr lightning hipcc)
$(call adddep,rocrand,hip_on_rocclr lightning hipcc)
$(call adddep,rocsolver,hip_on_rocclr rocblas rocsparse lightning hipcc)
$(call adddep,rocsparse,hip_on_rocclr rocprim lightning hipcc)
$(call adddep,rocthrust,hip_on_rocclr rocprim lightning hipcc)
$(call adddep,rocwmma,hip_on_rocclr rocblas lightning hipcc rocm-cmake rocm_smi_lib)
$(call adddep,rpp,half lightning hipcc openmp_extras)
# -------------------------------------------------------------------------
# The rest of the file is internal
# Do not pass jobserver params if -n build
ifneq (,$(findstring n,${MAKEFLAGS}))
RMAKE:=
else
RMAKE := +
endif
# disable the builtin rules
.SUFFIXES:
# Linear
# include moredeps
# A macro to define a toplevel target, add it to the 'all' target
# Make it depend on the generated log. Generate the log of the build.
# See if the macro is already defined, if so don't touch it.
# As GNU make allows more than one makefile to be specified with "-f"
# one could put an alternative definition of "toplevel" in a different
# file or even the environment, and use the data in this file for other
# purposes. Uses might include generating output in "dot" format for
# showing the dependency graph, or having a wrapper script to run programs
# to generate code quality tools.
ifeq (${toplevel},)
# { Start of test to see if toplevel is defined
define toplevel =
# The "target" make, this builds the package if it is out of date
T_$1: ${OUT_DIR}/logs/$1 FRC
: $1 built
# The "upload" for $1, it uploads the packages for $1 to the central storage
U_$1: T_$1 FRC
source $${INFRA_REPO}/envsetup.sh && $${INFRA_REPO}/upload_packages.sh "$1"
: $1 uploaded
# The "clean" for $1, it just marks the target as not existing so it will be built
# in the future.
C_$1: FRC
rm -f ${OUT_DIR}/logs/$1 ${OUT_DIR}/logs/$1.repackaged
# parallel build {
${OUT_DIR}/logs/$1: | ${OUT_DIR}/logs
ifneq ($(wildcard ${OUT_DIR}/logs/$1.repackaged),)
@echo Skipping build of $1 as it has already been repackaged
cat $$@.repackaged > $$@
rm -f $$@.repackaged
else # } {
@echo $1 started due to $$? | sed "s:${OUT_DIR}/logs/::g"
# Build in a subshell so we get the time output
# Pass in jobserver info using the RMAKE variable
${RMAKE}@( if set -x && source $${INFRA_REPO}/envsetup.sh && \
rm -f $$@.errors $$@ $$@.repackaged && \
$${INFRA_REPO}/build_$1.sh -c && source $${INFRA_REPO}/ccache-env-mathlib.sh && \
time bash -x $${INFRA_REPO}/build_$1.sh $${RELEASE_FLAG} $${SANITIZER_FLAG} && $${INFRA_REPO}/post_inst_pkg.sh "$1" ; \
then mv $$@.inprogress $$@ ; \
else mv $$@.inprogress $$@.errors ; echo Error in $1 >&2 ; exit 1 ;\
fi ) > $$@.inprogress 2>&1
endif # }
# end of toplevel macro
endef
# } End of test to see if toplevel is defined
endif
components:=$(sort $(components))
# Create all the T_xxxx and C_xxxx targets
$(call peval,$(foreach dep,$(strip ${components}),$(call toplevel,${dep})))
# Add all the T_xxxx targets to "all" except those listed in NOBUILD
# Note this does not prohibit them from being built, it just means that
# a build of "all" will not force them to be built directly
# example command
# make -f jenkins-utils/scripts/Stage1.mk -j60
##help all: Build everything
all: $(addprefix T_,$(filter-out ${NOBUILD},${components}))
@echo All ROCm components built
# Do not document this target
upload: $(addprefix U_,${components})
@echo All ROCm components built and uploaded
##help rocm-dev: Build a subset of ROCm
rocm-dev: T_rocm-dev
@echo rocm-dev built
${OUT_DIR}/logs:
sudo mkdir -p -m 775 "${ROCM_INSTALL_PATH}" && \
sudo chown -R "$(shell id -u):$(shell id -g)" "${ROCM_INSTALL_PATH}"
sudo chown -R "$(shell id -u):$(shell id -g)" "/home/$(shell id -un)"
mkdir -p "${@}"
mkdir -p ${HOME}/.ccache
##help clean: remove the output directory and recreate it
clean:
[ -n "${OUT_DIR}" ] && rm -rf "${OUT_DIR}"
mkdir -p ${OUT_DIR}/logs
.SECONDARY: ${components:%=${OUT_DIR}/logs/%}
.PHONY: all clean repack help list_components
##help list_components: output the list of components
##help : Hint make list_components | paste - - - | column -t
list_components:
@echo "${components}" | sed 'y/ /\n/'
##help help: show this text
help:
@sed -n 's/^##help //p' ${MAKEFILE_LIST} | \
if type -t column > /dev/null ; then column -s: -t ; else cat ; fi
FRC:

145
rocm-build/build_amd_smi_lib.sh Executable file
View File

@@ -0,0 +1,145 @@
#!/bin/bash
source "$(dirname "${BASH_SOURCE}")/compute_utils.sh"
printUsage() {
echo
echo "Usage: $(basename "${BASH_SOURCE}") [-c|-r|-h] [makeopts]"
echo
echo "Options:"
echo " -c, --clean Removes all amd_smi build artifacts"
echo " -r, --release Build non-debug version amd_smi (default is debug)"
echo " -a, --address_sanitizer Enable address sanitizer"
echo " -s, --static Build static lib (.a). build instead of dynamic/shared(.so) "
echo " -o, --outdir <pkg_type> Print path of output directory containing packages of type referred to by pkg_type"
echo " -p, --package <type> Specify packaging format"
echo " -h, --help Prints this help"
echo "Possible values for <type>:"
echo " deb -> Debian format (default)"
echo " rpm -> RPM format"
echo
return 0
}
PACKAGE_ROOT="$(getPackageRoot)"
TARGET="build"
PACKAGE_LIB=$(getLibPath)
PACKAGE_INCLUDE="$(getIncludePath)"
AMDSMI_BUILD_DIR=$(getBuildPath amdsmi)
AMDSMI_PACKAGE_DEB_DIR="$(getPackageRoot)/deb/amdsmi"
AMDSMI_PACKAGE_RPM_DIR="$(getPackageRoot)/rpm/amdsmi"
AMDSMI_BUILD_TYPE="debug"
BUILD_TYPE="Debug"
MAKETARGET="deb"
MAKEARG="$DASH_JAY O=$AMDSMI_BUILD_DIR"
AMDSMI_MAKE_OPTS="$DASH_JAY O=$AMDSMI_BUILD_DIR -C $AMDSMI_BUILD_DIR"
AMDSMI_PKG_NAME="amd-smi-lib"
SHARED_LIBS="ON"
CLEAN_OR_OUT=0;
PKGTYPE="deb"
VALID_STR=`getopt -o hcraso:p: --long help,clean,release,static,address_sanitizer,outdir:,package: -- "$@"`
eval set -- "$VALID_STR"
while true ;
do
case "$1" in
(-h | --help)
printUsage ; exit 0;;
(-c | --clean)
TARGET="clean" ; ((CLEAN_OR_OUT|=1)) ; shift ;;
(-r | --release)
BUILD_TYPE="RelWithDebInfo" ; shift ;;
(-a | --address_sanitizer)
set_asan_env_vars
set_address_sanitizer_on
# TODO - support standard option of passing cmake environment vars - CFLAGS,CXXFLAGS etc., to enable address sanitizer
ADDRESS_SANITIZER=true ; shift ;;
(-s | --static)
SHARED_LIBS="OFF" ; shift ;;
(-o | --outdir)
TARGET="outdir"; PKGTYPE=$2 ; OUT_DIR_SPECIFIED=1 ; ((CLEAN_OR_OUT|=2)) ; shift 2 ;;
(-p | --package)
MAKETARGET="$2" ; shift 2;;
--) shift; break;; # end delimiter
(*)
echo " This should never come but just incase : UNEXPECTED ERROR Parm : [$1] ">&2 ; exit 20;;
esac
done
RET_CONFLICT=1
check_conflicting_options $CLEAN_OR_OUT $PKGTYPE $MAKETARGET
if [ $RET_CONFLICT -ge 30 ]; then
print_vars $API_NAME $TARGET $BUILD_TYPE $SHARED_LIBS $CLEAN_OR_OUT $PKGTYPE $MAKETARGET
exit $RET_CONFLICT
fi
clean_amdsmi() {
rm -rf "$AMDSMI_BUILD_DIR"
rm -rf "$AMDSMI_PACKAGE_DEB_DIR"
rm -rf "$AMDSMI_PACKAGE_RPM_DIR"
rm -rf "$PACKAGE_ROOT/amd_smi"
rm -rf "$PACKAGE_INCLUDE/amd_smi"
rm -f $PACKAGE_LIB/libamd_smi.*
return 0
}
build_amdsmi() {
echo "Building AMDSMI"
echo "AMDSMI_BUILD_DIR: ${AMDSMI_BUILD_DIR}"
if [ ! -d "$AMDSMI_BUILD_DIR" ]; then
mkdir -p $AMDSMI_BUILD_DIR
pushd $AMDSMI_BUILD_DIR
print_lib_type $SHARED_LIBS
cmake \
-DBUILD_SHARED_LIBS=$SHARED_LIBS \
$(rocm_common_cmake_params) \
$(rocm_cmake_params) \
-DENABLE_LDCONFIG=OFF \
-DAMD_SMI_PACKAGE="${AMDSMI_PKG_NAME}" \
-DCPACK_PACKAGE_VERSION_MAJOR="1" \
-DCPACK_PACKAGE_VERSION_MINOR="$ROCM_LIBPATCH_VERSION" \
-DCPACK_PACKAGE_VERSION_PATCH="0" \
-DADDRESS_SANITIZER="$ADDRESS_SANITIZER" \
-DBUILD_TESTS=ON \
"$AMD_SMI_LIB_ROOT"
popd
fi
echo "Making amd_smi package:"
cmake --build "$AMDSMI_BUILD_DIR" -- $AMDSMI_MAKE_OPTS
cmake --build "$AMDSMI_BUILD_DIR" -- $AMDSMI_MAKE_OPTS install
cmake --build "$AMDSMI_BUILD_DIR" -- $AMDSMI_MAKE_OPTS package
copy_if DEB "${CPACKGEN:-"DEB;RPM"}" "$AMDSMI_PACKAGE_DEB_DIR" $AMDSMI_BUILD_DIR/*.deb
copy_if RPM "${CPACKGEN:-"DEB;RPM"}" "$AMDSMI_PACKAGE_RPM_DIR" $AMDSMI_BUILD_DIR/*.rpm
}
print_output_directory() {
case ${PKGTYPE} in
("deb")
echo ${AMDSMI_PACKAGE_DEB_DIR};;
("rpm")
echo ${AMDSMI_PACKAGE_RPM_DIR};;
(*)
echo "Invalid package type \"${PKGTYPE}\" provided for -o" >&2; exit 1;;
esac
exit
}
verifyEnvSetup
case $TARGET in
(clean) clean_amdsmi ;;
(build) build_amdsmi ;;
(outdir) print_output_directory ;;
(*) die "Invalid target $TARGET" ;;
esac
echo "Operation complete"
exit 0

54
rocm-build/build_amdmigraphx.sh Executable file
View File

@@ -0,0 +1,54 @@
#!/bin/bash
set -ex
source "$(dirname "${BASH_SOURCE[0]}")/compute_helper.sh"
set_component_src AMDMIGraphX
build_amdmigraphx() {
echo "Start build"
cd $COMPONENT_SRC
pip3 install https://github.com/RadeonOpenCompute/rbuild/archive/master.tar.gz
if [ "${ENABLE_ADDRESS_SANITIZER}" == "true" ]; then
set_asan_env_vars
set_address_sanitizer_on
fi
if [ -n "$GPU_ARCHS" ]; then
GPU_TARGETS="$GPU_ARCHS"
else
GPU_TARGETS="gfx908;gfx90a;gfx940;gfx941;gfx942;gfx1030;gfx1100;gfx1101"
fi
mkdir -p ${BUILD_DIR} && rm -rf ${BUILD_DIR}/* && mkdir -p ${HOME}/amdmigraphx && rm -rf ${HOME}/amdmigraphx/*
rbuild package -d "${HOME}/amdmigraphx" -B "${BUILD_DIR}" \
--cxx="${ROCM_PATH}/llvm/bin/clang++" \
--cc="${ROCM_PATH}/llvm/bin/clang" \
$(rocm_common_cmake_params) \
-DCMAKE_MODULE_LINKER_FLAGS="-Wl,--enable-new-dtags -Wl,--rpath,$ROCM_LIB_RPATH" \
-DGPU_TARGETS="${GPU_TARGETS}" \
-DCMAKE_INSTALL_RPATH=""
mkdir -p $PACKAGE_DIR && cp ${BUILD_DIR}/*.${PKGTYPE} $PACKAGE_DIR
cd $BUILD_DIR && cmake --build . -- install -j${PROC}
show_build_cache_stats
}
clean_amdmigraphx() {
echo "Cleaning AMDMIGraphX build directory: ${BUILD_DIR} ${DEPS_DIR} ${PACKAGE_DIR}"
rm -rf "$BUILD_DIR" "$DEPS_DIR" "$PACKAGE_DIR"
echo "Done!"
}
stage2_command_args "$@"
case $TARGET in
build) build_amdmigraphx ;;
outdir) print_output_directory ;;
clean) clean_amdmigraphx ;;
*) die "Invalid target $TARGET" ;;
esac

150
rocm-build/build_aqlprofile.sh Executable file
View File

@@ -0,0 +1,150 @@
#!/bin/bash
source "$(dirname "${BASH_SOURCE}")/compute_utils.sh"
printUsage() {
echo
echo "Usage: ${BASH_SOURCE##*/} [options ...]"
echo
echo "Options:"
echo " -c, --clean Clean output and delete all intermediate work"
echo " -p, --package <type> Specify packaging format"
echo " -r, --release Make a release build instead of a debug build"
echo " -o, --outdir <pkg_type> Print path of output directory containing packages of
type referred to by pkg_type"
echo " -h, --help Prints this help"
echo
echo "Possible values for <type>:"
echo " deb -> Debian format (default)"
echo " rpm -> RPM format"
echo
return 0
}
API_NAME="hsa-amd-aqlprofile"
PACKAGE_DEB="$(getPackageRoot)/deb/$API_NAME"
PROJ_NAME="$API_NAME"
TARGET="build"
BUILD_TYPE="Debug"
MAKETARGET="deb"
MAKE_OPTS="$DASH_JAY -C $BUILD_DIR"
SHARED_LIBS="ON"
CLEAN_OR_OUT=0
MAKETARGET="deb"
PKGTYPE="deb"
VALID_STR=$(getopt -o hcro:p: --long help,clean,release,clean,outdir:,package: -- "$@")
eval set -- "$VALID_STR"
while true; do
case "$1" in
-h | --help)
printUsage
exit 0
;;
-c | --clean)
TARGET="clean"
((CLEAN_OR_OUT |= 1))
shift
;;
-r | --release)
BUILD_TYPE="Release"
shift
;;
-o | --outdir)
TARGET="outdir"
PKGTYPE=$2
OUT_DIR_SPECIFIED=1
((CLEAN_OR_OUT |= 2))
shift 2
;;
--)
shift
break
;;
*)
echo " This should never come but just incase : UNEXPECTED ERROR Parm : [$1] " >&2
exit 20
;;
esac
done
copy_pkg_files_to_rocm() {
local comp_folder=$1
local comp_pkg_name=$2
cd "${OUT_DIR}/${PKGTYPE}/${comp_folder}"|| exit 2
if [ "${PKGTYPE}" = 'deb' ]; then
dpkg-deb -x ${comp_pkg_name}_*.deb pkg/
else
mkdir pkg && pushd pkg/ || exit 2
if [[ "${comp_pkg_name}" != *-dev* ]]; then
rpm2cpio ../${comp_pkg_name}-*.rpm | cpio -idmv
else
rpm2cpio ../${comp_pkg_name}el-*.rpm | cpio -idmv
fi
popd || exit 2
fi
ls ./pkg -alt
sudo cp -r ./pkg/*/rocm*/* "${ROCM_PATH}" || exit 2
rm -rf pkg/
}
clean() {
echo "Cleaning $PROJ_NAME package"
rm -rf "$PACKAGE_DEB"
}
build() {
echo "Downloading $PROJ_NAME" package
if [ "$DISTRO_NAME" = ubuntu ]; then
mkdir -p "$PACKAGE_DEB"
local rocm_ver=${ROCM_VERSION}
if [ ${ROCM_VERSION##*.} = 0 ]; then
rocm_ver=${ROCM_VERSION%.*}
fi
local url="https://repo.radeon.com/rocm/apt/${rocm_ver}/pool/main/h/${API_NAME}/"
local package
package=$(curl -s "$url" | grep -Po 'href="\K[^"]*' | grep "${DISTRO_RELEASE}" | head -n 1)
if [ -z "$package" ]; then
echo "No package found for Ubuntu version $DISTRO_RELEASE"
exit 1
fi
wget -t3 -P "$PACKAGE_DEB" "${url}${package}"
copy_pkg_files_to_rocm ${API_NAME} ${API_NAME}
else
echo "$DISTRO_ID is not supported..."
exit 2
fi
echo "Installing $PROJ_NAME" package
}
print_output_directory() {
case ${PKGTYPE} in
"deb")
echo ${PACKAGE_DEB}
;;
"rpm")
echo ${PACKAGE_RPM}
;;
*)
echo "Invalid package type \"${PKGTYPE}\" provided for -o" >&2
exit 1
;;
esac
exit
}
case "$TARGET" in
clean) clean ;;
build) build ;;
outdir) print_output_directory ;;
*) die "Invalid target $TARGET" ;;
esac
echo "Operation complete"

136
rocm-build/build_clang-ocl.sh Executable file
View File

@@ -0,0 +1,136 @@
#!/bin/bash
source "$(dirname "${BASH_SOURCE}")/compute_utils.sh"
printUsage() {
echo
echo "Usage: $(basename "${BASH_SOURCE}") [-c|-r|-h] [makeopts]"
echo
echo "Options:"
echo " -c, --clean Removes all clang-ocl build artifacts"
echo " -r, --release Build non-debug version clang-ocl (default is debug)"
echo " -a, --address_sanitizer Enable address sanitizer"
echo " -o, --outdir <pkg_type> Print path of output directory containing packages of
type referred to by pkg_type"
echo " -h, --help Prints this help"
echo " -s, --static Supports static CI by accepting this param & not bailing out. No effect of the param though"
echo
return 0
}
TARGET="build"
CLANG_OCL_DEST="$(getBinPath)"
CLANG_OCL_SRC_ROOT="$CLANG_OCL_ROOT"
CLANG_OCL_BUILD_DIR="$(getBuildPath clang-ocl)"
MAKEARG="$DASH_JAY"
PACKAGE_ROOT="$(getPackageRoot)"
PACKAGE_UTILS="$(getUtilsPath)"
CLANG_OCL_PACKAGE_DEB="$PACKAGE_ROOT/deb/clang-ocl"
CLANG_OCL_PACKAGE_RPM="$PACKAGE_ROOT/rpm/clang-ocl"
BUILD_TYPE="Debug"
SHARED_LIBS="ON"
CLEAN_OR_OUT=0;
MAKETARGET="deb"
PKGTYPE="deb"
VALID_STR=`getopt -o hcraso:g: --long help,clean,release,clean,static,address_sanitizer,outdir:,gpu_list: -- "$@"`
eval set -- "$VALID_STR"
while true ;
do
case "$1" in
(-h | --help)
printUsage ; exit 0;;
(-c | --clean)
TARGET="clean" ; ((CLEAN_OR_OUT|=1)) ; shift ;;
(-r | --release)
MAKEARG="$MAKEARG BUILD_TYPE=rel" ; BUILD_TYPE="Release" ; shift ;;
(-a | --address_sanitizer)
set_asan_env_vars
set_address_sanitizer_on ; shift ;;
(-s | --static)
SHARED_LIBS="OFF" ; shift ;;
(-o | --outdir)
TARGET="outdir"; PKGTYPE=$2 ; OUT_DIR_SPECIFIED=1 ; ((CLEAN_OR_OUT|=2)) ; shift 2 ;;
(-g | --gpu_list )
GPU_LIST=$2; shift 2 ;;
--) shift; break;;
(*)
echo " This should never come but just incase : UNEXPECTED ERROR Parm : [$1] ">&2 ; exit 20;;
esac
done
RET_CONFLICT=1
check_conflicting_options $CLEAN_OR_OUT $PKGTYPE $MAKETARGET
if [ $RET_CONFLICT -ge 30 ]; then
print_vars $API_NAME $TARGET $BUILD_TYPE $SHARED_LIBS $CLEAN_OR_OUT $PKGTYPE $MAKETARGET
exit $RET_CONFLICT
fi
clean_clang-ocl() {
echo "Removing clang-ocl"
rm -rf $CLANG_OCL_DEST/clang-ocl
rm -rf $CLANG_OCL_BUILD_DIR
rm -rf $CLANG_OCL_PACKAGE_DEB
rm -rf $CLANG_OCL_PACKAGE_RPM
}
build_clang-ocl() {
if [ ! -d "$CLANG_OCL_BUILD_DIR" ]; then
mkdir -p $CLANG_OCL_BUILD_DIR
pushd $CLANG_OCL_BUILD_DIR
if [ -e $PACKAGE_ROOT/lib/bitcode/opencl.amdgcn.bc ]; then
BC_DIR="$ROCM_INSTALL_PATH/lib"
else
BC_DIR="$ROCM_INSTALL_PATH/amdgcn/bitcode"
fi
cmake \
$(rocm_cmake_params) \
-DDISABLE_CHECKS="ON" \
-DCLANG_BIN="$ROCM_INSTALL_PATH/llvm/bin" \
-DBITCODE_DIR="$BC_DIR" \
$(rocm_common_cmake_params) \
-DCPACK_SET_DESTDIR="OFF" \
$CLANG_OCL_SRC_ROOT
echo "Making clang-ocl:"
cmake --build . -- $MAKEARG
cmake --build . -- $MAKEARG install
cmake --build . -- $MAKEARG package
popd
fi
copy_if DEB "${CPACKGEN:-"DEB;RPM"}" "$CLANG_OCL_PACKAGE_DEB" $CLANG_OCL_BUILD_DIR/rocm-clang-ocl*.deb
copy_if RPM "${CPACKGEN:-"DEB;RPM"}" "$CLANG_OCL_PACKAGE_RPM" $CLANG_OCL_BUILD_DIR/rocm-clang-ocl*.rpm
}
print_output_directory() {
case ${PKGTYPE} in
("deb")
echo ${CLANG_OCL_PACKAGE_DEB};;
("rpm")
echo ${CLANG_OCL_PACKAGE_RPM};;
(*)
echo "Invalid package type \"${PKGTYPE}\" provided for -o" >&2; exit 1;;
esac
exit
}
case $TARGET in
(clean) clean_clang-ocl ;;
(build) build_clang-ocl ;;
(outdir) print_output_directory ;;
(*) die "Invalid target $TARGET" ;;
esac
echo "Operation complete"
exit 0

151
rocm-build/build_comgr.sh Executable file
View File

@@ -0,0 +1,151 @@
#!/bin/bash
source "$(dirname "${BASH_SOURCE}")/compute_utils.sh"
printUsage() {
echo
echo "Usage: $(basename "${BASH_SOURCE}") [options ...]"
echo
echo "Options:"
echo " -c, --clean Clean output and delete all intermediate work"
echo " -p, --package <type> Specify packaging format"
echo " -r, --release Make a release build instead of a debug build"
echo " -a, --address_sanitizer Enable address sanitizer"
echo " -o, --outdir <pkg_type> Print path of output directory containing packages of
type referred to by pkg_type"
echo " -h, --help Prints this help"
echo " -s, --static Build static lib (.a). build instead of dynamic/shared(.so) "
echo " -l, --link_llvm_static Link to LLVM statically. Default is to dynamically link to LLVM; requires that LLVM dylibs are created."
echo
echo "Possible values for <type>:"
echo " deb -> Debian format (default)"
echo " rpm -> RPM format"
echo
return 0
}
API_NAME=amd_comgr
PROJ_NAME=$API_NAME
LIB_NAME=lib${API_NAME}
TARGET=build
MAKETARGET=deb
PACKAGE_ROOT=$(getPackageRoot)
PACKAGE_LIB=$(getLibPath)
PACKAGE_INCLUDE=$(getIncludePath)
BUILD_DIR=$(getBuildPath $API_NAME)
PACKAGE_DEB=$(getPackageRoot)/deb/$API_NAME
PACKAGE_RPM=$(getPackageRoot)/rpm/$API_NAME
PACKAGE_PREFIX=$ROCM_INSTALL_PATH
BUILD_TYPE=Debug
MAKE_OPTS="$DASH_JAY CTEST_OUTPUT_ON_FAILURE=1 -C $BUILD_DIR"
VERBOSE_OPTS="AMD_COMGR_EMIT_VERBOSE_LOGS=1 AMD_COMGR_REDIRECT_LOGS=stdout"
SHARED_LIBS="ON"
CLEAN_OR_OUT=0;
PKGTYPE="deb"
MAKETARGET="deb"
LINK_LLVM_DYLIB="OFF"
VALID_STR=`getopt -o hcraslo:p: --long help,clean,release,address_sanitizer,static,link_llvm_static,outdir:,package: -- "$@"`
eval set -- "$VALID_STR"
while true ;
do
case "$1" in
(-h | --help)
printUsage ; exit 0;;
(-c | --clean)
TARGET="clean" ; ((CLEAN_OR_OUT|=1)) ; shift ;;
(-r | --release)
BUILD_TYPE="RelWithDebInfo" ; shift ;;
(-a | --address_sanitizer)
set_asan_env_vars
set_address_sanitizer_on ; shift ;;
(-s | --static)
SHARED_LIBS="OFF" ; shift ;;
(-l | --link_llvm_static)
LINK_LLVM_DYLIB="OFF"; shift ;;
(-o | --outdir)
TARGET="outdir"; PKGTYPE=$2 ; OUT_DIR_SPECIFIED=1 ; ((CLEAN_OR_OUT|=2)) ; shift 2 ;;
(-p | --package)
MAKETARGET="$2" ; shift 2;;
--) shift; break;; # end delimiter
(*)
echo " This should never come but just incase : UNEXPECTED ERROR Parm : [$1] ">&2 ; exit 20;;
esac
done
RET_CONFLICT=1
check_conflicting_options $CLEAN_OR_OUT $PKGTYPE $MAKETARGET
if [ $RET_CONFLICT -ge 30 ]; then
print_vars $API_NAME $TARGET $BUILD_TYPE $SHARED_LIBS $CLEAN_OR_OUT $PKGTYPE $MAKETARGET
exit $RET_CONFLICT
fi
clean() {
echo "Cleaning $PROJ_NAME"
rm -rf $BUILD_DIR
rm -rf $PACKAGE_DEB
rm -rf $PACKAGE_RPM
rm -rf $PACKAGE_ROOT/${PROJ_NAME}
rm -rf $PACKAGE_LIB/${LIB_NAME}*
}
build() {
echo "Building $PROJ_NAME"
mkdir -p "$BUILD_DIR"
pushd "$BUILD_DIR"
if [ "$SHARED_LIBS" == "OFF" ]
then
echo " Building Archive "
else
echo " Building Shared Object "
fi
cmake \
$(rocm_cmake_params) \
-DBUILD_SHARED_LIBS=$SHARED_LIBS \
$(rocm_common_cmake_params) \
-DCMAKE_DISABLE_FIND_PACKAGE_hip=TRUE \
-DCMAKE_CTEST_ARGUMENTS="--exclude-regex;multithread_test" \
-DLLVM_LINK_LLVM_DYLIB=$LINK_LLVM_DYLIB \
-DLLVM_ENABLE_LIBCXX=$LINK_LLVM_DYLIB \
$COMGR_ROOT
popd
cmake --build "$BUILD_DIR" -- $MAKE_OPTS
cmake --build "$BUILD_DIR" -- $MAKE_OPTS $VERBOSE_OPTS test
cmake --build "$BUILD_DIR" -- $MAKE_OPTS install
cmake --build "$BUILD_DIR" -- $MAKE_OPTS package
mkdir -p $PACKAGE_LIB
cp -R $BUILD_DIR/${LIB_NAME}* $PACKAGE_LIB
copy_if DEB "${CPACKGEN:-"DEB;RPM"}" "$PACKAGE_DEB" $BUILD_DIR/comgr*.deb
copy_if RPM "${CPACKGEN:-"DEB;RPM"}" "$PACKAGE_RPM" $BUILD_DIR/comgr*.rpm
}
print_output_directory() {
case ${PKGTYPE} in
("deb")
echo ${PACKAGE_DEB};;
("rpm")
echo ${PACKAGE_RPM};;
(*)
echo "Invalid package type \"${PKGTYPE}\" provided for -o" >&2; exit 1;;
esac
exit
}
verifyEnvSetup
case $TARGET in
(clean) clean ;;
(build) build ;;
(outdir) print_output_directory ;;
(*) die "Invalid target $TARGET" ;;
esac
echo "Operation complete"

View File

@@ -0,0 +1,185 @@
#!/bin/bash
set -ex
source "$(dirname "${BASH_SOURCE[0]}")/compute_helper.sh"
set_component_src composable_kernel
build_miopen_ck() {
echo "Start Building Composable Kernel"
if [ "${ENABLE_ADDRESS_SANITIZER}" == "true" ]; then
set_asan_env_vars
set_address_sanitizer_on
fi
cd $COMPONENT_SRC
mkdir "$BUILD_DIR" && cd "$BUILD_DIR"
if [ -n "$GPU_ARCHS" ]; then
GPU_TARGETS="$GPU_ARCHS"
else
GPU_TARGETS="gfx908;gfx90a;gfx940;gfx941;gfx942;gfx1030;gfx1100;gfx1101"
fi
if [ "${ASAN_CMAKE_PARAMS}" == "true" ] ; then
cmake -DBUILD_DEV=OFF \
-DCMAKE_BUILD_TYPE=${BUILD_TYPE:-'RelWithDebInfo'} \
-DCMAKE_CXX_COMPILER=${ROCM_PATH}/llvm/bin/clang++ \
-DCMAKE_CXX_FLAGS=" -O3 " \
-DCMAKE_PREFIX_PATH="${ROCM_PATH%-*}/lib/cmake;${ROCM_PATH%-*}/$ASAN_LIBDIR;${ROCM_PATH%-*}/llvm;${ROCM_PATH%-*}" \
-DCMAKE_SHARED_LINKER_FLAGS_INIT="-Wl,--enable-new-dtags,--rpath,$ROCM_ASAN_LIB_RPATH" \
-DCMAKE_EXE_LINKER_FLAGS_INIT="-Wl,--enable-new-dtags,--rpath,$ROCM_ASAN_EXE_RPATH" \
-DCMAKE_VERBOSE_MAKEFILE=1 \
-DCMAKE_INSTALL_RPATH_USE_LINK_PATH=FALSE \
-DCMAKE_INSTALL_PREFIX=${ROCM_PATH} \
-DCMAKE_PACKAGING_INSTALL_PREFIX=${ROCM_PATH} \
-DBUILD_FILE_REORG_BACKWARD_COMPATIBILITY=OFF \
-DROCM_SYMLINK_LIBS=OFF \
-DCPACK_PACKAGING_INSTALL_PREFIX=${ROCM_PATH} \
-DROCM_DISABLE_LDCONFIG=ON \
-DROCM_PATH=${ROCM_PATH} \
-DCPACK_GENERATOR="${PKGTYPE^^}" \
${LAUNCHER_FLAGS} \
-DINSTANCES_ONLY=ON \
-DENABLE_ASAN_PACKAGING=true \
-DAMDGPU_TARGETS=${GPU_TARGETS} \
"$COMPONENT_SRC"
else
cmake -DBUILD_DEV=OFF \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_CXX_COMPILER=${ROCM_PATH}/llvm/bin/clang++ \
-DCMAKE_CXX_FLAGS=" -O3 " \
-DCMAKE_PREFIX_PATH=${ROCM_PATH%-*} \
-DCMAKE_SHARED_LINKER_FLAGS_INIT='-Wl,--enable-new-dtags,--rpath,$ORIGIN' \
-DCMAKE_EXE_LINKER_FLAGS_INIT='-Wl,--enable-new-dtags,--rpath,$ORIGIN/../lib' \
-DCMAKE_VERBOSE_MAKEFILE=1 \
-DCMAKE_INSTALL_RPATH_USE_LINK_PATH=FALSE \
-DCMAKE_INSTALL_PREFIX=${ROCM_PATH} \
-DCMAKE_PACKAGING_INSTALL_PREFIX=${ROCM_PATH} \
-DBUILD_FILE_REORG_BACKWARD_COMPATIBILITY=OFF \
-DROCM_SYMLINK_LIBS=OFF \
-DCPACK_PACKAGING_INSTALL_PREFIX=${ROCM_PATH} \
-DROCM_DISABLE_LDCONFIG=ON \
-DROCM_PATH=${ROCM_PATH} \
-DCPACK_GENERATOR="${PKGTYPE^^}" \
${LAUNCHER_FLAGS} \
-DINSTANCES_ONLY=ON \
-DAMDGPU_TARGETS=${GPU_TARGETS} \
"$COMPONENT_SRC"
fi
cmake --build . -- -j${PROC} package
cmake --build "$BUILD_DIR" -- install
mkdir -p $PACKAGE_DIR && cp ./*.${PKGTYPE} $PACKAGE_DIR
rm -rf *
}
unset_asan_env_vars() {
ASAN_CMAKE_PARAMS="false"
export ADDRESS_SANITIZER="OFF"
export LD_LIBRARY_PATH=""
export ASAN_OPTIONS=""
}
set_address_sanitizer_off() {
export CFLAGS=""
export CXXFLAGS=""
export LDFLAGS=""
}
build_miopen_ckProf() {
ENABLE_ADDRESS_SANITIZER=false
echo "Start Building Composable Kernel Profiler"
if [ "${ENABLE_ADDRESS_SANITIZER}" == "true" ]; then
set_asan_env_vars
set_address_sanitizer_on
else
unset_asan_env_vars
set_address_sanitizer_off
fi
cd $COMPONENT_SRC
cd "$BUILD_DIR"
rm -rf *
architectures='gfx10 gfx11 gfx90 gfx94'
if [ -n "$GPU_ARCHS" ]; then
architectures=$(echo ${GPU_ARCHS} | awk -F';' '{for(i=1;i<=NF;i++) a[substr($i,1,5)]} END{for(i in a) printf i" "}')
else
architectures='gfx10 gfx11 gfx90 gfx94'
fi
for arch in ${architectures}
do
if [ "${ASAN_CMAKE_PARAMS}" == "true" ] ; then
cmake -DBUILD_DEV=OFF \
-DCMAKE_PREFIX_PATH="${ROCM_PATH%-*}/lib/cmake;${ROCM_PATH%-*}/$ASAN_LIBDIR;${ROCM_PATH%-*}/llvm;${ROCM_PATH%-*}" \
-DCMAKE_BUILD_TYPE=${BUILD_TYPE:-'RelWithDebInfo'} \
-DCMAKE_SHARED_LINKER_FLAGS_INIT="-Wl,--enable-new-dtags,--rpath,$ROCM_ASAN_LIB_RPATH" \
-DCMAKE_EXE_LINKER_FLAGS_INIT="-Wl,--enable-new-dtags,--rpath,$ROCM_ASAN_EXE_RPATH" \
-DCMAKE_VERBOSE_MAKEFILE=1 \
-DCMAKE_INSTALL_RPATH_USE_LINK_PATH=FALSE \
-DCMAKE_INSTALL_PREFIX="${ROCM_PATH}" \
-DCMAKE_PACKAGING_INSTALL_PREFIX="${ROCM_PATH}" \
-DBUILD_FILE_REORG_BACKWARD_COMPATIBILITY=OFF \
-DROCM_SYMLINK_LIBS=OFF \
-DCPACK_PACKAGING_INSTALL_PREFIX="${ROCM_PATH}" \
-DROCM_DISABLE_LDCONFIG=ON \
-DROCM_PATH="${ROCM_PATH}" \
-DCPACK_GENERATOR="${PKGTYPE^^}" \
-DCMAKE_CXX_COMPILER="${ROCM_PATH}/llvm/bin/clang++" \
-DCMAKE_C_COMPILER="${ROCM_PATH}/llvm/bin/clang" \
${LAUNCHER_FLAGS} \
-DPROFILER_ONLY=ON \
-DENABLE_ASAN_PACKAGING=true \
-DGPU_ARCH="${arch}" \
"$COMPONENT_SRC"
else
cmake -DBUILD_DEV=OFF \
-DCMAKE_PREFIX_PATH="${ROCM_PATH%-*}" \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_SHARED_LINKER_FLAGS_INIT='-Wl,--enable-new-dtags,--rpath,$ORIGIN' \
-DCMAKE_EXE_LINKER_FLAGS_INIT='-Wl,--enable-new-dtags,--rpath,$ORIGIN/../lib' \
-DCMAKE_VERBOSE_MAKEFILE=1 \
-DCMAKE_INSTALL_RPATH_USE_LINK_PATH=FALSE \
-DCMAKE_INSTALL_PREFIX="${ROCM_PATH}" \
-DCMAKE_PACKAGING_INSTALL_PREFIX="${ROCM_PATH}" \
-DBUILD_FILE_REORG_BACKWARD_COMPATIBILITY=OFF \
-DROCM_SYMLINK_LIBS=OFF \
-DCPACK_PACKAGING_INSTALL_PREFIX="${ROCM_PATH}" \
-DROCM_DISABLE_LDCONFIG=ON \
-DROCM_PATH="${ROCM_PATH}" \
-DCPACK_GENERATOR="${PKGTYPE^^}" \
-DCMAKE_CXX_COMPILER="${ROCM_PATH}/llvm/bin/clang++" \
-DCMAKE_C_COMPILER="${ROCM_PATH}/llvm/bin/clang" \
${LAUNCHER_FLAGS} \
-DPROFILER_ONLY=ON \
-DGPU_ARCH="${arch}" \
"$COMPONENT_SRC"
fi
cmake --build . -- -j${PROC} package
cp ./*ckprofiler*.${PKGTYPE} $PACKAGE_DIR
rm -rf *
done
rm -rf _CPack_Packages/ && find -name '*.o' -delete
echo "Finished building Composable Kernel"
show_build_cache_stats
}
clean_miopen_ck() {
echo "Cleaning MIOpen-CK build directory: ${BUILD_DIR} ${PACKAGE_DIR}"
rm -rf "$BUILD_DIR" "$PACKAGE_DIR"
echo "Done!"
}
stage2_command_args "$@"
case $TARGET in
build) build_miopen_ck; build_miopen_ckProf;;
outdir) print_output_directory ;;
clean) clean_miopen_ck ;;
*) die "Invalid target $TARGET" ;;
esac

159
rocm-build/build_dbgapi.sh Executable file
View File

@@ -0,0 +1,159 @@
#!/bin/bash
source "${BASH_SOURCE%/*}/compute_utils.sh"
printUsage() {
echo
echo "Usage: $(basename "${BASH_SOURCE[0]}") [options ...]"
echo
echo "Options:"
echo " -c, --clean Clean output and delete all intermediate work"
echo " -p, --package <type> Specify packaging format"
echo " -r, --release Make a release build instead of a debug build"
echo " --enable-assertions Enable assertions"
echo " -a, --address_sanitizer Enable address sanitizer"
echo " -o, --outdir <pkg_type> Print path of output directory containing packages of
type referred to by pkg_type"
echo " -h, --help Prints this help"
echo " -M, --skip_man_pages Do not build the 'docs' target"
echo " -s, --static Supports static CI by accepting this param & not bailing out. No effect of the param though"
echo
echo "Possible values for <type>:"
echo " deb -> Debian format (default)"
echo " rpm -> RPM format"
echo
return 0
}
API_NAME=rocm-dbgapi
AMD_DBGAPI_NAME=amd-dbgapi
MAKEINSTALL_MANIFEST=makeinstall_manifest.txt
PROJ_NAME=$API_NAME
LIB_NAME=lib${API_NAME}.so
TARGET=build
MAKETARGET=deb
PACKAGE_ROOT=$(getPackageRoot)
PACKAGE_LIB=$(getLibPath)
PACKAGE_INCLUDE=$(getIncludePath)
BUILD_DIR=$(getBuildPath $API_NAME)
PACKAGE_DEB=$(getPackageRoot)/deb/$API_NAME
PACKAGE_RPM=$(getPackageRoot)/rpm/$API_NAME
BUILD_TYPE=Debug
MAKE_OPTS=($DASH_JAY -C "$BUILD_DIR")
SHARED_LIBS="ON"
CLEAN_OR_OUT=0;
MAKETARGET="deb"
PKGTYPE="deb"
DODOCSBUILD=true
VALID_STR=$(getopt -o hcraso:p:M --long help,clean,release,enable-assertions,static,address_sanitizer,outdir:,package:skip_man_pages -- "$@")
eval set -- "$VALID_STR"
while true ;
do
case "$1" in
(-h | --help)
printUsage ; exit 0;;
(-c | --clean)
TARGET="clean" ; ((CLEAN_OR_OUT|=1)) ;;
(-r | --release)
ENABLE_ASSERTIONS=${ENABLE_ASSERTIONS:-"Off"} ;
BUILD_TYPE="RelWithDebInfo" ;;
( --enable-assertions)
ENABLE_ASSERTIONS="On" ;;
(-a | --address_sanitizer)
set_asan_env_vars
set_address_sanitizer_on ;;
(-s | --static)
SHARED_LIBS="OFF" ;;
(-o | --outdir)
TARGET="outdir"; PKGTYPE=$2 ; ((CLEAN_OR_OUT|=2)) ; shift 1 ;;
(-M | --skip_man_pages) DODOCSBUILD=false;;
(-p | --package)
MAKETARGET="$2" ; shift 1;;
--) shift; break;;
(*)
echo " This should never come but just incase : UNEXPECTED ERROR Parm : [$1] ">&2 ; exit 20;;
esac
shift
done
check_conflicting_options $CLEAN_OR_OUT "$PKGTYPE" "$MAKETARGET"
if [ "$RET_CONFLICT" -ge 30 ]; then
print_vars "$API_NAME" "$TARGET" "$BUILD_TYPE" "$SHARED_LIBS" "$CLEAN_OR_OUT" "$PKGTYPE" "$MAKETARGET"
exit "$RET_CONFLICT"
fi
clean() {
echo "Cleaning $PROJ_NAME"
if [ -e "$BUILD_DIR/$MAKEINSTALL_MANIFEST" ] ; then
xargs rm -f < "$BUILD_DIR/$MAKEINSTALL_MANIFEST"
fi
rm -rf "$BUILD_DIR"
rm -rf "$PACKAGE_DEB"
rm -rf "$PACKAGE_RPM"
rm -rf "${PACKAGE_ROOT:?}/${PROJ_NAME}"
rm -rf "${PACKAGE_LIB:?}/${LIB_NAME}"*
rm -rf "${PACKAGE_LIB:?}/cmake/${AMD_DBGAPI_NAME}"
rm -rf "${PACKAGE_INCLUDE:?}/${AMD_DBGAPI_NAME}"
}
build() {
if [ ! -e "$ROCM_DBGAPI_ROOT/CMakeLists.txt" ]
then
echo " No $ROCM_DBGAPI_ROOT/CMakeLists.txt file, skipping rocm-dbgapi" >&2
echo " No $ROCM_DBGAPI_ROOT/CMakeLists.txt file, skipping rocm-dbgapi"
exit 0
fi
echo "Building $PROJ_NAME"
mkdir -p "$BUILD_DIR"
pushd "$BUILD_DIR" || exit 99
cmake \
$(rocm_cmake_params) \
$(rocm_common_cmake_params) \
-DENABLE_ASSERTIONS=${ENABLE_ASSERTIONS:-"On"} \
"$ROCM_DBGAPI_ROOT"
popd || exit 99
cmake --build "$BUILD_DIR" -- "${MAKE_OPTS[@]}"
"$DODOCSBUILD" && cmake --build "$BUILD_DIR" -- "${MAKE_OPTS[@]}" doc
cmake --build "$BUILD_DIR" -- "${MAKE_OPTS[@]}" install
mv "$BUILD_DIR/install_manifest.txt" "$BUILD_DIR/$MAKEINSTALL_MANIFEST"
cmake --build "$BUILD_DIR" -- "${MAKE_OPTS[@]}" package
mkdir -p "$PACKAGE_LIB"
(
shopt -s nullglob
cp -R "$BUILD_DIR/lib/${LIB_NAME}"* "$BUILD_DIR/${LIB_NAME}"* "$PACKAGE_LIB"
)
copy_if DEB "${CPACKGEN:-"DEB;RPM"}" "$PACKAGE_DEB" "$BUILD_DIR/${API_NAME}"*.deb
copy_if RPM "${CPACKGEN:-"DEB;RPM"}" "$PACKAGE_RPM" "$BUILD_DIR/${API_NAME}"*.rpm
}
print_output_directory() {
case ${PKGTYPE} in
("deb")
echo ${PACKAGE_DEB};;
("rpm")
echo ${PACKAGE_RPM};;
(*)
echo "Invalid package type \"${PKGTYPE}\" provided for -o" >&2; exit 1;;
esac
exit
}
verifyEnvSetup
case $TARGET in
(clean) clean ;;
(build) build ;;
(outdir) print_output_directory ;;
(*) die "Invalid target $TARGET" ;;
esac
echo "Operation complete"

140
rocm-build/build_devicelibs.sh Executable file
View File

@@ -0,0 +1,140 @@
#!/bin/bash
source "$(dirname "${BASH_SOURCE}")/compute_utils.sh"
printUsage() {
echo
echo "Usage: $(basename "${BASH_SOURCE}") [options ...]"
echo
echo "Options:"
echo " -c, --clean Clean output and delete all intermediate work"
echo " -r, --release Build a release version of the package"
echo " -a, --address_sanitizer Enable address sanitizer"
echo " -s, --static Supports static CI by accepting this param & not bailing out. No effect of the param though"
echo " -o, --outdir <pkg_type> Print path of output directory containing packages of
type referred to by pkg_type"
echo " -h, --help Prints this help"
echo
echo
return 0
}
PACKAGE_ROOT="$(getPackageRoot)"
PACKAGE_BIN="$(getBinPath)"
PACKAGE_LIB="$(getLibPath)"
BUILD_PATH="$(getBuildPath devicelibs)"
INSTALL_PATH="$(getPackageRoot)"
LIGHTNING_BUILD_PATH="$(getBuildPath lightning)"
DEB_PATH="$(getDebPath devicelibs)"
RPM_PATH="$(getRpmPath devicelibs)"
AMDGCN_LIB_PATH="$PACKAGE_ROOT/amdgcn/bitcode/"
TARGET="build"
MAKEOPTS="$DASH_JAY"
SHARED_LIBS="ON"
CLEAN_OR_OUT=0;
PKGTYPE="deb"
MAKETARGET="deb"
VALID_STR=`getopt -o hcraso: --long help,clean,release,static,address_sanitizer,outdir: -- "$@"`
eval set -- "$VALID_STR"
while true ;
do
case "$1" in
(-h | --help)
printUsage ; exit 0;;
(-c | --clean)
TARGET="clean" ; ((CLEAN_OR_OUT|=1)) ; shift ;;
(-r | --release)
BUILD_TYPE="Release" ; shift ;;
(-a | --address_sanitizer)
ASAN_CMAKE_PARAMS="true"
ack_and_ignore_asan ; shift ;;
(-s | --static)
SHARED_LIBS="OFF" ; shift ;;
(-o | --outdir)
TARGET="outdir"; PKGTYPE=$2 ; OUT_DIR_SPECIFIED=1 ; ((CLEAN_OR_OUT|=2)) ; shift 2 ;;
--) shift; break;; # end delimiter
(*)
echo " This should never come but just incase : UNEXPECTED ERROR Parm : [$1] ">&2 ; exit 20;;
esac
done
RET_CONFLICT=1
check_conflicting_options $CLEAN_OR_OUT $PKGTYPE $MAKETARGET
if [ $RET_CONFLICT -ge 30 ]; then
print_vars $API_NAME $TARGET $BUILD_TYPE $SHARED_LIBS $CLEAN_OR_OUT $PKGTYPE $MAKETARGET
exit $RET_CONFLICT
fi
clean_devicelibs() {
rm -rf "$BUILD_PATH"
rm -f $PACKAGE_LIB/hc*.bc
rm -f $PACKAGE_LIB/irif*.bc
rm -f $PACKAGE_LIB/ockl*.bc
rm -f $PACKAGE_LIB/oclc*.bc
rm -f $PACKAGE_LIB/ocml*.bc
rm -f $PACKAGE_LIB/opencl*.bc
rm -f $PACKAGE_LIB/openmp*.bc
rm -rf $PACKAGE_ROOT/amdgcn
rm -rf "$DEB_PATH"
rm -rf "$RPM_PATH"
}
build_devicelibs() {
mkdir -p "$BUILD_PATH"
pushd "$BUILD_PATH"
local clangResourceDir="$($LIGHTNING_BUILD_PATH/bin/clang -print-resource-dir)"
local clangResourceVer=${clangResourceDir#*lib/clang/}
local bitcodeInstallLoc="lib/llvm/lib/clang/${clangResourceVer}/lib"
export LLVM_BUILD="$LIGHTNING_BUILD_PATH"
if [ ! -e Makefile ]; then
cmake $(rocm_cmake_params) \
$(rocm_common_cmake_params) \
-DROCM_DEVICE_LIBS_BITCODE_INSTALL_LOC_NEW="$bitcodeInstallLoc/amdgcn" \
-DROCM_DEVICE_LIBS_BITCODE_INSTALL_LOC_OLD="amdgcn" \
"$DEVICELIBS_ROOT"
echo "CMake complete"
fi
echo "Building device-libs"
cmake --build . -- $MAKEOPTS
cmake --build . -- $MAKEOPTS install
popd
}
package_devicelibs() {
mkdir -p "$DEB_PATH"
mkdir -p "$RPM_PATH"
pushd "$BUILD_PATH"
cpack
copy_if DEB "${CPACKGEN:-"DEB;RPM"}" "$DEB_PATH" *.deb
copy_if RPM "${CPACKGEN:-"DEB;RPM"}" "$RPM_PATH" *.rpm
popd
}
print_output_directory() {
case ${PKGTYPE} in
("deb")
echo ${DEB_PATH};;
("rpm")
echo ${RPM_PATH};;
(*)
echo "Invalid package type \"${PKGTYPE}\" provided for -o" >&2; exit 1;;
esac
exit
}
case $TARGET in
(clean) clean_devicelibs ;;
(build) build_devicelibs; package_devicelibs ;;
(outdir) print_output_directory ;;
(*) die "Invalid target $TARGET" ;;
esac
echo "Operation complete"

46
rocm-build/build_half.sh Executable file
View File

@@ -0,0 +1,46 @@
#!/bin/bash
set -ex
source "$(dirname "${BASH_SOURCE[0]}")/compute_helper.sh"
set_component_src half
build_half() {
echo "Start build"
if [ "${ENABLE_ADDRESS_SANITIZER}" == "true" ]; then
set_asan_env_vars
set_address_sanitizer_on
ASAN_CMAKE_PARAMS="false"
fi
mkdir -p "$BUILD_DIR" && cd "$BUILD_DIR"
cmake \
-DCMAKE_INSTALL_PREFIX="$ROCM_PATH" \
-DBUILD_FILE_REORG_BACKWARD_COMPATIBILITY=OFF \
"$COMPONENT_SRC"
cmake --build "$BUILD_DIR" -- -j${PROC}
cmake --build "$BUILD_DIR" -- package
cmake --build "$BUILD_DIR" -- install
rm -rf _CPack_Packages/ && find -name '*.o' -delete
mkdir -p $PACKAGE_DIR && cp ${BUILD_DIR}/*.${PKGTYPE} $PACKAGE_DIR
show_build_cache_stats
}
clean_half() {
echo "Cleaning half build directory: ${BUILD_DIR} ${PACKAGE_DIR}"
rm -rf "$BUILD_DIR" "$PACKAGE_DIR"
echo "Done!"
}
stage2_command_args "$@"
case $TARGET in
build) build_half ;;
outdir) print_output_directory ;;
clean) clean_half ;;
*) die "Invalid target $TARGET" ;;
esac

295
rocm-build/build_hip_on_rocclr.sh Executable file
View File

@@ -0,0 +1,295 @@
#!/bin/bash
printUsage() {
echo
echo "Usage: $(basename "${BASH_SOURCE}") [options ...]"
echo
echo "Options:"
echo " -h, --help Prints this help"
echo " -c, --clean Clean output and delete all intermediate work"
echo " -r, --release Make a release build instead of a debug build"
echo " -a, --address_sanitizer Enable address sanitizer"
echo " -s, --static Build static lib (.a). build instead of dynamic/shared(.so) "
echo " -o, --outdir <pkg_type> Print path of output directory containing packages of type referred to by pkg_type"
echo " -t, --offload-arch=<arch> Specify arch for catch tests ex: --offload-arch=gfx1030 --offload-arch=gfx1100"
echo " -p, --package <type> Specify packaging format"
echo
echo "Possible values for <type>:"
echo " deb -> Debian format (default)"
echo " rpm -> RPM format"
echo
return 0
}
source "$(dirname "${BASH_SOURCE}")/compute_utils.sh"
MAKEOPTS="$DASH_JAY"
BUILD_PATH="$(getBuildPath hip-on-rocclr)"
TARGET="build"
PACKAGE_ROOT="$(getPackageRoot)"
PACKAGE_SRC="$(getSrcPath)"
PACKAGE_DEB="$PACKAGE_ROOT/deb/hip-on-rocclr"
PACKAGE_RPM="$PACKAGE_ROOT/rpm/hip-on-rocclr"
PREFIX_PATH="$PACKAGE_ROOT"
CORE_BUILD_DIR="$(getBuildPath hsa-core)"
ROCclr_BUILD_DIR="$(getBuildPath rocclr)"
HIPCC_BUILD_DIR="$(getBuildPath hipcc)"
CATCH_BUILD_DIR="$(getBuildPath catch)"
CATCH_SRC="$HIP_CATCH_TESTS_ROOT/catch"
SAMPLES_SRC="$HIP_CATCH_TESTS_ROOT/samples"
SAMPLES_BUILD_DIR="$(getBuildPath samples)"
if [ ! -e "$CATCH_SRC/CMakeLists.txt" ]; then
echo "Using catch source from hip project" >&2
CATCH_SRC="$HIP_ON_ROCclr_ROOT/tests/catch"
fi
BUILD_TYPE="Debug"
SHARED_LIBS="ON"
CLEAN_OR_OUT=0;
MAKETARGET="deb"
PKGTYPE="deb"
OFFLOAD_ARCH=()
DEFAULT_OFFLOAD_ARCH=(gfx900 gfx906 gfx908 gfx90a gfx940 gfx941 gfx942 gfx1030 gfx1031 gfx1033 gfx1034 gfx1035 gfx1100 gfx1101 gfx1102 gfx1103)
VALID_STR=`getopt -o hcrast:o: --long help,clean,release,address_sanitizer,static,offload-arch=:,outdir: -- "$@"`
eval set -- "$VALID_STR"
while true ;
do
case "$1" in
(-h | --help)
printUsage ; exit 0;;
(-c | --clean)
TARGET="clean" ; ((CLEAN_OR_OUT|=1)) ; shift ;;
(-r | --release)
BUILD_TYPE="RelWithDebInfo" ; shift ;;
(-a | --address_sanitizer)
set_asan_env_vars
set_address_sanitizer_on ; shift ;;
(-s | --static)
SHARED_LIBS="OFF" ; shift ;;
(-t | --offload-arch=)
OFFLOAD_ARCH+=( "$2" ); ((CLEAN_OR_OUT|=2)); shift 2 ;;
(-o | --outdir)
TARGET="outdir"; PKGTYPE=$2 ; OUT_DIR_SPECIFIED=1 ; ((CLEAN_OR_OUT|=2)) ; shift 2 ;;
--) shift; break;;
(*)
echo " This should never come but just incase : UNEXPECTED ERROR Parm : [$1] ">&2 ; exit 20;;
esac
done
if [ ${#OFFLOAD_ARCH[@]} = 0 ] ; then
OFFLOAD_ARCH=( "${DEFAULT_OFFLOAD_ARCH[@]}" )
else
echo "Using user defined offload archs ${OFFLOAD_ARCH[@]} for catch tests";
fi
printf -v OFFLOAD_ARCH_STR -- '--offload-arch=%q ' "${OFFLOAD_ARCH[@]}"
RET_CONFLICT=1
check_conflicting_options $CLEAN_OR_OUT $PKGTYPE $MAKETARGET
if [ $RET_CONFLICT -ge 30 ]; then
print_vars $API_NAME $TARGET $BUILD_TYPE $SHARED_LIBS $CLEAN_OR_OUT $PKGTYPE $MAKETARGET
exit $RET_CONFLICT
fi
clean_hip_on_rocclr() {
rm -rf "$BUILD_PATH"
rm -rf "$PACKAGE_DEB"
rm -rf "$PACKAGE_RPM"
rm -rf "$OUT_DIR/hip"
}
build_hip_on_rocclr() {
if [ -e "$CLR_ROOT/CMakeLists.txt" ]; then
_HIP_CMAKELIST_DIR="$CLR_ROOT"
_HIP_CMAKELIST_OPT="-DCLR_BUILD_HIP=ON -DCLR_BUILD_OCL=OFF"
if [ -e "$HIPOTHER_ROOT/hipnv" ]; then
_HIP_CMAKELIST_OPT="$_HIP_CMAKELIST_OPT -DHIPNV_DIR=$HIPOTHER_ROOT/hipnv"
fi
elif [ ! -e "$HIPAMD_ROOT/CMakeLists.txt" ]; then
echo "No $HIPAMD_ROOT/CMakeLists.txt file, skipping hip on rocclr" >&2
echo "No $HIPAMD_ROOT/CMakeLists.txt file, skipping hip on rocclr"
exit 0
else
_HIP_CMAKELIST_DIR="$HIPAMD_ROOT"
_HIP_CMAKELIST_OPT=""
fi
echo "$_HIP_CMAKELIST_DIR"
mkdir -p "$BUILD_PATH"
pushd "$BUILD_PATH"
if [ ! -e Makefile ]; then
echo "Building HIP-On-ROCclr CMake environment"
print_lib_type $SHARED_LIBS
cmake $(rocm_cmake_params) \
-DBUILD_SHARED_LIBS=$SHARED_LIBS \
-DHIP_COMPILER=clang \
-DHIP_PLATFORM=amd \
-DHIP_COMMON_DIR="$HIP_ON_ROCclr_ROOT" \
$(rocm_common_cmake_params) \
-DCMAKE_HIP_ARCHITECTURES=OFF \
-DHSA_PATH="$ROCM_INSTALL_PATH" \
-DCMAKE_SKIP_BUILD_RPATH=TRUE \
-DCPACK_INSTALL_PREFIX="$ROCM_INSTALL_PATH" \
-DROCM_PATH="$ROCM_INSTALL_PATH" \
-DHIPCC_BIN_DIR="$HIPCC_BUILD_DIR" \
-DHIP_CATCH_TEST=1 \
$_HIP_CMAKELIST_OPT \
"$_HIP_CMAKELIST_DIR"
echo "CMake complete"
fi
echo "Build and Install HIP"
cmake --build . -- $MAKEOPTS install "VERBOSE=1"
popd
}
build_catch_tests() {
WORKSPACE=`pwd`
echo "Build catch2 tests independently"
if [ ! -e "$CATCH_SRC/CMakeLists.txt" ]; then
echo "catch source not found: $CATCH_SRC" >&2
exit
fi
# build catch
rm -rf "$CATCH_BUILD_DIR"
mkdir -p "$CATCH_BUILD_DIR"
pushd "$CATCH_BUILD_DIR"
export HIP_PATH="$ROCM_INSTALL_PATH"
export ROCM_PATH="$ROCM_INSTALL_PATH"
cmake \
-DCMAKE_BUILD_TYPE="${BUILD_TYPE}" \
-DHIP_PLATFORM=amd \
-DROCM_PATH="$ROCM_INSTALL_PATH" \
-DOFFLOAD_ARCH_STR="$OFFLOAD_ARCH_STR" \
$(rocm_common_cmake_params) \
-DCPACK_RPM_DEBUGINFO_PACKAGE=FALSE \
-DCPACK_DEBIAN_DEBUGINFO_PACKAGE=FALSE \
-DCPACK_INSTALL_PREFIX="$ROCM_INSTALL_PATH" \
"$CATCH_SRC"
make $MAKEOPTS build_tests
echo "Packaging catch tests"
make $MAKEOPTS package_test
copy_if DEB "${CPACKGEN:-"DEB;RPM"}" "$PACKAGE_DEB" *.deb
copy_if RPM "${CPACKGEN:-"DEB;RPM"}" "$PACKAGE_RPM" *.rpm
popd
}
package_samples() {
if [ "$ASAN_CMAKE_PARAMS" == "true" ] ; then
echo "Disable the packaging of HIP samples" >&2
return
fi
WORKSPACE=`pwd`
if [ ! -e "$SAMPLES_SRC/CMakeLists.txt" ]; then
echo "HIP samples source not found at: $SAMPLES_SRC" >&2
echo "Using samples package from hip project: $BUILD_PATH" >&2
return
fi
rm -rf "$SAMPLES_BUILD_DIR"
mkdir -p "$SAMPLES_BUILD_DIR"
pushd "$SAMPLES_BUILD_DIR"
local CMAKE_PATH="$(getCmakePath)"
export HIP_PATH="$ROCM_INSTALL_PATH"
export ROCM_PATH="$ROCM_INSTALL_PATH"
cmake \
-DROCM_PATH="$ROCM_INSTALL_PATH" \
$(rocm_common_cmake_params) \
-DCMAKE_MODULE_PATH="$CMAKE_PATH/hip" \
-DCPACK_INSTALL_PREFIX="$ROCM_INSTALL_PATH" \
"$SAMPLES_SRC"
echo "Packaging hip samples from hip-tests project"
make $MAKEOPTS package_samples
copy_if DEB "${CPACKGEN:-"DEB;RPM"}" "$PACKAGE_DEB" *.deb
copy_if RPM "${CPACKGEN:-"DEB;RPM"}" "$PACKAGE_RPM" *.rpm
popd
}
clean_hip_tests(){
rm -rf "$CATCH_BUILD_DIR"
rm -rf "$PACKAGE_SRC/hip-on-rocclr"
rm -rf "$PACKAGE_SRC/hipamd"
rm -rf "$PACKAGE_SRC/rocclr"
rm -rf "$PACKAGE_SRC/opencl-on-rocclr"
rm -rf "$PACKAGE_SRC/clr"
rm -rf "$PACKAGE_SRC/hip-tests"
rm -rf "$PACKAGE_SRC/hipother"
}
copy_hip_tests() {
clean_hip_tests
echo "Copy HIP & ROCclr Source and tests"
mkdir -p "$PACKAGE_SRC/hip-on-rocclr"
echo "Copying hip-on-rocclr"
progressCopy "$HIP_ON_ROCclr_ROOT" "$PACKAGE_SRC/hip-on-rocclr"
if [ -e "$CLR_ROOT/CMakeLists.txt" ]; then
mkdir -p "$PACKAGE_SRC/clr"
echo "Copying clr"
progressCopy "$CLR_ROOT" "$PACKAGE_SRC/clr"
else
mkdir -p "$PACKAGE_SRC/hipamd"
mkdir -p "$PACKAGE_SRC/rocclr"
mkdir -p "$PACKAGE_SRC/opencl-on-rocclr"
echo "Copying hipamd"
progressCopy "$HIPAMD_ROOT" "$PACKAGE_SRC/hipamd"
echo "Copying rocclr"
progressCopy "$ROCclr_ROOT" "$PACKAGE_SRC/rocclr"
echo "Copying opencl-on-rocclr"
progressCopy "$OPENCL_ON_ROCclr_ROOT" "$PACKAGE_SRC/opencl-on-rocclr"
fi
if [ -e "$HIPOTHER_ROOT/hipnv" ]; then
mkdir -p "$PACKAGE_SRC/hipother"
echo "Copying hipother"
progressCopy "$HIPOTHER_ROOT" "$PACKAGE_SRC/hipother"
fi
mkdir -p "$PACKAGE_SRC/hip-tests"
echo "Copying hip-tests"
progressCopy "$HIP_CATCH_TESTS_ROOT" "$PACKAGE_SRC/hip-tests"
}
package_hip_on_rocclr()
{
echo "Packagin HIP-on-ROCclr"
pushd "$BUILD_PATH"
cmake --build . -- $MAKEOPTS package
copy_if DEB "${CPACKGEN:-"DEB;RPM"}" "$PACKAGE_DEB" *.deb
copy_if RPM "${CPACKGEN:-"DEB;RPM"}" "$PACKAGE_RPM" *.rpm
popd
}
print_output_directory() {
case ${PKGTYPE} in
("deb")
echo ${PACKAGE_DEB};;
("rpm")
echo ${PACKAGE_RPM};;
(*)
echo "Invalid package type \"${PKGTYPE}\" provided for -o" >&2; exit 1;;
esac
exit
}
case $TARGET in
(clean) clean_hip_on_rocclr; clean_hip_tests ;;
(build) build_hip_on_rocclr; build_catch_tests; package_hip_on_rocclr; package_samples; copy_hip_tests;;
(outdir) print_output_directory ;;
(*) die "Invalid target $TARGET" ;;
esac
echo "Operation complete"

65
rocm-build/build_hipblas.sh Executable file
View File

@@ -0,0 +1,65 @@
#!/bin/bash
set -ex
source "$(dirname "${BASH_SOURCE[0]}")/compute_helper.sh"
set_component_src hipBLAS
build_hipblas() {
echo "Start build"
CXX="g++"
CLIENTS_SAMPLES="ON"
if [ "${ENABLE_ADDRESS_SANITIZER}" == "true" ]; then
set_asan_env_vars
set_address_sanitizer_on
CLIENTS_SAMPLES="OFF"
fi
echo "C compiler: $CC"
echo "CXX compiler: $CXX"
echo "FC compiler: $FC"
cd $COMPONENT_SRC
mkdir -p "$BUILD_DIR" && cd "$BUILD_DIR"
if [ "${ENABLE_ADDRESS_SANITIZER}" == "true" ]; then
rebuild_lapack
fi
cmake \
${LAUNCHER_FLAGS} \
$(rocm_common_cmake_params) \
-DUSE_CUDA=OFF \
-DBUILD_CLIENTS_TESTS=ON \
-DBUILD_CLIENTS_BENCHMARKS=ON \
-DBUILD_CLIENTS_SAMPLES="${CLIENTS_SAMPLES}" \
-DCPACK_SET_DESTDIR=OFF \
-DBUILD_ADDRESS_SANITIZER="${ADDRESS_SANITIZER}" \
"$COMPONENT_SRC"
cmake --build "$BUILD_DIR" -- -j${PROC}
cmake --build "$BUILD_DIR" -- install
cmake --build "$BUILD_DIR" -- package
rm -rf _CPack_Packages/ && find -name '*.o' -delete
mkdir -p $PACKAGE_DIR && cp ${BUILD_DIR}/*.${PKGTYPE} $PACKAGE_DIR
show_build_cache_stats
}
clean_hipblas() {
echo "Cleaning hipBLAS build directory: ${BUILD_DIR} ${PACKAGE_DIR}"
rm -rf "$BUILD_DIR" "$PACKAGE_DIR"
echo "Done!"
}
stage2_command_args "$@"
case $TARGET in
build) build_hipblas ;;
outdir) print_output_directory ;;
clean) clean_hipblas ;;
*) die "Invalid target $TARGET" ;;
esac

69
rocm-build/build_hipblaslt.sh Executable file
View File

@@ -0,0 +1,69 @@
#!/bin/bash
set -ex
source "$(dirname "${BASH_SOURCE[0]}")/compute_helper.sh"
set_component_src hipBLASLt
build_hipblaslt() {
echo "Start build"
if [ "${ENABLE_ADDRESS_SANITIZER}" == "true" ]; then
set_asan_env_vars
set_address_sanitizer_on
fi
cd $COMPONENT_SRC
mkdir -p "$BUILD_DIR" && cd "$BUILD_DIR"
if [ "${ENABLE_ADDRESS_SANITIZER}" == "true" ]; then
rebuild_lapack
fi
if [ -n "$GPU_ARCHS" ]; then
GPU_TARGETS="$GPU_ARCHS"
else
# gfx90a:xnack+;gfx90a:xnack-;gfx940;gfx941;gfx942
GPU_TARGETS=all
fi
CXX=$(set_build_variables CXX)\
cmake \
-DAMDGPU_TARGETS=${GPU_TARGETS} \
${LAUNCHER_FLAGS} \
$(rocm_common_cmake_params) \
-DTensile_LOGIC= \
-DTensile_CODE_OBJECT_VERSION=default \
-DTensile_CPU_THREADS= \
-DTensile_LIBRARY_FORMAT=msgpack \
-DBUILD_CLIENTS_SAMPLES=ON \
-DBUILD_CLIENTS_TESTS=ON \
-DBUILD_CLIENTS_BENCHMARKS=ON \
-DCPACK_SET_DESTDIR=OFF \
-DBUILD_ADDRESS_SANITIZER="${ADDRESS_SANITIZER}" \
"$COMPONENT_SRC"
cmake --build "$BUILD_DIR" -- -j${PROC}
cmake --build "$BUILD_DIR" -- install
cmake --build "$BUILD_DIR" -- package
rm -rf _CPack_Packages/ && find -name '*.o' -delete
mkdir -p $PACKAGE_DIR && cp ${BUILD_DIR}/*.${PKGTYPE} $PACKAGE_DIR
show_build_cache_stats
}
clean_hipblaslt() {
echo "Cleaning hipBLASLt build directory: ${BUILD_DIR} ${PACKAGE_DIR}"
rm -rf "$BUILD_DIR" "$PACKAGE_DIR"
echo "Done!"
}
stage2_command_args "$@"
case $TARGET in
build) build_hipblaslt ;;
outdir) print_output_directory ;;
clean) clean_hipblaslt ;;
*) die "Invalid target $TARGET" ;;
esac

118
rocm-build/build_hipcc.sh Executable file
View File

@@ -0,0 +1,118 @@
#!/bin/bash
source "$(dirname "${BASH_SOURCE}")/compute_utils.sh"
printUsage() {
echo
echo "Usage: $(basename "${BASH_SOURCE}") [options ...]"
echo
echo "Options:"
echo " -a, --address_sanitizer Enable address sanitizer"
echo " -c, --clean Clean output and delete all intermediate work"
echo " -h, --help Prints this help"
echo " -o, --outdir <pkg_type> Print path of output directory containing packages of
type referred to by pkg_type"
echo " -r, --release Makes a release build"
echo
return 0
}
API_NAME=hipcc
PROJ_NAME=$API_NAME
TARGET="build"
MAKEOPTS="$DASH_JAY"
BUILD_TYPE="Debug"
BUILD_DIR=$(getBuildPath $API_NAME)
PACKAGE_DEB=$(getPackageRoot)/deb/$API_NAME
PACKAGE_RPM=$(getPackageRoot)/rpm/$API_NAME
PACKAGE_SRC="$(getSrcPath)"
while [ "$1" != "" ];
do
case $1 in
(-a | --address_sanitizer)
ack_and_ignore_asan ;;
(-c | --clean)
TARGET="clean" ;;
(-o | --outdir)
TARGET="outdir"; PKGTYPE=$2 ; OUT_DIR_SPECIFIED=1 ; ((CLEAN_OR_OUT|=2)) ; shift 1 ;;
(-r | --release)
BUILD_TYPE="RelWithDebInfo" ;;
(-h | --help)
printUsage ; exit 0 ;;
(*)
echo "Invalid option [$1]" >&2; printUsage; exit 1 ;;
esac
shift 1
done
clean() {
echo "Cleaning hipcc"
rm -rf $BUILD_DIR
echo "Cleaning up hipcc DEB and RPM packages"
rm -rf $PACKAGE_DEB
rm -rf $PACKAGE_RPM
}
copy_hipcc_sources() {
echo "Clean up hipcc build folder"
rm -rf "$PACKAGE_SRC/hipcc"
echo "Copy hipcc sources"
mkdir -p "$PACKAGE_SRC/hipcc"
progressCopy "$HIPCC_ROOT" "$PACKAGE_SRC/hipcc"
}
build() {
echo "Build hipcc binary"
mkdir -p "$BUILD_DIR"
pushd "$BUILD_DIR"
if ! [ -e "$HIPCC_ROOT/CMakeLists.txt" ] ; then
echo "No source for hipcc, exiting. this is not an error" >&2
exit 0
fi
cmake \
$(rocm_cmake_params) \
$(rocm_common_cmake_params) \
-DHIPCC_BACKWARD_COMPATIBILITY=OFF \
-DCMAKE_INSTALL_PREFIX="$OUT_DIR" \
$HIPCC_ROOT
popd
cmake --build "$BUILD_DIR" -- $MAKEOPTS
echo "Installing and Packaging hipcc"
cmake --build "$BUILD_DIR" -- $MAKEOPTS install
cmake --build "$BUILD_DIR" -- $MAKEOPTS package
copy_if DEB "${CPACKGEN:-"DEB;RPM"}" "$PACKAGE_DEB" $BUILD_DIR/hipcc*.deb
copy_if RPM "${CPACKGEN:-"DEB;RPM"}" "$PACKAGE_RPM" $BUILD_DIR/hipcc*.rpm
}
print_output_directory() {
case ${PKGTYPE} in
("deb")
echo ${PACKAGE_DEB};;
("rpm")
echo ${PACKAGE_RPM};;
(*)
echo "Invalid package type \"${PKGTYPE}\" provided for -o" >&2; exit 1 ;;
esac
exit
}
case $TARGET in
(clean) clean ;;
(build) build ; copy_hipcc_sources ;;
(outdir) print_output_directory ;;
(*) die "Invalid target $TARGET" ;;
esac
echo "Operation complete"

60
rocm-build/build_hipcub.sh Executable file
View File

@@ -0,0 +1,60 @@
#!/bin/bash
set -ex
source "$(dirname "${BASH_SOURCE[0]}")/compute_helper.sh"
set_component_src hipCUB
build_hipcub() {
echo "Start build"
cd $COMPONENT_SRC
if [ "${ENABLE_ADDRESS_SANITIZER}" == "true" ]; then
set_asan_env_vars
set_address_sanitizer_on
ASAN_CMAKE_PARAMS="false"
fi
mkdir -p "$BUILD_DIR" && cd "$BUILD_DIR"
if [ -n "$GPU_ARCHS" ]; then
GPU_TARGETS="$GPU_ARCHS"
else
GPU_TARGETS="gfx908:xnack-;gfx90a:xnack-;gfx90a:xnack+;gfx940;gfx941;gfx942;gfx1030;gfx1100;gfx1101"
fi
CXX=$(set_build_variables CXX)\
cmake \
${LAUNCHER_FLAGS} \
$(rocm_common_cmake_params) \
-DCMAKE_MODULE_PATH="${ROCM_PATH}/lib/cmake/hip;${ROCM_PATH}/hip/cmake" \
-Drocprim_DIR="${ROCM_PATH}/rocprim" \
-DBUILD_TEST=ON \
-DAMDGPU_TARGETS=${GPU_TARGETS} \
"$COMPONENT_SRC"
cmake --build "$BUILD_DIR" -- -j${PROC}
cmake --build "$BUILD_DIR" -- install
cmake --build "$BUILD_DIR" -- package
rm -rf _CPack_Packages/ && find -name '*.o' -delete
mkdir -p $PACKAGE_DIR && cp ${BUILD_DIR}/*.${PKGTYPE} $PACKAGE_DIR
show_build_cache_stats
}
clean_hipcub() {
echo "Cleaning hipCUB build directory: ${BUILD_DIR} ${PACKAGE_DIR}"
rm -rf "$BUILD_DIR" "$PACKAGE_DIR"
echo "Done!"
}
stage2_command_args "$@"
case $TARGET in
build) build_hipcub ;;
outdir) print_output_directory ;;
clean) clean_hipcub ;;
*) die "Invalid target $TARGET" ;;
esac

Some files were not shown because too many files have changed in this diff Show More