Files
ROCm/docs/how-to/system-debugging.md
Sam Wu ad66256e52 Merge develop into roc-6.0.x (#2810)
* Create issue_retrieval.yml

I am tasked with adding a GitHub action to process incoming GitHub issues. The AMD GitHub admin team asked me to try out one of their runners and to do so, I need to load in a workflow file.

* changed group to ROCM-Ubuntu

* Added a field to specify project number

This action receives an org name and project number and adds issues to it using this information

* Update issue_retrieval.yml

* Update issue_retrieval.yml

* Generate release notes for 6.0.1 from autotag script (#2790)

* Update CONTRIBUTING.md (#2791)

* Update CONTRIBUTING.md

* Fixed link to licensing document

Also, changed to use relative links for internal files.

* Revert "Update CONTRIBUTING.md" (#2795)

* Text change to direct PRs into default branch, since not all repos have develop branch

* add keywords (#2799)

* Update issue_retrieval.yml

* ci(default.xml): Add hipBLASLt to manifest (#2796)

* Deleting issue_report.yml in favor of a global issue template placed in ROCm/.github (#2803)

* Delete .github/ISSUE_TEMPLATE/issue_report.yml

* Delete .github/ISSUE_TEMPLATE/config.yml

* Delete .github/ISSUE_TEMPLATE directory (#2805)

* docs(conf.py): Update article info for release page (#2806)

* docs(conf.py): Update article info for release page

* Update conf.py

* Fix typo (#2809)

---------

Co-authored-by: abhimeda <138710508+abhimeda@users.noreply.github.com>
Co-authored-by: David Galiffi <dgaliffi@amd.com>
Co-authored-by: Lisa <lisa.delaney@amd.com>
Co-authored-by: Young Hui <young.hui@amd.com>
Co-authored-by: yhuiYH <145490163+yhuiYH@users.noreply.github.com>
2024-01-16 10:53:28 -07:00

1.8 KiB

<head> </head>

System debugging guide

ROCm language and system-level debug, flags, and environment variables

Kernel options to avoid: the Ethernet port getting renamed every time you change graphics cards, net.ifnames=0 biosdevname=0

ROCr error code

  • 2 Invalid Dimension
  • 4 Invalid Group Memory
  • 8 Invalid (or Null) Code
  • 32 Invalid Format
  • 64 Group is too large
  • 128 Out of VGPRs
  • 0x80000000 Debug Options

Command to dump firmware version and get Linux kernel version

sudo cat /sys/kernel/debug/dri/1/amdgpu_firmware_info

uname -a

Debug flags

Debug messages when developing/debugging base ROCm driver. You could enable the printing from libhsakmt.so by setting an environment variable, HSAKMT_DEBUG_LEVEL. Available debug levels are 3-7. The higher level you set, the more messages will print.

  • export HSAKMT_DEBUG_LEVEL=3 : Only pr_err() prints.

  • export HSAKMT_DEBUG_LEVEL=4 : pr_err() and pr_warn() print.

  • export HSAKMT_DEBUG_LEVEL=5 : We currently do not implement “notice”. Setting to 5 is same as setting to 4.

  • export HSAKMT_DEBUG_LEVEL=6 : pr_err(), pr_warn(), and pr_info print.

  • export HSAKMT_DEBUG_LEVEL=7 : Everything including pr_debug prints.

ROCr level environment variables for debug

HSA_ENABLE_SDMA=0

HSA_ENABLE_INTERRUPT=0

HSA_SVM_GUARD_PAGES=0

HSA_DISABLE_CACHE=1

Turn off page retry on GFX9/Vega devices

sudo -s

echo 1 > /sys/module/amdkfd/parameters/noretry

HIP environment variables 3.x

OpenCL debug flags

AMD_OCL_WAIT_COMMAND=1 (0 = OFF, 1 = On)

PCIe-debug

For information on how to debug and profile HIP applications, see {doc}hip:how_to_guides/debugging