Files
ROCm/docs/how-to/system-debugging.md
Peter Park d534f755e4 Add metadata to docs (#3688)
* add missing metadata

add metadata to mi300 arch doc

add metadata to contributing guide

add metadata to mi300x tuning guides

* update meta to yaml frontmatter

* update to md metadata to myst frontmatter

* remove extra file

* fix spelling
2025-01-14 08:55:45 -05:00

69 lines
1.8 KiB
Markdown

---
myst:
html_meta:
"description": "Learn more about common system-level debugging measures for ROCm."
"keywords": "env, var, sys, PCIe, troubleshooting, admin, error"
---
# System debugging
## ROCm language and system-level debug, flags, and environment variables
Kernel options to avoid: the Ethernet port getting renamed every time you change graphics cards, `net.ifnames=0 biosdevname=0`
## ROCr error code
* 2 Invalid Dimension
* 4 Invalid Group Memory
* 8 Invalid (or Null) Code
* 32 Invalid Format
* 64 Group is too large
* 128 Out of VGPRs
* 0x80000000 Debug Options
## Command to dump firmware version and get Linux kernel version
`sudo cat /sys/kernel/debug/dri/1/amdgpu_firmware_info`
`uname -a`
## Debug flags
Debug messages when developing/debugging base ROCm driver. You could enable the printing from `libhsakmt.so` by setting an environment variable, `HSAKMT_DEBUG_LEVEL`. Available debug levels are 3-7. The higher level you set, the more messages will print.
* `export HSAKMT_DEBUG_LEVEL=3` : Only pr_err() prints.
* `export HSAKMT_DEBUG_LEVEL=4` : pr_err() and pr_warn() print.
* `export HSAKMT_DEBUG_LEVEL=5` : We currently do not implement “notice”. Setting to 5 is same as setting to 4.
* `export HSAKMT_DEBUG_LEVEL=6` : pr_err(), pr_warn(), and pr_info print.
* `export HSAKMT_DEBUG_LEVEL=7` : Everything including pr_debug prints.
## ROCr level environment variables for debug
`HSA_ENABLE_SDMA=0`
`HSA_ENABLE_INTERRUPT=0`
`HSA_SVM_GUARD_PAGES=0`
`HSA_DISABLE_CACHE=1`
## Turn off page retry on GFX9/Vega devices
`sudo -s`
`echo 1 > /sys/module/amdkfd/parameters/noretry`
## HIP environment variables 3.x
### OpenCL debug flags
`AMD_OCL_WAIT_COMMAND=1 (0 = OFF, 1 = On)`
## PCIe-debug
For information on how to debug and profile HIP applications, see {doc}`hip:how-to/debugging`