mirror of https://github.com/ROCm/ROCm.git synced 2026-02-04 03:15:28 -05:00

Files

Peter Park 7b883f3af4 Add MI300X tuning guides (#3448 )

* Add MI300X tuning guides

Add mi300x doc (pandoc conversion)

fix headings

add metadata

move images to shared/

move images to shared/

convert tuning-guides.md to rst using pandoc

add mi300x to tuning-guides.rst landing page

update h1s, toc, and landing page

fix spelling

fix fmt

format code blocks

add tensilelite imgs

fix formatting

fix formatting some more

fix formatting

more formatting

spelling

remove --enforce-eager note

satisfy spellcheck linter

more spelling

add fixes from hongxia

fix env var in D5

add fixes to PyTorch inductor section

fix

fix

Update docs/how-to/tuning-guides/mi300x.rst

Co-authored-by: Hongxia Yang
<62075498+hongxiayang@users.noreply.github.com>

Update docs/how-to/tuning-guides/mi300x.rst

Co-authored-by: Hongxia Yang
<62075498+hongxiayang@users.noreply.github.com>

Update docs/how-to/tuning-guides/mi300x.rst

Co-authored-by: Hongxia Yang
<62075498+hongxiayang@users.noreply.github.com>

Update docs/how-to/tuning-guides/mi300x.rst

Co-authored-by: Hongxia Yang
<62075498+hongxiayang@users.noreply.github.com>

Update docs/how-to/tuning-guides/mi300x.rst

Co-authored-by: Hongxia Yang
<62075498+hongxiayang@users.noreply.github.com>

Update docs/how-to/tuning-guides/mi300x.rst

Co-authored-by: Hongxia Yang
<62075498+hongxiayang@users.noreply.github.com>

Update docs/how-to/tuning-guides/mi300x.rst

Co-authored-by: Hongxia Yang
<62075498+hongxiayang@users.noreply.github.com>

Update docs/how-to/tuning-guides/mi300x.rst

Co-authored-by: Hongxia Yang
<62075498+hongxiayang@users.noreply.github.com>

Update docs/how-to/tuning-guides/mi300x.rst

Co-authored-by: Hongxia Yang
<62075498+hongxiayang@users.noreply.github.com>

Update docs/how-to/tuning-guides/mi300x.rst

Co-authored-by: Hongxia Yang
<62075498+hongxiayang@users.noreply.github.com>

Update docs/how-to/tuning-guides/mi300x.rst

Co-authored-by: Hongxia Yang
<62075498+hongxiayang@users.noreply.github.com>

Update docs/how-to/tuning-guides/mi300x.rst

Co-authored-by: Hongxia Yang
<62075498+hongxiayang@users.noreply.github.com>

Update 'torch_compile_debug' suggestion based on Hongxia's feedback

fix PyTorch inductor env vars

minor formatting fixes

Apply suggestions from code review

Co-authored-by: Hongxia Yang
<62075498+hongxiayang@users.noreply.github.com>

Update vllm path

Co-authored-by: Hongxia Yang
<62075498+hongxiayang@users.noreply.github.com>

disable numfig in Sphinx configuration

fix formatting and capitalization

add words to wordlist

update index

update wordlist

update optimizing-triton-kernel

convert cards to table

fix link in index.md

add @lpaoletti's feedback

Add system tuning guide

add images

add system section

add os settings and sys management

remove pcie=noats recommendation

reorg

add blurb to developer section

impr formatting

remove windows os from tuning guides pages in conf.py

add suggestions from review

fix typo and link

remove os windows from relevant pages in conf

mi300x

add suggestions from review

fix toc

fix index links

reorg

update vLLM vars

Co-authored-by: Hongxia Yang
<62075498+hongxiayang@users.noreply.github.com>

update vLLM vars

Co-authored-by: Hongxia Yang
<62075498+hongxiayang@users.noreply.github.com>

reorganize

add warnings

add text to system tuning

add filler text on index pages

reorg tuning pages

fix links

fix vars

* rm old pages

fix toc

* add suggestions from review

small change

add more suggestions

rewrite intro

* add 'workload tuning philosophy'

* refactor

* fix broken links

* black format conf.py

* simplify cmd and update doc structure

* add higher-level heading for consistency (mi300x.rst)

* add fixes from review

fix url

add fixes

fix formatting

fix fmt

fix hipBLASLt section

change words

fix tensilelite section

fix

fix

fix fmt

* style guide

* fix some formatting

* satisfy spellcheck linter

* update wordlist

* fix bad conflict resolution

2024-07-22 17:24:14 -04:00

1.8 KiB

Raw Blame History

System debugging

ROCm language and system-level debug, flags, and environment variables

Kernel options to avoid: the Ethernet port getting renamed every time you change graphics cards, net.ifnames=0 biosdevname=0

ROCr error code

2 Invalid Dimension
4 Invalid Group Memory
8 Invalid (or Null) Code
32 Invalid Format
64 Group is too large
128 Out of VGPRs
0x80000000 Debug Options

Command to dump firmware version and get Linux kernel version

sudo cat /sys/kernel/debug/dri/1/amdgpu_firmware_info

uname -a

Debug flags

Debug messages when developing/debugging base ROCm driver. You could enable the printing from libhsakmt.so by setting an environment variable, HSAKMT_DEBUG_LEVEL. Available debug levels are 3-7. The higher level you set, the more messages will print.

export HSAKMT_DEBUG_LEVEL=3 : Only pr_err() prints.
export HSAKMT_DEBUG_LEVEL=4 : pr_err() and pr_warn() print.
export HSAKMT_DEBUG_LEVEL=5 : We currently do not implement “notice”. Setting to 5 is same as setting to 4.
export HSAKMT_DEBUG_LEVEL=6 : pr_err(), pr_warn(), and pr_info print.
export HSAKMT_DEBUG_LEVEL=7 : Everything including pr_debug prints.

ROCr level environment variables for debug

HSA_ENABLE_SDMA=0

HSA_ENABLE_INTERRUPT=0

HSA_SVM_GUARD_PAGES=0

HSA_DISABLE_CACHE=1

Turn off page retry on GFX9/Vega devices

sudo -s

echo 1 > /sys/module/amdkfd/parameters/noretry

HIP environment variables 3.x

OpenCL debug flags

AMD_OCL_WAIT_COMMAND=1 (0 = OFF, 1 = On)

PCIe-debug

For information on how to debug and profile HIP applications, see {doc}hip:how-to/debugging

1.8 KiB Raw Blame History