From c81d0f3b0a3620305b11de8745686c86b060b006 Mon Sep 17 00:00:00 2001 From: Peter Park Date: Wed, 18 Sep 2024 14:22:14 -0400 Subject: [PATCH] add MAD page --- .../model-automation-and-dashboarding.rst | 159 ++++++++++++++++++ docs/sphinx/_toc.yml.in | 1 + 2 files changed, 160 insertions(+) create mode 100644 docs/how-to/rocm-for-ai/model-automation-and-dashboarding.rst diff --git a/docs/how-to/rocm-for-ai/model-automation-and-dashboarding.rst b/docs/how-to/rocm-for-ai/model-automation-and-dashboarding.rst new file mode 100644 index 000000000..51f4483fc --- /dev/null +++ b/docs/how-to/rocm-for-ai/model-automation-and-dashboarding.rst @@ -0,0 +1,159 @@ +.. meta:: + :description: Discover and run deep learning models with AMD MAD -- Model Automation and Dashboarding tool. + :keywords: AI, LLM, machine, dashboarding, zoo, + +************************ +Running models using MAD +************************ + +The AMD Model Automation and Dashboarding (MAD) tool brings together an AI model zoo with automated execution across +various GPU architectures. It facilitates performance tracking by including mechanisms for maintaining historical +performance data and generating dashboards for analysis. MAD's source code repository and full documentation are located +at ``__. + +MAD pulls various models from their repositories and tests their performance inside ROCm Docker images. It is an index +of deep learning models that have been trained to get the best reproducible accuracy and performance with AMD’s ROCm +software stack running on AMD GPUs. + +Use MAD to: + +* Try new models, + +* Compare performance between patches or architectures, and + +* Track functionality and performance over time. + +Getting started with MAD +======================== + +Refer to the steps to set up your host computer with :doc:`ROCm ` here. Follow the detailed +:doc:`installation instructions ` for Linux-based platforms. + +ROCm Docker images +------------------ + +You can find ROCm Docker images for PyTorch and TensorFlow on Docker Hub at +:fab:`docker` `rocm/pytorch `_ and +:fab:`docker` `rocm/tensorflow `_. + +AMD publishes a unified Docker image at :fab:`docker` `rocm/vllm `_ that packages +together vLLM and PyTorch for the AMD Instinct™ MI300X accelerator. This enables users to quickly validate the expected +inference performance numbers on the MI300X. This Docker image includes: + +- ROCm + +- vLLM + +- PyTorch + +- Tuning files (.csv format) + +See ``__ for more information. + +.. _mad-run-locally: + +Using MAD to run models locally +=============================== + +The following describes MAD's basic functionalities. + +1. Clone the `MAD repository `_ to a local directory and install the required packages + on the host machine. For example: + + .. code-block:: shell + + git clone https://github.com/ROCm/MAD + cd MAD + pip3 install -r requirements.txt + +2. Using the ``tools/run_models.py`` script, you can run and collect performance results for all models in + ``models.json`` locally on a Docker host. + + ``run_models.py`` is the main MAD command line interface for running models locally. While the tool has many options, + running any single model is very easy. To run a model, look for its name or tag in the ``models.json`` and pass it to + ``run_models.py`` in the form of: + + .. code-block:: shell + + tools/run_models.py [-h] [--model_name MODEL_NAME] [--timeout TIMEOUT] [--live_output] [--clean_docker_cache] [--keep_alive] [--keep_model_dir] [-o OUTPUT] [--log_level LOG_LEVEL] + + See :ref:`mad-run-args` for the list of options and their descriptions. + +For each model in ``models.json``, the script: + +* Builds Docker images associated with each model. The images are named + ``ci-$(model_name)``, and are not removed after the script completes. + +* Starts the Docker container, with name, ``container_$(model_name)``. + The container should automatically be stopped and removed whenever + the script exits. + +* Clones the git ``url``, and runs the ``scripts``. + +* Compiles the final ``perf.csv`` and ``perf.html``. + +.. _mad-run-args: + +Arguments +--------- + +--help, -h + Show this help message and exit + +--tags TAGS + Tags to run (can be multiple). Overrides ``tags.json``. See :ref:`mad-run-tags`. + +--model-name MODEL_NAME + Model name to run the application. + +--timeout TIMEOUT + Timeout for the application running model in seconds, default timeout of 7200 (2 hours). + +--live-output + Prints output in real-time directly on STDOUT. + +--clean-docker-cache + Rebuild docker image without using cache. + +--keep-alive + Keep the container alive after the application finishes running. + +--keep-model-dir + Keep the model directory after the application finishes running. + +--output, -o OUTPUT + Output file for the result. + +--log-level LOG_LEVEL + Log level for the logger. + +.. _mad-run-tags: + +Tags +---- + +With the tag functionality, you can select a subset of the models with the corresponding tags to be run. User-specified +tags can be specified in ``tags.json`` or with the ``--tags`` argument. If multiple tags are specified, all models that +match any specified tag are selected. + +.. note:: + + Each model name in ``models.json`` is automatically a tag that can be used to run that model. Tags are also supported + in comma-separated form. + +For example, to run the ``pyt_huggingface_bert`` model, use: + +.. code-block:: shell + + python3 tools/run_models.py --tags pyt_huggingface_bert + +Or, to run all PyTorch models, use: + +.. code-block:: shell + + python3 tools/run_models.py --tags pyt + + +.. note:: + + Learn more about MAD's options by visiting ``__. diff --git a/docs/sphinx/_toc.yml.in b/docs/sphinx/_toc.yml.in index 2fa739e0c..0dcebddd7 100644 --- a/docs/sphinx/_toc.yml.in +++ b/docs/sphinx/_toc.yml.in @@ -32,6 +32,7 @@ subtrees: - file: how-to/rocm-for-ai/train-a-model.rst - file: how-to/rocm-for-ai/hugging-face-models.rst - file: how-to/rocm-for-ai/deploy-your-model.rst + - file: how-to/rocm-for-ai/model-automation-and-dashboarding.rst - file: how-to/rocm-for-hpc/index.rst title: Using ROCm for HPC - file: how-to/llm-fine-tuning-optimization/index.rst