diff --git a/.wordlist.txt b/.wordlist.txt index b9bb76fb4..805c364ee 100644 --- a/.wordlist.txt +++ b/.wordlist.txt @@ -97,6 +97,7 @@ ENDPGM EPYC ESXi EoS +FBGEMM FFT FFTs FFmpeg @@ -110,6 +111,7 @@ Flang Fortran Fuyu GALB +GCC GCD GCDs GCN @@ -175,6 +177,7 @@ Interop Intersphinx Intra Ioffe +Jinja JSON Jupyter KFD @@ -221,6 +224,7 @@ Megatron Mellanox Mellanox's Meta's +Miniconda MirroredStrategy Multicore Multithreaded @@ -620,6 +624,7 @@ performant perl pragma pre +prebuild prebuilt precompiled preconditioner @@ -711,8 +716,10 @@ subexpression subfolder subfolders submodule +submodules supercomputing symlink +symlinks td tensorfloat th diff --git a/docs/how-to/llm-fine-tuning-optimization/model-acceleration-libraries.rst b/docs/how-to/llm-fine-tuning-optimization/model-acceleration-libraries.rst index bd8a6e865..38e2f8f5d 100644 --- a/docs/how-to/llm-fine-tuning-optimization/model-acceleration-libraries.rst +++ b/docs/how-to/llm-fine-tuning-optimization/model-acceleration-libraries.rst @@ -251,3 +251,287 @@ page describes the options. Learn more about optimizing kernels with TunableOp in :ref:`Optimizing Triton kernels `. + + +FBGEMM and FBGEMM_GPU +===================== + +FBGEMM (Facebook General Matrix Multiplication) is a low-precision, high-performance CPU kernel library +for matrix-matrix multiplications and convolutions. It is used for server-side inference +and as a back end for PyTorch quantized operators. FBGEMM offers optimized on-CPU performance for reduced precision calculations, +strong performance on native tensor formats, and the ability to generate +high-performance shape- and size-specific kernels at runtime. + +FBGEMM_GPU collects several high-performance PyTorch GPU operator libraries +for use in training and inference. It provides efficient table-batched embedding functionality, +data layout transformation, and quantization support. + +For more information about FBGEMM and FBGEMM_GPU, see the `PyTorch FBGEMM GitHub `_ +and the `PyTorch FBGEMM documentation `_. +The `Meta blog post about FBGEMM `_ provides +additional background about the library. + +Installing FBGEMM_GPU +---------------------- + +Installing FBGEMM_GPU consists of the following steps: + +* Set up an isolated Miniconda environment +* Install ROCm using Docker or the :doc:`package manager ` +* Install the nightly `PyTorch `_ build +* Complete the pre-build and build tasks + +.. note:: + + FBGEMM_GPU doesn't require the installation of FBGEMM. To optionally install + FBGEMM, see the `FBGEMM install instructions `_. + +Set up the Miniconda environment +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +To install Miniconda, use the following commands. + +#. Install a `Miniconda environment `_ for reproducible builds. + All subsequent commands run inside this environment. + + .. code-block:: shell + + export PLATFORM_NAME="$(uname -s)-$(uname -m)" + + # Set the Miniconda prefix directory + miniconda_prefix=$HOME/miniconda + + # Download the Miniconda installer + wget -q "https://repo.anaconda.com/miniconda/Miniconda3-latest-${PLATFORM_NAME}.sh" -O miniconda.sh + + # Run the installer + bash miniconda.sh -b -p "$miniconda_prefix" -u + + # Load the shortcuts + . ~/.bashrc + + # Run updates + conda update -n base -c defaults -y conda + +#. Create a Miniconda environment with Python 3.12: + + .. code-block:: shell + + env_name= + python_version=3.12 + + # Create the environment + conda create -y --name ${env_name} python="${python_version}" + + # Upgrade PIP and pyOpenSSL package + conda run -n ${env_name} pip install --upgrade pip + conda run -n ${env_name} python -m pip install pyOpenSSL>22.1.0 + +#. Install additional build tools: + + .. code-block:: shell + + conda install -n ${env_name} -y \ + click \ + cmake \ + hypothesis \ + jinja2 \ + make \ + ncurses \ + ninja \ + numpy \ + scikit-build \ + wheel + +Install the ROCm components +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +FBGEMM_GPU can run in a ROCm Docker container or in conjunction with the full ROCm installation. +The Docker method is recommended because it requires fewer steps and provides a stable environment. + +To run FBGEMM_GPU in the Docker container, pull the `Minimal Docker image for ROCm `_. +This image includes all preinstalled ROCm packages required to integrate FBGEMM. To pull +and run the ROCm Docker image, use this command: + +.. code-block:: shell + + # Run for ROCm 6.2.0 + docker run -it --network=host --shm-size 16G --device=/dev/kfd --device=/dev/dri --group-add video \ + --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --ipc=host rocm/rocm-terminal:6.2 /bin/bash + +.. note:: + + The `Full Docker image for ROCm `_, which includes all + ROCm packages, can also be used. However, it results in a very large container, so the minimal + Docker image is recommended. + +You can also install ROCm using the package manager. FBGEMM_GPU requires the installation of the full ROCm package. +For more information, see :doc:`the ROCm installation guide `. +The ROCm package also requires the :doc:`MIOpen ` component as a dependency. +To install MIOpen, use the ``apt install`` command. + +.. code-block:: shell + + apt install hipify-clang miopen-hip miopen-hip-dev + +Install PyTorch +^^^^^^^^^^^^^^^^^^^^^^^ + +Install `PyTorch `_ using ``pip`` for the most reliable and consistent results. + +#. Install the nightly PyTorch build using ``pip``. + + .. code-block:: shell + + # Install the latest nightly, ROCm variant + conda run -n ${env_name} pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/rocm6.2/ + +#. Ensure PyTorch loads correctly. Verify the version and variant of the installation using an ``import`` test. + + .. code-block:: shell + + # Ensure that the package loads properly + conda run -n ${env_name} python -c "import torch.distributed" + + # Verify the version and variant of the installation + conda run -n ${env_name} python -c "import torch; print(torch.__version__)" + +Perform the prebuild and build +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +#. Clone the FBGEMM repository and the relevant submodules. Use ``pip`` to install the + components in ``requirements.txt``. Run the following commands inside the Miniconda environment. + + .. code-block:: shell + + # Select a version tag + FBGEMM_VERSION=v0.8.0 + + # Clone the repo along with its submodules + git clone https://github.com/pytorch/FBGEMM.git --branch=v0.8.0 --recursive fbgemm_${FBGEMM_VERSION} + + # Install additional required packages for building and testing + cd fbgemm_${FBGEMM_VERSION}/fbgemm_gpu + pip install requirements.txt + +#. Clear the build cache to remove stale build information. + + .. code-block:: shell + + # !! Run in fbgemm_gpu/ directory inside the Conda environment !! + + python setup.py clean + +#. Set the wheel build variables, including the package name, Python version tag, and Python platform name. + + .. code-block:: shell + + # Set the package name depending on the build variant + export package_name=fbgemm_gpu_rocm + + # Set the Python version tag. It should follow the convention `py`, + # for example, Python 3.12 --> py312 + export python_tag=py312 + + # Determine the processor architecture + export ARCH=$(uname -m) + + # Set the Python platform name for the Linux case + export python_plat_name="manylinux2014_${ARCH}" + +#. Build FBGEMM_GPU for the ROCm platform. Set ``ROCM_PATH`` to the path to your ROCm installation. + Run these commands from the ``fbgemm_gpu/`` directory inside the Miniconda environment. + + .. code-block:: shell + + # !! Run in the fbgemm_gpu/ directory inside the Conda environment !! + + export ROCM_PATH= + + # Build for the target architecture of the ROCm device installed on the machine (for example, 'gfx942;gfx90a') + # See :doc:`The Linux system requirements <../../reference/system-requirements>` for a list of supported GPUs. + export PYTORCH_ROCM_ARCH=$(${ROCM_PATH}/bin/rocminfo | grep -o -m 1 'gfx.*') + + # Build the wheel artifact only + python setup.py bdist_wheel \ + --package_variant=rocm \ + --python-tag="${python_tag}" \ + --plat-name="${python_plat_name}" \ + -DHIP_ROOT_DIR="${ROCM_PATH}" \ + -DCMAKE_C_FLAGS="-DTORCH_USE_HIP_DSA" \ + -DCMAKE_CXX_FLAGS="-DTORCH_USE_HIP_DSA" + + # Build and install the library into the Conda environment + python setup.py install \ + --package_variant=rocm \ + -DHIP_ROOT_DIR="${ROCM_PATH}" \ + -DCMAKE_C_FLAGS="-DTORCH_USE_HIP_DSA" \ + -DCMAKE_CXX_FLAGS="-DTORCH_USE_HIP_DSA" + +Post-build validation +---------------------- + +After building FBGEMM_GPU, run some verification checks to ensure the build is correct. Continue +to run all commands inside the ``fbgemm_gpu/`` directory inside the Miniconda environment. + +#. The build process generates many build artifacts and C++ templates, so + it is important to confirm no undefined symbols remain. + + .. code-block:: shell + + # !! Run in fbgemm_gpu/ directory inside the Conda environment !! + + # Locate the built .SO file + fbgemm_gpu_lib_path=$(find . -name fbgemm_gpu_py.so) + + # Check that the undefined symbols don't include fbgemm_gpu-defined functions + nm -gDCu "${fbgemm_gpu_lib_path}" | sort + +#. Verify the referenced version number of ``GLIBCXX`` and the presence of certain function symbols: + + .. code-block:: shell + + # !! Run in fbgemm_gpu/ directory inside the Conda environment !! + + # Locate the built .SO file + fbgemm_gpu_lib_path=$(find . -name fbgemm_gpu_py.so) + + # Note the versions of GLIBCXX referenced by the .SO + # The libstdc++.so.6 available on the install target must support these versions + objdump -TC "${fbgemm_gpu_lib_path}" | grep GLIBCXX | sed 's/.*GLIBCXX_\([.0-9]*\).*/GLIBCXX_\1/g' | sort -Vu | cat + + # Test for the existence of a given function symbol in the .SO + nm -gDC "${fbgemm_gpu_lib_path}" | grep " fbgemm_gpu::merge_pooled_embeddings(" + nm -gDC "${fbgemm_gpu_lib_path}" | grep " fbgemm_gpu::jagged_2d_to_dense(" + +Testing FBGEMM +---------------------- + +FBGEMM includes tests and benchmarks to validate performance. To run these tests, +you must use ROCm 5.7 or a more recent version on the host and container. To run FBGEMM tests, +follow these instructions: + +.. code-block:: shell + + # !! Run inside the Conda environment !! + + # From the /fbgemm_gpu/ directory + cd test + + export FBGEMM_TEST_WITH_ROCM=1 + # Enable for debugging failed kernel executions + export HIP_LAUNCH_BLOCKING=1 + + # Run the test + python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning split_table_batched_embeddings_test.py + +To run the FBGEMM_GPU ``uvm`` test, use these commands. These tests only support the AMD MI210 and +more recent accelerators. + +.. code-block:: shell + + # Run this inside the Conda environment from the /fbgemm_gpu/ directory + export HSA_XNACK=1 + cd test + + python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning ./uvm/uvm_test.py