diff --git a/.wordlist.txt b/.wordlist.txt
index b9bb76fb4..805c364ee 100644
--- a/.wordlist.txt
+++ b/.wordlist.txt
@@ -97,6 +97,7 @@ ENDPGM
 EPYC
 ESXi
 EoS
+FBGEMM
 FFT
 FFTs
 FFmpeg
@@ -110,6 +111,7 @@ Flang
 Fortran
 Fuyu
 GALB
+GCC
 GCD
 GCDs
 GCN
@@ -175,6 +177,7 @@ Interop
 Intersphinx
 Intra
 Ioffe
+Jinja
 JSON
 Jupyter
 KFD
@@ -221,6 +224,7 @@ Megatron
 Mellanox
 Mellanox's
 Meta's
+Miniconda
 MirroredStrategy
 Multicore
 Multithreaded
@@ -620,6 +624,7 @@ performant
 perl
 pragma
 pre
+prebuild
 prebuilt
 precompiled
 preconditioner
@@ -711,8 +716,10 @@ subexpression
 subfolder
 subfolders
 submodule
+submodules
 supercomputing
 symlink
+symlinks
 td
 tensorfloat
 th
diff --git a/docs/how-to/llm-fine-tuning-optimization/model-acceleration-libraries.rst b/docs/how-to/llm-fine-tuning-optimization/model-acceleration-libraries.rst
index bd8a6e865..38e2f8f5d 100644
--- a/docs/how-to/llm-fine-tuning-optimization/model-acceleration-libraries.rst
+++ b/docs/how-to/llm-fine-tuning-optimization/model-acceleration-libraries.rst
@@ -251,3 +251,287 @@ page describes the options.
 
 Learn more about optimizing kernels with TunableOp in
 :ref:`Optimizing Triton kernels <mi300x-tunableop>`.
+
+
+FBGEMM and FBGEMM_GPU
+=====================
+
+FBGEMM (Facebook General Matrix Multiplication) is a low-precision, high-performance CPU kernel library
+for matrix-matrix multiplications and convolutions. It is used for server-side inference
+and as a back end for PyTorch quantized operators. FBGEMM offers optimized on-CPU performance for reduced precision calculations,
+strong performance on native tensor formats, and the ability to generate
+high-performance shape- and size-specific kernels at runtime.
+
+FBGEMM_GPU collects several high-performance PyTorch GPU operator libraries  
+for use in training and inference. It provides efficient table-batched embedding functionality,
+data layout transformation, and quantization support.
+
+For more information about FBGEMM and FBGEMM_GPU, see the `PyTorch FBGEMM GitHub <https://github.com/pytorch/FBGEMM>`_
+and the `PyTorch FBGEMM documentation <https://pytorch.org/FBGEMM/>`_.
+The `Meta blog post about FBGEMM <https://engineering.fb.com/2018/11/07/ml-applications/fbgemm/>`_ provides
+additional background about the library.
+
+Installing FBGEMM_GPU
+----------------------
+
+Installing FBGEMM_GPU consists of the following steps:
+
+*  Set up an isolated Miniconda environment
+*  Install ROCm using Docker or the :doc:`package manager <rocm-install-on-linux:install/native-install/index>`
+*  Install the nightly `PyTorch <https://pytorch.org/>`_ build
+*  Complete the pre-build and build tasks
+  
+.. note::
+
+   FBGEMM_GPU doesn't require the installation of FBGEMM. To optionally install
+   FBGEMM, see the `FBGEMM install instructions <https://pytorch.org/FBGEMM/fbgemm-development/BuildInstructions.html>`_.
+
+Set up the Miniconda environment
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+To install Miniconda, use the following commands.
+
+#. Install a `Miniconda environment <https://docs.anaconda.com/miniconda/>`_ for reproducible builds.
+   All subsequent commands run inside this environment.
+
+   .. code-block:: shell
+
+      export PLATFORM_NAME="$(uname -s)-$(uname -m)"
+
+      # Set the Miniconda prefix directory
+      miniconda_prefix=$HOME/miniconda
+
+      # Download the Miniconda installer
+      wget -q "https://repo.anaconda.com/miniconda/Miniconda3-latest-${PLATFORM_NAME}.sh" -O miniconda.sh
+
+      # Run the installer
+      bash miniconda.sh -b -p "$miniconda_prefix" -u
+
+      # Load the shortcuts
+      . ~/.bashrc
+
+      # Run updates
+      conda update -n base -c defaults -y conda
+
+#. Create a Miniconda environment with Python 3.12:
+
+   .. code-block:: shell
+
+      env_name=<ENV NAME>
+      python_version=3.12
+
+      # Create the environment
+      conda create -y --name ${env_name} python="${python_version}"
+
+      # Upgrade PIP and pyOpenSSL package
+      conda run -n ${env_name} pip install --upgrade pip
+      conda run -n ${env_name} python -m pip install pyOpenSSL>22.1.0
+
+#. Install additional build tools:
+
+   .. code-block:: shell
+
+      conda install -n ${env_name} -y \
+         click \
+         cmake \
+         hypothesis \
+         jinja2 \
+         make \
+         ncurses \
+         ninja \
+         numpy \
+         scikit-build \
+         wheel
+
+Install the ROCm components
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+FBGEMM_GPU can run in a ROCm Docker container or in conjunction with the full ROCm installation.
+The Docker method is recommended because it requires fewer steps and provides a stable environment.
+
+To run FBGEMM_GPU in the Docker container, pull the `Minimal Docker image for ROCm <https://hub.docker.com/r/rocm/rocm-terminal>`_.
+This image includes all preinstalled ROCm packages required to integrate FBGEMM. To pull
+and run the ROCm Docker image, use this command:
+
+.. code-block:: shell
+
+   # Run for ROCm 6.2.0
+   docker run -it --network=host --shm-size 16G --device=/dev/kfd --device=/dev/dri --group-add video \
+   --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --ipc=host rocm/rocm-terminal:6.2 /bin/bash
+
+.. note::
+
+   The `Full Docker image for ROCm <https://hub.docker.com/r/rocm/dev-ubuntu-20.04>`_, which includes all
+   ROCm packages, can also be used. However, it results in a very large container, so the minimal
+   Docker image is recommended.
+
+You can also install ROCm using the package manager. FBGEMM_GPU requires the installation of the full ROCm package.
+For more information, see :doc:`the ROCm installation guide <rocm-install-on-linux:install/detailed-install>`.
+The ROCm package also requires the :doc:`MIOpen <miopen:index>` component as a dependency. 
+To install MIOpen, use the ``apt install`` command.
+
+.. code-block:: shell
+
+   apt install hipify-clang miopen-hip miopen-hip-dev
+
+Install PyTorch
+^^^^^^^^^^^^^^^^^^^^^^^
+
+Install `PyTorch <https://pytorch.org/>`_ using ``pip`` for the most reliable and consistent results.
+
+#. Install the nightly PyTorch build using ``pip``.
+
+   .. code-block:: shell
+
+      # Install the latest nightly, ROCm variant
+      conda run -n ${env_name} pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/rocm6.2/
+
+#. Ensure PyTorch loads correctly. Verify the version and variant of the installation using an ``import`` test.
+
+   .. code-block:: shell
+
+      # Ensure that the package loads properly
+      conda run -n ${env_name} python -c "import torch.distributed"
+
+      # Verify the version and variant of the installation
+      conda run -n ${env_name} python -c "import torch; print(torch.__version__)"
+
+Perform the prebuild and build
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+#. Clone the FBGEMM repository and the relevant submodules. Use ``pip`` to install the 
+   components in ``requirements.txt``. Run the following commands inside the Miniconda environment.
+
+   .. code-block:: shell
+
+      # Select a version tag
+      FBGEMM_VERSION=v0.8.0
+
+      # Clone the repo along with its submodules
+      git clone https://github.com/pytorch/FBGEMM.git --branch=v0.8.0 --recursive fbgemm_${FBGEMM_VERSION}
+
+      # Install additional required packages for building and testing
+      cd fbgemm_${FBGEMM_VERSION}/fbgemm_gpu
+      pip install requirements.txt
+
+#. Clear the build cache to remove stale build information.
+
+   .. code-block:: shell
+
+      # !! Run in fbgemm_gpu/ directory inside the Conda environment !!
+
+      python setup.py clean
+
+#. Set the wheel build variables, including the package name, Python version tag, and Python platform name.
+
+   .. code-block:: shell
+
+      # Set the package name depending on the build variant
+      export package_name=fbgemm_gpu_rocm
+
+      # Set the Python version tag.  It should follow the convention `py<major><minor>`,
+      # for example, Python 3.12 --> py312
+      export python_tag=py312
+
+      # Determine the processor architecture
+      export ARCH=$(uname -m)
+
+      # Set the Python platform name for the Linux case
+      export python_plat_name="manylinux2014_${ARCH}"
+
+#. Build FBGEMM_GPU for the ROCm platform. Set ``ROCM_PATH`` to the path to your ROCm installation. 
+   Run these commands from the ``fbgemm_gpu/`` directory inside the Miniconda environment.
+
+   .. code-block:: shell
+
+      # !! Run in the fbgemm_gpu/ directory inside the Conda environment !!
+
+      export ROCM_PATH=</path/to/rocm>
+
+      # Build for the target architecture of the ROCm device installed on the machine (for example, 'gfx942;gfx90a')
+      # See :doc:`The Linux system requirements <../../reference/system-requirements>` for a list of supported GPUs.
+      export PYTORCH_ROCM_ARCH=$(${ROCM_PATH}/bin/rocminfo | grep -o -m 1 'gfx.*')
+
+      # Build the wheel artifact only
+      python setup.py bdist_wheel \
+         --package_variant=rocm \
+         --python-tag="${python_tag}" \
+         --plat-name="${python_plat_name}" \
+         -DHIP_ROOT_DIR="${ROCM_PATH}" \
+         -DCMAKE_C_FLAGS="-DTORCH_USE_HIP_DSA" \
+         -DCMAKE_CXX_FLAGS="-DTORCH_USE_HIP_DSA"
+
+      # Build and install the library into the Conda environment
+      python setup.py install \
+         --package_variant=rocm \
+         -DHIP_ROOT_DIR="${ROCM_PATH}" \
+         -DCMAKE_C_FLAGS="-DTORCH_USE_HIP_DSA" \
+         -DCMAKE_CXX_FLAGS="-DTORCH_USE_HIP_DSA"  
+
+Post-build validation
+----------------------
+
+After building FBGEMM_GPU, run some verification checks to ensure the build is correct. Continue
+to run all commands inside the ``fbgemm_gpu/`` directory inside the Miniconda environment.
+
+#. The build process generates many build artifacts and C++ templates, so
+   it is important to confirm no undefined symbols remain.
+
+   .. code-block:: shell
+
+      # !! Run in fbgemm_gpu/ directory inside the Conda environment !!
+
+      # Locate the built .SO file
+      fbgemm_gpu_lib_path=$(find . -name fbgemm_gpu_py.so)
+
+      # Check that the undefined symbols don't include fbgemm_gpu-defined functions
+      nm -gDCu "${fbgemm_gpu_lib_path}" | sort
+
+#. Verify the referenced version number of ``GLIBCXX`` and the presence of certain function symbols:
+
+   .. code-block:: shell
+
+      # !! Run in fbgemm_gpu/ directory inside the Conda environment !!
+
+      # Locate the built .SO file
+      fbgemm_gpu_lib_path=$(find . -name fbgemm_gpu_py.so)
+
+      # Note the versions of GLIBCXX referenced by the .SO
+      # The libstdc++.so.6 available on the install target must support these versions
+      objdump -TC "${fbgemm_gpu_lib_path}" | grep GLIBCXX | sed 's/.*GLIBCXX_\([.0-9]*\).*/GLIBCXX_\1/g' | sort -Vu | cat
+
+      # Test for the existence of a given function symbol in the .SO
+      nm -gDC "${fbgemm_gpu_lib_path}" | grep " fbgemm_gpu::merge_pooled_embeddings("
+      nm -gDC "${fbgemm_gpu_lib_path}" | grep " fbgemm_gpu::jagged_2d_to_dense("
+
+Testing FBGEMM
+----------------------
+
+FBGEMM includes tests and benchmarks to validate performance. To run these tests,
+you must use ROCm 5.7 or a more recent version on the host and container. To run FBGEMM tests,
+follow these instructions:
+
+.. code-block:: shell
+
+   # !! Run inside the Conda environment !!
+
+   # From the /fbgemm_gpu/ directory
+   cd test
+
+   export FBGEMM_TEST_WITH_ROCM=1
+   # Enable for debugging failed kernel executions
+   export HIP_LAUNCH_BLOCKING=1
+
+   # Run the test
+   python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning split_table_batched_embeddings_test.py
+
+To run the FBGEMM_GPU ``uvm`` test, use these commands. These tests only support the AMD MI210 and 
+more recent accelerators. 
+
+.. code-block:: shell
+
+   # Run this inside the Conda environment from the /fbgemm_gpu/ directory
+   export HSA_XNACK=1
+   cd test
+
+   python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning ./uvm/uvm_test.py