@@ -137,7 +137,7 @@ ROCm 6.2.0 introduces the following bitsandbytes changes:
- `Int8` matrix multiplication is enabled, and it includes the following functions:
- `extract-outliers` – extracts rows and columns that have outliers in the inputs. They’re later used for matrix multiplication without quantization.
- `transform` – row-to-column and column-to-row transformations are enabled, along with transpose operations. These are used before and after matmul computation.
- `transform` – row-to-column and column-to-row transformations are enabled, along with transpose operations. These are used before and after `matmul` computation.
- `igemmlt` – new function for GEMM computation A*B^T. It uses
[hipblasLtMatMul](https://rocm.docs.amd.com/projects/hipBLASLt/en/docs-6.2.0/api-reference.html#hipblasltmatmul) and performs 8-bit GEMM operations.
- `dequant_mm` – dequantizes output matrix to original data type using scaling factors from vector-wise quantization.
@@ -192,7 +192,7 @@ all the accessible features, and the `vllm/Dockerfile.rocm` file can be used.
### Enhanced performance tuning on AMD Instinct accelerators
ROCm is pretuned for high-performance computing workloads including large language models, generative AI, and scientific computing.
ROCm is pre-tuned for high-performance computing workloads including large language models, generative AI, and scientific computing.
The ROCm documentation provides comprehensive guidance on configuring your system for AMD Instinct accelerators. It includes
detailed instructions on system settings and application tuning suggestions to help you fully leverage the capabilities of these
accelerators for optimal performance. For more information, see
#### Memory savings for bitsandbytes model quantization
@@ -125,9 +99,9 @@ ROCm 6.2.0 introduces the following bitsandbytes changes:
- `Int8` matrix multiplication is enabled, and it includes the following functions:
- `extract-outliers` – extracts rows and columns that have outliers in the inputs. They’re later used for matrix multiplication without quantization.
- `transform` – row-to-column and column-to-row transformations are enabled, along with transpose operations. These are used before and after matmul computation.
- `transform` – row-to-column and column-to-row transformations are enabled, along with transpose operations. These are used before and after `matmul` computation.
- `igemmlt` – new function for GEMM computation A*B^T. It uses
[hipblasLtMatMul](https://rocm.docs.amd.com/projects/hipBLASLt/en/latest/api-reference.html#hipblasltmatmul) and performs 8-bit GEMM operations.
[hipblasLtMatMul](https://rocm.docs.amd.com/projects/hipBLASLt/en/docs-6.2.0/api-reference.html#hipblasltmatmul) and performs 8-bit GEMM operations.
- `dequant_mm` – dequantizes output matrix to original data type using scaling factors from vector-wise quantization.
- Blockwise quantization – input tensors are quantized for a fixed block size.
- 4-bit quantization and dequantization functions – normalized `Float4` quantization, quantile estimation, and quantile quantization functions are enabled.
@@ -138,7 +112,7 @@ These functions are included in bitsandbytes. They are not part of ROCm. However
features to run them.
```
For more information, see [Model quantization techniques](https://rocm.docs.amd.com/en/latest/how-to/llm-fine-tuning-optimization/model-quantization.html).
For more information, see [Model quantization techniques](https://rocm.docs.amd.com/en/docs-6.2.0/how-to/llm-fine-tuning-optimization/model-quantization.html).
#### Improved vLLM support
@@ -146,14 +120,10 @@ ROCm 6.2.0 enhances vLLM support for inference on AMD Instinct accelerators, add
capabilities for `FP16`/`BF16` precision for LLMs, and `FP8` support for Llama.
ROCm 6.2.0 adds support for the following vLLM features:
- MP:
Multi-GPU execution. Choose between MP and Ray using a flag. To set it to MP,
- MP: Multi-GPU execution. Choose between MP and Ray using a flag. To set it to MP,
use `--distributed-executor-backed=mp`. The default depends on the commit in flux.
- FP8 KV cache:
Enhances computational efficiency and performance by significantly reducing memory usage and bandwidth requirements.
- FP8 KV cache: Enhances computational efficiency and performance by significantly reducing memory usage and bandwidth requirements.
The QUARK quantizer currently only supports Llama.
- Triton Flash Attention:
@@ -166,7 +136,7 @@ ROCm 6.2.0 adds support for the following vLLM features:
Improved optimization and tuning of GEMMs. It requires Docker with PyTorch 2.3 or later.
For more information about enabling these features, see
ROCm known issues are noted on {fab}`github` [GitHub](https://github.com/ROCm/ROCm/labels/Verified%20Issue). For known
issues related to individual components, review the [Detailed component changes](detailed-component-changes).
### Default processor affinity behavior for helper threads
Processor affinity is a critical setting to ensure that ROCm helper threads run on the correct cores. By default, ROCm
@@ -98,4 +92,3 @@ functionality provided by the closed-source compiler should transition to the op
Once the `rocm-llvm-alt` package is removed, any compilation requesting functionality provided by
the closed-source compiler will result in a Clang warning: "*[AMD] proprietary optimization compiler
has been removed*".
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.