feat(gpu): update GPU documentation

2026-01-06 21:34:05 -05:00 · 2025-06-16 18:17:34 +02:00
parent b6c21ef1fe
commit 97ce0f6ecf
12 changed files with 291 additions and 182 deletions
--- a/tfhe/docs/SUMMARY.md
+++ b/tfhe/docs/SUMMARY.md
@@ -58,17 +58,20 @@
  * [Generic trait bounds](fhe-computation/tooling/trait_bounds.md)
  * [Debugging](fhe-computation/tooling/debug.md)

-## Configuration
-
-* [Advanced Rust setup](configuration/rust_configuration.md)
+## Hardware acceleration
 * [GPU acceleration](configuration/gpu_acceleration/run_on_gpu.md)
+  * [A simple example](configuration/gpu_acceleration/simple_example.md) 
  * [Operations](configuration/gpu_acceleration/gpu_operations.md)
-  * [Benchmark](configuration/gpu_acceleration/benchmark.md)
  * [Compressing ciphertexts](configuration/gpu_acceleration/compressing_ciphertexts.md)
  * [Array types](configuration/gpu_acceleration/array_type.md)
+  * [ZK-POKs](configuration/gpu_acceleration/zk-pok.md)
  * [Multi-GPU support](configuration/gpu_acceleration/multi_gpu.md)
 * [HPU acceleration](configuration/hpu_acceleration/run_on_hpu.md)
  * [Benchmark](configuration/hpu_acceleration/benchmark.md)
+
+## Configuration
+
+* [Advanced Rust setup](configuration/rust_configuration.md)
 * [Parallelized PBS](configuration/parallelized_pbs.md)

 ## Integration
--- a/tfhe/docs/configuration/gpu_acceleration/array_type.md
+++ b/tfhe/docs/configuration/gpu_acceleration/array_type.md
@@ -1,7 +1,13 @@
 # Array types
 This document explains how to use array types on GPU, just as [on CPU](../../fhe-computation/types/array.md).

-Here is an example:
+Array types perform array and tensor operations on encrypted data, encapsulating the logic for iteration over array elements and array shape logic.
+
+## API elements discussed in this document
+
+- [`GpuFheUint32Array`](https://docs.rs/tfhe/latest/tfhe/array/type.GpuFheUint32Array.html): an n-d array of Uint32 encrypted values. Variants are available for all supported integer types and booleans.
+
+## Array types example

 ```rust
 use tfhe::{ConfigBuilder, set_server_key, ClearArray, ClientKey, CompressedServerKey};
--- a/tfhe/docs/configuration/gpu_acceleration/benchmark.md
+++ b/tfhe/docs/configuration/gpu_acceleration/benchmark.md
@@ -1,7 +0,0 @@
-# Benchmarks
-
-Please refer to the [GPU benchmarks](../../getting_started/benchmarks/gpu/README.md) for detailed performance benchmark results.
-
-{% hint style="warning" %}
-When measuring GPU times on your own on Linux, set the environment variable `CUDA_MODULE_LOADING=EAGER` to avoid CUDA API overheads during the first kernel execution.
-{% endhint %}
--- a/tfhe/docs/configuration/gpu_acceleration/compressing_ciphertexts.md
+++ b/tfhe/docs/configuration/gpu_acceleration/compressing_ciphertexts.md
@@ -1,13 +1,29 @@
-# Compressing ciphertexts
+# Compressing ciphertexts on the GPU

-This document explains how to compress ciphertexts using the GPU - even after homomorphic computations - just like on the [CPU](../../fhe-computation/data-handling/compress.md#compression-ciphertexts-after-some-homomorphic-computation).
+This document explains how to compress ciphertexts using the GPU. Compression can be applied on freshly encrypted ciphertexts or on ciphertexts that are the result of FHE integer operations. The syntax for ciphertext compression is identical to the one for the [CPU backend](../../fhe-computation/data-handling/compress.md#compression-ciphertexts-after-some-homomorphic-computation), but cryptographic parameters specific to the GPU must be configured for compression.

-Compressing ciphertexts after computation using GPU is very similar to how it's done on the CPU. The following example shows how to compress and decompress a list containing 4 messages:
+## API elements discussed in this document

-* One 32-bits integer
-* One 64-bit integer
-* One Boolean
-* One 2-bit integer
+- [tfhe::shortint::parameters](https://docs.rs/tfhe/latest/tfhe/shortint/parameters/index.html): this module provides the structure containing the cryptographic parameters required for the homomorphic evaluation of integer circuits as well as a list of secure cryptographic parameter sets.
+- [tfhe::ConfigBuilder::with_custom_parameters](https://docs.rs/tfhe/latest/tfhe/struct.ConfigBuilder.html#method.with_custom_parameters): initializes a configuration builder with a user-specified parameter set
+- [tfhe::ConfigBuilder::enable_compression](https://docs.rs/tfhe/latest/tfhe/struct.ConfigBuilder.html#method.enable_compression): enables the compression feature in the configuration builder 
+
+## Cryptographic parameter setting
+
+When using compression, the [`ConfigBuilder`](https://docs.rs/tfhe/latest/tfhe/struct.ConfigBuilder.html) class must be initialized with the `enable_compression` calls. This requires that the caller sets both the cryptographic PBS parameters and the compression cryptographic parameters.
+
+The [`PARAM_GPU_MULTI_BIT_GROUP_4_MESSAGE_2_CARRY_2_KS_PBS`](https://docs.rs/tfhe/latest/tfhe/shortint/parameters/aliases/constant.PARAM_GPU_MULTI_BIT_GROUP_4_MESSAGE_2_CARRY_2_KS_PBS.html) parameter set corresponds to the default PBS parameter set instantiated by `ConfigBuilder::default()` when the `"gpu"` feature enabled. The [`COMP_PARAM_GPU_MULTI_BIT_GROUP_4_MESSAGE_2_CARRY_2_KS_PBS`](https://docs.rs/tfhe/latest/tfhe/shortint/parameters/aliases/constant.COMP_PARAM_GPU_MULTI_BIT_GROUP_4_MESSAGE_2_CARRY_2_KS_PBS.html) parameters are the corresponding compression cryptographic parameters.
+
+```Rust
+    let config =
+        tfhe::ConfigBuilder::with_custom_parameters(PARAM_GPU_MULTI_BIT_GROUP_4_MESSAGE_2_CARRY_2_KS_PBS)
+            .enable_compression(COMP_PARAM_GPU_MULTI_BIT_GROUP_4_MESSAGE_2_CARRY_2_KS_PBS)
+            .build();
+```
+
+## GPU compression Example
+
+The following example shows how to compress and decompress a list containing 4 messages: one 32-bit integer, one 64-bit integer, one Boolean, and one 2-bit integer.

 ```rust
 use tfhe::prelude::*;
--- a/tfhe/docs/configuration/gpu_acceleration/gpu_operations.md
+++ b/tfhe/docs/configuration/gpu_acceleration/gpu_operations.md
@@ -3,36 +3,37 @@ This document outlines the GPU operations supported in TFHE-rs.

 The GPU backend includes the following operations for both signed and unsigned encrypted integers:

-| name                               | symbol                | `Enc`/`Enc`          | `Enc`/ `Int`               |
-|------------------------------------|-----------------------|----------------------|----------------------------|
-| Neg                                | `-`                   | :heavy\_check\_mark: | N/A                        |
-| Add                                | `+`                   | :heavy\_check\_mark: | :heavy\_check\_mark:       |
-| Sub                                | `-`                   | :heavy\_check\_mark: | :heavy\_check\_mark:       |
-| Mul                                | `*`                   | :heavy\_check\_mark: | :heavy\_check\_mark:       |
-| Div                                | `/`                   | :heavy\_check\_mark: | :heavy\_check\_mark:       |
-| Rem                                | `%`                   | :heavy\_check\_mark: | :heavy\_check\_mark:       |
-| Not                                | `!`                   | :heavy\_check\_mark: | N/A                        |
-| BitAnd                             | `&`                   | :heavy\_check\_mark: | :heavy\_check\_mark:       |
-| BitOr                              | `\|`                  | :heavy\_check\_mark: | :heavy\_check\_mark:       |
-| BitXor                             | `^`                   | :heavy\_check\_mark: | :heavy\_check\_mark:       |
-| Shr                                | `>>`                  | :heavy\_check\_mark: | :heavy\_check\_mark:       |
-| Shl                                | `<<`                  | :heavy\_check\_mark: | :heavy\_check\_mark:       |
-| Rotate right                       | `rotate_right`        | :heavy\_check\_mark: | :heavy\_check\_mark:       |
-| Rotate left                        | `rotate_left`         | :heavy\_check\_mark: | :heavy\_check\_mark:       |
-| Min                                | `min`                 | :heavy\_check\_mark: | :heavy\_check\_mark:       |
-| Max                                | `max`                 | :heavy\_check\_mark: | :heavy\_check\_mark:       |
-| Greater than                       | `gt`                  | :heavy\_check\_mark: | :heavy\_check\_mark:       |
-| Greater or equal than              | `ge`                  | :heavy\_check\_mark: | :heavy\_check\_mark:       |
-| Lower than                         | `lt`                  | :heavy\_check\_mark: | :heavy\_check\_mark:       |
-| Lower or equal than                | `le`                  | :heavy\_check\_mark: | :heavy\_check\_mark:       |
-| Equal                              | `eq`                  | :heavy\_check\_mark: | :heavy\_check\_mark:       |
-| Not Equal                          | `ne`                  | :heavy\_check\_mark: | :heavy\_check\_mark:       |
-| Cast (into dest type)              | `cast_into`           | :heavy\_check\_mark: | N/A                        |
-| Cast (from src type)               | `cast_from`           | :heavy\_check\_mark: | N/A                        |
-| Ternary operator                   | `select`              | :heavy\_check\_mark: | :heavy\_multiplication\_x: |
-| Integer logarithm                  | `ilog2`               | :heavy\_check\_mark: | N/A                        |
-| Count trailing/leading zeros/ones  | `count_leading_zeros` | :heavy\_check\_mark: | N/A                        |
-| Oblivious Pseudo Random Generation | `oprf`                | :heavy\_check\_mark: | N/A                        |
+| name                                                                                                                              | symbol          | `Enc`/`Enc`          | `Enc`/ `Int`               |
+|-----------------------------------------------------------------------------------------------------------------------------------|-----------------|----------------------|----------------------------|
+| [Neg](https://docs.rs/tfhe/latest/tfhe/struct.FheInt.html#method.neg-1)                                                           | `-`             | :heavy\_check\_mark: | N/A                        |
+| [Add](https://docs.rs/tfhe/latest/tfhe/struct.FheInt.html#method.add-1)                                                           | `+`             | :heavy\_check\_mark: | :heavy\_check\_mark:       |
+| [Sub](https://docs.rs/tfhe/latest/tfhe/struct.FheInt.html#method.sub-1)                                                           | `-`             | :heavy\_check\_mark: | :heavy\_check\_mark:       |
+| [Mul](https://docs.rs/tfhe/latest/tfhe/struct.FheInt.html#method.mul-1)                                                           | `*`             | :heavy\_check\_mark: | :heavy\_check\_mark:       |
+| [Div](https://docs.rs/tfhe/latest/tfhe/struct.FheInt.html#method.div-1)                                                           | `/`             | :heavy\_check\_mark: | :heavy\_check\_mark:       |
+| [Rem](https://docs.rs/tfhe/latest/tfhe/struct.FheInt.html#method.rem-1)                                                           | `%`             | :heavy\_check\_mark: | :heavy\_check\_mark:       |
+| [Not](https://docs.rs/tfhe/latest/tfhe/struct.FheInt.html#method.not-1)                                                           | `!`             | :heavy\_check\_mark: | N/A                        |
+| [BitAnd](https://docs.rs/tfhe/latest/tfhe/struct.FheInt.html#method.bitand-1)                                                     | `&`             | :heavy\_check\_mark: | :heavy\_check\_mark:       |
+| [BitOr](https://docs.rs/tfhe/latest/tfhe/struct.FheInt.html#method.bitor-1)                                                       | `\|`            | :heavy\_check\_mark: | :heavy\_check\_mark:       |
+| [BitXor](https://docs.rs/tfhe/latest/tfhe/struct.FheInt.html#method.bitxor-1)                                                     | `^`             | :heavy\_check\_mark: | :heavy\_check\_mark:       |
+| [Shr](https://docs.rs/tfhe/latest/tfhe/struct.FheInt.html#method.shr-1)                                                           | `>>`            | :heavy\_check\_mark: | :heavy\_check\_mark:       |
+| [Shl](https://docs.rs/tfhe/latest/tfhe/struct.FheInt.html#method.shl-1)                                                           | `<<`            | :heavy\_check\_mark: | :heavy\_check\_mark:       |
+| [Rotate right](https://docs.rs/tfhe/latest/tfhe/struct.FheInt.html#method.rotate_right-3)                                         | `rotate_right`  | :heavy\_check\_mark: | :heavy\_check\_mark:       |
+| [Rotate left](https://docs.rs/tfhe/latest/tfhe/struct.FheInt.html#method.rotate_left-3)                                           | `rotate_left`   | :heavy\_check\_mark: | :heavy\_check\_mark:       |
+| [Min](https://docs.rs/tfhe/latest/tfhe/struct.FheInt.html#method.min-1)                                                           | `min`           | :heavy\_check\_mark: | :heavy\_check\_mark:       |
+| [Max](https://docs.rs/tfhe/latest/tfhe/struct.FheInt.html#method.max-1)                                                           | `max`           | :heavy\_check\_mark: | :heavy\_check\_mark:       |
+| [Greater than](https://docs.rs/tfhe/latest/tfhe/struct.FheInt.html#method.gt-2)                                                   | `gt`            | :heavy\_check\_mark: | :heavy\_check\_mark:       |
+| [Greater or equal than](https://docs.rs/tfhe/latest/tfhe/struct.FheInt.html#method.ge-2)                                          | `ge`            | :heavy\_check\_mark: | :heavy\_check\_mark:       |
+| [Lower than](https://docs.rs/tfhe/latest/tfhe/struct.FheInt.html#method.lt-2)                                                     | `lt`            | :heavy\_check\_mark: | :heavy\_check\_mark:       |
+| [Lower or equal than](https://docs.rs/tfhe/latest/tfhe/struct.FheInt.html#method.le-2)                                            | `le`            | :heavy\_check\_mark: | :heavy\_check\_mark:       |
+| [Equal](https://docs.rs/tfhe/latest/tfhe/struct.FheInt.html#method.eq-2)                                                          | `eq`            | :heavy\_check\_mark: | :heavy\_check\_mark:       |
+| [Not Equal](https://docs.rs/tfhe/latest/tfhe/struct.FheInt.html#method.ne-2)                                                      | `ne`            | :heavy\_check\_mark: | :heavy\_check\_mark:       |
+| [Cast (into dest type)](https://docs.rs/tfhe/latest/tfhe/struct.FheInt.html#method.cast_into)                                     | `cast_into`     | :heavy\_check\_mark: | N/A                        |
+| [Cast (from src type)](https://docs.rs/tfhe/latest/tfhe/struct.FheInt.html#method.cast_from-2)                                    | `cast_from`     | :heavy\_check\_mark: | N/A                        |
+| [Ternary operator](https://docs.rs/tfhe/latest/tfhe/struct.FheInt.html#method.select)                                             | `select`        | :heavy\_check\_mark: | :heavy\_multiplication\_x: |
+| [Integer logarithm](https://docs.rs/tfhe/latest/tfhe/struct.FheInt.html#method.ilog2)                                             | `ilog2`         | :heavy\_check\_mark: | N/A                        |
+| [Count trailing/leading ones](https://docs.rs/tfhe/latest/tfhe/struct.FheInt.html#method.leading_ones)                            | `leading_zeros` | :heavy\_check\_mark: | N/A                        |
+| [Count trailing/leading zeros](https://docs.rs/tfhe/latest/tfhe/struct.FheInt.html#method.leading_zeros)                          | `leading_ones`  | :heavy\_check\_mark: | N/A                        |
+| [Oblivious Pseudo Random Generation](https://docs.rs/tfhe/latest/tfhe/struct.FheInt.html#method.generate_oblivious_pseudo_random) | `oprf`          | :heavy\_check\_mark: | N/A                        |

 {% hint style="info" %}
 All operations follow the same syntax as the one described in [here](../../fhe-computation/operations/README.md).
--- a/tfhe/docs/configuration/gpu_acceleration/multi_gpu.md
+++ b/tfhe/docs/configuration/gpu_acceleration/multi_gpu.md
@@ -1,31 +1,34 @@
 # Multi-GPU support
-This guide explains the multi GPU support of TFHE-rs, and walks through a practical example of performing a large batch of encrypted 64-bit additions using manual GPU 
+This guide explains the multi-GPU support of TFHE-rs, and walks through a practical example of performing a large batch of encrypted 64-bit additions using manual GPU 
 dispatching to improve the performance.

-## Multi-GPU support overview
+## Multi-GPU programming model

-TFHE-rs supports platforms with multiple GPUs. There is **nothing to change in the code to execute on such platforms**. To keep the API as user-friendly as possible, the configuration is automatically set, i.e., the user has no fine-grained control over the number of GPUs to be used.
-However, you can decide to have operations be executed on a single GPU of your choice.
-In many cases this provides better throughput than using all the available GPUs to perform the operation.
-Indeed, except for integer precisions above 64 bits and for the multiplication, which involves many bootstrap computations in parallel, most operations on up to 64 bits do not necessitate the full power of a GPU.
-You will then be able to maximize throughput on multiple GPUs with TFHE-rs.
+TFHE-rs supports platforms with multiple GPUs. By default, when decompressing a server key with the [`decompress_to_gpu`](https://docs.rs/tfhe/latest/tfhe/struct.CompressedServerKey.html#method.decompress_to_gpu) function, TFHE-rs will assign all available GPUs to the server key. TFHE-rs uses all GPUs assigned to the current server key when executing operations. Depending on the type and number of available GPUs, this automatic mechanism may not achieve optimal throughput.  

-## Improving throughput on multiple-GPUs
+Most integer operations have low GPU-intensity: they use few GPU cores and may not fully use the resources of a single GPU. Manual scheduling of operations on a single or on several GPUs, so that several such operations can be processed in parallel, is helpful for these types of low-GPU intensity operations. 

-By default, when multiple GPUs are available on the machine, TFHE-rs automatically uses them all
-to perform encrypted operations. Under the hood, it includes a hard-coded logic to dispatch work across all the GPUs and to copy essential data—like the server key—to each GPU.
-This approach is efficient for operations that load the GPU extensively (e.g. the 64-bit multiplication),
-but not so much for smaller operations like the encrypted addition or comparison on 64-bits.
-To address this, TFHE-rs also provides a mechanism to manually select which GPU to operate on.
+Other types of operations run optimally over several GPUs without manual scheduling but may benefit from manual scheduling on different GPUs when more than 4 GPUs are available:
+- operations on operands of 64-bits or more
+- multiplication of operands of 8-bits or more

-### Dispatch operations on the GPUs of your choice
+To improve throughput by increasing GPU core utilization on all available GPUs, you can:
+- optimize the number of GPUs assigned to a decompressed server key using the [`decompress_to_specific_gpu`](https://docs.rs/tfhe/latest/tfhe/struct.CompressedServerKey.html#method.decompress_to_specific_gpu) function.
+- execute several operations in parallel on the same GPU
+
+## API elements discussed in this document
+
+- [`tfhe::ServerKey::decompress_to_specific_gpu`](https://docs.rs/tfhe/latest/tfhe/struct.CompressedServerKey.html#method.decompress_to_specific_gpu): decompresses a server key to one or multiple GPUs
+- [`tfhe::set_server_key`](https://docs.rs/tfhe/latest/tfhe/fn.set_server_key.html): sets the current server key. When this is a GPU key, this function activates execution of integer operations on all GPUs assigned to this key. Moreover, this function will create anew CUDA stream on the current CPU thread.
+
+## Multi-GPU operation scheduling example

 When selecting a specific GPU to execute on, there are two essential requirements that are different from a default GPU execution:
- You must create a GPU server key on each GPU individually.
+- You must create a GPU server key on each GPU, or subset of GPUs, individually.
 - The batch of operations must be distributed on all the GPUs manually.

 #### Step 1: Decompress the server key to each GPU
-Instead of a single server key being used across all GPUs automatically, you’ll need specifically decompress the server key to each GPU, so that the key is available in memory.
+Instead of a single server key being used across all GPUs automatically, you’ll need decompress the server key to each GPU, so that the key is available in memory.
 For example, by default, the GPU server key is decompressed and loaded onto all available GPUs automatically as follows:
 ```rust
 use tfhe::{ConfigBuilder, set_server_key, ClientKey, CompressedServerKey};
--- a/tfhe/docs/configuration/gpu_acceleration/run_on_gpu.md
+++ b/tfhe/docs/configuration/gpu_acceleration/run_on_gpu.md
@@ -1,19 +1,75 @@
 # GPU acceleration

-This guide explains how to update your existing program to leverage GPU acceleration, or to start a new program using GPU.
+**TFHE-rs** has a CUDA GPU backend  that enables faster integer arithmetic operations on encrypted data, when compared to the default CPU backend. This guide explains how to update your existing program to leverage GPU acceleration, or to start a new program using GPU. 

-**TFHE-rs** now supports a GPU backend with CUDA implementation, enabling integer arithmetic operations on encrypted data.
+To explore a simple code example, go to:
+{% content-ref url="./simple_example.md" %} A simple TFHE-rs GPU example {% endcontent-ref %}

-## Prerequisites
+## FHE performance on GPU
+
+The GPU backend is **up to 4.2x faster** than the CPU one. For a comparison between CPU and GPU latencies, see the following page.
+{% content-ref url="../../getting_started/benchmarks/README.md" %} GPU vs CPU benchmarks {% endcontent-ref %}
+
+Different integer operations obtain different speedups. Please refer to the [detailed GPU benchmarks of FHE operations](../../getting_started/benchmarks/gpu/README.md) for detailed figures.
+
+{% hint style="warning" %}
+To reproduce TFHE-rs GPU benchmarks, see [this dedicated page](../../getting_started/benchmarks/gpu/gpu_programmable_bootstrapping.md). To obtain the best performance when running benchmarks, set the environment variable `CUDA_MODULE_LOADING=EAGER` to avoid CUDA API overheads during the first kernel execution. Bear in mind that GPU warmup is necessary before doing performance measurements.
+{% endhint %}
+
+## GPU TFHE-rs features
+
+By default, the GPU backend uses specific cryptographic parameters. When calling the [`tfhe::ConfigBuilder::default()`](https://doc.rust-lang.org/nightly/core/default/trait.Default.html#tymethod.default) function, the cryptographic for PBS will be:
+- PBS parameters: [`PARAM_GPU_MULTI_BIT_GROUP_4_MESSAGE_2_CARRY_2_KS_PBS`](https://docs.rs/tfhe/latest/tfhe/shortint/parameters/aliases/constant.PARAM_GPU_MULTI_BIT_GROUP_4_MESSAGE_2_CARRY_2_KS_PBS.html)
+
+These PBS parameters are accompanied by the following compression parameters: 
+- Compression parameters: [`COMP_PARAM_GPU_MULTI_BIT_GROUP_4_MESSAGE_2_CARRY_2_KS_PBS`](https://docs.rs/tfhe/latest/tfhe/shortint/parameters/aliases/constant.COMP_PARAM_GPU_MULTI_BIT_GROUP_4_MESSAGE_2_CARRY_2_KS_PBS.html)
+
+TFHE-rs uses dedicated parameters for the GPU in order to achieve optimal performance, and the CPU and GPU parameters cannot be mixed to perform computation and compression for security reasons.
+
+The GPU backend is designed to speed up server-side FHE operations and supports the following TFHE-rs features:
+
+- [FHE ciphertext operations](./gpu_operations.md)
+- [Ciphertext compression](./compressing_ciphertexts.md)
+- [Ciphertext arrays](array_type.md)
+- [ZK-POK proof expansion](zk-pok.md)
+- [Noise Squashing](https://docs.rs/tfhe/latest/tfhe/struct.FheInt.html#method.squash_noise)
+- [Multi-GPU for throughput optimization](./multi_gpu.md) 
+
+The following features are not supported:
+
+- Key generation
+- Encryption/decryption
+- ZK-POK proof generation and verification
+- Encrypted strings and operations on encrypted strings
+
+## GPU programming model
+
+The GPU TFHE-rs integer API is mostly identical to the CPU API: both integer datatypes and operations syntax are the same. All the while, some GPU program design principles must be considered:
+* Key generation, encryption, and decryption are performed on the CPU. When used in operations, ciphertexts are automatically copied to or from the first GPU that the user configures for TFHE-rs.
+* GPU syntax for integer FHE operations, key generation, and serialization is identical with equivalent CPU code.
+* When configured to compile for the GPU, TFHE-rs uses GPU specific cryptographic parameters that give high performance on the GPU. Ciphertexts and server-keys that are generated with CPU parameters can be processed with GPU-enabled TFHE-rs but performance is considerably degraded.
+* Each server key instance is assigned to a set of GPUs, which are automatically used in parallel. To set the active GPUs for a CPU thread, activate the server key assigned to the GPUs you want to use.
+* GPU integer operations are synchronous to the calling thread. To execute in parallel on several GPUs, use Rust parallel constructs such as `par_iter`.
+
+The key differences between the CPU API and the GPU API are:
+* The GPU backend only supports compressed server keys that must be decompressed on a GPU selected by the user.
+* For ciphertext compression the cryptographic parameters must be chosen by the user from the GPU parameter set.
+* For ciphertext arrays, GPU-specific ciphertext array types must be used instead of CPU ones.
+
+## Project configuration
+
+### 1. Prerequisites
+
+To compile and execute GPU TFHE-rs programs, make sure your system has the following software installed.

 * Cuda version >= 10
 * Compute Capability >= 3.0
 * [gcc](https://gcc.gnu.org/) >= 8.0 - check this [page](https://gist.github.com/ax3l/9489132) for more details about nvcc/gcc compatible versions
 * [cmake](https://cmake.org/) >= 3.24
 * libclang, to match Rust bingen [requirements](https://rust-lang.github.io/rust-bindgen/requirements.html) >= 9.0
-* Rust version - check this [page](../rust_configuration.md)
+* Rust version - see this [page](../rust_configuration.md)

-## Importing to your project
+### 2. Import GPU-enabled TFHE-rs

 To use the **TFHE-rs** GPU backend in your project, add the following dependency in your `Cargo.toml`.

@@ -21,104 +77,18 @@ To use the **TFHE-rs** GPU backend in your project, add the following dependency
 tfhe = { version = "~1.3.0", features = ["boolean", "shortint", "integer", "gpu"] }
 ```

+If none of the supported backends is configured in `Cargo.toml`, the CPU backend is used.
+
 {% hint style="success" %}
 For optimal performance when using **TFHE-rs**, run your code in release mode with the `--release` flag.
 {% endhint %}

-### Supported platforms
+### 3. Supported platforms

-**TFHE-rs** GPU backend is supported on Linux (x86, aarch64).
+The **TFHE-rs** GPU backend is supported on Linux (x86, aarch64). The following table lists compatibility status for other platforms.

-| OS      | x86         | aarch64       |
-| ------- | ----------- | ------------- |
-| Linux   | Supported   | Supported\*   |
-| macOS   | Unsupported | Unsupported\* |
-| Windows | Unsupported | Unsupported   |
-
-## A first example
-
-### Configuring and creating keys.
-
-Comparing to the [CPU example](../../getting_started/quick_start.md), GPU set up differs in the key creation, as detailed [here](run\_on\_gpu.md#setting-the-keys)
-
-Here is a full example (combining the client and server parts):
-
-```rust
-use tfhe::{ConfigBuilder, set_server_key, FheUint8, ClientKey, CompressedServerKey};
-use tfhe::prelude::*;
-
-fn main() {
-
-    let config = ConfigBuilder::default().build();
-
-    let client_key= ClientKey::generate(config);
-    let compressed_server_key = CompressedServerKey::new(&client_key);
-
-    let gpu_key = compressed_server_key.decompress_to_gpu();
-
-    let clear_a = 27u8;
-    let clear_b = 128u8;
-
-    let a = FheUint8::encrypt(clear_a, &client_key);
-    let b = FheUint8::encrypt(clear_b, &client_key);
-
-    //Server-side
-
-    set_server_key(gpu_key);
-    let result = a + b;
-
-    //Client-side
-    let decrypted_result: u8 = result.decrypt(&client_key);
-
-    let clear_result = clear_a + clear_b;
-
-    assert_eq!(decrypted_result, clear_result);
-}
-```
-
-Beware that when the GPU feature is activated, when calling: `let config = ConfigBuilder::default().build();`, the cryptographic parameters differ from the CPU ones, used when the GPU feature is not activated. Indeed, TFHE-rs uses dedicated parameters for the GPU in order to achieve better performance.
-
-### Setting the keys
-
-The configuration of the key is different from the CPU. More precisely, if both client and server keys are still generated by the client (which is assumed to run on a CPU), the server key has then to be decompressed by the server to be converted into the right format. To do so, the server should run this function: `decompressed_to_gpu()`.
-
-Once decompressed, the operations between CPU and GPU are identical.
-
-### Encryption
-
-On the client-side, the method to encrypt the data is exactly the same than the CPU one, as shown in the following example:
-
-```Rust
-    let clear_a = 27u8;
-    let clear_b = 128u8;
-    
-    let a = FheUint8::encrypt(clear_a, &client_key);
-    let b = FheUint8::encrypt(clear_b, &client_key);
-```
-
-### Computation
-
-The server first need to set up its keys with `set_server_key(gpu_key)`.
-
-Then, homomorphic computations are performed using the same approach as the [CPU operations](../../fhe-computation/operations/README.md).
-
-```Rust
-    //Server-side
-    set_server_key(gpu_key);
-    let result = a + b;
-
-    //Client-side
-    let decrypted_result: u8 = result.decrypt(&client_key);
-
-    let clear_result = clear_a + clear_b;
-
-    assert_eq!(decrypted_result, clear_result);
-```
-
-### Decryption
-
-Finally, the client decrypts the results using:
-
-```Rust
-    let decrypted_result: u8 = result.decrypt(&client_key);
-```
+| OS      | x86 | aarch64 |
+| ------- |-----|---------|
+| Linux   | Yes | Yes     |
+| macOS   | No  | No      |
+| Windows | No  | No      |
--- a/tfhe/docs/configuration/gpu_acceleration/simple_example.md
+++ b/tfhe/docs/configuration/gpu_acceleration/simple_example.md
@@ -0,0 +1,113 @@
+# Simple TFHE-rs program
+
+The example shown in this section computes the sum of two integers using the GPU. It  contains code that can be split into a client-side and a server-side part, but for simplicity it is shown as a single snippet. Only the server-side benefits from GPU acceleration.  
+
+This example shows how to use a single GPU to improve operation latency. It has the following structure:
+
+1. _Client-side_: Generate client keys and GPU server keys. Encrypt two numbers 
+2. _Server-side_: Move server keys to GPU and perform the addition
+3. _Client-side_: Decrypt the result 
+
+This example only performs an addition, but most FHE operations are supported on GPU. For a list see:
+
+{% content-ref url="./gpu_operations.md" %} List of FHE operations on GPU {% endcontent-ref %}
+
+## API elements discussed in this document
+
+- [`tfhe::ConfigBuilder::default()`](https://doc.rust-lang.org/nightly/core/default/trait.Default.html#tymethod.default): Instantiates the default cryptographic parameters. When the `"gpu"` feature is activated, the default parameters are GPU specific, which achieves optimal performance on GPU
+- [`tfhe::ServerKey::decompress_to_gpu`](https://docs.rs/tfhe/latest/tfhe/struct.CompressedServerKey.html#method.decompress_to_gpu):  decompresses a compressed ServerKey and copies it to all available GPUs
+- [`tfhe::set_server_key`](https://docs.rs/tfhe/latest/tfhe/fn.set_server_key.html): sets the current server key. When this is a GPU key, this function activates execution of integer operations on all GPUs assigned to this key.  
+
+## A simple TFHE-rs program 
+
+```rust
+use tfhe::{ConfigBuilder, set_server_key, FheUint8, ClientKey, CompressedServerKey};
+use tfhe::prelude::*;
+
+fn main() {
+
+    let config = ConfigBuilder::default().build();
+
+    let client_key= ClientKey::generate(config);
+    let compressed_server_key = CompressedServerKey::new(&client_key);
+
+    let gpu_key = compressed_server_key.decompress_to_gpu();
+
+    let clear_a = 27u8;
+    let clear_b = 128u8;
+
+    let a = FheUint8::encrypt(clear_a, &client_key);
+    let b = FheUint8::encrypt(clear_b, &client_key);
+
+    //Server-side
+
+    set_server_key(gpu_key);
+    let result = a + b;
+
+    //Client-side
+    let decrypted_result: u8 = result.decrypt(&client_key);
+
+    let clear_result = clear_a + clear_b;
+
+    assert_eq!(decrypted_result, clear_result);
+}
+```
+
+When the `"gpu"` feature is activated, calling: `let config = ConfigBuilder::default().build();` instantiates [cryptographic parameters that are different from the CPU ones](run_on_gpu.md#gpu-tfhe-rs-features). 
+
+## Breakdown of the GPU TFHE-rs program
+
+### Key generation
+
+Comparing to the [CPU example](../../getting_started/quick_start.md), in the code snippet above,
+the server-side must call `decompress_to_gpu` to enable GPU-execution for the ensuing operations on ciphertexts. This function assigns all available GPUs to the server key. 
+```Rust
+    let gpu_key = compressed_server_key.decompress_to_gpu();
+```
+Once the key is decompressed to GPU and set with `set_server_key`, operations on ciphertexts execute on the GPU. In the example above:
+- `compressed_server_key` is a [`CompressedServerKey`](https://docs.rs/tfhe/latest/tfhe/struct.CompressedServerKey.html), stored on CPU. The client-side should ensure this key is generated with [GPU cryptographic parameters](run_on_gpu.md#gpu-tfhe-rs-features).
+- `gpu_key` is the [`CudaServerKey`](https://docs.rs/tfhe/latest/tfhe/struct.CudaServerKey.html) corresponding to `compressed_server_key` and is stored on the GPU assigned to it.
+- [`set_server_key`](https://docs.rs/tfhe/latest/tfhe/fn.set_server_key.html) sets either a CPU or GPU key. In this example, `compressed_server_key` and `gpu_key` have GPU cryptographic parameters. A GPU server key can enable automatic parallelization on multiple GPUs.
+
+### Encryption
+
+On the client-side, the method to encrypt the data is exactly the same as the CPU one, as shown in the following example:
+
+```Rust
+    let clear_a = 27u8;
+    let clear_b = 128u8;
+    
+    let a = FheUint8::encrypt(clear_a, &client_key);
+    let b = FheUint8::encrypt(clear_b, &client_key);
+```
+
+### Server-side computation
+
+The server first needs to set up its keys with `set_server_key(gpu_key)`. Then, homomorphic computations are performed using the same approach as the [CPU operations](../../fhe-computation/operations/README.md).
+
+```Rust
+    //Server-side
+    set_server_key(gpu_key);
+    let result = a + b;
+
+    //Client-side
+    let decrypted_result: u8 = result.decrypt(&client_key);
+
+    let clear_result = clear_a + clear_b;
+
+    assert_eq!(decrypted_result, clear_result);
+```
+
+### Decryption
+
+Finally, the client decrypts the results using:
+
+```Rust
+    let decrypted_result: u8 = result.decrypt(&client_key);
+```
+
+## Optimizing for throughput
+
+In order to improve operation throughput, you can use multiple GPUs with fine-grained GPU scheduling, as detailed on the following page:
+
+{% content-ref url="./multi_gpu.md" %} Multi-GPU usage {% endcontent-ref %}
--- a/tfhe/docs/configuration/gpu_acceleration/zk-pok.md
+++ b/tfhe/docs/configuration/gpu_acceleration/zk-pok.md
@@ -1,23 +1,27 @@
 # Zero-knowledge proofs

-Zero-knowledge proofs (ZK) are a powerful tool to assert that the encryption of a message is correct, as discussed in [advanced features](../../fhe-computation/advanced-features/zk-pok.md).
-However, computation is not possible on the type of ciphertexts it produces (i.e. `ProvenCompactCiphertextList`). This document explains how to use the GPU to accelerate the
-preprocessing step needed to convert ciphertexts formatted for ZK to ciphertexts in the right format for computation purposes on GPU. This 
-operation is called "expansion".
+Zero-knowledge proofs (ZK) are a powerful tool to assert that the encryption of a message is correctly formed with secure cryptographic parameters and helps thwart chosen ciphertext attacks (CCA) such as replay attacks. 

-## Proven compact ciphertext list
+The CPU implementation is discussed in [advanced features](../../fhe-computation/advanced-features/zk-pok.md). During encryption, ZK proofs can be generated for a single ciphertext or for a list of ciphertexts. To use ciphertexts with proofs for computation, additional conversion steps are needed: proof expansion and proof verification. While both steps are necessary to use ciphertexts with proofs for computation, only proof expansion is sped up on GPU, while verification is performed by the CPU.

-A proven compact list of ciphertexts can be seen as a compacted collection of ciphertexts for which encryption can be verified.
-This verification is currently only supported on the CPU, but the expansion can be accelerated using the GPU.
-This way, verification and expansion can be performed in parallel, efficiently using all the available computational resources.
-
-## Supported types
-Encrypted messages can be integers (like FheUint64) or booleans. The GPU backend does not currently support encrypted strings.
+## Configuration

 {% hint style="info" %}
 You can enable this feature using the flag: `--features=zk-pok,gpu` when building **TFHE-rs**.
 {% endhint %}

+## API elements discussed in this document
+
+- [`tfhe::ProvenCompactCiphertextList`](https://docs.rs/tfhe/latest/tfhe/struct.ProvenCompactCiphertextList.html): a list of ciphertexts with accompanying ZK-proofs. The ciphertexts are stored in a compact form and must be expanded for computation.
+- [`tfhe::ProvenCompactCiphertextList::verify_and_expand`](https://docs.rs/tfhe/latest/tfhe/struct.ProvenCompactCiphertextList.html#method.verify_and_expand): verify the proofs for this ciphertext list and expand each ciphertext into a form that is supported for computation.
+
+## Proven compact ciphertext list
+
+A proven compact list of ciphertexts can be seen as a compacted collection of ciphertexts for which encryption can be verified.
+This verification is currently only supported on the CPU, but the expansion can be sped up using the GPU. However, verification and expansion can be performed in parallel, efficiently using all the available computational resources.
+
+## Supported types
+Encrypted messages can be integers (like FheUint64) or booleans. The GPU backend does not currently support encrypted strings.

 ## Example

--- a/tfhe/docs/getting_started/benchmarks/README.md
+++ b/tfhe/docs/getting_started/benchmarks/README.md
@@ -10,7 +10,7 @@ You can get the parameters used for benchmarks by cloning the repository and che
 make print_doc_bench_parameters
 ```

-### Operation time over FheUint 64
+### Operation latency CPU vs GPU comparison

 {% hint style="info" %}
 Benchmarks in the Table below were launched on: 
--- a/tfhe/docs/getting_started/benchmarks/cpu/cpu_integer_operations.md
+++ b/tfhe/docs/getting_started/benchmarks/cpu/cpu_integer_operations.md
@@ -22,7 +22,7 @@ The next table shows the operation timings on CPU when the left input is encrypt

 All timings are based on parallelized Radix-based integer operations where each block is encrypted using the default parameters `PARAM_MESSAGE_2_CARRY_2_KS_PBS`. To ensure predictable timings, we perform operations in the `default` mode, which ensures that the input and output encoding are similar (i.e., the carries are always emptied).

-You can minimize operational costs by selecting from 'unchecked', 'checked', or 'smart' modes from [the fine-grained APIs](../../../references/fine-grained-apis/quick_start.md), each balancing performance and correctness differently. For more details about parameters, see [here](../../../references/fine-grained-apis/shortint/parameters.md). You can find the benchmark results on GPU for all these operations on GPU [here](../../../configuration/gpu_acceleration/benchmark.md) and on HPU [here](../../../configuration/hpu_acceleration/benchmark.md).
+You can minimize operational costs by selecting from 'unchecked', 'checked', or 'smart' modes from [the fine-grained APIs](../../../references/fine-grained-apis/quick_start.md), each balancing performance and correctness differently. For more details about parameters, see [here](../../../references/fine-grained-apis/shortint/parameters.md). You can find the benchmark results on GPU for all these operations on GPU [here](../../../getting_started/benchmarks/gpu/README.md) and on HPU [here](../../../configuration/hpu_acceleration/benchmark.md).

 ## Reproducing TFHE-rs benchmarks

--- a/tfhe/src/test_user_docs.rs
+++ b/tfhe/src/test_user_docs.rs
@@ -235,10 +235,6 @@ mod test_gpu_doc {
        "../docs/configuration/gpu_acceleration/array_type.md",
        configuration_gpu_acceleration_array_type
    );
-    doctest!(
-        "../docs/configuration/gpu_acceleration/benchmark.md",
-        configuration_gpu_acceleration_benchmark
-    );
    doctest!(
        "../docs/configuration/gpu_acceleration/multi_gpu.md",
        configuration_gpu_acceleration_multi_gpu_device_selection
@@ -247,6 +243,10 @@ mod test_gpu_doc {
        "../docs/configuration/gpu_acceleration/zk-pok.md",
        configuration_gpu_acceleration_zk_pok
    );
+    doctest!(
+        "../docs/configuration/gpu_acceleration/simple_example.md",
+        configuration_gpu_simple_example
+    );
 }

 #[cfg(feature = "hpu")]