all tests are passing

fix some bugs
fix fmt
2026-01-11 15:48:20 -05:00 · 2025-03-10 18:49:37 +04:00 · 2025-03-10 15:59:48 +04:00 · 2025-03-07 18:34:41 +04:00 · 2025-03-07 18:31:07 +04:00 · 2025-03-07 17:37:13 +04:00
708 changed files with 52047 additions and 19940 deletions
--- a/.github/actions/gpu_setup/action.yml
+++ b/.github/actions/gpu_setup/action.yml
@@ -0,0 +1,63 @@
+name: Setup Cuda
+description: Setup Cuda on Hyperstack or GitHub instance
+
+inputs:
+  cuda-version:
+    description: Version of Cuda to use
+    required: true
+  gcc-version:
+    description: Version of GCC to use
+    required: true
+  cmake-version:
+    description: Version of cmake to use
+    default: 3.29.6
+  github-instance:
+    description: Instance is hosted on GitHub
+    default: 'false'
+
+runs:
+  using: "composite"
+  steps:
+    # Mandatory on hyperstack since a bootable volume is not re-usable yet.
+    - name: Install dependencies
+      shell: bash
+      run: |
+        sudo apt update
+        curl -fsSL https://apt.kitware.com/keys/kitware-archive-latest.asc | sudo gpg --dearmour -o /etc/apt/trusted.gpg.d/kitware.gpg
+        sudo chmod 644 /etc/apt/trusted.gpg.d/kitware.gpg
+        echo 'deb [signed-by=/etc/apt/trusted.gpg.d/kitware.gpg] https://apt.kitware.com/ubuntu/ jammy main' | sudo tee /etc/apt/sources.list.d/kitware.list >/dev/null
+        sudo apt update
+        sudo apt install -y cmake cmake-format libclang-dev
+
+    - name: Install CUDA
+      if: inputs.github-instance == 'true'
+      shell: bash
+      run: |
+        TOOLKIT_VERSION="$(echo ${{ inputs.cuda-version }} | sed 's/\(.*\)\.\(.*\)/\1-\2/')"
+        wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
+        sudo dpkg -i cuda-keyring_1.1-1_all.deb
+        sudo apt update
+        sudo apt -y install cuda-toolkit-${TOOLKIT_VERSION}
+
+    - name: Export CUDA variables
+      shell: bash
+      run: |
+        CUDA_PATH=/usr/local/cuda-${{ inputs.cuda-version }}
+        echo "CUDA_PATH=$CUDA_PATH" >> "${GITHUB_ENV}"
+        echo "PATH=$PATH:$CUDA_PATH/bin" >> "${GITHUB_PATH}"
+        echo "LD_LIBRARY_PATH=$CUDA_PATH/lib64:$LD_LIBRARY_PATH" >> "${GITHUB_ENV}"
+        echo "CUDA_MODULE_LOADER=EAGER" >> "${GITHUB_ENV}"
+
+    # Specify the correct host compilers
+    - name: Export gcc and g++ variables
+      shell: bash
+      run: |
+        {
+          echo "CC=/usr/bin/gcc-${{ inputs.gcc-version }}";
+          echo "CXX=/usr/bin/g++-${{ inputs.gcc-version }}";
+          echo "CUDAHOSTCXX=/usr/bin/g++-${{ inputs.gcc-version }}";
+        } >> "${GITHUB_ENV}"
+
+    - name: Check device is detected
+      shell: bash
+      run: nvidia-smi
--- a/.github/actions/hyperstack_setup/action.yml
+++ b/.github/actions/hyperstack_setup/action.yml
@@ -1,53 +0,0 @@
-name: Setup Cuda
-description: Setup Cuda on Hyperstack instance
-
-inputs:
-  cuda-version:
-    description: Version of Cuda to use
-    required: true
-  gcc-version:
-    description: Version of GCC to use
-    required: true
-  cmake-version:
-    description: Version of cmake to use
-    default: 3.29.6
-
-runs:
-  using: "composite"
-  steps:
-    # Mandatory on hyperstack since a bootable volume is not re-usable yet.
-    - name: Install dependencies
-      shell: bash
-      run: |
-        sudo apt update
-        sudo apt install -y checkinstall zlib1g-dev libssl-dev libclang-dev
-        wget https://github.com/Kitware/CMake/releases/download/v${{ inputs.cmake-version }}/cmake-${{ inputs.cmake-version }}.tar.gz
-        tar -zxvf cmake-${{ inputs.cmake-version }}.tar.gz
-        cd cmake-${{ inputs.cmake-version }}
-        ./bootstrap
-        make -j"$(nproc)"
-        sudo make install
-
-    - name: Export CUDA variables
-      shell: bash
-      run: |
-        CUDA_PATH=/usr/local/cuda-${{ inputs.cuda-version }}
-        echo "CUDA_PATH=$CUDA_PATH" >> "${GITHUB_ENV}"
-        echo "$CUDA_PATH/bin" >> "${GITHUB_PATH}"
-        echo "LD_LIBRARY_PATH=$CUDA_PATH/lib:$LD_LIBRARY_PATH" >> "${GITHUB_ENV}"
-        echo "CUDACXX=/usr/local/cuda-${{ inputs.cuda-version }}/bin/nvcc" >> "${GITHUB_ENV}"
-
-    # Specify the correct host compilers
-    - name: Export gcc and g++ variables
-      shell: bash
-      run: |
-        {
-          echo "CC=/usr/bin/gcc-${{ inputs.gcc-version }}";
-          echo "CXX=/usr/bin/g++-${{ inputs.gcc-version }}";
-          echo "CUDAHOSTCXX=/usr/bin/g++-${{ inputs.gcc-version }}";
-          echo "HOME=/home/ubuntu";
-        } >> "${GITHUB_ENV}"
-
-    - name: Check device is detected
-      shell: bash
-      run: nvidia-smi
--- a/.github/workflows/aws_tfhe_backward_compat_tests.yml
+++ b/.github/workflows/aws_tfhe_backward_compat_tests.yml
@@ -11,53 +11,26 @@ env:
  SLACK_ICON: https://pbs.twimg.com/profile_images/1274014582265298945/OjBKP9kn_400x400.png
  SLACK_USERNAME: ${{ secrets.BOT_USERNAME }}
  SLACK_WEBHOOK: ${{ secrets.SLACK_WEBHOOK }}
-  MSG_MINIMAL: event,action url,commit
-  BRANCH: ${{ github.head_ref || github.ref }}
-  REF: ${{ github.event.pull_request.head.sha || github.sha }}
+  CHECKOUT_TOKEN: ${{ secrets.REPO_CHECKOUT_TOKEN || secrets.GITHUB_TOKEN }}
+  # Secrets will be available only to zama-ai organization members
+  SECRETS_AVAILABLE: ${{ secrets.JOB_SECRET != '' }}
+  EXTERNAL_CONTRIBUTION_RUNNER: "large_ubuntu_16"

 on:
  # Allows you to run this workflow manually from the Actions tab as an alternative.
  workflow_dispatch:
-  # Trigger pull_request event on CI files to be able to test changes before merging to main branch.
-  # Workflow would fail if changes come from a forked repository since secrets are not available with this event.
  pull_request:
-    paths:
-      - '.github/**'
-      - 'ci/**'
-  # General entry point for Zama's pull request as well as contribution from forks.
-  pull_request_target:
-    paths:
-      - '**'
-      - '!.github/**'
-      - '!ci/**'

 jobs:
-  check-ci-files:
-    uses: ./.github/workflows/check_ci_files_change.yml
-    with:
-      checkout_ref: ${{ github.event.pull_request.head.sha || github.sha }}
-    secrets:
-      REPO_CHECKOUT_TOKEN: ${{ secrets.REPO_CHECKOUT_TOKEN }}
-
-  # Fail if the triggering actor is not part of Zama organization.
-  # If pull_request_target is emitted and CI files have changed, skip this job. This would skip following jobs.
-  check-user-permission:
-    needs: check-ci-files
-    if: github.event_name != 'pull_request_target' ||
-      (github.event_name == 'pull_request_target' && needs.check-ci-files.outputs.ci_file_changed == 'false')
-    uses: ./.github/workflows/check_actor_permissions.yml
-    secrets:
-      TOKEN: ${{ secrets.GITHUB_TOKEN }}
-
  setup-instance:
    name: Setup instance (backward-compat-tests)
-    needs: check-user-permission
    runs-on: ubuntu-latest
    outputs:
-      runner-name: ${{ steps.start-instance.outputs.label }}
+      runner-name: ${{ steps.start-remote-instance.outputs.label || steps.start-github-instance.outputs.runner_group }}
    steps:
-      - name: Start instance
-        id: start-instance
+      - name: Start remote instance
+        id: start-remote-instance
+        if: env.SECRETS_AVAILABLE == 'true'
        uses: zama-ai/slab-github-runner@79939325c3c429837c10d6041e4fd8589d328bac
        with:
          mode: start
@@ -67,11 +40,18 @@ jobs:
          backend: aws
          profile: cpu-small

+      # This instance will be spawned especially for pull-request from forked repository
+      - name: Start GitHub instance
+        id: start-github-instance
+        if: env.SECRETS_AVAILABLE == 'false'
+        run: |
+          echo "runner_group=${{ env.EXTERNAL_CONTRIBUTION_RUNNER }}" >> "$GITHUB_OUTPUT"
+
  backward-compat-tests:
    name: Backward compatibility tests
    needs: [ setup-instance ]
    concurrency:
-      group: ${{ github.workflow }}_${{ github.head_ref || github.ref }}
+      group: ${{ github.workflow_ref }}
      cancel-in-progress: true
    runs-on: ${{ needs.setup-instance.outputs.runner-name }}
    steps:
@@ -79,22 +59,17 @@ jobs:
        uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683
        with:
          persist-credentials: 'false'
-          token: ${{ secrets.REPO_CHECKOUT_TOKEN }}
-          ref: ${{ env.REF }}
+          token: ${{ env.CHECKOUT_TOKEN }}

      - name: Install latest stable
        uses: dtolnay/rust-toolchain@a54c7afa936fefeb4456b2dd8068152669aa8203
        with:
          toolchain: stable

-      - name: Install git-lfs
-        run: |
-          sudo apt update && sudo apt -y install git-lfs
-
      - name: Use specific data branch
        if: ${{ contains(github.event.pull_request.labels.*.name, 'data_PR') }}
        env:
-          PR_BRANCH: ${{ github.head_ref || github.ref_name }}
+          PR_BRANCH: ${{ github.ref_name }}
        run: |
          echo "BACKWARD_COMPAT_DATA_BRANCH=${PR_BRANCH}" >> "${GITHUB_ENV}"

@@ -131,8 +106,9 @@ jobs:
    needs: [ setup-instance, backward-compat-tests ]
    runs-on: ubuntu-latest
    steps:
-      - name: Stop instance
+      - name: Stop remote instance
        id: stop-instance
+        if: env.SECRETS_AVAILABLE == 'true'
        uses: zama-ai/slab-github-runner@79939325c3c429837c10d6041e4fd8589d328bac
        with:
          mode: stop
--- a/.github/workflows/aws_tfhe_fast_tests.yml
+++ b/.github/workflows/aws_tfhe_fast_tests.yml
@@ -11,26 +11,16 @@ env:
  SLACK_ICON: https://pbs.twimg.com/profile_images/1274014582265298945/OjBKP9kn_400x400.png
  SLACK_USERNAME: ${{ secrets.BOT_USERNAME }}
  SLACK_WEBHOOK: ${{ secrets.SLACK_WEBHOOK }}
-  MSG_MINIMAL: event,action url,commit
-  BRANCH: ${{ github.head_ref || github.ref }}
-  IS_PULL_REQUEST: ${{ github.event_name == 'pull_request' || github.event_name == 'pull_request_target' }}
-  REF: ${{ github.event.pull_request.head.sha || github.sha }}
+  IS_PULL_REQUEST: ${{ github.event_name == 'pull_request' }}
+  CHECKOUT_TOKEN: ${{ secrets.REPO_CHECKOUT_TOKEN || secrets.GITHUB_TOKEN }}
+  # Secrets will be available only to zama-ai organization members
+  SECRETS_AVAILABLE: ${{ secrets.JOB_SECRET != '' }}
+  EXTERNAL_CONTRIBUTION_RUNNER: "large_ubuntu_64-22.04"

 on:
  # Allows you to run this workflow manually from the Actions tab as an alternative.
  workflow_dispatch:
-  # Trigger pull_request event on CI files to be able to test changes before merging to main branch.
-  # Workflow would fail if changes come from a forked repository since secrets are not available with this event.
  pull_request:
-    paths:
-      - '.github/**'
-      - 'ci/**'
-  # General entry point for Zama's pull request as well as contribution from forks.
-  pull_request_target:
-    paths:
-      - '**'
-      - '!.github/**'
-      - '!ci/**'

 jobs:
  should-run:
@@ -69,8 +59,7 @@ jobs:
        with:
          fetch-depth: 0
          persist-credentials: 'false'
-          token: ${{ secrets.REPO_CHECKOUT_TOKEN }}
-          ref: ${{ env.REF }}
+          token: ${{ env.CHECKOUT_TOKEN }}

      - name: Check for file changes
        id: changed-files
@@ -137,34 +126,18 @@ jobs:
        run: |
          echo "any_changed=true" >> "$GITHUB_OUTPUT"

-  check-ci-files:
-    uses: ./.github/workflows/check_ci_files_change.yml
-    with:
-      checkout_ref: ${{ github.event.pull_request.head.sha || github.sha }}
-    secrets:
-      REPO_CHECKOUT_TOKEN: ${{ secrets.REPO_CHECKOUT_TOKEN }}
-
-  # Fail if the triggering actor is not part of Zama organization.
-  # If pull_request_target is emitted and CI files have changed, skip this job. This would skip following jobs.
-  check-user-permission:
-    needs: check-ci-files
-    if: github.event_name != 'pull_request_target' ||
-      (github.event_name == 'pull_request_target' && needs.check-ci-files.outputs.ci_file_changed == 'false')
-    uses: ./.github/workflows/check_actor_permissions.yml
-    secrets:
-      TOKEN: ${{ secrets.GITHUB_TOKEN }}
-
  setup-instance:
    name: Setup instance (fast-tests)
    if: github.event_name == 'workflow_dispatch' ||
      (github.event_name != 'workflow_dispatch' && needs.should-run.outputs.any_file_changed == 'true')
-    needs: [ should-run, check-user-permission ]
+    needs: should-run
    runs-on: ubuntu-latest
    outputs:
-      runner-name: ${{ steps.start-instance.outputs.label }}
+      runner-name: ${{ steps.start-remote-instance.outputs.label || steps.start-github-instance.outputs.runner_group }}
    steps:
-      - name: Start instance
-        id: start-instance
+      - name: Start remote instance
+        id: start-remote-instance
+        if: env.SECRETS_AVAILABLE == 'true'
        uses: zama-ai/slab-github-runner@79939325c3c429837c10d6041e4fd8589d328bac
        with:
          mode: start
@@ -174,11 +147,18 @@ jobs:
          backend: aws
          profile: cpu-big

+      # This instance will be spawned especially for pull-request from forked repository
+      - name: Start GitHub instance
+        id: start-github-instance
+        if: env.SECRETS_AVAILABLE == 'false'
+        run: |
+          echo "runner_group=${{ env.EXTERNAL_CONTRIBUTION_RUNNER }}" >> "$GITHUB_OUTPUT"
+
  fast-tests:
    name: Fast CPU tests
    needs: [ should-run, setup-instance ]
    concurrency:
-      group: ${{ github.workflow }}_${{ github.head_ref || github.ref }}
+      group: ${{ github.workflow_ref }}
      cancel-in-progress: true
    runs-on: ${{ needs.setup-instance.outputs.runner-name }}
    steps:
@@ -186,8 +166,7 @@ jobs:
        uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683
        with:
          persist-credentials: 'false'
-          token: ${{ secrets.REPO_CHECKOUT_TOKEN }}
-          ref: ${{ env.REF }}
+          token: ${{ env.CHECKOUT_TOKEN }}

      - name: Install latest stable
        uses: dtolnay/rust-toolchain@a54c7afa936fefeb4456b2dd8068152669aa8203
@@ -230,7 +209,7 @@ jobs:

      - name: Node cache restoration
        id: node-cache
-        uses: actions/cache/restore@1bd1e32a3bdc45362d1e726936510720a7c30a57 #v4.2.0
+        uses: actions/cache/restore@d4323d4df104b026a6aa633fdb11d772146be0bf #v4.2.2
        with:
          path: |
            ~/.nvm
@@ -243,7 +222,7 @@ jobs:
          make install_node

      - name: Node cache save
-        uses: actions/cache/save@1bd1e32a3bdc45362d1e726936510720a7c30a57 #v4.2.0
+        uses: actions/cache/save@d4323d4df104b026a6aa633fdb11d772146be0bf #v4.2.2
        if: steps.node-cache.outputs.cache-hit != 'true'
        with:
          path: |
@@ -286,7 +265,7 @@ jobs:
          make test_zk

      - name: Slack Notification
-        if: ${{ failure() }}
+        if: ${{ failure() && env.SECRETS_AVAILABLE == 'true' }}
        continue-on-error: true
        uses: rtCamp/action-slack-notify@c33737706dea87cd7784c687dadc9adf1be59990
        env:
@@ -299,8 +278,9 @@ jobs:
    needs: [ setup-instance, fast-tests ]
    runs-on: ubuntu-latest
    steps:
-      - name: Stop instance
+      - name: Stop remote instance
        id: stop-instance
+        if: env.SECRETS_AVAILABLE == 'true'
        uses: zama-ai/slab-github-runner@79939325c3c429837c10d6041e4fd8589d328bac
        with:
          mode: stop
--- a/.github/workflows/aws_tfhe_integer_tests.yml
+++ b/.github/workflows/aws_tfhe_integer_tests.yml
@@ -10,31 +10,20 @@ env:
  SLACK_ICON: https://pbs.twimg.com/profile_images/1274014582265298945/OjBKP9kn_400x400.png
  SLACK_USERNAME: ${{ secrets.BOT_USERNAME }}
  SLACK_WEBHOOK: ${{ secrets.SLACK_WEBHOOK }}
-  MSG_MINIMAL: event,action url,commit
-  BRANCH: ${{ github.head_ref || github.ref }}
  # We clear the cache to reduce memory pressure because of the numerous processes of cargo
  # nextest
  TFHE_RS_CLEAR_IN_MEMORY_KEY_CACHE: "1"
  NO_BIG_PARAMS: FALSE
-  REF: ${{ github.event.pull_request.head.sha || github.sha }}
+  CHECKOUT_TOKEN: ${{ secrets.REPO_CHECKOUT_TOKEN || secrets.GITHUB_TOKEN }}
+  # Secrets will be available only to zama-ai organization members
+  SECRETS_AVAILABLE: ${{ secrets.JOB_SECRET != '' }}
+  EXTERNAL_CONTRIBUTION_RUNNER: "large_ubuntu_64-22.04"

 on:
  # Allows you to run this workflow manually from the Actions tab as an alternative.
  workflow_dispatch:
-  # Trigger pull_request event on CI files to be able to test changes before merging to main branch.
-  # Workflow would fail if changes come from a forked repository since secrets are not available with this event.
  pull_request:
    types: [ labeled ]
-    paths:
-      - '.github/**'
-      - 'ci/**'
-  # General entry point for Zama's pull request as well as contribution from forks.
-  pull_request_target:
-    types: [ labeled ]
-    paths:
-      - '**'
-      - '!.github/**'
-      - '!ci/**'
  push:
    branches:
      - main
@@ -43,7 +32,7 @@ jobs:
  should-run:
    if:
      (github.event_name == 'push' && github.repository == 'zama-ai/tfhe-rs') ||
-      (github.event_name == 'pull_request_target' && contains(github.event.label.name, 'approved')) ||
+      (github.event_name == 'pull_request' && contains(github.event.label.name, 'approved')) ||
      github.event_name == 'workflow_dispatch'
    runs-on: ubuntu-latest
    permissions:
@@ -57,8 +46,7 @@ jobs:
        with:
          fetch-depth: 0
          persist-credentials: 'false'
-          token: ${{ secrets.REPO_CHECKOUT_TOKEN }}
-          ref: ${{ env.REF }}
+          token: ${{ env.CHECKOUT_TOKEN }}

      - name: Check for file changes
        id: changed-files
@@ -75,26 +63,9 @@ jobs:
              - tfhe/src/integer/**
              - .github/workflows/aws_tfhe_integer_tests.yml

-  check-ci-files:
-    uses: ./.github/workflows/check_ci_files_change.yml
-    with:
-      checkout_ref: ${{ github.event.pull_request.head.sha || github.sha }}
-    secrets:
-      REPO_CHECKOUT_TOKEN: ${{ secrets.REPO_CHECKOUT_TOKEN }}
-
-  # Fail if the triggering actor is not part of Zama organization.
-  # If pull_request_target is emitted and CI files have changed, skip this job. This would skip following jobs.
-  check-user-permission:
-    needs: check-ci-files
-    if: github.event_name != 'pull_request_target' ||
-      (github.event_name == 'pull_request_target' && needs.check-ci-files.outputs.ci_file_changed == 'false')
-    uses: ./.github/workflows/check_actor_permissions.yml
-    secrets:
-      TOKEN: ${{ secrets.GITHUB_TOKEN }}
-
  setup-instance:
    name: Setup instance (unsigned-integer-tests)
-    needs: [ should-run, check-user-permission ]
+    needs: should-run
    if:
      (github.event_name == 'push' && github.repository == 'zama-ai/tfhe-rs' && needs.should-run.outputs.integer_test == 'true') ||
      (github.event_name == 'schedule' && github.repository == 'zama-ai/tfhe-rs') ||
@@ -102,10 +73,11 @@ jobs:
      github.event_name == 'workflow_dispatch'
    runs-on: ubuntu-latest
    outputs:
-      runner-name: ${{ steps.start-instance.outputs.label }}
+      runner-name: ${{ steps.start-remote-instance.outputs.label || steps.start-github-instance.outputs.runner_group }}
    steps:
-      - name: Start instance
-        id: start-instance
+      - name: Start remote instance
+        id: start-remote-instance
+        if: env.SECRETS_AVAILABLE == 'true'
        uses: zama-ai/slab-github-runner@79939325c3c429837c10d6041e4fd8589d328bac
        with:
          mode: start
@@ -115,11 +87,18 @@ jobs:
          backend: aws
          profile: cpu-big

+      # This instance will be spawned especially for pull-request from forked repository
+      - name: Start GitHub instance
+        id: start-github-instance
+        if: env.SECRETS_AVAILABLE == 'false'
+        run: |
+          echo "runner_group=${{ env.EXTERNAL_CONTRIBUTION_RUNNER }}" >> "$GITHUB_OUTPUT"
+
  unsigned-integer-tests:
    name: Unsigned integer tests
    needs: setup-instance
    concurrency:
-      group: ${{ github.workflow }}_${{ github.head_ref || github.ref }}
+      group: ${{ github.workflow_ref }}
      cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
    runs-on: ${{ needs.setup-instance.outputs.runner-name }}
    steps:
@@ -127,8 +106,7 @@ jobs:
        uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683
        with:
          persist-credentials: "false"
-          token: ${{ secrets.REPO_CHECKOUT_TOKEN }}
-          ref: ${{ env.REF }}
+          token: ${{ env.CHECKOUT_TOKEN }}

      - name: Install latest stable
        uses: dtolnay/rust-toolchain@a54c7afa936fefeb4456b2dd8068152669aa8203
@@ -136,7 +114,7 @@ jobs:
          toolchain: stable

      - name: Should skip big parameters set
-        if: github.event_name == 'pull_request_target'
+        if: github.event_name == 'pull_request'
        run: |
          echo "NO_BIG_PARAMS=TRUE" >> "${GITHUB_ENV}"

@@ -170,8 +148,9 @@ jobs:
    needs: [setup-instance, unsigned-integer-tests]
    runs-on: ubuntu-latest
    steps:
-      - name: Stop instance
+      - name: Stop remote instance
        id: stop-instance
+        if: env.SECRETS_AVAILABLE == 'true'
        uses: zama-ai/slab-github-runner@79939325c3c429837c10d6041e4fd8589d328bac
        with:
          mode: stop
--- a/.github/workflows/aws_tfhe_signed_integer_tests.yml
+++ b/.github/workflows/aws_tfhe_signed_integer_tests.yml
@@ -10,31 +10,20 @@ env:
  SLACK_ICON: https://pbs.twimg.com/profile_images/1274014582265298945/OjBKP9kn_400x400.png
  SLACK_USERNAME: ${{ secrets.BOT_USERNAME }}
  SLACK_WEBHOOK: ${{ secrets.SLACK_WEBHOOK }}
-  MSG_MINIMAL: event,action url,commit
-  BRANCH: ${{ github.head_ref || github.ref }}
  # We clear the cache to reduce memory pressure because of the numerous processes of cargo
  # nextest
  TFHE_RS_CLEAR_IN_MEMORY_KEY_CACHE: "1"
  NO_BIG_PARAMS: FALSE
-  REF: ${{ github.event.pull_request.head.sha || github.sha }}
+  CHECKOUT_TOKEN: ${{ secrets.REPO_CHECKOUT_TOKEN || secrets.GITHUB_TOKEN }}
+  # Secrets will be available only to zama-ai organization members
+  SECRETS_AVAILABLE: ${{ secrets.JOB_SECRET != '' }}
+  EXTERNAL_CONTRIBUTION_RUNNER: "large_ubuntu_64-22.04"

 on:
  # Allows you to run this workflow manually from the Actions tab as an alternative.
  workflow_dispatch:
-  # Trigger pull_request event on CI files to be able to test changes before merging to main branch.
-  # Workflow would fail if changes come from a forked repository since secrets are not available with this event.
  pull_request:
    types: [ labeled ]
-    paths:
-      - '.github/**'
-      - 'ci/**'
-  # General entry point for Zama's pull request as well as contribution from forks.
-  pull_request_target:
-    types: [ labeled ]
-    paths:
-      - '**'
-      - '!.github/**'
-      - '!ci/**'
  push:
    branches:
      - main
@@ -44,7 +33,7 @@ jobs:
    if:
      (github.event_name == 'push' && github.repository == 'zama-ai/tfhe-rs') ||
      (github.event_name == 'schedule' && github.repository == 'zama-ai/tfhe-rs') ||
-      ((github.event_name == 'pull_request_target' || github.event_name == 'pull_request_target') && contains(github.event.label.name, 'approved')) ||
+      (github.event_name == 'pull_request' && contains(github.event.label.name, 'approved')) ||
      github.event_name == 'workflow_dispatch'
    runs-on: ubuntu-latest
    permissions:
@@ -58,8 +47,7 @@ jobs:
        with:
          fetch-depth: 0
          persist-credentials: 'false'
-          token: ${{ secrets.REPO_CHECKOUT_TOKEN }}
-          ref: ${{ env.REF }}
+          token: ${{ env.CHECKOUT_TOKEN }}

      - name: Check for file changes
        id: changed-files
@@ -76,26 +64,9 @@ jobs:
              - tfhe/src/integer/**
              - .github/workflows/aws_tfhe_signed_integer_tests.yml

-  check-ci-files:
-    uses: ./.github/workflows/check_ci_files_change.yml
-    with:
-      checkout_ref: ${{ github.event.pull_request.head.sha || github.sha }}
-    secrets:
-      REPO_CHECKOUT_TOKEN: ${{ secrets.REPO_CHECKOUT_TOKEN }}
-
-  # Fail if the triggering actor is not part of Zama organization.
-  # If pull_request_target is emitted and CI files have changed, skip this job. This would skip following jobs.
-  check-user-permission:
-    needs: check-ci-files
-    if: github.event_name != 'pull_request_target' ||
-      (github.event_name == 'pull_request_target' && needs.check-ci-files.outputs.ci_file_changed == 'false')
-    uses: ./.github/workflows/check_actor_permissions.yml
-    secrets:
-      TOKEN: ${{ secrets.GITHUB_TOKEN }}
-
  setup-instance:
    name: Setup instance (unsigned-integer-tests)
-    needs: [ should-run, check-user-permission ]
+    needs: should-run
    if:
      (github.event_name == 'push' && github.repository == 'zama-ai/tfhe-rs' && needs.should-run.outputs.integer_test == 'true') ||
      (github.event_name == 'schedule' && github.repository == 'zama-ai/tfhe-rs') ||
@@ -103,10 +74,11 @@ jobs:
      github.event_name == 'workflow_dispatch'
    runs-on: ubuntu-latest
    outputs:
-      runner-name: ${{ steps.start-instance.outputs.label }}
+      runner-name: ${{ steps.start-remote-instance.outputs.label || steps.start-github-instance.outputs.runner_group }}
    steps:
-      - name: Start instance
-        id: start-instance
+      - name: Start remote instance
+        id: start-remote-instance
+        if: env.SECRETS_AVAILABLE == 'true'
        uses: zama-ai/slab-github-runner@79939325c3c429837c10d6041e4fd8589d328bac
        with:
          mode: start
@@ -116,11 +88,18 @@ jobs:
          backend: aws
          profile: cpu-big

+      # This instance will be spawned especially for pull-request from forked repository
+      - name: Start GitHub instance
+        id: start-github-instance
+        if: env.SECRETS_AVAILABLE == 'false'
+        run: |
+          echo "runner_group=${{ env.EXTERNAL_CONTRIBUTION_RUNNER }}" >> "$GITHUB_OUTPUT"
+
  signed-integer-tests:
    name: Signed integer tests
    needs: setup-instance
    concurrency:
-      group: ${{ github.workflow }}_${{ github.head_ref || github.ref }}
+      group: ${{ github.workflow_ref }}
      cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
    runs-on: ${{ needs.setup-instance.outputs.runner-name }}
    steps:
@@ -128,8 +107,7 @@ jobs:
        uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683
        with:
          persist-credentials: "false"
-          token: ${{ secrets.REPO_CHECKOUT_TOKEN }}
-          ref: ${{ env.REF }}
+          token: ${{ env.CHECKOUT_TOKEN }}

      - name: Install latest stable
        uses: dtolnay/rust-toolchain@a54c7afa936fefeb4456b2dd8068152669aa8203
@@ -137,7 +115,7 @@ jobs:
          toolchain: stable

      - name: Should skip big parameters set
-        if: github.event_name == 'pull_request_target'
+        if: github.event_name == 'pull_request'
        run: |
          echo "NO_BIG_PARAMS=TRUE" >> "${GITHUB_ENV}"

@@ -175,8 +153,9 @@ jobs:
    needs: [setup-instance, signed-integer-tests]
    runs-on: ubuntu-latest
    steps:
-      - name: Stop instance
+      - name: Stop remote instance
        id: stop-instance
+        if: env.SECRETS_AVAILABLE == 'true'
        uses: zama-ai/slab-github-runner@79939325c3c429837c10d6041e4fd8589d328bac
        with:
          mode: stop
--- a/.github/workflows/aws_tfhe_tests.yml
+++ b/.github/workflows/aws_tfhe_tests.yml
@@ -10,28 +10,17 @@ env:
  SLACK_ICON: https://pbs.twimg.com/profile_images/1274014582265298945/OjBKP9kn_400x400.png
  SLACK_USERNAME: ${{ secrets.BOT_USERNAME }}
  SLACK_WEBHOOK: ${{ secrets.SLACK_WEBHOOK }}
-  MSG_MINIMAL: event,action url,commit
-  BRANCH: ${{ github.head_ref || github.ref }}
-  IS_PULL_REQUEST: ${{ github.event_name == 'pull_request' || github.event_name == 'pull_request_target' }}
-  REF: ${{ github.event.pull_request.head.sha || github.sha }}
+  IS_PULL_REQUEST: ${{ github.event_name == 'pull_request' }}
+  CHECKOUT_TOKEN: ${{ secrets.REPO_CHECKOUT_TOKEN || secrets.GITHUB_TOKEN }}
+  # Secrets will be available only to zama-ai organization members
+  SECRETS_AVAILABLE: ${{ secrets.JOB_SECRET != '' }}
+  EXTERNAL_CONTRIBUTION_RUNNER: "large_ubuntu_64-22.04"

 on:
  # Allows you to run this workflow manually from the Actions tab as an alternative.
  workflow_dispatch:
-  # Trigger pull_request event on CI files to be able to test changes before merging to main branch.
-  # Workflow would fail if changes come from a forked repository since secrets are not available with this event.
  pull_request:
    types: [ labeled ]
-    paths:
-      - '.github/**'
-      - 'ci/**'
-  # General entry point for Zama's pull request as well as contribution from forks.
-  pull_request_target:
-    types: [ labeled ]
-    paths:
-      - '**'
-      - '!.github/**'
-      - '!ci/**'
  schedule:
    # Nightly tests @ 1AM after each work day
    - cron: "0 1 * * MON-FRI"
@@ -79,8 +68,7 @@ jobs:
        with:
          fetch-depth: 0
          persist-credentials: 'false'
-          token: ${{ secrets.REPO_CHECKOUT_TOKEN }}
-          ref: ${{ env.REF }}
+          token: ${{ env.CHECKOUT_TOKEN }}

      - name: Check for file changes
        id: changed-files
@@ -147,34 +135,18 @@ jobs:
        run: |
          echo "any_changed=true" >> "$GITHUB_OUTPUT"

-  check-ci-files:
-    uses: ./.github/workflows/check_ci_files_change.yml
-    with:
-      checkout_ref: ${{ github.event.pull_request.head.sha || github.sha }}
-    secrets:
-      REPO_CHECKOUT_TOKEN: ${{ secrets.REPO_CHECKOUT_TOKEN }}
-
-  # Fail if the triggering actor is not part of Zama organization.
-  # If pull_request_target is emitted and CI files have changed, skip this job. This would skip following jobs.
-  check-user-permission:
-    needs: check-ci-files
-    if: github.event_name != 'pull_request_target' ||
-      (github.event_name == 'pull_request_target' && needs.check-ci-files.outputs.ci_file_changed == 'false')
-    uses: ./.github/workflows/check_actor_permissions.yml
-    secrets:
-      TOKEN: ${{ secrets.GITHUB_TOKEN }}
-
  setup-instance:
    name: Setup instance (cpu-tests)
-    if: github.event_name != 'pull_request_target' ||
+    if: github.event_name != 'pull_request' ||
      (github.event.action == 'labeled' && github.event.label.name == 'approved' && needs.should-run.outputs.any_file_changed == 'true')
-    needs: [ should-run, check-user-permission ]
+    needs: should-run
    runs-on: ubuntu-latest
    outputs:
-      runner-name: ${{ steps.start-instance.outputs.label }}
+      runner-name: ${{ steps.start-remote-instance.outputs.label || steps.start-github-instance.outputs.runner_group }}
    steps:
-      - name: Start instance
-        id: start-instance
+      - name: Start remote instance
+        id: start-remote-instance
+        if: env.SECRETS_AVAILABLE == 'true'
        uses: zama-ai/slab-github-runner@79939325c3c429837c10d6041e4fd8589d328bac
        with:
          mode: start
@@ -184,13 +156,20 @@ jobs:
          backend: aws
          profile: cpu-big

+      # This instance will be spawned especially for pull-request from forked repository
+      - name: Start GitHub instance
+        id: start-github-instance
+        if: env.SECRETS_AVAILABLE == 'false'
+        run: |
+          echo "runner_group=${{ env.EXTERNAL_CONTRIBUTION_RUNNER }}" >> "$GITHUB_OUTPUT"
+
  cpu-tests:
    name: CPU tests
-    if: github.event_name != 'pull_request_target' ||
-      (github.event_name == 'pull_request_target' && needs.setup-instance.result != 'skipped')
+    if: github.event_name != 'pull_request' ||
+      (github.event_name == 'pull_request' && needs.setup-instance.result != 'skipped')
    needs: [ should-run, setup-instance ]
    concurrency:
-      group: ${{ github.workflow }}_${{github.event_name}}_${{ github.head_ref || github.ref }}
+      group: ${{ github.workflow_ref }}_${{github.event_name}}
      cancel-in-progress: true
    runs-on: ${{ needs.setup-instance.outputs.runner-name }}
    steps:
@@ -198,8 +177,7 @@ jobs:
        uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683
        with:
          persist-credentials: 'false'
-          token: ${{ secrets.REPO_CHECKOUT_TOKEN }}
-          ref: ${{ env.REF }}
+          token: ${{ env.CHECKOUT_TOKEN }}

      - name: Install latest stable
        uses: dtolnay/rust-toolchain@a54c7afa936fefeb4456b2dd8068152669aa8203
@@ -282,8 +260,9 @@ jobs:
    needs: [ setup-instance, cpu-tests ]
    runs-on: ubuntu-latest
    steps:
-      - name: Stop instance
+      - name: Stop remote instance
        id: stop-instance
+        if: env.SECRETS_AVAILABLE == 'true'
        uses: zama-ai/slab-github-runner@79939325c3c429837c10d6041e4fd8589d328bac
        with:
          mode: stop
--- a/.github/workflows/aws_tfhe_wasm_tests.yml
+++ b/.github/workflows/aws_tfhe_wasm_tests.yml
@@ -10,56 +10,28 @@ env:
  SLACK_ICON: https://pbs.twimg.com/profile_images/1274014582265298945/OjBKP9kn_400x400.png
  SLACK_USERNAME: ${{ secrets.BOT_USERNAME }}
  SLACK_WEBHOOK: ${{ secrets.SLACK_WEBHOOK }}
-  MSG_MINIMAL: event,action url,commit
-  BRANCH: ${{ github.head_ref || github.ref }}
-  REF: ${{ github.event.pull_request.head.sha || github.sha }}
+  CHECKOUT_TOKEN: ${{ secrets.REPO_CHECKOUT_TOKEN || secrets.GITHUB_TOKEN }}
+  # Secrets will be available only to zama-ai organization members
+  SECRETS_AVAILABLE: ${{ secrets.JOB_SECRET != '' }}
+  EXTERNAL_CONTRIBUTION_RUNNER: "large_ubuntu_16"

 on:
  # Allows you to run this workflow manually from the Actions tab as an alternative.
  workflow_dispatch:
-  # Trigger pull_request event on CI files to be able to test changes before merging to main branch.
-  # Workflow would fail if changes come from a forked repository since secrets are not available with this event.
  pull_request:
    types: [ labeled ]
-    paths:
-      - '.github/**'
-      - 'ci/**'
-  # General entry point for Zama's pull request as well as contribution from forks.
-  pull_request_target:
-    types: [ labeled ]
-    paths:
-      - '**'
-      - '!.github/**'
-      - '!ci/**'

 jobs:
-  check-ci-files:
-    uses: ./.github/workflows/check_ci_files_change.yml
-    with:
-      checkout_ref: ${{ github.event.pull_request.head.sha || github.sha }}
-    secrets:
-      REPO_CHECKOUT_TOKEN: ${{ secrets.REPO_CHECKOUT_TOKEN }}
-
-  # Fail if the triggering actor is not part of Zama organization.
-  # If pull_request_target is emitted and CI files have changed, skip this job. This would skip following jobs.
-  check-user-permission:
-    needs: check-ci-files
-    if: github.event_name != 'pull_request_target' ||
-      (github.event_name == 'pull_request_target' && needs.check-ci-files.outputs.ci_file_changed == 'false')
-    uses: ./.github/workflows/check_actor_permissions.yml
-    secrets:
-      TOKEN: ${{ secrets.GITHUB_TOKEN }}
-
  setup-instance:
    name: Setup instance (wasm-tests)
-    needs: check-user-permission
    if: ${{ github.event_name == 'workflow_dispatch' || contains(github.event.label.name, 'approved') }}
    runs-on: ubuntu-latest
    outputs:
-      runner-name: ${{ steps.start-instance.outputs.label }}
+      runner-name: ${{ steps.start-remote-instance.outputs.label || steps.start-github-instance.outputs.runner_group }}
    steps:
-      - name: Start instance
-        id: start-instance
+      - name: Start remote instance
+        id: start-remote-instance
+        if: env.SECRETS_AVAILABLE == 'true'
        uses: zama-ai/slab-github-runner@79939325c3c429837c10d6041e4fd8589d328bac
        with:
          mode: start
@@ -69,11 +41,18 @@ jobs:
          backend: aws
          profile: cpu-small

+      # This instance will be spawned especially for pull-request from forked repository
+      - name: Start GitHub instance
+        id: start-github-instance
+        if: env.SECRETS_AVAILABLE == 'false'
+        run: |
+          echo "runner_group=${{ env.EXTERNAL_CONTRIBUTION_RUNNER }}" >> "$GITHUB_OUTPUT"
+
  wasm-tests:
    name: WASM tests
    needs: setup-instance
    concurrency:
-      group: ${{ github.workflow }}_${{ github.head_ref || github.ref }}
+      group: ${{ github.workflow_ref }}
      cancel-in-progress: true
    runs-on: ${{ needs.setup-instance.outputs.runner-name }}
    steps:
@@ -81,8 +60,7 @@ jobs:
        uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683
        with:
          persist-credentials: 'false'
-          token: ${{ secrets.REPO_CHECKOUT_TOKEN }}
-          ref: ${{ env.REF }}
+          token: ${{ env.CHECKOUT_TOKEN }}

      - name: Install latest stable
        uses: dtolnay/rust-toolchain@a54c7afa936fefeb4456b2dd8068152669aa8203
@@ -95,7 +73,7 @@ jobs:

      - name: Node cache restoration
        id: node-cache
-        uses: actions/cache/restore@1bd1e32a3bdc45362d1e726936510720a7c30a57 #v4.2.0
+        uses: actions/cache/restore@d4323d4df104b026a6aa633fdb11d772146be0bf #v4.2.2
        with:
          path: |
            ~/.nvm
@@ -108,7 +86,7 @@ jobs:
          make install_node

      - name: Node cache save
-        uses: actions/cache/save@1bd1e32a3bdc45362d1e726936510720a7c30a57 #v4.2.0
+        uses: actions/cache/save@d4323d4df104b026a6aa633fdb11d772146be0bf #v4.2.2
        if: steps.node-cache.outputs.cache-hit != 'true'
        with:
          path: |
@@ -151,8 +129,9 @@ jobs:
    needs: [ setup-instance, wasm-tests ]
    runs-on: ubuntu-latest
    steps:
-      - name: Stop instance
+      - name: Stop remote instance
        id: stop-instance
+        if: env.SECRETS_AVAILABLE == 'true'
        uses: zama-ai/slab-github-runner@79939325c3c429837c10d6041e4fd8589d328bac
        with:
          mode: stop
--- a/.github/workflows/benchmark_boolean.yml
+++ b/.github/workflows/benchmark_boolean.yml
@@ -43,7 +43,7 @@ jobs:
    needs: setup-instance
    runs-on: ${{ needs.setup-instance.outputs.runner-name }}
    concurrency:
-      group: ${{ github.workflow }}_${{ github.ref }}
+      group: ${{ github.workflow_ref }}
      cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
    continue-on-error: true
    steps:
@@ -94,7 +94,7 @@ jobs:
          --append-results

      - name: Upload parsed results artifact
-        uses: actions/upload-artifact@65c4c4a1ddee5b72f698fdd19549f0f0fb45cf08
+        uses: actions/upload-artifact@4cec3d8aa04e39d1a68397de0c4cd6fb9dce8ec1
        with:
          name: ${{ github.sha }}_boolean
          path: ${{ env.RESULTS_FILENAME }}
--- a/.github/workflows/benchmark_core_crypto.yml
+++ b/.github/workflows/benchmark_core_crypto.yml
@@ -43,7 +43,7 @@ jobs:
    needs: setup-instance
    runs-on: ${{ needs.setup-instance.outputs.runner-name }}
    concurrency:
-      group: ${{ github.workflow }}_${{ github.ref }}
+      group: ${{ github.workflow_ref }}
      cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
    steps:
      - name: Checkout tfhe-rs repo with tags
@@ -68,6 +68,7 @@ jobs:

      - name: Run benchmarks with AVX512
        run: |
+          make bench_ks_pbs
          make bench_pbs
          make bench_pbs128
          make bench_ks
@@ -85,7 +86,7 @@ jobs:
          --walk-subdirs

      - name: Upload parsed results artifact
-        uses: actions/upload-artifact@65c4c4a1ddee5b72f698fdd19549f0f0fb45cf08
+        uses: actions/upload-artifact@4cec3d8aa04e39d1a68397de0c4cd6fb9dce8ec1
        with:
          name: ${{ github.sha }}_core_crypto
          path: ${{ env.RESULTS_FILENAME }}
--- a/.github/workflows/benchmark_erc20.yml
+++ b/.github/workflows/benchmark_erc20.yml
@@ -43,7 +43,7 @@ jobs:
    needs: setup-instance
    runs-on: ${{ needs.setup-instance.outputs.runner-name }}
    concurrency:
-      group: ${{ github.workflow }}_${{ github.ref }}
+      group: ${{ github.workflow_ref }}
      cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
    continue-on-error: true
    timeout-minutes: 720  # 12 hours
@@ -99,7 +99,7 @@ jobs:
          --append-results

      - name: Upload parsed results artifact
-        uses: actions/upload-artifact@65c4c4a1ddee5b72f698fdd19549f0f0fb45cf08
+        uses: actions/upload-artifact@4cec3d8aa04e39d1a68397de0c4cd6fb9dce8ec1
        with:
          name: ${{ github.sha }}_erc20
          path: ${{ env.RESULTS_FILENAME }}
--- a/.github/workflows/benchmark_gpu_4090.yml
+++ b/.github/workflows/benchmark_gpu_4090.yml
@@ -11,58 +11,25 @@ env:
  SLACK_ICON: https://pbs.twimg.com/profile_images/1274014582265298945/OjBKP9kn_400x400.png
  SLACK_USERNAME: ${{ secrets.BOT_USERNAME }}
  SLACK_WEBHOOK: ${{ secrets.SLACK_WEBHOOK }}
-  MSG_MINIMAL: event,action url,commit
-  BRANCH: ${{ github.head_ref || github.ref }}
  FAST_BENCH: TRUE
-  REF: ${{ github.event.pull_request.head.sha || github.sha }}

 on:
  # Allows you to run this workflow manually from the Actions tab as an alternative.
  workflow_dispatch:
-  # Trigger pull_request event on CI files to be able to test changes before merging to main branch.
-  # Workflow would fail if changes come from a forked repository since secrets are not available with this event.
  pull_request:
    types: [ labeled ]
-    paths:
-      - '.github/**'
-      - 'ci/**'
-  # General entry point for Zama's pull request as well as contribution from forks.
-  pull_request_target:
-    types: [ labeled ]
-    paths:
-      - '**'
-      - '!.github/**'
-      - '!ci/**'
  schedule:
    # Weekly benchmarks will be triggered each Friday at 9p.m.
    - cron: "0 21 * * 5"

 jobs:
-  check-ci-files:
-    uses: ./.github/workflows/check_ci_files_change.yml
-    with:
-      checkout_ref: ${{ github.event.pull_request.head.sha || github.sha }}
-    secrets:
-      REPO_CHECKOUT_TOKEN: ${{ secrets.REPO_CHECKOUT_TOKEN }}
-
-  # Fail if the triggering actor is not part of Zama organization.
-  # If pull_request_target is emitted and CI files have changed, skip this job. This would skip following jobs.
-  check-user-permission:
-    needs: check-ci-files
-    if: github.event_name != 'pull_request_target' ||
-      (github.event_name == 'pull_request_target' && needs.check-ci-files.outputs.ci_file_changed == 'false')
-    uses: ./.github/workflows/check_actor_permissions.yml
-    secrets:
-      TOKEN: ${{ secrets.GITHUB_TOKEN }}
-
  cuda-integer-benchmarks:
    name: Cuda integer benchmarks (RTX 4090)
-    needs: check-user-permission
    if: ${{ github.event_name == 'workflow_dispatch' ||
      github.event_name == 'schedule' && github.repository == 'zama-ai/tfhe-rs' ||
      contains(github.event.label.name, '4090_bench') }}
    concurrency:
-      group: ${{ github.workflow }}_${{ github.ref }}_cuda_integer_bench
+      group: ${{ github.workflow_ref }}_cuda_integer_bench
      cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
    runs-on: ["self-hosted", "4090-desktop"]
    timeout-minutes: 1440 # 24 hours
@@ -73,7 +40,6 @@ jobs:
          fetch-depth: 0
          persist-credentials: 'false'
          token: ${{ secrets.REPO_CHECKOUT_TOKEN }}
-          ref: ${{ env.REF }}

      - name: Get benchmark details
        run: |
@@ -114,7 +80,7 @@ jobs:
          --walk-subdirs

      - name: Upload parsed results artifact
-        uses: actions/upload-artifact@65c4c4a1ddee5b72f698fdd19549f0f0fb45cf08
+        uses: actions/upload-artifact@4cec3d8aa04e39d1a68397de0c4cd6fb9dce8ec1
        with:
          name: ${{ github.sha }}_integer_multi_bit_gpu_default
          path: ${{ env.RESULTS_FILENAME }}
@@ -138,7 +104,7 @@ jobs:
    if: ${{ github.event_name == 'workflow_dispatch' || github.event_name == 'schedule' || contains(github.event.label.name, '4090_bench') }}
    needs: cuda-integer-benchmarks
    concurrency:
-      group: ${{ github.workflow }}_${{ github.ref }}_cuda_core_crypto_bench
+      group: ${{ github.workflow_ref }}_cuda_core_crypto_bench
      cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
    runs-on: ["self-hosted", "4090-desktop"]
    timeout-minutes: 1440 # 24 hours
@@ -191,7 +157,7 @@ jobs:
      

      - name: Upload parsed results artifact
-        uses: actions/upload-artifact@65c4c4a1ddee5b72f698fdd19549f0f0fb45cf08
+        uses: actions/upload-artifact@4cec3d8aa04e39d1a68397de0c4cd6fb9dce8ec1
        with:
          name: ${{ github.sha }}_core_crypto
          path: ${{ env.RESULTS_FILENAME }}
@@ -220,7 +186,7 @@ jobs:

  remove_github_label:
    name: Remove 4090 bench label
-    if: ${{ always() && github.event_name == 'pull_request_target' }}
+    if: ${{ always() && github.event_name == 'pull_request' }}
    needs: [cuda-integer-benchmarks, cuda-core-crypto-benchmarks]
    runs-on: ubuntu-latest
    steps:
--- a/.github/workflows/benchmark_gpu_core_crypto.yml
+++ b/.github/workflows/benchmark_gpu_core_crypto.yml
@@ -23,10 +23,16 @@ jobs:
    if: github.event_name != 'schedule' ||
      (github.event_name == 'schedule' && github.repository == 'zama-ai/tfhe-rs')
    outputs:
-      runner-name: ${{ steps.start-instance.outputs.label }}
+      # Use permanent remote instance label first as on-demand remote instance label output is set before the end of start-remote-instance step.
+      # If the latter fails due to a failed GitHub action runner set up, we have to fallback on the permanent instance.
+      # Since the on-demand remote label is set before failure, we have to do the logical OR in this order,
+      # otherwise we'll try to run the next job on a non-existing on-demand instance.
+      runner-name: ${{ steps.use-permanent-instance.outputs.runner_group || steps.start-remote-instance.outputs.label }}
+      remote-instance-outcome: ${{ steps.start-remote-instance.outcome }}
    steps:
-      - name: Start instance
-        id: start-instance
+      - name: Start remote instance
+        id: start-remote-instance
+        continue-on-error: true
        uses: zama-ai/slab-github-runner@79939325c3c429837c10d6041e4fd8589d328bac
        with:
          mode: start
@@ -36,6 +42,13 @@ jobs:
          backend: hyperstack
          profile: single-h100

+      # This will allow to fallback on permanent instances running on Hyperstack.
+      - name: Use permanent remote instance
+        id: use-permanent-instance
+        if: ${{ env.SECRETS_AVAILABLE == 'true' && failure() }}
+        run: |
+          echo "runner_group=h100x1" >> "$GITHUB_OUTPUT"
+
  cuda-core-crypto-benchmarks:
    name: Execute GPU core crypto benchmarks
    needs: setup-instance
@@ -57,7 +70,8 @@ jobs:
          token: ${{ secrets.REPO_CHECKOUT_TOKEN }}

      - name: Setup Hyperstack dependencies
-        uses: ./.github/actions/hyperstack_setup
+        if: needs.setup-instance.outputs.remote-instance-outcome == 'success'
+        uses: ./.github/actions/gpu_setup
        with:
          cuda-version: ${{ matrix.cuda }}
          gcc-version: ${{ matrix.gcc }}
@@ -70,11 +84,6 @@ jobs:
            echo "COMMIT_HASH=$(git describe --tags --dirty)";
          } >> "${GITHUB_ENV}"

-      - name: Set up home
-        # "Install rust" step require root user to have a HOME directory which is not set.
-        run: |
-          echo "HOME=/home/ubuntu" >> "${GITHUB_ENV}"
-
      - name: Install rust
        uses: dtolnay/rust-toolchain@a54c7afa936fefeb4456b2dd8068152669aa8203
        with:
@@ -82,6 +91,7 @@ jobs:

      - name: Run benchmarks with AVX512
        run: |
+          make bench_ks_pbs_gpu
          make bench_pbs_gpu
          make bench_ks_gpu

@@ -99,7 +109,7 @@ jobs:
          --walk-subdirs

      - name: Upload parsed results artifact
-        uses: actions/upload-artifact@65c4c4a1ddee5b72f698fdd19549f0f0fb45cf08
+        uses: actions/upload-artifact@4cec3d8aa04e39d1a68397de0c4cd6fb9dce8ec1
        with:
          name: ${{ github.sha }}_core_crypto
          path: ${{ env.RESULTS_FILENAME }}
@@ -133,7 +143,7 @@ jobs:

  teardown-instance:
    name: Teardown instance (cuda-integer-full-benchmarks)
-    if: ${{ always() && needs.setup-instance.result == 'success' }}
+    if: ${{ always() && needs.setup-instance.outputs.remote-instance-outcome == 'success' }}
    needs: [ setup-instance, cuda-core-crypto-benchmarks, slack-notify ]
    runs-on: ubuntu-latest
    steps:
--- a/.github/workflows/benchmark_gpu_erc20_common.yml
+++ b/.github/workflows/benchmark_gpu_erc20_common.yml
@@ -50,10 +50,16 @@ jobs:
    if:  github.event_name == 'workflow_dispatch' ||
      (github.event_name == 'schedule' && github.repository == 'zama-ai/tfhe-rs')
    outputs:
-      runner-name: ${{ steps.start-instance.outputs.label }}
+      # Use permanent remote instance label first as on-demand remote instance label output is set before the end of start-remote-instance step.
+      # If the latter fails due to a failed GitHub action runner set up, we have to fallback on the permanent instance.
+      # Since the on-demand remote label is set before failure, we have to do the logical OR in this order,
+      # otherwise we'll try to run the next job on a non-existing on-demand instance.
+      runner-name: ${{ steps.use-permanent-instance.outputs.runner_group || steps.start-remote-instance.outputs.label }}
+      remote-instance-outcome: ${{ steps.start-remote-instance.outcome }}
    steps:
-      - name: Start instance
-        id: start-instance
+      - name: Start remote instance
+        id: start-remote-instance
+        continue-on-error: true
        uses: zama-ai/slab-github-runner@79939325c3c429837c10d6041e4fd8589d328bac
        with:
          mode: start
@@ -63,6 +69,13 @@ jobs:
          backend: ${{ inputs.backend }}
          profile: ${{ inputs.profile }}

+      # This will allow to fallback on permanent instances running on Hyperstack.
+      - name: Use permanent remote instance
+        id: use-permanent-instance
+        if: ${{ env.SECRETS_AVAILABLE == 'true' && failure() && inputs.profile == 'single-h100' }}
+        run: |
+          echo "runner_group=h100x1" >> "$GITHUB_OUTPUT"
+
  cuda-erc20-benchmarks:
    name: Cuda ERC20 benchmarks (${{ inputs.profile }})
    needs: setup-instance
@@ -84,7 +97,8 @@ jobs:
          token: ${{ secrets.REPO_CHECKOUT_TOKEN }}

      - name: Setup Hyperstack dependencies
-        uses: ./.github/actions/hyperstack_setup
+        if: needs.setup-instance.outputs.remote-instance-outcome == 'success'
+        uses: ./.github/actions/gpu_setup
        with:
          cuda-version: ${{ matrix.cuda }}
          gcc-version: ${{ matrix.gcc }}
@@ -97,11 +111,6 @@ jobs:
            echo "COMMIT_HASH=$(git describe --tags --dirty)";
          } >> "${GITHUB_ENV}"

-      - name: Set up home
-        # "Install rust" step require root user to have a HOME directory which is not set.
-        run: |
-          echo "HOME=/home/ubuntu" >> "${GITHUB_ENV}"
-
      - name: Install rust
        uses: dtolnay/rust-toolchain@a54c7afa936fefeb4456b2dd8068152669aa8203
        with:
@@ -125,7 +134,7 @@ jobs:
          --name-suffix avx512

      - name: Upload parsed results artifact
-        uses: actions/upload-artifact@65c4c4a1ddee5b72f698fdd19549f0f0fb45cf08
+        uses: actions/upload-artifact@4cec3d8aa04e39d1a68397de0c4cd6fb9dce8ec1
        with:
          name: ${{ github.sha }}_erc20_${{ inputs.profile }}
          path: ${{ env.RESULTS_FILENAME }}
@@ -159,7 +168,7 @@ jobs:

  teardown-instance:
    name: Teardown instance (cuda-erc20-${{ inputs.profile }}-benchmarks)
-    if: ${{ always() && needs.setup-instance.result == 'success' }}
+    if: ${{ always() && needs.setup-instance.outputs.remote-instance-outcome == 'success' }}
    needs: [ setup-instance, cuda-erc20-benchmarks, slack-notify ]
    runs-on: ubuntu-latest
    steps:
--- a/.github/workflows/benchmark_gpu_integer_common.yml
+++ b/.github/workflows/benchmark_gpu_integer_common.yml
@@ -114,10 +114,16 @@ jobs:
    needs: prepare-matrix
    runs-on: ubuntu-latest
    outputs:
-      runner-name: ${{ steps.start-instance.outputs.label }}
+      # Use permanent remote instance label first as on-demand remote instance label output is set before the end of start-remote-instance step.
+      # If the latter fails due to a failed GitHub action runner set up, we have to fallback on the permanent instance.
+      # Since the on-demand remote label is set before failure, we have to do the logical OR in this order,
+      # otherwise we'll try to run the next job on a non-existing on-demand instance.
+      runner-name: ${{ steps.use-permanent-instance.outputs.runner_group || steps.start-remote-instance.outputs.label }}
+      remote-instance-outcome: ${{ steps.start-remote-instance.outcome }}
    steps:
-      - name: Start instance
-        id: start-instance
+      - name: Start remote instance
+        id: start-remote-instance
+        continue-on-error: true
        uses: zama-ai/slab-github-runner@79939325c3c429837c10d6041e4fd8589d328bac
        with:
          mode: start
@@ -127,6 +133,13 @@ jobs:
          backend: ${{ inputs.backend }}
          profile: ${{ inputs.profile }}

+      # This will allow to fallback on permanent instances running on Hyperstack.
+      - name: Use permanent remote instance
+        id: use-permanent-instance
+        if: ${{ env.SECRETS_AVAILABLE == 'true' && failure() && inputs.profile == 'single-h100' }}
+        run: |
+          echo "runner_group=h100x1" >> "$GITHUB_OUTPUT"
+
  cuda-benchmarks:
    name: Cuda benchmarks (${{ inputs.profile }})
    needs: [ prepare-matrix, setup-instance ]
@@ -154,7 +167,8 @@ jobs:
          token: ${{ secrets.REPO_CHECKOUT_TOKEN }}

      - name: Setup Hyperstack dependencies
-        uses: ./.github/actions/hyperstack_setup
+        if: needs.setup-instance.outputs.remote-instance-outcome == 'success'
+        uses: ./.github/actions/gpu_setup
        with:
          cuda-version: ${{ matrix.cuda }}
          gcc-version: ${{ matrix.gcc }}
@@ -167,11 +181,6 @@ jobs:
            echo "COMMIT_HASH=$(git describe --tags --dirty)";
          } >> "${GITHUB_ENV}"

-      - name: Set up home
-        # "Install rust" step require root user to have a HOME directory which is not set.
-        run: |
-          echo "HOME=/home/ubuntu" >> "${GITHUB_ENV}"
-
      - name: Install rust
        uses: dtolnay/rust-toolchain@a54c7afa936fefeb4456b2dd8068152669aa8203
        with:
@@ -201,7 +210,7 @@ jobs:
          --bench-type ${{ matrix.bench_type }}

      - name: Upload parsed results artifact
-        uses: actions/upload-artifact@65c4c4a1ddee5b72f698fdd19549f0f0fb45cf08
+        uses: actions/upload-artifact@4cec3d8aa04e39d1a68397de0c4cd6fb9dce8ec1
        with:
          name: ${{ github.sha }}_${{ matrix.command }}_${{ matrix.op_flavor }}_${{ inputs.profile }}
          path: ${{ env.RESULTS_FILENAME }}
@@ -235,7 +244,7 @@ jobs:

  teardown-instance:
    name: Teardown instance (cuda-${{ inputs.profile }}-benchmarks)
-    if: ${{ always() && needs.setup-instance.result == 'success' }}
+    if: ${{ always() && needs.setup-instance.outputs.remote-instance-outcome == 'success' }}
    needs: [ setup-instance, cuda-benchmarks, slack-notify ]
    runs-on: ubuntu-latest
    steps:
--- a/.github/workflows/benchmark_integer.yml
+++ b/.github/workflows/benchmark_integer.yml
@@ -104,7 +104,7 @@ jobs:
    needs: [ prepare-matrix, setup-instance ]
    runs-on: ${{ needs.setup-instance.outputs.runner-name }}
    concurrency:
-      group: ${{ github.workflow }}_${{ github.ref }}
+      group: ${{ github.workflow_ref }}
      cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
    continue-on-error: true
    timeout-minutes: 1440  # 24 hours
@@ -172,7 +172,7 @@ jobs:
          --bench-type ${{ matrix.bench_type }}

      - name: Upload parsed results artifact
-        uses: actions/upload-artifact@65c4c4a1ddee5b72f698fdd19549f0f0fb45cf08
+        uses: actions/upload-artifact@4cec3d8aa04e39d1a68397de0c4cd6fb9dce8ec1
        with:
          name: ${{ github.sha }}_${{ matrix.command }}_${{ matrix.op_flavor }}_${{ matrix.bench_type }}
          path: ${{ env.RESULTS_FILENAME }}
--- a/.github/workflows/benchmark_shortint.yml
+++ b/.github/workflows/benchmark_shortint.yml
@@ -70,7 +70,7 @@ jobs:
    needs: [ prepare-matrix, setup-instance ]
    runs-on: ${{ needs.setup-instance.outputs.runner-name }}
    concurrency:
-      group: ${{ github.workflow }}_${{ github.ref }}
+      group: ${{ github.workflow_ref }}
      cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
    continue-on-error: true
    strategy:
@@ -138,7 +138,7 @@ jobs:
          --append-results

      - name: Upload parsed results artifact
-        uses: actions/upload-artifact@65c4c4a1ddee5b72f698fdd19549f0f0fb45cf08
+        uses: actions/upload-artifact@4cec3d8aa04e39d1a68397de0c4cd6fb9dce8ec1
        with:
          name: ${{ github.sha }}_shortint_${{ matrix.op_flavor }}
          path: ${{ env.RESULTS_FILENAME }}
--- a/.github/workflows/benchmark_signed_integer.yml
+++ b/.github/workflows/benchmark_signed_integer.yml
@@ -104,7 +104,7 @@ jobs:
    needs: [ prepare-matrix, setup-instance ]
    runs-on: ${{ needs.setup-instance.outputs.runner-name }}
    concurrency:
-      group: ${{ github.workflow }}_${{ github.ref }}
+      group: ${{ github.workflow_ref }}
      cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
    continue-on-error: true
    timeout-minutes: 1440  # 24 hours
@@ -166,7 +166,7 @@ jobs:
          --bench-type ${{ matrix.bench_type }}

      - name: Upload parsed results artifact
-        uses: actions/upload-artifact@65c4c4a1ddee5b72f698fdd19549f0f0fb45cf08
+        uses: actions/upload-artifact@4cec3d8aa04e39d1a68397de0c4cd6fb9dce8ec1
        with:
          name: ${{ github.sha }}_${{ matrix.command }}_${{ matrix.op_flavor }}_${{ matrix.bench_type }}
          path: ${{ env.RESULTS_FILENAME }}
--- a/.github/workflows/benchmark_tfhe_fft.yml
+++ b/.github/workflows/benchmark_tfhe_fft.yml
@@ -45,7 +45,7 @@ jobs:
    name: Execute FFT benchmarks in EC2
    needs: setup-ec2
    concurrency:
-      group: ${{ github.workflow }}_${{ github.ref }}
+      group: ${{ github.workflow_ref }}
      cancel-in-progress: true
    runs-on: ${{ needs.setup-ec2.outputs.runner-name }}
    steps:
@@ -84,7 +84,7 @@ jobs:
          --name-suffix avx512

      - name: Upload parsed results artifact
-        uses: actions/upload-artifact@65c4c4a1ddee5b72f698fdd19549f0f0fb45cf08
+        uses: actions/upload-artifact@4cec3d8aa04e39d1a68397de0c4cd6fb9dce8ec1
        with:
          name: ${{ github.sha }}_fft
          path: ${{ env.RESULTS_FILENAME }}
--- a/.github/workflows/benchmark_tfhe_ntt.yml
+++ b/.github/workflows/benchmark_tfhe_ntt.yml
@@ -45,7 +45,7 @@ jobs:
    name: Execute NTT benchmarks in EC2
    needs: setup-ec2
    concurrency:
-      group: ${{ github.workflow }}_${{ github.ref }}
+      group: ${{ github.workflow_ref }}
      cancel-in-progress: true
    runs-on: ${{ needs.setup-ec2.outputs.runner-name }}
    steps:
@@ -84,7 +84,7 @@ jobs:
          --name-suffix avx512

      - name: Upload parsed results artifact
-        uses: actions/upload-artifact@65c4c4a1ddee5b72f698fdd19549f0f0fb45cf08
+        uses: actions/upload-artifact@4cec3d8aa04e39d1a68397de0c4cd6fb9dce8ec1
        with:
          name: ${{ github.sha }}_ntt
          path: ${{ env.RESULTS_FILENAME }}
--- a/.github/workflows/benchmark_tfhe_zk_pok.yml
+++ b/.github/workflows/benchmark_tfhe_zk_pok.yml
@@ -3,6 +3,14 @@ name: tfhe-zk-pok benchmarks

 on:
  workflow_dispatch:
+    inputs:
+      bench_type:
+        description: "Benchmarks type"
+        type: choice
+        default: latency
+        options:
+          - latency
+          - throughput
  push:
    branches:
      - main
@@ -20,6 +28,7 @@ env:
  SLACK_ICON: https://pbs.twimg.com/profile_images/1274014582265298945/OjBKP9kn_400x400.png
  SLACK_USERNAME: ${{ secrets.BOT_USERNAME }}
  SLACK_WEBHOOK: ${{ secrets.SLACK_WEBHOOK }}
+  BENCH_TYPE: ${{ inputs.bench_type || 'latency' }}

 jobs:
  should-run:
@@ -71,7 +80,7 @@ jobs:
    if: needs.setup-instance.result != 'skipped'
    needs: setup-instance
    concurrency:
-      group: ${{ github.workflow }}_${{github.event_name}}_${{ github.ref }}${{ github.ref == 'refs/heads/main' && github.sha || '' }}
+      group: ${{ github.workflow_ref }}_${{github.event_name}}${{ github.ref == 'refs/heads/main' && github.sha || '' }}
      cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
    runs-on: ${{ needs.setup-instance.outputs.runner-name }}
    steps:
@@ -105,7 +114,7 @@ jobs:

      - name: Run benchmarks
        run: |
-          make bench_tfhe_zk_pok
+          make BENCH_TYPE=${{ env.BENCH_TYPE }} bench_tfhe_zk_pok

      - name: Parse results
        run: |
@@ -119,10 +128,11 @@ jobs:
          --commit-date "${{ env.COMMIT_DATE }}" \
          --bench-date "${{ env.BENCH_DATE }}" \
          --walk-subdirs \
-          --name-suffix avx512
+          --name-suffix avx512 \
+          --bench-type ${{ env.BENCH_TYPE }}

      - name: Upload parsed results artifact
-        uses: actions/upload-artifact@65c4c4a1ddee5b72f698fdd19549f0f0fb45cf08
+        uses: actions/upload-artifact@4cec3d8aa04e39d1a68397de0c4cd6fb9dce8ec1
        with:
          name: ${{ github.sha }}_tfhe_zk_pok
          path: ${{ env.RESULTS_FILENAME }}
--- a/.github/workflows/benchmark_wasm_client.yml
+++ b/.github/workflows/benchmark_wasm_client.yml
@@ -110,7 +110,7 @@ jobs:

      - name: Node cache restoration
        id: node-cache
-        uses: actions/cache/restore@1bd1e32a3bdc45362d1e726936510720a7c30a57 #v4.2.0
+        uses: actions/cache/restore@d4323d4df104b026a6aa633fdb11d772146be0bf #v4.2.2
        with:
          path: |
            ~/.nvm
@@ -123,7 +123,7 @@ jobs:
          make install_node

      - name: Node cache save
-        uses: actions/cache/save@1bd1e32a3bdc45362d1e726936510720a7c30a57 #v4.2.0
+        uses: actions/cache/save@d4323d4df104b026a6aa633fdb11d772146be0bf #v4.2.2
        if: steps.node-cache.outputs.cache-hit != 'true'
        with:
          path: |
@@ -167,7 +167,7 @@ jobs:
          --append-results

      - name: Upload parsed results artifact
-        uses: actions/upload-artifact@65c4c4a1ddee5b72f698fdd19549f0f0fb45cf08
+        uses: actions/upload-artifact@4cec3d8aa04e39d1a68397de0c4cd6fb9dce8ec1
        with:
          name: ${{ github.sha }}_wasm_${{ matrix.browser }}
          path: ${{ env.RESULTS_FILENAME }}
--- a/.github/workflows/benchmark_zk_pke.yml
+++ b/.github/workflows/benchmark_zk_pke.yml
@@ -118,7 +118,7 @@ jobs:
    if: needs.setup-instance.result != 'skipped'
    needs: [ prepare-matrix, setup-instance ]
    concurrency:
-      group: ${{ github.workflow }}_${{github.event_name}}_${{ github.ref }}${{ github.ref == 'refs/heads/main' && github.sha || '' }}
+      group: ${{ github.workflow_ref }}_${{github.event_name}}${{ github.ref == 'refs/heads/main' && github.sha || '' }}
      cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
    runs-on: ${{ needs.setup-instance.outputs.runner-name }}
    strategy:
@@ -179,7 +179,7 @@ jobs:
          --append-results

      - name: Upload parsed results artifact
-        uses: actions/upload-artifact@65c4c4a1ddee5b72f698fdd19549f0f0fb45cf08
+        uses: actions/upload-artifact@4cec3d8aa04e39d1a68397de0c4cd6fb9dce8ec1
        with:
          name: ${{ github.sha }}_integer_zk
          path: ${{ env.RESULTS_FILENAME }}
--- a/.github/workflows/cargo_test_fft.yml
+++ b/.github/workflows/cargo_test_fft.yml
@@ -3,16 +3,46 @@ name: Cargo Test tfhe-fft

 on:
  pull_request:
+  push:
+    branches:
+      - main

 env:
  CARGO_TERM_COLOR: always
+  IS_PULL_REQUEST: ${{ github.event_name == 'pull_request' }}

 concurrency:
  group: ${{ github.workflow }}-${{ github.head_ref }}
  cancel-in-progress: true

 jobs:
+  should-run:
+    runs-on: ubuntu-latest
+    permissions:
+      pull-requests: read
+    outputs:
+      fft_test: ${{ env.IS_PULL_REQUEST == 'false' || steps.changed-files.outputs.fft_any_changed }}
+    steps:
+      - name: Checkout tfhe-rs
+        uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683
+        with:
+          fetch-depth: 0
+          persist-credentials: 'false'
+
+      - name: Check for file changes
+        id: changed-files
+        uses: tj-actions/changed-files@dcc7a0cba800f454d79fff4b993e8c3555bcc0a8
+        with:
+          files_yaml: |
+            fft:
+              - tfhe/Cargo.toml
+              - Makefile
+              - tfhe-fft/**
+              - '.github/workflows/cargo_test_fft.yml'
+
  cargo-tests-fft:
+    needs: should-run
+    if: needs.should-run.outputs.fft_test == 'true'
    runs-on: ${{ matrix.runner_type }}
    strategy:
      matrix:
@@ -39,6 +69,8 @@ jobs:
          make test_fft_no_std

  cargo-tests-fft-nightly:
+    needs: should-run
+    if: needs.should-run.outputs.fft_test == 'true'
    runs-on: ${{ matrix.runner_type }}
    strategy:
      matrix:
@@ -61,7 +93,9 @@ jobs:
          make test_fft_no_std_nightly

  cargo-tests-fft-node-js:
-    runs-on: "ubuntu-latest"
+    needs: should-run
+    if: needs.should-run.outputs.fft_test == 'true'
+    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683

@@ -69,3 +103,30 @@ jobs:
        run: |
          make install_node
          make test_fft_node_js_ci
+
+  cargo-tests-fft-successful:
+    needs: [ should-run, cargo-tests-fft, cargo-tests-fft-nightly, cargo-tests-fft-node-js ]
+    if: ${{ always() }}
+    runs-on: ubuntu-latest
+    steps:
+      - name: Tests do not need to run
+        if: needs.should-run.outputs.fft_test == 'false'
+        run: |
+          echo "tfhe-fft files haven't changed tests don't need to run"
+
+      - name: Check all tests passed
+        if: needs.should-run.outputs.fft_test == 'true' &&
+          needs.cargo-tests-fft.result == 'success' &&
+          needs.cargo-tests-fft-nightly.result == 'success' &&
+          needs.cargo-tests-fft-node-js.result == 'success'
+        run: |
+          echo "All tfhe-fft test passed"
+
+      - name: Check tests failure
+        if: needs.should-run.outputs.fft_test == 'true' &&
+          (needs.cargo-tests-fft.result != 'success' ||
+          needs.cargo-tests-fft-nightly.result != 'success' ||
+          needs.cargo-tests-fft-node-js.result != 'success')
+        run: |
+          echo "Some tfhe-fft tests failed"
+          exit 1
--- a/.github/workflows/cargo_test_ntt.yml
+++ b/.github/workflows/cargo_test_ntt.yml
@@ -3,16 +3,46 @@ name: Cargo Test tfhe-ntt

 on:
  pull_request:
+  push:
+    branches:
+      - main

 env:
  CARGO_TERM_COLOR: always
+  IS_PULL_REQUEST: ${{ github.event_name == 'pull_request' }}

 concurrency:
  group: ${{ github.workflow }}-${{ github.head_ref }}
  cancel-in-progress: true

-jobs:
+jobs:  
+  should-run:
+    runs-on: ubuntu-latest
+    permissions:
+      pull-requests: read
+    outputs:
+      ntt_test: ${{ env.IS_PULL_REQUEST == 'false' || steps.changed-files.outputs.ntt_any_changed }}
+    steps:
+      - name: Checkout tfhe-rs
+        uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683
+        with:
+          fetch-depth: 0
+          persist-credentials: 'false'
+
+      - name: Check for file changes
+        id: changed-files
+        uses: tj-actions/changed-files@dcc7a0cba800f454d79fff4b993e8c3555bcc0a8
+        with:
+          files_yaml: |
+            ntt:
+              - tfhe/Cargo.toml
+              - Makefile
+              - tfhe-ntt/**
+              - '.github/workflows/cargo_test_ntt.yml'
+
  cargo-tests-ntt:
+    needs: should-run
+    if: needs.should-run.outputs.ntt_test == 'true'
    runs-on: ${{ matrix.os }}
    strategy:
      matrix:
@@ -34,6 +64,8 @@ jobs:
        run: make test_ntt_no_std

  cargo-tests-ntt-nightly:
+    needs: should-run
+    if: needs.should-run.outputs.ntt_test == 'true'
    runs-on: ${{ matrix.os }}
    strategy:
      matrix:
@@ -52,3 +84,28 @@ jobs:

      - name: Test no-std nightly
        run: make test_ntt_no_std_nightly
+
+  cargo-tests-ntt-successful:
+    needs: [ should-run, cargo-tests-ntt, cargo-tests-ntt-nightly ]
+    if: ${{ always() }}
+    runs-on: ubuntu-latest
+    steps:
+      - name: Tests do not need to run
+        if: needs.should-run.outputs.ntt_test == 'false'
+        run: |
+          echo "tfhe-ntt files haven't changed tests don't need to run"
+
+      - name: Check all tests success
+        if: needs.should-run.outputs.ntt_test == 'true' &&
+          needs.cargo-tests-ntt.result == 'success' &&
+          needs.cargo-tests-ntt-nightly.result == 'success'
+        run: |
+          echo "All tfhe-ntt tests passed"
+
+      - name: Check tests failure
+        if: needs.should-run.outputs.ntt_test == 'true' &&
+          (needs.cargo-tests-ntt.result != 'success' ||
+          needs.cargo-tests-ntt-nightly.result != 'success')
+        run: |
+          echo "Some tfhe-ntt tests failed"
+          exit 1
--- a/.github/workflows/check_actor_permissions.yml
+++ b/.github/workflows/check_actor_permissions.yml
@@ -1,39 +0,0 @@
-# Check if an actor is a collaborator and has write access
-name: Check Actor Permissions
-
-on:
-  workflow_call:
-    inputs:
-      username:
-        type: string
-        default: ${{ github.triggering_actor }}
-    outputs:
-      is_authorized:
-        value: ${{ jobs.check-actor-permission.outputs.actor_authorized }}
-    secrets:
-      TOKEN:
-        required: true
-
-jobs:
-  check-actor-permission:
-    runs-on: ubuntu-latest
-    outputs:
-      actor_authorized: ${{ steps.check-access.outputs.require-result }}
-    steps:
-      - name: Get User Permission
-        id: check-access
-        uses: actions-cool/check-user-permission@7b90a27f92f3961b368376107661682c441f6103 # v2.3.0
-        with:
-          require: write
-          username: ${{ inputs.username }}
-        env:
-          GITHUB_TOKEN: ${{ secrets.TOKEN }}
-
-      - name: Check User Permission
-        if: ${{ !(inputs.username == 'dependabot[bot]' || inputs.username == 'cla-bot[bot]') &&
-          steps.check-access.outputs.require-result == 'false' }}
-        run: |
-          echo "${{ inputs.username }} does not have permissions on this repo."
-          echo "Current permission level is ${{ steps.check-access.outputs.user-permission }}"
-          echo "Job originally triggered by ${{ github.actor }}"
-          exit 1
--- a/.github/workflows/check_ci_files_change.yml
+++ b/.github/workflows/check_ci_files_change.yml
@@ -1,40 +0,0 @@
-# Check if there is any change in CI files since last commit
-name: Check changes in CI files
-
-on:
-  workflow_call:
-    inputs:
-      checkout_ref:
-        type: string
-        required: true
-    outputs:
-      ci_file_changed:
-        value: ${{ jobs.check-changes.outputs.ci_file_changed }}
-    secrets:
-      REPO_CHECKOUT_TOKEN:
-        required: true
-
-jobs:
-  check-changes:
-    runs-on: ubuntu-latest
-    permissions:
-      pull-requests: read
-    outputs:
-      ci_file_changed: ${{ steps.changed-files.outputs.ci_any_changed }}
-    steps:
-      - name: Checkout tfhe-rs
-        uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683
-        with:
-          fetch-depth: 0
-          persist-credentials: 'false'
-          token: ${{ secrets.REPO_CHECKOUT_TOKEN }}
-          ref: ${{ inputs.checkout_ref }}
-
-      - name: Check for file changes
-        id: changed-files
-        uses: tj-actions/changed-files@dcc7a0cba800f454d79fff4b993e8c3555bcc0a8
-        with:
-          files_yaml: |
-            ci:
-              - .github/**
-              - ci/**
--- a/.github/workflows/check_external_pr.yml
+++ b/.github/workflows/check_external_pr.yml
@@ -1,32 +0,0 @@
-# Check if a pull request fulfill pre-conditions to be accepted
-name: Check PR from fork
-
-on:
-  pull_request_target:
-    paths:
-      - '.github/**'
-      - 'ci/**'
-
-jobs:
-  # Fail if the triggering actor is not part of Zama organization.
-  check-user-permission:
-    name: Check event user permissions
-    uses: ./.github/workflows/check_actor_permissions.yml
-    with:
-      username: ${{ github.event.pull_request.user.login }}
-    secrets:
-      TOKEN: ${{ secrets.GITHUB_TOKEN }}
-
-  write-comment:
-    name: Write PR comment
-    if: ${{ always() && needs.check-user-permission.outputs.is_authorized == 'false' }}
-    needs: check-user-permission
-    runs-on: ubuntu-latest
-    permissions:
-      pull-requests: write
-    steps:
-      - name: Write warning
-        uses: thollander/actions-comment-pull-request@24bffb9b452ba05a4f3f77933840a6a841d1b32b
-        with:
-          message: |
-            CI files have changed. Only Zama organization members are authorized to modify these files.
--- a/.github/workflows/ci_lint.yml
+++ b/.github/workflows/ci_lint.yml
@@ -6,6 +6,7 @@ on:

 env:
  ACTIONLINT_VERSION: 1.6.27
+  CHECKOUT_TOKEN: ${{ secrets.REPO_CHECKOUT_TOKEN || secrets.GITHUB_TOKEN }}

 jobs:
  lint-check:
@@ -16,7 +17,7 @@ jobs:
        uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683
        with:
          persist-credentials: 'false'
-          token: ${{ secrets.REPO_CHECKOUT_TOKEN }}
+          token: ${{ env.CHECKOUT_TOKEN }}

      - name: Get actionlint
        run: |
@@ -30,7 +31,7 @@ jobs:
          make lint_workflow

      - name: Ensure SHA pinned actions
-        uses: zgosalvez/github-actions-ensure-sha-pinned-actions@c3a2b64f69b7a1542a68f44d9edbd9ec3fc1455e # v3.0.20
+        uses: zgosalvez/github-actions-ensure-sha-pinned-actions@25ed13d0628a1601b4b44048e63cc4328ed03633 # v3.0.22
        with:
          allowlist: |
            slsa-framework/slsa-github-generator
--- a/.github/workflows/code_coverage.yml
+++ b/.github/workflows/code_coverage.yml
@@ -38,7 +38,7 @@ jobs:
    name: Code coverage tests
    needs: setup-instance
    concurrency:
-      group: ${{ github.workflow }}_${{ github.event_name }}_${{ github.ref }}
+      group: ${{ github.workflow_ref }}_${{ github.event_name }}
      cancel-in-progress: true
    runs-on: ${{ needs.setup-instance.outputs.runner-name }}
    timeout-minutes: 5760 # 4 days
@@ -83,7 +83,7 @@ jobs:
          make test_shortint_cov

      - name: Upload tfhe coverage to Codecov
-        uses: codecov/codecov-action@13ce06bfc6bbe3ecf90edbbf1bc32fe5978ca1d3
+        uses: codecov/codecov-action@0565863a31f2c772f9f0395002a31e3f06189574
        if: steps.changed-files.outputs.tfhe_any_changed == 'true'
        with:
          token: ${{ secrets.CODECOV_TOKEN }}
@@ -97,7 +97,7 @@ jobs:
          make test_integer_cov

      - name: Upload tfhe coverage to Codecov
-        uses: codecov/codecov-action@13ce06bfc6bbe3ecf90edbbf1bc32fe5978ca1d3
+        uses: codecov/codecov-action@0565863a31f2c772f9f0395002a31e3f06189574
        if: steps.changed-files.outputs.tfhe_any_changed == 'true'
        with:
          token: ${{ secrets.CODECOV_TOKEN }}
--- a/.github/workflows/csprng_randomness_tests.yml
+++ b/.github/workflows/csprng_randomness_tests.yml
@@ -10,56 +10,28 @@ env:
  SLACK_ICON: https://pbs.twimg.com/profile_images/1274014582265298945/OjBKP9kn_400x400.png
  SLACK_USERNAME: ${{ secrets.BOT_USERNAME }}
  SLACK_WEBHOOK: ${{ secrets.SLACK_WEBHOOK }}
-  MSG_MINIMAL: event,action url,commit
-  BRANCH: ${{ github.head_ref || github.ref }}
-  REF: ${{ github.event.pull_request.head.sha || github.sha }}
+  CHECKOUT_TOKEN: ${{ secrets.REPO_CHECKOUT_TOKEN || secrets.GITHUB_TOKEN }}
+  # Secrets will be available only to zama-ai organization members
+  SECRETS_AVAILABLE: ${{ secrets.JOB_SECRET != '' }}
+  EXTERNAL_CONTRIBUTION_RUNNER: "large_ubuntu_16"

 on:
  # Allows you to run this workflow manually from the Actions tab as an alternative.
  workflow_dispatch:
-  # Trigger pull_request event on CI files to be able to test changes before merging to main branch.
-  # Workflow would fail if changes come from a forked repository since secrets are not available with this event.
  pull_request:
    types: [ labeled ]
-    paths:
-      - '.github/**'
-      - 'ci/**'
-  # General entry point for Zama's pull request as well as contribution from forks.
-  pull_request_target:
-    types: [ labeled ]
-    paths:
-      - '**'
-      - '!.github/**'
-      - '!ci/**'

 jobs:
-  check-ci-files:
-    uses: ./.github/workflows/check_ci_files_change.yml
-    with:
-      checkout_ref: ${{ github.event.pull_request.head.sha || github.sha }}
-    secrets:
-      REPO_CHECKOUT_TOKEN: ${{ secrets.REPO_CHECKOUT_TOKEN }}
-
-  # Fail if the triggering actor is not part of Zama organization.
-  # If pull_request_target is emitted and CI files have changed, skip this job. This would skip following jobs.
-  check-user-permission:
-    needs: check-ci-files
-    if: github.event_name != 'pull_request_target' ||
-      (github.event_name == 'pull_request_target' && needs.check-ci-files.outputs.ci_file_changed == 'false')
-    uses: ./.github/workflows/check_actor_permissions.yml
-    secrets:
-      TOKEN: ${{ secrets.GITHUB_TOKEN }}
-
  setup-instance:
    name: Setup instance (csprng-randomness-tests)
-    needs: check-user-permission
    if: ${{ github.event_name == 'workflow_dispatch' || contains(github.event.label.name, 'approved') }}
    runs-on: ubuntu-latest
    outputs:
-      runner-name: ${{ steps.start-instance.outputs.label }}
+      runner-name: ${{ steps.start-remote-instance.outputs.label || steps.start-github-instance.outputs.runner_group }}
    steps:
-      - name: Start instance
-        id: start-instance
+      - name: Start remote instance
+        id: start-remote-instance
+        if: env.SECRETS_AVAILABLE == 'true'
        uses: zama-ai/slab-github-runner@79939325c3c429837c10d6041e4fd8589d328bac
        with:
          mode: start
@@ -69,11 +41,18 @@ jobs:
          backend: aws
          profile: cpu-small

+      # This instance will be spawned especially for pull-request from forked repository
+      - name: Start GitHub instance
+        id: start-github-instance
+        if: env.SECRETS_AVAILABLE == 'false'
+        run: |
+          echo "runner_group=${{ env.EXTERNAL_CONTRIBUTION_RUNNER }}" >> "$GITHUB_OUTPUT"
+
  csprng-randomness-tests:
    name: CSPRNG randomness tests
    needs: setup-instance
    concurrency:
-      group: ${{ github.workflow }}_${{ github.head_ref || github.ref }}
+      group: ${{ github.workflow_ref }}
      cancel-in-progress: true
    runs-on: ${{ needs.setup-instance.outputs.runner-name }}
    steps:
@@ -81,8 +60,7 @@ jobs:
        uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683
        with:
          persist-credentials: 'false'
-          token: ${{ secrets.REPO_CHECKOUT_TOKEN }}
-          ref: ${{ env.REF }}
+          token: ${{ env.CHECKOUT_TOKEN }}

      - name: Install latest stable
        uses: dtolnay/rust-toolchain@a54c7afa936fefeb4456b2dd8068152669aa8203
@@ -107,8 +85,9 @@ jobs:
    needs: [ setup-instance, csprng-randomness-tests ]
    runs-on: ubuntu-latest
    steps:
-      - name: Stop instance
+      - name: Stop remote instance
        id: stop-instance
+        if: env.SECRETS_AVAILABLE == 'true'
        uses: zama-ai/slab-github-runner@79939325c3c429837c10d6041e4fd8589d328bac
        with:
          mode: stop
--- a/.github/workflows/data_pr_close.yml
+++ b/.github/workflows/data_pr_close.yml
@@ -8,17 +8,13 @@ env:
  SLACK_ICON: https://pbs.twimg.com/profile_images/1274014582265298945/OjBKP9kn_400x400.png
  SLACK_USERNAME: ${{ secrets.BOT_USERNAME }}
  SLACK_WEBHOOK: ${{ secrets.SLACK_WEBHOOK }}
-  MSG_MINIMAL: event,action url,commit
-  BRANCH: ${{ github.head_ref || github.ref }}
-  PR_BRANCH: ${{ github.head_ref || github.ref_name }}
+  PR_BRANCH: ${{ github.ref_name }}
  CLOSE_TYPE: ${{ github.event.pull_request.merged && 'merge' || 'close' }}

 # only trigger on pull request closed events
 on:
  pull_request:
    types: [ closed ]
-  pull_request_target:
-    types: [ closed ]

 # The same pattern is used for jobs that use the github api:
 # - save the result of the API call in the env var "GH_API_RES". Since the var is multiline
--- a/.github/workflows/gpu_4090_tests.yml
+++ b/.github/workflows/gpu_4090_tests.yml
@@ -11,67 +11,35 @@ env:
  SLACK_ICON: https://pbs.twimg.com/profile_images/1274014582265298945/OjBKP9kn_400x400.png
  SLACK_USERNAME: ${{ secrets.BOT_USERNAME }}
  SLACK_WEBHOOK: ${{ secrets.SLACK_WEBHOOK }}
-  MSG_MINIMAL: event,action url,commit
-  BRANCH: ${{ github.head_ref || github.ref }}
-  REF: ${{ github.event.pull_request.head.sha || github.sha }}
+  CHECKOUT_TOKEN: ${{ secrets.REPO_CHECKOUT_TOKEN || secrets.GITHUB_TOKEN }}

 on:
  # Allows you to run this workflow manually from the Actions tab as an alternative.
  workflow_dispatch:
-  # Trigger pull_request event on CI files to be able to test changes before merging to main branch.
-  # Workflow would fail if changes come from a forked repository since secrets are not available with this event.
  pull_request:
    types: [ labeled ]
-    paths:
-      - '.github/**'
-      - 'ci/**'
-  # General entry point for Zama's pull request as well as contribution from forks.
-  pull_request_target:
-    types: [ labeled ]
-    paths:
-      - '**'
-      - '!.github/**'
-      - '!ci/**'
  schedule:
    # Nightly tests @ 1AM after each work day
    - cron: "0 1 * * MON-FRI"

 jobs:
-  check-ci-files:
-    uses: ./.github/workflows/check_ci_files_change.yml
-    with:
-      checkout_ref: ${{ github.event.pull_request.head.sha || github.sha }}
-    secrets:
-      REPO_CHECKOUT_TOKEN: ${{ secrets.REPO_CHECKOUT_TOKEN }}
-
-  # Fail if the triggering actor is not part of Zama organization.
-  # If pull_request_target is emitted and CI files have changed, skip this job. This would skip following jobs.
-  check-user-permission:
-    needs: check-ci-files
-    if: github.event_name != 'pull_request_target' ||
-      (github.event_name == 'pull_request_target' && needs.check-ci-files.outputs.ci_file_changed == 'false')
-    uses: ./.github/workflows/check_actor_permissions.yml
-    secrets:
-      TOKEN: ${{ secrets.GITHUB_TOKEN }}
-
  cuda-tests-linux:
    name: CUDA tests (RTX 4090)
-    needs: check-user-permission
    if: github.event_name == 'workflow_dispatch' ||
      contains(github.event.label.name, '4090_test') ||
      (github.event_name == 'schedule' &&  github.repository == 'zama-ai/tfhe-rs')
    concurrency:
-      group: ${{ github.workflow }}_${{ github.head_ref || github.ref }}
+      group: ${{ github.workflow_ref }}
      cancel-in-progress: true
    runs-on: ["self-hosted", "4090-desktop"]
+    timeout-minutes: 1440 # 24 hours

    steps:
      - name: Checkout tfhe-rs
        uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683
        with:
          persist-credentials: 'false'
-          token: ${{ secrets.REPO_CHECKOUT_TOKEN }}
-          ref: ${{ env.REF }}
+          token: ${{ env.CHECKOUT_TOKEN }}

      - name: Install latest stable
        uses: dtolnay/rust-toolchain@a54c7afa936fefeb4456b2dd8068152669aa8203
@@ -103,7 +71,7 @@ jobs:
          make test_high_level_api_gpu

      - uses: actions-ecosystem/action-remove-labels@2ce5d41b4b6aa8503e285553f75ed56e0a40bae0
-        if: ${{ always() && github.event_name == 'pull_request_target' }}
+        if: ${{ always() && github.event_name == 'pull_request' }}
        with:
          labels: 4090_test
          github_token: ${{ secrets.GITHUB_TOKEN }}
--- a/.github/workflows/gpu_fast_h100_tests.yml
+++ b/.github/workflows/gpu_fast_h100_tests.yml
@@ -11,28 +11,17 @@ env:
  SLACK_ICON: https://pbs.twimg.com/profile_images/1274014582265298945/OjBKP9kn_400x400.png
  SLACK_USERNAME: ${{ secrets.BOT_USERNAME }}
  SLACK_WEBHOOK: ${{ secrets.SLACK_WEBHOOK }}
-  MSG_MINIMAL: event,action url,commit
-  BRANCH: ${{ github.head_ref || github.ref }}
-  IS_PULL_REQUEST: ${{ github.event_name == 'pull_request' || github.event_name == 'pull_request_target' }}
-  REF: ${{ github.event.pull_request.head.sha || github.sha }}
+  IS_PULL_REQUEST: ${{ github.event_name == 'pull_request' }}
+  CHECKOUT_TOKEN: ${{ secrets.REPO_CHECKOUT_TOKEN || secrets.GITHUB_TOKEN }}
+  # Secrets will be available only to zama-ai organization members
+  SECRETS_AVAILABLE: ${{ secrets.JOB_SECRET != '' }}
+  EXTERNAL_CONTRIBUTION_RUNNER: "gpu_ubuntu-22.04"

 on:
  # Allows you to run this workflow manually from the Actions tab as an alternative.
  workflow_dispatch:
-  # Trigger pull_request event on CI files to be able to test changes before merging to main branch.
-  # Workflow would fail if changes come from a forked repository since secrets are not available with this event.
  pull_request:
    types: [ labeled ]
-    paths:
-      - '.github/**'
-      - 'ci/**'
-  # General entry point for Zama's pull request as well as contribution from forks.
-  pull_request_target:
-    types: [ labeled ]
-    paths:
-      - '**'
-      - '!.github/**'
-      - '!ci/**'

 jobs:
  should-run:
@@ -47,8 +36,7 @@ jobs:
        with:
          fetch-depth: 0
          persist-credentials: 'false'
-          token: ${{ secrets.REPO_CHECKOUT_TOKEN }}
-          ref: ${{ env.REF }}
+          token: ${{ env.CHECKOUT_TOKEN }}

      - name: Check for file changes
        id: changed-files
@@ -72,35 +60,25 @@ jobs:
              - scripts/integer-tests.sh
              - ci/slab.toml

-  check-ci-files:
-    uses: ./.github/workflows/check_ci_files_change.yml
-    with:
-      checkout_ref: ${{ github.event.pull_request.head.sha || github.sha }}
-    secrets:
-      REPO_CHECKOUT_TOKEN: ${{ secrets.REPO_CHECKOUT_TOKEN }}
-
-  # Fail if the triggering actor is not part of Zama organization.
-  # If pull_request_target is emitted and CI files have changed, skip this job. This would skip following jobs.
-  check-user-permission:
-    needs: check-ci-files
-    if: github.event_name != 'pull_request_target' ||
-      (github.event_name == 'pull_request_target' && needs.check-ci-files.outputs.ci_file_changed == 'false')
-    uses: ./.github/workflows/check_actor_permissions.yml
-    secrets:
-      TOKEN: ${{ secrets.GITHUB_TOKEN }}
-
  setup-instance:
    name: Setup instance (cuda-h100-tests)
-    needs: [ should-run, check-user-permission ]
-    if: github.event_name != 'pull_request_target' ||
+    needs: should-run
+    if: github.event_name != 'pull_request' ||
      (github.event.action != 'labeled' && needs.should-run.outputs.gpu_test == 'true') ||
      (github.event.action == 'labeled' && github.event.label.name == 'approved' && needs.should-run.outputs.gpu_test == 'true')
    runs-on: ubuntu-latest
    outputs:
-      runner-name: ${{ steps.start-instance.outputs.label }}
+      # Use permanent remote instance label first as on-demand remote instance label output is set before the end of start-remote-instance step.
+      # If the latter fails due to a failed GitHub action runner set up, we have to fallback on the permanent instance.
+      # Since the on-demand remote label is set before failure, we have to do the logical OR in this order,
+      # otherwise we'll try to run the next job on a non-existing on-demand instance.
+      runner-name: ${{ steps.use-permanent-instance.outputs.runner_group || steps.start-remote-instance.outputs.label || steps.start-github-instance.outputs.runner_group }}
+      remote-instance-outcome: ${{ steps.start-remote-instance.outcome }}
    steps:
-      - name: Start instance
-        id: start-instance
+      - name: Start remote instance
+        id: start-remote-instance
+        if: env.SECRETS_AVAILABLE == 'true'
+        continue-on-error: true
        uses: zama-ai/slab-github-runner@79939325c3c429837c10d6041e4fd8589d328bac
        with:
          mode: start
@@ -110,13 +88,27 @@ jobs:
          backend: hyperstack
          profile: single-h100

+      # This will allow to fallback on permanent instances running on Hyperstack.
+      - name: Use permanent remote instance
+        id: use-permanent-instance
+        if: ${{ env.SECRETS_AVAILABLE == 'true' && steps.start-remote-instance.outcome == 'failure' }}
+        run: |
+          echo "runner_group=h100x1" >> "$GITHUB_OUTPUT"
+
+      # This instance will be spawned especially for pull-request from forked repository
+      - name: Start GitHub instance
+        id: start-github-instance
+        if: env.SECRETS_AVAILABLE == 'false'
+        run: |
+          echo "runner_group=${{ env.EXTERNAL_CONTRIBUTION_RUNNER }}" >> "$GITHUB_OUTPUT"
+
  cuda-tests-linux:
    name: CUDA H100 tests
    needs: [ should-run, setup-instance ]
-    if: github.event_name != 'pull_request_target' ||
-      (github.event_name == 'pull_request_target' && needs.setup-instance.result != 'skipped')
+    if: github.event_name != 'pull_request' ||
+      (github.event_name == 'pull_request' && needs.setup-instance.result != 'skipped')
    concurrency:
-      group: ${{ github.workflow }}_${{ github.head_ref || github.ref }}
+      group: ${{ github.workflow_ref }}
      cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
    runs-on: ${{ needs.setup-instance.outputs.runner-name }}
    strategy:
@@ -132,18 +124,15 @@ jobs:
        uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683
        with:
          persist-credentials: 'false'
-          token: ${{ secrets.REPO_CHECKOUT_TOKEN }}
-          ref: ${{ env.REF }}
+          token: ${{ env.CHECKOUT_TOKEN }}

      - name: Setup Hyperstack dependencies
-        uses: ./.github/actions/hyperstack_setup
+        if: needs.setup-instance.outputs.remote-instance-outcome == 'success'
+        uses: ./.github/actions/gpu_setup
        with:
          cuda-version: ${{ matrix.cuda }}
          gcc-version: ${{ matrix.gcc }}
-
-      - name: Set up home
-        run: |
-          echo "HOME=/home/ubuntu" >> "${GITHUB_ENV}"
+          github-instance: ${{ env.SECRETS_AVAILABLE == 'false' }}

      - name: Install latest stable
        uses: dtolnay/rust-toolchain@a54c7afa936fefeb4456b2dd8068152669aa8203
@@ -176,6 +165,7 @@ jobs:
    continue-on-error: true
    steps:
      - name: Send message
+        if: env.SECRETS_AVAILABLE == 'true'
        uses: rtCamp/action-slack-notify@c33737706dea87cd7784c687dadc9adf1be59990
        env:
          SLACK_COLOR: ${{ needs.cuda-tests-linux.result }}
@@ -183,12 +173,13 @@ jobs:

  teardown-instance:
    name: Teardown instance (cuda-h100-tests)
-    if: ${{ always() && needs.setup-instance.result == 'success' }}
+    if: ${{ always() && needs.setup-instance.outputs.remote-instance-outcome == 'success' }}
    needs: [ setup-instance, cuda-tests-linux ]
    runs-on: ubuntu-latest
    steps:
-      - name: Stop instance
+      - name: Stop remote instance
        id: stop-instance
+        if: env.SECRETS_AVAILABLE == 'true'
        uses: zama-ai/slab-github-runner@79939325c3c429837c10d6041e4fd8589d328bac
        with:
          mode: stop
--- a/.github/workflows/gpu_fast_tests.yml
+++ b/.github/workflows/gpu_fast_tests.yml
@@ -11,26 +11,16 @@ env:
  SLACK_ICON: https://pbs.twimg.com/profile_images/1274014582265298945/OjBKP9kn_400x400.png
  SLACK_USERNAME: ${{ secrets.BOT_USERNAME }}
  SLACK_WEBHOOK: ${{ secrets.SLACK_WEBHOOK }}
-  MSG_MINIMAL: event,action url,commit
-  BRANCH: ${{ github.head_ref || github.ref }}
-  IS_PULL_REQUEST: ${{ github.event_name == 'pull_request' || github.event_name == 'pull_request_target' }}
-  REF: ${{ github.event.pull_request.head.sha || github.sha }}
+  IS_PULL_REQUEST: ${{ github.event_name == 'pull_request' }}
+  CHECKOUT_TOKEN: ${{ secrets.REPO_CHECKOUT_TOKEN || secrets.GITHUB_TOKEN }}
+  # Secrets will be available only to zama-ai organization members
+  SECRETS_AVAILABLE: ${{ secrets.JOB_SECRET != '' }}
+  EXTERNAL_CONTRIBUTION_RUNNER: "gpu_ubuntu-22.04"

 on:
  # Allows you to run this workflow manually from the Actions tab as an alternative.
  workflow_dispatch:
-  # Trigger pull_request event on CI files to be able to test changes before merging to main branch.
-  # Workflow would fail if changes come from a forked repository since secrets are not available with this event.
  pull_request:
-    paths:
-      - '.github/**'
-      - 'ci/**'
-  # General entry point for Zama's pull request as well as contribution from forks.
-  pull_request_target:
-    paths:
-      - '**'
-      - '!.github/**'
-      - '!ci/**'

 jobs:
  should-run:
@@ -45,8 +35,7 @@ jobs:
        with:
          fetch-depth: 0
          persist-credentials: 'false'
-          token: ${{ secrets.REPO_CHECKOUT_TOKEN }}
-          ref: ${{ env.REF }}
+          token: ${{ env.CHECKOUT_TOKEN }}

      - name: Check for file changes
        id: changed-files
@@ -70,34 +59,18 @@ jobs:
              - scripts/integer-tests.sh
              - ci/slab.toml

-  check-ci-files:
-    uses: ./.github/workflows/check_ci_files_change.yml
-    with:
-      checkout_ref: ${{ github.event.pull_request.head.sha || github.sha }}
-    secrets:
-      REPO_CHECKOUT_TOKEN: ${{ secrets.REPO_CHECKOUT_TOKEN }}
-
-  # Fail if the triggering actor is not part of Zama organization.
-  # If pull_request_target is emitted and CI files have changed, skip this job. This would skip following jobs.
-  check-user-permission:
-    needs: check-ci-files
-    if: github.event_name != 'pull_request_target' ||
-      (github.event_name == 'pull_request_target' && needs.check-ci-files.outputs.ci_file_changed == 'false')
-    uses: ./.github/workflows/check_actor_permissions.yml
-    secrets:
-      TOKEN: ${{ secrets.GITHUB_TOKEN }}
-
  setup-instance:
    name: Setup instance (cuda-tests)
-    needs: [ should-run, check-user-permission ]
+    needs: should-run
    if: github.event_name == 'workflow_dispatch' ||
      needs.should-run.outputs.gpu_test == 'true'
    runs-on: ubuntu-latest
    outputs:
-      runner-name: ${{ steps.start-instance.outputs.label }}
+      runner-name: ${{ steps.start-remote-instance.outputs.label || steps.start-github-instance.outputs.runner_group }}
    steps:
-      - name: Start instance
-        id: start-instance
+      - name: Start remote instance
+        id: start-remote-instance
+        if: env.SECRETS_AVAILABLE == 'true'
        uses: zama-ai/slab-github-runner@79939325c3c429837c10d6041e4fd8589d328bac
        with:
          mode: start
@@ -107,13 +80,20 @@ jobs:
          backend: hyperstack
          profile: gpu-test

+      # This instance will be spawned especially for pull-request from forked repository
+      - name: Start GitHub instance
+        id: start-github-instance
+        if: env.SECRETS_AVAILABLE == 'false'
+        run: |
+          echo "runner_group=${{ env.EXTERNAL_CONTRIBUTION_RUNNER }}" >> "$GITHUB_OUTPUT"
+
  cuda-tests-linux:
    name: CUDA tests
    needs: [ should-run, setup-instance ]
-    if: github.event_name != 'pull_request_target' ||
-      (github.event_name == 'pull_request_target' && needs.setup-instance.result != 'skipped')
+    if: github.event_name != 'pull_request' ||
+      (github.event_name == 'pull_request' && needs.setup-instance.result != 'skipped')
    concurrency:
-      group: ${{ github.workflow }}_${{ github.head_ref || github.ref }}
+      group: ${{ github.workflow_ref }}
      cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
    runs-on: ${{ needs.setup-instance.outputs.runner-name }}
    strategy:
@@ -129,18 +109,14 @@ jobs:
        uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683
        with:
          persist-credentials: 'false'
-          token: ${{ secrets.REPO_CHECKOUT_TOKEN }}
-          ref: ${{ env.REF }}
+          token: ${{ env.CHECKOUT_TOKEN }}

      - name: Setup Hyperstack dependencies
-        uses: ./.github/actions/hyperstack_setup
+        uses: ./.github/actions/gpu_setup
        with:
          cuda-version: ${{ matrix.cuda }}
          gcc-version: ${{ matrix.gcc }}
-
-      - name: Set up home
-        run: |
-          echo "HOME=/home/ubuntu" >> "${GITHUB_ENV}"
+          github-instance: ${{ env.SECRETS_AVAILABLE == 'false' }}

      - name: Install latest stable
        uses: dtolnay/rust-toolchain@a54c7afa936fefeb4456b2dd8068152669aa8203
@@ -173,6 +149,7 @@ jobs:
    continue-on-error: true
    steps:
      - name: Send message
+        if: env.SECRETS_AVAILABLE == 'true'
        uses: rtCamp/action-slack-notify@c33737706dea87cd7784c687dadc9adf1be59990
        env:
          SLACK_COLOR: ${{ needs.cuda-tests-linux.result }}
@@ -184,8 +161,9 @@ jobs:
    needs: [ setup-instance, cuda-tests-linux ]
    runs-on: ubuntu-latest
    steps:
-      - name: Stop instance
+      - name: Stop remote instance
        id: stop-instance
+        if: env.SECRETS_AVAILABLE == 'true'
        uses: zama-ai/slab-github-runner@79939325c3c429837c10d6041e4fd8589d328bac
        with:
          mode: stop
--- a/.github/workflows/gpu_full_h100_tests.yml
+++ b/.github/workflows/gpu_full_h100_tests.yml
@@ -20,10 +20,16 @@ jobs:
    name: Setup instance (cuda-h100-tests)
    runs-on: ubuntu-latest
    outputs:
-      runner-name: ${{ steps.start-instance.outputs.label }}
+      # Use permanent remote instance label first as on-demand remote instance label output is set before the end of start-remote-instance step.
+      # If the latter fails due to a failed GitHub action runner set up, we have to fallback on the permanent instance.
+      # Since the on-demand remote label is set before failure, we have to do the logical OR in this order,
+      # otherwise we'll try to run the next job on a non-existing on-demand instance.
+      runner-name: ${{ steps.use-permanent-instance.outputs.runner_group || steps.start-remote-instance.outputs.label }}
+      remote-instance-outcome: ${{ steps.start-remote-instance.outcome }}
    steps:
-      - name: Start instance
-        id: start-instance
+      - name: Start remote instance
+        id: start-remote-instance
+        continue-on-error: true
        uses: zama-ai/slab-github-runner@79939325c3c429837c10d6041e4fd8589d328bac
        with:
          mode: start
@@ -33,11 +39,18 @@ jobs:
          backend: hyperstack
          profile: single-h100

+      # This will allow to fallback on permanent instances running on Hyperstack.
+      - name: Use permanent remote instance
+        id: use-permanent-instance
+        if: ${{ env.SECRETS_AVAILABLE == 'true' && failure() }}
+        run: |
+          echo "runner_group=h100x1" >> "$GITHUB_OUTPUT"
+
  cuda-tests-linux:
    name: CUDA H100 tests
    needs: [ setup-instance ]
    concurrency:
-      group: ${{ github.workflow }}_${{ github.ref }}
+      group: ${{ github.workflow_ref }}
      cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
    runs-on: ${{ needs.setup-instance.outputs.runner-name }}
    strategy:
@@ -68,15 +81,12 @@ jobs:
          token: ${{ secrets.REPO_CHECKOUT_TOKEN }}

      - name: Setup Hyperstack dependencies
-        uses: ./.github/actions/hyperstack_setup
+        if: needs.setup-instance.outputs.remote-instance-outcome == 'success'
+        uses: ./.github/actions/gpu_setup
        with:
          cuda-version: ${{ matrix.cuda }}
          gcc-version: ${{ matrix.gcc }}

-      - name: Set up home
-        run: |
-          echo "HOME=/home/ubuntu" >> "${GITHUB_ENV}"
-
      - name: Install latest stable
        uses: dtolnay/rust-toolchain@a54c7afa936fefeb4456b2dd8068152669aa8203
        with:
@@ -113,6 +123,7 @@ jobs:

  teardown-instance:
    name: Teardown instance (cuda-h100-tests)
+    if: ${{ always() && needs.setup-instance.outputs.remote-instance-outcome == 'success' }}
    needs: [ setup-instance, cuda-tests-linux ]
    runs-on: ubuntu-latest
    steps:
--- a/.github/workflows/gpu_full_multi_gpu_tests.yml
+++ b/.github/workflows/gpu_full_multi_gpu_tests.yml
@@ -11,28 +11,17 @@ env:
  SLACK_ICON: https://pbs.twimg.com/profile_images/1274014582265298945/OjBKP9kn_400x400.png
  SLACK_USERNAME: ${{ secrets.BOT_USERNAME }}
  SLACK_WEBHOOK: ${{ secrets.SLACK_WEBHOOK }}
-  MSG_MINIMAL: event,action url,commit
-  BRANCH: ${{ github.head_ref || github.ref }}
-  IS_PULL_REQUEST: ${{ github.event_name == 'pull_request' || github.event_name == 'pull_request_target' }}
-  REF: ${{ github.event.pull_request.head.sha || github.sha }}
+  IS_PULL_REQUEST: ${{ github.event_name == 'pull_request' }}
+  CHECKOUT_TOKEN: ${{ secrets.REPO_CHECKOUT_TOKEN || secrets.GITHUB_TOKEN }}
+  # Secrets will be available only to zama-ai organization members
+  SECRETS_AVAILABLE: ${{ secrets.JOB_SECRET != '' }}
+  EXTERNAL_CONTRIBUTION_RUNNER: "gpu_ubuntu-22.04"

 on:
  # Allows you to run this workflow manually from the Actions tab as an alternative.
  workflow_dispatch:
-  # Trigger pull_request event on CI files to be able to test changes before merging to main branch.
-  # Workflow would fail if changes come from a forked repository since secrets are not available with this event.
  pull_request:
    types: [ labeled ]
-    paths:
-      - '.github/**'
-      - 'ci/**'
-  # General entry point for Zama's pull request as well as contribution from forks.
-  pull_request_target:
-    types: [ labeled ]
-    paths:
-      - '**'
-      - '!.github/**'
-      - '!ci/**'

 jobs:
  should-run:
@@ -47,8 +36,7 @@ jobs:
        with:
          fetch-depth: 0
          persist-credentials: 'false'
-          token: ${{ secrets.REPO_CHECKOUT_TOKEN }}
-          ref: ${{ env.REF }}
+          token: ${{ env.CHECKOUT_TOKEN }}

      - name: Check for file changes
        id: changed-files
@@ -72,35 +60,19 @@ jobs:
              - scripts/integer-tests.sh
              - ci/slab.toml

-  check-ci-files:
-    uses: ./.github/workflows/check_ci_files_change.yml
-    with:
-      checkout_ref: ${{ github.event.pull_request.head.sha || github.sha }}
-    secrets:
-      REPO_CHECKOUT_TOKEN: ${{ secrets.REPO_CHECKOUT_TOKEN }}
-
-  # Fail if the triggering actor is not part of Zama organization.
-  # If pull_request_target is emitted and CI files have changed, skip this job. This would skip following jobs.
-  check-user-permission:
-    needs: check-ci-files
-    if: github.event_name != 'pull_request_target' ||
-      (github.event_name == 'pull_request_target' && needs.check-ci-files.outputs.ci_file_changed == 'false')
-    uses: ./.github/workflows/check_actor_permissions.yml
-    secrets:
-      TOKEN: ${{ secrets.GITHUB_TOKEN }}
-
  setup-instance:
    name: Setup instance (cuda-tests-multi-gpu)
-    needs: [ should-run, check-user-permission ]
-    if: github.event_name != 'pull_request_target' ||
+    needs: should-run
+    if: github.event_name != 'pull_request' ||
      (github.event.action != 'labeled' && needs.should-run.outputs.gpu_test == 'true') ||
      (github.event.action == 'labeled' && github.event.label.name == 'approved' && needs.should-run.outputs.gpu_test == 'true')
    runs-on: ubuntu-latest
    outputs:
-      runner-name: ${{ steps.start-instance.outputs.label }}
+      runner-name: ${{ steps.start-remote-instance.outputs.label || steps.start-github-instance.outputs.runner_group }}
    steps:
-      - name: Start instance
-        id: start-instance
+      - name: Start remote instance
+        id: start-remote-instance
+        if: env.SECRETS_AVAILABLE == 'true'
        uses: zama-ai/slab-github-runner@79939325c3c429837c10d6041e4fd8589d328bac
        with:
          mode: start
@@ -110,13 +82,20 @@ jobs:
          backend: hyperstack
          profile: multi-gpu-test

+      # This instance will be spawned especially for pull-request from forked repository
+      - name: Start GitHub instance
+        id: start-github-instance
+        if: env.SECRETS_AVAILABLE == 'false'
+        run: |
+          echo "runner_group=${{ env.EXTERNAL_CONTRIBUTION_RUNNER }}" >> "$GITHUB_OUTPUT"
+
  cuda-tests-linux:
    name: CUDA multi-GPU tests
    needs: [ should-run, setup-instance ]
-    if: github.event_name != 'pull_request_target' ||
-      (github.event_name == 'pull_request_target' && needs.setup-instance.result != 'skipped')
+    if: github.event_name != 'pull_request' ||
+      (github.event_name == 'pull_request' && needs.setup-instance.result != 'skipped')
    concurrency:
-      group: ${{ github.workflow }}_${{ github.head_ref || github.ref }}
+      group: ${{ github.workflow_ref }}
      cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
    runs-on: ${{ needs.setup-instance.outputs.runner-name }}
    strategy:
@@ -132,18 +111,14 @@ jobs:
        uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683
        with:
          persist-credentials: 'false'
-          token: ${{ secrets.REPO_CHECKOUT_TOKEN }}
-          ref: ${{ env.REF }}
+          token: ${{ env.CHECKOUT_TOKEN }}

      - name: Setup Hyperstack dependencies
-        uses: ./.github/actions/hyperstack_setup
+        uses: ./.github/actions/gpu_setup
        with:
          cuda-version: ${{ matrix.cuda }}
          gcc-version: ${{ matrix.gcc }}
-
-      - name: Set up home
-        run: |
-          echo "HOME=/home/ubuntu" >> "${GITHUB_ENV}"
+          github-instance: ${{ env.SECRETS_AVAILABLE == 'false' }}

      - name: Install latest stable
        uses: dtolnay/rust-toolchain@a54c7afa936fefeb4456b2dd8068152669aa8203
@@ -179,6 +154,7 @@ jobs:
    continue-on-error: true
    steps:
      - name: Send message
+        if: env.SECRETS_AVAILABLE == 'true'
        uses: rtCamp/action-slack-notify@c33737706dea87cd7784c687dadc9adf1be59990
        env:
          SLACK_COLOR: ${{ needs.cuda-tests-linux.result }}
@@ -190,8 +166,9 @@ jobs:
    needs: [ setup-instance, cuda-tests-linux ]
    runs-on: ubuntu-latest
    steps:
-      - name: Stop instance
+      - name: Stop remote instance
        id: stop-instance
+        if: env.SECRETS_AVAILABLE == 'true'
        uses: zama-ai/slab-github-runner@79939325c3c429837c10d6041e4fd8589d328bac
        with:
          mode: stop
--- a/.github/workflows/gpu_integer_long_run_tests.yml
+++ b/.github/workflows/gpu_integer_long_run_tests.yml
@@ -1,4 +1,4 @@
-name: Long Run Tests on GPU
+name: Cuda - Long Run Tests on GPU

 env:
  CARGO_TERM_COLOR: always
@@ -15,8 +15,8 @@ on:
  # Allows you to run this workflow manually from the Actions tab as an alternative.
  workflow_dispatch:
  schedule:
-    # Weekly tests will be triggered each Friday at 9p.m.
-    - cron: "0 21 * * 5"
+    # Nightly tests will be triggered each evening 8p.m.
+    - cron: "0 20 * * *"

 jobs:
  setup-instance:
@@ -42,7 +42,7 @@ jobs:
    name: Long run GPU tests
    needs: [ setup-instance ]
    concurrency:
-      group: ${{ github.workflow }}_${{github.event_name}}_${{ github.ref }}
+      group: ${{ github.workflow_ref }}_${{github.event_name}}
      cancel-in-progress: true
    runs-on: ${{ needs.setup-instance.outputs.runner-name }}
    strategy:
@@ -59,15 +59,11 @@ jobs:
        uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683

      - name: Setup Hyperstack dependencies
-        uses: ./.github/actions/hyperstack_setup
+        uses: ./.github/actions/gpu_setup
        with:
          cuda-version: ${{ matrix.cuda }}
          gcc-version: ${{ matrix.gcc }}

-      - name: Set up home
-        run: |
-          echo "HOME=/home/ubuntu" >> "${GITHUB_ENV}"
-
      - name: Install latest stable
        uses: dtolnay/rust-toolchain@a54c7afa936fefeb4456b2dd8068152669aa8203
        with:
--- a/.github/workflows/gpu_pcc.yml
+++ b/.github/workflows/gpu_pcc.yml
@@ -11,51 +11,24 @@ env:
  SLACK_ICON: https://pbs.twimg.com/profile_images/1274014582265298945/OjBKP9kn_400x400.png
  SLACK_USERNAME: ${{ secrets.BOT_USERNAME }}
  SLACK_WEBHOOK: ${{ secrets.SLACK_WEBHOOK }}
-  MSG_MINIMAL: event,action url,commit
-  BRANCH: ${{ github.head_ref || github.ref }}
-  REF: ${{ github.event.pull_request.head.sha || github.sha }}
+  CHECKOUT_TOKEN: ${{ secrets.REPO_CHECKOUT_TOKEN || secrets.GITHUB_TOKEN }}
+  # Secrets will be available only to zama-ai organization members
+  SECRETS_AVAILABLE: ${{ secrets.JOB_SECRET != '' }}
+  EXTERNAL_CONTRIBUTION_RUNNER: "large_ubuntu_16-22.04"

 on:
-  # Trigger pull_request event on CI files to be able to test changes before merging to main branch.
-  # Workflow would fail if changes come from a forked repository since secrets are not available with this event.
  pull_request:
-    paths:
-      - '.github/**'
-      - 'ci/**'
-  # General entry point for Zama's pull request as well as contribution from forks.
-  pull_request_target:
-    paths:
-      - '**'
-      - '!.github/**'
-      - '!ci/**'

 jobs:
-  check-ci-files:
-    uses: ./.github/workflows/check_ci_files_change.yml
-    with:
-      checkout_ref: ${{ github.event.pull_request.head.sha || github.sha }}
-    secrets:
-      REPO_CHECKOUT_TOKEN: ${{ secrets.REPO_CHECKOUT_TOKEN }}
-
-  # Fail if the triggering actor is not part of Zama organization.
-  # If pull_request_target is emitted and CI files have changed, skip this job. This would skip following jobs.
-  check-user-permission:
-    needs: check-ci-files
-    if: github.event_name != 'pull_request_target' ||
-      (github.event_name == 'pull_request_target' && needs.check-ci-files.outputs.ci_file_changed == 'false')
-    uses: ./.github/workflows/check_actor_permissions.yml
-    secrets:
-      TOKEN: ${{ secrets.GITHUB_TOKEN }}
-
  setup-instance:
    name: Setup instance (cuda-pcc)
-    needs: check-user-permission
    runs-on: ubuntu-latest
    outputs:
-      runner-name: ${{ steps.start-instance.outputs.label }}
+      runner-name: ${{ steps.start-remote-instance.outputs.label || steps.start-github-instance.outputs.runner_group }}
    steps:
-      - name: Start instance
-        id: start-instance
+      - name: Start remote instance
+        id: start-remote-instance
+        if: env.SECRETS_AVAILABLE == 'true'
        uses: zama-ai/slab-github-runner@79939325c3c429837c10d6041e4fd8589d328bac
        with:
          mode: start
@@ -65,11 +38,18 @@ jobs:
          backend: aws
          profile: gpu-build

+      # This instance will be spawned especially for pull-request from forked repository
+      - name: Start GitHub instance
+        id: start-github-instance
+        if: env.SECRETS_AVAILABLE == 'false'
+        run: |
+          echo "runner_group=${{ env.EXTERNAL_CONTRIBUTION_RUNNER }}" >> "$GITHUB_OUTPUT"
+
  cuda-pcc:
    name: CUDA post-commit checks
    needs: setup-instance
    concurrency:
-      group: ${{ github.workflow }}_${{ github.head_ref || github.ref }}
+      group: ${{ github.workflow_ref }}
      cancel-in-progress: true
    runs-on: ${{ needs.setup-instance.outputs.runner-name }}
    strategy:
@@ -88,12 +68,17 @@ jobs:
        uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683
        with:
          persist-credentials: 'false'
-          token: ${{ secrets.REPO_CHECKOUT_TOKEN }}
-          ref: ${{ env.REF }}
+          token: ${{ env.CHECKOUT_TOKEN }}

-      - name: Set up home
+      - name: Install CUDA
+        if: env.SECRETS_AVAILABLE == 'false'
+        shell: bash
        run: |
-          echo "HOME=/home/ubuntu" >> "${GITHUB_ENV}"
+          TOOLKIT_VERSION="$(echo ${{ matrix.cuda }} | sed 's/\(.*\)\.\(.*\)/\1-\2/')"
+          wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
+          sudo dpkg -i cuda-keyring_1.1-1_all.deb
+          sudo apt update
+          sudo apt -y install "cuda-toolkit-${TOOLKIT_VERSION}" cmake-format

      - name: Install latest stable
        uses: dtolnay/rust-toolchain@a54c7afa936fefeb4456b2dd8068152669aa8203
@@ -116,7 +101,6 @@ jobs:
            echo "CC=/usr/bin/gcc-${{ matrix.gcc }}";
            echo "CXX=/usr/bin/g++-${{ matrix.gcc }}";
            echo "CUDAHOSTCXX=/usr/bin/g++-${{ matrix.gcc }}";
-            echo "HOME=/home/ubuntu";
          } >> "${GITHUB_ENV}"

      - name: Run fmt checks
@@ -128,7 +112,7 @@ jobs:
          make pcc_gpu

      - name: Slack Notification
-        if: ${{ failure() }}
+        if: ${{ failure() && env.SECRETS_AVAILABLE == 'true' }}
        continue-on-error: true
        uses: rtCamp/action-slack-notify@c33737706dea87cd7784c687dadc9adf1be59990
        env:
@@ -141,8 +125,9 @@ jobs:
    needs: [ setup-instance, cuda-pcc ]
    runs-on: ubuntu-latest
    steps:
-      - name: Stop instance
+      - name: Stop remote instance
        id: stop-instance
+        if: env.SECRETS_AVAILABLE == 'true'
        uses: zama-ai/slab-github-runner@79939325c3c429837c10d6041e4fd8589d328bac
        with:
          mode: stop
--- a/.github/workflows/gpu_signed_integer_classic_tests.yml
+++ b/.github/workflows/gpu_signed_integer_classic_tests.yml
@@ -11,28 +11,17 @@ env:
  SLACK_ICON: https://pbs.twimg.com/profile_images/1274014582265298945/OjBKP9kn_400x400.png
  SLACK_USERNAME: ${{ secrets.BOT_USERNAME }}
  SLACK_WEBHOOK: ${{ secrets.SLACK_WEBHOOK }}
-  MSG_MINIMAL: event,action url,commit
-  BRANCH: ${{ github.head_ref || github.ref }}
-  IS_PULL_REQUEST: ${{ github.event_name == 'pull_request' || github.event_name == 'pull_request_target' }}
-  REF: ${{ github.event.pull_request.head.sha || github.sha }}
+  IS_PULL_REQUEST: ${{ github.event_name == 'pull_request' }}
+  CHECKOUT_TOKEN: ${{ secrets.REPO_CHECKOUT_TOKEN || secrets.GITHUB_TOKEN }}
+  # Secrets will be available only to zama-ai organization members
+  SECRETS_AVAILABLE: ${{ secrets.JOB_SECRET != '' }}
+  EXTERNAL_CONTRIBUTION_RUNNER: "gpu_ubuntu-22.04"

 on:
  # Allows you to run this workflow manually from the Actions tab as an alternative.
  workflow_dispatch:
-  # Trigger pull_request event on CI files to be able to test changes before merging to main branch.
-  # Workflow would fail if changes come from a forked repository since secrets are not available with this event.
  pull_request:
    types: [ labeled ]
-    paths:
-      - '.github/**'
-      - 'ci/**'
-  # General entry point for Zama's pull request as well as contribution from forks.
-  pull_request_target:
-    types: [ labeled ]
-    paths:
-      - '**'
-      - '!.github/**'
-      - '!ci/**'

 jobs:
  should-run:
@@ -47,8 +36,7 @@ jobs:
        with:
          fetch-depth: 0
          persist-credentials: 'false'
-          token: ${{ secrets.REPO_CHECKOUT_TOKEN }}
-          ref: ${{ env.REF }}
+          token: ${{ env.CHECKOUT_TOKEN }}

      - name: Check for file changes
        id: changed-files
@@ -72,35 +60,19 @@ jobs:
              - scripts/integer-tests.sh
              - ci/slab.toml

-  check-ci-files:
-    uses: ./.github/workflows/check_ci_files_change.yml
-    with:
-      checkout_ref: ${{ github.event.pull_request.head.sha || github.sha }}
-    secrets:
-      REPO_CHECKOUT_TOKEN: ${{ secrets.REPO_CHECKOUT_TOKEN }}
-
-  # Fail if the triggering actor is not part of Zama organization.
-  # If pull_request_target is emitted and CI files have changed, skip this job. This would skip following jobs.
-  check-user-permission:
-    needs: check-ci-files
-    if: github.event_name != 'pull_request_target' ||
-      (github.event_name == 'pull_request_target' && needs.check-ci-files.outputs.ci_file_changed == 'false')
-    uses: ./.github/workflows/check_actor_permissions.yml
-    secrets:
-      TOKEN: ${{ secrets.GITHUB_TOKEN }}
-
  setup-instance:
    name: Setup instance (cuda-signed-classic-tests)
-    needs: [ should-run, check-user-permission ]
-    if: github.event_name != 'pull_request_target' ||
+    needs: should-run
+    if: github.event_name != 'pull_request' ||
      (github.event.action != 'labeled' && needs.should-run.outputs.gpu_test == 'true') ||
      (github.event.action == 'labeled' && github.event.label.name == 'approved' && needs.should-run.outputs.gpu_test == 'true')
    runs-on: ubuntu-latest
    outputs:
-      runner-name: ${{ steps.start-instance.outputs.label }}
+      runner-name: ${{ steps.start-remote-instance.outputs.label || steps.start-github-instance.outputs.runner_group }}
    steps:
-      - name: Start instance
-        id: start-instance
+      - name: Start remote instance
+        id: start-remote-instance
+        if: env.SECRETS_AVAILABLE == 'true'
        uses: zama-ai/slab-github-runner@79939325c3c429837c10d6041e4fd8589d328bac
        with:
          mode: start
@@ -110,13 +82,20 @@ jobs:
          backend: hyperstack
          profile: gpu-test

+      # This instance will be spawned especially for pull-request from forked repository
+      - name: Start GitHub instance
+        id: start-github-instance
+        if: env.SECRETS_AVAILABLE == 'false'
+        run: |
+          echo "runner_group=${{ env.EXTERNAL_CONTRIBUTION_RUNNER }}" >> "$GITHUB_OUTPUT"
+
  cuda-tests-linux:
    name: CUDA signed integer tests with classical PBS
    needs: [ should-run, setup-instance ]
-    if: github.event_name != 'pull_request_target' ||
-      (github.event_name == 'pull_request_target' && needs.setup-instance.result != 'skipped')
+    if: github.event_name != 'pull_request' ||
+      (github.event_name == 'pull_request' && needs.setup-instance.result != 'skipped')
    concurrency:
-      group: ${{ github.workflow }}_${{ github.head_ref || github.ref }}
+      group: ${{ github.workflow_ref }}
      cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
    runs-on: ${{ needs.setup-instance.outputs.runner-name }}
    strategy:
@@ -132,18 +111,14 @@ jobs:
        uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683
        with:
          persist-credentials: 'false'
-          token: ${{ secrets.REPO_CHECKOUT_TOKEN }}
-          ref: ${{ env.REF }}
+          token: ${{ env.CHECKOUT_TOKEN }}

      - name: Setup Hyperstack dependencies
-        uses: ./.github/actions/hyperstack_setup
+        uses: ./.github/actions/gpu_setup
        with:
          cuda-version: ${{ matrix.cuda }}
          gcc-version: ${{ matrix.gcc }}
-
-      - name: Set up home
-        run: |
-          echo "HOME=/home/ubuntu" >> "${GITHUB_ENV}"
+          github-instance: ${{ env.SECRETS_AVAILABLE == 'false' }}

      - name: Install latest stable
        uses: dtolnay/rust-toolchain@a54c7afa936fefeb4456b2dd8068152669aa8203
@@ -162,6 +137,7 @@ jobs:
    continue-on-error: true
    steps:
      - name: Send message
+        if: env.SECRETS_AVAILABLE == 'true'
        uses: rtCamp/action-slack-notify@c33737706dea87cd7784c687dadc9adf1be59990
        env:
          SLACK_COLOR: ${{ needs.cuda-tests-linux.result }}
@@ -173,8 +149,9 @@ jobs:
    needs: [ setup-instance, cuda-tests-linux ]
    runs-on: ubuntu-latest
    steps:
-      - name: Stop instance
+      - name: Stop remote instance
        id: stop-instance
+        if: env.SECRETS_AVAILABLE == 'true'
        uses: zama-ai/slab-github-runner@79939325c3c429837c10d6041e4fd8589d328bac
        with:
          mode: stop
--- a/.github/workflows/gpu_signed_integer_h100_tests.yml
+++ b/.github/workflows/gpu_signed_integer_h100_tests.yml
@@ -11,28 +11,18 @@ env:
  SLACK_ICON: https://pbs.twimg.com/profile_images/1274014582265298945/OjBKP9kn_400x400.png
  SLACK_USERNAME: ${{ secrets.BOT_USERNAME }}
  SLACK_WEBHOOK: ${{ secrets.SLACK_WEBHOOK }}
-  MSG_MINIMAL: event,action url,commit
-  BRANCH: ${{ github.head_ref || github.ref }}
-  IS_PULL_REQUEST: ${{ github.event_name == 'pull_request' || github.event_name == 'pull_request_target' }}
-  REF: ${{ github.event.pull_request.head.sha || github.sha }}
+  IS_PULL_REQUEST: ${{ github.event_name == 'pull_request' }}
+  CHECKOUT_TOKEN: ${{ secrets.REPO_CHECKOUT_TOKEN || secrets.GITHUB_TOKEN }}
+  # Secrets will be available only to zama-ai organization members
+  SECRETS_AVAILABLE: ${{ secrets.JOB_SECRET != '' }}
+  EXTERNAL_CONTRIBUTION_RUNNER: "gpu_ubuntu-22.04"

 on:
  # Allows you to run this workflow manually from the Actions tab as an alternative.
  workflow_dispatch:
-  # Trigger pull_request event on CI files to be able to test changes before merging to main branch.
-  # Workflow would fail if changes come from a forked repository since secrets are not available with this event.
  pull_request:
    types: [ labeled ]
-    paths:
-      - '.github/**'
-      - 'ci/**'
-  # General entry point for Zama's pull request as well as contribution from forks.
-  pull_request_target:
-    types: [ labeled ]
-    paths:
-      - '**'
-      - '!.github/**'
-      - '!ci/**'
+

 jobs:
  should-run:
@@ -47,8 +37,7 @@ jobs:
        with:
          fetch-depth: 0
          persist-credentials: 'false'
-          token: ${{ secrets.REPO_CHECKOUT_TOKEN }}
-          ref: ${{ env.REF }}
+          token: ${{ env.CHECKOUT_TOKEN }}

      - name: Check for file changes
        id: changed-files
@@ -72,35 +61,25 @@ jobs:
              - scripts/integer-tests.sh
              - ci/slab.toml

-  check-ci-files:
-    uses: ./.github/workflows/check_ci_files_change.yml
-    with:
-      checkout_ref: ${{ github.event.pull_request.head.sha || github.sha }}
-    secrets:
-      REPO_CHECKOUT_TOKEN: ${{ secrets.REPO_CHECKOUT_TOKEN }}
-
-  # Fail if the triggering actor is not part of Zama organization.
-  # If pull_request_target is emitted and CI files have changed, skip this job. This would skip following jobs.
-  check-user-permission:
-    needs: check-ci-files
-    if: github.event_name != 'pull_request_target' ||
-      (github.event_name == 'pull_request_target' && needs.check-ci-files.outputs.ci_file_changed == 'false')
-    uses: ./.github/workflows/check_actor_permissions.yml
-    secrets:
-      TOKEN: ${{ secrets.GITHUB_TOKEN }}
-
  setup-instance:
    name: Setup instance (cuda-h100-tests)
-    needs: [ should-run, check-user-permission ]
-    if: github.event_name != 'pull_request_target' ||
+    needs: should-run
+    if: github.event_name != 'pull_request' ||
      (github.event.action != 'labeled' && needs.should-run.outputs.gpu_test == 'true') ||
      (github.event.action == 'labeled' && github.event.label.name == 'approved' && needs.should-run.outputs.gpu_test == 'true')
    runs-on: ubuntu-latest
    outputs:
-      runner-name: ${{ steps.start-instance.outputs.label }}
+      # Use permanent remote instance label first as on-demand remote instance label output is set before the end of start-remote-instance step.
+      # If the latter fails due to a failed GitHub action runner set up, we have to fallback on the permanent instance.
+      # Since the on-demand remote label is set before failure, we have to do the logical OR in this order,
+      # otherwise we'll try to run the next job on a non-existing on-demand instance.
+      runner-name: ${{ steps.use-permanent-instance.outputs.runner_group || steps.start-remote-instance.outputs.label || steps.start-github-instance.outputs.runner_group }}
+      remote-instance-outcome: ${{ steps.start-remote-instance.outcome }}
    steps:
-      - name: Start instance
-        id: start-instance
+      - name: Start remote instance
+        id: start-remote-instance
+        if: env.SECRETS_AVAILABLE == 'true'
+        continue-on-error: true
        uses: zama-ai/slab-github-runner@79939325c3c429837c10d6041e4fd8589d328bac
        with:
          mode: start
@@ -110,13 +89,27 @@ jobs:
          backend: hyperstack
          profile: single-h100

+      # This will allow to fallback on permanent instances running on Hyperstack.
+      - name: Use permanent remote instance
+        id: use-permanent-instance
+        if: ${{ env.SECRETS_AVAILABLE == 'true' && failure() }}
+        run: |
+          echo "runner_group=h100x1" >> "$GITHUB_OUTPUT"
+
+      # This instance will be spawned especially for pull-request from forked repository
+      - name: Start GitHub instance
+        id: start-github-instance
+        if: env.SECRETS_AVAILABLE == 'false'
+        run: |
+          echo "runner_group=${{ env.EXTERNAL_CONTRIBUTION_RUNNER }}" >> "$GITHUB_OUTPUT"
+
  cuda-tests-linux:
    name: CUDA H100 signed integer tests
    needs: [ should-run, setup-instance ]
-    if: github.event_name != 'pull_request_target' ||
-      (github.event_name == 'pull_request_target' && needs.setup-instance.result != 'skipped')
+    if: github.event_name != 'pull_request' ||
+      (github.event_name == 'pull_request' && needs.setup-instance.result != 'skipped')
    concurrency:
-      group: ${{ github.workflow }}_${{ github.head_ref || github.ref }}
+      group: ${{ github.workflow_ref }}
      cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
    runs-on: ${{ needs.setup-instance.outputs.runner-name }}
    strategy:
@@ -132,18 +125,15 @@ jobs:
        uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683
        with:
          persist-credentials: 'false'
-          token: ${{ secrets.REPO_CHECKOUT_TOKEN }}
-          ref: ${{ env.REF }}
+          token: ${{ env.CHECKOUT_TOKEN }}

      - name: Setup Hyperstack dependencies
-        uses: ./.github/actions/hyperstack_setup
+        if: needs.setup-instance.outputs.remote-instance-outcome == 'success'
+        uses: ./.github/actions/gpu_setup
        with:
          cuda-version: ${{ matrix.cuda }}
          gcc-version: ${{ matrix.gcc }}
-
-      - name: Set up home
-        run: |
-          echo "HOME=/home/ubuntu" >> "${GITHUB_ENV}"
+          github-instance: ${{ env.SECRETS_AVAILABLE == 'false' }}

      - name: Install latest stable
        uses: dtolnay/rust-toolchain@a54c7afa936fefeb4456b2dd8068152669aa8203
@@ -162,6 +152,7 @@ jobs:
    continue-on-error: true
    steps:
      - name: Send message
+        if: env.SECRETS_AVAILABLE == 'true'
        uses: rtCamp/action-slack-notify@c33737706dea87cd7784c687dadc9adf1be59990
        env:
          SLACK_COLOR: ${{ needs.cuda-tests-linux.result }}
@@ -169,12 +160,13 @@ jobs:

  teardown-instance:
    name: Teardown instance (cuda-h100-tests)
-    if: ${{ always() && needs.setup-instance.result == 'success' }}
+    if: ${{ always() && needs.setup-instance.outputs.remote-instance-outcome == 'success' }}
    needs: [ setup-instance, cuda-tests-linux ]
    runs-on: ubuntu-latest
    steps:
-      - name: Stop instance
+      - name: Stop remote instance
        id: stop-instance
+        if: env.SECRETS_AVAILABLE == 'true'
        uses: zama-ai/slab-github-runner@79939325c3c429837c10d6041e4fd8589d328bac
        with:
          mode: stop
--- a/.github/workflows/gpu_signed_integer_tests.yml
+++ b/.github/workflows/gpu_signed_integer_tests.yml
@@ -11,28 +11,18 @@ env:
  SLACK_ICON: https://pbs.twimg.com/profile_images/1274014582265298945/OjBKP9kn_400x400.png
  SLACK_USERNAME: ${{ secrets.BOT_USERNAME }}
  SLACK_WEBHOOK: ${{ secrets.SLACK_WEBHOOK }}
-  MSG_MINIMAL: event,action url,commit
-  BRANCH: ${{ github.head_ref || github.ref }}
  FAST_TESTS: TRUE
  NIGHTLY_TESTS: FALSE
-  IS_PULL_REQUEST: ${{ github.event_name == 'pull_request' || github.event_name == 'pull_request_target' }}
-  REF: ${{ github.event.pull_request.head.sha || github.sha }}
+  IS_PULL_REQUEST: ${{ github.event_name == 'pull_request' }}
+  CHECKOUT_TOKEN: ${{ secrets.REPO_CHECKOUT_TOKEN || secrets.GITHUB_TOKEN }}
+  # Secrets will be available only to zama-ai organization members
+  SECRETS_AVAILABLE: ${{ secrets.JOB_SECRET != '' }}
+  EXTERNAL_CONTRIBUTION_RUNNER: "gpu_ubuntu-22.04"

 on:
  # Allows you to run this workflow manually from the Actions tab as an alternative.
  workflow_dispatch:
-  # Trigger pull_request event on CI files to be able to test changes before merging to main branch.
-  # Workflow would fail if changes come from a forked repository since secrets are not available with this event.
  pull_request:
-    paths:
-      - '.github/**'
-      - 'ci/**'
-  # General entry point for Zama's pull request as well as contribution from forks.
-  pull_request_target:
-    paths:
-      - '**'
-      - '!.github/**'
-      - '!ci/**'
  schedule:
    # Nightly tests @ 1AM after each work day
    - cron: "0 1 * * MON-FRI"
@@ -50,8 +40,7 @@ jobs:
        with:
          fetch-depth: 0
          persist-credentials: 'false'
-          token: ${{ secrets.REPO_CHECKOUT_TOKEN }}
-          ref: ${{ env.REF }}
+          token: ${{ env.CHECKOUT_TOKEN }}

      - name: Check for file changes
        id: changed-files
@@ -74,36 +63,19 @@ jobs:
              - '.github/workflows/gpu_signed_integer_tests.yml'
              - scripts/integer-tests.sh
              - ci/slab.toml
-
-  check-ci-files:
-    uses: ./.github/workflows/check_ci_files_change.yml
-    with:
-      checkout_ref: ${{ github.event.pull_request.head.sha || github.sha }}
-    secrets:
-      REPO_CHECKOUT_TOKEN: ${{ secrets.REPO_CHECKOUT_TOKEN }}
-
-  # Fail if the triggering actor is not part of Zama organization.
-  # If pull_request_target is emitted and CI files have changed, skip this job. This would skip following jobs.
-  check-user-permission:
-    needs: check-ci-files
-    if: github.event_name != 'pull_request_target' ||
-      (github.event_name == 'pull_request_target' && needs.check-ci-files.outputs.ci_file_changed == 'false')
-    uses: ./.github/workflows/check_actor_permissions.yml
-    secrets:
-      TOKEN: ${{ secrets.GITHUB_TOKEN }}
-
  setup-instance:
    name: Setup instance (cuda-signed-integer-tests)
    runs-on: ubuntu-latest
-    needs: [ should-run, check-user-permission ]
+    needs: should-run
    if: (github.event_name == 'schedule' && github.repository == 'zama-ai/tfhe-rs') ||
      github.event_name == 'workflow_dispatch' ||
      needs.should-run.outputs.gpu_test == 'true'
    outputs:
-      runner-name: ${{ steps.start-instance.outputs.label }}
+      runner-name: ${{ steps.start-remote-instance.outputs.label || steps.start-github-instance.outputs.runner_group }}
    steps:
-      - name: Start instance
-        id: start-instance
+      - name: Start remote instance
+        id: start-remote-instance
+        if: env.SECRETS_AVAILABLE == 'true'
        uses: zama-ai/slab-github-runner@79939325c3c429837c10d6041e4fd8589d328bac
        with:
          mode: start
@@ -113,13 +85,20 @@ jobs:
          backend: hyperstack
          profile: gpu-test

+      # This instance will be spawned especially for pull-request from forked repository
+      - name: Start GitHub instance
+        id: start-github-instance
+        if: env.SECRETS_AVAILABLE == 'false'
+        run: |
+          echo "runner_group=${{ env.EXTERNAL_CONTRIBUTION_RUNNER }}" >> "$GITHUB_OUTPUT"
+
  cuda-signed-integer-tests:
    name: CUDA signed integer tests
    needs: [ should-run, setup-instance ]
-    if: github.event_name != 'pull_request_target' ||
-      (github.event_name == 'pull_request_target' && needs.setup-instance.result != 'skipped')
+    if: github.event_name != 'pull_request' ||
+      (github.event_name == 'pull_request' && needs.setup-instance.result != 'skipped')
    concurrency:
-      group: ${{ github.workflow }}_${{ github.head_ref || github.ref }}
+      group: ${{ github.workflow_ref }}
      cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
    runs-on: ${{ needs.setup-instance.outputs.runner-name }}
    strategy:
@@ -135,18 +114,14 @@ jobs:
        uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683
        with:
          persist-credentials: 'false'
-          token: ${{ secrets.REPO_CHECKOUT_TOKEN }}
-          ref: ${{ env.REF }}
+          token: ${{ env.CHECKOUT_TOKEN }}

      - name: Setup Hyperstack dependencies
-        uses: ./.github/actions/hyperstack_setup
+        uses: ./.github/actions/gpu_setup
        with:
          cuda-version: ${{ matrix.cuda }}
          gcc-version: ${{ matrix.gcc }}
-
-      - name: Set up home
-        run: |
-          echo "HOME=/home/ubuntu" >> "${GITHUB_ENV}"
+          github-instance: ${{ env.SECRETS_AVAILABLE == 'false' }}

      - name: Install latest stable
        uses: dtolnay/rust-toolchain@a54c7afa936fefeb4456b2dd8068152669aa8203
@@ -184,8 +159,9 @@ jobs:
    needs: [ setup-instance, cuda-signed-integer-tests ]
    runs-on: ubuntu-latest
    steps:
-      - name: Stop instance
+      - name: Stop remote instance
        id: stop-instance
+        if: env.SECRETS_AVAILABLE == 'true'
        uses: zama-ai/slab-github-runner@79939325c3c429837c10d6041e4fd8589d328bac
        with:
          mode: stop
--- a/.github/workflows/gpu_unsigned_integer_classic_tests.yml
+++ b/.github/workflows/gpu_unsigned_integer_classic_tests.yml
@@ -11,28 +11,18 @@ env:
  SLACK_ICON: https://pbs.twimg.com/profile_images/1274014582265298945/OjBKP9kn_400x400.png
  SLACK_USERNAME: ${{ secrets.BOT_USERNAME }}
  SLACK_WEBHOOK: ${{ secrets.SLACK_WEBHOOK }}
-  MSG_MINIMAL: event,action url,commit
-  BRANCH: ${{ github.head_ref || github.ref }}
-  IS_PULL_REQUEST: ${{ github.event_name == 'pull_request' || github.event_name == 'pull_request_target' }}
-  REF: ${{ github.event.pull_request.head.sha || github.sha }}
+  IS_PULL_REQUEST: ${{ github.event_name == 'pull_request' }}
+  CHECKOUT_TOKEN: ${{ secrets.REPO_CHECKOUT_TOKEN || secrets.GITHUB_TOKEN }}
+  # Secrets will be available only to zama-ai organization members
+  SECRETS_AVAILABLE: ${{ secrets.JOB_SECRET != '' }}
+  EXTERNAL_CONTRIBUTION_RUNNER: "gpu_ubuntu-22.04"

 on:
  # Allows you to run this workflow manually from the Actions tab as an alternative.
  workflow_dispatch:
-  # Trigger pull_request event on CI files to be able to test changes before merging to main branch.
-  # Workflow would fail if changes come from a forked repository since secrets are not available with this event.
  pull_request:
    types: [ labeled ]
-    paths:
-      - '.github/**'
-      - 'ci/**'
-  # General entry point for Zama's pull request as well as contribution from forks.
-  pull_request_target:
-    types: [ labeled ]
-    paths:
-      - '**'
-      - '!.github/**'
-      - '!ci/**'
+

 jobs:
  should-run:
@@ -47,8 +37,7 @@ jobs:
        with:
          fetch-depth: 0
          persist-credentials: 'false'
-          token: ${{ secrets.REPO_CHECKOUT_TOKEN }}
-          ref: ${{ env.REF }}
+          token: ${{ env.CHECKOUT_TOKEN }}

      - name: Check for file changes
        id: changed-files
@@ -72,35 +61,19 @@ jobs:
              - scripts/integer-tests.sh
              - ci/slab.toml

-  check-ci-files:
-    uses: ./.github/workflows/check_ci_files_change.yml
-    with:
-      checkout_ref: ${{ github.event.pull_request.head.sha || github.sha }}
-    secrets:
-      REPO_CHECKOUT_TOKEN: ${{ secrets.REPO_CHECKOUT_TOKEN }}
-
-  # Fail if the triggering actor is not part of Zama organization.
-  # If pull_request_target is emitted and CI files have changed, skip this job. This would skip following jobs.
-  check-user-permission:
-    needs: check-ci-files
-    if: github.event_name != 'pull_request_target' ||
-      (github.event_name == 'pull_request_target' && needs.check-ci-files.outputs.ci_file_changed == 'false')
-    uses: ./.github/workflows/check_actor_permissions.yml
-    secrets:
-      TOKEN: ${{ secrets.GITHUB_TOKEN }}
-
  setup-instance:
    name: Setup instance (cuda-unsigned-classic-tests)
-    needs: [ should-run, check-user-permission ]
+    needs: should-run
    if: github.event_name == 'workflow_dispatch' ||
      (github.event.action != 'labeled' && needs.should-run.outputs.gpu_test == 'true') ||
      (github.event.action == 'labeled' && github.event.label.name == 'approved' && needs.should-run.outputs.gpu_test == 'true')
    runs-on: ubuntu-latest
    outputs:
-      runner-name: ${{ steps.start-instance.outputs.label }}
+      runner-name: ${{ steps.start-remote-instance.outputs.label || steps.start-github-instance.outputs.runner_group }}
    steps:
-      - name: Start instance
-        id: start-instance
+      - name: Start remote instance
+        id: start-remote-instance
+        if: env.SECRETS_AVAILABLE == 'true'
        uses: zama-ai/slab-github-runner@79939325c3c429837c10d6041e4fd8589d328bac
        with:
          mode: start
@@ -110,13 +83,20 @@ jobs:
          backend: hyperstack
          profile: gpu-test

+      # This instance will be spawned especially for pull-request from forked repository
+      - name: Start GitHub instance
+        id: start-github-instance
+        if: env.SECRETS_AVAILABLE == 'false'
+        run: |
+          echo "runner_group=${{ env.EXTERNAL_CONTRIBUTION_RUNNER }}" >> "$GITHUB_OUTPUT"
+
  cuda-tests-linux:
    name: CUDA unsigned integer tests with classical PBS
    needs: [ should-run, setup-instance ]
-    if: github.event_name != 'pull_request_target' ||
-      (github.event_name == 'pull_request_target' && needs.setup-instance.result != 'skipped')
+    if: github.event_name != 'pull_request' ||
+      (github.event_name == 'pull_request' && needs.setup-instance.result != 'skipped')
    concurrency:
-      group: ${{ github.workflow }}_${{ github.head_ref || github.ref }}
+      group: ${{ github.workflow_ref }}
      cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
    runs-on: ${{ needs.setup-instance.outputs.runner-name }}
    strategy:
@@ -132,18 +112,14 @@ jobs:
        uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683
        with:
          persist-credentials: 'false'
-          token: ${{ secrets.REPO_CHECKOUT_TOKEN }}
-          ref: ${{ env.REF }}
+          token: ${{ env.CHECKOUT_TOKEN }}

      - name: Setup Hyperstack dependencies
-        uses: ./.github/actions/hyperstack_setup
+        uses: ./.github/actions/gpu_setup
        with:
          cuda-version: ${{ matrix.cuda }}
          gcc-version: ${{ matrix.gcc }}
-
-      - name: Set up home
-        run: |
-          echo "HOME=/home/ubuntu" >> "${GITHUB_ENV}"
+          github-instance: ${{ env.SECRETS_AVAILABLE == 'false' }}

      - name: Install latest stable
        uses: dtolnay/rust-toolchain@a54c7afa936fefeb4456b2dd8068152669aa8203
@@ -162,6 +138,7 @@ jobs:
    continue-on-error: true
    steps:
      - name: Send message
+        if: env.SECRETS_AVAILABLE == 'true'
        uses: rtCamp/action-slack-notify@c33737706dea87cd7784c687dadc9adf1be59990
        env:
          SLACK_COLOR: ${{ needs.cuda-tests-linux.result }}
@@ -173,8 +150,9 @@ jobs:
    needs: [ setup-instance, cuda-tests-linux ]
    runs-on: ubuntu-latest
    steps:
-      - name: Stop instance
+      - name: Stop remote instance
        id: stop-instance
+        if: env.SECRETS_AVAILABLE == 'true'
        uses: zama-ai/slab-github-runner@79939325c3c429837c10d6041e4fd8589d328bac
        with:
          mode: stop
--- a/.github/workflows/gpu_unsigned_integer_h100_tests.yml
+++ b/.github/workflows/gpu_unsigned_integer_h100_tests.yml
@@ -11,28 +11,17 @@ env:
  SLACK_ICON: https://pbs.twimg.com/profile_images/1274014582265298945/OjBKP9kn_400x400.png
  SLACK_USERNAME: ${{ secrets.BOT_USERNAME }}
  SLACK_WEBHOOK: ${{ secrets.SLACK_WEBHOOK }}
-  MSG_MINIMAL: event,action url,commit
-  BRANCH: ${{ github.head_ref || github.ref }}
-  IS_PULL_REQUEST: ${{ github.event_name == 'pull_request' || github.event_name == 'pull_request_target' }}
-  REF: ${{ github.event.pull_request.head.sha || github.sha }}
+  IS_PULL_REQUEST: ${{ github.event_name == 'pull_request' }}
+  CHECKOUT_TOKEN: ${{ secrets.REPO_CHECKOUT_TOKEN || secrets.GITHUB_TOKEN }}
+  # Secrets will be available only to zama-ai organization members
+  SECRETS_AVAILABLE: ${{ secrets.JOB_SECRET != '' }}
+  EXTERNAL_CONTRIBUTION_RUNNER: "gpu_ubuntu-22.04"

 on:
  # Allows you to run this workflow manually from the Actions tab as an alternative.
  workflow_dispatch:
-  # Trigger pull_request event on CI files to be able to test changes before merging to main branch.
-  # Workflow would fail if changes come from a forked repository since secrets are not available with this event.
  pull_request:
    types: [ labeled ]
-    paths:
-      - '.github/**'
-      - 'ci/**'
-  # General entry point for Zama's pull request as well as contribution from forks.
-  pull_request_target:
-    types: [ labeled ]
-    paths:
-      - '**'
-      - '!.github/**'
-      - '!ci/**'

 jobs:
  should-run:
@@ -47,8 +36,7 @@ jobs:
        with:
          fetch-depth: 0
          persist-credentials: 'false'
-          token: ${{ secrets.REPO_CHECKOUT_TOKEN }}
-          ref: ${{ env.REF }}
+          token: ${{ env.CHECKOUT_TOKEN }}

      - name: Check for file changes
        id: changed-files
@@ -72,35 +60,25 @@ jobs:
              - scripts/integer-tests.sh
              - ci/slab.toml

-  check-ci-files:
-    uses: ./.github/workflows/check_ci_files_change.yml
-    with:
-      checkout_ref: ${{ github.event.pull_request.head.sha || github.sha }}
-    secrets:
-      REPO_CHECKOUT_TOKEN: ${{ secrets.REPO_CHECKOUT_TOKEN }}
-
-  # Fail if the triggering actor is not part of Zama organization.
-  # If pull_request_target is emitted and CI files have changed, skip this job. This would skip following jobs.
-  check-user-permission:
-    needs: check-ci-files
-    if: github.event_name != 'pull_request_target' ||
-      (github.event_name == 'pull_request_target' && needs.check-ci-files.outputs.ci_file_changed == 'false')
-    uses: ./.github/workflows/check_actor_permissions.yml
-    secrets:
-      TOKEN: ${{ secrets.GITHUB_TOKEN }}
-
  setup-instance:
    name: Setup instance (cuda-h100-tests)
-    needs: [ should-run, check-user-permission ]
+    needs: should-run
    if: github.event_name == 'workflow_dispatch' ||
      (github.event.action != 'labeled' && needs.should-run.outputs.gpu_test == 'true') ||
      (github.event.action == 'labeled' && github.event.label.name == 'approved' && needs.should-run.outputs.gpu_test == 'true')
    runs-on: ubuntu-latest
    outputs:
-      runner-name: ${{ steps.start-instance.outputs.label }}
+      # Use permanent remote instance label first as on-demand remote instance label output is set before the end of start-remote-instance step.
+      # If the latter fails due to a failed GitHub action runner set up, we have to fallback on the permanent instance.
+      # Since the on-demand remote label is set before failure, we have to do the logical OR in this order,
+      # otherwise we'll try to run the next job on a non-existing on-demand instance.
+      runner-name: ${{ steps.use-permanent-instance.outputs.runner_group || steps.start-remote-instance.outputs.label || steps.start-github-instance.outputs.runner_group }}
+      remote-instance-outcome: ${{ steps.start-remote-instance.outcome }}
    steps:
-      - name: Start instance
-        id: start-instance
+      - name: Start remote instance
+        id: start-remote-instance
+        if: env.SECRETS_AVAILABLE == 'true'
+        continue-on-error: true
        uses: zama-ai/slab-github-runner@79939325c3c429837c10d6041e4fd8589d328bac
        with:
          mode: start
@@ -110,13 +88,27 @@ jobs:
          backend: hyperstack
          profile: single-h100

+      # This will allow to fallback on permanent instances running on Hyperstack.
+      - name: Use permanent remote instance
+        id: use-permanent-instance
+        if: ${{ env.SECRETS_AVAILABLE == 'true' && failure() }}
+        run: |
+          echo "runner_group=h100x1" >> "$GITHUB_OUTPUT"
+
+      # This instance will be spawned especially for pull-request from forked repository
+      - name: Start GitHub instance
+        id: start-github-instance
+        if: env.SECRETS_AVAILABLE == 'false'
+        run: |
+          echo "runner_group=${{ env.EXTERNAL_CONTRIBUTION_RUNNER }}" >> "$GITHUB_OUTPUT"
+
  cuda-tests-linux:
    name: CUDA H100 unsigned integer tests
    needs: [ should-run, setup-instance ]
-    if: github.event_name != 'pull_request_target' ||
-      (github.event_name == 'pull_request_target' && needs.setup-instance.result != 'skipped')
+    if: github.event_name != 'pull_request' ||
+      (github.event_name == 'pull_request' && needs.setup-instance.result != 'skipped')
    concurrency:
-      group: ${{ github.workflow }}_${{ github.head_ref || github.ref }}
+      group: ${{ github.workflow_ref }}
      cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
    runs-on: ${{ needs.setup-instance.outputs.runner-name }}
    strategy:
@@ -132,18 +124,15 @@ jobs:
        uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683
        with:
          persist-credentials: 'false'
-          token: ${{ secrets.REPO_CHECKOUT_TOKEN }}
-          ref: ${{ env.REF }}
+          token: ${{ env.CHECKOUT_TOKEN }}

      - name: Setup Hyperstack dependencies
-        uses: ./.github/actions/hyperstack_setup
+        if: needs.setup-instance.outputs.remote-instance-outcome == 'success'
+        uses: ./.github/actions/gpu_setup
        with:
          cuda-version: ${{ matrix.cuda }}
          gcc-version: ${{ matrix.gcc }}
-
-      - name: Set up home
-        run: |
-          echo "HOME=/home/ubuntu" >> "${GITHUB_ENV}"
+          github-instance: ${{ env.SECRETS_AVAILABLE == 'false' }}

      - name: Install latest stable
        uses: dtolnay/rust-toolchain@a54c7afa936fefeb4456b2dd8068152669aa8203
@@ -162,6 +151,7 @@ jobs:
    continue-on-error: true
    steps:
      - name: Send message
+        if: env.SECRETS_AVAILABLE == 'true'
        uses: rtCamp/action-slack-notify@c33737706dea87cd7784c687dadc9adf1be59990
        env:
          SLACK_COLOR: ${{ needs.cuda-tests-linux.result }}
@@ -169,12 +159,13 @@ jobs:

  teardown-instance:
    name: Teardown instance (cuda-h100-tests)
-    if: ${{ always() && needs.setup-instance.result == 'success' }}
+    if: ${{ always() && needs.setup-instance.outputs.remote-instance-outcome == 'success' }}
    needs: [ setup-instance, cuda-tests-linux ]
    runs-on: ubuntu-latest
    steps:
-      - name: Stop instance
+      - name: Stop remote instance
        id: stop-instance
+        if: env.SECRETS_AVAILABLE == 'true'
        uses: zama-ai/slab-github-runner@79939325c3c429837c10d6041e4fd8589d328bac
        with:
          mode: stop
--- a/.github/workflows/gpu_unsigned_integer_tests.yml
+++ b/.github/workflows/gpu_unsigned_integer_tests.yml
@@ -11,29 +11,18 @@ env:
  SLACK_ICON: https://pbs.twimg.com/profile_images/1274014582265298945/OjBKP9kn_400x400.png
  SLACK_USERNAME: ${{ secrets.BOT_USERNAME }}
  SLACK_WEBHOOK: ${{ secrets.SLACK_WEBHOOK }}
-  MSG_MINIMAL: event,action url,commit
-  BRANCH: ${{ github.head_ref || github.ref }}
  FAST_TESTS: TRUE
  NIGHTLY_TESTS: FALSE
-  REF: ${{ github.event.pull_request.head.sha || github.sha }}
+  CHECKOUT_TOKEN: ${{ secrets.REPO_CHECKOUT_TOKEN || secrets.GITHUB_TOKEN }}
+  # Secrets will be available only to zama-ai organization members
+  SECRETS_AVAILABLE: ${{ secrets.JOB_SECRET != '' }}
+  EXTERNAL_CONTRIBUTION_RUNNER: "gpu_ubuntu-22.04"

 on:
  # Allows you to run this workflow manually from the Actions tab as an alternative.
  workflow_dispatch:
-  # Trigger pull_request event on CI files to be able to test changes before merging to main branch.
-  # Workflow would fail if changes come from a forked repository since secrets are not available with this event.
  pull_request:
    types: [ labeled ]
-    paths:
-      - '.github/**'
-      - 'ci/**'
-  # General entry point for Zama's pull request as well as contribution from forks.
-  pull_request_target:
-    types: [ labeled ]
-    paths:
-      - '**'
-      - '!.github/**'
-      - '!ci/**'
  schedule:
    # Nightly tests @ 1AM after each work day
    - cron: "0 1 * * MON-FRI"
@@ -51,8 +40,7 @@ jobs:
        with:
          fetch-depth: 0
          persist-credentials: 'false'
-          token: ${{ secrets.REPO_CHECKOUT_TOKEN }}
-          ref: ${{ env.REF }}
+          token: ${{ env.CHECKOUT_TOKEN }}

      - name: Check for file changes
        id: changed-files
@@ -76,35 +64,19 @@ jobs:
              - scripts/integer-tests.sh
              - ci/slab.toml

-  check-ci-files:
-    uses: ./.github/workflows/check_ci_files_change.yml
-    with:
-      checkout_ref: ${{ github.event.pull_request.head.sha || github.sha }}
-    secrets:
-      REPO_CHECKOUT_TOKEN: ${{ secrets.REPO_CHECKOUT_TOKEN }}
-
-  # Fail if the triggering actor is not part of Zama organization.
-  # If pull_request_target is emitted and CI files have changed, skip this job. This would skip following jobs.
-  check-user-permission:
-    needs: check-ci-files
-    if: github.event_name != 'pull_request_target' ||
-      (github.event_name == 'pull_request_target' && needs.check-ci-files.outputs.ci_file_changed == 'false')
-    uses: ./.github/workflows/check_actor_permissions.yml
-    secrets:
-      TOKEN: ${{ secrets.GITHUB_TOKEN }}
-
  setup-instance:
    name: Setup instance (cuda-unsigned-integer-tests)
-    needs: [ should-run, check-user-permission ]
+    needs: should-run
    if: (github.event_name == 'schedule' && github.repository == 'zama-ai/tfhe-rs') ||
      github.event_name == 'workflow_dispatch' ||
      needs.should-run.outputs.gpu_test == 'true'
    runs-on: ubuntu-latest
    outputs:
-      runner-name: ${{ steps.start-instance.outputs.label }}
+      runner-name: ${{ steps.start-remote-instance.outputs.label || steps.start-github-instance.outputs.runner_group }}
    steps:
-      - name: Start instance
-        id: start-instance
+      - name: Start remote instance
+        id: start-remote-instance
+        if: env.SECRETS_AVAILABLE == 'true'
        uses: zama-ai/slab-github-runner@79939325c3c429837c10d6041e4fd8589d328bac
        with:
          mode: start
@@ -114,13 +86,20 @@ jobs:
          backend: hyperstack
          profile: gpu-test

+      # This instance will be spawned especially for pull-request from forked repository
+      - name: Start GitHub instance
+        id: start-github-instance
+        if: env.SECRETS_AVAILABLE == 'false'
+        run: |
+          echo "runner_group=${{ env.EXTERNAL_CONTRIBUTION_RUNNER }}" >> "$GITHUB_OUTPUT"
+
  cuda-unsigned-integer-tests:
    name: CUDA unsigned integer tests
    needs: [ should-run, setup-instance ]
-    if: github.event_name != 'pull_request_target' ||
-      (github.event_name == 'pull_request_target' && needs.setup-instance.result != 'skipped')
+    if: github.event_name != 'pull_request' ||
+      (github.event_name == 'pull_request' && needs.setup-instance.result != 'skipped')
    concurrency:
-      group: ${{ github.workflow }}_${{ github.head_ref || github.ref }}
+      group: ${{ github.workflow_ref }}
      cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
    runs-on: ${{ needs.setup-instance.outputs.runner-name }}
    strategy:
@@ -136,18 +115,14 @@ jobs:
        uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683
        with:
          persist-credentials: 'false'
-          token: ${{ secrets.REPO_CHECKOUT_TOKEN }}
-          ref: ${{ env.REF }}
+          token: ${{ env.CHECKOUT_TOKEN }}

      - name: Setup Hyperstack dependencies
-        uses: ./.github/actions/hyperstack_setup
+        uses: ./.github/actions/gpu_setup
        with:
          cuda-version: ${{ matrix.cuda }}
          gcc-version: ${{ matrix.gcc }}
-
-      - name: Set up home
-        run: |
-          echo "HOME=/home/ubuntu" >> "${GITHUB_ENV}"
+          github-instance: ${{ env.SECRETS_AVAILABLE == 'false' }}

      - name: Install latest stable
        uses: dtolnay/rust-toolchain@a54c7afa936fefeb4456b2dd8068152669aa8203
@@ -174,6 +149,7 @@ jobs:
    continue-on-error: true
    steps:
      - name: Send message
+        if: env.SECRETS_AVAILABLE == 'true'
        uses: rtCamp/action-slack-notify@c33737706dea87cd7784c687dadc9adf1be59990
        env:
          SLACK_COLOR: ${{ needs.cuda-unsigned-integer-tests.result }}
@@ -187,6 +163,7 @@ jobs:
    steps:
      - name: Stop instance
        id: stop-instance
+        if: env.SECRETS_AVAILABLE == 'true'
        uses: zama-ai/slab-github-runner@79939325c3c429837c10d6041e4fd8589d328bac
        with:
          mode: stop
--- a/.github/workflows/integer_long_run_tests.yml
+++ b/.github/workflows/integer_long_run_tests.yml
@@ -42,7 +42,7 @@ jobs:
    name: Long run CPU tests
    needs: [ setup-instance ]
    concurrency:
-      group: ${{ github.workflow }}_${{github.event_name}}_${{ github.ref }}
+      group: ${{ github.workflow_ref }}_${{github.event_name}}
      cancel-in-progress: true
    runs-on: ${{ needs.setup-instance.outputs.runner-name }}
    timeout-minutes: 4320 # 72 hours
--- a/.github/workflows/m1_tests.yml
+++ b/.github/workflows/m1_tests.yml
@@ -2,20 +2,8 @@ name: Tests on M1 CPU

 on:
  workflow_dispatch:
-  # Trigger pull_request event on CI files to be able to test changes before merging to main branch.
-  # Workflow would fail if changes come from a forked repository since secrets are not available with this event.
  pull_request:
    types: [ labeled ]
-    paths:
-      - '.github/**'
-      - 'ci/**'
-  # General entry point for Zama's pull request as well as contribution from forks.
-  pull_request_target:
-    types: [ labeled ]
-    paths:
-      - '**'
-      - '!.github/**'
-      - '!ci/**'
  # Have a nightly build for M1 tests
  schedule:
    # * is a special character in YAML so you have to quote this string
@@ -33,32 +21,14 @@ env:
  # We clear the cache to reduce memory pressure because of the numerous processes of cargo
  # nextest
  TFHE_RS_CLEAR_IN_MEMORY_KEY_CACHE: "1"
-  REF: ${{ github.event.pull_request.head.sha || github.sha }}
+  CHECKOUT_TOKEN: ${{ secrets.REPO_CHECKOUT_TOKEN || secrets.GITHUB_TOKEN }}

 concurrency:
-  group: ${{ github.workflow }}_${{ github.head_ref || github.ref }}
+  group: ${{ github.workflow_ref }}
  cancel-in-progress: true

 jobs:
-  check-ci-files:
-    uses: ./.github/workflows/check_ci_files_change.yml
-    with:
-      checkout_ref: ${{ github.event.pull_request.head.sha || github.sha }}
-    secrets:
-      REPO_CHECKOUT_TOKEN: ${{ secrets.REPO_CHECKOUT_TOKEN }}
-
-  # Fail if the triggering actor is not part of Zama organization.
-  # If pull_request_target is emitted and CI files have changed, skip this job. This would skip following jobs.
-  check-user-permission:
-    needs: check-ci-files
-    if: github.event_name != 'pull_request_target' ||
-      (github.event_name == 'pull_request_target' && needs.check-ci-files.outputs.ci_file_changed == 'false')
-    uses: ./.github/workflows/check_actor_permissions.yml
-    secrets:
-      TOKEN: ${{ secrets.GITHUB_TOKEN }}
-
  cargo-builds-m1:
-    needs: check-user-permission
    if: ${{ (github.event_name == 'schedule' &&  github.repository == 'zama-ai/tfhe-rs') ||
      github.event_name == 'workflow_dispatch' ||
      contains(github.event.label.name, 'm1_test') }}
@@ -70,8 +40,7 @@ jobs:
      - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683
        with:
          persist-credentials: "false"
-          token: ${{ secrets.REPO_CHECKOUT_TOKEN }}
-          ref: ${{ env.REF }}
+          token: ${{ env.CHECKOUT_TOKEN }}

      - name: Install latest stable
        uses: dtolnay/rust-toolchain@a54c7afa936fefeb4456b2dd8068152669aa8203
@@ -213,7 +182,7 @@ jobs:
    if: ${{ always() }}
    steps:
      - uses: actions-ecosystem/action-remove-labels@2ce5d41b4b6aa8503e285553f75ed56e0a40bae0
-        if: ${{ github.event_name == 'pull_request_target' }}
+        if: ${{ github.event_name == 'pull_request' }}
        with:
          labels: m1_test
          github_token: ${{ secrets.GITHUB_TOKEN }}
@@ -230,4 +199,4 @@ jobs:
          SLACK_USERNAME: ${{ secrets.BOT_USERNAME }}
          SLACK_WEBHOOK: ${{ secrets.SLACK_WEBHOOK }}
          MSG_MINIMAL: event,action url,commit
-          BRANCH: ${{ github.head_ref || github.ref }}
+          BRANCH: ${{ github.ref }}
--- a/.github/workflows/make_release.yml
+++ b/.github/workflows/make_release.yml
@@ -51,7 +51,7 @@ jobs:
      - name: Prepare package
        run: |
          cargo package -p tfhe
-      - uses: actions/upload-artifact@65c4c4a1ddee5b72f698fdd19549f0f0fb45cf08 # v4.6.0
+      - uses: actions/upload-artifact@4cec3d8aa04e39d1a68397de0c4cd6fb9dce8ec1 # v4.6.1
        with:
          name: crate
          path: target/package/*.crate
@@ -62,7 +62,7 @@ jobs:
  provenance:
    if: ${{ !inputs.dry_run  }}
    needs: [package]
-    uses: slsa-framework/slsa-github-generator/.github/workflows/generator_generic_slsa3.yml@v2.0.0
+    uses: slsa-framework/slsa-github-generator/.github/workflows/generator_generic_slsa3.yml@v2.1.0
    permissions:
      # Needed to detect the GitHub Actions environment
      actions: read
@@ -78,6 +78,10 @@ jobs:
    name: Publish Release
    needs: [package] # for comparing hashes
    runs-on: ubuntu-latest
+    # For provenance of npmjs publish
+    permissions:
+      contents: read
+      id-token: write
    steps:
      - name: Checkout
        uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
@@ -90,7 +94,7 @@ jobs:
        run: |
          echo "NPM_TAG=latest" >> "${GITHUB_ENV}"
      - name: Download artifact
-        uses: actions/download-artifact@fa0a91b85d4f404e444e00e005971372dc801d16 # v4.1.8
+        uses: actions/download-artifact@cc203385981b70ca67e1cc392babf9cc229d5806 # v4.1.9
        with:
          name: crate
          path: target/package
--- a/.github/workflows/make_release_cuda.yml
+++ b/.github/workflows/make_release_cuda.yml
@@ -61,13 +61,9 @@ jobs:
        uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
        with:
          fetch-depth: 0
-          persist-credentials: 'false'
+          persist-credentials: "false"
          token: ${{ secrets.REPO_CHECKOUT_TOKEN }}

-      - name: Set up home
-        run: |
-          echo "HOME=/home/ubuntu" >> "${GITHUB_ENV}"
-
      - name: Install latest stable
        uses: dtolnay/rust-toolchain@a54c7afa936fefeb4456b2dd8068152669aa8203
        with:
@@ -103,7 +99,7 @@ jobs:
  provenance:
    if: ${{ !inputs.dry_run  }}
    needs: [package]
-    uses: slsa-framework/slsa-github-generator/.github/workflows/generator_generic_slsa3.yml@v2.0.0
+    uses: slsa-framework/slsa-github-generator/.github/workflows/generator_generic_slsa3.yml@v2.1.0
    permissions:
      # Needed to detect the GitHub Actions environment
      actions: read
@@ -127,7 +123,35 @@ jobs:
          - os: ubuntu-22.04
            cuda: "12.2"
            gcc: 9
+    env:
+      CUDA_PATH: /usr/local/cuda-${{ matrix.cuda }}
    steps:
+      - name: Install latest stable
+        uses: dtolnay/rust-toolchain@a54c7afa936fefeb4456b2dd8068152669aa8203
+        with:
+          toolchain: stable
+
+      - name: Export CUDA variables
+        if: ${{ !cancelled() }}
+        run: |
+          echo "$CUDA_PATH/bin" >> "${GITHUB_PATH}"
+          {
+            echo "CUDA_PATH=$CUDA_PATH";
+            echo "LD_LIBRARY_PATH=$CUDA_PATH/lib:$LD_LIBRARY_PATH";
+            echo "CUDACXX=/usr/local/cuda-${{ matrix.cuda }}/bin/nvcc";
+          } >> "${GITHUB_ENV}"
+
+      # Specify the correct host compilers
+      - name: Export gcc and g++ variables
+        if: ${{ !cancelled() }}
+        run: |
+          {
+            echo "CC=/usr/bin/gcc-${{ matrix.gcc }}";
+            echo "CXX=/usr/bin/g++-${{ matrix.gcc }}";
+            echo "CUDAHOSTCXX=/usr/bin/g++-${{ matrix.gcc }}";
+            echo "HOME=/home/ubuntu";
+          } >> "${GITHUB_ENV}"
+
      - name: Publish crate.io package
        env:
          CRATES_TOKEN: ${{ secrets.CARGO_REGISTRY_TOKEN }}
@@ -162,7 +186,7 @@ jobs:
  teardown-instance:
    name: Teardown instance (publish-release)
    if: ${{ always() && needs.setup-instance.result == 'success' }}
-    needs: [ setup-instance, publish-cuda-release ]
+    needs: [setup-instance, publish-cuda-release]
    runs-on: ubuntu-latest
    steps:
      - name: Stop instance
--- a/.github/workflows/make_release_tfhe_csprng.yml
+++ b/.github/workflows/make_release_tfhe_csprng.yml
@@ -30,7 +30,7 @@ jobs:
      - name: Prepare package
        run: |
          cargo package -p tfhe-csprng
-      - uses: actions/upload-artifact@65c4c4a1ddee5b72f698fdd19549f0f0fb45cf08 # v4.6.0
+      - uses: actions/upload-artifact@4cec3d8aa04e39d1a68397de0c4cd6fb9dce8ec1 # v4.6.1
        with:
          name: crate-tfhe-csprng
          path: target/package/*.crate
@@ -42,7 +42,7 @@ jobs:
  provenance:
    if: ${{ !inputs.dry_run  }}
    needs: [package]
-    uses: slsa-framework/slsa-github-generator/.github/workflows/generator_generic_slsa3.yml@v2.0.0
+    uses: slsa-framework/slsa-github-generator/.github/workflows/generator_generic_slsa3.yml@v2.1.0
    permissions:
      # Needed to detect the GitHub Actions environment
      actions: read
@@ -66,7 +66,7 @@ jobs:
          fetch-depth: 0
          token: ${{ secrets.FHE_ACTIONS_TOKEN }}
      - name: Download artifact
-        uses: actions/download-artifact@fa0a91b85d4f404e444e00e005971372dc801d16 # v4.1.8
+        uses: actions/download-artifact@cc203385981b70ca67e1cc392babf9cc229d5806 # v4.1.9
        with:
          name: crate-tfhe-csprng
          path: target/package
--- a/.github/workflows/make_release_tfhe_fft.yml
+++ b/.github/workflows/make_release_tfhe_fft.yml
@@ -33,7 +33,7 @@ jobs:
      - name: Prepare package
        run: |
          cargo package -p tfhe-fft
-      - uses: actions/upload-artifact@65c4c4a1ddee5b72f698fdd19549f0f0fb45cf08 # v4.6.0
+      - uses: actions/upload-artifact@4cec3d8aa04e39d1a68397de0c4cd6fb9dce8ec1 # v4.6.1
        with:
          name: crate
          path: target/package/*.crate
@@ -44,7 +44,7 @@ jobs:
  provenance:
    if: ${{ !inputs.dry_run  }}
    needs: [package]
-    uses: slsa-framework/slsa-github-generator/.github/workflows/generator_generic_slsa3.yml@v2.0.0
+    uses: slsa-framework/slsa-github-generator/.github/workflows/generator_generic_slsa3.yml@v2.1.0
    permissions:
      # Needed to detect the GitHub Actions environment
      actions: read
--- a/.github/workflows/make_release_tfhe_ntt.yml
+++ b/.github/workflows/make_release_tfhe_ntt.yml
@@ -33,7 +33,7 @@ jobs:
      - name: Prepare package
        run: |
          cargo package -p tfhe-ntt
-      - uses: actions/upload-artifact@65c4c4a1ddee5b72f698fdd19549f0f0fb45cf08 # v4.6.0
+      - uses: actions/upload-artifact@4cec3d8aa04e39d1a68397de0c4cd6fb9dce8ec1 # v4.6.1
        with:
          name: crate
          path: target/package/*.crate
@@ -44,7 +44,7 @@ jobs:
  provenance:
    if: ${{ !inputs.dry_run  }}
    needs: [package]
-    uses: slsa-framework/slsa-github-generator/.github/workflows/generator_generic_slsa3.yml@v2.0.0
+    uses: slsa-framework/slsa-github-generator/.github/workflows/generator_generic_slsa3.yml@v2.1.0
    permissions:
      # Needed to detect the GitHub Actions environment
      actions: read
--- a/.github/workflows/make_release_tfhe_versionable.yml
+++ b/.github/workflows/make_release_tfhe_versionable.yml
@@ -2,14 +2,13 @@ name: Publish tfhe-versionable release

 on:
  workflow_dispatch:
-    inputs:
-      dry_run:
-        description: "Dry-run"
-        type: boolean
-        default: true

 env:
  ACTION_RUN_URL: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}
+  SLACK_CHANNEL: ${{ secrets.SLACK_CHANNEL }}
+  SLACK_ICON: https://pbs.twimg.com/profile_images/1274014582265298945/OjBKP9kn_400x400.png
+  SLACK_USERNAME: ${{ secrets.BOT_USERNAME }}
+  SLACK_WEBHOOK: ${{ secrets.SLACK_WEBHOOK }}

 jobs:
  verify_tag:
@@ -19,6 +18,7 @@ jobs:
      READ_ORG_TOKEN: ${{ secrets.READ_ORG_TOKEN }}

  package-derive:
+    name: Package tfhe-versionable-derive Release
    runs-on: ubuntu-latest
    outputs:
      hash: ${{ steps.hash.outputs.hash }}
@@ -30,7 +30,7 @@ jobs:
      - name: Prepare package
        run: |
          cargo package -p tfhe-versionable-derive
-      - uses: actions/upload-artifact@65c4c4a1ddee5b72f698fdd19549f0f0fb45cf08 # v4.6.0
+      - uses: actions/upload-artifact@4cec3d8aa04e39d1a68397de0c4cd6fb9dce8ec1 # v4.6.1
        with:
          name: crate-tfhe-versionable-derive
          path: target/package/*.crate
@@ -40,7 +40,7 @@ jobs:

  provenance-derive:
    needs: [package-derive]
-    uses: slsa-framework/slsa-github-generator/.github/workflows/generator_generic_slsa3.yml@v2.0.0
+    uses: slsa-framework/slsa-github-generator/.github/workflows/generator_generic_slsa3.yml@v2.1.0
    permissions:
      # Needed to detect the GitHub Actions environment
      actions: read
@@ -53,8 +53,8 @@ jobs:
      base64-subjects: ${{ needs.package-derive.outputs.hash }}

  publish_release-derive:
-    name: Publish tfhe-versionable Release
-    needs: [verify_tag, package-derive] # for comparing hashes
+    name: Publish tfhe-versionable-derive Release
+    needs: [ verify_tag, package-derive ] # for comparing hashes
    runs-on: ubuntu-latest
    steps:
      - name: Checkout
@@ -64,7 +64,7 @@ jobs:
          persist-credentials: 'false'
          token: ${{ secrets.REPO_CHECKOUT_TOKEN }}
      - name: Download artifact
-        uses: actions/download-artifact@fa0a91b85d4f404e444e00e005971372dc801d16 # v4.1.8
+        uses: actions/download-artifact@cc203385981b70ca67e1cc392babf9cc229d5806 # v4.1.9
        with:
          name: crate-tfhe-versionable-derive
          path: target/package
@@ -72,7 +72,7 @@ jobs:
        env:
          CRATES_TOKEN: ${{ secrets.CARGO_REGISTRY_TOKEN }}
        run: |
-          cargo publish -p tfhe-versionable-derive --token ${{ env.CRATES_TOKEN }} ${{ env.DRY_RUN }}
+          cargo publish -p tfhe-versionable-derive --token ${{ env.CRATES_TOKEN }}
      - name: Generate hash
        id: published_hash
        run: cd target/package && echo "pub_hash=$(sha256sum ./*.crate | base64 -w0)" >> "${GITHUB_OUTPUT}"
@@ -82,24 +82,18 @@ jobs:
        uses: rtCamp/action-slack-notify@c33737706dea87cd7784c687dadc9adf1be59990 # v2.3.2
        env:
          SLACK_COLOR: failure
-          SLACK_CHANNEL: ${{ secrets.SLACK_CHANNEL }}
-          SLACK_ICON: https://pbs.twimg.com/profile_images/1274014582265298945/OjBKP9kn_400x400.png
          SLACK_MESSAGE: "SLSA tfhe-versionable-derive - hash comparison failure: (${{ env.ACTION_RUN_URL }})"
-          SLACK_USERNAME: ${{ secrets.BOT_USERNAME }}
-          SLACK_WEBHOOK: ${{ secrets.SLACK_WEBHOOK }}
      - name: Slack Notification
        if: ${{ failure() }}
        continue-on-error: true
        uses: rtCamp/action-slack-notify@c33737706dea87cd7784c687dadc9adf1be59990 # v2.3.2
        env:
          SLACK_COLOR: ${{ job.status }}
-          SLACK_CHANNEL: ${{ secrets.SLACK_CHANNEL }}
-          SLACK_ICON: https://pbs.twimg.com/profile_images/1274014582265298945/OjBKP9kn_400x400.png
          SLACK_MESSAGE: "tfhe-versionable-derive release finished with status: ${{ job.status }}. (${{ env.ACTION_RUN_URL }})"
-          SLACK_USERNAME: ${{ secrets.BOT_USERNAME }}
-          SLACK_WEBHOOK: ${{ secrets.SLACK_WEBHOOK }}

  package:
+    name: Package tfhe-versionable Release
+    needs: publish_release-derive
    runs-on: ubuntu-latest
    outputs:
      hash: ${{ steps.hash.outputs.hash }}
@@ -111,7 +105,7 @@ jobs:
      - name: Prepare package
        run: |
          cargo package -p tfhe-versionable
-      - uses: actions/upload-artifact@65c4c4a1ddee5b72f698fdd19549f0f0fb45cf08 # v4.6.0
+      - uses: actions/upload-artifact@4cec3d8aa04e39d1a68397de0c4cd6fb9dce8ec1 # v4.6.1
        with:
          name: crate-tfhe-versionable
          path: target/package/*.crate
@@ -120,8 +114,8 @@ jobs:
        run: cd target/package && echo "hash=$(sha256sum ./*.crate | base64 -w0)" >> "${GITHUB_OUTPUT}"

  provenance:
-    needs: [package]
-    uses: slsa-framework/slsa-github-generator/.github/workflows/generator_generic_slsa3.yml@v2.0.0
+    needs: package
+    uses: slsa-framework/slsa-github-generator/.github/workflows/generator_generic_slsa3.yml@v2.1.0
    permissions:
      # Needed to detect the GitHub Actions environment
      actions: read
@@ -135,7 +129,7 @@ jobs:

  publish_release:
    name: Publish tfhe-versionable Release
-    needs: [package] # for comparing hashes
+    needs: package # for comparing hashes
    runs-on: ubuntu-latest
    steps:
      - name: Checkout
@@ -143,7 +137,7 @@ jobs:
        with:
          fetch-depth: 0
      - name: Download artifact
-        uses: actions/download-artifact@fa0a91b85d4f404e444e00e005971372dc801d16 # v4.1.8
+        uses: actions/download-artifact@cc203385981b70ca67e1cc392babf9cc229d5806 # v4.1.9
        with:
          name: crate-tfhe-versionable
          path: target/package
@@ -151,32 +145,21 @@ jobs:
        env:
          CRATES_TOKEN: ${{ secrets.CARGO_REGISTRY_TOKEN }}
        run: |
-          cargo publish -p tfhe-versionable --token ${{ env.CRATES_TOKEN }} ${{ env.DRY_RUN }}
-
+          cargo publish -p tfhe-versionable --token ${{ env.CRATES_TOKEN }}
      - name: Generate hash
        id: published_hash
        run: cd target/package && echo "pub_hash=$(sha256sum ./*.crate | base64 -w0)" >> "${GITHUB_OUTPUT}"
-
      - name: Slack notification (hashes comparison)
        if: ${{ needs.package.outputs.hash != steps.published_hash.outputs.pub_hash }}
        continue-on-error: true
        uses: rtCamp/action-slack-notify@c33737706dea87cd7784c687dadc9adf1be59990 # v2.3.2
        env:
          SLACK_COLOR: failure
-          SLACK_CHANNEL: ${{ secrets.SLACK_CHANNEL }}
-          SLACK_ICON: https://pbs.twimg.com/profile_images/1274014582265298945/OjBKP9kn_400x400.png
          SLACK_MESSAGE: "SLSA tfhe-versionable - hash comparison failure: (${{ env.ACTION_RUN_URL }})"
-          SLACK_USERNAME: ${{ secrets.BOT_USERNAME }}
-          SLACK_WEBHOOK: ${{ secrets.SLACK_WEBHOOK }}
-
      - name: Slack Notification
        if: ${{ failure() }}
        continue-on-error: true
        uses: rtCamp/action-slack-notify@c33737706dea87cd7784c687dadc9adf1be59990 # v2.3.2
        env:
          SLACK_COLOR: ${{ job.status }}
-          SLACK_CHANNEL: ${{ secrets.SLACK_CHANNEL }}
-          SLACK_ICON: https://pbs.twimg.com/profile_images/1274014582265298945/OjBKP9kn_400x400.png
          SLACK_MESSAGE: "tfhe-versionable release finished with status: ${{ job.status }}. (${{ env.ACTION_RUN_URL }})"
-          SLACK_USERNAME: ${{ secrets.BOT_USERNAME }}
-          SLACK_WEBHOOK: ${{ secrets.SLACK_WEBHOOK }}
--- a/.github/workflows/make_release_zk_pok.yml
+++ b/.github/workflows/make_release_zk_pok.yml
@@ -24,7 +24,7 @@ jobs:
        - name: Prepare package
          run: |
            cargo package -p tfhe-zk-pok
-        - uses: actions/upload-artifact@65c4c4a1ddee5b72f698fdd19549f0f0fb45cf08 # v4.6.0
+        - uses: actions/upload-artifact@4cec3d8aa04e39d1a68397de0c4cd6fb9dce8ec1 # v4.6.1
          with:
            name: crate-zk-pok
            path: target/package/*.crate
@@ -34,7 +34,7 @@ jobs:
  provenance:
    if: ${{ !inputs.dry_run  }}
    needs: [package]
-    uses: slsa-framework/slsa-github-generator/.github/workflows/generator_generic_slsa3.yml@v2.0.0
+    uses: slsa-framework/slsa-github-generator/.github/workflows/generator_generic_slsa3.yml@v2.1.0
    permissions:
      # Needed to detect the GitHub Actions environment
      actions: read
@@ -64,7 +64,7 @@ jobs:
          persist-credentials: 'false'
          token: ${{ secrets.REPO_CHECKOUT_TOKEN }}
      - name: Download artifact
-        uses: actions/download-artifact@fa0a91b85d4f404e444e00e005971372dc801d16 # v4.1.8
+        uses: actions/download-artifact@cc203385981b70ca67e1cc392babf9cc229d5806 # v4.1.9
        with:
          name: crate-zk-pok
          path: target/package
--- a/.gitignore
+++ b/.gitignore
@@ -32,5 +32,9 @@ web-test-runner/
 node_modules/
 package-lock.json

+# Python .env
+.env
+
 # Dir used for backward compatibility test data
 tests/tfhe-backward-compat-data/
+ci/
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -0,0 +1,233 @@
+# Contributing to TFHE-rs
+
+This document provides guidance on how to contribute to **TFHE-rs**.
+
+There are two ways to contribute:
+
+- **Report issues:** Open issues on GitHub to report bugs, suggest improvements, or note typos.
+- **Submit codes**: To become an official contributor, you must sign our Contributor License Agreement (CLA). Our CLA-bot will guide you through this process when you open your first pull request.
+
+## 1. Setting up the project
+
+Start by [forking](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/fork-a-repo) the **TFHE-rs** repository.
+
+{% hint style="info" %}
+- **Rust version**:  Ensure that you use a Rust version >= 1.81 to compile **TFHE-rs**.
+- **Incompatibility**: AArch64-based machines are not yet supported for Windows as it's currently missing an entropy source to be able to seed the [CSPRNGs](https://en.wikipedia.org/wiki/Cryptographically_secure_pseudorandom_number_generator) used in **TFHE-rs**.
+- **Performance**: For optimal performance, it is highly recommended to run **TFHE-rs** code in release mode with cargo's `--release` flag.
+{% endhint %}
+
+To get more details about the library, please refer to the [documentation](https://docs.zama.ai/tfhe-rs).
+
+## 2. Creating a new branch
+
+When creating your branch, make sure to use the following format :
+
+```
+git checkout -b {feat|fix|docs|chore…}/short_description
+```
+
+For example:
+
+```
+git checkout -b feat/new_feature_X
+```
+
+## 3. Before committing
+
+### 3.1 Linting
+
+Each commit to **TFHE-rs** should conform to the standards of the project. In particular, every source code, docker or workflows files should be linted to prevent programmatic and stylistic errors.
+
+- Rust source code linters: `clippy`
+- Typescript/Javascript source code linters: `eslint`, `prettier`
+
+To apply automatic code formatting, run:
+
+```
+make fmt
+```
+
+You can perform linting of all Cargo targets with:
+
+```
+make clippy_all_targets
+```
+
+### 3.2 Testing
+
+Your contributions must include comprehensive documentation and tests without breaking existing tests. To run pre-commit checks, execute:
+
+```
+make pcc
+```
+
+This command ensure that all the targets in the library are building correctly.
+For a faster check, use:
+
+```
+make fpcc
+```
+
+If you're contributing to GPU code, run also:
+
+```
+make pcc_gpu
+```
+
+Unit testing suites are heavy and can require a lot of computing power and RAM availability.
+Whilst tests are run automatically in continuous integration pipeline, you can run tests locally.
+
+All unit tests have a command formatted as:
+
+```
+make test_*
+```
+
+Run `make help` to display a list of all the commands available.
+
+To quickly test your changes locally, follow these steps:
+ 1. Locate where the code has changed.
+ 2. Add (or modify) a Cargo test filter to the corresponding `make` target in Makefile.
+ 3. Run the target.
+
+{% hint style="success" %}
+`make test_<something>` will print the underlying cargo command in STDOUT. You can quickly test your changes by copy/pasting the command and then modify it to suit your needs.
+{% endhint %}
+
+For example, if you made changes in `tfhe/src/integer/*`, you can test them with the following steps:
+ 1. In `test_integer` target, replace the filter `-- integer::` by `-- my_new_test`.
+ 2. Run `make test_integer`.
+
+## 4. Committing
+
+**TFHE-rs** follows the conventional commit specification to maintain a consistent commit history, essential for Semantic Versioning ([semver.org](https://semver.org/)).
+Commit messages are automatically checked in CI and will be rejected if they do not comply, so make sure that you follow the commit conventions detailed on [this page]
+(https://www.conventionalcommits.org/en/v1.0.0/).
+
+## 5. Rebasing
+
+Before creating a pull request, rebase your branch on the repository's `main` branch. Merge commits are not permitted, thus rebasing ensures fewer conflicts and a smoother PR review process.
+
+## 6. Opening a Pull Request
+
+Once your changes are ready, open a pull request.
+
+For instructions on creating a PR from a fork, refer to GitHub's [official documentation](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/creating-a-pull-request-from-a-fork).
+
+## 7. Continuous integration
+
+Before a pull request can be merged, several test suites run automatically. Below is an overview of the CI process:
+
+```mermaid
+---
+title: Continuous Integration Process
+---
+sequenceDiagram
+    autonumber
+
+    participant Contributor
+    participant GitHub
+    participant Reviewer
+    participant CI-pipeline
+
+    Contributor ->> GitHub: Open pull-request
+    GitHub -->> Contributor: Ask for CLA signing (once)
+    loop
+        Reviewer ->> GitHub: Review code
+        Reviewer ->> CI-pipeline: Approve workflows (short-run)
+        CI-pipeline -->> GitHub: Send checks results
+        Contributor ->> GitHub: Make changes
+    end
+    Reviewer ->> GitHub: Pull-request approval
+    Reviewer ->> CI-pipeline: Approve workflows (long-run)
+    CI-pipeline -->> GitHub: Send checks results
+    Reviewer -->> GitHub: Merge if pipeline green
+```
+
+> [!Note]
+>Useful details:
+>* pipeline is triggered by humans
+>* review team is located in Paris timezone, pipeline launch will most likely happen during office hours
+>* direct changes to CI related files are not allowed for external contributors
+>* run `make pcc` to fix any build errors before pushing commits
+
+## 8. Data versioning
+
+Data serialized with TFHE-rs must remain backward compatible. This is done using the [tfhe-versionable](https://crates.io/crates/tfhe-versionable) crate.
+
+If you modify a type that derives `Versionize` in a backward-incompatible way, an upgrade implementation must be provided.
+
+For example, these changes are data breaking:
+ * Adding a field to a struct.
+ * Changing the order of the fields within a struct or the variants within an enum.
+ * Renaming a field of a struct or a variant of an enum.
+ * Changing the type of field in a struct or a variant in an enum.
+
+On the contrary, these changes are *not* data breaking:
+ * Renaming a type (unless it implements the `Named` trait).
+ * Adding a variant to the end of an enum.
+
+## Example: adding a field
+
+Suppose you want to add an i32 field to a type named `MyType`. The original type is defined as:
+```rust
+#[derive(Serialize, Deserialize, Versionize)]
+#[versionize(MyTypeVersions)]
+struct MyType {
+  val: u64,
+}
+```
+And you want to change it to:
+```rust
+#[derive(Serialize, Deserialize, Versionize)]
+#[versionize(MyTypeVersions)]
+struct MyType {
+  val: u64,
+  other_val: i32
+}
+```
+
+Follow these steps:
+
+ 1. Navigate to the definition of the dispatch enum of this type. This is the type inside the `#[versionize(MyTypeVersions)]` macro attribute. In general, this type has the same name as the base type with a `Versions` suffix. You should find something like
+
+```rust
+#[derive(VersionsDispatch)]
+enum MyTypeVersions {
+  V0(MyTypeV0),
+  V1(MyType)
+}
+```
+
+ 2. Add a new variant to the enum to preserve the previous version of the type. You can simply copy and paste the previous definition of the type and add a version suffix:
+
+```rust
+#[derive(Version)]
+struct MyTypeV1 {
+  val: u64,
+}
+
+#[derive(VersionsDispatch)]
+enum MyTypeVersions {
+  V0(MyTypeV0),
+  V1(MyTypeV1),
+  V2(MyType) // Here this points to your modified type
+}
+```
+
+ 3. Implement the `Upgrade` trait to define how we should go from the previous version to the current version:
+```rust
+impl Upgrade<MyType> for MyTypeV1 {
+  type Error = Infallible;
+
+   fn upgrade(self) -> Result<MyType, Self::Error> {
+       Ok(MyType {
+           val: self.val,
+           other_val: 0
+        })
+   }
+}
+```
+
+ 4. Fix the upgrade target of the previous version. In this example, `impl Upgrade<MyType> for MyTypeV0 {` should simply be changed to `impl Upgrade<MyTypeV1> for MyTypeV0 {`
--- a/Cargo.toml
+++ b/Cargo.toml
@@ -21,7 +21,7 @@ bytemuck = "1.14.3"
 dyn-stack = { version = "0.11", default-features = false }
 itertools = "0.14"
 num-complex = "0.4"
-pulp = { version = "0.20", default-features = false }
+pulp = { version = "0.21", default-features = false }
 rand = "0.8"
 rayon = "1"
 serde = { version = "1.0", default-features = false }
--- a/2
+++ b/2
@@ -1,6 +1,6 @@
 BSD 3-Clause Clear License

-Copyright © 2024 ZAMA.
+Copyright © 2025 ZAMA.
 All rights reserved.

 Redistribution and use in source and binary forms, with or without modification,
--- a/56
+++ b/56
@@ -18,6 +18,8 @@ FAST_BENCH?=FALSE
 NIGHTLY_TESTS?=FALSE
 BENCH_OP_FLAVOR?=DEFAULT
 BENCH_TYPE?=latency
+BENCH_PARAM_TYPE?=classical
+BENCH_PARAMS_SET?=default
 NODE_VERSION=22.6
 BACKWARD_COMPAT_DATA_URL=https://github.com/zama-ai/tfhe-backward-compat-data.git
 BACKWARD_COMPAT_DATA_BRANCH?=$(shell ./scripts/backward_compat_data_version.py)
@@ -363,7 +365,18 @@ clippy_rustdoc: install_rs_check_toolchain
 	fi && \
 	CLIPPYFLAGS="-D warnings" RUSTDOCFLAGS="--no-run --nocapture --test-builder ./scripts/clippy_driver.sh -Z unstable-options" \
 		cargo "$(CARGO_RS_CHECK_TOOLCHAIN)" test --doc \
-		--features=boolean,shortint,integer,zk-pok,pbs-stats,strings \
+		--features=boolean,shortint,integer,zk-pok,pbs-stats,strings,experimental \
+		-p $(TFHE_SPEC)
+
+.PHONY: clippy_rustdoc_gpu # Run clippy lints on doctests enabling the boolean, shortint, integer and zk-pok
+clippy_rustdoc_gpu: install_rs_check_toolchain
+	if [[ "$(OS)" != "Linux" ]]; then \
+		echo "WARNING: skipped clippy_rustdoc_gpu, unsupported OS $(OS)"; \
+		exit 0; \
+	fi && \
+	CLIPPYFLAGS="-D warnings" RUSTDOCFLAGS="--no-run --nocapture --test-builder ./scripts/clippy_driver.sh -Z unstable-options" \
+		cargo "$(CARGO_RS_CHECK_TOOLCHAIN)" test --doc \
+		--features=boolean,shortint,integer,zk-pok,pbs-stats,strings,experimental,gpu \
 		-p $(TFHE_SPEC)

 .PHONY: clippy_c_api # Run clippy lints enabling the boolean, shortint and the C API
@@ -956,6 +969,10 @@ check_intra_md_links: install_mlc
 check_md_links: install_mlc
 	mlc --match-file-extension tfhe/docs

+.PHONY: check_parameter_export_ok # Checks exported "current" shortint parameter module is correct
+check_parameter_export_ok:
+	python3 ./scripts/check_current_param_export.py
+
 .PHONY: check_compile_tests # Build tests in debug without running them
 check_compile_tests: install_rs_build_toolchain
 	RUSTFLAGS="$(RUSTFLAGS)" cargo $(CARGO_RS_BUILD_TOOLCHAIN) test --no-run \
@@ -1156,10 +1173,24 @@ bench_boolean: install_rs_check_toolchain

 .PHONY: bench_pbs # Run benchmarks for PBS
 bench_pbs: install_rs_check_toolchain
-	RUSTFLAGS="$(RUSTFLAGS)" cargo $(CARGO_RS_CHECK_TOOLCHAIN) bench \
+	RUSTFLAGS="$(RUSTFLAGS)" __TFHE_RS_PARAMS_SET=$(BENCH_PARAMS_SET) cargo $(CARGO_RS_CHECK_TOOLCHAIN) bench \
 	--bench pbs-bench \
 	--features=boolean,shortint,internal-keycache,nightly-avx512 -p $(TFHE_SPEC)

+.PHONY: bench_ks_pbs # Run benchmarks for KS-PBS
+bench_ks_pbs: install_rs_check_toolchain
+	RUSTFLAGS="$(RUSTFLAGS)" __TFHE_RS_PARAM_TYPE=$(BENCH_PARAM_TYPE) __TFHE_RS_PARAMS_SET=$(BENCH_PARAMS_SET) \
+	cargo $(CARGO_RS_CHECK_TOOLCHAIN) bench \
+	--bench ks-pbs-bench \
+	--features=boolean,shortint,internal-keycache,nightly-avx512 -p $(TFHE_SPEC)
+
+.PHONY: bench_ks_pbs_gpu # Run benchmarks for KS-PBS on GPU backend
+bench_ks_pbs_gpu: install_rs_check_toolchain
+	RUSTFLAGS="$(RUSTFLAGS)" __TFHE_RS_PARAM_TYPE=$(BENCH_PARAM_TYPE) __TFHE_RS_PARAMS_SET=$(BENCH_PARAMS_SET) \
+	cargo $(CARGO_RS_CHECK_TOOLCHAIN) bench \
+	--bench ks-pbs-bench \
+	--features=boolean,shortint,gpu,internal-keycache,nightly-avx512 -p $(TFHE_SPEC)
+
 .PHONY: bench_pbs128 # Run benchmarks for PBS using FFT 128 bits
 bench_pbs128: install_rs_check_toolchain
 	RUSTFLAGS="$(RUSTFLAGS)" cargo $(CARGO_RS_CHECK_TOOLCHAIN) bench \
@@ -1168,19 +1199,20 @@ bench_pbs128: install_rs_check_toolchain

 .PHONY: bench_pbs_gpu # Run benchmarks for PBS on GPU backend
 bench_pbs_gpu: install_rs_check_toolchain
-	RUSTFLAGS="$(RUSTFLAGS)" __TFHE_RS_FAST_BENCH=$(FAST_BENCH) cargo $(CARGO_RS_CHECK_TOOLCHAIN) bench \
+	RUSTFLAGS="$(RUSTFLAGS)" __TFHE_RS_FAST_BENCH=$(FAST_BENCH) __TFHE_RS_PARAMS_SET=$(BENCH_PARAMS_SET) \
+	cargo $(CARGO_RS_CHECK_TOOLCHAIN) bench \
 	--bench pbs-bench \
 	--features=boolean,shortint,gpu,internal-keycache,nightly-avx512 -p $(TFHE_SPEC)

 .PHONY: bench_ks # Run benchmarks for keyswitch
 bench_ks: install_rs_check_toolchain
-	RUSTFLAGS="$(RUSTFLAGS)" cargo $(CARGO_RS_CHECK_TOOLCHAIN) bench \
+	RUSTFLAGS="$(RUSTFLAGS)" __TFHE_RS_PARAMS_SET=$(BENCH_PARAMS_SET) cargo $(CARGO_RS_CHECK_TOOLCHAIN) bench \
 	--bench ks-bench \
 	--features=boolean,shortint,internal-keycache,nightly-avx512 -p $(TFHE_SPEC)

 .PHONY: bench_ks_gpu # Run benchmarks for PBS on GPU backend
 bench_ks_gpu: install_rs_check_toolchain
-	RUSTFLAGS="$(RUSTFLAGS)" cargo $(CARGO_RS_CHECK_TOOLCHAIN) bench \
+	RUSTFLAGS="$(RUSTFLAGS)" __TFHE_RS_PARAMS_SET=$(BENCH_PARAMS_SET) cargo $(CARGO_RS_CHECK_TOOLCHAIN) bench \
 	--bench ks-bench \
 	--features=boolean,shortint,gpu,internal-keycache,nightly-avx512 -p $(TFHE_SPEC)

@@ -1281,7 +1313,7 @@ parse_wasm_benchmarks: install_rs_check_toolchain

 .PHONY: write_params_to_file # Gather all crypto parameters into a file with a Sage readable format.
 write_params_to_file: install_rs_check_toolchain
-	RUSTFLAGS="$(RUSTFLAGS)" cargo $(CARGO_RS_CHECK_TOOLCHAIN) run --profile $(CARGO_PROFILE) \
+	RUSTFLAGS="$(RUSTFLAGS)" cargo $(CARGO_RS_CHECK_TOOLCHAIN) run \
 	--example write_params_to_file --features=boolean,shortint,internal-keycache

 .PHONY: clone_backward_compat_data # Clone the data repo needed for backward compatibility tests
@@ -1313,15 +1345,17 @@ sha256_bool: install_rs_check_toolchain
 	--example sha256_bool --features=boolean

 .PHONY: pcc # pcc stands for pre commit checks (except GPU)
-pcc: no_tfhe_typo no_dbg_log check_fmt check_typos lint_doc check_md_docs_are_tested check_intra_md_links \
-clippy_all check_compile_tests test_tfhe_lints tfhe_lints
+pcc: no_tfhe_typo no_dbg_log check_parameter_export_ok check_fmt check_typos lint_doc \
+check_md_docs_are_tested check_intra_md_links clippy_all check_compile_tests test_tfhe_lints \
+tfhe_lints

 .PHONY: pcc_gpu # pcc stands for pre commit checks for GPU compilation
-pcc_gpu: clippy_gpu clippy_cuda_backend check_compile_tests_benches_gpu check_rust_bindings_did_not_change
+pcc_gpu: check_rust_bindings_did_not_change clippy_rustdoc_gpu \
+clippy_gpu clippy_cuda_backend check_compile_tests_benches_gpu

 .PHONY: fpcc # pcc stands for pre commit checks, the f stands for fast
-fpcc: no_tfhe_typo no_dbg_log check_fmt check_typos lint_doc check_md_docs_are_tested clippy_fast \
-check_compile_tests
+fpcc: no_tfhe_typo no_dbg_log check_parameter_export_ok check_fmt check_typos lint_doc \
+check_md_docs_are_tested clippy_fast check_compile_tests

 .PHONY: conformance # Automatically fix problems that can be fixed
 conformance: fix_newline fmt fmt_js
--- a/README.md
+++ b/README.md
@@ -10,7 +10,7 @@
 <hr/>

 <p align="center">
-  <a href="https://docs.zama.ai/tfhe-rs"> 📒 Documentation</a> | <a href="https://zama.ai/community"> 💛 Community support</a> | <a href="https://github.com/zama-ai/awesome-zama"> 📚 FHE resources by Zama</a>
+  <a href="https://github.com/zama-ai/tfhe-rs-handbook/blob/main/tfhe-rs-handbook.pdf"> 📃 Read Handbook</a> |<a href="https://docs.zama.ai/tfhe-rs"> 📒 Documentation</a> | <a href="https://zama.ai/community"> 💛 Community support</a> | <a href="https://github.com/zama-ai/awesome-zama"> 📚 FHE resources by Zama</a>
 </p>


@@ -67,6 +67,9 @@ production-ready library for all the advanced features of TFHE.

 ## Getting started

+> [!Important]
+> **TFHE-rs** released its first stable version v1.0.0 in February 2025, stabilizing the high-level API for the x86 CPU backend.
+
 ### Cargo.toml configuration
 To use the latest version of `TFHE-rs` in your project, you first need to add it as a dependency in your `Cargo.toml`:

@@ -75,13 +78,13 @@ tfhe = { version = "*", features = ["boolean", "shortint", "integer"] }
 ```

 > [!Note]
-> Note: You need to use a Rust version >= 1.81 to compile TFHE-rs.
+> Note: You need to use Rust version >= 1.84 to compile TFHE-rs.

 > [!Note]
-> Note: aarch64-based machines are not yet supported for Windows as it's currently missing an entropy source to be able to seed the [CSPRNGs](https://en.wikipedia.org/wiki/Cryptographically_secure_pseudorandom_number_generator) used in TFHE-rs.
+> Note: AArch64-based machines are not supported for Windows as it's currently missing an entropy source to be able to seed the [CSPRNGs](https://en.wikipedia.org/wiki/Cryptographically_secure_pseudorandom_number_generator) used in TFHE-rs.

 <p align="right">
-  <a href="#about" > ↑ Back to top </a> 
+  <a href="#about" > ↑ Back to top </a>
 </p>

 ### A simple example
@@ -138,7 +141,7 @@ fn main() -> Result<(), Box<dyn std::error::Error>> {
 }
 ```

-To run this code, use the following command: 
+To run this code, use the following command:
 <p align="center"> <code> cargo run --release </code> </p>

 > [!Note]
@@ -148,12 +151,15 @@ to run in release mode with cargo's `--release` flag to have the best performanc
 *Find an example with more explanations in [this part of the documentation](https://docs.zama.ai/tfhe-rs/get-started/quick_start)*

 <p align="right">
-  <a href="#about" > ↑ Back to top </a> 
+  <a href="#about" > ↑ Back to top </a>
 </p>



-## Resources 
+## Resources
+
+### TFHE-rs Handbook
+A document containing scientific and technical details about algorithms implemented into the library is available here: [TFHE-rs: A (Practical) Handbook](https://github.com/zama-ai/tfhe-rs-handbook/blob/main/tfhe-rs-handbook.pdf).

 ### TFHE deep dive
 - [TFHE Deep Dive - Part I - Ciphertext types](https://www.zama.ai/post/tfhe-deep-dive-part-1)
@@ -176,7 +182,7 @@ to run in release mode with cargo's `--release` flag to have the best performanc

 Full, comprehensive documentation is available here: [https://docs.zama.ai/tfhe-rs](https://docs.zama.ai/tfhe-rs).
 <p align="right">
-  <a href="#about" > ↑ Back to top </a> 
+  <a href="#about" > ↑ Back to top </a>
 </p>


@@ -194,9 +200,13 @@ When a new update is published in the Lattice Estimator, we update parameters ac

 ### Security model

-The default parameters for the TFHE-rs library are chosen considering the IND-CPA security model, and are selected with a bootstrapping failure probability fixed at p_error = $2^{-64}$. In particular, it is assumed that the results of decrypted computations are not shared by the secret key owner with any third parties, as such an action can lead to leakage of the secret encryption key. If you are designing an application where decryptions must be shared, you will need to craft custom encryption parameters which are chosen in consideration of the IND-CPA^D security model [1]. 
+By default, the parameter sets used in the High-Level API with the x86 CPU backend have a failure probability $\le 2^{128}$ to securely work in the IND-CPA^D model using the algorithmic techniques provided in our code base [1].
+If you want to work within the IND-CPA security model, which is less strict than the IND-CPA-D model, the parameter sets can easily be changed and would have slightly better performance. More details can be found in the [TFHE-rs documentation](https://docs.zama.ai/tfhe-rs).

-[1] Li, Baiyu, et al. "Securing approximate homomorphic encryption using differential privacy." Annual International Cryptology Conference. Cham: Springer Nature Switzerland, 2022. https://eprint.iacr.org/2022/816.pdf
+The default parameters used in the High-Level API with the GPU backend are chosen considering the IND-CPA security model, and are selected with a bootstrapping failure probability fixed at $p_{error} \le 2^{-64}$. In particular, it is assumed that the results of decrypted computations are not shared by the secret key owner with any third parties, as such an action can lead to leakage of the secret encryption key. If you are designing an application where decryptions must be shared, you will need to craft custom encryption parameters which are chosen in consideration of the IND-CPA^D security model [2].
+
+[1] Bernard, Olivier, et al. "Drifting Towards Better Error Probabilities in Fully Homomorphic Encryption Schemes". https://eprint.iacr.org/2024/1718.pdf
+[2] Li, Baiyu, et al. "Securing approximate homomorphic encryption using differential privacy." Annual International Cryptology Conference. Cham: Springer Nature Switzerland, 2022. https://eprint.iacr.org/2022/816.pdf

 #### Side-channel attacks

@@ -245,7 +255,7 @@ This software is distributed under the **BSD-3-Clause-Clear** license. Read [thi
 >We are open to collaborating and advancing the FHE space with our partners. If you have specific needs, please email us at hello@zama.ai.

 <p align="right">
-  <a href="#about" > ↑ Back to top </a> 
+  <a href="#about" > ↑ Back to top </a>
 </p>


@@ -259,8 +269,8 @@ This software is distributed under the **BSD-3-Clause-Clear** license. Read [thi
 </picture>
 </a>

-🌟 If you find this project helpful or interesting, please consider giving it a star on GitHub! Your support helps to grow the community and motivates further development. 
+🌟 If you find this project helpful or interesting, please consider giving it a star on GitHub! Your support helps to grow the community and motivates further development.

 <p align="right">
-  <a href="#about" > ↑ Back to top </a> 
+  <a href="#about" > ↑ Back to top </a>
 </p>
--- a/_typos.toml
+++ b/_typos.toml
@@ -13,3 +13,9 @@ extend-ignore-identifiers-re = [
    # Example in trivium
    "C9217BA0D762ACA1"
 ]
+
+[files]
+extend-exclude = [
+    "backends/tfhe-cuda-backend/cuda/src/fft128/twiddles.cu",
+    "backends/tfhe-cuda-backend/cuda/src/fft/twiddles.cu",
+]
--- a/apps/trivium/README.md
+++ b/apps/trivium/README.md
@@ -18,102 +18,102 @@ use tfhe::prelude::*;
 use tfhe_trivium::TriviumStream;

 fn get_hexadecimal_string_from_lsb_first_stream(a: Vec<bool>) -> String {
-	assert!(a.len() % 8 == 0);
-	let mut hexadecimal: String = "".to_string();
-	for test in a.chunks(8) {
-		// Encoding is bytes in LSB order
-		match test[4..8] {
-			[false, false, false, false] => hexadecimal.push('0'),
-			[true, false, false, false] => hexadecimal.push('1'),
-			[false, true, false, false] => hexadecimal.push('2'),
-			[true, true, false, false] => hexadecimal.push('3'),
+    assert!(a.len() % 8 == 0);
+    let mut hexadecimal: String = "".to_string();
+    for test in a.chunks(8) {
+        // Encoding is bytes in LSB order
+        match test[4..8] {
+            [false, false, false, false] => hexadecimal.push('0'),
+            [true, false, false, false] => hexadecimal.push('1'),
+            [false, true, false, false] => hexadecimal.push('2'),
+            [true, true, false, false] => hexadecimal.push('3'),

-			[false, false, true, false] => hexadecimal.push('4'),
-			[true, false, true, false] => hexadecimal.push('5'),
-			[false, true, true, false] => hexadecimal.push('6'),
-			[true, true, true, false] => hexadecimal.push('7'),
+            [false, false, true, false] => hexadecimal.push('4'),
+            [true, false, true, false] => hexadecimal.push('5'),
+            [false, true, true, false] => hexadecimal.push('6'),
+            [true, true, true, false] => hexadecimal.push('7'),

-			[false, false, false, true] => hexadecimal.push('8'),
-			[true, false, false, true] => hexadecimal.push('9'),
-			[false, true, false, true] => hexadecimal.push('A'),
-			[true, true, false, true] => hexadecimal.push('B'),
+            [false, false, false, true] => hexadecimal.push('8'),
+            [true, false, false, true] => hexadecimal.push('9'),
+            [false, true, false, true] => hexadecimal.push('A'),
+            [true, true, false, true] => hexadecimal.push('B'),

-			[false, false, true, true] => hexadecimal.push('C'),
-			[true, false, true, true] => hexadecimal.push('D'),
-			[false, true, true, true] => hexadecimal.push('E'),
-			[true, true, true, true] => hexadecimal.push('F'),
-			_ => ()
-		};
-		match test[0..4] {
-			[false, false, false, false] => hexadecimal.push('0'),
-			[true, false, false, false] => hexadecimal.push('1'),
-			[false, true, false, false] => hexadecimal.push('2'),
-			[true, true, false, false] => hexadecimal.push('3'),
+            [false, false, true, true] => hexadecimal.push('C'),
+            [true, false, true, true] => hexadecimal.push('D'),
+            [false, true, true, true] => hexadecimal.push('E'),
+            [true, true, true, true] => hexadecimal.push('F'),
+            _ => ()
+        };
+        match test[0..4] {
+            [false, false, false, false] => hexadecimal.push('0'),
+            [true, false, false, false] => hexadecimal.push('1'),
+            [false, true, false, false] => hexadecimal.push('2'),
+            [true, true, false, false] => hexadecimal.push('3'),

-			[false, false, true, false] => hexadecimal.push('4'),
-			[true, false, true, false] => hexadecimal.push('5'),
-			[false, true, true, false] => hexadecimal.push('6'),
-			[true, true, true, false] => hexadecimal.push('7'),
+            [false, false, true, false] => hexadecimal.push('4'),
+            [true, false, true, false] => hexadecimal.push('5'),
+            [false, true, true, false] => hexadecimal.push('6'),
+            [true, true, true, false] => hexadecimal.push('7'),

-			[false, false, false, true] => hexadecimal.push('8'),
-			[true, false, false, true] => hexadecimal.push('9'),
-			[false, true, false, true] => hexadecimal.push('A'),
-			[true, true, false, true] => hexadecimal.push('B'),
+            [false, false, false, true] => hexadecimal.push('8'),
+            [true, false, false, true] => hexadecimal.push('9'),
+            [false, true, false, true] => hexadecimal.push('A'),
+            [true, true, false, true] => hexadecimal.push('B'),

-			[false, false, true, true] => hexadecimal.push('C'),
-			[true, false, true, true] => hexadecimal.push('D'),
-			[false, true, true, true] => hexadecimal.push('E'),
-			[true, true, true, true] => hexadecimal.push('F'),
-			_ => ()
-		};
-	}
-	return hexadecimal;
+            [false, false, true, true] => hexadecimal.push('C'),
+            [true, false, true, true] => hexadecimal.push('D'),
+            [false, true, true, true] => hexadecimal.push('E'),
+            [true, true, true, true] => hexadecimal.push('F'),
+            _ => ()
+        };
+    }
+    return hexadecimal;
 }

 fn main() {
-	let config = ConfigBuilder::default().build();
-	let (client_key, server_key) = generate_keys(config);
+    let config = ConfigBuilder::default().build();
+    let (client_key, server_key) = generate_keys(config);

-	let key_string = "0053A6F94C9FF24598EB".to_string();
-	let mut key = [false; 80];
+    let key_string = "0053A6F94C9FF24598EB".to_string();
+    let mut key = [false; 80];

-	for i in (0..key_string.len()).step_by(2) {
-		let mut val: u8 = u8::from_str_radix(&key_string[i..i+2], 16).unwrap();
-		for j in 0..8 {
-			key[8*(i>>1) + j] = val % 2 == 1;
-			val >>= 1;
-		}
-	}
+    for i in (0..key_string.len()).step_by(2) {
+        let mut val: u8 = u8::from_str_radix(&key_string[i..i+2], 16).unwrap();
+        for j in 0..8 {
+            key[8*(i>>1) + j] = val % 2 == 1;
+            val >>= 1;
+        }
+    }

-	let iv_string = "0D74DB42A91077DE45AC".to_string();
-	let mut iv = [false; 80];
+    let iv_string = "0D74DB42A91077DE45AC".to_string();
+    let mut iv = [false; 80];

-	for i in (0..iv_string.len()).step_by(2) {
-		let mut val: u8 = u8::from_str_radix(&iv_string[i..i+2], 16).unwrap();
-		for j in 0..8 {
-			iv[8*(i>>1) + j] = val % 2 == 1;
-			val >>= 1;
-		}
-	}
+    for i in (0..iv_string.len()).step_by(2) {
+        let mut val: u8 = u8::from_str_radix(&iv_string[i..i+2], 16).unwrap();
+        for j in 0..8 {
+            iv[8*(i>>1) + j] = val % 2 == 1;
+            val >>= 1;
+        }
+    }

-	let output_0_63    = "F4CD954A717F26A7D6930830C4E7CF0819F80E03F25F342C64ADC66ABA7F8A8E6EAA49F23632AE3CD41A7BD290A0132F81C6D4043B6E397D7388F3A03B5FE358".to_string();
+    let output_0_63    = "F4CD954A717F26A7D6930830C4E7CF0819F80E03F25F342C64ADC66ABA7F8A8E6EAA49F23632AE3CD41A7BD290A0132F81C6D4043B6E397D7388F3A03B5FE358".to_string();

-	let cipher_key = key.map(|x| FheBool::encrypt(x, &client_key));
-	let cipher_iv = iv.map(|x| FheBool::encrypt(x, &client_key));
+    let cipher_key = key.map(|x| FheBool::encrypt(x, &client_key));
+    let cipher_iv = iv.map(|x| FheBool::encrypt(x, &client_key));


-	let mut trivium = TriviumStream::<FheBool>::new(cipher_key, cipher_iv, &server_key);
+    let mut trivium = TriviumStream::<FheBool>::new(cipher_key, cipher_iv, &server_key);

-	let mut vec = Vec::<bool>::with_capacity(64*8);
-	while vec.len() < 64*8 {
-		let cipher_outputs = trivium.next_64();
-		for c in cipher_outputs {
-			vec.push(c.decrypt(&client_key))
-		}
-	}
+    let mut vec = Vec::<bool>::with_capacity(64*8);
+    while vec.len() < 64*8 {
+        let cipher_outputs = trivium.next_64();
+        for c in cipher_outputs {
+            vec.push(c.decrypt(&client_key))
+        }
+    }

-	let hexadecimal = get_hexadecimal_string_from_lsb_first_stream(vec);
-	assert_eq!(output_0_63, hexadecimal[0..64*2]);
+    let hexadecimal = get_hexadecimal_string_from_lsb_first_stream(vec);
+    assert_eq!(output_0_63, hexadecimal[0..64*2]);
 }
 ```

@@ -129,7 +129,7 @@ Other sizes than 64 bit are expected to be available in the future.

 # FHE shortint Trivium implementation

-The same implementation is also available for generic Ciphertexts representing bits (meant to be used with parameters `V0_11_PARAM_MESSAGE_1_CARRY_1_KS_PBS_GAUSSIAN_2M64`).
+The same implementation is also available for generic Ciphertexts representing bits (meant to be used with parameters `V1_0_PARAM_MESSAGE_1_CARRY_1_KS_PBS_GAUSSIAN_2M128`).
 It uses a lower level API of tfhe-rs, so the syntax is a little bit different. It also implements the `TransCiphering` trait. For optimization purposes, it does not internally run
 on the same cryptographic parameters as the high level API of tfhe-rs. As such, it requires the usage of a casting key, to switch from one parameter space to another, which makes
 its setup a little more intricate.
@@ -137,67 +137,68 @@ its setup a little more intricate.
 Example code:
 ```rust
 use tfhe::shortint::prelude::*;
-use tfhe::shortint::parameters::{
-    V0_11_PARAM_MESSAGE_1_CARRY_1_KS_PBS_GAUSSIAN_2M64,
-    V0_11_PARAM_MESSAGE_2_CARRY_2_PBS_KS_GAUSSIAN_2M64,
+use tfhe::shortint::parameters::v1_0::{
+    V1_0_PARAM_MESSAGE_1_CARRY_1_KS_PBS_GAUSSIAN_2M128,
+    V1_0_PARAM_MESSAGE_2_CARRY_2_KS_PBS_GAUSSIAN_2M128,
+    V1_0_PARAM_KEYSWITCH_1_1_KS_PBS_TO_2_2_KS_PBS_GAUSSIAN_2M128,
 };
 use tfhe::{ConfigBuilder, generate_keys, FheUint64};
 use tfhe::prelude::*;
 use tfhe_trivium::TriviumStreamShortint;

 fn test_shortint() {
-	let config = ConfigBuilder::default()
-        .use_custom_parameters(V0_11_PARAM_MESSAGE_2_CARRY_2_PBS_KS_GAUSSIAN_2M64)
+    let config = ConfigBuilder::default()
+        .use_custom_parameters(V1_0_PARAM_MESSAGE_2_CARRY_2_KS_PBS_GAUSSIAN_2M128)
        .build();
-	let (hl_client_key, hl_server_key) = generate_keys(config);
+    let (hl_client_key, hl_server_key) = generate_keys(config);
    let underlying_ck: tfhe::shortint::ClientKey = (*hl_client_key.as_ref()).clone().into();
    let underlying_sk: tfhe::shortint::ServerKey = (*hl_server_key.as_ref()).clone().into();

-	let (client_key, server_key): (ClientKey, ServerKey) = gen_keys(V0_11_PARAM_MESSAGE_1_CARRY_1_KS_PBS_GAUSSIAN_2M64);
+    let (client_key, server_key): (ClientKey, ServerKey) = gen_keys(V1_0_PARAM_MESSAGE_1_CARRY_1_KS_PBS_GAUSSIAN_2M128);
    let ksk = KeySwitchingKey::new(
        (&client_key, Some(&server_key)),
        (&underlying_ck, &underlying_sk),
-        V0_11_PARAM_KEYSWITCH_1_1_KS_PBS_TO_2_2_KS_PBS,
+        V1_0_PARAM_KEYSWITCH_1_1_KS_PBS_TO_2_2_KS_PBS_GAUSSIAN_2M128_2M128,
    );

-	let key_string = "0053A6F94C9FF24598EB".to_string();
-	let mut key = [0; 80];
+    let key_string = "0053A6F94C9FF24598EB".to_string();
+    let mut key = [0; 80];

-	for i in (0..key_string.len()).step_by(2) {
-		let mut val = u64::from_str_radix(&key_string[i..i+2], 16).unwrap();
-		for j in 0..8 {
-			key[8*(i>>1) + j] = val % 2;
-			val >>= 1;
-		}
-	}
+    for i in (0..key_string.len()).step_by(2) {
+        let mut val = u64::from_str_radix(&key_string[i..i+2], 16).unwrap();
+        for j in 0..8 {
+            key[8*(i>>1) + j] = val % 2;
+            val >>= 1;
+        }
+    }

-	let iv_string = "0D74DB42A91077DE45AC".to_string();
-	let mut iv = [0; 80];
+    let iv_string = "0D74DB42A91077DE45AC".to_string();
+    let mut iv = [0; 80];

-	for i in (0..iv_string.len()).step_by(2) {
-		let mut val = u64::from_str_radix(&iv_string[i..i+2], 16).unwrap();
-		for j in 0..8 {
-			iv[8*(i>>1) + j] = val % 2;
-			val >>= 1;
-		}
-	}
-	let output_0_63    = "F4CD954A717F26A7D6930830C4E7CF0819F80E03F25F342C64ADC66ABA7F8A8E6EAA49F23632AE3CD41A7BD290A0132F81C6D4043B6E397D7388F3A03B5FE358".to_string();
+    for i in (0..iv_string.len()).step_by(2) {
+        let mut val = u64::from_str_radix(&iv_string[i..i+2], 16).unwrap();
+        for j in 0..8 {
+            iv[8*(i>>1) + j] = val % 2;
+            val >>= 1;
+        }
+    }
+    let output_0_63    = "F4CD954A717F26A7D6930830C4E7CF0819F80E03F25F342C64ADC66ABA7F8A8E6EAA49F23632AE3CD41A7BD290A0132F81C6D4043B6E397D7388F3A03B5FE358".to_string();

-	let cipher_key = key.map(|x| client_key.encrypt(x));
-	let cipher_iv = iv.map(|x| client_key.encrypt(x));
+    let cipher_key = key.map(|x| client_key.encrypt(x));
+    let cipher_iv = iv.map(|x| client_key.encrypt(x));

-	let mut ciphered_message = vec![FheUint64::try_encrypt(0u64, &hl_client_key).unwrap(); 9];
+    let mut ciphered_message = vec![FheUint64::try_encrypt(0u64, &hl_client_key).unwrap(); 9];

-	let mut trivium = TriviumStreamShortint::new(cipher_key, cipher_iv, &server_key, &ksk);
+    let mut trivium = TriviumStreamShortint::new(cipher_key, cipher_iv, &server_key, &ksk);

-	let mut vec = Vec::<u64>::with_capacity(8);
-	while vec.len() < 8 {
-		let trans_ciphered_message = trivium.trans_encrypt_64(ciphered_message.pop().unwrap(), &hl_server_key);
-		vec.push(trans_ciphered_message.decrypt(&hl_client_key));
-	}
+    let mut vec = Vec::<u64>::with_capacity(8);
+    while vec.len() < 8 {
+        let trans_ciphered_message = trivium.trans_encrypt_64(ciphered_message.pop().unwrap(), &hl_server_key);
+        vec.push(trans_ciphered_message.decrypt(&hl_client_key));
+    }

-	let hexadecimal = get_hexagonal_string_from_u64(vec);
-	assert_eq!(output_0_63, hexadecimal[0..64*2]);
+    let hexadecimal = get_hexagonal_string_from_u64(vec);
+    assert_eq!(output_0_63, hexadecimal[0..64*2]);
 }
 ```

--- a/apps/trivium/benches/kreyvium_shortint.rs
+++ b/apps/trivium/benches/kreyvium_shortint.rs
@@ -1,8 +1,9 @@
 use criterion::Criterion;
 use tfhe::prelude::*;
-use tfhe::shortint::parameters::{
-    V0_11_PARAM_MESSAGE_1_CARRY_1_KS_PBS_GAUSSIAN_2M64,
-    V0_11_PARAM_MESSAGE_2_CARRY_2_PBS_KS_GAUSSIAN_2M64,
+use tfhe::shortint::parameters::v1_0::{
+    V1_0_PARAM_KEYSWITCH_1_1_KS_PBS_TO_2_2_KS_PBS_GAUSSIAN_2M128,
+    V1_0_PARAM_MESSAGE_1_CARRY_1_KS_PBS_GAUSSIAN_2M128,
+    V1_0_PARAM_MESSAGE_2_CARRY_2_KS_PBS_GAUSSIAN_2M128,
 };
 use tfhe::shortint::prelude::*;
 use tfhe::{generate_keys, ConfigBuilder, FheUint64};
@@ -10,19 +11,19 @@ use tfhe_trivium::{KreyviumStreamShortint, TransCiphering};

 pub fn kreyvium_shortint_warmup(c: &mut Criterion) {
    let config = ConfigBuilder::default()
-        .use_custom_parameters(V0_11_PARAM_MESSAGE_2_CARRY_2_PBS_KS_GAUSSIAN_2M64)
+        .use_custom_parameters(V1_0_PARAM_MESSAGE_2_CARRY_2_KS_PBS_GAUSSIAN_2M128)
        .build();
    let (hl_client_key, hl_server_key) = generate_keys(config);
    let underlying_ck: tfhe::shortint::ClientKey = (*hl_client_key.as_ref()).clone().into();
    let underlying_sk: tfhe::shortint::ServerKey = (*hl_server_key.as_ref()).clone().into();

    let (client_key, server_key): (ClientKey, ServerKey) =
-        gen_keys(V0_11_PARAM_MESSAGE_1_CARRY_1_KS_PBS_GAUSSIAN_2M64);
+        gen_keys(V1_0_PARAM_MESSAGE_1_CARRY_1_KS_PBS_GAUSSIAN_2M128);

    let ksk = KeySwitchingKey::new(
        (&client_key, Some(&server_key)),
        (&underlying_ck, &underlying_sk),
-        V0_11_PARAM_KEYSWITCH_1_1_KS_PBS_TO_2_2_KS_PBS,
+        V1_0_PARAM_KEYSWITCH_1_1_KS_PBS_TO_2_2_KS_PBS_GAUSSIAN_2M128,
    );

    let key_string = "0053A6F94C9FF24598EB000000000000".to_string();
@@ -63,19 +64,19 @@ pub fn kreyvium_shortint_warmup(c: &mut Criterion) {

 pub fn kreyvium_shortint_gen(c: &mut Criterion) {
    let config = ConfigBuilder::default()
-        .use_custom_parameters(V0_11_PARAM_MESSAGE_2_CARRY_2_PBS_KS_GAUSSIAN_2M64)
+        .use_custom_parameters(V1_0_PARAM_MESSAGE_2_CARRY_2_KS_PBS_GAUSSIAN_2M128)
        .build();
    let (hl_client_key, hl_server_key) = generate_keys(config);
    let underlying_ck: tfhe::shortint::ClientKey = (*hl_client_key.as_ref()).clone().into();
    let underlying_sk: tfhe::shortint::ServerKey = (*hl_server_key.as_ref()).clone().into();

    let (client_key, server_key): (ClientKey, ServerKey) =
-        gen_keys(V0_11_PARAM_MESSAGE_1_CARRY_1_KS_PBS_GAUSSIAN_2M64);
+        gen_keys(V1_0_PARAM_MESSAGE_1_CARRY_1_KS_PBS_GAUSSIAN_2M128);

    let ksk = KeySwitchingKey::new(
        (&client_key, Some(&server_key)),
        (&underlying_ck, &underlying_sk),
-        V0_11_PARAM_KEYSWITCH_1_1_KS_PBS_TO_2_2_KS_PBS,
+        V1_0_PARAM_KEYSWITCH_1_1_KS_PBS_TO_2_2_KS_PBS_GAUSSIAN_2M128,
    );

    let key_string = "0053A6F94C9FF24598EB000000000000".to_string();
@@ -111,19 +112,19 @@ pub fn kreyvium_shortint_gen(c: &mut Criterion) {

 pub fn kreyvium_shortint_trans(c: &mut Criterion) {
    let config = ConfigBuilder::default()
-        .use_custom_parameters(V0_11_PARAM_MESSAGE_2_CARRY_2_PBS_KS_GAUSSIAN_2M64)
+        .use_custom_parameters(V1_0_PARAM_MESSAGE_2_CARRY_2_KS_PBS_GAUSSIAN_2M128)
        .build();
    let (hl_client_key, hl_server_key) = generate_keys(config);
    let underlying_ck: tfhe::shortint::ClientKey = (*hl_client_key.as_ref()).clone().into();
    let underlying_sk: tfhe::shortint::ServerKey = (*hl_server_key.as_ref()).clone().into();

    let (client_key, server_key): (ClientKey, ServerKey) =
-        gen_keys(V0_11_PARAM_MESSAGE_1_CARRY_1_KS_PBS_GAUSSIAN_2M64);
+        gen_keys(V1_0_PARAM_MESSAGE_1_CARRY_1_KS_PBS_GAUSSIAN_2M128);

    let ksk = KeySwitchingKey::new(
        (&client_key, Some(&server_key)),
        (&underlying_ck, &underlying_sk),
-        V0_11_PARAM_KEYSWITCH_1_1_KS_PBS_TO_2_2_KS_PBS,
+        V1_0_PARAM_KEYSWITCH_1_1_KS_PBS_TO_2_2_KS_PBS_GAUSSIAN_2M128,
    );

    let key_string = "0053A6F94C9FF24598EB000000000000".to_string();
--- a/apps/trivium/benches/trivium_shortint.rs
+++ b/apps/trivium/benches/trivium_shortint.rs
@@ -1,8 +1,9 @@
 use criterion::Criterion;
 use tfhe::prelude::*;
-use tfhe::shortint::parameters::{
-    V0_11_PARAM_MESSAGE_1_CARRY_1_KS_PBS_GAUSSIAN_2M64,
-    V0_11_PARAM_MESSAGE_2_CARRY_2_PBS_KS_GAUSSIAN_2M64,
+use tfhe::shortint::parameters::v1_0::{
+    V1_0_PARAM_KEYSWITCH_1_1_KS_PBS_TO_2_2_KS_PBS_GAUSSIAN_2M128,
+    V1_0_PARAM_MESSAGE_1_CARRY_1_KS_PBS_GAUSSIAN_2M128,
+    V1_0_PARAM_MESSAGE_2_CARRY_2_KS_PBS_GAUSSIAN_2M128,
 };
 use tfhe::shortint::prelude::*;
 use tfhe::{generate_keys, ConfigBuilder, FheUint64};
@@ -10,19 +11,19 @@ use tfhe_trivium::{TransCiphering, TriviumStreamShortint};

 pub fn trivium_shortint_warmup(c: &mut Criterion) {
    let config = ConfigBuilder::default()
-        .use_custom_parameters(V0_11_PARAM_MESSAGE_2_CARRY_2_PBS_KS_GAUSSIAN_2M64)
+        .use_custom_parameters(V1_0_PARAM_MESSAGE_2_CARRY_2_KS_PBS_GAUSSIAN_2M128)
        .build();
    let (hl_client_key, hl_server_key) = generate_keys(config);
    let underlying_ck: tfhe::shortint::ClientKey = (*hl_client_key.as_ref()).clone().into();
    let underlying_sk: tfhe::shortint::ServerKey = (*hl_server_key.as_ref()).clone().into();

    let (client_key, server_key): (ClientKey, ServerKey) =
-        gen_keys(V0_11_PARAM_MESSAGE_1_CARRY_1_KS_PBS_GAUSSIAN_2M64);
+        gen_keys(V1_0_PARAM_MESSAGE_1_CARRY_1_KS_PBS_GAUSSIAN_2M128);

    let ksk = KeySwitchingKey::new(
        (&client_key, Some(&server_key)),
        (&underlying_ck, &underlying_sk),
-        V0_11_PARAM_KEYSWITCH_1_1_KS_PBS_TO_2_2_KS_PBS,
+        V1_0_PARAM_KEYSWITCH_1_1_KS_PBS_TO_2_2_KS_PBS_GAUSSIAN_2M128,
    );

    let key_string = "0053A6F94C9FF24598EB".to_string();
@@ -63,19 +64,19 @@ pub fn trivium_shortint_warmup(c: &mut Criterion) {

 pub fn trivium_shortint_gen(c: &mut Criterion) {
    let config = ConfigBuilder::default()
-        .use_custom_parameters(V0_11_PARAM_MESSAGE_2_CARRY_2_PBS_KS_GAUSSIAN_2M64)
+        .use_custom_parameters(V1_0_PARAM_MESSAGE_2_CARRY_2_KS_PBS_GAUSSIAN_2M128)
        .build();
    let (hl_client_key, hl_server_key) = generate_keys(config);
    let underlying_ck: tfhe::shortint::ClientKey = (*hl_client_key.as_ref()).clone().into();
    let underlying_sk: tfhe::shortint::ServerKey = (*hl_server_key.as_ref()).clone().into();

    let (client_key, server_key): (ClientKey, ServerKey) =
-        gen_keys(V0_11_PARAM_MESSAGE_1_CARRY_1_KS_PBS_GAUSSIAN_2M64);
+        gen_keys(V1_0_PARAM_MESSAGE_1_CARRY_1_KS_PBS_GAUSSIAN_2M128);

    let ksk = KeySwitchingKey::new(
        (&client_key, Some(&server_key)),
        (&underlying_ck, &underlying_sk),
-        V0_11_PARAM_KEYSWITCH_1_1_KS_PBS_TO_2_2_KS_PBS,
+        V1_0_PARAM_KEYSWITCH_1_1_KS_PBS_TO_2_2_KS_PBS_GAUSSIAN_2M128,
    );

    let key_string = "0053A6F94C9FF24598EB".to_string();
@@ -111,19 +112,19 @@ pub fn trivium_shortint_gen(c: &mut Criterion) {

 pub fn trivium_shortint_trans(c: &mut Criterion) {
    let config = ConfigBuilder::default()
-        .use_custom_parameters(V0_11_PARAM_MESSAGE_2_CARRY_2_PBS_KS_GAUSSIAN_2M64)
+        .use_custom_parameters(V1_0_PARAM_MESSAGE_2_CARRY_2_KS_PBS_GAUSSIAN_2M128)
        .build();
    let (hl_client_key, hl_server_key) = generate_keys(config);
    let underlying_ck: tfhe::shortint::ClientKey = (*hl_client_key.as_ref()).clone().into();
    let underlying_sk: tfhe::shortint::ServerKey = (*hl_server_key.as_ref()).clone().into();

    let (client_key, server_key): (ClientKey, ServerKey) =
-        gen_keys(V0_11_PARAM_MESSAGE_1_CARRY_1_KS_PBS_GAUSSIAN_2M64);
+        gen_keys(V1_0_PARAM_MESSAGE_1_CARRY_1_KS_PBS_GAUSSIAN_2M128);

    let ksk = KeySwitchingKey::new(
        (&client_key, Some(&server_key)),
        (&underlying_ck, &underlying_sk),
-        V0_11_PARAM_KEYSWITCH_1_1_KS_PBS_TO_2_2_KS_PBS,
+        V1_0_PARAM_KEYSWITCH_1_1_KS_PBS_TO_2_2_KS_PBS_GAUSSIAN_2M128,
    );

    let key_string = "0053A6F94C9FF24598EB".to_string();
--- a/apps/trivium/src/kreyvium/test.rs
+++ b/apps/trivium/src/kreyvium/test.rs
@@ -1,8 +1,9 @@
 use crate::{KreyviumStream, KreyviumStreamByte, KreyviumStreamShortint, TransCiphering};
 use tfhe::prelude::*;
-use tfhe::shortint::parameters::{
-    V0_11_PARAM_MESSAGE_1_CARRY_1_KS_PBS_GAUSSIAN_2M64,
-    V0_11_PARAM_MESSAGE_2_CARRY_2_PBS_KS_GAUSSIAN_2M64,
+use tfhe::shortint::parameters::v1_0::{
+    V1_0_PARAM_KEYSWITCH_1_1_KS_PBS_TO_2_2_KS_PBS_GAUSSIAN_2M128,
+    V1_0_PARAM_MESSAGE_1_CARRY_1_KS_PBS_GAUSSIAN_2M128,
+    V1_0_PARAM_MESSAGE_2_CARRY_2_KS_PBS_GAUSSIAN_2M128,
 };
 use tfhe::{generate_keys, ConfigBuilder, FheBool, FheUint64, FheUint8};
 // Values for these tests come from the github repo renaud1239/Kreyvium,
@@ -220,19 +221,19 @@ use tfhe::shortint::prelude::*;
 #[test]
 fn kreyvium_test_shortint_long() {
    let config = ConfigBuilder::default()
-        .use_custom_parameters(V0_11_PARAM_MESSAGE_2_CARRY_2_PBS_KS_GAUSSIAN_2M64)
+        .use_custom_parameters(V1_0_PARAM_MESSAGE_2_CARRY_2_KS_PBS_GAUSSIAN_2M128)
        .build();
    let (hl_client_key, hl_server_key) = generate_keys(config);
    let underlying_ck: tfhe::shortint::ClientKey = (*hl_client_key.as_ref()).clone().into();
    let underlying_sk: tfhe::shortint::ServerKey = (*hl_server_key.as_ref()).clone().into();

    let (client_key, server_key): (ClientKey, ServerKey) =
-        gen_keys(V0_11_PARAM_MESSAGE_1_CARRY_1_KS_PBS_GAUSSIAN_2M64);
+        gen_keys(V1_0_PARAM_MESSAGE_1_CARRY_1_KS_PBS_GAUSSIAN_2M128);

    let ksk = KeySwitchingKey::new(
        (&client_key, Some(&server_key)),
        (&underlying_ck, &underlying_sk),
-        V0_11_PARAM_KEYSWITCH_1_1_KS_PBS_TO_2_2_KS_PBS,
+        V1_0_PARAM_KEYSWITCH_1_1_KS_PBS_TO_2_2_KS_PBS_GAUSSIAN_2M128,
    );

    let key_string = "0053A6F94C9FF24598EB000000000000".to_string();
--- a/apps/trivium/src/trans_ciphering/mod.rs
+++ b/apps/trivium/src/trans_ciphering/mod.rs
@@ -48,6 +48,8 @@ fn transcipher_from_1_1_stream(
 ) -> FheUint64 {
    assert_eq!(stream.len(), 64);

+    let id_lut = internal_server_key.generate_lookup_table(|x| x);
+
    let pairs = (0..32)
        .into_par_iter()
        .map(|i| {
@@ -57,10 +59,11 @@ fn transcipher_from_1_1_stream(
            let b0 = &stream[8 * byte_idx + 2 * pair_idx];
            let b1 = &stream[8 * byte_idx + 2 * pair_idx + 1];

-            casting_key.cast(
-                &internal_server_key
-                    .unchecked_add(b0, &internal_server_key.unchecked_scalar_mul(b1, 2)),
-            )
+            let mut combined = internal_server_key
+                .unchecked_add(b0, &internal_server_key.unchecked_scalar_mul(b1, 2));
+            internal_server_key.apply_lookup_table_assign(&mut combined, &id_lut);
+
+            casting_key.cast(&combined)
        })
        .collect::<Vec<_>>();

--- a/apps/trivium/src/trivium/test.rs
+++ b/apps/trivium/src/trivium/test.rs
@@ -1,8 +1,9 @@
 use crate::{TransCiphering, TriviumStream, TriviumStreamByte, TriviumStreamShortint};
 use tfhe::prelude::*;
-use tfhe::shortint::parameters::{
-    V0_11_PARAM_MESSAGE_1_CARRY_1_KS_PBS_GAUSSIAN_2M64,
-    V0_11_PARAM_MESSAGE_2_CARRY_2_PBS_KS_GAUSSIAN_2M64,
+use tfhe::shortint::parameters::v1_0::{
+    V1_0_PARAM_KEYSWITCH_1_1_KS_PBS_TO_2_2_KS_PBS_GAUSSIAN_2M128,
+    V1_0_PARAM_MESSAGE_1_CARRY_1_KS_PBS_GAUSSIAN_2M128,
+    V1_0_PARAM_MESSAGE_2_CARRY_2_KS_PBS_GAUSSIAN_2M128,
 };
 use tfhe::{generate_keys, ConfigBuilder, FheBool, FheUint64, FheUint8};
 // Values for these tests come from the github repo cantora/avr-crypto-lib, commit 2a5b018,
@@ -356,19 +357,19 @@ use tfhe::shortint::prelude::*;
 #[test]
 fn trivium_test_shortint_long() {
    let config = ConfigBuilder::default()
-        .use_custom_parameters(V0_11_PARAM_MESSAGE_2_CARRY_2_PBS_KS_GAUSSIAN_2M64)
+        .use_custom_parameters(V1_0_PARAM_MESSAGE_2_CARRY_2_KS_PBS_GAUSSIAN_2M128)
        .build();
    let (hl_client_key, hl_server_key) = generate_keys(config);
    let underlying_ck: tfhe::shortint::ClientKey = (*hl_client_key.as_ref()).clone().into();
    let underlying_sk: tfhe::shortint::ServerKey = (*hl_server_key.as_ref()).clone().into();

    let (client_key, server_key): (ClientKey, ServerKey) =
-        gen_keys(V0_11_PARAM_MESSAGE_1_CARRY_1_KS_PBS_GAUSSIAN_2M64);
+        gen_keys(V1_0_PARAM_MESSAGE_1_CARRY_1_KS_PBS_GAUSSIAN_2M128);

    let ksk = KeySwitchingKey::new(
        (&client_key, Some(&server_key)),
        (&underlying_ck, &underlying_sk),
-        V0_11_PARAM_KEYSWITCH_1_1_KS_PBS_TO_2_2_KS_PBS,
+        V1_0_PARAM_KEYSWITCH_1_1_KS_PBS_TO_2_2_KS_PBS_GAUSSIAN_2M128,
    );

    let key_string = "0053A6F94C9FF24598EB".to_string();
--- a/backends/tfhe-cuda-backend/Cargo.toml
+++ b/backends/tfhe-cuda-backend/Cargo.toml
@@ -1,6 +1,6 @@
 [package]
 name = "tfhe-cuda-backend"
-version = "0.7.0"
+version = "0.8.0"
 edition = "2021"
 authors = ["Zama team"]
 license = "BSD-3-Clause-Clear"
--- a/backends/tfhe-cuda-backend/LICENSE
+++ b/backends/tfhe-cuda-backend/LICENSE
@@ -1,6 +1,6 @@
 BSD 3-Clause Clear License

-Copyright © 2024 ZAMA.
+Copyright © 2025 ZAMA.
 All rights reserved.

 Redistribution and use in source and binary forms, with or without modification,
--- a/backends/tfhe-cuda-backend/README.md
+++ b/backends/tfhe-cuda-backend/README.md
@@ -13,24 +13,26 @@ and forth between the CPU and the GPU, to create and destroy Cuda streams, etc.:
 - `cuda_get_number_of_gpus`
 - `cuda_synchronize_device`
 The cryptographic operations it provides are:
- an amortized implementation of the TFHE programmable bootstrap: `cuda_bootstrap_amortized_lwe_ciphertext_vector_32` and `cuda_bootstrap_amortized_lwe_ciphertext_vector_64`
- a low latency implementation of the TFHE programmable bootstrap: `cuda_bootstrap_low latency_lwe_ciphertext_vector_32` and `cuda_bootstrap_low_latency_lwe_ciphertext_vector_64`
- the keyswitch: `cuda_keyswitch_lwe_ciphertext_vector_32` and `cuda_keyswitch_lwe_ciphertext_vector_64`
- the larger precision programmable bootstrap (wop PBS, which supports up to 16 bits of message while the classical PBS only supports up to 8 bits of message) and its sub-components: `cuda_wop_pbs_64`, `cuda_extract_bits_64`, `cuda_circuit_bootstrap_64`, `cuda_cmux_tree_64`, `cuda_blind_rotation_sample_extraction_64`
- acceleration for leveled operations: `cuda_negate_lwe_ciphertext_vector_64`, `cuda_add_lwe_ciphertext_vector_64`, `cuda_add_lwe_ciphertext_vector_plaintext_vector_64`, `cuda_mult_lwe_ciphertext_vector_cleartext_vector`.
+- an implementation of the classical TFHE programmable bootstrap,
+- an implementation of the multi-bit TFHE programmable bootstrap,
+- the keyswitch,
+- acceleration for leveled operations,
+- acceleration for arithmetics over encrypted integers of arbitrary size, 
+- acceleration for integer compression/decompression.

 ## Dependencies

 **Disclaimer**: Compilation on Windows/Mac is not supported yet. Only Nvidia GPUs are supported. 

- nvidia driver - for example, if you're running Ubuntu 20.04 check this [page](https://linuxconfig.org/how-to-install-the-nvidia-drivers-on-ubuntu-20-04-focal-fossa-linux) for installation
+- nvidia driver - for example, if you're running Ubuntu 20.04 check this [page](https://linuxconfig.org/how-to-install-the-nvidia-drivers-on-ubuntu-20-04-focal-fossa-linux) for installation. You need an Nvidia GPU with Compute Capability >= 3.0
 - [nvcc](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html) >= 10.0
 - [gcc](https://gcc.gnu.org/) >= 8.0 - check this [page](https://gist.github.com/ax3l/9489132) for more details about nvcc/gcc compatible versions
 - [cmake](https://cmake.org/) >= 3.24
+- libclang, to match Rust bingen [requirements](https://rust-lang.github.io/rust-bindgen/requirements.html) >= 9.0

 ## Build

-The Cuda project held in `tfhe-cuda-backend` can be compiled independently from TFHE-rs in the following way:
+The Cuda project held in `tfhe-cuda-backend` can be compiled independently of TFHE-rs in the following way:
 ```
 git clone git@github.com:zama-ai/tfhe-rs
 cd backends/tfhe-cuda-backend/cuda
--- a/backends/tfhe-cuda-backend/build.rs
+++ b/backends/tfhe-cuda-backend/build.rs
@@ -62,6 +62,7 @@ fn main() {
            "cuda/include/integer/integer.h",
            "cuda/include/keyswitch.h",
            "cuda/include/linear_algebra.h",
+            "cuda/include/fft/fft128.h",
            "cuda/include/pbs/programmable_bootstrap.h",
            "cuda/include/pbs/programmable_bootstrap_multibit.h",
        ];
--- a/backends/tfhe-cuda-backend/cuda/include/ciphertext.h
+++ b/backends/tfhe-cuda-backend/cuda/include/ciphertext.h
@@ -18,7 +18,7 @@ void cuda_convert_lwe_ciphertext_vector_to_cpu_64(void *stream,
 void cuda_glwe_sample_extract_64(void *stream, uint32_t gpu_index,
                                 void *lwe_array_out, void const *glwe_array_in,
                                 uint32_t const *nth_array, uint32_t num_nths,
-                                 uint32_t glwe_dimension,
+                                 uint32_t lwe_per_glwe, uint32_t glwe_dimension,
                                 uint32_t polynomial_size);
 }
 #endif
--- a/backends/tfhe-cuda-backend/cuda/include/device.h
+++ b/backends/tfhe-cuda-backend/cuda/include/device.h
@@ -52,7 +52,7 @@ void *cuda_malloc_async(uint64_t size, cudaStream_t stream, uint32_t gpu_index);

 void cuda_check_valid_malloc(uint64_t size, uint32_t gpu_index);

-void cuda_memcpy_async_to_gpu(void *dest, void *src, uint64_t size,
+void cuda_memcpy_async_to_gpu(void *dest, const void *src, uint64_t size,
                              cudaStream_t stream, uint32_t gpu_index);

 void cuda_memcpy_async_gpu_to_gpu(void *dest, void const *src, uint64_t size,
--- a/backends/tfhe-cuda-backend/cuda/include/fft/fft128.h
+++ b/backends/tfhe-cuda-backend/cuda/include/fft/fft128.h
@@ -0,0 +1,17 @@
+#include <stdint.h>
+extern "C" {
+void cuda_fourier_transform_forward_as_torus_f128_async(
+    void *stream, uint32_t gpu_index, void *re0, void *re1, void *im0,
+    void *im1, void const *standard, uint32_t const N,
+    const uint32_t number_of_samples);
+
+void cuda_fourier_transform_forward_as_integer_f128_async(
+    void *stream, uint32_t gpu_index, void *re0, void *re1, void *im0,
+    void *im1, void const *standard, uint32_t const N,
+    const uint32_t number_of_samples);
+
+void cuda_fourier_transform_backward_as_torus_f128_async(
+    void *stream, uint32_t gpu_index, void *standard, void const *re0,
+    void const *re1, void const *im0, void const *im1, uint32_t const N,
+    const uint32_t number_of_samples);
+}
--- a/backends/tfhe-cuda-backend/cuda/include/integer/compression/compression_utilities.h
+++ b/backends/tfhe-cuda-backend/cuda/include/integer/compression/compression_utilities.h
@@ -102,9 +102,7 @@ template <typename Torus> struct int_decompression {
      // Example: in the 2_2 case we are mapping a 2 bits message onto a 4 bits
      // space, we want to keep the original 2 bits value in the 4 bits space,
      // so we apply the identity and the encoding will rescale it for us.
-      auto decompression_rescale_f = [encryption_params](Torus x) -> Torus {
-        return x;
-      };
+      auto decompression_rescale_f = [](Torus x) -> Torus { return x; };

      auto effective_compression_message_modulus =
          encryption_params.carry_modulus;
--- a/backends/tfhe-cuda-backend/cuda/include/integer/integer.h
+++ b/backends/tfhe-cuda-backend/cuda/include/integer/integer.h
@@ -132,10 +132,11 @@ void scratch_cuda_integer_mult_radix_ciphertext_kb_64(

 void cuda_integer_mult_radix_ciphertext_kb_64(
    void *const *streams, uint32_t const *gpu_indexes, uint32_t gpu_count,
-    void *radix_lwe_out, void const *radix_lwe_left, bool const is_bool_left,
-    void const *radix_lwe_right, bool const is_bool_right, void *const *bsks,
-    void *const *ksks, int8_t *mem_ptr, uint32_t polynomial_size,
-    uint32_t num_blocks);
+    CudaRadixCiphertextFFI *radix_lwe_out,
+    CudaRadixCiphertextFFI const *radix_lwe_left, bool const is_bool_left,
+    CudaRadixCiphertextFFI const *radix_lwe_right, bool const is_bool_right,
+    void *const *bsks, void *const *ksks, int8_t *mem_ptr,
+    uint32_t polynomial_size, uint32_t num_blocks);

 void cleanup_cuda_integer_mult(void *const *streams,
                               uint32_t const *gpu_indexes, uint32_t gpu_count,
@@ -145,7 +146,7 @@ void cuda_negate_integer_radix_ciphertext_64(
    void *const *streams, uint32_t const *gpu_indexes, uint32_t gpu_count,
    CudaRadixCiphertextFFI *lwe_array_out,
    CudaRadixCiphertextFFI const *lwe_array_in, uint32_t message_modulus,
-    uint32_t carry_modulus);
+    uint32_t carry_modulus, uint32_t num_radix_blocks);

 void cuda_scalar_addition_integer_radix_ciphertext_64_inplace(
    void *const *streams, uint32_t const *gpu_indexes, uint32_t gpu_count,
@@ -177,8 +178,8 @@ void scratch_cuda_integer_radix_arithmetic_scalar_shift_kb_64(

 void cuda_integer_radix_arithmetic_scalar_shift_kb_64_inplace(
    void *const *streams, uint32_t const *gpu_indexes, uint32_t gpu_count,
-    void *lwe_array, uint32_t shift, int8_t *mem_ptr, void *const *bsks,
-    void *const *ksks, uint32_t num_blocks);
+    CudaRadixCiphertextFFI *lwe_array, uint32_t shift, int8_t *mem_ptr,
+    void *const *bsks, void *const *ksks);

 void cleanup_cuda_integer_radix_logical_scalar_shift(
    void *const *streams, uint32_t const *gpu_indexes, uint32_t gpu_count,
@@ -218,15 +219,17 @@ void scratch_cuda_integer_radix_comparison_kb_64(

 void cuda_comparison_integer_radix_ciphertext_kb_64(
    void *const *streams, uint32_t const *gpu_indexes, uint32_t gpu_count,
-    void *lwe_array_out, void const *lwe_array_1, void const *lwe_array_2,
-    int8_t *mem_ptr, void *const *bsks, void *const *ksks,
-    uint32_t lwe_ciphertext_count);
+    CudaRadixCiphertextFFI *lwe_array_out,
+    CudaRadixCiphertextFFI const *lwe_array_1,
+    CudaRadixCiphertextFFI const *lwe_array_2, int8_t *mem_ptr,
+    void *const *bsks, void *const *ksks);

 void cuda_scalar_comparison_integer_radix_ciphertext_kb_64(
    void *const *streams, uint32_t const *gpu_indexes, uint32_t gpu_count,
-    void *lwe_array_out, void const *lwe_array_in, void const *scalar_blocks,
+    CudaRadixCiphertextFFI *lwe_array_out,
+    CudaRadixCiphertextFFI const *lwe_array_in, void const *scalar_blocks,
    int8_t *mem_ptr, void *const *bsks, void *const *ksks,
-    uint32_t lwe_ciphertext_count, uint32_t num_scalar_blocks);
+    uint32_t num_scalar_blocks);

 void cleanup_cuda_integer_comparison(void *const *streams,
                                     uint32_t const *gpu_indexes,
@@ -351,9 +354,10 @@ void scratch_cuda_integer_overflowing_sub_kb_64_inplace(

 void cuda_integer_overflowing_sub_kb_64_inplace(
    void *const *streams, uint32_t const *gpu_indexes, uint32_t gpu_count,
-    void *lhs_array, const void *rhs_array, void *overflow_block,
-    const void *input_borrow, int8_t *mem_ptr, void *const *bsks,
-    void *const *ksks, uint32_t num_blocks, uint32_t compute_overflow,
+    CudaRadixCiphertextFFI *lhs_array, const CudaRadixCiphertextFFI *rhs_array,
+    CudaRadixCiphertextFFI *overflow_block,
+    const CudaRadixCiphertextFFI *input_borrow, int8_t *mem_ptr,
+    void *const *bsks, void *const *ksks, uint32_t compute_overflow,
    uint32_t uses_input_borrow);

 void cleanup_cuda_integer_overflowing_sub(void *const *streams,
@@ -372,9 +376,9 @@ void scratch_cuda_integer_radix_partial_sum_ciphertexts_vec_kb_64(

 void cuda_integer_radix_partial_sum_ciphertexts_vec_kb_64(
    void *const *streams, uint32_t const *gpu_indexes, uint32_t gpu_count,
-    void *radix_lwe_out, void *radix_lwe_vec, uint32_t num_radix_in_vec,
-    int8_t *mem_ptr, void *const *bsks, void *const *ksks,
-    uint32_t num_blocks_in_radix);
+    CudaRadixCiphertextFFI *radix_lwe_out,
+    CudaRadixCiphertextFFI *radix_lwe_vec, int8_t *mem_ptr, void *const *bsks,
+    void *const *ksks);

 void cleanup_cuda_integer_radix_partial_sum_ciphertexts_vec(
    void *const *streams, uint32_t const *gpu_indexes, uint32_t gpu_count,
@@ -390,10 +394,10 @@ void scratch_cuda_integer_scalar_mul_kb_64(

 void cuda_scalar_multiplication_integer_radix_ciphertext_64_inplace(
    void *const *streams, uint32_t const *gpu_indexes, uint32_t gpu_count,
-    void *lwe_array, uint64_t const *decomposed_scalar,
+    CudaRadixCiphertextFFI *lwe_array, uint64_t const *decomposed_scalar,
    uint64_t const *has_at_least_one_set, int8_t *mem_ptr, void *const *bsks,
-    void *const *ksks, uint32_t lwe_dimension, uint32_t polynomial_size,
-    uint32_t message_modulus, uint32_t num_blocks, uint32_t num_scalars);
+    void *const *ksks, uint32_t polynomial_size, uint32_t message_modulus,
+    uint32_t num_scalars);

 void cleanup_cuda_integer_radix_scalar_mul(void *const *streams,
                                           uint32_t const *gpu_indexes,
@@ -473,7 +477,8 @@ void scratch_cuda_integer_are_all_comparisons_block_true_kb_64(

 void cuda_integer_are_all_comparisons_block_true_kb_64(
    void *const *streams, uint32_t const *gpu_indexes, uint32_t gpu_count,
-    void *lwe_array_out, void const *lwe_array_in, int8_t *mem_ptr,
+    CudaRadixCiphertextFFI *lwe_array_out,
+    CudaRadixCiphertextFFI const *lwe_array_in, int8_t *mem_ptr,
    void *const *bsks, void *const *ksks, uint32_t num_radix_blocks);

 void cleanup_cuda_integer_are_all_comparisons_block_true(
@@ -491,7 +496,8 @@ void scratch_cuda_integer_is_at_least_one_comparisons_block_true_kb_64(

 void cuda_integer_is_at_least_one_comparisons_block_true_kb_64(
    void *const *streams, uint32_t const *gpu_indexes, uint32_t gpu_count,
-    void *lwe_array_out, void const *lwe_array_in, int8_t *mem_ptr,
+    CudaRadixCiphertextFFI *lwe_array_out,
+    CudaRadixCiphertextFFI const *lwe_array_in, int8_t *mem_ptr,
    void *const *bsks, void *const *ksks, uint32_t num_radix_blocks);

 void cleanup_cuda_integer_is_at_least_one_comparisons_block_true(
--- a/backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h
+++ b/backends/tfhe-cuda-backend/cuda/include/integer/integer_utilities.h
@@ -6,7 +6,6 @@
 #include "integer/radix_ciphertext.h"
 #include "keyswitch.h"
 #include "pbs/programmable_bootstrap.cuh"
-#include <cassert>
 #include <cmath>
 #include <functional>

@@ -22,7 +21,7 @@ template <typename Torus>
 __global__ void radix_blocks_rotate_right(Torus *dst, Torus *src,
                                          uint32_t value, uint32_t blocks_count,
                                          uint32_t lwe_size);
-void generate_ids_update_degrees(int *terms_degree, size_t *h_lwe_idx_in,
+void generate_ids_update_degrees(uint64_t *terms_degree, size_t *h_lwe_idx_in,
                                 size_t *h_lwe_idx_out,
                                 int32_t *h_smart_copy_in,
                                 int32_t *h_smart_copy_out, size_t ch_amount,
@@ -160,7 +159,7 @@ template <typename Torus> struct int_radix_lut {
  // lwe_trivial_indexes is the intermediary index we need in case
  // lwe_indexes_in != lwe_indexes_out
  Torus *lwe_trivial_indexes;
-  Torus *tmp_lwe_before_ks;
+  CudaRadixCiphertextFFI *tmp_lwe_before_ks;

  /// For multi GPU execution we create vectors of pointers for inputs and
  /// outputs
@@ -270,12 +269,10 @@ template <typename Torus> struct int_radix_lut {
                                 num_radix_blocks);

      // Keyswitch
-      Torus big_size =
-          (params.big_lwe_dimension + 1) * num_radix_blocks * sizeof(Torus);
-      Torus small_size =
-          (params.small_lwe_dimension + 1) * num_radix_blocks * sizeof(Torus);
-      tmp_lwe_before_ks =
-          (Torus *)cuda_malloc_async(big_size, streams[0], gpu_indexes[0]);
+      tmp_lwe_before_ks = new CudaRadixCiphertextFFI;
+      create_zero_radix_ciphertext_async<Torus>(
+          streams[0], gpu_indexes[0], tmp_lwe_before_ks, num_radix_blocks,
+          params.big_lwe_dimension);
    }
    degrees = (uint64_t *)malloc(num_luts * sizeof(uint64_t));
    max_degrees = (uint64_t *)malloc(num_luts * sizeof(uint64_t));
@@ -297,7 +294,8 @@ template <typename Torus> struct int_radix_lut {
    std::memcpy(gpu_indexes, input_gpu_indexes, gpu_count * sizeof(uint32_t));

    // base lut object should have bigger or equal memory than current one
-    assert(num_radix_blocks <= base_lut_object->num_blocks);
+    if (num_radix_blocks > base_lut_object->num_blocks)
+      PANIC("Cuda error: lut does not have enough blocks")
    // pbs
    buffer = base_lut_object->buffer;
    // Keyswitch
@@ -465,12 +463,10 @@ template <typename Torus> struct int_radix_lut {
                                 num_radix_blocks);

      // Keyswitch
-      Torus big_size =
-          (params.big_lwe_dimension + 1) * num_radix_blocks * sizeof(Torus);
-      Torus small_size =
-          (params.small_lwe_dimension + 1) * num_radix_blocks * sizeof(Torus);
-      tmp_lwe_before_ks =
-          (Torus *)cuda_malloc_async(big_size, streams[0], gpu_indexes[0]);
+      tmp_lwe_before_ks = new CudaRadixCiphertextFFI;
+      create_zero_radix_ciphertext_async<Torus>(
+          streams[0], gpu_indexes[0], tmp_lwe_before_ks, num_radix_blocks,
+          params.big_lwe_dimension);
    }
    degrees = (uint64_t *)malloc(num_many_lut * num_luts * sizeof(uint64_t));
    max_degrees = (uint64_t *)malloc(num_luts * sizeof(uint64_t));
@@ -481,7 +477,8 @@ template <typename Torus> struct int_radix_lut {
    auto lut = lut_vec[gpu_index];
    size_t lut_size = (params.glwe_dimension + 1) * params.polynomial_size;

-    assert(lut != nullptr);
+    if (lut == nullptr)
+      PANIC("Cuda error: invalid lut pointer")
    return &lut[idx * lut_size];
  }

@@ -555,7 +552,8 @@ template <typename Torus> struct int_radix_lut {
    free(h_lwe_indexes_out);

    if (!mem_reuse) {
-      cuda_drop_async(tmp_lwe_before_ks, streams[0], gpu_indexes[0]);
+      release_radix_ciphertext(streams[0], gpu_indexes[0], tmp_lwe_before_ks);
+      delete tmp_lwe_before_ks;
      cuda_synchronize_stream(streams[0], gpu_indexes[0]);
      for (int i = 0; i < buffer.size(); i++) {
        switch (params.pbs_type) {
@@ -1163,10 +1161,10 @@ template <typename Torus> struct int_overflowing_sub_memory {
 };

 template <typename Torus> struct int_sum_ciphertexts_vec_memory {
-  Torus *new_blocks;
-  Torus *new_blocks_copy;
-  Torus *old_blocks;
-  Torus *small_lwe_vector;
+  CudaRadixCiphertextFFI *new_blocks;
+  CudaRadixCiphertextFFI *new_blocks_copy;
+  CudaRadixCiphertextFFI *old_blocks;
+  CudaRadixCiphertextFFI *small_lwe_vector;
  int_radix_params params;

  int32_t *d_smart_copy_in;
@@ -1185,34 +1183,22 @@ template <typename Torus> struct int_sum_ciphertexts_vec_memory {
    int max_pbs_count = num_blocks_in_radix * max_num_radix_in_vec;

    // allocate gpu memory for intermediate buffers
-    new_blocks = (Torus *)cuda_malloc_async(
-        max_pbs_count * (params.big_lwe_dimension + 1) * sizeof(Torus),
-        streams[0], gpu_indexes[0]);
-    new_blocks_copy = (Torus *)cuda_malloc_async(
-        max_pbs_count * (params.big_lwe_dimension + 1) * sizeof(Torus),
-        streams[0], gpu_indexes[0]);
-    old_blocks = (Torus *)cuda_malloc_async(
-        max_pbs_count * (params.big_lwe_dimension + 1) * sizeof(Torus),
-        streams[0], gpu_indexes[0]);
-    small_lwe_vector = (Torus *)cuda_malloc_async(
-        max_pbs_count * (params.small_lwe_dimension + 1) * sizeof(Torus),
-        streams[0], gpu_indexes[0]);
-    cuda_memset_async(new_blocks, 0,
-                      max_pbs_count * (params.big_lwe_dimension + 1) *
-                          sizeof(Torus),
-                      streams[0], gpu_indexes[0]);
-    cuda_memset_async(new_blocks_copy, 0,
-                      max_pbs_count * (params.big_lwe_dimension + 1) *
-                          sizeof(Torus),
-                      streams[0], gpu_indexes[0]);
-    cuda_memset_async(old_blocks, 0,
-                      max_pbs_count * (params.big_lwe_dimension + 1) *
-                          sizeof(Torus),
-                      streams[0], gpu_indexes[0]);
-    cuda_memset_async(small_lwe_vector, 0,
-                      max_pbs_count * (params.small_lwe_dimension + 1) *
-                          sizeof(Torus),
-                      streams[0], gpu_indexes[0]);
+    new_blocks = new CudaRadixCiphertextFFI;
+    create_zero_radix_ciphertext_async<Torus>(streams[0], gpu_indexes[0],
+                                              new_blocks, max_pbs_count,
+                                              params.big_lwe_dimension);
+    new_blocks_copy = new CudaRadixCiphertextFFI;
+    create_zero_radix_ciphertext_async<Torus>(streams[0], gpu_indexes[0],
+                                              new_blocks_copy, max_pbs_count,
+                                              params.big_lwe_dimension);
+    old_blocks = new CudaRadixCiphertextFFI;
+    create_zero_radix_ciphertext_async<Torus>(streams[0], gpu_indexes[0],
+                                              old_blocks, max_pbs_count,
+                                              params.big_lwe_dimension);
+    small_lwe_vector = new CudaRadixCiphertextFFI;
+    create_zero_radix_ciphertext_async<Torus>(streams[0], gpu_indexes[0],
+                                              small_lwe_vector, max_pbs_count,
+                                              params.small_lwe_dimension);

    d_smart_copy_in = (int32_t *)cuda_malloc_async(
        max_pbs_count * sizeof(int32_t), streams[0], gpu_indexes[0]);
@@ -1229,8 +1215,9 @@ template <typename Torus> struct int_sum_ciphertexts_vec_memory {
                                 uint32_t gpu_count, int_radix_params params,
                                 uint32_t num_blocks_in_radix,
                                 uint32_t max_num_radix_in_vec,
-                                 Torus *new_blocks, Torus *old_blocks,
-                                 Torus *small_lwe_vector) {
+                                 CudaRadixCiphertextFFI *new_blocks,
+                                 CudaRadixCiphertextFFI *old_blocks,
+                                 CudaRadixCiphertextFFI *small_lwe_vector) {
    mem_reuse = true;
    this->params = params;

@@ -1240,13 +1227,10 @@ template <typename Torus> struct int_sum_ciphertexts_vec_memory {
    this->new_blocks = new_blocks;
    this->old_blocks = old_blocks;
    this->small_lwe_vector = small_lwe_vector;
-    new_blocks_copy = (Torus *)cuda_malloc_async(
-        max_pbs_count * (params.big_lwe_dimension + 1) * sizeof(Torus),
-        streams[0], gpu_indexes[0]);
-    cuda_memset_async(new_blocks_copy, 0,
-                      max_pbs_count * (params.big_lwe_dimension + 1) *
-                          sizeof(Torus),
-                      streams[0], gpu_indexes[0]);
+    new_blocks_copy = new CudaRadixCiphertextFFI;
+    create_zero_radix_ciphertext_async<Torus>(streams[0], gpu_indexes[0],
+                                              new_blocks_copy, max_pbs_count,
+                                              params.big_lwe_dimension);

    d_smart_copy_in = (int32_t *)cuda_malloc_async(
        max_pbs_count * sizeof(int32_t), streams[0], gpu_indexes[0]);
@@ -1264,12 +1248,15 @@ template <typename Torus> struct int_sum_ciphertexts_vec_memory {
    cuda_drop_async(d_smart_copy_out, streams[0], gpu_indexes[0]);

    if (!mem_reuse) {
-      cuda_drop_async(new_blocks, streams[0], gpu_indexes[0]);
-      cuda_drop_async(old_blocks, streams[0], gpu_indexes[0]);
-      cuda_drop_async(small_lwe_vector, streams[0], gpu_indexes[0]);
+      release_radix_ciphertext(streams[0], gpu_indexes[0], new_blocks);
+      delete new_blocks;
+      release_radix_ciphertext(streams[0], gpu_indexes[0], old_blocks);
+      delete old_blocks;
+      release_radix_ciphertext(streams[0], gpu_indexes[0], small_lwe_vector);
+      delete small_lwe_vector;
    }
-
-    cuda_drop_async(new_blocks_copy, streams[0], gpu_indexes[0]);
+    release_radix_ciphertext(streams[0], gpu_indexes[0], new_blocks_copy);
+    delete new_blocks_copy;
  }
 };
 // For sequential algorithm in group propagation
@@ -2121,9 +2108,9 @@ template <typename Torus> struct int_sc_prop_memory {
 };

 template <typename Torus> struct int_shifted_blocks_and_borrow_states_memory {
-  Torus *shifted_blocks_and_borrow_states;
-  Torus *shifted_blocks;
-  Torus *borrow_states;
+  CudaRadixCiphertextFFI *shifted_blocks_and_borrow_states;
+  CudaRadixCiphertextFFI *shifted_blocks;
+  CudaRadixCiphertextFFI *borrow_states;

  int_radix_lut<Torus> *luts_array_first_step;

@@ -2136,23 +2123,19 @@ template <typename Torus> struct int_shifted_blocks_and_borrow_states_memory {
    auto polynomial_size = params.polynomial_size;
    auto message_modulus = params.message_modulus;
    auto carry_modulus = params.carry_modulus;
-    auto big_lwe_size = (polynomial_size * glwe_dimension + 1);
-    auto big_lwe_size_bytes = big_lwe_size * sizeof(Torus);

-    shifted_blocks_and_borrow_states = (Torus *)cuda_malloc_async(
-        num_many_lut * num_radix_blocks * big_lwe_size_bytes, streams[0],
-        gpu_indexes[0]);
-    cuda_memset_async(shifted_blocks_and_borrow_states, 0,
-                      num_many_lut * num_radix_blocks * big_lwe_size_bytes,
-                      streams[0], gpu_indexes[0]);
-    shifted_blocks = (Torus *)cuda_malloc_async(
-        num_radix_blocks * big_lwe_size_bytes, streams[0], gpu_indexes[0]);
-    cuda_memset_async(shifted_blocks, 0, num_radix_blocks * big_lwe_size_bytes,
-                      streams[0], gpu_indexes[0]);
-    borrow_states = (Torus *)cuda_malloc_async(
-        num_radix_blocks * big_lwe_size_bytes, streams[0], gpu_indexes[0]);
-    cuda_memset_async(borrow_states, 0, num_radix_blocks * big_lwe_size_bytes,
-                      streams[0], gpu_indexes[0]);
+    shifted_blocks_and_borrow_states = new CudaRadixCiphertextFFI;
+    create_zero_radix_ciphertext_async<Torus>(
+        streams[0], gpu_indexes[0], shifted_blocks_and_borrow_states,
+        num_radix_blocks * num_many_lut, params.big_lwe_dimension);
+    shifted_blocks = new CudaRadixCiphertextFFI;
+    create_zero_radix_ciphertext_async<Torus>(streams[0], gpu_indexes[0],
+                                              shifted_blocks, num_radix_blocks,
+                                              params.big_lwe_dimension);
+    borrow_states = new CudaRadixCiphertextFFI;
+    create_zero_radix_ciphertext_async<Torus>(streams[0], gpu_indexes[0],
+                                              borrow_states, num_radix_blocks,
+                                              params.big_lwe_dimension);

    uint32_t num_luts_first_step = 2 * grouping_size + 1;

@@ -2302,10 +2285,13 @@ template <typename Torus> struct int_shifted_blocks_and_borrow_states_memory {
  void release(cudaStream_t const *streams, uint32_t const *gpu_indexes,
               uint32_t gpu_count) {

-    cuda_drop_async(shifted_blocks_and_borrow_states, streams[0],
-                    gpu_indexes[0]);
-    cuda_drop_async(shifted_blocks, streams[0], gpu_indexes[0]);
-    cuda_drop_async(borrow_states, streams[0], gpu_indexes[0]);
+    release_radix_ciphertext(streams[0], gpu_indexes[0],
+                             shifted_blocks_and_borrow_states);
+    delete shifted_blocks_and_borrow_states;
+    release_radix_ciphertext(streams[0], gpu_indexes[0], shifted_blocks);
+    delete shifted_blocks;
+    release_radix_ciphertext(streams[0], gpu_indexes[0], borrow_states);
+    delete borrow_states;

    luts_array_first_step->release(streams, gpu_indexes, gpu_count);
    delete luts_array_first_step;
@@ -2318,7 +2304,7 @@ template <typename Torus> struct int_borrow_prop_memory {

  uint32_t group_size;
  uint32_t num_groups;
-  Torus *overflow_block;
+  CudaRadixCiphertextFFI *overflow_block;

  int_radix_lut<Torus> *lut_message_extract;
  int_radix_lut<Torus> *lut_borrow_flag;
@@ -2348,8 +2334,6 @@ template <typename Torus> struct int_borrow_prop_memory {
    auto polynomial_size = params.polynomial_size;
    auto message_modulus = params.message_modulus;
    auto carry_modulus = params.carry_modulus;
-    auto big_lwe_size = (polynomial_size * glwe_dimension + 1);
-    auto big_lwe_size_bytes = big_lwe_size * sizeof(Torus);
    compute_overflow = compute_overflow_in;
    // for compute shifted blocks and block states
    uint32_t block_modulus = message_modulus * carry_modulus;
@@ -2371,10 +2355,10 @@ template <typename Torus> struct int_borrow_prop_memory {
        streams, gpu_indexes, gpu_count, params, num_radix_blocks,
        grouping_size, num_groups, true);

-    overflow_block = (Torus *)cuda_malloc_async(big_lwe_size_bytes, streams[0],
-                                                gpu_indexes[0]);
-    cuda_memset_async(overflow_block, 0, big_lwe_size_bytes, streams[0],
-                      gpu_indexes[0]);
+    overflow_block = new CudaRadixCiphertextFFI;
+    create_zero_radix_ciphertext_async<Torus>(streams[0], gpu_indexes[0],
+                                              overflow_block, 1,
+                                              params.big_lwe_dimension);

    lut_message_extract =
        new int_radix_lut<Torus>(streams, gpu_indexes, gpu_count, params, 1,
@@ -2450,7 +2434,8 @@ template <typename Torus> struct int_borrow_prop_memory {

    shifted_blocks_borrow_state_mem->release(streams, gpu_indexes, gpu_count);
    prop_simu_group_carries_mem->release(streams, gpu_indexes, gpu_count);
-    cuda_drop_async(overflow_block, streams[0], gpu_indexes[0]);
+    release_radix_ciphertext(streams[0], gpu_indexes[0], overflow_block);
+    delete overflow_block;

    lut_message_extract->release(streams, gpu_indexes, gpu_count);
    delete lut_message_extract;
@@ -2486,7 +2471,7 @@ template <typename Torus> struct int_zero_out_if_buffer {

  int_radix_params params;

-  Torus *tmp;
+  CudaRadixCiphertextFFI *tmp;

  cudaStream_t *true_streams;
  cudaStream_t *false_streams;
@@ -2499,10 +2484,11 @@ template <typename Torus> struct int_zero_out_if_buffer {
    this->params = params;
    active_gpu_count = get_active_gpu_count(num_radix_blocks, gpu_count);

-    Torus big_size =
-        (params.big_lwe_dimension + 1) * num_radix_blocks * sizeof(Torus);
    if (allocate_gpu_memory) {
-      tmp = (Torus *)cuda_malloc_async(big_size, streams[0], gpu_indexes[0]);
+      tmp = new CudaRadixCiphertextFFI;
+      create_zero_radix_ciphertext_async<Torus>(streams[0], gpu_indexes[0], tmp,
+                                                num_radix_blocks,
+                                                params.big_lwe_dimension);
      // We may use a different stream to allow concurrent operation
      true_streams =
          (cudaStream_t *)malloc(active_gpu_count * sizeof(cudaStream_t));
@@ -2516,7 +2502,8 @@ template <typename Torus> struct int_zero_out_if_buffer {
  }
  void release(cudaStream_t const *streams, uint32_t const *gpu_indexes,
               uint32_t gpu_count) {
-    cuda_drop_async(tmp, streams[0], gpu_indexes[0]);
+    release_radix_ciphertext(streams[0], gpu_indexes[0], tmp);
+    delete tmp;
    for (uint j = 0; j < active_gpu_count; j++) {
      cuda_destroy_stream(true_streams[j], gpu_indexes[j]);
      cuda_destroy_stream(false_streams[j], gpu_indexes[j]);
@@ -2527,9 +2514,9 @@ template <typename Torus> struct int_zero_out_if_buffer {
 };

 template <typename Torus> struct int_mul_memory {
-  Torus *vector_result_sb;
-  Torus *block_mul_res;
-  Torus *small_lwe_vector;
+  CudaRadixCiphertextFFI *vector_result_sb;
+  CudaRadixCiphertextFFI *block_mul_res;
+  CudaRadixCiphertextFFI *small_lwe_vector;

  int_radix_lut<Torus> *luts_array; // lsb msb
  int_radix_lut<Torus> *zero_out_predicate_lut;
@@ -2578,7 +2565,6 @@ template <typename Torus> struct int_mul_memory {
    auto polynomial_size = params.polynomial_size;
    auto message_modulus = params.message_modulus;
    auto carry_modulus = params.carry_modulus;
-    auto lwe_dimension = params.small_lwe_dimension;

    // 'vector_result_lsb' contains blocks from all possible shifts of
    // radix_lwe_left excluding zero ciphertext blocks
@@ -2591,17 +2577,18 @@ template <typename Torus> struct int_mul_memory {
    int total_block_count = lsb_vector_block_count + msb_vector_block_count;

    // allocate memory for intermediate buffers
-    vector_result_sb = (Torus *)cuda_malloc_async(
-        2 * total_block_count * (polynomial_size * glwe_dimension + 1) *
-            sizeof(Torus),
-        streams[0], gpu_indexes[0]);
-    block_mul_res = (Torus *)cuda_malloc_async(
-        2 * total_block_count * (polynomial_size * glwe_dimension + 1) *
-            sizeof(Torus),
-        streams[0], gpu_indexes[0]);
-    small_lwe_vector = (Torus *)cuda_malloc_async(
-        total_block_count * (lwe_dimension + 1) * sizeof(Torus), streams[0],
-        gpu_indexes[0]);
+    vector_result_sb = new CudaRadixCiphertextFFI;
+    create_zero_radix_ciphertext_async<Torus>(
+        streams[0], gpu_indexes[0], vector_result_sb, 2 * total_block_count,
+        params.big_lwe_dimension);
+    block_mul_res = new CudaRadixCiphertextFFI;
+    create_zero_radix_ciphertext_async<Torus>(
+        streams[0], gpu_indexes[0], block_mul_res, 2 * total_block_count,
+        params.big_lwe_dimension);
+    small_lwe_vector = new CudaRadixCiphertextFFI;
+    create_zero_radix_ciphertext_async<Torus>(
+        streams[0], gpu_indexes[0], small_lwe_vector, total_block_count,
+        params.small_lwe_dimension);

    // create int_radix_lut objects for lsb, msb, message, carry
    // luts_array -> lut = {lsb_acc, msb_acc}
@@ -2662,9 +2649,12 @@ template <typename Torus> struct int_mul_memory {

      return;
    }
-    cuda_drop_async(vector_result_sb, streams[0], gpu_indexes[0]);
-    cuda_drop_async(block_mul_res, streams[0], gpu_indexes[0]);
-    cuda_drop_async(small_lwe_vector, streams[0], gpu_indexes[0]);
+    release_radix_ciphertext(streams[0], gpu_indexes[0], vector_result_sb);
+    delete vector_result_sb;
+    release_radix_ciphertext(streams[0], gpu_indexes[0], block_mul_res);
+    delete block_mul_res;
+    release_radix_ciphertext(streams[0], gpu_indexes[0], small_lwe_vector);
+    delete small_lwe_vector;

    luts_array->release(streams, gpu_indexes, gpu_count);
    sum_ciphertexts_mem->release(streams, gpu_indexes, gpu_count);
@@ -2784,9 +2774,6 @@ template <typename Torus> struct int_logical_scalar_shift_buffer {
    tmp_rotated = pre_allocated_buffer;
    reuse_memory = true;

-    uint32_t max_amount_of_pbs = num_radix_blocks;
-    uint32_t big_lwe_size = params.big_lwe_dimension + 1;
-    uint32_t big_lwe_size_bytes = big_lwe_size * sizeof(Torus);
    set_zero_radix_ciphertext_slice_async<Torus>(streams[0], gpu_indexes[0],
                                                 tmp_rotated, 0,
                                                 tmp_rotated->num_radix_blocks);
@@ -2883,7 +2870,7 @@ template <typename Torus> struct int_arithmetic_scalar_shift_buffer {

  SHIFT_OR_ROTATE_TYPE shift_type;

-  Torus *tmp_rotated;
+  CudaRadixCiphertextFFI *tmp_rotated;

  cudaStream_t *local_streams_1;
  cudaStream_t *local_streams_2;
@@ -2912,16 +2899,10 @@ template <typename Torus> struct int_arithmetic_scalar_shift_buffer {
    this->params = params;

    if (allocate_gpu_memory) {
-      uint32_t big_lwe_size = params.big_lwe_dimension + 1;
-      uint32_t big_lwe_size_bytes = big_lwe_size * sizeof(Torus);
-
-      tmp_rotated = (Torus *)cuda_malloc_async((num_radix_blocks + 3) *
-                                                   big_lwe_size_bytes,
-                                               streams[0], gpu_indexes[0]);
-
-      cuda_memset_async(tmp_rotated, 0,
-                        (num_radix_blocks + 3) * big_lwe_size_bytes, streams[0],
-                        gpu_indexes[0]);
+      tmp_rotated = new CudaRadixCiphertextFFI;
+      create_zero_radix_ciphertext_async<Torus>(
+          streams[0], gpu_indexes[0], tmp_rotated, num_radix_blocks + 3,
+          params.big_lwe_dimension);

      uint32_t num_bits_in_block = (uint32_t)std::log2(params.message_modulus);

@@ -3057,7 +3038,8 @@ template <typename Torus> struct int_arithmetic_scalar_shift_buffer {
    lut_buffers_bivariate.clear();
    lut_buffers_univariate.clear();

-    cuda_drop_async(tmp_rotated, streams[0], gpu_indexes[0]);
+    release_radix_ciphertext(streams[0], gpu_indexes[0], tmp_rotated);
+    delete tmp_rotated;
  }
 };

@@ -3173,8 +3155,8 @@ template <typename Torus> struct int_are_all_block_true_buffer {
  COMPARISON_TYPE op;
  int_radix_params params;

-  Torus *tmp_out;
-  Torus *tmp_block_accumulated;
+  CudaRadixCiphertextFFI *tmp_out;
+  CudaRadixCiphertextFFI *tmp_block_accumulated;

  // This map store LUTs that checks the equality between some input and values
  // of interest in are_all_block_true(), as with max_value (the maximum message
@@ -3194,12 +3176,15 @@ template <typename Torus> struct int_are_all_block_true_buffer {
      uint32_t max_value = (total_modulus - 1) / (params.message_modulus - 1);

      int max_chunks = (num_radix_blocks + max_value - 1) / max_value;
-      tmp_block_accumulated = (Torus *)cuda_malloc_async(
-          (params.big_lwe_dimension + 1) * max_chunks * sizeof(Torus),
-          streams[0], gpu_indexes[0]);
-      tmp_out = (Torus *)cuda_malloc_async((params.big_lwe_dimension + 1) *
-                                               num_radix_blocks * sizeof(Torus),
-                                           streams[0], gpu_indexes[0]);
+      tmp_out = new CudaRadixCiphertextFFI;
+      create_zero_radix_ciphertext_async<Torus>(streams[0], gpu_indexes[0],
+                                                tmp_out, num_radix_blocks,
+                                                params.big_lwe_dimension);
+      tmp_block_accumulated = new CudaRadixCiphertextFFI;
+      create_zero_radix_ciphertext_async<Torus>(
+          streams[0], gpu_indexes[0], tmp_block_accumulated, max_chunks,
+          params.big_lwe_dimension);
+
      is_max_value =
          new int_radix_lut<Torus>(streams, gpu_indexes, gpu_count, params, 2,
                                   max_chunks, allocate_gpu_memory);
@@ -3222,8 +3207,10 @@ template <typename Torus> struct int_are_all_block_true_buffer {
    is_max_value->release(streams, gpu_indexes, gpu_count);
    delete (is_max_value);

-    cuda_drop_async(tmp_block_accumulated, streams[0], gpu_indexes[0]);
-    cuda_drop_async(tmp_out, streams[0], gpu_indexes[0]);
+    release_radix_ciphertext(streams[0], gpu_indexes[0], tmp_out);
+    delete tmp_out;
+    release_radix_ciphertext(streams[0], gpu_indexes[0], tmp_block_accumulated);
+    delete tmp_block_accumulated;
  }
 };

@@ -3336,8 +3323,8 @@ template <typename Torus> struct int_tree_sign_reduction_buffer {

  int_radix_lut<Torus> *tree_last_leaf_scalar_lut;

-  Torus *tmp_x;
-  Torus *tmp_y;
+  CudaRadixCiphertextFFI *tmp_x;
+  CudaRadixCiphertextFFI *tmp_y;

  int_tree_sign_reduction_buffer(cudaStream_t const *streams,
                                 uint32_t const *gpu_indexes,
@@ -3358,10 +3345,14 @@ template <typename Torus> struct int_tree_sign_reduction_buffer {
    };

    if (allocate_gpu_memory) {
-      tmp_x = (Torus *)cuda_malloc_async(big_size * num_radix_blocks,
-                                         streams[0], gpu_indexes[0]);
-      tmp_y = (Torus *)cuda_malloc_async(big_size * num_radix_blocks,
-                                         streams[0], gpu_indexes[0]);
+      tmp_x = new CudaRadixCiphertextFFI;
+      create_zero_radix_ciphertext_async<Torus>(streams[0], gpu_indexes[0],
+                                                tmp_x, num_radix_blocks,
+                                                params.big_lwe_dimension);
+      tmp_y = new CudaRadixCiphertextFFI;
+      create_zero_radix_ciphertext_async<Torus>(streams[0], gpu_indexes[0],
+                                                tmp_y, num_radix_blocks,
+                                                params.big_lwe_dimension);
      // LUTs
      tree_inner_leaf_lut =
          new int_radix_lut<Torus>(streams, gpu_indexes, gpu_count, params, 1,
@@ -3392,8 +3383,10 @@ template <typename Torus> struct int_tree_sign_reduction_buffer {
    tree_last_leaf_scalar_lut->release(streams, gpu_indexes, gpu_count);
    delete tree_last_leaf_scalar_lut;

-    cuda_drop_async(tmp_x, streams[0], gpu_indexes[0]);
-    cuda_drop_async(tmp_y, streams[0], gpu_indexes[0]);
+    release_radix_ciphertext(streams[0], gpu_indexes[0], tmp_x);
+    delete tmp_x;
+    release_radix_ciphertext(streams[0], gpu_indexes[0], tmp_y);
+    delete tmp_y;
  }
 };

@@ -3401,14 +3394,14 @@ template <typename Torus> struct int_comparison_diff_buffer {
  int_radix_params params;
  COMPARISON_TYPE op;

-  Torus *tmp_packed;
+  CudaRadixCiphertextFFI *tmp_packed;

  std::function<Torus(Torus)> operator_f;

  int_tree_sign_reduction_buffer<Torus> *tree_buffer;

-  Torus *tmp_signs_a;
-  Torus *tmp_signs_b;
+  CudaRadixCiphertextFFI *tmp_signs_a;
+  CudaRadixCiphertextFFI *tmp_signs_b;
  int_radix_lut<Torus> *reduce_signs_lut;

  int_comparison_diff_buffer(cudaStream_t const *streams,
@@ -3438,16 +3431,22 @@ template <typename Torus> struct int_comparison_diff_buffer {

      Torus big_size = (params.big_lwe_dimension + 1) * sizeof(Torus);

-      tmp_packed = (Torus *)cuda_malloc_async(big_size * num_radix_blocks,
-                                              streams[0], gpu_indexes[0]);
+      tmp_packed = new CudaRadixCiphertextFFI;
+      create_zero_radix_ciphertext_async<Torus>(streams[0], gpu_indexes[0],
+                                                tmp_packed, num_radix_blocks,
+                                                params.big_lwe_dimension);

      tree_buffer = new int_tree_sign_reduction_buffer<Torus>(
          streams, gpu_indexes, gpu_count, operator_f, params, num_radix_blocks,
          allocate_gpu_memory);
-      tmp_signs_a = (Torus *)cuda_malloc_async(big_size * num_radix_blocks,
-                                               streams[0], gpu_indexes[0]);
-      tmp_signs_b = (Torus *)cuda_malloc_async(big_size * num_radix_blocks,
-                                               streams[0], gpu_indexes[0]);
+      tmp_signs_a = new CudaRadixCiphertextFFI;
+      create_zero_radix_ciphertext_async<Torus>(streams[0], gpu_indexes[0],
+                                                tmp_signs_a, num_radix_blocks,
+                                                params.big_lwe_dimension);
+      tmp_signs_b = new CudaRadixCiphertextFFI;
+      create_zero_radix_ciphertext_async<Torus>(streams[0], gpu_indexes[0],
+                                                tmp_signs_b, num_radix_blocks,
+                                                params.big_lwe_dimension);
      // LUTs
      reduce_signs_lut =
          new int_radix_lut<Torus>(streams, gpu_indexes, gpu_count, params, 1,
@@ -3462,9 +3461,12 @@ template <typename Torus> struct int_comparison_diff_buffer {
    reduce_signs_lut->release(streams, gpu_indexes, gpu_count);
    delete reduce_signs_lut;

-    cuda_drop_async(tmp_packed, streams[0], gpu_indexes[0]);
-    cuda_drop_async(tmp_signs_a, streams[0], gpu_indexes[0]);
-    cuda_drop_async(tmp_signs_b, streams[0], gpu_indexes[0]);
+    release_radix_ciphertext(streams[0], gpu_indexes[0], tmp_packed);
+    delete tmp_packed;
+    release_radix_ciphertext(streams[0], gpu_indexes[0], tmp_signs_a);
+    delete tmp_signs_a;
+    release_radix_ciphertext(streams[0], gpu_indexes[0], tmp_signs_b);
+    delete tmp_signs_b;
  }
 };

@@ -3482,12 +3484,12 @@ template <typename Torus> struct int_comparison_buffer {
  int_comparison_eq_buffer<Torus> *eq_buffer;
  int_comparison_diff_buffer<Torus> *diff_buffer;

-  Torus *tmp_block_comparisons;
-  Torus *tmp_lwe_array_out;
-  Torus *tmp_trivial_sign_block;
+  CudaRadixCiphertextFFI *tmp_block_comparisons;
+  CudaRadixCiphertextFFI *tmp_lwe_array_out;
+  CudaRadixCiphertextFFI *tmp_trivial_sign_block;

  // Scalar EQ / NE
-  Torus *tmp_packed_input;
+  CudaRadixCiphertextFFI *tmp_packed_input;

  // Max Min
  int_cmux_buffer<Torus> *cmux_buffer;
@@ -3515,8 +3517,6 @@ template <typename Torus> struct int_comparison_buffer {

    identity_lut_f = [](Torus x) -> Torus { return x; };

-    auto big_lwe_size = params.big_lwe_dimension + 1;
-
    if (allocate_gpu_memory) {
      lsb_streams =
          (cudaStream_t *)malloc(active_gpu_count * sizeof(cudaStream_t));
@@ -3528,18 +3528,21 @@ template <typename Torus> struct int_comparison_buffer {
      }

      // +1 to have space for signed comparison
-      tmp_lwe_array_out = (Torus *)cuda_malloc_async(
-          big_lwe_size * (num_radix_blocks + 1) * sizeof(Torus), streams[0],
-          gpu_indexes[0]);
+      tmp_lwe_array_out = new CudaRadixCiphertextFFI;
+      create_zero_radix_ciphertext_async<Torus>(
+          streams[0], gpu_indexes[0], tmp_lwe_array_out, num_radix_blocks + 1,
+          params.big_lwe_dimension);

-      tmp_packed_input = (Torus *)cuda_malloc_async(
-          big_lwe_size * 2 * num_radix_blocks * sizeof(Torus), streams[0],
-          gpu_indexes[0]);
+      tmp_packed_input = new CudaRadixCiphertextFFI;
+      create_zero_radix_ciphertext_async<Torus>(
+          streams[0], gpu_indexes[0], tmp_packed_input, 2 * num_radix_blocks,
+          params.big_lwe_dimension);

      // Block comparisons
-      tmp_block_comparisons = (Torus *)cuda_malloc_async(
-          big_lwe_size * num_radix_blocks * sizeof(Torus), streams[0],
-          gpu_indexes[0]);
+      tmp_block_comparisons = new CudaRadixCiphertextFFI;
+      create_zero_radix_ciphertext_async<Torus>(
+          streams[0], gpu_indexes[0], tmp_block_comparisons, num_radix_blocks,
+          params.big_lwe_dimension);

      // Cleaning LUT
      identity_lut =
@@ -3602,8 +3605,10 @@ template <typename Torus> struct int_comparison_buffer {

      if (is_signed) {

-        tmp_trivial_sign_block = (Torus *)cuda_malloc_async(
-            big_lwe_size * sizeof(Torus), streams[0], gpu_indexes[0]);
+        tmp_trivial_sign_block = new CudaRadixCiphertextFFI;
+        create_zero_radix_ciphertext_async<Torus>(streams[0], gpu_indexes[0],
+                                                  tmp_trivial_sign_block, 1,
+                                                  params.big_lwe_dimension);

        signed_lut = new int_radix_lut<Torus>(
            streams, gpu_indexes, gpu_count, params, 1, 1, allocate_gpu_memory);
@@ -3678,12 +3683,17 @@ template <typename Torus> struct int_comparison_buffer {
    delete identity_lut;
    is_zero_lut->release(streams, gpu_indexes, gpu_count);
    delete is_zero_lut;
-    cuda_drop_async(tmp_lwe_array_out, streams[0], gpu_indexes[0]);
-    cuda_drop_async(tmp_block_comparisons, streams[0], gpu_indexes[0]);
-    cuda_drop_async(tmp_packed_input, streams[0], gpu_indexes[0]);
+    release_radix_ciphertext(streams[0], gpu_indexes[0], tmp_lwe_array_out);
+    delete tmp_lwe_array_out;
+    release_radix_ciphertext(streams[0], gpu_indexes[0], tmp_block_comparisons);
+    delete tmp_block_comparisons;
+    release_radix_ciphertext(streams[0], gpu_indexes[0], tmp_packed_input);
+    delete tmp_packed_input;

    if (is_signed) {
-      cuda_drop_async(tmp_trivial_sign_block, streams[0], gpu_indexes[0]);
+      release_radix_ciphertext(streams[0], gpu_indexes[0],
+                               tmp_trivial_sign_block);
+      delete tmp_trivial_sign_block;
      signed_lut->release(streams, gpu_indexes, gpu_count);
      delete (signed_lut);
      signed_msb_lut->release(streams, gpu_indexes, gpu_count);
@@ -4419,7 +4429,7 @@ template <typename Torus> struct int_scalar_mul_buffer {
  int_radix_params params;
  int_logical_scalar_shift_buffer<Torus> *logical_scalar_shift_buffer;
  int_sum_ciphertexts_vec_memory<Torus> *sum_ciphertexts_vec_mem;
-  Torus *preshifted_buffer;
+  CudaRadixCiphertextFFI *preshifted_buffer;
  CudaRadixCiphertextFFI *all_shifted_buffer;
  int_sc_prop_memory<Torus> *sc_prop_mem;
  bool anticipated_buffers_drop;
@@ -4434,25 +4444,21 @@ template <typename Torus> struct int_scalar_mul_buffer {

    if (allocate_gpu_memory) {
      uint32_t msg_bits = (uint32_t)std::log2(params.message_modulus);
-      uint32_t lwe_size = params.big_lwe_dimension + 1;
-      uint32_t lwe_size_bytes = lwe_size * sizeof(Torus);
      size_t num_ciphertext_bits = msg_bits * num_radix_blocks;

      //// Contains all shifted values of lhs for shift in range (0..msg_bits)
      //// The idea is that with these we can create all other shift that are
      /// in / range (0..total_bits) for free (block rotation)
-      preshifted_buffer = (Torus *)cuda_malloc_async(
-          num_ciphertext_bits * lwe_size_bytes, streams[0], gpu_indexes[0]);
+      preshifted_buffer = new CudaRadixCiphertextFFI;
+      create_zero_radix_ciphertext_async<Torus>(
+          streams[0], gpu_indexes[0], preshifted_buffer, num_ciphertext_bits,
+          params.big_lwe_dimension);

      all_shifted_buffer = new CudaRadixCiphertextFFI;
      create_zero_radix_ciphertext_async<Torus>(
          streams[0], gpu_indexes[0], all_shifted_buffer,
          num_ciphertext_bits * num_radix_blocks, params.big_lwe_dimension);

-      cuda_memset_async(preshifted_buffer, 0,
-                        num_ciphertext_bits * lwe_size_bytes, streams[0],
-                        gpu_indexes[0]);
-
      if (num_ciphertext_bits * num_radix_blocks >= num_radix_blocks + 2)
        logical_scalar_shift_buffer =
            new int_logical_scalar_shift_buffer<Torus>(
@@ -4484,7 +4490,8 @@ template <typename Torus> struct int_scalar_mul_buffer {
    release_radix_ciphertext(streams[0], gpu_indexes[0], all_shifted_buffer);
    delete all_shifted_buffer;
    if (!anticipated_buffers_drop) {
-      cuda_drop_async(preshifted_buffer, streams[0], gpu_indexes[0]);
+      release_radix_ciphertext(streams[0], gpu_indexes[0], preshifted_buffer);
+      delete preshifted_buffer;
      logical_scalar_shift_buffer->release(streams, gpu_indexes, gpu_count);
      delete (logical_scalar_shift_buffer);
    }
@@ -4741,4 +4748,5 @@ void update_degrees_after_scalar_bitxor(uint64_t *output_degrees,
                                        uint64_t *clear_degrees,
                                        uint64_t *input_degrees,
                                        uint32_t num_clear_blocks);
+std::pair<bool, bool> get_invert_flags(COMPARISON_TYPE compare);
 #endif // CUDA_INTEGER_UTILITIES_H
--- a/backends/tfhe-cuda-backend/cuda/include/pbs/pbs_utilities.h
+++ b/backends/tfhe-cuda-backend/cuda/include/pbs/pbs_utilities.h
@@ -9,20 +9,29 @@
 template <typename Torus>
 uint64_t get_buffer_size_full_sm_programmable_bootstrap_step_one(
    uint32_t polynomial_size) {
-  return sizeof(Torus) * polynomial_size +      // accumulator_rotated
-         sizeof(double2) * polynomial_size / 2; // accumulator fft
+  size_t scalar_size = sizeof(Torus);
+  size_t split_count = (scalar_size == 16) ? 2 : 1;
+  return scalar_size * polynomial_size + // accumulator_rotated
+         sizeof(double) * 2 * split_count * polynomial_size /
+             2; // accumulator fft
 }
 template <typename Torus>
 uint64_t get_buffer_size_full_sm_programmable_bootstrap_step_two(
    uint32_t polynomial_size) {
-  return sizeof(Torus) * polynomial_size +      // accumulator
-         sizeof(double2) * polynomial_size / 2; // accumulator fft
+  size_t scalar_size = sizeof(Torus);
+  size_t split_count = (scalar_size == 16) ? 2 : 1;
+  return scalar_size * polynomial_size + // accumulator
+         sizeof(double) * 2 * split_count * polynomial_size /
+             2; // accumulator fft
 }

 template <typename Torus>
 uint64_t
 get_buffer_size_partial_sm_programmable_bootstrap(uint32_t polynomial_size) {
-  return sizeof(double2) * polynomial_size / 2; // accumulator fft
+  size_t scalar_size = sizeof(Torus);
+  size_t split_count = (scalar_size == 16) ? 2 : 1;
+  return sizeof(double) * 2 * split_count * polynomial_size /
+         2; // accumulator fft
 }

 template <typename Torus>
@@ -215,6 +224,158 @@ template <typename Torus> struct pbs_buffer<Torus, PBS_TYPE::CLASSICAL> {
  }
 };

+template <PBS_TYPE pbs_type> struct pbs_buffer_128;
+
+template <> struct pbs_buffer_128<PBS_TYPE::CLASSICAL> {
+  int8_t *d_mem;
+
+  __uint128_t *global_accumulator;
+  double *global_join_buffer;
+
+  PBS_VARIANT pbs_variant;
+
+  pbs_buffer_128(cudaStream_t stream, uint32_t gpu_index,
+                 uint32_t glwe_dimension, uint32_t polynomial_size,
+                 uint32_t level_count, uint32_t input_lwe_ciphertext_count,
+                 PBS_VARIANT pbs_variant, bool allocate_gpu_memory) {
+    cuda_set_device(gpu_index);
+    this->pbs_variant = pbs_variant;
+
+    auto max_shared_memory = cuda_get_max_shared_memory(gpu_index);
+
+    if (allocate_gpu_memory) {
+      switch (pbs_variant) {
+      case PBS_VARIANT::DEFAULT: {
+        uint64_t full_sm_step_one =
+            get_buffer_size_full_sm_programmable_bootstrap_step_one<
+                __uint128_t>(polynomial_size);
+        uint64_t full_sm_step_two =
+            get_buffer_size_full_sm_programmable_bootstrap_step_two<
+                __uint128_t>(polynomial_size);
+        uint64_t partial_sm =
+            get_buffer_size_partial_sm_programmable_bootstrap<__uint128_t>(
+                polynomial_size);
+
+        uint64_t partial_dm_step_one = full_sm_step_one - partial_sm;
+        uint64_t partial_dm_step_two = full_sm_step_two - partial_sm;
+        uint64_t full_dm = full_sm_step_one;
+
+        uint64_t device_mem = 0;
+        if (max_shared_memory < partial_sm) {
+          device_mem = full_dm * input_lwe_ciphertext_count * level_count *
+                       (glwe_dimension + 1);
+        } else if (max_shared_memory < full_sm_step_two) {
+          device_mem =
+              (partial_dm_step_two + partial_dm_step_one * level_count) *
+              input_lwe_ciphertext_count * (glwe_dimension + 1);
+        } else if (max_shared_memory < full_sm_step_one) {
+          device_mem = partial_dm_step_one * input_lwe_ciphertext_count *
+                       level_count * (glwe_dimension + 1);
+        }
+        // Otherwise, both kernels run all in shared memory
+        d_mem = (int8_t *)cuda_malloc_async(device_mem, stream, gpu_index);
+
+        global_join_buffer = (double *)cuda_malloc_async(
+            (glwe_dimension + 1) * level_count * input_lwe_ciphertext_count *
+                (polynomial_size / 2) * sizeof(double) * 4,
+            stream, gpu_index);
+
+        global_accumulator = (__uint128_t *)cuda_malloc_async(
+            (glwe_dimension + 1) * input_lwe_ciphertext_count *
+                polynomial_size * sizeof(__uint128_t),
+            stream, gpu_index);
+      } break;
+      case PBS_VARIANT::CG: {
+        uint64_t full_sm =
+            get_buffer_size_full_sm_programmable_bootstrap_cg<__uint128_t>(
+                polynomial_size);
+        uint64_t partial_sm =
+            get_buffer_size_partial_sm_programmable_bootstrap_cg<__uint128_t>(
+                polynomial_size);
+
+        uint64_t partial_dm = full_sm - partial_sm;
+        uint64_t full_dm = full_sm;
+        uint64_t device_mem = 0;
+
+        if (max_shared_memory < partial_sm) {
+          device_mem = full_dm * input_lwe_ciphertext_count * level_count *
+                       (glwe_dimension + 1);
+        } else if (max_shared_memory < full_sm) {
+          device_mem = partial_dm * input_lwe_ciphertext_count * level_count *
+                       (glwe_dimension + 1);
+        }
+
+        // Otherwise, both kernels run all in shared memory
+        d_mem = (int8_t *)cuda_malloc_async(device_mem, stream, gpu_index);
+
+        global_join_buffer = (double *)cuda_malloc_async(
+            (glwe_dimension + 1) * level_count * input_lwe_ciphertext_count *
+                polynomial_size / 2 * sizeof(double) * 4,
+            stream, gpu_index);
+      } break;
+#if CUDA_ARCH >= 900
+      case PBS_VARIANT::TBC: {
+
+        bool supports_dsm =
+            supports_distributed_shared_memory_on_classic_programmable_bootstrap<
+                __uint128_t>(polynomial_size, max_shared_memory);
+
+        uint64_t full_sm =
+            get_buffer_size_full_sm_programmable_bootstrap_tbc<__uint128_t>(
+                polynomial_size);
+        uint64_t partial_sm =
+            get_buffer_size_partial_sm_programmable_bootstrap_tbc<__uint128_t>(
+                polynomial_size);
+        uint64_t minimum_sm_tbc = 0;
+        if (supports_dsm)
+          minimum_sm_tbc =
+              get_buffer_size_sm_dsm_plus_tbc_classic_programmable_bootstrap<
+                  __uint128_t>(polynomial_size);
+
+        uint64_t partial_dm = full_sm - partial_sm;
+        uint64_t full_dm = full_sm;
+        uint64_t device_mem = 0;
+
+        // There is a minimum amount of memory we need to run the TBC PBS, which
+        // is minimum_sm_tbc. We know that minimum_sm_tbc bytes are available
+        // because otherwise the previous check would have redirected
+        // computation to some other variant. If over that we don't have more
+        // partial_sm bytes, TBC PBS will run on NOSM. If we have partial_sm but
+        // not full_sm bytes, it will run on PARTIALSM. Otherwise, FULLSM.
+        //
+        // NOSM mode actually requires minimum_sm_tbc shared memory bytes.
+        if (max_shared_memory < partial_sm + minimum_sm_tbc) {
+          device_mem = full_dm * input_lwe_ciphertext_count * level_count *
+                       (glwe_dimension + 1);
+        } else if (max_shared_memory < full_sm + minimum_sm_tbc) {
+          device_mem = partial_dm * input_lwe_ciphertext_count * level_count *
+                       (glwe_dimension + 1);
+        }
+
+        // Otherwise, both kernels run all in shared memory
+        d_mem = (int8_t *)cuda_malloc_async(device_mem, stream, gpu_index);
+
+        global_join_buffer = (double *)cuda_malloc_async(
+            (glwe_dimension + 1) * level_count * input_lwe_ciphertext_count *
+                polynomial_size / 2 * sizeof(double) * 4,
+            stream, gpu_index);
+      } break;
+#endif
+      default:
+        PANIC("Cuda error (PBS): unsupported implementation variant.")
+      }
+    }
+  }
+
+  void release(cudaStream_t stream, uint32_t gpu_index) {
+    cuda_drop_async(d_mem, stream, gpu_index);
+    cuda_drop_async(global_join_buffer, stream, gpu_index);
+
+    if (pbs_variant == DEFAULT)
+      cuda_drop_async(global_accumulator, stream, gpu_index);
+  }
+};
+
 template <typename Torus>
 uint64_t get_buffer_size_programmable_bootstrap_cg(
    uint32_t glwe_dimension, uint32_t polynomial_size, uint32_t level_count,
--- a/backends/tfhe-cuda-backend/cuda/include/pbs/programmable_bootstrap.h
+++ b/backends/tfhe-cuda-backend/cuda/include/pbs/programmable_bootstrap.h
@@ -20,6 +20,11 @@ void cuda_convert_lwe_programmable_bootstrap_key_64(
    uint32_t input_lwe_dim, uint32_t glwe_dim, uint32_t level_count,
    uint32_t polynomial_size);

+void cuda_convert_lwe_programmable_bootstrap_key_128(
+    void *stream, uint32_t gpu_index, void *dest, void const *src,
+    uint32_t input_lwe_dim, uint32_t glwe_dim, uint32_t level_count,
+    uint32_t polynomial_size);
+
 void scratch_cuda_programmable_bootstrap_amortized_32(
    void *stream, uint32_t gpu_index, int8_t **pbs_buffer,
    uint32_t glwe_dimension, uint32_t polynomial_size,
@@ -62,6 +67,11 @@ void scratch_cuda_programmable_bootstrap_64(
    uint32_t polynomial_size, uint32_t level_count,
    uint32_t input_lwe_ciphertext_count, bool allocate_gpu_memory);

+void scratch_cuda_programmable_bootstrap_128(
+    void *stream, uint32_t gpu_index, int8_t **buffer, uint32_t glwe_dimension,
+    uint32_t polynomial_size, uint32_t level_count,
+    uint32_t input_lwe_ciphertext_count, bool allocate_gpu_memory);
+
 void cuda_programmable_bootstrap_lwe_ciphertext_vector_32(
    void *stream, uint32_t gpu_index, void *lwe_array_out,
    void const *lwe_output_indexes, void const *lut_vector,
@@ -80,7 +90,19 @@ void cuda_programmable_bootstrap_lwe_ciphertext_vector_64(
    uint32_t polynomial_size, uint32_t base_log, uint32_t level_count,
    uint32_t num_samples, uint32_t num_many_lut, uint32_t lut_stride);

+void cuda_programmable_bootstrap_lwe_ciphertext_vector_128(
+    void *stream, uint32_t gpu_index, void *lwe_array_out,
+    void const *lwe_output_indexes, void const *lut_vector,
+    void const *lut_vector_indexes, void const *lwe_array_in,
+    void const *lwe_input_indexes, void const *bootstrapping_key,
+    int8_t *buffer, uint32_t lwe_dimension, uint32_t glwe_dimension,
+    uint32_t polynomial_size, uint32_t base_log, uint32_t level_count,
+    uint32_t num_samples, uint32_t num_many_lut, uint32_t lut_stride);
+
 void cleanup_cuda_programmable_bootstrap(void *stream, uint32_t gpu_index,
                                         int8_t **pbs_buffer);
+
+void cleanup_cuda_programmable_bootstrap_128(void *stream, uint32_t gpu_index,
+                                             int8_t **pbs_buffer);
 }
 #endif // CUDA_BOOTSTRAP_H
--- a/backends/tfhe-cuda-backend/cuda/src/crypto/ciphertext.cu
+++ b/backends/tfhe-cuda-backend/cuda/src/crypto/ciphertext.cu
@@ -24,7 +24,7 @@ void cuda_convert_lwe_ciphertext_vector_to_cpu_64(void *stream,
 void cuda_glwe_sample_extract_64(void *stream, uint32_t gpu_index,
                                 void *lwe_array_out, void const *glwe_array_in,
                                 uint32_t const *nth_array, uint32_t num_nths,
-                                 uint32_t glwe_dimension,
+                                 uint32_t lwe_per_glwe, uint32_t glwe_dimension,
                                 uint32_t polynomial_size) {

  switch (polynomial_size) {
@@ -32,43 +32,43 @@ void cuda_glwe_sample_extract_64(void *stream, uint32_t gpu_index,
    host_sample_extract<uint64_t, AmortizedDegree<256>>(
        static_cast<cudaStream_t>(stream), gpu_index, (uint64_t *)lwe_array_out,
        (uint64_t const *)glwe_array_in, (uint32_t const *)nth_array, num_nths,
-        glwe_dimension);
+        lwe_per_glwe, glwe_dimension);
    break;
  case 512:
    host_sample_extract<uint64_t, AmortizedDegree<512>>(
        static_cast<cudaStream_t>(stream), gpu_index, (uint64_t *)lwe_array_out,
        (uint64_t const *)glwe_array_in, (uint32_t const *)nth_array, num_nths,
-        glwe_dimension);
+        lwe_per_glwe, glwe_dimension);
    break;
  case 1024:
    host_sample_extract<uint64_t, AmortizedDegree<1024>>(
        static_cast<cudaStream_t>(stream), gpu_index, (uint64_t *)lwe_array_out,
        (uint64_t const *)glwe_array_in, (uint32_t const *)nth_array, num_nths,
-        glwe_dimension);
+        lwe_per_glwe, glwe_dimension);
    break;
  case 2048:
    host_sample_extract<uint64_t, AmortizedDegree<2048>>(
        static_cast<cudaStream_t>(stream), gpu_index, (uint64_t *)lwe_array_out,
        (uint64_t const *)glwe_array_in, (uint32_t const *)nth_array, num_nths,
-        glwe_dimension);
+        lwe_per_glwe, glwe_dimension);
    break;
  case 4096:
    host_sample_extract<uint64_t, AmortizedDegree<4096>>(
        static_cast<cudaStream_t>(stream), gpu_index, (uint64_t *)lwe_array_out,
        (uint64_t const *)glwe_array_in, (uint32_t const *)nth_array, num_nths,
-        glwe_dimension);
+        lwe_per_glwe, glwe_dimension);
    break;
  case 8192:
    host_sample_extract<uint64_t, AmortizedDegree<8192>>(
        static_cast<cudaStream_t>(stream), gpu_index, (uint64_t *)lwe_array_out,
        (uint64_t const *)glwe_array_in, (uint32_t const *)nth_array, num_nths,
-        glwe_dimension);
+        lwe_per_glwe, glwe_dimension);
    break;
  case 16384:
    host_sample_extract<uint64_t, AmortizedDegree<16384>>(
        static_cast<cudaStream_t>(stream), gpu_index, (uint64_t *)lwe_array_out,
        (uint64_t const *)glwe_array_in, (uint32_t const *)nth_array, num_nths,
-        glwe_dimension);
+        lwe_per_glwe, glwe_dimension);
    break;
  default:
    PANIC("Cuda error: unsupported polynomial size. Supported "
--- a/backends/tfhe-cuda-backend/cuda/src/crypto/ciphertext.cuh
+++ b/backends/tfhe-cuda-backend/cuda/src/crypto/ciphertext.cuh
@@ -28,7 +28,7 @@ void cuda_convert_lwe_ciphertext_vector_to_cpu(cudaStream_t stream,

 template <typename Torus, class params>
 __global__ void sample_extract(Torus *lwe_array_out, Torus const *glwe_array_in,
-                               uint32_t const *nth_array,
+                               uint32_t const *nth_array, uint32_t lwe_per_glwe,
                               uint32_t glwe_dimension) {

  const int input_id = blockIdx.x;
@@ -39,28 +39,28 @@ __global__ void sample_extract(Torus *lwe_array_out, Torus const *glwe_array_in,
  auto lwe_out = lwe_array_out + input_id * lwe_output_size;

  // We assume each GLWE will store the first polynomial_size inputs
-  uint32_t lwe_per_glwe = params::degree;
  auto glwe_in = glwe_array_in + (input_id / lwe_per_glwe) * glwe_input_size;

-  // nth is ensured to be in [0, lwe_per_glwe)
-  auto nth = nth_array[input_id] % lwe_per_glwe;
+  // nth is ensured to be in [0, params::degree)
+  auto nth = nth_array[input_id] % params::degree;

  sample_extract_mask<Torus, params>(lwe_out, glwe_in, glwe_dimension, nth);
  sample_extract_body<Torus, params>(lwe_out, glwe_in, glwe_dimension, nth);
 }

+// lwe_per_glwe LWEs will be extracted per GLWE ciphertext, thus we need to have
+// enough indexes
 template <typename Torus, class params>
-__host__ void host_sample_extract(cudaStream_t stream, uint32_t gpu_index,
-                                  Torus *lwe_array_out,
-                                  Torus const *glwe_array_in,
-                                  uint32_t const *nth_array, uint32_t num_nths,
-                                  uint32_t glwe_dimension) {
+__host__ void
+host_sample_extract(cudaStream_t stream, uint32_t gpu_index,
+                    Torus *lwe_array_out, Torus const *glwe_array_in,
+                    uint32_t const *nth_array, uint32_t num_nths,
+                    uint32_t lwe_per_glwe, uint32_t glwe_dimension) {
  cuda_set_device(gpu_index);
-
  dim3 grid(num_nths);
  dim3 thds(params::degree / params::opt);
  sample_extract<Torus, params><<<grid, thds, 0, stream>>>(
-      lwe_array_out, glwe_array_in, nth_array, glwe_dimension);
+      lwe_array_out, glwe_array_in, nth_array, lwe_per_glwe, glwe_dimension);
  check_cuda_error(cudaGetLastError());
 }

--- a/backends/tfhe-cuda-backend/cuda/src/crypto/fast_packing_keyswitch.cuh
+++ b/backends/tfhe-cuda-backend/cuda/src/crypto/fast_packing_keyswitch.cuh
@@ -2,7 +2,6 @@
 #define CNCRT_FAST_KS_CUH

 #undef NDEBUG
-#include <assert.h>

 #include "device.h"
 #include "gadget.cuh"
--- a/backends/tfhe-cuda-backend/cuda/src/crypto/gadget.cuh
+++ b/backends/tfhe-cuda-backend/cuda/src/crypto/gadget.cuh
@@ -3,6 +3,7 @@

 #include "crypto/torus.cuh"
 #include "device.h"
+#include "fft128/f128.cuh"
 #include <cstdint>

 /**
@@ -42,6 +43,13 @@ public:
    }
  }

+  __device__ void decompose_and_compress_next_128(double *result) {
+    for (int j = 0; j < num_poly; j++) {
+      auto result_slice = result + j * params::degree / 2 * 4;
+      decompose_and_compress_next_polynomial_128(result_slice, j);
+    }
+  }
+
  // Decomposes a single polynomial
  __device__ void decompose_and_compress_next_polynomial(double2 *result,
                                                         int j) {
@@ -75,10 +83,58 @@ public:
    synchronize_threads_in_block();
  }

+  // Decomposes a single polynomial
+  __device__ void decompose_and_compress_next_polynomial_128(double *result,
+                                                             int j) {
+    uint32_t tid = threadIdx.x;
+    auto state_slice = &state[j * params::degree];
+    for (int i = 0; i < params::opt / 2; i++) {
+      auto input1 = &state_slice[tid];
+      auto input2 = &state_slice[tid + params::degree / 2];
+      T res_re = *input1 & mask_mod_b;
+      T res_im = *input2 & mask_mod_b;
+
+      *input1 >>= base_log; // Update state
+      *input2 >>= base_log; // Update state
+
+      T carry_re = ((res_re - 1ll) | *input1) & res_re;
+      T carry_im = ((res_im - 1ll) | *input2) & res_im;
+      carry_re >>= (base_log - 1);
+      carry_im >>= (base_log - 1);
+
+      *input1 += carry_re; // Update state
+      *input2 += carry_im; // Update state
+
+      res_re -= carry_re << base_log;
+      res_im -= carry_im << base_log;
+
+      auto out_re = u128_to_signed_to_f128(res_re);
+      auto out_im = u128_to_signed_to_f128(res_im);
+
+      auto out_re_hi = result + 0LL * params::degree / 2;
+      auto out_re_lo = result + 1LL * params::degree / 2;
+      auto out_im_hi = result + 2LL * params::degree / 2;
+      auto out_im_lo = result + 3LL * params::degree / 2;
+
+      out_re_hi[tid] = out_re.hi;
+      out_re_lo[tid] = out_re.lo;
+      out_im_hi[tid] = out_im.hi;
+      out_im_lo[tid] = out_im.lo;
+
+      tid += params::degree / params::opt;
+    }
+    synchronize_threads_in_block();
+  }
+
  __device__ void decompose_and_compress_level(double2 *result, int level) {
    for (int i = 0; i < level_count - level; i++)
      decompose_and_compress_next(result);
  }
+
+  __device__ void decompose_and_compress_level_128(double *result, int level) {
+    for (int i = 0; i < level_count - level; i++)
+      decompose_and_compress_next_128(result);
+  }
 };

 template <typename Torus>
--- a/backends/tfhe-cuda-backend/cuda/src/crypto/keyswitch.cuh
+++ b/backends/tfhe-cuda-backend/cuda/src/crypto/keyswitch.cuh
@@ -45,19 +45,19 @@ keyswitch(Torus *lwe_array_out, const Torus *__restrict__ lwe_output_indexes,
          const Torus *__restrict__ lwe_input_indexes,
          const Torus *__restrict__ ksk, uint32_t lwe_dimension_in,
          uint32_t lwe_dimension_out, uint32_t base_log, uint32_t level_count) {
-  const int tid = threadIdx.x + blockIdx.x * blockDim.x;
+  const int tid = threadIdx.x + blockIdx.y * blockDim.x;
  const int shmem_index = threadIdx.x + threadIdx.y * blockDim.x;

  extern __shared__ int8_t sharedmem[];
  Torus *lwe_acc_out = (Torus *)sharedmem;
  auto block_lwe_array_out = get_chunk(
-      lwe_array_out, lwe_output_indexes[blockIdx.y], lwe_dimension_out + 1);
+      lwe_array_out, lwe_output_indexes[blockIdx.x], lwe_dimension_out + 1);

  if (tid <= lwe_dimension_out) {

    Torus local_lwe_out = 0;
    auto block_lwe_array_in = get_chunk(
-        lwe_array_in, lwe_input_indexes[blockIdx.y], lwe_dimension_in + 1);
+        lwe_array_in, lwe_input_indexes[blockIdx.x], lwe_dimension_in + 1);

    if (tid == lwe_dimension_out && threadIdx.y == 0) {
      local_lwe_out = block_lwe_array_in[lwe_dimension_in];
@@ -108,13 +108,19 @@ __host__ void host_keyswitch_lwe_ciphertext_vector(
  cuda_set_device(gpu_index);

  constexpr int num_threads_y = 32;
-  int num_blocks, num_threads_x;
+  int num_blocks_per_sample, num_threads_x;

  getNumBlocksAndThreads2D(lwe_dimension_out + 1, 512, num_threads_y,
-                           num_blocks, num_threads_x);
+                           num_blocks_per_sample, num_threads_x);

  int shared_mem = sizeof(Torus) * num_threads_y * num_threads_x;
-  dim3 grid(num_blocks, num_samples, 1);
+  if (num_blocks_per_sample > 65536)
+    PANIC("Cuda error (Keyswith): number of blocks per sample is too large");
+
+  // In multiplication of large integers (512, 1024, 2048), the number of
+  // samples can be larger than 65536, so we need to set it in the first
+  // dimension of the grid
+  dim3 grid(num_samples, num_blocks_per_sample, 1);
  dim3 threads(num_threads_x, num_threads_y, 1);

  keyswitch<Torus><<<grid, threads, shared_mem, stream>>>(
--- a/backends/tfhe-cuda-backend/cuda/src/device.cu
+++ b/backends/tfhe-cuda-backend/cuda/src/device.cu
@@ -135,7 +135,7 @@ bool cuda_check_support_thread_block_clusters() {
 }

 /// Copy memory to the GPU asynchronously
-void cuda_memcpy_async_to_gpu(void *dest, void *src, uint64_t size,
+void cuda_memcpy_async_to_gpu(void *dest, const void *src, uint64_t size,
                              cudaStream_t stream, uint32_t gpu_index) {
  if (size == 0)
    return;
--- a/backends/tfhe-cuda-backend/cuda/src/fft/twiddles.cu
+++ b/backends/tfhe-cuda-backend/cuda/src/fft/twiddles.cu
--- a/backends/tfhe-cuda-backend/cuda/src/fft128/f128.cuh
+++ b/backends/tfhe-cuda-backend/cuda/src/fft128/f128.cuh
@@ -0,0 +1,410 @@
+
+#ifndef CUDA_FFT128_F128_CUH
+#define CUDA_FFT128_F128_CUH
+
+#include <cstdint>
+
+struct alignas(16) f128 {
+  double hi;
+  double lo;
+
+  // Default and parameterized constructors
+  __host__ __device__ f128() : hi(0.0), lo(0.0) {}
+  __host__ __device__ f128(double high, double low) : hi(high), lo(low) {}
+
+  // Quick two-sum
+  __host__ __device__ __forceinline__ static f128 quick_two_sum(double a,
+                                                                double b) {
+#ifdef __CUDA_ARCH__
+    double s = __dadd_rn(a, b);
+    return f128(s, __dsub_rn(b, __dsub_rn(s, a)));
+#else
+    double s = a + b;
+    return f128(s, b - (s - a));
+#endif
+  }
+
+  // Two-sum
+  __host__ __device__ __forceinline__ static f128 two_sum(double a, double b) {
+#ifdef __CUDA_ARCH__
+    double s = __dadd_rn(a, b);
+    double bb = __dsub_rn(s, a);
+    return f128(s, __dadd_rn(__dsub_rn(a, __dsub_rn(s, bb)), __dsub_rn(b, bb)));
+#else
+    double s = a + b;
+    double bb = s - a;
+    return f128(s, (a - (s - bb)) + (b - bb));
+#endif
+  }
+
+  // Two-product
+  __host__ __device__ __forceinline__ static f128 two_prod(double a, double b) {
+
+#ifdef __CUDA_ARCH__
+    double p = __dmul_rn(a, b);
+    double p2 = __fma_rn(a, b, -p);
+#else
+    double p = a * b;
+    double p2 = fma(a, b, -p);
+#endif
+    return f128(p, p2);
+  }
+
+  __host__ __device__ __forceinline__ static f128 two_diff(double a, double b) {
+#ifdef __CUDA_ARCH__
+    double s = __dsub_rn(a, b);
+    double bb = __dsub_rn(s, a);
+    return f128(s, __dsub_rn(__dsub_rn(a, __dsub_rn(s, bb)), __dadd_rn(b, bb)));
+#else
+    double s = a - b;
+    double bb = s - a;
+    return f128(s, (a - (s - bb)) - (b + bb));
+#endif
+  }
+
+  // Addition
+  __host__ __device__ static f128 add(const f128 &a, const f128 &b) {
+    auto s = two_sum(a.hi, b.hi);
+    auto t = two_sum(a.lo, b.lo);
+
+    double hi = s.hi;
+    double lo = s.lo + t.hi;
+    hi = hi + lo;
+    lo = lo - (hi - s.hi);
+
+    return f128(hi, lo + t.lo);
+  }
+
+  // Addition with estimate
+  __host__ __device__ static f128 add_estimate(const f128 &a, const f128 &b) {
+    auto se = two_sum(a.hi, b.hi);
+#ifdef __CUDA_ARCH__
+    se.lo = __dadd_rn(se.lo, __dadd_rn(a.lo, b.lo));
+#else
+    se.lo += (a.lo + b.lo);
+#endif
+
+    return quick_two_sum(se.hi, se.lo);
+  }
+
+  // Subtraction with estimate
+  __host__ __device__ static f128 sub_estimate(const f128 &a, const f128 &b) {
+    f128 se = two_diff(a.hi, b.hi);
+#ifdef __CUDA_ARCH__
+    se.lo = __dadd_rn(se.lo, a.lo);
+    se.lo = __dsub_rn(se.lo, b.lo);
+#else
+    se.lo += a.lo;
+    se.lo -= b.lo;
+#endif
+    return quick_two_sum(se.hi, se.lo);
+  }
+
+  // Subtraction
+  __host__ __device__ static f128 sub(const f128 &a, const f128 &b) {
+    auto s = two_diff(a.hi, b.hi);
+    auto t = two_diff(a.lo, b.lo);
+    s = quick_two_sum(s.hi, s.lo + t.hi);
+    return quick_two_sum(s.hi, s.lo + t.lo);
+  }
+
+  // Multiplication
+  __host__ __device__ static f128 mul(const f128 &a, const f128 &b) {
+    auto p = two_prod(a.hi, b.hi);
+#ifdef __CUDA_ARCH__
+    double a_0_x_b_1 = __dmul_rn(a.hi, b.lo);
+    double a_1_x_b_0 = __dmul_rn(a.lo, b.hi);
+    p.lo = __dadd_rn(p.lo, __dadd_rn(a_0_x_b_1, a_1_x_b_0));
+#else
+    p.lo += (a.hi * b.lo + a.lo * b.hi);
+#endif
+    p = quick_two_sum(p.hi, p.lo);
+    return p;
+  }
+
+  __host__ __device__ static f128 add_f64_f64(const double a, const double b) {
+    return two_sum(a, b);
+  }
+
+  __host__ __device__ static f128 f128_floor(const f128 &x) {
+    double x0_floor = floor(x.hi);
+    if (x0_floor == x.hi) {
+      return add_f64_f64(x0_floor, floor(x.lo));
+    }
+
+    return f128(x0_floor, 0.0);
+  }
+
+  __host__ __device__ static void
+  cplx_f128_mul_assign(f128 &c_re, f128 &c_im, const f128 &a_re,
+                       const f128 &a_im, const f128 &b_re, const f128 &b_im) {
+    auto a_re_x_b_re = mul(a_re, b_re);
+    auto a_re_x_b_im = mul(a_re, b_im);
+    auto a_im_x_b_re = mul(a_im, b_re);
+    auto a_im_x_b_im = mul(a_im, b_im);
+
+    c_re = sub_estimate(a_re_x_b_re, a_im_x_b_im);
+    c_im = add_estimate(a_im_x_b_re, a_re_x_b_im);
+  }
+
+  __host__ __device__ static void
+  cplx_f128_sub_assign(f128 &c_re, f128 &c_im, const f128 &a_re,
+                       const f128 &a_im, const f128 &b_re, const f128 &b_im) {
+    c_re = sub_estimate(a_re, b_re);
+    c_im = sub_estimate(a_im, b_im);
+  }
+  __host__ __device__ static void
+  cplx_f128_add_assign(f128 &c_re, f128 &c_im, const f128 &a_re,
+                       const f128 &a_im, const f128 &b_re, const f128 &b_im) {
+    c_re = add_estimate(a_re, b_re);
+    c_im = add_estimate(a_im, b_im);
+  }
+};
+
+struct f128x2 {
+  f128 re;
+  f128 im;
+
+  __host__ __device__ f128x2() : re(), im() {}
+
+  __host__ __device__ f128x2(const f128 &real, const f128 &imag)
+      : re(real), im(imag) {}
+
+  __host__ __device__ f128x2(double real, double imag)
+      : re(real, 0.0), im(imag, 0.0) {}
+
+  __host__ __device__ explicit f128x2(double real)
+      : re(real, 0.0), im(0.0, 0.0) {}
+
+  __host__ __device__ f128x2(const f128x2 &other)
+      : re(other.re), im(other.im) {}
+
+  __host__ __device__ f128x2(f128x2 &&other) noexcept
+      : re(std::move(other.re)), im(std::move(other.im)) {}
+
+  __host__ __device__ f128x2 &operator=(const f128x2 &other) {
+    if (this != &other) {
+      re = other.re;
+      im = other.im;
+    }
+    return *this;
+  }
+
+  __host__ __device__ f128x2 &operator=(f128x2 &&other) noexcept {
+    if (this != &other) {
+      re = std::move(other.re);
+      im = std::move(other.im);
+    }
+    return *this;
+  }
+
+  __host__ __device__ f128x2 conjugate() const {
+    return f128x2(re, f128(-im.hi, -im.lo));
+  }
+
+  __host__ __device__ f128 norm_squared() const {
+    return f128::add(f128::mul(re, re), f128::mul(im, im));
+  }
+
+  __host__ __device__ void zero() {
+    re = f128(0.0, 0.0);
+    im = f128(0.0, 0.0);
+  }
+
+  // Addition
+  __host__ __device__ friend f128x2 operator+(const f128x2 &a,
+                                              const f128x2 &b) {
+    return f128x2(f128::add(a.re, b.re), f128::add(a.im, b.im));
+  }
+
+  // Subtraction
+  __host__ __device__ friend f128x2 operator-(const f128x2 &a,
+                                              const f128x2 &b) {
+    return f128x2(f128::add(a.re, f128(-b.re.hi, -b.re.lo)),
+                  f128::add(a.im, f128(-b.im.hi, -b.im.lo)));
+  }
+
+  // Multiplication (complex multiplication)
+  __host__ __device__ friend f128x2 operator*(const f128x2 &a,
+                                              const f128x2 &b) {
+    f128 real_part =
+        f128::add(f128::mul(a.re, b.re),
+                  f128(-f128::mul(a.im, b.im).hi, -f128::mul(a.im, b.im).lo));
+    f128 imag_part = f128::add(f128::mul(a.re, b.im), f128::mul(a.im, b.re));
+    return f128x2(real_part, imag_part);
+  }
+
+  // Addition-assignment operator
+  __host__ __device__ f128x2 &operator+=(const f128x2 &other) {
+    re = f128::add(re, other.re);
+    im = f128::add(im, other.im);
+    return *this;
+  }
+
+  // Subtraction-assignment operator
+  __host__ __device__ f128x2 &operator-=(const f128x2 &other) {
+    re = f128::add(re, f128(-other.re.hi, -other.re.lo));
+    im = f128::add(im, f128(-other.im.hi, -other.im.lo));
+    return *this;
+  }
+
+  // Multiplication-assignment operator
+  __host__ __device__ f128x2 &operator*=(const f128x2 &other) {
+    f128 new_re =
+        f128::add(f128::mul(re, other.re), f128(-f128::mul(im, other.im).hi,
+                                                -f128::mul(im, other.im).lo));
+    f128 new_im = f128::add(f128::mul(re, other.im), f128::mul(im, other.re));
+    re = new_re;
+    im = new_im;
+    return *this;
+  }
+};
+
+__host__ __device__ inline uint64_t double_to_bits(double d) {
+  uint64_t bits = *reinterpret_cast<uint64_t *>(&d);
+  return bits;
+}
+
+__host__ __device__ inline double bits_to_double(uint64_t bits) {
+  double d = *reinterpret_cast<double *>(&bits);
+  return d;
+}
+
+__host__ __device__ inline double u128_to_f64(__uint128_t x) {
+  const __uint128_t ONE = 1;
+  const double A = ONE << 52;
+  const double B = ONE << 104;
+  const double C = ONE << 76;
+  const double D = 340282366920938500000000000000000000000.;
+
+  const __uint128_t threshold = (ONE << 104);
+
+  if (x < threshold) {
+    uint64_t A_bits = double_to_bits(A);
+
+    __uint128_t shifted = (x << 12);
+    uint64_t lower64 = static_cast<uint64_t>(shifted);
+    lower64 >>= 12;
+
+    uint64_t bits_l = A_bits | lower64;
+    double l_temp = bits_to_double(bits_l);
+    double l = l_temp - A;
+
+    uint64_t B_bits = double_to_bits(B);
+    uint64_t top64 = static_cast<uint64_t>(x >> 52);
+    uint64_t bits_h = B_bits | top64;
+    double h_temp = bits_to_double(bits_h);
+    double h = h_temp - B;
+
+    return (l + h);
+
+  } else {
+    uint64_t C_bits = double_to_bits(C);
+
+    __uint128_t shifted = (x >> 12);
+    uint64_t lower64 = static_cast<uint64_t>(shifted);
+    lower64 >>= 12;
+
+    uint64_t x_lo = static_cast<uint64_t>(x);
+    uint64_t mask_part = (x_lo & 0xFFFFFFULL);
+
+    uint64_t bits_l = C_bits | lower64 | mask_part;
+    double l_temp = bits_to_double(bits_l);
+    double l = l_temp - C;
+
+    uint64_t D_bits = double_to_bits(D);
+    uint64_t top64 = static_cast<uint64_t>(x >> 76);
+    uint64_t bits_h = D_bits | top64;
+    double h_temp = bits_to_double(bits_h);
+    double h = h_temp - D;
+
+    return (l + h);
+  }
+}
+
+__host__ __device__ inline __uint128_t f64_to_u128(const double f) {
+  const __uint128_t ONE = 1;
+  const uint64_t f_bits = double_to_bits(f);
+  if (f_bits < 1023ull << 52) {
+    return 0;
+  } else {
+    const __uint128_t m = ONE << 127 | (__uint128_t)f_bits << 75;
+    const uint64_t s = 1150 - (f_bits >> 52);
+    if (s >= 128) {
+      return 0;
+    } else {
+      return m >> s;
+    }
+  }
+}
+
+__host__ __device__ inline __uint128_t f64_to_i128(const double f) {
+  // Get raw bits of the double
+  const uint64_t f_bits = double_to_bits(f);
+
+  // Remove sign bit (equivalent to Rust's !0 >> 1 mask)
+  const uint64_t a = f_bits & 0x7FFFFFFFFFFFFFFFull;
+
+  // Check if value is in [0, 1) range
+  if (a < (1023ull << 52)) {
+    return 0;
+  }
+
+  // Reconstruct mantissa with implicit leading 1
+  const __uint128_t m =
+      (__uint128_t{1} << 127) | (static_cast<__uint128_t>(a) << 75);
+
+  // Calculate shift amount based on exponent
+  const uint64_t exponent = a >> 52;
+  const uint64_t s = 1150 - exponent;
+
+  // Perform unsigned right shift
+  const __uint128_t u = m >> s;
+
+  // Apply sign (check original sign bit)
+  const __int128_t result = static_cast<__int128_t>(u);
+  return (f_bits >> 63) ? -result : result;
+}
+
+__host__ __device__ inline double i128_to_f64(__int128_t const x) {
+  uint64_t sign = static_cast<uint64_t>(x >> 64) & (1ULL << 63);
+  __uint128_t abs =
+      (x < 0) ? static_cast<__uint128_t>(-x) : static_cast<__uint128_t>(x);
+
+  return bits_to_double(double_to_bits(u128_to_f64(abs)) | sign);
+}
+__host__ __device__ inline f128 u128_to_signed_to_f128(__uint128_t x) {
+  const double first_approx = i128_to_f64(x);
+  const uint64_t sign_bit = double_to_bits(first_approx) & (1ull << 63);
+  const __uint128_t first_approx_roundtrip =
+      f64_to_u128((first_approx < 0) ? -first_approx : first_approx);
+  const __uint128_t first_approx_roundtrip_signed =
+      (sign_bit == (1ull << 63)) ? -first_approx_roundtrip
+                                 : first_approx_roundtrip;
+
+  double correction = i128_to_f64(x - first_approx_roundtrip_signed);
+
+  return f128(first_approx, correction);
+}
+
+__host__ __device__ inline __uint128_t u128_from_torus_f128(const f128 &a) {
+  auto x = f128::sub_estimate(a, f128::f128_floor(a));
+  const double normalization = 340282366920938500000000000000000000000.;
+#ifdef __CUDA_ARCH__
+  x.hi = __dmul_rn(x.hi, normalization);
+  x.lo = __dmul_rn(x.lo, normalization);
+#else
+  x.hi *= normalization;
+  x.lo *= normalization;
+#endif
+
+  // TODO has to be round
+  x = f128::f128_floor(x);
+
+  __uint128_t x0 = f64_to_u128(x.hi);
+  __int128_t x1 = f64_to_i128(x.lo);
+
+  return x0 + x1;
+}
+
+#endif
--- a/backends/tfhe-cuda-backend/cuda/src/fft128/fft128.cu
+++ b/backends/tfhe-cuda-backend/cuda/src/fft128/fft128.cu
@@ -0,0 +1,163 @@
+#include "fft128.cuh"
+
+void cuda_fourier_transform_forward_as_integer_f128_async(
+    void *stream, uint32_t gpu_index, void *re0, void *re1, void *im0,
+    void *im1, void const *standard, const uint32_t N,
+    const uint32_t number_of_samples) {
+  switch (N) {
+  case 64:
+    host_fourier_transform_forward_as_integer_f128<AmortizedDegree<64>>(
+        static_cast<cudaStream_t>(stream), gpu_index, (double *)re0,
+        (double *)re1, (double *)im0, (double *)im1,
+        (__uint128_t const *)standard, N, number_of_samples);
+    break;
+  case 128:
+    host_fourier_transform_forward_as_integer_f128<AmortizedDegree<128>>(
+        static_cast<cudaStream_t>(stream), gpu_index, (double *)re0,
+        (double *)re1, (double *)im0, (double *)im1,
+        (__uint128_t const *)standard, N, number_of_samples);
+    break;
+  case 256:
+    host_fourier_transform_forward_as_integer_f128<AmortizedDegree<256>>(
+        static_cast<cudaStream_t>(stream), gpu_index, (double *)re0,
+        (double *)re1, (double *)im0, (double *)im1,
+        (__uint128_t const *)standard, N, number_of_samples);
+    break;
+  case 512:
+    host_fourier_transform_forward_as_integer_f128<AmortizedDegree<512>>(
+        static_cast<cudaStream_t>(stream), gpu_index, (double *)re0,
+        (double *)re1, (double *)im0, (double *)im1,
+        (__uint128_t const *)standard, N, number_of_samples);
+    break;
+  case 1024:
+    host_fourier_transform_forward_as_integer_f128<AmortizedDegree<1024>>(
+        static_cast<cudaStream_t>(stream), gpu_index, (double *)re0,
+        (double *)re1, (double *)im0, (double *)im1,
+        (__uint128_t const *)standard, N, number_of_samples);
+    break;
+  case 2048:
+    host_fourier_transform_forward_as_integer_f128<AmortizedDegree<2048>>(
+        static_cast<cudaStream_t>(stream), gpu_index, (double *)re0,
+        (double *)re1, (double *)im0, (double *)im1,
+        (__uint128_t const *)standard, N, number_of_samples);
+    break;
+  case 4096:
+    host_fourier_transform_forward_as_integer_f128<AmortizedDegree<4096>>(
+        static_cast<cudaStream_t>(stream), gpu_index, (double *)re0,
+        (double *)re1, (double *)im0, (double *)im1,
+        (__uint128_t const *)standard, N, number_of_samples);
+    break;
+  default:
+    PANIC("Cuda error (f128 fft): unsupported polynomial size. Supported "
+          "N's are powers of two"
+          " in the interval [64..4096].")
+  }
+}
+
+void cuda_fourier_transform_forward_as_torus_f128_async(
+    void *stream, uint32_t gpu_index, void *re0, void *re1, void *im0,
+    void *im1, void const *standard, const uint32_t N,
+    const uint32_t number_of_samples) {
+  switch (N) {
+  case 64:
+    host_fourier_transform_forward_as_torus_f128<AmortizedDegree<64>>(
+        static_cast<cudaStream_t>(stream), gpu_index, (double *)re0,
+        (double *)re1, (double *)im0, (double *)im1,
+        (__uint128_t const *)standard, N, number_of_samples);
+    break;
+  case 128:
+    host_fourier_transform_forward_as_torus_f128<AmortizedDegree<128>>(
+        static_cast<cudaStream_t>(stream), gpu_index, (double *)re0,
+        (double *)re1, (double *)im0, (double *)im1,
+        (__uint128_t const *)standard, N, number_of_samples);
+    break;
+  case 256:
+    host_fourier_transform_forward_as_torus_f128<AmortizedDegree<256>>(
+        static_cast<cudaStream_t>(stream), gpu_index, (double *)re0,
+        (double *)re1, (double *)im0, (double *)im1,
+        (__uint128_t const *)standard, N, number_of_samples);
+    break;
+  case 512:
+    host_fourier_transform_forward_as_torus_f128<AmortizedDegree<512>>(
+        static_cast<cudaStream_t>(stream), gpu_index, (double *)re0,
+        (double *)re1, (double *)im0, (double *)im1,
+        (__uint128_t const *)standard, N, number_of_samples);
+    break;
+  case 1024:
+    host_fourier_transform_forward_as_torus_f128<AmortizedDegree<1024>>(
+        static_cast<cudaStream_t>(stream), gpu_index, (double *)re0,
+        (double *)re1, (double *)im0, (double *)im1,
+        (__uint128_t const *)standard, N, number_of_samples);
+    break;
+  case 2048:
+    host_fourier_transform_forward_as_torus_f128<AmortizedDegree<2048>>(
+        static_cast<cudaStream_t>(stream), gpu_index, (double *)re0,
+        (double *)re1, (double *)im0, (double *)im1,
+        (__uint128_t const *)standard, N, number_of_samples);
+    break;
+  case 4096:
+    host_fourier_transform_forward_as_torus_f128<AmortizedDegree<4096>>(
+        static_cast<cudaStream_t>(stream), gpu_index, (double *)re0,
+        (double *)re1, (double *)im0, (double *)im1,
+        (__uint128_t const *)standard, N, number_of_samples);
+    break;
+  default:
+    PANIC("Cuda error (f128 fft): unsupported polynomial size. Supported "
+          "N's are powers of two"
+          " in the interval [64..4096].")
+  }
+}
+
+void cuda_fourier_transform_backward_as_torus_f128_async(
+    void *stream, uint32_t gpu_index, void *standard, void const *re0,
+    void const *re1, void const *im0, void const *im1, const uint32_t N,
+    const uint32_t number_of_samples) {
+  switch (N) {
+  case 64:
+    host_fourier_transform_backward_as_torus_f128<AmortizedDegree<64>>(
+        static_cast<cudaStream_t>(stream), gpu_index, (__uint128_t *)standard,
+        (double const *)re0, (double const *)re1, (double const *)im0,
+        (double const *)im1, N, number_of_samples);
+    break;
+  case 128:
+    host_fourier_transform_backward_as_torus_f128<AmortizedDegree<128>>(
+        static_cast<cudaStream_t>(stream), gpu_index, (__uint128_t *)standard,
+        (double const *)re0, (double const *)re1, (double const *)im0,
+        (double const *)im1, N, number_of_samples);
+    break;
+  case 256:
+    host_fourier_transform_backward_as_torus_f128<AmortizedDegree<256>>(
+        static_cast<cudaStream_t>(stream), gpu_index, (__uint128_t *)standard,
+        (double const *)re0, (double const *)re1, (double const *)im0,
+        (double const *)im1, N, number_of_samples);
+    break;
+  case 512:
+    host_fourier_transform_backward_as_torus_f128<AmortizedDegree<512>>(
+        static_cast<cudaStream_t>(stream), gpu_index, (__uint128_t *)standard,
+        (double const *)re0, (double const *)re1, (double const *)im0,
+        (double const *)im1, N, number_of_samples);
+    break;
+  case 1024:
+    host_fourier_transform_backward_as_torus_f128<AmortizedDegree<1024>>(
+        static_cast<cudaStream_t>(stream), gpu_index, (__uint128_t *)standard,
+        (double const *)re0, (double const *)re1, (double const *)im0,
+        (double const *)im1, N, number_of_samples);
+    break;
+  case 2048:
+    host_fourier_transform_backward_as_torus_f128<AmortizedDegree<2048>>(
+        static_cast<cudaStream_t>(stream), gpu_index, (__uint128_t *)standard,
+        (double const *)re0, (double const *)re1, (double const *)im0,
+        (double const *)im1, N, number_of_samples);
+    break;
+  case 4096:
+    host_fourier_transform_backward_as_torus_f128<AmortizedDegree<4096>>(
+        static_cast<cudaStream_t>(stream), gpu_index, (__uint128_t *)standard,
+        (double const *)re0, (double const *)re1, (double const *)im0,
+        (double const *)im1, N, number_of_samples);
+    break;
+  default:
+    PANIC("Cuda error (f128 ifft): unsupported polynomial size. Supported "
+          "N's are powers of two"
+          " in the interval [64..4096].")
+  }
+}
--- a/backends/tfhe-cuda-backend/cuda/src/fft128/fft128.cuh
+++ b/backends/tfhe-cuda-backend/cuda/src/fft128/fft128.cuh
@@ -0,0 +1,662 @@
+#ifndef CUDA_FFT128_CUH
+#define CUDA_FFT128_CUH
+
+#include "f128.cuh"
+#include "fft/fft128.h"
+#include "polynomial/functions.cuh"
+#include "polynomial/parameters.cuh"
+#include "twiddles.cuh"
+#include "types/complex/operations.cuh"
+#include <iostream>
+
+using Index = unsigned;
+
+#define NEG_TWID(i)                                                            \
+  f128x2(f128(neg_twiddles_re_hi[(i)], neg_twiddles_re_lo[(i)]),               \
+         f128(neg_twiddles_im_hi[(i)], neg_twiddles_im_lo[(i)]))
+
+#define F64x4_TO_F128x2(f128x2_reg, ind)                                       \
+  f128x2_reg.re.hi = dt_re_hi[ind];                                            \
+  f128x2_reg.re.lo = dt_re_lo[ind];                                            \
+  f128x2_reg.im.hi = dt_im_hi[ind];                                            \
+  f128x2_reg.im.lo = dt_im_lo[ind]
+
+#define F128x2_TO_F64x4(f128x2_reg, ind)                                       \
+  dt_re_hi[ind] = f128x2_reg.re.hi;                                            \
+  dt_re_lo[ind] = f128x2_reg.re.lo;                                            \
+  dt_im_hi[ind] = f128x2_reg.im.hi;                                            \
+  dt_im_lo[ind] = f128x2_reg.im.lo
+
+template <class params>
+__device__ void negacyclic_forward_fft_f128(double *dt_re_hi, double *dt_re_lo,
+                                            double *dt_im_hi,
+                                            double *dt_im_lo) {
+
+  __syncthreads();
+  constexpr Index BUTTERFLY_DEPTH = params::opt >> 1;
+  constexpr Index LOG2_DEGREE = params::log2_degree;
+  constexpr Index HALF_DEGREE = params::degree >> 1;
+  constexpr Index STRIDE = params::degree / params::opt;
+
+  f128x2 u[BUTTERFLY_DEPTH], v[BUTTERFLY_DEPTH], w;
+
+  Index tid = threadIdx.x;
+
+  // load into registers
+#pragma unroll
+  for (Index i = 0; i < BUTTERFLY_DEPTH; ++i) {
+    F64x4_TO_F128x2(u[i], tid);
+    F64x4_TO_F128x2(v[i], tid + HALF_DEGREE);
+    tid += STRIDE;
+  }
+
+  // level 1
+  // we don't make actual complex multiplication on level1 since we have only
+  // one twiddle, it's real and image parts are equal, so we can multiply
+  // it with simpler operations
+#pragma unroll
+  for (Index i = 0; i < BUTTERFLY_DEPTH; ++i) {
+    auto ww = NEG_TWID(1);
+    f128::cplx_f128_mul_assign(w.re, w.im, v[i].re, v[i].im, NEG_TWID(1).re,
+                               NEG_TWID(1).im);
+    f128::cplx_f128_sub_assign(v[i].re, v[i].im, u[i].re, u[i].im, w.re, w.im);
+    f128::cplx_f128_add_assign(u[i].re, u[i].im, u[i].re, u[i].im, w.re, w.im);
+  }
+
+  Index twiddle_shift = 1;
+  for (Index l = LOG2_DEGREE - 1; l >= 1; --l) {
+    Index lane_mask = 1 << (l - 1);
+    Index thread_mask = (1 << l) - 1;
+    twiddle_shift <<= 1;
+
+    tid = threadIdx.x;
+    __syncthreads();
+#pragma unroll
+    for (Index i = 0; i < BUTTERFLY_DEPTH; i++) {
+      Index rank = tid & thread_mask;
+      bool u_stays_in_register = rank < lane_mask;
+      F128x2_TO_F64x4(((u_stays_in_register) ? v[i] : u[i]), tid);
+      tid = tid + STRIDE;
+    }
+    __syncthreads();
+
+    tid = threadIdx.x;
+#pragma unroll
+    for (Index i = 0; i < BUTTERFLY_DEPTH; i++) {
+      Index rank = tid & thread_mask;
+      bool u_stays_in_register = rank < lane_mask;
+      F64x4_TO_F128x2(w, tid ^ lane_mask);
+      u[i] = (u_stays_in_register) ? u[i] : w;
+      v[i] = (u_stays_in_register) ? w : v[i];
+      w = NEG_TWID(tid / lane_mask + twiddle_shift);
+      f128::cplx_f128_mul_assign(w.re, w.im, v[i].re, v[i].im, w.re, w.im);
+      f128::cplx_f128_sub_assign(v[i].re, v[i].im, u[i].re, u[i].im, w.re,
+                                 w.im);
+      f128::cplx_f128_add_assign(u[i].re, u[i].im, u[i].re, u[i].im, w.re,
+                                 w.im);
+      tid = tid + STRIDE;
+    }
+  }
+  __syncthreads();
+
+  //   store registers in SM
+  tid = threadIdx.x;
+#pragma unroll
+  for (Index i = 0; i < BUTTERFLY_DEPTH; i++) {
+    F128x2_TO_F64x4(u[i], tid * 2);
+    F128x2_TO_F64x4(v[i], (tid * 2 + 1));
+    tid = tid + STRIDE;
+  }
+  __syncthreads();
+}
+
+template <class params>
+__device__ void negacyclic_backward_fft_f128(double *dt_re_hi, double *dt_re_lo,
+                                             double *dt_im_hi,
+                                             double *dt_im_lo) {
+  __syncthreads();
+  constexpr Index BUTTERFLY_DEPTH = params::opt >> 1;
+  constexpr Index LOG2_DEGREE = params::log2_degree;
+  constexpr Index DEGREE = params::degree;
+  constexpr Index HALF_DEGREE = params::degree >> 1;
+  constexpr Index STRIDE = params::degree / params::opt;
+
+  size_t tid = threadIdx.x;
+  f128x2 u[BUTTERFLY_DEPTH], v[BUTTERFLY_DEPTH], w;
+
+  // load into registers and divide by compressed polynomial size
+#pragma unroll
+  for (Index i = 0; i < BUTTERFLY_DEPTH; ++i) {
+    F64x4_TO_F128x2(u[i], 2 * tid);
+    F64x4_TO_F128x2(v[i], 2 * tid + 1);
+    tid += STRIDE;
+  }
+
+  Index twiddle_shift = DEGREE;
+  for (Index l = 1; l <= LOG2_DEGREE - 1; ++l) {
+    Index lane_mask = 1 << (l - 1);
+    Index thread_mask = (1 << l) - 1;
+    tid = threadIdx.x;
+    twiddle_shift >>= 1;
+
+    // at this point registers are ready for the  butterfly
+    tid = threadIdx.x;
+    __syncthreads();
+#pragma unroll
+    for (Index i = 0; i < BUTTERFLY_DEPTH; ++i) {
+      w = (u[i] - v[i]);
+      u[i] += v[i];
+      v[i] = w * NEG_TWID(tid / lane_mask + twiddle_shift).conjugate();
+
+      // keep one of the register for next iteration and store another one in sm
+      Index rank = tid & thread_mask;
+      bool u_stays_in_register = rank < lane_mask;
+      F128x2_TO_F64x4(((u_stays_in_register) ? v[i] : u[i]), tid);
+
+      tid = tid + STRIDE;
+    }
+    __syncthreads();
+
+    // prepare registers for next butterfly iteration
+    tid = threadIdx.x;
+#pragma unroll
+    for (Index i = 0; i < BUTTERFLY_DEPTH; ++i) {
+      Index rank = tid & thread_mask;
+      bool u_stays_in_register = rank < lane_mask;
+      F64x4_TO_F128x2(w, tid ^ lane_mask);
+
+      u[i] = (u_stays_in_register) ? u[i] : w;
+      v[i] = (u_stays_in_register) ? w : v[i];
+
+      tid = tid + STRIDE;
+    }
+  }
+
+  // last iteration
+  for (Index i = 0; i < BUTTERFLY_DEPTH; ++i) {
+    w = (u[i] - v[i]);
+    u[i] = u[i] + v[i];
+    v[i] = w * NEG_TWID(1).conjugate();
+  }
+  __syncthreads();
+  // store registers in SM
+  tid = threadIdx.x;
+#pragma unroll
+  for (Index i = 0; i < BUTTERFLY_DEPTH; i++) {
+    F128x2_TO_F64x4(u[i], tid);
+    F128x2_TO_F64x4(v[i], tid + HALF_DEGREE);
+
+    tid = tid + STRIDE;
+  }
+  __syncthreads();
+}
+
+// params is expected to be full degree not half degree
+template <class params>
+__device__ void convert_u128_to_f128_as_integer(
+    double *out_re_hi, double *out_re_lo, double *out_im_hi, double *out_im_lo,
+    const __uint128_t *in_re, const __uint128_t *in_im) {
+
+  Index tid = threadIdx.x;
+  // #pragma unroll
+  for (Index i = 0; i < params::opt / 2; i++) {
+    auto out_re = u128_to_signed_to_f128(in_re[tid]);
+    auto out_im = u128_to_signed_to_f128(in_im[tid]);
+
+    out_re_hi[tid] = out_re.hi;
+    out_re_lo[tid] = out_re.lo;
+    out_im_hi[tid] = out_im.hi;
+    out_im_lo[tid] = out_im.lo;
+
+    tid += params::degree / params::opt;
+  }
+}
+
+// params is expected to be full degree not half degree
+template <class params>
+__device__ void convert_u128_to_f128_as_torus(
+    double *out_re_hi, double *out_re_lo, double *out_im_hi, double *out_im_lo,
+    const __uint128_t *in_re, const __uint128_t *in_im) {
+
+  const double normalization = pow(2., -128.);
+  Index tid = threadIdx.x;
+  // #pragma unroll
+  for (Index i = 0; i < params::opt / 2; i++) {
+    auto out_re = u128_to_signed_to_f128(in_re[tid]);
+    auto out_im = u128_to_signed_to_f128(in_im[tid]);
+
+    out_re_hi[tid] = out_re.hi * normalization;
+    out_re_lo[tid] = out_re.lo * normalization;
+    out_im_hi[tid] = out_im.hi * normalization;
+    out_im_lo[tid] = out_im.lo * normalization;
+
+    tid += params::degree / params::opt;
+  }
+}
+
+template <class params>
+__device__ void
+convert_f128_to_u128_as_torus(__uint128_t *out_re, __uint128_t *out_im,
+                              const double *in_re_hi, const double *in_re_lo,
+                              const double *in_im_hi, const double *in_im_lo) {
+
+  const double normalization = 1. / (params::degree / 2);
+  Index tid = threadIdx.x;
+  // #pragma unroll
+  for (Index i = 0; i < params::opt / 2; i++) {
+
+    f128 in_re(in_re_hi[tid] * normalization, in_re_lo[tid] * normalization);
+    f128 in_im(in_im_hi[tid] * normalization, in_im_lo[tid] * normalization);
+
+    out_re[tid] = u128_from_torus_f128(in_re);
+    out_im[tid] = u128_from_torus_f128(in_im);
+
+    tid += params::degree / params::opt;
+  }
+}
+
+// params is expected to be full degree not half degree
+template <class params>
+__global__ void
+batch_convert_u128_to_f128_as_integer(double *out_re_hi, double *out_re_lo,
+                                      double *out_im_hi, double *out_im_lo,
+                                      const __uint128_t *in) {
+
+  convert_u128_to_f128_as_integer<params>(
+      &out_re_hi[blockIdx.x * params::degree / 2],
+      &out_re_lo[blockIdx.x * params::degree / 2],
+      &out_im_hi[blockIdx.x * params::degree / 2],
+      &out_im_lo[blockIdx.x * params::degree / 2],
+      &in[blockIdx.x * params::degree],
+      &in[blockIdx.x * params::degree + params::degree / 2]);
+}
+
+// params is expected to be full degree not half degree
+template <class params>
+__global__ void
+batch_convert_u128_to_f128_as_torus(double *out_re_hi, double *out_re_lo,
+                                    double *out_im_hi, double *out_im_lo,
+                                    const __uint128_t *in) {
+
+  convert_u128_to_f128_as_torus<params>(
+      &out_re_hi[blockIdx.x * params::degree / 2],
+      &out_re_lo[blockIdx.x * params::degree / 2],
+      &out_im_hi[blockIdx.x * params::degree / 2],
+      &out_im_lo[blockIdx.x * params::degree / 2],
+      &in[blockIdx.x * params::degree],
+      &in[blockIdx.x * params::degree + params::degree / 2]);
+}
+
+// params is expected to be full degree not half degree
+template <class params>
+__global__ void
+batch_convert_u128_to_f128_strided_as_torus(double *d_out,
+                                            const __uint128_t *d_in) {
+
+  constexpr size_t chunk_size = params::degree / 2 * 4;
+  double *chunk = &d_out[blockIdx.x * chunk_size];
+  double *out_re_hi = &chunk[0ULL * params::degree / 2];
+  double *out_re_lo = &chunk[1ULL * params::degree / 2];
+  double *out_im_hi = &chunk[2ULL * params::degree / 2];
+  double *out_im_lo = &chunk[3ULL * params::degree / 2];
+
+  convert_u128_to_f128_as_torus<params>(
+      out_re_hi, out_re_lo, out_im_hi, out_im_lo,
+      &d_in[blockIdx.x * params::degree],
+      &d_in[blockIdx.x * params::degree + params::degree / 2]);
+}
+
+// params is expected to be full degree not half degree
+template <class params>
+__global__ void batch_convert_f128_to_u128_as_torus(__uint128_t *out,
+                                                    const double *in_re_hi,
+                                                    const double *in_re_lo,
+                                                    const double *in_im_hi,
+                                                    const double *in_im_lo) {
+
+  convert_f128_to_u128_as_torus<params>(
+      &out[blockIdx.x * params::degree],
+      &out[blockIdx.x * params::degree + params::degree / 2],
+      &in_re_hi[blockIdx.x * params::degree / 2],
+      &in_re_lo[blockIdx.x * params::degree / 2],
+      &in_im_hi[blockIdx.x * params::degree / 2],
+      &in_im_lo[blockIdx.x * params::degree / 2]);
+}
+
+template <class params, sharedMemDegree SMD>
+__global__ void
+batch_NSMFFT_128(double *in_re_hi, double *in_re_lo, double *in_im_hi,
+                 double *in_im_lo, double *out_re_hi, double *out_re_lo,
+                 double *out_im_hi, double *out_im_lo, double *buffer) {
+  extern __shared__ double sharedMemoryFFT128[];
+  double *re_hi, *re_lo, *im_hi, *im_lo;
+
+  if (SMD == NOSM) {
+    re_hi =
+        &buffer[blockIdx.x * params::degree / 2 * 4 + params::degree / 2 * 0];
+    re_lo =
+        &buffer[blockIdx.x * params::degree / 2 * 4 + params::degree / 2 * 1];
+    im_hi =
+        &buffer[blockIdx.x * params::degree / 2 * 4 + params::degree / 2 * 2];
+    im_lo =
+        &buffer[blockIdx.x * params::degree / 2 * 4 + params::degree / 2 * 3];
+  } else {
+    re_hi = &sharedMemoryFFT128[params::degree / 2 * 0];
+    re_lo = &sharedMemoryFFT128[params::degree / 2 * 1];
+    im_hi = &sharedMemoryFFT128[params::degree / 2 * 2];
+    im_lo = &sharedMemoryFFT128[params::degree / 2 * 3];
+  }
+
+  Index tid = threadIdx.x;
+#pragma unroll
+  for (Index i = 0; i < params::opt / 2; ++i) {
+    re_hi[tid] = in_re_hi[blockIdx.x * (params::degree / 2) + tid];
+    re_lo[tid] = in_re_lo[blockIdx.x * (params::degree / 2) + tid];
+    im_hi[tid] = in_im_hi[blockIdx.x * (params::degree / 2) + tid];
+    im_lo[tid] = in_im_lo[blockIdx.x * (params::degree / 2) + tid];
+    tid += params::degree / params::opt;
+  }
+  __syncthreads();
+  if constexpr (params::fft_direction == 1) {
+    negacyclic_backward_fft_f128<HalfDegree<params>>(re_hi, re_lo, im_hi,
+                                                     im_lo);
+  } else {
+    negacyclic_forward_fft_f128<HalfDegree<params>>(re_hi, re_lo, im_hi, im_lo);
+  }
+  __syncthreads();
+  tid = threadIdx.x;
+#pragma unroll
+  for (Index i = 0; i < params::opt / 2; ++i) {
+    out_re_hi[blockIdx.x * (params::degree / 2) + tid] = re_hi[tid];
+    out_re_lo[blockIdx.x * (params::degree / 2) + tid] = re_lo[tid];
+    out_im_hi[blockIdx.x * (params::degree / 2) + tid] = im_hi[tid];
+    out_im_lo[blockIdx.x * (params::degree / 2) + tid] = im_lo[tid];
+    tid += params::degree / params::opt;
+  }
+}
+
+template <class params, sharedMemDegree SMD>
+__global__ void batch_NSMFFT_strided_128(double *d_in, double *d_out,
+                                         double *buffer) {
+  extern __shared__ double sharedMemoryFFT128[];
+  double *re_hi, *re_lo, *im_hi, *im_lo;
+
+  if (SMD == NOSM) {
+    re_hi =
+        &buffer[blockIdx.x * params::degree / 2 * 4 + params::degree / 2 * 0];
+    re_lo =
+        &buffer[blockIdx.x * params::degree / 2 * 4 + params::degree / 2 * 1];
+    im_hi =
+        &buffer[blockIdx.x * params::degree / 2 * 4 + params::degree / 2 * 2];
+    im_lo =
+        &buffer[blockIdx.x * params::degree / 2 * 4 + params::degree / 2 * 3];
+  } else {
+    re_hi = &sharedMemoryFFT128[params::degree / 2 * 0];
+    re_lo = &sharedMemoryFFT128[params::degree / 2 * 1];
+    im_hi = &sharedMemoryFFT128[params::degree / 2 * 2];
+    im_lo = &sharedMemoryFFT128[params::degree / 2 * 3];
+  }
+
+  constexpr size_t chunk_size = params::degree / 2 * 4;
+  double *chunk = &d_in[blockIdx.x * chunk_size];
+  double *tmp_re_hi = &chunk[0ULL * params::degree / 2];
+  double *tmp_re_lo = &chunk[1ULL * params::degree / 2];
+  double *tmp_im_hi = &chunk[2ULL * params::degree / 2];
+  double *tmp_im_lo = &chunk[3ULL * params::degree / 2];
+
+  Index tid = threadIdx.x;
+#pragma unroll
+  for (Index i = 0; i < params::opt / 2; ++i) {
+    re_hi[tid] = tmp_re_hi[tid];
+    re_lo[tid] = tmp_re_lo[tid];
+    im_hi[tid] = tmp_im_hi[tid];
+    im_lo[tid] = tmp_im_lo[tid];
+    tid += params::degree / params::opt;
+  }
+  __syncthreads();
+  if constexpr (params::fft_direction == 1) {
+    negacyclic_backward_fft_f128<HalfDegree<params>>(re_hi, re_lo, im_hi,
+                                                     im_lo);
+  } else {
+    negacyclic_forward_fft_f128<HalfDegree<params>>(re_hi, re_lo, im_hi, im_lo);
+  }
+  __syncthreads();
+
+  chunk = &d_out[blockIdx.x * chunk_size];
+  tmp_re_hi = &chunk[0ULL * params::degree / 2];
+  tmp_re_lo = &chunk[1ULL * params::degree / 2];
+  tmp_im_hi = &chunk[2ULL * params::degree / 2];
+  tmp_im_lo = &chunk[3ULL * params::degree / 2];
+
+  tid = threadIdx.x;
+#pragma unroll
+  for (Index i = 0; i < params::opt / 2; ++i) {
+    tmp_re_hi[tid] = re_hi[tid];
+    tmp_re_lo[tid] = re_lo[tid];
+    tmp_im_hi[tid] = im_hi[tid];
+    tmp_im_lo[tid] = im_lo[tid];
+    tid += params::degree / params::opt;
+  }
+}
+
+template <class params>
+__host__ void host_fourier_transform_forward_as_integer_f128(
+    cudaStream_t stream, uint32_t gpu_index, double *re0, double *re1,
+    double *im0, double *im1, const __uint128_t *standard, const uint32_t N,
+    const uint32_t number_of_samples) {
+
+  // allocate device buffers
+  double *d_re0 =
+      (double *)cuda_malloc_async(N / 2 * sizeof(double), stream, gpu_index);
+  double *d_re1 =
+      (double *)cuda_malloc_async(N / 2 * sizeof(double), stream, gpu_index);
+  double *d_im0 =
+      (double *)cuda_malloc_async(N / 2 * sizeof(double), stream, gpu_index);
+  double *d_im1 =
+      (double *)cuda_malloc_async(N / 2 * sizeof(double), stream, gpu_index);
+  __uint128_t *d_standard = (__uint128_t *)cuda_malloc_async(
+      N * sizeof(__uint128_t), stream, gpu_index);
+
+  // copy input into device
+  cuda_memcpy_async_to_gpu(d_standard, standard, N * sizeof(__uint128_t),
+                           stream, gpu_index);
+
+  // setup launch parameters
+  size_t required_shared_memory_size = sizeof(double) * N / 2 * 4;
+  int grid_size = number_of_samples;
+  int block_size = params::degree / params::opt;
+  bool full_sm =
+      (required_shared_memory_size <= cuda_get_max_shared_memory(gpu_index));
+  size_t buffer_size = full_sm ? 0 : (size_t)number_of_samples * N / 2 * 4;
+  size_t shared_memory_size = full_sm ? required_shared_memory_size : 0;
+  double *buffer = (double *)cuda_malloc_async(buffer_size, stream, gpu_index);
+
+  // configure shared memory for batch fft kernel
+  if (full_sm) {
+    check_cuda_error(cudaFuncSetAttribute(
+        batch_NSMFFT_128<FFTDegree<params, ForwardFFT>, FULLSM>,
+        cudaFuncAttributeMaxDynamicSharedMemorySize, shared_memory_size));
+    check_cuda_error(cudaFuncSetCacheConfig(
+        batch_NSMFFT_128<FFTDegree<params, ForwardFFT>, FULLSM>,
+        cudaFuncCachePreferShared));
+  }
+
+  // convert u128 into 4 x double
+  batch_convert_u128_to_f128_as_integer<params>
+      <<<grid_size, block_size, 0, stream>>>(d_re0, d_re1, d_im0, d_im1,
+                                             d_standard);
+
+  // call negacyclic 128 bit forward fft.
+  if (full_sm) {
+    batch_NSMFFT_128<FFTDegree<params, ForwardFFT>, FULLSM>
+        <<<grid_size, block_size, shared_memory_size, stream>>>(
+            d_re0, d_re1, d_im0, d_im1, d_re0, d_re1, d_im0, d_im1, buffer);
+  } else {
+    batch_NSMFFT_128<FFTDegree<params, ForwardFFT>, NOSM>
+        <<<grid_size, block_size, shared_memory_size, stream>>>(
+            d_re0, d_re1, d_im0, d_im1, d_re0, d_re1, d_im0, d_im1, buffer);
+  }
+
+  cuda_memcpy_async_to_cpu(re0, d_re0, N / 2 * sizeof(double), stream,
+                           gpu_index);
+  cuda_memcpy_async_to_cpu(re1, d_re1, N / 2 * sizeof(double), stream,
+                           gpu_index);
+  cuda_memcpy_async_to_cpu(im0, d_im0, N / 2 * sizeof(double), stream,
+                           gpu_index);
+  cuda_memcpy_async_to_cpu(im1, d_im1, N / 2 * sizeof(double), stream,
+                           gpu_index);
+
+  cuda_drop_async(d_standard, stream, gpu_index);
+  cuda_drop_async(d_re0, stream, gpu_index);
+  cuda_drop_async(d_re1, stream, gpu_index);
+  cuda_drop_async(d_im0, stream, gpu_index);
+  cuda_drop_async(d_im1, stream, gpu_index);
+}
+
+template <class params>
+__host__ void host_fourier_transform_forward_as_torus_f128(
+    cudaStream_t stream, uint32_t gpu_index, double *re0, double *re1,
+    double *im0, double *im1, const __uint128_t *standard, const uint32_t N,
+    const uint32_t number_of_samples) {
+
+  // allocate device buffers
+  double *d_re0 =
+      (double *)cuda_malloc_async(N / 2 * sizeof(double), stream, gpu_index);
+  double *d_re1 =
+      (double *)cuda_malloc_async(N / 2 * sizeof(double), stream, gpu_index);
+  double *d_im0 =
+      (double *)cuda_malloc_async(N / 2 * sizeof(double), stream, gpu_index);
+  double *d_im1 =
+      (double *)cuda_malloc_async(N / 2 * sizeof(double), stream, gpu_index);
+  __uint128_t *d_standard = (__uint128_t *)cuda_malloc_async(
+      N * sizeof(__uint128_t), stream, gpu_index);
+
+  // copy input into device
+  cuda_memcpy_async_to_gpu(d_standard, standard, N * sizeof(__uint128_t),
+                           stream, gpu_index);
+
+  // setup launch parameters
+  size_t required_shared_memory_size = sizeof(double) * N / 2 * 4;
+  int grid_size = number_of_samples;
+  int block_size = params::degree / params::opt;
+  bool full_sm =
+      (required_shared_memory_size <= cuda_get_max_shared_memory(gpu_index));
+  size_t buffer_size = full_sm ? 0 : (size_t)number_of_samples * N / 2 * 4;
+  size_t shared_memory_size = full_sm ? required_shared_memory_size : 0;
+  double *buffer = (double *)cuda_malloc_async(buffer_size, stream, gpu_index);
+
+  // configure shared memory for batch fft kernel
+  if (full_sm) {
+    check_cuda_error(cudaFuncSetAttribute(
+        batch_NSMFFT_128<FFTDegree<params, ForwardFFT>, FULLSM>,
+        cudaFuncAttributeMaxDynamicSharedMemorySize, shared_memory_size));
+    check_cuda_error(cudaFuncSetCacheConfig(
+        batch_NSMFFT_128<FFTDegree<params, ForwardFFT>, FULLSM>,
+        cudaFuncCachePreferShared));
+  }
+
+  // convert u128 into 4 x double
+  batch_convert_u128_to_f128_as_torus<params>
+      <<<grid_size, block_size, 0, stream>>>(d_re0, d_re1, d_im0, d_im1,
+                                             d_standard);
+
+  // call negacyclic 128 bit forward fft.
+  if (full_sm) {
+    batch_NSMFFT_128<FFTDegree<params, ForwardFFT>, FULLSM>
+        <<<grid_size, block_size, shared_memory_size, stream>>>(
+            d_re0, d_re1, d_im0, d_im1, d_re0, d_re1, d_im0, d_im1, buffer);
+  } else {
+    batch_NSMFFT_128<FFTDegree<params, ForwardFFT>, NOSM>
+        <<<grid_size, block_size, shared_memory_size, stream>>>(
+            d_re0, d_re1, d_im0, d_im1, d_re0, d_re1, d_im0, d_im1, buffer);
+  }
+
+  cuda_memcpy_async_to_cpu(re0, d_re0, N / 2 * sizeof(double), stream,
+                           gpu_index);
+  cuda_memcpy_async_to_cpu(re1, d_re1, N / 2 * sizeof(double), stream,
+                           gpu_index);
+  cuda_memcpy_async_to_cpu(im0, d_im0, N / 2 * sizeof(double), stream,
+                           gpu_index);
+  cuda_memcpy_async_to_cpu(im1, d_im1, N / 2 * sizeof(double), stream,
+                           gpu_index);
+
+  cuda_drop_async(d_standard, stream, gpu_index);
+  cuda_drop_async(d_re0, stream, gpu_index);
+  cuda_drop_async(d_re1, stream, gpu_index);
+  cuda_drop_async(d_im0, stream, gpu_index);
+  cuda_drop_async(d_im1, stream, gpu_index);
+}
+
+template <class params>
+__host__ void host_fourier_transform_backward_as_torus_f128(
+    cudaStream_t stream, uint32_t gpu_index, __uint128_t *standard,
+    double const *re0, double const *re1, double const *im0, double const *im1,
+    const uint32_t N, const uint32_t number_of_samples) {
+
+  // allocate device buffers
+  double *d_re0 =
+      (double *)cuda_malloc_async(N / 2 * sizeof(double), stream, gpu_index);
+  double *d_re1 =
+      (double *)cuda_malloc_async(N / 2 * sizeof(double), stream, gpu_index);
+  double *d_im0 =
+      (double *)cuda_malloc_async(N / 2 * sizeof(double), stream, gpu_index);
+  double *d_im1 =
+      (double *)cuda_malloc_async(N / 2 * sizeof(double), stream, gpu_index);
+  __uint128_t *d_standard = (__uint128_t *)cuda_malloc_async(
+      N * sizeof(__uint128_t), stream, gpu_index);
+
+  //  // copy input into device
+  cuda_memcpy_async_to_gpu(d_re0, re0, N / 2 * sizeof(double), stream,
+                           gpu_index);
+  cuda_memcpy_async_to_gpu(d_re1, re1, N / 2 * sizeof(double), stream,
+                           gpu_index);
+  cuda_memcpy_async_to_gpu(d_im0, im0, N / 2 * sizeof(double), stream,
+                           gpu_index);
+  cuda_memcpy_async_to_gpu(d_im1, im1, N / 2 * sizeof(double), stream,
+                           gpu_index);
+
+  // setup launch parameters
+  size_t required_shared_memory_size = sizeof(double) * N / 2 * 4;
+  int grid_size = number_of_samples;
+  int block_size = params::degree / params::opt;
+  bool full_sm =
+      (required_shared_memory_size <= cuda_get_max_shared_memory(gpu_index));
+  size_t buffer_size = full_sm ? 0 : (size_t)number_of_samples * N / 2 * 4;
+  size_t shared_memory_size = full_sm ? required_shared_memory_size : 0;
+  double *buffer = (double *)cuda_malloc_async(buffer_size, stream, gpu_index);
+
+  // configure shared memory for batch fft kernel
+  if (full_sm) {
+    check_cuda_error(cudaFuncSetAttribute(
+        batch_NSMFFT_128<FFTDegree<params, BackwardFFT>, FULLSM>,
+        cudaFuncAttributeMaxDynamicSharedMemorySize, shared_memory_size));
+    check_cuda_error(cudaFuncSetCacheConfig(
+        batch_NSMFFT_128<FFTDegree<params, BackwardFFT>, FULLSM>,
+        cudaFuncCachePreferShared));
+    batch_NSMFFT_128<FFTDegree<params, BackwardFFT>, FULLSM>
+        <<<grid_size, block_size, shared_memory_size, stream>>>(
+            d_re0, d_re1, d_im0, d_im1, d_re0, d_re1, d_im0, d_im1, buffer);
+  } else {
+    batch_NSMFFT_128<FFTDegree<params, BackwardFFT>, NOSM>
+        <<<grid_size, block_size, shared_memory_size, stream>>>(
+            d_re0, d_re1, d_im0, d_im1, d_re0, d_re1, d_im0, d_im1, buffer);
+  }
+
+  batch_convert_f128_to_u128_as_torus<params>
+      <<<grid_size, block_size, 0, stream>>>(d_standard, d_re0, d_re1, d_im0,
+                                             d_im1);
+
+  cuda_memcpy_async_to_cpu(standard, d_standard, N * sizeof(__uint128_t),
+                           stream, gpu_index);
+  cuda_drop_async(d_standard, stream, gpu_index);
+  cuda_drop_async(d_re0, stream, gpu_index);
+  cuda_drop_async(d_re1, stream, gpu_index);
+  cuda_drop_async(d_im0, stream, gpu_index);
+  cuda_drop_async(d_im1, stream, gpu_index);
+}
+
+#undef NEG_TWID
+#undef F64x4_TO_F128x2
+#undef F128x2_TO_F64x4
+
+#endif // TFHE_RS_BACKENDS_TFHE_CUDA_BACKEND_CUDA_SRC_FFT128_FFT128_CUH_
--- a/backends/tfhe-cuda-backend/cuda/src/fft128/twiddles.cu
+++ b/backends/tfhe-cuda-backend/cuda/src/fft128/twiddles.cu
--- a/backends/tfhe-cuda-backend/cuda/src/fft128/twiddles.cuh
+++ b/backends/tfhe-cuda-backend/cuda/src/fft128/twiddles.cuh
@@ -0,0 +1,11 @@
+#ifndef CUDA_FFT128_TWIDDLES_CUH
+#define CUDA_FFT128_TWIDDLES_CUH
+
+/*
+ * 'negtwiddles' are stored in device memory to profit caching
+ */
+extern __device__ double neg_twiddles_re_hi[4096];
+extern __device__ double neg_twiddles_re_lo[4096];
+extern __device__ double neg_twiddles_im_hi[4096];
+extern __device__ double neg_twiddles_im_lo[4096];
+#endif
--- a/backends/tfhe-cuda-backend/cuda/src/integer/abs.cuh
+++ b/backends/tfhe-cuda-backend/cuda/src/integer/abs.cuh
@@ -48,7 +48,7 @@ __host__ void legacy_host_integer_abs_kb_async(
  cuda_memcpy_async_gpu_to_gpu(mask, ct, num_blocks * big_lwe_size_bytes,
                               streams[0], gpu_indexes[0]);

-  host_integer_radix_arithmetic_scalar_shift_kb_inplace<Torus>(
+  legacy_host_integer_radix_arithmetic_scalar_shift_kb_inplace<Torus>(
      streams, gpu_indexes, gpu_count, mask, num_bits_in_ciphertext - 1,
      mem_ptr->arithmetic_scalar_shift_mem, bsks, ksks, num_blocks);
  legacy_host_addition<Torus>(streams[0], gpu_indexes[0], ct, mask, ct,
@@ -84,9 +84,8 @@ host_integer_abs_kb(cudaStream_t const *streams, uint32_t const *gpu_indexes,
  copy_radix_ciphertext_async<Torus>(streams[0], gpu_indexes[0], mask, ct);

  host_integer_radix_arithmetic_scalar_shift_kb_inplace<Torus>(
-      streams, gpu_indexes, gpu_count, (Torus *)(mask->ptr),
-      num_bits_in_ciphertext - 1, mem_ptr->arithmetic_scalar_shift_mem, bsks,
-      ksks, ct->num_radix_blocks);
+      streams, gpu_indexes, gpu_count, mask, num_bits_in_ciphertext - 1,
+      mem_ptr->arithmetic_scalar_shift_mem, bsks, ksks);
  host_addition<Torus>(streams[0], gpu_indexes[0], ct, mask, ct,
                       ct->num_radix_blocks);

--- a/backends/tfhe-cuda-backend/cuda/src/integer/cmux.cuh
+++ b/backends/tfhe-cuda-backend/cuda/src/integer/cmux.cuh
@@ -7,26 +7,34 @@
 template <typename Torus>
 __host__ void zero_out_if(cudaStream_t const *streams,
                          uint32_t const *gpu_indexes, uint32_t gpu_count,
-                          Torus *lwe_array_out, Torus const *lwe_array_input,
-                          Torus const *lwe_condition,
+                          CudaRadixCiphertextFFI *lwe_array_out,
+                          CudaRadixCiphertextFFI const *lwe_array_input,
+                          CudaRadixCiphertextFFI const *lwe_condition,
                          int_zero_out_if_buffer<Torus> *mem_ptr,
                          int_radix_lut<Torus> *predicate, void *const *bsks,
                          Torus *const *ksks, uint32_t num_radix_blocks) {
+  if (lwe_array_out->num_radix_blocks < num_radix_blocks ||
+      lwe_array_input->num_radix_blocks < num_radix_blocks)
+    PANIC("Cuda error: input or output radix ciphertexts does not have enough "
+          "blocks")
+  if (lwe_array_out->lwe_dimension != lwe_array_input->lwe_dimension ||
+      lwe_array_input->lwe_dimension != lwe_condition->lwe_dimension)
+    PANIC("Cuda error: input and output radix ciphertexts must have the same "
+          "lwe dimension")
  cuda_set_device(gpu_indexes[0]);
  auto params = mem_ptr->params;

  // We can't use integer_radix_apply_bivariate_lookup_table_kb since the
  // second operand is not an array
  auto tmp_lwe_array_input = mem_ptr->tmp;
-  pack_bivariate_blocks_with_single_block<Torus>(
+  host_pack_bivariate_blocks_with_single_block<Torus>(
      streams, gpu_indexes, gpu_count, tmp_lwe_array_input,
      predicate->lwe_indexes_in, lwe_array_input, lwe_condition,
-      predicate->lwe_indexes_in, params.big_lwe_dimension,
-      params.message_modulus, num_radix_blocks);
+      predicate->lwe_indexes_in, params.message_modulus, num_radix_blocks);

-  legacy_integer_radix_apply_univariate_lookup_table_kb<Torus>(
+  integer_radix_apply_univariate_lookup_table_kb<Torus>(
      streams, gpu_indexes, gpu_count, lwe_array_out, tmp_lwe_array_input, bsks,
-      ksks, num_radix_blocks, predicate);
+      ksks, predicate, num_radix_blocks);
 }

 template <typename Torus>
--- a/backends/tfhe-cuda-backend/cuda/src/integer/comparison.cu
+++ b/backends/tfhe-cuda-backend/cuda/src/integer/comparison.cu
@@ -38,21 +38,26 @@ void scratch_cuda_integer_radix_comparison_kb_64(

 void cuda_comparison_integer_radix_ciphertext_kb_64(
    void *const *streams, uint32_t const *gpu_indexes, uint32_t gpu_count,
-    void *lwe_array_out, void const *lwe_array_1, void const *lwe_array_2,
-    int8_t *mem_ptr, void *const *bsks, void *const *ksks,
-    uint32_t num_radix_blocks) {
+    CudaRadixCiphertextFFI *lwe_array_out,
+    CudaRadixCiphertextFFI const *lwe_array_1,
+    CudaRadixCiphertextFFI const *lwe_array_2, int8_t *mem_ptr,
+    void *const *bsks, void *const *ksks) {

+  if (lwe_array_1->num_radix_blocks != lwe_array_1->num_radix_blocks)
+    PANIC("Cuda error: input num radix blocks must be the same")
+  // The output ciphertext might be a boolean block or a radix ciphertext
+  // depending on the case (eq/gt vs max/min) so the amount of blocks to
+  // consider for calculation is the one of the input
+  auto num_radix_blocks = lwe_array_1->num_radix_blocks;
  int_comparison_buffer<uint64_t> *buffer =
      (int_comparison_buffer<uint64_t> *)mem_ptr;
  switch (buffer->op) {
  case EQ:
  case NE:
    host_integer_radix_equality_check_kb<uint64_t>(
-        (cudaStream_t *)(streams), gpu_indexes, gpu_count,
-        static_cast<uint64_t *>(lwe_array_out),
-        static_cast<const uint64_t *>(lwe_array_1),
-        static_cast<const uint64_t *>(lwe_array_2), buffer, bsks,
-        (uint64_t **)(ksks), num_radix_blocks);
+        (cudaStream_t *)(streams), gpu_indexes, gpu_count, lwe_array_out,
+        lwe_array_1, lwe_array_2, buffer, bsks, (uint64_t **)(ksks),
+        num_radix_blocks);
    break;
  case GT:
  case GE:
@@ -62,23 +67,18 @@ void cuda_comparison_integer_radix_ciphertext_kb_64(
      PANIC("Cuda error (comparisons): the number of radix blocks has to be "
            "even.")
    host_integer_radix_difference_check_kb<uint64_t>(
-        (cudaStream_t *)(streams), gpu_indexes, gpu_count,
-        static_cast<uint64_t *>(lwe_array_out),
-        static_cast<const uint64_t *>(lwe_array_1),
-        static_cast<const uint64_t *>(lwe_array_2), buffer,
-        buffer->diff_buffer->operator_f, bsks, (uint64_t **)(ksks),
-        num_radix_blocks);
+        (cudaStream_t *)(streams), gpu_indexes, gpu_count, lwe_array_out,
+        lwe_array_1, lwe_array_2, buffer, buffer->diff_buffer->operator_f, bsks,
+        (uint64_t **)(ksks), num_radix_blocks);
    break;
  case MAX:
  case MIN:
    if (num_radix_blocks % 2 != 0)
      PANIC("Cuda error (max/min): the number of radix blocks has to be even.")
    host_integer_radix_maxmin_kb<uint64_t>(
-        (cudaStream_t *)(streams), gpu_indexes, gpu_count,
-        static_cast<uint64_t *>(lwe_array_out),
-        static_cast<const uint64_t *>(lwe_array_1),
-        static_cast<const uint64_t *>(lwe_array_2), buffer, bsks,
-        (uint64_t **)(ksks), num_radix_blocks);
+        (cudaStream_t *)(streams), gpu_indexes, gpu_count, lwe_array_out,
+        lwe_array_1, lwe_array_2, buffer, bsks, (uint64_t **)(ksks),
+        num_radix_blocks);
    break;
  default:
    PANIC("Cuda error: integer operation not supported")
@@ -117,17 +117,16 @@ void scratch_cuda_integer_are_all_comparisons_block_true_kb_64(

 void cuda_integer_are_all_comparisons_block_true_kb_64(
    void *const *streams, uint32_t const *gpu_indexes, uint32_t gpu_count,
-    void *lwe_array_out, void const *lwe_array_in, int8_t *mem_ptr,
+    CudaRadixCiphertextFFI *lwe_array_out,
+    CudaRadixCiphertextFFI const *lwe_array_in, int8_t *mem_ptr,
    void *const *bsks, void *const *ksks, uint32_t num_radix_blocks) {

  int_comparison_buffer<uint64_t> *buffer =
      (int_comparison_buffer<uint64_t> *)mem_ptr;

  host_integer_are_all_comparisons_block_true_kb<uint64_t>(
-      (cudaStream_t *)(streams), gpu_indexes, gpu_count,
-      static_cast<uint64_t *>(lwe_array_out),
-      static_cast<const uint64_t *>(lwe_array_in), buffer, bsks,
-      (uint64_t **)(ksks), num_radix_blocks);
+      (cudaStream_t *)(streams), gpu_indexes, gpu_count, lwe_array_out,
+      lwe_array_in, buffer, bsks, (uint64_t **)(ksks), num_radix_blocks);
 }

 void cleanup_cuda_integer_are_all_comparisons_block_true(
@@ -161,17 +160,16 @@ void scratch_cuda_integer_is_at_least_one_comparisons_block_true_kb_64(

 void cuda_integer_is_at_least_one_comparisons_block_true_kb_64(
    void *const *streams, uint32_t const *gpu_indexes, uint32_t gpu_count,
-    void *lwe_array_out, void const *lwe_array_in, int8_t *mem_ptr,
+    CudaRadixCiphertextFFI *lwe_array_out,
+    CudaRadixCiphertextFFI const *lwe_array_in, int8_t *mem_ptr,
    void *const *bsks, void *const *ksks, uint32_t num_radix_blocks) {

  int_comparison_buffer<uint64_t> *buffer =
      (int_comparison_buffer<uint64_t> *)mem_ptr;

  host_integer_is_at_least_one_comparisons_block_true_kb<uint64_t>(
-      (cudaStream_t *)(streams), gpu_indexes, gpu_count,
-      static_cast<uint64_t *>(lwe_array_out),
-      static_cast<const uint64_t *>(lwe_array_in), buffer, bsks,
-      (uint64_t **)(ksks), num_radix_blocks);
+      (cudaStream_t *)(streams), gpu_indexes, gpu_count, lwe_array_out,
+      lwe_array_in, buffer, bsks, (uint64_t **)(ksks), num_radix_blocks);
 }

 void cleanup_cuda_integer_is_at_least_one_comparisons_block_true(
--- a/backends/tfhe-cuda-backend/cuda/src/integer/comparison.cuh
+++ b/backends/tfhe-cuda-backend/cuda/src/integer/comparison.cuh
@@ -8,6 +8,7 @@
 #include "integer/integer_utilities.h"
 #include "integer/negation.cuh"
 #include "integer/scalar_addition.cuh"
+#include "integer/subtraction.cuh"
 #include "pbs/programmable_bootstrap_classic.cuh"
 #include "pbs/programmable_bootstrap_multibit.cuh"
 #include "types/complex/operations.cuh"
@@ -48,15 +49,8 @@ __host__ void accumulate_all_blocks(cudaStream_t stream, uint32_t gpu_index,
  check_cuda_error(cudaGetLastError());
 }

-/* This takes an array of lwe ciphertexts, where each is an encryption of
- * either 0 or 1.
- *
- * It writes in lwe_array_out a single lwe ciphertext encrypting 1 if all input
- * blocks are 1 otherwise the block encrypts 0
- *
- */
 template <typename Torus>
-__host__ void are_all_comparisons_block_true(
+__host__ void legacy_are_all_comparisons_block_true(
    cudaStream_t const *streams, uint32_t const *gpu_indexes,
    uint32_t gpu_count, Torus *lwe_array_out, Torus const *lwe_array_in,
    int_comparison_buffer<Torus> *mem_ptr, void *const *bsks,
@@ -71,7 +65,7 @@ __host__ void are_all_comparisons_block_true(

  auto are_all_block_true_buffer =
      mem_ptr->eq_buffer->are_all_block_true_buffer;
-  auto tmp_out = are_all_block_true_buffer->tmp_out;
+  auto tmp_out = (Torus *)are_all_block_true_buffer->tmp_out->ptr;

  uint32_t total_modulus = message_modulus * carry_modulus;
  uint32_t max_value = (total_modulus - 1) / (message_modulus - 1);
@@ -90,7 +84,8 @@ __host__ void are_all_comparisons_block_true(
    // Since all blocks encrypt either 0 or 1, we can sum max_value of them
    // as in the worst case we will be adding `max_value` ones
    auto input_blocks = tmp_out;
-    auto accumulator = are_all_block_true_buffer->tmp_block_accumulated;
+    auto accumulator_ptr =
+        (Torus *)are_all_block_true_buffer->tmp_block_accumulated->ptr;
    auto is_max_value_lut = are_all_block_true_buffer->is_max_value;
    uint32_t chunk_lengths[num_chunks];
    auto begin_remaining_blocks = remaining_blocks;
@@ -98,15 +93,16 @@ __host__ void are_all_comparisons_block_true(
      uint32_t chunk_length =
          std::min(max_value, begin_remaining_blocks - i * max_value);
      chunk_lengths[i] = chunk_length;
-      accumulate_all_blocks<Torus>(streams[0], gpu_indexes[0], accumulator,
+      accumulate_all_blocks<Torus>(streams[0], gpu_indexes[0], accumulator_ptr,
                                   input_blocks, big_lwe_dimension,
                                   chunk_length);

-      accumulator += (big_lwe_dimension + 1);
+      accumulator_ptr += (big_lwe_dimension + 1);
      remaining_blocks -= (chunk_length - 1);
      input_blocks += (big_lwe_dimension + 1) * chunk_length;
    }
-    accumulator = are_all_block_true_buffer->tmp_block_accumulated;
+    auto accumulator =
+        (Torus *)are_all_block_true_buffer->tmp_block_accumulated->ptr;

    // Selects a LUT
    int_radix_lut<Torus> *lut;
@@ -163,11 +159,124 @@ __host__ void are_all_comparisons_block_true(
 /* This takes an array of lwe ciphertexts, where each is an encryption of
 * either 0 or 1.
 *
- * It writes in lwe_array_out a single lwe ciphertext encrypting 1 if at least
- * one input ciphertext encrypts 1 otherwise encrypts 0
+ * It writes in lwe_array_out a single lwe ciphertext encrypting 1 if all input
+ * blocks are 1 otherwise the block encrypts 0
+ *
 */
 template <typename Torus>
-__host__ void is_at_least_one_comparisons_block_true(
+__host__ void are_all_comparisons_block_true(
+    cudaStream_t const *streams, uint32_t const *gpu_indexes,
+    uint32_t gpu_count, CudaRadixCiphertextFFI *lwe_array_out,
+    CudaRadixCiphertextFFI const *lwe_array_in,
+    int_comparison_buffer<Torus> *mem_ptr, void *const *bsks,
+    Torus *const *ksks, uint32_t num_radix_blocks) {
+
+  if (lwe_array_out->lwe_dimension != lwe_array_in->lwe_dimension)
+    PANIC("Cuda error: input and output lwe dimensions must be the same")
+  if (lwe_array_in->num_radix_blocks < num_radix_blocks)
+    PANIC("Cuda error: input num radix blocks should not be lower "
+          "than the number of blocks to operate on")
+
+  auto params = mem_ptr->params;
+  auto big_lwe_dimension = params.big_lwe_dimension;
+  auto glwe_dimension = params.glwe_dimension;
+  auto polynomial_size = params.polynomial_size;
+  auto message_modulus = params.message_modulus;
+  auto carry_modulus = params.carry_modulus;
+
+  auto are_all_block_true_buffer =
+      mem_ptr->eq_buffer->are_all_block_true_buffer;
+  auto tmp_out = are_all_block_true_buffer->tmp_out;
+
+  uint32_t total_modulus = message_modulus * carry_modulus;
+  uint32_t max_value = (total_modulus - 1) / (message_modulus - 1);
+
+  copy_radix_ciphertext_slice_async<Torus>(streams[0], gpu_indexes[0], tmp_out,
+                                           0, num_radix_blocks, lwe_array_in, 0,
+                                           num_radix_blocks);
+
+  uint32_t remaining_blocks = num_radix_blocks;
+
+  while (remaining_blocks > 0) {
+    // Split in max_value chunks
+    int num_chunks = (remaining_blocks + max_value - 1) / max_value;
+
+    // Since all blocks encrypt either 0 or 1, we can sum max_value of them
+    // as in the worst case we will be adding `max_value` ones
+    auto input_blocks = (Torus *)tmp_out->ptr;
+    auto accumulator_ptr =
+        (Torus *)are_all_block_true_buffer->tmp_block_accumulated->ptr;
+    auto is_max_value_lut = are_all_block_true_buffer->is_max_value;
+    uint32_t chunk_lengths[num_chunks];
+    auto begin_remaining_blocks = remaining_blocks;
+    for (int i = 0; i < num_chunks; i++) {
+      uint32_t chunk_length =
+          std::min(max_value, begin_remaining_blocks - i * max_value);
+      chunk_lengths[i] = chunk_length;
+      accumulate_all_blocks<Torus>(streams[0], gpu_indexes[0], accumulator_ptr,
+                                   input_blocks, big_lwe_dimension,
+                                   chunk_length);
+
+      accumulator_ptr += (big_lwe_dimension + 1);
+      remaining_blocks -= (chunk_length - 1);
+      input_blocks += (big_lwe_dimension + 1) * chunk_length;
+    }
+    auto accumulator = are_all_block_true_buffer->tmp_block_accumulated;
+
+    // Selects a LUT
+    int_radix_lut<Torus> *lut;
+    if (are_all_block_true_buffer->op == COMPARISON_TYPE::NE) {
+      // is_non_zero_lut_buffer LUT
+      lut = mem_ptr->eq_buffer->is_non_zero_lut;
+    } else {
+      if (chunk_lengths[num_chunks - 1] != max_value) {
+        // LUT needs to be computed
+        uint32_t chunk_length = chunk_lengths[num_chunks - 1];
+        auto is_equal_to_num_blocks_lut_f = [chunk_length](Torus x) -> Torus {
+          return x == chunk_length;
+        };
+        generate_device_accumulator<Torus>(
+            streams[0], gpu_indexes[0], is_max_value_lut->get_lut(0, 1),
+            is_max_value_lut->get_degree(1),
+            is_max_value_lut->get_max_degree(1), glwe_dimension,
+            polynomial_size, message_modulus, carry_modulus,
+            is_equal_to_num_blocks_lut_f);
+
+        Torus *h_lut_indexes = (Torus *)malloc(num_chunks * sizeof(Torus));
+        for (int index = 0; index < num_chunks; index++) {
+          if (index == num_chunks - 1) {
+            h_lut_indexes[index] = 1;
+          } else {
+            h_lut_indexes[index] = 0;
+          }
+        }
+        cuda_memcpy_async_to_gpu(is_max_value_lut->get_lut_indexes(0, 0),
+                                 h_lut_indexes, num_chunks * sizeof(Torus),
+                                 streams[0], gpu_indexes[0]);
+        is_max_value_lut->broadcast_lut(streams, gpu_indexes, 0);
+        cuda_synchronize_stream(streams[0], gpu_indexes[0]);
+        free(h_lut_indexes);
+      }
+      lut = is_max_value_lut;
+    }
+
+    // Applies the LUT
+    if (remaining_blocks == 1) {
+      // In the last iteration we copy the output to the final address
+      integer_radix_apply_univariate_lookup_table_kb<Torus>(
+          streams, gpu_indexes, gpu_count, lwe_array_out, accumulator, bsks,
+          ksks, lut, 1);
+      return;
+    } else {
+      integer_radix_apply_univariate_lookup_table_kb<Torus>(
+          streams, gpu_indexes, gpu_count, tmp_out, accumulator, bsks, ksks,
+          lut, num_chunks);
+    }
+  }
+}
+
+template <typename Torus>
+__host__ void legacy_is_at_least_one_comparisons_block_true(
    cudaStream_t const *streams, uint32_t const *gpu_indexes,
    uint32_t gpu_count, Torus *lwe_array_out, Torus const *lwe_array_in,
    int_comparison_buffer<Torus> *mem_ptr, void *const *bsks,
@@ -183,10 +292,10 @@ __host__ void is_at_least_one_comparisons_block_true(
  uint32_t total_modulus = message_modulus * carry_modulus;
  uint32_t max_value = (total_modulus - 1) / (message_modulus - 1);

-  cuda_memcpy_async_gpu_to_gpu(mem_ptr->tmp_lwe_array_out, lwe_array_in,
-                               num_radix_blocks * (big_lwe_dimension + 1) *
-                                   sizeof(Torus),
-                               streams[0], gpu_indexes[0]);
+  cuda_memcpy_async_gpu_to_gpu(
+      (Torus *)mem_ptr->tmp_lwe_array_out->ptr, lwe_array_in,
+      num_radix_blocks * (big_lwe_dimension + 1) * sizeof(Torus), streams[0],
+      gpu_indexes[0]);

  uint32_t remaining_blocks = num_radix_blocks;
  while (remaining_blocks > 0) {
@@ -195,8 +304,8 @@ __host__ void is_at_least_one_comparisons_block_true(

    // Since all blocks encrypt either 0 or 1, we can sum max_value of them
    // as in the worst case we will be adding `max_value` ones
-    auto input_blocks = mem_ptr->tmp_lwe_array_out;
-    auto accumulator = buffer->tmp_block_accumulated;
+    auto input_blocks = (Torus *)mem_ptr->tmp_lwe_array_out->ptr;
+    auto accumulator = (Torus *)buffer->tmp_block_accumulated->ptr;
    uint32_t chunk_lengths[num_chunks];
    auto begin_remaining_blocks = remaining_blocks;
    for (int i = 0; i < num_chunks; i++) {
@@ -211,7 +320,7 @@ __host__ void is_at_least_one_comparisons_block_true(
      remaining_blocks -= (chunk_length - 1);
      input_blocks += (big_lwe_dimension + 1) * chunk_length;
    }
-    accumulator = buffer->tmp_block_accumulated;
+    accumulator = (Torus *)buffer->tmp_block_accumulated->ptr;

    // Selects a LUT
    int_radix_lut<Torus> *lut = mem_ptr->eq_buffer->is_non_zero_lut;
@@ -225,33 +334,91 @@ __host__ void is_at_least_one_comparisons_block_true(
      return;
    } else {
      legacy_integer_radix_apply_univariate_lookup_table_kb<Torus>(
-          streams, gpu_indexes, gpu_count, mem_ptr->tmp_lwe_array_out,
-          accumulator, bsks, ksks, num_chunks, lut);
+          streams, gpu_indexes, gpu_count,
+          (Torus *)mem_ptr->tmp_lwe_array_out->ptr, accumulator, bsks, ksks,
+          num_chunks, lut);
    }
  }
 }

-// This takes an input slice of blocks.
-//
-// Each block can encrypt any value as long as its < message_modulus.
-//
-// It will compare blocks with 0, for either equality or difference.
-//
-// This returns a Vec of block, where each block encrypts 1 or 0
-// depending of if all blocks matched with the comparison type with 0.
-//
-// E.g. For ZeroComparisonType::Equality, if all input blocks are zero
-// than all returned block will encrypt 1
-//
-// The returned Vec will have less block than the number of input blocks.
-// The returned blocks potentially needs to be 'reduced' to one block
-// with eg are_all_comparisons_block_true.
-//
-// This function exists because sometimes it is faster to concatenate
-// multiple vec of 'boolean' shortint block before reducing them with
-// are_all_comparisons_block_true
+/* This takes an array of lwe ciphertexts, where each is an encryption of
+ * either 0 or 1.
+ *
+ * It writes in lwe_array_out a single lwe ciphertext encrypting 1 if at least
+ * one input ciphertext encrypts 1 otherwise encrypts 0
+ */
 template <typename Torus>
-__host__ void host_compare_with_zero_equality(
+__host__ void is_at_least_one_comparisons_block_true(
+    cudaStream_t const *streams, uint32_t const *gpu_indexes,
+    uint32_t gpu_count, CudaRadixCiphertextFFI *lwe_array_out,
+    CudaRadixCiphertextFFI const *lwe_array_in,
+    int_comparison_buffer<Torus> *mem_ptr, void *const *bsks,
+    Torus *const *ksks, uint32_t num_radix_blocks) {
+
+  if (lwe_array_out->lwe_dimension != lwe_array_in->lwe_dimension)
+    PANIC("Cuda error: input lwe dimensions must be the same")
+
+  if (lwe_array_in->num_radix_blocks < num_radix_blocks)
+    PANIC("Cuda error: input num radix blocks should not be lower "
+          "than the number of blocks to operate on")
+  auto params = mem_ptr->params;
+  auto big_lwe_dimension = params.big_lwe_dimension;
+  auto message_modulus = params.message_modulus;
+  auto carry_modulus = params.carry_modulus;
+
+  auto buffer = mem_ptr->eq_buffer->are_all_block_true_buffer;
+
+  uint32_t total_modulus = message_modulus * carry_modulus;
+  uint32_t max_value = (total_modulus - 1) / (message_modulus - 1);
+
+  copy_radix_ciphertext_slice_async<Torus>(
+      streams[0], gpu_indexes[0], mem_ptr->tmp_lwe_array_out, 0,
+      num_radix_blocks, lwe_array_in, 0, num_radix_blocks);
+
+  uint32_t remaining_blocks = num_radix_blocks;
+  while (remaining_blocks > 0) {
+    // Split in max_value chunks
+    int num_chunks = (remaining_blocks + max_value - 1) / max_value;
+
+    // Since all blocks encrypt either 0 or 1, we can sum max_value of them
+    // as in the worst case we will be adding `max_value` ones
+    auto input_blocks = (Torus *)mem_ptr->tmp_lwe_array_out->ptr;
+    auto accumulator = (Torus *)buffer->tmp_block_accumulated->ptr;
+    uint32_t chunk_lengths[num_chunks];
+    auto begin_remaining_blocks = remaining_blocks;
+    for (int i = 0; i < num_chunks; i++) {
+      uint32_t chunk_length =
+          std::min(max_value, begin_remaining_blocks - i * max_value);
+      chunk_lengths[i] = chunk_length;
+      accumulate_all_blocks<Torus>(streams[0], gpu_indexes[0], accumulator,
+                                   input_blocks, big_lwe_dimension,
+                                   chunk_length);
+
+      accumulator += (big_lwe_dimension + 1);
+      remaining_blocks -= (chunk_length - 1);
+      input_blocks += (big_lwe_dimension + 1) * chunk_length;
+    }
+
+    // Selects a LUT
+    int_radix_lut<Torus> *lut = mem_ptr->eq_buffer->is_non_zero_lut;
+
+    // Applies the LUT
+    if (remaining_blocks == 1) {
+      // In the last iteration we copy the output to the final address
+      integer_radix_apply_univariate_lookup_table_kb<Torus>(
+          streams, gpu_indexes, gpu_count, lwe_array_out,
+          buffer->tmp_block_accumulated, bsks, ksks, lut, 1);
+      return;
+    } else {
+      integer_radix_apply_univariate_lookup_table_kb<Torus>(
+          streams, gpu_indexes, gpu_count, mem_ptr->tmp_lwe_array_out,
+          buffer->tmp_block_accumulated, bsks, ksks, lut, num_chunks);
+    }
+  }
+}
+
+template <typename Torus>
+__host__ void legacy_host_compare_with_zero_equality(
    cudaStream_t const *streams, uint32_t const *gpu_indexes,
    uint32_t gpu_count, Torus *lwe_array_out, Torus const *lwe_array_in,
    int_comparison_buffer<Torus> *mem_ptr, void *const *bsks,
@@ -308,6 +475,79 @@ __host__ void host_compare_with_zero_equality(
  legacy_integer_radix_apply_univariate_lookup_table_kb<Torus>(
      streams, gpu_indexes, gpu_count, sum, sum, bsks, ksks, num_sum_blocks,
      zero_comparison);
+  legacy_are_all_comparisons_block_true<Torus>(streams, gpu_indexes, gpu_count,
+                                               lwe_array_out, sum, mem_ptr,
+                                               bsks, ksks, num_sum_blocks);
+}
+
+// FIXME This function should be improved as it outputs a single LWE ciphertext
+//  but requires the output to have enough blocks allocated to compute
+//  intermediate values
+template <typename Torus>
+__host__ void host_compare_with_zero_equality(
+    cudaStream_t const *streams, uint32_t const *gpu_indexes,
+    uint32_t gpu_count, CudaRadixCiphertextFFI *lwe_array_out,
+    CudaRadixCiphertextFFI const *lwe_array_in,
+    int_comparison_buffer<Torus> *mem_ptr, void *const *bsks,
+    Torus *const *ksks, int32_t num_radix_blocks,
+    int_radix_lut<Torus> *zero_comparison) {
+
+  if (num_radix_blocks == 0)
+    return;
+  if (lwe_array_out->lwe_dimension != lwe_array_in->lwe_dimension)
+    PANIC("Cuda error: input lwe dimensions must be the same")
+  if (lwe_array_in->num_radix_blocks < num_radix_blocks)
+    PANIC("Cuda error: input num radix blocks should not be lower "
+          "than the number of blocks to operate on")
+
+  auto params = mem_ptr->params;
+  auto big_lwe_dimension = params.big_lwe_dimension;
+  auto message_modulus = params.message_modulus;
+  auto carry_modulus = params.carry_modulus;
+
+  // The idea is that we will sum chunks of blocks until carries are full
+  // then we compare the sum with 0.
+  //
+  // If all blocks were 0, the sum will be zero
+  // If at least one bock was not zero, the sum won't be zero
+  uint32_t total_modulus = message_modulus * carry_modulus;
+  uint32_t message_max = message_modulus - 1;
+
+  uint32_t num_elements_to_fill_carry = (total_modulus - 1) / message_max;
+
+  size_t big_lwe_size = big_lwe_dimension + 1;
+  int num_sum_blocks = 0;
+  // Accumulator
+  auto sum = lwe_array_out;
+
+  if (num_radix_blocks == 1) {
+    // Just copy
+    copy_radix_ciphertext_slice_async<Torus>(streams[0], gpu_indexes[0], sum, 0,
+                                             1, lwe_array_in, 0, 1);
+    num_sum_blocks = 1;
+  } else {
+    uint32_t remainder_blocks = num_radix_blocks;
+    auto sum_i = (Torus *)sum->ptr;
+    auto chunk = (Torus *)lwe_array_in->ptr;
+    while (remainder_blocks > 1) {
+      uint32_t chunk_size =
+          std::min(remainder_blocks, num_elements_to_fill_carry);
+
+      accumulate_all_blocks<Torus>(streams[0], gpu_indexes[0], sum_i, chunk,
+                                   big_lwe_dimension, chunk_size);
+
+      num_sum_blocks++;
+      remainder_blocks -= (chunk_size - 1);
+
+      // Update operands
+      chunk += (chunk_size - 1) * big_lwe_size;
+      sum_i += big_lwe_size;
+    }
+  }
+
+  integer_radix_apply_univariate_lookup_table_kb<Torus>(
+      streams, gpu_indexes, gpu_count, sum, sum, bsks, ksks, zero_comparison,
+      num_sum_blocks);
  are_all_comparisons_block_true<Torus>(streams, gpu_indexes, gpu_count,
                                        lwe_array_out, sum, mem_ptr, bsks, ksks,
                                        num_sum_blocks);
@@ -316,17 +556,22 @@ __host__ void host_compare_with_zero_equality(
 template <typename Torus>
 __host__ void host_integer_radix_equality_check_kb(
    cudaStream_t const *streams, uint32_t const *gpu_indexes,
-    uint32_t gpu_count, Torus *lwe_array_out, Torus const *lwe_array_1,
-    Torus const *lwe_array_2, int_comparison_buffer<Torus> *mem_ptr,
-    void *const *bsks, Torus *const *ksks, uint32_t num_radix_blocks) {
+    uint32_t gpu_count, CudaRadixCiphertextFFI *lwe_array_out,
+    CudaRadixCiphertextFFI const *lwe_array_1,
+    CudaRadixCiphertextFFI const *lwe_array_2,
+    int_comparison_buffer<Torus> *mem_ptr, void *const *bsks,
+    Torus *const *ksks, uint32_t num_radix_blocks) {

+  if (lwe_array_out->lwe_dimension != lwe_array_1->lwe_dimension ||
+      lwe_array_out->lwe_dimension != lwe_array_2->lwe_dimension)
+    PANIC("Cuda error: input lwe dimensions must be the same")
  auto eq_buffer = mem_ptr->eq_buffer;

  // Applies the LUT for the comparison operation
  auto comparisons = mem_ptr->tmp_block_comparisons;
-  legacy_integer_radix_apply_bivariate_lookup_table_kb<Torus>(
+  integer_radix_apply_bivariate_lookup_table_kb<Torus>(
      streams, gpu_indexes, gpu_count, comparisons, lwe_array_1, lwe_array_2,
-      bsks, ksks, num_radix_blocks, eq_buffer->operator_lut,
+      bsks, ksks, eq_buffer->operator_lut, num_radix_blocks,
      eq_buffer->operator_lut->params.message_modulus);

  // This takes a Vec of blocks, where each block is either 0 or 1.
@@ -341,10 +586,16 @@ __host__ void host_integer_radix_equality_check_kb(
 template <typename Torus>
 __host__ void compare_radix_blocks_kb(
    cudaStream_t const *streams, uint32_t const *gpu_indexes,
-    uint32_t gpu_count, Torus *lwe_array_out, Torus const *lwe_array_left,
-    Torus const *lwe_array_right, int_comparison_buffer<Torus> *mem_ptr,
-    void *const *bsks, Torus *const *ksks, uint32_t num_radix_blocks) {
+    uint32_t gpu_count, CudaRadixCiphertextFFI *lwe_array_out,
+    CudaRadixCiphertextFFI const *lwe_array_left,
+    CudaRadixCiphertextFFI const *lwe_array_right,
+    int_comparison_buffer<Torus> *mem_ptr, void *const *bsks,
+    Torus *const *ksks, uint32_t num_radix_blocks) {

+  if (lwe_array_out->lwe_dimension != lwe_array_left->lwe_dimension ||
+      lwe_array_out->lwe_dimension != lwe_array_right->lwe_dimension)
+    PANIC("Cuda error: input and output radix ciphertexts should have the same "
+          "lwe dimension")
  auto params = mem_ptr->params;
  auto big_lwe_dimension = params.big_lwe_dimension;
  auto message_modulus = params.message_modulus;
@@ -364,35 +615,43 @@ __host__ void compare_radix_blocks_kb(
  // space, so (-1) % (4 * 4) = 15 = 1|1111 We then add one and get 0 = 0|0000

  // Subtract
-  // Here we need the true lwe sub, not the one that comes from shortint.
-  host_subtraction<Torus>(streams[0], gpu_indexes[0], lwe_array_out,
-                          lwe_array_left, lwe_array_right, big_lwe_dimension,
-                          num_radix_blocks);
+  host_subtraction<Torus>(
+      streams[0], gpu_indexes[0], (Torus *)lwe_array_out->ptr,
+      (Torus *)lwe_array_left->ptr, (Torus *)lwe_array_right->ptr,
+      big_lwe_dimension, num_radix_blocks);

  // Apply LUT to compare to 0
  auto is_non_zero_lut = mem_ptr->eq_buffer->is_non_zero_lut;
-  legacy_integer_radix_apply_univariate_lookup_table_kb<Torus>(
+  integer_radix_apply_univariate_lookup_table_kb<Torus>(
      streams, gpu_indexes, gpu_count, lwe_array_out, lwe_array_out, bsks, ksks,
-      num_radix_blocks, is_non_zero_lut);
+      is_non_zero_lut, num_radix_blocks);

  // Add one
  // Here Lhs can have the following values: (-1) % (message modulus * carry
  // modulus), 0, 1 So the output values after the addition will be: 0, 1, 2
  host_integer_radix_add_scalar_one_inplace<Torus>(
-      streams, gpu_indexes, gpu_count, lwe_array_out, big_lwe_dimension,
-      num_radix_blocks, message_modulus, carry_modulus);
+      streams, gpu_indexes, gpu_count, lwe_array_out, message_modulus,
+      carry_modulus);
 }

 // Reduces a vec containing shortint blocks that encrypts a sign
 // (inferior, equal, superior) to one single shortint block containing the
 // final sign
 template <typename Torus>
-__host__ void tree_sign_reduction(
-    cudaStream_t const *streams, uint32_t const *gpu_indexes,
-    uint32_t gpu_count, Torus *lwe_array_out, Torus *lwe_block_comparisons,
-    int_tree_sign_reduction_buffer<Torus> *tree_buffer,
-    std::function<Torus(Torus)> sign_handler_f, void *const *bsks,
-    Torus *const *ksks, uint32_t num_radix_blocks) {
+__host__ void
+tree_sign_reduction(cudaStream_t const *streams, uint32_t const *gpu_indexes,
+                    uint32_t gpu_count, CudaRadixCiphertextFFI *lwe_array_out,
+                    CudaRadixCiphertextFFI *lwe_block_comparisons,
+                    int_tree_sign_reduction_buffer<Torus> *tree_buffer,
+                    std::function<Torus(Torus)> sign_handler_f,
+                    void *const *bsks, Torus *const *ksks,
+                    uint32_t num_radix_blocks) {
+
+  if (lwe_array_out->lwe_dimension != lwe_block_comparisons->lwe_dimension)
+    PANIC("Cuda error: input lwe dimensions must be the same")
+  if (lwe_block_comparisons->num_radix_blocks < num_radix_blocks)
+    PANIC("Cuda error: block comparisons num radix blocks should not be lower "
+          "than the number of blocks to operate on")

  auto params = tree_buffer->params;
  auto big_lwe_dimension = params.big_lwe_dimension;
@@ -405,37 +664,31 @@ __host__ void tree_sign_reduction(
  // Reduces a vec containing shortint blocks that encrypts a sign
  // (inferior, equal, superior) to one single shortint block containing the
  // final sign
-  size_t big_lwe_size = big_lwe_dimension + 1;
-  size_t big_lwe_size_bytes = big_lwe_size * sizeof(Torus);
-
  auto x = tree_buffer->tmp_x;
  auto y = tree_buffer->tmp_y;
  if (x != lwe_block_comparisons)
-    cuda_memcpy_async_gpu_to_gpu(x, lwe_block_comparisons,
-                                 big_lwe_size_bytes * num_radix_blocks,
-                                 streams[0], gpu_indexes[0]);
+    copy_radix_ciphertext_slice_async<Torus>(
+        streams[0], gpu_indexes[0], x, 0, num_radix_blocks,
+        lwe_block_comparisons, 0, num_radix_blocks);

  uint32_t partial_block_count = num_radix_blocks;

  auto inner_tree_leaf = tree_buffer->tree_inner_leaf_lut;
  while (partial_block_count > 2) {
-    pack_blocks<Torus>(streams[0], gpu_indexes[0], y, x, big_lwe_dimension,
-                       partial_block_count, 4);
+    pack_blocks<Torus>(streams[0], gpu_indexes[0], y, x, partial_block_count,
+                       4);

-    legacy_integer_radix_apply_univariate_lookup_table_kb<Torus>(
-        streams, gpu_indexes, gpu_count, x, y, bsks, ksks,
-        partial_block_count >> 1, inner_tree_leaf);
+    integer_radix_apply_univariate_lookup_table_kb<Torus>(
+        streams, gpu_indexes, gpu_count, x, y, bsks, ksks, inner_tree_leaf,
+        partial_block_count >> 1);

    if ((partial_block_count % 2) != 0) {
      partial_block_count >>= 1;
      partial_block_count++;

-      auto last_y_block = y + (partial_block_count - 1) * big_lwe_size;
-      auto last_x_block = x + (partial_block_count - 1) * big_lwe_size;
-
-      cuda_memcpy_async_gpu_to_gpu(last_x_block, last_y_block,
-                                   big_lwe_size_bytes, streams[0],
-                                   gpu_indexes[0]);
+      copy_radix_ciphertext_slice_async<Torus>(
+          streams[0], gpu_indexes[0], x, partial_block_count - 1,
+          partial_block_count, y, partial_block_count - 1, partial_block_count);
    } else {
      partial_block_count >>= 1;
    }
@@ -446,8 +699,8 @@ __host__ void tree_sign_reduction(
  std::function<Torus(Torus)> f;

  if (partial_block_count == 2) {
-    pack_blocks<Torus>(streams[0], gpu_indexes[0], y, x, big_lwe_dimension,
-                       partial_block_count, 4);
+    pack_blocks<Torus>(streams[0], gpu_indexes[0], y, x, partial_block_count,
+                       4);

    f = [block_selector_f, sign_handler_f](Torus x) -> Torus {
      int msb = (x >> 2) & 3;
@@ -468,58 +721,64 @@ __host__ void tree_sign_reduction(
  last_lut->broadcast_lut(streams, gpu_indexes, 0);

  // Last leaf
-  legacy_integer_radix_apply_univariate_lookup_table_kb<Torus>(
-      streams, gpu_indexes, gpu_count, lwe_array_out, y, bsks, ksks, 1,
-      last_lut);
+  integer_radix_apply_univariate_lookup_table_kb<Torus>(
+      streams, gpu_indexes, gpu_count, lwe_array_out, y, bsks, ksks, last_lut,
+      1);
 }

 template <typename Torus>
 __host__ void host_integer_radix_difference_check_kb(
    cudaStream_t const *streams, uint32_t const *gpu_indexes,
-    uint32_t gpu_count, Torus *lwe_array_out, Torus const *lwe_array_left,
-    Torus const *lwe_array_right, int_comparison_buffer<Torus> *mem_ptr,
+    uint32_t gpu_count, CudaRadixCiphertextFFI *lwe_array_out,
+    CudaRadixCiphertextFFI const *lwe_array_left,
+    CudaRadixCiphertextFFI const *lwe_array_right,
+    int_comparison_buffer<Torus> *mem_ptr,
    std::function<Torus(Torus)> reduction_lut_f, void *const *bsks,
    Torus *const *ksks, uint32_t num_radix_blocks) {

+  if (lwe_array_out->lwe_dimension != lwe_array_left->lwe_dimension ||
+      lwe_array_out->lwe_dimension != lwe_array_right->lwe_dimension)
+    PANIC("Cuda error: input lwe dimensions must be the same")
+
  auto diff_buffer = mem_ptr->diff_buffer;

  auto params = mem_ptr->params;
-  auto big_lwe_dimension = params.big_lwe_dimension;
-  auto big_lwe_size = big_lwe_dimension + 1;
  auto message_modulus = params.message_modulus;
  auto carry_modulus = params.carry_modulus;

  uint32_t packed_num_radix_blocks = num_radix_blocks;
-  Torus *lhs = (Torus *)lwe_array_left;
-  Torus *rhs = (Torus *)lwe_array_right;
+  CudaRadixCiphertextFFI lhs;
+  as_radix_ciphertext_slice<Torus>(&lhs, diff_buffer->tmp_packed, 0,
+                                   num_radix_blocks / 2);
+  CudaRadixCiphertextFFI rhs;
+  as_radix_ciphertext_slice<Torus>(&rhs, diff_buffer->tmp_packed,
+                                   num_radix_blocks / 2, num_radix_blocks);
  if (carry_modulus >= message_modulus) {
    // Packing is possible
    // Pack inputs
-    Torus *packed_left = diff_buffer->tmp_packed;
-    Torus *packed_right =
-        diff_buffer->tmp_packed + num_radix_blocks / 2 * big_lwe_size;
    // In case the ciphertext is signed, the sign block and the one before it
    // are handled separately
    if (mem_ptr->is_signed) {
      packed_num_radix_blocks -= 2;
    }
-    pack_blocks<Torus>(streams[0], gpu_indexes[0], packed_left, lwe_array_left,
-                       big_lwe_dimension, packed_num_radix_blocks,
-                       message_modulus);
-    pack_blocks<Torus>(streams[0], gpu_indexes[0], packed_right,
-                       lwe_array_right, big_lwe_dimension,
+    pack_blocks<Torus>(streams[0], gpu_indexes[0], &lhs, lwe_array_left,
+                       packed_num_radix_blocks, message_modulus);
+    pack_blocks<Torus>(streams[0], gpu_indexes[0], &rhs, lwe_array_right,
                       packed_num_radix_blocks, message_modulus);
    // From this point we have half number of blocks
    packed_num_radix_blocks /= 2;

    // Clean noise
    auto identity_lut = mem_ptr->identity_lut;
-    legacy_integer_radix_apply_univariate_lookup_table_kb<Torus>(
-        streams, gpu_indexes, gpu_count, packed_left, packed_left, bsks, ksks,
-        2 * packed_num_radix_blocks, identity_lut);
-
-    lhs = packed_left;
-    rhs = packed_right;
+    integer_radix_apply_univariate_lookup_table_kb<Torus>(
+        streams, gpu_indexes, gpu_count, diff_buffer->tmp_packed,
+        diff_buffer->tmp_packed, bsks, ksks, identity_lut,
+        2 * packed_num_radix_blocks);
+  } else {
+    as_radix_ciphertext_slice<Torus>(&lhs, lwe_array_left, 0,
+                                     lwe_array_left->num_radix_blocks);
+    as_radix_ciphertext_slice<Torus>(&rhs, lwe_array_right, 0,
+                                     lwe_array_right->num_radix_blocks);
  }

  // comparisons will be assigned
@@ -532,7 +791,7 @@ __host__ void host_integer_radix_difference_check_kb(
    // Compare packed blocks, or simply the total number of radix blocks in the
    // inputs
    compare_radix_blocks_kb<Torus>(streams, gpu_indexes, gpu_count, comparisons,
-                                   lhs, rhs, mem_ptr, bsks, ksks,
+                                   &lhs, &rhs, mem_ptr, bsks, ksks,
                                   packed_num_radix_blocks);
    num_comparisons = packed_num_radix_blocks;
  } else {
@@ -540,38 +799,59 @@ __host__ void host_integer_radix_difference_check_kb(
    if (carry_modulus >= message_modulus) {
      // Compare (num_radix_blocks - 2) / 2 packed blocks
      compare_radix_blocks_kb<Torus>(streams, gpu_indexes, gpu_count,
-                                     comparisons, lhs, rhs, mem_ptr, bsks, ksks,
-                                     packed_num_radix_blocks);
+                                     comparisons, &lhs, &rhs, mem_ptr, bsks,
+                                     ksks, packed_num_radix_blocks);

      // Compare the last block before the sign block separately
      auto identity_lut = mem_ptr->identity_lut;
-      Torus *packed_left = diff_buffer->tmp_packed;
-      Torus *packed_right =
-          diff_buffer->tmp_packed + num_radix_blocks / 2 * big_lwe_size;
-      Torus *last_left_block_before_sign_block =
-          packed_left + packed_num_radix_blocks * big_lwe_size;
-      Torus *last_right_block_before_sign_block =
-          packed_right + packed_num_radix_blocks * big_lwe_size;
-      legacy_integer_radix_apply_univariate_lookup_table_kb<Torus>(
-          streams, gpu_indexes, gpu_count, last_left_block_before_sign_block,
-          lwe_array_left + (num_radix_blocks - 2) * big_lwe_size, bsks, ksks, 1,
-          identity_lut);
-      legacy_integer_radix_apply_univariate_lookup_table_kb<Torus>(
-          streams, gpu_indexes, gpu_count, last_right_block_before_sign_block,
-          lwe_array_right + (num_radix_blocks - 2) * big_lwe_size, bsks, ksks,
-          1, identity_lut);
+      CudaRadixCiphertextFFI last_left_block_before_sign_block;
+      as_radix_ciphertext_slice<Torus>(
+          &last_left_block_before_sign_block, diff_buffer->tmp_packed,
+          packed_num_radix_blocks, packed_num_radix_blocks + 1);
+      CudaRadixCiphertextFFI shifted_lwe_array_left;
+      as_radix_ciphertext_slice<Torus>(&shifted_lwe_array_left, lwe_array_left,
+                                       num_radix_blocks - 2,
+                                       num_radix_blocks - 1);
+      integer_radix_apply_univariate_lookup_table_kb<Torus>(
+          streams, gpu_indexes, gpu_count, &last_left_block_before_sign_block,
+          &shifted_lwe_array_left, bsks, ksks, identity_lut, 1);
+
+      CudaRadixCiphertextFFI last_right_block_before_sign_block;
+      as_radix_ciphertext_slice<Torus>(
+          &last_right_block_before_sign_block, diff_buffer->tmp_packed,
+          num_radix_blocks / 2 + packed_num_radix_blocks,
+          num_radix_blocks / 2 + packed_num_radix_blocks + 1);
+      CudaRadixCiphertextFFI shifted_lwe_array_right;
+      as_radix_ciphertext_slice<Torus>(&shifted_lwe_array_right,
+                                       lwe_array_right, num_radix_blocks - 2,
+                                       num_radix_blocks - 1);
+      integer_radix_apply_univariate_lookup_table_kb<Torus>(
+          streams, gpu_indexes, gpu_count, &last_right_block_before_sign_block,
+          &shifted_lwe_array_right, bsks, ksks, identity_lut, 1);
+
+      CudaRadixCiphertextFFI shifted_comparisons;
+      as_radix_ciphertext_slice<Torus>(&shifted_comparisons, comparisons,
+                                       packed_num_radix_blocks,
+                                       packed_num_radix_blocks + 1);
      compare_radix_blocks_kb<Torus>(
-          streams, gpu_indexes, gpu_count,
-          comparisons + packed_num_radix_blocks * big_lwe_size,
-          last_left_block_before_sign_block, last_right_block_before_sign_block,
-          mem_ptr, bsks, ksks, 1);
+          streams, gpu_indexes, gpu_count, &shifted_comparisons,
+          &last_left_block_before_sign_block,
+          &last_right_block_before_sign_block, mem_ptr, bsks, ksks, 1);
+
      // Compare the sign block separately
-      legacy_integer_radix_apply_bivariate_lookup_table_kb<Torus>(
-          streams, gpu_indexes, gpu_count,
-          comparisons + (packed_num_radix_blocks + 1) * big_lwe_size,
-          lwe_array_left + (num_radix_blocks - 1) * big_lwe_size,
-          lwe_array_right + (num_radix_blocks - 1) * big_lwe_size, bsks, ksks,
-          1, mem_ptr->signed_lut, mem_ptr->signed_lut->params.message_modulus);
+      as_radix_ciphertext_slice<Torus>(&shifted_comparisons, comparisons,
+                                       packed_num_radix_blocks + 1,
+                                       packed_num_radix_blocks + 2);
+      CudaRadixCiphertextFFI last_left_block;
+      as_radix_ciphertext_slice<Torus>(&last_left_block, lwe_array_left,
+                                       num_radix_blocks - 1, num_radix_blocks);
+      CudaRadixCiphertextFFI last_right_block;
+      as_radix_ciphertext_slice<Torus>(&last_right_block, lwe_array_right,
+                                       num_radix_blocks - 1, num_radix_blocks);
+      integer_radix_apply_bivariate_lookup_table_kb<Torus>(
+          streams, gpu_indexes, gpu_count, &shifted_comparisons,
+          &last_left_block, &last_right_block, bsks, ksks, mem_ptr->signed_lut,
+          1, mem_ptr->signed_lut->params.message_modulus);
      num_comparisons = packed_num_radix_blocks + 2;

    } else {
@@ -579,12 +859,19 @@ __host__ void host_integer_radix_difference_check_kb(
          streams, gpu_indexes, gpu_count, comparisons, lwe_array_left,
          lwe_array_right, mem_ptr, bsks, ksks, num_radix_blocks - 1);
      // Compare the sign block separately
-      legacy_integer_radix_apply_bivariate_lookup_table_kb<Torus>(
-          streams, gpu_indexes, gpu_count,
-          comparisons + (num_radix_blocks - 1) * big_lwe_size,
-          lwe_array_left + (num_radix_blocks - 1) * big_lwe_size,
-          lwe_array_right + (num_radix_blocks - 1) * big_lwe_size, bsks, ksks,
-          1, mem_ptr->signed_lut, mem_ptr->signed_lut->params.message_modulus);
+      CudaRadixCiphertextFFI shifted_comparisons;
+      as_radix_ciphertext_slice<Torus>(&shifted_comparisons, comparisons,
+                                       num_radix_blocks - 1, num_radix_blocks);
+      CudaRadixCiphertextFFI last_left_block;
+      as_radix_ciphertext_slice<Torus>(&last_left_block, lwe_array_left,
+                                       num_radix_blocks - 1, num_radix_blocks);
+      CudaRadixCiphertextFFI last_right_block;
+      as_radix_ciphertext_slice<Torus>(&last_right_block, lwe_array_right,
+                                       num_radix_blocks - 1, num_radix_blocks);
+      integer_radix_apply_bivariate_lookup_table_kb<Torus>(
+          streams, gpu_indexes, gpu_count, &shifted_comparisons,
+          &last_left_block, &last_right_block, bsks, ksks, mem_ptr->signed_lut,
+          1, mem_ptr->signed_lut->params.message_modulus);
      num_comparisons = num_radix_blocks;
    }
  }
@@ -612,32 +899,42 @@ __host__ void scratch_cuda_integer_radix_comparison_check_kb(
 template <typename Torus>
 __host__ void host_integer_radix_maxmin_kb(
    cudaStream_t const *streams, uint32_t const *gpu_indexes,
-    uint32_t gpu_count, Torus *lwe_array_out, Torus const *lwe_array_left,
-    Torus const *lwe_array_right, int_comparison_buffer<Torus> *mem_ptr,
-    void *const *bsks, Torus *const *ksks, uint32_t total_num_radix_blocks) {
+    uint32_t gpu_count, CudaRadixCiphertextFFI *lwe_array_out,
+    CudaRadixCiphertextFFI const *lwe_array_left,
+    CudaRadixCiphertextFFI const *lwe_array_right,
+    int_comparison_buffer<Torus> *mem_ptr, void *const *bsks,
+    Torus *const *ksks, uint32_t num_radix_blocks) {
+
+  if (lwe_array_out->lwe_dimension != lwe_array_left->lwe_dimension ||
+      lwe_array_out->lwe_dimension != lwe_array_right->lwe_dimension)
+    PANIC("Cuda error: input and output lwe dimensions must be the same")
+  if (lwe_array_out->num_radix_blocks < num_radix_blocks ||
+      lwe_array_left->num_radix_blocks < num_radix_blocks ||
+      lwe_array_right->num_radix_blocks < num_radix_blocks)
+    PANIC("Cuda error: input and output num radix blocks should not be lower "
+          "than the number of blocks to operate on")

  // Compute the sign
  host_integer_radix_difference_check_kb<Torus>(
      streams, gpu_indexes, gpu_count, mem_ptr->tmp_lwe_array_out,
      lwe_array_left, lwe_array_right, mem_ptr, mem_ptr->identity_lut_f, bsks,
-      ksks, total_num_radix_blocks);
+      ksks, num_radix_blocks);

  // Selector
-  legacy_host_integer_radix_cmux_kb<Torus>(
-      streams, gpu_indexes, gpu_count, lwe_array_out,
-      mem_ptr->tmp_lwe_array_out, lwe_array_left, lwe_array_right,
-      mem_ptr->cmux_buffer, bsks, ksks, total_num_radix_blocks);
+  host_integer_radix_cmux_kb<Torus>(streams, gpu_indexes, gpu_count,
+                                    lwe_array_out, mem_ptr->tmp_lwe_array_out,
+                                    lwe_array_left, lwe_array_right,
+                                    mem_ptr->cmux_buffer, bsks, ksks);
 }

 template <typename Torus>
 __host__ void host_integer_are_all_comparisons_block_true_kb(
    cudaStream_t const *streams, uint32_t const *gpu_indexes,
-    uint32_t gpu_count, Torus *lwe_array_out, Torus const *lwe_array_in,
+    uint32_t gpu_count, CudaRadixCiphertextFFI *lwe_array_out,
+    CudaRadixCiphertextFFI const *lwe_array_in,
    int_comparison_buffer<Torus> *mem_ptr, void *const *bsks,
    Torus *const *ksks, uint32_t num_radix_blocks) {

-  auto eq_buffer = mem_ptr->eq_buffer;
-
  // It returns a block encrypting 1 if all input blocks are 1
  // otherwise the block encrypts 0
  are_all_comparisons_block_true<Torus>(streams, gpu_indexes, gpu_count,
@@ -648,12 +945,11 @@ __host__ void host_integer_are_all_comparisons_block_true_kb(
 template <typename Torus>
 __host__ void host_integer_is_at_least_one_comparisons_block_true_kb(
    cudaStream_t const *streams, uint32_t const *gpu_indexes,
-    uint32_t gpu_count, Torus *lwe_array_out, Torus const *lwe_array_in,
+    uint32_t gpu_count, CudaRadixCiphertextFFI *lwe_array_out,
+    CudaRadixCiphertextFFI const *lwe_array_in,
    int_comparison_buffer<Torus> *mem_ptr, void *const *bsks,
    Torus *const *ksks, uint32_t num_radix_blocks) {

-  auto eq_buffer = mem_ptr->eq_buffer;
-
  // It returns a block encrypting 1 if all input blocks are 1
  // otherwise the block encrypts 0
  is_at_least_one_comparisons_block_true<Torus>(
--- a/backends/tfhe-cuda-backend/cuda/src/integer/compression/compression.cuh
+++ b/backends/tfhe-cuda-backend/cuda/src/integer/compression/compression.cuh
@@ -43,6 +43,9 @@ __global__ void pack(Torus *array_out, Torus *array_in, uint32_t log_modulus,
  }
 }

+/// Packs `num_lwes` LWE-ciphertext contained in `num_glwes` GLWE-ciphertext in
+/// a compressed array This function follows the naming used in the CPU
+/// implementation
 template <typename Torus>
 __host__ void host_pack(cudaStream_t stream, uint32_t gpu_index,
                        Torus *array_out, Torus *array_in, uint32_t num_glwes,
@@ -55,26 +58,23 @@ __host__ void host_pack(cudaStream_t stream, uint32_t gpu_index,

  auto log_modulus = mem_ptr->storage_log_modulus;
  // [0..num_glwes-1) GLWEs
-  auto in_len = (compression_params.glwe_dimension + 1) *
-                compression_params.polynomial_size;
+  auto in_len = num_glwes * compression_params.glwe_dimension *
+                    compression_params.polynomial_size +
+                num_lwes;
+
  auto number_bits_to_pack = in_len * log_modulus;
-  auto nbits = sizeof(Torus) * 8;
+
  // number_bits_to_pack.div_ceil(Scalar::BITS)
+  auto nbits = sizeof(Torus) * 8;
  auto out_len = (number_bits_to_pack + nbits - 1) / nbits;

-  // Last GLWE
-  number_bits_to_pack = in_len * log_modulus;
-  auto last_out_len = (number_bits_to_pack + nbits - 1) / nbits;
-
-  auto num_coeffs = (num_glwes - 1) * out_len + last_out_len;
-
  int num_blocks = 0, num_threads = 0;
-  getNumBlocksAndThreads(num_coeffs, 1024, num_blocks, num_threads);
+  getNumBlocksAndThreads(out_len, 1024, num_blocks, num_threads);

  dim3 grid(num_blocks);
  dim3 threads(num_threads);
  pack<Torus><<<grid, threads, 0, stream>>>(array_out, array_in, log_modulus,
-                                            num_coeffs, in_len, out_len);
+                                            out_len, in_len, out_len);
  check_cuda_error(cudaGetLastError());
 }

@@ -99,14 +99,13 @@ host_integer_compress(cudaStream_t const *streams, uint32_t const *gpu_indexes,
  uint32_t lwe_in_size = input_lwe_dimension + 1;
  uint32_t glwe_out_size = (compression_params.glwe_dimension + 1) *
                           compression_params.polynomial_size;
-  uint32_t num_glwes_for_compression =
-      num_radix_blocks / mem_ptr->lwe_per_glwe + 1;
+  uint32_t num_glwes =
+      (num_radix_blocks + mem_ptr->lwe_per_glwe - 1) / mem_ptr->lwe_per_glwe;

  // Keyswitch LWEs to GLWE
  auto tmp_glwe_array_out = mem_ptr->tmp_glwe_array_out;
  cuda_memset_async(tmp_glwe_array_out, 0,
-                    num_glwes_for_compression *
-                        (compression_params.glwe_dimension + 1) *
+                    num_glwes * (compression_params.glwe_dimension + 1) *
                        compression_params.polynomial_size * sizeof(Torus),
                    streams[0], gpu_indexes[0]);
  auto fp_ks_buffer = mem_ptr->fp_ks_buffer;
@@ -131,23 +130,21 @@ host_integer_compress(cudaStream_t const *streams, uint32_t const *gpu_indexes,
  // Modulus switch
  host_modulus_switch_inplace<Torus>(
      streams[0], gpu_indexes[0], tmp_glwe_array_out,
-      num_glwes_for_compression * (compression_params.glwe_dimension + 1) *
-          compression_params.polynomial_size,
+      num_glwes * compression_params.glwe_dimension *
+              compression_params.polynomial_size +
+          num_radix_blocks,
      mem_ptr->storage_log_modulus);

  host_pack<Torus>(streams[0], gpu_indexes[0], glwe_array_out,
-                   tmp_glwe_array_out, num_glwes_for_compression,
-                   num_radix_blocks, mem_ptr);
+                   tmp_glwe_array_out, num_glwes, num_radix_blocks, mem_ptr);
 }

 template <typename Torus>
 __global__ void extract(Torus *glwe_array_out, Torus const *array_in,
-                        uint32_t index, uint32_t log_modulus,
-                        uint32_t input_len, uint32_t initial_out_len) {
+                        uint32_t log_modulus, uint32_t initial_out_len) {
  auto nbits = sizeof(Torus) * 8;

  auto i = threadIdx.x + blockIdx.x * blockDim.x;
-  auto chunk_array_in = array_in + index * input_len;
  if (i < initial_out_len) {
    // Unpack
    Torus mask = ((Torus)1 << log_modulus) - 1;
@@ -161,12 +158,11 @@ __global__ void extract(Torus *glwe_array_out, Torus const *array_in,

    Torus unpacked_i;
    if (start_block == end_block_inclusive) {
-      auto single_part = chunk_array_in[start_block] >> start_remainder;
+      auto single_part = array_in[start_block] >> start_remainder;
      unpacked_i = single_part & mask;
    } else {
-      auto first_part = chunk_array_in[start_block] >> start_remainder;
-      auto second_part = chunk_array_in[start_block + 1]
-                         << (nbits - start_remainder);
+      auto first_part = array_in[start_block] >> start_remainder;
+      auto second_part = array_in[start_block + 1] << (nbits - start_remainder);

      unpacked_i = (first_part | second_part) & mask;
    }
@@ -177,6 +173,7 @@ __global__ void extract(Torus *glwe_array_out, Torus const *array_in,
 }

 /// Extracts the glwe_index-nth GLWE ciphertext
+/// This function follows the naming used in the CPU implementation
 template <typename Torus>
 __host__ void host_extract(cudaStream_t stream, uint32_t gpu_index,
                           Torus *glwe_array_out, Torus const *array_in,
@@ -188,36 +185,51 @@ __host__ void host_extract(cudaStream_t stream, uint32_t gpu_index,
  cuda_set_device(gpu_index);

  auto compression_params = mem_ptr->compression_params;
-
  auto log_modulus = mem_ptr->storage_log_modulus;
+  auto glwe_ciphertext_size = (compression_params.glwe_dimension + 1) *
+                              compression_params.polynomial_size;
+
+  uint32_t body_count = mem_ptr->body_count;
+  auto num_glwes = (body_count + compression_params.polynomial_size - 1) /
+                   compression_params.polynomial_size;
+
+  // Compressed length of the compressed GLWE we want to extract
+  if (mem_ptr->body_count % compression_params.polynomial_size == 0)
+    body_count = compression_params.polynomial_size;
+  else if (glwe_index == num_glwes - 1)
+    body_count = mem_ptr->body_count % compression_params.polynomial_size;
+  else
+    body_count = compression_params.polynomial_size;

-  uint32_t body_count =
-      std::min(mem_ptr->body_count, compression_params.polynomial_size);
  auto initial_out_len =
      compression_params.glwe_dimension * compression_params.polynomial_size +
      body_count;

-  auto compressed_glwe_accumulator_size =
-      (compression_params.glwe_dimension + 1) *
-      compression_params.polynomial_size;
-  auto number_bits_to_unpack = compressed_glwe_accumulator_size * log_modulus;
+  // Calculates how many bits this particular GLWE shall use
+  auto number_bits_to_unpack = initial_out_len * log_modulus;
  auto nbits = sizeof(Torus) * 8;
-  // number_bits_to_unpack.div_ceil(Scalar::BITS)
  auto input_len = (number_bits_to_unpack + nbits - 1) / nbits;

-  // We assure the tail of the glwe is zeroed
-  auto zeroed_slice = glwe_array_out + initial_out_len;
-  cuda_memset_async(zeroed_slice, 0,
-                    (compression_params.polynomial_size - body_count) *
-                        sizeof(Torus),
-                    stream, gpu_index);
+  // Calculates how many bits a full-packed GLWE shall use
+  number_bits_to_unpack = glwe_ciphertext_size * log_modulus;
+  auto len = (number_bits_to_unpack + nbits - 1) / nbits;
+  // Uses that length to set the input pointer
+  auto chunk_array_in = array_in + glwe_index * len;
+
+  // Ensure the tail of the GLWE is zeroed
+  if (initial_out_len < glwe_ciphertext_size) {
+    auto zeroed_slice = glwe_array_out + initial_out_len;
+    cuda_memset_async(glwe_array_out, 0,
+                      (glwe_ciphertext_size - initial_out_len) * sizeof(Torus),
+                      stream, gpu_index);
+  }
+
  int num_blocks = 0, num_threads = 0;
  getNumBlocksAndThreads(initial_out_len, 128, num_blocks, num_threads);
  dim3 grid(num_blocks);
  dim3 threads(num_threads);
-  extract<Torus><<<grid, threads, 0, stream>>>(glwe_array_out, array_in,
-                                               glwe_index, log_modulus,
-                                               input_len, initial_out_len);
+  extract<Torus><<<grid, threads, 0, stream>>>(glwe_array_out, chunk_array_in,
+                                               log_modulus, initial_out_len);
  check_cuda_error(cudaGetLastError());
 }

@@ -235,18 +247,13 @@ __host__ void host_integer_decompress(

  auto compression_params = h_mem_ptr->compression_params;
  auto lwe_per_glwe = compression_params.polynomial_size;
-  if (indexes_array_size > lwe_per_glwe)
-    PANIC("Cuda error: too many LWEs to decompress. The number of LWEs should "
-          "be smaller than "
-          "polynomial_size.")

  auto num_radix_blocks = h_mem_ptr->num_radix_blocks;
  if (num_radix_blocks != indexes_array_size)
    PANIC("Cuda error: wrong number of LWEs in decompress: the number of LWEs "
          "should be the same as indexes_array_size.")

-  // the first element is the last index in h_indexes_array that lies in the
-  // related GLWE
+  // the first element is the number of LWEs that lies in the related GLWE
  std::vector<std::pair<int, Torus *>> glwe_vec;

  // Extract all GLWEs
@@ -257,7 +264,7 @@ __host__ void host_integer_decompress(
  auto extracted_glwe = h_mem_ptr->tmp_extracted_glwe;
  host_extract<Torus>(streams[0], gpu_indexes[0], extracted_glwe,
                      d_packed_glwe_in, current_glwe_index, h_mem_ptr);
-  glwe_vec.push_back(std::make_pair(0, extracted_glwe));
+  glwe_vec.push_back(std::make_pair(1, extracted_glwe));
  for (int i = 1; i < indexes_array_size; i++) {
    auto glwe_index = h_indexes_array[i] / lwe_per_glwe;
    if (glwe_index != current_glwe_index) {
@@ -266,10 +273,10 @@ __host__ void host_integer_decompress(
      // Extracts a new GLWE
      host_extract<Torus>(streams[0], gpu_indexes[0], extracted_glwe,
                          d_packed_glwe_in, glwe_index, h_mem_ptr);
-      glwe_vec.push_back(std::make_pair(i, extracted_glwe));
+      glwe_vec.push_back(std::make_pair(1, extracted_glwe));
    } else {
-      // Updates the index
-      glwe_vec.back().first++;
+      // Updates the quantity
+      ++glwe_vec.back().first;
    }
  }
  // Sample extract all LWEs
@@ -279,17 +286,16 @@ __host__ void host_integer_decompress(
  uint32_t current_idx = 0;
  auto d_indexes_array_chunk = d_indexes_array;
  for (const auto &max_idx_and_glwe : glwe_vec) {
-    uint32_t last_idx = max_idx_and_glwe.first;
+    const auto num_lwes = max_idx_and_glwe.first;
    extracted_glwe = max_idx_and_glwe.second;

-    auto num_lwes = last_idx + 1 - current_idx;
-    cuda_glwe_sample_extract_64(streams[0], gpu_indexes[0], extracted_lwe,
-                                extracted_glwe, d_indexes_array_chunk, num_lwes,
-                                compression_params.glwe_dimension,
-                                compression_params.polynomial_size);
+    cuda_glwe_sample_extract_64(
+        streams[0], gpu_indexes[0], extracted_lwe, extracted_glwe,
+        d_indexes_array_chunk, num_lwes, compression_params.polynomial_size,
+        compression_params.glwe_dimension, compression_params.polynomial_size);
    d_indexes_array_chunk += num_lwes;
    extracted_lwe += num_lwes * lwe_accumulator_size;
-    current_idx = last_idx;
+    current_idx += num_lwes;
  }

  // Reset
--- a/backends/tfhe-cuda-backend/cuda/src/integer/div_rem.cuh
+++ b/backends/tfhe-cuda-backend/cuda/src/integer/div_rem.cuh
@@ -93,7 +93,8 @@ template <typename Torus> struct lwe_ciphertext_list {

  // return block with `index`
  Torus *get_block(size_t index) {
-    assert(index < len);
+    if (index >= len)
+      PANIC("Cuda error: out of bound index")
    return &data[index * big_lwe_size];
  }

@@ -103,15 +104,17 @@ template <typename Torus> struct lwe_ciphertext_list {
  void pop() {
    if (len > 0)
      len--;
-    else
-      assert(len > 0);
+    else if (len <= 0)
+      PANIC("Cuda error: invalid length")
  }

  // insert ciphertext at index `ind`
  void insert(size_t ind, Torus *ciphertext_block, cudaStream_t stream,
              uint32_t gpu_index) {
-    assert(ind <= len);
-    assert(len < max_blocks);
+    if (ind > len)
+      PANIC("Cuda error: invalid index")
+    if (len >= max_blocks)
+      PANIC("Cuda error: invalid length")

    size_t insert_offset = ind * big_lwe_size;

@@ -129,7 +132,8 @@ template <typename Torus> struct lwe_ciphertext_list {

  // push ciphertext at the end of `data`
  void push(Torus *ciphertext_block, cudaStream_t stream, uint32_t gpu_index) {
-    assert(len < max_blocks);
+    if (len >= max_blocks)
+      PANIC("Cuda error: invalid length")

    size_t offset = len * big_lwe_size;
    cuda_memcpy_async_gpu_to_gpu(&data[offset], ciphertext_block,
@@ -140,7 +144,8 @@ template <typename Torus> struct lwe_ciphertext_list {
  // duplicate ciphertext into `number_of_blocks` ciphertexts
  void fill_with_same_ciphertext(Torus *ciphertext, size_t number_of_blocks,
                                 cudaStream_t stream, uint32_t gpu_index) {
-    assert(number_of_blocks <= max_blocks);
+    if (number_of_blocks > max_blocks)
+      PANIC("Cuda error: invalid number of blocks")

    for (size_t i = 0; i < number_of_blocks; i++) {
      Torus *dest = &data[i * big_lwe_size];
@@ -414,7 +419,9 @@ __host__ void host_unsigned_integer_div_rem_kb(
        merged_interesting_remainder, 0, merged_interesting_remainder.len - 1,
        streams[0], gpu_indexes[0]);

-    assert(merged_interesting_remainder.len == interesting_divisor.len);
+    if (merged_interesting_remainder.len != interesting_divisor.len)
+      PANIC("Cuda error: merged interesting remainder and interesting divisor "
+            "should have the same number of blocks")

    // `new_remainder` is not initialized yet, so need to set length
    new_remainder.len = merged_interesting_remainder.len;
@@ -437,7 +444,7 @@ __host__ void host_unsigned_integer_div_rem_kb(
      mem_ptr->overflow_sub_mem->update_lut_indexes(
          streams, gpu_indexes, first_indexes, second_indexes, scalar_indexes,
          merged_interesting_remainder.len);
-      host_integer_overflowing_sub<uint64_t>(
+      legacy_host_integer_overflowing_sub<uint64_t>(
          streams, gpu_indexes, gpu_count, new_remainder.data,
          (uint64_t *)merged_interesting_remainder.data,
          interesting_divisor.data, subtraction_overflowed.data,
@@ -459,7 +466,7 @@ __host__ void host_unsigned_integer_div_rem_kb(
        // We could call unchecked_scalar_ne
        // But we are in the special case where scalar == 0
        // So we can skip some stuff
-        host_compare_with_zero_equality<Torus>(
+        legacy_host_compare_with_zero_equality<Torus>(
            streams, gpu_indexes, gpu_count, tmp_1.data, trivial_blocks.data,
            mem_ptr->comparison_buffer, bsks, ksks, trivial_blocks.len,
            mem_ptr->comparison_buffer->eq_buffer->is_non_zero_lut);
@@ -467,7 +474,7 @@ __host__ void host_unsigned_integer_div_rem_kb(
        tmp_1.len =
            ceil_div(trivial_blocks.len, message_modulus * carry_modulus - 1);

-        is_at_least_one_comparisons_block_true<Torus>(
+        legacy_is_at_least_one_comparisons_block_true<Torus>(
            streams, gpu_indexes, gpu_count,
            at_least_one_upper_block_is_non_zero.data, tmp_1.data,
            mem_ptr->comparison_buffer, bsks, ksks, tmp_1.len);
@@ -575,8 +582,12 @@ __host__ void host_unsigned_integer_div_rem_kb(
      cuda_synchronize_stream(mem_ptr->sub_streams_3[j], gpu_indexes[j]);
    }

-    assert(first_trivial_block - 1 == cleaned_merged_interesting_remainder.len);
-    assert(first_trivial_block - 1 == new_remainder.len);
+    if (first_trivial_block != cleaned_merged_interesting_remainder.len)
+      PANIC("Cuda error: first_trivial_block should be equal to "
+            "clean_merged_interesting_remainder num blocks")
+    if (first_trivial_block != new_remainder.len)
+      PANIC("Cuda error: first_trivial_block should be equal to new_remainder "
+            "num blocks")

    remainder1.copy_from(cleaned_merged_interesting_remainder, 0,
                         first_trivial_block - 1, streams[0], gpu_indexes[0]);
@@ -584,7 +595,9 @@ __host__ void host_unsigned_integer_div_rem_kb(
                         gpu_indexes[0]);
  }

-  assert(remainder1.len == remainder2.len);
+  if (remainder1.len != remainder2.len)
+    PANIC("Cuda error: remainder1 and remainder2 should have the same number "
+          "of blocks")

  // Clean the quotient and remainder
  // as even though they have no carries, they are not at nominal noise level
--- a/backends/tfhe-cuda-backend/cuda/src/integer/integer.cu
+++ b/backends/tfhe-cuda-backend/cuda/src/integer/integer.cu
@@ -131,19 +131,17 @@ void cuda_add_and_propagate_single_carry_kb_64_inplace(

 void cuda_integer_overflowing_sub_kb_64_inplace(
    void *const *streams, uint32_t const *gpu_indexes, uint32_t gpu_count,
-    void *lhs_array, const void *rhs_array, void *overflow_block,
-    const void *input_borrow, int8_t *mem_ptr, void *const *bsks,
-    void *const *ksks, uint32_t num_blocks, uint32_t compute_overflow,
+    CudaRadixCiphertextFFI *lhs_array, const CudaRadixCiphertextFFI *rhs_array,
+    CudaRadixCiphertextFFI *overflow_block,
+    const CudaRadixCiphertextFFI *input_borrow, int8_t *mem_ptr,
+    void *const *bsks, void *const *ksks, uint32_t compute_overflow,
    uint32_t uses_input_borrow) {

  host_integer_overflowing_sub<uint64_t>(
-      (cudaStream_t const *)streams, gpu_indexes, gpu_count,
-      static_cast<uint64_t *>(lhs_array), static_cast<uint64_t *>(lhs_array),
-      static_cast<const uint64_t *>(rhs_array),
-      static_cast<uint64_t *>(overflow_block),
-      static_cast<const uint64_t *>(input_borrow),
+      (cudaStream_t const *)streams, gpu_indexes, gpu_count, lhs_array,
+      lhs_array, rhs_array, overflow_block, input_borrow,
      (int_borrow_prop_memory<uint64_t> *)mem_ptr, bsks, (uint64_t **)ksks,
-      num_blocks, compute_overflow, uses_input_borrow);
+      compute_overflow, uses_input_borrow);
 }

 void cleanup_cuda_propagate_single_carry(void *const *streams,
--- a/backends/tfhe-cuda-backend/cuda/src/integer/integer.cuh
+++ b/backends/tfhe-cuda-backend/cuda/src/integer/integer.cuh
--- a/backends/tfhe-cuda-backend/cuda/src/integer/multiplication.cu
+++ b/backends/tfhe-cuda-backend/cuda/src/integer/multiplication.cu
@@ -13,7 +13,7 @@
 * as output ids, -1 value as an input id means that zero ciphertext will be
 * copied on output index.
 */
-void generate_ids_update_degrees(int *terms_degree, size_t *h_lwe_idx_in,
+void generate_ids_update_degrees(uint64_t *terms_degree, size_t *h_lwe_idx_in,
                                 size_t *h_lwe_idx_out,
                                 int32_t *h_smart_copy_in,
                                 int32_t *h_smart_copy_out, size_t ch_amount,
@@ -127,66 +127,53 @@ void scratch_cuda_integer_mult_radix_ciphertext_kb_64(
 */
 void cuda_integer_mult_radix_ciphertext_kb_64(
    void *const *streams, uint32_t const *gpu_indexes, uint32_t gpu_count,
-    void *radix_lwe_out, void const *radix_lwe_left, bool const is_bool_left,
-    void const *radix_lwe_right, bool const is_bool_right, void *const *bsks,
-    void *const *ksks, int8_t *mem_ptr, uint32_t polynomial_size,
-    uint32_t num_blocks) {
+    CudaRadixCiphertextFFI *radix_lwe_out,
+    CudaRadixCiphertextFFI const *radix_lwe_left, bool const is_bool_left,
+    CudaRadixCiphertextFFI const *radix_lwe_right, bool const is_bool_right,
+    void *const *bsks, void *const *ksks, int8_t *mem_ptr,
+    uint32_t polynomial_size, uint32_t num_blocks) {

  switch (polynomial_size) {
  case 256:
    host_integer_mult_radix_kb<uint64_t, AmortizedDegree<256>>(
-        (cudaStream_t *)(streams), gpu_indexes, gpu_count,
-        static_cast<uint64_t *>(radix_lwe_out),
-        static_cast<const uint64_t *>(radix_lwe_left), is_bool_left,
-        static_cast<const uint64_t *>(radix_lwe_right), is_bool_right, bsks,
+        (cudaStream_t *)(streams), gpu_indexes, gpu_count, radix_lwe_out,
+        radix_lwe_left, is_bool_left, radix_lwe_right, is_bool_right, bsks,
        (uint64_t **)(ksks), (int_mul_memory<uint64_t> *)mem_ptr, num_blocks);
    break;
  case 512:
    host_integer_mult_radix_kb<uint64_t, AmortizedDegree<512>>(
-        (cudaStream_t *)(streams), gpu_indexes, gpu_count,
-        static_cast<uint64_t *>(radix_lwe_out),
-        static_cast<const uint64_t *>(radix_lwe_left), is_bool_left,
-        static_cast<const uint64_t *>(radix_lwe_right), is_bool_right, bsks,
+        (cudaStream_t *)(streams), gpu_indexes, gpu_count, radix_lwe_out,
+        radix_lwe_left, is_bool_left, radix_lwe_right, is_bool_right, bsks,
        (uint64_t **)(ksks), (int_mul_memory<uint64_t> *)mem_ptr, num_blocks);
    break;
  case 1024:
    host_integer_mult_radix_kb<uint64_t, AmortizedDegree<1024>>(
-        (cudaStream_t *)(streams), gpu_indexes, gpu_count,
-        static_cast<uint64_t *>(radix_lwe_out),
-        static_cast<const uint64_t *>(radix_lwe_left), is_bool_left,
-        static_cast<const uint64_t *>(radix_lwe_right), is_bool_right, bsks,
+        (cudaStream_t *)(streams), gpu_indexes, gpu_count, radix_lwe_out,
+        radix_lwe_left, is_bool_left, radix_lwe_right, is_bool_right, bsks,
        (uint64_t **)(ksks), (int_mul_memory<uint64_t> *)mem_ptr, num_blocks);
    break;
  case 2048:
    host_integer_mult_radix_kb<uint64_t, AmortizedDegree<2048>>(
-        (cudaStream_t *)(streams), gpu_indexes, gpu_count,
-        static_cast<uint64_t *>(radix_lwe_out),
-        static_cast<const uint64_t *>(radix_lwe_left), is_bool_left,
-        static_cast<const uint64_t *>(radix_lwe_right), is_bool_right, bsks,
+        (cudaStream_t *)(streams), gpu_indexes, gpu_count, radix_lwe_out,
+        radix_lwe_left, is_bool_left, radix_lwe_right, is_bool_right, bsks,
        (uint64_t **)(ksks), (int_mul_memory<uint64_t> *)mem_ptr, num_blocks);
    break;
  case 4096:
    host_integer_mult_radix_kb<uint64_t, AmortizedDegree<4096>>(
-        (cudaStream_t *)(streams), gpu_indexes, gpu_count,
-        static_cast<uint64_t *>(radix_lwe_out),
-        static_cast<const uint64_t *>(radix_lwe_left), is_bool_left,
-        static_cast<const uint64_t *>(radix_lwe_right), is_bool_right, bsks,
+        (cudaStream_t *)(streams), gpu_indexes, gpu_count, radix_lwe_out,
+        radix_lwe_left, is_bool_left, radix_lwe_right, is_bool_right, bsks,
        (uint64_t **)(ksks), (int_mul_memory<uint64_t> *)mem_ptr, num_blocks);
    break;
  case 8192:
    host_integer_mult_radix_kb<uint64_t, AmortizedDegree<8192>>(
-        (cudaStream_t *)(streams), gpu_indexes, gpu_count,
-        static_cast<uint64_t *>(radix_lwe_out),
-        static_cast<const uint64_t *>(radix_lwe_left), is_bool_left,
-        static_cast<const uint64_t *>(radix_lwe_right), is_bool_right, bsks,
+        (cudaStream_t *)(streams), gpu_indexes, gpu_count, radix_lwe_out,
+        radix_lwe_left, is_bool_left, radix_lwe_right, is_bool_right, bsks,
        (uint64_t **)(ksks), (int_mul_memory<uint64_t> *)mem_ptr, num_blocks);
    break;
  case 16384:
    host_integer_mult_radix_kb<uint64_t, AmortizedDegree<16384>>(
-        (cudaStream_t *)(streams), gpu_indexes, gpu_count,
-        static_cast<uint64_t *>(radix_lwe_out),
-        static_cast<const uint64_t *>(radix_lwe_left), is_bool_left,
-        static_cast<const uint64_t *>(radix_lwe_right), is_bool_right, bsks,
+        (cudaStream_t *)(streams), gpu_indexes, gpu_count, radix_lwe_out,
+        radix_lwe_left, is_bool_left, radix_lwe_right, is_bool_right, bsks,
        (uint64_t **)(ksks), (int_mul_memory<uint64_t> *)mem_ptr, num_blocks);
    break;
  default:
@@ -226,79 +213,73 @@ void scratch_cuda_integer_radix_partial_sum_ciphertexts_vec_kb_64(

 void cuda_integer_radix_partial_sum_ciphertexts_vec_kb_64(
    void *const *streams, uint32_t const *gpu_indexes, uint32_t gpu_count,
-    void *radix_lwe_out, void *radix_lwe_vec, uint32_t num_radix_in_vec,
-    int8_t *mem_ptr, void *const *bsks, void *const *ksks,
-    uint32_t num_blocks_in_radix) {
+    CudaRadixCiphertextFFI *radix_lwe_out,
+    CudaRadixCiphertextFFI *radix_lwe_vec, int8_t *mem_ptr, void *const *bsks,
+    void *const *ksks) {

  auto mem = (int_sum_ciphertexts_vec_memory<uint64_t> *)mem_ptr;
-
-  int *terms_degree =
-      (int *)malloc(num_blocks_in_radix * num_radix_in_vec * sizeof(int));
-
-  for (int i = 0; i < num_radix_in_vec * num_blocks_in_radix; i++) {
-    terms_degree[i] = mem->params.message_modulus - 1;
-  }
+  if (radix_lwe_vec->num_radix_blocks % radix_lwe_out->num_radix_blocks != 0)
+    PANIC("Cuda error: input vector length should be a multiple of the "
+          "output's number of radix blocks")

  switch (mem->params.polynomial_size) {
  case 512:
    host_integer_partial_sum_ciphertexts_vec_kb<uint64_t, AmortizedDegree<512>>(
-        (cudaStream_t *)(streams), gpu_indexes, gpu_count,
-        static_cast<uint64_t *>(radix_lwe_out),
-        static_cast<uint64_t *>(radix_lwe_vec), terms_degree, bsks,
-        (uint64_t **)(ksks), mem, num_blocks_in_radix, num_radix_in_vec,
+        (cudaStream_t *)(streams), gpu_indexes, gpu_count, radix_lwe_out,
+        radix_lwe_vec, bsks, (uint64_t **)(ksks), mem,
+        radix_lwe_out->num_radix_blocks,
+        radix_lwe_vec->num_radix_blocks / radix_lwe_out->num_radix_blocks,
        nullptr);
    break;
  case 1024:
    host_integer_partial_sum_ciphertexts_vec_kb<uint64_t,
                                                AmortizedDegree<1024>>(
-        (cudaStream_t *)(streams), gpu_indexes, gpu_count,
-        static_cast<uint64_t *>(radix_lwe_out),
-        static_cast<uint64_t *>(radix_lwe_vec), terms_degree, bsks,
-        (uint64_t **)(ksks), mem, num_blocks_in_radix, num_radix_in_vec,
+        (cudaStream_t *)(streams), gpu_indexes, gpu_count, radix_lwe_out,
+        radix_lwe_vec, bsks, (uint64_t **)(ksks), mem,
+        radix_lwe_out->num_radix_blocks,
+        radix_lwe_vec->num_radix_blocks / radix_lwe_out->num_radix_blocks,
        nullptr);
    break;
  case 2048:
    host_integer_partial_sum_ciphertexts_vec_kb<uint64_t,
                                                AmortizedDegree<2048>>(
-        (cudaStream_t *)(streams), gpu_indexes, gpu_count,
-        static_cast<uint64_t *>(radix_lwe_out),
-        static_cast<uint64_t *>(radix_lwe_vec), terms_degree, bsks,
-        (uint64_t **)(ksks), mem, num_blocks_in_radix, num_radix_in_vec,
+        (cudaStream_t *)(streams), gpu_indexes, gpu_count, radix_lwe_out,
+        radix_lwe_vec, bsks, (uint64_t **)(ksks), mem,
+        radix_lwe_out->num_radix_blocks,
+        radix_lwe_vec->num_radix_blocks / radix_lwe_out->num_radix_blocks,
        nullptr);
    break;
  case 4096:
    host_integer_partial_sum_ciphertexts_vec_kb<uint64_t,
                                                AmortizedDegree<4096>>(
-        (cudaStream_t *)(streams), gpu_indexes, gpu_count,
-        static_cast<uint64_t *>(radix_lwe_out),
-        static_cast<uint64_t *>(radix_lwe_vec), terms_degree, bsks,
-        (uint64_t **)(ksks), mem, num_blocks_in_radix, num_radix_in_vec,
+        (cudaStream_t *)(streams), gpu_indexes, gpu_count, radix_lwe_out,
+        radix_lwe_vec, bsks, (uint64_t **)(ksks), mem,
+        radix_lwe_out->num_radix_blocks,
+        radix_lwe_vec->num_radix_blocks / radix_lwe_out->num_radix_blocks,
        nullptr);
    break;
  case 8192:
    host_integer_partial_sum_ciphertexts_vec_kb<uint64_t,
                                                AmortizedDegree<8192>>(
-        (cudaStream_t *)(streams), gpu_indexes, gpu_count,
-        static_cast<uint64_t *>(radix_lwe_out),
-        static_cast<uint64_t *>(radix_lwe_vec), terms_degree, bsks,
-        (uint64_t **)(ksks), mem, num_blocks_in_radix, num_radix_in_vec,
+        (cudaStream_t *)(streams), gpu_indexes, gpu_count, radix_lwe_out,
+        radix_lwe_vec, bsks, (uint64_t **)(ksks), mem,
+        radix_lwe_out->num_radix_blocks,
+        radix_lwe_vec->num_radix_blocks / radix_lwe_out->num_radix_blocks,
        nullptr);
    break;
  case 16384:
    host_integer_partial_sum_ciphertexts_vec_kb<uint64_t,
                                                AmortizedDegree<16384>>(
-        (cudaStream_t *)(streams), gpu_indexes, gpu_count,
-        static_cast<uint64_t *>(radix_lwe_out),
-        static_cast<uint64_t *>(radix_lwe_vec), terms_degree, bsks,
-        (uint64_t **)(ksks), mem, num_blocks_in_radix, num_radix_in_vec,
+        (cudaStream_t *)(streams), gpu_indexes, gpu_count, radix_lwe_out,
+        radix_lwe_vec, bsks, (uint64_t **)(ksks), mem,
+        radix_lwe_out->num_radix_blocks,
+        radix_lwe_vec->num_radix_blocks / radix_lwe_out->num_radix_blocks,
        nullptr);
    break;
  default:
    PANIC("Cuda error (integer multiplication): unsupported polynomial size. "
          "Supported N's are powers of two in the interval [256..16384].")
  }
-
-  free(terms_degree);
 }

 void cleanup_cuda_integer_radix_partial_sum_ciphertexts_vec(
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
Beka Barbakadze	aa3f83d016	all tests are passing	2025-03-10 18:49:37 +04:00
Beka Barbakadze	320310a6b7	fix some bugs	2025-03-10 15:59:48 +04:00
Beka Barbakadze	f704a38814	fix fmt	2025-03-07 18:34:41 +04:00
Beka Barbakadze	1659c07c89	fix test	2025-03-07 18:31:07 +04:00
Beka Barbakadze	6cbe56283e	feat(gpu): add classic default 128 bit pbs	2025-03-07 17:37:13 +04:00
Beka Barbakadze	e90ec935a1	change parameters	2025-03-05 17:37:51 +04:00
Beka Barbakadze	7f3ac17cee	feat(gpu): Implement 128 bit classic pbs	2025-03-05 16:48:48 +04:00
Beka Barbakadze	c1d534efa4	refactor(gpu): refactor double2 operators to use cuda intrinsics	2025-03-03 17:29:39 +01:00
David Testé	47589ea9a7	chore(bench): run core_crypto benchmarks on all parameters p-fail This also add KS-PBS benchmarks.	2025-03-03 16:01:17 +01:00
Agnes Leroy	ce327b7b27	chore(gpu): refactor mul/scalar mul to track noise/degree	2025-03-03 13:51:00 +01:00
Arthur Meyre	877d0234ac	fix: fix the atomic pattern used to cast in trivium and a test in shortint - parameters are optimized for a clean ciphertext, the ciphertext being keyswitched was noisy	2025-03-03 13:10:11 +01:00
dependabot[bot]	f457ac40e5	chore(deps): bump codecov/codecov-action from 5.3.1 to 5.4.0 Bumps [codecov/codecov-action](https://github.com/codecov/codecov-action) from 5.3.1 to 5.4.0. - [Release notes](https://github.com/codecov/codecov-action/releases) - [Changelog](https://github.com/codecov/codecov-action/blob/main/CHANGELOG.md) - [Commits](`13ce06bfc6...0565863a31`) --- updated-dependencies: - dependency-name: codecov/codecov-action dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com>	2025-03-03 11:53:07 +01:00
dependabot[bot]	d9feb57b92	chore(deps): bump slsa-framework/slsa-github-generator Bumps [slsa-framework/slsa-github-generator](https://github.com/slsa-framework/slsa-github-generator) from 2.0.0 to 2.1.0. - [Release notes](https://github.com/slsa-framework/slsa-github-generator/releases) - [Changelog](https://github.com/slsa-framework/slsa-github-generator/blob/main/CHANGELOG.md) - [Commits](https://github.com/slsa-framework/slsa-github-generator/compare/v2.0.0...v2.1.0) --- updated-dependencies: - dependency-name: slsa-framework/slsa-github-generator dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com>	2025-03-03 11:52:56 +01:00
dependabot[bot]	fa41fb3ad4	chore(deps): bump actions/cache from 4.2.1 to 4.2.2 Bumps [actions/cache](https://github.com/actions/cache) from 4.2.1 to 4.2.2. - [Release notes](https://github.com/actions/cache/releases) - [Changelog](https://github.com/actions/cache/blob/main/RELEASES.md) - [Commits](`0c907a75c2...d4323d4df1`) --- updated-dependencies: - dependency-name: actions/cache dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com>	2025-03-03 11:52:45 +01:00
dependabot[bot]	375a482d0b	chore(deps): bump actions/download-artifact from 4.1.8 to 4.1.9 Bumps [actions/download-artifact](https://github.com/actions/download-artifact) from 4.1.8 to 4.1.9. - [Release notes](https://github.com/actions/download-artifact/releases) - [Commits](`fa0a91b85d...cc20338598`) --- updated-dependencies: - dependency-name: actions/download-artifact dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com>	2025-03-03 11:52:37 +01:00
Beka Barbakadze	7e941b29c1	refactor(gpu): use hexes to initialize twiddles for 64 bit fft	2025-03-03 14:44:12 +04:00
David Testé	3897137a3f	chore(ci): fallback on permanent h100 instance on shortage When a shortage occurs on n3-H100x1 instances on Hyperstack, we'll fall back on the permanent one registered on GitHub. This can be done by using 'h100x1' as runner label to run a job on it.	2025-03-03 11:38:32 +01:00
Beka Barbakadze	3988c85d6b	feat(gpu): Implement fft128 in cuda backend	2025-03-03 12:27:46 +04:00
Agnes Leroy	c1bf43eac1	feat(gpu): add a function to set a CudaLweList to 0	2025-02-28 16:46:17 +01:00
Agnes Leroy	95863e1e36	chore(gpu): plug in signed gpu tests in the hl api	2025-02-28 13:42:52 +01:00
Pedro Alves	a508f4cadc	fix(gpu): enforce tighter bounds on compression output	2025-02-28 07:12:36 -03:00
Agnes Leroy	dad278cdd3	chore(gpu): fix typo in doc	2025-02-28 11:12:17 +01:00
tmontaigu	699e24f735	docs: rename to README as its needed for link to work	2025-02-28 10:23:46 +01:00
Agnes Leroy	12ed899b34	chore(gpu): trigger long run tests every evening, edit workflow name	2025-02-27 17:22:02 +01:00
David Testé	8565b79a28	chore(ci): switch environment and add fallback for gpu profiles Switch n3-H100-SXM5x8 to US-1 as CANADA is out of stock on this instance. Also L40 instances fallback on n3-RTX-A6000x1 to mitigate resource shortages issues.	2025-02-27 16:59:04 +01:00
Agnes Leroy	1d7f9f1152	chore(gpu): refactor comparisons to track noise/degree	2025-02-27 16:57:24 +01:00
tmontaigu	3ecdd0d1bc	fix(c-api): add missing casts cast_into FheUint{12, 512, 1024, 2048} were missing from the C API	2025-02-27 16:30:51 +01:00
J-B Orfila	14517ca111	docs: add link in the README	2025-02-27 15:09:41 +01:00
Agnes Leroy	a2eceabd82	fix(gpu): fix scalar comparisons with 1 block	2025-02-27 13:11:36 +01:00
Guillermo Oyarzun	968ab31f27	fix(cpu): fix corner case when estimating the num blocks required	2025-02-27 11:38:17 +01:00
Agnes Leroy	74d5a88f1b	chore(gpu): replace asserts with panic	2025-02-27 11:36:59 +01:00
Agnes Leroy	e18ce00f63	chore(gpu): increase 4090 test timeout	2025-02-27 11:27:55 +01:00
tmontaigu	7ec8f901da	docs(js): update JS example The example was still using CompactFheUint32List which as been removed in favor of the more generic CompactCiphertextList	2025-02-27 10:54:08 +01:00
Arthur Meyre	610406ac27	chore: link CONTRIBUTING.md in the documentation	2025-02-26 16:07:44 +01:00
J-B Orfila	4162ff5b64	docs: security disclaimer updated	2025-02-26 16:07:31 +01:00
J-B Orfila	efd06c5b43	docs: correcting parameter section	2025-02-26 16:07:31 +01:00
Nicolas Sarlin	bd2a488f13	chore(doc): add a doc page about parameters	2025-02-26 16:07:31 +01:00
David Testé	9f48db2a90	chore(ci): fix workflow concurrency condition Referencing current branch using github.head_ref is a leftover from handling pull_request_target event. This event being removed, there is no need to be specific and we can instead use 'github.workflow_ref' which is more robust.	2025-02-26 14:11:42 +01:00
Pedro Alves	f962716fa5	feat(gpu): refactor the sample extract entry point so the user can pass how many LWEs should be extracted per GLWE	2025-02-26 11:58:47 +01:00
Arthur Meyre	ec3f3a1b52	chore(docs): use tilde requirements to minimize breakage on users' end	2025-02-25 17:59:23 +01:00
Arthur Meyre	ab36f36116	chore: update README	2025-02-25 17:59:23 +01:00
David Testé	06638c33d7	chore(ci): add contributing guidance	2025-02-25 17:21:42 +01:00
David Testé	e583212e6d	docs: refactor and update benchmarks pages Benchmarks tables are rendered as descriptive SVG images. Sort results by backend to have a clearer view in tree of content. PBS benchmarks now display results for various p-fail and several precisions.	2025-02-25 12:47:12 +01:00
David Testé	486ec9f053	chore(ci): update cpu aws ami and install git-lfs Several network errors occurred while trying to install git-lfs from within backward compatibility tests workflow. Having git-lfs installed directly in the Amazon Machine Image fix this issue.	2025-02-25 12:45:47 +01:00
Arthur Meyre	0216e640bf	test: make the bound on the base variance check a bit looser We have seen failures, we need proper confidence intervals on these tests	2025-02-24 17:47:30 +01:00
David Testé	d00224caa3	chore(ci): add should-run to tfhe-fft and tfhe-ntt tests This is done to avoid testing tfhe-ftt/ntt crates if nothing changes in their source files. However, these tests would be run unconditionally on each push on main branch.	2025-02-24 16:35:31 +01:00
dependabot[bot]	bd06971680	chore(deps): bump actions/cache from 4.2.0 to 4.2.1 Bumps [actions/cache](https://github.com/actions/cache) from 4.2.0 to 4.2.1. - [Release notes](https://github.com/actions/cache/releases) - [Changelog](https://github.com/actions/cache/blob/main/RELEASES.md) - [Commits](`1bd1e32a3b...0c907a75c2`) --- updated-dependencies: - dependency-name: actions/cache dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com>	2025-02-24 11:46:53 +01:00
dependabot[bot]	58688cd401	chore(deps): bump actions/upload-artifact from 4.6.0 to 4.6.1 Bumps [actions/upload-artifact](https://github.com/actions/upload-artifact) from 4.6.0 to 4.6.1. - [Release notes](https://github.com/actions/upload-artifact/releases) - [Commits](`65c4c4a1dd...4cec3d8aa0`) --- updated-dependencies: - dependency-name: actions/upload-artifact dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com>	2025-02-24 11:46:44 +01:00
Agnes Leroy	2757f7209a	chore(gpu): update backend readme	2025-02-24 11:22:14 +01:00
Mayeul@Zama	b38b119746	chore(docs): add HL strings documentation	2025-02-24 10:58:29 +01:00
Pedro Alves	219c755a77	fix(gpu): fix wrong number of blocks used in cast	2025-02-21 20:09:54 -03:00
Mayeul@Zama	fc4abd5fb1	chore: update toolchain	2025-02-21 15:03:23 +01:00
Guillermo Oyarzun	5de1445cbf	fix(gpu): fix wrong assert in division	2025-02-21 11:27:03 +01:00
Yuxi Zhao	6b21bff1e8	chore(docs): improve nagivation	2025-02-20 17:29:36 +01:00
Arthur Meyre	a1dc260fb2	chore(ci): make md doctest checker a bit more versatile on user errors	2025-02-20 17:29:36 +01:00
David Testé	5d9af12f6e	chore(ci): fix release workflow for tfhe-versionable tfhe-versionable crate depends on tfhe-versionable-derive. Workflow, now ensure that derive crate is published before attempting to package tfhe-versionable. Dry-run option is removed since it cannot be use correctly due the reason aforementioned.	2025-02-20 11:44:58 +01:00
Guillermo Oyarzun	32c93876d7	feat(gpu): enable division in high level api	2025-02-20 10:33:07 +01:00
Guillermo Oyarzun	bede76be82	feat(gpu): enable if then else for boolean ciphertexts in hlapi	2025-02-19 12:50:38 +01:00
Guillermo Oyarzun	508713f926	fix(gpu): enable large integers for the classical pbs flavors	2025-02-19 06:52:49 -03:00
Guillermo Oyarzun	6d7b32dd0a	fix(gpu): enable large integers other multi bit pbs	2025-02-19 06:52:49 -03:00
Pedro Alves	15f7ba20aa	fix(gpu): Remove unnecessary and incorrect bound check for decompression Removed unnecessary bounds check for the number of LWEs against polynomial size.	2025-02-19 06:17:11 -03:00
Arthur Meyre	4fa59cdd6d	chore(ci): fix web packages publish with provenance - re-enabled required permissions, notably write id-token	2025-02-18 16:18:59 +01:00
Arthur Meyre	69d5b7206e	chore(ci): fix packaging job to also have exported env vars	2025-02-18 15:24:24 +01:00
Arthur Meyre	a9cb617fe8	chore(ci): fix cuda release workflow to have rust re-installed for cargo	2025-02-18 14:58:40 +01:00
Arthur Meyre	54962af887	chore: update copyright year to 2025 co-authored-by: wgyt <wgythe@gmail.com>	2025-02-18 13:19:28 +01:00
Arthur Meyre	cb7d77f59a	feat: add 2^-128 parameters	2025-02-18 13:19:28 +01:00
Arthur Meyre	0ecd5e1508	chore: bump tfhe to 1.0.0	2025-02-18 13:19:28 +01:00
Arthur Meyre	dc8b293895	chore: bump tfhe-cuda-backend to 0.8.0	2025-02-18 13:19:28 +01:00
Arthur Meyre	4ca4203c02	chore: bump tfhe-zk-pok to 0.5.0	2025-02-18 13:19:28 +01:00
Arthur Meyre	dfa6b2827a	chore: bump tfhe-fft to 0.8.0	2025-02-18 13:19:28 +01:00
Arthur Meyre	06ae56b389	chore: bump tfhe-ntt to 0.5.0	2025-02-18 13:19:28 +01:00
Arthur Meyre	f0238bab16	chore: bump tfhe-versionable to 0.5.0	2025-02-18 13:19:28 +01:00
Arthur Meyre	e4e03277b5	fix(shortint): fix CompressedModulusSwitchNoiseReductionKey generation - was using the wrong seeded encryption API resulting in garbage values when decompressing	2025-02-18 13:19:28 +01:00
Arthur Meyre	49566cd7cf	refactor(core): rename foot-gunny functions for seeded entities encryption	2025-02-18 13:19:28 +01:00
Guillermo Oyarzun	e0df5364f9	fix(gpu): enable large number of samples in pbs tbc	2025-02-18 07:26:28 -03:00
tmontaigu	4650a5e3e4	chore(hlapi): add FhetTypes::AsciiString	2025-02-18 10:22:00 +01:00
tmontaigu	c86d2616c1	refactor(hlapi)!: introduce HlCompactable trait The purpose of this introducing this trait is to purposefully create a breaking change so that later we have more freedom on refactoring some stuff with less risk of breaking	2025-02-18 10:22:00 +01:00
tmontaigu	a40501691a	feat(hlapi): allow strings in compact/compressed list	2025-02-18 10:22:00 +01:00
David Testé	c7b0fe37ec	chore(ci): enable throughput benchmarks for zk-pok	2025-02-18 09:56:49 +01:00
Guillermo Oyarzun	0f44ffdf30	fix(gpu): enable larger number of samples in the keyswitch	2025-02-17 19:34:26 -03:00
tmontaigu	380bc9b91a	fix: rotations of 1 blocks of 4_4	2025-02-17 17:27:43 +01:00
Nicolas Sarlin	0809eb942f	chore!: homogenize conformance parameters BREAKING CHANGE: renamed some conformance parameters public types	2025-02-17 15:07:09 +01:00
dependabot[bot]	fb730d2953	chore(deps): bump zgosalvez/github-actions-ensure-sha-pinned-actions Bumps [zgosalvez/github-actions-ensure-sha-pinned-actions](https://github.com/zgosalvez/github-actions-ensure-sha-pinned-actions) from 3.0.20 to 3.0.22. - [Release notes](https://github.com/zgosalvez/github-actions-ensure-sha-pinned-actions/releases) - [Commits](`c3a2b64f69...25ed13d062`) --- updated-dependencies: - dependency-name: zgosalvez/github-actions-ensure-sha-pinned-actions dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com>	2025-02-17 14:13:41 +01:00
tmontaigu	1b7f7a5c8f	feat: add strings to integer's compressed list	2025-02-14 16:29:56 +01:00
Guillermo Oyarzun	0d3d23daec	chore(gpu): remove unused variables	2025-02-14 13:07:07 +01:00
Agnes Leroy	c5f44a6581	chore(gpu): refactor overflowing sub to track noise / degree	2025-02-14 12:03:20 +01:00
tmontaigu	cda43fd407	feat(strings): make strings compatible with compact list	2025-02-14 10:14:31 +01:00
Agnes Leroy	bfd3773322	chore(gpu): refactor arithmetic scalar shift	2025-02-13 20:58:12 +01:00
Agnes Leroy	a7c9357a02	fix(gpu): fix memory error in shift and rotate	2025-02-13 20:57:31 +01:00
Agnes Leroy	285fd2437e	chore(gpu): add some checks on radix sizes for carry propagation	2025-02-13 20:57:31 +01:00
David Testé	7ee49387fe	chore(ci): deduplicate parameters set to send to lattice estimator From SageMath point of view some tfhe-rs parameters set are equivalent. We deduplicate those by storing their name in the tag field. Grouping them that way we decrease analysis time dramatically.	2025-02-13 17:10:45 +01:00
Arthur Meyre	8756869fe3	fix: fix compression code for GPU which assumed a CPU data layout - the CPU data layout is truncated to only store relevant bodies (i.e. emtpy bodies are assumed to be 0) but the GPU CUDA code manages full GLWEs only. To fix that we manage the data layout during conversions to have consistent behavior when copying the list to/from CPU/GPU. Compression code has been fixed on the CPU side to have the proper length for the output expected by the CUDA code	2025-02-13 17:06:19 +01:00
Mayeul@Zama	9e4b585468	chore(core): use real modulus in test	2025-02-13 16:21:26 +01:00
tmontaigu	37934e42c1	fix(integer): rotations/shifts < 2 blocks This commit fixes a few bugs * The shift/rotate functions used when blocks encrypt a number of bits that is a power of 2 was causing a panic when working on one block. - Also, when the number of blocks was low (e.g 2 blocks with 2_2 params) a noise cleaning step was wrongly skipped * The function used when blocks encrypt non power of 2 number of bits also had a problem The test have been updated to test with different block sizes and check the noise level Overall these bugs only affected low block counts (e.g FheUint2, FheUint4) ciphertexts	2025-02-13 12:59:26 +01:00
Mayeul@Zama	53a1f35d3b	feat: update noise reduction to take input noise into account	2025-02-13 10:57:28 +01:00
Mayeul@Zama	4305f8d158	chore(core): refactor DispersionParameter	2025-02-13 10:57:28 +01:00
David Testé	eeb6c8a71f	chore(ci): remove pull_request_target for external contributions We use large GitHub hosted runners to run CI pipeline for external contributions. This avoids possible secret exposition due to usage of pull_request_target event. It also removes a layer a complexity to ensure such secrets are not exposed. The flow would be improved since tfhe-rs maintainers won't have to relaunch failed jobs individually, thanks to the "approve and run" button in GitHub user interface.	2025-02-13 08:45:02 +01:00
tmontaigu	16d8af150c	fix(gpu): compressed list gpu <-> cpu Some counts where to copied from the correct source to correct destination. And more importantly, the list on cuda side was stored using a GlweCiphertextList but the data was compressed (so the list was mostly empty). This use of a GlweList instead of a specialized type lead to problems when converting to Cpu	2025-02-12 15:17:23 +01:00
tmontaigu	d0b0fe8edb	fix(gpu): fix wrong degree after decompression For Signed and Unsigned DataKind, the degree was incorrectly set, leading to unneeded carry propagations	2025-02-12 15:17:23 +01:00
David Testé	3df08e9259	chore(ci): install github runner as ubuntu user for gpu workflows	2025-02-12 12:10:53 +01:00