Add FHE based PPD blog post (#563)

This commit is contained in:
tkmct
2025-09-25 21:33:05 +09:00
committed by GitHub
parent f73670a94d
commit 2903637730
2 changed files with 271 additions and 0 deletions

View File

@@ -0,0 +1,271 @@
---
authors: ["Takamichi Tsutsumi"] # Add your name or multiple authors in an array
title: "Constant-Depth NTT for FHE-Based Private Proof Delegation" # The title of your article
image: "/articles/const-depth-ntt-for-fhe-based-ppd/cover.webp" # Image used as cover, Keep in mind the image size, where possible use .webp format, possibly images less then 200/300kb
tldr: "We took benchmarks for constant depth NTT over FHE ciphertexts to see the feasibility of the state of the art FHE-based private proof delegation protocols." #Short summary
date: "2025-09-25" # Publication date in ISO format
tags: ["FHE", "ZKP"] # (Optional) Add relevant tags as an array of strings to categorize the article
projects: ["private-proof-delegation"]
---
Huge thanks to Keewoo and Nam for their sharp feedback, steady guidance, and countless practical suggestions that made this work stronger.
# 1. Introduction
FHE-SNARK is a compelling approach to private proof delegation: outsource zkSNARK proof generation to an untrusted server that homomorphically evaluates the provers algorithm, with the witness kept encrypted. A [recent work](https://eprint.iacr.org/2025/302) presents a cryptographic framework but without implementation and report only estimates on performance. In practice, that omission makes it hard to reason about feasibility and bottlenecks across the pipeline such as data parellelization or RAM/disks I/Os.
This post fills one specific gap: a **constantdepth NTT over FHE ciphertexts**. NTTs (finitefield FFTs) underpin polynomial IOPs and commitments, and—mirroring how fast FFT/NTTstyle transforms dominate modern FHE bootstrapping pipelines—they are the main subroutine in [recent work](https://eprint.iacr.org/2025/302). Our contribution is to implement an FHEfriendly instantiation, integrate it into an FHE flow, and benchmark it.
## What this post is (and isnt)
* **Is**: a focused, empirical clarification of the constant-depth NTT building block in an FHE-SNARK context.
* **Isnt**: an end-to-end benchmark of the full protocol, or a comparison of unrelated acceleration techniques. We are intentionally isolating one missing measurement.
## Why constant-depth?
Depth drives noise growth and bootstrapping frequency. A standard log-depth NTT stacks multiplicative layers; a **constant-depth NTT** reorganizes butterflies and twiddle application so multiplicative depth is bounded (independent of input size), shifting costs toward data movement and plaintext twiddle loads. In our layout, all index movement is realized by 2D packing and plaintext multiplications; no ciphertext rotations or keyswitches are performed in the measured kernel.
## Our contribution
* **Implementation:** Rust + OpenFHE with Intel HEXL acceleration for NTT primitives.
* **Measurement:** runtime and homomorphic op counts (ctpt multiplies, ctct adds) across sizes from ~1k to ~2.25M and depths 15 (focus on 34).
* **Positioning:** results interpreted strictly through the lens of FHE-SNARKs polynomial subroutines, to help researchers decide whether to invest effort upstream (witness layout/extension) or downstream (FHE-SNARK optimization).
## Scope & audience
* **Scope:** the NTT kernel over ciphertexts, its constant-depth layout, and empirical behavior on Xeon Ice Lake with AVX-512.
* **Audience:** practitioners building proof delegation systems who need concrete numbers to plan engineering work, and researchers prioritizing optimization targets.
Outline: §2 reviews the FHESNARK context; §3 details parameters, packing, and instrumentation; §4 reports results; §5 interprets them.
# 2. Background: FHE-SNARK & the Constant-Depth NTT Gap
**What FHE-SNARK is about.** The FHE-SNARK paper ([ePrint 2025/302](https://eprint.iacr.org/2025/302)) is a stateoftheart conceptual treatment (no implementation reported): run the SNARK prover *homomorphically* so a single untrusted server can generate a proof while the witness remains encrypted. It formalizes the model and the pipeline for evaluating the prover under HE, and explains why this could be competitive with alternatives. Its primarily foundational—defining the paradigm and spelling out what needs to be efficient for it to matter in practice.
For context, see [ePrint 2023/1609](https://eprint.iacr.org/2023/1609.pdf) and [2023/1949](https://eprint.iacr.org/2023/1949) on verifiable computation over HE and delegation, and [ePrint 2024/1684](https://eprint.iacr.org/2024/1684.pdf) on blind/oblivious SNARK variants.
In the FHESNARK papers Ligero instantiation, the provers homomorphic workload is dominated by ReedSolomon encoding, which Ligero uses for its codeword commitments; this encoding reduces to large batched NTTbased evaluation and interpolation over the RS evaluation domain, making the NTT kernel the principal driver of runtime and multiplicative depth over HE ciphertexts. The main levers that govern feasibility here are:
* **Multiplicative depth** (drives noise growth and whether extra maintenance is needed), and
* **Data movement** (rotations and packing/unpacking overheads; rotations are not used in our measured kernel).
* **Operation count** (dominant ctpt multiplies / ctct adds; count any keyswitches/rotations under ops when present).
**What the paper clarifies—and what it doesnt.** The paper clarifies *how* an FHEevaluated prover can be structured and *why* polynomial ops are central. But it doesnt publish **microbenchmarks** for any one kernel. In particular, we lack numbers for a **constantdepth NTT** over ciphertexts: runtime vs. size, depth usage, **operation counts (including any keyswitches/rotations)**, and the corresponding effect on the noise budget.
**Why constant-depth matters.** A standard log-depth NTT stacks multiplicative layers; under HE, that means stacked noise growth and more frequent maintenance. A **constant-depth NTT** reorganizes the butterflies and twiddle application so the multiplicative depth is **bounded (size-independent)**. As noted in §3.5, our layout handles index movement via 2D packing and plaintext matrix multiplications; rotations and keyswitches are not used by this kernel.
**What we measure (and why it complements the paper).**
To fill this specific gap, we isolate and benchmark **constant-depth NTT over ciphertexts** on **real-worldsized transforms**. Our stack is **Rust + OpenFHE with Intel HEXL** acceleration. We report:
* Runtime vs. size (from ~1k to ~2.25M entries),
* Multiplicative depth exercised (we sweep depths 15), and
* Homomorphic op counts (ctpt multiplies, ctct additions).
This is **not** an end-to-end FHE-SNARK benchmark. Its a focused, empirical clarification of a single kernel that the paper identifies as central but does not quantify. The result is a cleaner picture of where the true bottlenecks lie once NTTs multiplicative depth is capped—so teams can prioritize the right optimizations in the rest of the pipeline.
We now specify parameters, packing, and how we instrument the kernel.
# 3. Methodology
This section fixes **what we built, measured** so others can reproduce or extend the results.
## 3.1 Target protocol & kernels
We benchmark the **constantdepth NTT** that sits in the **ReedSolomon (RS) layer** of the *Ligero* prover path used in the FHESNARK paper. Under homomorphic evaluation, RS **encode/decode** (evaluation/interpolation) is implemented via forward/inverse NTTs; our measurements quantify that kernel in isolation.
## 3.2 Field and NTT domain
* **Prime field:** $p = 2^{32}-2^{20}+1 = \mathbf{4\,293\,918\,721}$ (32bit prime).
* **Implication:** we can run poweroftwo NTTs directly over $\mathbb{F}_p$, and we can fully batch over $X^N+1$ with $N=2^{14}$ (since $2N=2^{15} \mid (p-1)$).
### Rationale: prime choice
- **NTT constraints.** We need poweroftwo NTTs for (i) the perciphertext subtransforms and (ii) the toplevel NTT over the ciphertext grid. With $p-1 = 2^{20} \cdot 4095$, a primitive $2^{k}$th root exists for all $k \le 20$; in particular, $2N=2^{15}$ divides $p-1$, so negacyclic NTTs at $N=2^{14}$ are supported.
- **FHE constraints (packing).** Using BFV with plaintext modulus $t=p$ enables native batching over $X^N+1$ and cheap ctpt multiplies. A 32bit $t$ keeps plaintext ops and twiddle tables cachefriendly and aligns well with Intel HEXLs vectorized kernels, improving throughput without changing multiplicative depth.
- **Alignment with FHESNARK.** The FHESNARK paper sketches ~50bit fields in its endtoend setting. Our kernel results use a 32bit NTTfriendly prime for practicality and restriction from OpenFHE implementation; production deployments can switch to a 64bit NTTfriendly prime (e.g., Goldilocks). Which preserve the constantdepth layout; the tradeoffs are mostly constantfactor timing and memory.
## 3.3 HE scheme & ring parameters
* **Library / bindings:** Rust 1.91.0-nightly + **OpenFHE** + OpenFHE-rs (thin FFI bindings).
* **Optimization:** **Intel HEXL** enabled (`WITH_INTEL_HEXL=ON`).
* **Scheme:** BFV with plaintext modulus $t=p$.
* **Ring Dimension:** $2^{14}=16384$
* **Batching:** native BFV batching; all slots active.
## 3.4 Circuits, witness generation, and field port
We used **circom** to compile circuits and generate witnesses, ported to the new prime $p$.
* **Circuits:**
* **Semaphore v4** (membership + nullifier).
* **zkTwitter** (handle proof; Poseidon/Merkle path).
* **Field port:** changed circoms field modulus to $p=4\,293\,918\,721$ and **tweaked gadgets** to remove dependencies on BabyJubJub/BN254 arithmetic. Concretely, in **Semaphore v4** we **replaced BabyJubJub + Poseidon ID generation** with a lighter **Poseidon Hash2**based ID derivation (and removed the related range checks).
* **Resulting sizes (for reference on constraint magnitude):**
* Original **BN128** (semaphorev4): **15,917 wires**, witness ≈ **509KB**.
* Current **BN128** (semaphorenp, optimized): **5,550 wires**, witness ≈ **178KB**.
* The drop is from removing BabyJubJub arithmetic + range checks and using a lighter Poseidon + Merkle path.
* *Note:* the above counts are for BN128 baselines; our **fieldported** versions keep the same logic after replacing fieldspecific gadgets. We report these here to indicate the **realworld scale** we target when sizing NTT batches.
> **Security note.** Using a smaller field (32bit) changes soundness margins for RSbased protocols (code distance, rate, soundness error). Our focus here is *kernel* benchmarking; endtoend security must be reestablished at the protocol layer (e.g., by adjusting evaluation domain sizes/rounds). We flag this so readers dont conflate kernel timing with final system security.
## 3.5 Constantdepth NTT layout
We implement the NTT in a constant-depth, 2D layout. The goal is to keep multiplicative depth fixed (23) regardless of input size; we do this with 2D blocking plus plaintext twiddles and depth1 subtransforms:
1. Sub-transforms: Split the input into smaller subsequences, run NTTs on each at depth-1 (all in parallel).
2. Twiddle & merge (fused): apply plaintext twiddles and run the small group NTTs in a single pass (1 multiplication layer).
This way, the whole transform fits within a single multiplication depth per predetermined depth (i.e., the depth of recursion) instead of $\log n$. All index movement is handled via 2D packing and plaintext matrix multiplications; our implementation performs no ciphertext rotations and no keyswitches (counts = 0),.
## 3.6 NTT Sizes and Batching
- Field: prime $p = 2^{32} - 2^{20} + 1 = 4,293,918,721$.
- Ring: dimension $2^{14}=16,384$ (BFV).
- Circuit source: witnesses from Circom (Semaphore-v4, zk-Twitter), ported to this field.
- Witness size: from ~1k values up to ~2 million (zk-Twitter).
We pack field elements into ciphertext slots. The packing size is chosen near $\sqrt{M}$ for witness length $M$, rounded to a power of two. Inputs are then padded so the ciphertext count is also a power of two. This keeps the matrix shape balanced for the 2D NTT.
For a witness of length $M$, we use $\text{lanes} \approx 2^{\lfloor \log_2 \sqrt{M} \rfloor}$ per ciphertext and #CT $= \lceil M/\text{lanes} \rceil$, pad #CT to a power of two, and run an NTT of length #CT. Cost model: this constantdepth layout uses $O(d\cdot n^{1+1/d})$ ctpt multiplies (e.g., $O(n^{1.5})$ at $d=2$), trading multiplies for reduced depth (vs. $O(n\log n)$ at $\log n$ depth).
## 3.7 Metrics & Instrumentation
- Reported metrics: wallclock runtime, ctpt multiplies, ctct additions.
- Rotations/keyswitches: not used by this kernel (see §3.5), so we omit those columns.
- Noise budget: we did not report a before/after delta for this kernel; adding this is straightforward and left as future work.
- Sanity checks: each run decrypts and compares against a plaintext NTT to confirm correctness (excluded from timing).
## 3.8 Hardware & run controls
* **Machine:** Intel Xeon Platinum 8375C (Ice Lake, AVX512), 1 socket, 8 cores/16 threads (SMT=2), base 2.90GHz; L3 54MiB; 128GiB RAM.
Appendix A (Reproducibility) lists full toolchain, parameters, and the exact cargo command used to run these benchmarks.
---
This setup lets us answer the narrow question the paper left open—in the exact field and ring parameters we now target: **what does a constantdepth NTT actually cost** (depth, op counts, milliseconds) when you run it the way an FHEevaluated *Ligero* prover would.
# 4. Results
**Headline:** 1.94 s (5.6k @ depth=3), 4.50 s (22k @ depth=4), 121.1 s (2.25M @ depth=4). *Lower-bound kernel timings.* In an end-to-end FHE-SNARK, NTTs are evaluated at a higher ciphertext modulus (i.e., more levels), so wall-clock will be modestly higher.
Reading the tables: lower time is better; counts shown are ctpt multiplies and ctct additions; rotations/keyswitches are zero in this kernel.
| Witness size | Best depth | Time (s) | Throughput |
| -----------: | ---------: | -------: | ---------: |
| 5,570 | 3 | 1.94 | ~2.86k elems/s |
| 22,280 | 4 | 4.50 | ~4.95k elems/s |
| 2,250,280 | 4 | 121.11 | ~18.6k elems/s |
We measure three witness scales—from **\~5.6k** up to **\~2.25M** entries—spanning the tweaked Semaphore v4, its original-sized variant, and a zkTwitterscale input.
### 4.1 Semaphore v4 (tweaked to 32bit field)
Witness entries: **5,570**
| depth | time (s) | ctpt multiplies | ctct additions |
| ----: | ---------: | ---------: | --------: |
| 1 | 11.4158 | 16,384 | 16,256 |
| 2 | 2.6045 | 3,072 | 2,816 |
| 3 | **1.9449** | **2,048** | **1,664** |
| 4 | 2.7946 | 2,816 | 2,304 |
| 5 | 2.5411 | 2,048 | 1,408 |
* **Best:** depth **3****1.94 s** (\~**2.86k elems/s**, \~**0.35 ms/elem**).
* **Speedup vs depth1:** \~**5.9×**.
* **Note:** past depth3, overhead outweighs the smaller op counts.
### 4.2 Semaphore v4 (originalsize, same field)
Witness entries: **22,280**
| depth | time (s) | ctpt multiplies | ctct additions |
| ----: | ---------: | ---------: | --------: |
| 1 | 45.4127 | 65,536 | 65,280 |
| 2 | 6.4632 | 8,192 | 7,680 |
| 3 | 5.3701 | 6,144 | 5,376 |
| 4 | **4.4982** | **4,096** | **3,072** |
| 5 | 6.6192 | 6,144 | 4,864 |
* **Best:** depth **4****4.50 s** (\~**4.95k elems/s**, \~**0.20 ms/elem**).
* **Speedup vs depth1:** \~**10.1×**.
* **Observation:** as size grows, the sweet spot shifts from **3 → 4**.
### 4.3 zkTwitterscale (similar witness size on 32bit field)
Witness entries: **2,250,280**
| depth | time (s) | ctpt multiplies | ctct additions |
| ----: | -----------: | ----------: | ----------: |
| 1 | 11,729.7076 | 16,777,216 | 16,773,120 |
| 2 | 387.0503 | 524,288 | 516,096 |
| 3 | 160.5035 | 196,608 | 184,320 |
| 4 | **121.1053** | **131,072** | **114,688** |
| 5 | 133.9219 | 131,072 | 110,592 |
* **Best:** depth **4****121.11 s** (\~**18.6k elems/s**, \~**53.8µs/elem**).
* **Speedup vs depth1:** \~**97×**.
* **Note:** depth5 trims ops slightly but adds memory traffic and twiddleload overhead; beyond depth4 that overhead outweighs the saved multiplies, so **depth4** wins.
### 4.4 Takeaways
* **Constantdepth works.** Depth1 (naïve matrix NTT) is impractical at scale; depth **34** is **597× faster** across our sizes.
* **Size decides the sweet spot.** Small (\~5.6k) prefers **3**; medium/large (22k2.25M) prefers **4**.
* **Cost shifts to data movement.** After the sweet spot, runtime flattens even as op counts drop—overheads (layout, scheduling, memory, twiddle loads) dominate.
* **Feasible at real scale.** With **depth4**, a single Ice Lake socket processes **\~2.25M** field elements in **\~2 minutes**.
# 5. Discussion & Conclusion
**What we showed.** Constantdepth NTT over ciphertexts is **practical** at real scales in the FHESNARK (Ligero/RS) setting. On a single Intel Xeon Platinum 8375C (Ice Lake, AVX512) socket and a 32bit field:
* Depth1 (naïve matrix) is a nonstarter at scale.
* Depth **34** delivers **5×97×** speedups and keeps depth bounded.
* The **sweet spot shifts with size**: \~5.6k entries → **depth3**; ≥22k up to \~2.25M → **depth4**.
**What this means for builders.**
* **NTT isnt the blocker.** With a constantdepth layout, the transform fits inside typical BFV depth budgets and runs in minutes even at \~2.25M elements.
* **Optimize for data movement.** Once depth is capped, runtime flattens as op counts fall—**memory traffic and scheduling** take over. Codesign your **packing** (nearsquare), **stride sets**, and **batch shape** with upstream/downstream steps.
* **Pick depth first, then tune.** Start at **depth3** (small/medium) or **depth4** (large), then adjust packing and ring parameters for your throughput/memory envelope.
**On the 32bit field.** We ported the circuits to $p=4{,}293{,}918{,}721$ to exercise the kernel. That choice is fine for NTT benchmarking, but **protocol soundness** in RS/Ligero must be reestablished for smaller moduli (e.g., domain size/rounds). See §3.2 “Rationale: prime choice” for how this prime satisfies NTT roots, aligns with OpenFHE packing, and how to lift to ~50bit effective modulus via CRT or use a 64bit NTTfriendly prime. For production fields:
* Use **CRT** across several 32bit primes, or
* Switch to a **64bit prime** (e.g., Goldilocks) and expect roughly linear cost growth in ctpt multiplies (constants depend on HEXL paths).
**Limits of this work.**
* **Kernel only.** We did not measure the full FHESNARK pipeline.
* **Metrics coverage.** We reported time and ctpt/ctct counts. Rotations/keyswitches are not used by this kernel (counts = 0), and we did not yet add a simple before/after noisebudget delta.
* **One machine profile.** Results are singlesocket Ice Lake; microarchitecture changes will shift constants.
**Where to push next.**
* **R1CS modulus/porting:** R1CS circuits over a smaller prime field are nonstandard; existing BN254/BlS12based gadgets dont carry over asis. Reaudit soundness and constraints under the new modulus (e.g., range checks, hash/curve gadgets), and update any protocollevel parameters accordingly.
* **Witness extension under HE:** Endtoend proving requires RS witness extension executed under HE; we did not explore this here. Tooling is currently sparse—build generators that perform extension, packing/padding, and correctness checks under HE to integrate with the NTT kernel.
* **Hardware:** explore GPU offload for rotations/KS; widen AVX512 utilization.
* **Endtoend:** plug this NTT into a E2E prover under FHE, retune RS parameters for target soundness, and report wallclock/communication together.
**Bottom line.** The FHESNARK paper left constantdepth NTT unmeasured. We filled that gap with a concrete implementation and numbers across **\~1k → \~2.25M** elements. With **depth34**, NTT is **depthstable and fast enough**; the next wins will come from **layout and bandwidth** (rotations if introduced in future variants), not the butterfly.
---
# Appendix A. Reproducibility
- **Repo:** https://github.com/tkmct/fhe-snark
- **HE libs:** OpenFHE v1.2.4 (shared libs on system); Intel HEXL v1.2.5 enabled at OpenFHE build time. If relevant, also record exact commit hashes and build flags.
- **CPU:** Intel Xeon Platinum 8375C (Ice Lake), x86_64, 1 socket, 8 cores/16 threads (SMT=2), base 2.90GHz; caches: L1d 384KiB (8×), L1i 256KiB (8×), L2 10MiB (8×), L3 54MiB (1×); NUMA nodes: 1; AVX512 supported; virtualization: KVM.
- **Memory:** 128GiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 8.9 KiB