123 Commits

Author SHA1 Message Date
Arthur Meyre
20a91337c1 chore: prepare v1.5 2025-10-16 15:23:36 +02:00
Thomas Montaigu
498b0e6e5c refactor: use BTreeMap as internals of KVStore
This is to make the order of the key and value lists
deterministic when compressing
2025-10-14 17:04:13 +02:00
Thomas Montaigu
126138a59d chore: only run KVStore benches on CPU
As its the only backend that supports it
2025-10-08 11:52:14 +02:00
pgardratzama
3073d60f11 fix(hpu): work-around a criterion assert by reducing number of elements on division & modulus throughput bench 2025-10-07 14:23:07 +02:00
pgardratzama
ab25919187 fix(hpu): throughput benchmarks were done 1 IOp per 1 IOp... 2025-10-07 10:14:43 +02:00
Nicolas Sarlin
6a676551d8 chore(shortint): add metaparams for ks32 2025-10-07 09:51:09 +02:00
Enzo Di Maria
f0f3dd76eb feat(gpu): aes 128 2025-10-06 09:31:36 +02:00
Thomas Montaigu
e523fd2cb6 feat: add KVStore to the high level api
* Added Value type name to crate::integer::KVStore impl of Named trait
  as well as a bool to check we deserialize the correct value type
  (Radix vs SignedRadix)
* Add KVStore to high_level_api
* Add KVStore hlapi benches
* Remove specialized `[add,mul,sub]_to_slot` as `map` is now the
  intended API.
    - mul_to_slot was way slower than using `map`
    - add/mul_to_slot were a bit faster (~5% latency-wise), but returned
      less information (no old_value, no new_value, no boolean to check)
      if the key matched
    - Some known improvement can be made to map, which should result in
      it being better than add/sub_to_slot
* Add FheIntegerType trait to make the KVStore generic over
  FheUint/FheInt, and should make GPU integration "easy"
2025-10-03 15:01:23 +02:00
Agnes Leroy
f9e876730a chore(gpu): remove support for drift noise reduction 2025-10-03 09:45:20 +02:00
pgardratzama
39b81a8ded feat(hpu): move to new bitstream at 400Mhz with GRAM_NB 3
- update SIMD_N and min_batch_size to 12 which seems to give better
  latency and ERC20 throughput
- support IOp on several lines in ami /proc file
- reduce amount of ERC_20_SIMD per batch in HLAPI bench
2025-10-02 13:20:36 +02:00
pgardratzama
2bf595d0e2 fix(hpu): missing bench numbers for less_than & less_or_equal because lower != less 2025-10-02 13:20:36 +02:00
David Testé
d397ea3a39 chore(bench): handle ks32 atomic pattern in key size measurements 2025-09-23 12:01:33 +02:00
Guillermo Oyarzun
022cb3b18a fix(gpu): avoid out of memory when benchmarking throughput 2025-09-19 14:44:12 +02:00
David Testé
4ba1787e12 chore(bench): add crs size in zk-pke benchmark names
This is done get more details about the benchmarks when parsing
results.
2025-09-16 16:06:41 +02:00
David Testé
366d359441 chore(bench): measure ciphertext and key sizes at a large scale
Ciphertext sizes are measured at HLAPI layer with several
parameters set.
Keys sizes are measured at shortint level.
This benchmark has now its dedicated GitHub workflow that would
run, at least, each 24th of the month.
2025-09-16 15:43:36 +02:00
pgardratzama
757c2fc828 chore(hpu): make hpu integer bench fast by default 2025-09-10 22:24:31 +02:00
pgardratzama
4ff0d6cac2 feat(hpu): integer bench update (adds mod, div -> div_mod), erc20_simd simd batch size read from iop prototype 2025-09-10 22:24:31 +02:00
pgardratzama
1530f52c79 feat(hpu): adds support of ERC20 SIMD in hpu ERC20 bench 2025-09-10 22:24:31 +02:00
tmontaigu
e8dc403ebd feat(integer): add flip operation
Add the flip(condition: BooleanBlock, a: T, b: T) -> (T, T)
operation that homomorphically flip/swap two values if the
given encrypted boolean encrypts true
2025-09-10 09:44:28 +02:00
Pedro Alves
c78cc2d2e9 chore(gpu): add a benchmark for 128-bit multi-bit noise squashing
- Also, remove the lut indexes concept from the 128-bit multi-bit pbs. It's assumed not to exist by the entire backend (as it doesn't for classical PBS). So to keep it here would be a bit error prone.
2025-09-09 07:51:35 -03:00
David Testé
89b36ebca0 chore(bench): remove 2-bits size for full precision bench on gpu
GPU backend cannot accept less than 2 blocks for integer
benchmarks. Since 2-bits precision benchmarks are run with
*_MESSAGE_2_CARRY_2_* parameters, it will create only one block of
ciphertext, thus making the benchmarks unsuitable for GPU backend.
2025-09-08 12:24:24 +02:00
pgardratzama
bd7df4a03b chore(hpu): enable hpu hlapi workflow and throughput bench in integer workflow 2025-09-05 10:42:36 +02:00
pgardratzama
6fe24c6ab3 chore(hpu): update hpu integer bench scalar op names 2025-09-05 10:42:36 +02:00
pgardratzama
c6aa1adbe7 chore(hpu): update benches to run new operations 2025-09-05 10:42:36 +02:00
David Testé
4a0658389e chore(bench): make bits to prove customizable in zk benchmarks
Some application like blockchain, may wants to prove less bits
than CRS size allows to.
2025-09-05 09:03:24 +02:00
David Testé
97574bdae8 chore(bench): add noise squash benchmark with compressions
This new benchmark is extracted from a use case.
From a compressed ciphertext, it measures the decompression, then noise squashes it and finally compresses again the result.
2025-09-04 15:13:08 +02:00
Guillermo Oyarzun
c2e816a86c fix(gpu): change mininum number of elements in benches 2025-09-04 11:03:27 +02:00
Pedro Alves
cad4070ebe fix(gpu): fix the decompression function signature in the backend 2025-08-29 21:09:40 +02:00
Pedro Alves
94d24e1f8b feat(gpu): implement the centered modulus switch technique to classical PBS 2025-08-29 11:38:26 -03:00
Pedro Alves
9a1c0f48f4 feat(gpu): implement 128-bit compression and add it to the integer API 2025-08-29 11:26:07 -03:00
Guillermo Oyarzun
a8f391a442 chore(gpu): update 4_1_1 params to match specialized pbs 2025-08-28 17:54:59 +02:00
David Testé
4b6942a0f8 chore(bench): add unbounded oprf integer benchmarks
Also move Cuda OPRF benchmark into the same file as CPU implementation
2025-08-22 15:01:53 +02:00
David Testé
1647ec8f21 chore(bench): add 2 bits integer to full benchmarks
This is done to measure execution time on FheBool equivalent on all operations.
2025-08-19 09:54:03 +02:00
David Testé
b3f1a85e1d chore(bench): write parameters to disk for hlapi operations 2025-08-13 18:34:26 +02:00
Antoniu Pop
9316922e81 fix(benches): fix hlapi dex benchmark transfer function 2025-08-12 17:28:40 +01:00
Nicolas Sarlin
bc5c2f51ff fix(bench): store correct pfail from params 2025-08-12 09:44:37 +02:00
Mayeul@Zama
4d1b917045 feat(shortint): add multibit noise squashing 2025-08-11 16:30:59 +02:00
David Testé
3b42f9873a chore(bench): write params to file for each zk benchmark on gpu
To be parsable each benchmark criterion ID must have their crypto
details written to a file.
2025-08-07 15:17:33 +02:00
tmontaigu
8c838da209 chore(integer): improve measurements
It seems that in
```rust
bench_group.bench_function(&bench_id, |b| {
  // some code
  b.iter(|| {
      // function to bench
  })
});
```
If we put code in the '// some code' part, it affects the measurements
the slower this code is the worse the measurements can be.

For many operations the gap is small (a few ms or no gap),
but for the division the gap was around 500ms.

So to reduce this, we move out what we can, moving
the keycache access is the most important aspect as it
cost around 70ms to 100ms.

A LazyCell is used in order only access the keycache is the bench is not
filtered out. Which is the behaviour we had before this commit, and the
behaviour we want to keep so that running specific benches via regex
selection stay fast.

Also, for clean input benches, we use `iter` instead of `iter_batched`
as it makes more sense and should give more accurate results as
iter_batched timing include other things that just the timing of the
function.
2025-07-15 12:46:38 +02:00
Agnes Leroy
068cbc0f41 chore(gpu): add hl api noise squash latency and throughput bench 2025-07-11 14:04:32 +01:00
Pedro Alves
1b98312e2c fix(gpu): fix regression on ERC20 throughput
- partially revert changes done in fd79c4f972
- transfers for the GPU case should be measured using sequential
  operations (without rayon!)
2025-07-11 08:57:19 +01:00
Pedro Alves
d3dd010deb fix(gpu): reduces number of elements in the ZK throughput benchmark 2025-07-11 08:57:01 +01:00
Pedro Alves
9960f5e8b6 fix(gpu): Fix expand bench on multi-gpus 2025-07-09 09:17:55 +01:00
Pedro Alves
23ebd42209 fix(gpu): fix compression throughput benchmark 2025-07-07 16:30:24 +01:00
Pedro Alves
8c88678ee8 feat(gpu): implement 128-bit multi-bit PBS 2025-07-03 20:34:32 -03:00
Agnes Leroy
e4d856afdf chore(gpu): update noise squashing parameters 2025-07-03 12:51:19 +01:00
Pedro Alves
22ddba7145 fix(gpu): refactor the (128-bit and regular) classical PBS entry point to remove the num_samples parameter
- fixes the throughput for those PBSs
- also fixes the throughput benchmark for regular PBSs
2025-07-03 08:23:09 -03:00
David Testé
d955696fe0 chore(bench): reduce number of bit sizes to benchmark
This is done to reduce execution time since 4 bits precision is not useful to measure.
2025-07-03 12:45:02 +02:00
Baptiste Roux
24572edb1c feat(hpu): Add support for centered modswitch.
Add new field in HpuPBSParameters (log2_pfail and modulus_switch_type).
Also add new parameters set definition in shortint for benchmark matching.

Remove the used of use_mean_compensation register, this information is now embedded inside the parameters set definition.
Update psi64.hpu archive with newest bitstream
2025-07-02 14:41:41 +02:00
Arthur Meyre
86a40bcea9 chore: move gated import to section with feature gate in HL erc20 bench 2025-07-02 13:14:31 +02:00