* Added Value type name to crate::integer::KVStore impl of Named trait
as well as a bool to check we deserialize the correct value type
(Radix vs SignedRadix)
* Add KVStore to high_level_api
* Add KVStore hlapi benches
* Remove specialized `[add,mul,sub]_to_slot` as `map` is now the
intended API.
- mul_to_slot was way slower than using `map`
- add/mul_to_slot were a bit faster (~5% latency-wise), but returned
less information (no old_value, no new_value, no boolean to check)
if the key matched
- Some known improvement can be made to map, which should result in
it being better than add/sub_to_slot
* Add FheIntegerType trait to make the KVStore generic over
FheUint/FheInt, and should make GPU integration "easy"
- update SIMD_N and min_batch_size to 12 which seems to give better
latency and ERC20 throughput
- support IOp on several lines in ami /proc file
- reduce amount of ERC_20_SIMD per batch in HLAPI bench
Ciphertext sizes are measured at HLAPI layer with several
parameters set.
Keys sizes are measured at shortint level.
This benchmark has now its dedicated GitHub workflow that would
run, at least, each 24th of the month.
Add the flip(condition: BooleanBlock, a: T, b: T) -> (T, T)
operation that homomorphically flip/swap two values if the
given encrypted boolean encrypts true
- Also, remove the lut indexes concept from the 128-bit multi-bit pbs. It's assumed not to exist by the entire backend (as it doesn't for classical PBS). So to keep it here would be a bit error prone.
GPU backend cannot accept less than 2 blocks for integer
benchmarks. Since 2-bits precision benchmarks are run with
*_MESSAGE_2_CARRY_2_* parameters, it will create only one block of
ciphertext, thus making the benchmarks unsuitable for GPU backend.
This new benchmark is extracted from a use case.
From a compressed ciphertext, it measures the decompression, then noise squashes it and finally compresses again the result.
It seems that in
```rust
bench_group.bench_function(&bench_id, |b| {
// some code
b.iter(|| {
// function to bench
})
});
```
If we put code in the '// some code' part, it affects the measurements
the slower this code is the worse the measurements can be.
For many operations the gap is small (a few ms or no gap),
but for the division the gap was around 500ms.
So to reduce this, we move out what we can, moving
the keycache access is the most important aspect as it
cost around 70ms to 100ms.
A LazyCell is used in order only access the keycache is the bench is not
filtered out. Which is the behaviour we had before this commit, and the
behaviour we want to keep so that running specific benches via regex
selection stay fast.
Also, for clean input benches, we use `iter` instead of `iter_batched`
as it makes more sense and should give more accurate results as
iter_batched timing include other things that just the timing of the
function.
Add new field in HpuPBSParameters (log2_pfail and modulus_switch_type).
Also add new parameters set definition in shortint for benchmark matching.
Remove the used of use_mean_compensation register, this information is now embedded inside the parameters set definition.
Update psi64.hpu archive with newest bitstream