- this would not give an average runtime for scalar benchmarks and for
small precisions could give super good timings (for lucky values)
- the timings for other precisions could still be favorable or unfavorable
depending on the value that was drawn
Done to circumvent criterion limitation regarding automatic
truncation of long benchmark ID.
Using a println() call we ensure the complete name is displayed
before benchmark execution to ease manual parsing and debugging.
Ciphertext sizes are measured at HLAPI layer with several
parameters set.
Keys sizes are measured at shortint level.
This benchmark has now its dedicated GitHub workflow that would
run, at least, each 24th of the month.
Add the flip(condition: BooleanBlock, a: T, b: T) -> (T, T)
operation that homomorphically flip/swap two values if the
given encrypted boolean encrypts true
It seems that in
```rust
bench_group.bench_function(&bench_id, |b| {
// some code
b.iter(|| {
// function to bench
})
});
```
If we put code in the '// some code' part, it affects the measurements
the slower this code is the worse the measurements can be.
For many operations the gap is small (a few ms or no gap),
but for the division the gap was around 500ms.
So to reduce this, we move out what we can, moving
the keycache access is the most important aspect as it
cost around 70ms to 100ms.
A LazyCell is used in order only access the keycache is the bench is not
filtered out. Which is the behaviour we had before this commit, and the
behaviour we want to keep so that running specific benches via regex
selection stay fast.
Also, for clean input benches, we use `iter` instead of `iter_batched`
as it makes more sense and should give more accurate results as
iter_batched timing include other things that just the timing of the
function.
- includes documentation about GPU's accelerated expand on the HL API
- rework CudaKeySwitchingKey
- Cloning the key is no longer necessary on the HL API
Heuristic based on PBS count was flawed since a ZK verification operation will eat up to 32 threads on the machine. The previous heuristic could generate an input data vector way bigger than the total of threads divided by 32. This in turn lead to long execution time for benchmark and generate bad results.
This backend abstract communication with Hpu Fpga hardware.
It define it's proper entities to prevent circular dependencies with
tfhe-rs.
Object lifetime is handle through Arc<Mutex<T>> wrapper, and enforce
that all objects currently alive in Hpu Hw are also kept valid on the
host side.
It contains the second version of HPU instruction set (HIS_V2.0):
* DOp have following properties:
+ Template as first class citizen
+ Support of Immediate template
+ Direct parser and conversion between Asm/Hex
+ Replace deku (and it's associated endianess limitation) by
+ bitfield_struct and manual parsing
* IOp have following properties:
+ Support various number of Destination
+ Support various number of Sources
+ Support various number of Immediat values
+ Support of multiple bitwidth (Not implemented yet in the Fpga
firmware)
Details could be view in `backends/tfhe-hpu-backend/Readme.md`