Compare commits

..

313 Commits

Author SHA1 Message Date
Yuxi Zhao
bddc35459d GITBOOK-5: Update TOC 2024-03-05 14:32:28 +00:00
Yuxi Zhao
27c421b359 GITBOOK-4: V2 design details 2024-02-28 15:27:05 +00:00
Yuxi Zhao
2adeff44f3 GITBOOK-3: correct a typo 2024-02-28 14:54:38 +00:00
Yuxi Zhao
d0042aed54 GITBOOK-2: No subject 2024-02-28 14:23:50 +00:00
Yuxi Zhao
5eabdeab55 GITBOOK-1: Remove extra sentences 2024-02-28 14:11:06 +00:00
yuxizama
0152c212af Update SUMMARY.md 2024-02-28 15:07:14 +01:00
yuxizama
9a2c4a3784 Rename what-is-tfhe-rs to what-is-tfhe-rs.md 2024-02-28 15:06:45 +01:00
yuxizama
c14aad5656 V2 resorting 6 2024-02-28 14:57:24 +01:00
yuxizama
702e0ef306 V2 resorting 5 2024-02-28 14:54:21 +01:00
yuxizama
515d2e009f V2 resorting 4 2024-02-28 14:30:49 +01:00
yuxizama
711b5151dc V2 resorting 3 2024-02-28 14:28:58 +01:00
yuxizama
ceaee2f910 V2 resorting 2 2024-02-28 14:28:18 +01:00
yuxizama
41015db7a1 V2 resorting 1 2024-02-28 14:27:56 +01:00
Yuxi Zhao
485b2a7693 GITBOOK-15: V2 change images and adjust wording 2024-02-28 13:13:02 +00:00
Yuxi Zhao
7d903d5f7a GITBOOK-13: update v2 2024-02-27 16:01:36 +00:00
Yuxi Zhao
19ac6eb123 GITBOOK-1: New structure 2024-02-27 16:27:16 +01:00
tmontaigu
5b653864b7 chore(tfhe): bump version to 0.5.2 2024-02-23 10:21:47 +01:00
Arthur Meyre
a1d189b415 chore(ci): update macOS runner for cargo builds 2024-02-23 10:21:47 +01:00
sarah el kazdadi
c59434f183 chore(ci): update toolchain, fix clippy warnings 2024-02-23 10:21:47 +01:00
David Testé
83239e6afa chore(bench): implement integer casting benchmarks 2024-02-23 10:21:47 +01:00
sarah el kazdadi
ef8cb0273f fix(tfhe): update pulp and bytemuck to fix nightly breakage 2024-02-23 10:21:47 +01:00
tmontaigu
9b353bac2d fix(integer): correct degree in small comparisons 2024-02-23 10:21:47 +01:00
tmontaigu
46d65f1f87 fix(capi): add missing function on FheBool
- safe ser/de
- classical ser/de
- comparisons
- scalar binary fn/comparisons
- compact & compressed fhe bool encryption
2024-02-23 10:21:47 +01:00
tmontaigu
a63a2cb725 chore(hlapi): add tests for fhe_bool 2024-02-23 10:21:47 +01:00
tmontaigu
c45af05ec6 fix(integer): make encrypt_bool specify the degree
encrypt_one_block does not leak information
on the message.
BooleanBlocks are meant for when we want to
be explicit that the value is a boolean
and are ok for this to be public.

Thus it needs to correctly set the degree to 1
for other operations to properly take advantage of that
2024-02-23 10:21:47 +01:00
tmontaigu
584eaeb4ed fix(shortint): fix bitwise opts degree
We used `after_bitand/or/xor` on the ct_left
**after** the lut had changed its degree.
So the `after_bit` function computed the
resulting using a wrong degree for the left
ct.
2024-02-23 10:21:47 +01:00
tmontaigu
8d94ed2512 fix(hlapi): bind missing cuda bitnot 2024-02-23 10:21:47 +01:00
tmontaigu
b8d9dbe85b refactor(hlapi): split long files of hlapi
This splits the long base.rs files into multiple ones,
to make it easier to navigate.

There is no code changes appart from moving stuff.
2024-02-23 10:21:47 +01:00
tmontaigu
ad25340c33 feat(capi): add Cuda support
- This adds GPU support in the C API
- Also make ctest (cmake test launcher) print
  test output when it fails
2024-02-23 10:21:47 +01:00
Arthur Meyre
ad1ae0c8c2 chore(ci): update scripts and Makefile for future forward compatibility 2024-01-31 18:22:15 +01:00
Arthur Meyre
ee40906b8b chore(ci): convert some make targets to be semver trick compatible 2024-01-31 18:22:15 +01:00
Arthur Meyre
bf6b4cc541 chore(tfhe): bump version to 0.5.1 2024-01-30 10:51:39 +01:00
Arthur Meyre
24404567a4 chore(tfhe): bump tfhe-cuda-backend version to 0.1.3 2024-01-30 10:51:39 +01:00
tmontaigu
052dd4a60e feat(integer): fuse two PBS in comparisons
In comparisons, we were reducing a vec of orderings
(inferior, equal, superior) into one final ordering,
and then we would do one final PBS to transform that
into a boolean value (0 or 1) depending what was wanted
(<=, <, >, >=).

This fuse the last PBS (ordering -> boolean value) with
the last round of reduction, when there are only two blocks left
to be reduced.

This allows to gain one PBS. Meaning for ciphertext/cipheretxt
comparisons we get back the performance lost introduced by
the fix in f4c220c1. And comparisons between a clear and
ciphertext get an improvement.
2024-01-30 10:51:39 +01:00
tmontaigu
f8d829d076 fix(integer): add noise cleaning pbs in comparisons
In comparisons we were packing blocks to then do a subtraction
between them. However this goes above the noise limit
that would guarentee the advertised error propability.

To fix that we add a pbs to clean the noise. This pbs only needs
to be added in the ciphertext/ciphertext comparisons. Making them slower
by 1 PBS.
2024-01-30 10:51:39 +01:00
dependabot[bot]
d9761ca17e chore(deps): bump codecov/codecov-action from 3.1.4 to 3.1.5
Bumps [codecov/codecov-action](https://github.com/codecov/codecov-action) from 3.1.4 to 3.1.5.
- [Release notes](https://github.com/codecov/codecov-action/releases)
- [Changelog](https://github.com/codecov/codecov-action/blob/main/CHANGELOG.md)
- [Commits](eaaf4bedf3...4fe8c5f003)

---
updated-dependencies:
- dependency-name: codecov/codecov-action
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
2024-01-30 10:51:39 +01:00
dependabot[bot]
8d2e15347b chore(deps): bump tj-actions/changed-files from 42.0.0 to 42.0.2
Bumps [tj-actions/changed-files](https://github.com/tj-actions/changed-files) from 42.0.0 to 42.0.2.
- [Release notes](https://github.com/tj-actions/changed-files/releases)
- [Changelog](https://github.com/tj-actions/changed-files/blob/main/HISTORY.md)
- [Commits](ae82ed4ae0...90a06d6ba9)

---
updated-dependencies:
- dependency-name: tj-actions/changed-files
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
2024-01-30 10:51:39 +01:00
dependabot[bot]
a368257bc7 chore(deps): bump actions/upload-artifact from 4.1.0 to 4.3.0
Bumps [actions/upload-artifact](https://github.com/actions/upload-artifact) from 4.1.0 to 4.3.0.
- [Release notes](https://github.com/actions/upload-artifact/releases)
- [Commits](https://github.com/actions/upload-artifact/compare/v4.1.0...26f96dfa697d77e81fd5907df203aa23a56210a8)

---
updated-dependencies:
- dependency-name: actions/upload-artifact
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
2024-01-30 10:51:39 +01:00
David Testé
76d23d0c91 chore(bench): add ciphertexts sum to integer benchmarks 2024-01-30 10:51:39 +01:00
David Testé
ddc5002232 chore(bench): add pbs benchmarks on gpu 2024-01-30 10:51:39 +01:00
tmontaigu
c08c479616 docs(hlapi): document trivial encryption to debug 2024-01-30 10:51:39 +01:00
tmontaigu
f26afc16de docs(hlapi): document how to use rayon 2024-01-30 10:51:39 +01:00
yuxizama
13f533f6fb chore(docs): update readme links and badges 2024-01-30 10:51:39 +01:00
yuxizama
d9541e472b chore(docs): update README.md
Change support banner
2024-01-30 10:51:39 +01:00
Agnes Leroy
3453e45258 fix(gpu): make all async functions unsafe, fix cuda_drop binding, add missing sync 2024-01-30 10:51:39 +01:00
David Testé
55de96f046 chore(ci): add gpu tests from user documentation 2024-01-30 10:51:39 +01:00
Agnes Leroy
9747c06f6e chore(gpu): fix formatting command 2024-01-30 10:51:39 +01:00
Agnes Leroy
00f72d2c13 chore(gpu): fix compilation when no nvidia gpu is available 2024-01-30 10:51:39 +01:00
tmontaigu
01f5cb9056 fix(integer): is_scalar_out_of_bounds handles bigger ct
Fix a bug where in is_scalar_out_of_bounds, if the scalar was
negative and the ciphertext a signed one with more blocks than
the decomposed scalar, we would do an out of bound access
(i.e a panic).

This fixes that, this will fix doing signed_overflowing_mul on 256 bits
where the bug first appeared
2024-01-30 10:51:39 +01:00
David Testé
d66e313fa4 chore(ci): fix inputs for gpu full benchmark workflow 2024-01-30 10:51:39 +01:00
Arthur Meyre
c9d530e642 fix(core): ignore value in the body when doing LWE encryption 2024-01-30 10:51:39 +01:00
Agnes Leroy
6c2096fe52 chore(gpu): rename "test vector" -> "luts" and "tvi" -> "lut_indexes" 2024-01-30 10:51:39 +01:00
Agnes Leroy
1e94134dda chore(gpu): move around code in integer.h for better readability 2024-01-30 10:51:39 +01:00
tmontaigu
c76a60111c fix(integer): fix cast in scalar_shift/rotate
In scalar_shift/rotate, we get the number of bits to shift/rotate
as a generic type, the can be casted to u64.

We compute the total number of bits the ciphertext has, cast that number
to the same type as the scalar, and do "shift % num_bits".

However, if the number of bits computed exceeds the max value the scalar
type can hold, we could end up doing a remainder with 0.

e.g 256bits ciphertext and scalar type u8 => 256u64 casted to u8 results
in 0.

Fix that by casting the scalar value to u64.
2024-01-30 10:51:39 +01:00
tmontaigu
18ff400df2 chore(hlapi): remove leftover file
This file was not correctly removed during the refactor
2024-01-30 10:51:39 +01:00
David Testé
3d31d09be5 chore(ci): change rust-toolchain action
Github thrid-party Action actions-rs/toolchain is not maintained
anymore. We switch to dtolnay/rust-toolchain.
2024-01-30 10:51:39 +01:00
David Testé
76322606f2 chore(ci): set rustbacktrace var to full to ease debug on failure 2024-01-30 10:51:39 +01:00
dependabot[bot]
bf58a9f0c6 chore(deps): bump actions/upload-artifact from 3.1.2 to 4.2.0
Bumps [actions/upload-artifact](https://github.com/actions/upload-artifact) from 3.1.2 to 4.2.0.
- [Release notes](https://github.com/actions/upload-artifact/releases)
- [Commits](https://github.com/actions/upload-artifact/compare/v3.1.2...694cdabd8bdb0f10b2cea11669e1bf5453eed0a6)

---
updated-dependencies:
- dependency-name: actions/upload-artifact
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
2024-01-30 10:51:39 +01:00
dependabot[bot]
64461c82b4 chore(deps): bump tj-actions/changed-files from 41.1.1 to 42.0.0
Bumps [tj-actions/changed-files](https://github.com/tj-actions/changed-files) from 41.1.1 to 42.0.0.
- [Release notes](https://github.com/tj-actions/changed-files/releases)
- [Changelog](https://github.com/tj-actions/changed-files/blob/main/HISTORY.md)
- [Commits](62f4729b5d...ae82ed4ae0)

---
updated-dependencies:
- dependency-name: tj-actions/changed-files
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
2024-01-30 10:51:39 +01:00
dependabot[bot]
339c84fbd9 chore(deps): bump actions/checkout from 3.5.3 to 4.1.1
Bumps [actions/checkout](https://github.com/actions/checkout) from 3.5.3 to 4.1.1.
- [Release notes](https://github.com/actions/checkout/releases)
- [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md)
- [Commits](https://github.com/actions/checkout/compare/v3.5.3...b4ffde65f46336ab88eb53be808477a3936bae11)

---
updated-dependencies:
- dependency-name: actions/checkout
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
2024-01-30 10:51:39 +01:00
Arthur Meyre
bc682a5ffb docs(bench): add scalar benchmarks for integer 2024-01-29 16:42:32 +01:00
Arthur Meyre
2920daf2d9 chore(docs): fix link to 0.4 semver doc 2024-01-23 10:50:25 +01:00
Arthur Meyre
e7352eee8b chore(tfhe): add version for tfhe-cuda-backend dependency 2024-01-22 17:24:51 +01:00
J-B Orfila
33a7e9f3e4 doc(gpu): add how to page about running on GPU 2024-01-22 17:18:02 +01:00
David Testé
96da25ce90 chore(ci): separate clippy and tests in different job steps 2024-01-22 16:04:07 +01:00
Agnes Leroy
548f2e5d05 chore(gpu): fix gpu package for publication 2024-01-22 15:29:26 +01:00
Arthur Meyre
f313b58c8e doc(tfhe): add doc page about data migration which will redirect to 0.4 doc 2024-01-22 13:43:44 +01:00
J-B Orfila
fd038346b7 doc: overflow detection 2024-01-22 13:42:55 +01:00
tmontaigu
f0fcfd517b feat(hlapi): update wasm and c_api 2024-01-22 10:06:49 +01:00
Arthur Meyre
0d2448e9e9 feat(hl_api): add raw parts API for Ciphertext types 2024-01-19 13:05:26 +01:00
Arthur Meyre
68ef237ae6 feat(hl_api): add raw parts API for ServerKey 2024-01-19 13:05:26 +01:00
Arthur Meyre
fb39864f05 feat(hl_api): add raw parts API for PublicKey 2024-01-19 13:05:26 +01:00
Arthur Meyre
69a5562aba feat(hl_api): add raw parts API for CompressedServerKey 2024-01-19 13:05:26 +01:00
Arthur Meyre
d49ffdd26f feat(hl_api): add raw parts API for CompressedPublicKey 2024-01-19 13:05:26 +01:00
Arthur Meyre
c87c362d42 feat(hl_api): add raw part API for CompressedCompactPublicKey 2024-01-19 13:05:26 +01:00
Arthur Meyre
905ef4ea78 feat(hl_api): add raw parts API for CompactPublicKey 2024-01-19 13:05:26 +01:00
Arthur Meyre
957dd47295 feat(boolean): add raw parts API for PublicKey 2024-01-19 13:05:26 +01:00
Arthur Meyre
0a550ac803 feat(boolean): add raw parts API for CompressedPublicKey 2024-01-19 13:05:26 +01:00
Arthur Meyre
8c62155429 feat(boolean): add raw parts API for CompressedServerKey 2024-01-19 13:05:26 +01:00
Arthur Meyre
3e23631bdc feat(boolean): add raw parts API for ServerKey 2024-01-19 13:05:26 +01:00
Arthur Meyre
02b2fcf78d feat(boolean): add raw parts API for KeySwitchingKey 2024-01-19 13:05:26 +01:00
Arthur Meyre
d9d222c1b5 feat(integer): add raw parts API for CompressedCompactPublicKey 2024-01-19 13:05:26 +01:00
Arthur Meyre
f2011cd30d feat(integer): add raw parts API for CompactPublicKey 2024-01-19 13:05:26 +01:00
Arthur Meyre
1b3d41ec44 feat(integer): add raw parts API for CompressedPublicKey 2024-01-19 13:05:26 +01:00
Arthur Meyre
65816a175a feat(integer): add raw parts API for PublicKey 2024-01-19 13:05:26 +01:00
Arthur Meyre
2d1cf95900 feat(integer): add raw parts API for KeySwitchingKey 2024-01-19 13:05:26 +01:00
Arthur Meyre
247072a81a feat(integer): add raw parts API for CompressedServerKey 2024-01-19 13:05:26 +01:00
Arthur Meyre
c65a58c14f feat(integer): add raw parts API for ServerKey 2024-01-19 13:05:26 +01:00
Arthur Meyre
3a4859553e refactor(integer): add raw parts API for Wopbskey instead of from_shortint 2024-01-19 13:05:26 +01:00
Arthur Meyre
9c3a159ca1 feat(integer): add raw parts API for CompactCiphertextList 2024-01-19 13:05:26 +01:00
Arthur Meyre
e76ddd5a49 feat(shortint): add raw parts API for WopbsKey 2024-01-19 13:05:26 +01:00
Arthur Meyre
fa35b6ef8f feat(boolean): add raw parts API for CompressedCiphertext 2024-01-19 13:05:26 +01:00
Arthur Meyre
a6e835b3f1 feat(shortint): add raw parts API for CompactCiphertext and related list 2024-01-19 13:05:26 +01:00
Arthur Meyre
c586d64fab feat(shortint): add raw parts API for CompressedServerKey 2024-01-19 13:05:26 +01:00
Arthur Meyre
6108f180bf feat(shortint): add raw parts API to PublicKey 2024-01-19 13:05:26 +01:00
Arthur Meyre
8f0e4f6c99 feat(shortint): add raw parts API for CompressedPublicKey 2024-01-19 13:05:26 +01:00
Arthur Meyre
473b7e0c6a feat(shortint): add raw parts API to CompressedCompactPublicKey 2024-01-19 13:05:26 +01:00
Arthur Meyre
e9c92bc9a3 feat(shortint): raw parts API for CompactPublicKey 2024-01-19 13:05:26 +01:00
Arthur Meyre
e4c0c4c15f feat(shortint): add function to recompute thread count 2024-01-19 13:05:26 +01:00
Arthur Meyre
ea7c579efc feat(shortint): add raw parts API for ServerKey and KS Key
- add a method to easily get the expected LweDimension of a shortint
Ciphertext for a given ServerKey
2024-01-19 13:05:26 +01:00
Arthur Meyre
c9d19fca19 feat(tfhe): add raw parts API for HL API forward compatibility
- allows to convert HL API ClientKey from 0.4 to 0.5
- add raw parts API for shortint ClientKey
- add raw parts API for integer ClientKey
- add raw parts API for HL API IntegersClientKey
- add raw parts API for HL API ClientKey
2024-01-19 13:05:26 +01:00
Arthur Meyre
f3e6074480 feat(core): backward compatibility types for fft re-exported 2024-01-19 13:05:26 +01:00
tmontaigu
17c110f536 feat(hlapi): add gpu support 2024-01-19 09:27:04 +01:00
Arthur Meyre
45e27d8836 chore(core): remove Serialize and Deserialize on GlweBody
- users are not expected to use those structs for serialization
2024-01-18 17:42:07 +01:00
tmontaigu
ee10508c99 refactor(hlapi): prepare hlapi for different backends/representations
- Split GenericInteger into FheUint and FheInt
- Prepare FheUint and FheBool to support different backends
- Allow the use of 1_1 parameters

BREAKING CHANGE: unset_server_key no longer returns the server key
2024-01-18 16:58:11 +01:00
Beka Barbakadze
9213436b93 fix(gpu): fix mem reuse in multiplication, remove extra variable from lut constructor 2024-01-18 16:06:07 +01:00
Beka Barbakadze
aa2a8e31fe feat(gpu): return different chunk sizes for different number of ciphertexts 2024-01-18 16:06:07 +01:00
Beka Barbakadze
d49b8235bf feat(gpu): implement memory reuse for lut objects 2024-01-18 16:06:07 +01:00
David Testé
cec4a5b60b chore(bench): fix multi-bit parameters selection for multi-bit
Operation flavors for GPU don't end with "_gpu" anymore. So we
can't rely on these string, instead we are using the configuration
feature to select which multi-bit parameters set we use.
The bit size limitation for CPU in multi-bit is also put back in
place.
2024-01-18 15:43:53 +01:00
Mayeul@Zama
15147b4359 feat(integer): add oprf bench 2024-01-18 11:03:37 +01:00
Mayeul@Zama
f2ee360a47 feat(integer): add oblivious pseudo-random function 2024-01-18 11:03:37 +01:00
Mayeul@Zama
6631aae069 feat(shortint): add oblivious pseudo-random function 2024-01-18 11:03:37 +01:00
Mayeul@Zama
fa9cd866e4 feat(shortint): add no encoding functions 2024-01-18 11:03:37 +01:00
Pedro Alves
c632ac1b9a feat(gpu): add tfhe-cuda-backend to the repository 2024-01-18 10:14:36 +01:00
Arthur Meyre
f0e6b4c395 chore(ci): add newline to workflow 2024-01-17 18:38:45 +01:00
David Testé
2cd51ed36d chore(ci): add placeholder workflow for cuda backend 2024-01-17 17:59:44 +01:00
tmontaigu
dc04a5138e feat(integer): add signed_overflowing_mul 2024-01-17 10:54:58 +01:00
tmontaigu
eda338aa29 fix(integer): fix decrypting negative value of N >= 32 BITS
Bug was probably introduced in 48405959a4
2024-01-17 10:54:58 +01:00
Arthur Meyre
df6fa86481 refactor(c_api): use DynamicBuffer as the buffer type between C and Rust
- allows to share data safely between C and Rust and make sure it gets
destroyed with the right destructor
- apply formatting to C tests
- add a script to symlink the dependency lib to a fixed name
- the rust build system adds a fingerprint to the lib name which is not
practical when we are linking our C test executable
- link the most recent artifacts corresponding to the dependency we have
2024-01-17 10:45:02 +01:00
David Testé
6742e150b0 chore(ci): add overflowing unsigned multiplication to benchmarks 2024-01-16 17:52:35 +01:00
tmontaigu
93fac32755 refactor(shortint): make scalar_eq/scalar_not_eq take & not &mut
BREAKING_CHANGE: change mutability of scalar_equal/scalar_not_equal
arguement
2024-01-16 16:21:10 +01:00
Arthur Meyre
9ac57e75c9 feat(boolean): add raw parts methods to the ClientKey
- into_raw_parts allows to deconstruct a ClientKey
- new_from_raw_parts allows to construct a ClientKey
2024-01-16 13:08:27 +01:00
David Testé
a7abee0491 chore(ci): gather common benchmarks groups
Instead of having multiple BENCH_OP_FLAVOR to run for full
benchmarks, criterion groups are now gathered by operation kind
(default, smart, unchecked).
2024-01-15 17:09:00 +01:00
tmontaigu
0228a58cfc feat(integer): add cast_to_signed/unsigned
These functions does the logic of uX/iX as iX and uX/iX as uX
2024-01-15 15:14:54 +01:00
tmontaigu
f98c680e95 feat(integer): add unsigned overflowing mul 2024-01-15 12:24:04 +01:00
dependabot[bot]
dcc3d267e4 chore(deps): bump actions/upload-artifact from 4.0.0 to 4.1.0
Bumps [actions/upload-artifact](https://github.com/actions/upload-artifact) from 4.0.0 to 4.1.0.
- [Release notes](https://github.com/actions/upload-artifact/releases)
- [Commits](c7d193f32e...1eb3cb2b3e)

---
updated-dependencies:
- dependency-name: actions/upload-artifact
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
2024-01-15 10:15:31 +01:00
dependabot[bot]
2067092e0a chore(deps): bump tj-actions/changed-files from 41.0.1 to 41.1.1
Bumps [tj-actions/changed-files](https://github.com/tj-actions/changed-files) from 41.0.1 to 41.1.1.
- [Release notes](https://github.com/tj-actions/changed-files/releases)
- [Changelog](https://github.com/tj-actions/changed-files/blob/main/HISTORY.md)
- [Commits](716b1e1304...62f4729b5d)

---
updated-dependencies:
- dependency-name: tj-actions/changed-files
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
2024-01-15 10:13:41 +01:00
Arthur Meyre
7c50216f7a chore(core_crypto): remove Serde on some decomposition structs
- those structures are not expected to be serialized by a user
2024-01-10 18:25:18 +01:00
David Testé
77c0532793 chore(ci): add overflowing add and sub to integer benchmarks 2024-01-10 15:26:28 +01:00
Arthur Meyre
00ddfdec8b refactor(boolean): call as_view on LWE secret keys where appropriate
- previous code had to manually create a view from a container, this is
less convoluted and more user friendly
2024-01-08 17:17:00 +01:00
Arthur Meyre
ef9ec13999 refactor(shortint): remove the large_lwe_secret_key from the ClientKey
- accessors are provided to access the large and small LWE secret keys and
make sure they have compatible data types when used in functions behaving
differently depending on the encryption key choice
2024-01-08 17:17:00 +01:00
Arthur Meyre
b53c8aac3f feat(core): add view types and "as_view" functions for (G)LWE secret keys 2024-01-08 17:17:00 +01:00
Arthur Meyre
f052c1f8ba chore(doc): fix docstring for FheIntX types 2024-01-05 17:30:15 +01:00
Mayeul@Zama
fef6d18605 doc(all): add doc for safe_(de)serialization 2024-01-05 10:24:36 +01:00
Mayeul@Zama
e5505ab686 chore(all): update SERIALIZATION_VERSION for 0.5 release 2024-01-05 10:24:27 +01:00
Arthur Meyre
ab2c5f09a8 chore(tfhe): update 2023 to 2024 in license files and other places 2024-01-04 16:26:21 +01:00
Mayeul@Zama
40ae841a15 refactor(all): make paste dependency non optional 2024-01-04 10:33:36 +01:00
Mayeul@Zama
0a317c5f0e refactor(all): remove safe-deserialization feature
bincode dependency is not optional anymore
2024-01-04 10:33:36 +01:00
Arthur Meyre
415a8a2de5 chore(c_api): add the c_api code from the docs as a test 2024-01-03 13:26:05 +01:00
Arthur Meyre
935da25360 doc(c_api): Add an output for the users compiling the C API example 2024-01-03 13:26:05 +01:00
tmontaigu
dbeff4e4b4 docs(capi): fix C API example 2024-01-03 13:26:05 +01:00
Arthur Meyre
1e50d0cdd2 chore(doc): fix latex equation typo preventing formatting on GitHub
refs https://github.com/zama-ai/tfhe-rs/issues/748
2024-01-02 17:05:12 +01:00
dependabot[bot]
7c5551bf45 chore(deps): bump actions/upload-artifact from 3.1.3 to 4.0.0
Bumps [actions/upload-artifact](https://github.com/actions/upload-artifact) from 3.1.3 to 4.0.0.
- [Release notes](https://github.com/actions/upload-artifact/releases)
- [Commits](a8a3f3ad30...c7d193f32e)

---
updated-dependencies:
- dependency-name: actions/upload-artifact
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
2024-01-02 13:17:58 +01:00
dependabot[bot]
95c36d54cb chore(deps): bump tj-actions/changed-files from 40.2.1 to 41.0.1
Bumps [tj-actions/changed-files](https://github.com/tj-actions/changed-files) from 40.2.1 to 41.0.1.
- [Release notes](https://github.com/tj-actions/changed-files/releases)
- [Changelog](https://github.com/tj-actions/changed-files/blob/main/HISTORY.md)
- [Commits](1c938490c8...716b1e1304)

---
updated-dependencies:
- dependency-name: tj-actions/changed-files
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
2024-01-02 13:17:45 +01:00
Benoit Chevallier-Mames
384e15ca5a style(docs): fixing a few typos in the README 2023-12-21 11:51:52 +01:00
tmontaigu
526a53f3d4 feat(integer): add signed_overflowing_scalar_add/sub 2023-12-15 10:57:21 +01:00
Mayeul@Zama
7d17b71740 fix(shortint): fix smart_add/sub 2023-12-13 09:55:20 +01:00
Mayeul@Zama
8ecb85e4dd refactor(shortint): is_possible ops take CtNoiseDegree 2023-12-13 09:55:20 +01:00
Mayeul@Zama
cf30db7a30 fix(shortint): rename smart_apply_lookup_table_bivariate 2023-12-13 09:55:20 +01:00
Mayeul@Zama
2599f7d5ea fix(shortint): fix ciphertexts_can_be_packed_without_exceeding_space_or_noise 2023-12-13 09:55:20 +01:00
Mayeul@Zama
ae88bb3264 fix(shortint): fix smart_evaluate_bivariate_function 2023-12-13 09:55:20 +01:00
Mayeul@Zama
6fb898db66 refactor(shortint): move bivariate_pbs to its own module 2023-12-13 09:55:20 +01:00
Mayeul@Zama
e750d2cd92 refactor(shortint): use smart_evaluate_bivariate_function and evaluate_univariate_function 2023-12-13 09:55:20 +01:00
Arthur Meyre
d8586080da chore(integer): simplify an API to create a ServerKey from a shortint one
- the API required passing the client_key to get access to the message and
carry modulus, as the shortint ServerKey already has that information drop
the requirement to pass the ClientKey

BREAKING CHANGE:
new_radix_server_key_from_shortint, new_crt_server_key_from_shortint APIs
have changed and no longer require a ClienKey, the max degree methods now
only take a MessageModulus and CarryModulus instead of the full parameter
set
2023-12-13 09:41:59 +01:00
tmontaigu
4cc2e85556 feat(integer): add unsigned_overflowing_scalar_sub 2023-12-12 16:08:06 +01:00
tmontaigu
303a65c88d feat(integer): add unsigned_overflowing_scalar_add 2023-12-12 16:08:06 +01:00
Arthur Meyre
18c01e74d6 chore(core_crypto): remove legacy new types 2023-12-12 11:06:06 +01:00
dependabot[bot]
a8b6c72910 chore(deps): bump tj-actions/changed-files from 40.2.0 to 40.2.1
Bumps [tj-actions/changed-files](https://github.com/tj-actions/changed-files) from 40.2.0 to 40.2.1.
- [Release notes](https://github.com/tj-actions/changed-files/releases)
- [Changelog](https://github.com/tj-actions/changed-files/blob/main/HISTORY.md)
- [Commits](da093c1609...1c938490c8)

---
updated-dependencies:
- dependency-name: tj-actions/changed-files
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
2023-12-11 10:21:53 +01:00
Arthur Meyre
c2b21ed709 chore(integer): remove unused RadixDecomposition struct and code 2023-12-08 10:38:21 +01:00
tmontaigu
ad41fdf5a5 feat(integer): expose default and smart ciphertext sum
This expose the ciphertext sum default and smart variants

This also removes the par_seq_op functions as they are less optimal
2023-12-06 10:50:45 +01:00
Arthur Meyre
eeae19f35f chore(core_crypto): disable seeded entities for non power of 2 moduli
- as random uniform generation has rejection sampling for non native moduli
the seeded decompression currently does not work as it allocates just
enough bytes for a native integer and not for the various retries which may
be needed
- follow-up issue: https://github.com/zama-ai/tfhe-rs-internal/issues/358
2023-12-05 15:25:30 +01:00
Mayeul@Zama
798572e58c style(shortint): rename MaxNoiseLevel::valid validate 2023-12-04 18:16:46 +01:00
Mayeul@Zama
36d375943c refactor(shortint): encapsulate Degree and MaxDegree 2023-12-04 18:16:46 +01:00
Mayeul@Zama
a1488b10d5 style(shortint): move MaxDegree 2023-12-04 18:16:46 +01:00
Mayeul@Zama
b153641280 style(shortint): scalar_sub use scalar_add 2023-12-04 18:16:46 +01:00
tmontaigu
48405959a4 feat(integer): add decrypt_trivial
This adds a decrypt_trivial method to all ciphertext types of
shortint and integer.

This functions tries to "decrypt" the ciphertext if it is a
trivial one, otherwise it return an error.

This is meant to be a debugging 'tool':

To debug a function / circuit, users can call the function
on trivial ciphertexts intead of real ciphertext, that way,
computations are faster _and_ they will now be able to see intermediate
values via these decrypt_trivial, to do some print-debugging or use a
debugger.
2023-12-04 15:50:53 +01:00
Arthur Meyre
d39e73be91 chore(core): freshen up native decomposition tests
- the classic native decomposer results are exact and there aren't that
many cases to test so tests are changed to be exhaustive
- non native currently still has some work in progress parts so won't be
made exhaustive right away, additionally the next PR for the prime Q effort
will make some changes to it, so this is why the current code has not been
touched for the non native decomposer
2023-12-04 14:56:17 +01:00
dependabot[bot]
71447d845f chore(deps): bump tj-actions/changed-files from 40.1.1 to 40.2.0
Bumps [tj-actions/changed-files](https://github.com/tj-actions/changed-files) from 40.1.1 to 40.2.0.
- [Release notes](https://github.com/tj-actions/changed-files/releases)
- [Changelog](https://github.com/tj-actions/changed-files/blob/main/HISTORY.md)
- [Commits](25ef3926d1...da093c1609)

---
updated-dependencies:
- dependency-name: tj-actions/changed-files
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
2023-12-04 09:22:17 +01:00
Arthur Meyre
837df59b44 fix(shortint): programmable_bootstrapping_native_crt could alter its input
- the alteration is a trick to be able to perform the wopbs in the native
crt mode but it modifies the input ciphertext and leaves it modified
meaning that the input no longer represents the same data/encryption
- applying the functon several times likely would lead to incorrect results
starting from the second call with increasing error/divergence from the
original encrypted data
- clone locally instead, should be negligible given wopbs runtime

BREAKING CHANGE:
programmable_bootstrapping_native_crt signature has changed
2023-12-01 14:46:09 +01:00
Arthur Meyre
303cac2092 chore(ci): re-enable release profile for doctest
- following merge of 17.0.4 in rust stable the bug uncovered by lto on
aarch64 has been fixed https://github.com/rust-lang/rust/issues/116941 so
we remove the hard coded override
2023-12-01 12:34:26 +01:00
Arthur Meyre
bb1a969c34 chore(ci): update nightly toolchain to have fixed LLVM as well
- fix lints linked to latest nightly
2023-12-01 12:34:26 +01:00
tmontaigu
9dac9242be feat(integer): add boolean_ops to work on BooleanBlock
This adds boolean_bitand/or/xor/not to work on BooleanBlock
This is meant to improve API of BooleanBlock.
2023-12-01 11:13:31 +01:00
tmontaigu
e97ac815eb feat(shortint): add trivial pbs
When we detect that the ciphertext to be bootstrapped
is a trivial ciphertext, we can apply the lookup table
in clear thus saving computation time.

This allows shortint,integer,hlapi to have
way faster computations when all inputs are trivial
allowing to more rapidly check and debug a circuit
2023-11-30 22:01:28 +01:00
Arthur Meyre
62feb59722 chore(ci): fix doctest by using parameters with enough precision 2023-11-30 14:57:07 +01:00
Mayeul@Zama
ac8916a30f refactor(shortint): define woppbs on server key instead of engine 2023-11-30 14:54:33 +01:00
Mayeul@Zama
069ea98ad6 style(shortint): fix remaining unused_self 2023-11-30 14:54:33 +01:00
Mayeul@Zama
e587d1835e refactor(shortint): define operations on server key instead of engine 2023-11-30 14:54:33 +01:00
Mayeul@Zama
16121e7487 style(core): fix some unused self 2023-11-30 14:54:33 +01:00
Mayeul@Zama
3ab566de7b style(boolean): fix some unused self 2023-11-30 14:54:33 +01:00
David Testé
2309b07703 test(integer): add smart tests for radix_parallel min/max
This also refactor the code to use macro to parametrize tests to
make code smaller.
2023-11-30 14:44:17 +01:00
David Testé
8755094c38 test(integer): fix tests for unsigned scalar min/max operations 2023-11-30 14:44:17 +01:00
David Testé
4da10e9dd5 chore(ci): add aws ec2 fallback profile for cpu tests
This is done to mitigate resource shortages in our base AWS region
(eu-west-3) due to the high number of instances that are launched
in parallel in our Pull Requests.
2023-11-30 13:02:42 +01:00
Arthur Meyre
cdda260063 chore(ci): .gitignore was ignoring all files/directories names keys
- this was hiding some source files in vscode search and could likely have
been very annoying when commiting stuff
2023-11-29 18:14:33 +01:00
Arthur Meyre
be413fff50 chore(shortint): remove doctest as tests as it is confirmed they fail
- doctest were also failing as tests and so it is not linked to doctests
- still unclear what is causing the issue, the results are sometimes way
off
- the concentration of failed tests can indicate a miscompile as those
tests never fail on the M1 CI, some alignment is causing issues from time
to time
2023-11-29 17:43:25 +01:00
Arthur Meyre
3ed960d255 chore(ci): fix a command naming issue for the CI 2023-11-29 16:07:31 +01:00
Arthur Meyre
bdadd39a34 chore(ci): add some missing spec indicators for cargo commands 2023-11-29 15:29:06 +01:00
Arthur Meyre
f03f2f9c6d chore(ci): fix clippy trivium target which also triggered for tfhe 2023-11-29 15:29:06 +01:00
Arthur Meyre
f03ec9bbed chore(tfhe): re-allow semicolon_if_nothing_returned
- create a section for lints that have been considered to be disallowed but
are kept allowed, either because they bring too little value or because
they don't help with readability in certain circumstances
2023-11-29 15:28:43 +01:00
Arthur Meyre
5137751dd2 chore(ci): fix pedantic lint with missing trailing ; on () return 2023-11-29 15:28:43 +01:00
tmontaigu
edc3449dbf feat(integer): add signed overflowing_add 2023-11-29 13:02:58 +01:00
Arthur Meyre
6068c509de feat(tfhe): add LWE encryption and linalg with non power of 2 moduli 2023-11-29 09:54:55 +01:00
David Testé
30a4348e3a test(integer): use keycache for comparisons and wopbs 2023-11-28 14:11:53 +01:00
Mayeul@Zama
3013e02d90 fix(core): fix typo in comment 2023-11-28 13:10:23 +01:00
Mayeul@Zama
c029917c5c style(all): rename NB_TEST NB_TESTS 2023-11-28 13:10:23 +01:00
Mayeul@Zama
d23d04021b style(core): fix clippy::inconsistent_struct_constructor 2023-11-28 13:10:23 +01:00
Mayeul@Zama
5a5e9e0ac1 style(core): fix clippy::ptr_as_ptr 2023-11-28 13:10:23 +01:00
Mayeul@Zama
cc0a3bad8d style(core): fix clippy::iter_without_into_iter 2023-11-28 13:10:23 +01:00
Mayeul@Zama
1d12f60849 style(core): fix clippy::default_trait_access 2023-11-28 13:10:23 +01:00
Mayeul@Zama
a0db39c86e style(core): fix clippy::redundant_closure_for_method_calls 2023-11-28 13:10:23 +01:00
Mayeul@Zama
ef4558ac13 style(core): fix clippy::trivially_copy_pass_by_ref 2023-11-28 13:10:23 +01:00
Mayeul@Zama
bfb22b4531 style(core): fix clippy::needless_pass_by_value 2023-11-28 13:10:23 +01:00
Mayeul@Zama
88025010e1 style(core): fix clippy::unnecessary_wraps 2023-11-28 13:10:23 +01:00
Mayeul@Zama
b1f4f3b330 style(core): fix clippy::semicolon_if_nothing_returned 2023-11-28 13:10:23 +01:00
Mayeul@Zama
7575a426ab style(core): fix clippy::used_underscore_binding 2023-11-28 13:10:23 +01:00
Mayeul@Zama
1ac57218b1 style(core): fix clippy::manual_let_else 2023-11-28 13:10:23 +01:00
Mayeul@Zama
b7c3f16e24 style(core): fix clippy::implicit_clone 2023-11-28 13:10:23 +01:00
Mayeul@Zama
bf4f9198fb style(core): fix clippy::if_not_else 2023-11-28 13:10:23 +01:00
Mayeul@Zama
e618e1d05d style(core): fix clippy::uninlined_format_args 2023-11-28 13:10:23 +01:00
Mayeul@Zama
000428d688 style(core): replace assert by unwrap 2023-11-28 13:10:23 +01:00
Arthur Meyre
937c90666b chore(core): change the test LUT to apply the identity function
- the identity function more easily detects errors in the PBS as each mega
case contains a different value compared to its neighbours
2023-11-28 11:27:45 +01:00
sarah el kazdadi
b6a6f1b098 feat(core): specialize keyswitch implementation for small scalar values 2023-11-27 09:57:37 +01:00
David Testé
c2d7f1748c chore(ci): add core_crypto layer to code coverage 2023-11-22 10:21:17 +01:00
Mayeul@Zama
e8cd55dee6 feat(shortint): add degree information in CheckError::CarryFull 2023-11-21 19:52:26 +01:00
Mayeul@Zama
95aea9dbe8 feat(shortint): add noise checks 2023-11-21 19:52:26 +01:00
Mayeul@Zama
89f701d307 refactor(shortint): refactor CheckError 2023-11-21 19:52:26 +01:00
Mayeul@Zama
224146686f feat(shortint): add max_noise_level 2023-11-21 19:52:26 +01:00
Mayeul@Zama
b6b5f92220 fix(integer): update noise_level manually in direct calls to core_crypto 2023-11-21 19:52:26 +01:00
David Testé
0fec9e252b chore(ci): change benchmark aws ec2 machine type
This instance type hpc7a.96xlarge yields better performances for
nearly the same hourly cost.
2023-11-21 14:34:08 +01:00
Arthur Meyre
53c9b82824 chore(shortint): fix typo for NoiseLevel variant UNKNOWN 2023-11-20 18:54:14 +01:00
tmontaigu
f670a950d6 fix(integer): fix inner index computation in sum
In the function that sums a vec of ciphertexts,
we track trivial zeros to avoid un-needed PBSes.

One of this tracker is `last_block_where_addition_happened`
however it was not properly computed.
It was initialized to `num_blocks - 1`, and then got applied
a bunch of `max(current, new)` where 0 <= new <= num_block - 1
which means last_block_where_addition_happened was always num_blocks - 1.

The correct initial value is 0.
2023-11-20 16:40:21 +01:00
tmontaigu
a44970a9a3 feat(integer): avoid un-necessary computations in mul params 1_X
When the parameters have 1 bit of message (message modulus == 2)
then the multiplication of 2 blocks does not create a result that can
go into the carry space.

We use that fact to avoid doig un-necessary computations
when multiplying integers encrypted under parameters with 1 bit.
2023-11-20 15:21:59 +01:00
Arthur Meyre
55775b8e02 fix(shortint): fix overflow behavior of NoiseLevel
- we will need to use a MAX/UNKNOWN level for forward compatibility with
old serialized ciphertexts, this patch ensures the add/mul behavior
saturates properly to usize::MAX to force a refresh in operations which
do it automatically
2023-11-17 18:34:02 +01:00
Arthur Meyre
523d561de6 chore(ci): add _ci_run_filter to standalone tests in shortint
- those tests were likely ignored, this is no longer the case
2023-11-17 18:34:02 +01:00
tmontaigu
61a50d0bcc chore(integer): make oveflowing_add/sub return BooleanBlock 2023-11-17 16:22:20 +01:00
Arthur Meyre
ee57f5658b chore(ci): refactor integer script and skip div and rem preferring div_rem 2023-11-17 15:00:50 +01:00
tmontaigu
9362965f50 feat(integer): add accessors to inner shortint sks
Users can access blocks from an integer but they don't have
the ability to use the inner shortint server key to process
individual blocks.

This adds an AsRef impl on integer ServerKey to allow that.

This also adds shortcuts to the integer ServerKey to get
the MessageModulus/CarryModulus (these are shorticuts
because users could do `integer_key.as_ref().message_modulus`.
2023-11-16 16:25:27 +01:00
Arthur Meyre
00fb60451d chore(ci): group signed and unsigned integer for better runtime homogeneity 2023-11-16 14:18:30 +01:00
Arthur Meyre
18b9fd4464 chore(ci): re-enable mistakenly disabled AVX512 for integers 2023-11-16 14:18:30 +01:00
Arthur Meyre
eace0bfb85 chore(ci): spread tests between two CI machines/workflow for faster runtime 2023-11-16 14:18:30 +01:00
Arthur Meyre
af1be5ebca chore(core): fix noise generation which could overflow the custom modulus
- updated some function name (for modulus checking) to be clearer on what
they do and when to use them
2023-11-16 08:58:40 +01:00
tmontaigu
916bd8a09f feat(hlapi): move if_then_else/cmux to FheBool
- This makes FheBool use integer::BooleanBlock internally.
- It makes comparisons (eq, ne, le, etc) return a FheBool instead of
  FheUint/FheInt.
- It also moves the if_then_else and cmux methods to FheBool.
- Adds casting from FheBool to FheUint/FheInt (but not from
  FheUint/FheInt to FheBool as we expect users to do `a.ne(0)`
  as its matches Rust)

BREAKING CHANGE:
    - Comparisons now return FheBool
    - if_then_else/cmux are now methods of FheBool.
2023-11-15 23:22:30 +01:00
tmontaigu
20cb0642ce refactor(hlapi): implement CastFrom for GenericInteger
And add the trait to the prelude so that users can use
it.
2023-11-15 23:22:30 +01:00
Arthur Meyre
151f9f6d82 chore(ci): fix build on main following several big merges 2023-11-15 13:29:08 +01:00
Arthur Meyre
8db8cb49e4 chore(shortint): add some flaky/failing doctests as actual tests
- check that those are actually failing or that they are a doctest bug
- add _ci_run_filter so that we can easily make sure tests run in CI even
if they don't have the "parameter format"
2023-11-15 11:10:44 +01:00
Arthur Meyre
b4583976a2 chore(tfhe): fix .gitignore for key cache
- this was not properly ignoring the keycache if a file had a specific
extension
2023-11-15 11:10:30 +01:00
Arthur Meyre
b450375da1 chore(integer): restore assert after using 3_3 params for CRT doctests
- fix max degree for CRT keys which don't need to propagate carries

BREAKING CHANGE:
pub API removed from pub interface
2023-11-15 11:10:30 +01:00
tmontaigu
f02f1fb297 feat(integer): add unsigned_oveflowing_add 2023-11-14 18:57:09 +01:00
Mayeul@Zama
17642fa703 refactor(shortint): remove unused EngineResult 2023-11-14 16:30:09 +01:00
Mayeul@Zama
23fa9b24bd refactor(shortint): separate lut generation from ShortintEngine 2023-11-14 16:30:09 +01:00
tmontaigu
0453b9bd60 fix(integer): fix signed_overflowing_sub using trivial 0 2023-11-13 15:43:33 +01:00
Arthur Meyre
9b2cf67911 chore(tfhe): fix required features for the generate_test_keys util 2023-11-13 10:05:17 +01:00
dependabot[bot]
36a7656048 chore(deps): bump tj-actions/changed-files from 40.1.0 to 40.1.1
Bumps [tj-actions/changed-files](https://github.com/tj-actions/changed-files) from 40.1.0 to 40.1.1.
- [Release notes](https://github.com/tj-actions/changed-files/releases)
- [Changelog](https://github.com/tj-actions/changed-files/blob/main/HISTORY.md)
- [Commits](18c8a4eceb...25ef3926d1)

---
updated-dependencies:
- dependency-name: tj-actions/changed-files
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
2023-11-13 09:58:27 +01:00
Arthur Meyre
61c8eadd58 chore(ci): update Makefile for semver trick compatibility
- adding the tfhe package as a dependency is currently causing issues with
Cargo because of unified feature resolution it seems, it needs an
additional version specifier to disambiguate which package we are referring
to, an issue exists on their end but I don't think a fix is to be expected
soon https://github.com/rust-lang/cargo/issues/12891
- commiting this to main and then backporting the relevant pieces to 0.4.x
2023-11-10 15:35:38 +01:00
Arthur Meyre
fdd4d9d1cc chore(c_api): add more comments in the build.rs file and cbindgen.toml 2023-11-10 15:35:38 +01:00
Arthur Meyre
62700ab853 chore(tfhe): clarify dependency vs feature selection 2023-11-10 15:35:38 +01:00
Arthur Meyre
27445645e7 chore(c_api): have a way to skip cbindgen in a semver trick setting 2023-11-10 15:35:38 +01:00
tmontaigu
ea0cd26c0b chore(tfhe): fix builds on main 2023-11-10 15:15:31 +01:00
David Testé
ff48582679 test(core_crypto): silence dead code warnings on test utils 2023-11-10 09:35:16 +01:00
tmontaigu
a77c87ff12 refactor(hlapi): make GenericInteger generic over the Id 2023-11-09 20:33:53 +01:00
tmontaigu
6d143f1edc refactor(hlapi): remove unused FromParameters trait 2023-11-09 20:33:53 +01:00
Arthur Meyre
216e6b443a chore(tfhe): fix pedantic lints 2023-11-09 17:12:00 +01:00
Arthur Meyre
1400ae946c test(tfhe): add uniform random test
- use DKW test, it is e.g. used in
https://github.com/wch/r-source/blob/trunk/tests/p-r-random-tests.R

See Wikipedia DKW inequality
2023-11-09 17:12:00 +01:00
Arthur Meyre
c332902a05 feat(core): add support for non power of 2 moduli for random generation
- add convenience function to get truncated f64 value of an integer modulus
- update trait bounds for random generation for clearer diagnostics
2023-11-09 17:12:00 +01:00
Arthur Meyre
cf7a7f132d chore(doc): update a slightly wrong docstring 2023-11-09 14:38:43 +01:00
tmontaigu
6e0a3b9ad7 feat(integer): add BooleanBlock wrapper type
The BooleanBlock wrapper type is meant to convey the fact that
the ciphertext encrypts a 0 or 1.

Since its meant to be a simple wrapper, the goal for is to be flexible
and not add more burden than usefulness.

Hopefully this implementation somehow achieves that

Breaking Changes:
 - This changes the return type of comparisons from a T to
   a BooleanBlock. Requiring existing code to explicitely convert
   using `.into_radix`.
 - This makes the cmux/if_then_else functions take a BooleanValue
   as the input type  Requiring existing code to wrap their condition
   ciphertext in a new BooleanValue
2023-11-08 19:40:21 +01:00
Arthur Meyre
1f825dde08 chore(tfhe): bump version to 0.5.0 2023-11-08 15:55:22 +01:00
tmontaigu
f9222de47c feat(integer): add signed_overflowing_sub 2023-11-08 15:11:05 +01:00
Mayeul@Zama
5732e8dd7a test(hlapi): test base and compressed integer conformance 2023-11-08 09:25:55 +01:00
Mayeul@Zama
9db35c5474 chore(clippy): remove useless #[allow(warning)] 2023-11-07 16:47:04 +01:00
Mayeul@Zama
b69f73e8e6 chore(clippy): fix use_self warnings 2023-11-07 16:47:04 +01:00
Mayeul@Zama
90bdf75147 chore(clippy): enable nursery lints 2023-11-07 16:47:04 +01:00
Mayeul@Zama
233ea17adf chore(clippy): enable pedantic lints 2023-11-07 16:47:04 +01:00
David Testé
df6ee79841 chore(ci): test examples and apps in the ci 2023-11-07 10:58:03 +01:00
Mayeul@Zama
6497fb9a15 feat(shortint): update noise level in operations 2023-11-06 11:33:24 +01:00
Mayeul@Zama
d8894e3b69 feat(shortint): add noise level to ciphertexts 2023-11-06 11:33:24 +01:00
dependabot[bot]
42636bab13 chore(deps): bump tj-actions/changed-files from 40.0.0 to 40.1.0
Bumps [tj-actions/changed-files](https://github.com/tj-actions/changed-files) from 40.0.0 to 40.1.0.
- [Release notes](https://github.com/tj-actions/changed-files/releases)
- [Changelog](https://github.com/tj-actions/changed-files/blob/main/HISTORY.md)
- [Commits](af292f1e84...18c8a4eceb)

---
updated-dependencies:
- dependency-name: tj-actions/changed-files
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
2023-11-06 09:42:04 +01:00
tmontaigu
ec27d3dc6f refactor(hlapi): remove wrapping of booleans
This commit removes the wrapping of the `tfhe::boolean`
that was done in the HLAPI, effectively making the HLAPI
only wrapping `tfhe::integer`.

FheBool is now reused to be a single shortint block
compatible with other type FheUint8,16,etc (previously they were not).

In the future, `tfhe::boolean` could be re-wrapped in hlapi, but
this time, to be used as a base for all integers and not just
FheBool.

BREAKING CHANGE:
- hlapi no longer wraps tfhe::boolean API.
- tfhe::ConfigBuilder::enable_bool/disable_bool/all_disabled/all_enabled
  removed. Now default configuration should be done using
  `tfhe::ConfigBuilder::default()`.
- `tfhe::ConfigBuilder::use_default_small_integer` removed
  use `tfhe::CondifBuilder::default_with_small_encryption()`
- Uninitialied{ClientKey, PublicKey, CompressedPublicKey} error types
  removed as these erros are no longer possible
2023-11-04 00:18:16 +01:00
Mayeul@Zama
5272c95de4 fix(shortint): fix modulus on LUT output in test 2023-11-03 09:45:22 +01:00
Mayeul@Zama
27d7ace3ef feat(shortint): fix keyswitching wrapping behavior 2023-11-03 09:45:22 +01:00
Mayeul@Zama
d80ab231a8 fix(shortint): add LUT generation without carry 2023-11-03 09:45:22 +01:00
tmontaigu
fe3fa531f9 refactor(hlapi): Remove shortint support from HLAPI
This removes the wrapping of shortints from the HLAPI,
the reasons are:

Contrary to integers for which we have different bit size
by combining different number of blocks from the _same_ key.
shortints had different bit size, but also different keys
which lead to:

- Not being able to cast between 2 different shortint type
  and between 1 shortint and 1 integer. Technically these casts
  are possible, but requires a keyswitch (and likely a PBS).
  But the keyswitch requires parameters, which may not always exists.

- Due to each shortint having different keys, the internal code to
  manage that made heavy use of macros to avoid having thousands of
  repeated lines. However, this made the code harder to follow / modify
  especially for people that were not familiar with that.

- In practive to really benefit from shortints, proper management of
  carry space is needed, however the HLAPI completely hides that,
  resulting in less optimal performances. In short, shortints
  are better used as a low level construct.

- Building a FheUint4 with two block of message_2_carry_2
  is likely to be faster the one message_4_carry_4 for most use
  cases.

So removing the wrapping of shortints will simplify the code, and
allow for more simplification later.
Also, it will allow us to expose Fhe{Ui/I}nt{2, 4, 6} types
which are compatible (cast_from/into) with Fhe{Ui/I}nt{8, 16, 32, etc}.

BREAKING CHANGE:
    - FheUint{2,3,4} removed from HLAPI
    - All HLAPI functions thied to shortints are removed
2023-10-31 09:32:05 +01:00
tmontaigu
5c1573c266 fix(integer): fix worst case noise growth in encrypted shifts
In encrypted shifts we pack 3 bits from 3 different blocks into the same
blocks by doing `b0 * 4 + 2 * b1 + b2`, and then do a PBS to simulate a
hardware mux gate.

If the inputs of shift (ie, in lhs << rhs, lhs != rhs, ie we don't do
lhs << lhs) this is fine regarding the norm2 noise.

However if we do things like `a << a` or `a >> a`, which is probably a
very rare thing but not impossible, the norm2 noise would go above the
limit that guarantees our error probability.

To fix that, we extract the bits that tells shift amount, so that they
are already properly aligned to their mux input position.
The packing becomes `b0 + 2 * b1 + b2` and so,
the noise growth is ok even in the worst case of doind `a << a`.
2023-10-30 15:02:02 +01:00
dependabot[bot]
7772e8112d chore(deps): bump tj-actions/changed-files from 39.2.3 to 40.0.0
Bumps [tj-actions/changed-files](https://github.com/tj-actions/changed-files) from 39.2.3 to 40.0.0.
- [Release notes](https://github.com/tj-actions/changed-files/releases)
- [Changelog](https://github.com/tj-actions/changed-files/blob/main/HISTORY.md)
- [Commits](95690f9ece...af292f1e84)

---
updated-dependencies:
- dependency-name: tj-actions/changed-files
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
2023-10-30 12:07:44 +01:00
dependabot[bot]
5e92cb1475 chore(deps): bump JS-DevTools/npm-publish from 3.0.0 to 3.0.1
Bumps [JS-DevTools/npm-publish](https://github.com/js-devtools/npm-publish) from 3.0.0 to 3.0.1.
- [Release notes](https://github.com/js-devtools/npm-publish/releases)
- [Changelog](https://github.com/JS-DevTools/npm-publish/blob/main/CHANGELOG.md)
- [Commits](6fd3bc8dad...4b07b26a2f)

---
updated-dependencies:
- dependency-name: JS-DevTools/npm-publish
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
2023-10-30 12:07:24 +01:00
dependabot[bot]
f51e19b071 chore(deps): bump actions/checkout from 4.1.0 to 4.1.1
Bumps [actions/checkout](https://github.com/actions/checkout) from 4.1.0 to 4.1.1.
- [Release notes](https://github.com/actions/checkout/releases)
- [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md)
- [Commits](https://github.com/actions/checkout/compare/v4.1.0...b4ffde65f46336ab88eb53be808477a3936bae11)

---
updated-dependencies:
- dependency-name: actions/checkout
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
2023-10-30 12:06:53 +01:00
tmontaigu
aeb00ae584 chore(integer): use Arc<ServerKey> for executor
The goal is to avoid holding the key twice in memory
when both the executor and the test case needs the key
2023-10-27 18:01:55 +02:00
Arthur Meyre
ce5e9c1bdb chore(integer): more CRT tests and related fixes
- add remaining tests
- fix unchecked scalar mul for small carries
2023-10-27 11:30:00 +02:00
Arthur Meyre
4d4e124e94 chore(integer): add crt 32 bits tests with 5_1 params
- remove buggy unchecked_scalar_add_assign and replace by the proper
implementation which had a different name

BREAKING CHANGE:
removed an API entry point which was not required
2023-10-27 11:30:00 +02:00
Arthur Meyre
ca6d37e06f feat(integer): better handle trivial 0 blocks from LHS
- currently the filter only applied to the RHS but LHS can also benefit
from the filter
2023-10-27 10:31:24 +02:00
Mayeul@Zama
e3143315f3 fix(integer): disable broken assert in smart_crt_sub_assign 2023-10-27 09:43:51 +02:00
Mayeul@Zama
f8636fe814 feat(integer): add asserts in smart ops 2023-10-27 09:43:51 +02:00
tmontaigu
7e72400321 chore(doc): replace some ^ which could be interpreted as xor not pow 2023-10-26 23:42:58 +02:00
tmontaigu
728b409256 chore(integer): move comparator test out of it
Move the comparisons test (eq, ne, ge, gt, etc)
that were in the comparator module out of the comparator module.

This is so that in later commits will create test cases out
of these tests so they can, like other unsigned tests be
used to test other implementations of ServerKey
2023-10-25 10:31:55 +02:00
Arthur Meyre
d91404e567 chore(integer): remove empty where clause 2023-10-25 09:41:37 +02:00
David Testé
e11c3d7b7c chore(ci): add signed integer benchmarks to the CI 2023-10-25 09:14:00 +02:00
David Testé
6f8eeb043c chore(bench): add default ops for singed integers benchmarks 2023-10-25 09:14:00 +02:00
Arthur Meyre
00d55182b4 chore(ci): update examples to have a tmp dir to avoid rights issues in /tmp
- on machines where multiple users can log in, some files used for
serialization doctests would cause rights access issues and crash doctests
2023-10-23 15:03:18 +02:00
dependabot[bot]
6f6ce106c3 chore(deps): bump tj-actions/changed-files from 39.2.2 to 39.2.3
Bumps [tj-actions/changed-files](https://github.com/tj-actions/changed-files) from 39.2.2 to 39.2.3.
- [Release notes](https://github.com/tj-actions/changed-files/releases)
- [Changelog](https://github.com/tj-actions/changed-files/blob/main/HISTORY.md)
- [Commits](408093d9ff...95690f9ece)

---
updated-dependencies:
- dependency-name: tj-actions/changed-files
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
2023-10-23 10:28:00 +02:00
dependabot[bot]
68fcbb5280 chore(deps): bump JS-DevTools/npm-publish from 2.2.2 to 3.0.0
Bumps [JS-DevTools/npm-publish](https://github.com/js-devtools/npm-publish) from 2.2.2 to 3.0.0.
- [Release notes](https://github.com/js-devtools/npm-publish/releases)
- [Changelog](https://github.com/JS-DevTools/npm-publish/blob/main/CHANGELOG.md)
- [Commits](fe72237be0...6fd3bc8dad)

---
updated-dependencies:
- dependency-name: JS-DevTools/npm-publish
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
2023-10-23 10:27:35 +02:00
dependabot[bot]
3f46389cc8 chore(deps): bump actions/checkout from 4.1.0 to 4.1.1
Bumps [actions/checkout](https://github.com/actions/checkout) from 4.1.0 to 4.1.1.
- [Release notes](https://github.com/actions/checkout/releases)
- [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md)
- [Commits](8ade135a41...b4ffde65f4)

---
updated-dependencies:
- dependency-name: actions/checkout
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
2023-10-23 10:27:10 +02:00
Arthur Meyre
9e8dd01cb9 chore(ci): enable integer multi bit tests on M1 2023-10-20 17:55:38 +02:00
tmontaigu
0085ceb97b chore(ci): set node version 2023-10-20 10:24:50 +02:00
tmontaigu
be9a4d2d9c chore(wasm): update dependencies of wasm tests 2023-10-20 10:24:50 +02:00
Arthur Meyre
87421e8307 chore(ci): update M1 workflow to not explode the 6h GitHub limit
- run doc tests for CI with LTO off following M1 investigation
- LTO fat may be a cause of the wopbs flaky tests, disabling to check
2023-10-19 14:18:05 +02:00
Arthur Meyre
0c3919628f refactor(core): use avx512 intrinsics when available for data conversions
- we use inline assembly for now as rust does not propose those in the std
or core arch crates at the moment
- add tests for avx512 conversion
2023-10-19 13:21:19 +02:00
Arthur Meyre
f1c21888a7 chore(doc): encourage users to use dedicated keys to Radix or CRT 2023-10-19 09:52:22 +02:00
tmontaigu
2624beb7fa fix(integer): fix unsigned_overflowing_sub on trivials
unsigned_overflowing_sub does an independant subtraction
on each blocks with a correcting term being added to avoid
trashing the padding bit (lhs - rhs + correction).

The correction depended on rhs's degree.
e.g. if rhs's degree was in range 1..(msg_mod-1) -> correction =
     msg_mod

However if rhs's degree was zero (so rhs is a trivial 0), the correction
was also 0, however the borrow propagation rely on that correction to
always be added.
2023-10-18 19:26:01 +02:00
tmontaigu
e44c38a102 chore(ci): tell nvm to use node version 20 in wasm parallel tests 2023-10-18 19:04:27 +02:00
Arthur Meyre
4535230874 refactor(core): rename pbs_modulus_switch to fast_pbs_modulus_switch
- update docstring to reflect the change that has been done

BREAKING CHANGE:
pbs_modulus_switch is currently part of the public API and the rename is
therefore a breaking change
2023-10-17 16:53:19 +02:00
Arthur Meyre
a7b2d9b228 chore(ci): update check toolchain to latest nightly
- no new lints
2023-10-17 16:13:26 +02:00
Arthur Meyre
ab923a3ebc fix(crt): fix mul for non symmetrical parameters
- add non reg test for 32 bits mul with 5_1 parameters
2023-10-17 14:22:00 +02:00
Arthur Meyre
a0e85fb355 feat(core): add more custom moduli primitives to UnsignedInteger
As always for now the objective is to have functional custom modulus
implementations, not efficient ones

- add multiplication
- add leading_zeros
- add neg
2023-10-17 13:31:35 +02:00
Arthur Meyre
ecee305340 chore(core): change prelude algorithms imports 2023-10-17 13:31:35 +02:00
Mayeul@Zama
f08ea8cf85 fix(integer): fix max_degree formula 2023-10-17 11:35:08 +02:00
Mayeul@Zama
096e320b97 fix(crt): use 3_3 parameters for crt tests 2023-10-17 11:35:08 +02:00
Mayeul@Zama
95aac64c1c style(crt): compute modulus from base in tests 2023-10-17 11:35:08 +02:00
Mayeul@Zama
76aaa56691 fix(integer): fix small mul test 2023-10-17 11:35:08 +02:00
Mayeul@Zama
a40489bdd2 style(shortint): do not use assign ops on a cloned input 2023-10-17 11:35:08 +02:00
Mayeul@Zama
4bf617eb10 feat(shortint): cleanup input if necessary in ops 2023-10-17 11:35:08 +02:00
Mayeul@Zama
070073d229 feat(shortint): cleanup input if necessary in apply_lookup_table_bivariate 2023-10-17 11:35:08 +02:00
Arthur Meyre
6c1ca8e32b chore(core): use modular_distance instead of abs_diff in fft tests
- we are doing backwards conversions to the torus, so values could wrap
around near 0 or u64::MAX, take the modular distance which represents the
distance on the torus
2023-10-17 10:29:24 +02:00
Arthur Meyre
6523610ca4 refactor(core): refactor conversion code from f64 to i64
- observed that the subnormal case is already handled by the shift logic so
the special handling was not required
- add test for avx512 conversion
2023-10-17 10:29:24 +02:00
Arthur Meyre
41c20e22f5 chore(ci): enable AVX512 for integer and multi bit integer tests 2023-10-17 10:28:14 +02:00
765 changed files with 74604 additions and 34690 deletions

View File

@@ -5,6 +5,7 @@ env:
CARGO_TERM_COLOR: always
ACTION_RUN_URL: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}
RUSTFLAGS: "-C target-cpu=native"
RUST_BACKTRACE: "full"
on:
# Allows you to run this workflow manually from the Actions tab as an alternative.
@@ -51,7 +52,7 @@ jobs:
echo "Fork git sha: ${{ inputs.fork_git_sha }}"
- name: Checkout tfhe-rs
uses: actions/checkout@8ade135a41bc03ea155e62e844d188df1ea18608
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
with:
repository: ${{ inputs.fork_repo }}
ref: ${{ inputs.fork_git_sha }}
@@ -61,10 +62,9 @@ jobs:
echo "HOME=/home/ubuntu" >> "${GITHUB_ENV}"
- name: Install latest stable
uses: actions-rs/toolchain@16499b5e05bf2e26879000db0c1d13f7e13fa3af
uses: dtolnay/rust-toolchain@be73d7920c329f220ce78e0234b8f96b7ae60248
with:
toolchain: stable
default: true
- name: Run concrete-csprng tests
run: |
@@ -114,10 +114,6 @@ jobs:
run: |
make test_safe_deserialization
- name: Run forward compatibility tests
run: |
make test_forward_compatibility
- name: Slack Notification
if: ${{ always() }}
continue-on-error: true

112
.github/workflows/aws_tfhe_gpu_tests.yml vendored Normal file
View File

@@ -0,0 +1,112 @@
# Compile and test Concrete-cuda on an AWS instance
name: Concrete Cuda - Full tests
env:
CARGO_TERM_COLOR: always
ACTION_RUN_URL: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}
RUSTFLAGS: "-C target-cpu=native"
RUST_BACKTRACE: "full"
on:
# Allows you to run this workflow manually from the Actions tab as an alternative.
workflow_dispatch:
# All the inputs are provided by Slab
inputs:
instance_id:
description: "AWS instance ID"
type: string
instance_image_id:
description: "AWS instance AMI ID"
type: string
instance_type:
description: "AWS instance product type"
type: string
runner_name:
description: "Action runner name"
type: string
request_id:
description: 'Slab request ID'
type: string
fork_repo:
description: 'Name of forked repo as user/repo'
type: string
fork_git_sha:
description: 'Git SHA to checkout from fork'
type: string
jobs:
run-cuda-tests-linux:
concurrency:
group: tfhe_cuda_backend_test-${{ github.ref }}
cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
name: Test code in EC2
runs-on: ${{ inputs.runner_name }}
strategy:
fail-fast: false
# explicit include-based build matrix, of known valid options
matrix:
include:
- os: ubuntu-22.04
cuda: "12.2"
gcc: 9
env:
CUDA_PATH: /usr/local/cuda-${{ matrix.cuda }}
steps:
# Step used for log purpose.
- name: Instance configuration used
run: |
echo "ID: ${{ inputs.instance_id }}"
echo "AMI: ${{ inputs.instance_image_id }}"
echo "Type: ${{ inputs.instance_type }}"
echo "Request ID: ${{ inputs.request_id }}"
echo "Fork repo: ${{ inputs.fork_repo }}"
echo "Fork git sha: ${{ inputs.fork_git_sha }}"
- name: Checkout tfhe-rs
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
with:
repository: ${{ inputs.fork_repo }}
ref: ${{ inputs.fork_git_sha }}
- name: Set up home
run: |
echo "HOME=/home/ubuntu" >> "${GITHUB_ENV}"
- name: Install latest stable
uses: dtolnay/rust-toolchain@be73d7920c329f220ce78e0234b8f96b7ae60248
with:
toolchain: stable
- name: Export CUDA variables
if: ${{ !cancelled() }}
run: |
echo "CUDA_PATH=$CUDA_PATH" >> "${GITHUB_ENV}"
echo "$CUDA_PATH/bin" >> "${GITHUB_PATH}"
echo "LD_LIBRARY_PATH=$CUDA_PATH/lib:$LD_LIBRARY_PATH" >> "${GITHUB_ENV}"
echo "CUDACXX=/usr/local/cuda-${{ matrix.cuda }}/bin/nvcc" >> "${GITHUB_ENV}"
# Specify the correct host compilers
- name: Export gcc and g++ variables
if: ${{ !cancelled() }}
run: |
echo "CC=/usr/bin/gcc-${{ matrix.gcc }}" >> "${GITHUB_ENV}"
echo "CXX=/usr/bin/g++-${{ matrix.gcc }}" >> "${GITHUB_ENV}"
echo "CUDAHOSTCXX=/usr/bin/g++-${{ matrix.gcc }}" >> "${GITHUB_ENV}"
echo "HOME=/home/ubuntu" >> "${GITHUB_ENV}"
- name: Run clippy checks
run: |
make clippy_gpu
- name: Run all tests
run: |
make test_gpu
- name: Run user docs tests
run: |
make test_user_doc_gpu
- name: Test C API
run: |
make test_c_api_gpu

View File

@@ -1,9 +1,10 @@
name: AWS Integer Tests on CPU
name: AWS Unsigned Integer Tests on CPU
env:
CARGO_TERM_COLOR: always
ACTION_RUN_URL: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}
RUSTFLAGS: "-C target-cpu=native"
RUST_BACKTRACE: "full"
on:
# Allows you to run this workflow manually from the Actions tab as an alternative.
@@ -23,13 +24,13 @@ on:
description: "Action runner name"
type: string
request_id:
description: 'Slab request ID'
description: "Slab request ID"
type: string
fork_repo:
description: 'Name of forked repo as user/repo'
description: "Name of forked repo as user/repo"
type: string
fork_git_sha:
description: 'Git SHA to checkout from fork'
description: "Git SHA to checkout from fork"
type: string
jobs:
@@ -50,7 +51,7 @@ jobs:
echo "Fork git sha: ${{ inputs.fork_git_sha }}"
- name: Checkout tfhe-rs
uses: actions/checkout@8ade135a41bc03ea155e62e844d188df1ea18608
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
with:
repository: ${{ inputs.fork_repo }}
ref: ${{ inputs.fork_git_sha }}
@@ -60,18 +61,25 @@ jobs:
echo "HOME=/home/ubuntu" >> "${GITHUB_ENV}"
- name: Install latest stable
uses: actions-rs/toolchain@16499b5e05bf2e26879000db0c1d13f7e13fa3af
uses: dtolnay/rust-toolchain@be73d7920c329f220ce78e0234b8f96b7ae60248
with:
toolchain: stable
default: true
- name: Gen Keys if required
run: |
make GEN_KEY_CACHE_MULTI_BIT_ONLY=TRUE gen_key_cache
- name: Run unsigned integer multi-bit tests
run: |
AVX512_SUPPORT=ON make test_unsigned_integer_multi_bit_ci
- name: Gen Keys if required
run: |
make gen_key_cache
- name: Run integer tests
- name: Run unsigned integer tests
run: |
BIG_TESTS_INSTANCE=TRUE make test_integer_ci
AVX512_SUPPORT=ON BIG_TESTS_INSTANCE=TRUE make test_unsigned_integer_ci
- name: Slack Notification
if: ${{ always() }}

View File

@@ -1,9 +1,10 @@
name: AWS Multi Bit Tests on CPU
name: AWS Signed Integer Tests on CPU
env:
CARGO_TERM_COLOR: always
ACTION_RUN_URL: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}
RUSTFLAGS: "-C target-cpu=native"
RUST_BACKTRACE: "full"
on:
# Allows you to run this workflow manually from the Actions tab as an alternative.
@@ -23,13 +24,13 @@ on:
description: "Action runner name"
type: string
request_id:
description: 'Slab request ID'
description: "Slab request ID"
type: string
fork_repo:
description: 'Name of forked repo as user/repo'
description: "Name of forked repo as user/repo"
type: string
fork_git_sha:
description: 'Git SHA to checkout from fork'
description: "Git SHA to checkout from fork"
type: string
jobs:
@@ -50,7 +51,7 @@ jobs:
echo "Fork git sha: ${{ inputs.fork_git_sha }}"
- name: Checkout tfhe-rs
uses: actions/checkout@8ade135a41bc03ea155e62e844d188df1ea18608
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
with:
repository: ${{ inputs.fork_repo }}
ref: ${{ inputs.fork_git_sha }}
@@ -60,10 +61,9 @@ jobs:
echo "HOME=/home/ubuntu" >> "${GITHUB_ENV}"
- name: Install latest stable
uses: actions-rs/toolchain@16499b5e05bf2e26879000db0c1d13f7e13fa3af
uses: dtolnay/rust-toolchain@be73d7920c329f220ce78e0234b8f96b7ae60248
with:
toolchain: stable
default: true
- name: Gen Keys if required
run: |
@@ -73,9 +73,17 @@ jobs:
run: |
make test_shortint_multi_bit_ci
- name: Run integer multi-bit tests
- name: Run signed integer multi-bit tests
run: |
make test_integer_multi_bit_ci
AVX512_SUPPORT=ON make test_signed_integer_multi_bit_ci
- name: Gen Keys if required
run: |
make gen_key_cache
- name: Run signed integer tests
run: |
AVX512_SUPPORT=ON BIG_TESTS_INSTANCE=TRUE make test_signed_integer_ci
- name: Slack Notification
if: ${{ always() }}

View File

@@ -4,6 +4,7 @@ env:
CARGO_TERM_COLOR: always
ACTION_RUN_URL: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}
RUSTFLAGS: "-C target-cpu=native"
RUST_BACKTRACE: "full"
on:
# Allows you to run this workflow manually from the Actions tab as an alternative.
@@ -50,7 +51,7 @@ jobs:
echo "Fork git sha: ${{ inputs.fork_git_sha }}"
- name: Checkout tfhe-rs
uses: actions/checkout@8ade135a41bc03ea155e62e844d188df1ea18608
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
with:
repository: ${{ inputs.fork_repo }}
ref: ${{ inputs.fork_git_sha }}
@@ -60,10 +61,9 @@ jobs:
echo "HOME=/home/ubuntu" >> "${GITHUB_ENV}"
- name: Install latest stable
uses: actions-rs/toolchain@16499b5e05bf2e26879000db0c1d13f7e13fa3af
uses: dtolnay/rust-toolchain@be73d7920c329f220ce78e0234b8f96b7ae60248
with:
toolchain: stable
default: true
- name: Run concrete-csprng tests
run: |
@@ -81,10 +81,6 @@ jobs:
run: |
make test_c_api
- name: Run C API tests with forward_compatibility
run: |
FORWARD_COMPAT=ON make test_c_api
- name: Run user docs tests
run: |
make test_user_doc
@@ -104,6 +100,12 @@ jobs:
- name: Run example tests
run: |
make test_examples
make dark_market
- name: Run apps tests
run: |
make test_trivium
make test_kreyvium
- name: Slack Notification
if: ${{ always() }}

View File

@@ -4,6 +4,7 @@ env:
CARGO_TERM_COLOR: always
ACTION_RUN_URL: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}
RUSTFLAGS: "-C target-cpu=native"
RUST_BACKTRACE: "full"
on:
# Allows you to run this workflow manually from the Actions tab as an alternative.
@@ -50,7 +51,7 @@ jobs:
echo "Fork git sha: ${{ inputs.fork_git_sha }}"
- name: Checkout tfhe-rs
uses: actions/checkout@8ade135a41bc03ea155e62e844d188df1ea18608
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
with:
repository: ${{ inputs.fork_repo }}
ref: ${{ inputs.fork_git_sha }}
@@ -60,10 +61,9 @@ jobs:
echo "HOME=/home/ubuntu" >> "${GITHUB_ENV}"
- name: Install latest stable
uses: actions-rs/toolchain@16499b5e05bf2e26879000db0c1d13f7e13fa3af
uses: dtolnay/rust-toolchain@be73d7920c329f220ce78e0234b8f96b7ae60248
with:
toolchain: stable
default: true
- name: Run js on wasm API tests
run: |

View File

@@ -32,6 +32,7 @@ env:
CARGO_TERM_COLOR: always
RESULTS_FILENAME: parsed_benchmark_results_${{ github.sha }}.json
ACTION_RUN_URL: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}
RUST_BACKTRACE: "full"
jobs:
run-boolean-benchmarks:
@@ -51,7 +52,7 @@ jobs:
echo "BENCH_DATE=$(date --iso-8601=seconds)" >> "${GITHUB_ENV}"
- name: Checkout tfhe-rs repo with tags
uses: actions/checkout@8ade135a41bc03ea155e62e844d188df1ea18608
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
with:
fetch-depth: 0
@@ -61,10 +62,9 @@ jobs:
echo "HOME=/home/ubuntu" >> "${GITHUB_ENV}"
- name: Install rust
uses: actions-rs/toolchain@16499b5e05bf2e26879000db0c1d13f7e13fa3af
uses: dtolnay/rust-toolchain@be73d7920c329f220ce78e0234b8f96b7ae60248
with:
toolchain: nightly
override: true
- name: Run benchmarks with AVX512
run: |
@@ -96,13 +96,13 @@ jobs:
--append-results
- name: Upload parsed results artifact
uses: actions/upload-artifact@a8a3f3ad30e3422c9c7b888a15615d19a852ae32
uses: actions/upload-artifact@26f96dfa697d77e81fd5907df203aa23a56210a8
with:
name: ${{ github.sha }}_boolean
path: ${{ env.RESULTS_FILENAME }}
- name: Checkout Slab repo
uses: actions/checkout@8ade135a41bc03ea155e62e844d188df1ea18608
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
with:
repository: zama-ai/slab
path: slab

View File

@@ -6,6 +6,7 @@ on:
env:
CARGO_TERM_COLOR: always
RUSTFLAGS: "-C target-cpu=native"
RUST_BACKTRACE: "full"
concurrency:
group: ${{ github.workflow }}-${{ github.head_ref }}
@@ -17,11 +18,11 @@ jobs:
strategy:
matrix:
os: [ubuntu-latest, macos-latest, windows-latest]
os: [ubuntu-latest, macos-latest-large, windows-latest]
fail-fast: false
steps:
- uses: actions/checkout@8ade135a41bc03ea155e62e844d188df1ea18608
- uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
- name: Install and run newline linter checks
if: matrix.os == 'ubuntu-latest'

View File

@@ -4,6 +4,7 @@ env:
CARGO_TERM_COLOR: always
ACTION_RUN_URL: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}
RUSTFLAGS: "-C target-cpu=native"
RUST_BACKTRACE: "full"
on:
# Allows you to run this workflow manually from the Actions tab as an alternative.
@@ -38,6 +39,7 @@ jobs:
group: ${{ github.workflow }}_${{ github.ref }}_${{ inputs.instance_image_id }}_${{ inputs.instance_type }}
cancel-in-progress: true
runs-on: ${{ inputs.runner_name }}
timeout-minutes: 1080
steps:
# Step used for log purpose.
- name: Instance configuration used
@@ -50,7 +52,7 @@ jobs:
echo "Fork git sha: ${{ inputs.fork_git_sha }}"
- name: Checkout tfhe-rs
uses: actions/checkout@8ade135a41bc03ea155e62e844d188df1ea18608
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
with:
repository: ${{ inputs.fork_repo }}
ref: ${{ inputs.fork_git_sha }}
@@ -60,14 +62,13 @@ jobs:
echo "HOME=/home/ubuntu" >> "${GITHUB_ENV}"
- name: Install latest stable
uses: actions-rs/toolchain@16499b5e05bf2e26879000db0c1d13f7e13fa3af
uses: dtolnay/rust-toolchain@be73d7920c329f220ce78e0234b8f96b7ae60248
with:
toolchain: stable
default: true
- name: Check for file changes
id: changed-files
uses: tj-actions/changed-files@408093d9ff9c134c33b974e0722ce06b9d6e8263
uses: tj-actions/changed-files@90a06d6ba9543371ab4df8eeca0be07ca6054959
with:
files_yaml: |
tfhe:
@@ -79,6 +80,12 @@ jobs:
if: steps.changed-files.outputs.tfhe_any_changed == 'true'
run: |
make GEN_KEY_CACHE_COVERAGE_ONLY=TRUE gen_key_cache
make gen_key_cache_core_crypto
- name: Run coverage for core_crypto
if: steps.changed-files.outputs.tfhe_any_changed == 'true'
run: |
make test_core_crypto_cov AVX512_SUPPORT=ON
- name: Run coverage for boolean
if: steps.changed-files.outputs.tfhe_any_changed == 'true'
@@ -91,13 +98,13 @@ jobs:
make test_shortint_cov
- name: Upload tfhe coverage to Codecov
uses: codecov/codecov-action@eaaf4bedf32dbdc6b720b63067d99c4d77d6047d
uses: codecov/codecov-action@4fe8c5f003fae66aa5ebb77cfd3e7bfbbda0b6b0
if: steps.changed-files.outputs.tfhe_any_changed == 'true'
with:
token: ${{ secrets.CODECOV_TOKEN }}
directory: ./coverage/
fail_ci_if_error: true
files: shortint/cobertura.xml,boolean/cobertura.xml
files: shortint/cobertura.xml,boolean/cobertura.xml,core_crypto/cobertura.xml,core_crypto_avx512/cobertura.xml
- name: Slack Notification
if: ${{ failure() }}

View File

@@ -4,6 +4,7 @@ env:
CARGO_TERM_COLOR: always
ACTION_RUN_URL: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}
RUSTFLAGS: "-C target-cpu=native"
RUST_BACKTRACE: "full"
on:
# Allows you to run this workflow manually from the Actions tab as an alternative.
@@ -42,7 +43,7 @@ jobs:
steps:
- name: Checkout tfhe-rs
uses: actions/checkout@8ade135a41bc03ea155e62e844d188df1ea18608
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
with:
repository: ${{ inputs.fork_repo }}
ref: ${{ inputs.fork_git_sha }}
@@ -52,10 +53,9 @@ jobs:
echo "HOME=/home/ubuntu" >> "${GITHUB_ENV}"
- name: Install latest stable
uses: actions-rs/toolchain@16499b5e05bf2e26879000db0c1d13f7e13fa3af
uses: dtolnay/rust-toolchain@be73d7920c329f220ce78e0234b8f96b7ae60248
with:
toolchain: stable
default: true
- name: Dieharder randomness test suite
run: |

View File

@@ -25,6 +25,7 @@ env:
RESULTS_FILENAME: parsed_benchmark_results_${{ github.sha }}.json
PARSE_INTEGER_BENCH_CSV_FILE: tfhe_rs_integer_benches_${{ github.sha }}.csv
ACTION_RUN_URL: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}
RUST_BACKTRACE: "full"
jobs:
run-integer-benchmarks:
@@ -44,7 +45,7 @@ jobs:
echo "BENCH_DATE=$(date --iso-8601=seconds)" >> "${GITHUB_ENV}"
- name: Checkout tfhe-rs repo with tags
uses: actions/checkout@8ade135a41bc03ea155e62e844d188df1ea18608
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
with:
fetch-depth: 0
@@ -54,10 +55,9 @@ jobs:
echo "HOME=/home/ubuntu" >> "${GITHUB_ENV}"
- name: Install rust
uses: actions-rs/toolchain@16499b5e05bf2e26879000db0c1d13f7e13fa3af
uses: dtolnay/rust-toolchain@be73d7920c329f220ce78e0234b8f96b7ae60248
with:
toolchain: nightly
override: true
- name: Run benchmarks with AVX512
run: |
@@ -69,7 +69,7 @@ jobs:
parse_integer_benches
- name: Upload csv results artifact
uses: actions/upload-artifact@a8a3f3ad30e3422c9c7b888a15615d19a852ae32
uses: actions/upload-artifact@26f96dfa697d77e81fd5907df203aa23a56210a8
with:
name: ${{ github.sha }}_csv_integer
path: ${{ env.PARSE_INTEGER_BENCH_CSV_FILE }}
@@ -90,13 +90,13 @@ jobs:
--throughput
- name: Upload parsed results artifact
uses: actions/upload-artifact@a8a3f3ad30e3422c9c7b888a15615d19a852ae32
uses: actions/upload-artifact@26f96dfa697d77e81fd5907df203aa23a56210a8
with:
name: ${{ github.sha }}_integer
path: ${{ env.RESULTS_FILENAME }}
- name: Checkout Slab repo
uses: actions/checkout@8ade135a41bc03ea155e62e844d188df1ea18608
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
with:
repository: zama-ai/slab
path: slab

View File

@@ -28,6 +28,7 @@ env:
CARGO_TERM_COLOR: always
RESULTS_FILENAME: parsed_benchmark_results_${{ github.sha }}.json
ACTION_RUN_URL: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}
RUST_BACKTRACE: "full"
jobs:
prepare-matrix:
@@ -39,15 +40,12 @@ jobs:
- name: Weekly benchmarks
if: ${{ github.event.inputs.user_inputs == 'weekly_benchmarks' }}
run: |
echo "OP_FLAVOR=[\"default\", \"default_comp\", \"default_scalar\", \"default_scalar_comp\"]" >> ${GITHUB_ENV}
echo "OP_FLAVOR=[\"default\"]" >> ${GITHUB_ENV}
- name: Quarterly benchmarks
if: ${{ github.event.inputs.user_inputs == 'quarterly_benchmarks' }}
run: |
echo "OP_FLAVOR=[\"default\", \"default_comp\", \"default_scalar\", \"default_scalar_comp\", \
\"smart\", \"smart_comp\", \"smart_scalar\", \"smart_parallelized\", \"smart_parallelized_comp\", \"smart_scalar_parallelized\", \"smart_scalar_parallelized_comp\", \
\"unchecked\", \"unchecked_comp\", \"unchecked_scalar\", \"unchecked_scalar_comp\", \
\"misc\"]" >> ${GITHUB_ENV}
echo "OP_FLAVOR=[\"default\", \"smart\", \"unchecked\", \"misc\"]" >> ${GITHUB_ENV}
- name: Set operation flavor output
id: set_op_flavor
@@ -60,6 +58,7 @@ jobs:
runs-on: ${{ github.event.inputs.runner_name }}
if: ${{ !cancelled() }}
continue-on-error: true
timeout-minutes: 1440 # 24 hours
strategy:
max-parallel: 1
matrix:
@@ -74,7 +73,7 @@ jobs:
echo "Request ID: ${{ inputs.request_id }}"
- name: Checkout tfhe-rs repo with tags
uses: actions/checkout@8ade135a41bc03ea155e62e844d188df1ea18608
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
with:
fetch-depth: 0
@@ -90,13 +89,12 @@ jobs:
echo "HOME=/home/ubuntu" >> "${GITHUB_ENV}"
- name: Install rust
uses: actions-rs/toolchain@16499b5e05bf2e26879000db0c1d13f7e13fa3af
uses: dtolnay/rust-toolchain@be73d7920c329f220ce78e0234b8f96b7ae60248
with:
toolchain: nightly
override: true
- name: Checkout Slab repo
uses: actions/checkout@8ade135a41bc03ea155e62e844d188df1ea18608
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
with:
repository: zama-ai/slab
path: slab
@@ -120,7 +118,7 @@ jobs:
--throughput
- name: Upload parsed results artifact
uses: actions/upload-artifact@a8a3f3ad30e3422c9c7b888a15615d19a852ae32
uses: actions/upload-artifact@26f96dfa697d77e81fd5907df203aa23a56210a8
with:
name: ${{ github.sha }}_${{ matrix.command }}_${{ matrix.op_flavor }}
path: ${{ env.RESULTS_FILENAME }}

View File

@@ -0,0 +1,157 @@
# Run integer benchmarks on an AWS instance with CUDA and return parsed results to Slab CI bot.
name: Integer GPU benchmarks
on:
workflow_dispatch:
inputs:
instance_id:
description: "Instance ID"
type: string
instance_image_id:
description: "Instance AMI ID"
type: string
instance_type:
description: "Instance product type"
type: string
runner_name:
description: "Action runner name"
type: string
request_id:
description: "Slab request ID"
type: string
env:
CARGO_TERM_COLOR: always
RESULTS_FILENAME: parsed_benchmark_results_${{ github.sha }}.json
PARSE_INTEGER_BENCH_CSV_FILE: tfhe_rs_integer_benches_${{ github.sha }}.csv
ACTION_RUN_URL: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}
RUST_BACKTRACE: "full"
jobs:
run-integer-benchmarks:
name: Execute integer benchmarks in EC2
runs-on: ${{ github.event.inputs.runner_name }}
if: ${{ !cancelled() }}
strategy:
fail-fast: false
# explicit include-based build matrix, of known valid options
matrix:
include:
- os: ubuntu-22.04
cuda: "12.2"
gcc: 9
env:
CUDA_PATH: /usr/local/cuda-${{ matrix.cuda }}
steps:
- name: Instance configuration used
run: |
echo "IDs: ${{ inputs.instance_id }}"
echo "AMI: ${{ inputs.instance_image_id }}"
echo "Type: ${{ inputs.instance_type }}"
echo "Request ID: ${{ inputs.request_id }}"
- name: Get benchmark date
run: |
echo "BENCH_DATE=$(date --iso-8601=seconds)" >> "${GITHUB_ENV}"
- name: Checkout tfhe-rs repo with tags
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
with:
fetch-depth: 0
- name: Set up home
# "Install rust" step require root user to have a HOME directory which is not set.
run: |
echo "HOME=/home/ubuntu" >> "${GITHUB_ENV}"
- name: Install rust
uses: dtolnay/rust-toolchain@be73d7920c329f220ce78e0234b8f96b7ae60248
with:
toolchain: nightly
- name: Export CUDA variables
if: ${{ !cancelled() }}
run: |
echo "CUDA_PATH=$CUDA_PATH" >> "${GITHUB_ENV}"
echo "$CUDA_PATH/bin" >> "${GITHUB_PATH}"
echo "LD_LIBRARY_PATH=$CUDA_PATH/lib:$LD_LIBRARY_PATH" >> "${GITHUB_ENV}"
echo "CUDACXX=/usr/local/cuda-${{ matrix.cuda }}/bin/nvcc" >> "${GITHUB_ENV}"
# Specify the correct host compilers
- name: Export gcc and g++ variables
if: ${{ !cancelled() }}
run: |
echo "CC=/usr/bin/gcc-${{ matrix.gcc }}" >> "${GITHUB_ENV}"
echo "CXX=/usr/bin/g++-${{ matrix.gcc }}" >> "${GITHUB_ENV}"
echo "CUDAHOSTCXX=/usr/bin/g++-${{ matrix.gcc }}" >> "${GITHUB_ENV}"
echo "HOME=/home/ubuntu" >> "${GITHUB_ENV}"
- name: Run benchmarks with AVX512
run: |
make AVX512_SUPPORT=ON FAST_BENCH=TRUE BENCH_OP_FLAVOR=default bench_integer_gpu
- name: Parse benchmarks to csv
run: |
make PARSE_INTEGER_BENCH_CSV_FILE=${{ env.PARSE_INTEGER_BENCH_CSV_FILE }} \
parse_integer_benches
- name: Upload csv results artifact
uses: actions/upload-artifact@26f96dfa697d77e81fd5907df203aa23a56210a8
with:
name: ${{ github.sha }}_csv_integer
path: ${{ env.PARSE_INTEGER_BENCH_CSV_FILE }}
- name: Parse results
run: |
COMMIT_DATE="$(git --no-pager show -s --format=%cd --date=iso8601-strict ${{ github.sha }})"
COMMIT_HASH="$(git describe --tags --dirty)"
python3 ./ci/benchmark_parser.py target/criterion ${{ env.RESULTS_FILENAME }} \
--database tfhe_rs \
--hardware ${{ inputs.instance_type }} \
--backend gpu \
--project-version "${COMMIT_HASH}" \
--branch ${{ github.ref_name }} \
--commit-date "${COMMIT_DATE}" \
--bench-date "${{ env.BENCH_DATE }}" \
--walk-subdirs \
--name-suffix avx512 \
--throughput
- name: Upload parsed results artifact
uses: actions/upload-artifact@26f96dfa697d77e81fd5907df203aa23a56210a8
with:
name: ${{ github.sha }}_integer
path: ${{ env.RESULTS_FILENAME }}
- name: Checkout Slab repo
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
with:
repository: zama-ai/slab
path: slab
token: ${{ secrets.CONCRETE_ACTIONS_TOKEN }}
- name: Send data to Slab
shell: bash
run: |
echo "Computing HMac on results file"
SIGNATURE="$(slab/scripts/hmac_calculator.sh ${{ env.RESULTS_FILENAME }} '${{ secrets.JOB_SECRET }}')"
echo "Sending results to Slab..."
curl -v -k \
-H "Content-Type: application/json" \
-H "X-Slab-Repository: ${{ github.repository }}" \
-H "X-Slab-Command: store_data_v2" \
-H "X-Hub-Signature-256: sha256=${SIGNATURE}" \
-d @${{ env.RESULTS_FILENAME }} \
${{ secrets.SLAB_URL }}
- name: Slack Notification
if: ${{ failure() }}
continue-on-error: true
uses: rtCamp/action-slack-notify@b24d75fe0e728a4bf9fc42ee217caa686d141ee8
env:
SLACK_COLOR: ${{ job.status }}
SLACK_CHANNEL: ${{ secrets.SLACK_CHANNEL }}
SLACK_ICON: https://pbs.twimg.com/profile_images/1274014582265298945/OjBKP9kn_400x400.png
SLACK_MESSAGE: "Integer GPU benchmarks failed. (${{ env.ACTION_RUN_URL }})"
SLACK_USERNAME: ${{ secrets.BOT_USERNAME }}
SLACK_WEBHOOK: ${{ secrets.SLACK_WEBHOOK }}

View File

@@ -0,0 +1,162 @@
# Run all integer benchmarks on an AWS instance with CUDA and return parsed results to Slab CI bot.
name: Integer GPU full benchmarks
on:
workflow_dispatch:
inputs:
instance_id:
description: "Instance ID"
type: string
instance_image_id:
description: "Instance AMI ID"
type: string
instance_type:
description: "Instance product type"
type: string
runner_name:
description: "Action runner name"
type: string
request_id:
description: "Slab request ID"
type: string
# This input is not used in this workflow but still mandatory since a calling workflow could
# use it. If a triggering command include a user_inputs field, then the triggered workflow
# must include this very input, otherwise the workflow won't be called.
# See start_full_benchmarks.yml as example.
user_inputs:
description: "Type of benchmarks to run"
type: string
default: "weekly_benchmarks"
env:
CARGO_TERM_COLOR: always
RESULTS_FILENAME: parsed_benchmark_results_${{ github.sha }}.json
ACTION_RUN_URL: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}
RUST_BACKTRACE: "full"
jobs:
integer-benchmarks:
name: Execute integer benchmarks for all operations flavor
runs-on: ${{ github.event.inputs.runner_name }}
if: ${{ !cancelled() }}
continue-on-error: true
strategy:
fail-fast: false
max-parallel: 1
matrix:
command: [ integer, integer_multi_bit]
op_flavor: [ default, unchecked ]
# explicit include-based build matrix, of known valid options
include:
- os: ubuntu-22.04
cuda: "12.2"
gcc: 9
env:
CUDA_PATH: /usr/local/cuda-${{ matrix.cuda }}
steps:
- name: Instance configuration used
run: |
echo "IDs: ${{ inputs.instance_id }}"
echo "AMI: ${{ inputs.instance_image_id }}"
echo "Type: ${{ inputs.instance_type }}"
echo "Request ID: ${{ inputs.request_id }}"
- name: Checkout tfhe-rs repo with tags
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
with:
fetch-depth: 0
- name: Get benchmark details
run: |
echo "BENCH_DATE=$(date --iso-8601=seconds)" >> "${GITHUB_ENV}"
echo "COMMIT_DATE=$(git --no-pager show -s --format=%cd --date=iso8601-strict ${{ github.sha }})" >> "${GITHUB_ENV}"
echo "COMMIT_HASH=$(git describe --tags --dirty)" >> "${GITHUB_ENV}"
- name: Set up home
# "Install rust" step require root user to have a HOME directory which is not set.
run: |
echo "HOME=/home/ubuntu" >> "${GITHUB_ENV}"
- name: Install rust
uses: dtolnay/rust-toolchain@be73d7920c329f220ce78e0234b8f96b7ae60248
with:
toolchain: nightly
- name: Export CUDA variables
if: ${{ !cancelled() }}
run: |
echo "CUDA_PATH=$CUDA_PATH" >> "${GITHUB_ENV}"
echo "$CUDA_PATH/bin" >> "${GITHUB_PATH}"
echo "LD_LIBRARY_PATH=$CUDA_PATH/lib:$LD_LIBRARY_PATH" >> "${GITHUB_ENV}"
echo "CUDACXX=/usr/local/cuda-${{ matrix.cuda }}/bin/nvcc" >> "${GITHUB_ENV}"
# Specify the correct host compilers
- name: Export gcc and g++ variables
if: ${{ !cancelled() }}
run: |
echo "CC=/usr/bin/gcc-${{ matrix.gcc }}" >> "${GITHUB_ENV}"
echo "CXX=/usr/bin/g++-${{ matrix.gcc }}" >> "${GITHUB_ENV}"
echo "CUDAHOSTCXX=/usr/bin/g++-${{ matrix.gcc }}" >> "${GITHUB_ENV}"
echo "HOME=/home/ubuntu" >> "${GITHUB_ENV}"
- name: Checkout Slab repo
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
with:
repository: zama-ai/slab
path: slab
token: ${{ secrets.CONCRETE_ACTIONS_TOKEN }}
- name: Run benchmarks with AVX512
run: |
make AVX512_SUPPORT=ON BENCH_OP_FLAVOR=${{ matrix.op_flavor }} bench_${{ matrix.command }}_gpu
- name: Parse results
run: |
python3 ./ci/benchmark_parser.py target/criterion ${{ env.RESULTS_FILENAME }} \
--database tfhe_rs \
--hardware ${{ inputs.instance_type }} \
--backend gpu \
--project-version "${{ env.COMMIT_HASH }}" \
--branch ${{ github.ref_name }} \
--commit-date "${{ env.COMMIT_DATE }}" \
--bench-date "${{ env.BENCH_DATE }}" \
--walk-subdirs \
--name-suffix avx512 \
--throughput
- name: Upload parsed results artifact
uses: actions/upload-artifact@26f96dfa697d77e81fd5907df203aa23a56210a8
with:
name: ${{ github.sha }}_${{ matrix.command }}_${{ matrix.op_flavor }}
path: ${{ env.RESULTS_FILENAME }}
- name: Send data to Slab
shell: bash
run: |
echo "Computing HMac on results file"
SIGNATURE="$(slab/scripts/hmac_calculator.sh ${{ env.RESULTS_FILENAME }} '${{ secrets.JOB_SECRET }}')"
echo "Sending results to Slab..."
curl -v -k \
-H "Content-Type: application/json" \
-H "X-Slab-Repository: ${{ github.repository }}" \
-H "X-Slab-Command: store_data_v2" \
-H "X-Hub-Signature-256: sha256=${SIGNATURE}" \
-d @${{ env.RESULTS_FILENAME }} \
${{ secrets.SLAB_URL }}
slack-notification:
name: Slack Notification
runs-on: ${{ github.event.inputs.runner_name }}
if: ${{ failure() }}
needs: integer-benchmarks
steps:
- name: Notify
continue-on-error: true
uses: rtCamp/action-slack-notify@b24d75fe0e728a4bf9fc42ee217caa686d141ee8
env:
SLACK_COLOR: ${{ job.status }}
SLACK_CHANNEL: ${{ secrets.SLACK_CHANNEL }}
SLACK_ICON: https://pbs.twimg.com/profile_images/1274014582265298945/OjBKP9kn_400x400.png
SLACK_MESSAGE: "Integer GPU full benchmarks failed. (${{ env.ACTION_RUN_URL }})"
SLACK_USERNAME: ${{ secrets.BOT_USERNAME }}
SLACK_WEBHOOK: ${{ secrets.SLACK_WEBHOOK }}

View File

@@ -25,6 +25,7 @@ env:
RESULTS_FILENAME: parsed_benchmark_results_${{ github.sha }}.json
PARSE_INTEGER_BENCH_CSV_FILE: tfhe_rs_integer_benches_${{ github.sha }}.csv
ACTION_RUN_URL: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}
RUST_BACKTRACE: "full"
jobs:
run-integer-benchmarks:
@@ -44,7 +45,7 @@ jobs:
echo "BENCH_DATE=$(date --iso-8601=seconds)" >> "${GITHUB_ENV}"
- name: Checkout tfhe-rs repo with tags
uses: actions/checkout@8ade135a41bc03ea155e62e844d188df1ea18608
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
with:
fetch-depth: 0
@@ -54,10 +55,9 @@ jobs:
echo "HOME=/home/ubuntu" >> "${GITHUB_ENV}"
- name: Install rust
uses: actions-rs/toolchain@16499b5e05bf2e26879000db0c1d13f7e13fa3af
uses: dtolnay/rust-toolchain@be73d7920c329f220ce78e0234b8f96b7ae60248
with:
toolchain: nightly
override: true
- name: Run multi-bit benchmarks with AVX512
run: |
@@ -69,7 +69,7 @@ jobs:
parse_integer_benches
- name: Upload csv results artifact
uses: actions/upload-artifact@a8a3f3ad30e3422c9c7b888a15615d19a852ae32
uses: actions/upload-artifact@26f96dfa697d77e81fd5907df203aa23a56210a8
with:
name: ${{ github.sha }}_csv_integer
path: ${{ env.PARSE_INTEGER_BENCH_CSV_FILE }}
@@ -90,13 +90,13 @@ jobs:
--throughput
- name: Upload parsed results artifact
uses: actions/upload-artifact@a8a3f3ad30e3422c9c7b888a15615d19a852ae32
uses: actions/upload-artifact@26f96dfa697d77e81fd5907df203aa23a56210a8
with:
name: ${{ github.sha }}_integer
path: ${{ env.RESULTS_FILENAME }}
- name: Checkout Slab repo
uses: actions/checkout@8ade135a41bc03ea155e62e844d188df1ea18608
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
with:
repository: zama-ai/slab
path: slab

View File

@@ -0,0 +1,158 @@
# Run integer benchmarks with multi-bit cryptographic parameters on an AWS instance and return parsed results to Slab CI bot.
name: Integer Multi-bit benchmarks
on:
workflow_dispatch:
inputs:
instance_id:
description: "Instance ID"
type: string
instance_image_id:
description: "Instance AMI ID"
type: string
instance_type:
description: "Instance product type"
type: string
runner_name:
description: "Action runner name"
type: string
request_id:
description: "Slab request ID"
type: string
env:
CARGO_TERM_COLOR: always
RESULTS_FILENAME: parsed_benchmark_results_${{ github.sha }}.json
PARSE_INTEGER_BENCH_CSV_FILE: tfhe_rs_integer_benches_${{ github.sha }}.csv
ACTION_RUN_URL: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}
RUST_BACKTRACE: "full"
jobs:
run-integer-benchmarks:
name: Execute integer multi-bit benchmarks in EC2
runs-on: ${{ github.event.inputs.runner_name }}
if: ${{ !cancelled() }}
strategy:
fail-fast: false
# explicit include-based build matrix, of known valid options
matrix:
include:
- os: ubuntu-22.04
cuda: "11.8"
cuda_arch: "70"
gcc: 9
env:
CUDA_PATH: /usr/local/cuda-${{ matrix.cuda }}
steps:
- name: Instance configuration used
run: |
echo "IDs: ${{ inputs.instance_id }}"
echo "AMI: ${{ inputs.instance_image_id }}"
echo "Type: ${{ inputs.instance_type }}"
echo "Request ID: ${{ inputs.request_id }}"
- name: Get benchmark date
run: |
echo "BENCH_DATE=$(date --iso-8601=seconds)" >> "${GITHUB_ENV}"
- name: Checkout tfhe-rs repo with tags
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
with:
fetch-depth: 0
- name: Set up home
# "Install rust" step require root user to have a HOME directory which is not set.
run: |
echo "HOME=/home/ubuntu" >> "${GITHUB_ENV}"
- name: Install rust
uses: dtolnay/rust-toolchain@be73d7920c329f220ce78e0234b8f96b7ae60248
with:
toolchain: nightly
- name: Export CUDA variables
if: ${{ !cancelled() }}
run: |
echo "CUDA_PATH=$CUDA_PATH" >> "${GITHUB_ENV}"
echo "$CUDA_PATH/bin" >> "${GITHUB_PATH}"
echo "LD_LIBRARY_PATH=$CUDA_PATH/lib:$LD_LIBRARY_PATH" >> "${GITHUB_ENV}"
echo "CUDACXX=/usr/local/cuda-${{ matrix.cuda }}/bin/nvcc" >> "${GITHUB_ENV}"
# Specify the correct host compilers
- name: Export gcc and g++ variables
if: ${{ !cancelled() }}
run: |
echo "CC=/usr/bin/gcc-${{ matrix.gcc }}" >> "${GITHUB_ENV}"
echo "CXX=/usr/bin/g++-${{ matrix.gcc }}" >> "${GITHUB_ENV}"
echo "CUDAHOSTCXX=/usr/bin/g++-${{ matrix.gcc }}" >> "${GITHUB_ENV}"
echo "HOME=/home/ubuntu" >> "${GITHUB_ENV}"
- name: Run multi-bit benchmarks with AVX512
run: |
make AVX512_SUPPORT=ON FAST_BENCH=TRUE BENCH_OP_FLAVOR=default bench_integer_multi_bit_gpu
- name: Parse benchmarks to csv
run: |
make PARSE_INTEGER_BENCH_CSV_FILE=${{ env.PARSE_INTEGER_BENCH_CSV_FILE }} \
parse_integer_benches
- name: Upload csv results artifact
uses: actions/upload-artifact@26f96dfa697d77e81fd5907df203aa23a56210a8
with:
name: ${{ github.sha }}_csv_integer
path: ${{ env.PARSE_INTEGER_BENCH_CSV_FILE }}
- name: Parse results
run: |
COMMIT_DATE="$(git --no-pager show -s --format=%cd --date=iso8601-strict ${{ github.sha }})"
COMMIT_HASH="$(git describe --tags --dirty)"
python3 ./ci/benchmark_parser.py target/criterion ${{ env.RESULTS_FILENAME }} \
--database tfhe_rs \
--hardware ${{ inputs.instance_type }} \
--backend gpu \
--project-version "${COMMIT_HASH}" \
--branch ${{ github.ref_name }} \
--commit-date "${COMMIT_DATE}" \
--bench-date "${{ env.BENCH_DATE }}" \
--walk-subdirs \
--name-suffix avx512 \
--throughput
- name: Upload parsed results artifact
uses: actions/upload-artifact@26f96dfa697d77e81fd5907df203aa23a56210a8
with:
name: ${{ github.sha }}_integer
path: ${{ env.RESULTS_FILENAME }}
- name: Checkout Slab repo
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
with:
repository: zama-ai/slab
path: slab
token: ${{ secrets.CONCRETE_ACTIONS_TOKEN }}
- name: Send data to Slab
shell: bash
run: |
echo "Computing HMac on results file"
SIGNATURE="$(slab/scripts/hmac_calculator.sh ${{ env.RESULTS_FILENAME }} '${{ secrets.JOB_SECRET }}')"
echo "Sending results to Slab..."
curl -v -k \
-H "Content-Type: application/json" \
-H "X-Slab-Repository: ${{ github.repository }}" \
-H "X-Slab-Command: store_data_v2" \
-H "X-Hub-Signature-256: sha256=${SIGNATURE}" \
-d @${{ env.RESULTS_FILENAME }} \
${{ secrets.SLAB_URL }}
- name: Slack Notification
if: ${{ failure() }}
continue-on-error: true
uses: rtCamp/action-slack-notify@b24d75fe0e728a4bf9fc42ee217caa686d141ee8
env:
SLACK_COLOR: ${{ job.status }}
SLACK_CHANNEL: ${{ secrets.SLACK_CHANNEL }}
SLACK_ICON: https://pbs.twimg.com/profile_images/1274014582265298945/OjBKP9kn_400x400.png
SLACK_MESSAGE: "Integer GPU benchmarks failed. (${{ env.ACTION_RUN_URL }})"
SLACK_USERNAME: ${{ secrets.BOT_USERNAME }}
SLACK_WEBHOOK: ${{ secrets.SLACK_WEBHOOK }}

View File

@@ -14,8 +14,8 @@ on:
env:
CARGO_TERM_COLOR: always
RUSTFLAGS: "-C target-cpu=native"
RUST_BACKTRACE: "full"
ACTION_RUN_URL: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}
CARGO_PROFILE: release_lto_off
FAST_TESTS: "TRUE"
concurrency:
@@ -28,13 +28,12 @@ jobs:
runs-on: ["self-hosted", "m1mac"]
steps:
- uses: actions/checkout@8ade135a41bc03ea155e62e844d188df1ea18608
- uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
- name: Install latest stable
uses: actions-rs/toolchain@16499b5e05bf2e26879000db0c1d13f7e13fa3af
uses: dtolnay/rust-toolchain@be73d7920c329f220ce78e0234b8f96b7ae60248
with:
toolchain: stable
default: true
- name: Run pcc checks
run: |
@@ -111,10 +110,9 @@ jobs:
run: |
make test_shortint_multi_bit_ci
# # These multi bit integer tests are too slow on M1 with low core count and low RAM
# - name: Run integer multi bit tests
# run: |
# make test_integer_multi_bit_ci
- name: Run integer multi bit tests
run: |
make test_integer_multi_bit_ci
remove_label:
name: Remove m1_test label

View File

@@ -30,7 +30,7 @@ jobs:
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@8ade135a41bc03ea155e62e844d188df1ea18608
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
with:
fetch-depth: 0
@@ -49,7 +49,7 @@ jobs:
- name: Publish web package
if: ${{ inputs.push_web_package }}
uses: JS-DevTools/npm-publish@fe72237be0920f7a0cafd6a966c9b929c9466e9b
uses: JS-DevTools/npm-publish@4b07b26a2f6e0a51846e1870223e545bae91c552
with:
token: ${{ secrets.NPM_TOKEN }}
package: tfhe/pkg/package.json
@@ -65,7 +65,7 @@ jobs:
- name: Publish Node package
if: ${{ inputs.push_node_package }}
uses: JS-DevTools/npm-publish@fe72237be0920f7a0cafd6a966c9b929c9466e9b
uses: JS-DevTools/npm-publish@4b07b26a2f6e0a51846e1870223e545bae91c552
with:
token: ${{ secrets.NPM_TOKEN }}
package: tfhe/pkg/package.json

View File

@@ -18,7 +18,7 @@ jobs:
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@8ade135a41bc03ea155e62e844d188df1ea18608
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
with:
fetch-depth: 0

View File

@@ -17,10 +17,10 @@ jobs:
runs-on: ubuntu-latest
steps:
- name: Checkout tfhe-rs
uses: actions/checkout@8ade135a41bc03ea155e62e844d188df1ea18608
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
- name: Checkout lattice-estimator
uses: actions/checkout@8ade135a41bc03ea155e62e844d188df1ea18608
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
with:
repository: malb/lattice-estimator
path: lattice_estimator

View File

@@ -32,6 +32,7 @@ env:
CARGO_TERM_COLOR: always
RESULTS_FILENAME: parsed_benchmark_results_${{ github.sha }}.json
ACTION_RUN_URL: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}
RUST_BACKTRACE: "full"
jobs:
run-pbs-benchmarks:
@@ -51,7 +52,7 @@ jobs:
echo "BENCH_DATE=$(date --iso-8601=seconds)" >> "${GITHUB_ENV}"
- name: Checkout tfhe-rs repo with tags
uses: actions/checkout@8ade135a41bc03ea155e62e844d188df1ea18608
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
with:
fetch-depth: 0
@@ -61,22 +62,13 @@ jobs:
echo "HOME=/home/ubuntu" >> "${GITHUB_ENV}"
- name: Install rust
uses: actions-rs/toolchain@16499b5e05bf2e26879000db0c1d13f7e13fa3af
uses: dtolnay/rust-toolchain@be73d7920c329f220ce78e0234b8f96b7ae60248
with:
toolchain: nightly
override: true
- name: bench_ac2023_CJP
- name: Run benchmarks with AVX512
run: |
make AVX512_SUPPORT=ON bench_ac2023_CJP
- name: bench_ac2023_stairKS
run: |
make AVX512_SUPPORT=ON bench_ac2023_stairKS
- name: bench_ac2023_fastKS
run: |
make AVX512_SUPPORT=ON bench_ac2023_fastKS
make AVX512_SUPPORT=ON bench_pbs
- name: Parse results
run: |
@@ -94,13 +86,13 @@ jobs:
--throughput
- name: Upload parsed results artifact
uses: actions/upload-artifact@a8a3f3ad30e3422c9c7b888a15615d19a852ae32
uses: actions/upload-artifact@26f96dfa697d77e81fd5907df203aa23a56210a8
with:
name: ${{ github.sha }}_pbs
path: ${{ env.RESULTS_FILENAME }}
- name: Checkout Slab repo
uses: actions/checkout@8ade135a41bc03ea155e62e844d188df1ea18608
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
with:
repository: zama-ai/slab
path: slab

142
.github/workflows/pbs_gpu_benchmark.yml vendored Normal file
View File

@@ -0,0 +1,142 @@
# Run PBS benchmarks on an AWS instance with CUDA and return parsed results to Slab CI bot.
name: PBS GPU benchmarks
on:
workflow_dispatch:
inputs:
instance_id:
description: "Instance ID"
type: string
instance_image_id:
description: "Instance AMI ID"
type: string
instance_type:
description: "Instance product type"
type: string
runner_name:
description: "Action runner name"
type: string
request_id:
description: "Slab request ID"
type: string
# This input is not used in this workflow but still mandatory since a calling workflow could
# use it. If a triggering command include a user_inputs field, then the triggered workflow
# must include this very input, otherwise the workflow won't be called.
# See start_full_benchmarks.yml as example.
user_inputs:
description: "Type of benchmarks to run"
type: string
default: "weekly_benchmarks"
env:
CARGO_TERM_COLOR: always
RESULTS_FILENAME: parsed_benchmark_results_${{ github.sha }}.json
ACTION_RUN_URL: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}
jobs:
run-pbs-benchmarks:
name: Execute PBS benchmarks in EC2
runs-on: ${{ github.event.inputs.runner_name }}
if: ${{ !cancelled() }}
steps:
- name: Instance configuration used
run: |
echo "IDs: ${{ inputs.instance_id }}"
echo "AMI: ${{ inputs.instance_image_id }}"
echo "Type: ${{ inputs.instance_type }}"
echo "Request ID: ${{ inputs.request_id }}"
- name: Get benchmark date
run: |
echo "BENCH_DATE=$(date --iso-8601=seconds)" >> "${GITHUB_ENV}"
- name: Checkout tfhe-rs repo with tags
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
with:
fetch-depth: 0
- name: Set up home
# "Install rust" step require root user to have a HOME directory which is not set.
run: |
echo "HOME=/home/ubuntu" >> "${GITHUB_ENV}"
- name: Install rust
uses: dtolnay/rust-toolchain@be73d7920c329f220ce78e0234b8f96b7ae60248
with:
toolchain: nightly
- name: Export CUDA variables
if: ${{ !cancelled() }}
run: |
echo "CUDA_PATH=$CUDA_PATH" >> "${GITHUB_ENV}"
echo "$CUDA_PATH/bin" >> "${GITHUB_PATH}"
echo "LD_LIBRARY_PATH=$CUDA_PATH/lib:$LD_LIBRARY_PATH" >> "${GITHUB_ENV}"
echo "CUDACXX=/usr/local/cuda-${{ matrix.cuda }}/bin/nvcc" >> "${GITHUB_ENV}"
# Specify the correct host compilers
- name: Export gcc and g++ variables
if: ${{ !cancelled() }}
run: |
echo "CC=/usr/bin/gcc-${{ matrix.gcc }}" >> "${GITHUB_ENV}"
echo "CXX=/usr/bin/g++-${{ matrix.gcc }}" >> "${GITHUB_ENV}"
echo "CUDAHOSTCXX=/usr/bin/g++-${{ matrix.gcc }}" >> "${GITHUB_ENV}"
echo "HOME=/home/ubuntu" >> "${GITHUB_ENV}"
- name: Run benchmarks with AVX512
run: |
make AVX512_SUPPORT=ON bench_pbs_gpu
- name: Parse results
run: |
COMMIT_DATE="$(git --no-pager show -s --format=%cd --date=iso8601-strict ${{ github.sha }})"
COMMIT_HASH="$(git describe --tags --dirty)"
python3 ./ci/benchmark_parser.py target/criterion ${{ env.RESULTS_FILENAME }} \
--database tfhe_rs \
--hardware ${{ inputs.instance_type }} \
--backend gpu \
--project-version "${COMMIT_HASH}" \
--branch ${{ github.ref_name }} \
--commit-date "${COMMIT_DATE}" \
--bench-date "${{ env.BENCH_DATE }}" \
--name-suffix avx512 \
--walk-subdirs \
--throughput
- name: Upload parsed results artifact
uses: actions/upload-artifact@26f96dfa697d77e81fd5907df203aa23a56210a8
with:
name: ${{ github.sha }}_pbs
path: ${{ env.RESULTS_FILENAME }}
- name: Checkout Slab repo
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
with:
repository: zama-ai/slab
path: slab
token: ${{ secrets.CONCRETE_ACTIONS_TOKEN }}
- name: Send data to Slab
shell: bash
run: |
echo "Computing HMac on downloaded artifact"
SIGNATURE="$(slab/scripts/hmac_calculator.sh ${{ env.RESULTS_FILENAME }} '${{ secrets.JOB_SECRET }}')"
echo "Sending results to Slab..."
curl -v -k \
-H "Content-Type: application/json" \
-H "X-Slab-Repository: ${{ github.repository }}" \
-H "X-Slab-Command: store_data_v2" \
-H "X-Hub-Signature-256: sha256=${SIGNATURE}" \
-d @${{ env.RESULTS_FILENAME }} \
${{ secrets.SLAB_URL }}
- name: Slack Notification
if: ${{ failure() }}
continue-on-error: true
uses: rtCamp/action-slack-notify@b24d75fe0e728a4bf9fc42ee217caa686d141ee8
env:
SLACK_COLOR: ${{ job.status }}
SLACK_CHANNEL: ${{ secrets.SLACK_CHANNEL }}
SLACK_ICON: https://pbs.twimg.com/profile_images/1274014582265298945/OjBKP9kn_400x400.png
SLACK_MESSAGE: "PBS GPU benchmarks failed. (${{ env.ACTION_RUN_URL }})"
SLACK_USERNAME: ${{ secrets.BOT_USERNAME }}
SLACK_WEBHOOK: ${{ secrets.SLACK_WEBHOOK }}

View File

@@ -24,6 +24,7 @@ env:
CARGO_TERM_COLOR: always
RESULTS_FILENAME: parsed_benchmark_results_${{ github.sha }}.json
ACTION_RUN_URL: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}
RUST_BACKTRACE: "full"
jobs:
run-shortint-benchmarks:
@@ -43,7 +44,7 @@ jobs:
echo "BENCH_DATE=$(date --iso-8601=seconds)" >> "${GITHUB_ENV}"
- name: Checkout tfhe-rs repo with tags
uses: actions/checkout@8ade135a41bc03ea155e62e844d188df1ea18608
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
with:
fetch-depth: 0
@@ -53,10 +54,9 @@ jobs:
echo "HOME=/home/ubuntu" >> "${GITHUB_ENV}"
- name: Install rust
uses: actions-rs/toolchain@16499b5e05bf2e26879000db0c1d13f7e13fa3af
uses: dtolnay/rust-toolchain@be73d7920c329f220ce78e0234b8f96b7ae60248
with:
toolchain: nightly
override: true
- name: Run benchmarks with AVX512
run: |
@@ -88,13 +88,13 @@ jobs:
--append-results
- name: Upload parsed results artifact
uses: actions/upload-artifact@a8a3f3ad30e3422c9c7b888a15615d19a852ae32
uses: actions/upload-artifact@26f96dfa697d77e81fd5907df203aa23a56210a8
with:
name: ${{ github.sha }}_shortint
path: ${{ env.RESULTS_FILENAME }}
- name: Checkout Slab repo
uses: actions/checkout@8ade135a41bc03ea155e62e844d188df1ea18608
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
with:
repository: zama-ai/slab
path: slab

View File

@@ -32,6 +32,7 @@ env:
CARGO_TERM_COLOR: always
RESULTS_FILENAME: parsed_benchmark_results_${{ github.sha }}.json
ACTION_RUN_URL: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}
RUST_BACKTRACE: "full"
jobs:
shortint-benchmarks:
@@ -51,7 +52,7 @@ jobs:
echo "Request ID: ${{ inputs.request_id }}"
- name: Checkout tfhe-rs repo with tags
uses: actions/checkout@8ade135a41bc03ea155e62e844d188df1ea18608
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
with:
fetch-depth: 0
@@ -67,13 +68,12 @@ jobs:
echo "HOME=/home/ubuntu" >> "${GITHUB_ENV}"
- name: Install rust
uses: actions-rs/toolchain@16499b5e05bf2e26879000db0c1d13f7e13fa3af
uses: dtolnay/rust-toolchain@be73d7920c329f220ce78e0234b8f96b7ae60248
with:
toolchain: nightly
override: true
- name: Checkout Slab repo
uses: actions/checkout@8ade135a41bc03ea155e62e844d188df1ea18608
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
with:
repository: zama-ai/slab
path: slab
@@ -112,7 +112,7 @@ jobs:
--append-results
- name: Upload parsed results artifact
uses: actions/upload-artifact@a8a3f3ad30e3422c9c7b888a15615d19a852ae32
uses: actions/upload-artifact@26f96dfa697d77e81fd5907df203aa23a56210a8
with:
name: ${{ github.sha }}_shortint_${{ matrix.op_flavor }}
path: ${{ env.RESULTS_FILENAME }}

View File

@@ -0,0 +1,129 @@
# Run signed integer benchmarks on an AWS instance and return parsed results to Slab CI bot.
name: Signed Integer benchmarks
on:
workflow_dispatch:
inputs:
instance_id:
description: "Instance ID"
type: string
instance_image_id:
description: "Instance AMI ID"
type: string
instance_type:
description: "Instance product type"
type: string
runner_name:
description: "Action runner name"
type: string
request_id:
description: "Slab request ID"
type: string
env:
CARGO_TERM_COLOR: always
RESULTS_FILENAME: parsed_benchmark_results_${{ github.sha }}.json
PARSE_INTEGER_BENCH_CSV_FILE: tfhe_rs_integer_benches_${{ github.sha }}.csv
ACTION_RUN_URL: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}
RUST_BACKTRACE: "full"
jobs:
run-integer-benchmarks:
name: Execute signed integer benchmarks in EC2
runs-on: ${{ github.event.inputs.runner_name }}
if: ${{ !cancelled() }}
steps:
- name: Instance configuration used
run: |
echo "IDs: ${{ inputs.instance_id }}"
echo "AMI: ${{ inputs.instance_image_id }}"
echo "Type: ${{ inputs.instance_type }}"
echo "Request ID: ${{ inputs.request_id }}"
- name: Get benchmark date
run: |
echo "BENCH_DATE=$(date --iso-8601=seconds)" >> "${GITHUB_ENV}"
- name: Checkout tfhe-rs repo with tags
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
with:
fetch-depth: 0
- name: Set up home
# "Install rust" step require root user to have a HOME directory which is not set.
run: |
echo "HOME=/home/ubuntu" >> "${GITHUB_ENV}"
- name: Install rust
uses: dtolnay/rust-toolchain@be73d7920c329f220ce78e0234b8f96b7ae60248
with:
toolchain: nightly
- name: Run benchmarks with AVX512
run: |
make AVX512_SUPPORT=ON FAST_BENCH=TRUE bench_signed_integer
- name: Parse benchmarks to csv
run: |
make PARSE_INTEGER_BENCH_CSV_FILE=${{ env.PARSE_INTEGER_BENCH_CSV_FILE }} \
parse_integer_benches
- name: Upload csv results artifact
uses: actions/upload-artifact@26f96dfa697d77e81fd5907df203aa23a56210a8
with:
name: ${{ github.sha }}_csv_integer
path: ${{ env.PARSE_INTEGER_BENCH_CSV_FILE }}
- name: Parse results
run: |
COMMIT_DATE="$(git --no-pager show -s --format=%cd --date=iso8601-strict ${{ github.sha }})"
COMMIT_HASH="$(git describe --tags --dirty)"
python3 ./ci/benchmark_parser.py target/criterion ${{ env.RESULTS_FILENAME }} \
--database tfhe_rs \
--hardware ${{ inputs.instance_type }} \
--project-version "${COMMIT_HASH}" \
--branch ${{ github.ref_name }} \
--commit-date "${COMMIT_DATE}" \
--bench-date "${{ env.BENCH_DATE }}" \
--walk-subdirs \
--name-suffix avx512 \
--throughput
- name: Upload parsed results artifact
uses: actions/upload-artifact@26f96dfa697d77e81fd5907df203aa23a56210a8
with:
name: ${{ github.sha }}_integer
path: ${{ env.RESULTS_FILENAME }}
- name: Checkout Slab repo
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
with:
repository: zama-ai/slab
path: slab
token: ${{ secrets.CONCRETE_ACTIONS_TOKEN }}
- name: Send data to Slab
shell: bash
run: |
echo "Computing HMac on results file"
SIGNATURE="$(slab/scripts/hmac_calculator.sh ${{ env.RESULTS_FILENAME }} '${{ secrets.JOB_SECRET }}')"
echo "Sending results to Slab..."
curl -v -k \
-H "Content-Type: application/json" \
-H "X-Slab-Repository: ${{ github.repository }}" \
-H "X-Slab-Command: store_data_v2" \
-H "X-Hub-Signature-256: sha256=${SIGNATURE}" \
-d @${{ env.RESULTS_FILENAME }} \
${{ secrets.SLAB_URL }}
- name: Slack Notification
if: ${{ failure() }}
continue-on-error: true
uses: rtCamp/action-slack-notify@b24d75fe0e728a4bf9fc42ee217caa686d141ee8
env:
SLACK_COLOR: ${{ job.status }}
SLACK_CHANNEL: ${{ secrets.SLACK_CHANNEL }}
SLACK_ICON: https://pbs.twimg.com/profile_images/1274014582265298945/OjBKP9kn_400x400.png
SLACK_MESSAGE: "Signed integer benchmarks failed. (${{ env.ACTION_RUN_URL }})"
SLACK_USERNAME: ${{ secrets.BOT_USERNAME }}
SLACK_WEBHOOK: ${{ secrets.SLACK_WEBHOOK }}

View File

@@ -0,0 +1,133 @@
# Run all signed integer benchmarks on an AWS instance and return parsed results to Slab CI bot.
name: Signed Integer full benchmarks
on:
workflow_dispatch:
inputs:
instance_id:
description: "Instance ID"
type: string
instance_image_id:
description: "Instance AMI ID"
type: string
instance_type:
description: "Instance product type"
type: string
runner_name:
description: "Action runner name"
type: string
request_id:
description: "Slab request ID"
type: string
user_inputs:
description: "Type of benchmarks to run"
type: string
default: "weekly_benchmarks"
env:
CARGO_TERM_COLOR: always
RESULTS_FILENAME: parsed_benchmark_results_${{ github.sha }}.json
ACTION_RUN_URL: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}
RUST_BACKTRACE: "full"
jobs:
integer-benchmarks:
name: Execute signed integer benchmarks for all operations flavor
runs-on: ${{ github.event.inputs.runner_name }}
if: ${{ !cancelled() }}
continue-on-error: true
timeout-minutes: 1440 # 24 hours
strategy:
max-parallel: 1
matrix:
command: [ integer, integer_multi_bit ]
op_flavor: [ default, unchecked ]
steps:
- name: Instance configuration used
run: |
echo "IDs: ${{ inputs.instance_id }}"
echo "AMI: ${{ inputs.instance_image_id }}"
echo "Type: ${{ inputs.instance_type }}"
echo "Request ID: ${{ inputs.request_id }}"
- name: Checkout tfhe-rs repo with tags
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
with:
fetch-depth: 0
- name: Get benchmark details
run: |
echo "BENCH_DATE=$(date --iso-8601=seconds)" >> "${GITHUB_ENV}"
echo "COMMIT_DATE=$(git --no-pager show -s --format=%cd --date=iso8601-strict ${{ github.sha }})" >> "${GITHUB_ENV}"
echo "COMMIT_HASH=$(git describe --tags --dirty)" >> "${GITHUB_ENV}"
- name: Set up home
# "Install rust" step require root user to have a HOME directory which is not set.
run: |
echo "HOME=/home/ubuntu" >> "${GITHUB_ENV}"
- name: Install rust
uses: dtolnay/rust-toolchain@be73d7920c329f220ce78e0234b8f96b7ae60248
with:
toolchain: nightly
- name: Checkout Slab repo
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
with:
repository: zama-ai/slab
path: slab
token: ${{ secrets.CONCRETE_ACTIONS_TOKEN }}
- name: Run benchmarks with AVX512
run: |
make AVX512_SUPPORT=ON BENCH_OP_FLAVOR=${{ matrix.op_flavor }} bench_signed_${{ matrix.command }}
- name: Parse results
run: |
python3 ./ci/benchmark_parser.py target/criterion ${{ env.RESULTS_FILENAME }} \
--database tfhe_rs \
--hardware ${{ inputs.instance_type }} \
--project-version "${{ env.COMMIT_HASH }}" \
--branch ${{ github.ref_name }} \
--commit-date "${{ env.COMMIT_DATE }}" \
--bench-date "${{ env.BENCH_DATE }}" \
--walk-subdirs \
--name-suffix avx512 \
--throughput
- name: Upload parsed results artifact
uses: actions/upload-artifact@26f96dfa697d77e81fd5907df203aa23a56210a8
with:
name: ${{ github.sha }}_${{ matrix.command }}_${{ matrix.op_flavor }}
path: ${{ env.RESULTS_FILENAME }}
- name: Send data to Slab
shell: bash
run: |
echo "Computing HMac on results file"
SIGNATURE="$(slab/scripts/hmac_calculator.sh ${{ env.RESULTS_FILENAME }} '${{ secrets.JOB_SECRET }}')"
echo "Sending results to Slab..."
curl -v -k \
-H "Content-Type: application/json" \
-H "X-Slab-Repository: ${{ github.repository }}" \
-H "X-Slab-Command: store_data_v2" \
-H "X-Hub-Signature-256: sha256=${SIGNATURE}" \
-d @${{ env.RESULTS_FILENAME }} \
${{ secrets.SLAB_URL }}
slack-notification:
name: Slack Notification
runs-on: ${{ github.event.inputs.runner_name }}
if: ${{ failure() }}
needs: integer-benchmarks
steps:
- name: Notify
continue-on-error: true
uses: rtCamp/action-slack-notify@b24d75fe0e728a4bf9fc42ee217caa686d141ee8
env:
SLACK_COLOR: ${{ job.status }}
SLACK_CHANNEL: ${{ secrets.SLACK_CHANNEL }}
SLACK_ICON: https://pbs.twimg.com/profile_images/1274014582265298945/OjBKP9kn_400x400.png
SLACK_MESSAGE: "Signed integer full benchmarks failed. (${{ env.ACTION_RUN_URL }})"
SLACK_USERNAME: ${{ secrets.BOT_USERNAME }}
SLACK_WEBHOOK: ${{ secrets.SLACK_WEBHOOK }}

View File

@@ -0,0 +1,129 @@
# Run signed integer benchmarks with multi-bit cryptographic parameters on an AWS instance and return parsed results to Slab CI bot.
name: Signed Integer Multi-bit benchmarks
on:
workflow_dispatch:
inputs:
instance_id:
description: "Instance ID"
type: string
instance_image_id:
description: "Instance AMI ID"
type: string
instance_type:
description: "Instance product type"
type: string
runner_name:
description: "Action runner name"
type: string
request_id:
description: "Slab request ID"
type: string
env:
CARGO_TERM_COLOR: always
RESULTS_FILENAME: parsed_benchmark_results_${{ github.sha }}.json
PARSE_INTEGER_BENCH_CSV_FILE: tfhe_rs_integer_benches_${{ github.sha }}.csv
ACTION_RUN_URL: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}
RUST_BACKTRACE: "full"
jobs:
run-integer-benchmarks:
name: Execute signed integer multi-bit benchmarks in EC2
runs-on: ${{ github.event.inputs.runner_name }}
if: ${{ !cancelled() }}
steps:
- name: Instance configuration used
run: |
echo "IDs: ${{ inputs.instance_id }}"
echo "AMI: ${{ inputs.instance_image_id }}"
echo "Type: ${{ inputs.instance_type }}"
echo "Request ID: ${{ inputs.request_id }}"
- name: Get benchmark date
run: |
echo "BENCH_DATE=$(date --iso-8601=seconds)" >> "${GITHUB_ENV}"
- name: Checkout tfhe-rs repo with tags
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
with:
fetch-depth: 0
- name: Set up home
# "Install rust" step require root user to have a HOME directory which is not set.
run: |
echo "HOME=/home/ubuntu" >> "${GITHUB_ENV}"
- name: Install rust
uses: dtolnay/rust-toolchain@be73d7920c329f220ce78e0234b8f96b7ae60248
with:
toolchain: nightly
- name: Run multi-bit benchmarks with AVX512
run: |
make AVX512_SUPPORT=ON FAST_BENCH=TRUE bench_signed_integer_multi_bit
- name: Parse benchmarks to csv
run: |
make PARSE_INTEGER_BENCH_CSV_FILE=${{ env.PARSE_INTEGER_BENCH_CSV_FILE }} \
parse_integer_benches
- name: Upload csv results artifact
uses: actions/upload-artifact@26f96dfa697d77e81fd5907df203aa23a56210a8
with:
name: ${{ github.sha }}_csv_integer
path: ${{ env.PARSE_INTEGER_BENCH_CSV_FILE }}
- name: Parse results
run: |
COMMIT_DATE="$(git --no-pager show -s --format=%cd --date=iso8601-strict ${{ github.sha }})"
COMMIT_HASH="$(git describe --tags --dirty)"
python3 ./ci/benchmark_parser.py target/criterion ${{ env.RESULTS_FILENAME }} \
--database tfhe_rs \
--hardware ${{ inputs.instance_type }} \
--project-version "${COMMIT_HASH}" \
--branch ${{ github.ref_name }} \
--commit-date "${COMMIT_DATE}" \
--bench-date "${{ env.BENCH_DATE }}" \
--walk-subdirs \
--name-suffix avx512 \
--throughput
- name: Upload parsed results artifact
uses: actions/upload-artifact@26f96dfa697d77e81fd5907df203aa23a56210a8
with:
name: ${{ github.sha }}_integer
path: ${{ env.RESULTS_FILENAME }}
- name: Checkout Slab repo
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
with:
repository: zama-ai/slab
path: slab
token: ${{ secrets.CONCRETE_ACTIONS_TOKEN }}
- name: Send data to Slab
shell: bash
run: |
echo "Computing HMac on results file"
SIGNATURE="$(slab/scripts/hmac_calculator.sh ${{ env.RESULTS_FILENAME }} '${{ secrets.JOB_SECRET }}')"
echo "Sending results to Slab..."
curl -v -k \
-H "Content-Type: application/json" \
-H "X-Slab-Repository: ${{ github.repository }}" \
-H "X-Slab-Command: store_data_v2" \
-H "X-Hub-Signature-256: sha256=${SIGNATURE}" \
-d @${{ env.RESULTS_FILENAME }} \
${{ secrets.SLAB_URL }}
- name: Slack Notification
if: ${{ failure() }}
continue-on-error: true
uses: rtCamp/action-slack-notify@b24d75fe0e728a4bf9fc42ee217caa686d141ee8
env:
SLACK_COLOR: ${{ job.status }}
SLACK_CHANNEL: ${{ secrets.SLACK_CHANNEL }}
SLACK_ICON: https://pbs.twimg.com/profile_images/1274014582265298945/OjBKP9kn_400x400.png
SLACK_MESSAGE: "Signed integer benchmarks failed. (${{ env.ACTION_RUN_URL }})"
SLACK_USERNAME: ${{ secrets.BOT_USERNAME }}
SLACK_WEBHOOK: ${{ secrets.SLACK_WEBHOOK }}

View File

@@ -20,14 +20,26 @@ on:
description: "Run integer benches"
type: boolean
default: true
signed_integer_bench:
description: "Run signed integer benches"
type: boolean
default: true
integer_multi_bit_bench:
description: "Run integer multi bit benches"
type: boolean
default: true
signed_integer_multi_bit_bench:
description: "Run signed integer multi bit benches"
type: boolean
default: true
pbs_bench:
description: "Run PBS benches"
type: boolean
default: true
pbs_gpu_bench:
description: "Run PBS benches on GPU"
type: boolean
default: true
wasm_client_bench:
description: "Run WASM client benches"
type: boolean
@@ -38,17 +50,21 @@ jobs:
if: ${{ (github.event_name == 'push' && github.repository == 'zama-ai/tfhe-rs') || github.event_name == 'workflow_dispatch' }}
strategy:
matrix:
command: [boolean_bench, shortint_bench, integer_bench, integer_multi_bit_bench, pbs_bench, wasm_client_bench]
command: [ boolean_bench, shortint_bench,
integer_bench, integer_multi_bit_bench,
signed_integer_bench, signed_integer_multi_bit_bench,
integer_gpu_bench, integer_multi_bit_gpu_bench,
pbs_bench, pbs_gpu_bench, wasm_client_bench ]
runs-on: ubuntu-latest
steps:
- name: Checkout tfhe-rs
uses: actions/checkout@8ade135a41bc03ea155e62e844d188df1ea18608
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
with:
fetch-depth: 0
- name: Check for file changes
id: changed-files
uses: tj-actions/changed-files@408093d9ff9c134c33b974e0722ce06b9d6e8263
uses: tj-actions/changed-files@90a06d6ba9543371ab4df8eeca0be07ca6054959
with:
files_yaml: |
common_benches:
@@ -69,13 +85,23 @@ jobs:
integer_bench:
- tfhe/src/shortint/**
- tfhe/src/integer/**
- tfhe/benches/integer/**
- tfhe/benches/integer/bench.rs
- .github/workflows/integer_benchmark.yml
integer_multi_bit_bench:
- tfhe/src/shortint/**
- tfhe/src/integer/**
- tfhe/benches/integer/**
- .github/workflows/integer_benchmark.yml
- tfhe/benches/integer/bench.rs
- .github/workflows/integer_multi_bit_benchmark.yml
signed_integer_bench:
- tfhe/src/shortint/**
- tfhe/src/integer/**
- tfhe/benches/integer/signed_bench.rs
- .github/workflows/signed_integer_benchmark.yml
signed_integer_multi_bit_bench:
- tfhe/src/shortint/**
- tfhe/src/integer/**
- tfhe/benches/integer/signed_bench.rs
- .github/workflows/signed_integer_multi_bit_benchmark.yml
pbs_bench:
- tfhe/src/core_crypto/**
- tfhe/benches/core_crypto/**
@@ -85,7 +111,7 @@ jobs:
- .github/workflows/wasm_client_benchmark.yml
- name: Checkout Slab repo
uses: actions/checkout@8ade135a41bc03ea155e62e844d188df1ea18608
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
with:
repository: zama-ai/slab
path: slab

View File

@@ -24,16 +24,18 @@ jobs:
if: ${{ (github.event_name == 'schedule' && github.repository == 'zama-ai/tfhe-rs') || github.event_name == 'workflow_dispatch' }}
strategy:
matrix:
command: [ boolean_bench, shortint_full_bench, integer_full_bench, pbs_bench, wasm_client_bench ]
command: [ boolean_bench, shortint_full_bench,
integer_full_bench, signed_integer_full_bench, integer_gpu_full_bench,
pbs_bench, pbs_gpu_bench, wasm_client_bench ]
runs-on: ubuntu-latest
steps:
- name: Checkout tfhe-rs
uses: actions/checkout@8ade135a41bc03ea155e62e844d188df1ea18608
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
with:
fetch-depth: 0
- name: Checkout Slab repo
uses: actions/checkout@8ade135a41bc03ea155e62e844d188df1ea18608
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
with:
repository: zama-ai/slab
path: slab

View File

@@ -13,11 +13,11 @@ jobs:
runs-on: ubuntu-latest
steps:
- name: Checkout repo
uses: actions/checkout@8ade135a41bc03ea155e62e844d188df1ea18608
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
with:
fetch-depth: 0
- name: Save repo
uses: actions/upload-artifact@a8a3f3ad30e3422c9c7b888a15615d19a852ae32
uses: actions/upload-artifact@26f96dfa697d77e81fd5907df203aa23a56210a8
with:
name: repo-archive
path: '.'

View File

@@ -29,6 +29,7 @@ jobs:
allow-repeats: true
message: |
@slab-ci cpu_fast_test
@slab-ci gpu_test
- name: Add approved label
uses: actions-ecosystem/action-add-labels@18f1af5e3544586314bbe15c0273249c770b2daf
@@ -48,7 +49,7 @@ jobs:
Pull Request has been approved :tada:
Launching full test suite...
@slab-ci cpu_test
@slab-ci cpu_integer_test
@slab-ci cpu_multi_bit_test
@slab-ci cpu_unsigned_integer_test
@slab-ci cpu_signed_integer_test
@slab-ci cpu_wasm_test
@slab-ci csprng_randomness_testing

View File

@@ -32,6 +32,7 @@ env:
CARGO_TERM_COLOR: always
RESULTS_FILENAME: parsed_benchmark_results_${{ github.sha }}.json
ACTION_RUN_URL: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}
RUST_BACKTRACE: "full"
jobs:
run-wasm-client-benchmarks:
@@ -51,7 +52,7 @@ jobs:
echo "BENCH_DATE=$(date --iso-8601=seconds)" >> "${GITHUB_ENV}"
- name: Checkout tfhe-rs repo with tags
uses: actions/checkout@8ade135a41bc03ea155e62e844d188df1ea18608
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
with:
fetch-depth: 0
@@ -61,10 +62,9 @@ jobs:
echo "HOME=/home/ubuntu" >> "${GITHUB_ENV}"
- name: Install rust
uses: actions-rs/toolchain@16499b5e05bf2e26879000db0c1d13f7e13fa3af
uses: dtolnay/rust-toolchain@be73d7920c329f220ce78e0234b8f96b7ae60248
with:
toolchain: nightly
override: true
- name: Run benchmarks
run: |
@@ -97,13 +97,13 @@ jobs:
--append-results
- name: Upload parsed results artifact
uses: actions/upload-artifact@a8a3f3ad30e3422c9c7b888a15615d19a852ae32
uses: actions/upload-artifact@26f96dfa697d77e81fd5907df203aa23a56210a8
with:
name: ${{ github.sha }}_wasm
path: ${{ env.RESULTS_FILENAME }}
- name: Checkout Slab repo
uses: actions/checkout@8ade135a41bc03ea155e62e844d188df1ea18608
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
with:
repository: zama-ai/slab
path: slab

6
.gitignore vendored
View File

@@ -3,9 +3,9 @@ target/
.vscode/
# Path we use for internal-keycache during tests
./keys/
/keys/
# In case of symlinked keys
./keys
/keys
**/Cargo.lock
**/*.bin
@@ -18,4 +18,4 @@ target/
dieharder_run.log
# Coverage reports
./coverage/
/coverage/

View File

@@ -1,6 +1,6 @@
BSD 3-Clause Clear License
Copyright © 2023 ZAMA.
Copyright © 2024 ZAMA.
All rights reserved.
Redistribution and use in source and binary forms, with or without modification,

262
Makefile
View File

@@ -18,7 +18,13 @@ FAST_BENCH?=FALSE
BENCH_OP_FLAVOR?=DEFAULT
NODE_VERSION=20
FORWARD_COMPAT?=OFF
TFHE_CURRENT_VERSION:=$(shell grep '^version[[:space:]]*=' tfhe/Cargo.toml | cut -d '=' -f 2 | xargs)
# sed: -n, do not print input stream, -e means a script/expression
# 1,/version/ indicates from the first line, to the line matching version at the start of the line
# p indicates to print, so we keep only the start of the Cargo.toml until we hit the first version
# entry which should be the version of tfhe
TFHE_CURRENT_VERSION:=\
$(shell sed -n -e '1,/^version/p' tfhe/Cargo.toml | \
grep '^version[[:space:]]*=' | cut -d '=' -f 2 | xargs)
# Cargo has a hard time distinguishing between our package from the workspace and a package that
# could be a dependency, so we build an unambiguous spec here
TFHE_SPEC:=tfhe@$(TFHE_CURRENT_VERSION)
@@ -54,6 +60,10 @@ endif
REGEX_STRING?=''
REGEX_PATTERN?=''
# tfhe-cuda-backend
TFHECUDA_SRC="backends/tfhe-cuda-backend/cuda"
TFHECUDA_BUILD=$(TFHECUDA_SRC)/build
# Exclude these files from coverage reports
define COVERAGE_EXCLUDED_FILES
--exclude-files apps/trivium/src/trivium/* \
@@ -132,16 +142,27 @@ install_tarpaulin: install_rs_build_toolchain
.PHONY: check_linelint_installed # Check if linelint newline linter is installed
check_linelint_installed:
@printf "\n" | linelint - > /dev/null 2>&1 || \
( echo "Unable to locate linelint. Try installing it: https://github.com/fernandrone/linelint/releases" && exit 1 )
( echo "Unable to locate linelint. Try installing it: https://github.com/fernandrone/linelint/releases" && exit 1 )
.PHONY: fmt # Format rust code
fmt: install_rs_check_toolchain
cargo "$(CARGO_RS_CHECK_TOOLCHAIN)" fmt
.PHONY: fmt_gpu # Format rust and cuda code
fmt_gpu: install_rs_check_toolchain
cargo "$(CARGO_RS_CHECK_TOOLCHAIN)" fmt
cd "$(TFHECUDA_SRC)" && ./format_tfhe_cuda_backend.sh
.PHONY: check_fmt # Check rust code format
check_fmt: install_rs_check_toolchain
cargo "$(CARGO_RS_CHECK_TOOLCHAIN)" fmt --check
.PHONY: clippy_gpu # Run clippy lints on the gpu backend
clippy_gpu: install_rs_check_toolchain
RUSTFLAGS="$(RUSTFLAGS)" cargo "$(CARGO_RS_CHECK_TOOLCHAIN)" clippy \
--features=$(TARGET_ARCH_FEATURE),integer,shortint,gpu \
-p $(TFHE_SPEC) -- --no-deps -D warnings
.PHONY: fix_newline # Fix newline at end of file issues to be UNIX compliant
fix_newline: check_linelint_installed
linelint -a .
@@ -212,15 +233,9 @@ clippy_trivium: install_rs_check_toolchain
-p tfhe-trivium -- --no-deps -D warnings
.PHONY: clippy_all_targets # Run clippy lints on all targets (benches, examples, etc.)
clippy_all_targets: install_rs_check_toolchain
clippy_all_targets:
RUSTFLAGS="$(RUSTFLAGS)" cargo "$(CARGO_RS_CHECK_TOOLCHAIN)" clippy --all-targets \
--features=$(TARGET_ARCH_FEATURE),boolean,shortint,integer,internal-keycache,safe-deserialization \
-p $(TFHE_SPEC) -- --no-deps -D warnings
.PHONY: clippy_all_targets_forward_compatibility # Run clippy lints on all targets (benches, examples, etc.)
clippy_all_targets_forward_compatibility: install_rs_check_toolchain
RUSTFLAGS="$(RUSTFLAGS)" cargo "$(CARGO_RS_CHECK_TOOLCHAIN)" clippy --all-targets \
--features=$(TARGET_ARCH_FEATURE),boolean,shortint,integer,internal-keycache,safe-deserialization,forward_compatibility \
--features=$(TARGET_ARCH_FEATURE),boolean,shortint,integer,internal-keycache \
-p $(TFHE_SPEC) -- --no-deps -D warnings
.PHONY: clippy_concrete_csprng # Run clippy lints on concrete-csprng
@@ -231,20 +246,12 @@ clippy_concrete_csprng:
.PHONY: clippy_all # Run all clippy targets
clippy_all: clippy clippy_boolean clippy_shortint clippy_integer clippy_all_targets clippy_c_api \
clippy_js_wasm_api clippy_tasks clippy_core clippy_concrete_csprng clippy_trivium \
clippy_all_targets_forward_compatibility
clippy_js_wasm_api clippy_tasks clippy_core clippy_concrete_csprng clippy_trivium
.PHONY: clippy_fast # Run main clippy targets
clippy_fast: clippy clippy_all_targets clippy_c_api clippy_js_wasm_api clippy_tasks clippy_core \
clippy_concrete_csprng
.PHONY: gen_key_cache # Run the script to generate keys and cache them for shortint tests
gen_key_cache: install_rs_build_toolchain
RUSTFLAGS="$(RUSTFLAGS)" cargo $(CARGO_RS_BUILD_TOOLCHAIN) run --profile $(CARGO_PROFILE) \
--example generates_test_keys \
--features=$(TARGET_ARCH_FEATURE),boolean,shortint,internal-keycache -- \
$(MULTI_BIT_ONLY) $(COVERAGE_ONLY)
.PHONY: build_core # Build core_crypto without experimental features
build_core: install_rs_build_toolchain install_rs_check_toolchain
RUSTFLAGS="$(RUSTFLAGS)" cargo $(CARGO_RS_BUILD_TOOLCHAIN) build --profile $(CARGO_PROFILE) \
@@ -292,14 +299,21 @@ symlink_c_libs_without_fingerprint:
.PHONY: build_c_api # Build the C API for boolean, shortint and integer
build_c_api: install_rs_check_toolchain
RUSTFLAGS="$(RUSTFLAGS)" cargo $(CARGO_RS_CHECK_TOOLCHAIN) build --profile $(CARGO_PROFILE) \
--features=$(TARGET_ARCH_FEATURE),boolean-c-api,shortint-c-api,high-level-c-api,safe-deserialization,$(FORWARD_COMPAT_FEATURE) \
--features=$(TARGET_ARCH_FEATURE),boolean-c-api,shortint-c-api,high-level-c-api,$(FORWARD_COMPAT_FEATURE) \
-p $(TFHE_SPEC)
@"$(MAKE)" symlink_c_libs_without_fingerprint
.PHONY: build_c_api_gpu # Build the C API for boolean, shortint and integer
build_c_api_gpu: install_rs_check_toolchain
RUSTFLAGS="$(RUSTFLAGS)" cargo $(CARGO_RS_CHECK_TOOLCHAIN) build --profile $(CARGO_PROFILE) \
--features=$(TARGET_ARCH_FEATURE),boolean-c-api,shortint-c-api,high-level-c-api,gpu \
-p $(TFHE_SPEC)
@"$(MAKE)" symlink_c_libs_without_fingerprint
.PHONY: build_c_api_experimental_deterministic_fft # Build the C API for boolean, shortint and integer with experimental deterministic FFT
build_c_api_experimental_deterministic_fft: install_rs_check_toolchain
RUSTFLAGS="$(RUSTFLAGS)" cargo $(CARGO_RS_CHECK_TOOLCHAIN) build --profile $(CARGO_PROFILE) \
--features=$(TARGET_ARCH_FEATURE),boolean-c-api,shortint-c-api,high-level-c-api,safe-deserialization,experimental-force_fft_algo_dif4,$(FORWARD_COMPAT_FEATURE) \
--features=$(TARGET_ARCH_FEATURE),boolean-c-api,shortint-c-api,high-level-c-api,experimental-force_fft_algo_dif4,$(FORWARD_COMPAT_FEATURE) \
-p $(TFHE_SPEC)
@"$(MAKE)" symlink_c_libs_without_fingerprint
@@ -340,23 +354,37 @@ test_core_crypto: install_rs_build_toolchain install_rs_check_toolchain
--features=$(TARGET_ARCH_FEATURE),experimental,$(AVX512_FEATURE) -p $(TFHE_SPEC) -- core_crypto::; \
fi
.PHONY: test_ccs_2024_stair_ks # Run the tests of the core_crypto module including experimental ones
test_ccs_2024_stair_ks: install_rs_build_toolchain install_rs_check_toolchain
RUSTFLAGS="$(RUSTFLAGS)" cargo $(CARGO_RS_BUILD_TOOLCHAIN) test --profile $(CARGO_PROFILE) \
--features=$(TARGET_ARCH_FEATURE),experimental -p $(TFHE_SPEC) -- core_crypto::algorithms::test::lwe_stair_keyswitch
.PHONY: test_core_crypto_cov # Run the tests of the core_crypto module with code coverage
test_core_crypto_cov: install_rs_build_toolchain install_rs_check_toolchain install_tarpaulin
RUSTFLAGS="$(RUSTFLAGS)" cargo $(CARGO_RS_BUILD_TOOLCHAIN) tarpaulin --profile $(CARGO_PROFILE) \
--out xml --output-dir coverage/core_crypto --line --engine llvm --timeout 500 \
--implicit-test-threads $(COVERAGE_EXCLUDED_FILES) \
--features=$(TARGET_ARCH_FEATURE),experimental,internal-keycache,__coverage \
-p $(TFHE_SPEC) -- core_crypto::
@if [[ "$(AVX512_SUPPORT)" == "ON" ]]; then \
RUSTFLAGS="$(RUSTFLAGS)" cargo $(CARGO_RS_CHECK_TOOLCHAIN) test --profile $(CARGO_PROFILE) \
--features=$(TARGET_ARCH_FEATURE),experimental,$(AVX512_FEATURE) -p $(TFHE_SPEC) -- core_crypto::algorithms::test::lwe_stair_keyswitch; \
RUSTFLAGS="$(RUSTFLAGS)" cargo $(CARGO_RS_CHECK_TOOLCHAIN) tarpaulin --profile $(CARGO_PROFILE) \
--out xml --output-dir coverage/core_crypto_avx512 --line --engine llvm --timeout 500 \
--implicit-test-threads $(COVERAGE_EXCLUDED_FILES) \
--features=$(TARGET_ARCH_FEATURE),experimental,internal-keycache,__coverage,$(AVX512_FEATURE) \
-p $(TFHE_SPEC) -- core_crypto::; \
fi
.PHONY: test_ccs_2024_fft_shrinking_ks # Run the tests of the core_crypto module including experimental ones
test_ccs_2024_fft_shrinking_ks: install_rs_build_toolchain install_rs_check_toolchain
.PHONY: test_gpu # Run the tests of the core_crypto module including experimental on the gpu backend
test_gpu: test_core_crypto_gpu test_integer_gpu
.PHONY: test_core_crypto_gpu # Run the tests of the core_crypto module including experimental on the gpu backend
test_core_crypto_gpu: install_rs_build_toolchain install_rs_check_toolchain
RUSTFLAGS="$(RUSTFLAGS)" cargo $(CARGO_RS_BUILD_TOOLCHAIN) test --profile $(CARGO_PROFILE) \
--features=$(TARGET_ARCH_FEATURE),experimental -p $(TFHE_SPEC) -- core_crypto::algorithms::test::lwe_fast_keyswitch
@if [[ "$(AVX512_SUPPORT)" == "ON" ]]; then \
RUSTFLAGS="$(RUSTFLAGS)" cargo $(CARGO_RS_CHECK_TOOLCHAIN) test --profile $(CARGO_PROFILE) \
--features=$(TARGET_ARCH_FEATURE),experimental,$(AVX512_FEATURE) -p $(TFHE_SPEC) -- core_crypto::algorithms::test::lwe_fast_keyswitch; \
fi
--features=$(TARGET_ARCH_FEATURE),integer,gpu -p $(TFHE_SPEC) -- core_crypto::gpu::
RUSTFLAGS="$(RUSTFLAGS)" cargo $(CARGO_RS_BUILD_TOOLCHAIN) test --doc --profile $(CARGO_PROFILE) \
--features=$(TARGET_ARCH_FEATURE),integer,gpu -p $(TFHE_SPEC) -- core_crypto::gpu::
.PHONY: test_integer_gpu # Run the tests of the integer module including experimental on the gpu backend
test_integer_gpu: install_rs_build_toolchain install_rs_check_toolchain
RUSTFLAGS="$(RUSTFLAGS)" cargo $(CARGO_RS_BUILD_TOOLCHAIN) test --profile $(CARGO_PROFILE) \
--features=$(TARGET_ARCH_FEATURE),integer,gpu -p $(TFHE_SPEC) -- integer::gpu::server_key::
RUSTFLAGS="$(RUSTFLAGS)" cargo $(CARGO_RS_BUILD_TOOLCHAIN) test --doc --profile $(CARGO_PROFILE) \
--features=$(TARGET_ARCH_FEATURE),integer,gpu -p $(TFHE_SPEC) -- integer::gpu::server_key::
.PHONY: test_boolean # Run the tests of the boolean module
test_boolean: install_rs_build_toolchain
@@ -374,17 +402,21 @@ test_boolean_cov: install_rs_check_toolchain install_tarpaulin
.PHONY: test_c_api_rs # Run the rust tests for the C API
test_c_api_rs: install_rs_check_toolchain
RUSTFLAGS="$(RUSTFLAGS)" cargo $(CARGO_RS_CHECK_TOOLCHAIN) test --profile $(CARGO_PROFILE) \
--features=$(TARGET_ARCH_FEATURE),boolean-c-api,shortint-c-api,high-level-c-api,safe-deserialization \
--features=$(TARGET_ARCH_FEATURE),boolean-c-api,shortint-c-api,high-level-c-api \
-p $(TFHE_SPEC) \
c_api
.PHONY: test_c_api_c # Run the C tests for the C API
test_c_api_c: build_c_api
./scripts/c_api_tests.sh --forward-compat "$(FORWARD_COMPAT)"
./scripts/c_api_tests.sh
.PHONY: test_c_api # Run all the tests for the C API
test_c_api: test_c_api_rs test_c_api_c
.PHONY: test_c_api_gpu # Run the C tests for the C API
test_c_api_gpu: build_c_api_gpu
./scripts/c_api_tests.sh --gpu
.PHONY: test_shortint_ci # Run the tests for shortint ci
test_shortint_ci: install_rs_build_toolchain install_cargo_nextest
BIG_TESTS_INSTANCE="$(BIG_TESTS_INSTANCE)" \
@@ -413,23 +445,57 @@ test_shortint_cov: install_rs_check_toolchain install_tarpaulin
-p $(TFHE_SPEC) -- shortint::
.PHONY: test_integer_ci # Run the tests for integer ci
test_integer_ci: install_rs_build_toolchain install_cargo_nextest
test_integer_ci: install_rs_check_toolchain install_cargo_nextest
BIG_TESTS_INSTANCE="$(BIG_TESTS_INSTANCE)" \
FAST_TESTS="$(FAST_TESTS)" \
./scripts/integer-tests.sh --rust-toolchain $(CARGO_RS_BUILD_TOOLCHAIN) \
--cargo-profile "$(CARGO_PROFILE)" --tfhe-package "$(TFHE_SPEC)"
./scripts/integer-tests.sh --rust-toolchain $(CARGO_RS_CHECK_TOOLCHAIN) \
--cargo-profile "$(CARGO_PROFILE)" --avx512-support "$(AVX512_SUPPORT)" \
--tfhe-package "$(TFHE_SPEC)"
.PHONY: test_unsigned_integer_ci # Run the tests for unsigned integer ci
test_unsigned_integer_ci: install_rs_check_toolchain install_cargo_nextest
BIG_TESTS_INSTANCE="$(BIG_TESTS_INSTANCE)" \
FAST_TESTS="$(FAST_TESTS)" \
./scripts/integer-tests.sh --rust-toolchain $(CARGO_RS_CHECK_TOOLCHAIN) \
--cargo-profile "$(CARGO_PROFILE)" --avx512-support "$(AVX512_SUPPORT)" \
--unsigned-only --tfhe-package "$(TFHE_SPEC)"
.PHONY: test_signed_integer_ci # Run the tests for signed integer ci
test_signed_integer_ci: install_rs_check_toolchain install_cargo_nextest
BIG_TESTS_INSTANCE="$(BIG_TESTS_INSTANCE)" \
FAST_TESTS="$(FAST_TESTS)" \
./scripts/integer-tests.sh --rust-toolchain $(CARGO_RS_CHECK_TOOLCHAIN) \
--cargo-profile "$(CARGO_PROFILE)" --avx512-support "$(AVX512_SUPPORT)" \
--signed-only --tfhe-package "$(TFHE_SPEC)"
.PHONY: test_integer_multi_bit_ci # Run the tests for integer ci running only multibit tests
test_integer_multi_bit_ci: install_rs_build_toolchain install_cargo_nextest
test_integer_multi_bit_ci: install_rs_check_toolchain install_cargo_nextest
BIG_TESTS_INSTANCE="$(BIG_TESTS_INSTANCE)" \
FAST_TESTS="$(FAST_TESTS)" \
./scripts/integer-tests.sh --rust-toolchain $(CARGO_RS_BUILD_TOOLCHAIN) \
--cargo-profile "$(CARGO_PROFILE)" --multi-bit --tfhe-package "$(TFHE_SPEC)"
./scripts/integer-tests.sh --rust-toolchain $(CARGO_RS_CHECK_TOOLCHAIN) \
--cargo-profile "$(CARGO_PROFILE)" --multi-bit --avx512-support "$(AVX512_SUPPORT)" \
--tfhe-package "$(TFHE_SPEC)"
.PHONY: test_unsigned_integer_multi_bit_ci # Run the tests for nsigned integer ci running only multibit tests
test_unsigned_integer_multi_bit_ci: install_rs_check_toolchain install_cargo_nextest
BIG_TESTS_INSTANCE="$(BIG_TESTS_INSTANCE)" \
FAST_TESTS="$(FAST_TESTS)" \
./scripts/integer-tests.sh --rust-toolchain $(CARGO_RS_CHECK_TOOLCHAIN) \
--cargo-profile "$(CARGO_PROFILE)" --multi-bit --avx512-support "$(AVX512_SUPPORT)" \
--unsigned-only --tfhe-package "$(TFHE_SPEC)"
.PHONY: test_signed_integer_multi_bit_ci # Run the tests for nsigned integer ci running only multibit tests
test_signed_integer_multi_bit_ci: install_rs_check_toolchain install_cargo_nextest
BIG_TESTS_INSTANCE="$(BIG_TESTS_INSTANCE)" \
FAST_TESTS="$(FAST_TESTS)" \
./scripts/integer-tests.sh --rust-toolchain $(CARGO_RS_CHECK_TOOLCHAIN) \
--cargo-profile "$(CARGO_PROFILE)" --multi-bit --avx512-support "$(AVX512_SUPPORT)" \
--signed-only --tfhe-package "$(TFHE_SPEC)"
.PHONY: test_safe_deserialization # Run the tests for safe deserialization
test_safe_deserialization: install_rs_build_toolchain install_cargo_nextest
RUSTFLAGS="$(RUSTFLAGS)" cargo $(CARGO_RS_BUILD_TOOLCHAIN) test --profile $(CARGO_PROFILE) \
--features=$(TARGET_ARCH_FEATURE),boolean,shortint,integer,internal-keycache,safe-deserialization -p $(TFHE_SPEC) -- safe_deserialization::
--features=$(TARGET_ARCH_FEATURE),boolean,shortint,integer,internal-keycache -p $(TFHE_SPEC) -- safe_deserialization::
.PHONY: test_integer # Run all the tests for integer
test_integer: install_rs_build_toolchain
@@ -442,18 +508,18 @@ test_high_level_api: install_rs_build_toolchain
--features=$(TARGET_ARCH_FEATURE),boolean,shortint,integer,internal-keycache -p $(TFHE_SPEC) \
-- high_level_api::
.PHONY: test_forward_compatibility # Run forward compatibility tests
test_forward_compatibility: install_rs_build_toolchain
RUSTFLAGS="$(RUSTFLAGS)" cargo $(CARGO_RS_BUILD_TOOLCHAIN) test --tests --profile $(CARGO_PROFILE) \
--features=$(TARGET_ARCH_FEATURE),boolean,shortint,integer,forward_compatibility,internal-keycache -p $(TFHE_SPEC) \
-- forward_compatibility::
.PHONY: test_user_doc # Run tests from the .md documentation
test_user_doc: install_rs_build_toolchain
RUSTFLAGS="$(RUSTFLAGS)" cargo $(CARGO_RS_BUILD_TOOLCHAIN) test --profile $(CARGO_PROFILE) --doc \
--features=$(TARGET_ARCH_FEATURE),boolean,shortint,integer,internal-keycache -p $(TFHE_SPEC) \
-- test_user_docs::
.PHONY: test_user_doc_gpu # Run tests for GPU from the .md documentation
test_user_doc_gpu: install_rs_build_toolchain
RUSTFLAGS="$(RUSTFLAGS)" cargo $(CARGO_RS_BUILD_TOOLCHAIN) test --profile $(CARGO_PROFILE) --doc \
--features=$(TARGET_ARCH_FEATURE),boolean,shortint,integer,internal-keycache,gpu -p $(TFHE_SPEC) \
-- test_user_docs::
.PHONY: test_regex_engine # Run tests for regex_engine example
test_regex_engine: install_rs_build_toolchain
RUSTFLAGS="$(RUSTFLAGS)" cargo $(CARGO_RS_BUILD_TOOLCHAIN) test --profile $(CARGO_PROFILE) \
@@ -488,7 +554,7 @@ test_concrete_csprng:
doc: install_rs_check_toolchain
RUSTDOCFLAGS="--html-in-header katex-header.html" \
cargo "$(CARGO_RS_CHECK_TOOLCHAIN)" doc \
--features=$(TARGET_ARCH_FEATURE),boolean,shortint,integer --no-deps
--features=$(TARGET_ARCH_FEATURE),boolean,shortint,integer --no-deps -p $(TFHE_SPEC)
.PHONY: docs # Build rust doc alias for doc
docs: doc
@@ -497,7 +563,7 @@ docs: doc
lint_doc: install_rs_check_toolchain
RUSTDOCFLAGS="--html-in-header katex-header.html -Dwarnings" \
cargo "$(CARGO_RS_CHECK_TOOLCHAIN)" doc \
--features=$(TARGET_ARCH_FEATURE),boolean,shortint,integer --no-deps
--features=$(TARGET_ARCH_FEATURE),boolean,shortint,integer -p $(TFHE_SPEC) --no-deps
.PHONY: lint_docs # Build rust doc with linting enabled alias for lint_doc
lint_docs: lint_doc
@@ -514,14 +580,12 @@ format_doc_latex:
.PHONY: check_compile_tests # Build tests in debug without running them
check_compile_tests:
RUSTFLAGS="$(RUSTFLAGS)" cargo $(CARGO_RS_BUILD_TOOLCHAIN) test --no-run \
--features=$(TARGET_ARCH_FEATURE),experimental,boolean,shortint,integer,internal-keycache,safe-deserialization \
--features=$(TARGET_ARCH_FEATURE),experimental,boolean,shortint,integer,internal-keycache \
-p $(TFHE_SPEC)
@if [[ "$(OS)" == "Linux" || "$(OS)" == "Darwin" ]]; then \
"$(MAKE)" build_c_api && \
./scripts/c_api_tests.sh --build-only --forward-compat "$(FORWARD_COMPAT)" && \
FORWARD_COMPAT=ON "$(MAKE)" build_c_api && \
./scripts/c_api_tests.sh --build-only --forward-compat "$(FORWARD_COMPAT)"; \
./scripts/c_api_tests.sh --build-only; \
fi
.PHONY: build_nodejs_test_docker # Build a docker image with tools to run nodejs tests for wasm API
@@ -571,14 +635,28 @@ dieharder_csprng: install_dieharder build_concrete_csprng
# Benchmarks
#
.PHONY: bench_integer # Run benchmarks for integer
.PHONY: bench_integer # Run benchmarks for unsigned integer
bench_integer: install_rs_check_toolchain
RUSTFLAGS="$(RUSTFLAGS)" __TFHE_RS_BENCH_OP_FLAVOR=$(BENCH_OP_FLAVOR) __TFHE_RS_FAST_BENCH=$(FAST_BENCH) \
cargo $(CARGO_RS_CHECK_TOOLCHAIN) bench \
--bench integer-bench \
--features=$(TARGET_ARCH_FEATURE),integer,internal-keycache,$(AVX512_FEATURE) -p $(TFHE_SPEC) --
.PHONY: bench_integer_multi_bit # Run benchmarks for integer using multi-bit parameters
.PHONY: bench_signed_integer # Run benchmarks for signed integer
bench_signed_integer: install_rs_check_toolchain
RUSTFLAGS="$(RUSTFLAGS)" __TFHE_RS_BENCH_OP_FLAVOR=$(BENCH_OP_FLAVOR) __TFHE_RS_FAST_BENCH=$(FAST_BENCH) \
cargo $(CARGO_RS_CHECK_TOOLCHAIN) bench \
--bench integer-signed-bench \
--features=$(TARGET_ARCH_FEATURE),integer,internal-keycache,$(AVX512_FEATURE) -p $(TFHE_SPEC) --
.PHONY: bench_integer_gpu # Run benchmarks for integer on GPU backend
bench_integer_gpu: install_rs_check_toolchain
RUSTFLAGS="$(RUSTFLAGS)" __TFHE_RS_BENCH_OP_FLAVOR=$(BENCH_OP_FLAVOR) __TFHE_RS_FAST_BENCH=$(FAST_BENCH) \
cargo $(CARGO_RS_CHECK_TOOLCHAIN) bench \
--bench integer-bench \
--features=$(TARGET_ARCH_FEATURE),integer,gpu,internal-keycache,$(AVX512_FEATURE) -p $(TFHE_SPEC) --
.PHONY: bench_integer_multi_bit # Run benchmarks for unsigned integer using multi-bit parameters
bench_integer_multi_bit: install_rs_check_toolchain
RUSTFLAGS="$(RUSTFLAGS)" __TFHE_RS_BENCH_TYPE=MULTI_BIT \
__TFHE_RS_BENCH_OP_FLAVOR=$(BENCH_OP_FLAVOR) __TFHE_RS_FAST_BENCH=$(FAST_BENCH) \
@@ -586,6 +664,22 @@ bench_integer_multi_bit: install_rs_check_toolchain
--bench integer-bench \
--features=$(TARGET_ARCH_FEATURE),integer,internal-keycache,$(AVX512_FEATURE) -p $(TFHE_SPEC) --
.PHONY: bench_signed_integer_multi_bit # Run benchmarks for signed integer using multi-bit parameters
bench_signed_integer_multi_bit: install_rs_check_toolchain
RUSTFLAGS="$(RUSTFLAGS)" __TFHE_RS_BENCH_TYPE=MULTI_BIT \
__TFHE_RS_BENCH_OP_FLAVOR=$(BENCH_OP_FLAVOR) __TFHE_RS_FAST_BENCH=$(FAST_BENCH) \
cargo $(CARGO_RS_CHECK_TOOLCHAIN) bench \
--bench integer-signed-bench \
--features=$(TARGET_ARCH_FEATURE),integer,internal-keycache,$(AVX512_FEATURE) -p $(TFHE_SPEC) --
.PHONY: bench_integer_multi_bit_gpu # Run benchmarks for integer on GPU backend using multi-bit parameters
bench_integer_multi_bit_gpu: install_rs_check_toolchain
RUSTFLAGS="$(RUSTFLAGS)" __TFHE_RS_BENCH_TYPE=MULTI_BIT \
__TFHE_RS_BENCH_OP_FLAVOR=$(BENCH_OP_FLAVOR) __TFHE_RS_FAST_BENCH=$(FAST_BENCH) \
cargo $(CARGO_RS_CHECK_TOOLCHAIN) bench \
--bench integer-bench \
--features=$(TARGET_ARCH_FEATURE),integer,gpu,internal-keycache,$(AVX512_FEATURE) -p $(TFHE_SPEC) --
.PHONY: bench_shortint # Run benchmarks for shortint
bench_shortint: install_rs_check_toolchain
RUSTFLAGS="$(RUSTFLAGS)" __TFHE_RS_BENCH_OP_FLAVOR=$(BENCH_OP_FLAVOR) \
@@ -593,6 +687,19 @@ bench_shortint: install_rs_check_toolchain
--bench shortint-bench \
--features=$(TARGET_ARCH_FEATURE),shortint,internal-keycache,$(AVX512_FEATURE) -p $(TFHE_SPEC)
.PHONY: bench_oprf # Run benchmarks for shortint
bench_oprf: install_rs_check_toolchain
RUSTFLAGS="$(RUSTFLAGS)" \
cargo $(CARGO_RS_CHECK_TOOLCHAIN) bench \
--bench oprf-shortint-bench \
--features=$(TARGET_ARCH_FEATURE),shortint,internal-keycache,$(AVX512_FEATURE) -p $(TFHE_SPEC)
RUSTFLAGS="$(RUSTFLAGS)" \
cargo $(CARGO_RS_CHECK_TOOLCHAIN) bench \
--bench oprf-integer-bench \
--features=$(TARGET_ARCH_FEATURE),integer,internal-keycache,$(AVX512_FEATURE) -p $(TFHE_SPEC)
.PHONY: bench_shortint_multi_bit # Run benchmarks for shortint using multi-bit parameters
bench_shortint_multi_bit: install_rs_check_toolchain
RUSTFLAGS="$(RUSTFLAGS)" __TFHE_RS_BENCH_TYPE=MULTI_BIT \
@@ -614,6 +721,12 @@ bench_pbs: install_rs_check_toolchain
--bench pbs-bench \
--features=$(TARGET_ARCH_FEATURE),boolean,shortint,internal-keycache,$(AVX512_FEATURE) -p $(TFHE_SPEC)
.PHONY: bench_pbs_gpu # Run benchmarks for PBS on GPU backend
bench_pbs_gpu: install_rs_check_toolchain
RUSTFLAGS="$(RUSTFLAGS)" cargo $(CARGO_RS_CHECK_TOOLCHAIN) bench \
--bench pbs-bench \
--features=$(TARGET_ARCH_FEATURE),boolean,shortint,gpu,internal-keycache,$(AVX512_FEATURE) -p $(TFHE_SPEC)
.PHONY: bench_web_js_api_parallel # Run benchmarks for the web wasm api
bench_web_js_api_parallel: build_web_js_api_parallel
$(MAKE) -C tfhe/web_wasm_parallel_tests bench
@@ -624,27 +737,21 @@ ci_bench_web_js_api_parallel: build_web_js_api_parallel
nvm use node && \
$(MAKE) -C tfhe/web_wasm_parallel_tests bench-ci
.PHONY: bench_ccs_2024_cjp # Run benchmarks for PBS
bench_ccs_2024_cjp: install_rs_check_toolchain
RUSTFLAGS="$(RUSTFLAGS)" cargo $(CARGO_RS_CHECK_TOOLCHAIN) bench \
--bench ccs-2024-cjp \
--features=$(TARGET_ARCH_FEATURE),boolean,shortint,internal-keycache,$(AVX512_FEATURE) -p $(TFHE_SPEC)
.PHONY: bench_ccs_2024_fft_shrinking_ks # Run benchmarks for PBS
bench_ccs_2024_fft_shrinking_ks: install_rs_check_toolchain
RUSTFLAGS="$(RUSTFLAGS)" cargo $(CARGO_RS_CHECK_TOOLCHAIN) bench \
--bench ccs-2024-fft-shrinking-ks \
--features=$(TARGET_ARCH_FEATURE),internal-keycache,$(AVX512_FEATURE) -p $(TFHE_SPEC)
.PHONY: bench_ccs_2024_stair_ks # Run benchmarks for PBS
bench_ccs_2024_stair_ks: install_rs_check_toolchain
RUSTFLAGS="$(RUSTFLAGS)" cargo $(CARGO_RS_CHECK_TOOLCHAIN) bench \
--bench ccs-2024-stair-ks \
--features=$(TARGET_ARCH_FEATURE),internal-keycache,$(AVX512_FEATURE) -p $(TFHE_SPEC)
#
# Utility tools
#
.PHONY: gen_key_cache # Run the script to generate keys and cache them for shortint tests
gen_key_cache: install_rs_build_toolchain
RUSTFLAGS="$(RUSTFLAGS)" cargo $(CARGO_RS_BUILD_TOOLCHAIN) run --profile $(CARGO_PROFILE) \
--example generates_test_keys \
--features=$(TARGET_ARCH_FEATURE),boolean,shortint,internal-keycache -- \
$(MULTI_BIT_ONLY) $(COVERAGE_ONLY)
.PHONY: gen_key_cache_core_crypto # Run function to generate keys and cache them for core_crypto tests
gen_key_cache_core_crypto: install_rs_build_toolchain
RUSTFLAGS="$(RUSTFLAGS)" cargo $(CARGO_RS_BUILD_TOOLCHAIN) test --tests --profile $(CARGO_PROFILE) \
--features=$(TARGET_ARCH_FEATURE),experimental,internal-keycache -p $(TFHE_SPEC) -- --nocapture \
core_crypto::keycache::generate_keys
.PHONY: measure_hlapi_compact_pk_ct_sizes # Measure sizes of public keys and ciphertext for high-level API
measure_hlapi_compact_pk_ct_sizes: install_rs_check_toolchain
@@ -707,9 +814,12 @@ sha256_bool: install_rs_check_toolchain
--example sha256_bool \
--features=$(TARGET_ARCH_FEATURE),boolean
.PHONY: pcc # pcc stands for pre commit checks
.PHONY: pcc # pcc stands for pre commit checks (except GPU)
pcc: no_tfhe_typo no_dbg_log check_fmt lint_doc clippy_all check_compile_tests
.PHONY: pcc_gpu # pcc stands for pre commit checks for GPU compilation
pcc_gpu: pcc clippy_gpu
.PHONY: fpcc # pcc stands for pre commit checks, the f stands for fast
fpcc: no_tfhe_typo no_dbg_log check_fmt lint_doc clippy_fast check_compile_tests

View File

@@ -4,13 +4,17 @@
</p>
<hr/>
<p align="center">
<a href="https://docs.zama.ai/tfhe-rs"> 📒 Read documentation</a> | <a href="https://zama.ai/community"> 💛 Community support</a>
<a href="https://docs.zama.ai/tfhe-rs"> 📒 Read documentation</a> | <a href="https://zama.ai/community"> 💛 Community support</a> | <a href="https://github.com/zama-ai/awesome-zama"> 📚 FHE resources</a>
</p>
<p align="center">
<!-- Version badge using shields.io -->
<a href="https://github.com/zama-ai/tfhe-rs/releases">
<img src="https://img.shields.io/github/v/release/zama-ai/tfhe-rs?style=flat-square">
</a>
<!-- Link to tutorials badge using shields.io -->
<a href="#license">
<img src="https://img.shields.io/badge/License-BSD--3--Clause--Clear-orange?style=flat-square">
</a>
<!-- Zama Bounty Program -->
<a href="https://github.com/zama-ai/bounty-program">
<img src="https://img.shields.io/badge/Contribute-Zama%20Bounty%20Program-yellow?style=flat-square">
@@ -57,7 +61,7 @@ running Windows:
tfhe = { version = "*", features = ["boolean", "shortint", "integer", "x86_64"] }
```
Note: aarch64-based machines are not yet supported for Windows as it's currently missing an entropy source to be able to seed the [CSPRNGs](https://en.wikipedia.org/wiki/Cryptographically_secure_pseudorandom_number_generator) used in TFHE-rs
Note: aarch64-based machines are not yet supported for Windows as it's currently missing an entropy source to be able to seed the [CSPRNGs](https://en.wikipedia.org/wiki/Cryptographically_secure_pseudorandom_number_generator) used in TFHE-rs.
## A simple example
@@ -70,9 +74,7 @@ use tfhe::{generate_keys, set_server_key, ConfigBuilder, FheUint32, FheUint8};
fn main() -> Result<(), Box<dyn std::error::Error>> {
// Basic configuration to use homomorphic integers
let config = ConfigBuilder::all_disabled()
.enable_default_integers()
.build();
let config = ConfigBuilder::default().build();
// Key generation
let (client_key, server_keys) = generate_keys(config);
@@ -120,7 +122,7 @@ To run this code, use the following command:
<p align="center"> <code> cargo run --release </code> </p>
Note that when running code that uses `tfhe-rs`, it is highly recommended
to run in release mode with cargo's `--release` flag to have the best performances possible,
to run in release mode with cargo's `--release` flag to have the best performances possible.
## Contributing
@@ -140,9 +142,11 @@ libraries.
## Need support?
<a target="_blank" href="https://community.zama.ai">
<img src="https://user-images.githubusercontent.com/5758427/231115030-21195b55-2629-4c01-9809-be5059243999.png">
<img src="https://github.com/zama-ai/tfhe-rs/assets/157474013/33d856dc-f25d-454b-a010-af12bff2aa7d">
</a>
## Citing TFHE-rs
To cite TFHE-rs in academic papers, please use the following entry:

View File

@@ -1,84 +0,0 @@
# Description
In what follows, we provide instructions on how to run benchmarks from the paper entitled "New Secret Keys for Enhanced Performance in (T)FHE". In particular, Table 2 and Figure 4 (in the Appendix) can be easily reproduced using this code.
The implementation of the techniques from the aforementioned paper has been integrated into the TFHE-rs library, version 0.4.
Modified or added source files are located in `tfhe/src/core_crypto/`:
- `algorithms/pseudo_ggsw_encryption.rs`
- `algorithms/pseudo_ggsw_conversion.rs`
- `algorithms/lwe_shrinking_keyswitch_key_generation.rs`
- `algorithms/lwe_shrinking_keyswitch.rs`
- `algorithms/lwe_secret_key_generation.rs`
- `algorithms/lwe_partial_secret_key_generation.rs`
- `algorithms/lwe_fast_keyswitch_key_generation.rs`
- `algorithms/lwe_fast_keyswitch.rs`
- `algorithms/glwe_secret_key_generation.rs`
- `algorithms/glwe_partial_secret_key_generation.rs`
- `algorithms/glwe_partial_sample_extraction.rs`
Test files are located in `tfhe/src/core_crypto/algorithms/test`:
- `lwe_stair_keyswitch.rs`
- `lwe_fast_keyswitch.rs`
Benchmarks are located in `tfhe/benches/core_crypto`:
- `ccs_2024_cjp.rs`
- `ccs_2024_fft_shrinking_ks.rs`
- `ccs_2024_stair_ks.rs`
# Dependencies
Tested on Linux and Mac OS with Rust >= 1.75 (see [here](https://www.rust-lang.org/tools/install) a guide to install Rust).
# How to run benchmarks
At the root of the project (i.e., in the TFHE-rs folder), enter the following commands to run the benchmarks:
- `make bench_ccs_2024_cjp`: Returns the timings associated with the CJP-based bootstrapping (Table 2, Line 1 + Figure 4, blue line);
- `make bench_ccs_2024_stair_ks`: Returns the timings associated with the "All+Stair-KS"-based bootstrapping (Table 2, Line 2 + Figure 4, red line);
- `make bench_ccs_2024_fft_shrinking_ks`: Returns the timings associated with the "All+FFT Shrinking-KS"-based bootstrapping (Table 2, Line 3 + Figure 4, green line);
This outputs the timings depending on the input precision.
Since large precision (>= 6 bits) might be long to execute, particularly on a laptop, these are disable by default. To choose which precision to launch, please uncomment lines associated to the parameter names into the `param_vec` variable, inside the `criterion_bench` function inside one of the benchmark files.
For instance, to launch only the precision 7 of the stair-KS benchmark, the correct `param_vec` variable (line 353 of `ccs_2024_stair_ks.rs`) looks like:
```rust
let param_vec = [
// PRECISION_1_STAIR,
// PRECISION_2_STAIR,
// PRECISION_3_STAIR,
// PRECISION_4_STAIR,
// PRECISION_5_STAIR,
// PRECISION_6_STAIR,
PRECISION_7_STAIR
// PRECISION_8_STAIR,
// PRECISION_9_STAIR,
// PRECISION_10_STAIR,
// PRECISION_11_STAIR,
];
```
Running the command `make bench_ccs_2024_stair_ks`will give the correct benchmark.
# How to run the tests
At the root of the project (i.e., in the TFHE-rs folder), enter the following commands to run the tests:
- `make test_ccs_2024_stair_ks`: Returns the timings associated with the "All+Stair-KS"-based bootstrapping (Table 2, Line 2 + Figure 4, red line);
- `make test_ccs_2024_fft_shrinking_ks`: Returns the timings associated with the "All+FFT Shrinking-KS"-based bootstrapping (Table 2, Line 3 + Figure 4, green line);
As for the benchmarks, all precision are not enabled by default. To add precision in the test, associated parameters must be uncommented in the macro `create_parametrized_test!` (located at the end of each test file)
For instance, in the file `lwe_fast_keyswitch.rs`, testing only the precision will look like this:
```rust
create_parametrized_test!(lwe_encrypt_fast_ks_decrypt_custom_mod {
PRECISION_1_FAST_KS,
PRECISION_2_FAST_KS,
PRECISION_3_FAST_KS,
PRECISION_4_FAST_KS,
PRECISION_5_FAST_KS,
PRECISION_6_FAST_KS,
PRECISION_7_FAST_KS,
PRECISION_8_FAST_KS,
// PRECISION_9_FAST_KS,
PRECISION_10_FAST_KS
// PRECISION_11_FAST_KS
});
```
Please note that the last parameter MUST NOT be followed by a comma `,` to correctly compile.

View File

@@ -6,7 +6,7 @@ use tfhe_trivium::KreyviumStream;
use criterion::Criterion;
pub fn kreyvium_bool_gen(c: &mut Criterion) {
let config = ConfigBuilder::all_disabled().enable_default_bool().build();
let config = ConfigBuilder::default().build();
let (client_key, server_key) = generate_keys(config);
let key_string = "0053A6F94C9FF24598EB000000000000".to_string();
@@ -41,7 +41,7 @@ pub fn kreyvium_bool_gen(c: &mut Criterion) {
}
pub fn kreyvium_bool_warmup(c: &mut Criterion) {
let config = ConfigBuilder::all_disabled().enable_default_bool().build();
let config = ConfigBuilder::default().build();
let (client_key, server_key) = generate_keys(config);
let key_string = "0053A6F94C9FF24598EB000000000000".to_string();

View File

@@ -6,9 +6,8 @@ use tfhe_trivium::{KreyviumStreamByte, TransCiphering};
use criterion::Criterion;
pub fn kreyvium_byte_gen(c: &mut Criterion) {
let config = ConfigBuilder::all_disabled()
.enable_default_integers()
.enable_function_evaluation_integers()
let config = ConfigBuilder::default()
.enable_function_evaluation()
.build();
let (client_key, server_key) = generate_keys(config);
@@ -36,9 +35,8 @@ pub fn kreyvium_byte_gen(c: &mut Criterion) {
}
pub fn kreyvium_byte_trans(c: &mut Criterion) {
let config = ConfigBuilder::all_disabled()
.enable_default_integers()
.enable_function_evaluation_integers()
let config = ConfigBuilder::default()
.enable_function_evaluation()
.build();
let (client_key, server_key) = generate_keys(config);
@@ -67,9 +65,8 @@ pub fn kreyvium_byte_trans(c: &mut Criterion) {
}
pub fn kreyvium_byte_warmup(c: &mut Criterion) {
let config = ConfigBuilder::all_disabled()
.enable_default_integers()
.enable_function_evaluation_integers()
let config = ConfigBuilder::default()
.enable_function_evaluation()
.build();
let (client_key, server_key) = generate_keys(config);

View File

@@ -8,9 +8,7 @@ use tfhe_trivium::{KreyviumStreamShortint, TransCiphering};
use criterion::Criterion;
pub fn kreyvium_shortint_warmup(c: &mut Criterion) {
let config = ConfigBuilder::all_disabled()
.enable_default_integers()
.build();
let config = ConfigBuilder::default().build();
let (hl_client_key, hl_server_key) = generate_keys(config);
let underlying_ck: tfhe::shortint::ClientKey = (*hl_client_key.as_ref()).clone().into();
let underlying_sk: tfhe::shortint::ServerKey = (*hl_server_key.as_ref()).clone().into();
@@ -60,9 +58,7 @@ pub fn kreyvium_shortint_warmup(c: &mut Criterion) {
}
pub fn kreyvium_shortint_gen(c: &mut Criterion) {
let config = ConfigBuilder::all_disabled()
.enable_default_integers()
.build();
let config = ConfigBuilder::default().build();
let (hl_client_key, hl_server_key) = generate_keys(config);
let underlying_ck: tfhe::shortint::ClientKey = (*hl_client_key.as_ref()).clone().into();
let underlying_sk: tfhe::shortint::ServerKey = (*hl_server_key.as_ref()).clone().into();
@@ -107,9 +103,7 @@ pub fn kreyvium_shortint_gen(c: &mut Criterion) {
}
pub fn kreyvium_shortint_trans(c: &mut Criterion) {
let config = ConfigBuilder::all_disabled()
.enable_default_integers()
.build();
let config = ConfigBuilder::default().build();
let (hl_client_key, hl_server_key) = generate_keys(config);
let underlying_ck: tfhe::shortint::ClientKey = (*hl_client_key.as_ref()).clone().into();
let underlying_sk: tfhe::shortint::ServerKey = (*hl_server_key.as_ref()).clone().into();

View File

@@ -6,7 +6,7 @@ use tfhe_trivium::TriviumStream;
use criterion::Criterion;
pub fn trivium_bool_gen(c: &mut Criterion) {
let config = ConfigBuilder::all_disabled().enable_default_bool().build();
let config = ConfigBuilder::default().build();
let (client_key, server_key) = generate_keys(config);
let key_string = "0053A6F94C9FF24598EB".to_string();
@@ -41,7 +41,7 @@ pub fn trivium_bool_gen(c: &mut Criterion) {
}
pub fn trivium_bool_warmup(c: &mut Criterion) {
let config = ConfigBuilder::all_disabled().enable_default_bool().build();
let config = ConfigBuilder::default().build();
let (client_key, server_key) = generate_keys(config);
let key_string = "0053A6F94C9FF24598EB".to_string();

View File

@@ -6,9 +6,7 @@ use tfhe_trivium::{TransCiphering, TriviumStreamByte};
use criterion::Criterion;
pub fn trivium_byte_gen(c: &mut Criterion) {
let config = ConfigBuilder::all_disabled()
.enable_default_integers()
.build();
let config = ConfigBuilder::default().build();
let (client_key, server_key) = generate_keys(config);
let key_string = "0053A6F94C9FF24598EB".to_string();
@@ -35,9 +33,7 @@ pub fn trivium_byte_gen(c: &mut Criterion) {
}
pub fn trivium_byte_trans(c: &mut Criterion) {
let config = ConfigBuilder::all_disabled()
.enable_default_integers()
.build();
let config = ConfigBuilder::default().build();
let (client_key, server_key) = generate_keys(config);
let key_string = "0053A6F94C9FF24598EB".to_string();
@@ -65,9 +61,7 @@ pub fn trivium_byte_trans(c: &mut Criterion) {
}
pub fn trivium_byte_warmup(c: &mut Criterion) {
let config = ConfigBuilder::all_disabled()
.enable_default_integers()
.build();
let config = ConfigBuilder::default().build();
let (client_key, server_key) = generate_keys(config);
let key_string = "0053A6F94C9FF24598EB".to_string();

View File

@@ -8,9 +8,7 @@ use tfhe_trivium::{TransCiphering, TriviumStreamShortint};
use criterion::Criterion;
pub fn trivium_shortint_warmup(c: &mut Criterion) {
let config = ConfigBuilder::all_disabled()
.enable_default_integers()
.build();
let config = ConfigBuilder::default().build();
let (hl_client_key, hl_server_key) = generate_keys(config);
let underlying_ck: tfhe::shortint::ClientKey = (*hl_client_key.as_ref()).clone().into();
let underlying_sk: tfhe::shortint::ServerKey = (*hl_server_key.as_ref()).clone().into();
@@ -60,9 +58,7 @@ pub fn trivium_shortint_warmup(c: &mut Criterion) {
}
pub fn trivium_shortint_gen(c: &mut Criterion) {
let config = ConfigBuilder::all_disabled()
.enable_default_integers()
.build();
let config = ConfigBuilder::default().build();
let (hl_client_key, hl_server_key) = generate_keys(config);
let underlying_ck: tfhe::shortint::ClientKey = (*hl_client_key.as_ref()).clone().into();
let underlying_sk: tfhe::shortint::ServerKey = (*hl_server_key.as_ref()).clone().into();
@@ -107,9 +103,7 @@ pub fn trivium_shortint_gen(c: &mut Criterion) {
}
pub fn trivium_shortint_trans(c: &mut Criterion) {
let config = ConfigBuilder::all_disabled()
.enable_default_integers()
.build();
let config = ConfigBuilder::default().build();
let (hl_client_key, hl_server_key) = generate_keys(config);
let underlying_ck: tfhe::shortint::ClientKey = (*hl_client_key.as_ref()).clone().into();
let underlying_sk: tfhe::shortint::ServerKey = (*hl_server_key.as_ref()).clone().into();

View File

@@ -170,7 +170,7 @@ fn kreyvium_test_4() {
#[test]
fn kreyvium_test_fhe_long() {
let config = ConfigBuilder::all_disabled().enable_default_bool().build();
let config = ConfigBuilder::default().build();
let (client_key, server_key) = generate_keys(config);
let key_string = "0053A6F94C9FF24598EB000000000000".to_string();
@@ -217,9 +217,7 @@ use tfhe::shortint::prelude::*;
#[test]
fn kreyvium_test_shortint_long() {
let config = ConfigBuilder::all_disabled()
.enable_default_integers()
.build();
let config = ConfigBuilder::default().build();
let (hl_client_key, hl_server_key) = generate_keys(config);
let underlying_ck: tfhe::shortint::ClientKey = (*hl_client_key.as_ref()).clone().into();
let underlying_sk: tfhe::shortint::ServerKey = (*hl_server_key.as_ref()).clone().into();
@@ -302,9 +300,8 @@ fn kreyvium_test_clear_byte() {
#[test]
fn kreyvium_test_byte_long() {
let config = ConfigBuilder::all_disabled()
.enable_default_integers()
.enable_function_evaluation_integers()
let config = ConfigBuilder::default()
.enable_function_evaluation()
.build();
let (client_key, server_key) = generate_keys(config);
@@ -342,9 +339,8 @@ fn kreyvium_test_byte_long() {
#[test]
fn kreyvium_test_fhe_byte_transciphering_long() {
let config = ConfigBuilder::all_disabled()
.enable_default_integers()
.enable_function_evaluation_integers()
let config = ConfigBuilder::default()
.enable_function_evaluation()
.build();
let (client_key, server_key) = generate_keys(config);

View File

@@ -4,6 +4,7 @@
use crate::{KreyviumStreamByte, KreyviumStreamShortint, TriviumStreamByte, TriviumStreamShortint};
use tfhe::shortint::Ciphertext;
use tfhe::prelude::*;
use tfhe::{set_server_key, unset_server_key, FheUint64, FheUint8, ServerKey};
use rayon::prelude::*;

View File

@@ -232,7 +232,7 @@ fn trivium_test_clear_byte() {
#[test]
fn trivium_test_fhe_long() {
let config = ConfigBuilder::all_disabled().enable_default_bool().build();
let config = ConfigBuilder::default().build();
let (client_key, server_key) = generate_keys(config);
let key_string = "0053A6F94C9FF24598EB".to_string();
@@ -277,9 +277,7 @@ fn trivium_test_fhe_long() {
#[test]
fn trivium_test_fhe_byte_long() {
let config = ConfigBuilder::all_disabled()
.enable_default_integers()
.build();
let config = ConfigBuilder::default().build();
let (client_key, server_key) = generate_keys(config);
let key_string = "0053A6F94C9FF24598EB".to_string();
@@ -316,9 +314,7 @@ fn trivium_test_fhe_byte_long() {
#[test]
fn trivium_test_fhe_byte_transciphering_long() {
let config = ConfigBuilder::all_disabled()
.enable_default_integers()
.build();
let config = ConfigBuilder::default().build();
let (client_key, server_key) = generate_keys(config);
let key_string = "0053A6F94C9FF24598EB".to_string();
@@ -357,9 +353,7 @@ use tfhe::shortint::prelude::*;
#[test]
fn trivium_test_shortint_long() {
let config = ConfigBuilder::all_disabled()
.enable_default_integers()
.build();
let config = ConfigBuilder::default().build();
let (hl_client_key, hl_server_key) = generate_keys(config);
let underlying_ck: tfhe::shortint::ClientKey = (*hl_client_key.as_ref()).clone().into();
let underlying_sk: tfhe::shortint::ServerKey = (*hl_server_key.as_ref()).clone().into();

View File

@@ -0,0 +1,18 @@
[package]
name = "tfhe-cuda-backend"
version = "0.1.3"
edition = "2021"
authors = ["Zama team"]
license = "BSD-3-Clause-Clear"
description = "Cuda implementation of TFHE-rs primitives."
homepage = "https://www.zama.ai/"
documentation = "https://docs.zama.ai/tfhe-rs"
repository = "https://github.com/zama-ai/tfhe-rs"
readme = "README.md"
keywords = ["fully", "homomorphic", "encryption", "fhe", "cryptography"]
[build-dependencies]
cmake = { version = "0.1" }
[dependencies]
thiserror = "1.0"

View File

@@ -0,0 +1,28 @@
BSD 3-Clause Clear License
Copyright © 2024 ZAMA.
All rights reserved.
Redistribution and use in source and binary forms, with or without modification,
are permitted provided that the following conditions are met:
1. Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright notice, this
list of conditions and the following disclaimer in the documentation and/or other
materials provided with the distribution.
3. Neither the name of ZAMA nor the names of its contributors may be used to endorse
or promote products derived from this software without specific prior written permission.
NO EXPRESS OR IMPLIED LICENSES TO ANY PARTY'S PATENT RIGHTS ARE GRANTED BY THIS LICENSE.
THIS SOFTWARE IS PROVIDED BY THE ZAMA AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL
ZAMA OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY,
OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

View File

@@ -0,0 +1,53 @@
# TFHE Cuda backend
## Introduction
The `tfhe-cuda-backend` holds the code for GPU acceleration of Zama's variant of TFHE.
It implements CUDA/C++ functions to perform homomorphic operations on LWE ciphertexts.
It provides functions to allocate memory on the GPU, to copy data back
and forth between the CPU and the GPU, to create and destroy Cuda streams, etc.:
- `cuda_create_stream`, `cuda_destroy_stream`
- `cuda_malloc`, `cuda_check_valid_malloc`
- `cuda_memcpy_async_to_cpu`, `cuda_memcpy_async_to_gpu`
- `cuda_get_number_of_gpus`
- `cuda_synchronize_device`
The cryptographic operations it provides are:
- an amortized implementation of the TFHE programmable bootstrap: `cuda_bootstrap_amortized_lwe_ciphertext_vector_32` and `cuda_bootstrap_amortized_lwe_ciphertext_vector_64`
- a low latency implementation of the TFHE programmable bootstrap: `cuda_bootstrap_low latency_lwe_ciphertext_vector_32` and `cuda_bootstrap_low_latency_lwe_ciphertext_vector_64`
- the keyswitch: `cuda_keyswitch_lwe_ciphertext_vector_32` and `cuda_keyswitch_lwe_ciphertext_vector_64`
- the larger precision programmable bootstrap (wop PBS, which supports up to 16 bits of message while the classical PBS only supports up to 8 bits of message) and its sub-components: `cuda_wop_pbs_64`, `cuda_extract_bits_64`, `cuda_circuit_bootstrap_64`, `cuda_cmux_tree_64`, `cuda_blind_rotation_sample_extraction_64`
- acceleration for leveled operations: `cuda_negate_lwe_ciphertext_vector_64`, `cuda_add_lwe_ciphertext_vector_64`, `cuda_add_lwe_ciphertext_vector_plaintext_vector_64`, `cuda_mult_lwe_ciphertext_vector_cleartext_vector`.
## Dependencies
**Disclaimer**: Compilation on Windows/Mac is not supported yet. Only Nvidia GPUs are supported.
- nvidia driver - for example, if you're running Ubuntu 20.04 check this [page](https://linuxconfig.org/how-to-install-the-nvidia-drivers-on-ubuntu-20-04-focal-fossa-linux) for installation
- [nvcc](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html) >= 10.0
- [gcc](https://gcc.gnu.org/) >= 8.0 - check this [page](https://gist.github.com/ax3l/9489132) for more details about nvcc/gcc compatible versions
- [cmake](https://cmake.org/) >= 3.24
## Build
The Cuda project held in `tfhe-cuda-backend` can be compiled independently from Concrete in the
following way:
```
git clone git@github.com:zama-ai/tfhe-rs
cd backends/tfhe-cuda-backend/cuda
mkdir build
cd build
cmake ..
make
```
The compute capability is detected automatically (with the first GPU information) and set accordingly.
If your machine does not have an available Nvidia GPU, the compilation will work if you have the nvcc compiler installed. The generated executable will target a 7.0 compute capability (sm_70).
## Links
- [TFHE](https://eprint.iacr.org/2018/421.pdf)
## License
This software is distributed under the BSD-3-Clause-Clear license. If you have any questions,
please contact us at `hello@zama.ai`.

View File

@@ -0,0 +1,28 @@
use std::env;
use std::process::Command;
fn main() {
println!("Build tfhe-cuda-backend");
if env::consts::OS == "linux" {
let output = Command::new("./get_os_name.sh").output().unwrap();
let distribution = String::from_utf8(output.stdout).unwrap();
if distribution != "Ubuntu\n" {
println!(
"cargo:warning=This Linux distribution is not officially supported. \
Only Ubuntu is supported by tfhe-cuda-backend at this time. Build may fail\n"
);
}
let dest = cmake::build("cuda");
println!("cargo:rustc-link-search=native={}", dest.display());
println!("cargo:rustc-link-lib=static=tfhe_cuda_backend");
println!("cargo:rustc-link-search=native=/usr/local/cuda/lib64");
println!("cargo:rustc-link-lib=gomp");
println!("cargo:rustc-link-lib=cudart");
println!("cargo:rustc-link-search=native=/usr/lib/x86_64-linux-gnu/");
println!("cargo:rustc-link-lib=stdc++");
} else {
panic!(
"Error: platform not supported, tfhe-cuda-backend not built (only Linux is supported)"
);
}
}

View File

@@ -0,0 +1,10 @@
# -----------------------------
# Options effecting formatting.
# -----------------------------
with section("format"):
# How wide to allow formatted cmake files
line_width = 120
# How many spaces to tab for indent
tab_size = 2

View File

@@ -0,0 +1,90 @@
cmake_minimum_required(VERSION 3.24 FATAL_ERROR)
project(tfhe_cuda_backend LANGUAGES CXX)
# See if the minimum CUDA version is available. If not, only enable documentation building.
set(MINIMUM_SUPPORTED_CUDA_VERSION 10.0)
include(CheckLanguage)
# See if CUDA is available
check_language(CUDA)
# If so, enable CUDA to check the version.
if(CMAKE_CUDA_COMPILER)
enable_language(CUDA)
endif()
# If CUDA is not available, or the minimum version is too low do not build
if(NOT CMAKE_CUDA_COMPILER)
message(FATAL_ERROR "Cuda compiler not found.")
endif()
if(CMAKE_CUDA_COMPILER_VERSION VERSION_LESS ${MINIMUM_SUPPORTED_CUDA_VERSION})
message(FATAL_ERROR "CUDA ${MINIMUM_SUPPORTED_CUDA_VERSION} or greater is required for compilation.")
endif()
# Get CUDA compute capability
set(OUTPUTFILE ${CMAKE_CURRENT_SOURCE_DIR}/cuda_script) # No suffix required
set(CUDAFILE ${CMAKE_CURRENT_SOURCE_DIR}/check_cuda.cu)
execute_process(COMMAND nvcc -lcuda ${CUDAFILE} -o ${OUTPUTFILE})
execute_process(
COMMAND ${OUTPUTFILE}
RESULT_VARIABLE CUDA_RETURN_CODE
OUTPUT_VARIABLE ARCH)
file(REMOVE ${OUTPUTFILE})
if(${CUDA_RETURN_CODE} EQUAL 0)
set(CUDA_SUCCESS "TRUE")
else()
set(CUDA_SUCCESS "FALSE")
endif()
if(${CUDA_SUCCESS})
message(STATUS "CUDA Architecture: ${ARCH}")
message(STATUS "CUDA Version: ${CUDA_VERSION_STRING}")
message(STATUS "CUDA Path: ${CUDA_TOOLKIT_ROOT_DIR}")
message(STATUS "CUDA Libraries: ${CUDA_LIBRARIES}")
message(STATUS "CUDA Performance Primitives: ${CUDA_npp_LIBRARY}")
else()
message(WARNING ${ARCH})
endif()
if(NOT CMAKE_BUILD_TYPE)
set(CMAKE_BUILD_TYPE Release)
endif()
# Add OpenMP support
find_package(OpenMP REQUIRED)
set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Wall -Wextra")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -g")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${OpenMP_CXX_FLAGS}")
set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -Xcompiler ${OpenMP_CXX_FLAGS}")
if(${CUDA_SUCCESS})
set(CMAKE_CUDA_ARCHITECTURES native)
else()
set(CMAKE_CUDA_ARCHITECTURES 70)
endif()
# in production, should use -arch=sm_70 --ptxas-options=-v to see register spills -lineinfo for better debugging
set(CMAKE_CUDA_FLAGS
"${CMAKE_CUDA_FLAGS} -ccbin ${CMAKE_CXX_COMPILER} -O3 \
-std=c++17 --no-exceptions --expt-relaxed-constexpr -rdc=true \
--use_fast_math -Xcompiler -fPIC")
set(INCLUDE_DIR include)
add_subdirectory(src)
target_include_directories(tfhe_cuda_backend PRIVATE ${INCLUDE_DIR})
# This is required for rust cargo build
install(TARGETS tfhe_cuda_backend DESTINATION .)
install(TARGETS tfhe_cuda_backend DESTINATION lib)
# Define a function to add a lint target.
find_file(CPPLINT NAMES cpplint cpplint.exe)
if(CPPLINT)
# Add a custom target to lint all child projects. Dependencies are specified in child projects.
add_custom_target(all_lint)
# Don't trigger this target on ALL_BUILD or Visual Studio 'Rebuild Solution'
set_target_properties(all_lint PROPERTIES EXCLUDE_FROM_ALL TRUE)
# set_target_properties(all_lint PROPERTIES EXCLUDE_FROM_DEFAULT_BUILD TRUE)
endif()
enable_testing()

View File

@@ -0,0 +1,3 @@
set noparent
linelength=240
filter=-legal/copyright,-readability/todo,-runtime/references,-build/c++17

View File

@@ -0,0 +1,22 @@
#include <stdio.h>
int main(int argc, char **argv) {
cudaDeviceProp dP;
float min_cc = 3.0;
int rc = cudaGetDeviceProperties(&dP, 0);
if (rc != cudaSuccess) {
cudaError_t error = cudaGetLastError();
printf("CUDA error: %s", cudaGetErrorString(error));
return rc; /* Failure */
}
if ((dP.major + (dP.minor / 10)) < min_cc) {
printf("Min Compute Capability of %2.1f required: %d.%d found\n Not "
"Building CUDA Code",
min_cc, dP.major, dP.minor);
return 1; /* Failure */
} else {
printf("-arch=sm_%d%d", dP.major, dP.minor);
return 0; /* Success */
}
}

View File

@@ -0,0 +1,6 @@
#!/bin/bash
find ./{include,src} -iregex '^.*\.\(cpp\|cu\|h\|cuh\)$' -print | xargs clang-format-15 -i -style='file'
cmake-format -i CMakeLists.txt -c .cmake-format-config.py
find ./{include,src} -type f -name "CMakeLists.txt" | xargs -I % sh -c 'cmake-format -i % -c .cmake-format-config.py'

View File

@@ -0,0 +1,118 @@
#ifndef CUDA_BOOTSTRAP_H
#define CUDA_BOOTSTRAP_H
#include "device.h"
#include <cstdint>
enum PBS_TYPE { MULTI_BIT = 0, LOW_LAT = 1, AMORTIZED = 2 };
extern "C" {
void cuda_fourier_polynomial_mul(void *input1, void *input2, void *output,
cuda_stream_t *stream,
uint32_t polynomial_size,
uint32_t total_polynomials);
void cuda_convert_lwe_bootstrap_key_32(void *dest, void *src,
cuda_stream_t *stream,
uint32_t input_lwe_dim,
uint32_t glwe_dim, uint32_t level_count,
uint32_t polynomial_size);
void cuda_convert_lwe_bootstrap_key_64(void *dest, void *src,
cuda_stream_t *stream,
uint32_t input_lwe_dim,
uint32_t glwe_dim, uint32_t level_count,
uint32_t polynomial_size);
void scratch_cuda_bootstrap_amortized_32(
cuda_stream_t *stream, int8_t **pbs_buffer, uint32_t glwe_dimension,
uint32_t polynomial_size, uint32_t input_lwe_ciphertext_count,
uint32_t max_shared_memory, bool allocate_gpu_memory);
void scratch_cuda_bootstrap_amortized_64(
cuda_stream_t *stream, int8_t **pbs_buffer, uint32_t glwe_dimension,
uint32_t polynomial_size, uint32_t input_lwe_ciphertext_count,
uint32_t max_shared_memory, bool allocate_gpu_memory);
void cuda_bootstrap_amortized_lwe_ciphertext_vector_32(
cuda_stream_t *stream, void *lwe_array_out, void *lwe_output_indexes,
void *lut_vector, void *lut_vector_indexes, void *lwe_array_in,
void *lwe_input_indexes, void *bootstrapping_key, int8_t *pbs_buffer,
uint32_t lwe_dimension, uint32_t glwe_dimension, uint32_t polynomial_size,
uint32_t base_log, uint32_t level_count, uint32_t num_samples,
uint32_t num_luts, uint32_t lwe_idx, uint32_t max_shared_memory);
void cuda_bootstrap_amortized_lwe_ciphertext_vector_64(
cuda_stream_t *stream, void *lwe_array_out, void *lwe_output_indexes,
void *lut_vector, void *lut_vector_indexes, void *lwe_array_in,
void *lwe_input_indexes, void *bootstrapping_key, int8_t *pbs_buffer,
uint32_t lwe_dimension, uint32_t glwe_dimension, uint32_t polynomial_size,
uint32_t base_log, uint32_t level_count, uint32_t num_samples,
uint32_t num_luts, uint32_t lwe_idx, uint32_t max_shared_memory);
void cleanup_cuda_bootstrap_amortized(cuda_stream_t *stream,
int8_t **pbs_buffer);
void scratch_cuda_bootstrap_low_latency_32(
cuda_stream_t *stream, int8_t **pbs_buffer, uint32_t glwe_dimension,
uint32_t polynomial_size, uint32_t level_count,
uint32_t input_lwe_ciphertext_count, uint32_t max_shared_memory,
bool allocate_gpu_memory);
void scratch_cuda_bootstrap_low_latency_64(
cuda_stream_t *stream, int8_t **pbs_buffer, uint32_t glwe_dimension,
uint32_t polynomial_size, uint32_t level_count,
uint32_t input_lwe_ciphertext_count, uint32_t max_shared_memory,
bool allocate_gpu_memory);
void cuda_bootstrap_low_latency_lwe_ciphertext_vector_32(
cuda_stream_t *stream, void *lwe_array_out, void *lwe_output_indexes,
void *lut_vector, void *lut_vector_indexes, void *lwe_array_in,
void *lwe_input_indexes, void *bootstrapping_key, int8_t *pbs_buffer,
uint32_t lwe_dimension, uint32_t glwe_dimension, uint32_t polynomial_size,
uint32_t base_log, uint32_t level_count, uint32_t num_samples,
uint32_t num_luts, uint32_t lwe_idx, uint32_t max_shared_memory);
void cuda_bootstrap_low_latency_lwe_ciphertext_vector_64(
cuda_stream_t *stream, void *lwe_array_out, void *lwe_output_indexes,
void *lut_vector, void *lut_vector_indexes, void *lwe_array_in,
void *lwe_input_indexes, void *bootstrapping_key, int8_t *pbs_buffer,
uint32_t lwe_dimension, uint32_t glwe_dimension, uint32_t polynomial_size,
uint32_t base_log, uint32_t level_count, uint32_t num_samples,
uint32_t num_luts, uint32_t lwe_idx, uint32_t max_shared_memory);
void cleanup_cuda_bootstrap_low_latency(cuda_stream_t *stream,
int8_t **pbs_buffer);
uint64_t get_buffer_size_bootstrap_amortized_64(
uint32_t glwe_dimension, uint32_t polynomial_size,
uint32_t input_lwe_ciphertext_count, uint32_t max_shared_memory);
uint64_t get_buffer_size_bootstrap_low_latency_64(
uint32_t glwe_dimension, uint32_t polynomial_size, uint32_t level_count,
uint32_t input_lwe_ciphertext_count, uint32_t max_shared_memory);
}
#ifdef __CUDACC__
__device__ inline int get_start_ith_ggsw(int i, uint32_t polynomial_size,
int glwe_dimension,
uint32_t level_count);
template <typename T>
__device__ T *get_ith_mask_kth_block(T *ptr, int i, int k, int level,
uint32_t polynomial_size,
int glwe_dimension, uint32_t level_count);
template <typename T>
__device__ T *get_ith_body_kth_block(T *ptr, int i, int k, int level,
uint32_t polynomial_size,
int glwe_dimension, uint32_t level_count);
template <typename T>
__device__ T *get_multi_bit_ith_lwe_gth_group_kth_block(
T *ptr, int g, int i, int k, int level, uint32_t grouping_factor,
uint32_t polynomial_size, uint32_t glwe_dimension, uint32_t level_count);
#endif
#endif // CUDA_BOOTSTRAP_H

View File

@@ -0,0 +1,46 @@
#ifndef CUDA_MULTI_BIT_H
#define CUDA_MULTI_BIT_H
#include <cstdint>
extern "C" {
void cuda_convert_lwe_multi_bit_bootstrap_key_64(
void *dest, void *src, cuda_stream_t *stream, uint32_t input_lwe_dim,
uint32_t glwe_dim, uint32_t level_count, uint32_t polynomial_size,
uint32_t grouping_factor);
void cuda_multi_bit_pbs_lwe_ciphertext_vector_64(
cuda_stream_t *stream, void *lwe_array_out, void *lwe_output_indexes,
void *lut_vector, void *lut_vector_indexes, void *lwe_array_in,
void *lwe_input_indexes, void *bootstrapping_key, int8_t *pbs_buffer,
uint32_t lwe_dimension, uint32_t glwe_dimension, uint32_t polynomial_size,
uint32_t grouping_factor, uint32_t base_log, uint32_t level_count,
uint32_t num_samples, uint32_t num_luts, uint32_t lwe_idx,
uint32_t max_shared_memory, uint32_t chunk_size = 0);
void scratch_cuda_multi_bit_pbs_64(
cuda_stream_t *stream, int8_t **pbs_buffer, uint32_t lwe_dimension,
uint32_t glwe_dimension, uint32_t polynomial_size, uint32_t level_count,
uint32_t grouping_factor, uint32_t input_lwe_ciphertext_count,
uint32_t max_shared_memory, bool allocate_gpu_memory,
uint32_t chunk_size = 0);
void cleanup_cuda_multi_bit_pbs(cuda_stream_t *stream, int8_t **pbs_buffer);
}
#ifdef __CUDACC__
__host__ uint32_t get_lwe_chunk_size(uint32_t lwe_dimension,
uint32_t level_count,
uint32_t glwe_dimension,
uint32_t num_samples);
__host__ uint32_t get_average_lwe_chunk_size(uint32_t lwe_dimension,
uint32_t level_count,
uint32_t glwe_dimension,
uint32_t ct_count);
__host__ uint64_t get_max_buffer_size_multibit_bootstrap(
uint32_t lwe_dimension, uint32_t glwe_dimension, uint32_t polynomial_size,
uint32_t level_count, uint32_t max_input_lwe_ciphertext_count);
#endif
#endif // CUDA_MULTI_BIT_H

View File

@@ -0,0 +1,18 @@
#ifndef CUDA_CIPHERTEXT_H
#define CUDA_CIPHERTEXT_H
#include <cstdint>
extern "C" {
void cuda_convert_lwe_ciphertext_vector_to_gpu_64(void *dest, void *src,
void *v_stream,
uint32_t gpu_index,
uint32_t number_of_cts,
uint32_t lwe_dimension);
void cuda_convert_lwe_ciphertext_vector_to_cpu_64(void *dest, void *src,
void *v_stream,
uint32_t gpu_index,
uint32_t number_of_cts,
uint32_t lwe_dimension);
};
#endif

View File

@@ -0,0 +1,88 @@
#ifndef DEVICE_H
#define DEVICE_H
#include <cstdint>
#include <cstdio>
#include <cstdlib>
#include <cstring>
#include <cuda_runtime.h>
#define synchronize_threads_in_block() __syncthreads()
extern "C" {
struct cuda_stream_t {
cudaStream_t stream;
uint32_t gpu_index;
cuda_stream_t(uint32_t gpu_index) {
this->gpu_index = gpu_index;
cudaStreamCreate(&stream);
}
void release() {
cudaSetDevice(gpu_index);
cudaStreamDestroy(stream);
}
void synchronize() { cudaStreamSynchronize(stream); }
};
cuda_stream_t *cuda_create_stream(uint32_t gpu_index);
int cuda_destroy_stream(cuda_stream_t *stream);
void *cuda_malloc(uint64_t size, uint32_t gpu_index);
void *cuda_malloc_async(uint64_t size, cuda_stream_t *stream);
int cuda_check_valid_malloc(uint64_t size, uint32_t gpu_index);
int cuda_check_support_cooperative_groups();
int cuda_memcpy_to_cpu(void *dest, const void *src, uint64_t size);
int cuda_memcpy_async_to_gpu(void *dest, void *src, uint64_t size,
cuda_stream_t *stream);
int cuda_memcpy_async_gpu_to_gpu(void *dest, void *src, uint64_t size,
cuda_stream_t *stream);
int cuda_memcpy_to_gpu(void *dest, void *src, uint64_t size);
int cuda_memcpy_async_to_cpu(void *dest, const void *src, uint64_t size,
cuda_stream_t *stream);
int cuda_memset_async(void *dest, uint64_t val, uint64_t size,
cuda_stream_t *stream);
int cuda_get_number_of_gpus();
int cuda_synchronize_device(uint32_t gpu_index);
int cuda_drop(void *ptr, uint32_t gpu_index);
int cuda_drop_async(void *ptr, cuda_stream_t *stream);
int cuda_get_max_shared_memory(uint32_t gpu_index);
int cuda_synchronize_stream(cuda_stream_t *stream);
#define check_cuda_error(ans) \
{ cuda_error((ans), __FILE__, __LINE__); }
inline void cuda_error(cudaError_t code, const char *file, int line,
bool abort = true) {
if (code != cudaSuccess) {
fprintf(stderr, "Cuda error: %s %s %d\n", cudaGetErrorString(code), file,
line);
if (abort)
exit(code);
}
}
}
template <typename Torus>
void cuda_set_value_async(cudaStream_t *stream, Torus *d_array, Torus value,
Torus n);
#endif

View File

@@ -0,0 +1,100 @@
#include "cuComplex.h"
#include "thrust/complex.h"
#include <iostream>
#include <string>
#include <type_traits>
#define PRINT_VARS
#ifdef PRINT_VARS
#define PRINT_DEBUG_5(var, begin, end, step, cond) \
_print_debug(var, #var, begin, end, step, cond, "", false)
#define PRINT_DEBUG_6(var, begin, end, step, cond, text) \
_print_debug(var, #var, begin, end, step, cond, text, true)
#define CAT(A, B) A##B
#define PRINT_SELECT(NAME, NUM) CAT(NAME##_, NUM)
#define GET_COUNT(_1, _2, _3, _4, _5, _6, COUNT, ...) COUNT
#define VA_SIZE(...) GET_COUNT(__VA_ARGS__, 6, 5, 4, 3, 2, 1)
#define PRINT_DEBUG(...) \
PRINT_SELECT(PRINT_DEBUG, VA_SIZE(__VA_ARGS__))(__VA_ARGS__)
#else
#define PRINT_DEBUG(...)
#endif
template <typename T>
__device__ typename std::enable_if<std::is_unsigned<T>::value, void>::type
_print_debug(T *var, const char *var_name, int start, int end, int step,
bool cond, const char *text, bool has_text) {
__syncthreads();
if (cond) {
if (has_text)
printf("%s\n", text);
for (int i = start; i < end; i += step) {
printf("%s[%u]: %u\n", var_name, i, var[i]);
}
}
__syncthreads();
}
template <typename T>
__device__ typename std::enable_if<std::is_signed<T>::value, void>::type
_print_debug(T *var, const char *var_name, int start, int end, int step,
bool cond, const char *text, bool has_text) {
__syncthreads();
if (cond) {
if (has_text)
printf("%s\n", text);
for (int i = start; i < end; i += step) {
printf("%s[%u]: %d\n", var_name, i, var[i]);
}
}
__syncthreads();
}
template <typename T>
__device__ typename std::enable_if<std::is_floating_point<T>::value, void>::type
_print_debug(T *var, const char *var_name, int start, int end, int step,
bool cond, const char *text, bool has_text) {
__syncthreads();
if (cond) {
if (has_text)
printf("%s\n", text);
for (int i = start; i < end; i += step) {
printf("%s[%u]: %.15f\n", var_name, i, var[i]);
}
}
__syncthreads();
}
template <typename T>
__device__
typename std::enable_if<std::is_same<T, thrust::complex<double>>::value,
void>::type
_print_debug(T *var, const char *var_name, int start, int end, int step,
bool cond, const char *text, bool has_text) {
__syncthreads();
if (cond) {
if (has_text)
printf("%s\n", text);
for (int i = start; i < end; i += step) {
printf("%s[%u]: %.15f , %.15f\n", var_name, i, var[i].real(),
var[i].imag());
}
}
__syncthreads();
}
template <typename T>
__device__
typename std::enable_if<std::is_same<T, cuDoubleComplex>::value, void>::type
_print_debug(T *var, const char *var_name, int start, int end, int step,
bool cond, const char *text, bool has_text) {
__syncthreads();
if (cond) {
if (has_text)
printf("%s\n", text);
for (int i = start; i < end; i += step) {
printf("%s[%u]: %.15f , %.15f\n", var_name, i, var[i].x, var[i].y);
}
}
__syncthreads();
}

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,21 @@
#ifndef CNCRT_KS_H_
#define CNCRT_KS_H_
#include <cstdint>
extern "C" {
void cuda_keyswitch_lwe_ciphertext_vector_32(
cuda_stream_t *stream, void *lwe_array_out, void *lwe_output_indexes,
void *lwe_array_in, void *lwe_input_indexes, void *ksk,
uint32_t lwe_dimension_in, uint32_t lwe_dimension_out, uint32_t base_log,
uint32_t level_count, uint32_t num_samples);
void cuda_keyswitch_lwe_ciphertext_vector_64(
cuda_stream_t *stream, void *lwe_array_out, void *lwe_output_indexes,
void *lwe_array_in, void *lwe_input_indexes, void *ksk,
uint32_t lwe_dimension_in, uint32_t lwe_dimension_out, uint32_t base_log,
uint32_t level_count, uint32_t num_samples);
}
#endif // CNCRT_KS_H_

View File

@@ -0,0 +1,50 @@
#ifndef CUDA_LINALG_H_
#define CUDA_LINALG_H_
#include "bootstrap.h"
#include <cstdint>
#include <device.h>
extern "C" {
void cuda_negate_lwe_ciphertext_vector_32(cuda_stream_t *stream,
void *lwe_array_out,
void *lwe_array_in,
uint32_t input_lwe_dimension,
uint32_t input_lwe_ciphertext_count);
void cuda_negate_lwe_ciphertext_vector_64(cuda_stream_t *stream,
void *lwe_array_out,
void *lwe_array_in,
uint32_t input_lwe_dimension,
uint32_t input_lwe_ciphertext_count);
void cuda_add_lwe_ciphertext_vector_32(cuda_stream_t *stream,
void *lwe_array_out,
void *lwe_array_in_1,
void *lwe_array_in_2,
uint32_t input_lwe_dimension,
uint32_t input_lwe_ciphertext_count);
void cuda_add_lwe_ciphertext_vector_64(cuda_stream_t *stream,
void *lwe_array_out,
void *lwe_array_in_1,
void *lwe_array_in_2,
uint32_t input_lwe_dimension,
uint32_t input_lwe_ciphertext_count);
void cuda_add_lwe_ciphertext_vector_plaintext_vector_32(
cuda_stream_t *stream, void *lwe_array_out, void *lwe_array_in,
void *plaintext_array_in, uint32_t input_lwe_dimension,
uint32_t input_lwe_ciphertext_count);
void cuda_add_lwe_ciphertext_vector_plaintext_vector_64(
cuda_stream_t *stream, void *lwe_array_out, void *lwe_array_in,
void *plaintext_array_in, uint32_t input_lwe_dimension,
uint32_t input_lwe_ciphertext_count);
void cuda_mult_lwe_ciphertext_vector_cleartext_vector_32(
cuda_stream_t *stream, void *lwe_array_out, void *lwe_array_in,
void *cleartext_array_in, uint32_t input_lwe_dimension,
uint32_t input_lwe_ciphertext_count);
void cuda_mult_lwe_ciphertext_vector_cleartext_vector_64(
cuda_stream_t *stream, void *lwe_array_out, void *lwe_array_in,
void *cleartext_array_in, uint32_t input_lwe_dimension,
uint32_t input_lwe_ciphertext_count);
}
#endif // CUDA_LINALG_H_

View File

@@ -0,0 +1,18 @@
set(SOURCES
${CMAKE_SOURCE_DIR}/${INCLUDE_DIR}/bit_extraction.h
${CMAKE_SOURCE_DIR}/${INCLUDE_DIR}/bitwise_ops.h
${CMAKE_SOURCE_DIR}/${INCLUDE_DIR}/bootstrap.h
${CMAKE_SOURCE_DIR}/${INCLUDE_DIR}/bootstrap_multibit.h
${CMAKE_SOURCE_DIR}/${INCLUDE_DIR}/ciphertext.h
${CMAKE_SOURCE_DIR}/${INCLUDE_DIR}/circuit_bootstrap.h
${CMAKE_SOURCE_DIR}/${INCLUDE_DIR}/device.h
${CMAKE_SOURCE_DIR}/${INCLUDE_DIR}/integer.h
${CMAKE_SOURCE_DIR}/${INCLUDE_DIR}/keyswitch.h
${CMAKE_SOURCE_DIR}/${INCLUDE_DIR}/linear_algebra.h
${CMAKE_SOURCE_DIR}/${INCLUDE_DIR}/shifts.h
${CMAKE_SOURCE_DIR}/${INCLUDE_DIR}/vertical_packing.h)
file(GLOB_RECURSE SOURCES "*.cu")
add_library(tfhe_cuda_backend STATIC ${SOURCES})
set_target_properties(tfhe_cuda_backend PROPERTIES CUDA_SEPARABLE_COMPILATION ON CUDA_RESOLVE_DEVICE_SYMBOLS ON)
target_link_libraries(tfhe_cuda_backend PUBLIC cudart OpenMP::OpenMP_CXX)
target_include_directories(tfhe_cuda_backend PRIVATE .)

View File

@@ -0,0 +1 @@
#include "ciphertext.cuh"

View File

@@ -0,0 +1,44 @@
#ifndef CUDA_CIPHERTEXT_CUH
#define CUDA_CIPHERTEXT_CUH
#include "ciphertext.h"
#include "device.h"
#include <cstdint>
template <typename T>
void cuda_convert_lwe_ciphertext_vector_to_gpu(T *dest, T *src,
cuda_stream_t *stream,
uint32_t number_of_cts,
uint32_t lwe_dimension) {
cudaSetDevice(stream->gpu_index);
uint64_t size = number_of_cts * (lwe_dimension + 1) * sizeof(T);
cuda_memcpy_async_to_gpu(dest, src, size, stream);
}
void cuda_convert_lwe_ciphertext_vector_to_gpu_64(void *dest, void *src,
cuda_stream_t *stream,
uint32_t number_of_cts,
uint32_t lwe_dimension) {
cuda_convert_lwe_ciphertext_vector_to_gpu<uint64_t>(
(uint64_t *)dest, (uint64_t *)src, stream, number_of_cts, lwe_dimension);
}
template <typename T>
void cuda_convert_lwe_ciphertext_vector_to_cpu(T *dest, T *src,
cuda_stream_t *stream,
uint32_t number_of_cts,
uint32_t lwe_dimension) {
cudaSetDevice(stream->gpu_index);
uint64_t size = number_of_cts * (lwe_dimension + 1) * sizeof(T);
cuda_memcpy_async_to_cpu(dest, src, size, stream);
}
void cuda_convert_lwe_ciphertext_vector_to_cpu_64(void *dest, void *src,
cuda_stream_t *stream,
uint32_t number_of_cts,
uint32_t lwe_dimension) {
cuda_convert_lwe_ciphertext_vector_to_cpu<uint64_t>(
(uint64_t *)dest, (uint64_t *)src, stream, number_of_cts, lwe_dimension);
}
#endif

View File

@@ -0,0 +1,162 @@
#ifndef CNCRT_CRYPTO_CUH
#define CNCRT_CRPYTO_CUH
#include "device.h"
#include <cstdint>
/**
* GadgetMatrix implements the iterator design pattern to decompose a set of
* num_poly consecutive polynomials with degree params::degree. A total of
* level_count levels is expected and each call to decompose_and_compress_next()
* writes to the result the next level. It is also possible to advance an
* arbitrary amount of levels by using decompose_and_compress_level().
*
* This class always decomposes the entire set of num_poly polynomials.
* By default, it works on a single polynomial.
*/
#pragma once
template <typename T, class params> class GadgetMatrix {
private:
uint32_t level_count;
uint32_t base_log;
uint32_t mask;
uint32_t halfbg;
uint32_t num_poly;
T offset;
int current_level;
T mask_mod_b;
T *state;
public:
__device__ GadgetMatrix(uint32_t base_log, uint32_t level_count, T *state,
uint32_t num_poly = 1)
: base_log(base_log), level_count(level_count), num_poly(num_poly),
state(state) {
mask_mod_b = (1ll << base_log) - 1ll;
current_level = level_count;
int tid = threadIdx.x;
for (int i = 0; i < num_poly * params::opt; i++) {
state[tid] >>= (sizeof(T) * 8 - base_log * level_count);
tid += params::degree / params::opt;
}
synchronize_threads_in_block();
}
// Decomposes all polynomials at once
__device__ void decompose_and_compress_next(double2 *result) {
for (int j = 0; j < num_poly; j++) {
auto result_slice = result + j * params::degree / 2;
decompose_and_compress_next_polynomial(result_slice, j);
}
}
// Decomposes a single polynomial
__device__ void decompose_and_compress_next_polynomial(double2 *result,
int j) {
if (j == 0)
current_level -= 1;
int tid = threadIdx.x;
auto state_slice = state + j * params::degree;
for (int i = 0; i < params::opt / 2; i++) {
T res_re = state_slice[tid] & mask_mod_b;
T res_im = state_slice[tid + params::degree / 2] & mask_mod_b;
state_slice[tid] >>= base_log;
state_slice[tid + params::degree / 2] >>= base_log;
T carry_re = ((res_re - 1ll) | state_slice[tid]) & res_re;
T carry_im =
((res_im - 1ll) | state_slice[tid + params::degree / 2]) & res_im;
carry_re >>= (base_log - 1);
carry_im >>= (base_log - 1);
state_slice[tid] += carry_re;
state_slice[tid + params::degree / 2] += carry_im;
res_re -= carry_re << base_log;
res_im -= carry_im << base_log;
result[tid].x = (int32_t)res_re;
result[tid].y = (int32_t)res_im;
tid += params::degree / params::opt;
}
synchronize_threads_in_block();
}
// Decomposes a single polynomial
__device__ void
decompose_and_compress_next_polynomial_elements(double2 *result, int j) {
if (j == 0)
current_level -= 1;
int tid = threadIdx.x;
auto state_slice = state + j * params::degree;
for (int i = 0; i < params::opt / 2; i++) {
T res_re = state_slice[tid] & mask_mod_b;
T res_im = state_slice[tid + params::degree / 2] & mask_mod_b;
state_slice[tid] >>= base_log;
state_slice[tid + params::degree / 2] >>= base_log;
T carry_re = ((res_re - 1ll) | state_slice[tid]) & res_re;
T carry_im =
((res_im - 1ll) | state_slice[tid + params::degree / 2]) & res_im;
carry_re >>= (base_log - 1);
carry_im >>= (base_log - 1);
state_slice[tid] += carry_re;
state_slice[tid + params::degree / 2] += carry_im;
res_re -= carry_re << base_log;
res_im -= carry_im << base_log;
result[i].x = (int32_t)res_re;
result[i].y = (int32_t)res_im;
tid += params::degree / params::opt;
}
synchronize_threads_in_block();
}
__device__ void decompose_and_compress_level(double2 *result, int level) {
for (int i = 0; i < level_count - level; i++)
decompose_and_compress_next(result);
}
};
template <typename T> class GadgetMatrixSingle {
private:
uint32_t level_count;
uint32_t base_log;
uint32_t mask;
uint32_t halfbg;
T offset;
public:
__device__ GadgetMatrixSingle(uint32_t base_log, uint32_t level_count)
: base_log(base_log), level_count(level_count) {
uint32_t bg = 1 << base_log;
this->halfbg = bg / 2;
this->mask = bg - 1;
T temp = 0;
for (int i = 0; i < this->level_count; i++) {
temp += 1ULL << (sizeof(T) * 8 - (i + 1) * this->base_log);
}
this->offset = temp * this->halfbg;
}
__device__ T decompose_one_level_single(T element, uint32_t level) {
T s = element + this->offset;
uint32_t decal = (sizeof(T) * 8 - (level + 1) * this->base_log);
T temp1 = (s >> decal) & this->mask;
return (T)(temp1 - this->halfbg);
}
};
template <typename Torus>
__device__ Torus decompose_one(Torus &state, Torus mask_mod_b, int base_log) {
Torus res = state & mask_mod_b;
state >>= base_log;
Torus carry = ((res - 1ll) | state) & res;
carry >>= base_log - 1;
state += carry;
res -= carry << base_log;
return res;
}
#endif // CNCRT_CRPYTO_H

View File

@@ -0,0 +1,74 @@
#ifndef CNCRT_GGSW_CUH
#define CNCRT_GGSW_CUH
#include "device.h"
#include "fft/bnsmfft.cuh"
#include "polynomial/parameters.cuh"
template <typename T, typename ST, class params, sharedMemDegree SMD>
__global__ void device_batch_fft_ggsw_vector(double2 *dest, T *src,
int8_t *device_mem) {
extern __shared__ int8_t sharedmem[];
double2 *selected_memory;
if constexpr (SMD == FULLSM)
selected_memory = (double2 *)sharedmem;
else
selected_memory = (double2 *)device_mem[blockIdx.x * params::degree];
// Compression
int offset = blockIdx.x * blockDim.x;
int tid = threadIdx.x;
#pragma unroll
for (int i = 0; i < params::opt / 2; i++) {
ST x = src[(tid) + params::opt * offset];
ST y = src[(tid + params::degree / 2) + params::opt * offset];
selected_memory[tid].x = x / (double)std::numeric_limits<T>::max();
selected_memory[tid].y = y / (double)std::numeric_limits<T>::max();
tid += params::degree / params::opt;
}
synchronize_threads_in_block();
// Switch to the FFT space
NSMFFT_direct<HalfDegree<params>>(selected_memory);
synchronize_threads_in_block();
// Write the output to global memory
tid = threadIdx.x;
#pragma unroll
for (int j = 0; j < params::opt / 2; j++) {
dest[tid + (params::opt >> 1) * offset] = selected_memory[tid];
tid += params::degree / params::opt;
}
}
/**
* Applies the FFT transform on sequence of GGSW ciphertexts already in the
* global memory
*/
template <typename T, typename ST, class params>
void batch_fft_ggsw_vector(cuda_stream_t *stream, double2 *dest, T *src,
int8_t *d_mem, uint32_t r, uint32_t glwe_dim,
uint32_t polynomial_size, uint32_t level_count,
uint32_t gpu_index, uint32_t max_shared_memory) {
cudaSetDevice(stream->gpu_index);
int shared_memory_size = sizeof(double) * polynomial_size;
int gridSize = r * (glwe_dim + 1) * (glwe_dim + 1) * level_count;
int blockSize = polynomial_size / params::opt;
if (max_shared_memory < shared_memory_size) {
device_batch_fft_ggsw_vector<T, ST, params, NOSM>
<<<gridSize, blockSize, 0, stream->stream>>>(dest, src, d_mem);
} else {
device_batch_fft_ggsw_vector<T, ST, params, FULLSM>
<<<gridSize, blockSize, shared_memory_size, stream->stream>>>(dest, src,
d_mem);
}
check_cuda_error(cudaGetLastError());
}
#endif // CNCRT_GGSW_CUH

View File

@@ -0,0 +1,48 @@
#include "keyswitch.cuh"
#include "keyswitch.h"
#include <cstdint>
/* Perform keyswitch on a batch of 32 bits input LWE ciphertexts.
* Head out to the equivalent operation on 64 bits for more details.
*/
void cuda_keyswitch_lwe_ciphertext_vector_32(
cuda_stream_t *stream, void *lwe_array_out, void *lwe_output_indexes,
void *lwe_array_in, void *lwe_input_indexes, void *ksk,
uint32_t lwe_dimension_in, uint32_t lwe_dimension_out, uint32_t base_log,
uint32_t level_count, uint32_t num_samples) {
cuda_keyswitch_lwe_ciphertext_vector(
stream, static_cast<uint32_t *>(lwe_array_out),
static_cast<uint32_t *>(lwe_output_indexes),
static_cast<uint32_t *>(lwe_array_in),
static_cast<uint32_t *>(lwe_input_indexes), static_cast<uint32_t *>(ksk),
lwe_dimension_in, lwe_dimension_out, base_log, level_count, num_samples);
}
/* Perform keyswitch on a batch of 64 bits input LWE ciphertexts.
*
* - `v_stream` is a void pointer to the Cuda stream to be used in the kernel
* launch
* - `gpu_index` is the index of the GPU to be used in the kernel launch
* - lwe_array_out: output batch of num_samples keyswitched ciphertexts c =
* (a0,..an-1,b) where n is the output LWE dimension (lwe_dimension_out)
* - lwe_array_in: input batch of num_samples LWE ciphertexts, containing
* lwe_dimension_in mask values + 1 body value
* - ksk: the keyswitch key to be used in the operation
* - base log: the log of the base used in the decomposition (should be the one
* used to create the ksk)
*
* This function calls a wrapper to a device kernel that performs the keyswitch
* - num_samples blocks of threads are launched
*/
void cuda_keyswitch_lwe_ciphertext_vector_64(
cuda_stream_t *stream, void *lwe_array_out, void *lwe_output_indexes,
void *lwe_array_in, void *lwe_input_indexes, void *ksk,
uint32_t lwe_dimension_in, uint32_t lwe_dimension_out, uint32_t base_log,
uint32_t level_count, uint32_t num_samples) {
cuda_keyswitch_lwe_ciphertext_vector(
stream, static_cast<uint64_t *>(lwe_array_out),
static_cast<uint64_t *>(lwe_output_indexes),
static_cast<uint64_t *>(lwe_array_in),
static_cast<uint64_t *>(lwe_input_indexes), static_cast<uint64_t *>(ksk),
lwe_dimension_in, lwe_dimension_out, base_log, level_count, num_samples);
}

View File

@@ -0,0 +1,144 @@
#ifndef CNCRT_KS_CUH
#define CNCRT_KS_CUH
#include "device.h"
#include "gadget.cuh"
#include "polynomial/polynomial_math.cuh"
#include "torus.cuh"
#include <thread>
#include <vector>
template <typename Torus>
__device__ Torus *get_ith_block(Torus *ksk, int i, int level,
uint32_t lwe_dimension_out,
uint32_t level_count) {
int pos = i * level_count * (lwe_dimension_out + 1) +
level * (lwe_dimension_out + 1);
Torus *ptr = &ksk[pos];
return ptr;
}
/*
* keyswitch kernel
* Each thread handles a piece of the following equation:
* $$GLWE_s2(\Delta.m+e) = (0,0,..,0,b) - \sum_{i=0,k-1} <Dec(a_i),
* (GLWE_s2(s1_i q/beta),..,GLWE(s1_i q/beta^l)>$$ where k is the dimension of
* the GLWE ciphertext. If the polynomial dimension in GLWE is > 1, this
* equation is solved for each polynomial coefficient. where Dec denotes the
* decomposition with base beta and l levels and the inner product is done
* between the decomposition of a_i and l GLWE encryptions of s1_i q/\beta^j,
* with j in [1,l] We obtain a GLWE encryption of Delta.m (with Delta the
* scaling factor) under key s2 instead of s1, with an increased noise
*
*/
template <typename Torus>
__global__ void
keyswitch(Torus *lwe_array_out, Torus *lwe_output_indexes, Torus *lwe_array_in,
Torus *lwe_input_indexes, Torus *ksk, uint32_t lwe_dimension_in,
uint32_t lwe_dimension_out, uint32_t base_log, uint32_t level_count,
int lwe_lower, int lwe_upper, int cutoff) {
int tid = threadIdx.x;
extern __shared__ int8_t sharedmem[];
Torus *local_lwe_array_out = (Torus *)sharedmem;
auto block_lwe_array_in = get_chunk(
lwe_array_in, lwe_input_indexes[blockIdx.x], lwe_dimension_in + 1);
auto block_lwe_array_out = get_chunk(
lwe_array_out, lwe_output_indexes[blockIdx.x], lwe_dimension_out + 1);
auto gadget = GadgetMatrixSingle<Torus>(base_log, level_count);
int lwe_part_per_thd;
if (tid < cutoff) {
lwe_part_per_thd = lwe_upper;
} else {
lwe_part_per_thd = lwe_lower;
}
__syncthreads();
for (int k = 0; k < lwe_part_per_thd; k++) {
int idx = tid + k * blockDim.x;
local_lwe_array_out[idx] = 0;
}
__syncthreads();
if (tid == 0) {
local_lwe_array_out[lwe_dimension_out] =
block_lwe_array_in[lwe_dimension_in];
}
for (int i = 0; i < lwe_dimension_in; i++) {
__syncthreads();
Torus a_i =
round_to_closest_multiple(block_lwe_array_in[i], base_log, level_count);
Torus state = a_i >> (sizeof(Torus) * 8 - base_log * level_count);
Torus mask_mod_b = (1ll << base_log) - 1ll;
for (int j = 0; j < level_count; j++) {
auto ksk_block = get_ith_block(ksk, i, j, lwe_dimension_out, level_count);
Torus decomposed = decompose_one<Torus>(state, mask_mod_b, base_log);
for (int k = 0; k < lwe_part_per_thd; k++) {
int idx = tid + k * blockDim.x;
local_lwe_array_out[idx] -= (Torus)ksk_block[idx] * decomposed;
}
}
}
for (int k = 0; k < lwe_part_per_thd; k++) {
int idx = tid + k * blockDim.x;
block_lwe_array_out[idx] = local_lwe_array_out[idx];
}
}
/// assume lwe_array_in in the gpu
template <typename Torus>
__host__ void cuda_keyswitch_lwe_ciphertext_vector(
cuda_stream_t *stream, Torus *lwe_array_out, Torus *lwe_output_indexes,
Torus *lwe_array_in, Torus *lwe_input_indexes, Torus *ksk,
uint32_t lwe_dimension_in, uint32_t lwe_dimension_out, uint32_t base_log,
uint32_t level_count, uint32_t num_samples) {
cudaSetDevice(stream->gpu_index);
constexpr int ideal_threads = 128;
int lwe_dim = lwe_dimension_out + 1;
int lwe_lower, lwe_upper, cutoff;
if (lwe_dim % ideal_threads == 0) {
lwe_lower = lwe_dim / ideal_threads;
lwe_upper = lwe_dim / ideal_threads;
cutoff = 0;
} else {
int y =
ceil((double)lwe_dim / (double)ideal_threads) * ideal_threads - lwe_dim;
cutoff = ideal_threads - y;
lwe_lower = lwe_dim / ideal_threads;
lwe_upper = (int)ceil((double)lwe_dim / (double)ideal_threads);
}
int lwe_size_after = (lwe_dimension_out + 1) * num_samples;
int shared_mem = sizeof(Torus) * (lwe_dimension_out + 1);
cuda_memset_async(lwe_array_out, 0, sizeof(Torus) * lwe_size_after, stream);
check_cuda_error(cudaGetLastError());
dim3 grid(num_samples, 1, 1);
dim3 threads(ideal_threads, 1, 1);
// cudaFuncSetAttribute(keyswitch<Torus>,
// cudaFuncAttributeMaxDynamicSharedMemorySize,
// shared_mem);
keyswitch<<<grid, threads, shared_mem, stream->stream>>>(
lwe_array_out, lwe_output_indexes, lwe_array_in, lwe_input_indexes, ksk,
lwe_dimension_in, lwe_dimension_out, base_log, level_count, lwe_lower,
lwe_upper, cutoff);
check_cuda_error(cudaGetLastError());
}
#endif

View File

@@ -0,0 +1,74 @@
#ifndef CNCRT_TORUS_CUH
#define CNCRT_TORUS_CUH
#include "types/int128.cuh"
#include <limits>
template <typename T>
__device__ inline void typecast_double_to_torus(double x, T &r) {
r = T(x);
}
template <>
__device__ inline void typecast_double_to_torus<uint32_t>(double x,
uint32_t &r) {
r = __double2uint_rn(x);
}
template <>
__device__ inline void typecast_double_to_torus<uint64_t>(double x,
uint64_t &r) {
// The ull intrinsic does not behave in the same way on all architectures and
// on some platforms this causes the cmux tree test to fail
// Hence the intrinsic is not used here
uint128 nnnn = make_uint128_from_float(x);
uint64_t lll = nnnn.lo_;
r = lll;
}
template <typename T>
__device__ inline T round_to_closest_multiple(T x, uint32_t base_log,
uint32_t level_count) {
T shift = sizeof(T) * 8 - level_count * base_log;
T mask = 1ll << (shift - 1);
T b = (x & mask) >> (shift - 1);
T res = x >> shift;
res += b;
res <<= shift;
return res;
}
template <typename T>
__device__ __forceinline__ void rescale_torus_element(T element, T &output,
uint32_t log_shift) {
output =
round((double)element / (double(std::numeric_limits<T>::max()) + 1.0) *
(double)log_shift);
}
template <typename T>
__device__ __forceinline__ T rescale_torus_element(T element,
uint32_t log_shift) {
return round((double)element / (double(std::numeric_limits<T>::max()) + 1.0) *
(double)log_shift);
}
template <>
__device__ __forceinline__ void
rescale_torus_element<uint32_t>(uint32_t element, uint32_t &output,
uint32_t log_shift) {
output =
round(__uint2double_rn(element) /
(__uint2double_rn(std::numeric_limits<uint32_t>::max()) + 1.0) *
__uint2double_rn(log_shift));
}
template <>
__device__ __forceinline__ void
rescale_torus_element<uint64_t>(uint64_t element, uint64_t &output,
uint32_t log_shift) {
output = round(__ull2double_rn(element) /
(__ull2double_rn(std::numeric_limits<uint64_t>::max()) + 1.0) *
__uint2double_rn(log_shift));
}
#endif // CNCRT_TORUS_H

View File

@@ -0,0 +1,350 @@
#include "device.h"
#include <cstdint>
#include <cuda_runtime.h>
/// Unsafe function to create a CUDA stream, must check first that GPU exists
cuda_stream_t *cuda_create_stream(uint32_t gpu_index) {
cudaSetDevice(gpu_index);
cuda_stream_t *stream = new cuda_stream_t(gpu_index);
return stream;
}
/// Unsafe function to destroy CUDA stream, must check first the GPU exists
int cuda_destroy_stream(cuda_stream_t *stream) {
stream->release();
return 0;
}
/// Unsafe function that will try to allocate even if gpu_index is invalid
/// or if there's not enough memory. A safe wrapper around it must call
/// cuda_check_valid_malloc() first
void *cuda_malloc(uint64_t size, uint32_t gpu_index) {
cudaSetDevice(gpu_index);
void *ptr;
cudaMalloc((void **)&ptr, size);
check_cuda_error(cudaGetLastError());
return ptr;
}
/// Allocates a size-byte array at the device memory. Tries to do it
/// asynchronously.
void *cuda_malloc_async(uint64_t size, cuda_stream_t *stream) {
cudaSetDevice(stream->gpu_index);
void *ptr;
#ifndef CUDART_VERSION
#error CUDART_VERSION Undefined!
#elif (CUDART_VERSION >= 11020)
int support_async_alloc;
check_cuda_error(cudaDeviceGetAttribute(&support_async_alloc,
cudaDevAttrMemoryPoolsSupported,
stream->gpu_index));
if (support_async_alloc) {
check_cuda_error(cudaMallocAsync((void **)&ptr, size, stream->stream));
} else {
check_cuda_error(cudaMalloc((void **)&ptr, size));
}
#else
check_cuda_error(cudaMalloc((void **)&ptr, size));
#endif
return ptr;
}
/// Checks that allocation is valid
/// 0: valid
/// -1: invalid, not enough memory in device
/// -2: invalid, gpu index doesn't exist
int cuda_check_valid_malloc(uint64_t size, uint32_t gpu_index) {
if (gpu_index >= cuda_get_number_of_gpus()) {
// error code: invalid gpu_index
return -2;
}
cudaSetDevice(gpu_index);
size_t total_mem, free_mem;
cudaMemGetInfo(&free_mem, &total_mem);
if (size > free_mem) {
// error code: not enough memory
return -1;
}
return 0;
}
/// Returns
/// -> 0 if Cooperative Groups is not supported.
/// -> 1 otherwise
int cuda_check_support_cooperative_groups() {
int cooperative_groups_supported = 0;
cudaDeviceGetAttribute(&cooperative_groups_supported,
cudaDevAttrCooperativeLaunch, 0);
return cooperative_groups_supported > 0;
}
/// Tries to copy memory to the GPU asynchronously
/// 0: success
/// -1: error, invalid device pointer
/// -2: error, gpu index doesn't exist
/// -3: error, zero copy size
int cuda_memcpy_async_to_gpu(void *dest, void *src, uint64_t size,
cuda_stream_t *stream) {
if (size == 0) {
// error code: zero copy size
return -3;
}
if (stream->gpu_index >= cuda_get_number_of_gpus()) {
// error code: invalid gpu_index
return -2;
}
cudaPointerAttributes attr;
cudaPointerGetAttributes(&attr, dest);
if (attr.device != stream->gpu_index && attr.type != cudaMemoryTypeDevice) {
// error code: invalid device pointer
return -1;
}
cudaSetDevice(stream->gpu_index);
check_cuda_error(
cudaMemcpyAsync(dest, src, size, cudaMemcpyHostToDevice, stream->stream));
return 0;
}
/// Tries to copy memory to the GPU synchronously
/// 0: success
/// -1: error, invalid device pointer
/// -2: error, gpu index doesn't exist
/// -3: error, zero copy size
int cuda_memcpy_to_gpu(void *dest, void *src, uint64_t size) {
if (size == 0) {
// error code: zero copy size
return -3;
}
cudaPointerAttributes attr;
cudaPointerGetAttributes(&attr, dest);
if (attr.type != cudaMemoryTypeDevice) {
// error code: invalid device pointer
return -1;
}
check_cuda_error(cudaMemcpy(dest, src, size, cudaMemcpyHostToDevice));
return 0;
}
/// Tries to copy memory to the CPU synchronously
/// 0: success
/// -1: error, invalid device pointer
/// -2: error, gpu index doesn't exist
/// -3: error, zero copy size
int cuda_memcpy_to_cpu(void *dest, void *src, uint64_t size) {
if (size == 0) {
// error code: zero copy size
return -3;
}
cudaPointerAttributes attr;
cudaPointerGetAttributes(&attr, src);
if (attr.type != cudaMemoryTypeDevice) {
// error code: invalid device pointer
return -1;
}
check_cuda_error(cudaMemcpy(dest, src, size, cudaMemcpyDeviceToHost));
return 0;
}
/// Tries to copy memory within a GPU asynchronously
/// 0: success
/// -1: error, invalid device pointer
/// -2: error, gpu index doesn't exist
/// -3: error, zero copy size
int cuda_memcpy_async_gpu_to_gpu(void *dest, void *src, uint64_t size,
cuda_stream_t *stream) {
if (size == 0) {
// error code: zero copy size
return -3;
}
if (stream->gpu_index >= cuda_get_number_of_gpus()) {
// error code: invalid gpu_index
return -2;
}
cudaPointerAttributes attr_dest;
cudaPointerGetAttributes(&attr_dest, dest);
if (attr_dest.device != stream->gpu_index &&
attr_dest.type != cudaMemoryTypeDevice) {
// error code: invalid device pointer
return -1;
}
cudaPointerAttributes attr_src;
cudaPointerGetAttributes(&attr_src, src);
if (attr_src.device != stream->gpu_index &&
attr_src.type != cudaMemoryTypeDevice) {
// error code: invalid device pointer
return -1;
}
if (attr_src.device != attr_dest.device) {
// error code: different devices
return -1;
}
cudaSetDevice(stream->gpu_index);
check_cuda_error(cudaMemcpyAsync(dest, src, size, cudaMemcpyDeviceToDevice,
stream->stream));
return 0;
}
/// Synchronizes device
/// 0: success
/// -2: error, gpu index doesn't exist
int cuda_synchronize_device(uint32_t gpu_index) {
if (gpu_index >= cuda_get_number_of_gpus()) {
// error code: invalid gpu_index
return -2;
}
cudaSetDevice(gpu_index);
cudaDeviceSynchronize();
return 0;
}
int cuda_memset_async(void *dest, uint64_t val, uint64_t size,
cuda_stream_t *stream) {
if (size == 0) {
// error code: zero copy size
return -3;
}
if (stream->gpu_index >= cuda_get_number_of_gpus()) {
// error code: invalid gpu_index
return -2;
}
cudaPointerAttributes attr;
cudaPointerGetAttributes(&attr, dest);
if (attr.device != stream->gpu_index && attr.type != cudaMemoryTypeDevice) {
// error code: invalid device pointer
return -1;
}
cudaSetDevice(stream->gpu_index);
check_cuda_error(cudaMemsetAsync(dest, val, size, stream->stream));
return 0;
}
template <typename Torus>
__global__ void cuda_set_value_kernel(Torus *array, Torus value, Torus n) {
int index = threadIdx.x + blockIdx.x * blockDim.x;
if (index < n)
array[index] = value;
}
template <typename Torus>
void cuda_set_value_async(cudaStream_t *stream, Torus *d_array, Torus value,
Torus n) {
int block_size = 256;
int num_blocks = (n + block_size - 1) / block_size;
// Launch the kernel
cuda_set_value_kernel<<<num_blocks, block_size, 0, *stream>>>(d_array, value,
n);
}
/// Explicitly instantiate cuda_set_value_async for 32 and 64 bits
template void cuda_set_value_async(cudaStream_t *stream, uint64_t *d_array,
uint64_t value, uint64_t n);
template void cuda_set_value_async(cudaStream_t *stream, uint32_t *d_array,
uint32_t value, uint32_t n);
/// Tries to copy memory to the GPU asynchronously
/// 0: success
/// -1: error, invalid device pointer
/// -2: error, gpu index doesn't exist
/// -3: error, zero copy size
int cuda_memcpy_async_to_cpu(void *dest, const void *src, uint64_t size,
cuda_stream_t *stream) {
if (size == 0) {
// error code: zero copy size
return -3;
}
if (stream->gpu_index >= cuda_get_number_of_gpus()) {
// error code: invalid gpu_index
return -2;
}
cudaPointerAttributes attr;
cudaPointerGetAttributes(&attr, src);
if (attr.device != stream->gpu_index && attr.type != cudaMemoryTypeDevice) {
// error code: invalid device pointer
return -1;
}
cudaSetDevice(stream->gpu_index);
check_cuda_error(
cudaMemcpyAsync(dest, src, size, cudaMemcpyDeviceToHost, stream->stream));
return 0;
}
/// Return number of GPUs available
int cuda_get_number_of_gpus() {
int num_gpus;
cudaGetDeviceCount(&num_gpus);
return num_gpus;
}
/// Drop a cuda array
int cuda_drop(void *ptr, uint32_t gpu_index) {
if (gpu_index >= cuda_get_number_of_gpus()) {
// error code: invalid gpu_index
return -2;
}
cudaSetDevice(gpu_index);
check_cuda_error(cudaFree(ptr));
return 0;
}
/// Drop a cuda array. Tries to do it asynchronously
int cuda_drop_async(void *ptr, cuda_stream_t *stream) {
cudaSetDevice(stream->gpu_index);
#ifndef CUDART_VERSION
#error CUDART_VERSION Undefined!
#elif (CUDART_VERSION >= 11020)
int support_async_alloc;
check_cuda_error(cudaDeviceGetAttribute(&support_async_alloc,
cudaDevAttrMemoryPoolsSupported,
stream->gpu_index));
if (support_async_alloc) {
check_cuda_error(cudaFreeAsync(ptr, stream->stream));
} else {
check_cuda_error(cudaFree(ptr));
}
#else
check_cuda_error(cudaFree(ptr));
#endif
return 0;
}
/// Get the maximum size for the shared memory
int cuda_get_max_shared_memory(uint32_t gpu_index) {
if (gpu_index >= cuda_get_number_of_gpus()) {
// error code: invalid gpu_index
return -2;
}
cudaSetDevice(gpu_index);
cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, gpu_index);
int max_shared_memory = 0;
if (prop.major >= 6) {
max_shared_memory = prop.sharedMemPerMultiprocessor;
} else {
max_shared_memory = prop.sharedMemPerBlock;
}
return max_shared_memory;
}
int cuda_synchronize_stream(cuda_stream_t *stream) {
stream->synchronize();
return 0;
}

View File

@@ -0,0 +1,725 @@
#ifndef GPU_BOOTSTRAP_FFT_CUH
#define GPU_BOOTSTRAP_FFT_CUH
#include "polynomial/functions.cuh"
#include "polynomial/parameters.cuh"
#include "twiddles.cuh"
#include "types/complex/operations.cuh"
/*
* Direct negacyclic FFT:
* - before the FFT the N real coefficients are stored into a
* N/2 sized complex with the even coefficients in the real part
* and the odd coefficients in the imaginary part. This is referred to
* as the half-size FFT
* - when calling BNSMFFT_direct for the forward negacyclic FFT of PBS,
* opt is divided by 2 because the butterfly pattern is always applied
* between pairs of coefficients
* - instead of twisting each coefficient A_j before the FFT by
* multiplying by the w^j roots of unity (aka twiddles, w=exp(-i pi /N)),
* the FFT is modified, and for each level k of the FFT the twiddle:
* w_j,k = exp(-i pi j/2^k)
* is replaced with:
* \zeta_j,k = exp(-i pi (2j-1)/2^k)
*/
template <class params> __device__ void NSMFFT_direct(double2 *A) {
/* We don't make bit reverse here, since twiddles are already reversed
* Each thread is always in charge of "opt/2" pairs of coefficients,
* which is why we always loop through N/2 by N/opt strides
* The pragma unroll instruction tells the compiler to unroll the
* full loop, which should increase performance
*/
size_t tid = threadIdx.x;
size_t twid_id;
size_t i1, i2;
double2 u, v, w;
// level 1
// we don't make actual complex multiplication on level1 since we have only
// one twiddle, it's real and image parts are equal, so we can multiply
// it with simpler operations
#pragma unroll
for (size_t i = 0; i < params::opt / 2; ++i) {
i1 = tid;
i2 = tid + params::degree / 2;
u = A[i1];
v = A[i2] * (double2){0.707106781186547461715008466854,
0.707106781186547461715008466854};
A[i1] += v;
A[i2] = u - v;
tid += params::degree / params::opt;
}
__syncthreads();
// level 2
// from this level there are more than one twiddles and none of them has equal
// real and imag parts, so complete complex multiplication is needed
// for each level params::degree / 2^level represents number of coefficients
// inside divided chunk of specific level
//
tid = threadIdx.x;
#pragma unroll
for (size_t i = 0; i < params::opt / 2; ++i) {
twid_id = tid / (params::degree / 4);
i1 = 2 * (params::degree / 4) * twid_id + (tid & (params::degree / 4 - 1));
i2 = i1 + params::degree / 4;
w = negtwiddles[twid_id + 2];
u = A[i1];
v = A[i2] * w;
A[i1] += v;
A[i2] = u - v;
tid += params::degree / params::opt;
}
__syncthreads();
// level 3
tid = threadIdx.x;
#pragma unroll
for (size_t i = 0; i < params::opt / 2; ++i) {
twid_id = tid / (params::degree / 8);
i1 = 2 * (params::degree / 8) * twid_id + (tid & (params::degree / 8 - 1));
i2 = i1 + params::degree / 8;
w = negtwiddles[twid_id + 4];
u = A[i1];
v = A[i2] * w;
A[i1] += v;
A[i2] = u - v;
tid += params::degree / params::opt;
}
__syncthreads();
// level 4
tid = threadIdx.x;
#pragma unroll
for (size_t i = 0; i < params::opt / 2; ++i) {
twid_id = tid / (params::degree / 16);
i1 =
2 * (params::degree / 16) * twid_id + (tid & (params::degree / 16 - 1));
i2 = i1 + params::degree / 16;
w = negtwiddles[twid_id + 8];
u = A[i1];
v = A[i2] * w;
A[i1] += v;
A[i2] = u - v;
tid += params::degree / params::opt;
}
__syncthreads();
// level 5
tid = threadIdx.x;
#pragma unroll
for (size_t i = 0; i < params::opt / 2; ++i) {
twid_id = tid / (params::degree / 32);
i1 =
2 * (params::degree / 32) * twid_id + (tid & (params::degree / 32 - 1));
i2 = i1 + params::degree / 32;
w = negtwiddles[twid_id + 16];
u = A[i1];
v = A[i2] * w;
A[i1] += v;
A[i2] = u - v;
tid += params::degree / params::opt;
}
__syncthreads();
// level 6
tid = threadIdx.x;
#pragma unroll
for (size_t i = 0; i < params::opt / 2; ++i) {
twid_id = tid / (params::degree / 64);
i1 =
2 * (params::degree / 64) * twid_id + (tid & (params::degree / 64 - 1));
i2 = i1 + params::degree / 64;
w = negtwiddles[twid_id + 32];
u = A[i1];
v = A[i2] * w;
A[i1] += v;
A[i2] = u - v;
tid += params::degree / params::opt;
}
__syncthreads();
// level 7
tid = threadIdx.x;
#pragma unroll
for (size_t i = 0; i < params::opt / 2; ++i) {
twid_id = tid / (params::degree / 128);
i1 = 2 * (params::degree / 128) * twid_id +
(tid & (params::degree / 128 - 1));
i2 = i1 + params::degree / 128;
w = negtwiddles[twid_id + 64];
u = A[i1];
v = A[i2] * w;
A[i1] += v;
A[i2] = u - v;
tid += params::degree / params::opt;
}
__syncthreads();
// from level 8, we need to check size of params degree, because we support
// minimum actual polynomial size = 256, when compressed size is halfed and
// minimum supported compressed size is 128, so we always need first 7
// levels of butterfy operation, since butterfly levels are hardcoded
// we need to check if polynomial size is big enough to require specific level
// of butterfly.
if constexpr (params::degree >= 256) {
// level 8
tid = threadIdx.x;
#pragma unroll
for (size_t i = 0; i < params::opt / 2; ++i) {
twid_id = tid / (params::degree / 256);
i1 = 2 * (params::degree / 256) * twid_id +
(tid & (params::degree / 256 - 1));
i2 = i1 + params::degree / 256;
w = negtwiddles[twid_id + 128];
u = A[i1];
v = A[i2] * w;
A[i1] += v;
A[i2] = u - v;
tid += params::degree / params::opt;
}
__syncthreads();
}
if constexpr (params::degree >= 512) {
// level 9
tid = threadIdx.x;
#pragma unroll
for (size_t i = 0; i < params::opt / 2; ++i) {
twid_id = tid / (params::degree / 512);
i1 = 2 * (params::degree / 512) * twid_id +
(tid & (params::degree / 512 - 1));
i2 = i1 + params::degree / 512;
w = negtwiddles[twid_id + 256];
u = A[i1];
v = A[i2] * w;
A[i1] += v;
A[i2] = u - v;
tid += params::degree / params::opt;
}
__syncthreads();
}
if constexpr (params::degree >= 1024) {
// level 10
tid = threadIdx.x;
#pragma unroll
for (size_t i = 0; i < params::opt / 2; ++i) {
twid_id = tid / (params::degree / 1024);
i1 = 2 * (params::degree / 1024) * twid_id +
(tid & (params::degree / 1024 - 1));
i2 = i1 + params::degree / 1024;
w = negtwiddles[twid_id + 512];
u = A[i1];
v = A[i2] * w;
A[i1] += v;
A[i2] = u - v;
tid += params::degree / params::opt;
}
__syncthreads();
}
if constexpr (params::degree >= 2048) {
// level 11
tid = threadIdx.x;
#pragma unroll
for (size_t i = 0; i < params::opt / 2; ++i) {
twid_id = tid / (params::degree / 2048);
i1 = 2 * (params::degree / 2048) * twid_id +
(tid & (params::degree / 2048 - 1));
i2 = i1 + params::degree / 2048;
w = negtwiddles[twid_id + 1024];
u = A[i1];
v = A[i2] * w;
A[i1] += v;
A[i2] = u - v;
tid += params::degree / params::opt;
}
__syncthreads();
}
if constexpr (params::degree >= 4096) {
// level 12
tid = threadIdx.x;
#pragma unroll
for (size_t i = 0; i < params::opt / 2; ++i) {
twid_id = tid / (params::degree / 4096);
i1 = 2 * (params::degree / 4096) * twid_id +
(tid & (params::degree / 4096 - 1));
i2 = i1 + params::degree / 4096;
w = negtwiddles[twid_id + 2048];
u = A[i1];
v = A[i2] * w;
A[i1] += v;
A[i2] = u - v;
tid += params::degree / params::opt;
}
__syncthreads();
}
// compressed size = 8192 is actual polynomial size = 16384.
// from this size, twiddles can't fit in constant memory,
// so from here, butterfly operation access device memory.
if constexpr (params::degree >= 8192) {
// level 13
tid = threadIdx.x;
#pragma unroll
for (size_t i = 0; i < params::opt / 2; ++i) {
twid_id = tid / (params::degree / 8192);
i1 = 2 * (params::degree / 8192) * twid_id +
(tid & (params::degree / 8192 - 1));
i2 = i1 + params::degree / 8192;
w = negtwiddles13[twid_id];
u = A[i1];
v = A[i2] * w;
A[i1] += v;
A[i2] = u - v;
tid += params::degree / params::opt;
}
__syncthreads();
}
}
/*
* negacyclic inverse fft
*/
template <class params> __device__ void NSMFFT_inverse(double2 *A) {
/* We don't make bit reverse here, since twiddles are already reversed
* Each thread is always in charge of "opt/2" pairs of coefficients,
* which is why we always loop through N/2 by N/opt strides
* The pragma unroll instruction tells the compiler to unroll the
* full loop, which should increase performance
*/
size_t tid = threadIdx.x;
size_t twid_id;
size_t i1, i2;
double2 u, w;
// divide input by compressed polynomial size
tid = threadIdx.x;
for (size_t i = 0; i < params::opt; ++i) {
A[tid] /= params::degree;
tid += params::degree / params::opt;
}
__syncthreads();
// none of the twiddles have equal real and imag part, so
// complete complex multiplication has to be done
// here we have more than one twiddle
// mapping in backward fft is reversed
// butterfly operation is started from last level
// compressed size = 8192 is actual polynomial size = 16384.
// twiddles for this size can't fit in constant memory so
// butterfly operation for this level acess device memory to fetch
// twiddles
if constexpr (params::degree >= 8192) {
// level 13
tid = threadIdx.x;
#pragma unroll
for (size_t i = 0; i < params::opt / 2; ++i) {
twid_id = tid / (params::degree / 8192);
i1 = 2 * (params::degree / 8192) * twid_id +
(tid & (params::degree / 8192 - 1));
i2 = i1 + params::degree / 8192;
w = negtwiddles13[twid_id];
u = A[i1] - A[i2];
A[i1] += A[i2];
A[i2] = u * conjugate(w);
tid += params::degree / params::opt;
}
__syncthreads();
}
if constexpr (params::degree >= 4096) {
// level 12
tid = threadIdx.x;
#pragma unroll
for (size_t i = 0; i < params::opt / 2; ++i) {
twid_id = tid / (params::degree / 4096);
i1 = 2 * (params::degree / 4096) * twid_id +
(tid & (params::degree / 4096 - 1));
i2 = i1 + params::degree / 4096;
w = negtwiddles[twid_id + 2048];
u = A[i1] - A[i2];
A[i1] += A[i2];
A[i2] = u * conjugate(w);
tid += params::degree / params::opt;
}
__syncthreads();
}
if constexpr (params::degree >= 2048) {
// level 11
tid = threadIdx.x;
#pragma unroll
for (size_t i = 0; i < params::opt / 2; ++i) {
twid_id = tid / (params::degree / 2048);
i1 = 2 * (params::degree / 2048) * twid_id +
(tid & (params::degree / 2048 - 1));
i2 = i1 + params::degree / 2048;
w = negtwiddles[twid_id + 1024];
u = A[i1] - A[i2];
A[i1] += A[i2];
A[i2] = u * conjugate(w);
tid += params::degree / params::opt;
}
__syncthreads();
}
if constexpr (params::degree >= 1024) {
// level 10
tid = threadIdx.x;
#pragma unroll
for (size_t i = 0; i < params::opt / 2; ++i) {
twid_id = tid / (params::degree / 1024);
i1 = 2 * (params::degree / 1024) * twid_id +
(tid & (params::degree / 1024 - 1));
i2 = i1 + params::degree / 1024;
w = negtwiddles[twid_id + 512];
u = A[i1] - A[i2];
A[i1] += A[i2];
A[i2] = u * conjugate(w);
tid += params::degree / params::opt;
}
__syncthreads();
}
if constexpr (params::degree >= 512) {
// level 9
tid = threadIdx.x;
#pragma unroll
for (size_t i = 0; i < params::opt / 2; ++i) {
twid_id = tid / (params::degree / 512);
i1 = 2 * (params::degree / 512) * twid_id +
(tid & (params::degree / 512 - 1));
i2 = i1 + params::degree / 512;
w = negtwiddles[twid_id + 256];
u = A[i1] - A[i2];
A[i1] += A[i2];
A[i2] = u * conjugate(w);
tid += params::degree / params::opt;
}
__syncthreads();
}
if constexpr (params::degree >= 256) {
// level 8
tid = threadIdx.x;
#pragma unroll
for (size_t i = 0; i < params::opt / 2; ++i) {
twid_id = tid / (params::degree / 256);
i1 = 2 * (params::degree / 256) * twid_id +
(tid & (params::degree / 256 - 1));
i2 = i1 + params::degree / 256;
w = negtwiddles[twid_id + 128];
u = A[i1] - A[i2];
A[i1] += A[i2];
A[i2] = u * conjugate(w);
tid += params::degree / params::opt;
}
__syncthreads();
}
// below level 8, we don't need to check size of params degree, because we
// support minimum actual polynomial size = 256, when compressed size is
// halfed and minimum supported compressed size is 128, so we always need
// last 7 levels of butterfy operation, since butterfly levels are hardcoded
// we don't need to check if polynomial size is big enough to require
// specific level of butterfly.
// level 7
tid = threadIdx.x;
#pragma unroll
for (size_t i = 0; i < params::opt / 2; ++i) {
twid_id = tid / (params::degree / 128);
i1 = 2 * (params::degree / 128) * twid_id +
(tid & (params::degree / 128 - 1));
i2 = i1 + params::degree / 128;
w = negtwiddles[twid_id + 64];
u = A[i1] - A[i2];
A[i1] += A[i2];
A[i2] = u * conjugate(w);
tid += params::degree / params::opt;
}
__syncthreads();
// level 6
tid = threadIdx.x;
#pragma unroll
for (size_t i = 0; i < params::opt / 2; ++i) {
twid_id = tid / (params::degree / 64);
i1 =
2 * (params::degree / 64) * twid_id + (tid & (params::degree / 64 - 1));
i2 = i1 + params::degree / 64;
w = negtwiddles[twid_id + 32];
u = A[i1] - A[i2];
A[i1] += A[i2];
A[i2] = u * conjugate(w);
tid += params::degree / params::opt;
}
__syncthreads();
// level 5
tid = threadIdx.x;
#pragma unroll
for (size_t i = 0; i < params::opt / 2; ++i) {
twid_id = tid / (params::degree / 32);
i1 =
2 * (params::degree / 32) * twid_id + (tid & (params::degree / 32 - 1));
i2 = i1 + params::degree / 32;
w = negtwiddles[twid_id + 16];
u = A[i1] - A[i2];
A[i1] += A[i2];
A[i2] = u * conjugate(w);
tid += params::degree / params::opt;
}
__syncthreads();
// level 4
tid = threadIdx.x;
#pragma unroll
for (size_t i = 0; i < params::opt / 2; ++i) {
twid_id = tid / (params::degree / 16);
i1 =
2 * (params::degree / 16) * twid_id + (tid & (params::degree / 16 - 1));
i2 = i1 + params::degree / 16;
w = negtwiddles[twid_id + 8];
u = A[i1] - A[i2];
A[i1] += A[i2];
A[i2] = u * conjugate(w);
tid += params::degree / params::opt;
}
__syncthreads();
// level 3
tid = threadIdx.x;
#pragma unroll
for (size_t i = 0; i < params::opt / 2; ++i) {
twid_id = tid / (params::degree / 8);
i1 = 2 * (params::degree / 8) * twid_id + (tid & (params::degree / 8 - 1));
i2 = i1 + params::degree / 8;
w = negtwiddles[twid_id + 4];
u = A[i1] - A[i2];
A[i1] += A[i2];
A[i2] = u * conjugate(w);
tid += params::degree / params::opt;
}
__syncthreads();
// level 2
tid = threadIdx.x;
#pragma unroll
for (size_t i = 0; i < params::opt / 2; ++i) {
twid_id = tid / (params::degree / 4);
i1 = 2 * (params::degree / 4) * twid_id + (tid & (params::degree / 4 - 1));
i2 = i1 + params::degree / 4;
w = negtwiddles[twid_id + 2];
u = A[i1] - A[i2];
A[i1] += A[i2];
A[i2] = u * conjugate(w);
tid += params::degree / params::opt;
}
__syncthreads();
// level 1
tid = threadIdx.x;
#pragma unroll
for (size_t i = 0; i < params::opt / 2; ++i) {
twid_id = tid / (params::degree / 2);
i1 = 2 * (params::degree / 2) * twid_id + (tid & (params::degree / 2 - 1));
i2 = i1 + params::degree / 2;
w = negtwiddles[twid_id + 1];
u = A[i1] - A[i2];
A[i1] += A[i2];
A[i2] = u * conjugate(w);
tid += params::degree / params::opt;
}
__syncthreads();
}
/*
* global batch fft
* does fft in half size
* unrolling half size fft result in half size + 1 elements
* this function must be called with actual degree
* function takes as input already compressed input
*/
template <class params, sharedMemDegree SMD>
__global__ void batch_NSMFFT(double2 *d_input, double2 *d_output,
double2 *buffer) {
extern __shared__ double2 sharedMemoryFFT[];
double2 *fft = (SMD == NOSM) ? &buffer[blockIdx.x * params::degree / 2]
: sharedMemoryFFT;
int tid = threadIdx.x;
#pragma unroll
for (int i = 0; i < params::opt / 2; i++) {
fft[tid] = d_input[blockIdx.x * (params::degree / 2) + tid];
tid = tid + params::degree / params::opt;
}
__syncthreads();
NSMFFT_direct<HalfDegree<params>>(fft);
__syncthreads();
tid = threadIdx.x;
#pragma unroll
for (int i = 0; i < params::opt / 2; i++) {
d_output[blockIdx.x * (params::degree / 2) + tid] = fft[tid];
tid = tid + params::degree / params::opt;
}
}
/*
* global batch polynomial multiplication
* only used for fft tests
* d_input1 and d_output must not have the same pointer
* d_input1 can be modified inside the function
*/
template <class params, sharedMemDegree SMD>
__global__ void batch_polynomial_mul(double2 *d_input1, double2 *d_input2,
double2 *d_output, double2 *buffer) {
extern __shared__ double2 sharedMemoryFFT[];
double2 *fft = (SMD == NOSM) ? &buffer[blockIdx.x * params::degree / 2]
: sharedMemoryFFT;
// Move first polynomial into shared memory(if possible otherwise it will
// be moved in device buffer)
int tid = threadIdx.x;
#pragma unroll
for (int i = 0; i < params::opt / 2; i++) {
fft[tid] = d_input1[blockIdx.x * (params::degree / 2) + tid];
tid = tid + params::degree / params::opt;
}
// Perform direct negacyclic fourier transform
__syncthreads();
NSMFFT_direct<HalfDegree<params>>(fft);
__syncthreads();
// Put the result of direct fft inside input1
tid = threadIdx.x;
#pragma unroll
for (int i = 0; i < params::opt / 2; i++) {
d_input1[blockIdx.x * (params::degree / 2) + tid] = fft[tid];
tid = tid + params::degree / params::opt;
}
__syncthreads();
// Move first polynomial into shared memory(if possible otherwise it will
// be moved in device buffer)
tid = threadIdx.x;
#pragma unroll
for (int i = 0; i < params::opt / 2; i++) {
fft[tid] = d_input2[blockIdx.x * (params::degree / 2) + tid];
tid = tid + params::degree / params::opt;
}
// Perform direct negacyclic fourier transform on the second polynomial
__syncthreads();
NSMFFT_direct<HalfDegree<params>>(fft);
__syncthreads();
// calculate pointwise multiplication inside fft buffer
tid = threadIdx.x;
#pragma unroll
for (int i = 0; i < params::opt / 2; i++) {
fft[tid] *= d_input1[blockIdx.x * (params::degree / 2) + tid];
tid = tid + params::degree / params::opt;
}
// Perform backward negacyclic fourier transform
__syncthreads();
NSMFFT_inverse<HalfDegree<params>>(fft);
__syncthreads();
// copy results in output buffer
tid = threadIdx.x;
#pragma unroll
for (int i = 0; i < params::opt / 2; i++) {
d_output[blockIdx.x * (params::degree / 2) + tid] = fft[tid];
tid = tid + params::degree / params::opt;
}
}
#endif // GPU_BOOTSTRAP_FFT_CUH

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,13 @@
#ifndef GPU_BOOTSTRAP_TWIDDLES_CUH
#define GPU_BOOTSTRAP_TWIDDLES_CUH
/*
* 'negtwiddles' are stored in constant memory for faster access times
* because of it's limitied size, only twiddles for up to 2^12 polynomial size
* can be stored there, twiddles for 2^13 are stored in device memory
* 'negtwiddles13'
*/
extern __constant__ double2 negtwiddles[4096];
extern __device__ double2 negtwiddles13[4096];
#endif

View File

@@ -0,0 +1,51 @@
#include "integer/bitwise_ops.cuh"
void scratch_cuda_integer_radix_bitop_kb_64(
cuda_stream_t *stream, int8_t **mem_ptr, uint32_t glwe_dimension,
uint32_t polynomial_size, uint32_t big_lwe_dimension,
uint32_t small_lwe_dimension, uint32_t ks_level, uint32_t ks_base_log,
uint32_t pbs_level, uint32_t pbs_base_log, uint32_t grouping_factor,
uint32_t lwe_ciphertext_count, uint32_t message_modulus,
uint32_t carry_modulus, PBS_TYPE pbs_type, BITOP_TYPE op_type,
bool allocate_gpu_memory) {
int_radix_params params(pbs_type, glwe_dimension, polynomial_size,
big_lwe_dimension, small_lwe_dimension, ks_level,
ks_base_log, pbs_level, pbs_base_log, grouping_factor,
message_modulus, carry_modulus);
scratch_cuda_integer_radix_bitop_kb<uint64_t>(
stream, (int_bitop_buffer<uint64_t> **)mem_ptr, lwe_ciphertext_count,
params, op_type, allocate_gpu_memory);
}
void cuda_bitop_integer_radix_ciphertext_kb_64(
cuda_stream_t *stream, void *lwe_array_out, void *lwe_array_1,
void *lwe_array_2, int8_t *mem_ptr, void *bsk, void *ksk,
uint32_t lwe_ciphertext_count) {
host_integer_radix_bitop_kb<uint64_t>(
stream, static_cast<uint64_t *>(lwe_array_out),
static_cast<uint64_t *>(lwe_array_1),
static_cast<uint64_t *>(lwe_array_2),
(int_bitop_buffer<uint64_t> *)mem_ptr, bsk, static_cast<uint64_t *>(ksk),
lwe_ciphertext_count);
}
void cuda_bitnot_integer_radix_ciphertext_kb_64(
cuda_stream_t *stream, void *lwe_array_out, void *lwe_array_in,
int8_t *mem_ptr, void *bsk, void *ksk, uint32_t lwe_ciphertext_count) {
host_integer_radix_bitnot_kb<uint64_t>(
stream, static_cast<uint64_t *>(lwe_array_out),
static_cast<uint64_t *>(lwe_array_in),
(int_bitop_buffer<uint64_t> *)mem_ptr, bsk, static_cast<uint64_t *>(ksk),
lwe_ciphertext_count);
}
void cleanup_cuda_integer_bitop(cuda_stream_t *stream, int8_t **mem_ptr_void) {
int_bitop_buffer<uint64_t> *mem_ptr =
(int_bitop_buffer<uint64_t> *)(*mem_ptr_void);
mem_ptr->release(stream);
}

View File

@@ -0,0 +1,51 @@
#ifndef CUDA_INTEGER_BITWISE_OPS_CUH
#define CUDA_INTEGER_BITWISE_OPS_CUH
#include "crypto/keyswitch.cuh"
#include "device.h"
#include "integer.cuh"
#include "integer.h"
#include "pbs/bootstrap_low_latency.cuh"
#include "pbs/bootstrap_multibit.cuh"
#include "polynomial/functions.cuh"
#include "utils/kernel_dimensions.cuh"
#include <omp.h>
template <typename Torus>
__host__ void
host_integer_radix_bitop_kb(cuda_stream_t *stream, Torus *lwe_array_out,
Torus *lwe_array_1, Torus *lwe_array_2,
int_bitop_buffer<Torus> *mem_ptr, void *bsk,
Torus *ksk, uint32_t num_radix_blocks) {
auto lut = mem_ptr->lut;
integer_radix_apply_bivariate_lookup_table_kb<Torus>(
stream, lwe_array_out, lwe_array_1, lwe_array_2, bsk, ksk,
num_radix_blocks, lut);
}
template <typename Torus>
__host__ void
host_integer_radix_bitnot_kb(cuda_stream_t *stream, Torus *lwe_array_out,
Torus *lwe_array_in,
int_bitop_buffer<Torus> *mem_ptr, void *bsk,
Torus *ksk, uint32_t num_radix_blocks) {
auto lut = mem_ptr->lut;
integer_radix_apply_univariate_lookup_table_kb<Torus>(
stream, lwe_array_out, lwe_array_in, bsk, ksk, num_radix_blocks, lut);
}
template <typename Torus>
__host__ void scratch_cuda_integer_radix_bitop_kb(
cuda_stream_t *stream, int_bitop_buffer<Torus> **mem_ptr,
uint32_t num_radix_blocks, int_radix_params params, BITOP_TYPE op,
bool allocate_gpu_memory) {
*mem_ptr = new int_bitop_buffer<Torus>(stream, op, params, num_radix_blocks,
allocate_gpu_memory);
}
#endif

View File

@@ -0,0 +1,45 @@
#include "integer/cmux.cuh"
void scratch_cuda_integer_radix_cmux_kb_64(
cuda_stream_t *stream, int8_t **mem_ptr, uint32_t glwe_dimension,
uint32_t polynomial_size, uint32_t big_lwe_dimension,
uint32_t small_lwe_dimension, uint32_t ks_level, uint32_t ks_base_log,
uint32_t pbs_level, uint32_t pbs_base_log, uint32_t grouping_factor,
uint32_t lwe_ciphertext_count, uint32_t message_modulus,
uint32_t carry_modulus, PBS_TYPE pbs_type, bool allocate_gpu_memory) {
int_radix_params params(pbs_type, glwe_dimension, polynomial_size,
big_lwe_dimension, small_lwe_dimension, ks_level,
ks_base_log, pbs_level, pbs_base_log, grouping_factor,
message_modulus, carry_modulus);
std::function<uint64_t(uint64_t)> predicate_lut_f =
[](uint64_t x) -> uint64_t { return x == 1; };
scratch_cuda_integer_radix_cmux_kb(
stream, (int_cmux_buffer<uint64_t> **)mem_ptr, predicate_lut_f,
lwe_ciphertext_count, params, allocate_gpu_memory);
}
void cuda_cmux_integer_radix_ciphertext_kb_64(
cuda_stream_t *stream, void *lwe_array_out, void *lwe_condition,
void *lwe_array_true, void *lwe_array_false, int8_t *mem_ptr, void *bsk,
void *ksk, uint32_t lwe_ciphertext_count) {
host_integer_radix_cmux_kb<uint64_t>(
stream, static_cast<uint64_t *>(lwe_array_out),
static_cast<uint64_t *>(lwe_condition),
static_cast<uint64_t *>(lwe_array_true),
static_cast<uint64_t *>(lwe_array_false),
(int_cmux_buffer<uint64_t> *)mem_ptr, bsk, static_cast<uint64_t *>(ksk),
lwe_ciphertext_count);
}
void cleanup_cuda_integer_radix_cmux(cuda_stream_t *stream,
int8_t **mem_ptr_void) {
int_cmux_buffer<uint64_t> *mem_ptr =
(int_cmux_buffer<uint64_t> *)(*mem_ptr_void);
mem_ptr->release(stream);
}

View File

@@ -0,0 +1,100 @@
#ifndef CUDA_INTEGER_CMUX_CUH
#define CUDA_INTEGER_CMUX_CUH
#include "integer.cuh"
#include <omp.h>
template <typename Torus>
__host__ void zero_out_if(cuda_stream_t *stream, Torus *lwe_array_out,
Torus *lwe_array_input, Torus *lwe_condition,
int_zero_out_if_buffer<Torus> *mem_ptr,
int_radix_lut<Torus> *predicate, void *bsk,
Torus *ksk, uint32_t num_radix_blocks) {
auto params = mem_ptr->params;
int big_lwe_size = params.big_lwe_dimension + 1;
// Left message is shifted
int num_blocks = 0, num_threads = 0;
int num_entries = (params.big_lwe_dimension + 1);
getNumBlocksAndThreads(num_entries, 512, num_blocks, num_threads);
// We can't use integer_radix_apply_bivariate_lookup_table_kb since the
// second operand is fixed
auto tmp_lwe_array_input = mem_ptr->tmp;
for (int i = 0; i < num_radix_blocks; i++) {
auto lwe_array_out_block = tmp_lwe_array_input + i * big_lwe_size;
auto lwe_array_input_block = lwe_array_input + i * big_lwe_size;
device_pack_bivariate_blocks<<<num_blocks, num_threads, 0,
stream->stream>>>(
lwe_array_out_block, lwe_array_input_block, lwe_condition,
predicate->lwe_indexes, params.big_lwe_dimension,
params.message_modulus, 1);
check_cuda_error(cudaGetLastError());
}
integer_radix_apply_univariate_lookup_table_kb<Torus>(
stream, lwe_array_out, tmp_lwe_array_input, bsk, ksk, num_radix_blocks,
predicate);
}
template <typename Torus>
__host__ void
host_integer_radix_cmux_kb(cuda_stream_t *stream, Torus *lwe_array_out,
Torus *lwe_condition, Torus *lwe_array_true,
Torus *lwe_array_false,
int_cmux_buffer<Torus> *mem_ptr, void *bsk,
Torus *ksk, uint32_t num_radix_blocks) {
auto params = mem_ptr->params;
// Since our CPU threads will be working on different streams we shall assert
// the work in the main stream is completed
stream->synchronize();
auto true_stream = mem_ptr->zero_if_true_buffer->local_stream;
auto false_stream = mem_ptr->zero_if_false_buffer->local_stream;
#pragma omp parallel sections
{
// Both sections may be executed in parallel
#pragma omp section
{
auto mem_true = mem_ptr->zero_if_true_buffer;
zero_out_if(true_stream, mem_ptr->tmp_true_ct, lwe_array_true,
lwe_condition, mem_true, mem_ptr->inverted_predicate_lut, bsk,
ksk, num_radix_blocks);
}
#pragma omp section
{
auto mem_false = mem_ptr->zero_if_false_buffer;
zero_out_if(false_stream, mem_ptr->tmp_false_ct, lwe_array_false,
lwe_condition, mem_false, mem_ptr->predicate_lut, bsk, ksk,
num_radix_blocks);
}
}
cuda_synchronize_stream(true_stream);
cuda_synchronize_stream(false_stream);
// If the condition was true, true_ct will have kept its value and false_ct
// will be 0 If the condition was false, true_ct will be 0 and false_ct will
// have kept its value
auto added_cts = mem_ptr->tmp_true_ct;
host_addition(stream, added_cts, mem_ptr->tmp_true_ct, mem_ptr->tmp_false_ct,
params.big_lwe_dimension, num_radix_blocks);
integer_radix_apply_univariate_lookup_table_kb<Torus>(
stream, lwe_array_out, added_cts, bsk, ksk, num_radix_blocks,
mem_ptr->message_extract_lut);
}
template <typename Torus>
__host__ void scratch_cuda_integer_radix_cmux_kb(
cuda_stream_t *stream, int_cmux_buffer<Torus> **mem_ptr,
std::function<Torus(Torus)> predicate_lut_f, uint32_t num_radix_blocks,
int_radix_params params, bool allocate_gpu_memory) {
*mem_ptr = new int_cmux_buffer<Torus>(stream, predicate_lut_f, params,
num_radix_blocks, allocate_gpu_memory);
}
#endif

View File

@@ -0,0 +1,83 @@
#include "integer/comparison.cuh"
void scratch_cuda_integer_radix_comparison_kb_64(
cuda_stream_t *stream, int8_t **mem_ptr, uint32_t glwe_dimension,
uint32_t polynomial_size, uint32_t big_lwe_dimension,
uint32_t small_lwe_dimension, uint32_t ks_level, uint32_t ks_base_log,
uint32_t pbs_level, uint32_t pbs_base_log, uint32_t grouping_factor,
uint32_t lwe_ciphertext_count, uint32_t message_modulus,
uint32_t carry_modulus, PBS_TYPE pbs_type, COMPARISON_TYPE op_type,
bool allocate_gpu_memory) {
int_radix_params params(pbs_type, glwe_dimension, polynomial_size,
big_lwe_dimension, small_lwe_dimension, ks_level,
ks_base_log, pbs_level, pbs_base_log, grouping_factor,
message_modulus, carry_modulus);
switch (op_type) {
case EQ:
case NE:
scratch_cuda_integer_radix_equality_check_kb<uint64_t>(
stream, (int_comparison_buffer<uint64_t> **)mem_ptr,
lwe_ciphertext_count, params, op_type, allocate_gpu_memory);
break;
case GT:
case GE:
case LT:
case LE:
case MAX:
case MIN:
scratch_cuda_integer_radix_difference_check_kb<uint64_t>(
stream, (int_comparison_buffer<uint64_t> **)mem_ptr,
lwe_ciphertext_count, params, op_type, allocate_gpu_memory);
break;
}
}
void cuda_comparison_integer_radix_ciphertext_kb_64(
cuda_stream_t *stream, void *lwe_array_out, void *lwe_array_1,
void *lwe_array_2, int8_t *mem_ptr, void *bsk, void *ksk,
uint32_t lwe_ciphertext_count) {
int_comparison_buffer<uint64_t> *buffer =
(int_comparison_buffer<uint64_t> *)mem_ptr;
switch (buffer->op) {
case EQ:
case NE:
host_integer_radix_equality_check_kb<uint64_t>(
stream, static_cast<uint64_t *>(lwe_array_out),
static_cast<uint64_t *>(lwe_array_1),
static_cast<uint64_t *>(lwe_array_2), buffer, bsk,
static_cast<uint64_t *>(ksk), lwe_ciphertext_count);
break;
case GT:
case GE:
case LT:
case LE:
host_integer_radix_difference_check_kb<uint64_t>(
stream, static_cast<uint64_t *>(lwe_array_out),
static_cast<uint64_t *>(lwe_array_1),
static_cast<uint64_t *>(lwe_array_2), buffer,
buffer->diff_buffer->operator_f, bsk, static_cast<uint64_t *>(ksk),
lwe_ciphertext_count);
break;
case MAX:
case MIN:
host_integer_radix_maxmin_kb<uint64_t>(
stream, static_cast<uint64_t *>(lwe_array_out),
static_cast<uint64_t *>(lwe_array_1),
static_cast<uint64_t *>(lwe_array_2), buffer, bsk,
static_cast<uint64_t *>(ksk), lwe_ciphertext_count);
break;
default:
printf("Not implemented\n");
}
}
void cleanup_cuda_integer_comparison(cuda_stream_t *stream,
int8_t **mem_ptr_void) {
int_comparison_buffer<uint64_t> *mem_ptr =
(int_comparison_buffer<uint64_t> *)(*mem_ptr_void);
mem_ptr->release(stream);
}

View File

@@ -0,0 +1,468 @@
#ifndef CUDA_INTEGER_COMPARISON_OPS_CUH
#define CUDA_INTEGER_COMPARISON_OPS_CUH
#include "crypto/keyswitch.cuh"
#include "device.h"
#include "integer.cuh"
#include "integer.h"
#include "integer/cmux.cuh"
#include "integer/negation.cuh"
#include "integer/scalar_addition.cuh"
#include "pbs/bootstrap_low_latency.cuh"
#include "pbs/bootstrap_multibit.cuh"
#include "types/complex/operations.cuh"
#include "utils/kernel_dimensions.cuh"
// lwe_dimension + 1 threads
// todo: This kernel MUST be refactored to a binary reduction
template <typename Torus>
__global__ void device_accumulate_all_blocks(Torus *output, Torus *input_block,
uint32_t lwe_dimension,
uint32_t num_blocks) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx < lwe_dimension + 1) {
auto block = &input_block[idx];
Torus sum = block[0];
for (int i = 1; i < num_blocks; i++) {
sum += block[i * (lwe_dimension + 1)];
}
output[idx] = sum;
}
}
template <typename Torus>
__host__ void accumulate_all_blocks(cuda_stream_t *stream, Torus *output,
Torus *input, uint32_t lwe_dimension,
uint32_t num_radix_blocks) {
int num_blocks = 0, num_threads = 0;
int num_entries = (lwe_dimension + 1);
getNumBlocksAndThreads(num_entries, 512, num_blocks, num_threads);
// Add all blocks and store in sum
device_accumulate_all_blocks<<<num_blocks, num_threads, 0, stream->stream>>>(
output, input, lwe_dimension, num_radix_blocks);
check_cuda_error(cudaGetLastError());
}
template <typename Torus>
__host__ void
are_all_comparisons_block_true(cuda_stream_t *stream, Torus *lwe_array_out,
Torus *lwe_array_in,
int_comparison_buffer<Torus> *mem_ptr, void *bsk,
Torus *ksk, uint32_t num_radix_blocks) {
auto params = mem_ptr->params;
auto big_lwe_dimension = params.big_lwe_dimension;
auto glwe_dimension = params.glwe_dimension;
auto polynomial_size = params.polynomial_size;
auto message_modulus = params.message_modulus;
auto carry_modulus = params.carry_modulus;
auto are_all_block_true_buffer =
mem_ptr->eq_buffer->are_all_block_true_buffer;
uint32_t total_modulus = message_modulus * carry_modulus;
uint32_t max_value = total_modulus - 1;
cuda_memcpy_async_gpu_to_gpu(
lwe_array_out, lwe_array_in,
num_radix_blocks * (big_lwe_dimension + 1) * sizeof(Torus), stream);
int lut_num_blocks = 0;
uint32_t remaining_blocks = num_radix_blocks;
while (remaining_blocks > 1) {
// Split in max_value chunks
uint32_t chunk_length = std::min(max_value, remaining_blocks);
int num_chunks = remaining_blocks / chunk_length;
// Since all blocks encrypt either 0 or 1, we can sum max_value of them
// as in the worst case we will be adding `max_value` ones
auto input_blocks = lwe_array_out;
auto accumulator = are_all_block_true_buffer->tmp_block_accumulated;
for (int i = 0; i < num_chunks; i++) {
accumulate_all_blocks(stream, accumulator, input_blocks,
big_lwe_dimension, chunk_length);
accumulator += (big_lwe_dimension + 1);
remaining_blocks -= (chunk_length - 1);
input_blocks += (big_lwe_dimension + 1) * chunk_length;
}
accumulator = are_all_block_true_buffer->tmp_block_accumulated;
// Selects a LUT
int_radix_lut<Torus> *lut;
if (are_all_block_true_buffer->op == COMPARISON_TYPE::NE) {
// is_non_zero_lut_buffer LUT
lut = mem_ptr->eq_buffer->is_non_zero_lut;
} else if (chunk_length == max_value) {
// is_max_value LUT
lut = are_all_block_true_buffer->is_max_value_lut;
} else {
// is_equal_to_num_blocks LUT
lut = are_all_block_true_buffer->is_equal_to_num_blocks_lut;
if (chunk_length != lut_num_blocks) {
auto is_equal_to_num_blocks_lut_f = [max_value,
chunk_length](Torus x) -> Torus {
return (x & max_value) == chunk_length;
};
generate_device_accumulator<Torus>(
stream, lut->lut, glwe_dimension, polynomial_size, message_modulus,
carry_modulus, is_equal_to_num_blocks_lut_f);
// We don't have to generate this lut again
lut_num_blocks = chunk_length;
}
}
// Applies the LUT
integer_radix_apply_univariate_lookup_table_kb<Torus>(
stream, lwe_array_out, accumulator, bsk, ksk, num_chunks, lut);
}
}
// This takes an input slice of blocks.
//
// Each block can encrypt any value as long as its < message_modulus.
//
// It will compare blocks with 0, for either equality or difference.
//
// This returns a Vec of block, where each block encrypts 1 or 0
// depending of if all blocks matched with the comparison type with 0.
//
// E.g. For ZeroComparisonType::Equality, if all input blocks are zero
// than all returned block will encrypt 1
//
// The returned Vec will have less block than the number of input blocks.
// The returned blocks potentially needs to be 'reduced' to one block
// with eg are_all_comparisons_block_true.
//
// This function exists because sometimes it is faster to concatenate
// multiple vec of 'boolean' shortint block before reducing them with
// are_all_comparisons_block_true
template <typename Torus>
__host__ void host_compare_with_zero_equality(
cuda_stream_t *stream, Torus *lwe_array_out, Torus *lwe_array_in,
int_comparison_buffer<Torus> *mem_ptr, void *bsk, Torus *ksk,
int32_t num_radix_blocks) {
auto params = mem_ptr->params;
auto big_lwe_dimension = params.big_lwe_dimension;
auto message_modulus = params.message_modulus;
auto carry_modulus = params.carry_modulus;
// The idea is that we will sum chunks of blocks until carries are full
// then we compare the sum with 0.
//
// If all blocks were 0, the sum will be zero
// If at least one bock was not zero, the sum won't be zero
uint32_t total_modulus = message_modulus * carry_modulus;
uint32_t message_max = message_modulus - 1;
uint32_t num_elements_to_fill_carry = (total_modulus - 1) / message_max;
size_t big_lwe_size = big_lwe_dimension + 1;
size_t big_lwe_size_bytes = big_lwe_size * sizeof(Torus);
int num_sum_blocks = 0;
// Accumulator
auto sum = lwe_array_out;
if (num_radix_blocks == 1) {
// Just copy
cuda_memcpy_async_gpu_to_gpu(sum, lwe_array_in, big_lwe_size_bytes, stream);
num_sum_blocks = 1;
} else {
uint32_t remainder_blocks = num_radix_blocks;
auto sum_i = sum;
auto chunk = lwe_array_in;
while (remainder_blocks > 1) {
uint32_t chunk_size =
std::min(remainder_blocks, num_elements_to_fill_carry);
accumulate_all_blocks(stream, sum_i, chunk, big_lwe_dimension,
chunk_size);
num_sum_blocks++;
remainder_blocks -= (chunk_size - 1);
// Update operands
chunk += chunk_size * big_lwe_size;
sum_i += big_lwe_size;
}
}
auto is_equal_to_zero_lut = mem_ptr->diff_buffer->is_zero_lut;
integer_radix_apply_univariate_lookup_table_kb<Torus>(
stream, sum, sum, bsk, ksk, num_sum_blocks, is_equal_to_zero_lut);
are_all_comparisons_block_true(stream, lwe_array_out, sum, mem_ptr, bsk, ksk,
num_sum_blocks);
// The result will be in the two first block. Everything else is
// garbage.
cuda_memset_async(lwe_array_out + big_lwe_size, 0,
big_lwe_size_bytes * (num_radix_blocks - 1), stream);
}
template <typename Torus>
__host__ void host_integer_radix_equality_check_kb(
cuda_stream_t *stream, Torus *lwe_array_out, Torus *lwe_array_1,
Torus *lwe_array_2, int_comparison_buffer<Torus> *mem_ptr, void *bsk,
Torus *ksk, uint32_t num_radix_blocks) {
auto eq_buffer = mem_ptr->eq_buffer;
auto params = mem_ptr->params;
auto big_lwe_dimension = params.big_lwe_dimension;
// Applies the LUT for the comparison operation
auto comparisons = mem_ptr->tmp_block_comparisons;
integer_radix_apply_bivariate_lookup_table_kb(
stream, comparisons, lwe_array_1, lwe_array_2, bsk, ksk, num_radix_blocks,
eq_buffer->operator_lut);
// This takes a Vec of blocks, where each block is either 0 or 1.
//
// It return a block encrypting 1 if all input blocks are 1
// otherwise the block encrypts 0
are_all_comparisons_block_true(stream, lwe_array_out, comparisons, mem_ptr,
bsk, ksk, num_radix_blocks);
// Zero all blocks but the first
size_t big_lwe_size = big_lwe_dimension + 1;
size_t big_lwe_size_bytes = big_lwe_size * sizeof(Torus);
cuda_memset_async(lwe_array_out + big_lwe_size, 0,
big_lwe_size_bytes * (num_radix_blocks - 1), stream);
}
template <typename Torus>
__host__ void scratch_cuda_integer_radix_equality_check_kb(
cuda_stream_t *stream, int_comparison_buffer<Torus> **mem_ptr,
uint32_t num_radix_blocks, int_radix_params params, COMPARISON_TYPE op,
bool allocate_gpu_memory) {
*mem_ptr = new int_comparison_buffer<Torus>(
stream, op, params, num_radix_blocks, allocate_gpu_memory);
}
template <typename Torus>
__host__ void
compare_radix_blocks_kb(cuda_stream_t *stream, Torus *lwe_array_out,
Torus *lwe_array_left, Torus *lwe_array_right,
int_comparison_buffer<Torus> *mem_ptr, void *bsk,
Torus *ksk, uint32_t num_radix_blocks) {
auto params = mem_ptr->params;
auto big_lwe_dimension = params.big_lwe_dimension;
auto message_modulus = params.message_modulus;
auto carry_modulus = params.carry_modulus;
// When rhs > lhs, the subtraction will overflow, and the bit of padding will
// be set to 1
// meaning that the output of the pbs will be the negative (modulo message
// space)
//
// Example:
// lhs: 1, rhs: 3, message modulus: 4, carry modulus 4
// lhs - rhs = -2 % (4 * 4) = 14 = 1|1110 (padding_bit|b4b3b2b1)
// Since there was an overflow the bit of padding is 1 and not 0.
// When applying the LUT for an input value of 14 we would expect 1,
// but since the bit of padding is 1, we will get -1 modulus our message
// space, so (-1) % (4 * 4) = 15 = 1|1111 We then add one and get 0 = 0|0000
// Subtract
// Here we need the true lwe sub, not the one that comes from shortint.
host_subtraction(stream, lwe_array_out, lwe_array_left, lwe_array_right,
big_lwe_dimension, num_radix_blocks);
// Apply LUT to compare to 0
auto is_non_zero_lut = mem_ptr->eq_buffer->is_non_zero_lut;
integer_radix_apply_univariate_lookup_table_kb(
stream, lwe_array_out, lwe_array_out, bsk, ksk, num_radix_blocks,
is_non_zero_lut);
// Add one
// Here Lhs can have the following values: (-1) % (message modulus * carry
// modulus), 0, 1 So the output values after the addition will be: 0, 1, 2
host_integer_radix_add_scalar_one_inplace(stream, lwe_array_out,
big_lwe_dimension, num_radix_blocks,
message_modulus, carry_modulus);
}
// Reduces a vec containing shortint blocks that encrypts a sign
// (inferior, equal, superior) to one single shortint block containing the
// final sign
template <typename Torus>
__host__ void
tree_sign_reduction(cuda_stream_t *stream, Torus *lwe_array_out,
Torus *lwe_block_comparisons,
int_tree_sign_reduction_buffer<Torus> *tree_buffer,
std::function<Torus(Torus)> sign_handler_f, void *bsk,
Torus *ksk, uint32_t num_radix_blocks) {
auto params = tree_buffer->params;
auto big_lwe_dimension = params.big_lwe_dimension;
auto glwe_dimension = params.glwe_dimension;
auto polynomial_size = params.polynomial_size;
auto message_modulus = params.message_modulus;
auto carry_modulus = params.carry_modulus;
// Tree reduction
// Reduces a vec containing shortint blocks that encrypts a sign
// (inferior, equal, superior) to one single shortint block containing the
// final sign
size_t big_lwe_size = big_lwe_dimension + 1;
size_t big_lwe_size_bytes = big_lwe_size * sizeof(Torus);
auto x = tree_buffer->tmp_x;
auto y = tree_buffer->tmp_y;
if (x != lwe_block_comparisons)
cuda_memcpy_async_gpu_to_gpu(x, lwe_block_comparisons,
big_lwe_size_bytes * num_radix_blocks, stream);
uint32_t partial_block_count = num_radix_blocks;
auto inner_tree_leaf = tree_buffer->tree_inner_leaf_lut;
while (partial_block_count > 2) {
pack_blocks(stream, y, x, big_lwe_dimension, partial_block_count, 4);
integer_radix_apply_univariate_lookup_table_kb<Torus>(
stream, x, y, bsk, ksk, partial_block_count >> 1, inner_tree_leaf);
if ((partial_block_count % 2) != 0) {
partial_block_count >>= 1;
partial_block_count++;
auto last_y_block = y + (partial_block_count - 1) * big_lwe_size;
auto last_x_block = x + (partial_block_count - 1) * big_lwe_size;
cuda_memcpy_async_gpu_to_gpu(last_x_block, last_y_block,
big_lwe_size_bytes, stream);
} else {
partial_block_count >>= 1;
}
}
auto last_lut = tree_buffer->tree_last_leaf_lut;
auto block_selector_f = tree_buffer->block_selector_f;
std::function<Torus(Torus)> f;
if (partial_block_count == 2) {
pack_blocks(stream, y, x, big_lwe_dimension, partial_block_count, 4);
f = [block_selector_f, sign_handler_f](Torus x) -> Torus {
int msb = (x >> 2) & 3;
int lsb = x & 3;
int final_sign = block_selector_f(msb, lsb);
return sign_handler_f(final_sign);
};
} else {
// partial_block_count == 1
y = x;
f = sign_handler_f;
}
generate_device_accumulator<Torus>(stream, last_lut->lut, glwe_dimension,
polynomial_size, message_modulus,
carry_modulus, f);
// Last leaf
integer_radix_apply_univariate_lookup_table_kb(stream, lwe_array_out, y, bsk,
ksk, 1, last_lut);
}
template <typename Torus>
__host__ void host_integer_radix_difference_check_kb(
cuda_stream_t *stream, Torus *lwe_array_out, Torus *lwe_array_left,
Torus *lwe_array_right, int_comparison_buffer<Torus> *mem_ptr,
std::function<Torus(Torus)> reduction_lut_f, void *bsk, Torus *ksk,
uint32_t total_num_radix_blocks) {
auto diff_buffer = mem_ptr->diff_buffer;
auto params = mem_ptr->params;
auto big_lwe_dimension = params.big_lwe_dimension;
auto message_modulus = params.message_modulus;
auto carry_modulus = params.carry_modulus;
uint32_t num_radix_blocks = total_num_radix_blocks;
auto lhs = lwe_array_left;
auto rhs = lwe_array_right;
if (carry_modulus == message_modulus) {
// Packing is possible
// Pack inputs
Torus *packed_left = diff_buffer->tmp_packed_left;
Torus *packed_right = diff_buffer->tmp_packed_right;
pack_blocks(stream, packed_left, lwe_array_left, big_lwe_dimension,
num_radix_blocks, message_modulus);
pack_blocks(stream, packed_right, lwe_array_right, big_lwe_dimension,
num_radix_blocks, message_modulus);
// From this point we have half number of blocks
num_radix_blocks /= 2;
// Clean noise
auto cleaning_lut = mem_ptr->cleaning_lut;
integer_radix_apply_univariate_lookup_table_kb(
stream, packed_left, packed_left, bsk, ksk, num_radix_blocks,
cleaning_lut);
integer_radix_apply_univariate_lookup_table_kb(
stream, packed_right, packed_right, bsk, ksk, num_radix_blocks,
cleaning_lut);
lhs = packed_left;
rhs = packed_right;
}
// comparisons will be assigned
// - 0 if lhs < rhs
// - 1 if lhs == rhs
// - 2 if lhs > rhs
auto comparisons = mem_ptr->tmp_block_comparisons;
compare_radix_blocks_kb(stream, comparisons, lhs, rhs, mem_ptr, bsk, ksk,
num_radix_blocks);
// Reduces a vec containing radix blocks that encrypts a sign
// (inferior, equal, superior) to one single radix block containing the
// final sign
tree_sign_reduction(stream, lwe_array_out, comparisons,
mem_ptr->diff_buffer->tree_buffer, reduction_lut_f, bsk,
ksk, num_radix_blocks);
// The result will be in the first block. Everything else is garbage.
size_t big_lwe_size = big_lwe_dimension + 1;
size_t big_lwe_size_bytes = big_lwe_size * sizeof(Torus);
cuda_memset_async(lwe_array_out + big_lwe_size, 0,
(total_num_radix_blocks - 1) * big_lwe_size_bytes, stream);
}
template <typename Torus>
__host__ void scratch_cuda_integer_radix_difference_check_kb(
cuda_stream_t *stream, int_comparison_buffer<Torus> **mem_ptr,
uint32_t num_radix_blocks, int_radix_params params, COMPARISON_TYPE op,
bool allocate_gpu_memory) {
*mem_ptr = new int_comparison_buffer<Torus>(
stream, op, params, num_radix_blocks, allocate_gpu_memory);
}
template <typename Torus>
__host__ void
host_integer_radix_maxmin_kb(cuda_stream_t *stream, Torus *lwe_array_out,
Torus *lwe_array_left, Torus *lwe_array_right,
int_comparison_buffer<Torus> *mem_ptr, void *bsk,
Torus *ksk, uint32_t total_num_radix_blocks) {
// Compute the sign
host_integer_radix_difference_check_kb(
stream, mem_ptr->tmp_lwe_array_out, lwe_array_left, lwe_array_right,
mem_ptr, mem_ptr->cleaning_lut_f, bsk, ksk, total_num_radix_blocks);
// Selector
host_integer_radix_cmux_kb(
stream, lwe_array_out, mem_ptr->tmp_lwe_array_out, lwe_array_left,
lwe_array_right, mem_ptr->cmux_buffer, bsk, ksk, total_num_radix_blocks);
}
#endif

View File

@@ -0,0 +1,127 @@
#include "integer/integer.cuh"
#include <linear_algebra.h>
void cuda_full_propagation_64_inplace(
cuda_stream_t *stream, void *input_blocks, int8_t *mem_ptr, void *ksk,
void *bsk, uint32_t lwe_dimension, uint32_t glwe_dimension,
uint32_t polynomial_size, uint32_t ks_base_log, uint32_t ks_level,
uint32_t pbs_base_log, uint32_t pbs_level, uint32_t grouping_factor,
uint32_t num_blocks) {
switch (polynomial_size) {
case 256:
host_full_propagate_inplace<uint64_t, int64_t, AmortizedDegree<256>>(
stream, static_cast<uint64_t *>(input_blocks),
(int_fullprop_buffer<uint64_t> *)mem_ptr, static_cast<uint64_t *>(ksk),
bsk, lwe_dimension, glwe_dimension, polynomial_size, ks_base_log,
ks_level, pbs_base_log, pbs_level, grouping_factor, num_blocks);
break;
case 512:
host_full_propagate_inplace<uint64_t, int64_t, AmortizedDegree<512>>(
stream, static_cast<uint64_t *>(input_blocks),
(int_fullprop_buffer<uint64_t> *)mem_ptr, static_cast<uint64_t *>(ksk),
bsk, lwe_dimension, glwe_dimension, polynomial_size, ks_base_log,
ks_level, pbs_base_log, pbs_level, grouping_factor, num_blocks);
break;
case 1024:
host_full_propagate_inplace<uint64_t, int64_t, AmortizedDegree<1024>>(
stream, static_cast<uint64_t *>(input_blocks),
(int_fullprop_buffer<uint64_t> *)mem_ptr, static_cast<uint64_t *>(ksk),
bsk, lwe_dimension, glwe_dimension, polynomial_size, ks_base_log,
ks_level, pbs_base_log, pbs_level, grouping_factor, num_blocks);
break;
case 2048:
host_full_propagate_inplace<uint64_t, int64_t, AmortizedDegree<2048>>(
stream, static_cast<uint64_t *>(input_blocks),
(int_fullprop_buffer<uint64_t> *)mem_ptr, static_cast<uint64_t *>(ksk),
bsk, lwe_dimension, glwe_dimension, polynomial_size, ks_base_log,
ks_level, pbs_base_log, pbs_level, grouping_factor, num_blocks);
break;
case 4096:
host_full_propagate_inplace<uint64_t, int64_t, AmortizedDegree<4096>>(
stream, static_cast<uint64_t *>(input_blocks),
(int_fullprop_buffer<uint64_t> *)mem_ptr, static_cast<uint64_t *>(ksk),
bsk, lwe_dimension, glwe_dimension, polynomial_size, ks_base_log,
ks_level, pbs_base_log, pbs_level, grouping_factor, num_blocks);
break;
case 8192:
host_full_propagate_inplace<uint64_t, int64_t, AmortizedDegree<8192>>(
stream, static_cast<uint64_t *>(input_blocks),
(int_fullprop_buffer<uint64_t> *)mem_ptr, static_cast<uint64_t *>(ksk),
bsk, lwe_dimension, glwe_dimension, polynomial_size, ks_base_log,
ks_level, pbs_base_log, pbs_level, grouping_factor, num_blocks);
break;
case 16384:
host_full_propagate_inplace<uint64_t, int64_t, AmortizedDegree<16384>>(
stream, static_cast<uint64_t *>(input_blocks),
(int_fullprop_buffer<uint64_t> *)mem_ptr, static_cast<uint64_t *>(ksk),
bsk, lwe_dimension, glwe_dimension, polynomial_size, ks_base_log,
ks_level, pbs_base_log, pbs_level, grouping_factor, num_blocks);
break;
default:
break;
}
}
void scratch_cuda_full_propagation_64(
cuda_stream_t *stream, int8_t **mem_ptr, uint32_t lwe_dimension,
uint32_t glwe_dimension, uint32_t polynomial_size, uint32_t level_count,
uint32_t grouping_factor, uint32_t input_lwe_ciphertext_count,
uint32_t message_modulus, uint32_t carry_modulus, PBS_TYPE pbs_type,
bool allocate_gpu_memory) {
scratch_cuda_full_propagation<uint64_t>(
stream, (int_fullprop_buffer<uint64_t> **)mem_ptr, lwe_dimension,
glwe_dimension, polynomial_size, level_count, grouping_factor,
input_lwe_ciphertext_count, message_modulus, carry_modulus, pbs_type,
allocate_gpu_memory);
}
void cleanup_cuda_full_propagation(cuda_stream_t *stream,
int8_t **mem_ptr_void) {
int_fullprop_buffer<uint64_t> *mem_ptr =
(int_fullprop_buffer<uint64_t> *)(*mem_ptr_void);
cuda_drop_async(mem_ptr->lut_buffer, stream);
cuda_drop_async(mem_ptr->lut_indexes, stream);
cuda_drop_async(mem_ptr->pbs_buffer, stream);
cuda_drop_async(mem_ptr->tmp_small_lwe_vector, stream);
cuda_drop_async(mem_ptr->tmp_big_lwe_vector, stream);
}
void scratch_cuda_propagate_single_carry_low_latency_kb_64_inplace(
cuda_stream_t *stream, int8_t **mem_ptr, uint32_t glwe_dimension,
uint32_t polynomial_size, uint32_t big_lwe_dimension,
uint32_t small_lwe_dimension, uint32_t ks_level, uint32_t ks_base_log,
uint32_t pbs_level, uint32_t pbs_base_log, uint32_t grouping_factor,
uint32_t num_blocks, uint32_t message_modulus, uint32_t carry_modulus,
PBS_TYPE pbs_type, bool allocate_gpu_memory) {
int_radix_params params(pbs_type, glwe_dimension, polynomial_size,
big_lwe_dimension, small_lwe_dimension, ks_level,
ks_base_log, pbs_level, pbs_base_log, grouping_factor,
message_modulus, carry_modulus);
scratch_cuda_propagate_single_carry_low_latency_kb_inplace(
stream, (int_sc_prop_memory<uint64_t> **)mem_ptr, num_blocks, params,
allocate_gpu_memory);
}
void cuda_propagate_single_carry_low_latency_kb_64_inplace(
cuda_stream_t *stream, void *lwe_array, int8_t *mem_ptr, void *bsk,
void *ksk, uint32_t num_blocks) {
host_propagate_single_carry_low_latency<uint64_t>(
stream, static_cast<uint64_t *>(lwe_array),
(int_sc_prop_memory<uint64_t> *)mem_ptr, bsk,
static_cast<uint64_t *>(ksk), num_blocks);
}
void cleanup_cuda_propagate_single_carry_low_latency(cuda_stream_t *stream,
int8_t **mem_ptr_void) {
int_sc_prop_memory<uint64_t> *mem_ptr =
(int_sc_prop_memory<uint64_t> *)(*mem_ptr_void);
mem_ptr->release(stream);
}

View File

@@ -0,0 +1,677 @@
#ifndef CUDA_INTEGER_CUH
#define CUDA_INTEGER_CUH
#include "crypto/keyswitch.cuh"
#include "device.h"
#include "integer.h"
#include "integer/scalar_addition.cuh"
#include "linear_algebra.h"
#include "linearalgebra/addition.cuh"
#include "pbs/bootstrap_low_latency.cuh"
#include "pbs/bootstrap_multibit.cuh"
#include "polynomial/functions.cuh"
#include "utils/kernel_dimensions.cuh"
#include <functional>
template <typename Torus>
void execute_pbs(cuda_stream_t *stream, Torus *lwe_array_out,
Torus *lwe_output_indexes, Torus *lut_vector,
Torus *lut_vector_indexes, Torus *lwe_array_in,
Torus *lwe_input_indexes, void *bootstrapping_key,
int8_t *pbs_buffer, uint32_t glwe_dimension,
uint32_t lwe_dimension, uint32_t polynomial_size,
uint32_t base_log, uint32_t level_count,
uint32_t grouping_factor, uint32_t input_lwe_ciphertext_count,
uint32_t num_luts, uint32_t lwe_idx,
uint32_t max_shared_memory, PBS_TYPE pbs_type) {
if (sizeof(Torus) == sizeof(uint32_t)) {
// 32 bits
switch (pbs_type) {
case MULTI_BIT:
printf("multibit\n");
printf("Error: 32-bit multibit PBS is not supported.\n");
break;
case LOW_LAT:
cuda_bootstrap_low_latency_lwe_ciphertext_vector_32(
stream, lwe_array_out, lwe_output_indexes, lut_vector,
lut_vector_indexes, lwe_array_in, lwe_input_indexes,
bootstrapping_key, pbs_buffer, lwe_dimension, glwe_dimension,
polynomial_size, base_log, level_count, input_lwe_ciphertext_count,
num_luts, lwe_idx, max_shared_memory);
break;
case AMORTIZED:
cuda_bootstrap_amortized_lwe_ciphertext_vector_32(
stream, lwe_array_out, lwe_output_indexes, lut_vector,
lut_vector_indexes, lwe_array_in, lwe_input_indexes,
bootstrapping_key, pbs_buffer, lwe_dimension, glwe_dimension,
polynomial_size, base_log, level_count, input_lwe_ciphertext_count,
num_luts, lwe_idx, max_shared_memory);
break;
default:
break;
}
} else {
// 64 bits
switch (pbs_type) {
case MULTI_BIT:
cuda_multi_bit_pbs_lwe_ciphertext_vector_64(
stream, lwe_array_out, lwe_output_indexes, lut_vector,
lut_vector_indexes, lwe_array_in, lwe_input_indexes,
bootstrapping_key, pbs_buffer, lwe_dimension, glwe_dimension,
polynomial_size, grouping_factor, base_log, level_count,
input_lwe_ciphertext_count, num_luts, lwe_idx,
max_shared_memory);
break;
case LOW_LAT:
cuda_bootstrap_low_latency_lwe_ciphertext_vector_64(
stream, lwe_array_out, lwe_output_indexes, lut_vector,
lut_vector_indexes, lwe_array_in, lwe_input_indexes,
bootstrapping_key, pbs_buffer, lwe_dimension, glwe_dimension,
polynomial_size, base_log, level_count, input_lwe_ciphertext_count,
num_luts, lwe_idx, max_shared_memory);
break;
case AMORTIZED:
cuda_bootstrap_amortized_lwe_ciphertext_vector_64(
stream, lwe_array_out, lwe_output_indexes, lut_vector,
lut_vector_indexes, lwe_array_in, lwe_input_indexes,
bootstrapping_key, pbs_buffer, lwe_dimension, glwe_dimension,
polynomial_size, base_log, level_count, input_lwe_ciphertext_count,
num_luts, lwe_idx, max_shared_memory);
break;
default:
break;
}
}
}
// function rotates right radix ciphertext with specific value
// grid is one dimensional
// blockIdx.x represents x_th block of radix ciphertext
template <typename Torus>
__global__ void radix_blocks_rotate_right(Torus *dst, Torus *src,
uint32_t value, uint32_t blocks_count,
uint32_t lwe_size) {
value %= blocks_count;
size_t tid = threadIdx.x;
size_t src_block_id = blockIdx.x;
size_t dst_block_id = (src_block_id + value) % blocks_count;
size_t stride = blockDim.x;
auto cur_src_block = &src[src_block_id * lwe_size];
auto cur_dst_block = &dst[dst_block_id * lwe_size];
for (size_t i = tid; i < lwe_size; i += stride) {
cur_dst_block[i] = cur_src_block[i];
}
}
// function rotates left radix ciphertext with specific value
// grid is one dimensional
// blockIdx.x represents x_th block of radix ciphertext
template <typename Torus>
__global__ void radix_blocks_rotate_left(Torus *dst, Torus *src, uint32_t value,
uint32_t blocks_count,
uint32_t lwe_size) {
value %= blocks_count;
size_t src_block_id = blockIdx.x;
size_t tid = threadIdx.x;
size_t dst_block_id = (src_block_id >= value)
? src_block_id - value
: src_block_id - value + blocks_count;
size_t stride = blockDim.x;
auto cur_src_block = &src[src_block_id * lwe_size];
auto cur_dst_block = &dst[dst_block_id * lwe_size];
for (size_t i = tid; i < lwe_size; i += stride) {
cur_dst_block[i] = cur_src_block[i];
}
}
// polynomial_size threads
template <typename Torus>
__global__ void
device_pack_bivariate_blocks(Torus *lwe_array_out, Torus *lwe_array_1,
Torus *lwe_array_2, Torus *lwe_indexes,
uint32_t lwe_dimension, uint32_t message_modulus,
uint32_t num_blocks) {
int tid = threadIdx.x + blockIdx.x * blockDim.x;
if (tid < num_blocks * (lwe_dimension + 1)) {
int block_id = tid / (lwe_dimension + 1);
int coeff_id = tid % (lwe_dimension + 1);
int pos = lwe_indexes[block_id] * (lwe_dimension + 1) + coeff_id;
lwe_array_out[pos] = lwe_array_1[pos] * message_modulus + lwe_array_2[pos];
}
}
template <typename Torus>
__host__ void pack_bivariate_blocks(cuda_stream_t *stream, Torus *lwe_array_out,
Torus *lwe_array_1, Torus *lwe_array_2,
Torus *lwe_indexes, uint32_t lwe_dimension,
uint32_t message_modulus,
uint32_t num_radix_blocks) {
// Left message is shifted
int num_blocks = 0, num_threads = 0;
int num_entries = num_radix_blocks * (lwe_dimension + 1);
getNumBlocksAndThreads(num_entries, 512, num_blocks, num_threads);
device_pack_bivariate_blocks<<<num_blocks, num_threads, 0, stream->stream>>>(
lwe_array_out, lwe_array_1, lwe_array_2, lwe_indexes, lwe_dimension,
message_modulus, num_radix_blocks);
check_cuda_error(cudaGetLastError());
}
template <typename Torus>
__host__ void integer_radix_apply_univariate_lookup_table_kb(
cuda_stream_t *stream, Torus *lwe_array_out, Torus *lwe_array_in, void *bsk,
Torus *ksk, uint32_t num_radix_blocks, int_radix_lut<Torus> *lut) {
// apply_lookup_table
auto params = lut->params;
auto pbs_type = params.pbs_type;
auto big_lwe_dimension = params.big_lwe_dimension;
auto small_lwe_dimension = params.small_lwe_dimension;
auto ks_level = params.ks_level;
auto ks_base_log = params.ks_base_log;
auto pbs_level = params.pbs_level;
auto pbs_base_log = params.pbs_base_log;
auto glwe_dimension = params.glwe_dimension;
auto polynomial_size = params.polynomial_size;
auto grouping_factor = params.grouping_factor;
// Compute Keyswitch-PBS
cuda_keyswitch_lwe_ciphertext_vector(
stream, lut->tmp_lwe_after_ks, lut->lwe_indexes, lwe_array_in,
lut->lwe_indexes, ksk, big_lwe_dimension, small_lwe_dimension,
ks_base_log, ks_level, num_radix_blocks);
execute_pbs(stream, lwe_array_out, lut->lwe_indexes, lut->lut,
lut->lut_indexes, lut->tmp_lwe_after_ks, lut->lwe_indexes, bsk,
lut->pbs_buffer, glwe_dimension, small_lwe_dimension,
polynomial_size, pbs_base_log, pbs_level, grouping_factor,
num_radix_blocks, 1, 0,
cuda_get_max_shared_memory(stream->gpu_index), pbs_type);
}
template <typename Torus>
__host__ void integer_radix_apply_bivariate_lookup_table_kb(
cuda_stream_t *stream, Torus *lwe_array_out, Torus *lwe_array_1,
Torus *lwe_array_2, void *bsk, Torus *ksk, uint32_t num_radix_blocks,
int_radix_lut<Torus> *lut) {
// apply_lookup_table_bivariate
auto params = lut->params;
auto big_lwe_dimension = params.big_lwe_dimension;
auto message_modulus = params.message_modulus;
// Left message is shifted
pack_bivariate_blocks(stream, lut->tmp_lwe_before_ks, lwe_array_1,
lwe_array_2, lut->lwe_indexes, big_lwe_dimension,
message_modulus, num_radix_blocks);
check_cuda_error(cudaGetLastError());
// Apply LUT
integer_radix_apply_univariate_lookup_table_kb(stream, lwe_array_out,
lut->tmp_lwe_before_ks, bsk,
ksk, num_radix_blocks, lut);
}
// Rotates the slice in-place such that the first mid elements of the slice move
// to the end while the last array_length elements move to the front. After
// calling rotate_left, the element previously at index mid will become the
// first element in the slice.
template <typename Torus>
void rotate_left(Torus *buffer, int mid, uint32_t array_length) {
mid = mid % array_length;
std::rotate(buffer, buffer + mid, buffer + array_length);
}
template <typename Torus>
void generate_lookup_table(Torus *acc, uint32_t glwe_dimension,
uint32_t polynomial_size, uint32_t message_modulus,
uint32_t carry_modulus,
std::function<Torus(Torus)> f) {
uint32_t modulus_sup = message_modulus * carry_modulus;
uint32_t box_size = polynomial_size / modulus_sup;
Torus delta = (1ul << 63) / modulus_sup;
memset(acc, 0, glwe_dimension * polynomial_size * sizeof(Torus));
auto body = &acc[glwe_dimension * polynomial_size];
// This accumulator extracts the carry bits
for (int i = 0; i < modulus_sup; i++) {
int index = i * box_size;
for (int j = index; j < index + box_size; j++) {
auto f_eval = f(i);
body[j] = f_eval * delta;
}
}
int half_box_size = box_size / 2;
// Negate the first half_box_size coefficients
for (int i = 0; i < half_box_size; i++) {
body[i] = -body[i];
}
rotate_left(body, half_box_size, polynomial_size);
}
template <typename Torus>
void generate_lookup_table_bivariate(Torus *acc, uint32_t glwe_dimension,
uint32_t polynomial_size,
uint32_t message_modulus,
uint32_t carry_modulus,
std::function<Torus(Torus, Torus)> f) {
Torus factor_u64 = message_modulus;
auto wrapped_f = [factor_u64, message_modulus, f](Torus input) -> Torus {
Torus lhs = (input / factor_u64) % message_modulus;
Torus rhs = (input % factor_u64) % message_modulus;
return f(lhs, rhs);
};
generate_lookup_table<Torus>(acc, glwe_dimension, polynomial_size,
message_modulus, carry_modulus, wrapped_f);
}
/*
* generate bivariate accumulator for device pointer
* v_stream - cuda stream
* acc - device pointer for bivariate accumulator
* ...
* f - wrapping function with two Torus inputs
*/
template <typename Torus>
void generate_device_accumulator_bivariate(
cuda_stream_t *stream, Torus *acc_bivariate, uint32_t glwe_dimension,
uint32_t polynomial_size, uint32_t message_modulus, uint32_t carry_modulus,
std::function<Torus(Torus, Torus)> f) {
// host lut
Torus *h_lut =
(Torus *)malloc((glwe_dimension + 1) * polynomial_size * sizeof(Torus));
// fill bivariate accumulator
generate_lookup_table_bivariate<Torus>(h_lut, glwe_dimension, polynomial_size,
message_modulus, carry_modulus, f);
// copy host lut and lut_indexes to device
cuda_memcpy_async_to_gpu(
acc_bivariate, h_lut,
(glwe_dimension + 1) * polynomial_size * sizeof(Torus), stream);
cuda_synchronize_stream(stream);
free(h_lut);
}
/*
* generate bivariate accumulator for device pointer
* v_stream - cuda stream
* acc - device pointer for accumulator
* ...
* f - evaluating function with one Torus input
*/
template <typename Torus>
void generate_device_accumulator(cuda_stream_t *stream, Torus *acc,
uint32_t glwe_dimension,
uint32_t polynomial_size,
uint32_t message_modulus,
uint32_t carry_modulus,
std::function<Torus(Torus)> f) {
// host lut
Torus *h_lut =
(Torus *)malloc((glwe_dimension + 1) * polynomial_size * sizeof(Torus));
// fill accumulator
generate_lookup_table<Torus>(h_lut, glwe_dimension, polynomial_size,
message_modulus, carry_modulus, f);
// copy host lut and lut_indexes to device
cuda_memcpy_async_to_gpu(
acc, h_lut, (glwe_dimension + 1) * polynomial_size * sizeof(Torus),
stream);
cuda_synchronize_stream(stream);
free(h_lut);
}
template <typename Torus>
void scratch_cuda_propagate_single_carry_low_latency_kb_inplace(
cuda_stream_t *stream, int_sc_prop_memory<Torus> **mem_ptr,
uint32_t num_radix_blocks, int_radix_params params,
bool allocate_gpu_memory) {
*mem_ptr = new int_sc_prop_memory<Torus>(stream, params, num_radix_blocks,
allocate_gpu_memory);
}
template <typename Torus>
void host_propagate_single_carry_low_latency(cuda_stream_t *stream,
Torus *lwe_array,
int_sc_prop_memory<Torus> *mem,
void *bsk, Torus *ksk,
uint32_t num_blocks) {
auto params = mem->params;
auto glwe_dimension = params.glwe_dimension;
auto polynomial_size = params.polynomial_size;
auto message_modulus = params.message_modulus;
auto big_lwe_size = glwe_dimension * polynomial_size + 1;
auto big_lwe_size_bytes = big_lwe_size * sizeof(Torus);
auto generates_or_propagates = mem->generates_or_propagates;
auto step_output = mem->step_output;
auto luts_array = mem->luts_array;
auto luts_carry_propagation_sum = mem->luts_carry_propagation_sum;
auto message_acc = mem->message_acc;
integer_radix_apply_univariate_lookup_table_kb<Torus>(
stream, generates_or_propagates, lwe_array, bsk, ksk, num_blocks,
luts_array);
// compute prefix sum with hillis&steele
int num_steps = ceil(log2((double)num_blocks));
int space = 1;
cuda_memcpy_async_gpu_to_gpu(step_output, generates_or_propagates,
big_lwe_size_bytes * num_blocks, stream);
for (int step = 0; step < num_steps; step++) {
auto cur_blocks = &step_output[space * big_lwe_size];
auto prev_blocks = generates_or_propagates;
int cur_total_blocks = num_blocks - space;
integer_radix_apply_bivariate_lookup_table_kb<Torus>(
stream, cur_blocks, cur_blocks, prev_blocks, bsk, ksk, cur_total_blocks,
luts_carry_propagation_sum);
cuda_memcpy_async_gpu_to_gpu(&generates_or_propagates[space * big_lwe_size],
cur_blocks,
big_lwe_size_bytes * cur_total_blocks, stream);
space *= 2;
}
radix_blocks_rotate_right<<<num_blocks, 256, 0, stream->stream>>>(
step_output, generates_or_propagates, 1, num_blocks, big_lwe_size);
cuda_memset_async(step_output, 0, big_lwe_size_bytes, stream);
host_addition(stream, lwe_array, lwe_array, step_output,
glwe_dimension * polynomial_size, num_blocks);
integer_radix_apply_univariate_lookup_table_kb<Torus>(
stream, lwe_array, lwe_array, bsk, ksk, num_blocks, message_acc);
}
/*
* input_blocks: input radix ciphertext propagation will happen inplace
* acc_message_carry: list of two lut s, [(message_acc), (carry_acc)]
* lut_indexes_message_carry: lut_indexes for message and carry, should always be {0, 1}
* small_lwe_vector: output of keyswitch should have
* size = 2 * (lwe_dimension + 1) * sizeof(Torus)
* big_lwe_vector: output of pbs should have
* size = 2 * (glwe_dimension * polynomial_size + 1) * sizeof(Torus)
*/
template <typename Torus, typename STorus, class params>
void host_full_propagate_inplace(cuda_stream_t *stream, Torus *input_blocks,
int_fullprop_buffer<Torus> *mem_ptr,
Torus *ksk, void *bsk, uint32_t lwe_dimension,
uint32_t glwe_dimension,
uint32_t polynomial_size, uint32_t ks_base_log,
uint32_t ks_level, uint32_t pbs_base_log,
uint32_t pbs_level, uint32_t grouping_factor,
uint32_t num_blocks) {
int big_lwe_size = (glwe_dimension * polynomial_size + 1);
int small_lwe_size = (lwe_dimension + 1);
for (int i = 0; i < num_blocks; i++) {
auto cur_input_block = &input_blocks[i * big_lwe_size];
cuda_keyswitch_lwe_ciphertext_vector<Torus>(
stream, mem_ptr->tmp_small_lwe_vector, mem_ptr->lwe_indexes,
cur_input_block, mem_ptr->lwe_indexes, ksk,
polynomial_size * glwe_dimension, lwe_dimension, ks_base_log, ks_level,
1);
cuda_memcpy_async_gpu_to_gpu(&mem_ptr->tmp_small_lwe_vector[small_lwe_size],
mem_ptr->tmp_small_lwe_vector,
small_lwe_size * sizeof(Torus), stream);
execute_pbs<Torus>(
stream, mem_ptr->tmp_big_lwe_vector, mem_ptr->lwe_indexes,
mem_ptr->lut_buffer, mem_ptr->lut_indexes,
mem_ptr->tmp_small_lwe_vector, mem_ptr->lwe_indexes, bsk,
mem_ptr->pbs_buffer, glwe_dimension, lwe_dimension, polynomial_size,
pbs_base_log, pbs_level, grouping_factor, 2, 2, 0,
cuda_get_max_shared_memory(stream->gpu_index), mem_ptr->pbs_type);
cuda_memcpy_async_gpu_to_gpu(cur_input_block, mem_ptr->tmp_big_lwe_vector,
big_lwe_size * sizeof(Torus), stream);
if (i < num_blocks - 1) {
auto next_input_block = &input_blocks[(i + 1) * big_lwe_size];
host_addition(stream, next_input_block, next_input_block,
&mem_ptr->tmp_big_lwe_vector[big_lwe_size],
glwe_dimension * polynomial_size, 1);
}
}
}
template <typename Torus>
void scratch_cuda_full_propagation(
cuda_stream_t *stream, int_fullprop_buffer<Torus> **mem_ptr,
uint32_t lwe_dimension, uint32_t glwe_dimension, uint32_t polynomial_size,
uint32_t pbs_level, uint32_t grouping_factor, uint32_t num_radix_blocks,
uint32_t message_modulus, uint32_t carry_modulus, PBS_TYPE pbs_type,
bool allocate_gpu_memory) {
// PBS
int8_t *pbs_buffer;
if (pbs_type == MULTI_BIT) {
uint32_t lwe_chunk_size = get_average_lwe_chunk_size(
lwe_dimension, pbs_level, glwe_dimension, num_radix_blocks);
// Only 64 bits is supported
scratch_cuda_multi_bit_pbs_64(stream, &pbs_buffer, lwe_dimension,
glwe_dimension, polynomial_size, pbs_level,
grouping_factor, num_radix_blocks,
cuda_get_max_shared_memory(stream->gpu_index),
allocate_gpu_memory, lwe_chunk_size);
} else {
// Classic
// We only use low latency for classic mode
if (sizeof(Torus) == sizeof(uint32_t))
scratch_cuda_bootstrap_low_latency_32(
stream, &pbs_buffer, glwe_dimension, polynomial_size, pbs_level,
num_radix_blocks, cuda_get_max_shared_memory(stream->gpu_index),
allocate_gpu_memory);
else
scratch_cuda_bootstrap_low_latency_64(
stream, &pbs_buffer, glwe_dimension, polynomial_size, pbs_level,
num_radix_blocks, cuda_get_max_shared_memory(stream->gpu_index),
allocate_gpu_memory);
}
// LUT
Torus *lut_buffer;
if (allocate_gpu_memory) {
// LUT is used as a trivial encryption, so we only allocate memory for the
// body
Torus lut_buffer_size =
2 * (glwe_dimension + 1) * polynomial_size * sizeof(Torus);
lut_buffer = (Torus *)cuda_malloc_async(lut_buffer_size, stream);
// LUTs
auto lut_f_message = [message_modulus](Torus x) -> Torus {
return x % message_modulus;
};
auto lut_f_carry = [message_modulus](Torus x) -> Torus {
return x / message_modulus;
};
//
Torus *lut_buffer_message = lut_buffer;
Torus *lut_buffer_carry =
lut_buffer + (glwe_dimension + 1) * polynomial_size;
generate_device_accumulator<Torus>(
stream, lut_buffer_message, glwe_dimension, polynomial_size,
message_modulus, carry_modulus, lut_f_message);
generate_device_accumulator<Torus>(stream, lut_buffer_carry, glwe_dimension,
polynomial_size, message_modulus,
carry_modulus, lut_f_carry);
}
Torus *lut_indexes;
if (allocate_gpu_memory) {
lut_indexes = (Torus *)cuda_malloc_async(2 * sizeof(Torus), stream);
Torus h_lut_indexes[2] = {0, 1};
cuda_memcpy_async_to_gpu(lut_indexes, h_lut_indexes, 2 * sizeof(Torus),
stream);
}
Torus *lwe_indexes;
if (allocate_gpu_memory) {
Torus lwe_indexes_size = num_radix_blocks * sizeof(Torus);
lwe_indexes = (Torus *)cuda_malloc_async(lwe_indexes_size, stream);
Torus *h_lwe_indexes = (Torus *)malloc(lwe_indexes_size);
for (int i = 0; i < num_radix_blocks; i++)
h_lwe_indexes[i] = i;
cuda_memcpy_async_to_gpu(lwe_indexes, h_lwe_indexes, lwe_indexes_size,
stream);
cuda_synchronize_stream(stream);
free(h_lwe_indexes);
}
// Temporary arrays
Torus *small_lwe_vector;
Torus *big_lwe_vector;
if (allocate_gpu_memory) {
Torus small_vector_size = 2 * (lwe_dimension + 1) * sizeof(Torus);
Torus big_vector_size =
2 * (glwe_dimension * polynomial_size + 1) * sizeof(Torus);
small_lwe_vector = (Torus *)cuda_malloc_async(small_vector_size, stream);
big_lwe_vector = (Torus *)cuda_malloc_async(big_vector_size, stream);
}
*mem_ptr = new int_fullprop_buffer<Torus>;
(*mem_ptr)->pbs_type = pbs_type;
(*mem_ptr)->pbs_buffer = pbs_buffer;
(*mem_ptr)->lut_buffer = lut_buffer;
(*mem_ptr)->lut_indexes = lut_indexes;
(*mem_ptr)->lwe_indexes = lwe_indexes;
(*mem_ptr)->tmp_small_lwe_vector = small_lwe_vector;
(*mem_ptr)->tmp_big_lwe_vector = big_lwe_vector;
}
// (lwe_dimension+1) threads
// (num_radix_blocks / 2) thread blocks
template <typename Torus>
__global__ void device_pack_blocks(Torus *lwe_array_out, Torus *lwe_array_in,
uint32_t lwe_dimension,
uint32_t num_radix_blocks, uint32_t factor) {
int tid = threadIdx.x + blockIdx.x * blockDim.x;
if (tid < (lwe_dimension + 1)) {
for (int bid = 0; bid < (num_radix_blocks / 2); bid++) {
Torus *lsb_block = lwe_array_in + (2 * bid) * (lwe_dimension + 1);
Torus *msb_block = lsb_block + (lwe_dimension + 1);
Torus *packed_block = lwe_array_out + bid * (lwe_dimension + 1);
packed_block[tid] = lsb_block[tid] + factor * msb_block[tid];
}
if (num_radix_blocks % 2 != 0) {
// We couldn't pack the last block, so we just copy it
Torus *lsb_block =
lwe_array_in + (num_radix_blocks - 1) * (lwe_dimension + 1);
Torus *last_block =
lwe_array_out + (num_radix_blocks / 2) * (lwe_dimension + 1);
last_block[tid] = lsb_block[tid];
}
}
}
// Packs the low ciphertext in the message parts of the high ciphertext
// and moves the high ciphertext into the carry part.
//
// This requires the block parameters to have enough room for two ciphertexts,
// so at least as many carry modulus as the message modulus
//
// Expects the carry buffer to be empty
template <typename Torus>
__host__ void pack_blocks(cuda_stream_t *stream, Torus *lwe_array_out,
Torus *lwe_array_in, uint32_t lwe_dimension,
uint32_t num_radix_blocks, uint32_t factor) {
assert(lwe_array_out != lwe_array_in);
int num_blocks = 0, num_threads = 0;
int num_entries = (lwe_dimension + 1);
getNumBlocksAndThreads(num_entries, 512, num_blocks, num_threads);
device_pack_blocks<<<num_blocks, num_threads, 0, stream->stream>>>(
lwe_array_out, lwe_array_in, lwe_dimension, num_radix_blocks, factor);
}
template <typename Torus>
__global__ void
device_create_trivial_radix(Torus *lwe_array, Torus *scalar_input,
int32_t num_blocks, uint32_t lwe_dimension,
uint64_t delta) {
int tid = blockIdx.x * blockDim.x + threadIdx.x;
if (tid < num_blocks) {
Torus scalar = scalar_input[tid];
Torus *body = lwe_array + tid * (lwe_dimension + 1) + lwe_dimension;
*body = scalar * delta;
}
}
template <typename Torus>
__host__ void
create_trivial_radix(cuda_stream_t *stream, Torus *lwe_array_out,
Torus *scalar_array, uint32_t lwe_dimension,
uint32_t num_radix_blocks, uint32_t num_scalar_blocks,
uint64_t message_modulus, uint64_t carry_modulus) {
size_t radix_size = (lwe_dimension + 1) * num_radix_blocks;
cuda_memset_async(lwe_array_out, 0, radix_size * sizeof(Torus), stream);
if (num_scalar_blocks == 0)
return;
// Create a 1-dimensional grid of threads
int num_blocks = 0, num_threads = 0;
int num_entries = num_scalar_blocks;
getNumBlocksAndThreads(num_entries, 512, num_blocks, num_threads);
dim3 grid(num_blocks, 1, 1);
dim3 thds(num_threads, 1, 1);
// Value of the shift we multiply our messages by
// If message_modulus and carry_modulus are always powers of 2 we can simplify
// this
uint64_t delta = ((uint64_t)1 << 63) / (message_modulus * carry_modulus);
device_create_trivial_radix<<<grid, thds, 0, stream->stream>>>(
lwe_array_out, scalar_array, num_scalar_blocks, lwe_dimension, delta);
check_cuda_error(cudaGetLastError());
}
#endif // TFHE_RS_INTERNAL_INTEGER_CUH

View File

@@ -0,0 +1,107 @@
#include "integer/multiplication.cuh"
/*
* This scratch function allocates the necessary amount of data on the GPU for
* the integer radix multiplication in keyswitch->bootstrap order.
*/
void scratch_cuda_integer_mult_radix_ciphertext_kb_64(
cuda_stream_t *stream, int8_t **mem_ptr, uint32_t message_modulus,
uint32_t carry_modulus, uint32_t glwe_dimension, uint32_t lwe_dimension,
uint32_t polynomial_size, uint32_t pbs_base_log, uint32_t pbs_level,
uint32_t ks_base_log, uint32_t ks_level, uint32_t grouping_factor,
uint32_t num_radix_blocks, PBS_TYPE pbs_type, uint32_t max_shared_memory,
bool allocate_gpu_memory) {
int_radix_params params(pbs_type, glwe_dimension, polynomial_size,
polynomial_size, lwe_dimension, ks_level, ks_base_log,
pbs_level, pbs_base_log, grouping_factor,
message_modulus, carry_modulus);
switch (polynomial_size) {
case 2048:
scratch_cuda_integer_mult_radix_ciphertext_kb<uint64_t>(
stream, (int_mul_memory<uint64_t> **)mem_ptr, num_radix_blocks, params,
allocate_gpu_memory);
break;
default:
break;
}
}
/*
* Computes a multiplication between two 64 bit radix lwe ciphertexts
* encrypting integer values. keyswitch -> bootstrap pattern is used, function
* works for single pair of radix ciphertexts, 'v_stream' can be used for
* parallelization
* - 'v_stream' is a void pointer to the Cuda stream to be used in the kernel
* launch
* - 'gpu_index' is the index of the GPU to be used in the kernel launch
* - 'radix_lwe_out' is 64 bit radix big lwe ciphertext, product of
* multiplication
* - 'radix_lwe_left' left radix big lwe ciphertext
* - 'radix_lwe_right' right radix big lwe ciphertext
* - 'bsk' bootstrapping key in fourier domain
* - 'ksk' keyswitching key
* - 'mem_ptr'
* - 'message_modulus' message_modulus
* - 'carry_modulus' carry_modulus
* - 'glwe_dimension' glwe_dimension
* - 'lwe_dimension' is the dimension of small lwe ciphertext
* - 'polynomial_size' polynomial size
* - 'pbs_base_log' base log used in the pbs
* - 'pbs_level' decomposition level count used in the pbs
* - 'ks_level' decomposition level count used in the keyswitch
* - 'num_blocks' is the number of big lwe ciphertext blocks inside radix
* ciphertext
* - 'pbs_type' selects which PBS implementation should be used
* - 'max_shared_memory' maximum shared memory per cuda block
*/
void cuda_integer_mult_radix_ciphertext_kb_64(
cuda_stream_t *stream, void *radix_lwe_out, void *radix_lwe_left,
void *radix_lwe_right, void *bsk, void *ksk, int8_t *mem_ptr,
uint32_t message_modulus, uint32_t carry_modulus, uint32_t glwe_dimension,
uint32_t lwe_dimension, uint32_t polynomial_size, uint32_t pbs_base_log,
uint32_t pbs_level, uint32_t ks_base_log, uint32_t ks_level,
uint32_t grouping_factor, uint32_t num_blocks, PBS_TYPE pbs_type,
uint32_t max_shared_memory) {
switch (polynomial_size) {
case 2048:
host_integer_mult_radix_kb<uint64_t, int64_t, AmortizedDegree<2048>>(
stream, static_cast<uint64_t *>(radix_lwe_out),
static_cast<uint64_t *>(radix_lwe_left),
static_cast<uint64_t *>(radix_lwe_right), bsk,
static_cast<uint64_t *>(ksk), (int_mul_memory<uint64_t> *)mem_ptr,
num_blocks);
break;
default:
break;
}
}
void cleanup_cuda_integer_mult(cuda_stream_t *stream, int8_t **mem_ptr_void) {
int_mul_memory<uint64_t> *mem_ptr =
(int_mul_memory<uint64_t> *)(*mem_ptr_void);
mem_ptr->release(stream);
}
void cuda_small_scalar_multiplication_integer_radix_ciphertext_64_inplace(
cuda_stream_t *stream, void *lwe_array, uint64_t scalar,
uint32_t lwe_dimension, uint32_t lwe_ciphertext_count) {
cuda_small_scalar_multiplication_integer_radix_ciphertext_64(
stream, lwe_array, lwe_array, scalar, lwe_dimension,
lwe_ciphertext_count);
}
void cuda_small_scalar_multiplication_integer_radix_ciphertext_64(
cuda_stream_t *stream, void *output_lwe_array, void *input_lwe_array,
uint64_t scalar, uint32_t lwe_dimension, uint32_t lwe_ciphertext_count) {
host_integer_small_scalar_mult_radix(
stream, static_cast<uint64_t *>(output_lwe_array),
static_cast<uint64_t *>(input_lwe_array), scalar, lwe_dimension,
lwe_ciphertext_count);
}

View File

@@ -0,0 +1,639 @@
#ifndef CUDA_INTEGER_MULT_CUH
#define CUDA_INTEGER_MULT_CUH
#ifdef __CDT_PARSER__
#undef __CUDA_RUNTIME_H__
#include <cuda_runtime.h>
#endif
#include "bootstrap.h"
#include "bootstrap_multibit.h"
#include "crypto/keyswitch.cuh"
#include "device.h"
#include "integer.h"
#include "integer/integer.cuh"
#include "linear_algebra.h"
#include "pbs/bootstrap_amortized.cuh"
#include "pbs/bootstrap_low_latency.cuh"
#include "pbs/bootstrap_multibit.cuh"
#include "utils/helper.cuh"
#include "utils/kernel_dimensions.cuh"
#include <fstream>
#include <iostream>
#include <omp.h>
#include <sstream>
#include <string>
#include <vector>
template <typename Torus, class params>
__global__ void
all_shifted_lhs_rhs(Torus *radix_lwe_left, Torus *lsb_ciphertext,
Torus *msb_ciphertext, Torus *radix_lwe_right,
Torus *lsb_rhs, Torus *msb_rhs, int num_blocks) {
size_t block_id = blockIdx.x;
double D = sqrt((2 * num_blocks + 1) * (2 * num_blocks + 1) - 8 * block_id);
size_t radix_id = int((2 * num_blocks + 1 - D) / 2.);
size_t local_block_id =
block_id - (2 * num_blocks - radix_id + 1) / 2. * radix_id;
bool process_msb = (local_block_id < (num_blocks - radix_id - 1));
auto cur_lsb_block = &lsb_ciphertext[block_id * (params::degree + 1)];
auto cur_msb_block =
(process_msb)
? &msb_ciphertext[(block_id - radix_id) * (params::degree + 1)]
: nullptr;
auto cur_lsb_rhs_block = &lsb_rhs[block_id * (params::degree + 1)];
auto cur_msb_rhs_block =
(process_msb) ? &msb_rhs[(block_id - radix_id) * (params::degree + 1)]
: nullptr;
auto cur_ct_right = &radix_lwe_right[radix_id * (params::degree + 1)];
auto cur_src = &radix_lwe_left[local_block_id * (params::degree + 1)];
size_t tid = threadIdx.x;
for (int i = 0; i < params::opt; i++) {
Torus value = cur_src[tid];
if (process_msb) {
cur_lsb_block[tid] = cur_msb_block[tid] = value;
cur_lsb_rhs_block[tid] = cur_msb_rhs_block[tid] = cur_ct_right[tid];
} else {
cur_lsb_block[tid] = value;
cur_lsb_rhs_block[tid] = cur_ct_right[tid];
}
tid += params::degree / params::opt;
}
if (threadIdx.x == 0) {
Torus value = cur_src[params::degree];
if (process_msb) {
cur_lsb_block[params::degree] = cur_msb_block[params::degree] = value;
cur_lsb_rhs_block[params::degree] = cur_msb_rhs_block[params::degree] =
cur_ct_right[params::degree];
} else {
cur_lsb_block[params::degree] = value;
cur_lsb_rhs_block[params::degree] = cur_ct_right[params::degree];
}
}
}
template <typename Torus>
void compress_device_array_with_map(cuda_stream_t *stream, Torus *src,
Torus *dst, int *S, int *F, int num_blocks,
uint32_t map_size, uint32_t unit_size,
int &total_copied, bool is_message) {
for (int i = 0; i < map_size; i++) {
int s_index = i * num_blocks + S[i];
int number_of_unit = F[i] - S[i] + is_message;
auto cur_dst = &dst[total_copied * unit_size];
auto cur_src = &src[s_index * unit_size];
size_t copy_size = unit_size * number_of_unit * sizeof(Torus);
cuda_memcpy_async_gpu_to_gpu(cur_dst, cur_src, copy_size, stream);
total_copied += number_of_unit;
}
}
template <typename Torus>
void extract_message_carry_to_full_radix(cuda_stream_t *stream, Torus *src,
Torus *dst, int *S, int *F,
uint32_t map_size, uint32_t unit_size,
int &total_copied,
int &total_radix_copied,
int num_blocks, bool is_message) {
size_t radix_size = unit_size * num_blocks;
for (int i = 0; i < map_size; i++) {
auto cur_dst_radix = &dst[total_radix_copied * radix_size];
int s_index = S[i];
int number_of_unit = F[i] - s_index + is_message;
if (!is_message) {
int zero_block_count = num_blocks - number_of_unit;
cuda_memset_async(cur_dst_radix, 0,
zero_block_count * unit_size * sizeof(Torus), stream);
s_index = zero_block_count;
}
auto cur_dst = &cur_dst_radix[s_index * unit_size];
auto cur_src = &src[total_copied * unit_size];
size_t copy_size = unit_size * number_of_unit * sizeof(Torus);
cuda_memcpy_async_gpu_to_gpu(cur_dst, cur_src, copy_size, stream);
total_copied += number_of_unit;
++total_radix_copied;
}
}
template <typename Torus, class params>
__global__ void tree_add_chunks(Torus *result_blocks, Torus *input_blocks,
uint32_t chunk_size, uint32_t num_blocks) {
extern __shared__ Torus result[];
size_t chunk_id = blockIdx.x;
size_t chunk_elem_size = chunk_size * num_blocks * (params::degree + 1);
size_t radix_elem_size = num_blocks * (params::degree + 1);
auto src_chunk = &input_blocks[chunk_id * chunk_elem_size];
auto dst_radix = &result_blocks[chunk_id * radix_elem_size];
size_t block_stride = blockIdx.y * (params::degree + 1);
auto dst_block = &dst_radix[block_stride];
// init shared mem with first radix of chunk
size_t tid = threadIdx.x;
for (int i = 0; i < params::opt; i++) {
result[tid] = src_chunk[block_stride + tid];
tid += params::degree / params::opt;
}
if (threadIdx.x == 0) {
result[params::degree] = src_chunk[block_stride + params::degree];
}
// accumulate rest of the radixes
for (int r_id = 1; r_id < chunk_size; r_id++) {
auto cur_src_radix = &src_chunk[r_id * radix_elem_size];
tid = threadIdx.x;
for (int i = 0; i < params::opt; i++) {
result[tid] += cur_src_radix[block_stride + tid];
tid += params::degree / params::opt;
}
if (threadIdx.x == 0) {
result[params::degree] += cur_src_radix[block_stride + params::degree];
}
}
// put result from shared mem to global mem
tid = threadIdx.x;
for (int i = 0; i < params::opt; i++) {
dst_block[tid] = result[tid];
tid += params::degree / params::opt;
}
if (threadIdx.x == 0) {
dst_block[params::degree] = result[params::degree];
}
}
template <typename Torus, class params>
__global__ void fill_radix_from_lsb_msb(Torus *result_blocks, Torus *lsb_blocks,
Torus *msb_blocks,
uint32_t glwe_dimension,
uint32_t lsb_count, uint32_t msb_count,
uint32_t num_blocks) {
size_t big_lwe_dimension = glwe_dimension * params::degree + 1;
size_t big_lwe_id = blockIdx.x;
size_t radix_id = big_lwe_id / num_blocks;
size_t block_id = big_lwe_id % num_blocks;
size_t lsb_block_id = block_id - radix_id;
size_t msb_block_id = block_id - radix_id - 1;
bool process_lsb = (radix_id <= block_id);
bool process_msb = (radix_id + 1 <= block_id);
auto cur_res_lsb_ct = &result_blocks[big_lwe_id * big_lwe_dimension];
auto cur_res_msb_ct =
&result_blocks[num_blocks * num_blocks * big_lwe_dimension +
big_lwe_id * big_lwe_dimension];
Torus *cur_lsb_radix = &lsb_blocks[(2 * num_blocks - radix_id + 1) *
radix_id / 2 * (params::degree + 1)];
Torus *cur_msb_radix = (process_msb)
? &msb_blocks[(2 * num_blocks - radix_id - 1) *
radix_id / 2 * (params::degree + 1)]
: nullptr;
Torus *cur_lsb_ct = (process_lsb)
? &cur_lsb_radix[lsb_block_id * (params::degree + 1)]
: nullptr;
Torus *cur_msb_ct = (process_msb)
? &cur_msb_radix[msb_block_id * (params::degree + 1)]
: nullptr;
size_t tid = threadIdx.x;
for (int i = 0; i < params::opt; i++) {
cur_res_lsb_ct[tid] = (process_lsb) ? cur_lsb_ct[tid] : 0;
cur_res_msb_ct[tid] = (process_msb) ? cur_msb_ct[tid] : 0;
tid += params::degree / params::opt;
}
if (threadIdx.x == 0) {
cur_res_lsb_ct[params::degree] =
(process_lsb) ? cur_lsb_ct[params::degree] : 0;
cur_res_msb_ct[params::degree] =
(process_msb) ? cur_msb_ct[params::degree] : 0;
}
}
template <typename Torus, typename STorus, class params>
__host__ void host_integer_mult_radix_kb(
cuda_stream_t *stream, uint64_t *radix_lwe_out, uint64_t *radix_lwe_left,
uint64_t *radix_lwe_right, void *bsk, uint64_t *ksk,
int_mul_memory<Torus> *mem_ptr, uint32_t num_blocks) {
auto glwe_dimension = mem_ptr->params.glwe_dimension;
auto polynomial_size = mem_ptr->params.polynomial_size;
auto lwe_dimension = mem_ptr->params.small_lwe_dimension;
auto message_modulus = mem_ptr->params.message_modulus;
auto carry_modulus = mem_ptr->params.carry_modulus;
int big_lwe_dimension = glwe_dimension * polynomial_size;
int big_lwe_size = big_lwe_dimension + 1;
// 'vector_result_lsb' contains blocks from all possible right shifts of
// radix_lwe_left, only nonzero blocks are kept
int lsb_vector_block_count = num_blocks * (num_blocks + 1) / 2;
// 'vector_result_msb' contains blocks from all possible shifts of
// radix_lwe_left except the last blocks of each shift. Only nonzero blocks
// are kept
int msb_vector_block_count = num_blocks * (num_blocks - 1) / 2;
// total number of blocks msb and lsb
int total_block_count = lsb_vector_block_count + msb_vector_block_count;
// buffer to keep all lsb and msb shifts
// for lsb all nonzero blocks of each right shifts are kept
// for 0 shift num_blocks blocks
// for 1 shift num_blocks - 1 blocks
// for num_blocks - 1 shift 1 block
// (num_blocks + 1) * num_blocks / 2 blocks
// for msb we don't keep track for last blocks so
// for 0 shift num_blocks - 1 blocks
// for 1 shift num_blocks - 2 blocks
// for num_blocks - 1 shift 0 blocks
// (num_blocks - 1) * num_blocks / 2 blocks
// in total num_blocks^2 blocks
// in each block three is big polynomial with
// glwe_dimension * polynomial_size + 1 coefficients
auto vector_result_sb = mem_ptr->vector_result_sb;
// buffer to keep lsb_vector + msb_vector
// addition will happen in full terms so there will be
// num_blocks terms and each term will have num_blocks block
// num_blocks^2 blocks in total
// and each blocks has big lwe ciphertext with
// glwe_dimension * polynomial_size + 1 coefficients
auto block_mul_res = mem_ptr->block_mul_res;
// buffer to keep keyswitch result of num_blocks^2 ciphertext
// in total it has num_blocks^2 small lwe ciphertexts with
// lwe_dimension +1 coefficients
auto small_lwe_vector = mem_ptr->small_lwe_vector;
// buffer to keep pbs result for num_blocks^2 lwe_ciphertext
// in total it has num_blocks^2 big lwe ciphertexts with
// glwe_dimension * polynomial_size + 1 coefficients
auto lwe_pbs_out_array = mem_ptr->lwe_pbs_out_array;
// it contains two lut, first for lsb extraction,
// second for msb extraction, with total length =
// 2 * (glwe_dimension + 1) * polynomial_size
auto luts_array = mem_ptr->luts_array;
// accumulator to extract message
// with length (glwe_dimension + 1) * polynomial_size
auto luts_message = mem_ptr->luts_message;
// accumulator to extract carry
// with length (glwe_dimension + 1) * polynomial_size
auto luts_carry = mem_ptr->luts_carry;
// to be used as default indexing
auto lwe_indexes = luts_array->lwe_indexes;
auto vector_result_lsb = &vector_result_sb[0];
auto vector_result_msb =
&vector_result_sb[lsb_vector_block_count *
(polynomial_size * glwe_dimension + 1)];
auto vector_lsb_rhs = &block_mul_res[0];
auto vector_msb_rhs = &block_mul_res[lsb_vector_block_count *
(polynomial_size * glwe_dimension + 1)];
dim3 grid(lsb_vector_block_count, 1, 1);
dim3 thds(params::degree / params::opt, 1, 1);
all_shifted_lhs_rhs<Torus, params><<<grid, thds, 0, stream->stream>>>(
radix_lwe_left, vector_result_lsb, vector_result_msb, radix_lwe_right,
vector_lsb_rhs, vector_msb_rhs, num_blocks);
integer_radix_apply_bivariate_lookup_table_kb<Torus>(
stream, block_mul_res, block_mul_res, vector_result_sb, bsk, ksk,
total_block_count, luts_array);
vector_result_lsb = &block_mul_res[0];
vector_result_msb = &block_mul_res[lsb_vector_block_count *
(polynomial_size * glwe_dimension + 1)];
fill_radix_from_lsb_msb<Torus, params>
<<<num_blocks * num_blocks, params::degree / params::opt, 0,
stream->stream>>>(vector_result_sb, vector_result_lsb,
vector_result_msb, glwe_dimension,
lsb_vector_block_count, msb_vector_block_count,
num_blocks);
auto new_blocks = block_mul_res;
auto old_blocks = vector_result_sb;
// amount of current radixes after block_mul
size_t r = 2 * num_blocks;
size_t total_modulus = message_modulus * carry_modulus;
size_t message_max = message_modulus - 1;
size_t chunk_size = (total_modulus - 1) / message_max;
size_t ch_amount = r / chunk_size;
int terms_degree[r * num_blocks];
int f_b[ch_amount];
int l_b[ch_amount];
for (int i = 0; i < num_blocks * num_blocks; i++) {
size_t r_id = i / num_blocks;
size_t b_id = i % num_blocks;
terms_degree[i] = (b_id >= r_id) ? 3 : 0;
}
auto terms_degree_msb = &terms_degree[num_blocks * num_blocks];
for (int i = 0; i < num_blocks * num_blocks; i++) {
size_t r_id = i / num_blocks;
size_t b_id = i % num_blocks;
terms_degree_msb[i] = (b_id > r_id) ? 2 : 0;
}
auto max_shared_memory = cuda_get_max_shared_memory(stream->gpu_index);
while (r > chunk_size) {
int cur_total_blocks = r * num_blocks;
ch_amount = r / chunk_size;
dim3 add_grid(ch_amount, num_blocks, 1);
size_t sm_size = big_lwe_size * sizeof(Torus);
cuda_memset_async(new_blocks, 0,
ch_amount * num_blocks * big_lwe_size * sizeof(Torus),
stream);
tree_add_chunks<Torus, params><<<add_grid, 256, sm_size, stream->stream>>>(
new_blocks, old_blocks, chunk_size, num_blocks);
for (int c_id = 0; c_id < ch_amount; c_id++) {
auto cur_chunk = &terms_degree[c_id * chunk_size * num_blocks];
int mx = 0;
int mn = num_blocks;
for (int r_id = 1; r_id < chunk_size; r_id++) {
auto cur_radix = &cur_chunk[r_id * num_blocks];
for (int i = 0; i < num_blocks; i++) {
if (cur_radix[i]) {
mn = min(mn, i);
mx = max(mx, i);
}
}
}
f_b[c_id] = mn;
l_b[c_id] = mx;
}
int total_copied = 0;
int message_count = 0;
int carry_count = 0;
compress_device_array_with_map<Torus>(stream, new_blocks, old_blocks, f_b,
l_b, num_blocks, ch_amount,
big_lwe_size, total_copied, true);
message_count = total_copied;
compress_device_array_with_map<Torus>(stream, new_blocks, old_blocks, f_b,
l_b, num_blocks, ch_amount,
big_lwe_size, total_copied, false);
carry_count = total_copied - message_count;
auto message_blocks_vector = old_blocks;
auto carry_blocks_vector =
&old_blocks[message_count * (glwe_dimension * polynomial_size + 1)];
cuda_keyswitch_lwe_ciphertext_vector(
stream, small_lwe_vector, lwe_indexes, old_blocks, lwe_indexes, ksk,
polynomial_size * glwe_dimension, lwe_dimension,
mem_ptr->params.ks_base_log, mem_ptr->params.ks_level, total_copied);
execute_pbs<Torus>(
stream, message_blocks_vector, lwe_indexes, luts_message->lut,
luts_message->lut_indexes, small_lwe_vector, lwe_indexes, bsk,
luts_message->pbs_buffer, glwe_dimension, lwe_dimension,
polynomial_size, mem_ptr->params.pbs_base_log,
mem_ptr->params.pbs_level, mem_ptr->params.grouping_factor,
message_count, 1, 0, max_shared_memory, mem_ptr->params.pbs_type);
execute_pbs<Torus>(stream, carry_blocks_vector, lwe_indexes,
luts_carry->lut, luts_carry->lut_indexes,
&small_lwe_vector[message_count * (lwe_dimension + 1)],
lwe_indexes, bsk, luts_carry->pbs_buffer,
glwe_dimension, lwe_dimension, polynomial_size,
mem_ptr->params.pbs_base_log, mem_ptr->params.pbs_level,
mem_ptr->params.grouping_factor, carry_count, 1, 0,
max_shared_memory, mem_ptr->params.pbs_type);
int rem_blocks = r % chunk_size * num_blocks;
int new_blocks_created = 2 * ch_amount * num_blocks;
int copy_size = rem_blocks * big_lwe_size * sizeof(Torus);
auto cur_dst = &new_blocks[new_blocks_created * big_lwe_size];
auto cur_src = &old_blocks[(cur_total_blocks - rem_blocks) * big_lwe_size];
cuda_memcpy_async_gpu_to_gpu(cur_dst, cur_src, copy_size, stream);
total_copied = 0;
int total_radix_copied = 0;
extract_message_carry_to_full_radix<Torus>(
stream, old_blocks, new_blocks, f_b, l_b, ch_amount, big_lwe_size,
total_copied, total_radix_copied, num_blocks, true);
extract_message_carry_to_full_radix<Torus>(
stream, old_blocks, new_blocks, f_b, l_b, ch_amount, big_lwe_size,
total_copied, total_radix_copied, num_blocks, false);
std::swap(new_blocks, old_blocks);
r = (new_blocks_created + rem_blocks) / num_blocks;
}
dim3 add_grid(1, num_blocks, 1);
size_t sm_size = big_lwe_size * sizeof(Torus);
cuda_memset_async(radix_lwe_out, 0, num_blocks * big_lwe_size * sizeof(Torus),
stream);
tree_add_chunks<Torus, params><<<add_grid, 256, sm_size, stream->stream>>>(
radix_lwe_out, old_blocks, r, num_blocks);
integer_radix_apply_univariate_lookup_table_kb<Torus>(
stream, vector_result_sb, radix_lwe_out, bsk, ksk, num_blocks,
luts_message);
integer_radix_apply_univariate_lookup_table_kb<Torus>(
stream, &block_mul_res[big_lwe_size], radix_lwe_out, bsk, ksk, num_blocks,
luts_carry);
cuda_memset_async(block_mul_res, 0, big_lwe_size * sizeof(Torus), stream);
host_addition(stream, radix_lwe_out, vector_result_sb, block_mul_res,
big_lwe_size, num_blocks);
host_propagate_single_carry_low_latency<Torus>(
stream, radix_lwe_out, mem_ptr->scp_mem, bsk, ksk, num_blocks);
}
template <typename Torus>
__host__ void scratch_cuda_integer_mult_radix_ciphertext_kb(
cuda_stream_t *stream, int_mul_memory<Torus> **mem_ptr,
uint32_t num_radix_blocks, int_radix_params params,
bool allocate_gpu_memory) {
*mem_ptr = new int_mul_memory<Torus>(stream, params, num_radix_blocks,
allocate_gpu_memory);
}
// Function to apply lookup table,
// It has two mode
// lsb_msb_mode == true - extracts lsb and msb
// lsb_msb_mode == false - extracts message and carry
template <typename Torus, typename STorus, class params>
void apply_lookup_table(Torus *input_ciphertexts, Torus *output_ciphertexts,
int_mul_memory<Torus> *mem_ptr, uint32_t glwe_dimension,
uint32_t lwe_dimension, uint32_t polynomial_size,
uint32_t pbs_base_log, uint32_t pbs_level,
uint32_t ks_base_log, uint32_t ks_level,
uint32_t grouping_factor,
uint32_t lsb_message_blocks_count,
uint32_t msb_carry_blocks_count,
uint32_t max_shared_memory, bool lsb_msb_mode) {
int total_blocks_count = lsb_message_blocks_count + msb_carry_blocks_count;
int gpu_n = mem_ptr->p2p_gpu_count;
if (total_blocks_count < gpu_n)
gpu_n = total_blocks_count;
int gpu_blocks_count = total_blocks_count / gpu_n;
int big_lwe_size = glwe_dimension * polynomial_size + 1;
// int small_lwe_size = lwe_dimension + 1;
#pragma omp parallel for num_threads(gpu_n)
for (int i = 0; i < gpu_n; i++) {
cudaSetDevice(i);
auto this_stream = mem_ptr->streams[i];
// Index where input and output blocks start for current gpu
int big_lwe_start_index = i * gpu_blocks_count * big_lwe_size;
// Last gpu might have extra blocks to process if total blocks number is not
// divisible by gpu_n
if (i == gpu_n - 1) {
gpu_blocks_count += total_blocks_count % gpu_n;
}
int can_access_peer;
cudaDeviceCanAccessPeer(&can_access_peer, i, 0);
if (i == 0) {
check_cuda_error(
cudaMemcpyAsync(mem_ptr->pbs_output_multi_gpu[i],
&input_ciphertexts[big_lwe_start_index],
gpu_blocks_count * big_lwe_size * sizeof(Torus),
cudaMemcpyDeviceToDevice, *this_stream));
} else if (can_access_peer) {
check_cuda_error(cudaMemcpyPeerAsync(
mem_ptr->pbs_output_multi_gpu[i], i,
&input_ciphertexts[big_lwe_start_index], 0,
gpu_blocks_count * big_lwe_size * sizeof(Torus), *this_stream));
} else {
// Uses host memory as middle ground
cuda_memcpy_async_to_cpu(mem_ptr->device_to_device_buffer[i],
&input_ciphertexts[big_lwe_start_index],
gpu_blocks_count * big_lwe_size * sizeof(Torus),
this_stream, i);
cuda_memcpy_async_to_gpu(
mem_ptr->pbs_output_multi_gpu[i], mem_ptr->device_to_device_buffer[i],
gpu_blocks_count * big_lwe_size * sizeof(Torus), this_stream, i);
}
// when lsb and msb have to be extracted
// for first lsb_count blocks we need lsb_acc
// for last msb_count blocks we need msb_acc
// when message and carry have tobe extracted
// for first message_count blocks we need message_acc
// for last carry_count blocks we need carry_acc
Torus *cur_lut_indexes;
if (lsb_msb_mode) {
cur_lut_indexes = (big_lwe_start_index < lsb_message_blocks_count)
? mem_ptr->lut_indexes_lsb_multi_gpu[i]
: mem_ptr->lut_indexes_msb_multi_gpu[i];
} else {
cur_lut_indexes = (big_lwe_start_index < lsb_message_blocks_count)
? mem_ptr->lut_indexes_message_multi_gpu[i]
: mem_ptr->lut_indexes_carry_multi_gpu[i];
}
// execute keyswitch on a current gpu with corresponding input and output
// blocks pbs_output_multi_gpu[i] is an input for keyswitch and
// pbs_input_multi_gpu[i] is an output for keyswitch
cuda_keyswitch_lwe_ciphertext_vector(
this_stream, i, mem_ptr->pbs_input_multi_gpu[i],
mem_ptr->pbs_output_multi_gpu[i], mem_ptr->ksk_multi_gpu[i],
polynomial_size * glwe_dimension, lwe_dimension, ks_base_log, ks_level,
gpu_blocks_count);
// execute pbs on a current gpu with corresponding input and output
cuda_multi_bit_pbs_lwe_ciphertext_vector_64(
this_stream, i, mem_ptr->pbs_output_multi_gpu[i],
mem_ptr->lut_multi_gpu[i], cur_lut_indexes,
mem_ptr->pbs_input_multi_gpu[i], mem_ptr->bsk_multi_gpu[i],
mem_ptr->pbs_buffer_multi_gpu[i], lwe_dimension, glwe_dimension,
polynomial_size, grouping_factor, pbs_base_log, pbs_level,
grouping_factor, gpu_blocks_count, 2, 0, max_shared_memory);
// lookup table is applied and now data from current gpu have to be copied
// back to gpu_0 in 'output_ciphertexts' buffer
if (i == 0) {
check_cuda_error(
cudaMemcpyAsync(&output_ciphertexts[big_lwe_start_index],
mem_ptr->pbs_output_multi_gpu[i],
gpu_blocks_count * big_lwe_size * sizeof(Torus),
cudaMemcpyDeviceToDevice, *this_stream));
} else if (can_access_peer) {
check_cuda_error(cudaMemcpyPeerAsync(
&output_ciphertexts[big_lwe_start_index], 0,
mem_ptr->pbs_output_multi_gpu[i], i,
gpu_blocks_count * big_lwe_size * sizeof(Torus), *this_stream));
} else {
// Uses host memory as middle ground
cuda_memcpy_async_to_cpu(
mem_ptr->device_to_device_buffer[i], mem_ptr->pbs_output_multi_gpu[i],
gpu_blocks_count * big_lwe_size * sizeof(Torus), this_stream, i);
cuda_memcpy_async_to_gpu(&output_ciphertexts[big_lwe_start_index],
mem_ptr->device_to_device_buffer[i],
gpu_blocks_count * big_lwe_size * sizeof(Torus),
this_stream, i);
}
}
}
template <typename T>
__global__ void device_small_scalar_radix_multiplication(T *output_lwe_array,
T *input_lwe_array,
T scalar,
uint32_t lwe_dimension,
uint32_t num_blocks) {
int index = blockIdx.x * blockDim.x + threadIdx.x;
int lwe_size = lwe_dimension + 1;
if (index < num_blocks * lwe_size) {
// Here we take advantage of the wrapping behaviour of uint
output_lwe_array[index] = input_lwe_array[index] * scalar;
}
}
template <typename T>
__host__ void host_integer_small_scalar_mult_radix(
cuda_stream_t *stream, T *output_lwe_array, T *input_lwe_array, T scalar,
uint32_t input_lwe_dimension, uint32_t input_lwe_ciphertext_count) {
cudaSetDevice(stream->gpu_index);
// lwe_size includes the presence of the body
// whereas lwe_dimension is the number of elements in the mask
int lwe_size = input_lwe_dimension + 1;
// Create a 1-dimensional grid of threads
int num_blocks = 0, num_threads = 0;
int num_entries = input_lwe_ciphertext_count * lwe_size;
getNumBlocksAndThreads(num_entries, 512, num_blocks, num_threads);
dim3 grid(num_blocks, 1, 1);
dim3 thds(num_threads, 1, 1);
device_small_scalar_radix_multiplication<<<grid, thds, 0, stream->stream>>>(
output_lwe_array, input_lwe_array, scalar, input_lwe_dimension,
input_lwe_ciphertext_count);
check_cuda_error(cudaGetLastError());
}
#endif

View File

@@ -0,0 +1,12 @@
#include "integer/negation.cuh"
void cuda_negate_integer_radix_ciphertext_64_inplace(
cuda_stream_t *stream, void *lwe_array, uint32_t lwe_dimension,
uint32_t lwe_ciphertext_count, uint32_t message_modulus,
uint32_t carry_modulus) {
host_integer_radix_negation(stream, static_cast<uint64_t *>(lwe_array),
static_cast<uint64_t *>(lwe_array), lwe_dimension,
lwe_ciphertext_count, message_modulus,
carry_modulus);
}

View File

@@ -0,0 +1,79 @@
#ifndef CUDA_INTEGER_NEGATE_CUH
#define CUDA_INTEGER_NEGATE_CUH
#ifdef __CDT_PARSER__
#undef __CUDA_RUNTIME_H__
#include <cuda_runtime.h>
#endif
#include "device.h"
#include "integer.h"
#include "utils/kernel_dimensions.cuh"
template <typename Torus>
__global__ void
device_integer_radix_negation(Torus *output, Torus *input, int32_t num_blocks,
uint64_t lwe_dimension, uint64_t message_modulus,
uint64_t carry_modulus, uint64_t delta) {
int tid = blockIdx.x * blockDim.x + threadIdx.x;
if (tid < lwe_dimension + 1) {
bool is_body = (tid == lwe_dimension);
// z = ceil( degree / 2^p ) * 2^p
uint64_t z = (2 * message_modulus - 1) / message_modulus;
__syncthreads();
z *= message_modulus;
// (0,Delta*z) - ct
output[tid] = (is_body ? z * delta - input[tid] : -input[tid]);
for (int radix_block_id = 1; radix_block_id < num_blocks;
radix_block_id++) {
tid += (lwe_dimension + 1);
// Subtract z/B to the next ciphertext to compensate for the addition of z
uint64_t zb = z / message_modulus;
uint64_t encoded_zb = zb * delta;
__syncthreads();
// (0,Delta*z) - ct
output[tid] =
(is_body ? z * delta - (input[tid] + encoded_zb) : -input[tid]);
__syncthreads();
}
}
}
template <typename Torus>
__host__ void host_integer_radix_negation(cuda_stream_t *stream, Torus *output,
Torus *input, uint32_t lwe_dimension,
uint32_t input_lwe_ciphertext_count,
uint64_t message_modulus,
uint64_t carry_modulus) {
cudaSetDevice(stream->gpu_index);
// lwe_size includes the presence of the body
// whereas lwe_dimension is the number of elements in the mask
int lwe_size = lwe_dimension + 1;
// Create a 1-dimensional grid of threads
int num_blocks = 0, num_threads = 0;
int num_entries = lwe_size;
getNumBlocksAndThreads(num_entries, 512, num_blocks, num_threads);
dim3 grid(num_blocks, 1, 1);
dim3 thds(num_threads, 1, 1);
uint64_t shared_mem = input_lwe_ciphertext_count * sizeof(uint32_t);
// Value of the shift we multiply our messages by
// If message_modulus and carry_modulus are always powers of 2 we can simplify
// this
uint64_t delta = ((uint64_t)1 << 63) / (message_modulus * carry_modulus);
device_integer_radix_negation<<<grid, thds, shared_mem, stream->stream>>>(
output, input, input_lwe_ciphertext_count, lwe_dimension, message_modulus,
carry_modulus, delta);
check_cuda_error(cudaGetLastError());
}
#endif

View File

@@ -0,0 +1,12 @@
#include "integer/scalar_addition.cuh"
void cuda_scalar_addition_integer_radix_ciphertext_64_inplace(
cuda_stream_t *stream, void *lwe_array, void *scalar_input,
uint32_t lwe_dimension, uint32_t lwe_ciphertext_count,
uint32_t message_modulus, uint32_t carry_modulus) {
host_integer_radix_scalar_addition_inplace(
stream, static_cast<uint64_t *>(lwe_array),
static_cast<uint64_t *>(scalar_input), lwe_dimension,
lwe_ciphertext_count, message_modulus, carry_modulus);
}

View File

@@ -0,0 +1,130 @@
#ifndef CUDA_INTEGER_ADD_CUH
#define CUDA_INTEGER_ADD_CUH
#ifdef __CDT_PARSER__
#undef __CUDA_RUNTIME_H__
#include <cuda_runtime.h>
#endif
#include "device.h"
#include "integer.h"
#include "utils/kernel_dimensions.cuh"
#include <stdio.h>
template <typename Torus>
__global__ void device_integer_radix_scalar_addition_inplace(
Torus *lwe_array, Torus *scalar_input, int32_t num_blocks,
uint32_t lwe_dimension, uint64_t delta) {
int tid = blockIdx.x * blockDim.x + threadIdx.x;
if (tid < num_blocks) {
Torus scalar = scalar_input[tid];
Torus *body = lwe_array + tid * (lwe_dimension + 1) + lwe_dimension;
*body += scalar * delta;
}
}
template <typename Torus>
__host__ void host_integer_radix_scalar_addition_inplace(
cuda_stream_t *stream, Torus *lwe_array, Torus *scalar_input,
uint32_t lwe_dimension, uint32_t input_lwe_ciphertext_count,
uint32_t message_modulus, uint32_t carry_modulus) {
cudaSetDevice(stream->gpu_index);
// Create a 1-dimensional grid of threads
int num_blocks = 0, num_threads = 0;
int num_entries = input_lwe_ciphertext_count;
getNumBlocksAndThreads(num_entries, 512, num_blocks, num_threads);
dim3 grid(num_blocks, 1, 1);
dim3 thds(num_threads, 1, 1);
// Value of the shift we multiply our messages by
// If message_modulus and carry_modulus are always powers of 2 we can simplify
// this
uint64_t delta = ((uint64_t)1 << 63) / (message_modulus * carry_modulus);
device_integer_radix_scalar_addition_inplace<<<grid, thds, 0,
stream->stream>>>(
lwe_array, scalar_input, input_lwe_ciphertext_count, lwe_dimension,
delta);
check_cuda_error(cudaGetLastError());
}
template <typename Torus>
__global__ void device_integer_radix_add_scalar_one_inplace(
Torus *lwe_array, int32_t num_blocks, uint32_t lwe_dimension,
uint64_t delta) {
int tid = blockIdx.x * blockDim.x + threadIdx.x;
if (tid < num_blocks) {
Torus *body = lwe_array + tid * (lwe_dimension + 1) + lwe_dimension;
*body += delta;
}
}
template <typename Torus>
__host__ void host_integer_radix_add_scalar_one_inplace(
cuda_stream_t *stream, Torus *lwe_array, uint32_t lwe_dimension,
uint32_t input_lwe_ciphertext_count, uint32_t message_modulus,
uint32_t carry_modulus) {
cudaSetDevice(stream->gpu_index);
// Create a 1-dimensional grid of threads
int num_blocks = 0, num_threads = 0;
int num_entries = input_lwe_ciphertext_count;
getNumBlocksAndThreads(num_entries, 512, num_blocks, num_threads);
dim3 grid(num_blocks, 1, 1);
dim3 thds(num_threads, 1, 1);
// Value of the shift we multiply our messages by
// If message_modulus and carry_modulus are always powers of 2 we can simplify
// this
uint64_t delta = ((uint64_t)1 << 63) / (message_modulus * carry_modulus);
device_integer_radix_add_scalar_one_inplace<<<grid, thds, 0,
stream->stream>>>(
lwe_array, input_lwe_ciphertext_count, lwe_dimension, delta);
check_cuda_error(cudaGetLastError());
}
template <typename Torus>
__global__ void device_integer_radix_scalar_subtraction_inplace(
Torus *lwe_array, Torus *scalar_input, int32_t num_blocks,
uint32_t lwe_dimension, uint64_t delta) {
int tid = blockIdx.x * blockDim.x + threadIdx.x;
if (tid < num_blocks) {
Torus scalar = scalar_input[tid];
Torus *body = lwe_array + tid * (lwe_dimension + 1) + lwe_dimension;
*body -= scalar * delta;
}
}
template <typename Torus>
__host__ void host_integer_radix_scalar_subtraction_inplace(
cuda_stream_t *stream, Torus *lwe_array, Torus *scalar_input,
uint32_t lwe_dimension, uint32_t input_lwe_ciphertext_count,
uint32_t message_modulus, uint32_t carry_modulus) {
cudaSetDevice(stream->gpu_index);
// Create a 1-dimensional grid of threads
int num_blocks = 0, num_threads = 0;
int num_entries = input_lwe_ciphertext_count;
getNumBlocksAndThreads(num_entries, 512, num_blocks, num_threads);
dim3 grid(num_blocks, 1, 1);
dim3 thds(num_threads, 1, 1);
// Value of the shift we multiply our messages by
// If message_modulus and carry_modulus are always powers of 2 we can simplify
// this
uint64_t delta = ((uint64_t)1 << 63) / (message_modulus * carry_modulus);
device_integer_radix_scalar_subtraction_inplace<<<grid, thds, 0,
stream->stream>>>(
lwe_array, scalar_input, input_lwe_ciphertext_count, lwe_dimension,
delta);
check_cuda_error(cudaGetLastError());
}
#endif

View File

@@ -0,0 +1,14 @@
#include "integer/scalar_bitops.cuh"
void cuda_scalar_bitop_integer_radix_ciphertext_kb_64(
cuda_stream_t *stream, void *lwe_array_out, void *lwe_array_input,
void *clear_blocks, uint32_t num_clear_blocks, int8_t *mem_ptr, void *bsk,
void *ksk, uint32_t lwe_ciphertext_count, BITOP_TYPE op) {
host_integer_radix_scalar_bitop_kb<uint64_t>(
stream, static_cast<uint64_t *>(lwe_array_out),
static_cast<uint64_t *>(lwe_array_input),
static_cast<uint64_t *>(clear_blocks), num_clear_blocks,
(int_bitop_buffer<uint64_t> *)mem_ptr, bsk, static_cast<uint64_t *>(ksk),
lwe_ciphertext_count, op);
}

View File

@@ -0,0 +1,51 @@
#ifndef CUDA_INTEGER_SCALAR_BITWISE_OPS_CUH
#define CUDA_INTEGER_SCALAR_BITWISE_OPS_CUH
#include "integer/bitwise_ops.cuh"
#include <omp.h>
template <typename Torus>
__host__ void host_integer_radix_scalar_bitop_kb(
cuda_stream_t *stream, Torus *lwe_array_out, Torus *lwe_array_input,
Torus *clear_blocks, uint32_t num_clear_blocks,
int_bitop_buffer<Torus> *mem_ptr, void *bsk, Torus *ksk,
uint32_t num_radix_blocks, BITOP_TYPE op) {
auto lut = mem_ptr->lut;
auto params = lut->params;
auto big_lwe_dimension = params.big_lwe_dimension;
uint32_t lwe_size = big_lwe_dimension + 1;
if (num_clear_blocks == 0) {
if (op == SCALAR_BITAND) {
auto lwe_array_out_block = lwe_array_out + num_clear_blocks * lwe_size;
cuda_memset_async(lwe_array_out, 0,
num_radix_blocks * lwe_size * sizeof(Torus), stream);
} else {
cuda_memcpy_async_gpu_to_gpu(lwe_array_out, lwe_array_input,
num_radix_blocks * lwe_size * sizeof(Torus),
stream);
}
} else {
auto lut_buffer = lut->lut;
// We have all possible LUTs pre-computed and we use the decomposed scalar
// as index to recover the right one
cuda_memcpy_async_gpu_to_gpu(lut->lut_indexes, clear_blocks,
num_clear_blocks * sizeof(Torus), stream);
integer_radix_apply_univariate_lookup_table_kb<Torus>(
stream, lwe_array_out, lwe_array_input, bsk, ksk, num_clear_blocks,
lut);
if (op == SCALAR_BITAND) {
auto lwe_array_out_block = lwe_array_out + num_clear_blocks * lwe_size;
cuda_memset_async(lwe_array_out_block, 0,
(num_radix_blocks - num_clear_blocks) * lwe_size *
sizeof(Torus),
stream);
}
}
}
#endif

View File

@@ -0,0 +1,44 @@
#include "integer/scalar_comparison.cuh"
void cuda_scalar_comparison_integer_radix_ciphertext_kb_64(
cuda_stream_t *stream, void *lwe_array_out, void *lwe_array_in,
void *scalar_blocks, int8_t *mem_ptr, void *bsk, void *ksk,
uint32_t lwe_ciphertext_count, uint32_t num_scalar_blocks) {
int_comparison_buffer<uint64_t> *buffer =
(int_comparison_buffer<uint64_t> *)mem_ptr;
switch (buffer->op) {
// case EQ:
// case NE:
// host_integer_radix_equality_check_kb<uint64_t>(
// stream, static_cast<uint64_t *>(lwe_array_out),
// static_cast<uint64_t *>(lwe_array_1),
// static_cast<uint64_t *>(lwe_array_2), buffer, bsk,
// static_cast<uint64_t *>(ksk), glwe_dimension, polynomial_size,
// big_lwe_dimension, small_lwe_dimension, ks_level, ks_base_log,
// pbs_level, pbs_base_log, grouping_factor, lwe_ciphertext_count,
// message_modulus, carry_modulus);
// break;
case GT:
case GE:
case LT:
case LE:
host_integer_radix_scalar_difference_check_kb<uint64_t>(
stream, static_cast<uint64_t *>(lwe_array_out),
static_cast<uint64_t *>(lwe_array_in),
static_cast<uint64_t *>(scalar_blocks), buffer,
buffer->diff_buffer->operator_f, bsk, static_cast<uint64_t *>(ksk),
lwe_ciphertext_count, num_scalar_blocks);
break;
case MAX:
case MIN:
host_integer_radix_scalar_maxmin_kb<uint64_t>(
stream, static_cast<uint64_t *>(lwe_array_out),
static_cast<uint64_t *>(lwe_array_in),
static_cast<uint64_t *>(scalar_blocks), buffer, bsk,
static_cast<uint64_t *>(ksk), lwe_ciphertext_count, num_scalar_blocks);
break;
default:
printf("Not implemented\n");
}
}

View File

@@ -0,0 +1,298 @@
#ifndef CUDA_INTEGER_SCALAR_COMPARISON_OPS_CUH
#define CUDA_INTEGER_SCALAR_COMPARISON_OPS_CUH
#include "integer/comparison.cuh"
#include <omp.h>
template <typename Torus>
__host__ void host_integer_radix_scalar_difference_check_kb(
cuda_stream_t *stream, Torus *lwe_array_out, Torus *lwe_array_in,
Torus *scalar_blocks, int_comparison_buffer<Torus> *mem_ptr,
std::function<Torus(Torus)> sign_handler_f, void *bsk, Torus *ksk,
uint32_t total_num_radix_blocks, uint32_t total_num_scalar_blocks) {
auto params = mem_ptr->params;
auto big_lwe_dimension = params.big_lwe_dimension;
auto glwe_dimension = params.glwe_dimension;
auto polynomial_size = params.polynomial_size;
auto message_modulus = params.message_modulus;
auto carry_modulus = params.carry_modulus;
auto diff_buffer = mem_ptr->diff_buffer;
size_t big_lwe_size = big_lwe_dimension + 1;
size_t big_lwe_size_bytes = big_lwe_size * sizeof(Torus);
// Reducing the signs is the bottleneck of the comparison algorithms,
// however if the scalar case there is an improvement:
//
// The idea is to reduce the number of signs block we have to
// reduce. We can do that by splitting the comparison problem in two parts.
//
// - One part where we compute the signs block between the scalar with just
// enough blocks
// from the ciphertext that can represent the scalar value
//
// - The other part is to compare the ciphertext blocks not considered for the
// sign
// computation with zero, and create a single sign block from that.
//
// The smaller the scalar value is compared to the ciphertext num bits
// encrypted, the more the comparisons with zeros we have to do, and the less
// signs block we will have to reduce.
//
// This will create a speedup as comparing a bunch of blocks with 0
// is faster
if (total_num_scalar_blocks == 0) {
// We only have to compare blocks with zero
// means scalar is zero
host_compare_with_zero_equality(stream, mem_ptr->tmp_lwe_array_out,
lwe_array_in, mem_ptr, bsk, ksk,
total_num_radix_blocks);
auto scalar_last_leaf_lut_f = [sign_handler_f](Torus x) -> Torus {
x = (x == 1 ? IS_EQUAL : IS_SUPERIOR);
return sign_handler_f(x);
};
auto lut = mem_ptr->diff_buffer->tree_buffer->tree_last_leaf_scalar_lut;
generate_device_accumulator<Torus>(stream, lut->lut, glwe_dimension,
polynomial_size, message_modulus,
carry_modulus, scalar_last_leaf_lut_f);
integer_radix_apply_univariate_lookup_table_kb<Torus>(
stream, lwe_array_out, mem_ptr->tmp_lwe_array_out, bsk, ksk, 1, lut);
// The result will be in the two first block. Everything else is
// garbage.
cuda_memset_async(lwe_array_out + big_lwe_size, 0,
big_lwe_size_bytes * (total_num_radix_blocks - 1),
stream);
} else if (total_num_scalar_blocks < total_num_radix_blocks) {
// We have to handle both part of the work described above
uint32_t num_lsb_radix_blocks = total_num_scalar_blocks;
uint32_t num_msb_radix_blocks =
total_num_radix_blocks - num_lsb_radix_blocks;
auto lsb = lwe_array_in;
auto msb = lwe_array_in + num_lsb_radix_blocks * big_lwe_size;
auto lwe_array_lsb_out = mem_ptr->tmp_lwe_array_out;
auto lwe_array_msb_out = lwe_array_lsb_out + big_lwe_size;
cuda_synchronize_stream(stream);
auto lsb_stream = diff_buffer->lsb_stream;
auto msb_stream = diff_buffer->msb_stream;
#pragma omp parallel sections
{
// Both sections may be executed in parallel
#pragma omp section
{
//////////////
// lsb
Torus *lhs = diff_buffer->tmp_packed_left;
Torus *rhs = diff_buffer->tmp_packed_right;
pack_blocks(lsb_stream, lhs, lwe_array_in, big_lwe_dimension,
num_lsb_radix_blocks, message_modulus);
pack_blocks(lsb_stream, rhs, scalar_blocks, 0, total_num_scalar_blocks,
message_modulus);
// From this point we have half number of blocks
num_lsb_radix_blocks /= 2;
num_lsb_radix_blocks += (total_num_scalar_blocks % 2);
// comparisons will be assigned
// - 0 if lhs < rhs
// - 1 if lhs == rhs
// - 2 if lhs > rhs
auto comparisons = mem_ptr->tmp_block_comparisons;
scalar_compare_radix_blocks_kb(lsb_stream, comparisons, lhs, rhs,
mem_ptr, bsk, ksk, num_lsb_radix_blocks);
// Reduces a vec containing radix blocks that encrypts a sign
// (inferior, equal, superior) to one single radix block containing the
// final sign
tree_sign_reduction(lsb_stream, lwe_array_lsb_out, comparisons,
mem_ptr->diff_buffer->tree_buffer,
mem_ptr->cleaning_lut_f, bsk, ksk,
num_lsb_radix_blocks);
}
#pragma omp section
{
//////////////
// msb
host_compare_with_zero_equality(msb_stream, lwe_array_msb_out, msb,
mem_ptr, bsk, ksk,
num_msb_radix_blocks);
}
}
cuda_synchronize_stream(lsb_stream);
cuda_synchronize_stream(msb_stream);
//////////////
// Reduce the two blocks into one final
auto scalar_bivariate_last_leaf_lut_f =
[sign_handler_f](Torus lsb, Torus msb) -> Torus {
if (msb == 1)
return sign_handler_f(lsb);
else
return sign_handler_f(IS_SUPERIOR);
};
auto lut = diff_buffer->tree_buffer->tree_last_leaf_scalar_lut;
generate_device_accumulator_bivariate<Torus>(
stream, lut->lut, glwe_dimension, polynomial_size, message_modulus,
carry_modulus, scalar_bivariate_last_leaf_lut_f);
integer_radix_apply_bivariate_lookup_table_kb(
stream, lwe_array_out, lwe_array_lsb_out, lwe_array_msb_out, bsk, ksk,
1, lut);
// The result will be in the first block. Everything else is garbage.
cuda_memset_async(lwe_array_out + big_lwe_size, 0,
(total_num_radix_blocks - 1) * big_lwe_size_bytes,
stream);
} else {
// We only have to do the regular comparison
// And not the part where we compare most significant blocks with zeros
// total_num_radix_blocks == total_num_scalar_blocks
uint32_t num_lsb_radix_blocks = total_num_radix_blocks;
uint32_t num_scalar_blocks = total_num_scalar_blocks;
auto lsb = lwe_array_in;
Torus *lhs = diff_buffer->tmp_packed_left;
Torus *rhs = diff_buffer->tmp_packed_right;
pack_blocks(stream, lhs, lwe_array_in, big_lwe_dimension,
num_lsb_radix_blocks, message_modulus);
pack_blocks(stream, rhs, scalar_blocks, 0, num_scalar_blocks,
message_modulus);
// From this point we have half number of blocks
num_lsb_radix_blocks /= 2;
num_scalar_blocks /= 2;
// comparisons will be assigned
// - 0 if lhs < rhs
// - 1 if lhs == rhs
// - 2 if lhs > rhs
auto comparisons = mem_ptr->tmp_lwe_array_out;
scalar_compare_radix_blocks_kb(stream, comparisons, lhs, rhs, mem_ptr, bsk,
ksk, num_lsb_radix_blocks);
// Reduces a vec containing radix blocks that encrypts a sign
// (inferior, equal, superior) to one single radix block containing the
// final sign
tree_sign_reduction(stream, lwe_array_out, comparisons,
mem_ptr->diff_buffer->tree_buffer, sign_handler_f, bsk,
ksk, num_lsb_radix_blocks);
// The result will be in the first block. Everything else is garbage.
cuda_memset_async(lwe_array_out + big_lwe_size, 0,
(total_num_radix_blocks - 1) * big_lwe_size_bytes,
stream);
}
}
template <typename Torus>
__host__ void
scalar_compare_radix_blocks_kb(cuda_stream_t *stream, Torus *lwe_array_out,
Torus *lwe_array_in, Torus *scalar_blocks,
int_comparison_buffer<Torus> *mem_ptr, void *bsk,
Torus *ksk, uint32_t num_radix_blocks) {
auto params = mem_ptr->params;
auto pbs_type = params.pbs_type;
auto big_lwe_dimension = params.big_lwe_dimension;
auto small_lwe_dimension = params.small_lwe_dimension;
auto ks_level = params.ks_level;
auto ks_base_log = params.ks_base_log;
auto pbs_level = params.pbs_level;
auto pbs_base_log = params.pbs_base_log;
auto glwe_dimension = params.glwe_dimension;
auto polynomial_size = params.polynomial_size;
auto grouping_factor = params.grouping_factor;
auto message_modulus = params.message_modulus;
auto carry_modulus = params.carry_modulus;
// When rhs > lhs, the subtraction will overflow, and the bit of padding will
// be set to 1
// meaning that the output of the pbs will be the negative (modulo message
// space)
//
// Example:
// lhs: 1, rhs: 3, message modulus: 4, carry modulus 4
// lhs - rhs = -2 % (4 * 4) = 14 = 1|1110 (padding_bit|b4b3b2b1)
// Since there was an overflow the bit of padding is 1 and not 0.
// When applying the LUT for an input value of 14 we would expect 1,
// but since the bit of padding is 1, we will get -1 modulus our message
// space, so (-1) % (4 * 4) = 15 = 1|1111 We then add one and get 0 = 0|0000
auto subtracted_blocks = mem_ptr->tmp_block_comparisons;
cuda_memcpy_async_gpu_to_gpu(
subtracted_blocks, lwe_array_in,
num_radix_blocks * (big_lwe_dimension + 1) * sizeof(Torus), stream);
// Subtract
// Here we need the true lwe sub, not the one that comes from shortint.
host_integer_radix_scalar_subtraction_inplace(
stream, subtracted_blocks, scalar_blocks, big_lwe_dimension,
num_radix_blocks, message_modulus, carry_modulus);
// Apply LUT to compare to 0
auto sign_lut = mem_ptr->eq_buffer->is_non_zero_lut;
integer_radix_apply_univariate_lookup_table_kb(stream, lwe_array_out,
subtracted_blocks, bsk, ksk,
num_radix_blocks, sign_lut);
// Add one
// Here Lhs can have the following values: (-1) % (message modulus * carry
// modulus), 0, 1 So the output values after the addition will be: 0, 1, 2
host_integer_radix_add_scalar_one_inplace(stream, lwe_array_out,
big_lwe_dimension, num_radix_blocks,
message_modulus, carry_modulus);
}
template <typename Torus>
__host__ void host_integer_radix_scalar_maxmin_kb(
cuda_stream_t *stream, Torus *lwe_array_out, Torus *lwe_array_in,
Torus *scalar_blocks, int_comparison_buffer<Torus> *mem_ptr, void *bsk,
Torus *ksk, uint32_t total_num_radix_blocks,
uint32_t total_num_scalar_blocks) {
auto params = mem_ptr->params;
// Calculates the difference sign between the ciphertext and the scalar
// - 0 if lhs < rhs
// - 1 if lhs == rhs
// - 2 if lhs > rhs
auto sign = mem_ptr->tmp_lwe_array_out;
host_integer_radix_scalar_difference_check_kb(
stream, sign, lwe_array_in, scalar_blocks, mem_ptr,
mem_ptr->cleaning_lut_f, bsk, ksk, total_num_radix_blocks,
total_num_scalar_blocks);
// There is no optimized CMUX for scalars, so we convert to a trivial
// ciphertext
auto lwe_array_left = lwe_array_in;
auto lwe_array_right = mem_ptr->tmp_block_comparisons;
create_trivial_radix(stream, lwe_array_right, scalar_blocks,
params.big_lwe_dimension, total_num_radix_blocks,
total_num_scalar_blocks, params.message_modulus,
params.carry_modulus);
// Selector
// CMUX for Max or Min
host_integer_radix_cmux_kb(
stream, lwe_array_out, mem_ptr->tmp_lwe_array_out, lwe_array_left,
lwe_array_right, mem_ptr->cmux_buffer, bsk, ksk, total_num_radix_blocks);
}
#endif

View File

@@ -0,0 +1,40 @@
#include "scalar_rotate.cuh"
void scratch_cuda_integer_radix_scalar_rotate_kb_64(
cuda_stream_t *stream, int8_t **mem_ptr, uint32_t glwe_dimension,
uint32_t polynomial_size, uint32_t big_lwe_dimension,
uint32_t small_lwe_dimension, uint32_t ks_level, uint32_t ks_base_log,
uint32_t pbs_level, uint32_t pbs_base_log, uint32_t grouping_factor,
uint32_t num_blocks, uint32_t message_modulus, uint32_t carry_modulus,
PBS_TYPE pbs_type, SHIFT_TYPE shift_type, bool allocate_gpu_memory) {
int_radix_params params(pbs_type, glwe_dimension, polynomial_size,
big_lwe_dimension, small_lwe_dimension, ks_level,
ks_base_log, pbs_level, pbs_base_log, grouping_factor,
message_modulus, carry_modulus);
scratch_cuda_integer_radix_scalar_rotate_kb<uint64_t>(
stream, (int_shift_buffer<uint64_t> **)mem_ptr, num_blocks, params,
shift_type, allocate_gpu_memory);
}
void cuda_integer_radix_scalar_rotate_kb_64_inplace(cuda_stream_t *stream,
void *lwe_array, uint32_t n,
int8_t *mem_ptr, void *bsk,
void *ksk,
uint32_t num_blocks) {
host_integer_radix_scalar_rotate_kb_inplace<uint64_t>(
stream, static_cast<uint64_t *>(lwe_array), n,
(int_shift_buffer<uint64_t> *)mem_ptr, bsk, static_cast<uint64_t *>(ksk),
num_blocks);
}
void cleanup_cuda_integer_radix_scalar_rotate(cuda_stream_t *stream,
int8_t **mem_ptr_void) {
int_shift_buffer<uint64_t> *mem_ptr =
(int_shift_buffer<uint64_t> *)(*mem_ptr_void);
mem_ptr->release(stream);
}

View File

@@ -0,0 +1,114 @@
#ifndef CUDA_INTEGER_SCALAR_ROTATE_OPS_CUH
#define CUDA_INTEGER_SCALAR_ROTATE_OPS_CUH
#include "crypto/keyswitch.cuh"
#include "device.h"
#include "integer.cuh"
#include "integer.h"
#include "pbs/bootstrap_low_latency.cuh"
#include "pbs/bootstrap_multibit.cuh"
#include "types/complex/operations.cuh"
#include "utils/helper.cuh"
#include "utils/kernel_dimensions.cuh"
#ifndef CUDA_INTEGER_SHIFT_OPS_CUH
#define CUDA_INTEGER_SHIFT_OPS_CUH
#include "crypto/keyswitch.cuh"
#include "device.h"
#include "integer.cuh"
#include "integer.h"
#include "pbs/bootstrap_low_latency.cuh"
#include "pbs/bootstrap_multibit.cuh"
#include "types/complex/operations.cuh"
#include "utils/helper.cuh"
#include "utils/kernel_dimensions.cuh"
template <typename Torus>
__host__ void scratch_cuda_integer_radix_scalar_rotate_kb(
cuda_stream_t *stream, int_shift_buffer<Torus> **mem_ptr,
uint32_t num_radix_blocks, int_radix_params params, SHIFT_TYPE shift_type,
bool allocate_gpu_memory) {
*mem_ptr = new int_shift_buffer<Torus>(stream, shift_type, params,
num_radix_blocks, allocate_gpu_memory);
}
template <typename Torus>
__host__ void host_integer_radix_scalar_rotate_kb_inplace(
cuda_stream_t *stream, Torus *lwe_array, uint32_t n,
int_shift_buffer<Torus> *mem, void *bsk, Torus *ksk, uint32_t num_blocks) {
auto params = mem->params;
auto glwe_dimension = params.glwe_dimension;
auto polynomial_size = params.polynomial_size;
auto message_modulus = params.message_modulus;
size_t big_lwe_size = glwe_dimension * polynomial_size + 1;
size_t big_lwe_size_bytes = big_lwe_size * sizeof(Torus);
size_t num_bits_in_message = (size_t)log2(message_modulus);
size_t total_num_bits = num_bits_in_message * num_blocks;
n = n % total_num_bits;
if (n == 0) {
return;
}
size_t rotations = n / num_bits_in_message;
size_t shift_within_block = n % num_bits_in_message;
Torus *rotated_buffer = mem->tmp_rotated;
auto lut_bivariate = mem->lut_buffers_bivariate[shift_within_block - 1];
// rotate right all the blocks in radix ciphertext
// copy result in new buffer
// 256 threads are used in every block
// block_count blocks will be used in the grid
// one block is responsible to process single lwe ciphertext
if (mem->shift_type == LEFT_SHIFT) {
radix_blocks_rotate_right<<<num_blocks, 256, 0, stream->stream>>>(
rotated_buffer, lwe_array, rotations, num_blocks, big_lwe_size);
cuda_memcpy_async_gpu_to_gpu(lwe_array, rotated_buffer,
num_blocks * big_lwe_size_bytes, stream);
if (shift_within_block == 0) {
return;
}
auto receiver_blocks = lwe_array;
auto giver_blocks = rotated_buffer;
radix_blocks_rotate_right<<<num_blocks, 256, 0, stream->stream>>>(
giver_blocks, lwe_array, 1, num_blocks, big_lwe_size);
integer_radix_apply_bivariate_lookup_table_kb<Torus>(
stream, lwe_array, receiver_blocks, giver_blocks, bsk, ksk, num_blocks,
lut_bivariate);
} else {
// left shift
radix_blocks_rotate_left<<<num_blocks, 256, 0, stream->stream>>>(
rotated_buffer, lwe_array, rotations, num_blocks, big_lwe_size);
cuda_memcpy_async_gpu_to_gpu(lwe_array, rotated_buffer,
num_blocks * big_lwe_size_bytes, stream);
if (shift_within_block == 0) {
return;
}
auto receiver_blocks = lwe_array;
auto giver_blocks = rotated_buffer;
radix_blocks_rotate_left<<<num_blocks, 256, 0, stream->stream>>>(
giver_blocks, lwe_array, 1, num_blocks, big_lwe_size);
integer_radix_apply_bivariate_lookup_table_kb<Torus>(
stream, lwe_array, receiver_blocks, giver_blocks, bsk, ksk, num_blocks,
lut_bivariate);
}
}
#endif // CUDA_SCALAR_OPS_CUH
#endif // CUDA_INTEGER_SCALAR_ROTATE_OPS_CUH

View File

@@ -0,0 +1,38 @@
#include "scalar_shifts.cuh"
void scratch_cuda_integer_radix_scalar_shift_kb_64(
cuda_stream_t *stream, int8_t **mem_ptr, uint32_t glwe_dimension,
uint32_t polynomial_size, uint32_t big_lwe_dimension,
uint32_t small_lwe_dimension, uint32_t ks_level, uint32_t ks_base_log,
uint32_t pbs_level, uint32_t pbs_base_log, uint32_t grouping_factor,
uint32_t num_blocks, uint32_t message_modulus, uint32_t carry_modulus,
PBS_TYPE pbs_type, SHIFT_TYPE shift_type, bool allocate_gpu_memory) {
int_radix_params params(pbs_type, glwe_dimension, polynomial_size,
big_lwe_dimension, small_lwe_dimension, ks_level,
ks_base_log, pbs_level, pbs_base_log, grouping_factor,
message_modulus, carry_modulus);
scratch_cuda_integer_radix_scalar_shift_kb<uint64_t>(
stream, (int_shift_buffer<uint64_t> **)mem_ptr, num_blocks, params,
shift_type, allocate_gpu_memory);
}
void cuda_integer_radix_scalar_shift_kb_64_inplace(
cuda_stream_t *stream, void *lwe_array, uint32_t shift, int8_t *mem_ptr,
void *bsk, void *ksk, uint32_t num_blocks) {
host_integer_radix_scalar_shift_kb_inplace<uint64_t>(
stream, static_cast<uint64_t *>(lwe_array), shift,
(int_shift_buffer<uint64_t> *)mem_ptr, bsk, static_cast<uint64_t *>(ksk),
num_blocks);
}
void cleanup_cuda_integer_radix_scalar_shift(cuda_stream_t *stream,
int8_t **mem_ptr_void) {
int_shift_buffer<uint64_t> *mem_ptr =
(int_shift_buffer<uint64_t> *)(*mem_ptr_void);
mem_ptr->release(stream);
}

View File

@@ -0,0 +1,125 @@
#ifndef CUDA_INTEGER_SHIFT_OPS_CUH
#define CUDA_INTEGER_SHIFT_OPS_CUH
#include "crypto/keyswitch.cuh"
#include "device.h"
#include "integer.cuh"
#include "integer.h"
#include "pbs/bootstrap_low_latency.cuh"
#include "pbs/bootstrap_multibit.cuh"
#include "types/complex/operations.cuh"
#include "utils/helper.cuh"
#include "utils/kernel_dimensions.cuh"
template <typename Torus>
__host__ void scratch_cuda_integer_radix_scalar_shift_kb(
cuda_stream_t *stream, int_shift_buffer<Torus> **mem_ptr,
uint32_t num_radix_blocks, int_radix_params params, SHIFT_TYPE shift_type,
bool allocate_gpu_memory) {
*mem_ptr = new int_shift_buffer<Torus>(stream, shift_type, params,
num_radix_blocks, allocate_gpu_memory);
}
template <typename Torus>
__host__ void host_integer_radix_scalar_shift_kb_inplace(
cuda_stream_t *stream, Torus *lwe_array, uint32_t shift,
int_shift_buffer<Torus> *mem, void *bsk, Torus *ksk, uint32_t num_blocks) {
auto params = mem->params;
auto glwe_dimension = params.glwe_dimension;
auto polynomial_size = params.polynomial_size;
auto message_modulus = params.message_modulus;
size_t big_lwe_size = glwe_dimension * polynomial_size + 1;
size_t big_lwe_size_bytes = big_lwe_size * sizeof(Torus);
size_t num_bits_in_block = (size_t)log2(message_modulus);
size_t total_num_bits = num_bits_in_block * num_blocks;
shift = shift % total_num_bits;
if (shift == 0) {
return;
}
size_t rotations = std::min(shift / num_bits_in_block, (size_t)num_blocks);
size_t shift_within_block = shift % num_bits_in_block;
Torus *rotated_buffer = mem->tmp_rotated;
auto lut_bivariate = mem->lut_buffers_bivariate[shift_within_block - 1];
auto lut_univariate = mem->lut_buffers_univariate[shift_within_block];
// rotate right all the blocks in radix ciphertext
// copy result in new buffer
// 256 threads are used in every block
// block_count blocks will be used in the grid
// one block is responsible to process single lwe ciphertext
if (mem->shift_type == LEFT_SHIFT) {
radix_blocks_rotate_right<<<num_blocks, 256, 0, stream->stream>>>(
rotated_buffer, lwe_array, rotations, num_blocks, big_lwe_size);
// create trivial assign for value = 0
cuda_memset_async(rotated_buffer, 0, rotations * big_lwe_size_bytes,
stream);
cuda_memcpy_async_gpu_to_gpu(lwe_array, rotated_buffer,
num_blocks * big_lwe_size_bytes, stream);
if (shift_within_block == 0 || rotations == num_blocks) {
return;
}
// check if we have enough blocks for partial processing
if (rotations < num_blocks - 1) {
auto partial_current_blocks = &lwe_array[(rotations + 1) * big_lwe_size];
auto partial_previous_blocks = &lwe_array[rotations * big_lwe_size];
size_t partial_block_count = num_blocks - rotations - 1;
integer_radix_apply_bivariate_lookup_table_kb<Torus>(
stream, partial_current_blocks, partial_current_blocks,
partial_previous_blocks, bsk, ksk, partial_block_count,
lut_bivariate);
}
auto rest = &lwe_array[rotations * big_lwe_size];
integer_radix_apply_univariate_lookup_table_kb<Torus>(
stream, rest, rest, bsk, ksk, 1, lut_univariate);
} else {
// right shift
radix_blocks_rotate_left<<<num_blocks, 256, 0, stream->stream>>>(
rotated_buffer, lwe_array, rotations, num_blocks, big_lwe_size);
// rotate left as the blocks are from LSB to MSB
// create trivial assign for value = 0
cuda_memset_async(rotated_buffer + (num_blocks - rotations) * big_lwe_size,
0, rotations * big_lwe_size_bytes, stream);
cuda_memcpy_async_gpu_to_gpu(lwe_array, rotated_buffer,
num_blocks * big_lwe_size_bytes, stream);
if (shift_within_block == 0 || rotations == num_blocks) {
return;
}
// check if we have enough blocks for partial processing
if (rotations < num_blocks - 1) {
auto partial_current_blocks = lwe_array;
auto partial_next_blocks = &lwe_array[big_lwe_size];
size_t partial_block_count = num_blocks - rotations - 1;
integer_radix_apply_bivariate_lookup_table_kb<Torus>(
stream, partial_current_blocks, partial_current_blocks,
partial_next_blocks, bsk, ksk, partial_block_count, lut_bivariate);
}
// The right-most block is done separately as it does not
// need to recuperate the shifted bits from its right neighbour.
auto last_block = &lwe_array[(num_blocks - rotations - 1) * big_lwe_size];
integer_radix_apply_univariate_lookup_table_kb<Torus>(
stream, last_block, last_block, bsk, ksk, 1, lut_univariate);
}
}
#endif // CUDA_SCALAR_OPS_CUH

View File

@@ -0,0 +1,109 @@
#include "linearalgebra/addition.cuh"
/*
* Perform the addition of two u32 input LWE ciphertext vectors.
* See the equivalent operation on u64 ciphertexts for more details.
*/
void cuda_add_lwe_ciphertext_vector_32(cuda_stream_t *stream,
void *lwe_array_out,
void *lwe_array_in_1,
void *lwe_array_in_2,
uint32_t input_lwe_dimension,
uint32_t input_lwe_ciphertext_count) {
host_addition(stream, static_cast<uint32_t *>(lwe_array_out),
static_cast<uint32_t *>(lwe_array_in_1),
static_cast<uint32_t *>(lwe_array_in_2), input_lwe_dimension,
input_lwe_ciphertext_count);
}
/*
* Perform the addition of two u64 input LWE ciphertext vectors.
* - `v_stream` is a void pointer to the Cuda stream to be used in the kernel
* launch
* - `gpu_index` is the index of the GPU to be used in the kernel launch
* - `lwe_array_out` is an array of size
* `(input_lwe_dimension + 1) * input_lwe_ciphertext_count` that should have
* been allocated on the GPU before calling this function, and that will hold
* the result of the computation.
* - `lwe_array_in_1` is the first LWE ciphertext vector used as input, it
* should have been allocated and initialized before calling this function. It
* has the same size as the output array.
* - `lwe_array_in_2` is the second LWE ciphertext vector used as input, it
* should have been allocated and initialized before calling this function. It
* has the same size as the output array.
* - `input_lwe_dimension` is the number of mask elements in the two input and
* in the output ciphertext vectors
* - `input_lwe_ciphertext_count` is the number of ciphertexts contained in each
* input LWE ciphertext vector, as well as in the output.
*
* Each element (mask element or body) of the input LWE ciphertext vector 1 is
* added to the corresponding element in the input LWE ciphertext 2. The result
* is stored in the output LWE ciphertext vector. The two input LWE ciphertext
* vectors are left unchanged. This function is a wrapper to a device function
* that performs the operation on the GPU.
*/
void cuda_add_lwe_ciphertext_vector_64(cuda_stream_t *stream,
void *lwe_array_out,
void *lwe_array_in_1,
void *lwe_array_in_2,
uint32_t input_lwe_dimension,
uint32_t input_lwe_ciphertext_count) {
host_addition(stream, static_cast<uint64_t *>(lwe_array_out),
static_cast<uint64_t *>(lwe_array_in_1),
static_cast<uint64_t *>(lwe_array_in_2), input_lwe_dimension,
input_lwe_ciphertext_count);
}
/*
* Perform the addition of a u32 input LWE ciphertext vector with a u32
* plaintext vector. See the equivalent operation on u64 data for more details.
*/
void cuda_add_lwe_ciphertext_vector_plaintext_vector_32(
cuda_stream_t *stream, void *lwe_array_out, void *lwe_array_in,
void *plaintext_array_in, uint32_t input_lwe_dimension,
uint32_t input_lwe_ciphertext_count) {
host_addition_plaintext(stream, static_cast<uint32_t *>(lwe_array_out),
static_cast<uint32_t *>(lwe_array_in),
static_cast<uint32_t *>(plaintext_array_in),
input_lwe_dimension, input_lwe_ciphertext_count);
}
/*
* Perform the addition of a u64 input LWE ciphertext vector with a u64 input
* plaintext vector.
* - `v_stream` is a void pointer to the Cuda stream to be used in the kernel
* launch
* - `gpu_index` is the index of the GPU to be used in the kernel launch
* - `lwe_array_out` is an array of size
* `(input_lwe_dimension + 1) * input_lwe_ciphertext_count` that should have
* been allocated on the GPU before calling this function, and that will hold
* the result of the computation.
* - `lwe_array_in` is the LWE ciphertext vector used as input, it should have
* been allocated and initialized before calling this function. It has the same
* size as the output array.
* - `plaintext_array_in` is the plaintext vector used as input, it should have
* been allocated and initialized before calling this function. It should be of
* size `input_lwe_ciphertext_count`.
* - `input_lwe_dimension` is the number of mask elements in the input and
* output LWE ciphertext vectors
* - `input_lwe_ciphertext_count` is the number of ciphertexts contained in the
* input LWE ciphertext vector, as well as in the output. It is also the number
* of plaintexts in the input plaintext vector.
*
* Each plaintext of the input plaintext vector is added to the body of the
* corresponding LWE ciphertext in the LWE ciphertext vector. The result of the
* operation is stored in the output LWE ciphertext vector. The two input
* vectors are unchanged. This function is a wrapper to a device function that
* performs the operation on the GPU.
*/
void cuda_add_lwe_ciphertext_vector_plaintext_vector_64(
cuda_stream_t *stream, void *lwe_array_out, void *lwe_array_in,
void *plaintext_array_in, uint32_t input_lwe_dimension,
uint32_t input_lwe_ciphertext_count) {
host_addition_plaintext(stream, static_cast<uint64_t *>(lwe_array_out),
static_cast<uint64_t *>(lwe_array_in),
static_cast<uint64_t *>(plaintext_array_in),
input_lwe_dimension, input_lwe_ciphertext_count);
}

View File

@@ -0,0 +1,154 @@
#ifndef CUDA_ADD_CUH
#define CUDA_ADD_CUH
#ifdef __CDT_PARSER__
#undef __CUDA_RUNTIME_H__
#include <cuda_runtime.h>
#endif
#include "../utils/kernel_dimensions.cuh"
#include "device.h"
#include "linear_algebra.h"
#include <stdio.h>
template <typename T>
__global__ void plaintext_addition(T *output, T *lwe_input, T *plaintext_input,
uint32_t input_lwe_dimension,
uint32_t num_entries) {
int tid = threadIdx.x;
int plaintext_index = blockIdx.x * blockDim.x + tid;
if (plaintext_index < num_entries) {
int index =
plaintext_index * (input_lwe_dimension + 1) + input_lwe_dimension;
// Here we take advantage of the wrapping behaviour of uint
output[index] = lwe_input[index] + plaintext_input[plaintext_index];
}
}
template <typename T>
__host__ void host_addition_plaintext(cuda_stream_t *stream, T *output,
T *lwe_input, T *plaintext_input,
uint32_t lwe_dimension,
uint32_t lwe_ciphertext_count) {
cudaSetDevice(stream->gpu_index);
int num_blocks = 0, num_threads = 0;
int num_entries = lwe_ciphertext_count;
getNumBlocksAndThreads(num_entries, 512, num_blocks, num_threads);
dim3 grid(num_blocks, 1, 1);
dim3 thds(num_threads, 1, 1);
cuda_memcpy_async_gpu_to_gpu(
output, lwe_input, (lwe_dimension + 1) * lwe_ciphertext_count, stream);
plaintext_addition<<<grid, thds, 0, stream->stream>>>(
output, lwe_input, plaintext_input, lwe_dimension, num_entries);
check_cuda_error(cudaGetLastError());
}
template <typename T>
__global__ void addition(T *output, T *input_1, T *input_2,
uint32_t num_entries) {
int tid = threadIdx.x;
int index = blockIdx.x * blockDim.x + tid;
if (index < num_entries) {
// Here we take advantage of the wrapping behaviour of uint
output[index] = input_1[index] + input_2[index];
}
}
// Coefficient-wise addition
template <typename T>
__host__ void host_addition(cuda_stream_t *stream, T *output, T *input_1,
T *input_2, uint32_t input_lwe_dimension,
uint32_t input_lwe_ciphertext_count) {
cudaSetDevice(stream->gpu_index);
// lwe_size includes the presence of the body
// whereas lwe_dimension is the number of elements in the mask
int lwe_size = input_lwe_dimension + 1;
// Create a 1-dimensional grid of threads
int num_blocks = 0, num_threads = 0;
int num_entries = input_lwe_ciphertext_count * lwe_size;
getNumBlocksAndThreads(num_entries, 512, num_blocks, num_threads);
dim3 grid(num_blocks, 1, 1);
dim3 thds(num_threads, 1, 1);
addition<<<grid, thds, 0, stream->stream>>>(output, input_1, input_2,
num_entries);
check_cuda_error(cudaGetLastError());
}
template <typename T>
__global__ void subtraction(T *output, T *input_1, T *input_2,
uint32_t num_entries) {
int tid = threadIdx.x;
int index = blockIdx.x * blockDim.x + tid;
if (index < num_entries) {
// Here we take advantage of the wrapping behaviour of uint
output[index] = input_1[index] - input_2[index];
}
}
// Coefficient-wise subtraction
template <typename T>
__host__ void host_subtraction(cuda_stream_t *stream, T *output, T *input_1,
T *input_2, uint32_t input_lwe_dimension,
uint32_t input_lwe_ciphertext_count) {
cudaSetDevice(stream->gpu_index);
// lwe_size includes the presence of the body
// whereas lwe_dimension is the number of elements in the mask
int lwe_size = input_lwe_dimension + 1;
// Create a 1-dimensional grid of threads
int num_blocks = 0, num_threads = 0;
int num_entries = input_lwe_ciphertext_count * lwe_size;
getNumBlocksAndThreads(num_entries, 512, num_blocks, num_threads);
dim3 grid(num_blocks, 1, 1);
dim3 thds(num_threads, 1, 1);
subtraction<<<grid, thds, 0, stream->stream>>>(output, input_1, input_2,
num_entries);
check_cuda_error(cudaGetLastError());
}
template <typename T>
__global__ void radix_body_subtraction_inplace(T *lwe_ct, T *plaintext_input,
uint32_t input_lwe_dimension,
uint32_t num_entries) {
int tid = threadIdx.x;
int plaintext_index = blockIdx.x * blockDim.x + tid;
if (plaintext_index < num_entries) {
int index =
plaintext_index * (input_lwe_dimension + 1) + input_lwe_dimension;
// Here we take advantage of the wrapping behaviour of uint
lwe_ct[index] -= plaintext_input[plaintext_index];
}
}
template <typename T>
__host__ void host_subtraction_plaintext(cuda_stream_t *stream, T *output,
T *lwe_input, T *plaintext_input,
uint32_t input_lwe_dimension,
uint32_t input_lwe_ciphertext_count) {
cudaSetDevice(stream->gpu_index);
int num_blocks = 0, num_threads = 0;
int num_entries = input_lwe_ciphertext_count;
getNumBlocksAndThreads(num_entries, 512, num_blocks, num_threads);
dim3 grid(num_blocks, 1, 1);
dim3 thds(num_threads, 1, 1);
cuda_memcpy_async_gpu_to_gpu(output, lwe_input,
input_lwe_ciphertext_count *
(input_lwe_dimension + 1) * sizeof(T),
stream);
radix_body_subtraction_inplace<<<grid, thds, 0, stream->stream>>>(
output, plaintext_input, input_lwe_dimension, num_entries);
check_cuda_error(cudaGetLastError());
}
#endif // CUDA_ADD_H

Some files were not shown because too many files have changed in this diff Show More