Compare commits

..

126 Commits

Author SHA1 Message Date
David Testé
c721b0fdaf chore(ci): WIP test hyperstack on pre_prod slab 2024-04-10 18:05:37 +02:00
David Testé
470667507d chore(ci): update usage of slab-github-runner to last version 2024-04-10 10:28:48 +02:00
Pedro Alves
ac424136ac chore(gpu): add lwe_chunk_size targeting RTX 4090 GPUs 2024-04-10 09:57:23 +02:00
Pedro Alves
9576c5fd77 feat(gpu): implement signed scalar ge, gt, le, lt, max, and min 2024-04-10 09:55:43 +02:00
Arthur Meyre
5df40597c2 chore(zk): add metadata for Cargo publish 2024-04-09 14:13:07 +02:00
Arthur Meyre
c807bce207 chore(tfhe): update ZK related parameters to use TUniform ones 2024-04-09 13:27:19 +02:00
Arthur Meyre
26747828eb chore(ci): add a cpu count script to avoid crashing on macOS on make -j 2024-04-09 13:27:19 +02:00
Arthur Meyre
4c645267ca chore(apis): expose TUniform 2^-40 parameters for js and C APIs 2024-04-09 13:27:19 +02:00
Arthur Meyre
bea9b77090 chore(shortint): add multi bit GPU alias
- add easy access to compact PK tuniform params
2024-04-09 13:27:19 +02:00
David Testé
d1fe49fa2f refactor(shortint): add several p-error for various parameters set 2024-04-09 13:27:19 +02:00
Arthur Meyre
e5b3092414 refactor(shortint): add max noise level and p_fail fields to the parameters 2024-04-09 13:27:19 +02:00
tmontaigu
30fc8c7c74 feat(hlapi): bind cuda to FheInt 2024-04-09 07:59:35 +02:00
tmontaigu
2c106e8f01 feat(tfhe): plug zk-pok into all layers 2024-04-09 07:59:20 +02:00
Arthur Meyre
f868bb2397 feat(tfhe): add zk-pok code base
- integration of work done by Sarah in the repo

Co-authored-by: sarah el kazdadi <sarah.elkazdadi@zama.ai>
2024-04-09 07:59:20 +02:00
Arthur Meyre
691bff5970 chore(wop): remove outdated parameters and update other parameters 2024-04-09 07:57:54 +02:00
Arthur Meyre
555c984ab3 chore(docs): add information about IND CPA^D 2024-04-08 19:43:56 +02:00
Agnes Leroy
5b21363482 doc(gpu): add missing benchmark results 2024-04-08 18:16:45 +02:00
Pedro Alves
b021aa16d6 feat(gpu): implement signed if_then_else 2024-04-08 17:47:32 +02:00
dependabot[bot]
cda3f2b0ae chore(deps): bump dtolnay/rust-toolchain
Bumps [dtolnay/rust-toolchain](https://github.com/dtolnay/rust-toolchain) from be73d7920c329f220ce78e0234b8f96b7ae60248 to dc6353516c68da0f06325f42ad880f76a5e77ec9.
- [Release notes](https://github.com/dtolnay/rust-toolchain/releases)
- [Commits](be73d7920c...dc6353516c)

---
updated-dependencies:
- dependency-name: dtolnay/rust-toolchain
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
2024-04-08 09:27:14 +02:00
dependabot[bot]
50df70047e chore(deps): bump codecov/codecov-action from 4.1.1 to 4.2.0
Bumps [codecov/codecov-action](https://github.com/codecov/codecov-action) from 4.1.1 to 4.2.0.
- [Release notes](https://github.com/codecov/codecov-action/releases)
- [Changelog](https://github.com/codecov/codecov-action/blob/main/CHANGELOG.md)
- [Commits](c16abc29c9...7afa10ed9b)

---
updated-dependencies:
- dependency-name: codecov/codecov-action
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
2024-04-08 09:21:49 +02:00
Arthur Meyre
0e10acb9f0 chore(ci): fix clippy GPU lint 2024-04-08 09:19:29 +02:00
Arthur Meyre
3d369b0771 feat(core): add experimental fast KS primitives from keytricks 2024-04-08 09:19:29 +02:00
Arthur Meyre
8ae9c16019 test(core): add stair KS tests
- StairKS 4 parameters were optimized for norm2 = 2^4, we don't apply the
dot product, so keeping the loops is fine as the noise grows less
2024-04-08 09:19:29 +02:00
Arthur Meyre
ed8a32d106 chore(core): use proper new type to represent a message space modulus log 2024-04-08 09:19:29 +02:00
Agnes Leroy
f9a3984c7e doc(gpu): add benchmark results 2024-04-05 16:47:30 +02:00
tmontaigu
b6868e08d2 refactor(hlapi): improve conformance deserialization API
- Move safe_deserialization_conformant from being a free function
  to being an associated function on each types FheUint, FheInt,
  FheBool, Compact, Compressed, CompactList.

- Add safe_deserialization_conformant on CompactList for both Rust and
  CAPI (altough CAPI is limited to strict len check for now)

BREAKING_CHANGE: deserialize_safe_conformant was moved from being a free
function to being an associated method of the different types, and ask
for a &ServerKey, not conformance params

BREAKING_CHANGE: is_conformant not really accessible anymore
2024-04-05 10:13:47 +02:00
Agnes Leroy
9ef3183d2e chore(gpu): fix multi-bit scalar mul benchmark 2024-04-05 09:20:48 +02:00
Yuxi Zhao
cdeb647629 chore(docs): update doc new structure and landing page
- update design to fix mobile display
- remove dubs
- misc fixes and make sure user docs tests still run
- upload new designs
- add developer survey
- change designs and wordings
- delete unused images
- change page options
2024-04-04 18:57:11 +02:00
Agnes Leroy
88ff4d17cf chore(gpu): remove carry prop after scalar mul single carry prop after scalar add/sub 2024-04-04 17:37:15 +02:00
Agnes Leroy
daadb115aa fix(gpu): fix mult 256 bit benchmark 2024-04-04 14:41:36 +02:00
Agnes Leroy
971b0cf0b6 feat(gpu): signed scalar rotate 2024-04-04 13:49:58 +02:00
Mayeul@Zama
4c8528d70d feat(hl): add boolean compression 2024-04-03 15:06:55 +02:00
Mayeul@Zama
865b1bdb7f feat(hl): add integer compression 2024-04-03 15:06:55 +02:00
Mayeul@Zama
7d2bb98893 feat(all): add conformance for compressed modulus switched 2024-04-03 15:06:55 +02:00
Mayeul@Zama
d58dd56433 refactor(all): decompress takes shared reference 2024-04-03 15:06:55 +02:00
Agnes Leroy
1fc3297af8 chore(gpu): add missing underscore in comparison tests 2024-04-03 14:38:55 +02:00
Agnes Leroy
cc72594c0d feat(gpu): signed comparisons 2024-04-03 14:38:55 +02:00
Arthur Meyre
3c39abed79 feat(core): add experimental lwe shrinking keyswitch from keytricks 2024-04-03 11:47:55 +02:00
Arthur Meyre
ab9cee529f chore(tfhe): export macro for named params to allow external use
- it is sometimes useful to be able to use the keycache mechanism from
outside the crate
2024-04-03 11:47:55 +02:00
Agnes Leroy
f98bbd9146 feat(gpu): signed eq/ne 2024-04-03 09:27:44 +02:00
Mayeul@Zama
0bad5c4b92 refactor(all): decompress takes shared reference
remove from/into decompression
2024-04-02 14:10:24 +02:00
tmontaigu
6360cbfdd1 feat(hlapi): bind sum for cuda backend 2024-04-02 10:27:17 +02:00
dependabot[bot]
d746eb8569 chore(deps): bump JS-DevTools/npm-publish from 3.1.0 to 3.1.1
Bumps [JS-DevTools/npm-publish](https://github.com/js-devtools/npm-publish) from 3.1.0 to 3.1.1.
- [Release notes](https://github.com/js-devtools/npm-publish/releases)
- [Changelog](https://github.com/JS-DevTools/npm-publish/blob/main/CHANGELOG.md)
- [Commits](79051c040d...19c28f1ef1)

---
updated-dependencies:
- dependency-name: JS-DevTools/npm-publish
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
2024-04-02 09:41:54 +02:00
dependabot[bot]
6ae6a49e0d chore(deps): bump tj-actions/changed-files from 43.0.1 to 44.0.0
Bumps [tj-actions/changed-files](https://github.com/tj-actions/changed-files) from 43.0.1 to 44.0.0.
- [Release notes](https://github.com/tj-actions/changed-files/releases)
- [Changelog](https://github.com/tj-actions/changed-files/blob/main/HISTORY.md)
- [Commits](20576b4b9e...2d756ea4c5)

---
updated-dependencies:
- dependency-name: tj-actions/changed-files
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
2024-04-02 09:41:25 +02:00
dependabot[bot]
1f8b310669 chore(deps): bump codecov/codecov-action from 4.1.0 to 4.1.1
Bumps [codecov/codecov-action](https://github.com/codecov/codecov-action) from 4.1.0 to 4.1.1.
- [Release notes](https://github.com/codecov/codecov-action/releases)
- [Changelog](https://github.com/codecov/codecov-action/blob/main/CHANGELOG.md)
- [Commits](54bcd8715e...c16abc29c9)

---
updated-dependencies:
- dependency-name: codecov/codecov-action
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
2024-04-02 09:40:10 +02:00
Agnes Leroy
cb1110fc79 feat(gpu): signed and unsigned scalar mul
+ remove small scalar mul
+ move around signed tests_cases
2024-03-29 11:22:51 +01:00
Arthur Meyre
80836c5dfd feat(tfhe): use concrete-fft 0.4.1 for faster pbs 128 by default 2024-03-29 10:36:42 +01:00
Pedro Alves
1b6c26994a feat(gpu): implement encrypted shift and rotate 2024-03-29 08:47:50 +01:00
tmontaigu
c20eccf248 feat(c_api): bind leading/trainling_ones/zeros and ilog2 2024-03-28 12:53:01 +01:00
tmontaigu
31302e532c feat(hlapi): bind leading/trailing_ones/zeros and ilog2 2024-03-28 12:53:01 +01:00
Mayeul@Zama
a11d690fd9 feat(integer): add modulus switch compression 2024-03-27 15:22:20 +01:00
Mayeul@Zama
d76c58c38a chore(integer): cleanup create_parametrized_test macro 2024-03-27 15:22:20 +01:00
Mayeul@Zama
5f1d6715ec feat(shortint): add modulus switch compression 2024-03-27 15:22:20 +01:00
Beka Barbakadze
1151a7c3ef fix(gpu): replace hardcoded degrees in multiplication.cuh by correct values. 2024-03-27 08:50:22 +01:00
Pedro Alves
5f975ff6f6 chore(gpu): replaces a mention to the low-latency PBS by just 'classical PBS' and removes a mention to the amortized variant 2024-03-26 12:15:24 -03:00
Arthur Meyre
fb4b975c34 feat(tfhe): add explicit decompress primitives for all CompressedServerKey
- we have a From implementation that allowed to decompress server keys but
it was not visible enough
- make the decompress methods take &self instead of self as input as we now
have the CUDA backend meaning we could be performing several decompressions
taking self by value would force the user to clone data
2024-03-26 15:45:54 +01:00
Arthur Meyre
0e9301cc4f chore(doc): fix incorrect comment in repo README 2024-03-26 14:30:30 +01:00
Mayeul@Zama
2469c0ffde fix(gpu): fix build.rs warning 2024-03-26 12:52:51 +01:00
Agnes Leroy
2955f0acfd fix(gpu): fix tfhe-cuda-backend release 2024-03-26 09:11:54 +01:00
David Testé
f5fb578858 chore(ci): build cuda crates on aws instead on github runner 2024-03-26 09:11:54 +01:00
Agnes Leroy
61283254f0 fix(gpu): fix gpu clippy 2024-03-26 09:11:32 +01:00
David Testé
0dce4b5e93 chore(tfhe): rename integer ilog2 operations 2024-03-26 09:11:32 +01:00
dependabot[bot]
a296f33966 chore(deps): bump rtCamp/action-slack-notify from 2.2.1 to 2.3.0
Bumps [rtCamp/action-slack-notify](https://github.com/rtcamp/action-slack-notify) from 2.2.1 to 2.3.0.
- [Release notes](https://github.com/rtcamp/action-slack-notify/releases)
- [Commits](b24d75fe0e...4e5fb42d24)

---
updated-dependencies:
- dependency-name: rtCamp/action-slack-notify
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
2024-03-25 13:51:08 +01:00
Arthur Meyre
2db5ac5b3d fix(wop): fix empty extracted bits list rejected as invalid by the wopbs
- empty list is interpreted as being a trivial 0
- add non regression test from github issue
2024-03-25 09:42:33 +01:00
dependabot[bot]
292903a24a chore(deps): bump JS-DevTools/npm-publish from 3.0.1 to 3.1.0
Bumps [JS-DevTools/npm-publish](https://github.com/js-devtools/npm-publish) from 3.0.1 to 3.1.0.
- [Release notes](https://github.com/js-devtools/npm-publish/releases)
- [Changelog](https://github.com/JS-DevTools/npm-publish/blob/main/CHANGELOG.md)
- [Commits](4b07b26a2f...79051c040d)

---
updated-dependencies:
- dependency-name: JS-DevTools/npm-publish
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
2024-03-25 09:39:13 +01:00
dependabot[bot]
52bbb2d1e6 chore(deps): bump tj-actions/changed-files from 43.0.0 to 43.0.1
Bumps [tj-actions/changed-files](https://github.com/tj-actions/changed-files) from 43.0.0 to 43.0.1.
- [Release notes](https://github.com/tj-actions/changed-files/releases)
- [Changelog](https://github.com/tj-actions/changed-files/blob/main/HISTORY.md)
- [Commits](77af4bed28...20576b4b9e)

---
updated-dependencies:
- dependency-name: tj-actions/changed-files
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
2024-03-25 09:38:52 +01:00
Agnes Leroy
8d876a988e chore(test): add a test macro with only classical params for CPU tests
Remove the macro in radix_parallel/tests_unsigned/mod.rs
2024-03-25 09:16:21 +01:00
Agnes Leroy
b57f4d6764 chore(gpu): speedup right arithmetic scalar shift 2024-03-25 09:16:21 +01:00
Agnes Leroy
82ef2cc672 feat(gpu): signed scalar shift 2024-03-25 09:16:21 +01:00
Agnes Leroy
6535bd1cca feat(gpu): add an entry point to decompress integer server key to cuda server key 2024-03-25 09:16:00 +01:00
Beka Barbakadze
8cd8f8c176 feat(gpu): implement overflowing_sub 2024-03-22 17:15:13 +01:00
Agnes Leroy
bcbab11950 fix(gpu): fix bug in integer mult when k > 1 2024-03-22 10:33:48 +01:00
Arthur Meyre
3b291ac37d chore(tfhe): make sure the GPU module is present during doc compilation
- fix lints
2024-03-22 10:12:06 +01:00
David Testé
e2f6ddbd46 chore(ci): create workflow to release tfhe-cuda-backend crate 2024-03-21 14:50:31 +01:00
Mayeul@Zama
31e2949906 style(all): regroup uses 2024-03-21 13:58:02 +01:00
Mayeul@Zama
2bf23ae9fb fix(core): fix doctest comment 2024-03-21 13:58:02 +01:00
Mayeul@Zama
edf41b5c84 fix(shortint): fix test 2024-03-21 13:58:02 +01:00
Mayeul@Zama
98ba269c1d chore(tfhe): remove useless comments 2024-03-21 13:58:02 +01:00
Mayeul@Zama
ffda4d3fbe refactor(integer): ciphertext module 2024-03-21 13:58:02 +01:00
Mayeul@Zama
4046df90e9 refactor(shortint): ciphertext module 2024-03-21 13:58:02 +01:00
Mayeul@Zama
7e723f1ec2 refactor(shortint): factorize PBS code 2024-03-21 13:58:02 +01:00
Mayeul@Zama
13f7adec66 feat(core_crypto): rename modulus switch compression 2024-03-21 13:58:02 +01:00
Mayeul@Zama
259d5b6827 chore(tfhe): cleanup unused macros 2024-03-21 13:58:02 +01:00
Mayeul@Zama
4798ee17c4 chore(tfhe): make macros scoped 2024-03-21 13:58:02 +01:00
Mayeul@Zama
1c8f6ce75d refactor(shortint): split shortint parametrized tests in 2 files 2024-03-21 13:58:02 +01:00
Mayeul@Zama
7f7591f1b4 fix(shortint): fix and rename tests 2024-03-21 13:58:02 +01:00
Arthur Meyre
d06f958990 chore(ci): force the removal of the 4090 label for PRs even for failures
- always() forces the evaluation of the PR removal even if there was a
failure before, which is irrelevant for removing a label
2024-03-21 10:19:31 +01:00
Pedro Alves
b4619bb745 fix(gpu): fix compilation when the user doesn't have a CUDA-capable device 2024-03-20 13:25:22 -03:00
Mayeul@Zama
f911af6e18 chore(c_api): remove useless feature flags 2024-03-20 15:07:10 +01:00
Mayeul@Zama
48309ff773 fix(c_api): run clippy on high-level-c-api 2024-03-20 15:07:10 +01:00
Pedro Alves
06af752bfc fix(gpu): includes tests_and_benchmarks/include to format_tfhe_cuda_backend.sh 2024-03-20 08:58:29 +01:00
Arthur Meyre
73f8383def fix(integer): fix the CRT LUT generation 2024-03-19 19:12:59 +01:00
David Testé
edca34c2c9 chore(ci): run aws gpu benchmark only on p3 instances
p4 (A100) and p5 (H100) resources are too scarce on AWS EC2 to use
them. A100 for example almost always fails on spawn request.
2024-03-19 15:42:02 +01:00
Pedro Alves
e6fd6823de chore(gpu): implement a macro evaluated at compile time to retrieve the architecture 2024-03-19 11:47:10 +01:00
Agnes Leroy
ff8912bf66 chore(gpu): reduce scratch time 2024-03-19 11:47:10 +01:00
Pedro Alves
86e5640e06 fix(gpu): fix out-of-memory error in the custom benchmark tool 2024-03-19 03:07:36 -03:00
Agnes Leroy
0136642f89 feat(gpu): signed scalar bitop 2024-03-18 21:13:42 +01:00
tmontaigu
5a19114417 feat(integer): make bitnot a PBS-free operation
BREAKING CHANGE: bitnot_parallelized it not bitnot as the operation
does not require the use of multithreading anymore
2024-03-18 17:36:35 +01:00
Arthur Meyre
7fdcde0449 chore(ci): change slack notifications to be less confusing and more robust
- sometimes the notification will say fail while it did not really fail
- use the generic form which can never be wrong
2024-03-18 13:59:15 +01:00
Arthur Meyre
8a1cc3750b chore(core): add asserts on in and out LweDimension to check they match
- ran into an issue where the dimensions did not agree and got weird
results because of that
2024-03-18 13:59:04 +01:00
dependabot[bot]
719bad6e7d chore(deps): bump actions/checkout from 4.1.1 to 4.1.2
Bumps [actions/checkout](https://github.com/actions/checkout) from 4.1.1 to 4.1.2.
- [Release notes](https://github.com/actions/checkout/releases)
- [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md)
- [Commits](b4ffde65f4...9bb56186c3)

---
updated-dependencies:
- dependency-name: actions/checkout
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
2024-03-18 13:58:46 +01:00
dependabot[bot]
a1483c6c9f chore(deps): bump tj-actions/changed-files from 42.1.0 to 43.0.0
Bumps [tj-actions/changed-files](https://github.com/tj-actions/changed-files) from 42.1.0 to 43.0.0.
- [Release notes](https://github.com/tj-actions/changed-files/releases)
- [Changelog](https://github.com/tj-actions/changed-files/blob/main/HISTORY.md)
- [Commits](aa08304bd4...77af4bed28)

---
updated-dependencies:
- dependency-name: tj-actions/changed-files
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
2024-03-18 11:40:53 +01:00
David Testé
c15e35782d chore(ci): use new workflow fine-grained token 2024-03-18 10:10:26 +01:00
Miles
55f666d323 chore(tfhe): fix typos 2024-03-15 11:48:53 +01:00
Mayeul@Zama
155822bb99 fix(tfhe): fix formatting in macros 2024-03-15 09:26:53 +01:00
Mayeul@Zama
1647634c8e fix(script): fix formatting 2024-03-15 09:26:53 +01:00
Mayeul@Zama
26aa20a78a fix(doc): fix warning 2024-03-15 09:26:53 +01:00
Mayeul@Zama
53b89fdfae fix(doc): add syntax highlighting to rust doctests 2024-03-15 09:26:53 +01:00
Agnes Leroy
5976ba51b1 feat(gpu): signed bitops 2024-03-15 09:12:14 +01:00
Mayeul@Zama
de6db4bc9d fix(trivium): check warnings in benches 2024-03-14 13:08:38 +01:00
tmontaigu
de8568a5bb fix(integer): fix parallel carry propagation on empty input 2024-03-14 10:34:56 +01:00
David Testé
83e9671071 chore(ci): check sha256 sum for nvm installation script 2024-03-14 09:22:26 +01:00
David Testé
9efe4ac69e chore(ci): format javascript code using prettier 2024-03-14 09:22:26 +01:00
David Testé
937c364c6d chore(ci): add format recipes for javascript code 2024-03-14 09:22:26 +01:00
David Testé
b40897adbe chore(bench): benchmark server keys with wasm
Benchmarks are run for 1_1 and 2_2 parameters set on compressed
server key.
2024-03-14 09:22:26 +01:00
David Testé
54ba8de83f chore(wasm): allow parallel generation of shortint server key 2024-03-14 09:22:26 +01:00
Pedro Alves
20d92afaaf feat(gpu): add support to larger polynomials on multi-bit PBS 2024-03-13 15:45:08 +01:00
Mayeul@Zama
865b667ffd feat(core): add lwe ct modulus switch compression 2024-03-13 15:25:35 +01:00
Mayeul@Zama
3b35cc8269 refactor(core): simplify fast_pbs_modulus_switch 2024-03-13 15:25:35 +01:00
tmontaigu
8e19bd1b79 feat(integer): improve propagation & sum algorithms
For the full_propagation, the changes makes it do the best thing
depending in the degrees of the input.

First, the sum now uses full_propagate as its last step
as opposed to do a custom full propagation. This leads to
timing improvements for <= 8 bits, as the full_propagation
selects the sequential propagation that is always faster for
these precisions.

This will also improve any function that uses a sum with small
precision (like ilog2, leading/trailing_zeros/ones)

This will also improve performances for all precisions when computations
are done on modest hardware.

Second, the core algorithm of the sum now reasons
in terms of columns not rows which makes the code easier.
This makes us do less mistakes when computing the range
for which we have to extract messages and carry leading to less PBSes.

This leads to better performances on modest hardware, or when the
precision + number of elements starts to saturate the CPU threads.
2024-03-13 14:55:11 +01:00
Agnes Leroy
4e5e30550b feat(gpu): optimize gpu int mul vector add part
- reduce keyswitch operations twice, reduce pbs layers twice,
  remove compression and decompression operations.
  remove most of the memcopies.

- expose sum ciphertexts standalone entry point
2024-03-13 16:03:42 +04:00
Agnes Leroy
ca40c8673f chore(gpu): fix compilation without a device 2024-03-13 11:30:22 +01:00
Mayeul@Zama
9f70be9c95 feat(tfhe): disable debug assertions in devo profile
makes KS-PBS almost two times faster
2024-03-13 09:43:22 +01:00
Mayeul@Zama
dc44f5e517 feat(tfhe): update rust toolchain 2024-03-13 09:43:22 +01:00
Agnes Leroy
6f954bb538 feat(gpu): signed scalar sub 2024-03-12 15:32:57 +01:00
Pedro Alves
d3801446ff chore(gpu): rename the low-latency PBS to just PBS and the fast variants to cg 2024-03-12 08:50:44 -03:00
732 changed files with 67587 additions and 28460 deletions

View File

@@ -21,7 +21,7 @@ jobs:
uses: actions-ecosystem/action-remove-labels@2ce5d41b4b6aa8503e285553f75ed56e0a40bae0
with:
# We use a PAT to have the same user (zama-bot) for label deletion as for creation.
github_token: ${{ secrets.CONCRETE_ACTIONS_TOKEN }}
github_token: ${{ secrets.FHE_ACTIONS_TOKEN }}
labels: approved
# Add label only if the review is approved and if the label doesn't already exist
@@ -30,5 +30,5 @@ jobs:
if: ${{ github.event_name == 'pull_request_review' && github.event.review.state == 'approved' && !contains(fromJSON(env.LABELS), 'approved') }}
with:
# We need to use a PAT to be able to trigger `labeled` event for the other workflow.
github_token: ${{ secrets.CONCRETE_ACTIONS_TOKEN }}
github_token: ${{ secrets.FHE_ACTIONS_TOKEN }}
labels: approved

View File

@@ -23,17 +23,16 @@ jobs:
runs-on: ubuntu-latest
outputs:
runner-name: ${{ steps.start-instance.outputs.label }}
instance-id: ${{ steps.start-instance.outputs.ec2-instance-id }}
aws-region: ${{ steps.start-instance.outputs.aws-region }}
steps:
- name: Start instance
id: start-instance
uses: zama-ai/slab-github-runner@8562abbdc96b3619bd5debe1fb934db298f9a044
uses: zama-ai/slab-github-runner@1dced74825027fe3d481392163ed8fc56813fb5d
with:
mode: start
github-token: ${{ secrets.SLAB_ACTION_TOKEN }}
slab-url: ${{ secrets.SLAB_BASE_URL }}
job-secret: ${{ secrets.JOB_SECRET }}
backend: aws
profile: cpu-big
fast-tests:
@@ -45,14 +44,14 @@ jobs:
runs-on: ${{ needs.setup-ec2.outputs.runner-name }}
steps:
- name: Checkout tfhe-rs
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
uses: actions/checkout@9bb56186c3b09b4f86b1c65136769dd318469633
- name: Set up home
run: |
echo "HOME=/home/ubuntu" >> "${GITHUB_ENV}"
- name: Install latest stable
uses: dtolnay/rust-toolchain@be73d7920c329f220ce78e0234b8f96b7ae60248
uses: dtolnay/rust-toolchain@dc6353516c68da0f06325f42ad880f76a5e77ec9
with:
toolchain: stable
@@ -60,6 +59,10 @@ jobs:
run: |
make test_concrete_csprng
- name: Run tfhe-zk-pok tests
run: |
make test_zk_pok
- name: Run core tests
run: |
AVX512_SUPPORT=ON make test_core_crypto
@@ -107,7 +110,7 @@ jobs:
- name: Slack Notification
if: ${{ always() }}
continue-on-error: true
uses: rtCamp/action-slack-notify@b24d75fe0e728a4bf9fc42ee217caa686d141ee8
uses: rtCamp/action-slack-notify@4e5fb42d249be6a45a298f3c9543b111b02f7907
env:
SLACK_COLOR: ${{ job.status }}
SLACK_MESSAGE: "Fast AWS tests finished with status: ${{ job.status }}. (${{ env.ACTION_RUN_URL }})"
@@ -120,19 +123,18 @@ jobs:
steps:
- name: Stop instance
id: stop-instance
uses: zama-ai/slab-github-runner@8562abbdc96b3619bd5debe1fb934db298f9a044
uses: zama-ai/slab-github-runner@1dced74825027fe3d481392163ed8fc56813fb5d
with:
mode: stop
github-token: ${{ secrets.SLAB_ACTION_TOKEN }}
slab-url: ${{ secrets.SLAB_BASE_URL }}
job-secret: ${{ secrets.JOB_SECRET }}
region: ${{ needs.setup-ec2.outputs.aws-region }}
label: ${{ needs.setup-ec2.outputs.runner-name }}
- name: Slack Notification
if: ${{ failure() }}
continue-on-error: true
uses: rtCamp/action-slack-notify@b24d75fe0e728a4bf9fc42ee217caa686d141ee8
uses: rtCamp/action-slack-notify@4e5fb42d249be6a45a298f3c9543b111b02f7907
env:
SLACK_COLOR: ${{ job.status }}
SLACK_MESSAGE: "EC2 teardown (fast-tests) failed. (${{ env.ACTION_RUN_URL }})"
SLACK_MESSAGE: "EC2 teardown (fast-tests) finished with status: ${{ job.status }}. (${{ env.ACTION_RUN_URL }})"

View File

@@ -29,10 +29,10 @@ jobs:
steps:
- name: Checkout tfhe-rs
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
uses: actions/checkout@9bb56186c3b09b4f86b1c65136769dd318469633
- name: Install latest stable
uses: dtolnay/rust-toolchain@be73d7920c329f220ce78e0234b8f96b7ae60248
uses: dtolnay/rust-toolchain@dc6353516c68da0f06325f42ad880f76a5e77ec9
with:
toolchain: stable
@@ -61,7 +61,7 @@ jobs:
make test_high_level_api_gpu
- uses: actions-ecosystem/action-remove-labels@2ce5d41b4b6aa8503e285553f75ed56e0a40bae0
if: ${{ github.event_name == 'pull_request' }}
if: ${{ always() && github.event_name == 'pull_request' }}
with:
labels: 4090_test
github_token: ${{ secrets.GITHUB_TOKEN }}
@@ -69,7 +69,7 @@ jobs:
- name: Slack Notification
if: ${{ always() }}
continue-on-error: true
uses: rtCamp/action-slack-notify@b24d75fe0e728a4bf9fc42ee217caa686d141ee8
uses: rtCamp/action-slack-notify@4e5fb42d249be6a45a298f3c9543b111b02f7907
env:
SLACK_COLOR: ${{ job.status }}
SLACK_MESSAGE: "CUDA RTX 4090 tests finished with status: ${{ job.status }}. (${{ env.ACTION_RUN_URL }})"

View File

@@ -23,17 +23,16 @@ jobs:
runs-on: ubuntu-latest
outputs:
runner-name: ${{ steps.start-instance.outputs.label }}
instance-id: ${{ steps.start-instance.outputs.ec2-instance-id }}
aws-region: ${{ steps.start-instance.outputs.aws-region }}
steps:
- name: Start instance
id: start-instance
uses: zama-ai/slab-github-runner@8562abbdc96b3619bd5debe1fb934db298f9a044
uses: zama-ai/slab-github-runner@1dced74825027fe3d481392163ed8fc56813fb5d
with:
mode: start
github-token: ${{ secrets.SLAB_ACTION_TOKEN }}
slab-url: ${{ secrets.SLAB_BASE_URL }}
slab-url: ${{ secrets.SLAB_BASE_URL_PRE_PROD }}
job-secret: ${{ secrets.JOB_SECRET }}
backend: aws
profile: gpu-test
cuda-tests-linux:
@@ -56,7 +55,7 @@ jobs:
steps:
- name: Checkout tfhe-rs
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
uses: actions/checkout@9bb56186c3b09b4f86b1c65136769dd318469633
- name: Set up home
run: |
@@ -113,7 +112,7 @@ jobs:
- name: Slack Notification
if: ${{ always() }}
continue-on-error: true
uses: rtCamp/action-slack-notify@b24d75fe0e728a4bf9fc42ee217caa686d141ee8
uses: rtCamp/action-slack-notify@4e5fb42d249be6a45a298f3c9543b111b02f7907
env:
SLACK_COLOR: ${{ job.status }}
SLACK_MESSAGE: "CUDA AWS tests finished with status: ${{ job.status }}. (${{ env.ACTION_RUN_URL }})"
@@ -126,19 +125,18 @@ jobs:
steps:
- name: Stop instance
id: stop-instance
uses: zama-ai/slab-github-runner@8562abbdc96b3619bd5debe1fb934db298f9a044
uses: zama-ai/slab-github-runner@1dced74825027fe3d481392163ed8fc56813fb5d
with:
mode: stop
github-token: ${{ secrets.SLAB_ACTION_TOKEN }}
slab-url: ${{ secrets.SLAB_BASE_URL }}
slab-url: ${{ secrets.SLAB_BASE_URL_PRE_PROD }}
job-secret: ${{ secrets.JOB_SECRET }}
region: ${{ needs.setup-ec2.outputs.aws-region }}
label: ${{ needs.setup-ec2.outputs.runner-name }}
- name: Slack Notification
if: ${{ failure() }}
continue-on-error: true
uses: rtCamp/action-slack-notify@b24d75fe0e728a4bf9fc42ee217caa686d141ee8
uses: rtCamp/action-slack-notify@4e5fb42d249be6a45a298f3c9543b111b02f7907
env:
SLACK_COLOR: ${{ job.status }}
SLACK_MESSAGE: "EC2 teardown (cuda-tests) failed. (${{ env.ACTION_RUN_URL }})"
SLACK_MESSAGE: "EC2 teardown (cuda-tests) finished with status: ${{ job.status }}. (${{ env.ACTION_RUN_URL }})"

View File

@@ -24,17 +24,16 @@ jobs:
runs-on: ubuntu-latest
outputs:
runner-name: ${{ steps.start-instance.outputs.label }}
instance-id: ${{ steps.start-instance.outputs.ec2-instance-id }}
aws-region: ${{ steps.start-instance.outputs.aws-region }}
steps:
- name: Start instance
id: start-instance
uses: zama-ai/slab-github-runner@8562abbdc96b3619bd5debe1fb934db298f9a044
uses: zama-ai/slab-github-runner@1dced74825027fe3d481392163ed8fc56813fb5d
with:
mode: start
github-token: ${{ secrets.SLAB_ACTION_TOKEN }}
slab-url: ${{ secrets.SLAB_BASE_URL }}
job-secret: ${{ secrets.JOB_SECRET }}
backend: aws
profile: cpu-big
unsigned-integer-tests:
@@ -46,14 +45,14 @@ jobs:
runs-on: ${{ needs.setup-ec2.outputs.runner-name }}
steps:
- name: Checkout tfhe-rs
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
uses: actions/checkout@9bb56186c3b09b4f86b1c65136769dd318469633
- name: Set up home
run: |
echo "HOME=/home/ubuntu" >> "${GITHUB_ENV}"
- name: Install latest stable
uses: dtolnay/rust-toolchain@be73d7920c329f220ce78e0234b8f96b7ae60248
uses: dtolnay/rust-toolchain@dc6353516c68da0f06325f42ad880f76a5e77ec9
with:
toolchain: stable
@@ -76,7 +75,7 @@ jobs:
- name: Slack Notification
if: ${{ always() }}
continue-on-error: true
uses: rtCamp/action-slack-notify@b24d75fe0e728a4bf9fc42ee217caa686d141ee8
uses: rtCamp/action-slack-notify@4e5fb42d249be6a45a298f3c9543b111b02f7907
env:
SLACK_COLOR: ${{ job.status }}
SLACK_MESSAGE: "Unsigned Integer tests finished with status: ${{ job.status }}. (${{ env.ACTION_RUN_URL }})"
@@ -89,19 +88,18 @@ jobs:
steps:
- name: Stop instance
id: stop-instance
uses: zama-ai/slab-github-runner@8562abbdc96b3619bd5debe1fb934db298f9a044
uses: zama-ai/slab-github-runner@1dced74825027fe3d481392163ed8fc56813fb5d
with:
mode: stop
github-token: ${{ secrets.SLAB_ACTION_TOKEN }}
slab-url: ${{ secrets.SLAB_BASE_URL }}
job-secret: ${{ secrets.JOB_SECRET }}
region: ${{ needs.setup-ec2.outputs.aws-region }}
label: ${{ needs.setup-ec2.outputs.runner-name }}
- name: Slack Notification
if: ${{ failure() }}
continue-on-error: true
uses: rtCamp/action-slack-notify@b24d75fe0e728a4bf9fc42ee217caa686d141ee8
uses: rtCamp/action-slack-notify@4e5fb42d249be6a45a298f3c9543b111b02f7907
env:
SLACK_COLOR: ${{ job.status }}
SLACK_MESSAGE: "EC2 teardown (unsigned-integer-tests) failed. (${{ env.ACTION_RUN_URL }})"
SLACK_MESSAGE: "EC2 teardown (unsigned-integer-tests) finished with status: ${{ job.status }}. (${{ env.ACTION_RUN_URL }})"

View File

@@ -24,17 +24,16 @@ jobs:
runs-on: ubuntu-latest
outputs:
runner-name: ${{ steps.start-instance.outputs.label }}
instance-id: ${{ steps.start-instance.outputs.ec2-instance-id }}
aws-region: ${{ steps.start-instance.outputs.aws-region }}
steps:
- name: Start instance
id: start-instance
uses: zama-ai/slab-github-runner@8562abbdc96b3619bd5debe1fb934db298f9a044
uses: zama-ai/slab-github-runner@1dced74825027fe3d481392163ed8fc56813fb5d
with:
mode: start
github-token: ${{ secrets.SLAB_ACTION_TOKEN }}
slab-url: ${{ secrets.SLAB_BASE_URL }}
job-secret: ${{ secrets.JOB_SECRET }}
backend: aws
profile: cpu-big
signed-integer-tests:
@@ -46,14 +45,14 @@ jobs:
runs-on: ${{ needs.setup-ec2.outputs.runner-name }}
steps:
- name: Checkout tfhe-rs
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
uses: actions/checkout@9bb56186c3b09b4f86b1c65136769dd318469633
- name: Set up home
run: |
echo "HOME=/home/ubuntu" >> "${GITHUB_ENV}"
- name: Install latest stable
uses: dtolnay/rust-toolchain@be73d7920c329f220ce78e0234b8f96b7ae60248
uses: dtolnay/rust-toolchain@dc6353516c68da0f06325f42ad880f76a5e77ec9
with:
toolchain: stable
@@ -80,7 +79,7 @@ jobs:
- name: Slack Notification
if: ${{ always() }}
continue-on-error: true
uses: rtCamp/action-slack-notify@b24d75fe0e728a4bf9fc42ee217caa686d141ee8
uses: rtCamp/action-slack-notify@4e5fb42d249be6a45a298f3c9543b111b02f7907
env:
SLACK_COLOR: ${{ job.status }}
SLACK_MESSAGE: "Signed Integer tests finished with status: ${{ job.status }}. (${{ env.ACTION_RUN_URL }})"
@@ -93,19 +92,18 @@ jobs:
steps:
- name: Stop instance
id: stop-instance
uses: zama-ai/slab-github-runner@8562abbdc96b3619bd5debe1fb934db298f9a044
uses: zama-ai/slab-github-runner@1dced74825027fe3d481392163ed8fc56813fb5d
with:
mode: stop
github-token: ${{ secrets.SLAB_ACTION_TOKEN }}
slab-url: ${{ secrets.SLAB_BASE_URL }}
job-secret: ${{ secrets.JOB_SECRET }}
region: ${{ needs.setup-ec2.outputs.aws-region }}
label: ${{ needs.setup-ec2.outputs.runner-name }}
- name: Slack Notification
if: ${{ failure() }}
continue-on-error: true
uses: rtCamp/action-slack-notify@b24d75fe0e728a4bf9fc42ee217caa686d141ee8
uses: rtCamp/action-slack-notify@4e5fb42d249be6a45a298f3c9543b111b02f7907
env:
SLACK_COLOR: ${{ job.status }}
SLACK_MESSAGE: "EC2 teardown (signed-integer-tests) failed. (${{ env.ACTION_RUN_URL }})"
SLACK_MESSAGE: "EC2 teardown (signed-integer-tests) finished with status: ${{ job.status }}. (${{ env.ACTION_RUN_URL }})"

View File

@@ -24,17 +24,16 @@ jobs:
runs-on: ubuntu-latest
outputs:
runner-name: ${{ steps.start-instance.outputs.label }}
instance-id: ${{ steps.start-instance.outputs.ec2-instance-id }}
aws-region: ${{ steps.start-instance.outputs.aws-region }}
steps:
- name: Start instance
id: start-instance
uses: zama-ai/slab-github-runner@8562abbdc96b3619bd5debe1fb934db298f9a044
uses: zama-ai/slab-github-runner@1dced74825027fe3d481392163ed8fc56813fb5d
with:
mode: start
github-token: ${{ secrets.SLAB_ACTION_TOKEN }}
slab-url: ${{ secrets.SLAB_BASE_URL }}
job-secret: ${{ secrets.JOB_SECRET }}
backend: aws
profile: cpu-big
cpu-tests:
@@ -46,14 +45,14 @@ jobs:
runs-on: ${{ needs.setup-ec2.outputs.runner-name }}
steps:
- name: Checkout tfhe-rs
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
uses: actions/checkout@9bb56186c3b09b4f86b1c65136769dd318469633
- name: Set up home
run: |
echo "HOME=/home/ubuntu" >> "${GITHUB_ENV}"
- name: Install latest stable
uses: dtolnay/rust-toolchain@be73d7920c329f220ce78e0234b8f96b7ae60248
uses: dtolnay/rust-toolchain@dc6353516c68da0f06325f42ad880f76a5e77ec9
with:
toolchain: stable
@@ -61,6 +60,10 @@ jobs:
run: |
make test_concrete_csprng
- name: Run tfhe-zk-pok tests
run: |
make test_zk_pok
- name: Run core tests
run: |
AVX512_SUPPORT=ON make test_core_crypto
@@ -102,7 +105,7 @@ jobs:
- name: Slack Notification
if: ${{ always() }}
continue-on-error: true
uses: rtCamp/action-slack-notify@b24d75fe0e728a4bf9fc42ee217caa686d141ee8
uses: rtCamp/action-slack-notify@4e5fb42d249be6a45a298f3c9543b111b02f7907
env:
SLACK_COLOR: ${{ job.status }}
SLACK_MESSAGE: "CPU tests finished with status: ${{ job.status }}. (${{ env.ACTION_RUN_URL }})"
@@ -115,19 +118,18 @@ jobs:
steps:
- name: Stop instance
id: stop-instance
uses: zama-ai/slab-github-runner@8562abbdc96b3619bd5debe1fb934db298f9a044
uses: zama-ai/slab-github-runner@1dced74825027fe3d481392163ed8fc56813fb5d
with:
mode: stop
github-token: ${{ secrets.SLAB_ACTION_TOKEN }}
slab-url: ${{ secrets.SLAB_BASE_URL }}
job-secret: ${{ secrets.JOB_SECRET }}
region: ${{ needs.setup-ec2.outputs.aws-region }}
label: ${{ needs.setup-ec2.outputs.runner-name }}
- name: Slack Notification
if: ${{ failure() }}
continue-on-error: true
uses: rtCamp/action-slack-notify@b24d75fe0e728a4bf9fc42ee217caa686d141ee8
uses: rtCamp/action-slack-notify@4e5fb42d249be6a45a298f3c9543b111b02f7907
env:
SLACK_COLOR: ${{ job.status }}
SLACK_MESSAGE: "EC2 teardown (cpu-tests) failed. (${{ env.ACTION_RUN_URL }})"
SLACK_MESSAGE: "EC2 teardown (cpu-tests) finished with status: ${{ job.status }}. (${{ env.ACTION_RUN_URL }})"

View File

@@ -24,17 +24,16 @@ jobs:
runs-on: ubuntu-latest
outputs:
runner-name: ${{ steps.start-instance.outputs.label }}
instance-id: ${{ steps.start-instance.outputs.ec2-instance-id }}
aws-region: ${{ steps.start-instance.outputs.aws-region }}
steps:
- name: Start instance
id: start-instance
uses: zama-ai/slab-github-runner@8562abbdc96b3619bd5debe1fb934db298f9a044
uses: zama-ai/slab-github-runner@1dced74825027fe3d481392163ed8fc56813fb5d
with:
mode: start
github-token: ${{ secrets.SLAB_ACTION_TOKEN }}
slab-url: ${{ secrets.SLAB_BASE_URL }}
job-secret: ${{ secrets.JOB_SECRET }}
backend: aws
profile: cpu-small
wasm-tests:
@@ -46,30 +45,37 @@ jobs:
runs-on: ${{ needs.setup-ec2.outputs.runner-name }}
steps:
- name: Checkout tfhe-rs
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
uses: actions/checkout@9bb56186c3b09b4f86b1c65136769dd318469633
- name: Set up home
run: |
echo "HOME=/home/ubuntu" >> "${GITHUB_ENV}"
- name: Install latest stable
uses: dtolnay/rust-toolchain@be73d7920c329f220ce78e0234b8f96b7ae60248
uses: dtolnay/rust-toolchain@dc6353516c68da0f06325f42ad880f76a5e77ec9
with:
toolchain: stable
- name: Install Node
run: |
make install_node
- name: Run fmt checks
run: |
make check_fmt_js
- name: Run js on wasm API tests
run: |
make test_nodejs_wasm_api_in_docker
- name: Run parallel wasm tests
run: |
make install_node
make ci_test_web_js_api_parallel
- name: Slack Notification
if: ${{ always() }}
continue-on-error: true
uses: rtCamp/action-slack-notify@b24d75fe0e728a4bf9fc42ee217caa686d141ee8
uses: rtCamp/action-slack-notify@4e5fb42d249be6a45a298f3c9543b111b02f7907
env:
SLACK_COLOR: ${{ job.status }}
SLACK_MESSAGE: "WASM tests finished with status: ${{ job.status }}. (${{ env.ACTION_RUN_URL }})"
@@ -82,19 +88,18 @@ jobs:
steps:
- name: Stop instance
id: stop-instance
uses: zama-ai/slab-github-runner@8562abbdc96b3619bd5debe1fb934db298f9a044
uses: zama-ai/slab-github-runner@1dced74825027fe3d481392163ed8fc56813fb5d
with:
mode: stop
github-token: ${{ secrets.SLAB_ACTION_TOKEN }}
slab-url: ${{ secrets.SLAB_BASE_URL }}
job-secret: ${{ secrets.JOB_SECRET }}
region: ${{ needs.setup-ec2.outputs.aws-region }}
label: ${{ needs.setup-ec2.outputs.runner-name }}
- name: Slack Notification
if: ${{ failure() }}
continue-on-error: true
uses: rtCamp/action-slack-notify@b24d75fe0e728a4bf9fc42ee217caa686d141ee8
uses: rtCamp/action-slack-notify@4e5fb42d249be6a45a298f3c9543b111b02f7907
env:
SLACK_COLOR: ${{ job.status }}
SLACK_MESSAGE: "EC2 teardown (wasm-tests) failed. (${{ env.ACTION_RUN_URL }})"
SLACK_MESSAGE: "EC2 teardown (wasm-tests) finished with status: ${{ job.status }}. (${{ env.ACTION_RUN_URL }})"

View File

@@ -53,7 +53,7 @@ jobs:
echo "BENCH_DATE=$(date --iso-8601=seconds)" >> "${GITHUB_ENV}"
- name: Checkout tfhe-rs repo with tags
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
uses: actions/checkout@9bb56186c3b09b4f86b1c65136769dd318469633
with:
fetch-depth: 0
@@ -63,7 +63,7 @@ jobs:
echo "HOME=/home/ubuntu" >> "${GITHUB_ENV}"
- name: Install rust
uses: dtolnay/rust-toolchain@be73d7920c329f220ce78e0234b8f96b7ae60248
uses: dtolnay/rust-toolchain@dc6353516c68da0f06325f42ad880f76a5e77ec9
with:
toolchain: nightly
@@ -103,11 +103,11 @@ jobs:
path: ${{ env.RESULTS_FILENAME }}
- name: Checkout Slab repo
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
uses: actions/checkout@9bb56186c3b09b4f86b1c65136769dd318469633
with:
repository: zama-ai/slab
path: slab
token: ${{ secrets.CONCRETE_ACTIONS_TOKEN }}
token: ${{ secrets.FHE_ACTIONS_TOKEN }}
- name: Send data to Slab
shell: bash
@@ -126,11 +126,11 @@ jobs:
- name: Slack Notification
if: ${{ failure() }}
continue-on-error: true
uses: rtCamp/action-slack-notify@b24d75fe0e728a4bf9fc42ee217caa686d141ee8
uses: rtCamp/action-slack-notify@4e5fb42d249be6a45a298f3c9543b111b02f7907
env:
SLACK_COLOR: ${{ job.status }}
SLACK_CHANNEL: ${{ secrets.SLACK_CHANNEL }}
SLACK_ICON: https://pbs.twimg.com/profile_images/1274014582265298945/OjBKP9kn_400x400.png
SLACK_MESSAGE: "Boolean benchmarks failed. (${{ env.ACTION_RUN_URL }})"
SLACK_MESSAGE: "Boolean benchmarks finished with status: ${{ job.status }}. (${{ env.ACTION_RUN_URL }})"
SLACK_USERNAME: ${{ secrets.BOT_USERNAME }}
SLACK_WEBHOOK: ${{ secrets.SLACK_WEBHOOK }}

View File

@@ -23,7 +23,7 @@ jobs:
fail-fast: false
steps:
- uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
- uses: actions/checkout@9bb56186c3b09b4f86b1c65136769dd318469633
- name: Install and run newline linter checks
if: matrix.os == 'ubuntu-latest'

View File

@@ -13,7 +13,7 @@ jobs:
runs-on: ubuntu-latest
steps:
- name: Checkout tfhe-rs
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
uses: actions/checkout@9bb56186c3b09b4f86b1c65136769dd318469633
- name: Get actionlint
run: |

View File

@@ -53,7 +53,7 @@ jobs:
echo "Fork git sha: ${{ inputs.fork_git_sha }}"
- name: Checkout tfhe-rs
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
uses: actions/checkout@9bb56186c3b09b4f86b1c65136769dd318469633
with:
repository: ${{ inputs.fork_repo }}
ref: ${{ inputs.fork_git_sha }}
@@ -63,13 +63,13 @@ jobs:
echo "HOME=/home/ubuntu" >> "${GITHUB_ENV}"
- name: Install latest stable
uses: dtolnay/rust-toolchain@be73d7920c329f220ce78e0234b8f96b7ae60248
uses: dtolnay/rust-toolchain@dc6353516c68da0f06325f42ad880f76a5e77ec9
with:
toolchain: stable
- name: Check for file changes
id: changed-files
uses: tj-actions/changed-files@aa08304bd477b800d468db44fe10f6c61f7f7b11
uses: tj-actions/changed-files@2d756ea4c53f7f6b397767d8723b3a10a9f35bf2
with:
files_yaml: |
tfhe:
@@ -99,7 +99,7 @@ jobs:
make test_shortint_cov
- name: Upload tfhe coverage to Codecov
uses: codecov/codecov-action@54bcd8715eee62d40e33596ef5e8f0f48dbbccab
uses: codecov/codecov-action@7afa10ed9b269c561c2336fd862446844e0cbf71
if: steps.changed-files.outputs.tfhe_any_changed == 'true'
with:
token: ${{ secrets.CODECOV_TOKEN }}
@@ -113,7 +113,7 @@ jobs:
make test_integer_cov
- name: Upload tfhe coverage to Codecov
uses: codecov/codecov-action@54bcd8715eee62d40e33596ef5e8f0f48dbbccab
uses: codecov/codecov-action@7afa10ed9b269c561c2336fd862446844e0cbf71
if: steps.changed-files.outputs.tfhe_any_changed == 'true'
with:
token: ${{ secrets.CODECOV_TOKEN }}
@@ -124,7 +124,7 @@ jobs:
- name: Slack Notification
if: ${{ failure() }}
continue-on-error: true
uses: rtCamp/action-slack-notify@b24d75fe0e728a4bf9fc42ee217caa686d141ee8
uses: rtCamp/action-slack-notify@4e5fb42d249be6a45a298f3c9543b111b02f7907
env:
SLACK_COLOR: ${{ job.status }}
SLACK_CHANNEL: ${{ secrets.SLACK_CHANNEL }}

View File

@@ -53,7 +53,7 @@ jobs:
echo "BENCH_DATE=$(date --iso-8601=seconds)" >> "${GITHUB_ENV}"
- name: Checkout tfhe-rs repo with tags
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
uses: actions/checkout@9bb56186c3b09b4f86b1c65136769dd318469633
with:
fetch-depth: 0
@@ -63,7 +63,7 @@ jobs:
echo "HOME=/home/ubuntu" >> "${GITHUB_ENV}"
- name: Install rust
uses: dtolnay/rust-toolchain@be73d7920c329f220ce78e0234b8f96b7ae60248
uses: dtolnay/rust-toolchain@dc6353516c68da0f06325f42ad880f76a5e77ec9
with:
toolchain: nightly
@@ -94,11 +94,11 @@ jobs:
path: ${{ env.RESULTS_FILENAME }}
- name: Checkout Slab repo
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
uses: actions/checkout@9bb56186c3b09b4f86b1c65136769dd318469633
with:
repository: zama-ai/slab
path: slab
token: ${{ secrets.CONCRETE_ACTIONS_TOKEN }}
token: ${{ secrets.FHE_ACTIONS_TOKEN }}
- name: Send data to Slab
shell: bash
@@ -117,11 +117,11 @@ jobs:
- name: Slack Notification
if: ${{ failure() }}
continue-on-error: true
uses: rtCamp/action-slack-notify@b24d75fe0e728a4bf9fc42ee217caa686d141ee8
uses: rtCamp/action-slack-notify@4e5fb42d249be6a45a298f3c9543b111b02f7907
env:
SLACK_COLOR: ${{ job.status }}
SLACK_CHANNEL: ${{ secrets.SLACK_CHANNEL }}
SLACK_ICON: https://pbs.twimg.com/profile_images/1274014582265298945/OjBKP9kn_400x400.png
SLACK_MESSAGE: "PBS benchmarks failed. (${{ env.ACTION_RUN_URL }})"
SLACK_MESSAGE: "PBS benchmarks finished with status: ${{ job.status }}. (${{ env.ACTION_RUN_URL }})"
SLACK_USERNAME: ${{ secrets.BOT_USERNAME }}
SLACK_WEBHOOK: ${{ secrets.SLACK_WEBHOOK }}

View File

@@ -2,31 +2,8 @@
name: Core crypto GPU benchmarks
on:
# Allows you to run this workflow manually from the Actions tab as an alternative.
workflow_dispatch:
inputs:
instance_id:
description: "Instance ID"
type: string
instance_image_id:
description: "Instance AMI ID"
type: string
instance_type:
description: "Instance product type"
type: string
runner_name:
description: "Action runner name"
type: string
request_id:
description: "Slab request ID"
type: string
# This input is not used in this workflow but still mandatory since a calling workflow could
# use it. If a triggering command include a user_inputs field, then the triggered workflow
# must include this very input, otherwise the workflow won't be called.
# See start_full_benchmarks.yml as example.
user_inputs:
description: "Type of benchmarks to run"
type: string
default: "weekly_benchmarks"
env:
CARGO_TERM_COLOR: always
@@ -34,10 +11,27 @@ env:
ACTION_RUN_URL: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}
jobs:
run-core-crypto-benchmarks:
name: Execute GPU core crypto benchmarks in EC2
runs-on: ${{ github.event.inputs.runner_name }}
if: ${{ !cancelled() }}
setup-ec2:
name: Setup EC2 instance (cuda-benchmarks)
runs-on: ubuntu-latest
outputs:
runner-name: ${{ steps.start-instance.outputs.label }}
steps:
- name: Start instance
id: start-instance
uses: zama-ai/slab-github-runner@1dced74825027fe3d481392163ed8fc56813fb5d
with:
mode: start
github-token: ${{ secrets.SLAB_ACTION_TOKEN }}
slab-url: ${{ secrets.SLAB_BASE_URL_PRE_PROD }}
job-secret: ${{ secrets.JOB_SECRET }}
backend: hyperstack
profile: gpu-bench
core-crypto-benchmarks:
name: CUDA core crypto benchmarks
needs: setup-ec2
runs-on: ${{ needs.setup-ec2.outputs.runner-name }}
strategy:
fail-fast: false
# explicit include-based build matrix, of known valid options
@@ -45,23 +39,29 @@ jobs:
include:
- os: ubuntu-22.04
cuda: "12.2"
gcc: 9
gcc: 11
env:
CUDA_PATH: /usr/local/cuda-${{ matrix.cuda }}
CMAKE_VERSION: 3.29.1
steps:
- name: Instance configuration used
- name: Install dependencies
run: |
echo "IDs: ${{ inputs.instance_id }}"
echo "AMI: ${{ inputs.instance_image_id }}"
echo "Type: ${{ inputs.instance_type }}"
echo "Request ID: ${{ inputs.request_id }}"
sudo apt update
sudo apt install -y checkinstall zlib1g-dev libssl-dev
wget https://github.com/Kitware/CMake/releases/download/v${{ env.CMAKE_VERSION }}/cmake-${{ env.CMAKE_VERSION }}.tar.gz
tar -zxvf cmake-${{ env.CMAKE_VERSION }}.tar.gz
cd cmake-${{ env.CMAKE_VERSION }}
./bootstrap
make -j"$(nproc)"
sudo make install
- name: Get benchmark date
run: |
echo "BENCH_DATE=$(date --iso-8601=seconds)" >> "${GITHUB_ENV}"
- name: Checkout tfhe-rs repo with tags
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
uses: actions/checkout@9bb56186c3b09b4f86b1c65136769dd318469633
with:
fetch-depth: 0
@@ -71,7 +71,7 @@ jobs:
echo "HOME=/home/ubuntu" >> "${GITHUB_ENV}"
- name: Install rust
uses: dtolnay/rust-toolchain@be73d7920c329f220ce78e0234b8f96b7ae60248
uses: dtolnay/rust-toolchain@dc6353516c68da0f06325f42ad880f76a5e77ec9
with:
toolchain: nightly
@@ -124,11 +124,11 @@ jobs:
path: ${{ env.RESULTS_FILENAME }}
- name: Checkout Slab repo
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
uses: actions/checkout@9bb56186c3b09b4f86b1c65136769dd318469633
with:
repository: zama-ai/slab
path: slab
token: ${{ secrets.CONCRETE_ACTIONS_TOKEN }}
token: ${{ secrets.FHE_ACTIONS_TOKEN }}
- name: Send data to Slab
shell: bash
@@ -144,14 +144,39 @@ jobs:
-d @${{ env.RESULTS_FILENAME }} \
${{ secrets.SLAB_URL }}
# FIXME This action needs docker to be installed on the machine beforehand.
# - name: Slack Notification
# if: ${{ failure() }}
# continue-on-error: true
# uses: rtCamp/action-slack-notify@4e5fb42d249be6a45a298f3c9543b111b02f7907
# env:
# SLACK_COLOR: ${{ job.status }}
# SLACK_CHANNEL: ${{ secrets.SLACK_CHANNEL }}
# SLACK_ICON: https://pbs.twimg.com/profile_images/1274014582265298945/OjBKP9kn_400x400.png
# SLACK_MESSAGE: "PBS GPU benchmarks finished with status: ${{ job.status }}. (${{ env.ACTION_RUN_URL }})"
# SLACK_USERNAME: ${{ secrets.BOT_USERNAME }}
# SLACK_WEBHOOK: ${{ secrets.SLACK_WEBHOOK }}
teardown-ec2:
name: Teardown EC2 instance (cuda-benchmarks)
if: ${{ always() && needs.setup-ec2.result != 'skipped' }}
needs: [ setup-ec2, core-crypto-benchmarks ]
runs-on: ubuntu-latest
steps:
- name: Stop instance
id: stop-instance
uses: zama-ai/slab-github-runner@1dced74825027fe3d481392163ed8fc56813fb5d
with:
mode: stop
github-token: ${{ secrets.SLAB_ACTION_TOKEN }}
slab-url: ${{ secrets.SLAB_BASE_URL_PRE_PROD }}
job-secret: ${{ secrets.JOB_SECRET }}
label: ${{ needs.setup-ec2.outputs.runner-name }}
- name: Slack Notification
if: ${{ failure() }}
continue-on-error: true
uses: rtCamp/action-slack-notify@b24d75fe0e728a4bf9fc42ee217caa686d141ee8
uses: rtCamp/action-slack-notify@4e5fb42d249be6a45a298f3c9543b111b02f7907
env:
SLACK_COLOR: ${{ job.status }}
SLACK_CHANNEL: ${{ secrets.SLACK_CHANNEL }}
SLACK_ICON: https://pbs.twimg.com/profile_images/1274014582265298945/OjBKP9kn_400x400.png
SLACK_MESSAGE: "PBS GPU benchmarks failed. (${{ env.ACTION_RUN_URL }})"
SLACK_USERNAME: ${{ secrets.BOT_USERNAME }}
SLACK_WEBHOOK: ${{ secrets.SLACK_WEBHOOK }}
SLACK_MESSAGE: "EC2 teardown (cuda-benchmarks) finished with status: ${{ job.status }}. (${{ env.ACTION_RUN_URL }})"

View File

@@ -25,17 +25,16 @@ jobs:
runs-on: ubuntu-latest
outputs:
runner-name: ${{ steps.start-instance.outputs.label }}
instance-id: ${{ steps.start-instance.outputs.ec2-instance-id }}
aws-region: ${{ steps.start-instance.outputs.aws-region }}
steps:
- name: Start instance
id: start-instance
uses: zama-ai/slab-github-runner@8562abbdc96b3619bd5debe1fb934db298f9a044
uses: zama-ai/slab-github-runner@1dced74825027fe3d481392163ed8fc56813fb5d
with:
mode: start
github-token: ${{ secrets.SLAB_ACTION_TOKEN }}
slab-url: ${{ secrets.SLAB_BASE_URL }}
job-secret: ${{ secrets.JOB_SECRET }}
backend: aws
profile: cpu-small
csprng-randomness-tests:
@@ -47,14 +46,14 @@ jobs:
runs-on: ${{ needs.setup-ec2.outputs.runner-name }}
steps:
- name: Checkout tfhe-rs
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
uses: actions/checkout@9bb56186c3b09b4f86b1c65136769dd318469633
- name: Set up home
run: |
echo "HOME=/home/ubuntu" >> "${GITHUB_ENV}"
- name: Install latest stable
uses: dtolnay/rust-toolchain@be73d7920c329f220ce78e0234b8f96b7ae60248
uses: dtolnay/rust-toolchain@dc6353516c68da0f06325f42ad880f76a5e77ec9
with:
toolchain: stable
@@ -65,7 +64,7 @@ jobs:
- name: Slack Notification
if: ${{ failure() }}
continue-on-error: true
uses: rtCamp/action-slack-notify@b24d75fe0e728a4bf9fc42ee217caa686d141ee8
uses: rtCamp/action-slack-notify@4e5fb42d249be6a45a298f3c9543b111b02f7907
env:
SLACK_COLOR: ${{ job.status }}
SLACK_MESSAGE: "concrete-csprng randomness check finished with status: ${{ job.status }}. (${{ env.ACTION_RUN_URL }})"
@@ -78,19 +77,18 @@ jobs:
steps:
- name: Stop instance
id: stop-instance
uses: zama-ai/slab-github-runner@8562abbdc96b3619bd5debe1fb934db298f9a044
uses: zama-ai/slab-github-runner@1dced74825027fe3d481392163ed8fc56813fb5d
with:
mode: stop
github-token: ${{ secrets.SLAB_ACTION_TOKEN }}
slab-url: ${{ secrets.SLAB_BASE_URL }}
job-secret: ${{ secrets.JOB_SECRET }}
region: ${{ needs.setup-ec2.outputs.aws-region }}
label: ${{ needs.setup-ec2.outputs.runner-name }}
- name: Slack Notification
if: ${{ failure() }}
continue-on-error: true
uses: rtCamp/action-slack-notify@b24d75fe0e728a4bf9fc42ee217caa686d141ee8
uses: rtCamp/action-slack-notify@4e5fb42d249be6a45a298f3c9543b111b02f7907
env:
SLACK_COLOR: ${{ job.status }}
SLACK_MESSAGE: "EC2 teardown (csprng-randomness-tests) failed. (${{ env.ACTION_RUN_URL }})"
SLACK_MESSAGE: "EC2 teardown (csprng-randomness-tests) finished with status: ${{ job.status }}. (${{ env.ACTION_RUN_URL }})"

View File

@@ -39,7 +39,7 @@ jobs:
steps:
- name: Checkout tfhe-rs
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
uses: actions/checkout@9bb56186c3b09b4f86b1c65136769dd318469633
with:
fetch-depth: 0
@@ -52,16 +52,16 @@ jobs:
} >> "${GITHUB_ENV}"
- name: Install rust
uses: dtolnay/rust-toolchain@be73d7920c329f220ce78e0234b8f96b7ae60248
uses: dtolnay/rust-toolchain@dc6353516c68da0f06325f42ad880f76a5e77ec9
with:
toolchain: nightly
- name: Checkout Slab repo
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
uses: actions/checkout@9bb56186c3b09b4f86b1c65136769dd318469633
with:
repository: zama-ai/slab
path: slab
token: ${{ secrets.CONCRETE_ACTIONS_TOKEN }}
token: ${{ secrets.FHE_ACTIONS_TOKEN }}
- name: Run integer benchmarks
run: |
@@ -103,10 +103,10 @@ jobs:
- name: Slack Notification
if: ${{ always() }}
continue-on-error: true
uses: rtCamp/action-slack-notify@b24d75fe0e728a4bf9fc42ee217caa686d141ee8
uses: rtCamp/action-slack-notify@4e5fb42d249be6a45a298f3c9543b111b02f7907
env:
SLACK_COLOR: ${{ job.status }}
SLACK_MESSAGE: "Integer RTX 4090 full benchmarks failed. (${{ env.ACTION_RUN_URL }})"
SLACK_MESSAGE: "Integer RTX 4090 full benchmarks finished with status: ${{ job.status }}. (${{ env.ACTION_RUN_URL }})"
cuda-core-crypto-benchmarks:
name: Cuda core crypto benchmarks (RTX 4090)
@@ -120,7 +120,7 @@ jobs:
steps:
- name: Checkout tfhe-rs
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
uses: actions/checkout@9bb56186c3b09b4f86b1c65136769dd318469633
with:
fetch-depth: 0
@@ -133,16 +133,16 @@ jobs:
} >> "${GITHUB_ENV}"
- name: Install rust
uses: dtolnay/rust-toolchain@be73d7920c329f220ce78e0234b8f96b7ae60248
uses: dtolnay/rust-toolchain@dc6353516c68da0f06325f42ad880f76a5e77ec9
with:
toolchain: nightly
- name: Checkout Slab repo
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
uses: actions/checkout@9bb56186c3b09b4f86b1c65136769dd318469633
with:
repository: zama-ai/slab
path: slab
token: ${{ secrets.CONCRETE_ACTIONS_TOKEN }}
token: ${{ secrets.FHE_ACTIONS_TOKEN }}
- name: Run integer benchmarks
run: |
@@ -185,14 +185,14 @@ jobs:
- name: Slack Notification
if: ${{ !success() && !cancelled() }}
continue-on-error: true
uses: rtCamp/action-slack-notify@b24d75fe0e728a4bf9fc42ee217caa686d141ee8
uses: rtCamp/action-slack-notify@4e5fb42d249be6a45a298f3c9543b111b02f7907
env:
SLACK_COLOR: ${{ job.status }}
SLACK_MESSAGE: "Core crypto RTX 4090 full benchmarks failed. (${{ env.ACTION_RUN_URL }})"
SLACK_MESSAGE: "Core crypto RTX 4090 full benchmarks finished with status: ${{ job.status }}. (${{ env.ACTION_RUN_URL }})"
remove_github_label:
name: Remove 4090 bench label
if: ${{ github.event_name == 'pull_request' }}
if: ${{ always() && github.event_name == 'pull_request' }}
needs: [cuda-integer-benchmarks, cuda-core-crypto-benchmarks]
runs-on: ["self-hosted", "4090-desktop"]
steps:

View File

@@ -46,7 +46,7 @@ jobs:
echo "BENCH_DATE=$(date --iso-8601=seconds)" >> "${GITHUB_ENV}"
- name: Checkout tfhe-rs repo with tags
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
uses: actions/checkout@9bb56186c3b09b4f86b1c65136769dd318469633
with:
fetch-depth: 0
@@ -56,7 +56,7 @@ jobs:
echo "HOME=/home/ubuntu" >> "${GITHUB_ENV}"
- name: Install rust
uses: dtolnay/rust-toolchain@be73d7920c329f220ce78e0234b8f96b7ae60248
uses: dtolnay/rust-toolchain@dc6353516c68da0f06325f42ad880f76a5e77ec9
with:
toolchain: nightly
@@ -97,11 +97,11 @@ jobs:
path: ${{ env.RESULTS_FILENAME }}
- name: Checkout Slab repo
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
uses: actions/checkout@9bb56186c3b09b4f86b1c65136769dd318469633
with:
repository: zama-ai/slab
path: slab
token: ${{ secrets.CONCRETE_ACTIONS_TOKEN }}
token: ${{ secrets.FHE_ACTIONS_TOKEN }}
- name: Send data to Slab
shell: bash
@@ -120,11 +120,11 @@ jobs:
- name: Slack Notification
if: ${{ failure() }}
continue-on-error: true
uses: rtCamp/action-slack-notify@b24d75fe0e728a4bf9fc42ee217caa686d141ee8
uses: rtCamp/action-slack-notify@4e5fb42d249be6a45a298f3c9543b111b02f7907
env:
SLACK_COLOR: ${{ job.status }}
SLACK_CHANNEL: ${{ secrets.SLACK_CHANNEL }}
SLACK_ICON: https://pbs.twimg.com/profile_images/1274014582265298945/OjBKP9kn_400x400.png
SLACK_MESSAGE: "Integer benchmarks failed. (${{ env.ACTION_RUN_URL }})"
SLACK_MESSAGE: "Integer benchmarks finished with status: ${{ job.status }}. (${{ env.ACTION_RUN_URL }})"
SLACK_USERNAME: ${{ secrets.BOT_USERNAME }}
SLACK_WEBHOOK: ${{ secrets.SLACK_WEBHOOK }}

View File

@@ -74,7 +74,7 @@ jobs:
echo "Request ID: ${{ inputs.request_id }}"
- name: Checkout tfhe-rs repo with tags
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
uses: actions/checkout@9bb56186c3b09b4f86b1c65136769dd318469633
with:
fetch-depth: 0
@@ -92,16 +92,16 @@ jobs:
echo "HOME=/home/ubuntu" >> "${GITHUB_ENV}"
- name: Install rust
uses: dtolnay/rust-toolchain@be73d7920c329f220ce78e0234b8f96b7ae60248
uses: dtolnay/rust-toolchain@dc6353516c68da0f06325f42ad880f76a5e77ec9
with:
toolchain: nightly
- name: Checkout Slab repo
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
uses: actions/checkout@9bb56186c3b09b4f86b1c65136769dd318469633
with:
repository: zama-ai/slab
path: slab
token: ${{ secrets.CONCRETE_ACTIONS_TOKEN }}
token: ${{ secrets.FHE_ACTIONS_TOKEN }}
- name: Run benchmarks with AVX512
run: |
@@ -148,11 +148,11 @@ jobs:
steps:
- name: Notify
continue-on-error: true
uses: rtCamp/action-slack-notify@b24d75fe0e728a4bf9fc42ee217caa686d141ee8
uses: rtCamp/action-slack-notify@4e5fb42d249be6a45a298f3c9543b111b02f7907
env:
SLACK_COLOR: ${{ job.status }}
SLACK_CHANNEL: ${{ secrets.SLACK_CHANNEL }}
SLACK_ICON: https://pbs.twimg.com/profile_images/1274014582265298945/OjBKP9kn_400x400.png
SLACK_MESSAGE: "Integer full benchmarks failed. (${{ env.ACTION_RUN_URL }})"
SLACK_MESSAGE: "Integer full benchmarks finished with status: ${{ job.status }}. (${{ env.ACTION_RUN_URL }})"
SLACK_USERNAME: ${{ secrets.BOT_USERNAME }}
SLACK_WEBHOOK: ${{ secrets.SLACK_WEBHOOK }}

View File

@@ -2,23 +2,9 @@
name: Integer GPU benchmarks
on:
# Allows you to run this workflow manually from the Actions tab as an alternative.
workflow_dispatch:
inputs:
instance_id:
description: "Instance ID"
type: string
instance_image_id:
description: "Instance AMI ID"
type: string
instance_type:
description: "Instance product type"
type: string
runner_name:
description: "Action runner name"
type: string
request_id:
description: "Slab request ID"
type: string
pull_request:
env:
CARGO_TERM_COLOR: always
@@ -29,10 +15,27 @@ env:
RUST_MIN_STACK: "8388608"
jobs:
run-integer-benchmarks:
name: Execute integer benchmarks in EC2
runs-on: ${{ github.event.inputs.runner_name }}
if: ${{ !cancelled() }}
setup-ec2:
name: Setup EC2 instance (cuda-benchmarks)
runs-on: ubuntu-latest
outputs:
runner-name: ${{ steps.start-instance.outputs.label }}
steps:
- name: Start instance
id: start-instance
uses: zama-ai/slab-github-runner@1dced74825027fe3d481392163ed8fc56813fb5d
with:
mode: start
github-token: ${{ secrets.SLAB_ACTION_TOKEN }}
slab-url: ${{ secrets.SLAB_BASE_URL_PRE_PROD }}
job-secret: ${{ secrets.JOB_SECRET }}
backend: hyperstack
profile: gpu-bench
cuda-integer-benchmarks:
name: CUDA integer benchmarks
needs: setup-ec2
runs-on: ${{ needs.setup-ec2.outputs.runner-name }}
strategy:
fail-fast: false
# explicit include-based build matrix, of known valid options
@@ -40,23 +43,29 @@ jobs:
include:
- os: ubuntu-22.04
cuda: "12.2"
gcc: 9
gcc: 11
env:
CUDA_PATH: /usr/local/cuda-${{ matrix.cuda }}
CMAKE_VERSION: 3.29.1
steps:
- name: Instance configuration used
- name: Install dependencies
run: |
echo "IDs: ${{ inputs.instance_id }}"
echo "AMI: ${{ inputs.instance_image_id }}"
echo "Type: ${{ inputs.instance_type }}"
echo "Request ID: ${{ inputs.request_id }}"
sudo apt update
sudo apt install -y checkinstall zlib1g-dev libssl-dev
wget https://github.com/Kitware/CMake/releases/download/v${{ env.CMAKE_VERSION }}/cmake-${{ env.CMAKE_VERSION }}.tar.gz
tar -zxvf cmake-${{ env.CMAKE_VERSION }}.tar.gz
cd cmake-${{ env.CMAKE_VERSION }}
./bootstrap
make -j"$(nproc)"
sudo make install
- name: Get benchmark date
run: |
echo "BENCH_DATE=$(date --iso-8601=seconds)" >> "${GITHUB_ENV}"
- name: Checkout tfhe-rs repo with tags
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
uses: actions/checkout@9bb56186c3b09b4f86b1c65136769dd318469633
with:
fetch-depth: 0
@@ -66,7 +75,7 @@ jobs:
echo "HOME=/home/ubuntu" >> "${GITHUB_ENV}"
- name: Install rust
uses: dtolnay/rust-toolchain@be73d7920c329f220ce78e0234b8f96b7ae60248
uses: dtolnay/rust-toolchain@dc6353516c68da0f06325f42ad880f76a5e77ec9
with:
toolchain: nightly
@@ -111,7 +120,7 @@ jobs:
COMMIT_HASH="$(git describe --tags --dirty)"
python3 ./ci/benchmark_parser.py target/criterion ${{ env.RESULTS_FILENAME }} \
--database tfhe_rs \
--hardware ${{ inputs.instance_type }} \
--hardware "n2-H100x1" \
--backend gpu \
--project-version "${COMMIT_HASH}" \
--branch ${{ github.ref_name }} \
@@ -128,11 +137,11 @@ jobs:
path: ${{ env.RESULTS_FILENAME }}
- name: Checkout Slab repo
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
uses: actions/checkout@9bb56186c3b09b4f86b1c65136769dd318469633
with:
repository: zama-ai/slab
path: slab
token: ${{ secrets.CONCRETE_ACTIONS_TOKEN }}
token: ${{ secrets.FHE_ACTIONS_TOKEN }}
- name: Send data to Slab
shell: bash
@@ -148,14 +157,39 @@ jobs:
-d @${{ env.RESULTS_FILENAME }} \
${{ secrets.SLAB_URL }}
# FIXME This action needs docker to be installed on the machine beforehand.
# - name: Slack Notification
# if: ${{ !success() && !cancelled() }}
# continue-on-error: true
# uses: rtCamp/action-slack-notify@4e5fb42d249be6a45a298f3c9543b111b02f7907
# env:
# SLACK_COLOR: ${{ job.status }}
# SLACK_CHANNEL: ${{ secrets.SLACK_CHANNEL }}
# SLACK_ICON: https://pbs.twimg.com/profile_images/1274014582265298945/OjBKP9kn_400x400.png
# SLACK_MESSAGE: "Integer GPU benchmarks finished with status: ${{ job.status }}. (${{ env.ACTION_RUN_URL }})"
# SLACK_USERNAME: ${{ secrets.BOT_USERNAME }}
# SLACK_WEBHOOK: ${{ secrets.SLACK_WEBHOOK }}
teardown-ec2:
name: Teardown EC2 instance (cuda-benchmarks)
if: ${{ always() && needs.setup-ec2.result != 'skipped' }}
needs: [ setup-ec2, cuda-integer-benchmarks ]
runs-on: ubuntu-latest
steps:
- name: Stop instance
id: stop-instance
uses: zama-ai/slab-github-runner@1dced74825027fe3d481392163ed8fc56813fb5d
with:
mode: stop
github-token: ${{ secrets.SLAB_ACTION_TOKEN }}
slab-url: ${{ secrets.SLAB_BASE_URL_PRE_PROD }}
job-secret: ${{ secrets.JOB_SECRET }}
label: ${{ needs.setup-ec2.outputs.runner-name }}
- name: Slack Notification
if: ${{ !success() && !cancelled() }}
if: ${{ failure() }}
continue-on-error: true
uses: rtCamp/action-slack-notify@b24d75fe0e728a4bf9fc42ee217caa686d141ee8
uses: rtCamp/action-slack-notify@4e5fb42d249be6a45a298f3c9543b111b02f7907
env:
SLACK_COLOR: ${{ job.status }}
SLACK_CHANNEL: ${{ secrets.SLACK_CHANNEL }}
SLACK_ICON: https://pbs.twimg.com/profile_images/1274014582265298945/OjBKP9kn_400x400.png
SLACK_MESSAGE: "Integer GPU benchmarks failed. (${{ env.ACTION_RUN_URL }})"
SLACK_USERNAME: ${{ secrets.BOT_USERNAME }}
SLACK_WEBHOOK: ${{ secrets.SLACK_WEBHOOK }}
SLACK_MESSAGE: "EC2 teardown (cuda-benchmarks) finished with status: ${{ job.status }}. (${{ env.ACTION_RUN_URL }})"

View File

@@ -2,31 +2,9 @@
name: Integer GPU full benchmarks
on:
# Allows you to run this workflow manually from the Actions tab as an alternative.
workflow_dispatch:
inputs:
instance_id:
description: "Instance ID"
type: string
instance_image_id:
description: "Instance AMI ID"
type: string
instance_type:
description: "Instance product type"
type: string
runner_name:
description: "Action runner name"
type: string
request_id:
description: "Slab request ID"
type: string
# This input is not used in this workflow but still mandatory since a calling workflow could
# use it. If a triggering command include a user_inputs field, then the triggered workflow
# must include this very input, otherwise the workflow won't be called.
# See start_full_benchmarks.yml as example.
user_inputs:
description: "Type of benchmarks to run"
type: string
default: "weekly_benchmarks"
pull_request:
env:
CARGO_TERM_COLOR: always
@@ -36,11 +14,28 @@ env:
RUST_MIN_STACK: "8388608"
jobs:
integer-benchmarks:
name: Execute integer benchmarks for all operations flavor
runs-on: ${{ github.event.inputs.runner_name }}
setup-ec2:
name: Setup EC2 instance (cuda-full-benchmarks)
runs-on: ubuntu-latest
outputs:
runner-name: ${{ steps.start-instance.outputs.label }}
steps:
- name: Start instance
id: start-instance
uses: zama-ai/slab-github-runner@1dced74825027fe3d481392163ed8fc56813fb5d
with:
mode: start
github-token: ${{ secrets.SLAB_ACTION_TOKEN }}
slab-url: ${{ secrets.SLAB_BASE_URL_PRE_PROD }}
job-secret: ${{ secrets.JOB_SECRET }}
backend: hyperstack
profile: gpu-bench
cuda-integer-full-benchmarks:
name: CUDA integer full benchmarks
needs: setup-ec2
runs-on: ${{ needs.setup-ec2.outputs.runner-name }}
timeout-minutes: 1440 # 24 hours
if: ${{ !cancelled() }}
continue-on-error: true
strategy:
fail-fast: false
@@ -52,19 +47,25 @@ jobs:
include:
- os: ubuntu-22.04
cuda: "12.2"
gcc: 9
gcc: 11
env:
CUDA_PATH: /usr/local/cuda-${{ matrix.cuda }}
CMAKE_VERSION: 3.29.1
steps:
- name: Instance configuration used
- name: Install dependencies
run: |
echo "IDs: ${{ inputs.instance_id }}"
echo "AMI: ${{ inputs.instance_image_id }}"
echo "Type: ${{ inputs.instance_type }}"
echo "Request ID: ${{ inputs.request_id }}"
sudo apt update
sudo apt install -y checkinstall zlib1g-dev libssl-dev
wget https://github.com/Kitware/CMake/releases/download/v${{ env.CMAKE_VERSION }}/cmake-${{ env.CMAKE_VERSION }}.tar.gz
tar -zxvf cmake-${{ env.CMAKE_VERSION }}.tar.gz
cd cmake-${{ env.CMAKE_VERSION }}
./bootstrap
make -j"$(nproc)"
sudo make install
- name: Checkout tfhe-rs repo with tags
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
uses: actions/checkout@9bb56186c3b09b4f86b1c65136769dd318469633
with:
fetch-depth: 0
@@ -82,7 +83,7 @@ jobs:
echo "HOME=/home/ubuntu" >> "${GITHUB_ENV}"
- name: Install rust
uses: dtolnay/rust-toolchain@be73d7920c329f220ce78e0234b8f96b7ae60248
uses: dtolnay/rust-toolchain@dc6353516c68da0f06325f42ad880f76a5e77ec9
with:
toolchain: nightly
@@ -107,11 +108,11 @@ jobs:
} >> "${GITHUB_ENV}"
- name: Checkout Slab repo
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
uses: actions/checkout@9bb56186c3b09b4f86b1c65136769dd318469633
with:
repository: zama-ai/slab
path: slab
token: ${{ secrets.CONCRETE_ACTIONS_TOKEN }}
token: ${{ secrets.FHE_ACTIONS_TOKEN }}
- name: Run benchmarks with AVX512
run: |
@@ -121,7 +122,7 @@ jobs:
run: |
python3 ./ci/benchmark_parser.py target/criterion ${{ env.RESULTS_FILENAME }} \
--database tfhe_rs \
--hardware ${{ inputs.instance_type }} \
--hardware "n2-H100x1" \
--backend gpu \
--project-version "${{ env.COMMIT_HASH }}" \
--branch ${{ github.ref_name }} \
@@ -151,19 +152,39 @@ jobs:
-d @${{ env.RESULTS_FILENAME }} \
${{ secrets.SLAB_URL }}
slack-notification:
name: Slack Notification
runs-on: ${{ github.event.inputs.runner_name }}
if: ${{ !success() && !cancelled() }}
needs: integer-benchmarks
# FIXME This action needs docker to be installed on the machine beforehand.
# - name: Slack Notification
# if: ${{ !success() && !cancelled() }}
# continue-on-error: true
# uses: rtCamp/action-slack-notify@4e5fb42d249be6a45a298f3c9543b111b02f7907
# env:
# SLACK_COLOR: ${{ job.status }}
# SLACK_CHANNEL: ${{ secrets.SLACK_CHANNEL }}
# SLACK_ICON: https://pbs.twimg.com/profile_images/1274014582265298945/OjBKP9kn_400x400.png
# SLACK_MESSAGE: "Integer GPU full benchmarks finished with status: ${{ job.status }}. (${{ env.ACTION_RUN_URL }})"
# SLACK_USERNAME: ${{ secrets.BOT_USERNAME }}
# SLACK_WEBHOOK: ${{ secrets.SLACK_WEBHOOK }}
teardown-ec2:
name: Teardown EC2 instance (cuda-full-benchmarks)
if: ${{ always() && needs.setup-ec2.result != 'skipped' }}
needs: [ setup-ec2, cuda-integer-full-benchmarks ]
runs-on: ubuntu-latest
steps:
- name: Notify
- name: Stop instance
id: stop-instance
uses: zama-ai/slab-github-runner@1dced74825027fe3d481392163ed8fc56813fb5d
with:
mode: stop
github-token: ${{ secrets.SLAB_ACTION_TOKEN }}
slab-url: ${{ secrets.SLAB_BASE_URL_PRE_PROD }}
job-secret: ${{ secrets.JOB_SECRET }}
label: ${{ needs.setup-ec2.outputs.runner-name }}
- name: Slack Notification
if: ${{ failure() }}
continue-on-error: true
uses: rtCamp/action-slack-notify@b24d75fe0e728a4bf9fc42ee217caa686d141ee8
uses: rtCamp/action-slack-notify@4e5fb42d249be6a45a298f3c9543b111b02f7907
env:
SLACK_COLOR: ${{ job.status }}
SLACK_CHANNEL: ${{ secrets.SLACK_CHANNEL }}
SLACK_ICON: https://pbs.twimg.com/profile_images/1274014582265298945/OjBKP9kn_400x400.png
SLACK_MESSAGE: "Integer GPU full benchmarks failed. (${{ env.ACTION_RUN_URL }})"
SLACK_USERNAME: ${{ secrets.BOT_USERNAME }}
SLACK_WEBHOOK: ${{ secrets.SLACK_WEBHOOK }}
SLACK_MESSAGE: "EC2 teardown (cuda-full-benchmarks) finished with status: ${{ job.status }}. (${{ env.ACTION_RUN_URL }})"

View File

@@ -46,7 +46,7 @@ jobs:
echo "BENCH_DATE=$(date --iso-8601=seconds)" >> "${GITHUB_ENV}"
- name: Checkout tfhe-rs repo with tags
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
uses: actions/checkout@9bb56186c3b09b4f86b1c65136769dd318469633
with:
fetch-depth: 0
@@ -56,7 +56,7 @@ jobs:
echo "HOME=/home/ubuntu" >> "${GITHUB_ENV}"
- name: Install rust
uses: dtolnay/rust-toolchain@be73d7920c329f220ce78e0234b8f96b7ae60248
uses: dtolnay/rust-toolchain@dc6353516c68da0f06325f42ad880f76a5e77ec9
with:
toolchain: nightly
@@ -97,11 +97,11 @@ jobs:
path: ${{ env.RESULTS_FILENAME }}
- name: Checkout Slab repo
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
uses: actions/checkout@9bb56186c3b09b4f86b1c65136769dd318469633
with:
repository: zama-ai/slab
path: slab
token: ${{ secrets.CONCRETE_ACTIONS_TOKEN }}
token: ${{ secrets.FHE_ACTIONS_TOKEN }}
- name: Send data to Slab
shell: bash
@@ -120,11 +120,11 @@ jobs:
- name: Slack Notification
if: ${{ failure() }}
continue-on-error: true
uses: rtCamp/action-slack-notify@b24d75fe0e728a4bf9fc42ee217caa686d141ee8
uses: rtCamp/action-slack-notify@4e5fb42d249be6a45a298f3c9543b111b02f7907
env:
SLACK_COLOR: ${{ job.status }}
SLACK_CHANNEL: ${{ secrets.SLACK_CHANNEL }}
SLACK_ICON: https://pbs.twimg.com/profile_images/1274014582265298945/OjBKP9kn_400x400.png
SLACK_MESSAGE: "Integer benchmarks failed. (${{ env.ACTION_RUN_URL }})"
SLACK_MESSAGE: "Integer benchmarks finished with status: ${{ job.status }}. (${{ env.ACTION_RUN_URL }})"
SLACK_USERNAME: ${{ secrets.BOT_USERNAME }}
SLACK_WEBHOOK: ${{ secrets.SLACK_WEBHOOK }}

View File

@@ -2,23 +2,9 @@
name: Integer GPU Multi-bit benchmarks
on:
# Allows you to run this workflow manually from the Actions tab as an alternative.
workflow_dispatch:
inputs:
instance_id:
description: "Instance ID"
type: string
instance_image_id:
description: "Instance AMI ID"
type: string
instance_type:
description: "Instance product type"
type: string
runner_name:
description: "Action runner name"
type: string
request_id:
description: "Slab request ID"
type: string
pull_request:
env:
CARGO_TERM_COLOR: always
@@ -29,11 +15,28 @@ env:
RUST_MIN_STACK: "8388608"
jobs:
cuda-integer-benchmarks:
name: Execute integer multi-bit benchmarks in EC2
runs-on: ${{ github.event.inputs.runner_name }}
setup-ec2:
name: Setup EC2 instance (cuda-multi-bit-benchmarks)
runs-on: ubuntu-latest
outputs:
runner-name: ${{ steps.start-instance.outputs.label }}
steps:
- name: Start instance
id: start-instance
uses: zama-ai/slab-github-runner@1dced74825027fe3d481392163ed8fc56813fb5d
with:
mode: start
github-token: ${{ secrets.SLAB_ACTION_TOKEN }}
slab-url: ${{ secrets.SLAB_BASE_URL_PRE_PROD }}
job-secret: ${{ secrets.JOB_SECRET }}
backend: hyperstack
profile: gpu-bench
cuda-integer-multi-bit-benchmarks:
name: CUDA integer multi-bit benchmarks
needs: setup-ec2
runs-on: ${{ needs.setup-ec2.outputs.runner-name }}
timeout-minutes: 1440 # 24 hours
if: ${{ !cancelled() }}
strategy:
fail-fast: false
# explicit include-based build matrix, of known valid options
@@ -41,23 +44,29 @@ jobs:
include:
- os: ubuntu-22.04
cuda: "12.2"
gcc: 9
gcc: 11
env:
CUDA_PATH: /usr/local/cuda-${{ matrix.cuda }}
CMAKE_VERSION: 3.29.1
steps:
- name: Instance configuration used
- name: Install dependencies
run: |
echo "IDs: ${{ inputs.instance_id }}"
echo "AMI: ${{ inputs.instance_image_id }}"
echo "Type: ${{ inputs.instance_type }}"
echo "Request ID: ${{ inputs.request_id }}"
sudo apt update
sudo apt install -y checkinstall zlib1g-dev libssl-dev
wget https://github.com/Kitware/CMake/releases/download/v${{ env.CMAKE_VERSION }}/cmake-${{ env.CMAKE_VERSION }}.tar.gz
tar -zxvf cmake-${{ env.CMAKE_VERSION }}.tar.gz
cd cmake-${{ env.CMAKE_VERSION }}
./bootstrap
make -j"$(nproc)"
sudo make install
- name: Get benchmark date
run: |
echo "BENCH_DATE=$(date --iso-8601=seconds)" >> "${GITHUB_ENV}"
- name: Checkout tfhe-rs repo with tags
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
uses: actions/checkout@9bb56186c3b09b4f86b1c65136769dd318469633
with:
fetch-depth: 0
@@ -67,7 +76,7 @@ jobs:
echo "HOME=/home/ubuntu" >> "${GITHUB_ENV}"
- name: Install rust
uses: dtolnay/rust-toolchain@be73d7920c329f220ce78e0234b8f96b7ae60248
uses: dtolnay/rust-toolchain@dc6353516c68da0f06325f42ad880f76a5e77ec9
with:
toolchain: nightly
@@ -112,7 +121,7 @@ jobs:
COMMIT_HASH="$(git describe --tags --dirty)"
python3 ./ci/benchmark_parser.py target/criterion ${{ env.RESULTS_FILENAME }} \
--database tfhe_rs \
--hardware ${{ inputs.instance_type }} \
--hardware "n2-H100x1" \
--backend gpu \
--project-version "${COMMIT_HASH}" \
--branch ${{ github.ref_name }} \
@@ -129,11 +138,11 @@ jobs:
path: ${{ env.RESULTS_FILENAME }}
- name: Checkout Slab repo
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
uses: actions/checkout@9bb56186c3b09b4f86b1c65136769dd318469633
with:
repository: zama-ai/slab
path: slab
token: ${{ secrets.CONCRETE_ACTIONS_TOKEN }}
token: ${{ secrets.FHE_ACTIONS_TOKEN }}
- name: Send data to Slab
shell: bash
@@ -149,14 +158,39 @@ jobs:
-d @${{ env.RESULTS_FILENAME }} \
${{ secrets.SLAB_URL }}
# FIXME This action needs docker to be installed on the machine beforehand.
# - name: Slack Notification
# if: ${{ !success() && !cancelled() }}
# continue-on-error: true
# uses: rtCamp/action-slack-notify@4e5fb42d249be6a45a298f3c9543b111b02f7907
# env:
# SLACK_COLOR: ${{ job.status }}
# SLACK_CHANNEL: ${{ secrets.SLACK_CHANNEL }}
# SLACK_ICON: https://pbs.twimg.com/profile_images/1274014582265298945/OjBKP9kn_400x400.png
# SLACK_MESSAGE: "Integer GPU benchmarks finished with status: ${{ job.status }}. (${{ env.ACTION_RUN_URL }})"
# SLACK_USERNAME: ${{ secrets.BOT_USERNAME }}
# SLACK_WEBHOOK: ${{ secrets.SLACK_WEBHOOK }}
teardown-ec2:
name: Teardown EC2 instance (cuda-multi-bit-benchmarks)
if: ${{ always() && needs.setup-ec2.result != 'skipped' }}
needs: [ setup-ec2, cuda-integer-multi-bit-benchmarks ]
runs-on: ubuntu-latest
steps:
- name: Stop instance
id: stop-instance
uses: zama-ai/slab-github-runner@1dced74825027fe3d481392163ed8fc56813fb5d
with:
mode: stop
github-token: ${{ secrets.SLAB_ACTION_TOKEN }}
slab-url: ${{ secrets.SLAB_BASE_URL_PRE_PROD }}
job-secret: ${{ secrets.JOB_SECRET }}
label: ${{ needs.setup-ec2.outputs.runner-name }}
- name: Slack Notification
if: ${{ !success() && !cancelled() }}
if: ${{ failure() }}
continue-on-error: true
uses: rtCamp/action-slack-notify@b24d75fe0e728a4bf9fc42ee217caa686d141ee8
uses: rtCamp/action-slack-notify@4e5fb42d249be6a45a298f3c9543b111b02f7907
env:
SLACK_COLOR: ${{ job.status }}
SLACK_CHANNEL: ${{ secrets.SLACK_CHANNEL }}
SLACK_ICON: https://pbs.twimg.com/profile_images/1274014582265298945/OjBKP9kn_400x400.png
SLACK_MESSAGE: "Integer GPU benchmarks failed. (${{ env.ACTION_RUN_URL }})"
SLACK_USERNAME: ${{ secrets.BOT_USERNAME }}
SLACK_WEBHOOK: ${{ secrets.SLACK_WEBHOOK }}
SLACK_MESSAGE: "EC2 teardown (cuda-multi-bit-benchmarks) finished with status: ${{ job.status }}. (${{ env.ACTION_RUN_URL }})"

View File

@@ -31,10 +31,10 @@ jobs:
timeout-minutes: 720
steps:
- uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
- uses: actions/checkout@9bb56186c3b09b4f86b1c65136769dd318469633
- name: Install latest stable
uses: dtolnay/rust-toolchain@be73d7920c329f220ce78e0234b8f96b7ae60248
uses: dtolnay/rust-toolchain@dc6353516c68da0f06325f42ad880f76a5e77ec9
with:
toolchain: stable
@@ -74,6 +74,10 @@ jobs:
run: |
make test_concrete_csprng
- name: Run tfhe-zk-pok tests
run: |
make test_zk_pok
- name: Run core tests
run: |
make test_core_crypto
@@ -133,7 +137,7 @@ jobs:
- name: Slack Notification
if: ${{ needs.cargo-builds.result != 'skipped' }}
continue-on-error: true
uses: rtCamp/action-slack-notify@b24d75fe0e728a4bf9fc42ee217caa686d141ee8
uses: rtCamp/action-slack-notify@4e5fb42d249be6a45a298f3c9543b111b02f7907
env:
SLACK_COLOR: ${{ needs.cargo-builds.result }}
SLACK_CHANNEL: ${{ secrets.SLACK_CHANNEL }}

View File

@@ -30,7 +30,7 @@ jobs:
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
uses: actions/checkout@9bb56186c3b09b4f86b1c65136769dd318469633
with:
fetch-depth: 0
@@ -49,7 +49,7 @@ jobs:
- name: Publish web package
if: ${{ inputs.push_web_package }}
uses: JS-DevTools/npm-publish@4b07b26a2f6e0a51846e1870223e545bae91c552
uses: JS-DevTools/npm-publish@19c28f1ef146469e409470805ea4279d47c3d35c
with:
token: ${{ secrets.NPM_TOKEN }}
package: tfhe/pkg/package.json
@@ -65,7 +65,7 @@ jobs:
- name: Publish Node package
if: ${{ inputs.push_node_package }}
uses: JS-DevTools/npm-publish@4b07b26a2f6e0a51846e1870223e545bae91c552
uses: JS-DevTools/npm-publish@19c28f1ef146469e409470805ea4279d47c3d35c
with:
token: ${{ secrets.NPM_TOKEN }}
package: tfhe/pkg/package.json
@@ -74,7 +74,7 @@ jobs:
- name: Slack Notification
if: ${{ failure() }}
continue-on-error: true
uses: rtCamp/action-slack-notify@b24d75fe0e728a4bf9fc42ee217caa686d141ee8
uses: rtCamp/action-slack-notify@4e5fb42d249be6a45a298f3c9543b111b02f7907
env:
SLACK_COLOR: ${{ job.status }}
SLACK_CHANNEL: ${{ secrets.SLACK_CHANNEL }}

View File

@@ -18,7 +18,7 @@ jobs:
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
uses: actions/checkout@9bb56186c3b09b4f86b1c65136769dd318469633
with:
fetch-depth: 0
@@ -32,7 +32,7 @@ jobs:
- name: Slack Notification
if: ${{ failure() }}
continue-on-error: true
uses: rtCamp/action-slack-notify@b24d75fe0e728a4bf9fc42ee217caa686d141ee8
uses: rtCamp/action-slack-notify@4e5fb42d249be6a45a298f3c9543b111b02f7907
env:
SLACK_COLOR: ${{ job.status }}
SLACK_CHANNEL: ${{ secrets.SLACK_CHANNEL }}

129
.github/workflows/make_release_cuda.yml vendored Normal file
View File

@@ -0,0 +1,129 @@
# Publish new release of tfhe-cuda-backend on crates.io.
name: Publish CUDA release
on:
workflow_dispatch:
inputs:
dry_run:
description: "Dry-run"
type: boolean
default: true
push_to_crates:
description: "Push to crate"
type: boolean
default: true
env:
ACTION_RUN_URL: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}
SLACK_CHANNEL: ${{ secrets.SLACK_CHANNEL }}
SLACK_ICON: https://pbs.twimg.com/profile_images/1274014582265298945/OjBKP9kn_400x400.png
SLACK_USERNAME: ${{ secrets.BOT_USERNAME }}
SLACK_WEBHOOK: ${{ secrets.SLACK_WEBHOOK }}
jobs:
setup-ec2:
name: Setup EC2 instance (publish-cuda-release)
runs-on: ubuntu-latest
outputs:
runner-name: ${{ steps.start-instance.outputs.label }}
steps:
- name: Start instance
id: start-instance
uses: zama-ai/slab-github-runner@1dced74825027fe3d481392163ed8fc56813fb5d
with:
mode: start
github-token: ${{ secrets.SLAB_ACTION_TOKEN }}
slab-url: ${{ secrets.SLAB_BASE_URL }}
job-secret: ${{ secrets.JOB_SECRET }}
backend: aws
profile: gpu-test
publish-cuda-release:
name: Publish CUDA Release
needs: setup-ec2
runs-on: ${{ needs.setup-ec2.outputs.runner-name }}
strategy:
fail-fast: false
# explicit include-based build matrix, of known valid options
matrix:
include:
- os: ubuntu-22.04
cuda: "12.2"
gcc: 9
env:
CUDA_PATH: /usr/local/cuda-${{ matrix.cuda }}
steps:
- name: Checkout
uses: actions/checkout@9bb56186c3b09b4f86b1c65136769dd318469633
with:
fetch-depth: 0
- name: Set up home
run: |
echo "HOME=/home/ubuntu" >> "${GITHUB_ENV}"
- name: Install latest stable
uses: dtolnay/rust-toolchain@dc6353516c68da0f06325f42ad880f76a5e77ec9
with:
toolchain: stable
- name: Export CUDA variables
if: ${{ !cancelled() }}
run: |
echo "$CUDA_PATH/bin" >> "${GITHUB_PATH}"
{
echo "CUDA_PATH=$CUDA_PATH";
echo "LD_LIBRARY_PATH=$CUDA_PATH/lib:$LD_LIBRARY_PATH";
echo "CUDACXX=/usr/local/cuda-${{ matrix.cuda }}/bin/nvcc";
} >> "${GITHUB_ENV}"
# Specify the correct host compilers
- name: Export gcc and g++ variables
if: ${{ !cancelled() }}
run: |
{
echo "CC=/usr/bin/gcc-${{ matrix.gcc }}";
echo "CXX=/usr/bin/g++-${{ matrix.gcc }}";
echo "CUDAHOSTCXX=/usr/bin/g++-${{ matrix.gcc }}";
echo "HOME=/home/ubuntu";
} >> "${GITHUB_ENV}"
- name: Publish crate.io package
if: ${{ inputs.push_to_crates }}
env:
CRATES_TOKEN: ${{ secrets.CARGO_REGISTRY_TOKEN }}
DRY_RUN: ${{ inputs.dry_run && '--dry-run' || '' }}
run: |
cargo publish -p tfhe-cuda-backend --token ${{ env.CRATES_TOKEN }} ${{ env.DRY_RUN }}
- name: Slack Notification
if: ${{ failure() }}
continue-on-error: true
uses: rtCamp/action-slack-notify@4e5fb42d249be6a45a298f3c9543b111b02f7907
env:
SLACK_COLOR: ${{ job.status }}
SLACK_MESSAGE: "tfhe-cuda-backend release finished with status: ${{ job.status }}. (${{ env.ACTION_RUN_URL }})"
teardown-ec2:
name: Teardown EC2 instance (publish-release)
if: ${{ always() && needs.setup-ec2.result != 'skipped' }}
needs: [ setup-ec2, publish-cuda-release ]
runs-on: ubuntu-latest
steps:
- name: Stop instance
id: stop-instance
uses: zama-ai/slab-github-runner@1dced74825027fe3d481392163ed8fc56813fb5d
with:
mode: stop
github-token: ${{ secrets.SLAB_ACTION_TOKEN }}
slab-url: ${{ secrets.SLAB_BASE_URL }}
job-secret: ${{ secrets.JOB_SECRET }}
label: ${{ needs.setup-ec2.outputs.runner-name }}
- name: Slack Notification
if: ${{ failure() }}
continue-on-error: true
uses: rtCamp/action-slack-notify@4e5fb42d249be6a45a298f3c9543b111b02f7907
env:
SLACK_COLOR: ${{ job.status }}
SLACK_MESSAGE: "EC2 teardown (publish-cuda-release) finished with status: ${{ job.status }}. (${{ env.ACTION_RUN_URL }})"

View File

@@ -17,10 +17,10 @@ jobs:
runs-on: ubuntu-latest
steps:
- name: Checkout tfhe-rs
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
uses: actions/checkout@9bb56186c3b09b4f86b1c65136769dd318469633
- name: Checkout lattice-estimator
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
uses: actions/checkout@9bb56186c3b09b4f86b1c65136769dd318469633
with:
repository: malb/lattice-estimator
path: lattice_estimator
@@ -42,7 +42,7 @@ jobs:
- name: Slack Notification
if: ${{ always() }}
continue-on-error: true
uses: rtCamp/action-slack-notify@b24d75fe0e728a4bf9fc42ee217caa686d141ee8
uses: rtCamp/action-slack-notify@4e5fb42d249be6a45a298f3c9543b111b02f7907
env:
SLACK_COLOR: ${{ job.status }}
SLACK_CHANNEL: ${{ secrets.SLACK_CHANNEL }}

View File

@@ -45,7 +45,7 @@ jobs:
echo "BENCH_DATE=$(date --iso-8601=seconds)" >> "${GITHUB_ENV}"
- name: Checkout tfhe-rs repo with tags
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
uses: actions/checkout@9bb56186c3b09b4f86b1c65136769dd318469633
with:
fetch-depth: 0
@@ -55,7 +55,7 @@ jobs:
echo "HOME=/home/ubuntu" >> "${GITHUB_ENV}"
- name: Install rust
uses: dtolnay/rust-toolchain@be73d7920c329f220ce78e0234b8f96b7ae60248
uses: dtolnay/rust-toolchain@dc6353516c68da0f06325f42ad880f76a5e77ec9
with:
toolchain: nightly
@@ -95,11 +95,11 @@ jobs:
path: ${{ env.RESULTS_FILENAME }}
- name: Checkout Slab repo
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
uses: actions/checkout@9bb56186c3b09b4f86b1c65136769dd318469633
with:
repository: zama-ai/slab
path: slab
token: ${{ secrets.CONCRETE_ACTIONS_TOKEN }}
token: ${{ secrets.FHE_ACTIONS_TOKEN }}
- name: Send data to Slab
shell: bash
@@ -118,11 +118,11 @@ jobs:
- name: Slack Notification
if: ${{ failure() }}
continue-on-error: true
uses: rtCamp/action-slack-notify@b24d75fe0e728a4bf9fc42ee217caa686d141ee8
uses: rtCamp/action-slack-notify@4e5fb42d249be6a45a298f3c9543b111b02f7907
env:
SLACK_COLOR: ${{ job.status }}
SLACK_CHANNEL: ${{ secrets.SLACK_CHANNEL }}
SLACK_ICON: https://pbs.twimg.com/profile_images/1274014582265298945/OjBKP9kn_400x400.png
SLACK_MESSAGE: "Shortint benchmarks failed. (${{ env.ACTION_RUN_URL }})"
SLACK_MESSAGE: "Shortint benchmarks finished with status: ${{ job.status }}. (${{ env.ACTION_RUN_URL }})"
SLACK_USERNAME: ${{ secrets.BOT_USERNAME }}
SLACK_WEBHOOK: ${{ secrets.SLACK_WEBHOOK }}

View File

@@ -53,7 +53,7 @@ jobs:
echo "Request ID: ${{ inputs.request_id }}"
- name: Checkout tfhe-rs repo with tags
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
uses: actions/checkout@9bb56186c3b09b4f86b1c65136769dd318469633
with:
fetch-depth: 0
@@ -71,16 +71,16 @@ jobs:
echo "HOME=/home/ubuntu" >> "${GITHUB_ENV}"
- name: Install rust
uses: dtolnay/rust-toolchain@be73d7920c329f220ce78e0234b8f96b7ae60248
uses: dtolnay/rust-toolchain@dc6353516c68da0f06325f42ad880f76a5e77ec9
with:
toolchain: nightly
- name: Checkout Slab repo
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
uses: actions/checkout@9bb56186c3b09b4f86b1c65136769dd318469633
with:
repository: zama-ai/slab
path: slab
token: ${{ secrets.CONCRETE_ACTIONS_TOKEN }}
token: ${{ secrets.FHE_ACTIONS_TOKEN }}
- name: Run benchmarks with AVX512
run: |
@@ -142,11 +142,11 @@ jobs:
steps:
- name: Notify
continue-on-error: true
uses: rtCamp/action-slack-notify@b24d75fe0e728a4bf9fc42ee217caa686d141ee8
uses: rtCamp/action-slack-notify@4e5fb42d249be6a45a298f3c9543b111b02f7907
env:
SLACK_COLOR: ${{ job.status }}
SLACK_CHANNEL: ${{ secrets.SLACK_CHANNEL }}
SLACK_ICON: https://pbs.twimg.com/profile_images/1274014582265298945/OjBKP9kn_400x400.png
SLACK_MESSAGE: "Shortint full benchmarks failed. (${{ env.ACTION_RUN_URL }})"
SLACK_MESSAGE: "Shortint full benchmarks finished with status: ${{ job.status }}. (${{ env.ACTION_RUN_URL }})"
SLACK_USERNAME: ${{ secrets.BOT_USERNAME }}
SLACK_WEBHOOK: ${{ secrets.SLACK_WEBHOOK }}

View File

@@ -46,7 +46,7 @@ jobs:
echo "BENCH_DATE=$(date --iso-8601=seconds)" >> "${GITHUB_ENV}"
- name: Checkout tfhe-rs repo with tags
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
uses: actions/checkout@9bb56186c3b09b4f86b1c65136769dd318469633
with:
fetch-depth: 0
@@ -56,7 +56,7 @@ jobs:
echo "HOME=/home/ubuntu" >> "${GITHUB_ENV}"
- name: Install rust
uses: dtolnay/rust-toolchain@be73d7920c329f220ce78e0234b8f96b7ae60248
uses: dtolnay/rust-toolchain@dc6353516c68da0f06325f42ad880f76a5e77ec9
with:
toolchain: nightly
@@ -97,11 +97,11 @@ jobs:
path: ${{ env.RESULTS_FILENAME }}
- name: Checkout Slab repo
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
uses: actions/checkout@9bb56186c3b09b4f86b1c65136769dd318469633
with:
repository: zama-ai/slab
path: slab
token: ${{ secrets.CONCRETE_ACTIONS_TOKEN }}
token: ${{ secrets.FHE_ACTIONS_TOKEN }}
- name: Send data to Slab
shell: bash
@@ -120,11 +120,11 @@ jobs:
- name: Slack Notification
if: ${{ failure() }}
continue-on-error: true
uses: rtCamp/action-slack-notify@b24d75fe0e728a4bf9fc42ee217caa686d141ee8
uses: rtCamp/action-slack-notify@4e5fb42d249be6a45a298f3c9543b111b02f7907
env:
SLACK_COLOR: ${{ job.status }}
SLACK_CHANNEL: ${{ secrets.SLACK_CHANNEL }}
SLACK_ICON: https://pbs.twimg.com/profile_images/1274014582265298945/OjBKP9kn_400x400.png
SLACK_MESSAGE: "Signed integer benchmarks failed. (${{ env.ACTION_RUN_URL }})"
SLACK_MESSAGE: "Signed integer benchmarks finished with status: ${{ job.status }}. (${{ env.ACTION_RUN_URL }})"
SLACK_USERNAME: ${{ secrets.BOT_USERNAME }}
SLACK_WEBHOOK: ${{ secrets.SLACK_WEBHOOK }}

View File

@@ -52,7 +52,7 @@ jobs:
echo "Request ID: ${{ inputs.request_id }}"
- name: Checkout tfhe-rs repo with tags
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
uses: actions/checkout@9bb56186c3b09b4f86b1c65136769dd318469633
with:
fetch-depth: 0
@@ -70,16 +70,16 @@ jobs:
echo "HOME=/home/ubuntu" >> "${GITHUB_ENV}"
- name: Install rust
uses: dtolnay/rust-toolchain@be73d7920c329f220ce78e0234b8f96b7ae60248
uses: dtolnay/rust-toolchain@dc6353516c68da0f06325f42ad880f76a5e77ec9
with:
toolchain: nightly
- name: Checkout Slab repo
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
uses: actions/checkout@9bb56186c3b09b4f86b1c65136769dd318469633
with:
repository: zama-ai/slab
path: slab
token: ${{ secrets.CONCRETE_ACTIONS_TOKEN }}
token: ${{ secrets.FHE_ACTIONS_TOKEN }}
- name: Run benchmarks with AVX512
run: |
@@ -126,11 +126,11 @@ jobs:
steps:
- name: Notify
continue-on-error: true
uses: rtCamp/action-slack-notify@b24d75fe0e728a4bf9fc42ee217caa686d141ee8
uses: rtCamp/action-slack-notify@4e5fb42d249be6a45a298f3c9543b111b02f7907
env:
SLACK_COLOR: ${{ job.status }}
SLACK_CHANNEL: ${{ secrets.SLACK_CHANNEL }}
SLACK_ICON: https://pbs.twimg.com/profile_images/1274014582265298945/OjBKP9kn_400x400.png
SLACK_MESSAGE: "Signed integer full benchmarks failed. (${{ env.ACTION_RUN_URL }})"
SLACK_MESSAGE: "Signed integer full benchmarks finished with status: ${{ job.status }}. (${{ env.ACTION_RUN_URL }})"
SLACK_USERNAME: ${{ secrets.BOT_USERNAME }}
SLACK_WEBHOOK: ${{ secrets.SLACK_WEBHOOK }}

View File

@@ -46,7 +46,7 @@ jobs:
echo "BENCH_DATE=$(date --iso-8601=seconds)" >> "${GITHUB_ENV}"
- name: Checkout tfhe-rs repo with tags
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
uses: actions/checkout@9bb56186c3b09b4f86b1c65136769dd318469633
with:
fetch-depth: 0
@@ -56,7 +56,7 @@ jobs:
echo "HOME=/home/ubuntu" >> "${GITHUB_ENV}"
- name: Install rust
uses: dtolnay/rust-toolchain@be73d7920c329f220ce78e0234b8f96b7ae60248
uses: dtolnay/rust-toolchain@dc6353516c68da0f06325f42ad880f76a5e77ec9
with:
toolchain: nightly
@@ -97,11 +97,11 @@ jobs:
path: ${{ env.RESULTS_FILENAME }}
- name: Checkout Slab repo
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
uses: actions/checkout@9bb56186c3b09b4f86b1c65136769dd318469633
with:
repository: zama-ai/slab
path: slab
token: ${{ secrets.CONCRETE_ACTIONS_TOKEN }}
token: ${{ secrets.FHE_ACTIONS_TOKEN }}
- name: Send data to Slab
shell: bash
@@ -120,11 +120,11 @@ jobs:
- name: Slack Notification
if: ${{ failure() }}
continue-on-error: true
uses: rtCamp/action-slack-notify@b24d75fe0e728a4bf9fc42ee217caa686d141ee8
uses: rtCamp/action-slack-notify@4e5fb42d249be6a45a298f3c9543b111b02f7907
env:
SLACK_COLOR: ${{ job.status }}
SLACK_CHANNEL: ${{ secrets.SLACK_CHANNEL }}
SLACK_ICON: https://pbs.twimg.com/profile_images/1274014582265298945/OjBKP9kn_400x400.png
SLACK_MESSAGE: "Signed integer benchmarks failed. (${{ env.ACTION_RUN_URL }})"
SLACK_MESSAGE: "Signed integer benchmarks finished with status: ${{ job.status }}. (${{ env.ACTION_RUN_URL }})"
SLACK_USERNAME: ${{ secrets.BOT_USERNAME }}
SLACK_WEBHOOK: ${{ secrets.SLACK_WEBHOOK }}

View File

@@ -58,13 +58,13 @@ jobs:
runs-on: ubuntu-latest
steps:
- name: Checkout tfhe-rs
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
uses: actions/checkout@9bb56186c3b09b4f86b1c65136769dd318469633
with:
fetch-depth: 0
- name: Check for file changes
id: changed-files
uses: tj-actions/changed-files@aa08304bd477b800d468db44fe10f6c61f7f7b11
uses: tj-actions/changed-files@2d756ea4c53f7f6b397767d8723b3a10a9f35bf2
with:
files_yaml: |
common_benches:
@@ -111,11 +111,11 @@ jobs:
- .github/workflows/wasm_client_benchmark.yml
- name: Checkout Slab repo
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
uses: actions/checkout@9bb56186c3b09b4f86b1c65136769dd318469633
with:
repository: zama-ai/slab
path: slab
token: ${{ secrets.CONCRETE_ACTIONS_TOKEN }}
token: ${{ secrets.FHE_ACTIONS_TOKEN }}
- name: Start AWS job in Slab
# If manually triggered check that the current bench has been requested

View File

@@ -30,16 +30,16 @@ jobs:
runs-on: ubuntu-latest
steps:
- name: Checkout tfhe-rs
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
uses: actions/checkout@9bb56186c3b09b4f86b1c65136769dd318469633
with:
fetch-depth: 0
- name: Checkout Slab repo
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
uses: actions/checkout@9bb56186c3b09b4f86b1c65136769dd318469633
with:
repository: zama-ai/slab
path: slab
token: ${{ secrets.CONCRETE_ACTIONS_TOKEN }}
token: ${{ secrets.FHE_ACTIONS_TOKEN }}
- name: Set benchmarks type as weekly
if: (github.event_name == 'workflow_dispatch' && inputs.benchmark_type == 'weekly') || github.event.schedule == '0 1 * * 6'

View File

@@ -13,7 +13,7 @@ jobs:
runs-on: ubuntu-latest
steps:
- name: Checkout repo
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
uses: actions/checkout@9bb56186c3b09b4f86b1c65136769dd318469633
with:
fetch-depth: 0
- name: Save repo
@@ -26,12 +26,12 @@ jobs:
with:
source_repo: "zama-ai/tfhe-rs"
source_branch: "main"
destination_repo: "https://${{ secrets.BOT_USERNAME }}:${{ secrets.CONCRETE_ACTIONS_TOKEN }}@github.com/${{ secrets.SYNC_DEST_REPO }}"
destination_repo: "https://${{ secrets.BOT_USERNAME }}:${{ secrets.FHE_ACTIONS_TOKEN }}@github.com/${{ secrets.SYNC_DEST_REPO }}"
destination_branch: "main"
- name: git-sync tags
uses: wei/git-sync@55c6b63b4f21607da0e9877ca9b4d11a29fc6d83
with:
source_repo: "zama-ai/tfhe-rs"
source_branch: "refs/tags/*"
destination_repo: "https://${{ secrets.BOT_USERNAME }}:${{ secrets.CONCRETE_ACTIONS_TOKEN }}@github.com/${{ secrets.SYNC_DEST_REPO }}"
destination_repo: "https://${{ secrets.BOT_USERNAME }}:${{ secrets.FHE_ACTIONS_TOKEN }}@github.com/${{ secrets.SYNC_DEST_REPO }}"
destination_branch: "refs/tags/*"

View File

@@ -53,7 +53,7 @@ jobs:
echo "BENCH_DATE=$(date --iso-8601=seconds)" >> "${GITHUB_ENV}"
- name: Checkout tfhe-rs repo with tags
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
uses: actions/checkout@9bb56186c3b09b4f86b1c65136769dd318469633
with:
fetch-depth: 0
@@ -63,7 +63,7 @@ jobs:
echo "HOME=/home/ubuntu" >> "${GITHUB_ENV}"
- name: Install rust
uses: dtolnay/rust-toolchain@be73d7920c329f220ce78e0234b8f96b7ae60248
uses: dtolnay/rust-toolchain@dc6353516c68da0f06325f42ad880f76a5e77ec9
with:
toolchain: nightly
@@ -104,11 +104,11 @@ jobs:
path: ${{ env.RESULTS_FILENAME }}
- name: Checkout Slab repo
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
uses: actions/checkout@9bb56186c3b09b4f86b1c65136769dd318469633
with:
repository: zama-ai/slab
path: slab
token: ${{ secrets.CONCRETE_ACTIONS_TOKEN }}
token: ${{ secrets.FHE_ACTIONS_TOKEN }}
- name: Send data to Slab
shell: bash
@@ -127,11 +127,11 @@ jobs:
- name: Slack Notification
if: ${{ failure() }}
continue-on-error: true
uses: rtCamp/action-slack-notify@b24d75fe0e728a4bf9fc42ee217caa686d141ee8
uses: rtCamp/action-slack-notify@4e5fb42d249be6a45a298f3c9543b111b02f7907
env:
SLACK_COLOR: ${{ job.status }}
SLACK_CHANNEL: ${{ secrets.SLACK_CHANNEL }}
SLACK_ICON: https://pbs.twimg.com/profile_images/1274014582265298945/OjBKP9kn_400x400.png
SLACK_MESSAGE: "WASM benchmarks failed. (${{ env.ACTION_RUN_URL }})"
SLACK_MESSAGE: "WASM benchmarks finished with status: ${{ job.status }}. (${{ env.ACTION_RUN_URL }})"
SLACK_USERNAME: ${{ secrets.BOT_USERNAME }}
SLACK_WEBHOOK: ${{ secrets.SLACK_WEBHOOK }}

View File

@@ -1,6 +1,13 @@
[workspace]
resolver = "2"
members = ["tfhe", "tasks", "apps/trivium", "concrete-csprng", "backends/tfhe-cuda-backend"]
members = [
"tfhe",
"tfhe-zk-pok",
"tasks",
"apps/trivium",
"concrete-csprng",
"backends/tfhe-cuda-backend",
]
[profile.bench]
lto = "fat"
@@ -17,3 +24,4 @@ lto = "off"
inherits = "dev"
opt-level = 3
lto = "off"
debug-assertions = false

View File

@@ -3,6 +3,7 @@ OS:=$(shell uname)
RS_CHECK_TOOLCHAIN:=$(shell cat toolchain.txt | tr -d '\n')
CARGO_RS_CHECK_TOOLCHAIN:=+$(RS_CHECK_TOOLCHAIN)
TARGET_ARCH_FEATURE:=$(shell ./scripts/get_arch_feature.sh)
CPU_COUNT=$(shell ./scripts/cpu_count.sh)
RS_BUILD_TOOLCHAIN:=stable
CARGO_RS_BUILD_TOOLCHAIN:=+$(RS_BUILD_TOOLCHAIN)
CARGO_PROFILE?=release
@@ -119,7 +120,12 @@ install_wasm_pack: install_rs_build_toolchain
.PHONY: install_node # Install last version of NodeJS via nvm
install_node:
curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.39.3/install.sh | $(SHELL)
curl -o nvm_install.sh https://raw.githubusercontent.com/nvm-sh/nvm/v0.39.3/install.sh
@echo "2ed5e94ba12434370f0358800deb69f514e8bce90f13beb0e1b241d42c6abafd nvm_install.sh" > nvm_checksum
@sha256sum -c nvm_checksum
@rm nvm_checksum
$(SHELL) nvm_install.sh
@rm nvm_install.sh
source ~/.bashrc
$(SHELL) -i -c 'nvm install $(NODE_VERSION)' || \
( echo "Unable to install node, unknown error." && exit 1 )
@@ -149,24 +155,51 @@ check_actionlint_installed:
@actionlint --version > /dev/null 2>&1 || \
( echo "Unable to locate actionlint. Try installing it: https://github.com/rhysd/actionlint/releases" && exit 1 )
.PHONY: check_nvm_installed # Check if Node Version Manager is installed
check_nvm_installed:
@source ~/.nvm/nvm.sh && nvm --version > /dev/null 2>&1 || \
( echo "Unable to locate Node. Run 'make install_node'" && exit 1 )
.PHONY: fmt # Format rust code
fmt: install_rs_check_toolchain
cargo "$(CARGO_RS_CHECK_TOOLCHAIN)" fmt
.PHONY: fmt_js # Format javascript code
fmt_js: check_nvm_installed
source ~/.nvm/nvm.sh && \
nvm install $(NODE_VERSION) && \
nvm use $(NODE_VERSION) && \
$(MAKE) -C tfhe/web_wasm_parallel_tests fmt
.PHONY: fmt_gpu # Format rust and cuda code
fmt_gpu: install_rs_check_toolchain
cargo "$(CARGO_RS_CHECK_TOOLCHAIN)" fmt
cd "$(TFHECUDA_SRC)" && ./format_tfhe_cuda_backend.sh
.PHONY: fmt_c_tests # Format c tests
fmt_c_tests:
find tfhe/c_api_tests/ -regex '.*\.\(cpp\|hpp\|cu\|c\|h\)' -exec clang-format -style=file -i {} \;
.PHONY: check_fmt # Check rust code format
check_fmt: install_rs_check_toolchain
cargo "$(CARGO_RS_CHECK_TOOLCHAIN)" fmt --check
.PHONY: check_fmt_c_tests # Check C tests format
check_fmt_c_tests:
find tfhe/c_api_tests/ -regex '.*\.\(cpp\|hpp\|cu\|c\|h\)' -exec clang-format --dry-run --Werror -style=file {} \;
.PHONY: check_fmt_gpu # Check rust and cuda code format
check_fmt_gpu: install_rs_check_toolchain
cargo "$(CARGO_RS_CHECK_TOOLCHAIN)" fmt --check
cd "$(TFHECUDA_SRC)" && ./format_tfhe_cuda_backend.sh -c
.PHONY: check_fmt_js # Check javascript code format
check_fmt_js: check_nvm_installed
source ~/.nvm/nvm.sh && \
nvm install $(NODE_VERSION) && \
nvm use $(NODE_VERSION) && \
$(MAKE) -C tfhe/web_wasm_parallel_tests check_fmt
.PHONY: clippy_gpu # Run clippy lints on tfhe with "gpu" enabled
clippy_gpu: install_rs_check_toolchain
RUSTFLAGS="$(RUSTFLAGS)" cargo "$(CARGO_RS_CHECK_TOOLCHAIN)" clippy \
@@ -228,7 +261,7 @@ clippy: install_rs_check_toolchain
.PHONY: clippy_c_api # Run clippy lints enabling the boolean, shortint and the C API
clippy_c_api: install_rs_check_toolchain
RUSTFLAGS="$(RUSTFLAGS)" cargo "$(CARGO_RS_CHECK_TOOLCHAIN)" clippy \
--features=$(TARGET_ARCH_FEATURE),boolean-c-api,shortint-c-api \
--features=$(TARGET_ARCH_FEATURE),boolean-c-api,shortint-c-api,high-level-c-api \
-p $(TFHE_SPEC) -- --no-deps -D warnings
.PHONY: clippy_js_wasm_api # Run clippy lints enabling the boolean, shortint, integer and the js wasm API
@@ -244,13 +277,13 @@ clippy_tasks:
.PHONY: clippy_trivium # Run clippy lints on Trivium app
clippy_trivium: install_rs_check_toolchain
RUSTFLAGS="$(RUSTFLAGS)" cargo "$(CARGO_RS_CHECK_TOOLCHAIN)" clippy \
RUSTFLAGS="$(RUSTFLAGS)" cargo "$(CARGO_RS_CHECK_TOOLCHAIN)" clippy --all-targets \
-p tfhe-trivium -- --no-deps -D warnings
.PHONY: clippy_all_targets # Run clippy lints on all targets (benches, examples, etc.)
clippy_all_targets:
RUSTFLAGS="$(RUSTFLAGS)" cargo "$(CARGO_RS_CHECK_TOOLCHAIN)" clippy --all-targets \
--features=$(TARGET_ARCH_FEATURE),boolean,shortint,integer,internal-keycache \
--features=$(TARGET_ARCH_FEATURE),boolean,shortint,integer,internal-keycache,zk-pok-experimental \
-p $(TFHE_SPEC) -- --no-deps -D warnings
.PHONY: clippy_concrete_csprng # Run clippy lints on concrete-csprng
@@ -259,9 +292,14 @@ clippy_concrete_csprng:
--features=$(TARGET_ARCH_FEATURE) \
-p concrete-csprng -- --no-deps -D warnings
.PHONY: clippy_zk_pok # Run clippy lints on tfhe-zk-pok
clippy_zk_pok:
RUSTFLAGS="$(RUSTFLAGS)" cargo "$(CARGO_RS_CHECK_TOOLCHAIN)" clippy --all-targets \
-p tfhe-zk-pok -- --no-deps -D warnings
.PHONY: clippy_all # Run all clippy targets
clippy_all: clippy clippy_boolean clippy_shortint clippy_integer clippy_all_targets clippy_c_api \
clippy_js_wasm_api clippy_tasks clippy_core clippy_concrete_csprng clippy_trivium
clippy_js_wasm_api clippy_tasks clippy_core clippy_concrete_csprng clippy_zk_pok clippy_trivium
.PHONY: clippy_fast # Run main clippy targets
clippy_fast: clippy clippy_all_targets clippy_c_api clippy_js_wasm_api clippy_tasks clippy_core \
@@ -324,14 +362,14 @@ symlink_c_libs_without_fingerprint:
.PHONY: build_c_api # Build the C API for boolean, shortint and integer
build_c_api: install_rs_check_toolchain
RUSTFLAGS="$(RUSTFLAGS)" cargo $(CARGO_RS_CHECK_TOOLCHAIN) build --profile $(CARGO_PROFILE) \
--features=$(TARGET_ARCH_FEATURE),boolean-c-api,shortint-c-api,high-level-c-api,$(FORWARD_COMPAT_FEATURE) \
--features=$(TARGET_ARCH_FEATURE),boolean-c-api,shortint-c-api,high-level-c-api,zk-pok-experimental,$(FORWARD_COMPAT_FEATURE) \
-p $(TFHE_SPEC)
@"$(MAKE)" symlink_c_libs_without_fingerprint
.PHONY: build_c_api_gpu # Build the C API for boolean, shortint and integer
build_c_api_gpu: install_rs_check_toolchain
RUSTFLAGS="$(RUSTFLAGS)" cargo $(CARGO_RS_CHECK_TOOLCHAIN) build --profile $(CARGO_PROFILE) \
--features=$(TARGET_ARCH_FEATURE),boolean-c-api,shortint-c-api,high-level-c-api,gpu \
--features=$(TARGET_ARCH_FEATURE),boolean-c-api,shortint-c-api,high-level-c-api,zk-pok-experimental,gpu \
-p $(TFHE_SPEC)
@"$(MAKE)" symlink_c_libs_without_fingerprint
@@ -347,7 +385,7 @@ build_web_js_api: install_rs_build_toolchain install_wasm_pack
cd tfhe && \
RUSTFLAGS="$(WASM_RUSTFLAGS)" rustup run "$(RS_BUILD_TOOLCHAIN)" \
wasm-pack build --release --target=web \
-- --features=boolean-client-js-wasm-api,shortint-client-js-wasm-api,integer-client-js-wasm-api
-- --features=boolean-client-js-wasm-api,shortint-client-js-wasm-api,integer-client-js-wasm-api,zk-pok-experimental
.PHONY: build_web_js_api_parallel # Build the js API targeting the web browser with parallelism support
build_web_js_api_parallel: install_rs_check_toolchain install_wasm_pack
@@ -355,7 +393,7 @@ build_web_js_api_parallel: install_rs_check_toolchain install_wasm_pack
rustup component add rust-src --toolchain $(RS_CHECK_TOOLCHAIN) && \
RUSTFLAGS="$(WASM_RUSTFLAGS) -C target-feature=+atomics,+bulk-memory,+mutable-globals" rustup run $(RS_CHECK_TOOLCHAIN) \
wasm-pack build --release --target=web \
-- --features=boolean-client-js-wasm-api,shortint-client-js-wasm-api,integer-client-js-wasm-api,parallel-wasm-api \
-- --features=boolean-client-js-wasm-api,shortint-client-js-wasm-api,integer-client-js-wasm-api,parallel-wasm-api,zk-pok-experimental \
-Z build-std=panic_abort,std
.PHONY: build_node_js_api # Build the js API targeting nodejs
@@ -363,7 +401,7 @@ build_node_js_api: install_rs_build_toolchain install_wasm_pack
cd tfhe && \
RUSTFLAGS="$(WASM_RUSTFLAGS)" rustup run "$(RS_BUILD_TOOLCHAIN)" \
wasm-pack build --release --target=nodejs \
-- --features=boolean-client-js-wasm-api,shortint-client-js-wasm-api,integer-client-js-wasm-api
-- --features=boolean-client-js-wasm-api,shortint-client-js-wasm-api,integer-client-js-wasm-api,zk-pok-experimental
.PHONY: build_concrete_csprng # Build concrete_csprng
build_concrete_csprng: install_rs_build_toolchain
@@ -373,10 +411,10 @@ build_concrete_csprng: install_rs_build_toolchain
.PHONY: test_core_crypto # Run the tests of the core_crypto module including experimental ones
test_core_crypto: install_rs_build_toolchain install_rs_check_toolchain
RUSTFLAGS="$(RUSTFLAGS)" cargo $(CARGO_RS_BUILD_TOOLCHAIN) test --profile $(CARGO_PROFILE) \
--features=$(TARGET_ARCH_FEATURE),experimental -p $(TFHE_SPEC) -- core_crypto::
--features=$(TARGET_ARCH_FEATURE),experimental,zk-pok-experimental -p $(TFHE_SPEC) -- core_crypto::
@if [[ "$(AVX512_SUPPORT)" == "ON" ]]; then \
RUSTFLAGS="$(RUSTFLAGS)" cargo $(CARGO_RS_CHECK_TOOLCHAIN) test --profile $(CARGO_PROFILE) \
--features=$(TARGET_ARCH_FEATURE),experimental,$(AVX512_FEATURE) -p $(TFHE_SPEC) -- core_crypto::; \
--features=$(TARGET_ARCH_FEATURE),experimental,zk-pok-experimental,$(AVX512_FEATURE) -p $(TFHE_SPEC) -- core_crypto::; \
fi
.PHONY: test_core_crypto_cov # Run the tests of the core_crypto module with code coverage
@@ -399,7 +437,7 @@ test_cuda_backend:
mkdir -p "$(TFHECUDA_BUILD)" && \
cd "$(TFHECUDA_BUILD)" && \
cmake .. -DCMAKE_BUILD_TYPE=Release -DTFHE_CUDA_BACKEND_BUILD_TESTS=ON && \
make -j && \
make -j "$(CPU_COUNT)" && \
make test
.PHONY: test_gpu # Run the tests of the core_crypto module including experimental on the gpu backend
@@ -547,7 +585,7 @@ test_integer_cov: install_rs_check_toolchain install_tarpaulin
.PHONY: test_high_level_api # Run all the tests for high_level_api
test_high_level_api: install_rs_build_toolchain
RUSTFLAGS="$(RUSTFLAGS)" cargo $(CARGO_RS_BUILD_TOOLCHAIN) test --profile $(CARGO_PROFILE) \
--features=$(TARGET_ARCH_FEATURE),boolean,shortint,integer,internal-keycache -p $(TFHE_SPEC) \
--features=$(TARGET_ARCH_FEATURE),boolean,shortint,integer,internal-keycache,zk-pok-experimental -p $(TFHE_SPEC) \
-- high_level_api::
test_high_level_api_gpu: install_rs_build_toolchain install_cargo_nextest
@@ -558,13 +596,14 @@ test_high_level_api_gpu: install_rs_build_toolchain install_cargo_nextest
.PHONY: test_user_doc # Run tests from the .md documentation
test_user_doc: install_rs_build_toolchain
RUSTFLAGS="$(RUSTFLAGS)" cargo $(CARGO_RS_BUILD_TOOLCHAIN) test --profile $(CARGO_PROFILE) --doc \
--features=$(TARGET_ARCH_FEATURE),boolean,shortint,integer,internal-keycache -p $(TFHE_SPEC) \
--features=$(TARGET_ARCH_FEATURE),boolean,shortint,integer,internal-keycache,pbs-stats,zk-pok-experimental \
-p $(TFHE_SPEC) \
-- test_user_docs::
.PHONY: test_user_doc_gpu # Run tests for GPU from the .md documentation
test_user_doc_gpu: install_rs_build_toolchain
RUSTFLAGS="$(RUSTFLAGS)" cargo $(CARGO_RS_BUILD_TOOLCHAIN) test --profile $(CARGO_PROFILE) --doc \
--features=$(TARGET_ARCH_FEATURE),boolean,shortint,integer,internal-keycache,gpu -p $(TFHE_SPEC) \
--features=$(TARGET_ARCH_FEATURE),boolean,shortint,integer,internal-keycache,gpu,zk-pok-experimental -p $(TFHE_SPEC) \
-- test_user_docs::
.PHONY: test_fhe_strings # Run tests for fhe_strings example
@@ -603,33 +642,46 @@ test_concrete_csprng:
RUSTFLAGS="$(RUSTFLAGS)" cargo $(CARGO_RS_BUILD_TOOLCHAIN) test --profile $(CARGO_PROFILE) \
--features=$(TARGET_ARCH_FEATURE) -p concrete-csprng
.PHONY: test_zk_pok # Run tfhe-zk-pok-experimental tests
test_zk_pok:
RUSTFLAGS="$(RUSTFLAGS)" cargo $(CARGO_RS_BUILD_TOOLCHAIN) test --profile $(CARGO_PROFILE) \
-p tfhe-zk-pok
.PHONY: doc # Build rust doc
doc: install_rs_check_toolchain
@# Even though we are not in docs.rs, this allows to "just" build the doc
DOCS_RS=1 \
RUSTDOCFLAGS="--html-in-header katex-header.html" \
cargo "$(CARGO_RS_CHECK_TOOLCHAIN)" doc \
--features=$(TARGET_ARCH_FEATURE),boolean,shortint,integer --no-deps -p $(TFHE_SPEC)
--features=$(TARGET_ARCH_FEATURE),boolean,shortint,integer,gpu,internal-keycache,experimental --no-deps -p $(TFHE_SPEC)
.PHONY: docs # Build rust doc alias for doc
docs: doc
.PHONY: lint_doc # Build rust doc with linting enabled
lint_doc: install_rs_check_toolchain
@# Even though we are not in docs.rs, this allows to "just" build the doc
DOCS_RS=1 \
RUSTDOCFLAGS="--html-in-header katex-header.html -Dwarnings" \
cargo "$(CARGO_RS_CHECK_TOOLCHAIN)" doc \
--features=$(TARGET_ARCH_FEATURE),boolean,shortint,integer -p $(TFHE_SPEC) --no-deps
--features=$(TARGET_ARCH_FEATURE),boolean,shortint,integer,gpu,internal-keycache,experimental -p $(TFHE_SPEC) --no-deps
.PHONY: lint_docs # Build rust doc with linting enabled alias for lint_doc
lint_docs: lint_doc
.PHONY: format_doc_latex # Format the documentation latex equations to avoid broken rendering.
format_doc_latex:
cargo xtask format_latex_doc
RUSTFLAGS="" cargo xtask format_latex_doc
@"$(MAKE)" --no-print-directory fmt
@printf "\n===============================\n\n"
@printf "Please manually inspect changes made by format_latex_doc, rustfmt can break equations \
if the line length is exceeded\n"
@printf "\n===============================\n"
.PHONY: check_md_docs_are_tested # Checks that the rust codeblocks in our .md files are tested
check_md_docs_are_tested:
RUSTFLAGS="" cargo xtask check_tfhe_docs_are_tested
.PHONY: check_compile_tests # Build tests in debug without running them
check_compile_tests:
RUSTFLAGS="$(RUSTFLAGS)" cargo $(CARGO_RS_BUILD_TOOLCHAIN) test --no-run \
@@ -649,7 +701,7 @@ check_compile_tests_benches_gpu: install_rs_build_toolchain
mkdir -p "$(TFHECUDA_BUILD)" && \
cd "$(TFHECUDA_BUILD)" && \
cmake .. -DCMAKE_BUILD_TYPE=Debug -DTFHE_CUDA_BACKEND_BUILD_TESTS=ON -DTFHE_CUDA_BACKEND_BUILD_BENCHMARKS=ON && \
make -j
make -j "$(CPU_COUNT)"
.PHONY: build_nodejs_test_docker # Build a docker image with tools to run nodejs tests for wasm API
build_nodejs_test_docker:
@@ -890,13 +942,15 @@ sha256_bool: install_rs_check_toolchain
--features=$(TARGET_ARCH_FEATURE),boolean
.PHONY: pcc # pcc stands for pre commit checks (except GPU)
pcc: no_tfhe_typo no_dbg_log check_fmt lint_doc clippy_all check_compile_tests
pcc: no_tfhe_typo no_dbg_log check_fmt lint_doc check_md_docs_are_tested clippy_all \
check_compile_tests
.PHONY: pcc_gpu # pcc stands for pre commit checks for GPU compilation
pcc_gpu: clippy_gpu clippy_cuda_backend check_compile_tests_benches_gpu
.PHONY: fpcc # pcc stands for pre commit checks, the f stands for fast
fpcc: no_tfhe_typo no_dbg_log check_fmt lint_doc clippy_fast check_compile_tests
fpcc: no_tfhe_typo no_dbg_log check_fmt lint_doc check_md_docs_are_tested clippy_fast \
check_compile_tests
.PHONY: conformance # Automatically fix problems that can be fixed
conformance: fix_newline fmt

View File

@@ -1,6 +1,10 @@
<p align="center">
<!-- product name logo -->
<img width=600 src="https://user-images.githubusercontent.com/5758427/231206749-8f146b97-3c5a-4201-8388-3ffa88580415.png">
<picture>
<source media="(prefers-color-scheme: dark)" srcset="https://github.com/zama-ai/tfhe-rs/assets/157474013/5283e0ba-da1e-43af-9f2a-c5221367a12b">
<source media="(prefers-color-scheme: light)" srcset="https://github.com/zama-ai/tfhe-rs/assets/157474013/b94a8c96-7595-400b-9311-70765c706955">
<img width=600 alt="Zama TFHE-rs">
</picture>
</p>
<hr/>
@@ -127,13 +131,13 @@ fn main() -> Result<(), Box<dyn std::error::Error>> {
// Clear equivalent computations: 1344 * 5 = 6720
let encrypted_res_mul = &encrypted_a * &encrypted_b;
// Clear equivalent computations: 1344 >> 5 = 42
// Clear equivalent computations: 6720 >> 5 = 210
encrypted_a = &encrypted_res_mul >> &encrypted_b;
// Clear equivalent computations: let casted_a = a as u8;
let casted_a: FheUint8 = encrypted_a.cast_into();
// Clear equivalent computations: min(42, 7) = 7
// Clear equivalent computations: min(210, 7) = 7
let encrypted_res_min = &casted_a.min(&encrypted_c);
// Operation between clear and encrypted data:
@@ -173,12 +177,12 @@ to run in release mode with cargo's `--release` flag to have the best performanc
<br></br>
### Tutorials
- [Homomorphic Parity Bit](https://docs.zama.ai/tfhe-rs/tutorials/parity_bit)
- [Homomorphic Case Changing on Ascii String](https://docs.zama.ai/tfhe-rs/tutorials/ascii_fhe_string)
- [[Video tutorial] Implement signed integers using TFHE-rs ](https://www.zama.ai/post/video-tutorial-implement-signed-integers-ssing-tfhe-rs)
- [Homomorphic parity bit](https://docs.zama.ai/tfhe-rs/tutorials/parity_bit)
- [Homomorphic case changing on Ascii string](https://docs.zama.ai/tfhe-rs/tutorials/ascii_fhe_string)
- [Boolean SHA256 with TFHE-rs](https://www.zama.ai/post/boolean-sha256-tfhe-rs)
- [Dark Market with TFHE-rs](https://www.zama.ai/post/dark-market-tfhe-rs)
- [Regular Expression Engine with TFHE-rs](https://www.zama.ai/post/regex-engine-tfhe-rs)
- [Dark market with TFHE-rs](https://www.zama.ai/post/dark-market-tfhe-rs)
- [Regular expression engine with TFHE-rs](https://www.zama.ai/post/regex-engine-tfhe-rs)
*Explore more useful resources in [TFHE-rs tutorials](https://docs.zama.ai/tfhe-rs/tutorials) and [Awesome Zama repo](https://github.com/zama-ai/awesome-zama)*
<br></br>
@@ -202,6 +206,12 @@ with `red_cost_model = reduction.RC.BDGL16`.
When a new update is published in the Lattice Estimator, we update parameters accordingly.
### Security Model
The default parameters for the TFHE-rs library are chosen considering the IND-CPA security model, and are selected with a bootstrapping failure probability fixed at p_error = $2^{-40}$. In particular, it is assumed that the results of decrypted computations are not shared by the secret key owner with any third parties, as such an action can lead to leakage of the secret encryption key. If you are designing an application where decryptions must be shared, you will need to craft custom encryption parameters which are chosen in consideration of the IND-CPA^D security model [1].
[1] Li, Baiyu, et al. "Securing approximate homomorphic encryption using differential privacy." Annual International Cryptology Conference. Cham: Springer Nature Switzerland, 2022. https://eprint.iacr.org/2022/816.pdf
#### Side-Channel Attacks
Mitigation for side-channel attacks has not yet been implemented in TFHE-rs,
@@ -240,7 +250,11 @@ This software is distributed under the **BSD-3-Clause-Clear** license. If you ha
## Support
<a target="_blank" href="https://community.zama.ai">
<img src="https://github.com/zama-ai/tfhe-rs/assets/157474013/8da6cf5b-51a0-4c86-9e75-fd0e4a4c64a4">
<picture>
<source media="(prefers-color-scheme: dark)" srcset="https://github.com/zama-ai/tfhe-rs/assets/157474013/08656d0a-3f44-4126-b8b6-8c601dff5380">
<source media="(prefers-color-scheme: light)" srcset="https://github.com/zama-ai/tfhe-rs/assets/157474013/1c9c9308-50ac-4aab-a4b9-469bb8c536a4">
<img alt="Support">
</picture>
</a>
🌟 If you find this project helpful or interesting, please consider giving it a star on GitHub! Your support helps to grow the community and motivates further development.

View File

@@ -15,7 +15,6 @@ Example of a Rust main below:
```rust
use tfhe::{ConfigBuilder, generate_keys, FheBool};
use tfhe::prelude::*;
use tfhe_trivium::TriviumStream;
fn get_hexadecimal_string_from_lsb_first_stream(a: Vec<bool>) -> String {
@@ -139,10 +138,8 @@ Example code:
```rust
use tfhe::shortint::prelude::*;
use tfhe::shortint::CastingKey;
use tfhe::{ConfigBuilder, generate_keys, FheUint64};
use tfhe::prelude::*;
use tfhe_trivium::TriviumStreamShortint;
fn test_shortint() {

View File

@@ -1,10 +1,8 @@
use criterion::Criterion;
use tfhe::prelude::*;
use tfhe::{generate_keys, ConfigBuilder, FheBool};
use tfhe_trivium::KreyviumStream;
use criterion::Criterion;
pub fn kreyvium_bool_gen(c: &mut Criterion) {
let config = ConfigBuilder::default().build();
let (client_key, server_key) = generate_keys(config);

View File

@@ -1,10 +1,8 @@
use criterion::Criterion;
use tfhe::prelude::*;
use tfhe::{generate_keys, ConfigBuilder, FheUint64, FheUint8};
use tfhe_trivium::{KreyviumStreamByte, TransCiphering};
use criterion::Criterion;
pub fn kreyvium_byte_gen(c: &mut Criterion) {
let config = ConfigBuilder::default()
.enable_function_evaluation()

View File

@@ -1,12 +1,9 @@
use criterion::Criterion;
use tfhe::prelude::*;
use tfhe::shortint::prelude::*;
use tfhe::shortint::KeySwitchingKey;
use tfhe::{generate_keys, ConfigBuilder, FheUint64};
use tfhe_trivium::{KreyviumStreamShortint, TransCiphering};
use criterion::Criterion;
pub fn kreyvium_shortint_warmup(c: &mut Criterion) {
let config = ConfigBuilder::default().build();
let (hl_client_key, hl_server_key) = generate_keys(config);

View File

@@ -1,10 +1,8 @@
use criterion::Criterion;
use tfhe::prelude::*;
use tfhe::{generate_keys, ConfigBuilder, FheBool};
use tfhe_trivium::TriviumStream;
use criterion::Criterion;
pub fn trivium_bool_gen(c: &mut Criterion) {
let config = ConfigBuilder::default().build();
let (client_key, server_key) = generate_keys(config);

View File

@@ -1,10 +1,8 @@
use criterion::Criterion;
use tfhe::prelude::*;
use tfhe::{generate_keys, ConfigBuilder, FheUint64, FheUint8};
use tfhe_trivium::{TransCiphering, TriviumStreamByte};
use criterion::Criterion;
pub fn trivium_byte_gen(c: &mut Criterion) {
let config = ConfigBuilder::default().build();
let (client_key, server_key) = generate_keys(config);

View File

@@ -1,12 +1,9 @@
use criterion::Criterion;
use tfhe::prelude::*;
use tfhe::shortint::prelude::*;
use tfhe::shortint::KeySwitchingKey;
use tfhe::{generate_keys, ConfigBuilder, FheUint64};
use tfhe_trivium::{TransCiphering, TriviumStreamShortint};
use criterion::Criterion;
pub fn trivium_shortint_warmup(c: &mut Criterion) {
let config = ConfigBuilder::default().build();
let (hl_client_key, hl_server_key) = generate_keys(config);

View File

@@ -2,12 +2,10 @@
//! for the representation of the inner bits.
use crate::static_deque::StaticDeque;
use rayon::prelude::*;
use tfhe::prelude::*;
use tfhe::{set_server_key, unset_server_key, FheBool, ServerKey};
use rayon::prelude::*;
/// Internal trait specifying which operations are necessary for KreyviumStream generic type
pub trait KreyviumBoolInput<OpOutput>:
Sized

View File

@@ -2,12 +2,10 @@
//! for the representation of the inner bits.
use crate::static_deque::{StaticByteDeque, StaticByteDequeInput};
use rayon::prelude::*;
use tfhe::prelude::*;
use tfhe::{set_server_key, unset_server_key, FheUint8, ServerKey};
use rayon::prelude::*;
/// Internal trait specifying which operations are necessary for KreyviumStreamByte generic type
pub trait KreyviumByteInput<OpOutput>:
Sized

View File

@@ -1,8 +1,6 @@
use crate::static_deque::StaticDeque;
use tfhe::shortint::prelude::*;
use rayon::prelude::*;
use tfhe::shortint::prelude::*;
/// KreyviumStreamShortint: a struct implementing the Kreyvium stream cipher, using a generic
/// Ciphertext for the internal representation of bits (intended to represent a single bit). To be
@@ -36,7 +34,7 @@ impl KreyviumStreamShortint {
let mut c_register: [Ciphertext; 111] = [0; 111].map(|x| sk.create_trivial(x));
for i in 0..93 {
a_register[i] = key[128 - 93 + i].clone();
a_register[i].clone_from(&key[128 - 93 + i]);
}
for i in 0..84 {
b_register[i] = sk.create_trivial(iv[128 - 84 + i]);

View File

@@ -1,8 +1,7 @@
use crate::{KreyviumStream, KreyviumStreamByte, KreyviumStreamShortint, TransCiphering};
use tfhe::prelude::*;
use tfhe::{generate_keys, ConfigBuilder, FheBool, FheUint64, FheUint8};
use crate::{KreyviumStream, KreyviumStreamByte, KreyviumStreamShortint, TransCiphering};
// Values for these tests come from the github repo renaud1239/Kreyvium,
// commit fd6828f68711276c25f55e605935028f5e843f43

View File

@@ -1,5 +1,6 @@
#[allow(clippy::module_inception)]
mod static_deque;
pub use static_deque::StaticDeque;
mod static_byte_deque;
pub use static_byte_deque::{StaticByteDeque, StaticByteDequeInput};

View File

@@ -4,7 +4,6 @@
//! This is pretending to store bits, and allows accessing bits in chunks of 8 consecutive.
use crate::static_deque::StaticDeque;
use tfhe::FheUint8;
/// Internal trait specifying which operations are needed by StaticByteDeque

View File

@@ -2,12 +2,10 @@
//! when trans ciphering is available to them.
use crate::{KreyviumStreamByte, KreyviumStreamShortint, TriviumStreamByte, TriviumStreamShortint};
use tfhe::shortint::Ciphertext;
use tfhe::prelude::*;
use tfhe::{set_server_key, unset_server_key, FheUint64, FheUint8, ServerKey};
use rayon::prelude::*;
use tfhe::prelude::*;
use tfhe::shortint::Ciphertext;
use tfhe::{set_server_key, unset_server_key, FheUint64, FheUint8, ServerKey};
/// Triat specifying the interface for trans ciphering a FheUint64 object. Since it is meant
/// to be used with stream ciphers, encryption and decryption are by default the same.

View File

@@ -1,8 +1,7 @@
use crate::{TransCiphering, TriviumStream, TriviumStreamByte, TriviumStreamShortint};
use tfhe::prelude::*;
use tfhe::{generate_keys, ConfigBuilder, FheBool, FheUint64, FheUint8};
use crate::{TransCiphering, TriviumStream, TriviumStreamByte, TriviumStreamShortint};
// Values for these tests come from the github repo cantora/avr-crypto-lib, commit 2a5b018,
// file testvectors/trivium-80.80.test-vectors

View File

@@ -2,12 +2,10 @@
//! for the representation of the inner bits.
use crate::static_deque::StaticDeque;
use rayon::prelude::*;
use tfhe::prelude::*;
use tfhe::{set_server_key, unset_server_key, FheBool, ServerKey};
use rayon::prelude::*;
/// Internal trait specifying which operations are necessary for TriviumStream generic type
pub trait TriviumBoolInput<OpOutput>:
Sized

View File

@@ -2,12 +2,10 @@
//! for the representation of the inner bits.
use crate::static_deque::{StaticByteDeque, StaticByteDequeInput};
use rayon::prelude::*;
use tfhe::prelude::*;
use tfhe::{set_server_key, unset_server_key, FheUint8, ServerKey};
use rayon::prelude::*;
/// Internal trait specifying which operations are necessary for TriviumStreamByte generic type
pub trait TriviumByteInput<OpOutput>:
Sized

View File

@@ -1,8 +1,6 @@
use crate::static_deque::StaticDeque;
use tfhe::shortint::prelude::*;
use rayon::prelude::*;
use tfhe::shortint::prelude::*;
/// TriviumStreamShortint: a struct implementing the Trivium stream cipher, using a generic
/// Ciphertext for the internal representation of bits (intended to represent a single bit). To be
@@ -34,7 +32,7 @@ impl TriviumStreamShortint {
let mut c_register: [Ciphertext; 111] = [0; 111].map(|x| sk.create_trivial(x));
for i in 0..80 {
a_register[93 - 80 + i] = key[i].clone();
a_register[93 - 80 + i].clone_from(&key[i]);
b_register[84 - 80 + i] = sk.create_trivial(iv[i]);
}

View File

@@ -2,6 +2,12 @@ use std::env;
use std::process::Command;
fn main() {
if let Ok(val) = env::var("DOCS_RS") {
if val.parse::<u32>() == Ok(1) {
return;
}
}
println!("Build tfhe-cuda-backend");
if env::consts::OS == "linux" {
let output = Command::new("./get_os_name.sh").output().unwrap();

View File

@@ -1 +1,2 @@
/build/
include/cuda_config.h

View File

@@ -58,10 +58,15 @@ set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${OpenMP_CXX_FLAGS}")
set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -Xcompiler ${OpenMP_CXX_FLAGS}")
if(${CUDA_SUCCESS})
set(CMAKE_CUDA_ARCHITECTURES native)
string(REPLACE "-arch=sm_" "" CUDA_ARCH "${ARCH}")
set(CUDA_ARCH "${CUDA_ARCH}0")
else()
set(CMAKE_CUDA_ARCHITECTURES 70)
set(CUDA_ARCH "700")
endif()
add_compile_definitions(CUDA_ARCH=${CUDA_ARCH})
# in production, should use -arch=sm_70 --ptxas-options=-v to see register spills -lineinfo for better debugging
set(CMAKE_CUDA_FLAGS
"${CMAKE_CUDA_FLAGS} -ccbin ${CMAKE_CXX_COMPILER} -O3 \

View File

@@ -6,14 +6,14 @@ while getopts ":c" option; do
case $option in
c)
# code to execute when flag1 is provided
find ./{include,src,tests_and_benchmarks/tests,tests_and_benchmarks/benchmarks} -iregex '^.*\.\(cpp\|cu\|h\|cuh\)$' -print | xargs clang-format-15 -i -style='file' --dry-run --Werror
find ./{include,src,tests_and_benchmarks/include,tests_and_benchmarks/tests,tests_and_benchmarks/benchmarks} -iregex '^.*\.\(cpp\|cu\|h\|cuh\)$' -print | xargs clang-format-15 -i -style='file' --dry-run --Werror
cmake-format -i CMakeLists.txt -c .cmake-format-config.py
find ./{include,src,tests_and_benchmarks/tests,tests_and_benchmarks/benchmarks} -type f -name "CMakeLists.txt" | xargs -I % sh -c 'cmake-format -i % -c .cmake-format-config.py'
find ./{include,src,tests_and_benchmarks/include,tests_and_benchmarks/tests,tests_and_benchmarks/benchmarks} -type f -name "CMakeLists.txt" | xargs -I % sh -c 'cmake-format -i % -c .cmake-format-config.py'
git diff --exit-code
exit
;;
esac
done
find ./{include,src,tests_and_benchmarks/tests,tests_and_benchmarks/benchmarks} -iregex '^.*\.\(cpp\|cu\|h\|cuh\)$' -print | xargs clang-format-15 -i -style='file'
find ./{include,src,tests_and_benchmarks/include,tests_and_benchmarks/tests,tests_and_benchmarks/benchmarks} -iregex '^.*\.\(cpp\|cu\|h\|cuh\)$' -print | xargs clang-format-15 -i -style='file'
cmake-format -i CMakeLists.txt -c .cmake-format-config.py
find ./{include,src,tests_and_benchmarks/tests,tests_and_benchmarks/benchmarks} -type f -name "CMakeLists.txt" | xargs -I % sh -c 'cmake-format -i % -c .cmake-format-config.py'
find ./{include,src,tests_and_benchmarks/include,tests_and_benchmarks/tests,tests_and_benchmarks/benchmarks} -type f -name "CMakeLists.txt" | xargs -I % sh -c 'cmake-format -i % -c .cmake-format-config.py'

View File

@@ -1,155 +0,0 @@
#ifndef CUDA_MULTI_BIT_H
#define CUDA_MULTI_BIT_H
#include "bootstrap.h"
#include <cstdint>
extern "C" {
bool has_support_to_cuda_bootstrap_fast_multi_bit(uint32_t glwe_dimension,
uint32_t polynomial_size,
uint32_t level_count,
uint32_t num_samples,
uint32_t max_shared_memory);
void cuda_convert_lwe_multi_bit_bootstrap_key_64(
void *dest, void *src, cuda_stream_t *stream, uint32_t input_lwe_dim,
uint32_t glwe_dim, uint32_t level_count, uint32_t polynomial_size,
uint32_t grouping_factor);
void scratch_cuda_multi_bit_pbs_64(
cuda_stream_t *stream, int8_t **pbs_buffer, uint32_t lwe_dimension,
uint32_t glwe_dimension, uint32_t polynomial_size, uint32_t level_count,
uint32_t grouping_factor, uint32_t input_lwe_ciphertext_count,
uint32_t max_shared_memory, bool allocate_gpu_memory,
uint32_t chunk_size = 0);
void cuda_multi_bit_pbs_lwe_ciphertext_vector_64(
cuda_stream_t *stream, void *lwe_array_out, void *lwe_output_indexes,
void *lut_vector, void *lut_vector_indexes, void *lwe_array_in,
void *lwe_input_indexes, void *bootstrapping_key, int8_t *buffer,
uint32_t lwe_dimension, uint32_t glwe_dimension, uint32_t polynomial_size,
uint32_t grouping_factor, uint32_t base_log, uint32_t level_count,
uint32_t num_samples, uint32_t num_luts, uint32_t lwe_idx,
uint32_t max_shared_memory, uint32_t lwe_chunk_size = 0);
void scratch_cuda_generic_multi_bit_pbs_64(
cuda_stream_t *stream, int8_t **pbs_buffer, uint32_t lwe_dimension,
uint32_t glwe_dimension, uint32_t polynomial_size, uint32_t level_count,
uint32_t grouping_factor, uint32_t input_lwe_ciphertext_count,
uint32_t max_shared_memory, bool allocate_gpu_memory,
uint32_t lwe_chunk_size = 0);
void cuda_generic_multi_bit_pbs_lwe_ciphertext_vector_64(
cuda_stream_t *stream, void *lwe_array_out, void *lwe_output_indexes,
void *lut_vector, void *lut_vector_indexes, void *lwe_array_in,
void *lwe_input_indexes, void *bootstrapping_key, int8_t *pbs_buffer,
uint32_t lwe_dimension, uint32_t glwe_dimension, uint32_t polynomial_size,
uint32_t grouping_factor, uint32_t base_log, uint32_t level_count,
uint32_t num_samples, uint32_t num_luts, uint32_t lwe_idx,
uint32_t max_shared_memory, uint32_t lwe_chunk_size = 0);
void cleanup_cuda_multi_bit_pbs_32(cuda_stream_t *stream, int8_t **pbs_buffer);
void cleanup_cuda_multi_bit_pbs_64(cuda_stream_t *stream, int8_t **pbs_buffer);
}
template <typename Torus, typename STorus>
void scratch_cuda_fast_multi_bit_pbs(
cuda_stream_t *stream, pbs_buffer<Torus, MULTI_BIT> **pbs_buffer,
uint32_t lwe_dimension, uint32_t glwe_dimension, uint32_t polynomial_size,
uint32_t level_count, uint32_t grouping_factor,
uint32_t input_lwe_ciphertext_count, uint32_t max_shared_memory,
bool allocate_gpu_memory, uint32_t lwe_chunk_size = 0);
template <typename Torus>
void cuda_fast_multi_bit_pbs_lwe_ciphertext_vector(
cuda_stream_t *stream, Torus *lwe_array_out, Torus *lwe_output_indexes,
Torus *lut_vector, Torus *lut_vector_indexes, Torus *lwe_array_in,
Torus *lwe_input_indexes, Torus *bootstrapping_key,
pbs_buffer<Torus, MULTI_BIT> *pbs_buffer, uint32_t lwe_dimension,
uint32_t glwe_dimension, uint32_t polynomial_size, uint32_t grouping_factor,
uint32_t base_log, uint32_t level_count, uint32_t num_samples,
uint32_t num_luts, uint32_t lwe_idx, uint32_t max_shared_memory,
uint32_t lwe_chunk_size = 0);
template <typename Torus, typename STorus>
void scratch_cuda_multi_bit_pbs(
cuda_stream_t *stream, pbs_buffer<Torus, MULTI_BIT> **pbs_buffer,
uint32_t lwe_dimension, uint32_t glwe_dimension, uint32_t polynomial_size,
uint32_t level_count, uint32_t grouping_factor,
uint32_t input_lwe_ciphertext_count, uint32_t max_shared_memory,
bool allocate_gpu_memory, uint32_t lwe_chunk_size = 0);
template <typename Torus>
void cuda_multi_bit_pbs_lwe_ciphertext_vector(
cuda_stream_t *stream, Torus *lwe_array_out, Torus *lwe_output_indexes,
Torus *lut_vector, Torus *lut_vector_indexes, Torus *lwe_array_in,
Torus *lwe_input_indexes, Torus *bootstrapping_key,
pbs_buffer<Torus, MULTI_BIT> *pbs_buffer, uint32_t lwe_dimension,
uint32_t glwe_dimension, uint32_t polynomial_size, uint32_t grouping_factor,
uint32_t base_log, uint32_t level_count, uint32_t num_samples,
uint32_t num_luts, uint32_t lwe_idx, uint32_t max_shared_memory,
uint32_t lwe_chunk_size = 0);
template <typename Torus> struct pbs_buffer<Torus, PBS_TYPE::MULTI_BIT> {
double2 *keybundle_fft;
Torus *global_accumulator;
double2 *global_accumulator_fft;
PBS_VARIANT pbs_variant;
pbs_buffer(cuda_stream_t *stream, uint32_t glwe_dimension,
uint32_t polynomial_size, uint32_t level_count,
uint32_t input_lwe_ciphertext_count, uint32_t lwe_chunk_size,
PBS_VARIANT pbs_variant, bool allocate_gpu_memory) {
this->pbs_variant = pbs_variant;
auto max_shared_memory = cuda_get_max_shared_memory(stream->gpu_index);
if (allocate_gpu_memory) {
switch (pbs_variant) {
case DEFAULT:
case FAST:
keybundle_fft = (double2 *)cuda_malloc_async(
input_lwe_ciphertext_count * lwe_chunk_size * level_count *
(glwe_dimension + 1) * (glwe_dimension + 1) *
(polynomial_size / 2) * sizeof(double2),
stream);
global_accumulator = (Torus *)cuda_malloc_async(
input_lwe_ciphertext_count * (glwe_dimension + 1) *
polynomial_size * sizeof(Torus),
stream);
global_accumulator_fft = (double2 *)cuda_malloc_async(
input_lwe_ciphertext_count * (glwe_dimension + 1) * level_count *
(polynomial_size / 2) * sizeof(double2),
stream);
break;
default:
PANIC("Cuda error (PBS): unsupported implementation variant.")
}
}
}
void release(cuda_stream_t *stream) {
cuda_drop_async(keybundle_fft, stream);
cuda_drop_async(global_accumulator, stream);
cuda_drop_async(global_accumulator_fft, stream);
}
};
#ifdef __CUDACC__
__host__ uint32_t get_lwe_chunk_size(uint32_t lwe_dimension,
uint32_t level_count,
uint32_t glwe_dimension,
uint32_t num_samples);
__host__ uint32_t get_average_lwe_chunk_size(uint32_t lwe_dimension,
uint32_t level_count,
uint32_t glwe_dimension,
uint32_t ct_count);
__host__ uint64_t get_max_buffer_size_multibit_bootstrap(
uint32_t lwe_dimension, uint32_t glwe_dimension, uint32_t polynomial_size,
uint32_t level_count, uint32_t max_input_lwe_ciphertext_count);
#endif
#endif // CUDA_MULTI_BIT_H

File diff suppressed because it is too large Load Diff

View File

@@ -1,7 +1,7 @@
#ifndef CUDA_LINALG_H_
#define CUDA_LINALG_H_
#include "bootstrap.h"
#include "programmable_bootstrap.h"
#include <cstdint>
#include <device.h>

View File

@@ -4,8 +4,8 @@
#include "device.h"
#include <cstdint>
enum PBS_TYPE { MULTI_BIT = 0, LOW_LAT = 1, AMORTIZED = 2 };
enum PBS_VARIANT { DEFAULT = 0, FAST = 1 };
enum PBS_TYPE { MULTI_BIT = 0, CLASSICAL = 1 };
enum PBS_VARIANT { DEFAULT = 0, CG = 1 };
extern "C" {
void cuda_fourier_polynomial_mul(void *input1, void *input2, void *output,
@@ -13,29 +13,25 @@ void cuda_fourier_polynomial_mul(void *input1, void *input2, void *output,
uint32_t polynomial_size,
uint32_t total_polynomials);
void cuda_convert_lwe_bootstrap_key_32(void *dest, void *src,
cuda_stream_t *stream,
uint32_t input_lwe_dim,
uint32_t glwe_dim, uint32_t level_count,
uint32_t polynomial_size);
void cuda_convert_lwe_programmable_bootstrap_key_32(
void *dest, void *src, cuda_stream_t *stream, uint32_t input_lwe_dim,
uint32_t glwe_dim, uint32_t level_count, uint32_t polynomial_size);
void cuda_convert_lwe_bootstrap_key_64(void *dest, void *src,
cuda_stream_t *stream,
uint32_t input_lwe_dim,
uint32_t glwe_dim, uint32_t level_count,
uint32_t polynomial_size);
void cuda_convert_lwe_programmable_bootstrap_key_64(
void *dest, void *src, cuda_stream_t *stream, uint32_t input_lwe_dim,
uint32_t glwe_dim, uint32_t level_count, uint32_t polynomial_size);
void scratch_cuda_bootstrap_amortized_32(
void scratch_cuda_programmable_bootstrap_amortized_32(
cuda_stream_t *stream, int8_t **pbs_buffer, uint32_t glwe_dimension,
uint32_t polynomial_size, uint32_t input_lwe_ciphertext_count,
uint32_t max_shared_memory, bool allocate_gpu_memory);
void scratch_cuda_bootstrap_amortized_64(
void scratch_cuda_programmable_bootstrap_amortized_64(
cuda_stream_t *stream, int8_t **pbs_buffer, uint32_t glwe_dimension,
uint32_t polynomial_size, uint32_t input_lwe_ciphertext_count,
uint32_t max_shared_memory, bool allocate_gpu_memory);
void cuda_bootstrap_amortized_lwe_ciphertext_vector_32(
void cuda_programmable_bootstrap_amortized_lwe_ciphertext_vector_32(
cuda_stream_t *stream, void *lwe_array_out, void *lwe_output_indexes,
void *lut_vector, void *lut_vector_indexes, void *lwe_array_in,
void *lwe_input_indexes, void *bootstrapping_key, int8_t *pbs_buffer,
@@ -43,7 +39,7 @@ void cuda_bootstrap_amortized_lwe_ciphertext_vector_32(
uint32_t base_log, uint32_t level_count, uint32_t num_samples,
uint32_t num_luts, uint32_t lwe_idx, uint32_t max_shared_memory);
void cuda_bootstrap_amortized_lwe_ciphertext_vector_64(
void cuda_programmable_bootstrap_amortized_lwe_ciphertext_vector_64(
cuda_stream_t *stream, void *lwe_array_out, void *lwe_output_indexes,
void *lut_vector, void *lut_vector_indexes, void *lwe_array_in,
void *lwe_input_indexes, void *bootstrapping_key, int8_t *pbs_buffer,
@@ -51,22 +47,22 @@ void cuda_bootstrap_amortized_lwe_ciphertext_vector_64(
uint32_t base_log, uint32_t level_count, uint32_t num_samples,
uint32_t num_luts, uint32_t lwe_idx, uint32_t max_shared_memory);
void cleanup_cuda_bootstrap_amortized(cuda_stream_t *stream,
int8_t **pbs_buffer);
void cleanup_cuda_programmable_bootstrap_amortized(cuda_stream_t *stream,
int8_t **pbs_buffer);
void scratch_cuda_bootstrap_low_latency_32(
void scratch_cuda_programmable_bootstrap_32(
cuda_stream_t *stream, int8_t **buffer, uint32_t glwe_dimension,
uint32_t polynomial_size, uint32_t level_count,
uint32_t input_lwe_ciphertext_count, uint32_t max_shared_memory,
bool allocate_gpu_memory);
void scratch_cuda_bootstrap_low_latency_64(
void scratch_cuda_programmable_bootstrap_64(
cuda_stream_t *stream, int8_t **buffer, uint32_t glwe_dimension,
uint32_t polynomial_size, uint32_t level_count,
uint32_t input_lwe_ciphertext_count, uint32_t max_shared_memory,
bool allocate_gpu_memory);
void cuda_bootstrap_low_latency_lwe_ciphertext_vector_32(
void cuda_programmable_bootstrap_lwe_ciphertext_vector_32(
cuda_stream_t *stream, void *lwe_array_out, void *lwe_output_indexes,
void *lut_vector, void *lut_vector_indexes, void *lwe_array_in,
void *lwe_input_indexes, void *bootstrapping_key, int8_t *buffer,
@@ -74,7 +70,7 @@ void cuda_bootstrap_low_latency_lwe_ciphertext_vector_32(
uint32_t base_log, uint32_t level_count, uint32_t num_samples,
uint32_t num_luts, uint32_t lwe_idx, uint32_t max_shared_memory);
void cuda_bootstrap_low_latency_lwe_ciphertext_vector_64(
void cuda_programmable_bootstrap_lwe_ciphertext_vector_64(
cuda_stream_t *stream, void *lwe_array_out, void *lwe_output_indexes,
void *lut_vector, void *lut_vector_indexes, void *lwe_array_in,
void *lwe_input_indexes, void *bootstrapping_key, int8_t *buffer,
@@ -82,31 +78,28 @@ void cuda_bootstrap_low_latency_lwe_ciphertext_vector_64(
uint32_t base_log, uint32_t level_count, uint32_t num_samples,
uint32_t num_luts, uint32_t lwe_idx, uint32_t max_shared_memory);
void cleanup_cuda_bootstrap_low_latency_32(cuda_stream_t *stream,
int8_t **pbs_buffer);
void cleanup_cuda_programmable_bootstrap(cuda_stream_t *stream,
int8_t **pbs_buffer);
void cleanup_cuda_bootstrap_low_latency_64(cuda_stream_t *stream,
int8_t **pbs_buffer);
uint64_t get_buffer_size_bootstrap_amortized_64(
uint64_t get_buffer_size_programmable_bootstrap_amortized_64(
uint32_t glwe_dimension, uint32_t polynomial_size,
uint32_t input_lwe_ciphertext_count, uint32_t max_shared_memory);
uint64_t get_buffer_size_bootstrap_low_latency_64(
uint64_t get_buffer_size_programmable_bootstrap_64(
uint32_t glwe_dimension, uint32_t polynomial_size, uint32_t level_count,
uint32_t input_lwe_ciphertext_count, uint32_t max_shared_memory);
}
template <typename Torus>
__host__ __device__ uint64_t
get_buffer_size_full_sm_bootstrap_low_latency_step_one(
get_buffer_size_full_sm_programmable_bootstrap_step_one(
uint32_t polynomial_size) {
return sizeof(Torus) * polynomial_size + // accumulator_rotated
sizeof(double2) * polynomial_size / 2; // accumulator fft
}
template <typename Torus>
__host__ __device__ uint64_t
get_buffer_size_full_sm_bootstrap_low_latency_step_two(
get_buffer_size_full_sm_programmable_bootstrap_step_two(
uint32_t polynomial_size) {
return sizeof(Torus) * polynomial_size + // accumulator
sizeof(double2) * polynomial_size / 2; // accumulator fft
@@ -114,13 +107,13 @@ get_buffer_size_full_sm_bootstrap_low_latency_step_two(
template <typename Torus>
__host__ __device__ uint64_t
get_buffer_size_partial_sm_bootstrap_low_latency(uint32_t polynomial_size) {
get_buffer_size_partial_sm_programmable_bootstrap(uint32_t polynomial_size) {
return sizeof(double2) * polynomial_size / 2; // accumulator fft
}
template <typename Torus>
__host__ __device__ uint64_t
get_buffer_size_full_sm_bootstrap_fast_low_latency(uint32_t polynomial_size) {
get_buffer_size_full_sm_programmable_bootstrap_cg(uint32_t polynomial_size) {
return sizeof(Torus) * polynomial_size + // accumulator_rotated
sizeof(Torus) * polynomial_size + // accumulator
sizeof(double2) * polynomial_size / 2; // accumulator fft
@@ -128,14 +121,13 @@ get_buffer_size_full_sm_bootstrap_fast_low_latency(uint32_t polynomial_size) {
template <typename Torus>
__host__ __device__ uint64_t
get_buffer_size_partial_sm_bootstrap_fast_low_latency(
uint32_t polynomial_size) {
get_buffer_size_partial_sm_programmable_bootstrap_cg(uint32_t polynomial_size) {
return sizeof(double2) * polynomial_size / 2; // accumulator fft mask & body
}
template <typename Torus, PBS_TYPE pbs_type> struct pbs_buffer;
template <typename Torus> struct pbs_buffer<Torus, PBS_TYPE::LOW_LAT> {
template <typename Torus> struct pbs_buffer<Torus, PBS_TYPE::CLASSICAL> {
int8_t *d_mem;
Torus *global_accumulator;
@@ -155,13 +147,13 @@ template <typename Torus> struct pbs_buffer<Torus, PBS_TYPE::LOW_LAT> {
switch (pbs_variant) {
case PBS_VARIANT::DEFAULT: {
uint64_t full_sm_step_one =
get_buffer_size_full_sm_bootstrap_low_latency_step_one<Torus>(
get_buffer_size_full_sm_programmable_bootstrap_step_one<Torus>(
polynomial_size);
uint64_t full_sm_step_two =
get_buffer_size_full_sm_bootstrap_low_latency_step_two<Torus>(
get_buffer_size_full_sm_programmable_bootstrap_step_two<Torus>(
polynomial_size);
uint64_t partial_sm =
get_buffer_size_partial_sm_bootstrap_low_latency<Torus>(
get_buffer_size_partial_sm_programmable_bootstrap<Torus>(
polynomial_size);
uint64_t partial_dm_step_one = full_sm_step_one - partial_sm;
@@ -193,12 +185,12 @@ template <typename Torus> struct pbs_buffer<Torus, PBS_TYPE::LOW_LAT> {
polynomial_size * sizeof(Torus),
stream);
} break;
case PBS_VARIANT::FAST: {
case PBS_VARIANT::CG: {
uint64_t full_sm =
get_buffer_size_full_sm_bootstrap_fast_low_latency<Torus>(
get_buffer_size_full_sm_programmable_bootstrap_cg<Torus>(
polynomial_size);
uint64_t partial_sm =
get_buffer_size_partial_sm_bootstrap_fast_low_latency<Torus>(
get_buffer_size_partial_sm_programmable_bootstrap_cg<Torus>(
polynomial_size);
uint64_t partial_dm = full_sm - partial_sm;
@@ -237,14 +229,14 @@ template <typename Torus> struct pbs_buffer<Torus, PBS_TYPE::LOW_LAT> {
};
template <typename Torus>
__host__ __device__ uint64_t get_buffer_size_bootstrap_fast_low_latency(
__host__ __device__ uint64_t get_buffer_size_programmable_bootstrap_cg(
uint32_t glwe_dimension, uint32_t polynomial_size, uint32_t level_count,
uint32_t input_lwe_ciphertext_count, uint32_t max_shared_memory) {
uint64_t full_sm = get_buffer_size_full_sm_bootstrap_fast_low_latency<Torus>(
polynomial_size);
uint64_t full_sm =
get_buffer_size_full_sm_programmable_bootstrap_cg<Torus>(polynomial_size);
uint64_t partial_sm =
get_buffer_size_partial_sm_bootstrap_fast_low_latency<Torus>(
get_buffer_size_partial_sm_programmable_bootstrap_cg<Torus>(
polynomial_size);
uint64_t partial_dm = full_sm - partial_sm;
uint64_t full_dm = full_sm;
@@ -263,42 +255,42 @@ __host__ __device__ uint64_t get_buffer_size_bootstrap_fast_low_latency(
}
template <typename Torus>
bool has_support_to_cuda_bootstrap_fast_low_latency(uint32_t glwe_dimension,
uint32_t polynomial_size,
uint32_t level_count,
uint32_t num_samples,
uint32_t max_shared_memory);
bool has_support_to_cuda_programmable_bootstrap_cg(uint32_t glwe_dimension,
uint32_t polynomial_size,
uint32_t level_count,
uint32_t num_samples,
uint32_t max_shared_memory);
template <typename Torus>
void cuda_bootstrap_fast_low_latency_lwe_ciphertext_vector(
void cuda_programmable_bootstrap_cg_lwe_ciphertext_vector(
cuda_stream_t *stream, Torus *lwe_array_out, Torus *lwe_output_indexes,
Torus *lut_vector, Torus *lut_vector_indexes, Torus *lwe_array_in,
Torus *lwe_input_indexes, double2 *bootstrapping_key,
pbs_buffer<Torus, LOW_LAT> *buffer, uint32_t lwe_dimension,
pbs_buffer<Torus, CLASSICAL> *buffer, uint32_t lwe_dimension,
uint32_t glwe_dimension, uint32_t polynomial_size, uint32_t base_log,
uint32_t level_count, uint32_t num_samples, uint32_t num_luts,
uint32_t lwe_idx, uint32_t max_shared_memory);
template <typename Torus>
void cuda_bootstrap_low_latency_lwe_ciphertext_vector(
void cuda_programmable_bootstrap_lwe_ciphertext_vector(
cuda_stream_t *stream, Torus *lwe_array_out, Torus *lwe_output_indexes,
Torus *lut_vector, Torus *lut_vector_indexes, Torus *lwe_array_in,
Torus *lwe_input_indexes, double2 *bootstrapping_key,
pbs_buffer<Torus, LOW_LAT> *buffer, uint32_t lwe_dimension,
pbs_buffer<Torus, CLASSICAL> *buffer, uint32_t lwe_dimension,
uint32_t glwe_dimension, uint32_t polynomial_size, uint32_t base_log,
uint32_t level_count, uint32_t num_samples, uint32_t num_luts,
uint32_t lwe_idx, uint32_t max_shared_memory);
template <typename Torus, typename STorus>
void scratch_cuda_fast_bootstrap_low_latency(
cuda_stream_t *stream, pbs_buffer<Torus, LOW_LAT> **pbs_buffer,
void scratch_cuda_programmable_bootstrap_cg(
cuda_stream_t *stream, pbs_buffer<Torus, CLASSICAL> **pbs_buffer,
uint32_t glwe_dimension, uint32_t polynomial_size, uint32_t level_count,
uint32_t input_lwe_ciphertext_count, uint32_t max_shared_memory,
bool allocate_gpu_memory);
template <typename Torus, typename STorus>
void scratch_cuda_bootstrap_low_latency(
cuda_stream_t *stream, pbs_buffer<Torus, LOW_LAT> **buffer,
void scratch_cuda_programmable_bootstrap(
cuda_stream_t *stream, pbs_buffer<Torus, CLASSICAL> **buffer,
uint32_t glwe_dimension, uint32_t polynomial_size, uint32_t level_count,
uint32_t input_lwe_ciphertext_count, uint32_t max_shared_memory,
bool allocate_gpu_memory);

View File

@@ -0,0 +1,241 @@
#ifndef CUDA_MULTI_BIT_H
#define CUDA_MULTI_BIT_H
#include "programmable_bootstrap.h"
#include <cstdint>
extern "C" {
bool has_support_to_cuda_programmable_bootstrap_cg_multi_bit(
uint32_t glwe_dimension, uint32_t polynomial_size, uint32_t level_count,
uint32_t num_samples, uint32_t max_shared_memory);
void cuda_convert_lwe_multi_bit_programmable_bootstrap_key_64(
void *dest, void *src, cuda_stream_t *stream, uint32_t input_lwe_dim,
uint32_t glwe_dim, uint32_t level_count, uint32_t polynomial_size,
uint32_t grouping_factor);
void scratch_cuda_multi_bit_programmable_bootstrap_64(
cuda_stream_t *stream, int8_t **pbs_buffer, uint32_t lwe_dimension,
uint32_t glwe_dimension, uint32_t polynomial_size, uint32_t level_count,
uint32_t grouping_factor, uint32_t input_lwe_ciphertext_count,
uint32_t max_shared_memory, bool allocate_gpu_memory,
uint32_t chunk_size = 0);
void cuda_multi_bit_programmable_bootstrap_lwe_ciphertext_vector_64(
cuda_stream_t *stream, void *lwe_array_out, void *lwe_output_indexes,
void *lut_vector, void *lut_vector_indexes, void *lwe_array_in,
void *lwe_input_indexes, void *bootstrapping_key, int8_t *buffer,
uint32_t lwe_dimension, uint32_t glwe_dimension, uint32_t polynomial_size,
uint32_t grouping_factor, uint32_t base_log, uint32_t level_count,
uint32_t num_samples, uint32_t num_luts, uint32_t lwe_idx,
uint32_t max_shared_memory, uint32_t lwe_chunk_size = 0);
void scratch_cuda_generic_multi_bit_programmable_bootstrap_64(
cuda_stream_t *stream, int8_t **pbs_buffer, uint32_t lwe_dimension,
uint32_t glwe_dimension, uint32_t polynomial_size, uint32_t level_count,
uint32_t grouping_factor, uint32_t input_lwe_ciphertext_count,
uint32_t max_shared_memory, bool allocate_gpu_memory,
uint32_t lwe_chunk_size = 0);
void cuda_generic_multi_bit_programmable_bootstrap_lwe_ciphertext_vector_64(
cuda_stream_t *stream, void *lwe_array_out, void *lwe_output_indexes,
void *lut_vector, void *lut_vector_indexes, void *lwe_array_in,
void *lwe_input_indexes, void *bootstrapping_key, int8_t *pbs_buffer,
uint32_t lwe_dimension, uint32_t glwe_dimension, uint32_t polynomial_size,
uint32_t grouping_factor, uint32_t base_log, uint32_t level_count,
uint32_t num_samples, uint32_t num_luts, uint32_t lwe_idx,
uint32_t max_shared_memory, uint32_t lwe_chunk_size = 0);
void cleanup_cuda_multi_bit_programmable_bootstrap(cuda_stream_t *stream,
int8_t **pbs_buffer);
}
template <typename Torus, typename STorus>
void scratch_cuda_cg_multi_bit_programmable_bootstrap(
cuda_stream_t *stream, pbs_buffer<Torus, MULTI_BIT> **pbs_buffer,
uint32_t lwe_dimension, uint32_t glwe_dimension, uint32_t polynomial_size,
uint32_t level_count, uint32_t grouping_factor,
uint32_t input_lwe_ciphertext_count, uint32_t max_shared_memory,
bool allocate_gpu_memory, uint32_t lwe_chunk_size = 0);
template <typename Torus>
void cuda_cg_multi_bit_programmable_bootstrap_lwe_ciphertext_vector(
cuda_stream_t *stream, Torus *lwe_array_out, Torus *lwe_output_indexes,
Torus *lut_vector, Torus *lut_vector_indexes, Torus *lwe_array_in,
Torus *lwe_input_indexes, Torus *bootstrapping_key,
pbs_buffer<Torus, MULTI_BIT> *pbs_buffer, uint32_t lwe_dimension,
uint32_t glwe_dimension, uint32_t polynomial_size, uint32_t grouping_factor,
uint32_t base_log, uint32_t level_count, uint32_t num_samples,
uint32_t num_luts, uint32_t lwe_idx, uint32_t max_shared_memory,
uint32_t lwe_chunk_size = 0);
template <typename Torus, typename STorus>
void scratch_cuda_multi_bit_programmable_bootstrap(
cuda_stream_t *stream, pbs_buffer<Torus, MULTI_BIT> **pbs_buffer,
uint32_t lwe_dimension, uint32_t glwe_dimension, uint32_t polynomial_size,
uint32_t level_count, uint32_t grouping_factor,
uint32_t input_lwe_ciphertext_count, uint32_t max_shared_memory,
bool allocate_gpu_memory, uint32_t lwe_chunk_size = 0);
template <typename Torus>
void cuda_multi_bit_programmable_bootstrap_lwe_ciphertext_vector(
cuda_stream_t *stream, Torus *lwe_array_out, Torus *lwe_output_indexes,
Torus *lut_vector, Torus *lut_vector_indexes, Torus *lwe_array_in,
Torus *lwe_input_indexes, Torus *bootstrapping_key,
pbs_buffer<Torus, MULTI_BIT> *pbs_buffer, uint32_t lwe_dimension,
uint32_t glwe_dimension, uint32_t polynomial_size, uint32_t grouping_factor,
uint32_t base_log, uint32_t level_count, uint32_t num_samples,
uint32_t num_luts, uint32_t lwe_idx, uint32_t max_shared_memory,
uint32_t lwe_chunk_size = 0);
template <typename Torus>
__host__ __device__ uint64_t
get_buffer_size_full_sm_multibit_programmable_bootstrap_keybundle(
uint32_t polynomial_size);
template <typename Torus>
__host__ __device__ uint64_t
get_buffer_size_full_sm_multibit_programmable_bootstrap_step_one(
uint32_t polynomial_size);
template <typename Torus>
__host__ __device__ uint64_t
get_buffer_size_full_sm_multibit_programmable_bootstrap_step_two(
uint32_t polynomial_size);
template <typename Torus>
__host__ __device__ uint64_t
get_buffer_size_partial_sm_multibit_programmable_bootstrap_step_one(
uint32_t polynomial_size);
template <typename Torus>
__host__ __device__ uint64_t
get_buffer_size_full_sm_cg_multibit_programmable_bootstrap(
uint32_t polynomial_size);
template <typename Torus>
__host__ __device__ uint64_t
get_buffer_size_partial_sm_cg_multibit_programmable_bootstrap(
uint32_t polynomial_size);
template <typename Torus> struct pbs_buffer<Torus, PBS_TYPE::MULTI_BIT> {
int8_t *d_mem_keybundle = NULL;
int8_t *d_mem_acc_step_one = NULL;
int8_t *d_mem_acc_step_two = NULL;
int8_t *d_mem_acc_cg = NULL;
double2 *keybundle_fft;
Torus *global_accumulator;
double2 *global_accumulator_fft;
PBS_VARIANT pbs_variant;
pbs_buffer(cuda_stream_t *stream, uint32_t glwe_dimension,
uint32_t polynomial_size, uint32_t level_count,
uint32_t input_lwe_ciphertext_count, uint32_t lwe_chunk_size,
PBS_VARIANT pbs_variant, bool allocate_gpu_memory) {
this->pbs_variant = pbs_variant;
auto max_shared_memory = cuda_get_max_shared_memory(stream->gpu_index);
uint64_t full_sm_keybundle =
get_buffer_size_full_sm_multibit_programmable_bootstrap_keybundle<
Torus>(polynomial_size);
uint64_t full_sm_accumulate_step_one =
get_buffer_size_full_sm_multibit_programmable_bootstrap_step_one<Torus>(
polynomial_size);
uint64_t partial_sm_accumulate_step_one =
get_buffer_size_partial_sm_multibit_programmable_bootstrap_step_one<
Torus>(polynomial_size);
uint64_t full_sm_accumulate_step_two =
get_buffer_size_full_sm_multibit_programmable_bootstrap_step_two<Torus>(
polynomial_size);
uint64_t full_sm_cg_accumulate =
get_buffer_size_full_sm_cg_multibit_programmable_bootstrap<Torus>(
polynomial_size);
uint64_t partial_sm_cg_accumulate =
get_buffer_size_partial_sm_cg_multibit_programmable_bootstrap<Torus>(
polynomial_size);
auto num_blocks_keybundle = input_lwe_ciphertext_count * lwe_chunk_size *
(glwe_dimension + 1) * (glwe_dimension + 1) *
level_count;
auto num_blocks_acc_step_one =
level_count * (glwe_dimension + 1) * input_lwe_ciphertext_count;
auto num_blocks_acc_step_two =
input_lwe_ciphertext_count * (glwe_dimension + 1);
auto num_blocks_acc_cg =
level_count * (glwe_dimension + 1) * input_lwe_ciphertext_count;
if (allocate_gpu_memory) {
// Keybundle
if (max_shared_memory < full_sm_keybundle)
d_mem_keybundle = (int8_t *)cuda_malloc_async(
num_blocks_keybundle * full_sm_keybundle, stream);
switch (pbs_variant) {
case DEFAULT:
// Accumulator step one
if (max_shared_memory < partial_sm_accumulate_step_one)
d_mem_acc_step_one = (int8_t *)cuda_malloc_async(
num_blocks_acc_step_one * full_sm_accumulate_step_one, stream);
else if (max_shared_memory < full_sm_accumulate_step_one)
d_mem_acc_step_one = (int8_t *)cuda_malloc_async(
num_blocks_acc_step_one * partial_sm_accumulate_step_one, stream);
// Accumulator step two
if (max_shared_memory < full_sm_accumulate_step_two)
d_mem_acc_step_two = (int8_t *)cuda_malloc_async(
num_blocks_acc_step_two * full_sm_accumulate_step_two, stream);
break;
case CG:
// Accumulator CG
if (max_shared_memory < partial_sm_cg_accumulate)
d_mem_acc_cg = (int8_t *)cuda_malloc_async(
num_blocks_acc_cg * full_sm_cg_accumulate, stream);
else if (max_shared_memory < full_sm_cg_accumulate)
d_mem_acc_cg = (int8_t *)cuda_malloc_async(
num_blocks_acc_cg * partial_sm_cg_accumulate, stream);
break;
default:
PANIC("Cuda error (PBS): unsupported implementation variant.")
}
keybundle_fft = (double2 *)cuda_malloc_async(
num_blocks_keybundle * (polynomial_size / 2) * sizeof(double2),
stream);
global_accumulator = (Torus *)cuda_malloc_async(
num_blocks_acc_step_two * polynomial_size * sizeof(Torus), stream);
global_accumulator_fft = (double2 *)cuda_malloc_async(
num_blocks_acc_step_one * (polynomial_size / 2) * sizeof(double2),
stream);
}
}
void release(cuda_stream_t *stream) {
if (d_mem_keybundle)
cuda_drop_async(d_mem_keybundle, stream);
switch (pbs_variant) {
case DEFAULT:
if (d_mem_acc_step_one)
cuda_drop_async(d_mem_acc_step_one, stream);
if (d_mem_acc_step_two)
cuda_drop_async(d_mem_acc_step_two, stream);
break;
case CG:
if (d_mem_acc_cg)
cuda_drop_async(d_mem_acc_cg, stream);
break;
default:
PANIC("Cuda error (PBS): unsupported implementation variant.")
}
cuda_drop_async(keybundle_fft, stream);
cuda_drop_async(global_accumulator, stream);
cuda_drop_async(global_accumulator_fft, stream);
}
};
#ifdef __CUDACC__
__host__ uint32_t get_lwe_chunk_size(uint32_t ct_count);
#endif
#endif // CUDA_MULTI_BIT_H

View File

@@ -216,14 +216,10 @@ void cuda_drop_async(void *ptr, cuda_stream_t *stream) {
/// Get the maximum size for the shared memory
int cuda_get_max_shared_memory(uint32_t gpu_index) {
check_cuda_error(cudaSetDevice(gpu_index));
cudaDeviceProp prop;
check_cuda_error(cudaGetDeviceProperties(&prop, gpu_index));
int max_shared_memory = 0;
if (prop.major >= 6) {
max_shared_memory = prop.sharedMemPerMultiprocessor;
} else {
max_shared_memory = prop.sharedMemPerBlock;
}
cudaDeviceGetAttribute(&max_shared_memory, cudaDevAttrMaxSharedMemoryPerBlock,
gpu_index);
check_cuda_error(cudaGetLastError());
return max_shared_memory;
}

View File

@@ -181,7 +181,7 @@ template <class params> __device__ void NSMFFT_direct(double2 *A) {
// from level 8, we need to check size of params degree, because we support
// minimum actual polynomial size = 256, when compressed size is halfed and
// minimum supported compressed size is 128, so we always need first 7
// levels of butterfy operation, since butterfly levels are hardcoded
// levels of butterfly operation, since butterfly levels are hardcoded
// we need to check if polynomial size is big enough to require specific level
// of butterfly.
if constexpr (params::degree >= 256) {
@@ -353,7 +353,7 @@ template <class params> __device__ void NSMFFT_inverse(double2 *A) {
// compressed size = 8192 is actual polynomial size = 16384.
// twiddles for this size can't fit in constant memory so
// butterfly operation for this level acess device memory to fetch
// butterfly operation for this level access device memory to fetch
// twiddles
if constexpr (params::degree >= 8192) {
// level 13
@@ -484,7 +484,7 @@ template <class params> __device__ void NSMFFT_inverse(double2 *A) {
// below level 8, we don't need to check size of params degree, because we
// support minimum actual polynomial size = 256, when compressed size is
// halfed and minimum supported compressed size is 128, so we always need
// last 7 levels of butterfy operation, since butterfly levels are hardcoded
// last 7 levels of butterfly operation, since butterfly levels are hardcoded
// we don't need to check if polynomial size is big enough to require
// specific level of butterfly.
// level 7

View File

@@ -3,7 +3,7 @@
/*
* 'negtwiddles' are stored in constant memory for faster access times
* because of it's limitied size, only twiddles for up to 2^12 polynomial size
* because of it's limited size, only twiddles for up to 2^12 polynomial size
* can be stored there, twiddles for 2^13 are stored in device memory
* 'negtwiddles13'
*/

View File

@@ -5,8 +5,8 @@
#include "device.h"
#include "integer.cuh"
#include "integer.h"
#include "pbs/bootstrap_low_latency.cuh"
#include "pbs/bootstrap_multibit.cuh"
#include "pbs/programmable_bootstrap_classic.cuh"
#include "pbs/programmable_bootstrap_multibit.cuh"
#include "polynomial/functions.cuh"
#include "utils/kernel_dimensions.cuh"
#include <omp.h>

View File

@@ -29,8 +29,8 @@ __host__ void zero_out_if(cuda_stream_t *stream, Torus *lwe_array_out,
device_pack_bivariate_blocks<<<num_blocks, num_threads, 0,
stream->stream>>>(
lwe_array_out_block, lwe_array_input_block, lwe_condition,
predicate->lwe_indexes, params.big_lwe_dimension,
lwe_array_out_block, predicate->lwe_indexes_in, lwe_array_input_block,
lwe_condition, predicate->lwe_indexes_in, params.big_lwe_dimension,
params.message_modulus, 1);
check_cuda_error(cudaGetLastError());
}

View File

@@ -5,8 +5,8 @@ void scratch_cuda_integer_radix_comparison_kb_64(
uint32_t polynomial_size, uint32_t big_lwe_dimension,
uint32_t small_lwe_dimension, uint32_t ks_level, uint32_t ks_base_log,
uint32_t pbs_level, uint32_t pbs_base_log, uint32_t grouping_factor,
uint32_t lwe_ciphertext_count, uint32_t message_modulus,
uint32_t carry_modulus, PBS_TYPE pbs_type, COMPARISON_TYPE op_type,
uint32_t num_radix_blocks, uint32_t message_modulus, uint32_t carry_modulus,
PBS_TYPE pbs_type, COMPARISON_TYPE op_type, bool is_signed,
bool allocate_gpu_memory) {
int_radix_params params(pbs_type, glwe_dimension, polynomial_size,
@@ -17,9 +17,9 @@ void scratch_cuda_integer_radix_comparison_kb_64(
switch (op_type) {
case EQ:
case NE:
scratch_cuda_integer_radix_equality_check_kb<uint64_t>(
stream, (int_comparison_buffer<uint64_t> **)mem_ptr,
lwe_ciphertext_count, params, op_type, allocate_gpu_memory);
scratch_cuda_integer_radix_comparison_check_kb<uint64_t>(
stream, (int_comparison_buffer<uint64_t> **)mem_ptr, num_radix_blocks,
params, op_type, false, allocate_gpu_memory);
break;
case GT:
case GE:
@@ -27,9 +27,9 @@ void scratch_cuda_integer_radix_comparison_kb_64(
case LE:
case MAX:
case MIN:
scratch_cuda_integer_radix_difference_check_kb<uint64_t>(
stream, (int_comparison_buffer<uint64_t> **)mem_ptr,
lwe_ciphertext_count, params, op_type, allocate_gpu_memory);
scratch_cuda_integer_radix_comparison_check_kb<uint64_t>(
stream, (int_comparison_buffer<uint64_t> **)mem_ptr, num_radix_blocks,
params, op_type, is_signed, allocate_gpu_memory);
break;
}
}
@@ -37,7 +37,7 @@ void scratch_cuda_integer_radix_comparison_kb_64(
void cuda_comparison_integer_radix_ciphertext_kb_64(
cuda_stream_t *stream, void *lwe_array_out, void *lwe_array_1,
void *lwe_array_2, int8_t *mem_ptr, void *bsk, void *ksk,
uint32_t lwe_ciphertext_count) {
uint32_t num_radix_blocks) {
int_comparison_buffer<uint64_t> *buffer =
(int_comparison_buffer<uint64_t> *)mem_ptr;
@@ -48,7 +48,7 @@ void cuda_comparison_integer_radix_ciphertext_kb_64(
stream, static_cast<uint64_t *>(lwe_array_out),
static_cast<uint64_t *>(lwe_array_1),
static_cast<uint64_t *>(lwe_array_2), buffer, bsk,
static_cast<uint64_t *>(ksk), lwe_ciphertext_count);
static_cast<uint64_t *>(ksk), num_radix_blocks);
break;
case GT:
case GE:
@@ -59,7 +59,7 @@ void cuda_comparison_integer_radix_ciphertext_kb_64(
static_cast<uint64_t *>(lwe_array_1),
static_cast<uint64_t *>(lwe_array_2), buffer,
buffer->diff_buffer->operator_f, bsk, static_cast<uint64_t *>(ksk),
lwe_ciphertext_count);
num_radix_blocks);
break;
case MAX:
case MIN:
@@ -67,7 +67,7 @@ void cuda_comparison_integer_radix_ciphertext_kb_64(
stream, static_cast<uint64_t *>(lwe_array_out),
static_cast<uint64_t *>(lwe_array_1),
static_cast<uint64_t *>(lwe_array_2), buffer, bsk,
static_cast<uint64_t *>(ksk), lwe_ciphertext_count);
static_cast<uint64_t *>(ksk), num_radix_blocks);
break;
default:
PANIC("Cuda error: integer operation not supported")

View File

@@ -8,8 +8,8 @@
#include "integer/cmux.cuh"
#include "integer/negation.cuh"
#include "integer/scalar_addition.cuh"
#include "pbs/bootstrap_low_latency.cuh"
#include "pbs/bootstrap_multibit.cuh"
#include "pbs/programmable_bootstrap_classic.cuh"
#include "pbs/programmable_bootstrap_multibit.cuh"
#include "types/complex/operations.cuh"
#include "utils/kernel_dimensions.cuh"
@@ -71,24 +71,25 @@ are_all_comparisons_block_true(cuda_stream_t *stream, Torus *lwe_array_out,
auto are_all_block_true_buffer =
mem_ptr->eq_buffer->are_all_block_true_buffer;
auto tmp_out = are_all_block_true_buffer->tmp_out;
uint32_t total_modulus = message_modulus * carry_modulus;
uint32_t max_value = total_modulus - 1;
cuda_memcpy_async_gpu_to_gpu(
lwe_array_out, lwe_array_in,
tmp_out, lwe_array_in,
num_radix_blocks * (big_lwe_dimension + 1) * sizeof(Torus), stream);
int lut_num_blocks = 0;
uint32_t remaining_blocks = num_radix_blocks;
while (remaining_blocks > 1) {
while (remaining_blocks > 0) {
// Split in max_value chunks
uint32_t chunk_length = std::min(max_value, remaining_blocks);
int num_chunks = remaining_blocks / chunk_length;
// Since all blocks encrypt either 0 or 1, we can sum max_value of them
// as in the worst case we will be adding `max_value` ones
auto input_blocks = lwe_array_out;
auto input_blocks = tmp_out;
auto accumulator = are_all_block_true_buffer->tmp_block_accumulated;
for (int i = 0; i < num_chunks; i++) {
accumulate_all_blocks(stream, accumulator, input_blocks,
@@ -131,8 +132,15 @@ are_all_comparisons_block_true(cuda_stream_t *stream, Torus *lwe_array_out,
}
// Applies the LUT
integer_radix_apply_univariate_lookup_table_kb<Torus>(
stream, lwe_array_out, accumulator, bsk, ksk, num_chunks, lut);
if (remaining_blocks == 1) {
// In the last iteration we copy the output to the final address
integer_radix_apply_univariate_lookup_table_kb<Torus>(
stream, lwe_array_out, accumulator, bsk, ksk, 1, lut);
return;
} else {
integer_radix_apply_univariate_lookup_table_kb<Torus>(
stream, tmp_out, accumulator, bsk, ksk, num_chunks, lut);
}
}
}
@@ -158,18 +166,18 @@ __host__ void is_at_least_one_comparisons_block_true(
uint32_t max_value = total_modulus - 1;
cuda_memcpy_async_gpu_to_gpu(
lwe_array_out, lwe_array_in,
mem_ptr->tmp_lwe_array_out, lwe_array_in,
num_radix_blocks * (big_lwe_dimension + 1) * sizeof(Torus), stream);
uint32_t remaining_blocks = num_radix_blocks;
while (remaining_blocks > 1) {
while (remaining_blocks > 0) {
// Split in max_value chunks
uint32_t chunk_length = std::min(max_value, remaining_blocks);
int num_chunks = remaining_blocks / chunk_length;
// Since all blocks encrypt either 0 or 1, we can sum max_value of them
// as in the worst case we will be adding `max_value` ones
auto input_blocks = lwe_array_out;
auto input_blocks = mem_ptr->tmp_lwe_array_out;
auto accumulator = buffer->tmp_block_accumulated;
for (int i = 0; i < num_chunks; i++) {
accumulate_all_blocks(stream, accumulator, input_blocks,
@@ -185,8 +193,16 @@ __host__ void is_at_least_one_comparisons_block_true(
int_radix_lut<Torus> *lut = mem_ptr->eq_buffer->is_non_zero_lut;
// Applies the LUT
integer_radix_apply_univariate_lookup_table_kb<Torus>(
stream, lwe_array_out, accumulator, bsk, ksk, num_chunks, lut);
if (remaining_blocks == 1) {
// In the last iteration we copy the output to the final address
integer_radix_apply_univariate_lookup_table_kb<Torus>(
stream, lwe_array_out, accumulator, bsk, ksk, 1, lut);
return;
} else {
integer_radix_apply_univariate_lookup_table_kb<Torus>(
stream, mem_ptr->tmp_lwe_array_out, accumulator, bsk, ksk, num_chunks,
lut);
}
}
}
@@ -257,7 +273,7 @@ __host__ void host_compare_with_zero_equality(
remainder_blocks -= (chunk_size - 1);
// Update operands
chunk += chunk_size * big_lwe_size;
chunk += (chunk_size - 1) * big_lwe_size;
sum_i += big_lwe_size;
}
}
@@ -266,11 +282,6 @@ __host__ void host_compare_with_zero_equality(
stream, sum, sum, bsk, ksk, num_sum_blocks, zero_comparison);
are_all_comparisons_block_true(stream, lwe_array_out, sum, mem_ptr, bsk, ksk,
num_sum_blocks);
// The result will be in the two first block. Everything else is
// garbage.
cuda_memset_async(lwe_array_out + big_lwe_size, 0,
big_lwe_size_bytes * (num_radix_blocks - 1), stream);
}
template <typename Torus>
@@ -279,11 +290,9 @@ __host__ void host_integer_radix_equality_check_kb(
Torus *lwe_array_2, int_comparison_buffer<Torus> *mem_ptr, void *bsk,
Torus *ksk, uint32_t num_radix_blocks) {
cudaSetDevice(stream->gpu_index);
auto eq_buffer = mem_ptr->eq_buffer;
auto params = mem_ptr->params;
auto big_lwe_dimension = params.big_lwe_dimension;
// Applies the LUT for the comparison operation
auto comparisons = mem_ptr->tmp_block_comparisons;
integer_radix_apply_bivariate_lookup_table_kb(
@@ -292,27 +301,10 @@ __host__ void host_integer_radix_equality_check_kb(
// This takes a Vec of blocks, where each block is either 0 or 1.
//
// It return a block encrypting 1 if all input blocks are 1
// It returns a block encrypting 1 if all input blocks are 1
// otherwise the block encrypts 0
are_all_comparisons_block_true(stream, lwe_array_out, comparisons, mem_ptr,
bsk, ksk, num_radix_blocks);
// Zero all blocks but the first
size_t big_lwe_size = big_lwe_dimension + 1;
size_t big_lwe_size_bytes = big_lwe_size * sizeof(Torus);
cuda_memset_async(lwe_array_out + big_lwe_size, 0,
big_lwe_size_bytes * (num_radix_blocks - 1), stream);
}
template <typename Torus>
__host__ void scratch_cuda_integer_radix_equality_check_kb(
cuda_stream_t *stream, int_comparison_buffer<Torus> **mem_ptr,
uint32_t num_radix_blocks, int_radix_params params, COMPARISON_TYPE op,
bool allocate_gpu_memory) {
cudaSetDevice(stream->gpu_index);
*mem_ptr = new int_comparison_buffer<Torus>(
stream, op, params, num_radix_blocks, allocate_gpu_memory);
}
template <typename Torus>
@@ -447,38 +439,45 @@ __host__ void host_integer_radix_difference_check_kb(
cuda_stream_t *stream, Torus *lwe_array_out, Torus *lwe_array_left,
Torus *lwe_array_right, int_comparison_buffer<Torus> *mem_ptr,
std::function<Torus(Torus)> reduction_lut_f, void *bsk, Torus *ksk,
uint32_t total_num_radix_blocks) {
uint32_t num_radix_blocks) {
cudaSetDevice(stream->gpu_index);
auto diff_buffer = mem_ptr->diff_buffer;
auto params = mem_ptr->params;
auto big_lwe_dimension = params.big_lwe_dimension;
auto big_lwe_size = big_lwe_dimension + 1;
auto message_modulus = params.message_modulus;
auto carry_modulus = params.carry_modulus;
uint32_t num_radix_blocks = total_num_radix_blocks;
uint32_t packed_num_radix_blocks = num_radix_blocks;
auto lhs = lwe_array_left;
auto rhs = lwe_array_right;
if (carry_modulus == message_modulus) {
if (carry_modulus >= message_modulus) {
// Packing is possible
// Pack inputs
Torus *packed_left = diff_buffer->tmp_packed_left;
Torus *packed_right = diff_buffer->tmp_packed_right;
// In case the ciphertext is signed, the sign block and the one before it
// are handled separately
if (mem_ptr->is_signed) {
packed_num_radix_blocks -= 2;
}
pack_blocks(stream, packed_left, lwe_array_left, big_lwe_dimension,
num_radix_blocks, message_modulus);
packed_num_radix_blocks, message_modulus);
pack_blocks(stream, packed_right, lwe_array_right, big_lwe_dimension,
num_radix_blocks, message_modulus);
packed_num_radix_blocks, message_modulus);
// From this point we have half number of blocks
num_radix_blocks /= 2;
packed_num_radix_blocks /= 2;
// Clean noise
auto cleaning_lut = mem_ptr->cleaning_lut;
auto identity_lut = mem_ptr->identity_lut;
integer_radix_apply_univariate_lookup_table_kb(
stream, packed_left, packed_left, bsk, ksk, num_radix_blocks,
cleaning_lut);
stream, packed_left, packed_left, bsk, ksk, packed_num_radix_blocks,
identity_lut);
integer_radix_apply_univariate_lookup_table_kb(
stream, packed_right, packed_right, bsk, ksk, num_radix_blocks,
cleaning_lut);
stream, packed_right, packed_right, bsk, ksk, packed_num_radix_blocks,
identity_lut);
lhs = packed_left;
rhs = packed_right;
@@ -489,31 +488,78 @@ __host__ void host_integer_radix_difference_check_kb(
// - 1 if lhs == rhs
// - 2 if lhs > rhs
auto comparisons = mem_ptr->tmp_block_comparisons;
compare_radix_blocks_kb(stream, comparisons, lhs, rhs, mem_ptr, bsk, ksk,
num_radix_blocks);
auto num_comparisons = 0;
if (!mem_ptr->is_signed) {
// Compare packed blocks, or simply the total number of radix blocks in the
// inputs
compare_radix_blocks_kb(stream, comparisons, lhs, rhs, mem_ptr, bsk, ksk,
packed_num_radix_blocks);
num_comparisons = packed_num_radix_blocks;
} else {
// Packing is possible
if (carry_modulus >= message_modulus) {
// Compare (num_radix_blocks - 2) / 2 packed blocks
compare_radix_blocks_kb(stream, comparisons, lhs, rhs, mem_ptr, bsk, ksk,
packed_num_radix_blocks);
// Compare the last block before the sign block separately
auto identity_lut = mem_ptr->identity_lut;
Torus *last_left_block_before_sign_block =
diff_buffer->tmp_packed_left + packed_num_radix_blocks * big_lwe_size;
Torus *last_right_block_before_sign_block =
diff_buffer->tmp_packed_right +
packed_num_radix_blocks * big_lwe_size;
integer_radix_apply_univariate_lookup_table_kb(
stream, last_left_block_before_sign_block,
lwe_array_left + (num_radix_blocks - 2) * big_lwe_size, bsk, ksk, 1,
identity_lut);
integer_radix_apply_univariate_lookup_table_kb(
stream, last_right_block_before_sign_block,
lwe_array_right + (num_radix_blocks - 2) * big_lwe_size, bsk, ksk, 1,
identity_lut);
compare_radix_blocks_kb(
stream, comparisons + packed_num_radix_blocks * big_lwe_size,
last_left_block_before_sign_block, last_right_block_before_sign_block,
mem_ptr, bsk, ksk, 1);
// Compare the sign block separately
integer_radix_apply_bivariate_lookup_table_kb(
stream, comparisons + (packed_num_radix_blocks + 1) * big_lwe_size,
lwe_array_left + (num_radix_blocks - 1) * big_lwe_size,
lwe_array_right + (num_radix_blocks - 1) * big_lwe_size, bsk, ksk, 1,
mem_ptr->signed_lut);
num_comparisons = packed_num_radix_blocks + 2;
} else {
compare_radix_blocks_kb(stream, comparisons, lwe_array_left,
lwe_array_right, mem_ptr, bsk, ksk,
num_radix_blocks - 1);
// Compare the sign block separately
integer_radix_apply_bivariate_lookup_table_kb(
stream, comparisons + (num_radix_blocks - 1) * big_lwe_size,
lwe_array_left + (num_radix_blocks - 1) * big_lwe_size,
lwe_array_right + (num_radix_blocks - 1) * big_lwe_size, bsk, ksk, 1,
mem_ptr->signed_lut);
num_comparisons = num_radix_blocks;
}
}
// Reduces a vec containing radix blocks that encrypts a sign
// (inferior, equal, superior) to one single radix block containing the
// final sign
tree_sign_reduction(stream, lwe_array_out, comparisons,
mem_ptr->diff_buffer->tree_buffer, reduction_lut_f, bsk,
ksk, num_radix_blocks);
// The result will be in the first block. Everything else is garbage.
size_t big_lwe_size = big_lwe_dimension + 1;
size_t big_lwe_size_bytes = big_lwe_size * sizeof(Torus);
cuda_memset_async(lwe_array_out + big_lwe_size, 0,
(total_num_radix_blocks - 1) * big_lwe_size_bytes, stream);
ksk, num_comparisons);
}
template <typename Torus>
__host__ void scratch_cuda_integer_radix_difference_check_kb(
__host__ void scratch_cuda_integer_radix_comparison_check_kb(
cuda_stream_t *stream, int_comparison_buffer<Torus> **mem_ptr,
uint32_t num_radix_blocks, int_radix_params params, COMPARISON_TYPE op,
bool allocate_gpu_memory) {
bool is_signed, bool allocate_gpu_memory) {
cudaSetDevice(stream->gpu_index);
*mem_ptr = new int_comparison_buffer<Torus>(
stream, op, params, num_radix_blocks, allocate_gpu_memory);
stream, op, params, num_radix_blocks, is_signed, allocate_gpu_memory);
}
template <typename Torus>
@@ -523,10 +569,11 @@ host_integer_radix_maxmin_kb(cuda_stream_t *stream, Torus *lwe_array_out,
int_comparison_buffer<Torus> *mem_ptr, void *bsk,
Torus *ksk, uint32_t total_num_radix_blocks) {
cudaSetDevice(stream->gpu_index);
// Compute the sign
host_integer_radix_difference_check_kb(
stream, mem_ptr->tmp_lwe_array_out, lwe_array_left, lwe_array_right,
mem_ptr, mem_ptr->cleaning_lut_f, bsk, ksk, total_num_radix_blocks);
mem_ptr, mem_ptr->identity_lut_f, bsk, ksk, total_num_radix_blocks);
// Selector
host_integer_radix_cmux_kb(

View File

@@ -88,12 +88,14 @@ void cleanup_cuda_full_propagation(cuda_stream_t *stream,
cuda_drop_async(mem_ptr->lut_buffer, stream);
cuda_drop_async(mem_ptr->lut_indexes, stream);
cuda_drop_async(mem_ptr->lwe_indexes, stream);
cuda_drop_async(mem_ptr->tmp_small_lwe_vector, stream);
cuda_drop_async(mem_ptr->tmp_big_lwe_vector, stream);
switch (mem_ptr->pbs_type) {
case LOW_LAT: {
auto x = (pbs_buffer<uint64_t, LOW_LAT> *)(mem_ptr->pbs_buffer);
case CLASSICAL: {
auto x = (pbs_buffer<uint64_t, CLASSICAL> *)(mem_ptr->pbs_buffer);
x->release(stream);
} break;
case MULTI_BIT: {
@@ -105,7 +107,7 @@ void cleanup_cuda_full_propagation(cuda_stream_t *stream,
}
}
void scratch_cuda_propagate_single_carry_low_latency_kb_64_inplace(
void scratch_cuda_propagate_single_carry_kb_64_inplace(
cuda_stream_t *stream, int8_t **mem_ptr, uint32_t glwe_dimension,
uint32_t polynomial_size, uint32_t big_lwe_dimension,
uint32_t small_lwe_dimension, uint32_t ks_level, uint32_t ks_base_log,
@@ -118,22 +120,23 @@ void scratch_cuda_propagate_single_carry_low_latency_kb_64_inplace(
ks_base_log, pbs_level, pbs_base_log, grouping_factor,
message_modulus, carry_modulus);
scratch_cuda_propagate_single_carry_low_latency_kb_inplace(
scratch_cuda_propagate_single_carry_kb_inplace(
stream, (int_sc_prop_memory<uint64_t> **)mem_ptr, num_blocks, params,
allocate_gpu_memory);
}
void cuda_propagate_single_carry_low_latency_kb_64_inplace(
cuda_stream_t *stream, void *lwe_array, int8_t *mem_ptr, void *bsk,
void *ksk, uint32_t num_blocks) {
host_propagate_single_carry_low_latency<uint64_t>(
void cuda_propagate_single_carry_kb_64_inplace(cuda_stream_t *stream,
void *lwe_array, int8_t *mem_ptr,
void *bsk, void *ksk,
uint32_t num_blocks) {
host_propagate_single_carry<uint64_t>(
stream, static_cast<uint64_t *>(lwe_array),
(int_sc_prop_memory<uint64_t> *)mem_ptr, bsk,
static_cast<uint64_t *>(ksk), num_blocks);
}
void cleanup_cuda_propagate_single_carry_low_latency(cuda_stream_t *stream,
int8_t **mem_ptr_void) {
void cleanup_cuda_propagate_single_carry(cuda_stream_t *stream,
int8_t **mem_ptr_void) {
int_sc_prop_memory<uint64_t> *mem_ptr =
(int_sc_prop_memory<uint64_t> *)(*mem_ptr_void);
mem_ptr->release(stream);

View File

@@ -1,7 +1,6 @@
#ifndef CUDA_INTEGER_CUH
#define CUDA_INTEGER_CUH
#include "bootstrap.h"
#include "crypto/keyswitch.cuh"
#include "device.h"
#include "integer.h"
@@ -9,6 +8,8 @@
#include "linear_algebra.h"
#include "linearalgebra/addition.cuh"
#include "polynomial/functions.cuh"
#include "programmable_bootstrap.h"
#include "utils/helper.cuh"
#include "utils/kernel_dimensions.cuh"
#include <functional>
@@ -61,26 +62,30 @@ __global__ void radix_blocks_rotate_left(Torus *dst, Torus *src, uint32_t value,
// polynomial_size threads
template <typename Torus>
__global__ void
device_pack_bivariate_blocks(Torus *lwe_array_out, Torus *lwe_array_1,
Torus *lwe_array_2, Torus *lwe_indexes,
uint32_t lwe_dimension, uint32_t message_modulus,
uint32_t num_blocks) {
device_pack_bivariate_blocks(Torus *lwe_array_out, Torus *lwe_indexes_out,
Torus *lwe_array_1, Torus *lwe_array_2,
Torus *lwe_indexes_in, uint32_t lwe_dimension,
uint32_t shift, uint32_t num_blocks) {
int tid = threadIdx.x + blockIdx.x * blockDim.x;
if (tid < num_blocks * (lwe_dimension + 1)) {
int block_id = tid / (lwe_dimension + 1);
int coeff_id = tid % (lwe_dimension + 1);
int pos = lwe_indexes[block_id] * (lwe_dimension + 1) + coeff_id;
lwe_array_out[pos] = lwe_array_1[pos] * message_modulus + lwe_array_2[pos];
int pos_in = lwe_indexes_in[block_id] * (lwe_dimension + 1) + coeff_id;
int pos_out = lwe_indexes_out[block_id] * (lwe_dimension + 1) + coeff_id;
lwe_array_out[pos_out] = lwe_array_1[pos_in] * shift + lwe_array_2[pos_in];
}
}
/* Combine lwe_array_1 and lwe_array_2 so that each block m1 and m2
* becomes out = m1 * shift + m2
*/
template <typename Torus>
__host__ void pack_bivariate_blocks(cuda_stream_t *stream, Torus *lwe_array_out,
Torus *lwe_array_1, Torus *lwe_array_2,
Torus *lwe_indexes, uint32_t lwe_dimension,
uint32_t message_modulus,
Torus *lwe_indexes_out, Torus *lwe_array_1,
Torus *lwe_array_2, Torus *lwe_indexes_in,
uint32_t lwe_dimension, uint32_t shift,
uint32_t num_radix_blocks) {
cudaSetDevice(stream->gpu_index);
@@ -89,8 +94,8 @@ __host__ void pack_bivariate_blocks(cuda_stream_t *stream, Torus *lwe_array_out,
int num_entries = num_radix_blocks * (lwe_dimension + 1);
getNumBlocksAndThreads(num_entries, 512, num_blocks, num_threads);
device_pack_bivariate_blocks<<<num_blocks, num_threads, 0, stream->stream>>>(
lwe_array_out, lwe_array_1, lwe_array_2, lwe_indexes, lwe_dimension,
message_modulus, num_radix_blocks);
lwe_array_out, lwe_indexes_out, lwe_array_1, lwe_array_2, lwe_indexes_in,
lwe_dimension, shift, num_radix_blocks);
check_cuda_error(cudaGetLastError());
}
@@ -114,15 +119,15 @@ __host__ void integer_radix_apply_univariate_lookup_table_kb(
// Compute Keyswitch-PBS
cuda_keyswitch_lwe_ciphertext_vector(
stream, lut->tmp_lwe_after_ks, lut->lwe_indexes, lwe_array_in,
lut->lwe_indexes, ksk, big_lwe_dimension, small_lwe_dimension,
stream, lut->tmp_lwe_after_ks, lut->lwe_trivial_indexes, lwe_array_in,
lut->lwe_indexes_in, ksk, big_lwe_dimension, small_lwe_dimension,
ks_base_log, ks_level, num_radix_blocks);
execute_pbs<Torus>(stream, lwe_array_out, lut->lwe_indexes, lut->lut,
lut->lut_indexes, lut->tmp_lwe_after_ks, lut->lwe_indexes,
bsk, lut->buffer, glwe_dimension, small_lwe_dimension,
polynomial_size, pbs_base_log, pbs_level, grouping_factor,
num_radix_blocks, 1, 0,
execute_pbs<Torus>(stream, lwe_array_out, lut->lwe_indexes_out, lut->lut,
lut->lut_indexes, lut->tmp_lwe_after_ks,
lut->lwe_trivial_indexes, bsk, lut->buffer, glwe_dimension,
small_lwe_dimension, polynomial_size, pbs_base_log,
pbs_level, grouping_factor, num_radix_blocks, 1, 0,
cuda_get_max_shared_memory(stream->gpu_index), pbs_type);
}
@@ -133,21 +138,38 @@ __host__ void integer_radix_apply_bivariate_lookup_table_kb(
int_radix_lut<Torus> *lut) {
cudaSetDevice(stream->gpu_index);
// apply_lookup_table_bivariate
auto params = lut->params;
auto pbs_type = params.pbs_type;
auto big_lwe_dimension = params.big_lwe_dimension;
auto small_lwe_dimension = params.small_lwe_dimension;
auto ks_level = params.ks_level;
auto ks_base_log = params.ks_base_log;
auto pbs_level = params.pbs_level;
auto pbs_base_log = params.pbs_base_log;
auto glwe_dimension = params.glwe_dimension;
auto polynomial_size = params.polynomial_size;
auto grouping_factor = params.grouping_factor;
auto message_modulus = params.message_modulus;
// Left message is shifted
pack_bivariate_blocks(stream, lut->tmp_lwe_before_ks, lwe_array_1,
lwe_array_2, lut->lwe_indexes, big_lwe_dimension,
message_modulus, num_radix_blocks);
auto lwe_array_pbs_in = lut->tmp_lwe_before_ks;
pack_bivariate_blocks(stream, lwe_array_pbs_in, lut->lwe_trivial_indexes,
lwe_array_1, lwe_array_2, lut->lwe_indexes_in,
big_lwe_dimension, message_modulus, num_radix_blocks);
check_cuda_error(cudaGetLastError());
// Apply LUT
integer_radix_apply_univariate_lookup_table_kb(stream, lwe_array_out,
lut->tmp_lwe_before_ks, bsk,
ksk, num_radix_blocks, lut);
cuda_keyswitch_lwe_ciphertext_vector(
stream, lut->tmp_lwe_after_ks, lut->lwe_trivial_indexes, lwe_array_pbs_in,
lut->lwe_trivial_indexes, ksk, big_lwe_dimension, small_lwe_dimension,
ks_base_log, ks_level, num_radix_blocks);
execute_pbs<Torus>(stream, lwe_array_out, lut->lwe_indexes_out, lut->lut,
lut->lut_indexes, lut->tmp_lwe_after_ks,
lut->lwe_trivial_indexes, bsk, lut->buffer, glwe_dimension,
small_lwe_dimension, polynomial_size, pbs_base_log,
pbs_level, grouping_factor, num_radix_blocks, 1, 0,
cuda_get_max_shared_memory(stream->gpu_index), pbs_type);
}
// Rotates the slice in-place such that the first mid elements of the slice move
@@ -276,7 +298,7 @@ void generate_device_accumulator(cuda_stream_t *stream, Torus *acc,
}
template <typename Torus>
void scratch_cuda_propagate_single_carry_low_latency_kb_inplace(
void scratch_cuda_propagate_single_carry_kb_inplace(
cuda_stream_t *stream, int_sc_prop_memory<Torus> **mem_ptr,
uint32_t num_radix_blocks, int_radix_params params,
bool allocate_gpu_memory) {
@@ -286,11 +308,9 @@ void scratch_cuda_propagate_single_carry_low_latency_kb_inplace(
}
template <typename Torus>
void host_propagate_single_carry_low_latency(cuda_stream_t *stream,
Torus *lwe_array,
int_sc_prop_memory<Torus> *mem,
void *bsk, Torus *ksk,
uint32_t num_blocks) {
void host_propagate_single_carry(cuda_stream_t *stream, Torus *lwe_array,
int_sc_prop_memory<Torus> *mem, void *bsk,
Torus *ksk, uint32_t num_blocks) {
auto params = mem->params;
auto glwe_dimension = params.glwe_dimension;
auto polynomial_size = params.polynomial_size;
@@ -341,6 +361,65 @@ void host_propagate_single_carry_low_latency(cuda_stream_t *stream,
stream, lwe_array, lwe_array, bsk, ksk, num_blocks, message_acc);
}
template <typename Torus>
void host_propagate_single_sub_borrow(cuda_stream_t *stream, Torus *overflowed,
Torus *lwe_array,
int_single_borrow_prop_memory<Torus> *mem,
void *bsk, Torus *ksk,
uint32_t num_blocks) {
auto params = mem->params;
auto glwe_dimension = params.glwe_dimension;
auto polynomial_size = params.polynomial_size;
auto big_lwe_size = glwe_dimension * polynomial_size + 1;
auto big_lwe_size_bytes = big_lwe_size * sizeof(Torus);
auto generates_or_propagates = mem->generates_or_propagates;
auto step_output = mem->step_output;
auto luts_array = mem->luts_array;
auto luts_carry_propagation_sum = mem->luts_borrow_propagation_sum;
auto message_acc = mem->message_acc;
integer_radix_apply_univariate_lookup_table_kb<Torus>(
stream, generates_or_propagates, lwe_array, bsk, ksk, num_blocks,
luts_array);
// compute prefix sum with hillis&steele
int num_steps = ceil(log2((double)num_blocks));
int space = 1;
cuda_memcpy_async_gpu_to_gpu(step_output, generates_or_propagates,
big_lwe_size_bytes * num_blocks, stream);
for (int step = 0; step < num_steps; step++) {
auto cur_blocks = &step_output[space * big_lwe_size];
auto prev_blocks = generates_or_propagates;
int cur_total_blocks = num_blocks - space;
integer_radix_apply_bivariate_lookup_table_kb<Torus>(
stream, cur_blocks, cur_blocks, prev_blocks, bsk, ksk, cur_total_blocks,
luts_carry_propagation_sum);
cuda_memcpy_async_gpu_to_gpu(&generates_or_propagates[space * big_lwe_size],
cur_blocks,
big_lwe_size_bytes * cur_total_blocks, stream);
space *= 2;
}
cuda_memcpy_async_gpu_to_gpu(
overflowed, &generates_or_propagates[big_lwe_size * (num_blocks - 1)],
big_lwe_size_bytes, stream);
radix_blocks_rotate_right<<<num_blocks, 256, 0, stream->stream>>>(
step_output, generates_or_propagates, 1, num_blocks, big_lwe_size);
cuda_memset_async(step_output, 0, big_lwe_size_bytes, stream);
host_subtraction(stream, lwe_array, lwe_array, step_output,
glwe_dimension * polynomial_size, num_blocks);
integer_radix_apply_univariate_lookup_table_kb<Torus>(
stream, lwe_array, lwe_array, bsk, ksk, num_blocks, message_acc);
}
/*
* input_blocks: input radix ciphertext propagation will happen inplace
* acc_message_carry: list of two lut s, [(message_acc), (carry_acc)]
@@ -508,7 +587,7 @@ __global__ void device_pack_blocks(Torus *lwe_array_out, Torus *lwe_array_in,
packed_block[tid] = lsb_block[tid] + factor * msb_block[tid];
}
if (num_radix_blocks % 2 != 0) {
if (num_radix_blocks % 2 == 1) {
// We couldn't pack the last block, so we just copy it
Torus *lsb_block =
lwe_array_in + (num_radix_blocks - 1) * (lwe_dimension + 1);
@@ -589,4 +668,107 @@ create_trivial_radix(cuda_stream_t *stream, Torus *lwe_array_out,
check_cuda_error(cudaGetLastError());
}
/**
* Each bit in lwe_array_in becomes a lwe ciphertext in lwe_array_out
* Thus, lwe_array_out must be allocated with num_radix_blocks * bits_per_block
* * (lwe_dimension+1) * sizeeof(Torus) bytes
*/
template <typename Torus>
__host__ void extract_n_bits(cuda_stream_t *stream, Torus *lwe_array_out,
Torus *lwe_array_in, void *bsk, Torus *ksk,
uint32_t num_radix_blocks, uint32_t bits_per_block,
int_bit_extract_luts_buffer<Torus> *bit_extract) {
integer_radix_apply_univariate_lookup_table_kb(
stream, lwe_array_out, lwe_array_in, bsk, ksk,
num_radix_blocks * bits_per_block, bit_extract->lut);
}
template <typename Torus>
__host__ void reduce_signs(cuda_stream_t *stream, Torus *signs_array_out,
Torus *signs_array_in,
int_comparison_buffer<Torus> *mem_ptr,
std::function<Torus(Torus)> sign_handler_f,
void *bsk, Torus *ksk, uint32_t num_sign_blocks) {
auto diff_buffer = mem_ptr->diff_buffer;
auto params = mem_ptr->params;
auto big_lwe_dimension = params.big_lwe_dimension;
auto glwe_dimension = params.glwe_dimension;
auto polynomial_size = params.polynomial_size;
auto message_modulus = params.message_modulus;
auto carry_modulus = params.carry_modulus;
std::function<Torus(Torus)> reduce_two_orderings_function =
[diff_buffer, sign_handler_f](Torus x) -> Torus {
int msb = (x >> 2) & 3;
int lsb = x & 3;
return diff_buffer->tree_buffer->block_selector_f(msb, lsb);
};
auto signs_a = diff_buffer->tmp_signs_a;
auto signs_b = diff_buffer->tmp_signs_b;
cuda_memcpy_async_gpu_to_gpu(
signs_a, signs_array_in,
(big_lwe_dimension + 1) * num_sign_blocks * sizeof(Torus), stream);
if (num_sign_blocks > 2) {
auto lut = diff_buffer->reduce_signs_lut;
generate_device_accumulator<Torus>(
stream, lut->lut, glwe_dimension, polynomial_size, message_modulus,
carry_modulus, reduce_two_orderings_function);
while (num_sign_blocks > 2) {
pack_blocks(stream, signs_b, signs_a, big_lwe_dimension, num_sign_blocks,
4);
integer_radix_apply_univariate_lookup_table_kb(
stream, signs_a, signs_b, bsk, ksk, num_sign_blocks / 2, lut);
auto last_block_signs_b =
signs_b + (num_sign_blocks / 2) * (big_lwe_dimension + 1);
auto last_block_signs_a =
signs_a + (num_sign_blocks / 2) * (big_lwe_dimension + 1);
if (num_sign_blocks % 2 == 1)
cuda_memcpy_async_gpu_to_gpu(last_block_signs_a, last_block_signs_b,
(big_lwe_dimension + 1) * sizeof(Torus),
stream);
num_sign_blocks = (num_sign_blocks / 2) + (num_sign_blocks % 2);
}
}
if (num_sign_blocks == 2) {
std::function<Torus(Torus)> final_lut_f =
[reduce_two_orderings_function, sign_handler_f](Torus x) -> Torus {
Torus final_sign = reduce_two_orderings_function(x);
return sign_handler_f(final_sign);
};
auto lut = diff_buffer->reduce_signs_lut;
generate_device_accumulator<Torus>(stream, lut->lut, glwe_dimension,
polynomial_size, message_modulus,
carry_modulus, final_lut_f);
pack_blocks(stream, signs_b, signs_a, big_lwe_dimension, 2, 4);
integer_radix_apply_univariate_lookup_table_kb(stream, signs_array_out,
signs_b, bsk, ksk, 1, lut);
} else {
std::function<Torus(Torus)> final_lut_f =
[mem_ptr, sign_handler_f](Torus x) -> Torus {
return sign_handler_f(x & 3);
};
auto lut = mem_ptr->diff_buffer->reduce_signs_lut;
generate_device_accumulator<Torus>(stream, lut->lut, glwe_dimension,
polynomial_size, message_modulus,
carry_modulus, final_lut_f);
integer_radix_apply_univariate_lookup_table_kb(stream, signs_array_out,
signs_a, bsk, ksk, 1, lut);
}
}
#endif // TFHE_RS_INTERNAL_INTEGER_CUH

View File

@@ -1,5 +1,66 @@
#include "integer/multiplication.cuh"
/*
* when adding chunk_size times terms together, there might be some blocks
* where addition have not happened or degree is zero, in that case we don't
* need to apply lookup table, so we find the indexes of the blocks where
* addition happened and store them inside h_lwe_idx_in, from same block
* might be extracted message and carry(if it is not the last block), so
* one block id might have two output id and we store them in h_lwe_idx_out
* blocks that do not require applying lookup table might be copied on both
* message and carry side or be replaced with zero ciphertexts, indexes of such
* blocks are stored inside h_smart_copy_in as input ids and h_smart_copy_out
* as output ids, -1 value as an input id means that zero ciphertext will be
* copied on output index.
*/
void generate_ids_update_degrees(int *terms_degree, size_t *h_lwe_idx_in,
size_t *h_lwe_idx_out,
int32_t *h_smart_copy_in,
int32_t *h_smart_copy_out, size_t ch_amount,
uint32_t num_radix, uint32_t num_blocks,
size_t chunk_size, size_t message_max,
size_t &total_count, size_t &message_count,
size_t &carry_count, size_t &sm_copy_count) {
for (size_t c_id = 0; c_id < ch_amount; c_id++) {
auto cur_chunk = &terms_degree[c_id * chunk_size * num_blocks];
for (size_t r_id = 0; r_id < num_blocks; r_id++) {
size_t new_degree = 0;
for (size_t chunk_id = 0; chunk_id < chunk_size; chunk_id++) {
new_degree += cur_chunk[chunk_id * num_blocks + r_id];
}
if (new_degree > message_max) {
h_lwe_idx_in[message_count] = c_id * num_blocks + r_id;
h_lwe_idx_out[message_count] = c_id * num_blocks + r_id;
message_count++;
} else {
h_smart_copy_in[sm_copy_count] = c_id * num_blocks + r_id;
h_smart_copy_out[sm_copy_count] = c_id * num_blocks + r_id;
sm_copy_count++;
}
}
}
for (size_t i = 0; i < sm_copy_count; i++) {
h_smart_copy_in[i] = -1;
h_smart_copy_out[i] = h_smart_copy_out[i] + ch_amount * num_blocks + 1;
}
for (size_t i = 0; i < message_count; i++) {
if (h_lwe_idx_in[i] % num_blocks != num_blocks - 1) {
h_lwe_idx_in[message_count + carry_count] = h_lwe_idx_in[i];
h_lwe_idx_out[message_count + carry_count] =
ch_amount * num_blocks + h_lwe_idx_in[i] + 1;
carry_count++;
} else {
h_smart_copy_in[sm_copy_count] = -1;
h_smart_copy_out[sm_copy_count] =
h_lwe_idx_in[i] - (num_blocks - 1) + ch_amount * num_blocks;
sm_copy_count++;
}
}
total_count = message_count + carry_count;
}
/*
* This scratch function allocates the necessary amount of data on the GPU for
* the integer radix multiplication in keyswitch->bootstrap order.
@@ -13,9 +74,9 @@ void scratch_cuda_integer_mult_radix_ciphertext_kb_64(
bool allocate_gpu_memory) {
int_radix_params params(pbs_type, glwe_dimension, polynomial_size,
polynomial_size, lwe_dimension, ks_level, ks_base_log,
pbs_level, pbs_base_log, grouping_factor,
message_modulus, carry_modulus);
polynomial_size * glwe_dimension, lwe_dimension,
ks_level, ks_base_log, pbs_level, pbs_base_log,
grouping_factor, message_modulus, carry_modulus);
switch (polynomial_size) {
case 2048:
@@ -89,21 +150,92 @@ void cleanup_cuda_integer_mult(cuda_stream_t *stream, int8_t **mem_ptr_void) {
mem_ptr->release(stream);
}
void cuda_small_scalar_multiplication_integer_radix_ciphertext_64_inplace(
cuda_stream_t *stream, void *lwe_array, uint64_t scalar,
uint32_t lwe_dimension, uint32_t lwe_ciphertext_count) {
void scratch_cuda_integer_radix_sum_ciphertexts_vec_kb_64(
cuda_stream_t *stream, int8_t **mem_ptr, uint32_t glwe_dimension,
uint32_t polynomial_size, uint32_t lwe_dimension, uint32_t ks_level,
uint32_t ks_base_log, uint32_t pbs_level, uint32_t pbs_base_log,
uint32_t grouping_factor, uint32_t num_blocks_in_radix,
uint32_t max_num_radix_in_vec, uint32_t message_modulus,
uint32_t carry_modulus, PBS_TYPE pbs_type, bool allocate_gpu_memory) {
cuda_small_scalar_multiplication_integer_radix_ciphertext_64(
stream, lwe_array, lwe_array, scalar, lwe_dimension,
lwe_ciphertext_count);
int_radix_params params(pbs_type, glwe_dimension, polynomial_size,
glwe_dimension * polynomial_size, lwe_dimension,
ks_level, ks_base_log, pbs_level, pbs_base_log,
grouping_factor, message_modulus, carry_modulus);
scratch_cuda_integer_sum_ciphertexts_vec_kb<uint64_t>(
stream, (int_sum_ciphertexts_vec_memory<uint64_t> **)mem_ptr,
num_blocks_in_radix, max_num_radix_in_vec, params, allocate_gpu_memory);
}
void cuda_small_scalar_multiplication_integer_radix_ciphertext_64(
cuda_stream_t *stream, void *output_lwe_array, void *input_lwe_array,
uint64_t scalar, uint32_t lwe_dimension, uint32_t lwe_ciphertext_count) {
void cuda_integer_radix_sum_ciphertexts_vec_kb_64(
cuda_stream_t *stream, void *radix_lwe_out, void *radix_lwe_vec,
uint32_t num_radix_in_vec, int8_t *mem_ptr, void *bsk, void *ksk,
uint32_t num_blocks_in_radix) {
host_integer_small_scalar_mult_radix(
stream, static_cast<uint64_t *>(output_lwe_array),
static_cast<uint64_t *>(input_lwe_array), scalar, lwe_dimension,
lwe_ciphertext_count);
auto mem = (int_sum_ciphertexts_vec_memory<uint64_t> *)mem_ptr;
int *terms_degree =
(int *)malloc(num_blocks_in_radix * num_radix_in_vec * sizeof(int));
for (int i = 0; i < num_radix_in_vec * num_blocks_in_radix; i++) {
terms_degree[i] = mem->params.message_modulus - 1;
}
switch (mem->params.polynomial_size) {
case 512:
host_integer_sum_ciphertexts_vec_kb<uint64_t, AmortizedDegree<512>>(
stream, static_cast<uint64_t *>(radix_lwe_out),
static_cast<uint64_t *>(radix_lwe_vec), terms_degree, bsk,
static_cast<uint64_t *>(ksk), mem, num_blocks_in_radix,
num_radix_in_vec);
break;
case 1024:
host_integer_sum_ciphertexts_vec_kb<uint64_t, AmortizedDegree<1024>>(
stream, static_cast<uint64_t *>(radix_lwe_out),
static_cast<uint64_t *>(radix_lwe_vec), terms_degree, bsk,
static_cast<uint64_t *>(ksk), mem, num_blocks_in_radix,
num_radix_in_vec);
break;
case 2048:
host_integer_sum_ciphertexts_vec_kb<uint64_t, AmortizedDegree<2048>>(
stream, static_cast<uint64_t *>(radix_lwe_out),
static_cast<uint64_t *>(radix_lwe_vec), terms_degree, bsk,
static_cast<uint64_t *>(ksk), mem, num_blocks_in_radix,
num_radix_in_vec);
break;
case 4096:
host_integer_sum_ciphertexts_vec_kb<uint64_t, AmortizedDegree<4096>>(
stream, static_cast<uint64_t *>(radix_lwe_out),
static_cast<uint64_t *>(radix_lwe_vec), terms_degree, bsk,
static_cast<uint64_t *>(ksk), mem, num_blocks_in_radix,
num_radix_in_vec);
break;
case 8192:
host_integer_sum_ciphertexts_vec_kb<uint64_t, AmortizedDegree<8192>>(
stream, static_cast<uint64_t *>(radix_lwe_out),
static_cast<uint64_t *>(radix_lwe_vec), terms_degree, bsk,
static_cast<uint64_t *>(ksk), mem, num_blocks_in_radix,
num_radix_in_vec);
break;
case 16384:
host_integer_sum_ciphertexts_vec_kb<uint64_t, AmortizedDegree<16384>>(
stream, static_cast<uint64_t *>(radix_lwe_out),
static_cast<uint64_t *>(radix_lwe_vec), terms_degree, bsk,
static_cast<uint64_t *>(ksk), mem, num_blocks_in_radix,
num_radix_in_vec);
break;
default:
PANIC("Cuda error (integer sum ciphertexts): unsupported polynomial size. "
"Only N = 512, 1024, 2048, 4096, 8192, 16384 is supported")
}
free(terms_degree);
}
void cleanup_cuda_integer_radix_sum_ciphertexts_vec(cuda_stream_t *stream,
int8_t **mem_ptr_void) {
int_sum_ciphertexts_vec_memory<uint64_t> *mem_ptr =
(int_sum_ciphertexts_vec_memory<uint64_t> *)(*mem_ptr_void);
mem_ptr->release(stream);
}

View File

@@ -6,12 +6,12 @@
#include <cuda_runtime.h>
#endif
#include "bootstrap.h"
#include "crypto/keyswitch.cuh"
#include "device.h"
#include "integer.h"
#include "integer/integer.cuh"
#include "linear_algebra.h"
#include "programmable_bootstrap.h"
#include "utils/helper.cuh"
#include "utils/kernel_dimensions.cuh"
#include <fstream>
@@ -21,6 +21,24 @@
#include <string>
#include <vector>
template <typename Torus>
__global__ void smart_copy(Torus *dst, Torus *src, int32_t *id_out,
int32_t *id_in, size_t lwe_size) {
size_t tid = threadIdx.x;
size_t b_id = blockIdx.x;
size_t stride = blockDim.x;
auto input_id = id_in[b_id];
auto output_id = id_out[b_id];
auto cur_src = (input_id >= 0) ? &src[input_id * lwe_size] : nullptr;
auto cur_dst = &dst[output_id * lwe_size];
for (int i = tid; i < lwe_size; i += stride) {
cur_dst[i] = (input_id >= 0) ? cur_src[i] : 0;
}
}
template <typename Torus, class params>
__global__ void
all_shifted_lhs_rhs(Torus *radix_lwe_left, Torus *lsb_ciphertext,
@@ -74,100 +92,37 @@ all_shifted_lhs_rhs(Torus *radix_lwe_left, Torus *lsb_ciphertext,
}
template <typename Torus>
void compress_device_array_with_map(cuda_stream_t *stream, Torus *src,
Torus *dst, int *S, int *F, int num_blocks,
uint32_t map_size, uint32_t unit_size,
int &total_copied, bool is_message) {
cudaSetDevice(stream->gpu_index);
for (int i = 0; i < map_size; i++) {
int s_index = i * num_blocks + S[i];
int number_of_unit = F[i] - S[i] + is_message;
auto cur_dst = &dst[total_copied * unit_size];
auto cur_src = &src[s_index * unit_size];
size_t copy_size = unit_size * number_of_unit * sizeof(Torus);
cuda_memcpy_async_gpu_to_gpu(cur_dst, cur_src, copy_size, stream);
total_copied += number_of_unit;
}
}
template <typename Torus>
void extract_message_carry_to_full_radix(cuda_stream_t *stream, Torus *src,
Torus *dst, int *S, int *F,
uint32_t map_size, uint32_t unit_size,
int &total_copied,
int &total_radix_copied,
int num_blocks, bool is_message) {
cudaSetDevice(stream->gpu_index);
size_t radix_size = unit_size * num_blocks;
for (int i = 0; i < map_size; i++) {
auto cur_dst_radix = &dst[total_radix_copied * radix_size];
int s_index = S[i];
int number_of_unit = F[i] - s_index + is_message;
if (!is_message) {
int zero_block_count = num_blocks - number_of_unit;
cuda_memset_async(cur_dst_radix, 0,
zero_block_count * unit_size * sizeof(Torus), stream);
s_index = zero_block_count;
}
auto cur_dst = &cur_dst_radix[s_index * unit_size];
auto cur_src = &src[total_copied * unit_size];
size_t copy_size = unit_size * number_of_unit * sizeof(Torus);
cuda_memcpy_async_gpu_to_gpu(cur_dst, cur_src, copy_size, stream);
total_copied += number_of_unit;
++total_radix_copied;
}
}
template <typename Torus, class params>
__global__ void tree_add_chunks(Torus *result_blocks, Torus *input_blocks,
uint32_t chunk_size, uint32_t num_blocks) {
uint32_t chunk_size, uint32_t block_size,
uint32_t num_blocks) {
extern __shared__ Torus result[];
size_t stride = blockDim.x;
size_t chunk_id = blockIdx.x;
size_t chunk_elem_size = chunk_size * num_blocks * (params::degree + 1);
size_t radix_elem_size = num_blocks * (params::degree + 1);
size_t chunk_elem_size = chunk_size * num_blocks * block_size;
size_t radix_elem_size = num_blocks * block_size;
auto src_chunk = &input_blocks[chunk_id * chunk_elem_size];
auto dst_radix = &result_blocks[chunk_id * radix_elem_size];
size_t block_stride = blockIdx.y * (params::degree + 1);
size_t block_stride = blockIdx.y * block_size;
auto dst_block = &dst_radix[block_stride];
// init shared mem with first radix of chunk
size_t tid = threadIdx.x;
for (int i = 0; i < params::opt; i++) {
result[tid] = src_chunk[block_stride + tid];
tid += params::degree / params::opt;
}
if (threadIdx.x == 0) {
result[params::degree] = src_chunk[block_stride + params::degree];
for (int i = tid; i < block_size; i += stride) {
result[i] = src_chunk[block_stride + i];
}
// accumulate rest of the radixes
for (int r_id = 1; r_id < chunk_size; r_id++) {
auto cur_src_radix = &src_chunk[r_id * radix_elem_size];
tid = threadIdx.x;
for (int i = 0; i < params::opt; i++) {
result[tid] += cur_src_radix[block_stride + tid];
tid += params::degree / params::opt;
}
if (threadIdx.x == 0) {
result[params::degree] += cur_src_radix[block_stride + params::degree];
for (int i = tid; i < block_size; i += stride) {
result[i] += cur_src_radix[block_stride + i];
}
}
// put result from shared mem to global mem
tid = threadIdx.x;
for (int i = 0; i < params::opt; i++) {
dst_block[tid] = result[tid];
tid += params::degree / params::opt;
}
if (threadIdx.x == 0) {
dst_block[params::degree] = result[params::degree];
for (int i = tid; i < block_size; i += stride) {
dst_block[i] = result[i];
}
}
@@ -218,6 +173,142 @@ __global__ void fill_radix_from_lsb_msb(Torus *result_blocks, Torus *lsb_blocks,
(process_msb) ? cur_msb_ct[params::degree] : 0;
}
}
template <typename Torus>
__host__ void scratch_cuda_integer_sum_ciphertexts_vec_kb(
cuda_stream_t *stream, int_sum_ciphertexts_vec_memory<Torus> **mem_ptr,
uint32_t num_blocks_in_radix, uint32_t max_num_radix_in_vec,
int_radix_params params, bool allocate_gpu_memory) {
cudaSetDevice(stream->gpu_index);
size_t sm_size = (params.big_lwe_dimension + 1) * sizeof(Torus);
check_cuda_error(cudaFuncSetAttribute(
tree_add_chunks<Torus>, cudaFuncAttributeMaxDynamicSharedMemorySize,
sm_size));
cudaFuncSetCacheConfig(tree_add_chunks<Torus>, cudaFuncCachePreferShared);
check_cuda_error(cudaGetLastError());
*mem_ptr = new int_sum_ciphertexts_vec_memory<Torus>(
stream, params, num_blocks_in_radix, max_num_radix_in_vec,
allocate_gpu_memory);
}
template <typename Torus, class params>
__host__ void host_integer_sum_ciphertexts_vec_kb(
cuda_stream_t *stream, Torus *radix_lwe_out, Torus *terms,
int *terms_degree, void *bsk, uint64_t *ksk,
int_sum_ciphertexts_vec_memory<uint64_t> *mem_ptr,
uint32_t num_blocks_in_radix, uint32_t num_radix_in_vec) {
cudaSetDevice(stream->gpu_index);
auto new_blocks = mem_ptr->new_blocks;
auto old_blocks = mem_ptr->old_blocks;
auto small_lwe_vector = mem_ptr->small_lwe_vector;
auto luts_message_carry = mem_ptr->luts_message_carry;
auto lwe_indexes_in = luts_message_carry->lwe_indexes_in;
auto lwe_indexes_out = luts_message_carry->lwe_indexes_out;
auto d_smart_copy_in = mem_ptr->d_smart_copy_in;
auto d_smart_copy_out = mem_ptr->d_smart_copy_out;
auto message_modulus = mem_ptr->params.message_modulus;
auto carry_modulus = mem_ptr->params.carry_modulus;
auto num_blocks = num_blocks_in_radix;
auto big_lwe_size = mem_ptr->params.big_lwe_dimension + 1;
auto glwe_dimension = mem_ptr->params.glwe_dimension;
auto polynomial_size = mem_ptr->params.polynomial_size;
auto lwe_dimension = mem_ptr->params.small_lwe_dimension;
auto big_lwe_dimension = mem_ptr->params.big_lwe_dimension;
if (old_blocks != terms) {
cuda_memcpy_async_gpu_to_gpu(old_blocks, terms,
num_blocks_in_radix * num_radix_in_vec *
big_lwe_size * sizeof(Torus),
stream);
}
size_t r = num_radix_in_vec;
size_t total_modulus = message_modulus * carry_modulus;
size_t message_max = message_modulus - 1;
size_t chunk_size = (total_modulus - 1) / message_max;
size_t h_lwe_idx_in[r * num_blocks];
size_t h_lwe_idx_out[r * num_blocks];
int32_t h_smart_copy_in[r * num_blocks];
int32_t h_smart_copy_out[r * num_blocks];
auto max_shared_memory = cuda_get_max_shared_memory(stream->gpu_index);
while (r > 2) {
size_t cur_total_blocks = r * num_blocks;
size_t ch_amount = r / chunk_size;
if (!ch_amount)
ch_amount++;
dim3 add_grid(ch_amount, num_blocks, 1);
size_t sm_size = big_lwe_size * sizeof(Torus);
tree_add_chunks<Torus><<<add_grid, 512, sm_size, stream->stream>>>(
new_blocks, old_blocks, min(r, chunk_size), big_lwe_size, num_blocks);
size_t total_count = 0;
size_t message_count = 0;
size_t carry_count = 0;
size_t sm_copy_count = 0;
generate_ids_update_degrees(
terms_degree, h_lwe_idx_in, h_lwe_idx_out, h_smart_copy_in,
h_smart_copy_out, ch_amount, r, num_blocks, chunk_size, message_max,
total_count, message_count, carry_count, sm_copy_count);
size_t copy_size = total_count * sizeof(Torus);
cuda_memcpy_async_to_gpu(lwe_indexes_in, h_lwe_idx_in, copy_size, stream);
cuda_memcpy_async_to_gpu(lwe_indexes_out, h_lwe_idx_out, copy_size, stream);
copy_size = sm_copy_count * sizeof(int32_t);
cuda_memcpy_async_to_gpu(d_smart_copy_in, h_smart_copy_in, copy_size,
stream);
cuda_memcpy_async_to_gpu(d_smart_copy_out, h_smart_copy_out, copy_size,
stream);
smart_copy<<<sm_copy_count, 256, 0, stream->stream>>>(
new_blocks, new_blocks, d_smart_copy_out, d_smart_copy_in,
big_lwe_size);
if (carry_count > 0)
cuda_set_value_async<Torus>(
&(stream->stream), luts_message_carry->get_lut_indexes(message_count),
1, carry_count);
cuda_keyswitch_lwe_ciphertext_vector(
stream, small_lwe_vector, lwe_indexes_in, new_blocks, lwe_indexes_in,
ksk, polynomial_size * glwe_dimension, lwe_dimension,
mem_ptr->params.ks_base_log, mem_ptr->params.ks_level, message_count);
execute_pbs<Torus>(
stream, new_blocks, lwe_indexes_out, luts_message_carry->lut,
luts_message_carry->lut_indexes, small_lwe_vector, lwe_indexes_in, bsk,
luts_message_carry->buffer, glwe_dimension, lwe_dimension,
polynomial_size, mem_ptr->params.pbs_base_log,
mem_ptr->params.pbs_level, mem_ptr->params.grouping_factor, total_count,
2, 0, max_shared_memory, mem_ptr->params.pbs_type);
int rem_blocks = (r > chunk_size) ? r % chunk_size * num_blocks : 0;
int new_blocks_created = 2 * ch_amount * num_blocks;
copy_size = rem_blocks * big_lwe_size * sizeof(Torus);
auto cur_dst = &new_blocks[new_blocks_created * big_lwe_size];
auto cur_src = &old_blocks[(cur_total_blocks - rem_blocks) * big_lwe_size];
cuda_memcpy_async_gpu_to_gpu(cur_dst, cur_src, copy_size, stream);
std::swap(new_blocks, old_blocks);
r = (new_blocks_created + rem_blocks) / num_blocks;
}
host_addition(stream, radix_lwe_out, old_blocks,
&old_blocks[num_blocks * big_lwe_size], big_lwe_dimension,
num_blocks);
host_propagate_single_carry<Torus>(stream, radix_lwe_out, mem_ptr->scp_mem,
bsk, ksk, num_blocks);
}
template <typename Torus, typename STorus, class params>
__host__ void host_integer_mult_radix_kb(
@@ -233,7 +324,6 @@ __host__ void host_integer_mult_radix_kb(
auto carry_modulus = mem_ptr->params.carry_modulus;
int big_lwe_dimension = glwe_dimension * polynomial_size;
int big_lwe_size = big_lwe_dimension + 1;
// 'vector_result_lsb' contains blocks from all possible right shifts of
// radix_lwe_left, only nonzero blocks are kept
@@ -281,17 +371,6 @@ __host__ void host_integer_mult_radix_kb(
// 2 * (glwe_dimension + 1) * polynomial_size
auto luts_array = mem_ptr->luts_array;
// accumulator to extract message
// with length (glwe_dimension + 1) * polynomial_size
auto luts_message = mem_ptr->luts_message;
// accumulator to extract carry
// with length (glwe_dimension + 1) * polynomial_size
auto luts_carry = mem_ptr->luts_carry;
// to be used as default indexing
auto lwe_indexes = luts_array->lwe_indexes;
auto vector_result_lsb = &vector_result_sb[0];
auto vector_result_msb =
&vector_result_sb[lsb_vector_block_count *
@@ -323,144 +402,22 @@ __host__ void host_integer_mult_radix_kb(
lsb_vector_block_count, msb_vector_block_count,
num_blocks);
auto new_blocks = block_mul_res;
auto old_blocks = vector_result_sb;
// amount of current radixes after block_mul
size_t r = 2 * num_blocks;
size_t total_modulus = message_modulus * carry_modulus;
size_t message_max = message_modulus - 1;
size_t chunk_size = (total_modulus - 1) / message_max;
size_t ch_amount = r / chunk_size;
int terms_degree[r * num_blocks];
int f_b[ch_amount];
int l_b[ch_amount];
int terms_degree[2 * num_blocks * num_blocks];
for (int i = 0; i < num_blocks * num_blocks; i++) {
size_t r_id = i / num_blocks;
size_t b_id = i % num_blocks;
terms_degree[i] = (b_id >= r_id) ? 3 : 0;
terms_degree[i] = (b_id >= r_id) ? message_modulus - 1 : 0;
}
auto terms_degree_msb = &terms_degree[num_blocks * num_blocks];
for (int i = 0; i < num_blocks * num_blocks; i++) {
size_t r_id = i / num_blocks;
size_t b_id = i % num_blocks;
terms_degree_msb[i] = (b_id > r_id) ? 2 : 0;
terms_degree_msb[i] = (b_id > r_id) ? message_modulus - 2 : 0;
}
auto max_shared_memory = cuda_get_max_shared_memory(stream->gpu_index);
while (r > chunk_size) {
int cur_total_blocks = r * num_blocks;
ch_amount = r / chunk_size;
dim3 add_grid(ch_amount, num_blocks, 1);
size_t sm_size = big_lwe_size * sizeof(Torus);
cuda_memset_async(new_blocks, 0,
ch_amount * num_blocks * big_lwe_size * sizeof(Torus),
stream);
tree_add_chunks<Torus, params><<<add_grid, 256, sm_size, stream->stream>>>(
new_blocks, old_blocks, chunk_size, num_blocks);
for (int c_id = 0; c_id < ch_amount; c_id++) {
auto cur_chunk = &terms_degree[c_id * chunk_size * num_blocks];
int mx = 0;
int mn = num_blocks;
for (int r_id = 1; r_id < chunk_size; r_id++) {
auto cur_radix = &cur_chunk[r_id * num_blocks];
for (int i = 0; i < num_blocks; i++) {
if (cur_radix[i]) {
mn = min(mn, i);
mx = max(mx, i);
}
}
}
f_b[c_id] = mn;
l_b[c_id] = mx;
}
int total_copied = 0;
int message_count = 0;
int carry_count = 0;
compress_device_array_with_map<Torus>(stream, new_blocks, old_blocks, f_b,
l_b, num_blocks, ch_amount,
big_lwe_size, total_copied, true);
message_count = total_copied;
compress_device_array_with_map<Torus>(stream, new_blocks, old_blocks, f_b,
l_b, num_blocks, ch_amount,
big_lwe_size, total_copied, false);
carry_count = total_copied - message_count;
auto message_blocks_vector = old_blocks;
auto carry_blocks_vector =
&old_blocks[message_count * (glwe_dimension * polynomial_size + 1)];
cuda_keyswitch_lwe_ciphertext_vector(
stream, small_lwe_vector, lwe_indexes, old_blocks, lwe_indexes, ksk,
polynomial_size * glwe_dimension, lwe_dimension,
mem_ptr->params.ks_base_log, mem_ptr->params.ks_level, total_copied);
execute_pbs<Torus>(stream, message_blocks_vector, lwe_indexes,
luts_message->lut, luts_message->lut_indexes,
small_lwe_vector, lwe_indexes, bsk, luts_message->buffer,
glwe_dimension, lwe_dimension, polynomial_size,
mem_ptr->params.pbs_base_log, mem_ptr->params.pbs_level,
mem_ptr->params.grouping_factor, message_count, 1, 0,
max_shared_memory, mem_ptr->params.pbs_type);
execute_pbs<Torus>(stream, carry_blocks_vector, lwe_indexes,
luts_carry->lut, luts_carry->lut_indexes,
&small_lwe_vector[message_count * (lwe_dimension + 1)],
lwe_indexes, bsk, luts_carry->buffer, glwe_dimension,
lwe_dimension, polynomial_size,
mem_ptr->params.pbs_base_log, mem_ptr->params.pbs_level,
mem_ptr->params.grouping_factor, carry_count, 1, 0,
max_shared_memory, mem_ptr->params.pbs_type);
int rem_blocks = r % chunk_size * num_blocks;
int new_blocks_created = 2 * ch_amount * num_blocks;
int copy_size = rem_blocks * big_lwe_size * sizeof(Torus);
auto cur_dst = &new_blocks[new_blocks_created * big_lwe_size];
auto cur_src = &old_blocks[(cur_total_blocks - rem_blocks) * big_lwe_size];
cuda_memcpy_async_gpu_to_gpu(cur_dst, cur_src, copy_size, stream);
total_copied = 0;
int total_radix_copied = 0;
extract_message_carry_to_full_radix<Torus>(
stream, old_blocks, new_blocks, f_b, l_b, ch_amount, big_lwe_size,
total_copied, total_radix_copied, num_blocks, true);
extract_message_carry_to_full_radix<Torus>(
stream, old_blocks, new_blocks, f_b, l_b, ch_amount, big_lwe_size,
total_copied, total_radix_copied, num_blocks, false);
std::swap(new_blocks, old_blocks);
r = (new_blocks_created + rem_blocks) / num_blocks;
}
dim3 add_grid(1, num_blocks, 1);
size_t sm_size = big_lwe_size * sizeof(Torus);
cuda_memset_async(radix_lwe_out, 0, num_blocks * big_lwe_size * sizeof(Torus),
stream);
tree_add_chunks<Torus, params><<<add_grid, 256, sm_size, stream->stream>>>(
radix_lwe_out, old_blocks, r, num_blocks);
integer_radix_apply_univariate_lookup_table_kb<Torus>(
stream, vector_result_sb, radix_lwe_out, bsk, ksk, num_blocks,
luts_message);
integer_radix_apply_univariate_lookup_table_kb<Torus>(
stream, &block_mul_res[big_lwe_size], radix_lwe_out, bsk, ksk, num_blocks,
luts_carry);
cuda_memset_async(block_mul_res, 0, big_lwe_size * sizeof(Torus), stream);
host_addition(stream, radix_lwe_out, vector_result_sb, block_mul_res,
big_lwe_dimension, num_blocks);
host_propagate_single_carry_low_latency<Torus>(
stream, radix_lwe_out, mem_ptr->scp_mem, bsk, ksk, num_blocks);
host_integer_sum_ciphertexts_vec_kb<Torus, params>(
stream, radix_lwe_out, vector_result_sb, terms_degree, bsk, ksk,
mem_ptr->sum_ciphertexts_mem, num_blocks, 2 * num_blocks);
}
template <typename Torus>
@@ -469,166 +426,15 @@ __host__ void scratch_cuda_integer_mult_radix_ciphertext_kb(
uint32_t num_radix_blocks, int_radix_params params,
bool allocate_gpu_memory) {
cudaSetDevice(stream->gpu_index);
size_t sm_size = (params.big_lwe_dimension + 1) * sizeof(Torus);
check_cuda_error(cudaFuncSetAttribute(
tree_add_chunks<Torus>, cudaFuncAttributeMaxDynamicSharedMemorySize,
sm_size));
cudaFuncSetCacheConfig(tree_add_chunks<Torus>, cudaFuncCachePreferShared);
check_cuda_error(cudaGetLastError());
*mem_ptr = new int_mul_memory<Torus>(stream, params, num_radix_blocks,
allocate_gpu_memory);
}
// Function to apply lookup table,
// It has two mode
// lsb_msb_mode == true - extracts lsb and msb
// lsb_msb_mode == false - extracts message and carry
template <typename Torus, typename STorus, class params>
void apply_lookup_table(Torus *input_ciphertexts, Torus *output_ciphertexts,
int_mul_memory<Torus> *mem_ptr, uint32_t glwe_dimension,
uint32_t lwe_dimension, uint32_t polynomial_size,
uint32_t pbs_base_log, uint32_t pbs_level,
uint32_t ks_base_log, uint32_t ks_level,
uint32_t grouping_factor,
uint32_t lsb_message_blocks_count,
uint32_t msb_carry_blocks_count,
uint32_t max_shared_memory, bool lsb_msb_mode) {
int total_blocks_count = lsb_message_blocks_count + msb_carry_blocks_count;
int gpu_n = mem_ptr->p2p_gpu_count;
if (total_blocks_count < gpu_n)
gpu_n = total_blocks_count;
int gpu_blocks_count = total_blocks_count / gpu_n;
int big_lwe_size = glwe_dimension * polynomial_size + 1;
// int small_lwe_size = lwe_dimension + 1;
#pragma omp parallel for num_threads(gpu_n)
for (int i = 0; i < gpu_n; i++) {
cudaSetDevice(i);
auto this_stream = mem_ptr->streams[i];
// Index where input and output blocks start for current gpu
int big_lwe_start_index = i * gpu_blocks_count * big_lwe_size;
// Last gpu might have extra blocks to process if total blocks number is not
// divisible by gpu_n
if (i == gpu_n - 1) {
gpu_blocks_count += total_blocks_count % gpu_n;
}
int can_access_peer;
cudaDeviceCanAccessPeer(&can_access_peer, i, 0);
if (i == 0) {
check_cuda_error(
cudaMemcpyAsync(mem_ptr->pbs_output_multi_gpu[i],
&input_ciphertexts[big_lwe_start_index],
gpu_blocks_count * big_lwe_size * sizeof(Torus),
cudaMemcpyDeviceToDevice, *this_stream));
} else if (can_access_peer) {
check_cuda_error(cudaMemcpyPeerAsync(
mem_ptr->pbs_output_multi_gpu[i], i,
&input_ciphertexts[big_lwe_start_index], 0,
gpu_blocks_count * big_lwe_size * sizeof(Torus), *this_stream));
} else {
// Uses host memory as middle ground
cuda_memcpy_async_to_cpu(mem_ptr->device_to_device_buffer[i],
&input_ciphertexts[big_lwe_start_index],
gpu_blocks_count * big_lwe_size * sizeof(Torus),
this_stream, i);
cuda_memcpy_async_to_gpu(
mem_ptr->pbs_output_multi_gpu[i], mem_ptr->device_to_device_buffer[i],
gpu_blocks_count * big_lwe_size * sizeof(Torus), this_stream, i);
}
// when lsb and msb have to be extracted
// for first lsb_count blocks we need lsb_acc
// for last msb_count blocks we need msb_acc
// when message and carry have tobe extracted
// for first message_count blocks we need message_acc
// for last carry_count blocks we need carry_acc
Torus *cur_lut_indexes;
if (lsb_msb_mode) {
cur_lut_indexes = (big_lwe_start_index < lsb_message_blocks_count)
? mem_ptr->lut_indexes_lsb_multi_gpu[i]
: mem_ptr->lut_indexes_msb_multi_gpu[i];
} else {
cur_lut_indexes = (big_lwe_start_index < lsb_message_blocks_count)
? mem_ptr->lut_indexes_message_multi_gpu[i]
: mem_ptr->lut_indexes_carry_multi_gpu[i];
}
// execute keyswitch on a current gpu with corresponding input and output
// blocks pbs_output_multi_gpu[i] is an input for keyswitch and
// pbs_input_multi_gpu[i] is an output for keyswitch
cuda_keyswitch_lwe_ciphertext_vector(
this_stream, i, mem_ptr->pbs_input_multi_gpu[i],
mem_ptr->pbs_output_multi_gpu[i], mem_ptr->ksk_multi_gpu[i],
polynomial_size * glwe_dimension, lwe_dimension, ks_base_log, ks_level,
gpu_blocks_count);
// execute pbs on a current gpu with corresponding input and output
cuda_multi_bit_pbs_lwe_ciphertext_vector_64(
this_stream, i, mem_ptr->pbs_output_multi_gpu[i],
mem_ptr->lut_multi_gpu[i], cur_lut_indexes,
mem_ptr->pbs_input_multi_gpu[i], mem_ptr->bsk_multi_gpu[i],
mem_ptr->pbs_buffer_multi_gpu[i], lwe_dimension, glwe_dimension,
polynomial_size, grouping_factor, pbs_base_log, pbs_level,
grouping_factor, gpu_blocks_count, 2, 0, max_shared_memory);
// lookup table is applied and now data from current gpu have to be copied
// back to gpu_0 in 'output_ciphertexts' buffer
if (i == 0) {
check_cuda_error(
cudaMemcpyAsync(&output_ciphertexts[big_lwe_start_index],
mem_ptr->pbs_output_multi_gpu[i],
gpu_blocks_count * big_lwe_size * sizeof(Torus),
cudaMemcpyDeviceToDevice, *this_stream));
} else if (can_access_peer) {
check_cuda_error(cudaMemcpyPeerAsync(
&output_ciphertexts[big_lwe_start_index], 0,
mem_ptr->pbs_output_multi_gpu[i], i,
gpu_blocks_count * big_lwe_size * sizeof(Torus), *this_stream));
} else {
// Uses host memory as middle ground
cuda_memcpy_async_to_cpu(
mem_ptr->device_to_device_buffer[i], mem_ptr->pbs_output_multi_gpu[i],
gpu_blocks_count * big_lwe_size * sizeof(Torus), this_stream, i);
cuda_memcpy_async_to_gpu(&output_ciphertexts[big_lwe_start_index],
mem_ptr->device_to_device_buffer[i],
gpu_blocks_count * big_lwe_size * sizeof(Torus),
this_stream, i);
}
}
}
template <typename T>
__global__ void device_small_scalar_radix_multiplication(T *output_lwe_array,
T *input_lwe_array,
T scalar,
uint32_t lwe_dimension,
uint32_t num_blocks) {
int index = blockIdx.x * blockDim.x + threadIdx.x;
int lwe_size = lwe_dimension + 1;
if (index < num_blocks * lwe_size) {
// Here we take advantage of the wrapping behaviour of uint
output_lwe_array[index] = input_lwe_array[index] * scalar;
}
}
template <typename T>
__host__ void host_integer_small_scalar_mult_radix(
cuda_stream_t *stream, T *output_lwe_array, T *input_lwe_array, T scalar,
uint32_t input_lwe_dimension, uint32_t input_lwe_ciphertext_count) {
cudaSetDevice(stream->gpu_index);
// lwe_size includes the presence of the body
// whereas lwe_dimension is the number of elements in the mask
int lwe_size = input_lwe_dimension + 1;
// Create a 1-dimensional grid of threads
int num_blocks = 0, num_threads = 0;
int num_entries = input_lwe_ciphertext_count * lwe_size;
getNumBlocksAndThreads(num_entries, 512, num_blocks, num_threads);
dim3 grid(num_blocks, 1, 1);
dim3 thds(num_threads, 1, 1);
device_small_scalar_radix_multiplication<<<grid, thds, 0, stream->stream>>>(
output_lwe_array, input_lwe_array, scalar, input_lwe_dimension,
input_lwe_ciphertext_count);
check_cuda_error(cudaGetLastError());
}
#endif

View File

@@ -10,3 +10,91 @@ void cuda_negate_integer_radix_ciphertext_64_inplace(
lwe_ciphertext_count, message_modulus,
carry_modulus);
}
void scratch_cuda_integer_radix_overflowing_sub_kb_64(
cuda_stream_t *stream, int8_t **mem_ptr, uint32_t glwe_dimension,
uint32_t polynomial_size, uint32_t big_lwe_dimension,
uint32_t small_lwe_dimension, uint32_t ks_level, uint32_t ks_base_log,
uint32_t pbs_level, uint32_t pbs_base_log, uint32_t grouping_factor,
uint32_t num_blocks, uint32_t message_modulus, uint32_t carry_modulus,
PBS_TYPE pbs_type, bool allocate_gpu_memory) {
int_radix_params params(pbs_type, glwe_dimension, polynomial_size,
big_lwe_dimension, small_lwe_dimension, ks_level,
ks_base_log, pbs_level, pbs_base_log, grouping_factor,
message_modulus, carry_modulus);
scratch_cuda_integer_overflowing_sub_kb<uint64_t>(
stream, (int_overflowing_sub_memory<uint64_t> **)mem_ptr, num_blocks,
params, allocate_gpu_memory);
}
void cuda_integer_radix_overflowing_sub_kb_64(
cuda_stream_t *stream, void *radix_lwe_out, void *radix_lwe_overflowed,
void *radix_lwe_left, void *radix_lwe_right, int8_t *mem_ptr, void *bsk,
void *ksk, uint32_t num_blocks) {
auto mem = (int_overflowing_sub_memory<uint64_t> *)mem_ptr;
switch (mem->params.polynomial_size) {
case 512:
host_integer_overflowing_sub_kb<uint64_t, AmortizedDegree<512>>(
stream, static_cast<uint64_t *>(radix_lwe_out),
static_cast<uint64_t *>(radix_lwe_overflowed),
static_cast<uint64_t *>(radix_lwe_left),
static_cast<uint64_t *>(radix_lwe_right), bsk,
static_cast<uint64_t *>(ksk), mem, num_blocks);
break;
case 1024:
host_integer_overflowing_sub_kb<uint64_t, AmortizedDegree<1024>>(
stream, static_cast<uint64_t *>(radix_lwe_out),
static_cast<uint64_t *>(radix_lwe_overflowed),
static_cast<uint64_t *>(radix_lwe_left),
static_cast<uint64_t *>(radix_lwe_right), bsk,
static_cast<uint64_t *>(ksk), mem, num_blocks);
break;
case 2048:
host_integer_overflowing_sub_kb<uint64_t, AmortizedDegree<2048>>(
stream, static_cast<uint64_t *>(radix_lwe_out),
static_cast<uint64_t *>(radix_lwe_overflowed),
static_cast<uint64_t *>(radix_lwe_left),
static_cast<uint64_t *>(radix_lwe_right), bsk,
static_cast<uint64_t *>(ksk), mem, num_blocks);
break;
case 4096:
host_integer_overflowing_sub_kb<uint64_t, AmortizedDegree<4096>>(
stream, static_cast<uint64_t *>(radix_lwe_out),
static_cast<uint64_t *>(radix_lwe_overflowed),
static_cast<uint64_t *>(radix_lwe_left),
static_cast<uint64_t *>(radix_lwe_right), bsk,
static_cast<uint64_t *>(ksk), mem, num_blocks);
break;
case 8192:
host_integer_overflowing_sub_kb<uint64_t, AmortizedDegree<8192>>(
stream, static_cast<uint64_t *>(radix_lwe_out),
static_cast<uint64_t *>(radix_lwe_overflowed),
static_cast<uint64_t *>(radix_lwe_left),
static_cast<uint64_t *>(radix_lwe_right), bsk,
static_cast<uint64_t *>(ksk), mem, num_blocks);
break;
case 16384:
host_integer_overflowing_sub_kb<uint64_t, AmortizedDegree<16384>>(
stream, static_cast<uint64_t *>(radix_lwe_out),
static_cast<uint64_t *>(radix_lwe_overflowed),
static_cast<uint64_t *>(radix_lwe_left),
static_cast<uint64_t *>(radix_lwe_right), bsk,
static_cast<uint64_t *>(ksk), mem, num_blocks);
break;
default:
PANIC("Cuda error (integer overflowing sub): unsupported polynomial size. "
"Only N = 512, 1024, 2048, 4096, 8192, 16384 is supported")
}
}
void cleanup_cuda_integer_radix_overflowing_sub(cuda_stream_t *stream,
int8_t **mem_ptr_void) {
int_overflowing_sub_memory<uint64_t> *mem_ptr =
(int_overflowing_sub_memory<uint64_t> *)(*mem_ptr_void);
mem_ptr->release(stream);
}

View File

@@ -6,9 +6,20 @@
#include <cuda_runtime.h>
#endif
#include "crypto/keyswitch.cuh"
#include "device.h"
#include "integer.h"
#include "integer/integer.cuh"
#include "linear_algebra.h"
#include "programmable_bootstrap.h"
#include "utils/helper.cuh"
#include "utils/kernel_dimensions.cuh"
#include <fstream>
#include <iostream>
#include <omp.h>
#include <sstream>
#include <string>
#include <vector>
template <typename Torus>
__global__ void
@@ -76,4 +87,32 @@ __host__ void host_integer_radix_negation(cuda_stream_t *stream, Torus *output,
check_cuda_error(cudaGetLastError());
}
template <typename Torus>
__host__ void scratch_cuda_integer_overflowing_sub_kb(
cuda_stream_t *stream, int_overflowing_sub_memory<Torus> **mem_ptr,
uint32_t num_blocks, int_radix_params params, bool allocate_gpu_memory) {
cudaSetDevice(stream->gpu_index);
*mem_ptr = new int_overflowing_sub_memory<Torus>(stream, params, num_blocks,
allocate_gpu_memory);
}
template <typename Torus, class params>
__host__ void host_integer_overflowing_sub_kb(
cuda_stream_t *stream, Torus *radix_lwe_out, Torus *radix_lwe_overflowed,
Torus *radix_lwe_left, Torus *radix_lwe_right, void *bsk, uint64_t *ksk,
int_overflowing_sub_memory<uint64_t> *mem_ptr, uint32_t num_blocks) {
auto radix_params = mem_ptr->params;
host_unchecked_sub_with_correcting_term(
stream, radix_lwe_out, radix_lwe_left, radix_lwe_right,
radix_params.big_lwe_dimension, num_blocks, radix_params.message_modulus,
radix_params.carry_modulus, radix_params.message_modulus - 1);
host_propagate_single_sub_borrow<Torus>(
stream, radix_lwe_overflowed, radix_lwe_out, mem_ptr->borrow_prop_mem,
bsk, ksk, num_blocks);
}
#endif

View File

@@ -5,7 +5,7 @@
#include <omp.h>
template <typename Torus>
__host__ void host_integer_radix_scalar_difference_check_kb(
__host__ void integer_radix_unsigned_scalar_difference_check_kb(
cuda_stream_t *stream, Torus *lwe_array_out, Torus *lwe_array_in,
Torus *scalar_blocks, int_comparison_buffer<Torus> *mem_ptr,
std::function<Torus(Torus)> sign_handler_f, void *bsk, Torus *ksk,
@@ -22,7 +22,6 @@ __host__ void host_integer_radix_scalar_difference_check_kb(
auto diff_buffer = mem_ptr->diff_buffer;
size_t big_lwe_size = big_lwe_dimension + 1;
size_t big_lwe_size_bytes = big_lwe_size * sizeof(Torus);
// Reducing the signs is the bottleneck of the comparison algorithms,
// however if the scalar case there is an improvement:
@@ -65,12 +64,6 @@ __host__ void host_integer_radix_scalar_difference_check_kb(
integer_radix_apply_univariate_lookup_table_kb<Torus>(
stream, lwe_array_out, mem_ptr->tmp_lwe_array_out, bsk, ksk, 1, lut);
// The result will be in the two first block. Everything else is
// garbage.
cuda_memset_async(lwe_array_out + big_lwe_size, 0,
big_lwe_size_bytes * (total_num_radix_blocks - 1),
stream);
} else if (total_num_scalar_blocks < total_num_radix_blocks) {
// We have to handle both part of the work described above
@@ -78,7 +71,6 @@ __host__ void host_integer_radix_scalar_difference_check_kb(
uint32_t num_msb_radix_blocks =
total_num_radix_blocks - num_lsb_radix_blocks;
auto lsb = lwe_array_in;
auto msb = lwe_array_in + num_lsb_radix_blocks * big_lwe_size;
auto lwe_array_lsb_out = mem_ptr->tmp_lwe_array_out;
@@ -121,7 +113,7 @@ __host__ void host_integer_radix_scalar_difference_check_kb(
// final sign
tree_sign_reduction(lsb_stream, lwe_array_lsb_out, comparisons,
mem_ptr->diff_buffer->tree_buffer,
mem_ptr->cleaning_lut_f, bsk, ksk,
mem_ptr->identity_lut_f, bsk, ksk,
num_lsb_radix_blocks);
}
#pragma omp section
@@ -156,10 +148,6 @@ __host__ void host_integer_radix_scalar_difference_check_kb(
stream, lwe_array_out, lwe_array_lsb_out, lwe_array_msb_out, bsk, ksk,
1, lut);
// The result will be in the first block. Everything else is garbage.
cuda_memset_async(lwe_array_out + big_lwe_size, 0,
(total_num_radix_blocks - 1) * big_lwe_size_bytes,
stream);
} else {
// We only have to do the regular comparison
// And not the part where we compare most significant blocks with zeros
@@ -167,8 +155,6 @@ __host__ void host_integer_radix_scalar_difference_check_kb(
uint32_t num_lsb_radix_blocks = total_num_radix_blocks;
uint32_t num_scalar_blocks = total_num_scalar_blocks;
auto lsb = lwe_array_in;
Torus *lhs = diff_buffer->tmp_packed_left;
Torus *rhs = diff_buffer->tmp_packed_right;
@@ -195,11 +181,344 @@ __host__ void host_integer_radix_scalar_difference_check_kb(
tree_sign_reduction(stream, lwe_array_out, comparisons,
mem_ptr->diff_buffer->tree_buffer, sign_handler_f, bsk,
ksk, num_lsb_radix_blocks);
}
}
// The result will be in the first block. Everything else is garbage.
cuda_memset_async(lwe_array_out + big_lwe_size, 0,
(total_num_radix_blocks - 1) * big_lwe_size_bytes,
stream);
template <typename Torus>
__host__ void integer_radix_signed_scalar_difference_check_kb(
cuda_stream_t *stream, Torus *lwe_array_out, Torus *lwe_array_in,
Torus *scalar_blocks, int_comparison_buffer<Torus> *mem_ptr,
std::function<Torus(Torus)> sign_handler_f, void *bsk, Torus *ksk,
uint32_t total_num_radix_blocks, uint32_t total_num_scalar_blocks) {
cudaSetDevice(stream->gpu_index);
auto params = mem_ptr->params;
auto big_lwe_dimension = params.big_lwe_dimension;
auto glwe_dimension = params.glwe_dimension;
auto polynomial_size = params.polynomial_size;
auto message_modulus = params.message_modulus;
auto carry_modulus = params.carry_modulus;
auto diff_buffer = mem_ptr->diff_buffer;
size_t big_lwe_size = big_lwe_dimension + 1;
// Reducing the signs is the bottleneck of the comparison algorithms,
// however if the scalar case there is an improvement:
//
// The idea is to reduce the number of signs block we have to
// reduce. We can do that by splitting the comparison problem in two parts.
//
// - One part where we compute the signs block between the scalar with just
// enough blocks
// from the ciphertext that can represent the scalar value
//
// - The other part is to compare the ciphertext blocks not considered for the
// sign
// computation with zero, and create a single sign block from that.
//
// The smaller the scalar value is compared to the ciphertext num bits
// encrypted, the more the comparisons with zeros we have to do, and the less
// signs block we will have to reduce.
//
// This will create a speedup as comparing a bunch of blocks with 0
// is faster
if (total_num_scalar_blocks == 0) {
// We only have to compare blocks with zero
// means scalar is zero
Torus *are_all_msb_zeros = mem_ptr->tmp_lwe_array_out;
host_compare_with_zero_equality(stream, are_all_msb_zeros, lwe_array_in,
mem_ptr, bsk, ksk, total_num_radix_blocks,
mem_ptr->is_zero_lut);
Torus *sign_block =
lwe_array_in + (total_num_radix_blocks - 1) * big_lwe_size;
auto sign_bit_pos = (int)std::log2(message_modulus) - 1;
auto scalar_last_leaf_with_respect_to_zero_lut_f =
[sign_handler_f, sign_bit_pos,
message_modulus](Torus sign_block) -> Torus {
sign_block %= message_modulus;
int sign_bit_is_set = (sign_block >> sign_bit_pos) == 1;
CMP_ORDERING sign_block_ordering;
if (sign_bit_is_set) {
sign_block_ordering = CMP_ORDERING::IS_INFERIOR;
} else if (sign_block != 0) {
sign_block_ordering = CMP_ORDERING::IS_SUPERIOR;
} else {
sign_block_ordering = CMP_ORDERING::IS_EQUAL;
}
return sign_block_ordering;
};
auto block_selector_f = mem_ptr->diff_buffer->tree_buffer->block_selector_f;
auto scalar_bivariate_last_leaf_lut_f =
[scalar_last_leaf_with_respect_to_zero_lut_f, sign_handler_f,
block_selector_f](Torus are_all_zeros, Torus sign_block) -> Torus {
// "re-code" are_all_zeros as an ordering value
if (are_all_zeros == 1) {
are_all_zeros = CMP_ORDERING::IS_EQUAL;
} else {
are_all_zeros = CMP_ORDERING::IS_SUPERIOR;
};
return sign_handler_f(block_selector_f(
scalar_last_leaf_with_respect_to_zero_lut_f(sign_block),
are_all_zeros));
};
auto lut = mem_ptr->diff_buffer->tree_buffer->tree_last_leaf_scalar_lut;
generate_device_accumulator_bivariate<Torus>(
stream, lut->lut, glwe_dimension, polynomial_size, message_modulus,
carry_modulus, scalar_bivariate_last_leaf_lut_f);
integer_radix_apply_bivariate_lookup_table_kb(
stream, lwe_array_out, are_all_msb_zeros, sign_block, bsk, ksk, 1, lut);
} else if (total_num_scalar_blocks < total_num_radix_blocks) {
// We have to handle both part of the work described above
// And the sign bit is located in the most_significant_blocks
uint32_t num_lsb_radix_blocks = total_num_scalar_blocks;
uint32_t num_msb_radix_blocks =
total_num_radix_blocks - num_lsb_radix_blocks;
auto msb = lwe_array_in + num_lsb_radix_blocks * big_lwe_size;
auto lwe_array_lsb_out = mem_ptr->tmp_lwe_array_out;
auto lwe_array_msb_out = lwe_array_lsb_out + big_lwe_size;
cuda_synchronize_stream(stream);
auto lsb_stream = mem_ptr->lsb_stream;
auto msb_stream = mem_ptr->msb_stream;
#pragma omp parallel sections
{
// Both sections may be executed in parallel
#pragma omp section
{
//////////////
// lsb
Torus *lhs = diff_buffer->tmp_packed_left;
Torus *rhs = diff_buffer->tmp_packed_right;
pack_blocks(lsb_stream, lhs, lwe_array_in, big_lwe_dimension,
num_lsb_radix_blocks, message_modulus);
pack_blocks(lsb_stream, rhs, scalar_blocks, 0, total_num_scalar_blocks,
message_modulus);
// From this point we have half number of blocks
num_lsb_radix_blocks /= 2;
num_lsb_radix_blocks += (total_num_scalar_blocks % 2);
// comparisons will be assigned
// - 0 if lhs < rhs
// - 1 if lhs == rhs
// - 2 if lhs > rhs
auto comparisons = mem_ptr->tmp_block_comparisons;
scalar_compare_radix_blocks_kb(lsb_stream, comparisons, lhs, rhs,
mem_ptr, bsk, ksk, num_lsb_radix_blocks);
// Reduces a vec containing radix blocks that encrypts a sign
// (inferior, equal, superior) to one single radix block containing the
// final sign
tree_sign_reduction(lsb_stream, lwe_array_lsb_out, comparisons,
mem_ptr->diff_buffer->tree_buffer,
mem_ptr->identity_lut_f, bsk, ksk,
num_lsb_radix_blocks);
}
#pragma omp section
{
//////////////
// msb
// We remove the last block (which is the sign)
Torus *are_all_msb_zeros = lwe_array_msb_out;
host_compare_with_zero_equality(msb_stream, are_all_msb_zeros, msb,
mem_ptr, bsk, ksk, num_msb_radix_blocks,
mem_ptr->is_zero_lut);
auto sign_bit_pos = (int)log2(message_modulus) - 1;
auto lut_f = [mem_ptr, sign_bit_pos](Torus sign_block,
Torus msb_are_zeros) {
bool sign_bit_is_set = (sign_block >> sign_bit_pos) == 1;
CMP_ORDERING sign_block_ordering;
if (sign_bit_is_set) {
sign_block_ordering = CMP_ORDERING::IS_INFERIOR;
} else if (sign_block != 0) {
sign_block_ordering = CMP_ORDERING::IS_SUPERIOR;
} else {
sign_block_ordering = CMP_ORDERING::IS_EQUAL;
}
CMP_ORDERING msb_ordering;
if (msb_are_zeros == 1)
msb_ordering = CMP_ORDERING::IS_EQUAL;
else
msb_ordering = CMP_ORDERING::IS_SUPERIOR;
return mem_ptr->diff_buffer->tree_buffer->block_selector_f(
sign_block_ordering, msb_ordering);
};
auto signed_msb_lut = mem_ptr->signed_msb_lut;
generate_device_accumulator_bivariate<Torus>(
msb_stream, signed_msb_lut->lut, params.glwe_dimension,
params.polynomial_size, params.message_modulus,
params.carry_modulus, lut_f);
Torus *sign_block = msb + (num_msb_radix_blocks - 1) * big_lwe_size;
integer_radix_apply_bivariate_lookup_table_kb(
msb_stream, lwe_array_msb_out, sign_block, are_all_msb_zeros, bsk,
ksk, 1, signed_msb_lut);
}
}
cuda_synchronize_stream(lsb_stream);
cuda_synchronize_stream(msb_stream);
//////////////
// Reduce the two blocks into one final
reduce_signs(stream, lwe_array_out, lwe_array_lsb_out, mem_ptr,
sign_handler_f, bsk, ksk, 2);
} else {
// We only have to do the regular comparison
// And not the part where we compare most significant blocks with zeros
// total_num_radix_blocks == total_num_scalar_blocks
uint32_t num_lsb_radix_blocks = total_num_radix_blocks;
cuda_synchronize_stream(stream);
auto lsb_stream = mem_ptr->lsb_stream;
auto msb_stream = mem_ptr->msb_stream;
auto lwe_array_ct_out = mem_ptr->tmp_lwe_array_out;
auto lwe_array_sign_out =
lwe_array_ct_out + (num_lsb_radix_blocks / 2) * big_lwe_size;
#pragma omp parallel sections
{
// Both sections may be executed in parallel
#pragma omp section
{
Torus *lhs = diff_buffer->tmp_packed_left;
Torus *rhs = diff_buffer->tmp_packed_right;
pack_blocks(lsb_stream, lhs, lwe_array_in, big_lwe_dimension,
num_lsb_radix_blocks - 1, message_modulus);
pack_blocks(lsb_stream, rhs, scalar_blocks, 0, num_lsb_radix_blocks - 1,
message_modulus);
// From this point we have half number of blocks
num_lsb_radix_blocks /= 2;
// comparisons will be assigned
// - 0 if lhs < rhs
// - 1 if lhs == rhs
// - 2 if lhs > rhs
scalar_compare_radix_blocks_kb(lsb_stream, lwe_array_ct_out, lhs, rhs,
mem_ptr, bsk, ksk, num_lsb_radix_blocks);
}
#pragma omp section
{
Torus *encrypted_sign_block =
lwe_array_in + (total_num_radix_blocks - 1) * big_lwe_size;
Torus *scalar_sign_block =
scalar_blocks + (total_num_scalar_blocks - 1);
auto trivial_sign_block = mem_ptr->tmp_trivial_sign_block;
create_trivial_radix(msb_stream, trivial_sign_block, scalar_sign_block,
big_lwe_dimension, 1, 1, message_modulus,
carry_modulus);
integer_radix_apply_bivariate_lookup_table_kb(
msb_stream, lwe_array_sign_out, encrypted_sign_block,
trivial_sign_block, bsk, ksk, 1, mem_ptr->signed_lut);
}
}
cuda_synchronize_stream(lsb_stream);
cuda_synchronize_stream(msb_stream);
// Reduces a vec containing radix blocks that encrypts a sign
// (inferior, equal, superior) to one single radix block containing the
// final sign
reduce_signs(stream, lwe_array_out, lwe_array_ct_out, mem_ptr,
sign_handler_f, bsk, ksk, num_lsb_radix_blocks + 1);
}
}
template <typename Torus>
__host__ void integer_radix_signed_scalar_maxmin_kb(
cuda_stream_t *stream, Torus *lwe_array_out, Torus *lwe_array_in,
Torus *scalar_blocks, int_comparison_buffer<Torus> *mem_ptr, void *bsk,
Torus *ksk, uint32_t total_num_radix_blocks,
uint32_t total_num_scalar_blocks) {
cudaSetDevice(stream->gpu_index);
auto params = mem_ptr->params;
// Calculates the difference sign between the ciphertext and the scalar
// - 0 if lhs < rhs
// - 1 if lhs == rhs
// - 2 if lhs > rhs
auto sign = mem_ptr->tmp_lwe_array_out;
integer_radix_signed_scalar_difference_check_kb(
stream, sign, lwe_array_in, scalar_blocks, mem_ptr,
mem_ptr->identity_lut_f, bsk, ksk, total_num_radix_blocks,
total_num_scalar_blocks);
// There is no optimized CMUX for scalars, so we convert to a trivial
// ciphertext
auto lwe_array_left = lwe_array_in;
auto lwe_array_right = mem_ptr->tmp_block_comparisons;
create_trivial_radix(stream, lwe_array_right, scalar_blocks,
params.big_lwe_dimension, total_num_radix_blocks,
total_num_scalar_blocks, params.message_modulus,
params.carry_modulus);
// Selector
// CMUX for Max or Min
host_integer_radix_cmux_kb(stream, lwe_array_out, sign, lwe_array_left,
lwe_array_right, mem_ptr->cmux_buffer, bsk, ksk,
total_num_radix_blocks);
}
template <typename Torus>
__host__ void host_integer_radix_scalar_difference_check_kb(
cuda_stream_t *stream, Torus *lwe_array_out, Torus *lwe_array_in,
Torus *scalar_blocks, int_comparison_buffer<Torus> *mem_ptr,
std::function<Torus(Torus)> sign_handler_f, void *bsk, Torus *ksk,
uint32_t total_num_radix_blocks, uint32_t total_num_scalar_blocks) {
if (mem_ptr->is_signed) {
// is signed and scalar is positive
integer_radix_signed_scalar_difference_check_kb(
stream, lwe_array_out, lwe_array_in, scalar_blocks, mem_ptr,
sign_handler_f, bsk, ksk, total_num_radix_blocks,
total_num_scalar_blocks);
} else {
integer_radix_unsigned_scalar_difference_check_kb(
stream, lwe_array_out, lwe_array_in, scalar_blocks, mem_ptr,
sign_handler_f, bsk, ksk, total_num_radix_blocks,
total_num_scalar_blocks);
}
}
template <typename Torus>
__host__ void host_integer_radix_signed_scalar_maxmin_kb(
cuda_stream_t *stream, Torus *lwe_array_out, Torus *lwe_array_in,
Torus *scalar_blocks, int_comparison_buffer<Torus> *mem_ptr, void *bsk,
Torus *ksk, uint32_t total_num_radix_blocks,
uint32_t total_num_scalar_blocks) {
if (mem_ptr->is_signed) {
// is signed and scalar is positive
integer_radix_signed_scalar_maxmin_kb(
stream, lwe_array_out, lwe_array_in, scalar_blocks, mem_ptr, bsk, ksk,
total_num_radix_blocks, total_num_scalar_blocks);
} else {
integer_radix_unsigned_scalar_maxmin_kb(
stream, lwe_array_out, lwe_array_in, scalar_blocks, mem_ptr, bsk, ksk,
total_num_radix_blocks, total_num_scalar_blocks);
}
}
@@ -270,7 +589,7 @@ __host__ void host_integer_radix_scalar_maxmin_kb(
auto sign = mem_ptr->tmp_lwe_array_out;
host_integer_radix_scalar_difference_check_kb(
stream, sign, lwe_array_in, scalar_blocks, mem_ptr,
mem_ptr->cleaning_lut_f, bsk, ksk, total_num_radix_blocks,
mem_ptr->identity_lut_f, bsk, ksk, total_num_radix_blocks,
total_num_scalar_blocks);
// There is no optimized CMUX for scalars, so we convert to a trivial
@@ -303,7 +622,6 @@ __host__ void host_integer_radix_scalar_equality_check_kb(
auto eq_buffer = mem_ptr->eq_buffer;
size_t big_lwe_size = big_lwe_dimension + 1;
size_t big_lwe_size_bytes = big_lwe_size * sizeof(Torus);
auto scalar_comparison_luts = eq_buffer->scalar_comparison_luts;
@@ -393,11 +711,5 @@ __host__ void host_integer_radix_scalar_equality_check_kb(
default:
PANIC("Cuda error: integer operation not supported")
}
// The result will be in the two first block. Everything else is
// garbage.
if (num_radix_blocks > 1)
cuda_memset_async(lwe_array_out + big_lwe_size, 0,
big_lwe_size_bytes * (num_radix_blocks - 1), stream);
}
#endif

View File

@@ -0,0 +1,89 @@
#include "integer/scalar_mul.cuh"
void scratch_cuda_integer_scalar_mul_kb_64(
cuda_stream_t *stream, int8_t **mem_ptr, uint32_t glwe_dimension,
uint32_t polynomial_size, uint32_t lwe_dimension, uint32_t ks_level,
uint32_t ks_base_log, uint32_t pbs_level, uint32_t pbs_base_log,
uint32_t grouping_factor, uint32_t num_blocks, uint32_t message_modulus,
uint32_t carry_modulus, PBS_TYPE pbs_type, bool allocate_gpu_memory) {
int_radix_params params(pbs_type, glwe_dimension, polynomial_size,
glwe_dimension * polynomial_size, lwe_dimension,
ks_level, ks_base_log, pbs_level, pbs_base_log,
grouping_factor, message_modulus, carry_modulus);
scratch_cuda_integer_radix_scalar_mul_kb<uint64_t>(
stream, (int_scalar_mul_buffer<uint64_t> **)mem_ptr, num_blocks, params,
allocate_gpu_memory);
}
void cuda_scalar_multiplication_integer_radix_ciphertext_64_inplace(
cuda_stream_t *stream, void *lwe_array, uint64_t *decomposed_scalar,
uint64_t *has_at_least_one_set, int8_t *mem, void *bsk, void *ksk,
uint32_t lwe_dimension, uint32_t polynomial_size, uint32_t message_modulus,
uint32_t num_blocks, uint32_t num_scalars) {
switch (polynomial_size) {
case 512:
host_integer_scalar_mul_radix<uint64_t, AmortizedDegree<512>>(
stream, static_cast<uint64_t *>(lwe_array), decomposed_scalar,
has_at_least_one_set,
reinterpret_cast<int_scalar_mul_buffer<uint64_t> *>(mem), bsk,
static_cast<uint64_t *>(ksk), lwe_dimension, message_modulus,
num_blocks, num_scalars);
break;
case 1024:
host_integer_scalar_mul_radix<uint64_t, AmortizedDegree<1024>>(
stream, static_cast<uint64_t *>(lwe_array), decomposed_scalar,
has_at_least_one_set,
reinterpret_cast<int_scalar_mul_buffer<uint64_t> *>(mem), bsk,
static_cast<uint64_t *>(ksk), lwe_dimension, message_modulus,
num_blocks, num_scalars);
break;
case 2048:
host_integer_scalar_mul_radix<uint64_t, AmortizedDegree<2048>>(
stream, static_cast<uint64_t *>(lwe_array), decomposed_scalar,
has_at_least_one_set,
reinterpret_cast<int_scalar_mul_buffer<uint64_t> *>(mem), bsk,
static_cast<uint64_t *>(ksk), lwe_dimension, message_modulus,
num_blocks, num_scalars);
break;
case 4096:
host_integer_scalar_mul_radix<uint64_t, AmortizedDegree<4096>>(
stream, static_cast<uint64_t *>(lwe_array), decomposed_scalar,
has_at_least_one_set,
reinterpret_cast<int_scalar_mul_buffer<uint64_t> *>(mem), bsk,
static_cast<uint64_t *>(ksk), lwe_dimension, message_modulus,
num_blocks, num_scalars);
break;
case 8192:
host_integer_scalar_mul_radix<uint64_t, AmortizedDegree<8192>>(
stream, static_cast<uint64_t *>(lwe_array), decomposed_scalar,
has_at_least_one_set,
reinterpret_cast<int_scalar_mul_buffer<uint64_t> *>(mem), bsk,
static_cast<uint64_t *>(ksk), lwe_dimension, message_modulus,
num_blocks, num_scalars);
break;
case 16384:
host_integer_scalar_mul_radix<uint64_t, AmortizedDegree<16384>>(
stream, static_cast<uint64_t *>(lwe_array), decomposed_scalar,
has_at_least_one_set,
reinterpret_cast<int_scalar_mul_buffer<uint64_t> *>(mem), bsk,
static_cast<uint64_t *>(ksk), lwe_dimension, message_modulus,
num_blocks, num_scalars);
break;
default:
PANIC("Cuda error (scalar multiplication): unsupported polynomial size. "
"Only N = 512, 1024, 2048, 4096, 8192, 16384 are supported.")
}
}
void cleanup_cuda_integer_radix_scalar_mul(cuda_stream_t *stream,
int8_t **mem_ptr_void) {
cudaSetDevice(stream->gpu_index);
int_scalar_mul_buffer<uint64_t> *mem_ptr =
(int_scalar_mul_buffer<uint64_t> *)(*mem_ptr_void);
mem_ptr->release(stream);
}

View File

@@ -0,0 +1,136 @@
#ifndef CUDA_INTEGER_SCALAR_MUL_CUH
#define CUDA_INTEGER_SCALAR_MUL_CUH
#ifdef __CDT_PARSER__
#undef __CUDA_RUNTIME_H__
#include <cuda_runtime.h>
#endif
#include "device.h"
#include "integer.h"
#include "multiplication.cuh"
#include "scalar_shifts.cuh"
#include "utils/kernel_dimensions.cuh"
#include <stdio.h>
template <typename T>
__global__ void device_small_scalar_radix_multiplication(T *output_lwe_array,
T *input_lwe_array,
T scalar,
uint32_t lwe_dimension,
uint32_t num_blocks) {
int index = blockIdx.x * blockDim.x + threadIdx.x;
int lwe_size = lwe_dimension + 1;
if (index < num_blocks * lwe_size) {
// Here we take advantage of the wrapping behaviour of uint
output_lwe_array[index] = input_lwe_array[index] * scalar;
}
}
template <typename T>
__host__ void scratch_cuda_integer_radix_scalar_mul_kb(
cuda_stream_t *stream, int_scalar_mul_buffer<T> **mem_ptr,
uint32_t num_radix_blocks, int_radix_params params,
bool allocate_gpu_memory) {
cudaSetDevice(stream->gpu_index);
size_t sm_size = (params.big_lwe_dimension + 1) * sizeof(T);
check_cuda_error(cudaFuncSetAttribute(
tree_add_chunks<T>, cudaFuncAttributeMaxDynamicSharedMemorySize,
sm_size));
cudaFuncSetCacheConfig(tree_add_chunks<T>, cudaFuncCachePreferShared);
check_cuda_error(cudaGetLastError());
*mem_ptr = new int_scalar_mul_buffer<T>(stream, params, num_radix_blocks,
allocate_gpu_memory);
}
template <typename T, class params>
__host__ void host_integer_scalar_mul_radix(
cuda_stream_t *stream, T *lwe_array, T *decomposed_scalar,
T *has_at_least_one_set, int_scalar_mul_buffer<T> *mem, void *bsk, T *ksk,
uint32_t input_lwe_dimension, uint32_t message_modulus,
uint32_t num_radix_blocks, uint32_t num_scalars) {
if (num_radix_blocks == 0 | num_scalars == 0)
return;
cudaSetDevice(stream->gpu_index);
// lwe_size includes the presence of the body
// whereas lwe_dimension is the number of elements in the mask
uint32_t lwe_size = input_lwe_dimension + 1;
uint32_t lwe_size_bytes = lwe_size * sizeof(T);
uint32_t msg_bits = (uint32_t)std::log2(message_modulus);
uint32_t num_ciphertext_bits = msg_bits * num_radix_blocks;
T *preshifted_buffer = mem->preshifted_buffer;
T *all_shifted_buffer = mem->all_shifted_buffer;
for (size_t shift_amount = 0; shift_amount < msg_bits; shift_amount++) {
T *ptr = preshifted_buffer + shift_amount * lwe_size * num_radix_blocks;
if (has_at_least_one_set[shift_amount] == 1) {
cuda_memcpy_async_gpu_to_gpu(ptr, lwe_array,
lwe_size_bytes * num_radix_blocks, stream);
host_integer_radix_logical_scalar_shift_kb_inplace(
stream, ptr, shift_amount, mem->logical_scalar_shift_buffer, bsk, ksk,
num_radix_blocks);
} else {
// create trivial assign for value = 0
cuda_memset_async(ptr, 0, num_radix_blocks * lwe_size_bytes, stream);
}
}
size_t j = 0;
for (size_t i = 0; i < min(num_scalars, num_ciphertext_bits); i++) {
if (decomposed_scalar[i] == 1) {
// Perform a block shift
T *preshifted_radix_ct =
preshifted_buffer + (i % msg_bits) * num_radix_blocks * lwe_size;
T *block_shift_buffer =
all_shifted_buffer + j * num_radix_blocks * lwe_size;
radix_blocks_rotate_right<<<num_radix_blocks, 256, 0, stream->stream>>>(
block_shift_buffer, preshifted_radix_ct, i / msg_bits,
num_radix_blocks, lwe_size);
// create trivial assign for value = 0
cuda_memset_async(block_shift_buffer, 0, (i / msg_bits) * lwe_size_bytes,
stream);
j++;
}
}
if (j == 0) {
// lwe array = 0
cuda_memset_async(lwe_array, 0, num_radix_blocks * lwe_size_bytes, stream);
} else {
int terms_degree[j * num_radix_blocks];
for (int i = 0; i < j * num_radix_blocks; i++) {
terms_degree[i] = message_modulus - 1;
}
host_integer_sum_ciphertexts_vec_kb<T, params>(
stream, lwe_array, all_shifted_buffer, terms_degree, bsk, ksk,
mem->sum_ciphertexts_vec_mem, num_radix_blocks, j);
}
}
// Small scalar_mul is used in shift/rotate
template <typename T>
__host__ void host_integer_small_scalar_mul_radix(
cuda_stream_t *stream, T *output_lwe_array, T *input_lwe_array, T scalar,
uint32_t input_lwe_dimension, uint32_t input_lwe_ciphertext_count) {
cudaSetDevice(stream->gpu_index);
// lwe_size includes the presence of the body
// whereas lwe_dimension is the number of elements in the mask
int lwe_size = input_lwe_dimension + 1;
// Create a 1-dimensional grid of threads
int num_blocks = 0, num_threads = 0;
int num_entries = input_lwe_ciphertext_count * lwe_size;
getNumBlocksAndThreads(num_entries, 512, num_blocks, num_threads);
dim3 grid(num_blocks, 1, 1);
dim3 thds(num_threads, 1, 1);
device_small_scalar_radix_multiplication<<<grid, thds, 0, stream->stream>>>(
output_lwe_array, input_lwe_array, scalar, input_lwe_dimension,
input_lwe_ciphertext_count);
check_cuda_error(cudaGetLastError());
}
#endif

View File

@@ -6,7 +6,8 @@ void scratch_cuda_integer_radix_scalar_rotate_kb_64(
uint32_t small_lwe_dimension, uint32_t ks_level, uint32_t ks_base_log,
uint32_t pbs_level, uint32_t pbs_base_log, uint32_t grouping_factor,
uint32_t num_blocks, uint32_t message_modulus, uint32_t carry_modulus,
PBS_TYPE pbs_type, SHIFT_TYPE shift_type, bool allocate_gpu_memory) {
PBS_TYPE pbs_type, SHIFT_OR_ROTATE_TYPE shift_type,
bool allocate_gpu_memory) {
int_radix_params params(pbs_type, glwe_dimension, polynomial_size,
big_lwe_dimension, small_lwe_dimension, ks_level,
@@ -14,8 +15,8 @@ void scratch_cuda_integer_radix_scalar_rotate_kb_64(
message_modulus, carry_modulus);
scratch_cuda_integer_radix_scalar_rotate_kb<uint64_t>(
stream, (int_shift_buffer<uint64_t> **)mem_ptr, num_blocks, params,
shift_type, allocate_gpu_memory);
stream, (int_logical_scalar_shift_buffer<uint64_t> **)mem_ptr, num_blocks,
params, shift_type, allocate_gpu_memory);
}
void cuda_integer_radix_scalar_rotate_kb_64_inplace(cuda_stream_t *stream,
@@ -26,15 +27,15 @@ void cuda_integer_radix_scalar_rotate_kb_64_inplace(cuda_stream_t *stream,
host_integer_radix_scalar_rotate_kb_inplace<uint64_t>(
stream, static_cast<uint64_t *>(lwe_array), n,
(int_shift_buffer<uint64_t> *)mem_ptr, bsk, static_cast<uint64_t *>(ksk),
num_blocks);
(int_logical_scalar_shift_buffer<uint64_t> *)mem_ptr, bsk,
static_cast<uint64_t *>(ksk), num_blocks);
}
void cleanup_cuda_integer_radix_scalar_rotate(cuda_stream_t *stream,
int8_t **mem_ptr_void) {
int_shift_buffer<uint64_t> *mem_ptr =
(int_shift_buffer<uint64_t> *)(*mem_ptr_void);
int_logical_scalar_shift_buffer<uint64_t> *mem_ptr =
(int_logical_scalar_shift_buffer<uint64_t> *)(*mem_ptr_void);
mem_ptr->release(stream);
}

View File

@@ -5,40 +5,28 @@
#include "device.h"
#include "integer.cuh"
#include "integer.h"
#include "pbs/bootstrap_low_latency.cuh"
#include "pbs/bootstrap_multibit.cuh"
#include "types/complex/operations.cuh"
#include "utils/helper.cuh"
#include "utils/kernel_dimensions.cuh"
#ifndef CUDA_INTEGER_SHIFT_OPS_CUH
#define CUDA_INTEGER_SHIFT_OPS_CUH
#include "crypto/keyswitch.cuh"
#include "device.h"
#include "integer.cuh"
#include "integer.h"
#include "pbs/bootstrap_low_latency.cuh"
#include "pbs/bootstrap_multibit.cuh"
#include "pbs/programmable_bootstrap_classic.cuh"
#include "pbs/programmable_bootstrap_multibit.cuh"
#include "types/complex/operations.cuh"
#include "utils/helper.cuh"
#include "utils/kernel_dimensions.cuh"
template <typename Torus>
__host__ void scratch_cuda_integer_radix_scalar_rotate_kb(
cuda_stream_t *stream, int_shift_buffer<Torus> **mem_ptr,
uint32_t num_radix_blocks, int_radix_params params, SHIFT_TYPE shift_type,
bool allocate_gpu_memory) {
cuda_stream_t *stream, int_logical_scalar_shift_buffer<Torus> **mem_ptr,
uint32_t num_radix_blocks, int_radix_params params,
SHIFT_OR_ROTATE_TYPE shift_type, bool allocate_gpu_memory) {
cudaSetDevice(stream->gpu_index);
*mem_ptr = new int_shift_buffer<Torus>(stream, shift_type, params,
num_radix_blocks, allocate_gpu_memory);
*mem_ptr = new int_logical_scalar_shift_buffer<Torus>(
stream, shift_type, params, num_radix_blocks, allocate_gpu_memory);
}
template <typename Torus>
__host__ void host_integer_radix_scalar_rotate_kb_inplace(
cuda_stream_t *stream, Torus *lwe_array, uint32_t n,
int_shift_buffer<Torus> *mem, void *bsk, Torus *ksk, uint32_t num_blocks) {
int_logical_scalar_shift_buffer<Torus> *mem, void *bsk, Torus *ksk,
uint32_t num_blocks) {
cudaSetDevice(stream->gpu_index);
auto params = mem->params;
@@ -111,6 +99,4 @@ __host__ void host_integer_radix_scalar_rotate_kb_inplace(
}
}
#endif // CUDA_SCALAR_OPS_CUH
#endif // CUDA_INTEGER_SCALAR_ROTATE_OPS_CUH

View File

@@ -1,38 +1,90 @@
#include "scalar_shifts.cuh"
void scratch_cuda_integer_radix_scalar_shift_kb_64(
void scratch_cuda_integer_radix_logical_scalar_shift_kb_64(
cuda_stream_t *stream, int8_t **mem_ptr, uint32_t glwe_dimension,
uint32_t polynomial_size, uint32_t big_lwe_dimension,
uint32_t small_lwe_dimension, uint32_t ks_level, uint32_t ks_base_log,
uint32_t pbs_level, uint32_t pbs_base_log, uint32_t grouping_factor,
uint32_t num_blocks, uint32_t message_modulus, uint32_t carry_modulus,
PBS_TYPE pbs_type, SHIFT_TYPE shift_type, bool allocate_gpu_memory) {
PBS_TYPE pbs_type, SHIFT_OR_ROTATE_TYPE shift_type,
bool allocate_gpu_memory) {
int_radix_params params(pbs_type, glwe_dimension, polynomial_size,
big_lwe_dimension, small_lwe_dimension, ks_level,
ks_base_log, pbs_level, pbs_base_log, grouping_factor,
message_modulus, carry_modulus);
scratch_cuda_integer_radix_scalar_shift_kb<uint64_t>(
stream, (int_shift_buffer<uint64_t> **)mem_ptr, num_blocks, params,
shift_type, allocate_gpu_memory);
scratch_cuda_integer_radix_logical_scalar_shift_kb<uint64_t>(
stream, (int_logical_scalar_shift_buffer<uint64_t> **)mem_ptr, num_blocks,
params, shift_type, allocate_gpu_memory);
}
void cuda_integer_radix_scalar_shift_kb_64_inplace(
/// The logical scalar shift is the one used for unsigned integers, and
/// for the left scalar shift. It is constituted of a rotation, followed by
/// the application of a PBS onto the rotated blocks up to num_blocks -
/// rotations - 1 The remaining blocks are padded with zeros
void cuda_integer_radix_logical_scalar_shift_kb_64_inplace(
cuda_stream_t *stream, void *lwe_array, uint32_t shift, int8_t *mem_ptr,
void *bsk, void *ksk, uint32_t num_blocks) {
host_integer_radix_scalar_shift_kb_inplace<uint64_t>(
host_integer_radix_logical_scalar_shift_kb_inplace<uint64_t>(
stream, static_cast<uint64_t *>(lwe_array), shift,
(int_shift_buffer<uint64_t> *)mem_ptr, bsk, static_cast<uint64_t *>(ksk),
num_blocks);
(int_logical_scalar_shift_buffer<uint64_t> *)mem_ptr, bsk,
static_cast<uint64_t *>(ksk), num_blocks);
}
void cleanup_cuda_integer_radix_scalar_shift(cuda_stream_t *stream,
int8_t **mem_ptr_void) {
void scratch_cuda_integer_radix_arithmetic_scalar_shift_kb_64(
cuda_stream_t *stream, int8_t **mem_ptr, uint32_t glwe_dimension,
uint32_t polynomial_size, uint32_t big_lwe_dimension,
uint32_t small_lwe_dimension, uint32_t ks_level, uint32_t ks_base_log,
uint32_t pbs_level, uint32_t pbs_base_log, uint32_t grouping_factor,
uint32_t num_blocks, uint32_t message_modulus, uint32_t carry_modulus,
PBS_TYPE pbs_type, SHIFT_OR_ROTATE_TYPE shift_type,
bool allocate_gpu_memory) {
int_shift_buffer<uint64_t> *mem_ptr =
(int_shift_buffer<uint64_t> *)(*mem_ptr_void);
int_radix_params params(pbs_type, glwe_dimension, polynomial_size,
big_lwe_dimension, small_lwe_dimension, ks_level,
ks_base_log, pbs_level, pbs_base_log, grouping_factor,
message_modulus, carry_modulus);
scratch_cuda_integer_radix_arithmetic_scalar_shift_kb<uint64_t>(
stream, (int_arithmetic_scalar_shift_buffer<uint64_t> **)mem_ptr,
num_blocks, params, shift_type, allocate_gpu_memory);
}
/// The arithmetic scalar shift is the one used for the signed right shift.
/// It is constituted of a rotation, followed by
/// the application of a PBS onto the rotated blocks up to num_blocks -
/// rotations - 2 The last rotated block has another PBS applied, as it is the
/// sign block, and a second PBS is also applied to it to compute the padding
/// block, which is copied onto all remaining blocks instead of padding with
/// zeros as would be done in the logical shift.
void cuda_integer_radix_arithmetic_scalar_shift_kb_64_inplace(
cuda_stream_t *stream, void *lwe_array, uint32_t shift, int8_t *mem_ptr,
void *bsk, void *ksk, uint32_t num_blocks) {
host_integer_radix_arithmetic_scalar_shift_kb_inplace<uint64_t>(
stream, static_cast<uint64_t *>(lwe_array), shift,
(int_arithmetic_scalar_shift_buffer<uint64_t> *)mem_ptr, bsk,
static_cast<uint64_t *>(ksk), num_blocks);
}
void cleanup_cuda_integer_radix_logical_scalar_shift(cuda_stream_t *stream,
int8_t **mem_ptr_void) {
cudaSetDevice(stream->gpu_index);
int_logical_scalar_shift_buffer<uint64_t> *mem_ptr =
(int_logical_scalar_shift_buffer<uint64_t> *)(*mem_ptr_void);
mem_ptr->release(stream);
}
void cleanup_cuda_integer_radix_arithmetic_scalar_shift(cuda_stream_t *stream,
int8_t **mem_ptr_void) {
cudaSetDevice(stream->gpu_index);
int_arithmetic_scalar_shift_buffer<uint64_t> *mem_ptr =
(int_arithmetic_scalar_shift_buffer<uint64_t> *)(*mem_ptr_void);
mem_ptr->release(stream);
}

View File

@@ -1,31 +1,33 @@
#ifndef CUDA_INTEGER_SHIFT_OPS_CUH
#define CUDA_INTEGER_SHIFT_OPS_CUH
#ifndef CUDA_INTEGER_SCALAR_SHIFT_OPS_CUH
#define CUDA_INTEGER_SCALAR_SHIFT_OPS_CUH
#include "crypto/keyswitch.cuh"
#include "device.h"
#include "integer.cuh"
#include "integer.h"
#include "pbs/bootstrap_low_latency.cuh"
#include "pbs/bootstrap_multibit.cuh"
#include "pbs/programmable_bootstrap_classic.cuh"
#include "pbs/programmable_bootstrap_multibit.cuh"
#include "types/complex/operations.cuh"
#include "utils/helper.cuh"
#include "utils/kernel_dimensions.cuh"
#include <omp.h>
template <typename Torus>
__host__ void scratch_cuda_integer_radix_scalar_shift_kb(
cuda_stream_t *stream, int_shift_buffer<Torus> **mem_ptr,
uint32_t num_radix_blocks, int_radix_params params, SHIFT_TYPE shift_type,
bool allocate_gpu_memory) {
__host__ void scratch_cuda_integer_radix_logical_scalar_shift_kb(
cuda_stream_t *stream, int_logical_scalar_shift_buffer<Torus> **mem_ptr,
uint32_t num_radix_blocks, int_radix_params params,
SHIFT_OR_ROTATE_TYPE shift_type, bool allocate_gpu_memory) {
cudaSetDevice(stream->gpu_index);
*mem_ptr = new int_shift_buffer<Torus>(stream, shift_type, params,
num_radix_blocks, allocate_gpu_memory);
*mem_ptr = new int_logical_scalar_shift_buffer<Torus>(
stream, shift_type, params, num_radix_blocks, allocate_gpu_memory);
}
template <typename Torus>
__host__ void host_integer_radix_scalar_shift_kb_inplace(
__host__ void host_integer_radix_logical_scalar_shift_kb_inplace(
cuda_stream_t *stream, Torus *lwe_array, uint32_t shift,
int_shift_buffer<Torus> *mem, void *bsk, Torus *ksk, uint32_t num_blocks) {
int_logical_scalar_shift_buffer<Torus> *mem, void *bsk, Torus *ksk,
uint32_t num_blocks) {
cudaSetDevice(stream->gpu_index);
auto params = mem->params;
@@ -107,4 +109,128 @@ __host__ void host_integer_radix_scalar_shift_kb_inplace(
}
}
template <typename Torus>
__host__ void scratch_cuda_integer_radix_arithmetic_scalar_shift_kb(
cuda_stream_t *stream, int_arithmetic_scalar_shift_buffer<Torus> **mem_ptr,
uint32_t num_radix_blocks, int_radix_params params,
SHIFT_OR_ROTATE_TYPE shift_type, bool allocate_gpu_memory) {
cudaSetDevice(stream->gpu_index);
*mem_ptr = new int_arithmetic_scalar_shift_buffer<Torus>(
stream, shift_type, params, num_radix_blocks, allocate_gpu_memory);
}
template <typename Torus>
__host__ void host_integer_radix_arithmetic_scalar_shift_kb_inplace(
cuda_stream_t *stream, Torus *lwe_array, uint32_t shift,
int_arithmetic_scalar_shift_buffer<Torus> *mem, void *bsk, Torus *ksk,
uint32_t num_blocks) {
cudaSetDevice(stream->gpu_index);
auto params = mem->params;
auto glwe_dimension = params.glwe_dimension;
auto polynomial_size = params.polynomial_size;
auto message_modulus = params.message_modulus;
size_t big_lwe_size = glwe_dimension * polynomial_size + 1;
size_t big_lwe_size_bytes = big_lwe_size * sizeof(Torus);
size_t num_bits_in_block = (size_t)log2(message_modulus);
size_t total_num_bits = num_bits_in_block * num_blocks;
shift = shift % total_num_bits;
if (shift == 0) {
return;
}
size_t rotations = std::min(shift / num_bits_in_block, (size_t)num_blocks);
size_t shift_within_block = shift % num_bits_in_block;
Torus *rotated_buffer = mem->tmp_rotated;
Torus *padding_block = &rotated_buffer[num_blocks * big_lwe_size];
Torus *last_block_copy = &padding_block[big_lwe_size];
auto lut_univariate_shift_last_block =
mem->lut_buffers_univariate[shift_within_block - 1];
auto lut_univariate_padding_block =
mem->lut_buffers_univariate[num_bits_in_block - 1];
auto lut_bivariate = mem->lut_buffers_bivariate[shift_within_block - 1];
if (mem->shift_type == RIGHT_SHIFT) {
radix_blocks_rotate_left<<<num_blocks, 256, 0, stream->stream>>>(
rotated_buffer, lwe_array, rotations, num_blocks, big_lwe_size);
cuda_memcpy_async_gpu_to_gpu(lwe_array, rotated_buffer,
num_blocks * big_lwe_size_bytes, stream);
if (num_bits_in_block == 1) {
// if there is only 1 bit in the msg part, it means shift_within block is
// 0 thus only rotations is required.
// We still need to pad with the value of the sign bit.
// And here since a block only has 1 bit of message
// we can optimize things by not doing the pbs to extract this sign bit
Torus *block_src =
rotated_buffer + (num_blocks - rotations - 1) * big_lwe_size;
Torus *block_dest =
rotated_buffer + (num_blocks - rotations) * big_lwe_size;
for (uint i = 0; i < num_blocks; i++) {
cuda_memcpy_async_gpu_to_gpu(block_dest, block_src, big_lwe_size_bytes,
stream);
block_dest += big_lwe_size;
}
return;
}
// In the arithmetic shift case we have to pad with the value of the sign
// bit. This creates the need for a different shifting lut than in the
// logical shift case. We also need another PBS to create the padding block.
Torus *last_block = lwe_array + (num_blocks - rotations - 1) * big_lwe_size;
cuda_memcpy_async_gpu_to_gpu(last_block_copy,
rotated_buffer + (num_blocks - rotations - 1) *
big_lwe_size,
big_lwe_size_bytes, stream);
auto partial_current_blocks = lwe_array;
auto partial_next_blocks = &rotated_buffer[big_lwe_size];
size_t partial_block_count = num_blocks - rotations;
if (shift_within_block != 0 && rotations != num_blocks) {
integer_radix_apply_bivariate_lookup_table_kb<Torus>(
stream, partial_current_blocks, partial_current_blocks,
partial_next_blocks, bsk, ksk, partial_block_count, lut_bivariate);
}
// Since our CPU threads will be working on different streams we shall
// assert the work in the main stream is completed
stream->synchronize();
#pragma omp parallel sections
{
// All sections may be executed in parallel
#pragma omp section
{
integer_radix_apply_univariate_lookup_table_kb(
mem->local_stream_1, padding_block, last_block_copy, bsk, ksk, 1,
lut_univariate_padding_block);
// Replace blocks 'pulled' from the left with the correct padding block
for (uint i = 0; i < rotations; i++) {
cuda_memcpy_async_gpu_to_gpu(
lwe_array + (num_blocks - rotations + i) * big_lwe_size,
padding_block, big_lwe_size_bytes, mem->local_stream_1);
}
}
#pragma omp section
{
if (shift_within_block != 0 && rotations != num_blocks) {
integer_radix_apply_univariate_lookup_table_kb(
mem->local_stream_2, last_block, last_block_copy, bsk, ksk, 1,
lut_univariate_shift_last_block);
}
}
}
cuda_synchronize_stream(mem->local_stream_1);
cuda_synchronize_stream(mem->local_stream_2);
} else {
PANIC("Cuda error (scalar shift): left scalar shift is never of the "
"arithmetic type")
}
}
#endif // CUDA_SCALAR_OPS_CUH

View File

@@ -0,0 +1,40 @@
#include "shift_and_rotate.cuh"
void scratch_cuda_integer_radix_shift_and_rotate_kb_64(
cuda_stream_t *stream, int8_t **mem_ptr, uint32_t glwe_dimension,
uint32_t polynomial_size, uint32_t big_lwe_dimension,
uint32_t small_lwe_dimension, uint32_t ks_level, uint32_t ks_base_log,
uint32_t pbs_level, uint32_t pbs_base_log, uint32_t grouping_factor,
uint32_t num_blocks, uint32_t message_modulus, uint32_t carry_modulus,
PBS_TYPE pbs_type, SHIFT_OR_ROTATE_TYPE shift_type, bool is_signed,
bool allocate_gpu_memory) {
int_radix_params params(pbs_type, glwe_dimension, polynomial_size,
big_lwe_dimension, small_lwe_dimension, ks_level,
ks_base_log, pbs_level, pbs_base_log, grouping_factor,
message_modulus, carry_modulus);
scratch_cuda_integer_radix_shift_and_rotate_kb<uint64_t>(
stream, (int_shift_and_rotate_buffer<uint64_t> **)mem_ptr, num_blocks,
params, shift_type, is_signed, allocate_gpu_memory);
}
void cuda_integer_radix_shift_and_rotate_kb_64_inplace(
cuda_stream_t *stream, void *lwe_array, void *lwe_shift, int8_t *mem_ptr,
void *bsk, void *ksk, uint32_t num_blocks) {
host_integer_radix_shift_and_rotate_kb_inplace<uint64_t>(
stream, static_cast<uint64_t *>(lwe_array),
static_cast<uint64_t *>(lwe_shift),
(int_shift_and_rotate_buffer<uint64_t> *)mem_ptr, bsk,
static_cast<uint64_t *>(ksk), num_blocks);
}
void cleanup_cuda_integer_radix_shift_and_rotate(cuda_stream_t *stream,
int8_t **mem_ptr_void) {
int_shift_and_rotate_buffer<uint64_t> *mem_ptr =
(int_shift_and_rotate_buffer<uint64_t> *)(*mem_ptr_void);
mem_ptr->release(stream);
}

View File

@@ -0,0 +1,181 @@
#ifndef CUDA_INTEGER_SHIFT_OPS_CUH
#define CUDA_INTEGER_SHIFT_OPS_CUH
#include "crypto/keyswitch.cuh"
#include "device.h"
#include "integer.cuh"
#include "integer.h"
#include "pbs/programmable_bootstrap_classic.cuh"
#include "pbs/programmable_bootstrap_multibit.cuh"
#include "scalar_mul.cuh"
#include "types/complex/operations.cuh"
#include "utils/helper.cuh"
#include "utils/kernel_dimensions.cuh"
template <typename Torus>
__host__ void scratch_cuda_integer_radix_shift_and_rotate_kb(
cuda_stream_t *stream, int_shift_and_rotate_buffer<Torus> **mem_ptr,
uint32_t num_radix_blocks, int_radix_params params,
SHIFT_OR_ROTATE_TYPE shift_type, bool is_signed, bool allocate_gpu_memory) {
cudaSetDevice(stream->gpu_index);
*mem_ptr = new int_shift_and_rotate_buffer<Torus>(
stream, shift_type, is_signed, params, num_radix_blocks,
allocate_gpu_memory);
}
template <typename Torus>
__host__ void host_integer_radix_shift_and_rotate_kb_inplace(
cuda_stream_t *stream, Torus *lwe_array, Torus *lwe_shift,
int_shift_and_rotate_buffer<Torus> *mem, void *bsk, Torus *ksk,
uint32_t num_radix_blocks) {
uint32_t bits_per_block = std::log2(mem->params.message_modulus);
uint32_t total_nb_bits = bits_per_block * num_radix_blocks;
auto big_lwe_dimension = mem->params.big_lwe_dimension;
auto big_lwe_size = big_lwe_dimension + 1;
auto big_lwe_size_bytes = big_lwe_size * sizeof(Torus);
assert(total_nb_bits > 0);
// Extract all bits
auto bits = mem->tmp_bits;
extract_n_bits<Torus>(stream, bits, lwe_array, bsk, ksk, num_radix_blocks,
bits_per_block, mem->bit_extract_luts);
// Extract shift bits
auto shift_bits = mem->tmp_shift_bits;
auto is_power_of_two = [](uint32_t n) {
return (n > 0) && ((n & (n - 1)) == 0);
};
// This effectively means, that if the block parameters
// give a total_nb_bits that is not a power of two,
// then the behaviour of shifting won't be the same
// if shift >= total_nb_bits compared to when total_nb_bits
// is a power of two, as will 'capture' more bits in `shift_bits`
uint32_t max_num_bits_that_tell_shift = std::log2(total_nb_bits);
if (!is_power_of_two(total_nb_bits))
max_num_bits_that_tell_shift += 1;
// Extracts bits and put them in the bit index 2 (=> bit number 3)
// so that it is already aligned to the correct position of the cmux input
// and we reduce noise growth
extract_n_bits<Torus>(stream, shift_bits, lwe_shift, bsk, ksk, 1,
max_num_bits_that_tell_shift,
mem->bit_extract_luts_with_offset_2);
// If signed, do an "arithmetic shift" by padding with the sign bit
auto last_bit = bits + (total_nb_bits - 1) * big_lwe_size;
// Apply op
auto rotated_input = mem->tmp_rotated;
auto input_bits_a = mem->tmp_input_bits_a;
auto input_bits_b = mem->tmp_input_bits_b;
auto mux_lut = mem->mux_lut;
auto mux_inputs = mem->tmp_mux_inputs;
cuda_memcpy_async_gpu_to_gpu(input_bits_a, bits,
total_nb_bits * big_lwe_size_bytes, stream);
for (int d = 0; d < max_num_bits_that_tell_shift; d++) {
auto shift_bit = shift_bits + d * big_lwe_size;
cuda_memcpy_async_gpu_to_gpu(input_bits_b, input_bits_a,
total_nb_bits * big_lwe_size_bytes, stream);
auto rotations = 1 << d;
switch (mem->shift_type) {
case LEFT_SHIFT:
radix_blocks_rotate_right<<<total_nb_bits, 256, 0, stream->stream>>>(
rotated_input, input_bits_b, rotations, total_nb_bits, big_lwe_size);
if (mem->is_signed && mem->shift_type == RIGHT_SHIFT)
for (int i = 0; i < rotations; i++)
cuda_memcpy_async_gpu_to_gpu(rotated_input + i * big_lwe_size,
last_bit, big_lwe_size_bytes, stream);
else
cuda_memset_async(rotated_input, 0, rotations * big_lwe_size_bytes,
stream);
break;
case RIGHT_SHIFT:
radix_blocks_rotate_left<<<total_nb_bits, 256, 0, stream->stream>>>(
rotated_input, input_bits_b, rotations, total_nb_bits, big_lwe_size);
if (mem->is_signed)
for (int i = 0; i < rotations; i++)
cuda_memcpy_async_gpu_to_gpu(
rotated_input + (total_nb_bits - rotations + i) * big_lwe_size,
last_bit, big_lwe_size_bytes, stream);
else
cuda_memset_async(rotated_input +
(total_nb_bits - rotations) * big_lwe_size,
0, rotations * big_lwe_size_bytes, stream);
break;
case LEFT_ROTATE:
radix_blocks_rotate_right<<<total_nb_bits, 256, 0, stream->stream>>>(
rotated_input, input_bits_b, rotations, total_nb_bits, big_lwe_size);
break;
case RIGHT_ROTATE:
radix_blocks_rotate_left<<<total_nb_bits, 256, 0, stream->stream>>>(
rotated_input, input_bits_b, rotations, total_nb_bits, big_lwe_size);
break;
default:
PANIC("Unknown operation")
}
// pack bits into one block so that we have
// control_bit|b|a
cuda_memset_async(mux_inputs, 0, total_nb_bits * big_lwe_size_bytes,
stream); // Do we need this?
pack_bivariate_blocks(stream, mux_inputs, mux_lut->lwe_indexes_out,
rotated_input, input_bits_a, mux_lut->lwe_indexes_in,
big_lwe_dimension, 2, total_nb_bits);
// The shift bit is already properly aligned/positioned
for (int i = 0; i < total_nb_bits; i++)
host_addition(stream, mux_inputs + i * big_lwe_size,
mux_inputs + i * big_lwe_size, shift_bit,
mem->params.big_lwe_dimension, 1);
// we have
// control_bit|b|a
integer_radix_apply_univariate_lookup_table_kb(
stream, input_bits_a, mux_inputs, bsk, ksk, total_nb_bits, mux_lut);
}
// Initializes the output
// Copy the last bit for each radix block
auto lwe_last_out = lwe_array;
last_bit = input_bits_a + (bits_per_block - 1) * big_lwe_size;
for (int i = 0; i < num_radix_blocks; i++) {
cuda_memcpy_async_gpu_to_gpu(lwe_last_out, last_bit, big_lwe_size_bytes,
stream);
lwe_last_out += big_lwe_size;
last_bit += bits_per_block * big_lwe_size;
}
// Bitshift and add the other bits
lwe_last_out = lwe_array;
for (int i = bits_per_block - 2; i >= 0; i--) {
host_integer_small_scalar_mul_radix<Torus>(
stream, lwe_last_out, lwe_last_out, 2, big_lwe_dimension,
num_radix_blocks);
auto block = lwe_last_out;
auto bit_to_add = input_bits_a + i * big_lwe_size;
for (int j = 0; j < num_radix_blocks; j++) {
host_addition(stream, block, block, bit_to_add, big_lwe_dimension, 1);
block += big_lwe_size;
bit_to_add += bits_per_block * big_lwe_size;
}
// To give back a clean ciphertext
auto cleaning_lut = mem->cleaning_lut;
integer_radix_apply_univariate_lookup_table_kb(
stream, lwe_last_out, lwe_last_out, bsk, ksk, num_radix_blocks,
cleaning_lut);
}
}
#endif

View File

@@ -151,4 +151,49 @@ __host__ void host_subtraction_plaintext(cuda_stream_t *stream, T *output,
output, plaintext_input, input_lwe_dimension, num_entries);
check_cuda_error(cudaGetLastError());
}
template <typename T>
__global__ void unchecked_sub_with_correcting_term(
T *output, T *input_1, T *input_2, uint32_t num_entries, uint32_t lwe_size,
uint32_t message_modulus, uint32_t carry_modulus, uint32_t degree) {
uint32_t msg_mod = message_modulus;
uint64_t z = max((uint64_t)ceil(degree / msg_mod), (uint64_t)1);
z *= msg_mod;
uint64_t delta = (1ULL << 63) / (message_modulus * carry_modulus);
uint64_t w = z * delta;
int tid = threadIdx.x;
int index = blockIdx.x * blockDim.x + tid;
if (index < num_entries) {
// Here we take advantage of the wrapping behaviour of uint
output[index] = input_1[index] + ((0 - input_2[index]));
if (index % lwe_size == lwe_size - 1)
output[index] += w;
}
}
template <typename T>
__host__ void host_unchecked_sub_with_correcting_term(
cuda_stream_t *stream, T *output, T *input_1, T *input_2,
uint32_t input_lwe_dimension, uint32_t input_lwe_ciphertext_count,
uint32_t message_modulus, uint32_t carry_modulus, uint32_t degree) {
cudaSetDevice(stream->gpu_index);
// lwe_size includes the presence of the body
// whereas lwe_dimension is the number of elements in the mask
int lwe_size = input_lwe_dimension + 1;
// Create a 1-dimensional grid of threads
int num_blocks = 0, num_threads = 0;
int num_entries = input_lwe_ciphertext_count * lwe_size;
getNumBlocksAndThreads(num_entries, 512, num_blocks, num_threads);
dim3 grid(num_blocks, 1, 1);
dim3 thds(num_threads, 1, 1);
unchecked_sub_with_correcting_term<<<grid, thds, 0, stream->stream>>>(
output, input_1, input_2, num_entries, lwe_size, message_modulus,
carry_modulus, degree);
check_cuda_error(cudaGetLastError());
}
#endif // CUDA_ADD_H

View File

@@ -1 +0,0 @@
#include "bootstrap.cuh"

View File

@@ -1,30 +1,26 @@
#include "bootstrapping_key.cuh"
void cuda_convert_lwe_bootstrap_key_32(void *dest, void *src,
cuda_stream_t *stream,
uint32_t input_lwe_dim,
uint32_t glwe_dim, uint32_t level_count,
uint32_t polynomial_size) {
void cuda_convert_lwe_programmable_bootstrap_key_32(
void *dest, void *src, cuda_stream_t *stream, uint32_t input_lwe_dim,
uint32_t glwe_dim, uint32_t level_count, uint32_t polynomial_size) {
uint32_t total_polynomials =
input_lwe_dim * (glwe_dim + 1) * (glwe_dim + 1) * level_count;
cuda_convert_lwe_bootstrap_key<uint32_t, int32_t>(
cuda_convert_lwe_programmable_bootstrap_key<uint32_t, int32_t>(
(double2 *)dest, (int32_t *)src, stream, input_lwe_dim, glwe_dim,
level_count, polynomial_size, total_polynomials);
}
void cuda_convert_lwe_bootstrap_key_64(void *dest, void *src,
cuda_stream_t *stream,
uint32_t input_lwe_dim,
uint32_t glwe_dim, uint32_t level_count,
uint32_t polynomial_size) {
void cuda_convert_lwe_programmable_bootstrap_key_64(
void *dest, void *src, cuda_stream_t *stream, uint32_t input_lwe_dim,
uint32_t glwe_dim, uint32_t level_count, uint32_t polynomial_size) {
uint32_t total_polynomials =
input_lwe_dim * (glwe_dim + 1) * (glwe_dim + 1) * level_count;
cuda_convert_lwe_bootstrap_key<uint64_t, int64_t>(
cuda_convert_lwe_programmable_bootstrap_key<uint64_t, int64_t>(
(double2 *)dest, (int64_t *)src, stream, input_lwe_dim, glwe_dim,
level_count, polynomial_size, total_polynomials);
}
void cuda_convert_lwe_multi_bit_bootstrap_key_64(
void cuda_convert_lwe_multi_bit_programmable_bootstrap_key_64(
void *dest, void *src, cuda_stream_t *stream, uint32_t input_lwe_dim,
uint32_t glwe_dim, uint32_t level_count, uint32_t polynomial_size,
uint32_t grouping_factor) {

View File

@@ -1,11 +1,11 @@
#ifndef CUDA_BSK_CUH
#define CUDA_BSK_CUH
#include "bootstrap.h"
#include "bootstrap_multibit.h"
#include "device.h"
#include "fft/bnsmfft.cuh"
#include "polynomial/parameters.cuh"
#include "programmable_bootstrap.h"
#include "programmable_bootstrap_multibit.h"
#include <atomic>
#include <cstdint>
@@ -60,12 +60,10 @@ __device__ T *get_multi_bit_ith_lwe_gth_group_kth_block(
}
////////////////////////////////////////////////
template <typename T, typename ST>
void cuda_convert_lwe_bootstrap_key(double2 *dest, ST *src,
cuda_stream_t *stream,
uint32_t input_lwe_dim, uint32_t glwe_dim,
uint32_t level_count,
uint32_t polynomial_size,
uint32_t total_polynomials) {
void cuda_convert_lwe_programmable_bootstrap_key(
double2 *dest, ST *src, cuda_stream_t *stream, uint32_t input_lwe_dim,
uint32_t glwe_dim, uint32_t level_count, uint32_t polynomial_size,
uint32_t total_polynomials) {
cudaSetDevice(stream->gpu_index);
int shared_memory_size = sizeof(double) * polynomial_size;

View File

@@ -0,0 +1 @@
#include "programmable_bootstrap.cuh"

View File

@@ -1,8 +1,8 @@
#include "../../include/bootstrap.h"
#include "../../include/device.h"
#include "../../include/programmable_bootstrap.h"
#include "../include/device.h"
#include "bootstrap_low_latency.cuh"
#include "bootstrap_multibit.cuh"
#include "programmable_bootstrap_classic.cuh"
#include "programmable_bootstrap_multibit.cuh"
template <typename Torus>
void execute_pbs(cuda_stream_t *stream, Torus *lwe_array_out,
@@ -21,16 +21,8 @@ void execute_pbs(cuda_stream_t *stream, Torus *lwe_array_out,
switch (pbs_type) {
case MULTI_BIT:
PANIC("Error: 32-bit multibit PBS is not supported.\n")
case LOW_LAT:
cuda_bootstrap_low_latency_lwe_ciphertext_vector_32(
stream, lwe_array_out, lwe_output_indexes, lut_vector,
lut_vector_indexes, lwe_array_in, lwe_input_indexes,
bootstrapping_key, pbs_buffer, lwe_dimension, glwe_dimension,
polynomial_size, base_log, level_count, input_lwe_ciphertext_count,
num_luts, lwe_idx, max_shared_memory);
break;
case AMORTIZED:
cuda_bootstrap_amortized_lwe_ciphertext_vector_32(
case CLASSICAL:
cuda_programmable_bootstrap_lwe_ciphertext_vector_32(
stream, lwe_array_out, lwe_output_indexes, lut_vector,
lut_vector_indexes, lwe_array_in, lwe_input_indexes,
bootstrapping_key, pbs_buffer, lwe_dimension, glwe_dimension,
@@ -45,23 +37,17 @@ void execute_pbs(cuda_stream_t *stream, Torus *lwe_array_out,
// 64 bits
switch (pbs_type) {
case MULTI_BIT:
cuda_multi_bit_pbs_lwe_ciphertext_vector_64(
if (grouping_factor == 0)
PANIC("Multi-bit PBS error: grouping factor should be > 0.")
cuda_multi_bit_programmable_bootstrap_lwe_ciphertext_vector_64(
stream, lwe_array_out, lwe_output_indexes, lut_vector,
lut_vector_indexes, lwe_array_in, lwe_input_indexes,
bootstrapping_key, pbs_buffer, lwe_dimension, glwe_dimension,
polynomial_size, grouping_factor, base_log, level_count,
input_lwe_ciphertext_count, num_luts, lwe_idx, max_shared_memory);
break;
case LOW_LAT:
cuda_bootstrap_low_latency_lwe_ciphertext_vector_64(
stream, lwe_array_out, lwe_output_indexes, lut_vector,
lut_vector_indexes, lwe_array_in, lwe_input_indexes,
bootstrapping_key, pbs_buffer, lwe_dimension, glwe_dimension,
polynomial_size, base_log, level_count, input_lwe_ciphertext_count,
num_luts, lwe_idx, max_shared_memory);
break;
case AMORTIZED:
cuda_bootstrap_amortized_lwe_ciphertext_vector_64(
case CLASSICAL:
cuda_programmable_bootstrap_lwe_ciphertext_vector_64(
stream, lwe_array_out, lwe_output_indexes, lut_vector,
lut_vector_indexes, lwe_array_in, lwe_input_indexes,
bootstrapping_key, pbs_buffer, lwe_dimension, glwe_dimension,
@@ -92,16 +78,11 @@ void execute_scratch_pbs(cuda_stream_t *stream, int8_t **pbs_buffer,
switch (pbs_type) {
case MULTI_BIT:
PANIC("Error: 32-bit multibit PBS is not supported.\n")
case LOW_LAT:
scratch_cuda_bootstrap_low_latency_32(
case CLASSICAL:
scratch_cuda_programmable_bootstrap_32(
stream, pbs_buffer, glwe_dimension, polynomial_size, level_count,
input_lwe_ciphertext_count, max_shared_memory, allocate_gpu_memory);
break;
case AMORTIZED:
scratch_cuda_bootstrap_amortized_32(
stream, pbs_buffer, glwe_dimension, polynomial_size,
input_lwe_ciphertext_count, max_shared_memory, allocate_gpu_memory);
break;
default:
PANIC("Error: unsupported cuda PBS type.")
}
@@ -110,21 +91,18 @@ void execute_scratch_pbs(cuda_stream_t *stream, int8_t **pbs_buffer,
// 64 bits
switch (pbs_type) {
case MULTI_BIT:
scratch_cuda_multi_bit_pbs_64(
if (grouping_factor == 0)
PANIC("Multi-bit PBS error: grouping factor should be > 0.")
scratch_cuda_multi_bit_programmable_bootstrap_64(
stream, pbs_buffer, lwe_dimension, glwe_dimension, polynomial_size,
level_count, grouping_factor, input_lwe_ciphertext_count,
max_shared_memory, allocate_gpu_memory);
break;
case LOW_LAT:
scratch_cuda_bootstrap_low_latency_64(
case CLASSICAL:
scratch_cuda_programmable_bootstrap_64(
stream, pbs_buffer, glwe_dimension, polynomial_size, level_count,
input_lwe_ciphertext_count, max_shared_memory, allocate_gpu_memory);
break;
case AMORTIZED:
scratch_cuda_bootstrap_amortized_64(
stream, pbs_buffer, glwe_dimension, polynomial_size,
input_lwe_ciphertext_count, max_shared_memory, allocate_gpu_memory);
break;
default:
PANIC("Error: unsupported cuda PBS type.")
}

View File

@@ -1,12 +1,12 @@
#include "bootstrap_amortized.cuh"
#include "programmable_bootstrap_amortized.cuh"
/*
* Returns the buffer size for 64 bits executions
*/
uint64_t get_buffer_size_bootstrap_amortized_64(
uint64_t get_buffer_size_programmable_bootstrap_amortized_64(
uint32_t glwe_dimension, uint32_t polynomial_size,
uint32_t input_lwe_ciphertext_count, uint32_t max_shared_memory) {
return get_buffer_size_bootstrap_amortized<uint64_t>(
return get_buffer_size_programmable_bootstrap_amortized<uint64_t>(
glwe_dimension, polynomial_size, input_lwe_ciphertext_count,
max_shared_memory);
}
@@ -17,44 +17,51 @@ uint64_t get_buffer_size_bootstrap_amortized_64(
* configures SM options on the GPU in case FULLSM or PARTIALSM mode is going to
* be used.
*/
void scratch_cuda_bootstrap_amortized_32(
void scratch_cuda_programmable_bootstrap_amortized_32(
cuda_stream_t *stream, int8_t **pbs_buffer, uint32_t glwe_dimension,
uint32_t polynomial_size, uint32_t input_lwe_ciphertext_count,
uint32_t max_shared_memory, bool allocate_gpu_memory) {
switch (polynomial_size) {
case 256:
scratch_bootstrap_amortized<uint32_t, int32_t, AmortizedDegree<256>>(
scratch_programmable_bootstrap_amortized<uint32_t, int32_t,
AmortizedDegree<256>>(
stream, pbs_buffer, glwe_dimension, polynomial_size,
input_lwe_ciphertext_count, max_shared_memory, allocate_gpu_memory);
break;
case 512:
scratch_bootstrap_amortized<uint32_t, int32_t, AmortizedDegree<512>>(
scratch_programmable_bootstrap_amortized<uint32_t, int32_t,
AmortizedDegree<512>>(
stream, pbs_buffer, glwe_dimension, polynomial_size,
input_lwe_ciphertext_count, max_shared_memory, allocate_gpu_memory);
break;
case 1024:
scratch_bootstrap_amortized<uint32_t, int32_t, AmortizedDegree<1024>>(
scratch_programmable_bootstrap_amortized<uint32_t, int32_t,
AmortizedDegree<1024>>(
stream, pbs_buffer, glwe_dimension, polynomial_size,
input_lwe_ciphertext_count, max_shared_memory, allocate_gpu_memory);
break;
case 2048:
scratch_bootstrap_amortized<uint32_t, int32_t, AmortizedDegree<2048>>(
scratch_programmable_bootstrap_amortized<uint32_t, int32_t,
AmortizedDegree<2048>>(
stream, pbs_buffer, glwe_dimension, polynomial_size,
input_lwe_ciphertext_count, max_shared_memory, allocate_gpu_memory);
break;
case 4096:
scratch_bootstrap_amortized<uint32_t, int32_t, AmortizedDegree<4096>>(
scratch_programmable_bootstrap_amortized<uint32_t, int32_t,
AmortizedDegree<4096>>(
stream, pbs_buffer, glwe_dimension, polynomial_size,
input_lwe_ciphertext_count, max_shared_memory, allocate_gpu_memory);
break;
case 8192:
scratch_bootstrap_amortized<uint32_t, int32_t, AmortizedDegree<8192>>(
scratch_programmable_bootstrap_amortized<uint32_t, int32_t,
AmortizedDegree<8192>>(
stream, pbs_buffer, glwe_dimension, polynomial_size,
input_lwe_ciphertext_count, max_shared_memory, allocate_gpu_memory);
break;
case 16384:
scratch_bootstrap_amortized<uint32_t, int32_t, AmortizedDegree<16384>>(
scratch_programmable_bootstrap_amortized<uint32_t, int32_t,
AmortizedDegree<16384>>(
stream, pbs_buffer, glwe_dimension, polynomial_size,
input_lwe_ciphertext_count, max_shared_memory, allocate_gpu_memory);
break;
@@ -71,44 +78,51 @@ void scratch_cuda_bootstrap_amortized_32(
* configures SM options on the GPU in case FULLSM or PARTIALSM mode is going to
* be used.
*/
void scratch_cuda_bootstrap_amortized_64(
void scratch_cuda_programmable_bootstrap_amortized_64(
cuda_stream_t *stream, int8_t **pbs_buffer, uint32_t glwe_dimension,
uint32_t polynomial_size, uint32_t input_lwe_ciphertext_count,
uint32_t max_shared_memory, bool allocate_gpu_memory) {
switch (polynomial_size) {
case 256:
scratch_bootstrap_amortized<uint64_t, int64_t, AmortizedDegree<256>>(
scratch_programmable_bootstrap_amortized<uint64_t, int64_t,
AmortizedDegree<256>>(
stream, pbs_buffer, glwe_dimension, polynomial_size,
input_lwe_ciphertext_count, max_shared_memory, allocate_gpu_memory);
break;
case 512:
scratch_bootstrap_amortized<uint64_t, int64_t, AmortizedDegree<512>>(
scratch_programmable_bootstrap_amortized<uint64_t, int64_t,
AmortizedDegree<512>>(
stream, pbs_buffer, glwe_dimension, polynomial_size,
input_lwe_ciphertext_count, max_shared_memory, allocate_gpu_memory);
break;
case 1024:
scratch_bootstrap_amortized<uint64_t, int64_t, AmortizedDegree<1024>>(
scratch_programmable_bootstrap_amortized<uint64_t, int64_t,
AmortizedDegree<1024>>(
stream, pbs_buffer, glwe_dimension, polynomial_size,
input_lwe_ciphertext_count, max_shared_memory, allocate_gpu_memory);
break;
case 2048:
scratch_bootstrap_amortized<uint64_t, int64_t, AmortizedDegree<2048>>(
scratch_programmable_bootstrap_amortized<uint64_t, int64_t,
AmortizedDegree<2048>>(
stream, pbs_buffer, glwe_dimension, polynomial_size,
input_lwe_ciphertext_count, max_shared_memory, allocate_gpu_memory);
break;
case 4096:
scratch_bootstrap_amortized<uint64_t, int64_t, AmortizedDegree<4096>>(
scratch_programmable_bootstrap_amortized<uint64_t, int64_t,
AmortizedDegree<4096>>(
stream, pbs_buffer, glwe_dimension, polynomial_size,
input_lwe_ciphertext_count, max_shared_memory, allocate_gpu_memory);
break;
case 8192:
scratch_bootstrap_amortized<uint64_t, int64_t, AmortizedDegree<8192>>(
scratch_programmable_bootstrap_amortized<uint64_t, int64_t,
AmortizedDegree<8192>>(
stream, pbs_buffer, glwe_dimension, polynomial_size,
input_lwe_ciphertext_count, max_shared_memory, allocate_gpu_memory);
break;
case 16384:
scratch_bootstrap_amortized<uint64_t, int64_t, AmortizedDegree<16384>>(
scratch_programmable_bootstrap_amortized<uint64_t, int64_t,
AmortizedDegree<16384>>(
stream, pbs_buffer, glwe_dimension, polynomial_size,
input_lwe_ciphertext_count, max_shared_memory, allocate_gpu_memory);
break;
@@ -122,7 +136,7 @@ void scratch_cuda_bootstrap_amortized_64(
/* Perform the programmable bootstrapping on a batch of input u32 LWE
* ciphertexts. See the corresponding operation on 64 bits for more details.
*/
void cuda_bootstrap_amortized_lwe_ciphertext_vector_32(
void cuda_programmable_bootstrap_amortized_lwe_ciphertext_vector_32(
cuda_stream_t *stream, void *lwe_array_out, void *lwe_output_indexes,
void *lut_vector, void *lut_vector_indexes, void *lwe_array_in,
void *lwe_input_indexes, void *bootstrapping_key, int8_t *pbs_buffer,
@@ -136,7 +150,7 @@ void cuda_bootstrap_amortized_lwe_ciphertext_vector_32(
switch (polynomial_size) {
case 256:
host_bootstrap_amortized<uint32_t, AmortizedDegree<256>>(
host_programmable_bootstrap_amortized<uint32_t, AmortizedDegree<256>>(
stream, (uint32_t *)lwe_array_out, (uint32_t *)lwe_output_indexes,
(uint32_t *)lut_vector, (uint32_t *)lut_vector_indexes,
(uint32_t *)lwe_array_in, (uint32_t *)lwe_input_indexes,
@@ -145,7 +159,7 @@ void cuda_bootstrap_amortized_lwe_ciphertext_vector_32(
max_shared_memory);
break;
case 512:
host_bootstrap_amortized<uint32_t, AmortizedDegree<512>>(
host_programmable_bootstrap_amortized<uint32_t, AmortizedDegree<512>>(
stream, (uint32_t *)lwe_array_out, (uint32_t *)lwe_output_indexes,
(uint32_t *)lut_vector, (uint32_t *)lut_vector_indexes,
(uint32_t *)lwe_array_in, (uint32_t *)lwe_input_indexes,
@@ -154,7 +168,7 @@ void cuda_bootstrap_amortized_lwe_ciphertext_vector_32(
max_shared_memory);
break;
case 1024:
host_bootstrap_amortized<uint32_t, AmortizedDegree<1024>>(
host_programmable_bootstrap_amortized<uint32_t, AmortizedDegree<1024>>(
stream, (uint32_t *)lwe_array_out, (uint32_t *)lwe_output_indexes,
(uint32_t *)lut_vector, (uint32_t *)lut_vector_indexes,
(uint32_t *)lwe_array_in, (uint32_t *)lwe_input_indexes,
@@ -163,7 +177,7 @@ void cuda_bootstrap_amortized_lwe_ciphertext_vector_32(
max_shared_memory);
break;
case 2048:
host_bootstrap_amortized<uint32_t, AmortizedDegree<2048>>(
host_programmable_bootstrap_amortized<uint32_t, AmortizedDegree<2048>>(
stream, (uint32_t *)lwe_array_out, (uint32_t *)lwe_output_indexes,
(uint32_t *)lut_vector, (uint32_t *)lut_vector_indexes,
(uint32_t *)lwe_array_in, (uint32_t *)lwe_input_indexes,
@@ -172,7 +186,7 @@ void cuda_bootstrap_amortized_lwe_ciphertext_vector_32(
max_shared_memory);
break;
case 4096:
host_bootstrap_amortized<uint32_t, AmortizedDegree<4096>>(
host_programmable_bootstrap_amortized<uint32_t, AmortizedDegree<4096>>(
stream, (uint32_t *)lwe_array_out, (uint32_t *)lwe_output_indexes,
(uint32_t *)lut_vector, (uint32_t *)lut_vector_indexes,
(uint32_t *)lwe_array_in, (uint32_t *)lwe_input_indexes,
@@ -181,7 +195,7 @@ void cuda_bootstrap_amortized_lwe_ciphertext_vector_32(
max_shared_memory);
break;
case 8192:
host_bootstrap_amortized<uint32_t, AmortizedDegree<8192>>(
host_programmable_bootstrap_amortized<uint32_t, AmortizedDegree<8192>>(
stream, (uint32_t *)lwe_array_out, (uint32_t *)lwe_output_indexes,
(uint32_t *)lut_vector, (uint32_t *)lut_vector_indexes,
(uint32_t *)lwe_array_in, (uint32_t *)lwe_input_indexes,
@@ -190,7 +204,7 @@ void cuda_bootstrap_amortized_lwe_ciphertext_vector_32(
max_shared_memory);
break;
case 16384:
host_bootstrap_amortized<uint32_t, AmortizedDegree<16384>>(
host_programmable_bootstrap_amortized<uint32_t, AmortizedDegree<16384>>(
stream, (uint32_t *)lwe_array_out, (uint32_t *)lwe_output_indexes,
(uint32_t *)lut_vector, (uint32_t *)lut_vector_indexes,
(uint32_t *)lwe_array_in, (uint32_t *)lwe_input_indexes,
@@ -270,7 +284,7 @@ void cuda_bootstrap_amortized_lwe_ciphertext_vector_32(
* - the constant memory (64K) is used for storing the roots of identity
* values for the FFT
*/
void cuda_bootstrap_amortized_lwe_ciphertext_vector_64(
void cuda_programmable_bootstrap_amortized_lwe_ciphertext_vector_64(
cuda_stream_t *stream, void *lwe_array_out, void *lwe_output_indexes,
void *lut_vector, void *lut_vector_indexes, void *lwe_array_in,
void *lwe_input_indexes, void *bootstrapping_key, int8_t *pbs_buffer,
@@ -284,7 +298,7 @@ void cuda_bootstrap_amortized_lwe_ciphertext_vector_64(
switch (polynomial_size) {
case 256:
host_bootstrap_amortized<uint64_t, AmortizedDegree<256>>(
host_programmable_bootstrap_amortized<uint64_t, AmortizedDegree<256>>(
stream, (uint64_t *)lwe_array_out, (uint64_t *)lwe_output_indexes,
(uint64_t *)lut_vector, (uint64_t *)lut_vector_indexes,
(uint64_t *)lwe_array_in, (uint64_t *)lwe_input_indexes,
@@ -293,7 +307,7 @@ void cuda_bootstrap_amortized_lwe_ciphertext_vector_64(
max_shared_memory);
break;
case 512:
host_bootstrap_amortized<uint64_t, AmortizedDegree<512>>(
host_programmable_bootstrap_amortized<uint64_t, AmortizedDegree<512>>(
stream, (uint64_t *)lwe_array_out, (uint64_t *)lwe_output_indexes,
(uint64_t *)lut_vector, (uint64_t *)lut_vector_indexes,
(uint64_t *)lwe_array_in, (uint64_t *)lwe_input_indexes,
@@ -302,7 +316,7 @@ void cuda_bootstrap_amortized_lwe_ciphertext_vector_64(
max_shared_memory);
break;
case 1024:
host_bootstrap_amortized<uint64_t, AmortizedDegree<1024>>(
host_programmable_bootstrap_amortized<uint64_t, AmortizedDegree<1024>>(
stream, (uint64_t *)lwe_array_out, (uint64_t *)lwe_output_indexes,
(uint64_t *)lut_vector, (uint64_t *)lut_vector_indexes,
(uint64_t *)lwe_array_in, (uint64_t *)lwe_input_indexes,
@@ -311,7 +325,7 @@ void cuda_bootstrap_amortized_lwe_ciphertext_vector_64(
max_shared_memory);
break;
case 2048:
host_bootstrap_amortized<uint64_t, AmortizedDegree<2048>>(
host_programmable_bootstrap_amortized<uint64_t, AmortizedDegree<2048>>(
stream, (uint64_t *)lwe_array_out, (uint64_t *)lwe_output_indexes,
(uint64_t *)lut_vector, (uint64_t *)lut_vector_indexes,
(uint64_t *)lwe_array_in, (uint64_t *)lwe_input_indexes,
@@ -320,7 +334,7 @@ void cuda_bootstrap_amortized_lwe_ciphertext_vector_64(
max_shared_memory);
break;
case 4096:
host_bootstrap_amortized<uint64_t, AmortizedDegree<4096>>(
host_programmable_bootstrap_amortized<uint64_t, AmortizedDegree<4096>>(
stream, (uint64_t *)lwe_array_out, (uint64_t *)lwe_output_indexes,
(uint64_t *)lut_vector, (uint64_t *)lut_vector_indexes,
(uint64_t *)lwe_array_in, (uint64_t *)lwe_input_indexes,
@@ -329,7 +343,7 @@ void cuda_bootstrap_amortized_lwe_ciphertext_vector_64(
max_shared_memory);
break;
case 8192:
host_bootstrap_amortized<uint64_t, AmortizedDegree<8192>>(
host_programmable_bootstrap_amortized<uint64_t, AmortizedDegree<8192>>(
stream, (uint64_t *)lwe_array_out, (uint64_t *)lwe_output_indexes,
(uint64_t *)lut_vector, (uint64_t *)lut_vector_indexes,
(uint64_t *)lwe_array_in, (uint64_t *)lwe_input_indexes,
@@ -338,7 +352,7 @@ void cuda_bootstrap_amortized_lwe_ciphertext_vector_64(
max_shared_memory);
break;
case 16384:
host_bootstrap_amortized<uint64_t, AmortizedDegree<16384>>(
host_programmable_bootstrap_amortized<uint64_t, AmortizedDegree<16384>>(
stream, (uint64_t *)lwe_array_out, (uint64_t *)lwe_output_indexes,
(uint64_t *)lut_vector, (uint64_t *)lut_vector_indexes,
(uint64_t *)lwe_array_in, (uint64_t *)lwe_input_indexes,
@@ -357,8 +371,8 @@ void cuda_bootstrap_amortized_lwe_ciphertext_vector_64(
* This cleanup function frees the data for the amortized PBS on GPU in
* buffer for 32 or 64 bits inputs.
*/
void cleanup_cuda_bootstrap_amortized(cuda_stream_t *stream,
int8_t **pbs_buffer) {
void cleanup_cuda_programmable_bootstrap_amortized(cuda_stream_t *stream,
int8_t **pbs_buffer) {
// Free memory
cuda_drop_async(*pbs_buffer, stream);

View File

@@ -6,7 +6,6 @@
#include <cuda_runtime.h>
#endif
#include "bootstrap.h"
#include "crypto/gadget.cuh"
#include "crypto/torus.cuh"
#include "device.h"
@@ -15,11 +14,12 @@
#include "polynomial/functions.cuh"
#include "polynomial/parameters.cuh"
#include "polynomial/polynomial_math.cuh"
#include "programmable_bootstrap.h"
#include "types/complex/operations.cuh"
template <typename Torus, class params, sharedMemDegree SMD>
/*
* Kernel launched by host_bootstrap_amortized
* Kernel launched by host_programmable_bootstrap_amortized
*
* Uses shared memory to increase performance
* - lwe_array_out: output batch of num_samples bootstrapped ciphertexts c =
@@ -46,7 +46,7 @@ template <typename Torus, class params, sharedMemDegree SMD>
* - device_memory_size_per_sample: amount of global memory to allocate if SMD
* is not FULLSM
*/
__global__ void device_bootstrap_amortized(
__global__ void device_programmable_bootstrap_amortized(
Torus *lwe_array_out, Torus *lwe_output_indexes, Torus *lut_vector,
Torus *lut_vector_indexes, Torus *lwe_array_in, Torus *lwe_input_indexes,
double2 *bootstrapping_key, int8_t *device_mem, uint32_t glwe_dimension,
@@ -211,7 +211,8 @@ __global__ void device_bootstrap_amortized(
}
template <typename Torus>
__host__ __device__ uint64_t get_buffer_size_full_sm_bootstrap_amortized(
__host__ __device__ uint64_t
get_buffer_size_full_sm_programmable_bootstrap_amortized(
uint32_t polynomial_size, uint32_t glwe_dimension) {
return sizeof(Torus) * polynomial_size * (glwe_dimension + 1) + // accumulator
sizeof(Torus) * polynomial_size *
@@ -223,19 +224,22 @@ __host__ __device__ uint64_t get_buffer_size_full_sm_bootstrap_amortized(
template <typename Torus>
__host__ __device__ uint64_t
get_buffer_size_partial_sm_bootstrap_amortized(uint32_t polynomial_size) {
get_buffer_size_partial_sm_programmable_bootstrap_amortized(
uint32_t polynomial_size) {
return sizeof(double2) * polynomial_size / 2; // accumulator fft
}
template <typename Torus>
__host__ __device__ uint64_t get_buffer_size_bootstrap_amortized(
__host__ __device__ uint64_t get_buffer_size_programmable_bootstrap_amortized(
uint32_t glwe_dimension, uint32_t polynomial_size,
uint32_t input_lwe_ciphertext_count, uint32_t max_shared_memory) {
uint64_t full_sm = get_buffer_size_full_sm_bootstrap_amortized<Torus>(
polynomial_size, glwe_dimension);
uint64_t full_sm =
get_buffer_size_full_sm_programmable_bootstrap_amortized<Torus>(
polynomial_size, glwe_dimension);
uint64_t partial_sm =
get_buffer_size_partial_sm_bootstrap_amortized<Torus>(polynomial_size);
get_buffer_size_partial_sm_programmable_bootstrap_amortized<Torus>(
polynomial_size);
uint64_t partial_dm = full_sm - partial_sm;
uint64_t full_dm = full_sm;
uint64_t device_mem = 0;
@@ -248,41 +252,45 @@ __host__ __device__ uint64_t get_buffer_size_bootstrap_amortized(
}
template <typename Torus, typename STorus, typename params>
__host__ void scratch_bootstrap_amortized(
__host__ void scratch_programmable_bootstrap_amortized(
cuda_stream_t *stream, int8_t **pbs_buffer, uint32_t glwe_dimension,
uint32_t polynomial_size, uint32_t input_lwe_ciphertext_count,
uint32_t max_shared_memory, bool allocate_gpu_memory) {
cudaSetDevice(stream->gpu_index);
uint64_t full_sm = get_buffer_size_full_sm_bootstrap_amortized<Torus>(
polynomial_size, glwe_dimension);
uint64_t full_sm =
get_buffer_size_full_sm_programmable_bootstrap_amortized<Torus>(
polynomial_size, glwe_dimension);
uint64_t partial_sm =
get_buffer_size_partial_sm_bootstrap_amortized<Torus>(polynomial_size);
get_buffer_size_partial_sm_programmable_bootstrap_amortized<Torus>(
polynomial_size);
if (max_shared_memory >= partial_sm && max_shared_memory < full_sm) {
cudaFuncSetAttribute(device_bootstrap_amortized<Torus, params, PARTIALSM>,
cudaFuncAttributeMaxDynamicSharedMemorySize,
partial_sm);
cudaFuncSetCacheConfig(device_bootstrap_amortized<Torus, params, PARTIALSM>,
cudaFuncCachePreferShared);
cudaFuncSetAttribute(
device_programmable_bootstrap_amortized<Torus, params, PARTIALSM>,
cudaFuncAttributeMaxDynamicSharedMemorySize, partial_sm);
cudaFuncSetCacheConfig(
device_programmable_bootstrap_amortized<Torus, params, PARTIALSM>,
cudaFuncCachePreferShared);
} else if (max_shared_memory >= partial_sm) {
check_cuda_error(cudaFuncSetAttribute(
device_bootstrap_amortized<Torus, params, FULLSM>,
device_programmable_bootstrap_amortized<Torus, params, FULLSM>,
cudaFuncAttributeMaxDynamicSharedMemorySize, full_sm));
check_cuda_error(cudaFuncSetCacheConfig(
device_bootstrap_amortized<Torus, params, FULLSM>,
device_programmable_bootstrap_amortized<Torus, params, FULLSM>,
cudaFuncCachePreferShared));
}
if (allocate_gpu_memory) {
uint64_t buffer_size = get_buffer_size_bootstrap_amortized<Torus>(
glwe_dimension, polynomial_size, input_lwe_ciphertext_count,
max_shared_memory);
uint64_t buffer_size =
get_buffer_size_programmable_bootstrap_amortized<Torus>(
glwe_dimension, polynomial_size, input_lwe_ciphertext_count,
max_shared_memory);
*pbs_buffer = (int8_t *)cuda_malloc_async(buffer_size, stream);
check_cuda_error(cudaGetLastError());
}
}
template <typename Torus, class params>
__host__ void host_bootstrap_amortized(
__host__ void host_programmable_bootstrap_amortized(
cuda_stream_t *stream, Torus *lwe_array_out, Torus *lwe_output_indexes,
Torus *lut_vector, Torus *lut_vector_indexes, Torus *lwe_array_in,
Torus *lwe_input_indexes, double2 *bootstrapping_key, int8_t *pbs_buffer,
@@ -292,11 +300,13 @@ __host__ void host_bootstrap_amortized(
uint32_t max_shared_memory) {
cudaSetDevice(stream->gpu_index);
uint64_t SM_FULL = get_buffer_size_full_sm_bootstrap_amortized<Torus>(
polynomial_size, glwe_dimension);
uint64_t SM_FULL =
get_buffer_size_full_sm_programmable_bootstrap_amortized<Torus>(
polynomial_size, glwe_dimension);
uint64_t SM_PART =
get_buffer_size_partial_sm_bootstrap_amortized<Torus>(polynomial_size);
get_buffer_size_partial_sm_programmable_bootstrap_amortized<Torus>(
polynomial_size);
uint64_t DM_PART = SM_FULL - SM_PART;
@@ -316,14 +326,14 @@ __host__ void host_bootstrap_amortized(
// from one of three templates (no use, partial use or full use
// of shared memory)
if (max_shared_memory < SM_PART) {
device_bootstrap_amortized<Torus, params, NOSM>
device_programmable_bootstrap_amortized<Torus, params, NOSM>
<<<grid, thds, 0, stream->stream>>>(
lwe_array_out, lwe_output_indexes, lut_vector, lut_vector_indexes,
lwe_array_in, lwe_input_indexes, bootstrapping_key, pbs_buffer,
glwe_dimension, lwe_dimension, polynomial_size, base_log,
level_count, lwe_idx, DM_FULL);
} else if (max_shared_memory < SM_FULL) {
device_bootstrap_amortized<Torus, params, PARTIALSM>
device_programmable_bootstrap_amortized<Torus, params, PARTIALSM>
<<<grid, thds, SM_PART, stream->stream>>>(
lwe_array_out, lwe_output_indexes, lut_vector, lut_vector_indexes,
lwe_array_in, lwe_input_indexes, bootstrapping_key, pbs_buffer,
@@ -335,7 +345,7 @@ __host__ void host_bootstrap_amortized(
// device then has to be allocated dynamically.
// For lower compute capabilities, this call
// just does nothing and the amount of shared memory used is 48 KB
device_bootstrap_amortized<Torus, params, FULLSM>
device_programmable_bootstrap_amortized<Torus, params, FULLSM>
<<<grid, thds, SM_FULL, stream->stream>>>(
lwe_array_out, lwe_output_indexes, lut_vector, lut_vector_indexes,
lwe_array_in, lwe_input_indexes, bootstrapping_key, pbs_buffer,
@@ -354,8 +364,8 @@ int cuda_get_pbs_per_gpu(int polynomial_size) {
cudaDeviceProp device_properties;
cudaGetDeviceProperties(&device_properties, 0);
cudaOccupancyMaxActiveBlocksPerMultiprocessor(
&blocks_per_sm, device_bootstrap_amortized<Torus, params>, num_threads,
0);
&blocks_per_sm, device_programmable_bootstrap_amortized<Torus, params>,
num_threads, 0);
return device_properties.multiProcessorCount * blocks_per_sm;
}

View File

@@ -1,5 +1,5 @@
#ifndef CUDA_FAST_LOWLAT_PBS_CUH
#define CUDA_FAST_LOWLAT_PBS_CUH
#ifndef CUDA_CG_PBS_CUH
#define CUDA_CG_PBS_CUH
#ifdef __CDT_PARSER__
#undef __CUDA_RUNTIME_H__
@@ -8,7 +8,6 @@
#include "cooperative_groups.h"
#include "bootstrap.h"
#include "crypto/gadget.cuh"
#include "crypto/torus.cuh"
#include "device.h"
@@ -16,9 +15,10 @@
#include "fft/twiddles.cuh"
#include "polynomial/parameters.cuh"
#include "polynomial/polynomial_math.cuh"
#include "programmable_bootstrap.h"
#include "types/complex/operations.cuh"
// Cooperative groups are used in the low latency PBS
// Cooperative groups are used for this implementation
using namespace cooperative_groups;
namespace cg = cooperative_groups;
@@ -114,8 +114,7 @@ __device__ void mul_ggsw_glwe(Torus *accumulator, double2 *fft,
}
/*
* Kernel launched by the low latency version of the
* bootstrapping, that uses cooperative groups
* Kernel that computes the classical PBS using cooperative groups
*
* - lwe_array_out: vector of output lwe s, with length
* (glwe_dimension * polynomial_size+1)*num_samples
@@ -128,7 +127,7 @@ __device__ void mul_ggsw_glwe(Torus *accumulator, double2 *fft,
* Each y-block computes one element of the lwe_array_out.
*/
template <typename Torus, class params, sharedMemDegree SMD>
__global__ void device_bootstrap_fast_low_latency(
__global__ void device_programmable_bootstrap_cg(
Torus *lwe_array_out, Torus *lwe_output_indexes, Torus *lut_vector,
Torus *lut_vector_indexes, Torus *lwe_array_in, Torus *lwe_input_indexes,
double2 *bootstrapping_key, double2 *join_buffer, uint32_t lwe_dimension,
@@ -246,51 +245,50 @@ __global__ void device_bootstrap_fast_low_latency(
}
template <typename Torus, typename STorus, typename params>
__host__ void scratch_bootstrap_fast_low_latency(
cuda_stream_t *stream, pbs_buffer<Torus, LOW_LAT> **buffer,
__host__ void scratch_programmable_bootstrap_cg(
cuda_stream_t *stream, pbs_buffer<Torus, CLASSICAL> **buffer,
uint32_t glwe_dimension, uint32_t polynomial_size, uint32_t level_count,
uint32_t input_lwe_ciphertext_count, uint32_t max_shared_memory,
bool allocate_gpu_memory) {
cudaSetDevice(stream->gpu_index);
uint64_t full_sm = get_buffer_size_full_sm_bootstrap_fast_low_latency<Torus>(
polynomial_size);
uint64_t full_sm =
get_buffer_size_full_sm_programmable_bootstrap_cg<Torus>(polynomial_size);
uint64_t partial_sm =
get_buffer_size_partial_sm_bootstrap_fast_low_latency<Torus>(
get_buffer_size_partial_sm_programmable_bootstrap_cg<Torus>(
polynomial_size);
if (max_shared_memory >= partial_sm && max_shared_memory < full_sm) {
check_cuda_error(cudaFuncSetAttribute(
device_bootstrap_fast_low_latency<Torus, params, PARTIALSM>,
device_programmable_bootstrap_cg<Torus, params, PARTIALSM>,
cudaFuncAttributeMaxDynamicSharedMemorySize, partial_sm));
cudaFuncSetCacheConfig(
device_bootstrap_fast_low_latency<Torus, params, PARTIALSM>,
device_programmable_bootstrap_cg<Torus, params, PARTIALSM>,
cudaFuncCachePreferShared);
check_cuda_error(cudaGetLastError());
} else if (max_shared_memory >= partial_sm) {
check_cuda_error(cudaFuncSetAttribute(
device_bootstrap_fast_low_latency<Torus, params, FULLSM>,
device_programmable_bootstrap_cg<Torus, params, FULLSM>,
cudaFuncAttributeMaxDynamicSharedMemorySize, full_sm));
cudaFuncSetCacheConfig(
device_bootstrap_fast_low_latency<Torus, params, FULLSM>,
device_programmable_bootstrap_cg<Torus, params, FULLSM>,
cudaFuncCachePreferShared);
check_cuda_error(cudaGetLastError());
}
*buffer = new pbs_buffer<Torus, LOW_LAT>(
*buffer = new pbs_buffer<Torus, CLASSICAL>(
stream, glwe_dimension, polynomial_size, level_count,
input_lwe_ciphertext_count, PBS_VARIANT::FAST, allocate_gpu_memory);
input_lwe_ciphertext_count, PBS_VARIANT::CG, allocate_gpu_memory);
}
/*
* Host wrapper to the low latency version
* of bootstrapping
* Host wrapper
*/
template <typename Torus, class params>
__host__ void host_bootstrap_fast_low_latency(
__host__ void host_programmable_bootstrap_cg(
cuda_stream_t *stream, Torus *lwe_array_out, Torus *lwe_output_indexes,
Torus *lut_vector, Torus *lut_vector_indexes, Torus *lwe_array_in,
Torus *lwe_input_indexes, double2 *bootstrapping_key,
pbs_buffer<Torus, LOW_LAT> *buffer, uint32_t glwe_dimension,
pbs_buffer<Torus, CLASSICAL> *buffer, uint32_t glwe_dimension,
uint32_t lwe_dimension, uint32_t polynomial_size, uint32_t base_log,
uint32_t level_count, uint32_t input_lwe_ciphertext_count,
uint32_t num_luts, uint32_t max_shared_memory) {
@@ -298,11 +296,11 @@ __host__ void host_bootstrap_fast_low_latency(
// With SM each block corresponds to either the mask or body, no need to
// duplicate data for each
uint64_t full_sm = get_buffer_size_full_sm_bootstrap_fast_low_latency<Torus>(
polynomial_size);
uint64_t full_sm =
get_buffer_size_full_sm_programmable_bootstrap_cg<Torus>(polynomial_size);
uint64_t partial_sm =
get_buffer_size_partial_sm_bootstrap_fast_low_latency<Torus>(
get_buffer_size_partial_sm_programmable_bootstrap_cg<Torus>(
polynomial_size);
uint64_t full_dm = full_sm;
@@ -333,28 +331,27 @@ __host__ void host_bootstrap_fast_low_latency(
if (max_shared_memory < partial_sm) {
kernel_args[13] = &full_dm;
check_cuda_error(cudaLaunchCooperativeKernel(
(void *)device_bootstrap_fast_low_latency<Torus, params, NOSM>, grid,
(void *)device_programmable_bootstrap_cg<Torus, params, NOSM>, grid,
thds, (void **)kernel_args, 0, stream->stream));
} else if (max_shared_memory < full_sm) {
kernel_args[13] = &partial_dm;
check_cuda_error(cudaLaunchCooperativeKernel(
(void *)device_bootstrap_fast_low_latency<Torus, params, PARTIALSM>,
(void *)device_programmable_bootstrap_cg<Torus, params, PARTIALSM>,
grid, thds, (void **)kernel_args, partial_sm, stream->stream));
} else {
int no_dm = 0;
kernel_args[13] = &no_dm;
check_cuda_error(cudaLaunchCooperativeKernel(
(void *)device_bootstrap_fast_low_latency<Torus, params, FULLSM>, grid,
(void *)device_programmable_bootstrap_cg<Torus, params, FULLSM>, grid,
thds, (void **)kernel_args, full_sm, stream->stream));
}
check_cuda_error(cudaGetLastError());
}
// Verify if the grid size for the low latency kernel satisfies the cooperative
// group constraints
// Verify if the grid size satisfies the cooperative group constraints
template <typename Torus, class params>
__host__ bool verify_cuda_bootstrap_fast_low_latency_grid_size(
__host__ bool verify_cuda_programmable_bootstrap_cg_grid_size(
int glwe_dimension, int level_count, int num_samples,
uint32_t max_shared_memory) {
@@ -364,10 +361,10 @@ __host__ bool verify_cuda_bootstrap_fast_low_latency_grid_size(
// Calculate the dimension of the kernel
uint64_t full_sm =
get_buffer_size_full_sm_bootstrap_fast_low_latency<Torus>(params::degree);
get_buffer_size_full_sm_programmable_bootstrap_cg<Torus>(params::degree);
uint64_t partial_sm =
get_buffer_size_partial_sm_bootstrap_fast_low_latency<Torus>(
get_buffer_size_partial_sm_programmable_bootstrap_cg<Torus>(
params::degree);
int thds = params::degree / params::opt;
@@ -379,17 +376,16 @@ __host__ bool verify_cuda_bootstrap_fast_low_latency_grid_size(
if (max_shared_memory < partial_sm) {
cudaOccupancyMaxActiveBlocksPerMultiprocessor(
&max_active_blocks_per_sm,
(void *)device_bootstrap_fast_low_latency<Torus, params, NOSM>, thds,
0);
(void *)device_programmable_bootstrap_cg<Torus, params, NOSM>, thds, 0);
} else if (max_shared_memory < full_sm) {
cudaOccupancyMaxActiveBlocksPerMultiprocessor(
&max_active_blocks_per_sm,
(void *)device_bootstrap_fast_low_latency<Torus, params, PARTIALSM>,
(void *)device_programmable_bootstrap_cg<Torus, params, PARTIALSM>,
thds, partial_sm);
} else {
cudaOccupancyMaxActiveBlocksPerMultiprocessor(
&max_active_blocks_per_sm,
(void *)device_bootstrap_fast_low_latency<Torus, params, FULLSM>, thds,
(void *)device_programmable_bootstrap_cg<Torus, params, FULLSM>, thds,
full_sm);
}
@@ -399,46 +395,45 @@ __host__ bool verify_cuda_bootstrap_fast_low_latency_grid_size(
return number_of_blocks <= max_active_blocks_per_sm * number_of_sm;
}
// Verify if the grid size for the low latency kernel satisfies the cooperative
// group constraints
// Verify if the grid size satisfies the cooperative group constraints
template <typename Torus>
__host__ bool supports_cooperative_groups_on_lowlat_pbs(
__host__ bool supports_cooperative_groups_on_programmable_bootstrap(
int glwe_dimension, int polynomial_size, int level_count, int num_samples,
uint32_t max_shared_memory) {
switch (polynomial_size) {
case 256:
return verify_cuda_bootstrap_fast_low_latency_grid_size<
return verify_cuda_programmable_bootstrap_cg_grid_size<
Torus, AmortizedDegree<256>>(glwe_dimension, level_count, num_samples,
max_shared_memory);
case 512:
return verify_cuda_bootstrap_fast_low_latency_grid_size<
return verify_cuda_programmable_bootstrap_cg_grid_size<
Torus, AmortizedDegree<512>>(glwe_dimension, level_count, num_samples,
max_shared_memory);
case 1024:
return verify_cuda_bootstrap_fast_low_latency_grid_size<
return verify_cuda_programmable_bootstrap_cg_grid_size<
Torus, AmortizedDegree<1024>>(glwe_dimension, level_count, num_samples,
max_shared_memory);
case 2048:
return verify_cuda_bootstrap_fast_low_latency_grid_size<
return verify_cuda_programmable_bootstrap_cg_grid_size<
Torus, AmortizedDegree<2048>>(glwe_dimension, level_count, num_samples,
max_shared_memory);
case 4096:
return verify_cuda_bootstrap_fast_low_latency_grid_size<
return verify_cuda_programmable_bootstrap_cg_grid_size<
Torus, AmortizedDegree<4096>>(glwe_dimension, level_count, num_samples,
max_shared_memory);
case 8192:
return verify_cuda_bootstrap_fast_low_latency_grid_size<
return verify_cuda_programmable_bootstrap_cg_grid_size<
Torus, AmortizedDegree<8192>>(glwe_dimension, level_count, num_samples,
max_shared_memory);
case 16384:
return verify_cuda_bootstrap_fast_low_latency_grid_size<
return verify_cuda_programmable_bootstrap_cg_grid_size<
Torus, AmortizedDegree<16384>>(glwe_dimension, level_count, num_samples,
max_shared_memory);
default:
PANIC("Cuda error (low latency PBS): unsupported polynomial size. "
PANIC("Cuda error (classical PBS): unsupported polynomial size. "
"Supported N's are powers of two"
" in the interval [256..16384].")
}
}
#endif // LOWLAT_FAST_PBS_H
#endif // CG_PBS_H

View File

@@ -1,8 +1,6 @@
#ifndef CUDA_FAST_MULTIBIT_PBS_CUH
#define CUDA_FAST_MULTIBIT_PBS_CUH
#include "bootstrap.h"
#include "bootstrap_multibit.cuh"
#include "cooperative_groups.h"
#include "crypto/gadget.cuh"
#include "crypto/ggsw.cuh"
@@ -13,18 +11,21 @@
#include "polynomial/functions.cuh"
#include "polynomial/parameters.cuh"
#include "polynomial/polynomial_math.cuh"
#include "programmable_bootstrap.h"
#include "programmable_bootstrap_multibit.cuh"
#include "types/complex/operations.cuh"
#include <vector>
template <typename Torus, class params>
__global__ void device_multi_bit_bootstrap_fast_accumulate(
template <typename Torus, class params, sharedMemDegree SMD>
__global__ void device_multi_bit_programmable_bootstrap_cg_accumulate(
Torus *lwe_array_out, Torus *lwe_output_indexes, Torus *lut_vector,
Torus *lut_vector_indexes, Torus *lwe_array_in, Torus *lwe_input_indexes,
double2 *keybundle_array, double2 *join_buffer, Torus *global_accumulator,
uint32_t lwe_dimension, uint32_t glwe_dimension, uint32_t polynomial_size,
uint32_t base_log, uint32_t level_count, uint32_t grouping_factor,
uint32_t lwe_offset, uint32_t lwe_chunk_size,
uint32_t keybundle_size_per_input) {
uint32_t keybundle_size_per_input, int8_t *device_mem,
uint64_t device_memory_size_per_block) {
grid_group grid = this_grid();
@@ -34,14 +35,21 @@ __global__ void device_multi_bit_bootstrap_fast_accumulate(
extern __shared__ int8_t sharedmem[];
int8_t *selected_memory;
selected_memory = sharedmem;
if constexpr (SMD == FULLSM) {
selected_memory = sharedmem;
} else {
int block_index = blockIdx.x + blockIdx.y * gridDim.x +
blockIdx.z * gridDim.x * gridDim.y;
selected_memory = &device_mem[block_index * device_memory_size_per_block];
}
// We always compute the pointer with most restrictive alignment to avoid
// alignment issues
double2 *accumulator_fft = (double2 *)selected_memory;
Torus *accumulator =
(Torus *)accumulator_fft +
(ptrdiff_t)(sizeof(double2) * polynomial_size / 2 / sizeof(Torus));
Torus *accumulator = (Torus *)selected_memory;
double2 *accumulator_fft =
(double2 *)accumulator +
(ptrdiff_t)(sizeof(Torus) * polynomial_size / sizeof(double2));
if constexpr (SMD == PARTIALSM)
accumulator_fft = (double2 *)sharedmem;
// The third dimension of the block is used to determine on which ciphertext
// this block is operating, in the case of batch bootstraps
@@ -128,12 +136,19 @@ __global__ void device_multi_bit_bootstrap_fast_accumulate(
template <typename Torus>
__host__ __device__ uint64_t
get_buffer_size_full_sm_fast_multibit_bootstrap(uint32_t polynomial_size) {
get_buffer_size_partial_sm_cg_multibit_programmable_bootstrap(
uint32_t polynomial_size) {
return sizeof(Torus) * polynomial_size; // accumulator
}
template <typename Torus>
__host__ __device__ uint64_t
get_buffer_size_full_sm_cg_multibit_programmable_bootstrap(
uint32_t polynomial_size) {
return sizeof(Torus) * polynomial_size * 2; // accumulator
}
template <typename Torus>
__host__ __device__ uint64_t get_buffer_size_fast_multibit_bootstrap(
__host__ __device__ uint64_t get_buffer_size_cg_multibit_programmable_bootstrap(
uint32_t lwe_dimension, uint32_t glwe_dimension, uint32_t polynomial_size,
uint32_t level_count, uint32_t input_lwe_ciphertext_count,
uint32_t grouping_factor, uint32_t lwe_chunk_size,
@@ -153,7 +168,7 @@ __host__ __device__ uint64_t get_buffer_size_fast_multibit_bootstrap(
}
template <typename Torus, typename STorus, typename params>
__host__ void scratch_fast_multi_bit_pbs(
__host__ void scratch_cg_multi_bit_programmable_bootstrap(
cuda_stream_t *stream, pbs_buffer<uint64_t, MULTI_BIT> **buffer,
uint32_t lwe_dimension, uint32_t glwe_dimension, uint32_t polynomial_size,
uint32_t level_count, uint32_t input_lwe_ciphertext_count,
@@ -163,69 +178,106 @@ __host__ void scratch_fast_multi_bit_pbs(
cudaSetDevice(stream->gpu_index);
uint64_t full_sm_keybundle =
get_buffer_size_full_sm_multibit_bootstrap_keybundle<Torus>(
get_buffer_size_full_sm_multibit_programmable_bootstrap_keybundle<Torus>(
polynomial_size);
uint64_t full_sm_cg_accumulate =
get_buffer_size_full_sm_cg_multibit_programmable_bootstrap<Torus>(
polynomial_size);
uint64_t partial_sm_cg_accumulate =
get_buffer_size_partial_sm_cg_multibit_programmable_bootstrap<Torus>(
polynomial_size);
uint64_t full_sm_accumulate =
get_buffer_size_full_sm_fast_multibit_bootstrap<Torus>(polynomial_size);
check_cuda_error(cudaFuncSetAttribute(
device_multi_bit_bootstrap_keybundle<Torus, params>,
cudaFuncAttributeMaxDynamicSharedMemorySize, full_sm_keybundle));
cudaFuncSetCacheConfig(device_multi_bit_bootstrap_keybundle<Torus, params>,
cudaFuncCachePreferShared);
check_cuda_error(cudaGetLastError());
if (max_shared_memory < full_sm_keybundle) {
check_cuda_error(cudaFuncSetAttribute(
device_multi_bit_programmable_bootstrap_keybundle<Torus, params, NOSM>,
cudaFuncAttributeMaxDynamicSharedMemorySize, 0));
cudaFuncSetCacheConfig(
device_multi_bit_programmable_bootstrap_keybundle<Torus, params, NOSM>,
cudaFuncCachePreferShared);
check_cuda_error(cudaGetLastError());
} else {
check_cuda_error(cudaFuncSetAttribute(
device_multi_bit_programmable_bootstrap_keybundle<Torus, params,
FULLSM>,
cudaFuncAttributeMaxDynamicSharedMemorySize, full_sm_keybundle));
cudaFuncSetCacheConfig(
device_multi_bit_programmable_bootstrap_keybundle<Torus, params,
FULLSM>,
cudaFuncCachePreferShared);
check_cuda_error(cudaGetLastError());
}
check_cuda_error(cudaFuncSetAttribute(
device_multi_bit_bootstrap_fast_accumulate<Torus, params>,
cudaFuncAttributeMaxDynamicSharedMemorySize, full_sm_accumulate));
cudaFuncSetCacheConfig(
device_multi_bit_bootstrap_fast_accumulate<Torus, params>,
cudaFuncCachePreferShared);
check_cuda_error(cudaGetLastError());
if (max_shared_memory < partial_sm_cg_accumulate) {
check_cuda_error(cudaFuncSetAttribute(
device_multi_bit_programmable_bootstrap_cg_accumulate<Torus, params,
NOSM>,
cudaFuncAttributeMaxDynamicSharedMemorySize, 0));
cudaFuncSetCacheConfig(
device_multi_bit_programmable_bootstrap_cg_accumulate<Torus, params,
NOSM>,
cudaFuncCachePreferShared);
check_cuda_error(cudaGetLastError());
} else if (max_shared_memory < full_sm_cg_accumulate) {
check_cuda_error(cudaFuncSetAttribute(
device_multi_bit_programmable_bootstrap_cg_accumulate<Torus, params,
PARTIALSM>,
cudaFuncAttributeMaxDynamicSharedMemorySize, partial_sm_cg_accumulate));
cudaFuncSetCacheConfig(
device_multi_bit_programmable_bootstrap_cg_accumulate<Torus, params,
PARTIALSM>,
cudaFuncCachePreferShared);
check_cuda_error(cudaGetLastError());
} else {
check_cuda_error(cudaFuncSetAttribute(
device_multi_bit_programmable_bootstrap_cg_accumulate<Torus, params,
FULLSM>,
cudaFuncAttributeMaxDynamicSharedMemorySize, full_sm_cg_accumulate));
cudaFuncSetCacheConfig(
device_multi_bit_programmable_bootstrap_cg_accumulate<Torus, params,
FULLSM>,
cudaFuncCachePreferShared);
check_cuda_error(cudaGetLastError());
}
if (!lwe_chunk_size)
lwe_chunk_size = get_average_lwe_chunk_size(
lwe_dimension, level_count, glwe_dimension, input_lwe_ciphertext_count);
lwe_chunk_size = get_lwe_chunk_size(input_lwe_ciphertext_count);
*buffer = new pbs_buffer<uint64_t, MULTI_BIT>(
stream, glwe_dimension, polynomial_size, level_count,
input_lwe_ciphertext_count, lwe_chunk_size, PBS_VARIANT::FAST,
input_lwe_ciphertext_count, lwe_chunk_size, PBS_VARIANT::CG,
allocate_gpu_memory);
}
template <typename Torus, typename STorus, class params>
__host__ void host_fast_multi_bit_pbs(
cuda_stream_t *stream, Torus *lwe_array_out, Torus *lwe_output_indexes,
Torus *lut_vector, Torus *lut_vector_indexes, Torus *lwe_array_in,
Torus *lwe_input_indexes, uint64_t *bootstrapping_key,
pbs_buffer<Torus, MULTI_BIT> *pbs_buffer, uint32_t glwe_dimension,
uint32_t lwe_dimension, uint32_t polynomial_size, uint32_t grouping_factor,
uint32_t base_log, uint32_t level_count, uint32_t num_samples,
uint32_t num_luts, uint32_t lwe_idx, uint32_t max_shared_memory,
uint32_t lwe_chunk_size = 0) {
cudaSetDevice(stream->gpu_index);
template <typename Torus, class params>
__host__ void execute_external_product_loop(
cuda_stream_t *stream, Torus *lut_vector, Torus *lut_vector_indexes,
Torus *lwe_array_in, Torus *lwe_input_indexes, Torus *lwe_array_out,
Torus *lwe_output_indexes, pbs_buffer<Torus, MULTI_BIT> *buffer,
uint32_t num_samples, uint32_t lwe_dimension, uint32_t glwe_dimension,
uint32_t polynomial_size, uint32_t grouping_factor, uint32_t base_log,
uint32_t level_count, uint32_t lwe_chunk_size, uint32_t max_shared_memory,
int lwe_offset) {
if (!lwe_chunk_size)
lwe_chunk_size = get_average_lwe_chunk_size(lwe_dimension, level_count,
glwe_dimension, num_samples);
//
double2 *keybundle_fft = pbs_buffer->keybundle_fft;
Torus *global_accumulator = pbs_buffer->global_accumulator;
double2 *buffer_fft = pbs_buffer->global_accumulator_fft;
//
uint64_t full_sm_keybundle =
get_buffer_size_full_sm_multibit_bootstrap_keybundle<Torus>(
uint64_t full_dm =
get_buffer_size_full_sm_cg_multibit_programmable_bootstrap<Torus>(
polynomial_size);
uint64_t full_sm_accumulate =
get_buffer_size_full_sm_fast_multibit_bootstrap<Torus>(polynomial_size);
uint64_t partial_dm =
get_buffer_size_partial_sm_cg_multibit_programmable_bootstrap<Torus>(
polynomial_size);
uint64_t no_dm = 0;
uint32_t keybundle_size_per_input =
lwe_chunk_size * level_count * (glwe_dimension + 1) *
(glwe_dimension + 1) * (polynomial_size / 2);
//
void *kernel_args[18];
uint32_t chunk_size =
std::min(lwe_chunk_size, (lwe_dimension / grouping_factor) - lwe_offset);
auto d_mem = buffer->d_mem_acc_cg;
auto keybundle_fft = buffer->keybundle_fft;
auto global_accumulator = buffer->global_accumulator;
auto buffer_fft = buffer->global_accumulator_fft;
void *kernel_args[20];
kernel_args[0] = &lwe_array_out;
kernel_args[1] = &lwe_output_indexes;
kernel_args[2] = &lut_vector;
@@ -241,55 +293,87 @@ __host__ void host_fast_multi_bit_pbs(
kernel_args[12] = &base_log;
kernel_args[13] = &level_count;
kernel_args[14] = &grouping_factor;
kernel_args[15] = &lwe_offset;
kernel_args[16] = &chunk_size;
kernel_args[17] = &keybundle_size_per_input;
kernel_args[18] = &d_mem;
//
dim3 grid_accumulate(level_count, glwe_dimension + 1, num_samples);
dim3 thds(polynomial_size / params::opt, 1, 1);
if (max_shared_memory < partial_dm) {
kernel_args[19] = &full_dm;
check_cuda_error(cudaLaunchCooperativeKernel(
(void *)device_multi_bit_programmable_bootstrap_cg_accumulate<
Torus, params, NOSM>,
grid_accumulate, thds, (void **)kernel_args, 0, stream->stream));
} else if (max_shared_memory < full_dm) {
kernel_args[19] = &partial_dm;
check_cuda_error(cudaLaunchCooperativeKernel(
(void *)device_multi_bit_programmable_bootstrap_cg_accumulate<
Torus, params, PARTIALSM>,
grid_accumulate, thds, (void **)kernel_args, partial_dm,
stream->stream));
} else {
kernel_args[19] = &no_dm;
check_cuda_error(cudaLaunchCooperativeKernel(
(void *)device_multi_bit_programmable_bootstrap_cg_accumulate<
Torus, params, FULLSM>,
grid_accumulate, thds, (void **)kernel_args, full_dm, stream->stream));
}
}
template <typename Torus, typename STorus, class params>
__host__ void host_cg_multi_bit_programmable_bootstrap(
cuda_stream_t *stream, Torus *lwe_array_out, Torus *lwe_output_indexes,
Torus *lut_vector, Torus *lut_vector_indexes, Torus *lwe_array_in,
Torus *lwe_input_indexes, uint64_t *bootstrapping_key,
pbs_buffer<Torus, MULTI_BIT> *buffer, uint32_t glwe_dimension,
uint32_t lwe_dimension, uint32_t polynomial_size, uint32_t grouping_factor,
uint32_t base_log, uint32_t level_count, uint32_t num_samples,
uint32_t num_luts, uint32_t lwe_idx, uint32_t max_shared_memory,
uint32_t lwe_chunk_size = 0) {
cudaSetDevice(stream->gpu_index);
if (!lwe_chunk_size)
lwe_chunk_size = get_lwe_chunk_size(num_samples);
for (uint32_t lwe_offset = 0; lwe_offset < (lwe_dimension / grouping_factor);
lwe_offset += lwe_chunk_size) {
uint32_t chunk_size = std::min(
lwe_chunk_size, (lwe_dimension / grouping_factor) - lwe_offset);
// Compute a keybundle
dim3 grid_keybundle(num_samples * chunk_size,
(glwe_dimension + 1) * (glwe_dimension + 1),
level_count);
device_multi_bit_bootstrap_keybundle<Torus, params>
<<<grid_keybundle, thds, full_sm_keybundle, stream->stream>>>(
lwe_array_in, lwe_input_indexes, keybundle_fft, bootstrapping_key,
lwe_dimension, glwe_dimension, polynomial_size, grouping_factor,
base_log, level_count, lwe_offset, chunk_size,
keybundle_size_per_input);
check_cuda_error(cudaGetLastError());
execute_compute_keybundle<Torus, params>(
stream, lwe_array_in, lwe_input_indexes, bootstrapping_key, buffer,
num_samples, lwe_dimension, glwe_dimension, polynomial_size,
grouping_factor, base_log, level_count, max_shared_memory,
lwe_chunk_size, lwe_offset);
kernel_args[15] = &lwe_offset;
kernel_args[16] = &chunk_size;
check_cuda_error(cudaLaunchCooperativeKernel(
(void *)device_multi_bit_bootstrap_fast_accumulate<Torus, params>,
grid_accumulate, thds, (void **)kernel_args, full_sm_accumulate,
stream->stream));
// Accumulate
execute_external_product_loop<Torus, params>(
stream, lut_vector, lut_vector_indexes, lwe_array_in, lwe_input_indexes,
lwe_array_out, lwe_output_indexes, buffer, num_samples, lwe_dimension,
glwe_dimension, polynomial_size, grouping_factor, base_log, level_count,
lwe_chunk_size, max_shared_memory, lwe_offset);
}
}
// Verify if the grid size for the low latency kernel satisfies the cooperative
// group constraints
// Verify if the grid size satisfies the cooperative group constraints
template <typename Torus, class params>
__host__ bool
verify_cuda_bootstrap_fast_multi_bit_grid_size(int glwe_dimension,
int level_count, int num_samples,
uint32_t max_shared_memory) {
__host__ bool verify_cuda_programmable_bootstrap_cg_multi_bit_grid_size(
int glwe_dimension, int level_count, int num_samples,
uint32_t max_shared_memory) {
// If Cooperative Groups is not supported, no need to check anything else
if (!cuda_check_support_cooperative_groups())
return false;
// Calculate the dimension of the kernel
uint64_t full_sm =
get_buffer_size_full_sm_fast_multibit_bootstrap<Torus>(params::degree);
uint64_t full_sm_cg_accumulate =
get_buffer_size_full_sm_cg_multibit_programmable_bootstrap<Torus>(
params::degree);
uint64_t partial_sm_cg_accumulate =
get_buffer_size_partial_sm_cg_multibit_programmable_bootstrap<Torus>(
params::degree);
int thds = params::degree / params::opt;
@@ -297,10 +381,25 @@ verify_cuda_bootstrap_fast_multi_bit_grid_size(int glwe_dimension,
int number_of_blocks = level_count * (glwe_dimension + 1) * num_samples;
int max_active_blocks_per_sm;
cudaOccupancyMaxActiveBlocksPerMultiprocessor(
&max_active_blocks_per_sm,
(void *)device_multi_bit_bootstrap_fast_accumulate<Torus, params>, thds,
full_sm);
if (max_shared_memory < partial_sm_cg_accumulate) {
cudaOccupancyMaxActiveBlocksPerMultiprocessor(
&max_active_blocks_per_sm,
(void *)device_multi_bit_programmable_bootstrap_cg_accumulate<
Torus, params, NOSM>,
thds, 0);
} else if (max_shared_memory < full_sm_cg_accumulate) {
cudaOccupancyMaxActiveBlocksPerMultiprocessor(
&max_active_blocks_per_sm,
(void *)device_multi_bit_programmable_bootstrap_cg_accumulate<
Torus, params, PARTIALSM>,
thds, partial_sm_cg_accumulate);
} else {
cudaOccupancyMaxActiveBlocksPerMultiprocessor(
&max_active_blocks_per_sm,
(void *)device_multi_bit_programmable_bootstrap_cg_accumulate<
Torus, params, FULLSM>,
thds, full_sm_cg_accumulate);
}
// Get the number of streaming multiprocessors
int number_of_sm = 0;
@@ -311,36 +410,36 @@ verify_cuda_bootstrap_fast_multi_bit_grid_size(int glwe_dimension,
// Verify if the grid size for the multi-bit kernel satisfies the cooperative
// group constraints
template <typename Torus>
__host__ bool supports_cooperative_groups_on_multibit_pbs(
__host__ bool supports_cooperative_groups_on_multibit_programmable_bootstrap(
int glwe_dimension, int polynomial_size, int level_count, int num_samples,
uint32_t max_shared_memory) {
switch (polynomial_size) {
case 256:
return verify_cuda_bootstrap_fast_multi_bit_grid_size<Torus,
AmortizedDegree<256>>(
glwe_dimension, level_count, num_samples, max_shared_memory);
return verify_cuda_programmable_bootstrap_cg_multi_bit_grid_size<
Torus, AmortizedDegree<256>>(glwe_dimension, level_count, num_samples,
max_shared_memory);
case 512:
return verify_cuda_bootstrap_fast_multi_bit_grid_size<Torus,
AmortizedDegree<512>>(
glwe_dimension, level_count, num_samples, max_shared_memory);
return verify_cuda_programmable_bootstrap_cg_multi_bit_grid_size<
Torus, AmortizedDegree<512>>(glwe_dimension, level_count, num_samples,
max_shared_memory);
case 1024:
return verify_cuda_bootstrap_fast_multi_bit_grid_size<
return verify_cuda_programmable_bootstrap_cg_multi_bit_grid_size<
Torus, AmortizedDegree<1024>>(glwe_dimension, level_count, num_samples,
max_shared_memory);
case 2048:
return verify_cuda_bootstrap_fast_multi_bit_grid_size<
return verify_cuda_programmable_bootstrap_cg_multi_bit_grid_size<
Torus, AmortizedDegree<2048>>(glwe_dimension, level_count, num_samples,
max_shared_memory);
case 4096:
return verify_cuda_bootstrap_fast_multi_bit_grid_size<
return verify_cuda_programmable_bootstrap_cg_multi_bit_grid_size<
Torus, AmortizedDegree<4096>>(glwe_dimension, level_count, num_samples,
max_shared_memory);
case 8192:
return verify_cuda_bootstrap_fast_multi_bit_grid_size<
return verify_cuda_programmable_bootstrap_cg_multi_bit_grid_size<
Torus, AmortizedDegree<8192>>(glwe_dimension, level_count, num_samples,
max_shared_memory);
case 16384:
return verify_cuda_bootstrap_fast_multi_bit_grid_size<
return verify_cuda_programmable_bootstrap_cg_multi_bit_grid_size<
Torus, AmortizedDegree<16384>>(glwe_dimension, level_count, num_samples,
max_shared_memory);
default:

View File

@@ -1,11 +1,13 @@
#include "bootstrap_fast_low_latency.cuh"
#include "bootstrap_low_latency.cuh"
#include "programmable_bootstrap_cg_classic.cuh"
#include "programmable_bootstrap_classic.cuh"
template <typename Torus>
bool has_support_to_cuda_bootstrap_fast_low_latency(
uint32_t glwe_dimension, uint32_t polynomial_size, uint32_t level_count,
uint32_t num_samples, uint32_t max_shared_memory) {
return supports_cooperative_groups_on_lowlat_pbs<Torus>(
bool has_support_to_cuda_programmable_bootstrap_cg(uint32_t glwe_dimension,
uint32_t polynomial_size,
uint32_t level_count,
uint32_t num_samples,
uint32_t max_shared_memory) {
return supports_cooperative_groups_on_programmable_bootstrap<Torus>(
glwe_dimension, polynomial_size, level_count, num_samples,
max_shared_memory);
}
@@ -13,117 +15,117 @@ bool has_support_to_cuda_bootstrap_fast_low_latency(
/*
* Returns the buffer size for 64 bits executions
*/
uint64_t get_buffer_size_bootstrap_low_latency_64(
uint64_t get_buffer_size_programmable_bootstrap_64(
uint32_t glwe_dimension, uint32_t polynomial_size, uint32_t level_count,
uint32_t input_lwe_ciphertext_count, uint32_t max_shared_memory) {
if (has_support_to_cuda_bootstrap_fast_low_latency<uint64_t>(
if (has_support_to_cuda_programmable_bootstrap_cg<uint64_t>(
glwe_dimension, polynomial_size, level_count,
input_lwe_ciphertext_count, max_shared_memory))
return get_buffer_size_bootstrap_fast_low_latency<uint64_t>(
return get_buffer_size_programmable_bootstrap_cg<uint64_t>(
glwe_dimension, polynomial_size, level_count,
input_lwe_ciphertext_count, max_shared_memory);
else
return get_buffer_size_bootstrap_fast_low_latency<uint64_t>(
return get_buffer_size_programmable_bootstrap_cg<uint64_t>(
glwe_dimension, polynomial_size, level_count,
input_lwe_ciphertext_count, max_shared_memory);
}
template <typename Torus, typename STorus>
void scratch_cuda_fast_bootstrap_low_latency(
cuda_stream_t *stream, pbs_buffer<Torus, LOW_LAT> **pbs_buffer,
void scratch_cuda_programmable_bootstrap_cg(
cuda_stream_t *stream, pbs_buffer<Torus, CLASSICAL> **pbs_buffer,
uint32_t glwe_dimension, uint32_t polynomial_size, uint32_t level_count,
uint32_t input_lwe_ciphertext_count, uint32_t max_shared_memory,
bool allocate_gpu_memory) {
switch (polynomial_size) {
case 256:
scratch_bootstrap_fast_low_latency<Torus, STorus, AmortizedDegree<256>>(
scratch_programmable_bootstrap_cg<Torus, STorus, AmortizedDegree<256>>(
stream, pbs_buffer, glwe_dimension, polynomial_size, level_count,
input_lwe_ciphertext_count, max_shared_memory, allocate_gpu_memory);
break;
case 512:
scratch_bootstrap_fast_low_latency<Torus, STorus, AmortizedDegree<512>>(
scratch_programmable_bootstrap_cg<Torus, STorus, AmortizedDegree<512>>(
stream, pbs_buffer, glwe_dimension, polynomial_size, level_count,
input_lwe_ciphertext_count, max_shared_memory, allocate_gpu_memory);
break;
case 1024:
scratch_bootstrap_fast_low_latency<Torus, STorus, AmortizedDegree<1024>>(
scratch_programmable_bootstrap_cg<Torus, STorus, AmortizedDegree<1024>>(
stream, pbs_buffer, glwe_dimension, polynomial_size, level_count,
input_lwe_ciphertext_count, max_shared_memory, allocate_gpu_memory);
break;
case 2048:
scratch_bootstrap_fast_low_latency<Torus, STorus, AmortizedDegree<2048>>(
scratch_programmable_bootstrap_cg<Torus, STorus, AmortizedDegree<2048>>(
stream, pbs_buffer, glwe_dimension, polynomial_size, level_count,
input_lwe_ciphertext_count, max_shared_memory, allocate_gpu_memory);
break;
case 4096:
scratch_bootstrap_fast_low_latency<Torus, STorus, AmortizedDegree<4096>>(
scratch_programmable_bootstrap_cg<Torus, STorus, AmortizedDegree<4096>>(
stream, pbs_buffer, glwe_dimension, polynomial_size, level_count,
input_lwe_ciphertext_count, max_shared_memory, allocate_gpu_memory);
break;
case 8192:
scratch_bootstrap_fast_low_latency<Torus, STorus, AmortizedDegree<8192>>(
scratch_programmable_bootstrap_cg<Torus, STorus, AmortizedDegree<8192>>(
stream, pbs_buffer, glwe_dimension, polynomial_size, level_count,
input_lwe_ciphertext_count, max_shared_memory, allocate_gpu_memory);
break;
case 16384:
scratch_bootstrap_fast_low_latency<Torus, STorus, AmortizedDegree<16384>>(
scratch_programmable_bootstrap_cg<Torus, STorus, AmortizedDegree<16384>>(
stream, pbs_buffer, glwe_dimension, polynomial_size, level_count,
input_lwe_ciphertext_count, max_shared_memory, allocate_gpu_memory);
break;
default:
PANIC("Cuda error (low latency PBS): unsupported polynomial size. "
PANIC("Cuda error (classical PBS): unsupported polynomial size. "
"Supported N's are powers of two"
" in the interval [256..16384].")
}
}
template <typename Torus, typename STorus>
void scratch_cuda_bootstrap_low_latency(
cuda_stream_t *stream, pbs_buffer<Torus, LOW_LAT> **buffer,
void scratch_cuda_programmable_bootstrap(
cuda_stream_t *stream, pbs_buffer<Torus, CLASSICAL> **buffer,
uint32_t glwe_dimension, uint32_t polynomial_size, uint32_t level_count,
uint32_t input_lwe_ciphertext_count, uint32_t max_shared_memory,
bool allocate_gpu_memory) {
switch (polynomial_size) {
case 256:
scratch_bootstrap_low_latency<Torus, STorus, AmortizedDegree<256>>(
scratch_programmable_bootstrap<Torus, STorus, AmortizedDegree<256>>(
stream, buffer, glwe_dimension, polynomial_size, level_count,
input_lwe_ciphertext_count, max_shared_memory, allocate_gpu_memory);
break;
case 512:
scratch_bootstrap_low_latency<Torus, STorus, AmortizedDegree<512>>(
scratch_programmable_bootstrap<Torus, STorus, AmortizedDegree<512>>(
stream, buffer, glwe_dimension, polynomial_size, level_count,
input_lwe_ciphertext_count, max_shared_memory, allocate_gpu_memory);
break;
case 1024:
scratch_bootstrap_low_latency<Torus, STorus, AmortizedDegree<1024>>(
scratch_programmable_bootstrap<Torus, STorus, AmortizedDegree<1024>>(
stream, buffer, glwe_dimension, polynomial_size, level_count,
input_lwe_ciphertext_count, max_shared_memory, allocate_gpu_memory);
break;
case 2048:
scratch_bootstrap_low_latency<Torus, STorus, AmortizedDegree<2048>>(
scratch_programmable_bootstrap<Torus, STorus, AmortizedDegree<2048>>(
stream, buffer, glwe_dimension, polynomial_size, level_count,
input_lwe_ciphertext_count, max_shared_memory, allocate_gpu_memory);
break;
case 4096:
scratch_bootstrap_low_latency<Torus, STorus, AmortizedDegree<4096>>(
scratch_programmable_bootstrap<Torus, STorus, AmortizedDegree<4096>>(
stream, buffer, glwe_dimension, polynomial_size, level_count,
input_lwe_ciphertext_count, max_shared_memory, allocate_gpu_memory);
break;
case 8192:
scratch_bootstrap_low_latency<Torus, STorus, AmortizedDegree<8192>>(
scratch_programmable_bootstrap<Torus, STorus, AmortizedDegree<8192>>(
stream, buffer, glwe_dimension, polynomial_size, level_count,
input_lwe_ciphertext_count, max_shared_memory, allocate_gpu_memory);
break;
case 16384:
scratch_bootstrap_low_latency<Torus, STorus, AmortizedDegree<16384>>(
scratch_programmable_bootstrap<Torus, STorus, AmortizedDegree<16384>>(
stream, buffer, glwe_dimension, polynomial_size, level_count,
input_lwe_ciphertext_count, max_shared_memory, allocate_gpu_memory);
break;
default:
PANIC("Cuda error (low latency PBS): unsupported polynomial size. "
PANIC("Cuda error (classical PBS): unsupported polynomial size. "
"Supported N's are powers of two"
" in the interval [256..16384].")
}
@@ -131,198 +133,192 @@ void scratch_cuda_bootstrap_low_latency(
/*
* This scratch function allocates the necessary amount of data on the GPU for
* the low latency PBS on 32 bits inputs, into `buffer`. It also
* the classical PBS on 32 bits inputs, into `buffer`. It also
* configures SM options on the GPU in case FULLSM or PARTIALSM mode is going to
* be used.
*/
void scratch_cuda_bootstrap_low_latency_32(
void scratch_cuda_programmable_bootstrap_32(
cuda_stream_t *stream, int8_t **buffer, uint32_t glwe_dimension,
uint32_t polynomial_size, uint32_t level_count,
uint32_t input_lwe_ciphertext_count, uint32_t max_shared_memory,
bool allocate_gpu_memory) {
if (has_support_to_cuda_bootstrap_fast_low_latency<uint32_t>(
if (has_support_to_cuda_programmable_bootstrap_cg<uint32_t>(
glwe_dimension, polynomial_size, level_count,
input_lwe_ciphertext_count, max_shared_memory))
scratch_cuda_fast_bootstrap_low_latency<uint32_t, int32_t>(
stream, (pbs_buffer<uint32_t, LOW_LAT> **)buffer, glwe_dimension,
scratch_cuda_programmable_bootstrap_cg<uint32_t, int32_t>(
stream, (pbs_buffer<uint32_t, CLASSICAL> **)buffer, glwe_dimension,
polynomial_size, level_count, input_lwe_ciphertext_count,
max_shared_memory, allocate_gpu_memory);
else
scratch_cuda_bootstrap_low_latency<uint32_t, int32_t>(
stream, (pbs_buffer<uint32_t, LOW_LAT> **)buffer, glwe_dimension,
scratch_cuda_programmable_bootstrap<uint32_t, int32_t>(
stream, (pbs_buffer<uint32_t, CLASSICAL> **)buffer, glwe_dimension,
polynomial_size, level_count, input_lwe_ciphertext_count,
max_shared_memory, allocate_gpu_memory);
}
/*
* This scratch function allocates the necessary amount of data on the GPU for
* the low_latency PBS on 64 bits inputs, into `buffer`. It also
* configures SM options on the GPU in case FULLSM or PARTIALSM mode is going to
* be used.
* the PBS on 64 bits inputs, into `buffer`. It also configures SM options on
* the GPU in case FULLSM or PARTIALSM mode is going to be used.
*/
void scratch_cuda_bootstrap_low_latency_64(
void scratch_cuda_programmable_bootstrap_64(
cuda_stream_t *stream, int8_t **buffer, uint32_t glwe_dimension,
uint32_t polynomial_size, uint32_t level_count,
uint32_t input_lwe_ciphertext_count, uint32_t max_shared_memory,
bool allocate_gpu_memory) {
if (has_support_to_cuda_bootstrap_fast_low_latency<uint64_t>(
if (has_support_to_cuda_programmable_bootstrap_cg<uint64_t>(
glwe_dimension, polynomial_size, level_count,
input_lwe_ciphertext_count, max_shared_memory))
scratch_cuda_fast_bootstrap_low_latency<uint64_t, int64_t>(
stream, (pbs_buffer<uint64_t, LOW_LAT> **)buffer, glwe_dimension,
scratch_cuda_programmable_bootstrap_cg<uint64_t, int64_t>(
stream, (pbs_buffer<uint64_t, CLASSICAL> **)buffer, glwe_dimension,
polynomial_size, level_count, input_lwe_ciphertext_count,
max_shared_memory, allocate_gpu_memory);
else
scratch_cuda_bootstrap_low_latency<uint64_t, int64_t>(
stream, (pbs_buffer<uint64_t, LOW_LAT> **)buffer, glwe_dimension,
scratch_cuda_programmable_bootstrap<uint64_t, int64_t>(
stream, (pbs_buffer<uint64_t, CLASSICAL> **)buffer, glwe_dimension,
polynomial_size, level_count, input_lwe_ciphertext_count,
max_shared_memory, allocate_gpu_memory);
}
template <typename Torus>
void cuda_bootstrap_fast_low_latency_lwe_ciphertext_vector(
void cuda_programmable_bootstrap_cg_lwe_ciphertext_vector(
cuda_stream_t *stream, Torus *lwe_array_out, Torus *lwe_output_indexes,
Torus *lut_vector, Torus *lut_vector_indexes, Torus *lwe_array_in,
Torus *lwe_input_indexes, double2 *bootstrapping_key,
pbs_buffer<Torus, LOW_LAT> *buffer, uint32_t lwe_dimension,
pbs_buffer<Torus, CLASSICAL> *buffer, uint32_t lwe_dimension,
uint32_t glwe_dimension, uint32_t polynomial_size, uint32_t base_log,
uint32_t level_count, uint32_t num_samples, uint32_t num_luts,
uint32_t lwe_idx, uint32_t max_shared_memory) {
switch (polynomial_size) {
case 256:
host_bootstrap_fast_low_latency<Torus, AmortizedDegree<256>>(
host_programmable_bootstrap_cg<Torus, AmortizedDegree<256>>(
stream, lwe_array_out, lwe_output_indexes, lut_vector,
lut_vector_indexes, lwe_array_in, lwe_input_indexes, bootstrapping_key,
buffer, glwe_dimension, lwe_dimension, polynomial_size, base_log,
level_count, num_samples, num_luts, max_shared_memory);
break;
case 512:
host_bootstrap_fast_low_latency<Torus, Degree<512>>(
host_programmable_bootstrap_cg<Torus, Degree<512>>(
stream, lwe_array_out, lwe_output_indexes, lut_vector,
lut_vector_indexes, lwe_array_in, lwe_input_indexes, bootstrapping_key,
buffer, glwe_dimension, lwe_dimension, polynomial_size, base_log,
level_count, num_samples, num_luts, max_shared_memory);
break;
case 1024:
host_bootstrap_fast_low_latency<Torus, Degree<1024>>(
host_programmable_bootstrap_cg<Torus, Degree<1024>>(
stream, lwe_array_out, lwe_output_indexes, lut_vector,
lut_vector_indexes, lwe_array_in, lwe_input_indexes, bootstrapping_key,
buffer, glwe_dimension, lwe_dimension, polynomial_size, base_log,
level_count, num_samples, num_luts, max_shared_memory);
break;
case 2048:
host_bootstrap_fast_low_latency<Torus, AmortizedDegree<2048>>(
host_programmable_bootstrap_cg<Torus, AmortizedDegree<2048>>(
stream, lwe_array_out, lwe_output_indexes, lut_vector,
lut_vector_indexes, lwe_array_in, lwe_input_indexes, bootstrapping_key,
buffer, glwe_dimension, lwe_dimension, polynomial_size, base_log,
level_count, num_samples, num_luts, max_shared_memory);
break;
case 4096:
host_bootstrap_fast_low_latency<Torus, AmortizedDegree<4096>>(
host_programmable_bootstrap_cg<Torus, AmortizedDegree<4096>>(
stream, lwe_array_out, lwe_output_indexes, lut_vector,
lut_vector_indexes, lwe_array_in, lwe_input_indexes, bootstrapping_key,
buffer, glwe_dimension, lwe_dimension, polynomial_size, base_log,
level_count, num_samples, num_luts, max_shared_memory);
break;
case 8192:
host_bootstrap_fast_low_latency<Torus, AmortizedDegree<8192>>(
host_programmable_bootstrap_cg<Torus, AmortizedDegree<8192>>(
stream, lwe_array_out, lwe_output_indexes, lut_vector,
lut_vector_indexes, lwe_array_in, lwe_input_indexes, bootstrapping_key,
buffer, glwe_dimension, lwe_dimension, polynomial_size, base_log,
level_count, num_samples, num_luts, max_shared_memory);
break;
case 16384:
host_bootstrap_fast_low_latency<Torus, AmortizedDegree<16384>>(
host_programmable_bootstrap_cg<Torus, AmortizedDegree<16384>>(
stream, lwe_array_out, lwe_output_indexes, lut_vector,
lut_vector_indexes, lwe_array_in, lwe_input_indexes, bootstrapping_key,
buffer, glwe_dimension, lwe_dimension, polynomial_size, base_log,
level_count, num_samples, num_luts, max_shared_memory);
break;
default:
PANIC("Cuda error (low latency PBS): unsupported polynomial size. "
PANIC("Cuda error (classical PBS): unsupported polynomial size. "
"Supported N's are powers of two"
" in the interval [256..16384].")
}
}
template <typename Torus>
void cuda_bootstrap_low_latency_lwe_ciphertext_vector(
void cuda_programmable_bootstrap_lwe_ciphertext_vector(
cuda_stream_t *stream, Torus *lwe_array_out, Torus *lwe_output_indexes,
Torus *lut_vector, Torus *lut_vector_indexes, Torus *lwe_array_in,
Torus *lwe_input_indexes, double2 *bootstrapping_key,
pbs_buffer<Torus, LOW_LAT> *buffer, uint32_t lwe_dimension,
pbs_buffer<Torus, CLASSICAL> *buffer, uint32_t lwe_dimension,
uint32_t glwe_dimension, uint32_t polynomial_size, uint32_t base_log,
uint32_t level_count, uint32_t num_samples, uint32_t num_luts,
uint32_t lwe_idx, uint32_t max_shared_memory) {
switch (polynomial_size) {
case 256:
host_bootstrap_low_latency<Torus, AmortizedDegree<256>>(
host_programmable_bootstrap<Torus, AmortizedDegree<256>>(
stream, lwe_array_out, lwe_output_indexes, lut_vector,
lut_vector_indexes, lwe_array_in, lwe_input_indexes, bootstrapping_key,
buffer, glwe_dimension, lwe_dimension, polynomial_size, base_log,
level_count, num_samples, num_luts, max_shared_memory);
break;
case 512:
host_bootstrap_low_latency<Torus, Degree<512>>(
host_programmable_bootstrap<Torus, Degree<512>>(
stream, lwe_array_out, lwe_output_indexes, lut_vector,
lut_vector_indexes, lwe_array_in, lwe_input_indexes, bootstrapping_key,
buffer, glwe_dimension, lwe_dimension, polynomial_size, base_log,
level_count, num_samples, num_luts, max_shared_memory);
break;
case 1024:
host_bootstrap_low_latency<Torus, Degree<1024>>(
host_programmable_bootstrap<Torus, Degree<1024>>(
stream, lwe_array_out, lwe_output_indexes, lut_vector,
lut_vector_indexes, lwe_array_in, lwe_input_indexes, bootstrapping_key,
buffer, glwe_dimension, lwe_dimension, polynomial_size, base_log,
level_count, num_samples, num_luts, max_shared_memory);
break;
case 2048:
host_bootstrap_low_latency<Torus, AmortizedDegree<2048>>(
host_programmable_bootstrap<Torus, AmortizedDegree<2048>>(
stream, lwe_array_out, lwe_output_indexes, lut_vector,
lut_vector_indexes, lwe_array_in, lwe_input_indexes, bootstrapping_key,
buffer, glwe_dimension, lwe_dimension, polynomial_size, base_log,
level_count, num_samples, num_luts, max_shared_memory);
break;
case 4096:
host_bootstrap_low_latency<Torus, AmortizedDegree<4096>>(
host_programmable_bootstrap<Torus, AmortizedDegree<4096>>(
stream, lwe_array_out, lwe_output_indexes, lut_vector,
lut_vector_indexes, lwe_array_in, lwe_input_indexes, bootstrapping_key,
buffer, glwe_dimension, lwe_dimension, polynomial_size, base_log,
level_count, num_samples, num_luts, max_shared_memory);
break;
case 8192:
host_bootstrap_low_latency<Torus, AmortizedDegree<8192>>(
host_programmable_bootstrap<Torus, AmortizedDegree<8192>>(
stream, lwe_array_out, lwe_output_indexes, lut_vector,
lut_vector_indexes, lwe_array_in, lwe_input_indexes, bootstrapping_key,
buffer, glwe_dimension, lwe_dimension, polynomial_size, base_log,
level_count, num_samples, num_luts, max_shared_memory);
break;
case 16384:
host_bootstrap_low_latency<Torus, AmortizedDegree<16384>>(
host_programmable_bootstrap<Torus, AmortizedDegree<16384>>(
stream, lwe_array_out, lwe_output_indexes, lut_vector,
lut_vector_indexes, lwe_array_in, lwe_input_indexes, bootstrapping_key,
buffer, glwe_dimension, lwe_dimension, polynomial_size, base_log,
level_count, num_samples, num_luts, max_shared_memory);
break;
default:
PANIC("Cuda error (low latency PBS): unsupported polynomial size. "
PANIC("Cuda error (classical PBS): unsupported polynomial size. "
"Supported N's are powers of two"
" in the interval [256..16384].")
}
}
/* Perform bootstrapping on a batch of input u32 LWE ciphertexts.
* This function performs best for small numbers of inputs. Beyond a certain
* number of inputs (the exact number depends on the cryptographic parameters),
* the kernel cannot be launched and it is necessary to split the kernel call
* into several calls on smaller batches of inputs. For more details on this
* operation, head on to the equivalent u64 operation.
*/
void cuda_bootstrap_low_latency_lwe_ciphertext_vector_32(
void cuda_programmable_bootstrap_lwe_ciphertext_vector_32(
cuda_stream_t *stream, void *lwe_array_out, void *lwe_output_indexes,
void *lut_vector, void *lut_vector_indexes, void *lwe_array_in,
void *lwe_input_indexes, void *bootstrapping_key, int8_t *buffer,
@@ -331,13 +327,13 @@ void cuda_bootstrap_low_latency_lwe_ciphertext_vector_32(
uint32_t num_luts, uint32_t lwe_idx, uint32_t max_shared_memory) {
if (base_log > 32)
PANIC("Cuda error (low latency PBS): base log should be > number of bits "
PANIC("Cuda error (classical PBS): base log should be > number of bits "
"in the ciphertext representation (32)");
if (has_support_to_cuda_bootstrap_fast_low_latency<uint32_t>(
if (has_support_to_cuda_programmable_bootstrap_cg<uint32_t>(
glwe_dimension, polynomial_size, level_count, num_samples,
max_shared_memory))
cuda_bootstrap_fast_low_latency_lwe_ciphertext_vector<uint32_t>(
cuda_programmable_bootstrap_cg_lwe_ciphertext_vector<uint32_t>(
stream, static_cast<uint32_t *>(lwe_array_out),
static_cast<uint32_t *>(lwe_output_indexes),
static_cast<uint32_t *>(lut_vector),
@@ -345,11 +341,11 @@ void cuda_bootstrap_low_latency_lwe_ciphertext_vector_32(
static_cast<uint32_t *>(lwe_array_in),
static_cast<uint32_t *>(lwe_input_indexes),
static_cast<double2 *>(bootstrapping_key),
(pbs_buffer<uint32_t, LOW_LAT> *)buffer, lwe_dimension, glwe_dimension,
polynomial_size, base_log, level_count, num_samples, num_luts, lwe_idx,
max_shared_memory);
(pbs_buffer<uint32_t, CLASSICAL> *)buffer, lwe_dimension,
glwe_dimension, polynomial_size, base_log, level_count, num_samples,
num_luts, lwe_idx, max_shared_memory);
else
cuda_bootstrap_low_latency_lwe_ciphertext_vector<uint32_t>(
cuda_programmable_bootstrap_lwe_ciphertext_vector<uint32_t>(
stream, static_cast<uint32_t *>(lwe_array_out),
static_cast<uint32_t *>(lwe_output_indexes),
static_cast<uint32_t *>(lut_vector),
@@ -357,16 +353,12 @@ void cuda_bootstrap_low_latency_lwe_ciphertext_vector_32(
static_cast<uint32_t *>(lwe_array_in),
static_cast<uint32_t *>(lwe_input_indexes),
static_cast<double2 *>(bootstrapping_key),
(pbs_buffer<uint32_t, LOW_LAT> *)buffer, lwe_dimension, glwe_dimension,
polynomial_size, base_log, level_count, num_samples, num_luts, lwe_idx,
max_shared_memory);
(pbs_buffer<uint32_t, CLASSICAL> *)buffer, lwe_dimension,
glwe_dimension, polynomial_size, base_log, level_count, num_samples,
num_luts, lwe_idx, max_shared_memory);
}
/* Perform bootstrapping on a batch of input u64 LWE ciphertexts.
* This function performs best for small numbers of inputs. Beyond a certain
* number of inputs (the exact number depends on the cryptographic parameters),
* the kernel cannot be launched and it is necessary to split the kernel call
* into several calls on smaller batches of inputs.
*
* - `v_stream` is a void pointer to the Cuda stream to be used in the kernel
* launch
@@ -438,7 +430,7 @@ void cuda_bootstrap_low_latency_lwe_ciphertext_vector_32(
* - the constant memory (64K) is used for storing the roots of identity
* values for the FFT
*/
void cuda_bootstrap_low_latency_lwe_ciphertext_vector_64(
void cuda_programmable_bootstrap_lwe_ciphertext_vector_64(
cuda_stream_t *stream, void *lwe_array_out, void *lwe_output_indexes,
void *lut_vector, void *lut_vector_indexes, void *lwe_array_in,
void *lwe_input_indexes, void *bootstrapping_key, int8_t *buffer,
@@ -446,13 +438,13 @@ void cuda_bootstrap_low_latency_lwe_ciphertext_vector_64(
uint32_t base_log, uint32_t level_count, uint32_t num_samples,
uint32_t num_luts, uint32_t lwe_idx, uint32_t max_shared_memory) {
if (base_log > 64)
PANIC("Cuda error (low latency PBS): base log should be > number of bits "
PANIC("Cuda error (classical PBS): base log should be > number of bits "
"in the ciphertext representation (64)");
if (has_support_to_cuda_bootstrap_fast_low_latency<uint64_t>(
if (has_support_to_cuda_programmable_bootstrap_cg<uint64_t>(
glwe_dimension, polynomial_size, level_count, num_samples,
max_shared_memory))
cuda_bootstrap_fast_low_latency_lwe_ciphertext_vector<uint64_t>(
cuda_programmable_bootstrap_cg_lwe_ciphertext_vector<uint64_t>(
stream, static_cast<uint64_t *>(lwe_array_out),
static_cast<uint64_t *>(lwe_output_indexes),
static_cast<uint64_t *>(lut_vector),
@@ -460,11 +452,11 @@ void cuda_bootstrap_low_latency_lwe_ciphertext_vector_64(
static_cast<uint64_t *>(lwe_array_in),
static_cast<uint64_t *>(lwe_input_indexes),
static_cast<double2 *>(bootstrapping_key),
(pbs_buffer<uint64_t, LOW_LAT> *)buffer, lwe_dimension, glwe_dimension,
polynomial_size, base_log, level_count, num_samples, num_luts, lwe_idx,
max_shared_memory);
(pbs_buffer<uint64_t, CLASSICAL> *)buffer, lwe_dimension,
glwe_dimension, polynomial_size, base_log, level_count, num_samples,
num_luts, lwe_idx, max_shared_memory);
else
cuda_bootstrap_low_latency_lwe_ciphertext_vector<uint64_t>(
cuda_programmable_bootstrap_lwe_ciphertext_vector<uint64_t>(
stream, static_cast<uint64_t *>(lwe_array_out),
static_cast<uint64_t *>(lwe_output_indexes),
static_cast<uint64_t *>(lut_vector),
@@ -472,90 +464,85 @@ void cuda_bootstrap_low_latency_lwe_ciphertext_vector_64(
static_cast<uint64_t *>(lwe_array_in),
static_cast<uint64_t *>(lwe_input_indexes),
static_cast<double2 *>(bootstrapping_key),
(pbs_buffer<uint64_t, LOW_LAT> *)buffer, lwe_dimension, glwe_dimension,
polynomial_size, base_log, level_count, num_samples, num_luts, lwe_idx,
max_shared_memory);
(pbs_buffer<uint64_t, CLASSICAL> *)buffer, lwe_dimension,
glwe_dimension, polynomial_size, base_log, level_count, num_samples,
num_luts, lwe_idx, max_shared_memory);
}
/*
* This cleanup function frees the data for the low latency PBS on GPU in
* buffer for 32 or 64 bits inputs.
* This cleanup function frees the data on GPU for the PBS buffer for 32 or 64
* bits inputs.
*/
void cleanup_cuda_bootstrap_low_latency_32(cuda_stream_t *stream,
int8_t **buffer) {
auto x = (pbs_buffer<uint32_t, LOW_LAT> *)(*buffer);
x->release(stream);
}
void cleanup_cuda_bootstrap_low_latency_64(cuda_stream_t *stream,
int8_t **buffer) {
auto x = (pbs_buffer<uint64_t, LOW_LAT> *)(*buffer);
void cleanup_cuda_programmable_bootstrap(cuda_stream_t *stream,
int8_t **buffer) {
auto x = (pbs_buffer<uint64_t, CLASSICAL> *)(*buffer);
x->release(stream);
}
template bool has_support_to_cuda_bootstrap_fast_low_latency<uint64_t>(
template bool has_support_to_cuda_programmable_bootstrap_cg<uint64_t>(
uint32_t glwe_dimension, uint32_t polynomial_size, uint32_t level_count,
uint32_t num_samples, uint32_t max_shared_memory);
template void cuda_bootstrap_fast_low_latency_lwe_ciphertext_vector<uint64_t>(
template void cuda_programmable_bootstrap_cg_lwe_ciphertext_vector<uint64_t>(
cuda_stream_t *stream, uint64_t *lwe_array_out,
uint64_t *lwe_output_indexes, uint64_t *lut_vector,
uint64_t *lut_vector_indexes, uint64_t *lwe_array_in,
uint64_t *lwe_input_indexes, double2 *bootstrapping_key,
pbs_buffer<uint64_t, LOW_LAT> *pbs_buffer, uint32_t lwe_dimension,
pbs_buffer<uint64_t, CLASSICAL> *pbs_buffer, uint32_t lwe_dimension,
uint32_t glwe_dimension, uint32_t polynomial_size, uint32_t base_log,
uint32_t level_count, uint32_t num_samples, uint32_t num_luts,
uint32_t lwe_idx, uint32_t max_shared_memory);
template void cuda_bootstrap_low_latency_lwe_ciphertext_vector<uint64_t>(
template void cuda_programmable_bootstrap_lwe_ciphertext_vector<uint64_t>(
cuda_stream_t *stream, uint64_t *lwe_array_out,
uint64_t *lwe_output_indexes, uint64_t *lut_vector,
uint64_t *lut_vector_indexes, uint64_t *lwe_array_in,
uint64_t *lwe_input_indexes, double2 *bootstrapping_key,
pbs_buffer<uint64_t, LOW_LAT> *pbs_buffer, uint32_t lwe_dimension,
pbs_buffer<uint64_t, CLASSICAL> *pbs_buffer, uint32_t lwe_dimension,
uint32_t glwe_dimension, uint32_t polynomial_size, uint32_t base_log,
uint32_t level_count, uint32_t num_samples, uint32_t num_luts,
uint32_t lwe_idx, uint32_t max_shared_memory);
template void scratch_cuda_fast_bootstrap_low_latency<uint64_t, int64_t>(
cuda_stream_t *stream, pbs_buffer<uint64_t, LOW_LAT> **pbs_buffer,
template void scratch_cuda_programmable_bootstrap_cg<uint64_t, int64_t>(
cuda_stream_t *stream, pbs_buffer<uint64_t, CLASSICAL> **pbs_buffer,
uint32_t glwe_dimension, uint32_t polynomial_size, uint32_t level_count,
uint32_t input_lwe_ciphertext_count, uint32_t max_shared_memory,
bool allocate_gpu_memory);
template void scratch_cuda_bootstrap_low_latency<uint64_t, int64_t>(
cuda_stream_t *stream, pbs_buffer<uint64_t, LOW_LAT> **buffer,
template void scratch_cuda_programmable_bootstrap<uint64_t, int64_t>(
cuda_stream_t *stream, pbs_buffer<uint64_t, CLASSICAL> **buffer,
uint32_t glwe_dimension, uint32_t polynomial_size, uint32_t level_count,
uint32_t input_lwe_ciphertext_count, uint32_t max_shared_memory,
bool allocate_gpu_memory);
template void cuda_bootstrap_fast_low_latency_lwe_ciphertext_vector<uint32_t>(
template void cuda_programmable_bootstrap_cg_lwe_ciphertext_vector<uint32_t>(
cuda_stream_t *stream, uint32_t *lwe_array_out,
uint32_t *lwe_output_indexes, uint32_t *lut_vector,
uint32_t *lut_vector_indexes, uint32_t *lwe_array_in,
uint32_t *lwe_input_indexes, double2 *bootstrapping_key,
pbs_buffer<uint32_t, LOW_LAT> *pbs_buffer, uint32_t lwe_dimension,
pbs_buffer<uint32_t, CLASSICAL> *pbs_buffer, uint32_t lwe_dimension,
uint32_t glwe_dimension, uint32_t polynomial_size, uint32_t base_log,
uint32_t level_count, uint32_t num_samples, uint32_t num_luts,
uint32_t lwe_idx, uint32_t max_shared_memory);
template void cuda_bootstrap_low_latency_lwe_ciphertext_vector<uint32_t>(
template void cuda_programmable_bootstrap_lwe_ciphertext_vector<uint32_t>(
cuda_stream_t *stream, uint32_t *lwe_array_out,
uint32_t *lwe_output_indexes, uint32_t *lut_vector,
uint32_t *lut_vector_indexes, uint32_t *lwe_array_in,
uint32_t *lwe_input_indexes, double2 *bootstrapping_key,
pbs_buffer<uint32_t, LOW_LAT> *pbs_buffer, uint32_t lwe_dimension,
pbs_buffer<uint32_t, CLASSICAL> *pbs_buffer, uint32_t lwe_dimension,
uint32_t glwe_dimension, uint32_t polynomial_size, uint32_t base_log,
uint32_t level_count, uint32_t num_samples, uint32_t num_luts,
uint32_t lwe_idx, uint32_t max_shared_memory);
template void scratch_cuda_fast_bootstrap_low_latency<uint32_t, int32_t>(
cuda_stream_t *stream, pbs_buffer<uint32_t, LOW_LAT> **pbs_buffer,
template void scratch_cuda_programmable_bootstrap_cg<uint32_t, int32_t>(
cuda_stream_t *stream, pbs_buffer<uint32_t, CLASSICAL> **pbs_buffer,
uint32_t glwe_dimension, uint32_t polynomial_size, uint32_t level_count,
uint32_t input_lwe_ciphertext_count, uint32_t max_shared_memory,
bool allocate_gpu_memory);
template void scratch_cuda_bootstrap_low_latency<uint32_t, int32_t>(
cuda_stream_t *stream, pbs_buffer<uint32_t, LOW_LAT> **buffer,
template void scratch_cuda_programmable_bootstrap<uint32_t, int32_t>(
cuda_stream_t *stream, pbs_buffer<uint32_t, CLASSICAL> **buffer,
uint32_t glwe_dimension, uint32_t polynomial_size, uint32_t level_count,
uint32_t input_lwe_ciphertext_count, uint32_t max_shared_memory,
bool allocate_gpu_memory);

View File

@@ -1,12 +1,11 @@
#ifndef CUDA_LOWLAT_PBS_CUH
#define CUDA_LOWLAT_PBS_CUH
#ifndef CUDA_PBS_CUH
#define CUDA_PBS_CUH
#ifdef __CDT_PARSER__
#undef __CUDA_RUNTIME_H__
#include <cuda_runtime.h>
#endif
#include "bootstrap.h"
#include "crypto/gadget.cuh"
#include "crypto/torus.cuh"
#include "device.h"
@@ -14,10 +13,11 @@
#include "fft/twiddles.cuh"
#include "polynomial/parameters.cuh"
#include "polynomial/polynomial_math.cuh"
#include "programmable_bootstrap.h"
#include "types/complex/operations.cuh"
template <typename Torus, class params, sharedMemDegree SMD>
__global__ void device_bootstrap_low_latency_step_one(
__global__ void device_programmable_bootstrap_step_one(
Torus *lut_vector, Torus *lut_vector_indexes, Torus *lwe_array_in,
Torus *lwe_input_indexes, double2 *bootstrapping_key,
Torus *global_accumulator, double2 *global_accumulator_fft,
@@ -127,7 +127,7 @@ __global__ void device_bootstrap_low_latency_step_one(
}
template <typename Torus, class params, sharedMemDegree SMD>
__global__ void device_bootstrap_low_latency_step_two(
__global__ void device_programmable_bootstrap_step_two(
Torus *lwe_array_out, Torus *lwe_output_indexes, Torus *lut_vector,
Torus *lut_vector_indexes, double2 *bootstrapping_key,
Torus *global_accumulator, double2 *global_accumulator_fft,
@@ -222,18 +222,18 @@ __global__ void device_bootstrap_low_latency_step_two(
}
template <typename Torus>
__host__ __device__ uint64_t get_buffer_size_bootstrap_low_latency(
__host__ __device__ uint64_t get_buffer_size_programmable_bootstrap(
uint32_t glwe_dimension, uint32_t polynomial_size, uint32_t level_count,
uint32_t input_lwe_ciphertext_count, uint32_t max_shared_memory) {
uint64_t full_sm_step_one =
get_buffer_size_full_sm_bootstrap_low_latency_step_one<Torus>(
get_buffer_size_full_sm_programmable_bootstrap_step_one<Torus>(
polynomial_size);
uint64_t full_sm_step_two =
get_buffer_size_full_sm_bootstrap_low_latency_step_two<Torus>(
get_buffer_size_full_sm_programmable_bootstrap_step_two<Torus>(
polynomial_size);
uint64_t partial_sm =
get_buffer_size_partial_sm_bootstrap_low_latency<Torus>(polynomial_size);
get_buffer_size_partial_sm_programmable_bootstrap<Torus>(polynomial_size);
uint64_t partial_dm_step_one = full_sm_step_one - partial_sm;
uint64_t partial_dm_step_two = full_sm_step_two - partial_sm;
@@ -263,37 +263,37 @@ __host__ __device__ uint64_t get_buffer_size_bootstrap_low_latency(
}
template <typename Torus, typename STorus, typename params>
__host__ void scratch_bootstrap_low_latency(
cuda_stream_t *stream, pbs_buffer<Torus, LOW_LAT> **buffer,
__host__ void scratch_programmable_bootstrap(
cuda_stream_t *stream, pbs_buffer<Torus, CLASSICAL> **buffer,
uint32_t glwe_dimension, uint32_t polynomial_size, uint32_t level_count,
uint32_t input_lwe_ciphertext_count, uint32_t max_shared_memory,
bool allocate_gpu_memory) {
cudaSetDevice(stream->gpu_index);
uint64_t full_sm_step_one =
get_buffer_size_full_sm_bootstrap_low_latency_step_one<Torus>(
get_buffer_size_full_sm_programmable_bootstrap_step_one<Torus>(
polynomial_size);
uint64_t full_sm_step_two =
get_buffer_size_full_sm_bootstrap_low_latency_step_two<Torus>(
get_buffer_size_full_sm_programmable_bootstrap_step_two<Torus>(
polynomial_size);
uint64_t partial_sm =
get_buffer_size_partial_sm_bootstrap_low_latency<Torus>(polynomial_size);
get_buffer_size_partial_sm_programmable_bootstrap<Torus>(polynomial_size);
// Configure step one
if (max_shared_memory >= partial_sm && max_shared_memory < full_sm_step_one) {
check_cuda_error(cudaFuncSetAttribute(
device_bootstrap_low_latency_step_one<Torus, params, PARTIALSM>,
device_programmable_bootstrap_step_one<Torus, params, PARTIALSM>,
cudaFuncAttributeMaxDynamicSharedMemorySize, partial_sm));
cudaFuncSetCacheConfig(
device_bootstrap_low_latency_step_one<Torus, params, PARTIALSM>,
device_programmable_bootstrap_step_one<Torus, params, PARTIALSM>,
cudaFuncCachePreferShared);
check_cuda_error(cudaGetLastError());
} else if (max_shared_memory >= partial_sm) {
check_cuda_error(cudaFuncSetAttribute(
device_bootstrap_low_latency_step_one<Torus, params, FULLSM>,
device_programmable_bootstrap_step_one<Torus, params, FULLSM>,
cudaFuncAttributeMaxDynamicSharedMemorySize, full_sm_step_one));
cudaFuncSetCacheConfig(
device_bootstrap_low_latency_step_one<Torus, params, FULLSM>,
device_programmable_bootstrap_step_one<Torus, params, FULLSM>,
cudaFuncCachePreferShared);
check_cuda_error(cudaGetLastError());
}
@@ -301,29 +301,29 @@ __host__ void scratch_bootstrap_low_latency(
// Configure step two
if (max_shared_memory >= partial_sm && max_shared_memory < full_sm_step_two) {
check_cuda_error(cudaFuncSetAttribute(
device_bootstrap_low_latency_step_two<Torus, params, PARTIALSM>,
device_programmable_bootstrap_step_two<Torus, params, PARTIALSM>,
cudaFuncAttributeMaxDynamicSharedMemorySize, partial_sm));
cudaFuncSetCacheConfig(
device_bootstrap_low_latency_step_two<Torus, params, PARTIALSM>,
device_programmable_bootstrap_step_two<Torus, params, PARTIALSM>,
cudaFuncCachePreferShared);
check_cuda_error(cudaGetLastError());
} else if (max_shared_memory >= partial_sm) {
check_cuda_error(cudaFuncSetAttribute(
device_bootstrap_low_latency_step_two<Torus, params, FULLSM>,
device_programmable_bootstrap_step_two<Torus, params, FULLSM>,
cudaFuncAttributeMaxDynamicSharedMemorySize, full_sm_step_two));
cudaFuncSetCacheConfig(
device_bootstrap_low_latency_step_two<Torus, params, FULLSM>,
device_programmable_bootstrap_step_two<Torus, params, FULLSM>,
cudaFuncCachePreferShared);
check_cuda_error(cudaGetLastError());
}
*buffer = new pbs_buffer<Torus, LOW_LAT>(
*buffer = new pbs_buffer<Torus, CLASSICAL>(
stream, glwe_dimension, polynomial_size, level_count,
input_lwe_ciphertext_count, PBS_VARIANT::DEFAULT, allocate_gpu_memory);
}
template <typename Torus, class params>
__host__ void execute_low_latency_step_one(
__host__ void execute_step_one(
cuda_stream_t *stream, Torus *lut_vector, Torus *lut_vector_indexes,
Torus *lwe_array_in, Torus *lwe_input_indexes, double2 *bootstrapping_key,
Torus *global_accumulator, double2 *global_accumulator_fft,
@@ -337,21 +337,21 @@ __host__ void execute_low_latency_step_one(
dim3 grid(level_count, glwe_dimension + 1, input_lwe_ciphertext_count);
if (max_shared_memory < partial_sm) {
device_bootstrap_low_latency_step_one<Torus, params, NOSM>
device_programmable_bootstrap_step_one<Torus, params, NOSM>
<<<grid, thds, 0, stream->stream>>>(
lut_vector, lut_vector_indexes, lwe_array_in, lwe_input_indexes,
bootstrapping_key, global_accumulator, global_accumulator_fft,
lwe_iteration, lwe_dimension, polynomial_size, base_log,
level_count, d_mem, full_dm);
} else if (max_shared_memory < full_sm) {
device_bootstrap_low_latency_step_one<Torus, params, PARTIALSM>
device_programmable_bootstrap_step_one<Torus, params, PARTIALSM>
<<<grid, thds, partial_sm, stream->stream>>>(
lut_vector, lut_vector_indexes, lwe_array_in, lwe_input_indexes,
bootstrapping_key, global_accumulator, global_accumulator_fft,
lwe_iteration, lwe_dimension, polynomial_size, base_log,
level_count, d_mem, partial_dm);
} else {
device_bootstrap_low_latency_step_one<Torus, params, FULLSM>
device_programmable_bootstrap_step_one<Torus, params, FULLSM>
<<<grid, thds, full_sm, stream->stream>>>(
lut_vector, lut_vector_indexes, lwe_array_in, lwe_input_indexes,
bootstrapping_key, global_accumulator, global_accumulator_fft,
@@ -362,7 +362,7 @@ __host__ void execute_low_latency_step_one(
}
template <typename Torus, class params>
__host__ void execute_low_latency_step_two(
__host__ void execute_step_two(
cuda_stream_t *stream, Torus *lwe_array_out, Torus *lwe_output_indexes,
Torus *lut_vector, Torus *lut_vector_indexes, double2 *bootstrapping_key,
Torus *global_accumulator, double2 *global_accumulator_fft,
@@ -376,21 +376,21 @@ __host__ void execute_low_latency_step_two(
dim3 grid(input_lwe_ciphertext_count, glwe_dimension + 1);
if (max_shared_memory < partial_sm) {
device_bootstrap_low_latency_step_two<Torus, params, NOSM>
device_programmable_bootstrap_step_two<Torus, params, NOSM>
<<<grid, thds, 0, stream->stream>>>(
lwe_array_out, lwe_output_indexes, lut_vector, lut_vector_indexes,
bootstrapping_key, global_accumulator, global_accumulator_fft,
lwe_iteration, lwe_dimension, polynomial_size, base_log,
level_count, d_mem, full_dm);
} else if (max_shared_memory < full_sm) {
device_bootstrap_low_latency_step_two<Torus, params, PARTIALSM>
device_programmable_bootstrap_step_two<Torus, params, PARTIALSM>
<<<grid, thds, partial_sm, stream->stream>>>(
lwe_array_out, lwe_output_indexes, lut_vector, lut_vector_indexes,
bootstrapping_key, global_accumulator, global_accumulator_fft,
lwe_iteration, lwe_dimension, polynomial_size, base_log,
level_count, d_mem, partial_dm);
} else {
device_bootstrap_low_latency_step_two<Torus, params, FULLSM>
device_programmable_bootstrap_step_two<Torus, params, FULLSM>
<<<grid, thds, full_sm, stream->stream>>>(
lwe_array_out, lwe_output_indexes, lut_vector, lut_vector_indexes,
bootstrapping_key, global_accumulator, global_accumulator_fft,
@@ -400,15 +400,14 @@ __host__ void execute_low_latency_step_two(
check_cuda_error(cudaGetLastError());
}
/*
* Host wrapper to the low latency version
* of bootstrapping
* Host wrapper to the programmable bootstrap
*/
template <typename Torus, class params>
__host__ void host_bootstrap_low_latency(
__host__ void host_programmable_bootstrap(
cuda_stream_t *stream, Torus *lwe_array_out, Torus *lwe_output_indexes,
Torus *lut_vector, Torus *lut_vector_indexes, Torus *lwe_array_in,
Torus *lwe_input_indexes, double2 *bootstrapping_key,
pbs_buffer<Torus, LOW_LAT> *pbs_buffer, uint32_t glwe_dimension,
pbs_buffer<Torus, CLASSICAL> *pbs_buffer, uint32_t glwe_dimension,
uint32_t lwe_dimension, uint32_t polynomial_size, uint32_t base_log,
uint32_t level_count, uint32_t input_lwe_ciphertext_count,
uint32_t num_luts, uint32_t max_shared_memory) {
@@ -417,14 +416,14 @@ __host__ void host_bootstrap_low_latency(
// With SM each block corresponds to either the mask or body, no need to
// duplicate data for each
uint64_t full_sm_step_one =
get_buffer_size_full_sm_bootstrap_low_latency_step_one<Torus>(
get_buffer_size_full_sm_programmable_bootstrap_step_one<Torus>(
polynomial_size);
uint64_t full_sm_step_two =
get_buffer_size_full_sm_bootstrap_low_latency_step_two<Torus>(
get_buffer_size_full_sm_programmable_bootstrap_step_two<Torus>(
polynomial_size);
uint64_t partial_sm =
get_buffer_size_partial_sm_bootstrap_low_latency<Torus>(polynomial_size);
get_buffer_size_partial_sm_programmable_bootstrap<Torus>(polynomial_size);
uint64_t partial_dm_step_one = full_sm_step_one - partial_sm;
uint64_t partial_dm_step_two = full_sm_step_two - partial_sm;
@@ -436,13 +435,13 @@ __host__ void host_bootstrap_low_latency(
int8_t *d_mem = pbs_buffer->d_mem;
for (int i = 0; i < lwe_dimension; i++) {
execute_low_latency_step_one<Torus, params>(
execute_step_one<Torus, params>(
stream, lut_vector, lut_vector_indexes, lwe_array_in, lwe_input_indexes,
bootstrapping_key, global_accumulator, global_accumulator_fft,
input_lwe_ciphertext_count, lwe_dimension, glwe_dimension,
polynomial_size, base_log, level_count, d_mem, max_shared_memory, i,
partial_sm, partial_dm_step_one, full_sm_step_one, full_dm_step_one);
execute_low_latency_step_two<Torus, params>(
execute_step_two<Torus, params>(
stream, lwe_array_out, lwe_output_indexes, lut_vector,
lut_vector_indexes, bootstrapping_key, global_accumulator,
global_accumulator_fft, input_lwe_ciphertext_count, lwe_dimension,
@@ -452,4 +451,4 @@ __host__ void host_bootstrap_low_latency(
}
}
#endif // LOWLAT_PBS_H
#endif // CUDA_PBS_CUH

Some files were not shown because too many files have changed in this diff Show More